AI Reliability Operations Guide
This runbook covers monitoring, incident response, and troubleshooting for olllo’s AI reliability infrastructure.
Dashboard Overview
The AI reliability dashboard provides real-time visibility into AI operations at /api/admin/ai-dashboard.
Key Metrics to Monitor
| Metric | Description | Healthy Range |
|---|---|---|
| Success Rate | Percentage of AI requests completing successfully | > 99% |
| Avg Latency | Average request duration in milliseconds | < 5000ms |
| Token Usage | Total tokens consumed across all tiers | Varies by traffic |
| Cache Hit Rate | Percentage of requests served from cache | > 50% |
| Fallback Rate | Percentage of requests using fallback models | < 5% |
| Safety Incidents | Count of blocked/flagged content | < 5x baseline |
Dashboard API Endpoints
# Get dashboard metrics (last 24 hours)
curl -H "Authorization: Bearer $ADMIN_API_KEY" \
"https://app.olllo.app/api/admin/ai-dashboard?period=24h"
# Get metrics for specific feature
curl -H "Authorization: Bearer $ADMIN_API_KEY" \
"https://app.olllo.app/api/admin/ai-dashboard?period=7d&feature=reflection"Alert Thresholds
Critical Alerts (Immediate Response)
| Condition | Threshold | Action |
|---|---|---|
| Success rate drops | < 95% for 5 minutes | Page on-call, investigate primary provider |
| All providers failing | 100% failure rate | Activate graceful degradation |
| Safety incident spike | > 5x 7-day baseline | Investigate potential attack or model issue |
Warning Alerts (Business Hours)
| Condition | Threshold | Action |
|---|---|---|
| Elevated fallback rate | > 10% for 1 hour | Check primary provider status |
| High latency | > 10s avg for 15 min | Review request patterns |
| Low cache hit rate | < 30% for 1 hour | Verify cache configuration |
Fallback Behavior
olllo implements automatic fallback to maintain availability when AI providers experience issues.
Model Tier Fallback Chain
| Tier | Primary Model | Fallback Model |
|---|---|---|
| Fast | claude-3-5-haiku-latest | gpt-4o-mini |
| Balanced | claude-sonnet-4-5 | gpt-4o |
| Advanced | claude-sonnet-4-5 | gpt-4o |
When Fallback Activates
- Provider Error: Connection timeout, rate limiting, or service unavailable
- Retries Exhausted: After 2 retry attempts with exponential backoff
- Circuit Open: Provider circuit breaker triggered after repeated failures
User Experience During Fallback
- Users see no visible difference in most cases
- Response quality may vary slightly between providers
- Telemetry records
usedFallback: truewith reason
Graceful Degradation
When AI services are completely unavailable, the system degrades gracefully:
Feature-Specific Fallbacks
| Feature | Degraded Behavior |
|---|---|
| PII Detection | Skip detection, log warning |
| STAR Extraction | Accept raw text without AI enhancement |
| Reflection Prompts | Use pre-defined static prompts |
| Summary Generation | Display “Summary unavailable” message |
User Notifications
Users see contextual messages like:
- “AI features are temporarily limited. Your data has been saved.”
- “We’ll enhance your entry when AI services are restored.”
Troubleshooting Guide
High Error Rate
- Check provider status pages (status.anthropic.com, status.openai.com)
- Review error codes in telemetry:
errorCodefield - Check rate limit status: Are we hitting provider limits?
- Verify API keys are valid and not expired
Elevated Latency
- Check streaming vs non-streaming ratio
- Review token counts - large requests take longer
- Check provider latency (may be provider-side issue)
- Review concurrent request count
Low Cache Hit Rate
- Verify cache configuration is correct
- Check if request patterns have changed
- Review cache key generation logic
- Confirm Redis connection is healthy
Safety Incident Spike
- Review incident types in dashboard
- Check for coordinated abuse patterns
- Review affected users/sessions
- Consider temporary rate limit adjustments
Request Tracing
Every AI request includes a correlation ID for tracing.
Looking Up a Request
# Get full request trace
curl -H "Authorization: Bearer $ADMIN_API_KEY" \
"https://app.olllo.app/api/admin/ai-dashboard/trace/{correlationId}"
# Get human-readable format
curl -H "Authorization: Bearer $ADMIN_API_KEY" \
"https://app.olllo.app/api/admin/ai-dashboard/trace/{correlationId}?format=text"Trace Information Includes
- Request timing and duration
- Model used (primary or fallback)
- Token usage and cost
- Retry attempts with timestamps
- Safety incidents if any
- Error details if failed
Incident Response Procedures
Severity 1: Complete AI Outage
- Verify: Check provider status pages
- Communicate: Post status update
- Mitigate: Ensure graceful degradation is active
- Monitor: Watch for recovery signals
- Resolve: Verify normal operation restored
- Postmortem: Document and improve
Severity 2: Partial Degradation
- Identify: Which tier/feature is affected?
- Assess: Is fallback working correctly?
- Monitor: Track error rates and user impact
- Escalate: If fallback also failing, escalate to Sev 1
Severity 3: Performance Issues
- Measure: Capture baseline metrics
- Analyze: Identify bottleneck (provider, cache, etc.)
- Optimize: Apply targeted fixes
- Verify: Confirm improvement
Monitoring Thresholds for 99.9% Availability (SC-008)
Target: 99.9% availability = max 8.76 hours downtime/year
Baseline Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Uptime | 99.9% | < 99.5% over 1 hour |
| Error Rate | < 0.1% | > 1% for 5 minutes |
| P95 Latency | < 5s | > 10s for 10 minutes |
| Fallback Success | > 99% | < 95% when fallback active |
Availability Calculation
Availability = (Successful Requests / Total Requests) * 100
Where "Successful" includes:
- Direct success (status: SUCCESS)
- Fallback success (usedFallback: true, status: SUCCESS)
- Graceful degradation (user received response, even if limited)Contacts
- On-Call: Check PagerDuty rotation
- AI Infrastructure: #ai-platform Slack channel
- Escalation: engineering-leads@olllo.app
Last updated on