AI Reliability Operations Guide

This runbook covers monitoring, incident response, and troubleshooting for olllo’s AI reliability infrastructure.

Dashboard Overview

The AI reliability dashboard provides real-time visibility into AI operations at /api/admin/ai-dashboard.

Key Metrics to Monitor

Metric	Description	Healthy Range
Success Rate	Percentage of AI requests completing successfully	> 99%
Avg Latency	Average request duration in milliseconds	< 5000ms
Token Usage	Total tokens consumed across all tiers	Varies by traffic
Cache Hit Rate	Percentage of requests served from cache	> 50%
Fallback Rate	Percentage of requests using fallback models	< 5%
Safety Incidents	Count of blocked/flagged content	< 5x baseline

Dashboard API Endpoints


# Get dashboard metrics (last 24 hours)
curl -H "Authorization: Bearer $ADMIN_API_KEY" \
  "https://app.olllo.app/api/admin/ai-dashboard?period=24h"
 
# Get metrics for specific feature
curl -H "Authorization: Bearer $ADMIN_API_KEY" \
  "https://app.olllo.app/api/admin/ai-dashboard?period=7d&feature=reflection"

Alert Thresholds

Critical Alerts (Immediate Response)

Condition	Threshold	Action
Success rate drops	< 95% for 5 minutes	Page on-call, investigate primary provider
All providers failing	100% failure rate	Activate graceful degradation
Safety incident spike	> 5x 7-day baseline	Investigate potential attack or model issue

Warning Alerts (Business Hours)

Condition	Threshold	Action
Elevated fallback rate	> 10% for 1 hour	Check primary provider status
High latency	> 10s avg for 15 min	Review request patterns
Low cache hit rate	< 30% for 1 hour	Verify cache configuration

Fallback Behavior

olllo implements automatic fallback to maintain availability when AI providers experience issues.

Model Tier Fallback Chain

Tier	Primary Model	Fallback Model
Fast	claude-3-5-haiku-latest	gpt-4o-mini
Balanced	claude-sonnet-4-5	gpt-4o
Advanced	claude-sonnet-4-5	gpt-4o

When Fallback Activates

Provider Error: Connection timeout, rate limiting, or service unavailable
Retries Exhausted: After 2 retry attempts with exponential backoff
Circuit Open: Provider circuit breaker triggered after repeated failures

User Experience During Fallback

Users see no visible difference in most cases
Response quality may vary slightly between providers
Telemetry records usedFallback: true with reason

Graceful Degradation

When AI services are completely unavailable, the system degrades gracefully:

Feature-Specific Fallbacks

Feature	Degraded Behavior
PII Detection	Skip detection, log warning
STAR Extraction	Accept raw text without AI enhancement
Reflection Prompts	Use pre-defined static prompts
Summary Generation	Display “Summary unavailable” message

User Notifications

Users see contextual messages like:

“AI features are temporarily limited. Your data has been saved.”
“We’ll enhance your entry when AI services are restored.”

Troubleshooting Guide

High Error Rate

Check provider status pages (status.anthropic.com, status.openai.com)
Review error codes in telemetry: errorCode field
Check rate limit status: Are we hitting provider limits?
Verify API keys are valid and not expired

Elevated Latency

Check streaming vs non-streaming ratio
Review token counts - large requests take longer
Check provider latency (may be provider-side issue)
Review concurrent request count

Low Cache Hit Rate

Verify cache configuration is correct
Check if request patterns have changed
Review cache key generation logic
Confirm Redis connection is healthy

Safety Incident Spike

Review incident types in dashboard
Check for coordinated abuse patterns
Review affected users/sessions
Consider temporary rate limit adjustments

Request Tracing

Every AI request includes a correlation ID for tracing.

Looking Up a Request


# Get full request trace
curl -H "Authorization: Bearer $ADMIN_API_KEY" \
  "https://app.olllo.app/api/admin/ai-dashboard/trace/{correlationId}"
 
# Get human-readable format
curl -H "Authorization: Bearer $ADMIN_API_KEY" \
  "https://app.olllo.app/api/admin/ai-dashboard/trace/{correlationId}?format=text"

Trace Information Includes

Request timing and duration
Model used (primary or fallback)
Token usage and cost
Retry attempts with timestamps
Safety incidents if any
Error details if failed

Incident Response Procedures

Severity 1: Complete AI Outage

Verify: Check provider status pages
Communicate: Post status update
Mitigate: Ensure graceful degradation is active
Monitor: Watch for recovery signals
Resolve: Verify normal operation restored
Postmortem: Document and improve

Severity 2: Partial Degradation

Identify: Which tier/feature is affected?
Assess: Is fallback working correctly?
Monitor: Track error rates and user impact
Escalate: If fallback also failing, escalate to Sev 1

Severity 3: Performance Issues

Measure: Capture baseline metrics
Analyze: Identify bottleneck (provider, cache, etc.)
Optimize: Apply targeted fixes
Verify: Confirm improvement

Monitoring Thresholds for 99.9% Availability (SC-008)

Target: 99.9% availability = max 8.76 hours downtime/year

Baseline Metrics

Metric	Target	Alert Threshold
Uptime	99.9%	< 99.5% over 1 hour
Error Rate	< 0.1%	> 1% for 5 minutes
P95 Latency	< 5s	> 10s for 10 minutes
Fallback Success	> 99%	< 95% when fallback active

Availability Calculation


Availability = (Successful Requests / Total Requests) * 100

Where "Successful" includes:
- Direct success (status: SUCCESS)
- Fallback success (usedFallback: true, status: SUCCESS)
- Graceful degradation (user received response, even if limited)

Contacts

On-Call: Check PagerDuty rotation
AI Infrastructure: #ai-platform Slack channel
Escalation: engineering-leads@olllo.app