Skip to Content
OperationsAI Reliability Operations Guide

AI Reliability Operations Guide

This runbook covers monitoring, incident response, and troubleshooting for olllo’s AI reliability infrastructure.


Dashboard Overview

The AI reliability dashboard provides real-time visibility into AI operations at /api/admin/ai-dashboard.

Key Metrics to Monitor

MetricDescriptionHealthy Range
Success RatePercentage of AI requests completing successfully> 99%
Avg LatencyAverage request duration in milliseconds< 5000ms
Token UsageTotal tokens consumed across all tiersVaries by traffic
Cache Hit RatePercentage of requests served from cache> 50%
Fallback RatePercentage of requests using fallback models< 5%
Safety IncidentsCount of blocked/flagged content< 5x baseline

Dashboard API Endpoints

# Get dashboard metrics (last 24 hours) curl -H "Authorization: Bearer $ADMIN_API_KEY" \ "https://app.olllo.app/api/admin/ai-dashboard?period=24h" # Get metrics for specific feature curl -H "Authorization: Bearer $ADMIN_API_KEY" \ "https://app.olllo.app/api/admin/ai-dashboard?period=7d&feature=reflection"

Alert Thresholds

Critical Alerts (Immediate Response)

ConditionThresholdAction
Success rate drops< 95% for 5 minutesPage on-call, investigate primary provider
All providers failing100% failure rateActivate graceful degradation
Safety incident spike> 5x 7-day baselineInvestigate potential attack or model issue

Warning Alerts (Business Hours)

ConditionThresholdAction
Elevated fallback rate> 10% for 1 hourCheck primary provider status
High latency> 10s avg for 15 minReview request patterns
Low cache hit rate< 30% for 1 hourVerify cache configuration

Fallback Behavior

olllo implements automatic fallback to maintain availability when AI providers experience issues.

Model Tier Fallback Chain

TierPrimary ModelFallback Model
Fastclaude-3-5-haiku-latestgpt-4o-mini
Balancedclaude-sonnet-4-5gpt-4o
Advancedclaude-sonnet-4-5gpt-4o

When Fallback Activates

  1. Provider Error: Connection timeout, rate limiting, or service unavailable
  2. Retries Exhausted: After 2 retry attempts with exponential backoff
  3. Circuit Open: Provider circuit breaker triggered after repeated failures

User Experience During Fallback

  • Users see no visible difference in most cases
  • Response quality may vary slightly between providers
  • Telemetry records usedFallback: true with reason

Graceful Degradation

When AI services are completely unavailable, the system degrades gracefully:

Feature-Specific Fallbacks

FeatureDegraded Behavior
PII DetectionSkip detection, log warning
STAR ExtractionAccept raw text without AI enhancement
Reflection PromptsUse pre-defined static prompts
Summary GenerationDisplay “Summary unavailable” message

User Notifications

Users see contextual messages like:

  • “AI features are temporarily limited. Your data has been saved.”
  • “We’ll enhance your entry when AI services are restored.”

Troubleshooting Guide

High Error Rate

  1. Check provider status pages (status.anthropic.com, status.openai.com)
  2. Review error codes in telemetry: errorCode field
  3. Check rate limit status: Are we hitting provider limits?
  4. Verify API keys are valid and not expired

Elevated Latency

  1. Check streaming vs non-streaming ratio
  2. Review token counts - large requests take longer
  3. Check provider latency (may be provider-side issue)
  4. Review concurrent request count

Low Cache Hit Rate

  1. Verify cache configuration is correct
  2. Check if request patterns have changed
  3. Review cache key generation logic
  4. Confirm Redis connection is healthy

Safety Incident Spike

  1. Review incident types in dashboard
  2. Check for coordinated abuse patterns
  3. Review affected users/sessions
  4. Consider temporary rate limit adjustments

Request Tracing

Every AI request includes a correlation ID for tracing.

Looking Up a Request

# Get full request trace curl -H "Authorization: Bearer $ADMIN_API_KEY" \ "https://app.olllo.app/api/admin/ai-dashboard/trace/{correlationId}" # Get human-readable format curl -H "Authorization: Bearer $ADMIN_API_KEY" \ "https://app.olllo.app/api/admin/ai-dashboard/trace/{correlationId}?format=text"

Trace Information Includes

  • Request timing and duration
  • Model used (primary or fallback)
  • Token usage and cost
  • Retry attempts with timestamps
  • Safety incidents if any
  • Error details if failed

Incident Response Procedures

Severity 1: Complete AI Outage

  1. Verify: Check provider status pages
  2. Communicate: Post status update
  3. Mitigate: Ensure graceful degradation is active
  4. Monitor: Watch for recovery signals
  5. Resolve: Verify normal operation restored
  6. Postmortem: Document and improve

Severity 2: Partial Degradation

  1. Identify: Which tier/feature is affected?
  2. Assess: Is fallback working correctly?
  3. Monitor: Track error rates and user impact
  4. Escalate: If fallback also failing, escalate to Sev 1

Severity 3: Performance Issues

  1. Measure: Capture baseline metrics
  2. Analyze: Identify bottleneck (provider, cache, etc.)
  3. Optimize: Apply targeted fixes
  4. Verify: Confirm improvement

Monitoring Thresholds for 99.9% Availability (SC-008)

Target: 99.9% availability = max 8.76 hours downtime/year

Baseline Metrics

MetricTargetAlert Threshold
Uptime99.9%< 99.5% over 1 hour
Error Rate< 0.1%> 1% for 5 minutes
P95 Latency< 5s> 10s for 10 minutes
Fallback Success> 99%< 95% when fallback active

Availability Calculation

Availability = (Successful Requests / Total Requests) * 100 Where "Successful" includes: - Direct success (status: SUCCESS) - Fallback success (usedFallback: true, status: SUCCESS) - Graceful degradation (user received response, even if limited)

Contacts

Last updated on