Email Delivery Operations Guide
This runbook covers monitoring, incident response, and troubleshooting for olllo’s email queue and delivery infrastructure.
Architecture Overview
olllo uses a queue-based email delivery system to handle Resend API rate limits and ensure reliable delivery.
Components
| Component | Purpose | Location |
|---|---|---|
| Email Queue | Database-backed queue for pending emails | packages/email/services/email-queue.ts |
| Batch Sender | Sends up to 100 emails per Resend API call | packages/email/lib/batch-sender.ts |
| Cron Processor | Processes queue every minute | /api/cron/process-email-queue |
| Cleanup Job | Removes expired items (7-day retention) | /api/cron/ai-telemetry-cleanup |
Flow
┌────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ Trigger │────▶│ queueEmail() │────▶│ EmailQueueItem │
│ (notification) │ │ (deduplication) │ │ (PENDING) │
└────────────────┘ └─────────────────┘ └──────────────────┘
│
┌───────────────────────────────────────────────┘
▼ (every 1 minute)
┌──────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ processQueue() │────▶│ sendBatch() │────▶│ Resend API │
│ (100 per batch) │ │ (batch.send) │ │ (batch endpoint) │
└──────────────────┘ └─────────────────┘ └──────────────────┘
│
└──▶ Update status: SENT, RETRYING, or FAILEDQueue Status Reference
| Status | Description | Retention |
|---|---|---|
| PENDING | Waiting to be processed | Until processed |
| PROCESSING | Currently being sent | Transient |
| RETRYING | Failed, scheduled for retry | Until max retries |
| SENT | Successfully delivered | 7 days |
| FAILED | Permanently failed | 7 days |
Monitoring
Queue Stats Endpoint
# Get current queue statistics
curl -H "Authorization: Bearer $ADMIN_API_KEY" \
"https://app.olllo.app/api/admin/email-queue/stats"Response:
{
"pending": 12,
"processing": 0,
"retrying": 2,
"sentLast24h": 458,
"failedLast24h": 3
}Key Metrics to Monitor
| Metric | Description | Healthy Range |
|---|---|---|
| Pending Count | Emails waiting to be sent | < 500 |
| Processing Count | Currently being sent | < 100 |
| Retrying Count | Failed, pending retry | < 50 |
| Sent (24h) | Successfully delivered | Varies by traffic |
| Failed (24h) | Permanently failed | < 1% of sent |
Alert Thresholds
| Condition | Threshold | Action |
|---|---|---|
| High pending count | > 1000 for 5 minutes | Check cron job running |
| High retry count | > 100 | Check Resend API status |
| Elevated failures | > 5% of sent | Investigate error messages |
| Processing stuck | > 100 for 10 minutes | Check for cron timeout |
Troubleshooting Guide
Emails Stuck in PENDING
Symptoms: Pending count growing, no emails being delivered
Diagnosis:
- Check cron job is running: Vercel Dashboard → Cron Jobs
- Verify CRON_SECRET is configured
- Check process-email-queue logs for errors
Resolution:
# Manually trigger queue processing
curl -X POST \
-H "Authorization: Bearer $CRON_SECRET" \
"https://app.olllo.app/api/cron/process-email-queue"High Failure Rate
Symptoms: Many emails in FAILED status, elevated error count
Diagnosis:
- Check Resend dashboard for API errors
- Review error messages in failed items:
SELECT id, userId, errorMessage, retryCount, createdAt FROM "EmailQueueItem" WHERE status = 'FAILED' ORDER BY createdAt DESC LIMIT 20;
Common Causes:
- Invalid email addresses
- Resend API key issues
- Domain verification problems
- Rate limit exceeded (429)
Resolution:
- For rate limits: System auto-retries with exponential backoff
- For invalid addresses: Clean user data
- For API issues: Check Resend status page
Rate Limiting (429 Errors)
Symptoms: Emails going to RETRYING status, rate limit errors in logs
How the System Handles It:
- Batch sender detects 429 response
- Marks items as RETRYING with exponential backoff
- Backoff delay: 500ms → 1s → 2s → 4s → … (max 30s)
- Jitter applied (75%-125%) to prevent thundering herd
Monitoring:
# Check retry distribution
SELECT
retryCount,
COUNT(*) as count
FROM "EmailQueueItem"
WHERE status = 'RETRYING'
GROUP BY retryCount;Duplicate Emails Not Being Prevented
Symptoms: Users receiving duplicate emails
Diagnosis:
- Check if emails have same userId and content
- Verify email hash is being generated correctly
How Deduplication Works:
- SHA256 hash of
from + to + subject + html - Only first 16 characters used
- Checks for existing PENDING/PROCESSING/RETRYING items
Resolution:
- If legitimate duplicates: Review trigger code
- If hash collision: Extremely rare, monitor
Retry Logic
Exponential Backoff Configuration
| Retry # | Base Delay | With Jitter (75%-125%) |
|---|---|---|
| 1 | 500ms | 375ms - 625ms |
| 2 | 1s | 750ms - 1.25s |
| 3 | 2s | 1.5s - 2.5s |
| 4 | 4s | 3s - 5s |
| 5 | 8s | 6s - 10s |
| 6+ | 30s (cap) | 22.5s - 37.5s |
Max Retry Configuration
Default: 5 retries
Configurable per email via maxRetries parameter
Partial Batch Failures
When a batch partially fails:
- Successful emails: Marked as SENT
- Failed emails: Individually marked as RETRYING
- No retry of entire batch (only failed items)
Manual Operations
Cancel a Pending Email
import { cancelEmail } from "@repo/email";
const result = await cancelEmail("queue-item-id");
// { success: true } or { success: false, error: "Cannot cancel..." }Only PENDING emails can be cancelled. PROCESSING/RETRYING/SENT/FAILED cannot be cancelled.
Force Process Queue
# Trigger immediate queue processing
curl -X POST \
-H "Authorization: Bearer $CRON_SECRET" \
"https://app.olllo.app/api/cron/process-email-queue"Manual Cleanup
Cleanup runs automatically during daily ai-telemetry-cleanup cron.
To manually trigger:
curl -X POST \
-H "Authorization: Bearer $CRON_SECRET" \
"https://app.olllo.app/api/cron/ai-telemetry-cleanup"Database Queries
Check Queue Health
-- Queue status distribution
SELECT status, COUNT(*) as count
FROM "EmailQueueItem"
GROUP BY status;
-- Recent failures with errors
SELECT id, userId, errorMessage, retryCount, createdAt
FROM "EmailQueueItem"
WHERE status = 'FAILED'
ORDER BY createdAt DESC
LIMIT 20;
-- Stuck processing items (older than 5 minutes)
SELECT id, userId, batchId, updatedAt
FROM "EmailQueueItem"
WHERE status = 'PROCESSING'
AND "updatedAt" < NOW() - INTERVAL '5 minutes';Reset Stuck Items
If items are stuck in PROCESSING (cron crashed mid-batch):
-- Reset stuck processing items to PENDING
UPDATE "EmailQueueItem"
SET status = 'PENDING', "batchId" = NULL
WHERE status = 'PROCESSING'
AND "updatedAt" < NOW() - INTERVAL '10 minutes';Incident Response
Severity 1: No Emails Being Delivered
- Verify: Check Resend status page
- Mitigate: Emails queue up, will deliver when resolved
- Monitor: Watch pending count growth rate
- Communicate: Notify users if extended outage
- Resolve: System auto-recovers when Resend is back
Severity 2: Elevated Failure Rate
- Identify: Check error messages in failed items
- Assess: Is it specific email types or all?
- Fix: Address root cause (invalid addresses, etc.)
- Recover: Failed items not auto-retried; may need manual re-trigger
Severity 3: Queue Backlog
- Measure: Check pending count growth rate
- Analyze: Is cron running? Is batch size optimal?
- Optimize: Consider temporary batch size increase
- Monitor: Verify backlog clearing
Configuration Reference
Environment Variables
| Variable | Description | Default |
|---|---|---|
| RESEND_API_KEY | Resend API key (starts with re_) | Required |
| RESEND_FROM | Default sender address (use friendly name format: olllo <hi@notifications.olllo.ai>) | Required |
| CRON_SECRET | Secret for cron authentication | Required |
Constants (in code)
| Constant | Value | Location |
|---|---|---|
| MAX_BATCH_SIZE | 100 | email-queue.ts |
| DEFAULT_MAX_BATCHES | 5 | email-queue.ts |
| QUEUE_EXPIRATION_DAYS | 7 | email-queue.ts |
| DEFAULT_MAX_RETRIES | 5 | email-queue.ts |
| INTER_BATCH_DELAY_MS | 500 | email-queue.ts |
Cron Schedule
| Job | Schedule | Purpose |
|---|---|---|
| process-email-queue | Every 1 minute | Process pending emails |
| ai-telemetry-cleanup | Daily at 3 AM | Cleanup expired items |
Contacts
- On-Call: Check PagerDuty rotation
- Email Infrastructure: #email-platform Slack channel
- Escalation: engineering-leads@olllo.app
- Resend Status: status.resend.com