Skip to Content
OperationsEmail Delivery Operations Guide

Email Delivery Operations Guide

This runbook covers monitoring, incident response, and troubleshooting for olllo’s email queue and delivery infrastructure.


Architecture Overview

olllo uses a queue-based email delivery system to handle Resend API rate limits and ensure reliable delivery.

Components

ComponentPurposeLocation
Email QueueDatabase-backed queue for pending emailspackages/email/services/email-queue.ts
Batch SenderSends up to 100 emails per Resend API callpackages/email/lib/batch-sender.ts
Cron ProcessorProcesses queue every minute/api/cron/process-email-queue
Cleanup JobRemoves expired items (7-day retention)/api/cron/ai-telemetry-cleanup

Flow

┌────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │ Trigger │────▶│ queueEmail() │────▶│ EmailQueueItem │ │ (notification) │ │ (deduplication) │ │ (PENDING) │ └────────────────┘ └─────────────────┘ └──────────────────┘ ┌───────────────────────────────────────────────┘ ▼ (every 1 minute) ┌──────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │ processQueue() │────▶│ sendBatch() │────▶│ Resend API │ │ (100 per batch) │ │ (batch.send) │ │ (batch endpoint) │ └──────────────────┘ └─────────────────┘ └──────────────────┘ └──▶ Update status: SENT, RETRYING, or FAILED

Queue Status Reference

StatusDescriptionRetention
PENDINGWaiting to be processedUntil processed
PROCESSINGCurrently being sentTransient
RETRYINGFailed, scheduled for retryUntil max retries
SENTSuccessfully delivered7 days
FAILEDPermanently failed7 days

Monitoring

Queue Stats Endpoint

# Get current queue statistics curl -H "Authorization: Bearer $ADMIN_API_KEY" \ "https://app.olllo.app/api/admin/email-queue/stats"

Response:

{ "pending": 12, "processing": 0, "retrying": 2, "sentLast24h": 458, "failedLast24h": 3 }

Key Metrics to Monitor

MetricDescriptionHealthy Range
Pending CountEmails waiting to be sent< 500
Processing CountCurrently being sent< 100
Retrying CountFailed, pending retry< 50
Sent (24h)Successfully deliveredVaries by traffic
Failed (24h)Permanently failed< 1% of sent

Alert Thresholds

ConditionThresholdAction
High pending count> 1000 for 5 minutesCheck cron job running
High retry count> 100Check Resend API status
Elevated failures> 5% of sentInvestigate error messages
Processing stuck> 100 for 10 minutesCheck for cron timeout

Troubleshooting Guide

Emails Stuck in PENDING

Symptoms: Pending count growing, no emails being delivered

Diagnosis:

  1. Check cron job is running: Vercel Dashboard → Cron Jobs
  2. Verify CRON_SECRET is configured
  3. Check process-email-queue logs for errors

Resolution:

# Manually trigger queue processing curl -X POST \ -H "Authorization: Bearer $CRON_SECRET" \ "https://app.olllo.app/api/cron/process-email-queue"

High Failure Rate

Symptoms: Many emails in FAILED status, elevated error count

Diagnosis:

  1. Check Resend dashboard for API errors
  2. Review error messages in failed items:
    SELECT id, userId, errorMessage, retryCount, createdAt FROM "EmailQueueItem" WHERE status = 'FAILED' ORDER BY createdAt DESC LIMIT 20;

Common Causes:

  • Invalid email addresses
  • Resend API key issues
  • Domain verification problems
  • Rate limit exceeded (429)

Resolution:

  • For rate limits: System auto-retries with exponential backoff
  • For invalid addresses: Clean user data
  • For API issues: Check Resend status page

Rate Limiting (429 Errors)

Symptoms: Emails going to RETRYING status, rate limit errors in logs

How the System Handles It:

  1. Batch sender detects 429 response
  2. Marks items as RETRYING with exponential backoff
  3. Backoff delay: 500ms → 1s → 2s → 4s → … (max 30s)
  4. Jitter applied (75%-125%) to prevent thundering herd

Monitoring:

# Check retry distribution SELECT retryCount, COUNT(*) as count FROM "EmailQueueItem" WHERE status = 'RETRYING' GROUP BY retryCount;

Duplicate Emails Not Being Prevented

Symptoms: Users receiving duplicate emails

Diagnosis:

  1. Check if emails have same userId and content
  2. Verify email hash is being generated correctly

How Deduplication Works:

  • SHA256 hash of from + to + subject + html
  • Only first 16 characters used
  • Checks for existing PENDING/PROCESSING/RETRYING items

Resolution:

  • If legitimate duplicates: Review trigger code
  • If hash collision: Extremely rare, monitor

Retry Logic

Exponential Backoff Configuration

Retry #Base DelayWith Jitter (75%-125%)
1500ms375ms - 625ms
21s750ms - 1.25s
32s1.5s - 2.5s
44s3s - 5s
58s6s - 10s
6+30s (cap)22.5s - 37.5s

Max Retry Configuration

Default: 5 retries Configurable per email via maxRetries parameter

Partial Batch Failures

When a batch partially fails:

  • Successful emails: Marked as SENT
  • Failed emails: Individually marked as RETRYING
  • No retry of entire batch (only failed items)

Manual Operations

Cancel a Pending Email

import { cancelEmail } from "@repo/email"; const result = await cancelEmail("queue-item-id"); // { success: true } or { success: false, error: "Cannot cancel..." }

Only PENDING emails can be cancelled. PROCESSING/RETRYING/SENT/FAILED cannot be cancelled.

Force Process Queue

# Trigger immediate queue processing curl -X POST \ -H "Authorization: Bearer $CRON_SECRET" \ "https://app.olllo.app/api/cron/process-email-queue"

Manual Cleanup

Cleanup runs automatically during daily ai-telemetry-cleanup cron.

To manually trigger:

curl -X POST \ -H "Authorization: Bearer $CRON_SECRET" \ "https://app.olllo.app/api/cron/ai-telemetry-cleanup"

Database Queries

Check Queue Health

-- Queue status distribution SELECT status, COUNT(*) as count FROM "EmailQueueItem" GROUP BY status; -- Recent failures with errors SELECT id, userId, errorMessage, retryCount, createdAt FROM "EmailQueueItem" WHERE status = 'FAILED' ORDER BY createdAt DESC LIMIT 20; -- Stuck processing items (older than 5 minutes) SELECT id, userId, batchId, updatedAt FROM "EmailQueueItem" WHERE status = 'PROCESSING' AND "updatedAt" < NOW() - INTERVAL '5 minutes';

Reset Stuck Items

If items are stuck in PROCESSING (cron crashed mid-batch):

-- Reset stuck processing items to PENDING UPDATE "EmailQueueItem" SET status = 'PENDING', "batchId" = NULL WHERE status = 'PROCESSING' AND "updatedAt" < NOW() - INTERVAL '10 minutes';

Incident Response

Severity 1: No Emails Being Delivered

  1. Verify: Check Resend status page
  2. Mitigate: Emails queue up, will deliver when resolved
  3. Monitor: Watch pending count growth rate
  4. Communicate: Notify users if extended outage
  5. Resolve: System auto-recovers when Resend is back

Severity 2: Elevated Failure Rate

  1. Identify: Check error messages in failed items
  2. Assess: Is it specific email types or all?
  3. Fix: Address root cause (invalid addresses, etc.)
  4. Recover: Failed items not auto-retried; may need manual re-trigger

Severity 3: Queue Backlog

  1. Measure: Check pending count growth rate
  2. Analyze: Is cron running? Is batch size optimal?
  3. Optimize: Consider temporary batch size increase
  4. Monitor: Verify backlog clearing

Configuration Reference

Environment Variables

VariableDescriptionDefault
RESEND_API_KEYResend API key (starts with re_)Required
RESEND_FROMDefault sender address (use friendly name format: olllo <hi@notifications.olllo.ai>)Required
CRON_SECRETSecret for cron authenticationRequired

Constants (in code)

ConstantValueLocation
MAX_BATCH_SIZE100email-queue.ts
DEFAULT_MAX_BATCHES5email-queue.ts
QUEUE_EXPIRATION_DAYS7email-queue.ts
DEFAULT_MAX_RETRIES5email-queue.ts
INTER_BATCH_DELAY_MS500email-queue.ts

Cron Schedule

JobSchedulePurpose
process-email-queueEvery 1 minuteProcess pending emails
ai-telemetry-cleanupDaily at 3 AMCleanup expired items

Contacts

  • On-Call: Check PagerDuty rotation
  • Email Infrastructure: #email-platform Slack channel
  • Escalation: engineering-leads@olllo.app
  • Resend Status: status.resend.com
Last updated on