--- title: "Manual Recovery" description: "Recover stuck observations after worker crashes or restarts" --- # Manual Recovery Guide ## Overview Claude-mem's manual recovery system helps you recover observations that get stuck in the processing queue after worker crashes, system restarts, or unexpected shutdowns. **Key Change in v5.x**: Automatic recovery on worker startup is now disabled. This gives you explicit control over when reprocessing happens, preventing unexpected duplicate observations. ## When Do You Need Manual Recovery? You should trigger manual recovery when: - **Worker crashed or restarted** - Observations were queued but worker stopped before processing - **No new summaries appearing** - Observations are being saved but not processed into summaries - **Stuck messages detected** - Messages showing as "processing" for >5 minutes - **System crashes** - Unexpected shutdowns left messages in incomplete states ## Quick Start ### Using the CLI Tool (Recommended) The interactive CLI tool is the safest and easiest way to recover stuck observations: ```bash # Check status and prompt for recovery bun scripts/check-pending-queue.ts ``` This will: 1. Check worker health 2. Show queue summary (pending, processing, failed, stuck counts) 3. Display sessions with pending work 4. Prompt you to confirm recovery 5. Show recently processed messages for feedback ### Auto-Process Without Prompts For scripting or when you're confident recovery is needed: ```bash # Auto-process without prompting bun scripts/check-pending-queue.ts --process # Limit to 5 sessions bun scripts/check-pending-queue.ts --process --limit 5 ``` ## Understanding Queue States Messages progress through these lifecycle states: 1. **pending** → Queued, waiting to process 2. **processing** → Currently being processed by SDK agent 3. **processed** → Completed successfully 4. **failed** → Failed after 3 retry attempts ### Stuck Detection Messages in `processing` state for **>5 minutes** are considered stuck: - They're automatically reset to `pending` on worker startup - They're NOT automatically reprocessed (requires manual trigger) - They appear in the `stuckCount` field when checking queue status ## Recovery Methods ### Method 1: Interactive CLI Tool **Best for**: Regular users, interactive sessions, when you want visibility into what's happening ```bash bun scripts/check-pending-queue.ts ``` **Example Output**: ``` Checking worker health... Worker is healthy ✓ Queue Summary: Pending: 12 messages Processing: 2 messages (1 stuck) Failed: 0 messages Recently Processed: 5 messages in last 30 minutes Sessions with pending work: 3 Session 44: 5 pending, 1 processing (age: 2m) Session 45: 4 pending, 1 processing (age: 7m - STUCK) Session 46: 2 pending Would you like to process these pending queues? (y/n) ``` **Features**: - ✅ Pre-flight health check (verifies worker is running) - ✅ Detailed queue breakdown by session - ✅ Age tracking for stuck detection - ✅ Confirmation prompt (prevents accidental reprocessing) - ✅ Non-interactive mode with `--process` flag - ✅ Session limit control with `--limit N` ### Method 2: HTTP API **Best for**: Automation, scripting, integration with monitoring systems #### Check Queue Status ```bash curl http://localhost:37777/api/pending-queue ``` **Response**: ```json { "queue": { "messages": [ { "id": 123, "session_db_id": 45, "claude_session_id": "abc123", "message_type": "observation", "status": "pending", "retry_count": 0, "created_at_epoch": 1730886600000 } ], "totalPending": 12, "totalProcessing": 2, "totalFailed": 0, "stuckCount": 1 }, "recentlyProcessed": [...], "sessionsWithPendingWork": [44, 45, 46] } ``` **Key Fields**: - `totalPending` - Messages waiting to process - `totalProcessing` - Messages currently processing - `stuckCount` - Processing messages >5 minutes old - `sessionsWithPendingWork` - Session IDs needing recovery #### Trigger Recovery ```bash curl -X POST http://localhost:37777/api/pending-queue/process \ -H "Content-Type: application/json" \ -d '{"sessionLimit": 10}' ``` **Response**: ```json { "success": true, "totalPendingSessions": 15, "sessionsStarted": 10, "sessionsSkipped": 2, "startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53] } ``` **Response Fields**: - `totalPendingSessions` - Total sessions with pending messages in database - `sessionsStarted` - Sessions we started processing this request - `sessionsSkipped` - Sessions already processing (prevents duplicate agents) - `startedSessionIds` - Database IDs of sessions we started ## Best Practices ### 1. Always Check Before Recovery ```bash # Check queue status first curl http://localhost:37777/api/pending-queue # Or use CLI tool which checks automatically bun scripts/check-pending-queue.ts ``` ### 2. Start with Low Session Limits ```bash # Process only 5 sessions at a time bun scripts/check-pending-queue.ts --process --limit 5 ``` This prevents overwhelming the worker with too many concurrent SDK agents. ### 3. Monitor During Recovery Watch worker logs while recovery runs: ```bash npm run worker:logs ``` Look for: - SDK agent starts: `Starting SDK agent for session...` - Processing completions: `Processed observation...` - Errors: `ERROR` or `Failed to process...` ### 4. Verify Recovery Success Check recently processed messages: ```bash curl http://localhost:37777/api/pending-queue | jq '.recentlyProcessed' ``` Or use the CLI tool which shows this automatically. ### 5. Handle Failed Messages Messages that fail 3 times are marked `failed` and won't auto-retry: ```bash # View failed messages sqlite3 ~/.claude-mem/claude-mem.db " SELECT id, session_db_id, message_type, retry_count FROM pending_messages WHERE status = 'failed' ORDER BY completed_at_epoch DESC; " ``` You can manually reset them if needed: ```bash sqlite3 ~/.claude-mem/claude-mem.db " UPDATE pending_messages SET status = 'pending', retry_count = 0 WHERE status = 'failed'; " ``` ## Troubleshooting ### Recovery Not Working **Symptom**: Triggered recovery but messages still pending **Solutions**: 1. **Verify worker health**: ```bash curl http://localhost:37777/health ``` 2. **Check worker logs for errors**: ```bash npm run worker:logs | grep -i error ``` 3. **Restart worker**: ```bash npm run worker:restart ``` 4. **Check database integrity**: ```bash sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;" ``` ### Messages Stuck Forever **Symptom**: Messages show as "processing" for hours **Solution**: Force reset stuck messages ```bash # Reset all stuck messages to pending sqlite3 ~/.claude-mem/claude-mem.db " UPDATE pending_messages SET status = 'pending', started_processing_at_epoch = NULL WHERE status = 'processing'; " # Then trigger recovery bun scripts/check-pending-queue.ts --process ``` ### Worker Crashes During Recovery **Symptom**: Worker stops while processing recovered messages **Solutions**: 1. **Check available memory**: ```bash npm run worker:status ``` 2. **Reduce session limit**: ```bash bun scripts/check-pending-queue.ts --process --limit 3 ``` 3. **Check for SDK errors in logs**: ```bash npm run worker:logs | grep -i "sdk" ``` 4. **Increase worker memory** (if using custom runner): ```bash export NODE_OPTIONS="--max-old-space-size=4096" npm run worker:restart ``` ## Advanced Usage ### Direct Database Inspection View all pending messages: ```bash sqlite3 ~/.claude-mem/claude-mem.db " SELECT id, session_db_id, message_type, status, retry_count, datetime(created_at_epoch/1000, 'unixepoch') as created_at, datetime(started_processing_at_epoch/1000, 'unixepoch') as started_at, CAST((strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 AS INTEGER) as age_minutes FROM pending_messages WHERE status IN ('pending', 'processing') ORDER BY created_at_epoch; " ``` ### Count Messages by Status ```bash sqlite3 ~/.claude-mem/claude-mem.db " SELECT status, COUNT(*) as count FROM pending_messages GROUP BY status; " ``` ### Find Sessions with Pending Work ```bash sqlite3 ~/.claude-mem/claude-mem.db " SELECT session_db_id, COUNT(*) as pending_count, GROUP_CONCAT(message_type) as message_types FROM pending_messages WHERE status IN ('pending', 'processing') GROUP BY session_db_id; " ``` ### View Recent Failures ```bash sqlite3 ~/.claude-mem/claude-mem.db " SELECT id, session_db_id, message_type, retry_count, datetime(completed_at_epoch/1000, 'unixepoch') as failed_at FROM pending_messages WHERE status = 'failed' ORDER BY completed_at_epoch DESC LIMIT 10; " ``` ## Integration Examples ### Cron Job for Automatic Recovery ```bash #!/bin/bash # Run every hour to process stuck queues # Check if worker is healthy if curl -f http://localhost:37777/health > /dev/null 2>&1; then # Auto-process up to 5 sessions bun scripts/check-pending-queue.ts --process --limit 5 else echo "Worker not healthy, skipping recovery" exit 1 fi ``` ### Monitoring Script ```bash #!/bin/bash # Alert if stuck count exceeds threshold STUCK_COUNT=$(curl -s http://localhost:37777/api/pending-queue | jq '.queue.stuckCount') if [ "$STUCK_COUNT" -gt 5 ]; then echo "WARNING: $STUCK_COUNT stuck messages detected" # Send alert (email, Slack, etc.) fi ``` ### Pre-Shutdown Recovery ```bash #!/bin/bash # Process pending queues before system shutdown echo "Processing pending queues before shutdown..." bun scripts/check-pending-queue.ts --process --limit 20 echo "Waiting for processing to complete..." sleep 10 echo "Stopping worker..." claude-mem stop ``` ## Migration Note If you're upgrading from v4.x to v5.x: **v4.x Behavior** (Automatic Recovery): - Worker automatically recovered stuck messages on startup - No user control over reprocessing timing **v5.x Behavior** (Manual Recovery): - Stuck messages detected but NOT automatically reprocessed - User must explicitly trigger recovery via CLI or API - Prevents unexpected duplicate observations - Provides explicit control over when processing happens **Migration Steps**: 1. Upgrade to v5.x 2. Check for stuck messages: `bun scripts/check-pending-queue.ts` 3. Process if needed: `bun scripts/check-pending-queue.ts --process` 4. Add recovery to your workflow (cron job, pre-shutdown script, etc.) ## See Also - [Worker Service Architecture](../architecture/worker-service) - Technical details on queue processing - [Troubleshooting - Manual Recovery](../troubleshooting#manual-recovery-for-stuck-observations) - Common issues and solutions - [Database Schema](../architecture/database) - Pending messages table structure