feat: Fix observation timestamps, refactor session management, and enhance worker reliability (#437)

* Refactor worker version checks and increase timeout settings - Updated the default hook timeout from 5000ms to 120000ms for improved stability. - Modified the worker version check to log a warning instead of restarting the worker on version mismatch. - Removed legacy PM2 cleanup and worker start logic, simplifying the ensureWorkerRunning function. - Enhanced polling mechanism for worker readiness with increased retries and reduced interval. * feat: implement worker queue polling to ensure processing completion before proceeding * refactor: change worker command from start to restart in hooks configuration * refactor: remove session management complexity - Simplify createSDKSession to pure INSERT OR IGNORE - Remove auto-create logic from storeObservation/storeSummary - Delete 11 unused session management methods - Derive prompt_number from user_prompts count - Keep sdk_sessions table schema unchanged for compatibility * refactor: simplify session management by removing unused methods and auto-creation logic * Refactor session prompt number retrieval in SessionRoutes - Updated the method of obtaining the prompt number from the session. - Replaced `store.getPromptCounter(sessionDbId)` with `store.getPromptNumberFromUserPrompts(claudeSessionId)` for better clarity and accuracy. - Adjusted the logic for incrementing the prompt number to derive it from the user prompts count instead of directly incrementing a counter. * refactor: replace getPromptCounter with getPromptNumberFromUserPrompts in SessionManager Phase 7 of session management simplification. Updates SessionManager to derive prompt numbers from user_prompts table count instead of using the deprecated prompt_counter column. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: simplify SessionCompletionHandler to use direct SQL query Phase 8: Remove call to findActiveSDKSession() and replace with direct database query in SessionCompletionHandler.completeByClaudeId(). This removes dependency on the deleted findActiveSDKSession() method and simplifies the code by using a straightforward SELECT query. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: remove markSessionCompleted call from SDKAgent - Delete call to markSessionCompleted() in SDKAgent.ts - Session status is no longer tracked or updated - Part of phase 9: simplifying session management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: remove markSessionComplete method (Phase 10) - Deleted markSessionComplete() method from DatabaseManager - Removed markSessionComplete call from SessionCompletionHandler - Session completion status no longer tracked in database - Part of session management simplification effort 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: replace deleted updateSDKSessionId calls in import script (Phase 11) - Replace updateSDKSessionId() calls with direct SQL UPDATE statements - Method was deleted in Phase 3 as part of session management simplification - Import script now uses direct database access consistently 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * test: add validation for SQL updates in sdk_sessions table * refactor: enhance worker-cli to support manual and automated runs * Remove cleanup hook and associated session completion logic - Deleted the cleanup-hook implementation from the hooks directory. - Removed the session completion endpoint that was used by the cleanup hook. - Updated the SessionCompletionHandler to eliminate the completeByClaudeId method and its dependencies. - Adjusted the SessionRoutes to reflect the removal of the session completion route. * fix: update worker-cli command to use bun for consistency * feat: Implement timestamp fix for observations and enhance processing logic - Added `earliestPendingTimestamp` to `ActiveSession` to track the original timestamp of the earliest pending message. - Updated `SDKAgent` to capture and utilize the earliest pending timestamp during response processing. - Modified `SessionManager` to track the earliest timestamp when yielding messages. - Created scripts for fixing corrupted timestamps, validating fixes, and investigating timestamp issues. - Verified that all corrupted observations have been repaired and logic for future processing is sound. - Ensured orphan processing can be safely re-enabled after validation. * feat: Enhance SessionStore to support custom database paths and add timestamp fields for observations and summaries * Refactor pending queue processing and add management endpoints - Disabled automatic recovery of orphaned queues on startup; users must now use the new /api/pending-queue/process endpoint. - Updated processOrphanedQueues method to processPendingQueues with improved session handling and return detailed results. - Added new API endpoints for managing pending queues: GET /api/pending-queue and POST /api/pending-queue/process. - Introduced a new script (check-pending-queue.ts) for checking and processing pending observation queues interactively or automatically. - Enhanced logging and error handling for better monitoring of session processing. * updated agent sdk * feat: Add manual recovery guide and queue management endpoints to documentation --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-25 15:36:46 -05:00
parent 4cf2c1bdb1
commit 266c746d50
47 changed files with 3417 additions and 1026 deletions
@@ -19,7 +19,7 @@ The worker service is a long-running HTTP API built with Express.js and managed

 ## REST API Endpoints

-The worker service exposes 20 HTTP endpoints organized into five categories:
+The worker service exposes 22 HTTP endpoints organized into six categories:

 ### Viewer & Health Endpoints

@@ -385,9 +385,106 @@ POST /api/settings
 }
 ```

+### Queue Management Endpoints
+
+#### 16. Get Pending Queue Status
+```
+GET /api/pending-queue
+```
+
+**Purpose**: View current processing queue status and identify stuck messages
+
+**Response**:
+```json
+{
+  "queue": {
+    "messages": [
+      {
+        "id": 123,
+        "session_db_id": 45,
+        "claude_session_id": "abc123",
+        "message_type": "observation",
+        "status": "pending",
+        "retry_count": 0,
+        "created_at_epoch": 1730886600000,
+        "started_processing_at_epoch": null,
+        "completed_at_epoch": null
+      }
+    ],
+    "totalPending": 5,
+    "totalProcessing": 2,
+    "totalFailed": 0,
+    "stuckCount": 1
+  },
+  "recentlyProcessed": [
+    {
+      "id": 122,
+      "session_db_id": 44,
+      "status": "processed",
+      "completed_at_epoch": 1730886500000
+    }
+  ],
+  "sessionsWithPendingWork": [44, 45, 46]
+}
+```
+
+**Status Definitions**:
+- `pending`: Message queued, not yet processed
+- `processing`: Message currently being processed by SDK agent
+- `processed`: Message completed successfully
+- `failed`: Message failed after max retry attempts (3 by default)
+
+**Stuck Detection**: Messages in `processing` status for >5 minutes are considered stuck and included in `stuckCount`
+
+**Use Case**: Check queue health after worker crashes or restarts to identify unprocessed observations
+
+#### 17. Trigger Manual Recovery
+```
+POST /api/pending-queue/process
+```
+
+**Purpose**: Manually trigger processing of pending queues (replaces automatic recovery in v5.x+)
+
+**Request Body**:
+```json
+{
+  "sessionLimit": 10
+}
+```
+
+**Body Parameters**:
+- `sessionLimit` (optional): Maximum number of sessions to process (default: 10, max: 100)
+
+**Response**:
+```json
+{
+  "success": true,
+  "totalPendingSessions": 15,
+  "sessionsStarted": 10,
+  "sessionsSkipped": 2,
+  "startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
+}
+```
+
+**Response Fields**:
+- `totalPendingSessions`: Total sessions with pending messages in database
+- `sessionsStarted`: Number of sessions we started processing this request
+- `sessionsSkipped`: Sessions already actively processing (not restarted)
+- `startedSessionIds`: Database IDs of sessions started
+
+**Behavior**:
+- Processes up to `sessionLimit` sessions with pending work
+- Skips sessions already actively processing (prevents duplicate agents)
+- Starts non-blocking SDK agents for each session
+- Returns immediately with status (processing continues in background)
+
+**Use Case**: Manually recover stuck observations after worker crashes, or when automatic recovery was disabled
+
+**Recovery Strategy Note**: As of v5.x, automatic recovery on worker startup is disabled by default. Users must manually trigger recovery using this endpoint or the CLI tool (`bun scripts/check-pending-queue.ts`) to maintain explicit control over reprocessing.
+
 ### Session Management Endpoints

-#### 16. Initialize Session
+#### 19. Initialize Session
 ```
 POST /sessions/:sessionDbId/init
 ```
@@ -408,7 +505,7 @@ POST /sessions/:sessionDbId/init
 }
 ```

-#### 17. Add Observation
+#### 20. Add Observation
 ```
 POST /sessions/:sessionDbId/observations
 ```
@@ -431,7 +528,7 @@ POST /sessions/:sessionDbId/observations
 }
 ```

-#### 18. Generate Summary
+#### 21. Generate Summary
 ```
 POST /sessions/:sessionDbId/summarize
 ```
@@ -451,7 +548,7 @@ POST /sessions/:sessionDbId/summarize
 }
 ```

-#### 19. Session Status
+#### 22. Session Status
 ```
 GET /sessions/:sessionDbId/status
 ```
@@ -466,7 +563,7 @@ GET /sessions/:sessionDbId/status
 }
 ```

-#### 20. Delete Session
+#### 23. Delete Session
 ```
 DELETE /sessions/:sessionDbId
 ```
@@ -371,45 +371,149 @@ npm test

 ## Testing

-### Running Tests
+### Testing Philosophy

+Claude-mem relies on **real-world usage and manual testing** rather than traditional unit tests. The project philosophy prioritizes:
+
+1. **Manual verification** - Testing features in actual Claude Code sessions
+2. **Integration testing** - Running the full system end-to-end
+3. **Database inspection** - Verifying data correctness via SQLite queries
+4. **CLI tools** - Interactive tools for checking system state
+5. **Observability** - Comprehensive logging and worker health checks
+
+This approach was chosen because:
+- Hook behavior depends heavily on Claude Code's runtime environment
+- SDK interactions require real API calls and responses
+- SQLite and Bun runtime provide stability guarantees
+- Manual testing catches integration issues that unit tests miss
+
+### Manual Testing Workflow
+
+When developing new features:
+
+1. **Build and sync**:
+   ```bash
+   npm run build
+   npm run sync-marketplace
+   claude-mem restart
+   ```
+
+2. **Test in real session**:
+   - Start Claude Code
+   - Trigger the feature you're testing
+   - Verify expected behavior
+
+3. **Check database state**:
+   ```bash
+   sqlite3 ~/.claude-mem/claude-mem.db "SELECT * FROM your_table;"
+   ```
+
+4. **Monitor worker logs**:
+   ```bash
+   npm run worker:logs
+   ```
+
+5. **Verify queue health** (for recovery features):
+   ```bash
+   bun scripts/check-pending-queue.ts
+   ```
+
+### Testing Tools
+
+**Health Checks**:
 ```bash
-# All tests
-npm test
+# Worker status
+npm run worker:status

-# Specific test file
-node --test tests/your-test.test.ts
+# Queue inspection
+curl http://localhost:37777/api/pending-queue

-# With coverage (if configured)
-npm test -- --coverage
+# Database integrity
+sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
 ```

-### Writing Tests
+**Hook Testing**:
+```bash
+# Test context hook manually
+echo '{"session_id":"test-123","cwd":"'$(pwd)'","source":"startup"}' | node plugin/scripts/context-hook.js

-Create test files in `tests/`:
-
-```typescript
-import { describe, it } from 'node:test';
-import assert from 'node:assert';
-
-describe('YourFeature', () => {
-  it('should do something', () => {
-    // Test implementation
-    assert.strictEqual(result, expected);
-  });
-});
+# Test new hook
+echo '{"session_id":"test-123","cwd":"'$(pwd)'","prompt":"test"}' | node plugin/scripts/new-hook.js
 ```

-### Test Database
+**Data Verification**:
+```bash
+# Check recent observations
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT id, tool_name, created_at
+  FROM observations
+  ORDER BY created_at_epoch DESC
+  LIMIT 10;
+"

-Use a separate test database:
-
-```typescript
-import { SessionStore } from '../src/services/sqlite/SessionStore';
-
-const store = new SessionStore(':memory:'); // In-memory database
+# Check summaries
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT id, request, completed
+  FROM session_summaries
+  ORDER BY created_at_epoch DESC
+  LIMIT 5;
+"
 ```

+### Recovery Feature Testing
+
+For manual recovery features specifically:
+
+1. **Simulate stuck messages**:
+   ```bash
+   # Manually create stuck message (for testing only)
+   sqlite3 ~/.claude-mem/claude-mem.db "
+     UPDATE pending_messages
+     SET status = 'processing',
+         started_processing_at_epoch = strftime('%s', 'now', '-10 minutes') * 1000
+     WHERE id = 123;
+   "
+   ```
+
+2. **Test recovery**:
+   ```bash
+   bun scripts/check-pending-queue.ts
+   ```
+
+3. **Verify results**:
+   ```bash
+   curl http://localhost:37777/api/pending-queue | jq '.queue'
+   ```
+
+### Regression Testing
+
+Before releasing:
+
+1. **Test all hook triggers**:
+   - SessionStart: Start new Claude Code session
+   - UserPromptSubmit: Submit a prompt
+   - PostToolUse: Use a tool like Read
+   - Summary: Let session complete
+   - SessionEnd: Close Claude Code
+
+2. **Test core features**:
+   - Context injection (recent sessions appear)
+   - Observation processing (summaries generated)
+   - MCP search tools (search returns results)
+   - Viewer UI (loads at http://localhost:37777)
+   - Manual recovery (stuck messages recovered)
+
+3. **Test edge cases**:
+   - Worker crash recovery
+   - Database locks
+   - Port conflicts
+   - Large databases
+
+4. **Cross-platform** (if applicable):
+   - macOS
+   - Linux
+   - Windows
+
 ## Code Style

 ### TypeScript Guidelines
@@ -40,6 +40,7 @@
          "usage/claude-desktop",
          "usage/private-tags",
          "usage/export-import",
+          "usage/manual-recovery",
          "beta-features",
          "endless-mode"
        ]
@@ -285,6 +285,201 @@ The skill includes comprehensive diagnostics, automated repair sequences, and de
   claude-mem restart
   ```

+### Manual Recovery for Stuck Observations
+
+**Symptoms**: Observations stuck in processing queue after worker crash or restart, no new summaries appearing despite worker running.
+
+**Background**: As of v5.x, automatic queue recovery on worker startup is disabled. Users must manually trigger recovery to maintain explicit control over reprocessing and prevent unexpected duplicate observations.
+
+**Solutions**:
+
+#### Option 1: Use CLI Recovery Tool (Recommended)
+
+The interactive CLI tool provides the safest and most user-friendly recovery experience:
+
+```bash
+# Check queue status and prompt for recovery
+bun scripts/check-pending-queue.ts
+
+# Auto-process without prompting
+bun scripts/check-pending-queue.ts --process
+
+# Process up to 5 sessions
+bun scripts/check-pending-queue.ts --process --limit 5
+```
+
+**What it does**:
+- ✅ Checks worker health before proceeding
+- ✅ Shows detailed queue summary (pending, processing, failed, stuck)
+- ✅ Groups messages by session with age and status breakdown
+- ✅ Prompts user to confirm processing (unless `--process` flag used)
+- ✅ Shows recently processed messages for feedback
+
+**Interactive Example**:
+```
+Worker is healthy ✓
+
+Queue Summary:
+  Pending: 12 messages
+  Processing: 2 messages (1 stuck)
+  Failed: 0 messages
+  Recently Processed: 5 messages in last 30 minutes
+
+Sessions with pending work: 3
+  Session 44: 5 pending, 1 processing (age: 2m)
+  Session 45: 4 pending, 1 processing (age: 7m - STUCK)
+  Session 46: 2 pending
+
+Would you like to process these pending queues? (y/n)
+```
+
+#### Option 2: Use HTTP API Directly
+
+For automation or scripting scenarios:
+
+1. **Check queue status**:
+   ```bash
+   curl http://localhost:37777/api/pending-queue
+   ```
+
+   Response shows:
+   - `queue.totalPending`: Messages waiting to process
+   - `queue.totalProcessing`: Messages currently processing
+   - `queue.stuckCount`: Processing messages >5 minutes old
+   - `sessionsWithPendingWork`: Session IDs needing recovery
+
+2. **Trigger manual recovery**:
+   ```bash
+   curl -X POST http://localhost:37777/api/pending-queue/process \
+     -H "Content-Type: application/json" \
+     -d '{"sessionLimit": 10}'
+   ```
+
+   Response includes:
+   - `totalPendingSessions`: Total sessions with pending messages
+   - `sessionsStarted`: Number of sessions we started processing
+   - `sessionsSkipped`: Sessions already processing (not restarted)
+   - `startedSessionIds`: Database IDs of sessions started
+
+#### Understanding Queue States
+
+Messages progress through these states:
+
+1. **pending** - Queued, waiting to process
+2. **processing** - Currently being processed by SDK agent
+3. **processed** - Completed successfully
+4. **failed** - Failed after 3 retry attempts
+
+**Stuck Detection**: Messages in `processing` state for >5 minutes are considered stuck and automatically reset to `pending` on worker startup (but not automatically reprocessed).
+
+#### Recovery Strategy
+
+**When to use manual recovery**:
+- After worker crashes or unexpected restarts
+- When observations appear saved but no summaries generated
+- When queue status shows stuck messages (processing >5 minutes)
+- After system crashes or forced shutdowns
+
+**Best practices**:
+1. Always check queue status before triggering recovery
+2. Use the CLI tool for interactive sessions (provides feedback)
+3. Use the HTTP API for automation/scripting
+4. Start with a low session limit (5-10) to avoid overwhelming the worker
+5. Monitor worker logs during recovery: `npm run worker:logs`
+6. Check recently processed messages to confirm recovery worked
+
+#### Troubleshooting Recovery Issues
+
+If recovery fails or messages remain stuck:
+
+1. **Verify worker is healthy**:
+   ```bash
+   curl http://localhost:37777/health
+   # Should return: {"status":"ok","uptime":12345,"port":37777}
+   ```
+
+2. **Check database for corruption**:
+   ```bash
+   sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
+   ```
+
+3. **View stuck messages directly**:
+   ```bash
+   sqlite3 ~/.claude-mem/claude-mem.db "
+     SELECT id, session_db_id, status, retry_count,
+            (strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 as age_minutes
+     FROM pending_messages
+     WHERE status = 'processing'
+     ORDER BY started_processing_at_epoch;
+   "
+   ```
+
+4. **Force reset stuck messages** (nuclear option):
+   ```bash
+   sqlite3 ~/.claude-mem/claude-mem.db "
+     UPDATE pending_messages
+     SET status = 'pending', started_processing_at_epoch = NULL
+     WHERE status = 'processing';
+   "
+   ```
+
+   Then trigger recovery:
+   ```bash
+   bun scripts/check-pending-queue.ts --process
+   ```
+
+5. **Check worker logs for SDK errors**:
+   ```bash
+   npm run worker:logs | grep -i error
+   ```
+
+#### Understanding the Queue Table
+
+The `pending_messages` table tracks all messages with these key fields:
+
+```sql
+CREATE TABLE pending_messages (
+  id INTEGER PRIMARY KEY,
+  session_db_id INTEGER,          -- Foreign key to sdk_sessions
+  claude_session_id TEXT,          -- Claude session ID
+  message_type TEXT,               -- 'observation' | 'summarize'
+  status TEXT,                     -- 'pending' | 'processing' | 'processed' | 'failed'
+  retry_count INTEGER,             -- Current retry attempt (max: 3)
+  created_at_epoch INTEGER,        -- When message was queued
+  started_processing_at_epoch INTEGER,  -- When marked 'processing'
+  completed_at_epoch INTEGER       -- When completed/failed
+)
+```
+
+**Query examples**:
+
+```bash
+# Count messages by status
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT status, COUNT(*)
+  FROM pending_messages
+  GROUP BY status;
+"
+
+# Find sessions with pending work
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT session_db_id, COUNT(*) as pending_count
+  FROM pending_messages
+  WHERE status IN ('pending', 'processing')
+  GROUP BY session_db_id;
+"
+
+# View recent failures
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT id, session_db_id, message_type, retry_count,
+         datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
+  FROM pending_messages
+  WHERE status = 'failed'
+  ORDER BY completed_at_epoch DESC
+  LIMIT 10;
+"
+```
+
 ## Hook Issues

 ### Hooks Not Firing
@@ -0,0 +1,450 @@
+---
+title: "Manual Recovery"
+description: "Recover stuck observations after worker crashes or restarts"
+---
+
+# Manual Recovery Guide
+
+## Overview
+
+Claude-mem's manual recovery system helps you recover observations that get stuck in the processing queue after worker crashes, system restarts, or unexpected shutdowns.
+
+**Key Change in v5.x**: Automatic recovery on worker startup is now disabled. This gives you explicit control over when reprocessing happens, preventing unexpected duplicate observations.
+
+## When Do You Need Manual Recovery?
+
+You should trigger manual recovery when:
+
+- **Worker crashed or restarted** - Observations were queued but worker stopped before processing
+- **No new summaries appearing** - Observations are being saved but not processed into summaries
+- **Stuck messages detected** - Messages showing as "processing" for >5 minutes
+- **System crashes** - Unexpected shutdowns left messages in incomplete states
+
+## Quick Start
+
+### Using the CLI Tool (Recommended)
+
+The interactive CLI tool is the safest and easiest way to recover stuck observations:
+
+```bash
+# Check status and prompt for recovery
+bun scripts/check-pending-queue.ts
+```
+
+This will:
+1. Check worker health
+2. Show queue summary (pending, processing, failed, stuck counts)
+3. Display sessions with pending work
+4. Prompt you to confirm recovery
+5. Show recently processed messages for feedback
+
+### Auto-Process Without Prompts
+
+For scripting or when you're confident recovery is needed:
+
+```bash
+# Auto-process without prompting
+bun scripts/check-pending-queue.ts --process
+
+# Limit to 5 sessions
+bun scripts/check-pending-queue.ts --process --limit 5
+```
+
+## Understanding Queue States
+
+Messages progress through these lifecycle states:
+
+1. **pending** → Queued, waiting to process
+2. **processing** → Currently being processed by SDK agent
+3. **processed** → Completed successfully
+4. **failed** → Failed after 3 retry attempts
+
+### Stuck Detection
+
+Messages in `processing` state for **>5 minutes** are considered stuck:
+
+- They're automatically reset to `pending` on worker startup
+- They're NOT automatically reprocessed (requires manual trigger)
+- They appear in the `stuckCount` field when checking queue status
+
+## Recovery Methods
+
+### Method 1: Interactive CLI Tool
+
+**Best for**: Regular users, interactive sessions, when you want visibility into what's happening
+
+```bash
+bun scripts/check-pending-queue.ts
+```
+
+**Example Output**:
+```
+Checking worker health...
+Worker is healthy ✓
+
+Queue Summary:
+  Pending: 12 messages
+  Processing: 2 messages (1 stuck)
+  Failed: 0 messages
+  Recently Processed: 5 messages in last 30 minutes
+
+Sessions with pending work: 3
+  Session 44: 5 pending, 1 processing (age: 2m)
+  Session 45: 4 pending, 1 processing (age: 7m - STUCK)
+  Session 46: 2 pending
+
+Would you like to process these pending queues? (y/n)
+```
+
+**Features**:
+- ✅ Pre-flight health check (verifies worker is running)
+- ✅ Detailed queue breakdown by session
+- ✅ Age tracking for stuck detection
+- ✅ Confirmation prompt (prevents accidental reprocessing)
+- ✅ Non-interactive mode with `--process` flag
+- ✅ Session limit control with `--limit N`
+
+### Method 2: HTTP API
+
+**Best for**: Automation, scripting, integration with monitoring systems
+
+#### Check Queue Status
+
+```bash
+curl http://localhost:37777/api/pending-queue
+```
+
+**Response**:
+```json
+{
+  "queue": {
+    "messages": [
+      {
+        "id": 123,
+        "session_db_id": 45,
+        "claude_session_id": "abc123",
+        "message_type": "observation",
+        "status": "pending",
+        "retry_count": 0,
+        "created_at_epoch": 1730886600000
+      }
+    ],
+    "totalPending": 12,
+    "totalProcessing": 2,
+    "totalFailed": 0,
+    "stuckCount": 1
+  },
+  "recentlyProcessed": [...],
+  "sessionsWithPendingWork": [44, 45, 46]
+}
+```
+
+**Key Fields**:
+- `totalPending` - Messages waiting to process
+- `totalProcessing` - Messages currently processing
+- `stuckCount` - Processing messages >5 minutes old
+- `sessionsWithPendingWork` - Session IDs needing recovery
+
+#### Trigger Recovery
+
+```bash
+curl -X POST http://localhost:37777/api/pending-queue/process \
+  -H "Content-Type: application/json" \
+  -d '{"sessionLimit": 10}'
+```
+
+**Response**:
+```json
+{
+  "success": true,
+  "totalPendingSessions": 15,
+  "sessionsStarted": 10,
+  "sessionsSkipped": 2,
+  "startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
+}
+```
+
+**Response Fields**:
+- `totalPendingSessions` - Total sessions with pending messages in database
+- `sessionsStarted` - Sessions we started processing this request
+- `sessionsSkipped` - Sessions already processing (prevents duplicate agents)
+- `startedSessionIds` - Database IDs of sessions we started
+
+## Best Practices
+
+### 1. Always Check Before Recovery
+
+```bash
+# Check queue status first
+curl http://localhost:37777/api/pending-queue
+
+# Or use CLI tool which checks automatically
+bun scripts/check-pending-queue.ts
+```
+
+### 2. Start with Low Session Limits
+
+```bash
+# Process only 5 sessions at a time
+bun scripts/check-pending-queue.ts --process --limit 5
+```
+
+This prevents overwhelming the worker with too many concurrent SDK agents.
+
+### 3. Monitor During Recovery
+
+Watch worker logs while recovery runs:
+
+```bash
+npm run worker:logs
+```
+
+Look for:
+- SDK agent starts: `Starting SDK agent for session...`
+- Processing completions: `Processed observation...`
+- Errors: `ERROR` or `Failed to process...`
+
+### 4. Verify Recovery Success
+
+Check recently processed messages:
+
+```bash
+curl http://localhost:37777/api/pending-queue | jq '.recentlyProcessed'
+```
+
+Or use the CLI tool which shows this automatically.
+
+### 5. Handle Failed Messages
+
+Messages that fail 3 times are marked `failed` and won't auto-retry:
+
+```bash
+# View failed messages
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT id, session_db_id, message_type, retry_count
+  FROM pending_messages
+  WHERE status = 'failed'
+  ORDER BY completed_at_epoch DESC;
+"
+```
+
+You can manually reset them if needed:
+
+```bash
+sqlite3 ~/.claude-mem/claude-mem.db "
+  UPDATE pending_messages
+  SET status = 'pending', retry_count = 0
+  WHERE status = 'failed';
+"
+```
+
+## Troubleshooting
+
+### Recovery Not Working
+
+**Symptom**: Triggered recovery but messages still pending
+
+**Solutions**:
+
+1. **Verify worker health**:
+   ```bash
+   curl http://localhost:37777/health
+   ```
+
+2. **Check worker logs for errors**:
+   ```bash
+   npm run worker:logs | grep -i error
+   ```
+
+3. **Restart worker**:
+   ```bash
+   claude-mem restart
+   ```
+
+4. **Check database integrity**:
+   ```bash
+   sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
+   ```
+
+### Messages Stuck Forever
+
+**Symptom**: Messages show as "processing" for hours
+
+**Solution**: Force reset stuck messages
+
+```bash
+# Reset all stuck messages to pending
+sqlite3 ~/.claude-mem/claude-mem.db "
+  UPDATE pending_messages
+  SET status = 'pending', started_processing_at_epoch = NULL
+  WHERE status = 'processing';
+"
+
+# Then trigger recovery
+bun scripts/check-pending-queue.ts --process
+```
+
+### Worker Crashes During Recovery
+
+**Symptom**: Worker stops while processing recovered messages
+
+**Solutions**:
+
+1. **Check available memory**:
+   ```bash
+   npm run worker:status
+   ```
+
+2. **Reduce session limit**:
+   ```bash
+   bun scripts/check-pending-queue.ts --process --limit 3
+   ```
+
+3. **Check for SDK errors in logs**:
+   ```bash
+   npm run worker:logs | grep -i "sdk"
+   ```
+
+4. **Increase worker memory** (if using custom runner):
+   ```bash
+   export NODE_OPTIONS="--max-old-space-size=4096"
+   claude-mem restart
+   ```
+
+## Advanced Usage
+
+### Direct Database Inspection
+
+View all pending messages:
+
+```bash
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT
+    id,
+    session_db_id,
+    message_type,
+    status,
+    retry_count,
+    datetime(created_at_epoch/1000, 'unixepoch') as created_at,
+    datetime(started_processing_at_epoch/1000, 'unixepoch') as started_at,
+    CAST((strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 AS INTEGER) as age_minutes
+  FROM pending_messages
+  WHERE status IN ('pending', 'processing')
+  ORDER BY created_at_epoch;
+"
+```
+
+### Count Messages by Status
+
+```bash
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT status, COUNT(*) as count
+  FROM pending_messages
+  GROUP BY status;
+"
+```
+
+### Find Sessions with Pending Work
+
+```bash
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT
+    session_db_id,
+    COUNT(*) as pending_count,
+    GROUP_CONCAT(message_type) as message_types
+  FROM pending_messages
+  WHERE status IN ('pending', 'processing')
+  GROUP BY session_db_id;
+"
+```
+
+### View Recent Failures
+
+```bash
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT
+    id,
+    session_db_id,
+    message_type,
+    retry_count,
+    datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
+  FROM pending_messages
+  WHERE status = 'failed'
+  ORDER BY completed_at_epoch DESC
+  LIMIT 10;
+"
+```
+
+## Integration Examples
+
+### Cron Job for Automatic Recovery
+
+```bash
+#!/bin/bash
+# Run every hour to process stuck queues
+
+# Check if worker is healthy
+if curl -f http://localhost:37777/health > /dev/null 2>&1; then
+  # Auto-process up to 5 sessions
+  bun scripts/check-pending-queue.ts --process --limit 5
+else
+  echo "Worker not healthy, skipping recovery"
+  exit 1
+fi
+```
+
+### Monitoring Script
+
+```bash
+#!/bin/bash
+# Alert if stuck count exceeds threshold
+
+STUCK_COUNT=$(curl -s http://localhost:37777/api/pending-queue | jq '.queue.stuckCount')
+
+if [ "$STUCK_COUNT" -gt 5 ]; then
+  echo "WARNING: $STUCK_COUNT stuck messages detected"
+  # Send alert (email, Slack, etc.)
+fi
+```
+
+### Pre-Shutdown Recovery
+
+```bash
+#!/bin/bash
+# Process pending queues before system shutdown
+
+echo "Processing pending queues before shutdown..."
+bun scripts/check-pending-queue.ts --process --limit 20
+
+echo "Waiting for processing to complete..."
+sleep 10
+
+echo "Stopping worker..."
+claude-mem stop
+```
+
+## Migration Note
+
+If you're upgrading from v4.x to v5.x:
+
+**v4.x Behavior** (Automatic Recovery):
+- Worker automatically recovered stuck messages on startup
+- No user control over reprocessing timing
+
+**v5.x Behavior** (Manual Recovery):
+- Stuck messages detected but NOT automatically reprocessed
+- User must explicitly trigger recovery via CLI or API
+- Prevents unexpected duplicate observations
+- Provides explicit control over when processing happens
+
+**Migration Steps**:
+1. Upgrade to v5.x
+2. Check for stuck messages: `bun scripts/check-pending-queue.ts`
+3. Process if needed: `bun scripts/check-pending-queue.ts --process`
+4. Add recovery to your workflow (cron job, pre-shutdown script, etc.)
+
+## See Also
+
+- [Worker Service Architecture](../architecture/worker-service) - Technical details on queue processing
+- [Troubleshooting - Manual Recovery](../troubleshooting#manual-recovery-for-stuck-observations) - Common issues and solutions
+- [Database Schema](../architecture/database) - Pending messages table structure