feat: Fix observation timestamps, refactor session management, and enhance worker reliability (#437)

* Refactor worker version checks and increase timeout settings

- Updated the default hook timeout from 5000ms to 120000ms for improved stability.
- Modified the worker version check to log a warning instead of restarting the worker on version mismatch.
- Removed legacy PM2 cleanup and worker start logic, simplifying the ensureWorkerRunning function.
- Enhanced polling mechanism for worker readiness with increased retries and reduced interval.

* feat: implement worker queue polling to ensure processing completion before proceeding

* refactor: change worker command from start to restart in hooks configuration

* refactor: remove session management complexity

- Simplify createSDKSession to pure INSERT OR IGNORE
- Remove auto-create logic from storeObservation/storeSummary
- Delete 11 unused session management methods
- Derive prompt_number from user_prompts count
- Keep sdk_sessions table schema unchanged for compatibility

* refactor: simplify session management by removing unused methods and auto-creation logic

* Refactor session prompt number retrieval in SessionRoutes

- Updated the method of obtaining the prompt number from the session.
- Replaced `store.getPromptCounter(sessionDbId)` with `store.getPromptNumberFromUserPrompts(claudeSessionId)` for better clarity and accuracy.
- Adjusted the logic for incrementing the prompt number to derive it from the user prompts count instead of directly incrementing a counter.

* refactor: replace getPromptCounter with getPromptNumberFromUserPrompts in SessionManager

Phase 7 of session management simplification. Updates SessionManager to derive
prompt numbers from user_prompts table count instead of using the deprecated
prompt_counter column.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: simplify SessionCompletionHandler to use direct SQL query

Phase 8: Remove call to findActiveSDKSession() and replace with direct
database query in SessionCompletionHandler.completeByClaudeId().

This removes dependency on the deleted findActiveSDKSession() method
and simplifies the code by using a straightforward SELECT query.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: remove markSessionCompleted call from SDKAgent

- Delete call to markSessionCompleted() in SDKAgent.ts
- Session status is no longer tracked or updated
- Part of phase 9: simplifying session management

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: remove markSessionComplete method (Phase 10)

- Deleted markSessionComplete() method from DatabaseManager
- Removed markSessionComplete call from SessionCompletionHandler
- Session completion status no longer tracked in database
- Part of session management simplification effort

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: replace deleted updateSDKSessionId calls in import script (Phase 11)

- Replace updateSDKSessionId() calls with direct SQL UPDATE statements
- Method was deleted in Phase 3 as part of session management simplification
- Import script now uses direct database access consistently

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* test: add validation for SQL updates in sdk_sessions table

* refactor: enhance worker-cli to support manual and automated runs

* Remove cleanup hook and associated session completion logic

- Deleted the cleanup-hook implementation from the hooks directory.
- Removed the session completion endpoint that was used by the cleanup hook.
- Updated the SessionCompletionHandler to eliminate the completeByClaudeId method and its dependencies.
- Adjusted the SessionRoutes to reflect the removal of the session completion route.

* fix: update worker-cli command to use bun for consistency

* feat: Implement timestamp fix for observations and enhance processing logic

- Added `earliestPendingTimestamp` to `ActiveSession` to track the original timestamp of the earliest pending message.
- Updated `SDKAgent` to capture and utilize the earliest pending timestamp during response processing.
- Modified `SessionManager` to track the earliest timestamp when yielding messages.
- Created scripts for fixing corrupted timestamps, validating fixes, and investigating timestamp issues.
- Verified that all corrupted observations have been repaired and logic for future processing is sound.
- Ensured orphan processing can be safely re-enabled after validation.

* feat: Enhance SessionStore to support custom database paths and add timestamp fields for observations and summaries

* Refactor pending queue processing and add management endpoints

- Disabled automatic recovery of orphaned queues on startup; users must now use the new /api/pending-queue/process endpoint.
- Updated processOrphanedQueues method to processPendingQueues with improved session handling and return detailed results.
- Added new API endpoints for managing pending queues: GET /api/pending-queue and POST /api/pending-queue/process.
- Introduced a new script (check-pending-queue.ts) for checking and processing pending observation queues interactively or automatically.
- Enhanced logging and error handling for better monitoring of session processing.

* updated agent sdk

* feat: Add manual recovery guide and queue management endpoints to documentation

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2025-12-25 15:36:46 -05:00
committed by GitHub
parent 4cf2c1bdb1
commit 266c746d50
47 changed files with 3417 additions and 1026 deletions
+103 -6
View File
@@ -19,7 +19,7 @@ The worker service is a long-running HTTP API built with Express.js and managed
## REST API Endpoints
The worker service exposes 20 HTTP endpoints organized into five categories:
The worker service exposes 22 HTTP endpoints organized into six categories:
### Viewer & Health Endpoints
@@ -385,9 +385,106 @@ POST /api/settings
}
```
### Queue Management Endpoints
#### 16. Get Pending Queue Status
```
GET /api/pending-queue
```
**Purpose**: View current processing queue status and identify stuck messages
**Response**:
```json
{
"queue": {
"messages": [
{
"id": 123,
"session_db_id": 45,
"claude_session_id": "abc123",
"message_type": "observation",
"status": "pending",
"retry_count": 0,
"created_at_epoch": 1730886600000,
"started_processing_at_epoch": null,
"completed_at_epoch": null
}
],
"totalPending": 5,
"totalProcessing": 2,
"totalFailed": 0,
"stuckCount": 1
},
"recentlyProcessed": [
{
"id": 122,
"session_db_id": 44,
"status": "processed",
"completed_at_epoch": 1730886500000
}
],
"sessionsWithPendingWork": [44, 45, 46]
}
```
**Status Definitions**:
- `pending`: Message queued, not yet processed
- `processing`: Message currently being processed by SDK agent
- `processed`: Message completed successfully
- `failed`: Message failed after max retry attempts (3 by default)
**Stuck Detection**: Messages in `processing` status for >5 minutes are considered stuck and included in `stuckCount`
**Use Case**: Check queue health after worker crashes or restarts to identify unprocessed observations
#### 17. Trigger Manual Recovery
```
POST /api/pending-queue/process
```
**Purpose**: Manually trigger processing of pending queues (replaces automatic recovery in v5.x+)
**Request Body**:
```json
{
"sessionLimit": 10
}
```
**Body Parameters**:
- `sessionLimit` (optional): Maximum number of sessions to process (default: 10, max: 100)
**Response**:
```json
{
"success": true,
"totalPendingSessions": 15,
"sessionsStarted": 10,
"sessionsSkipped": 2,
"startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
}
```
**Response Fields**:
- `totalPendingSessions`: Total sessions with pending messages in database
- `sessionsStarted`: Number of sessions we started processing this request
- `sessionsSkipped`: Sessions already actively processing (not restarted)
- `startedSessionIds`: Database IDs of sessions started
**Behavior**:
- Processes up to `sessionLimit` sessions with pending work
- Skips sessions already actively processing (prevents duplicate agents)
- Starts non-blocking SDK agents for each session
- Returns immediately with status (processing continues in background)
**Use Case**: Manually recover stuck observations after worker crashes, or when automatic recovery was disabled
**Recovery Strategy Note**: As of v5.x, automatic recovery on worker startup is disabled by default. Users must manually trigger recovery using this endpoint or the CLI tool (`bun scripts/check-pending-queue.ts`) to maintain explicit control over reprocessing.
### Session Management Endpoints
#### 16. Initialize Session
#### 19. Initialize Session
```
POST /sessions/:sessionDbId/init
```
@@ -408,7 +505,7 @@ POST /sessions/:sessionDbId/init
}
```
#### 17. Add Observation
#### 20. Add Observation
```
POST /sessions/:sessionDbId/observations
```
@@ -431,7 +528,7 @@ POST /sessions/:sessionDbId/observations
}
```
#### 18. Generate Summary
#### 21. Generate Summary
```
POST /sessions/:sessionDbId/summarize
```
@@ -451,7 +548,7 @@ POST /sessions/:sessionDbId/summarize
}
```
#### 19. Session Status
#### 22. Session Status
```
GET /sessions/:sessionDbId/status
```
@@ -466,7 +563,7 @@ GET /sessions/:sessionDbId/status
}
```
#### 20. Delete Session
#### 23. Delete Session
```
DELETE /sessions/:sessionDbId
```
+131 -27
View File
@@ -371,45 +371,149 @@ npm test
## Testing
### Running Tests
### Testing Philosophy
Claude-mem relies on **real-world usage and manual testing** rather than traditional unit tests. The project philosophy prioritizes:
1. **Manual verification** - Testing features in actual Claude Code sessions
2. **Integration testing** - Running the full system end-to-end
3. **Database inspection** - Verifying data correctness via SQLite queries
4. **CLI tools** - Interactive tools for checking system state
5. **Observability** - Comprehensive logging and worker health checks
This approach was chosen because:
- Hook behavior depends heavily on Claude Code's runtime environment
- SDK interactions require real API calls and responses
- SQLite and Bun runtime provide stability guarantees
- Manual testing catches integration issues that unit tests miss
### Manual Testing Workflow
When developing new features:
1. **Build and sync**:
```bash
npm run build
npm run sync-marketplace
claude-mem restart
```
2. **Test in real session**:
- Start Claude Code
- Trigger the feature you're testing
- Verify expected behavior
3. **Check database state**:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "SELECT * FROM your_table;"
```
4. **Monitor worker logs**:
```bash
npm run worker:logs
```
5. **Verify queue health** (for recovery features):
```bash
bun scripts/check-pending-queue.ts
```
### Testing Tools
**Health Checks**:
```bash
# All tests
npm test
# Worker status
npm run worker:status
# Specific test file
node --test tests/your-test.test.ts
# Queue inspection
curl http://localhost:37777/api/pending-queue
# With coverage (if configured)
npm test -- --coverage
# Database integrity
sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
```
### Writing Tests
**Hook Testing**:
```bash
# Test context hook manually
echo '{"session_id":"test-123","cwd":"'$(pwd)'","source":"startup"}' | node plugin/scripts/context-hook.js
Create test files in `tests/`:
```typescript
import { describe, it } from 'node:test';
import assert from 'node:assert';
describe('YourFeature', () => {
it('should do something', () => {
// Test implementation
assert.strictEqual(result, expected);
});
});
# Test new hook
echo '{"session_id":"test-123","cwd":"'$(pwd)'","prompt":"test"}' | node plugin/scripts/new-hook.js
```
### Test Database
**Data Verification**:
```bash
# Check recent observations
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT id, tool_name, created_at
FROM observations
ORDER BY created_at_epoch DESC
LIMIT 10;
"
Use a separate test database:
```typescript
import { SessionStore } from '../src/services/sqlite/SessionStore';
const store = new SessionStore(':memory:'); // In-memory database
# Check summaries
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT id, request, completed
FROM session_summaries
ORDER BY created_at_epoch DESC
LIMIT 5;
"
```
### Recovery Feature Testing
For manual recovery features specifically:
1. **Simulate stuck messages**:
```bash
# Manually create stuck message (for testing only)
sqlite3 ~/.claude-mem/claude-mem.db "
UPDATE pending_messages
SET status = 'processing',
started_processing_at_epoch = strftime('%s', 'now', '-10 minutes') * 1000
WHERE id = 123;
"
```
2. **Test recovery**:
```bash
bun scripts/check-pending-queue.ts
```
3. **Verify results**:
```bash
curl http://localhost:37777/api/pending-queue | jq '.queue'
```
### Regression Testing
Before releasing:
1. **Test all hook triggers**:
- SessionStart: Start new Claude Code session
- UserPromptSubmit: Submit a prompt
- PostToolUse: Use a tool like Read
- Summary: Let session complete
- SessionEnd: Close Claude Code
2. **Test core features**:
- Context injection (recent sessions appear)
- Observation processing (summaries generated)
- MCP search tools (search returns results)
- Viewer UI (loads at http://localhost:37777)
- Manual recovery (stuck messages recovered)
3. **Test edge cases**:
- Worker crash recovery
- Database locks
- Port conflicts
- Large databases
4. **Cross-platform** (if applicable):
- macOS
- Linux
- Windows
## Code Style
### TypeScript Guidelines
+1
View File
@@ -40,6 +40,7 @@
"usage/claude-desktop",
"usage/private-tags",
"usage/export-import",
"usage/manual-recovery",
"beta-features",
"endless-mode"
]
+195
View File
@@ -285,6 +285,201 @@ The skill includes comprehensive diagnostics, automated repair sequences, and de
claude-mem restart
```
### Manual Recovery for Stuck Observations
**Symptoms**: Observations stuck in processing queue after worker crash or restart, no new summaries appearing despite worker running.
**Background**: As of v5.x, automatic queue recovery on worker startup is disabled. Users must manually trigger recovery to maintain explicit control over reprocessing and prevent unexpected duplicate observations.
**Solutions**:
#### Option 1: Use CLI Recovery Tool (Recommended)
The interactive CLI tool provides the safest and most user-friendly recovery experience:
```bash
# Check queue status and prompt for recovery
bun scripts/check-pending-queue.ts
# Auto-process without prompting
bun scripts/check-pending-queue.ts --process
# Process up to 5 sessions
bun scripts/check-pending-queue.ts --process --limit 5
```
**What it does**:
- ✅ Checks worker health before proceeding
- ✅ Shows detailed queue summary (pending, processing, failed, stuck)
- ✅ Groups messages by session with age and status breakdown
- ✅ Prompts user to confirm processing (unless `--process` flag used)
- ✅ Shows recently processed messages for feedback
**Interactive Example**:
```
Worker is healthy ✓
Queue Summary:
Pending: 12 messages
Processing: 2 messages (1 stuck)
Failed: 0 messages
Recently Processed: 5 messages in last 30 minutes
Sessions with pending work: 3
Session 44: 5 pending, 1 processing (age: 2m)
Session 45: 4 pending, 1 processing (age: 7m - STUCK)
Session 46: 2 pending
Would you like to process these pending queues? (y/n)
```
#### Option 2: Use HTTP API Directly
For automation or scripting scenarios:
1. **Check queue status**:
```bash
curl http://localhost:37777/api/pending-queue
```
Response shows:
- `queue.totalPending`: Messages waiting to process
- `queue.totalProcessing`: Messages currently processing
- `queue.stuckCount`: Processing messages >5 minutes old
- `sessionsWithPendingWork`: Session IDs needing recovery
2. **Trigger manual recovery**:
```bash
curl -X POST http://localhost:37777/api/pending-queue/process \
-H "Content-Type: application/json" \
-d '{"sessionLimit": 10}'
```
Response includes:
- `totalPendingSessions`: Total sessions with pending messages
- `sessionsStarted`: Number of sessions we started processing
- `sessionsSkipped`: Sessions already processing (not restarted)
- `startedSessionIds`: Database IDs of sessions started
#### Understanding Queue States
Messages progress through these states:
1. **pending** - Queued, waiting to process
2. **processing** - Currently being processed by SDK agent
3. **processed** - Completed successfully
4. **failed** - Failed after 3 retry attempts
**Stuck Detection**: Messages in `processing` state for >5 minutes are considered stuck and automatically reset to `pending` on worker startup (but not automatically reprocessed).
#### Recovery Strategy
**When to use manual recovery**:
- After worker crashes or unexpected restarts
- When observations appear saved but no summaries generated
- When queue status shows stuck messages (processing >5 minutes)
- After system crashes or forced shutdowns
**Best practices**:
1. Always check queue status before triggering recovery
2. Use the CLI tool for interactive sessions (provides feedback)
3. Use the HTTP API for automation/scripting
4. Start with a low session limit (5-10) to avoid overwhelming the worker
5. Monitor worker logs during recovery: `npm run worker:logs`
6. Check recently processed messages to confirm recovery worked
#### Troubleshooting Recovery Issues
If recovery fails or messages remain stuck:
1. **Verify worker is healthy**:
```bash
curl http://localhost:37777/health
# Should return: {"status":"ok","uptime":12345,"port":37777}
```
2. **Check database for corruption**:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
```
3. **View stuck messages directly**:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT id, session_db_id, status, retry_count,
(strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 as age_minutes
FROM pending_messages
WHERE status = 'processing'
ORDER BY started_processing_at_epoch;
"
```
4. **Force reset stuck messages** (nuclear option):
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
UPDATE pending_messages
SET status = 'pending', started_processing_at_epoch = NULL
WHERE status = 'processing';
"
```
Then trigger recovery:
```bash
bun scripts/check-pending-queue.ts --process
```
5. **Check worker logs for SDK errors**:
```bash
npm run worker:logs | grep -i error
```
#### Understanding the Queue Table
The `pending_messages` table tracks all messages with these key fields:
```sql
CREATE TABLE pending_messages (
id INTEGER PRIMARY KEY,
session_db_id INTEGER, -- Foreign key to sdk_sessions
claude_session_id TEXT, -- Claude session ID
message_type TEXT, -- 'observation' | 'summarize'
status TEXT, -- 'pending' | 'processing' | 'processed' | 'failed'
retry_count INTEGER, -- Current retry attempt (max: 3)
created_at_epoch INTEGER, -- When message was queued
started_processing_at_epoch INTEGER, -- When marked 'processing'
completed_at_epoch INTEGER -- When completed/failed
)
```
**Query examples**:
```bash
# Count messages by status
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT status, COUNT(*)
FROM pending_messages
GROUP BY status;
"
# Find sessions with pending work
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT session_db_id, COUNT(*) as pending_count
FROM pending_messages
WHERE status IN ('pending', 'processing')
GROUP BY session_db_id;
"
# View recent failures
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT id, session_db_id, message_type, retry_count,
datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
FROM pending_messages
WHERE status = 'failed'
ORDER BY completed_at_epoch DESC
LIMIT 10;
"
```
## Hook Issues
### Hooks Not Firing
+450
View File
@@ -0,0 +1,450 @@
---
title: "Manual Recovery"
description: "Recover stuck observations after worker crashes or restarts"
---
# Manual Recovery Guide
## Overview
Claude-mem's manual recovery system helps you recover observations that get stuck in the processing queue after worker crashes, system restarts, or unexpected shutdowns.
**Key Change in v5.x**: Automatic recovery on worker startup is now disabled. This gives you explicit control over when reprocessing happens, preventing unexpected duplicate observations.
## When Do You Need Manual Recovery?
You should trigger manual recovery when:
- **Worker crashed or restarted** - Observations were queued but worker stopped before processing
- **No new summaries appearing** - Observations are being saved but not processed into summaries
- **Stuck messages detected** - Messages showing as "processing" for >5 minutes
- **System crashes** - Unexpected shutdowns left messages in incomplete states
## Quick Start
### Using the CLI Tool (Recommended)
The interactive CLI tool is the safest and easiest way to recover stuck observations:
```bash
# Check status and prompt for recovery
bun scripts/check-pending-queue.ts
```
This will:
1. Check worker health
2. Show queue summary (pending, processing, failed, stuck counts)
3. Display sessions with pending work
4. Prompt you to confirm recovery
5. Show recently processed messages for feedback
### Auto-Process Without Prompts
For scripting or when you're confident recovery is needed:
```bash
# Auto-process without prompting
bun scripts/check-pending-queue.ts --process
# Limit to 5 sessions
bun scripts/check-pending-queue.ts --process --limit 5
```
## Understanding Queue States
Messages progress through these lifecycle states:
1. **pending** → Queued, waiting to process
2. **processing** → Currently being processed by SDK agent
3. **processed** → Completed successfully
4. **failed** → Failed after 3 retry attempts
### Stuck Detection
Messages in `processing` state for **>5 minutes** are considered stuck:
- They're automatically reset to `pending` on worker startup
- They're NOT automatically reprocessed (requires manual trigger)
- They appear in the `stuckCount` field when checking queue status
## Recovery Methods
### Method 1: Interactive CLI Tool
**Best for**: Regular users, interactive sessions, when you want visibility into what's happening
```bash
bun scripts/check-pending-queue.ts
```
**Example Output**:
```
Checking worker health...
Worker is healthy ✓
Queue Summary:
Pending: 12 messages
Processing: 2 messages (1 stuck)
Failed: 0 messages
Recently Processed: 5 messages in last 30 minutes
Sessions with pending work: 3
Session 44: 5 pending, 1 processing (age: 2m)
Session 45: 4 pending, 1 processing (age: 7m - STUCK)
Session 46: 2 pending
Would you like to process these pending queues? (y/n)
```
**Features**:
- ✅ Pre-flight health check (verifies worker is running)
- ✅ Detailed queue breakdown by session
- ✅ Age tracking for stuck detection
- ✅ Confirmation prompt (prevents accidental reprocessing)
- ✅ Non-interactive mode with `--process` flag
- ✅ Session limit control with `--limit N`
### Method 2: HTTP API
**Best for**: Automation, scripting, integration with monitoring systems
#### Check Queue Status
```bash
curl http://localhost:37777/api/pending-queue
```
**Response**:
```json
{
"queue": {
"messages": [
{
"id": 123,
"session_db_id": 45,
"claude_session_id": "abc123",
"message_type": "observation",
"status": "pending",
"retry_count": 0,
"created_at_epoch": 1730886600000
}
],
"totalPending": 12,
"totalProcessing": 2,
"totalFailed": 0,
"stuckCount": 1
},
"recentlyProcessed": [...],
"sessionsWithPendingWork": [44, 45, 46]
}
```
**Key Fields**:
- `totalPending` - Messages waiting to process
- `totalProcessing` - Messages currently processing
- `stuckCount` - Processing messages >5 minutes old
- `sessionsWithPendingWork` - Session IDs needing recovery
#### Trigger Recovery
```bash
curl -X POST http://localhost:37777/api/pending-queue/process \
-H "Content-Type: application/json" \
-d '{"sessionLimit": 10}'
```
**Response**:
```json
{
"success": true,
"totalPendingSessions": 15,
"sessionsStarted": 10,
"sessionsSkipped": 2,
"startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
}
```
**Response Fields**:
- `totalPendingSessions` - Total sessions with pending messages in database
- `sessionsStarted` - Sessions we started processing this request
- `sessionsSkipped` - Sessions already processing (prevents duplicate agents)
- `startedSessionIds` - Database IDs of sessions we started
## Best Practices
### 1. Always Check Before Recovery
```bash
# Check queue status first
curl http://localhost:37777/api/pending-queue
# Or use CLI tool which checks automatically
bun scripts/check-pending-queue.ts
```
### 2. Start with Low Session Limits
```bash
# Process only 5 sessions at a time
bun scripts/check-pending-queue.ts --process --limit 5
```
This prevents overwhelming the worker with too many concurrent SDK agents.
### 3. Monitor During Recovery
Watch worker logs while recovery runs:
```bash
npm run worker:logs
```
Look for:
- SDK agent starts: `Starting SDK agent for session...`
- Processing completions: `Processed observation...`
- Errors: `ERROR` or `Failed to process...`
### 4. Verify Recovery Success
Check recently processed messages:
```bash
curl http://localhost:37777/api/pending-queue | jq '.recentlyProcessed'
```
Or use the CLI tool which shows this automatically.
### 5. Handle Failed Messages
Messages that fail 3 times are marked `failed` and won't auto-retry:
```bash
# View failed messages
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT id, session_db_id, message_type, retry_count
FROM pending_messages
WHERE status = 'failed'
ORDER BY completed_at_epoch DESC;
"
```
You can manually reset them if needed:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
UPDATE pending_messages
SET status = 'pending', retry_count = 0
WHERE status = 'failed';
"
```
## Troubleshooting
### Recovery Not Working
**Symptom**: Triggered recovery but messages still pending
**Solutions**:
1. **Verify worker health**:
```bash
curl http://localhost:37777/health
```
2. **Check worker logs for errors**:
```bash
npm run worker:logs | grep -i error
```
3. **Restart worker**:
```bash
claude-mem restart
```
4. **Check database integrity**:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
```
### Messages Stuck Forever
**Symptom**: Messages show as "processing" for hours
**Solution**: Force reset stuck messages
```bash
# Reset all stuck messages to pending
sqlite3 ~/.claude-mem/claude-mem.db "
UPDATE pending_messages
SET status = 'pending', started_processing_at_epoch = NULL
WHERE status = 'processing';
"
# Then trigger recovery
bun scripts/check-pending-queue.ts --process
```
### Worker Crashes During Recovery
**Symptom**: Worker stops while processing recovered messages
**Solutions**:
1. **Check available memory**:
```bash
npm run worker:status
```
2. **Reduce session limit**:
```bash
bun scripts/check-pending-queue.ts --process --limit 3
```
3. **Check for SDK errors in logs**:
```bash
npm run worker:logs | grep -i "sdk"
```
4. **Increase worker memory** (if using custom runner):
```bash
export NODE_OPTIONS="--max-old-space-size=4096"
claude-mem restart
```
## Advanced Usage
### Direct Database Inspection
View all pending messages:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT
id,
session_db_id,
message_type,
status,
retry_count,
datetime(created_at_epoch/1000, 'unixepoch') as created_at,
datetime(started_processing_at_epoch/1000, 'unixepoch') as started_at,
CAST((strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 AS INTEGER) as age_minutes
FROM pending_messages
WHERE status IN ('pending', 'processing')
ORDER BY created_at_epoch;
"
```
### Count Messages by Status
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT status, COUNT(*) as count
FROM pending_messages
GROUP BY status;
"
```
### Find Sessions with Pending Work
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT
session_db_id,
COUNT(*) as pending_count,
GROUP_CONCAT(message_type) as message_types
FROM pending_messages
WHERE status IN ('pending', 'processing')
GROUP BY session_db_id;
"
```
### View Recent Failures
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT
id,
session_db_id,
message_type,
retry_count,
datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
FROM pending_messages
WHERE status = 'failed'
ORDER BY completed_at_epoch DESC
LIMIT 10;
"
```
## Integration Examples
### Cron Job for Automatic Recovery
```bash
#!/bin/bash
# Run every hour to process stuck queues
# Check if worker is healthy
if curl -f http://localhost:37777/health > /dev/null 2>&1; then
# Auto-process up to 5 sessions
bun scripts/check-pending-queue.ts --process --limit 5
else
echo "Worker not healthy, skipping recovery"
exit 1
fi
```
### Monitoring Script
```bash
#!/bin/bash
# Alert if stuck count exceeds threshold
STUCK_COUNT=$(curl -s http://localhost:37777/api/pending-queue | jq '.queue.stuckCount')
if [ "$STUCK_COUNT" -gt 5 ]; then
echo "WARNING: $STUCK_COUNT stuck messages detected"
# Send alert (email, Slack, etc.)
fi
```
### Pre-Shutdown Recovery
```bash
#!/bin/bash
# Process pending queues before system shutdown
echo "Processing pending queues before shutdown..."
bun scripts/check-pending-queue.ts --process --limit 20
echo "Waiting for processing to complete..."
sleep 10
echo "Stopping worker..."
claude-mem stop
```
## Migration Note
If you're upgrading from v4.x to v5.x:
**v4.x Behavior** (Automatic Recovery):
- Worker automatically recovered stuck messages on startup
- No user control over reprocessing timing
**v5.x Behavior** (Manual Recovery):
- Stuck messages detected but NOT automatically reprocessed
- User must explicitly trigger recovery via CLI or API
- Prevents unexpected duplicate observations
- Provides explicit control over when processing happens
**Migration Steps**:
1. Upgrade to v5.x
2. Check for stuck messages: `bun scripts/check-pending-queue.ts`
3. Process if needed: `bun scripts/check-pending-queue.ts --process`
4. Add recovery to your workflow (cron job, pre-shutdown script, etc.)
## See Also
- [Worker Service Architecture](../architecture/worker-service) - Technical details on queue processing
- [Troubleshooting - Manual Recovery](../troubleshooting#manual-recovery-for-stuck-observations) - Common issues and solutions
- [Database Schema](../architecture/database) - Pending messages table structure