feat: Fix observation timestamps, refactor session management, and enhance worker reliability (#437)
* Refactor worker version checks and increase timeout settings - Updated the default hook timeout from 5000ms to 120000ms for improved stability. - Modified the worker version check to log a warning instead of restarting the worker on version mismatch. - Removed legacy PM2 cleanup and worker start logic, simplifying the ensureWorkerRunning function. - Enhanced polling mechanism for worker readiness with increased retries and reduced interval. * feat: implement worker queue polling to ensure processing completion before proceeding * refactor: change worker command from start to restart in hooks configuration * refactor: remove session management complexity - Simplify createSDKSession to pure INSERT OR IGNORE - Remove auto-create logic from storeObservation/storeSummary - Delete 11 unused session management methods - Derive prompt_number from user_prompts count - Keep sdk_sessions table schema unchanged for compatibility * refactor: simplify session management by removing unused methods and auto-creation logic * Refactor session prompt number retrieval in SessionRoutes - Updated the method of obtaining the prompt number from the session. - Replaced `store.getPromptCounter(sessionDbId)` with `store.getPromptNumberFromUserPrompts(claudeSessionId)` for better clarity and accuracy. - Adjusted the logic for incrementing the prompt number to derive it from the user prompts count instead of directly incrementing a counter. * refactor: replace getPromptCounter with getPromptNumberFromUserPrompts in SessionManager Phase 7 of session management simplification. Updates SessionManager to derive prompt numbers from user_prompts table count instead of using the deprecated prompt_counter column. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: simplify SessionCompletionHandler to use direct SQL query Phase 8: Remove call to findActiveSDKSession() and replace with direct database query in SessionCompletionHandler.completeByClaudeId(). This removes dependency on the deleted findActiveSDKSession() method and simplifies the code by using a straightforward SELECT query. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: remove markSessionCompleted call from SDKAgent - Delete call to markSessionCompleted() in SDKAgent.ts - Session status is no longer tracked or updated - Part of phase 9: simplifying session management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: remove markSessionComplete method (Phase 10) - Deleted markSessionComplete() method from DatabaseManager - Removed markSessionComplete call from SessionCompletionHandler - Session completion status no longer tracked in database - Part of session management simplification effort 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: replace deleted updateSDKSessionId calls in import script (Phase 11) - Replace updateSDKSessionId() calls with direct SQL UPDATE statements - Method was deleted in Phase 3 as part of session management simplification - Import script now uses direct database access consistently 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * test: add validation for SQL updates in sdk_sessions table * refactor: enhance worker-cli to support manual and automated runs * Remove cleanup hook and associated session completion logic - Deleted the cleanup-hook implementation from the hooks directory. - Removed the session completion endpoint that was used by the cleanup hook. - Updated the SessionCompletionHandler to eliminate the completeByClaudeId method and its dependencies. - Adjusted the SessionRoutes to reflect the removal of the session completion route. * fix: update worker-cli command to use bun for consistency * feat: Implement timestamp fix for observations and enhance processing logic - Added `earliestPendingTimestamp` to `ActiveSession` to track the original timestamp of the earliest pending message. - Updated `SDKAgent` to capture and utilize the earliest pending timestamp during response processing. - Modified `SessionManager` to track the earliest timestamp when yielding messages. - Created scripts for fixing corrupted timestamps, validating fixes, and investigating timestamp issues. - Verified that all corrupted observations have been repaired and logic for future processing is sound. - Ensured orphan processing can be safely re-enabled after validation. * feat: Enhance SessionStore to support custom database paths and add timestamp fields for observations and summaries * Refactor pending queue processing and add management endpoints - Disabled automatic recovery of orphaned queues on startup; users must now use the new /api/pending-queue/process endpoint. - Updated processOrphanedQueues method to processPendingQueues with improved session handling and return detailed results. - Added new API endpoints for managing pending queues: GET /api/pending-queue and POST /api/pending-queue/process. - Introduced a new script (check-pending-queue.ts) for checking and processing pending observation queues interactively or automatically. - Enhanced logging and error handling for better monitoring of session processing. * updated agent sdk * feat: Add manual recovery guide and queue management endpoints to documentation --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -19,7 +19,7 @@ The worker service is a long-running HTTP API built with Express.js and managed
|
||||
|
||||
## REST API Endpoints
|
||||
|
||||
The worker service exposes 20 HTTP endpoints organized into five categories:
|
||||
The worker service exposes 22 HTTP endpoints organized into six categories:
|
||||
|
||||
### Viewer & Health Endpoints
|
||||
|
||||
@@ -385,9 +385,106 @@ POST /api/settings
|
||||
}
|
||||
```
|
||||
|
||||
### Queue Management Endpoints
|
||||
|
||||
#### 16. Get Pending Queue Status
|
||||
```
|
||||
GET /api/pending-queue
|
||||
```
|
||||
|
||||
**Purpose**: View current processing queue status and identify stuck messages
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"queue": {
|
||||
"messages": [
|
||||
{
|
||||
"id": 123,
|
||||
"session_db_id": 45,
|
||||
"claude_session_id": "abc123",
|
||||
"message_type": "observation",
|
||||
"status": "pending",
|
||||
"retry_count": 0,
|
||||
"created_at_epoch": 1730886600000,
|
||||
"started_processing_at_epoch": null,
|
||||
"completed_at_epoch": null
|
||||
}
|
||||
],
|
||||
"totalPending": 5,
|
||||
"totalProcessing": 2,
|
||||
"totalFailed": 0,
|
||||
"stuckCount": 1
|
||||
},
|
||||
"recentlyProcessed": [
|
||||
{
|
||||
"id": 122,
|
||||
"session_db_id": 44,
|
||||
"status": "processed",
|
||||
"completed_at_epoch": 1730886500000
|
||||
}
|
||||
],
|
||||
"sessionsWithPendingWork": [44, 45, 46]
|
||||
}
|
||||
```
|
||||
|
||||
**Status Definitions**:
|
||||
- `pending`: Message queued, not yet processed
|
||||
- `processing`: Message currently being processed by SDK agent
|
||||
- `processed`: Message completed successfully
|
||||
- `failed`: Message failed after max retry attempts (3 by default)
|
||||
|
||||
**Stuck Detection**: Messages in `processing` status for >5 minutes are considered stuck and included in `stuckCount`
|
||||
|
||||
**Use Case**: Check queue health after worker crashes or restarts to identify unprocessed observations
|
||||
|
||||
#### 17. Trigger Manual Recovery
|
||||
```
|
||||
POST /api/pending-queue/process
|
||||
```
|
||||
|
||||
**Purpose**: Manually trigger processing of pending queues (replaces automatic recovery in v5.x+)
|
||||
|
||||
**Request Body**:
|
||||
```json
|
||||
{
|
||||
"sessionLimit": 10
|
||||
}
|
||||
```
|
||||
|
||||
**Body Parameters**:
|
||||
- `sessionLimit` (optional): Maximum number of sessions to process (default: 10, max: 100)
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"totalPendingSessions": 15,
|
||||
"sessionsStarted": 10,
|
||||
"sessionsSkipped": 2,
|
||||
"startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
|
||||
}
|
||||
```
|
||||
|
||||
**Response Fields**:
|
||||
- `totalPendingSessions`: Total sessions with pending messages in database
|
||||
- `sessionsStarted`: Number of sessions we started processing this request
|
||||
- `sessionsSkipped`: Sessions already actively processing (not restarted)
|
||||
- `startedSessionIds`: Database IDs of sessions started
|
||||
|
||||
**Behavior**:
|
||||
- Processes up to `sessionLimit` sessions with pending work
|
||||
- Skips sessions already actively processing (prevents duplicate agents)
|
||||
- Starts non-blocking SDK agents for each session
|
||||
- Returns immediately with status (processing continues in background)
|
||||
|
||||
**Use Case**: Manually recover stuck observations after worker crashes, or when automatic recovery was disabled
|
||||
|
||||
**Recovery Strategy Note**: As of v5.x, automatic recovery on worker startup is disabled by default. Users must manually trigger recovery using this endpoint or the CLI tool (`bun scripts/check-pending-queue.ts`) to maintain explicit control over reprocessing.
|
||||
|
||||
### Session Management Endpoints
|
||||
|
||||
#### 16. Initialize Session
|
||||
#### 19. Initialize Session
|
||||
```
|
||||
POST /sessions/:sessionDbId/init
|
||||
```
|
||||
@@ -408,7 +505,7 @@ POST /sessions/:sessionDbId/init
|
||||
}
|
||||
```
|
||||
|
||||
#### 17. Add Observation
|
||||
#### 20. Add Observation
|
||||
```
|
||||
POST /sessions/:sessionDbId/observations
|
||||
```
|
||||
@@ -431,7 +528,7 @@ POST /sessions/:sessionDbId/observations
|
||||
}
|
||||
```
|
||||
|
||||
#### 18. Generate Summary
|
||||
#### 21. Generate Summary
|
||||
```
|
||||
POST /sessions/:sessionDbId/summarize
|
||||
```
|
||||
@@ -451,7 +548,7 @@ POST /sessions/:sessionDbId/summarize
|
||||
}
|
||||
```
|
||||
|
||||
#### 19. Session Status
|
||||
#### 22. Session Status
|
||||
```
|
||||
GET /sessions/:sessionDbId/status
|
||||
```
|
||||
@@ -466,7 +563,7 @@ GET /sessions/:sessionDbId/status
|
||||
}
|
||||
```
|
||||
|
||||
#### 20. Delete Session
|
||||
#### 23. Delete Session
|
||||
```
|
||||
DELETE /sessions/:sessionDbId
|
||||
```
|
||||
|
||||
+131
-27
@@ -371,45 +371,149 @@ npm test
|
||||
|
||||
## Testing
|
||||
|
||||
### Running Tests
|
||||
### Testing Philosophy
|
||||
|
||||
Claude-mem relies on **real-world usage and manual testing** rather than traditional unit tests. The project philosophy prioritizes:
|
||||
|
||||
1. **Manual verification** - Testing features in actual Claude Code sessions
|
||||
2. **Integration testing** - Running the full system end-to-end
|
||||
3. **Database inspection** - Verifying data correctness via SQLite queries
|
||||
4. **CLI tools** - Interactive tools for checking system state
|
||||
5. **Observability** - Comprehensive logging and worker health checks
|
||||
|
||||
This approach was chosen because:
|
||||
- Hook behavior depends heavily on Claude Code's runtime environment
|
||||
- SDK interactions require real API calls and responses
|
||||
- SQLite and Bun runtime provide stability guarantees
|
||||
- Manual testing catches integration issues that unit tests miss
|
||||
|
||||
### Manual Testing Workflow
|
||||
|
||||
When developing new features:
|
||||
|
||||
1. **Build and sync**:
|
||||
```bash
|
||||
npm run build
|
||||
npm run sync-marketplace
|
||||
claude-mem restart
|
||||
```
|
||||
|
||||
2. **Test in real session**:
|
||||
- Start Claude Code
|
||||
- Trigger the feature you're testing
|
||||
- Verify expected behavior
|
||||
|
||||
3. **Check database state**:
|
||||
```bash
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "SELECT * FROM your_table;"
|
||||
```
|
||||
|
||||
4. **Monitor worker logs**:
|
||||
```bash
|
||||
npm run worker:logs
|
||||
```
|
||||
|
||||
5. **Verify queue health** (for recovery features):
|
||||
```bash
|
||||
bun scripts/check-pending-queue.ts
|
||||
```
|
||||
|
||||
### Testing Tools
|
||||
|
||||
**Health Checks**:
|
||||
```bash
|
||||
# All tests
|
||||
npm test
|
||||
# Worker status
|
||||
npm run worker:status
|
||||
|
||||
# Specific test file
|
||||
node --test tests/your-test.test.ts
|
||||
# Queue inspection
|
||||
curl http://localhost:37777/api/pending-queue
|
||||
|
||||
# With coverage (if configured)
|
||||
npm test -- --coverage
|
||||
# Database integrity
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
|
||||
```
|
||||
|
||||
### Writing Tests
|
||||
**Hook Testing**:
|
||||
```bash
|
||||
# Test context hook manually
|
||||
echo '{"session_id":"test-123","cwd":"'$(pwd)'","source":"startup"}' | node plugin/scripts/context-hook.js
|
||||
|
||||
Create test files in `tests/`:
|
||||
|
||||
```typescript
|
||||
import { describe, it } from 'node:test';
|
||||
import assert from 'node:assert';
|
||||
|
||||
describe('YourFeature', () => {
|
||||
it('should do something', () => {
|
||||
// Test implementation
|
||||
assert.strictEqual(result, expected);
|
||||
});
|
||||
});
|
||||
# Test new hook
|
||||
echo '{"session_id":"test-123","cwd":"'$(pwd)'","prompt":"test"}' | node plugin/scripts/new-hook.js
|
||||
```
|
||||
|
||||
### Test Database
|
||||
**Data Verification**:
|
||||
```bash
|
||||
# Check recent observations
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT id, tool_name, created_at
|
||||
FROM observations
|
||||
ORDER BY created_at_epoch DESC
|
||||
LIMIT 10;
|
||||
"
|
||||
|
||||
Use a separate test database:
|
||||
|
||||
```typescript
|
||||
import { SessionStore } from '../src/services/sqlite/SessionStore';
|
||||
|
||||
const store = new SessionStore(':memory:'); // In-memory database
|
||||
# Check summaries
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT id, request, completed
|
||||
FROM session_summaries
|
||||
ORDER BY created_at_epoch DESC
|
||||
LIMIT 5;
|
||||
"
|
||||
```
|
||||
|
||||
### Recovery Feature Testing
|
||||
|
||||
For manual recovery features specifically:
|
||||
|
||||
1. **Simulate stuck messages**:
|
||||
```bash
|
||||
# Manually create stuck message (for testing only)
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
UPDATE pending_messages
|
||||
SET status = 'processing',
|
||||
started_processing_at_epoch = strftime('%s', 'now', '-10 minutes') * 1000
|
||||
WHERE id = 123;
|
||||
"
|
||||
```
|
||||
|
||||
2. **Test recovery**:
|
||||
```bash
|
||||
bun scripts/check-pending-queue.ts
|
||||
```
|
||||
|
||||
3. **Verify results**:
|
||||
```bash
|
||||
curl http://localhost:37777/api/pending-queue | jq '.queue'
|
||||
```
|
||||
|
||||
### Regression Testing
|
||||
|
||||
Before releasing:
|
||||
|
||||
1. **Test all hook triggers**:
|
||||
- SessionStart: Start new Claude Code session
|
||||
- UserPromptSubmit: Submit a prompt
|
||||
- PostToolUse: Use a tool like Read
|
||||
- Summary: Let session complete
|
||||
- SessionEnd: Close Claude Code
|
||||
|
||||
2. **Test core features**:
|
||||
- Context injection (recent sessions appear)
|
||||
- Observation processing (summaries generated)
|
||||
- MCP search tools (search returns results)
|
||||
- Viewer UI (loads at http://localhost:37777)
|
||||
- Manual recovery (stuck messages recovered)
|
||||
|
||||
3. **Test edge cases**:
|
||||
- Worker crash recovery
|
||||
- Database locks
|
||||
- Port conflicts
|
||||
- Large databases
|
||||
|
||||
4. **Cross-platform** (if applicable):
|
||||
- macOS
|
||||
- Linux
|
||||
- Windows
|
||||
|
||||
## Code Style
|
||||
|
||||
### TypeScript Guidelines
|
||||
|
||||
@@ -40,6 +40,7 @@
|
||||
"usage/claude-desktop",
|
||||
"usage/private-tags",
|
||||
"usage/export-import",
|
||||
"usage/manual-recovery",
|
||||
"beta-features",
|
||||
"endless-mode"
|
||||
]
|
||||
|
||||
@@ -285,6 +285,201 @@ The skill includes comprehensive diagnostics, automated repair sequences, and de
|
||||
claude-mem restart
|
||||
```
|
||||
|
||||
### Manual Recovery for Stuck Observations
|
||||
|
||||
**Symptoms**: Observations stuck in processing queue after worker crash or restart, no new summaries appearing despite worker running.
|
||||
|
||||
**Background**: As of v5.x, automatic queue recovery on worker startup is disabled. Users must manually trigger recovery to maintain explicit control over reprocessing and prevent unexpected duplicate observations.
|
||||
|
||||
**Solutions**:
|
||||
|
||||
#### Option 1: Use CLI Recovery Tool (Recommended)
|
||||
|
||||
The interactive CLI tool provides the safest and most user-friendly recovery experience:
|
||||
|
||||
```bash
|
||||
# Check queue status and prompt for recovery
|
||||
bun scripts/check-pending-queue.ts
|
||||
|
||||
# Auto-process without prompting
|
||||
bun scripts/check-pending-queue.ts --process
|
||||
|
||||
# Process up to 5 sessions
|
||||
bun scripts/check-pending-queue.ts --process --limit 5
|
||||
```
|
||||
|
||||
**What it does**:
|
||||
- ✅ Checks worker health before proceeding
|
||||
- ✅ Shows detailed queue summary (pending, processing, failed, stuck)
|
||||
- ✅ Groups messages by session with age and status breakdown
|
||||
- ✅ Prompts user to confirm processing (unless `--process` flag used)
|
||||
- ✅ Shows recently processed messages for feedback
|
||||
|
||||
**Interactive Example**:
|
||||
```
|
||||
Worker is healthy ✓
|
||||
|
||||
Queue Summary:
|
||||
Pending: 12 messages
|
||||
Processing: 2 messages (1 stuck)
|
||||
Failed: 0 messages
|
||||
Recently Processed: 5 messages in last 30 minutes
|
||||
|
||||
Sessions with pending work: 3
|
||||
Session 44: 5 pending, 1 processing (age: 2m)
|
||||
Session 45: 4 pending, 1 processing (age: 7m - STUCK)
|
||||
Session 46: 2 pending
|
||||
|
||||
Would you like to process these pending queues? (y/n)
|
||||
```
|
||||
|
||||
#### Option 2: Use HTTP API Directly
|
||||
|
||||
For automation or scripting scenarios:
|
||||
|
||||
1. **Check queue status**:
|
||||
```bash
|
||||
curl http://localhost:37777/api/pending-queue
|
||||
```
|
||||
|
||||
Response shows:
|
||||
- `queue.totalPending`: Messages waiting to process
|
||||
- `queue.totalProcessing`: Messages currently processing
|
||||
- `queue.stuckCount`: Processing messages >5 minutes old
|
||||
- `sessionsWithPendingWork`: Session IDs needing recovery
|
||||
|
||||
2. **Trigger manual recovery**:
|
||||
```bash
|
||||
curl -X POST http://localhost:37777/api/pending-queue/process \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"sessionLimit": 10}'
|
||||
```
|
||||
|
||||
Response includes:
|
||||
- `totalPendingSessions`: Total sessions with pending messages
|
||||
- `sessionsStarted`: Number of sessions we started processing
|
||||
- `sessionsSkipped`: Sessions already processing (not restarted)
|
||||
- `startedSessionIds`: Database IDs of sessions started
|
||||
|
||||
#### Understanding Queue States
|
||||
|
||||
Messages progress through these states:
|
||||
|
||||
1. **pending** - Queued, waiting to process
|
||||
2. **processing** - Currently being processed by SDK agent
|
||||
3. **processed** - Completed successfully
|
||||
4. **failed** - Failed after 3 retry attempts
|
||||
|
||||
**Stuck Detection**: Messages in `processing` state for >5 minutes are considered stuck and automatically reset to `pending` on worker startup (but not automatically reprocessed).
|
||||
|
||||
#### Recovery Strategy
|
||||
|
||||
**When to use manual recovery**:
|
||||
- After worker crashes or unexpected restarts
|
||||
- When observations appear saved but no summaries generated
|
||||
- When queue status shows stuck messages (processing >5 minutes)
|
||||
- After system crashes or forced shutdowns
|
||||
|
||||
**Best practices**:
|
||||
1. Always check queue status before triggering recovery
|
||||
2. Use the CLI tool for interactive sessions (provides feedback)
|
||||
3. Use the HTTP API for automation/scripting
|
||||
4. Start with a low session limit (5-10) to avoid overwhelming the worker
|
||||
5. Monitor worker logs during recovery: `npm run worker:logs`
|
||||
6. Check recently processed messages to confirm recovery worked
|
||||
|
||||
#### Troubleshooting Recovery Issues
|
||||
|
||||
If recovery fails or messages remain stuck:
|
||||
|
||||
1. **Verify worker is healthy**:
|
||||
```bash
|
||||
curl http://localhost:37777/health
|
||||
# Should return: {"status":"ok","uptime":12345,"port":37777}
|
||||
```
|
||||
|
||||
2. **Check database for corruption**:
|
||||
```bash
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
|
||||
```
|
||||
|
||||
3. **View stuck messages directly**:
|
||||
```bash
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT id, session_db_id, status, retry_count,
|
||||
(strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 as age_minutes
|
||||
FROM pending_messages
|
||||
WHERE status = 'processing'
|
||||
ORDER BY started_processing_at_epoch;
|
||||
"
|
||||
```
|
||||
|
||||
4. **Force reset stuck messages** (nuclear option):
|
||||
```bash
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
UPDATE pending_messages
|
||||
SET status = 'pending', started_processing_at_epoch = NULL
|
||||
WHERE status = 'processing';
|
||||
"
|
||||
```
|
||||
|
||||
Then trigger recovery:
|
||||
```bash
|
||||
bun scripts/check-pending-queue.ts --process
|
||||
```
|
||||
|
||||
5. **Check worker logs for SDK errors**:
|
||||
```bash
|
||||
npm run worker:logs | grep -i error
|
||||
```
|
||||
|
||||
#### Understanding the Queue Table
|
||||
|
||||
The `pending_messages` table tracks all messages with these key fields:
|
||||
|
||||
```sql
|
||||
CREATE TABLE pending_messages (
|
||||
id INTEGER PRIMARY KEY,
|
||||
session_db_id INTEGER, -- Foreign key to sdk_sessions
|
||||
claude_session_id TEXT, -- Claude session ID
|
||||
message_type TEXT, -- 'observation' | 'summarize'
|
||||
status TEXT, -- 'pending' | 'processing' | 'processed' | 'failed'
|
||||
retry_count INTEGER, -- Current retry attempt (max: 3)
|
||||
created_at_epoch INTEGER, -- When message was queued
|
||||
started_processing_at_epoch INTEGER, -- When marked 'processing'
|
||||
completed_at_epoch INTEGER -- When completed/failed
|
||||
)
|
||||
```
|
||||
|
||||
**Query examples**:
|
||||
|
||||
```bash
|
||||
# Count messages by status
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT status, COUNT(*)
|
||||
FROM pending_messages
|
||||
GROUP BY status;
|
||||
"
|
||||
|
||||
# Find sessions with pending work
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT session_db_id, COUNT(*) as pending_count
|
||||
FROM pending_messages
|
||||
WHERE status IN ('pending', 'processing')
|
||||
GROUP BY session_db_id;
|
||||
"
|
||||
|
||||
# View recent failures
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT id, session_db_id, message_type, retry_count,
|
||||
datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
|
||||
FROM pending_messages
|
||||
WHERE status = 'failed'
|
||||
ORDER BY completed_at_epoch DESC
|
||||
LIMIT 10;
|
||||
"
|
||||
```
|
||||
|
||||
## Hook Issues
|
||||
|
||||
### Hooks Not Firing
|
||||
|
||||
@@ -0,0 +1,450 @@
|
||||
---
|
||||
title: "Manual Recovery"
|
||||
description: "Recover stuck observations after worker crashes or restarts"
|
||||
---
|
||||
|
||||
# Manual Recovery Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Claude-mem's manual recovery system helps you recover observations that get stuck in the processing queue after worker crashes, system restarts, or unexpected shutdowns.
|
||||
|
||||
**Key Change in v5.x**: Automatic recovery on worker startup is now disabled. This gives you explicit control over when reprocessing happens, preventing unexpected duplicate observations.
|
||||
|
||||
## When Do You Need Manual Recovery?
|
||||
|
||||
You should trigger manual recovery when:
|
||||
|
||||
- **Worker crashed or restarted** - Observations were queued but worker stopped before processing
|
||||
- **No new summaries appearing** - Observations are being saved but not processed into summaries
|
||||
- **Stuck messages detected** - Messages showing as "processing" for >5 minutes
|
||||
- **System crashes** - Unexpected shutdowns left messages in incomplete states
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Using the CLI Tool (Recommended)
|
||||
|
||||
The interactive CLI tool is the safest and easiest way to recover stuck observations:
|
||||
|
||||
```bash
|
||||
# Check status and prompt for recovery
|
||||
bun scripts/check-pending-queue.ts
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Check worker health
|
||||
2. Show queue summary (pending, processing, failed, stuck counts)
|
||||
3. Display sessions with pending work
|
||||
4. Prompt you to confirm recovery
|
||||
5. Show recently processed messages for feedback
|
||||
|
||||
### Auto-Process Without Prompts
|
||||
|
||||
For scripting or when you're confident recovery is needed:
|
||||
|
||||
```bash
|
||||
# Auto-process without prompting
|
||||
bun scripts/check-pending-queue.ts --process
|
||||
|
||||
# Limit to 5 sessions
|
||||
bun scripts/check-pending-queue.ts --process --limit 5
|
||||
```
|
||||
|
||||
## Understanding Queue States
|
||||
|
||||
Messages progress through these lifecycle states:
|
||||
|
||||
1. **pending** → Queued, waiting to process
|
||||
2. **processing** → Currently being processed by SDK agent
|
||||
3. **processed** → Completed successfully
|
||||
4. **failed** → Failed after 3 retry attempts
|
||||
|
||||
### Stuck Detection
|
||||
|
||||
Messages in `processing` state for **>5 minutes** are considered stuck:
|
||||
|
||||
- They're automatically reset to `pending` on worker startup
|
||||
- They're NOT automatically reprocessed (requires manual trigger)
|
||||
- They appear in the `stuckCount` field when checking queue status
|
||||
|
||||
## Recovery Methods
|
||||
|
||||
### Method 1: Interactive CLI Tool
|
||||
|
||||
**Best for**: Regular users, interactive sessions, when you want visibility into what's happening
|
||||
|
||||
```bash
|
||||
bun scripts/check-pending-queue.ts
|
||||
```
|
||||
|
||||
**Example Output**:
|
||||
```
|
||||
Checking worker health...
|
||||
Worker is healthy ✓
|
||||
|
||||
Queue Summary:
|
||||
Pending: 12 messages
|
||||
Processing: 2 messages (1 stuck)
|
||||
Failed: 0 messages
|
||||
Recently Processed: 5 messages in last 30 minutes
|
||||
|
||||
Sessions with pending work: 3
|
||||
Session 44: 5 pending, 1 processing (age: 2m)
|
||||
Session 45: 4 pending, 1 processing (age: 7m - STUCK)
|
||||
Session 46: 2 pending
|
||||
|
||||
Would you like to process these pending queues? (y/n)
|
||||
```
|
||||
|
||||
**Features**:
|
||||
- ✅ Pre-flight health check (verifies worker is running)
|
||||
- ✅ Detailed queue breakdown by session
|
||||
- ✅ Age tracking for stuck detection
|
||||
- ✅ Confirmation prompt (prevents accidental reprocessing)
|
||||
- ✅ Non-interactive mode with `--process` flag
|
||||
- ✅ Session limit control with `--limit N`
|
||||
|
||||
### Method 2: HTTP API
|
||||
|
||||
**Best for**: Automation, scripting, integration with monitoring systems
|
||||
|
||||
#### Check Queue Status
|
||||
|
||||
```bash
|
||||
curl http://localhost:37777/api/pending-queue
|
||||
```
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"queue": {
|
||||
"messages": [
|
||||
{
|
||||
"id": 123,
|
||||
"session_db_id": 45,
|
||||
"claude_session_id": "abc123",
|
||||
"message_type": "observation",
|
||||
"status": "pending",
|
||||
"retry_count": 0,
|
||||
"created_at_epoch": 1730886600000
|
||||
}
|
||||
],
|
||||
"totalPending": 12,
|
||||
"totalProcessing": 2,
|
||||
"totalFailed": 0,
|
||||
"stuckCount": 1
|
||||
},
|
||||
"recentlyProcessed": [...],
|
||||
"sessionsWithPendingWork": [44, 45, 46]
|
||||
}
|
||||
```
|
||||
|
||||
**Key Fields**:
|
||||
- `totalPending` - Messages waiting to process
|
||||
- `totalProcessing` - Messages currently processing
|
||||
- `stuckCount` - Processing messages >5 minutes old
|
||||
- `sessionsWithPendingWork` - Session IDs needing recovery
|
||||
|
||||
#### Trigger Recovery
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:37777/api/pending-queue/process \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"sessionLimit": 10}'
|
||||
```
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"totalPendingSessions": 15,
|
||||
"sessionsStarted": 10,
|
||||
"sessionsSkipped": 2,
|
||||
"startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
|
||||
}
|
||||
```
|
||||
|
||||
**Response Fields**:
|
||||
- `totalPendingSessions` - Total sessions with pending messages in database
|
||||
- `sessionsStarted` - Sessions we started processing this request
|
||||
- `sessionsSkipped` - Sessions already processing (prevents duplicate agents)
|
||||
- `startedSessionIds` - Database IDs of sessions we started
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Always Check Before Recovery
|
||||
|
||||
```bash
|
||||
# Check queue status first
|
||||
curl http://localhost:37777/api/pending-queue
|
||||
|
||||
# Or use CLI tool which checks automatically
|
||||
bun scripts/check-pending-queue.ts
|
||||
```
|
||||
|
||||
### 2. Start with Low Session Limits
|
||||
|
||||
```bash
|
||||
# Process only 5 sessions at a time
|
||||
bun scripts/check-pending-queue.ts --process --limit 5
|
||||
```
|
||||
|
||||
This prevents overwhelming the worker with too many concurrent SDK agents.
|
||||
|
||||
### 3. Monitor During Recovery
|
||||
|
||||
Watch worker logs while recovery runs:
|
||||
|
||||
```bash
|
||||
npm run worker:logs
|
||||
```
|
||||
|
||||
Look for:
|
||||
- SDK agent starts: `Starting SDK agent for session...`
|
||||
- Processing completions: `Processed observation...`
|
||||
- Errors: `ERROR` or `Failed to process...`
|
||||
|
||||
### 4. Verify Recovery Success
|
||||
|
||||
Check recently processed messages:
|
||||
|
||||
```bash
|
||||
curl http://localhost:37777/api/pending-queue | jq '.recentlyProcessed'
|
||||
```
|
||||
|
||||
Or use the CLI tool which shows this automatically.
|
||||
|
||||
### 5. Handle Failed Messages
|
||||
|
||||
Messages that fail 3 times are marked `failed` and won't auto-retry:
|
||||
|
||||
```bash
|
||||
# View failed messages
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT id, session_db_id, message_type, retry_count
|
||||
FROM pending_messages
|
||||
WHERE status = 'failed'
|
||||
ORDER BY completed_at_epoch DESC;
|
||||
"
|
||||
```
|
||||
|
||||
You can manually reset them if needed:
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
UPDATE pending_messages
|
||||
SET status = 'pending', retry_count = 0
|
||||
WHERE status = 'failed';
|
||||
"
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Recovery Not Working
|
||||
|
||||
**Symptom**: Triggered recovery but messages still pending
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Verify worker health**:
|
||||
```bash
|
||||
curl http://localhost:37777/health
|
||||
```
|
||||
|
||||
2. **Check worker logs for errors**:
|
||||
```bash
|
||||
npm run worker:logs | grep -i error
|
||||
```
|
||||
|
||||
3. **Restart worker**:
|
||||
```bash
|
||||
claude-mem restart
|
||||
```
|
||||
|
||||
4. **Check database integrity**:
|
||||
```bash
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
|
||||
```
|
||||
|
||||
### Messages Stuck Forever
|
||||
|
||||
**Symptom**: Messages show as "processing" for hours
|
||||
|
||||
**Solution**: Force reset stuck messages
|
||||
|
||||
```bash
|
||||
# Reset all stuck messages to pending
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
UPDATE pending_messages
|
||||
SET status = 'pending', started_processing_at_epoch = NULL
|
||||
WHERE status = 'processing';
|
||||
"
|
||||
|
||||
# Then trigger recovery
|
||||
bun scripts/check-pending-queue.ts --process
|
||||
```
|
||||
|
||||
### Worker Crashes During Recovery
|
||||
|
||||
**Symptom**: Worker stops while processing recovered messages
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check available memory**:
|
||||
```bash
|
||||
npm run worker:status
|
||||
```
|
||||
|
||||
2. **Reduce session limit**:
|
||||
```bash
|
||||
bun scripts/check-pending-queue.ts --process --limit 3
|
||||
```
|
||||
|
||||
3. **Check for SDK errors in logs**:
|
||||
```bash
|
||||
npm run worker:logs | grep -i "sdk"
|
||||
```
|
||||
|
||||
4. **Increase worker memory** (if using custom runner):
|
||||
```bash
|
||||
export NODE_OPTIONS="--max-old-space-size=4096"
|
||||
claude-mem restart
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Direct Database Inspection
|
||||
|
||||
View all pending messages:
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT
|
||||
id,
|
||||
session_db_id,
|
||||
message_type,
|
||||
status,
|
||||
retry_count,
|
||||
datetime(created_at_epoch/1000, 'unixepoch') as created_at,
|
||||
datetime(started_processing_at_epoch/1000, 'unixepoch') as started_at,
|
||||
CAST((strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 AS INTEGER) as age_minutes
|
||||
FROM pending_messages
|
||||
WHERE status IN ('pending', 'processing')
|
||||
ORDER BY created_at_epoch;
|
||||
"
|
||||
```
|
||||
|
||||
### Count Messages by Status
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT status, COUNT(*) as count
|
||||
FROM pending_messages
|
||||
GROUP BY status;
|
||||
"
|
||||
```
|
||||
|
||||
### Find Sessions with Pending Work
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT
|
||||
session_db_id,
|
||||
COUNT(*) as pending_count,
|
||||
GROUP_CONCAT(message_type) as message_types
|
||||
FROM pending_messages
|
||||
WHERE status IN ('pending', 'processing')
|
||||
GROUP BY session_db_id;
|
||||
"
|
||||
```
|
||||
|
||||
### View Recent Failures
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT
|
||||
id,
|
||||
session_db_id,
|
||||
message_type,
|
||||
retry_count,
|
||||
datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
|
||||
FROM pending_messages
|
||||
WHERE status = 'failed'
|
||||
ORDER BY completed_at_epoch DESC
|
||||
LIMIT 10;
|
||||
"
|
||||
```
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Cron Job for Automatic Recovery
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Run every hour to process stuck queues
|
||||
|
||||
# Check if worker is healthy
|
||||
if curl -f http://localhost:37777/health > /dev/null 2>&1; then
|
||||
# Auto-process up to 5 sessions
|
||||
bun scripts/check-pending-queue.ts --process --limit 5
|
||||
else
|
||||
echo "Worker not healthy, skipping recovery"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### Monitoring Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Alert if stuck count exceeds threshold
|
||||
|
||||
STUCK_COUNT=$(curl -s http://localhost:37777/api/pending-queue | jq '.queue.stuckCount')
|
||||
|
||||
if [ "$STUCK_COUNT" -gt 5 ]; then
|
||||
echo "WARNING: $STUCK_COUNT stuck messages detected"
|
||||
# Send alert (email, Slack, etc.)
|
||||
fi
|
||||
```
|
||||
|
||||
### Pre-Shutdown Recovery
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Process pending queues before system shutdown
|
||||
|
||||
echo "Processing pending queues before shutdown..."
|
||||
bun scripts/check-pending-queue.ts --process --limit 20
|
||||
|
||||
echo "Waiting for processing to complete..."
|
||||
sleep 10
|
||||
|
||||
echo "Stopping worker..."
|
||||
claude-mem stop
|
||||
```
|
||||
|
||||
## Migration Note
|
||||
|
||||
If you're upgrading from v4.x to v5.x:
|
||||
|
||||
**v4.x Behavior** (Automatic Recovery):
|
||||
- Worker automatically recovered stuck messages on startup
|
||||
- No user control over reprocessing timing
|
||||
|
||||
**v5.x Behavior** (Manual Recovery):
|
||||
- Stuck messages detected but NOT automatically reprocessed
|
||||
- User must explicitly trigger recovery via CLI or API
|
||||
- Prevents unexpected duplicate observations
|
||||
- Provides explicit control over when processing happens
|
||||
|
||||
**Migration Steps**:
|
||||
1. Upgrade to v5.x
|
||||
2. Check for stuck messages: `bun scripts/check-pending-queue.ts`
|
||||
3. Process if needed: `bun scripts/check-pending-queue.ts --process`
|
||||
4. Add recovery to your workflow (cron job, pre-shutdown script, etc.)
|
||||
|
||||
## See Also
|
||||
|
||||
- [Worker Service Architecture](../architecture/worker-service) - Technical details on queue processing
|
||||
- [Troubleshooting - Manual Recovery](../troubleshooting#manual-recovery-for-stuck-observations) - Common issues and solutions
|
||||
- [Database Schema](../architecture/database) - Pending messages table structure
|
||||
Reference in New Issue
Block a user