feat: Fix observation timestamps, refactor session management, and enhance worker reliability (#437)

* Refactor worker version checks and increase timeout settings

- Updated the default hook timeout from 5000ms to 120000ms for improved stability.
- Modified the worker version check to log a warning instead of restarting the worker on version mismatch.
- Removed legacy PM2 cleanup and worker start logic, simplifying the ensureWorkerRunning function.
- Enhanced polling mechanism for worker readiness with increased retries and reduced interval.

* feat: implement worker queue polling to ensure processing completion before proceeding

* refactor: change worker command from start to restart in hooks configuration

* refactor: remove session management complexity

- Simplify createSDKSession to pure INSERT OR IGNORE
- Remove auto-create logic from storeObservation/storeSummary
- Delete 11 unused session management methods
- Derive prompt_number from user_prompts count
- Keep sdk_sessions table schema unchanged for compatibility

* refactor: simplify session management by removing unused methods and auto-creation logic

* Refactor session prompt number retrieval in SessionRoutes

- Updated the method of obtaining the prompt number from the session.
- Replaced `store.getPromptCounter(sessionDbId)` with `store.getPromptNumberFromUserPrompts(claudeSessionId)` for better clarity and accuracy.
- Adjusted the logic for incrementing the prompt number to derive it from the user prompts count instead of directly incrementing a counter.

* refactor: replace getPromptCounter with getPromptNumberFromUserPrompts in SessionManager

Phase 7 of session management simplification. Updates SessionManager to derive
prompt numbers from user_prompts table count instead of using the deprecated
prompt_counter column.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: simplify SessionCompletionHandler to use direct SQL query

Phase 8: Remove call to findActiveSDKSession() and replace with direct
database query in SessionCompletionHandler.completeByClaudeId().

This removes dependency on the deleted findActiveSDKSession() method
and simplifies the code by using a straightforward SELECT query.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: remove markSessionCompleted call from SDKAgent

- Delete call to markSessionCompleted() in SDKAgent.ts
- Session status is no longer tracked or updated
- Part of phase 9: simplifying session management

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: remove markSessionComplete method (Phase 10)

- Deleted markSessionComplete() method from DatabaseManager
- Removed markSessionComplete call from SessionCompletionHandler
- Session completion status no longer tracked in database
- Part of session management simplification effort

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: replace deleted updateSDKSessionId calls in import script (Phase 11)

- Replace updateSDKSessionId() calls with direct SQL UPDATE statements
- Method was deleted in Phase 3 as part of session management simplification
- Import script now uses direct database access consistently

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* test: add validation for SQL updates in sdk_sessions table

* refactor: enhance worker-cli to support manual and automated runs

* Remove cleanup hook and associated session completion logic

- Deleted the cleanup-hook implementation from the hooks directory.
- Removed the session completion endpoint that was used by the cleanup hook.
- Updated the SessionCompletionHandler to eliminate the completeByClaudeId method and its dependencies.
- Adjusted the SessionRoutes to reflect the removal of the session completion route.

* fix: update worker-cli command to use bun for consistency

* feat: Implement timestamp fix for observations and enhance processing logic

- Added `earliestPendingTimestamp` to `ActiveSession` to track the original timestamp of the earliest pending message.
- Updated `SDKAgent` to capture and utilize the earliest pending timestamp during response processing.
- Modified `SessionManager` to track the earliest timestamp when yielding messages.
- Created scripts for fixing corrupted timestamps, validating fixes, and investigating timestamp issues.
- Verified that all corrupted observations have been repaired and logic for future processing is sound.
- Ensured orphan processing can be safely re-enabled after validation.

* feat: Enhance SessionStore to support custom database paths and add timestamp fields for observations and summaries

* Refactor pending queue processing and add management endpoints

- Disabled automatic recovery of orphaned queues on startup; users must now use the new /api/pending-queue/process endpoint.
- Updated processOrphanedQueues method to processPendingQueues with improved session handling and return detailed results.
- Added new API endpoints for managing pending queues: GET /api/pending-queue and POST /api/pending-queue/process.
- Introduced a new script (check-pending-queue.ts) for checking and processing pending observation queues interactively or automatically.
- Enhanced logging and error handling for better monitoring of session processing.

* updated agent sdk

* feat: Add manual recovery guide and queue management endpoints to documentation

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2025-12-25 15:36:46 -05:00
committed by GitHub
parent 4cf2c1bdb1
commit 266c746d50
47 changed files with 3417 additions and 1026 deletions
+195
View File
@@ -285,6 +285,201 @@ The skill includes comprehensive diagnostics, automated repair sequences, and de
claude-mem restart
```
### Manual Recovery for Stuck Observations
**Symptoms**: Observations stuck in processing queue after worker crash or restart, no new summaries appearing despite worker running.
**Background**: As of v5.x, automatic queue recovery on worker startup is disabled. Users must manually trigger recovery to maintain explicit control over reprocessing and prevent unexpected duplicate observations.
**Solutions**:
#### Option 1: Use CLI Recovery Tool (Recommended)
The interactive CLI tool provides the safest and most user-friendly recovery experience:
```bash
# Check queue status and prompt for recovery
bun scripts/check-pending-queue.ts
# Auto-process without prompting
bun scripts/check-pending-queue.ts --process
# Process up to 5 sessions
bun scripts/check-pending-queue.ts --process --limit 5
```
**What it does**:
- ✅ Checks worker health before proceeding
- ✅ Shows detailed queue summary (pending, processing, failed, stuck)
- ✅ Groups messages by session with age and status breakdown
- ✅ Prompts user to confirm processing (unless `--process` flag used)
- ✅ Shows recently processed messages for feedback
**Interactive Example**:
```
Worker is healthy ✓
Queue Summary:
Pending: 12 messages
Processing: 2 messages (1 stuck)
Failed: 0 messages
Recently Processed: 5 messages in last 30 minutes
Sessions with pending work: 3
Session 44: 5 pending, 1 processing (age: 2m)
Session 45: 4 pending, 1 processing (age: 7m - STUCK)
Session 46: 2 pending
Would you like to process these pending queues? (y/n)
```
#### Option 2: Use HTTP API Directly
For automation or scripting scenarios:
1. **Check queue status**:
```bash
curl http://localhost:37777/api/pending-queue
```
Response shows:
- `queue.totalPending`: Messages waiting to process
- `queue.totalProcessing`: Messages currently processing
- `queue.stuckCount`: Processing messages >5 minutes old
- `sessionsWithPendingWork`: Session IDs needing recovery
2. **Trigger manual recovery**:
```bash
curl -X POST http://localhost:37777/api/pending-queue/process \
-H "Content-Type: application/json" \
-d '{"sessionLimit": 10}'
```
Response includes:
- `totalPendingSessions`: Total sessions with pending messages
- `sessionsStarted`: Number of sessions we started processing
- `sessionsSkipped`: Sessions already processing (not restarted)
- `startedSessionIds`: Database IDs of sessions started
#### Understanding Queue States
Messages progress through these states:
1. **pending** - Queued, waiting to process
2. **processing** - Currently being processed by SDK agent
3. **processed** - Completed successfully
4. **failed** - Failed after 3 retry attempts
**Stuck Detection**: Messages in `processing` state for >5 minutes are considered stuck and automatically reset to `pending` on worker startup (but not automatically reprocessed).
#### Recovery Strategy
**When to use manual recovery**:
- After worker crashes or unexpected restarts
- When observations appear saved but no summaries generated
- When queue status shows stuck messages (processing >5 minutes)
- After system crashes or forced shutdowns
**Best practices**:
1. Always check queue status before triggering recovery
2. Use the CLI tool for interactive sessions (provides feedback)
3. Use the HTTP API for automation/scripting
4. Start with a low session limit (5-10) to avoid overwhelming the worker
5. Monitor worker logs during recovery: `npm run worker:logs`
6. Check recently processed messages to confirm recovery worked
#### Troubleshooting Recovery Issues
If recovery fails or messages remain stuck:
1. **Verify worker is healthy**:
```bash
curl http://localhost:37777/health
# Should return: {"status":"ok","uptime":12345,"port":37777}
```
2. **Check database for corruption**:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
```
3. **View stuck messages directly**:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT id, session_db_id, status, retry_count,
(strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 as age_minutes
FROM pending_messages
WHERE status = 'processing'
ORDER BY started_processing_at_epoch;
"
```
4. **Force reset stuck messages** (nuclear option):
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
UPDATE pending_messages
SET status = 'pending', started_processing_at_epoch = NULL
WHERE status = 'processing';
"
```
Then trigger recovery:
```bash
bun scripts/check-pending-queue.ts --process
```
5. **Check worker logs for SDK errors**:
```bash
npm run worker:logs | grep -i error
```
#### Understanding the Queue Table
The `pending_messages` table tracks all messages with these key fields:
```sql
CREATE TABLE pending_messages (
id INTEGER PRIMARY KEY,
session_db_id INTEGER, -- Foreign key to sdk_sessions
claude_session_id TEXT, -- Claude session ID
message_type TEXT, -- 'observation' | 'summarize'
status TEXT, -- 'pending' | 'processing' | 'processed' | 'failed'
retry_count INTEGER, -- Current retry attempt (max: 3)
created_at_epoch INTEGER, -- When message was queued
started_processing_at_epoch INTEGER, -- When marked 'processing'
completed_at_epoch INTEGER -- When completed/failed
)
```
**Query examples**:
```bash
# Count messages by status
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT status, COUNT(*)
FROM pending_messages
GROUP BY status;
"
# Find sessions with pending work
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT session_db_id, COUNT(*) as pending_count
FROM pending_messages
WHERE status IN ('pending', 'processing')
GROUP BY session_db_id;
"
# View recent failures
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT id, session_db_id, message_type, retry_count,
datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
FROM pending_messages
WHERE status = 'failed'
ORDER BY completed_at_epoch DESC
LIMIT 10;
"
```
## Hook Issues
### Hooks Not Firing