- Issue #514: Documented analysis of orphaned observer session files, including root cause, evidence, and recommended fixes. - Issue #517: Analyzed PowerShell escaping issues in cleanupOrphanedProcesses() on Windows, with recommended fixes using WMIC. - Issue #520: Confirmed resolution of stuck messages issue through architectural changes to a claim-and-delete pattern. - Issue #527: Identified detection failure of uv on Apple Silicon Macs with Homebrew installation, proposed path updates for detection. - Issue #532: Analyzed memory leak issues in SessionManager, detailing session cleanup and conversationHistory growth concerns, with recommended fixes.
7.8 KiB
Issue #520: Stuck Messages Analysis
Date: January 4, 2026 Status: RESOLVED - Issue no longer exists in current codebase Original Issue: Messages in 'processing' status never recovered after worker crash
Executive Summary
The issue described in GitHub #520 has been fully resolved in the current codebase through a fundamental architectural change. The system now uses a claim-and-delete pattern instead of the old claim-process-mark pattern, which eliminates the stuck 'processing' state problem entirely.
Original Issue Description
The issue claimed that after a worker crash:
getSessionsWithPendingMessages()returns sessions withstatus IN ('pending', 'processing')- But
claimNextMessage()only looks forstatus = 'pending' - So 'processing' messages are orphaned
Proposed Fix: Add resetStuckMessages(0) at start of processPendingQueues()
Current Code Analysis
1. Queue Processing Pattern: Claim-and-Delete
The current architecture uses claimAndDelete() instead of claimNextMessage():
File: /Users/alexnewman/Scripts/claude-mem/src/services/sqlite/PendingMessageStore.ts
// Lines 85-104
claimAndDelete(sessionDbId: number): PersistentPendingMessage | null {
const claimTx = this.db.transaction((sessionId: number) => {
const peekStmt = this.db.prepare(`
SELECT * FROM pending_messages
WHERE session_db_id = ? AND status = 'pending'
ORDER BY id ASC
LIMIT 1
`);
const msg = peekStmt.get(sessionId) as PersistentPendingMessage | null;
if (msg) {
// Delete immediately - no "processing" state needed
const deleteStmt = this.db.prepare('DELETE FROM pending_messages WHERE id = ?');
deleteStmt.run(msg.id);
}
return msg;
});
return claimTx(sessionDbId) as PersistentPendingMessage | null;
}
Key insight: Messages are atomically selected and deleted in a single transaction. There is no 'processing' state for messages being actively worked on - they simply don't exist in the database anymore.
2. Iterator Uses claimAndDelete
File: /Users/alexnewman/Scripts/claude-mem/src/services/queue/SessionQueueProcessor.ts
// Lines 18-38
async *createIterator(sessionDbId: number, signal: AbortSignal): AsyncIterableIterator<PendingMessageWithId> {
while (!signal.aborted) {
try {
// Atomically claim AND DELETE next message from DB
// Message is now in memory only - no "processing" state tracking needed
const persistentMessage = this.store.claimAndDelete(sessionDbId);
if (persistentMessage) {
// Yield the message for processing (it's already deleted from queue)
yield this.toPendingMessageWithId(persistentMessage);
} else {
// Queue empty - wait for wake-up event
await this.waitForMessage(signal);
}
} catch (error) {
// ... error handling
}
}
}
3. getSessionsWithPendingMessages Still Checks Both States
File: /Users/alexnewman/Scripts/claude-mem/src/services/sqlite/PendingMessageStore.ts
// Lines 319-326
getSessionsWithPendingMessages(): number[] {
const stmt = this.db.prepare(`
SELECT DISTINCT session_db_id FROM pending_messages
WHERE status IN ('pending', 'processing')
`);
const results = stmt.all() as { session_db_id: number }[];
return results.map(r => r.session_db_id);
}
This is technically vestigial code - with the claim-and-delete pattern, messages should never be in 'processing' state. However, it provides backward compatibility and defense-in-depth.
4. Startup Recovery Still Exists
File: /Users/alexnewman/Scripts/claude-mem/src/services/worker-service.ts
// Lines 236-242
// Recover stuck messages from previous crashes
const { PendingMessageStore } = await import('./sqlite/PendingMessageStore.js');
const pendingStore = new PendingMessageStore(this.dbManager.getSessionStore().db, 3);
const STUCK_THRESHOLD_MS = 5 * 60 * 1000;
const resetCount = pendingStore.resetStuckMessages(STUCK_THRESHOLD_MS);
if (resetCount > 0) {
logger.info('SYSTEM', `Recovered ${resetCount} stuck messages from previous session`, { thresholdMinutes: 5 });
}
This runs BEFORE processPendingQueues() is called (line 281), which addresses the original fix request.
Verification of Issue Status
Does the Issue Exist?
NO - The issue as described no longer exists because:
-
No 'processing' state during normal operation: With claim-and-delete, messages go directly from 'pending' to 'deleted'. They never enter a 'processing' state.
-
Startup recovery handles legacy stuck messages: Even if 'processing' messages exist (from old code or edge cases),
resetStuckMessages()is called BEFOREprocessPendingQueues()ininitializeBackground()(lines 236-241 run before line 281). -
Architecture fundamentally changed: The old
claimNextMessage()function that only looked forstatus = 'pending'no longer exists. It was replaced withclaimAndDelete().
GeminiAgent and OpenRouterAgent Behavior
Both agents use the same SessionManager.getMessageIterator() which calls SessionQueueProcessor.createIterator() which uses claimAndDelete(). All three agents (SDKAgent, GeminiAgent, OpenRouterAgent) use identical queue processing:
// GeminiAgent.ts:174, OpenRouterAgent.ts:134
for await (const message of this.sessionManager.getMessageIterator(session.sessionDbId)) {
// ...
}
They do NOT handle recovery differently - they all rely on the shared infrastructure.
What v8.5.7 Changed
Looking at the git history:
v8.5.7 (ac03901):
- Minor ESM/CommonJS compatibility fix for isMainModule detection
- No queue-related changes
v8.5.6 -> v8.5.7:
- f21ea97 refactor: decompose monolith into modular architecture with comprehensive test suite (#538)
The major refactor happened before v8.5.7. The claim-and-delete pattern was already in place.
Timeline of Resolution
Based on git history, the issue was likely resolved through these commits:
b8ce27b-feat(queue): Simplify queue processing and enhance reliability- eb1a78b -
fix: eliminate duplicate observations by simplifying message queue - d72a81e -
Refactor session queue processing and database interactions
These commits appear to have introduced the claim-and-delete pattern that eliminates the original bug.
Conclusion
Issue #520 should be closed as resolved.
The described bug (claimNextMessage() only checking status = 'pending') no longer exists because:
claimNextMessage()was replaced withclaimAndDelete()which atomically removes messagesresetStuckMessages()is already called at startup BEFOREprocessPendingQueues()- The 'processing' status is now only used for legacy compatibility and edge cases
No Fix Needed
The proposed fix ("Add resetStuckMessages(0) at start of processPendingQueues()") is:
- Unnecessary - The recovery happens in
initializeBackground()beforeprocessPendingQueues()is called - Using wrong threshold -
resetStuckMessages(0)would reset ALL processing messages immediately, which could cause issues if called during normal operation (not just startup)
The current implementation with a 5-minute threshold is more robust - it only recovers truly stuck messages, not messages that are actively being processed.
Appendix: File References
| Component | File | Key Lines |
|---|---|---|
| claimAndDelete | src/services/sqlite/PendingMessageStore.ts |
85-104 |
| Queue Iterator | src/services/queue/SessionQueueProcessor.ts |
18-38 |
| Startup Recovery | src/services/worker-service.ts |
236-242 |
| processPendingQueues | src/services/worker-service.ts |
326-375 |
| getSessionsWithPendingMessages | src/services/sqlite/PendingMessageStore.ts |
319-326 |
| resetStuckMessages | src/services/sqlite/PendingMessageStore.ts |
279-290 |