bb033b95f1
- Issue #514: Documented analysis of orphaned observer session files, including root cause, evidence, and recommended fixes. - Issue #517: Analyzed PowerShell escaping issues in cleanupOrphanedProcesses() on Windows, with recommended fixes using WMIC. - Issue #520: Confirmed resolution of stuck messages issue through architectural changes to a claim-and-delete pattern. - Issue #527: Identified detection failure of uv on Apple Silicon Macs with Homebrew installation, proposed path updates for detection. - Issue #532: Analyzed memory leak issues in SessionManager, detailing session cleanup and conversationHistory growth concerns, with recommended fixes.
211 lines
7.8 KiB
Markdown
211 lines
7.8 KiB
Markdown
# Issue #520: Stuck Messages Analysis
|
|
|
|
**Date:** January 4, 2026
|
|
**Status:** RESOLVED - Issue no longer exists in current codebase
|
|
**Original Issue:** Messages in 'processing' status never recovered after worker crash
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
The issue described in GitHub #520 has been **fully resolved** in the current codebase through a fundamental architectural change. The system now uses a **claim-and-delete** pattern instead of the old **claim-process-mark** pattern, which eliminates the stuck 'processing' state problem entirely.
|
|
|
|
---
|
|
|
|
## Original Issue Description
|
|
|
|
The issue claimed that after a worker crash:
|
|
|
|
1. `getSessionsWithPendingMessages()` returns sessions with `status IN ('pending', 'processing')`
|
|
2. But `claimNextMessage()` only looks for `status = 'pending'`
|
|
3. So 'processing' messages are orphaned
|
|
|
|
**Proposed Fix:** Add `resetStuckMessages(0)` at start of `processPendingQueues()`
|
|
|
|
---
|
|
|
|
## Current Code Analysis
|
|
|
|
### 1. Queue Processing Pattern: Claim-and-Delete
|
|
|
|
The current architecture uses `claimAndDelete()` instead of `claimNextMessage()`:
|
|
|
|
**File:** `/Users/alexnewman/Scripts/claude-mem/src/services/sqlite/PendingMessageStore.ts`
|
|
|
|
```typescript
|
|
// Lines 85-104
|
|
claimAndDelete(sessionDbId: number): PersistentPendingMessage | null {
|
|
const claimTx = this.db.transaction((sessionId: number) => {
|
|
const peekStmt = this.db.prepare(`
|
|
SELECT * FROM pending_messages
|
|
WHERE session_db_id = ? AND status = 'pending'
|
|
ORDER BY id ASC
|
|
LIMIT 1
|
|
`);
|
|
const msg = peekStmt.get(sessionId) as PersistentPendingMessage | null;
|
|
|
|
if (msg) {
|
|
// Delete immediately - no "processing" state needed
|
|
const deleteStmt = this.db.prepare('DELETE FROM pending_messages WHERE id = ?');
|
|
deleteStmt.run(msg.id);
|
|
}
|
|
return msg;
|
|
});
|
|
|
|
return claimTx(sessionDbId) as PersistentPendingMessage | null;
|
|
}
|
|
```
|
|
|
|
**Key insight:** Messages are atomically selected and deleted in a single transaction. There is no 'processing' state for messages being actively worked on - they simply don't exist in the database anymore.
|
|
|
|
### 2. Iterator Uses claimAndDelete
|
|
|
|
**File:** `/Users/alexnewman/Scripts/claude-mem/src/services/queue/SessionQueueProcessor.ts`
|
|
|
|
```typescript
|
|
// Lines 18-38
|
|
async *createIterator(sessionDbId: number, signal: AbortSignal): AsyncIterableIterator<PendingMessageWithId> {
|
|
while (!signal.aborted) {
|
|
try {
|
|
// Atomically claim AND DELETE next message from DB
|
|
// Message is now in memory only - no "processing" state tracking needed
|
|
const persistentMessage = this.store.claimAndDelete(sessionDbId);
|
|
|
|
if (persistentMessage) {
|
|
// Yield the message for processing (it's already deleted from queue)
|
|
yield this.toPendingMessageWithId(persistentMessage);
|
|
} else {
|
|
// Queue empty - wait for wake-up event
|
|
await this.waitForMessage(signal);
|
|
}
|
|
} catch (error) {
|
|
// ... error handling
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. getSessionsWithPendingMessages Still Checks Both States
|
|
|
|
**File:** `/Users/alexnewman/Scripts/claude-mem/src/services/sqlite/PendingMessageStore.ts`
|
|
|
|
```typescript
|
|
// Lines 319-326
|
|
getSessionsWithPendingMessages(): number[] {
|
|
const stmt = this.db.prepare(`
|
|
SELECT DISTINCT session_db_id FROM pending_messages
|
|
WHERE status IN ('pending', 'processing')
|
|
`);
|
|
const results = stmt.all() as { session_db_id: number }[];
|
|
return results.map(r => r.session_db_id);
|
|
}
|
|
```
|
|
|
|
**This is technically vestigial code** - with the claim-and-delete pattern, messages should never be in 'processing' state. However, it provides backward compatibility and defense-in-depth.
|
|
|
|
### 4. Startup Recovery Still Exists
|
|
|
|
**File:** `/Users/alexnewman/Scripts/claude-mem/src/services/worker-service.ts`
|
|
|
|
```typescript
|
|
// Lines 236-242
|
|
// Recover stuck messages from previous crashes
|
|
const { PendingMessageStore } = await import('./sqlite/PendingMessageStore.js');
|
|
const pendingStore = new PendingMessageStore(this.dbManager.getSessionStore().db, 3);
|
|
const STUCK_THRESHOLD_MS = 5 * 60 * 1000;
|
|
const resetCount = pendingStore.resetStuckMessages(STUCK_THRESHOLD_MS);
|
|
if (resetCount > 0) {
|
|
logger.info('SYSTEM', `Recovered ${resetCount} stuck messages from previous session`, { thresholdMinutes: 5 });
|
|
}
|
|
```
|
|
|
|
This runs BEFORE `processPendingQueues()` is called (line 281), which addresses the original fix request.
|
|
|
|
---
|
|
|
|
## Verification of Issue Status
|
|
|
|
### Does the Issue Exist?
|
|
|
|
**NO** - The issue as described no longer exists because:
|
|
|
|
1. **No 'processing' state during normal operation**: With claim-and-delete, messages go directly from 'pending' to 'deleted'. They never enter a 'processing' state.
|
|
|
|
2. **Startup recovery handles legacy stuck messages**: Even if 'processing' messages exist (from old code or edge cases), `resetStuckMessages()` is called BEFORE `processPendingQueues()` in `initializeBackground()` (lines 236-241 run before line 281).
|
|
|
|
3. **Architecture fundamentally changed**: The old `claimNextMessage()` function that only looked for `status = 'pending'` no longer exists. It was replaced with `claimAndDelete()`.
|
|
|
|
### GeminiAgent and OpenRouterAgent Behavior
|
|
|
|
Both agents use the same `SessionManager.getMessageIterator()` which calls `SessionQueueProcessor.createIterator()` which uses `claimAndDelete()`. All three agents (SDKAgent, GeminiAgent, OpenRouterAgent) use identical queue processing:
|
|
|
|
```typescript
|
|
// GeminiAgent.ts:174, OpenRouterAgent.ts:134
|
|
for await (const message of this.sessionManager.getMessageIterator(session.sessionDbId)) {
|
|
// ...
|
|
}
|
|
```
|
|
|
|
They do NOT handle recovery differently - they all rely on the shared infrastructure.
|
|
|
|
### What v8.5.7 Changed
|
|
|
|
Looking at the git history:
|
|
|
|
```
|
|
v8.5.7 (ac03901):
|
|
- Minor ESM/CommonJS compatibility fix for isMainModule detection
|
|
- No queue-related changes
|
|
|
|
v8.5.6 -> v8.5.7:
|
|
- f21ea97 refactor: decompose monolith into modular architecture with comprehensive test suite (#538)
|
|
```
|
|
|
|
The major refactor happened before v8.5.7. The claim-and-delete pattern was already in place.
|
|
|
|
---
|
|
|
|
## Timeline of Resolution
|
|
|
|
Based on git history, the issue was likely resolved through these commits:
|
|
|
|
1. **b8ce27b** - `feat(queue): Simplify queue processing and enhance reliability`
|
|
2. **eb1a78b** - `fix: eliminate duplicate observations by simplifying message queue`
|
|
3. **d72a81e** - `Refactor session queue processing and database interactions`
|
|
|
|
These commits appear to have introduced the claim-and-delete pattern that eliminates the original bug.
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**Issue #520 should be closed as resolved.**
|
|
|
|
The described bug (`claimNextMessage()` only checking `status = 'pending'`) no longer exists because:
|
|
|
|
1. `claimNextMessage()` was replaced with `claimAndDelete()` which atomically removes messages
|
|
2. `resetStuckMessages()` is already called at startup BEFORE `processPendingQueues()`
|
|
3. The 'processing' status is now only used for legacy compatibility and edge cases
|
|
|
|
### No Fix Needed
|
|
|
|
The proposed fix ("Add `resetStuckMessages(0)` at start of `processPendingQueues()`") is:
|
|
|
|
1. **Unnecessary** - The recovery happens in `initializeBackground()` before `processPendingQueues()` is called
|
|
2. **Using wrong threshold** - `resetStuckMessages(0)` would reset ALL processing messages immediately, which could cause issues if called during normal operation (not just startup)
|
|
|
|
The current implementation with a 5-minute threshold is more robust - it only recovers truly stuck messages, not messages that are actively being processed.
|
|
|
|
---
|
|
|
|
## Appendix: File References
|
|
|
|
| Component | File | Key Lines |
|
|
|-----------|------|-----------|
|
|
| claimAndDelete | `src/services/sqlite/PendingMessageStore.ts` | 85-104 |
|
|
| Queue Iterator | `src/services/queue/SessionQueueProcessor.ts` | 18-38 |
|
|
| Startup Recovery | `src/services/worker-service.ts` | 236-242 |
|
|
| processPendingQueues | `src/services/worker-service.ts` | 326-375 |
|
|
| getSessionsWithPendingMessages | `src/services/sqlite/PendingMessageStore.ts` | 319-326 |
|
|
| resetStuckMessages | `src/services/sqlite/PendingMessageStore.ts` | 279-290 |
|