0862f2970c
- Documented six critical gaps in the message lifecycle leading to stuck observations. - Detailed analysis of message lifecycle architecture, including status states and normal flow. - Identified critical stuck points with code paths and required fixes. - Proposed recovery mechanisms and recommendations for critical fixes, robustness improvements, and observability enhancements. - Included file references for database, processing, HTTP, worker layers, hooks, and constants.
385 lines
12 KiB
Markdown
385 lines
12 KiB
Markdown
# Investigation Report: Stuck Observations in Processing State
|
|
|
|
**Date:** January 2, 2026
|
|
**Investigator:** Claude
|
|
**Status:** Complete
|
|
**Severity:** High - Observations can get permanently stuck until worker restart
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Observations get stuck in "processing" state due to **six critical gaps** in the message lifecycle:
|
|
|
|
1. **In-memory tracking set not cleared on error** - `pendingProcessingIds` retains stale IDs after crashes
|
|
2. **No try-catch around database updates** - Partial updates leave system in inconsistent state
|
|
3. **Hook exit code inconsistency** - Some hooks exit explicitly, others rely on implicit Node.js behavior
|
|
4. **5-minute recovery threshold only on startup** - No continuous monitoring during runtime
|
|
5. **Iterator doesn't resume after yield errors** - Messages left in "processing" forever
|
|
6. **No global error handlers in hooks** - Unhandled promise rejections crash without cleanup
|
|
|
|
---
|
|
|
|
## Message Lifecycle Architecture
|
|
|
|
### Status States
|
|
|
|
The `pending_messages` table uses 4 states:
|
|
|
|
| Status | Description | Transition From | Transition To |
|
|
|--------|-------------|-----------------|---------------|
|
|
| `pending` | Queued, awaiting processing | (created) | `processing` |
|
|
| `processing` | Actively being processed by SDK | `pending` | `processed`, `failed`, or stuck |
|
|
| `processed` | Successfully completed | `processing` | (deleted after retention) |
|
|
| `failed` | Max retries exceeded | `processing` | (permanent) |
|
|
|
|
### Normal Flow
|
|
|
|
```
|
|
HTTP Request → enqueue() → pending
|
|
↓
|
|
claimNextMessage() → processing
|
|
↓
|
|
SDK processes → markProcessed() → processed
|
|
↓
|
|
cleanup → deleted
|
|
```
|
|
|
|
### Key Files
|
|
|
|
| Component | File | Lines |
|
|
|-----------|------|-------|
|
|
| Status enum | `src/services/sqlite/PendingMessageStore.ts` | 19 |
|
|
| Claim message | `src/services/sqlite/PendingMessageStore.ts` | 87-118 |
|
|
| Mark processed | `src/services/sqlite/PendingMessageStore.ts` | 252-264 |
|
|
| Mark failed | `src/services/sqlite/PendingMessageStore.ts` | 271-296 |
|
|
| In-memory tracking | `src/services/worker/SessionManager.ts` | 386 |
|
|
| Clear tracking | `src/services/worker/SDKAgent.ts` | 497 |
|
|
| Error handler | `src/services/worker/http/routes/SessionRoutes.ts` | 137-168 |
|
|
|
|
---
|
|
|
|
## Critical Stuck Points
|
|
|
|
### Stuck Point #1: In-Memory Set Not Cleared on Error
|
|
|
|
**Location:** `src/services/worker/http/routes/SessionRoutes.ts:137-168`
|
|
|
|
**Problem:** When a generator crashes, the error handler marks database messages as failed but **never resets `session.pendingProcessingIds`**.
|
|
|
|
**Code Path:**
|
|
```typescript
|
|
session.generatorPromise = agent.startSession(session, this.workerService)
|
|
.catch(error => {
|
|
// Mark all processing messages as failed in DB
|
|
for (const msg of processingMessages) {
|
|
pendingStore.markFailed(msg.id); // ✓ DB updated
|
|
}
|
|
// ✗ session.pendingProcessingIds.clear() - MISSING!
|
|
});
|
|
```
|
|
|
|
**Result:**
|
|
- Database shows messages as `failed`
|
|
- In-memory set still contains stale message IDs
|
|
- On generator restart, same IDs added again (duplicates possible)
|
|
- Memory-database state divergence
|
|
|
|
**Fix Required:** Add `session.pendingProcessingIds.clear()` in catch block.
|
|
|
|
---
|
|
|
|
### Stuck Point #2: No Try-Catch Around markProcessed()
|
|
|
|
**Location:** `src/services/worker/SDKAgent.ts:487-516`
|
|
|
|
**Problem:** The `markMessagesProcessed()` function loops through all pending IDs but has no error handling around individual `markProcessed()` calls.
|
|
|
|
**Code Path:**
|
|
```typescript
|
|
private async markMessagesProcessed(session, worker): Promise<void> {
|
|
for (const messageId of session.pendingProcessingIds) {
|
|
pendingMessageStore.markProcessed(messageId); // ✗ No try-catch
|
|
}
|
|
session.pendingProcessingIds.clear(); // Never reached if above throws
|
|
}
|
|
```
|
|
|
|
**Result:**
|
|
- If DB error occurs on message N, messages N+1...M never marked
|
|
- `pendingProcessingIds.clear()` never called
|
|
- Partial database update + stale in-memory set
|
|
|
|
**Fix Required:** Wrap individual `markProcessed()` calls in try-catch, continue on error, log failures.
|
|
|
|
---
|
|
|
|
### Stuck Point #3: Hook Exit Code Inconsistency
|
|
|
|
**Location:** All hooks in `src/hooks/`
|
|
|
|
**Problem:** Hooks have inconsistent exit patterns:
|
|
|
|
| Hook | Explicit Exit? | Method | Timeout |
|
|
|------|----------------|--------|---------|
|
|
| context-hook | YES | `process.exit(0)` | 15s |
|
|
| user-message-hook | YES | `process.exit(3)` | 15s |
|
|
| new-hook | NO | Implicit | 15s |
|
|
| save-hook | NO | Implicit | 300s |
|
|
| summary-hook | NO | Implicit | 300s |
|
|
|
|
**Critical Issues:**
|
|
|
|
1. **No global error handlers** - No `process.on('unhandledRejection', ...)` in any hook
|
|
2. **Async errors bubble to Node.js** - Causes exit(1) with stack trace to stderr
|
|
3. **save-hook fire-and-forget pattern** - Errors may not surface
|
|
|
|
**save-hook.ts Entry Point (lines 75-85):**
|
|
```typescript
|
|
stdin.on('end', async () => {
|
|
// No try-catch wrapper!
|
|
try {
|
|
parsed = input.trim() ? JSON.parse(input) : undefined;
|
|
} catch (error) {
|
|
throw new Error(`Failed to parse...`); // Unhandled!
|
|
}
|
|
await saveHook(parsed); // Also can throw, unhandled!
|
|
});
|
|
```
|
|
|
|
**summary-hook.ts Bug (line 68):**
|
|
```typescript
|
|
if (!response.ok) {
|
|
console.log(STANDARD_HOOK_RESPONSE); // Outputs success BEFORE throwing!
|
|
throw new Error(`Summary generation failed: ${response.status}`);
|
|
}
|
|
```
|
|
|
|
This sends success response to Claude Code, then crashes.
|
|
|
|
---
|
|
|
|
### Stuck Point #4: Iterator Doesn't Resume After Yield Error
|
|
|
|
**Location:** `src/services/queue/SessionQueueProcessor.ts:17-38`
|
|
|
|
**Problem:** The async iterator stops completely if the consuming agent throws while processing a yielded message.
|
|
|
|
**Code Path:**
|
|
```typescript
|
|
async *createIterator(sessionDbId, signal) {
|
|
while (!signal.aborted) {
|
|
const message = this.store.claimNextMessage(sessionDbId); // → processing
|
|
if (message) {
|
|
yield message; // Agent throws here = iterator stops
|
|
} else {
|
|
await this.waitForMessage(signal);
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Result:**
|
|
- Message claimed → status = `processing`
|
|
- Message yielded → agent throws during processing
|
|
- Iterator stops, never resumes
|
|
- Message stuck until 5-minute timeout
|
|
|
|
**Fix Required:** Wrap yield in try-catch, mark failed on error, continue loop.
|
|
|
|
---
|
|
|
|
### Stuck Point #5: 5-Minute Recovery Only on Startup
|
|
|
|
**Location:** `src/services/worker-service.ts:686-690`
|
|
|
|
**Problem:** Stuck message recovery only runs when worker initializes.
|
|
|
|
**Code Path:**
|
|
```typescript
|
|
// In initializeWorker()
|
|
const STUCK_THRESHOLD_MS = 5 * 60 * 1000;
|
|
const resetCount = pendingStore.resetStuckMessages(STUCK_THRESHOLD_MS);
|
|
```
|
|
|
|
**Result:**
|
|
- During normal operation, no continuous monitoring
|
|
- Messages can stay stuck for hours if worker doesn't restart
|
|
- User must manually restart worker or wait
|
|
|
|
**Fix Required:** Add periodic stuck message check (every 60 seconds) during runtime.
|
|
|
|
---
|
|
|
|
### Stuck Point #6: markFailed() Not Transactional
|
|
|
|
**Location:** `src/services/sqlite/PendingMessageStore.ts:271-296`
|
|
|
|
**Problem:** The `markFailed()` method does SELECT then UPDATE without a transaction wrapper.
|
|
|
|
**Code Path:**
|
|
```typescript
|
|
markFailed(messageId: number): void {
|
|
const msg = this.db.prepare(`SELECT retry_count FROM pending_messages WHERE id = ?`).get(messageId);
|
|
|
|
// Race condition window here!
|
|
|
|
if (msg.retry_count < this.maxRetries) {
|
|
this.db.prepare(`UPDATE pending_messages SET status = 'pending', retry_count = retry_count + 1...`).run(messageId);
|
|
} else {
|
|
this.db.prepare(`UPDATE pending_messages SET status = 'failed'...`).run(messageId);
|
|
}
|
|
}
|
|
```
|
|
|
|
**Result:**
|
|
- If process crashes between SELECT and UPDATE, retry_count may be stale
|
|
- Could lead to wrong retry decision
|
|
|
|
**Fix Required:** Wrap in `this.db.transaction(() => { ... })()`.
|
|
|
|
---
|
|
|
|
## Stuck Scenarios
|
|
|
|
### Scenario A: SDK Hangs During Processing
|
|
|
|
1. Message claimed → `status = 'processing'`
|
|
2. Added to `pendingProcessingIds`
|
|
3. Yielded to SDK agent
|
|
4. SDK hangs (e.g., network timeout, infinite loop)
|
|
5. **Result:** Stuck forever until 5-minute timeout on worker restart
|
|
|
|
### Scenario B: Generator Crash After Yielding
|
|
|
|
1. Message claimed and yielded
|
|
2. Agent throws error before `markProcessed()`
|
|
3. Error handler marks DB messages as `failed`
|
|
4. `pendingProcessingIds` NOT cleared
|
|
5. Generator restarts
|
|
6. Same message IDs added to set again
|
|
7. **Result:** Duplicate tracking, potential double-processing
|
|
|
|
### Scenario C: Partial Database Update
|
|
|
|
1. 5 messages being marked processed
|
|
2. Messages 1-3 succeed
|
|
3. Database connection drops
|
|
4. Message 4 throws error
|
|
5. Loop breaks, messages 4-5 never marked
|
|
6. `pendingProcessingIds.clear()` never called
|
|
7. **Result:** Mixed state - some processed, some stuck
|
|
|
|
### Scenario D: Hook Throws Without Cleanup
|
|
|
|
1. `save-hook.ts` receives observation
|
|
2. HTTP request to worker succeeds
|
|
3. Output `STANDARD_HOOK_RESPONSE` sent
|
|
4. Later code throws (e.g., Chroma sync fails)
|
|
5. Node.js exits with code 1
|
|
6. **Result:** Claude Code sees success, but observation may be partial
|
|
|
|
---
|
|
|
|
## Recovery Mechanisms
|
|
|
|
### Current Mechanisms
|
|
|
|
| Mechanism | Location | Trigger | Limitation |
|
|
|-----------|----------|---------|------------|
|
|
| Startup stuck reset | worker-service.ts:687 | Worker restart | Only on restart |
|
|
| Generator crash recovery | SessionRoutes.ts:183-216 | Generator exit | Requires full exit |
|
|
| Manual retry | (needs verification) | User action | Requires UI intervention |
|
|
| Old message cleanup | SDKAgent.ts:504 | After processing | Only cleans processed |
|
|
|
|
### Missing Mechanisms
|
|
|
|
1. **Continuous stuck monitoring** - No runtime detection
|
|
2. **Per-message timeout** - No kill switch for hung SDK
|
|
3. **UI stuck count display** - User can't see stuck messages
|
|
4. **Manual recovery API** - No endpoint to retry individual messages
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
### Priority 1: Critical Fixes
|
|
|
|
1. **Clear pendingProcessingIds in error handler**
|
|
- File: `SessionRoutes.ts:168`
|
|
- Add: `session.pendingProcessingIds.clear()`
|
|
|
|
2. **Add try-catch around markProcessed loop**
|
|
- File: `SDKAgent.ts:489`
|
|
- Wrap individual calls, continue on error
|
|
|
|
3. **Add global error handler to hooks**
|
|
- All hooks in `src/hooks/`
|
|
- Add `process.on('unhandledRejection', ...)` at entry
|
|
|
|
### Priority 2: Robustness Improvements
|
|
|
|
4. **Add continuous stuck message monitor**
|
|
- Check every 60 seconds during runtime
|
|
- Reset messages stuck > 5 minutes
|
|
|
|
5. **Make markFailed transactional**
|
|
- Wrap SELECT + UPDATE in transaction
|
|
|
|
6. **Fix summary-hook output-before-throw bug**
|
|
- Move `console.log(STANDARD_HOOK_RESPONSE)` after error check
|
|
|
|
### Priority 3: Observability
|
|
|
|
7. **Add stuck message count to viewer UI**
|
|
- Show processing messages > 2 minutes old
|
|
|
|
8. **Add manual retry API endpoint**
|
|
- Allow user to retry stuck messages without restart
|
|
|
|
9. **Add explicit exit to all hooks**
|
|
- Consistent `process.exit(0)` on success path
|
|
|
|
---
|
|
|
|
## Appendix: File Reference
|
|
|
|
### Database Layer
|
|
- `src/services/sqlite/PendingMessageStore.ts` - Message queue persistence
|
|
- `src/services/sqlite/SessionStore.ts` - Session management, table schemas
|
|
|
|
### Processing Layer
|
|
- `src/services/queue/SessionQueueProcessor.ts` - Async iterator for claiming
|
|
- `src/services/worker/SessionManager.ts` - Session state, message iterator
|
|
- `src/services/worker/SDKAgent.ts` - SDK interaction, response processing
|
|
|
|
### HTTP Layer
|
|
- `src/services/worker/http/routes/SessionRoutes.ts` - Generator lifecycle, error handling
|
|
|
|
### Worker Layer
|
|
- `src/services/worker-service.ts` - Startup recovery, health checks
|
|
|
|
### Hooks
|
|
- `src/hooks/context-hook.ts` - SessionStart (explicit exit)
|
|
- `src/hooks/user-message-hook.ts` - SessionStart parallel (explicit exit)
|
|
- `src/hooks/new-hook.ts` - UserPromptSubmit (implicit exit)
|
|
- `src/hooks/save-hook.ts` - PostToolUse (implicit exit, fire-and-forget)
|
|
- `src/hooks/summary-hook.ts` - Stop (implicit exit, output bug)
|
|
|
|
### Constants
|
|
- `src/shared/hook-constants.ts` - Exit codes, timeouts
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
The primary cause of stuck observations is the **disconnect between in-memory tracking (`pendingProcessingIds`) and database state**. When errors occur, the database may be updated but the in-memory set is not cleared, leading to:
|
|
|
|
1. Duplicate tracking on restart
|
|
2. Memory-database state divergence
|
|
3. Messages appearing stuck in UI
|
|
|
|
Secondary causes include inconsistent hook exit patterns and the lack of runtime stuck message monitoring.
|
|
|
|
The 5-minute startup recovery is a safety net, but it only works when the worker restarts. For a production system, continuous monitoring and proper error handling at all state transition points are essential.
|