Files

T

Alex Newman 0862f2970c feat: add investigation report on stuck observations in processing state

- Documented six critical gaps in the message lifecycle leading to stuck observations.
- Detailed analysis of message lifecycle architecture, including status states and normal flow.
- Identified critical stuck points with code paths and required fixes.
- Proposed recovery mechanisms and recommendations for critical fixes, robustness improvements, and observability enhancements.
- Included file references for database, processing, HTTP, worker layers, hooks, and constants.

2026-01-02 14:53:41 -05:00

12 KiB

Raw Blame History

Investigation Report: Stuck Observations in Processing State

Date: January 2, 2026 Investigator: Claude Status: Complete Severity: High - Observations can get permanently stuck until worker restart

Executive Summary

Observations get stuck in "processing" state due to six critical gaps in the message lifecycle:

In-memory tracking set not cleared on error - pendingProcessingIds retains stale IDs after crashes
No try-catch around database updates - Partial updates leave system in inconsistent state
Hook exit code inconsistency - Some hooks exit explicitly, others rely on implicit Node.js behavior
5-minute recovery threshold only on startup - No continuous monitoring during runtime
Iterator doesn't resume after yield errors - Messages left in "processing" forever
No global error handlers in hooks - Unhandled promise rejections crash without cleanup

Message Lifecycle Architecture

Status States

The pending_messages table uses 4 states:

Status	Description	Transition From	Transition To
`pending`	Queued, awaiting processing	(created)	`processing`
`processing`	Actively being processed by SDK	`pending`	`processed`, `failed`, or stuck
`processed`	Successfully completed	`processing`	(deleted after retention)
`failed`	Max retries exceeded	`processing`	(permanent)

Normal Flow

HTTP Request → enqueue() → pending
                              ↓
           claimNextMessage() → processing
                              ↓
              SDK processes → markProcessed() → processed
                              ↓
                      cleanup → deleted

Key Files

Component	File	Lines
Status enum	`src/services/sqlite/PendingMessageStore.ts`	19
Claim message	`src/services/sqlite/PendingMessageStore.ts`	87-118
Mark processed	`src/services/sqlite/PendingMessageStore.ts`	252-264
Mark failed	`src/services/sqlite/PendingMessageStore.ts`	271-296
In-memory tracking	`src/services/worker/SessionManager.ts`	386
Clear tracking	`src/services/worker/SDKAgent.ts`	497
Error handler	`src/services/worker/http/routes/SessionRoutes.ts`	137-168

Critical Stuck Points

Stuck Point #1: In-Memory Set Not Cleared on Error

Location: src/services/worker/http/routes/SessionRoutes.ts:137-168

Problem: When a generator crashes, the error handler marks database messages as failed but never resets session.pendingProcessingIds.

Code Path:

session.generatorPromise = agent.startSession(session, this.workerService)
  .catch(error => {
    // Mark all processing messages as failed in DB
    for (const msg of processingMessages) {
      pendingStore.markFailed(msg.id);  // ✓ DB updated
    }
    // ✗ session.pendingProcessingIds.clear() - MISSING!
  });

Result:

Database shows messages as failed
In-memory set still contains stale message IDs
On generator restart, same IDs added again (duplicates possible)
Memory-database state divergence

Fix Required: Add session.pendingProcessingIds.clear() in catch block.

Stuck Point #2: No Try-Catch Around markProcessed()

Location: src/services/worker/SDKAgent.ts:487-516

Problem: The markMessagesProcessed() function loops through all pending IDs but has no error handling around individual markProcessed() calls.

Code Path:

private async markMessagesProcessed(session, worker): Promise<void> {
  for (const messageId of session.pendingProcessingIds) {
    pendingMessageStore.markProcessed(messageId);  // ✗ No try-catch
  }
  session.pendingProcessingIds.clear();  // Never reached if above throws
}

Result:

If DB error occurs on message N, messages N+1...M never marked
pendingProcessingIds.clear() never called
Partial database update + stale in-memory set

Fix Required: Wrap individual markProcessed() calls in try-catch, continue on error, log failures.

Stuck Point #3: Hook Exit Code Inconsistency

Location: All hooks in src/hooks/

Problem: Hooks have inconsistent exit patterns:

Hook	Explicit Exit?	Method	Timeout
context-hook	YES	`process.exit(0)`	15s
user-message-hook	YES	`process.exit(3)`	15s
new-hook	NO	Implicit	15s
save-hook	NO	Implicit	300s
summary-hook	NO	Implicit	300s

Critical Issues:

No global error handlers - No process.on('unhandledRejection', ...) in any hook
Async errors bubble to Node.js - Causes exit(1) with stack trace to stderr
save-hook fire-and-forget pattern - Errors may not surface

save-hook.ts Entry Point (lines 75-85):

stdin.on('end', async () => {
  // No try-catch wrapper!
  try {
    parsed = input.trim() ? JSON.parse(input) : undefined;
  } catch (error) {
    throw new Error(`Failed to parse...`);  // Unhandled!
  }
  await saveHook(parsed);  // Also can throw, unhandled!
});

summary-hook.ts Bug (line 68):

if (!response.ok) {
  console.log(STANDARD_HOOK_RESPONSE);  // Outputs success BEFORE throwing!
  throw new Error(`Summary generation failed: ${response.status}`);
}

This sends success response to Claude Code, then crashes.

Stuck Point #4: Iterator Doesn't Resume After Yield Error

Location: src/services/queue/SessionQueueProcessor.ts:17-38

Problem: The async iterator stops completely if the consuming agent throws while processing a yielded message.

Code Path:

async *createIterator(sessionDbId, signal) {
  while (!signal.aborted) {
    const message = this.store.claimNextMessage(sessionDbId);  // → processing
    if (message) {
      yield message;  // Agent throws here = iterator stops
    } else {
      await this.waitForMessage(signal);
    }
  }
}

Result:

Message claimed → status = processing
Message yielded → agent throws during processing
Iterator stops, never resumes
Message stuck until 5-minute timeout

Fix Required: Wrap yield in try-catch, mark failed on error, continue loop.

Stuck Point #5: 5-Minute Recovery Only on Startup

Location: src/services/worker-service.ts:686-690

Problem: Stuck message recovery only runs when worker initializes.

Code Path:

// In initializeWorker()
const STUCK_THRESHOLD_MS = 5 * 60 * 1000;
const resetCount = pendingStore.resetStuckMessages(STUCK_THRESHOLD_MS);

Result:

During normal operation, no continuous monitoring
Messages can stay stuck for hours if worker doesn't restart
User must manually restart worker or wait

Fix Required: Add periodic stuck message check (every 60 seconds) during runtime.

Stuck Point #6: markFailed() Not Transactional

Location: src/services/sqlite/PendingMessageStore.ts:271-296

Problem: The markFailed() method does SELECT then UPDATE without a transaction wrapper.

Code Path:

markFailed(messageId: number): void {
  const msg = this.db.prepare(`SELECT retry_count FROM pending_messages WHERE id = ?`).get(messageId);

  // Race condition window here!

  if (msg.retry_count < this.maxRetries) {
    this.db.prepare(`UPDATE pending_messages SET status = 'pending', retry_count = retry_count + 1...`).run(messageId);
  } else {
    this.db.prepare(`UPDATE pending_messages SET status = 'failed'...`).run(messageId);
  }
}

Result:

If process crashes between SELECT and UPDATE, retry_count may be stale
Could lead to wrong retry decision

Fix Required: Wrap in this.db.transaction(() => { ... })().

Stuck Scenarios

Scenario A: SDK Hangs During Processing

Message claimed → status = 'processing'
Added to pendingProcessingIds
Yielded to SDK agent
SDK hangs (e.g., network timeout, infinite loop)
Result: Stuck forever until 5-minute timeout on worker restart

Scenario B: Generator Crash After Yielding

Message claimed and yielded
Agent throws error before markProcessed()
Error handler marks DB messages as failed
pendingProcessingIds NOT cleared
Generator restarts
Same message IDs added to set again
Result: Duplicate tracking, potential double-processing

Scenario C: Partial Database Update

5 messages being marked processed
Messages 1-3 succeed
Database connection drops
Message 4 throws error
Loop breaks, messages 4-5 never marked
pendingProcessingIds.clear() never called
Result: Mixed state - some processed, some stuck

Scenario D: Hook Throws Without Cleanup

save-hook.ts receives observation
HTTP request to worker succeeds
Output STANDARD_HOOK_RESPONSE sent
Later code throws (e.g., Chroma sync fails)
Node.js exits with code 1
Result: Claude Code sees success, but observation may be partial

Recovery Mechanisms

Current Mechanisms

Mechanism	Location	Trigger	Limitation
Startup stuck reset	worker-service.ts:687	Worker restart	Only on restart
Generator crash recovery	SessionRoutes.ts:183-216	Generator exit	Requires full exit
Manual retry	(needs verification)	User action	Requires UI intervention
Old message cleanup	SDKAgent.ts:504	After processing	Only cleans processed

Missing Mechanisms

Continuous stuck monitoring - No runtime detection
Per-message timeout - No kill switch for hung SDK
UI stuck count display - User can't see stuck messages
Manual recovery API - No endpoint to retry individual messages

Recommendations

Priority 1: Critical Fixes

Clear pendingProcessingIds in error handler
- File: SessionRoutes.ts:168
- Add: session.pendingProcessingIds.clear()
Add try-catch around markProcessed loop
- File: SDKAgent.ts:489
- Wrap individual calls, continue on error
Add global error handler to hooks
- All hooks in src/hooks/
- Add process.on('unhandledRejection', ...) at entry

Priority 2: Robustness Improvements

Add continuous stuck message monitor
- Check every 60 seconds during runtime
- Reset messages stuck > 5 minutes
Make markFailed transactional
- Wrap SELECT + UPDATE in transaction
Fix summary-hook output-before-throw bug
- Move console.log(STANDARD_HOOK_RESPONSE) after error check

Priority 3: Observability

Add stuck message count to viewer UI
- Show processing messages > 2 minutes old
Add manual retry API endpoint
- Allow user to retry stuck messages without restart
Add explicit exit to all hooks
- Consistent process.exit(0) on success path

Appendix: File Reference

Database Layer

src/services/sqlite/PendingMessageStore.ts - Message queue persistence
src/services/sqlite/SessionStore.ts - Session management, table schemas

Processing Layer

src/services/queue/SessionQueueProcessor.ts - Async iterator for claiming
src/services/worker/SessionManager.ts - Session state, message iterator
src/services/worker/SDKAgent.ts - SDK interaction, response processing

HTTP Layer

src/services/worker/http/routes/SessionRoutes.ts - Generator lifecycle, error handling

Worker Layer

src/services/worker-service.ts - Startup recovery, health checks

Hooks

src/hooks/context-hook.ts - SessionStart (explicit exit)
src/hooks/user-message-hook.ts - SessionStart parallel (explicit exit)
src/hooks/new-hook.ts - UserPromptSubmit (implicit exit)
src/hooks/save-hook.ts - PostToolUse (implicit exit, fire-and-forget)
src/hooks/summary-hook.ts - Stop (implicit exit, output bug)

Constants

src/shared/hook-constants.ts - Exit codes, timeouts

Conclusion

The primary cause of stuck observations is the disconnect between in-memory tracking (pendingProcessingIds) and database state. When errors occur, the database may be updated but the in-memory set is not cleared, leading to:

Duplicate tracking on restart
Memory-database state divergence
Messages appearing stuck in UI

Secondary causes include inconsistent hook exit patterns and the lack of runtime stuck message monitoring.

The 5-minute startup recovery is a safety net, but it only works when the worker restarts. For a production system, continuous monitoring and proper error handling at all state transition points are essential.

12 KiB Raw Blame History

Investigation Report: Stuck Observations in Processing State

Executive Summary

Message Lifecycle Architecture

Status States

Normal Flow

Key Files

Critical Stuck Points

Stuck Point #1: In-Memory Set Not Cleared on Error

Stuck Point #2: No Try-Catch Around markProcessed()

Stuck Point #3: Hook Exit Code Inconsistency

Stuck Point #4: Iterator Doesn't Resume After Yield Error

Stuck Point #5: 5-Minute Recovery Only on Startup

Stuck Point #6: markFailed() Not Transactional

Stuck Scenarios

Scenario A: SDK Hangs During Processing

Scenario B: Generator Crash After Yielding

Scenario C: Partial Database Update

Scenario D: Hook Throws Without Cleanup

Recovery Mechanisms

Current Mechanisms

Missing Mechanisms

Recommendations

Priority 1: Critical Fixes

Priority 2: Robustness Improvements

Priority 3: Observability

Appendix: File Reference

Database Layer

Processing Layer

HTTP Layer

Worker Layer

Hooks

Constants

Conclusion

12 KiB

Raw Blame History