Files
claude-mem/docs/reports/2026-01-02--stuck-observations.md
T
Alex Newman 0862f2970c feat: add investigation report on stuck observations in processing state
- Documented six critical gaps in the message lifecycle leading to stuck observations.
- Detailed analysis of message lifecycle architecture, including status states and normal flow.
- Identified critical stuck points with code paths and required fixes.
- Proposed recovery mechanisms and recommendations for critical fixes, robustness improvements, and observability enhancements.
- Included file references for database, processing, HTTP, worker layers, hooks, and constants.
2026-01-02 14:53:41 -05:00

12 KiB

Investigation Report: Stuck Observations in Processing State

Date: January 2, 2026 Investigator: Claude Status: Complete Severity: High - Observations can get permanently stuck until worker restart


Executive Summary

Observations get stuck in "processing" state due to six critical gaps in the message lifecycle:

  1. In-memory tracking set not cleared on error - pendingProcessingIds retains stale IDs after crashes
  2. No try-catch around database updates - Partial updates leave system in inconsistent state
  3. Hook exit code inconsistency - Some hooks exit explicitly, others rely on implicit Node.js behavior
  4. 5-minute recovery threshold only on startup - No continuous monitoring during runtime
  5. Iterator doesn't resume after yield errors - Messages left in "processing" forever
  6. No global error handlers in hooks - Unhandled promise rejections crash without cleanup

Message Lifecycle Architecture

Status States

The pending_messages table uses 4 states:

Status Description Transition From Transition To
pending Queued, awaiting processing (created) processing
processing Actively being processed by SDK pending processed, failed, or stuck
processed Successfully completed processing (deleted after retention)
failed Max retries exceeded processing (permanent)

Normal Flow

HTTP Request → enqueue() → pending
                              ↓
           claimNextMessage() → processing
                              ↓
              SDK processes → markProcessed() → processed
                              ↓
                      cleanup → deleted

Key Files

Component File Lines
Status enum src/services/sqlite/PendingMessageStore.ts 19
Claim message src/services/sqlite/PendingMessageStore.ts 87-118
Mark processed src/services/sqlite/PendingMessageStore.ts 252-264
Mark failed src/services/sqlite/PendingMessageStore.ts 271-296
In-memory tracking src/services/worker/SessionManager.ts 386
Clear tracking src/services/worker/SDKAgent.ts 497
Error handler src/services/worker/http/routes/SessionRoutes.ts 137-168

Critical Stuck Points

Stuck Point #1: In-Memory Set Not Cleared on Error

Location: src/services/worker/http/routes/SessionRoutes.ts:137-168

Problem: When a generator crashes, the error handler marks database messages as failed but never resets session.pendingProcessingIds.

Code Path:

session.generatorPromise = agent.startSession(session, this.workerService)
  .catch(error => {
    // Mark all processing messages as failed in DB
    for (const msg of processingMessages) {
      pendingStore.markFailed(msg.id);  // ✓ DB updated
    }
    // ✗ session.pendingProcessingIds.clear() - MISSING!
  });

Result:

  • Database shows messages as failed
  • In-memory set still contains stale message IDs
  • On generator restart, same IDs added again (duplicates possible)
  • Memory-database state divergence

Fix Required: Add session.pendingProcessingIds.clear() in catch block.


Stuck Point #2: No Try-Catch Around markProcessed()

Location: src/services/worker/SDKAgent.ts:487-516

Problem: The markMessagesProcessed() function loops through all pending IDs but has no error handling around individual markProcessed() calls.

Code Path:

private async markMessagesProcessed(session, worker): Promise<void> {
  for (const messageId of session.pendingProcessingIds) {
    pendingMessageStore.markProcessed(messageId);  // ✗ No try-catch
  }
  session.pendingProcessingIds.clear();  // Never reached if above throws
}

Result:

  • If DB error occurs on message N, messages N+1...M never marked
  • pendingProcessingIds.clear() never called
  • Partial database update + stale in-memory set

Fix Required: Wrap individual markProcessed() calls in try-catch, continue on error, log failures.


Stuck Point #3: Hook Exit Code Inconsistency

Location: All hooks in src/hooks/

Problem: Hooks have inconsistent exit patterns:

Hook Explicit Exit? Method Timeout
context-hook YES process.exit(0) 15s
user-message-hook YES process.exit(3) 15s
new-hook NO Implicit 15s
save-hook NO Implicit 300s
summary-hook NO Implicit 300s

Critical Issues:

  1. No global error handlers - No process.on('unhandledRejection', ...) in any hook
  2. Async errors bubble to Node.js - Causes exit(1) with stack trace to stderr
  3. save-hook fire-and-forget pattern - Errors may not surface

save-hook.ts Entry Point (lines 75-85):

stdin.on('end', async () => {
  // No try-catch wrapper!
  try {
    parsed = input.trim() ? JSON.parse(input) : undefined;
  } catch (error) {
    throw new Error(`Failed to parse...`);  // Unhandled!
  }
  await saveHook(parsed);  // Also can throw, unhandled!
});

summary-hook.ts Bug (line 68):

if (!response.ok) {
  console.log(STANDARD_HOOK_RESPONSE);  // Outputs success BEFORE throwing!
  throw new Error(`Summary generation failed: ${response.status}`);
}

This sends success response to Claude Code, then crashes.


Stuck Point #4: Iterator Doesn't Resume After Yield Error

Location: src/services/queue/SessionQueueProcessor.ts:17-38

Problem: The async iterator stops completely if the consuming agent throws while processing a yielded message.

Code Path:

async *createIterator(sessionDbId, signal) {
  while (!signal.aborted) {
    const message = this.store.claimNextMessage(sessionDbId);  // → processing
    if (message) {
      yield message;  // Agent throws here = iterator stops
    } else {
      await this.waitForMessage(signal);
    }
  }
}

Result:

  • Message claimed → status = processing
  • Message yielded → agent throws during processing
  • Iterator stops, never resumes
  • Message stuck until 5-minute timeout

Fix Required: Wrap yield in try-catch, mark failed on error, continue loop.


Stuck Point #5: 5-Minute Recovery Only on Startup

Location: src/services/worker-service.ts:686-690

Problem: Stuck message recovery only runs when worker initializes.

Code Path:

// In initializeWorker()
const STUCK_THRESHOLD_MS = 5 * 60 * 1000;
const resetCount = pendingStore.resetStuckMessages(STUCK_THRESHOLD_MS);

Result:

  • During normal operation, no continuous monitoring
  • Messages can stay stuck for hours if worker doesn't restart
  • User must manually restart worker or wait

Fix Required: Add periodic stuck message check (every 60 seconds) during runtime.


Stuck Point #6: markFailed() Not Transactional

Location: src/services/sqlite/PendingMessageStore.ts:271-296

Problem: The markFailed() method does SELECT then UPDATE without a transaction wrapper.

Code Path:

markFailed(messageId: number): void {
  const msg = this.db.prepare(`SELECT retry_count FROM pending_messages WHERE id = ?`).get(messageId);

  // Race condition window here!

  if (msg.retry_count < this.maxRetries) {
    this.db.prepare(`UPDATE pending_messages SET status = 'pending', retry_count = retry_count + 1...`).run(messageId);
  } else {
    this.db.prepare(`UPDATE pending_messages SET status = 'failed'...`).run(messageId);
  }
}

Result:

  • If process crashes between SELECT and UPDATE, retry_count may be stale
  • Could lead to wrong retry decision

Fix Required: Wrap in this.db.transaction(() => { ... })().


Stuck Scenarios

Scenario A: SDK Hangs During Processing

  1. Message claimed → status = 'processing'
  2. Added to pendingProcessingIds
  3. Yielded to SDK agent
  4. SDK hangs (e.g., network timeout, infinite loop)
  5. Result: Stuck forever until 5-minute timeout on worker restart

Scenario B: Generator Crash After Yielding

  1. Message claimed and yielded
  2. Agent throws error before markProcessed()
  3. Error handler marks DB messages as failed
  4. pendingProcessingIds NOT cleared
  5. Generator restarts
  6. Same message IDs added to set again
  7. Result: Duplicate tracking, potential double-processing

Scenario C: Partial Database Update

  1. 5 messages being marked processed
  2. Messages 1-3 succeed
  3. Database connection drops
  4. Message 4 throws error
  5. Loop breaks, messages 4-5 never marked
  6. pendingProcessingIds.clear() never called
  7. Result: Mixed state - some processed, some stuck

Scenario D: Hook Throws Without Cleanup

  1. save-hook.ts receives observation
  2. HTTP request to worker succeeds
  3. Output STANDARD_HOOK_RESPONSE sent
  4. Later code throws (e.g., Chroma sync fails)
  5. Node.js exits with code 1
  6. Result: Claude Code sees success, but observation may be partial

Recovery Mechanisms

Current Mechanisms

Mechanism Location Trigger Limitation
Startup stuck reset worker-service.ts:687 Worker restart Only on restart
Generator crash recovery SessionRoutes.ts:183-216 Generator exit Requires full exit
Manual retry (needs verification) User action Requires UI intervention
Old message cleanup SDKAgent.ts:504 After processing Only cleans processed

Missing Mechanisms

  1. Continuous stuck monitoring - No runtime detection
  2. Per-message timeout - No kill switch for hung SDK
  3. UI stuck count display - User can't see stuck messages
  4. Manual recovery API - No endpoint to retry individual messages

Recommendations

Priority 1: Critical Fixes

  1. Clear pendingProcessingIds in error handler

    • File: SessionRoutes.ts:168
    • Add: session.pendingProcessingIds.clear()
  2. Add try-catch around markProcessed loop

    • File: SDKAgent.ts:489
    • Wrap individual calls, continue on error
  3. Add global error handler to hooks

    • All hooks in src/hooks/
    • Add process.on('unhandledRejection', ...) at entry

Priority 2: Robustness Improvements

  1. Add continuous stuck message monitor

    • Check every 60 seconds during runtime
    • Reset messages stuck > 5 minutes
  2. Make markFailed transactional

    • Wrap SELECT + UPDATE in transaction
  3. Fix summary-hook output-before-throw bug

    • Move console.log(STANDARD_HOOK_RESPONSE) after error check

Priority 3: Observability

  1. Add stuck message count to viewer UI

    • Show processing messages > 2 minutes old
  2. Add manual retry API endpoint

    • Allow user to retry stuck messages without restart
  3. Add explicit exit to all hooks

    • Consistent process.exit(0) on success path

Appendix: File Reference

Database Layer

  • src/services/sqlite/PendingMessageStore.ts - Message queue persistence
  • src/services/sqlite/SessionStore.ts - Session management, table schemas

Processing Layer

  • src/services/queue/SessionQueueProcessor.ts - Async iterator for claiming
  • src/services/worker/SessionManager.ts - Session state, message iterator
  • src/services/worker/SDKAgent.ts - SDK interaction, response processing

HTTP Layer

  • src/services/worker/http/routes/SessionRoutes.ts - Generator lifecycle, error handling

Worker Layer

  • src/services/worker-service.ts - Startup recovery, health checks

Hooks

  • src/hooks/context-hook.ts - SessionStart (explicit exit)
  • src/hooks/user-message-hook.ts - SessionStart parallel (explicit exit)
  • src/hooks/new-hook.ts - UserPromptSubmit (implicit exit)
  • src/hooks/save-hook.ts - PostToolUse (implicit exit, fire-and-forget)
  • src/hooks/summary-hook.ts - Stop (implicit exit, output bug)

Constants

  • src/shared/hook-constants.ts - Exit codes, timeouts

Conclusion

The primary cause of stuck observations is the disconnect between in-memory tracking (pendingProcessingIds) and database state. When errors occur, the database may be updated but the in-memory set is not cleared, leading to:

  1. Duplicate tracking on restart
  2. Memory-database state divergence
  3. Messages appearing stuck in UI

Secondary causes include inconsistent hook exit patterns and the lack of runtime stuck message monitoring.

The 5-minute startup recovery is a safety net, but it only works when the worker restarts. For a production system, continuous monitoring and proper error handling at all state transition points are essential.