Files
claude-mem/docs/reports/2026-01-03--observation-saving-failure.md
T
Alex Newman 817b9e8f27 Improve error handling and logging across worker services (#528)
* fix: prevent memory_session_id from equaling content_session_id

The bug: memory_session_id was initialized to contentSessionId as a
"placeholder for FK purposes". This caused the SDK resume logic to
inject memory agent messages into the USER's Claude Code transcript,
corrupting their conversation history.

Root cause:
- SessionStore.createSDKSession initialized memory_session_id = contentSessionId
- SDKAgent checked memorySessionId !== contentSessionId but this check
  only worked if the session was fetched fresh from DB

The fix:
- SessionStore: Initialize memory_session_id as NULL, not contentSessionId
- SDKAgent: Simple truthy check !!session.memorySessionId (NULL = fresh start)
- Database migration: Ran UPDATE to set memory_session_id = NULL for 1807
  existing sessions that had the bug

Also adds [ALIGNMENT] logging across the session lifecycle to help debug
session continuity issues:
- Hook entry: contentSessionId + promptNumber
- DB lookup: contentSessionId → memorySessionId mapping proof
- Resume decision: shows which memorySessionId will be used for resume
- Capture: logs when memorySessionId is captured from first SDK response

UI: Added "Alignment" quick filter button in LogsModal to show only
alignment logs for debugging session continuity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: improve error handling in worker-service.ts

- Fix GENERIC_CATCH anti-patterns by logging full error objects instead of just messages
- Add [ANTI-PATTERN IGNORED] markers for legitimate cases (cleanup, hot paths)
- Simplify error handling comments to be more concise
- Improve httpShutdown() error discrimination for ECONNREFUSED
- Reduce LARGE_TRY_BLOCK issues in initialization code

Part of anti-pattern cleanup plan (132 total issues)

* refactor: improve error logging in SearchManager.ts

- Pass full error objects to logger instead of just error.message
- Fixes PARTIAL_ERROR_LOGGING anti-patterns (10 instances)
- Better debugging visibility when Chroma queries fail

Part of anti-pattern cleanup (133 remaining)

* refactor: improve error logging across SessionStore and mcp-server

- SessionStore.ts: Fix error logging in column rename utility
- mcp-server.ts: Log full error objects instead of just error.message
- Improve error handling in Worker API calls and tool execution

Part of anti-pattern cleanup (133 remaining)

* Refactor hooks to streamline error handling and loading states

- Simplified error handling in useContextPreview by removing try-catch and directly checking response status.
- Refactored usePagination to eliminate try-catch, improving readability and maintaining error handling through response checks.
- Cleaned up useSSE by removing unnecessary try-catch around JSON parsing, ensuring clarity in message handling.
- Enhanced useSettings by streamlining the saving process, removing try-catch, and directly checking the result for success.

* refactor: add error handling back to SearchManager Chroma calls

- Wrap queryChroma calls in try-catch to prevent generator crashes
- Log Chroma errors as warnings and fall back gracefully
- Fixes generator failures when Chroma has issues
- Part of anti-pattern cleanup recovery

* feat: Add generator failure investigation report and observation duplication regression report

- Created a comprehensive investigation report detailing the root cause of generator failures during anti-pattern cleanup, including the impact, investigation process, and implemented fixes.
- Documented the critical regression causing observation duplication due to race conditions in the SDK agent, outlining symptoms, root cause analysis, and proposed fixes.

* fix: address PR #528 review comments - atomic cleanup and detector improvements

This commit addresses critical review feedback from PR #528:

## 1. Atomic Message Cleanup (Fix Race Condition)

**Problem**: SessionRoutes.ts generator error handler had race condition
- Queried messages then marked failed in loop
- If crash during loop → partial marking → inconsistent state

**Solution**:
- Added `markSessionMessagesFailed()` to PendingMessageStore.ts
- Single atomic UPDATE statement replaces loop
- Follows existing pattern from `resetProcessingToPending()`

**Files**:
- src/services/sqlite/PendingMessageStore.ts (new method)
- src/services/worker/http/routes/SessionRoutes.ts (use new method)

## 2. Anti-Pattern Detector Improvements

**Problem**: Detector didn't recognize logger.failure() method
- Lines 212 & 335 already included "failure"
- Lines 112-113 (PARTIAL_ERROR_LOGGING detection) did not

**Solution**: Updated regex patterns to include "failure" for consistency

**Files**:
- scripts/anti-pattern-test/detect-error-handling-antipatterns.ts

## 3. Documentation

**PR Comment**: Added clarification on memory_session_id fix location
- Points to SessionStore.ts:1155
- Explains why NULL initialization prevents message injection bug

## Review Response

Addresses "Must Address Before Merge" items from review:
 Clarified memory_session_id bug fix location (via PR comment)
 Made generator error handler message cleanup atomic
 Deferred comprehensive test suite to follow-up PR (keeps PR focused)

## Testing

- Build passes with no errors
- Anti-pattern detector runs successfully
- Atomic cleanup follows proven pattern from existing methods

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: FOREIGN KEY constraint and missing failed_at_epoch column

Two critical bugs fixed:

1. Missing failed_at_epoch column in pending_messages table
   - Added migration 20 to create the column
   - Fixes error when trying to mark messages as failed

2. FOREIGN KEY constraint failed when storing observations
   - All three agents (SDK, Gemini, OpenRouter) were passing
     session.contentSessionId instead of session.memorySessionId
   - storeObservationsAndMarkComplete expects memorySessionId
   - Added null check and clear error message

However, observations still not saving - see investigation report.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Refactor hook input parsing to improve error handling

- Added a nested try-catch block in new-hook.ts, save-hook.ts, and summary-hook.ts to handle JSON parsing errors more gracefully.
- Replaced direct error throwing with logging of the error details using logger.error.
- Ensured that the process exits cleanly after handling input in all three hooks.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-03 18:51:59 -05:00

6.5 KiB

Observation Saving Failure Investigation

Date: 2026-01-03 Severity: CRITICAL Status: Bugs fixed, but observations still not saving

Summary

Despite fixing two critical bugs (missing failed_at_epoch column and FOREIGN KEY constraint errors), observations are still not being saved. Last observation was saved at 2026-01-03 20:44:49 (over an hour ago as of this report).

Bugs Fixed

Bug #1: Missing failed_at_epoch Column

  • Root Cause: Code in PendingMessageStore.markSessionMessagesFailed() tried to set failed_at_epoch column that didn't exist in schema
  • Fix: Added migration 20 to create the column
  • Status: Fixed and verified

Bug #2: FOREIGN KEY Constraint Failed

  • Root Cause: ALL THREE agents (SDKAgent, GeminiAgent, OpenRouterAgent) were passing session.contentSessionId to storeObservationsAndMarkComplete() but function expected session.memorySessionId
  • Location:
    • src/services/worker/SDKAgent.ts:354
    • src/services/worker/GeminiAgent.ts:397
    • src/services/worker/OpenRouterAgent.ts:440
  • Fix: Changed all three agents to pass session.memorySessionId with null check
  • Status: Fixed and verified

Current State (as of investigation)

Database State

  • Total observations: 34,734
  • Latest observation: 2026-01-03 20:44:49 (1+ hours ago)
  • Pending messages: 0 (queue is empty)
  • Recent sessions: Multiple sessions created but no observations saved

Recent Sessions

76292 | c5fd263d-d9ae-4f49-8caf-3f7bb4857804 | 4227fb34-ba37-4625-b18c-bc073044ea73 | 2026-01-03T20:50:51.930Z
76269 | 227c4af2-6c64-45cd-8700-4bb8309038a4 | 3ce5f8ff-85d0-4d1a-9c40-c0d8b905fce8 | 2026-01-03T20:47:10.637Z

Both have valid memory_session_id values captured, suggesting SDK communication is working.

Root Cause Analysis

Potential Issues

  1. Worker Not Processing Messages

    • Queue is empty (0 pending messages)
    • Either messages aren't being created, or they're being processed and deleted immediately without creating observations
  2. Hooks Not Creating Messages

    • PostToolUse hook may not be firing
    • Or hook is failing silently before creating pending messages
  3. Generator Failing Before Observations

    • SDK may be failing to return observations
    • Or parsing is failing silently
  4. The FIFO Queue Design Itself

    • Current system has complex status tracking that hides failures
    • Messages can be marked "processed" even if no observations were created
    • No clear indication of what actually happened

Evidence of Deeper Problems

Architectural Issues Found

The queue processing system violates basic FIFO principles:

Current Overcomplicated Design:

  • Status tracking: pendingprocessingprocessed/failed
  • Multiple timestamps: created_at_epoch, started_processing_at_epoch, completed_at_epoch, failed_at_epoch
  • Retry counts and stuck message detection
  • Complex recovery logic for different failure scenarios

What a FIFO Queue Should Be:

  1. INSERT message
  2. Process it
  3. DELETE when done
  4. If worker crashes → message stays in queue → gets reprocessed

The complexity is masking failures. Messages are being marked "processed" but no observations are being created.

Critical Questions Needing Investigation

  1. Are PostToolUse hooks even firing?

    • Check hook execution logs
    • Verify tool usage is being captured
  2. Are pending messages being created?

    • Check message creation in hooks
    • Look for silent failures in message insertion
  3. Is the generator even starting?

    • Check worker logs for session processing
    • Verify SDK connections are established
  4. Why is the queue always empty?

    • Messages processed instantly? (unlikely)
    • Messages never created? (more likely)
    • Messages created then immediately deleted? (possible)

Immediate Next Steps

  1. Add Logging

    • Add detailed logging to PostToolUse hook
    • Log every step of message creation
    • Log generator startup and SDK responses
  2. Check Hook Execution

    • Verify hooks are actually running
    • Check for silent failures in hook code
  3. Test Message Creation Manually

    • Create a test message directly in database
    • Verify worker picks it up and processes it
  4. Simplify the Queue (Long-term)

    • Remove status tracking complexity
    • Make it a true FIFO queue
    • Make failures obvious instead of silent

Code Changes Made

SessionStore.ts

// Migration 20: Add failed_at_epoch column
private addFailedAtEpochColumn(): void {
  const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(20);
  if (applied) return;

  const tableInfo = this.db.query('PRAGMA table_info(pending_messages)').all();
  const hasColumn = tableInfo.some(col => col.name === 'failed_at_epoch');

  if (!hasColumn) {
    this.db.run('ALTER TABLE pending_messages ADD COLUMN failed_at_epoch INTEGER');
    logger.info('DB', 'Added failed_at_epoch column to pending_messages table');
  }

  this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(20, new Date().toISOString());
}

SDKAgent.ts, GeminiAgent.ts, OpenRouterAgent.ts

// BEFORE (WRONG):
const result = sessionStore.storeObservationsAndMarkComplete(
  session.contentSessionId,  // ❌ Wrong session ID
  session.project,
  observations,
  // ...
);

// AFTER (FIXED):
if (!session.memorySessionId) {
  throw new Error('Cannot store observations: memorySessionId not yet captured');
}

const result = sessionStore.storeObservationsAndMarkComplete(
  session.memorySessionId,  // ✅ Correct session ID
  session.project,
  observations,
  // ...
);

Conclusion

The two bugs are fixed, but observations still aren't being saved. The problem is likely earlier in the pipeline:

  • Hooks not executing
  • Messages not being created
  • Or the overly complex queue system is hiding failures

The queue design itself is fundamentally flawed - it tracks too much state and makes failures invisible. A proper FIFO queue would make these issues obvious immediately.

  1. Immediate: Add comprehensive logging to PostToolUse hook and message creation
  2. Short-term: Manual testing of queue processing
  3. Long-term: Rip out status tracking and implement proper FIFO queue

Investigation needed: This report documents what was fixed and what's still broken. The actual root cause of why observations stopped saving needs deeper investigation of the hook execution and message creation pipeline.