Files

T

Alex Newman 817b9e8f27 Improve error handling and logging across worker services (#528 )

* fix: prevent memory_session_id from equaling content_session_id

The bug: memory_session_id was initialized to contentSessionId as a
"placeholder for FK purposes". This caused the SDK resume logic to
inject memory agent messages into the USER's Claude Code transcript,
corrupting their conversation history.

Root cause:
- SessionStore.createSDKSession initialized memory_session_id = contentSessionId
- SDKAgent checked memorySessionId !== contentSessionId but this check
  only worked if the session was fetched fresh from DB

The fix:
- SessionStore: Initialize memory_session_id as NULL, not contentSessionId
- SDKAgent: Simple truthy check !!session.memorySessionId (NULL = fresh start)
- Database migration: Ran UPDATE to set memory_session_id = NULL for 1807
  existing sessions that had the bug

Also adds [ALIGNMENT] logging across the session lifecycle to help debug
session continuity issues:
- Hook entry: contentSessionId + promptNumber
- DB lookup: contentSessionId → memorySessionId mapping proof
- Resume decision: shows which memorySessionId will be used for resume
- Capture: logs when memorySessionId is captured from first SDK response

UI: Added "Alignment" quick filter button in LogsModal to show only
alignment logs for debugging session continuity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: improve error handling in worker-service.ts

- Fix GENERIC_CATCH anti-patterns by logging full error objects instead of just messages
- Add [ANTI-PATTERN IGNORED] markers for legitimate cases (cleanup, hot paths)
- Simplify error handling comments to be more concise
- Improve httpShutdown() error discrimination for ECONNREFUSED
- Reduce LARGE_TRY_BLOCK issues in initialization code

Part of anti-pattern cleanup plan (132 total issues)

* refactor: improve error logging in SearchManager.ts

- Pass full error objects to logger instead of just error.message
- Fixes PARTIAL_ERROR_LOGGING anti-patterns (10 instances)
- Better debugging visibility when Chroma queries fail

Part of anti-pattern cleanup (133 remaining)

* refactor: improve error logging across SessionStore and mcp-server

- SessionStore.ts: Fix error logging in column rename utility
- mcp-server.ts: Log full error objects instead of just error.message
- Improve error handling in Worker API calls and tool execution

Part of anti-pattern cleanup (133 remaining)

* Refactor hooks to streamline error handling and loading states

- Simplified error handling in useContextPreview by removing try-catch and directly checking response status.
- Refactored usePagination to eliminate try-catch, improving readability and maintaining error handling through response checks.
- Cleaned up useSSE by removing unnecessary try-catch around JSON parsing, ensuring clarity in message handling.
- Enhanced useSettings by streamlining the saving process, removing try-catch, and directly checking the result for success.

* refactor: add error handling back to SearchManager Chroma calls

- Wrap queryChroma calls in try-catch to prevent generator crashes
- Log Chroma errors as warnings and fall back gracefully
- Fixes generator failures when Chroma has issues
- Part of anti-pattern cleanup recovery

* feat: Add generator failure investigation report and observation duplication regression report

- Created a comprehensive investigation report detailing the root cause of generator failures during anti-pattern cleanup, including the impact, investigation process, and implemented fixes.
- Documented the critical regression causing observation duplication due to race conditions in the SDK agent, outlining symptoms, root cause analysis, and proposed fixes.

* fix: address PR #528 review comments - atomic cleanup and detector improvements

This commit addresses critical review feedback from PR #528:

## 1. Atomic Message Cleanup (Fix Race Condition)

**Problem**: SessionRoutes.ts generator error handler had race condition
- Queried messages then marked failed in loop
- If crash during loop → partial marking → inconsistent state

**Solution**:
- Added `markSessionMessagesFailed()` to PendingMessageStore.ts
- Single atomic UPDATE statement replaces loop
- Follows existing pattern from `resetProcessingToPending()`

**Files**:
- src/services/sqlite/PendingMessageStore.ts (new method)
- src/services/worker/http/routes/SessionRoutes.ts (use new method)

## 2. Anti-Pattern Detector Improvements

**Problem**: Detector didn't recognize logger.failure() method
- Lines 212 & 335 already included "failure"
- Lines 112-113 (PARTIAL_ERROR_LOGGING detection) did not

**Solution**: Updated regex patterns to include "failure" for consistency

**Files**:
- scripts/anti-pattern-test/detect-error-handling-antipatterns.ts

## 3. Documentation

**PR Comment**: Added clarification on memory_session_id fix location
- Points to SessionStore.ts:1155
- Explains why NULL initialization prevents message injection bug

## Review Response

Addresses "Must Address Before Merge" items from review:
✅ Clarified memory_session_id bug fix location (via PR comment)
✅ Made generator error handler message cleanup atomic
❌ Deferred comprehensive test suite to follow-up PR (keeps PR focused)

## Testing

- Build passes with no errors
- Anti-pattern detector runs successfully
- Atomic cleanup follows proven pattern from existing methods

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: FOREIGN KEY constraint and missing failed_at_epoch column

Two critical bugs fixed:

1. Missing failed_at_epoch column in pending_messages table
   - Added migration 20 to create the column
   - Fixes error when trying to mark messages as failed

2. FOREIGN KEY constraint failed when storing observations
   - All three agents (SDK, Gemini, OpenRouter) were passing
     session.contentSessionId instead of session.memorySessionId
   - storeObservationsAndMarkComplete expects memorySessionId
   - Added null check and clear error message

However, observations still not saving - see investigation report.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Refactor hook input parsing to improve error handling

- Added a nested try-catch block in new-hook.ts, save-hook.ts, and summary-hook.ts to handle JSON parsing errors more gracefully.
- Replaced direct error throwing with logging of the error details using logger.error.
- Ensured that the process exits cleanly after handling input in all three hooks.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-03 18:51:59 -05:00

6.5 KiB

Raw Blame History

Observation Saving Failure Investigation

Date: 2026-01-03 Severity: CRITICAL Status: Bugs fixed, but observations still not saving

Summary

Despite fixing two critical bugs (missing failed_at_epoch column and FOREIGN KEY constraint errors), observations are still not being saved. Last observation was saved at 2026-01-03 20:44:49 (over an hour ago as of this report).

Bugs Fixed

Bug #1: Missing `failed_at_epoch` Column

Root Cause: Code in PendingMessageStore.markSessionMessagesFailed() tried to set failed_at_epoch column that didn't exist in schema
Fix: Added migration 20 to create the column
Status: ✅ Fixed and verified

Bug #2: FOREIGN KEY Constraint Failed

Root Cause: ALL THREE agents (SDKAgent, GeminiAgent, OpenRouterAgent) were passing session.contentSessionId to storeObservationsAndMarkComplete() but function expected session.memorySessionId
Location:
- src/services/worker/SDKAgent.ts:354
- src/services/worker/GeminiAgent.ts:397
- src/services/worker/OpenRouterAgent.ts:440
Fix: Changed all three agents to pass session.memorySessionId with null check
Status: ✅ Fixed and verified

Current State (as of investigation)

Database State

Total observations: 34,734
Latest observation: 2026-01-03 20:44:49 (1+ hours ago)
Pending messages: 0 (queue is empty)
Recent sessions: Multiple sessions created but no observations saved

Recent Sessions

76292 | c5fd263d-d9ae-4f49-8caf-3f7bb4857804 | 4227fb34-ba37-4625-b18c-bc073044ea73 | 2026-01-03T20:50:51.930Z
76269 | 227c4af2-6c64-45cd-8700-4bb8309038a4 | 3ce5f8ff-85d0-4d1a-9c40-c0d8b905fce8 | 2026-01-03T20:47:10.637Z

Both have valid memory_session_id values captured, suggesting SDK communication is working.

Root Cause Analysis

Potential Issues

Worker Not Processing Messages
- Queue is empty (0 pending messages)
- Either messages aren't being created, or they're being processed and deleted immediately without creating observations
Hooks Not Creating Messages
- PostToolUse hook may not be firing
- Or hook is failing silently before creating pending messages
Generator Failing Before Observations
- SDK may be failing to return observations
- Or parsing is failing silently
The FIFO Queue Design Itself
- Current system has complex status tracking that hides failures
- Messages can be marked "processed" even if no observations were created
- No clear indication of what actually happened

Evidence of Deeper Problems

Architectural Issues Found

The queue processing system violates basic FIFO principles:

Current Overcomplicated Design:

Status tracking: pending → processing → processed/failed
Multiple timestamps: created_at_epoch, started_processing_at_epoch, completed_at_epoch, failed_at_epoch
Retry counts and stuck message detection
Complex recovery logic for different failure scenarios

What a FIFO Queue Should Be:

INSERT message
Process it
DELETE when done
If worker crashes → message stays in queue → gets reprocessed

The complexity is masking failures. Messages are being marked "processed" but no observations are being created.

Critical Questions Needing Investigation

Are PostToolUse hooks even firing?
- Check hook execution logs
- Verify tool usage is being captured
Are pending messages being created?
- Check message creation in hooks
- Look for silent failures in message insertion
Is the generator even starting?
- Check worker logs for session processing
- Verify SDK connections are established
Why is the queue always empty?
- Messages processed instantly? (unlikely)
- Messages never created? (more likely)
- Messages created then immediately deleted? (possible)

Immediate Next Steps

Add Logging
- Add detailed logging to PostToolUse hook
- Log every step of message creation
- Log generator startup and SDK responses
Check Hook Execution
- Verify hooks are actually running
- Check for silent failures in hook code
Test Message Creation Manually
- Create a test message directly in database
- Verify worker picks it up and processes it
Simplify the Queue (Long-term)
- Remove status tracking complexity
- Make it a true FIFO queue
- Make failures obvious instead of silent

Code Changes Made

SessionStore.ts

// Migration 20: Add failed_at_epoch column
private addFailedAtEpochColumn(): void {
  const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(20);
  if (applied) return;

  const tableInfo = this.db.query('PRAGMA table_info(pending_messages)').all();
  const hasColumn = tableInfo.some(col => col.name === 'failed_at_epoch');

  if (!hasColumn) {
    this.db.run('ALTER TABLE pending_messages ADD COLUMN failed_at_epoch INTEGER');
    logger.info('DB', 'Added failed_at_epoch column to pending_messages table');
  }

  this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(20, new Date().toISOString());
}

SDKAgent.ts, GeminiAgent.ts, OpenRouterAgent.ts

// BEFORE (WRONG):
const result = sessionStore.storeObservationsAndMarkComplete(
  session.contentSessionId,  // ❌ Wrong session ID
  session.project,
  observations,
  // ...
);

// AFTER (FIXED):
if (!session.memorySessionId) {
  throw new Error('Cannot store observations: memorySessionId not yet captured');
}

const result = sessionStore.storeObservationsAndMarkComplete(
  session.memorySessionId,  // ✅ Correct session ID
  session.project,
  observations,
  // ...
);

Conclusion

The two bugs are fixed, but observations still aren't being saved. The problem is likely earlier in the pipeline:

Hooks not executing
Messages not being created
Or the overly complex queue system is hiding failures

The queue design itself is fundamentally flawed - it tracks too much state and makes failures invisible. A proper FIFO queue would make these issues obvious immediately.

Recommended Action

Immediate: Add comprehensive logging to PostToolUse hook and message creation
Short-term: Manual testing of queue processing
Long-term: Rip out status tracking and implement proper FIFO queue

Investigation needed: This report documents what was fixed and what's still broken. The actual root cause of why observations stopped saving needs deeper investigation of the hook execution and message creation pipeline.

6.5 KiB Raw Blame History