* fix: prevent memory_session_id from equaling content_session_id The bug: memory_session_id was initialized to contentSessionId as a "placeholder for FK purposes". This caused the SDK resume logic to inject memory agent messages into the USER's Claude Code transcript, corrupting their conversation history. Root cause: - SessionStore.createSDKSession initialized memory_session_id = contentSessionId - SDKAgent checked memorySessionId !== contentSessionId but this check only worked if the session was fetched fresh from DB The fix: - SessionStore: Initialize memory_session_id as NULL, not contentSessionId - SDKAgent: Simple truthy check !!session.memorySessionId (NULL = fresh start) - Database migration: Ran UPDATE to set memory_session_id = NULL for 1807 existing sessions that had the bug Also adds [ALIGNMENT] logging across the session lifecycle to help debug session continuity issues: - Hook entry: contentSessionId + promptNumber - DB lookup: contentSessionId → memorySessionId mapping proof - Resume decision: shows which memorySessionId will be used for resume - Capture: logs when memorySessionId is captured from first SDK response UI: Added "Alignment" quick filter button in LogsModal to show only alignment logs for debugging session continuity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor: improve error handling in worker-service.ts - Fix GENERIC_CATCH anti-patterns by logging full error objects instead of just messages - Add [ANTI-PATTERN IGNORED] markers for legitimate cases (cleanup, hot paths) - Simplify error handling comments to be more concise - Improve httpShutdown() error discrimination for ECONNREFUSED - Reduce LARGE_TRY_BLOCK issues in initialization code Part of anti-pattern cleanup plan (132 total issues) * refactor: improve error logging in SearchManager.ts - Pass full error objects to logger instead of just error.message - Fixes PARTIAL_ERROR_LOGGING anti-patterns (10 instances) - Better debugging visibility when Chroma queries fail Part of anti-pattern cleanup (133 remaining) * refactor: improve error logging across SessionStore and mcp-server - SessionStore.ts: Fix error logging in column rename utility - mcp-server.ts: Log full error objects instead of just error.message - Improve error handling in Worker API calls and tool execution Part of anti-pattern cleanup (133 remaining) * Refactor hooks to streamline error handling and loading states - Simplified error handling in useContextPreview by removing try-catch and directly checking response status. - Refactored usePagination to eliminate try-catch, improving readability and maintaining error handling through response checks. - Cleaned up useSSE by removing unnecessary try-catch around JSON parsing, ensuring clarity in message handling. - Enhanced useSettings by streamlining the saving process, removing try-catch, and directly checking the result for success. * refactor: add error handling back to SearchManager Chroma calls - Wrap queryChroma calls in try-catch to prevent generator crashes - Log Chroma errors as warnings and fall back gracefully - Fixes generator failures when Chroma has issues - Part of anti-pattern cleanup recovery * feat: Add generator failure investigation report and observation duplication regression report - Created a comprehensive investigation report detailing the root cause of generator failures during anti-pattern cleanup, including the impact, investigation process, and implemented fixes. - Documented the critical regression causing observation duplication due to race conditions in the SDK agent, outlining symptoms, root cause analysis, and proposed fixes. * fix: address PR #528 review comments - atomic cleanup and detector improvements This commit addresses critical review feedback from PR #528: ## 1. Atomic Message Cleanup (Fix Race Condition) **Problem**: SessionRoutes.ts generator error handler had race condition - Queried messages then marked failed in loop - If crash during loop → partial marking → inconsistent state **Solution**: - Added `markSessionMessagesFailed()` to PendingMessageStore.ts - Single atomic UPDATE statement replaces loop - Follows existing pattern from `resetProcessingToPending()` **Files**: - src/services/sqlite/PendingMessageStore.ts (new method) - src/services/worker/http/routes/SessionRoutes.ts (use new method) ## 2. Anti-Pattern Detector Improvements **Problem**: Detector didn't recognize logger.failure() method - Lines 212 & 335 already included "failure" - Lines 112-113 (PARTIAL_ERROR_LOGGING detection) did not **Solution**: Updated regex patterns to include "failure" for consistency **Files**: - scripts/anti-pattern-test/detect-error-handling-antipatterns.ts ## 3. Documentation **PR Comment**: Added clarification on memory_session_id fix location - Points to SessionStore.ts:1155 - Explains why NULL initialization prevents message injection bug ## Review Response Addresses "Must Address Before Merge" items from review: ✅ Clarified memory_session_id bug fix location (via PR comment) ✅ Made generator error handler message cleanup atomic ❌ Deferred comprehensive test suite to follow-up PR (keeps PR focused) ## Testing - Build passes with no errors - Anti-pattern detector runs successfully - Atomic cleanup follows proven pattern from existing methods 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: FOREIGN KEY constraint and missing failed_at_epoch column Two critical bugs fixed: 1. Missing failed_at_epoch column in pending_messages table - Added migration 20 to create the column - Fixes error when trying to mark messages as failed 2. FOREIGN KEY constraint failed when storing observations - All three agents (SDK, Gemini, OpenRouter) were passing session.contentSessionId instead of session.memorySessionId - storeObservationsAndMarkComplete expects memorySessionId - Added null check and clear error message However, observations still not saving - see investigation report. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Refactor hook input parsing to improve error handling - Added a nested try-catch block in new-hook.ts, save-hook.ts, and summary-hook.ts to handle JSON parsing errors more gracefully. - Replaced direct error throwing with logging of the error details using logger.error. - Ensured that the process exits cleanly after handling input in all three hooks. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
6.5 KiB
Observation Saving Failure Investigation
Date: 2026-01-03 Severity: CRITICAL Status: Bugs fixed, but observations still not saving
Summary
Despite fixing two critical bugs (missing failed_at_epoch column and FOREIGN KEY constraint errors), observations are still not being saved. Last observation was saved at 2026-01-03 20:44:49 (over an hour ago as of this report).
Bugs Fixed
Bug #1: Missing failed_at_epoch Column
- Root Cause: Code in
PendingMessageStore.markSessionMessagesFailed()tried to setfailed_at_epochcolumn that didn't exist in schema - Fix: Added migration 20 to create the column
- Status: ✅ Fixed and verified
Bug #2: FOREIGN KEY Constraint Failed
- Root Cause: ALL THREE agents (SDKAgent, GeminiAgent, OpenRouterAgent) were passing
session.contentSessionIdtostoreObservationsAndMarkComplete()but function expectedsession.memorySessionId - Location:
src/services/worker/SDKAgent.ts:354src/services/worker/GeminiAgent.ts:397src/services/worker/OpenRouterAgent.ts:440
- Fix: Changed all three agents to pass
session.memorySessionIdwith null check - Status: ✅ Fixed and verified
Current State (as of investigation)
Database State
- Total observations: 34,734
- Latest observation: 2026-01-03 20:44:49 (1+ hours ago)
- Pending messages: 0 (queue is empty)
- Recent sessions: Multiple sessions created but no observations saved
Recent Sessions
76292 | c5fd263d-d9ae-4f49-8caf-3f7bb4857804 | 4227fb34-ba37-4625-b18c-bc073044ea73 | 2026-01-03T20:50:51.930Z
76269 | 227c4af2-6c64-45cd-8700-4bb8309038a4 | 3ce5f8ff-85d0-4d1a-9c40-c0d8b905fce8 | 2026-01-03T20:47:10.637Z
Both have valid memory_session_id values captured, suggesting SDK communication is working.
Root Cause Analysis
Potential Issues
-
Worker Not Processing Messages
- Queue is empty (0 pending messages)
- Either messages aren't being created, or they're being processed and deleted immediately without creating observations
-
Hooks Not Creating Messages
- PostToolUse hook may not be firing
- Or hook is failing silently before creating pending messages
-
Generator Failing Before Observations
- SDK may be failing to return observations
- Or parsing is failing silently
-
The FIFO Queue Design Itself
- Current system has complex status tracking that hides failures
- Messages can be marked "processed" even if no observations were created
- No clear indication of what actually happened
Evidence of Deeper Problems
Architectural Issues Found
The queue processing system violates basic FIFO principles:
Current Overcomplicated Design:
- Status tracking:
pending→processing→processed/failed - Multiple timestamps:
created_at_epoch,started_processing_at_epoch,completed_at_epoch,failed_at_epoch - Retry counts and stuck message detection
- Complex recovery logic for different failure scenarios
What a FIFO Queue Should Be:
- INSERT message
- Process it
- DELETE when done
- If worker crashes → message stays in queue → gets reprocessed
The complexity is masking failures. Messages are being marked "processed" but no observations are being created.
Critical Questions Needing Investigation
-
Are PostToolUse hooks even firing?
- Check hook execution logs
- Verify tool usage is being captured
-
Are pending messages being created?
- Check message creation in hooks
- Look for silent failures in message insertion
-
Is the generator even starting?
- Check worker logs for session processing
- Verify SDK connections are established
-
Why is the queue always empty?
- Messages processed instantly? (unlikely)
- Messages never created? (more likely)
- Messages created then immediately deleted? (possible)
Immediate Next Steps
-
Add Logging
- Add detailed logging to PostToolUse hook
- Log every step of message creation
- Log generator startup and SDK responses
-
Check Hook Execution
- Verify hooks are actually running
- Check for silent failures in hook code
-
Test Message Creation Manually
- Create a test message directly in database
- Verify worker picks it up and processes it
-
Simplify the Queue (Long-term)
- Remove status tracking complexity
- Make it a true FIFO queue
- Make failures obvious instead of silent
Code Changes Made
SessionStore.ts
// Migration 20: Add failed_at_epoch column
private addFailedAtEpochColumn(): void {
const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(20);
if (applied) return;
const tableInfo = this.db.query('PRAGMA table_info(pending_messages)').all();
const hasColumn = tableInfo.some(col => col.name === 'failed_at_epoch');
if (!hasColumn) {
this.db.run('ALTER TABLE pending_messages ADD COLUMN failed_at_epoch INTEGER');
logger.info('DB', 'Added failed_at_epoch column to pending_messages table');
}
this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(20, new Date().toISOString());
}
SDKAgent.ts, GeminiAgent.ts, OpenRouterAgent.ts
// BEFORE (WRONG):
const result = sessionStore.storeObservationsAndMarkComplete(
session.contentSessionId, // ❌ Wrong session ID
session.project,
observations,
// ...
);
// AFTER (FIXED):
if (!session.memorySessionId) {
throw new Error('Cannot store observations: memorySessionId not yet captured');
}
const result = sessionStore.storeObservationsAndMarkComplete(
session.memorySessionId, // ✅ Correct session ID
session.project,
observations,
// ...
);
Conclusion
The two bugs are fixed, but observations still aren't being saved. The problem is likely earlier in the pipeline:
- Hooks not executing
- Messages not being created
- Or the overly complex queue system is hiding failures
The queue design itself is fundamentally flawed - it tracks too much state and makes failures invisible. A proper FIFO queue would make these issues obvious immediately.
Recommended Action
- Immediate: Add comprehensive logging to PostToolUse hook and message creation
- Short-term: Manual testing of queue processing
- Long-term: Rip out status tracking and implement proper FIFO queue
Investigation needed: This report documents what was fixed and what's still broken. The actual root cause of why observations stopped saving needs deeper investigation of the hook execution and message creation pipeline.