817b9e8f27
* fix: prevent memory_session_id from equaling content_session_id The bug: memory_session_id was initialized to contentSessionId as a "placeholder for FK purposes". This caused the SDK resume logic to inject memory agent messages into the USER's Claude Code transcript, corrupting their conversation history. Root cause: - SessionStore.createSDKSession initialized memory_session_id = contentSessionId - SDKAgent checked memorySessionId !== contentSessionId but this check only worked if the session was fetched fresh from DB The fix: - SessionStore: Initialize memory_session_id as NULL, not contentSessionId - SDKAgent: Simple truthy check !!session.memorySessionId (NULL = fresh start) - Database migration: Ran UPDATE to set memory_session_id = NULL for 1807 existing sessions that had the bug Also adds [ALIGNMENT] logging across the session lifecycle to help debug session continuity issues: - Hook entry: contentSessionId + promptNumber - DB lookup: contentSessionId → memorySessionId mapping proof - Resume decision: shows which memorySessionId will be used for resume - Capture: logs when memorySessionId is captured from first SDK response UI: Added "Alignment" quick filter button in LogsModal to show only alignment logs for debugging session continuity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor: improve error handling in worker-service.ts - Fix GENERIC_CATCH anti-patterns by logging full error objects instead of just messages - Add [ANTI-PATTERN IGNORED] markers for legitimate cases (cleanup, hot paths) - Simplify error handling comments to be more concise - Improve httpShutdown() error discrimination for ECONNREFUSED - Reduce LARGE_TRY_BLOCK issues in initialization code Part of anti-pattern cleanup plan (132 total issues) * refactor: improve error logging in SearchManager.ts - Pass full error objects to logger instead of just error.message - Fixes PARTIAL_ERROR_LOGGING anti-patterns (10 instances) - Better debugging visibility when Chroma queries fail Part of anti-pattern cleanup (133 remaining) * refactor: improve error logging across SessionStore and mcp-server - SessionStore.ts: Fix error logging in column rename utility - mcp-server.ts: Log full error objects instead of just error.message - Improve error handling in Worker API calls and tool execution Part of anti-pattern cleanup (133 remaining) * Refactor hooks to streamline error handling and loading states - Simplified error handling in useContextPreview by removing try-catch and directly checking response status. - Refactored usePagination to eliminate try-catch, improving readability and maintaining error handling through response checks. - Cleaned up useSSE by removing unnecessary try-catch around JSON parsing, ensuring clarity in message handling. - Enhanced useSettings by streamlining the saving process, removing try-catch, and directly checking the result for success. * refactor: add error handling back to SearchManager Chroma calls - Wrap queryChroma calls in try-catch to prevent generator crashes - Log Chroma errors as warnings and fall back gracefully - Fixes generator failures when Chroma has issues - Part of anti-pattern cleanup recovery * feat: Add generator failure investigation report and observation duplication regression report - Created a comprehensive investigation report detailing the root cause of generator failures during anti-pattern cleanup, including the impact, investigation process, and implemented fixes. - Documented the critical regression causing observation duplication due to race conditions in the SDK agent, outlining symptoms, root cause analysis, and proposed fixes. * fix: address PR #528 review comments - atomic cleanup and detector improvements This commit addresses critical review feedback from PR #528: ## 1. Atomic Message Cleanup (Fix Race Condition) **Problem**: SessionRoutes.ts generator error handler had race condition - Queried messages then marked failed in loop - If crash during loop → partial marking → inconsistent state **Solution**: - Added `markSessionMessagesFailed()` to PendingMessageStore.ts - Single atomic UPDATE statement replaces loop - Follows existing pattern from `resetProcessingToPending()` **Files**: - src/services/sqlite/PendingMessageStore.ts (new method) - src/services/worker/http/routes/SessionRoutes.ts (use new method) ## 2. Anti-Pattern Detector Improvements **Problem**: Detector didn't recognize logger.failure() method - Lines 212 & 335 already included "failure" - Lines 112-113 (PARTIAL_ERROR_LOGGING detection) did not **Solution**: Updated regex patterns to include "failure" for consistency **Files**: - scripts/anti-pattern-test/detect-error-handling-antipatterns.ts ## 3. Documentation **PR Comment**: Added clarification on memory_session_id fix location - Points to SessionStore.ts:1155 - Explains why NULL initialization prevents message injection bug ## Review Response Addresses "Must Address Before Merge" items from review: ✅ Clarified memory_session_id bug fix location (via PR comment) ✅ Made generator error handler message cleanup atomic ❌ Deferred comprehensive test suite to follow-up PR (keeps PR focused) ## Testing - Build passes with no errors - Anti-pattern detector runs successfully - Atomic cleanup follows proven pattern from existing methods 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: FOREIGN KEY constraint and missing failed_at_epoch column Two critical bugs fixed: 1. Missing failed_at_epoch column in pending_messages table - Added migration 20 to create the column - Fixes error when trying to mark messages as failed 2. FOREIGN KEY constraint failed when storing observations - All three agents (SDK, Gemini, OpenRouter) were passing session.contentSessionId instead of session.memorySessionId - storeObservationsAndMarkComplete expects memorySessionId - Added null check and clear error message However, observations still not saving - see investigation report. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Refactor hook input parsing to improve error handling - Added a nested try-catch block in new-hook.ts, save-hook.ts, and summary-hook.ts to handle JSON parsing errors more gracefully. - Replaced direct error throwing with logging of the error details using logger.error. - Ensured that the process exits cleanly after handling input in all three hooks. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
185 lines
6.5 KiB
Markdown
185 lines
6.5 KiB
Markdown
# Observation Saving Failure Investigation
|
|
|
|
**Date**: 2026-01-03
|
|
**Severity**: CRITICAL
|
|
**Status**: Bugs fixed, but observations still not saving
|
|
|
|
## Summary
|
|
|
|
Despite fixing two critical bugs (missing `failed_at_epoch` column and FOREIGN KEY constraint errors), observations are still not being saved. Last observation was saved at **2026-01-03 20:44:49** (over an hour ago as of this report).
|
|
|
|
## Bugs Fixed
|
|
|
|
### Bug #1: Missing `failed_at_epoch` Column
|
|
- **Root Cause**: Code in `PendingMessageStore.markSessionMessagesFailed()` tried to set `failed_at_epoch` column that didn't exist in schema
|
|
- **Fix**: Added migration 20 to create the column
|
|
- **Status**: ✅ Fixed and verified
|
|
|
|
### Bug #2: FOREIGN KEY Constraint Failed
|
|
- **Root Cause**: ALL THREE agents (SDKAgent, GeminiAgent, OpenRouterAgent) were passing `session.contentSessionId` to `storeObservationsAndMarkComplete()` but function expected `session.memorySessionId`
|
|
- **Location**:
|
|
- `src/services/worker/SDKAgent.ts:354`
|
|
- `src/services/worker/GeminiAgent.ts:397`
|
|
- `src/services/worker/OpenRouterAgent.ts:440`
|
|
- **Fix**: Changed all three agents to pass `session.memorySessionId` with null check
|
|
- **Status**: ✅ Fixed and verified
|
|
|
|
## Current State (as of investigation)
|
|
|
|
### Database State
|
|
- **Total observations**: 34,734
|
|
- **Latest observation**: 2026-01-03 20:44:49 (1+ hours ago)
|
|
- **Pending messages**: 0 (queue is empty)
|
|
- **Recent sessions**: Multiple sessions created but no observations saved
|
|
|
|
### Recent Sessions
|
|
```
|
|
76292 | c5fd263d-d9ae-4f49-8caf-3f7bb4857804 | 4227fb34-ba37-4625-b18c-bc073044ea73 | 2026-01-03T20:50:51.930Z
|
|
76269 | 227c4af2-6c64-45cd-8700-4bb8309038a4 | 3ce5f8ff-85d0-4d1a-9c40-c0d8b905fce8 | 2026-01-03T20:47:10.637Z
|
|
```
|
|
|
|
Both have valid `memory_session_id` values captured, suggesting SDK communication is working.
|
|
|
|
## Root Cause Analysis
|
|
|
|
### Potential Issues
|
|
|
|
1. **Worker Not Processing Messages**
|
|
- Queue is empty (0 pending messages)
|
|
- Either messages aren't being created, or they're being processed and deleted immediately without creating observations
|
|
|
|
2. **Hooks Not Creating Messages**
|
|
- PostToolUse hook may not be firing
|
|
- Or hook is failing silently before creating pending messages
|
|
|
|
3. **Generator Failing Before Observations**
|
|
- SDK may be failing to return observations
|
|
- Or parsing is failing silently
|
|
|
|
4. **The FIFO Queue Design Itself**
|
|
- Current system has complex status tracking that hides failures
|
|
- Messages can be marked "processed" even if no observations were created
|
|
- No clear indication of what actually happened
|
|
|
|
## Evidence of Deeper Problems
|
|
|
|
### Architectural Issues Found
|
|
|
|
The queue processing system violates basic FIFO principles:
|
|
|
|
**Current Overcomplicated Design:**
|
|
- Status tracking: `pending` → `processing` → `processed`/`failed`
|
|
- Multiple timestamps: `created_at_epoch`, `started_processing_at_epoch`, `completed_at_epoch`, `failed_at_epoch`
|
|
- Retry counts and stuck message detection
|
|
- Complex recovery logic for different failure scenarios
|
|
|
|
**What a FIFO Queue Should Be:**
|
|
1. INSERT message
|
|
2. Process it
|
|
3. DELETE when done
|
|
4. If worker crashes → message stays in queue → gets reprocessed
|
|
|
|
The complexity is masking failures. Messages are being marked "processed" but no observations are being created.
|
|
|
|
## Critical Questions Needing Investigation
|
|
|
|
1. **Are PostToolUse hooks even firing?**
|
|
- Check hook execution logs
|
|
- Verify tool usage is being captured
|
|
|
|
2. **Are pending messages being created?**
|
|
- Check message creation in hooks
|
|
- Look for silent failures in message insertion
|
|
|
|
3. **Is the generator even starting?**
|
|
- Check worker logs for session processing
|
|
- Verify SDK connections are established
|
|
|
|
4. **Why is the queue always empty?**
|
|
- Messages processed instantly? (unlikely)
|
|
- Messages never created? (more likely)
|
|
- Messages created then immediately deleted? (possible)
|
|
|
|
## Immediate Next Steps
|
|
|
|
1. **Add Logging**
|
|
- Add detailed logging to PostToolUse hook
|
|
- Log every step of message creation
|
|
- Log generator startup and SDK responses
|
|
|
|
2. **Check Hook Execution**
|
|
- Verify hooks are actually running
|
|
- Check for silent failures in hook code
|
|
|
|
3. **Test Message Creation Manually**
|
|
- Create a test message directly in database
|
|
- Verify worker picks it up and processes it
|
|
|
|
4. **Simplify the Queue (Long-term)**
|
|
- Remove status tracking complexity
|
|
- Make it a true FIFO queue
|
|
- Make failures obvious instead of silent
|
|
|
|
## Code Changes Made
|
|
|
|
### SessionStore.ts
|
|
```typescript
|
|
// Migration 20: Add failed_at_epoch column
|
|
private addFailedAtEpochColumn(): void {
|
|
const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(20);
|
|
if (applied) return;
|
|
|
|
const tableInfo = this.db.query('PRAGMA table_info(pending_messages)').all();
|
|
const hasColumn = tableInfo.some(col => col.name === 'failed_at_epoch');
|
|
|
|
if (!hasColumn) {
|
|
this.db.run('ALTER TABLE pending_messages ADD COLUMN failed_at_epoch INTEGER');
|
|
logger.info('DB', 'Added failed_at_epoch column to pending_messages table');
|
|
}
|
|
|
|
this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(20, new Date().toISOString());
|
|
}
|
|
```
|
|
|
|
### SDKAgent.ts, GeminiAgent.ts, OpenRouterAgent.ts
|
|
```typescript
|
|
// BEFORE (WRONG):
|
|
const result = sessionStore.storeObservationsAndMarkComplete(
|
|
session.contentSessionId, // ❌ Wrong session ID
|
|
session.project,
|
|
observations,
|
|
// ...
|
|
);
|
|
|
|
// AFTER (FIXED):
|
|
if (!session.memorySessionId) {
|
|
throw new Error('Cannot store observations: memorySessionId not yet captured');
|
|
}
|
|
|
|
const result = sessionStore.storeObservationsAndMarkComplete(
|
|
session.memorySessionId, // ✅ Correct session ID
|
|
session.project,
|
|
observations,
|
|
// ...
|
|
);
|
|
```
|
|
|
|
## Conclusion
|
|
|
|
The two bugs are fixed, but observations still aren't being saved. The problem is likely earlier in the pipeline:
|
|
- Hooks not executing
|
|
- Messages not being created
|
|
- Or the overly complex queue system is hiding failures
|
|
|
|
**The queue design itself is fundamentally flawed** - it tracks too much state and makes failures invisible. A proper FIFO queue would make these issues obvious immediately.
|
|
|
|
## Recommended Action
|
|
|
|
1. **Immediate**: Add comprehensive logging to PostToolUse hook and message creation
|
|
2. **Short-term**: Manual testing of queue processing
|
|
3. **Long-term**: Rip out status tracking and implement proper FIFO queue
|
|
|
|
---
|
|
|
|
**Investigation needed**: This report documents what was fixed and what's still broken. The actual root cause of why observations stopped saving needs deeper investigation of the hook execution and message creation pipeline.
|