Files
claude-mem/docs/reports/2026-01-03--observation-saving-failure.md
T
Alex Newman 817b9e8f27 Improve error handling and logging across worker services (#528)
* fix: prevent memory_session_id from equaling content_session_id

The bug: memory_session_id was initialized to contentSessionId as a
"placeholder for FK purposes". This caused the SDK resume logic to
inject memory agent messages into the USER's Claude Code transcript,
corrupting their conversation history.

Root cause:
- SessionStore.createSDKSession initialized memory_session_id = contentSessionId
- SDKAgent checked memorySessionId !== contentSessionId but this check
  only worked if the session was fetched fresh from DB

The fix:
- SessionStore: Initialize memory_session_id as NULL, not contentSessionId
- SDKAgent: Simple truthy check !!session.memorySessionId (NULL = fresh start)
- Database migration: Ran UPDATE to set memory_session_id = NULL for 1807
  existing sessions that had the bug

Also adds [ALIGNMENT] logging across the session lifecycle to help debug
session continuity issues:
- Hook entry: contentSessionId + promptNumber
- DB lookup: contentSessionId → memorySessionId mapping proof
- Resume decision: shows which memorySessionId will be used for resume
- Capture: logs when memorySessionId is captured from first SDK response

UI: Added "Alignment" quick filter button in LogsModal to show only
alignment logs for debugging session continuity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: improve error handling in worker-service.ts

- Fix GENERIC_CATCH anti-patterns by logging full error objects instead of just messages
- Add [ANTI-PATTERN IGNORED] markers for legitimate cases (cleanup, hot paths)
- Simplify error handling comments to be more concise
- Improve httpShutdown() error discrimination for ECONNREFUSED
- Reduce LARGE_TRY_BLOCK issues in initialization code

Part of anti-pattern cleanup plan (132 total issues)

* refactor: improve error logging in SearchManager.ts

- Pass full error objects to logger instead of just error.message
- Fixes PARTIAL_ERROR_LOGGING anti-patterns (10 instances)
- Better debugging visibility when Chroma queries fail

Part of anti-pattern cleanup (133 remaining)

* refactor: improve error logging across SessionStore and mcp-server

- SessionStore.ts: Fix error logging in column rename utility
- mcp-server.ts: Log full error objects instead of just error.message
- Improve error handling in Worker API calls and tool execution

Part of anti-pattern cleanup (133 remaining)

* Refactor hooks to streamline error handling and loading states

- Simplified error handling in useContextPreview by removing try-catch and directly checking response status.
- Refactored usePagination to eliminate try-catch, improving readability and maintaining error handling through response checks.
- Cleaned up useSSE by removing unnecessary try-catch around JSON parsing, ensuring clarity in message handling.
- Enhanced useSettings by streamlining the saving process, removing try-catch, and directly checking the result for success.

* refactor: add error handling back to SearchManager Chroma calls

- Wrap queryChroma calls in try-catch to prevent generator crashes
- Log Chroma errors as warnings and fall back gracefully
- Fixes generator failures when Chroma has issues
- Part of anti-pattern cleanup recovery

* feat: Add generator failure investigation report and observation duplication regression report

- Created a comprehensive investigation report detailing the root cause of generator failures during anti-pattern cleanup, including the impact, investigation process, and implemented fixes.
- Documented the critical regression causing observation duplication due to race conditions in the SDK agent, outlining symptoms, root cause analysis, and proposed fixes.

* fix: address PR #528 review comments - atomic cleanup and detector improvements

This commit addresses critical review feedback from PR #528:

## 1. Atomic Message Cleanup (Fix Race Condition)

**Problem**: SessionRoutes.ts generator error handler had race condition
- Queried messages then marked failed in loop
- If crash during loop → partial marking → inconsistent state

**Solution**:
- Added `markSessionMessagesFailed()` to PendingMessageStore.ts
- Single atomic UPDATE statement replaces loop
- Follows existing pattern from `resetProcessingToPending()`

**Files**:
- src/services/sqlite/PendingMessageStore.ts (new method)
- src/services/worker/http/routes/SessionRoutes.ts (use new method)

## 2. Anti-Pattern Detector Improvements

**Problem**: Detector didn't recognize logger.failure() method
- Lines 212 & 335 already included "failure"
- Lines 112-113 (PARTIAL_ERROR_LOGGING detection) did not

**Solution**: Updated regex patterns to include "failure" for consistency

**Files**:
- scripts/anti-pattern-test/detect-error-handling-antipatterns.ts

## 3. Documentation

**PR Comment**: Added clarification on memory_session_id fix location
- Points to SessionStore.ts:1155
- Explains why NULL initialization prevents message injection bug

## Review Response

Addresses "Must Address Before Merge" items from review:
 Clarified memory_session_id bug fix location (via PR comment)
 Made generator error handler message cleanup atomic
 Deferred comprehensive test suite to follow-up PR (keeps PR focused)

## Testing

- Build passes with no errors
- Anti-pattern detector runs successfully
- Atomic cleanup follows proven pattern from existing methods

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: FOREIGN KEY constraint and missing failed_at_epoch column

Two critical bugs fixed:

1. Missing failed_at_epoch column in pending_messages table
   - Added migration 20 to create the column
   - Fixes error when trying to mark messages as failed

2. FOREIGN KEY constraint failed when storing observations
   - All three agents (SDK, Gemini, OpenRouter) were passing
     session.contentSessionId instead of session.memorySessionId
   - storeObservationsAndMarkComplete expects memorySessionId
   - Added null check and clear error message

However, observations still not saving - see investigation report.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Refactor hook input parsing to improve error handling

- Added a nested try-catch block in new-hook.ts, save-hook.ts, and summary-hook.ts to handle JSON parsing errors more gracefully.
- Replaced direct error throwing with logging of the error details using logger.error.
- Ensured that the process exits cleanly after handling input in all three hooks.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-03 18:51:59 -05:00

185 lines
6.5 KiB
Markdown

# Observation Saving Failure Investigation
**Date**: 2026-01-03
**Severity**: CRITICAL
**Status**: Bugs fixed, but observations still not saving
## Summary
Despite fixing two critical bugs (missing `failed_at_epoch` column and FOREIGN KEY constraint errors), observations are still not being saved. Last observation was saved at **2026-01-03 20:44:49** (over an hour ago as of this report).
## Bugs Fixed
### Bug #1: Missing `failed_at_epoch` Column
- **Root Cause**: Code in `PendingMessageStore.markSessionMessagesFailed()` tried to set `failed_at_epoch` column that didn't exist in schema
- **Fix**: Added migration 20 to create the column
- **Status**: ✅ Fixed and verified
### Bug #2: FOREIGN KEY Constraint Failed
- **Root Cause**: ALL THREE agents (SDKAgent, GeminiAgent, OpenRouterAgent) were passing `session.contentSessionId` to `storeObservationsAndMarkComplete()` but function expected `session.memorySessionId`
- **Location**:
- `src/services/worker/SDKAgent.ts:354`
- `src/services/worker/GeminiAgent.ts:397`
- `src/services/worker/OpenRouterAgent.ts:440`
- **Fix**: Changed all three agents to pass `session.memorySessionId` with null check
- **Status**: ✅ Fixed and verified
## Current State (as of investigation)
### Database State
- **Total observations**: 34,734
- **Latest observation**: 2026-01-03 20:44:49 (1+ hours ago)
- **Pending messages**: 0 (queue is empty)
- **Recent sessions**: Multiple sessions created but no observations saved
### Recent Sessions
```
76292 | c5fd263d-d9ae-4f49-8caf-3f7bb4857804 | 4227fb34-ba37-4625-b18c-bc073044ea73 | 2026-01-03T20:50:51.930Z
76269 | 227c4af2-6c64-45cd-8700-4bb8309038a4 | 3ce5f8ff-85d0-4d1a-9c40-c0d8b905fce8 | 2026-01-03T20:47:10.637Z
```
Both have valid `memory_session_id` values captured, suggesting SDK communication is working.
## Root Cause Analysis
### Potential Issues
1. **Worker Not Processing Messages**
- Queue is empty (0 pending messages)
- Either messages aren't being created, or they're being processed and deleted immediately without creating observations
2. **Hooks Not Creating Messages**
- PostToolUse hook may not be firing
- Or hook is failing silently before creating pending messages
3. **Generator Failing Before Observations**
- SDK may be failing to return observations
- Or parsing is failing silently
4. **The FIFO Queue Design Itself**
- Current system has complex status tracking that hides failures
- Messages can be marked "processed" even if no observations were created
- No clear indication of what actually happened
## Evidence of Deeper Problems
### Architectural Issues Found
The queue processing system violates basic FIFO principles:
**Current Overcomplicated Design:**
- Status tracking: `pending``processing``processed`/`failed`
- Multiple timestamps: `created_at_epoch`, `started_processing_at_epoch`, `completed_at_epoch`, `failed_at_epoch`
- Retry counts and stuck message detection
- Complex recovery logic for different failure scenarios
**What a FIFO Queue Should Be:**
1. INSERT message
2. Process it
3. DELETE when done
4. If worker crashes → message stays in queue → gets reprocessed
The complexity is masking failures. Messages are being marked "processed" but no observations are being created.
## Critical Questions Needing Investigation
1. **Are PostToolUse hooks even firing?**
- Check hook execution logs
- Verify tool usage is being captured
2. **Are pending messages being created?**
- Check message creation in hooks
- Look for silent failures in message insertion
3. **Is the generator even starting?**
- Check worker logs for session processing
- Verify SDK connections are established
4. **Why is the queue always empty?**
- Messages processed instantly? (unlikely)
- Messages never created? (more likely)
- Messages created then immediately deleted? (possible)
## Immediate Next Steps
1. **Add Logging**
- Add detailed logging to PostToolUse hook
- Log every step of message creation
- Log generator startup and SDK responses
2. **Check Hook Execution**
- Verify hooks are actually running
- Check for silent failures in hook code
3. **Test Message Creation Manually**
- Create a test message directly in database
- Verify worker picks it up and processes it
4. **Simplify the Queue (Long-term)**
- Remove status tracking complexity
- Make it a true FIFO queue
- Make failures obvious instead of silent
## Code Changes Made
### SessionStore.ts
```typescript
// Migration 20: Add failed_at_epoch column
private addFailedAtEpochColumn(): void {
const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(20);
if (applied) return;
const tableInfo = this.db.query('PRAGMA table_info(pending_messages)').all();
const hasColumn = tableInfo.some(col => col.name === 'failed_at_epoch');
if (!hasColumn) {
this.db.run('ALTER TABLE pending_messages ADD COLUMN failed_at_epoch INTEGER');
logger.info('DB', 'Added failed_at_epoch column to pending_messages table');
}
this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(20, new Date().toISOString());
}
```
### SDKAgent.ts, GeminiAgent.ts, OpenRouterAgent.ts
```typescript
// BEFORE (WRONG):
const result = sessionStore.storeObservationsAndMarkComplete(
session.contentSessionId, // ❌ Wrong session ID
session.project,
observations,
// ...
);
// AFTER (FIXED):
if (!session.memorySessionId) {
throw new Error('Cannot store observations: memorySessionId not yet captured');
}
const result = sessionStore.storeObservationsAndMarkComplete(
session.memorySessionId, // ✅ Correct session ID
session.project,
observations,
// ...
);
```
## Conclusion
The two bugs are fixed, but observations still aren't being saved. The problem is likely earlier in the pipeline:
- Hooks not executing
- Messages not being created
- Or the overly complex queue system is hiding failures
**The queue design itself is fundamentally flawed** - it tracks too much state and makes failures invisible. A proper FIFO queue would make these issues obvious immediately.
## Recommended Action
1. **Immediate**: Add comprehensive logging to PostToolUse hook and message creation
2. **Short-term**: Manual testing of queue processing
3. **Long-term**: Rip out status tracking and implement proper FIFO queue
---
**Investigation needed**: This report documents what was fixed and what's still broken. The actual root cause of why observations stopped saving needs deeper investigation of the hook execution and message creation pipeline.