Files
claude-mem/docs/reports/2026-01-02--observation-duplication-regression.md
T
Alex Newman 817b9e8f27 Improve error handling and logging across worker services (#528)
* fix: prevent memory_session_id from equaling content_session_id

The bug: memory_session_id was initialized to contentSessionId as a
"placeholder for FK purposes". This caused the SDK resume logic to
inject memory agent messages into the USER's Claude Code transcript,
corrupting their conversation history.

Root cause:
- SessionStore.createSDKSession initialized memory_session_id = contentSessionId
- SDKAgent checked memorySessionId !== contentSessionId but this check
  only worked if the session was fetched fresh from DB

The fix:
- SessionStore: Initialize memory_session_id as NULL, not contentSessionId
- SDKAgent: Simple truthy check !!session.memorySessionId (NULL = fresh start)
- Database migration: Ran UPDATE to set memory_session_id = NULL for 1807
  existing sessions that had the bug

Also adds [ALIGNMENT] logging across the session lifecycle to help debug
session continuity issues:
- Hook entry: contentSessionId + promptNumber
- DB lookup: contentSessionId → memorySessionId mapping proof
- Resume decision: shows which memorySessionId will be used for resume
- Capture: logs when memorySessionId is captured from first SDK response

UI: Added "Alignment" quick filter button in LogsModal to show only
alignment logs for debugging session continuity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: improve error handling in worker-service.ts

- Fix GENERIC_CATCH anti-patterns by logging full error objects instead of just messages
- Add [ANTI-PATTERN IGNORED] markers for legitimate cases (cleanup, hot paths)
- Simplify error handling comments to be more concise
- Improve httpShutdown() error discrimination for ECONNREFUSED
- Reduce LARGE_TRY_BLOCK issues in initialization code

Part of anti-pattern cleanup plan (132 total issues)

* refactor: improve error logging in SearchManager.ts

- Pass full error objects to logger instead of just error.message
- Fixes PARTIAL_ERROR_LOGGING anti-patterns (10 instances)
- Better debugging visibility when Chroma queries fail

Part of anti-pattern cleanup (133 remaining)

* refactor: improve error logging across SessionStore and mcp-server

- SessionStore.ts: Fix error logging in column rename utility
- mcp-server.ts: Log full error objects instead of just error.message
- Improve error handling in Worker API calls and tool execution

Part of anti-pattern cleanup (133 remaining)

* Refactor hooks to streamline error handling and loading states

- Simplified error handling in useContextPreview by removing try-catch and directly checking response status.
- Refactored usePagination to eliminate try-catch, improving readability and maintaining error handling through response checks.
- Cleaned up useSSE by removing unnecessary try-catch around JSON parsing, ensuring clarity in message handling.
- Enhanced useSettings by streamlining the saving process, removing try-catch, and directly checking the result for success.

* refactor: add error handling back to SearchManager Chroma calls

- Wrap queryChroma calls in try-catch to prevent generator crashes
- Log Chroma errors as warnings and fall back gracefully
- Fixes generator failures when Chroma has issues
- Part of anti-pattern cleanup recovery

* feat: Add generator failure investigation report and observation duplication regression report

- Created a comprehensive investigation report detailing the root cause of generator failures during anti-pattern cleanup, including the impact, investigation process, and implemented fixes.
- Documented the critical regression causing observation duplication due to race conditions in the SDK agent, outlining symptoms, root cause analysis, and proposed fixes.

* fix: address PR #528 review comments - atomic cleanup and detector improvements

This commit addresses critical review feedback from PR #528:

## 1. Atomic Message Cleanup (Fix Race Condition)

**Problem**: SessionRoutes.ts generator error handler had race condition
- Queried messages then marked failed in loop
- If crash during loop → partial marking → inconsistent state

**Solution**:
- Added `markSessionMessagesFailed()` to PendingMessageStore.ts
- Single atomic UPDATE statement replaces loop
- Follows existing pattern from `resetProcessingToPending()`

**Files**:
- src/services/sqlite/PendingMessageStore.ts (new method)
- src/services/worker/http/routes/SessionRoutes.ts (use new method)

## 2. Anti-Pattern Detector Improvements

**Problem**: Detector didn't recognize logger.failure() method
- Lines 212 & 335 already included "failure"
- Lines 112-113 (PARTIAL_ERROR_LOGGING detection) did not

**Solution**: Updated regex patterns to include "failure" for consistency

**Files**:
- scripts/anti-pattern-test/detect-error-handling-antipatterns.ts

## 3. Documentation

**PR Comment**: Added clarification on memory_session_id fix location
- Points to SessionStore.ts:1155
- Explains why NULL initialization prevents message injection bug

## Review Response

Addresses "Must Address Before Merge" items from review:
 Clarified memory_session_id bug fix location (via PR comment)
 Made generator error handler message cleanup atomic
 Deferred comprehensive test suite to follow-up PR (keeps PR focused)

## Testing

- Build passes with no errors
- Anti-pattern detector runs successfully
- Atomic cleanup follows proven pattern from existing methods

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: FOREIGN KEY constraint and missing failed_at_epoch column

Two critical bugs fixed:

1. Missing failed_at_epoch column in pending_messages table
   - Added migration 20 to create the column
   - Fixes error when trying to mark messages as failed

2. FOREIGN KEY constraint failed when storing observations
   - All three agents (SDK, Gemini, OpenRouter) were passing
     session.contentSessionId instead of session.memorySessionId
   - storeObservationsAndMarkComplete expects memorySessionId
   - Added null check and clear error message

However, observations still not saving - see investigation report.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Refactor hook input parsing to improve error handling

- Added a nested try-catch block in new-hook.ts, save-hook.ts, and summary-hook.ts to handle JSON parsing errors more gracefully.
- Replaced direct error throwing with logging of the error details using logger.error.
- Ensured that the process exits cleanly after handling input in all three hooks.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-03 18:51:59 -05:00

13 KiB

Observation Duplication Regression - 2026-01-02

Executive Summary

A critical regression is causing the same observation to be created multiple times (2-11 duplicates per observation). This occurred after recent error handling refactoring work that removed try-catch blocks. The root cause is a race condition between observation persistence and message completion marking in the SDK agent, exacerbated by crash recovery logic.

Symptoms

  • 11 observations about "session generator failure" created between 10:01-10:09 PM (same content, different timestamps)
  • 8 observations about "fixed missing closing brace" created between 9:32 PM-9:55 PM
  • 2 observations about "remove large try-catch blocks" created at 9:33 PM
  • Multiple other duplicates across different sessions

Example from database:

-- Same observation created 8 times over 23 minutes
id     | title                                          | created_at
-------|------------------------------------------------|-------------------
36050  | Fixed Missing Closing Brace in SearchManager  | 2026-01-02 21:32:43
36040  | Fixed Missing Closing Brace in SearchManager  | 2026-01-02 21:33:34
36047  | Fixed missing closing brace...                | 2026-01-02 21:33:38
36041  | Fixed missing closing brace...                | 2026-01-02 21:34:33
36060  | Fixed Missing Closing Brace...                | 2026-01-02 21:41:23
36062  | Fixed Missing Closing Brace...                | 2026-01-02 21:53:02
36063  | Fixed Missing Closing Brace...                | 2026-01-02 21:53:33
36065  | Fixed missing closing brace...                | 2026-01-02 21:55:06

Root Cause Analysis

The Critical Race Condition

The SDK agent has a fatal ordering issue in message processing:

File: /Users/alexnewman/Scripts/claude-mem/src/services/worker/SDKAgent.ts

// Line 328-410: processSDKResponse()
private async processSDKResponse(...): Promise<void> {
  // Parse observations from SDK response
  const observations = parseObservations(text, session.contentSessionId);

  // Store observations IMMEDIATELY
  for (const obs of observations) {
    const { id: obsId } = this.dbManager.getSessionStore().storeObservation(...);
    // ⚠️ OBSERVATION IS NOW IN DATABASE
  }

  // Parse and store summary
  const summary = parseSummary(text, session.sessionDbId);
  if (summary) {
    this.dbManager.getSessionStore().storeSummary(...);
    // ⚠️ SUMMARY IS NOW IN DATABASE
  }

  // ONLY NOW mark the message as processed
  await this.markMessagesProcessed(session, worker);  // ⚠️ LINE 487
}
// Line 494-502: markMessagesProcessed()
private async markMessagesProcessed(...): Promise<void> {
  const pendingMessageStore = this.sessionManager.getPendingMessageStore();
  if (session.pendingProcessingIds.size > 0) {
    for (const messageId of session.pendingProcessingIds) {
      pendingMessageStore.markProcessed(messageId);  // ⚠️ TOO LATE!
    }
  }
}

The Window of Vulnerability

Between storing observations (line ~340) and marking the message as processed (line 498), there is a critical window where:

  1. Observations exist in database
  2. Message is still in 'processing' status ⚠️
  3. If SDK crashes/exits → Message remains stuck in 'processing'

How Crash Recovery Makes It Worse

File: /Users/alexnewman/Scripts/claude-mem/src/services/worker/http/routes/SessionRoutes.ts

// Line 183-205: Generator .finally() block
.finally(() => {
  // Crash recovery: If not aborted and still has work, restart
  if (!wasAborted) {
    const pendingStore = this.sessionManager.getPendingMessageStore();
    const pendingCount = pendingStore.getPendingCount(sessionDbId);

    if (pendingCount > 0) {  // ⚠️ Counts 'processing' messages too!
      logger.info('SESSION', `Restarting generator after crash/exit`);

      // Restart generator
      setTimeout(() => {
        this.startGeneratorWithProvider(stillExists, ...);
      }, 1000);
    }
  }
});

File: /Users/alexnewman/Scripts/claude-mem/src/services/sqlite/PendingMessageStore.ts

// Line 319-326: getPendingCount()
getPendingCount(sessionDbId: number): number {
  const stmt = this.db.prepare(`
    SELECT COUNT(*) as count FROM pending_messages
    WHERE session_db_id = ? AND status IN ('pending', 'processing')  // ⚠️
  `);
  return result.count;
}

// Line 299-314: resetStuckMessages()
resetStuckMessages(thresholdMs: number): number {
  const stmt = this.db.prepare(`
    UPDATE pending_messages
    SET status = 'pending', started_processing_at_epoch = NULL
    WHERE status = 'processing' AND started_processing_at_epoch < ?  // ⚠️
  `);
  return result.changes;
}

The Duplication Sequence

  1. SDK processes message #1 (e.g., "Read tool on SearchManager.ts")

    • Marks message as 'processing' in database
    • Sends observation prompt to SDK agent
  2. SDK returns response with observation

    • parseObservations() extracts: "Fixed missing closing brace..."
    • storeObservation() saves observation #1 to database
    • CRASH or ERROR occurs (e.g., from recent error handling changes)
    • markMessagesProcessed() NEVER CALLED ⚠️
    • Message remains in 'processing' status
  3. Crash recovery triggers (line 184-204)

    • getPendingCount() finds message still in 'processing'
    • Generator restarts with 1-second delay
  4. Worker restart or stuck message recovery

    • resetStuckMessages() resets message to 'pending'
    • Generator processes the SAME message again
  5. SDK processes message #1 AGAIN

    • Same observation prompt sent to SDK
    • SDK returns SAME observation (deterministic from same file state)
    • storeObservation() saves observation #2 (DUPLICATE!)
    • Process may crash again, creating observation #3, #4, etc.

Why No Database Deduplication?

File: /Users/alexnewman/Scripts/claude-mem/src/services/sqlite/SessionStore.ts

// Line 1224-1229: storeObservation() - NO deduplication!
const stmt = this.db.prepare(`
  INSERT INTO observations
  (memory_session_id, project, type, title, subtitle, ...)
  VALUES (?, ?, ?, ?, ?, ...)  // ⚠️ No INSERT OR IGNORE, no uniqueness check
`);

The database table has:

  • No UNIQUE constraint on (memory_session_id, title, subtitle, type)
  • No INSERT OR IGNORE logic
  • No deduplication check before insertion

Compare to the IMPORT logic which DOES have deduplication:

// Line ~1440: importObservation() HAS deduplication
const existing = this.checkObservationExists(
  obs.memory_session_id,
  obs.title,
  obs.subtitle,
  obs.type
);

if (existing) {
  return { imported: false, id: existing.id };  // ✅ Prevents duplicates
}

Connection to Anti-Pattern Cleanup Work

What Changed

Recent commits removed try-catch blocks as part of anti-pattern mitigation:

0123b15 refactor: add error handling back to SearchManager Chroma calls
776f4ea Refactor hooks to streamline error handling and loading states
0ea82bd refactor: improve error logging across SessionStore and mcp-server
379b0c1 refactor: improve error logging in SearchManager.ts
4c0cdec refactor: improve error handling in worker-service.ts

Commit 776f4ea made significant changes:

  • Removed try-catch blocks from hooks (useContextPreview, usePagination, useSSE, useSettings)
  • Modified SessionStore.ts error handling
  • Modified SearchManager.ts error handling (3000+ lines changed)

How This Triggered the Bug

The duplication regression was latent - the race condition always existed. However:

  1. Before: Large try-catch blocks suppressed errors

    • SDK errors were caught and logged
    • Generator continued running
    • Messages got marked as processed (eventually)
  2. After: Error handling removed/streamlined

    • SDK errors now crash the generator
    • Generator exits before marking messages processed
    • Crash recovery restarts generator repeatedly
    • Same message processed multiple times

Evidence from Database

Session 75894 (content_session_id: 56f94e5d-2514-4d44-aa43-f5e31d9b4c38):

  • 26 pending messages queued (all unique)
  • Only 7 observations should have been created
  • But 8+ duplicates of "Fixed missing closing brace" were created
  • Created over 23-minute window (9:32 PM - 9:55 PM)
  • Indicates repeated crashes and recoveries

Fix Strategy

Short-term Fix (Critical)

Option 1: Transaction-based atomic completion (RECOMMENDED)

Wrap observation storage and message completion in a single transaction:

// In SDKAgent.ts processSDKResponse()
private async processSDKResponse(...): Promise<void> {
  const pendingStore = this.sessionManager.getPendingMessageStore();

  // Start transaction
  const db = this.dbManager.getSessionStore().db;
  const saveTransaction = db.transaction(() => {
    // Parse and store observations
    const observations = parseObservations(text, session.contentSessionId);
    const observationIds = [];

    for (const obs of observations) {
      const { id } = this.dbManager.getSessionStore().storeObservation(...);
      observationIds.push(id);
    }

    // Parse and store summary
    const summary = parseSummary(text, session.sessionDbId);
    if (summary) {
      this.dbManager.getSessionStore().storeSummary(...);
    }

    // CRITICAL: Mark messages as processed IN SAME TRANSACTION
    for (const messageId of session.pendingProcessingIds) {
      pendingStore.markProcessed(messageId);
    }

    return observationIds;
  });

  // Execute transaction atomically
  const observationIds = saveTransaction();

  // Broadcast to SSE AFTER transaction commits
  for (const obsId of observationIds) {
    worker?.sseBroadcaster.broadcast(...);
  }
}

Option 2: Mark processed BEFORE storing (SIMPLER)

// In SDKAgent.ts processSDKResponse()
private async processSDKResponse(...): Promise<void> {
  // Mark messages as processed FIRST
  await this.markMessagesProcessed(session, worker);

  // Then store observations (idempotent)
  const observations = parseObservations(text, session.contentSessionId);
  for (const obs of observations) {
    this.dbManager.getSessionStore().storeObservation(...);
  }
}

Risk: If storage fails, message is marked complete but observation is lost. However, this is better than duplicates.

Medium-term Fix (Important)

Add database-level deduplication:

-- Add unique constraint
CREATE UNIQUE INDEX idx_observations_unique
ON observations(memory_session_id, title, subtitle, type);

-- Modify storeObservation() to use INSERT OR IGNORE
INSERT OR IGNORE INTO observations (...) VALUES (...);

Or use the existing checkObservationExists() logic:

// In SessionStore.ts storeObservation()
storeObservation(...): { id: number; createdAtEpoch: number } {
  // Check for existing observation
  const existing = this.checkObservationExists(
    memorySessionId,
    observation.title,
    observation.subtitle,
    observation.type
  );

  if (existing) {
    logger.debug('DB', 'Observation already exists, skipping', {
      obsId: existing.id,
      title: observation.title
    });
    return { id: existing.id, createdAtEpoch: existing.created_at_epoch };
  }

  // Insert new observation...
}

Long-term Fix (Architectural)

Redesign crash recovery to be idempotent:

  1. Message status flow should be:

    • pendingprocessingprocessed (one-way, no resets)
  2. Stuck message recovery should:

    • Create NEW message for retry (with retry_count)
    • Mark old message as 'failed' or 'abandoned'
    • Never reset 'processing' → 'pending'
  3. SDK agent should:

    • Track which observations were created for each message
    • Skip observation creation if message was already processed
    • Use message ID as idempotency key

Testing Plan

  1. Reproduce the regression:

    • Create session with multiple tool uses
    • Force SDK crash during observation processing
    • Verify duplicates are NOT created with fix
  2. Edge cases:

    • Test worker restart during observation storage
    • Test network failure during Chroma sync
    • Test database write failure scenarios
  3. Performance:

    • Verify transaction doesn't slow down processing
    • Test with high observation volume (100+ per session)

Cleanup Required

Run the existing cleanup script to remove current duplicates:

cd /Users/alexnewman/Scripts/claude-mem
npm run cleanup-duplicates

This script identifies duplicates by (memory_session_id, title, subtitle, type) and keeps the earliest (MIN(id)).

Files Requiring Changes

  1. src/services/worker/SDKAgent.ts - Add transaction or reorder completion
  2. src/services/sqlite/SessionStore.ts - Add deduplication check
  3. src/services/sqlite/migrations.ts - Add unique index (optional)
  4. src/services/worker/http/routes/SessionRoutes.ts - Improve crash recovery logging

Estimated Impact

  • Severity: Critical (data integrity)
  • Scope: All sessions since 2026-01-02 ~9:30 PM
  • User impact: Confusing duplicate memories, inflated token counts
  • Database impact: ~50-100+ duplicate rows

References

  • Original issue: Generator failure observations (11 duplicates)
  • Related commit: 776f4ea "Refactor hooks to streamline error handling"
  • Cleanup script: /Users/alexnewman/Scripts/claude-mem/src/bin/cleanup-duplicates.ts
  • Related report: docs/reports/2026-01-02--stuck-observations.md