Files
claude-mem/docs/reports/2026-01-02--observation-duplication-regression.md
T
Alex Newman 817b9e8f27 Improve error handling and logging across worker services (#528)
* fix: prevent memory_session_id from equaling content_session_id

The bug: memory_session_id was initialized to contentSessionId as a
"placeholder for FK purposes". This caused the SDK resume logic to
inject memory agent messages into the USER's Claude Code transcript,
corrupting their conversation history.

Root cause:
- SessionStore.createSDKSession initialized memory_session_id = contentSessionId
- SDKAgent checked memorySessionId !== contentSessionId but this check
  only worked if the session was fetched fresh from DB

The fix:
- SessionStore: Initialize memory_session_id as NULL, not contentSessionId
- SDKAgent: Simple truthy check !!session.memorySessionId (NULL = fresh start)
- Database migration: Ran UPDATE to set memory_session_id = NULL for 1807
  existing sessions that had the bug

Also adds [ALIGNMENT] logging across the session lifecycle to help debug
session continuity issues:
- Hook entry: contentSessionId + promptNumber
- DB lookup: contentSessionId → memorySessionId mapping proof
- Resume decision: shows which memorySessionId will be used for resume
- Capture: logs when memorySessionId is captured from first SDK response

UI: Added "Alignment" quick filter button in LogsModal to show only
alignment logs for debugging session continuity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: improve error handling in worker-service.ts

- Fix GENERIC_CATCH anti-patterns by logging full error objects instead of just messages
- Add [ANTI-PATTERN IGNORED] markers for legitimate cases (cleanup, hot paths)
- Simplify error handling comments to be more concise
- Improve httpShutdown() error discrimination for ECONNREFUSED
- Reduce LARGE_TRY_BLOCK issues in initialization code

Part of anti-pattern cleanup plan (132 total issues)

* refactor: improve error logging in SearchManager.ts

- Pass full error objects to logger instead of just error.message
- Fixes PARTIAL_ERROR_LOGGING anti-patterns (10 instances)
- Better debugging visibility when Chroma queries fail

Part of anti-pattern cleanup (133 remaining)

* refactor: improve error logging across SessionStore and mcp-server

- SessionStore.ts: Fix error logging in column rename utility
- mcp-server.ts: Log full error objects instead of just error.message
- Improve error handling in Worker API calls and tool execution

Part of anti-pattern cleanup (133 remaining)

* Refactor hooks to streamline error handling and loading states

- Simplified error handling in useContextPreview by removing try-catch and directly checking response status.
- Refactored usePagination to eliminate try-catch, improving readability and maintaining error handling through response checks.
- Cleaned up useSSE by removing unnecessary try-catch around JSON parsing, ensuring clarity in message handling.
- Enhanced useSettings by streamlining the saving process, removing try-catch, and directly checking the result for success.

* refactor: add error handling back to SearchManager Chroma calls

- Wrap queryChroma calls in try-catch to prevent generator crashes
- Log Chroma errors as warnings and fall back gracefully
- Fixes generator failures when Chroma has issues
- Part of anti-pattern cleanup recovery

* feat: Add generator failure investigation report and observation duplication regression report

- Created a comprehensive investigation report detailing the root cause of generator failures during anti-pattern cleanup, including the impact, investigation process, and implemented fixes.
- Documented the critical regression causing observation duplication due to race conditions in the SDK agent, outlining symptoms, root cause analysis, and proposed fixes.

* fix: address PR #528 review comments - atomic cleanup and detector improvements

This commit addresses critical review feedback from PR #528:

## 1. Atomic Message Cleanup (Fix Race Condition)

**Problem**: SessionRoutes.ts generator error handler had race condition
- Queried messages then marked failed in loop
- If crash during loop → partial marking → inconsistent state

**Solution**:
- Added `markSessionMessagesFailed()` to PendingMessageStore.ts
- Single atomic UPDATE statement replaces loop
- Follows existing pattern from `resetProcessingToPending()`

**Files**:
- src/services/sqlite/PendingMessageStore.ts (new method)
- src/services/worker/http/routes/SessionRoutes.ts (use new method)

## 2. Anti-Pattern Detector Improvements

**Problem**: Detector didn't recognize logger.failure() method
- Lines 212 & 335 already included "failure"
- Lines 112-113 (PARTIAL_ERROR_LOGGING detection) did not

**Solution**: Updated regex patterns to include "failure" for consistency

**Files**:
- scripts/anti-pattern-test/detect-error-handling-antipatterns.ts

## 3. Documentation

**PR Comment**: Added clarification on memory_session_id fix location
- Points to SessionStore.ts:1155
- Explains why NULL initialization prevents message injection bug

## Review Response

Addresses "Must Address Before Merge" items from review:
 Clarified memory_session_id bug fix location (via PR comment)
 Made generator error handler message cleanup atomic
 Deferred comprehensive test suite to follow-up PR (keeps PR focused)

## Testing

- Build passes with no errors
- Anti-pattern detector runs successfully
- Atomic cleanup follows proven pattern from existing methods

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: FOREIGN KEY constraint and missing failed_at_epoch column

Two critical bugs fixed:

1. Missing failed_at_epoch column in pending_messages table
   - Added migration 20 to create the column
   - Fixes error when trying to mark messages as failed

2. FOREIGN KEY constraint failed when storing observations
   - All three agents (SDK, Gemini, OpenRouter) were passing
     session.contentSessionId instead of session.memorySessionId
   - storeObservationsAndMarkComplete expects memorySessionId
   - Added null check and clear error message

However, observations still not saving - see investigation report.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Refactor hook input parsing to improve error handling

- Added a nested try-catch block in new-hook.ts, save-hook.ts, and summary-hook.ts to handle JSON parsing errors more gracefully.
- Replaced direct error throwing with logging of the error details using logger.error.
- Ensured that the process exits cleanly after handling input in all three hooks.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-03 18:51:59 -05:00

400 lines
13 KiB
Markdown

# Observation Duplication Regression - 2026-01-02
## Executive Summary
A critical regression is causing the same observation to be created multiple times (2-11 duplicates per observation). This occurred after recent error handling refactoring work that removed try-catch blocks. The root cause is a **race condition between observation persistence and message completion marking** in the SDK agent, exacerbated by crash recovery logic.
## Symptoms
- **11 observations** about "session generator failure" created between 10:01-10:09 PM (same content, different timestamps)
- **8 observations** about "fixed missing closing brace" created between 9:32 PM-9:55 PM
- **2 observations** about "remove large try-catch blocks" created at 9:33 PM
- Multiple other duplicates across different sessions
Example from database:
```sql
-- Same observation created 8 times over 23 minutes
id | title | created_at
-------|------------------------------------------------|-------------------
36050 | Fixed Missing Closing Brace in SearchManager | 2026-01-02 21:32:43
36040 | Fixed Missing Closing Brace in SearchManager | 2026-01-02 21:33:34
36047 | Fixed missing closing brace... | 2026-01-02 21:33:38
36041 | Fixed missing closing brace... | 2026-01-02 21:34:33
36060 | Fixed Missing Closing Brace... | 2026-01-02 21:41:23
36062 | Fixed Missing Closing Brace... | 2026-01-02 21:53:02
36063 | Fixed Missing Closing Brace... | 2026-01-02 21:53:33
36065 | Fixed missing closing brace... | 2026-01-02 21:55:06
```
## Root Cause Analysis
### The Critical Race Condition
The SDK agent has a fatal ordering issue in message processing:
**File: `/Users/alexnewman/Scripts/claude-mem/src/services/worker/SDKAgent.ts`**
```typescript
// Line 328-410: processSDKResponse()
private async processSDKResponse(...): Promise<void> {
// Parse observations from SDK response
const observations = parseObservations(text, session.contentSessionId);
// Store observations IMMEDIATELY
for (const obs of observations) {
const { id: obsId } = this.dbManager.getSessionStore().storeObservation(...);
// ⚠️ OBSERVATION IS NOW IN DATABASE
}
// Parse and store summary
const summary = parseSummary(text, session.sessionDbId);
if (summary) {
this.dbManager.getSessionStore().storeSummary(...);
// ⚠️ SUMMARY IS NOW IN DATABASE
}
// ONLY NOW mark the message as processed
await this.markMessagesProcessed(session, worker); // ⚠️ LINE 487
}
```
```typescript
// Line 494-502: markMessagesProcessed()
private async markMessagesProcessed(...): Promise<void> {
const pendingMessageStore = this.sessionManager.getPendingMessageStore();
if (session.pendingProcessingIds.size > 0) {
for (const messageId of session.pendingProcessingIds) {
pendingMessageStore.markProcessed(messageId); // ⚠️ TOO LATE!
}
}
}
```
### The Window of Vulnerability
Between storing observations (line ~340) and marking the message as processed (line 498), there is a **critical window** where:
1. **Observations exist in database**
2. **Message is still in 'processing' status** ⚠️
3. **If SDK crashes/exits** → Message remains stuck in 'processing'
### How Crash Recovery Makes It Worse
**File: `/Users/alexnewman/Scripts/claude-mem/src/services/worker/http/routes/SessionRoutes.ts`**
```typescript
// Line 183-205: Generator .finally() block
.finally(() => {
// Crash recovery: If not aborted and still has work, restart
if (!wasAborted) {
const pendingStore = this.sessionManager.getPendingMessageStore();
const pendingCount = pendingStore.getPendingCount(sessionDbId);
if (pendingCount > 0) { // ⚠️ Counts 'processing' messages too!
logger.info('SESSION', `Restarting generator after crash/exit`);
// Restart generator
setTimeout(() => {
this.startGeneratorWithProvider(stillExists, ...);
}, 1000);
}
}
});
```
**File: `/Users/alexnewman/Scripts/claude-mem/src/services/sqlite/PendingMessageStore.ts`**
```typescript
// Line 319-326: getPendingCount()
getPendingCount(sessionDbId: number): number {
const stmt = this.db.prepare(`
SELECT COUNT(*) as count FROM pending_messages
WHERE session_db_id = ? AND status IN ('pending', 'processing') // ⚠️
`);
return result.count;
}
// Line 299-314: resetStuckMessages()
resetStuckMessages(thresholdMs: number): number {
const stmt = this.db.prepare(`
UPDATE pending_messages
SET status = 'pending', started_processing_at_epoch = NULL
WHERE status = 'processing' AND started_processing_at_epoch < ? // ⚠️
`);
return result.changes;
}
```
### The Duplication Sequence
1. **SDK processes message #1** (e.g., "Read tool on SearchManager.ts")
- Marks message as 'processing' in database
- Sends observation prompt to SDK agent
2. **SDK returns response** with observation
- `parseObservations()` extracts: "Fixed missing closing brace..."
- `storeObservation()` saves observation #1 to database ✅
- **CRASH or ERROR occurs** (e.g., from recent error handling changes)
- `markMessagesProcessed()` NEVER CALLED ⚠️
- Message remains in 'processing' status
3. **Crash recovery triggers** (line 184-204)
- `getPendingCount()` finds message still in 'processing'
- Generator restarts with 1-second delay
4. **Worker restart or stuck message recovery**
- `resetStuckMessages()` resets message to 'pending'
- Generator processes the SAME message again
5. **SDK processes message #1 AGAIN**
- Same observation prompt sent to SDK
- SDK returns SAME observation (deterministic from same file state)
- `storeObservation()` saves observation #2 ✅ (DUPLICATE!)
- Process may crash again, creating observation #3, #4, etc.
### Why No Database Deduplication?
**File: `/Users/alexnewman/Scripts/claude-mem/src/services/sqlite/SessionStore.ts`**
```typescript
// Line 1224-1229: storeObservation() - NO deduplication!
const stmt = this.db.prepare(`
INSERT INTO observations
(memory_session_id, project, type, title, subtitle, ...)
VALUES (?, ?, ?, ?, ?, ...) // ⚠️ No INSERT OR IGNORE, no uniqueness check
`);
```
The database table has:
- ❌ No UNIQUE constraint on (memory_session_id, title, subtitle, type)
- ❌ No INSERT OR IGNORE logic
- ❌ No deduplication check before insertion
Compare to the IMPORT logic which DOES have deduplication:
```typescript
// Line ~1440: importObservation() HAS deduplication
const existing = this.checkObservationExists(
obs.memory_session_id,
obs.title,
obs.subtitle,
obs.type
);
if (existing) {
return { imported: false, id: existing.id }; // ✅ Prevents duplicates
}
```
## Connection to Anti-Pattern Cleanup Work
### What Changed
Recent commits removed try-catch blocks as part of anti-pattern mitigation:
```bash
0123b15 refactor: add error handling back to SearchManager Chroma calls
776f4ea Refactor hooks to streamline error handling and loading states
0ea82bd refactor: improve error logging across SessionStore and mcp-server
379b0c1 refactor: improve error logging in SearchManager.ts
4c0cdec refactor: improve error handling in worker-service.ts
```
Commit `776f4ea` made significant changes:
- Removed try-catch blocks from hooks (useContextPreview, usePagination, useSSE, useSettings)
- Modified SessionStore.ts error handling
- Modified SearchManager.ts error handling (3000+ lines changed)
### How This Triggered the Bug
The duplication regression was **latent** - the race condition always existed. However:
1. **Before**: Large try-catch blocks suppressed errors
- SDK errors were caught and logged
- Generator continued running
- Messages got marked as processed (eventually)
2. **After**: Error handling removed/streamlined
- SDK errors now crash the generator
- Generator exits before marking messages processed
- Crash recovery restarts generator repeatedly
- Same message processed multiple times
### Evidence from Database
Session 75894 (content_session_id: 56f94e5d-2514-4d44-aa43-f5e31d9b4c38):
- **26 pending messages** queued (all unique)
- **Only 7 observations** should have been created
- **But 8+ duplicates** of "Fixed missing closing brace" were created
- Created over 23-minute window (9:32 PM - 9:55 PM)
- Indicates **repeated crashes and recoveries**
## Fix Strategy
### Short-term Fix (Critical)
**Option 1: Transaction-based atomic completion** (RECOMMENDED)
Wrap observation storage and message completion in a single transaction:
```typescript
// In SDKAgent.ts processSDKResponse()
private async processSDKResponse(...): Promise<void> {
const pendingStore = this.sessionManager.getPendingMessageStore();
// Start transaction
const db = this.dbManager.getSessionStore().db;
const saveTransaction = db.transaction(() => {
// Parse and store observations
const observations = parseObservations(text, session.contentSessionId);
const observationIds = [];
for (const obs of observations) {
const { id } = this.dbManager.getSessionStore().storeObservation(...);
observationIds.push(id);
}
// Parse and store summary
const summary = parseSummary(text, session.sessionDbId);
if (summary) {
this.dbManager.getSessionStore().storeSummary(...);
}
// CRITICAL: Mark messages as processed IN SAME TRANSACTION
for (const messageId of session.pendingProcessingIds) {
pendingStore.markProcessed(messageId);
}
return observationIds;
});
// Execute transaction atomically
const observationIds = saveTransaction();
// Broadcast to SSE AFTER transaction commits
for (const obsId of observationIds) {
worker?.sseBroadcaster.broadcast(...);
}
}
```
**Option 2: Mark processed BEFORE storing** (SIMPLER)
```typescript
// In SDKAgent.ts processSDKResponse()
private async processSDKResponse(...): Promise<void> {
// Mark messages as processed FIRST
await this.markMessagesProcessed(session, worker);
// Then store observations (idempotent)
const observations = parseObservations(text, session.contentSessionId);
for (const obs of observations) {
this.dbManager.getSessionStore().storeObservation(...);
}
}
```
Risk: If storage fails, message is marked complete but observation is lost. However, this is better than duplicates.
### Medium-term Fix (Important)
**Add database-level deduplication:**
```sql
-- Add unique constraint
CREATE UNIQUE INDEX idx_observations_unique
ON observations(memory_session_id, title, subtitle, type);
-- Modify storeObservation() to use INSERT OR IGNORE
INSERT OR IGNORE INTO observations (...) VALUES (...);
```
Or use the existing `checkObservationExists()` logic:
```typescript
// In SessionStore.ts storeObservation()
storeObservation(...): { id: number; createdAtEpoch: number } {
// Check for existing observation
const existing = this.checkObservationExists(
memorySessionId,
observation.title,
observation.subtitle,
observation.type
);
if (existing) {
logger.debug('DB', 'Observation already exists, skipping', {
obsId: existing.id,
title: observation.title
});
return { id: existing.id, createdAtEpoch: existing.created_at_epoch };
}
// Insert new observation...
}
```
### Long-term Fix (Architectural)
**Redesign crash recovery to be idempotent:**
1. **Message status flow should be:**
- `pending``processing``processed` (one-way, no resets)
2. **Stuck message recovery should:**
- Create NEW message for retry (with retry_count)
- Mark old message as 'failed' or 'abandoned'
- Never reset 'processing' → 'pending'
3. **SDK agent should:**
- Track which observations were created for each message
- Skip observation creation if message was already processed
- Use message ID as idempotency key
## Testing Plan
1. **Reproduce the regression:**
- Create session with multiple tool uses
- Force SDK crash during observation processing
- Verify duplicates are NOT created with fix
2. **Edge cases:**
- Test worker restart during observation storage
- Test network failure during Chroma sync
- Test database write failure scenarios
3. **Performance:**
- Verify transaction doesn't slow down processing
- Test with high observation volume (100+ per session)
## Cleanup Required
Run the existing cleanup script to remove current duplicates:
```bash
cd /Users/alexnewman/Scripts/claude-mem
npm run cleanup-duplicates
```
This script identifies duplicates by `(memory_session_id, title, subtitle, type)` and keeps the earliest (MIN(id)).
## Files Requiring Changes
1. **src/services/worker/SDKAgent.ts** - Add transaction or reorder completion
2. **src/services/sqlite/SessionStore.ts** - Add deduplication check
3. **src/services/sqlite/migrations.ts** - Add unique index (optional)
4. **src/services/worker/http/routes/SessionRoutes.ts** - Improve crash recovery logging
## Estimated Impact
- **Severity**: Critical (data integrity)
- **Scope**: All sessions since 2026-01-02 ~9:30 PM
- **User impact**: Confusing duplicate memories, inflated token counts
- **Database impact**: ~50-100+ duplicate rows
## References
- Original issue: Generator failure observations (11 duplicates)
- Related commit: `776f4ea` "Refactor hooks to streamline error handling"
- Cleanup script: `/Users/alexnewman/Scripts/claude-mem/src/bin/cleanup-duplicates.ts`
- Related report: `docs/reports/2026-01-02--stuck-observations.md`