* fix: prevent memory_session_id from equaling content_session_id The bug: memory_session_id was initialized to contentSessionId as a "placeholder for FK purposes". This caused the SDK resume logic to inject memory agent messages into the USER's Claude Code transcript, corrupting their conversation history. Root cause: - SessionStore.createSDKSession initialized memory_session_id = contentSessionId - SDKAgent checked memorySessionId !== contentSessionId but this check only worked if the session was fetched fresh from DB The fix: - SessionStore: Initialize memory_session_id as NULL, not contentSessionId - SDKAgent: Simple truthy check !!session.memorySessionId (NULL = fresh start) - Database migration: Ran UPDATE to set memory_session_id = NULL for 1807 existing sessions that had the bug Also adds [ALIGNMENT] logging across the session lifecycle to help debug session continuity issues: - Hook entry: contentSessionId + promptNumber - DB lookup: contentSessionId → memorySessionId mapping proof - Resume decision: shows which memorySessionId will be used for resume - Capture: logs when memorySessionId is captured from first SDK response UI: Added "Alignment" quick filter button in LogsModal to show only alignment logs for debugging session continuity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor: improve error handling in worker-service.ts - Fix GENERIC_CATCH anti-patterns by logging full error objects instead of just messages - Add [ANTI-PATTERN IGNORED] markers for legitimate cases (cleanup, hot paths) - Simplify error handling comments to be more concise - Improve httpShutdown() error discrimination for ECONNREFUSED - Reduce LARGE_TRY_BLOCK issues in initialization code Part of anti-pattern cleanup plan (132 total issues) * refactor: improve error logging in SearchManager.ts - Pass full error objects to logger instead of just error.message - Fixes PARTIAL_ERROR_LOGGING anti-patterns (10 instances) - Better debugging visibility when Chroma queries fail Part of anti-pattern cleanup (133 remaining) * refactor: improve error logging across SessionStore and mcp-server - SessionStore.ts: Fix error logging in column rename utility - mcp-server.ts: Log full error objects instead of just error.message - Improve error handling in Worker API calls and tool execution Part of anti-pattern cleanup (133 remaining) * Refactor hooks to streamline error handling and loading states - Simplified error handling in useContextPreview by removing try-catch and directly checking response status. - Refactored usePagination to eliminate try-catch, improving readability and maintaining error handling through response checks. - Cleaned up useSSE by removing unnecessary try-catch around JSON parsing, ensuring clarity in message handling. - Enhanced useSettings by streamlining the saving process, removing try-catch, and directly checking the result for success. * refactor: add error handling back to SearchManager Chroma calls - Wrap queryChroma calls in try-catch to prevent generator crashes - Log Chroma errors as warnings and fall back gracefully - Fixes generator failures when Chroma has issues - Part of anti-pattern cleanup recovery * feat: Add generator failure investigation report and observation duplication regression report - Created a comprehensive investigation report detailing the root cause of generator failures during anti-pattern cleanup, including the impact, investigation process, and implemented fixes. - Documented the critical regression causing observation duplication due to race conditions in the SDK agent, outlining symptoms, root cause analysis, and proposed fixes. * fix: address PR #528 review comments - atomic cleanup and detector improvements This commit addresses critical review feedback from PR #528: ## 1. Atomic Message Cleanup (Fix Race Condition) **Problem**: SessionRoutes.ts generator error handler had race condition - Queried messages then marked failed in loop - If crash during loop → partial marking → inconsistent state **Solution**: - Added `markSessionMessagesFailed()` to PendingMessageStore.ts - Single atomic UPDATE statement replaces loop - Follows existing pattern from `resetProcessingToPending()` **Files**: - src/services/sqlite/PendingMessageStore.ts (new method) - src/services/worker/http/routes/SessionRoutes.ts (use new method) ## 2. Anti-Pattern Detector Improvements **Problem**: Detector didn't recognize logger.failure() method - Lines 212 & 335 already included "failure" - Lines 112-113 (PARTIAL_ERROR_LOGGING detection) did not **Solution**: Updated regex patterns to include "failure" for consistency **Files**: - scripts/anti-pattern-test/detect-error-handling-antipatterns.ts ## 3. Documentation **PR Comment**: Added clarification on memory_session_id fix location - Points to SessionStore.ts:1155 - Explains why NULL initialization prevents message injection bug ## Review Response Addresses "Must Address Before Merge" items from review: ✅ Clarified memory_session_id bug fix location (via PR comment) ✅ Made generator error handler message cleanup atomic ❌ Deferred comprehensive test suite to follow-up PR (keeps PR focused) ## Testing - Build passes with no errors - Anti-pattern detector runs successfully - Atomic cleanup follows proven pattern from existing methods 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: FOREIGN KEY constraint and missing failed_at_epoch column Two critical bugs fixed: 1. Missing failed_at_epoch column in pending_messages table - Added migration 20 to create the column - Fixes error when trying to mark messages as failed 2. FOREIGN KEY constraint failed when storing observations - All three agents (SDK, Gemini, OpenRouter) were passing session.contentSessionId instead of session.memorySessionId - storeObservationsAndMarkComplete expects memorySessionId - Added null check and clear error message However, observations still not saving - see investigation report. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Refactor hook input parsing to improve error handling - Added a nested try-catch block in new-hook.ts, save-hook.ts, and summary-hook.ts to handle JSON parsing errors more gracefully. - Replaced direct error throwing with logging of the error details using logger.error. - Ensured that the process exits cleanly after handling input in all three hooks. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
13 KiB
Observation Duplication Regression - 2026-01-02
Executive Summary
A critical regression is causing the same observation to be created multiple times (2-11 duplicates per observation). This occurred after recent error handling refactoring work that removed try-catch blocks. The root cause is a race condition between observation persistence and message completion marking in the SDK agent, exacerbated by crash recovery logic.
Symptoms
- 11 observations about "session generator failure" created between 10:01-10:09 PM (same content, different timestamps)
- 8 observations about "fixed missing closing brace" created between 9:32 PM-9:55 PM
- 2 observations about "remove large try-catch blocks" created at 9:33 PM
- Multiple other duplicates across different sessions
Example from database:
-- Same observation created 8 times over 23 minutes
id | title | created_at
-------|------------------------------------------------|-------------------
36050 | Fixed Missing Closing Brace in SearchManager | 2026-01-02 21:32:43
36040 | Fixed Missing Closing Brace in SearchManager | 2026-01-02 21:33:34
36047 | Fixed missing closing brace... | 2026-01-02 21:33:38
36041 | Fixed missing closing brace... | 2026-01-02 21:34:33
36060 | Fixed Missing Closing Brace... | 2026-01-02 21:41:23
36062 | Fixed Missing Closing Brace... | 2026-01-02 21:53:02
36063 | Fixed Missing Closing Brace... | 2026-01-02 21:53:33
36065 | Fixed missing closing brace... | 2026-01-02 21:55:06
Root Cause Analysis
The Critical Race Condition
The SDK agent has a fatal ordering issue in message processing:
File: /Users/alexnewman/Scripts/claude-mem/src/services/worker/SDKAgent.ts
// Line 328-410: processSDKResponse()
private async processSDKResponse(...): Promise<void> {
// Parse observations from SDK response
const observations = parseObservations(text, session.contentSessionId);
// Store observations IMMEDIATELY
for (const obs of observations) {
const { id: obsId } = this.dbManager.getSessionStore().storeObservation(...);
// ⚠️ OBSERVATION IS NOW IN DATABASE
}
// Parse and store summary
const summary = parseSummary(text, session.sessionDbId);
if (summary) {
this.dbManager.getSessionStore().storeSummary(...);
// ⚠️ SUMMARY IS NOW IN DATABASE
}
// ONLY NOW mark the message as processed
await this.markMessagesProcessed(session, worker); // ⚠️ LINE 487
}
// Line 494-502: markMessagesProcessed()
private async markMessagesProcessed(...): Promise<void> {
const pendingMessageStore = this.sessionManager.getPendingMessageStore();
if (session.pendingProcessingIds.size > 0) {
for (const messageId of session.pendingProcessingIds) {
pendingMessageStore.markProcessed(messageId); // ⚠️ TOO LATE!
}
}
}
The Window of Vulnerability
Between storing observations (line ~340) and marking the message as processed (line 498), there is a critical window where:
- Observations exist in database ✅
- Message is still in 'processing' status ⚠️
- If SDK crashes/exits → Message remains stuck in 'processing'
How Crash Recovery Makes It Worse
File: /Users/alexnewman/Scripts/claude-mem/src/services/worker/http/routes/SessionRoutes.ts
// Line 183-205: Generator .finally() block
.finally(() => {
// Crash recovery: If not aborted and still has work, restart
if (!wasAborted) {
const pendingStore = this.sessionManager.getPendingMessageStore();
const pendingCount = pendingStore.getPendingCount(sessionDbId);
if (pendingCount > 0) { // ⚠️ Counts 'processing' messages too!
logger.info('SESSION', `Restarting generator after crash/exit`);
// Restart generator
setTimeout(() => {
this.startGeneratorWithProvider(stillExists, ...);
}, 1000);
}
}
});
File: /Users/alexnewman/Scripts/claude-mem/src/services/sqlite/PendingMessageStore.ts
// Line 319-326: getPendingCount()
getPendingCount(sessionDbId: number): number {
const stmt = this.db.prepare(`
SELECT COUNT(*) as count FROM pending_messages
WHERE session_db_id = ? AND status IN ('pending', 'processing') // ⚠️
`);
return result.count;
}
// Line 299-314: resetStuckMessages()
resetStuckMessages(thresholdMs: number): number {
const stmt = this.db.prepare(`
UPDATE pending_messages
SET status = 'pending', started_processing_at_epoch = NULL
WHERE status = 'processing' AND started_processing_at_epoch < ? // ⚠️
`);
return result.changes;
}
The Duplication Sequence
-
SDK processes message #1 (e.g., "Read tool on SearchManager.ts")
- Marks message as 'processing' in database
- Sends observation prompt to SDK agent
-
SDK returns response with observation
parseObservations()extracts: "Fixed missing closing brace..."storeObservation()saves observation #1 to database ✅- CRASH or ERROR occurs (e.g., from recent error handling changes)
markMessagesProcessed()NEVER CALLED ⚠️- Message remains in 'processing' status
-
Crash recovery triggers (line 184-204)
getPendingCount()finds message still in 'processing'- Generator restarts with 1-second delay
-
Worker restart or stuck message recovery
resetStuckMessages()resets message to 'pending'- Generator processes the SAME message again
-
SDK processes message #1 AGAIN
- Same observation prompt sent to SDK
- SDK returns SAME observation (deterministic from same file state)
storeObservation()saves observation #2 ✅ (DUPLICATE!)- Process may crash again, creating observation #3, #4, etc.
Why No Database Deduplication?
File: /Users/alexnewman/Scripts/claude-mem/src/services/sqlite/SessionStore.ts
// Line 1224-1229: storeObservation() - NO deduplication!
const stmt = this.db.prepare(`
INSERT INTO observations
(memory_session_id, project, type, title, subtitle, ...)
VALUES (?, ?, ?, ?, ?, ...) // ⚠️ No INSERT OR IGNORE, no uniqueness check
`);
The database table has:
- ❌ No UNIQUE constraint on (memory_session_id, title, subtitle, type)
- ❌ No INSERT OR IGNORE logic
- ❌ No deduplication check before insertion
Compare to the IMPORT logic which DOES have deduplication:
// Line ~1440: importObservation() HAS deduplication
const existing = this.checkObservationExists(
obs.memory_session_id,
obs.title,
obs.subtitle,
obs.type
);
if (existing) {
return { imported: false, id: existing.id }; // ✅ Prevents duplicates
}
Connection to Anti-Pattern Cleanup Work
What Changed
Recent commits removed try-catch blocks as part of anti-pattern mitigation:
0123b15 refactor: add error handling back to SearchManager Chroma calls
776f4ea Refactor hooks to streamline error handling and loading states
0ea82bd refactor: improve error logging across SessionStore and mcp-server
379b0c1 refactor: improve error logging in SearchManager.ts
4c0cdec refactor: improve error handling in worker-service.ts
Commit 776f4ea made significant changes:
- Removed try-catch blocks from hooks (useContextPreview, usePagination, useSSE, useSettings)
- Modified SessionStore.ts error handling
- Modified SearchManager.ts error handling (3000+ lines changed)
How This Triggered the Bug
The duplication regression was latent - the race condition always existed. However:
-
Before: Large try-catch blocks suppressed errors
- SDK errors were caught and logged
- Generator continued running
- Messages got marked as processed (eventually)
-
After: Error handling removed/streamlined
- SDK errors now crash the generator
- Generator exits before marking messages processed
- Crash recovery restarts generator repeatedly
- Same message processed multiple times
Evidence from Database
Session 75894 (content_session_id: 56f94e5d-2514-4d44-aa43-f5e31d9b4c38):
- 26 pending messages queued (all unique)
- Only 7 observations should have been created
- But 8+ duplicates of "Fixed missing closing brace" were created
- Created over 23-minute window (9:32 PM - 9:55 PM)
- Indicates repeated crashes and recoveries
Fix Strategy
Short-term Fix (Critical)
Option 1: Transaction-based atomic completion (RECOMMENDED)
Wrap observation storage and message completion in a single transaction:
// In SDKAgent.ts processSDKResponse()
private async processSDKResponse(...): Promise<void> {
const pendingStore = this.sessionManager.getPendingMessageStore();
// Start transaction
const db = this.dbManager.getSessionStore().db;
const saveTransaction = db.transaction(() => {
// Parse and store observations
const observations = parseObservations(text, session.contentSessionId);
const observationIds = [];
for (const obs of observations) {
const { id } = this.dbManager.getSessionStore().storeObservation(...);
observationIds.push(id);
}
// Parse and store summary
const summary = parseSummary(text, session.sessionDbId);
if (summary) {
this.dbManager.getSessionStore().storeSummary(...);
}
// CRITICAL: Mark messages as processed IN SAME TRANSACTION
for (const messageId of session.pendingProcessingIds) {
pendingStore.markProcessed(messageId);
}
return observationIds;
});
// Execute transaction atomically
const observationIds = saveTransaction();
// Broadcast to SSE AFTER transaction commits
for (const obsId of observationIds) {
worker?.sseBroadcaster.broadcast(...);
}
}
Option 2: Mark processed BEFORE storing (SIMPLER)
// In SDKAgent.ts processSDKResponse()
private async processSDKResponse(...): Promise<void> {
// Mark messages as processed FIRST
await this.markMessagesProcessed(session, worker);
// Then store observations (idempotent)
const observations = parseObservations(text, session.contentSessionId);
for (const obs of observations) {
this.dbManager.getSessionStore().storeObservation(...);
}
}
Risk: If storage fails, message is marked complete but observation is lost. However, this is better than duplicates.
Medium-term Fix (Important)
Add database-level deduplication:
-- Add unique constraint
CREATE UNIQUE INDEX idx_observations_unique
ON observations(memory_session_id, title, subtitle, type);
-- Modify storeObservation() to use INSERT OR IGNORE
INSERT OR IGNORE INTO observations (...) VALUES (...);
Or use the existing checkObservationExists() logic:
// In SessionStore.ts storeObservation()
storeObservation(...): { id: number; createdAtEpoch: number } {
// Check for existing observation
const existing = this.checkObservationExists(
memorySessionId,
observation.title,
observation.subtitle,
observation.type
);
if (existing) {
logger.debug('DB', 'Observation already exists, skipping', {
obsId: existing.id,
title: observation.title
});
return { id: existing.id, createdAtEpoch: existing.created_at_epoch };
}
// Insert new observation...
}
Long-term Fix (Architectural)
Redesign crash recovery to be idempotent:
-
Message status flow should be:
pending→processing→processed(one-way, no resets)
-
Stuck message recovery should:
- Create NEW message for retry (with retry_count)
- Mark old message as 'failed' or 'abandoned'
- Never reset 'processing' → 'pending'
-
SDK agent should:
- Track which observations were created for each message
- Skip observation creation if message was already processed
- Use message ID as idempotency key
Testing Plan
-
Reproduce the regression:
- Create session with multiple tool uses
- Force SDK crash during observation processing
- Verify duplicates are NOT created with fix
-
Edge cases:
- Test worker restart during observation storage
- Test network failure during Chroma sync
- Test database write failure scenarios
-
Performance:
- Verify transaction doesn't slow down processing
- Test with high observation volume (100+ per session)
Cleanup Required
Run the existing cleanup script to remove current duplicates:
cd /Users/alexnewman/Scripts/claude-mem
npm run cleanup-duplicates
This script identifies duplicates by (memory_session_id, title, subtitle, type) and keeps the earliest (MIN(id)).
Files Requiring Changes
- src/services/worker/SDKAgent.ts - Add transaction or reorder completion
- src/services/sqlite/SessionStore.ts - Add deduplication check
- src/services/sqlite/migrations.ts - Add unique index (optional)
- src/services/worker/http/routes/SessionRoutes.ts - Improve crash recovery logging
Estimated Impact
- Severity: Critical (data integrity)
- Scope: All sessions since 2026-01-02 ~9:30 PM
- User impact: Confusing duplicate memories, inflated token counts
- Database impact: ~50-100+ duplicate rows
References
- Original issue: Generator failure observations (11 duplicates)
- Related commit:
776f4ea"Refactor hooks to streamline error handling" - Cleanup script:
/Users/alexnewman/Scripts/claude-mem/src/bin/cleanup-duplicates.ts - Related report:
docs/reports/2026-01-02--stuck-observations.md