diff --git a/docs/plans/session-logic-fixes.md b/docs/plans/session-logic-fixes.md new file mode 100644 index 00000000..fd79ab35 --- /dev/null +++ b/docs/plans/session-logic-fixes.md @@ -0,0 +1,1123 @@ +# Session Logic Fixes - Claude-Mem + +**Status:** Planning +**Created:** 2025-10-16 +**Priority:** High +**Estimated Effort:** 2-3 days + +## Executive Summary + +The claude-mem session logic architecture is fundamentally sound, using Claude Agent SDK in streaming input mode with Unix socket IPC for real-time observation processing. However, **we need to verify the basic happy path works end-to-end before addressing edge cases**. + +**Critical Goal:** Session ends → summary generated → next session immediately sees summary in context + +**Overall Assessment:** Architecture is correct, but needs systematic verification that the happy path works, then resilience improvements + +**Current Status:** Unknown if basic cycle works - need to test and debug the core flow first + +## Feedback Applied (2025-10-16) + +### Round 1: Technical Corrections +- ✅ Confirmed architectural approach is sound +- ❌ **Corrected:** SessionEnd hooks already exist in Claude Code - we're configuring, not implementing +- ✅ Technical fixes for resilience issues are sound + +### Round 2: Priority Reordering (MAJOR CHANGE) +**Critical realization:** The document focused on edge cases (zombies, crashes) when the basic happy path might not even work yet. + +**Complete restructure:** +1. **Phase 0 (NEW - TOP PRIORITY):** Verify the basic cycle works + - Does Stop hook fire on normal exit? + - Does worker generate and store summary? + - Does context hook load summaries on next session? + - End-to-end integration test + +2. **Phase 1 (SECOND PRIORITY):** Fix resilience issues + - Zombie workers, race conditions, stale sockets + - All the original issues moved here + +**Key principle:** Everything else is irrelevant if "session ends → next session sees summary" doesn't work. + +**Revised Focus:** Get the fucking happy path working first, then worry about edge cases. + +## Architecture Overview + +### Current Flow + +``` +SessionStart (startup) + → context-hook.ts:15 + → Loads recent summaries from DB + → Outputs markdown to stdout (becomes context) + +UserPromptSubmit + → new-hook.ts:16 + → Creates SDK session in DB (status='active') + → Spawns detached worker process + → Worker starts immediately, hooks return + +Worker Process (worker.ts:75) + → Starts Unix socket server at /tmp/claude-mem-worker-{id}.sock + → Runs SDK agent with streaming input (async generator) + → Yields init prompt to SDK agent + → Waits for messages from hooks + +PostToolUse (fired for each tool) + → save-hook.ts:24 + → Sends observation to worker via Unix socket + → Worker receives → yields to SDK agent + → SDK agent analyzes → returns XML + → Worker parses XML → stores in observations table + +Stop (session ends) + → summary-hook.ts:15 + → Sends FINALIZE message to worker via socket + → Worker yields finalize prompt to SDK agent + → SDK agent generates XML + → Worker parses → stores in session_summaries table + → Worker marks session completed, closes socket, exits +``` + +### Key Components + +**Hook Files:** +- `src/hooks/context.ts` - SessionStart hook logic +- `src/hooks/new.ts` - UserPromptSubmit hook logic +- `src/hooks/save.ts` - PostToolUse hook logic +- `src/hooks/summary.ts` - Stop hook logic +- `src/bin/hooks/*.ts` - Entry point wrappers for each hook + +**Worker:** +- `src/sdk/worker.ts` - Main worker process with SDK integration +- `src/sdk/prompts.ts` - Prompt generation for SDK agent +- `src/sdk/parser.ts` - XML parser for SDK responses + +**Database:** +- `src/services/sqlite/HooksDatabase.ts` - Lightweight DB interface for hooks +- `src/services/sqlite/migrations.ts` - Schema definitions + +**Configuration:** +- `hooks/hooks.json` - Hook configuration for Claude Code plugin + +### Technologies + +- **IPC:** Unix domain sockets (`/tmp/claude-mem-worker-{id}.sock`) +- **SDK Mode:** Streaming input (async generator pattern) +- **Output Format:** XML blocks (`` and ``) +- **Process Model:** Detached worker (spawn with detached: true, stdio: 'ignore') +- **Database:** SQLite with Bun + +## Identified Issues + +### Phase 0: Verify Happy Path Works (DO THIS FIRST) + +**Priority:** CRITICAL - Everything else is irrelevant if the basic cycle doesn't work + +**Goal:** Prove that when a session ends normally, the next session immediately sees the summary in its context. + +#### Test 0.1: Does Stop Hook Fire on Normal Exit? + +**What to test:** +```bash +# Start Claude Code session +claude + +# Do some work (read files, etc) + +# Exit normally +exit + +# Check logs - did Stop hook run? +``` + +**Expected behavior:** +- Stop hook (`summary-hook`) should fire +- Should send FINALIZE message to worker socket +- Worker should receive it and generate summary + +**How to verify:** +1. Add logging to `src/hooks/summary.ts` at the top of `summaryHook()` +2. Add logging when sending socket message +3. Exit session normally and check logs + +**If it doesn't work:** Debug why Stop hook isn't firing or why socket message fails + +--- + +#### Test 0.2: Does Worker Receive FINALIZE and Generate Summary? + +**What to test:** +After Stop hook fires, does the worker: +1. Receive the FINALIZE message +2. Yield finalize prompt to SDK agent +3. Get back a summary from SDK +4. Parse the XML +5. Store it in `session_summaries` table + +**How to verify:** +1. Add console.error logging in `src/sdk/worker.ts:239` in the message handler +2. Log when FINALIZE is received +3. Log the SDK agent response +4. Log when summary is parsed +5. Query DB after session ends: + ```bash + sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM session_summaries ORDER BY created_at DESC LIMIT 1" + ``` + +**If it doesn't work:** +- Check if worker is even running (ps aux | grep worker) +- Check if socket message arrived +- Check if SDK agent returned valid XML +- Check if parser worked +- Check if DB insert succeeded + +--- + +#### Test 0.3: Does Context Hook Load Summaries? + +**What to test:** +When starting a new session, does context hook: +1. Query recent summaries from DB +2. Format them as markdown +3. Output to stdout (becomes context) + +**How to verify:** +1. Add logging to `src/hooks/context.ts:24` +2. Log the summaries retrieved from DB +3. Log the markdown output +4. Start new session and check: + - Console output (should see markdown) + - Claude's context (ask "what did we do last session?") + +**If it doesn't work:** +- Check if SessionStart hook is firing +- Check if DB query returns results +- Check if markdown is being formatted correctly +- Check if output is going to stdout properly + +--- + +#### Test 0.4: End-to-End Integration Test + +**What to test:** +Full cycle from start to finish: + +```bash +# Session 1 +claude +# Do some work +echo "test file" > test.txt +cat test.txt +exit + +# Verify summary was stored +sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT summary_text FROM session_summaries ORDER BY created_at DESC LIMIT 1" + +# Session 2 +claude +# Ask Claude: "What did we do last session?" +# Expected: Claude should know we created and read test.txt +``` + +**Success criteria:** +- ✅ Summary appears in DB after session 1 +- ✅ Session 2 context includes summary from session 1 +- ✅ Claude can answer questions about previous session + +**If it doesn't work:** +- Review logs from Tests 0.1-0.3 +- Add more granular logging +- Check each step of the pipeline + +--- + +#### Common Failure Points & Debugging + +**If summaries aren't showing up in new sessions:** + +1. **Stop hook not configured/firing:** + ```bash + # Check hooks config + cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.Stop' + + # Should see summary-hook configured + # If not, hooks.json is wrong or plugin not installed + ``` + +2. **Worker not running:** + ```bash + ps aux | grep claude-mem-worker + + # If no worker, UserPromptSubmit hook failed to spawn it + # Check new-hook logs + ``` + +3. **Socket communication failing:** + ```bash + # Check socket exists + ls /tmp/claude-mem-worker-*.sock + + # Try to connect manually + echo '{"type":"finalize"}' | nc -U /tmp/claude-mem-worker-*.sock + ``` + +4. **SDK agent not returning summary:** + - Check API key is set + - Check SDK agent prompt is valid + - Check XML parser is working + - Add logging to see SDK response + +5. **DB write failing:** + ```bash + # Check DB exists and is writable + sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions WHERE status='active'" + + # If no active session, new-hook didn't create it + ``` + +6. **Context hook not loading:** + ```bash + # Check SessionStart hook configured + cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.SessionStart' + + # Start session and check for context output + # Should see markdown in initial context + ``` + +**Debugging Checklist:** +- [ ] Verify all hooks are configured in hooks.json +- [ ] Verify plugin is installed correctly +- [ ] Add console.error logging to all hooks (goes to stderr, visible in terminal) +- [ ] Check each step of the pipeline systematically +- [ ] Don't assume anything works - verify each piece + +--- + +### Phase 1: Critical Resilience Issues (Fix After Happy Path Works) + +#### 1. Zombie Worker Processes + +**Severity:** High +**Impact:** Memory/CPU waste, orphaned processes accumulate + +**Problem:** +If Stop hook never fires (user Ctrl-C, Claude Code crash), worker runs forever waiting for FINALIZE message. + +**Location:** `src/sdk/worker.ts:239` +```typescript +// Current code - infinite loop with no timeout +while (!this.isFinalized) { + if (this.pendingMessages.length === 0) { + await this.sleep(100); + continue; + } + // ... process messages +} +``` + +**Fix Required:** +```typescript +// Add watchdog timer +class SDKWorker { + private maxIdleTime = 2 * 60 * 60 * 1000; // 2 hours + private lastActivityTime = Date.now(); + + private updateActivity(): void { + this.lastActivityTime = Date.now(); + } + + private async* createMessageGenerator(): AsyncIterable<...> { + // Yield initial prompt + const initPrompt = buildInitPrompt(...); + yield { type: 'user', message: { role: 'user', content: initPrompt } }; + this.updateActivity(); + + while (!this.isFinalized) { + // Check for timeout + const idleTime = Date.now() - this.lastActivityTime; + if (idleTime > this.maxIdleTime) { + console.error(`[SDK Worker] Timeout - no activity for ${this.maxIdleTime / 1000}s`); + this.isFinalized = true; + break; + } + + if (this.pendingMessages.length === 0) { + await this.sleep(100); + continue; + } + + // Process messages and update activity + this.updateActivity(); + // ... existing message processing + } + } +} +``` + +**Testing:** +1. Start claude-mem session +2. Kill Claude Code process (kill -9) +3. Verify worker exits after 2 hours +4. Check no orphaned processes remain + +--- + +#### 2. SessionEnd Hook Not Configured + +**Severity:** High +**Impact:** No cleanup on abrupt exit, sessions stuck in "active" status + +**Problem:** +SessionEnd hooks are a built-in Claude Code feature that "run when a session ends" and "cannot block session termination but can perform cleanup tasks" ([docs](https://docs.claude.com/en/docs/claude-code/hooks#hook-events)). However, claude-mem's `hooks/hooks.json` does NOT configure this hook. Worker doesn't get cleaned up when Claude Code exits abruptly. + +**Note:** This is NOT a missing feature in Claude Code - SessionEnd hooks already exist. We just need to configure them. + +**Current Configuration:** `hooks/hooks.json:1-51` +```json +{ + "hooks": { + "SessionStart": [...], + "UserPromptSubmit": [...], + "PostToolUse": [...], + "Stop": [...] + // SessionEnd is MISSING + } +} +``` + +**Fix Required:** + +SessionEnd hooks receive structured input including: +```json +{ + "session_id": "abc123", + "transcript_path": "~/.claude/projects/.../transcript.jsonl", + "cwd": "/Users/...", + "hook_event_name": "SessionEnd", + "reason": "exit" // or "clear", "logout", "prompt_input_exit", etc. +} +``` + +**Implementation Steps:** + +1. **Add SessionEnd configuration to hooks/hooks.json:** + +For events like SessionEnd that don't use matchers, we can omit the matcher field: + +```json +{ + "hooks": { + "SessionEnd": [ + { + "hooks": [ + { + "type": "command", + "command": "bun ${CLAUDE_PLUGIN_ROOT}/scripts/hooks/cleanup-hook.js", + "timeout": 60000 + } + ] + } + ] + } +} +``` + +2. **Create src/hooks/cleanup.ts:** +```typescript +import { HooksDatabase } from '../services/sqlite/HooksDatabase.js'; +import { getWorkerSocketPath } from '../shared/paths.js'; +import { existsSync, unlinkSync } from 'fs'; +import { execSync } from 'child_process'; + +export interface SessionEndInput { + session_id: string; + cwd: string; + reason: 'clear' | 'logout' | 'prompt_input_exit' | 'other'; + [key: string]: any; +} + +/** + * Cleanup Hook - SessionEnd + * Cleans up worker process and marks session as terminated + */ +export function cleanupHook(input?: SessionEndInput): void { + try { + if (!input) { + console.log('No input provided - this script is designed to run as a Claude Code SessionEnd hook'); + process.exit(0); + } + + const { session_id, reason } = input; + + // Find active SDK session + const db = new HooksDatabase(); + const session = db.findActiveSDKSession(session_id); + + if (!session) { + db.close(); + console.log('{"suppressOutput": true}'); + process.exit(0); + } + + // Get socket path and clean up socket file + const socketPath = getWorkerSocketPath(session.id); + if (existsSync(socketPath)) { + try { + unlinkSync(socketPath); + } catch (err) { + console.error(`[claude-mem cleanup] Failed to remove socket: ${err.message}`); + } + } + + // Mark session as failed (not completed since it was terminated) + db.markSessionFailed(session.id); + db.close(); + + // Try to kill worker process if still running + // Worker socket path includes session ID, so we can find it + try { + // Find worker process by socket file in lsof output + const lsofOutput = execSync(`lsof ${socketPath} 2>/dev/null || true`, { encoding: 'utf8' }); + const pidMatch = lsofOutput.match(/\s+(\d+)\s+/); + if (pidMatch) { + const pid = pidMatch[1]; + console.error(`[claude-mem cleanup] Killing worker process ${pid}`); + process.kill(parseInt(pid, 10), 'SIGTERM'); + } + } catch (err) { + // Worker already dead or couldn't find it - that's fine + } + + console.log('{"suppressOutput": true}'); + process.exit(0); + + } catch (error: any) { + console.error(`[claude-mem cleanup error: ${error.message}]`); + console.log('{"suppressOutput": true}'); + process.exit(0); + } +} +``` + +3. **Create src/bin/hooks/cleanup-hook.ts:** +```typescript +#!/usr/bin/env bun + +/** + * Cleanup Hook Entry Point - SessionEnd + * Standalone executable for plugin hooks + */ + +import { cleanupHook } from '../../hooks/cleanup.js'; + +// Read input from stdin +const input = await Bun.stdin.text(); + +try { + const parsed = input.trim() ? JSON.parse(input) : undefined; + cleanupHook(parsed); +} catch (error: any) { + console.error(`[claude-mem cleanup-hook error: ${error.message}]`); + console.log('{"suppressOutput": true}'); + process.exit(0); +} +``` + +4. **Update build process to compile cleanup-hook.ts to scripts/hooks/cleanup-hook.js** + +**Testing:** +1. Start claude-mem session +2. Exit Claude Code with Ctrl-C +3. Verify worker process is killed +4. Verify socket file is removed +5. Verify session marked as "failed" in DB + +--- + +#### 3. Stale Socket Files Block New Sessions + +**Severity:** Medium +**Impact:** Worker fails to start if previous worker crashed + +**Problem:** +If worker crashes, socket file persists at `/tmp/claude-mem-worker-{id}.sock`. Next worker with same session ID fails with EADDRINUSE. + +**Location:** `src/sdk/worker.ts:111-163` +```typescript +private async startSocketServer(): Promise { + // Current code only removes if exists + if (existsSync(this.socketPath)) { + unlinkSync(this.socketPath); + } + + return new Promise((resolve, reject) => { + this.server = net.createServer((socket) => { ... }); + this.server.listen(this.socketPath, () => { resolve(); }); + }); +} +``` + +**Fix Required:** +```typescript +private async startSocketServer(): Promise { + // Clean up stale socket if it exists + if (existsSync(this.socketPath)) { + // Test if socket is responsive + const isStale = await this.testSocketStale(this.socketPath); + if (isStale) { + console.error(`[SDK Worker] Removing stale socket: ${this.socketPath}`); + unlinkSync(this.socketPath); + } else { + // Socket is active - another worker is using this session ID + throw new Error(`Socket already in use: ${this.socketPath}`); + } + } + + return new Promise((resolve, reject) => { + this.server = net.createServer((socket) => { + let buffer = ''; + socket.on('data', (chunk) => { + // ... existing code + }); + }); + + this.server.on('error', (err: any) => { + if (err.code === 'EADDRINUSE') { + console.error(`[SDK Worker] Socket already in use: ${this.socketPath}`); + } + reject(err); + }); + + this.server.listen(this.socketPath, () => { + resolve(); + }); + }); +} + +/** + * Test if socket file is stale (no process listening) + */ +private async testSocketStale(socketPath: string): Promise { + return new Promise((resolve) => { + const testClient = net.connect(socketPath); + + testClient.on('connect', () => { + // Socket is responsive - not stale + testClient.end(); + resolve(false); + }); + + testClient.on('error', () => { + // Socket exists but not responsive - stale + resolve(true); + }); + + // Timeout after 100ms + setTimeout(() => { + testClient.destroy(); + resolve(true); + }, 100); + }); +} +``` + +**Testing:** +1. Start worker, kill it with kill -9 +2. Verify socket file persists +3. Start new worker with same session ID +4. Verify old socket is detected as stale and removed +5. Verify new worker starts successfully + +--- + +#### 4. Race Condition on First Observation + +**Severity:** Medium +**Impact:** First observation might be lost if socket not ready + +**Problem:** +Worker startup is async (socket creation, SDK initialization). PostToolUse can fire immediately after UserPromptSubmit returns, before socket is ready. + +**Current Flow:** +1. UserPromptSubmit → creates session → spawns worker → returns immediately +2. PostToolUse fires (Claude reads a file) +3. save-hook tries to connect → ENOENT (socket not ready yet) +4. Connection fails → logs error, continues +5. First observation lost + +**Location:** `src/hooks/save.ts:71` +```typescript +const client = net.connect(socketPath, () => { + client.write(JSON.stringify(message) + '\n'); + client.end(); +}); + +client.on('error', (err) => { + // Currently just logs and continues - observation lost + console.error(`[claude-mem save] Socket error: ${err.message}`); +}); +``` + +**Fix Required:** +```typescript +/** + * Save Hook - PostToolUse + * Sends tool observations to worker via Unix socket with retry logic + */ +export function saveHook(input?: PostToolUseInput): void { + try { + if (!input) { + console.log('No input provided - this script is designed to run as a Claude Code PostToolUse hook'); + process.exit(0); + } + + const { session_id, tool_name, tool_input, tool_output } = input; + + if (SKIP_TOOLS.has(tool_name)) { + console.log('{"continue": true, "suppressOutput": true}'); + process.exit(0); + } + + const db = new HooksDatabase(); + const session = db.findActiveSDKSession(session_id); + db.close(); + + if (!session) { + console.log('{"continue": true, "suppressOutput": true}'); + process.exit(0); + } + + const socketPath = getWorkerSocketPath(session.id); + const message = { + type: 'observation', + tool_name, + tool_input: JSON.stringify(tool_input), + tool_output: JSON.stringify(tool_output) + }; + + // Try to send with retries + sendWithRetry(socketPath, message, 5).then(() => { + console.log('{"continue": true, "suppressOutput": true}'); + process.exit(0); + }).catch((err) => { + console.error(`[claude-mem save] Failed after retries: ${err.message}`); + console.log('{"continue": true, "suppressOutput": true}'); + process.exit(0); + }); + + } catch (error: any) { + console.error(`[claude-mem save error: ${error.message}]`); + console.log('{"continue": true, "suppressOutput": true}'); + process.exit(0); + } +} + +/** + * Send message to socket with exponential backoff retry + */ +async function sendWithRetry( + socketPath: string, + message: any, + maxRetries: number +): Promise { + let retries = maxRetries; + let delay = 100; // Start with 100ms + + while (retries > 0) { + try { + await sendMessage(socketPath, message); + return; // Success + } catch (err: any) { + retries--; + if (retries === 0) { + throw err; // Out of retries + } + + // Exponential backoff + await sleep(delay); + delay = Math.min(delay * 2, 2000); // Cap at 2s + } + } +} + +/** + * Send single message to socket + */ +function sendMessage(socketPath: string, message: any): Promise { + return new Promise((resolve, reject) => { + const client = net.connect(socketPath, () => { + client.write(JSON.stringify(message) + '\n'); + client.end(); + resolve(); + }); + + client.on('error', (err) => { + reject(err); + }); + }); +} + +function sleep(ms: number): Promise { + return new Promise(resolve => setTimeout(resolve, ms)); +} +``` + +**Testing:** +1. Add artificial delay in worker startup +2. Fire PostToolUse immediately after UserPromptSubmit +3. Verify save-hook retries and succeeds +4. Verify observation is captured + +--- + +### Medium Priority (Should Fix) + +#### 5. Orphaned Active Sessions in Database + +**Severity:** Low +**Impact:** DB bloat, confusion about session status + +**Problem:** +Sessions marked "active" never transition to "completed" or "failed" if worker crashes or is killed. + +**Fix Required:** + +Create cleanup script: `src/commands/cleanup-sessions.ts` +```typescript +import { HooksDatabase } from '../services/sqlite/HooksDatabase.js'; + +/** + * Mark old active sessions as failed + */ +export function cleanupSessions(maxAgeHours: number = 24): void { + const db = new HooksDatabase(); + const maxAgeMs = maxAgeHours * 60 * 60 * 1000; + const cutoffEpoch = Date.now() - maxAgeMs; + + const query = (db as any).db.query(` + UPDATE sdk_sessions + SET status = 'failed', completed_at = datetime('now'), completed_at_epoch = ? + WHERE status = 'active' AND started_at_epoch < ? + `); + + const result = query.run(Date.now(), cutoffEpoch); + console.log(`Marked ${result.changes} old active sessions as failed`); + + db.close(); +} +``` + +Add to CLI: `src/bin/cli.ts` +```typescript +.command('cleanup-sessions') +.description('Mark old active sessions as failed') +.option('--max-age ', 'Maximum age in hours', '24') +.action((options) => { + cleanupSessions(parseInt(options.maxAge, 10)); +}) +``` + +**Alternative:** Add auto-expiry check in `context-hook`: +```typescript +// Before loading summaries, clean up stale sessions +const maxAgeMs = 24 * 60 * 60 * 1000; +const cutoffEpoch = Date.now() - maxAgeMs; +db.db.query(` + UPDATE sdk_sessions + SET status = 'failed' + WHERE status = 'active' AND started_at_epoch < ? +`).run(cutoffEpoch); +``` + +--- + +#### 6. SessionStart Only Runs on "startup" + +**Severity:** Low +**Impact:** No context loaded on /resume + +**Problem:** +`context-hook` only loads context on "startup" source, skips "resume", "clear", and "compact". + +**Location:** `src/hooks/context.ts:24` +```typescript +// Only run on startup (not on resume) +if (input.source && input.source !== 'startup') { + console.log(''); + process.exit(0); +} +``` + +**Fix Required:** +```typescript +// Load context on startup and resume +if (input.source && input.source !== 'startup' && input.source !== 'resume') { + console.log(''); // Skip for clear/compact + process.exit(0); +} +``` + +**Rationale:** +- **startup:** Load context (project overview) +- **resume:** Load context (user continuing work) +- **clear:** Skip (user wants fresh start) +- **compact:** Skip (just memory optimization, context preserved) + +--- + +### Low Priority (Nice to Have) + +#### 7. No Cost Control or Observation Limits + +**Severity:** Low +**Impact:** Long sessions can be expensive + +**Problem:** +No limits on SDK agent API calls. A session with thousands of tools could rack up significant costs. + +**Fix Ideas:** +1. Add observation counter, warn after N observations +2. Add cost estimation based on token usage +3. Add budget limit in config +4. Batch observations (send N at once instead of one-by-one) + +**Example:** +```typescript +class SDKWorker { + private observationCount = 0; + private maxObservations = 1000; + + private handleMessage(message: WorkerMessage): void { + if (message.type === 'observation') { + this.observationCount++; + if (this.observationCount > this.maxObservations) { + console.error(`[SDK Worker] Exceeded max observations: ${this.maxObservations}`); + this.isFinalized = true; + return; + } + } + this.pendingMessages.push(message); + } +} +``` + +--- + +#### 8. No Health Check Mechanism + +**Severity:** Low +**Impact:** Can't tell if worker is alive/healthy + +**Fix Ideas:** +1. Add `/status` command that checks for active workers +2. Add health check endpoint on socket (ping/pong) +3. Add metrics to DB (last_activity_at) + +--- + +#### 9. No Observation Deduplication + +**Severity:** Low +**Impact:** Duplicate observations if same tool executed multiple times + +**Fix Ideas:** +1. Hash tool_name + tool_input + tool_output +2. Check for duplicate hash before storing +3. Or let SDK agent handle deduplication naturally + +--- + +## Implementation Checklist + +### Phase 0: Verify Happy Path (DO THIS FIRST - HIGHEST PRIORITY) + +**Goal:** Prove the basic cycle works end-to-end before fixing edge cases. + +- [ ] **Test 0.1: Verify Stop Hook Fires** + - [ ] Add logging to `src/hooks/summary.ts` + - [ ] Exit session normally and verify hook runs + - [ ] Verify FINALIZE message is sent to socket + +- [ ] **Test 0.2: Verify Worker Generates Summary** + - [ ] Add logging to worker message handler + - [ ] Verify FINALIZE message received + - [ ] Verify SDK agent response + - [ ] Verify summary parsed and stored in DB + - [ ] Query DB to confirm summary exists + +- [ ] **Test 0.3: Verify Context Hook Loads Summaries** + - [ ] Add logging to `src/hooks/context.ts` + - [ ] Start new session, verify summaries loaded + - [ ] Verify markdown output to stdout + - [ ] Verify Claude has context from previous session + +- [ ] **Test 0.4: End-to-End Integration Test** + - [ ] Run session 1 with test work + - [ ] Verify summary in DB + - [ ] Run session 2 + - [ ] Ask Claude about previous session + - [ ] Confirm Claude has correct context + +**STOP HERE:** Only proceed to Phase 1 after confirming all Phase 0 tests pass. + +--- + +### Phase 1: Critical Resilience Fixes (Do After Phase 0) + +- [ ] Add watchdog timer to worker (Issue #1) + - [ ] Add lastActivityTime tracking + - [ ] Add timeout check in message generator loop + - [ ] Test with zombie worker scenario + +- [ ] Configure existing SessionEnd hook (Issue #2) + - [ ] Add SessionEnd configuration to hooks/hooks.json + - [ ] Create src/hooks/cleanup.ts (implements cleanup logic) + - [ ] Create src/bin/hooks/cleanup-hook.ts (entry point) + - [ ] Update build process to compile cleanup-hook + - [ ] Test with Ctrl-C exit and verify worker cleanup + +- [ ] Fix stale socket detection (Issue #3) + - [ ] Add testSocketStale method + - [ ] Update startSocketServer to check for stale sockets + - [ ] Test with crashed worker scenario + +- [ ] Fix save-hook race condition (Issue #4) + - [ ] Add sendWithRetry function + - [ ] Add exponential backoff logic + - [ ] Update save-hook to use retry logic + - [ ] Test with immediate PostToolUse + +### Phase 2: Medium Priority + +- [ ] Add session cleanup script (Issue #5) + - [ ] Create cleanup-sessions command + - [ ] Add to CLI + - [ ] Optional: Add auto-cleanup to context-hook + +- [ ] Fix SessionStart source handling (Issue #6) + - [ ] Update context-hook to load on "resume" + - [ ] Test with /resume command + +### Phase 3: Low Priority (Optional) + +- [ ] Add cost control (Issue #7) +- [ ] Add health checks (Issue #8) +- [ ] Add observation deduplication (Issue #9) + +## Testing Strategy + +### Unit Tests + +Create tests for each fix: +- `test/hooks/cleanup.test.ts` - SessionEnd hook +- `test/sdk/worker-timeout.test.ts` - Watchdog timer +- `test/hooks/save-retry.test.ts` - Retry logic + +### Integration Tests + +Test complete flows: +1. **Normal flow:** SessionStart → UserPromptSubmit → PostToolUse → Stop +2. **Crash recovery:** Worker crash → SessionEnd cleanup +3. **Zombie worker:** No Stop hook → Worker timeout +4. **Socket race:** Immediate PostToolUse → Retry success + +### Manual Testing Scenarios + +1. **Zombie Worker Test:** + ```bash + # Start session + claude + # Kill Claude with Ctrl-C + # Check for worker process + ps aux | grep claude-mem-worker + # Wait 2 hours, verify worker exits + ``` + +2. **SessionEnd Test:** + ```bash + # Start session + claude + # Exit normally or Ctrl-C + # Verify worker killed + # Verify socket removed + # Check DB for session status + sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions" + ``` + +3. **Stale Socket Test:** + ```bash + # Start session + claude + # Kill worker with kill -9 + # Verify socket exists + ls /tmp/claude-mem-worker-*.sock + # Start new session + # Verify old socket removed, new session starts + ``` + +4. **Race Condition Test:** + ```bash + # Add delay to worker startup (for testing) + # Start session, immediately run command + claude "list all files" + # Verify first observation captured + ``` + +## File Modifications Required + +### New Files +- `src/hooks/cleanup.ts` - SessionEnd hook logic +- `src/bin/hooks/cleanup-hook.ts` - SessionEnd entry point +- `src/commands/cleanup-sessions.ts` - Session cleanup script +- `test/hooks/cleanup.test.ts` - Tests for SessionEnd hook +- `test/sdk/worker-timeout.test.ts` - Tests for watchdog timer +- `test/hooks/save-retry.test.ts` - Tests for retry logic + +### Modified Files +- `hooks/hooks.json` - Add SessionEnd configuration +- `src/sdk/worker.ts` - Add watchdog timer, stale socket detection +- `src/hooks/save.ts` - Add retry logic +- `src/hooks/context.ts` - Load context on resume +- `src/bin/cli.ts` - Add cleanup-sessions command + +## Dependencies + +No new dependencies required. All fixes use existing: +- `net` (Unix sockets) +- `fs` (file operations) +- `child_process` (process management) +- `bun:sqlite` (database) + +## Success Criteria + +### Phase 0 (Must Pass First) +1. ✅ Stop hook fires on normal exit +2. ✅ Worker receives FINALIZE and generates summary +3. ✅ Summary is stored in DB correctly +4. ✅ Context hook loads summaries on next session +5. ✅ New session immediately sees previous session's summary in context +6. ✅ End-to-end integration test passes + +### Phase 1 (After Phase 0 Passes) +1. ✅ Worker processes never become zombies (exit after 2h max) +2. ✅ SessionEnd hook cleans up worker and socket on exit +3. ✅ Stale sockets don't block new sessions +4. ✅ First observation always captured (no race condition) +5. ✅ No orphaned "active" sessions in DB after 24h +6. ✅ Context loads on /resume +7. ✅ All tests pass + +## References + +- Claude Code Hooks Documentation: https://docs.claude.com/en/docs/claude-code/hooks +- Claude Agent SDK Streaming: https://docs.claude.com/en/api/agent-sdk/streaming-vs-single-mode +- Unix Domain Sockets: Node.js `net` module +- SQLite Best Practices: Bun SQLite documentation + +## Notes + +- All hooks must return `{"continue": true, "suppressOutput": true}` on error +- Hooks have 60s default timeout (configurable) +- Worker is detached process, doesn't block Claude Code +- SessionEnd hooks "cannot block session termination" per Claude Code docs +- Streaming input mode is the recommended SDK approach for this architecture