# Session Logic Fixes - Claude-Mem **Status:** Planning **Created:** 2025-10-16 **Priority:** High **Estimated Effort:** 2-3 days ## Executive Summary The claude-mem session logic architecture is fundamentally sound, using Claude Agent SDK in streaming input mode with Unix socket IPC for real-time observation processing. However, **we need to verify the basic happy path works end-to-end before addressing edge cases**. **Critical Goal:** Session ends → summary generated → next session immediately sees summary in context **Overall Assessment:** Architecture is correct, but needs systematic verification that the happy path works, then resilience improvements **Current Status:** Unknown if basic cycle works - need to test and debug the core flow first ## Feedback Applied (2025-10-16) ### Round 1: Technical Corrections - ✅ Confirmed architectural approach is sound - ❌ **Corrected:** SessionEnd hooks already exist in Claude Code - we're configuring, not implementing - ✅ Technical fixes for resilience issues are sound ### Round 2: Priority Reordering (MAJOR CHANGE) **Critical realization:** The document focused on edge cases (zombies, crashes) when the basic happy path might not even work yet. **Complete restructure:** 1. **Phase 0 (NEW - TOP PRIORITY):** Verify the basic cycle works - Does Stop hook fire on normal exit? - Does worker generate and store summary? - Does context hook load summaries on next session? - End-to-end integration test 2. **Phase 1 (SECOND PRIORITY):** Fix resilience issues - Zombie workers, race conditions, stale sockets - All the original issues moved here **Key principle:** Everything else is irrelevant if "session ends → next session sees summary" doesn't work. **Revised Focus:** Get the fucking happy path working first, then worry about edge cases. ## Architecture Overview ### Current Flow ``` SessionStart (startup) → context-hook.ts:15 → Loads recent summaries from DB → Outputs markdown to stdout (becomes context) UserPromptSubmit → new-hook.ts:16 → Creates SDK session in DB (status='active') → Spawns detached worker process → Worker starts immediately, hooks return Worker Process (worker.ts:75) → Starts Unix socket server at /tmp/claude-mem-worker-{id}.sock → Runs SDK agent with streaming input (async generator) → Yields init prompt to SDK agent → Waits for messages from hooks PostToolUse (fired for each tool) → save-hook.ts:24 → Sends observation to worker via Unix socket → Worker receives → yields to SDK agent → SDK agent analyzes → returns XML → Worker parses XML → stores in observations table Stop (session ends) → summary-hook.ts:15 → Sends FINALIZE message to worker via socket → Worker yields finalize prompt to SDK agent → SDK agent generates XML → Worker parses → stores in session_summaries table → Worker marks session completed, closes socket, exits ``` ### Key Components **Hook Files:** - `src/hooks/context.ts` - SessionStart hook logic - `src/hooks/new.ts` - UserPromptSubmit hook logic - `src/hooks/save.ts` - PostToolUse hook logic - `src/hooks/summary.ts` - Stop hook logic - `src/bin/hooks/*.ts` - Entry point wrappers for each hook **Worker:** - `src/sdk/worker.ts` - Main worker process with SDK integration - `src/sdk/prompts.ts` - Prompt generation for SDK agent - `src/sdk/parser.ts` - XML parser for SDK responses **Database:** - `src/services/sqlite/HooksDatabase.ts` - Lightweight DB interface for hooks - `src/services/sqlite/migrations.ts` - Schema definitions **Configuration:** - `hooks/hooks.json` - Hook configuration for Claude Code plugin ### Technologies - **IPC:** Unix domain sockets (`/tmp/claude-mem-worker-{id}.sock`) - **SDK Mode:** Streaming input (async generator pattern) - **Output Format:** XML blocks (`` and ``) - **Process Model:** Detached worker (spawn with detached: true, stdio: 'ignore') - **Database:** SQLite with Bun ## Identified Issues ### Phase 0: Verify Happy Path Works (DO THIS FIRST) **Priority:** CRITICAL - Everything else is irrelevant if the basic cycle doesn't work **Goal:** Prove that when a session ends normally, the next session immediately sees the summary in its context. #### Test 0.1: Does Stop Hook Fire on Normal Exit? **What to test:** ```bash # Start Claude Code session claude # Do some work (read files, etc) # Exit normally exit # Check logs - did Stop hook run? ``` **Expected behavior:** - Stop hook (`summary-hook`) should fire - Should send FINALIZE message to worker socket - Worker should receive it and generate summary **How to verify:** 1. Add logging to `src/hooks/summary.ts` at the top of `summaryHook()` 2. Add logging when sending socket message 3. Exit session normally and check logs **If it doesn't work:** Debug why Stop hook isn't firing or why socket message fails --- #### Test 0.2: Does Worker Receive FINALIZE and Generate Summary? **What to test:** After Stop hook fires, does the worker: 1. Receive the FINALIZE message 2. Yield finalize prompt to SDK agent 3. Get back a summary from SDK 4. Parse the XML 5. Store it in `session_summaries` table **How to verify:** 1. Add console.error logging in `src/sdk/worker.ts:239` in the message handler 2. Log when FINALIZE is received 3. Log the SDK agent response 4. Log when summary is parsed 5. Query DB after session ends: ```bash sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM session_summaries ORDER BY created_at DESC LIMIT 1" ``` **If it doesn't work:** - Check if worker is even running (ps aux | grep worker) - Check if socket message arrived - Check if SDK agent returned valid XML - Check if parser worked - Check if DB insert succeeded --- #### Test 0.3: Does Context Hook Load Summaries? **What to test:** When starting a new session, does context hook: 1. Query recent summaries from DB 2. Format them as markdown 3. Output to stdout (becomes context) **How to verify:** 1. Add logging to `src/hooks/context.ts:24` 2. Log the summaries retrieved from DB 3. Log the markdown output 4. Start new session and check: - Console output (should see markdown) - Claude's context (ask "what did we do last session?") **If it doesn't work:** - Check if SessionStart hook is firing - Check if DB query returns results - Check if markdown is being formatted correctly - Check if output is going to stdout properly --- #### Test 0.4: End-to-End Integration Test **What to test:** Full cycle from start to finish: ```bash # Session 1 claude # Do some work echo "test file" > test.txt cat test.txt exit # Verify summary was stored sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT summary_text FROM session_summaries ORDER BY created_at DESC LIMIT 1" # Session 2 claude # Ask Claude: "What did we do last session?" # Expected: Claude should know we created and read test.txt ``` **Success criteria:** - ✅ Summary appears in DB after session 1 - ✅ Session 2 context includes summary from session 1 - ✅ Claude can answer questions about previous session **If it doesn't work:** - Review logs from Tests 0.1-0.3 - Add more granular logging - Check each step of the pipeline --- #### Common Failure Points & Debugging **If summaries aren't showing up in new sessions:** 1. **Stop hook not configured/firing:** ```bash # Check hooks config cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.Stop' # Should see summary-hook configured # If not, hooks.json is wrong or plugin not installed ``` 2. **Worker not running:** ```bash ps aux | grep claude-mem-worker # If no worker, UserPromptSubmit hook failed to spawn it # Check new-hook logs ``` 3. **Socket communication failing:** ```bash # Check socket exists ls /tmp/claude-mem-worker-*.sock # Try to connect manually echo '{"type":"finalize"}' | nc -U /tmp/claude-mem-worker-*.sock ``` 4. **SDK agent not returning summary:** - Check API key is set - Check SDK agent prompt is valid - Check XML parser is working - Add logging to see SDK response 5. **DB write failing:** ```bash # Check DB exists and is writable sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions WHERE status='active'" # If no active session, new-hook didn't create it ``` 6. **Context hook not loading:** ```bash # Check SessionStart hook configured cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.SessionStart' # Start session and check for context output # Should see markdown in initial context ``` **Debugging Checklist:** - [ ] Verify all hooks are configured in hooks.json - [ ] Verify plugin is installed correctly - [ ] Add console.error logging to all hooks (goes to stderr, visible in terminal) - [ ] Check each step of the pipeline systematically - [ ] Don't assume anything works - verify each piece --- ### Phase 1: Critical Resilience Issues (Fix After Happy Path Works) #### 1. Zombie Worker Processes **Severity:** High **Impact:** Memory/CPU waste, orphaned processes accumulate **Problem:** If Stop hook never fires (user Ctrl-C, Claude Code crash), worker runs forever waiting for FINALIZE message. **Location:** `src/sdk/worker.ts:239` ```typescript // Current code - infinite loop with no timeout while (!this.isFinalized) { if (this.pendingMessages.length === 0) { await this.sleep(100); continue; } // ... process messages } ``` **Fix Required:** ```typescript // Add watchdog timer class SDKWorker { private maxIdleTime = 2 * 60 * 60 * 1000; // 2 hours private lastActivityTime = Date.now(); private updateActivity(): void { this.lastActivityTime = Date.now(); } private async* createMessageGenerator(): AsyncIterable<...> { // Yield initial prompt const initPrompt = buildInitPrompt(...); yield { type: 'user', message: { role: 'user', content: initPrompt } }; this.updateActivity(); while (!this.isFinalized) { // Check for timeout const idleTime = Date.now() - this.lastActivityTime; if (idleTime > this.maxIdleTime) { console.error(`[SDK Worker] Timeout - no activity for ${this.maxIdleTime / 1000}s`); this.isFinalized = true; break; } if (this.pendingMessages.length === 0) { await this.sleep(100); continue; } // Process messages and update activity this.updateActivity(); // ... existing message processing } } } ``` **Testing:** 1. Start claude-mem session 2. Kill Claude Code process (kill -9) 3. Verify worker exits after 2 hours 4. Check no orphaned processes remain --- #### 2. SessionEnd Hook Not Configured **Severity:** High **Impact:** No cleanup on abrupt exit, sessions stuck in "active" status **Problem:** SessionEnd hooks are a built-in Claude Code feature that "run when a session ends" and "cannot block session termination but can perform cleanup tasks" ([docs](https://docs.claude.com/en/docs/claude-code/hooks#hook-events)). However, claude-mem's `hooks/hooks.json` does NOT configure this hook. Worker doesn't get cleaned up when Claude Code exits abruptly. **Note:** This is NOT a missing feature in Claude Code - SessionEnd hooks already exist. We just need to configure them. **Current Configuration:** `hooks/hooks.json:1-51` ```json { "hooks": { "SessionStart": [...], "UserPromptSubmit": [...], "PostToolUse": [...], "Stop": [...] // SessionEnd is MISSING } } ``` **Fix Required:** SessionEnd hooks receive structured input including: ```json { "session_id": "abc123", "transcript_path": "~/.claude/projects/.../transcript.jsonl", "cwd": "/Users/...", "hook_event_name": "SessionEnd", "reason": "exit" // or "clear", "logout", "prompt_input_exit", etc. } ``` **Implementation Steps:** 1. **Add SessionEnd configuration to hooks/hooks.json:** For events like SessionEnd that don't use matchers, we can omit the matcher field: ```json { "hooks": { "SessionEnd": [ { "hooks": [ { "type": "command", "command": "bun ${CLAUDE_PLUGIN_ROOT}/scripts/hooks/cleanup-hook.js", "timeout": 60000 } ] } ] } } ``` 2. **Create src/hooks/cleanup.ts:** ```typescript import { HooksDatabase } from '../services/sqlite/HooksDatabase.js'; import { getWorkerSocketPath } from '../shared/paths.js'; import { existsSync, unlinkSync } from 'fs'; import { execSync } from 'child_process'; export interface SessionEndInput { session_id: string; cwd: string; reason: 'clear' | 'logout' | 'prompt_input_exit' | 'other'; [key: string]: any; } /** * Cleanup Hook - SessionEnd * Cleans up worker process and marks session as terminated */ export function cleanupHook(input?: SessionEndInput): void { try { if (!input) { console.log('No input provided - this script is designed to run as a Claude Code SessionEnd hook'); process.exit(0); } const { session_id, reason } = input; // Find active SDK session const db = new HooksDatabase(); const session = db.findActiveSDKSession(session_id); if (!session) { db.close(); console.log('{"suppressOutput": true}'); process.exit(0); } // Get socket path and clean up socket file const socketPath = getWorkerSocketPath(session.id); if (existsSync(socketPath)) { try { unlinkSync(socketPath); } catch (err) { console.error(`[claude-mem cleanup] Failed to remove socket: ${err.message}`); } } // Mark session as failed (not completed since it was terminated) db.markSessionFailed(session.id); db.close(); // Try to kill worker process if still running // Worker socket path includes session ID, so we can find it try { // Find worker process by socket file in lsof output const lsofOutput = execSync(`lsof ${socketPath} 2>/dev/null || true`, { encoding: 'utf8' }); const pidMatch = lsofOutput.match(/\s+(\d+)\s+/); if (pidMatch) { const pid = pidMatch[1]; console.error(`[claude-mem cleanup] Killing worker process ${pid}`); process.kill(parseInt(pid, 10), 'SIGTERM'); } } catch (err) { // Worker already dead or couldn't find it - that's fine } console.log('{"suppressOutput": true}'); process.exit(0); } catch (error: any) { console.error(`[claude-mem cleanup error: ${error.message}]`); console.log('{"suppressOutput": true}'); process.exit(0); } } ``` 3. **Create src/bin/hooks/cleanup-hook.ts:** ```typescript #!/usr/bin/env bun /** * Cleanup Hook Entry Point - SessionEnd * Standalone executable for plugin hooks */ import { cleanupHook } from '../../hooks/cleanup.js'; // Read input from stdin const input = await Bun.stdin.text(); try { const parsed = input.trim() ? JSON.parse(input) : undefined; cleanupHook(parsed); } catch (error: any) { console.error(`[claude-mem cleanup-hook error: ${error.message}]`); console.log('{"suppressOutput": true}'); process.exit(0); } ``` 4. **Update build process to compile cleanup-hook.ts to scripts/hooks/cleanup-hook.js** **Testing:** 1. Start claude-mem session 2. Exit Claude Code with Ctrl-C 3. Verify worker process is killed 4. Verify socket file is removed 5. Verify session marked as "failed" in DB --- #### 3. Stale Socket Files Block New Sessions **Severity:** Medium **Impact:** Worker fails to start if previous worker crashed **Problem:** If worker crashes, socket file persists at `/tmp/claude-mem-worker-{id}.sock`. Next worker with same session ID fails with EADDRINUSE. **Location:** `src/sdk/worker.ts:111-163` ```typescript private async startSocketServer(): Promise { // Current code only removes if exists if (existsSync(this.socketPath)) { unlinkSync(this.socketPath); } return new Promise((resolve, reject) => { this.server = net.createServer((socket) => { ... }); this.server.listen(this.socketPath, () => { resolve(); }); }); } ``` **Fix Required:** ```typescript private async startSocketServer(): Promise { // Clean up stale socket if it exists if (existsSync(this.socketPath)) { // Test if socket is responsive const isStale = await this.testSocketStale(this.socketPath); if (isStale) { console.error(`[SDK Worker] Removing stale socket: ${this.socketPath}`); unlinkSync(this.socketPath); } else { // Socket is active - another worker is using this session ID throw new Error(`Socket already in use: ${this.socketPath}`); } } return new Promise((resolve, reject) => { this.server = net.createServer((socket) => { let buffer = ''; socket.on('data', (chunk) => { // ... existing code }); }); this.server.on('error', (err: any) => { if (err.code === 'EADDRINUSE') { console.error(`[SDK Worker] Socket already in use: ${this.socketPath}`); } reject(err); }); this.server.listen(this.socketPath, () => { resolve(); }); }); } /** * Test if socket file is stale (no process listening) */ private async testSocketStale(socketPath: string): Promise { return new Promise((resolve) => { const testClient = net.connect(socketPath); testClient.on('connect', () => { // Socket is responsive - not stale testClient.end(); resolve(false); }); testClient.on('error', () => { // Socket exists but not responsive - stale resolve(true); }); // Timeout after 100ms setTimeout(() => { testClient.destroy(); resolve(true); }, 100); }); } ``` **Testing:** 1. Start worker, kill it with kill -9 2. Verify socket file persists 3. Start new worker with same session ID 4. Verify old socket is detected as stale and removed 5. Verify new worker starts successfully --- #### 4. Race Condition on First Observation **Severity:** Medium **Impact:** First observation might be lost if socket not ready **Problem:** Worker startup is async (socket creation, SDK initialization). PostToolUse can fire immediately after UserPromptSubmit returns, before socket is ready. **Current Flow:** 1. UserPromptSubmit → creates session → spawns worker → returns immediately 2. PostToolUse fires (Claude reads a file) 3. save-hook tries to connect → ENOENT (socket not ready yet) 4. Connection fails → logs error, continues 5. First observation lost **Location:** `src/hooks/save.ts:71` ```typescript const client = net.connect(socketPath, () => { client.write(JSON.stringify(message) + '\n'); client.end(); }); client.on('error', (err) => { // Currently just logs and continues - observation lost console.error(`[claude-mem save] Socket error: ${err.message}`); }); ``` **Fix Required:** ```typescript /** * Save Hook - PostToolUse * Sends tool observations to worker via Unix socket with retry logic */ export function saveHook(input?: PostToolUseInput): void { try { if (!input) { console.log('No input provided - this script is designed to run as a Claude Code PostToolUse hook'); process.exit(0); } const { session_id, tool_name, tool_input, tool_output } = input; if (SKIP_TOOLS.has(tool_name)) { console.log('{"continue": true, "suppressOutput": true}'); process.exit(0); } const db = new HooksDatabase(); const session = db.findActiveSDKSession(session_id); db.close(); if (!session) { console.log('{"continue": true, "suppressOutput": true}'); process.exit(0); } const socketPath = getWorkerSocketPath(session.id); const message = { type: 'observation', tool_name, tool_input: JSON.stringify(tool_input), tool_output: JSON.stringify(tool_output) }; // Try to send with retries sendWithRetry(socketPath, message, 5).then(() => { console.log('{"continue": true, "suppressOutput": true}'); process.exit(0); }).catch((err) => { console.error(`[claude-mem save] Failed after retries: ${err.message}`); console.log('{"continue": true, "suppressOutput": true}'); process.exit(0); }); } catch (error: any) { console.error(`[claude-mem save error: ${error.message}]`); console.log('{"continue": true, "suppressOutput": true}'); process.exit(0); } } /** * Send message to socket with exponential backoff retry */ async function sendWithRetry( socketPath: string, message: any, maxRetries: number ): Promise { let retries = maxRetries; let delay = 100; // Start with 100ms while (retries > 0) { try { await sendMessage(socketPath, message); return; // Success } catch (err: any) { retries--; if (retries === 0) { throw err; // Out of retries } // Exponential backoff await sleep(delay); delay = Math.min(delay * 2, 2000); // Cap at 2s } } } /** * Send single message to socket */ function sendMessage(socketPath: string, message: any): Promise { return new Promise((resolve, reject) => { const client = net.connect(socketPath, () => { client.write(JSON.stringify(message) + '\n'); client.end(); resolve(); }); client.on('error', (err) => { reject(err); }); }); } function sleep(ms: number): Promise { return new Promise(resolve => setTimeout(resolve, ms)); } ``` **Testing:** 1. Add artificial delay in worker startup 2. Fire PostToolUse immediately after UserPromptSubmit 3. Verify save-hook retries and succeeds 4. Verify observation is captured --- ### Medium Priority (Should Fix) #### 5. Orphaned Active Sessions in Database **Severity:** Low **Impact:** DB bloat, confusion about session status **Problem:** Sessions marked "active" never transition to "completed" or "failed" if worker crashes or is killed. **Fix Required:** Create cleanup script: `src/commands/cleanup-sessions.ts` ```typescript import { HooksDatabase } from '../services/sqlite/HooksDatabase.js'; /** * Mark old active sessions as failed */ export function cleanupSessions(maxAgeHours: number = 24): void { const db = new HooksDatabase(); const maxAgeMs = maxAgeHours * 60 * 60 * 1000; const cutoffEpoch = Date.now() - maxAgeMs; const query = (db as any).db.query(` UPDATE sdk_sessions SET status = 'failed', completed_at = datetime('now'), completed_at_epoch = ? WHERE status = 'active' AND started_at_epoch < ? `); const result = query.run(Date.now(), cutoffEpoch); console.log(`Marked ${result.changes} old active sessions as failed`); db.close(); } ``` Add to CLI: `src/bin/cli.ts` ```typescript .command('cleanup-sessions') .description('Mark old active sessions as failed') .option('--max-age ', 'Maximum age in hours', '24') .action((options) => { cleanupSessions(parseInt(options.maxAge, 10)); }) ``` **Alternative:** Add auto-expiry check in `context-hook`: ```typescript // Before loading summaries, clean up stale sessions const maxAgeMs = 24 * 60 * 60 * 1000; const cutoffEpoch = Date.now() - maxAgeMs; db.db.query(` UPDATE sdk_sessions SET status = 'failed' WHERE status = 'active' AND started_at_epoch < ? `).run(cutoffEpoch); ``` --- #### 6. SessionStart Only Runs on "startup" **Severity:** Low **Impact:** No context loaded on /resume **Problem:** `context-hook` only loads context on "startup" source, skips "resume", "clear", and "compact". **Location:** `src/hooks/context.ts:24` ```typescript // Only run on startup (not on resume) if (input.source && input.source !== 'startup') { console.log(''); process.exit(0); } ``` **Fix Required:** ```typescript // Load context on startup and resume if (input.source && input.source !== 'startup' && input.source !== 'resume') { console.log(''); // Skip for clear/compact process.exit(0); } ``` **Rationale:** - **startup:** Load context (project overview) - **resume:** Load context (user continuing work) - **clear:** Skip (user wants fresh start) - **compact:** Skip (just memory optimization, context preserved) --- ### Low Priority (Nice to Have) #### 7. No Cost Control or Observation Limits **Severity:** Low **Impact:** Long sessions can be expensive **Problem:** No limits on SDK agent API calls. A session with thousands of tools could rack up significant costs. **Fix Ideas:** 1. Add observation counter, warn after N observations 2. Add cost estimation based on token usage 3. Add budget limit in config 4. Batch observations (send N at once instead of one-by-one) **Example:** ```typescript class SDKWorker { private observationCount = 0; private maxObservations = 1000; private handleMessage(message: WorkerMessage): void { if (message.type === 'observation') { this.observationCount++; if (this.observationCount > this.maxObservations) { console.error(`[SDK Worker] Exceeded max observations: ${this.maxObservations}`); this.isFinalized = true; return; } } this.pendingMessages.push(message); } } ``` --- #### 8. No Health Check Mechanism **Severity:** Low **Impact:** Can't tell if worker is alive/healthy **Fix Ideas:** 1. Add `/status` command that checks for active workers 2. Add health check endpoint on socket (ping/pong) 3. Add metrics to DB (last_activity_at) --- #### 9. No Observation Deduplication **Severity:** Low **Impact:** Duplicate observations if same tool executed multiple times **Fix Ideas:** 1. Hash tool_name + tool_input + tool_output 2. Check for duplicate hash before storing 3. Or let SDK agent handle deduplication naturally --- ## Implementation Checklist ### Phase 0: Verify Happy Path (DO THIS FIRST - HIGHEST PRIORITY) **Goal:** Prove the basic cycle works end-to-end before fixing edge cases. - [ ] **Test 0.1: Verify Stop Hook Fires** - [ ] Add logging to `src/hooks/summary.ts` - [ ] Exit session normally and verify hook runs - [ ] Verify FINALIZE message is sent to socket - [ ] **Test 0.2: Verify Worker Generates Summary** - [ ] Add logging to worker message handler - [ ] Verify FINALIZE message received - [ ] Verify SDK agent response - [ ] Verify summary parsed and stored in DB - [ ] Query DB to confirm summary exists - [ ] **Test 0.3: Verify Context Hook Loads Summaries** - [ ] Add logging to `src/hooks/context.ts` - [ ] Start new session, verify summaries loaded - [ ] Verify markdown output to stdout - [ ] Verify Claude has context from previous session - [ ] **Test 0.4: End-to-End Integration Test** - [ ] Run session 1 with test work - [ ] Verify summary in DB - [ ] Run session 2 - [ ] Ask Claude about previous session - [ ] Confirm Claude has correct context **STOP HERE:** Only proceed to Phase 1 after confirming all Phase 0 tests pass. --- ### Phase 1: Critical Resilience Fixes (Do After Phase 0) - [ ] Add watchdog timer to worker (Issue #1) - [ ] Add lastActivityTime tracking - [ ] Add timeout check in message generator loop - [ ] Test with zombie worker scenario - [ ] Configure existing SessionEnd hook (Issue #2) - [ ] Add SessionEnd configuration to hooks/hooks.json - [ ] Create src/hooks/cleanup.ts (implements cleanup logic) - [ ] Create src/bin/hooks/cleanup-hook.ts (entry point) - [ ] Update build process to compile cleanup-hook - [ ] Test with Ctrl-C exit and verify worker cleanup - [ ] Fix stale socket detection (Issue #3) - [ ] Add testSocketStale method - [ ] Update startSocketServer to check for stale sockets - [ ] Test with crashed worker scenario - [ ] Fix save-hook race condition (Issue #4) - [ ] Add sendWithRetry function - [ ] Add exponential backoff logic - [ ] Update save-hook to use retry logic - [ ] Test with immediate PostToolUse ### Phase 2: Medium Priority - [ ] Add session cleanup script (Issue #5) - [ ] Create cleanup-sessions command - [ ] Add to CLI - [ ] Optional: Add auto-cleanup to context-hook - [ ] Fix SessionStart source handling (Issue #6) - [ ] Update context-hook to load on "resume" - [ ] Test with /resume command ### Phase 3: Low Priority (Optional) - [ ] Add cost control (Issue #7) - [ ] Add health checks (Issue #8) - [ ] Add observation deduplication (Issue #9) ## Testing Strategy ### Unit Tests Create tests for each fix: - `test/hooks/cleanup.test.ts` - SessionEnd hook - `test/sdk/worker-timeout.test.ts` - Watchdog timer - `test/hooks/save-retry.test.ts` - Retry logic ### Integration Tests Test complete flows: 1. **Normal flow:** SessionStart → UserPromptSubmit → PostToolUse → Stop 2. **Crash recovery:** Worker crash → SessionEnd cleanup 3. **Zombie worker:** No Stop hook → Worker timeout 4. **Socket race:** Immediate PostToolUse → Retry success ### Manual Testing Scenarios 1. **Zombie Worker Test:** ```bash # Start session claude # Kill Claude with Ctrl-C # Check for worker process ps aux | grep claude-mem-worker # Wait 2 hours, verify worker exits ``` 2. **SessionEnd Test:** ```bash # Start session claude # Exit normally or Ctrl-C # Verify worker killed # Verify socket removed # Check DB for session status sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions" ``` 3. **Stale Socket Test:** ```bash # Start session claude # Kill worker with kill -9 # Verify socket exists ls /tmp/claude-mem-worker-*.sock # Start new session # Verify old socket removed, new session starts ``` 4. **Race Condition Test:** ```bash # Add delay to worker startup (for testing) # Start session, immediately run command claude "list all files" # Verify first observation captured ``` ## File Modifications Required ### New Files - `src/hooks/cleanup.ts` - SessionEnd hook logic - `src/bin/hooks/cleanup-hook.ts` - SessionEnd entry point - `src/commands/cleanup-sessions.ts` - Session cleanup script - `test/hooks/cleanup.test.ts` - Tests for SessionEnd hook - `test/sdk/worker-timeout.test.ts` - Tests for watchdog timer - `test/hooks/save-retry.test.ts` - Tests for retry logic ### Modified Files - `hooks/hooks.json` - Add SessionEnd configuration - `src/sdk/worker.ts` - Add watchdog timer, stale socket detection - `src/hooks/save.ts` - Add retry logic - `src/hooks/context.ts` - Load context on resume - `src/bin/cli.ts` - Add cleanup-sessions command ## Dependencies No new dependencies required. All fixes use existing: - `net` (Unix sockets) - `fs` (file operations) - `child_process` (process management) - `bun:sqlite` (database) ## Success Criteria ### Phase 0 (Must Pass First) 1. ✅ Stop hook fires on normal exit 2. ✅ Worker receives FINALIZE and generates summary 3. ✅ Summary is stored in DB correctly 4. ✅ Context hook loads summaries on next session 5. ✅ New session immediately sees previous session's summary in context 6. ✅ End-to-end integration test passes ### Phase 1 (After Phase 0 Passes) 1. ✅ Worker processes never become zombies (exit after 2h max) 2. ✅ SessionEnd hook cleans up worker and socket on exit 3. ✅ Stale sockets don't block new sessions 4. ✅ First observation always captured (no race condition) 5. ✅ No orphaned "active" sessions in DB after 24h 6. ✅ Context loads on /resume 7. ✅ All tests pass ## References - Claude Code Hooks Documentation: https://docs.claude.com/en/docs/claude-code/hooks - Claude Agent SDK Streaming: https://docs.claude.com/en/api/agent-sdk/streaming-vs-single-mode - Unix Domain Sockets: Node.js `net` module - SQLite Best Practices: Bun SQLite documentation ## Notes - All hooks must return `{"continue": true, "suppressOutput": true}` on error - Hooks have 60s default timeout (configurable) - Worker is detached process, doesn't block Claude Code - SessionEnd hooks "cannot block session termination" per Claude Code docs - Streaming input mode is the recommended SDK approach for this architecture