Files
claude-mem/docs/plans/session-logic-fixes.md
T

1124 lines
31 KiB
Markdown

# Session Logic Fixes - Claude-Mem
**Status:** Planning
**Created:** 2025-10-16
**Priority:** High
**Estimated Effort:** 2-3 days
## Executive Summary
The claude-mem session logic architecture is fundamentally sound, using Claude Agent SDK in streaming input mode with Unix socket IPC for real-time observation processing. However, **we need to verify the basic happy path works end-to-end before addressing edge cases**.
**Critical Goal:** Session ends → summary generated → next session immediately sees summary in context
**Overall Assessment:** Architecture is correct, but needs systematic verification that the happy path works, then resilience improvements
**Current Status:** Unknown if basic cycle works - need to test and debug the core flow first
## Feedback Applied (2025-10-16)
### Round 1: Technical Corrections
- ✅ Confirmed architectural approach is sound
-**Corrected:** SessionEnd hooks already exist in Claude Code - we're configuring, not implementing
- ✅ Technical fixes for resilience issues are sound
### Round 2: Priority Reordering (MAJOR CHANGE)
**Critical realization:** The document focused on edge cases (zombies, crashes) when the basic happy path might not even work yet.
**Complete restructure:**
1. **Phase 0 (NEW - TOP PRIORITY):** Verify the basic cycle works
- Does Stop hook fire on normal exit?
- Does worker generate and store summary?
- Does context hook load summaries on next session?
- End-to-end integration test
2. **Phase 1 (SECOND PRIORITY):** Fix resilience issues
- Zombie workers, race conditions, stale sockets
- All the original issues moved here
**Key principle:** Everything else is irrelevant if "session ends → next session sees summary" doesn't work.
**Revised Focus:** Get the fucking happy path working first, then worry about edge cases.
## Architecture Overview
### Current Flow
```
SessionStart (startup)
→ context-hook.ts:15
→ Loads recent summaries from DB
→ Outputs markdown to stdout (becomes context)
UserPromptSubmit
→ new-hook.ts:16
→ Creates SDK session in DB (status='active')
→ Spawns detached worker process
→ Worker starts immediately, hooks return
Worker Process (worker.ts:75)
→ Starts Unix socket server at /tmp/claude-mem-worker-{id}.sock
→ Runs SDK agent with streaming input (async generator)
→ Yields init prompt to SDK agent
→ Waits for messages from hooks
PostToolUse (fired for each tool)
→ save-hook.ts:24
→ Sends observation to worker via Unix socket
→ Worker receives → yields to SDK agent
→ SDK agent analyzes → returns <observation> XML
→ Worker parses XML → stores in observations table
Stop (session ends)
→ summary-hook.ts:15
→ Sends FINALIZE message to worker via socket
→ Worker yields finalize prompt to SDK agent
→ SDK agent generates <summary> XML
→ Worker parses → stores in session_summaries table
→ Worker marks session completed, closes socket, exits
```
### Key Components
**Hook Files:**
- `src/hooks/context.ts` - SessionStart hook logic
- `src/hooks/new.ts` - UserPromptSubmit hook logic
- `src/hooks/save.ts` - PostToolUse hook logic
- `src/hooks/summary.ts` - Stop hook logic
- `src/bin/hooks/*.ts` - Entry point wrappers for each hook
**Worker:**
- `src/sdk/worker.ts` - Main worker process with SDK integration
- `src/sdk/prompts.ts` - Prompt generation for SDK agent
- `src/sdk/parser.ts` - XML parser for SDK responses
**Database:**
- `src/services/sqlite/HooksDatabase.ts` - Lightweight DB interface for hooks
- `src/services/sqlite/migrations.ts` - Schema definitions
**Configuration:**
- `hooks/hooks.json` - Hook configuration for Claude Code plugin
### Technologies
- **IPC:** Unix domain sockets (`/tmp/claude-mem-worker-{id}.sock`)
- **SDK Mode:** Streaming input (async generator pattern)
- **Output Format:** XML blocks (`<observation>` and `<summary>`)
- **Process Model:** Detached worker (spawn with detached: true, stdio: 'ignore')
- **Database:** SQLite with Bun
## Identified Issues
### Phase 0: Verify Happy Path Works (DO THIS FIRST)
**Priority:** CRITICAL - Everything else is irrelevant if the basic cycle doesn't work
**Goal:** Prove that when a session ends normally, the next session immediately sees the summary in its context.
#### Test 0.1: Does Stop Hook Fire on Normal Exit?
**What to test:**
```bash
# Start Claude Code session
claude
# Do some work (read files, etc)
# Exit normally
exit
# Check logs - did Stop hook run?
```
**Expected behavior:**
- Stop hook (`summary-hook`) should fire
- Should send FINALIZE message to worker socket
- Worker should receive it and generate summary
**How to verify:**
1. Add logging to `src/hooks/summary.ts` at the top of `summaryHook()`
2. Add logging when sending socket message
3. Exit session normally and check logs
**If it doesn't work:** Debug why Stop hook isn't firing or why socket message fails
---
#### Test 0.2: Does Worker Receive FINALIZE and Generate Summary?
**What to test:**
After Stop hook fires, does the worker:
1. Receive the FINALIZE message
2. Yield finalize prompt to SDK agent
3. Get back a summary from SDK
4. Parse the XML
5. Store it in `session_summaries` table
**How to verify:**
1. Add console.error logging in `src/sdk/worker.ts:239` in the message handler
2. Log when FINALIZE is received
3. Log the SDK agent response
4. Log when summary is parsed
5. Query DB after session ends:
```bash
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM session_summaries ORDER BY created_at DESC LIMIT 1"
```
**If it doesn't work:**
- Check if worker is even running (ps aux | grep worker)
- Check if socket message arrived
- Check if SDK agent returned valid XML
- Check if parser worked
- Check if DB insert succeeded
---
#### Test 0.3: Does Context Hook Load Summaries?
**What to test:**
When starting a new session, does context hook:
1. Query recent summaries from DB
2. Format them as markdown
3. Output to stdout (becomes context)
**How to verify:**
1. Add logging to `src/hooks/context.ts:24`
2. Log the summaries retrieved from DB
3. Log the markdown output
4. Start new session and check:
- Console output (should see markdown)
- Claude's context (ask "what did we do last session?")
**If it doesn't work:**
- Check if SessionStart hook is firing
- Check if DB query returns results
- Check if markdown is being formatted correctly
- Check if output is going to stdout properly
---
#### Test 0.4: End-to-End Integration Test
**What to test:**
Full cycle from start to finish:
```bash
# Session 1
claude
# Do some work
echo "test file" > test.txt
cat test.txt
exit
# Verify summary was stored
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT summary_text FROM session_summaries ORDER BY created_at DESC LIMIT 1"
# Session 2
claude
# Ask Claude: "What did we do last session?"
# Expected: Claude should know we created and read test.txt
```
**Success criteria:**
- ✅ Summary appears in DB after session 1
- ✅ Session 2 context includes summary from session 1
- ✅ Claude can answer questions about previous session
**If it doesn't work:**
- Review logs from Tests 0.1-0.3
- Add more granular logging
- Check each step of the pipeline
---
#### Common Failure Points & Debugging
**If summaries aren't showing up in new sessions:**
1. **Stop hook not configured/firing:**
```bash
# Check hooks config
cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.Stop'
# Should see summary-hook configured
# If not, hooks.json is wrong or plugin not installed
```
2. **Worker not running:**
```bash
ps aux | grep claude-mem-worker
# If no worker, UserPromptSubmit hook failed to spawn it
# Check new-hook logs
```
3. **Socket communication failing:**
```bash
# Check socket exists
ls /tmp/claude-mem-worker-*.sock
# Try to connect manually
echo '{"type":"finalize"}' | nc -U /tmp/claude-mem-worker-*.sock
```
4. **SDK agent not returning summary:**
- Check API key is set
- Check SDK agent prompt is valid
- Check XML parser is working
- Add logging to see SDK response
5. **DB write failing:**
```bash
# Check DB exists and is writable
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions WHERE status='active'"
# If no active session, new-hook didn't create it
```
6. **Context hook not loading:**
```bash
# Check SessionStart hook configured
cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.SessionStart'
# Start session and check for context output
# Should see markdown in initial context
```
**Debugging Checklist:**
- [ ] Verify all hooks are configured in hooks.json
- [ ] Verify plugin is installed correctly
- [ ] Add console.error logging to all hooks (goes to stderr, visible in terminal)
- [ ] Check each step of the pipeline systematically
- [ ] Don't assume anything works - verify each piece
---
### Phase 1: Critical Resilience Issues (Fix After Happy Path Works)
#### 1. Zombie Worker Processes
**Severity:** High
**Impact:** Memory/CPU waste, orphaned processes accumulate
**Problem:**
If Stop hook never fires (user Ctrl-C, Claude Code crash), worker runs forever waiting for FINALIZE message.
**Location:** `src/sdk/worker.ts:239`
```typescript
// Current code - infinite loop with no timeout
while (!this.isFinalized) {
if (this.pendingMessages.length === 0) {
await this.sleep(100);
continue;
}
// ... process messages
}
```
**Fix Required:**
```typescript
// Add watchdog timer
class SDKWorker {
private maxIdleTime = 2 * 60 * 60 * 1000; // 2 hours
private lastActivityTime = Date.now();
private updateActivity(): void {
this.lastActivityTime = Date.now();
}
private async* createMessageGenerator(): AsyncIterable<...> {
// Yield initial prompt
const initPrompt = buildInitPrompt(...);
yield { type: 'user', message: { role: 'user', content: initPrompt } };
this.updateActivity();
while (!this.isFinalized) {
// Check for timeout
const idleTime = Date.now() - this.lastActivityTime;
if (idleTime > this.maxIdleTime) {
console.error(`[SDK Worker] Timeout - no activity for ${this.maxIdleTime / 1000}s`);
this.isFinalized = true;
break;
}
if (this.pendingMessages.length === 0) {
await this.sleep(100);
continue;
}
// Process messages and update activity
this.updateActivity();
// ... existing message processing
}
}
}
```
**Testing:**
1. Start claude-mem session
2. Kill Claude Code process (kill -9)
3. Verify worker exits after 2 hours
4. Check no orphaned processes remain
---
#### 2. SessionEnd Hook Not Configured
**Severity:** High
**Impact:** No cleanup on abrupt exit, sessions stuck in "active" status
**Problem:**
SessionEnd hooks are a built-in Claude Code feature that "run when a session ends" and "cannot block session termination but can perform cleanup tasks" ([docs](https://docs.claude.com/en/docs/claude-code/hooks#hook-events)). However, claude-mem's `hooks/hooks.json` does NOT configure this hook. Worker doesn't get cleaned up when Claude Code exits abruptly.
**Note:** This is NOT a missing feature in Claude Code - SessionEnd hooks already exist. We just need to configure them.
**Current Configuration:** `hooks/hooks.json:1-51`
```json
{
"hooks": {
"SessionStart": [...],
"UserPromptSubmit": [...],
"PostToolUse": [...],
"Stop": [...]
// SessionEnd is MISSING
}
}
```
**Fix Required:**
SessionEnd hooks receive structured input including:
```json
{
"session_id": "abc123",
"transcript_path": "~/.claude/projects/.../transcript.jsonl",
"cwd": "/Users/...",
"hook_event_name": "SessionEnd",
"reason": "exit" // or "clear", "logout", "prompt_input_exit", etc.
}
```
**Implementation Steps:**
1. **Add SessionEnd configuration to hooks/hooks.json:**
For events like SessionEnd that don't use matchers, we can omit the matcher field:
```json
{
"hooks": {
"SessionEnd": [
{
"hooks": [
{
"type": "command",
"command": "bun ${CLAUDE_PLUGIN_ROOT}/scripts/hooks/cleanup-hook.js",
"timeout": 60000
}
]
}
]
}
}
```
2. **Create src/hooks/cleanup.ts:**
```typescript
import { HooksDatabase } from '../services/sqlite/HooksDatabase.js';
import { getWorkerSocketPath } from '../shared/paths.js';
import { existsSync, unlinkSync } from 'fs';
import { execSync } from 'child_process';
export interface SessionEndInput {
session_id: string;
cwd: string;
reason: 'clear' | 'logout' | 'prompt_input_exit' | 'other';
[key: string]: any;
}
/**
* Cleanup Hook - SessionEnd
* Cleans up worker process and marks session as terminated
*/
export function cleanupHook(input?: SessionEndInput): void {
try {
if (!input) {
console.log('No input provided - this script is designed to run as a Claude Code SessionEnd hook');
process.exit(0);
}
const { session_id, reason } = input;
// Find active SDK session
const db = new HooksDatabase();
const session = db.findActiveSDKSession(session_id);
if (!session) {
db.close();
console.log('{"suppressOutput": true}');
process.exit(0);
}
// Get socket path and clean up socket file
const socketPath = getWorkerSocketPath(session.id);
if (existsSync(socketPath)) {
try {
unlinkSync(socketPath);
} catch (err) {
console.error(`[claude-mem cleanup] Failed to remove socket: ${err.message}`);
}
}
// Mark session as failed (not completed since it was terminated)
db.markSessionFailed(session.id);
db.close();
// Try to kill worker process if still running
// Worker socket path includes session ID, so we can find it
try {
// Find worker process by socket file in lsof output
const lsofOutput = execSync(`lsof ${socketPath} 2>/dev/null || true`, { encoding: 'utf8' });
const pidMatch = lsofOutput.match(/\s+(\d+)\s+/);
if (pidMatch) {
const pid = pidMatch[1];
console.error(`[claude-mem cleanup] Killing worker process ${pid}`);
process.kill(parseInt(pid, 10), 'SIGTERM');
}
} catch (err) {
// Worker already dead or couldn't find it - that's fine
}
console.log('{"suppressOutput": true}');
process.exit(0);
} catch (error: any) {
console.error(`[claude-mem cleanup error: ${error.message}]`);
console.log('{"suppressOutput": true}');
process.exit(0);
}
}
```
3. **Create src/bin/hooks/cleanup-hook.ts:**
```typescript
#!/usr/bin/env bun
/**
* Cleanup Hook Entry Point - SessionEnd
* Standalone executable for plugin hooks
*/
import { cleanupHook } from '../../hooks/cleanup.js';
// Read input from stdin
const input = await Bun.stdin.text();
try {
const parsed = input.trim() ? JSON.parse(input) : undefined;
cleanupHook(parsed);
} catch (error: any) {
console.error(`[claude-mem cleanup-hook error: ${error.message}]`);
console.log('{"suppressOutput": true}');
process.exit(0);
}
```
4. **Update build process to compile cleanup-hook.ts to scripts/hooks/cleanup-hook.js**
**Testing:**
1. Start claude-mem session
2. Exit Claude Code with Ctrl-C
3. Verify worker process is killed
4. Verify socket file is removed
5. Verify session marked as "failed" in DB
---
#### 3. Stale Socket Files Block New Sessions
**Severity:** Medium
**Impact:** Worker fails to start if previous worker crashed
**Problem:**
If worker crashes, socket file persists at `/tmp/claude-mem-worker-{id}.sock`. Next worker with same session ID fails with EADDRINUSE.
**Location:** `src/sdk/worker.ts:111-163`
```typescript
private async startSocketServer(): Promise<void> {
// Current code only removes if exists
if (existsSync(this.socketPath)) {
unlinkSync(this.socketPath);
}
return new Promise((resolve, reject) => {
this.server = net.createServer((socket) => { ... });
this.server.listen(this.socketPath, () => { resolve(); });
});
}
```
**Fix Required:**
```typescript
private async startSocketServer(): Promise<void> {
// Clean up stale socket if it exists
if (existsSync(this.socketPath)) {
// Test if socket is responsive
const isStale = await this.testSocketStale(this.socketPath);
if (isStale) {
console.error(`[SDK Worker] Removing stale socket: ${this.socketPath}`);
unlinkSync(this.socketPath);
} else {
// Socket is active - another worker is using this session ID
throw new Error(`Socket already in use: ${this.socketPath}`);
}
}
return new Promise((resolve, reject) => {
this.server = net.createServer((socket) => {
let buffer = '';
socket.on('data', (chunk) => {
// ... existing code
});
});
this.server.on('error', (err: any) => {
if (err.code === 'EADDRINUSE') {
console.error(`[SDK Worker] Socket already in use: ${this.socketPath}`);
}
reject(err);
});
this.server.listen(this.socketPath, () => {
resolve();
});
});
}
/**
* Test if socket file is stale (no process listening)
*/
private async testSocketStale(socketPath: string): Promise<boolean> {
return new Promise((resolve) => {
const testClient = net.connect(socketPath);
testClient.on('connect', () => {
// Socket is responsive - not stale
testClient.end();
resolve(false);
});
testClient.on('error', () => {
// Socket exists but not responsive - stale
resolve(true);
});
// Timeout after 100ms
setTimeout(() => {
testClient.destroy();
resolve(true);
}, 100);
});
}
```
**Testing:**
1. Start worker, kill it with kill -9
2. Verify socket file persists
3. Start new worker with same session ID
4. Verify old socket is detected as stale and removed
5. Verify new worker starts successfully
---
#### 4. Race Condition on First Observation
**Severity:** Medium
**Impact:** First observation might be lost if socket not ready
**Problem:**
Worker startup is async (socket creation, SDK initialization). PostToolUse can fire immediately after UserPromptSubmit returns, before socket is ready.
**Current Flow:**
1. UserPromptSubmit → creates session → spawns worker → returns immediately
2. PostToolUse fires (Claude reads a file)
3. save-hook tries to connect → ENOENT (socket not ready yet)
4. Connection fails → logs error, continues
5. First observation lost
**Location:** `src/hooks/save.ts:71`
```typescript
const client = net.connect(socketPath, () => {
client.write(JSON.stringify(message) + '\n');
client.end();
});
client.on('error', (err) => {
// Currently just logs and continues - observation lost
console.error(`[claude-mem save] Socket error: ${err.message}`);
});
```
**Fix Required:**
```typescript
/**
* Save Hook - PostToolUse
* Sends tool observations to worker via Unix socket with retry logic
*/
export function saveHook(input?: PostToolUseInput): void {
try {
if (!input) {
console.log('No input provided - this script is designed to run as a Claude Code PostToolUse hook');
process.exit(0);
}
const { session_id, tool_name, tool_input, tool_output } = input;
if (SKIP_TOOLS.has(tool_name)) {
console.log('{"continue": true, "suppressOutput": true}');
process.exit(0);
}
const db = new HooksDatabase();
const session = db.findActiveSDKSession(session_id);
db.close();
if (!session) {
console.log('{"continue": true, "suppressOutput": true}');
process.exit(0);
}
const socketPath = getWorkerSocketPath(session.id);
const message = {
type: 'observation',
tool_name,
tool_input: JSON.stringify(tool_input),
tool_output: JSON.stringify(tool_output)
};
// Try to send with retries
sendWithRetry(socketPath, message, 5).then(() => {
console.log('{"continue": true, "suppressOutput": true}');
process.exit(0);
}).catch((err) => {
console.error(`[claude-mem save] Failed after retries: ${err.message}`);
console.log('{"continue": true, "suppressOutput": true}');
process.exit(0);
});
} catch (error: any) {
console.error(`[claude-mem save error: ${error.message}]`);
console.log('{"continue": true, "suppressOutput": true}');
process.exit(0);
}
}
/**
* Send message to socket with exponential backoff retry
*/
async function sendWithRetry(
socketPath: string,
message: any,
maxRetries: number
): Promise<void> {
let retries = maxRetries;
let delay = 100; // Start with 100ms
while (retries > 0) {
try {
await sendMessage(socketPath, message);
return; // Success
} catch (err: any) {
retries--;
if (retries === 0) {
throw err; // Out of retries
}
// Exponential backoff
await sleep(delay);
delay = Math.min(delay * 2, 2000); // Cap at 2s
}
}
}
/**
* Send single message to socket
*/
function sendMessage(socketPath: string, message: any): Promise<void> {
return new Promise((resolve, reject) => {
const client = net.connect(socketPath, () => {
client.write(JSON.stringify(message) + '\n');
client.end();
resolve();
});
client.on('error', (err) => {
reject(err);
});
});
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
```
**Testing:**
1. Add artificial delay in worker startup
2. Fire PostToolUse immediately after UserPromptSubmit
3. Verify save-hook retries and succeeds
4. Verify observation is captured
---
### Medium Priority (Should Fix)
#### 5. Orphaned Active Sessions in Database
**Severity:** Low
**Impact:** DB bloat, confusion about session status
**Problem:**
Sessions marked "active" never transition to "completed" or "failed" if worker crashes or is killed.
**Fix Required:**
Create cleanup script: `src/commands/cleanup-sessions.ts`
```typescript
import { HooksDatabase } from '../services/sqlite/HooksDatabase.js';
/**
* Mark old active sessions as failed
*/
export function cleanupSessions(maxAgeHours: number = 24): void {
const db = new HooksDatabase();
const maxAgeMs = maxAgeHours * 60 * 60 * 1000;
const cutoffEpoch = Date.now() - maxAgeMs;
const query = (db as any).db.query(`
UPDATE sdk_sessions
SET status = 'failed', completed_at = datetime('now'), completed_at_epoch = ?
WHERE status = 'active' AND started_at_epoch < ?
`);
const result = query.run(Date.now(), cutoffEpoch);
console.log(`Marked ${result.changes} old active sessions as failed`);
db.close();
}
```
Add to CLI: `src/bin/cli.ts`
```typescript
.command('cleanup-sessions')
.description('Mark old active sessions as failed')
.option('--max-age <hours>', 'Maximum age in hours', '24')
.action((options) => {
cleanupSessions(parseInt(options.maxAge, 10));
})
```
**Alternative:** Add auto-expiry check in `context-hook`:
```typescript
// Before loading summaries, clean up stale sessions
const maxAgeMs = 24 * 60 * 60 * 1000;
const cutoffEpoch = Date.now() - maxAgeMs;
db.db.query(`
UPDATE sdk_sessions
SET status = 'failed'
WHERE status = 'active' AND started_at_epoch < ?
`).run(cutoffEpoch);
```
---
#### 6. SessionStart Only Runs on "startup"
**Severity:** Low
**Impact:** No context loaded on /resume
**Problem:**
`context-hook` only loads context on "startup" source, skips "resume", "clear", and "compact".
**Location:** `src/hooks/context.ts:24`
```typescript
// Only run on startup (not on resume)
if (input.source && input.source !== 'startup') {
console.log('');
process.exit(0);
}
```
**Fix Required:**
```typescript
// Load context on startup and resume
if (input.source && input.source !== 'startup' && input.source !== 'resume') {
console.log(''); // Skip for clear/compact
process.exit(0);
}
```
**Rationale:**
- **startup:** Load context (project overview)
- **resume:** Load context (user continuing work)
- **clear:** Skip (user wants fresh start)
- **compact:** Skip (just memory optimization, context preserved)
---
### Low Priority (Nice to Have)
#### 7. No Cost Control or Observation Limits
**Severity:** Low
**Impact:** Long sessions can be expensive
**Problem:**
No limits on SDK agent API calls. A session with thousands of tools could rack up significant costs.
**Fix Ideas:**
1. Add observation counter, warn after N observations
2. Add cost estimation based on token usage
3. Add budget limit in config
4. Batch observations (send N at once instead of one-by-one)
**Example:**
```typescript
class SDKWorker {
private observationCount = 0;
private maxObservations = 1000;
private handleMessage(message: WorkerMessage): void {
if (message.type === 'observation') {
this.observationCount++;
if (this.observationCount > this.maxObservations) {
console.error(`[SDK Worker] Exceeded max observations: ${this.maxObservations}`);
this.isFinalized = true;
return;
}
}
this.pendingMessages.push(message);
}
}
```
---
#### 8. No Health Check Mechanism
**Severity:** Low
**Impact:** Can't tell if worker is alive/healthy
**Fix Ideas:**
1. Add `/status` command that checks for active workers
2. Add health check endpoint on socket (ping/pong)
3. Add metrics to DB (last_activity_at)
---
#### 9. No Observation Deduplication
**Severity:** Low
**Impact:** Duplicate observations if same tool executed multiple times
**Fix Ideas:**
1. Hash tool_name + tool_input + tool_output
2. Check for duplicate hash before storing
3. Or let SDK agent handle deduplication naturally
---
## Implementation Checklist
### Phase 0: Verify Happy Path (DO THIS FIRST - HIGHEST PRIORITY)
**Goal:** Prove the basic cycle works end-to-end before fixing edge cases.
- [ ] **Test 0.1: Verify Stop Hook Fires**
- [ ] Add logging to `src/hooks/summary.ts`
- [ ] Exit session normally and verify hook runs
- [ ] Verify FINALIZE message is sent to socket
- [ ] **Test 0.2: Verify Worker Generates Summary**
- [ ] Add logging to worker message handler
- [ ] Verify FINALIZE message received
- [ ] Verify SDK agent response
- [ ] Verify summary parsed and stored in DB
- [ ] Query DB to confirm summary exists
- [ ] **Test 0.3: Verify Context Hook Loads Summaries**
- [ ] Add logging to `src/hooks/context.ts`
- [ ] Start new session, verify summaries loaded
- [ ] Verify markdown output to stdout
- [ ] Verify Claude has context from previous session
- [ ] **Test 0.4: End-to-End Integration Test**
- [ ] Run session 1 with test work
- [ ] Verify summary in DB
- [ ] Run session 2
- [ ] Ask Claude about previous session
- [ ] Confirm Claude has correct context
**STOP HERE:** Only proceed to Phase 1 after confirming all Phase 0 tests pass.
---
### Phase 1: Critical Resilience Fixes (Do After Phase 0)
- [ ] Add watchdog timer to worker (Issue #1)
- [ ] Add lastActivityTime tracking
- [ ] Add timeout check in message generator loop
- [ ] Test with zombie worker scenario
- [ ] Configure existing SessionEnd hook (Issue #2)
- [ ] Add SessionEnd configuration to hooks/hooks.json
- [ ] Create src/hooks/cleanup.ts (implements cleanup logic)
- [ ] Create src/bin/hooks/cleanup-hook.ts (entry point)
- [ ] Update build process to compile cleanup-hook
- [ ] Test with Ctrl-C exit and verify worker cleanup
- [ ] Fix stale socket detection (Issue #3)
- [ ] Add testSocketStale method
- [ ] Update startSocketServer to check for stale sockets
- [ ] Test with crashed worker scenario
- [ ] Fix save-hook race condition (Issue #4)
- [ ] Add sendWithRetry function
- [ ] Add exponential backoff logic
- [ ] Update save-hook to use retry logic
- [ ] Test with immediate PostToolUse
### Phase 2: Medium Priority
- [ ] Add session cleanup script (Issue #5)
- [ ] Create cleanup-sessions command
- [ ] Add to CLI
- [ ] Optional: Add auto-cleanup to context-hook
- [ ] Fix SessionStart source handling (Issue #6)
- [ ] Update context-hook to load on "resume"
- [ ] Test with /resume command
### Phase 3: Low Priority (Optional)
- [ ] Add cost control (Issue #7)
- [ ] Add health checks (Issue #8)
- [ ] Add observation deduplication (Issue #9)
## Testing Strategy
### Unit Tests
Create tests for each fix:
- `test/hooks/cleanup.test.ts` - SessionEnd hook
- `test/sdk/worker-timeout.test.ts` - Watchdog timer
- `test/hooks/save-retry.test.ts` - Retry logic
### Integration Tests
Test complete flows:
1. **Normal flow:** SessionStart → UserPromptSubmit → PostToolUse → Stop
2. **Crash recovery:** Worker crash → SessionEnd cleanup
3. **Zombie worker:** No Stop hook → Worker timeout
4. **Socket race:** Immediate PostToolUse → Retry success
### Manual Testing Scenarios
1. **Zombie Worker Test:**
```bash
# Start session
claude
# Kill Claude with Ctrl-C
# Check for worker process
ps aux | grep claude-mem-worker
# Wait 2 hours, verify worker exits
```
2. **SessionEnd Test:**
```bash
# Start session
claude
# Exit normally or Ctrl-C
# Verify worker killed
# Verify socket removed
# Check DB for session status
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions"
```
3. **Stale Socket Test:**
```bash
# Start session
claude
# Kill worker with kill -9 <pid>
# Verify socket exists
ls /tmp/claude-mem-worker-*.sock
# Start new session
# Verify old socket removed, new session starts
```
4. **Race Condition Test:**
```bash
# Add delay to worker startup (for testing)
# Start session, immediately run command
claude "list all files"
# Verify first observation captured
```
## File Modifications Required
### New Files
- `src/hooks/cleanup.ts` - SessionEnd hook logic
- `src/bin/hooks/cleanup-hook.ts` - SessionEnd entry point
- `src/commands/cleanup-sessions.ts` - Session cleanup script
- `test/hooks/cleanup.test.ts` - Tests for SessionEnd hook
- `test/sdk/worker-timeout.test.ts` - Tests for watchdog timer
- `test/hooks/save-retry.test.ts` - Tests for retry logic
### Modified Files
- `hooks/hooks.json` - Add SessionEnd configuration
- `src/sdk/worker.ts` - Add watchdog timer, stale socket detection
- `src/hooks/save.ts` - Add retry logic
- `src/hooks/context.ts` - Load context on resume
- `src/bin/cli.ts` - Add cleanup-sessions command
## Dependencies
No new dependencies required. All fixes use existing:
- `net` (Unix sockets)
- `fs` (file operations)
- `child_process` (process management)
- `bun:sqlite` (database)
## Success Criteria
### Phase 0 (Must Pass First)
1. ✅ Stop hook fires on normal exit
2. ✅ Worker receives FINALIZE and generates summary
3. ✅ Summary is stored in DB correctly
4. ✅ Context hook loads summaries on next session
5. ✅ New session immediately sees previous session's summary in context
6. ✅ End-to-end integration test passes
### Phase 1 (After Phase 0 Passes)
1. ✅ Worker processes never become zombies (exit after 2h max)
2. ✅ SessionEnd hook cleans up worker and socket on exit
3. ✅ Stale sockets don't block new sessions
4. ✅ First observation always captured (no race condition)
5. ✅ No orphaned "active" sessions in DB after 24h
6. ✅ Context loads on /resume
7. ✅ All tests pass
## References
- Claude Code Hooks Documentation: https://docs.claude.com/en/docs/claude-code/hooks
- Claude Agent SDK Streaming: https://docs.claude.com/en/api/agent-sdk/streaming-vs-single-mode
- Unix Domain Sockets: Node.js `net` module
- SQLite Best Practices: Bun SQLite documentation
## Notes
- All hooks must return `{"continue": true, "suppressOutput": true}` on error
- Hooks have 60s default timeout (configurable)
- Worker is detached process, doesn't block Claude Code
- SessionEnd hooks "cannot block session termination" per Claude Code docs
- Streaming input mode is the recommended SDK approach for this architecture