31 KiB
Session Logic Fixes - Claude-Mem
Status: Planning Created: 2025-10-16 Priority: High Estimated Effort: 2-3 days
Executive Summary
The claude-mem session logic architecture is fundamentally sound, using Claude Agent SDK in streaming input mode with Unix socket IPC for real-time observation processing. However, we need to verify the basic happy path works end-to-end before addressing edge cases.
Critical Goal: Session ends → summary generated → next session immediately sees summary in context
Overall Assessment: Architecture is correct, but needs systematic verification that the happy path works, then resilience improvements
Current Status: Unknown if basic cycle works - need to test and debug the core flow first
Feedback Applied (2025-10-16)
Round 1: Technical Corrections
- ✅ Confirmed architectural approach is sound
- ❌ Corrected: SessionEnd hooks already exist in Claude Code - we're configuring, not implementing
- ✅ Technical fixes for resilience issues are sound
Round 2: Priority Reordering (MAJOR CHANGE)
Critical realization: The document focused on edge cases (zombies, crashes) when the basic happy path might not even work yet.
Complete restructure:
-
Phase 0 (NEW - TOP PRIORITY): Verify the basic cycle works
- Does Stop hook fire on normal exit?
- Does worker generate and store summary?
- Does context hook load summaries on next session?
- End-to-end integration test
-
Phase 1 (SECOND PRIORITY): Fix resilience issues
- Zombie workers, race conditions, stale sockets
- All the original issues moved here
Key principle: Everything else is irrelevant if "session ends → next session sees summary" doesn't work.
Revised Focus: Get the fucking happy path working first, then worry about edge cases.
Architecture Overview
Current Flow
SessionStart (startup)
→ context-hook.ts:15
→ Loads recent summaries from DB
→ Outputs markdown to stdout (becomes context)
UserPromptSubmit
→ new-hook.ts:16
→ Creates SDK session in DB (status='active')
→ Spawns detached worker process
→ Worker starts immediately, hooks return
Worker Process (worker.ts:75)
→ Starts Unix socket server at /tmp/claude-mem-worker-{id}.sock
→ Runs SDK agent with streaming input (async generator)
→ Yields init prompt to SDK agent
→ Waits for messages from hooks
PostToolUse (fired for each tool)
→ save-hook.ts:24
→ Sends observation to worker via Unix socket
→ Worker receives → yields to SDK agent
→ SDK agent analyzes → returns <observation> XML
→ Worker parses XML → stores in observations table
Stop (session ends)
→ summary-hook.ts:15
→ Sends FINALIZE message to worker via socket
→ Worker yields finalize prompt to SDK agent
→ SDK agent generates <summary> XML
→ Worker parses → stores in session_summaries table
→ Worker marks session completed, closes socket, exits
Key Components
Hook Files:
src/hooks/context.ts- SessionStart hook logicsrc/hooks/new.ts- UserPromptSubmit hook logicsrc/hooks/save.ts- PostToolUse hook logicsrc/hooks/summary.ts- Stop hook logicsrc/bin/hooks/*.ts- Entry point wrappers for each hook
Worker:
src/sdk/worker.ts- Main worker process with SDK integrationsrc/sdk/prompts.ts- Prompt generation for SDK agentsrc/sdk/parser.ts- XML parser for SDK responses
Database:
src/services/sqlite/HooksDatabase.ts- Lightweight DB interface for hookssrc/services/sqlite/migrations.ts- Schema definitions
Configuration:
hooks/hooks.json- Hook configuration for Claude Code plugin
Technologies
- IPC: Unix domain sockets (
/tmp/claude-mem-worker-{id}.sock) - SDK Mode: Streaming input (async generator pattern)
- Output Format: XML blocks (
<observation>and<summary>) - Process Model: Detached worker (spawn with detached: true, stdio: 'ignore')
- Database: SQLite with Bun
Identified Issues
Phase 0: Verify Happy Path Works (DO THIS FIRST)
Priority: CRITICAL - Everything else is irrelevant if the basic cycle doesn't work
Goal: Prove that when a session ends normally, the next session immediately sees the summary in its context.
Test 0.1: Does Stop Hook Fire on Normal Exit?
What to test:
# Start Claude Code session
claude
# Do some work (read files, etc)
# Exit normally
exit
# Check logs - did Stop hook run?
Expected behavior:
- Stop hook (
summary-hook) should fire - Should send FINALIZE message to worker socket
- Worker should receive it and generate summary
How to verify:
- Add logging to
src/hooks/summary.tsat the top ofsummaryHook() - Add logging when sending socket message
- Exit session normally and check logs
If it doesn't work: Debug why Stop hook isn't firing or why socket message fails
Test 0.2: Does Worker Receive FINALIZE and Generate Summary?
What to test: After Stop hook fires, does the worker:
- Receive the FINALIZE message
- Yield finalize prompt to SDK agent
- Get back a summary from SDK
- Parse the XML
- Store it in
session_summariestable
How to verify:
- Add console.error logging in
src/sdk/worker.ts:239in the message handler - Log when FINALIZE is received
- Log the SDK agent response
- Log when summary is parsed
- Query DB after session ends:
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM session_summaries ORDER BY created_at DESC LIMIT 1"
If it doesn't work:
- Check if worker is even running (ps aux | grep worker)
- Check if socket message arrived
- Check if SDK agent returned valid XML
- Check if parser worked
- Check if DB insert succeeded
Test 0.3: Does Context Hook Load Summaries?
What to test: When starting a new session, does context hook:
- Query recent summaries from DB
- Format them as markdown
- Output to stdout (becomes context)
How to verify:
- Add logging to
src/hooks/context.ts:24 - Log the summaries retrieved from DB
- Log the markdown output
- Start new session and check:
- Console output (should see markdown)
- Claude's context (ask "what did we do last session?")
If it doesn't work:
- Check if SessionStart hook is firing
- Check if DB query returns results
- Check if markdown is being formatted correctly
- Check if output is going to stdout properly
Test 0.4: End-to-End Integration Test
What to test: Full cycle from start to finish:
# Session 1
claude
# Do some work
echo "test file" > test.txt
cat test.txt
exit
# Verify summary was stored
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT summary_text FROM session_summaries ORDER BY created_at DESC LIMIT 1"
# Session 2
claude
# Ask Claude: "What did we do last session?"
# Expected: Claude should know we created and read test.txt
Success criteria:
- ✅ Summary appears in DB after session 1
- ✅ Session 2 context includes summary from session 1
- ✅ Claude can answer questions about previous session
If it doesn't work:
- Review logs from Tests 0.1-0.3
- Add more granular logging
- Check each step of the pipeline
Common Failure Points & Debugging
If summaries aren't showing up in new sessions:
-
Stop hook not configured/firing:
# Check hooks config cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.Stop' # Should see summary-hook configured # If not, hooks.json is wrong or plugin not installed -
Worker not running:
ps aux | grep claude-mem-worker # If no worker, UserPromptSubmit hook failed to spawn it # Check new-hook logs -
Socket communication failing:
# Check socket exists ls /tmp/claude-mem-worker-*.sock # Try to connect manually echo '{"type":"finalize"}' | nc -U /tmp/claude-mem-worker-*.sock -
SDK agent not returning summary:
- Check API key is set
- Check SDK agent prompt is valid
- Check XML parser is working
- Add logging to see SDK response
-
DB write failing:
# Check DB exists and is writable sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions WHERE status='active'" # If no active session, new-hook didn't create it -
Context hook not loading:
# Check SessionStart hook configured cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.SessionStart' # Start session and check for context output # Should see markdown in initial context
Debugging Checklist:
- Verify all hooks are configured in hooks.json
- Verify plugin is installed correctly
- Add console.error logging to all hooks (goes to stderr, visible in terminal)
- Check each step of the pipeline systematically
- Don't assume anything works - verify each piece
Phase 1: Critical Resilience Issues (Fix After Happy Path Works)
1. Zombie Worker Processes
Severity: High Impact: Memory/CPU waste, orphaned processes accumulate
Problem: If Stop hook never fires (user Ctrl-C, Claude Code crash), worker runs forever waiting for FINALIZE message.
Location: src/sdk/worker.ts:239
// Current code - infinite loop with no timeout
while (!this.isFinalized) {
if (this.pendingMessages.length === 0) {
await this.sleep(100);
continue;
}
// ... process messages
}
Fix Required:
// Add watchdog timer
class SDKWorker {
private maxIdleTime = 2 * 60 * 60 * 1000; // 2 hours
private lastActivityTime = Date.now();
private updateActivity(): void {
this.lastActivityTime = Date.now();
}
private async* createMessageGenerator(): AsyncIterable<...> {
// Yield initial prompt
const initPrompt = buildInitPrompt(...);
yield { type: 'user', message: { role: 'user', content: initPrompt } };
this.updateActivity();
while (!this.isFinalized) {
// Check for timeout
const idleTime = Date.now() - this.lastActivityTime;
if (idleTime > this.maxIdleTime) {
console.error(`[SDK Worker] Timeout - no activity for ${this.maxIdleTime / 1000}s`);
this.isFinalized = true;
break;
}
if (this.pendingMessages.length === 0) {
await this.sleep(100);
continue;
}
// Process messages and update activity
this.updateActivity();
// ... existing message processing
}
}
}
Testing:
- Start claude-mem session
- Kill Claude Code process (kill -9)
- Verify worker exits after 2 hours
- Check no orphaned processes remain
2. SessionEnd Hook Not Configured
Severity: High Impact: No cleanup on abrupt exit, sessions stuck in "active" status
Problem:
SessionEnd hooks are a built-in Claude Code feature that "run when a session ends" and "cannot block session termination but can perform cleanup tasks" (docs). However, claude-mem's hooks/hooks.json does NOT configure this hook. Worker doesn't get cleaned up when Claude Code exits abruptly.
Note: This is NOT a missing feature in Claude Code - SessionEnd hooks already exist. We just need to configure them.
Current Configuration: hooks/hooks.json:1-51
{
"hooks": {
"SessionStart": [...],
"UserPromptSubmit": [...],
"PostToolUse": [...],
"Stop": [...]
// SessionEnd is MISSING
}
}
Fix Required:
SessionEnd hooks receive structured input including:
{
"session_id": "abc123",
"transcript_path": "~/.claude/projects/.../transcript.jsonl",
"cwd": "/Users/...",
"hook_event_name": "SessionEnd",
"reason": "exit" // or "clear", "logout", "prompt_input_exit", etc.
}
Implementation Steps:
- Add SessionEnd configuration to hooks/hooks.json:
For events like SessionEnd that don't use matchers, we can omit the matcher field:
{
"hooks": {
"SessionEnd": [
{
"hooks": [
{
"type": "command",
"command": "bun ${CLAUDE_PLUGIN_ROOT}/scripts/hooks/cleanup-hook.js",
"timeout": 60000
}
]
}
]
}
}
- Create src/hooks/cleanup.ts:
import { HooksDatabase } from '../services/sqlite/HooksDatabase.js';
import { getWorkerSocketPath } from '../shared/paths.js';
import { existsSync, unlinkSync } from 'fs';
import { execSync } from 'child_process';
export interface SessionEndInput {
session_id: string;
cwd: string;
reason: 'clear' | 'logout' | 'prompt_input_exit' | 'other';
[key: string]: any;
}
/**
* Cleanup Hook - SessionEnd
* Cleans up worker process and marks session as terminated
*/
export function cleanupHook(input?: SessionEndInput): void {
try {
if (!input) {
console.log('No input provided - this script is designed to run as a Claude Code SessionEnd hook');
process.exit(0);
}
const { session_id, reason } = input;
// Find active SDK session
const db = new HooksDatabase();
const session = db.findActiveSDKSession(session_id);
if (!session) {
db.close();
console.log('{"suppressOutput": true}');
process.exit(0);
}
// Get socket path and clean up socket file
const socketPath = getWorkerSocketPath(session.id);
if (existsSync(socketPath)) {
try {
unlinkSync(socketPath);
} catch (err) {
console.error(`[claude-mem cleanup] Failed to remove socket: ${err.message}`);
}
}
// Mark session as failed (not completed since it was terminated)
db.markSessionFailed(session.id);
db.close();
// Try to kill worker process if still running
// Worker socket path includes session ID, so we can find it
try {
// Find worker process by socket file in lsof output
const lsofOutput = execSync(`lsof ${socketPath} 2>/dev/null || true`, { encoding: 'utf8' });
const pidMatch = lsofOutput.match(/\s+(\d+)\s+/);
if (pidMatch) {
const pid = pidMatch[1];
console.error(`[claude-mem cleanup] Killing worker process ${pid}`);
process.kill(parseInt(pid, 10), 'SIGTERM');
}
} catch (err) {
// Worker already dead or couldn't find it - that's fine
}
console.log('{"suppressOutput": true}');
process.exit(0);
} catch (error: any) {
console.error(`[claude-mem cleanup error: ${error.message}]`);
console.log('{"suppressOutput": true}');
process.exit(0);
}
}
- Create src/bin/hooks/cleanup-hook.ts:
#!/usr/bin/env bun
/**
* Cleanup Hook Entry Point - SessionEnd
* Standalone executable for plugin hooks
*/
import { cleanupHook } from '../../hooks/cleanup.js';
// Read input from stdin
const input = await Bun.stdin.text();
try {
const parsed = input.trim() ? JSON.parse(input) : undefined;
cleanupHook(parsed);
} catch (error: any) {
console.error(`[claude-mem cleanup-hook error: ${error.message}]`);
console.log('{"suppressOutput": true}');
process.exit(0);
}
- Update build process to compile cleanup-hook.ts to scripts/hooks/cleanup-hook.js
Testing:
- Start claude-mem session
- Exit Claude Code with Ctrl-C
- Verify worker process is killed
- Verify socket file is removed
- Verify session marked as "failed" in DB
3. Stale Socket Files Block New Sessions
Severity: Medium Impact: Worker fails to start if previous worker crashed
Problem:
If worker crashes, socket file persists at /tmp/claude-mem-worker-{id}.sock. Next worker with same session ID fails with EADDRINUSE.
Location: src/sdk/worker.ts:111-163
private async startSocketServer(): Promise<void> {
// Current code only removes if exists
if (existsSync(this.socketPath)) {
unlinkSync(this.socketPath);
}
return new Promise((resolve, reject) => {
this.server = net.createServer((socket) => { ... });
this.server.listen(this.socketPath, () => { resolve(); });
});
}
Fix Required:
private async startSocketServer(): Promise<void> {
// Clean up stale socket if it exists
if (existsSync(this.socketPath)) {
// Test if socket is responsive
const isStale = await this.testSocketStale(this.socketPath);
if (isStale) {
console.error(`[SDK Worker] Removing stale socket: ${this.socketPath}`);
unlinkSync(this.socketPath);
} else {
// Socket is active - another worker is using this session ID
throw new Error(`Socket already in use: ${this.socketPath}`);
}
}
return new Promise((resolve, reject) => {
this.server = net.createServer((socket) => {
let buffer = '';
socket.on('data', (chunk) => {
// ... existing code
});
});
this.server.on('error', (err: any) => {
if (err.code === 'EADDRINUSE') {
console.error(`[SDK Worker] Socket already in use: ${this.socketPath}`);
}
reject(err);
});
this.server.listen(this.socketPath, () => {
resolve();
});
});
}
/**
* Test if socket file is stale (no process listening)
*/
private async testSocketStale(socketPath: string): Promise<boolean> {
return new Promise((resolve) => {
const testClient = net.connect(socketPath);
testClient.on('connect', () => {
// Socket is responsive - not stale
testClient.end();
resolve(false);
});
testClient.on('error', () => {
// Socket exists but not responsive - stale
resolve(true);
});
// Timeout after 100ms
setTimeout(() => {
testClient.destroy();
resolve(true);
}, 100);
});
}
Testing:
- Start worker, kill it with kill -9
- Verify socket file persists
- Start new worker with same session ID
- Verify old socket is detected as stale and removed
- Verify new worker starts successfully
4. Race Condition on First Observation
Severity: Medium Impact: First observation might be lost if socket not ready
Problem: Worker startup is async (socket creation, SDK initialization). PostToolUse can fire immediately after UserPromptSubmit returns, before socket is ready.
Current Flow:
- UserPromptSubmit → creates session → spawns worker → returns immediately
- PostToolUse fires (Claude reads a file)
- save-hook tries to connect → ENOENT (socket not ready yet)
- Connection fails → logs error, continues
- First observation lost
Location: src/hooks/save.ts:71
const client = net.connect(socketPath, () => {
client.write(JSON.stringify(message) + '\n');
client.end();
});
client.on('error', (err) => {
// Currently just logs and continues - observation lost
console.error(`[claude-mem save] Socket error: ${err.message}`);
});
Fix Required:
/**
* Save Hook - PostToolUse
* Sends tool observations to worker via Unix socket with retry logic
*/
export function saveHook(input?: PostToolUseInput): void {
try {
if (!input) {
console.log('No input provided - this script is designed to run as a Claude Code PostToolUse hook');
process.exit(0);
}
const { session_id, tool_name, tool_input, tool_output } = input;
if (SKIP_TOOLS.has(tool_name)) {
console.log('{"continue": true, "suppressOutput": true}');
process.exit(0);
}
const db = new HooksDatabase();
const session = db.findActiveSDKSession(session_id);
db.close();
if (!session) {
console.log('{"continue": true, "suppressOutput": true}');
process.exit(0);
}
const socketPath = getWorkerSocketPath(session.id);
const message = {
type: 'observation',
tool_name,
tool_input: JSON.stringify(tool_input),
tool_output: JSON.stringify(tool_output)
};
// Try to send with retries
sendWithRetry(socketPath, message, 5).then(() => {
console.log('{"continue": true, "suppressOutput": true}');
process.exit(0);
}).catch((err) => {
console.error(`[claude-mem save] Failed after retries: ${err.message}`);
console.log('{"continue": true, "suppressOutput": true}');
process.exit(0);
});
} catch (error: any) {
console.error(`[claude-mem save error: ${error.message}]`);
console.log('{"continue": true, "suppressOutput": true}');
process.exit(0);
}
}
/**
* Send message to socket with exponential backoff retry
*/
async function sendWithRetry(
socketPath: string,
message: any,
maxRetries: number
): Promise<void> {
let retries = maxRetries;
let delay = 100; // Start with 100ms
while (retries > 0) {
try {
await sendMessage(socketPath, message);
return; // Success
} catch (err: any) {
retries--;
if (retries === 0) {
throw err; // Out of retries
}
// Exponential backoff
await sleep(delay);
delay = Math.min(delay * 2, 2000); // Cap at 2s
}
}
}
/**
* Send single message to socket
*/
function sendMessage(socketPath: string, message: any): Promise<void> {
return new Promise((resolve, reject) => {
const client = net.connect(socketPath, () => {
client.write(JSON.stringify(message) + '\n');
client.end();
resolve();
});
client.on('error', (err) => {
reject(err);
});
});
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
Testing:
- Add artificial delay in worker startup
- Fire PostToolUse immediately after UserPromptSubmit
- Verify save-hook retries and succeeds
- Verify observation is captured
Medium Priority (Should Fix)
5. Orphaned Active Sessions in Database
Severity: Low Impact: DB bloat, confusion about session status
Problem: Sessions marked "active" never transition to "completed" or "failed" if worker crashes or is killed.
Fix Required:
Create cleanup script: src/commands/cleanup-sessions.ts
import { HooksDatabase } from '../services/sqlite/HooksDatabase.js';
/**
* Mark old active sessions as failed
*/
export function cleanupSessions(maxAgeHours: number = 24): void {
const db = new HooksDatabase();
const maxAgeMs = maxAgeHours * 60 * 60 * 1000;
const cutoffEpoch = Date.now() - maxAgeMs;
const query = (db as any).db.query(`
UPDATE sdk_sessions
SET status = 'failed', completed_at = datetime('now'), completed_at_epoch = ?
WHERE status = 'active' AND started_at_epoch < ?
`);
const result = query.run(Date.now(), cutoffEpoch);
console.log(`Marked ${result.changes} old active sessions as failed`);
db.close();
}
Add to CLI: src/bin/cli.ts
.command('cleanup-sessions')
.description('Mark old active sessions as failed')
.option('--max-age <hours>', 'Maximum age in hours', '24')
.action((options) => {
cleanupSessions(parseInt(options.maxAge, 10));
})
Alternative: Add auto-expiry check in context-hook:
// Before loading summaries, clean up stale sessions
const maxAgeMs = 24 * 60 * 60 * 1000;
const cutoffEpoch = Date.now() - maxAgeMs;
db.db.query(`
UPDATE sdk_sessions
SET status = 'failed'
WHERE status = 'active' AND started_at_epoch < ?
`).run(cutoffEpoch);
6. SessionStart Only Runs on "startup"
Severity: Low Impact: No context loaded on /resume
Problem:
context-hook only loads context on "startup" source, skips "resume", "clear", and "compact".
Location: src/hooks/context.ts:24
// Only run on startup (not on resume)
if (input.source && input.source !== 'startup') {
console.log('');
process.exit(0);
}
Fix Required:
// Load context on startup and resume
if (input.source && input.source !== 'startup' && input.source !== 'resume') {
console.log(''); // Skip for clear/compact
process.exit(0);
}
Rationale:
- startup: Load context (project overview)
- resume: Load context (user continuing work)
- clear: Skip (user wants fresh start)
- compact: Skip (just memory optimization, context preserved)
Low Priority (Nice to Have)
7. No Cost Control or Observation Limits
Severity: Low Impact: Long sessions can be expensive
Problem: No limits on SDK agent API calls. A session with thousands of tools could rack up significant costs.
Fix Ideas:
- Add observation counter, warn after N observations
- Add cost estimation based on token usage
- Add budget limit in config
- Batch observations (send N at once instead of one-by-one)
Example:
class SDKWorker {
private observationCount = 0;
private maxObservations = 1000;
private handleMessage(message: WorkerMessage): void {
if (message.type === 'observation') {
this.observationCount++;
if (this.observationCount > this.maxObservations) {
console.error(`[SDK Worker] Exceeded max observations: ${this.maxObservations}`);
this.isFinalized = true;
return;
}
}
this.pendingMessages.push(message);
}
}
8. No Health Check Mechanism
Severity: Low Impact: Can't tell if worker is alive/healthy
Fix Ideas:
- Add
/statuscommand that checks for active workers - Add health check endpoint on socket (ping/pong)
- Add metrics to DB (last_activity_at)
9. No Observation Deduplication
Severity: Low Impact: Duplicate observations if same tool executed multiple times
Fix Ideas:
- Hash tool_name + tool_input + tool_output
- Check for duplicate hash before storing
- Or let SDK agent handle deduplication naturally
Implementation Checklist
Phase 0: Verify Happy Path (DO THIS FIRST - HIGHEST PRIORITY)
Goal: Prove the basic cycle works end-to-end before fixing edge cases.
-
Test 0.1: Verify Stop Hook Fires
- Add logging to
src/hooks/summary.ts - Exit session normally and verify hook runs
- Verify FINALIZE message is sent to socket
- Add logging to
-
Test 0.2: Verify Worker Generates Summary
- Add logging to worker message handler
- Verify FINALIZE message received
- Verify SDK agent response
- Verify summary parsed and stored in DB
- Query DB to confirm summary exists
-
Test 0.3: Verify Context Hook Loads Summaries
- Add logging to
src/hooks/context.ts - Start new session, verify summaries loaded
- Verify markdown output to stdout
- Verify Claude has context from previous session
- Add logging to
-
Test 0.4: End-to-End Integration Test
- Run session 1 with test work
- Verify summary in DB
- Run session 2
- Ask Claude about previous session
- Confirm Claude has correct context
STOP HERE: Only proceed to Phase 1 after confirming all Phase 0 tests pass.
Phase 1: Critical Resilience Fixes (Do After Phase 0)
-
Add watchdog timer to worker (Issue #1)
- Add lastActivityTime tracking
- Add timeout check in message generator loop
- Test with zombie worker scenario
-
Configure existing SessionEnd hook (Issue #2)
- Add SessionEnd configuration to hooks/hooks.json
- Create src/hooks/cleanup.ts (implements cleanup logic)
- Create src/bin/hooks/cleanup-hook.ts (entry point)
- Update build process to compile cleanup-hook
- Test with Ctrl-C exit and verify worker cleanup
-
Fix stale socket detection (Issue #3)
- Add testSocketStale method
- Update startSocketServer to check for stale sockets
- Test with crashed worker scenario
-
Fix save-hook race condition (Issue #4)
- Add sendWithRetry function
- Add exponential backoff logic
- Update save-hook to use retry logic
- Test with immediate PostToolUse
Phase 2: Medium Priority
-
Add session cleanup script (Issue #5)
- Create cleanup-sessions command
- Add to CLI
- Optional: Add auto-cleanup to context-hook
-
Fix SessionStart source handling (Issue #6)
- Update context-hook to load on "resume"
- Test with /resume command
Phase 3: Low Priority (Optional)
- Add cost control (Issue #7)
- Add health checks (Issue #8)
- Add observation deduplication (Issue #9)
Testing Strategy
Unit Tests
Create tests for each fix:
test/hooks/cleanup.test.ts- SessionEnd hooktest/sdk/worker-timeout.test.ts- Watchdog timertest/hooks/save-retry.test.ts- Retry logic
Integration Tests
Test complete flows:
- Normal flow: SessionStart → UserPromptSubmit → PostToolUse → Stop
- Crash recovery: Worker crash → SessionEnd cleanup
- Zombie worker: No Stop hook → Worker timeout
- Socket race: Immediate PostToolUse → Retry success
Manual Testing Scenarios
-
Zombie Worker Test:
# Start session claude # Kill Claude with Ctrl-C # Check for worker process ps aux | grep claude-mem-worker # Wait 2 hours, verify worker exits -
SessionEnd Test:
# Start session claude # Exit normally or Ctrl-C # Verify worker killed # Verify socket removed # Check DB for session status sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions" -
Stale Socket Test:
# Start session claude # Kill worker with kill -9 <pid> # Verify socket exists ls /tmp/claude-mem-worker-*.sock # Start new session # Verify old socket removed, new session starts -
Race Condition Test:
# Add delay to worker startup (for testing) # Start session, immediately run command claude "list all files" # Verify first observation captured
File Modifications Required
New Files
src/hooks/cleanup.ts- SessionEnd hook logicsrc/bin/hooks/cleanup-hook.ts- SessionEnd entry pointsrc/commands/cleanup-sessions.ts- Session cleanup scripttest/hooks/cleanup.test.ts- Tests for SessionEnd hooktest/sdk/worker-timeout.test.ts- Tests for watchdog timertest/hooks/save-retry.test.ts- Tests for retry logic
Modified Files
hooks/hooks.json- Add SessionEnd configurationsrc/sdk/worker.ts- Add watchdog timer, stale socket detectionsrc/hooks/save.ts- Add retry logicsrc/hooks/context.ts- Load context on resumesrc/bin/cli.ts- Add cleanup-sessions command
Dependencies
No new dependencies required. All fixes use existing:
net(Unix sockets)fs(file operations)child_process(process management)bun:sqlite(database)
Success Criteria
Phase 0 (Must Pass First)
- ✅ Stop hook fires on normal exit
- ✅ Worker receives FINALIZE and generates summary
- ✅ Summary is stored in DB correctly
- ✅ Context hook loads summaries on next session
- ✅ New session immediately sees previous session's summary in context
- ✅ End-to-end integration test passes
Phase 1 (After Phase 0 Passes)
- ✅ Worker processes never become zombies (exit after 2h max)
- ✅ SessionEnd hook cleans up worker and socket on exit
- ✅ Stale sockets don't block new sessions
- ✅ First observation always captured (no race condition)
- ✅ No orphaned "active" sessions in DB after 24h
- ✅ Context loads on /resume
- ✅ All tests pass
References
- Claude Code Hooks Documentation: https://docs.claude.com/en/docs/claude-code/hooks
- Claude Agent SDK Streaming: https://docs.claude.com/en/api/agent-sdk/streaming-vs-single-mode
- Unix Domain Sockets: Node.js
netmodule - SQLite Best Practices: Bun SQLite documentation
Notes
- All hooks must return
{"continue": true, "suppressOutput": true}on error - Hooks have 60s default timeout (configurable)
- Worker is detached process, doesn't block Claude Code
- SessionEnd hooks "cannot block session termination" per Claude Code docs
- Streaming input mode is the recommended SDK approach for this architecture