Files
claude-mem/docs/plans/session-logic-fixes.md
T

31 KiB

Session Logic Fixes - Claude-Mem

Status: Planning Created: 2025-10-16 Priority: High Estimated Effort: 2-3 days

Executive Summary

The claude-mem session logic architecture is fundamentally sound, using Claude Agent SDK in streaming input mode with Unix socket IPC for real-time observation processing. However, we need to verify the basic happy path works end-to-end before addressing edge cases.

Critical Goal: Session ends → summary generated → next session immediately sees summary in context

Overall Assessment: Architecture is correct, but needs systematic verification that the happy path works, then resilience improvements

Current Status: Unknown if basic cycle works - need to test and debug the core flow first

Feedback Applied (2025-10-16)

Round 1: Technical Corrections

  • Confirmed architectural approach is sound
  • Corrected: SessionEnd hooks already exist in Claude Code - we're configuring, not implementing
  • Technical fixes for resilience issues are sound

Round 2: Priority Reordering (MAJOR CHANGE)

Critical realization: The document focused on edge cases (zombies, crashes) when the basic happy path might not even work yet.

Complete restructure:

  1. Phase 0 (NEW - TOP PRIORITY): Verify the basic cycle works

    • Does Stop hook fire on normal exit?
    • Does worker generate and store summary?
    • Does context hook load summaries on next session?
    • End-to-end integration test
  2. Phase 1 (SECOND PRIORITY): Fix resilience issues

    • Zombie workers, race conditions, stale sockets
    • All the original issues moved here

Key principle: Everything else is irrelevant if "session ends → next session sees summary" doesn't work.

Revised Focus: Get the fucking happy path working first, then worry about edge cases.

Architecture Overview

Current Flow

SessionStart (startup)
  → context-hook.ts:15
  → Loads recent summaries from DB
  → Outputs markdown to stdout (becomes context)

UserPromptSubmit
  → new-hook.ts:16
  → Creates SDK session in DB (status='active')
  → Spawns detached worker process
  → Worker starts immediately, hooks return

Worker Process (worker.ts:75)
  → Starts Unix socket server at /tmp/claude-mem-worker-{id}.sock
  → Runs SDK agent with streaming input (async generator)
  → Yields init prompt to SDK agent
  → Waits for messages from hooks

PostToolUse (fired for each tool)
  → save-hook.ts:24
  → Sends observation to worker via Unix socket
  → Worker receives → yields to SDK agent
  → SDK agent analyzes → returns <observation> XML
  → Worker parses XML → stores in observations table

Stop (session ends)
  → summary-hook.ts:15
  → Sends FINALIZE message to worker via socket
  → Worker yields finalize prompt to SDK agent
  → SDK agent generates <summary> XML
  → Worker parses → stores in session_summaries table
  → Worker marks session completed, closes socket, exits

Key Components

Hook Files:

  • src/hooks/context.ts - SessionStart hook logic
  • src/hooks/new.ts - UserPromptSubmit hook logic
  • src/hooks/save.ts - PostToolUse hook logic
  • src/hooks/summary.ts - Stop hook logic
  • src/bin/hooks/*.ts - Entry point wrappers for each hook

Worker:

  • src/sdk/worker.ts - Main worker process with SDK integration
  • src/sdk/prompts.ts - Prompt generation for SDK agent
  • src/sdk/parser.ts - XML parser for SDK responses

Database:

  • src/services/sqlite/HooksDatabase.ts - Lightweight DB interface for hooks
  • src/services/sqlite/migrations.ts - Schema definitions

Configuration:

  • hooks/hooks.json - Hook configuration for Claude Code plugin

Technologies

  • IPC: Unix domain sockets (/tmp/claude-mem-worker-{id}.sock)
  • SDK Mode: Streaming input (async generator pattern)
  • Output Format: XML blocks (<observation> and <summary>)
  • Process Model: Detached worker (spawn with detached: true, stdio: 'ignore')
  • Database: SQLite with Bun

Identified Issues

Phase 0: Verify Happy Path Works (DO THIS FIRST)

Priority: CRITICAL - Everything else is irrelevant if the basic cycle doesn't work

Goal: Prove that when a session ends normally, the next session immediately sees the summary in its context.

Test 0.1: Does Stop Hook Fire on Normal Exit?

What to test:

# Start Claude Code session
claude

# Do some work (read files, etc)

# Exit normally
exit

# Check logs - did Stop hook run?

Expected behavior:

  • Stop hook (summary-hook) should fire
  • Should send FINALIZE message to worker socket
  • Worker should receive it and generate summary

How to verify:

  1. Add logging to src/hooks/summary.ts at the top of summaryHook()
  2. Add logging when sending socket message
  3. Exit session normally and check logs

If it doesn't work: Debug why Stop hook isn't firing or why socket message fails


Test 0.2: Does Worker Receive FINALIZE and Generate Summary?

What to test: After Stop hook fires, does the worker:

  1. Receive the FINALIZE message
  2. Yield finalize prompt to SDK agent
  3. Get back a summary from SDK
  4. Parse the XML
  5. Store it in session_summaries table

How to verify:

  1. Add console.error logging in src/sdk/worker.ts:239 in the message handler
  2. Log when FINALIZE is received
  3. Log the SDK agent response
  4. Log when summary is parsed
  5. Query DB after session ends:
    sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM session_summaries ORDER BY created_at DESC LIMIT 1"
    

If it doesn't work:

  • Check if worker is even running (ps aux | grep worker)
  • Check if socket message arrived
  • Check if SDK agent returned valid XML
  • Check if parser worked
  • Check if DB insert succeeded

Test 0.3: Does Context Hook Load Summaries?

What to test: When starting a new session, does context hook:

  1. Query recent summaries from DB
  2. Format them as markdown
  3. Output to stdout (becomes context)

How to verify:

  1. Add logging to src/hooks/context.ts:24
  2. Log the summaries retrieved from DB
  3. Log the markdown output
  4. Start new session and check:
    • Console output (should see markdown)
    • Claude's context (ask "what did we do last session?")

If it doesn't work:

  • Check if SessionStart hook is firing
  • Check if DB query returns results
  • Check if markdown is being formatted correctly
  • Check if output is going to stdout properly

Test 0.4: End-to-End Integration Test

What to test: Full cycle from start to finish:

# Session 1
claude
# Do some work
echo "test file" > test.txt
cat test.txt
exit

# Verify summary was stored
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT summary_text FROM session_summaries ORDER BY created_at DESC LIMIT 1"

# Session 2
claude
# Ask Claude: "What did we do last session?"
# Expected: Claude should know we created and read test.txt

Success criteria:

  • Summary appears in DB after session 1
  • Session 2 context includes summary from session 1
  • Claude can answer questions about previous session

If it doesn't work:

  • Review logs from Tests 0.1-0.3
  • Add more granular logging
  • Check each step of the pipeline

Common Failure Points & Debugging

If summaries aren't showing up in new sessions:

  1. Stop hook not configured/firing:

    # Check hooks config
    cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.Stop'
    
    # Should see summary-hook configured
    # If not, hooks.json is wrong or plugin not installed
    
  2. Worker not running:

    ps aux | grep claude-mem-worker
    
    # If no worker, UserPromptSubmit hook failed to spawn it
    # Check new-hook logs
    
  3. Socket communication failing:

    # Check socket exists
    ls /tmp/claude-mem-worker-*.sock
    
    # Try to connect manually
    echo '{"type":"finalize"}' | nc -U /tmp/claude-mem-worker-*.sock
    
  4. SDK agent not returning summary:

    • Check API key is set
    • Check SDK agent prompt is valid
    • Check XML parser is working
    • Add logging to see SDK response
  5. DB write failing:

    # Check DB exists and is writable
    sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions WHERE status='active'"
    
    # If no active session, new-hook didn't create it
    
  6. Context hook not loading:

    # Check SessionStart hook configured
    cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.SessionStart'
    
    # Start session and check for context output
    # Should see markdown in initial context
    

Debugging Checklist:

  • Verify all hooks are configured in hooks.json
  • Verify plugin is installed correctly
  • Add console.error logging to all hooks (goes to stderr, visible in terminal)
  • Check each step of the pipeline systematically
  • Don't assume anything works - verify each piece

Phase 1: Critical Resilience Issues (Fix After Happy Path Works)

1. Zombie Worker Processes

Severity: High Impact: Memory/CPU waste, orphaned processes accumulate

Problem: If Stop hook never fires (user Ctrl-C, Claude Code crash), worker runs forever waiting for FINALIZE message.

Location: src/sdk/worker.ts:239

// Current code - infinite loop with no timeout
while (!this.isFinalized) {
  if (this.pendingMessages.length === 0) {
    await this.sleep(100);
    continue;
  }
  // ... process messages
}

Fix Required:

// Add watchdog timer
class SDKWorker {
  private maxIdleTime = 2 * 60 * 60 * 1000; // 2 hours
  private lastActivityTime = Date.now();

  private updateActivity(): void {
    this.lastActivityTime = Date.now();
  }

  private async* createMessageGenerator(): AsyncIterable<...> {
    // Yield initial prompt
    const initPrompt = buildInitPrompt(...);
    yield { type: 'user', message: { role: 'user', content: initPrompt } };
    this.updateActivity();

    while (!this.isFinalized) {
      // Check for timeout
      const idleTime = Date.now() - this.lastActivityTime;
      if (idleTime > this.maxIdleTime) {
        console.error(`[SDK Worker] Timeout - no activity for ${this.maxIdleTime / 1000}s`);
        this.isFinalized = true;
        break;
      }

      if (this.pendingMessages.length === 0) {
        await this.sleep(100);
        continue;
      }

      // Process messages and update activity
      this.updateActivity();
      // ... existing message processing
    }
  }
}

Testing:

  1. Start claude-mem session
  2. Kill Claude Code process (kill -9)
  3. Verify worker exits after 2 hours
  4. Check no orphaned processes remain

2. SessionEnd Hook Not Configured

Severity: High Impact: No cleanup on abrupt exit, sessions stuck in "active" status

Problem: SessionEnd hooks are a built-in Claude Code feature that "run when a session ends" and "cannot block session termination but can perform cleanup tasks" (docs). However, claude-mem's hooks/hooks.json does NOT configure this hook. Worker doesn't get cleaned up when Claude Code exits abruptly.

Note: This is NOT a missing feature in Claude Code - SessionEnd hooks already exist. We just need to configure them.

Current Configuration: hooks/hooks.json:1-51

{
  "hooks": {
    "SessionStart": [...],
    "UserPromptSubmit": [...],
    "PostToolUse": [...],
    "Stop": [...]
    // SessionEnd is MISSING
  }
}

Fix Required:

SessionEnd hooks receive structured input including:

{
  "session_id": "abc123",
  "transcript_path": "~/.claude/projects/.../transcript.jsonl",
  "cwd": "/Users/...",
  "hook_event_name": "SessionEnd",
  "reason": "exit"  // or "clear", "logout", "prompt_input_exit", etc.
}

Implementation Steps:

  1. Add SessionEnd configuration to hooks/hooks.json:

For events like SessionEnd that don't use matchers, we can omit the matcher field:

{
  "hooks": {
    "SessionEnd": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "bun ${CLAUDE_PLUGIN_ROOT}/scripts/hooks/cleanup-hook.js",
            "timeout": 60000
          }
        ]
      }
    ]
  }
}
  1. Create src/hooks/cleanup.ts:
import { HooksDatabase } from '../services/sqlite/HooksDatabase.js';
import { getWorkerSocketPath } from '../shared/paths.js';
import { existsSync, unlinkSync } from 'fs';
import { execSync } from 'child_process';

export interface SessionEndInput {
  session_id: string;
  cwd: string;
  reason: 'clear' | 'logout' | 'prompt_input_exit' | 'other';
  [key: string]: any;
}

/**
 * Cleanup Hook - SessionEnd
 * Cleans up worker process and marks session as terminated
 */
export function cleanupHook(input?: SessionEndInput): void {
  try {
    if (!input) {
      console.log('No input provided - this script is designed to run as a Claude Code SessionEnd hook');
      process.exit(0);
    }

    const { session_id, reason } = input;

    // Find active SDK session
    const db = new HooksDatabase();
    const session = db.findActiveSDKSession(session_id);

    if (!session) {
      db.close();
      console.log('{"suppressOutput": true}');
      process.exit(0);
    }

    // Get socket path and clean up socket file
    const socketPath = getWorkerSocketPath(session.id);
    if (existsSync(socketPath)) {
      try {
        unlinkSync(socketPath);
      } catch (err) {
        console.error(`[claude-mem cleanup] Failed to remove socket: ${err.message}`);
      }
    }

    // Mark session as failed (not completed since it was terminated)
    db.markSessionFailed(session.id);
    db.close();

    // Try to kill worker process if still running
    // Worker socket path includes session ID, so we can find it
    try {
      // Find worker process by socket file in lsof output
      const lsofOutput = execSync(`lsof ${socketPath} 2>/dev/null || true`, { encoding: 'utf8' });
      const pidMatch = lsofOutput.match(/\s+(\d+)\s+/);
      if (pidMatch) {
        const pid = pidMatch[1];
        console.error(`[claude-mem cleanup] Killing worker process ${pid}`);
        process.kill(parseInt(pid, 10), 'SIGTERM');
      }
    } catch (err) {
      // Worker already dead or couldn't find it - that's fine
    }

    console.log('{"suppressOutput": true}');
    process.exit(0);

  } catch (error: any) {
    console.error(`[claude-mem cleanup error: ${error.message}]`);
    console.log('{"suppressOutput": true}');
    process.exit(0);
  }
}
  1. Create src/bin/hooks/cleanup-hook.ts:
#!/usr/bin/env bun

/**
 * Cleanup Hook Entry Point - SessionEnd
 * Standalone executable for plugin hooks
 */

import { cleanupHook } from '../../hooks/cleanup.js';

// Read input from stdin
const input = await Bun.stdin.text();

try {
  const parsed = input.trim() ? JSON.parse(input) : undefined;
  cleanupHook(parsed);
} catch (error: any) {
  console.error(`[claude-mem cleanup-hook error: ${error.message}]`);
  console.log('{"suppressOutput": true}');
  process.exit(0);
}
  1. Update build process to compile cleanup-hook.ts to scripts/hooks/cleanup-hook.js

Testing:

  1. Start claude-mem session
  2. Exit Claude Code with Ctrl-C
  3. Verify worker process is killed
  4. Verify socket file is removed
  5. Verify session marked as "failed" in DB

3. Stale Socket Files Block New Sessions

Severity: Medium Impact: Worker fails to start if previous worker crashed

Problem: If worker crashes, socket file persists at /tmp/claude-mem-worker-{id}.sock. Next worker with same session ID fails with EADDRINUSE.

Location: src/sdk/worker.ts:111-163

private async startSocketServer(): Promise<void> {
  // Current code only removes if exists
  if (existsSync(this.socketPath)) {
    unlinkSync(this.socketPath);
  }

  return new Promise((resolve, reject) => {
    this.server = net.createServer((socket) => { ... });
    this.server.listen(this.socketPath, () => { resolve(); });
  });
}

Fix Required:

private async startSocketServer(): Promise<void> {
  // Clean up stale socket if it exists
  if (existsSync(this.socketPath)) {
    // Test if socket is responsive
    const isStale = await this.testSocketStale(this.socketPath);
    if (isStale) {
      console.error(`[SDK Worker] Removing stale socket: ${this.socketPath}`);
      unlinkSync(this.socketPath);
    } else {
      // Socket is active - another worker is using this session ID
      throw new Error(`Socket already in use: ${this.socketPath}`);
    }
  }

  return new Promise((resolve, reject) => {
    this.server = net.createServer((socket) => {
      let buffer = '';
      socket.on('data', (chunk) => {
        // ... existing code
      });
    });

    this.server.on('error', (err: any) => {
      if (err.code === 'EADDRINUSE') {
        console.error(`[SDK Worker] Socket already in use: ${this.socketPath}`);
      }
      reject(err);
    });

    this.server.listen(this.socketPath, () => {
      resolve();
    });
  });
}

/**
 * Test if socket file is stale (no process listening)
 */
private async testSocketStale(socketPath: string): Promise<boolean> {
  return new Promise((resolve) => {
    const testClient = net.connect(socketPath);

    testClient.on('connect', () => {
      // Socket is responsive - not stale
      testClient.end();
      resolve(false);
    });

    testClient.on('error', () => {
      // Socket exists but not responsive - stale
      resolve(true);
    });

    // Timeout after 100ms
    setTimeout(() => {
      testClient.destroy();
      resolve(true);
    }, 100);
  });
}

Testing:

  1. Start worker, kill it with kill -9
  2. Verify socket file persists
  3. Start new worker with same session ID
  4. Verify old socket is detected as stale and removed
  5. Verify new worker starts successfully

4. Race Condition on First Observation

Severity: Medium Impact: First observation might be lost if socket not ready

Problem: Worker startup is async (socket creation, SDK initialization). PostToolUse can fire immediately after UserPromptSubmit returns, before socket is ready.

Current Flow:

  1. UserPromptSubmit → creates session → spawns worker → returns immediately
  2. PostToolUse fires (Claude reads a file)
  3. save-hook tries to connect → ENOENT (socket not ready yet)
  4. Connection fails → logs error, continues
  5. First observation lost

Location: src/hooks/save.ts:71

const client = net.connect(socketPath, () => {
  client.write(JSON.stringify(message) + '\n');
  client.end();
});

client.on('error', (err) => {
  // Currently just logs and continues - observation lost
  console.error(`[claude-mem save] Socket error: ${err.message}`);
});

Fix Required:

/**
 * Save Hook - PostToolUse
 * Sends tool observations to worker via Unix socket with retry logic
 */
export function saveHook(input?: PostToolUseInput): void {
  try {
    if (!input) {
      console.log('No input provided - this script is designed to run as a Claude Code PostToolUse hook');
      process.exit(0);
    }

    const { session_id, tool_name, tool_input, tool_output } = input;

    if (SKIP_TOOLS.has(tool_name)) {
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    }

    const db = new HooksDatabase();
    const session = db.findActiveSDKSession(session_id);
    db.close();

    if (!session) {
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    }

    const socketPath = getWorkerSocketPath(session.id);
    const message = {
      type: 'observation',
      tool_name,
      tool_input: JSON.stringify(tool_input),
      tool_output: JSON.stringify(tool_output)
    };

    // Try to send with retries
    sendWithRetry(socketPath, message, 5).then(() => {
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    }).catch((err) => {
      console.error(`[claude-mem save] Failed after retries: ${err.message}`);
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    });

  } catch (error: any) {
    console.error(`[claude-mem save error: ${error.message}]`);
    console.log('{"continue": true, "suppressOutput": true}');
    process.exit(0);
  }
}

/**
 * Send message to socket with exponential backoff retry
 */
async function sendWithRetry(
  socketPath: string,
  message: any,
  maxRetries: number
): Promise<void> {
  let retries = maxRetries;
  let delay = 100; // Start with 100ms

  while (retries > 0) {
    try {
      await sendMessage(socketPath, message);
      return; // Success
    } catch (err: any) {
      retries--;
      if (retries === 0) {
        throw err; // Out of retries
      }

      // Exponential backoff
      await sleep(delay);
      delay = Math.min(delay * 2, 2000); // Cap at 2s
    }
  }
}

/**
 * Send single message to socket
 */
function sendMessage(socketPath: string, message: any): Promise<void> {
  return new Promise((resolve, reject) => {
    const client = net.connect(socketPath, () => {
      client.write(JSON.stringify(message) + '\n');
      client.end();
      resolve();
    });

    client.on('error', (err) => {
      reject(err);
    });
  });
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

Testing:

  1. Add artificial delay in worker startup
  2. Fire PostToolUse immediately after UserPromptSubmit
  3. Verify save-hook retries and succeeds
  4. Verify observation is captured

Medium Priority (Should Fix)

5. Orphaned Active Sessions in Database

Severity: Low Impact: DB bloat, confusion about session status

Problem: Sessions marked "active" never transition to "completed" or "failed" if worker crashes or is killed.

Fix Required:

Create cleanup script: src/commands/cleanup-sessions.ts

import { HooksDatabase } from '../services/sqlite/HooksDatabase.js';

/**
 * Mark old active sessions as failed
 */
export function cleanupSessions(maxAgeHours: number = 24): void {
  const db = new HooksDatabase();
  const maxAgeMs = maxAgeHours * 60 * 60 * 1000;
  const cutoffEpoch = Date.now() - maxAgeMs;

  const query = (db as any).db.query(`
    UPDATE sdk_sessions
    SET status = 'failed', completed_at = datetime('now'), completed_at_epoch = ?
    WHERE status = 'active' AND started_at_epoch < ?
  `);

  const result = query.run(Date.now(), cutoffEpoch);
  console.log(`Marked ${result.changes} old active sessions as failed`);

  db.close();
}

Add to CLI: src/bin/cli.ts

.command('cleanup-sessions')
.description('Mark old active sessions as failed')
.option('--max-age <hours>', 'Maximum age in hours', '24')
.action((options) => {
  cleanupSessions(parseInt(options.maxAge, 10));
})

Alternative: Add auto-expiry check in context-hook:

// Before loading summaries, clean up stale sessions
const maxAgeMs = 24 * 60 * 60 * 1000;
const cutoffEpoch = Date.now() - maxAgeMs;
db.db.query(`
  UPDATE sdk_sessions
  SET status = 'failed'
  WHERE status = 'active' AND started_at_epoch < ?
`).run(cutoffEpoch);

6. SessionStart Only Runs on "startup"

Severity: Low Impact: No context loaded on /resume

Problem: context-hook only loads context on "startup" source, skips "resume", "clear", and "compact".

Location: src/hooks/context.ts:24

// Only run on startup (not on resume)
if (input.source && input.source !== 'startup') {
  console.log('');
  process.exit(0);
}

Fix Required:

// Load context on startup and resume
if (input.source && input.source !== 'startup' && input.source !== 'resume') {
  console.log(''); // Skip for clear/compact
  process.exit(0);
}

Rationale:

  • startup: Load context (project overview)
  • resume: Load context (user continuing work)
  • clear: Skip (user wants fresh start)
  • compact: Skip (just memory optimization, context preserved)

Low Priority (Nice to Have)

7. No Cost Control or Observation Limits

Severity: Low Impact: Long sessions can be expensive

Problem: No limits on SDK agent API calls. A session with thousands of tools could rack up significant costs.

Fix Ideas:

  1. Add observation counter, warn after N observations
  2. Add cost estimation based on token usage
  3. Add budget limit in config
  4. Batch observations (send N at once instead of one-by-one)

Example:

class SDKWorker {
  private observationCount = 0;
  private maxObservations = 1000;

  private handleMessage(message: WorkerMessage): void {
    if (message.type === 'observation') {
      this.observationCount++;
      if (this.observationCount > this.maxObservations) {
        console.error(`[SDK Worker] Exceeded max observations: ${this.maxObservations}`);
        this.isFinalized = true;
        return;
      }
    }
    this.pendingMessages.push(message);
  }
}

8. No Health Check Mechanism

Severity: Low Impact: Can't tell if worker is alive/healthy

Fix Ideas:

  1. Add /status command that checks for active workers
  2. Add health check endpoint on socket (ping/pong)
  3. Add metrics to DB (last_activity_at)

9. No Observation Deduplication

Severity: Low Impact: Duplicate observations if same tool executed multiple times

Fix Ideas:

  1. Hash tool_name + tool_input + tool_output
  2. Check for duplicate hash before storing
  3. Or let SDK agent handle deduplication naturally

Implementation Checklist

Phase 0: Verify Happy Path (DO THIS FIRST - HIGHEST PRIORITY)

Goal: Prove the basic cycle works end-to-end before fixing edge cases.

  • Test 0.1: Verify Stop Hook Fires

    • Add logging to src/hooks/summary.ts
    • Exit session normally and verify hook runs
    • Verify FINALIZE message is sent to socket
  • Test 0.2: Verify Worker Generates Summary

    • Add logging to worker message handler
    • Verify FINALIZE message received
    • Verify SDK agent response
    • Verify summary parsed and stored in DB
    • Query DB to confirm summary exists
  • Test 0.3: Verify Context Hook Loads Summaries

    • Add logging to src/hooks/context.ts
    • Start new session, verify summaries loaded
    • Verify markdown output to stdout
    • Verify Claude has context from previous session
  • Test 0.4: End-to-End Integration Test

    • Run session 1 with test work
    • Verify summary in DB
    • Run session 2
    • Ask Claude about previous session
    • Confirm Claude has correct context

STOP HERE: Only proceed to Phase 1 after confirming all Phase 0 tests pass.


Phase 1: Critical Resilience Fixes (Do After Phase 0)

  • Add watchdog timer to worker (Issue #1)

    • Add lastActivityTime tracking
    • Add timeout check in message generator loop
    • Test with zombie worker scenario
  • Configure existing SessionEnd hook (Issue #2)

    • Add SessionEnd configuration to hooks/hooks.json
    • Create src/hooks/cleanup.ts (implements cleanup logic)
    • Create src/bin/hooks/cleanup-hook.ts (entry point)
    • Update build process to compile cleanup-hook
    • Test with Ctrl-C exit and verify worker cleanup
  • Fix stale socket detection (Issue #3)

    • Add testSocketStale method
    • Update startSocketServer to check for stale sockets
    • Test with crashed worker scenario
  • Fix save-hook race condition (Issue #4)

    • Add sendWithRetry function
    • Add exponential backoff logic
    • Update save-hook to use retry logic
    • Test with immediate PostToolUse

Phase 2: Medium Priority

  • Add session cleanup script (Issue #5)

    • Create cleanup-sessions command
    • Add to CLI
    • Optional: Add auto-cleanup to context-hook
  • Fix SessionStart source handling (Issue #6)

    • Update context-hook to load on "resume"
    • Test with /resume command

Phase 3: Low Priority (Optional)

  • Add cost control (Issue #7)
  • Add health checks (Issue #8)
  • Add observation deduplication (Issue #9)

Testing Strategy

Unit Tests

Create tests for each fix:

  • test/hooks/cleanup.test.ts - SessionEnd hook
  • test/sdk/worker-timeout.test.ts - Watchdog timer
  • test/hooks/save-retry.test.ts - Retry logic

Integration Tests

Test complete flows:

  1. Normal flow: SessionStart → UserPromptSubmit → PostToolUse → Stop
  2. Crash recovery: Worker crash → SessionEnd cleanup
  3. Zombie worker: No Stop hook → Worker timeout
  4. Socket race: Immediate PostToolUse → Retry success

Manual Testing Scenarios

  1. Zombie Worker Test:

    # Start session
    claude
    # Kill Claude with Ctrl-C
    # Check for worker process
    ps aux | grep claude-mem-worker
    # Wait 2 hours, verify worker exits
    
  2. SessionEnd Test:

    # Start session
    claude
    # Exit normally or Ctrl-C
    # Verify worker killed
    # Verify socket removed
    # Check DB for session status
    sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions"
    
  3. Stale Socket Test:

    # Start session
    claude
    # Kill worker with kill -9 <pid>
    # Verify socket exists
    ls /tmp/claude-mem-worker-*.sock
    # Start new session
    # Verify old socket removed, new session starts
    
  4. Race Condition Test:

    # Add delay to worker startup (for testing)
    # Start session, immediately run command
    claude "list all files"
    # Verify first observation captured
    

File Modifications Required

New Files

  • src/hooks/cleanup.ts - SessionEnd hook logic
  • src/bin/hooks/cleanup-hook.ts - SessionEnd entry point
  • src/commands/cleanup-sessions.ts - Session cleanup script
  • test/hooks/cleanup.test.ts - Tests for SessionEnd hook
  • test/sdk/worker-timeout.test.ts - Tests for watchdog timer
  • test/hooks/save-retry.test.ts - Tests for retry logic

Modified Files

  • hooks/hooks.json - Add SessionEnd configuration
  • src/sdk/worker.ts - Add watchdog timer, stale socket detection
  • src/hooks/save.ts - Add retry logic
  • src/hooks/context.ts - Load context on resume
  • src/bin/cli.ts - Add cleanup-sessions command

Dependencies

No new dependencies required. All fixes use existing:

  • net (Unix sockets)
  • fs (file operations)
  • child_process (process management)
  • bun:sqlite (database)

Success Criteria

Phase 0 (Must Pass First)

  1. Stop hook fires on normal exit
  2. Worker receives FINALIZE and generates summary
  3. Summary is stored in DB correctly
  4. Context hook loads summaries on next session
  5. New session immediately sees previous session's summary in context
  6. End-to-end integration test passes

Phase 1 (After Phase 0 Passes)

  1. Worker processes never become zombies (exit after 2h max)
  2. SessionEnd hook cleans up worker and socket on exit
  3. Stale sockets don't block new sessions
  4. First observation always captured (no race condition)
  5. No orphaned "active" sessions in DB after 24h
  6. Context loads on /resume
  7. All tests pass

References

Notes

  • All hooks must return {"continue": true, "suppressOutput": true} on error
  • Hooks have 60s default timeout (configurable)
  • Worker is detached process, doesn't block Claude Code
  • SessionEnd hooks "cannot block session termination" per Claude Code docs
  • Streaming input mode is the recommended SDK approach for this architecture