airkjw/claude-mem

Fork 0

Files

T

Alex Newman 307c87b9f6 refactor: restructure session logic documentation and prioritize happy path verification

2025-10-16 17:49:35 -04:00

31 KiB

Raw Blame History

Session Logic Fixes - Claude-Mem

Status: Planning Created: 2025-10-16 Priority: High Estimated Effort: 2-3 days

Executive Summary

The claude-mem session logic architecture is fundamentally sound, using Claude Agent SDK in streaming input mode with Unix socket IPC for real-time observation processing. However, we need to verify the basic happy path works end-to-end before addressing edge cases.

Critical Goal: Session ends → summary generated → next session immediately sees summary in context

Overall Assessment: Architecture is correct, but needs systematic verification that the happy path works, then resilience improvements

Current Status: Unknown if basic cycle works - need to test and debug the core flow first

Feedback Applied (2025-10-16)

Round 1: Technical Corrections

✅ Confirmed architectural approach is sound
❌ Corrected: SessionEnd hooks already exist in Claude Code - we're configuring, not implementing
✅ Technical fixes for resilience issues are sound

Round 2: Priority Reordering (MAJOR CHANGE)

Critical realization: The document focused on edge cases (zombies, crashes) when the basic happy path might not even work yet.

Complete restructure:

Phase 0 (NEW - TOP PRIORITY): Verify the basic cycle works
- Does Stop hook fire on normal exit?
- Does worker generate and store summary?
- Does context hook load summaries on next session?
- End-to-end integration test
Phase 1 (SECOND PRIORITY): Fix resilience issues
- Zombie workers, race conditions, stale sockets
- All the original issues moved here

Key principle: Everything else is irrelevant if "session ends → next session sees summary" doesn't work.

Revised Focus: Get the fucking happy path working first, then worry about edge cases.

Architecture Overview

Current Flow

SessionStart (startup)
  → context-hook.ts:15
  → Loads recent summaries from DB
  → Outputs markdown to stdout (becomes context)

UserPromptSubmit
  → new-hook.ts:16
  → Creates SDK session in DB (status='active')
  → Spawns detached worker process
  → Worker starts immediately, hooks return

Worker Process (worker.ts:75)
  → Starts Unix socket server at /tmp/claude-mem-worker-{id}.sock
  → Runs SDK agent with streaming input (async generator)
  → Yields init prompt to SDK agent
  → Waits for messages from hooks

PostToolUse (fired for each tool)
  → save-hook.ts:24
  → Sends observation to worker via Unix socket
  → Worker receives → yields to SDK agent
  → SDK agent analyzes → returns <observation> XML
  → Worker parses XML → stores in observations table

Stop (session ends)
  → summary-hook.ts:15
  → Sends FINALIZE message to worker via socket
  → Worker yields finalize prompt to SDK agent
  → SDK agent generates <summary> XML
  → Worker parses → stores in session_summaries table
  → Worker marks session completed, closes socket, exits

Key Components

Hook Files:

src/hooks/context.ts - SessionStart hook logic
src/hooks/new.ts - UserPromptSubmit hook logic
src/hooks/save.ts - PostToolUse hook logic
src/hooks/summary.ts - Stop hook logic
src/bin/hooks/*.ts - Entry point wrappers for each hook

Worker:

src/sdk/worker.ts - Main worker process with SDK integration
src/sdk/prompts.ts - Prompt generation for SDK agent
src/sdk/parser.ts - XML parser for SDK responses

Database:

src/services/sqlite/HooksDatabase.ts - Lightweight DB interface for hooks
src/services/sqlite/migrations.ts - Schema definitions

Configuration:

hooks/hooks.json - Hook configuration for Claude Code plugin

Technologies

IPC: Unix domain sockets (/tmp/claude-mem-worker-{id}.sock)
SDK Mode: Streaming input (async generator pattern)
Output Format: XML blocks (<observation> and <summary>)
Process Model: Detached worker (spawn with detached: true, stdio: 'ignore')
Database: SQLite with Bun

Identified Issues

Phase 0: Verify Happy Path Works (DO THIS FIRST)

Priority: CRITICAL - Everything else is irrelevant if the basic cycle doesn't work

Goal: Prove that when a session ends normally, the next session immediately sees the summary in its context.

Test 0.1: Does Stop Hook Fire on Normal Exit?

What to test:

# Start Claude Code session
claude

# Do some work (read files, etc)

# Exit normally
exit

# Check logs - did Stop hook run?

Expected behavior:

Stop hook (summary-hook) should fire
Should send FINALIZE message to worker socket
Worker should receive it and generate summary

How to verify:

Add logging to src/hooks/summary.ts at the top of summaryHook()
Add logging when sending socket message
Exit session normally and check logs

If it doesn't work: Debug why Stop hook isn't firing or why socket message fails

Test 0.2: Does Worker Receive FINALIZE and Generate Summary?

What to test: After Stop hook fires, does the worker:

Receive the FINALIZE message
Yield finalize prompt to SDK agent
Get back a summary from SDK
Parse the XML
Store it in session_summaries table

How to verify:

Add console.error logging in src/sdk/worker.ts:239 in the message handler
Log when FINALIZE is received
Log the SDK agent response
Log when summary is parsed

Query DB after session ends:

sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM session_summaries ORDER BY created_at DESC LIMIT 1"

If it doesn't work:

Check if worker is even running (ps aux | grep worker)
Check if socket message arrived
Check if SDK agent returned valid XML
Check if parser worked
Check if DB insert succeeded

Test 0.3: Does Context Hook Load Summaries?

What to test: When starting a new session, does context hook:

Query recent summaries from DB
Format them as markdown
Output to stdout (becomes context)

How to verify:

Add logging to src/hooks/context.ts:24
Log the summaries retrieved from DB
Log the markdown output
Start new session and check:
- Console output (should see markdown)
- Claude's context (ask "what did we do last session?")

If it doesn't work:

Check if SessionStart hook is firing
Check if DB query returns results
Check if markdown is being formatted correctly
Check if output is going to stdout properly

Test 0.4: End-to-End Integration Test

What to test: Full cycle from start to finish:

# Session 1
claude
# Do some work
echo "test file" > test.txt
cat test.txt
exit

# Verify summary was stored
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT summary_text FROM session_summaries ORDER BY created_at DESC LIMIT 1"

# Session 2
claude
# Ask Claude: "What did we do last session?"
# Expected: Claude should know we created and read test.txt

Success criteria:

✅ Summary appears in DB after session 1
✅ Session 2 context includes summary from session 1
✅ Claude can answer questions about previous session

If it doesn't work:

Review logs from Tests 0.1-0.3
Add more granular logging
Check each step of the pipeline

Common Failure Points & Debugging

If summaries aren't showing up in new sessions:

Stop hook not configured/firing:

# Check hooks config
cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.Stop'

# Should see summary-hook configured
# If not, hooks.json is wrong or plugin not installed

Worker not running:

ps aux | grep claude-mem-worker

# If no worker, UserPromptSubmit hook failed to spawn it
# Check new-hook logs

Socket communication failing:

# Check socket exists
ls /tmp/claude-mem-worker-*.sock

# Try to connect manually
echo '{"type":"finalize"}' | nc -U /tmp/claude-mem-worker-*.sock

SDK agent not returning summary:
- Check API key is set
- Check SDK agent prompt is valid
- Check XML parser is working
- Add logging to see SDK response

DB write failing:

# Check DB exists and is writable
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions WHERE status='active'"

# If no active session, new-hook didn't create it

Context hook not loading:

# Check SessionStart hook configured
cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.SessionStart'

# Start session and check for context output
# Should see markdown in initial context

Debugging Checklist:

Verify all hooks are configured in hooks.json
Verify plugin is installed correctly
Add console.error logging to all hooks (goes to stderr, visible in terminal)
Check each step of the pipeline systematically
Don't assume anything works - verify each piece

Phase 1: Critical Resilience Issues (Fix After Happy Path Works)

1. Zombie Worker Processes

Severity: High Impact: Memory/CPU waste, orphaned processes accumulate

Problem: If Stop hook never fires (user Ctrl-C, Claude Code crash), worker runs forever waiting for FINALIZE message.

Location: src/sdk/worker.ts:239

// Current code - infinite loop with no timeout
while (!this.isFinalized) {
  if (this.pendingMessages.length === 0) {
    await this.sleep(100);
    continue;
  }
  // ... process messages
}

Fix Required:

// Add watchdog timer
class SDKWorker {
  private maxIdleTime = 2 * 60 * 60 * 1000; // 2 hours
  private lastActivityTime = Date.now();

  private updateActivity(): void {
    this.lastActivityTime = Date.now();
  }

  private async* createMessageGenerator(): AsyncIterable<...> {
    // Yield initial prompt
    const initPrompt = buildInitPrompt(...);
    yield { type: 'user', message: { role: 'user', content: initPrompt } };
    this.updateActivity();

    while (!this.isFinalized) {
      // Check for timeout
      const idleTime = Date.now() - this.lastActivityTime;
      if (idleTime > this.maxIdleTime) {
        console.error(`[SDK Worker] Timeout - no activity for ${this.maxIdleTime / 1000}s`);
        this.isFinalized = true;
        break;
      }

      if (this.pendingMessages.length === 0) {
        await this.sleep(100);
        continue;
      }

      // Process messages and update activity
      this.updateActivity();
      // ... existing message processing
    }
  }
}

Testing:

Start claude-mem session
Kill Claude Code process (kill -9)
Verify worker exits after 2 hours
Check no orphaned processes remain

2. SessionEnd Hook Not Configured

Severity: High Impact: No cleanup on abrupt exit, sessions stuck in "active" status

Problem: SessionEnd hooks are a built-in Claude Code feature that "run when a session ends" and "cannot block session termination but can perform cleanup tasks" (docs). However, claude-mem's hooks/hooks.json does NOT configure this hook. Worker doesn't get cleaned up when Claude Code exits abruptly.

Note: This is NOT a missing feature in Claude Code - SessionEnd hooks already exist. We just need to configure them.

Current Configuration: hooks/hooks.json:1-51

{
  "hooks": {
    "SessionStart": [...],
    "UserPromptSubmit": [...],
    "PostToolUse": [...],
    "Stop": [...]
    // SessionEnd is MISSING
  }
}

Fix Required:

SessionEnd hooks receive structured input including:

{
  "session_id": "abc123",
  "transcript_path": "~/.claude/projects/.../transcript.jsonl",
  "cwd": "/Users/...",
  "hook_event_name": "SessionEnd",
  "reason": "exit"  // or "clear", "logout", "prompt_input_exit", etc.
}

Implementation Steps:

Add SessionEnd configuration to hooks/hooks.json:

For events like SessionEnd that don't use matchers, we can omit the matcher field:

{
  "hooks": {
    "SessionEnd": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "bun ${CLAUDE_PLUGIN_ROOT}/scripts/hooks/cleanup-hook.js",
            "timeout": 60000
          }
        ]
      }
    ]
  }
}

Create src/hooks/cleanup.ts:

import { HooksDatabase } from '../services/sqlite/HooksDatabase.js';
import { getWorkerSocketPath } from '../shared/paths.js';
import { existsSync, unlinkSync } from 'fs';
import { execSync } from 'child_process';

export interface SessionEndInput {
  session_id: string;
  cwd: string;
  reason: 'clear' | 'logout' | 'prompt_input_exit' | 'other';
  [key: string]: any;
}

/**
 * Cleanup Hook - SessionEnd
 * Cleans up worker process and marks session as terminated
 */
export function cleanupHook(input?: SessionEndInput): void {
  try {
    if (!input) {
      console.log('No input provided - this script is designed to run as a Claude Code SessionEnd hook');
      process.exit(0);
    }

    const { session_id, reason } = input;

    // Find active SDK session
    const db = new HooksDatabase();
    const session = db.findActiveSDKSession(session_id);

    if (!session) {
      db.close();
      console.log('{"suppressOutput": true}');
      process.exit(0);
    }

    // Get socket path and clean up socket file
    const socketPath = getWorkerSocketPath(session.id);
    if (existsSync(socketPath)) {
      try {
        unlinkSync(socketPath);
      } catch (err) {
        console.error(`[claude-mem cleanup] Failed to remove socket: ${err.message}`);
      }
    }

    // Mark session as failed (not completed since it was terminated)
    db.markSessionFailed(session.id);
    db.close();

    // Try to kill worker process if still running
    // Worker socket path includes session ID, so we can find it
    try {
      // Find worker process by socket file in lsof output
      const lsofOutput = execSync(`lsof ${socketPath} 2>/dev/null || true`, { encoding: 'utf8' });
      const pidMatch = lsofOutput.match(/\s+(\d+)\s+/);
      if (pidMatch) {
        const pid = pidMatch[1];
        console.error(`[claude-mem cleanup] Killing worker process ${pid}`);
        process.kill(parseInt(pid, 10), 'SIGTERM');
      }
    } catch (err) {
      // Worker already dead or couldn't find it - that's fine
    }

    console.log('{"suppressOutput": true}');
    process.exit(0);

  } catch (error: any) {
    console.error(`[claude-mem cleanup error: ${error.message}]`);
    console.log('{"suppressOutput": true}');
    process.exit(0);
  }
}

Create src/bin/hooks/cleanup-hook.ts:

#!/usr/bin/env bun

/**
 * Cleanup Hook Entry Point - SessionEnd
 * Standalone executable for plugin hooks
 */

import { cleanupHook } from '../../hooks/cleanup.js';

// Read input from stdin
const input = await Bun.stdin.text();

try {
  const parsed = input.trim() ? JSON.parse(input) : undefined;
  cleanupHook(parsed);
} catch (error: any) {
  console.error(`[claude-mem cleanup-hook error: ${error.message}]`);
  console.log('{"suppressOutput": true}');
  process.exit(0);
}

Update build process to compile cleanup-hook.ts to scripts/hooks/cleanup-hook.js

Testing:

Start claude-mem session
Exit Claude Code with Ctrl-C
Verify worker process is killed
Verify socket file is removed
Verify session marked as "failed" in DB

3. Stale Socket Files Block New Sessions

Severity: Medium Impact: Worker fails to start if previous worker crashed

Problem: If worker crashes, socket file persists at /tmp/claude-mem-worker-{id}.sock. Next worker with same session ID fails with EADDRINUSE.

Location: src/sdk/worker.ts:111-163

private async startSocketServer(): Promise<void> {
  // Current code only removes if exists
  if (existsSync(this.socketPath)) {
    unlinkSync(this.socketPath);
  }

  return new Promise((resolve, reject) => {
    this.server = net.createServer((socket) => { ... });
    this.server.listen(this.socketPath, () => { resolve(); });
  });
}

Fix Required:

private async startSocketServer(): Promise<void> {
  // Clean up stale socket if it exists
  if (existsSync(this.socketPath)) {
    // Test if socket is responsive
    const isStale = await this.testSocketStale(this.socketPath);
    if (isStale) {
      console.error(`[SDK Worker] Removing stale socket: ${this.socketPath}`);
      unlinkSync(this.socketPath);
    } else {
      // Socket is active - another worker is using this session ID
      throw new Error(`Socket already in use: ${this.socketPath}`);
    }
  }

  return new Promise((resolve, reject) => {
    this.server = net.createServer((socket) => {
      let buffer = '';
      socket.on('data', (chunk) => {
        // ... existing code
      });
    });

    this.server.on('error', (err: any) => {
      if (err.code === 'EADDRINUSE') {
        console.error(`[SDK Worker] Socket already in use: ${this.socketPath}`);
      }
      reject(err);
    });

    this.server.listen(this.socketPath, () => {
      resolve();
    });
  });
}

/**
 * Test if socket file is stale (no process listening)
 */
private async testSocketStale(socketPath: string): Promise<boolean> {
  return new Promise((resolve) => {
    const testClient = net.connect(socketPath);

    testClient.on('connect', () => {
      // Socket is responsive - not stale
      testClient.end();
      resolve(false);
    });

    testClient.on('error', () => {
      // Socket exists but not responsive - stale
      resolve(true);
    });

    // Timeout after 100ms
    setTimeout(() => {
      testClient.destroy();
      resolve(true);
    }, 100);
  });
}

Testing:

Start worker, kill it with kill -9
Verify socket file persists
Start new worker with same session ID
Verify old socket is detected as stale and removed
Verify new worker starts successfully

4. Race Condition on First Observation

Severity: Medium Impact: First observation might be lost if socket not ready

Problem: Worker startup is async (socket creation, SDK initialization). PostToolUse can fire immediately after UserPromptSubmit returns, before socket is ready.

Current Flow:

UserPromptSubmit → creates session → spawns worker → returns immediately
PostToolUse fires (Claude reads a file)
save-hook tries to connect → ENOENT (socket not ready yet)
Connection fails → logs error, continues
First observation lost

Location: src/hooks/save.ts:71

const client = net.connect(socketPath, () => {
  client.write(JSON.stringify(message) + '\n');
  client.end();
});

client.on('error', (err) => {
  // Currently just logs and continues - observation lost
  console.error(`[claude-mem save] Socket error: ${err.message}`);
});

Fix Required:

/**
 * Save Hook - PostToolUse
 * Sends tool observations to worker via Unix socket with retry logic
 */
export function saveHook(input?: PostToolUseInput): void {
  try {
    if (!input) {
      console.log('No input provided - this script is designed to run as a Claude Code PostToolUse hook');
      process.exit(0);
    }

    const { session_id, tool_name, tool_input, tool_output } = input;

    if (SKIP_TOOLS.has(tool_name)) {
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    }

    const db = new HooksDatabase();
    const session = db.findActiveSDKSession(session_id);
    db.close();

    if (!session) {
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    }

    const socketPath = getWorkerSocketPath(session.id);
    const message = {
      type: 'observation',
      tool_name,
      tool_input: JSON.stringify(tool_input),
      tool_output: JSON.stringify(tool_output)
    };

    // Try to send with retries
    sendWithRetry(socketPath, message, 5).then(() => {
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    }).catch((err) => {
      console.error(`[claude-mem save] Failed after retries: ${err.message}`);
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    });

  } catch (error: any) {
    console.error(`[claude-mem save error: ${error.message}]`);
    console.log('{"continue": true, "suppressOutput": true}');
    process.exit(0);
  }
}

/**
 * Send message to socket with exponential backoff retry
 */
async function sendWithRetry(
  socketPath: string,
  message: any,
  maxRetries: number
): Promise<void> {
  let retries = maxRetries;
  let delay = 100; // Start with 100ms

  while (retries > 0) {
    try {
      await sendMessage(socketPath, message);
      return; // Success
    } catch (err: any) {
      retries--;
      if (retries === 0) {
        throw err; // Out of retries
      }

      // Exponential backoff
      await sleep(delay);
      delay = Math.min(delay * 2, 2000); // Cap at 2s
    }
  }
}

/**
 * Send single message to socket
 */
function sendMessage(socketPath: string, message: any): Promise<void> {
  return new Promise((resolve, reject) => {
    const client = net.connect(socketPath, () => {
      client.write(JSON.stringify(message) + '\n');
      client.end();
      resolve();
    });

    client.on('error', (err) => {
      reject(err);
    });
  });
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

Testing:

Add artificial delay in worker startup
Fire PostToolUse immediately after UserPromptSubmit
Verify save-hook retries and succeeds
Verify observation is captured

Medium Priority (Should Fix)

5. Orphaned Active Sessions in Database

Severity: Low Impact: DB bloat, confusion about session status

Problem: Sessions marked "active" never transition to "completed" or "failed" if worker crashes or is killed.

Fix Required:

Create cleanup script: src/commands/cleanup-sessions.ts

import { HooksDatabase } from '../services/sqlite/HooksDatabase.js';

/**
 * Mark old active sessions as failed
 */
export function cleanupSessions(maxAgeHours: number = 24): void {
  const db = new HooksDatabase();
  const maxAgeMs = maxAgeHours * 60 * 60 * 1000;
  const cutoffEpoch = Date.now() - maxAgeMs;

  const query = (db as any).db.query(`
    UPDATE sdk_sessions
    SET status = 'failed', completed_at = datetime('now'), completed_at_epoch = ?
    WHERE status = 'active' AND started_at_epoch < ?
  `);

  const result = query.run(Date.now(), cutoffEpoch);
  console.log(`Marked ${result.changes} old active sessions as failed`);

  db.close();
}

Add to CLI: src/bin/cli.ts

.command('cleanup-sessions')
.description('Mark old active sessions as failed')
.option('--max-age <hours>', 'Maximum age in hours', '24')
.action((options) => {
  cleanupSessions(parseInt(options.maxAge, 10));
})

Alternative: Add auto-expiry check in context-hook:

// Before loading summaries, clean up stale sessions
const maxAgeMs = 24 * 60 * 60 * 1000;
const cutoffEpoch = Date.now() - maxAgeMs;
db.db.query(`
  UPDATE sdk_sessions
  SET status = 'failed'
  WHERE status = 'active' AND started_at_epoch < ?
`).run(cutoffEpoch);

6. SessionStart Only Runs on "startup"

Severity: Low Impact: No context loaded on /resume

Problem: context-hook only loads context on "startup" source, skips "resume", "clear", and "compact".

Location: src/hooks/context.ts:24

// Only run on startup (not on resume)
if (input.source && input.source !== 'startup') {
  console.log('');
  process.exit(0);
}

Fix Required:

// Load context on startup and resume
if (input.source && input.source !== 'startup' && input.source !== 'resume') {
  console.log(''); // Skip for clear/compact
  process.exit(0);
}

Rationale:

startup: Load context (project overview)
resume: Load context (user continuing work)
clear: Skip (user wants fresh start)
compact: Skip (just memory optimization, context preserved)

Low Priority (Nice to Have)

7. No Cost Control or Observation Limits

Severity: Low Impact: Long sessions can be expensive

Problem: No limits on SDK agent API calls. A session with thousands of tools could rack up significant costs.

Fix Ideas:

Add observation counter, warn after N observations
Add cost estimation based on token usage
Add budget limit in config
Batch observations (send N at once instead of one-by-one)

Example:

class SDKWorker {
  private observationCount = 0;
  private maxObservations = 1000;

  private handleMessage(message: WorkerMessage): void {
    if (message.type === 'observation') {
      this.observationCount++;
      if (this.observationCount > this.maxObservations) {
        console.error(`[SDK Worker] Exceeded max observations: ${this.maxObservations}`);
        this.isFinalized = true;
        return;
      }
    }
    this.pendingMessages.push(message);
  }
}

8. No Health Check Mechanism

Severity: Low Impact: Can't tell if worker is alive/healthy

Fix Ideas:

Add /status command that checks for active workers
Add health check endpoint on socket (ping/pong)
Add metrics to DB (last_activity_at)

9. No Observation Deduplication

Severity: Low Impact: Duplicate observations if same tool executed multiple times

Fix Ideas:

Hash tool_name + tool_input + tool_output
Check for duplicate hash before storing
Or let SDK agent handle deduplication naturally

Implementation Checklist

Phase 0: Verify Happy Path (DO THIS FIRST - HIGHEST PRIORITY)

Goal: Prove the basic cycle works end-to-end before fixing edge cases.

Test 0.1: Verify Stop Hook Fires
- Add logging to src/hooks/summary.ts
- Exit session normally and verify hook runs
- Verify FINALIZE message is sent to socket
Test 0.2: Verify Worker Generates Summary
- Add logging to worker message handler
- Verify FINALIZE message received
- Verify SDK agent response
- Verify summary parsed and stored in DB
- Query DB to confirm summary exists
Test 0.3: Verify Context Hook Loads Summaries
- Add logging to src/hooks/context.ts
- Start new session, verify summaries loaded
- Verify markdown output to stdout
- Verify Claude has context from previous session
Test 0.4: End-to-End Integration Test
- Run session 1 with test work
- Verify summary in DB
- Run session 2
- Ask Claude about previous session
- Confirm Claude has correct context

STOP HERE: Only proceed to Phase 1 after confirming all Phase 0 tests pass.

Phase 1: Critical Resilience Fixes (Do After Phase 0)

Add watchdog timer to worker (Issue #1)
- Add lastActivityTime tracking
- Add timeout check in message generator loop
- Test with zombie worker scenario
Configure existing SessionEnd hook (Issue #2)
- Add SessionEnd configuration to hooks/hooks.json
- Create src/hooks/cleanup.ts (implements cleanup logic)
- Create src/bin/hooks/cleanup-hook.ts (entry point)
- Update build process to compile cleanup-hook
- Test with Ctrl-C exit and verify worker cleanup
Fix stale socket detection (Issue #3)
- Add testSocketStale method
- Update startSocketServer to check for stale sockets
- Test with crashed worker scenario
Fix save-hook race condition (Issue #4)
- Add sendWithRetry function
- Add exponential backoff logic
- Update save-hook to use retry logic
- Test with immediate PostToolUse

Phase 2: Medium Priority

Add session cleanup script (Issue #5)
- Create cleanup-sessions command
- Add to CLI
- Optional: Add auto-cleanup to context-hook
Fix SessionStart source handling (Issue #6)
- Update context-hook to load on "resume"
- Test with /resume command

Phase 3: Low Priority (Optional)

Add cost control (Issue #7)
Add health checks (Issue #8)
Add observation deduplication (Issue #9)

Testing Strategy

Unit Tests

Create tests for each fix:

test/hooks/cleanup.test.ts - SessionEnd hook
test/sdk/worker-timeout.test.ts - Watchdog timer
test/hooks/save-retry.test.ts - Retry logic

Integration Tests

Test complete flows:

Normal flow: SessionStart → UserPromptSubmit → PostToolUse → Stop
Crash recovery: Worker crash → SessionEnd cleanup
Zombie worker: No Stop hook → Worker timeout
Socket race: Immediate PostToolUse → Retry success

Manual Testing Scenarios

Zombie Worker Test:

# Start session
claude
# Kill Claude with Ctrl-C
# Check for worker process
ps aux | grep claude-mem-worker
# Wait 2 hours, verify worker exits

SessionEnd Test:

# Start session
claude
# Exit normally or Ctrl-C
# Verify worker killed
# Verify socket removed
# Check DB for session status
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions"

Stale Socket Test:

# Start session
claude
# Kill worker with kill -9 <pid>
# Verify socket exists
ls /tmp/claude-mem-worker-*.sock
# Start new session
# Verify old socket removed, new session starts

Race Condition Test:

# Add delay to worker startup (for testing)
# Start session, immediately run command
claude "list all files"
# Verify first observation captured

File Modifications Required

New Files

src/hooks/cleanup.ts - SessionEnd hook logic
src/bin/hooks/cleanup-hook.ts - SessionEnd entry point
src/commands/cleanup-sessions.ts - Session cleanup script
test/hooks/cleanup.test.ts - Tests for SessionEnd hook
test/sdk/worker-timeout.test.ts - Tests for watchdog timer
test/hooks/save-retry.test.ts - Tests for retry logic

Modified Files

hooks/hooks.json - Add SessionEnd configuration
src/sdk/worker.ts - Add watchdog timer, stale socket detection
src/hooks/save.ts - Add retry logic
src/hooks/context.ts - Load context on resume
src/bin/cli.ts - Add cleanup-sessions command

Dependencies

No new dependencies required. All fixes use existing:

net (Unix sockets)
fs (file operations)
child_process (process management)
bun:sqlite (database)

Success Criteria

Phase 0 (Must Pass First)

✅ Stop hook fires on normal exit
✅ Worker receives FINALIZE and generates summary
✅ Summary is stored in DB correctly
✅ Context hook loads summaries on next session
✅ New session immediately sees previous session's summary in context
✅ End-to-end integration test passes

Phase 1 (After Phase 0 Passes)

✅ Worker processes never become zombies (exit after 2h max)
✅ SessionEnd hook cleans up worker and socket on exit
✅ Stale sockets don't block new sessions
✅ First observation always captured (no race condition)
✅ No orphaned "active" sessions in DB after 24h
✅ Context loads on /resume
✅ All tests pass

References

Claude Code Hooks Documentation: https://docs.claude.com/en/docs/claude-code/hooks
Claude Agent SDK Streaming: https://docs.claude.com/en/api/agent-sdk/streaming-vs-single-mode
Unix Domain Sockets: Node.js net module
SQLite Best Practices: Bun SQLite documentation

Notes

All hooks must return {"continue": true, "suppressOutput": true} on error
Hooks have 60s default timeout (configurable)
Worker is detached process, doesn't block Claude Code
SessionEnd hooks "cannot block session termination" per Claude Code docs
Streaming input mode is the recommended SDK approach for this architecture

31 KiB Raw Blame History

Session Logic Fixes - Claude-Mem

Executive Summary

Feedback Applied (2025-10-16)

Round 1: Technical Corrections

Round 2: Priority Reordering (MAJOR CHANGE)

Architecture Overview

Current Flow

Key Components

Technologies

Identified Issues

Phase 0: Verify Happy Path Works (DO THIS FIRST)

Test 0.1: Does Stop Hook Fire on Normal Exit?

Test 0.2: Does Worker Receive FINALIZE and Generate Summary?

Test 0.3: Does Context Hook Load Summaries?

Test 0.4: End-to-End Integration Test

Common Failure Points & Debugging

Phase 1: Critical Resilience Issues (Fix After Happy Path Works)

1. Zombie Worker Processes

2. SessionEnd Hook Not Configured

3. Stale Socket Files Block New Sessions

4. Race Condition on First Observation

Medium Priority (Should Fix)

5. Orphaned Active Sessions in Database

6. SessionStart Only Runs on "startup"

Low Priority (Nice to Have)

7. No Cost Control or Observation Limits

8. No Health Check Mechanism

9. No Observation Deduplication

Implementation Checklist

Phase 0: Verify Happy Path (DO THIS FIRST - HIGHEST PRIORITY)

Phase 1: Critical Resilience Fixes (Do After Phase 0)

Phase 2: Medium Priority

Phase 3: Low Priority (Optional)

Testing Strategy

Unit Tests

Integration Tests

Manual Testing Scenarios

File Modifications Required

New Files

Modified Files

Dependencies

Success Criteria

Phase 0 (Must Pass First)

Phase 1 (After Phase 0 Passes)

References

Notes

31 KiB

Raw Blame History