claude-mem/docs/plans/session-logic-fixes.md

# Session Logic Fixes - Claude-Mem

**Status:** Planning
**Created:** 2025-10-16
**Priority:** High
**Estimated Effort:** 2-3 days

## Executive Summary

The claude-mem session logic architecture is fundamentally sound, using Claude Agent SDK in streaming input mode with Unix socket IPC for real-time observation processing. However, **we need to verify the basic happy path works end-to-end before addressing edge cases**.

**Critical Goal:** Session ends → summary generated → next session immediately sees summary in context

**Overall Assessment:** Architecture is correct, but needs systematic verification that the happy path works, then resilience improvements

**Current Status:** Unknown if basic cycle works - need to test and debug the core flow first

## Feedback Applied (2025-10-16)

### Round 1: Technical Corrections
- ✅ Confirmed architectural approach is sound
- ❌ **Corrected:** SessionEnd hooks already exist in Claude Code - we're configuring, not implementing
- ✅ Technical fixes for resilience issues are sound

### Round 2: Priority Reordering (MAJOR CHANGE)
**Critical realization:** The document focused on edge cases (zombies, crashes) when the basic happy path might not even work yet.

**Complete restructure:**
1. **Phase 0 (NEW - TOP PRIORITY):** Verify the basic cycle works
   - Does Stop hook fire on normal exit?
   - Does worker generate and store summary?
   - Does context hook load summaries on next session?
   - End-to-end integration test

2. **Phase 1 (SECOND PRIORITY):** Fix resilience issues
   - Zombie workers, race conditions, stale sockets
   - All the original issues moved here

**Key principle:** Everything else is irrelevant if "session ends → next session sees summary" doesn't work.

**Revised Focus:** Get the fucking happy path working first, then worry about edge cases.

## Architecture Overview

### Current Flow

```
SessionStart (startup)
  → context-hook.ts:15
  → Loads recent summaries from DB
  → Outputs markdown to stdout (becomes context)

UserPromptSubmit
  → new-hook.ts:16
  → Creates SDK session in DB (status='active')
  → Spawns detached worker process
  → Worker starts immediately, hooks return

Worker Process (worker.ts:75)
  → Starts Unix socket server at /tmp/claude-mem-worker-{id}.sock
  → Runs SDK agent with streaming input (async generator)
  → Yields init prompt to SDK agent
  → Waits for messages from hooks

PostToolUse (fired for each tool)
  → save-hook.ts:24
  → Sends observation to worker via Unix socket
  → Worker receives → yields to SDK agent
  → SDK agent analyzes → returns <observation> XML
  → Worker parses XML → stores in observations table

Stop (session ends)
  → summary-hook.ts:15
  → Sends FINALIZE message to worker via socket
  → Worker yields finalize prompt to SDK agent
  → SDK agent generates <summary> XML
  → Worker parses → stores in session_summaries table
  → Worker marks session completed, closes socket, exits
```

### Key Components

**Hook Files:**
- `src/hooks/context.ts` - SessionStart hook logic
- `src/hooks/new.ts` - UserPromptSubmit hook logic
- `src/hooks/save.ts` - PostToolUse hook logic
- `src/hooks/summary.ts` - Stop hook logic
- `src/bin/hooks/*.ts` - Entry point wrappers for each hook

**Worker:**
- `src/sdk/worker.ts` - Main worker process with SDK integration
- `src/sdk/prompts.ts` - Prompt generation for SDK agent
- `src/sdk/parser.ts` - XML parser for SDK responses

**Database:**
- `src/services/sqlite/HooksDatabase.ts` - Lightweight DB interface for hooks
- `src/services/sqlite/migrations.ts` - Schema definitions

**Configuration:**
- `hooks/hooks.json` - Hook configuration for Claude Code plugin

### Technologies

- **IPC:** Unix domain sockets (`/tmp/claude-mem-worker-{id}.sock`)
- **SDK Mode:** Streaming input (async generator pattern)
- **Output Format:** XML blocks (`<observation>` and `<summary>`)
- **Process Model:** Detached worker (spawn with detached: true, stdio: 'ignore')
- **Database:** SQLite with Bun

## Identified Issues

### Phase 0: Verify Happy Path Works (DO THIS FIRST)

**Priority:** CRITICAL - Everything else is irrelevant if the basic cycle doesn't work

**Goal:** Prove that when a session ends normally, the next session immediately sees the summary in its context.

#### Test 0.1: Does Stop Hook Fire on Normal Exit?

**What to test:**
```bash
# Start Claude Code session
claude

# Do some work (read files, etc)

# Exit normally
exit

# Check logs - did Stop hook run?
```

**Expected behavior:**
- Stop hook (`summary-hook`) should fire
- Should send FINALIZE message to worker socket
- Worker should receive it and generate summary

**How to verify:**
1. Add logging to `src/hooks/summary.ts` at the top of `summaryHook()`
2. Add logging when sending socket message
3. Exit session normally and check logs

**If it doesn't work:** Debug why Stop hook isn't firing or why socket message fails

---

#### Test 0.2: Does Worker Receive FINALIZE and Generate Summary?

**What to test:**
After Stop hook fires, does the worker:
1. Receive the FINALIZE message
2. Yield finalize prompt to SDK agent
3. Get back a summary from SDK
4. Parse the XML
5. Store it in `session_summaries` table

**How to verify:**
1. Add console.error logging in `src/sdk/worker.ts:239` in the message handler
2. Log when FINALIZE is received
3. Log the SDK agent response
4. Log when summary is parsed
5. Query DB after session ends:
   ```bash
   sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM session_summaries ORDER BY created_at DESC LIMIT 1"
   ```

**If it doesn't work:**
- Check if worker is even running (ps aux | grep worker)
- Check if socket message arrived
- Check if SDK agent returned valid XML
- Check if parser worked
- Check if DB insert succeeded

---

#### Test 0.3: Does Context Hook Load Summaries?

**What to test:**
When starting a new session, does context hook:
1. Query recent summaries from DB
2. Format them as markdown
3. Output to stdout (becomes context)

**How to verify:**
1. Add logging to `src/hooks/context.ts:24`
2. Log the summaries retrieved from DB
3. Log the markdown output
4. Start new session and check:
   - Console output (should see markdown)
   - Claude's context (ask "what did we do last session?")

**If it doesn't work:**
- Check if SessionStart hook is firing
- Check if DB query returns results
- Check if markdown is being formatted correctly
- Check if output is going to stdout properly

---

#### Test 0.4: End-to-End Integration Test

**What to test:**
Full cycle from start to finish:

```bash
# Session 1
claude
# Do some work
echo "test file" > test.txt
cat test.txt
exit

# Verify summary was stored
sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT summary_text FROM session_summaries ORDER BY created_at DESC LIMIT 1"

# Session 2
claude
# Ask Claude: "What did we do last session?"
# Expected: Claude should know we created and read test.txt
```

**Success criteria:**
- ✅ Summary appears in DB after session 1
- ✅ Session 2 context includes summary from session 1
- ✅ Claude can answer questions about previous session

**If it doesn't work:**
- Review logs from Tests 0.1-0.3
- Add more granular logging
- Check each step of the pipeline

---

#### Common Failure Points & Debugging

**If summaries aren't showing up in new sessions:**

1. **Stop hook not configured/firing:**
   ```bash
   # Check hooks config
   cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.Stop'

   # Should see summary-hook configured
   # If not, hooks.json is wrong or plugin not installed
   ```

2. **Worker not running:**
   ```bash
   ps aux | grep claude-mem-worker

   # If no worker, UserPromptSubmit hook failed to spawn it
   # Check new-hook logs
   ```

3. **Socket communication failing:**
   ```bash
   # Check socket exists
   ls /tmp/claude-mem-worker-*.sock

   # Try to connect manually
   echo '{"type":"finalize"}' | nc -U /tmp/claude-mem-worker-*.sock
   ```

4. **SDK agent not returning summary:**
   - Check API key is set
   - Check SDK agent prompt is valid
   - Check XML parser is working
   - Add logging to see SDK response

5. **DB write failing:**
   ```bash
   # Check DB exists and is writable
   sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions WHERE status='active'"

   # If no active session, new-hook didn't create it
   ```

6. **Context hook not loading:**
   ```bash
   # Check SessionStart hook configured
   cat ~/.claude/plugins/claude-mem/hooks.json | jq '.hooks.SessionStart'

   # Start session and check for context output
   # Should see markdown in initial context
   ```

**Debugging Checklist:**
- [ ] Verify all hooks are configured in hooks.json
- [ ] Verify plugin is installed correctly
- [ ] Add console.error logging to all hooks (goes to stderr, visible in terminal)
- [ ] Check each step of the pipeline systematically
- [ ] Don't assume anything works - verify each piece

---

### Phase 1: Critical Resilience Issues (Fix After Happy Path Works)

#### 1. Zombie Worker Processes

**Severity:** High
**Impact:** Memory/CPU waste, orphaned processes accumulate

**Problem:**
If Stop hook never fires (user Ctrl-C, Claude Code crash), worker runs forever waiting for FINALIZE message.

**Location:** `src/sdk/worker.ts:239`
```typescript
// Current code - infinite loop with no timeout
while (!this.isFinalized) {
  if (this.pendingMessages.length === 0) {
    await this.sleep(100);
    continue;
  }
  // ... process messages
}
```

**Fix Required:**
```typescript
// Add watchdog timer
class SDKWorker {
  private maxIdleTime = 2 * 60 * 60 * 1000; // 2 hours
  private lastActivityTime = Date.now();

  private updateActivity(): void {
    this.lastActivityTime = Date.now();
  }

  private async* createMessageGenerator(): AsyncIterable<...> {
    // Yield initial prompt
    const initPrompt = buildInitPrompt(...);
    yield { type: 'user', message: { role: 'user', content: initPrompt } };
    this.updateActivity();

    while (!this.isFinalized) {
      // Check for timeout
      const idleTime = Date.now() - this.lastActivityTime;
      if (idleTime > this.maxIdleTime) {
        console.error(`[SDK Worker] Timeout - no activity for ${this.maxIdleTime / 1000}s`);
        this.isFinalized = true;
        break;
      }

      if (this.pendingMessages.length === 0) {
        await this.sleep(100);
        continue;
      }

      // Process messages and update activity
      this.updateActivity();
      // ... existing message processing
    }
  }
}
```

**Testing:**
1. Start claude-mem session
2. Kill Claude Code process (kill -9)
3. Verify worker exits after 2 hours
4. Check no orphaned processes remain

---

#### 2. SessionEnd Hook Not Configured

**Severity:** High
**Impact:** No cleanup on abrupt exit, sessions stuck in "active" status

**Problem:**
SessionEnd hooks are a built-in Claude Code feature that "run when a session ends" and "cannot block session termination but can perform cleanup tasks" ([docs](https://docs.claude.com/en/docs/claude-code/hooks#hook-events)). However, claude-mem's `hooks/hooks.json` does NOT configure this hook. Worker doesn't get cleaned up when Claude Code exits abruptly.

**Note:** This is NOT a missing feature in Claude Code - SessionEnd hooks already exist. We just need to configure them.

**Current Configuration:** `hooks/hooks.json:1-51`
```json
{
  "hooks": {
    "SessionStart": [...],
    "UserPromptSubmit": [...],
    "PostToolUse": [...],
    "Stop": [...]
    // SessionEnd is MISSING
  }
}
```

**Fix Required:**

SessionEnd hooks receive structured input including:
```json
{
  "session_id": "abc123",
  "transcript_path": "~/.claude/projects/.../transcript.jsonl",
  "cwd": "/Users/...",
  "hook_event_name": "SessionEnd",
  "reason": "exit"  // or "clear", "logout", "prompt_input_exit", etc.
}
```

**Implementation Steps:**

1. **Add SessionEnd configuration to hooks/hooks.json:**

For events like SessionEnd that don't use matchers, we can omit the matcher field:

```json
{
  "hooks": {
    "SessionEnd": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "bun ${CLAUDE_PLUGIN_ROOT}/scripts/hooks/cleanup-hook.js",
            "timeout": 60000
          }
        ]
      }
    ]
  }
}
```

2. **Create src/hooks/cleanup.ts:**
```typescript
import { HooksDatabase } from '../services/sqlite/HooksDatabase.js';
import { getWorkerSocketPath } from '../shared/paths.js';
import { existsSync, unlinkSync } from 'fs';
import { execSync } from 'child_process';

export interface SessionEndInput {
  session_id: string;
  cwd: string;
  reason: 'clear' | 'logout' | 'prompt_input_exit' | 'other';
  [key: string]: any;
}

/**
 * Cleanup Hook - SessionEnd
 * Cleans up worker process and marks session as terminated
 */
export function cleanupHook(input?: SessionEndInput): void {
  try {
    if (!input) {
      console.log('No input provided - this script is designed to run as a Claude Code SessionEnd hook');
      process.exit(0);
    }

    const { session_id, reason } = input;

    // Find active SDK session
    const db = new HooksDatabase();
    const session = db.findActiveSDKSession(session_id);

    if (!session) {
      db.close();
      console.log('{"suppressOutput": true}');
      process.exit(0);
    }

    // Get socket path and clean up socket file
    const socketPath = getWorkerSocketPath(session.id);
    if (existsSync(socketPath)) {
      try {
        unlinkSync(socketPath);
      } catch (err) {
        console.error(`[claude-mem cleanup] Failed to remove socket: ${err.message}`);
      }
    }

    // Mark session as failed (not completed since it was terminated)
    db.markSessionFailed(session.id);
    db.close();

    // Try to kill worker process if still running
    // Worker socket path includes session ID, so we can find it
    try {
      // Find worker process by socket file in lsof output
      const lsofOutput = execSync(`lsof ${socketPath} 2>/dev/null || true`, { encoding: 'utf8' });
      const pidMatch = lsofOutput.match(/\s+(\d+)\s+/);
      if (pidMatch) {
        const pid = pidMatch[1];
        console.error(`[claude-mem cleanup] Killing worker process ${pid}`);
        process.kill(parseInt(pid, 10), 'SIGTERM');
      }
    } catch (err) {
      // Worker already dead or couldn't find it - that's fine
    }

    console.log('{"suppressOutput": true}');
    process.exit(0);

  } catch (error: any) {
    console.error(`[claude-mem cleanup error: ${error.message}]`);
    console.log('{"suppressOutput": true}');
    process.exit(0);
  }
}
```

3. **Create src/bin/hooks/cleanup-hook.ts:**
```typescript
#!/usr/bin/env bun

/**
 * Cleanup Hook Entry Point - SessionEnd
 * Standalone executable for plugin hooks
 */

import { cleanupHook } from '../../hooks/cleanup.js';

// Read input from stdin
const input = await Bun.stdin.text();

try {
  const parsed = input.trim() ? JSON.parse(input) : undefined;
  cleanupHook(parsed);
} catch (error: any) {
  console.error(`[claude-mem cleanup-hook error: ${error.message}]`);
  console.log('{"suppressOutput": true}');
  process.exit(0);
}
```

4. **Update build process to compile cleanup-hook.ts to scripts/hooks/cleanup-hook.js**

**Testing:**
1. Start claude-mem session
2. Exit Claude Code with Ctrl-C
3. Verify worker process is killed
4. Verify socket file is removed
5. Verify session marked as "failed" in DB

---

#### 3. Stale Socket Files Block New Sessions

**Severity:** Medium
**Impact:** Worker fails to start if previous worker crashed

**Problem:**
If worker crashes, socket file persists at `/tmp/claude-mem-worker-{id}.sock`. Next worker with same session ID fails with EADDRINUSE.

**Location:** `src/sdk/worker.ts:111-163`
```typescript
private async startSocketServer(): Promise<void> {
  // Current code only removes if exists
  if (existsSync(this.socketPath)) {
    unlinkSync(this.socketPath);
  }

  return new Promise((resolve, reject) => {
    this.server = net.createServer((socket) => { ... });
    this.server.listen(this.socketPath, () => { resolve(); });
  });
}
```

**Fix Required:**
```typescript
private async startSocketServer(): Promise<void> {
  // Clean up stale socket if it exists
  if (existsSync(this.socketPath)) {
    // Test if socket is responsive
    const isStale = await this.testSocketStale(this.socketPath);
    if (isStale) {
      console.error(`[SDK Worker] Removing stale socket: ${this.socketPath}`);
      unlinkSync(this.socketPath);
    } else {
      // Socket is active - another worker is using this session ID
      throw new Error(`Socket already in use: ${this.socketPath}`);
    }
  }

  return new Promise((resolve, reject) => {
    this.server = net.createServer((socket) => {
      let buffer = '';
      socket.on('data', (chunk) => {
        // ... existing code
      });
    });

    this.server.on('error', (err: any) => {
      if (err.code === 'EADDRINUSE') {
        console.error(`[SDK Worker] Socket already in use: ${this.socketPath}`);
      }
      reject(err);
    });

    this.server.listen(this.socketPath, () => {
      resolve();
    });
  });
}

/**
 * Test if socket file is stale (no process listening)
 */
private async testSocketStale(socketPath: string): Promise<boolean> {
  return new Promise((resolve) => {
    const testClient = net.connect(socketPath);

    testClient.on('connect', () => {
      // Socket is responsive - not stale
      testClient.end();
      resolve(false);
    });

    testClient.on('error', () => {
      // Socket exists but not responsive - stale
      resolve(true);
    });

    // Timeout after 100ms
    setTimeout(() => {
      testClient.destroy();
      resolve(true);
    }, 100);
  });
}
```

**Testing:**
1. Start worker, kill it with kill -9
2. Verify socket file persists
3. Start new worker with same session ID
4. Verify old socket is detected as stale and removed
5. Verify new worker starts successfully

---

#### 4. Race Condition on First Observation

**Severity:** Medium
**Impact:** First observation might be lost if socket not ready

**Problem:**
Worker startup is async (socket creation, SDK initialization). PostToolUse can fire immediately after UserPromptSubmit returns, before socket is ready.

**Current Flow:**
1. UserPromptSubmit → creates session → spawns worker → returns immediately
2. PostToolUse fires (Claude reads a file)
3. save-hook tries to connect → ENOENT (socket not ready yet)
4. Connection fails → logs error, continues
5. First observation lost

**Location:** `src/hooks/save.ts:71`
```typescript
const client = net.connect(socketPath, () => {
  client.write(JSON.stringify(message) + '\n');
  client.end();
});

client.on('error', (err) => {
  // Currently just logs and continues - observation lost
  console.error(`[claude-mem save] Socket error: ${err.message}`);
});
```

**Fix Required:**
```typescript
/**
 * Save Hook - PostToolUse
 * Sends tool observations to worker via Unix socket with retry logic
 */
export function saveHook(input?: PostToolUseInput): void {
  try {
    if (!input) {
      console.log('No input provided - this script is designed to run as a Claude Code PostToolUse hook');
      process.exit(0);
    }

    const { session_id, tool_name, tool_input, tool_output } = input;

    if (SKIP_TOOLS.has(tool_name)) {
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    }

    const db = new HooksDatabase();
    const session = db.findActiveSDKSession(session_id);
    db.close();

    if (!session) {
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    }

    const socketPath = getWorkerSocketPath(session.id);
    const message = {
      type: 'observation',
      tool_name,
      tool_input: JSON.stringify(tool_input),
      tool_output: JSON.stringify(tool_output)
    };

    // Try to send with retries
    sendWithRetry(socketPath, message, 5).then(() => {
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    }).catch((err) => {
      console.error(`[claude-mem save] Failed after retries: ${err.message}`);
      console.log('{"continue": true, "suppressOutput": true}');
      process.exit(0);
    });

  } catch (error: any) {
    console.error(`[claude-mem save error: ${error.message}]`);
    console.log('{"continue": true, "suppressOutput": true}');
    process.exit(0);
  }
}

/**
 * Send message to socket with exponential backoff retry
 */
async function sendWithRetry(
  socketPath: string,
  message: any,
  maxRetries: number
): Promise<void> {
  let retries = maxRetries;
  let delay = 100; // Start with 100ms

  while (retries > 0) {
    try {
      await sendMessage(socketPath, message);
      return; // Success
    } catch (err: any) {
      retries--;
      if (retries === 0) {
        throw err; // Out of retries
      }

      // Exponential backoff
      await sleep(delay);
      delay = Math.min(delay * 2, 2000); // Cap at 2s
    }
  }
}

/**
 * Send single message to socket
 */
function sendMessage(socketPath: string, message: any): Promise<void> {
  return new Promise((resolve, reject) => {
    const client = net.connect(socketPath, () => {
      client.write(JSON.stringify(message) + '\n');
      client.end();
      resolve();
    });

    client.on('error', (err) => {
      reject(err);
    });
  });
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}
```

**Testing:**
1. Add artificial delay in worker startup
2. Fire PostToolUse immediately after UserPromptSubmit
3. Verify save-hook retries and succeeds
4. Verify observation is captured

---

### Medium Priority (Should Fix)

#### 5. Orphaned Active Sessions in Database

**Severity:** Low
**Impact:** DB bloat, confusion about session status

**Problem:**
Sessions marked "active" never transition to "completed" or "failed" if worker crashes or is killed.

**Fix Required:**

Create cleanup script: `src/commands/cleanup-sessions.ts`
```typescript
import { HooksDatabase } from '../services/sqlite/HooksDatabase.js';

/**
 * Mark old active sessions as failed
 */
export function cleanupSessions(maxAgeHours: number = 24): void {
  const db = new HooksDatabase();
  const maxAgeMs = maxAgeHours * 60 * 60 * 1000;
  const cutoffEpoch = Date.now() - maxAgeMs;

  const query = (db as any).db.query(`
    UPDATE sdk_sessions
    SET status = 'failed', completed_at = datetime('now'), completed_at_epoch = ?
    WHERE status = 'active' AND started_at_epoch < ?
  `);

  const result = query.run(Date.now(), cutoffEpoch);
  console.log(`Marked ${result.changes} old active sessions as failed`);

  db.close();
}
```

Add to CLI: `src/bin/cli.ts`
```typescript
.command('cleanup-sessions')
.description('Mark old active sessions as failed')
.option('--max-age <hours>', 'Maximum age in hours', '24')
.action((options) => {
  cleanupSessions(parseInt(options.maxAge, 10));
})
```

**Alternative:** Add auto-expiry check in `context-hook`:
```typescript
// Before loading summaries, clean up stale sessions
const maxAgeMs = 24 * 60 * 60 * 1000;
const cutoffEpoch = Date.now() - maxAgeMs;
db.db.query(`
  UPDATE sdk_sessions
  SET status = 'failed'
  WHERE status = 'active' AND started_at_epoch < ?
`).run(cutoffEpoch);
```

---

#### 6. SessionStart Only Runs on "startup"

**Severity:** Low
**Impact:** No context loaded on /resume

**Problem:**
`context-hook` only loads context on "startup" source, skips "resume", "clear", and "compact".

**Location:** `src/hooks/context.ts:24`
```typescript
// Only run on startup (not on resume)
if (input.source && input.source !== 'startup') {
  console.log('');
  process.exit(0);
}
```

**Fix Required:**
```typescript
// Load context on startup and resume
if (input.source && input.source !== 'startup' && input.source !== 'resume') {
  console.log(''); // Skip for clear/compact
  process.exit(0);
}
```

**Rationale:**
- **startup:** Load context (project overview)
- **resume:** Load context (user continuing work)
- **clear:** Skip (user wants fresh start)
- **compact:** Skip (just memory optimization, context preserved)

---

### Low Priority (Nice to Have)

#### 7. No Cost Control or Observation Limits

**Severity:** Low
**Impact:** Long sessions can be expensive

**Problem:**
No limits on SDK agent API calls. A session with thousands of tools could rack up significant costs.

**Fix Ideas:**
1. Add observation counter, warn after N observations
2. Add cost estimation based on token usage
3. Add budget limit in config
4. Batch observations (send N at once instead of one-by-one)

**Example:**
```typescript
class SDKWorker {
  private observationCount = 0;
  private maxObservations = 1000;

  private handleMessage(message: WorkerMessage): void {
    if (message.type === 'observation') {
      this.observationCount++;
      if (this.observationCount > this.maxObservations) {
        console.error(`[SDK Worker] Exceeded max observations: ${this.maxObservations}`);
        this.isFinalized = true;
        return;
      }
    }
    this.pendingMessages.push(message);
  }
}
```

---

#### 8. No Health Check Mechanism

**Severity:** Low
**Impact:** Can't tell if worker is alive/healthy

**Fix Ideas:**
1. Add `/status` command that checks for active workers
2. Add health check endpoint on socket (ping/pong)
3. Add metrics to DB (last_activity_at)

---

#### 9. No Observation Deduplication

**Severity:** Low
**Impact:** Duplicate observations if same tool executed multiple times

**Fix Ideas:**
1. Hash tool_name + tool_input + tool_output
2. Check for duplicate hash before storing
3. Or let SDK agent handle deduplication naturally

---

## Implementation Checklist

### Phase 0: Verify Happy Path (DO THIS FIRST - HIGHEST PRIORITY)

**Goal:** Prove the basic cycle works end-to-end before fixing edge cases.

- [ ] **Test 0.1: Verify Stop Hook Fires**
  - [ ] Add logging to `src/hooks/summary.ts`
  - [ ] Exit session normally and verify hook runs
  - [ ] Verify FINALIZE message is sent to socket

- [ ] **Test 0.2: Verify Worker Generates Summary**
  - [ ] Add logging to worker message handler
  - [ ] Verify FINALIZE message received
  - [ ] Verify SDK agent response
  - [ ] Verify summary parsed and stored in DB
  - [ ] Query DB to confirm summary exists

- [ ] **Test 0.3: Verify Context Hook Loads Summaries**
  - [ ] Add logging to `src/hooks/context.ts`
  - [ ] Start new session, verify summaries loaded
  - [ ] Verify markdown output to stdout
  - [ ] Verify Claude has context from previous session

- [ ] **Test 0.4: End-to-End Integration Test**
  - [ ] Run session 1 with test work
  - [ ] Verify summary in DB
  - [ ] Run session 2
  - [ ] Ask Claude about previous session
  - [ ] Confirm Claude has correct context

**STOP HERE:** Only proceed to Phase 1 after confirming all Phase 0 tests pass.

---

### Phase 1: Critical Resilience Fixes (Do After Phase 0)

- [ ] Add watchdog timer to worker (Issue #1)
  - [ ] Add lastActivityTime tracking
  - [ ] Add timeout check in message generator loop
  - [ ] Test with zombie worker scenario

- [ ] Configure existing SessionEnd hook (Issue #2)
  - [ ] Add SessionEnd configuration to hooks/hooks.json
  - [ ] Create src/hooks/cleanup.ts (implements cleanup logic)
  - [ ] Create src/bin/hooks/cleanup-hook.ts (entry point)
  - [ ] Update build process to compile cleanup-hook
  - [ ] Test with Ctrl-C exit and verify worker cleanup

- [ ] Fix stale socket detection (Issue #3)
  - [ ] Add testSocketStale method
  - [ ] Update startSocketServer to check for stale sockets
  - [ ] Test with crashed worker scenario

- [ ] Fix save-hook race condition (Issue #4)
  - [ ] Add sendWithRetry function
  - [ ] Add exponential backoff logic
  - [ ] Update save-hook to use retry logic
  - [ ] Test with immediate PostToolUse

### Phase 2: Medium Priority

- [ ] Add session cleanup script (Issue #5)
  - [ ] Create cleanup-sessions command
  - [ ] Add to CLI
  - [ ] Optional: Add auto-cleanup to context-hook

- [ ] Fix SessionStart source handling (Issue #6)
  - [ ] Update context-hook to load on "resume"
  - [ ] Test with /resume command

### Phase 3: Low Priority (Optional)

- [ ] Add cost control (Issue #7)
- [ ] Add health checks (Issue #8)
- [ ] Add observation deduplication (Issue #9)

## Testing Strategy

### Unit Tests

Create tests for each fix:
- `test/hooks/cleanup.test.ts` - SessionEnd hook
- `test/sdk/worker-timeout.test.ts` - Watchdog timer
- `test/hooks/save-retry.test.ts` - Retry logic

### Integration Tests

Test complete flows:
1. **Normal flow:** SessionStart → UserPromptSubmit → PostToolUse → Stop
2. **Crash recovery:** Worker crash → SessionEnd cleanup
3. **Zombie worker:** No Stop hook → Worker timeout
4. **Socket race:** Immediate PostToolUse → Retry success

### Manual Testing Scenarios

1. **Zombie Worker Test:**
   ```bash
   # Start session
   claude
   # Kill Claude with Ctrl-C
   # Check for worker process
   ps aux | grep claude-mem-worker
   # Wait 2 hours, verify worker exits
   ```

2. **SessionEnd Test:**
   ```bash
   # Start session
   claude
   # Exit normally or Ctrl-C
   # Verify worker killed
   # Verify socket removed
   # Check DB for session status
   sqlite3 ~/.claude-mem/data/claude-mem.db "SELECT * FROM sdk_sessions"
   ```

3. **Stale Socket Test:**
   ```bash
   # Start session
   claude
   # Kill worker with kill -9 <pid>
   # Verify socket exists
   ls /tmp/claude-mem-worker-*.sock
   # Start new session
   # Verify old socket removed, new session starts
   ```

4. **Race Condition Test:**
   ```bash
   # Add delay to worker startup (for testing)
   # Start session, immediately run command
   claude "list all files"
   # Verify first observation captured
   ```

## File Modifications Required

### New Files
- `src/hooks/cleanup.ts` - SessionEnd hook logic
- `src/bin/hooks/cleanup-hook.ts` - SessionEnd entry point
- `src/commands/cleanup-sessions.ts` - Session cleanup script
- `test/hooks/cleanup.test.ts` - Tests for SessionEnd hook
- `test/sdk/worker-timeout.test.ts` - Tests for watchdog timer
- `test/hooks/save-retry.test.ts` - Tests for retry logic

### Modified Files
- `hooks/hooks.json` - Add SessionEnd configuration
- `src/sdk/worker.ts` - Add watchdog timer, stale socket detection
- `src/hooks/save.ts` - Add retry logic
- `src/hooks/context.ts` - Load context on resume
- `src/bin/cli.ts` - Add cleanup-sessions command

## Dependencies

No new dependencies required. All fixes use existing:
- `net` (Unix sockets)
- `fs` (file operations)
- `child_process` (process management)
- `bun:sqlite` (database)

## Success Criteria

### Phase 0 (Must Pass First)
1. ✅ Stop hook fires on normal exit
2. ✅ Worker receives FINALIZE and generates summary
3. ✅ Summary is stored in DB correctly
4. ✅ Context hook loads summaries on next session
5. ✅ New session immediately sees previous session's summary in context
6. ✅ End-to-end integration test passes

### Phase 1 (After Phase 0 Passes)
1. ✅ Worker processes never become zombies (exit after 2h max)
2. ✅ SessionEnd hook cleans up worker and socket on exit
3. ✅ Stale sockets don't block new sessions
4. ✅ First observation always captured (no race condition)
5. ✅ No orphaned "active" sessions in DB after 24h
6. ✅ Context loads on /resume
7. ✅ All tests pass

## References

- Claude Code Hooks Documentation: https://docs.claude.com/en/docs/claude-code/hooks
- Claude Agent SDK Streaming: https://docs.claude.com/en/api/agent-sdk/streaming-vs-single-mode
- Unix Domain Sockets: Node.js `net` module
- SQLite Best Practices: Bun SQLite documentation

## Notes

- All hooks must return `{"continue": true, "suppressOutput": true}` on error
- Hooks have 60s default timeout (configurable)
- Worker is detached process, doesn't block Claude Code
- SessionEnd hooks "cannot block session termination" per Claude Code docs
- Streaming input mode is the recommended SDK approach for this architecture