claude-mem/plans/03-worker-lifecycle.md

# Plan 03 — Worker / Daemon Lifecycle Hardening

> **Scope**: Fix accumulated worker / daemon lifecycle bugs in claude-mem.
> Address DB bloat, chroma-mcp leaks, retry storms, port/PID races, queue zombies, missing supervision, and observability gaps.
>
> **Non-implementation**: This document is a plan. Each phase is self-contained; an executing agent should be able to run a single phase without re-discovering context.
>
> **Audience**: Subsequent agents executing one phase per session.

---

## Phase 0 — Documentation Discovery & Allowed APIs

**Goal**: Anchor every implementation phase in real APIs that exist in the current codebase or in vetted libraries. Prevent phantom-method invention.

### 0.1 Read these files end-to-end before touching code

| File | Why |
| --- | --- |
| `CLAUDE.md` (project root) | Architecture, exit-code strategy, Pro/OSS boundary, settings conventions |
| `src/services/worker-service.ts` | `WorkerService` class, `--daemon` `main()`, signal registration, all CLI subcommands |
| `src/services/worker-spawner.ts` | `ensureWorkerStarted` 3-state machine (`ready`/`warming`/`dead`) |
| `src/services/infrastructure/ProcessManager.ts` | `spawnDaemon`, PID file ops, `captureProcessStartToken`, `isProcessAlive` |
| `src/services/infrastructure/HealthMonitor.ts` | `isPortInUse`, `waitForHealth`, `waitForReadiness`, `httpShutdown` |
| `src/services/infrastructure/GracefulShutdown.ts` | `performGracefulShutdown` ordering |
| `src/services/infrastructure/CleanupV12_4_3.ts` | `runOneTimeV12_4_3Cleanup`, `STUCK_PENDING_THRESHOLD = 10`, observer-purge SQL |
| `src/services/sync/ChromaMcpManager.ts` | `ensureConnected`, `connectInternal`, `stop`, `killProcessTree`, `collectDescendantPids`, `RECONNECT_BACKOFF_MS = 10_000`, `MCP_CONNECTION_TIMEOUT_MS = 30_000` |
| `src/supervisor/index.ts` | `Supervisor` class, `validateWorkerPidFile`, signal-handler config |
| `src/supervisor/process-registry.ts` | `ProcessRegistry`, `getSdkProcessForSession`, `ensureSdkProcessExit`, `waitForSlot`, `TOTAL_PROCESS_HARD_CAP = 10` |
| `src/supervisor/health-checker.ts` | 30s `pruneDeadEntries` loop (already present — extend, don't replace) |
| `src/supervisor/shutdown.ts` | `runShutdownCascade`, `signalProcess`, `loadTreeKill` |
| `src/services/worker/SessionManager.ts` | In-memory session map, `deleteSession`, queue/pending integration |
| `src/services/worker/RestartGuard.ts` | Per-session restart cap (10/60s window, 5 consecutive) |
| `src/services/worker/retry.ts` | Provider-level retry (`withRetry`, classified errors) — DO NOT mutate; circuit breaker layers ABOVE this |
| `src/shared/worker-utils.ts` | `recordWorkerUnreachable` (line 401), `executeWithWorkerFallback` (line 443), fail-loud counter file at `~/.claude-mem/state/hook-failures.json` |
| `src/services/sqlite/Database.ts` | PRAGMA setup (lines 27-32, 69-74) — single source of truth for DB pragmas |
| `src/services/server/Server.ts` | `/api/health` (line 161), `/api/readiness` (line 178), `/api/version` (line 192) |
| `src/shared/SettingsDefaultsManager.ts` | Where every new setting key MUST be declared with a default |
| `src/shared/hook-constants.ts` | `HOOK_TIMEOUTS`, `HOOK_EXIT_CODES` — extend here, don't inline |
| `plugin/bun-runner.js`, `plugin/scripts/worker-service.cjs` | Built worker entrypoint — note the build pipeline (`scripts/build-hooks.js`) |

### 0.2 Allowed APIs (use these, do NOT invent siblings)

**SQLite (bun:sqlite)** — pragma calls are `db.run('PRAGMA …')` or `db.prepare('PRAGMA …').get()`. Existing pragmas: `journal_mode=WAL`, `synchronous=NORMAL`, `foreign_keys=ON`, `temp_store=memory`, `mmap_size`, `cache_size`. **VACUUM** runs only outside a transaction. `VACUUM INTO 'path'` is the backup form already used in `CleanupV12_4_3.ts:135`. `wal_checkpoint(TRUNCATE)` is the truncating-checkpoint form.

**Process supervision** — `getSupervisor()`, `getProcessRegistry()`, `registerProcess(id, info, processRef?)`, `unregisterProcess(id)`, `pruneDeadEntries()`, `assertCanSpawn(type)`, `runShutdownCascade(...)`. Tree-kill on POSIX uses `pgrep -P` recursion + `process.kill(-pgid, signal)`; on Windows uses `taskkill /T /F /PID` or `tree-kill` npm.

**HTTP/Express** — `Server.app.get('/api/...', handler)` via `registerRoutes` (handlers implement `setupRoutes(app)` on a `RouteHandler` interface). Every new endpoint must follow the existing `RouteHandler` pattern under `src/services/worker/http/routes/`.

**Settings** — `SettingsDefaultsManager.get('CLAUDE_MEM_…')`, `SettingsDefaultsManager.loadFromFile(path)`. New keys require: (a) type added to the interface in `SettingsDefaultsManager.ts`, (b) default value declared in the same file, (c) documented in CLAUDE.md if user-tunable.

**Logging** — `logger.info(category, msg, fields)`, `logger.warn`, `logger.error(category, msg, fields, error)`. Categories used here: `SYSTEM`, `WORKER`, `SESSION`, `CHROMA_MCP`, `SDK`, `DB`, `QUEUE`, `PROCESS`. Add new category `MAINTENANCE` for VACUUM / reaper events.

### 0.3 Anti-patterns — explicitly forbidden

- **Do not** add a new singleton supervisor — extend `getSupervisor()`.
- **Do not** spawn child processes without going through `getSupervisor().assertCanSpawn(...)` and `registerProcess(...)`.
- **Do not** call `process.exit(1)` on hook-side error paths — it accumulates Windows Terminal tabs (CLAUDE.md exit-code strategy). Use `0` for graceful, `2` only for blocking-error paths that need to surface stderr to Claude.
- **Do not** delete `sdk_sessions` rows if `observations` or `session_summaries` still reference their `memory_session_id` without an explicit user-opt-in flag.
- **Do not** hold a SQLite write lock during `VACUUM` while ingestion is hot. Pause queue processing first.
- **Do not** introduce setInterval timers that keep the event loop alive — every new timer must call `.unref()`.
- **Do not** invent settings keys — declare them in `SettingsDefaultsManager.ts` first.

### 0.4 Confidence note

Confidence: HIGH on file/API inventory (read-pass complete on all referenced files). MEDIUM on Windows behavior of new advisory locks (Windows mandatory locking via `lockf` is bun-runtime-dependent — verify via spike before committing).

---

## Phase 1 — Inventory & Instrumentation (read-only, safe)

**Goal**: Produce a written state-machine diagram and an exit-site catalog that subsequent phases reference. No code changes; create a scratch document at `docs/internal/worker-lifecycle-state-machine.md` if the executor wants an artifact, otherwise capture findings in commit messages.

### 1.1 Tasks

1. **Trace the worker daemon spawn → terminate path** end-to-end. Source order:
   - Hook entry → `src/shared/worker-utils.ts:ensureWorkerRunning` (lazy spawn) OR `src/services/worker-spawner.ts:ensureWorkerStarted` (explicit)
   - `spawnDaemon` (`src/services/infrastructure/ProcessManager.ts:408`) — POSIX uses `setsid` if available, Windows uses `Start-Process -WindowStyle Hidden`
   - `--daemon` branch in `src/services/worker-service.ts:937` — duplicate-PID/duplicate-port guard
   - `WorkerService.start()` (line 258) → `startSupervisor()` → `server.listen()` → `writePidFile()` → `getSupervisor().registerProcess('worker', ...)` → `initializeBackground()`
   - Signal handlers via `configureSupervisorSignalHandlers` (`src/supervisor/index.ts:49`) — SIGTERM/SIGINT; SIGHUP ignored in `--daemon` mode on POSIX
   - Shutdown: `WorkerService.shutdown()` → `performGracefulShutdown` → server close → `sessionManager.shutdownAll()` → mcp client close → chroma stop → db close → `getSupervisor().stop()` → `runShutdownCascade` → PID file unlink

2. **Catalog every `process.exit(...)` site** in worker-service.ts (already mapped — 21 sites; lines 764, 772, 794, 804, 810, 813, 828, 835, 842, 853, 870, 878, 888, 895, 916, 933, 945, 950, 971, 975, 991). Annotate each with: code, intent, whether it leaks the worker on the same path, whether shutdown ran first.

3. **Catalog every retry / unreachable site**:
   - `src/shared/worker-utils.ts:401 recordWorkerUnreachable` (the #1874 counter)
   - `src/cli/handlers/{context,file-context,file-edit,summarize,observation,user-message,session-init}.ts` — every `executeWithWorkerFallback` caller
   - `src/servers/mcp-server.ts:72,100,145` — direct `workerHttpRequest`
   - `src/services/transcripts/processor.ts:331,371,373` — direct `workerHttpRequest`
   - `src/services/integrations/CursorHooksInstaller.ts:64,349,352` — direct `workerHttpRequest`
   - `src/utils/claude-md-utils.ts:305` — direct `workerHttpRequest`

4. **Catalog every spawn site**:
   - `spawnDaemon` (worker self-spawn)
   - `ChromaMcpManager.connectInternal` (chroma-mcp via uvx → uv → python → chroma-mcp)
   - `spawnSdkProcess` (`src/supervisor/process-registry.ts:532`) — Claude SDK subprocesses
   - `runMcpSelfCheck` (`src/services/worker-service.ts:405`) — MCP loopback probe via `process.execPath`
   - Any `execSync` / `execFile` / `spawnSync` in `ChromaMcpManager` (cert resolution) or `ProcessManager` (binary lookup, cwd-remap)

### 1.2 Acceptance criteria

- Markdown table written (commit message or scratch doc) listing every spawn and exit site with file:line.
- A 1-paragraph English description of the worker state machine (states + transitions) suitable to paste into PR descriptions.
- Confirmed list of which `executeWithWorkerFallback` callers run inside hooks (Claude Code's strict timeout window) vs. inside the worker (no timeout pressure) — this drives Phase 4 circuit-breaker scoping.

### 1.3 Verification

- `grep -rn "process.exit" src/ --include="*.ts" | wc -l` matches the catalog.
- `grep -rn "executeWithWorkerFallback\|workerHttpRequest" src/ --include="*.ts" | grep -v worker-utils.ts | wc -l` matches the catalog.

### 1.4 Deliverable

Hand-off note for Phase 2-8 executors with file/line anchors; no code committed.

---

## Phase 5 — PID/Port Reclamation & Race-Free Startup

> Shipping order: **Phase 5 first** (per Phase 8 ordering). Idempotent and safe.

**Goal**: Eliminate the silent-exit-0 case where a fresh `--daemon` spawn loses the port race; harden cross-platform PID-reuse detection; serialize concurrent spawns with an OS-level advisory lock.

### 5.1 Files to modify

| File | Change |
| --- | --- |
| `src/supervisor/process-registry.ts` | Extend `captureProcessStartToken` for macOS (already partial via `ps -o lstart`) and Windows (`wmic process where ProcessId=X get CreationDate /value`). Add unit test for each platform branch. |
| `src/supervisor/index.ts:validateWorkerPidFile` | Add port-on-pid match check — if `pidInfo.port !== currentExpectedPort`, treat as `'stale'`. |
| `src/services/infrastructure/ProcessManager.ts` | Add new exports: `acquireDaemonLock()` / `releaseDaemonLock()` using POSIX `flock` (via `fcntl`/`flock` syscall through `bun:ffi` or shelling to `flock(1)` on Linux only) and Windows mandatory file lock via `LockFile` (or fall back to atomic-rename sentinel on Windows). |
| `src/services/worker-service.ts:937` (`--daemon` branch) | Wrap startup in `acquireDaemonLock()`. If port is in use, perform a `/api/version` probe; if the listener returns OUR `BUILT_IN_VERSION` → exit 0 (legit duplicate); if it returns a different version → log a warning and exit 0 (stale worker, will be restarted by version-mismatch path); if the listener doesn't respond → wait `HOOK_TIMEOUTS.PORT_IN_USE_WAIT` then write a clear stderr line with diagnostic before exiting. |
| `src/services/worker-spawner.ts` | Same lock acquisition before `spawnDaemon`. Release on success or error. |

### 5.2 Detailed tasks

1. **macOS start-time token**: extend `captureProcessStartToken` (registry line 56). On Darwin, prefer `ps -p <pid> -o lstart=` (already in fallback path). Verify with `LC_ALL=C LANG=C` env so locale doesn't change the timestamp format. Add a comment explaining that `ps lstart` resolution is 1-second — collisions still possible but vastly less likely than no-token.

2. **Windows start-time token**: add a Win32 branch using `wmic process where ProcessId=<pid> get CreationDate /value`. Parse the `CreationDate=YYYYMMDDHHMMSS.ffffff+TZ` line. Cache the wmic resolution per-pid for 5s (avoid re-shelling on repeat checks).

3. **Port-on-pid match**: in `validateWorkerPidFile`, after confirming `isPidAlive(pidInfo.pid)`, verify the recorded `pidInfo.port` is reachable via `isPortInUse(pidInfo.port)` AND the listener's `/api/version` returns a version string. If port is dead but PID alive → return `'stale'` (worker crashed mid-listen, PID about to be reused).

4. **Advisory lock**:
   - POSIX: open `<DATA_DIR>/.worker-spawn.lock` with `O_RDWR | O_CREAT`, `flock(fd, LOCK_EX | LOCK_NB)`. On EAGAIN, log `Another spawn in progress, waiting up to 5s` and retry with `LOCK_EX` (blocking) under a `setTimeout` race. Implement via `bun:ffi` for POSIX `flock(2)` if available, otherwise shell `flock -n -x <path> <command>`. **Spike first**: confirm bun's `bun:ffi` exposes `flock`. If not, use a watch-and-rename sentinel (less ideal but works).
   - Windows: Use `LockFile` via Win32 API or fall back to atomic `mkdirSync` of `<DATA_DIR>/.worker-spawn.lock.dir` (fails if exists) with stale-timeout cleanup at 30s.

5. **Diagnostic stderr**: when port-in-use without our worker responding, write to stderr (and log INFO) with: `claude-mem worker port <N> in use by an unidentified process; not spawning duplicate`. This must NOT block the hook — exit 0 still per CLAUDE.md.

### 5.3 New settings

| Key | Default | Range | Purpose |
| --- | --- | --- | --- |
| `CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS` | `5000` | 0–60000 | Max wait for the spawn lock |
| `CLAUDE_MEM_PID_PORT_RECHECK_MS` | `2000` | 500–30000 | Wait window before treating port-in-use without `/api/version` response as "unknown listener" |

### 5.4 Acceptance criteria

- Run two `claude-mem start` commands in parallel → exactly one daemon ends up alive; the other exits cleanly with a log line referencing the lock.
- Kill the worker `-9` (skip cleanup), reuse the PID with `python -c 'import time; time.sleep(60)'` → `validateWorkerPidFile` returns `'stale'` and removes the file.
- On macOS, run worker, capture token, kill, spawn unrelated process with same PID, spawn worker again → token mismatch detected; old PID file ignored.
- `/api/version` probe path: spawn a fake server on the worker port → daemon exits 0 with the new diagnostic stderr, NOT silently.

### 5.5 Observability hooks

- Log `SYSTEM` INFO `Daemon spawn lock acquired` on success.
- Log `SYSTEM` WARN `Daemon spawn lock contention`, fields `{waitedMs}`.
- Log `SYSTEM` WARN `Worker port occupied by foreign listener`, fields `{port, probeStatus}`.
- New `/api/healthz` fields (added in Phase 7): `pid_file_path`, `pid_start_token`, `daemon_lock_held: bool`.

### 5.6 Verification checklist

- [ ] `grep "process.exit(0)" src/services/worker-service.ts` — count unchanged (no new silent exits introduced).
- [ ] Manual two-process race test (Linux + macOS + Windows VM).
- [ ] Existing health-check tests still pass.
- [ ] No new always-on `setInterval` introduced.

---

## Phase 6 — DB Maintenance (VACUUM / WAL)

> Ships alongside Phase 5 (idempotent).

**Goal**: Recover the 504 MB of free pages, prevent recurrence, surface DB-size metrics.

### 6.1 Files to modify

| File | Change |
| --- | --- |
| `src/services/sqlite/Database.ts:27-32` and `:69-74` | Add `PRAGMA auto_vacuum = INCREMENTAL` BEFORE the first table is created (only takes effect on a fresh DB; harmless on existing DBs but logs a no-op). For existing DBs, the migration path is the one-shot Phase-6 startup VACUUM. |
| `src/services/maintenance/DbMaintenance.ts` (new) | Periodic maintenance task: on a 24h timer (configurable), call `PRAGMA incremental_vacuum`, `PRAGMA wal_checkpoint(TRUNCATE)`, then collect metrics (`page_count`, `freelist_count`, file size). Emit `MAINTENANCE` INFO log. Acquire `dbMaintenanceMutex` so other writers wait. |
| `src/services/maintenance/DbMaintenance.ts` | Startup check: if `freelist_count / page_count > FREE_RATIO_VACUUM_THRESHOLD` (default 0.40), perform full `VACUUM` after `VACUUM INTO` backup to `<DATA_DIR>/backups/claude-mem-pre-vacuum-<ts>.db`. Pause queue processor first. |
| `src/services/worker-service.ts:initializeBackground` | Wire the maintenance task — start after `dbManager.initialize()`. Timer must `.unref()`. |
| `src/services/worker/SessionManager.ts` | Expose `pauseQueueProcessing(): Promise<void>` and `resumeQueueProcessing(): void`. Use the existing AbortController + emitter to drain in-flight work; don't introduce new state. Maintenance acquires; readers continue (WAL allows them). |
| `src/services/infrastructure/CleanupV12_4_3.ts:135` | Reuse the existing `VACUUM INTO` backup pattern verbatim — copy the disk-space pre-flight check (`statfsSync`, line 115). |

### 6.2 Detailed tasks

1. **Auto-vacuum on new DBs**: Add `PRAGMA auto_vacuum = INCREMENTAL` in `Database.ts` BEFORE `migrationRunner.runAllMigrations()`. Verify with a comment that this is no-op on existing DBs (sqlite docs say a full VACUUM is required to flip auto_vacuum mode after tables exist). Document the migration path: existing users get the freed-page reclamation via the startup full VACUUM in step 3.

2. **Periodic incremental vacuum + WAL checkpoint**:
   - Schedule via `setInterval` with `.unref()`. Default cadence: 24h. Setting: `CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS` (default `24`, min `1`, max `168`).
   - Each tick: acquire mutex → `db.run('PRAGMA incremental_vacuum')` → `db.run('PRAGMA wal_checkpoint(TRUNCATE)')` → snapshot metrics → release.
   - Skip the tick if a `VACUUM` is in progress.

3. **Startup full VACUUM (one-shot per session) when free-ratio is high**:
   - Read `page_count` (`PRAGMA page_count`) and `freelist_count` (`PRAGMA freelist_count`).
   - If `freelist_count / page_count >= CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO` (default `0.40`), schedule a deferred VACUUM (5 minutes after worker becomes ready) to avoid slowing startup.
   - VACUUM steps: pause queue → `VACUUM INTO '<backup>'` → verify backup → `VACUUM` (full) → resume queue → log freed pages and ms taken.
   - Disk-space pre-flight: `statfsSync` (mirror `CleanupV12_4_3.ts:115`). Skip if free space < `1.2 * dbSize + 100MB`. Log `MAINTENANCE` ERROR in that case so the user sees actionable info.

4. **Pause/resume hook in SessionManager**: The existing `for await ... of getMessageIterator()` loop in queue processor needs a "pause" semaphore. Implementation: add a `Promise<void>` gate that the iterator awaits before yielding. Maintenance flips it to a pending promise during VACUUM; resolve to release. **Do not** abort in-flight messages — they can complete; new messages wait.

5. **Cleanup-V12.4.3 regression detection**: Re-scan `sdk_sessions WHERE project = OBSERVER_SESSIONS_PROJECT` and `pending_messages` matching the stuck-pending pattern at maintenance ticks. If any match AND the marker exists, log `MAINTENANCE` WARN and re-run the purge (idempotent). Setting: `CLAUDE_MEM_CLEANUP_REGRESSION_CHECK = true`.

### 6.3 New settings

| Key | Default | Range | Purpose |
| --- | --- | --- | --- |
| `CLAUDE_MEM_DB_MAINTENANCE_ENABLED` | `true` | bool | Master kill-switch |
| `CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS` | `24` | 1–168 | Periodic cadence |
| `CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO` | `0.40` | 0.05–0.95 | Free-ratio above which we auto-VACUUM at startup |
| `CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS` | `300000` (5 min) | 0–3600000 | Defer startup VACUUM so it doesn't block readiness |
| `CLAUDE_MEM_CLEANUP_REGRESSION_CHECK` | `true` | bool | Re-scan v12.4.3-shaped pollution |

### 6.4 Acceptance criteria

- Reproduce the bloat scenario: stuff `pending_messages` with 100k stuck `processing` rows, run worker → startup VACUUM fires within 5 min after readiness, freed-pages log line appears, file size drops.
- Existing 532 MB DBs reclaim ≥ 95% of free pages on first run (matches the 28 MB target observed manually).
- Hot-ingestion test: enqueue 1000 observations during a maintenance tick → no `SQLITE_BUSY` or `database is locked` errors; queue resumes after VACUUM.
- `PRAGMA auto_vacuum` returns `2` (incremental) on freshly-created DBs.
- Maintenance loop ticks honor `.unref()` — `process.exit(0)` from a clean shutdown returns immediately, not after the 24h interval.

### 6.5 Observability hooks

- New log category: `MAINTENANCE`.
- Events: `MaintenanceStart`, `MaintenanceTick`, `VacuumStart`, `VacuumComplete` (`{freedPages, ms, dbSizeBeforeMb, dbSizeAfterMb}`), `VacuumSkippedLowDisk`, `RegressionDetected`, `MaintenanceComplete`.
- `/api/healthz` fields (Phase 7): `db_page_count`, `db_freelist_count`, `db_free_ratio_pct`, `db_size_bytes`, `db_last_vacuum_at`, `db_last_vacuum_freed_pages`, `db_last_maintenance_at`.

### 6.6 Anti-pattern guards

- **Do not** call `VACUUM` inside a transaction (sqlite errors).
- **Do not** hold the queue pause across the `VACUUM INTO` backup phase — only the final full `VACUUM` needs the writer-lock window. (`VACUUM INTO` works on a read-only snapshot.)
- **Do not** call `PRAGMA wal_checkpoint(FULL)` — TRUNCATE is required to actually shrink the WAL file.

### 6.7 Verification checklist

- [ ] Backup created at `<DATA_DIR>/backups/` before every full VACUUM.
- [ ] Maintenance timer registered with `.unref()` (grep for `setInterval` in the new file → `unref()` follows each).
- [ ] No new direct `setInterval` outside the maintenance file.
- [ ] PRAGMA list in `Database.ts` extended with `auto_vacuum` and includes a comment about migration.

---

## Phase 2 — Stuck-Session Reaper (fix v12.4.3 bloat)

**Goal**: Stop `pending_messages` and `sdk_sessions` from accumulating zombies.

### 2.1 Files to modify

| File | Change |
| --- | --- |
| `src/services/maintenance/SessionReaper.ts` (new) | Periodic reaper. Plugs into the supervisor's existing `health-checker.ts` 30s tick (extend, do not replace). |
| `src/supervisor/health-checker.ts:9 runHealthCheck` | Call `SessionReaper.tick()` after `pruneDeadEntries()`. |
| `src/services/worker/SessionManager.ts:deleteSession` | After in-memory delete, call `pendingStore.clearPendingForSession(sessionDbId)` synchronously (it already does this via `clearPendingForSession` on a separate path — verify and unify). |
| `src/services/sqlite/PendingMessageStore.ts` | Add `reapStuckProcessing(olderThanMs: number): number` returning the count of rows reset to `pending`. |
| `src/services/sqlite/SessionStore.ts` | Add `findInactiveSdkSessions(olderThanDays: number): Array<{id, project, contentSessionId, memorySessionId, lastActivityAt}>`. |
| `src/services/sqlite/SessionStore.ts` | Add `markSdkSessionInactive(id: number)` — adds an `inactive_at` column or sets a sentinel. |
| `src/services/sqlite/migrations/runner.ts` | New migration: add `inactive_at TEXT NULL` to `sdk_sessions` if absent. |

### 2.2 Reaper logic

Per tick (default 30s, gated by `CLAUDE_MEM_REAPER_ENABLED`):

1. **Stuck-processing sweep**: `UPDATE pending_messages SET status='pending' WHERE status='processing' AND updated_at < <now - PROCESSING_STUCK_MS>` (default 5 minutes). Log count if > 0.

2. **Orphan-pending sweep**: `DELETE FROM pending_messages WHERE session_db_id NOT IN (SELECT id FROM sdk_sessions)` (defensive — should already be FK-protected but log if any deleted).

3. **Inactive-session detection** (does NOT delete):
   - SELECT sdk_sessions where `id NOT IN <in-memory session ids>` AND `last_activity > N days ago` (computed from MAX of related observations / pending_messages / session_summaries timestamps).
   - For each: `UPDATE sdk_sessions SET inactive_at = <now> WHERE id = ? AND inactive_at IS NULL`.

4. **Observer-pollution regression check** (matches Phase 6 task 5):
   - If `OBSERVER_SESSIONS_PROJECT` rows reappear after the v12.4.3 marker is present, re-run the purge SQL from `CleanupV12_4_3.runObserverSessionsPurge` (lines 196-218).
   - Log `MAINTENANCE` WARN with counts.

5. **Hard delete is opt-in** via `CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS` (default `0` = disabled; nonzero = days threshold). When enabled and a session has `inactive_at` older than the threshold AND no FK-referencing rows, hard-delete the session row. Default-off because user data safety > disk space.

### 2.3 New settings

| Key | Default | Range | Purpose |
| --- | --- | --- | --- |
| `CLAUDE_MEM_REAPER_ENABLED` | `true` | bool | Master switch |
| `CLAUDE_MEM_REAPER_TICK_MS` | `30000` | 5000–600000 | Tick cadence (piggy-backs supervisor; this value gates whether the reaper runs each tick) |
| `CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS` | `300000` (5 min) | 30000–86400000 | Threshold for a `processing` row to be considered stuck |
| `CLAUDE_MEM_REAPER_INACTIVE_DAYS` | `30` | 1–365 | When to mark a session `inactive_at` |
| `CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS` | `0` | 0–365 | 0 = never; otherwise, hard-delete inactive rows older than N days |

### 2.4 Acceptance criteria

- Inject 50 stuck `processing` rows older than 5 minutes → next reaper tick resets them → `/api/healthz` shows `oldest_pending_processing_age_sec` drop to 0.
- Inject `OBSERVER_SESSIONS_PROJECT` rows post-marker → next tick logs regression and purges them.
- Reaper survives a worker restart without losing state (everything is DB-backed).
- Active sessions (in-memory) are NEVER marked inactive even if their last DB write is old (in-memory presence wins).

### 2.5 Observability

- Log: `MAINTENANCE` INFO `ReaperTick`, fields `{stuckProcessing, orphanPending, markedInactive, hardDeleted, observerRegression}`.
- New `/api/healthz` fields (Phase 7): `oldest_processing_pending_age_sec`, `processing_pending_count`, `pending_count_total`, `sdk_sessions_total`, `sdk_sessions_inactive`, `sdk_sessions_by_project: { [project]: count }`.

### 2.6 Verification checklist

- [ ] Migration adds `inactive_at` column without breaking existing data (test on a copy of a real DB).
- [ ] In-memory active sessions never appear in `findInactiveSdkSessions`.
- [ ] Reaper does NOT cascade-delete `observations` / `session_summaries` unless explicit hard-delete + zero-FK-reference precondition.
- [ ] `/api/healthz` shows reaper metrics.

---

## Phase 3 — chroma-mcp Child-Process Supervisor

**Goal**: Stop the 23-concurrent-chroma-mcp leak. Bound concurrency, reap idle, scan for orphans at startup.

### 3.1 Files to modify

| File | Change |
| --- | --- |
| `src/services/sync/ChromaMcpManager.ts` | Add idle reaper; enforce single-instance via supervisor registry; add startup orphan scan; add `lastCallAt` timestamp updated by `callTool`. |
| `src/services/sync/ChromaMcpManager.ts:ensureConnected` (line 43) | Before connect, check `getProcessRegistry().getAll().filter(r => r.type === 'chroma')` — if non-empty AND PID alive AND PID not the current `_process.pid`, refuse to spawn (alert + reuse existing if possible; otherwise wait for backoff). |
| `src/services/sync/ChromaMcpManager.ts:registerManagedProcess` (line 613) | Already calls `getSupervisor().registerProcess(CHROMA_SUPERVISOR_ID, ...)` — verify the supervisor enforces single-instance for this id. (Currently `register` is keyed by id so same id replaces; document this.) |
| `src/supervisor/process-registry.ts` | Add `getActiveCountByType(type: string): number`. Add `findChromaOrphans(): Promise<number[]>` — POSIX `pgrep -af 'chroma-mcp'` filtered by PPID == 1. |
| `src/services/worker-service.ts:initializeBackground` | After `ChromaMcpManager.getInstance()`, kick off `await ChromaMcpManager.scanAndReapOrphans()` (best-effort; never throws). |

### 3.2 Detailed tasks

1. **Startup orphan scan**: New static method `ChromaMcpManager.scanAndReapOrphans()`:
   - POSIX: `pgrep -af 'chroma-mcp'` → for each PID, check PPID. If PPID == 1 (re-parented to init), call `killProcessTree(pid)` (existing function at line 388). Log `CHROMA_MCP` INFO `ReapedOrphan`, fields `{pid, ageSec}`.
   - Windows: `Get-CimInstance Win32_Process -Filter "Name='chroma-mcp.exe'"` filter by parent process state, kill with taskkill.
   - Bound the scan to processes whose command-line includes `chroma-mcp==<CHROMA_MCP_PINNED_VERSION>` to avoid killing unrelated chroma installations.

2. **Idle reaper**: Add `lastCallAt: number = 0` field to `ChromaMcpManager`. Update on every `callTool`. Run a `setInterval(checkIdle, 60_000)` (`.unref()`) — if `connected && Date.now() - lastCallAt > CHROMA_MCP_IDLE_SHUTDOWN_MS` (default 15 min), call `await this.stop()`. Lazy-reconnect resumes on next `callTool`.

3. **Single-instance guard on reconnect**: In `ensureConnected`, before `connectInternal`, call `getProcessRegistry().getActiveCountByType('chroma')`. If > 0 AND the registered PID is alive but `this.connected === false`, this is a stale process (we lost track). Tear it down via `killProcessTree(registeredPid)` first, then proceed with fresh spawn. Otherwise the count grows by one each reconnect — exactly the leak observed.

4. **Hard cap**: extend `getSupervisor().assertCanSpawn('chroma mcp')` (already called at line 87) to actually count and reject. Cap = 1 chroma-mcp per worker. Cap = `TOTAL_PROCESS_HARD_CAP` (10) overall — already enforced for SDK processes; extend to chroma-mcp.

5. **Tighten close path**: in `connectInternal` (line 74), after `transport.close()` / `client.close()`, if the underlying `_process.pid` is still in the registry, call `killProcessTree` and `unregisterProcess` explicitly. Don't rely on `transport.onclose` alone — it has the stale-callback guard but doesn't always fire on connect-time failures.

### 3.3 New settings

| Key | Default | Range | Purpose |
| --- | --- | --- | --- |
| `CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS` | `900000` (15 min) | 60000–86400000 | Idle reaper threshold |
| `CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START` | `true` | bool | Master switch for startup scan |
| `CLAUDE_MEM_CHROMA_MAX_CONCURRENT` | `1` | 1–4 | Cap chroma-mcp instances per worker |

### 3.4 Acceptance criteria

- Spawn 5 chroma-mcp processes manually parented to init; restart worker → all 5 are reaped at startup.
- Force connect-time failure (kill transport mid-connect) 10 times → registry count never exceeds 1.
- Run worker for 30 min with no chroma calls → process is reaped after 15 min and `getProcessRegistry().getActiveCountByType('chroma')` returns 0.
- `callTool` after idle-shutdown lazy-reconnects successfully.

### 3.5 Observability

- Log: `CHROMA_MCP` INFO `OrphanScan` `{found, killed}`.
- Log: `CHROMA_MCP` INFO `IdleShutdown` `{idleMs}`.
- Log: `CHROMA_MCP` WARN `RegistryStale` when single-instance guard tears down a phantom.
- `/api/healthz` fields (Phase 7): `chroma_mcp_pid_count`, `chroma_mcp_last_call_at`, `chroma_mcp_state` ('connected'|'disconnected'|'backoff'), `chroma_mcp_backoff_remaining_ms`.

### 3.6 Anti-pattern guards

- **Do not** kill chroma processes whose command-line doesn't match `chroma-mcp==<PINNED_VERSION>` — could match unrelated user installs.
- **Do not** spin up the idle-reaper timer if `chromaMcpManager` is null (chroma disabled via `CLAUDE_MEM_CHROMA_ENABLED=false`).
- **Do not** call `getProcessRegistry()` from outside the worker process — it's worker-internal.

### 3.7 Verification checklist

- [ ] After 2.5 hours of normal use, `ps aux | grep chroma-mcp | wc -l` ≤ 1.
- [ ] Idle-reaper timer is `.unref()`d.
- [ ] Orphan scan tolerates `pgrep` returning empty (no false-error logs).
- [ ] Build still passes on Windows (Win32 branch compiles even if not unit-tested).

---

## Phase 4 — Circuit Breaker for Retry Storms

**Goal**: Replace the unbounded counter at `worker-utils.ts:401` with a real circuit breaker. Stop hooks from hammering the worker when it's down.

### 4.1 Files to modify

| File | Change |
| --- | --- |
| `src/shared/worker-circuit-breaker.ts` (new) | `CircuitBreaker` class: states `CLOSED`, `OPEN`, `HALF_OPEN`. Persist to `~/.claude-mem/state/circuit-breaker.json`. |
| `src/shared/worker-utils.ts:executeWithWorkerFallback` (line 443) | Wrap the call in `breaker.run(...)`. On `OPEN`, return `WorkerFallback` immediately (no HTTP). |
| `src/shared/worker-utils.ts:recordWorkerUnreachable` (line 401) | Becomes a thin shim that calls `breaker.recordFailure()`. Hard cap (`MAX_LIFETIME_FAILURES = 50`) trips the breaker permanently until manual reset. |
| `src/shared/worker-utils.ts:resetWorkerFailureCounter` (line 419) | Becomes `breaker.recordSuccess()`. |
| `src/cli/hook-command.ts` | Verify the swallowed-stderr fix from observation 2026-05-07 is applied (it's marked as a "no-op replacement bug"). The breaker's stderr-fail-loud path must actually write to `process.stderr.write()`, not a stub. |
| `src/services/server/Server.ts` | Add `/api/admin/breaker/reset` POST endpoint (gated by localhost only) for manual unsticking. |

### 4.2 Breaker semantics

States and transitions:

```
CLOSED ──[N consecutive failures]──> OPEN
OPEN   ──[reset_timeout_ms elapsed]──> HALF_OPEN
HALF_OPEN ──[1 success]──> CLOSED
HALF_OPEN ──[1 failure]──> OPEN  (resets timer)
ANY    ──[lifetime failures > MAX_LIFETIME_FAILURES]──> OPEN_PERMANENT (until manual reset via API or settings reload)
```

Defaults:

| Setting | Default | Range |
| --- | --- | --- |
| `CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD` | `5` | 1–50 |
| `CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS` | `30000` | 1000–600000 |
| `CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES` | `1` | 1–10 |
| `CLAUDE_MEM_BREAKER_LIFETIME_CAP` | `50` | 0–10000 (0 = no cap) |

Persistent state file shape:

```json
{
  "state": "CLOSED|OPEN|HALF_OPEN|OPEN_PERMANENT",
  "consecutiveFailures": 0,
  "lifetimeFailures": 0,
  "openedAt": null,
  "lastFailureAt": null,
  "lastSuccessAt": null,
  "lastTrippedAt": null
}
```

### 4.3 Detailed tasks

1. **CircuitBreaker class**: pure logic class, no I/O. Methods: `getState()`, `canAttempt()`, `recordFailure(reason)`, `recordSuccess()`, `forceReset()`. Atomic file writes (write tmp + rename) for the JSON snapshot, mirroring `writeHookFailureStateAtomic` (worker-utils.ts:372).

2. **Wire into `executeWithWorkerFallback`**:
   ```
   if (!breaker.canAttempt()) {
     // Optional: print one-line stderr if state changed during this call
     return { continue: true, reason: 'circuit_breaker_open', [WORKER_FALLBACK_BRAND]: true };
   }
   const alive = await ensureWorkerAliveOnce();
   if (!alive) { breaker.recordFailure('unreachable'); ... }
   ...
   if (response.ok) breaker.recordSuccess();
   ```

3. **Fail-loud stderr fix**: The 2026-05-07 observation mentions a "stderr no-op replacement bug" in `hookCommand`. Investigate `src/cli/hook-command.ts` for any `process.stderr.write` shim that suppresses output. The breaker's diagnostic ("Worker unreachable; circuit breaker OPEN; will retry in Xs") MUST appear on the user's terminal so they know what's happening. Test by intentionally killing the worker and running a hook — message should appear on stderr.

4. **Manual reset endpoint**: `POST /api/admin/breaker/reset` (no body required). Restricted to `127.0.0.1` only. Logs `SYSTEM` WARN `BreakerForceReset` with caller info.

5. **Lifetime cap**: when `lifetimeFailures > CLAUDE_MEM_BREAKER_LIFETIME_CAP`, transition to `OPEN_PERMANENT`. The only way out is the manual-reset API or restarting the worker with a fresh state file. Print prominent stderr: `claude-mem: 50 lifetime worker failures detected. Disabling memory hooks until reset. Run: claude-mem worker doctor`.

### 4.4 Acceptance criteria

- Kill the worker, run 100 hooks → exactly `CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD` HTTP attempts made; rest short-circuit.
- After 30s idle, next hook makes ONE probe (HALF_OPEN); if probe succeeds, breaker closes.
- Lifetime cap (set to 5 for testing): 6th lifetime failure → permanent open until `POST /api/admin/breaker/reset` clears it.
- Stderr message visible to user when breaker opens (manual repro: kill worker, run 5+ hooks).
- Existing hook-failures.json file is migrated to the new breaker JSON format on first run (one-shot migration in `worker-utils.ts`).

### 4.5 Observability

- Log: `SYSTEM` WARN `BreakerOpened`, fields `{lifetime, consecutiveBefore}`.
- Log: `SYSTEM` INFO `BreakerHalfOpen`.
- Log: `SYSTEM` INFO `BreakerClosed`, fields `{recoveredAfterMs}`.
- Log: `SYSTEM` ERROR `BreakerOpenedPermanent`.
- `/api/healthz` fields (Phase 7): `breaker_state`, `breaker_consecutive_failures`, `breaker_lifetime_failures`, `breaker_opened_at`, `breaker_total_trips`.

### 4.6 Anti-pattern guards

- **Do not** call the breaker from inside the worker process — it's a hook-side concern. The worker has `RestartGuard` for its own session-level limits.
- **Do not** auto-reset the lifetime counter on restart; persist it. Otherwise restart-loops mask the underlying failure.
- **Do not** block the breaker reset endpoint on initialization (`/api/admin/breaker/reset` should work even if `initializationCompleteFlag === false`).

### 4.7 Verification checklist

- [ ] No call site bypasses the breaker (grep for `workerHttpRequest` outside `executeWithWorkerFallback` and audit each — some integrations may need `breaker.canAttempt()` guards added).
- [ ] State file readable/writable across process restarts.
- [ ] Stderr fail-loud path verified end-to-end on Linux + macOS + Windows Terminal.
- [ ] No `process.exit(1)` introduced — breaker tripping returns `WorkerFallback`, not exit codes.

---

## Phase 7 — `/api/healthz` Endpoint with Concrete Metrics

**Goal**: Centralized observability so future regressions are detectable at a glance.

### 7.1 Files to modify

| File | Change |
| --- | --- |
| `src/services/worker/http/routes/HealthzRoutes.ts` (new) | Implements `RouteHandler`. GET `/api/healthz` and `/api/healthz?format=prom`. |
| `src/services/worker-service.ts:registerRoutes` | Register the new `HealthzRoutes(...)`. |
| `src/services/worker/MetricsCollector.ts` (new) | Aggregates metrics; refreshed on the supervisor's existing 30s health-check tick to avoid amplifying load. |
| `src/supervisor/health-checker.ts:runHealthCheck` | Call `MetricsCollector.refresh()` after `pruneDeadEntries`. |

### 7.2 Endpoint contract

`GET /api/healthz` → 200 JSON:

```json
{
  "status": "ok|degraded|unhealthy",
  "ts": "2026-05-07T21:30:00.000Z",
  "uptime_sec": 12345,
  "versions": {
    "plugin": "12.7.5",
    "worker": "12.7.5",
    "matches": true
  },
  "process": {
    "pid": 12345,
    "rss_mb": 145.2,
    "event_loop_lag_ms": 3.1,
    "managed": true,
    "platform": "darwin"
  },
  "pid_file": {
    "path": "/Users/.../worker.pid",
    "start_token": "Wed May  7 14:23:15 2026",
    "daemon_lock_held": true
  },
  "db": {
    "path": "/Users/.../claude-mem.db",
    "size_bytes": 31457280,
    "page_count": 7680,
    "freelist_count": 12,
    "free_ratio_pct": 0.16,
    "last_vacuum_at": "2026-05-07T20:00:00.000Z",
    "last_vacuum_freed_pages": 130000,
    "last_maintenance_at": "2026-05-07T20:00:00.000Z",
    "oldest_processing_pending_age_sec": 4,
    "processing_pending_count": 1,
    "pending_count_total": 12,
    "sdk_sessions_total": 145,
    "sdk_sessions_inactive": 13,
    "sdk_sessions_by_project": { "claude-mem": 25, "...": 120 }
  },
  "child_processes": {
    "chroma_mcp_pid_count": 1,
    "chroma_mcp_last_call_at": "2026-05-07T21:25:11.000Z",
    "chroma_mcp_state": "connected",
    "chroma_mcp_backoff_remaining_ms": 0,
    "sdk_process_count": 0,
    "supervisor_registry_size": 2
  },
  "network": {
    "hook_consecutive_failures": 0,
    "breaker_state": "CLOSED",
    "breaker_consecutive_failures": 0,
    "breaker_lifetime_failures": 3,
    "breaker_opened_at": null,
    "breaker_total_trips": 1,
    "last_request_at": "2026-05-07T21:29:55.000Z",
    "request_rate_per_min": 12.3
  },
  "ai": {
    "provider": "claude",
    "auth_method": "...",
    "last_interaction": { ... }
  }
}
```

`GET /api/healthz?format=prom` → 200 `text/plain` with Prometheus text format. One metric per JSON leaf (e.g. `claude_mem_db_free_ratio_pct 0.16`).

`status` derivation:
- `unhealthy` if breaker is OPEN_PERMANENT, OR DB initialization failed, OR chroma-mcp pid count > `CLAUDE_MEM_CHROMA_MAX_CONCURRENT`.
- `degraded` if breaker is OPEN, OR free_ratio > 0.4, OR oldest_processing_pending > 1 hour, OR worker version mismatches plugin version.
- `ok` otherwise.

### 7.3 Detailed tasks

1. **MetricsCollector class**: a `Map<string, unknown>` snapshot. Public `refresh()` collects fresh data; public `getSnapshot()` returns the cached object. Refresh is called by the 30s health-check tick AND on-demand if last refresh > 5s ago (debounced).

2. **DB metrics queries** (use `db.prepare` + `.get()`):
   - `PRAGMA page_count` → `{ page_count: number }`
   - `PRAGMA freelist_count` → `{ freelist_count: number }`
   - `PRAGMA page_size` → for size_bytes computation
   - `SELECT MIN(updated_at) FROM pending_messages WHERE status='processing'` (with `julianday` math for age in seconds)
   - `SELECT COUNT(*) FROM sdk_sessions GROUP BY project`

3. **Process metrics**: `process.memoryUsage().rss / 1024 / 1024`. Event-loop lag via `perf_hooks.monitorEventLoopDelay` (Node API, available in bun) — sample over 30s window.

4. **Network metrics**: maintain a rolling 1-min request counter in middleware (existing `createMiddleware` in `Server.ts:156`). Increment on each `/api/*` request.

5. **Prometheus format**: emit `# HELP` and `# TYPE` lines per metric. Use the same naming convention (`claude_mem_<group>_<name>`).

6. **Compatibility**: leave `/api/health` UNCHANGED (existing integrations break otherwise). `/api/healthz` is the new richer endpoint.

### 7.4 Acceptance criteria

- `curl 127.0.0.1:<port>/api/healthz | jq .status` returns `ok` on a healthy worker.
- After Phase 6 ships, `db.free_ratio_pct` updates at 30s cadence (verify by manually inflating freelist).
- Phase 4 breaker state changes are visible within 30s.
- `?format=prom` parses with `promtool check metrics`.
- No new endpoint blocks for > 50ms (snapshot is cached; refresh is async).

### 7.5 Observability hooks (yes, for the observability endpoint itself)

- Log `WORKER` DEBUG `MetricsRefresh`, fields `{durationMs}`.
- Log `WORKER` WARN `MetricsRefreshSlow` if refresh > 250ms (DB query stall signal).

### 7.6 Verification checklist

- [ ] `/api/health` response body unchanged byte-for-byte (regression test).
- [ ] All Phase 2-6 metrics exposed (cross-check the field list in those phases).
- [ ] `?format=prom` output validates with `promtool` if available; otherwise visual inspection.
- [ ] Endpoint mounted via `RouteHandler` pattern (no direct `app.get` in worker-service.ts).

---

## Phase 8 — Observability, CLI, & Rollout

**Goal**: User-facing surface so operators can see what the new machinery did. Ordered last to allow phases 2-7 to stabilize.

### 8.1 Files to modify

| File | Change |
| --- | --- |
| `src/cli/handlers/worker-doctor.ts` (new) | New CLI subcommand `claude-mem worker doctor` — fetches `/api/healthz`, formats it for terminals, includes recent reaper actions. |
| `src/services/worker-service.ts:main()` | Register the `worker doctor` CLI route (alongside existing `cursor`, `gemini-cli` cases). |
| `plugin/scripts/worker-cli.js` | Wire to the new doctor command. |
| `CLAUDE.md` (project root) | Document new settings under a "Worker Maintenance" section. |
| `docs/public/` (optional) | User-facing explanation of the breaker, reaper, and health endpoint. |

### 8.2 `worker doctor` output (example)

```
claude-mem worker doctor

Status:           OK
Version:          plugin=12.7.5 worker=12.7.5 (match)
Uptime:           3h 25m
PID:              12345  (lock held: yes)

Database:
  Size:             32 MB    (free: 0.16%)
  Last vacuum:      4h ago, freed 130k pages
  Pending:          12 total / 1 processing (oldest 4s)
  SDK sessions:     145 total / 13 inactive

Child processes:
  chroma-mcp:       1  (last call: 5s ago, state: connected)
  SDK processes:    0
  Supervisor:       2 entries

Circuit breaker:
  State:            CLOSED
  Consecutive:      0
  Lifetime:         3
  Total trips:      1

Recent maintenance (last 24h):
  2026-05-07 20:00  Vacuum: freed 130k pages in 1.4s
  2026-05-07 19:30  Reaper: 5 stuck-processing reset, 2 inactive marked
  2026-05-07 18:00  Chroma orphan scan: 0 found
```

If `status != ok`, append a "Recommended actions" block:
- breaker open → `claude-mem worker reset-breaker`
- DB free ratio high → mention next vacuum window
- chroma orphans → `claude-mem worker reap-chroma`

### 8.3 Detailed tasks

1. **Doctor command**: GET `/api/healthz` via `workerHttpRequest`. Format as the table above. Color-code (red/yellow/green) using existing chalk integration if present, otherwise plain text. JSON pass-through via `--json` flag.

2. **Recent-actions feed**: store the last 50 maintenance events in a circular buffer in `MetricsCollector` (in-memory only — survives one worker lifetime; not persistent). Expose at `/api/healthz/events` (separate to avoid bloating the main response).

3. **Update CLAUDE.md**: add a "Worker Maintenance" section with: settings reference table, the doctor command, a brief description of the reaper/breaker/vacuum behavior. Per CLAUDE.md "Important: No need to edit the changelog ever" — only edit CLAUDE.md, never CHANGELOG.

4. **Rollout ordering** (per problem statement constraint):
   - Wave 1 (idempotent, low-risk): Phase 5 (PID/port reclamation), Phase 6 (DB maintenance).
   - Wave 2 (reapers — needs careful testing on busy DBs): Phase 2 (session reaper), Phase 3 (chroma supervisor).
   - Wave 3 (user-visible behavior change): Phase 4 (circuit breaker), Phase 7 (`/api/healthz`).
   - Wave 4 (CLI surface): Phase 8 (doctor command, docs).

   Each wave can ship as a separate release. Inter-wave dependencies: Phase 7 depends on data sources from Phases 2/3/4/6 — but the endpoint can ship with partial data (fields gated by phase availability).

### 8.4 Acceptance criteria

- `claude-mem worker doctor` prints a green-OK summary on a healthy worker.
- `claude-mem worker doctor --json` returns valid JSON pipeable to `jq`.
- Killing the worker → `claude-mem worker doctor` cleanly reports `Worker unreachable` instead of hanging.
- CLAUDE.md updates are limited to a new section; no churn elsewhere.

### 8.5 Verification checklist

- [ ] `claude-mem worker doctor` exits 0 on healthy state, 1 on unhealthy, 2 if worker unreachable (mirrors hook-exit-codes convention).
- [ ] No new public marketplace API surface beyond what's documented.
- [ ] Doctor command works without the worker running (unreachable path covered).

---

## Final Phase — Cross-Phase Verification

**Goal**: Prove the system works end-to-end before declaring victory.

### F.1 Soak test (24h)

Run the worker for 24 hours under realistic Claude Code usage. After 24h:

| Metric | Pass criterion |
| --- | --- |
| `ps aux \| grep chroma-mcp \| wc -l` | ≤ 1 |
| `ps aux \| grep claude-mem \| wc -l` | ≤ a small constant (1-2) |
| DB size growth rate | < 5 MB/hr; free_ratio < 20% |
| `/api/healthz` `breaker.lifetime_failures` | < 10 (vs. the #1874 starting baseline) |
| Stuck `processing` rows older than 10 min | 0 |
| Worker memory RSS | < 300 MB (no leak) |

### F.2 Failure-injection tests

| Inject | Expected behavior |
| --- | --- |
| Kill worker via `kill -9` | Lazy-respawn on next hook; PID file cleaned |
| Two parallel `claude-mem start` | Exactly one daemon survives; lock log line visible |
| 100 stuck processing rows | Reaper resets all within `REAPER_PROCESSING_STUCK_MS + REAPER_TICK_MS` |
| Spawn fake listener on worker port | New `--daemon` exits 0 with diagnostic stderr (no silent exit) |
| Fork 5 chroma-mcp orphans | Worker startup reaps all 5 |
| Pull network during 10 hooks | Breaker opens after threshold; subsequent hooks short-circuit |

### F.3 Anti-pattern grep

```
# No new always-on intervals
grep -rn "setInterval" src/ --include="*.ts" | grep -v "unref()" | grep -v "^src/.*test"

# No new process.exit(1) on hook paths
git diff main -- src/shared/worker-utils.ts src/cli/ | grep "process.exit(1)"

# No invented settings
git diff main -- src/shared/SettingsDefaultsManager.ts | grep "CLAUDE_MEM_"
# Cross-reference with all phases' settings tables.

# No hardcoded magic numbers in business logic
git diff main | grep -E "[0-9]{4,}" | grep -v SettingsDefaultsManager | grep -v test
```

### F.4 Documentation diff

- `CLAUDE.md` adds: Worker Maintenance section (Phase 8.3).
- `docs/public/` (optional): user-facing explanation.
- No CHANGELOG edits (auto-generated per CLAUDE.md).

### F.5 Sign-off checklist

- [ ] All 8 phases shipped.
- [ ] `/api/healthz` reports `status: "ok"` 24h after deployment.
- [ ] No new ERROR-level logs in production for 24h (excluding pre-existing).
- [ ] Manual `worker doctor` on 3 production-like environments confirms expected output.
- [ ] Phase 0 doc-discovery anti-patterns not violated (grep `git log -p`).

---

## Appendix A — Settings Reference (consolidated)

All settings declared in `src/shared/SettingsDefaultsManager.ts`:

| Setting | Phase | Default | Range |
| --- | --- | --- | --- |
| `CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS` | 5 | `5000` | 0–60000 |
| `CLAUDE_MEM_PID_PORT_RECHECK_MS` | 5 | `2000` | 500–30000 |
| `CLAUDE_MEM_DB_MAINTENANCE_ENABLED` | 6 | `true` | bool |
| `CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS` | 6 | `24` | 1–168 |
| `CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO` | 6 | `0.40` | 0.05–0.95 |
| `CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS` | 6 | `300000` | 0–3600000 |
| `CLAUDE_MEM_CLEANUP_REGRESSION_CHECK` | 6 | `true` | bool |
| `CLAUDE_MEM_REAPER_ENABLED` | 2 | `true` | bool |
| `CLAUDE_MEM_REAPER_TICK_MS` | 2 | `30000` | 5000–600000 |
| `CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS` | 2 | `300000` | 30000–86400000 |
| `CLAUDE_MEM_REAPER_INACTIVE_DAYS` | 2 | `30` | 1–365 |
| `CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS` | 2 | `0` | 0–365 |
| `CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS` | 3 | `900000` | 60000–86400000 |
| `CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START` | 3 | `true` | bool |
| `CLAUDE_MEM_CHROMA_MAX_CONCURRENT` | 3 | `1` | 1–4 |
| `CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD` | 4 | `5` | 1–50 |
| `CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS` | 4 | `30000` | 1000–600000 |
| `CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES` | 4 | `1` | 1–10 |
| `CLAUDE_MEM_BREAKER_LIFETIME_CAP` | 4 | `50` | 0–10000 |

## Appendix B — File Change Summary

| File | Phases that touch it |
| --- | --- |
| `src/services/worker-service.ts` | 3 (initializeBackground), 5 (--daemon), 6 (maintenance wiring), 7 (route registration), 8 (CLI) |
| `src/services/worker-spawner.ts` | 5 |
| `src/services/infrastructure/ProcessManager.ts` | 5 (lock + start-token) |
| `src/services/infrastructure/HealthMonitor.ts` | 5 (port-on-pid match) |
| `src/services/infrastructure/CleanupV12_4_3.ts` | 6 (regression detection — read only) |
| `src/services/sync/ChromaMcpManager.ts` | 3 |
| `src/supervisor/index.ts` | 5 (validateWorkerPidFile) |
| `src/supervisor/process-registry.ts` | 3 (orphan scan), 5 (start-token) |
| `src/supervisor/health-checker.ts` | 2 (reaper), 7 (metrics refresh) |
| `src/services/worker/SessionManager.ts` | 2 (delete hook), 6 (pause/resume) |
| `src/shared/worker-utils.ts` | 4 (breaker integration) |
| `src/services/sqlite/Database.ts` | 6 (auto_vacuum) |
| `src/services/sqlite/PendingMessageStore.ts` | 2 (reapStuckProcessing) |
| `src/services/sqlite/SessionStore.ts` | 2 (findInactiveSdkSessions) |
| `src/services/sqlite/migrations/runner.ts` | 2 (inactive_at column) |
| `src/services/server/Server.ts` | 4 (breaker reset), 7 (healthz route) |
| `src/shared/SettingsDefaultsManager.ts` | 2-6 (settings keys) |
| `src/services/maintenance/DbMaintenance.ts` | 6 (NEW) |
| `src/services/maintenance/SessionReaper.ts` | 2 (NEW) |
| `src/shared/worker-circuit-breaker.ts` | 4 (NEW) |
| `src/services/worker/MetricsCollector.ts` | 7 (NEW) |
| `src/services/worker/http/routes/HealthzRoutes.ts` | 7 (NEW) |
| `src/cli/handlers/worker-doctor.ts` | 8 (NEW) |
| `CLAUDE.md` | 8 (Worker Maintenance section) |

## Appendix C — Open Questions for Executor

1. **`bun:ffi` flock support**: confirm via spike before committing Phase 5.4. If unavailable, fall back to `flock(1)` shell on Linux + atomic `mkdirSync` sentinel on macOS/Windows.
2. **Event-loop lag sampling on bun**: verify `perf_hooks.monitorEventLoopDelay` works in bun's Node-compat layer. If not, fall back to a setImmediate-based heuristic.
3. **Existing-DB auto_vacuum migration**: verify that the startup full VACUUM in Phase 6.3 is sufficient to reclaim the 504 MB without requiring users to run `PRAGMA auto_vacuum = INCREMENTAL; VACUUM;` manually. (It should — full VACUUM with auto_vacuum already set takes effect.)
4. **Pro-features compatibility**: confirm with maintainers that `/api/healthz` does not duplicate any planned Pro endpoint. Per CLAUDE.md "Pro Features Architecture", the worker's local HTTP API stays open — `/api/healthz` is fine to add OSS-side.