Files
claude-mem/plans/03-worker-lifecycle.md
T
Alex Newman a10d1b342f docs(plans): add architectural plan files for issues #2376-#2381
Six numbered plan documents covering:
- 01 Hook IO Discipline (#2376)
- 02 Spawn-Contract Templating (#2377)
- 03 Worker / Daemon Lifecycle Hardening (#2378)
- 04 Installer Failure Transparency (#2379)
- 05 Observer SDK Tool Enforcement (#2380)
- 06 Worker Env Isolation (#2381)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:31:02 -07:00

821 lines
52 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Plan 03 — Worker / Daemon Lifecycle Hardening
> **Scope**: Fix accumulated worker / daemon lifecycle bugs in claude-mem.
> Address DB bloat, chroma-mcp leaks, retry storms, port/PID races, queue zombies, missing supervision, and observability gaps.
>
> **Non-implementation**: This document is a plan. Each phase is self-contained; an executing agent should be able to run a single phase without re-discovering context.
>
> **Audience**: Subsequent agents executing one phase per session.
---
## Phase 0 — Documentation Discovery & Allowed APIs
**Goal**: Anchor every implementation phase in real APIs that exist in the current codebase or in vetted libraries. Prevent phantom-method invention.
### 0.1 Read these files end-to-end before touching code
| File | Why |
| --- | --- |
| `CLAUDE.md` (project root) | Architecture, exit-code strategy, Pro/OSS boundary, settings conventions |
| `src/services/worker-service.ts` | `WorkerService` class, `--daemon` `main()`, signal registration, all CLI subcommands |
| `src/services/worker-spawner.ts` | `ensureWorkerStarted` 3-state machine (`ready`/`warming`/`dead`) |
| `src/services/infrastructure/ProcessManager.ts` | `spawnDaemon`, PID file ops, `captureProcessStartToken`, `isProcessAlive` |
| `src/services/infrastructure/HealthMonitor.ts` | `isPortInUse`, `waitForHealth`, `waitForReadiness`, `httpShutdown` |
| `src/services/infrastructure/GracefulShutdown.ts` | `performGracefulShutdown` ordering |
| `src/services/infrastructure/CleanupV12_4_3.ts` | `runOneTimeV12_4_3Cleanup`, `STUCK_PENDING_THRESHOLD = 10`, observer-purge SQL |
| `src/services/sync/ChromaMcpManager.ts` | `ensureConnected`, `connectInternal`, `stop`, `killProcessTree`, `collectDescendantPids`, `RECONNECT_BACKOFF_MS = 10_000`, `MCP_CONNECTION_TIMEOUT_MS = 30_000` |
| `src/supervisor/index.ts` | `Supervisor` class, `validateWorkerPidFile`, signal-handler config |
| `src/supervisor/process-registry.ts` | `ProcessRegistry`, `getSdkProcessForSession`, `ensureSdkProcessExit`, `waitForSlot`, `TOTAL_PROCESS_HARD_CAP = 10` |
| `src/supervisor/health-checker.ts` | 30s `pruneDeadEntries` loop (already present — extend, don't replace) |
| `src/supervisor/shutdown.ts` | `runShutdownCascade`, `signalProcess`, `loadTreeKill` |
| `src/services/worker/SessionManager.ts` | In-memory session map, `deleteSession`, queue/pending integration |
| `src/services/worker/RestartGuard.ts` | Per-session restart cap (10/60s window, 5 consecutive) |
| `src/services/worker/retry.ts` | Provider-level retry (`withRetry`, classified errors) — DO NOT mutate; circuit breaker layers ABOVE this |
| `src/shared/worker-utils.ts` | `recordWorkerUnreachable` (line 401), `executeWithWorkerFallback` (line 443), fail-loud counter file at `~/.claude-mem/state/hook-failures.json` |
| `src/services/sqlite/Database.ts` | PRAGMA setup (lines 27-32, 69-74) — single source of truth for DB pragmas |
| `src/services/server/Server.ts` | `/api/health` (line 161), `/api/readiness` (line 178), `/api/version` (line 192) |
| `src/shared/SettingsDefaultsManager.ts` | Where every new setting key MUST be declared with a default |
| `src/shared/hook-constants.ts` | `HOOK_TIMEOUTS`, `HOOK_EXIT_CODES` — extend here, don't inline |
| `plugin/bun-runner.js`, `plugin/scripts/worker-service.cjs` | Built worker entrypoint — note the build pipeline (`scripts/build-hooks.js`) |
### 0.2 Allowed APIs (use these, do NOT invent siblings)
**SQLite (bun:sqlite)** — pragma calls are `db.run('PRAGMA …')` or `db.prepare('PRAGMA …').get()`. Existing pragmas: `journal_mode=WAL`, `synchronous=NORMAL`, `foreign_keys=ON`, `temp_store=memory`, `mmap_size`, `cache_size`. **VACUUM** runs only outside a transaction. `VACUUM INTO 'path'` is the backup form already used in `CleanupV12_4_3.ts:135`. `wal_checkpoint(TRUNCATE)` is the truncating-checkpoint form.
**Process supervision**`getSupervisor()`, `getProcessRegistry()`, `registerProcess(id, info, processRef?)`, `unregisterProcess(id)`, `pruneDeadEntries()`, `assertCanSpawn(type)`, `runShutdownCascade(...)`. Tree-kill on POSIX uses `pgrep -P` recursion + `process.kill(-pgid, signal)`; on Windows uses `taskkill /T /F /PID` or `tree-kill` npm.
**HTTP/Express**`Server.app.get('/api/...', handler)` via `registerRoutes` (handlers implement `setupRoutes(app)` on a `RouteHandler` interface). Every new endpoint must follow the existing `RouteHandler` pattern under `src/services/worker/http/routes/`.
**Settings**`SettingsDefaultsManager.get('CLAUDE_MEM_…')`, `SettingsDefaultsManager.loadFromFile(path)`. New keys require: (a) type added to the interface in `SettingsDefaultsManager.ts`, (b) default value declared in the same file, (c) documented in CLAUDE.md if user-tunable.
**Logging**`logger.info(category, msg, fields)`, `logger.warn`, `logger.error(category, msg, fields, error)`. Categories used here: `SYSTEM`, `WORKER`, `SESSION`, `CHROMA_MCP`, `SDK`, `DB`, `QUEUE`, `PROCESS`. Add new category `MAINTENANCE` for VACUUM / reaper events.
### 0.3 Anti-patterns — explicitly forbidden
- **Do not** add a new singleton supervisor — extend `getSupervisor()`.
- **Do not** spawn child processes without going through `getSupervisor().assertCanSpawn(...)` and `registerProcess(...)`.
- **Do not** call `process.exit(1)` on hook-side error paths — it accumulates Windows Terminal tabs (CLAUDE.md exit-code strategy). Use `0` for graceful, `2` only for blocking-error paths that need to surface stderr to Claude.
- **Do not** delete `sdk_sessions` rows if `observations` or `session_summaries` still reference their `memory_session_id` without an explicit user-opt-in flag.
- **Do not** hold a SQLite write lock during `VACUUM` while ingestion is hot. Pause queue processing first.
- **Do not** introduce setInterval timers that keep the event loop alive — every new timer must call `.unref()`.
- **Do not** invent settings keys — declare them in `SettingsDefaultsManager.ts` first.
### 0.4 Confidence note
Confidence: HIGH on file/API inventory (read-pass complete on all referenced files). MEDIUM on Windows behavior of new advisory locks (Windows mandatory locking via `lockf` is bun-runtime-dependent — verify via spike before committing).
---
## Phase 1 — Inventory & Instrumentation (read-only, safe)
**Goal**: Produce a written state-machine diagram and an exit-site catalog that subsequent phases reference. No code changes; create a scratch document at `docs/internal/worker-lifecycle-state-machine.md` if the executor wants an artifact, otherwise capture findings in commit messages.
### 1.1 Tasks
1. **Trace the worker daemon spawn → terminate path** end-to-end. Source order:
- Hook entry → `src/shared/worker-utils.ts:ensureWorkerRunning` (lazy spawn) OR `src/services/worker-spawner.ts:ensureWorkerStarted` (explicit)
- `spawnDaemon` (`src/services/infrastructure/ProcessManager.ts:408`) — POSIX uses `setsid` if available, Windows uses `Start-Process -WindowStyle Hidden`
- `--daemon` branch in `src/services/worker-service.ts:937` — duplicate-PID/duplicate-port guard
- `WorkerService.start()` (line 258) → `startSupervisor()``server.listen()``writePidFile()``getSupervisor().registerProcess('worker', ...)``initializeBackground()`
- Signal handlers via `configureSupervisorSignalHandlers` (`src/supervisor/index.ts:49`) — SIGTERM/SIGINT; SIGHUP ignored in `--daemon` mode on POSIX
- Shutdown: `WorkerService.shutdown()``performGracefulShutdown` → server close → `sessionManager.shutdownAll()` → mcp client close → chroma stop → db close → `getSupervisor().stop()``runShutdownCascade` → PID file unlink
2. **Catalog every `process.exit(...)` site** in worker-service.ts (already mapped — 21 sites; lines 764, 772, 794, 804, 810, 813, 828, 835, 842, 853, 870, 878, 888, 895, 916, 933, 945, 950, 971, 975, 991). Annotate each with: code, intent, whether it leaks the worker on the same path, whether shutdown ran first.
3. **Catalog every retry / unreachable site**:
- `src/shared/worker-utils.ts:401 recordWorkerUnreachable` (the #1874 counter)
- `src/cli/handlers/{context,file-context,file-edit,summarize,observation,user-message,session-init}.ts` — every `executeWithWorkerFallback` caller
- `src/servers/mcp-server.ts:72,100,145` — direct `workerHttpRequest`
- `src/services/transcripts/processor.ts:331,371,373` — direct `workerHttpRequest`
- `src/services/integrations/CursorHooksInstaller.ts:64,349,352` — direct `workerHttpRequest`
- `src/utils/claude-md-utils.ts:305` — direct `workerHttpRequest`
4. **Catalog every spawn site**:
- `spawnDaemon` (worker self-spawn)
- `ChromaMcpManager.connectInternal` (chroma-mcp via uvx → uv → python → chroma-mcp)
- `spawnSdkProcess` (`src/supervisor/process-registry.ts:532`) — Claude SDK subprocesses
- `runMcpSelfCheck` (`src/services/worker-service.ts:405`) — MCP loopback probe via `process.execPath`
- Any `execSync` / `execFile` / `spawnSync` in `ChromaMcpManager` (cert resolution) or `ProcessManager` (binary lookup, cwd-remap)
### 1.2 Acceptance criteria
- Markdown table written (commit message or scratch doc) listing every spawn and exit site with file:line.
- A 1-paragraph English description of the worker state machine (states + transitions) suitable to paste into PR descriptions.
- Confirmed list of which `executeWithWorkerFallback` callers run inside hooks (Claude Code's strict timeout window) vs. inside the worker (no timeout pressure) — this drives Phase 4 circuit-breaker scoping.
### 1.3 Verification
- `grep -rn "process.exit" src/ --include="*.ts" | wc -l` matches the catalog.
- `grep -rn "executeWithWorkerFallback\|workerHttpRequest" src/ --include="*.ts" | grep -v worker-utils.ts | wc -l` matches the catalog.
### 1.4 Deliverable
Hand-off note for Phase 2-8 executors with file/line anchors; no code committed.
---
## Phase 5 — PID/Port Reclamation & Race-Free Startup
> Shipping order: **Phase 5 first** (per Phase 8 ordering). Idempotent and safe.
**Goal**: Eliminate the silent-exit-0 case where a fresh `--daemon` spawn loses the port race; harden cross-platform PID-reuse detection; serialize concurrent spawns with an OS-level advisory lock.
### 5.1 Files to modify
| File | Change |
| --- | --- |
| `src/supervisor/process-registry.ts` | Extend `captureProcessStartToken` for macOS (already partial via `ps -o lstart`) and Windows (`wmic process where ProcessId=X get CreationDate /value`). Add unit test for each platform branch. |
| `src/supervisor/index.ts:validateWorkerPidFile` | Add port-on-pid match check — if `pidInfo.port !== currentExpectedPort`, treat as `'stale'`. |
| `src/services/infrastructure/ProcessManager.ts` | Add new exports: `acquireDaemonLock()` / `releaseDaemonLock()` using POSIX `flock` (via `fcntl`/`flock` syscall through `bun:ffi` or shelling to `flock(1)` on Linux only) and Windows mandatory file lock via `LockFile` (or fall back to atomic-rename sentinel on Windows). |
| `src/services/worker-service.ts:937` (`--daemon` branch) | Wrap startup in `acquireDaemonLock()`. If port is in use, perform a `/api/version` probe; if the listener returns OUR `BUILT_IN_VERSION` → exit 0 (legit duplicate); if it returns a different version → log a warning and exit 0 (stale worker, will be restarted by version-mismatch path); if the listener doesn't respond → wait `HOOK_TIMEOUTS.PORT_IN_USE_WAIT` then write a clear stderr line with diagnostic before exiting. |
| `src/services/worker-spawner.ts` | Same lock acquisition before `spawnDaemon`. Release on success or error. |
### 5.2 Detailed tasks
1. **macOS start-time token**: extend `captureProcessStartToken` (registry line 56). On Darwin, prefer `ps -p <pid> -o lstart=` (already in fallback path). Verify with `LC_ALL=C LANG=C` env so locale doesn't change the timestamp format. Add a comment explaining that `ps lstart` resolution is 1-second — collisions still possible but vastly less likely than no-token.
2. **Windows start-time token**: add a Win32 branch using `wmic process where ProcessId=<pid> get CreationDate /value`. Parse the `CreationDate=YYYYMMDDHHMMSS.ffffff+TZ` line. Cache the wmic resolution per-pid for 5s (avoid re-shelling on repeat checks).
3. **Port-on-pid match**: in `validateWorkerPidFile`, after confirming `isPidAlive(pidInfo.pid)`, verify the recorded `pidInfo.port` is reachable via `isPortInUse(pidInfo.port)` AND the listener's `/api/version` returns a version string. If port is dead but PID alive → return `'stale'` (worker crashed mid-listen, PID about to be reused).
4. **Advisory lock**:
- POSIX: open `<DATA_DIR>/.worker-spawn.lock` with `O_RDWR | O_CREAT`, `flock(fd, LOCK_EX | LOCK_NB)`. On EAGAIN, log `Another spawn in progress, waiting up to 5s` and retry with `LOCK_EX` (blocking) under a `setTimeout` race. Implement via `bun:ffi` for POSIX `flock(2)` if available, otherwise shell `flock -n -x <path> <command>`. **Spike first**: confirm bun's `bun:ffi` exposes `flock`. If not, use a watch-and-rename sentinel (less ideal but works).
- Windows: Use `LockFile` via Win32 API or fall back to atomic `mkdirSync` of `<DATA_DIR>/.worker-spawn.lock.dir` (fails if exists) with stale-timeout cleanup at 30s.
5. **Diagnostic stderr**: when port-in-use without our worker responding, write to stderr (and log INFO) with: `claude-mem worker port <N> in use by an unidentified process; not spawning duplicate`. This must NOT block the hook — exit 0 still per CLAUDE.md.
### 5.3 New settings
| Key | Default | Range | Purpose |
| --- | --- | --- | --- |
| `CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS` | `5000` | 060000 | Max wait for the spawn lock |
| `CLAUDE_MEM_PID_PORT_RECHECK_MS` | `2000` | 50030000 | Wait window before treating port-in-use without `/api/version` response as "unknown listener" |
### 5.4 Acceptance criteria
- Run two `claude-mem start` commands in parallel → exactly one daemon ends up alive; the other exits cleanly with a log line referencing the lock.
- Kill the worker `-9` (skip cleanup), reuse the PID with `python -c 'import time; time.sleep(60)'``validateWorkerPidFile` returns `'stale'` and removes the file.
- On macOS, run worker, capture token, kill, spawn unrelated process with same PID, spawn worker again → token mismatch detected; old PID file ignored.
- `/api/version` probe path: spawn a fake server on the worker port → daemon exits 0 with the new diagnostic stderr, NOT silently.
### 5.5 Observability hooks
- Log `SYSTEM` INFO `Daemon spawn lock acquired` on success.
- Log `SYSTEM` WARN `Daemon spawn lock contention`, fields `{waitedMs}`.
- Log `SYSTEM` WARN `Worker port occupied by foreign listener`, fields `{port, probeStatus}`.
- New `/api/healthz` fields (added in Phase 7): `pid_file_path`, `pid_start_token`, `daemon_lock_held: bool`.
### 5.6 Verification checklist
- [ ] `grep "process.exit(0)" src/services/worker-service.ts` — count unchanged (no new silent exits introduced).
- [ ] Manual two-process race test (Linux + macOS + Windows VM).
- [ ] Existing health-check tests still pass.
- [ ] No new always-on `setInterval` introduced.
---
## Phase 6 — DB Maintenance (VACUUM / WAL)
> Ships alongside Phase 5 (idempotent).
**Goal**: Recover the 504 MB of free pages, prevent recurrence, surface DB-size metrics.
### 6.1 Files to modify
| File | Change |
| --- | --- |
| `src/services/sqlite/Database.ts:27-32` and `:69-74` | Add `PRAGMA auto_vacuum = INCREMENTAL` BEFORE the first table is created (only takes effect on a fresh DB; harmless on existing DBs but logs a no-op). For existing DBs, the migration path is the one-shot Phase-6 startup VACUUM. |
| `src/services/maintenance/DbMaintenance.ts` (new) | Periodic maintenance task: on a 24h timer (configurable), call `PRAGMA incremental_vacuum`, `PRAGMA wal_checkpoint(TRUNCATE)`, then collect metrics (`page_count`, `freelist_count`, file size). Emit `MAINTENANCE` INFO log. Acquire `dbMaintenanceMutex` so other writers wait. |
| `src/services/maintenance/DbMaintenance.ts` | Startup check: if `freelist_count / page_count > FREE_RATIO_VACUUM_THRESHOLD` (default 0.40), perform full `VACUUM` after `VACUUM INTO` backup to `<DATA_DIR>/backups/claude-mem-pre-vacuum-<ts>.db`. Pause queue processor first. |
| `src/services/worker-service.ts:initializeBackground` | Wire the maintenance task — start after `dbManager.initialize()`. Timer must `.unref()`. |
| `src/services/worker/SessionManager.ts` | Expose `pauseQueueProcessing(): Promise<void>` and `resumeQueueProcessing(): void`. Use the existing AbortController + emitter to drain in-flight work; don't introduce new state. Maintenance acquires; readers continue (WAL allows them). |
| `src/services/infrastructure/CleanupV12_4_3.ts:135` | Reuse the existing `VACUUM INTO` backup pattern verbatim — copy the disk-space pre-flight check (`statfsSync`, line 115). |
### 6.2 Detailed tasks
1. **Auto-vacuum on new DBs**: Add `PRAGMA auto_vacuum = INCREMENTAL` in `Database.ts` BEFORE `migrationRunner.runAllMigrations()`. Verify with a comment that this is no-op on existing DBs (sqlite docs say a full VACUUM is required to flip auto_vacuum mode after tables exist). Document the migration path: existing users get the freed-page reclamation via the startup full VACUUM in step 3.
2. **Periodic incremental vacuum + WAL checkpoint**:
- Schedule via `setInterval` with `.unref()`. Default cadence: 24h. Setting: `CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS` (default `24`, min `1`, max `168`).
- Each tick: acquire mutex → `db.run('PRAGMA incremental_vacuum')``db.run('PRAGMA wal_checkpoint(TRUNCATE)')` → snapshot metrics → release.
- Skip the tick if a `VACUUM` is in progress.
3. **Startup full VACUUM (one-shot per session) when free-ratio is high**:
- Read `page_count` (`PRAGMA page_count`) and `freelist_count` (`PRAGMA freelist_count`).
- If `freelist_count / page_count >= CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO` (default `0.40`), schedule a deferred VACUUM (5 minutes after worker becomes ready) to avoid slowing startup.
- VACUUM steps: pause queue → `VACUUM INTO '<backup>'` → verify backup → `VACUUM` (full) → resume queue → log freed pages and ms taken.
- Disk-space pre-flight: `statfsSync` (mirror `CleanupV12_4_3.ts:115`). Skip if free space < `1.2 * dbSize + 100MB`. Log `MAINTENANCE` ERROR in that case so the user sees actionable info.
4. **Pause/resume hook in SessionManager**: The existing `for await ... of getMessageIterator()` loop in queue processor needs a "pause" semaphore. Implementation: add a `Promise<void>` gate that the iterator awaits before yielding. Maintenance flips it to a pending promise during VACUUM; resolve to release. **Do not** abort in-flight messages — they can complete; new messages wait.
5. **Cleanup-V12.4.3 regression detection**: Re-scan `sdk_sessions WHERE project = OBSERVER_SESSIONS_PROJECT` and `pending_messages` matching the stuck-pending pattern at maintenance ticks. If any match AND the marker exists, log `MAINTENANCE` WARN and re-run the purge (idempotent). Setting: `CLAUDE_MEM_CLEANUP_REGRESSION_CHECK = true`.
### 6.3 New settings
| Key | Default | Range | Purpose |
| --- | --- | --- | --- |
| `CLAUDE_MEM_DB_MAINTENANCE_ENABLED` | `true` | bool | Master kill-switch |
| `CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS` | `24` | 1168 | Periodic cadence |
| `CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO` | `0.40` | 0.050.95 | Free-ratio above which we auto-VACUUM at startup |
| `CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS` | `300000` (5 min) | 03600000 | Defer startup VACUUM so it doesn't block readiness |
| `CLAUDE_MEM_CLEANUP_REGRESSION_CHECK` | `true` | bool | Re-scan v12.4.3-shaped pollution |
### 6.4 Acceptance criteria
- Reproduce the bloat scenario: stuff `pending_messages` with 100k stuck `processing` rows, run worker → startup VACUUM fires within 5 min after readiness, freed-pages log line appears, file size drops.
- Existing 532 MB DBs reclaim ≥ 95% of free pages on first run (matches the 28 MB target observed manually).
- Hot-ingestion test: enqueue 1000 observations during a maintenance tick → no `SQLITE_BUSY` or `database is locked` errors; queue resumes after VACUUM.
- `PRAGMA auto_vacuum` returns `2` (incremental) on freshly-created DBs.
- Maintenance loop ticks honor `.unref()``process.exit(0)` from a clean shutdown returns immediately, not after the 24h interval.
### 6.5 Observability hooks
- New log category: `MAINTENANCE`.
- Events: `MaintenanceStart`, `MaintenanceTick`, `VacuumStart`, `VacuumComplete` (`{freedPages, ms, dbSizeBeforeMb, dbSizeAfterMb}`), `VacuumSkippedLowDisk`, `RegressionDetected`, `MaintenanceComplete`.
- `/api/healthz` fields (Phase 7): `db_page_count`, `db_freelist_count`, `db_free_ratio_pct`, `db_size_bytes`, `db_last_vacuum_at`, `db_last_vacuum_freed_pages`, `db_last_maintenance_at`.
### 6.6 Anti-pattern guards
- **Do not** call `VACUUM` inside a transaction (sqlite errors).
- **Do not** hold the queue pause across the `VACUUM INTO` backup phase — only the final full `VACUUM` needs the writer-lock window. (`VACUUM INTO` works on a read-only snapshot.)
- **Do not** call `PRAGMA wal_checkpoint(FULL)` — TRUNCATE is required to actually shrink the WAL file.
### 6.7 Verification checklist
- [ ] Backup created at `<DATA_DIR>/backups/` before every full VACUUM.
- [ ] Maintenance timer registered with `.unref()` (grep for `setInterval` in the new file → `unref()` follows each).
- [ ] No new direct `setInterval` outside the maintenance file.
- [ ] PRAGMA list in `Database.ts` extended with `auto_vacuum` and includes a comment about migration.
---
## Phase 2 — Stuck-Session Reaper (fix v12.4.3 bloat)
**Goal**: Stop `pending_messages` and `sdk_sessions` from accumulating zombies.
### 2.1 Files to modify
| File | Change |
| --- | --- |
| `src/services/maintenance/SessionReaper.ts` (new) | Periodic reaper. Plugs into the supervisor's existing `health-checker.ts` 30s tick (extend, do not replace). |
| `src/supervisor/health-checker.ts:9 runHealthCheck` | Call `SessionReaper.tick()` after `pruneDeadEntries()`. |
| `src/services/worker/SessionManager.ts:deleteSession` | After in-memory delete, call `pendingStore.clearPendingForSession(sessionDbId)` synchronously (it already does this via `clearPendingForSession` on a separate path — verify and unify). |
| `src/services/sqlite/PendingMessageStore.ts` | Add `reapStuckProcessing(olderThanMs: number): number` returning the count of rows reset to `pending`. |
| `src/services/sqlite/SessionStore.ts` | Add `findInactiveSdkSessions(olderThanDays: number): Array<{id, project, contentSessionId, memorySessionId, lastActivityAt}>`. |
| `src/services/sqlite/SessionStore.ts` | Add `markSdkSessionInactive(id: number)` — adds an `inactive_at` column or sets a sentinel. |
| `src/services/sqlite/migrations/runner.ts` | New migration: add `inactive_at TEXT NULL` to `sdk_sessions` if absent. |
### 2.2 Reaper logic
Per tick (default 30s, gated by `CLAUDE_MEM_REAPER_ENABLED`):
1. **Stuck-processing sweep**: `UPDATE pending_messages SET status='pending' WHERE status='processing' AND updated_at < <now - PROCESSING_STUCK_MS>` (default 5 minutes). Log count if > 0.
2. **Orphan-pending sweep**: `DELETE FROM pending_messages WHERE session_db_id NOT IN (SELECT id FROM sdk_sessions)` (defensive — should already be FK-protected but log if any deleted).
3. **Inactive-session detection** (does NOT delete):
- SELECT sdk_sessions where `id NOT IN <in-memory session ids>` AND `last_activity > N days ago` (computed from MAX of related observations / pending_messages / session_summaries timestamps).
- For each: `UPDATE sdk_sessions SET inactive_at = <now> WHERE id = ? AND inactive_at IS NULL`.
4. **Observer-pollution regression check** (matches Phase 6 task 5):
- If `OBSERVER_SESSIONS_PROJECT` rows reappear after the v12.4.3 marker is present, re-run the purge SQL from `CleanupV12_4_3.runObserverSessionsPurge` (lines 196-218).
- Log `MAINTENANCE` WARN with counts.
5. **Hard delete is opt-in** via `CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS` (default `0` = disabled; nonzero = days threshold). When enabled and a session has `inactive_at` older than the threshold AND no FK-referencing rows, hard-delete the session row. Default-off because user data safety > disk space.
### 2.3 New settings
| Key | Default | Range | Purpose |
| --- | --- | --- | --- |
| `CLAUDE_MEM_REAPER_ENABLED` | `true` | bool | Master switch |
| `CLAUDE_MEM_REAPER_TICK_MS` | `30000` | 5000600000 | Tick cadence (piggy-backs supervisor; this value gates whether the reaper runs each tick) |
| `CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS` | `300000` (5 min) | 3000086400000 | Threshold for a `processing` row to be considered stuck |
| `CLAUDE_MEM_REAPER_INACTIVE_DAYS` | `30` | 1365 | When to mark a session `inactive_at` |
| `CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS` | `0` | 0365 | 0 = never; otherwise, hard-delete inactive rows older than N days |
### 2.4 Acceptance criteria
- Inject 50 stuck `processing` rows older than 5 minutes → next reaper tick resets them → `/api/healthz` shows `oldest_pending_processing_age_sec` drop to 0.
- Inject `OBSERVER_SESSIONS_PROJECT` rows post-marker → next tick logs regression and purges them.
- Reaper survives a worker restart without losing state (everything is DB-backed).
- Active sessions (in-memory) are NEVER marked inactive even if their last DB write is old (in-memory presence wins).
### 2.5 Observability
- Log: `MAINTENANCE` INFO `ReaperTick`, fields `{stuckProcessing, orphanPending, markedInactive, hardDeleted, observerRegression}`.
- New `/api/healthz` fields (Phase 7): `oldest_processing_pending_age_sec`, `processing_pending_count`, `pending_count_total`, `sdk_sessions_total`, `sdk_sessions_inactive`, `sdk_sessions_by_project: { [project]: count }`.
### 2.6 Verification checklist
- [ ] Migration adds `inactive_at` column without breaking existing data (test on a copy of a real DB).
- [ ] In-memory active sessions never appear in `findInactiveSdkSessions`.
- [ ] Reaper does NOT cascade-delete `observations` / `session_summaries` unless explicit hard-delete + zero-FK-reference precondition.
- [ ] `/api/healthz` shows reaper metrics.
---
## Phase 3 — chroma-mcp Child-Process Supervisor
**Goal**: Stop the 23-concurrent-chroma-mcp leak. Bound concurrency, reap idle, scan for orphans at startup.
### 3.1 Files to modify
| File | Change |
| --- | --- |
| `src/services/sync/ChromaMcpManager.ts` | Add idle reaper; enforce single-instance via supervisor registry; add startup orphan scan; add `lastCallAt` timestamp updated by `callTool`. |
| `src/services/sync/ChromaMcpManager.ts:ensureConnected` (line 43) | Before connect, check `getProcessRegistry().getAll().filter(r => r.type === 'chroma')` — if non-empty AND PID alive AND PID not the current `_process.pid`, refuse to spawn (alert + reuse existing if possible; otherwise wait for backoff). |
| `src/services/sync/ChromaMcpManager.ts:registerManagedProcess` (line 613) | Already calls `getSupervisor().registerProcess(CHROMA_SUPERVISOR_ID, ...)` — verify the supervisor enforces single-instance for this id. (Currently `register` is keyed by id so same id replaces; document this.) |
| `src/supervisor/process-registry.ts` | Add `getActiveCountByType(type: string): number`. Add `findChromaOrphans(): Promise<number[]>` — POSIX `pgrep -af 'chroma-mcp'` filtered by PPID == 1. |
| `src/services/worker-service.ts:initializeBackground` | After `ChromaMcpManager.getInstance()`, kick off `await ChromaMcpManager.scanAndReapOrphans()` (best-effort; never throws). |
### 3.2 Detailed tasks
1. **Startup orphan scan**: New static method `ChromaMcpManager.scanAndReapOrphans()`:
- POSIX: `pgrep -af 'chroma-mcp'` → for each PID, check PPID. If PPID == 1 (re-parented to init), call `killProcessTree(pid)` (existing function at line 388). Log `CHROMA_MCP` INFO `ReapedOrphan`, fields `{pid, ageSec}`.
- Windows: `Get-CimInstance Win32_Process -Filter "Name='chroma-mcp.exe'"` filter by parent process state, kill with taskkill.
- Bound the scan to processes whose command-line includes `chroma-mcp==<CHROMA_MCP_PINNED_VERSION>` to avoid killing unrelated chroma installations.
2. **Idle reaper**: Add `lastCallAt: number = 0` field to `ChromaMcpManager`. Update on every `callTool`. Run a `setInterval(checkIdle, 60_000)` (`.unref()`) — if `connected && Date.now() - lastCallAt > CHROMA_MCP_IDLE_SHUTDOWN_MS` (default 15 min), call `await this.stop()`. Lazy-reconnect resumes on next `callTool`.
3. **Single-instance guard on reconnect**: In `ensureConnected`, before `connectInternal`, call `getProcessRegistry().getActiveCountByType('chroma')`. If > 0 AND the registered PID is alive but `this.connected === false`, this is a stale process (we lost track). Tear it down via `killProcessTree(registeredPid)` first, then proceed with fresh spawn. Otherwise the count grows by one each reconnect — exactly the leak observed.
4. **Hard cap**: extend `getSupervisor().assertCanSpawn('chroma mcp')` (already called at line 87) to actually count and reject. Cap = 1 chroma-mcp per worker. Cap = `TOTAL_PROCESS_HARD_CAP` (10) overall — already enforced for SDK processes; extend to chroma-mcp.
5. **Tighten close path**: in `connectInternal` (line 74), after `transport.close()` / `client.close()`, if the underlying `_process.pid` is still in the registry, call `killProcessTree` and `unregisterProcess` explicitly. Don't rely on `transport.onclose` alone — it has the stale-callback guard but doesn't always fire on connect-time failures.
### 3.3 New settings
| Key | Default | Range | Purpose |
| --- | --- | --- | --- |
| `CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS` | `900000` (15 min) | 6000086400000 | Idle reaper threshold |
| `CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START` | `true` | bool | Master switch for startup scan |
| `CLAUDE_MEM_CHROMA_MAX_CONCURRENT` | `1` | 14 | Cap chroma-mcp instances per worker |
### 3.4 Acceptance criteria
- Spawn 5 chroma-mcp processes manually parented to init; restart worker → all 5 are reaped at startup.
- Force connect-time failure (kill transport mid-connect) 10 times → registry count never exceeds 1.
- Run worker for 30 min with no chroma calls → process is reaped after 15 min and `getProcessRegistry().getActiveCountByType('chroma')` returns 0.
- `callTool` after idle-shutdown lazy-reconnects successfully.
### 3.5 Observability
- Log: `CHROMA_MCP` INFO `OrphanScan` `{found, killed}`.
- Log: `CHROMA_MCP` INFO `IdleShutdown` `{idleMs}`.
- Log: `CHROMA_MCP` WARN `RegistryStale` when single-instance guard tears down a phantom.
- `/api/healthz` fields (Phase 7): `chroma_mcp_pid_count`, `chroma_mcp_last_call_at`, `chroma_mcp_state` ('connected'|'disconnected'|'backoff'), `chroma_mcp_backoff_remaining_ms`.
### 3.6 Anti-pattern guards
- **Do not** kill chroma processes whose command-line doesn't match `chroma-mcp==<PINNED_VERSION>` — could match unrelated user installs.
- **Do not** spin up the idle-reaper timer if `chromaMcpManager` is null (chroma disabled via `CLAUDE_MEM_CHROMA_ENABLED=false`).
- **Do not** call `getProcessRegistry()` from outside the worker process — it's worker-internal.
### 3.7 Verification checklist
- [ ] After 2.5 hours of normal use, `ps aux | grep chroma-mcp | wc -l` ≤ 1.
- [ ] Idle-reaper timer is `.unref()`d.
- [ ] Orphan scan tolerates `pgrep` returning empty (no false-error logs).
- [ ] Build still passes on Windows (Win32 branch compiles even if not unit-tested).
---
## Phase 4 — Circuit Breaker for Retry Storms
**Goal**: Replace the unbounded counter at `worker-utils.ts:401` with a real circuit breaker. Stop hooks from hammering the worker when it's down.
### 4.1 Files to modify
| File | Change |
| --- | --- |
| `src/shared/worker-circuit-breaker.ts` (new) | `CircuitBreaker` class: states `CLOSED`, `OPEN`, `HALF_OPEN`. Persist to `~/.claude-mem/state/circuit-breaker.json`. |
| `src/shared/worker-utils.ts:executeWithWorkerFallback` (line 443) | Wrap the call in `breaker.run(...)`. On `OPEN`, return `WorkerFallback` immediately (no HTTP). |
| `src/shared/worker-utils.ts:recordWorkerUnreachable` (line 401) | Becomes a thin shim that calls `breaker.recordFailure()`. Hard cap (`MAX_LIFETIME_FAILURES = 50`) trips the breaker permanently until manual reset. |
| `src/shared/worker-utils.ts:resetWorkerFailureCounter` (line 419) | Becomes `breaker.recordSuccess()`. |
| `src/cli/hook-command.ts` | Verify the swallowed-stderr fix from observation 2026-05-07 is applied (it's marked as a "no-op replacement bug"). The breaker's stderr-fail-loud path must actually write to `process.stderr.write()`, not a stub. |
| `src/services/server/Server.ts` | Add `/api/admin/breaker/reset` POST endpoint (gated by localhost only) for manual unsticking. |
### 4.2 Breaker semantics
States and transitions:
```
CLOSED ──[N consecutive failures]──> OPEN
OPEN ──[reset_timeout_ms elapsed]──> HALF_OPEN
HALF_OPEN ──[1 success]──> CLOSED
HALF_OPEN ──[1 failure]──> OPEN (resets timer)
ANY ──[lifetime failures > MAX_LIFETIME_FAILURES]──> OPEN_PERMANENT (until manual reset via API or settings reload)
```
Defaults:
| Setting | Default | Range |
| --- | --- | --- |
| `CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD` | `5` | 150 |
| `CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS` | `30000` | 1000600000 |
| `CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES` | `1` | 110 |
| `CLAUDE_MEM_BREAKER_LIFETIME_CAP` | `50` | 010000 (0 = no cap) |
Persistent state file shape:
```json
{
"state": "CLOSED|OPEN|HALF_OPEN|OPEN_PERMANENT",
"consecutiveFailures": 0,
"lifetimeFailures": 0,
"openedAt": null,
"lastFailureAt": null,
"lastSuccessAt": null,
"lastTrippedAt": null
}
```
### 4.3 Detailed tasks
1. **CircuitBreaker class**: pure logic class, no I/O. Methods: `getState()`, `canAttempt()`, `recordFailure(reason)`, `recordSuccess()`, `forceReset()`. Atomic file writes (write tmp + rename) for the JSON snapshot, mirroring `writeHookFailureStateAtomic` (worker-utils.ts:372).
2. **Wire into `executeWithWorkerFallback`**:
```
if (!breaker.canAttempt()) {
// Optional: print one-line stderr if state changed during this call
return { continue: true, reason: 'circuit_breaker_open', [WORKER_FALLBACK_BRAND]: true };
}
const alive = await ensureWorkerAliveOnce();
if (!alive) { breaker.recordFailure('unreachable'); ... }
...
if (response.ok) breaker.recordSuccess();
```
3. **Fail-loud stderr fix**: The 2026-05-07 observation mentions a "stderr no-op replacement bug" in `hookCommand`. Investigate `src/cli/hook-command.ts` for any `process.stderr.write` shim that suppresses output. The breaker's diagnostic ("Worker unreachable; circuit breaker OPEN; will retry in Xs") MUST appear on the user's terminal so they know what's happening. Test by intentionally killing the worker and running a hook — message should appear on stderr.
4. **Manual reset endpoint**: `POST /api/admin/breaker/reset` (no body required). Restricted to `127.0.0.1` only. Logs `SYSTEM` WARN `BreakerForceReset` with caller info.
5. **Lifetime cap**: when `lifetimeFailures > CLAUDE_MEM_BREAKER_LIFETIME_CAP`, transition to `OPEN_PERMANENT`. The only way out is the manual-reset API or restarting the worker with a fresh state file. Print prominent stderr: `claude-mem: 50 lifetime worker failures detected. Disabling memory hooks until reset. Run: claude-mem worker doctor`.
### 4.4 Acceptance criteria
- Kill the worker, run 100 hooks → exactly `CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD` HTTP attempts made; rest short-circuit.
- After 30s idle, next hook makes ONE probe (HALF_OPEN); if probe succeeds, breaker closes.
- Lifetime cap (set to 5 for testing): 6th lifetime failure → permanent open until `POST /api/admin/breaker/reset` clears it.
- Stderr message visible to user when breaker opens (manual repro: kill worker, run 5+ hooks).
- Existing hook-failures.json file is migrated to the new breaker JSON format on first run (one-shot migration in `worker-utils.ts`).
### 4.5 Observability
- Log: `SYSTEM` WARN `BreakerOpened`, fields `{lifetime, consecutiveBefore}`.
- Log: `SYSTEM` INFO `BreakerHalfOpen`.
- Log: `SYSTEM` INFO `BreakerClosed`, fields `{recoveredAfterMs}`.
- Log: `SYSTEM` ERROR `BreakerOpenedPermanent`.
- `/api/healthz` fields (Phase 7): `breaker_state`, `breaker_consecutive_failures`, `breaker_lifetime_failures`, `breaker_opened_at`, `breaker_total_trips`.
### 4.6 Anti-pattern guards
- **Do not** call the breaker from inside the worker process — it's a hook-side concern. The worker has `RestartGuard` for its own session-level limits.
- **Do not** auto-reset the lifetime counter on restart; persist it. Otherwise restart-loops mask the underlying failure.
- **Do not** block the breaker reset endpoint on initialization (`/api/admin/breaker/reset` should work even if `initializationCompleteFlag === false`).
### 4.7 Verification checklist
- [ ] No call site bypasses the breaker (grep for `workerHttpRequest` outside `executeWithWorkerFallback` and audit each — some integrations may need `breaker.canAttempt()` guards added).
- [ ] State file readable/writable across process restarts.
- [ ] Stderr fail-loud path verified end-to-end on Linux + macOS + Windows Terminal.
- [ ] No `process.exit(1)` introduced — breaker tripping returns `WorkerFallback`, not exit codes.
---
## Phase 7 — `/api/healthz` Endpoint with Concrete Metrics
**Goal**: Centralized observability so future regressions are detectable at a glance.
### 7.1 Files to modify
| File | Change |
| --- | --- |
| `src/services/worker/http/routes/HealthzRoutes.ts` (new) | Implements `RouteHandler`. GET `/api/healthz` and `/api/healthz?format=prom`. |
| `src/services/worker-service.ts:registerRoutes` | Register the new `HealthzRoutes(...)`. |
| `src/services/worker/MetricsCollector.ts` (new) | Aggregates metrics; refreshed on the supervisor's existing 30s health-check tick to avoid amplifying load. |
| `src/supervisor/health-checker.ts:runHealthCheck` | Call `MetricsCollector.refresh()` after `pruneDeadEntries`. |
### 7.2 Endpoint contract
`GET /api/healthz` → 200 JSON:
```json
{
"status": "ok|degraded|unhealthy",
"ts": "2026-05-07T21:30:00.000Z",
"uptime_sec": 12345,
"versions": {
"plugin": "12.7.5",
"worker": "12.7.5",
"matches": true
},
"process": {
"pid": 12345,
"rss_mb": 145.2,
"event_loop_lag_ms": 3.1,
"managed": true,
"platform": "darwin"
},
"pid_file": {
"path": "/Users/.../worker.pid",
"start_token": "Wed May 7 14:23:15 2026",
"daemon_lock_held": true
},
"db": {
"path": "/Users/.../claude-mem.db",
"size_bytes": 31457280,
"page_count": 7680,
"freelist_count": 12,
"free_ratio_pct": 0.16,
"last_vacuum_at": "2026-05-07T20:00:00.000Z",
"last_vacuum_freed_pages": 130000,
"last_maintenance_at": "2026-05-07T20:00:00.000Z",
"oldest_processing_pending_age_sec": 4,
"processing_pending_count": 1,
"pending_count_total": 12,
"sdk_sessions_total": 145,
"sdk_sessions_inactive": 13,
"sdk_sessions_by_project": { "claude-mem": 25, "...": 120 }
},
"child_processes": {
"chroma_mcp_pid_count": 1,
"chroma_mcp_last_call_at": "2026-05-07T21:25:11.000Z",
"chroma_mcp_state": "connected",
"chroma_mcp_backoff_remaining_ms": 0,
"sdk_process_count": 0,
"supervisor_registry_size": 2
},
"network": {
"hook_consecutive_failures": 0,
"breaker_state": "CLOSED",
"breaker_consecutive_failures": 0,
"breaker_lifetime_failures": 3,
"breaker_opened_at": null,
"breaker_total_trips": 1,
"last_request_at": "2026-05-07T21:29:55.000Z",
"request_rate_per_min": 12.3
},
"ai": {
"provider": "claude",
"auth_method": "...",
"last_interaction": { ... }
}
}
```
`GET /api/healthz?format=prom` → 200 `text/plain` with Prometheus text format. One metric per JSON leaf (e.g. `claude_mem_db_free_ratio_pct 0.16`).
`status` derivation:
- `unhealthy` if breaker is OPEN_PERMANENT, OR DB initialization failed, OR chroma-mcp pid count > `CLAUDE_MEM_CHROMA_MAX_CONCURRENT`.
- `degraded` if breaker is OPEN, OR free_ratio > 0.4, OR oldest_processing_pending > 1 hour, OR worker version mismatches plugin version.
- `ok` otherwise.
### 7.3 Detailed tasks
1. **MetricsCollector class**: a `Map<string, unknown>` snapshot. Public `refresh()` collects fresh data; public `getSnapshot()` returns the cached object. Refresh is called by the 30s health-check tick AND on-demand if last refresh > 5s ago (debounced).
2. **DB metrics queries** (use `db.prepare` + `.get()`):
- `PRAGMA page_count` → `{ page_count: number }`
- `PRAGMA freelist_count` → `{ freelist_count: number }`
- `PRAGMA page_size` → for size_bytes computation
- `SELECT MIN(updated_at) FROM pending_messages WHERE status='processing'` (with `julianday` math for age in seconds)
- `SELECT COUNT(*) FROM sdk_sessions GROUP BY project`
3. **Process metrics**: `process.memoryUsage().rss / 1024 / 1024`. Event-loop lag via `perf_hooks.monitorEventLoopDelay` (Node API, available in bun) — sample over 30s window.
4. **Network metrics**: maintain a rolling 1-min request counter in middleware (existing `createMiddleware` in `Server.ts:156`). Increment on each `/api/*` request.
5. **Prometheus format**: emit `# HELP` and `# TYPE` lines per metric. Use the same naming convention (`claude_mem_<group>_<name>`).
6. **Compatibility**: leave `/api/health` UNCHANGED (existing integrations break otherwise). `/api/healthz` is the new richer endpoint.
### 7.4 Acceptance criteria
- `curl 127.0.0.1:<port>/api/healthz | jq .status` returns `ok` on a healthy worker.
- After Phase 6 ships, `db.free_ratio_pct` updates at 30s cadence (verify by manually inflating freelist).
- Phase 4 breaker state changes are visible within 30s.
- `?format=prom` parses with `promtool check metrics`.
- No new endpoint blocks for > 50ms (snapshot is cached; refresh is async).
### 7.5 Observability hooks (yes, for the observability endpoint itself)
- Log `WORKER` DEBUG `MetricsRefresh`, fields `{durationMs}`.
- Log `WORKER` WARN `MetricsRefreshSlow` if refresh > 250ms (DB query stall signal).
### 7.6 Verification checklist
- [ ] `/api/health` response body unchanged byte-for-byte (regression test).
- [ ] All Phase 2-6 metrics exposed (cross-check the field list in those phases).
- [ ] `?format=prom` output validates with `promtool` if available; otherwise visual inspection.
- [ ] Endpoint mounted via `RouteHandler` pattern (no direct `app.get` in worker-service.ts).
---
## Phase 8 — Observability, CLI, & Rollout
**Goal**: User-facing surface so operators can see what the new machinery did. Ordered last to allow phases 2-7 to stabilize.
### 8.1 Files to modify
| File | Change |
| --- | --- |
| `src/cli/handlers/worker-doctor.ts` (new) | New CLI subcommand `claude-mem worker doctor` — fetches `/api/healthz`, formats it for terminals, includes recent reaper actions. |
| `src/services/worker-service.ts:main()` | Register the `worker doctor` CLI route (alongside existing `cursor`, `gemini-cli` cases). |
| `plugin/scripts/worker-cli.js` | Wire to the new doctor command. |
| `CLAUDE.md` (project root) | Document new settings under a "Worker Maintenance" section. |
| `docs/public/` (optional) | User-facing explanation of the breaker, reaper, and health endpoint. |
### 8.2 `worker doctor` output (example)
```
claude-mem worker doctor
Status: OK
Version: plugin=12.7.5 worker=12.7.5 (match)
Uptime: 3h 25m
PID: 12345 (lock held: yes)
Database:
Size: 32 MB (free: 0.16%)
Last vacuum: 4h ago, freed 130k pages
Pending: 12 total / 1 processing (oldest 4s)
SDK sessions: 145 total / 13 inactive
Child processes:
chroma-mcp: 1 (last call: 5s ago, state: connected)
SDK processes: 0
Supervisor: 2 entries
Circuit breaker:
State: CLOSED
Consecutive: 0
Lifetime: 3
Total trips: 1
Recent maintenance (last 24h):
2026-05-07 20:00 Vacuum: freed 130k pages in 1.4s
2026-05-07 19:30 Reaper: 5 stuck-processing reset, 2 inactive marked
2026-05-07 18:00 Chroma orphan scan: 0 found
```
If `status != ok`, append a "Recommended actions" block:
- breaker open → `claude-mem worker reset-breaker`
- DB free ratio high → mention next vacuum window
- chroma orphans → `claude-mem worker reap-chroma`
### 8.3 Detailed tasks
1. **Doctor command**: GET `/api/healthz` via `workerHttpRequest`. Format as the table above. Color-code (red/yellow/green) using existing chalk integration if present, otherwise plain text. JSON pass-through via `--json` flag.
2. **Recent-actions feed**: store the last 50 maintenance events in a circular buffer in `MetricsCollector` (in-memory only — survives one worker lifetime; not persistent). Expose at `/api/healthz/events` (separate to avoid bloating the main response).
3. **Update CLAUDE.md**: add a "Worker Maintenance" section with: settings reference table, the doctor command, a brief description of the reaper/breaker/vacuum behavior. Per CLAUDE.md "Important: No need to edit the changelog ever" — only edit CLAUDE.md, never CHANGELOG.
4. **Rollout ordering** (per problem statement constraint):
- Wave 1 (idempotent, low-risk): Phase 5 (PID/port reclamation), Phase 6 (DB maintenance).
- Wave 2 (reapers — needs careful testing on busy DBs): Phase 2 (session reaper), Phase 3 (chroma supervisor).
- Wave 3 (user-visible behavior change): Phase 4 (circuit breaker), Phase 7 (`/api/healthz`).
- Wave 4 (CLI surface): Phase 8 (doctor command, docs).
Each wave can ship as a separate release. Inter-wave dependencies: Phase 7 depends on data sources from Phases 2/3/4/6 — but the endpoint can ship with partial data (fields gated by phase availability).
### 8.4 Acceptance criteria
- `claude-mem worker doctor` prints a green-OK summary on a healthy worker.
- `claude-mem worker doctor --json` returns valid JSON pipeable to `jq`.
- Killing the worker → `claude-mem worker doctor` cleanly reports `Worker unreachable` instead of hanging.
- CLAUDE.md updates are limited to a new section; no churn elsewhere.
### 8.5 Verification checklist
- [ ] `claude-mem worker doctor` exits 0 on healthy state, 1 on unhealthy, 2 if worker unreachable (mirrors hook-exit-codes convention).
- [ ] No new public marketplace API surface beyond what's documented.
- [ ] Doctor command works without the worker running (unreachable path covered).
---
## Final Phase — Cross-Phase Verification
**Goal**: Prove the system works end-to-end before declaring victory.
### F.1 Soak test (24h)
Run the worker for 24 hours under realistic Claude Code usage. After 24h:
| Metric | Pass criterion |
| --- | --- |
| `ps aux \| grep chroma-mcp \| wc -l` | ≤ 1 |
| `ps aux \| grep claude-mem \| wc -l` | ≤ a small constant (1-2) |
| DB size growth rate | < 5 MB/hr; free_ratio < 20% |
| `/api/healthz` `breaker.lifetime_failures` | < 10 (vs. the #1874 starting baseline) |
| Stuck `processing` rows older than 10 min | 0 |
| Worker memory RSS | < 300 MB (no leak) |
### F.2 Failure-injection tests
| Inject | Expected behavior |
| --- | --- |
| Kill worker via `kill -9` | Lazy-respawn on next hook; PID file cleaned |
| Two parallel `claude-mem start` | Exactly one daemon survives; lock log line visible |
| 100 stuck processing rows | Reaper resets all within `REAPER_PROCESSING_STUCK_MS + REAPER_TICK_MS` |
| Spawn fake listener on worker port | New `--daemon` exits 0 with diagnostic stderr (no silent exit) |
| Fork 5 chroma-mcp orphans | Worker startup reaps all 5 |
| Pull network during 10 hooks | Breaker opens after threshold; subsequent hooks short-circuit |
### F.3 Anti-pattern grep
```
# No new always-on intervals
grep -rn "setInterval" src/ --include="*.ts" | grep -v "unref()" | grep -v "^src/.*test"
# No new process.exit(1) on hook paths
git diff main -- src/shared/worker-utils.ts src/cli/ | grep "process.exit(1)"
# No invented settings
git diff main -- src/shared/SettingsDefaultsManager.ts | grep "CLAUDE_MEM_"
# Cross-reference with all phases' settings tables.
# No hardcoded magic numbers in business logic
git diff main | grep -E "[0-9]{4,}" | grep -v SettingsDefaultsManager | grep -v test
```
### F.4 Documentation diff
- `CLAUDE.md` adds: Worker Maintenance section (Phase 8.3).
- `docs/public/` (optional): user-facing explanation.
- No CHANGELOG edits (auto-generated per CLAUDE.md).
### F.5 Sign-off checklist
- [ ] All 8 phases shipped.
- [ ] `/api/healthz` reports `status: "ok"` 24h after deployment.
- [ ] No new ERROR-level logs in production for 24h (excluding pre-existing).
- [ ] Manual `worker doctor` on 3 production-like environments confirms expected output.
- [ ] Phase 0 doc-discovery anti-patterns not violated (grep `git log -p`).
---
## Appendix A — Settings Reference (consolidated)
All settings declared in `src/shared/SettingsDefaultsManager.ts`:
| Setting | Phase | Default | Range |
| --- | --- | --- | --- |
| `CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS` | 5 | `5000` | 060000 |
| `CLAUDE_MEM_PID_PORT_RECHECK_MS` | 5 | `2000` | 50030000 |
| `CLAUDE_MEM_DB_MAINTENANCE_ENABLED` | 6 | `true` | bool |
| `CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS` | 6 | `24` | 1168 |
| `CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO` | 6 | `0.40` | 0.050.95 |
| `CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS` | 6 | `300000` | 03600000 |
| `CLAUDE_MEM_CLEANUP_REGRESSION_CHECK` | 6 | `true` | bool |
| `CLAUDE_MEM_REAPER_ENABLED` | 2 | `true` | bool |
| `CLAUDE_MEM_REAPER_TICK_MS` | 2 | `30000` | 5000600000 |
| `CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS` | 2 | `300000` | 3000086400000 |
| `CLAUDE_MEM_REAPER_INACTIVE_DAYS` | 2 | `30` | 1365 |
| `CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS` | 2 | `0` | 0365 |
| `CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS` | 3 | `900000` | 6000086400000 |
| `CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START` | 3 | `true` | bool |
| `CLAUDE_MEM_CHROMA_MAX_CONCURRENT` | 3 | `1` | 14 |
| `CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD` | 4 | `5` | 150 |
| `CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS` | 4 | `30000` | 1000600000 |
| `CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES` | 4 | `1` | 110 |
| `CLAUDE_MEM_BREAKER_LIFETIME_CAP` | 4 | `50` | 010000 |
## Appendix B — File Change Summary
| File | Phases that touch it |
| --- | --- |
| `src/services/worker-service.ts` | 3 (initializeBackground), 5 (--daemon), 6 (maintenance wiring), 7 (route registration), 8 (CLI) |
| `src/services/worker-spawner.ts` | 5 |
| `src/services/infrastructure/ProcessManager.ts` | 5 (lock + start-token) |
| `src/services/infrastructure/HealthMonitor.ts` | 5 (port-on-pid match) |
| `src/services/infrastructure/CleanupV12_4_3.ts` | 6 (regression detection — read only) |
| `src/services/sync/ChromaMcpManager.ts` | 3 |
| `src/supervisor/index.ts` | 5 (validateWorkerPidFile) |
| `src/supervisor/process-registry.ts` | 3 (orphan scan), 5 (start-token) |
| `src/supervisor/health-checker.ts` | 2 (reaper), 7 (metrics refresh) |
| `src/services/worker/SessionManager.ts` | 2 (delete hook), 6 (pause/resume) |
| `src/shared/worker-utils.ts` | 4 (breaker integration) |
| `src/services/sqlite/Database.ts` | 6 (auto_vacuum) |
| `src/services/sqlite/PendingMessageStore.ts` | 2 (reapStuckProcessing) |
| `src/services/sqlite/SessionStore.ts` | 2 (findInactiveSdkSessions) |
| `src/services/sqlite/migrations/runner.ts` | 2 (inactive_at column) |
| `src/services/server/Server.ts` | 4 (breaker reset), 7 (healthz route) |
| `src/shared/SettingsDefaultsManager.ts` | 2-6 (settings keys) |
| `src/services/maintenance/DbMaintenance.ts` | 6 (NEW) |
| `src/services/maintenance/SessionReaper.ts` | 2 (NEW) |
| `src/shared/worker-circuit-breaker.ts` | 4 (NEW) |
| `src/services/worker/MetricsCollector.ts` | 7 (NEW) |
| `src/services/worker/http/routes/HealthzRoutes.ts` | 7 (NEW) |
| `src/cli/handlers/worker-doctor.ts` | 8 (NEW) |
| `CLAUDE.md` | 8 (Worker Maintenance section) |
## Appendix C — Open Questions for Executor
1. **`bun:ffi` flock support**: confirm via spike before committing Phase 5.4. If unavailable, fall back to `flock(1)` shell on Linux + atomic `mkdirSync` sentinel on macOS/Windows.
2. **Event-loop lag sampling on bun**: verify `perf_hooks.monitorEventLoopDelay` works in bun's Node-compat layer. If not, fall back to a setImmediate-based heuristic.
3. **Existing-DB auto_vacuum migration**: verify that the startup full VACUUM in Phase 6.3 is sufficient to reclaim the 504 MB without requiring users to run `PRAGMA auto_vacuum = INCREMENTAL; VACUUM;` manually. (It should — full VACUUM with auto_vacuum already set takes effect.)
4. **Pro-features compatibility**: confirm with maintainers that `/api/healthz` does not duplicate any planned Pro endpoint. Per CLAUDE.md "Pro Features Architecture", the worker's local HTTP API stays open — `/api/healthz` is fine to add OSS-side.