# Plan 03 — Worker / Daemon Lifecycle Hardening > **Scope**: Fix accumulated worker / daemon lifecycle bugs in claude-mem. > Address DB bloat, chroma-mcp leaks, retry storms, port/PID races, queue zombies, missing supervision, and observability gaps. > > **Non-implementation**: This document is a plan. Each phase is self-contained; an executing agent should be able to run a single phase without re-discovering context. > > **Audience**: Subsequent agents executing one phase per session. --- ## Phase 0 — Documentation Discovery & Allowed APIs **Goal**: Anchor every implementation phase in real APIs that exist in the current codebase or in vetted libraries. Prevent phantom-method invention. ### 0.1 Read these files end-to-end before touching code | File | Why | | --- | --- | | `CLAUDE.md` (project root) | Architecture, exit-code strategy, Pro/OSS boundary, settings conventions | | `src/services/worker-service.ts` | `WorkerService` class, `--daemon` `main()`, signal registration, all CLI subcommands | | `src/services/worker-spawner.ts` | `ensureWorkerStarted` 3-state machine (`ready`/`warming`/`dead`) | | `src/services/infrastructure/ProcessManager.ts` | `spawnDaemon`, PID file ops, `captureProcessStartToken`, `isProcessAlive` | | `src/services/infrastructure/HealthMonitor.ts` | `isPortInUse`, `waitForHealth`, `waitForReadiness`, `httpShutdown` | | `src/services/infrastructure/GracefulShutdown.ts` | `performGracefulShutdown` ordering | | `src/services/infrastructure/CleanupV12_4_3.ts` | `runOneTimeV12_4_3Cleanup`, `STUCK_PENDING_THRESHOLD = 10`, observer-purge SQL | | `src/services/sync/ChromaMcpManager.ts` | `ensureConnected`, `connectInternal`, `stop`, `killProcessTree`, `collectDescendantPids`, `RECONNECT_BACKOFF_MS = 10_000`, `MCP_CONNECTION_TIMEOUT_MS = 30_000` | | `src/supervisor/index.ts` | `Supervisor` class, `validateWorkerPidFile`, signal-handler config | | `src/supervisor/process-registry.ts` | `ProcessRegistry`, `getSdkProcessForSession`, `ensureSdkProcessExit`, `waitForSlot`, `TOTAL_PROCESS_HARD_CAP = 10` | | `src/supervisor/health-checker.ts` | 30s `pruneDeadEntries` loop (already present — extend, don't replace) | | `src/supervisor/shutdown.ts` | `runShutdownCascade`, `signalProcess`, `loadTreeKill` | | `src/services/worker/SessionManager.ts` | In-memory session map, `deleteSession`, queue/pending integration | | `src/services/worker/RestartGuard.ts` | Per-session restart cap (10/60s window, 5 consecutive) | | `src/services/worker/retry.ts` | Provider-level retry (`withRetry`, classified errors) — DO NOT mutate; circuit breaker layers ABOVE this | | `src/shared/worker-utils.ts` | `recordWorkerUnreachable` (line 401), `executeWithWorkerFallback` (line 443), fail-loud counter file at `~/.claude-mem/state/hook-failures.json` | | `src/services/sqlite/Database.ts` | PRAGMA setup (lines 27-32, 69-74) — single source of truth for DB pragmas | | `src/services/server/Server.ts` | `/api/health` (line 161), `/api/readiness` (line 178), `/api/version` (line 192) | | `src/shared/SettingsDefaultsManager.ts` | Where every new setting key MUST be declared with a default | | `src/shared/hook-constants.ts` | `HOOK_TIMEOUTS`, `HOOK_EXIT_CODES` — extend here, don't inline | | `plugin/bun-runner.js`, `plugin/scripts/worker-service.cjs` | Built worker entrypoint — note the build pipeline (`scripts/build-hooks.js`) | ### 0.2 Allowed APIs (use these, do NOT invent siblings) **SQLite (bun:sqlite)** — pragma calls are `db.run('PRAGMA …')` or `db.prepare('PRAGMA …').get()`. Existing pragmas: `journal_mode=WAL`, `synchronous=NORMAL`, `foreign_keys=ON`, `temp_store=memory`, `mmap_size`, `cache_size`. **VACUUM** runs only outside a transaction. `VACUUM INTO 'path'` is the backup form already used in `CleanupV12_4_3.ts:135`. `wal_checkpoint(TRUNCATE)` is the truncating-checkpoint form. **Process supervision** — `getSupervisor()`, `getProcessRegistry()`, `registerProcess(id, info, processRef?)`, `unregisterProcess(id)`, `pruneDeadEntries()`, `assertCanSpawn(type)`, `runShutdownCascade(...)`. Tree-kill on POSIX uses `pgrep -P` recursion + `process.kill(-pgid, signal)`; on Windows uses `taskkill /T /F /PID` or `tree-kill` npm. **HTTP/Express** — `Server.app.get('/api/...', handler)` via `registerRoutes` (handlers implement `setupRoutes(app)` on a `RouteHandler` interface). Every new endpoint must follow the existing `RouteHandler` pattern under `src/services/worker/http/routes/`. **Settings** — `SettingsDefaultsManager.get('CLAUDE_MEM_…')`, `SettingsDefaultsManager.loadFromFile(path)`. New keys require: (a) type added to the interface in `SettingsDefaultsManager.ts`, (b) default value declared in the same file, (c) documented in CLAUDE.md if user-tunable. **Logging** — `logger.info(category, msg, fields)`, `logger.warn`, `logger.error(category, msg, fields, error)`. Categories used here: `SYSTEM`, `WORKER`, `SESSION`, `CHROMA_MCP`, `SDK`, `DB`, `QUEUE`, `PROCESS`. Add new category `MAINTENANCE` for VACUUM / reaper events. ### 0.3 Anti-patterns — explicitly forbidden - **Do not** add a new singleton supervisor — extend `getSupervisor()`. - **Do not** spawn child processes without going through `getSupervisor().assertCanSpawn(...)` and `registerProcess(...)`. - **Do not** call `process.exit(1)` on hook-side error paths — it accumulates Windows Terminal tabs (CLAUDE.md exit-code strategy). Use `0` for graceful, `2` only for blocking-error paths that need to surface stderr to Claude. - **Do not** delete `sdk_sessions` rows if `observations` or `session_summaries` still reference their `memory_session_id` without an explicit user-opt-in flag. - **Do not** hold a SQLite write lock during `VACUUM` while ingestion is hot. Pause queue processing first. - **Do not** introduce setInterval timers that keep the event loop alive — every new timer must call `.unref()`. - **Do not** invent settings keys — declare them in `SettingsDefaultsManager.ts` first. ### 0.4 Confidence note Confidence: HIGH on file/API inventory (read-pass complete on all referenced files). MEDIUM on Windows behavior of new advisory locks (Windows mandatory locking via `lockf` is bun-runtime-dependent — verify via spike before committing). --- ## Phase 1 — Inventory & Instrumentation (read-only, safe) **Goal**: Produce a written state-machine diagram and an exit-site catalog that subsequent phases reference. No code changes; create a scratch document at `docs/internal/worker-lifecycle-state-machine.md` if the executor wants an artifact, otherwise capture findings in commit messages. ### 1.1 Tasks 1. **Trace the worker daemon spawn → terminate path** end-to-end. Source order: - Hook entry → `src/shared/worker-utils.ts:ensureWorkerRunning` (lazy spawn) OR `src/services/worker-spawner.ts:ensureWorkerStarted` (explicit) - `spawnDaemon` (`src/services/infrastructure/ProcessManager.ts:408`) — POSIX uses `setsid` if available, Windows uses `Start-Process -WindowStyle Hidden` - `--daemon` branch in `src/services/worker-service.ts:937` — duplicate-PID/duplicate-port guard - `WorkerService.start()` (line 258) → `startSupervisor()` → `server.listen()` → `writePidFile()` → `getSupervisor().registerProcess('worker', ...)` → `initializeBackground()` - Signal handlers via `configureSupervisorSignalHandlers` (`src/supervisor/index.ts:49`) — SIGTERM/SIGINT; SIGHUP ignored in `--daemon` mode on POSIX - Shutdown: `WorkerService.shutdown()` → `performGracefulShutdown` → server close → `sessionManager.shutdownAll()` → mcp client close → chroma stop → db close → `getSupervisor().stop()` → `runShutdownCascade` → PID file unlink 2. **Catalog every `process.exit(...)` site** in worker-service.ts (already mapped — 21 sites; lines 764, 772, 794, 804, 810, 813, 828, 835, 842, 853, 870, 878, 888, 895, 916, 933, 945, 950, 971, 975, 991). Annotate each with: code, intent, whether it leaks the worker on the same path, whether shutdown ran first. 3. **Catalog every retry / unreachable site**: - `src/shared/worker-utils.ts:401 recordWorkerUnreachable` (the #1874 counter) - `src/cli/handlers/{context,file-context,file-edit,summarize,observation,user-message,session-init}.ts` — every `executeWithWorkerFallback` caller - `src/servers/mcp-server.ts:72,100,145` — direct `workerHttpRequest` - `src/services/transcripts/processor.ts:331,371,373` — direct `workerHttpRequest` - `src/services/integrations/CursorHooksInstaller.ts:64,349,352` — direct `workerHttpRequest` - `src/utils/claude-md-utils.ts:305` — direct `workerHttpRequest` 4. **Catalog every spawn site**: - `spawnDaemon` (worker self-spawn) - `ChromaMcpManager.connectInternal` (chroma-mcp via uvx → uv → python → chroma-mcp) - `spawnSdkProcess` (`src/supervisor/process-registry.ts:532`) — Claude SDK subprocesses - `runMcpSelfCheck` (`src/services/worker-service.ts:405`) — MCP loopback probe via `process.execPath` - Any `execSync` / `execFile` / `spawnSync` in `ChromaMcpManager` (cert resolution) or `ProcessManager` (binary lookup, cwd-remap) ### 1.2 Acceptance criteria - Markdown table written (commit message or scratch doc) listing every spawn and exit site with file:line. - A 1-paragraph English description of the worker state machine (states + transitions) suitable to paste into PR descriptions. - Confirmed list of which `executeWithWorkerFallback` callers run inside hooks (Claude Code's strict timeout window) vs. inside the worker (no timeout pressure) — this drives Phase 4 circuit-breaker scoping. ### 1.3 Verification - `grep -rn "process.exit" src/ --include="*.ts" | wc -l` matches the catalog. - `grep -rn "executeWithWorkerFallback\|workerHttpRequest" src/ --include="*.ts" | grep -v worker-utils.ts | wc -l` matches the catalog. ### 1.4 Deliverable Hand-off note for Phase 2-8 executors with file/line anchors; no code committed. --- ## Phase 5 — PID/Port Reclamation & Race-Free Startup > Shipping order: **Phase 5 first** (per Phase 8 ordering). Idempotent and safe. **Goal**: Eliminate the silent-exit-0 case where a fresh `--daemon` spawn loses the port race; harden cross-platform PID-reuse detection; serialize concurrent spawns with an OS-level advisory lock. ### 5.1 Files to modify | File | Change | | --- | --- | | `src/supervisor/process-registry.ts` | Extend `captureProcessStartToken` for macOS (already partial via `ps -o lstart`) and Windows (`wmic process where ProcessId=X get CreationDate /value`). Add unit test for each platform branch. | | `src/supervisor/index.ts:validateWorkerPidFile` | Add port-on-pid match check — if `pidInfo.port !== currentExpectedPort`, treat as `'stale'`. | | `src/services/infrastructure/ProcessManager.ts` | Add new exports: `acquireDaemonLock()` / `releaseDaemonLock()` using POSIX `flock` (via `fcntl`/`flock` syscall through `bun:ffi` or shelling to `flock(1)` on Linux only) and Windows mandatory file lock via `LockFile` (or fall back to atomic-rename sentinel on Windows). | | `src/services/worker-service.ts:937` (`--daemon` branch) | Wrap startup in `acquireDaemonLock()`. If port is in use, perform a `/api/version` probe; if the listener returns OUR `BUILT_IN_VERSION` → exit 0 (legit duplicate); if it returns a different version → log a warning and exit 0 (stale worker, will be restarted by version-mismatch path); if the listener doesn't respond → wait `HOOK_TIMEOUTS.PORT_IN_USE_WAIT` then write a clear stderr line with diagnostic before exiting. | | `src/services/worker-spawner.ts` | Same lock acquisition before `spawnDaemon`. Release on success or error. | ### 5.2 Detailed tasks 1. **macOS start-time token**: extend `captureProcessStartToken` (registry line 56). On Darwin, prefer `ps -p -o lstart=` (already in fallback path). Verify with `LC_ALL=C LANG=C` env so locale doesn't change the timestamp format. Add a comment explaining that `ps lstart` resolution is 1-second — collisions still possible but vastly less likely than no-token. 2. **Windows start-time token**: add a Win32 branch using `wmic process where ProcessId= get CreationDate /value`. Parse the `CreationDate=YYYYMMDDHHMMSS.ffffff+TZ` line. Cache the wmic resolution per-pid for 5s (avoid re-shelling on repeat checks). 3. **Port-on-pid match**: in `validateWorkerPidFile`, after confirming `isPidAlive(pidInfo.pid)`, verify the recorded `pidInfo.port` is reachable via `isPortInUse(pidInfo.port)` AND the listener's `/api/version` returns a version string. If port is dead but PID alive → return `'stale'` (worker crashed mid-listen, PID about to be reused). 4. **Advisory lock**: - POSIX: open `/.worker-spawn.lock` with `O_RDWR | O_CREAT`, `flock(fd, LOCK_EX | LOCK_NB)`. On EAGAIN, log `Another spawn in progress, waiting up to 5s` and retry with `LOCK_EX` (blocking) under a `setTimeout` race. Implement via `bun:ffi` for POSIX `flock(2)` if available, otherwise shell `flock -n -x `. **Spike first**: confirm bun's `bun:ffi` exposes `flock`. If not, use a watch-and-rename sentinel (less ideal but works). - Windows: Use `LockFile` via Win32 API or fall back to atomic `mkdirSync` of `/.worker-spawn.lock.dir` (fails if exists) with stale-timeout cleanup at 30s. 5. **Diagnostic stderr**: when port-in-use without our worker responding, write to stderr (and log INFO) with: `claude-mem worker port in use by an unidentified process; not spawning duplicate`. This must NOT block the hook — exit 0 still per CLAUDE.md. ### 5.3 New settings | Key | Default | Range | Purpose | | --- | --- | --- | --- | | `CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS` | `5000` | 0–60000 | Max wait for the spawn lock | | `CLAUDE_MEM_PID_PORT_RECHECK_MS` | `2000` | 500–30000 | Wait window before treating port-in-use without `/api/version` response as "unknown listener" | ### 5.4 Acceptance criteria - Run two `claude-mem start` commands in parallel → exactly one daemon ends up alive; the other exits cleanly with a log line referencing the lock. - Kill the worker `-9` (skip cleanup), reuse the PID with `python -c 'import time; time.sleep(60)'` → `validateWorkerPidFile` returns `'stale'` and removes the file. - On macOS, run worker, capture token, kill, spawn unrelated process with same PID, spawn worker again → token mismatch detected; old PID file ignored. - `/api/version` probe path: spawn a fake server on the worker port → daemon exits 0 with the new diagnostic stderr, NOT silently. ### 5.5 Observability hooks - Log `SYSTEM` INFO `Daemon spawn lock acquired` on success. - Log `SYSTEM` WARN `Daemon spawn lock contention`, fields `{waitedMs}`. - Log `SYSTEM` WARN `Worker port occupied by foreign listener`, fields `{port, probeStatus}`. - New `/api/healthz` fields (added in Phase 7): `pid_file_path`, `pid_start_token`, `daemon_lock_held: bool`. ### 5.6 Verification checklist - [ ] `grep "process.exit(0)" src/services/worker-service.ts` — count unchanged (no new silent exits introduced). - [ ] Manual two-process race test (Linux + macOS + Windows VM). - [ ] Existing health-check tests still pass. - [ ] No new always-on `setInterval` introduced. --- ## Phase 6 — DB Maintenance (VACUUM / WAL) > Ships alongside Phase 5 (idempotent). **Goal**: Recover the 504 MB of free pages, prevent recurrence, surface DB-size metrics. ### 6.1 Files to modify | File | Change | | --- | --- | | `src/services/sqlite/Database.ts:27-32` and `:69-74` | Add `PRAGMA auto_vacuum = INCREMENTAL` BEFORE the first table is created (only takes effect on a fresh DB; harmless on existing DBs but logs a no-op). For existing DBs, the migration path is the one-shot Phase-6 startup VACUUM. | | `src/services/maintenance/DbMaintenance.ts` (new) | Periodic maintenance task: on a 24h timer (configurable), call `PRAGMA incremental_vacuum`, `PRAGMA wal_checkpoint(TRUNCATE)`, then collect metrics (`page_count`, `freelist_count`, file size). Emit `MAINTENANCE` INFO log. Acquire `dbMaintenanceMutex` so other writers wait. | | `src/services/maintenance/DbMaintenance.ts` | Startup check: if `freelist_count / page_count > FREE_RATIO_VACUUM_THRESHOLD` (default 0.40), perform full `VACUUM` after `VACUUM INTO` backup to `/backups/claude-mem-pre-vacuum-.db`. Pause queue processor first. | | `src/services/worker-service.ts:initializeBackground` | Wire the maintenance task — start after `dbManager.initialize()`. Timer must `.unref()`. | | `src/services/worker/SessionManager.ts` | Expose `pauseQueueProcessing(): Promise` and `resumeQueueProcessing(): void`. Use the existing AbortController + emitter to drain in-flight work; don't introduce new state. Maintenance acquires; readers continue (WAL allows them). | | `src/services/infrastructure/CleanupV12_4_3.ts:135` | Reuse the existing `VACUUM INTO` backup pattern verbatim — copy the disk-space pre-flight check (`statfsSync`, line 115). | ### 6.2 Detailed tasks 1. **Auto-vacuum on new DBs**: Add `PRAGMA auto_vacuum = INCREMENTAL` in `Database.ts` BEFORE `migrationRunner.runAllMigrations()`. Verify with a comment that this is no-op on existing DBs (sqlite docs say a full VACUUM is required to flip auto_vacuum mode after tables exist). Document the migration path: existing users get the freed-page reclamation via the startup full VACUUM in step 3. 2. **Periodic incremental vacuum + WAL checkpoint**: - Schedule via `setInterval` with `.unref()`. Default cadence: 24h. Setting: `CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS` (default `24`, min `1`, max `168`). - Each tick: acquire mutex → `db.run('PRAGMA incremental_vacuum')` → `db.run('PRAGMA wal_checkpoint(TRUNCATE)')` → snapshot metrics → release. - Skip the tick if a `VACUUM` is in progress. 3. **Startup full VACUUM (one-shot per session) when free-ratio is high**: - Read `page_count` (`PRAGMA page_count`) and `freelist_count` (`PRAGMA freelist_count`). - If `freelist_count / page_count >= CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO` (default `0.40`), schedule a deferred VACUUM (5 minutes after worker becomes ready) to avoid slowing startup. - VACUUM steps: pause queue → `VACUUM INTO ''` → verify backup → `VACUUM` (full) → resume queue → log freed pages and ms taken. - Disk-space pre-flight: `statfsSync` (mirror `CleanupV12_4_3.ts:115`). Skip if free space < `1.2 * dbSize + 100MB`. Log `MAINTENANCE` ERROR in that case so the user sees actionable info. 4. **Pause/resume hook in SessionManager**: The existing `for await ... of getMessageIterator()` loop in queue processor needs a "pause" semaphore. Implementation: add a `Promise` gate that the iterator awaits before yielding. Maintenance flips it to a pending promise during VACUUM; resolve to release. **Do not** abort in-flight messages — they can complete; new messages wait. 5. **Cleanup-V12.4.3 regression detection**: Re-scan `sdk_sessions WHERE project = OBSERVER_SESSIONS_PROJECT` and `pending_messages` matching the stuck-pending pattern at maintenance ticks. If any match AND the marker exists, log `MAINTENANCE` WARN and re-run the purge (idempotent). Setting: `CLAUDE_MEM_CLEANUP_REGRESSION_CHECK = true`. ### 6.3 New settings | Key | Default | Range | Purpose | | --- | --- | --- | --- | | `CLAUDE_MEM_DB_MAINTENANCE_ENABLED` | `true` | bool | Master kill-switch | | `CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS` | `24` | 1–168 | Periodic cadence | | `CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO` | `0.40` | 0.05–0.95 | Free-ratio above which we auto-VACUUM at startup | | `CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS` | `300000` (5 min) | 0–3600000 | Defer startup VACUUM so it doesn't block readiness | | `CLAUDE_MEM_CLEANUP_REGRESSION_CHECK` | `true` | bool | Re-scan v12.4.3-shaped pollution | ### 6.4 Acceptance criteria - Reproduce the bloat scenario: stuff `pending_messages` with 100k stuck `processing` rows, run worker → startup VACUUM fires within 5 min after readiness, freed-pages log line appears, file size drops. - Existing 532 MB DBs reclaim ≥ 95% of free pages on first run (matches the 28 MB target observed manually). - Hot-ingestion test: enqueue 1000 observations during a maintenance tick → no `SQLITE_BUSY` or `database is locked` errors; queue resumes after VACUUM. - `PRAGMA auto_vacuum` returns `2` (incremental) on freshly-created DBs. - Maintenance loop ticks honor `.unref()` — `process.exit(0)` from a clean shutdown returns immediately, not after the 24h interval. ### 6.5 Observability hooks - New log category: `MAINTENANCE`. - Events: `MaintenanceStart`, `MaintenanceTick`, `VacuumStart`, `VacuumComplete` (`{freedPages, ms, dbSizeBeforeMb, dbSizeAfterMb}`), `VacuumSkippedLowDisk`, `RegressionDetected`, `MaintenanceComplete`. - `/api/healthz` fields (Phase 7): `db_page_count`, `db_freelist_count`, `db_free_ratio_pct`, `db_size_bytes`, `db_last_vacuum_at`, `db_last_vacuum_freed_pages`, `db_last_maintenance_at`. ### 6.6 Anti-pattern guards - **Do not** call `VACUUM` inside a transaction (sqlite errors). - **Do not** hold the queue pause across the `VACUUM INTO` backup phase — only the final full `VACUUM` needs the writer-lock window. (`VACUUM INTO` works on a read-only snapshot.) - **Do not** call `PRAGMA wal_checkpoint(FULL)` — TRUNCATE is required to actually shrink the WAL file. ### 6.7 Verification checklist - [ ] Backup created at `/backups/` before every full VACUUM. - [ ] Maintenance timer registered with `.unref()` (grep for `setInterval` in the new file → `unref()` follows each). - [ ] No new direct `setInterval` outside the maintenance file. - [ ] PRAGMA list in `Database.ts` extended with `auto_vacuum` and includes a comment about migration. --- ## Phase 2 — Stuck-Session Reaper (fix v12.4.3 bloat) **Goal**: Stop `pending_messages` and `sdk_sessions` from accumulating zombies. ### 2.1 Files to modify | File | Change | | --- | --- | | `src/services/maintenance/SessionReaper.ts` (new) | Periodic reaper. Plugs into the supervisor's existing `health-checker.ts` 30s tick (extend, do not replace). | | `src/supervisor/health-checker.ts:9 runHealthCheck` | Call `SessionReaper.tick()` after `pruneDeadEntries()`. | | `src/services/worker/SessionManager.ts:deleteSession` | After in-memory delete, call `pendingStore.clearPendingForSession(sessionDbId)` synchronously (it already does this via `clearPendingForSession` on a separate path — verify and unify). | | `src/services/sqlite/PendingMessageStore.ts` | Add `reapStuckProcessing(olderThanMs: number): number` returning the count of rows reset to `pending`. | | `src/services/sqlite/SessionStore.ts` | Add `findInactiveSdkSessions(olderThanDays: number): Array<{id, project, contentSessionId, memorySessionId, lastActivityAt}>`. | | `src/services/sqlite/SessionStore.ts` | Add `markSdkSessionInactive(id: number)` — adds an `inactive_at` column or sets a sentinel. | | `src/services/sqlite/migrations/runner.ts` | New migration: add `inactive_at TEXT NULL` to `sdk_sessions` if absent. | ### 2.2 Reaper logic Per tick (default 30s, gated by `CLAUDE_MEM_REAPER_ENABLED`): 1. **Stuck-processing sweep**: `UPDATE pending_messages SET status='pending' WHERE status='processing' AND updated_at < ` (default 5 minutes). Log count if > 0. 2. **Orphan-pending sweep**: `DELETE FROM pending_messages WHERE session_db_id NOT IN (SELECT id FROM sdk_sessions)` (defensive — should already be FK-protected but log if any deleted). 3. **Inactive-session detection** (does NOT delete): - SELECT sdk_sessions where `id NOT IN ` AND `last_activity > N days ago` (computed from MAX of related observations / pending_messages / session_summaries timestamps). - For each: `UPDATE sdk_sessions SET inactive_at = WHERE id = ? AND inactive_at IS NULL`. 4. **Observer-pollution regression check** (matches Phase 6 task 5): - If `OBSERVER_SESSIONS_PROJECT` rows reappear after the v12.4.3 marker is present, re-run the purge SQL from `CleanupV12_4_3.runObserverSessionsPurge` (lines 196-218). - Log `MAINTENANCE` WARN with counts. 5. **Hard delete is opt-in** via `CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS` (default `0` = disabled; nonzero = days threshold). When enabled and a session has `inactive_at` older than the threshold AND no FK-referencing rows, hard-delete the session row. Default-off because user data safety > disk space. ### 2.3 New settings | Key | Default | Range | Purpose | | --- | --- | --- | --- | | `CLAUDE_MEM_REAPER_ENABLED` | `true` | bool | Master switch | | `CLAUDE_MEM_REAPER_TICK_MS` | `30000` | 5000–600000 | Tick cadence (piggy-backs supervisor; this value gates whether the reaper runs each tick) | | `CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS` | `300000` (5 min) | 30000–86400000 | Threshold for a `processing` row to be considered stuck | | `CLAUDE_MEM_REAPER_INACTIVE_DAYS` | `30` | 1–365 | When to mark a session `inactive_at` | | `CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS` | `0` | 0–365 | 0 = never; otherwise, hard-delete inactive rows older than N days | ### 2.4 Acceptance criteria - Inject 50 stuck `processing` rows older than 5 minutes → next reaper tick resets them → `/api/healthz` shows `oldest_pending_processing_age_sec` drop to 0. - Inject `OBSERVER_SESSIONS_PROJECT` rows post-marker → next tick logs regression and purges them. - Reaper survives a worker restart without losing state (everything is DB-backed). - Active sessions (in-memory) are NEVER marked inactive even if their last DB write is old (in-memory presence wins). ### 2.5 Observability - Log: `MAINTENANCE` INFO `ReaperTick`, fields `{stuckProcessing, orphanPending, markedInactive, hardDeleted, observerRegression}`. - New `/api/healthz` fields (Phase 7): `oldest_processing_pending_age_sec`, `processing_pending_count`, `pending_count_total`, `sdk_sessions_total`, `sdk_sessions_inactive`, `sdk_sessions_by_project: { [project]: count }`. ### 2.6 Verification checklist - [ ] Migration adds `inactive_at` column without breaking existing data (test on a copy of a real DB). - [ ] In-memory active sessions never appear in `findInactiveSdkSessions`. - [ ] Reaper does NOT cascade-delete `observations` / `session_summaries` unless explicit hard-delete + zero-FK-reference precondition. - [ ] `/api/healthz` shows reaper metrics. --- ## Phase 3 — chroma-mcp Child-Process Supervisor **Goal**: Stop the 23-concurrent-chroma-mcp leak. Bound concurrency, reap idle, scan for orphans at startup. ### 3.1 Files to modify | File | Change | | --- | --- | | `src/services/sync/ChromaMcpManager.ts` | Add idle reaper; enforce single-instance via supervisor registry; add startup orphan scan; add `lastCallAt` timestamp updated by `callTool`. | | `src/services/sync/ChromaMcpManager.ts:ensureConnected` (line 43) | Before connect, check `getProcessRegistry().getAll().filter(r => r.type === 'chroma')` — if non-empty AND PID alive AND PID not the current `_process.pid`, refuse to spawn (alert + reuse existing if possible; otherwise wait for backoff). | | `src/services/sync/ChromaMcpManager.ts:registerManagedProcess` (line 613) | Already calls `getSupervisor().registerProcess(CHROMA_SUPERVISOR_ID, ...)` — verify the supervisor enforces single-instance for this id. (Currently `register` is keyed by id so same id replaces; document this.) | | `src/supervisor/process-registry.ts` | Add `getActiveCountByType(type: string): number`. Add `findChromaOrphans(): Promise` — POSIX `pgrep -af 'chroma-mcp'` filtered by PPID == 1. | | `src/services/worker-service.ts:initializeBackground` | After `ChromaMcpManager.getInstance()`, kick off `await ChromaMcpManager.scanAndReapOrphans()` (best-effort; never throws). | ### 3.2 Detailed tasks 1. **Startup orphan scan**: New static method `ChromaMcpManager.scanAndReapOrphans()`: - POSIX: `pgrep -af 'chroma-mcp'` → for each PID, check PPID. If PPID == 1 (re-parented to init), call `killProcessTree(pid)` (existing function at line 388). Log `CHROMA_MCP` INFO `ReapedOrphan`, fields `{pid, ageSec}`. - Windows: `Get-CimInstance Win32_Process -Filter "Name='chroma-mcp.exe'"` filter by parent process state, kill with taskkill. - Bound the scan to processes whose command-line includes `chroma-mcp==` to avoid killing unrelated chroma installations. 2. **Idle reaper**: Add `lastCallAt: number = 0` field to `ChromaMcpManager`. Update on every `callTool`. Run a `setInterval(checkIdle, 60_000)` (`.unref()`) — if `connected && Date.now() - lastCallAt > CHROMA_MCP_IDLE_SHUTDOWN_MS` (default 15 min), call `await this.stop()`. Lazy-reconnect resumes on next `callTool`. 3. **Single-instance guard on reconnect**: In `ensureConnected`, before `connectInternal`, call `getProcessRegistry().getActiveCountByType('chroma')`. If > 0 AND the registered PID is alive but `this.connected === false`, this is a stale process (we lost track). Tear it down via `killProcessTree(registeredPid)` first, then proceed with fresh spawn. Otherwise the count grows by one each reconnect — exactly the leak observed. 4. **Hard cap**: extend `getSupervisor().assertCanSpawn('chroma mcp')` (already called at line 87) to actually count and reject. Cap = 1 chroma-mcp per worker. Cap = `TOTAL_PROCESS_HARD_CAP` (10) overall — already enforced for SDK processes; extend to chroma-mcp. 5. **Tighten close path**: in `connectInternal` (line 74), after `transport.close()` / `client.close()`, if the underlying `_process.pid` is still in the registry, call `killProcessTree` and `unregisterProcess` explicitly. Don't rely on `transport.onclose` alone — it has the stale-callback guard but doesn't always fire on connect-time failures. ### 3.3 New settings | Key | Default | Range | Purpose | | --- | --- | --- | --- | | `CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS` | `900000` (15 min) | 60000–86400000 | Idle reaper threshold | | `CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START` | `true` | bool | Master switch for startup scan | | `CLAUDE_MEM_CHROMA_MAX_CONCURRENT` | `1` | 1–4 | Cap chroma-mcp instances per worker | ### 3.4 Acceptance criteria - Spawn 5 chroma-mcp processes manually parented to init; restart worker → all 5 are reaped at startup. - Force connect-time failure (kill transport mid-connect) 10 times → registry count never exceeds 1. - Run worker for 30 min with no chroma calls → process is reaped after 15 min and `getProcessRegistry().getActiveCountByType('chroma')` returns 0. - `callTool` after idle-shutdown lazy-reconnects successfully. ### 3.5 Observability - Log: `CHROMA_MCP` INFO `OrphanScan` `{found, killed}`. - Log: `CHROMA_MCP` INFO `IdleShutdown` `{idleMs}`. - Log: `CHROMA_MCP` WARN `RegistryStale` when single-instance guard tears down a phantom. - `/api/healthz` fields (Phase 7): `chroma_mcp_pid_count`, `chroma_mcp_last_call_at`, `chroma_mcp_state` ('connected'|'disconnected'|'backoff'), `chroma_mcp_backoff_remaining_ms`. ### 3.6 Anti-pattern guards - **Do not** kill chroma processes whose command-line doesn't match `chroma-mcp==` — could match unrelated user installs. - **Do not** spin up the idle-reaper timer if `chromaMcpManager` is null (chroma disabled via `CLAUDE_MEM_CHROMA_ENABLED=false`). - **Do not** call `getProcessRegistry()` from outside the worker process — it's worker-internal. ### 3.7 Verification checklist - [ ] After 2.5 hours of normal use, `ps aux | grep chroma-mcp | wc -l` ≤ 1. - [ ] Idle-reaper timer is `.unref()`d. - [ ] Orphan scan tolerates `pgrep` returning empty (no false-error logs). - [ ] Build still passes on Windows (Win32 branch compiles even if not unit-tested). --- ## Phase 4 — Circuit Breaker for Retry Storms **Goal**: Replace the unbounded counter at `worker-utils.ts:401` with a real circuit breaker. Stop hooks from hammering the worker when it's down. ### 4.1 Files to modify | File | Change | | --- | --- | | `src/shared/worker-circuit-breaker.ts` (new) | `CircuitBreaker` class: states `CLOSED`, `OPEN`, `HALF_OPEN`. Persist to `~/.claude-mem/state/circuit-breaker.json`. | | `src/shared/worker-utils.ts:executeWithWorkerFallback` (line 443) | Wrap the call in `breaker.run(...)`. On `OPEN`, return `WorkerFallback` immediately (no HTTP). | | `src/shared/worker-utils.ts:recordWorkerUnreachable` (line 401) | Becomes a thin shim that calls `breaker.recordFailure()`. Hard cap (`MAX_LIFETIME_FAILURES = 50`) trips the breaker permanently until manual reset. | | `src/shared/worker-utils.ts:resetWorkerFailureCounter` (line 419) | Becomes `breaker.recordSuccess()`. | | `src/cli/hook-command.ts` | Verify the swallowed-stderr fix from observation 2026-05-07 is applied (it's marked as a "no-op replacement bug"). The breaker's stderr-fail-loud path must actually write to `process.stderr.write()`, not a stub. | | `src/services/server/Server.ts` | Add `/api/admin/breaker/reset` POST endpoint (gated by localhost only) for manual unsticking. | ### 4.2 Breaker semantics States and transitions: ``` CLOSED ──[N consecutive failures]──> OPEN OPEN ──[reset_timeout_ms elapsed]──> HALF_OPEN HALF_OPEN ──[1 success]──> CLOSED HALF_OPEN ──[1 failure]──> OPEN (resets timer) ANY ──[lifetime failures > MAX_LIFETIME_FAILURES]──> OPEN_PERMANENT (until manual reset via API or settings reload) ``` Defaults: | Setting | Default | Range | | --- | --- | --- | | `CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD` | `5` | 1–50 | | `CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS` | `30000` | 1000–600000 | | `CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES` | `1` | 1–10 | | `CLAUDE_MEM_BREAKER_LIFETIME_CAP` | `50` | 0–10000 (0 = no cap) | Persistent state file shape: ```json { "state": "CLOSED|OPEN|HALF_OPEN|OPEN_PERMANENT", "consecutiveFailures": 0, "lifetimeFailures": 0, "openedAt": null, "lastFailureAt": null, "lastSuccessAt": null, "lastTrippedAt": null } ``` ### 4.3 Detailed tasks 1. **CircuitBreaker class**: pure logic class, no I/O. Methods: `getState()`, `canAttempt()`, `recordFailure(reason)`, `recordSuccess()`, `forceReset()`. Atomic file writes (write tmp + rename) for the JSON snapshot, mirroring `writeHookFailureStateAtomic` (worker-utils.ts:372). 2. **Wire into `executeWithWorkerFallback`**: ``` if (!breaker.canAttempt()) { // Optional: print one-line stderr if state changed during this call return { continue: true, reason: 'circuit_breaker_open', [WORKER_FALLBACK_BRAND]: true }; } const alive = await ensureWorkerAliveOnce(); if (!alive) { breaker.recordFailure('unreachable'); ... } ... if (response.ok) breaker.recordSuccess(); ``` 3. **Fail-loud stderr fix**: The 2026-05-07 observation mentions a "stderr no-op replacement bug" in `hookCommand`. Investigate `src/cli/hook-command.ts` for any `process.stderr.write` shim that suppresses output. The breaker's diagnostic ("Worker unreachable; circuit breaker OPEN; will retry in Xs") MUST appear on the user's terminal so they know what's happening. Test by intentionally killing the worker and running a hook — message should appear on stderr. 4. **Manual reset endpoint**: `POST /api/admin/breaker/reset` (no body required). Restricted to `127.0.0.1` only. Logs `SYSTEM` WARN `BreakerForceReset` with caller info. 5. **Lifetime cap**: when `lifetimeFailures > CLAUDE_MEM_BREAKER_LIFETIME_CAP`, transition to `OPEN_PERMANENT`. The only way out is the manual-reset API or restarting the worker with a fresh state file. Print prominent stderr: `claude-mem: 50 lifetime worker failures detected. Disabling memory hooks until reset. Run: claude-mem worker doctor`. ### 4.4 Acceptance criteria - Kill the worker, run 100 hooks → exactly `CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD` HTTP attempts made; rest short-circuit. - After 30s idle, next hook makes ONE probe (HALF_OPEN); if probe succeeds, breaker closes. - Lifetime cap (set to 5 for testing): 6th lifetime failure → permanent open until `POST /api/admin/breaker/reset` clears it. - Stderr message visible to user when breaker opens (manual repro: kill worker, run 5+ hooks). - Existing hook-failures.json file is migrated to the new breaker JSON format on first run (one-shot migration in `worker-utils.ts`). ### 4.5 Observability - Log: `SYSTEM` WARN `BreakerOpened`, fields `{lifetime, consecutiveBefore}`. - Log: `SYSTEM` INFO `BreakerHalfOpen`. - Log: `SYSTEM` INFO `BreakerClosed`, fields `{recoveredAfterMs}`. - Log: `SYSTEM` ERROR `BreakerOpenedPermanent`. - `/api/healthz` fields (Phase 7): `breaker_state`, `breaker_consecutive_failures`, `breaker_lifetime_failures`, `breaker_opened_at`, `breaker_total_trips`. ### 4.6 Anti-pattern guards - **Do not** call the breaker from inside the worker process — it's a hook-side concern. The worker has `RestartGuard` for its own session-level limits. - **Do not** auto-reset the lifetime counter on restart; persist it. Otherwise restart-loops mask the underlying failure. - **Do not** block the breaker reset endpoint on initialization (`/api/admin/breaker/reset` should work even if `initializationCompleteFlag === false`). ### 4.7 Verification checklist - [ ] No call site bypasses the breaker (grep for `workerHttpRequest` outside `executeWithWorkerFallback` and audit each — some integrations may need `breaker.canAttempt()` guards added). - [ ] State file readable/writable across process restarts. - [ ] Stderr fail-loud path verified end-to-end on Linux + macOS + Windows Terminal. - [ ] No `process.exit(1)` introduced — breaker tripping returns `WorkerFallback`, not exit codes. --- ## Phase 7 — `/api/healthz` Endpoint with Concrete Metrics **Goal**: Centralized observability so future regressions are detectable at a glance. ### 7.1 Files to modify | File | Change | | --- | --- | | `src/services/worker/http/routes/HealthzRoutes.ts` (new) | Implements `RouteHandler`. GET `/api/healthz` and `/api/healthz?format=prom`. | | `src/services/worker-service.ts:registerRoutes` | Register the new `HealthzRoutes(...)`. | | `src/services/worker/MetricsCollector.ts` (new) | Aggregates metrics; refreshed on the supervisor's existing 30s health-check tick to avoid amplifying load. | | `src/supervisor/health-checker.ts:runHealthCheck` | Call `MetricsCollector.refresh()` after `pruneDeadEntries`. | ### 7.2 Endpoint contract `GET /api/healthz` → 200 JSON: ```json { "status": "ok|degraded|unhealthy", "ts": "2026-05-07T21:30:00.000Z", "uptime_sec": 12345, "versions": { "plugin": "12.7.5", "worker": "12.7.5", "matches": true }, "process": { "pid": 12345, "rss_mb": 145.2, "event_loop_lag_ms": 3.1, "managed": true, "platform": "darwin" }, "pid_file": { "path": "/Users/.../worker.pid", "start_token": "Wed May 7 14:23:15 2026", "daemon_lock_held": true }, "db": { "path": "/Users/.../claude-mem.db", "size_bytes": 31457280, "page_count": 7680, "freelist_count": 12, "free_ratio_pct": 0.16, "last_vacuum_at": "2026-05-07T20:00:00.000Z", "last_vacuum_freed_pages": 130000, "last_maintenance_at": "2026-05-07T20:00:00.000Z", "oldest_processing_pending_age_sec": 4, "processing_pending_count": 1, "pending_count_total": 12, "sdk_sessions_total": 145, "sdk_sessions_inactive": 13, "sdk_sessions_by_project": { "claude-mem": 25, "...": 120 } }, "child_processes": { "chroma_mcp_pid_count": 1, "chroma_mcp_last_call_at": "2026-05-07T21:25:11.000Z", "chroma_mcp_state": "connected", "chroma_mcp_backoff_remaining_ms": 0, "sdk_process_count": 0, "supervisor_registry_size": 2 }, "network": { "hook_consecutive_failures": 0, "breaker_state": "CLOSED", "breaker_consecutive_failures": 0, "breaker_lifetime_failures": 3, "breaker_opened_at": null, "breaker_total_trips": 1, "last_request_at": "2026-05-07T21:29:55.000Z", "request_rate_per_min": 12.3 }, "ai": { "provider": "claude", "auth_method": "...", "last_interaction": { ... } } } ``` `GET /api/healthz?format=prom` → 200 `text/plain` with Prometheus text format. One metric per JSON leaf (e.g. `claude_mem_db_free_ratio_pct 0.16`). `status` derivation: - `unhealthy` if breaker is OPEN_PERMANENT, OR DB initialization failed, OR chroma-mcp pid count > `CLAUDE_MEM_CHROMA_MAX_CONCURRENT`. - `degraded` if breaker is OPEN, OR free_ratio > 0.4, OR oldest_processing_pending > 1 hour, OR worker version mismatches plugin version. - `ok` otherwise. ### 7.3 Detailed tasks 1. **MetricsCollector class**: a `Map` snapshot. Public `refresh()` collects fresh data; public `getSnapshot()` returns the cached object. Refresh is called by the 30s health-check tick AND on-demand if last refresh > 5s ago (debounced). 2. **DB metrics queries** (use `db.prepare` + `.get()`): - `PRAGMA page_count` → `{ page_count: number }` - `PRAGMA freelist_count` → `{ freelist_count: number }` - `PRAGMA page_size` → for size_bytes computation - `SELECT MIN(updated_at) FROM pending_messages WHERE status='processing'` (with `julianday` math for age in seconds) - `SELECT COUNT(*) FROM sdk_sessions GROUP BY project` 3. **Process metrics**: `process.memoryUsage().rss / 1024 / 1024`. Event-loop lag via `perf_hooks.monitorEventLoopDelay` (Node API, available in bun) — sample over 30s window. 4. **Network metrics**: maintain a rolling 1-min request counter in middleware (existing `createMiddleware` in `Server.ts:156`). Increment on each `/api/*` request. 5. **Prometheus format**: emit `# HELP` and `# TYPE` lines per metric. Use the same naming convention (`claude_mem__`). 6. **Compatibility**: leave `/api/health` UNCHANGED (existing integrations break otherwise). `/api/healthz` is the new richer endpoint. ### 7.4 Acceptance criteria - `curl 127.0.0.1:/api/healthz | jq .status` returns `ok` on a healthy worker. - After Phase 6 ships, `db.free_ratio_pct` updates at 30s cadence (verify by manually inflating freelist). - Phase 4 breaker state changes are visible within 30s. - `?format=prom` parses with `promtool check metrics`. - No new endpoint blocks for > 50ms (snapshot is cached; refresh is async). ### 7.5 Observability hooks (yes, for the observability endpoint itself) - Log `WORKER` DEBUG `MetricsRefresh`, fields `{durationMs}`. - Log `WORKER` WARN `MetricsRefreshSlow` if refresh > 250ms (DB query stall signal). ### 7.6 Verification checklist - [ ] `/api/health` response body unchanged byte-for-byte (regression test). - [ ] All Phase 2-6 metrics exposed (cross-check the field list in those phases). - [ ] `?format=prom` output validates with `promtool` if available; otherwise visual inspection. - [ ] Endpoint mounted via `RouteHandler` pattern (no direct `app.get` in worker-service.ts). --- ## Phase 8 — Observability, CLI, & Rollout **Goal**: User-facing surface so operators can see what the new machinery did. Ordered last to allow phases 2-7 to stabilize. ### 8.1 Files to modify | File | Change | | --- | --- | | `src/cli/handlers/worker-doctor.ts` (new) | New CLI subcommand `claude-mem worker doctor` — fetches `/api/healthz`, formats it for terminals, includes recent reaper actions. | | `src/services/worker-service.ts:main()` | Register the `worker doctor` CLI route (alongside existing `cursor`, `gemini-cli` cases). | | `plugin/scripts/worker-cli.js` | Wire to the new doctor command. | | `CLAUDE.md` (project root) | Document new settings under a "Worker Maintenance" section. | | `docs/public/` (optional) | User-facing explanation of the breaker, reaper, and health endpoint. | ### 8.2 `worker doctor` output (example) ``` claude-mem worker doctor Status: OK Version: plugin=12.7.5 worker=12.7.5 (match) Uptime: 3h 25m PID: 12345 (lock held: yes) Database: Size: 32 MB (free: 0.16%) Last vacuum: 4h ago, freed 130k pages Pending: 12 total / 1 processing (oldest 4s) SDK sessions: 145 total / 13 inactive Child processes: chroma-mcp: 1 (last call: 5s ago, state: connected) SDK processes: 0 Supervisor: 2 entries Circuit breaker: State: CLOSED Consecutive: 0 Lifetime: 3 Total trips: 1 Recent maintenance (last 24h): 2026-05-07 20:00 Vacuum: freed 130k pages in 1.4s 2026-05-07 19:30 Reaper: 5 stuck-processing reset, 2 inactive marked 2026-05-07 18:00 Chroma orphan scan: 0 found ``` If `status != ok`, append a "Recommended actions" block: - breaker open → `claude-mem worker reset-breaker` - DB free ratio high → mention next vacuum window - chroma orphans → `claude-mem worker reap-chroma` ### 8.3 Detailed tasks 1. **Doctor command**: GET `/api/healthz` via `workerHttpRequest`. Format as the table above. Color-code (red/yellow/green) using existing chalk integration if present, otherwise plain text. JSON pass-through via `--json` flag. 2. **Recent-actions feed**: store the last 50 maintenance events in a circular buffer in `MetricsCollector` (in-memory only — survives one worker lifetime; not persistent). Expose at `/api/healthz/events` (separate to avoid bloating the main response). 3. **Update CLAUDE.md**: add a "Worker Maintenance" section with: settings reference table, the doctor command, a brief description of the reaper/breaker/vacuum behavior. Per CLAUDE.md "Important: No need to edit the changelog ever" — only edit CLAUDE.md, never CHANGELOG. 4. **Rollout ordering** (per problem statement constraint): - Wave 1 (idempotent, low-risk): Phase 5 (PID/port reclamation), Phase 6 (DB maintenance). - Wave 2 (reapers — needs careful testing on busy DBs): Phase 2 (session reaper), Phase 3 (chroma supervisor). - Wave 3 (user-visible behavior change): Phase 4 (circuit breaker), Phase 7 (`/api/healthz`). - Wave 4 (CLI surface): Phase 8 (doctor command, docs). Each wave can ship as a separate release. Inter-wave dependencies: Phase 7 depends on data sources from Phases 2/3/4/6 — but the endpoint can ship with partial data (fields gated by phase availability). ### 8.4 Acceptance criteria - `claude-mem worker doctor` prints a green-OK summary on a healthy worker. - `claude-mem worker doctor --json` returns valid JSON pipeable to `jq`. - Killing the worker → `claude-mem worker doctor` cleanly reports `Worker unreachable` instead of hanging. - CLAUDE.md updates are limited to a new section; no churn elsewhere. ### 8.5 Verification checklist - [ ] `claude-mem worker doctor` exits 0 on healthy state, 1 on unhealthy, 2 if worker unreachable (mirrors hook-exit-codes convention). - [ ] No new public marketplace API surface beyond what's documented. - [ ] Doctor command works without the worker running (unreachable path covered). --- ## Final Phase — Cross-Phase Verification **Goal**: Prove the system works end-to-end before declaring victory. ### F.1 Soak test (24h) Run the worker for 24 hours under realistic Claude Code usage. After 24h: | Metric | Pass criterion | | --- | --- | | `ps aux \| grep chroma-mcp \| wc -l` | ≤ 1 | | `ps aux \| grep claude-mem \| wc -l` | ≤ a small constant (1-2) | | DB size growth rate | < 5 MB/hr; free_ratio < 20% | | `/api/healthz` `breaker.lifetime_failures` | < 10 (vs. the #1874 starting baseline) | | Stuck `processing` rows older than 10 min | 0 | | Worker memory RSS | < 300 MB (no leak) | ### F.2 Failure-injection tests | Inject | Expected behavior | | --- | --- | | Kill worker via `kill -9` | Lazy-respawn on next hook; PID file cleaned | | Two parallel `claude-mem start` | Exactly one daemon survives; lock log line visible | | 100 stuck processing rows | Reaper resets all within `REAPER_PROCESSING_STUCK_MS + REAPER_TICK_MS` | | Spawn fake listener on worker port | New `--daemon` exits 0 with diagnostic stderr (no silent exit) | | Fork 5 chroma-mcp orphans | Worker startup reaps all 5 | | Pull network during 10 hooks | Breaker opens after threshold; subsequent hooks short-circuit | ### F.3 Anti-pattern grep ``` # No new always-on intervals grep -rn "setInterval" src/ --include="*.ts" | grep -v "unref()" | grep -v "^src/.*test" # No new process.exit(1) on hook paths git diff main -- src/shared/worker-utils.ts src/cli/ | grep "process.exit(1)" # No invented settings git diff main -- src/shared/SettingsDefaultsManager.ts | grep "CLAUDE_MEM_" # Cross-reference with all phases' settings tables. # No hardcoded magic numbers in business logic git diff main | grep -E "[0-9]{4,}" | grep -v SettingsDefaultsManager | grep -v test ``` ### F.4 Documentation diff - `CLAUDE.md` adds: Worker Maintenance section (Phase 8.3). - `docs/public/` (optional): user-facing explanation. - No CHANGELOG edits (auto-generated per CLAUDE.md). ### F.5 Sign-off checklist - [ ] All 8 phases shipped. - [ ] `/api/healthz` reports `status: "ok"` 24h after deployment. - [ ] No new ERROR-level logs in production for 24h (excluding pre-existing). - [ ] Manual `worker doctor` on 3 production-like environments confirms expected output. - [ ] Phase 0 doc-discovery anti-patterns not violated (grep `git log -p`). --- ## Appendix A — Settings Reference (consolidated) All settings declared in `src/shared/SettingsDefaultsManager.ts`: | Setting | Phase | Default | Range | | --- | --- | --- | --- | | `CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS` | 5 | `5000` | 0–60000 | | `CLAUDE_MEM_PID_PORT_RECHECK_MS` | 5 | `2000` | 500–30000 | | `CLAUDE_MEM_DB_MAINTENANCE_ENABLED` | 6 | `true` | bool | | `CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS` | 6 | `24` | 1–168 | | `CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO` | 6 | `0.40` | 0.05–0.95 | | `CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS` | 6 | `300000` | 0–3600000 | | `CLAUDE_MEM_CLEANUP_REGRESSION_CHECK` | 6 | `true` | bool | | `CLAUDE_MEM_REAPER_ENABLED` | 2 | `true` | bool | | `CLAUDE_MEM_REAPER_TICK_MS` | 2 | `30000` | 5000–600000 | | `CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS` | 2 | `300000` | 30000–86400000 | | `CLAUDE_MEM_REAPER_INACTIVE_DAYS` | 2 | `30` | 1–365 | | `CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS` | 2 | `0` | 0–365 | | `CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS` | 3 | `900000` | 60000–86400000 | | `CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START` | 3 | `true` | bool | | `CLAUDE_MEM_CHROMA_MAX_CONCURRENT` | 3 | `1` | 1–4 | | `CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD` | 4 | `5` | 1–50 | | `CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS` | 4 | `30000` | 1000–600000 | | `CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES` | 4 | `1` | 1–10 | | `CLAUDE_MEM_BREAKER_LIFETIME_CAP` | 4 | `50` | 0–10000 | ## Appendix B — File Change Summary | File | Phases that touch it | | --- | --- | | `src/services/worker-service.ts` | 3 (initializeBackground), 5 (--daemon), 6 (maintenance wiring), 7 (route registration), 8 (CLI) | | `src/services/worker-spawner.ts` | 5 | | `src/services/infrastructure/ProcessManager.ts` | 5 (lock + start-token) | | `src/services/infrastructure/HealthMonitor.ts` | 5 (port-on-pid match) | | `src/services/infrastructure/CleanupV12_4_3.ts` | 6 (regression detection — read only) | | `src/services/sync/ChromaMcpManager.ts` | 3 | | `src/supervisor/index.ts` | 5 (validateWorkerPidFile) | | `src/supervisor/process-registry.ts` | 3 (orphan scan), 5 (start-token) | | `src/supervisor/health-checker.ts` | 2 (reaper), 7 (metrics refresh) | | `src/services/worker/SessionManager.ts` | 2 (delete hook), 6 (pause/resume) | | `src/shared/worker-utils.ts` | 4 (breaker integration) | | `src/services/sqlite/Database.ts` | 6 (auto_vacuum) | | `src/services/sqlite/PendingMessageStore.ts` | 2 (reapStuckProcessing) | | `src/services/sqlite/SessionStore.ts` | 2 (findInactiveSdkSessions) | | `src/services/sqlite/migrations/runner.ts` | 2 (inactive_at column) | | `src/services/server/Server.ts` | 4 (breaker reset), 7 (healthz route) | | `src/shared/SettingsDefaultsManager.ts` | 2-6 (settings keys) | | `src/services/maintenance/DbMaintenance.ts` | 6 (NEW) | | `src/services/maintenance/SessionReaper.ts` | 2 (NEW) | | `src/shared/worker-circuit-breaker.ts` | 4 (NEW) | | `src/services/worker/MetricsCollector.ts` | 7 (NEW) | | `src/services/worker/http/routes/HealthzRoutes.ts` | 7 (NEW) | | `src/cli/handlers/worker-doctor.ts` | 8 (NEW) | | `CLAUDE.md` | 8 (Worker Maintenance section) | ## Appendix C — Open Questions for Executor 1. **`bun:ffi` flock support**: confirm via spike before committing Phase 5.4. If unavailable, fall back to `flock(1)` shell on Linux + atomic `mkdirSync` sentinel on macOS/Windows. 2. **Event-loop lag sampling on bun**: verify `perf_hooks.monitorEventLoopDelay` works in bun's Node-compat layer. If not, fall back to a setImmediate-based heuristic. 3. **Existing-DB auto_vacuum migration**: verify that the startup full VACUUM in Phase 6.3 is sufficient to reclaim the 504 MB without requiring users to run `PRAGMA auto_vacuum = INCREMENTAL; VACUUM;` manually. (It should — full VACUUM with auto_vacuum already set takes effect.) 4. **Pro-features compatibility**: confirm with maintainers that `/api/healthz` does not duplicate any planned Pro endpoint. Per CLAUDE.md "Pro Features Architecture", the worker's local HTTP API stays open — `/api/healthz` is fine to add OSS-side.