Files

T

Alex Newman a10d1b342f docs(plans): add architectural plan files for issues #2376-#2381

Six numbered plan documents covering:
- 01 Hook IO Discipline (#2376)
- 02 Spawn-Contract Templating (#2377)
- 03 Worker / Daemon Lifecycle Hardening (#2378)
- 04 Installer Failure Transparency (#2379)
- 05 Observer SDK Tool Enforcement (#2380)
- 06 Worker Env Isolation (#2381)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-10 00:31:02 -07:00

52 KiB

Raw Blame History

Plan 03 — Worker / Daemon Lifecycle Hardening

Scope: Fix accumulated worker / daemon lifecycle bugs in claude-mem. Address DB bloat, chroma-mcp leaks, retry storms, port/PID races, queue zombies, missing supervision, and observability gaps.

Non-implementation: This document is a plan. Each phase is self-contained; an executing agent should be able to run a single phase without re-discovering context.

Audience: Subsequent agents executing one phase per session.

Phase 0 — Documentation Discovery & Allowed APIs

Goal: Anchor every implementation phase in real APIs that exist in the current codebase or in vetted libraries. Prevent phantom-method invention.

0.1 Read these files end-to-end before touching code

File	Why
`CLAUDE.md` (project root)	Architecture, exit-code strategy, Pro/OSS boundary, settings conventions
`src/services/worker-service.ts`	`WorkerService` class, `--daemon` `main()`, signal registration, all CLI subcommands
`src/services/worker-spawner.ts`	`ensureWorkerStarted` 3-state machine (`ready`/`warming`/`dead`)
`src/services/infrastructure/ProcessManager.ts`	`spawnDaemon`, PID file ops, `captureProcessStartToken`, `isProcessAlive`
`src/services/infrastructure/HealthMonitor.ts`	`isPortInUse`, `waitForHealth`, `waitForReadiness`, `httpShutdown`
`src/services/infrastructure/GracefulShutdown.ts`	`performGracefulShutdown` ordering
`src/services/infrastructure/CleanupV12_4_3.ts`	`runOneTimeV12_4_3Cleanup`, `STUCK_PENDING_THRESHOLD = 10`, observer-purge SQL
`src/services/sync/ChromaMcpManager.ts`	`ensureConnected`, `connectInternal`, `stop`, `killProcessTree`, `collectDescendantPids`, `RECONNECT_BACKOFF_MS = 10_000`, `MCP_CONNECTION_TIMEOUT_MS = 30_000`
`src/supervisor/index.ts`	`Supervisor` class, `validateWorkerPidFile`, signal-handler config
`src/supervisor/process-registry.ts`	`ProcessRegistry`, `getSdkProcessForSession`, `ensureSdkProcessExit`, `waitForSlot`, `TOTAL_PROCESS_HARD_CAP = 10`
`src/supervisor/health-checker.ts`	30s `pruneDeadEntries` loop (already present — extend, don't replace)
`src/supervisor/shutdown.ts`	`runShutdownCascade`, `signalProcess`, `loadTreeKill`
`src/services/worker/SessionManager.ts`	In-memory session map, `deleteSession`, queue/pending integration
`src/services/worker/RestartGuard.ts`	Per-session restart cap (10/60s window, 5 consecutive)
`src/services/worker/retry.ts`	Provider-level retry (`withRetry`, classified errors) — DO NOT mutate; circuit breaker layers ABOVE this
`src/shared/worker-utils.ts`	`recordWorkerUnreachable` (line 401), `executeWithWorkerFallback` (line 443), fail-loud counter file at `~/.claude-mem/state/hook-failures.json`
`src/services/sqlite/Database.ts`	PRAGMA setup (lines 27-32, 69-74) — single source of truth for DB pragmas
`src/services/server/Server.ts`	`/api/health` (line 161), `/api/readiness` (line 178), `/api/version` (line 192)
`src/shared/SettingsDefaultsManager.ts`	Where every new setting key MUST be declared with a default
`src/shared/hook-constants.ts`	`HOOK_TIMEOUTS`, `HOOK_EXIT_CODES` — extend here, don't inline
`plugin/bun-runner.js`, `plugin/scripts/worker-service.cjs`	Built worker entrypoint — note the build pipeline (`scripts/build-hooks.js`)

0.2 Allowed APIs (use these, do NOT invent siblings)

SQLite (bun:sqlite) — pragma calls are db.run('PRAGMA …') or db.prepare('PRAGMA …').get(). Existing pragmas: journal_mode=WAL, synchronous=NORMAL, foreign_keys=ON, temp_store=memory, mmap_size, cache_size. VACUUM runs only outside a transaction. VACUUM INTO 'path' is the backup form already used in CleanupV12_4_3.ts:135. wal_checkpoint(TRUNCATE) is the truncating-checkpoint form.

Process supervision — getSupervisor(), getProcessRegistry(), registerProcess(id, info, processRef?), unregisterProcess(id), pruneDeadEntries(), assertCanSpawn(type), runShutdownCascade(...). Tree-kill on POSIX uses pgrep -P recursion + process.kill(-pgid, signal); on Windows uses taskkill /T /F /PID or tree-kill npm.

HTTP/Express — Server.app.get('/api/...', handler) via registerRoutes (handlers implement setupRoutes(app) on a RouteHandler interface). Every new endpoint must follow the existing RouteHandler pattern under src/services/worker/http/routes/.

Settings — SettingsDefaultsManager.get('CLAUDE_MEM_…'), SettingsDefaultsManager.loadFromFile(path). New keys require: (a) type added to the interface in SettingsDefaultsManager.ts, (b) default value declared in the same file, (c) documented in CLAUDE.md if user-tunable.

Logging — logger.info(category, msg, fields), logger.warn, logger.error(category, msg, fields, error). Categories used here: SYSTEM, WORKER, SESSION, CHROMA_MCP, SDK, DB, QUEUE, PROCESS. Add new category MAINTENANCE for VACUUM / reaper events.

0.3 Anti-patterns — explicitly forbidden

Do not add a new singleton supervisor — extend getSupervisor().
Do not spawn child processes without going through getSupervisor().assertCanSpawn(...) and registerProcess(...).
Do not call process.exit(1) on hook-side error paths — it accumulates Windows Terminal tabs (CLAUDE.md exit-code strategy). Use 0 for graceful, 2 only for blocking-error paths that need to surface stderr to Claude.
Do not delete sdk_sessions rows if observations or session_summaries still reference their memory_session_id without an explicit user-opt-in flag.
Do not hold a SQLite write lock during VACUUM while ingestion is hot. Pause queue processing first.
Do not introduce setInterval timers that keep the event loop alive — every new timer must call .unref().
Do not invent settings keys — declare them in SettingsDefaultsManager.ts first.

0.4 Confidence note

Confidence: HIGH on file/API inventory (read-pass complete on all referenced files). MEDIUM on Windows behavior of new advisory locks (Windows mandatory locking via lockf is bun-runtime-dependent — verify via spike before committing).

Phase 1 — Inventory & Instrumentation (read-only, safe)

Goal: Produce a written state-machine diagram and an exit-site catalog that subsequent phases reference. No code changes; create a scratch document at docs/internal/worker-lifecycle-state-machine.md if the executor wants an artifact, otherwise capture findings in commit messages.

1.1 Tasks

Trace the worker daemon spawn → terminate path end-to-end. Source order:
- Hook entry → src/shared/worker-utils.ts:ensureWorkerRunning (lazy spawn) OR src/services/worker-spawner.ts:ensureWorkerStarted (explicit)
- spawnDaemon (src/services/infrastructure/ProcessManager.ts:408) — POSIX uses setsid if available, Windows uses Start-Process -WindowStyle Hidden
- --daemon branch in src/services/worker-service.ts:937 — duplicate-PID/duplicate-port guard
- WorkerService.start() (line 258) → startSupervisor() → server.listen() → writePidFile() → getSupervisor().registerProcess('worker', ...) → initializeBackground()
- Signal handlers via configureSupervisorSignalHandlers (src/supervisor/index.ts:49) — SIGTERM/SIGINT; SIGHUP ignored in --daemon mode on POSIX
- Shutdown: WorkerService.shutdown() → performGracefulShutdown → server close → sessionManager.shutdownAll() → mcp client close → chroma stop → db close → getSupervisor().stop() → runShutdownCascade → PID file unlink
Catalog every process.exit(...) site in worker-service.ts (already mapped — 21 sites; lines 764, 772, 794, 804, 810, 813, 828, 835, 842, 853, 870, 878, 888, 895, 916, 933, 945, 950, 971, 975, 991). Annotate each with: code, intent, whether it leaks the worker on the same path, whether shutdown ran first.
Catalog every retry / unreachable site:
- src/shared/worker-utils.ts:401 recordWorkerUnreachable (the #1874 counter)
- src/cli/handlers/{context,file-context,file-edit,summarize,observation,user-message,session-init}.ts — every executeWithWorkerFallback caller
- src/servers/mcp-server.ts:72,100,145 — direct workerHttpRequest
- src/services/transcripts/processor.ts:331,371,373 — direct workerHttpRequest
- src/services/integrations/CursorHooksInstaller.ts:64,349,352 — direct workerHttpRequest
- src/utils/claude-md-utils.ts:305 — direct workerHttpRequest
Catalog every spawn site:
- spawnDaemon (worker self-spawn)
- ChromaMcpManager.connectInternal (chroma-mcp via uvx → uv → python → chroma-mcp)
- spawnSdkProcess (src/supervisor/process-registry.ts:532) — Claude SDK subprocesses
- runMcpSelfCheck (src/services/worker-service.ts:405) — MCP loopback probe via process.execPath
- Any execSync / execFile / spawnSync in ChromaMcpManager (cert resolution) or ProcessManager (binary lookup, cwd-remap)

1.2 Acceptance criteria

Markdown table written (commit message or scratch doc) listing every spawn and exit site with file:line.
A 1-paragraph English description of the worker state machine (states + transitions) suitable to paste into PR descriptions.
Confirmed list of which executeWithWorkerFallback callers run inside hooks (Claude Code's strict timeout window) vs. inside the worker (no timeout pressure) — this drives Phase 4 circuit-breaker scoping.

1.3 Verification

grep -rn "process.exit" src/ --include="*.ts" | wc -l matches the catalog.
grep -rn "executeWithWorkerFallback\|workerHttpRequest" src/ --include="*.ts" | grep -v worker-utils.ts | wc -l matches the catalog.

1.4 Deliverable

Hand-off note for Phase 2-8 executors with file/line anchors; no code committed.

Phase 5 — PID/Port Reclamation & Race-Free Startup

Shipping order: Phase 5 first (per Phase 8 ordering). Idempotent and safe.

Goal: Eliminate the silent-exit-0 case where a fresh --daemon spawn loses the port race; harden cross-platform PID-reuse detection; serialize concurrent spawns with an OS-level advisory lock.

5.1 Files to modify

File	Change
`src/supervisor/process-registry.ts`	Extend `captureProcessStartToken` for macOS (already partial via `ps -o lstart`) and Windows (`wmic process where ProcessId=X get CreationDate /value`). Add unit test for each platform branch.
`src/supervisor/index.ts:validateWorkerPidFile`	Add port-on-pid match check — if `pidInfo.port !== currentExpectedPort`, treat as `'stale'`.
`src/services/infrastructure/ProcessManager.ts`	Add new exports: `acquireDaemonLock()` / `releaseDaemonLock()` using POSIX `flock` (via `fcntl`/`flock` syscall through `bun:ffi` or shelling to `flock(1)` on Linux only) and Windows mandatory file lock via `LockFile` (or fall back to atomic-rename sentinel on Windows).
`src/services/worker-service.ts:937` (`--daemon` branch)	Wrap startup in `acquireDaemonLock()`. If port is in use, perform a `/api/version` probe; if the listener returns OUR `BUILT_IN_VERSION` → exit 0 (legit duplicate); if it returns a different version → log a warning and exit 0 (stale worker, will be restarted by version-mismatch path); if the listener doesn't respond → wait `HOOK_TIMEOUTS.PORT_IN_USE_WAIT` then write a clear stderr line with diagnostic before exiting.
`src/services/worker-spawner.ts`	Same lock acquisition before `spawnDaemon`. Release on success or error.

5.2 Detailed tasks

macOS start-time token: extend captureProcessStartToken (registry line 56). On Darwin, prefer ps -p <pid> -o lstart= (already in fallback path). Verify with LC_ALL=C LANG=C env so locale doesn't change the timestamp format. Add a comment explaining that ps lstart resolution is 1-second — collisions still possible but vastly less likely than no-token.
Windows start-time token: add a Win32 branch using wmic process where ProcessId=<pid> get CreationDate /value. Parse the CreationDate=YYYYMMDDHHMMSS.ffffff+TZ line. Cache the wmic resolution per-pid for 5s (avoid re-shelling on repeat checks).
Port-on-pid match: in validateWorkerPidFile, after confirming isPidAlive(pidInfo.pid), verify the recorded pidInfo.port is reachable via isPortInUse(pidInfo.port) AND the listener's /api/version returns a version string. If port is dead but PID alive → return 'stale' (worker crashed mid-listen, PID about to be reused).
Advisory lock:
- POSIX: open <DATA_DIR>/.worker-spawn.lock with O_RDWR | O_CREAT, flock(fd, LOCK_EX | LOCK_NB). On EAGAIN, log Another spawn in progress, waiting up to 5s and retry with LOCK_EX (blocking) under a setTimeout race. Implement via bun:ffi for POSIX flock(2) if available, otherwise shell flock -n -x <path> <command>. Spike first: confirm bun's bun:ffi exposes flock. If not, use a watch-and-rename sentinel (less ideal but works).
- Windows: Use LockFile via Win32 API or fall back to atomic mkdirSync of <DATA_DIR>/.worker-spawn.lock.dir (fails if exists) with stale-timeout cleanup at 30s.
Diagnostic stderr: when port-in-use without our worker responding, write to stderr (and log INFO) with: claude-mem worker port <N> in use by an unidentified process; not spawning duplicate. This must NOT block the hook — exit 0 still per CLAUDE.md.

5.3 New settings

Key	Default	Range	Purpose
`CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS`	`5000`	0–60000	Max wait for the spawn lock
`CLAUDE_MEM_PID_PORT_RECHECK_MS`	`2000`	500–30000	Wait window before treating port-in-use without `/api/version` response as "unknown listener"

5.4 Acceptance criteria

Run two claude-mem start commands in parallel → exactly one daemon ends up alive; the other exits cleanly with a log line referencing the lock.
Kill the worker -9 (skip cleanup), reuse the PID with python -c 'import time; time.sleep(60)' → validateWorkerPidFile returns 'stale' and removes the file.
On macOS, run worker, capture token, kill, spawn unrelated process with same PID, spawn worker again → token mismatch detected; old PID file ignored.
/api/version probe path: spawn a fake server on the worker port → daemon exits 0 with the new diagnostic stderr, NOT silently.

5.5 Observability hooks

Log SYSTEM INFO Daemon spawn lock acquired on success.
Log SYSTEM WARN Daemon spawn lock contention, fields {waitedMs}.
Log SYSTEM WARN Worker port occupied by foreign listener, fields {port, probeStatus}.
New /api/healthz fields (added in Phase 7): pid_file_path, pid_start_token, daemon_lock_held: bool.

5.6 Verification checklist

grep "process.exit(0)" src/services/worker-service.ts — count unchanged (no new silent exits introduced).
Manual two-process race test (Linux + macOS + Windows VM).
Existing health-check tests still pass.
No new always-on setInterval introduced.

Phase 6 — DB Maintenance (VACUUM / WAL)

Ships alongside Phase 5 (idempotent).

Goal: Recover the 504 MB of free pages, prevent recurrence, surface DB-size metrics.

6.1 Files to modify

File	Change
`src/services/sqlite/Database.ts:27-32` and `:69-74`	Add `PRAGMA auto_vacuum = INCREMENTAL` BEFORE the first table is created (only takes effect on a fresh DB; harmless on existing DBs but logs a no-op). For existing DBs, the migration path is the one-shot Phase-6 startup VACUUM.
`src/services/maintenance/DbMaintenance.ts` (new)	Periodic maintenance task: on a 24h timer (configurable), call `PRAGMA incremental_vacuum`, `PRAGMA wal_checkpoint(TRUNCATE)`, then collect metrics (`page_count`, `freelist_count`, file size). Emit `MAINTENANCE` INFO log. Acquire `dbMaintenanceMutex` so other writers wait.
`src/services/maintenance/DbMaintenance.ts`	Startup check: if `freelist_count / page_count > FREE_RATIO_VACUUM_THRESHOLD` (default 0.40), perform full `VACUUM` after `VACUUM INTO` backup to `<DATA_DIR>/backups/claude-mem-pre-vacuum-<ts>.db`. Pause queue processor first.
`src/services/worker-service.ts:initializeBackground`	Wire the maintenance task — start after `dbManager.initialize()`. Timer must `.unref()`.
`src/services/worker/SessionManager.ts`	Expose `pauseQueueProcessing(): Promise<void>` and `resumeQueueProcessing(): void`. Use the existing AbortController + emitter to drain in-flight work; don't introduce new state. Maintenance acquires; readers continue (WAL allows them).
`src/services/infrastructure/CleanupV12_4_3.ts:135`	Reuse the existing `VACUUM INTO` backup pattern verbatim — copy the disk-space pre-flight check (`statfsSync`, line 115).

6.2 Detailed tasks

Auto-vacuum on new DBs: Add PRAGMA auto_vacuum = INCREMENTAL in Database.ts BEFORE migrationRunner.runAllMigrations(). Verify with a comment that this is no-op on existing DBs (sqlite docs say a full VACUUM is required to flip auto_vacuum mode after tables exist). Document the migration path: existing users get the freed-page reclamation via the startup full VACUUM in step 3.
Periodic incremental vacuum + WAL checkpoint:
- Schedule via setInterval with .unref(). Default cadence: 24h. Setting: CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS (default 24, min 1, max 168).
- Each tick: acquire mutex → db.run('PRAGMA incremental_vacuum') → db.run('PRAGMA wal_checkpoint(TRUNCATE)') → snapshot metrics → release.
- Skip the tick if a VACUUM is in progress.
Startup full VACUUM (one-shot per session) when free-ratio is high:
- Read page_count (PRAGMA page_count) and freelist_count (PRAGMA freelist_count).
- If freelist_count / page_count >= CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO (default 0.40), schedule a deferred VACUUM (5 minutes after worker becomes ready) to avoid slowing startup.
- VACUUM steps: pause queue → VACUUM INTO '<backup>' → verify backup → VACUUM (full) → resume queue → log freed pages and ms taken.
- Disk-space pre-flight: statfsSync (mirror CleanupV12_4_3.ts:115). Skip if free space < 1.2 * dbSize + 100MB. Log MAINTENANCE ERROR in that case so the user sees actionable info.
Pause/resume hook in SessionManager: The existing for await ... of getMessageIterator() loop in queue processor needs a "pause" semaphore. Implementation: add a Promise<void> gate that the iterator awaits before yielding. Maintenance flips it to a pending promise during VACUUM; resolve to release. Do not abort in-flight messages — they can complete; new messages wait.
Cleanup-V12.4.3 regression detection: Re-scan sdk_sessions WHERE project = OBSERVER_SESSIONS_PROJECT and pending_messages matching the stuck-pending pattern at maintenance ticks. If any match AND the marker exists, log MAINTENANCE WARN and re-run the purge (idempotent). Setting: CLAUDE_MEM_CLEANUP_REGRESSION_CHECK = true.

6.3 New settings

Key	Default	Range	Purpose
`CLAUDE_MEM_DB_MAINTENANCE_ENABLED`	`true`	bool	Master kill-switch
`CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS`	`24`	1–168	Periodic cadence
`CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO`	`0.40`	0.05–0.95	Free-ratio above which we auto-VACUUM at startup
`CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS`	`300000` (5 min)	0–3600000	Defer startup VACUUM so it doesn't block readiness
`CLAUDE_MEM_CLEANUP_REGRESSION_CHECK`	`true`	bool	Re-scan v12.4.3-shaped pollution

6.4 Acceptance criteria

Reproduce the bloat scenario: stuff pending_messages with 100k stuck processing rows, run worker → startup VACUUM fires within 5 min after readiness, freed-pages log line appears, file size drops.
Existing 532 MB DBs reclaim ≥ 95% of free pages on first run (matches the 28 MB target observed manually).
Hot-ingestion test: enqueue 1000 observations during a maintenance tick → no SQLITE_BUSY or database is locked errors; queue resumes after VACUUM.
PRAGMA auto_vacuum returns 2 (incremental) on freshly-created DBs.
Maintenance loop ticks honor .unref() — process.exit(0) from a clean shutdown returns immediately, not after the 24h interval.

6.5 Observability hooks

New log category: MAINTENANCE.
Events: MaintenanceStart, MaintenanceTick, VacuumStart, VacuumComplete ({freedPages, ms, dbSizeBeforeMb, dbSizeAfterMb}), VacuumSkippedLowDisk, RegressionDetected, MaintenanceComplete.
/api/healthz fields (Phase 7): db_page_count, db_freelist_count, db_free_ratio_pct, db_size_bytes, db_last_vacuum_at, db_last_vacuum_freed_pages, db_last_maintenance_at.

6.6 Anti-pattern guards

Do not call VACUUM inside a transaction (sqlite errors).
Do not hold the queue pause across the VACUUM INTO backup phase — only the final full VACUUM needs the writer-lock window. (VACUUM INTO works on a read-only snapshot.)
Do not call PRAGMA wal_checkpoint(FULL) — TRUNCATE is required to actually shrink the WAL file.

6.7 Verification checklist

Backup created at <DATA_DIR>/backups/ before every full VACUUM.
Maintenance timer registered with .unref() (grep for setInterval in the new file → unref() follows each).
No new direct setInterval outside the maintenance file.
PRAGMA list in Database.ts extended with auto_vacuum and includes a comment about migration.

Phase 2 — Stuck-Session Reaper (fix v12.4.3 bloat)

Goal: Stop pending_messages and sdk_sessions from accumulating zombies.

2.1 Files to modify

File	Change
`src/services/maintenance/SessionReaper.ts` (new)	Periodic reaper. Plugs into the supervisor's existing `health-checker.ts` 30s tick (extend, do not replace).
`src/supervisor/health-checker.ts:9 runHealthCheck`	Call `SessionReaper.tick()` after `pruneDeadEntries()`.
`src/services/worker/SessionManager.ts:deleteSession`	After in-memory delete, call `pendingStore.clearPendingForSession(sessionDbId)` synchronously (it already does this via `clearPendingForSession` on a separate path — verify and unify).
`src/services/sqlite/PendingMessageStore.ts`	Add `reapStuckProcessing(olderThanMs: number): number` returning the count of rows reset to `pending`.
`src/services/sqlite/SessionStore.ts`	Add `findInactiveSdkSessions(olderThanDays: number): Array<{id, project, contentSessionId, memorySessionId, lastActivityAt}>`.
`src/services/sqlite/SessionStore.ts`	Add `markSdkSessionInactive(id: number)` — adds an `inactive_at` column or sets a sentinel.
`src/services/sqlite/migrations/runner.ts`	New migration: add `inactive_at TEXT NULL` to `sdk_sessions` if absent.

2.2 Reaper logic

Per tick (default 30s, gated by CLAUDE_MEM_REAPER_ENABLED):

Stuck-processing sweep: UPDATE pending_messages SET status='pending' WHERE status='processing' AND updated_at < <now - PROCESSING_STUCK_MS> (default 5 minutes). Log count if > 0.
Orphan-pending sweep: DELETE FROM pending_messages WHERE session_db_id NOT IN (SELECT id FROM sdk_sessions) (defensive — should already be FK-protected but log if any deleted).
Inactive-session detection (does NOT delete):
- SELECT sdk_sessions where id NOT IN <in-memory session ids> AND last_activity > N days ago (computed from MAX of related observations / pending_messages / session_summaries timestamps).
- For each: UPDATE sdk_sessions SET inactive_at = <now> WHERE id = ? AND inactive_at IS NULL.
Observer-pollution regression check (matches Phase 6 task 5):
- If OBSERVER_SESSIONS_PROJECT rows reappear after the v12.4.3 marker is present, re-run the purge SQL from CleanupV12_4_3.runObserverSessionsPurge (lines 196-218).
- Log MAINTENANCE WARN with counts.
Hard delete is opt-in via CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS (default 0 = disabled; nonzero = days threshold). When enabled and a session has inactive_at older than the threshold AND no FK-referencing rows, hard-delete the session row. Default-off because user data safety > disk space.

2.3 New settings

Key	Default	Range	Purpose
`CLAUDE_MEM_REAPER_ENABLED`	`true`	bool	Master switch
`CLAUDE_MEM_REAPER_TICK_MS`	`30000`	5000–600000	Tick cadence (piggy-backs supervisor; this value gates whether the reaper runs each tick)
`CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS`	`300000` (5 min)	30000–86400000	Threshold for a `processing` row to be considered stuck
`CLAUDE_MEM_REAPER_INACTIVE_DAYS`	`30`	1–365	When to mark a session `inactive_at`
`CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS`	`0`	0–365	0 = never; otherwise, hard-delete inactive rows older than N days

2.4 Acceptance criteria

Inject 50 stuck processing rows older than 5 minutes → next reaper tick resets them → /api/healthz shows oldest_pending_processing_age_sec drop to 0.
Inject OBSERVER_SESSIONS_PROJECT rows post-marker → next tick logs regression and purges them.
Reaper survives a worker restart without losing state (everything is DB-backed).
Active sessions (in-memory) are NEVER marked inactive even if their last DB write is old (in-memory presence wins).

2.5 Observability

Log: MAINTENANCE INFO ReaperTick, fields {stuckProcessing, orphanPending, markedInactive, hardDeleted, observerRegression}.
New /api/healthz fields (Phase 7): oldest_processing_pending_age_sec, processing_pending_count, pending_count_total, sdk_sessions_total, sdk_sessions_inactive, sdk_sessions_by_project: { [project]: count }.

2.6 Verification checklist

Migration adds inactive_at column without breaking existing data (test on a copy of a real DB).
In-memory active sessions never appear in findInactiveSdkSessions.
Reaper does NOT cascade-delete observations / session_summaries unless explicit hard-delete + zero-FK-reference precondition.
/api/healthz shows reaper metrics.

Phase 3 — chroma-mcp Child-Process Supervisor

Goal: Stop the 23-concurrent-chroma-mcp leak. Bound concurrency, reap idle, scan for orphans at startup.

3.1 Files to modify

File	Change
`src/services/sync/ChromaMcpManager.ts`	Add idle reaper; enforce single-instance via supervisor registry; add startup orphan scan; add `lastCallAt` timestamp updated by `callTool`.
`src/services/sync/ChromaMcpManager.ts:ensureConnected` (line 43)	Before connect, check `getProcessRegistry().getAll().filter(r => r.type === 'chroma')` — if non-empty AND PID alive AND PID not the current `_process.pid`, refuse to spawn (alert + reuse existing if possible; otherwise wait for backoff).
`src/services/sync/ChromaMcpManager.ts:registerManagedProcess` (line 613)	Already calls `getSupervisor().registerProcess(CHROMA_SUPERVISOR_ID, ...)` — verify the supervisor enforces single-instance for this id. (Currently `register` is keyed by id so same id replaces; document this.)
`src/supervisor/process-registry.ts`	Add `getActiveCountByType(type: string): number`. Add `findChromaOrphans(): Promise<number[]>` — POSIX `pgrep -af 'chroma-mcp'` filtered by PPID == 1.
`src/services/worker-service.ts:initializeBackground`	After `ChromaMcpManager.getInstance()`, kick off `await ChromaMcpManager.scanAndReapOrphans()` (best-effort; never throws).

3.2 Detailed tasks

Startup orphan scan: New static method ChromaMcpManager.scanAndReapOrphans():
- POSIX: pgrep -af 'chroma-mcp' → for each PID, check PPID. If PPID == 1 (re-parented to init), call killProcessTree(pid) (existing function at line 388). Log CHROMA_MCP INFO ReapedOrphan, fields {pid, ageSec}.
- Windows: Get-CimInstance Win32_Process -Filter "Name='chroma-mcp.exe'" filter by parent process state, kill with taskkill.
- Bound the scan to processes whose command-line includes chroma-mcp==<CHROMA_MCP_PINNED_VERSION> to avoid killing unrelated chroma installations.
Idle reaper: Add lastCallAt: number = 0 field to ChromaMcpManager. Update on every callTool. Run a setInterval(checkIdle, 60_000) (.unref()) — if connected && Date.now() - lastCallAt > CHROMA_MCP_IDLE_SHUTDOWN_MS (default 15 min), call await this.stop(). Lazy-reconnect resumes on next callTool.
Single-instance guard on reconnect: In ensureConnected, before connectInternal, call getProcessRegistry().getActiveCountByType('chroma'). If > 0 AND the registered PID is alive but this.connected === false, this is a stale process (we lost track). Tear it down via killProcessTree(registeredPid) first, then proceed with fresh spawn. Otherwise the count grows by one each reconnect — exactly the leak observed.
Hard cap: extend getSupervisor().assertCanSpawn('chroma mcp') (already called at line 87) to actually count and reject. Cap = 1 chroma-mcp per worker. Cap = TOTAL_PROCESS_HARD_CAP (10) overall — already enforced for SDK processes; extend to chroma-mcp.
Tighten close path: in connectInternal (line 74), after transport.close() / client.close(), if the underlying _process.pid is still in the registry, call killProcessTree and unregisterProcess explicitly. Don't rely on transport.onclose alone — it has the stale-callback guard but doesn't always fire on connect-time failures.

3.3 New settings

Key	Default	Range	Purpose
`CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS`	`900000` (15 min)	60000–86400000	Idle reaper threshold
`CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START`	`true`	bool	Master switch for startup scan
`CLAUDE_MEM_CHROMA_MAX_CONCURRENT`	`1`	1–4	Cap chroma-mcp instances per worker

3.4 Acceptance criteria

Spawn 5 chroma-mcp processes manually parented to init; restart worker → all 5 are reaped at startup.
Force connect-time failure (kill transport mid-connect) 10 times → registry count never exceeds 1.
Run worker for 30 min with no chroma calls → process is reaped after 15 min and getProcessRegistry().getActiveCountByType('chroma') returns 0.
callTool after idle-shutdown lazy-reconnects successfully.

3.5 Observability

Log: CHROMA_MCP INFO OrphanScan {found, killed}.
Log: CHROMA_MCP INFO IdleShutdown {idleMs}.
Log: CHROMA_MCP WARN RegistryStale when single-instance guard tears down a phantom.
/api/healthz fields (Phase 7): chroma_mcp_pid_count, chroma_mcp_last_call_at, chroma_mcp_state ('connected'|'disconnected'|'backoff'), chroma_mcp_backoff_remaining_ms.

3.6 Anti-pattern guards

Do not kill chroma processes whose command-line doesn't match chroma-mcp==<PINNED_VERSION> — could match unrelated user installs.
Do not spin up the idle-reaper timer if chromaMcpManager is null (chroma disabled via CLAUDE_MEM_CHROMA_ENABLED=false).
Do not call getProcessRegistry() from outside the worker process — it's worker-internal.

3.7 Verification checklist

After 2.5 hours of normal use, ps aux | grep chroma-mcp | wc -l ≤ 1.
Idle-reaper timer is .unref()d.
Orphan scan tolerates pgrep returning empty (no false-error logs).
Build still passes on Windows (Win32 branch compiles even if not unit-tested).

Phase 4 — Circuit Breaker for Retry Storms

Goal: Replace the unbounded counter at worker-utils.ts:401 with a real circuit breaker. Stop hooks from hammering the worker when it's down.

4.1 Files to modify

File	Change
`src/shared/worker-circuit-breaker.ts` (new)	`CircuitBreaker` class: states `CLOSED`, `OPEN`, `HALF_OPEN`. Persist to `~/.claude-mem/state/circuit-breaker.json`.
`src/shared/worker-utils.ts:executeWithWorkerFallback` (line 443)	Wrap the call in `breaker.run(...)`. On `OPEN`, return `WorkerFallback` immediately (no HTTP).
`src/shared/worker-utils.ts:recordWorkerUnreachable` (line 401)	Becomes a thin shim that calls `breaker.recordFailure()`. Hard cap (`MAX_LIFETIME_FAILURES = 50`) trips the breaker permanently until manual reset.
`src/shared/worker-utils.ts:resetWorkerFailureCounter` (line 419)	Becomes `breaker.recordSuccess()`.
`src/cli/hook-command.ts`	Verify the swallowed-stderr fix from observation 2026-05-07 is applied (it's marked as a "no-op replacement bug"). The breaker's stderr-fail-loud path must actually write to `process.stderr.write()`, not a stub.
`src/services/server/Server.ts`	Add `/api/admin/breaker/reset` POST endpoint (gated by localhost only) for manual unsticking.

4.2 Breaker semantics

States and transitions:

CLOSED ──[N consecutive failures]──> OPEN
OPEN   ──[reset_timeout_ms elapsed]──> HALF_OPEN
HALF_OPEN ──[1 success]──> CLOSED
HALF_OPEN ──[1 failure]──> OPEN  (resets timer)
ANY    ──[lifetime failures > MAX_LIFETIME_FAILURES]──> OPEN_PERMANENT (until manual reset via API or settings reload)

Defaults:

Setting	Default	Range
`CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD`	`5`	1–50
`CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS`	`30000`	1000–600000
`CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES`	`1`	1–10
`CLAUDE_MEM_BREAKER_LIFETIME_CAP`	`50`	0–10000 (0 = no cap)

Persistent state file shape:

{
  "state": "CLOSED|OPEN|HALF_OPEN|OPEN_PERMANENT",
  "consecutiveFailures": 0,
  "lifetimeFailures": 0,
  "openedAt": null,
  "lastFailureAt": null,
  "lastSuccessAt": null,
  "lastTrippedAt": null
}

4.3 Detailed tasks

CircuitBreaker class: pure logic class, no I/O. Methods: getState(), canAttempt(), recordFailure(reason), recordSuccess(), forceReset(). Atomic file writes (write tmp + rename) for the JSON snapshot, mirroring writeHookFailureStateAtomic (worker-utils.ts:372).

Wire into executeWithWorkerFallback:

if (!breaker.canAttempt()) {
  // Optional: print one-line stderr if state changed during this call
  return { continue: true, reason: 'circuit_breaker_open', [WORKER_FALLBACK_BRAND]: true };
}
const alive = await ensureWorkerAliveOnce();
if (!alive) { breaker.recordFailure('unreachable'); ... }
...
if (response.ok) breaker.recordSuccess();

Fail-loud stderr fix: The 2026-05-07 observation mentions a "stderr no-op replacement bug" in hookCommand. Investigate src/cli/hook-command.ts for any process.stderr.write shim that suppresses output. The breaker's diagnostic ("Worker unreachable; circuit breaker OPEN; will retry in Xs") MUST appear on the user's terminal so they know what's happening. Test by intentionally killing the worker and running a hook — message should appear on stderr.
Manual reset endpoint: POST /api/admin/breaker/reset (no body required). Restricted to 127.0.0.1 only. Logs SYSTEM WARN BreakerForceReset with caller info.
Lifetime cap: when lifetimeFailures > CLAUDE_MEM_BREAKER_LIFETIME_CAP, transition to OPEN_PERMANENT. The only way out is the manual-reset API or restarting the worker with a fresh state file. Print prominent stderr: claude-mem: 50 lifetime worker failures detected. Disabling memory hooks until reset. Run: claude-mem worker doctor.

4.4 Acceptance criteria

Kill the worker, run 100 hooks → exactly CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD HTTP attempts made; rest short-circuit.
After 30s idle, next hook makes ONE probe (HALF_OPEN); if probe succeeds, breaker closes.
Lifetime cap (set to 5 for testing): 6th lifetime failure → permanent open until POST /api/admin/breaker/reset clears it.
Stderr message visible to user when breaker opens (manual repro: kill worker, run 5+ hooks).
Existing hook-failures.json file is migrated to the new breaker JSON format on first run (one-shot migration in worker-utils.ts).

4.5 Observability

Log: SYSTEM WARN BreakerOpened, fields {lifetime, consecutiveBefore}.
Log: SYSTEM INFO BreakerHalfOpen.
Log: SYSTEM INFO BreakerClosed, fields {recoveredAfterMs}.
Log: SYSTEM ERROR BreakerOpenedPermanent.
/api/healthz fields (Phase 7): breaker_state, breaker_consecutive_failures, breaker_lifetime_failures, breaker_opened_at, breaker_total_trips.

4.6 Anti-pattern guards

Do not call the breaker from inside the worker process — it's a hook-side concern. The worker has RestartGuard for its own session-level limits.
Do not auto-reset the lifetime counter on restart; persist it. Otherwise restart-loops mask the underlying failure.
Do not block the breaker reset endpoint on initialization (/api/admin/breaker/reset should work even if initializationCompleteFlag === false).

4.7 Verification checklist

No call site bypasses the breaker (grep for workerHttpRequest outside executeWithWorkerFallback and audit each — some integrations may need breaker.canAttempt() guards added).
State file readable/writable across process restarts.
Stderr fail-loud path verified end-to-end on Linux + macOS + Windows Terminal.
No process.exit(1) introduced — breaker tripping returns WorkerFallback, not exit codes.

Phase 7 — `/api/healthz` Endpoint with Concrete Metrics

Goal: Centralized observability so future regressions are detectable at a glance.

7.1 Files to modify

File	Change
`src/services/worker/http/routes/HealthzRoutes.ts` (new)	Implements `RouteHandler`. GET `/api/healthz` and `/api/healthz?format=prom`.
`src/services/worker-service.ts:registerRoutes`	Register the new `HealthzRoutes(...)`.
`src/services/worker/MetricsCollector.ts` (new)	Aggregates metrics; refreshed on the supervisor's existing 30s health-check tick to avoid amplifying load.
`src/supervisor/health-checker.ts:runHealthCheck`	Call `MetricsCollector.refresh()` after `pruneDeadEntries`.

7.2 Endpoint contract

GET /api/healthz → 200 JSON:

{
  "status": "ok|degraded|unhealthy",
  "ts": "2026-05-07T21:30:00.000Z",
  "uptime_sec": 12345,
  "versions": {
    "plugin": "12.7.5",
    "worker": "12.7.5",
    "matches": true
  },
  "process": {
    "pid": 12345,
    "rss_mb": 145.2,
    "event_loop_lag_ms": 3.1,
    "managed": true,
    "platform": "darwin"
  },
  "pid_file": {
    "path": "/Users/.../worker.pid",
    "start_token": "Wed May  7 14:23:15 2026",
    "daemon_lock_held": true
  },
  "db": {
    "path": "/Users/.../claude-mem.db",
    "size_bytes": 31457280,
    "page_count": 7680,
    "freelist_count": 12,
    "free_ratio_pct": 0.16,
    "last_vacuum_at": "2026-05-07T20:00:00.000Z",
    "last_vacuum_freed_pages": 130000,
    "last_maintenance_at": "2026-05-07T20:00:00.000Z",
    "oldest_processing_pending_age_sec": 4,
    "processing_pending_count": 1,
    "pending_count_total": 12,
    "sdk_sessions_total": 145,
    "sdk_sessions_inactive": 13,
    "sdk_sessions_by_project": { "claude-mem": 25, "...": 120 }
  },
  "child_processes": {
    "chroma_mcp_pid_count": 1,
    "chroma_mcp_last_call_at": "2026-05-07T21:25:11.000Z",
    "chroma_mcp_state": "connected",
    "chroma_mcp_backoff_remaining_ms": 0,
    "sdk_process_count": 0,
    "supervisor_registry_size": 2
  },
  "network": {
    "hook_consecutive_failures": 0,
    "breaker_state": "CLOSED",
    "breaker_consecutive_failures": 0,
    "breaker_lifetime_failures": 3,
    "breaker_opened_at": null,
    "breaker_total_trips": 1,
    "last_request_at": "2026-05-07T21:29:55.000Z",
    "request_rate_per_min": 12.3
  },
  "ai": {
    "provider": "claude",
    "auth_method": "...",
    "last_interaction": { ... }
  }
}

GET /api/healthz?format=prom → 200 text/plain with Prometheus text format. One metric per JSON leaf (e.g. claude_mem_db_free_ratio_pct 0.16).

status derivation:

unhealthy if breaker is OPEN_PERMANENT, OR DB initialization failed, OR chroma-mcp pid count > CLAUDE_MEM_CHROMA_MAX_CONCURRENT.
degraded if breaker is OPEN, OR free_ratio > 0.4, OR oldest_processing_pending > 1 hour, OR worker version mismatches plugin version.
ok otherwise.

7.3 Detailed tasks

MetricsCollector class: a Map<string, unknown> snapshot. Public refresh() collects fresh data; public getSnapshot() returns the cached object. Refresh is called by the 30s health-check tick AND on-demand if last refresh > 5s ago (debounced).
DB metrics queries (use db.prepare + .get()):
- PRAGMA page_count → { page_count: number }
- PRAGMA freelist_count → { freelist_count: number }
- PRAGMA page_size → for size_bytes computation
- SELECT MIN(updated_at) FROM pending_messages WHERE status='processing' (with julianday math for age in seconds)
- SELECT COUNT(*) FROM sdk_sessions GROUP BY project
Process metrics: process.memoryUsage().rss / 1024 / 1024. Event-loop lag via perf_hooks.monitorEventLoopDelay (Node API, available in bun) — sample over 30s window.
Network metrics: maintain a rolling 1-min request counter in middleware (existing createMiddleware in Server.ts:156). Increment on each /api/* request.
Prometheus format: emit # HELP and # TYPE lines per metric. Use the same naming convention (claude_mem_<group>_<name>).
Compatibility: leave /api/health UNCHANGED (existing integrations break otherwise). /api/healthz is the new richer endpoint.

7.4 Acceptance criteria

curl 127.0.0.1:<port>/api/healthz | jq .status returns ok on a healthy worker.
After Phase 6 ships, db.free_ratio_pct updates at 30s cadence (verify by manually inflating freelist).
Phase 4 breaker state changes are visible within 30s.
?format=prom parses with promtool check metrics.
No new endpoint blocks for > 50ms (snapshot is cached; refresh is async).

7.5 Observability hooks (yes, for the observability endpoint itself)

Log WORKER DEBUG MetricsRefresh, fields {durationMs}.
Log WORKER WARN MetricsRefreshSlow if refresh > 250ms (DB query stall signal).

7.6 Verification checklist

/api/health response body unchanged byte-for-byte (regression test).
All Phase 2-6 metrics exposed (cross-check the field list in those phases).
?format=prom output validates with promtool if available; otherwise visual inspection.
Endpoint mounted via RouteHandler pattern (no direct app.get in worker-service.ts).

Phase 8 — Observability, CLI, & Rollout

Goal: User-facing surface so operators can see what the new machinery did. Ordered last to allow phases 2-7 to stabilize.

8.1 Files to modify

File	Change
`src/cli/handlers/worker-doctor.ts` (new)	New CLI subcommand `claude-mem worker doctor` — fetches `/api/healthz`, formats it for terminals, includes recent reaper actions.
`src/services/worker-service.ts:main()`	Register the `worker doctor` CLI route (alongside existing `cursor`, `gemini-cli` cases).
`plugin/scripts/worker-cli.js`	Wire to the new doctor command.
`CLAUDE.md` (project root)	Document new settings under a "Worker Maintenance" section.
`docs/public/` (optional)	User-facing explanation of the breaker, reaper, and health endpoint.

8.2 `worker doctor` output (example)

claude-mem worker doctor

Status:           OK
Version:          plugin=12.7.5 worker=12.7.5 (match)
Uptime:           3h 25m
PID:              12345  (lock held: yes)

Database:
  Size:             32 MB    (free: 0.16%)
  Last vacuum:      4h ago, freed 130k pages
  Pending:          12 total / 1 processing (oldest 4s)
  SDK sessions:     145 total / 13 inactive

Child processes:
  chroma-mcp:       1  (last call: 5s ago, state: connected)
  SDK processes:    0
  Supervisor:       2 entries

Circuit breaker:
  State:            CLOSED
  Consecutive:      0
  Lifetime:         3
  Total trips:      1

Recent maintenance (last 24h):
  2026-05-07 20:00  Vacuum: freed 130k pages in 1.4s
  2026-05-07 19:30  Reaper: 5 stuck-processing reset, 2 inactive marked
  2026-05-07 18:00  Chroma orphan scan: 0 found

If status != ok, append a "Recommended actions" block:

breaker open → claude-mem worker reset-breaker
DB free ratio high → mention next vacuum window
chroma orphans → claude-mem worker reap-chroma

8.3 Detailed tasks

Doctor command: GET /api/healthz via workerHttpRequest. Format as the table above. Color-code (red/yellow/green) using existing chalk integration if present, otherwise plain text. JSON pass-through via --json flag.
Recent-actions feed: store the last 50 maintenance events in a circular buffer in MetricsCollector (in-memory only — survives one worker lifetime; not persistent). Expose at /api/healthz/events (separate to avoid bloating the main response).
Update CLAUDE.md: add a "Worker Maintenance" section with: settings reference table, the doctor command, a brief description of the reaper/breaker/vacuum behavior. Per CLAUDE.md "Important: No need to edit the changelog ever" — only edit CLAUDE.md, never CHANGELOG.
Rollout ordering (per problem statement constraint):
- Wave 1 (idempotent, low-risk): Phase 5 (PID/port reclamation), Phase 6 (DB maintenance).
- Wave 2 (reapers — needs careful testing on busy DBs): Phase 2 (session reaper), Phase 3 (chroma supervisor).
- Wave 3 (user-visible behavior change): Phase 4 (circuit breaker), Phase 7 (/api/healthz).
- Wave 4 (CLI surface): Phase 8 (doctor command, docs).
Each wave can ship as a separate release. Inter-wave dependencies: Phase 7 depends on data sources from Phases 2/3/4/6 — but the endpoint can ship with partial data (fields gated by phase availability).

8.4 Acceptance criteria

claude-mem worker doctor prints a green-OK summary on a healthy worker.
claude-mem worker doctor --json returns valid JSON pipeable to jq.
Killing the worker → claude-mem worker doctor cleanly reports Worker unreachable instead of hanging.
CLAUDE.md updates are limited to a new section; no churn elsewhere.

8.5 Verification checklist

claude-mem worker doctor exits 0 on healthy state, 1 on unhealthy, 2 if worker unreachable (mirrors hook-exit-codes convention).
No new public marketplace API surface beyond what's documented.
Doctor command works without the worker running (unreachable path covered).

Final Phase — Cross-Phase Verification

Goal: Prove the system works end-to-end before declaring victory.

F.1 Soak test (24h)

Run the worker for 24 hours under realistic Claude Code usage. After 24h:

Metric	Pass criterion
`ps aux \| grep chroma-mcp \| wc -l`	≤ 1
`ps aux \| grep claude-mem \| wc -l`	≤ a small constant (1-2)
DB size growth rate	< 5 MB/hr; free_ratio < 20%
`/api/healthz` `breaker.lifetime_failures`	< 10 (vs. the #1874 starting baseline)
Stuck `processing` rows older than 10 min	0
Worker memory RSS	< 300 MB (no leak)

F.2 Failure-injection tests

Inject	Expected behavior
Kill worker via `kill -9`	Lazy-respawn on next hook; PID file cleaned
Two parallel `claude-mem start`	Exactly one daemon survives; lock log line visible
100 stuck processing rows	Reaper resets all within `REAPER_PROCESSING_STUCK_MS + REAPER_TICK_MS`
Spawn fake listener on worker port	New `--daemon` exits 0 with diagnostic stderr (no silent exit)
Fork 5 chroma-mcp orphans	Worker startup reaps all 5
Pull network during 10 hooks	Breaker opens after threshold; subsequent hooks short-circuit

F.3 Anti-pattern grep

# No new always-on intervals
grep -rn "setInterval" src/ --include="*.ts" | grep -v "unref()" | grep -v "^src/.*test"

# No new process.exit(1) on hook paths
git diff main -- src/shared/worker-utils.ts src/cli/ | grep "process.exit(1)"

# No invented settings
git diff main -- src/shared/SettingsDefaultsManager.ts | grep "CLAUDE_MEM_"
# Cross-reference with all phases' settings tables.

# No hardcoded magic numbers in business logic
git diff main | grep -E "[0-9]{4,}" | grep -v SettingsDefaultsManager | grep -v test

F.4 Documentation diff

CLAUDE.md adds: Worker Maintenance section (Phase 8.3).
docs/public/ (optional): user-facing explanation.
No CHANGELOG edits (auto-generated per CLAUDE.md).

F.5 Sign-off checklist

All 8 phases shipped.
/api/healthz reports status: "ok" 24h after deployment.
No new ERROR-level logs in production for 24h (excluding pre-existing).
Manual worker doctor on 3 production-like environments confirms expected output.
Phase 0 doc-discovery anti-patterns not violated (grep git log -p).

Appendix A — Settings Reference (consolidated)

All settings declared in src/shared/SettingsDefaultsManager.ts:

Setting	Phase	Default	Range
`CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS`	5	`5000`	0–60000
`CLAUDE_MEM_PID_PORT_RECHECK_MS`	5	`2000`	500–30000
`CLAUDE_MEM_DB_MAINTENANCE_ENABLED`	6	`true`	bool
`CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS`	6	`24`	1–168
`CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO`	6	`0.40`	0.05–0.95
`CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS`	6	`300000`	0–3600000
`CLAUDE_MEM_CLEANUP_REGRESSION_CHECK`	6	`true`	bool
`CLAUDE_MEM_REAPER_ENABLED`	2	`true`	bool
`CLAUDE_MEM_REAPER_TICK_MS`	2	`30000`	5000–600000
`CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS`	2	`300000`	30000–86400000
`CLAUDE_MEM_REAPER_INACTIVE_DAYS`	2	`30`	1–365
`CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS`	2	`0`	0–365
`CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS`	3	`900000`	60000–86400000
`CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START`	3	`true`	bool
`CLAUDE_MEM_CHROMA_MAX_CONCURRENT`	3	`1`	1–4
`CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD`	4	`5`	1–50
`CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS`	4	`30000`	1000–600000
`CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES`	4	`1`	1–10
`CLAUDE_MEM_BREAKER_LIFETIME_CAP`	4	`50`	0–10000

Appendix B — File Change Summary

File	Phases that touch it
`src/services/worker-service.ts`	3 (initializeBackground), 5 (--daemon), 6 (maintenance wiring), 7 (route registration), 8 (CLI)
`src/services/worker-spawner.ts`	5
`src/services/infrastructure/ProcessManager.ts`	5 (lock + start-token)
`src/services/infrastructure/HealthMonitor.ts`	5 (port-on-pid match)
`src/services/infrastructure/CleanupV12_4_3.ts`	6 (regression detection — read only)
`src/services/sync/ChromaMcpManager.ts`	3
`src/supervisor/index.ts`	5 (validateWorkerPidFile)
`src/supervisor/process-registry.ts`	3 (orphan scan), 5 (start-token)
`src/supervisor/health-checker.ts`	2 (reaper), 7 (metrics refresh)
`src/services/worker/SessionManager.ts`	2 (delete hook), 6 (pause/resume)
`src/shared/worker-utils.ts`	4 (breaker integration)
`src/services/sqlite/Database.ts`	6 (auto_vacuum)
`src/services/sqlite/PendingMessageStore.ts`	2 (reapStuckProcessing)
`src/services/sqlite/SessionStore.ts`	2 (findInactiveSdkSessions)
`src/services/sqlite/migrations/runner.ts`	2 (inactive_at column)
`src/services/server/Server.ts`	4 (breaker reset), 7 (healthz route)
`src/shared/SettingsDefaultsManager.ts`	2-6 (settings keys)
`src/services/maintenance/DbMaintenance.ts`	6 (NEW)
`src/services/maintenance/SessionReaper.ts`	2 (NEW)
`src/shared/worker-circuit-breaker.ts`	4 (NEW)
`src/services/worker/MetricsCollector.ts`	7 (NEW)
`src/services/worker/http/routes/HealthzRoutes.ts`	7 (NEW)
`src/cli/handlers/worker-doctor.ts`	8 (NEW)
`CLAUDE.md`	8 (Worker Maintenance section)

Appendix C — Open Questions for Executor

bun:ffi flock support: confirm via spike before committing Phase 5.4. If unavailable, fall back to flock(1) shell on Linux + atomic mkdirSync sentinel on macOS/Windows.
Event-loop lag sampling on bun: verify perf_hooks.monitorEventLoopDelay works in bun's Node-compat layer. If not, fall back to a setImmediate-based heuristic.
Existing-DB auto_vacuum migration: verify that the startup full VACUUM in Phase 6.3 is sufficient to reclaim the 504 MB without requiring users to run PRAGMA auto_vacuum = INCREMENTAL; VACUUM; manually. (It should — full VACUUM with auto_vacuum already set takes effect.)
Pro-features compatibility: confirm with maintainers that /api/healthz does not duplicate any planned Pro endpoint. Per CLAUDE.md "Pro Features Architecture", the worker's local HTTP API stays open — /api/healthz is fine to add OSS-side.

52 KiB Raw Blame History Unescape Escape