Six numbered plan documents covering: - 01 Hook IO Discipline (#2376) - 02 Spawn-Contract Templating (#2377) - 03 Worker / Daemon Lifecycle Hardening (#2378) - 04 Installer Failure Transparency (#2379) - 05 Observer SDK Tool Enforcement (#2380) - 06 Worker Env Isolation (#2381) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
52 KiB
Plan 03 — Worker / Daemon Lifecycle Hardening
Scope: Fix accumulated worker / daemon lifecycle bugs in claude-mem. Address DB bloat, chroma-mcp leaks, retry storms, port/PID races, queue zombies, missing supervision, and observability gaps.
Non-implementation: This document is a plan. Each phase is self-contained; an executing agent should be able to run a single phase without re-discovering context.
Audience: Subsequent agents executing one phase per session.
Phase 0 — Documentation Discovery & Allowed APIs
Goal: Anchor every implementation phase in real APIs that exist in the current codebase or in vetted libraries. Prevent phantom-method invention.
0.1 Read these files end-to-end before touching code
| File | Why |
|---|---|
CLAUDE.md (project root) |
Architecture, exit-code strategy, Pro/OSS boundary, settings conventions |
src/services/worker-service.ts |
WorkerService class, --daemon main(), signal registration, all CLI subcommands |
src/services/worker-spawner.ts |
ensureWorkerStarted 3-state machine (ready/warming/dead) |
src/services/infrastructure/ProcessManager.ts |
spawnDaemon, PID file ops, captureProcessStartToken, isProcessAlive |
src/services/infrastructure/HealthMonitor.ts |
isPortInUse, waitForHealth, waitForReadiness, httpShutdown |
src/services/infrastructure/GracefulShutdown.ts |
performGracefulShutdown ordering |
src/services/infrastructure/CleanupV12_4_3.ts |
runOneTimeV12_4_3Cleanup, STUCK_PENDING_THRESHOLD = 10, observer-purge SQL |
src/services/sync/ChromaMcpManager.ts |
ensureConnected, connectInternal, stop, killProcessTree, collectDescendantPids, RECONNECT_BACKOFF_MS = 10_000, MCP_CONNECTION_TIMEOUT_MS = 30_000 |
src/supervisor/index.ts |
Supervisor class, validateWorkerPidFile, signal-handler config |
src/supervisor/process-registry.ts |
ProcessRegistry, getSdkProcessForSession, ensureSdkProcessExit, waitForSlot, TOTAL_PROCESS_HARD_CAP = 10 |
src/supervisor/health-checker.ts |
30s pruneDeadEntries loop (already present — extend, don't replace) |
src/supervisor/shutdown.ts |
runShutdownCascade, signalProcess, loadTreeKill |
src/services/worker/SessionManager.ts |
In-memory session map, deleteSession, queue/pending integration |
src/services/worker/RestartGuard.ts |
Per-session restart cap (10/60s window, 5 consecutive) |
src/services/worker/retry.ts |
Provider-level retry (withRetry, classified errors) — DO NOT mutate; circuit breaker layers ABOVE this |
src/shared/worker-utils.ts |
recordWorkerUnreachable (line 401), executeWithWorkerFallback (line 443), fail-loud counter file at ~/.claude-mem/state/hook-failures.json |
src/services/sqlite/Database.ts |
PRAGMA setup (lines 27-32, 69-74) — single source of truth for DB pragmas |
src/services/server/Server.ts |
/api/health (line 161), /api/readiness (line 178), /api/version (line 192) |
src/shared/SettingsDefaultsManager.ts |
Where every new setting key MUST be declared with a default |
src/shared/hook-constants.ts |
HOOK_TIMEOUTS, HOOK_EXIT_CODES — extend here, don't inline |
plugin/bun-runner.js, plugin/scripts/worker-service.cjs |
Built worker entrypoint — note the build pipeline (scripts/build-hooks.js) |
0.2 Allowed APIs (use these, do NOT invent siblings)
SQLite (bun:sqlite) — pragma calls are db.run('PRAGMA …') or db.prepare('PRAGMA …').get(). Existing pragmas: journal_mode=WAL, synchronous=NORMAL, foreign_keys=ON, temp_store=memory, mmap_size, cache_size. VACUUM runs only outside a transaction. VACUUM INTO 'path' is the backup form already used in CleanupV12_4_3.ts:135. wal_checkpoint(TRUNCATE) is the truncating-checkpoint form.
Process supervision — getSupervisor(), getProcessRegistry(), registerProcess(id, info, processRef?), unregisterProcess(id), pruneDeadEntries(), assertCanSpawn(type), runShutdownCascade(...). Tree-kill on POSIX uses pgrep -P recursion + process.kill(-pgid, signal); on Windows uses taskkill /T /F /PID or tree-kill npm.
HTTP/Express — Server.app.get('/api/...', handler) via registerRoutes (handlers implement setupRoutes(app) on a RouteHandler interface). Every new endpoint must follow the existing RouteHandler pattern under src/services/worker/http/routes/.
Settings — SettingsDefaultsManager.get('CLAUDE_MEM_…'), SettingsDefaultsManager.loadFromFile(path). New keys require: (a) type added to the interface in SettingsDefaultsManager.ts, (b) default value declared in the same file, (c) documented in CLAUDE.md if user-tunable.
Logging — logger.info(category, msg, fields), logger.warn, logger.error(category, msg, fields, error). Categories used here: SYSTEM, WORKER, SESSION, CHROMA_MCP, SDK, DB, QUEUE, PROCESS. Add new category MAINTENANCE for VACUUM / reaper events.
0.3 Anti-patterns — explicitly forbidden
- Do not add a new singleton supervisor — extend
getSupervisor(). - Do not spawn child processes without going through
getSupervisor().assertCanSpawn(...)andregisterProcess(...). - Do not call
process.exit(1)on hook-side error paths — it accumulates Windows Terminal tabs (CLAUDE.md exit-code strategy). Use0for graceful,2only for blocking-error paths that need to surface stderr to Claude. - Do not delete
sdk_sessionsrows ifobservationsorsession_summariesstill reference theirmemory_session_idwithout an explicit user-opt-in flag. - Do not hold a SQLite write lock during
VACUUMwhile ingestion is hot. Pause queue processing first. - Do not introduce setInterval timers that keep the event loop alive — every new timer must call
.unref(). - Do not invent settings keys — declare them in
SettingsDefaultsManager.tsfirst.
0.4 Confidence note
Confidence: HIGH on file/API inventory (read-pass complete on all referenced files). MEDIUM on Windows behavior of new advisory locks (Windows mandatory locking via lockf is bun-runtime-dependent — verify via spike before committing).
Phase 1 — Inventory & Instrumentation (read-only, safe)
Goal: Produce a written state-machine diagram and an exit-site catalog that subsequent phases reference. No code changes; create a scratch document at docs/internal/worker-lifecycle-state-machine.md if the executor wants an artifact, otherwise capture findings in commit messages.
1.1 Tasks
-
Trace the worker daemon spawn → terminate path end-to-end. Source order:
- Hook entry →
src/shared/worker-utils.ts:ensureWorkerRunning(lazy spawn) ORsrc/services/worker-spawner.ts:ensureWorkerStarted(explicit) spawnDaemon(src/services/infrastructure/ProcessManager.ts:408) — POSIX usessetsidif available, Windows usesStart-Process -WindowStyle Hidden--daemonbranch insrc/services/worker-service.ts:937— duplicate-PID/duplicate-port guardWorkerService.start()(line 258) →startSupervisor()→server.listen()→writePidFile()→getSupervisor().registerProcess('worker', ...)→initializeBackground()- Signal handlers via
configureSupervisorSignalHandlers(src/supervisor/index.ts:49) — SIGTERM/SIGINT; SIGHUP ignored in--daemonmode on POSIX - Shutdown:
WorkerService.shutdown()→performGracefulShutdown→ server close →sessionManager.shutdownAll()→ mcp client close → chroma stop → db close →getSupervisor().stop()→runShutdownCascade→ PID file unlink
- Hook entry →
-
Catalog every
process.exit(...)site in worker-service.ts (already mapped — 21 sites; lines 764, 772, 794, 804, 810, 813, 828, 835, 842, 853, 870, 878, 888, 895, 916, 933, 945, 950, 971, 975, 991). Annotate each with: code, intent, whether it leaks the worker on the same path, whether shutdown ran first. -
Catalog every retry / unreachable site:
src/shared/worker-utils.ts:401 recordWorkerUnreachable(the #1874 counter)src/cli/handlers/{context,file-context,file-edit,summarize,observation,user-message,session-init}.ts— everyexecuteWithWorkerFallbackcallersrc/servers/mcp-server.ts:72,100,145— directworkerHttpRequestsrc/services/transcripts/processor.ts:331,371,373— directworkerHttpRequestsrc/services/integrations/CursorHooksInstaller.ts:64,349,352— directworkerHttpRequestsrc/utils/claude-md-utils.ts:305— directworkerHttpRequest
-
Catalog every spawn site:
spawnDaemon(worker self-spawn)ChromaMcpManager.connectInternal(chroma-mcp via uvx → uv → python → chroma-mcp)spawnSdkProcess(src/supervisor/process-registry.ts:532) — Claude SDK subprocessesrunMcpSelfCheck(src/services/worker-service.ts:405) — MCP loopback probe viaprocess.execPath- Any
execSync/execFile/spawnSyncinChromaMcpManager(cert resolution) orProcessManager(binary lookup, cwd-remap)
1.2 Acceptance criteria
- Markdown table written (commit message or scratch doc) listing every spawn and exit site with file:line.
- A 1-paragraph English description of the worker state machine (states + transitions) suitable to paste into PR descriptions.
- Confirmed list of which
executeWithWorkerFallbackcallers run inside hooks (Claude Code's strict timeout window) vs. inside the worker (no timeout pressure) — this drives Phase 4 circuit-breaker scoping.
1.3 Verification
grep -rn "process.exit" src/ --include="*.ts" | wc -lmatches the catalog.grep -rn "executeWithWorkerFallback\|workerHttpRequest" src/ --include="*.ts" | grep -v worker-utils.ts | wc -lmatches the catalog.
1.4 Deliverable
Hand-off note for Phase 2-8 executors with file/line anchors; no code committed.
Phase 5 — PID/Port Reclamation & Race-Free Startup
Shipping order: Phase 5 first (per Phase 8 ordering). Idempotent and safe.
Goal: Eliminate the silent-exit-0 case where a fresh --daemon spawn loses the port race; harden cross-platform PID-reuse detection; serialize concurrent spawns with an OS-level advisory lock.
5.1 Files to modify
| File | Change |
|---|---|
src/supervisor/process-registry.ts |
Extend captureProcessStartToken for macOS (already partial via ps -o lstart) and Windows (wmic process where ProcessId=X get CreationDate /value). Add unit test for each platform branch. |
src/supervisor/index.ts:validateWorkerPidFile |
Add port-on-pid match check — if pidInfo.port !== currentExpectedPort, treat as 'stale'. |
src/services/infrastructure/ProcessManager.ts |
Add new exports: acquireDaemonLock() / releaseDaemonLock() using POSIX flock (via fcntl/flock syscall through bun:ffi or shelling to flock(1) on Linux only) and Windows mandatory file lock via LockFile (or fall back to atomic-rename sentinel on Windows). |
src/services/worker-service.ts:937 (--daemon branch) |
Wrap startup in acquireDaemonLock(). If port is in use, perform a /api/version probe; if the listener returns OUR BUILT_IN_VERSION → exit 0 (legit duplicate); if it returns a different version → log a warning and exit 0 (stale worker, will be restarted by version-mismatch path); if the listener doesn't respond → wait HOOK_TIMEOUTS.PORT_IN_USE_WAIT then write a clear stderr line with diagnostic before exiting. |
src/services/worker-spawner.ts |
Same lock acquisition before spawnDaemon. Release on success or error. |
5.2 Detailed tasks
-
macOS start-time token: extend
captureProcessStartToken(registry line 56). On Darwin, preferps -p <pid> -o lstart=(already in fallback path). Verify withLC_ALL=C LANG=Cenv so locale doesn't change the timestamp format. Add a comment explaining thatps lstartresolution is 1-second — collisions still possible but vastly less likely than no-token. -
Windows start-time token: add a Win32 branch using
wmic process where ProcessId=<pid> get CreationDate /value. Parse theCreationDate=YYYYMMDDHHMMSS.ffffff+TZline. Cache the wmic resolution per-pid for 5s (avoid re-shelling on repeat checks). -
Port-on-pid match: in
validateWorkerPidFile, after confirmingisPidAlive(pidInfo.pid), verify the recordedpidInfo.portis reachable viaisPortInUse(pidInfo.port)AND the listener's/api/versionreturns a version string. If port is dead but PID alive → return'stale'(worker crashed mid-listen, PID about to be reused). -
Advisory lock:
- POSIX: open
<DATA_DIR>/.worker-spawn.lockwithO_RDWR | O_CREAT,flock(fd, LOCK_EX | LOCK_NB). On EAGAIN, logAnother spawn in progress, waiting up to 5sand retry withLOCK_EX(blocking) under asetTimeoutrace. Implement viabun:ffifor POSIXflock(2)if available, otherwise shellflock -n -x <path> <command>. Spike first: confirm bun'sbun:ffiexposesflock. If not, use a watch-and-rename sentinel (less ideal but works). - Windows: Use
LockFilevia Win32 API or fall back to atomicmkdirSyncof<DATA_DIR>/.worker-spawn.lock.dir(fails if exists) with stale-timeout cleanup at 30s.
- POSIX: open
-
Diagnostic stderr: when port-in-use without our worker responding, write to stderr (and log INFO) with:
claude-mem worker port <N> in use by an unidentified process; not spawning duplicate. This must NOT block the hook — exit 0 still per CLAUDE.md.
5.3 New settings
| Key | Default | Range | Purpose |
|---|---|---|---|
CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS |
5000 |
0–60000 | Max wait for the spawn lock |
CLAUDE_MEM_PID_PORT_RECHECK_MS |
2000 |
500–30000 | Wait window before treating port-in-use without /api/version response as "unknown listener" |
5.4 Acceptance criteria
- Run two
claude-mem startcommands in parallel → exactly one daemon ends up alive; the other exits cleanly with a log line referencing the lock. - Kill the worker
-9(skip cleanup), reuse the PID withpython -c 'import time; time.sleep(60)'→validateWorkerPidFilereturns'stale'and removes the file. - On macOS, run worker, capture token, kill, spawn unrelated process with same PID, spawn worker again → token mismatch detected; old PID file ignored.
/api/versionprobe path: spawn a fake server on the worker port → daemon exits 0 with the new diagnostic stderr, NOT silently.
5.5 Observability hooks
- Log
SYSTEMINFODaemon spawn lock acquiredon success. - Log
SYSTEMWARNDaemon spawn lock contention, fields{waitedMs}. - Log
SYSTEMWARNWorker port occupied by foreign listener, fields{port, probeStatus}. - New
/api/healthzfields (added in Phase 7):pid_file_path,pid_start_token,daemon_lock_held: bool.
5.6 Verification checklist
grep "process.exit(0)" src/services/worker-service.ts— count unchanged (no new silent exits introduced).- Manual two-process race test (Linux + macOS + Windows VM).
- Existing health-check tests still pass.
- No new always-on
setIntervalintroduced.
Phase 6 — DB Maintenance (VACUUM / WAL)
Ships alongside Phase 5 (idempotent).
Goal: Recover the 504 MB of free pages, prevent recurrence, surface DB-size metrics.
6.1 Files to modify
| File | Change |
|---|---|
src/services/sqlite/Database.ts:27-32 and :69-74 |
Add PRAGMA auto_vacuum = INCREMENTAL BEFORE the first table is created (only takes effect on a fresh DB; harmless on existing DBs but logs a no-op). For existing DBs, the migration path is the one-shot Phase-6 startup VACUUM. |
src/services/maintenance/DbMaintenance.ts (new) |
Periodic maintenance task: on a 24h timer (configurable), call PRAGMA incremental_vacuum, PRAGMA wal_checkpoint(TRUNCATE), then collect metrics (page_count, freelist_count, file size). Emit MAINTENANCE INFO log. Acquire dbMaintenanceMutex so other writers wait. |
src/services/maintenance/DbMaintenance.ts |
Startup check: if freelist_count / page_count > FREE_RATIO_VACUUM_THRESHOLD (default 0.40), perform full VACUUM after VACUUM INTO backup to <DATA_DIR>/backups/claude-mem-pre-vacuum-<ts>.db. Pause queue processor first. |
src/services/worker-service.ts:initializeBackground |
Wire the maintenance task — start after dbManager.initialize(). Timer must .unref(). |
src/services/worker/SessionManager.ts |
Expose pauseQueueProcessing(): Promise<void> and resumeQueueProcessing(): void. Use the existing AbortController + emitter to drain in-flight work; don't introduce new state. Maintenance acquires; readers continue (WAL allows them). |
src/services/infrastructure/CleanupV12_4_3.ts:135 |
Reuse the existing VACUUM INTO backup pattern verbatim — copy the disk-space pre-flight check (statfsSync, line 115). |
6.2 Detailed tasks
-
Auto-vacuum on new DBs: Add
PRAGMA auto_vacuum = INCREMENTALinDatabase.tsBEFOREmigrationRunner.runAllMigrations(). Verify with a comment that this is no-op on existing DBs (sqlite docs say a full VACUUM is required to flip auto_vacuum mode after tables exist). Document the migration path: existing users get the freed-page reclamation via the startup full VACUUM in step 3. -
Periodic incremental vacuum + WAL checkpoint:
- Schedule via
setIntervalwith.unref(). Default cadence: 24h. Setting:CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS(default24, min1, max168). - Each tick: acquire mutex →
db.run('PRAGMA incremental_vacuum')→db.run('PRAGMA wal_checkpoint(TRUNCATE)')→ snapshot metrics → release. - Skip the tick if a
VACUUMis in progress.
- Schedule via
-
Startup full VACUUM (one-shot per session) when free-ratio is high:
- Read
page_count(PRAGMA page_count) andfreelist_count(PRAGMA freelist_count). - If
freelist_count / page_count >= CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO(default0.40), schedule a deferred VACUUM (5 minutes after worker becomes ready) to avoid slowing startup. - VACUUM steps: pause queue →
VACUUM INTO '<backup>'→ verify backup →VACUUM(full) → resume queue → log freed pages and ms taken. - Disk-space pre-flight:
statfsSync(mirrorCleanupV12_4_3.ts:115). Skip if free space <1.2 * dbSize + 100MB. LogMAINTENANCEERROR in that case so the user sees actionable info.
- Read
-
Pause/resume hook in SessionManager: The existing
for await ... of getMessageIterator()loop in queue processor needs a "pause" semaphore. Implementation: add aPromise<void>gate that the iterator awaits before yielding. Maintenance flips it to a pending promise during VACUUM; resolve to release. Do not abort in-flight messages — they can complete; new messages wait. -
Cleanup-V12.4.3 regression detection: Re-scan
sdk_sessions WHERE project = OBSERVER_SESSIONS_PROJECTandpending_messagesmatching the stuck-pending pattern at maintenance ticks. If any match AND the marker exists, logMAINTENANCEWARN and re-run the purge (idempotent). Setting:CLAUDE_MEM_CLEANUP_REGRESSION_CHECK = true.
6.3 New settings
| Key | Default | Range | Purpose |
|---|---|---|---|
CLAUDE_MEM_DB_MAINTENANCE_ENABLED |
true |
bool | Master kill-switch |
CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS |
24 |
1–168 | Periodic cadence |
CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO |
0.40 |
0.05–0.95 | Free-ratio above which we auto-VACUUM at startup |
CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS |
300000 (5 min) |
0–3600000 | Defer startup VACUUM so it doesn't block readiness |
CLAUDE_MEM_CLEANUP_REGRESSION_CHECK |
true |
bool | Re-scan v12.4.3-shaped pollution |
6.4 Acceptance criteria
- Reproduce the bloat scenario: stuff
pending_messageswith 100k stuckprocessingrows, run worker → startup VACUUM fires within 5 min after readiness, freed-pages log line appears, file size drops. - Existing 532 MB DBs reclaim ≥ 95% of free pages on first run (matches the 28 MB target observed manually).
- Hot-ingestion test: enqueue 1000 observations during a maintenance tick → no
SQLITE_BUSYordatabase is lockederrors; queue resumes after VACUUM. PRAGMA auto_vacuumreturns2(incremental) on freshly-created DBs.- Maintenance loop ticks honor
.unref()—process.exit(0)from a clean shutdown returns immediately, not after the 24h interval.
6.5 Observability hooks
- New log category:
MAINTENANCE. - Events:
MaintenanceStart,MaintenanceTick,VacuumStart,VacuumComplete({freedPages, ms, dbSizeBeforeMb, dbSizeAfterMb}),VacuumSkippedLowDisk,RegressionDetected,MaintenanceComplete. /api/healthzfields (Phase 7):db_page_count,db_freelist_count,db_free_ratio_pct,db_size_bytes,db_last_vacuum_at,db_last_vacuum_freed_pages,db_last_maintenance_at.
6.6 Anti-pattern guards
- Do not call
VACUUMinside a transaction (sqlite errors). - Do not hold the queue pause across the
VACUUM INTObackup phase — only the final fullVACUUMneeds the writer-lock window. (VACUUM INTOworks on a read-only snapshot.) - Do not call
PRAGMA wal_checkpoint(FULL)— TRUNCATE is required to actually shrink the WAL file.
6.7 Verification checklist
- Backup created at
<DATA_DIR>/backups/before every full VACUUM. - Maintenance timer registered with
.unref()(grep forsetIntervalin the new file →unref()follows each). - No new direct
setIntervaloutside the maintenance file. - PRAGMA list in
Database.tsextended withauto_vacuumand includes a comment about migration.
Phase 2 — Stuck-Session Reaper (fix v12.4.3 bloat)
Goal: Stop pending_messages and sdk_sessions from accumulating zombies.
2.1 Files to modify
| File | Change |
|---|---|
src/services/maintenance/SessionReaper.ts (new) |
Periodic reaper. Plugs into the supervisor's existing health-checker.ts 30s tick (extend, do not replace). |
src/supervisor/health-checker.ts:9 runHealthCheck |
Call SessionReaper.tick() after pruneDeadEntries(). |
src/services/worker/SessionManager.ts:deleteSession |
After in-memory delete, call pendingStore.clearPendingForSession(sessionDbId) synchronously (it already does this via clearPendingForSession on a separate path — verify and unify). |
src/services/sqlite/PendingMessageStore.ts |
Add reapStuckProcessing(olderThanMs: number): number returning the count of rows reset to pending. |
src/services/sqlite/SessionStore.ts |
Add findInactiveSdkSessions(olderThanDays: number): Array<{id, project, contentSessionId, memorySessionId, lastActivityAt}>. |
src/services/sqlite/SessionStore.ts |
Add markSdkSessionInactive(id: number) — adds an inactive_at column or sets a sentinel. |
src/services/sqlite/migrations/runner.ts |
New migration: add inactive_at TEXT NULL to sdk_sessions if absent. |
2.2 Reaper logic
Per tick (default 30s, gated by CLAUDE_MEM_REAPER_ENABLED):
-
Stuck-processing sweep:
UPDATE pending_messages SET status='pending' WHERE status='processing' AND updated_at < <now - PROCESSING_STUCK_MS>(default 5 minutes). Log count if > 0. -
Orphan-pending sweep:
DELETE FROM pending_messages WHERE session_db_id NOT IN (SELECT id FROM sdk_sessions)(defensive — should already be FK-protected but log if any deleted). -
Inactive-session detection (does NOT delete):
- SELECT sdk_sessions where
id NOT IN <in-memory session ids>ANDlast_activity > N days ago(computed from MAX of related observations / pending_messages / session_summaries timestamps). - For each:
UPDATE sdk_sessions SET inactive_at = <now> WHERE id = ? AND inactive_at IS NULL.
- SELECT sdk_sessions where
-
Observer-pollution regression check (matches Phase 6 task 5):
- If
OBSERVER_SESSIONS_PROJECTrows reappear after the v12.4.3 marker is present, re-run the purge SQL fromCleanupV12_4_3.runObserverSessionsPurge(lines 196-218). - Log
MAINTENANCEWARN with counts.
- If
-
Hard delete is opt-in via
CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS(default0= disabled; nonzero = days threshold). When enabled and a session hasinactive_atolder than the threshold AND no FK-referencing rows, hard-delete the session row. Default-off because user data safety > disk space.
2.3 New settings
| Key | Default | Range | Purpose |
|---|---|---|---|
CLAUDE_MEM_REAPER_ENABLED |
true |
bool | Master switch |
CLAUDE_MEM_REAPER_TICK_MS |
30000 |
5000–600000 | Tick cadence (piggy-backs supervisor; this value gates whether the reaper runs each tick) |
CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS |
300000 (5 min) |
30000–86400000 | Threshold for a processing row to be considered stuck |
CLAUDE_MEM_REAPER_INACTIVE_DAYS |
30 |
1–365 | When to mark a session inactive_at |
CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS |
0 |
0–365 | 0 = never; otherwise, hard-delete inactive rows older than N days |
2.4 Acceptance criteria
- Inject 50 stuck
processingrows older than 5 minutes → next reaper tick resets them →/api/healthzshowsoldest_pending_processing_age_secdrop to 0. - Inject
OBSERVER_SESSIONS_PROJECTrows post-marker → next tick logs regression and purges them. - Reaper survives a worker restart without losing state (everything is DB-backed).
- Active sessions (in-memory) are NEVER marked inactive even if their last DB write is old (in-memory presence wins).
2.5 Observability
- Log:
MAINTENANCEINFOReaperTick, fields{stuckProcessing, orphanPending, markedInactive, hardDeleted, observerRegression}. - New
/api/healthzfields (Phase 7):oldest_processing_pending_age_sec,processing_pending_count,pending_count_total,sdk_sessions_total,sdk_sessions_inactive,sdk_sessions_by_project: { [project]: count }.
2.6 Verification checklist
- Migration adds
inactive_atcolumn without breaking existing data (test on a copy of a real DB). - In-memory active sessions never appear in
findInactiveSdkSessions. - Reaper does NOT cascade-delete
observations/session_summariesunless explicit hard-delete + zero-FK-reference precondition. /api/healthzshows reaper metrics.
Phase 3 — chroma-mcp Child-Process Supervisor
Goal: Stop the 23-concurrent-chroma-mcp leak. Bound concurrency, reap idle, scan for orphans at startup.
3.1 Files to modify
| File | Change |
|---|---|
src/services/sync/ChromaMcpManager.ts |
Add idle reaper; enforce single-instance via supervisor registry; add startup orphan scan; add lastCallAt timestamp updated by callTool. |
src/services/sync/ChromaMcpManager.ts:ensureConnected (line 43) |
Before connect, check getProcessRegistry().getAll().filter(r => r.type === 'chroma') — if non-empty AND PID alive AND PID not the current _process.pid, refuse to spawn (alert + reuse existing if possible; otherwise wait for backoff). |
src/services/sync/ChromaMcpManager.ts:registerManagedProcess (line 613) |
Already calls getSupervisor().registerProcess(CHROMA_SUPERVISOR_ID, ...) — verify the supervisor enforces single-instance for this id. (Currently register is keyed by id so same id replaces; document this.) |
src/supervisor/process-registry.ts |
Add getActiveCountByType(type: string): number. Add findChromaOrphans(): Promise<number[]> — POSIX pgrep -af 'chroma-mcp' filtered by PPID == 1. |
src/services/worker-service.ts:initializeBackground |
After ChromaMcpManager.getInstance(), kick off await ChromaMcpManager.scanAndReapOrphans() (best-effort; never throws). |
3.2 Detailed tasks
-
Startup orphan scan: New static method
ChromaMcpManager.scanAndReapOrphans():- POSIX:
pgrep -af 'chroma-mcp'→ for each PID, check PPID. If PPID == 1 (re-parented to init), callkillProcessTree(pid)(existing function at line 388). LogCHROMA_MCPINFOReapedOrphan, fields{pid, ageSec}. - Windows:
Get-CimInstance Win32_Process -Filter "Name='chroma-mcp.exe'"filter by parent process state, kill with taskkill. - Bound the scan to processes whose command-line includes
chroma-mcp==<CHROMA_MCP_PINNED_VERSION>to avoid killing unrelated chroma installations.
- POSIX:
-
Idle reaper: Add
lastCallAt: number = 0field toChromaMcpManager. Update on everycallTool. Run asetInterval(checkIdle, 60_000)(.unref()) — ifconnected && Date.now() - lastCallAt > CHROMA_MCP_IDLE_SHUTDOWN_MS(default 15 min), callawait this.stop(). Lazy-reconnect resumes on nextcallTool. -
Single-instance guard on reconnect: In
ensureConnected, beforeconnectInternal, callgetProcessRegistry().getActiveCountByType('chroma'). If > 0 AND the registered PID is alive butthis.connected === false, this is a stale process (we lost track). Tear it down viakillProcessTree(registeredPid)first, then proceed with fresh spawn. Otherwise the count grows by one each reconnect — exactly the leak observed. -
Hard cap: extend
getSupervisor().assertCanSpawn('chroma mcp')(already called at line 87) to actually count and reject. Cap = 1 chroma-mcp per worker. Cap =TOTAL_PROCESS_HARD_CAP(10) overall — already enforced for SDK processes; extend to chroma-mcp. -
Tighten close path: in
connectInternal(line 74), aftertransport.close()/client.close(), if the underlying_process.pidis still in the registry, callkillProcessTreeandunregisterProcessexplicitly. Don't rely ontransport.onclosealone — it has the stale-callback guard but doesn't always fire on connect-time failures.
3.3 New settings
| Key | Default | Range | Purpose |
|---|---|---|---|
CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS |
900000 (15 min) |
60000–86400000 | Idle reaper threshold |
CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START |
true |
bool | Master switch for startup scan |
CLAUDE_MEM_CHROMA_MAX_CONCURRENT |
1 |
1–4 | Cap chroma-mcp instances per worker |
3.4 Acceptance criteria
- Spawn 5 chroma-mcp processes manually parented to init; restart worker → all 5 are reaped at startup.
- Force connect-time failure (kill transport mid-connect) 10 times → registry count never exceeds 1.
- Run worker for 30 min with no chroma calls → process is reaped after 15 min and
getProcessRegistry().getActiveCountByType('chroma')returns 0. callToolafter idle-shutdown lazy-reconnects successfully.
3.5 Observability
- Log:
CHROMA_MCPINFOOrphanScan{found, killed}. - Log:
CHROMA_MCPINFOIdleShutdown{idleMs}. - Log:
CHROMA_MCPWARNRegistryStalewhen single-instance guard tears down a phantom. /api/healthzfields (Phase 7):chroma_mcp_pid_count,chroma_mcp_last_call_at,chroma_mcp_state('connected'|'disconnected'|'backoff'),chroma_mcp_backoff_remaining_ms.
3.6 Anti-pattern guards
- Do not kill chroma processes whose command-line doesn't match
chroma-mcp==<PINNED_VERSION>— could match unrelated user installs. - Do not spin up the idle-reaper timer if
chromaMcpManageris null (chroma disabled viaCLAUDE_MEM_CHROMA_ENABLED=false). - Do not call
getProcessRegistry()from outside the worker process — it's worker-internal.
3.7 Verification checklist
- After 2.5 hours of normal use,
ps aux | grep chroma-mcp | wc -l≤ 1. - Idle-reaper timer is
.unref()d. - Orphan scan tolerates
pgrepreturning empty (no false-error logs). - Build still passes on Windows (Win32 branch compiles even if not unit-tested).
Phase 4 — Circuit Breaker for Retry Storms
Goal: Replace the unbounded counter at worker-utils.ts:401 with a real circuit breaker. Stop hooks from hammering the worker when it's down.
4.1 Files to modify
| File | Change |
|---|---|
src/shared/worker-circuit-breaker.ts (new) |
CircuitBreaker class: states CLOSED, OPEN, HALF_OPEN. Persist to ~/.claude-mem/state/circuit-breaker.json. |
src/shared/worker-utils.ts:executeWithWorkerFallback (line 443) |
Wrap the call in breaker.run(...). On OPEN, return WorkerFallback immediately (no HTTP). |
src/shared/worker-utils.ts:recordWorkerUnreachable (line 401) |
Becomes a thin shim that calls breaker.recordFailure(). Hard cap (MAX_LIFETIME_FAILURES = 50) trips the breaker permanently until manual reset. |
src/shared/worker-utils.ts:resetWorkerFailureCounter (line 419) |
Becomes breaker.recordSuccess(). |
src/cli/hook-command.ts |
Verify the swallowed-stderr fix from observation 2026-05-07 is applied (it's marked as a "no-op replacement bug"). The breaker's stderr-fail-loud path must actually write to process.stderr.write(), not a stub. |
src/services/server/Server.ts |
Add /api/admin/breaker/reset POST endpoint (gated by localhost only) for manual unsticking. |
4.2 Breaker semantics
States and transitions:
CLOSED ──[N consecutive failures]──> OPEN
OPEN ──[reset_timeout_ms elapsed]──> HALF_OPEN
HALF_OPEN ──[1 success]──> CLOSED
HALF_OPEN ──[1 failure]──> OPEN (resets timer)
ANY ──[lifetime failures > MAX_LIFETIME_FAILURES]──> OPEN_PERMANENT (until manual reset via API or settings reload)
Defaults:
| Setting | Default | Range |
|---|---|---|
CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD |
5 |
1–50 |
CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS |
30000 |
1000–600000 |
CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES |
1 |
1–10 |
CLAUDE_MEM_BREAKER_LIFETIME_CAP |
50 |
0–10000 (0 = no cap) |
Persistent state file shape:
{
"state": "CLOSED|OPEN|HALF_OPEN|OPEN_PERMANENT",
"consecutiveFailures": 0,
"lifetimeFailures": 0,
"openedAt": null,
"lastFailureAt": null,
"lastSuccessAt": null,
"lastTrippedAt": null
}
4.3 Detailed tasks
-
CircuitBreaker class: pure logic class, no I/O. Methods:
getState(),canAttempt(),recordFailure(reason),recordSuccess(),forceReset(). Atomic file writes (write tmp + rename) for the JSON snapshot, mirroringwriteHookFailureStateAtomic(worker-utils.ts:372). -
Wire into
executeWithWorkerFallback:if (!breaker.canAttempt()) { // Optional: print one-line stderr if state changed during this call return { continue: true, reason: 'circuit_breaker_open', [WORKER_FALLBACK_BRAND]: true }; } const alive = await ensureWorkerAliveOnce(); if (!alive) { breaker.recordFailure('unreachable'); ... } ... if (response.ok) breaker.recordSuccess(); -
Fail-loud stderr fix: The 2026-05-07 observation mentions a "stderr no-op replacement bug" in
hookCommand. Investigatesrc/cli/hook-command.tsfor anyprocess.stderr.writeshim that suppresses output. The breaker's diagnostic ("Worker unreachable; circuit breaker OPEN; will retry in Xs") MUST appear on the user's terminal so they know what's happening. Test by intentionally killing the worker and running a hook — message should appear on stderr. -
Manual reset endpoint:
POST /api/admin/breaker/reset(no body required). Restricted to127.0.0.1only. LogsSYSTEMWARNBreakerForceResetwith caller info. -
Lifetime cap: when
lifetimeFailures > CLAUDE_MEM_BREAKER_LIFETIME_CAP, transition toOPEN_PERMANENT. The only way out is the manual-reset API or restarting the worker with a fresh state file. Print prominent stderr:claude-mem: 50 lifetime worker failures detected. Disabling memory hooks until reset. Run: claude-mem worker doctor.
4.4 Acceptance criteria
- Kill the worker, run 100 hooks → exactly
CLAUDE_MEM_BREAKER_FAILURE_THRESHOLDHTTP attempts made; rest short-circuit. - After 30s idle, next hook makes ONE probe (HALF_OPEN); if probe succeeds, breaker closes.
- Lifetime cap (set to 5 for testing): 6th lifetime failure → permanent open until
POST /api/admin/breaker/resetclears it. - Stderr message visible to user when breaker opens (manual repro: kill worker, run 5+ hooks).
- Existing hook-failures.json file is migrated to the new breaker JSON format on first run (one-shot migration in
worker-utils.ts).
4.5 Observability
- Log:
SYSTEMWARNBreakerOpened, fields{lifetime, consecutiveBefore}. - Log:
SYSTEMINFOBreakerHalfOpen. - Log:
SYSTEMINFOBreakerClosed, fields{recoveredAfterMs}. - Log:
SYSTEMERRORBreakerOpenedPermanent. /api/healthzfields (Phase 7):breaker_state,breaker_consecutive_failures,breaker_lifetime_failures,breaker_opened_at,breaker_total_trips.
4.6 Anti-pattern guards
- Do not call the breaker from inside the worker process — it's a hook-side concern. The worker has
RestartGuardfor its own session-level limits. - Do not auto-reset the lifetime counter on restart; persist it. Otherwise restart-loops mask the underlying failure.
- Do not block the breaker reset endpoint on initialization (
/api/admin/breaker/resetshould work even ifinitializationCompleteFlag === false).
4.7 Verification checklist
- No call site bypasses the breaker (grep for
workerHttpRequestoutsideexecuteWithWorkerFallbackand audit each — some integrations may needbreaker.canAttempt()guards added). - State file readable/writable across process restarts.
- Stderr fail-loud path verified end-to-end on Linux + macOS + Windows Terminal.
- No
process.exit(1)introduced — breaker tripping returnsWorkerFallback, not exit codes.
Phase 7 — /api/healthz Endpoint with Concrete Metrics
Goal: Centralized observability so future regressions are detectable at a glance.
7.1 Files to modify
| File | Change |
|---|---|
src/services/worker/http/routes/HealthzRoutes.ts (new) |
Implements RouteHandler. GET /api/healthz and /api/healthz?format=prom. |
src/services/worker-service.ts:registerRoutes |
Register the new HealthzRoutes(...). |
src/services/worker/MetricsCollector.ts (new) |
Aggregates metrics; refreshed on the supervisor's existing 30s health-check tick to avoid amplifying load. |
src/supervisor/health-checker.ts:runHealthCheck |
Call MetricsCollector.refresh() after pruneDeadEntries. |
7.2 Endpoint contract
GET /api/healthz → 200 JSON:
{
"status": "ok|degraded|unhealthy",
"ts": "2026-05-07T21:30:00.000Z",
"uptime_sec": 12345,
"versions": {
"plugin": "12.7.5",
"worker": "12.7.5",
"matches": true
},
"process": {
"pid": 12345,
"rss_mb": 145.2,
"event_loop_lag_ms": 3.1,
"managed": true,
"platform": "darwin"
},
"pid_file": {
"path": "/Users/.../worker.pid",
"start_token": "Wed May 7 14:23:15 2026",
"daemon_lock_held": true
},
"db": {
"path": "/Users/.../claude-mem.db",
"size_bytes": 31457280,
"page_count": 7680,
"freelist_count": 12,
"free_ratio_pct": 0.16,
"last_vacuum_at": "2026-05-07T20:00:00.000Z",
"last_vacuum_freed_pages": 130000,
"last_maintenance_at": "2026-05-07T20:00:00.000Z",
"oldest_processing_pending_age_sec": 4,
"processing_pending_count": 1,
"pending_count_total": 12,
"sdk_sessions_total": 145,
"sdk_sessions_inactive": 13,
"sdk_sessions_by_project": { "claude-mem": 25, "...": 120 }
},
"child_processes": {
"chroma_mcp_pid_count": 1,
"chroma_mcp_last_call_at": "2026-05-07T21:25:11.000Z",
"chroma_mcp_state": "connected",
"chroma_mcp_backoff_remaining_ms": 0,
"sdk_process_count": 0,
"supervisor_registry_size": 2
},
"network": {
"hook_consecutive_failures": 0,
"breaker_state": "CLOSED",
"breaker_consecutive_failures": 0,
"breaker_lifetime_failures": 3,
"breaker_opened_at": null,
"breaker_total_trips": 1,
"last_request_at": "2026-05-07T21:29:55.000Z",
"request_rate_per_min": 12.3
},
"ai": {
"provider": "claude",
"auth_method": "...",
"last_interaction": { ... }
}
}
GET /api/healthz?format=prom → 200 text/plain with Prometheus text format. One metric per JSON leaf (e.g. claude_mem_db_free_ratio_pct 0.16).
status derivation:
unhealthyif breaker is OPEN_PERMANENT, OR DB initialization failed, OR chroma-mcp pid count >CLAUDE_MEM_CHROMA_MAX_CONCURRENT.degradedif breaker is OPEN, OR free_ratio > 0.4, OR oldest_processing_pending > 1 hour, OR worker version mismatches plugin version.okotherwise.
7.3 Detailed tasks
-
MetricsCollector class: a
Map<string, unknown>snapshot. Publicrefresh()collects fresh data; publicgetSnapshot()returns the cached object. Refresh is called by the 30s health-check tick AND on-demand if last refresh > 5s ago (debounced). -
DB metrics queries (use
db.prepare+.get()):PRAGMA page_count→{ page_count: number }PRAGMA freelist_count→{ freelist_count: number }PRAGMA page_size→ for size_bytes computationSELECT MIN(updated_at) FROM pending_messages WHERE status='processing'(withjuliandaymath for age in seconds)SELECT COUNT(*) FROM sdk_sessions GROUP BY project
-
Process metrics:
process.memoryUsage().rss / 1024 / 1024. Event-loop lag viaperf_hooks.monitorEventLoopDelay(Node API, available in bun) — sample over 30s window. -
Network metrics: maintain a rolling 1-min request counter in middleware (existing
createMiddlewareinServer.ts:156). Increment on each/api/*request. -
Prometheus format: emit
# HELPand# TYPElines per metric. Use the same naming convention (claude_mem_<group>_<name>). -
Compatibility: leave
/api/healthUNCHANGED (existing integrations break otherwise)./api/healthzis the new richer endpoint.
7.4 Acceptance criteria
curl 127.0.0.1:<port>/api/healthz | jq .statusreturnsokon a healthy worker.- After Phase 6 ships,
db.free_ratio_pctupdates at 30s cadence (verify by manually inflating freelist). - Phase 4 breaker state changes are visible within 30s.
?format=promparses withpromtool check metrics.- No new endpoint blocks for > 50ms (snapshot is cached; refresh is async).
7.5 Observability hooks (yes, for the observability endpoint itself)
- Log
WORKERDEBUGMetricsRefresh, fields{durationMs}. - Log
WORKERWARNMetricsRefreshSlowif refresh > 250ms (DB query stall signal).
7.6 Verification checklist
/api/healthresponse body unchanged byte-for-byte (regression test).- All Phase 2-6 metrics exposed (cross-check the field list in those phases).
?format=promoutput validates withpromtoolif available; otherwise visual inspection.- Endpoint mounted via
RouteHandlerpattern (no directapp.getin worker-service.ts).
Phase 8 — Observability, CLI, & Rollout
Goal: User-facing surface so operators can see what the new machinery did. Ordered last to allow phases 2-7 to stabilize.
8.1 Files to modify
| File | Change |
|---|---|
src/cli/handlers/worker-doctor.ts (new) |
New CLI subcommand claude-mem worker doctor — fetches /api/healthz, formats it for terminals, includes recent reaper actions. |
src/services/worker-service.ts:main() |
Register the worker doctor CLI route (alongside existing cursor, gemini-cli cases). |
plugin/scripts/worker-cli.js |
Wire to the new doctor command. |
CLAUDE.md (project root) |
Document new settings under a "Worker Maintenance" section. |
docs/public/ (optional) |
User-facing explanation of the breaker, reaper, and health endpoint. |
8.2 worker doctor output (example)
claude-mem worker doctor
Status: OK
Version: plugin=12.7.5 worker=12.7.5 (match)
Uptime: 3h 25m
PID: 12345 (lock held: yes)
Database:
Size: 32 MB (free: 0.16%)
Last vacuum: 4h ago, freed 130k pages
Pending: 12 total / 1 processing (oldest 4s)
SDK sessions: 145 total / 13 inactive
Child processes:
chroma-mcp: 1 (last call: 5s ago, state: connected)
SDK processes: 0
Supervisor: 2 entries
Circuit breaker:
State: CLOSED
Consecutive: 0
Lifetime: 3
Total trips: 1
Recent maintenance (last 24h):
2026-05-07 20:00 Vacuum: freed 130k pages in 1.4s
2026-05-07 19:30 Reaper: 5 stuck-processing reset, 2 inactive marked
2026-05-07 18:00 Chroma orphan scan: 0 found
If status != ok, append a "Recommended actions" block:
- breaker open →
claude-mem worker reset-breaker - DB free ratio high → mention next vacuum window
- chroma orphans →
claude-mem worker reap-chroma
8.3 Detailed tasks
-
Doctor command: GET
/api/healthzviaworkerHttpRequest. Format as the table above. Color-code (red/yellow/green) using existing chalk integration if present, otherwise plain text. JSON pass-through via--jsonflag. -
Recent-actions feed: store the last 50 maintenance events in a circular buffer in
MetricsCollector(in-memory only — survives one worker lifetime; not persistent). Expose at/api/healthz/events(separate to avoid bloating the main response). -
Update CLAUDE.md: add a "Worker Maintenance" section with: settings reference table, the doctor command, a brief description of the reaper/breaker/vacuum behavior. Per CLAUDE.md "Important: No need to edit the changelog ever" — only edit CLAUDE.md, never CHANGELOG.
-
Rollout ordering (per problem statement constraint):
- Wave 1 (idempotent, low-risk): Phase 5 (PID/port reclamation), Phase 6 (DB maintenance).
- Wave 2 (reapers — needs careful testing on busy DBs): Phase 2 (session reaper), Phase 3 (chroma supervisor).
- Wave 3 (user-visible behavior change): Phase 4 (circuit breaker), Phase 7 (
/api/healthz). - Wave 4 (CLI surface): Phase 8 (doctor command, docs).
Each wave can ship as a separate release. Inter-wave dependencies: Phase 7 depends on data sources from Phases 2/3/4/6 — but the endpoint can ship with partial data (fields gated by phase availability).
8.4 Acceptance criteria
claude-mem worker doctorprints a green-OK summary on a healthy worker.claude-mem worker doctor --jsonreturns valid JSON pipeable tojq.- Killing the worker →
claude-mem worker doctorcleanly reportsWorker unreachableinstead of hanging. - CLAUDE.md updates are limited to a new section; no churn elsewhere.
8.5 Verification checklist
claude-mem worker doctorexits 0 on healthy state, 1 on unhealthy, 2 if worker unreachable (mirrors hook-exit-codes convention).- No new public marketplace API surface beyond what's documented.
- Doctor command works without the worker running (unreachable path covered).
Final Phase — Cross-Phase Verification
Goal: Prove the system works end-to-end before declaring victory.
F.1 Soak test (24h)
Run the worker for 24 hours under realistic Claude Code usage. After 24h:
| Metric | Pass criterion |
|---|---|
ps aux | grep chroma-mcp | wc -l |
≤ 1 |
ps aux | grep claude-mem | wc -l |
≤ a small constant (1-2) |
| DB size growth rate | < 5 MB/hr; free_ratio < 20% |
/api/healthz breaker.lifetime_failures |
< 10 (vs. the #1874 starting baseline) |
Stuck processing rows older than 10 min |
0 |
| Worker memory RSS | < 300 MB (no leak) |
F.2 Failure-injection tests
| Inject | Expected behavior |
|---|---|
Kill worker via kill -9 |
Lazy-respawn on next hook; PID file cleaned |
Two parallel claude-mem start |
Exactly one daemon survives; lock log line visible |
| 100 stuck processing rows | Reaper resets all within REAPER_PROCESSING_STUCK_MS + REAPER_TICK_MS |
| Spawn fake listener on worker port | New --daemon exits 0 with diagnostic stderr (no silent exit) |
| Fork 5 chroma-mcp orphans | Worker startup reaps all 5 |
| Pull network during 10 hooks | Breaker opens after threshold; subsequent hooks short-circuit |
F.3 Anti-pattern grep
# No new always-on intervals
grep -rn "setInterval" src/ --include="*.ts" | grep -v "unref()" | grep -v "^src/.*test"
# No new process.exit(1) on hook paths
git diff main -- src/shared/worker-utils.ts src/cli/ | grep "process.exit(1)"
# No invented settings
git diff main -- src/shared/SettingsDefaultsManager.ts | grep "CLAUDE_MEM_"
# Cross-reference with all phases' settings tables.
# No hardcoded magic numbers in business logic
git diff main | grep -E "[0-9]{4,}" | grep -v SettingsDefaultsManager | grep -v test
F.4 Documentation diff
CLAUDE.mdadds: Worker Maintenance section (Phase 8.3).docs/public/(optional): user-facing explanation.- No CHANGELOG edits (auto-generated per CLAUDE.md).
F.5 Sign-off checklist
- All 8 phases shipped.
/api/healthzreportsstatus: "ok"24h after deployment.- No new ERROR-level logs in production for 24h (excluding pre-existing).
- Manual
worker doctoron 3 production-like environments confirms expected output. - Phase 0 doc-discovery anti-patterns not violated (grep
git log -p).
Appendix A — Settings Reference (consolidated)
All settings declared in src/shared/SettingsDefaultsManager.ts:
| Setting | Phase | Default | Range |
|---|---|---|---|
CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS |
5 | 5000 |
0–60000 |
CLAUDE_MEM_PID_PORT_RECHECK_MS |
5 | 2000 |
500–30000 |
CLAUDE_MEM_DB_MAINTENANCE_ENABLED |
6 | true |
bool |
CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS |
6 | 24 |
1–168 |
CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO |
6 | 0.40 |
0.05–0.95 |
CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS |
6 | 300000 |
0–3600000 |
CLAUDE_MEM_CLEANUP_REGRESSION_CHECK |
6 | true |
bool |
CLAUDE_MEM_REAPER_ENABLED |
2 | true |
bool |
CLAUDE_MEM_REAPER_TICK_MS |
2 | 30000 |
5000–600000 |
CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS |
2 | 300000 |
30000–86400000 |
CLAUDE_MEM_REAPER_INACTIVE_DAYS |
2 | 30 |
1–365 |
CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS |
2 | 0 |
0–365 |
CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS |
3 | 900000 |
60000–86400000 |
CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START |
3 | true |
bool |
CLAUDE_MEM_CHROMA_MAX_CONCURRENT |
3 | 1 |
1–4 |
CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD |
4 | 5 |
1–50 |
CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS |
4 | 30000 |
1000–600000 |
CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES |
4 | 1 |
1–10 |
CLAUDE_MEM_BREAKER_LIFETIME_CAP |
4 | 50 |
0–10000 |
Appendix B — File Change Summary
| File | Phases that touch it |
|---|---|
src/services/worker-service.ts |
3 (initializeBackground), 5 (--daemon), 6 (maintenance wiring), 7 (route registration), 8 (CLI) |
src/services/worker-spawner.ts |
5 |
src/services/infrastructure/ProcessManager.ts |
5 (lock + start-token) |
src/services/infrastructure/HealthMonitor.ts |
5 (port-on-pid match) |
src/services/infrastructure/CleanupV12_4_3.ts |
6 (regression detection — read only) |
src/services/sync/ChromaMcpManager.ts |
3 |
src/supervisor/index.ts |
5 (validateWorkerPidFile) |
src/supervisor/process-registry.ts |
3 (orphan scan), 5 (start-token) |
src/supervisor/health-checker.ts |
2 (reaper), 7 (metrics refresh) |
src/services/worker/SessionManager.ts |
2 (delete hook), 6 (pause/resume) |
src/shared/worker-utils.ts |
4 (breaker integration) |
src/services/sqlite/Database.ts |
6 (auto_vacuum) |
src/services/sqlite/PendingMessageStore.ts |
2 (reapStuckProcessing) |
src/services/sqlite/SessionStore.ts |
2 (findInactiveSdkSessions) |
src/services/sqlite/migrations/runner.ts |
2 (inactive_at column) |
src/services/server/Server.ts |
4 (breaker reset), 7 (healthz route) |
src/shared/SettingsDefaultsManager.ts |
2-6 (settings keys) |
src/services/maintenance/DbMaintenance.ts |
6 (NEW) |
src/services/maintenance/SessionReaper.ts |
2 (NEW) |
src/shared/worker-circuit-breaker.ts |
4 (NEW) |
src/services/worker/MetricsCollector.ts |
7 (NEW) |
src/services/worker/http/routes/HealthzRoutes.ts |
7 (NEW) |
src/cli/handlers/worker-doctor.ts |
8 (NEW) |
CLAUDE.md |
8 (Worker Maintenance section) |
Appendix C — Open Questions for Executor
bun:ffiflock support: confirm via spike before committing Phase 5.4. If unavailable, fall back toflock(1)shell on Linux + atomicmkdirSyncsentinel on macOS/Windows.- Event-loop lag sampling on bun: verify
perf_hooks.monitorEventLoopDelayworks in bun's Node-compat layer. If not, fall back to a setImmediate-based heuristic. - Existing-DB auto_vacuum migration: verify that the startup full VACUUM in Phase 6.3 is sufficient to reclaim the 504 MB without requiring users to run
PRAGMA auto_vacuum = INCREMENTAL; VACUUM;manually. (It should — full VACUUM with auto_vacuum already set takes effect.) - Pro-features compatibility: confirm with maintainers that
/api/healthzdoes not duplicate any planned Pro endpoint. Per CLAUDE.md "Pro Features Architecture", the worker's local HTTP API stays open —/api/healthzis fine to add OSS-side.