Observations were 100% failing on Claude Code 2.1.109+ because the Agent
SDK emits ["--setting-sources", ""] when settingSources defaults to [].
The existing Bun-workaround filter stripped the empty string but left
the orphan --setting-sources flag, which then consumed --permission-mode
as its value, crashing the subprocess with:
Error processing --setting-sources:
Invalid setting source: --permission-mode.
Make the filter pair-aware: when an empty arg follows a --flag, drop
both so the SDK default (no setting sources) is preserved by omission.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: add session lifecycle guards to prevent runaway API spend (#1590)
Three root causes allowed 30+ subprocess accumulation over 36 hours:
1. SIGTERM-killed processes (code 143) triggered crash recovery and
immediately respawned — now detected and treated as intentional
termination (aborts controller so wasAborted=true in .finally).
2. No wall-clock limit: sessions ran for 13+ hours continuously
spending tokens — now refuses new generators after 4 hours and
drains the pending queue to prevent further spawning.
3. Duplicate --resume processes for the same session UUID — now
killed and unregistered before a new spawn is registered.
Generated by Claude Code
Vibe coded by ousamabenyounes
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: use normalized errorMsg in logger.error payload and annotate SIGTERM override
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: use persisted createdAt for wall-clock guard and bind abortController locally to prevent stale abort
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: re-trigger CodeRabbit review after rate limit reset
* fix: defer process unregistration until exit and align boundary test with strict > (#1693)
- ProcessRegistry: don't unregister PID immediately after SIGTERM — let the
existing 'exit' handler clean up when the process actually exits, preventing
tracking loss for still-live processes.
- Test: align wall-clock boundary test with production's strict `>` operator
(exactly 4h is NOT terminated, only >4h is).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
Bun's child_process.spawn() silently drops empty string arguments from
argv, unlike Node which preserves them. When the Agent SDK defaults
settingSources to [] (empty array), [].join(",") produces "" which gets
pushed as ["--setting-sources", ""]. Bun drops the "", causing
--permission-mode to be consumed as the value for --setting-sources:
Error processing --setting-sources: Invalid setting source: --permission-mode
This caused 100% observation failure (exit code 1 on every SDK subprocess
spawn), resulting in 0 observations stored across all sessions.
The fix filters empty string args before passing to spawn(), making the
behavior consistent between Node and Bun runtimes.
Fixes#1779
Related: #1660
Co-authored-by: bswnth48 <69203760+bswnth48@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add embedded Process Supervisor for unified process lifecycle management
Consolidates scattered process management (ProcessManager, GracefulShutdown,
HealthMonitor, ProcessRegistry) into a unified src/supervisor/ module.
New: ProcessRegistry with JSON persistence, env sanitizer (strips CLAUDECODE_*
vars), graceful shutdown cascade (SIGTERM → 5s wait → SIGKILL with tree-kill
on Windows), PID file liveness validation, and singleton Supervisor API.
Fixes#1352 (worker inherits CLAUDECODE env causing nested sessions)
Fixes#1356 (zombie TCP socket after Windows reboot)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add session-scoped process reaping to supervisor
Adds reapSession(sessionId) to ProcessRegistry for killing session-tagged
processes on session end. SessionManager.deleteSession() now triggers reaping.
Tightens orphan reaper interval from 60s to 30s.
Fixes#1351 (MCP server processes leak on session end)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add Unix domain socket support for worker communication
Introduces socket-manager.ts for UDS-based worker communication, eliminating
port 37777 collisions between concurrent sessions. Worker listens on
~/.claude-mem/sockets/worker.sock by default with TCP fallback.
All hook handlers, MCP server, health checks, and admin commands updated to
use socket-aware workerHttpRequest(). Backwards compatible — settings can
force TCP mode via CLAUDE_MEM_WORKER_TRANSPORT=tcp.
Fixes#1346 (port 37777 collision across concurrent sessions)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: remove in-process worker fallback from hook command
Removes the fallback path where hook scripts started WorkerService in-process,
making the worker a grandchild of Claude Code (killed by sandbox). Hooks now
always delegate to ensureWorkerStarted() which spawns a fully detached daemon.
Fixes#1249 (grandchild process killed by sandbox)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add health checker and /api/admin/doctor endpoint
Adds 30-second periodic health sweep that prunes dead processes from the
supervisor registry and cleans stale socket files. Adds /api/admin/doctor
endpoint exposing supervisor state, process liveness, and environment health.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add comprehensive supervisor test suite
64 tests covering all supervisor modules: process registry (18 tests),
env sanitizer (8), shutdown cascade (10), socket manager (15), health
checker (5), and supervisor API (6). Includes persistence, isolation,
edge cases, and cross-module integration scenarios.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: revert Unix domain socket transport, restore TCP on port 37777
The socket-manager introduced UDS as default transport, but this broke
the HTTP server's TCP accessibility (viewer UI, curl, external monitoring).
Since there's only ever one worker process handling all sessions, the
port collision rationale for UDS doesn't apply. Reverts to TCP-only,
removing ~900 lines of unnecessary complexity.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: remove dead code found in pre-landing review
Remove unused `acceptingSpawns` field from Supervisor class (written but
never read — assertCanSpawn uses stopPromise instead) and unused
`buildWorkerUrl` import from context handler.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* updated gitignore
* fix: address PR review feedback - downgrade HTTP logging, clean up gitignore, harden supervisor
- Downgrade request/response HTTP logging from info to debug to reduce noise
- Remove unused getWorkerPort imports, use buildWorkerUrl helper
- Export ENV_PREFIXES/ENV_EXACT_MATCHES from env-sanitizer, reuse in Server.ts
- Fix isPidAlive(0) returning true (should be false)
- Add shutdownInitiated flag to prevent signal handler race condition
- Make validateWorkerPidFile testable with pidFilePath option
- Remove unused dataDir from ShutdownCascadeOptions
- Upgrade reapSession log from debug to warn
- Rename zombiePidFiles to deadProcessPids (returns actual PIDs)
- Clean up gitignore: remove duplicate datasets/, stale ~*/ and http*/ patterns
- Fix tests to use temp directories instead of relying on real PID file
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
proc.killed only means Node sent a signal — the process can still be alive.
This caused premature pool slot release, allowing unbounded process spawning.
- ensureProcessExit: remove proc.killed from early-exit checks, only trust exitCode
- Fix 3 call-site guards that skipped cleanup for signaled-but-alive processes
- Add TOTAL_PROCESS_HARD_CAP=10 safety net in waitForSlot()
- After SIGKILL, wait up to 1s via exit event instead of blind 200ms sleep
- Reduce reaper interval from 5min to 1min, idle threshold from 2min to 1min
Closes#1226
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: configurable subprocess pool limit for SDK agents
Prevents runaway accumulation of Claude SDK agent subprocesses by
enforcing a configurable concurrency limit.
- New CLAUDE_MEM_MAX_CONCURRENT_AGENTS setting (default: 2)
- Promise-based waitForSlot() in ProcessRegistry (not polling per
review feedback on #830)
- Waiters are notified via unregisterProcess when a slot frees up
- SDKAgent.startSession() waits for a slot before spawning
- 60s timeout prevents indefinite waits
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
* fix: remove unused originalUnregister const and getActiveCount import
Cleanup from Greptile review:
- Remove dead `originalUnregister` variable in ProcessRegistry
- Remove unused `getActiveCount` import in SDKAgent
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Happy <yesreply@happy.engineering>
* fix: SDK Agent fails on Windows when username contains spaces
Fixes spawn failure on Windows when the user's path contains spaces
(e.g., C:\Users\Anderson Wang\).
Root cause:
- SDKAgent.ts returns full auto-detected path with spaces
- ProcessRegistry.ts cannot execute .cmd files when path contains spaces
Solution:
- SDKAgent: On Windows, prefer "claude.cmd" via PATH instead of full path
- ProcessRegistry: Use cmd.exe /d /c wrapper for .cmd files on Windows
This preserves argument boundaries (e.g., empty string values) while
properly handling paths with spaces.
Fixes#1014
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: add Windows spawn path with spaces fix documentation
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Strip CLAUDECODE env var from SDK subprocesses to prevent "cannot be
launched inside another Claude Code session" error (Claude Code 2.1.42+)
- Lazy-load @chroma-core/default-embed to avoid eagerly pulling in
sharp native binaries at bundle startup (fixes ERR_DLOPEN_FAILED)
- Add stderr capture to SDK spawn for diagnosing future process failures
- Exclude lockfiles from marketplace rsync and delete stale lockfiles
before npm install to prevent native dep version mismatches
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
During SIGHUP testing with 6+ active sessions, ChromaSync.ensureConnection()
had no mutex — concurrent fire-and-forget syncObservation() calls each spawned
a chroma-mcp subprocess via StdioClientTransport, creating 641 orphans in ~5min.
Error-driven reconnection formed a positive feedback loop amplifying the storm.
Defense layers:
- Layer 0: Connection mutex via promise memoization (prevents concurrent spawns)
- Layer 1: Pre-spawn process count guard using execFileSync('ps') (kills excess)
- Layer 2: Hardened close() with try-finally + Unix pkill in GracefulShutdown
- Layer 3: Count-based orphan reaper in ProcessManager (not age-based)
- Layer 4: Circuit breaker stops retries after 3 consecutive failures for 60s
Closes#1063, closes#695
Relates to #1010, #707
Add killIdleDaemonChildren() to clean up SDK-spawned claude processes
that don't terminate after completing their work.
Problem:
- Worker-service daemon spawns Claude SDK processes
- These processes remain alive after work completes
- They accumulate over time, consuming significant memory
- Existing killSystemOrphans() only handles PPID=1 orphans
Solution:
- Add killIdleDaemonChildren() that finds claude processes where:
- Parent PID = daemon's PID (children of worker-service)
- CPU = 0% (idle, not actively working)
- Running > 2 minutes (completed their work)
- Call it from reapOrphanedProcesses() (runs every 5 minutes)
Testing:
- Verified locally: 15+ zombie processes cleaned up
- Memory saved: ~2GB
- Normal processes (MCP server, Chroma) unaffected
Co-Authored-By: Claude <noreply@anthropic.com>
The previous approach (PR #837) set CLAUDE_CONFIG_DIR to isolate observer
sessions from `claude --resume`. However, this broke authentication because
Claude Code stores credentials in the config directory.
This fix uses the SDK's `cwd` option instead:
- Observer sessions run with cwd=~/.claude-mem/observer-sessions/
- Project name = path.basename(cwd) = "observer-sessions"
- Sessions won't appear when running `claude --resume` from actual projects
- Authentication works because ~/.claude/ config is preserved
Changes:
- ProcessRegistry.ts: Remove CLAUDE_CONFIG_DIR override from spawn
- SDKAgent.ts: Add cwd option to query() pointing to observer dir
- paths.ts: Rename OBSERVER_CONFIG_DIR to OBSERVER_SESSIONS_DIR
Fixes regression from #837
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Observer sessions created by claude-mem were appearing in the main Claude Code
session picker (`claude --resume`), cluttering the list with internal plugin
sessions that users never intend to resume.
In one user's case: 74 observer sessions out of ~220 total (34% noise).
## Solution
Set `CLAUDE_CONFIG_DIR` to `~/.claude-mem/observer-config/` when spawning
observer Claude processes. This stores observer session files in a separate
location, isolating them from user sessions.
## Changes
1. Added `OBSERVER_CONFIG_DIR` to paths.ts
2. Modified `createPidCapturingSpawn()` in ProcessRegistry.ts to inject
`CLAUDE_CONFIG_DIR` environment variable
Observer sessions now write their `.jsonl` files to:
`~/.claude-mem/observer-config/projects/*/`
Instead of the user's:
`~/.claude/projects/*/`
Fixes#832
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Fix zombie process accumulation (Issue #737)
Problem: Claude haiku subprocesses spawned by the SDK weren't terminating
properly, causing zombie process accumulation (user reported 155 processes
consuming 51GB RAM).
Root causes:
1. SDK's SpawnedProcess interface hides subprocess PIDs
2. deleteSession() didn't verify subprocess exit
3. abort() was fire-and-forget with no confirmation
4. No mechanism to track or clean up orphaned processes
Solution:
- Add ProcessRegistry module to track spawned Claude subprocesses
- Use SDK's spawnClaudeCodeProcess option to capture PIDs via custom spawn
- Pass signal parameter to enable AbortController integration
- Wait for subprocess exit in deleteSession() with 5s timeout
- Escalate to SIGKILL if graceful exit fails
- Add orphan reaper running every 5 minutes as safety net
Files changed:
- src/services/worker/ProcessRegistry.ts (new): PID registry and reaper
- src/services/worker/SDKAgent.ts: Use custom spawn to capture PIDs
- src/services/worker/SessionManager.ts: Verify subprocess exit on delete
- src/services/worker-service.ts: Start/stop orphan reaper
Fixes#737
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix: address code review feedback
- Replace busy-wait polling with event-based proc.once('exit')
- Detect and warn about multiple processes per session (race condition)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: bigphoot <bigphoot@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>