c7abb01dfc1d6ac298d888fdebe4a6742e2ecd65
2 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
80a8c90a1a |
feat: add embedded Process Supervisor for unified process lifecycle (#1370)
* feat: add embedded Process Supervisor for unified process lifecycle management Consolidates scattered process management (ProcessManager, GracefulShutdown, HealthMonitor, ProcessRegistry) into a unified src/supervisor/ module. New: ProcessRegistry with JSON persistence, env sanitizer (strips CLAUDECODE_* vars), graceful shutdown cascade (SIGTERM → 5s wait → SIGKILL with tree-kill on Windows), PID file liveness validation, and singleton Supervisor API. Fixes #1352 (worker inherits CLAUDECODE env causing nested sessions) Fixes #1356 (zombie TCP socket after Windows reboot) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add session-scoped process reaping to supervisor Adds reapSession(sessionId) to ProcessRegistry for killing session-tagged processes on session end. SessionManager.deleteSession() now triggers reaping. Tightens orphan reaper interval from 60s to 30s. Fixes #1351 (MCP server processes leak on session end) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Unix domain socket support for worker communication Introduces socket-manager.ts for UDS-based worker communication, eliminating port 37777 collisions between concurrent sessions. Worker listens on ~/.claude-mem/sockets/worker.sock by default with TCP fallback. All hook handlers, MCP server, health checks, and admin commands updated to use socket-aware workerHttpRequest(). Backwards compatible — settings can force TCP mode via CLAUDE_MEM_WORKER_TRANSPORT=tcp. Fixes #1346 (port 37777 collision across concurrent sessions) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove in-process worker fallback from hook command Removes the fallback path where hook scripts started WorkerService in-process, making the worker a grandchild of Claude Code (killed by sandbox). Hooks now always delegate to ensureWorkerStarted() which spawns a fully detached daemon. Fixes #1249 (grandchild process killed by sandbox) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add health checker and /api/admin/doctor endpoint Adds 30-second periodic health sweep that prunes dead processes from the supervisor registry and cleans stale socket files. Adds /api/admin/doctor endpoint exposing supervisor state, process liveness, and environment health. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add comprehensive supervisor test suite 64 tests covering all supervisor modules: process registry (18 tests), env sanitizer (8), shutdown cascade (10), socket manager (15), health checker (5), and supervisor API (6). Includes persistence, isolation, edge cases, and cross-module integration scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: revert Unix domain socket transport, restore TCP on port 37777 The socket-manager introduced UDS as default transport, but this broke the HTTP server's TCP accessibility (viewer UI, curl, external monitoring). Since there's only ever one worker process handling all sessions, the port collision rationale for UDS doesn't apply. Reverts to TCP-only, removing ~900 lines of unnecessary complexity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove dead code found in pre-landing review Remove unused `acceptingSpawns` field from Supervisor class (written but never read — assertCanSpawn uses stopPromise instead) and unused `buildWorkerUrl` import from context handler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * updated gitignore * fix: address PR review feedback - downgrade HTTP logging, clean up gitignore, harden supervisor - Downgrade request/response HTTP logging from info to debug to reduce noise - Remove unused getWorkerPort imports, use buildWorkerUrl helper - Export ENV_PREFIXES/ENV_EXACT_MATCHES from env-sanitizer, reuse in Server.ts - Fix isPidAlive(0) returning true (should be false) - Add shutdownInitiated flag to prevent signal handler race condition - Make validateWorkerPidFile testable with pidFilePath option - Remove unused dataDir from ShutdownCascadeOptions - Upgrade reapSession log from debug to warn - Rename zombiePidFiles to deadProcessPids (returns actual PIDs) - Clean up gitignore: remove duplicate datasets/, stale ~*/ and http*/ patterns - Fix tests to use temp directories instead of relying on real PID file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
c6f932988a |
Fix 30+ root-cause bugs across 10 triage phases (#1214)
* MAESTRO: fix ChromaDB core issues — Python pinning, Windows paths, disable toggle, metadata sanitization, transport errors - Add --python version pinning to uvx args in both local and remote mode (fixes #1196, #1206, #1208) - Convert backslash paths to forward slashes for --data-dir on Windows (fixes #1199) - Add CLAUDE_MEM_CHROMA_ENABLED setting for SQLite-only fallback mode (fixes #707) - Sanitize metadata in addDocuments() to filter null/undefined/empty values (fixes #1183, #1188) - Wrap callTool() in try/catch for transport errors with auto-reconnect (fixes #1162) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix data integrity — content-hash deduplication, project name collision, empty project guard, stuck isProcessing - Add SHA-256 content-hash deduplication to observations INSERT (store.ts, transactions.ts, SessionStore.ts) - Add content_hash column via migration 22 with backfill and index - Fix project name collision: getCurrentProjectName() now returns parent/basename - Guard against empty project string with cwd-derived fallback - Fix stuck isProcessing: hasAnyPendingWork() resets processing messages older than 5 minutes - Add 12 new tests covering all four fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix hook lifecycle — stderr suppression, output isolation, conversation pollution prevention - Suppress process.stderr.write in hookCommand() to prevent Claude Code showing diagnostic output as error UI (#1181). Restores stderr in finally block for worker-continues case. - Convert console.error() to logger.warn()/error() in hook-command.ts and handlers/index.ts so all diagnostics route to log file instead of stderr. - Verified all 7 handlers return suppressOutput: true (prevents conversation pollution #598, #784). - Verified session-complete is a recognized event type (fixes #984). - Verified unknown event types return no-op handler with exit 0 (graceful degradation). - Added 10 new tests in tests/hook-lifecycle.test.ts covering event dispatch, adapter defaults, stderr suppression, and standard response constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix worker lifecycle — restart loop coordination, stale transport retry, ENOENT shutdown race - Add PID file mtime guard to prevent concurrent restart storms (#1145): isPidFileRecent() + touchPidFile() coordinate across sessions - Add transparent retry in ChromaMcpManager.callTool() on transport error — reconnects and retries once instead of failing (#1131) - Wrap getInstalledPluginVersion() with ENOENT/EBUSY handling (#1042) - Verified ChromaMcpManager.stop() already called on all shutdown paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Windows platform support — uvx.cmd spawn, PowerShell $_ elimination, windowsHide, FTS5 fallback - Route uvx spawn through cmd.exe /c on Windows since MCP SDK lacks shell:true (#1190, #1192, #1199) - Replace all PowerShell Where-Object {$_} pipelines with WQL -Filter server-side filtering (#1024, #1062) - Add windowsHide: true to all exec/spawn calls missing it to prevent console popups (#1048) - Add FTS5 runtime probe with graceful fallback when unavailable on Windows (#791) - Guard FTS5 table creation in migrations, SessionSearch, and SessionStore with try/catch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix skills/ distribution — build-time verification and regression tests (#1187) Add post-build verification in build-hooks.js that fails if critical distribution files (skills, hooks, plugin manifest) are missing. Add 10 regression tests covering skill file presence, YAML frontmatter, hooks.json integrity, and package.json files field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix MigrationRunner schema initialization (#979) — version conflict between parallel migration systems Root cause: old DatabaseManager migrations 1-7 shared schema_versions table with MigrationRunner's 4-22, causing version number collisions (5=drop tables vs add column, 6=FTS5 vs prompt tracking, 7=discovery_tokens vs remove UNIQUE). initializeSchema() was gated behind maxApplied===0, so core tables were never created when old versions were present. Fixes: - initializeSchema() always creates core tables via CREATE TABLE IF NOT EXISTS - Migrations 5-7 check actual DB state (columns/constraints) not just version tracking - Crash-safe temp table rebuilds (DROP IF EXISTS _new before CREATE) - Added missing migration 21 (ON UPDATE CASCADE) to MigrationRunner - Added ON UPDATE CASCADE to FK definitions in initializeSchema() - All changes applied to both runner.ts and SessionStore.ts Tests: 13 new tests in migration-runner.test.ts covering fresh DB, idempotency, version conflicts, crash recovery, FK constraints, and data integrity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix 21 test failures — stale mocks, outdated assertions, missing OpenClaw guards Server tests (12): Added missing workerPath and getAiStatus to ServerOptions mocks after interface expansion. ChromaSync tests (3): Updated to verify transport cleanup in ChromaMcpManager after architecture refactor. OpenClaw (2): Added memory_ tool skipping and response truncation to prevent recursive loops and oversized payloads. MarkdownFormatter (2): Updated assertions to match current output. SettingsDefaultsManager (1): Used correct default key for getBool test. Logger standards (1): Excluded CLI transcript command from background service check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Codex CLI compatibility (#744) — session_id fallbacks, unknown platform tolerance, undefined guard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Cursor IDE integration (#838, #1049) — adapter field fallbacks, tolerant session-init validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix /api/logs OOM (#1203) — tail-read replaces full-file readFileSync Replace readFileSync (loads entire file into memory) with readLastLines() that reads only from the end of the file in expanding chunks (64KB → 10MB cap). Prevents OOM on large log files while preserving the same API response shape. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Settings CORS error (#1029) — explicit methods and allowedHeaders in CORS config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: add session custom_title for agent attribution (#1213) — migration 23, endpoint + store support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: prevent CLAUDE.md/AGENTS.md writes inside .git/ directories (#1165) Add .git path guard to all 4 write sites to prevent ref corruption when paths resolve inside .git internals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix plugin disabled state not respected (#781) — early exit check in all hook entry points Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix UserPromptSubmit context re-injection on every turn (#1079) — contextInjected session flag Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix stale AbortController queue stall (#1099) — lastGeneratorActivity tracking + 30s timeout Three-layer fix: 1. Added lastGeneratorActivity timestamp to ActiveSession, updated by processAgentResponse (all agents), getMessageIterator (queue yields), and startGeneratorWithProvider (generator launch) 2. Added stale generator detection in ensureGeneratorRunning — if no activity for >30s, aborts stale controller, resets state, restarts 3. Added AbortSignal.timeout(30000) in deleteSession to prevent indefinite hang when awaiting a stuck generator promise Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |