feat: add embedded Process Supervisor for unified process lifecycle (#1370)

* feat: add embedded Process Supervisor for unified process lifecycle management

Consolidates scattered process management (ProcessManager, GracefulShutdown,
HealthMonitor, ProcessRegistry) into a unified src/supervisor/ module.

New: ProcessRegistry with JSON persistence, env sanitizer (strips CLAUDECODE_*
vars), graceful shutdown cascade (SIGTERM → 5s wait → SIGKILL with tree-kill
on Windows), PID file liveness validation, and singleton Supervisor API.

Fixes #1352 (worker inherits CLAUDECODE env causing nested sessions)
Fixes #1356 (zombie TCP socket after Windows reboot)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add session-scoped process reaping to supervisor

Adds reapSession(sessionId) to ProcessRegistry for killing session-tagged
processes on session end. SessionManager.deleteSession() now triggers reaping.
Tightens orphan reaper interval from 60s to 30s.

Fixes #1351 (MCP server processes leak on session end)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Unix domain socket support for worker communication

Introduces socket-manager.ts for UDS-based worker communication, eliminating
port 37777 collisions between concurrent sessions. Worker listens on
~/.claude-mem/sockets/worker.sock by default with TCP fallback.

All hook handlers, MCP server, health checks, and admin commands updated to
use socket-aware workerHttpRequest(). Backwards compatible — settings can
force TCP mode via CLAUDE_MEM_WORKER_TRANSPORT=tcp.

Fixes #1346 (port 37777 collision across concurrent sessions)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove in-process worker fallback from hook command

Removes the fallback path where hook scripts started WorkerService in-process,
making the worker a grandchild of Claude Code (killed by sandbox). Hooks now
always delegate to ensureWorkerStarted() which spawns a fully detached daemon.

Fixes #1249 (grandchild process killed by sandbox)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add health checker and /api/admin/doctor endpoint

Adds 30-second periodic health sweep that prunes dead processes from the
supervisor registry and cleans stale socket files. Adds /api/admin/doctor
endpoint exposing supervisor state, process liveness, and environment health.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add comprehensive supervisor test suite

64 tests covering all supervisor modules: process registry (18 tests),
env sanitizer (8), shutdown cascade (10), socket manager (15), health
checker (5), and supervisor API (6). Includes persistence, isolation,
edge cases, and cross-module integration scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: revert Unix domain socket transport, restore TCP on port 37777

The socket-manager introduced UDS as default transport, but this broke
the HTTP server's TCP accessibility (viewer UI, curl, external monitoring).
Since there's only ever one worker process handling all sessions, the
port collision rationale for UDS doesn't apply. Reverts to TCP-only,
removing ~900 lines of unnecessary complexity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove dead code found in pre-landing review

Remove unused `acceptingSpawns` field from Supervisor class (written but
never read — assertCanSpawn uses stopPromise instead) and unused
`buildWorkerUrl` import from context handler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* updated gitignore

* fix: address PR review feedback - downgrade HTTP logging, clean up gitignore, harden supervisor

- Downgrade request/response HTTP logging from info to debug to reduce noise
- Remove unused getWorkerPort imports, use buildWorkerUrl helper
- Export ENV_PREFIXES/ENV_EXACT_MATCHES from env-sanitizer, reuse in Server.ts
- Fix isPidAlive(0) returning true (should be false)
- Add shutdownInitiated flag to prevent signal handler race condition
- Make validateWorkerPidFile testable with pidFilePath option
- Remove unused dataDir from ShutdownCascadeOptions
- Upgrade reapSession log from debug to warn
- Rename zombiePidFiles to deadProcessPids (returns actual PIDs)
- Clean up gitignore: remove duplicate datasets/, stale ~*/ and http*/ patterns
- Fix tests to use temp directories instead of relying on real PID file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-03-16 14:49:23 -07:00
committed by GitHub
parent 237a4c37f8
commit 80a8c90a1a
44 changed files with 2385 additions and 636 deletions
+79 -72
View File
@@ -20,6 +20,8 @@ import { getAuthMethodDescription } from '../shared/EnvManager.js';
import { logger } from '../utils/logger.js';
import { ChromaMcpManager } from './sync/ChromaMcpManager.js';
import { ChromaSync } from './sync/ChromaSync.js';
import { configureSupervisorSignalHandlers, getSupervisor, startSupervisor } from '../supervisor/index.js';
import { sanitizeEnv } from '../supervisor/env-sanitizer.js';
// Windows: avoid repeated spawn popups when startup fails (issue #921)
const WINDOWS_SPAWN_COOLDOWN_MS = 2 * 60 * 1000;
@@ -78,7 +80,6 @@ import {
cleanStalePidFile,
isProcessAlive,
spawnDaemon,
createSignalHandler,
isPidFileRecent,
touchPidFile
} from './infrastructure/ProcessManager.js';
@@ -263,33 +264,10 @@ export class WorkerService {
* Register signal handlers for graceful shutdown
*/
private registerSignalHandlers(): void {
const shutdownRef = { value: this.isShuttingDown };
const handler = createSignalHandler(() => this.shutdown(), shutdownRef);
process.on('SIGTERM', () => {
this.isShuttingDown = shutdownRef.value;
handler('SIGTERM');
configureSupervisorSignalHandlers(async () => {
this.isShuttingDown = true;
await this.shutdown();
});
process.on('SIGINT', () => {
this.isShuttingDown = shutdownRef.value;
handler('SIGINT');
});
// SIGHUP: sent by kernel when controlling terminal closes.
// Daemon mode: ignore it (survive parent shell exit).
// Interactive mode: treat like SIGTERM (graceful shutdown).
if (process.platform !== 'win32') {
if (process.argv.includes('--daemon')) {
process.on('SIGHUP', () => {
logger.debug('SYSTEM', 'Ignoring SIGHUP in daemon mode');
});
} else {
process.on('SIGHUP', () => {
this.isShuttingDown = shutdownRef.value;
handler('SIGHUP');
});
}
}
}
/**
@@ -351,7 +329,9 @@ export class WorkerService {
const port = getWorkerPort();
const host = getWorkerHost();
// Start HTTP server FIRST - make port available immediately
await startSupervisor();
// Start HTTP server FIRST - make it available immediately
await this.server.listen(port, host);
// Worker writes its own PID - reliable on all platforms
@@ -363,6 +343,12 @@ export class WorkerService {
startedAt: new Date().toISOString()
});
getSupervisor().registerProcess('worker', {
pid: process.pid,
type: 'worker',
startedAt: new Date().toISOString()
});
logger.info('SYSTEM', 'Worker started', { host, port, pid: process.pid });
// Do slow initialization in background (non-blocking)
@@ -446,19 +432,50 @@ export class WorkerService {
// Connect to MCP server
const mcpServerPath = path.join(__dirname, 'mcp-server.cjs');
getSupervisor().assertCanSpawn('mcp server');
const transport = new StdioClientTransport({
command: 'node',
args: [mcpServerPath],
env: process.env
env: sanitizeEnv(process.env)
});
const MCP_INIT_TIMEOUT_MS = 300000;
const mcpConnectionPromise = this.mcpClient.connect(transport);
const timeoutPromise = new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error('MCP connection timeout after 5 minutes')), MCP_INIT_TIMEOUT_MS)
);
let timeoutId: ReturnType<typeof setTimeout>;
const timeoutPromise = new Promise<never>((_, reject) => {
timeoutId = setTimeout(
() => reject(new Error('MCP connection timeout after 5 minutes')),
MCP_INIT_TIMEOUT_MS
);
});
await Promise.race([mcpConnectionPromise, timeoutPromise]);
try {
await Promise.race([mcpConnectionPromise, timeoutPromise]);
} catch (connectionError) {
clearTimeout(timeoutId!);
logger.warn('WORKER', 'MCP server connection failed, cleaning up subprocess', {
error: connectionError instanceof Error ? connectionError.message : String(connectionError)
});
try {
await transport.close();
} catch {
// Best effort: the supervisor handles later process cleanup for survivors.
}
throw connectionError;
}
clearTimeout(timeoutId!);
const mcpProcess = (transport as unknown as { _process?: import('child_process').ChildProcess })._process;
if (mcpProcess?.pid) {
getSupervisor().registerProcess('mcp-server', {
pid: mcpProcess.pid,
type: 'mcp',
startedAt: new Date().toISOString()
}, mcpProcess);
mcpProcess.once('exit', () => {
getSupervisor().unregisterProcess('mcp-server');
});
}
this.mcpReady = true;
logger.success('WORKER', 'MCP server connected');
@@ -470,7 +487,7 @@ export class WorkerService {
}
return activeIds;
});
logger.info('SYSTEM', 'Started orphan reaper (runs every 1 minute)');
logger.info('SYSTEM', 'Started orphan reaper (runs every 30 seconds)');
// Reap stale sessions to unblock orphan process cleanup (Issue #1168)
this.staleSessionReaperInterval = setInterval(async () => {
@@ -916,12 +933,22 @@ export class WorkerService {
* Ensures the worker is started and healthy.
* This function can be called by both 'start' and 'hook' commands.
*
* @param port - The port the worker should run on
* @param port - The TCP port (used for port-in-use checks and daemon spawn)
* @returns true if worker is healthy (existing or newly started), false on failure
*/
async function ensureWorkerStarted(port: number): Promise<boolean> {
// Clean stale PID file first (cheap: 1 fs read + 1 signal-0 check)
cleanStalePidFile();
const pidFileStatus = cleanStalePidFile();
if (pidFileStatus === 'alive') {
logger.info('SYSTEM', 'Worker PID file points to a live process, skipping duplicate spawn');
const healthy = await waitForHealth(port, getPlatformTimeout(HOOK_TIMEOUTS.PORT_IN_USE_WAIT));
if (healthy) {
logger.info('SYSTEM', 'Worker became healthy while waiting on live PID');
return true;
}
logger.warn('SYSTEM', 'Live PID detected but worker did not become healthy before timeout');
return false;
}
// Check if worker is already running and healthy
if (await waitForHealth(port, 1000)) {
@@ -1065,11 +1092,9 @@ async function main() {
case 'restart': {
logger.info('SYSTEM', 'Restarting worker');
await httpShutdown(port);
const freed = await waitForPortFree(port, getPlatformTimeout(15000));
if (!freed) {
const restartFreed = await waitForPortFree(port, getPlatformTimeout(15000));
if (!restartFreed) {
logger.error('SYSTEM', 'Port did not free up after shutdown, aborting restart', { port });
// Exit gracefully: Windows Terminal won't keep tab open on exit 0
// The wrapper/plugin will handle restart logic if needed
process.exit(0);
}
removePidFile();
@@ -1100,9 +1125,9 @@ async function main() {
}
case 'status': {
const running = await isPortInUse(port);
const portInUse = await isPortInUse(port);
const pidInfo = readPidFile();
if (running && pidInfo) {
if (portInUse && pidInfo) {
console.log('Worker is running');
console.log(` PID: ${pidInfo.pid}`);
console.log(` Port: ${pidInfo.port}`);
@@ -1122,13 +1147,7 @@ async function main() {
}
case 'hook': {
// Auto-start worker if not running
const workerReady = await ensureWorkerStarted(port);
if (!workerReady) {
logger.warn('SYSTEM', 'Worker failed to start before hook, handler will retry');
}
// Existing logic unchanged
// Validate CLI args first (before any I/O)
const platform = process.argv[3];
const event = process.argv[4];
if (!platform || !event) {
@@ -1138,32 +1157,20 @@ async function main() {
process.exit(1);
}
// Check if worker is already running on port
const portInUse = await isPortInUse(port);
let startedWorkerInProcess = false;
if (!portInUse) {
// Port free - start worker IN THIS PROCESS (no spawn!)
// This process becomes the worker and stays alive
try {
logger.info('SYSTEM', 'Starting worker in-process for hook', { event });
const worker = new WorkerService();
await worker.start();
startedWorkerInProcess = true;
// Worker is now running in this process on the port
} catch (error) {
logger.failure('SYSTEM', 'Worker failed to start in hook', {}, error as Error);
removePidFile();
process.exit(0);
}
// Ensure worker is running as a detached daemon (#1249).
//
// IMPORTANT: The hook process MUST NOT become the worker. Starting the
// worker in-process makes it a grandchild of Claude Code, which the
// sandbox kills. Instead, ensureWorkerStarted() spawns a fully detached
// daemon (detached: true, stdio: 'ignore', child.unref()) that survives
// the hook process's exit and is invisible to Claude Code's sandbox.
const workerReady = await ensureWorkerStarted(port);
if (!workerReady) {
logger.warn('SYSTEM', 'Worker failed to start before hook, handler will proceed gracefully');
}
// If port in use, we'll use HTTP to the existing worker
const { hookCommand } = await import('../cli/hook-command.js');
// If we started the worker in this process, skip process.exit() so we stay alive as the worker
await hookCommand(platform, event, { skipExit: startedWorkerInProcess });
// Note: if we started worker in-process, this process stays alive as the worker
// The break allows the event loop to continue serving requests
await hookCommand(platform, event);
break;
}