feat: add embedded Process Supervisor for unified process lifecycle (#1370)

* feat: add embedded Process Supervisor for unified process lifecycle management

Consolidates scattered process management (ProcessManager, GracefulShutdown,
HealthMonitor, ProcessRegistry) into a unified src/supervisor/ module.

New: ProcessRegistry with JSON persistence, env sanitizer (strips CLAUDECODE_*
vars), graceful shutdown cascade (SIGTERM → 5s wait → SIGKILL with tree-kill
on Windows), PID file liveness validation, and singleton Supervisor API.

Fixes #1352 (worker inherits CLAUDECODE env causing nested sessions)
Fixes #1356 (zombie TCP socket after Windows reboot)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add session-scoped process reaping to supervisor

Adds reapSession(sessionId) to ProcessRegistry for killing session-tagged
processes on session end. SessionManager.deleteSession() now triggers reaping.
Tightens orphan reaper interval from 60s to 30s.

Fixes #1351 (MCP server processes leak on session end)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Unix domain socket support for worker communication

Introduces socket-manager.ts for UDS-based worker communication, eliminating
port 37777 collisions between concurrent sessions. Worker listens on
~/.claude-mem/sockets/worker.sock by default with TCP fallback.

All hook handlers, MCP server, health checks, and admin commands updated to
use socket-aware workerHttpRequest(). Backwards compatible — settings can
force TCP mode via CLAUDE_MEM_WORKER_TRANSPORT=tcp.

Fixes #1346 (port 37777 collision across concurrent sessions)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove in-process worker fallback from hook command

Removes the fallback path where hook scripts started WorkerService in-process,
making the worker a grandchild of Claude Code (killed by sandbox). Hooks now
always delegate to ensureWorkerStarted() which spawns a fully detached daemon.

Fixes #1249 (grandchild process killed by sandbox)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add health checker and /api/admin/doctor endpoint

Adds 30-second periodic health sweep that prunes dead processes from the
supervisor registry and cleans stale socket files. Adds /api/admin/doctor
endpoint exposing supervisor state, process liveness, and environment health.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add comprehensive supervisor test suite

64 tests covering all supervisor modules: process registry (18 tests),
env sanitizer (8), shutdown cascade (10), socket manager (15), health
checker (5), and supervisor API (6). Includes persistence, isolation,
edge cases, and cross-module integration scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: revert Unix domain socket transport, restore TCP on port 37777

The socket-manager introduced UDS as default transport, but this broke
the HTTP server's TCP accessibility (viewer UI, curl, external monitoring).
Since there's only ever one worker process handling all sessions, the
port collision rationale for UDS doesn't apply. Reverts to TCP-only,
removing ~900 lines of unnecessary complexity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove dead code found in pre-landing review

Remove unused `acceptingSpawns` field from Supervisor class (written but
never read — assertCanSpawn uses stopPromise instead) and unused
`buildWorkerUrl` import from context handler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* updated gitignore

* fix: address PR review feedback - downgrade HTTP logging, clean up gitignore, harden supervisor

- Downgrade request/response HTTP logging from info to debug to reduce noise
- Remove unused getWorkerPort imports, use buildWorkerUrl helper
- Export ENV_PREFIXES/ENV_EXACT_MATCHES from env-sanitizer, reuse in Server.ts
- Fix isPidAlive(0) returning true (should be false)
- Add shutdownInitiated flag to prevent signal handler race condition
- Make validateWorkerPidFile testable with pidFilePath option
- Remove unused dataDir from ShutdownCascadeOptions
- Upgrade reapSession log from debug to warn
- Rename zombiePidFiles to deadProcessPids (returns actual PIDs)
- Clean up gitignore: remove duplicate datasets/, stale ~*/ and http*/ patterns
- Fix tests to use temp directories instead of relying on real PID file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-03-16 14:49:23 -07:00
committed by GitHub
parent 237a4c37f8
commit 80a8c90a1a
44 changed files with 2385 additions and 636 deletions
+20
View File
@@ -0,0 +1,20 @@
export const ENV_PREFIXES = ['CLAUDECODE_', 'CLAUDE_CODE_'];
export const ENV_EXACT_MATCHES = new Set([
'CLAUDECODE',
'CLAUDE_CODE_SESSION',
'CLAUDE_CODE_ENTRYPOINT',
'MCP_SESSION_ID',
]);
export function sanitizeEnv(env: NodeJS.ProcessEnv = process.env): NodeJS.ProcessEnv {
const sanitized: NodeJS.ProcessEnv = {};
for (const [key, value] of Object.entries(env)) {
if (value === undefined) continue;
if (ENV_EXACT_MATCHES.has(key)) continue;
if (ENV_PREFIXES.some(prefix => key.startsWith(prefix))) continue;
sanitized[key] = value;
}
return sanitized;
}
+40
View File
@@ -0,0 +1,40 @@
/**
* Health Checker - Periodic background cleanup of dead processes
*
* Runs every 30 seconds to prune dead processes from the supervisor registry.
* The interval is unref'd so it does not keep the process alive.
*/
import { logger } from '../utils/logger.js';
import { getProcessRegistry } from './process-registry.js';
const HEALTH_CHECK_INTERVAL_MS = 30_000;
let healthCheckInterval: ReturnType<typeof setInterval> | null = null;
function runHealthCheck(): void {
const registry = getProcessRegistry();
const removedProcessCount = registry.pruneDeadEntries();
if (removedProcessCount > 0) {
logger.info('SYSTEM', `Health check: pruned ${removedProcessCount} dead process(es) from registry`);
}
}
export function startHealthChecker(): void {
if (healthCheckInterval !== null) return;
healthCheckInterval = setInterval(runHealthCheck, HEALTH_CHECK_INTERVAL_MS);
healthCheckInterval.unref();
logger.debug('SYSTEM', 'Health checker started', { intervalMs: HEALTH_CHECK_INTERVAL_MS });
}
export function stopHealthChecker(): void {
if (healthCheckInterval === null) return;
clearInterval(healthCheckInterval);
healthCheckInterval = null;
logger.debug('SYSTEM', 'Health checker stopped');
}
+188
View File
@@ -0,0 +1,188 @@
import { existsSync, readFileSync, rmSync } from 'fs';
import { homedir } from 'os';
import path from 'path';
import { logger } from '../utils/logger.js';
import { getProcessRegistry, isPidAlive, type ManagedProcessInfo, type ProcessRegistry } from './process-registry.js';
import { runShutdownCascade } from './shutdown.js';
import { startHealthChecker, stopHealthChecker } from './health-checker.js';
const DATA_DIR = path.join(homedir(), '.claude-mem');
const PID_FILE = path.join(DATA_DIR, 'worker.pid');
interface PidInfo {
pid: number;
port: number;
startedAt: string;
}
interface ValidateWorkerPidOptions {
logAlive?: boolean;
pidFilePath?: string;
}
export type ValidateWorkerPidStatus = 'missing' | 'alive' | 'stale' | 'invalid';
class Supervisor {
private readonly registry: ProcessRegistry;
private started = false;
private stopPromise: Promise<void> | null = null;
private signalHandlersRegistered = false;
private shutdownInitiated = false;
private shutdownHandler: (() => Promise<void>) | null = null;
constructor(registry: ProcessRegistry) {
this.registry = registry;
}
async start(): Promise<void> {
if (this.started) return;
this.registry.initialize();
const pidStatus = validateWorkerPidFile({ logAlive: false });
if (pidStatus === 'alive') {
throw new Error('Worker already running');
}
this.started = true;
startHealthChecker();
}
configureSignalHandlers(shutdownHandler: () => Promise<void>): void {
this.shutdownHandler = shutdownHandler;
if (this.signalHandlersRegistered) return;
this.signalHandlersRegistered = true;
const handleSignal = async (signal: string): Promise<void> => {
if (this.shutdownInitiated) {
logger.warn('SYSTEM', `Received ${signal} but shutdown already in progress`);
return;
}
this.shutdownInitiated = true;
logger.info('SYSTEM', `Received ${signal}, shutting down...`);
try {
if (this.shutdownHandler) {
await this.shutdownHandler();
} else {
await this.stop();
}
} catch (error) {
logger.error('SYSTEM', 'Error during shutdown', {}, error as Error);
try {
await this.stop();
} catch (stopError) {
logger.debug('SYSTEM', 'Supervisor shutdown fallback failed', {}, stopError as Error);
}
}
process.exit(0);
};
process.on('SIGTERM', () => void handleSignal('SIGTERM'));
process.on('SIGINT', () => void handleSignal('SIGINT'));
if (process.platform !== 'win32') {
if (process.argv.includes('--daemon')) {
process.on('SIGHUP', () => {
logger.debug('SYSTEM', 'Ignoring SIGHUP in daemon mode');
});
} else {
process.on('SIGHUP', () => void handleSignal('SIGHUP'));
}
}
}
async stop(): Promise<void> {
if (this.stopPromise) {
await this.stopPromise;
return;
}
stopHealthChecker();
this.stopPromise = runShutdownCascade({
registry: this.registry,
currentPid: process.pid
}).finally(() => {
this.started = false;
this.stopPromise = null;
});
await this.stopPromise;
}
assertCanSpawn(type: string): void {
if (this.stopPromise !== null) {
throw new Error(`Supervisor is shutting down, refusing to spawn ${type}`);
}
}
registerProcess(id: string, processInfo: ManagedProcessInfo, processRef?: Parameters<ProcessRegistry['register']>[2]): void {
this.registry.register(id, processInfo, processRef);
}
unregisterProcess(id: string): void {
this.registry.unregister(id);
}
getRegistry(): ProcessRegistry {
return this.registry;
}
}
const supervisorSingleton = new Supervisor(getProcessRegistry());
export async function startSupervisor(): Promise<void> {
await supervisorSingleton.start();
}
export async function stopSupervisor(): Promise<void> {
await supervisorSingleton.stop();
}
export function getSupervisor(): Supervisor {
return supervisorSingleton;
}
export function configureSupervisorSignalHandlers(shutdownHandler: () => Promise<void>): void {
supervisorSingleton.configureSignalHandlers(shutdownHandler);
}
export function validateWorkerPidFile(options: ValidateWorkerPidOptions = {}): ValidateWorkerPidStatus {
const pidFilePath = options.pidFilePath ?? PID_FILE;
if (!existsSync(pidFilePath)) {
return 'missing';
}
let pidInfo: PidInfo | null = null;
try {
pidInfo = JSON.parse(readFileSync(pidFilePath, 'utf-8')) as PidInfo;
} catch (error) {
logger.warn('SYSTEM', 'Failed to parse worker PID file, removing it', { path: pidFilePath }, error as Error);
rmSync(pidFilePath, { force: true });
return 'invalid';
}
if (isPidAlive(pidInfo.pid)) {
if (options.logAlive ?? true) {
logger.info('SYSTEM', 'Worker already running (PID alive)', {
existingPid: pidInfo.pid,
existingPort: pidInfo.port,
startedAt: pidInfo.startedAt
});
}
return 'alive';
}
logger.info('SYSTEM', 'Removing stale PID file (worker process is dead)', {
pid: pidInfo.pid,
port: pidInfo.port,
startedAt: pidInfo.startedAt
});
rmSync(pidFilePath, { force: true });
return 'stale';
}
+253
View File
@@ -0,0 +1,253 @@
import { ChildProcess } from 'child_process';
import { existsSync, mkdirSync, readFileSync, writeFileSync } from 'fs';
import { homedir } from 'os';
import path from 'path';
import { logger } from '../utils/logger.js';
const REAP_SESSION_SIGTERM_TIMEOUT_MS = 5_000;
const REAP_SESSION_SIGKILL_TIMEOUT_MS = 1_000;
const DATA_DIR = path.join(homedir(), '.claude-mem');
const DEFAULT_REGISTRY_PATH = path.join(DATA_DIR, 'supervisor.json');
export interface ManagedProcessInfo {
pid: number;
type: string;
sessionId?: string | number;
startedAt: string;
}
export interface ManagedProcessRecord extends ManagedProcessInfo {
id: string;
}
interface PersistedRegistry {
processes: Record<string, ManagedProcessInfo>;
}
export function isPidAlive(pid: number): boolean {
if (!Number.isInteger(pid) || pid < 0) return false;
if (pid === 0) return false;
try {
process.kill(pid, 0);
return true;
} catch (error: unknown) {
const code = (error as NodeJS.ErrnoException).code;
return code === 'EPERM';
}
}
export class ProcessRegistry {
private readonly registryPath: string;
private readonly entries = new Map<string, ManagedProcessInfo>();
private readonly runtimeProcesses = new Map<string, ChildProcess>();
private initialized = false;
constructor(registryPath: string = DEFAULT_REGISTRY_PATH) {
this.registryPath = registryPath;
}
initialize(): void {
if (this.initialized) return;
this.initialized = true;
mkdirSync(path.dirname(this.registryPath), { recursive: true });
if (!existsSync(this.registryPath)) {
this.persist();
return;
}
try {
const raw = JSON.parse(readFileSync(this.registryPath, 'utf-8')) as PersistedRegistry;
const processes = raw.processes ?? {};
for (const [id, info] of Object.entries(processes)) {
this.entries.set(id, info);
}
} catch (error) {
logger.warn('SYSTEM', 'Failed to parse supervisor registry, rebuilding', {
path: this.registryPath
}, error as Error);
this.entries.clear();
}
const removed = this.pruneDeadEntries();
if (removed > 0) {
logger.info('SYSTEM', 'Removed dead processes from supervisor registry', { removed });
}
this.persist();
}
register(id: string, processInfo: ManagedProcessInfo, processRef?: ChildProcess): void {
this.initialize();
this.entries.set(id, processInfo);
if (processRef) {
this.runtimeProcesses.set(id, processRef);
}
this.persist();
}
unregister(id: string): void {
this.initialize();
this.entries.delete(id);
this.runtimeProcesses.delete(id);
this.persist();
}
clear(): void {
this.entries.clear();
this.runtimeProcesses.clear();
this.persist();
}
getAll(): ManagedProcessRecord[] {
this.initialize();
return Array.from(this.entries.entries())
.map(([id, info]) => ({ id, ...info }))
.sort((a, b) => {
const left = Date.parse(a.startedAt);
const right = Date.parse(b.startedAt);
return (Number.isNaN(left) ? 0 : left) - (Number.isNaN(right) ? 0 : right);
});
}
getBySession(sessionId: string | number): ManagedProcessRecord[] {
const normalized = String(sessionId);
return this.getAll().filter(record => record.sessionId !== undefined && String(record.sessionId) === normalized);
}
getRuntimeProcess(id: string): ChildProcess | undefined {
return this.runtimeProcesses.get(id);
}
getByPid(pid: number): ManagedProcessRecord[] {
return this.getAll().filter(record => record.pid === pid);
}
pruneDeadEntries(): number {
this.initialize();
let removed = 0;
for (const [id, info] of this.entries) {
if (isPidAlive(info.pid)) continue;
this.entries.delete(id);
this.runtimeProcesses.delete(id);
removed += 1;
}
if (removed > 0) {
this.persist();
}
return removed;
}
/**
* Kill and unregister all processes tagged with the given sessionId.
* Sends SIGTERM first, waits up to 5s, then SIGKILL for survivors.
* Called when a session is deleted to prevent leaked child processes (#1351).
*/
async reapSession(sessionId: string | number): Promise<number> {
this.initialize();
const sessionRecords = this.getBySession(sessionId);
if (sessionRecords.length === 0) {
return 0;
}
const sessionIdNum = typeof sessionId === 'number' ? sessionId : Number(sessionId) || undefined;
logger.info('SYSTEM', `Reaping ${sessionRecords.length} process(es) for session ${sessionId}`, {
sessionId: sessionIdNum,
pids: sessionRecords.map(r => r.pid)
});
// Phase 1: SIGTERM all alive processes
const aliveRecords = sessionRecords.filter(r => isPidAlive(r.pid));
for (const record of aliveRecords) {
try {
process.kill(record.pid, 'SIGTERM');
} catch (error: unknown) {
const code = (error as NodeJS.ErrnoException).code;
if (code !== 'ESRCH') {
logger.debug('SYSTEM', `Failed to SIGTERM session process PID ${record.pid}`, {
pid: record.pid
}, error as Error);
}
}
}
// Phase 2: Wait for processes to exit
const deadline = Date.now() + REAP_SESSION_SIGTERM_TIMEOUT_MS;
while (Date.now() < deadline) {
const survivors = aliveRecords.filter(r => isPidAlive(r.pid));
if (survivors.length === 0) break;
await new Promise(resolve => setTimeout(resolve, 100));
}
// Phase 3: SIGKILL any survivors
const survivors = aliveRecords.filter(r => isPidAlive(r.pid));
for (const record of survivors) {
logger.warn('SYSTEM', `Session process PID ${record.pid} did not exit after SIGTERM, sending SIGKILL`, {
pid: record.pid,
sessionId: sessionIdNum
});
try {
process.kill(record.pid, 'SIGKILL');
} catch (error: unknown) {
const code = (error as NodeJS.ErrnoException).code;
if (code !== 'ESRCH') {
logger.debug('SYSTEM', `Failed to SIGKILL session process PID ${record.pid}`, {
pid: record.pid
}, error as Error);
}
}
}
// Brief wait for SIGKILL to take effect
if (survivors.length > 0) {
const sigkillDeadline = Date.now() + REAP_SESSION_SIGKILL_TIMEOUT_MS;
while (Date.now() < sigkillDeadline) {
const remaining = survivors.filter(r => isPidAlive(r.pid));
if (remaining.length === 0) break;
await new Promise(resolve => setTimeout(resolve, 100));
}
}
// Phase 4: Unregister all session records
for (const record of sessionRecords) {
this.entries.delete(record.id);
this.runtimeProcesses.delete(record.id);
}
this.persist();
logger.info('SYSTEM', `Reaped ${sessionRecords.length} process(es) for session ${sessionId}`, {
sessionId: sessionIdNum,
reaped: sessionRecords.length
});
return sessionRecords.length;
}
private persist(): void {
const payload: PersistedRegistry = {
processes: Object.fromEntries(this.entries.entries())
};
mkdirSync(path.dirname(this.registryPath), { recursive: true });
writeFileSync(this.registryPath, JSON.stringify(payload, null, 2));
}
}
let registrySingleton: ProcessRegistry | null = null;
export function getProcessRegistry(): ProcessRegistry {
if (!registrySingleton) {
registrySingleton = new ProcessRegistry();
}
return registrySingleton;
}
export function createProcessRegistry(registryPath: string): ProcessRegistry {
return new ProcessRegistry(registryPath);
}
+157
View File
@@ -0,0 +1,157 @@
import { execFile } from 'child_process';
import { rmSync } from 'fs';
import { homedir } from 'os';
import path from 'path';
import { promisify } from 'util';
import { logger } from '../utils/logger.js';
import { HOOK_TIMEOUTS } from '../shared/hook-constants.js';
import { isPidAlive, type ManagedProcessRecord, type ProcessRegistry } from './process-registry.js';
const execFileAsync = promisify(execFile);
const DATA_DIR = path.join(homedir(), '.claude-mem');
const PID_FILE = path.join(DATA_DIR, 'worker.pid');
type TreeKillFn = (pid: number, signal?: string, callback?: (error?: Error | null) => void) => void;
export interface ShutdownCascadeOptions {
registry: ProcessRegistry;
currentPid?: number;
pidFilePath?: string;
}
export async function runShutdownCascade(options: ShutdownCascadeOptions): Promise<void> {
const currentPid = options.currentPid ?? process.pid;
const pidFilePath = options.pidFilePath ?? PID_FILE;
const allRecords = options.registry.getAll();
const childRecords = [...allRecords]
.filter(record => record.pid !== currentPid)
.sort((a, b) => Date.parse(b.startedAt) - Date.parse(a.startedAt));
for (const record of childRecords) {
if (!isPidAlive(record.pid)) {
options.registry.unregister(record.id);
continue;
}
try {
await signalProcess(record.pid, 'SIGTERM');
} catch (error) {
logger.debug('SYSTEM', 'Failed to send SIGTERM to child process', {
pid: record.pid,
type: record.type
}, error as Error);
}
}
await waitForExit(childRecords, 5000);
const survivors = childRecords.filter(record => isPidAlive(record.pid));
for (const record of survivors) {
try {
await signalProcess(record.pid, 'SIGKILL');
} catch (error) {
logger.debug('SYSTEM', 'Failed to force kill child process', {
pid: record.pid,
type: record.type
}, error as Error);
}
}
await waitForExit(survivors, 1000);
for (const record of childRecords) {
options.registry.unregister(record.id);
}
for (const record of allRecords.filter(record => record.pid === currentPid)) {
options.registry.unregister(record.id);
}
try {
rmSync(pidFilePath, { force: true });
} catch (error) {
logger.debug('SYSTEM', 'Failed to remove PID file during shutdown', { pidFilePath }, error as Error);
}
options.registry.pruneDeadEntries();
}
async function waitForExit(records: ManagedProcessRecord[], timeoutMs: number): Promise<void> {
const deadline = Date.now() + timeoutMs;
while (Date.now() < deadline) {
const survivors = records.filter(record => isPidAlive(record.pid));
if (survivors.length === 0) {
return;
}
await new Promise(resolve => setTimeout(resolve, 100));
}
}
async function signalProcess(pid: number, signal: 'SIGTERM' | 'SIGKILL'): Promise<void> {
if (signal === 'SIGTERM') {
try {
process.kill(pid, signal);
} catch (error) {
const errno = (error as NodeJS.ErrnoException).code;
if (errno === 'ESRCH') {
return;
}
throw error;
}
return;
}
if (process.platform === 'win32') {
const treeKill = await loadTreeKill();
if (treeKill) {
await new Promise<void>((resolve, reject) => {
treeKill(pid, signal, (error) => {
if (!error) {
resolve();
return;
}
const errno = (error as NodeJS.ErrnoException).code;
if (errno === 'ESRCH') {
resolve();
return;
}
reject(error);
});
});
return;
}
const args = ['/PID', String(pid), '/T'];
if (signal === 'SIGKILL') {
args.push('/F');
}
await execFileAsync('taskkill', args, {
timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND,
windowsHide: true
});
return;
}
try {
process.kill(pid, signal);
} catch (error) {
const errno = (error as NodeJS.ErrnoException).code;
if (errno === 'ESRCH') {
return;
}
throw error;
}
}
async function loadTreeKill(): Promise<TreeKillFn | null> {
const moduleName = 'tree-kill';
try {
const treeKillModule = await import(moduleName);
return (treeKillModule.default ?? treeKillModule) as TreeKillFn;
} catch {
return null;
}
}