fix: prevent zombie subprocess accumulation by only trusting exitCode (#1226) (#1325)

proc.killed only means Node sent a signal — the process can still be alive.
This caused premature pool slot release, allowing unbounded process spawning.

- ensureProcessExit: remove proc.killed from early-exit checks, only trust exitCode
- Fix 3 call-site guards that skipped cleanup for signaled-but-alive processes
- Add TOTAL_PROCESS_HARD_CAP=10 safety net in waitForSlot()
- After SIGKILL, wait up to 1s via exit event instead of blind 200ms sleep
- Reduce reaper interval from 5min to 1min, idle threshold from 2min to 1min

Closes #1226

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Nir Alfasi
2026-03-13 04:59:42 +02:00
committed by GitHub
parent 23058d4b0c
commit 38d9ac7adb
5 changed files with 231 additions and 13 deletions
+2 -2
View File
@@ -470,7 +470,7 @@ export class WorkerService {
}
return activeIds;
});
logger.info('SYSTEM', 'Started orphan reaper (runs every 5 minutes)');
logger.info('SYSTEM', 'Started orphan reaper (runs every 1 minute)');
// Reap stale sessions to unblock orphan process cleanup (Issue #1168)
this.staleSessionReaperInterval = setInterval(async () => {
@@ -618,7 +618,7 @@ export class WorkerService {
.finally(async () => {
// CRITICAL: Verify subprocess exit to prevent zombie accumulation (Issue #1168)
const trackedProcess = getProcessBySession(session.sessionDbId);
if (trackedProcess && !trackedProcess.process.killed && trackedProcess.process.exitCode === null) {
if (trackedProcess && trackedProcess.process.exitCode === null) {
await ensureProcessExit(trackedProcess, 5000);
}