Files
claude-mem/.claude/plans/pr-458-fixes.md
T
Alex Newman 2a60794485 fix: handle Unix kill errors in orphaned process cleanup
Replace execAsync kill command with individual process.kill calls wrapped
in try/catch to gracefully handle processes that have already exited.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 20:51:41 -05:00

5.7 KiB

PR #458 Fix Plan: Worker Lifecycle Critical Issues

Branch: bugfix/spawn-worker Created: 2025-12-26 Context: PR review found 3 critical bugs, 5 important issues in the worker self-spawn implementation


Phase 1: Fix Critical waitForProcessesExit Bug

Goal: Fix the logic bug that crashes shutdown when child processes exit

File: src/services/worker-service.ts

What to do:

  1. Find the waitForProcessesExit method (around line 752-771)
  2. The pids.filter() callback calls process.kill(pid, 0) which throws when the process is dead
  3. Wrap in try/catch to return false when process doesn't exist

Current code:

const stillAlive = pids.filter(pid => {
  process.kill(pid, 0); // Signal 0 checks if process exists - throws if dead
  return true;
});

Fixed code:

const stillAlive = pids.filter(pid => {
  try {
    process.kill(pid, 0);
    return true;
  } catch {
    return false;
  }
});

Verification: The filter should now correctly remove PIDs of exited processes instead of throwing.


Phase 2: Fix Spawn PID Validation

Goal: Prevent invalid PID file writes when spawn fails

File: src/services/worker-service.ts

What to do:

  1. Find the start case in the CLI switch (around line 816-844)
  2. After spawn(), add check for undefined pid before writing PID file
  3. Exit with error if spawn failed

Current code:

const child = spawn(process.execPath, [__filename, '--daemon'], {...});
child.unref();
writePidFile({ pid: child.pid!, port, startedAt: new Date().toISOString() });

Fixed code:

const child = spawn(process.execPath, [__filename, '--daemon'], {...});

if (child.pid === undefined) {
  console.error('Failed to spawn worker daemon');
  process.exit(1);
}

child.unref();
writePidFile({ pid: child.pid, port, startedAt: new Date().toISOString() });

Verification: Remove the ! non-null assertion since we now properly check.


Phase 3: Fix Unix Process Cleanup Error Handling

Goal: Handle errors in orphaned process cleanup

File: src/services/worker-service.ts

What to do:

  1. Find cleanupOrphanedProcesses method (around line 389-458)
  2. The Unix branch at line 454 has await execAsync(\kill ${pids.join(' ')}`)` with no error handling
  3. Replace with individual process.kill calls wrapped in try/catch

Current code:

} else {
  await execAsync(`kill ${pids.join(' ')}`);
}

Fixed code:

} else {
  for (const pid of pids) {
    try {
      process.kill(pid, 'SIGKILL');
    } catch {
      // Process already exited - that's fine
    }
  }
}

Phase 4: Fix Windows taskkill Error Handling

Goal: Prevent cleanup abort when one process fails to kill

File: src/services/worker-service.ts

What to do:

  1. Find the Windows taskkill loop in cleanupOrphanedProcesses (around line 444-452)
  2. Wrap execSync in try/catch so one failure doesn't abort the loop

Current code:

for (const pid of pids) {
  if (!Number.isInteger(pid) || pid <= 0) {
    logger.warn('SYSTEM', 'Skipping invalid PID', { pid });
    continue;
  }
  execSync(`taskkill /PID ${pid} /T /F`, { timeout: 60000, stdio: 'ignore' });
}

Fixed code:

for (const pid of pids) {
  if (!Number.isInteger(pid) || pid <= 0) {
    logger.warn('SYSTEM', 'Skipping invalid PID', { pid });
    continue;
  }
  try {
    execSync(`taskkill /PID ${pid} /T /F`, { timeout: 60000, stdio: 'ignore' });
  } catch {
    // Process may have already exited - continue cleanup
  }
}

Phase 5: Use Readiness Endpoint for Health Checks

Goal: Ensure waitForHealth checks full initialization, not just HTTP server

File: src/services/worker-service.ts

What to do:

  1. Find waitForHealth function (around line 61-68)
  2. Change endpoint from /api/health to /api/readiness
  3. /api/readiness returns 503 until background init completes, /api/health returns 200 immediately

Current code:

async function waitForHealth(port: number, timeoutMs: number = 30000): Promise<boolean> {
  const start = Date.now();
  while (Date.now() - start < timeoutMs) {
    if (await isPortInUse(port)) return true;
    await new Promise(r => setTimeout(r, 500));
  }
  return false;
}

Fixed code:

async function waitForHealth(port: number, timeoutMs: number = 30000): Promise<boolean> {
  const start = Date.now();
  while (Date.now() - start < timeoutMs) {
    try {
      const response = await fetch(`http://127.0.0.1:${port}/api/readiness`, {
        signal: AbortSignal.timeout(2000)
      });
      if (response.ok) return true;
    } catch {
      // Not ready yet
    }
    await new Promise(r => setTimeout(r, 500));
  }
  return false;
}

Phase 6: Build, Test, and Commit

Goal: Verify fixes work and commit changes

Commands:

npm run build-and-sync

Commit message:

fix: address critical error handling issues in worker lifecycle

- Fix waitForProcessesExit crash when child processes exit
- Add spawn pid validation before writing PID file
- Handle Unix kill errors in orphaned process cleanup
- Handle Windows taskkill errors in cleanup loop
- Use /api/readiness for health checks instead of /api/health

Fixes issues found in PR #458 review.

Summary

Phase Priority Description
1 CRITICAL Fix waitForProcessesExit filter bug
2 CRITICAL Validate child.pid after spawn
3 CRITICAL Fix Unix kill error handling
4 HIGH Fix Windows taskkill error handling
5 HIGH Use readiness endpoint for health
6 - Build and commit

Estimated changes: ~30 lines modified in worker-service.ts