diff --git a/.claude/plans/pr-458-fixes.md b/.claude/plans/pr-458-fixes.md deleted file mode 100644 index 6d0a120a..00000000 --- a/.claude/plans/pr-458-fixes.md +++ /dev/null @@ -1,229 +0,0 @@ -# PR #458 Fix Plan: Worker Lifecycle Critical Issues - -**Branch:** `bugfix/spawn-worker` -**Created:** 2025-12-26 -**Context:** PR review found 3 critical bugs, 5 important issues in the worker self-spawn implementation - ---- - -## Phase 1: Fix Critical `waitForProcessesExit` Bug - -**Goal:** Fix the logic bug that crashes shutdown when child processes exit - -**File:** `src/services/worker-service.ts` - -**What to do:** -1. Find the `waitForProcessesExit` method (around line 752-771) -2. The `pids.filter()` callback calls `process.kill(pid, 0)` which throws when the process is dead -3. Wrap in try/catch to return false when process doesn't exist - -**Current code:** -```typescript -const stillAlive = pids.filter(pid => { - process.kill(pid, 0); // Signal 0 checks if process exists - throws if dead - return true; -}); -``` - -**Fixed code:** -```typescript -const stillAlive = pids.filter(pid => { - try { - process.kill(pid, 0); - return true; - } catch { - return false; - } -}); -``` - -**Verification:** The filter should now correctly remove PIDs of exited processes instead of throwing. - ---- - -## Phase 2: Fix Spawn PID Validation - -**Goal:** Prevent invalid PID file writes when spawn fails - -**File:** `src/services/worker-service.ts` - -**What to do:** -1. Find the `start` case in the CLI switch (around line 816-844) -2. After spawn(), add check for undefined pid before writing PID file -3. Exit with error if spawn failed - -**Current code:** -```typescript -const child = spawn(process.execPath, [__filename, '--daemon'], {...}); -child.unref(); -writePidFile({ pid: child.pid!, port, startedAt: new Date().toISOString() }); -``` - -**Fixed code:** -```typescript -const child = spawn(process.execPath, [__filename, '--daemon'], {...}); - -if (child.pid === undefined) { - console.error('Failed to spawn worker daemon'); - process.exit(1); -} - -child.unref(); -writePidFile({ pid: child.pid, port, startedAt: new Date().toISOString() }); -``` - -**Verification:** Remove the `!` non-null assertion since we now properly check. - ---- - -## Phase 3: Fix Unix Process Cleanup Error Handling - -**Goal:** Handle errors in orphaned process cleanup - -**File:** `src/services/worker-service.ts` - -**What to do:** -1. Find `cleanupOrphanedProcesses` method (around line 389-458) -2. The Unix branch at line 454 has `await execAsync(\`kill ${pids.join(' ')}\`)` with no error handling -3. Replace with individual process.kill calls wrapped in try/catch - -**Current code:** -```typescript -} else { - await execAsync(`kill ${pids.join(' ')}`); -} -``` - -**Fixed code:** -```typescript -} else { - for (const pid of pids) { - try { - process.kill(pid, 'SIGKILL'); - } catch { - // Process already exited - that's fine - } - } -} -``` - ---- - -## Phase 4: Fix Windows taskkill Error Handling - -**Goal:** Prevent cleanup abort when one process fails to kill - -**File:** `src/services/worker-service.ts` - -**What to do:** -1. Find the Windows taskkill loop in `cleanupOrphanedProcesses` (around line 444-452) -2. Wrap `execSync` in try/catch so one failure doesn't abort the loop - -**Current code:** -```typescript -for (const pid of pids) { - if (!Number.isInteger(pid) || pid <= 0) { - logger.warn('SYSTEM', 'Skipping invalid PID', { pid }); - continue; - } - execSync(`taskkill /PID ${pid} /T /F`, { timeout: 60000, stdio: 'ignore' }); -} -``` - -**Fixed code:** -```typescript -for (const pid of pids) { - if (!Number.isInteger(pid) || pid <= 0) { - logger.warn('SYSTEM', 'Skipping invalid PID', { pid }); - continue; - } - try { - execSync(`taskkill /PID ${pid} /T /F`, { timeout: 60000, stdio: 'ignore' }); - } catch { - // Process may have already exited - continue cleanup - } -} -``` - ---- - -## Phase 5: Use Readiness Endpoint for Health Checks - -**Goal:** Ensure `waitForHealth` checks full initialization, not just HTTP server - -**File:** `src/services/worker-service.ts` - -**What to do:** -1. Find `waitForHealth` function (around line 61-68) -2. Change endpoint from `/api/health` to `/api/readiness` -3. `/api/readiness` returns 503 until background init completes, `/api/health` returns 200 immediately - -**Current code:** -```typescript -async function waitForHealth(port: number, timeoutMs: number = 30000): Promise { - const start = Date.now(); - while (Date.now() - start < timeoutMs) { - if (await isPortInUse(port)) return true; - await new Promise(r => setTimeout(r, 500)); - } - return false; -} -``` - -**Fixed code:** -```typescript -async function waitForHealth(port: number, timeoutMs: number = 30000): Promise { - const start = Date.now(); - while (Date.now() - start < timeoutMs) { - try { - const response = await fetch(`http://127.0.0.1:${port}/api/readiness`, { - signal: AbortSignal.timeout(2000) - }); - if (response.ok) return true; - } catch { - // Not ready yet - } - await new Promise(r => setTimeout(r, 500)); - } - return false; -} -``` - ---- - -## Phase 6: Build, Test, and Commit - -**Goal:** Verify fixes work and commit changes - -**Commands:** -```bash -npm run build-and-sync -``` - -**Commit message:** -``` -fix: address critical error handling issues in worker lifecycle - -- Fix waitForProcessesExit crash when child processes exit -- Add spawn pid validation before writing PID file -- Handle Unix kill errors in orphaned process cleanup -- Handle Windows taskkill errors in cleanup loop -- Use /api/readiness for health checks instead of /api/health - -Fixes issues found in PR #458 review. -``` - ---- - -## Summary - -| Phase | Priority | Description | -|-------|----------|-------------| -| 1 | CRITICAL | Fix `waitForProcessesExit` filter bug | -| 2 | CRITICAL | Validate `child.pid` after spawn | -| 3 | CRITICAL | Fix Unix kill error handling | -| 4 | HIGH | Fix Windows taskkill error handling | -| 5 | HIGH | Use readiness endpoint for health | -| 6 | - | Build and commit | - -**Estimated changes:** ~30 lines modified in `worker-service.ts`