fix: prevent duplicate worker daemons and zombie processes (#1178)

* fix: prevent duplicate worker daemons and zombie processes

Three root causes of chroma-mcp timeouts:

1. HTTP shutdown (POST /api/admin/shutdown) closed resources but never
   called process.exit(). Zombie workers stayed alive, background tasks
   reconnected to chroma-mcp, spawning duplicate subprocesses that all
   contended for the same persistent data directory.

2. No guard against concurrent daemon startup. When hooks fired
   simultaneously, multiple daemons started before either wrote a PID
   file. The loser got EADDRINUSE but stayed alive because signal
   handlers registered in the constructor prevented exit.

3. Corrupt 147GB HNSW index file caused all chroma queries to timeout
   (MCP error -32001). Data fix: deleted corrupt collection, backfill
   rebuilds from SQLite.

Code fixes:
- Add PID-based guard in daemon startup: exit if PID file process alive
- Add port-based guard in daemon startup: exit if port already bound
  (runs before WorkerService constructor registers keepalive handlers)
- Add process.exit(0) after HTTP shutdown/restart completes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: aggressive startup cleanup and one-time chroma wipe for upgrade

Kill orphaned worker-service.cjs and chroma-mcp processes immediately
at startup (no age gate) while keeping 30-min threshold for mcp-server.
Wipe corrupt chroma data once on upgrade from pre-v10.3 versions —
backfill rebuilds from SQLite automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: wrap shutdown handlers in try/finally to guarantee process.exit

If onShutdown() or onRestart() threw, process.exit(0) was never reached,
leaving the daemon alive as a zombie. Also removed redundant require('fs')
calls in process-manager tests where ESM imports already existed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-02-18 20:10:28 -05:00
committed by GitHub
parent 44cdbec173
commit e788fd3676
5 changed files with 397 additions and 129 deletions
+15 -2
View File
@@ -248,8 +248,14 @@ export class Server {
process.send!({ type: 'restart' });
} else {
// Unix or standalone Windows - handle restart ourselves
// The spawner (ensureWorkerStarted/restart command) handles spawning the new daemon.
// This process just needs to shut down and exit.
setTimeout(async () => {
await this.options.onRestart();
try {
await this.options.onRestart();
} finally {
process.exit(0);
}
}, 100);
}
});
@@ -268,7 +274,14 @@ export class Server {
} else {
// Unix or standalone Windows - handle shutdown ourselves
setTimeout(async () => {
await this.options.onShutdown();
try {
await this.options.onShutdown();
} finally {
// CRITICAL: Exit the process after shutdown completes (or fails).
// Without this, the daemon stays alive as a zombie — background tasks
// (backfill, reconnects) keep running and respawn chroma-mcp subprocesses.
process.exit(0);
}
}, 100);
}
});