fix: prevent duplicate worker daemons and zombie processes (#1178)
* fix: prevent duplicate worker daemons and zombie processes Three root causes of chroma-mcp timeouts: 1. HTTP shutdown (POST /api/admin/shutdown) closed resources but never called process.exit(). Zombie workers stayed alive, background tasks reconnected to chroma-mcp, spawning duplicate subprocesses that all contended for the same persistent data directory. 2. No guard against concurrent daemon startup. When hooks fired simultaneously, multiple daemons started before either wrote a PID file. The loser got EADDRINUSE but stayed alive because signal handlers registered in the constructor prevented exit. 3. Corrupt 147GB HNSW index file caused all chroma queries to timeout (MCP error -32001). Data fix: deleted corrupt collection, backfill rebuilds from SQLite. Code fixes: - Add PID-based guard in daemon startup: exit if PID file process alive - Add port-based guard in daemon startup: exit if port already bound (runs before WorkerService constructor registers keepalive handlers) - Add process.exit(0) after HTTP shutdown/restart completes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: aggressive startup cleanup and one-time chroma wipe for upgrade Kill orphaned worker-service.cjs and chroma-mcp processes immediately at startup (no age gate) while keeping 30-min threshold for mcp-server. Wipe corrupt chroma data once on upgrade from pre-v10.3 versions — backfill rebuilds from SQLite automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: wrap shutdown handlers in try/finally to guarantee process.exit If onShutdown() or onRestart() threw, process.exit(0) was never reached, leaving the daemon alive as a zombie. Also removed redundant require('fs') calls in process-manager tests where ESM imports already existed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -69,8 +69,10 @@ import {
|
||||
readPidFile,
|
||||
removePidFile,
|
||||
getPlatformTimeout,
|
||||
cleanupOrphanedProcesses,
|
||||
aggressiveStartupCleanup,
|
||||
runOneTimeChromaMigration,
|
||||
cleanStalePidFile,
|
||||
isProcessAlive,
|
||||
spawnDaemon,
|
||||
createSignalHandler
|
||||
} from './infrastructure/ProcessManager.js';
|
||||
@@ -367,7 +369,7 @@ export class WorkerService {
|
||||
*/
|
||||
private async initializeBackground(): Promise<void> {
|
||||
try {
|
||||
await cleanupOrphanedProcesses();
|
||||
await aggressiveStartupCleanup();
|
||||
|
||||
// Load mode configuration
|
||||
const { ModeManager } = await import('./domain/ModeManager.js');
|
||||
@@ -376,6 +378,12 @@ export class WorkerService {
|
||||
|
||||
const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
|
||||
|
||||
// One-time chroma wipe for users upgrading from versions with duplicate worker bugs.
|
||||
// Only runs in local mode (chroma is local-only). Backfill at line ~414 rebuilds from SQLite.
|
||||
if (settings.CLAUDE_MEM_MODE === 'local' || !settings.CLAUDE_MEM_MODE) {
|
||||
runOneTimeChromaMigration();
|
||||
}
|
||||
|
||||
// Initialize ChromaMcpManager (lazy - connects on first use via ChromaSync)
|
||||
this.chromaMcpManager = ChromaMcpManager.getInstance();
|
||||
logger.info('SYSTEM', 'ChromaMcpManager initialized (lazy - connects on first use)');
|
||||
@@ -1097,6 +1105,28 @@ async function main() {
|
||||
|
||||
case '--daemon':
|
||||
default: {
|
||||
// GUARD 1: Refuse to start if another worker is already alive (PID check).
|
||||
// Instant check (kill -0) — no HTTP dependency.
|
||||
const existingPidInfo = readPidFile();
|
||||
if (existingPidInfo && isProcessAlive(existingPidInfo.pid)) {
|
||||
logger.info('SYSTEM', 'Worker already running (PID alive), refusing to start duplicate', {
|
||||
existingPid: existingPidInfo.pid,
|
||||
existingPort: existingPidInfo.port,
|
||||
startedAt: existingPidInfo.startedAt
|
||||
});
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
// GUARD 2: Refuse to start if the port is already bound.
|
||||
// Catches the race where two daemons start simultaneously before
|
||||
// either writes a PID file. Must run BEFORE constructing WorkerService
|
||||
// because the constructor registers signal handlers and timers that
|
||||
// prevent the process from exiting even if listen() fails later.
|
||||
if (await isPortInUse(port)) {
|
||||
logger.info('SYSTEM', 'Port already in use, refusing to start duplicate', { port });
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
// Prevent daemon from dying silently on unhandled errors.
|
||||
// The HTTP server can continue serving even if a background task throws.
|
||||
process.on('unhandledRejection', (reason) => {
|
||||
|
||||
Reference in New Issue
Block a user