Compare commits

...

3 Commits

Author SHA1 Message Date
Alex Newman 097035de6c chore: bump version to 10.3.1
Publish to npm / publish (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 20:12:17 -05:00
Alex Newman e788fd3676 fix: prevent duplicate worker daemons and zombie processes (#1178)
* fix: prevent duplicate worker daemons and zombie processes

Three root causes of chroma-mcp timeouts:

1. HTTP shutdown (POST /api/admin/shutdown) closed resources but never
   called process.exit(). Zombie workers stayed alive, background tasks
   reconnected to chroma-mcp, spawning duplicate subprocesses that all
   contended for the same persistent data directory.

2. No guard against concurrent daemon startup. When hooks fired
   simultaneously, multiple daemons started before either wrote a PID
   file. The loser got EADDRINUSE but stayed alive because signal
   handlers registered in the constructor prevented exit.

3. Corrupt 147GB HNSW index file caused all chroma queries to timeout
   (MCP error -32001). Data fix: deleted corrupt collection, backfill
   rebuilds from SQLite.

Code fixes:
- Add PID-based guard in daemon startup: exit if PID file process alive
- Add port-based guard in daemon startup: exit if port already bound
  (runs before WorkerService constructor registers keepalive handlers)
- Add process.exit(0) after HTTP shutdown/restart completes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: aggressive startup cleanup and one-time chroma wipe for upgrade

Kill orphaned worker-service.cjs and chroma-mcp processes immediately
at startup (no age gate) while keeping 30-min threshold for mcp-server.
Wipe corrupt chroma data once on upgrade from pre-v10.3 versions —
backfill rebuilds from SQLite automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: wrap shutdown handlers in try/finally to guarantee process.exit

If onShutdown() or onRestart() threw, process.exit(0) was never reached,
leaving the daemon alive as a zombie. Also removed redundant require('fs')
calls in process-manager tests where ESM imports already existed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 20:10:28 -05:00
Alex Newman 44cdbec173 docs: update CHANGELOG.md for v10.3.0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 18:34:33 -05:00
11 changed files with 427 additions and 148 deletions
+1 -1
View File
@@ -10,7 +10,7 @@
"plugins": [
{
"name": "claude-mem",
"version": "10.3.0",
"version": "10.3.1",
"source": "./plugin",
"description": "Persistent memory system for Claude Code - context compression across sessions"
}
+25 -14
View File
@@ -2,6 +2,31 @@
All notable changes to claude-mem.
## [v10.3.0] - 2026-02-18
## Replace WASM Embeddings with Persistent chroma-mcp MCP Connection
### Highlights
- **New: ChromaMcpManager** — Singleton stdio MCP client communicating with chroma-mcp via `uvx`, replacing the previous ChromaServerManager (`npx chroma run` + `chromadb` npm + ONNX/WASM)
- **Eliminates native binary issues** — No more segfaults, WASM embedding failures, or cross-platform install headaches
- **Graceful subprocess lifecycle** — Wired into GracefulShutdown for clean teardown; zombie process prevention with kill-on-failure and stale `onclose` handler guards
- **Connection backoff** — 10-second reconnect backoff prevents chroma-mcp spawn storms
- **SQL injection guards** — Added parameterization to ChromaSync ID exclusion queries
- **Simplified ChromaSync** — Reduced complexity by delegating embedding concerns to chroma-mcp
### Breaking Changes
None — backward compatible. ChromaDB data is preserved; only the connection mechanism changed.
### Files Changed
- `src/services/sync/ChromaMcpManager.ts` (new) — MCP client singleton
- `src/services/sync/ChromaServerManager.ts` (deleted) — Old WASM/native approach
- `src/services/sync/ChromaSync.ts` — Simplified to use MCP client
- `src/services/worker-service.ts` — Updated startup sequence
- `src/services/infrastructure/GracefulShutdown.ts` — Subprocess cleanup integration
## [v10.2.6] - 2026-02-18
## Bug Fixes
@@ -1426,17 +1451,3 @@ This patch release addresses a race condition where SIGTERM/SIGINT signals arriv
**Full Changelog**: https://github.com/thedotmack/claude-mem/compare/v8.2.7...v8.2.8
## [v8.2.7] - 2025-12-29
## What's Changed
### Token Optimizations
- Simplified MCP server tool definitions for reduced token usage
- Removed outdated troubleshooting and mem-search skill documentation
- Enhanced search parameter descriptions for better clarity
- Streamlined MCP workflows for improved efficiency
This release significantly reduces the token footprint of the plugin's MCP tools and documentation.
**Full Changelog**: https://github.com/thedotmack/claude-mem/compare/v8.2.6...v8.2.7
+1 -1
View File
@@ -1,6 +1,6 @@
{
"name": "claude-mem",
"version": "10.3.0",
"version": "10.3.1",
"description": "Memory compression system for Claude Code - persist context across sessions",
"keywords": [
"claude",
+1 -1
View File
@@ -1,6 +1,6 @@
{
"name": "claude-mem",
"version": "10.3.0",
"version": "10.3.1",
"description": "Persistent memory system for Claude Code - seamlessly preserve context across sessions",
"author": {
"name": "Alex Newman"
+1 -1
View File
@@ -1,6 +1,6 @@
{
"name": "claude-mem-plugin",
"version": "10.3.0",
"version": "10.3.1",
"private": true,
"description": "Runtime dependencies for claude-mem bundled hooks",
"type": "module",
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
+177 -1
View File
@@ -10,7 +10,7 @@
import path from 'path';
import { homedir } from 'os';
import { existsSync, writeFileSync, readFileSync, unlinkSync, mkdirSync } from 'fs';
import { existsSync, writeFileSync, readFileSync, unlinkSync, mkdirSync, rmSync } from 'fs';
import { exec, execSync, spawn } from 'child_process';
import { promisify } from 'util';
import { logger } from '../../utils/logger.js';
@@ -426,6 +426,182 @@ export async function cleanupOrphanedProcesses(): Promise<void> {
logger.info('SYSTEM', 'Orphaned processes cleaned up', { count: pidsToKill.length });
}
// Patterns that should be killed immediately at startup (no age gate)
// These are child processes that should not outlive their parent worker
const AGGRESSIVE_CLEANUP_PATTERNS = ['worker-service.cjs', 'chroma-mcp'];
// Patterns that keep the age-gated threshold (may be legitimately running)
const AGE_GATED_CLEANUP_PATTERNS = ['mcp-server.cjs'];
/**
* Aggressive startup cleanup for orphaned claude-mem processes.
*
* Unlike cleanupOrphanedProcesses() which age-gates everything at 30 minutes,
* this function kills worker-service.cjs and chroma-mcp processes immediately
* (they should not outlive their parent worker). Only mcp-server.cjs keeps
* the age threshold since it may be legitimately running.
*
* Called once at daemon startup.
*/
export async function aggressiveStartupCleanup(): Promise<void> {
const isWindows = process.platform === 'win32';
const currentPid = process.pid;
const pidsToKill: number[] = [];
const allPatterns = [...AGGRESSIVE_CLEANUP_PATTERNS, ...AGE_GATED_CLEANUP_PATTERNS];
try {
if (isWindows) {
const patternConditions = allPatterns
.map(p => `$_.CommandLine -like '*${p}*'`)
.join(' -or ');
const cmd = `powershell -NoProfile -NonInteractive -Command "Get-CimInstance Win32_Process | Where-Object { (${patternConditions}) -and $_.ProcessId -ne ${currentPid} } | Select-Object ProcessId, CommandLine, CreationDate | ConvertTo-Json"`;
const { stdout } = await execAsync(cmd, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND });
if (!stdout.trim() || stdout.trim() === 'null') {
logger.debug('SYSTEM', 'No orphaned claude-mem processes found (Windows)');
return;
}
const processes = JSON.parse(stdout);
const processList = Array.isArray(processes) ? processes : [processes];
const now = Date.now();
for (const proc of processList) {
const pid = proc.ProcessId;
if (!Number.isInteger(pid) || pid <= 0 || pid === currentPid) continue;
const commandLine = proc.CommandLine || '';
const isAggressive = AGGRESSIVE_CLEANUP_PATTERNS.some(p => commandLine.includes(p));
if (isAggressive) {
// Kill immediately — no age check
pidsToKill.push(pid);
logger.debug('SYSTEM', 'Found orphaned process (aggressive)', { pid, commandLine: commandLine.substring(0, 80) });
} else {
// Age-gated: only kill if older than threshold
const creationMatch = proc.CreationDate?.match(/\/Date\((\d+)\)\//);
if (creationMatch) {
const creationTime = parseInt(creationMatch[1], 10);
const ageMinutes = (now - creationTime) / (1000 * 60);
if (ageMinutes >= ORPHAN_MAX_AGE_MINUTES) {
pidsToKill.push(pid);
logger.debug('SYSTEM', 'Found orphaned process (age-gated)', { pid, ageMinutes: Math.round(ageMinutes) });
}
}
}
}
} else {
// Unix: Use ps with elapsed time
const patternRegex = allPatterns.join('|');
const { stdout } = await execAsync(
`ps -eo pid,etime,command | grep -E "${patternRegex}" | grep -v grep || true`
);
if (!stdout.trim()) {
logger.debug('SYSTEM', 'No orphaned claude-mem processes found (Unix)');
return;
}
const lines = stdout.trim().split('\n');
for (const line of lines) {
const match = line.trim().match(/^(\d+)\s+(\S+)\s+(.*)$/);
if (!match) continue;
const pid = parseInt(match[1], 10);
const etime = match[2];
const command = match[3];
if (!Number.isInteger(pid) || pid <= 0 || pid === currentPid) continue;
const isAggressive = AGGRESSIVE_CLEANUP_PATTERNS.some(p => command.includes(p));
if (isAggressive) {
// Kill immediately — no age check
pidsToKill.push(pid);
logger.debug('SYSTEM', 'Found orphaned process (aggressive)', { pid, command: command.substring(0, 80) });
} else {
// Age-gated: only kill if older than threshold
const ageMinutes = parseElapsedTime(etime);
if (ageMinutes >= ORPHAN_MAX_AGE_MINUTES) {
pidsToKill.push(pid);
logger.debug('SYSTEM', 'Found orphaned process (age-gated)', { pid, ageMinutes, command: command.substring(0, 80) });
}
}
}
}
} catch (error) {
logger.error('SYSTEM', 'Failed to enumerate orphaned processes during aggressive cleanup', {}, error as Error);
return;
}
if (pidsToKill.length === 0) {
return;
}
logger.info('SYSTEM', 'Aggressive startup cleanup: killing orphaned processes', {
platform: isWindows ? 'Windows' : 'Unix',
count: pidsToKill.length,
pids: pidsToKill
});
if (isWindows) {
for (const pid of pidsToKill) {
if (!Number.isInteger(pid) || pid <= 0) continue;
try {
execSync(`taskkill /PID ${pid} /T /F`, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, stdio: 'ignore' });
} catch (error) {
logger.debug('SYSTEM', 'Failed to kill process, may have already exited', { pid }, error as Error);
}
}
} else {
for (const pid of pidsToKill) {
try {
process.kill(pid, 'SIGKILL');
} catch (error) {
logger.debug('SYSTEM', 'Process already exited', { pid }, error as Error);
}
}
}
logger.info('SYSTEM', 'Aggressive startup cleanup complete', { count: pidsToKill.length });
}
const CHROMA_MIGRATION_MARKER_FILENAME = '.chroma-cleaned-v10.3';
/**
* One-time chroma data wipe for users upgrading from versions with duplicate
* worker bugs that could corrupt chroma data. Since chroma is always rebuildable
* from SQLite (via backfillAllProjects), this is safe.
*
* Checks for a marker file. If absent, wipes ~/.claude-mem/chroma/ and writes
* the marker. If present, skips. Idempotent.
*
* @param dataDirectory - Override for DATA_DIR (used in tests)
*/
export function runOneTimeChromaMigration(dataDirectory?: string): void {
const effectiveDataDir = dataDirectory ?? DATA_DIR;
const markerPath = path.join(effectiveDataDir, CHROMA_MIGRATION_MARKER_FILENAME);
const chromaDir = path.join(effectiveDataDir, 'chroma');
if (existsSync(markerPath)) {
logger.debug('SYSTEM', 'Chroma migration marker exists, skipping wipe');
return;
}
logger.warn('SYSTEM', 'Running one-time chroma data wipe (upgrade from pre-v10.3)', { chromaDir });
if (existsSync(chromaDir)) {
rmSync(chromaDir, { recursive: true, force: true });
logger.info('SYSTEM', 'Chroma data directory removed', { chromaDir });
}
// Write marker file to prevent future wipes
mkdirSync(effectiveDataDir, { recursive: true });
writeFileSync(markerPath, new Date().toISOString());
logger.info('SYSTEM', 'Chroma migration marker written', { markerPath });
}
/**
* Spawn a detached daemon process
* Returns the child PID or undefined if spawn failed
+15 -2
View File
@@ -248,8 +248,14 @@ export class Server {
process.send!({ type: 'restart' });
} else {
// Unix or standalone Windows - handle restart ourselves
// The spawner (ensureWorkerStarted/restart command) handles spawning the new daemon.
// This process just needs to shut down and exit.
setTimeout(async () => {
await this.options.onRestart();
try {
await this.options.onRestart();
} finally {
process.exit(0);
}
}, 100);
}
});
@@ -268,7 +274,14 @@ export class Server {
} else {
// Unix or standalone Windows - handle shutdown ourselves
setTimeout(async () => {
await this.options.onShutdown();
try {
await this.options.onShutdown();
} finally {
// CRITICAL: Exit the process after shutdown completes (or fails).
// Without this, the daemon stays alive as a zombie — background tasks
// (backfill, reconnects) keep running and respawn chroma-mcp subprocesses.
process.exit(0);
}
}, 100);
}
});
+32 -2
View File
@@ -69,8 +69,10 @@ import {
readPidFile,
removePidFile,
getPlatformTimeout,
cleanupOrphanedProcesses,
aggressiveStartupCleanup,
runOneTimeChromaMigration,
cleanStalePidFile,
isProcessAlive,
spawnDaemon,
createSignalHandler
} from './infrastructure/ProcessManager.js';
@@ -367,7 +369,7 @@ export class WorkerService {
*/
private async initializeBackground(): Promise<void> {
try {
await cleanupOrphanedProcesses();
await aggressiveStartupCleanup();
// Load mode configuration
const { ModeManager } = await import('./domain/ModeManager.js');
@@ -376,6 +378,12 @@ export class WorkerService {
const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
// One-time chroma wipe for users upgrading from versions with duplicate worker bugs.
// Only runs in local mode (chroma is local-only). Backfill at line ~414 rebuilds from SQLite.
if (settings.CLAUDE_MEM_MODE === 'local' || !settings.CLAUDE_MEM_MODE) {
runOneTimeChromaMigration();
}
// Initialize ChromaMcpManager (lazy - connects on first use via ChromaSync)
this.chromaMcpManager = ChromaMcpManager.getInstance();
logger.info('SYSTEM', 'ChromaMcpManager initialized (lazy - connects on first use)');
@@ -1097,6 +1105,28 @@ async function main() {
case '--daemon':
default: {
// GUARD 1: Refuse to start if another worker is already alive (PID check).
// Instant check (kill -0) — no HTTP dependency.
const existingPidInfo = readPidFile();
if (existingPidInfo && isProcessAlive(existingPidInfo.pid)) {
logger.info('SYSTEM', 'Worker already running (PID alive), refusing to start duplicate', {
existingPid: existingPidInfo.pid,
existingPort: existingPidInfo.port,
startedAt: existingPidInfo.startedAt
});
process.exit(0);
}
// GUARD 2: Refuse to start if the port is already bound.
// Catches the race where two daemons start simultaneously before
// either writes a PID file. Must run BEFORE constructing WorkerService
// because the constructor registers signal handlers and timers that
// prevent the process from exiting even if listen() fails later.
if (await isPortInUse(port)) {
logger.info('SYSTEM', 'Port already in use, refusing to start duplicate', { port });
process.exit(0);
}
// Prevent daemon from dying silently on unhandled errors.
// The HTTP server can continue serving even if a background task throws.
process.on('unhandledRejection', (reason) => {
+52 -3
View File
@@ -1,6 +1,7 @@
import { describe, it, expect, beforeEach, afterEach } from 'bun:test';
import { existsSync, readFileSync } from 'fs';
import { existsSync, readFileSync, mkdirSync, writeFileSync, rmSync } from 'fs';
import { homedir } from 'os';
import { tmpdir } from 'os';
import path from 'path';
import {
writePidFile,
@@ -12,6 +13,7 @@ import {
cleanStalePidFile,
spawnDaemon,
resolveWorkerRuntimePath,
runOneTimeChromaMigration,
type PidInfo
} from '../../src/services/infrastructure/index.js';
@@ -32,7 +34,6 @@ describe('ProcessManager', () => {
afterEach(() => {
// Restore original PID file or remove test one
if (originalPidContent !== null) {
const { writeFileSync } = require('fs');
writeFileSync(PID_FILE, originalPidContent);
originalPidContent = null;
} else {
@@ -105,7 +106,6 @@ describe('ProcessManager', () => {
});
it('should return null for corrupted JSON', () => {
const { writeFileSync } = require('fs');
writeFileSync(PID_FILE, 'not valid json {{{');
const result = readPidFile();
@@ -415,4 +415,53 @@ describe('ProcessManager', () => {
// This is a logic verification test — actual signal delivery is tested manually
});
});
describe('runOneTimeChromaMigration', () => {
let testDataDir: string;
beforeEach(() => {
testDataDir = path.join(tmpdir(), `claude-mem-test-${Date.now()}-${Math.random().toString(36).slice(2)}`);
mkdirSync(testDataDir, { recursive: true });
});
afterEach(() => {
rmSync(testDataDir, { recursive: true, force: true });
});
it('should wipe chroma directory and write marker file', () => {
// Create a fake chroma directory with data
const chromaDir = path.join(testDataDir, 'chroma');
mkdirSync(chromaDir, { recursive: true });
writeFileSync(path.join(chromaDir, 'test-data.bin'), 'fake chroma data');
runOneTimeChromaMigration(testDataDir);
// Chroma dir should be gone
expect(existsSync(chromaDir)).toBe(false);
// Marker file should exist
expect(existsSync(path.join(testDataDir, '.chroma-cleaned-v10.3'))).toBe(true);
});
it('should skip when marker file already exists (idempotent)', () => {
// Write marker file first
writeFileSync(path.join(testDataDir, '.chroma-cleaned-v10.3'), 'already done');
// Create a chroma directory that should NOT be wiped
const chromaDir = path.join(testDataDir, 'chroma');
mkdirSync(chromaDir, { recursive: true });
writeFileSync(path.join(chromaDir, 'important.bin'), 'should survive');
runOneTimeChromaMigration(testDataDir);
// Chroma dir should still exist (migration was skipped)
expect(existsSync(chromaDir)).toBe(true);
expect(existsSync(path.join(chromaDir, 'important.bin'))).toBe(true);
});
it('should handle missing chroma directory gracefully', () => {
// No chroma dir exists — should just write marker without error
expect(() => runOneTimeChromaMigration(testDataDir)).not.toThrow();
expect(existsSync(path.join(testDataDir, '.chroma-cleaned-v10.3'))).toBe(true);
});
});
});