chore: bump version to 10.3.2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: rename save_memory and fix MCP search instructions + startup hook (#1210 )
2026-02-23 03:32:22 -05:00 · 2026-02-23 03:30:31 -05:00 · 2026-02-19 22:39:36 -05:00 · 2026-02-19 22:08:43 -05:00 · 2026-02-19 22:06:05 -05:00 · 2026-02-18 20:12:46 -05:00
27 changed files with 4933 additions and 364 deletions
@@ -10,7 +10,7 @@
  "plugins": [
    {
      "name": "claude-mem",
-      "version": "10.3.0",
+      "version": "10.3.2",
      "source": "./plugin",
      "description": "Persistent memory system for Claude Code - context compression across sessions"
    }
@@ -1,6 +1,7 @@
 datasets/
 node_modules/
 dist/
+!installer/dist/
 *.log
 .DS_Store
 .env
@@ -2,6 +2,46 @@

 All notable changes to claude-mem.

+## [v10.3.1] - 2026-02-19
+
+## Fix: Prevent Duplicate Worker Daemons and Zombie Processes
+
+Three root causes of chroma-mcp timeouts identified and fixed:
+
+### PID-based daemon guard
+Exit immediately on startup if PID file points to a live process. Prevents the race condition where hooks firing simultaneously could start multiple daemons before either wrote a PID file.
+
+### Port-based daemon guard
+Exit if port 37777 is already bound — runs before WorkerService constructor registers keepalive signal handlers that previously prevented exit on EADDRINUSE.
+
+### Guaranteed process.exit() after HTTP shutdown
+HTTP shutdown (POST /api/admin/shutdown) now calls `process.exit(0)` in a `try/finally` block. Previously, zombie workers stayed alive after shutdown, and background tasks reconnected to chroma-mcp, spawning duplicate subprocesses contending for the same data directory.
+
+## [v10.3.0] - 2026-02-18
+
+## Replace WASM Embeddings with Persistent chroma-mcp MCP Connection
+
+### Highlights
+
+- **New: ChromaMcpManager** — Singleton stdio MCP client communicating with chroma-mcp via `uvx`, replacing the previous ChromaServerManager (`npx chroma run` + `chromadb` npm + ONNX/WASM)
+- **Eliminates native binary issues** — No more segfaults, WASM embedding failures, or cross-platform install headaches
+- **Graceful subprocess lifecycle** — Wired into GracefulShutdown for clean teardown; zombie process prevention with kill-on-failure and stale `onclose` handler guards
+- **Connection backoff** — 10-second reconnect backoff prevents chroma-mcp spawn storms
+- **SQL injection guards** — Added parameterization to ChromaSync ID exclusion queries
+- **Simplified ChromaSync** — Reduced complexity by delegating embedding concerns to chroma-mcp
+
+### Breaking Changes
+
+None — backward compatible. ChromaDB data is preserved; only the connection mechanism changed.
+
+### Files Changed
+
+- `src/services/sync/ChromaMcpManager.ts` (new) — MCP client singleton
+- `src/services/sync/ChromaServerManager.ts` (deleted) — Old WASM/native approach
+- `src/services/sync/ChromaSync.ts` — Simplified to use MCP client
+- `src/services/worker-service.ts` — Updated startup sequence
+- `src/services/infrastructure/GracefulShutdown.ts` — Subprocess cleanup integration
+
 ## [v10.2.6] - 2026-02-18

 ## Bug Fixes
@@ -1411,32 +1451,3 @@ Thanks @yungweng for the detailed bug report!
 - Updated worker CLI scripts to reference worker-service.cjs directly
 - Simplified hook command configurations

-## [v8.2.8] - 2025-12-29
-
-## Bug Fixes
-
- Fixed orphaned chroma-mcp processes during shutdown (#489)
-  - Added graceful shutdown handling with signal handlers registered early in WorkerService lifecycle
-  - Ensures ChromaSync subprocess cleanup even when interrupted during initialization
-  - Removes PID file during shutdown to prevent stale process tracking
-
-## Technical Details
-
-This patch release addresses a race condition where SIGTERM/SIGINT signals arriving during ChromaSync initialization could leave orphaned chroma-mcp processes. The fix moves signal handler registration from the start() method to the constructor, ensuring cleanup handlers exist throughout the entire initialization lifecycle.
-
-**Full Changelog**: https://github.com/thedotmack/claude-mem/compare/v8.2.7...v8.2.8
-
-## [v8.2.7] - 2025-12-29
-
-## What's Changed
-
-### Token Optimizations
- Simplified MCP server tool definitions for reduced token usage
- Removed outdated troubleshooting and mem-search skill documentation
- Enhanced search parameter descriptions for better clarity
- Streamlined MCP workflows for improved efficiency
-
-This release significantly reduces the token footprint of the plugin's MCP tools and documentation.
-
-**Full Changelog**: https://github.com/thedotmack/claude-mem/compare/v8.2.6...v8.2.7
-
@@ -198,7 +198,7 @@ See [Architecture Overview](https://docs.claude-mem.ai/architecture/overview) fo

 ## MCP Search Tools

-Claude-Mem provides intelligent memory search through **5 MCP tools** following a token-efficient **3-layer workflow pattern**:
+Claude-Mem provides intelligent memory search through **4 MCP tools** following a token-efficient **3-layer workflow pattern**:

 **The 3-Layer Workflow:**

@@ -211,7 +211,6 @@ Claude-Mem provides intelligent memory search through **5 MCP tools** following
 - Start with `search` to get an index of results
 - Use `timeline` to see what was happening around specific observations
 - Use `get_observations` to fetch full details for relevant IDs
- Use `save_memory` to manually store important information
 - **~10x token savings** by filtering before fetching details

 **Available MCP Tools:**
@@ -219,8 +218,6 @@ Claude-Mem provides intelligent memory search through **5 MCP tools** following
 1. **`search`** - Search memory index with full-text queries, filters by type/date/project
 2. **`timeline`** - Get chronological context around a specific observation or query
 3. **`get_observations`** - Fetch full observation details by IDs (always batch multiple IDs)
-4. **`save_memory`** - Manually save a memory/observation for semantic search
-5. **`__IMPORTANT`** - Workflow documentation (always visible to Claude)

 **Example Usage:**

@@ -232,9 +229,6 @@ search(query="authentication bug", type="bugfix", limit=10)

 // Step 3: Fetch full details
 get_observations(ids=[123, 456])
-
-// Save important information manually
-save_memory(text="API requires auth header X-API-Key", title="API Auth")
 ```

 See [Search Tools Guide](https://docs.claude-mem.ai/usage/search-tools) for detailed examples.
@@ -5,7 +5,7 @@ set -euo pipefail
 # Usage: curl -fsSL https://install.cmem.ai | bash
 #   or:  curl -fsSL https://install.cmem.ai | bash -s -- --provider=gemini --api-key=YOUR_KEY

-INSTALLER_URL="https://raw.githubusercontent.com/thedotmack/claude-mem/main/installer/dist/index.js"
+INSTALLER_URL="https://install.cmem.ai/installer.js"

 # Colors
 RED='\033[0;31m'
@@ -1,5 +1,8 @@
 {
  "$schema": "https://openapi.vercel.sh/vercel.json",
+  "rewrites": [
+    { "source": "/", "destination": "/install.sh" }
+  ],
  "headers": [
    {
      "source": "/(.*)\\.sh",
@@ -1,6 +1,6 @@
 {
  "name": "claude-mem",
-  "version": "10.3.0",
+  "version": "10.3.2",
  "description": "Memory compression system for Claude Code - persist context across sessions",
  "keywords": [
    "claude",
@@ -0,0 +1,52 @@
+# Fix: SessionStart Hook "startup hook error" — Worker Not Waiting
+
+## Root Cause
+
+The **installed plugin** (`~/.claude/plugins/marketplaces/thedotmack/`) is version **10.2.5** and has **none** of the recent fixes:
+
+| Fix | Repo Status | Installed Status |
+|-----|-------------|-----------------|
+| Hook group split (smart-install isolated from worker start) | In `plugin/hooks/hooks.json` | **Missing** — all 3 hooks in one group, smart-install failure blocks worker |
+| `waitForReadiness()` after spawn | In `src/services/infrastructure/HealthMonitor.ts` | **Missing** — 0 occurrences in installed `worker-service.cjs` |
+| Early `initializationCompleteFlag` (after DB+search, not MCP) | In `src/services/worker-service.ts` | **Missing** — flag set after MCP connection (5+ minute wait) |
+
+The changes exist in source code but were **never built and synced** to the installed location.
+
+---
+
+## Phase 1: Build and Sync
+
+```bash
+npm run build-and-sync
+```
+
+### Verification
+
+```bash
+# 1. Confirm waitForReadiness exists in installed build
+grep -c "waitForReadiness" ~/.claude/plugins/marketplaces/thedotmack/plugin/scripts/worker-service.cjs
+# Expected: > 0
+
+# 2. Confirm hooks.json has two SessionStart groups (the split)
+python3 -c "import json; d=json.load(open('$(echo $HOME)/.claude/plugins/marketplaces/thedotmack/plugin/hooks/hooks.json')); print('SessionStart groups:', len(d['hooks']['SessionStart']))"
+# Expected: 2
+
+# 3. Confirm initializationCompleteFlag is set before MCP connection
+grep -n "Core initialization complete" ~/.claude/plugins/marketplaces/thedotmack/plugin/scripts/worker-service.cjs | head -1
+# Expected: appears BEFORE "MCP server connected"
+```
+
+## Phase 2: Restart Worker and Test
+
+```bash
+# Stop existing worker
+bun plugin/scripts/worker-service.cjs stop
+
+# Verify stopped
+curl -s http://127.0.0.1:37777/api/health && echo "STILL RUNNING" || echo "STOPPED"
+```
+
+Then start a new Claude Code session and verify:
+- No "SessionStart:startup hook error" messages
+- Worker is running: `curl http://127.0.0.1:37777/api/health`
+- Readiness endpoint works: `curl http://127.0.0.1:37777/api/readiness`
@@ -1,6 +1,6 @@
 {
  "name": "claude-mem",
-  "version": "10.3.0",
+  "version": "10.3.2",
  "description": "Persistent memory system for Claude Code - seamlessly preserve context across sessions",
  "author": {
    "name": "Alex Newman"
@@ -21,7 +21,12 @@
            "type": "command",
            "command": "node \"${CLAUDE_PLUGIN_ROOT}/scripts/smart-install.js\"",
            "timeout": 300
-          },
+          }
+        ]
+      },
+      {
+        "matcher": "startup|clear|compact",
+        "hooks": [
          {
            "type": "command",
            "command": "node \"${CLAUDE_PLUGIN_ROOT}/scripts/bun-runner.js\" \"${CLAUDE_PLUGIN_ROOT}/scripts/worker-service.cjs\" start",
@@ -1,6 +1,6 @@
 {
  "name": "claude-mem-plugin",
-  "version": "10.3.0",
+  "version": "10.3.2",
  "private": true,
  "description": "Runtime dependencies for claude-mem bundled hooks",
  "type": "module",
@@ -93,20 +93,6 @@ get_observations(ids=[11131, 10942])

 **Returns:** Complete observation objects with title, subtitle, narrative, facts, concepts, files (~500-1000 tokens each)

-## Saving Memories
-
-Use the `save_memory` MCP tool to store manual observations:
-
-```
-save_memory(text="Important discovery about the auth system", title="Auth Architecture", project="my-project")
-```
-
-**Parameters:**
-
- `text` (string, required) - Content to remember
- `title` (string, optional) - Short title, auto-generated if omitted
- `project` (string, optional) - Project name, defaults to "claude-mem"
-
 ## Examples

 **Find recent bug fixes:**
@@ -235,8 +235,8 @@ NEVER fetch full details without filtering first. 10x token savings.`,
    }
  },
  {
-    name: 'save_memory',
-    description: 'Save a manual memory/observation for semantic search. Use this to remember important information.',
+    name: 'save_observation',
+    description: 'Save an observation to the database. Params: text (required), title, project',
    inputSchema: {
      type: 'object',
      properties: {
@@ -74,8 +74,8 @@ export function renderColorContextIndex(): string[] {
    `${colors.dim}Context Index: This semantic index (titles, types, files, tokens) is usually sufficient to understand past work.${colors.reset}`,
    '',
    `${colors.dim}When you need implementation details, rationale, or debugging context:${colors.reset}`,
-    `${colors.dim}  - Use MCP tools (search, get_observations) to fetch full observations on-demand${colors.reset}`,
-    `${colors.dim}  - Critical types ( bugfix, decision) often need detailed fetching${colors.reset}`,
+    `${colors.dim}  - Fetch by ID: get_observations([IDs]) for observations visible in this index${colors.reset}`,
+    `${colors.dim}  - Search history: Use the mem-search skill for past decisions, bugs, and deeper research${colors.reset}`,
    `${colors.dim}  - Trust this index over re-reading code for past decisions and learnings${colors.reset}`,
    ''
  ];
@@ -72,8 +72,8 @@ export function renderMarkdownContextIndex(): string[] {
    `**Context Index:** This semantic index (titles, types, files, tokens) is usually sufficient to understand past work.`,
    '',
    `When you need implementation details, rationale, or debugging context:`,
-    `- Use MCP tools (search, get_observations) to fetch full observations on-demand`,
-    `- Critical types ( bugfix, decision) often need detailed fetching`,
+    `- Fetch by ID: get_observations([IDs]) for observations visible in this index`,
+    `- Search history: Use the mem-search skill for past decisions, bugs, and deeper research`,
    `- Trust this index over re-reading code for past decisions and learnings`,
    ''
  ];
@@ -29,31 +29,49 @@ export async function isPortInUse(port: number): Promise<boolean> {
 }

 /**
- * Wait for the worker HTTP server to become responsive (liveness check)
- * Uses /api/health instead of /api/readiness because:
- * - /api/health returns 200 as soon as HTTP server is listening
- * - /api/readiness waits for full initialization (MCP connection can take 5+ minutes)
- * See: https://github.com/thedotmack/claude-mem/issues/811
- * @param port Worker port to check
- * @param timeoutMs Maximum time to wait in milliseconds
- * @returns true if worker became responsive, false if timeout
+ * Poll a localhost endpoint until it returns 200 OK or timeout.
+ * Shared implementation for liveness and readiness checks.
 */
-export async function waitForHealth(port: number, timeoutMs: number = 30000): Promise<boolean> {
+async function pollEndpointUntilOk(
+  port: number,
+  endpointPath: string,
+  timeoutMs: number,
+  retryLogMessage: string
+): Promise<boolean> {
  const start = Date.now();
  while (Date.now() - start < timeoutMs) {
    try {
      // Note: Removed AbortSignal.timeout to avoid Windows Bun cleanup issue (libuv assertion)
-      const response = await fetch(`http://127.0.0.1:${port}/api/health`);
+      const response = await fetch(`http://127.0.0.1:${port}${endpointPath}`);
      if (response.ok) return true;
    } catch (error) {
      // [ANTI-PATTERN IGNORED]: Retry loop - expected failures during startup, will retry
-      logger.debug('SYSTEM', 'Service not ready yet, will retry', { port }, error as Error);
+      logger.debug('SYSTEM', retryLogMessage, { port }, error as Error);
    }
    await new Promise(r => setTimeout(r, 500));
  }
  return false;
 }

+/**
+ * Wait for the worker HTTP server to become responsive (liveness check).
+ * Uses /api/health which returns 200 as soon as the HTTP server is listening.
+ * For full initialization (DB + search), use waitForReadiness() instead.
+ */
+export function waitForHealth(port: number, timeoutMs: number = 30000): Promise<boolean> {
+  return pollEndpointUntilOk(port, '/api/health', timeoutMs, 'Service not ready yet, will retry');
+}
+
+/**
+ * Wait for the worker to be fully initialized (DB + search ready).
+ * Uses /api/readiness which returns 200 only after core initialization completes.
+ * Now that initializationCompleteFlag is set after DB/search init (not MCP),
+ * this typically completes in a few seconds.
+ */
+export function waitForReadiness(port: number, timeoutMs: number = 30000): Promise<boolean> {
+  return pollEndpointUntilOk(port, '/api/readiness', timeoutMs, 'Worker not ready yet, will retry');
+}
+
 /**
 * Wait for a port to become free (no longer responding to health checks)
 * Used after shutdown to confirm the port is available for restart
@@ -10,7 +10,7 @@

 import path from 'path';
 import { homedir } from 'os';
-import { existsSync, writeFileSync, readFileSync, unlinkSync, mkdirSync } from 'fs';
+import { existsSync, writeFileSync, readFileSync, unlinkSync, mkdirSync, rmSync } from 'fs';
 import { exec, execSync, spawn } from 'child_process';
 import { promisify } from 'util';
 import { logger } from '../../utils/logger.js';
@@ -426,6 +426,182 @@ export async function cleanupOrphanedProcesses(): Promise<void> {
  logger.info('SYSTEM', 'Orphaned processes cleaned up', { count: pidsToKill.length });
 }

+// Patterns that should be killed immediately at startup (no age gate)
+// These are child processes that should not outlive their parent worker
+const AGGRESSIVE_CLEANUP_PATTERNS = ['worker-service.cjs', 'chroma-mcp'];
+
+// Patterns that keep the age-gated threshold (may be legitimately running)
+const AGE_GATED_CLEANUP_PATTERNS = ['mcp-server.cjs'];
+
+/**
+ * Aggressive startup cleanup for orphaned claude-mem processes.
+ *
+ * Unlike cleanupOrphanedProcesses() which age-gates everything at 30 minutes,
+ * this function kills worker-service.cjs and chroma-mcp processes immediately
+ * (they should not outlive their parent worker). Only mcp-server.cjs keeps
+ * the age threshold since it may be legitimately running.
+ *
+ * Called once at daemon startup.
+ */
+export async function aggressiveStartupCleanup(): Promise<void> {
+  const isWindows = process.platform === 'win32';
+  const currentPid = process.pid;
+  const pidsToKill: number[] = [];
+  const allPatterns = [...AGGRESSIVE_CLEANUP_PATTERNS, ...AGE_GATED_CLEANUP_PATTERNS];
+
+  try {
+    if (isWindows) {
+      const patternConditions = allPatterns
+        .map(p => `$_.CommandLine -like '*${p}*'`)
+        .join(' -or ');
+
+      const cmd = `powershell -NoProfile -NonInteractive -Command "Get-CimInstance Win32_Process | Where-Object { (${patternConditions}) -and $_.ProcessId -ne ${currentPid} } | Select-Object ProcessId, CommandLine, CreationDate | ConvertTo-Json"`;
+      const { stdout } = await execAsync(cmd, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND });
+
+      if (!stdout.trim() || stdout.trim() === 'null') {
+        logger.debug('SYSTEM', 'No orphaned claude-mem processes found (Windows)');
+        return;
+      }
+
+      const processes = JSON.parse(stdout);
+      const processList = Array.isArray(processes) ? processes : [processes];
+      const now = Date.now();
+
+      for (const proc of processList) {
+        const pid = proc.ProcessId;
+        if (!Number.isInteger(pid) || pid <= 0 || pid === currentPid) continue;
+
+        const commandLine = proc.CommandLine || '';
+        const isAggressive = AGGRESSIVE_CLEANUP_PATTERNS.some(p => commandLine.includes(p));
+
+        if (isAggressive) {
+          // Kill immediately — no age check
+          pidsToKill.push(pid);
+          logger.debug('SYSTEM', 'Found orphaned process (aggressive)', { pid, commandLine: commandLine.substring(0, 80) });
+        } else {
+          // Age-gated: only kill if older than threshold
+          const creationMatch = proc.CreationDate?.match(/\/Date\((\d+)\)\//);
+          if (creationMatch) {
+            const creationTime = parseInt(creationMatch[1], 10);
+            const ageMinutes = (now - creationTime) / (1000 * 60);
+            if (ageMinutes >= ORPHAN_MAX_AGE_MINUTES) {
+              pidsToKill.push(pid);
+              logger.debug('SYSTEM', 'Found orphaned process (age-gated)', { pid, ageMinutes: Math.round(ageMinutes) });
+            }
+          }
+        }
+      }
+    } else {
+      // Unix: Use ps with elapsed time
+      const patternRegex = allPatterns.join('|');
+      const { stdout } = await execAsync(
+        `ps -eo pid,etime,command | grep -E "${patternRegex}" | grep -v grep || true`
+      );
+
+      if (!stdout.trim()) {
+        logger.debug('SYSTEM', 'No orphaned claude-mem processes found (Unix)');
+        return;
+      }
+
+      const lines = stdout.trim().split('\n');
+      for (const line of lines) {
+        const match = line.trim().match(/^(\d+)\s+(\S+)\s+(.*)$/);
+        if (!match) continue;
+
+        const pid = parseInt(match[1], 10);
+        const etime = match[2];
+        const command = match[3];
+
+        if (!Number.isInteger(pid) || pid <= 0 || pid === currentPid) continue;
+
+        const isAggressive = AGGRESSIVE_CLEANUP_PATTERNS.some(p => command.includes(p));
+
+        if (isAggressive) {
+          // Kill immediately — no age check
+          pidsToKill.push(pid);
+          logger.debug('SYSTEM', 'Found orphaned process (aggressive)', { pid, command: command.substring(0, 80) });
+        } else {
+          // Age-gated: only kill if older than threshold
+          const ageMinutes = parseElapsedTime(etime);
+          if (ageMinutes >= ORPHAN_MAX_AGE_MINUTES) {
+            pidsToKill.push(pid);
+            logger.debug('SYSTEM', 'Found orphaned process (age-gated)', { pid, ageMinutes, command: command.substring(0, 80) });
+          }
+        }
+      }
+    }
+  } catch (error) {
+    logger.error('SYSTEM', 'Failed to enumerate orphaned processes during aggressive cleanup', {}, error as Error);
+    return;
+  }
+
+  if (pidsToKill.length === 0) {
+    return;
+  }
+
+  logger.info('SYSTEM', 'Aggressive startup cleanup: killing orphaned processes', {
+    platform: isWindows ? 'Windows' : 'Unix',
+    count: pidsToKill.length,
+    pids: pidsToKill
+  });
+
+  if (isWindows) {
+    for (const pid of pidsToKill) {
+      if (!Number.isInteger(pid) || pid <= 0) continue;
+      try {
+        execSync(`taskkill /PID ${pid} /T /F`, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, stdio: 'ignore' });
+      } catch (error) {
+        logger.debug('SYSTEM', 'Failed to kill process, may have already exited', { pid }, error as Error);
+      }
+    }
+  } else {
+    for (const pid of pidsToKill) {
+      try {
+        process.kill(pid, 'SIGKILL');
+      } catch (error) {
+        logger.debug('SYSTEM', 'Process already exited', { pid }, error as Error);
+      }
+    }
+  }
+
+  logger.info('SYSTEM', 'Aggressive startup cleanup complete', { count: pidsToKill.length });
+}
+
+const CHROMA_MIGRATION_MARKER_FILENAME = '.chroma-cleaned-v10.3';
+
+/**
+ * One-time chroma data wipe for users upgrading from versions with duplicate
+ * worker bugs that could corrupt chroma data. Since chroma is always rebuildable
+ * from SQLite (via backfillAllProjects), this is safe.
+ *
+ * Checks for a marker file. If absent, wipes ~/.claude-mem/chroma/ and writes
+ * the marker. If present, skips. Idempotent.
+ *
+ * @param dataDirectory - Override for DATA_DIR (used in tests)
+ */
+export function runOneTimeChromaMigration(dataDirectory?: string): void {
+  const effectiveDataDir = dataDirectory ?? DATA_DIR;
+  const markerPath = path.join(effectiveDataDir, CHROMA_MIGRATION_MARKER_FILENAME);
+  const chromaDir = path.join(effectiveDataDir, 'chroma');
+
+  if (existsSync(markerPath)) {
+    logger.debug('SYSTEM', 'Chroma migration marker exists, skipping wipe');
+    return;
+  }
+
+  logger.warn('SYSTEM', 'Running one-time chroma data wipe (upgrade from pre-v10.3)', { chromaDir });
+
+  if (existsSync(chromaDir)) {
+    rmSync(chromaDir, { recursive: true, force: true });
+    logger.info('SYSTEM', 'Chroma data directory removed', { chromaDir });
+  }
+
+  // Write marker file to prevent future wipes
+  mkdirSync(effectiveDataDir, { recursive: true });
+  writeFileSync(markerPath, new Date().toISOString());
+  logger.info('SYSTEM', 'Chroma migration marker written', { markerPath });
+}
+
 /**
 * Spawn a detached daemon process
 * Returns the child PID or undefined if spawn failed
@@ -248,8 +248,14 @@ export class Server {
        process.send!({ type: 'restart' });
      } else {
        // Unix or standalone Windows - handle restart ourselves
+        // The spawner (ensureWorkerStarted/restart command) handles spawning the new daemon.
+        // This process just needs to shut down and exit.
        setTimeout(async () => {
-          await this.options.onRestart();
+          try {
+            await this.options.onRestart();
+          } finally {
+            process.exit(0);
+          }
        }, 100);
      }
    });
@@ -268,7 +274,14 @@ export class Server {
      } else {
        // Unix or standalone Windows - handle shutdown ourselves
        setTimeout(async () => {
-          await this.options.onShutdown();
+          try {
+            await this.options.onShutdown();
+          } finally {
+            // CRITICAL: Exit the process after shutdown completes (or fails).
+            // Without this, the daemon stays alive as a zombie — background tasks
+            // (backfill, reconnects) keep running and respawn chroma-mcp subprocesses.
+            process.exit(0);
+          }
        }, 100);
      }
    });
@@ -69,14 +69,17 @@ import {
  readPidFile,
  removePidFile,
  getPlatformTimeout,
-  cleanupOrphanedProcesses,
+  aggressiveStartupCleanup,
+  runOneTimeChromaMigration,
  cleanStalePidFile,
+  isProcessAlive,
  spawnDaemon,
  createSignalHandler
 } from './infrastructure/ProcessManager.js';
 import {
  isPortInUse,
  waitForHealth,
+  waitForReadiness,
  waitForPortFree,
  httpShutdown,
  checkVersionMatch
@@ -367,7 +370,7 @@ export class WorkerService {
   */
  private async initializeBackground(): Promise<void> {
    try {
-      await cleanupOrphanedProcesses();
+      await aggressiveStartupCleanup();

      // Load mode configuration
      const { ModeManager } = await import('./domain/ModeManager.js');
@@ -376,6 +379,12 @@ export class WorkerService {

      const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);

+      // One-time chroma wipe for users upgrading from versions with duplicate worker bugs.
+      // Only runs in local mode (chroma is local-only). Backfill at line ~414 rebuilds from SQLite.
+      if (settings.CLAUDE_MEM_MODE === 'local' || !settings.CLAUDE_MEM_MODE) {
+        runOneTimeChromaMigration();
+      }
+
      // Initialize ChromaMcpManager (lazy - connects on first use via ChromaSync)
      this.chromaMcpManager = ChromaMcpManager.getInstance();
      logger.info('SYSTEM', 'ChromaMcpManager initialized (lazy - connects on first use)');
@@ -408,6 +417,13 @@ export class WorkerService {
      this.server.registerRoutes(this.searchRoutes);
      logger.info('WORKER', 'SearchManager initialized and search routes registered');

+      // DB and search are ready — mark initialization complete so hooks can proceed.
+      // MCP connection is tracked separately via mcpReady and is NOT required for
+      // the worker to serve context/search requests.
+      this.initializationCompleteFlag = true;
+      this.resolveInitialization();
+      logger.info('SYSTEM', 'Core initialization complete (DB + search ready)');
+
      // Auto-backfill Chroma for all projects if out of sync with SQLite (fire-and-forget)
      if (this.chromaMcpManager) {
        ChromaSync.backfillAllProjects().then(() => {
@@ -433,11 +449,7 @@ export class WorkerService {

      await Promise.race([mcpConnectionPromise, timeoutPromise]);
      this.mcpReady = true;
-      logger.success('WORKER', 'Connected to MCP server');
-
-      this.initializationCompleteFlag = true;
-      this.resolveInitialization();
-      logger.info('SYSTEM', 'Background initialization complete');
+      logger.success('WORKER', 'MCP server connected');

      // Start orphan reaper to clean up zombie processes (Issue #737)
      this.stopOrphanReaper = startOrphanReaper(() => {
@@ -937,6 +949,13 @@ async function ensureWorkerStarted(port: number): Promise<boolean> {
    return false;
  }

+  // Health passed (HTTP listening). Now wait for DB + search initialization
+  // so hooks that run immediately after can actually use the worker.
+  const ready = await waitForReadiness(port, getPlatformTimeout(HOOK_TIMEOUTS.READINESS_WAIT));
+  if (!ready) {
+    logger.warn('SYSTEM', 'Worker is alive but readiness timed out — proceeding anyway');
+  }
+
  clearWorkerSpawnAttempted();
  logger.info('SYSTEM', 'Worker started successfully');
  return true;
@@ -1097,6 +1116,28 @@ async function main() {

    case '--daemon':
    default: {
+      // GUARD 1: Refuse to start if another worker is already alive (PID check).
+      // Instant check (kill -0) — no HTTP dependency.
+      const existingPidInfo = readPidFile();
+      if (existingPidInfo && isProcessAlive(existingPidInfo.pid)) {
+        logger.info('SYSTEM', 'Worker already running (PID alive), refusing to start duplicate', {
+          existingPid: existingPidInfo.pid,
+          existingPort: existingPidInfo.port,
+          startedAt: existingPidInfo.startedAt
+        });
+        process.exit(0);
+      }
+
+      // GUARD 2: Refuse to start if the port is already bound.
+      // Catches the race where two daemons start simultaneously before
+      // either writes a PID file. Must run BEFORE constructing WorkerService
+      // because the constructor registers signal handlers and timers that
+      // prevent the process from exiting even if listen() fails later.
+      if (await isPortInUse(port)) {
+        logger.info('SYSTEM', 'Port already in use, refusing to start duplicate', { port });
+        process.exit(0);
+      }
+
      // Prevent daemon from dying silently on unhandled errors.
      // The HTTP server can continue serving even if a background task throws.
      process.on('unhandledRejection', (reason) => {
@@ -2,6 +2,7 @@ export const HOOK_TIMEOUTS = {
  DEFAULT: 300000,            // Standard HTTP timeout (5 min for slow systems)
  HEALTH_CHECK: 3000,         // Worker health check (3s — healthy worker responds in <100ms)
  POST_SPAWN_WAIT: 5000,      // Wait for daemon to start after spawn (starts in <1s on Linux)
+  READINESS_WAIT: 30000,      // Wait for DB + search init after spawn (typically <5s)
  PORT_IN_USE_WAIT: 3000,     // Wait when port occupied but health failing
  WORKER_STARTUP_WAIT: 1000,
  PRE_RESTART_SETTLE_DELAY: 2000,  // Give files time to sync before restart
@@ -1,6 +1,7 @@
 import { describe, it, expect, beforeEach, afterEach } from 'bun:test';
-import { existsSync, readFileSync } from 'fs';
+import { existsSync, readFileSync, mkdirSync, writeFileSync, rmSync } from 'fs';
 import { homedir } from 'os';
+import { tmpdir } from 'os';
 import path from 'path';
 import {
  writePidFile,
@@ -12,6 +13,7 @@ import {
  cleanStalePidFile,
  spawnDaemon,
  resolveWorkerRuntimePath,
+  runOneTimeChromaMigration,
  type PidInfo
 } from '../../src/services/infrastructure/index.js';

@@ -32,7 +34,6 @@ describe('ProcessManager', () => {
  afterEach(() => {
    // Restore original PID file or remove test one
    if (originalPidContent !== null) {
-      const { writeFileSync } = require('fs');
      writeFileSync(PID_FILE, originalPidContent);
      originalPidContent = null;
    } else {
@@ -105,7 +106,6 @@ describe('ProcessManager', () => {
    });

    it('should return null for corrupted JSON', () => {
-      const { writeFileSync } = require('fs');
      writeFileSync(PID_FILE, 'not valid json {{{');

      const result = readPidFile();
@@ -415,4 +415,53 @@ describe('ProcessManager', () => {
      // This is a logic verification test — actual signal delivery is tested manually
    });
  });
+
+  describe('runOneTimeChromaMigration', () => {
+    let testDataDir: string;
+
+    beforeEach(() => {
+      testDataDir = path.join(tmpdir(), `claude-mem-test-${Date.now()}-${Math.random().toString(36).slice(2)}`);
+      mkdirSync(testDataDir, { recursive: true });
+    });
+
+    afterEach(() => {
+      rmSync(testDataDir, { recursive: true, force: true });
+    });
+
+    it('should wipe chroma directory and write marker file', () => {
+      // Create a fake chroma directory with data
+      const chromaDir = path.join(testDataDir, 'chroma');
+      mkdirSync(chromaDir, { recursive: true });
+      writeFileSync(path.join(chromaDir, 'test-data.bin'), 'fake chroma data');
+
+      runOneTimeChromaMigration(testDataDir);
+
+      // Chroma dir should be gone
+      expect(existsSync(chromaDir)).toBe(false);
+      // Marker file should exist
+      expect(existsSync(path.join(testDataDir, '.chroma-cleaned-v10.3'))).toBe(true);
+    });
+
+    it('should skip when marker file already exists (idempotent)', () => {
+      // Write marker file first
+      writeFileSync(path.join(testDataDir, '.chroma-cleaned-v10.3'), 'already done');
+
+      // Create a chroma directory that should NOT be wiped
+      const chromaDir = path.join(testDataDir, 'chroma');
+      mkdirSync(chromaDir, { recursive: true });
+      writeFileSync(path.join(chromaDir, 'important.bin'), 'should survive');
+
+      runOneTimeChromaMigration(testDataDir);
+
+      // Chroma dir should still exist (migration was skipped)
+      expect(existsSync(chromaDir)).toBe(true);
+      expect(existsSync(path.join(chromaDir, 'important.bin'))).toBe(true);
+    });
+
+    it('should handle missing chroma directory gracefully', () => {
+      // No chroma dir exists — should just write marker without error
+      expect(() => runOneTimeChromaMigration(testDataDir)).not.toThrow();
+      expect(existsSync(path.join(testDataDir, '.chroma-cleaned-v10.3'))).toBe(true);
+    });
+  });
 });
Author	SHA1	Message	Date
Alex Newman	c2c3e3069c	chore: bump version to 10.3.2 Publish to npm / publish (push) Has been cancelled Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 03:32:22 -05:00
Alex Newman	7966c6cba9	fix: rename save_memory and fix MCP search instructions + startup hook (#1210 ) * fix: rename save_memory to save_observation and fix MCP search instructions Stop the primary agent from proactively saving memories by renaming save_memory to save_observation with a neutral description. Remove "Saving Memories" section from SKILL.md. Update context formatters and output styles to reference the mem-search skill instead of raw MCP tool names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: split SessionStart hooks so smart-install failure doesn't block worker start smart-install.js and worker-start were in the same hook group, so if smart-install exited non-zero the worker never started. Split into separate hook groups so they run independently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: worker startup waits for readiness before hooks fire Move initializationCompleteFlag to set after DB/search init (not MCP), add waitForReadiness() polling /api/readiness, and extract shared pollEndpointUntilOk helper to DRY up health/readiness checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 03:30:31 -05:00
Alex Newman	e4e735d3ff	fix: add rewrite rule so install.cmem.ai root serves install.sh Without this, curl https://install.cmem.ai returns 404 because Vercel has no index file mapping for the root path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 22:39:36 -05:00
Alex Newman	780cc3894e	fix: serve installer JS from install.cmem.ai instead of GitHub raw Copied compiled installer to install/public/installer.js so Vercel serves it at install.cmem.ai/installer.js. Updated install.sh to fetch from same domain instead of raw.githubusercontent.com. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 22:08:43 -05:00
Alex Newman	8d46c00dd8	fix: add compiled installer dist so CLI installation works The bootstrap script (install.sh) fetches installer/dist/index.js from main, but it was never committed due to the global dist/ gitignore rule. Added negation rule and the compiled installer bundle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 22:06:05 -05:00
Alex Newman	4ab601fc9f	docs: update CHANGELOG.md for v10.3.1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 20:12:46 -05:00
Alex Newman	097035de6c	chore: bump version to 10.3.1 Publish to npm / publish (push) Has been cancelled Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 20:12:17 -05:00
Alex Newman	e788fd3676	fix: prevent duplicate worker daemons and zombie processes (#1178 ) * fix: prevent duplicate worker daemons and zombie processes Three root causes of chroma-mcp timeouts: 1. HTTP shutdown (POST /api/admin/shutdown) closed resources but never called process.exit(). Zombie workers stayed alive, background tasks reconnected to chroma-mcp, spawning duplicate subprocesses that all contended for the same persistent data directory. 2. No guard against concurrent daemon startup. When hooks fired simultaneously, multiple daemons started before either wrote a PID file. The loser got EADDRINUSE but stayed alive because signal handlers registered in the constructor prevented exit. 3. Corrupt 147GB HNSW index file caused all chroma queries to timeout (MCP error -32001). Data fix: deleted corrupt collection, backfill rebuilds from SQLite. Code fixes: - Add PID-based guard in daemon startup: exit if PID file process alive - Add port-based guard in daemon startup: exit if port already bound (runs before WorkerService constructor registers keepalive handlers) - Add process.exit(0) after HTTP shutdown/restart completes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: aggressive startup cleanup and one-time chroma wipe for upgrade Kill orphaned worker-service.cjs and chroma-mcp processes immediately at startup (no age gate) while keeping 30-min threshold for mcp-server. Wipe corrupt chroma data once on upgrade from pre-v10.3 versions — backfill rebuilds from SQLite automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: wrap shutdown handlers in try/finally to guarantee process.exit If onShutdown() or onRestart() threw, process.exit(0) was never reached, leaving the daemon alive as a zombie. Also removed redundant require('fs') calls in process-manager tests where ESM imports already existed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 20:10:28 -05:00
Alex Newman	44cdbec173	docs: update CHANGELOG.md for v10.3.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 18:34:33 -05:00