refactor: decompose monolithic services into modular architecture (#534)

* docs: add monolith refactor report with system breakdown Comprehensive analysis of codebase identifying: - 14 files over 500 lines requiring refactoring - 3 critical monoliths (SessionStore, SearchManager, worker-service) - 80% code duplication across agent files - 5-phase refactoring roadmap with domain-based architecture * fix: prevent memory_session_id from equaling content_session_id The bug: memory_session_id was initialized to contentSessionId as a "placeholder for FK purposes". This caused the SDK resume logic to inject memory agent messages into the USER's Claude Code transcript, corrupting their conversation history. Root cause: - SessionStore.createSDKSession initialized memory_session_id = contentSessionId - SDKAgent checked memorySessionId !== contentSessionId but this check only worked if the session was fetched fresh from DB The fix: - SessionStore: Initialize memory_session_id as NULL, not contentSessionId - SDKAgent: Simple truthy check !!session.memorySessionId (NULL = fresh start) - Database migration: Ran UPDATE to set memory_session_id = NULL for 1807 existing sessions that had the bug Also adds [ALIGNMENT] logging across the session lifecycle to help debug session continuity issues: - Hook entry: contentSessionId + promptNumber - DB lookup: contentSessionId → memorySessionId mapping proof - Resume decision: shows which memorySessionId will be used for resume - Capture: logs when memorySessionId is captured from first SDK response UI: Added "Alignment" quick filter button in LogsModal to show only alignment logs for debugging session continuity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor: improve error handling in worker-service.ts - Fix GENERIC_CATCH anti-patterns by logging full error objects instead of just messages - Add [ANTI-PATTERN IGNORED] markers for legitimate cases (cleanup, hot paths) - Simplify error handling comments to be more concise - Improve httpShutdown() error discrimination for ECONNREFUSED - Reduce LARGE_TRY_BLOCK issues in initialization code Part of anti-pattern cleanup plan (132 total issues) * refactor: improve error logging in SearchManager.ts - Pass full error objects to logger instead of just error.message - Fixes PARTIAL_ERROR_LOGGING anti-patterns (10 instances) - Better debugging visibility when Chroma queries fail Part of anti-pattern cleanup (133 remaining) * refactor: improve error logging across SessionStore and mcp-server - SessionStore.ts: Fix error logging in column rename utility - mcp-server.ts: Log full error objects instead of just error.message - Improve error handling in Worker API calls and tool execution Part of anti-pattern cleanup (133 remaining) * Refactor hooks to streamline error handling and loading states - Simplified error handling in useContextPreview by removing try-catch and directly checking response status. - Refactored usePagination to eliminate try-catch, improving readability and maintaining error handling through response checks. - Cleaned up useSSE by removing unnecessary try-catch around JSON parsing, ensuring clarity in message handling. - Enhanced useSettings by streamlining the saving process, removing try-catch, and directly checking the result for success. * refactor: add error handling back to SearchManager Chroma calls - Wrap queryChroma calls in try-catch to prevent generator crashes - Log Chroma errors as warnings and fall back gracefully - Fixes generator failures when Chroma has issues - Part of anti-pattern cleanup recovery * feat: Add generator failure investigation report and observation duplication regression report - Created a comprehensive investigation report detailing the root cause of generator failures during anti-pattern cleanup, including the impact, investigation process, and implemented fixes. - Documented the critical regression causing observation duplication due to race conditions in the SDK agent, outlining symptoms, root cause analysis, and proposed fixes. * fix: address PR #528 review comments - atomic cleanup and detector improvements This commit addresses critical review feedback from PR #528: ## 1. Atomic Message Cleanup (Fix Race Condition) **Problem**: SessionRoutes.ts generator error handler had race condition - Queried messages then marked failed in loop - If crash during loop → partial marking → inconsistent state **Solution**: - Added `markSessionMessagesFailed()` to PendingMessageStore.ts - Single atomic UPDATE statement replaces loop - Follows existing pattern from `resetProcessingToPending()` **Files**: - src/services/sqlite/PendingMessageStore.ts (new method) - src/services/worker/http/routes/SessionRoutes.ts (use new method) ## 2. Anti-Pattern Detector Improvements **Problem**: Detector didn't recognize logger.failure() method - Lines 212 & 335 already included "failure" - Lines 112-113 (PARTIAL_ERROR_LOGGING detection) did not **Solution**: Updated regex patterns to include "failure" for consistency **Files**: - scripts/anti-pattern-test/detect-error-handling-antipatterns.ts ## 3. Documentation **PR Comment**: Added clarification on memory_session_id fix location - Points to SessionStore.ts:1155 - Explains why NULL initialization prevents message injection bug ## Review Response Addresses "Must Address Before Merge" items from review: ✅ Clarified memory_session_id bug fix location (via PR comment) ✅ Made generator error handler message cleanup atomic ❌ Deferred comprehensive test suite to follow-up PR (keeps PR focused) ## Testing - Build passes with no errors - Anti-pattern detector runs successfully - Atomic cleanup follows proven pattern from existing methods 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: FOREIGN KEY constraint and missing failed_at_epoch column Two critical bugs fixed: 1. Missing failed_at_epoch column in pending_messages table - Added migration 20 to create the column - Fixes error when trying to mark messages as failed 2. FOREIGN KEY constraint failed when storing observations - All three agents (SDK, Gemini, OpenRouter) were passing session.contentSessionId instead of session.memorySessionId - storeObservationsAndMarkComplete expects memorySessionId - Added null check and clear error message However, observations still not saving - see investigation report. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Refactor hook input parsing to improve error handling - Added a nested try-catch block in new-hook.ts, save-hook.ts, and summary-hook.ts to handle JSON parsing errors more gracefully. - Replaced direct error throwing with logging of the error details using logger.error. - Ensured that the process exits cleanly after handling input in all three hooks. * docs: update monolith report post session-logging merge - SessionStore grew to 2,011 lines (49 methods) - highest priority - SearchManager reduced to 1,778 lines (improved) - Agent files reduced by ~45 lines combined - Added trend indicators and post-merge observations - Core refactoring proposal remains valid * refactor(sqlite): decompose SessionStore into modular architecture Extract the 2011-line SessionStore.ts monolith into focused, single-responsibility modules following grep-optimized progressive disclosure pattern: New module structure: - sessions/ - Session creation and retrieval (create.ts, get.ts, types.ts) - observations/ - Observation storage and queries (store.ts, get.ts, recent.ts, files.ts, types.ts) - summaries/ - Summary storage and queries (store.ts, get.ts, recent.ts, types.ts) - prompts/ - User prompt management (store.ts, get.ts, types.ts) - timeline/ - Cross-entity timeline queries (queries.ts) - import/ - Bulk import operations (bulk.ts) - migrations/ - Database migrations (runner.ts) New coordinator files: - Database.ts - ClaudeMemDatabase class with re-exports - transactions.ts - Atomic cross-entity transactions - Named re-export facades (Sessions.ts, Observations.ts, etc.) Key design decisions: - All functions take `db: Database` as first parameter (functional style) - Named re-exports instead of index.ts for grep-friendliness - SessionStore retained as backward-compatible wrapper - Target file size: 50-150 lines (60% compliance) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(agents): extract shared logic into modular architecture Consolidate duplicate code across SDKAgent, GeminiAgent, and OpenRouterAgent into focused utility modules. Total reduction: 500 lines (29%). New modules in src/services/worker/agents/: - ResponseProcessor.ts: Atomic DB transactions, Chroma sync, SSE broadcast - ObservationBroadcaster.ts: SSE event formatting and dispatch - SessionCleanupHelper.ts: Session state cleanup and stuck message reset - FallbackErrorHandler.ts: Provider error detection for fallback logic - types.ts: Shared interfaces (WorkerRef, SSE payloads, StorageResult) Bug fix: SDKAgent was incorrectly using obs.files instead of obs.files_read and hardcoding files_modified to empty array. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(search): extract search strategies into modular architecture Decompose SearchManager into focused strategy pattern with: - SearchOrchestrator: Coordinates strategy selection and fallback - ChromaSearchStrategy: Vector semantic search via ChromaDB - SQLiteSearchStrategy: Filter-only queries for date/project/type - HybridSearchStrategy: Metadata filtering + semantic ranking - ResultFormatter: Markdown table formatting for results - TimelineBuilder: Chronological timeline construction - Filter modules: DateFilter, ProjectFilter, TypeFilter SearchManager now delegates to new infrastructure while maintaining full backward compatibility with existing public API. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(context): decompose context-generator into modular architecture Extract 660-line monolith into focused components: - ContextBuilder: Main orchestrator (~160 lines) - ContextConfigLoader: Configuration loading - TokenCalculator: Token budget calculations - ObservationCompiler: Data retrieval and query building - MarkdownFormatter/ColorFormatter: Output formatting - Section renderers: Header, Timeline, Summary, Footer Maintains full backward compatibility - context-generator.ts now delegates to new ContextBuilder while preserving public API. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(worker): decompose worker-service into modular infrastructure Split 2000+ line monolith into focused modules: Infrastructure: - ProcessManager: PID files, signal handlers, child process cleanup - HealthMonitor: Port checks, health polling, version matching - GracefulShutdown: Coordinated cleanup on exit Server: - Server: Express app setup, core routes, route registration - Middleware: Re-exports from existing middleware - ErrorHandler: Centralized error handling with AppError class Integrations: - CursorHooksInstaller: Full Cursor IDE integration (registry, hooks, MCP) WorkerService now acts as thin coordinator wiring all components together. Maintains full backward compatibility with existing public API. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Refactor session queue processing and database interactions - Implement claim-and-delete pattern in SessionQueueProcessor to simplify message handling and eliminate duplicate processing. - Update PendingMessageStore to support atomic claim-and-delete operations, removing the need for intermediate processing states. - Introduce storeObservations method in SessionStore for simplified observation and summary storage without message tracking. - Remove deprecated methods and clean up session state management in worker agents. - Adjust response processing to accommodate new storage patterns, ensuring atomic transactions for observations and summaries. - Remove unnecessary reset logic for stuck messages due to the new queue handling approach. * Add duplicate observation cleanup script Script to clean up duplicate observations created by the batching bug where observations were stored once per message ID instead of once per observation. Includes safety checks to always keep at least one copy. Usage: bun scripts/cleanup-duplicates.ts # Dry run bun scripts/cleanup-duplicates.ts --execute # Delete duplicates bun scripts/cleanup-duplicates.ts --aggressive # Ignore time window 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
2026-01-03 21:22:27 -05:00
parent 7748e62387
commit 2fc4153bef
87 changed files with 9933 additions and 3492 deletions
@@ -0,0 +1,115 @@
+/**
+ * GracefulShutdown - Cleanup utilities for graceful exit
+ *
+ * Extracted from worker-service.ts to provide centralized shutdown coordination.
+ * Handles:
+ * - HTTP server closure (with Windows-specific delays)
+ * - Session manager shutdown coordination
+ * - Child process cleanup (Windows zombie port fix)
+ */
+
+import http from 'http';
+import { logger } from '../../utils/logger.js';
+import {
+  getChildProcesses,
+  forceKillProcess,
+  waitForProcessesExit,
+  removePidFile
+} from './ProcessManager.js';
+
+export interface ShutdownableService {
+  shutdownAll(): Promise<void>;
+}
+
+export interface CloseableClient {
+  close(): Promise<void>;
+}
+
+export interface CloseableDatabase {
+  close(): Promise<void>;
+}
+
+/**
+ * Configuration for graceful shutdown
+ */
+export interface GracefulShutdownConfig {
+  server: http.Server | null;
+  sessionManager: ShutdownableService;
+  mcpClient?: CloseableClient;
+  dbManager?: CloseableDatabase;
+}
+
+/**
+ * Perform graceful shutdown of all services
+ *
+ * IMPORTANT: On Windows, we must kill all child processes before exiting
+ * to prevent zombie ports. The socket handle can be inherited by children,
+ * and if not properly closed, the port stays bound after process death.
+ */
+export async function performGracefulShutdown(config: GracefulShutdownConfig): Promise<void> {
+  logger.info('SYSTEM', 'Shutdown initiated');
+
+  // Clean up PID file on shutdown
+  removePidFile();
+
+  // STEP 1: Enumerate all child processes BEFORE we start closing things
+  const childPids = await getChildProcesses(process.pid);
+  logger.info('SYSTEM', 'Found child processes', { count: childPids.length, pids: childPids });
+
+  // STEP 2: Close HTTP server first
+  if (config.server) {
+    await closeHttpServer(config.server);
+    logger.info('SYSTEM', 'HTTP server closed');
+  }
+
+  // STEP 3: Shutdown active sessions
+  await config.sessionManager.shutdownAll();
+
+  // STEP 4: Close MCP client connection (signals child to exit gracefully)
+  if (config.mcpClient) {
+    await config.mcpClient.close();
+    logger.info('SYSTEM', 'MCP client closed');
+  }
+
+  // STEP 5: Close database connection (includes ChromaSync cleanup)
+  if (config.dbManager) {
+    await config.dbManager.close();
+  }
+
+  // STEP 6: Force kill any remaining child processes (Windows zombie port fix)
+  if (childPids.length > 0) {
+    logger.info('SYSTEM', 'Force killing remaining children');
+    for (const pid of childPids) {
+      await forceKillProcess(pid);
+    }
+    // Wait for children to fully exit
+    await waitForProcessesExit(childPids, 5000);
+  }
+
+  logger.info('SYSTEM', 'Worker shutdown complete');
+}
+
+/**
+ * Close HTTP server with Windows-specific delays
+ * Windows needs extra time to release sockets properly
+ */
+async function closeHttpServer(server: http.Server): Promise<void> {
+  // Close all active connections
+  server.closeAllConnections();
+
+  // Give Windows time to close connections before closing server (prevents zombie ports)
+  if (process.platform === 'win32') {
+    await new Promise(r => setTimeout(r, 500));
+  }
+
+  // Close the server
+  await new Promise<void>((resolve, reject) => {
+    server.close(err => err ? reject(err) : resolve());
+  });
+
+  // Extra delay on Windows to ensure port is fully released
+  if (process.platform === 'win32') {
+    await new Promise(r => setTimeout(r, 500));
+    logger.info('SYSTEM', 'Waited for Windows port cleanup');
+  }
+}
@@ -0,0 +1,143 @@
+/**
+ * HealthMonitor - Port monitoring, health checks, and version checking
+ *
+ * Extracted from worker-service.ts monolith to provide centralized health monitoring.
+ * Handles:
+ * - Port availability checking
+ * - Worker health/readiness polling
+ * - Version mismatch detection (critical for plugin updates)
+ * - HTTP-based shutdown requests
+ */
+
+import path from 'path';
+import { homedir } from 'os';
+import { readFileSync } from 'fs';
+import { logger } from '../../utils/logger.js';
+
+/**
+ * Check if a port is in use by querying the health endpoint
+ */
+export async function isPortInUse(port: number): Promise<boolean> {
+  try {
+    // Note: Removed AbortSignal.timeout to avoid Windows Bun cleanup issue (libuv assertion)
+    const response = await fetch(`http://127.0.0.1:${port}/api/health`);
+    return response.ok;
+  } catch (error) {
+    // [ANTI-PATTERN IGNORED]: Health check polls every 500ms, logging would flood
+    return false;
+  }
+}
+
+/**
+ * Wait for the worker to become fully ready (passes readiness check)
+ * @param port Worker port to check
+ * @param timeoutMs Maximum time to wait in milliseconds
+ * @returns true if worker became ready, false if timeout
+ */
+export async function waitForHealth(port: number, timeoutMs: number = 30000): Promise<boolean> {
+  const start = Date.now();
+  while (Date.now() - start < timeoutMs) {
+    try {
+      // Note: Removed AbortSignal.timeout to avoid Windows Bun cleanup issue (libuv assertion)
+      const response = await fetch(`http://127.0.0.1:${port}/api/readiness`);
+      if (response.ok) return true;
+    } catch (error) {
+      // [ANTI-PATTERN IGNORED]: Retry loop - expected failures during startup, will retry
+      logger.debug('SYSTEM', 'Service not ready yet, will retry', { port }, error as Error);
+    }
+    await new Promise(r => setTimeout(r, 500));
+  }
+  return false;
+}
+
+/**
+ * Wait for a port to become free (no longer responding to health checks)
+ * Used after shutdown to confirm the port is available for restart
+ */
+export async function waitForPortFree(port: number, timeoutMs: number = 10000): Promise<boolean> {
+  const start = Date.now();
+  while (Date.now() - start < timeoutMs) {
+    if (!(await isPortInUse(port))) return true;
+    await new Promise(r => setTimeout(r, 500));
+  }
+  return false;
+}
+
+/**
+ * Send HTTP shutdown request to a running worker
+ * @param port Worker port
+ * @returns true if shutdown request was acknowledged, false otherwise
+ */
+export async function httpShutdown(port: number): Promise<boolean> {
+  try {
+    // Note: Removed AbortSignal.timeout to avoid Windows Bun cleanup issue (libuv assertion)
+    const response = await fetch(`http://127.0.0.1:${port}/api/admin/shutdown`, {
+      method: 'POST'
+    });
+    if (!response.ok) {
+      logger.warn('SYSTEM', 'Shutdown request returned error', { port, status: response.status });
+      return false;
+    }
+    return true;
+  } catch (error) {
+    // Connection refused is expected if worker already stopped
+    if (error instanceof Error && error.message?.includes('ECONNREFUSED')) {
+      logger.debug('SYSTEM', 'Worker already stopped', { port }, error);
+      return false;
+    }
+    // Unexpected error - log full details
+    logger.warn('SYSTEM', 'Shutdown request failed unexpectedly', { port }, error as Error);
+    return false;
+  }
+}
+
+/**
+ * Get the plugin version from the installed marketplace package.json
+ * This is the "expected" version that should be running
+ */
+export function getInstalledPluginVersion(): string {
+  const marketplaceRoot = path.join(homedir(), '.claude', 'plugins', 'marketplaces', 'thedotmack');
+  const packageJsonPath = path.join(marketplaceRoot, 'package.json');
+  const packageJson = JSON.parse(readFileSync(packageJsonPath, 'utf-8'));
+  return packageJson.version;
+}
+
+/**
+ * Get the running worker's version via API
+ * This is the "actual" version currently running
+ */
+export async function getRunningWorkerVersion(port: number): Promise<string | null> {
+  try {
+    const response = await fetch(`http://127.0.0.1:${port}/api/version`);
+    if (!response.ok) return null;
+    const data = await response.json() as { version: string };
+    return data.version;
+  } catch {
+    // Expected: worker not running or version endpoint unavailable
+    logger.debug('SYSTEM', 'Could not fetch worker version', { port });
+    return null;
+  }
+}
+
+export interface VersionCheckResult {
+  matches: boolean;
+  pluginVersion: string;
+  workerVersion: string | null;
+}
+
+/**
+ * Check if worker version matches plugin version
+ * Critical for detecting when plugin is updated but worker is still running old code
+ * Returns true if versions match or if we can't determine (assume match for graceful degradation)
+ */
+export async function checkVersionMatch(port: number): Promise<VersionCheckResult> {
+  const pluginVersion = getInstalledPluginVersion();
+  const workerVersion = await getRunningWorkerVersion(port);
+
+  // If we can't get worker version, assume it matches (graceful degradation)
+  if (!workerVersion) {
+    return { matches: true, pluginVersion, workerVersion };
+  }
+
+  return { matches: pluginVersion === workerVersion, pluginVersion, workerVersion };
+}
@@ -0,0 +1,306 @@
+/**
+ * ProcessManager - PID files, signal handlers, and child process lifecycle management
+ *
+ * Extracted from worker-service.ts monolith to provide centralized process management.
+ * Handles:
+ * - PID file management for daemon coordination
+ * - Signal handler registration for graceful shutdown
+ * - Child process enumeration and cleanup (especially for Windows zombie port fix)
+ */
+
+import path from 'path';
+import { homedir } from 'os';
+import { existsSync, writeFileSync, readFileSync, unlinkSync, mkdirSync } from 'fs';
+import { exec, execSync, spawn } from 'child_process';
+import { promisify } from 'util';
+import { logger } from '../../utils/logger.js';
+
+const execAsync = promisify(exec);
+
+// Standard paths for PID file management
+const DATA_DIR = path.join(homedir(), '.claude-mem');
+const PID_FILE = path.join(DATA_DIR, 'worker.pid');
+
+export interface PidInfo {
+  pid: number;
+  port: number;
+  startedAt: string;
+}
+
+/**
+ * Write PID info to the standard PID file location
+ */
+export function writePidFile(info: PidInfo): void {
+  mkdirSync(DATA_DIR, { recursive: true });
+  writeFileSync(PID_FILE, JSON.stringify(info, null, 2));
+}
+
+/**
+ * Read PID info from the standard PID file location
+ * Returns null if file doesn't exist or is corrupted
+ */
+export function readPidFile(): PidInfo | null {
+  if (!existsSync(PID_FILE)) return null;
+
+  try {
+    return JSON.parse(readFileSync(PID_FILE, 'utf-8'));
+  } catch (error) {
+    logger.warn('SYSTEM', 'Failed to parse PID file', { path: PID_FILE }, error as Error);
+    return null;
+  }
+}
+
+/**
+ * Remove the PID file (called during shutdown)
+ */
+export function removePidFile(): void {
+  if (!existsSync(PID_FILE)) return;
+
+  try {
+    unlinkSync(PID_FILE);
+  } catch (error) {
+    // [ANTI-PATTERN IGNORED]: Cleanup function - PID file removal failure is non-critical
+    logger.warn('SYSTEM', 'Failed to remove PID file', { path: PID_FILE }, error as Error);
+  }
+}
+
+/**
+ * Get platform-adjusted timeout (Windows socket cleanup is slower)
+ */
+export function getPlatformTimeout(baseMs: number): number {
+  const WINDOWS_MULTIPLIER = 2.0;
+  return process.platform === 'win32' ? Math.round(baseMs * WINDOWS_MULTIPLIER) : baseMs;
+}
+
+/**
+ * Get all child process PIDs (Windows-specific)
+ * Used for cleanup to prevent zombie ports when parent exits
+ */
+export async function getChildProcesses(parentPid: number): Promise<number[]> {
+  if (process.platform !== 'win32') {
+    return [];
+  }
+
+  // SECURITY: Validate PID is a positive integer to prevent command injection
+  if (!Number.isInteger(parentPid) || parentPid <= 0) {
+    logger.warn('SYSTEM', 'Invalid parent PID for child process enumeration', { parentPid });
+    return [];
+  }
+
+  try {
+    const cmd = `powershell -Command "Get-CimInstance Win32_Process | Where-Object { $_.ParentProcessId -eq ${parentPid} } | Select-Object -ExpandProperty ProcessId"`;
+    const { stdout } = await execAsync(cmd, { timeout: 60000 });
+    return stdout
+      .trim()
+      .split('\n')
+      .map(s => parseInt(s.trim(), 10))
+      .filter(n => !isNaN(n) && Number.isInteger(n) && n > 0);
+  } catch (error) {
+    // Shutdown cleanup - failure is non-critical, continue without child process cleanup
+    logger.warn('SYSTEM', 'Failed to enumerate child processes', { parentPid }, error as Error);
+    return [];
+  }
+}
+
+/**
+ * Force kill a process by PID
+ * Windows: uses taskkill /F /T to kill process tree
+ * Unix: uses SIGKILL
+ */
+export async function forceKillProcess(pid: number): Promise<void> {
+  // SECURITY: Validate PID is a positive integer to prevent command injection
+  if (!Number.isInteger(pid) || pid <= 0) {
+    logger.warn('SYSTEM', 'Invalid PID for force kill', { pid });
+    return;
+  }
+
+  try {
+    if (process.platform === 'win32') {
+      // /T kills entire process tree, /F forces termination
+      await execAsync(`taskkill /PID ${pid} /T /F`, { timeout: 60000 });
+    } else {
+      process.kill(pid, 'SIGKILL');
+    }
+    logger.info('SYSTEM', 'Killed process', { pid });
+  } catch (error) {
+    // [ANTI-PATTERN IGNORED]: Shutdown cleanup - process already exited, continue
+    logger.debug('SYSTEM', 'Process already exited during force kill', { pid }, error as Error);
+  }
+}
+
+/**
+ * Wait for processes to fully exit
+ */
+export async function waitForProcessesExit(pids: number[], timeoutMs: number): Promise<void> {
+  const start = Date.now();
+
+  while (Date.now() - start < timeoutMs) {
+    const stillAlive = pids.filter(pid => {
+      try {
+        process.kill(pid, 0);
+        return true;
+      } catch (error) {
+        // [ANTI-PATTERN IGNORED]: Tight loop checking 100s of PIDs every 100ms during cleanup
+        return false;
+      }
+    });
+
+    if (stillAlive.length === 0) {
+      logger.info('SYSTEM', 'All child processes exited');
+      return;
+    }
+
+    logger.debug('SYSTEM', 'Waiting for processes to exit', { stillAlive });
+    await new Promise(r => setTimeout(r, 100));
+  }
+
+  logger.warn('SYSTEM', 'Timeout waiting for child processes to exit');
+}
+
+/**
+ * Clean up orphaned chroma-mcp processes from previous worker sessions
+ * Prevents process accumulation and memory leaks
+ */
+export async function cleanupOrphanedProcesses(): Promise<void> {
+  const isWindows = process.platform === 'win32';
+  const pids: number[] = [];
+
+  try {
+    if (isWindows) {
+      // Windows: Use PowerShell Get-CimInstance to find chroma-mcp processes
+      const cmd = `powershell -Command "Get-CimInstance Win32_Process | Where-Object { $_.Name -like '*python*' -and $_.CommandLine -like '*chroma-mcp*' } | Select-Object -ExpandProperty ProcessId"`;
+      const { stdout } = await execAsync(cmd, { timeout: 60000 });
+
+      if (!stdout.trim()) {
+        logger.debug('SYSTEM', 'No orphaned chroma-mcp processes found (Windows)');
+        return;
+      }
+
+      const pidStrings = stdout.trim().split('\n');
+      for (const pidStr of pidStrings) {
+        const pid = parseInt(pidStr.trim(), 10);
+        // SECURITY: Validate PID is positive integer before adding to list
+        if (!isNaN(pid) && Number.isInteger(pid) && pid > 0) {
+          pids.push(pid);
+        }
+      }
+    } else {
+      // Unix: Use ps aux | grep
+      const { stdout } = await execAsync('ps aux | grep "chroma-mcp" | grep -v grep || true');
+
+      if (!stdout.trim()) {
+        logger.debug('SYSTEM', 'No orphaned chroma-mcp processes found (Unix)');
+        return;
+      }
+
+      const lines = stdout.trim().split('\n');
+      for (const line of lines) {
+        const parts = line.trim().split(/\s+/);
+        if (parts.length > 1) {
+          const pid = parseInt(parts[1], 10);
+          // SECURITY: Validate PID is positive integer before adding to list
+          if (!isNaN(pid) && Number.isInteger(pid) && pid > 0) {
+            pids.push(pid);
+          }
+        }
+      }
+    }
+  } catch (error) {
+    // Orphan cleanup is non-critical - log and continue
+    logger.warn('SYSTEM', 'Failed to enumerate orphaned processes', {}, error as Error);
+    return;
+  }
+
+  if (pids.length === 0) {
+    return;
+  }
+
+  logger.info('SYSTEM', 'Cleaning up orphaned chroma-mcp processes', {
+    platform: isWindows ? 'Windows' : 'Unix',
+    count: pids.length,
+    pids
+  });
+
+  // Kill all found processes
+  if (isWindows) {
+    for (const pid of pids) {
+      // SECURITY: Double-check PID validation before using in taskkill command
+      if (!Number.isInteger(pid) || pid <= 0) {
+        logger.warn('SYSTEM', 'Skipping invalid PID', { pid });
+        continue;
+      }
+      try {
+        execSync(`taskkill /PID ${pid} /T /F`, { timeout: 60000, stdio: 'ignore' });
+      } catch (error) {
+        // [ANTI-PATTERN IGNORED]: Cleanup loop - process may have exited, continue to next PID
+        logger.debug('SYSTEM', 'Failed to kill process, may have already exited', { pid }, error as Error);
+      }
+    }
+  } else {
+    for (const pid of pids) {
+      try {
+        process.kill(pid, 'SIGKILL');
+      } catch (error) {
+        // [ANTI-PATTERN IGNORED]: Cleanup loop - process may have exited, continue to next PID
+        logger.debug('SYSTEM', 'Process already exited', { pid }, error as Error);
+      }
+    }
+  }
+
+  logger.info('SYSTEM', 'Orphaned processes cleaned up', { count: pids.length });
+}
+
+/**
+ * Spawn a detached daemon process
+ * Returns the child PID or undefined if spawn failed
+ */
+export function spawnDaemon(
+  scriptPath: string,
+  port: number,
+  extraEnv: Record<string, string> = {}
+): number | undefined {
+  const child = spawn(process.execPath, [scriptPath, '--daemon'], {
+    detached: true,
+    stdio: 'ignore',
+    windowsHide: true,
+    env: {
+      ...process.env,
+      CLAUDE_MEM_WORKER_PORT: String(port),
+      ...extraEnv
+    }
+  });
+
+  if (child.pid === undefined) {
+    return undefined;
+  }
+
+  child.unref();
+  return child.pid;
+}
+
+/**
+ * Create signal handler factory for graceful shutdown
+ * Returns a handler function that can be passed to process.on('SIGTERM') etc.
+ */
+export function createSignalHandler(
+  shutdownFn: () => Promise<void>,
+  isShuttingDownRef: { value: boolean }
+): (signal: string) => Promise<void> {
+  return async (signal: string) => {
+    if (isShuttingDownRef.value) {
+      logger.warn('SYSTEM', `Received ${signal} but shutdown already in progress`);
+      return;
+    }
+    isShuttingDownRef.value = true;
+
+    logger.info('SYSTEM', `Received ${signal}, shutting down...`);
+    try {
+      await shutdownFn();
+      process.exit(0);
+    } catch (error) {
+      // Top-level signal handler - log any shutdown error and exit
+      logger.error('SYSTEM', 'Error during shutdown', {}, error as Error);
+      process.exit(1);
+    }
+  };
+}
@@ -0,0 +1,7 @@
+/**
+ * Infrastructure module - Process management, health monitoring, and shutdown utilities
+ */
+
+export * from './ProcessManager.js';
+export * from './HealthMonitor.js';
+export * from './GracefulShutdown.js';