Files
claude-mem/MEMORY_LEAK_FIXES.md
T
Copilot c46e4a341a Fix memory leaks from orphaned uvx/python processes (#120)
This fixes memory leak, will remove one unnecessary MCP after this in a new PR but this is mission critical fix

* Initial plan

* Fix memory leaks: Add proper cleanup for ChromaSync and search server processes

Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com>

* Add comprehensive process cleanup and PM2 configuration improvements

Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com>

* Add comprehensive summary and recommendations for memory leak fixes

Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com>
2025-11-16 22:16:41 -05:00

6.1 KiB

Memory Leak Fixes - Process Cleanup

Problem Summary

Multiple uvx and Python processes were accumulating over time, eventually consuming excessive system resources. The root cause was improper cleanup of child processes spawned by:

  1. ChromaSync - Each instance spawns a uvx chroma-mcp process via MCP StdioClientTransport
  2. Search Server - Spawns a uvx chroma-mcp process for semantic search
  3. Worker Service - Creates an MCP client connection to the search server

Root Causes

1. ChromaSync Not Closed in DatabaseManager

Location: src/services/worker/DatabaseManager.ts:42-52

Problem: The close() method did not call chromaSync.close(), leaving the uvx process running even after the worker shut down.

Fix: Added explicit ChromaSync cleanup in the close() method:

async close(): Promise<void> {
  // Close ChromaSync first (terminates uvx/python processes)
  if (this.chromaSync) {
    try {
      await this.chromaSync.close();
      this.chromaSync = null;
    } catch (error) {
      logger.error('DB', 'Failed to close ChromaSync', {}, error as Error);
    }
  }
  // ... rest of cleanup
}

2. Search Server No Cleanup Handlers

Location: src/servers/search-server.ts:1743-1781

Problem: The search server had no SIGTERM/SIGINT handlers, so child processes were orphaned when the server was terminated (especially during PM2 restarts).

Fix: Added comprehensive cleanup function:

async function cleanup() {
  console.error('[search-server] Shutting down...');
  
  // Close Chroma client (terminates uvx/python processes)
  if (chromaClient) {
    await chromaClient.close();
  }
  
  // Close database connections
  if (search) search.close();
  if (store) store.close();
  
  process.exit(0);
}

// Register cleanup handlers
process.on('SIGTERM', cleanup);
process.on('SIGINT', cleanup);

3. Worker Service Not Closing MCP Client

Location: src/services/worker-service.ts:214-230

Problem: The worker service connected to the search server via MCP client but never closed the connection, keeping the search server process alive.

Fix: Added MCP client cleanup in shutdown:

async shutdown(): Promise<void> {
  await this.sessionManager.shutdownAll();
  
  // Close MCP client connection (terminates search server process)
  if (this.mcpClient) {
    try {
      await this.mcpClient.close();
      logger.info('SYSTEM', 'MCP client closed');
    } catch (error) {
      logger.error('SYSTEM', 'Failed to close MCP client', {}, error as Error);
    }
  }
  
  // ... rest of shutdown
}

4. PM2 Configuration Not Optimized for Graceful Shutdown

Location: ecosystem.config.cjs

Problem: PM2 watch mode was restarting the worker frequently, but without proper configuration for graceful shutdown, child processes could be orphaned.

Fix: Enhanced PM2 configuration:

{
  kill_timeout: 5000,        // Extra time for cleanup
  wait_ready: true,          // Wait for process to be ready
  kill_signal: 'SIGTERM',    // Use graceful shutdown signal
  ignore_watch: [
    'vector-db',             // Don't restart on Chroma DB changes
    '.claude-mem'            // Don't restart on data changes
  ]
}

Process Lifecycle

Before Fixes

SessionStart -> Worker -> DatabaseManager -> ChromaSync -> uvx (orphaned)
                    \-> MCP Client -> Search Server -> uvx (orphaned)
                                           \-> Chroma Client -> uvx (orphaned)
Worker Restart -> 3 new orphaned processes per restart

After Fixes

SessionStart -> Worker -> DatabaseManager -> ChromaSync -> uvx
                                                              ↓
Shutdown -> DatabaseManager.close() -> chromaSync.close() -> terminates uvx

Worker -> MCP Client -> Search Server -> Chroma Client -> uvx
            ↓                              ↓
Worker.shutdown() -> mcpClient.close()     ↓
                              ↓            ↓
                    Search Server cleanup() -> chromaClient.close()
                                                    ↓
                                              terminates uvx

Testing Process Cleanup

Manual Test

  1. Start worker: pm2 start ecosystem.config.cjs
  2. Check processes: ps aux | grep -E "(uvx|python.*chroma)" | grep -v grep
  3. Create a session (trigger ChromaSync)
  4. Check process count again
  5. Restart worker: pm2 restart claude-mem-worker
  6. Wait 5 seconds for cleanup
  7. Check final process count - should return to baseline

Expected Behavior

  • Baseline: 0-1 uvx/python processes (persistent PM2 worker)
  • During Session: +2-3 processes (ChromaSync, Search Server, Chroma)
  • After Restart: Returns to baseline within 5 seconds

Verification

Run the test script:

chmod +x tests/test-process-cleanup.sh
./tests/test-process-cleanup.sh

Expected output:

=== Process Cleanup Test ===
1. Initial process count: 0
2. Starting test process...
   During execution: 3 processes
3. Final process count: 0
✅ PASS: No process leaks detected

Monitoring

To monitor for leaks in production:

# Watch process count over time
watch -n 5 'ps aux | grep -E "(uvx|python.*chroma)" | grep -v grep | wc -l'

# Detailed process list
ps aux | grep -E "(uvx|python.*chroma)" | grep -v grep

# PM2 process monitoring
pm2 monit

Additional Safeguards

  1. Error Handling: All cleanup operations have try-catch blocks to ensure partial cleanup succeeds even if one component fails
  2. Logging: Comprehensive logging of cleanup operations for debugging
  3. Timeout Configuration: PM2 kill_timeout ensures enough time for graceful shutdown
  4. Signal Handling: Both SIGTERM and SIGINT handlers registered for flexibility

Future Improvements

  1. Process Monitoring: Add metrics to track child process count over time
  2. Health Checks: Periodic verification that process count stays within expected bounds
  3. Automatic Cleanup: Detect and clean up orphaned processes on worker startup
  4. Resource Limits: Set memory/CPU limits on child processes to prevent runaway resource usage