Files
claude-mem/MEMORY_LEAK_FIXES.md
T
Copilot c46e4a341a Fix memory leaks from orphaned uvx/python processes (#120)
This fixes memory leak, will remove one unnecessary MCP after this in a new PR but this is mission critical fix

* Initial plan

* Fix memory leaks: Add proper cleanup for ChromaSync and search server processes

Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com>

* Add comprehensive process cleanup and PM2 configuration improvements

Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com>

* Add comprehensive summary and recommendations for memory leak fixes

Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com>
2025-11-16 22:16:41 -05:00

190 lines
6.1 KiB
Markdown

# Memory Leak Fixes - Process Cleanup
## Problem Summary
Multiple `uvx` and Python processes were accumulating over time, eventually consuming excessive system resources. The root cause was improper cleanup of child processes spawned by:
1. **ChromaSync** - Each instance spawns a `uvx chroma-mcp` process via MCP StdioClientTransport
2. **Search Server** - Spawns a `uvx chroma-mcp` process for semantic search
3. **Worker Service** - Creates an MCP client connection to the search server
## Root Causes
### 1. ChromaSync Not Closed in DatabaseManager
**Location**: `src/services/worker/DatabaseManager.ts:42-52`
**Problem**: The `close()` method did not call `chromaSync.close()`, leaving the uvx process running even after the worker shut down.
**Fix**: Added explicit ChromaSync cleanup in the close() method:
```typescript
async close(): Promise<void> {
// Close ChromaSync first (terminates uvx/python processes)
if (this.chromaSync) {
try {
await this.chromaSync.close();
this.chromaSync = null;
} catch (error) {
logger.error('DB', 'Failed to close ChromaSync', {}, error as Error);
}
}
// ... rest of cleanup
}
```
### 2. Search Server No Cleanup Handlers
**Location**: `src/servers/search-server.ts:1743-1781`
**Problem**: The search server had no SIGTERM/SIGINT handlers, so child processes were orphaned when the server was terminated (especially during PM2 restarts).
**Fix**: Added comprehensive cleanup function:
```typescript
async function cleanup() {
console.error('[search-server] Shutting down...');
// Close Chroma client (terminates uvx/python processes)
if (chromaClient) {
await chromaClient.close();
}
// Close database connections
if (search) search.close();
if (store) store.close();
process.exit(0);
}
// Register cleanup handlers
process.on('SIGTERM', cleanup);
process.on('SIGINT', cleanup);
```
### 3. Worker Service Not Closing MCP Client
**Location**: `src/services/worker-service.ts:214-230`
**Problem**: The worker service connected to the search server via MCP client but never closed the connection, keeping the search server process alive.
**Fix**: Added MCP client cleanup in shutdown:
```typescript
async shutdown(): Promise<void> {
await this.sessionManager.shutdownAll();
// Close MCP client connection (terminates search server process)
if (this.mcpClient) {
try {
await this.mcpClient.close();
logger.info('SYSTEM', 'MCP client closed');
} catch (error) {
logger.error('SYSTEM', 'Failed to close MCP client', {}, error as Error);
}
}
// ... rest of shutdown
}
```
### 4. PM2 Configuration Not Optimized for Graceful Shutdown
**Location**: `ecosystem.config.cjs`
**Problem**: PM2 watch mode was restarting the worker frequently, but without proper configuration for graceful shutdown, child processes could be orphaned.
**Fix**: Enhanced PM2 configuration:
```javascript
{
kill_timeout: 5000, // Extra time for cleanup
wait_ready: true, // Wait for process to be ready
kill_signal: 'SIGTERM', // Use graceful shutdown signal
ignore_watch: [
'vector-db', // Don't restart on Chroma DB changes
'.claude-mem' // Don't restart on data changes
]
}
```
## Process Lifecycle
### Before Fixes
```
SessionStart -> Worker -> DatabaseManager -> ChromaSync -> uvx (orphaned)
\-> MCP Client -> Search Server -> uvx (orphaned)
\-> Chroma Client -> uvx (orphaned)
Worker Restart -> 3 new orphaned processes per restart
```
### After Fixes
```
SessionStart -> Worker -> DatabaseManager -> ChromaSync -> uvx
Shutdown -> DatabaseManager.close() -> chromaSync.close() -> terminates uvx
Worker -> MCP Client -> Search Server -> Chroma Client -> uvx
↓ ↓
Worker.shutdown() -> mcpClient.close() ↓
↓ ↓
Search Server cleanup() -> chromaClient.close()
terminates uvx
```
## Testing Process Cleanup
### Manual Test
1. Start worker: `pm2 start ecosystem.config.cjs`
2. Check processes: `ps aux | grep -E "(uvx|python.*chroma)" | grep -v grep`
3. Create a session (trigger ChromaSync)
4. Check process count again
5. Restart worker: `pm2 restart claude-mem-worker`
6. Wait 5 seconds for cleanup
7. Check final process count - should return to baseline
### Expected Behavior
- **Baseline**: 0-1 uvx/python processes (persistent PM2 worker)
- **During Session**: +2-3 processes (ChromaSync, Search Server, Chroma)
- **After Restart**: Returns to baseline within 5 seconds
## Verification
Run the test script:
```bash
chmod +x tests/test-process-cleanup.sh
./tests/test-process-cleanup.sh
```
Expected output:
```
=== Process Cleanup Test ===
1. Initial process count: 0
2. Starting test process...
During execution: 3 processes
3. Final process count: 0
✅ PASS: No process leaks detected
```
## Monitoring
To monitor for leaks in production:
```bash
# Watch process count over time
watch -n 5 'ps aux | grep -E "(uvx|python.*chroma)" | grep -v grep | wc -l'
# Detailed process list
ps aux | grep -E "(uvx|python.*chroma)" | grep -v grep
# PM2 process monitoring
pm2 monit
```
## Additional Safeguards
1. **Error Handling**: All cleanup operations have try-catch blocks to ensure partial cleanup succeeds even if one component fails
2. **Logging**: Comprehensive logging of cleanup operations for debugging
3. **Timeout Configuration**: PM2 kill_timeout ensures enough time for graceful shutdown
4. **Signal Handling**: Both SIGTERM and SIGINT handlers registered for flexibility
## Future Improvements
1. **Process Monitoring**: Add metrics to track child process count over time
2. **Health Checks**: Periodic verification that process count stays within expected bounds
3. **Automatic Cleanup**: Detect and clean up orphaned processes on worker startup
4. **Resource Limits**: Set memory/CPU limits on child processes to prevent runaway resource usage