Fix memory leaks from orphaned uvx/python processes (#120)
This fixes memory leak, will remove one unnecessary MCP after this in a new PR but this is mission critical fix * Initial plan * Fix memory leaks: Add proper cleanup for ChromaSync and search server processes Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com> * Add comprehensive process cleanup and PM2 configuration improvements Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com> * Add comprehensive summary and recommendations for memory leak fixes Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: thedotmack <683968+thedotmack@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,189 @@
|
||||
# Memory Leak Fixes - Process Cleanup
|
||||
|
||||
## Problem Summary
|
||||
|
||||
Multiple `uvx` and Python processes were accumulating over time, eventually consuming excessive system resources. The root cause was improper cleanup of child processes spawned by:
|
||||
|
||||
1. **ChromaSync** - Each instance spawns a `uvx chroma-mcp` process via MCP StdioClientTransport
|
||||
2. **Search Server** - Spawns a `uvx chroma-mcp` process for semantic search
|
||||
3. **Worker Service** - Creates an MCP client connection to the search server
|
||||
|
||||
## Root Causes
|
||||
|
||||
### 1. ChromaSync Not Closed in DatabaseManager
|
||||
**Location**: `src/services/worker/DatabaseManager.ts:42-52`
|
||||
|
||||
**Problem**: The `close()` method did not call `chromaSync.close()`, leaving the uvx process running even after the worker shut down.
|
||||
|
||||
**Fix**: Added explicit ChromaSync cleanup in the close() method:
|
||||
```typescript
|
||||
async close(): Promise<void> {
|
||||
// Close ChromaSync first (terminates uvx/python processes)
|
||||
if (this.chromaSync) {
|
||||
try {
|
||||
await this.chromaSync.close();
|
||||
this.chromaSync = null;
|
||||
} catch (error) {
|
||||
logger.error('DB', 'Failed to close ChromaSync', {}, error as Error);
|
||||
}
|
||||
}
|
||||
// ... rest of cleanup
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Search Server No Cleanup Handlers
|
||||
**Location**: `src/servers/search-server.ts:1743-1781`
|
||||
|
||||
**Problem**: The search server had no SIGTERM/SIGINT handlers, so child processes were orphaned when the server was terminated (especially during PM2 restarts).
|
||||
|
||||
**Fix**: Added comprehensive cleanup function:
|
||||
```typescript
|
||||
async function cleanup() {
|
||||
console.error('[search-server] Shutting down...');
|
||||
|
||||
// Close Chroma client (terminates uvx/python processes)
|
||||
if (chromaClient) {
|
||||
await chromaClient.close();
|
||||
}
|
||||
|
||||
// Close database connections
|
||||
if (search) search.close();
|
||||
if (store) store.close();
|
||||
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
// Register cleanup handlers
|
||||
process.on('SIGTERM', cleanup);
|
||||
process.on('SIGINT', cleanup);
|
||||
```
|
||||
|
||||
### 3. Worker Service Not Closing MCP Client
|
||||
**Location**: `src/services/worker-service.ts:214-230`
|
||||
|
||||
**Problem**: The worker service connected to the search server via MCP client but never closed the connection, keeping the search server process alive.
|
||||
|
||||
**Fix**: Added MCP client cleanup in shutdown:
|
||||
```typescript
|
||||
async shutdown(): Promise<void> {
|
||||
await this.sessionManager.shutdownAll();
|
||||
|
||||
// Close MCP client connection (terminates search server process)
|
||||
if (this.mcpClient) {
|
||||
try {
|
||||
await this.mcpClient.close();
|
||||
logger.info('SYSTEM', 'MCP client closed');
|
||||
} catch (error) {
|
||||
logger.error('SYSTEM', 'Failed to close MCP client', {}, error as Error);
|
||||
}
|
||||
}
|
||||
|
||||
// ... rest of shutdown
|
||||
}
|
||||
```
|
||||
|
||||
### 4. PM2 Configuration Not Optimized for Graceful Shutdown
|
||||
**Location**: `ecosystem.config.cjs`
|
||||
|
||||
**Problem**: PM2 watch mode was restarting the worker frequently, but without proper configuration for graceful shutdown, child processes could be orphaned.
|
||||
|
||||
**Fix**: Enhanced PM2 configuration:
|
||||
```javascript
|
||||
{
|
||||
kill_timeout: 5000, // Extra time for cleanup
|
||||
wait_ready: true, // Wait for process to be ready
|
||||
kill_signal: 'SIGTERM', // Use graceful shutdown signal
|
||||
ignore_watch: [
|
||||
'vector-db', // Don't restart on Chroma DB changes
|
||||
'.claude-mem' // Don't restart on data changes
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Process Lifecycle
|
||||
|
||||
### Before Fixes
|
||||
```
|
||||
SessionStart -> Worker -> DatabaseManager -> ChromaSync -> uvx (orphaned)
|
||||
\-> MCP Client -> Search Server -> uvx (orphaned)
|
||||
\-> Chroma Client -> uvx (orphaned)
|
||||
Worker Restart -> 3 new orphaned processes per restart
|
||||
```
|
||||
|
||||
### After Fixes
|
||||
```
|
||||
SessionStart -> Worker -> DatabaseManager -> ChromaSync -> uvx
|
||||
↓
|
||||
Shutdown -> DatabaseManager.close() -> chromaSync.close() -> terminates uvx
|
||||
|
||||
Worker -> MCP Client -> Search Server -> Chroma Client -> uvx
|
||||
↓ ↓
|
||||
Worker.shutdown() -> mcpClient.close() ↓
|
||||
↓ ↓
|
||||
Search Server cleanup() -> chromaClient.close()
|
||||
↓
|
||||
terminates uvx
|
||||
```
|
||||
|
||||
## Testing Process Cleanup
|
||||
|
||||
### Manual Test
|
||||
1. Start worker: `pm2 start ecosystem.config.cjs`
|
||||
2. Check processes: `ps aux | grep -E "(uvx|python.*chroma)" | grep -v grep`
|
||||
3. Create a session (trigger ChromaSync)
|
||||
4. Check process count again
|
||||
5. Restart worker: `pm2 restart claude-mem-worker`
|
||||
6. Wait 5 seconds for cleanup
|
||||
7. Check final process count - should return to baseline
|
||||
|
||||
### Expected Behavior
|
||||
- **Baseline**: 0-1 uvx/python processes (persistent PM2 worker)
|
||||
- **During Session**: +2-3 processes (ChromaSync, Search Server, Chroma)
|
||||
- **After Restart**: Returns to baseline within 5 seconds
|
||||
|
||||
## Verification
|
||||
|
||||
Run the test script:
|
||||
```bash
|
||||
chmod +x tests/test-process-cleanup.sh
|
||||
./tests/test-process-cleanup.sh
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
=== Process Cleanup Test ===
|
||||
1. Initial process count: 0
|
||||
2. Starting test process...
|
||||
During execution: 3 processes
|
||||
3. Final process count: 0
|
||||
✅ PASS: No process leaks detected
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
To monitor for leaks in production:
|
||||
|
||||
```bash
|
||||
# Watch process count over time
|
||||
watch -n 5 'ps aux | grep -E "(uvx|python.*chroma)" | grep -v grep | wc -l'
|
||||
|
||||
# Detailed process list
|
||||
ps aux | grep -E "(uvx|python.*chroma)" | grep -v grep
|
||||
|
||||
# PM2 process monitoring
|
||||
pm2 monit
|
||||
```
|
||||
|
||||
## Additional Safeguards
|
||||
|
||||
1. **Error Handling**: All cleanup operations have try-catch blocks to ensure partial cleanup succeeds even if one component fails
|
||||
2. **Logging**: Comprehensive logging of cleanup operations for debugging
|
||||
3. **Timeout Configuration**: PM2 kill_timeout ensures enough time for graceful shutdown
|
||||
4. **Signal Handling**: Both SIGTERM and SIGINT handlers registered for flexibility
|
||||
|
||||
## Future Improvements
|
||||
|
||||
1. **Process Monitoring**: Add metrics to track child process count over time
|
||||
2. **Health Checks**: Periodic verification that process count stays within expected bounds
|
||||
3. **Automatic Cleanup**: Detect and clean up orphaned processes on worker startup
|
||||
4. **Resource Limits**: Set memory/CPU limits on child processes to prevent runaway resource usage
|
||||
Reference in New Issue
Block a user