2659ec3231
* Refactor CLAUDE.md and related files for December 2025 updates - Updated CLAUDE.md in src/services/worker with new entries for December 2025, including changes to Search.ts, GeminiAgent.ts, SDKAgent.ts, and SessionManager.ts. - Revised CLAUDE.md in src/shared to reflect updates and new entries for December 2025, including paths.ts and worker-utils.ts. - Modified hook-constants.ts to clarify exit codes and their behaviors. - Added comprehensive hooks reference documentation for Claude Code, detailing usage, events, and examples. - Created initial CLAUDE.md files in various directories to track recent activity. * fix: Merge user-message-hook output into context-hook hookSpecificOutput - Add footer message to additionalContext in context-hook.ts - Remove user-message-hook from SessionStart hooks array - Fixes issue where stderr+exit(1) approach was silently discarded Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update logs and documentation for recent plugin and worker service changes - Added detailed logs for worker service activities from Dec 10, 2025 to Jan 7, 2026, including initialization patterns, cleanup confirmations, and diagnostic logging. - Updated plugin documentation with recent activities, including plugin synchronization and configuration changes from Dec 3, 2025 to Jan 7, 2026. - Enhanced the context hook and worker service logs to reflect improvements and fixes in the plugin architecture. - Documented the migration and verification processes for the Claude memory system and its integration with the marketplace. * Refactor hooks architecture and remove deprecated user-message-hook - Updated hook configurations in CLAUDE.md and hooks.json to reflect changes in session start behavior. - Removed user-message-hook functionality as it is no longer utilized in Claude Code 2.1.0; context is now injected silently. - Enhanced context-hook to handle session context injection without user-visible messages. - Cleaned up documentation across multiple files to align with the new hook structure and removed references to obsolete hooks. - Adjusted timing and command execution for hooks to improve performance and reliability. * fix: Address PR #610 review issues - Replace USER_MESSAGE_ONLY test with BLOCKING_ERROR test in hook-constants.test.ts - Standardize Claude Code 2.1.0 note wording across all three documentation files - Exclude deprecated user-message-hook.ts from logger-usage-standards test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: Remove hardcoded fake token counts from context injection Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Address PR #610 review issues by fixing test files, standardizing documentation notes, and verifying code quality improvements. * fix: Add path validation to CLAUDE.md distribution to prevent invalid directory creation - Add isValidPathForClaudeMd() function to reject invalid paths: - Tilde paths (~) that Node.js doesn't expand - URLs (http://, https://) - Paths with spaces (likely command text or PR references) - Paths with # (GitHub issue/PR references) - Relative paths that escape project boundary - Integrate validation in updateFolderClaudeMdFiles loop - Add 6 unit tests for path validation - Update .gitignore to prevent accidental commit of malformed directories - Clean up existing invalid directories (~/, PR #610..., git diff..., https:) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: Implement path validation in CLAUDE.md generation to prevent invalid directory creation - Added `isValidPathForClaudeMd()` function to validate file paths in `src/utils/claude-md-utils.ts`. - Integrated path validation in `updateFolderClaudeMdFiles` to skip invalid paths. - Added 6 new unit tests in `tests/utils/claude-md-utils.test.ts` to cover various rejection cases. - Updated `.gitignore` to prevent tracking of invalid directories. - Cleaned up existing invalid directories in the repository. * feat: Promote critical WARN logs to ERROR level across codebase Comprehensive log-level audit promoting 38+ WARN messages to ERROR for improved debugging and incident response: - Parser: observation type errors, data contamination - SDK/Agents: empty init responses (Gemini, OpenRouter) - Worker/Queue: session recovery, auto-recovery failures - Chroma: sync failures, search failures (now treated as critical) - SQLite: search failures (primary data store) - Session/Generator: failures, missing context - Infrastructure: shutdown, process management failures - File Operations: CLAUDE.md updates, config reads - Branch Management: recovery checkout failures Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: Address PR #614 review issues - Remove incorrectly tracked tilde-prefixed files from git - Fix absolute path validation to check projectRoot boundaries - Add test coverage for absolute path validation edge cases Closes review issues: - Issue 1: ~/ prefixed files removed from tracking - Issue 3: Absolute paths now validated against projectRoot - Issue 4: Added 3 new test cases for absolute path scenarios Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * build assets and context --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
428 lines
14 KiB
Markdown
428 lines
14 KiB
Markdown
# Technical Report: Worker Daemon Child Process Leak
|
|
|
|
**Issue:** #603 - Bug: worker-service daemon leaks child claude processes
|
|
**Author:** raulk
|
|
**Created:** 2026-01-07
|
|
**Report Version:** 1.0
|
|
**Severity:** Critical
|
|
**Priority:** P0 - Immediate attention required
|
|
|
|
---
|
|
|
|
## 1. Executive Summary
|
|
|
|
The `worker-service.cjs --daemon` process spawns Claude subagent processes via the Claude Agent SDK that are not being properly terminated when their tasks complete. Over the course of normal usage (6+ hours), this results in the accumulation of orphaned child processes that consume significant system memory.
|
|
|
|
**Key Findings:**
|
|
- 121 orphaned `claude` processes accumulated over ~6 hours
|
|
- Total memory consumption: ~44GB RSS
|
|
- Average memory per process: ~372MB
|
|
- Root cause: Missing child process cleanup after SDK query completion
|
|
- The issue affects Linux systems and potentially all platforms
|
|
|
|
**Recommendation:** Implement explicit child process tracking and cleanup in the SDK agent lifecycle, and add process reaping on generator completion.
|
|
|
|
---
|
|
|
|
## 2. Problem Analysis
|
|
|
|
### 2.1 Observed Behavior
|
|
|
|
The reporter documented the following scenario:
|
|
|
|
**Parent daemon process (running 7+ hours):**
|
|
```
|
|
PID PPID RSS(KB) ELAPSED COMMAND
|
|
4118969 1 161656 07:28:16 bun ~/.claude/plugins/cache/thedotmack/claude-mem/9.0.0/scripts/worker-service.cjs --daemon
|
|
```
|
|
|
|
**Sample of leaked children (121 total, all parented to daemon):**
|
|
```
|
|
PID PPID RSS(KB) ELAPSED COMMAND
|
|
1927 4118969 377308 06:21:16 claude --output-format stream-json --verbose --input-format stream-json --model claude-sonnet-4-5 --disallowedTools Bash,Read,Write,Edit,Grep,Glob,WebFetch,WebSearch,Task,NotebookEdit,AskUserQuestion,TodoWrite --setting-sources --permission-mode default
|
|
2834 4118969 384716 06:20:44 claude --output-format stream-json [...]
|
|
3988 4118969 381844 06:20:15 claude --output-format stream-json --resume <session-id> [...]
|
|
5938 4118969 382816 06:19:37 claude --output-format stream-json --resume <session-id> [...]
|
|
11503 4118969 381276 06:16:12 claude --output-format stream-json --resume <session-id> [...]
|
|
```
|
|
|
|
### 2.2 Reproduction Steps
|
|
|
|
1. Use claude-mem normally throughout a work session
|
|
2. Run: `ps -o pid,ppid,rss,etime --no-headers | awk '$2 == '$(pgrep -f worker-service.cjs)`
|
|
3. Count grows over time without bound
|
|
|
|
### 2.3 Expected Behavior
|
|
|
|
Child claude processes should terminate when their task completes, or the daemon should reap them.
|
|
|
|
---
|
|
|
|
## 3. Technical Details
|
|
|
|
### 3.1 Architecture Overview
|
|
|
|
The claude-mem worker service uses a modular architecture:
|
|
|
|
```
|
|
WorkerService (worker-service.ts)
|
|
|
|
|
+-- SDKAgent (SDKAgent.ts)
|
|
| |
|
|
| +-- query() from @anthropic-ai/claude-agent-sdk
|
|
| |
|
|
| +-- Spawns `claude` CLI subprocess
|
|
|
|
|
+-- SessionManager (SessionManager.ts)
|
|
| |
|
|
| +-- Manages active sessions
|
|
| +-- Event-driven message queues
|
|
|
|
|
+-- ProcessManager (ProcessManager.ts)
|
|
|
|
|
+-- Child process enumeration
|
|
+-- Graceful shutdown cleanup
|
|
```
|
|
|
|
### 3.2 SDK Agent Child Process Spawning
|
|
|
|
The `SDKAgent.startSession()` method invokes the Claude Agent SDK's `query()` function:
|
|
|
|
```typescript
|
|
// src/services/worker/SDKAgent.ts (lines 100-114)
|
|
const queryResult = query({
|
|
prompt: messageGenerator,
|
|
options: {
|
|
model: modelId,
|
|
...(hasRealMemorySessionId && session.lastPromptNumber > 1 && { resume: session.memorySessionId }),
|
|
disallowedTools,
|
|
abortController: session.abortController,
|
|
pathToClaudeCodeExecutable: claudePath
|
|
}
|
|
});
|
|
```
|
|
|
|
The `query()` function internally spawns a `claude` CLI subprocess with the parameters visible in the leaked process list:
|
|
- `--output-format stream-json`
|
|
- `--verbose`
|
|
- `--input-format stream-json`
|
|
- `--model claude-sonnet-4-5`
|
|
- `--disallowedTools ...`
|
|
- `--setting-sources`
|
|
- `--permission-mode default`
|
|
|
|
### 3.3 Session Lifecycle
|
|
|
|
Sessions are managed through the following flow:
|
|
|
|
1. **Initialization:** `SessionRoutes.handleSessionInit()` creates a session and starts a generator
|
|
2. **Processing:** `SDKAgent.startSession()` runs the query loop, processing messages from the queue
|
|
3. **Completion:** Generator promise resolves, triggering cleanup in `finally` block
|
|
|
|
The relevant generator lifecycle code in `SessionRoutes.ts` (lines 137-216):
|
|
|
|
```typescript
|
|
session.generatorPromise = agent.startSession(session, this.workerService)
|
|
.catch(error => { /* error handling */ })
|
|
.finally(() => {
|
|
session.generatorPromise = null;
|
|
session.currentProvider = null;
|
|
this.workerService.broadcastProcessingStatus();
|
|
|
|
// Crash recovery logic...
|
|
if (!wasAborted) {
|
|
// Check for pending work and potentially restart
|
|
}
|
|
});
|
|
```
|
|
|
|
### 3.4 Graceful Shutdown Implementation
|
|
|
|
The existing shutdown mechanism in `GracefulShutdown.ts` (lines 49-90) does handle child processes, but **only during daemon shutdown**:
|
|
|
|
```typescript
|
|
export async function performGracefulShutdown(config: GracefulShutdownConfig): Promise<void> {
|
|
// STEP 1: Enumerate all child processes BEFORE we start closing things
|
|
const childPids = await getChildProcesses(process.pid);
|
|
|
|
// ... other cleanup steps ...
|
|
|
|
// STEP 6: Force kill any remaining child processes (Windows zombie port fix)
|
|
if (childPids.length > 0) {
|
|
for (const pid of childPids) {
|
|
await forceKillProcess(pid);
|
|
}
|
|
await waitForProcessesExit(childPids, 5000);
|
|
}
|
|
}
|
|
```
|
|
|
|
**Critical Gap:** This cleanup only runs when the daemon itself shuts down, not when individual SDK sessions complete.
|
|
|
|
---
|
|
|
|
## 4. Impact Assessment
|
|
|
|
### 4.1 Resource Consumption
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Leaked processes | 121 |
|
|
| Total RSS | ~44GB |
|
|
| Average per process | ~372MB |
|
|
| Accumulation rate | ~20 processes/hour |
|
|
| Time to exhaustion (64GB system) | ~3 hours |
|
|
|
|
### 4.2 System Effects
|
|
|
|
1. **Memory Exhaustion:** Systems with limited RAM will experience OOM conditions
|
|
2. **Performance Degradation:** Swap thrashing as memory fills
|
|
3. **Process Table Pollution:** Maximum PID limits may be approached
|
|
4. **User Experience:** System becomes unresponsive during extended sessions
|
|
|
|
### 4.3 Affected Platforms
|
|
|
|
- **Linux (confirmed):** Ubuntu reported by issue author
|
|
- **macOS (likely):** Same process spawning mechanism
|
|
- **Windows (potentially different):** Uses different child process tracking
|
|
|
|
---
|
|
|
|
## 5. Root Cause Analysis
|
|
|
|
### 5.1 Primary Root Cause
|
|
|
|
**The SDK's `query()` function spawns a child `claude` process that is not being explicitly terminated when the async iterator completes.**
|
|
|
|
The `SDKAgent.startSession()` method:
|
|
1. Creates an async generator via `query()`
|
|
2. Iterates over messages via `for await (const message of queryResult)`
|
|
3. When iteration completes (naturally or via abort), the generator resolves
|
|
4. **No explicit cleanup of the underlying child process occurs**
|
|
|
|
### 5.2 Contributing Factors
|
|
|
|
1. **No Child Process Tracking:** The codebase does not maintain a registry of spawned child processes during normal operation - only during shutdown enumeration.
|
|
|
|
2. **AbortController Not Triggering Process Kill:** While sessions have an `abortController`, signaling abort to the SDK iterator does not guarantee the underlying `claude` process terminates.
|
|
|
|
3. **Generator Finally Block Missing Process Cleanup:** The `finally` block in `SessionRoutes.startGeneratorWithProvider()` handles state cleanup but does not explicitly kill child processes.
|
|
|
|
4. **SDK Abstraction Hiding Process Details:** The `@anthropic-ai/claude-agent-sdk` abstracts the subprocess management, making it difficult to access and terminate the child process directly.
|
|
|
|
### 5.3 Code Path Analysis
|
|
|
|
```
|
|
User Session Complete
|
|
|
|
|
v
|
|
SDKAgent.startSession() completes for-await loop
|
|
|
|
|
v
|
|
Generator promise resolves
|
|
|
|
|
v
|
|
SessionRoutes finally block executes
|
|
|
|
|
+-- session.generatorPromise = null
|
|
+-- session.currentProvider = null
|
|
+-- broadcastProcessingStatus()
|
|
+-- Check pending work
|
|
|
|
|
v
|
|
[MISSING] Child process termination
|
|
|
|
|
v
|
|
Claude subprocess continues running (LEAKED)
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Recommended Solutions
|
|
|
|
### 6.1 Solution A: SDK-Level Child Process Tracking (Preferred)
|
|
|
|
Add explicit child process tracking to the SDKAgent class:
|
|
|
|
```typescript
|
|
// src/services/worker/SDKAgent.ts
|
|
|
|
export class SDKAgent {
|
|
private activeChildProcesses: Map<number, { pid: number, sessionDbId: number }> = new Map();
|
|
|
|
async startSession(session: ActiveSession, worker?: WorkerRef): Promise<void> {
|
|
// Before query(), track that we're about to spawn
|
|
const queryResult = query({...});
|
|
|
|
// After first message, capture the PID if available
|
|
// Note: May require SDK modification to expose PID
|
|
|
|
try {
|
|
for await (const message of queryResult) {
|
|
// ... existing message handling
|
|
}
|
|
} finally {
|
|
// Cleanup: Kill any child process for this session
|
|
this.cleanupSessionProcess(session.sessionDbId);
|
|
}
|
|
}
|
|
|
|
private cleanupSessionProcess(sessionDbId: number): void {
|
|
// Find and terminate process for this session
|
|
// Requires either SDK enhancement or platform-specific process enumeration
|
|
}
|
|
}
|
|
```
|
|
|
|
**Challenges:** The SDK does not currently expose the child process PID.
|
|
|
|
### 6.2 Solution B: Session-Level Process Enumeration and Cleanup
|
|
|
|
Add process cleanup to the session completion flow:
|
|
|
|
```typescript
|
|
// src/services/worker/http/routes/SessionRoutes.ts
|
|
|
|
private startGeneratorWithProvider(session, provider, source): void {
|
|
const parentPid = process.pid;
|
|
const preExistingPids = new Set(await getChildProcessesForSession(parentPid, 'claude'));
|
|
|
|
session.generatorPromise = agent.startSession(session, this.workerService)
|
|
.finally(async () => {
|
|
// Find new child processes that appeared during this session
|
|
const currentPids = await getChildProcessesForSession(parentPid, 'claude');
|
|
const newPids = currentPids.filter(pid => !preExistingPids.has(pid));
|
|
|
|
// Terminate orphaned processes
|
|
for (const pid of newPids) {
|
|
await forceKillProcess(pid);
|
|
}
|
|
|
|
// ... existing cleanup
|
|
});
|
|
}
|
|
```
|
|
|
|
### 6.3 Solution C: Periodic Orphan Reaper (Mitigation)
|
|
|
|
Add a background task that periodically identifies and terminates leaked processes:
|
|
|
|
```typescript
|
|
// src/services/worker/OrphanReaper.ts
|
|
|
|
export class OrphanReaper {
|
|
private interval: NodeJS.Timer | null = null;
|
|
|
|
start(intervalMs: number = 60000): void {
|
|
this.interval = setInterval(async () => {
|
|
const orphans = await this.findOrphanedClaudeProcesses();
|
|
for (const pid of orphans) {
|
|
await forceKillProcess(pid);
|
|
}
|
|
}, intervalMs);
|
|
}
|
|
|
|
private async findOrphanedClaudeProcesses(): Promise<number[]> {
|
|
// Find claude processes parented to the worker daemon
|
|
// that have been running longer than expected (e.g., > 30 minutes)
|
|
}
|
|
}
|
|
```
|
|
|
|
**Pros:** Works without SDK modifications
|
|
**Cons:** Reactive rather than proactive; processes leak for up to interval duration
|
|
|
|
### 6.4 Solution D: Request SDK Enhancement
|
|
|
|
File an issue with the Claude Agent SDK requesting:
|
|
1. Exposure of child process PID in query result
|
|
2. Built-in cleanup on iterator completion
|
|
3. Explicit `close()` or `terminate()` method
|
|
|
|
### 6.5 Recommended Implementation Order
|
|
|
|
1. **Immediate (P0):** Implement Solution C (Orphan Reaper) as a mitigation
|
|
2. **Short-term (P1):** Implement Solution B (Session-Level Cleanup)
|
|
3. **Medium-term (P2):** Pursue Solution D (SDK Enhancement) with Anthropic
|
|
4. **Long-term (P3):** Implement Solution A once SDK provides PID access
|
|
|
|
---
|
|
|
|
## 7. Priority/Severity Assessment
|
|
|
|
### 7.1 Severity: Critical
|
|
|
|
- **Data Loss:** No
|
|
- **System Instability:** Yes - memory exhaustion
|
|
- **User Impact:** High - system becomes unusable
|
|
- **Scope:** All users with extended sessions
|
|
|
|
### 7.2 Priority: P0 - Immediate
|
|
|
|
- **Frequency:** Every session creates leaked processes
|
|
- **Accumulation:** Unbounded growth
|
|
- **Workaround:** Manual daemon restart (disruptive)
|
|
- **Business Impact:** Renders product unusable for long sessions
|
|
|
|
### 7.3 Effort Estimate
|
|
|
|
| Solution | Effort | Risk |
|
|
|----------|--------|------|
|
|
| Orphan Reaper (C) | 2-4 hours | Low |
|
|
| Session Cleanup (B) | 4-8 hours | Medium |
|
|
| SDK Enhancement (D) | External dependency | - |
|
|
| Full Tracking (A) | 8-16 hours | Medium |
|
|
|
|
---
|
|
|
|
## 8. References
|
|
|
|
- **Issue:** https://github.com/thedotmack/claude-mem/issues/603
|
|
- **Source Files:**
|
|
- `/src/services/worker/SDKAgent.ts` - SDK query invocation
|
|
- `/src/services/worker/SessionManager.ts` - Session lifecycle
|
|
- `/src/services/worker/http/routes/SessionRoutes.ts` - Generator management
|
|
- `/src/services/infrastructure/ProcessManager.ts` - Process utilities
|
|
- `/src/services/infrastructure/GracefulShutdown.ts` - Shutdown cleanup
|
|
- **Related Code:**
|
|
- `@anthropic-ai/claude-agent-sdk` - External SDK spawning processes
|
|
|
|
---
|
|
|
|
## 9. Appendix: Process Enumeration Reference
|
|
|
|
### Current getChildProcesses Implementation
|
|
|
|
```typescript
|
|
// src/services/infrastructure/ProcessManager.ts
|
|
export async function getChildProcesses(parentPid: number): Promise<number[]> {
|
|
if (process.platform !== 'win32') {
|
|
return []; // NOTE: Only implemented for Windows!
|
|
}
|
|
|
|
// Windows implementation using wmic
|
|
const cmd = `wmic process where "parentprocessid=${parentPid}" get processid /format:list`;
|
|
// ...
|
|
}
|
|
```
|
|
|
|
**Critical Finding:** The `getChildProcesses` function is currently **Windows-only** and returns an empty array on Linux/macOS. This means the Linux user reporting the issue has no built-in cleanup mechanism.
|
|
|
|
### Required Fix for Linux/macOS
|
|
|
|
```typescript
|
|
export async function getChildProcesses(parentPid: number): Promise<number[]> {
|
|
if (process.platform === 'win32') {
|
|
// Existing Windows implementation
|
|
} else {
|
|
// Unix implementation
|
|
const { stdout } = await execAsync(`pgrep -P ${parentPid}`);
|
|
return stdout.trim().split('\n').map(Number).filter(n => !isNaN(n));
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
*Report prepared by Claude Code analysis of codebase and issue #603*
|