282345f379
This is a backup of all work done by the 3 Phase 1 agents: Agent A - Command Injection Fix (Issue #354): - Fixed command injection in BranchManager.ts - Fixed unnecessary shell usage in bun-path.ts - Added comprehensive security test suite - Created SECURITY.md and SECURITY_AUDIT_REPORT.md Agent B - Observation Persistence Fix (Issue #353): - Added PendingMessageStore from PR #335 - Integrated persistent queue into SessionManager - Modified SDKAgent to mark messages complete - Updated SessionStore with pending_messages migration - Updated worker-types.ts with new interfaces Agent C - Batch Endpoint Verification (Issue #348): - Created batch-observations.test.ts - Updated worker-service.mdx documentation Also includes: - Documentation context files (biomimetic, windows struggles) - Build artifacts from agent testing This work will be re-evaluated after v7.3.0 release.
333 lines
12 KiB
Markdown
333 lines
12 KiB
Markdown
# Windows, Bun, and Worker Service Struggles
|
|
|
|
A comprehensive chronicle of platform-specific issues, attempted fixes, and architectural decisions.
|
|
|
|
## Executive Summary
|
|
|
|
The claude-mem project has faced persistent Windows-specific issues centered around three core problems:
|
|
|
|
1. **Console Window Popups**: Blank terminal windows appearing when spawning worker and SDK subprocess
|
|
2. **Zombie Socket Issues**: Bun leaving TCP sockets in LISTEN state after termination on Windows
|
|
3. **Process Management Complexity**: Platform-specific spawning logic and reliability issues
|
|
|
|
These issues have driven multiple PRs, architectural pivots, and significant debate about runtime switching (Bun → Node.js).
|
|
|
|
---
|
|
|
|
## Timeline of Issues
|
|
|
|
### Issue #209: Windows Worker Startup Failures (Dec 12-13, 2025)
|
|
|
|
**Problem**: Worker service failed to start on Windows using PowerShell Start-Process approach.
|
|
|
|
**Symptoms**:
|
|
- Worker startup attempted via `powershell.exe -NoProfile -NonInteractive -Command Start-Process`
|
|
- Health check retries exhausted (15 attempts over 15 seconds)
|
|
- Users left unable to start worker manually
|
|
|
|
**Root Causes**:
|
|
- Platform-conditional process spawning (PowerShell for Windows, PM2 for Unix)
|
|
- PowerShell spawning without `-PassThru` to capture PID
|
|
- Inconsistent process management across platforms
|
|
|
|
**Resolution**: Issue was marked as closed, suggesting it was resolved in v7.1.0 through architectural unification with Bun-based ProcessManager using PID file tracking consistently across all platforms.
|
|
|
|
**Status**: ✅ Resolved (pre-PR #335)
|
|
|
|
---
|
|
|
|
### Issue #309 & PR #315: Console Window Popups (Dec 14-15, 2025)
|
|
|
|
**Problem**: Blank terminal windows appear when spawning worker processes and SDK subprocesses on Windows.
|
|
|
|
**First Attempted Fix (PR #315)**: Add `windowsHide: true` to spawn options
|
|
|
|
**Why It Failed**: Node.js bug #21825 - `windowsHide: true` is **ignored** when `detached: true` is also set. Both flags are required:
|
|
- `detached: true` - Needed for background process
|
|
- `windowsHide: true` - Needed to hide window (but doesn't work when detached)
|
|
|
|
**Testing Results** (by ToxMox):
|
|
- Tested PR #315 on Windows 11
|
|
- Confirmed blank terminal windows still appear for both worker and SDK subprocess spawns
|
|
- Affects both `ProcessManager.ts` (worker) and `SDKAgent.ts` (SDK subprocess)
|
|
|
|
**Working Solution**: Use PowerShell's `Start-Process` with `-WindowStyle Hidden` flag instead of standard spawn.
|
|
|
|
**Status**: ❌ PR #315 closed in favor of more comprehensive solution
|
|
|
|
---
|
|
|
|
### Bun Zombie Socket Issue (Dec 15, 2025)
|
|
|
|
**Problem**: Bun leaves TCP sockets in zombie LISTEN state on Windows after worker termination.
|
|
|
|
**Symptoms**:
|
|
- Port remains bound even though no process owns it
|
|
- `OwningProcess` shows 0 or dead PID
|
|
- New worker instances cannot start due to `EADDRINUSE` errors
|
|
- Happens regardless of termination method (process.exit(), external kill, Ctrl+C)
|
|
- **Only system reboot clears zombie ports**
|
|
|
|
**Upstream Tracking**:
|
|
- Bun issue #12127
|
|
- Bun issue #5774
|
|
- Bun issue #8786
|
|
|
|
**Impact**: Windows users may need to reboot their systems when worker crashes or is restarted.
|
|
|
|
**Proposed Solution**: Switch worker runtime from Bun to Node.js on Windows (or globally).
|
|
|
|
**Status**: 🟡 Unresolved - Platform-specific bug in Bun's Windows socket cleanup
|
|
|
|
---
|
|
|
|
### SDK Subprocess Hang Issue (Dec 15, 2025)
|
|
|
|
**Problem**: SDK subprocesses can hang indefinitely, blocking observation processing.
|
|
|
|
**Root Cause**: `AbortController.abort()` does not actually terminate child processes.
|
|
|
|
**Symptoms**:
|
|
- For-await loop blocks forever waiting for output from hung subprocess
|
|
- Observation processing halts
|
|
- No recovery mechanism
|
|
|
|
**Solution**: Implement watchdog timer that explicitly kills child processes using platform-specific commands:
|
|
- **Windows**: `wmic process where ParentProcessId=<pid> delete`
|
|
- **Unix**: `pkill -P <pid>`
|
|
|
|
**Timeout**: `SDK_QUERY_TIMEOUT_MS` set to 2 minutes
|
|
|
|
**Status**: ✅ Fixed in PR #335 (watchdog implementation)
|
|
|
|
---
|
|
|
|
## PR #335: Comprehensive Windows Fix (Dec 15, 2025)
|
|
|
|
### What It Attempted
|
|
|
|
ToxMox developed a comprehensive PR addressing all Windows issues simultaneously:
|
|
|
|
1. **PowerShell-based spawning** to fix popup windows
|
|
2. **Runtime switch** from Bun to Node.js (globally) to fix zombie sockets
|
|
3. **Queue monitoring system** with persistent message queue
|
|
4. **Watchdog service** for stuck message recovery
|
|
5. **SQLite compatibility layer** for Node.js support
|
|
|
|
### Architecture Decisions
|
|
|
|
**ProcessManager Changes**:
|
|
- Switched from `startWithBun()` to `startWithNode()`
|
|
- Windows: Uses PowerShell `Start-Process -WindowStyle Hidden -PassThru`
|
|
- Unix: Uses standard `spawn()` with `detached: true`
|
|
- Captures PID via PowerShell `Select-Object -ExpandProperty Id`
|
|
- Comment states: "Use Node on all platforms (Bun has zombie socket issues on Windows)"
|
|
|
|
**SQLite Compatibility Layer**:
|
|
- Created `sqlite-compat.ts` adapter pattern
|
|
- Provides `bun:sqlite` API compatibility via `better-sqlite3`
|
|
- Allows code to work with both Bun and Node.js runtimes
|
|
|
|
### Critical Issues Identified
|
|
|
|
#### 1. **Global vs Platform-Conditional Runtime**
|
|
|
|
**The Inconsistency**: Code comment explicitly states zombie sockets occur "on Windows", yet solution applies Node.js universally across all platforms.
|
|
|
|
**Questions Raised**:
|
|
- Why sacrifice Bun's performance on macOS/Linux where no issues documented?
|
|
- Platform-specific spawning already implemented - why not platform-specific runtime?
|
|
- No documented Bun reliability issues on non-Windows platforms
|
|
|
|
#### 2. **Performance Regressions**
|
|
|
|
**better-sqlite3 Blocking**:
|
|
- Synchronous-only API blocks Node.js event loop during all DB operations
|
|
- Contrasts with Bun's async SQLite support
|
|
- Affects: enqueue, markProcessing, markProcessed, watchdog checks
|
|
|
|
**Watchdog Polling Overhead**:
|
|
- Full table scans every 30 seconds even when idle
|
|
- Constant database I/O overhead
|
|
- No max queue size limits = unbounded growth
|
|
|
|
**Startup Latency**:
|
|
- Node.js initialization (slower than Bun)
|
|
- Native module loading (better-sqlite3)
|
|
- Database migrations
|
|
- Stuck message scan
|
|
- Watchdog initialization
|
|
- HTTP server startup
|
|
|
|
#### 3. **Build Dependencies**
|
|
|
|
**better-sqlite3 Requirements**:
|
|
- node-gyp
|
|
- Python
|
|
- C++ compiler toolchains
|
|
- Visual Studio Build Tools (Windows)
|
|
|
|
**Impact**:
|
|
- Local development machines without build tools fail
|
|
- CI/CD pipelines need updated Docker images
|
|
- Restricted environments where compilers not permitted
|
|
- ARM/M1 Mac compatibility issues
|
|
|
|
#### 4. **Migration Risks**
|
|
|
|
**Breaking Changes**:
|
|
- Automatic database migration adds `pending_messages` table
|
|
- Runtime switch not documented in PR
|
|
- Node.js becomes undocumented hard requirement
|
|
- No migration guide or rollback procedure
|
|
|
|
**Unanswered Questions**:
|
|
- What happens to in-flight messages during upgrade?
|
|
- Can users safely downgrade?
|
|
- Is migration idempotent?
|
|
|
|
#### 5. **Code Quality Issues**
|
|
|
|
**Command Injection Risk** (ProcessManager.ts:67):
|
|
- PowerShell commands use template literal concatenation
|
|
- Vulnerable if `MARKETPLACE_ROOT` or script paths attacker-controlled
|
|
- Should use array-based argument passing
|
|
|
|
**Missing Error Handling** (WatchdogService.ts:61):
|
|
- `setInterval` callback lacks error handling
|
|
- Timer continues running if `check()` throws
|
|
- Creates zombie watchdog scenario
|
|
|
|
**No Queue Size Limits**:
|
|
- Unbounded database growth if messages accumulate
|
|
- Failed messages (exceeding `maxRetries`) accumulate indefinitely
|
|
- Only 24-hour retention for processed messages
|
|
|
|
---
|
|
|
|
## Assessment and Recommendations
|
|
|
|
### What Was Validated
|
|
|
|
**Legitimate Windows Issues**:
|
|
- ✅ Console window popups are real (Node.js bug #21825)
|
|
- ✅ PowerShell `Start-Process` solution works
|
|
- ✅ Bun zombie socket issue is real and Windows-specific
|
|
- ✅ SDK subprocess hang issue is real
|
|
|
|
### What Remains Questionable
|
|
|
|
**Global Runtime Switch**:
|
|
- ❌ No evidence Bun problematic on macOS/Linux
|
|
- ❌ Platform-conditional runtime not considered
|
|
- ❌ Performance trade-offs not documented
|
|
- ❌ "Windows-only" issue applied globally
|
|
|
|
**Zombie Socket Root Cause**:
|
|
- 🟡 May be fixable with proper cleanup handlers:
|
|
- Missing `server.close()` calls before exit
|
|
- Processes killed with `SIGKILL` before cleanup finishes
|
|
- Missing `SIGTERM` signal handlers for graceful shutdown
|
|
- 🟡 Runtime switch may be unnecessary over-engineering
|
|
|
|
### Salvageable Components
|
|
|
|
**If Extracted into Separate PRs**:
|
|
|
|
1. **PowerShell Spawning for Windows Worker**
|
|
- Focused PR: "Windows: Use Node.js instead of Bun for worker process"
|
|
- Platform-conditional logic (Node.js on Windows, Bun elsewhere)
|
|
- Independent justification required
|
|
|
|
2. **SQLite Compatibility Layer**
|
|
- Well-designed adapter pattern
|
|
- Requires independent justification for Node.js runtime need
|
|
- Should not be bundled with other changes
|
|
|
|
3. **Queue Monitoring UI Concept**
|
|
- Valuable visibility into worker state
|
|
- Should build on in-memory state first
|
|
- Remove database persistence requirement initially
|
|
|
|
4. **Watchdog Improvements**
|
|
- SDK subprocess timeout handling
|
|
- Evidence of superiority over current approach needed
|
|
|
|
---
|
|
|
|
## Current Status
|
|
|
|
### Resolved
|
|
- ✅ Issue #209: Windows worker startup (v7.1.0)
|
|
- ✅ SDK subprocess hang issue (watchdog implementation)
|
|
|
|
### In Progress
|
|
- 🔄 PR #339: Windows console popup fix (extracted from PR #335)
|
|
- 🔄 PR #338: Queue monitoring system (extracted from PR #335)
|
|
|
|
### Open Questions
|
|
- ❓ Should runtime switch be global or Windows-only?
|
|
- ❓ Can zombie socket issue be fixed without runtime switch?
|
|
- ❓ Is better-sqlite3's synchronous blocking acceptable?
|
|
- ❓ Should queue persistence be in-memory first?
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
### Architectural Principles Violated
|
|
|
|
**YAGNI**: Queue persistence, watchdog service, and comprehensive monitoring added without proven need.
|
|
|
|
**Happy Path**: Should have started with simplest Windows fix (PowerShell spawning), validated, then added complexity if needed.
|
|
|
|
**Incremental Validation**: Bundling multiple architectural changes prevents isolating what actually solves the problem.
|
|
|
|
### What Should Have Happened
|
|
|
|
1. **Phase 1**: PowerShell spawning fix for Windows console popups (targeted, testable)
|
|
2. **Phase 2**: Investigate zombie socket root cause (cleanup handlers vs runtime switch)
|
|
3. **Phase 3**: If runtime switch justified, implement as Windows-conditional first
|
|
4. **Phase 4**: Add queue monitoring as optional feature with in-memory state
|
|
5. **Phase 5**: Add persistence only if in-memory insufficient
|
|
|
|
### Key Takeaways
|
|
|
|
- **Windows-specific issues don't justify global architectural changes** without clear evidence
|
|
- **Platform-conditional logic is acceptable** when solving platform-specific problems
|
|
- **Native module dependencies are heavy** - avoid unless necessary
|
|
- **Performance regressions need explicit justification** - synchronous blocking, startup latency, polling overhead all impact UX
|
|
- **Bundle size matters** - build tools, compilers, Python are significant requirements
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
**GitHub Issues**:
|
|
- #209: Windows worker startup failures
|
|
- #309: Console window popups
|
|
- #315: windowsHide approach (closed)
|
|
|
|
**PRs**:
|
|
- #335: Comprehensive Windows fix (under review)
|
|
- #338: Queue monitoring system (extracted)
|
|
- #339: Windows console popup fix (extracted)
|
|
|
|
**Upstream Bugs**:
|
|
- Node.js #21825: windowsHide ignored with detached
|
|
- Bun #12127, #5774, #8786: Windows zombie sockets
|
|
|
|
**Related Observations**:
|
|
- #27302: PR #315 windowsHide failure analysis
|
|
- #27233: Bun zombie socket discovery
|
|
- #27232: Windows background window root cause
|
|
- #27286: Runtime switch assessment
|
|
- #27283: PowerShell process spawn fix
|
|
- #27190: ProcessManager Node.js implementation
|
|
- #24532: Issue #209 resolution
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-12-16
|
|
**Document Status**: Comprehensive review based on memory search through #S3485
|