Files
claude-mem/docs/context/windows-bun-worker-struggles.md
T
Alex Newman 282345f379 backup: Phase 1 agent work (security, persistence, batch endpoint)
This is a backup of all work done by the 3 Phase 1 agents:

Agent A - Command Injection Fix (Issue #354):
- Fixed command injection in BranchManager.ts
- Fixed unnecessary shell usage in bun-path.ts
- Added comprehensive security test suite
- Created SECURITY.md and SECURITY_AUDIT_REPORT.md

Agent B - Observation Persistence Fix (Issue #353):
- Added PendingMessageStore from PR #335
- Integrated persistent queue into SessionManager
- Modified SDKAgent to mark messages complete
- Updated SessionStore with pending_messages migration
- Updated worker-types.ts with new interfaces

Agent C - Batch Endpoint Verification (Issue #348):
- Created batch-observations.test.ts
- Updated worker-service.mdx documentation

Also includes:
- Documentation context files (biomimetic, windows struggles)
- Build artifacts from agent testing

This work will be re-evaluated after v7.3.0 release.
2025-12-16 15:44:06 -05:00

333 lines
12 KiB
Markdown

# Windows, Bun, and Worker Service Struggles
A comprehensive chronicle of platform-specific issues, attempted fixes, and architectural decisions.
## Executive Summary
The claude-mem project has faced persistent Windows-specific issues centered around three core problems:
1. **Console Window Popups**: Blank terminal windows appearing when spawning worker and SDK subprocess
2. **Zombie Socket Issues**: Bun leaving TCP sockets in LISTEN state after termination on Windows
3. **Process Management Complexity**: Platform-specific spawning logic and reliability issues
These issues have driven multiple PRs, architectural pivots, and significant debate about runtime switching (Bun → Node.js).
---
## Timeline of Issues
### Issue #209: Windows Worker Startup Failures (Dec 12-13, 2025)
**Problem**: Worker service failed to start on Windows using PowerShell Start-Process approach.
**Symptoms**:
- Worker startup attempted via `powershell.exe -NoProfile -NonInteractive -Command Start-Process`
- Health check retries exhausted (15 attempts over 15 seconds)
- Users left unable to start worker manually
**Root Causes**:
- Platform-conditional process spawning (PowerShell for Windows, PM2 for Unix)
- PowerShell spawning without `-PassThru` to capture PID
- Inconsistent process management across platforms
**Resolution**: Issue was marked as closed, suggesting it was resolved in v7.1.0 through architectural unification with Bun-based ProcessManager using PID file tracking consistently across all platforms.
**Status**: ✅ Resolved (pre-PR #335)
---
### Issue #309 & PR #315: Console Window Popups (Dec 14-15, 2025)
**Problem**: Blank terminal windows appear when spawning worker processes and SDK subprocesses on Windows.
**First Attempted Fix (PR #315)**: Add `windowsHide: true` to spawn options
**Why It Failed**: Node.js bug #21825 - `windowsHide: true` is **ignored** when `detached: true` is also set. Both flags are required:
- `detached: true` - Needed for background process
- `windowsHide: true` - Needed to hide window (but doesn't work when detached)
**Testing Results** (by ToxMox):
- Tested PR #315 on Windows 11
- Confirmed blank terminal windows still appear for both worker and SDK subprocess spawns
- Affects both `ProcessManager.ts` (worker) and `SDKAgent.ts` (SDK subprocess)
**Working Solution**: Use PowerShell's `Start-Process` with `-WindowStyle Hidden` flag instead of standard spawn.
**Status**: ❌ PR #315 closed in favor of more comprehensive solution
---
### Bun Zombie Socket Issue (Dec 15, 2025)
**Problem**: Bun leaves TCP sockets in zombie LISTEN state on Windows after worker termination.
**Symptoms**:
- Port remains bound even though no process owns it
- `OwningProcess` shows 0 or dead PID
- New worker instances cannot start due to `EADDRINUSE` errors
- Happens regardless of termination method (process.exit(), external kill, Ctrl+C)
- **Only system reboot clears zombie ports**
**Upstream Tracking**:
- Bun issue #12127
- Bun issue #5774
- Bun issue #8786
**Impact**: Windows users may need to reboot their systems when worker crashes or is restarted.
**Proposed Solution**: Switch worker runtime from Bun to Node.js on Windows (or globally).
**Status**: 🟡 Unresolved - Platform-specific bug in Bun's Windows socket cleanup
---
### SDK Subprocess Hang Issue (Dec 15, 2025)
**Problem**: SDK subprocesses can hang indefinitely, blocking observation processing.
**Root Cause**: `AbortController.abort()` does not actually terminate child processes.
**Symptoms**:
- For-await loop blocks forever waiting for output from hung subprocess
- Observation processing halts
- No recovery mechanism
**Solution**: Implement watchdog timer that explicitly kills child processes using platform-specific commands:
- **Windows**: `wmic process where ParentProcessId=<pid> delete`
- **Unix**: `pkill -P <pid>`
**Timeout**: `SDK_QUERY_TIMEOUT_MS` set to 2 minutes
**Status**: ✅ Fixed in PR #335 (watchdog implementation)
---
## PR #335: Comprehensive Windows Fix (Dec 15, 2025)
### What It Attempted
ToxMox developed a comprehensive PR addressing all Windows issues simultaneously:
1. **PowerShell-based spawning** to fix popup windows
2. **Runtime switch** from Bun to Node.js (globally) to fix zombie sockets
3. **Queue monitoring system** with persistent message queue
4. **Watchdog service** for stuck message recovery
5. **SQLite compatibility layer** for Node.js support
### Architecture Decisions
**ProcessManager Changes**:
- Switched from `startWithBun()` to `startWithNode()`
- Windows: Uses PowerShell `Start-Process -WindowStyle Hidden -PassThru`
- Unix: Uses standard `spawn()` with `detached: true`
- Captures PID via PowerShell `Select-Object -ExpandProperty Id`
- Comment states: "Use Node on all platforms (Bun has zombie socket issues on Windows)"
**SQLite Compatibility Layer**:
- Created `sqlite-compat.ts` adapter pattern
- Provides `bun:sqlite` API compatibility via `better-sqlite3`
- Allows code to work with both Bun and Node.js runtimes
### Critical Issues Identified
#### 1. **Global vs Platform-Conditional Runtime**
**The Inconsistency**: Code comment explicitly states zombie sockets occur "on Windows", yet solution applies Node.js universally across all platforms.
**Questions Raised**:
- Why sacrifice Bun's performance on macOS/Linux where no issues documented?
- Platform-specific spawning already implemented - why not platform-specific runtime?
- No documented Bun reliability issues on non-Windows platforms
#### 2. **Performance Regressions**
**better-sqlite3 Blocking**:
- Synchronous-only API blocks Node.js event loop during all DB operations
- Contrasts with Bun's async SQLite support
- Affects: enqueue, markProcessing, markProcessed, watchdog checks
**Watchdog Polling Overhead**:
- Full table scans every 30 seconds even when idle
- Constant database I/O overhead
- No max queue size limits = unbounded growth
**Startup Latency**:
- Node.js initialization (slower than Bun)
- Native module loading (better-sqlite3)
- Database migrations
- Stuck message scan
- Watchdog initialization
- HTTP server startup
#### 3. **Build Dependencies**
**better-sqlite3 Requirements**:
- node-gyp
- Python
- C++ compiler toolchains
- Visual Studio Build Tools (Windows)
**Impact**:
- Local development machines without build tools fail
- CI/CD pipelines need updated Docker images
- Restricted environments where compilers not permitted
- ARM/M1 Mac compatibility issues
#### 4. **Migration Risks**
**Breaking Changes**:
- Automatic database migration adds `pending_messages` table
- Runtime switch not documented in PR
- Node.js becomes undocumented hard requirement
- No migration guide or rollback procedure
**Unanswered Questions**:
- What happens to in-flight messages during upgrade?
- Can users safely downgrade?
- Is migration idempotent?
#### 5. **Code Quality Issues**
**Command Injection Risk** (ProcessManager.ts:67):
- PowerShell commands use template literal concatenation
- Vulnerable if `MARKETPLACE_ROOT` or script paths attacker-controlled
- Should use array-based argument passing
**Missing Error Handling** (WatchdogService.ts:61):
- `setInterval` callback lacks error handling
- Timer continues running if `check()` throws
- Creates zombie watchdog scenario
**No Queue Size Limits**:
- Unbounded database growth if messages accumulate
- Failed messages (exceeding `maxRetries`) accumulate indefinitely
- Only 24-hour retention for processed messages
---
## Assessment and Recommendations
### What Was Validated
**Legitimate Windows Issues**:
- ✅ Console window popups are real (Node.js bug #21825)
- ✅ PowerShell `Start-Process` solution works
- ✅ Bun zombie socket issue is real and Windows-specific
- ✅ SDK subprocess hang issue is real
### What Remains Questionable
**Global Runtime Switch**:
- ❌ No evidence Bun problematic on macOS/Linux
- ❌ Platform-conditional runtime not considered
- ❌ Performance trade-offs not documented
- ❌ "Windows-only" issue applied globally
**Zombie Socket Root Cause**:
- 🟡 May be fixable with proper cleanup handlers:
- Missing `server.close()` calls before exit
- Processes killed with `SIGKILL` before cleanup finishes
- Missing `SIGTERM` signal handlers for graceful shutdown
- 🟡 Runtime switch may be unnecessary over-engineering
### Salvageable Components
**If Extracted into Separate PRs**:
1. **PowerShell Spawning for Windows Worker**
- Focused PR: "Windows: Use Node.js instead of Bun for worker process"
- Platform-conditional logic (Node.js on Windows, Bun elsewhere)
- Independent justification required
2. **SQLite Compatibility Layer**
- Well-designed adapter pattern
- Requires independent justification for Node.js runtime need
- Should not be bundled with other changes
3. **Queue Monitoring UI Concept**
- Valuable visibility into worker state
- Should build on in-memory state first
- Remove database persistence requirement initially
4. **Watchdog Improvements**
- SDK subprocess timeout handling
- Evidence of superiority over current approach needed
---
## Current Status
### Resolved
- ✅ Issue #209: Windows worker startup (v7.1.0)
- ✅ SDK subprocess hang issue (watchdog implementation)
### In Progress
- 🔄 PR #339: Windows console popup fix (extracted from PR #335)
- 🔄 PR #338: Queue monitoring system (extracted from PR #335)
### Open Questions
- ❓ Should runtime switch be global or Windows-only?
- ❓ Can zombie socket issue be fixed without runtime switch?
- ❓ Is better-sqlite3's synchronous blocking acceptable?
- ❓ Should queue persistence be in-memory first?
---
## Lessons Learned
### Architectural Principles Violated
**YAGNI**: Queue persistence, watchdog service, and comprehensive monitoring added without proven need.
**Happy Path**: Should have started with simplest Windows fix (PowerShell spawning), validated, then added complexity if needed.
**Incremental Validation**: Bundling multiple architectural changes prevents isolating what actually solves the problem.
### What Should Have Happened
1. **Phase 1**: PowerShell spawning fix for Windows console popups (targeted, testable)
2. **Phase 2**: Investigate zombie socket root cause (cleanup handlers vs runtime switch)
3. **Phase 3**: If runtime switch justified, implement as Windows-conditional first
4. **Phase 4**: Add queue monitoring as optional feature with in-memory state
5. **Phase 5**: Add persistence only if in-memory insufficient
### Key Takeaways
- **Windows-specific issues don't justify global architectural changes** without clear evidence
- **Platform-conditional logic is acceptable** when solving platform-specific problems
- **Native module dependencies are heavy** - avoid unless necessary
- **Performance regressions need explicit justification** - synchronous blocking, startup latency, polling overhead all impact UX
- **Bundle size matters** - build tools, compilers, Python are significant requirements
---
## References
**GitHub Issues**:
- #209: Windows worker startup failures
- #309: Console window popups
- #315: windowsHide approach (closed)
**PRs**:
- #335: Comprehensive Windows fix (under review)
- #338: Queue monitoring system (extracted)
- #339: Windows console popup fix (extracted)
**Upstream Bugs**:
- Node.js #21825: windowsHide ignored with detached
- Bun #12127, #5774, #8786: Windows zombie sockets
**Related Observations**:
- #27302: PR #315 windowsHide failure analysis
- #27233: Bun zombie socket discovery
- #27232: Windows background window root cause
- #27286: Runtime switch assessment
- #27283: PowerShell process spawn fix
- #27190: ProcessManager Node.js implementation
- #24532: Issue #209 resolution
---
**Last Updated**: 2025-12-16
**Document Status**: Comprehensive review based on memory search through #S3485