This is a backup of all work done by the 3 Phase 1 agents: Agent A - Command Injection Fix (Issue #354): - Fixed command injection in BranchManager.ts - Fixed unnecessary shell usage in bun-path.ts - Added comprehensive security test suite - Created SECURITY.md and SECURITY_AUDIT_REPORT.md Agent B - Observation Persistence Fix (Issue #353): - Added PendingMessageStore from PR #335 - Integrated persistent queue into SessionManager - Modified SDKAgent to mark messages complete - Updated SessionStore with pending_messages migration - Updated worker-types.ts with new interfaces Agent C - Batch Endpoint Verification (Issue #348): - Created batch-observations.test.ts - Updated worker-service.mdx documentation Also includes: - Documentation context files (biomimetic, windows struggles) - Build artifacts from agent testing This work will be re-evaluated after v7.3.0 release.
12 KiB
Windows, Bun, and Worker Service Struggles
A comprehensive chronicle of platform-specific issues, attempted fixes, and architectural decisions.
Executive Summary
The claude-mem project has faced persistent Windows-specific issues centered around three core problems:
- Console Window Popups: Blank terminal windows appearing when spawning worker and SDK subprocess
- Zombie Socket Issues: Bun leaving TCP sockets in LISTEN state after termination on Windows
- Process Management Complexity: Platform-specific spawning logic and reliability issues
These issues have driven multiple PRs, architectural pivots, and significant debate about runtime switching (Bun → Node.js).
Timeline of Issues
Issue #209: Windows Worker Startup Failures (Dec 12-13, 2025)
Problem: Worker service failed to start on Windows using PowerShell Start-Process approach.
Symptoms:
- Worker startup attempted via
powershell.exe -NoProfile -NonInteractive -Command Start-Process - Health check retries exhausted (15 attempts over 15 seconds)
- Users left unable to start worker manually
Root Causes:
- Platform-conditional process spawning (PowerShell for Windows, PM2 for Unix)
- PowerShell spawning without
-PassThruto capture PID - Inconsistent process management across platforms
Resolution: Issue was marked as closed, suggesting it was resolved in v7.1.0 through architectural unification with Bun-based ProcessManager using PID file tracking consistently across all platforms.
Status: ✅ Resolved (pre-PR #335)
Issue #309 & PR #315: Console Window Popups (Dec 14-15, 2025)
Problem: Blank terminal windows appear when spawning worker processes and SDK subprocesses on Windows.
First Attempted Fix (PR #315): Add windowsHide: true to spawn options
Why It Failed: Node.js bug #21825 - windowsHide: true is ignored when detached: true is also set. Both flags are required:
detached: true- Needed for background processwindowsHide: true- Needed to hide window (but doesn't work when detached)
Testing Results (by ToxMox):
- Tested PR #315 on Windows 11
- Confirmed blank terminal windows still appear for both worker and SDK subprocess spawns
- Affects both
ProcessManager.ts(worker) andSDKAgent.ts(SDK subprocess)
Working Solution: Use PowerShell's Start-Process with -WindowStyle Hidden flag instead of standard spawn.
Status: ❌ PR #315 closed in favor of more comprehensive solution
Bun Zombie Socket Issue (Dec 15, 2025)
Problem: Bun leaves TCP sockets in zombie LISTEN state on Windows after worker termination.
Symptoms:
- Port remains bound even though no process owns it
OwningProcessshows 0 or dead PID- New worker instances cannot start due to
EADDRINUSEerrors - Happens regardless of termination method (process.exit(), external kill, Ctrl+C)
- Only system reboot clears zombie ports
Upstream Tracking:
- Bun issue #12127
- Bun issue #5774
- Bun issue #8786
Impact: Windows users may need to reboot their systems when worker crashes or is restarted.
Proposed Solution: Switch worker runtime from Bun to Node.js on Windows (or globally).
Status: 🟡 Unresolved - Platform-specific bug in Bun's Windows socket cleanup
SDK Subprocess Hang Issue (Dec 15, 2025)
Problem: SDK subprocesses can hang indefinitely, blocking observation processing.
Root Cause: AbortController.abort() does not actually terminate child processes.
Symptoms:
- For-await loop blocks forever waiting for output from hung subprocess
- Observation processing halts
- No recovery mechanism
Solution: Implement watchdog timer that explicitly kills child processes using platform-specific commands:
- Windows:
wmic process where ParentProcessId=<pid> delete - Unix:
pkill -P <pid>
Timeout: SDK_QUERY_TIMEOUT_MS set to 2 minutes
Status: ✅ Fixed in PR #335 (watchdog implementation)
PR #335: Comprehensive Windows Fix (Dec 15, 2025)
What It Attempted
ToxMox developed a comprehensive PR addressing all Windows issues simultaneously:
- PowerShell-based spawning to fix popup windows
- Runtime switch from Bun to Node.js (globally) to fix zombie sockets
- Queue monitoring system with persistent message queue
- Watchdog service for stuck message recovery
- SQLite compatibility layer for Node.js support
Architecture Decisions
ProcessManager Changes:
- Switched from
startWithBun()tostartWithNode() - Windows: Uses PowerShell
Start-Process -WindowStyle Hidden -PassThru - Unix: Uses standard
spawn()withdetached: true - Captures PID via PowerShell
Select-Object -ExpandProperty Id - Comment states: "Use Node on all platforms (Bun has zombie socket issues on Windows)"
SQLite Compatibility Layer:
- Created
sqlite-compat.tsadapter pattern - Provides
bun:sqliteAPI compatibility viabetter-sqlite3 - Allows code to work with both Bun and Node.js runtimes
Critical Issues Identified
1. Global vs Platform-Conditional Runtime
The Inconsistency: Code comment explicitly states zombie sockets occur "on Windows", yet solution applies Node.js universally across all platforms.
Questions Raised:
- Why sacrifice Bun's performance on macOS/Linux where no issues documented?
- Platform-specific spawning already implemented - why not platform-specific runtime?
- No documented Bun reliability issues on non-Windows platforms
2. Performance Regressions
better-sqlite3 Blocking:
- Synchronous-only API blocks Node.js event loop during all DB operations
- Contrasts with Bun's async SQLite support
- Affects: enqueue, markProcessing, markProcessed, watchdog checks
Watchdog Polling Overhead:
- Full table scans every 30 seconds even when idle
- Constant database I/O overhead
- No max queue size limits = unbounded growth
Startup Latency:
- Node.js initialization (slower than Bun)
- Native module loading (better-sqlite3)
- Database migrations
- Stuck message scan
- Watchdog initialization
- HTTP server startup
3. Build Dependencies
better-sqlite3 Requirements:
- node-gyp
- Python
- C++ compiler toolchains
- Visual Studio Build Tools (Windows)
Impact:
- Local development machines without build tools fail
- CI/CD pipelines need updated Docker images
- Restricted environments where compilers not permitted
- ARM/M1 Mac compatibility issues
4. Migration Risks
Breaking Changes:
- Automatic database migration adds
pending_messagestable - Runtime switch not documented in PR
- Node.js becomes undocumented hard requirement
- No migration guide or rollback procedure
Unanswered Questions:
- What happens to in-flight messages during upgrade?
- Can users safely downgrade?
- Is migration idempotent?
5. Code Quality Issues
Command Injection Risk (ProcessManager.ts:67):
- PowerShell commands use template literal concatenation
- Vulnerable if
MARKETPLACE_ROOTor script paths attacker-controlled - Should use array-based argument passing
Missing Error Handling (WatchdogService.ts:61):
setIntervalcallback lacks error handling- Timer continues running if
check()throws - Creates zombie watchdog scenario
No Queue Size Limits:
- Unbounded database growth if messages accumulate
- Failed messages (exceeding
maxRetries) accumulate indefinitely - Only 24-hour retention for processed messages
Assessment and Recommendations
What Was Validated
Legitimate Windows Issues:
- ✅ Console window popups are real (Node.js bug #21825)
- ✅ PowerShell
Start-Processsolution works - ✅ Bun zombie socket issue is real and Windows-specific
- ✅ SDK subprocess hang issue is real
What Remains Questionable
Global Runtime Switch:
- ❌ No evidence Bun problematic on macOS/Linux
- ❌ Platform-conditional runtime not considered
- ❌ Performance trade-offs not documented
- ❌ "Windows-only" issue applied globally
Zombie Socket Root Cause:
- 🟡 May be fixable with proper cleanup handlers:
- Missing
server.close()calls before exit - Processes killed with
SIGKILLbefore cleanup finishes - Missing
SIGTERMsignal handlers for graceful shutdown
- Missing
- 🟡 Runtime switch may be unnecessary over-engineering
Salvageable Components
If Extracted into Separate PRs:
-
PowerShell Spawning for Windows Worker
- Focused PR: "Windows: Use Node.js instead of Bun for worker process"
- Platform-conditional logic (Node.js on Windows, Bun elsewhere)
- Independent justification required
-
SQLite Compatibility Layer
- Well-designed adapter pattern
- Requires independent justification for Node.js runtime need
- Should not be bundled with other changes
-
Queue Monitoring UI Concept
- Valuable visibility into worker state
- Should build on in-memory state first
- Remove database persistence requirement initially
-
Watchdog Improvements
- SDK subprocess timeout handling
- Evidence of superiority over current approach needed
Current Status
Resolved
- ✅ Issue #209: Windows worker startup (v7.1.0)
- ✅ SDK subprocess hang issue (watchdog implementation)
In Progress
- 🔄 PR #339: Windows console popup fix (extracted from PR #335)
- 🔄 PR #338: Queue monitoring system (extracted from PR #335)
Open Questions
- ❓ Should runtime switch be global or Windows-only?
- ❓ Can zombie socket issue be fixed without runtime switch?
- ❓ Is better-sqlite3's synchronous blocking acceptable?
- ❓ Should queue persistence be in-memory first?
Lessons Learned
Architectural Principles Violated
YAGNI: Queue persistence, watchdog service, and comprehensive monitoring added without proven need.
Happy Path: Should have started with simplest Windows fix (PowerShell spawning), validated, then added complexity if needed.
Incremental Validation: Bundling multiple architectural changes prevents isolating what actually solves the problem.
What Should Have Happened
- Phase 1: PowerShell spawning fix for Windows console popups (targeted, testable)
- Phase 2: Investigate zombie socket root cause (cleanup handlers vs runtime switch)
- Phase 3: If runtime switch justified, implement as Windows-conditional first
- Phase 4: Add queue monitoring as optional feature with in-memory state
- Phase 5: Add persistence only if in-memory insufficient
Key Takeaways
- Windows-specific issues don't justify global architectural changes without clear evidence
- Platform-conditional logic is acceptable when solving platform-specific problems
- Native module dependencies are heavy - avoid unless necessary
- Performance regressions need explicit justification - synchronous blocking, startup latency, polling overhead all impact UX
- Bundle size matters - build tools, compilers, Python are significant requirements
References
GitHub Issues:
- #209: Windows worker startup failures
- #309: Console window popups
- #315: windowsHide approach (closed)
PRs:
- #335: Comprehensive Windows fix (under review)
- #338: Queue monitoring system (extracted)
- #339: Windows console popup fix (extracted)
Upstream Bugs:
- Node.js #21825: windowsHide ignored with detached
- Bun #12127, #5774, #8786: Windows zombie sockets
Related Observations:
- #27302: PR #315 windowsHide failure analysis
- #27233: Bun zombie socket discovery
- #27232: Windows background window root cause
- #27286: Runtime switch assessment
- #27283: PowerShell process spawn fix
- #27190: ProcessManager Node.js implementation
- #24532: Issue #209 resolution
Last Updated: 2025-12-16 Document Status: Comprehensive review based on memory search through #S3485