backup: Phase 1 agent work (security, persistence, batch endpoint)
This is a backup of all work done by the 3 Phase 1 agents: Agent A - Command Injection Fix (Issue #354): - Fixed command injection in BranchManager.ts - Fixed unnecessary shell usage in bun-path.ts - Added comprehensive security test suite - Created SECURITY.md and SECURITY_AUDIT_REPORT.md Agent B - Observation Persistence Fix (Issue #353): - Added PendingMessageStore from PR #335 - Integrated persistent queue into SessionManager - Modified SDKAgent to mark messages complete - Updated SessionStore with pending_messages migration - Updated worker-types.ts with new interfaces Agent C - Batch Endpoint Verification (Issue #348): - Created batch-observations.test.ts - Updated worker-service.mdx documentation Also includes: - Documentation context files (biomimetic, windows struggles) - Build artifacts from agent testing This work will be re-evaluated after v7.3.0 release.
This commit is contained in:
@@ -0,0 +1,332 @@
|
||||
# Windows, Bun, and Worker Service Struggles
|
||||
|
||||
A comprehensive chronicle of platform-specific issues, attempted fixes, and architectural decisions.
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The claude-mem project has faced persistent Windows-specific issues centered around three core problems:
|
||||
|
||||
1. **Console Window Popups**: Blank terminal windows appearing when spawning worker and SDK subprocess
|
||||
2. **Zombie Socket Issues**: Bun leaving TCP sockets in LISTEN state after termination on Windows
|
||||
3. **Process Management Complexity**: Platform-specific spawning logic and reliability issues
|
||||
|
||||
These issues have driven multiple PRs, architectural pivots, and significant debate about runtime switching (Bun → Node.js).
|
||||
|
||||
---
|
||||
|
||||
## Timeline of Issues
|
||||
|
||||
### Issue #209: Windows Worker Startup Failures (Dec 12-13, 2025)
|
||||
|
||||
**Problem**: Worker service failed to start on Windows using PowerShell Start-Process approach.
|
||||
|
||||
**Symptoms**:
|
||||
- Worker startup attempted via `powershell.exe -NoProfile -NonInteractive -Command Start-Process`
|
||||
- Health check retries exhausted (15 attempts over 15 seconds)
|
||||
- Users left unable to start worker manually
|
||||
|
||||
**Root Causes**:
|
||||
- Platform-conditional process spawning (PowerShell for Windows, PM2 for Unix)
|
||||
- PowerShell spawning without `-PassThru` to capture PID
|
||||
- Inconsistent process management across platforms
|
||||
|
||||
**Resolution**: Issue was marked as closed, suggesting it was resolved in v7.1.0 through architectural unification with Bun-based ProcessManager using PID file tracking consistently across all platforms.
|
||||
|
||||
**Status**: ✅ Resolved (pre-PR #335)
|
||||
|
||||
---
|
||||
|
||||
### Issue #309 & PR #315: Console Window Popups (Dec 14-15, 2025)
|
||||
|
||||
**Problem**: Blank terminal windows appear when spawning worker processes and SDK subprocesses on Windows.
|
||||
|
||||
**First Attempted Fix (PR #315)**: Add `windowsHide: true` to spawn options
|
||||
|
||||
**Why It Failed**: Node.js bug #21825 - `windowsHide: true` is **ignored** when `detached: true` is also set. Both flags are required:
|
||||
- `detached: true` - Needed for background process
|
||||
- `windowsHide: true` - Needed to hide window (but doesn't work when detached)
|
||||
|
||||
**Testing Results** (by ToxMox):
|
||||
- Tested PR #315 on Windows 11
|
||||
- Confirmed blank terminal windows still appear for both worker and SDK subprocess spawns
|
||||
- Affects both `ProcessManager.ts` (worker) and `SDKAgent.ts` (SDK subprocess)
|
||||
|
||||
**Working Solution**: Use PowerShell's `Start-Process` with `-WindowStyle Hidden` flag instead of standard spawn.
|
||||
|
||||
**Status**: ❌ PR #315 closed in favor of more comprehensive solution
|
||||
|
||||
---
|
||||
|
||||
### Bun Zombie Socket Issue (Dec 15, 2025)
|
||||
|
||||
**Problem**: Bun leaves TCP sockets in zombie LISTEN state on Windows after worker termination.
|
||||
|
||||
**Symptoms**:
|
||||
- Port remains bound even though no process owns it
|
||||
- `OwningProcess` shows 0 or dead PID
|
||||
- New worker instances cannot start due to `EADDRINUSE` errors
|
||||
- Happens regardless of termination method (process.exit(), external kill, Ctrl+C)
|
||||
- **Only system reboot clears zombie ports**
|
||||
|
||||
**Upstream Tracking**:
|
||||
- Bun issue #12127
|
||||
- Bun issue #5774
|
||||
- Bun issue #8786
|
||||
|
||||
**Impact**: Windows users may need to reboot their systems when worker crashes or is restarted.
|
||||
|
||||
**Proposed Solution**: Switch worker runtime from Bun to Node.js on Windows (or globally).
|
||||
|
||||
**Status**: 🟡 Unresolved - Platform-specific bug in Bun's Windows socket cleanup
|
||||
|
||||
---
|
||||
|
||||
### SDK Subprocess Hang Issue (Dec 15, 2025)
|
||||
|
||||
**Problem**: SDK subprocesses can hang indefinitely, blocking observation processing.
|
||||
|
||||
**Root Cause**: `AbortController.abort()` does not actually terminate child processes.
|
||||
|
||||
**Symptoms**:
|
||||
- For-await loop blocks forever waiting for output from hung subprocess
|
||||
- Observation processing halts
|
||||
- No recovery mechanism
|
||||
|
||||
**Solution**: Implement watchdog timer that explicitly kills child processes using platform-specific commands:
|
||||
- **Windows**: `wmic process where ParentProcessId=<pid> delete`
|
||||
- **Unix**: `pkill -P <pid>`
|
||||
|
||||
**Timeout**: `SDK_QUERY_TIMEOUT_MS` set to 2 minutes
|
||||
|
||||
**Status**: ✅ Fixed in PR #335 (watchdog implementation)
|
||||
|
||||
---
|
||||
|
||||
## PR #335: Comprehensive Windows Fix (Dec 15, 2025)
|
||||
|
||||
### What It Attempted
|
||||
|
||||
ToxMox developed a comprehensive PR addressing all Windows issues simultaneously:
|
||||
|
||||
1. **PowerShell-based spawning** to fix popup windows
|
||||
2. **Runtime switch** from Bun to Node.js (globally) to fix zombie sockets
|
||||
3. **Queue monitoring system** with persistent message queue
|
||||
4. **Watchdog service** for stuck message recovery
|
||||
5. **SQLite compatibility layer** for Node.js support
|
||||
|
||||
### Architecture Decisions
|
||||
|
||||
**ProcessManager Changes**:
|
||||
- Switched from `startWithBun()` to `startWithNode()`
|
||||
- Windows: Uses PowerShell `Start-Process -WindowStyle Hidden -PassThru`
|
||||
- Unix: Uses standard `spawn()` with `detached: true`
|
||||
- Captures PID via PowerShell `Select-Object -ExpandProperty Id`
|
||||
- Comment states: "Use Node on all platforms (Bun has zombie socket issues on Windows)"
|
||||
|
||||
**SQLite Compatibility Layer**:
|
||||
- Created `sqlite-compat.ts` adapter pattern
|
||||
- Provides `bun:sqlite` API compatibility via `better-sqlite3`
|
||||
- Allows code to work with both Bun and Node.js runtimes
|
||||
|
||||
### Critical Issues Identified
|
||||
|
||||
#### 1. **Global vs Platform-Conditional Runtime**
|
||||
|
||||
**The Inconsistency**: Code comment explicitly states zombie sockets occur "on Windows", yet solution applies Node.js universally across all platforms.
|
||||
|
||||
**Questions Raised**:
|
||||
- Why sacrifice Bun's performance on macOS/Linux where no issues documented?
|
||||
- Platform-specific spawning already implemented - why not platform-specific runtime?
|
||||
- No documented Bun reliability issues on non-Windows platforms
|
||||
|
||||
#### 2. **Performance Regressions**
|
||||
|
||||
**better-sqlite3 Blocking**:
|
||||
- Synchronous-only API blocks Node.js event loop during all DB operations
|
||||
- Contrasts with Bun's async SQLite support
|
||||
- Affects: enqueue, markProcessing, markProcessed, watchdog checks
|
||||
|
||||
**Watchdog Polling Overhead**:
|
||||
- Full table scans every 30 seconds even when idle
|
||||
- Constant database I/O overhead
|
||||
- No max queue size limits = unbounded growth
|
||||
|
||||
**Startup Latency**:
|
||||
- Node.js initialization (slower than Bun)
|
||||
- Native module loading (better-sqlite3)
|
||||
- Database migrations
|
||||
- Stuck message scan
|
||||
- Watchdog initialization
|
||||
- HTTP server startup
|
||||
|
||||
#### 3. **Build Dependencies**
|
||||
|
||||
**better-sqlite3 Requirements**:
|
||||
- node-gyp
|
||||
- Python
|
||||
- C++ compiler toolchains
|
||||
- Visual Studio Build Tools (Windows)
|
||||
|
||||
**Impact**:
|
||||
- Local development machines without build tools fail
|
||||
- CI/CD pipelines need updated Docker images
|
||||
- Restricted environments where compilers not permitted
|
||||
- ARM/M1 Mac compatibility issues
|
||||
|
||||
#### 4. **Migration Risks**
|
||||
|
||||
**Breaking Changes**:
|
||||
- Automatic database migration adds `pending_messages` table
|
||||
- Runtime switch not documented in PR
|
||||
- Node.js becomes undocumented hard requirement
|
||||
- No migration guide or rollback procedure
|
||||
|
||||
**Unanswered Questions**:
|
||||
- What happens to in-flight messages during upgrade?
|
||||
- Can users safely downgrade?
|
||||
- Is migration idempotent?
|
||||
|
||||
#### 5. **Code Quality Issues**
|
||||
|
||||
**Command Injection Risk** (ProcessManager.ts:67):
|
||||
- PowerShell commands use template literal concatenation
|
||||
- Vulnerable if `MARKETPLACE_ROOT` or script paths attacker-controlled
|
||||
- Should use array-based argument passing
|
||||
|
||||
**Missing Error Handling** (WatchdogService.ts:61):
|
||||
- `setInterval` callback lacks error handling
|
||||
- Timer continues running if `check()` throws
|
||||
- Creates zombie watchdog scenario
|
||||
|
||||
**No Queue Size Limits**:
|
||||
- Unbounded database growth if messages accumulate
|
||||
- Failed messages (exceeding `maxRetries`) accumulate indefinitely
|
||||
- Only 24-hour retention for processed messages
|
||||
|
||||
---
|
||||
|
||||
## Assessment and Recommendations
|
||||
|
||||
### What Was Validated
|
||||
|
||||
**Legitimate Windows Issues**:
|
||||
- ✅ Console window popups are real (Node.js bug #21825)
|
||||
- ✅ PowerShell `Start-Process` solution works
|
||||
- ✅ Bun zombie socket issue is real and Windows-specific
|
||||
- ✅ SDK subprocess hang issue is real
|
||||
|
||||
### What Remains Questionable
|
||||
|
||||
**Global Runtime Switch**:
|
||||
- ❌ No evidence Bun problematic on macOS/Linux
|
||||
- ❌ Platform-conditional runtime not considered
|
||||
- ❌ Performance trade-offs not documented
|
||||
- ❌ "Windows-only" issue applied globally
|
||||
|
||||
**Zombie Socket Root Cause**:
|
||||
- 🟡 May be fixable with proper cleanup handlers:
|
||||
- Missing `server.close()` calls before exit
|
||||
- Processes killed with `SIGKILL` before cleanup finishes
|
||||
- Missing `SIGTERM` signal handlers for graceful shutdown
|
||||
- 🟡 Runtime switch may be unnecessary over-engineering
|
||||
|
||||
### Salvageable Components
|
||||
|
||||
**If Extracted into Separate PRs**:
|
||||
|
||||
1. **PowerShell Spawning for Windows Worker**
|
||||
- Focused PR: "Windows: Use Node.js instead of Bun for worker process"
|
||||
- Platform-conditional logic (Node.js on Windows, Bun elsewhere)
|
||||
- Independent justification required
|
||||
|
||||
2. **SQLite Compatibility Layer**
|
||||
- Well-designed adapter pattern
|
||||
- Requires independent justification for Node.js runtime need
|
||||
- Should not be bundled with other changes
|
||||
|
||||
3. **Queue Monitoring UI Concept**
|
||||
- Valuable visibility into worker state
|
||||
- Should build on in-memory state first
|
||||
- Remove database persistence requirement initially
|
||||
|
||||
4. **Watchdog Improvements**
|
||||
- SDK subprocess timeout handling
|
||||
- Evidence of superiority over current approach needed
|
||||
|
||||
---
|
||||
|
||||
## Current Status
|
||||
|
||||
### Resolved
|
||||
- ✅ Issue #209: Windows worker startup (v7.1.0)
|
||||
- ✅ SDK subprocess hang issue (watchdog implementation)
|
||||
|
||||
### In Progress
|
||||
- 🔄 PR #339: Windows console popup fix (extracted from PR #335)
|
||||
- 🔄 PR #338: Queue monitoring system (extracted from PR #335)
|
||||
|
||||
### Open Questions
|
||||
- ❓ Should runtime switch be global or Windows-only?
|
||||
- ❓ Can zombie socket issue be fixed without runtime switch?
|
||||
- ❓ Is better-sqlite3's synchronous blocking acceptable?
|
||||
- ❓ Should queue persistence be in-memory first?
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### Architectural Principles Violated
|
||||
|
||||
**YAGNI**: Queue persistence, watchdog service, and comprehensive monitoring added without proven need.
|
||||
|
||||
**Happy Path**: Should have started with simplest Windows fix (PowerShell spawning), validated, then added complexity if needed.
|
||||
|
||||
**Incremental Validation**: Bundling multiple architectural changes prevents isolating what actually solves the problem.
|
||||
|
||||
### What Should Have Happened
|
||||
|
||||
1. **Phase 1**: PowerShell spawning fix for Windows console popups (targeted, testable)
|
||||
2. **Phase 2**: Investigate zombie socket root cause (cleanup handlers vs runtime switch)
|
||||
3. **Phase 3**: If runtime switch justified, implement as Windows-conditional first
|
||||
4. **Phase 4**: Add queue monitoring as optional feature with in-memory state
|
||||
5. **Phase 5**: Add persistence only if in-memory insufficient
|
||||
|
||||
### Key Takeaways
|
||||
|
||||
- **Windows-specific issues don't justify global architectural changes** without clear evidence
|
||||
- **Platform-conditional logic is acceptable** when solving platform-specific problems
|
||||
- **Native module dependencies are heavy** - avoid unless necessary
|
||||
- **Performance regressions need explicit justification** - synchronous blocking, startup latency, polling overhead all impact UX
|
||||
- **Bundle size matters** - build tools, compilers, Python are significant requirements
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
**GitHub Issues**:
|
||||
- #209: Windows worker startup failures
|
||||
- #309: Console window popups
|
||||
- #315: windowsHide approach (closed)
|
||||
|
||||
**PRs**:
|
||||
- #335: Comprehensive Windows fix (under review)
|
||||
- #338: Queue monitoring system (extracted)
|
||||
- #339: Windows console popup fix (extracted)
|
||||
|
||||
**Upstream Bugs**:
|
||||
- Node.js #21825: windowsHide ignored with detached
|
||||
- Bun #12127, #5774, #8786: Windows zombie sockets
|
||||
|
||||
**Related Observations**:
|
||||
- #27302: PR #315 windowsHide failure analysis
|
||||
- #27233: Bun zombie socket discovery
|
||||
- #27232: Windows background window root cause
|
||||
- #27286: Runtime switch assessment
|
||||
- #27283: PowerShell process spawn fix
|
||||
- #27190: ProcessManager Node.js implementation
|
||||
- #24532: Issue #209 resolution
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-12-16
|
||||
**Document Status**: Comprehensive review based on memory search through #S3485
|
||||
Reference in New Issue
Block a user