Files
claude-mem/src/shared/worker-utils.ts
T
Alex Newman bff10d49c9 fix(windows): Windows platform stabilization improvements (#378)
* chore: bump version to 7.3.6 in package.json

* Enhance worker readiness checks and MCP connection handling

- Updated health check endpoint to /api/readiness for better initialization tracking.
- Increased timeout for health checks and worker startup retries, especially for Windows.
- Added initialization flags to track MCP readiness and overall worker initialization status.
- Implemented a timeout guard for MCP connection to prevent hanging.
- Adjusted logging to reflect readiness state and errors more accurately.

* fix(windows): use Bun PATH detection in worker wrapper

Phase 2/8: Fix Bun PATH Detection in Worker Wrapper

- Import getBunPath() in worker-wrapper.ts for Bun detection
- Add Bun path resolution before spawning inner worker process
- Update spawn call to use detected Bun path instead of process.execPath
- Add logging to bun-path.ts when PATH detection succeeds
- Add logging when fallback paths are used
- Add Windows-specific validation for .exe extension
- Log warning with searched paths when Bun not found
- Fail fast with clear error message if Bun cannot be detected

This ensures worker-wrapper uses the correct Bun executable on Windows
even when Bun is not in PATH, fixing issue #371 where users reported
"Bun not in PATH" errors despite Bun being installed.

Addresses: #371

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix(windows): standardize child process spawning with windowsHide

Phase 3/8: Standardize Child Process Spawning (Windows)

Changes:
- Added windowsHide flag to ChromaSync MCP subprocess spawn
- Added Windows-specific process tracking (childPid) in ChromaSync
- Force-kill subprocess on Windows before closing transport to prevent zombie processes
- Updated cleanupOrphanedProcesses() to support Windows using PowerShell Get-CimInstance
- Use taskkill /T /F for proper process tree cleanup on Windows
- Audited BranchManager - confirmed windowsHide already present on all spawn calls

This prevents PowerShell windows from appearing during ChromaSync operations
and ensures proper cleanup of subprocess trees on Windows.

Addresses: #363, #361, #367, #371, #373, #374

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix(windows): enhance socket cleanup with recursive process tree management

Phase 4/8: Enhanced Socket Cleanup & Process Tree Management

Changes:
- Added recursive process tree enumeration in worker-wrapper.ts for Windows
- Enhanced killInner() to enumerate all descendants before killing
- Added fallback individual process kill if taskkill /T fails
- Added 10s timeout to ChromaSync.close() in DatabaseManager to prevent hangs
- Force nullify ChromaSync even on close failure to prevent resource leaks
- Improved logging to show full process tree during cleanup

This ensures complete cleanup of all child processes (ChromaSync MCP subprocess,
Python processes, etc.) preventing socket leaks and CLOSE_WAIT states.

Addresses: #363, #361

* fix(windows): consolidate project name extraction with drive root handling

Phase 5/8: Project Name Extraction Consolidation

- Created shared getProjectName() utility in src/utils/project-name.ts
- Handles edge case: drive roots (C:\, J:\) now return "drive-X" format
- Handles edge case: null/undefined/empty cwd now returns "unknown-project"
- Fixed missing null check bug in new-hook.ts
- Replaced duplicated path.basename(cwd) logic in:
  - src/hooks/context-hook.ts
  - src/hooks/new-hook.ts
  - src/services/context-generator.ts

Addresses: #374

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix(windows): increase timeouts and improve error messages

Phase 6/8: Increase Timeouts & Improve Error Messages

- Enhanced logger.ts with platform prefix (WIN32/DARWIN) and PID in all logs
- Added comprehensive Windows troubleshooting to ProcessManager error messages
- Enhanced Bun detection error message with Windows-specific troubleshooting
- All error messages now include GitHub issue numbers and docs links
- Windows timeout already increased to 2.0x multiplier in previous phases

Changes:
- src/utils/logger.ts: Added platform prefix and PID to all log output
- src/services/process/ProcessManager.ts: Enhanced error messages with troubleshooting steps
- src/utils/bun-path.ts: Added Windows-specific Bun detection error guidance

Addresses: #363, #361, #367, #371, #373, #374

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix(windows): add comprehensive Windows CI testing

Phase 7/8: Add Windows CI Testing

- Create automated Windows testing workflow
- Test worker startup/shutdown cycles
- Verify Bun PATH detection on Windows
- Test rapid restart scenarios
- Validate port cleanup after shutdown
- Check for zombie processes
- Run on all pushes and PRs to main/fix/feature branches

Addresses: #363, #361, #367, #371, #373, #374

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* ci(windows): remove build steps from Windows CI workflow

Build files are already included in the plugin folder, so npm install
and npm run build are unnecessary steps in the CI workflow.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* revert: remove Windows CI workflow

The CI workflow cannot be properly implemented in the current architecture
due to limitations in testing the worker service in CI environments.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* security: add PID validation and improve ChromaSync timeout handling

Address critical security and reliability issues identified in PR review:

**Security Fixes:**
- Add PID validation before all PowerShell/taskkill command execution
- Validate PIDs are positive integers to prevent command injection
- Apply validation in worker-wrapper.ts, worker-service.ts, and ChromaSync.ts

**Reliability Improvements:**
- Add timeout handling to ChromaSync client.close() (10s timeout)
- Add timeout handling to ChromaSync transport.close() (5s timeout)
- Implement force-kill fallback when ChromaSync close operations timeout
- Prevents hanging on shutdown and ensures subprocess cleanup

**Implementation Details:**
- PID validation checks: Number.isInteger(pid) && pid > 0
- Applied before all execSync taskkill calls on Windows
- Applied in process enumeration (Get-CimInstance) PowerShell commands
- ChromaSync.close() uses Promise.race for timeout enforcement
- Graceful degradation with force-kill fallback on timeout

Addresses PR #378 review feedback

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Refactor ChromaSync client and transport closure logic

- Removed timeout handling for closing the Chroma client and transport.
- Simplified error logging for client and transport closure.
- Ensured subprocess cleanup logic is more straightforward.

* fix(worker): streamline Windows process management and cleanup

* revert: remove speculative LLM-generated complexity

Reverts defensive code that was added speculatively without user-reported issues:

- ChromaSync: Remove PID extraction and explicit taskkill (wrapper handles this)
- worker-wrapper: Restore simple taskkill /T /F (validated in v7.3.5)
- DatabaseManager: Remove Promise.race timeout wrapper
- hook-constants: Restore original timeout values
- logger: Remove platform/PID additions to every log line
- bun-path: Remove speculative logging

Keeps only changes that map to actual GitHub issues:
- #374: Drive root project name fix (getProjectName utility)
- #363: Readiness endpoint and Windows orphan cleanup
- #367: windowsHide on ChromaSync transport

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 18:44:04 -05:00

240 lines
7.4 KiB
TypeScript

import path from "path";
import { homedir } from "os";
import { spawnSync } from "child_process";
import { existsSync, writeFileSync, readFileSync, mkdirSync } from "fs";
import { logger } from "../utils/logger.js";
import { HOOK_TIMEOUTS, getTimeout } from "./hook-constants.js";
import { ProcessManager } from "../services/process/ProcessManager.js";
import { SettingsDefaultsManager } from "./SettingsDefaultsManager.js";
import { getWorkerRestartInstructions } from "../utils/error-messages.js";
const MARKETPLACE_ROOT = path.join(homedir(), '.claude', 'plugins', 'marketplaces', 'thedotmack');
// Named constants for health checks
const HEALTH_CHECK_TIMEOUT_MS = getTimeout(HOOK_TIMEOUTS.HEALTH_CHECK);
// Port cache to avoid repeated settings file reads
let cachedPort: number | null = null;
/**
* Get the worker port number from settings
* Uses CLAUDE_MEM_WORKER_PORT from settings file or default (37777)
* Caches the port value to avoid repeated file reads
*/
export function getWorkerPort(): number {
if (cachedPort !== null) {
return cachedPort;
}
try {
const settingsPath = path.join(SettingsDefaultsManager.get('CLAUDE_MEM_DATA_DIR'), 'settings.json');
const settings = SettingsDefaultsManager.loadFromFile(settingsPath);
cachedPort = parseInt(settings.CLAUDE_MEM_WORKER_PORT, 10);
return cachedPort;
} catch (error) {
// Fallback to default if settings load fails
logger.debug('SYSTEM', 'Failed to load port from settings, using default', { error });
cachedPort = parseInt(SettingsDefaultsManager.get('CLAUDE_MEM_WORKER_PORT'), 10);
return cachedPort;
}
}
/**
* Clear the cached port value
* Call this when settings are updated to force re-reading from file
*/
export function clearPortCache(): void {
cachedPort = null;
}
/**
* Get the worker host address
* Priority: ~/.claude-mem/settings.json > env var > default (127.0.0.1)
*/
export function getWorkerHost(): string {
const settingsPath = path.join(homedir(), '.claude-mem', 'settings.json');
const settings = SettingsDefaultsManager.loadFromFile(settingsPath);
return settings.CLAUDE_MEM_WORKER_HOST;
}
/**
* Check if worker is responsive and fully initialized by trying the readiness endpoint
* Changed from /health to /api/readiness to ensure MCP initialization is complete
*/
async function isWorkerHealthy(): Promise<boolean> {
try {
const port = getWorkerPort();
const response = await fetch(`http://127.0.0.1:${port}/api/readiness`, {
signal: AbortSignal.timeout(HEALTH_CHECK_TIMEOUT_MS)
});
return response.ok;
} catch (error) {
logger.debug('SYSTEM', 'Worker readiness check failed', {
error: error instanceof Error ? error.message : String(error),
errorType: error?.constructor?.name
});
return false;
}
}
/**
* Get the current plugin version from package.json
*/
function getPluginVersion(): string | null {
try {
const packageJsonPath = path.join(MARKETPLACE_ROOT, 'package.json');
const packageJson = JSON.parse(readFileSync(packageJsonPath, 'utf-8'));
return packageJson.version;
} catch (error) {
logger.debug('SYSTEM', 'Failed to read plugin version', {
error: error instanceof Error ? error.message : String(error)
});
return null;
}
}
/**
* Get the running worker's version from the API
*/
async function getWorkerVersion(): Promise<string | null> {
try {
const port = getWorkerPort();
const response = await fetch(`http://127.0.0.1:${port}/api/version`, {
signal: AbortSignal.timeout(HEALTH_CHECK_TIMEOUT_MS)
});
if (!response.ok) return null;
const data = await response.json() as { version: string };
return data.version;
} catch (error) {
logger.debug('SYSTEM', 'Failed to get worker version', {
error: error instanceof Error ? error.message : String(error)
});
return null;
}
}
/**
* Check if worker version matches plugin version
* If mismatch detected, restart the worker automatically
*/
async function ensureWorkerVersionMatches(): Promise<void> {
const pluginVersion = getPluginVersion();
const workerVersion = await getWorkerVersion();
if (!pluginVersion || !workerVersion) {
// Can't determine versions, skip check
return;
}
if (pluginVersion !== workerVersion) {
logger.info('SYSTEM', 'Worker version mismatch detected - restarting worker', {
pluginVersion,
workerVersion
});
// Give files time to sync before restart
await new Promise(resolve => setTimeout(resolve, getTimeout(HOOK_TIMEOUTS.PRE_RESTART_SETTLE_DELAY)));
// Restart the worker
await ProcessManager.restart(getWorkerPort());
// Give it a moment to start
await new Promise(resolve => setTimeout(resolve, 1000));
// Verify it's healthy
if (!await isWorkerHealthy()) {
logger.error('SYSTEM', 'Worker failed to restart after version mismatch', {
expectedVersion: pluginVersion,
runningVersion: workerVersion,
port: getWorkerPort()
});
}
}
}
/**
* Start the worker service using ProcessManager
* Handles both Unix (Bun) and Windows (compiled exe) platforms
*/
async function startWorker(): Promise<boolean> {
// Clean up legacy PM2 (one-time migration)
const dataDir = SettingsDefaultsManager.get('CLAUDE_MEM_DATA_DIR');
const pm2MigratedMarker = path.join(dataDir, '.pm2-migrated');
// Ensure data directory exists (may not exist on fresh install)
mkdirSync(dataDir, { recursive: true });
if (!existsSync(pm2MigratedMarker)) {
try {
spawnSync('pm2', ['delete', 'claude-mem-worker'], { stdio: 'ignore' });
// Mark migration as complete
writeFileSync(pm2MigratedMarker, new Date().toISOString(), 'utf-8');
logger.debug('SYSTEM', 'PM2 cleanup completed and marked');
} catch {
// PM2 not installed or process doesn't exist - still mark as migrated
writeFileSync(pm2MigratedMarker, new Date().toISOString(), 'utf-8');
}
}
const port = getWorkerPort();
const result = await ProcessManager.start(port);
if (!result.success) {
logger.error('SYSTEM', 'Failed to start worker', {
platform: process.platform,
port,
error: result.error,
marketplaceRoot: MARKETPLACE_ROOT
});
}
return result.success;
}
/**
* Ensure worker service is running
* Checks health and auto-starts if not running
* Also ensures worker version matches plugin version
*/
export async function ensureWorkerRunning(): Promise<void> {
// Check if already healthy
if (await isWorkerHealthy()) {
// Worker is healthy, but check if version matches
await ensureWorkerVersionMatches();
return;
}
// Try to start the worker
const started = await startWorker();
if (!started) {
const port = getWorkerPort();
throw new Error(
getWorkerRestartInstructions({
port,
customPrefix: `Worker service failed to start on port ${port}.`
})
);
}
// Wait for worker to become responsive after starting
// Try up to 5 times with 500ms delays (2.5 seconds total)
for (let i = 0; i < 5; i++) {
await new Promise(resolve => setTimeout(resolve, 500));
if (await isWorkerHealthy()) {
await ensureWorkerVersionMatches();
return;
}
}
// Worker started but isn't responding
const port = getWorkerPort();
logger.error('SYSTEM', 'Worker started but not responding to health checks');
throw new Error(
getWorkerRestartInstructions({
port,
customPrefix: `Worker service started but is not responding on port ${port}.`
})
);
}