Fix 30+ root-cause bugs across 10 triage phases (#1214)

* MAESTRO: fix ChromaDB core issues — Python pinning, Windows paths, disable toggle, metadata sanitization, transport errors

- Add --python version pinning to uvx args in both local and remote mode (fixes #1196, #1206, #1208)
- Convert backslash paths to forward slashes for --data-dir on Windows (fixes #1199)
- Add CLAUDE_MEM_CHROMA_ENABLED setting for SQLite-only fallback mode (fixes #707)
- Sanitize metadata in addDocuments() to filter null/undefined/empty values (fixes #1183, #1188)
- Wrap callTool() in try/catch for transport errors with auto-reconnect (fixes #1162)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix data integrity — content-hash deduplication, project name collision, empty project guard, stuck isProcessing

- Add SHA-256 content-hash deduplication to observations INSERT (store.ts, transactions.ts, SessionStore.ts)
- Add content_hash column via migration 22 with backfill and index
- Fix project name collision: getCurrentProjectName() now returns parent/basename
- Guard against empty project string with cwd-derived fallback
- Fix stuck isProcessing: hasAnyPendingWork() resets processing messages older than 5 minutes
- Add 12 new tests covering all four fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix hook lifecycle — stderr suppression, output isolation, conversation pollution prevention

- Suppress process.stderr.write in hookCommand() to prevent Claude Code showing diagnostic
  output as error UI (#1181). Restores stderr in finally block for worker-continues case.
- Convert console.error() to logger.warn()/error() in hook-command.ts and handlers/index.ts
  so all diagnostics route to log file instead of stderr.
- Verified all 7 handlers return suppressOutput: true (prevents conversation pollution #598, #784).
- Verified session-complete is a recognized event type (fixes #984).
- Verified unknown event types return no-op handler with exit 0 (graceful degradation).
- Added 10 new tests in tests/hook-lifecycle.test.ts covering event dispatch, adapter defaults,
  stderr suppression, and standard response constants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix worker lifecycle — restart loop coordination, stale transport retry, ENOENT shutdown race

- Add PID file mtime guard to prevent concurrent restart storms (#1145):
  isPidFileRecent() + touchPidFile() coordinate across sessions
- Add transparent retry in ChromaMcpManager.callTool() on transport
  error — reconnects and retries once instead of failing (#1131)
- Wrap getInstalledPluginVersion() with ENOENT/EBUSY handling (#1042)
- Verified ChromaMcpManager.stop() already called on all shutdown paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Windows platform support — uvx.cmd spawn, PowerShell $_ elimination, windowsHide, FTS5 fallback

- Route uvx spawn through cmd.exe /c on Windows since MCP SDK lacks shell:true (#1190, #1192, #1199)
- Replace all PowerShell Where-Object {$_} pipelines with WQL -Filter server-side filtering (#1024, #1062)
- Add windowsHide: true to all exec/spawn calls missing it to prevent console popups (#1048)
- Add FTS5 runtime probe with graceful fallback when unavailable on Windows (#791)
- Guard FTS5 table creation in migrations, SessionSearch, and SessionStore with try/catch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix skills/ distribution — build-time verification and regression tests (#1187)

Add post-build verification in build-hooks.js that fails if critical
distribution files (skills, hooks, plugin manifest) are missing. Add
10 regression tests covering skill file presence, YAML frontmatter,
hooks.json integrity, and package.json files field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix MigrationRunner schema initialization (#979) — version conflict between parallel migration systems

Root cause: old DatabaseManager migrations 1-7 shared schema_versions table with
MigrationRunner's 4-22, causing version number collisions (5=drop tables vs add column,
6=FTS5 vs prompt tracking, 7=discovery_tokens vs remove UNIQUE).  initializeSchema()
was gated behind maxApplied===0, so core tables were never created when old versions
were present.

Fixes:
- initializeSchema() always creates core tables via CREATE TABLE IF NOT EXISTS
- Migrations 5-7 check actual DB state (columns/constraints) not just version tracking
- Crash-safe temp table rebuilds (DROP IF EXISTS _new before CREATE)
- Added missing migration 21 (ON UPDATE CASCADE) to MigrationRunner
- Added ON UPDATE CASCADE to FK definitions in initializeSchema()
- All changes applied to both runner.ts and SessionStore.ts

Tests: 13 new tests in migration-runner.test.ts covering fresh DB, idempotency,
version conflicts, crash recovery, FK constraints, and data integrity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix 21 test failures — stale mocks, outdated assertions, missing OpenClaw guards

Server tests (12): Added missing workerPath and getAiStatus to ServerOptions
mocks after interface expansion. ChromaSync tests (3): Updated to verify
transport cleanup in ChromaMcpManager after architecture refactor. OpenClaw (2):
Added memory_ tool skipping and response truncation to prevent recursive loops
and oversized payloads. MarkdownFormatter (2): Updated assertions to match
current output. SettingsDefaultsManager (1): Used correct default key for
getBool test. Logger standards (1): Excluded CLI transcript command from
background service check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Codex CLI compatibility (#744) — session_id fallbacks, unknown platform tolerance, undefined guard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Cursor IDE integration (#838, #1049) — adapter field fallbacks, tolerant session-init validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix /api/logs OOM (#1203) — tail-read replaces full-file readFileSync

Replace readFileSync (loads entire file into memory) with readLastLines()
that reads only from the end of the file in expanding chunks (64KB → 10MB cap).
Prevents OOM on large log files while preserving the same API response shape.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Settings CORS error (#1029) — explicit methods and allowedHeaders in CORS config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: add session custom_title for agent attribution (#1213) — migration 23, endpoint + store support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: prevent CLAUDE.md/AGENTS.md writes inside .git/ directories (#1165)

Add .git path guard to all 4 write sites to prevent ref corruption when
paths resolve inside .git internals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix plugin disabled state not respected (#781) — early exit check in all hook entry points

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix UserPromptSubmit context re-injection on every turn (#1079) — contextInjected session flag

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix stale AbortController queue stall (#1099) — lastGeneratorActivity tracking + 30s timeout

Three-layer fix:
1. Added lastGeneratorActivity timestamp to ActiveSession, updated by
   processAgentResponse (all agents), getMessageIterator (queue yields),
   and startGeneratorWithProvider (generator launch)
2. Added stale generator detection in ensureGeneratorRunning — if no
   activity for >30s, aborts stale controller, resets state, restarts
3. Added AbortSignal.timeout(30000) in deleteSession to prevent
   indefinite hang when awaiting a stuck generator promise

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-02-23 19:34:35 -05:00
committed by GitHub
parent d9a30cc7d4
commit c6f932988a
62 changed files with 3639 additions and 793 deletions
+16 -6
View File
@@ -115,12 +115,22 @@ export async function httpShutdown(port: number): Promise<boolean> {
/**
* Get the plugin version from the installed marketplace package.json
* This is the "expected" version that should be running
* This is the "expected" version that should be running.
* Returns 'unknown' on ENOENT/EBUSY (shutdown race condition, fix #1042).
*/
export function getInstalledPluginVersion(): string {
const packageJsonPath = path.join(MARKETPLACE_ROOT, 'package.json');
const packageJson = JSON.parse(readFileSync(packageJsonPath, 'utf-8'));
return packageJson.version;
try {
const packageJsonPath = path.join(MARKETPLACE_ROOT, 'package.json');
const packageJson = JSON.parse(readFileSync(packageJsonPath, 'utf-8'));
return packageJson.version;
} catch (error: unknown) {
const code = (error as NodeJS.ErrnoException).code;
if (code === 'ENOENT' || code === 'EBUSY') {
logger.debug('SYSTEM', 'Could not read plugin version (shutdown race)', { code });
return 'unknown';
}
throw error;
}
}
/**
@@ -155,8 +165,8 @@ export async function checkVersionMatch(port: number): Promise<VersionCheckResul
const pluginVersion = getInstalledPluginVersion();
const workerVersion = await getRunningWorkerVersion(port);
// If we can't get worker version, assume it matches (graceful degradation)
if (!workerVersion) {
// If either version is unknown/null, assume match (graceful degradation, fix #1042)
if (!workerVersion || pluginVersion === 'unknown') {
return { matches: true, pluginVersion, workerVersion };
}
+59 -22
View File
@@ -10,7 +10,7 @@
import path from 'path';
import { homedir } from 'os';
import { existsSync, writeFileSync, readFileSync, unlinkSync, mkdirSync, rmSync } from 'fs';
import { existsSync, writeFileSync, readFileSync, unlinkSync, mkdirSync, rmSync, statSync, utimesSync } from 'fs';
import { exec, execSync, spawn } from 'child_process';
import { promisify } from 'util';
import { logger } from '../../utils/logger.js';
@@ -54,7 +54,8 @@ function lookupBinaryInPath(binaryName: string, platform: NodeJS.Platform): stri
try {
const output = execSync(command, {
stdio: ['ignore', 'pipe', 'ignore'],
encoding: 'utf-8'
encoding: 'utf-8',
windowsHide: true
});
const firstMatch = output
@@ -191,10 +192,10 @@ export async function getChildProcesses(parentPid: number): Promise<number[]> {
}
try {
// PowerShell Get-Process instead of WMIC (deprecated in Windows 11)
const cmd = `powershell -NoProfile -NonInteractive -Command "Get-Process | Where-Object { $_.ParentProcessId -eq ${parentPid} } | Select-Object -ExpandProperty Id"`;
const { stdout } = await execAsync(cmd, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND });
// PowerShell outputs just numbers (one per line), simpler than WMIC's "ProcessId=1234" format
// Use WQL -Filter to avoid $_ pipeline syntax that breaks in Git Bash (#1062, #1024).
// Get-CimInstance with server-side filtering is also more efficient than piping through Where-Object.
const cmd = `powershell -NoProfile -NonInteractive -Command "Get-CimInstance Win32_Process -Filter 'ParentProcessId=${parentPid}' | Select-Object -ExpandProperty ProcessId"`;
const { stdout } = await execAsync(cmd, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, windowsHide: true });
return stdout
.split('\n')
.map(line => line.trim())
@@ -223,7 +224,7 @@ export async function forceKillProcess(pid: number): Promise<void> {
try {
if (process.platform === 'win32') {
// /T kills entire process tree, /F forces termination
await execAsync(`taskkill /PID ${pid} /T /F`, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND });
await execAsync(`taskkill /PID ${pid} /T /F`, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, windowsHide: true });
} else {
process.kill(pid, 'SIGKILL');
}
@@ -315,13 +316,14 @@ export async function cleanupOrphanedProcesses(): Promise<void> {
try {
if (isWindows) {
// Windows: Use PowerShell Get-CimInstance with JSON output for age filtering
const patternConditions = ORPHAN_PROCESS_PATTERNS
.map(p => `$_.CommandLine -like '*${p}*'`)
.join(' -or ');
// Windows: Use WQL -Filter for server-side filtering (no $_ pipeline syntax).
// Avoids Git Bash $_ interpretation (#1062) and PowerShell syntax errors (#1024).
const wqlPatternConditions = ORPHAN_PROCESS_PATTERNS
.map(p => `CommandLine LIKE '%${p}%'`)
.join(' OR ');
const cmd = `powershell -NoProfile -NonInteractive -Command "Get-CimInstance Win32_Process | Where-Object { (${patternConditions}) -and $_.ProcessId -ne ${currentPid} } | Select-Object ProcessId, CreationDate | ConvertTo-Json"`;
const { stdout } = await execAsync(cmd, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND });
const cmd = `powershell -NoProfile -NonInteractive -Command "Get-CimInstance Win32_Process -Filter '(${wqlPatternConditions}) AND ProcessId != ${currentPid}' | Select-Object ProcessId, CreationDate | ConvertTo-Json"`;
const { stdout } = await execAsync(cmd, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, windowsHide: true });
if (!stdout.trim() || stdout.trim() === 'null') {
logger.debug('SYSTEM', 'No orphaned claude-mem processes found (Windows)');
@@ -406,7 +408,7 @@ export async function cleanupOrphanedProcesses(): Promise<void> {
continue;
}
try {
execSync(`taskkill /PID ${pid} /T /F`, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, stdio: 'ignore' });
execSync(`taskkill /PID ${pid} /T /F`, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, stdio: 'ignore', windowsHide: true });
} catch (error) {
// [ANTI-PATTERN IGNORED]: Cleanup loop - process may have exited, continue to next PID
logger.debug('SYSTEM', 'Failed to kill process, may have already exited', { pid }, error as Error);
@@ -451,12 +453,14 @@ export async function aggressiveStartupCleanup(): Promise<void> {
try {
if (isWindows) {
const patternConditions = allPatterns
.map(p => `$_.CommandLine -like '*${p}*'`)
.join(' -or ');
// Use WQL -Filter for server-side filtering (no $_ pipeline syntax).
// Avoids Git Bash $_ interpretation (#1062) and PowerShell syntax errors (#1024).
const wqlPatternConditions = allPatterns
.map(p => `CommandLine LIKE '%${p}%'`)
.join(' OR ');
const cmd = `powershell -NoProfile -NonInteractive -Command "Get-CimInstance Win32_Process | Where-Object { (${patternConditions}) -and $_.ProcessId -ne ${currentPid} } | Select-Object ProcessId, CommandLine, CreationDate | ConvertTo-Json"`;
const { stdout } = await execAsync(cmd, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND });
const cmd = `powershell -NoProfile -NonInteractive -Command "Get-CimInstance Win32_Process -Filter '(${wqlPatternConditions}) AND ProcessId != ${currentPid}' | Select-Object ProcessId, CommandLine, CreationDate | ConvertTo-Json"`;
const { stdout } = await execAsync(cmd, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, windowsHide: true });
if (!stdout.trim() || stdout.trim() === 'null') {
logger.debug('SYSTEM', 'No orphaned claude-mem processes found (Windows)');
@@ -549,7 +553,7 @@ export async function aggressiveStartupCleanup(): Promise<void> {
for (const pid of pidsToKill) {
if (!Number.isInteger(pid) || pid <= 0) continue;
try {
execSync(`taskkill /PID ${pid} /T /F`, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, stdio: 'ignore' });
execSync(`taskkill /PID ${pid} /T /F`, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, stdio: 'ignore', windowsHide: true });
} catch (error) {
logger.debug('SYSTEM', 'Failed to kill process, may have already exited', { pid }, error as Error);
}
@@ -699,10 +703,10 @@ export function spawnDaemon(
*
* EPERM is treated as "alive" because it means the process exists but
* belongs to a different user/session (common in multi-user setups).
* PID 0 (Windows WMIC sentinel for unknown PID) is treated as alive.
* PID 0 (Windows sentinel for unknown PID) is treated as alive.
*/
export function isProcessAlive(pid: number): boolean {
// PID 0 is the Windows WMIC sentinel value — process was spawned but PID unknown
// PID 0 is the Windows sentinel value — process was spawned but PID unknown
if (pid === 0) return true;
// Invalid PIDs are not alive
@@ -720,6 +724,39 @@ export function isProcessAlive(pid: number): boolean {
}
}
/**
* Check if the PID file was written recently (within thresholdMs).
*
* Used to coordinate restarts across concurrent sessions: if the PID file
* was recently written, another session likely just restarted the worker.
* Callers should poll /api/health instead of attempting their own restart.
*
* @param thresholdMs - Maximum age in ms to consider "recent" (default: 15000)
* @returns true if the PID file exists and was modified within thresholdMs
*/
export function isPidFileRecent(thresholdMs: number = 15000): boolean {
try {
const stats = statSync(PID_FILE);
return (Date.now() - stats.mtimeMs) < thresholdMs;
} catch {
return false;
}
}
/**
* Touch the PID file to update its mtime without changing contents.
* Used after a restart to signal other sessions that a restart just completed.
*/
export function touchPidFile(): void {
try {
if (!existsSync(PID_FILE)) return;
const now = new Date();
utimesSync(PID_FILE, now, now);
} catch {
// Best-effort — failure to touch doesn't affect correctness
}
}
/**
* Read the PID file and remove it if the recorded process is dead (stale).
*