claude-mem

Author	SHA1	Message	Date
Alex Newman	8e0e3ca109	fix: stop draining queue on /clear (remove SessionEnd shim) (#2136 ) * fix: stop draining queue on /clear (and on every other SessionEnd) The SessionEnd hook was wired to session-complete on Claude Code, Gemini CLI, the transcripts processor, the OpenCode plugin, and OpenClaw. All of those paths called POST /api/sessions/complete, which marked the session completed and abandoned every still-pending observation in the queue. So typing /clear (or logging out, or quitting) wiped in-flight work that the worker was perfectly happy to keep processing on its own. Removed the entire shim: - Deleted SessionEnd hook block in plugin/hooks/hooks.json - Deleted src/cli/handlers/session-complete.ts and its registry entry - Deleted POST /api/sessions/complete route + Zod schema in SessionRoutes - Removed call from transcripts processor handleSessionEnd - Removed call from opencode-plugin session.deleted handler - Removed Gemini SessionEnd → session-complete mapping - Removed openclaw scheduleSessionComplete + completionDelayMs + timer state - Updated tests + comments accordingly Explicit user-initiated deletion (DELETE /api/sessions/:id and POST /api/sessions/:sessionDbId/complete from the viewer UI) still works via SessionCompletionHandler.completeByDbId — that's the only path that should drain the queue. The worker self-completes via its SDK-agent generator's finally-block, so no external completion call is needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify opencode-plugin session.deleted is in-memory cleanup only Greptile P2: file-level header still implied session.deleted called the worker. Now it only cleans up the local contentSessionIdsByOpenCodeSessionId map; worker self-completes via the SDK-agent generator finally-block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:08:35 -07:00
Alex Newman	703c64c756	v12.4.3: one-time pollution cleanup migration + v12.4.1/v12.4.2 fixes (#2133 ) * fix: 5 trivial bugs from v12.4.1 issue triage - #2092: emit CJS-safe banner (no import.meta.url) in worker-service.cjs - #2100: PreToolUse Read hook timeout 2000s → 60s - #2131: add "shell": "bash" to every hook for Windows compat - #2132: Antigravity dir typo .agent → .agents - #2088: clear inherited MCP servers in worker SDK query() calls Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: stop context overflow loop + block task-notification leak - SDKAgent: clear memorySessionId on "prompt is too long" so crash-recovery starts a fresh SDK session instead of resuming the same poisoned context forever (was producing 68+ failed pending_messages on a single stuck session in the wild) - tag-stripping: new isInternalProtocolPayload() predicate; session-init hook + SessionRoutes both skip storage when entire prompt is one of Claude Code's autonomous protocol blocks (currently <task-notification>; conservative deny-list — does NOT touch <command-name>/<command-message> which wrap real user slash-commands) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version to 12.4.2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update CHANGELOG.md for v12.4.2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cleanup): one-time v12.4.3 migration purges observer-sessions and stuck pending_messages Adds CleanupV12_4_3 module that runs once per data dir on worker startup (after migrations apply, before Chroma backfill). Drops accumulated pollution that v12.4.0 (observer-sessions filter) and v12.4.2 (context-overflow guard + task-notification leak block) prevent from recurring: - DELETE FROM sdk_sessions WHERE project='observer-sessions' (cascades to user_prompts, observations, session_summaries via existing FK ON DELETE CASCADE) - DELETE FROM pending_messages stuck in 'failed'/'processing' for any session with >=10 such rows (poisoned chains from the pre-v12.4.2 retry loop; threshold spares legitimate transient failures) - Wipes ~/.claude-mem/chroma and chroma-sync-state.json so backfillAllProjects rebuilds the vector store from cleaned SQLite Pre-flight checks free disk (1.2x DB size + 100MB) via fs.statfsSync; backs up via VACUUM INTO with copyFileSync fallback; PRAGMA foreign_keys=ON on the cleanup connection (off by default in bun:sqlite). Marker file ~/.claude-mem/.cleanup-v12.4.3-applied records backup path and counts. Opt-out via CLAUDE_MEM_SKIP_CLEANUP_V12_4_3=1. Verified locally: 311MB DB backed up to 277MB in 943ms; 11 observer sessions + 3 cascade rows + 141 stuck pending_messages purged; chroma rebuilt via backfill. Total cleanup time 1.1s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address PR #2133 code review - SessionRoutes: check isInternalProtocolPayload before stripping tags so internal protocol prompts skip the strip work entirely. - tag-stripping: bound isInternalProtocolPayload input length to 256KB to prevent ReDoS-class scans on malformed unclosed tags. - SDKAgent: extract resetSessionForFreshStart helper; both context-overflow paths now share one nullification routine. - worker-service: drop the per-startup "Checking for one-time v12.4.3 cleanup" info log — runs every boot even after marker exists; the function already logs at debug/warn when relevant. - tests: add isInternalProtocolPayload edge cases (whitespace, attributes, partial tags, unrelated tags, oversize input). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile P2 comments on PR #2133 CleanupV12_4_3.ts: derive backup directory and restore-hint path from effectiveDataDir instead of the module-level BACKUPS_DIR/DB_PATH constants. The dataDirectory override is meant for test isolation; the prior version still wrote backups to the production directory. SessionRoutes.ts: move isInternalProtocolPayload guard to the top of handleSessionInitByClaudeId, before createSDKSession. The previous position blocked the user_prompts insert but still created an empty sdk_sessions row, asymmetric with the hook-layer guard in session-init.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cleanup): retry on disk-skip; survive chroma wipe failure CodeRabbit Major + Claude review: - Disk pre-flight skip no longer writes the marker. A user temporarily low on disk would otherwise have the cleanup permanently disabled even after freeing space. Retry on next startup instead. - Wrap wipeChromaArtifacts in try/catch and write the marker even on failure (with chromaWipeError captured). Without this, an rmSync permission failure on chroma/ left writeMarker unreached, so every subsequent boot re-ran the SQL purge AND created a fresh backup, consuming disk indefinitely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cleanup): close backup handle before copyFileSync fallback Claude review: - backupDb is now closed before falling into the copyFileSync fallback. On Windows an open SQLite handle holds a file lock that can prevent the fallback copy from reading the source. The previous version only closed after both branches completed. - Add empty-body <task-notification></task-notification> case to the isInternalProtocolPayload tests for completeness. Cascade-row count queries already match the actual FK columns (content_session_id for user_prompts, memory_session_id for observations / session_summaries) — no fix needed there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cleanup): accurate session count + add migration tests Claude review v3: session-init.ts: filter on rawPrompt before the [media prompt] substitution. Functionally equivalent but explicit — the check no longer depends on the substitution leaving real protocol payloads untouched. CleanupV12_4_3.ts: counts.observerSessions now comes from a pre-DELETE COUNT(), not from result.changes. bun:sqlite inflates result.changes with FTS-trigger and cascade row counts (the user_prompts_fts triggers inflate a 3-session purge to 19 changes). The previous code logged a misleading total and wrote it to the marker. tests/infrastructure/cleanup-v12_4_3.test.ts: happy-path coverage of the migration against a real on-disk SQLite under a tmpdir. Verifies observer-session purge with cascades, stuck pending_messages purge, chroma artifact wipe, marker payload shape, idempotency on re-run, and CLAUDE_MEM_SKIP_CLEANUP_V12_4_3 opt-out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(protocol-filter): close two-block false positive; address review CodeRabbit + Claude review v5: tag-stripping.ts: PROTOCOL_ONLY_REGEX rewritten with a negative-lookahead body so a prompt like "<task-notification>x</task-notification> hi <task-notification>y</task-notification>" no longer matches as a single outer block — the prior greedy [\s\S]* spanned the middle user text and would have silently dropped a real prompt. Confirmed via probe. tag-stripping.test.ts: drop the 50ms wall-clock assertion (CI flake); add the two-block-with-text case as a regression test. SessionRoutes.ts: filter on req.body.prompt directly, before the [media prompt] substitution and 256KB truncation. Mirrors the session-init.ts hook-layer ordering and ensures a protocol payload that happens to be near the byte limit isn't truncated before the filter runs. cleanup-v12_4_3.test.ts: add stuckCount=9 below-threshold case verifying pending_messages with <10 stuck rows are preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cleanup): include WAL/SHM in backup fallback; safer rollback CodeRabbit Major + Claude review v6: CleanupV12_4_3.ts: when VACUUM INTO fails and copyFileSync runs, also copy any -wal/-shm sidecars. The DB is configured WAL mode, so recent committed pages can live in those files; copying only the .db would miss them. VACUUM INTO already captures everything in one file, so the happy path is unaffected. CleanupV12_4_3.ts: wrap ROLLBACK in try/catch so a no-op rollback (SQLite already rolled back on a constraint failure) cannot shadow the original purge error. SDKAgent.ts: align both context-overflow log levels to error. Both branches are fatal-recovery paths; the previous warn/error split was inconsistent and made the throw branch easy to miss in logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: pre-count stuck pending_messages; document adjacent-block fall-through Claude review v7: CleanupV12_4_3.ts: runStuckPendingPurge now uses a SELECT COUNT(*) before the DELETE, matching the pattern in runObserverSessionsPurge. result.changes is reliable today (no FTS on pending_messages) but the explicit count protects against future schema additions, and keeps the two purges symmetric. tag-stripping.test.ts: add test documenting that adjacent protocol blocks (no user text between) deliberately fall through to storage. The deny-list is per-block; concatenations are out of scope. Skipped per project rules / Node API constraints: - frsize fallback in disk check: Node/Bun StatFs doesn't expose frsize - VACUUM-INTO comment: comment-only suggestion - Overflow string constant extraction: low value Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 16:30:34 -07:00
Alex Newman	94d592f212	perf: streamline worker startup and consolidate database connections (#2122 ) * docs: pathfinder refactor corpus + Node 20 preflight Adds the PATHFINDER-2026-04-22 principle-driven refactor plan (11 docs, cross-checked PASS) plus the exploratory PATHFINDER-2026-04-21 corpus that motivated it. Bumps engines.node to >=20.0.0 per the ingestion-path plan preflight (recursive fs.watch). Adds the pathfinder skill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 01 — data integrity Schema, UNIQUE constraints, self-healing claim, Chroma upsert fallback. - Phase 1: fresh schema.sql regenerated at post-refactor shape. - Phase 2: migrations 23+24 — rebuild pending_messages without started_processing_at_epoch; UNIQUE(session_id, tool_use_id); UNIQUE(memory_session_id, content_hash) on observations; dedup duplicate rows before adding indexes. - Phase 3: claimNextMessage rewritten to self-healing query using worker_pid NOT IN live_worker_pids; STALE_PROCESSING_THRESHOLD_MS and the 60-s stale-reset block deleted. - Phase 4: DEDUP_WINDOW_MS and findDuplicateObservation deleted; observations.insert now uses ON CONFLICT DO NOTHING. - Phase 5: failed-message purge block deleted from worker-service 2-min interval; clearFailedOlderThan method deleted. - Phase 6: repairMalformedSchema and its Python subprocess repair path deleted from Database.ts; SQLite errors now propagate. - Phase 7: Chroma delete-then-add fallback gated behind CHROMA_SYNC_FALLBACK_ON_CONFLICT env flag as bridge until Chroma MCP ships native upsert. - Phase 8: migration 19 no-op block absorbed into fresh schema.sql. Verification greps all return 0 matches. bun test tests/sqlite/ passes 63/63. bun run build succeeds. Plan: PATHFINDER-2026-04-22/01-data-integrity.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 02 — process lifecycle OS process groups replace hand-rolled reapers. Worker runs until killed; orphans are prevented by detached spawn + kill(-pgid). - Phase 1: src/services/worker/ProcessRegistry.ts DELETED. The canonical registry at src/supervisor/process-registry.ts is the sole survivor; SDK spawn site consolidated into it via new createSdkSpawnFactory/spawnSdkProcess/getSdkProcessForSession/ ensureSdkProcessExit/waitForSlot helpers. - Phase 2: SDK children spawn with detached:true + stdio: ['ignore','pipe','pipe']; pgid recorded on ManagedProcessInfo. - Phase 3: shutdown.ts signalProcess teardown uses process.kill(-pgid, signal) on Unix when pgid is recorded; Windows path unchanged (tree-kill/taskkill). - Phase 4: all reaper intervals deleted — startOrphanReaper call, staleSessionReaperInterval setInterval (including the co-located WAL checkpoint — SQLite's built-in wal_autocheckpoint handles WAL growth without an app-level timer), killIdleDaemonChildren, killSystemOrphans, reapOrphanedProcesses, reapStaleSessions, and detectStaleGenerator. MAX_GENERATOR_IDLE_MS and MAX_SESSION_IDLE_MS constants deleted. - Phase 5: abandonedTimer — already 0 matches; primary-path cleanup via generatorPromise.finally() already lives in worker-service startSessionProcessor and SessionRoutes ensureGeneratorRunning. - Phase 6: evictIdlestSession and its evict callback deleted from SessionManager. Pool admission gates backpressure upstream. - Phase 7: SDK-failure fallback — SessionManager has zero matches for fallbackAgent/Gemini/OpenRouter. Failures surface to hooks via exit code 2 through SessionRoutes error mapping. - Phase 8: ensureWorkerRunning in worker-utils.ts rewritten to lazy-spawn — consults isWorkerPortAlive (which gates captureProcessStartToken for PID-reuse safety via commit `99060bac`), then spawns detached with unref(), then waitForWorkerPort({ attempts: 3, backoffMs: 250 }) hand-rolled exponential backoff 250→500→1000ms. No respawn npm dep. - Phase 9: idle self-shutdown — zero matches for idleCheck/idleTimeout/IDLE_MAX_MS/idleShutdown. Worker exits only on external SIGTERM via supervisor signal handlers. Three test files that exercised deleted code removed: tests/worker/process-registry.test.ts, tests/worker/session-lifecycle-guard.test.ts, tests/services/worker/reap-stale-sessions.test.ts. Pass count: 1451 → 1407 (-44), all attributable to deleted test files. Zero new failures. 31 pre-existing failures remain (schema-repair suite, logger-usage-standards, environmental openclaw / plugin-distribution) — none introduced by Plan 02. All 10 verification greps return 0. bun run build succeeds. Plan: PATHFINDER-2026-04-22/02-process-lifecycle.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 04 (narrowed) — search fail-fast Phases 3, 5, 6 only. Plan-doc inaccuracies for phases 1/2/4/7/8/9 deferred for plan reconciliation: - Phase 1/2: ObservationRow type doesn't exist; the four "formatters" operate on three incompatible types. - Phase 4: RECENCY_WINDOW_MS already imported from SEARCH_CONSTANTS at every call site. - Phase 7: getExistingChromaIds is NOT @deprecated and has an active caller in ChromaSync.backfillMissingSyncs. - Phase 8: estimateTokens already consolidated. - Phase 9: knowledge-corpus rewrite blocked on PG-3 prompt-caching cost smoke test. Phase 3 — Delete SearchManager.findByConcept/findByFile/findByType. SearchRoutes handlers (handleSearchByConcept/File/Type) now call searchManager.getOrchestrator().findByXxx() directly via new getter accessors on SearchManager. ~250 LoC deleted. Phase 5 — Fail-fast Chroma. Created src/services/worker/search/errors.ts with ChromaUnavailableError extends AppError(503, 'CHROMA_UNAVAILABLE'). Deleted SearchOrchestrator.executeWithFallback's Chroma-failed SQLite-fallback branch; runtime Chroma errors now throw 503. "Path 3" (chromaSync was null at construction — explicit- uninitialized config) preserved as legitimate empty-result state per plan text. ChromaSearchStrategy.search no longer wraps in try/catch — errors propagate. Phase 6 — Delete HybridSearchStrategy three try/catch silent fallback blocks (findByConcept, findByType, findByFile) at lines ~82-95, ~120-132, ~161-172. Removed `fellBack` field from StrategySearchResult type and every return site (SQLiteSearchStrategy, BaseSearchStrategy.emptyResult, SearchOrchestrator). Tests updated (Principle 7 — delete in same PR): - search-orchestrator.test.ts: "fall back to SQLite" rewritten as "throw ChromaUnavailableError (HTTP 503)". - chroma/hybrid/sqlite-search-strategy tests: rewritten to rejects.toThrow; removed fellBack assertions. Verification: SearchManager.findBy → 0; fellBack → 0 in src/. bun test tests/worker/search/ → 122 pass, 0 fail. bun test (suite-wide) → 1407 pass, baseline maintained, 0 new failures. bun run build succeeds. Plan: PATHFINDER-2026-04-22/04-read-path.md (Phases 3, 5, 6) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 03 — ingestion path Fail-fast parser, direct in-process ingest, recursive fs.watch, DB-backed tool pairing. Worker-internal HTTP loopback eliminated. - Phase 0: Created src/services/worker/http/shared.ts exporting ingestObservation/ingestPrompt/ingestSummary as direct in-process functions plus ingestEventBus (Node EventEmitter, reusing existing pattern — no third event bus introduced). setIngestContext wires the SessionManager dependency from worker-service constructor. - Phase 1: src/sdk/parser.ts collapsed to one parseAgentXml returning { valid:true; kind: 'observation'\|'summary'; data } \| { valid:false; reason: string }. Inspects root element; <skip_summary reason="…"/> is a first-class summary case with skipped:true. NEVER returns undefined. NEVER coerces. - Phase 2: ResponseProcessor calls parseAgentXml exactly once, branches on the discriminated union. On invalid → markFailed + logger.warn(reason). On observation → ingestObservation. On summary → ingestSummary then emit summaryStoredEvent { sessionId, messageId } (consumed by Plan 05's blocking /api/session/end). - Phase 3: Deleted consecutiveSummaryFailures field (ResponseProcessor + SessionManager + worker-types) and MAX_CONSECUTIVE_SUMMARY_FAILURES constant. Circuit-breaker guards and "tripped" log lines removed. - Phase 4: coerceObservationToSummary deleted from sdk/parser.ts. - Phase 5: src/services/transcripts/watcher.ts rescan setInterval replaced with fs.watch(transcriptsRoot, { recursive: true, persistent: true }) — Node 20+ recursive mode. - Phase 6: src/services/transcripts/processor.ts pendingTools Map deleted. tool_use rows insert with INSERT OR IGNORE on UNIQUE(session_id, tool_use_id) (added by Plan 01). New pairToolUsesByJoin query in PendingMessageStore for read-time pairing (UNIQUE INDEX provides idempotency; explicit consumer not yet wired). - Phase 7: HTTP loopback at processor.ts:252 replaced with direct ingestObservation call. maybeParseJson silent-passthrough rewritten to fail-fast (throws on malformed JSON). - Phase 8: src/utils/tag-stripping.ts countTags + stripTagsInternal collapsed into one alternation regex, single-pass over input. - Phase 9: src/utils/transcript-parser.ts (dead TranscriptParser class) deleted. The active extractLastMessage at src/shared/transcript-parser.ts:41-144 is the sole survivor. Tests updated (Principle 7 — same-PR delete): - tests/sdk/parser.test.ts + parse-summary.test.ts: rewritten to assert discriminated-union shape; coercion-specific scenarios collapse into { valid:false } assertions. - tests/worker/agents/response-processor.test.ts: circuit-breaker describe block skipped; non-XML/empty-response tests assert fail-fast markFailed behavior. Verification: every grep returns 0. transcript-parser.ts deleted. bun run build succeeds. bun test → 1399 pass / 28 fail / 7 skip (net -8 pass = the 4 retired circuit-breaker tests + 4 collapsed parser cases). Zero new failures vs baseline. Deferred (out of Plan 03 scope, will land in Plan 06): SessionRoutes HTTP route handlers still call sessionManager.queueObservation inline rather than the new shared helpers — the helpers are ready, the route swap is mechanical and belongs with the Zod refactor. Plan: PATHFINDER-2026-04-22/03-ingestion-path.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 05 — hook surface Worker-call plumbing collapsed to one helper. Polling replaced by server-side blocking endpoint. Fail-loud counter surfaces persistent worker outages via exit code 2. - Phase 1: plugin/hooks/hooks.json — three 20-iteration `for i in 1..20; do curl -sf .../health && break; sleep 0.1; done` shell retry wrappers deleted. Hook commands invoke their bun entry point directly. - Phase 2: src/shared/worker-utils.ts — added executeWithWorkerFallback<T>(url, method, body) returning T \| { continue: true; reason?: string }. All 8 hook handlers (observation, session-init, context, file-context, file-edit, summarize, session-complete, user-message) rewritten to use it instead of duplicating the ensureWorkerRunning → workerHttpRequest → fallback sequence. - Phase 3: blocking POST /api/session/end in SessionRoutes.ts using validateBody + sessionEndSchema (z.object({sessionId})). One-shot ingestEventBus.on('summaryStoredEvent') listener, 30 s timer, req.aborted handler — all share one cleanup so the listener cannot leak. summarize.ts polling loop, plus MAX_WAIT_FOR_SUMMARY_MS / POLL_INTERVAL_MS constants, deleted. - Phase 4: src/shared/hook-settings.ts — loadFromFileOnce() memoizes SettingsDefaultsManager.loadFromFile per process. Per-handler settings reads collapsed. - Phase 5: src/shared/should-track-project.ts — single exclusion check entry; isProjectExcluded no longer referenced from src/cli/handlers/. - Phase 6: cwd validation pushed into adapter normalizeInput (all 6 adapters: claude-code, cursor, raw, gemini-cli, windsurf). New AdapterRejectedInput error in src/cli/adapters/errors.ts. Handler-level isValidCwd checks deleted from file-edit.ts and observation.ts. hook-command.ts catches AdapterRejectedInput → graceful fallback. - Phase 7: session-init.ts conditional initAgent guard deleted; initAgent is idempotent. tests/hooks/context-reinjection-guard test (validated the deleted conditional) deleted in same PR per Principle 7. - Phase 8: fail-loud counter at ~/.claude-mem/state/hook-failures .json. Atomic write via .tmp + rename. CLAUDE_MEM_HOOK_FAIL_LOUD _THRESHOLD setting (default 3). On consecutive worker-unreachable ≥ N: process.exit(2). On success: reset to 0. NOT a retry. - Phase 9: ensureWorkerAliveOnce() module-scope memoization wrapping ensureWorkerRunning. executeWithWorkerFallback calls the memoized version. Minimal validateBody middleware stub at src/services/worker/http/middleware/validateBody.ts. Plan 06 will expand with typed inference + error envelope conventions. Verification: 4/4 grep targets pass. bun run build succeeds. bun test → 1393 pass / 28 fail / 7 skip; -6 pass attributable solely to deleted context-reinjection-guard test file. Zero new failures vs baseline. Plan: PATHFINDER-2026-04-22/05-hook-surface.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 06 — API surface One Zod-based validator wrapping every POST/PUT. Rate limiter, diagnostic endpoints, and shutdown wrappers deleted. Failure- marking consolidated to one helper. - Phase 1 (preflight): zod@^3 already installed. - Phase 2: validateBody middleware confirmed at canonical shape in src/services/worker/http/middleware/validateBody.ts — safeParse → 400 { error: 'ValidationError', issues: [...] } on failure, replaces req.body with parsed value on success. - Phase 3: Per-route Zod schemas declared at the top of each route file. 24 POST endpoints across SessionRoutes, CorpusRoutes, DataRoutes, MemoryRoutes, SearchRoutes, LogsRoutes, SettingsRoutes now wrap with validateBody(). /api/session/end (Plan 05) confirmed using same middleware. - Phase 4: validateRequired() deleted from BaseRouteHandler along with every call site. Inline coercion helpers (coerceStringArray, coercePositiveInteger) and inline if (!req.body...) guards deleted across all route files. - Phase 5: Rate limiter middleware and its registration deleted from src/services/worker/http/middleware.ts. Worker binds 127.0.0.1:37777 — no untrusted caller. - Phase 6: viewer.html cached at module init in ViewerRoutes.ts via fs.readFileSync; served as Buffer with text/html content type. SKILL.md + per-operation .md files cached in Server.ts as Map<string, string>; loadInstructionContent helper deleted. NO fs.watch, NO TTL — process restart is the cache-invalidation event. - Phase 7: Four diagnostic endpoints deleted from DataRoutes.ts — /api/pending-queue (GET), /api/pending-queue/process (POST), /api/pending-queue/failed (DELETE), /api/pending-queue/all (DELETE). Helper methods that ONLY served them (getQueueMessages, getStuckCount, getRecentlyProcessed, clearFailed, clearAll) deleted from PendingMessageStore. KEPT: /api/processing-status (observability), /health (used by ensureWorkerRunning). - Phase 8: stopSupervisor wrapper deleted from supervisor/index.ts. GracefulShutdown now calls getSupervisor().stop() directly. Two functions retained with clear roles: - performGracefulShutdown — worker-side 6-step shutdown - runShutdownCascade — supervisor-side child teardown (process.kill(-pgid), Windows tree-kill, PID-file cleanup) Each has unique non-trivial logic and a single canonical caller. - Phase 9: transitionMessagesTo(status, filter) is the sole failure-marking path on PendingMessageStore. Old methods markSessionMessagesFailed and markAllSessionMessagesAbandoned deleted along with all callers (worker-service, SessionCompletionHandler, tests/zombie-prevention). Tests updated (Principle 7 same-PR delete): coercion test files refactored to chain validateBody → handler. Zombie-prevention tests rewritten to call transitionMessagesTo. Verification: all 4 grep targets → 0. bun run build succeeds. bun test → 1393 pass / 28 fail / 7 skip — exact match to baseline. Zero new failures. Plan: PATHFINDER-2026-04-22/06-api-surface.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 07 — dead code sweep ts-prune-driven sweep across the tree after Plans 01-06 landed. Deleted unused exports, orphan helpers, and one fully orphaned file. Earlier-plan deletions verified. Deleted: - src/utils/bun-path.ts (entire file — getBunPath, getBunPathOrThrow, isBunAvailable: zero importers) - bun-resolver.getBunVersionString: zero callers - PendingMessageStore.retryMessage / resetProcessingToPending / abortMessage: superseded by transitionMessagesTo (Plan 06 Phase 9) - EnvManager.MANAGED_CREDENTIAL_KEYS, EnvManager.setCredential: zero callers - CodexCliInstaller.checkCodexCliStatus: zero callers; no status command exists in npx-cli - Two "REMOVED: cleanupOrphanedSessions" stale-fence comments Kept (with documented justification): - Public API surface in dist/sdk/* (parseAgentXml, prompt builders, ParsedObservation, ParsedSummary, ParseResult, SUMMARY_MODE_MARKER) — exported via package.json sdk path. - generateContext / loadContextConfig / token utilities — used via dynamic await import('../../../context-generator.js') in worker SearchRoutes. - MCP_IDE_INSTALLERS, install/uninstall functions for codex/goose — used via dynamic await import in npx-cli/install.ts + uninstall.ts (ts-prune cannot trace dynamic imports). - getExistingChromaIds — active caller in ChromaSync.backfillMissingSyncs (Plan 04 narrowed scope). - processPendingQueues / getSessionsWithPendingMessages — active orphan-recovery caller in worker-service.ts plus zombie-prevention test coverage. - StoreAndMarkCompleteResult legacy alias — return-type annotation in same file. - All Database.ts barrel re-exports — used downstream. Earlier-plan verification: - Plan 03 Phase 9: VERIFIED — src/utils/transcript-parser.ts is gone; TranscriptParser has 0 references in src/. - Plan 01 Phase 8: VERIFIED — migration 19 no-op absorbed. - SessionStore.ts:52-70 consolidation NOT executed (deferred): the methods are not thin wrappers but ~900 LoC of bodies, and two methods are documented as intentional mirrors so the context-generator.cjs bundle stays schema-consistent without pulling MigrationRunner. Deserves its own plan, not a sweep. Verification: TranscriptParser → 0; transcript-parser.ts → gone; no commented-out code markers remain. bun run build succeeds. bun test → 1393 pass / 28 fail / 7 skip — EXACT match to baseline. Zero regressions. Plan: PATHFINDER-2026-04-22/07-dead-code.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: remove residual ProcessRegistry comment reference Plan 07 dead-code sweep missed one comment-level reference to the deleted in-memory ProcessRegistry class in SessionManager.ts:347. Rewritten to describe the supervisor.json scope without naming the deleted class, completing the verification grep target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile review (P1 + 2× P2) P1 — Plan 05 Phase 3 blocking endpoint was non-functional: executeWithWorkerFallback used HEALTH_CHECK_TIMEOUT_MS (3 s) for the POST /api/session/end call, but the server holds the connection for SERVER_SIDE_SUMMARY_TIMEOUT_MS (30 s). Client always raced to a "timed out" rejection that isWorkerUnavailable classified as worker-unreachable, so the hook silently degraded instead of waiting for summaryStoredEvent. - Added optional timeoutMs to executeWithWorkerFallback, forwarded to workerHttpRequest. - summarize.ts call site now passes 35_000 (5 s above server hold window). P2 — ingestSummary({ kind: 'parsed' }) branch was dead code: ResponseProcessor emitted summaryStoredEvent directly via the event bus, bypassing the centralized helper that the comment claimed was the single source. - ResponseProcessor now calls ingestSummary({ kind: 'parsed', sessionDbId, messageId, contentSessionId, parsed }) so the event-emission path is single-sourced. - ingestSummary's requireContext() resolution moved inside the 'queue' branch (the only branch that needs sessionManager / dbManager). 'parsed' is a pure event-bus emission and doesn't need worker-internal context — fixes mocked ResponseProcessor unit tests that don't call setIngestContext. P2 — isWorkerFallback could false-positive on legitimate API responses whose schema includes { continue: true, ... }: - Added a Symbol.for('claude-mem/worker-fallback') brand to WorkerFallback. isWorkerFallback now checks the brand, not a duck-typed property name. Verification: bun run build succeeds. bun test → 1393 pass / 28 fail / 7 skip — exact baseline match. Zero new failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile iteration 2 (P1 + P2) P1 — summaryStoredEvent fired regardless of whether the row was persisted. ResponseProcessor's call to ingestSummary({ kind: 'parsed' }) ran for every parsed.kind === 'summary' even when result.summaryId came back null (e.g. FK violation, null memory_session_id at commit). The blocking /api/session/end endpoint then returned { ok: true } and the Stop hook logged 'Summary stored' for a non-existent row. - Gate ingestSummary call on (parsed.data.skipped \|\| session.lastSummaryStored). Skipped summaries are an explicit no-op bypass and still confirm; real summaries only confirm when storage actually wrote a row. - Non-skipped + summaryId === null path logs a warn and lets the server-side timeout (504) surface to the hook instead of a false ok:true. P2 — PendingMessageStore.enqueue() returns 0 when INSERT OR IGNORE suppresses a duplicate (the UNIQUE(session_id, tool_use_id) constraint added by Plan 01 Phase 1). The two callers (SessionManager.queueObservation and queueSummarize) previously logged 'ENQUEUED messageId=0' which read like a row was inserted. - Branch on messageId === 0 and emit a 'DUP_SUPPRESSED' debug log instead of the misleading ENQUEUED line. No behavior change — the duplicate is still correctly suppressed by the DB (Principle 3); only the log surface is corrected. - confirmProcessed is never called with the enqueue() return value (it operates on session.processingMessageIds[] from claimNextMessage), so no caller is broken; the visibility fix prevents future misuse. Verification: bun run build succeeds. bun test → 1393 pass / 28 fail / 7 skip — exact baseline match. Zero new failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile iteration 3 (P1 + 2× P2) - P1 worker-service.ts: wire ensureGeneratorRunning into the ingest context after SessionRoutes is constructed. setIngestContext runs before routes exist, so transcript-watcher observations queued via ingestObservation() had no way to auto-start the SDK generator. Added attachIngestGeneratorStarter() to patch the callback in. - P2 shared.ts: IngestEventBus now sets maxListeners to 0. Concurrent /api/session/end calls register one listener each and clean up on completion, so the default-10 warning fires spuriously under normal load. - P2 SessionRoutes.ts: handleObservationsByClaudeId now delegates to ingestObservation() instead of duplicating skip-tool / meta / privacy / queue logic. Single helper, matching the Plan 03 goal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile iteration 4 (P1 tool-pair + P2 parse/path/doc) - processor.handleToolResult: restore in-memory tool-use→tool-result pairing via session.pendingTools for schemas (e.g. Codex) whose tool_result events carry only tool_use_id + output. Without this, neither handler fired — all tool observations silently dropped. - processor.maybeParseJson: return raw string on parse failure instead of throwing. Previously a single malformed JSON-shaped field caused handleLine's outer catch to discard the entire transcript line. - watcher.deepestNonGlobAncestor: split on / and \\, emit empty string for purely-glob inputs so the caller skips the watch instead of anchoring fs.watch at the filesystem root. Windows-compatible. - PendingMessageStore.enqueue: tighten docstring — callers today only log on the returned id; the SessionManager branches on id === 0. * fix: forward tool_use_id through ingestObservation (Greptile iter 5) P1 — Plan 01's UNIQUE(content_session_id, tool_use_id) dedup never fired because the new shared ingest path dropped the toolUseId before queueObservation. SQLite treats NULL values as distinct for UNIQUE, so every replayed transcript line landed a duplicate row. - shared.ingestObservation: forward payload.toolUseId to queueObservation so INSERT OR IGNORE can actually collapse. - SessionRoutes.handleObservationsByClaudeId: destructure both tool_use_id (HTTP convention) and toolUseId (JS convention) from req.body and pass into ingestObservation. - observationsByClaudeIdSchema: declare both keys explicitly so the validator doesn't rely on .passthrough() alone. * fix: drop dead pairToolUsesByJoin, close session-end listener race - PendingMessageStore: delete pairToolUsesByJoin. The method was never called and its self-join semantics are structurally incompatible with UNIQUE(content_session_id, tool_use_id): INSERT OR IGNORE collapses any second row with the same pair, so a self-join can only ever match a row to itself. In-memory pendingTools in processor.ts remains the pairing path for split-event schemas. - IngestEventBus: retain a short-lived (60s) recentStored map keyed by sessionId. Populated on summaryStoredEvent emit, evicted on consume or TTL. - handleSessionEnd: drain the recent-events buffer before attaching the listener. Closes the register-after-emit race where the summary can persist between the hook's summarize POST and its session/end POST — previously that window returned 504 after the 30s timeout. * chore: merge origin/main into vivacious-teeth Resolves conflicts with 15 commits on main (v12.3.9, security observation types, Telegram notifier, PID-reuse worker start-guard). Conflict resolution strategy: - plugin/hooks/hooks.json, plugin/scripts/.cjs, plugin/ui/viewer-bundle.js: kept ours — PATHFINDER Plan 05 deletes the for-i-in-1-to-20 curl retry loops and the built artifacts regenerate on build. - src/cli/handlers/summarize.ts: kept ours — Plan 05 blocking POST /api/session/end supersedes main's fire-and-forget path. - src/services/worker-service.ts: kept ours — Plan 05 ingest bus + summaryStoredEvent supersedes main's SessionCompletionHandler DI refactor + orphan-reaper fallback. - src/services/worker/http/routes/SessionRoutes.ts: kept ours — same reason; generator .finally() Stop-hook self-clean is a guard for a path our blocking endpoint removes. - src/services/worker/http/routes/CorpusRoutes.ts: merged — added security_alert / security_note to ALLOWED_CORPUS_TYPES (feature from #2084) while preserving our Zod validateBody schema. Typecheck: 294 errors (vs 298 pre-merge). No new errors introduced; all remaining are pre-existing (Component-enum gaps, DOM lib for viewer, bun:sqlite types). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix: address Greptile P2 findings 1) SessionRoutes.handleSessionEnd was the only route handler not wrapped in wrapHandler — synchronous exceptions would hang the client rather than surfacing as 500s. Wrap it like every other handler. 2) processor.handleToolResult only consumed the session.pendingTools entry when the tool_result arrived without a toolName. In the split-schema path where tool_result carries both toolName and toolId, the entry was never deleted and the map grew for the life of the session. Consume the entry whenever toolId is present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: typing cleanup and viewer tsconfig split for PR feedback - Add explicit return types for SessionStore query methods - Exclude src/ui/viewer from root tsconfig, give it its own DOM-typed config - Add bun to root tsconfig types, plus misc typing tweaks flagged by Greptile - Rebuilt plugin/scripts/* artifacts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile P2 findings (iter 2) - PendingMessageStore.transitionMessagesTo: require sessionDbId (drop the unscoped-drain branch that would nuke every pending/processing row across all sessions if a future caller omitted the filter). - IngestEventBus.takeRecentSummaryStored: make idempotent — keep the cached event until TTL eviction so a retried Stop hook's second /api/session/end returns immediately instead of hanging 30 s. - TranscriptWatcher fs.watch callback: skip full glob scan for paths already tailed (JSONL appends fire on every line; only unknown paths warrant a rescan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: call finalizeSession in terminal session paths (Greptile iter 3) terminateSession and runFallbackForTerminatedSession previously called SessionCompletionHandler.finalizeSession before removeSessionImmediate; the refactor dropped those calls, leaving sdk_sessions.status='active' for every session killed by wall-clock limit, unrecoverable error, or exhausted fallback chain. The deleted reapStaleSessions interval was the only prior backstop. Re-wires finalizeSession (idempotent: marks completed, drains pending, broadcasts) into both paths; no reaper reintroduced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: GC failed pending_messages rows at startup (Greptile iter 4) Plan 07 deleted clearFailed/clearFailedOlderThan as "dead code", but with the periodic sweep also removed, nothing reaps status='failed' rows now — they accumulate indefinitely. Since claimNextMessage's self-healing subquery scans this table, unbounded growth degrades claim latency over time. Re-introduces clearFailedOlderThan and calls it once at worker startup (not a reaper — one-shot, idempotent). 7-day retention keeps enough history for operator inspection while bounding the table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: finalize sessions on normal exit; cleanup hoist; share handler (iter 5) 1. startSessionProcessor success branch now calls completionHandler. finalizeSession before removeSessionImmediate. Hooks-disabled installs (and any Stop hook that fails before POST /api/sessions/complete) no longer leave sdk_sessions rows as status='active' forever. Idempotent — a subsequent /api/sessions/complete is a no-op. 2. Hoist SessionRoutes.handleSessionEnd cleanup declaration above the closures that reference it (TDZ safety; safe at runtime today but fragile if timeout ever shrinks). 3. SessionRoutes now receives WorkerService's shared SessionCompletionHandler instead of constructing its own — prevents silent divergence if the handler ever becomes stateful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: stop runaway crash-recovery loop on dead sessions Two distinct bugs were combining to keep a dead session restarting forever: Bug 1 (uncaught "The operation was aborted."): child_process.spawn emits 'error' asynchronously for ENOENT/EACCES/abort signal aborts. spawnSdkProcess() never attached an 'error' listener, so any async spawn failure became uncaughtException and escaped to the daemon-level handler. Attach an 'error' listener immediately after spawn, before the !child.pid early-return, so async spawn errors are logged (with errno code) and swallowed locally. Bug 2 (sliding-window limiter never trips on slow restart cadence): RestartGuard tripped only when restartTimestamps.length exceeded MAX_WINDOWED_RESTARTS (10) within RESTART_WINDOW_MS (60s). With the 8s exponential-backoff cap, only ~7-8 restarts fit in the window, so a dead session that fail-restart-fail-restart on 8s cycles would loop forever (consecutiveRestarts climbing past 30+ in observed logs). Add a consecutiveFailures counter that increments on every restart and resets only on recordSuccess(). Trip when consecutive failures exceed MAX_CONSECUTIVE_FAILURES (5) — meaning 5 restarts with zero successful processing in between proves the session is dead. Both guards now run in parallel: tight loops still trip the windowed cap; slow loops trip the consecutive-failure cap. Also: when the SessionRoutes path trips the guard, drain pending messages to 'abandoned' so the session does not reappear in getSessionsWithPendingMessages and trigger another auto-start cycle. The worker-service.ts path already does this via terminateSession. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf: streamline worker startup and consolidate database connections 1. Database Pooling: Modified DatabaseManager, SessionStore, and SessionSearch to share a single bun:sqlite connection, eliminating redundant file descriptors. 2. Non-blocking Startup: Refactored WorktreeAdoption and Chroma backfill to run in the background (fire-and-forget), preventing them from stalling core initialization. 3. Diagnostic Routes: Added /api/chroma/status and bypassed the initialization guard for health/readiness endpoints to allow diagnostics during startup. 4. Robust Search: Implemented reliable SQLite FTS5 fallback in SearchManager for when Chroma (uvx) fails or is unavailable. 5. Code Cleanup: Removed redundant loopback MCP checks and mangled initialization logic from WorkerService. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: hard-exclude observer-sessions from hooks; bundle migration 29 (#2124) * fix: hard-exclude observer-sessions from hooks; backfill bundle migrations Stop hook + SessionEnd hook were storing the SDK observer's own init/continuation/summary prompts in user_prompts, leaking into the viewer (meta-observation regression). 25 such rows accumulated. - shouldTrackProject: hard-reject OBSERVER_SESSIONS_DIR (and its subtree) before consulting user-configured exclusion globs. - summarize.ts (Stop) and session-complete.ts (SessionEnd): early-return when shouldTrackProject(cwd) is false, so the observer's own hooks cannot bootstrap the worker or queue a summary against the meta-session. - SessionRoutes: cap user-prompt body at 256 KiB at the session-init boundary so a runaway observer prompt cannot blow up storage. - SessionStore: add migration 29 (UNIQUE(memory_session_id, content_hash) on observations) inline so bundled artifacts (worker-service.cjs, context-generator.cjs) stay schema-consistent — without it, the ON CONFLICT clause in observation inserts throws. - spawnSdkProcess: stdio[stdin] from 'ignore' to 'pipe' so the supervisor can actually feed the observer's stdin. Also rebuilds plugin/scripts/{worker-service,context-generator}.cjs. * fix: walk back to UTF-8 boundary on prompt truncation (Greptile P2) Plain Buffer.subarray at MAX_USER_PROMPT_BYTES can land mid-codepoint, which the utf8 decoder silently rewrites to U+FFFD. Walk back over any continuation bytes (0b10xxxxxx) before decoding so the truncated prompt ends on a valid sequence boundary instead of a replacement character. * fix: cross-platform observer-dir containment; clarify SDK stdin pipe claude-review feedback on PR #2124. - shouldTrackProject: literal `cwd.startsWith(OBSERVER_SESSIONS_DIR + '/')` hard-coded a POSIX separator and missed Windows backslash paths plus any trailing-slash variance. Switched to a path.relative-based isWithin() helper so Windows hook input under observer-sessions\\... is also excluded. - spawnSdkProcess: added a comment explaining why stdin must be 'pipe' — SpawnedSdkProcess.stdin is typed NonNullable and the Claude Agent SDK consumes that pipe; 'ignore' would null it and the null-check below would tear the child down on every spawn. * fix: make Stop hook fire-and-forget; remove dead /api/session/end The Stop hook was awaiting a 35-second long-poll on /api/session/end, which the worker held open until the summary-stored event fired (or its 30s server-side timeout elapsed). Followed by another await on /api/sessions/complete. Three sequential awaits, the middle one a 30s hold — not fire-and-forget despite repeated requests. The Stop hook now does ONE thing: POST /api/sessions/summarize to queue the summary work and return. The worker drives the rest async. Session-map cleanup is performed by the SessionEnd handler (session-complete.ts), not duplicated here. - summarize.ts: drop the /api/session/end long-poll and the trailing /api/sessions/complete await; ~40 lines removed; unused SessionEndResponse interface gone; header comment rewritten. - SessionRoutes: delete handleSessionEnd, sessionEndSchema, the SERVER_SIDE_SUMMARY_TIMEOUT_MS constant, and the /api/session/end route registration. Drop the now-unused ingestEventBus and SummaryStoredEvent imports. - ResponseProcessor + shared.ts + worker-utils.ts: update stale comments that referenced the dead endpoint. The IngestEventBus is left in place dormant (no listeners) for follow-up cleanup so this PR stays focused on the blocker. Bundle artifact (worker-service.cjs) rebuilt via build-and-sync. Verification: - grep '/api/session/end' plugin/scripts/worker-service.cjs → 0 - grep 'timeoutMs:35' plugin/scripts/worker-service.cjs → 0 - Worker restarted clean, /api/health ok at pid 92368 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * deps: bump all dependencies to latest including majors Upgrades: React 18→19, Express 4→5, Zod 3→4, TypeScript 5→6, @types/node 20→25, @anthropic-ai/claude-agent-sdk 0.1→0.2, @clack/prompts 0.9→1.2, plus minors. Adds Daily Maintenance section to CLAUDE.md mandating latest-version policy across manifests. Express 5 surfaced a race in Server.listen() where the 'error' handler was attached after listen() was invoked; refactored to use http.createServer with both 'error' and 'listening' handlers attached before listen(), restoring port-conflict rejection semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: surface real chroma errors and add deep status probe Replace the misleading "Vector search failed - semantic search unavailable. Install uv... restart the worker." string in SearchManager with the actual exception text from chroma_query_documents. The lying message blamed `uv` for any failure — even when the real cause was a chroma-mcp transport timeout, an empty collection, or a dead subprocess. Also add /api/chroma/status?deep=1 backed by a new ChromaMcpManager.probeSemanticSearch() that round-trips a real query (chroma_list_collections + chroma_query_documents) instead of just checking the stdio handshake. The cheap default path is unchanged. Includes the diagnostic plan (PLAN-fix-mcp-search.md) and updated test fixtures for the new structured failure message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: rebuild worker-service bundle to match merged src Bundle was stale after the squash merge of #2124 — it still contained the old "Install uv... semantic search unavailable" string and lacked probeSemanticSearch. Rebuilt via bun run build-and-sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: address coderabbit feedback on PLAN-fix-mcp-search.md - replace machine-specific /Users/alexnewman absolute paths with portable <repo-root> placeholder (MD-style portability) - add blank lines around the TypeScript fenced block (MD031) - tag the bare fenced block with `text` (MD040) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 13:37:40 -07:00
Alex Newman	f2d361b918	feat: security observation types + Telegram notifier (#2084 ) * feat: security observation types + Telegram notifier Adds two severity-axis security observation types (security_alert, security_note) to the code mode and a fire-and-forget Telegram notifier that posts when a saved observation matches configured type or concept triggers. Default trigger fires on security_alert only; notifier is disabled until BOT_TOKEN and CHAT_ID are set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(telegram): honor CLAUDE_MEM_TELEGRAM_ENABLED master toggle Adds an explicit on/off flag (default 'true') so users can disable the notifier without clearing credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(stop-hook): make summarize handler fire-and-forget Stop hook previously blocked the Claude Code session for up to 110 seconds while polling the worker for summary completion. The handler now returns as soon as the enqueue POST is acked. - summarize.ts: drop the 500ms polling loop and /api/sessions/complete call; tighten SUMMARIZE_TIMEOUT_MS from 300s to 5s since the worker acks the enqueue synchronously. - SessionCompletionHandler: extract idempotent finalizeSession() for DB mark + orphaned-pending-queue drain + broadcast. completeByDbId now delegates so the /api/sessions/complete HTTP route is backward compatible. - SessionRoutes: wire finalizeSession into the SDK-agent generator's finally block, gated on lastSummaryStored + empty pending queue so only Stop events produce finalize (not every idle tick). - WorkerService: own the single SessionCompletionHandler instance and inject it into SessionRoutes to avoid duplicate construction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pr2084): address reviewer findings CodeRabbit: - SessionStore.getSessionById now returns status; without it, the finalizeSession idempotency guard always evaluated false and re-fired drain/broadcast on every call. - worker-service.ts: three call sites that remove the in-memory session after finalizeSession now do so only on success. On failure the session is left in place so the 60s orphan reaper can retry; removing it would orphan an 'active' DB row indefinitely under the fire-and- forget Stop hook. - runFallbackForTerminatedSession no longer emits a second session_completed event; finalizeSession already broadcasts one. The explicit broadcast now runs only on the finalize-failure fallback. Greptile: - TelegramNotifier reads via loadFromFile(USER_SETTINGS_PATH) so values in ~/.claude-mem/settings.json actually take effect; SettingsDefaultsManager.get() alone skipped the file and silently ignored user-configured credentials. - Emoji is derived from obs.type (security_alert → 🚨, security_note → 🔐, fallback 🔔) instead of hardcoded 🚨 for every observation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(hooks): worker-port mismatch on Windows and settings.json overrides (#2086) Hooks computed the health-check port as \$((37700 + id -u % 100)), ignoring ~/.claude-mem/settings.json. Two failure modes resulted: 1. Users upgrading from pre-per-uid builds kept CLAUDE_MEM_WORKER_PORT set to '37777' in settings.json. The worker bound 37777 (settings wins), but hooks queried 37701 (uid 501 on macOS), so every SessionStart/UserPromptSubmit health check failed. 2. Windows Git Bash/PowerShell returns a real Windows UID for 'id -u' (e.g. 209), producing port 37709 while the Node worker fell back to 37777 (process.getuid?.() ?? 77). Every prompt hit the 60s hook timeout. hooks.json now resolves the port in this order, matching how the worker itself resolves it: 1. sed CLAUDE_MEM_WORKER_PORT from ~/.claude-mem/settings.json 2. If absent, and uname is MINGW/CYGWIN/MSYS → 37777 3. Otherwise 37700 + (id -u \|\| 77) % 100 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pr2084): sync DatabaseManager.getSessionById return type CodeRabbit round 2: the DatabaseManager.getSessionById return type was missing platform_source, custom_title, and status fields that SessionStore.getSessionById actually returns. Structural typing hid the mismatch at compile time, but it prevents callers going through DatabaseManager from seeing the status field that the idempotency guard in SessionCompletionHandler relies on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pr2084): hooks honor env vars and host; looser port regex (#2086 followup) CodeRabbit round 3: match the worker's env > file > defaults precedence and resolve host the same way as port. - Env: CLAUDE_MEM_WORKER_PORT and CLAUDE_MEM_WORKER_HOST win first. - File: sed now accepts both quoted ('"37777"') and unquoted (37777) JSON values for the port; a separate sed reads CLAUDE_MEM_WORKER_HOST. - Defaults: port per-uid formula (Windows: 37777), host 127.0.0.1. - Health-check URL uses the resolved $HOST instead of hardcoded localhost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 16:08:28 -07:00
Alex Newman	99060bac1a	fix: detect PID reuse in worker start-guard (container restarts) (#2082 ) * fix: detect PID reuse in worker start-guard to survive container restarts The 'Worker already running' guard checked PID liveness with kill(0), which false-positives when a persistent PID file outlives the PID namespace (docker stop / docker start, pm2 graceful reloads). The new worker comes up with the same low PID (e.g. 11) as the old one, kill(0) says 'alive', and the worker refuses to start against its own prior incarnation. Capture a process-start token alongside the PID and verify identity, not just liveness: - Linux: /proc/<pid>/stat field 22 (starttime, jiffies since boot) - macOS/POSIX: `ps -p <pid> -o lstart=` - Windows: unchanged (returns null, falls back to liveness) PID files written by older versions are token-less, so verifyPidFileOwnership falls back to the current liveness-only behavior for backwards compatibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: apply review feedback to PID identity helpers - Collapse ProcessManager re-export down to a single import/export statement. - Make verifyPidFileOwnership a type predicate (info is PidInfo) so callers don't need non-null assertions on the narrowed value. - Drop the `!` assertions at the worker-service GUARD 1 call site now that the predicate narrows. - Tighten the captureProcessStartToken platform doc comment to enumerate process.platform values explicitly. No behavior change — esbuild output is byte-identical (type-only edits). Addresses items 1-3 of the claude-review comment on PR #2082. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: pin LC_ALL=C for `ps lstart=` in captureProcessStartToken Without a locale pin, `ps -o lstart=` emits month/weekday names in the system locale. A bind-mounted PID file written under one locale and read under another would hash to different tokens and the live worker would incorrectly appear stale — reintroducing the very bug this helper exists to prevent. Flagged by Greptile on PR #2082. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: address second-round review on PID identity helpers - verifyPidFileOwnership: log a DEBUG diagnostic when the PID is alive but the start-token mismatches. Without it, callers can't distinguish the "process dead" path from the "PID reused" path in production logs — the exact case this helper exists to catch. - writePidFile: drop the redundant `?? undefined` coercion. `null` and `undefined` are both falsy for the subsequent ternary, so the coercion was purely cosmetic noise that suggested an important distinction. - Add a unit test for the win32 fallback path in captureProcessStartToken (mocks process.platform) — previously uncovered in CI. Addresses items 1, 2, and 5 of the second claude-review on PR #2082. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 19:49:03 -07:00
Alex Newman	8d166b47c1	Revert "revert: roll back v12.3.3 (Issue Blowout 2026)" This reverts commit `bfc7de377a`.	2026-04-20 12:18:55 -07:00
Alex Newman	bfc7de377a	revert: roll back v12.3.3 (Issue Blowout 2026) SessionStart context injection regressed in v12.3.3 — no memory context is being delivered to new sessions. Rolling back to the v12.3.2 tree state while the regression is investigated. Reverts #2080. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 11:59:15 -07:00
Alex Newman	ba1ef6c42c	fix: Issue Blowout 2026 — 25 bugs across worker, hooks, security, and search (#2080 ) * fix: resolve search, database, and docker bugs (#1913, #1916, #1956, #1957, #2048) - Fix concept/concepts param mismatch in SearchManager.normalizeParams (#1916) - Add FTS5 keyword fallback when ChromaDB is unavailable (#1913, #2048) - Add periodic WAL checkpoint and journal_size_limit to prevent unbounded WAL growth (#1956) - Add periodic clearFailed() to purge stale pending_messages (#1957) - Fix nounset-safe TTY_ARGS expansion in docker/claude-mem/run.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: prevent silent data loss on non-XML responses, add queue info to /health (#1867, #1874) - ResponseProcessor: mark messages as failed (with retry) instead of confirming when the LLM returns non-XML garbage (auth errors, rate limits) (#1874) - Health endpoint: include activeSessions count for queue liveness monitoring (#1867) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: cache isFts5Available() at construction time Addresses Greptile review: avoid DDL probe (CREATE + DROP) on every text query. Result is now cached in _fts5Available at construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve worker stability bugs — pool deadlock, MCP loopback, restart guard (#1868, #1876, #2053) - Replace flat consecutiveRestarts counter with time-windowed RestartGuard: only counts restarts within 60s window (cap=10), decays after 5min of success. Prevents stranding pending messages on long-running sessions. (#2053) - Add idle session eviction to pool slot allocation: when all slots are full, evict the idlest session (no pending work, oldest activity) to free a slot for new requests, preventing 60s timeout deadlock. (#1868) - Fix MCP loopback self-check: use process.execPath instead of bare 'node' which fails on non-interactive PATH. Fix crash misclassification by removing false "Generator exited unexpectedly" error log on normal completion. (#1876) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve hooks reliability bugs — summarize exit code, session-init health wait (#1896, #1901, #1903, #1907) - Wrap summarize hook's workerHttpRequest in try/catch to prevent exit code 2 (blocking error) on network failures or malformed responses. Session exit no longer blocks on worker errors. (#1901) - Add health-check wait loop to UserPromptSubmit session-init command in hooks.json. On Linux/WSL where hook ordering fires UserPromptSubmit before SessionStart, session-init now waits up to 10s for worker health before proceeding. Also wrap session-init HTTP call in try/catch. (#1907) - Close #1896 as already-fixed: mtime comparison at file-context.ts:255-267 bypasses truncation when file is newer than latest observation. - Close #1903 as no-repro: hooks.json correctly declares all hook events. Issue was Claude Code 12.0.1/macOS platform event-dispatch bug. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: security hardening — bearer auth, path validation, rate limits, per-user port (#1932, #1933, #1934, #1935, #1936) - Add bearer token auth to all API endpoints: auto-generated 32-byte token stored at ~/.claude-mem/worker-auth-token (mode 0600). All hook, MCP, viewer, and OpenCode requests include Authorization header. Health/readiness endpoints exempt for polling. (#1932, #1933) - Add path traversal protection: watch.context.path validated against project root and ~/.claude-mem/ before write. Rejects ../../../etc style attacks. (#1934) - Reduce JSON body limit from 50MB to 5MB. Add in-memory rate limiter (300 req/min/IP) to prevent abuse. (#1935) - Derive default worker port from UID (37700 + uid%100) to prevent cross-user data leakage on multi-user macOS. Windows falls back to 37777. Shell hooks use same formula via id -u. (#1936) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve search project filtering and import Chroma sync (#1911, #1912, #1914, #1918) - Fix per-type search endpoints to pass project filter to Chroma queries and SQLite hydration. searchObservations/Sessions/UserPrompts now use $or clause matching project + merged_into_project. (#1912) - Fix timeline/search methods to pass project to Chroma anchor queries. Prevents cross-project result leakage when project param omitted. (#1911) - Sync imported observations to ChromaDB after FTS rebuild. Import endpoint now calls chromaSync.syncObservation() for each imported row, making them visible to MCP search(). (#1914) - Fix session-init cwd fallback to match context.ts (process.cwd()). Prevents project key mismatch that caused "no previous sessions" on fresh sessions. (#1918) - Fix sync-marketplace restart to include auth token and per-user port. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve all CodeRabbit and Greptile review comments on PR #2080 - Fix run.sh comment mismatch (no-op flag vs empty array) - Gate session-init on health check success (prevent running when worker unreachable) - Fix date_desc ordering ignored in FTS session search - Age-scope failed message purge (1h retention) instead of clearing all - Anchor RestartGuard decay to real successes (null init, not Date.now()) - Add recordSuccess() calls in ResponseProcessor and completion path - Prevent caller headers from overriding bearer auth token - Add lazy cleanup for rate limiter map to prevent unbounded growth - Bound post-import Chroma sync with concurrency limit of 8 - Add doc_type:'observation' filter to Chroma queries feeding observation hydration - Add FTS fallback to all specialized search handlers (observations, sessions, prompts, timeline) - Add response.ok check and error handling in viewer saveSettings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve CodeRabbit round-2 review comments - Use failure timestamp (COALESCE) instead of created_at_epoch for stale purge - Downgrade _fts5Available flag when FTS table creation fails - Escape FTS5 MATCH input by quoting user queries as literal phrases - Escape LIKE metacharacters (%, _, \) in prompt text search - Add response.ok check in initial settings load (matches save flow) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve CodeRabbit round-3 review comments - Include failed_at_epoch in COALESCE for age-scoped purge - Re-throw FTS5 errors so callers can distinguish failure from no-results - Wrap all FTS fallback calls in SearchManager with try/catch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-20 11:42:09 -07:00
Alex Newman	be99a5d690	fix: resolve search, database, and docker bugs (#2079 ) * fix: resolve search, database, and docker bugs (#1913, #1916, #1956, #1957, #2048) - Fix concept/concepts param mismatch in SearchManager.normalizeParams (#1916) - Add FTS5 keyword fallback when ChromaDB is unavailable (#1913, #2048) - Add periodic WAL checkpoint and journal_size_limit to prevent unbounded WAL growth (#1956) - Add periodic clearFailed() to purge stale pending_messages (#1957) - Fix nounset-safe TTY_ARGS expansion in docker/claude-mem/run.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: prevent silent data loss on non-XML responses, add queue info to /health (#1867, #1874) - ResponseProcessor: mark messages as failed (with retry) instead of confirming when the LLM returns non-XML garbage (auth errors, rate limits) (#1874) - Health endpoint: include activeSessions count for queue liveness monitoring (#1867) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: cache isFts5Available() at construction time Addresses Greptile review: avoid DDL probe (CREATE + DROP) on every text query. Result is now cached in _fts5Available at construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-19 22:19:18 -07:00
Alex Newman	a0dd516cd5	fix: resolve all 301 error handling anti-patterns across codebase Systematic cleanup of every error handling anti-pattern detected by the automated scanner. 289 issues fixed via code changes, 12 approved with specific technical justifications. Changes across 90 files: - GENERIC_CATCH (141): Added instanceof Error type discrimination - LARGE_TRY_BLOCK (82): Extracted helper methods to narrow try scope to ≤10 lines - NO_LOGGING_IN_CATCH (65): Added logger/console calls for error visibility - CATCH_AND_CONTINUE_CRITICAL_PATH (10): Added throw/return or approved overrides - ERROR_STRING_MATCHING (2): Approved with rationale (no typed error classes) - ERROR_MESSAGE_GUESSING (1): Replaced chained .includes() with documented pattern array - PROMISE_CATCH_NO_LOGGING (1): Added logging to .catch() handler Also fixes a detector bug where nested try/catch inside a catch block corrupted brace-depth tracking, causing false positives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-19 19:57:00 -07:00
Alex Newman	7a66cb310f	fix(worktree): address PR review — schema guard, startup adoption, query parity Addresses six CodeRabbit/Greptile findings on PR #2052: - Schema guard in adoptMergedWorktrees probes for merged_into_project columns before preparing statements; returns early when absent so first boot after upgrade (pre-migration) doesn't silently fail. - Startup adoption now iterates distinct cwds from pending_messages and dedupes via resolveMainRepoPath — the worker daemon runs with cwd=plugin scripts dir, so process.cwd() fallback was a no-op. - ObservationCompiler single-project queries (queryObservations / querySummaries) OR merged_into_project into WHERE so injected context surfaces adopted worktree rows, matching the Multi variants. - SessionStore constructor now calls ensureMergedIntoProjectColumns so bundled artifacts (context-generator.cjs) that embed SessionStore get the merged_into_project column on DBs that only went through the bundled migration chain. - OBSERVER_SESSIONS_PROJECT constant is now derived from basename(OBSERVER_SESSIONS_DIR) and used across PaginationHelper, SessionStore, and timeline queries instead of hardcoded strings. - Corrected misleading Chroma retry docstring in WorktreeAdoption to match actual behavior (no auto-retry once SQL commits).	2026-04-16 21:31:30 -07:00
Alex Newman	f6fda8fff4	fix(worktree): address CodeRabbit PR review feedback - Document --branch override in npx-cli help text - Guard ContextBuilder against empty projects[] override; fall back to cwd-derived primary - Ensure merged_into_project indexes are created even if ALTER ran in a prior partial migration - Reject adopt --branch/--cwd flags with missing or flag-like values - Use defined --color-border-primary token for merged badge border Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 20:03:27 -07:00
Alex Newman	d24f3a7019	fix(worktree): address PR review — test assertion, dry-run sentinel, git timeouts - Update allProjects test expectation to match [parent, composite] (matches JSDoc + callers in ContextBuilder/context handlers). - Replace string-matched __DRY_RUN_ROLLBACK__ sentinel with dedicated DryRunRollback class to avoid swallowing unrelated errors. - Add 5000ms timeout to spawnSync git calls in WorktreeAdoption and ProcessManager so worker startup can't hang on a stuck git process. - Drop unreachable break after process.exit(0) in adopt case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 19:50:01 -07:00
Alex Newman	5664fabce4	feat(cli): npx claude-mem adopt [--dry-run] [--branch X] Adds a manual escape hatch for the worktree adoption engine. Covers squash-merges where git branch --merged HEAD returns nothing, and lets users re-run adoption on demand. Wired through worker-service.cjs (same pattern as generate/clean) so the command runs under Bun with bun:sqlite, keeping npx-cli/ pure Node. --cwd flag passes the user's working directory through the spawn so the engine resolves the correct parent repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 19:28:17 -07:00
Alex Newman	0b90495391	feat(worktree): auto-adopt merged worktrees on worker startup Invokes adoptMergedWorktrees() right after runOneTimeCwdRemap() and before dbManager.initialize(), wrapped in try/catch so adoption failures never block startup. Idempotent, so running every startup is cheap — the SQL UPDATE only touches rows where merged_into_project IS NULL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 19:24:43 -07:00
Alex Newman	193e7e0719	feat(worktree): auto-apply cwd-based project remap on worker startup Ports scripts/cwd-remap.ts into ProcessManager.runOneTimeCwdRemap() and invokes it in initializeBackground() alongside the existing chroma migration. Uses pending_messages.cwd as the source of truth to rewrite pre-worktree bare project names into the parent/worktree composite format so search and context are consistent. - Backs up the DB to .bak-cwd-remap-<ts> before any writes. - Idempotent: marker file .cwd-remap-applied-v1 short-circuits reruns. - No-ops on fresh installs (no DB, or no pending_messages table). - On failure, logs and skips the marker so the next restart retries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 17:51:33 -07:00
Alex Newman	216d17879d	Merge pull request #1680 from ousamabenyounes/fix/issue-1447 fix: suppress false ERROR when duplicate daemon loses port bind race (#1447)	2026-04-14 18:41:25 -07:00
Ousama Ben Younes	08cf2ba3bd	fix: suppress false ERROR when duplicate daemon loses port bind race (#1447 ) When the MCP server and SessionStart hook both spawn a worker daemon concurrently, one loses the bind race (EADDRINUSE / Bun's port-in-use error). The loser now checks if the winner is healthy; if so, it logs INFO and exits cleanly instead of logging a misleading ERROR on every first session start. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-10 10:01:08 +00:00
Alex Newman	c648d5d8d2	feat: Knowledge Agents — queryable corpora from claude-mem (#1653 ) * feat: add knowledge agent types, store, builder, and renderer Phase 1 of Knowledge Agents feature. Introduces corpus compilation pipeline that filters observations from the database into portable corpus files stored at ~/.claude-mem/corpora/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add corpus CRUD HTTP endpoints and wire into worker service Phase 2 of Knowledge Agents. Adds CorpusRoutes with 5 endpoints (build, list, get, delete, rebuild) and registers them during worker background initialization alongside SearchRoutes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add KnowledgeAgent with V1 SDK prime/query/reprime Phase 3 of Knowledge Agents. Uses Agent SDK V1 query() with resume and disallowedTools for Q&A-only knowledge sessions. Auto-reprimes on session expiry. Adds prime, query, and reprime HTTP endpoints to CorpusRoutes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add MCP tools and skill for knowledge agents Phase 4 of Knowledge Agents. Adds build_corpus, list_corpora, prime_corpus, and query_corpus MCP tools delegating to worker HTTP endpoints. Includes /knowledge-agent skill with workflow docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle SDK process exit in KnowledgeAgent, add e2e test The Agent SDK may throw after yielding all messages when the Claude process exits with a non-zero code. Now tolerates this if session_id/answer were already captured. Adds comprehensive e2e test script (31 assertions) orchestrated via tmux-cli. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use settings model ID instead of hardcoded model in KnowledgeAgent Reads CLAUDE_MEM_MODEL from user settings via getModelId(), matching the existing SDKAgent pattern. No more hardcoded model assumptions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: improve knowledge agents developer experience Add public documentation page, rebuild/reprime MCP tools, and actionable error messages. DX review scored knowledge agents 4/10 — core engineering works (31/31 e2e) but the feature was invisible. This addresses discoverability (docs, cross-links), API completeness (missing MCP tools), and error quality (fix/example fields in error responses). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add quick start guide to knowledge agents page Covers the three main use cases upfront: creating an agent, asking a single question, and starting a fresh conversation with reprime. Includes keeping-it-current section for rebuild + reprime workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address code review issues — path traversal, session safety, prompt injection - Block path traversal in CorpusStore with alphanumeric name validation and resolved path check - Harden system prompt against instruction injection from untrusted corpus content - Validate question field as non-empty string in query endpoint - Only persist session_id after successful prime (not null on failure) - Persist refreshed session_id after query execution - Only auto-reprime on session resume errors, not all query failures - Add fenced code block language tags to SKILL.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address remaining code review issues — e2e robustness, MCP validation, docs - Harden e2e curl wrappers with connect-timeout, fallback to HTTP 000 on transport failure - Use curl_post wrapper consistently for all long-running POST calls - Add runtime name validation to all corpus MCP tool handlers - Fix docs: soften hallucination guarantee to probabilistic claim - Fix architecture diagram: add missing rebuild_corpus and reprime_corpus tools Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: enforce string[] type in safeParseJsonArray for corpus data integrity Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add blank line before fenced code blocks in SKILL.md maintenance section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 17:30:20 -07:00
Alex Newman	abd55977ca	fix(mcp): MCP server crashes with Cannot find module 'bun:sqlite' under Node (#1645 ) * fix(mcp): MCP server crashes with Cannot find module 'bun:sqlite' under Node The MCP server bundle (mcp-server.cjs) ships with `#!/usr/bin/env node` so it must run under Node, but commit `2b60dd29` added an import of `ensureWorkerStarted` from worker-service.ts. That import transitively pulls in DatabaseManager → bun:sqlite, blowing up at top-level require under Node. The bundle ballooned from ~358KB (v11.0.1) to ~1.96MB (v12.0.0) and crashed on every spawn, breaking the MCP server entirely for Codex/MCP-only clients and any flow that boots the MCP tool surface. Fix: 1. Extract `ensureWorkerStarted` and the Windows spawn-cooldown helpers into a new lightweight module `src/services/worker-spawner.ts` that only imports from infrastructure/ProcessManager, infrastructure/HealthMonitor, shared/, and utils/logger — no SQLite, no ChromaSync, no DatabaseManager. 2. The new helper takes the worker script path explicitly so callers running under Node (mcp-server) can pass `worker-service.cjs` while callers already inside the worker (worker-service self-spawn) pass `__filename`. worker-service.ts keeps a thin wrapper for back-compat. 3. mcp-server.ts now imports from worker-spawner.js and resolves WORKER_SCRIPT_PATH via __dirname so the daemon can be auto-started for MCP-only clients without dragging in the entire worker bundle. 4. resolveWorkerRuntimePath() now searches for Bun on every platform (not just Windows). worker-service.cjs requires Bun at runtime, so when the spawner is invoked from a Node process the Unix branch can no longer fall through to process.execPath (= node). 5. spawnDaemon's Unix branch now calls resolveWorkerRuntimePath() instead of hardcoding process.execPath, fixing the same Node-spawning-Node bug for the actual subprocess launch on Linux/macOS. After: - mcp-server.cjs is 384KB again with zero `bun:sqlite` references - node mcp-server.cjs initializes and serves tools/list + tools/call (verified via JSON-RPC against the running worker) - ProcessManager test suite updated for the new cross-platform Bun resolution behavior; full suite has the same pre-existing failures as main, no regressions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(mcp): address PR #1645 review feedback (round 1) Per Claude Code Review on PR #1645: 1. mcp-server.ts: log a warning when both __dirname and import.meta.url resolution fail. The cwd() fallback is essentially dead code for the CJS bundle but if it ever fires it gives the user a breadcrumb instead of a silently-wrong WORKER_SCRIPT_PATH. 2. mcp-server.ts: existsSync check on WORKER_SCRIPT_PATH at module load. Surfaces a clear "worker-service.cjs not found at expected path" log line for partial installs / dev environments instead of letting the failure surface as a generic spawnDaemon error later. 3. ProcessManager.ts: explanatory comment on the Windows `return 0` sentinel in spawnDaemon. Documents that PowerShell Start-Process doesn't return a PID and that callers MUST use `pid === undefined` for failure detection — never falsy checks like `if (!pid)`. Items 4 (no direct unit tests for the worker-spawner Windows cooldown helpers) and 5 (process-manager.test.ts uses real ~/.claude-mem path) are deferred — the reviewer flagged the latter as out of scope, and the former needs an injectable-I/O refactor that isn't appropriate for a hotfix bugfix PR. Verified: build clean, mcp-server.cjs still 384KB / zero bun:sqlite, JSON-RPC tools/list still returns the 7-tool surface, ProcessManager test suite still 43/43. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(spawner): mkdir CLAUDE_MEM_DATA_DIR before writing Windows cooldown marker Per CodeRabbit on PR #1645: on a fresh user profile, the data dir may not exist yet when markWorkerSpawnAttempted() runs. writeFileSync would throw ENOENT, the catch would swallow it, and the marker would never be created — defeating the popup-loop protection this helper exists to provide. mkdirSync(dir, { recursive: true }) is a no-op when the directory already exists, so it's safe to call on every spawn attempt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(spawner): add APPROVED OVERRIDE annotations for cooldown marker catches Per CodeRabbit on PR #1645: silent catch blocks at spawn-cooldown sites should carry the APPROVED OVERRIDE annotation that the rest of the codebase uses (see ProcessManager.ts:689, BaseRouteHandler.ts:82, ChromaSync.ts:288). Both catches are intentional best-effort: - markWorkerSpawnAttempted: if mkdir/writeFileSync fails, the worker spawn itself will almost certainly fail too. Surfacing that downstream is far more useful than a noisy log line about a lock file. - clearWorkerSpawnAttempted: a stale marker is harmless. Worst case is one suppressed retry within the cooldown window, then self-heals. No behaviour change. Resolves the second half of CodeRabbit's lines 38-65 comment on worker-spawner.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(mcp): address PR #1645 review feedback (round 2) Round 2 of Claude Code Review feedback on PR #1645: Build guardrail (most important — protects the regression this PR fixes): - scripts/build-hooks.js: post-build check that fails the build if mcp-server.cjs ever contains a `bun:sqlite` reference. This is the exact regression PR #1645 fixed; future contributors will get an immediate, actionable error if a transitive import re-introduces it. Verified the check trips when violated. Code clarity: - src/servers/mcp-server.ts: drop dead `_originalLog` capture — it was never restored. Less code is fewer bugs. - src/servers/mcp-server.ts: elevate `cwd()` fallback log from WARN to ERROR. Per reviewer: a wrong WORKER_SCRIPT_PATH means worker auto-start silently fails, so the breadcrumb should be loud and searchable. - src/services/worker-service.ts: extended doc comment on the `ensureWorkerStartedShared(port, __filename)` wrapper explaining why `__filename` is the correct script path here (CJS bundle = compiled worker-service.cjs) and why mcp-server.ts can't use the same trick. - src/services/infrastructure/ProcessManager.ts: inline comment on the `env.BUN === 'bun'` bare-command guard explaining why it's reachable even though `isBunExecutablePath('bun')` is true (pathExists returns false for relative names, so the second branch is what fires). Coverage: - src/services/infrastructure/ProcessManager.ts: add `/usr/bin/bun` to the Linux candidate paths so apt-installed Bun on Debian/Ubuntu is found without falling through to the PATH lookup. Out-of-scope items (deferred with rationale in PR replies): - Unit tests for ensureWorkerStarted / Windows cooldown helpers — needs injectable-I/O refactor unsuitable for a hotfix. - Sentinel object for Windows spawnDaemon `0` — broader API change. - Windows Scoop install path — follow-up for a future PR. - runOneTimeChromaMigration placement, aggressiveStartupCleanup, console.log redirect timing, platform timeout multiplier — all pre-existing and unrelated to this regression. Verified: build clean, guardrail trips on simulated violation, mcp-server.cjs still 0 bun:sqlite refs, ProcessManager tests 43/43. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(mcp): address PR #1645 review feedback (round 3) Round 3 of Claude Code Review feedback on PR #1645: ProcessManager.ts: improve actionability of "Bun not found" errors Both Windows and Unix branches of spawnDaemon previously logged a vague "Failed to locate Bun runtime" message when resolveWorkerRuntimePath() returned null. Replaced with an actionable message that names the install URL and explains why Bun is required (worker uses bun:sqlite). The existing null-guard at the call sites already prevents passing null to child_process.spawn — only the error text changed. scripts/build-hooks.js: refine bun:sqlite guardrail to match actual require() calls only The previous coarse `includes('bun:sqlite')` check tripped on its own improved error message, which legitimately mentions "bun:sqlite" by name. Switched to a regex that matches `require("bun:sqlite")` / `require('bun:sqlite')` (with optional whitespace, handles both quote styles, handles minified output) so error messages and inline comments can reference the module name without false positives. Verified the regex still trips on real violations (both spaced and minified forms) and correctly ignores string-literal mentions. Other round-3 items (verified, not changed): - TOOL_ENDPOINT_MAP: reviewer flagged as dead code, but it IS used at lines 250 and 263 by the search and timeline tool handlers. False positive — kept as-is. - if (!pid) callsites: grepped src/, zero offenders. The Windows `0` PID sentinel contract is safe; only the in-line documentation comment in ProcessManager.ts mentions the anti-pattern. - callWorkerAPIPost double-wrapping: pre-existing intentional behavior (only used by /api/observations/batch which returns raw data, not the MCP {content:[...]} shape). Unrelated to this regression. - Snap path / startParentHeartbeat / main().catch / test for non- existent workerScriptPath / etc — pre-existing or out of scope for this hotfix, deferred per established disposition. Verified: build clean, guardrail still trips on real violations, mcp-server.cjs has 0 require("bun:sqlite") calls, JSON-RPC tools/list returns the 7-tool surface, ProcessManager tests 43/43. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(spawnDaemon): contract test for Windows 0 PID success sentinel Per CodeRabbit nitpick on PR #1645 commit 7a96b3b9: add a focused test that documents the spawnDaemon return contract so any future contributor who introduces `if (!pid)` against a spawnDaemon return value (or its wrapper) sees a failing assertion explaining why the falsy check is incorrect. The test deliberately exercises the JS-level semantics rather than mocking PowerShell — a true mocked Windows test would require refactoring spawnDaemon to take an injectable execSync, which is a larger change than this hotfix should carry. The contract assertions here catch the same regression class (treating Windows success as failure) without that refactor. Verified: bun test tests/infrastructure/process-manager.test.ts now passes 44/44 (was 43/43). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(mcp): address PR #1645 review feedback (round 4) Round 4 of Claude Code Review feedback on PR #1645 (review of round-3 commit 193286f9): tests/infrastructure/process-manager.test.ts: replace require('fs') with the already-imported statSync. Reviewer correctly flagged that the file uses ESM-style named imports everywhere else and the inline require() calls would break under strict ESM. Two callsites updated in the touchPidFile test. src/services/infrastructure/ProcessManager.ts: hoist resolveWorkerRuntimePath() and the `Bun runtime not found` error handling out of both branches in spawnDaemon. Both Windows and Unix branches need the same Bun lookup, and resolving once before the OS branch split avoids a duplicate execSync('which bun')/where bun in the no-well-known-path fallback. The error message is also DRY now — single source of truth instead of two near-identical strings. CodeRabbit confirmed in its previous reply that "All actionable items across all four review rounds are fully resolved" — these two minor items from claude-review of round 3 are the only remaining cleanup. Verified: build clean, ProcessManager tests still 44/44. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(mcp): address PR #1645 review feedback (round 5) Round 5 of Claude Code Review feedback on PR #1645: src/services/worker-spawner.ts: drop `export` from internal helpers `shouldSkipSpawnOnWindows`, `markWorkerSpawnAttempted`, and `clearWorkerSpawnAttempted` were exported even though they were private in worker-service.ts and nothing outside this module needs them. Removing the `export` keyword keeps the public surface to just `ensureWorkerStarted` and prevents future callers from bypassing the spawn lifecycle. scripts/build-hooks.js: broaden guardrail to all bun:* modules Previously the regex only caught `require("bun:sqlite")`, but every module in the `bun:` namespace (bun:ffi, bun:test, etc.) is Bun-only and would crash mcp-server.cjs the same way under Node. Generalized the regex to `require("bun:[a-z][a-z0-9_-]")` so a transitive import of any Bun-only module fails the build instead of shipping a broken bundle. Verified the new regex still trips on bun:sqlite, bun:ffi, bun:test, and correctly ignores string-literal mentions in error messages. src/servers/mcp-server.ts: attribute root cause when dirname resolution fails Previously, if `__dirname`/`import.meta.url` resolution failed and we fell back to `process.cwd()`, the user would see two warnings: an error about the dirname fallback AND a separate warning about the missing worker bundle. The second warning hides the root cause — someone debugging would assume the install is broken when really it's a dirname-resolution failure. Track the failure with a flag and emit a single root-cause-attributing log line in the existence-check branch instead. The dirname fallback paths are still functionally unreachable in CJS deployment; this just makes the failure mode unmistakable if it ever does fire. Out of scope (consistent with prior rounds): - darwin/linux split for non-Windows candidate paths (benign today) - Integration test for non-existent workerScriptPath (test coverage gap deferred since rounds 1-2) - Defer existsSync check to first ensureWorkerStarted call (current module-init check is the loud signal we want) Already addressed in earlier rounds: - resolveWorkerRuntimePath() called twice in spawnDaemon → hoisted in round 4 (b2c114b4) - _originalLog dead code → removed in round 2 (7a96b3b9) Verified: build clean, broadened guardrail trips on bun:sqlite, bun:ffi, and bun:test (and ignores string literals), MCP server serves the 7-tool surface, ProcessManager tests still 44/44. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(mcp): address PR #1645 review feedback (round 6) Round 6 of Claude Code Review feedback on PR #1645: src/services/worker-spawner.ts: validate workerScriptPath at entry Add an empty-string + existsSync guard at the top of ensureWorkerStarted. Without this, a partial install or upstream path-resolution regression just surfaces as a low-signal child_process error from spawnDaemon. The explicit log line at the entry point makes that class of bug much easier to diagnose. The mcp-server.ts module-init existsSync check already covers this for the MCP-server caller, but defending at the spawner level reinforces the contract for any future caller. src/services/worker-spawner.ts: document SettingsDefaultsManager dependency boundary in the module header The spawner imports from SettingsDefaultsManager, ProcessManager, and HealthMonitor. None of those currently touch bun:sqlite, but if any of them ever does, the spawner's SQLite-free contract silently breaks. The build guardrail in build-hooks.js is the only thing that catches it. Header comment now flags this so future contributors audit transitive imports when adding helpers from the shared/infrastructure layers. src/services/infrastructure/ProcessManager.ts: add /snap/bin/bun Ubuntu Snap install path. Now alongside the existing apt path (/usr/bin/bun) and Homebrew/Linuxbrew paths. The PATH lookup catches it as fallback, but listing it explicitly avoids paying for an execSync('which bun') in the common case. src/servers/mcp-server.ts: elevate missing-bundle log warn → error A missing worker-service.cjs means EVERY MCP tool call that needs the worker silently fails. That's a broken-install state, not a transient condition — match the severity of the dirname-fallback branch above (which is already ERROR). Out of scope (consistent with prior rounds, reviewer agrees these are appropriately deferred): - Streaming bundle read in build-hooks.js (nit at current 384KB size) - Unit tests for ensureWorkerStarted / cooldown helpers - Integration test for non-existent workerScriptPath Verified: build clean, broadened guardrail still trips on bun:* imports and ignores string literals, MCP server serves the 7-tool surface, ProcessManager tests still 44/44. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(mcp): defer WORKER_SCRIPT_PATH check to first call (round 7) Round 7 of Claude Code Review feedback on PR #1645: src/servers/mcp-server.ts: extract module-level existsSync check into checkWorkerScriptPath() and call it lazily from ensureWorkerConnection() instead of at module load. The early-warning intent is preserved (the check still fires before any actual spawn attempt), but tests/tools that import this module without booting the MCP server no longer see noisy ERROR-level log lines for a worker bundle they never intended to start. The check is cheap and idempotent, so calling it on every auto-start attempt is fine. The two failure-mode branches (dirname-resolution failure vs simple missing-bundle) remain unchanged — the function body is identical to the previous module-level if-block, just hoisted into a function and called from ensureWorkerConnection(). False positive (no change needed): - Reviewer flagged `mkdirSync` as a dead import in worker-spawner.ts, but it IS used at line 71 in markWorkerSpawnAttempted (the round-1 ENOENT fix CodeRabbit explicitly asked for). Out of scope: - Volta path (~/.volta/bin/bun) — PATH fallback handles it; nit per reviewer - worker-spawner.ts unit tests — needs injectable I/O, deferred consistently since round 1 Verified: build clean, tests 44/44, smoke test 7-tool surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(mcp): address PR #1645 review feedback (round 8) Round 8 of Claude Code Review feedback on PR #1645: tests/services/worker-spawner.test.ts: NEW FILE — unit tests for the ensureWorkerStarted entry-point validation guards added in round 6. Covers the empty-string and non-existent-path cases without requiring the broader injectable-I/O refactor that the deeper spawn lifecycle tests would need. 2 new passing tests. src/services/infrastructure/ProcessManager.ts: memoize resolveWorkerRuntimePath() for the no-options call site (which is what spawnDaemon uses). Caches both successful resolutions and the not-found result so repeated spawn attempts (crash loops, health thrashing) don't repeatedly hit statSync on candidate paths. Tests that pass options bypass the cache entirely so existing test cases remain deterministic. Added resetWorkerRuntimePathCache() exported for test isolation only. src/servers/mcp-server.ts: rename checkWorkerScriptPath() → warnIfWorkerScriptMissing(). Per reviewer: the old name implied a boolean check but the function returns void and has side effects. New name is more accurate. DEFENDED (no change made): - Reviewer asked to elevate process.cwd() fallback to a synchronous throw at module load. This conflicts with round 7 feedback which asked to defer the existsSync check to first call to avoid noisy test logs. The current lazy approach is the right compromise: it fires before any actual spawn attempt, attributes the root cause, and doesn't pollute test imports. Throwing at module load would crash before stdio is wired up, which is much harder to debug than the lazy log line. - Reviewer asked to grep for `if (!pid)` callsites — already verified in round 3, zero offenders in src/. Out of scope: - Volta path (~/.volta/bin/bun) — PATH fallback handles it; reviewer marked as nit - Deeper unit tests for ensureWorkerStarted spawn lifecycle (PID file cleanup, health checks, etc.) — needs injectable I/O, deferred consistently since round 1 Verified: build clean, ProcessManager tests still 44/44, new worker-spawner tests 2/2, smoke test serves 7 tools. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(spawner): clear Windows cooldown marker on all healthy paths (round 9) Round 9 of PR #1645 review feedback. src/services/worker-spawner.ts: clear stale Windows cooldown marker on every healthy-return path Per CodeRabbit (genuine bug): The .worker-start-attempted marker was previously only cleared after a spawn initiated by ensureWorkerStarted itself succeeded. If a previous auto-start failed, then the worker became healthy via another session or a manual start, the early-return success branches (existing live PID, fast-path health check, port-in-use waitForHealth) would leave the stale marker behind. A subsequent genuine outage inside the 2-minute cooldown window would then be incorrectly suppressed on Windows. Now calls clearWorkerSpawnAttempted() on all three healthy success paths in addition to the existing post-spawn path. The function is already a no-op on non-Windows, so the change is risk-free for Linux and macOS callers. src/servers/mcp-server.ts: more actionable error when auto-start fails Per claude-review: when ensureWorkerStarted returns false (or throws), the caller currently logs a generic "Worker auto-start failed" line. Updated both error sites to explicitly call out which MCP tools will fail (search/timeline/get_observations) and to point at earlier log lines for the specific cause. Helps users distinguish "worker is just not running" from "tools are broken". DEFENDED (no change): - Sentinel object for Windows spawnDaemon 0 PID — broader API change, out of scope, deferred consistently since round 1 - Spawner lifecycle tests beyond input validation — needs injectable I/O, deferred consistently - Concurrent cooldown marker race on Windows — pre-existing, out of scope - stripHardcodedDirname() regex fragility assertion — pre-existing, out of scope Verified: build clean, ProcessManager tests 44/44, worker-spawner tests 2/2, smoke test 7-tool surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(spawner): don't cache null Bun-not-found result (round 10) Round 10 of PR #1645 review feedback. src/services/infrastructure/ProcessManager.ts: only cache successful resolveWorkerRuntimePath() results Genuine bug from claude-review: the round-8 memoization cached BOTH successful resolutions AND the not-found `null` result. If Bun isn't on PATH at the moment the MCP server first tries to spawn the worker — e.g., on a fresh install where the user installs Bun in another terminal and retries — every subsequent ensureWorkerConnection call would return the cached `null` and fail with a misleading "Bun not found" error even though Bun is now available. The fix is the one-line change the reviewer suggested: only cache when `result !== null`. Crash loops still get the fast-path memoized success; recovery from a fresh-install Bun install still works. src/servers/mcp-server.ts: rename warnIfWorkerScriptMissing → errorIfWorkerScriptMissing Per claude-review: the function uses logger.error but the name says "warn" — name/level mismatch. Renamed to match. The function still serves the same purpose (defensive lazy check), just with an accurate name. DEFENDED (no change): - Discriminated union for mcpServerDirResolutionFailed flag — current approach works, the noise is minimal, and the alternative would add type complexity for a path that's functionally unreachable in CJS deployment - macOS /usr/local/bin/bun "missing" — already in the Linux/macOS candidate list at line 137 (false positive from reviewer) - nix store path — out of scope, PATH fallback handles it - Long build-hooks.js error message — verbosity is intentional, this message only fires on a real regression and the diagnostic value is worth the line wrap - Spawner lifecycle test coverage gap — needs injectable I/O, deferred consistently Verified: build clean, ProcessManager tests 44/44, worker-spawner tests 2/2, smoke test 7-tool surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(mcp): bundle size budget guardrail (round 11) Round 11 of PR #1645 review feedback. scripts/build-hooks.js: secondary bundle-size budget guardrail Per claude-review: the existing `require("bun:")` regex catches the specific regression class we already know about, but if esbuild ever changes how it emits external module specifiers, the regex could silently miss the regression. A bundle-size budget catches the structural symptom (worker-service.ts dragged into the bundle blew the size from ~358KB to ~1.96MB) regardless of how the imports look. Set the ceiling at 600KB. Current size is ~384KB; the broken v12.0.0 bundle was ~1920KB. Plenty of headroom for legitimate growth without incentivizing bundle bloat or false positives. Both guardrails fire independently — one is regex-based, one is size-based — so a regression has to defeat both to ship. tests/services/worker-spawner.test.ts: comment about port irrelevance Per claude-review: the hardcoded port values in the validation-guard tests are arbitrary because the path validation short-circuits before any network I/O. Added a comment explaining this so future readers don't waste time wondering why specific ports were picked. DEFENDED (no change): - clearWorkerSpawnAttempted on the unhealthy-live-PID return path: reviewer asked to clear the marker here too, but the current behavior is correct. The marker tracks "recently attempted a spawn" and exists to prevent rapid PowerShell-popup loops. If a wedged process is currently using the port, the spawn isn't actually happening on this code path (the helper returns false without reaching the spawn step). When the wedged process eventually dies and a subsequent call hits the spawn path, the marker correctly suppresses repeated retry attempts within the 2-minute cooldown. Clearing the marker on the unhealthy-return path would defeat exactly the popup-loop protection the marker exists to provide. - execSync in lookupBinaryInPath blocks event loop: pre-existing concern, not introduced by this PR. Reviewer notes "fires once, result cached". Not in scope for a hotfix. - Tracking issue for spawner lifecycle test gap: out of scope for this PR; the gap is documented in the test file's header comment with a back-reference to PR #1645. Verified: build clean, both guardrails functional (size budget is under the new ceiling), ProcessManager tests 44/44, worker-spawner tests 2/2, smoke test 7-tool surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix(mcp): eliminate double error log when worker bundle is missing (round 12) Round 12 of PR #1645 review feedback. src/servers/mcp-server.ts: errorIfWorkerScriptMissing() now only logs when the dirname-fallback attribution path is needed Previously a missing worker-service.cjs would produce two ERROR log lines on the same code path: 1. errorIfWorkerScriptMissing() in ensureWorkerConnection() 2. The existsSync guard inside ensureWorkerStarted() The simple "missing bundle" case is fully covered by the spawner's own existsSync guard. The mcp-server.ts function now ONLY logs when mcpServerDirResolutionFailed is true — that's the mcp-server-specific root-cause attribution that the spawner cannot provide on its own. Net effect: same single error log per bug class, cleaner triage. DEFENDED (no change): - mkdirSync error propagation in markWorkerSpawnAttempted: reviewer worried that mkdirSync/writeFileSync exceptions could escape, but the entire body is already wrapped in try/catch with an APPROVED OVERRIDE annotation. False positive. - clearWorkerSpawnAttempted on healthy paths: reviewer asked a clarifying question, not a change request. The behavior is intentional — the cooldown marker exists to prevent rapid PowerShell-popup loops from a series of failed spawns; a healthy worker means the marker has served its purpose and a future outage should NOT be suppressed. Will explain in PR reply. - __filename ESM concern in worker-service.ts wrapper: already documented in round 4 with an extended comment about the CJS bundle context and why mcp-server.ts can't use the same trick. - Spawn lifecycle integration tests: deferred consistently since round 1; gap is documented in worker-spawner.test.ts header. Verified: build clean, ProcessManager tests 44/44, worker-spawner tests 2/2, smoke test 7-tool surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(spawner): add bare-command BUN env override coverage Final round of PR #1645 review feedback: while preparing to merge, I noticed CodeRabbit's round-5 CHANGES_REQUESTED review on commit 3570d2f0 included an unaddressed nitpick — the env-driven bare-command branch in resolveWorkerRuntimePath() (returning a bare 'bun' unchanged when BUN or BUN_PATH is set that way) had no test coverage and could regress without any failing assertion. Added a focused test that exercises the env: { BUN: 'bun' } branch specifically. 47/47 tests pass (was 46/46). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 18:08:36 -07:00
Alex Newman	25bb93a995	fix: address PR #1641 review comments (round 2) - Remove duplicate TranscriptWatcher/config imports in worker-service.ts - Use normalizePlatformSource in handleSessionInitByClaudeId for consistency - Don't skip DB completion when session not in memory (completeByClaudeId) - Add try-catch around fetch in useContextPreview refresh callback - Deduplicate store.getAllProjects() call in DataRoutes - Fix malformed comment separators in migration runner - Fix missing closing brace and JSDoc opener (merge artifact) in migration runner Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 13:22:58 -07:00
Alex Newman	cbb68ad9e1	fix: worker startup crash and missing observation columns Two bugs fixed: 1. SessionCompletionHandler called dbManager.getSessionStore() during WorkerService construction, before DB initialization. Changed to accept DatabaseManager and defer the call to runtime. 2. migration009 (generated_by_model, relevance_count columns) only ran via the deprecated MigrationRunner path, never through SessionStore's migration chain. Added addObservationModelColumns() to SessionStore constructor. Checks column existence directly since schema_versions may have been marked applied without the ALTER TABLE succeeding. Also removed duplicate transcriptWatcher declaration and shutdown block (merge artifact). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 12:20:10 -07:00
Alex Newman	6250a194dd	Merge branch 'pr-1472' into integration/validation-batch # Conflicts: # plugin/scripts/context-generator.cjs # plugin/scripts/mcp-server.cjs # plugin/scripts/worker-service.cjs # plugin/ui/viewer-bundle.js # src/cli/handlers/context.ts # src/services/sqlite/SessionStore.ts # src/services/sqlite/migrations/runner.ts # src/services/worker-service.ts # src/shared/SettingsDefaultsManager.ts	2026-04-06 14:23:18 -07:00
Alex Newman	d570909bf1	Merge branch 'pr-1491' into integration/validation-batch # Conflicts: # plugin/scripts/mcp-server.cjs # plugin/scripts/worker-service.cjs # src/shared/hook-constants.ts	2026-04-06 14:20:05 -07:00
Henry Gimenez da Costa	753837bff3	fix(windows): isMainModule CJS branch fails on Bun — add CLAUDE_MEM_MANAGED fallback On Bun/Windows, `require.main !== module` in CJS mode causes the worker to exit silently with code 0. The wrapper already sets CLAUDE_MEM_MANAGED=true when spawning the inner worker, so checking this env var is a safe fallback that doesn't affect standalone execution. Ref #1450 (incomplete fix in PR #1518 — ESM path fixed but CJS branch not). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 02:45:57 -03:00
Alex Newman	2495f98496	refactor: consolidate MCP factory, add non-TTY support, auto-detect transcript watchers - Phase 1: Replace 5 duplicate MCP installers with config-driven factory, extract shared context-injection and json-utils utilities, fix process.execPath usage - Phase 2: Add non-TTY fallback for @clack/prompts to prevent ENOENT in CI/Docker - Phase 3: Wire GeminiCliHooksInstaller through hook command framework with adapter - Phase 4: Auto-start transcript watchers on worker boot when config exists Net -107 lines via DRY consolidation of duplicated installer logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 00:35:55 -07:00
Oracle Public Cloud User	4589b34eab	fix: decouple mcp health from loopback self-check	2026-03-30 16:37:56 +00:00
Ryan Malia	b0f1a458cf	fix: log warning when readiness times out on reused-worker path (#1491 ) Mirror the fresh-spawn path's timeout logging for debugging parity. CodeRabbit nitpick on PR #1491. Co-Authored-By: CC <noreply@anthropic.com>	2026-03-30 03:47:08 -07:00
Ryan Malia	83f61177c7	fix: address CodeRabbit review feedback on PR #1491 - Update POST_SPAWN_WAIT test assertion from 5000 to 15000 to match the constant change in hook-constants.ts - Remove redundant readPidFile() from aggressiveStartupCleanup() — start() writes the new PID before this runs, so it always returns process.pid (already protected) - Add waitForReadiness() to the reused-worker path in ensureWorkerStarted() to prevent concurrent hooks from racing past a cold-starting worker's initialization guard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-30 03:43:36 -07:00
Ryan Malia	88b47f9e9c	fix: prevent worker daemon from being killed by its own hooks (#1490 ) Three independent fixes for worker daemon instability: 1. Remove version mismatch auto-restart from ensureWorkerStarted() (#1435). The marketplace bundle ships with __DEFAULT_PACKAGE_VERSION__ unbaked, causing BUILT_IN_VERSION to fall back to "development". This creates a 100% reproducible mismatch on every hook call, killing a healthy worker and often failing to restart. Same pattern across #566, #665, #667, #669, #689, #1124, #1145 (8+ releases). 2. Add process.ppid and PID-file PID to aggressiveStartupCleanup() exclusions (#1426). Without this, a newly spawned daemon SIGKILLs the hook process that spawned it and any already-running worker the PID file points to. 3. Increase POST_SPAWN_WAIT from 5s to 15s (#1423). The 5s timeout was sized for Linux (<1s startup) but macOS ARM64 cold starts take 6-8s with Chroma enabled.	2026-03-30 03:43:36 -07:00
Alex Newman	07ab7000a8	fix: patch 7 critical bugs affecting all non-dev-machine users and Windows 1. Fix esbuild inlining build-machine __dirname as string literal — use CJS-compatible runtime banner with require("node:url").fileURLToPath across worker-service, mcp-server, and context-generator builds. 2. Fix isMainModule check missing .cjs extension and Windows backslash path normalization. 3. Wrap extractLastMessage in try-catch to prevent infinite Stop hook feedback loop on malformed transcripts (exit 0 instead of exit 2). 4. Replace heavy SessionEnd hook (Node→Bun→1.7MB CJS→HTTP) with lightweight inline node -e one-liner (~200ms vs >1s). 5. Add 7 Gemini/OpenRouter error patterns to unrecoverablePatterns circuit breaker to prevent 77K+ retry loops on expired API keys. 6. Preserve CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_CODE_GIT_BASH_PATH in sanitizeEnv instead of stripping them with the CLAUDE_CODE_ prefix. 7. Use PowerShell -EncodedCommand for spawnDaemon to fix path quoting when Windows usernames contain spaces. Closes #1515, #1495, #1475, #1465, #1500, #1513, #1512, #1450, #1460, #1486, #1449, #1481, #1451, #1480, #1453, #1445 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 15:20:29 -07:00
huakson	2b60dd2932	feat: isolate Claude and Codex session sources Persist platform_source across session creation, transcript ingestion, API query paths, and viewer state so Claude and Codex data can coexist without bleeding into each other. - add platform-source normalization helpers and persist platform_source in sdk_sessions via migration 24 with backfill and indexing - thread platformSource through CLI hooks, transcript processing, context generation, pagination, search routes, SSE payloads, and session management - expose source-aware project catalogs, viewer tabs, context preview selectors, and source badges for observations, prompts, and summaries - start the transcript watcher from the worker for transcript-based clients and preserve platform source during Codex ingestion - auto-start the worker from the MCP server for MCP-only clients and tighten stdio-driven cleanup during shutdown - keep createSDKSession backward compatible with existing custom-title callers while allowing explicit platform source forwarding	2026-03-24 08:46:18 -03:00
Alex Newman	4d7bec4d05	fix: stop spinner from spinning forever (#1440 ) * fix: stop spinner from spinning forever due to orphaned DB messages The activity spinner never stopped because isAnySessionProcessing() queried ALL pending/processing messages in the database, including orphaned messages from dead sessions that no generator would ever process. Root cause: isAnySessionProcessing() used hasAnyPendingWork() which is a global DB scan. Changed it to use getTotalQueueDepth() which only checks sessions in the active in-memory Map. Additional fixes: - Add terminateSession() to enforce restart-or-terminate invariant - Fix 3 zombie paths in .finally() handler that left sessions alive - Clean up idle sessions from memory on successful completion - Remove redundant bare isProcessing:true broadcast - Replace inline require() with proper accessor - Add 8 regression tests for session termination invariant Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address review findings — idle-timeout race, double broadcast, query amplification - Move pendingCount check before idle-timeout termination to prevent abandoning fresh messages that arrive between idle abort and .finally() - Move broadcastProcessingStatus() inside restart branch only — the else branch already broadcasts via removeSessionImmediate callback - Compute queueDepth once in broadcastProcessingStatus() and derive isProcessing from it, eliminating redundant double iteration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 14:13:10 -07:00
Alex Newman	80a8c90a1a	feat: add embedded Process Supervisor for unified process lifecycle (#1370 ) * feat: add embedded Process Supervisor for unified process lifecycle management Consolidates scattered process management (ProcessManager, GracefulShutdown, HealthMonitor, ProcessRegistry) into a unified src/supervisor/ module. New: ProcessRegistry with JSON persistence, env sanitizer (strips CLAUDECODE_* vars), graceful shutdown cascade (SIGTERM → 5s wait → SIGKILL with tree-kill on Windows), PID file liveness validation, and singleton Supervisor API. Fixes #1352 (worker inherits CLAUDECODE env causing nested sessions) Fixes #1356 (zombie TCP socket after Windows reboot) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add session-scoped process reaping to supervisor Adds reapSession(sessionId) to ProcessRegistry for killing session-tagged processes on session end. SessionManager.deleteSession() now triggers reaping. Tightens orphan reaper interval from 60s to 30s. Fixes #1351 (MCP server processes leak on session end) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Unix domain socket support for worker communication Introduces socket-manager.ts for UDS-based worker communication, eliminating port 37777 collisions between concurrent sessions. Worker listens on ~/.claude-mem/sockets/worker.sock by default with TCP fallback. All hook handlers, MCP server, health checks, and admin commands updated to use socket-aware workerHttpRequest(). Backwards compatible — settings can force TCP mode via CLAUDE_MEM_WORKER_TRANSPORT=tcp. Fixes #1346 (port 37777 collision across concurrent sessions) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove in-process worker fallback from hook command Removes the fallback path where hook scripts started WorkerService in-process, making the worker a grandchild of Claude Code (killed by sandbox). Hooks now always delegate to ensureWorkerStarted() which spawns a fully detached daemon. Fixes #1249 (grandchild process killed by sandbox) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add health checker and /api/admin/doctor endpoint Adds 30-second periodic health sweep that prunes dead processes from the supervisor registry and cleans stale socket files. Adds /api/admin/doctor endpoint exposing supervisor state, process liveness, and environment health. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add comprehensive supervisor test suite 64 tests covering all supervisor modules: process registry (18 tests), env sanitizer (8), shutdown cascade (10), socket manager (15), health checker (5), and supervisor API (6). Includes persistence, isolation, edge cases, and cross-module integration scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: revert Unix domain socket transport, restore TCP on port 37777 The socket-manager introduced UDS as default transport, but this broke the HTTP server's TCP accessibility (viewer UI, curl, external monitoring). Since there's only ever one worker process handling all sessions, the port collision rationale for UDS doesn't apply. Reverts to TCP-only, removing ~900 lines of unnecessary complexity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove dead code found in pre-landing review Remove unused `acceptingSpawns` field from Supervisor class (written but never read — assertCanSpawn uses stopPromise instead) and unused `buildWorkerUrl` import from context handler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * updated gitignore * fix: address PR review feedback - downgrade HTTP logging, clean up gitignore, harden supervisor - Downgrade request/response HTTP logging from info to debug to reduce noise - Remove unused getWorkerPort imports, use buildWorkerUrl helper - Export ENV_PREFIXES/ENV_EXACT_MATCHES from env-sanitizer, reuse in Server.ts - Fix isPidAlive(0) returning true (should be false) - Add shutdownInitiated flag to prevent signal handler race condition - Make validateWorkerPidFile testable with pidFilePath option - Remove unused dataDir from ShutdownCascadeOptions - Upgrade reapSession log from debug to warn - Rename zombiePidFiles to deadProcessPids (returns actual PIDs) - Clean up gitignore: remove duplicate datasets/, stale ~/ and http/ patterns - Fix tests to use temp directories instead of relying on real PID file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 14:49:23 -07:00
laihenyi	626654f816	fix: prevent infinite restart loop on FOREIGN KEY constraint errors (#1334 ) The pending-work-restart logic had no retry limit, causing infinite loops when sessions encountered FOREIGN KEY constraint failures. This led to 2000+ error log entries per minute and eventual worker crash via SIGTERM. Two fixes: 1. Add 'FOREIGN KEY constraint failed' to unrecoverable error patterns so it short-circuits immediately instead of falling through to restart 2. Add MAX_PENDING_RESTARTS (3) limit to pending-work-restart path as a safety net for any future unhandled persistent errors Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 20:03:48 -07:00
Nir Alfasi	38d9ac7adb	fix: prevent zombie subprocess accumulation by only trusting exitCode (#1226 ) (#1325 ) proc.killed only means Node sent a signal — the process can still be alive. This caused premature pool slot release, allowing unbounded process spawning. - ensureProcessExit: remove proc.killed from early-exit checks, only trust exitCode - Fix 3 call-site guards that skipped cleanup for signaled-but-alive processes - Add TOTAL_PROCESS_HARD_CAP=10 safety net in waitForSlot() - After SIGKILL, wait up to 1s via exit event instead of blind 200ms sleep - Reduce reaper interval from 5min to 1min, idle threshold from 2min to 1min Closes #1226 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 19:59:42 -07:00
Alex Newman	ad3d236cec	fix: resolve hook crashes and CLAUDE_PLUGIN_ROOT fallback (#1215 , #1220 ) (#1229 ) * fix: resolve PostToolUse hook crashes and 5s latency (#1220) Three compounding bugs caused hook failures: 1. Missing break statements in worker-service.ts switch — if async code threw before process.exit(), execution fell through to subsequent cases. Added break to all 7 cases missing them. 2. Unhandled promise rejection on main() — added .catch() that logs the error and exits 0 (per project exit code strategy: don't block Claude Code or leave Windows Terminal tabs open). 3. Redundant start commands in hooks.json — PostToolUse, UserPromptSubmit, and Stop groups each had a standalone start command that was redundant (the hook case already calls ensureWorkerStarted internally). The redundant start also caused 5s latency via bun-runner.js collectStdin() timeout since Claude Code never closes stdin. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add CLAUDE_PLUGIN_ROOT fallback for Stop hooks (#1215) Upstream Claude Code bug (anthropics/claude-code#24529) leaves CLAUDE_PLUGIN_ROOT unset for Stop hooks on macOS and ALL hooks on Linux. Two-layer defense: 1. Shell-level: hooks.json commands now use inline fallback _R="${CLAUDE_PLUGIN_ROOT}"; [ -z "$_R" ] && _R="$HOME/..."; falling back to the known marketplace install path. 2. Script-level: bun-runner.js self-resolves plugin root from its own filesystem location via import.meta.url, and fixes broken /scripts/... paths that result from empty expansion. Added test to verify all hook commands include the fallback path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 19:31:26 -05:00
Alex Newman	c6f932988a	Fix 30+ root-cause bugs across 10 triage phases (#1214 ) * MAESTRO: fix ChromaDB core issues — Python pinning, Windows paths, disable toggle, metadata sanitization, transport errors - Add --python version pinning to uvx args in both local and remote mode (fixes #1196, #1206, #1208) - Convert backslash paths to forward slashes for --data-dir on Windows (fixes #1199) - Add CLAUDE_MEM_CHROMA_ENABLED setting for SQLite-only fallback mode (fixes #707) - Sanitize metadata in addDocuments() to filter null/undefined/empty values (fixes #1183, #1188) - Wrap callTool() in try/catch for transport errors with auto-reconnect (fixes #1162) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix data integrity — content-hash deduplication, project name collision, empty project guard, stuck isProcessing - Add SHA-256 content-hash deduplication to observations INSERT (store.ts, transactions.ts, SessionStore.ts) - Add content_hash column via migration 22 with backfill and index - Fix project name collision: getCurrentProjectName() now returns parent/basename - Guard against empty project string with cwd-derived fallback - Fix stuck isProcessing: hasAnyPendingWork() resets processing messages older than 5 minutes - Add 12 new tests covering all four fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix hook lifecycle — stderr suppression, output isolation, conversation pollution prevention - Suppress process.stderr.write in hookCommand() to prevent Claude Code showing diagnostic output as error UI (#1181). Restores stderr in finally block for worker-continues case. - Convert console.error() to logger.warn()/error() in hook-command.ts and handlers/index.ts so all diagnostics route to log file instead of stderr. - Verified all 7 handlers return suppressOutput: true (prevents conversation pollution #598, #784). - Verified session-complete is a recognized event type (fixes #984). - Verified unknown event types return no-op handler with exit 0 (graceful degradation). - Added 10 new tests in tests/hook-lifecycle.test.ts covering event dispatch, adapter defaults, stderr suppression, and standard response constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix worker lifecycle — restart loop coordination, stale transport retry, ENOENT shutdown race - Add PID file mtime guard to prevent concurrent restart storms (#1145): isPidFileRecent() + touchPidFile() coordinate across sessions - Add transparent retry in ChromaMcpManager.callTool() on transport error — reconnects and retries once instead of failing (#1131) - Wrap getInstalledPluginVersion() with ENOENT/EBUSY handling (#1042) - Verified ChromaMcpManager.stop() already called on all shutdown paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Windows platform support — uvx.cmd spawn, PowerShell $_ elimination, windowsHide, FTS5 fallback - Route uvx spawn through cmd.exe /c on Windows since MCP SDK lacks shell:true (#1190, #1192, #1199) - Replace all PowerShell Where-Object {$_} pipelines with WQL -Filter server-side filtering (#1024, #1062) - Add windowsHide: true to all exec/spawn calls missing it to prevent console popups (#1048) - Add FTS5 runtime probe with graceful fallback when unavailable on Windows (#791) - Guard FTS5 table creation in migrations, SessionSearch, and SessionStore with try/catch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix skills/ distribution — build-time verification and regression tests (#1187) Add post-build verification in build-hooks.js that fails if critical distribution files (skills, hooks, plugin manifest) are missing. Add 10 regression tests covering skill file presence, YAML frontmatter, hooks.json integrity, and package.json files field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix MigrationRunner schema initialization (#979) — version conflict between parallel migration systems Root cause: old DatabaseManager migrations 1-7 shared schema_versions table with MigrationRunner's 4-22, causing version number collisions (5=drop tables vs add column, 6=FTS5 vs prompt tracking, 7=discovery_tokens vs remove UNIQUE). initializeSchema() was gated behind maxApplied===0, so core tables were never created when old versions were present. Fixes: - initializeSchema() always creates core tables via CREATE TABLE IF NOT EXISTS - Migrations 5-7 check actual DB state (columns/constraints) not just version tracking - Crash-safe temp table rebuilds (DROP IF EXISTS _new before CREATE) - Added missing migration 21 (ON UPDATE CASCADE) to MigrationRunner - Added ON UPDATE CASCADE to FK definitions in initializeSchema() - All changes applied to both runner.ts and SessionStore.ts Tests: 13 new tests in migration-runner.test.ts covering fresh DB, idempotency, version conflicts, crash recovery, FK constraints, and data integrity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix 21 test failures — stale mocks, outdated assertions, missing OpenClaw guards Server tests (12): Added missing workerPath and getAiStatus to ServerOptions mocks after interface expansion. ChromaSync tests (3): Updated to verify transport cleanup in ChromaMcpManager after architecture refactor. OpenClaw (2): Added memory_ tool skipping and response truncation to prevent recursive loops and oversized payloads. MarkdownFormatter (2): Updated assertions to match current output. SettingsDefaultsManager (1): Used correct default key for getBool test. Logger standards (1): Excluded CLI transcript command from background service check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Codex CLI compatibility (#744) — session_id fallbacks, unknown platform tolerance, undefined guard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Cursor IDE integration (#838, #1049) — adapter field fallbacks, tolerant session-init validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix /api/logs OOM (#1203) — tail-read replaces full-file readFileSync Replace readFileSync (loads entire file into memory) with readLastLines() that reads only from the end of the file in expanding chunks (64KB → 10MB cap). Prevents OOM on large log files while preserving the same API response shape. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix Settings CORS error (#1029) — explicit methods and allowedHeaders in CORS config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: add session custom_title for agent attribution (#1213) — migration 23, endpoint + store support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: prevent CLAUDE.md/AGENTS.md writes inside .git/ directories (#1165) Add .git path guard to all 4 write sites to prevent ref corruption when paths resolve inside .git internals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix plugin disabled state not respected (#781) — early exit check in all hook entry points Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix UserPromptSubmit context re-injection on every turn (#1079) — contextInjected session flag Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * MAESTRO: fix stale AbortController queue stall (#1099) — lastGeneratorActivity tracking + 30s timeout Three-layer fix: 1. Added lastGeneratorActivity timestamp to ActiveSession, updated by processAgentResponse (all agents), getMessageIterator (queue yields), and startGeneratorWithProvider (generator launch) 2. Added stale generator detection in ensureGeneratorRunning — if no activity for >30s, aborts stale controller, resets state, restarts 3. Added AbortSignal.timeout(30000) in deleteSession to prevent indefinite hang when awaiting a stuck generator promise Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 19:34:35 -05:00
Alex Newman	7966c6cba9	fix: rename save_memory and fix MCP search instructions + startup hook (#1210 ) * fix: rename save_memory to save_observation and fix MCP search instructions Stop the primary agent from proactively saving memories by renaming save_memory to save_observation with a neutral description. Remove "Saving Memories" section from SKILL.md. Update context formatters and output styles to reference the mem-search skill instead of raw MCP tool names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: split SessionStart hooks so smart-install failure doesn't block worker start smart-install.js and worker-start were in the same hook group, so if smart-install exited non-zero the worker never started. Split into separate hook groups so they run independently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: worker startup waits for readiness before hooks fire Move initializationCompleteFlag to set after DB/search init (not MCP), add waitForReadiness() polling /api/readiness, and extract shared pollEndpointUntilOk helper to DRY up health/readiness checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 03:30:31 -05:00
Alex Newman	e788fd3676	fix: prevent duplicate worker daemons and zombie processes (#1178 ) * fix: prevent duplicate worker daemons and zombie processes Three root causes of chroma-mcp timeouts: 1. HTTP shutdown (POST /api/admin/shutdown) closed resources but never called process.exit(). Zombie workers stayed alive, background tasks reconnected to chroma-mcp, spawning duplicate subprocesses that all contended for the same persistent data directory. 2. No guard against concurrent daemon startup. When hooks fired simultaneously, multiple daemons started before either wrote a PID file. The loser got EADDRINUSE but stayed alive because signal handlers registered in the constructor prevented exit. 3. Corrupt 147GB HNSW index file caused all chroma queries to timeout (MCP error -32001). Data fix: deleted corrupt collection, backfill rebuilds from SQLite. Code fixes: - Add PID-based guard in daemon startup: exit if PID file process alive - Add port-based guard in daemon startup: exit if port already bound (runs before WorkerService constructor registers keepalive handlers) - Add process.exit(0) after HTTP shutdown/restart completes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: aggressive startup cleanup and one-time chroma wipe for upgrade Kill orphaned worker-service.cjs and chroma-mcp processes immediately at startup (no age gate) while keeping 30-min threshold for mcp-server. Wipe corrupt chroma data once on upgrade from pre-v10.3 versions — backfill rebuilds from SQLite automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: wrap shutdown handlers in try/finally to guarantee process.exit If onShutdown() or onRestart() threw, process.exit(0) was never reached, leaving the daemon alive as a zombie. Also removed redundant require('fs') calls in process-manager tests where ESM imports already existed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 20:10:28 -05:00
Alex Newman	40daf8f3fa	feat: replace WASM embeddings with persistent chroma-mcp MCP connection (#1176 ) * feat: replace WASM embeddings with persistent chroma-mcp MCP connection Replace ChromaServerManager (npx chroma run + chromadb npm + ONNX/WASM) with ChromaMcpManager, a singleton stdio MCP client that communicates with chroma-mcp via uvx. This eliminates native binary issues, segfaults, and WASM embedding failures that plagued cross-platform installs. Key changes: - Add ChromaMcpManager: singleton MCP client with lazy connect, auto-reconnect, connection lock, and Zscaler SSL cert support - Rewrite ChromaSync to use MCP tool calls instead of chromadb npm client - Handle chroma-mcp's non-JSON responses (plain text success/error messages) - Treat "collection already exists" as idempotent success - Wire ChromaMcpManager into GracefulShutdown for clean subprocess teardown - Delete ChromaServerManager (no longer needed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review — connection guard leak, timer leak, async reset - Clear connecting guard in finally block to prevent permanent reconnection block - Clear timeout after successful connection to prevent timer leak - Make reset() async to await stop() before nullifying instance - Delete obsolete chroma-server-manager test (imports deleted class) - Update graceful-shutdown test to use chromaMcpManager property name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: prevent chroma-mcp spawn storm — zombie cleanup, stale onclose guard, reconnect backoff Three bugs caused chroma-mcp processes to accumulate (92+ observed): 1. Zombie on timeout: failed connections left subprocess alive because only the timer was cleared, not the transport. Now catch block explicitly closes transport+client before rethrowing. 2. Stale onclose race: old transport's onclose handler captured `this` and overwrote the current connection reference after reconnect, orphaning the new subprocess. Now guarded with reference check. 3. No backoff: every failure triggered immediate reconnect. With backfill doing hundreds of MCP calls, this created rapid-fire spawning. Added 10s backoff on both connection failure and unexpected process death. Also includes ChromaSync fixes from PR review: - queryChroma deduplication now preserves index-aligned arrays - SQL injection guard on backfill ID exclusion lists Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 18:32:38 -05:00
Alex Newman	5d79bb7a7a	fix: prevent zombie process accumulation by verifying subprocess exit (#1168 ) (#1175 ) Two changes fix the observer process resource leak: 1. Add ensureProcessExit to generator finally blocks in SessionRoutes and worker-service, matching the pattern already working in SDKAgent. 2. Add stale session reaper (every 2m) that removes sessions with no active generator and no pending work after 15m idle. This unblocks the orphan reaper which previously skipped processes for "active" sessions. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 16:33:23 -05:00
Alex Newman	b88251bc8b	fix: self-healing claimNextMessage prevents stuck processing messages (#1159 ) * fix: self-healing claimNextMessage prevents stuck processing messages claimAndDelete → claimNextMessage with atomic self-healing: resets stale processing messages (>60s) back to pending before claiming. Eliminates stuck messages from generator crashes without external timers. Removes redundant idle-timeout reset in worker-service.ts. Adds QUEUE to logger Component type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update stale comments in SessionQueueProcessor to reflect claim-confirm pattern Comments still referenced the old claim-and-delete pattern after the claimNextMessage rename. Updated to accurately describe the current lifecycle where messages are marked as processing and stay in DB until confirmProcessed() is called. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: move Date.now() inside transaction and extract stale threshold constant - Move Date.now() inside claimNextMessage transaction closure so timestamp is fresh if WAL contention causes retry - Extract STALE_PROCESSING_THRESHOLD_MS to module-level constant - Add comment clarifying strict < boundary semantics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 23:15:46 -05:00
Alex Newman	ca8421611c	fix: backfill Chroma vector DB for all projects on startup (#1154 ) * fix: backfill all Chroma projects on worker startup ChromaSync.ensureBackfilled() existed but was never called. After v10.2.2's bun cache clear destroyed the ONNX model cache, Chroma only had ~2 days of embeddings while SQLite had 49k+ observations. - Add static backfillAllProjects() to ChromaSync — iterates all projects in SQLite, creates temporary ChromaSync per project, runs smart diff - Call backfillAllProjects() fire-and-forget on worker startup - Add 'CHROMA_SYNC' to logger Component type (pre-existing gap) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: sanitize project names for Chroma collection naming Replace characters outside [a-zA-Z0-9._-] with underscores so projects like "YC Stuff" map to collection "cm__YC_Stuff" instead of failing Chroma's collection name validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: route backfill to shared cm__claude-mem collection, harden sanitization - Use single ChromaSync('claude-mem') in backfillAllProjects() instead of per-project instances, matching how DatabaseManager and SearchManager operate — fixes critical bug where backfilled data landed in orphaned collections that no search path reads from - Strip trailing non-alphanumeric chars from sanitized collection names to satisfy Chroma's end-character constraint - Guard backfill behind Chroma server readiness to avoid N spurious error logs when Chroma failed to start - Use CHROMA_SYNC log component consistently for backfill messages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: pass project as parameter to ensureBackfilled instead of mutating instance state Eliminates shared mutable state in backfillAllProjects() loop. Project scoping is now passed explicitly via parameter to both ensureBackfilled() and getExistingChromaIds(), keeping a single Chroma connection while avoiding fragile instance property mutation across iterations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 22:47:46 -05:00
Alex Newman	f24251118e	fix: bun install, node-addon-api for sharp, consolidate PendingMessageStore (#1140 ) * fix: use bun install in sync, add node-addon-api for sharp, consolidate PendingMessageStore - Switch sync-marketplace from npm to bun install - Add node-addon-api as dev dep so sharp builds under bun - Consolidate duplicate PendingMessageStore instantiation in worker-service finally block Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * build assets --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 18:05:42 -05:00
Alex Newman	d2e926fbf7	fix: post-merge breakage (Gemini, idle timeout, sharp cache) (#1138 ) * fix: add gemini-3-flash to validModels array The model was defined in the type union and RPM limits but missing from the runtime validModels array, causing silent fallback to gemini-2.5-flash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: skip processing when Gemini returns empty observation response Empty responses were silently consuming messages from the queue via processAgentResponse. Now skips processing on empty content, leaving the message in processing status for stale recovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: prevent idle timeout from triggering infinite restart loop When a session hits the 3-minute idle timeout, the finally block was seeing stale processing messages and restarting the generator endlessly. Now tracks idle timeout as a distinct exit reason via session flag, resets stale messages, and skips restart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: clear stale Bun native module cache on update Bun's global cache retains sharp/libvips native binaries with broken dylib references after version upgrades. Clear ~/.bun/install/cache/@img/ before install in both the end-user (smart-install) and dev (sync-marketplace) paths to prevent ERR_DLOPEN_FAILED errors in Chroma sync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review feedback (empty summary response, session-scoped reset, shell injection) - Apply same empty-response guard to summary path as observation path in GeminiAgent - Add optional sessionDbId param to resetStaleProcessingMessages for session-scoped resets - Use JSON.stringify for gitignore pattern escaping, filter negation patterns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 17:46:30 -05:00
Alex Newman	c27314f896	fix: address PR review comments for chroma server lifecycle	2026-02-13 23:39:30 -05:00
Alex Newman	ed313db742	Merge main into feat/chroma-http-server Resolve conflicts between Chroma HTTP server PR and main branch changes (folder CLAUDE.md, exclusion settings, Zscaler SSL, transport cleanup). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 21:02:54 -05:00
Alex Newman	5de728612e	chore: bump version to 10.0.6 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 00:02:37 -05:00
Alex Newman	f05f9ca735	Merge remote-tracking branch 'origin/main' into openclaw-installer # Conflicts: # plugin/scripts/mcp-server.cjs # plugin/scripts/worker-service.cjs	2026-02-12 22:04:03 -05:00

1 2 3 4

200 Commits