Commit Graph

22 Commits

Author SHA1 Message Date
Ben Younes 05232ff091 fix: reap stuck generators in reapStaleSessions (fixes #1652) (#1698)
* fix: reap stuck generators in reapStaleSessions (fixes #1652)

Sessions whose SDK subprocess hung would stay in the active sessions
map forever because `reapStaleSessions()` unconditionally skipped any
session with a non-null `generatorPromise`.  The generator was blocked
on `for await (const msg of queryResult)` inside SDKAgent and could
never unblock itself — the idle-timeout only fires when the generator
is in `waitForMessage()`, and the orphan reaper skips processes whose
session is still in the map.

Add `MAX_GENERATOR_IDLE_MS` (5 min).  When `reapStaleSessions()` sees
a session whose `generatorPromise` is set but `lastGeneratorActivity`
has not advanced in over 5 minutes, it now:
1. SIGKILLs the tracked subprocess to unblock the stuck `for await`
2. Calls `session.abortController.abort()` so the generator loop exits
3. Calls `deleteSession()` which waits up to 30 s for the generator to
   finish, then cleans up supervisor-tracked children

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: freeze time in stale-generator test and import constants from production source

- Export MAX_GENERATOR_IDLE_MS, MAX_SESSION_IDLE_MS, StaleGeneratorCandidate,
  StaleGeneratorProcess, and detectStaleGenerator from SessionManager.ts so
  tests no longer duplicate production constants or detection logic.
- Use setSystemTime() from bun:test to freeze Date.now() in the
  "exactly at threshold" test, eliminating the flaky double-Date.now() race.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 00:58:35 -07:00
Alex Newman 4c2ab98d90 Merge pull request #1679 from ousamabenyounes/fix/issue-1297
fix: set cwd to homedir when spawning chroma-mcp to prevent pydantic .env.local crash (#1297)
2026-04-14 18:41:48 -07:00
Alex Newman 216d17879d Merge pull request #1680 from ousamabenyounes/fix/issue-1447
fix: suppress false ERROR when duplicate daemon loses port bind race (#1447)
2026-04-14 18:41:25 -07:00
Ousama Ben Younes 08cf2ba3bd fix: suppress false ERROR when duplicate daemon loses port bind race (#1447)
When the MCP server and SessionStart hook both spawn a worker daemon
concurrently, one loses the bind race (EADDRINUSE / Bun's port-in-use
error). The loser now checks if the winner is healthy; if so, it logs
INFO and exits cleanly instead of logging a misleading ERROR on every
first session start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 10:01:08 +00:00
Ousama Ben Younes c7c4fd54d6 fix: set cwd to homedir when spawning chroma-mcp to prevent pydantic .env.local crash (#1297)
chroma-mcp uses pydantic-settings which auto-reads .env/.env.local from
the CWD. When the project directory contains non-chroma variables (e.g.
CELERY_TASK_ALWAYS_EAGER), pydantic rejects them with "Extra inputs are
not permitted", crashing the subprocess and triggering a permanent
backoff loop. Passing cwd: os.homedir() to StdioClientTransport ensures
pydantic never reads project env files.

Generated by Claude Code
Vibe coded by ousamabenyounes

Co-Authored-By: Claude <noreply@anthropic.com>
2026-04-10 09:55:02 +00:00
Alex Newman abd55977ca fix(mcp): MCP server crashes with Cannot find module 'bun:sqlite' under Node (#1645)
* fix(mcp): MCP server crashes with Cannot find module 'bun:sqlite' under Node

The MCP server bundle (mcp-server.cjs) ships with `#!/usr/bin/env node` so
it must run under Node, but commit 2b60dd29 added an import of
`ensureWorkerStarted` from worker-service.ts. That import transitively pulls
in DatabaseManager → bun:sqlite, blowing up at top-level require under Node.

The bundle ballooned from ~358KB (v11.0.1) to ~1.96MB (v12.0.0) and crashed
on every spawn, breaking the MCP server entirely for Codex/MCP-only clients
and any flow that boots the MCP tool surface.

Fix:

1. Extract `ensureWorkerStarted` and the Windows spawn-cooldown helpers
   into a new lightweight module `src/services/worker-spawner.ts` that
   only imports from infrastructure/ProcessManager, infrastructure/HealthMonitor,
   shared/*, and utils/logger — no SQLite, no ChromaSync, no DatabaseManager.

2. The new helper takes the worker script path explicitly so callers
   running under Node (mcp-server) can pass `worker-service.cjs` while
   callers already inside the worker (worker-service self-spawn) pass
   `__filename`. worker-service.ts keeps a thin wrapper for back-compat.

3. mcp-server.ts now imports from worker-spawner.js and resolves
   WORKER_SCRIPT_PATH via __dirname so the daemon can be auto-started
   for MCP-only clients without dragging in the entire worker bundle.

4. resolveWorkerRuntimePath() now searches for Bun on every platform
   (not just Windows). worker-service.cjs requires Bun at runtime, so
   when the spawner is invoked from a Node process the Unix branch can
   no longer fall through to process.execPath (= node).

5. spawnDaemon's Unix branch now calls resolveWorkerRuntimePath() instead
   of hardcoding process.execPath, fixing the same Node-spawning-Node bug
   for the actual subprocess launch on Linux/macOS.

After:
- mcp-server.cjs is 384KB again with zero `bun:sqlite` references
- node mcp-server.cjs initializes and serves tools/list + tools/call
  (verified via JSON-RPC against the running worker)
- ProcessManager test suite updated for the new cross-platform Bun
  resolution behavior; full suite has the same pre-existing failures
  as main, no regressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): address PR #1645 review feedback (round 1)

Per Claude Code Review on PR #1645:

1. mcp-server.ts: log a warning when both __dirname and import.meta.url
   resolution fail. The cwd() fallback is essentially dead code for the
   CJS bundle but if it ever fires it gives the user a breadcrumb instead
   of a silently-wrong WORKER_SCRIPT_PATH.

2. mcp-server.ts: existsSync check on WORKER_SCRIPT_PATH at module load.
   Surfaces a clear "worker-service.cjs not found at expected path" log
   line for partial installs / dev environments instead of letting the
   failure surface as a generic spawnDaemon error later.

3. ProcessManager.ts: explanatory comment on the Windows `return 0`
   sentinel in spawnDaemon. Documents that PowerShell Start-Process
   doesn't return a PID and that callers MUST use `pid === undefined`
   for failure detection — never falsy checks like `if (!pid)`.

Items 4 (no direct unit tests for the worker-spawner Windows cooldown
helpers) and 5 (process-manager.test.ts uses real ~/.claude-mem path)
are deferred — the reviewer flagged the latter as out of scope, and
the former needs an injectable-I/O refactor that isn't appropriate
for a hotfix bugfix PR.

Verified: build clean, mcp-server.cjs still 384KB / zero bun:sqlite,
JSON-RPC tools/list still returns the 7-tool surface, ProcessManager
test suite still 43/43.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(spawner): mkdir CLAUDE_MEM_DATA_DIR before writing Windows cooldown marker

Per CodeRabbit on PR #1645: on a fresh user profile, the data dir may not
exist yet when markWorkerSpawnAttempted() runs. writeFileSync would throw
ENOENT, the catch would swallow it, and the marker would never be created
— defeating the popup-loop protection this helper exists to provide.

mkdirSync(dir, { recursive: true }) is a no-op when the directory already
exists, so it's safe to call on every spawn attempt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(spawner): add APPROVED OVERRIDE annotations for cooldown marker catches

Per CodeRabbit on PR #1645: silent catch blocks at spawn-cooldown sites
should carry the APPROVED OVERRIDE annotation that the rest of the
codebase uses (see ProcessManager.ts:689, BaseRouteHandler.ts:82,
ChromaSync.ts:288).

Both catches are intentional best-effort:
- markWorkerSpawnAttempted: if mkdir/writeFileSync fails, the worker
  spawn itself will almost certainly fail too. Surfacing that downstream
  is far more useful than a noisy log line about a lock file.
- clearWorkerSpawnAttempted: a stale marker is harmless. Worst case is
  one suppressed retry within the cooldown window, then self-heals.

No behaviour change. Resolves the second half of CodeRabbit's lines
38-65 comment on worker-spawner.ts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): address PR #1645 review feedback (round 2)

Round 2 of Claude Code Review feedback on PR #1645:

Build guardrail (most important — protects the regression this PR fixes):

- scripts/build-hooks.js: post-build check that fails the build if
  mcp-server.cjs ever contains a `bun:sqlite` reference. This is the
  exact regression PR #1645 fixed; future contributors will get an
  immediate, actionable error if a transitive import re-introduces it.
  Verified the check trips when violated.

Code clarity:

- src/servers/mcp-server.ts: drop dead `_originalLog` capture — it was
  never restored. Less code is fewer bugs.

- src/servers/mcp-server.ts: elevate `cwd()` fallback log from WARN to
  ERROR. Per reviewer: a wrong WORKER_SCRIPT_PATH means worker auto-start
  silently fails, so the breadcrumb should be loud and searchable.

- src/services/worker-service.ts: extended doc comment on the
  `ensureWorkerStartedShared(port, __filename)` wrapper explaining why
  `__filename` is the correct script path here (CJS bundle = compiled
  worker-service.cjs) and why mcp-server.ts can't use the same trick.

- src/services/infrastructure/ProcessManager.ts: inline comment on the
  `env.BUN === 'bun'` bare-command guard explaining why it's reachable
  even though `isBunExecutablePath('bun')` is true (pathExists returns
  false for relative names, so the second branch is what fires).

Coverage:

- src/services/infrastructure/ProcessManager.ts: add `/usr/bin/bun` to
  the Linux candidate paths so apt-installed Bun on Debian/Ubuntu is
  found without falling through to the PATH lookup.

Out-of-scope items (deferred with rationale in PR replies):

- Unit tests for ensureWorkerStarted / Windows cooldown helpers — needs
  injectable-I/O refactor unsuitable for a hotfix.
- Sentinel object for Windows spawnDaemon `0` — broader API change.
- Windows Scoop install path — follow-up for a future PR.
- runOneTimeChromaMigration placement, aggressiveStartupCleanup,
  console.log redirect timing, platform timeout multiplier — all
  pre-existing and unrelated to this regression.

Verified: build clean, guardrail trips on simulated violation,
mcp-server.cjs still 0 bun:sqlite refs, ProcessManager tests 43/43.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): address PR #1645 review feedback (round 3)

Round 3 of Claude Code Review feedback on PR #1645:

ProcessManager.ts: improve actionability of "Bun not found" errors

Both Windows and Unix branches of spawnDaemon previously logged a vague
"Failed to locate Bun runtime" message when resolveWorkerRuntimePath()
returned null. Replaced with an actionable message that names the install
URL and explains *why* Bun is required (worker uses bun:sqlite). The
existing null-guard at the call sites already prevents passing null to
child_process.spawn — only the error text changed.

scripts/build-hooks.js: refine bun:sqlite guardrail to match actual
require() calls only

The previous coarse `includes('bun:sqlite')` check tripped on its own
improved error message, which legitimately mentions "bun:sqlite" by name.
Switched to a regex that matches `require("bun:sqlite")` /
`require('bun:sqlite')` (with optional whitespace, handles both quote
styles, handles minified output) so error messages and inline comments
can reference the module name without false positives. Verified the
regex still trips on real violations (both spaced and minified forms)
and correctly ignores string-literal mentions.

Other round-3 items (verified, not changed):

- TOOL_ENDPOINT_MAP: reviewer flagged as dead code, but it IS used at
  lines 250 and 263 by the search and timeline tool handlers. False
  positive — kept as-is.
- if (!pid) callsites: grepped src/, zero offenders. The Windows `0`
  PID sentinel contract is safe; only the in-line documentation comment
  in ProcessManager.ts mentions the anti-pattern.
- callWorkerAPIPost double-wrapping: pre-existing intentional behavior
  (only used by /api/observations/batch which returns raw data, not
  the MCP {content:[...]} shape). Unrelated to this regression.
- Snap path / startParentHeartbeat / main().catch / test for non-
  existent workerScriptPath / etc — pre-existing or out of scope for
  this hotfix, deferred per established disposition.

Verified: build clean, guardrail still trips on real violations,
mcp-server.cjs has 0 require("bun:sqlite") calls, JSON-RPC tools/list
returns the 7-tool surface, ProcessManager tests 43/43.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(spawnDaemon): contract test for Windows 0 PID success sentinel

Per CodeRabbit nitpick on PR #1645 commit 7a96b3b9: add a focused test
that documents the spawnDaemon return contract so any future contributor
who introduces `if (!pid)` against a spawnDaemon return value (or its
wrapper) sees a failing assertion explaining why the falsy check is
incorrect.

The test deliberately exercises the JS-level semantics rather than
mocking PowerShell — a true mocked Windows test would require
refactoring spawnDaemon to take an injectable execSync, which is a
larger change than this hotfix should carry. The contract assertions
here catch the same regression class (treating Windows success as
failure) without that refactor.

Verified: bun test tests/infrastructure/process-manager.test.ts now
passes 44/44 (was 43/43).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): address PR #1645 review feedback (round 4)

Round 4 of Claude Code Review feedback on PR #1645 (review of round-3
commit 193286f9):

tests/infrastructure/process-manager.test.ts: replace require('fs')
with the already-imported statSync. Reviewer correctly flagged that
the file uses ESM-style named imports everywhere else and the inline
require() calls would break under strict ESM. Two callsites updated
in the touchPidFile test.

src/services/infrastructure/ProcessManager.ts: hoist
resolveWorkerRuntimePath() and the `Bun runtime not found` error
handling out of both branches in spawnDaemon. Both Windows and Unix
branches need the same Bun lookup, and resolving once before the OS
branch split avoids a duplicate execSync('which bun')/where bun in the
no-well-known-path fallback. The error message is also DRY now —
single source of truth instead of two near-identical strings.

CodeRabbit confirmed in its previous reply that "All actionable items
across all four review rounds are fully resolved" — these two minor
items from claude-review of round 3 are the only remaining cleanup.

Verified: build clean, ProcessManager tests still 44/44.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): address PR #1645 review feedback (round 5)

Round 5 of Claude Code Review feedback on PR #1645:

src/services/worker-spawner.ts: drop `export` from internal helpers

`shouldSkipSpawnOnWindows`, `markWorkerSpawnAttempted`, and
`clearWorkerSpawnAttempted` were exported even though they were
private in worker-service.ts and nothing outside this module needs
them. Removing the `export` keyword keeps the public surface to just
`ensureWorkerStarted` and prevents future callers from bypassing the
spawn lifecycle.

scripts/build-hooks.js: broaden guardrail to all bun:* modules

Previously the regex only caught `require("bun:sqlite")`, but every
module in the `bun:` namespace (bun:ffi, bun:test, etc.) is Bun-only
and would crash mcp-server.cjs the same way under Node. Generalized
the regex to `require("bun:[a-z][a-z0-9_-]*")` so a transitive import
of any Bun-only module fails the build instead of shipping a broken
bundle. Verified the new regex still trips on bun:sqlite, bun:ffi,
bun:test, and correctly ignores string-literal mentions in error
messages.

src/servers/mcp-server.ts: attribute root cause when dirname resolution fails

Previously, if `__dirname`/`import.meta.url` resolution failed and we
fell back to `process.cwd()`, the user would see two warnings: an
error about the dirname fallback AND a separate warning about the
missing worker bundle. The second warning hides the root cause —
someone debugging would assume the install is broken when really it's
a dirname-resolution failure. Track the failure with a flag and emit
a single root-cause-attributing log line in the existence-check
branch instead. The dirname fallback paths are still functionally
unreachable in CJS deployment; this just makes the failure mode
unmistakable if it ever does fire.

Out of scope (consistent with prior rounds):
- darwin/linux split for non-Windows candidate paths (benign today)
- Integration test for non-existent workerScriptPath (test coverage
  gap deferred since rounds 1-2)
- Defer existsSync check to first ensureWorkerStarted call (current
  module-init check is the loud signal we want)

Already addressed in earlier rounds:
- resolveWorkerRuntimePath() called twice in spawnDaemon → hoisted in
  round 4 (b2c114b4)
- _originalLog dead code → removed in round 2 (7a96b3b9)

Verified: build clean, broadened guardrail trips on bun:sqlite,
bun:ffi, and bun:test (and ignores string literals), MCP server
serves the 7-tool surface, ProcessManager tests still 44/44.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): address PR #1645 review feedback (round 6)

Round 6 of Claude Code Review feedback on PR #1645:

src/services/worker-spawner.ts: validate workerScriptPath at entry

Add an empty-string + existsSync guard at the top of ensureWorkerStarted.
Without this, a partial install or upstream path-resolution regression
just surfaces as a low-signal child_process error from spawnDaemon. The
explicit log line at the entry point makes that class of bug much
easier to diagnose. The mcp-server.ts module-init existsSync check
already covers this for the MCP-server caller, but defending at the
spawner level reinforces the contract for any future caller.

src/services/worker-spawner.ts: document SettingsDefaultsManager
dependency boundary in the module header

The spawner imports from SettingsDefaultsManager, ProcessManager, and
HealthMonitor. None of those currently touch bun:sqlite, but if any
of them ever does, the spawner's SQLite-free contract silently breaks.
The build guardrail in build-hooks.js is the only thing that catches
it. Header comment now flags this so future contributors audit
transitive imports when adding helpers from the shared/infrastructure
layers.

src/services/infrastructure/ProcessManager.ts: add /snap/bin/bun

Ubuntu Snap install path. Now alongside the existing apt path
(/usr/bin/bun) and Homebrew/Linuxbrew paths. The PATH lookup catches
it as fallback, but listing it explicitly avoids paying for an
execSync('which bun') in the common case.

src/servers/mcp-server.ts: elevate missing-bundle log warn → error

A missing worker-service.cjs means EVERY MCP tool call that needs the
worker silently fails. That's a broken-install state, not a transient
condition — match the severity of the dirname-fallback branch above
(which is already ERROR).

Out of scope (consistent with prior rounds, reviewer agrees these are
appropriately deferred):
- Streaming bundle read in build-hooks.js (nit at current 384KB size)
- Unit tests for ensureWorkerStarted / cooldown helpers
- Integration test for non-existent workerScriptPath

Verified: build clean, broadened guardrail still trips on bun:* imports
and ignores string literals, MCP server serves the 7-tool surface,
ProcessManager tests still 44/44.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): defer WORKER_SCRIPT_PATH check to first call (round 7)

Round 7 of Claude Code Review feedback on PR #1645:

src/servers/mcp-server.ts: extract module-level existsSync check into
checkWorkerScriptPath() and call it lazily from ensureWorkerConnection()
instead of at module load.

The early-warning intent is preserved (the check still fires before any
actual spawn attempt), but tests/tools that import this module without
booting the MCP server no longer see noisy ERROR-level log lines for a
worker bundle they never intended to start. The check is cheap and
idempotent, so calling it on every auto-start attempt is fine.

The two failure-mode branches (dirname-resolution failure vs simple
missing-bundle) remain unchanged — the function body is identical to
the previous module-level if-block, just hoisted into a function and
called from ensureWorkerConnection().

False positive (no change needed):
- Reviewer flagged `mkdirSync` as a dead import in worker-spawner.ts,
  but it IS used at line 71 in markWorkerSpawnAttempted (the round-1
  ENOENT fix CodeRabbit explicitly asked for).

Out of scope:
- Volta path (~/.volta/bin/bun) — PATH fallback handles it; nit per
  reviewer
- worker-spawner.ts unit tests — needs injectable I/O, deferred
  consistently since round 1

Verified: build clean, tests 44/44, smoke test 7-tool surface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): address PR #1645 review feedback (round 8)

Round 8 of Claude Code Review feedback on PR #1645:

tests/services/worker-spawner.test.ts: NEW FILE — unit tests for the
ensureWorkerStarted entry-point validation guards added in round 6.
Covers the empty-string and non-existent-path cases without requiring
the broader injectable-I/O refactor that the deeper spawn lifecycle
tests would need. 2 new passing tests.

src/services/infrastructure/ProcessManager.ts: memoize
resolveWorkerRuntimePath() for the no-options call site (which is what
spawnDaemon uses). Caches both successful resolutions and the
not-found result so repeated spawn attempts (crash loops, health
thrashing) don't repeatedly hit statSync on candidate paths. Tests
that pass options bypass the cache entirely so existing test cases
remain deterministic. Added resetWorkerRuntimePathCache() exported
for test isolation only.

src/servers/mcp-server.ts: rename checkWorkerScriptPath() →
warnIfWorkerScriptMissing(). Per reviewer: the old name implied a
boolean check but the function returns void and has side effects. New
name is more accurate.

DEFENDED (no change made):
- Reviewer asked to elevate process.cwd() fallback to a synchronous
  throw at module load. This conflicts with round 7 feedback which
  asked to defer the existsSync check to first call to avoid noisy
  test logs. The current lazy approach is the right compromise: it
  fires before any actual spawn attempt, attributes the root cause,
  and doesn't pollute test imports. Throwing at module load would
  crash before stdio is wired up, which is much harder to debug than
  the lazy log line.
- Reviewer asked to grep for `if (!pid)` callsites — already verified
  in round 3, zero offenders in src/.

Out of scope:
- Volta path (~/.volta/bin/bun) — PATH fallback handles it; reviewer
  marked as nit
- Deeper unit tests for ensureWorkerStarted spawn lifecycle (PID file
  cleanup, health checks, etc.) — needs injectable I/O, deferred
  consistently since round 1

Verified: build clean, ProcessManager tests still 44/44, new
worker-spawner tests 2/2, smoke test serves 7 tools.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(spawner): clear Windows cooldown marker on all healthy paths (round 9)

Round 9 of PR #1645 review feedback.

src/services/worker-spawner.ts: clear stale Windows cooldown marker
on every healthy-return path

Per CodeRabbit (genuine bug):

The .worker-start-attempted marker was previously only cleared after
a spawn initiated by ensureWorkerStarted itself succeeded. If a
previous auto-start failed, then the worker became healthy via
another session or a manual start, the early-return success branches
(existing live PID, fast-path health check, port-in-use waitForHealth)
would leave the stale marker behind. A subsequent genuine outage
inside the 2-minute cooldown window would then be incorrectly
suppressed on Windows.

Now calls clearWorkerSpawnAttempted() on all three healthy success
paths in addition to the existing post-spawn path. The function is
already a no-op on non-Windows, so the change is risk-free for Linux
and macOS callers.

src/servers/mcp-server.ts: more actionable error when auto-start fails

Per claude-review: when ensureWorkerStarted returns false (or throws),
the caller currently logs a generic "Worker auto-start failed" line.
Updated both error sites to explicitly call out which MCP tools will
fail (search/timeline/get_observations) and to point at earlier log
lines for the specific cause. Helps users distinguish "worker is just
not running" from "tools are broken".

DEFENDED (no change):
- Sentinel object for Windows spawnDaemon 0 PID — broader API change,
  out of scope, deferred consistently since round 1
- Spawner lifecycle tests beyond input validation — needs injectable
  I/O, deferred consistently
- Concurrent cooldown marker race on Windows — pre-existing,
  out of scope
- stripHardcodedDirname() regex fragility assertion — pre-existing,
  out of scope

Verified: build clean, ProcessManager tests 44/44, worker-spawner
tests 2/2, smoke test 7-tool surface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(spawner): don't cache null Bun-not-found result (round 10)

Round 10 of PR #1645 review feedback.

src/services/infrastructure/ProcessManager.ts: only cache successful
resolveWorkerRuntimePath() results

Genuine bug from claude-review: the round-8 memoization cached BOTH
successful resolutions AND the not-found `null` result. If Bun isn't
on PATH at the moment the MCP server first tries to spawn the worker
— e.g., on a fresh install where the user installs Bun in another
terminal and retries — every subsequent ensureWorkerConnection call
would return the cached `null` and fail with a misleading "Bun not
found" error even though Bun is now available.

The fix is the one-line change the reviewer suggested: only cache
when `result !== null`. Crash loops still get the fast-path memoized
success; recovery from a fresh-install Bun install still works.

src/servers/mcp-server.ts: rename warnIfWorkerScriptMissing →
errorIfWorkerScriptMissing

Per claude-review: the function uses logger.error but the name says
"warn" — name/level mismatch. Renamed to match. The function still
serves the same purpose (defensive lazy check), just with an accurate
name.

DEFENDED (no change):
- Discriminated union for mcpServerDirResolutionFailed flag — current
  approach works, the noise is minimal, and the alternative would
  add type complexity for a path that's functionally unreachable in
  CJS deployment
- macOS /usr/local/bin/bun "missing" — already in the Linux/macOS
  candidate list at line 137 (false positive from reviewer)
- nix store path — out of scope, PATH fallback handles it
- Long build-hooks.js error message — verbosity is intentional, this
  message only fires on a real regression and the diagnostic value is
  worth the line wrap
- Spawner lifecycle test coverage gap — needs injectable I/O,
  deferred consistently

Verified: build clean, ProcessManager tests 44/44, worker-spawner
tests 2/2, smoke test 7-tool surface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): bundle size budget guardrail (round 11)

Round 11 of PR #1645 review feedback.

scripts/build-hooks.js: secondary bundle-size budget guardrail

Per claude-review: the existing `require("bun:*")` regex catches the
specific regression class we already know about, but if esbuild ever
changes how it emits external module specifiers, the regex could
silently miss the regression. A bundle-size budget catches the
structural symptom (worker-service.ts dragged into the bundle blew
the size from ~358KB to ~1.96MB) regardless of how the imports look.

Set the ceiling at 600KB. Current size is ~384KB; the broken v12.0.0
bundle was ~1920KB. Plenty of headroom for legitimate growth without
incentivizing bundle bloat or false positives. Both guardrails fire
independently — one is regex-based, one is size-based — so a
regression has to defeat both to ship.

tests/services/worker-spawner.test.ts: comment about port irrelevance

Per claude-review: the hardcoded port values in the validation-guard
tests are arbitrary because the path validation short-circuits before
any network I/O. Added a comment explaining this so future readers
don't waste time wondering why specific ports were picked.

DEFENDED (no change):

- clearWorkerSpawnAttempted on the unhealthy-live-PID return path:
  reviewer asked to clear the marker here too, but the current
  behavior is correct. The marker tracks "recently attempted a spawn"
  and exists to prevent rapid PowerShell-popup loops. If a wedged
  process is currently using the port, the spawn isn't actually
  happening on this code path (the helper returns false without
  reaching the spawn step). When the wedged process eventually dies
  and a subsequent call hits the spawn path, the marker correctly
  suppresses repeated retry attempts within the 2-minute cooldown.
  Clearing the marker on the unhealthy-return path would defeat
  exactly the popup-loop protection the marker exists to provide.

- execSync in lookupBinaryInPath blocks event loop: pre-existing
  concern, not introduced by this PR. Reviewer notes "fires once,
  result cached". Not in scope for a hotfix.

- Tracking issue for spawner lifecycle test gap: out of scope for
  this PR; the gap is documented in the test file's header comment
  with a back-reference to PR #1645.

Verified: build clean, both guardrails functional (size budget is
under the new ceiling), ProcessManager tests 44/44, worker-spawner
tests 2/2, smoke test 7-tool surface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(mcp): eliminate double error log when worker bundle is missing (round 12)

Round 12 of PR #1645 review feedback.

src/servers/mcp-server.ts: errorIfWorkerScriptMissing() now only logs
when the dirname-fallback attribution path is needed

Previously a missing worker-service.cjs would produce two ERROR log
lines on the same code path:
1. errorIfWorkerScriptMissing() in ensureWorkerConnection()
2. The existsSync guard inside ensureWorkerStarted()

The simple "missing bundle" case is fully covered by the spawner's
own existsSync guard. The mcp-server.ts function now ONLY logs when
mcpServerDirResolutionFailed is true — that's the mcp-server-specific
root-cause attribution that the spawner cannot provide on its own.

Net effect: same single error log per bug class, cleaner triage.

DEFENDED (no change):

- mkdirSync error propagation in markWorkerSpawnAttempted: reviewer
  worried that mkdirSync/writeFileSync exceptions could escape, but
  the entire body is already wrapped in try/catch with an APPROVED
  OVERRIDE annotation. False positive.
- clearWorkerSpawnAttempted on healthy paths: reviewer asked a
  clarifying question, not a change request. The behavior is
  intentional — the cooldown marker exists to prevent rapid
  PowerShell-popup loops from a series of failed spawns; a healthy
  worker means the marker has served its purpose and a future
  outage should NOT be suppressed. Will explain in PR reply.
- __filename ESM concern in worker-service.ts wrapper: already
  documented in round 4 with an extended comment about the CJS
  bundle context and why mcp-server.ts can't use the same trick.
- Spawn lifecycle integration tests: deferred consistently since
  round 1; gap is documented in worker-spawner.test.ts header.

Verified: build clean, ProcessManager tests 44/44, worker-spawner
tests 2/2, smoke test 7-tool surface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(spawner): add bare-command BUN env override coverage

Final round of PR #1645 review feedback: while preparing to merge, I
noticed CodeRabbit's round-5 CHANGES_REQUESTED review on commit
3570d2f0 included an unaddressed nitpick — the env-driven bare-command
branch in resolveWorkerRuntimePath() (returning a bare 'bun' unchanged
when BUN or BUN_PATH is set that way) had no test coverage and could
regress without any failing assertion.

Added a focused test that exercises the env: { BUN: 'bun' } branch
specifically. 47/47 tests pass (was 46/46).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 18:08:36 -07:00
Alex Newman 6250a194dd Merge branch 'pr-1472' into integration/validation-batch
# Conflicts:
#	plugin/scripts/context-generator.cjs
#	plugin/scripts/mcp-server.cjs
#	plugin/scripts/worker-service.cjs
#	plugin/ui/viewer-bundle.js
#	src/cli/handlers/context.ts
#	src/services/sqlite/SessionStore.ts
#	src/services/sqlite/migrations/runner.ts
#	src/services/worker-service.ts
#	src/shared/SettingsDefaultsManager.ts
2026-04-06 14:23:18 -07:00
Alex Newman 5dd2a6f758 Merge branch 'pr-1553' into integration/validation-batch
# Conflicts:
#	src/services/worker/session/SessionCompletionHandler.ts
2026-04-06 14:19:50 -07:00
Alex Newman 85f57e6440 Merge branch 'pr-1554' into integration/validation-batch 2026-04-06 14:18:28 -07:00
Ousama Ben Younes 2a304d59eb fix: handle bare path strings in files_modified/files_read columns (#1359)
JSON.parse('/path/to/file') throws SyntaxError, crashing the viewer and
any code reading observations with legacy bare-path data in those columns.

- Add parseFileList() helper in observations/files.ts — tries JSON.parse,
  falls back to wrapping bare strings in an array
- Replace unsafe JSON.parse calls in files.ts, SessionStore.ts, ChromaSync.ts
- Add 9 unit tests covering null, empty, valid JSON, bare paths, invalid JSON

Closes #1359

Co-Authored-By: Claude <noreply@anthropic.com>
2026-04-01 06:17:35 +00:00
Ousama Ben Younes 12501412b9 fix: persist session completion to database in completeByDbId (#1532)
completeByDbId only cleaned up in-memory state, leaving sdk_sessions rows
with status='active' and completed_at=NULL indefinitely. Ghost sessions
accumulated and exhausted the agent pool, causing 60s timeout errors.

- Add SessionStore.markSessionCompleted() to set status/completed_at/completed_at_epoch
- Call it at the start of completeByDbId before in-memory cleanup
- Inject SessionStore into SessionCompletionHandler via constructor
- Add 4 tests covering status, timestamps, isolation, and non-existent IDs

Closes #1532

Co-Authored-By: Claude <noreply@anthropic.com>
2026-04-01 06:02:14 +00:00
huakson 4f6fb9e614 fix: address platform source review feedback
Tighten platform source persistence so legacy callers cannot silently relabel existing sessions, repair migration 24 when schema_versions drifts from the real schema, and polish the follow-up UI/error-handler review nits.

- only backfill platform_source when it is blank and raise on explicit source conflicts for an existing session
- make migration 24 verify both the sdk_sessions column and its index before treating it as applied
- expose platform_source from the functional session getters and add regression tests for source preservation and schema drift recovery
- add the required APPROVED OVERRIDE annotation for centralized HTTP error translation
- keep mobile source pills on a single horizontal row
2026-03-24 10:46:48 -03:00
Vincent Leraitre 237a4c37f8 fix: always pass --ssl flag to chroma-mcp in remote mode (#1286)
* fix: always pass --ssl flag to chroma-mcp in remote mode

The chroma-mcp CLI defaults to SSL when using --client-type http.
When CLAUDE_MEM_CHROMA_SSL is false (the common case for local
ChromaDB servers), buildCommandArgs() omitted --ssl entirely,
causing chroma-mcp to attempt an SSL connection to a plain HTTP
server and fail with "Could not connect to a Chroma server".

Always pass --ssl with an explicit true/false value so the user's
CLAUDE_MEM_CHROMA_SSL setting is faithfully forwarded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add regression tests for ChromaMcpManager SSL flag fix

Adds 4 focused test cases verifying buildCommandArgs() produces correct
--ssl args, covering SSL=false, SSL=true, unset (defaults to false), and
local mode (no --ssl flag). Requested by @xkonjin in PR #1286 review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: rebuild checked-in bundles to include SSL flag fix

Rebuild all bundles against upstream/main so the --ssl <true|false>
fix is present in the runtime artifacts that hooks and the marketplace
plugin actually execute.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:03:58 -07:00
antmid ad902bedd9 fix: auto-repair malformed database schema from cross-version sync (#1308)
When a claude-mem DB is synced between machines running different versions,
orphaned indexes can reference non-existent columns (e.g. idx_observations_content_hash
referencing content_hash). This causes SQLite to throw "malformed database schema"
on ALL queries, including PRAGMAs, creating a silent 503 failure loop.

The fix detects this on startup, uses Python's sqlite3 module to drop the
orphaned schema objects (bun:sqlite doesn't support writable_schema modifications),
resets migration versions, and lets the idempotent migration system recreate
everything properly.

Fixes #1307

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:01:51 -07:00
Alex Newman c6f932988a Fix 30+ root-cause bugs across 10 triage phases (#1214)
* MAESTRO: fix ChromaDB core issues — Python pinning, Windows paths, disable toggle, metadata sanitization, transport errors

- Add --python version pinning to uvx args in both local and remote mode (fixes #1196, #1206, #1208)
- Convert backslash paths to forward slashes for --data-dir on Windows (fixes #1199)
- Add CLAUDE_MEM_CHROMA_ENABLED setting for SQLite-only fallback mode (fixes #707)
- Sanitize metadata in addDocuments() to filter null/undefined/empty values (fixes #1183, #1188)
- Wrap callTool() in try/catch for transport errors with auto-reconnect (fixes #1162)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix data integrity — content-hash deduplication, project name collision, empty project guard, stuck isProcessing

- Add SHA-256 content-hash deduplication to observations INSERT (store.ts, transactions.ts, SessionStore.ts)
- Add content_hash column via migration 22 with backfill and index
- Fix project name collision: getCurrentProjectName() now returns parent/basename
- Guard against empty project string with cwd-derived fallback
- Fix stuck isProcessing: hasAnyPendingWork() resets processing messages older than 5 minutes
- Add 12 new tests covering all four fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix hook lifecycle — stderr suppression, output isolation, conversation pollution prevention

- Suppress process.stderr.write in hookCommand() to prevent Claude Code showing diagnostic
  output as error UI (#1181). Restores stderr in finally block for worker-continues case.
- Convert console.error() to logger.warn()/error() in hook-command.ts and handlers/index.ts
  so all diagnostics route to log file instead of stderr.
- Verified all 7 handlers return suppressOutput: true (prevents conversation pollution #598, #784).
- Verified session-complete is a recognized event type (fixes #984).
- Verified unknown event types return no-op handler with exit 0 (graceful degradation).
- Added 10 new tests in tests/hook-lifecycle.test.ts covering event dispatch, adapter defaults,
  stderr suppression, and standard response constants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix worker lifecycle — restart loop coordination, stale transport retry, ENOENT shutdown race

- Add PID file mtime guard to prevent concurrent restart storms (#1145):
  isPidFileRecent() + touchPidFile() coordinate across sessions
- Add transparent retry in ChromaMcpManager.callTool() on transport
  error — reconnects and retries once instead of failing (#1131)
- Wrap getInstalledPluginVersion() with ENOENT/EBUSY handling (#1042)
- Verified ChromaMcpManager.stop() already called on all shutdown paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Windows platform support — uvx.cmd spawn, PowerShell $_ elimination, windowsHide, FTS5 fallback

- Route uvx spawn through cmd.exe /c on Windows since MCP SDK lacks shell:true (#1190, #1192, #1199)
- Replace all PowerShell Where-Object {$_} pipelines with WQL -Filter server-side filtering (#1024, #1062)
- Add windowsHide: true to all exec/spawn calls missing it to prevent console popups (#1048)
- Add FTS5 runtime probe with graceful fallback when unavailable on Windows (#791)
- Guard FTS5 table creation in migrations, SessionSearch, and SessionStore with try/catch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix skills/ distribution — build-time verification and regression tests (#1187)

Add post-build verification in build-hooks.js that fails if critical
distribution files (skills, hooks, plugin manifest) are missing. Add
10 regression tests covering skill file presence, YAML frontmatter,
hooks.json integrity, and package.json files field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix MigrationRunner schema initialization (#979) — version conflict between parallel migration systems

Root cause: old DatabaseManager migrations 1-7 shared schema_versions table with
MigrationRunner's 4-22, causing version number collisions (5=drop tables vs add column,
6=FTS5 vs prompt tracking, 7=discovery_tokens vs remove UNIQUE).  initializeSchema()
was gated behind maxApplied===0, so core tables were never created when old versions
were present.

Fixes:
- initializeSchema() always creates core tables via CREATE TABLE IF NOT EXISTS
- Migrations 5-7 check actual DB state (columns/constraints) not just version tracking
- Crash-safe temp table rebuilds (DROP IF EXISTS _new before CREATE)
- Added missing migration 21 (ON UPDATE CASCADE) to MigrationRunner
- Added ON UPDATE CASCADE to FK definitions in initializeSchema()
- All changes applied to both runner.ts and SessionStore.ts

Tests: 13 new tests in migration-runner.test.ts covering fresh DB, idempotency,
version conflicts, crash recovery, FK constraints, and data integrity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix 21 test failures — stale mocks, outdated assertions, missing OpenClaw guards

Server tests (12): Added missing workerPath and getAiStatus to ServerOptions
mocks after interface expansion. ChromaSync tests (3): Updated to verify
transport cleanup in ChromaMcpManager after architecture refactor. OpenClaw (2):
Added memory_ tool skipping and response truncation to prevent recursive loops
and oversized payloads. MarkdownFormatter (2): Updated assertions to match
current output. SettingsDefaultsManager (1): Used correct default key for
getBool test. Logger standards (1): Excluded CLI transcript command from
background service check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Codex CLI compatibility (#744) — session_id fallbacks, unknown platform tolerance, undefined guard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Cursor IDE integration (#838, #1049) — adapter field fallbacks, tolerant session-init validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix /api/logs OOM (#1203) — tail-read replaces full-file readFileSync

Replace readFileSync (loads entire file into memory) with readLastLines()
that reads only from the end of the file in expanding chunks (64KB → 10MB cap).
Prevents OOM on large log files while preserving the same API response shape.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Settings CORS error (#1029) — explicit methods and allowedHeaders in CORS config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: add session custom_title for agent attribution (#1213) — migration 23, endpoint + store support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: prevent CLAUDE.md/AGENTS.md writes inside .git/ directories (#1165)

Add .git path guard to all 4 write sites to prevent ref corruption when
paths resolve inside .git internals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix plugin disabled state not respected (#781) — early exit check in all hook entry points

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix UserPromptSubmit context re-injection on every turn (#1079) — contextInjected session flag

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix stale AbortController queue stall (#1099) — lastGeneratorActivity tracking + 30s timeout

Three-layer fix:
1. Added lastGeneratorActivity timestamp to ActiveSession, updated by
   processAgentResponse (all agents), getMessageIterator (queue yields),
   and startGeneratorWithProvider (generator launch)
2. Added stale generator detection in ensureGeneratorRunning — if no
   activity for >30s, aborts stale controller, resets state, restarts
3. Added AbortSignal.timeout(30000) in deleteSession to prevent
   indefinite hang when awaiting a stuck generator promise

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 19:34:35 -05:00
Alex Newman 40daf8f3fa feat: replace WASM embeddings with persistent chroma-mcp MCP connection (#1176)
* feat: replace WASM embeddings with persistent chroma-mcp MCP connection

Replace ChromaServerManager (npx chroma run + chromadb npm + ONNX/WASM)
with ChromaMcpManager, a singleton stdio MCP client that communicates with
chroma-mcp via uvx. This eliminates native binary issues, segfaults, and
WASM embedding failures that plagued cross-platform installs.

Key changes:
- Add ChromaMcpManager: singleton MCP client with lazy connect, auto-reconnect,
  connection lock, and Zscaler SSL cert support
- Rewrite ChromaSync to use MCP tool calls instead of chromadb npm client
- Handle chroma-mcp's non-JSON responses (plain text success/error messages)
- Treat "collection already exists" as idempotent success
- Wire ChromaMcpManager into GracefulShutdown for clean subprocess teardown
- Delete ChromaServerManager (no longer needed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review — connection guard leak, timer leak, async reset

- Clear connecting guard in finally block to prevent permanent reconnection block
- Clear timeout after successful connection to prevent timer leak
- Make reset() async to await stop() before nullifying instance
- Delete obsolete chroma-server-manager test (imports deleted class)
- Update graceful-shutdown test to use chromaMcpManager property name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent chroma-mcp spawn storm — zombie cleanup, stale onclose guard, reconnect backoff

Three bugs caused chroma-mcp processes to accumulate (92+ observed):

1. Zombie on timeout: failed connections left subprocess alive because
   only the timer was cleared, not the transport. Now catch block
   explicitly closes transport+client before rethrowing.

2. Stale onclose race: old transport's onclose handler captured `this`
   and overwrote the current connection reference after reconnect,
   orphaning the new subprocess. Now guarded with reference check.

3. No backoff: every failure triggered immediate reconnect. With
   backfill doing hundreds of MCP calls, this created rapid-fire
   spawning. Added 10s backoff on both connection failure and
   unexpected process death.

Also includes ChromaSync fixes from PR review:
- queryChroma deduplication now preserves index-aligned arrays
- SQL injection guard on backfill ID exclusion lists

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 18:32:38 -05:00
Alex Newman b88251bc8b fix: self-healing claimNextMessage prevents stuck processing messages (#1159)
* fix: self-healing claimNextMessage prevents stuck processing messages

claimAndDelete → claimNextMessage with atomic self-healing: resets stale
processing messages (>60s) back to pending before claiming. Eliminates
stuck messages from generator crashes without external timers. Removes
redundant idle-timeout reset in worker-service.ts. Adds QUEUE to logger
Component type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update stale comments in SessionQueueProcessor to reflect claim-confirm pattern

Comments still referenced the old claim-and-delete pattern after the
claimNextMessage rename. Updated to accurately describe the current
lifecycle where messages are marked as processing and stay in DB until
confirmProcessed() is called.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: move Date.now() inside transaction and extract stale threshold constant

- Move Date.now() inside claimNextMessage transaction closure so timestamp
  is fresh if WAL contention causes retry
- Extract STALE_PROCESSING_THRESHOLD_MS to module-level constant
- Add comment clarifying strict < boundary semantics

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 23:15:46 -05:00
Alex Newman c27314f896 fix: address PR review comments for chroma server lifecycle 2026-02-13 23:39:30 -05:00
Alex Newman 7566b8c650 fix: add idle timeout to prevent zombie observer processes (#856)
* fix: add idle timeout to prevent zombie observer processes

Root cause fix for zombie observer accumulation. The SessionQueueProcessor
iterator now exits gracefully after 3 minutes of inactivity instead of
waiting forever for messages.

Changes:
- Add IDLE_TIMEOUT_MS constant (3 minutes)
- waitForMessage() now returns boolean and accepts timeout parameter
- createIterator() tracks lastActivityTime and exits on idle timeout
- Graceful exit via return (not throw) allows SDK to complete cleanly

This addresses the root cause that PR #848 worked around with pattern
matching. Observer processes now self-terminate, preventing accumulation
when session-complete hooks don't fire.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: trigger abort on idle timeout to actually kill subprocess

The previous implementation only returned from the iterator on idle timeout,
but this doesn't terminate the Claude subprocess - it just stops yielding
messages. The subprocess stays alive as a zombie because:

1. Returning from createIterator() ends the generator
2. The SDK closes stdin via transport.endInput()
3. But the subprocess may not exit on stdin EOF
4. No abort signal is sent to kill it

Fix: Add onIdleTimeout callback that SessionManager uses to call
session.abortController.abort(). This sends SIGTERM to the subprocess
via the SDK's ProcessTransport abort handler.

Verified by Codex analysis of the SDK internals:
- abort() triggers ProcessTransport abort handler → SIGTERM
- transport.close() sends SIGTERM → escalates to SIGKILL after 5s
- Just closing stdin is NOT sufficient to guarantee subprocess exit

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: add idle timeout to prevent zombie observer processes

Also cleaned up hooks.json to remove redundant start commands.
The hook command handler now auto-starts the worker if not running,
which is how it should have been since we changed to auto-start.

This maintenance change was done manually.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: resolve race condition in session queue idle timeout detection

- Reset timer on spurious wakeup when queue is empty but duration check fails
- Use optional chaining for onIdleTimeout callback
- Include threshold value in idle timeout log message for better diagnostics
- Add comprehensive unit tests for SessionQueueProcessor

Fixes PR #856 review feedback.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: migrate installer to Setup hook

- Add plugin/scripts/setup.sh for one-time dependency setup
- Add Setup hook to hooks.json (triggers via claude --init)
- Remove smart-install.js from SessionStart hook
- Keep smart-install.js as manual fallback for Windows/auto-install

Setup hook handles:
- Bun detection with fallback locations
- uv detection (optional, for Chroma)
- Version marker to skip redundant installs
- Clear error messages with install instructions

* feat: add np for one-command npm releases

- Add np as dev dependency
- Add release, release:patch, release:minor, release:major scripts
- Add prepublishOnly hook to run build before publish
- Configure np (no yarn, include all contents, run tests)

* fix: reduce PostToolUse hook timeout to 30s

PostToolUse runs on every tool call, 120s was excessive and could cause
hangs. Reduced to 30s for responsive behavior.

* docs: add PR shipping report

Analyzed 6 PRs for shipping readiness:
- #856: Ready to merge (idle timeout fix)
- #700, #722, #657: Have conflicts, need rebase
- #464: Contributor PR, too large (15K+ lines)
- #863: Needs manual review

Includes shipping strategy and conflict resolution order.

* MAESTRO: Verify PR #856 test suite passes

All 797 tests pass (3 skipped, 0 failures). The 11 SessionQueueProcessor
idle timeout tests all pass with 20 expect() assertions verified.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* MAESTRO: Verify PR #856 build passes

- Ran npm run build successfully with no TypeScript errors
- All artifacts generated (worker-service, mcp-server, context-generator, viewer UI)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* MAESTRO: Code review PR #856 implementation verified

Verified all requirements in SessionQueueProcessor.ts:
- IDLE_TIMEOUT_MS = 180000ms (3 minutes)
- waitForMessage() accepts timeout parameter
- lastActivityTime reset on spurious wakeup (race condition fix)
- Graceful exit logs include thresholdMs parameter
- 11 comprehensive test cases in SessionQueueProcessor.test.ts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: bigph00t <166455923+bigph00t@users.noreply.github.com>
Co-authored-by: root <root@srv1317155.hstgr.cloud>
2026-02-04 19:31:24 -05:00
Alexander Knigge 182097ef1c fix: resolve path format mismatch in folder CLAUDE.md generation (#794) (#813)
The isDirectChild() function failed to match files when the API used
absolute paths (/Users/x/project/app/api) but the database stored
relative paths (app/api/router.py). This caused all folder CLAUDE.md
files to incorrectly show "No recent activity".

Changes:
- Create shared path-utils module with proper path normalization
- Implement suffix matching strategy for mixed path formats
- Update SessionSearch.ts to use shared utilities
- Update regenerate-claude-md.ts to use shared utilities (was using
  outdated broken logic)
- Prevent spurious directory creation from malformed paths
- Add comprehensive test coverage for path matching edge cases

This is the proper fix for #794, replacing PR #809 which only masked
the bug by skipping file creation when "no activity" was shown.

Co-authored-by: bigphoot <bigphoot@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 15:48:31 -05:00
Alex Newman 3ea0b60b9f feat: Mode system with inheritance and multilingual support (#412)
* feat: add domain management system with support for multiple domain profiles

- Introduced DomainManager class for loading and managing domain profiles.
- Added support for a default domain ('code') and fallback mechanisms.
- Implemented domain configuration validation and error handling.
- Created types for domain configuration, observation types, and concepts.
- Added new directory for domain profiles and ensured its existence.
- Updated SettingsDefaultsManager to include CLAUDE_MEM_DOMAIN setting.

* Refactor domain management to mode management

- Removed DomainManager class and replaced it with ModeManager for better clarity and functionality.
- Updated types from DomainConfig to ModeConfig and DomainPrompts to ModePrompts.
- Changed references from domains to modes in the settings and paths.
- Ensured backward compatibility by maintaining the fallback mechanism to the 'code' mode.

* feat: add migration 008 to support mode-agnostic observations and refactor service layer references in documentation

* feat: add new modes for code development and email investigation with detailed observation types and concepts

* Refactor observation parsing and prompt generation to incorporate mode-specific configurations

- Updated `parseObservations` function to use 'observation' as a universal fallback type instead of 'change', utilizing active mode's valid observation types.
- Modified `buildInitPrompt` and `buildContinuationPrompt` functions to accept a `ModeConfig` parameter, allowing for dynamic prompt content based on the active mode.
- Enhanced `ModePrompts` interface to include additional guidance for observers, such as recording focus and skip guidance.
- Adjusted the SDKAgent to load the active mode and pass it to prompt generation functions, ensuring prompts are tailored to the current mode's context.

* fix: correct mode prompt injection to preserve exact wording and type list visibility

- Add script to extract prompts from main branch prompts.ts into code.yaml
- Fix prompts.ts to show type list in XML template (e.g., "[ bugfix | feature | ... ]")
- Keep 'change' as fallback type in parser.ts (maintain backwards compatibility)
- Regenerate code.yaml with exact wording from original hardcoded prompts
- Build succeeds with no TypeScript errors

* fix: update ModeManager to load JSON mode files and improve validation

- Changed ModeManager to load mode configurations from JSON files instead of YAML.
- Removed the requirement for an "observation" type and updated validation to require at least one observation type.
- Updated fallback behavior in the parser to use the first type from the active mode's type list.
- Added comprehensive tests for mode loading, prompt injection, and parser integration, ensuring correct behavior across different modes.
- Introduced new mode JSON files for "Code Development" and "Email Investigation" with detailed observation types and prompts.

* Add mode configuration loading and update licensing information for Ragtime

- Implemented loading of mode configuration in WorkerService before database initialization.
- Added PolyForm Noncommercial License 1.0.0 to Ragtime directory.
- Created README.md for Ragtime with licensing details and usage guidelines.

* fix: add datasets directory to .gitignore to prevent accidental commits

* refactor: remove unused plugin package.json file

* chore: add package.json for claude-mem plugin with version 7.4.5

* refactor: remove outdated tests and improve error handling

- Deleted tests for ChromaSync error handling, smart install, strip memory tags, and user prompt tag stripping due to redundancy or outdated logic.
- Removed vitest configuration as it is no longer needed.
- Added a comprehensive implementation plan for fixing the modes system, addressing critical issues and improving functionality.
- Created a detailed test analysis report highlighting the quality and effectiveness of the current test suite, identifying areas for improvement.
- Introduced a new plugin package.json for runtime dependencies related to claude-mem hooks.

* refactor: remove parser regression tests to streamline codebase

* docs: update CLAUDE.md to clarify test management and changelog generation

* refactor: remove migration008 for mode-agnostic observations

* Refactor observation type handling to use ModeManager for icons and emojis

- Removed direct mappings of observation types to icons and work emojis in context-generator, FormattingService, SearchManager, and TimelineService.
- Integrated ModeManager to dynamically retrieve icons and emojis based on the active mode.
- Improved maintainability by centralizing the logic for observation type representation.

* Refactor observation metadata constants and update context generator

- Removed the explicit declaration of OBSERVATION_TYPES and OBSERVATION_CONCEPTS from observation-metadata.ts.
- Introduced fallback default strings for DEFAULT_OBSERVATION_TYPES_STRING and DEFAULT_OBSERVATION_CONCEPTS_STRING.
- Updated context-generator.ts to utilize observation types and concepts from ModeManager instead of constants.

* refactor: remove intermediate error handling from hooks (Phase 1)

Apply "fail fast" error handling strategy - errors propagate and crash loud
instead of being caught, wrapped, and re-thrown at intermediate layers.

Changes:
- Remove try/catch around fetch calls in all hooks - let errors throw
- Add try/catch ONLY around JSON.parse at entry points
- Delete error-handler.ts and hook-error-handler.ts (no longer needed)
- Update worker-utils.ts: functions now throw instead of returning null
- Update transcript-parser.ts: throw on missing path, empty file, malformed JSON
- Remove all handleWorkerError, handleFetchError imports

Philosophy: If something breaks, we KNOW it broke. No silent failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: remove intermediate error handling from worker service (Phase 2)

Apply "fail fast" error handling strategy to worker service layer.

Changes:
- worker-service.ts: Remove try/catch from version endpoint, cleanup,
  MCP close, process enumeration, force kill, and isAlive check
- SessionRoutes.ts: Remove try/catch from JSON.stringify calls, remove
  .catch() from Chroma sync and SDK agent calls
- SettingsRoutes.ts: Remove try/catch from toggleMcp()
- DatabaseManager.ts: Remove .catch() from backfill and close operations
- SDKAgent.ts: Keep outer try/catch (top-level), remove .catch() from
  Chroma sync operations
- SSEBroadcaster.ts: Remove try/catch from broadcast and sendToClient

Philosophy: Errors propagate and crash loud. BaseRouteHandler.wrapHandler
provides top-level catching for HTTP routes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: remove error swallowing from SQLite services (Phase 3)

Apply "fail fast" error handling strategy to database layer.

SessionStore.ts migrations:
- ensureWorkerPortColumn(): Remove outer try/catch, let it throw
- ensurePromptTrackingColumns(): Remove outer try/catch, let it throw
- removeSessionSummariesUniqueConstraint(): Keep inner transaction
  rollback, remove outer catch
- addObservationHierarchicalFields(): Remove outer try/catch
- makeObservationsTextNullable(): Keep inner transaction rollback,
  remove outer catch
- createUserPromptsTable(): Keep inner transaction rollback, remove
  outer catch
- getFilesForSession(): Remove try/catch around JSON.parse

SessionSearch.ts:
- ensureFTSTables(): Remove try/catch, let it throw

Philosophy: Migration errors that are swallowed mean we think the
database is fine when it's not. Keep only inner transaction rollback
try/catch blocks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: remove error hiding from utilities (Phase 4)

Apply "fail fast" error handling strategy to utility layer.

logger.ts:
- formatTool(): Remove try/catch, let JSON.parse throw on malformed input

context-generator.ts:
- loadContextConfig(): Remove try/catch, let parseInt throw on invalid settings
- Transcript extraction: Remove try/catch, let file read errors propagate

ChromaSync.ts:
- close(): Remove nested try/catch blocks, let close errors propagate

Philosophy: No silent fallbacks or hidden defaults. If something breaks,
we know it broke immediately.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: serve static UI assets and update package root path

- Added middleware to serve static UI assets (JS, CSS, fonts, etc.) in ViewerRoutes.
- Updated getPackageRoot function to correctly return the package root directory as one level up from the current directory.

* feat: Enhance mode loading with inheritance support

- Introduced parseInheritance method to handle parent--override mode IDs.
- Added deepMerge method for recursively merging mode configurations.
- Updated loadMode method to support inheritance, loading parent modes and applying overrides.
- Improved error handling for missing mode files and logging for better traceability.

* fix(modes): correct inheritance file resolution and path handling

* Refactor code structure for improved readability and maintainability

* feat: Add mode configuration documentation and examples

* fix: Improve concurrency handling in translateReadme function

* Refactor SDK prompts to enhance clarity and structure

- Updated the `buildInitPrompt` and `buildContinuationPrompt` functions in `prompts.ts` to improve the organization of prompt components, including the addition of language instructions and footer messages.
- Removed redundant instructions and emphasized the importance of recording observations.
- Modified the `ModePrompts` interface in `types.ts` to include new properties for system identity, language instructions, and output format header, ensuring better flexibility and clarity in prompt generation.

* Enhance prompts with language instructions and XML formatting

- Updated `buildInitPrompt`, `buildSummaryPrompt`, and `buildContinuationPrompt` functions to include detailed language instructions in XML comments.
- Ensured that language instructions guide users to keep XML tags in English while writing content in the specified language.
- Modified the `buildSummaryPrompt` function to accept `mode` as a parameter for consistency.
- Adjusted the call to `buildSummaryPrompt` in `SDKAgent` to pass the `mode` argument.

* Refactor XML prompt generation in SDK

- Updated the buildInitPrompt, buildSummaryPrompt, and buildContinuationPrompt functions to use new placeholders for XML elements, improving maintainability and readability.
- Removed redundant language instructions in comments for clarity.
- Added new properties to ModePrompts interface for better structure and organization of XML placeholders and section headers.

* feat: Update observation prompts and structure across multiple languages

* chore: Remove planning docs and update Ragtime README

Remove ephemeral development artifacts:
- .claude/plans/modes-system-fixes.md
- .claude/test-analysis-report.md
- PROMPT_INJECTION_ANALYSIS.md

Update ragtime/README.md to explain:
- Feature is not yet implemented
- Dependency on modes system (now complete in PR #412)
- Ready to be scripted out in future release

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: Move summary prompts to mode files for multilingual support

Summary prompts were hardcoded in English in prompts.ts, breaking
multilingual support. Now properly mode-based:

- Added summary_instruction, summary_context_label,
  summary_format_instruction, summary_footer to code.json
- Updated buildSummaryPrompt() to use mode fields instead of hardcoded text
- Added summary_footer with language instructions to all 10 language modes
- Language modes keep English prompts + language requirement footer

This fixes the gaslighting where we claimed full multilingual support
but summaries were still generated in English.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* chore: Clean up README by removing local preview instructions and streamlining beta features section

* Add translated README files for Ukrainian, Vietnamese, and Chinese languages

* Add new language modes for code development in multiple languages

- Introduced JSON configurations for Code Development in Greek, Finnish, Hebrew, Hindi, Hungarian, Indonesian, Italian, Dutch, Norwegian, Polish, Brazilian Portuguese, Romanian, Swedish, Turkish, and Ukrainian.
- Each configuration includes prompts for observations, summaries, and instructions tailored to the respective language.
- Ensured that all prompts emphasize the importance of generating observations without referencing the agent's actions.

* Add multilingual support links to README files in various languages

- Updated README.id.md, README.it.md, README.ja.md, README.ko.md, README.nl.md, README.no.md, README.pl.md, README.pt-br.md, README.ro.md, README.ru.md, README.sv.md, README.th.md, README.tr.md, README.uk.md, README.vi.md, and README.zh.md to include links to other language versions.
- Each README now features a centered paragraph with flags and links for easy navigation between different language documents.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-22 20:14:18 -05:00
Alex Newman 52d2f72a82 Standardize and enhance error handling across hooks and worker service (#295)
* Enhance error logging in hooks

- Added detailed error logging in context-hook, new-hook, save-hook, and summary-hook to capture status, project, port, and relevant session information on failures.
- Improved error messages thrown in save-hook and summary-hook to include specific context about the failure.

* Refactor migration logging to use console.log instead of console.error

- Updated SessionSearch and SessionStore classes to replace console.error with console.log for migration-related messages.
- Added notes in the documentation to clarify the use of console.log for migration messages due to the unavailability of the structured logger during constructor execution.

* Refactor SDKAgent and silent-debug utility to simplify error handling

- Updated SDKAgent to use direct defaults instead of happy_path_error__with_fallback for non-critical fields such as last_user_message, last_assistant_message, title, filesRead, filesModified, concepts, and summary.request.
- Enhanced silent-debug documentation to clarify appropriate use cases for happy_path_error__with_fallback, emphasizing its role in handling unexpected null/undefined values while discouraging its use for nullable fields with valid defaults.

* fix: correct happy_path_error__with_fallback usage to prevent false errors

Fixes false "Missing cwd" and "Missing transcript_path" errors that were
flooding silent.log even when values were present.

Root cause: happy_path_error__with_fallback was being called unconditionally
instead of only when the value was actually missing.

Pattern changed from:
  value: happy_path_error__with_fallback('Missing', {}, value || '')

To correct usage:
  value: value || happy_path_error__with_fallback('Missing', {}, '')

Fixed in:
- src/hooks/save-hook.ts (PostToolUse hook)
- src/hooks/summary-hook.ts (Stop hook)
- src/services/worker/http/routes/SessionRoutes.ts (2 instances)

Impact: Eliminates false error noise, making actual errors visible.

Addresses issue #260 - users were seeing "Missing cwd" errors despite
Claude Code correctly passing all required fields.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Enhance error logging and handling across services

- Improved error messages in SessionStore to include project context when fetching boundary observations and timestamps.
- Updated ChromaSync error handling to provide more informative messages regarding client initialization failures, including the project context.
- Enhanced error logging in WorkerService to include the package path when reading version fails.
- Added detailed error logging in worker-utils to capture expected and running versions during health checks.
- Extended WorkerErrorMessageOptions to include actualError for more informative restart instructions.

* Refactor error handling in hooks to use standardized fetch error handler

- Introduced a new error handler `handleFetchError` in `shared/error-handler.ts` to standardize logging and user-facing error messages for fetch failures across hooks.
- Updated `context-hook.ts`, `new-hook.ts`, `save-hook.ts`, and `summary-hook.ts` to utilize the new error handler, improving consistency and maintainability.
- Removed redundant imports and error handling logic related to worker restart instructions from the hooks.

* feat: add comprehensive error handling tests for hooks and ChromaSync client

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-13 23:25:43 -05:00