Commit Graph

1875 Commits

Author SHA1 Message Date
Alex Newman 2e3532127a Merge remote-tracking branch 'origin/codex-version-mismatch-investigation-plan' into fix-and-ship-codex-mem-search-access 2026-05-06 19:11:10 -07:00
Alex Newman 3a000b6bc6 Merge remote-tracking branch 'origin/main' into fix-and-ship-codex-mem-search-access
# Conflicts:
#	.mcp.json
#	plugin/.mcp.json
#	plugin/scripts/mcp-server.cjs
#	plugin/scripts/worker-service.cjs
#	tests/infrastructure/plugin-distribution.test.ts
2026-05-06 19:10:49 -07:00
Alex Newman a2872dabfa docs: plan Codex plugin version mismatch fix 2026-05-06 19:08:12 -07:00
Alex Newman 7af43146e4 docs: update changelog for v12.7.3 2026-05-06 18:32:44 -07:00
Alex Newman 02c7a3dfa9 chore: bump version to 12.7.3 2026-05-06 18:30:53 -07:00
Alex Newman 65f2fd8cdd fix: harden startup and schema repair contracts
Reliability patch covering startup path resolution, install marker compatibility, export CLI request contracts, schema repair safety, hard-stop retry-loop handling, and the PR babysit status helper.
2026-05-06 18:29:26 -07:00
Alex Newman 37d186e767 test: guard MCP launcher fallback distribution 2026-05-06 14:29:49 -07:00
Alex Newman c80225147f chore: bump version to 12.7.3 2026-05-06 14:23:37 -07:00
Alex Newman 938c608507 fix(codex): make mem-search MCP startup self-locating 2026-05-06 14:20:55 -07:00
Alex Newman bb3dbfdb5a docs(skill): document discord notify working directory 2026-05-06 12:11:27 -07:00
Alex Newman 4db99da432 docs: update changelog for 12.7.2 2026-05-06 03:35:09 -07:00
Alex Newman 20e98d8361 chore: bump version to 12.7.2 2026-05-06 03:34:22 -07:00
Alex Newman 65607897a8 fix(install): disable Claude Code auto-memory on every claude-code install
Disable Claude Code auto-memory during claude-code installs and harden atomic settings writes, including symlink and dangling-symlink destinations.
2026-05-06 03:32:40 -07:00
Alex Newman d31c4d2a57 docs: fix outdated references in CLAUDE.md (#2304)
- Correct hook lifecycle list: 6 hooks (Setup, SessionStart,
  UserPromptSubmit, PreToolUse, PostToolUse, Stop), not the
  fictional 'Summary' / 'SessionEnd' pair.
- Replace misleading 'src/hooks/*.ts' description with the actual
  build path from src/services/worker-service.ts via
  scripts/build-hooks.js, and list the real subcommands.
- Drop the broken link to private/context/claude-code/exit-codes.md
  (path no longer exists in the repo).
2026-05-06 03:20:15 -07:00
Alex Newman 1eaed3141f docs: update changelog for 12.7.1 2026-05-06 03:07:39 -07:00
Alex Newman 6198762f1e chore: bump version to 12.7.1 2026-05-06 03:06:20 -07:00
Alex Newman 9f2ce1754c Add babysit PR monitoring skill
Add a babysit skill for monitoring PR checks, review comments, and review threads until a PR is merge-ready.
2026-05-06 03:04:40 -07:00
Alex Newman 9e4e30a01d docs: update changelog for 12.7.0 2026-05-06 01:59:32 -07:00
Alex Newman 1667eac0be chore: bump version to 12.7.0 2026-05-06 01:57:45 -07:00
Alex Newman 56db06811e Add native Codex hooks integration (#2319)
* Add native Codex hooks integration

* Address Codex review feedback

* Use durable Codex marketplace root

* Address Codex file context review feedback

* Harden Codex installer review paths

* Report Codex legacy cleanup failures

* fix: keep MCP manifests in marketplace sync

* fix: bundle zod in MCP server

* fix: warn on Codex legacy cleanup failure

* Fix hook observation readiness timeouts

* Address Codex hook review notes

* Tighten Codex MCP file context matching

* Resolve final Codex review nits

* Add Codex marketplace version guidance

* Reset worker failure counter on API fallback

* Fix Codex cat flag file extraction
2026-05-06 01:55:27 -07:00
Alex Newman a5bb6b346a docs: document LiteLLM gateway routing 2026-05-05 15:08:09 -07:00
Alex Newman 09dcecafd0 chore: bump version to 12.6.5 2026-05-05 14:48:24 -07:00
Alex Newman b414f57edb Merge pull request #2302 from thedotmack/codex/remove-agent-pool-timeout
[codex] Remove agent pool timeout data loss
2026-05-05 14:46:28 -07:00
Alex Newman 46b59573b5 Merge remote-tracking branch 'origin/main' into codex/remove-agent-pool-timeout
# Conflicts:
#	plugin/scripts/mcp-server.cjs
#	plugin/scripts/worker-service.cjs
2026-05-05 14:45:57 -07:00
Alex Newman 89718f79b0 chore: bump version to 12.6.4 2026-05-05 14:34:55 -07:00
Alex Newman ce35bb520d chore: sync plugin artifacts for 12.6.3 2026-05-05 13:07:36 -07:00
Alex Newman 519fbe5daa 12.6.3 2026-05-05 13:05:38 -07:00
Alex Newman 92f800d49c fix: drain invalid observer responses 2026-05-05 13:00:42 -07:00
Alex Newman 9a2818fc2e docs: regenerate CHANGELOG for v12.6.2 2026-05-04 22:36:10 -07:00
Alex Newman ec97813582 chore: bump version to 12.6.2 2026-05-04 21:48:43 -07:00
Alex Newman 1981f9b2fe fix(install): revert tree-sitter grammars to devDependencies (#2300 regression) (#2305)
PR #2300 moved 21 tree-sitter grammar packages from devDependencies into
root dependencies, claiming "their .wasm files are loaded at runtime by
parser.ts." That justification is wrong for the root claude-mem npm
package: parser.ts compiles into plugin/scripts/worker-service.cjs, which
runs from the marketplace folder where plugin/package.json already lists
every grammar as a runtime dep. Nothing in dist/npx-cli/ ever loads a
grammar, and resolveGrammarPath() handles missing packages gracefully.

The regression: `npx claude-mem@12.6.1 install` now fetches all 21
grammars at npx time. tree-sitter-swift's postinstall pulls a nested
tree-sitter-cli that downloads a Rust binary from GitHub and hangs the
install. npm ignores the trustedDependencies bun-allowlist, so there's
no way to skip the postinstall scripts on a bare `npx` fetch.

Fix: move grammar packages back to root devDependencies. The marketplace
plugin install (installPluginDependencies → bun install in plugin/) still
works because plugin/package.json keeps them as deps and Bun honors
trustedDependencies: ["tree-sitter-cli"] to skip the harmful postinstalls
on every other grammar.

Keep PR #2300's --legacy-peer-deps + --omit=dev install.ts changes —
those address a separate, valid marketplace ERESOLVE.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:47:27 -07:00
Alex Newman a122d34ebf fix: address Greptile P1 + CodeRabbit follow-ups (cycle 9)
- waitForSlot now accepts an optional AbortSignal. When the signal
  fires (e.g. session.abortController.abort() during shutdown or
  cancel), the queued waiter is removed from slotWaiters and the
  promise rejects immediately, instead of hanging until a slot
  naturally opens. Restores the cancellation guarantee that the
  removed 60s timeout used to provide. ClaudeProvider.startSession
  now passes session.abortController.signal at the call site.
- EnvManager: a bare ANTHROPIC_BASE_URL now also short-circuits the
  OAuth lookup. Tokenless gateways (allowed by the new install flow)
  were otherwise being authenticated against api.anthropic.com via the
  injected OS-keychain OAuth token.
- install.ts: resolveClaudeAuthMethod now reads the raw stored
  CLAUDE_MEM_CLAUDE_AUTH_METHOD value via a direct settings.json read
  (readRawStoredAuthMethod), bypassing SettingsDefaultsManager's
  default backfill. Without this, getSetting() always returned
  'subscription' for unmigrated installs and the env-based fallback
  never ran — so the previous fix only addressed the optics, not
  the actual misclassification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:07:42 -07:00
Alex Newman 5edf1557c4 fix: address Greptile P1 security + CodeRabbit follow-ups on PR #2302
- EnvManager: add ANTHROPIC_AUTH_TOKEN to BLOCKED_ENV_VARS so a token
  inherited from the parent shell can no longer short-circuit the OAuth
  lookup at SDK spawn time. Mirrors the ANTHROPIC_API_KEY treatment
  added in issue #733. Explicit gateway tokens in
  ~/.claude-mem/.env are still re-injected by buildIsolatedEnv().
- install.ts: extract resolveClaudeAuthMethod() that returns a stored
  CLAUDE_MEM_CLAUDE_AUTH_METHOD when present and otherwise infers
  the mode from ~/.claude-mem/.env (ANTHROPIC_BASE_URL → gateway,
  ANTHROPIC_API_KEY → api-key, else subscription). persistClaudeProvider,
  the interactive Claude auth flow, and promptClaudeModel now use it,
  so older installs that pre-date the setting are no longer
  misclassified as 'subscription' (which would clear working
  credentials and disable custom gateway models).
- configureDirectApiKey: when an Anthropic API key already exists,
  prompt to keep or rotate it instead of silently re-saving — restores
  the ability to update a revoked or rotated key from the installer
  without losing the cancel-safe behaviour added in 7f3686fd.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:59:11 -07:00
Alex Newman 7f3686fd2c fix: address CodeRabbit Major findings on install.ts
- Cancel of API-key / Gateway-URL prompts no longer wipes existing
  credentials by switching to subscription auth and emptying
  ANTHROPIC_API_KEY / ANTHROPIC_BASE_URL / ANTHROPIC_AUTH_TOKEN. Cancel
  now leaves the prior config untouched.
- Empty gateway-token input preserves the existing token instead of
  clearing it. The new prompt copy explains that blank keeps the
  current token.
- Interactive install no longer hard-locks to Claude when
  --provider is unset. Prompt now asks for provider
  (claude/gemini/openrouter) up front, then runs the Claude auth flow
  only when the user picks Claude.
- Claude auth-mode prompt now seeds initialValue from the stored
  CLAUDE_MEM_CLAUDE_AUTH_METHOD setting, so reruns honor existing
  configuration instead of always defaulting to subscription.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:51:00 -07:00
Alex Newman 0cc45c6e7f fix: drop duplicate notifySlotAvailable() in SDK child exit handler
CodeRabbit flagged a duplicate slot wakeup: spawnSdkProcess's child
'exit' handler called registry.unregister(recordId) and then
notifySlotAvailable() unconditionally. Registry.unregister() already
fires notifySlotAvailable() internally when removing an SDK entry, so
the trailing call woke a second waiter for the same freed slot — both
could see count < maxConcurrent in the same synchronous tick before
either replacement registered, transiently exceeding maxConcurrent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:47:08 -07:00
Alex Newman 4e49dcf445 fix: address Greptile P1 findings on PR #2302
- process-registry.ts: skip the trailing notifySlotAvailable() when
  pruneDeadEntries() removed entries — prune already wakes one waiter
  per removed SDK process, so the unconditional call double-woke and
  could let two waiters spawn in the same synchronous tick, briefly
  exceeding maxConcurrent. Only fire the safety-net notify when nothing
  was pruned.
- install.ts: persistClaudeProvider() no longer silently rewrites
  CLAUDE_MEM_CLAUDE_AUTH_METHOD to 'subscription'. When called without
  an explicit auth method, preserve the existing setting; only fall
  back to 'subscription' when none is configured. Prevents re-running
  'claude-mem install --provider claude' from wiping a user's
  configured 'api-key' or 'gateway' auth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:43:00 -07:00
Alex Newman 53cdccdf7a docs: regenerate CHANGELOG for v12.6.1
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:32:20 -07:00
Alex Newman 1a8432fe97 chore: bump version to 12.6.1
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:29:31 -07:00
Alex Newman 9902a15b21 Remove stale internal docs and reports (#2290) 2026-05-04 20:22:29 -07:00
Alex Newman c4097b4ebb add Claude SDK gateway installer setup 2026-05-04 20:21:36 -07:00
Alex Newman 3ff6fcbe1e fix: install no longer fails on tree-sitter peer-dep ERESOLVE (#2300)
* fix: install no longer fails on tree-sitter peer-dep ERESOLVE

The marketplace npm install was failing on a peer-dep conflict between
@derekstride/tree-sitter-sql (peers tree-sitter@^0.21) and
@tree-sitter-grammars/tree-sitter-lua (peers tree-sitter@^0.22.4),
breaking install across all 12 supported IDEs (#2261-#2272).

The conflict is irrelevant: smart_outline/smart_search/smart_unfold use
the tree-sitter CLI + .wasm files shipped inside each grammar package,
never the JS native bindings, so the peer warning is harmless.

- package.json: move grammar packages to dependencies (their .wasm files
  are loaded at runtime by parser.ts, so they were never devDeps).
- src/npx-cli/commands/install.ts: pass --legacy-peer-deps to silence
  the resolver and replace deprecated --production with --omit=dev.

Verified across all 12 IDEs in the install harness: zero npm errors,
21 grammar packages installed, smart_outline parses TypeScript and
smart_search matches across TypeScript+Python.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: clarify --legacy-peer-deps rationale in marketplace install

Addresses Greptile review comment on PR #2300.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:21:23 -07:00
Alex Newman 8bef7c6a34 fix: remove agent pool timeout data loss 2026-05-04 19:43:45 -07:00
Alex Newman 39f1102600 fix: remove ONNX/OpenBLAS thread cap from chroma-mcp spawn env
The 2-thread cap was a bandaid for #2220 (Windows) and #2253 (macOS Intel)
CPU runaway reports on v12.4.9. The actual root causes (watermark stuck
at 0 → continuous re-embed, orphan process trees, fire-and-forget backfill
across 80+ projects) were fixed structurally in #2282: per-batch watermark
persistence, killProcessTree() + pgid registration, max-3 concurrent
backfills with re-entrancy guard, kernel-enforced child cleanup (#2216).

With the structural fixes in place, capping ONNX/OpenBLAS/MKL at 2 threads
slows initial backfill 3–6× on multi-core machines and provides no
steady-state benefit. Defer to the OS scheduler and the user's environment.

ANONYMIZED_TELEMETRY=false stays — unrelated to the storm, blocks
background HTTP from the embedding subprocess.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:08:53 -07:00
Alex Newman 43037782a8 docs: regenerate CHANGELOG for v12.6.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 22:53:10 -07:00
Alex Newman a3b161f8c8 chore: bump version to 12.6.0
Minor release: 17 issues fixed and 4 foundations introduced via PR #2282.

Foundations (new public modules):
- F1 spawnHidden wrapper (src/shared/spawn.ts) — windowsHide default
- F2 paths namespace (src/shared/paths.ts) — 24 hardcoded sites collapsed,
  CLAUDE_MEM_DATA_DIR flows 100% of runtime
- F3 getUptimeSeconds (src/shared/uptime.ts) — fixes ms-bug at Server.ts:165
- F4 ClassifiedProviderError (src/services/worker/provider-errors.ts) — kind
  union (transient/unrecoverable/rate_limit/quota_exhausted/auth_invalid),
  per-provider classifiers, unrecoverablePatterns allowlist deleted

User-facing capabilities:
- OAuth keychain reader (#2215): readClaudeOAuthToken() reads from
  macOS keychain, Windows DPAPI, Linux libsecret at worker spawn-time
- Quota-aware wall-clock guard (#2234): RateLimitStore with auth-type gate
  (api_key never aborts; cli/oauth aborts at per-window thresholds);
  rateLimits surfaced on /api/health
- withRetry helper (#2254): honors ClassifiedProviderError.kind, exponential
  backoff with jitter, request-id capture for dedup logging

Bug fixes also landing in 12.6.0:
#2188 stdin fallback, #2196 ANTHROPIC_BASE_URL docs, #2220 chroma CPU storm,
#2225 opencode _zod.def crash, #2231 SECURITY.md, #2233 parser fence (Part A),
#2236 observer visible windows, #2237/#2238 hardcoded paths, #2240 Chroma
dedupe, #2242 check-pending-queue endpoints, #2243 scripts/package.json,
#2244 summaryStored=null, #2247 Codex task_complete, #2248 Cursor sessions,
#2250 health uptime ms, #2253 chroma macOS CPU.

Tests: 1454 pass / 77 fail (matches main baseline, zero net regressions).
All CI green: build, CodeRabbit (17 rounds resolved), Greptile (clean).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 22:48:11 -07:00
Alex Newman d384d3c595 fix: bug-batch — 17 issues + 4 foundations (chroma, opencode, parser, OAuth, paths, uptime, classification) (#2282)
* feat: foundations F1-F4 + simple bug fixes

Foundations (no consumer adoption yet):
- F1 spawnHidden wrapper at src/shared/spawn.ts
- F2 paths namespace with 18 accessors + invariant test (tests/shared/paths.test.ts)
- F3 getUptimeSeconds at src/shared/uptime.ts
- F4 ClassifiedProviderError at src/services/worker/provider-errors.ts + 6 tests

Issue fixes (file-isolated, parallel-safe):
- #2231: SECURITY.md at repo root for GitHub Security tab
- #2240: dedupe observationIds before Chroma sync (ResponseProcessor.ts)
- #2247: add task_complete to Codex session-end events
- #2243: rsync excludes scripts/package.json + scripts/node_modules

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: validate Claude executable with --version and detect desktop app

Extract findClaudeExecutable() into shared utility used by both
SDKAgent and KnowledgeAgent (deduplication). Every candidate is now
validated with --version (3s timeout). Desktop app executables in
AppData/Program Files get an actionable error message directing
users to install the CLI via npm.

Closes #2222

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use Zod schemas in OpenCode plugin to fix _zod.def crash

OpenCode 1.14.x walks arg._zod.def at plugin registration, which
crashes on plain JSON Schema objects like {type: "string"}. Replace
with z.string().describe() so the Zod internals are present.

Closes #2226, #2225, #2154

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: neutralize chroma-mcp CPU storm at the root

Two surgical fixes to the chroma backfill path that together cause the
sustained 60–80% CPU + orphan accumulation pattern reported across

1. ChromaMcpManager.getSpawnEnv: cap embedding-thread fanout
   ONNX Runtime / OpenBLAS / MKL all default to cpu_count(), so a 12-core
   machine spins 12 threads burning embeddings concurrently. The user's
   getSpawnEnv only handled SSL certs — no thread limits at all. Inject
   OMP_NUM_THREADS / ONNX_NUM_THREADS / OPENBLAS_NUM_THREADS / MKL_NUM_THREADS
   defaults of 2 (only if user hasn't pinned them), and
   ANONYMIZED_TELEMETRY=false to stop background HTTP from the embedding
   subprocess. Closes the storm at the source.

2. ChromaSync.backfill{Observations,Summaries,Prompts}: per-batch watermark
   The bump was in a trailing finally block. SIGKILL / OOM / power loss
   mid-flight skips finally entirely, so the watermark stayed at 0 and the
   next worker boot re-embedded the entire history (16K obs in #2220's
   case), which then pegged CPU forever in combination with (1). Move the
   bump inside the loop so progress is durable per batch.
   Closes #2214.

Verification:
- 26/26 chroma tests pass (tests/services/sync, tests/integration/chroma-vector-sync)
- Bundle confirms thread caps and per-batch bumps are present
- Full suite: 1429 pass / 20 fail — pre-existing failures only, no
  regression vs v12.4.9 baseline (1429 pass / 27 fail)

Closes #2214.
Substantially de-amplifies #2220 (the structural Job-Object cleanup is
still tracked separately at #2216).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: kill chroma-mcp process tree and limit backfill concurrency

Three fixes for orphan chroma-mcp processes and resource exhaustion:

1. killProcessTree() in ChromaMcpManager.stop() tears down the full
   uvx->uv->python->chroma-mcp spawn chain (pkill -P on POSIX,
   taskkill /T on Windows) before MCP client.close().

2. Register chroma process with pgid for supervisor shutdown cascade.

3. backfillAllProjects() now processes max 3 projects concurrently
   with a re-entrancy guard to prevent overlapping fire-and-forget runs.

Closes #2216, advances #2220, #2213

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* build: regenerate plugin artifacts after cherry-picks

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: foundation consumers + Cursor/stdin/queue/docs fixes

F1 spawnHidden adoption (#2236):
- 8 spawn → spawnHidden conversions across worker-utils, ProcessManager,
  npx-cli (install/runtime), supervisor/process-registry

F3 getUptimeSeconds adoption (#2250):
- Server.ts:165 (THE BUG: returned ms)
- Server.ts:270, SessionRoutes.ts:326 (4th ms-bug consumer found),
  DataRoutes.ts:225 (refactor for consistency)

#2188 stdin '{}' fallback removal:
- Diagnostic logging to <DATA_DIR>/logs/runner-errors.log + CAPTURE_BROKEN
  marker; exit 0 to preserve Windows Terminal exit-code strategy

#2196 ANTHROPIC_BASE_URL docs:
- New docs/public/configuration/custom-anthropic-backends.mdx
- Note: issue may need separate auto-detect feature; docs document
  existing plumbing only

#2242 check-pending-queue endpoints:
- Point at /api/processing-status + /api/processing per DataRoutes.ts;
  honor CLAUDE_MEM_WORKER_PORT env

#2248 Cursor sessions never summarized:
- Pulled reporter wbingli's tested fix (commit 46eaba44)
- Bug A: cursor adapter now derives transcriptPath from cwd+sessionId
- Bug B: parser accepts both line.type and line.role
- Bug C: walk backward, prefer non-empty text, fallback to empty
- Tests: 10-case regression suite + tests/fixtures/cursor-session.jsonl

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: F2 paths namespace adoption (#2237 + #2238)

Replaced 24 hardcoded homedir() + '.claude-mem' sites across 18 source
files with paths.<accessor>() calls from src/shared/paths.ts.

Accessors used: dataDir, workerPid, settings, database, chroma,
combinedCerts, transcriptsConfig, transcriptsState, corpora,
supervisorRegistry, envFile, logsDir.

Sites converted (file:area):
- src/cli/claude-md-commands.ts (database)
- src/services/context/ContextConfigLoader.ts (settings)
- src/services/infrastructure/ProcessManager.ts (workerPid)
- src/services/infrastructure/WorktreeAdoption.ts (settings)
- src/services/integrations/CodexCliInstaller.ts (settings)
- src/services/sync/ChromaMcpManager.ts (chroma + combinedCerts)
- src/services/transcripts/config.ts (transcriptsConfig + transcriptsState)
- src/services/worker/ClaudeProvider.ts (envFile)
- src/services/worker/GeminiProvider.ts (envFile + 2 more)
- src/services/worker/http/routes/DataRoutes.ts (dataDir)
- src/services/worker/http/routes/SettingsRoutes.ts (settings + envFile)
- src/services/worker/knowledge/CorpusStore.ts (corpora)
- src/shared/EnvManager.ts (envFile)
- src/supervisor/index.ts (supervisorRegistry)
- src/supervisor/process-registry.ts (supervisorRegistry)
- src/supervisor/shutdown.ts (supervisorRegistry)
- src/utils/claude-md-utils.ts (database)
- src/utils/logger.ts (logsDir + settings, lazy to avoid cycle)

CLAUDE_MEM_DATA_DIR override now flows through 100% of the worker
runtime; no per-file env reads needed.

Verification:
- Grep guard: zero homedir+'.claude-mem' sites remain in src/
  (excluding paths.ts itself and SettingsDefaultsManager.ts)
- F2 invariant test: 3/3 pass (60 expects)
- Foundation tests: 19/19 pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: F4 provider classification + parser fence + OAuth keychain

F4 adoption (#2244 + #2254):
- Per-provider classifiers: classifyClaudeError, classifyGeminiError,
  classifyOpenRouterError. Each lives in the provider file.
- New retry helper at src/services/worker/retry.ts: withRetry() honors
  ClassifiedProviderError.kind; retriable=transient/rate_limit (with
  retryAfterMs); not retriable=unrecoverable/auth_invalid/quota_exhausted.
  maxRetries=2, perAttemptTimeout=30s, exponential backoff with jitter.
- GeminiProvider + OpenRouterProvider fetch calls wrapped with retry.
  Best-effort request-id capture (x-goog-request-id, x-request-id,
  x-openrouter-request-id) for dedup logging.
- Deleted unrecoverablePatterns allowlist at worker-service.ts:540 area;
  worker dispatches on err.kind instead.
- 28 new classifier tests at tests/worker/provider-classifiers.test.ts:
  429-no-Retry-After, 500-with-quota-exceeded, OverloadedError,
  per-provider auth_invalid signals.

#2233 Part A — parser fence handling:
- src/sdk/prompts.ts: removed 4 fence markers from XML example blocks.
  Model now sees plain XML, eliminating the failure-mode that drained
  quota via repeated retries.
- src/sdk/parser.ts: stripCodeFences() at top, called before
  parseAgentXml. Fence-tolerant regardless of model behavior.
- TODO comment references #2233 Part B (tool-use migration as separate
  scope).
- 4 fence-tolerance tests added to tests/sdk/parser.test.ts.

#2215 OAuth token keychain:
- New src/shared/oauth-token.ts (~360 LOC): readClaudeOAuthToken()
  reads from platform-native credential stores at worker spawn-time.
  - macOS: security find-generic-password -s "Claude Code-credentials"
  - Windows: PowerShell wrapper around CredRead (Win32 Advapi32.dll)
  - Linux: secret-tool lookup
  - Fallback: env CLAUDE_CODE_OAUTH_TOKEN with JWT exp claim or sidecar
    expiresAt validation; refuses stale-token injection.
- EnvManager.buildIsolatedEnvWithFreshOAuth() (async) replaces silent
  process.env copy. Empty injection on absent; marker write on expired.
- <DATA_DIR>/oauth-stale.marker surfaces "re-login via Claude Desktop"
  via existing SessionStart additionalContext mechanism (context.ts).
- ClaudeProvider.startSession + KnowledgeAgent.prime/executeQuery now
  await the async env builder.
- 17 oauth-token tests covering decodeJwtExpMs, marker round-trip,
  env-fallback expiry detection.

Verification:
- npx tsc --noEmit: only pre-existing bun-types error
- bun test (foundations + new): 70 pass, 0 new fails (8 fails are
  pre-existing parser.test.ts cases unrelated to fence work)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: #2234 quota-aware wall-clock guard

New src/services/worker/RateLimitStore.ts (207 LOC) — vendor pattern from
meridian/rateLimitStore.ts (MIT, copied not depended).

API:
- class RateLimitStore: set/get/getAll/getMostRecentByWindow/size/clear,
  in-memory last-write-wins keyed by rateLimitType.
- globalRateLimitStore singleton.
- shouldAbortForQuota(authMethod, store, now?) → {abort, reason?, window?}
- isApiKeyAuth(authMethod): matches both verbose getAuthMethodDescription
  strings and concise "api_key".

Thresholds (auth-type gated):
- api_key: never aborts (user authorized per-call spend).
- cli/oauth/subscription:
  - five_hour utilization >= 0.95 OR resetsAt within 15min (with 0.85
    utilization floor to avoid false trip on freshly-reset windows)
  - seven_day_opus >= 0.93
  - seven_day_sonnet >= 0.92
  - seven_day >= 0.93
  - overage >= 0.95

ClaudeProvider integration (line 198, for-await loop):
- Detects message.type === 'system' && subtype === 'rate_limit'
- Records rate_limit_info via globalRateLimitStore.set
- Calls shouldAbortForQuota(authMethod, globalRateLimitStore)
- On abort: session.abortReason = 'quota:<window>', abortController.abort,
  break out of loop. Worker continues other sessions.

Health endpoint (Server.ts:174):
- New rateLimits field on /api/health from getMostRecentByWindow().
- Field shape: {five_hour?, seven_day?, seven_day_opus?, seven_day_sonnet?,
  overage?} each carrying utilization, status, resetsAt, observedAt.

Tests (tests/worker/rate-limit-store.test.ts):
- 22 cases covering store CRUD, isApiKeyAuth, abort decision matrix.
- api_key never aborts at any utilization.
- cli aborts at threshold breaches per window.
- Reset-grace buffer with utilization floor.

Verification:
- npx tsc --noEmit: only pre-existing bun error
- bun test tests/worker/rate-limit-store.test.ts: 22/22 pass
- bun test tests/claude-provider-resume.test.ts: 9/9 pass
- bun test tests/server/: 44/44 pass

Plugin artifacts regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* build: regenerate worker-service.cjs after final build-and-sync

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: align test assertions with F4 classification + timeout

Two test fixes for branch-introduced regressions vs main:

1. tests/gemini_provider.test.ts "should throw on other errors":
   F4's classifyGeminiError replaced upstream Error message with
   ClassifiedProviderError. Test was pinned to pre-F4 string.
   Updated assertion to match new "Gemini bad request (status 400)".

2. tests/infrastructure/graceful-shutdown.test.ts:
   Test pokes real ~/.claude-mem/supervisor.json registry which on a
   developer machine contains live worker + chroma-mcp PIDs. SIGTERM →
   wait → SIGKILL cascade takes ~6s end-to-end. Bumped per-test timeout
   to 15000ms. Underlying shutdown code unchanged. Future cleanup
   should mock getSupervisor() here.

Result: branch failure count == main (77 pre-existing failures).
No new regressions from this branch's work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: address 4 Greptile P1/P2 findings on PR #2282

P1 (real bug): clearStaleMarker silently broken in ESM
- src/shared/oauth-token.ts:14: add unlinkSync to top-level fs import
- src/shared/oauth-token.ts:342: drop inline require('fs'), call
  unlinkSync directly. ESM has no require, so the previous code threw
  ReferenceError swallowed by try/catch — making clearStaleMarker a
  permanent no-op. Stale oauth marker would persist indefinitely after
  Claude Desktop refreshed the token.

P2 (security): execSync shell-string interpolation
- src/shared/find-claude-executable.ts:39: execSync(`"${candidate}"
  --version`) → execFileSync(candidate, ['--version']). Path containing
  ", ;, & — reachable on Windows via crafted CLAUDE_CODE_PATH in
  settings.json — would otherwise produce a malformed/exploitable
  command.

P2 (security): PowerShell username injection
- src/shared/oauth-token.ts:119: userInfo().username escaped with PS
  single-quote convention (' → '') before interpolation into
  `'Claude Code-credentials:${user}'`. Defensive against future Windows
  versions or domain-joined machines that may permit ' in usernames.

P2 (style): Unreachable throw lastError post-loop
- src/services/worker/retry.ts:109: explained as the safety net for
  opts.maxRetries < 0 (pathological input where the loop never executes
  and lastError is undefined). Annotated with comment + descriptive
  fallback Error so the dead-looking code is now self-documenting.

Verification:
- npx tsc --noEmit: clean (only pre-existing bun-types error)
- bun test tests/shared/oauth-token.test.ts tests/worker/provider-classifiers.test.ts
  tests/worker/provider-errors.test.ts: 50 pass / 0 fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: tighten SECURITY.md data-flow and audit dates

Fixes CodeRabbit comments #3178957249 (Data Storage section overstated
"no external transmission" — softened to call out Claude Agent SDK,
alternate provider, Chroma MCP, OAuth keychain, and registry fetches)
and #3178957250 (Next Scheduled Audit was earlier than Last Updated;
bumped Last Updated to 2026-05-03 and audit to 2026-09-16) on PR #2282.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: drop inline require('fs') in paths.ts

Fixes CodeRabbit outside-diff comment on src/shared/paths.ts:25-29 from
PR #2282 review. resolveDataDir() ran require('fs') inside an ESM module
(this file uses import.meta.url and .js imports), which can break in
strict ESM environments. readFileSync now imports at the top alongside
existsSync/mkdirSync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: block CLAUDE_CODE_OAUTH_TOKEN from parent env (issue #2215)

Fixes CodeRabbit outside-diff comment on src/shared/EnvManager.ts:14-17
from PR #2282 review. The OAuth-token leak fix was bypassed because
buildIsolatedEnv() copied every parent env var that wasn't in
BLOCKED_ENV_VARS, but CLAUDE_CODE_OAUTH_TOKEN was not blocked. A stale
parent token therefore still reached isolatedEnv even when the fresh
keychain read returned expired/absent — defeating the fix documented
inline at lines 178-183.

Adds CLAUDE_CODE_OAUTH_TOKEN to BLOCKED_ENV_VARS and defensively deletes
it again at the top of buildIsolatedEnvWithFreshOAuth() so the
fresh-spawn-time read is the only path that can populate it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: validate cursor sessionId against path traversal

Fixes CodeRabbit comment #3178957252 on PR #2282. The Cursor adapter
took sessionId straight from stdin and concatenated it into a
join(homedir(), '.cursor', 'projects', ..., sessionId, ...) path. A
crafted value containing path separators or '..' segments could escape
~/.cursor/projects, and the later transcript read would then probe
arbitrary local files.

deriveCursorTranscriptPath() now rejects any sessionId that doesn't
match /^[A-Za-z0-9_-]+$/ — Cursor's real session ids are UUID-style
identifiers, so the safe whitelist is non-disruptive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: scope stripCodeFences() to full-wrapper payloads only

Fixes CodeRabbit comment #3178957253 on PR #2282. The previous regex
greedily removed the first opening and last closing triple-backticks
anywhere in the input, which could mangle valid content with internal
fenced examples or surrounding prose — and ran before XML parsing so
it created false negatives.

stripCodeFences() now only strips when the entire payload is a single
fenced block (start-to-end, with optional language tag and surrounding
whitespace), capturing the inner content. Adds a regression test that
feeds prose with internal triple-backtick markers around a real
<observation> block and asserts the inner ``` are preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: honor abortSignal during retry backoff sleep

Fixes CodeRabbit comment #3178957263 on PR #2282. The retry helper used
an unconditional `setTimeout` Promise for backoff between attempts, so
an external abort that fired during the wait was delayed until the
timer completed.

The backoff now races setTimeout against opts.abortSignal: if the signal
flips, the timer is cleared and the Promise rejects with 'Aborted'
immediately. The abort listener is registered with { once: true } and
removed when the timer fires to avoid leaks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: abort immediately on provider-side rejected status

Fixes CodeRabbit comment #3178957261 on PR #2282. shouldAbortForQuota()
only checked utilization thresholds and reset-grace heuristics; a
snapshot with status='rejected' (or overageStatus='rejected' on the
overage window) but no utilization number could still return
{ abort: false }, letting the worker keep consuming after the provider
had already declared the bucket exhausted.

Provider-side rejection is now checked before utilization. When either
rejection signal is present the guard returns abort=true with reason
"quota:<window> rejected by provider".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: only bump Chroma watermark on confirmed batch writes

Fixes CodeRabbit comments #3178957259 (watermark advances on swallowed
batch failures) and #3178957260 (backfillInProgress can stick true if
init throws) on PR #2282.

addDocuments() previously logged and swallowed per-batch failures with a
void return type, so all three backfill loops (observations, summaries,
prompts) bumped the watermark unconditionally after the call —
turning a transient Chroma failure into permanently-skipped records.
addDocuments() now returns the count of documents that actually landed
(including delete+add reconcile retries), and each loop only advances
the watermark when the batch wrote successfully. Failed batches log a
debug message and continue so the loop still gets through the rest.

backfillAllProjects() now constructs SessionStore and ChromaSync inside
a try block so a constructor throw can't leave the static
backfillInProgress guard stuck true and silently skip every later
backfill. The finally always clears the guard and best-effort closes
each resource.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: fall back to pid kill when process group is gone

Fixes CodeRabbit outside-diff comment on src/supervisor/shutdown.ts:118-134
from PR #2282 review. signalProcess() returned silently when a pgid was
present and process.kill(-pgid, signal) threw ESRCH, never attempting
the per-pid signal. With the new chroma registration path that records a
pgid alongside the pid, an already-collapsed group could turn shutdown
into a no-op even though the root pid was still alive.

The POSIX branch now tries -pgid first when present, and on ESRCH falls
through to process.kill(pid, signal). Non-ESRCH errors still propagate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: settings path, uptime clamp, fetch timeouts

Fixes three smaller CodeRabbit issues on PR #2282:

- SettingsRoutes (outside-diff #2282 review on lines 65-79): the
  parse-error response told users to delete ~/.claude-mem/settings.json
  even when paths.settings() resolved elsewhere. Now uses the resolved
  settingsPath variable in the message.

- uptime.ts (#3178957264 / lines 2-3): getUptimeSeconds() could return
  a negative value if startedAtMs was in the future or the system clock
  moved backward. Clamps with Math.max(0, ...) so health endpoints
  never see negative seconds.

- check-pending-queue.ts (#3178957248 / lines 27-45): checkWorkerHealth,
  getProcessingStatus and triggerProcessing all called fetch with no
  timeout, so the script could block forever if the worker accepted the
  TCP connection but never responded. Wraps each fetch with an
  AbortController + 10s timeout that throws a clear timeout message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: walk descendants recursively when killing chroma-mcp tree

Fixes CodeRabbit comment #3178957258 on PR #2282. The POSIX teardown in
ChromaMcpManager.killProcessTree() relied on `pkill -P <pid>`, which
only signals direct children. Under uv, chroma-mcp spawns python as a
grandchild — when uv exits and python re-parents to init, pkill -P
never reaches it and the descendant survives the "tree kill".

killProcessTree() now collects the full descendant set via a recursive
`pgrep -P` walk before each signal phase. The walk returns leaves first
so signals propagate bottom-up (SIGTERM children before their parents,
then again for SIGKILL after the 500ms grace window so any layer that
re-parented during teardown still gets cleaned up). pgrep failures
(no children, missing binary) return [] so this stays best-effort and
falls back to the existing per-pid signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: tolerate malformed JSONL lines in transcript-parser

Fixes Greptile P1 comment 3178964456 on PR #2282.

extractLastMessageFromJsonl previously called JSON.parse(rawLine) with no
guard. A truncated/malformed JSONL line — common when a transcript was
crashed mid-write or partially flushed — would throw SyntaxError, crash
the summarization pipeline for that session, and silently lose all
prior valid messages.

Fix: wrap JSON.parse in try/catch and skip bad lines. The empty-line
guard only catches truly empty strings, not malformed fragments.

Regression tests added for two cases:
- Mixed valid + truncated lines: returns last valid match.
- All lines malformed: returns empty string (no throw).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: classify FK constraint failures BEFORE provider classifier

Fixes Greptile P1 comment 3178979583 on PR #2282.

The F4 #2244 work introduced a regression: reclassifyAtDispatch always
returns a non-null ClassifiedProviderError for known agent types
(Claude/Gemini/OpenRouter), so the isFkConstraintFailure branch was dead
code. Per-provider classifiers don't recognize "FOREIGN KEY constraint
failed", so SQLite FK failures fell through to the default 'transient'
kind and would retry indefinitely — restart loop on corrupted session
DB state.

Old unrecoverablePatterns explicitly listed FK constraint as
unrecoverable; restoring that semantic by checking FK FIRST and only
deferring to the classifier when not an FK error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: validate CLAUDE_MEM_WORKER_PORT in check-pending-queue

Parse the env var, range-check (1-65535), and fall back to 37777 with a
console.warn on invalid input instead of letting a malformed value flow
into the URL builder unchecked (CodeRabbit Minor on PR #2282).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: SIGKILL union of pre-TERM and post-wait descendant sets

When the chroma-mcp root exits during the SIGTERM grace window, its
descendants get re-parented to init and drop out of the post-wait
pgrep -P scan. Without including the pre-TERM snapshot, those
re-parented PIDs would never receive SIGKILL even though they were
definitely children before SIGTERM and may still be alive (CodeRabbit
Major on PR #2282).

Compute Array.from(new Set([...descendantsBeforeTerm, ...descendantsBeforeKill]))
and SIGKILL the union. The two sets typically overlap, so dedupe is
required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: enforce addDocuments return-count in direct sync paths

syncObservation/syncSummary/syncUserPrompt now capture the written count
from addDocuments() and only bump the watermark when every requested
document landed in Chroma. addDocuments() tolerates per-batch failures
(returns the actual written count), so the previous unconditional bump
was silently marking unsynced rows as synced on transient errors —
preventing the next backfill from retrying them (CodeRabbit Major on PR
#2282).

A partial write now logs a warn with the (requested, written) pair and
preserves retryability on the next pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: guard backfill watermark against non-contiguous failures

The backfill watermark is a single monotonic id, so it cannot represent
sparse success: "synced through 200, gap at 201–250, then 251 onward"
would, on restart, skip 201–250 forever because the watermark sat at
either 200 or 251 — both lose data (CodeRabbit Major on PR #2282).

Add a per-loop hadGap flag to backfillObservations / backfillSummaries /
backfillPrompts. Once any batch under-writes, every subsequent batch
must also skip the bump, regardless of whether it itself succeeded.
Also tighten the failure check from `writtenInBatch <= 0` to
`writtenInBatch < batch.length` so partial-batch writes are caught.

The watermark stays at the last contiguously-synced position; the next
backfill pass retries from there, eventually closing the gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: clear oauth-stale marker when token is absent

When an OAuth token disappears entirely (user logs out, keychain
cleared), buildIsolatedEnvWithFreshOAuth's absent branch was leaving any
prior stale-marker file in place. The session-start hook would then keep
surfacing an "expired token, re-login" warning even though the token is
no longer expired — it's gone, and re-login was already done elsewhere
or not applicable (CodeRabbit Minor on PR #2282).

Call clearStaleMarker() in the absent branch the same way the present
branch already does. Add a regression test exercising the full
buildIsolatedEnvWithFreshOAuth path: pre-write a marker, force absent
via spoofed unsupported platform, assert the marker is gone after.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: skip unknown message.content shapes instead of throwing

extractLastMessageFromJsonl already tolerates malformed JSONL lines
(JSON.parse failure -> continue), but a valid JSON line whose
message.content is an unexpected type (null, number, plain object) was
still throwing — contradicting the new tolerance and crashing the entire
summary pipeline on a single weird line (CodeRabbit Major + Greptile P1
on PR #2282).

Replace the `throw new Error(...)` with `continue` so a single bad
content shape skips that line instead of failing the whole transcript
read. Forward compat: future content schemas land harmlessly.

Add regression tests covering null, number, and plain-object content;
each must not throw and must fall back to the most recent valid line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: guard null/primitive entries in message.content array

Fixes CodeRabbit comment 3179004190 on PR #2282.

The Array.isArray branch previously did `c.type === 'text'` directly,
which throws if `c` is null or a primitive — possible in malformed logs.
Tightened the filter with a type guard: requires c to be a non-null
object with type === 'text' and a string text field. Same defensive
class as the malformed-line and unknown-content-shape tolerances.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 22:27:07 -07:00
Alex Newman 7fb0745ddb docs: update CHANGELOG.md for v12.5.1 2026-05-03 15:07:35 -07:00
Alex Newman 9f9fbd76fb chore: bump version to 12.5.1 2026-05-03 15:05:47 -07:00
Alex Newman 99ff296f6f fix: skip tree-sitter native build via trustedDependencies allowlist (#2278)
bun install fails on Node 25+ because the upstream tree-sitter@0.25.0
package's binding.gyp specifies C++17, but Node 25's V8 headers require
C++20 and #error on older standards. The package ships no prebuilds for
this platform/Node combo, so node-gyp-build falls back to source compile
and dies with hundreds of errors.

claude-mem doesn't use the tree-sitter runtime — it only shells out to
the prebuilt tree-sitter-cli Rust binary (see Hu/CS in the bundled
mcp-server). Add trustedDependencies: ["tree-sitter-cli"] so bun runs
the CLI's install.js (downloads the Rust binary) but blocks every other
postinstall, including the failing native compile and the unused .node
bindings of all 24+ grammar packages.

Verified end-to-end on darwin-arm64 / Node 25.9.0: 37 packages install
in ~30s, 28 postinstalls correctly blocked, CLI binary works,
grammars still JIT-compile via tree-sitter query -p <grammar-dir>.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:58:54 -07:00
Alex Newman 794745b3a3 docs: update CHANGELOG.md for v12.5.0 2026-05-02 16:09:31 -07:00