Commit Graph

550 Commits

Author SHA1 Message Date
Alex Newman 4de417663c fix: catch corrupt JSON in Gemini CLI status command
readGeminiSettings() throws on corrupt JSON since ae6915b, but
checkGeminiCliHooksStatus() called it without catching — violating
its "returns 0 always" contract.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 14:29:08 -07:00
Alex Newman 190c74492f fix: address second PR review — clean replace, IDE failure bubbling, bun validation
- cpSync now does rmSync before copy to avoid stale file merges
- setupIDEs() returns failed IDE list; install reports partial success
- runSmartInstall() returns boolean status instead of void
- Worker port in next-steps URL reads CLAUDE_MEM_WORKER_PORT env var
- Goose YAML regex stops at column-0 keys (prevents eating sibling sections)
- AGENTS.md uninstall removes header-only stub files
- findBunPath() validated before use in WindsurfHooksInstaller
- Cursor marked unsupported in ide-detection until installer is wired

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 14:17:18 -07:00
Alex Newman ae6915b88e fix: address PR review — shebang, double-escaping, data loss, uninstall scope
- Add shebang banner to NPX CLI esbuild config so npx claude-mem works
- Remove manual backslash pre-escaping in WindsurfHooksInstaller (JSON.stringify handles it)
- Scope cache deletion to claude-mem only, not entire vendor namespace
- Use getWorkerPort() in OpenCodeInstaller instead of hard-coded 37777
- Throw on corrupt JSON in readJsonSafe/readGeminiSettings/Windsurf to prevent data loss
- Fix Cursor install stub to warn instead of silently succeeding
- Fix Gemini uninstall to remove individual hooks within groups, not whole groups
- Update tests for new corrupt-file-throws behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 13:49:14 -07:00
Alex Newman 2495f98496 refactor: consolidate MCP factory, add non-TTY support, auto-detect transcript watchers
- Phase 1: Replace 5 duplicate MCP installers with config-driven factory, extract
  shared context-injection and json-utils utilities, fix process.execPath usage
- Phase 2: Add non-TTY fallback for @clack/prompts to prevent ENOENT in CI/Docker
- Phase 3: Wire GeminiCliHooksInstaller through hook command framework with adapter
- Phase 4: Auto-start transcript watchers on worker boot when config exists

Net -107 lines via DRY consolidation of duplicated installer logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 00:35:55 -07:00
Alex Newman a2ac116aac fix: move summary wait + session-complete into Stop hook to prevent lost summaries
SessionEnd has a 1.5s hardcoded cap from Claude Code (CLAUDE_CODE_SESSIONEND_HOOKS_TIMEOUT_MS),
making it unsuitable for waiting on async work. Previously, the Stop hook would fire-and-forget
the summarize request, then SessionEnd would immediately call deleteSession — aborting the SDK
agent mid-summary.

Now the Stop hook (120s timeout, no cap) owns the full lifecycle:
1. Queue summarize request
2. Poll new GET /api/sessions/status endpoint until queue drains
3. Call /api/sessions/complete after summary finishes

SessionEnd is now a true fire-and-forget fallback (process.exit(0) immediately).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 14:05:53 -07:00
Alex Newman 8265fc7aa1 Merge remote-tracking branch 'origin/thedotmack/npx-gemini-cli' into thedotmack/npx-gemini-cli
Resolve merge conflicts in adapter index, gemini-cli adapter, and rebuilt CJS artifacts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 13:47:49 -07:00
Alex Newman 76a880a3d6 feat: update install CLI, ESM compat, and Gemini CLI docs
Fixes CursorHooksInstaller ESM compatibility, updates install command
with improved path resolution, and refreshes built plugin artifacts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 12:38:45 -07:00
Alex Newman 80d1deedbe fix: address PR review feedback from CodeRabbit
- Add sessionId to summarize.ts warning log for easier triage
- Add APPROVED OVERRIDE annotation to Windows spawn catch block

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 15:34:42 -07:00
Alex Newman 07ab7000a8 fix: patch 7 critical bugs affecting all non-dev-machine users and Windows
1. Fix esbuild inlining build-machine __dirname as string literal — use
   CJS-compatible runtime banner with require("node:url").fileURLToPath
   across worker-service, mcp-server, and context-generator builds.

2. Fix isMainModule check missing .cjs extension and Windows backslash
   path normalization.

3. Wrap extractLastMessage in try-catch to prevent infinite Stop hook
   feedback loop on malformed transcripts (exit 0 instead of exit 2).

4. Replace heavy SessionEnd hook (Node→Bun→1.7MB CJS→HTTP) with
   lightweight inline node -e one-liner (~200ms vs >1s).

5. Add 7 Gemini/OpenRouter error patterns to unrecoverablePatterns
   circuit breaker to prevent 77K+ retry loops on expired API keys.

6. Preserve CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_CODE_GIT_BASH_PATH in
   sanitizeEnv instead of stripping them with the CLAUDE_CODE_ prefix.

7. Use PowerShell -EncodedCommand for spawnDaemon to fix path quoting
   when Windows usernames contain spaces.

Closes #1515, #1495, #1475, #1465, #1500, #1513, #1512, #1450, #1460,
#1486, #1449, #1481, #1451, #1480, #1453, #1445

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 15:20:29 -07:00
Conductor 5621b67ccd Saving uncommitted changes before archiving 2026-03-26 19:35:27 -07:00
Alex Newman a656af2bff feat: improve Gemini CLI timeline display by stripping ANSI colors and providing markdown fallback 2026-03-25 23:51:56 -07:00
Alex Newman 88636ec012 feat: remove old installer, update docs to npx claude-mem
Removes installer/ directory (16 files) — fully replaced by src/npx-cli/.
Updates install.sh and installer.js to redirect to npx claude-mem.
Adds npx claude-mem as primary install method in docs and README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 23:02:18 -07:00
Alex Newman 031513d723 feat: add Codex CLI, OpenClaw, and MCP-based IDE integrations
Codex CLI: transcript-based integration watching ~/.codex/sessions/,
schema bumped to v0.3 with exec_command support, AGENTS.md context.

OpenClaw: installer wires pre-built plugin to ~/.openclaw/extensions/,
registers in openclaw.json with memory slot and sync config.

MCP integrations (6 IDEs): Copilot CLI, Antigravity, Goose, Crush,
Roo Code, and Warp — config writing + context injection. Goose uses
string-based YAML manipulation (no parser dependency).

All 13 IDE targets now supported in npx claude-mem install.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 23:02:18 -07:00
Alex Newman f2cc33b494 feat: add Gemini CLI, OpenCode, and Windsurf IDE integrations
Gemini CLI: platform adapter mapping 6 of 11 hooks, settings.json
deep-merge installer, GEMINI.md context injection.

OpenCode: plugin with tool.execute.after interceptor, bus events for
session lifecycle, claude_mem_search custom tool, AGENTS.md context.

Windsurf: platform adapter for tool_info envelope format, hooks.json
installer for 5 post-action hooks, .windsurf/rules context injection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 23:02:18 -07:00
Alex Newman 3a09c1bb1a feat: add NPX CLI and OpenClaw build pipeline, optimize npm package size
Adds esbuild steps for npx-cli (57KB, Node.js ESM) and openclaw plugin
(12KB). Creates .npmignore to exclude node_modules and Bun binary from
npm package, reducing pack size from 146MB to 2MB.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 23:02:18 -07:00
Alex Newman 85eb796b18 feat: add npx CLI entry point with install, runtime, and IDE detection commands
Replaces the old git-clone installer with a direct npm package copy workflow.
Supports 13 IDE auto-detection targets, runtime delegation to Bun worker,
and pure Node.js install path (no Bun required for installation).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 23:02:18 -07:00
Alex Newman 4d7bec4d05 fix: stop spinner from spinning forever (#1440)
* fix: stop spinner from spinning forever due to orphaned DB messages

The activity spinner never stopped because isAnySessionProcessing() queried
ALL pending/processing messages in the database, including orphaned messages
from dead sessions that no generator would ever process.

Root cause: isAnySessionProcessing() used hasAnyPendingWork() which is a
global DB scan. Changed it to use getTotalQueueDepth() which only checks
sessions in the active in-memory Map.

Additional fixes:
- Add terminateSession() to enforce restart-or-terminate invariant
- Fix 3 zombie paths in .finally() handler that left sessions alive
- Clean up idle sessions from memory on successful completion
- Remove redundant bare isProcessing:true broadcast
- Replace inline require() with proper accessor
- Add 8 regression tests for session termination invariant

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings — idle-timeout race, double broadcast, query amplification

- Move pendingCount check before idle-timeout termination to prevent
  abandoning fresh messages that arrive between idle abort and .finally()
- Move broadcastProcessingStatus() inside restart branch only — the else
  branch already broadcasts via removeSessionImmediate callback
- Compute queueDepth once in broadcastProcessingStatus() and derive
  isProcessing from it, eliminating redundant double iteration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 14:13:10 -07:00
Alex Newman 9f529a30f5 feat: strip <system_instruction> tags before DB storage (#1398)
* feat: strip <system_instruction> tags before database storage

Extends the existing tag-stripping mechanism (used for <private> and
<claude-mem-context>) to also filter Conductor-injected system instructions,
preventing them from being persisted in the observation database.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: also strip <system-instruction> (hyphen variant) before DB storage

Conductor uses both <system_instruction> and <system-instruction> tag
formats. This adds the hyphen variant to the same stripping mechanism.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 12:08:25 -07:00
Alex Newman 7e07210635 feat: add timeline-report skill with token economics, compress context output 53%
## Summary
- New timeline-report skill for generating narrative project history reports
- Compressed markdown context output ~53% (tables → flat compact lines, verbose labels → terse format)
- Added `full=true` param to /api/context/inject for fetching all observations
- Split TimelineRenderer into separate markdown/color rendering paths
- Removed arbitrary file write vulnerability (dump_to_file param)
- Fixed timestamp ditto marker leaking across session summary boundaries

## Review
- Rebased on main (v10.6.0) to preserve OpenClaw system prompt injection
- Reviewed by /review (gstack) + /octo:review (Codex, Gemini, Claude fleet)
- Security fix (dump_to_file removal) confirmed by all 3 reviewers
- Timestamp bug caught by Codex, fixed

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-03-18 13:57:20 -07:00
Glucksberg 9361e33b6d fix(openclaw): inject context via system prompt instead of overwriting MEMORY.md (#1386)
* fix(openclaw): inject context via system prompt instead of overwriting MEMORY.md

The OpenClaw plugin was overwriting each agent's MEMORY.md with a large
auto-generated observation dump (~12-15KB) on every before_agent_start
and tool_result_persist event. This conflicts with OpenClaw's design
where MEMORY.md is agent-curated long-term memory.

Migrate context injection from file-based (writeFile MEMORY.md) to
OpenClaw's native before_prompt_build hook, which returns context via
appendSystemContext. This keeps MEMORY.md under agent control while
still providing cross-session observation context to the LLM.

Changes:
- Add before_prompt_build hook that returns { appendSystemContext }
- Remove writeFile/MEMORY.md sync from before_agent_start
- Remove MEMORY.md sync from tool_result_persist (observations still recorded)
- Add 60s TTL cache to avoid re-fetching context on every LLM turn
- Add syncMemoryFileExclude config for per-agent opt-out
- Remove dead workspaceDirsBySessionKey tracking map
- Rewrite test suite to verify prompt injection instead of file writes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(ui): align settings defaults with backend and use nullish coalescing

The web UI had two issues causing settings inflation:

1. DEFAULT_SETTINGS in the UI used FULL_COUNT='5' and all token columns
   'true', while SettingsDefaultsManager (backend) uses FULL_COUNT='0'
   and token columns 'false'. Opening the settings modal and saving
   without changes would silently inflate the context.

2. useSettings used || for fallback, which treats '0' and 'false' as
   falsy — even when the backend correctly returns these values, the UI
   would replace them with inflated defaults. Changed to ?? (nullish
   coalescing) so only null/undefined trigger the fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(openclaw): update integration docs for system prompt injection

Reflect the migration from MEMORY.md file writes to before_prompt_build
hook-based context injection:

- Update architecture diagram and overview to show new hook flow
- Replace "MEMORY.md Live Sync" section with "System Prompt Context Injection"
- Update event lifecycle steps (before_agent_start, tool_result_persist)
- Add before_prompt_build step with TTL cache description
- Document new syncMemoryFileExclude config parameter
- Update session tracking to reflect removed workspaceDirsBySessionKey

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: fix terminology and update SKILL.md for system prompt injection

Replace "prompt injection" with "context injection" in docs to avoid
confusion with the OWASP security term. Update openclaw/SKILL.md to
reflect the new before_prompt_build hook and remove stale MEMORY.md
references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Alex Newman <thedotmack@gmail.com>
2026-03-17 17:14:30 -07:00
Alex Newman 80a8c90a1a feat: add embedded Process Supervisor for unified process lifecycle (#1370)
* feat: add embedded Process Supervisor for unified process lifecycle management

Consolidates scattered process management (ProcessManager, GracefulShutdown,
HealthMonitor, ProcessRegistry) into a unified src/supervisor/ module.

New: ProcessRegistry with JSON persistence, env sanitizer (strips CLAUDECODE_*
vars), graceful shutdown cascade (SIGTERM → 5s wait → SIGKILL with tree-kill
on Windows), PID file liveness validation, and singleton Supervisor API.

Fixes #1352 (worker inherits CLAUDECODE env causing nested sessions)
Fixes #1356 (zombie TCP socket after Windows reboot)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add session-scoped process reaping to supervisor

Adds reapSession(sessionId) to ProcessRegistry for killing session-tagged
processes on session end. SessionManager.deleteSession() now triggers reaping.
Tightens orphan reaper interval from 60s to 30s.

Fixes #1351 (MCP server processes leak on session end)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Unix domain socket support for worker communication

Introduces socket-manager.ts for UDS-based worker communication, eliminating
port 37777 collisions between concurrent sessions. Worker listens on
~/.claude-mem/sockets/worker.sock by default with TCP fallback.

All hook handlers, MCP server, health checks, and admin commands updated to
use socket-aware workerHttpRequest(). Backwards compatible — settings can
force TCP mode via CLAUDE_MEM_WORKER_TRANSPORT=tcp.

Fixes #1346 (port 37777 collision across concurrent sessions)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove in-process worker fallback from hook command

Removes the fallback path where hook scripts started WorkerService in-process,
making the worker a grandchild of Claude Code (killed by sandbox). Hooks now
always delegate to ensureWorkerStarted() which spawns a fully detached daemon.

Fixes #1249 (grandchild process killed by sandbox)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add health checker and /api/admin/doctor endpoint

Adds 30-second periodic health sweep that prunes dead processes from the
supervisor registry and cleans stale socket files. Adds /api/admin/doctor
endpoint exposing supervisor state, process liveness, and environment health.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add comprehensive supervisor test suite

64 tests covering all supervisor modules: process registry (18 tests),
env sanitizer (8), shutdown cascade (10), socket manager (15), health
checker (5), and supervisor API (6). Includes persistence, isolation,
edge cases, and cross-module integration scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: revert Unix domain socket transport, restore TCP on port 37777

The socket-manager introduced UDS as default transport, but this broke
the HTTP server's TCP accessibility (viewer UI, curl, external monitoring).
Since there's only ever one worker process handling all sessions, the
port collision rationale for UDS doesn't apply. Reverts to TCP-only,
removing ~900 lines of unnecessary complexity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove dead code found in pre-landing review

Remove unused `acceptingSpawns` field from Supervisor class (written but
never read — assertCanSpawn uses stopPromise instead) and unused
`buildWorkerUrl` import from context handler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* updated gitignore

* fix: address PR review feedback - downgrade HTTP logging, clean up gitignore, harden supervisor

- Downgrade request/response HTTP logging from info to debug to reduce noise
- Remove unused getWorkerPort imports, use buildWorkerUrl helper
- Export ENV_PREFIXES/ENV_EXACT_MATCHES from env-sanitizer, reuse in Server.ts
- Fix isPidAlive(0) returning true (should be false)
- Add shutdownInitiated flag to prevent signal handler race condition
- Make validateWorkerPidFile testable with pidFilePath option
- Remove unused dataDir from ShutdownCascadeOptions
- Upgrade reapSession log from debug to warn
- Rename zombiePidFiles to deadProcessPids (returns actual PIDs)
- Clean up gitignore: remove duplicate datasets/, stale ~*/ and http*/ patterns
- Fix tests to use temp directories instead of relying on real PID file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 14:49:23 -07:00
Vincent Leraitre 237a4c37f8 fix: always pass --ssl flag to chroma-mcp in remote mode (#1286)
* fix: always pass --ssl flag to chroma-mcp in remote mode

The chroma-mcp CLI defaults to SSL when using --client-type http.
When CLAUDE_MEM_CHROMA_SSL is false (the common case for local
ChromaDB servers), buildCommandArgs() omitted --ssl entirely,
causing chroma-mcp to attempt an SSL connection to a plain HTTP
server and fail with "Could not connect to a Chroma server".

Always pass --ssl with an explicit true/false value so the user's
CLAUDE_MEM_CHROMA_SSL setting is faithfully forwarded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add regression tests for ChromaMcpManager SSL flag fix

Adds 4 focused test cases verifying buildCommandArgs() produces correct
--ssl args, covering SSL=false, SSL=true, unset (defaults to false), and
local mode (no --ssl flag). Requested by @xkonjin in PR #1286 review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: rebuild checked-in bundles to include SSL flag fix

Rebuild all bundles against upstream/main so the --ssl <true|false>
fix is present in the runtime artifacts that hooks and the marketplace
plugin actually execute.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:03:58 -07:00
laihenyi 626654f816 fix: prevent infinite restart loop on FOREIGN KEY constraint errors (#1334)
The pending-work-restart logic had no retry limit, causing infinite loops
when sessions encountered FOREIGN KEY constraint failures. This led to
2000+ error log entries per minute and eventual worker crash via SIGTERM.

Two fixes:
1. Add 'FOREIGN KEY constraint failed' to unrecoverable error patterns
   so it short-circuits immediately instead of falling through to restart
2. Add MAX_PENDING_RESTARTS (3) limit to pending-work-restart path as a
   safety net for any future unhandled persistent errors

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:03:48 -07:00
enzoricciulli e7ba9acaa7 fix: add content-hash dedup to batch observation store methods (#1302)
storeObservations() and storeObservationsAndMarkComplete() were missing
the content-hash deduplication that storeObservation() (singular) already
had via computeObservationContentHash() and findDuplicateObservation().

This caused the Gemini provider (and potentially others that return
multiple observations per response) to insert 2-10x duplicate rows per
tool use, since the batch methods inserted unconditionally without
checking content_hash.

The fix adds the same dedup pattern from storeObservation() to both
batch methods:
1. Compute content hash via computeObservationContentHash()
2. Check for existing observation within 30s window via findDuplicateObservation()
3. Skip insert and reuse existing ID if duplicate found
4. Include content_hash column in INSERT statement

Fixes #1158 (duplicate observations with Gemini provider)

Co-authored-by: Enzo Ricciulli <e.ricciulli@systhema.ai>
2026-03-12 20:01:53 -07:00
antmid ad902bedd9 fix: auto-repair malformed database schema from cross-version sync (#1308)
When a claude-mem DB is synced between machines running different versions,
orphaned indexes can reference non-existent columns (e.g. idx_observations_content_hash
referencing content_hash). This causes SQLite to throw "malformed database schema"
on ALL queries, including PRAGMAs, creating a silent 503 failure loop.

The fix detects this on startup, uses Python's sqlite3 module to drop the
orphaned schema objects (bun:sqlite doesn't support writable_schema modifications),
resets migration versions, and lets the idempotent migration system recreate
everything properly.

Fixes #1307

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:01:51 -07:00
GigiTiti-Kai b88566dcdd fix(ui): include SSE live data when project filter is active (#1315)
When a project filter was selected in the Web UI, all SSE live data
(observations, summaries, prompts) was completely discarded. Only
paginated API data was shown, meaning new real-time events were
invisible until the user refreshed the page.

Fix: filter SSE data by project before merging with paginated data,
instead of discarding it entirely.

Fixes #1313

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:01:48 -07:00
Rajiv Sinclair 1fac57535e fix: gracefully handle missing transcript files in worktree sessions (#1326)
When Claude Code runs in a worktree (via Agent tool with isolation: "worktree"),
the transcript path points to the worktree's project directory. After the
worktree is cleaned up, the Stop hook fires but the transcript file no longer
exists, causing extractLastMessage() to throw. This error triggers Claude to
respond, which fires another Stop hook, creating an infinite error loop.

Changed throws to warn-and-return-empty so the summarize hook exits cleanly
with exit 0 instead of cascading errors.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 19:59:47 -07:00
AlexWorland 10e980cd69 fix: remove unrecognized fields from Claude Code Stop hook output (#1291)
* fix: remove unrecognized fields from Claude Code Stop hook output

Claude Code validates Stop hook JSON output against its hook contract
schema which only accepts {decision?, reason?, systemMessage?}. The
formatOutput() function was returning {continue, suppressOutput} which
are not part of the Claude Code hook API, causing "JSON validation
failed" errors on every session stop.

Return an empty object {} for the default case (no hookSpecificOutput),
preserving only systemMessage when present. This is valid for all hook
event types and eliminates the schema validation error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add unhappy-path tests for formatOutput per PR review

Add edge case coverage for malformed input (undefined/null), falsy
systemMessage values, non-contract field stripping, and contract key
allowlist. Also add defensive null guard to formatOutput matching
normalizeInput pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Alex Worland <alexworland@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 19:59:45 -07:00
Nir Alfasi 38d9ac7adb fix: prevent zombie subprocess accumulation by only trusting exitCode (#1226) (#1325)
proc.killed only means Node sent a signal — the process can still be alive.
This caused premature pool slot release, allowing unbounded process spawning.

- ensureProcessExit: remove proc.killed from early-exit checks, only trust exitCode
- Fix 3 call-site guards that skipped cleanup for signaled-but-alive processes
- Add TOTAL_PROCESS_HARD_CAP=10 safety net in waitForSlot()
- After SIGKILL, wait up to 1s via exit event instead of blind 200ms sleep
- Reduce reaper interval from 5min to 1min, idle threshold from 2min to 1min

Closes #1226

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 19:59:42 -07:00
Ben Younes 503bda4868 fix: add null guards for getChromaSync() when Chroma is disabled (#1336)
When CLAUDE_MEM_CHROMA_ENABLED=false, getChromaSync() returns null.
Two call sites were missing null guards, causing "null is not an object"
errors on every UserPromptSubmit / session init.

Fixes #1294

Vibe-coded by Ousama Ben Younes

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 19:58:03 -07:00
Umut Polat 73113321a1 fix: respect dateStart/dateEnd filters in Chroma search path (#1343)
When a search query includes dateStart/dateEnd parameters, the Chroma
semantic search path (PATH 2) ignored them entirely and only applied a
hardcoded 90-day recency window. This meant date-filtered searches
returned results from outside the requested range.

Now the Chroma path checks for a user-provided dateRange first. If
present, it filters results by the requested start/end bounds. The
90-day window is only used as the default when no date filter is
specified, preserving backward compatibility.

Fixes #1324

Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com>
2026-03-12 19:57:58 -07:00
Umut Polat 88be01910b fix: respect env vars and settings.json for DATA_DIR resolution (#1344)
SettingsDefaultsManager.get() returned only the hardcoded default,
ignoring both environment variables and settings.json overrides. This
meant CLAUDE_MEM_DATA_DIR set via env or settings file had no effect
on paths.ts (and other callers using get()).

Two changes:
- get() now checks process.env before falling back to the default
- paths.ts resolves DATA_DIR with full priority: env var > settings.json
  at the default location > hardcoded default

The settings file read uses a synchronous bootstrap pattern to avoid
circular dependencies (DATA_DIR is needed to locate the settings file,
so we check the default path only).

Fixes #1303

Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com>
2026-03-12 19:57:55 -07:00
Umut Polat 9dbf63f5d4 fix: prevent LLM from using <observation> tags in summary responses (#1345)
The SDK agent's conversation history is heavily biased toward
<observation> output. By the time a summarize prompt arrives, the
in-context conditioning can cause the LLM to respond with <observation>
tags instead of the expected <summary> tags. parseSummary() then returns
null and the summary is silently lost.

Two changes:
- Add explicit mode-switch instructions at the top of the summary prompt
  telling the LLM not to use <observation> tags and that only <summary>
  output will be accepted
- Add a warning log in parseSummary() when <observation> tags are found
  in a response that has no <summary> block, making the issue visible
  in logs instead of silently discarding

Fixes #1312

Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com>
2026-03-12 19:57:52 -07:00
Alex Newman 6581d2ef45 fix: unify mode type/concept loading to always use mode definition (#1316)
* fix: unify mode type/concept loading to always use mode definition

Code mode previously read observation types/concepts from settings.json
while non-code modes read from their mode JSON definition. This caused
stale filters to persist when switching between modes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove dead observation type/concept settings constants

CLAUDE_MEM_CONTEXT_OBSERVATION_TYPES and OBSERVATION_CONCEPTS are no
longer read by ContextConfigLoader since all modes now use their mode
definition. Removes the constants, defaults, UI controls, and the
now-empty observation-metadata.ts file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 03:00:20 -07:00
Alex Newman 0e502dbd21 feat: add smart-explore AST-based code navigation (#1244)
* feat: add smart-file-read module for token-optimized semantic code search

- Created package.json for the smart-file-read module with dependencies and scripts.
- Implemented parser.ts for code structure parsing using tree-sitter, supporting multiple languages.
- Developed search.ts for searching code files and symbols with grep-style and structural matching.
- Added test-run.mjs for testing search and outline functionalities.
- Configured TypeScript with tsconfig.json for strict type checking and module resolution.

* fix: update .gitignore to include _tree-sitter and remove unused subproject

* feat: add preliminary results and skill recommendation for smart-explore module

* chore: remove outdated plan.md file detailing session start hook issues

* feat: update Smart File Read integration plan and skill documentation for smart-explore

* feat: migrate Smart File Read to web-tree-sitter WASM for cross-platform compatibility

* refactor: switch to tree-sitter CLI for parsing and enhance search functionality

- Updated `parser.ts` to utilize the tree-sitter CLI for AST extraction instead of native bindings, improving compatibility and performance.
- Removed grammar loading logic and replaced it with a path resolution for grammar packages.
- Implemented batch parsing in `parseFilesBatch` to handle multiple files in a single CLI call, enhancing search speed.
- Refactored `searchCodebase` to collect files and parse them in batches, streamlining the search process.
- Adjusted symbol extraction logic to accommodate the new parsing method and ensure accurate symbol matching.

* feat: update Smart File Read integration plan to utilize tree-sitter CLI for improved performance and cross-platform compatibility

* feat: add smart-file-read parser and search to src/services

Copy validated tree-sitter CLI-based parser and search modules from
smart-file-read prototype into the claude-mem source tree for MCP
tool integration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: register smart_search, smart_unfold, smart_outline MCP tools

Add 3 tree-sitter AST-based code exploration tools to the MCP server.
Direct execution (no HTTP delegation) — they call parser/search
functions directly for sub-second response times.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add tree-sitter CLI deps to build system and plugin runtime

Externalize tree-sitter packages in esbuild MCP server build. Add
10 grammar packages + CLI to plugin package.json for runtime install.
Remove unused @chroma-core/default-embed from plugin deps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: create smart-explore skill with 3-layer workflow docs

Progressive disclosure workflow: search -> outline -> unfold.
Documents all 3 MCP tools with parameters and token economics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add comprehensive documentation for the smart-explore feature

- Introduced a detailed technical reference covering the architecture, parser, search engine, and tool registration for the smart-explore feature in claude-mem.
- Documented the three-layer workflow: search, outline, and unfold, along with their respective MCP tools.
- Explained the parsing process using tree-sitter, including language support, query patterns, and symbol extraction.
- Outlined the search module's functionality, including file discovery, batch parsing, and relevance scoring.
- Provided insights into build system integration and token economics for efficient code exploration.

* chore: remove experiment artifacts, prototypes, and plan files

Remove A/B test docs, prototype smart-file-read directory, and
implementation plans. Keep only production code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: simplify hooks configuration and remove setup script

* fix: use execFileSync to prevent command injection in tree-sitter parser

Replaces execSync shell string with execFileSync + argument array,
eliminating shell interpretation of file paths. Also corrects
file_pattern description from "Glob pattern" to "Substring filter".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 21:00:26 -05:00
Alex Newman 0b034af98b fix: remove save_observation from MCP tool surface
save_observation is an internal API-only feature and should not be
exposed as an MCP tool to Claude.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 19:55:12 -05:00
Alex Newman ad3d236cec fix: resolve hook crashes and CLAUDE_PLUGIN_ROOT fallback (#1215, #1220) (#1229)
* fix: resolve PostToolUse hook crashes and 5s latency (#1220)

Three compounding bugs caused hook failures:

1. Missing break statements in worker-service.ts switch — if async
   code threw before process.exit(), execution fell through to
   subsequent cases. Added break to all 7 cases missing them.

2. Unhandled promise rejection on main() — added .catch() that logs
   the error and exits 0 (per project exit code strategy: don't block
   Claude Code or leave Windows Terminal tabs open).

3. Redundant start commands in hooks.json — PostToolUse,
   UserPromptSubmit, and Stop groups each had a standalone start
   command that was redundant (the hook case already calls
   ensureWorkerStarted internally). The redundant start also caused
   5s latency via bun-runner.js collectStdin() timeout since Claude
   Code never closes stdin.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add CLAUDE_PLUGIN_ROOT fallback for Stop hooks (#1215)

Upstream Claude Code bug (anthropics/claude-code#24529) leaves
CLAUDE_PLUGIN_ROOT unset for Stop hooks on macOS and ALL hooks
on Linux. Two-layer defense:

1. Shell-level: hooks.json commands now use inline fallback
   _R="${CLAUDE_PLUGIN_ROOT}"; [ -z "$_R" ] && _R="$HOME/...";
   falling back to the known marketplace install path.

2. Script-level: bun-runner.js self-resolves plugin root from
   its own filesystem location via import.meta.url, and fixes
   broken /scripts/... paths that result from empty expansion.

Added test to verify all hook commands include the fallback path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 19:31:26 -05:00
alan e0fec4bad7 feat: add terminal output control for SessionStart context (#1143)
* feat: add terminal output control for SessionStart context

Add CLAUDE_MEM_CONTEXT_SHOW_TERMINAL_OUTPUT setting to control whether
context is displayed in the terminal at SessionStart.

When set to "false", the terminal remains clean at startup while
context is still injected into Claude's system prompt. This allows
users who find the context output verbose to disable it without
losing the automatic context injection.

Defaults to "true" for backward compatibility.

Changes:
- Add CLAUDE_MEM_CONTEXT_SHOW_TERMINAL_OUTPUT to SettingsDefaultsManager
- Check setting in context handler before setting systemMessage
- Update settings file format to include new option

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: use USER_SETTINGS_PATH and skip color fetch when disabled

Address PR feedback from automated review:

1. Use shared USER_SETTINGS_PATH constant instead of hardcoded path
   - Respects custom CLAUDE_MEM_DATA_DIR override
   - Consistent with other handlers (session-init, observation)

2. Skip color fetch when terminal output disabled
   - Check setting before making HTTP requests
   - Saves network round-trip on every session start

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Alan Dong <adong@Alans-MacBook-Pro.local>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-23 21:05:05 -05:00
Alex Newman c6f932988a Fix 30+ root-cause bugs across 10 triage phases (#1214)
* MAESTRO: fix ChromaDB core issues — Python pinning, Windows paths, disable toggle, metadata sanitization, transport errors

- Add --python version pinning to uvx args in both local and remote mode (fixes #1196, #1206, #1208)
- Convert backslash paths to forward slashes for --data-dir on Windows (fixes #1199)
- Add CLAUDE_MEM_CHROMA_ENABLED setting for SQLite-only fallback mode (fixes #707)
- Sanitize metadata in addDocuments() to filter null/undefined/empty values (fixes #1183, #1188)
- Wrap callTool() in try/catch for transport errors with auto-reconnect (fixes #1162)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix data integrity — content-hash deduplication, project name collision, empty project guard, stuck isProcessing

- Add SHA-256 content-hash deduplication to observations INSERT (store.ts, transactions.ts, SessionStore.ts)
- Add content_hash column via migration 22 with backfill and index
- Fix project name collision: getCurrentProjectName() now returns parent/basename
- Guard against empty project string with cwd-derived fallback
- Fix stuck isProcessing: hasAnyPendingWork() resets processing messages older than 5 minutes
- Add 12 new tests covering all four fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix hook lifecycle — stderr suppression, output isolation, conversation pollution prevention

- Suppress process.stderr.write in hookCommand() to prevent Claude Code showing diagnostic
  output as error UI (#1181). Restores stderr in finally block for worker-continues case.
- Convert console.error() to logger.warn()/error() in hook-command.ts and handlers/index.ts
  so all diagnostics route to log file instead of stderr.
- Verified all 7 handlers return suppressOutput: true (prevents conversation pollution #598, #784).
- Verified session-complete is a recognized event type (fixes #984).
- Verified unknown event types return no-op handler with exit 0 (graceful degradation).
- Added 10 new tests in tests/hook-lifecycle.test.ts covering event dispatch, adapter defaults,
  stderr suppression, and standard response constants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix worker lifecycle — restart loop coordination, stale transport retry, ENOENT shutdown race

- Add PID file mtime guard to prevent concurrent restart storms (#1145):
  isPidFileRecent() + touchPidFile() coordinate across sessions
- Add transparent retry in ChromaMcpManager.callTool() on transport
  error — reconnects and retries once instead of failing (#1131)
- Wrap getInstalledPluginVersion() with ENOENT/EBUSY handling (#1042)
- Verified ChromaMcpManager.stop() already called on all shutdown paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Windows platform support — uvx.cmd spawn, PowerShell $_ elimination, windowsHide, FTS5 fallback

- Route uvx spawn through cmd.exe /c on Windows since MCP SDK lacks shell:true (#1190, #1192, #1199)
- Replace all PowerShell Where-Object {$_} pipelines with WQL -Filter server-side filtering (#1024, #1062)
- Add windowsHide: true to all exec/spawn calls missing it to prevent console popups (#1048)
- Add FTS5 runtime probe with graceful fallback when unavailable on Windows (#791)
- Guard FTS5 table creation in migrations, SessionSearch, and SessionStore with try/catch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix skills/ distribution — build-time verification and regression tests (#1187)

Add post-build verification in build-hooks.js that fails if critical
distribution files (skills, hooks, plugin manifest) are missing. Add
10 regression tests covering skill file presence, YAML frontmatter,
hooks.json integrity, and package.json files field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix MigrationRunner schema initialization (#979) — version conflict between parallel migration systems

Root cause: old DatabaseManager migrations 1-7 shared schema_versions table with
MigrationRunner's 4-22, causing version number collisions (5=drop tables vs add column,
6=FTS5 vs prompt tracking, 7=discovery_tokens vs remove UNIQUE).  initializeSchema()
was gated behind maxApplied===0, so core tables were never created when old versions
were present.

Fixes:
- initializeSchema() always creates core tables via CREATE TABLE IF NOT EXISTS
- Migrations 5-7 check actual DB state (columns/constraints) not just version tracking
- Crash-safe temp table rebuilds (DROP IF EXISTS _new before CREATE)
- Added missing migration 21 (ON UPDATE CASCADE) to MigrationRunner
- Added ON UPDATE CASCADE to FK definitions in initializeSchema()
- All changes applied to both runner.ts and SessionStore.ts

Tests: 13 new tests in migration-runner.test.ts covering fresh DB, idempotency,
version conflicts, crash recovery, FK constraints, and data integrity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix 21 test failures — stale mocks, outdated assertions, missing OpenClaw guards

Server tests (12): Added missing workerPath and getAiStatus to ServerOptions
mocks after interface expansion. ChromaSync tests (3): Updated to verify
transport cleanup in ChromaMcpManager after architecture refactor. OpenClaw (2):
Added memory_ tool skipping and response truncation to prevent recursive loops
and oversized payloads. MarkdownFormatter (2): Updated assertions to match
current output. SettingsDefaultsManager (1): Used correct default key for
getBool test. Logger standards (1): Excluded CLI transcript command from
background service check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Codex CLI compatibility (#744) — session_id fallbacks, unknown platform tolerance, undefined guard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Cursor IDE integration (#838, #1049) — adapter field fallbacks, tolerant session-init validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix /api/logs OOM (#1203) — tail-read replaces full-file readFileSync

Replace readFileSync (loads entire file into memory) with readLastLines()
that reads only from the end of the file in expanding chunks (64KB → 10MB cap).
Prevents OOM on large log files while preserving the same API response shape.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Settings CORS error (#1029) — explicit methods and allowedHeaders in CORS config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: add session custom_title for agent attribution (#1213) — migration 23, endpoint + store support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: prevent CLAUDE.md/AGENTS.md writes inside .git/ directories (#1165)

Add .git path guard to all 4 write sites to prevent ref corruption when
paths resolve inside .git internals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix plugin disabled state not respected (#781) — early exit check in all hook entry points

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix UserPromptSubmit context re-injection on every turn (#1079) — contextInjected session flag

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix stale AbortController queue stall (#1099) — lastGeneratorActivity tracking + 30s timeout

Three-layer fix:
1. Added lastGeneratorActivity timestamp to ActiveSession, updated by
   processAgentResponse (all agents), getMessageIterator (queue yields),
   and startGeneratorWithProvider (generator launch)
2. Added stale generator detection in ensureGeneratorRunning — if no
   activity for >30s, aborts stale controller, resets state, restarts
3. Added AbortSignal.timeout(30000) in deleteSession to prevent
   indefinite hang when awaiting a stuck generator promise

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 19:34:35 -05:00
Alex Newman 5f28550551 MAESTRO: fix MCP type coercion for batch endpoints, add defensive observation error handling
Add string-to-array coercion for ids and memorySessionIds in DataRoutes.ts
batch endpoints so MCP clients sending "[1,2,3]" or "1,2,3" instead of
native arrays no longer get 400 errors. Wrap observation storage path in
SessionRoutes.ts with try/catch returning 200 on recoverable errors instead
of 500, preventing hook breakage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 16:41:28 -05:00
Alex Newman 0a26bb18bf fix: update footer text to reference claude-mem skill instead of MCP search tools
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 03:47:02 -05:00
Alex Newman 7966c6cba9 fix: rename save_memory and fix MCP search instructions + startup hook (#1210)
* fix: rename save_memory to save_observation and fix MCP search instructions

Stop the primary agent from proactively saving memories by renaming
save_memory to save_observation with a neutral description. Remove
"Saving Memories" section from SKILL.md. Update context formatters
and output styles to reference the mem-search skill instead of raw
MCP tool names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: split SessionStart hooks so smart-install failure doesn't block worker start

smart-install.js and worker-start were in the same hook group, so if
smart-install exited non-zero the worker never started. Split into
separate hook groups so they run independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: worker startup waits for readiness before hooks fire

Move initializationCompleteFlag to set after DB/search init (not MCP),
add waitForReadiness() polling /api/readiness, and extract shared
pollEndpointUntilOk helper to DRY up health/readiness checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 03:30:31 -05:00
Alex Newman e788fd3676 fix: prevent duplicate worker daemons and zombie processes (#1178)
* fix: prevent duplicate worker daemons and zombie processes

Three root causes of chroma-mcp timeouts:

1. HTTP shutdown (POST /api/admin/shutdown) closed resources but never
   called process.exit(). Zombie workers stayed alive, background tasks
   reconnected to chroma-mcp, spawning duplicate subprocesses that all
   contended for the same persistent data directory.

2. No guard against concurrent daemon startup. When hooks fired
   simultaneously, multiple daemons started before either wrote a PID
   file. The loser got EADDRINUSE but stayed alive because signal
   handlers registered in the constructor prevented exit.

3. Corrupt 147GB HNSW index file caused all chroma queries to timeout
   (MCP error -32001). Data fix: deleted corrupt collection, backfill
   rebuilds from SQLite.

Code fixes:
- Add PID-based guard in daemon startup: exit if PID file process alive
- Add port-based guard in daemon startup: exit if port already bound
  (runs before WorkerService constructor registers keepalive handlers)
- Add process.exit(0) after HTTP shutdown/restart completes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: aggressive startup cleanup and one-time chroma wipe for upgrade

Kill orphaned worker-service.cjs and chroma-mcp processes immediately
at startup (no age gate) while keeping 30-min threshold for mcp-server.
Wipe corrupt chroma data once on upgrade from pre-v10.3 versions —
backfill rebuilds from SQLite automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: wrap shutdown handlers in try/finally to guarantee process.exit

If onShutdown() or onRestart() threw, process.exit(0) was never reached,
leaving the daemon alive as a zombie. Also removed redundant require('fs')
calls in process-manager tests where ESM imports already existed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 20:10:28 -05:00
Alex Newman 40daf8f3fa feat: replace WASM embeddings with persistent chroma-mcp MCP connection (#1176)
* feat: replace WASM embeddings with persistent chroma-mcp MCP connection

Replace ChromaServerManager (npx chroma run + chromadb npm + ONNX/WASM)
with ChromaMcpManager, a singleton stdio MCP client that communicates with
chroma-mcp via uvx. This eliminates native binary issues, segfaults, and
WASM embedding failures that plagued cross-platform installs.

Key changes:
- Add ChromaMcpManager: singleton MCP client with lazy connect, auto-reconnect,
  connection lock, and Zscaler SSL cert support
- Rewrite ChromaSync to use MCP tool calls instead of chromadb npm client
- Handle chroma-mcp's non-JSON responses (plain text success/error messages)
- Treat "collection already exists" as idempotent success
- Wire ChromaMcpManager into GracefulShutdown for clean subprocess teardown
- Delete ChromaServerManager (no longer needed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review — connection guard leak, timer leak, async reset

- Clear connecting guard in finally block to prevent permanent reconnection block
- Clear timeout after successful connection to prevent timer leak
- Make reset() async to await stop() before nullifying instance
- Delete obsolete chroma-server-manager test (imports deleted class)
- Update graceful-shutdown test to use chromaMcpManager property name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent chroma-mcp spawn storm — zombie cleanup, stale onclose guard, reconnect backoff

Three bugs caused chroma-mcp processes to accumulate (92+ observed):

1. Zombie on timeout: failed connections left subprocess alive because
   only the timer was cleared, not the transport. Now catch block
   explicitly closes transport+client before rethrowing.

2. Stale onclose race: old transport's onclose handler captured `this`
   and overwrote the current connection reference after reconnect,
   orphaning the new subprocess. Now guarded with reference check.

3. No backoff: every failure triggered immediate reconnect. With
   backfill doing hundreds of MCP calls, this created rapid-fire
   spawning. Added 10s backoff on both connection failure and
   unexpected process death.

Also includes ChromaSync fixes from PR review:
- queryChroma deduplication now preserves index-aligned arrays
- SQL injection guard on backfill ID exclusion lists

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 18:32:38 -05:00
Alex Newman 5d79bb7a7a fix: prevent zombie process accumulation by verifying subprocess exit (#1168) (#1175)
Two changes fix the observer process resource leak:

1. Add ensureProcessExit to generator finally blocks in SessionRoutes and
   worker-service, matching the pattern already working in SDKAgent.

2. Add stale session reaper (every 2m) that removes sessions with no active
   generator and no pending work after 15m idle. This unblocks the orphan
   reaper which previously skipped processes for "active" sessions.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 16:33:23 -05:00
Alex Newman b88251bc8b fix: self-healing claimNextMessage prevents stuck processing messages (#1159)
* fix: self-healing claimNextMessage prevents stuck processing messages

claimAndDelete → claimNextMessage with atomic self-healing: resets stale
processing messages (>60s) back to pending before claiming. Eliminates
stuck messages from generator crashes without external timers. Removes
redundant idle-timeout reset in worker-service.ts. Adds QUEUE to logger
Component type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update stale comments in SessionQueueProcessor to reflect claim-confirm pattern

Comments still referenced the old claim-and-delete pattern after the
claimNextMessage rename. Updated to accurately describe the current
lifecycle where messages are marked as processing and stay in DB until
confirmProcessed() is called.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: move Date.now() inside transaction and extract stale threshold constant

- Move Date.now() inside claimNextMessage transaction closure so timestamp
  is fresh if WAL contention causes retry
- Extract STALE_PROCESSING_THRESHOLD_MS to module-level constant
- Add comment clarifying strict < boundary semantics

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 23:15:46 -05:00
Alex Newman ca8421611c fix: backfill Chroma vector DB for all projects on startup (#1154)
* fix: backfill all Chroma projects on worker startup

ChromaSync.ensureBackfilled() existed but was never called. After
v10.2.2's bun cache clear destroyed the ONNX model cache, Chroma only
had ~2 days of embeddings while SQLite had 49k+ observations.

- Add static backfillAllProjects() to ChromaSync — iterates all projects
  in SQLite, creates temporary ChromaSync per project, runs smart diff
- Call backfillAllProjects() fire-and-forget on worker startup
- Add 'CHROMA_SYNC' to logger Component type (pre-existing gap)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: sanitize project names for Chroma collection naming

Replace characters outside [a-zA-Z0-9._-] with underscores so projects
like "YC Stuff" map to collection "cm__YC_Stuff" instead of failing
Chroma's collection name validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: route backfill to shared cm__claude-mem collection, harden sanitization

- Use single ChromaSync('claude-mem') in backfillAllProjects() instead of
  per-project instances, matching how DatabaseManager and SearchManager
  operate — fixes critical bug where backfilled data landed in orphaned
  collections that no search path reads from
- Strip trailing non-alphanumeric chars from sanitized collection names
  to satisfy Chroma's end-character constraint
- Guard backfill behind Chroma server readiness to avoid N spurious error
  logs when Chroma failed to start
- Use CHROMA_SYNC log component consistently for backfill messages

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: pass project as parameter to ensureBackfilled instead of mutating instance state

Eliminates shared mutable state in backfillAllProjects() loop. Project
scoping is now passed explicitly via parameter to both ensureBackfilled()
and getExistingChromaIds(), keeping a single Chroma connection while
avoiding fragile instance property mutation across iterations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:47:46 -05:00
Alex Newman 224567f980 fix: prevent ONNX model cache corruption from bun cache clears
Remove nuclear `bun pm cache rm` from smart-install.js and
sync-marketplace.cjs (only needed for removed sharp dependency).
Add `bun install` in cache version directory after sync so worker
can resolve dependencies. Move HuggingFace model cache to
~/.claude-mem/models/ so reinstalls don't corrupt it. Add self-healing
retry for Protobuf parsing failures.

Fixes recurring issues #1104, #1105, #1110.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 19:54:17 -05:00
Alex Newman f24251118e fix: bun install, node-addon-api for sharp, consolidate PendingMessageStore (#1140)
* fix: use bun install in sync, add node-addon-api for sharp, consolidate PendingMessageStore

- Switch sync-marketplace from npm to bun install
- Add node-addon-api as dev dep so sharp builds under bun
- Consolidate duplicate PendingMessageStore instantiation in worker-service finally block

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* build assets

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 18:05:42 -05:00
Alex Newman d2e926fbf7 fix: post-merge breakage (Gemini, idle timeout, sharp cache) (#1138)
* fix: add gemini-3-flash to validModels array

The model was defined in the type union and RPM limits but missing from
the runtime validModels array, causing silent fallback to gemini-2.5-flash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip processing when Gemini returns empty observation response

Empty responses were silently consuming messages from the queue via
processAgentResponse. Now skips processing on empty content, leaving
the message in processing status for stale recovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent idle timeout from triggering infinite restart loop

When a session hits the 3-minute idle timeout, the finally block was
seeing stale processing messages and restarting the generator endlessly.
Now tracks idle timeout as a distinct exit reason via session flag,
resets stale messages, and skips restart.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: clear stale Bun native module cache on update

Bun's global cache retains sharp/libvips native binaries with broken
dylib references after version upgrades. Clear ~/.bun/install/cache/@img/
before install in both the end-user (smart-install) and dev (sync-marketplace)
paths to prevent ERR_DLOPEN_FAILED errors in Chroma sync.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review feedback (empty summary response, session-scoped reset, shell injection)

- Apply same empty-response guard to summary path as observation path in GeminiAgent
- Add optional sessionDbId param to resetStaleProcessingMessages for session-scoped resets
- Use JSON.stringify for gitignore pattern escaping, filter negation patterns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 17:46:30 -05:00