Commit Graph

220 Commits

Author SHA1 Message Date
Alessandro Costa 876cc4d837 feat: semantic context injection via Chroma on UserPromptSubmit (#1568)
* feat: semantic context injection via Chroma on every UserPromptSubmit

On each prompt, queries ChromaDB for the top-N most relevant past
observations and injects them as additionalContext. Replaces the
recency-based "last N observations" approach with relevance-based
semantic search.

Changes:
- session-init.ts: After session init, query /api/context/semantic
  with user's prompt text. If results found, return as
  hookSpecificOutput with hookEventName 'UserPromptSubmit'.
- SearchRoutes.ts: New GET /api/context/semantic endpoint that queries
  SearchManager with format='json' and formats results as markdown.
- SettingsDefaultsManager.ts: New settings CLAUDE_MEM_SEMANTIC_INJECT
  (default: true) and CLAUDE_MEM_SEMANTIC_INJECT_LIMIT (default: 5).

Key behaviors:
- Fires on every UserPromptSubmit (not just SessionStart)
- Minimum prompt length: 20 chars (skips "ok", "yes", etc.)
- Skips media-only prompts
- Graceful degradation: if worker/Chroma unavailable, no injection
- Survives /clear: re-injects on next prompt (not session-bound)
- Uses workerHttpRequest (v10.6.3 API, not raw fetch)

Production data (23 days, 3,400+ observations):
- Before: 8 most recent observations (often irrelevant to current topic)
- After: 5 most relevant observations (semantic match)
- Token cost: ~1800 → ~800-1200 per injection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address CodeRabbit review on PR #1568

- session-init: don't skip semantic injection when contextInjected=true
  (only skip agent re-init, semantic lookup must run every prompt)
- session-init: normalize SEMANTIC_INJECT toggle via String().toLowerCase()
- semantic endpoint: change from GET to POST to avoid URL-length limits
  and prompt exposure in access logs. Handler accepts both body and query
  for backwards compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Alessandro Costa <alessandro@claudio.dev>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 15:16:46 -07:00
Alessandro Costa 8958c3335d feat: drain orphaned pending messages on SIGTERM session completion (#1567)
* feat: drain orphaned pending messages on session completion (SIGTERM)

When deleteSession() aborts the SDK agent via SIGTERM, pending messages
in the queue are never processed. Without drain, they remain in
'pending' status forever — no future generator picks them up because
the session is already completed.

Adds markAllSessionMessagesAbandoned() call after deleteSession() in
completeByDbId(). This reuses the existing PendingMessageStore method
already used by worker-service.ts terminateSession().

Production evidence: 15 orphaned summarize messages found across
completed sessions (ages 3h to 3 days) before this fix. After fix:
0 orphaned messages over 23 days of operation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: document best-effort drain limitation per CodeRabbit review #1567

Add comment noting the rare race condition when generators outlive the
30s SIGTERM timeout. Practical risk is negligible (0 orphans over 23 days).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Alessandro Costa <alessandro@claudio.dev>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 15:14:25 -07:00
Alex Newman a2ac116aac fix: move summary wait + session-complete into Stop hook to prevent lost summaries
SessionEnd has a 1.5s hardcoded cap from Claude Code (CLAUDE_CODE_SESSIONEND_HOOKS_TIMEOUT_MS),
making it unsuitable for waiting on async work. Previously, the Stop hook would fire-and-forget
the summarize request, then SessionEnd would immediately call deleteSession — aborting the SDK
agent mid-summary.

Now the Stop hook (120s timeout, no cap) owns the full lifecycle:
1. Queue summarize request
2. Poll new GET /api/sessions/status endpoint until queue drains
3. Call /api/sessions/complete after summary finishes

SessionEnd is now a true fire-and-forget fallback (process.exit(0) immediately).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 14:05:53 -07:00
Alex Newman 67645041fa Merge main into thedotmack/file-read-timeline-inject
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 16:11:41 -07:00
Ousama Ben Younes 12501412b9 fix: persist session completion to database in completeByDbId (#1532)
completeByDbId only cleaned up in-memory state, leaving sdk_sessions rows
with status='active' and completed_at=NULL indefinitely. Ghost sessions
accumulated and exhausted the agent pool, causing 60s timeout errors.

- Add SessionStore.markSessionCompleted() to set status/completed_at/completed_at_epoch
- Call it at the start of completeByDbId before in-memory cleanup
- Inject SessionStore into SessionCompletionHandler via constructor
- Add 4 tests covering status, timestamps, isolation, and non-existent IDs

Closes #1532

Co-Authored-By: Claude <noreply@anthropic.com>
2026-04-01 06:02:14 +00:00
huakson 4f6fb9e614 fix: address platform source review feedback
Tighten platform source persistence so legacy callers cannot silently relabel existing sessions, repair migration 24 when schema_versions drifts from the real schema, and polish the follow-up UI/error-handler review nits.

- only backfill platform_source when it is blank and raise on explicit source conflicts for an existing session
- make migration 24 verify both the sdk_sessions column and its index before treating it as applied
- expose platform_source from the functional session getters and add regression tests for source preservation and schema drift recovery
- add the required APPROVED OVERRIDE annotation for centralized HTTP error translation
- keep mobile source pills on a single horizontal row
2026-03-24 10:46:48 -03:00
huakson 2b60dd2932 feat: isolate Claude and Codex session sources
Persist platform_source across session creation, transcript ingestion, API query paths, and viewer state so Claude and Codex data can coexist without bleeding into each other.

- add platform-source normalization helpers and persist platform_source in sdk_sessions via migration 24 with backfill and indexing
- thread platformSource through CLI hooks, transcript processing, context generation, pagination, search routes, SSE payloads, and session management
- expose source-aware project catalogs, viewer tabs, context preview selectors, and source badges for observations, prompts, and summaries
- start the transcript watcher from the worker for transcript-based clients and preserve platform source during Codex ingestion
- auto-start the worker from the MCP server for MCP-only clients and tighten stdio-driven cleanup during shutdown
- keep createSDKSession backward compatible with existing custom-title callers while allowing explicit platform source forwarding
2026-03-24 08:46:18 -03:00
vnz df1fb8bb89 fix(gemini): add conversation history truncation to prevent O(N²) token cost growth
GeminiAgent sends the full conversation history with every API call,
causing quadratic token growth per session. A 100-observation session
sends ~30M cumulative input tokens. This ports the proven truncateHistory()
sliding window from OpenRouterAgent to GeminiAgent.

- Add CLAUDE_MEM_GEMINI_MAX_CONTEXT_MESSAGES (default: 20) and
  CLAUDE_MEM_GEMINI_MAX_TOKENS (default: 100000) settings
- Add truncateHistory() to GeminiAgent using shared estimateTokens()
- Always preserve at least the newest message to avoid empty API requests
- Add settings validation in SettingsRoutes (1-100 messages, 1K-1M tokens)
- Add regression tests for truncation and oversized single-prompt edge case

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 07:37:58 +01:00
Alex Newman 4d7bec4d05 fix: stop spinner from spinning forever (#1440)
* fix: stop spinner from spinning forever due to orphaned DB messages

The activity spinner never stopped because isAnySessionProcessing() queried
ALL pending/processing messages in the database, including orphaned messages
from dead sessions that no generator would ever process.

Root cause: isAnySessionProcessing() used hasAnyPendingWork() which is a
global DB scan. Changed it to use getTotalQueueDepth() which only checks
sessions in the active in-memory Map.

Additional fixes:
- Add terminateSession() to enforce restart-or-terminate invariant
- Fix 3 zombie paths in .finally() handler that left sessions alive
- Clean up idle sessions from memory on successful completion
- Remove redundant bare isProcessing:true broadcast
- Replace inline require() with proper accessor
- Add 8 regression tests for session termination invariant

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings — idle-timeout race, double broadcast, query amplification

- Move pendingCount check before idle-timeout termination to prevent
  abandoning fresh messages that arrive between idle abort and .finally()
- Move broadcastProcessingStatus() inside restart branch only — the else
  branch already broadcasts via removeSessionImmediate callback
- Compute queueDepth once in broadcastProcessingStatus() and derive
  isProcessing from it, eliminating redundant double iteration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 14:13:10 -07:00
Alex Newman 5b041d6b49 refactor: rename formatters to AgentFormatter/HumanFormatter for semantic clarity
ColorFormatter and MarkdownFormatter names obscured their actual purpose.
The formatters serve two distinct audiences: the AI agent (compressed,
token-efficient context) and the human (rich ANSI-colored terminal output).

- MarkdownFormatter → AgentFormatter (renderMarkdown* → renderAgent*)
- ColorFormatter → HumanFormatter (renderColor* → renderHuman*)
- useColors parameter → forHuman across the pipeline
- Import aliases Color/Markdown → Human/Agent
- API query param `colors=true` unchanged (backward compatible)

Pure rename refactor — no logic or behavior changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 11:50:41 -07:00
Alex Newman c80763390b feat: file-read decision gate — block reads when observation history exists
Add a PreToolUse gate that blocks file reads on first attempt when rich
observation history exists, presenting the timeline as feedback. Claude
then decides: use get_observations() (skip read, save tokens) or re-read
(allowed on second attempt).

- FileReadGate: in-memory session-scoped gate with 4h TTL
- POST /api/file-context/gate endpoint in worker
- stderrMessage plumbing in hook-command for exit code 2
- file-context handler uses gate to block/allow reads

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 12:11:02 -07:00
Alex Newman e07b13f7de fix: proper project isolation and relative path matching for file-context hook
- Use getProjectContext(cwd).allProjects for project scoping (same as SessionStart)
- Convert absolute file_path to relative using cwd (observations store relative paths)
- API accepts comma-separated projects param with IN() SQL filter
- Remove basename matching — use full relative path to avoid cross-file collisions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 15:38:53 -07:00
Alex Newman fb9d917f8a feat: inject file observation timeline on PreToolUse Read hook
When Claude reads a file, the PreToolUse hook queries for existing
observations about that file and injects the timeline into context
via additionalContext + permissionDecision: allow. This prevents
duplicate observations and saves tokens through active rediscovery.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 15:18:54 -07:00
Alex Newman 7e07210635 feat: add timeline-report skill with token economics, compress context output 53%
## Summary
- New timeline-report skill for generating narrative project history reports
- Compressed markdown context output ~53% (tables → flat compact lines, verbose labels → terse format)
- Added `full=true` param to /api/context/inject for fetching all observations
- Split TimelineRenderer into separate markdown/color rendering paths
- Removed arbitrary file write vulnerability (dump_to_file param)
- Fixed timestamp ditto marker leaking across session summary boundaries

## Review
- Rebased on main (v10.6.0) to preserve OpenClaw system prompt injection
- Reviewed by /review (gstack) + /octo:review (Codex, Gemini, Claude fleet)
- Security fix (dump_to_file removal) confirmed by all 3 reviewers
- Timestamp bug caught by Codex, fixed

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-03-18 13:57:20 -07:00
Alex Newman 80a8c90a1a feat: add embedded Process Supervisor for unified process lifecycle (#1370)
* feat: add embedded Process Supervisor for unified process lifecycle management

Consolidates scattered process management (ProcessManager, GracefulShutdown,
HealthMonitor, ProcessRegistry) into a unified src/supervisor/ module.

New: ProcessRegistry with JSON persistence, env sanitizer (strips CLAUDECODE_*
vars), graceful shutdown cascade (SIGTERM → 5s wait → SIGKILL with tree-kill
on Windows), PID file liveness validation, and singleton Supervisor API.

Fixes #1352 (worker inherits CLAUDECODE env causing nested sessions)
Fixes #1356 (zombie TCP socket after Windows reboot)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add session-scoped process reaping to supervisor

Adds reapSession(sessionId) to ProcessRegistry for killing session-tagged
processes on session end. SessionManager.deleteSession() now triggers reaping.
Tightens orphan reaper interval from 60s to 30s.

Fixes #1351 (MCP server processes leak on session end)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Unix domain socket support for worker communication

Introduces socket-manager.ts for UDS-based worker communication, eliminating
port 37777 collisions between concurrent sessions. Worker listens on
~/.claude-mem/sockets/worker.sock by default with TCP fallback.

All hook handlers, MCP server, health checks, and admin commands updated to
use socket-aware workerHttpRequest(). Backwards compatible — settings can
force TCP mode via CLAUDE_MEM_WORKER_TRANSPORT=tcp.

Fixes #1346 (port 37777 collision across concurrent sessions)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove in-process worker fallback from hook command

Removes the fallback path where hook scripts started WorkerService in-process,
making the worker a grandchild of Claude Code (killed by sandbox). Hooks now
always delegate to ensureWorkerStarted() which spawns a fully detached daemon.

Fixes #1249 (grandchild process killed by sandbox)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add health checker and /api/admin/doctor endpoint

Adds 30-second periodic health sweep that prunes dead processes from the
supervisor registry and cleans stale socket files. Adds /api/admin/doctor
endpoint exposing supervisor state, process liveness, and environment health.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add comprehensive supervisor test suite

64 tests covering all supervisor modules: process registry (18 tests),
env sanitizer (8), shutdown cascade (10), socket manager (15), health
checker (5), and supervisor API (6). Includes persistence, isolation,
edge cases, and cross-module integration scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: revert Unix domain socket transport, restore TCP on port 37777

The socket-manager introduced UDS as default transport, but this broke
the HTTP server's TCP accessibility (viewer UI, curl, external monitoring).
Since there's only ever one worker process handling all sessions, the
port collision rationale for UDS doesn't apply. Reverts to TCP-only,
removing ~900 lines of unnecessary complexity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove dead code found in pre-landing review

Remove unused `acceptingSpawns` field from Supervisor class (written but
never read — assertCanSpawn uses stopPromise instead) and unused
`buildWorkerUrl` import from context handler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* updated gitignore

* fix: address PR review feedback - downgrade HTTP logging, clean up gitignore, harden supervisor

- Downgrade request/response HTTP logging from info to debug to reduce noise
- Remove unused getWorkerPort imports, use buildWorkerUrl helper
- Export ENV_PREFIXES/ENV_EXACT_MATCHES from env-sanitizer, reuse in Server.ts
- Fix isPidAlive(0) returning true (should be false)
- Add shutdownInitiated flag to prevent signal handler race condition
- Make validateWorkerPidFile testable with pidFilePath option
- Remove unused dataDir from ShutdownCascadeOptions
- Upgrade reapSession log from debug to warn
- Rename zombiePidFiles to deadProcessPids (returns actual PIDs)
- Clean up gitignore: remove duplicate datasets/, stale ~*/ and http*/ patterns
- Fix tests to use temp directories instead of relying on real PID file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 14:49:23 -07:00
Nir Alfasi 38d9ac7adb fix: prevent zombie subprocess accumulation by only trusting exitCode (#1226) (#1325)
proc.killed only means Node sent a signal — the process can still be alive.
This caused premature pool slot release, allowing unbounded process spawning.

- ensureProcessExit: remove proc.killed from early-exit checks, only trust exitCode
- Fix 3 call-site guards that skipped cleanup for signaled-but-alive processes
- Add TOTAL_PROCESS_HARD_CAP=10 safety net in waitForSlot()
- After SIGKILL, wait up to 1s via exit event instead of blind 200ms sleep
- Reduce reaper interval from 5min to 1min, idle threshold from 2min to 1min

Closes #1226

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 19:59:42 -07:00
Ben Younes 503bda4868 fix: add null guards for getChromaSync() when Chroma is disabled (#1336)
When CLAUDE_MEM_CHROMA_ENABLED=false, getChromaSync() returns null.
Two call sites were missing null guards, causing "null is not an object"
errors on every UserPromptSubmit / session init.

Fixes #1294

Vibe-coded by Ousama Ben Younes

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 19:58:03 -07:00
Umut Polat 73113321a1 fix: respect dateStart/dateEnd filters in Chroma search path (#1343)
When a search query includes dateStart/dateEnd parameters, the Chroma
semantic search path (PATH 2) ignored them entirely and only applied a
hardcoded 90-day recency window. This meant date-filtered searches
returned results from outside the requested range.

Now the Chroma path checks for a user-provided dateRange first. If
present, it filters results by the requested start/end bounds. The
90-day window is only used as the default when no date filter is
specified, preserving backward compatibility.

Fixes #1324

Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com>
2026-03-12 19:57:58 -07:00
Alex Newman c6f932988a Fix 30+ root-cause bugs across 10 triage phases (#1214)
* MAESTRO: fix ChromaDB core issues — Python pinning, Windows paths, disable toggle, metadata sanitization, transport errors

- Add --python version pinning to uvx args in both local and remote mode (fixes #1196, #1206, #1208)
- Convert backslash paths to forward slashes for --data-dir on Windows (fixes #1199)
- Add CLAUDE_MEM_CHROMA_ENABLED setting for SQLite-only fallback mode (fixes #707)
- Sanitize metadata in addDocuments() to filter null/undefined/empty values (fixes #1183, #1188)
- Wrap callTool() in try/catch for transport errors with auto-reconnect (fixes #1162)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix data integrity — content-hash deduplication, project name collision, empty project guard, stuck isProcessing

- Add SHA-256 content-hash deduplication to observations INSERT (store.ts, transactions.ts, SessionStore.ts)
- Add content_hash column via migration 22 with backfill and index
- Fix project name collision: getCurrentProjectName() now returns parent/basename
- Guard against empty project string with cwd-derived fallback
- Fix stuck isProcessing: hasAnyPendingWork() resets processing messages older than 5 minutes
- Add 12 new tests covering all four fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix hook lifecycle — stderr suppression, output isolation, conversation pollution prevention

- Suppress process.stderr.write in hookCommand() to prevent Claude Code showing diagnostic
  output as error UI (#1181). Restores stderr in finally block for worker-continues case.
- Convert console.error() to logger.warn()/error() in hook-command.ts and handlers/index.ts
  so all diagnostics route to log file instead of stderr.
- Verified all 7 handlers return suppressOutput: true (prevents conversation pollution #598, #784).
- Verified session-complete is a recognized event type (fixes #984).
- Verified unknown event types return no-op handler with exit 0 (graceful degradation).
- Added 10 new tests in tests/hook-lifecycle.test.ts covering event dispatch, adapter defaults,
  stderr suppression, and standard response constants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix worker lifecycle — restart loop coordination, stale transport retry, ENOENT shutdown race

- Add PID file mtime guard to prevent concurrent restart storms (#1145):
  isPidFileRecent() + touchPidFile() coordinate across sessions
- Add transparent retry in ChromaMcpManager.callTool() on transport
  error — reconnects and retries once instead of failing (#1131)
- Wrap getInstalledPluginVersion() with ENOENT/EBUSY handling (#1042)
- Verified ChromaMcpManager.stop() already called on all shutdown paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Windows platform support — uvx.cmd spawn, PowerShell $_ elimination, windowsHide, FTS5 fallback

- Route uvx spawn through cmd.exe /c on Windows since MCP SDK lacks shell:true (#1190, #1192, #1199)
- Replace all PowerShell Where-Object {$_} pipelines with WQL -Filter server-side filtering (#1024, #1062)
- Add windowsHide: true to all exec/spawn calls missing it to prevent console popups (#1048)
- Add FTS5 runtime probe with graceful fallback when unavailable on Windows (#791)
- Guard FTS5 table creation in migrations, SessionSearch, and SessionStore with try/catch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix skills/ distribution — build-time verification and regression tests (#1187)

Add post-build verification in build-hooks.js that fails if critical
distribution files (skills, hooks, plugin manifest) are missing. Add
10 regression tests covering skill file presence, YAML frontmatter,
hooks.json integrity, and package.json files field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix MigrationRunner schema initialization (#979) — version conflict between parallel migration systems

Root cause: old DatabaseManager migrations 1-7 shared schema_versions table with
MigrationRunner's 4-22, causing version number collisions (5=drop tables vs add column,
6=FTS5 vs prompt tracking, 7=discovery_tokens vs remove UNIQUE).  initializeSchema()
was gated behind maxApplied===0, so core tables were never created when old versions
were present.

Fixes:
- initializeSchema() always creates core tables via CREATE TABLE IF NOT EXISTS
- Migrations 5-7 check actual DB state (columns/constraints) not just version tracking
- Crash-safe temp table rebuilds (DROP IF EXISTS _new before CREATE)
- Added missing migration 21 (ON UPDATE CASCADE) to MigrationRunner
- Added ON UPDATE CASCADE to FK definitions in initializeSchema()
- All changes applied to both runner.ts and SessionStore.ts

Tests: 13 new tests in migration-runner.test.ts covering fresh DB, idempotency,
version conflicts, crash recovery, FK constraints, and data integrity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix 21 test failures — stale mocks, outdated assertions, missing OpenClaw guards

Server tests (12): Added missing workerPath and getAiStatus to ServerOptions
mocks after interface expansion. ChromaSync tests (3): Updated to verify
transport cleanup in ChromaMcpManager after architecture refactor. OpenClaw (2):
Added memory_ tool skipping and response truncation to prevent recursive loops
and oversized payloads. MarkdownFormatter (2): Updated assertions to match
current output. SettingsDefaultsManager (1): Used correct default key for
getBool test. Logger standards (1): Excluded CLI transcript command from
background service check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Codex CLI compatibility (#744) — session_id fallbacks, unknown platform tolerance, undefined guard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Cursor IDE integration (#838, #1049) — adapter field fallbacks, tolerant session-init validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix /api/logs OOM (#1203) — tail-read replaces full-file readFileSync

Replace readFileSync (loads entire file into memory) with readLastLines()
that reads only from the end of the file in expanding chunks (64KB → 10MB cap).
Prevents OOM on large log files while preserving the same API response shape.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix Settings CORS error (#1029) — explicit methods and allowedHeaders in CORS config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: add session custom_title for agent attribution (#1213) — migration 23, endpoint + store support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: prevent CLAUDE.md/AGENTS.md writes inside .git/ directories (#1165)

Add .git path guard to all 4 write sites to prevent ref corruption when
paths resolve inside .git internals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix plugin disabled state not respected (#781) — early exit check in all hook entry points

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix UserPromptSubmit context re-injection on every turn (#1079) — contextInjected session flag

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* MAESTRO: fix stale AbortController queue stall (#1099) — lastGeneratorActivity tracking + 30s timeout

Three-layer fix:
1. Added lastGeneratorActivity timestamp to ActiveSession, updated by
   processAgentResponse (all agents), getMessageIterator (queue yields),
   and startGeneratorWithProvider (generator launch)
2. Added stale generator detection in ensureGeneratorRunning — if no
   activity for >30s, aborts stale controller, resets state, restarts
3. Added AbortSignal.timeout(30000) in deleteSession to prevent
   indefinite hang when awaiting a stuck generator promise

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 19:34:35 -05:00
Alex Newman 5f28550551 MAESTRO: fix MCP type coercion for batch endpoints, add defensive observation error handling
Add string-to-array coercion for ids and memorySessionIds in DataRoutes.ts
batch endpoints so MCP clients sending "[1,2,3]" or "1,2,3" instead of
native arrays no longer get 400 errors. Wrap observation storage path in
SessionRoutes.ts with try/catch returning 200 on recoverable errors instead
of 500, preventing hook breakage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 16:41:28 -05:00
Alex Newman 40daf8f3fa feat: replace WASM embeddings with persistent chroma-mcp MCP connection (#1176)
* feat: replace WASM embeddings with persistent chroma-mcp MCP connection

Replace ChromaServerManager (npx chroma run + chromadb npm + ONNX/WASM)
with ChromaMcpManager, a singleton stdio MCP client that communicates with
chroma-mcp via uvx. This eliminates native binary issues, segfaults, and
WASM embedding failures that plagued cross-platform installs.

Key changes:
- Add ChromaMcpManager: singleton MCP client with lazy connect, auto-reconnect,
  connection lock, and Zscaler SSL cert support
- Rewrite ChromaSync to use MCP tool calls instead of chromadb npm client
- Handle chroma-mcp's non-JSON responses (plain text success/error messages)
- Treat "collection already exists" as idempotent success
- Wire ChromaMcpManager into GracefulShutdown for clean subprocess teardown
- Delete ChromaServerManager (no longer needed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review — connection guard leak, timer leak, async reset

- Clear connecting guard in finally block to prevent permanent reconnection block
- Clear timeout after successful connection to prevent timer leak
- Make reset() async to await stop() before nullifying instance
- Delete obsolete chroma-server-manager test (imports deleted class)
- Update graceful-shutdown test to use chromaMcpManager property name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent chroma-mcp spawn storm — zombie cleanup, stale onclose guard, reconnect backoff

Three bugs caused chroma-mcp processes to accumulate (92+ observed):

1. Zombie on timeout: failed connections left subprocess alive because
   only the timer was cleared, not the transport. Now catch block
   explicitly closes transport+client before rethrowing.

2. Stale onclose race: old transport's onclose handler captured `this`
   and overwrote the current connection reference after reconnect,
   orphaning the new subprocess. Now guarded with reference check.

3. No backoff: every failure triggered immediate reconnect. With
   backfill doing hundreds of MCP calls, this created rapid-fire
   spawning. Added 10s backoff on both connection failure and
   unexpected process death.

Also includes ChromaSync fixes from PR review:
- queryChroma deduplication now preserves index-aligned arrays
- SQL injection guard on backfill ID exclusion lists

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 18:32:38 -05:00
Alex Newman 5d79bb7a7a fix: prevent zombie process accumulation by verifying subprocess exit (#1168) (#1175)
Two changes fix the observer process resource leak:

1. Add ensureProcessExit to generator finally blocks in SessionRoutes and
   worker-service, matching the pattern already working in SDKAgent.

2. Add stale session reaper (every 2m) that removes sessions with no active
   generator and no pending work after 15m idle. This unblocks the orphan
   reaper which previously skipped processes for "active" sessions.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 16:33:23 -05:00
Alex Newman d2e926fbf7 fix: post-merge breakage (Gemini, idle timeout, sharp cache) (#1138)
* fix: add gemini-3-flash to validModels array

The model was defined in the type union and RPM limits but missing from
the runtime validModels array, causing silent fallback to gemini-2.5-flash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip processing when Gemini returns empty observation response

Empty responses were silently consuming messages from the queue via
processAgentResponse. Now skips processing on empty content, leaving
the message in processing status for stale recovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent idle timeout from triggering infinite restart loop

When a session hits the 3-minute idle timeout, the finally block was
seeing stale processing messages and restarting the generator endlessly.
Now tracks idle timeout as a distinct exit reason via session flag,
resets stale messages, and skips restart.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: clear stale Bun native module cache on update

Bun's global cache retains sharp/libvips native binaries with broken
dylib references after version upgrades. Clear ~/.bun/install/cache/@img/
before install in both the end-user (smart-install) and dev (sync-marketplace)
paths to prevent ERR_DLOPEN_FAILED errors in Chroma sync.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review feedback (empty summary response, session-scoped reset, shell injection)

- Apply same empty-response guard to summary path as observation path in GeminiAgent
- Add optional sessionDbId param to resetStaleProcessingMessages for session-scoped resets
- Use JSON.stringify for gitignore pattern escaping, filter negation patterns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 17:46:30 -05:00
michelhelsdingen 51719d23a4 feat: configurable subprocess pool limit for SDK agents (#995)
* feat: configurable subprocess pool limit for SDK agents

Prevents runaway accumulation of Claude SDK agent subprocesses by
enforcing a configurable concurrency limit.

- New CLAUDE_MEM_MAX_CONCURRENT_AGENTS setting (default: 2)
- Promise-based waitForSlot() in ProcessRegistry (not polling per
  review feedback on #830)
- Waiters are notified via unregisterProcess when a slot frees up
- SDKAgent.startSession() waits for a slot before spawning
- 60s timeout prevents indefinite waits

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

* fix: remove unused originalUnregister const and getActiveCount import

Cleanup from Greptile review:
- Remove dead `originalUnregister` variable in ProcessRegistry
- Remove unused `getActiveCount` import in SDKAgent

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Happy <yesreply@happy.engineering>
2026-02-16 00:31:17 -05:00
ixnaswang 81013e1310 fix: SDK Agent fails on Windows when username contains spaces (#1022)
* fix: SDK Agent fails on Windows when username contains spaces

Fixes spawn failure on Windows when the user's path contains spaces
(e.g., C:\Users\Anderson Wang\).

Root cause:
- SDKAgent.ts returns full auto-detected path with spaces
- ProcessRegistry.ts cannot execute .cmd files when path contains spaces

Solution:
- SDKAgent: On Windows, prefer "claude.cmd" via PATH instead of full path
- ProcessRegistry: Use cmd.exe /d /c wrapper for .cmd files on Windows

This preserves argument boundaries (e.g., empty string values) while
properly handling paths with spaces.

Fixes #1014

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: add Windows spawn path with spaces fix documentation

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-16 00:31:11 -05:00
TerrifiedBug 0a40c4c596 fix: include project in ChromaDB where clause for vector search (#1112)
When searching with a project parameter, the ChromaDB vector query was
not filtering by project. It only filtered by doc_type. This caused
larger projects to dominate the top-N results returned by ChromaDB,
effectively crowding out results from smaller projects before the
post-hoc SQLite project filter could take effect.

For example, with project A having 19,000 embeddings and project B
having 700, a search scoped to project B would return mostly project A
results from ChromaDB. After SQLite filtered by project, only 1-3
results from B would survive instead of the expected 20+.

The fix adds the project to the ChromaDB where clause using $and when
both doc_type and project filters are needed. This is applied in both
ChromaSearchStrategy.buildWhereFilter() and SearchManager.search().

Co-authored-by: TARS <tars@openclaw.local>
2026-02-16 00:30:29 -05:00
zhaixingzi 454e9c5870 fix: resolve duplicate assistant messages in OpenRouter agent (#1074)
This commit addresses the issue of duplicate assistant messages appearing
in the conversation history by commenting out the lines that were
unnecessarily pushing assistant responses to the conversationHistory array.

The processAgentResponse function already handles adding assistant messages
to the conversation history, so these additional pushes were causing
duplicate entries.

Changes made:
- Commented out session.conversationHistory.push calls for assistant responses
  in three locations within OpenRouterAgent.ts:
  1. In the init response handling (around line 117)
  2. In the observation response handling (around line 188)
  3. In the summary response handling (around line 230)

This ensures that assistant messages are only added once to the conversation
history, preventing duplication while maintaining the intended functionality.

Co-authored-by: 张坤 <zhangkun@example.com>
2026-02-16 00:26:07 -05:00
SaneApps 2f337dab13 fix: use Gemini v1 API endpoint instead of v1beta (#1082)
v1beta does not support newer models like gemini-3-flash, causing
silent 404 errors that back up the observation queue indefinitely.
Users with CLAUDE_MEM_GEMINI_MODEL=gemini-3-flash get zero observations
stored, with no visible error — the queue just grows silently.

Changes:
- Switch API URL from v1beta/models to v1/models (generateContent
  works identically on both endpoints)
- Add gemini-3-flash to GeminiModel type and RPM limits
- Update test to match new endpoint

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 00:26:01 -05:00
Alex Newman 055888e181 fix: address PR review feedback for subprocess cleanup and binary resolution
Wrap SDK query loop in try/finally so subprocess cleanup runs on error paths.
Swap Chroma binary check order to try project-level .bin first (common case).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 23:24:00 -05:00
Alex Newman e1ef14dbcc fix: resolve orphaned subprocesses and Chroma HTTP regressions
- Add subprocess cleanup after SDK query loop completes, using existing
  ProcessRegistry infrastructure (getProcessBySession + ensureProcessExit)
- Replace npx-based Chroma binary spawning with absolute path resolution
  via require.resolve, falling back to npx with explicit cwd (#1120)
- Remove @chroma-core/default-embed client-side dependency; let Chroma
  HTTP server handle embeddings server-side (#1104, #1105, #1110)

Closes #1010, #1089, #1090, #1068, #1120, #1104, #1105, #1110

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 22:04:52 -05:00
Alex Newman 1b68c55763 fix: resolve SDK spawn failures and sharp native binary crashes
- Strip CLAUDECODE env var from SDK subprocesses to prevent "cannot be
  launched inside another Claude Code session" error (Claude Code 2.1.42+)
- Lazy-load @chroma-core/default-embed to avoid eagerly pulling in
  sharp native binaries at bundle startup (fixes ERR_DLOPEN_FAILED)
- Add stderr capture to SDK spawn for diagnosing future process failures
- Exclude lockfiles from marketplace rsync and delete stale lockfiles
  before npm install to prevent native dep version mismatches

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:47:27 -05:00
Alex Newman 5de728612e chore: bump version to 10.0.6
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 00:02:37 -05:00
Alex Newman 98d87d7573 chore: bump version to 10.0.4
Reverts v10.0.3 chroma-mcp spawn storm fix (broken release).
Restores codebase to v10.0.2 state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 21:36:34 -05:00
Rod Boev a3f9e7f638 fix: prevent chroma-mcp spawn storm with 5-layer defense (641 processes → max 2)
During SIGHUP testing with 6+ active sessions, ChromaSync.ensureConnection()
had no mutex — concurrent fire-and-forget syncObservation() calls each spawned
a chroma-mcp subprocess via StdioClientTransport, creating 641 orphans in ~5min.
Error-driven reconnection formed a positive feedback loop amplifying the storm.

Defense layers:
- Layer 0: Connection mutex via promise memoization (prevents concurrent spawns)
- Layer 1: Pre-spawn process count guard using execFileSync('ps') (kills excess)
- Layer 2: Hardened close() with try-finally + Unix pkill in GracefulShutdown
- Layer 3: Count-based orphan reaper in ProcessManager (not age-based)
- Layer 4: Circuit breaker stops retries after 3 consecutive failures for 60s

Closes #1063, closes #695
Relates to #1010, #707
2026-02-11 07:19:28 -05:00
Alex Newman 8dfcb5e612 chore: bump version to 9.1.0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 01:05:38 -05:00
Alex Newman 98920bd860 MAESTRO: Merge PR #662 - Add save_memory MCP tool for manual memory storage
Adds save_memory MCP tool allowing users to manually save observations
for semantic search. Source changes cherry-picked from PR #662 by
@darconada (build artifact conflicts resolved by direct application).

Closes #645.

Co-Authored-By: darconadalabarga <darconada@arsys.es>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 04:13:44 -05:00
Alex Newman 5dffb1ebb0 MAESTRO: fix(hooks): add session-complete handler to enable orphan reaper cleanup
Cherry-picked from PR #844 by @thusdigital. Sessions stayed in active
sessions map forever after summarize, causing the orphan reaper to think
all processes were still active. Adds session-complete as Stop phase 2
hook that calls POST /api/sessions/complete to remove sessions from the
active map, allowing the reaper to correctly identify and clean up
orphaned worker processes. Fixes #842.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 03:23:13 -05:00
Alex Newman da1d2cd36a MAESTRO: fix(db): prevent FK constraint failures on worker restart
Cherry-picked source changes from PR #889 by @Et9797. Fixes #846.

Key changes:
- Add ensureMemorySessionIdRegistered() guard in SessionStore.ts
- Add ON UPDATE CASCADE migration (schema v21) for observations and session_summaries FK constraints
- Change message queue from claim-and-delete to claim-confirm pattern (PendingMessageStore.ts)
- Add spawn deduplication and unrecoverable error detection in SessionRoutes.ts and worker-service.ts
- Add forceInit flag to SDKAgent for stale session recovery

Build artifacts skipped (pre-existing dompurify dep issue). Path fixes (HealthMonitor.ts, worker-utils.ts)
already merged via PR #634.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 03:16:17 -05:00
Alex Newman 75a0f2e981 fix: respect CLAUDE_CONFIG_DIR for plugin paths (#626)
Add MARKETPLACE_ROOT constant to paths.ts and update 5 source files to use
centralized path constants instead of hardcoded ~/.claude paths. Preserves
backwards compatibility when CLAUDE_CONFIG_DIR is not set.

Based on PR #634 by @Kuroakira, cherry-picked onto main due to build artifact
merge conflicts (source changes applied cleanly).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 03:00:08 -05:00
Alex Newman 91e1d5baad fix: correct Gemini model name from gemini-3-flash to gemini-3-flash-preview
The Gemini API requires the -preview suffix for the Gemini 3 Flash model.
gemini-3-flash does not exist - only gemini-3-flash-preview is available.
This was causing 404 errors when users selected this model option.

Closes #831

Co-Authored-By: Glucksberg <markuscontasul@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 02:55:30 -05:00
Abdelkarim Mateos Sanchez 9bd56c993c fix: align IDs with metadatas in ChromaSearchStrategy
ChromaSync.queryChroma() returns deduplicated sqlite_ids but the
metadatas array contains multiple entries per observation (narrative +
facts). The filterByRecency() method was iterating over metadatas and
using the index to access ids, causing array out-of-bounds access.

The fix builds a Map from sqlite_id to metadata, then iterates over
the deduplicated ids array to ensure proper alignment.

Symptoms before fix:
- Semantic search returning incorrect/empty results
- Search only working with near-exact queries
- Recent items (same day) not being found

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 02:07:03 -05:00
Alex Newman 711f5455df fix: generate synthetic memorySessionId for stateless providers (PR #615)
Gemini and OpenRouter are stateless APIs that never return session IDs.
Without synthetic IDs, PR #693's defensive memorySessionId checks throw
errors on every observation processing call for these providers.

Generates provider-prefixed IDs (gemini-/openrouter-{contentSessionId}-
{timestamp}) before the first API call, persisted to the database via
updateMemorySessionId(). Applied from PR #615 (closed due to staleness).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 02:04:49 -05:00
TranslateMe ea38601564 fix: Reset AbortController before starting generator to prevent infinite abort loop
When a generator exits with wasAborted=true, the AbortController remains in
aborted state but generatorPromise is set to null. When a new observation
arrives, ensureGeneratorRunning() sees generatorPromise=null and tries to
start a new generator, but the new generator immediately sees
signal.aborted=true and exits, causing an infinite "Generator aborted" loop.

This fix resets the AbortController if it's already aborted before starting
a new generator, allowing the session to recover from the stuck state.

Bug reproduction:
1. Session receives observations
2. Something causes the generator to be aborted
3. generatorPromise = null, but abortController.signal.aborted = true
4. New observation arrives → starts generator → immediately aborted → loop

Fix: Check if abortController.signal.aborted before starting generator,
and create a new AbortController if needed.
2026-02-06 01:53:17 -05:00
jayvenn21 f24bba21e9 fix(worker): gracefully process orphaned pending messages after session termination 2026-02-06 01:47:43 -05:00
Alex Newman 6382d6f9c7 MAESTRO: Merge PR #693 - prevent infinite restart loop that causes runaway API costs
Add restart limit (max 3 consecutive restarts) with exponential backoff
to prevent infinite generator restart loops. Also add defensive
memorySessionId checks in GeminiAgent and OpenRouterAgent before
expensive LLM calls to fail fast when session ID hasn't been captured.

Based on PR #693 by @ajbmachon (applied to current main).

Co-Authored-By: Andre Machon <ajbmachon2@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 01:43:12 -05:00
jayvenn21 93bad99d79 fix: terminate session on prompt-too-long instead of retrying indefinitely 2026-02-06 01:39:55 -05:00
Michel Tomas b07130acc6 fix: handle both boolean and string types for settings
JSON.parse preserves native types, so boolean true/false stay as
booleans rather than strings. The previous check only handled string
'true', meaning users who set `"ENABLED": true` (boolean) wouldn't
have the feature work.

Now both `getBool()` helper and ResponseProcessor check handle:
- String 'true' → enabled
- Boolean true → enabled
- Any other value → disabled

Tested: Confirmed feature stays disabled with string "false" and
no new CLAUDE.md files are created when accessing directories.
2026-02-06 01:36:45 -05:00
Michel Tomas bb96092d74 fix: add FOLDER_CLAUDEMD_ENABLED to settingKeys for API/UI access 2026-02-06 01:36:45 -05:00
Michel Tomas 9907df1db8 fix: respect CLAUDE_MEM_FOLDER_CLAUDEMD_ENABLED setting
The CLAUDE_MEM_FOLDER_CLAUDEMD_ENABLED setting was documented but never
actually checked in code. The folder CLAUDE.md generation ran
unconditionally whenever files were touched.

Changes:
- Add CLAUDE_MEM_FOLDER_CLAUDEMD_ENABLED to SettingsDefaults interface
- Add default value 'false' to DEFAULTS object
- Check setting in ResponseProcessor before calling updateFolderClaudeMdFiles

Fixes the issue identified in issue-600-documentation-audit-features-not-implemented.md:
"The setting CLAUDE_MEM_FOLDER_CLAUDEMD_ENABLED is never read."
2026-02-06 01:36:45 -05:00
jayvenn21 5d1ee20076 fix: prevent duplicate generator spawns in handleSessionInit
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-05 23:48:56 -05:00