Files
claude-mem/ISSUE-BLOWOUT-TODO.md
T
Alex Newman ba1ef6c42c fix: Issue Blowout 2026 — 25 bugs across worker, hooks, security, and search (#2080)
* fix: resolve search, database, and docker bugs (#1913, #1916, #1956, #1957, #2048)

- Fix concept/concepts param mismatch in SearchManager.normalizeParams (#1916)
- Add FTS5 keyword fallback when ChromaDB is unavailable (#1913, #2048)
- Add periodic WAL checkpoint and journal_size_limit to prevent unbounded WAL growth (#1956)
- Add periodic clearFailed() to purge stale pending_messages (#1957)
- Fix nounset-safe TTY_ARGS expansion in docker/claude-mem/run.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: prevent silent data loss on non-XML responses, add queue info to /health (#1867, #1874)

- ResponseProcessor: mark messages as failed (with retry) instead of confirming
  when the LLM returns non-XML garbage (auth errors, rate limits) (#1874)
- Health endpoint: include activeSessions count for queue liveness monitoring (#1867)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: cache isFts5Available() at construction time

Addresses Greptile review: avoid DDL probe (CREATE + DROP) on every text
query. Result is now cached in _fts5Available at construction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve worker stability bugs — pool deadlock, MCP loopback, restart guard (#1868, #1876, #2053)

- Replace flat consecutiveRestarts counter with time-windowed RestartGuard:
  only counts restarts within 60s window (cap=10), decays after 5min of
  success. Prevents stranding pending messages on long-running sessions. (#2053)

- Add idle session eviction to pool slot allocation: when all slots are full,
  evict the idlest session (no pending work, oldest activity) to free a slot
  for new requests, preventing 60s timeout deadlock. (#1868)

- Fix MCP loopback self-check: use process.execPath instead of bare 'node'
  which fails on non-interactive PATH. Fix crash misclassification by removing
  false "Generator exited unexpectedly" error log on normal completion. (#1876)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve hooks reliability bugs — summarize exit code, session-init health wait (#1896, #1901, #1903, #1907)

- Wrap summarize hook's workerHttpRequest in try/catch to prevent exit
  code 2 (blocking error) on network failures or malformed responses.
  Session exit no longer blocks on worker errors. (#1901)

- Add health-check wait loop to UserPromptSubmit session-init command in
  hooks.json. On Linux/WSL where hook ordering fires UserPromptSubmit
  before SessionStart, session-init now waits up to 10s for worker health
  before proceeding. Also wrap session-init HTTP call in try/catch. (#1907)

- Close #1896 as already-fixed: mtime comparison at file-context.ts:255-267
  bypasses truncation when file is newer than latest observation.

- Close #1903 as no-repro: hooks.json correctly declares all hook events.
  Issue was Claude Code 12.0.1/macOS platform event-dispatch bug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: security hardening — bearer auth, path validation, rate limits, per-user port (#1932, #1933, #1934, #1935, #1936)

- Add bearer token auth to all API endpoints: auto-generated 32-byte
  token stored at ~/.claude-mem/worker-auth-token (mode 0600). All hook,
  MCP, viewer, and OpenCode requests include Authorization header.
  Health/readiness endpoints exempt for polling. (#1932, #1933)

- Add path traversal protection: watch.context.path validated against
  project root and ~/.claude-mem/ before write. Rejects ../../../etc
  style attacks. (#1934)

- Reduce JSON body limit from 50MB to 5MB. Add in-memory rate limiter
  (300 req/min/IP) to prevent abuse. (#1935)

- Derive default worker port from UID (37700 + uid%100) to prevent
  cross-user data leakage on multi-user macOS. Windows falls back to
  37777. Shell hooks use same formula via id -u. (#1936)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve search project filtering and import Chroma sync (#1911, #1912, #1914, #1918)

- Fix per-type search endpoints to pass project filter to Chroma queries
  and SQLite hydration. searchObservations/Sessions/UserPrompts now use
  $or clause matching project + merged_into_project. (#1912)

- Fix timeline/search methods to pass project to Chroma anchor queries.
  Prevents cross-project result leakage when project param omitted. (#1911)

- Sync imported observations to ChromaDB after FTS rebuild. Import
  endpoint now calls chromaSync.syncObservation() for each imported
  row, making them visible to MCP search(). (#1914)

- Fix session-init cwd fallback to match context.ts (process.cwd()).
  Prevents project key mismatch that caused "no previous sessions"
  on fresh sessions. (#1918)

- Fix sync-marketplace restart to include auth token and per-user port.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve all CodeRabbit and Greptile review comments on PR #2080

- Fix run.sh comment mismatch (no-op flag vs empty array)
- Gate session-init on health check success (prevent running when worker unreachable)
- Fix date_desc ordering ignored in FTS session search
- Age-scope failed message purge (1h retention) instead of clearing all
- Anchor RestartGuard decay to real successes (null init, not Date.now())
- Add recordSuccess() calls in ResponseProcessor and completion path
- Prevent caller headers from overriding bearer auth token
- Add lazy cleanup for rate limiter map to prevent unbounded growth
- Bound post-import Chroma sync with concurrency limit of 8
- Add doc_type:'observation' filter to Chroma queries feeding observation hydration
- Add FTS fallback to all specialized search handlers (observations, sessions, prompts, timeline)
- Add response.ok check and error handling in viewer saveSettings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve CodeRabbit round-2 review comments

- Use failure timestamp (COALESCE) instead of created_at_epoch for stale purge
- Downgrade _fts5Available flag when FTS table creation fails
- Escape FTS5 MATCH input by quoting user queries as literal phrases
- Escape LIKE metacharacters (%, _, \) in prompt text search
- Add response.ok check in initial settings load (matches save flow)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve CodeRabbit round-3 review comments

- Include failed_at_epoch in COALESCE for age-scoped purge
- Re-throw FTS5 errors so callers can distinguish failure from no-results
- Wrap all FTS fallback calls in SearchManager with try/catch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 11:42:09 -07:00

14 KiB

Issue Blowout 2026 - Running TODO

Branch: issue-blowout-2026 (merged as PR #2079) Strategy: Cynical dev. Every bug report is suspect — look for overengineered band-aids as root cause. Test gate: After every build-and-sync, verify observations are flowing. Released: v12.3.2 on 2026-04-19

Instructions for Continuation

Workflow per issue

  1. Use /make-plan and /do to attack each issue's root cause
  2. Be cynical — most bug reports are surface-level; the real issue is usually overengineered band-aids
  3. After every npm run build-and-sync, verify observations flow:
    sleep 5 && sqlite3 ~/.claude-mem/claude-mem.db "SELECT COUNT(*) FROM observations WHERE created_at_epoch > (strftime('%s','now') - 120) * 1000"
    
  4. If observations stop flowing, that's a regression — fix it before continuing

Docker isolation

  • Port 37777: Host's live bun worker (YOUR claude-mem instance — don't touch)
  • Port 37778: Another agent's docker container (claude-mem-dev) — hands off
  • Your docker: Use tag claude-mem:blowout, data dir .docker-blowout-data/
    TAG=claude-mem:blowout docker/claude-mem/build.sh
    HOST_MEM_DIR=$(pwd)/.docker-blowout-data TAG=claude-mem:blowout docker/claude-mem/run.sh
    
  • Check observations in docker DB:
    sqlite3 .docker-blowout-data/claude-mem.db 'select count(*) from observations'
    

PR → Review → Merge → Release cycle

  1. Create PR from feature branch to main
  2. Start review loop: /loop 2m to check and resolve review comments
    • CodeRabbit and Greptile post inline comments — read, fix, commit, push, reply
    • claude-review is a CI check — just needs to pass
    • CodeRabbit can take 5-10 min to process after each push
  3. When all reviews pass: gh pr merge <PR#> --repo thedotmack/claude-mem --squash --delete-branch --admin
  4. Close resolved issues: for issue in <numbers>; do gh issue close $issue --repo thedotmack/claude-mem --comment "Fixed in PR #XXXX"; done
  5. Version bump:
    cd ~/Scripts/claude-mem
    git pull origin main
    # Run /version-bump patch (or use the skill: claude-mem:version-bump)
    # It handles: version files → build → commit → tag → push → gh release → changelog
    

Key files in the codebase

  • Parser: src/sdk/parser.ts — observation and summary XML parsing
  • Prompts: src/sdk/prompts.ts — LLM prompt templates (observation, summary, continuation)
  • ResponseProcessor: src/services/worker/agents/ResponseProcessor.ts — unified response handler
  • SessionManager: src/services/worker/SessionManager.ts — queue, sessions, circuit breaker
  • SessionSearch: src/services/sqlite/SessionSearch.ts — FTS5 and filter queries
  • SearchManager: src/services/worker/SearchManager.ts — hybrid Chroma+SQLite orchestration
  • Worker service: src/services/worker-service.ts — periodic reapers, startup
  • Summarize hook: src/cli/handlers/summarize.ts — Stop hook entry point
  • SessionRoutes: src/services/worker/http/routes/SessionRoutes.ts — HTTP API
  • ViewerRoutes: src/services/worker/http/routes/ViewerRoutes.ts — /health endpoint
  • Agents: src/services/worker/SDKAgent.ts, GeminiAgent.ts, OpenRouterAgent.ts
  • Modes: plugin/modes/code.json — prompt field values for the default mode
  • Migrations: src/services/sqlite/migrations/runner.ts
  • PendingMessageStore: src/services/sqlite/PendingMessageStore.ts — queue persistence

Completed Phase 2-5 (16 more issues — this session)

# Component Issue Resolution
2053 worker Generator restart guard strands pending messages FIXED — Time-windowed RestartGuard replaces flat counter (10 restarts/60s window, 5min decay)
1868 worker SDK pool deadlock: idle sessions monopolize slots FIXED — evictIdlestSession() callback in waitForSlot() preempts idle sessions
1876 worker MCP loopback self-check fails; crash misclassification FIXED — process.execPath replaces bare 'node'; removed false "exited unexpectedly" log
1901 hooks Summarize stop hook exits code 2 on errors FIXED — workerHttpRequest wrapped in try/catch, exits gracefully
1907 hooks Linux/WSL session-init before worker healthy FIXED — health-check curl loop added to UserPromptSubmit hook; HTTP call wrapped
1896 hooks PreToolUse file-context caps Read to limit:1 CLOSED — already fixed (mtime comparison at file-context.ts:255-267)
1903 hooks PostToolUse/Stop/SessionEnd never fire CLOSED — no-repro (hooks.json correct; Claude Code 12.0.1 platform bug)
1932 security Admin endpoints spoofable requireLocalhost FIXED — bearer token auth on all API endpoints
1933 security Unauthenticated HTTP API exposes 30+ endpoints FIXED — auto-generated token at ~/.claude-mem/worker-auth-token (mode 0600)
1934 security watch.context.path written without validation FIXED — path traversal protection validates against project root / data dir
1935 security Unbounded input, no rate limits FIXED — 5MB body limit (was 50MB), 300 req/min/IP rate limiter
1936 security Multi-user macOS shared port cross-user MCP FIXED — per-user port derivation from UID (37700 + uid%100)
1911 search search()/timeline() cross-project results FIXED — project filter passed to Chroma queries and timeline anchor searches
1912 search /api/search per-type endpoints ignore project FIXED — project $or clause added to searchObservations/Sessions/UserPrompts
1914 search Imported observations invisible to MCP search FIXED — ChromaSync.syncObservation() called after import
1918 search SessionStart "no previous sessions" on fresh sessions FIXED — session-init cwd fallback matches context.ts (process.cwd())

Completed (9 issues — PR #2079, v12.3.2)

# Component Issue Resolution
1908 summarizer parseSummary discards output when LLM emits observation tags CLOSED — already fixed by Gen 3 coercion (coerceObservationToSummary in parser.ts)
1953 db Migration 7 rebuilds table every startup CLOSED — already fixed by commit 59ce0fc5 (origin !== 'pk' filter)
1916 search /api/search/by-concept emits malformed SQL FIXED — concept→concepts remap in SearchManager.normalizeParams()
1913 search Text search returns empty when ChromaDB disabled FIXED — FTS5 keyword fallback in SessionSearch + SearchManager
2048 search Text queries should fall back to FTS5 when Chroma disabled FIXED — same as #1913
1957 db pending_messages: failed rows never purged FIXED — periodic clearFailed() in stale session reaper (every 2 min)
1956 db WAL grows unbounded, no checkpoint schedule FIXED — journal_size_limit=4MB + periodic wal_checkpoint(PASSIVE)
1874 worker processAgentResponse deletes queued messages on non-XML output FIXED — mark messages failed (with retry) instead of confirming
1867 worker Queue processor dies while /health stays green FIXED — activeSessions count added to /health endpoint

Also fixed (not an issue): docker/claude-mem/run.sh nounset-safe TTY_ARGS expansion. Also fixed (Greptile review): cached isFts5Available() at construction time.

Remaining — CRITICAL (5)

# Component Issue
1925 mcp chroma-mcp subprocess leak via null-before-close
1926 mcp chroma-mcp stdio handshake broken across all versions
1942 auth Default model not resolved on Bedrock/Vertex/Azure
1943 auth SDK pipeline rejects Bedrock auth
1880 windows Ghost LISTEN socket on port 37777 after crash
1887 windows Failing worker blocks Claude Code MCP 10+ min in hook-restart loop

Remaining — HIGH (32)

# Component Issue
1869 worker No mid-session auto-restart after inner crash
1870 worker Stop hook blocks ~110s when SDK pool saturated
1871 worker generateContext opens fresh SessionStore per call
1875 worker Spawns uvx/node/claude by bare name; silent fail in non-interactive
1877 worker Cross-session context bleed in same project dir
1879 worker Session completion races in-flight summarize
1890 sdk-pool SDK session resume during summarize causes context-overflow
1892 sdk-pool Memory agent prompt defeats cache (dynamic before static)
1895 hooks Stop hook spins 110s when worker older than v12.1.0
1897 hooks PreToolUse:Read lacks PATH export and cache-path lookup
1899 hooks SessionStart additionalContext >10KB truncated to 2KB
1902 hooks Stop and PostToolUse hooks synchronously block up to 120s
1904 hooks UserPromptSubmit hooks skipped in git worktree sessions
1905 hooks Saved_hook_context entries pegs CPU 100% on session load
1906 hooks PR #1229 fallback path points to source, not cache
1909 summarizer Summarize hook doesn't recognize Gemini transcripts
1921 mcp Root .mcp.json is empty, mcp-search never registers
1922 mcp MCP server uses 3s timeout for corpus prime/query
1929 installer "Update now" fails for cache-only installs
1930 installer Windows 11 ships smart-explore without tree-sitter
1937 observer JSONL files accumulate indefinitely, tens of GB
1938 observer Observer background sessions burn tokens with no budget
1939 cross-platform Project key uses basename(cwd), fragmenting worktrees
1941 cross-platform Linux worker with live-but-unhealthy PID blocks restart
1944 auth ANTHROPIC_AUTH_TOKEN not forwarded to SDK subprocess
1945 auth Vertex AI CLI auth fails silently on expired OAuth
1947 plugin-lifecycle OpenCode tool args as plain objects not Zod schemas
1948 plugin-lifecycle OpenClaw installer "plugin not found"
1949 plugin-lifecycle OpenClaw per-agent memory isolation broken
1950 plugin-lifecycle OpenClaw missing skills, session drift, workspaceDir loss
1952 db ON UPDATE CASCADE rewrites historical session attribution
1954 db observation_feedback schema mismatch source vs compiled
1958 viewer Settings model dropdown destroys precise model IDs
1881-1888 windows 8 Windows-specific bugs (paths, spawning, timeouts)

Remaining — MEDIUM (21)

# Component Issue
1872 worker Gemini 400/401 triggers 2-min crash-recovery loop
1873 worker worker-service.cjs killed by SIGKILL (unbounded heap)
1878 worker Logger caches log file path, never rotates
1891 sdk-pool Mode prompts in user messages, not system prompt
1893 sdk-pool SDK sub-agents hardcoded permissionMode:"default"
1894 hooks SessionStart can't find claude at ~/.local/bin
1898 hooks SessionStart health-check uses hardcoded port 37777
1900 hooks Setup hook references non-existent scripts/setup.sh
1910 summarizer Summary prompt leaks observation tags, ignores user_prompt
1915 search Search results not deduplicated
1917 search $CMEM context preview shows oldest instead of newest
1920 search Context footer "ID" ambiguous across 3 ID spaces
1923 mcp smart_outline empty for .txt files
1924 mcp chroma-mcp child not terminated on exit
1927 mcp chroma-mcp fails on WSL with ALL_PROXY=socks5
1928 installer BranchManager.pullUpdates() fails on cache-layout
1931 installer npm run worker:status ENOENT .claude/package.json
1940 cross-platform cmux.app wrapper "Claude executable not found"
1946 auth OpenRouter 401 Missing Authentication header
1955 db Duplicate observations bypass content-hash dedup
1959 viewer SSE new_prompt broadcast dies after /reload-plugins
1961 misc Traditional Chinese falls back to Simplified

Remaining — LOW (3)

# Component Issue
1919 search Shared jsts tree-sitter query applies TS-only to JS
1951 plugin-lifecycle OpenClaw lifecycle events stored as observations
1960 misc OpenRouter URL hardcoded

Remaining — NON-LABELED (1)

# Component Issue
2054 installer installCLI version-pinned alias can't self-update

Suggested Next Attack Order

Phase 2: Worker stability — DONE

Phase 3: Hooks reliability — DONE

Phase 4: Security hardening — DONE

Phase 5: Search remaining — DONE

Phase 6: MCP + Auth

  • #1925, #1926, #1942, #1943

Phase 7: Windows

  • #1880, #1887, #1881-1888

Phase 6: MCP / Chroma

  • #1925, #1926, #2046, #1921

Phase 7: Everything else

  • Remaining hooks, installer, windows, observer, viewer, auth, plugin-lifecycle

Progress Log

Time Action Result
9:40p #1908 analyzed Already fixed by Gen 3 coercion. Closed.
9:51p #1916 fixed concept→concepts remap in normalizeParams
9:53p #1913/#2048 fixed FTS5 fallback in SessionSearch + SearchManager
9:57p #1953 closed Already fixed by commit 59ce0fc5
9:57p #1957 fixed Periodic clearFailed() in stale session reaper
9:58p #1956 fixed journal_size_limit + periodic WAL checkpoint
10:01p #1874 fixed Non-XML responses mark messages failed instead of confirming
10:01p #1867 fixed Health endpoint includes activeSessions count
10:02p build-and-sync Observations flowing. No regression.
10:03p PR #2079 created 2 commits pushed
10:06p Greptile review 2 comments — cached isFts5Available(). Fixed + pushed.
10:20p PR #2079 merged All reviews passed (CodeRabbit, Greptile, claude-review)
10:25p v12.3.2 released Tag pushed, GitHub release created, CHANGELOG updated