From bf8b7dbd9fc45b097ccfc807d81b3029f1f4210d Mon Sep 17 00:00:00 2001 From: Alessandro Costa Date: Wed, 1 Apr 2026 22:05:06 -0300 Subject: [PATCH] docs: add architecture overview and production guide Architecture overview covers the 4-layer system design, hook lifecycle, data flow, and key patterns (CLAIM-CONFIRM, circuit-breaker, graceful degradation, deduplication, dual session IDs). Production guide provides recommended settings, health monitoring metrics and thresholds, quick health check commands, multi-machine sync setup, growth expectations, common issues with solutions, and log analysis tips. Based on 23 days of production usage with 3,400+ observations across two physical servers and 8 projects. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/architecture-overview.md | 139 ++++++++++++++++++++++++++++++++++ docs/production-guide.md | 111 +++++++++++++++++++++++++++ 2 files changed, 250 insertions(+) create mode 100644 docs/architecture-overview.md create mode 100644 docs/production-guide.md diff --git a/docs/architecture-overview.md b/docs/architecture-overview.md new file mode 100644 index 00000000..ae66db29 --- /dev/null +++ b/docs/architecture-overview.md @@ -0,0 +1,139 @@ +# claude-mem Architecture Overview + +## System Layers + +``` ++-----------------------------------------------------------+ +| Claude Code (host) | +| +-- Hook System (5 events) | +| +-- MCP Client (search tools) | ++-----------------------------------------------------------+ +| CLI Layer (Bun) | +| +-- bun-runner.js (Node->Bun bridge) | +| +-- hook-command.ts (orchestrator) | +| +-- handlers/ (context, session-init, observation, | +| summarize, session-complete) | ++-----------------------------------------------------------+ +| Worker Daemon (Express, port 37777) | +| +-- SessionManager (session lifecycle) | +| +-- SDKAgent (Claude Agent SDK) | +| +-- SearchManager (search orchestration) | +| +-- ProcessRegistry (subprocess management) | +| +-- ChromaSync (embedding synchronization) | ++-----------------------------------------------------------+ +| Storage Layer | +| +-- SQLite (claude-mem.db) -- structured data | +| +-- ChromaDB (chroma.sqlite3) -- vector embeddings | +| +-- MCP Server (interface for Claude Code) | ++-----------------------------------------------------------+ +``` + +## Hook Lifecycle + +| Event | Handler | What it does | Timeout | +|-------|---------|-------------|---------| +| Setup | setup.sh | Install system dependencies | 300s | +| SessionStart | smart-install.js + context | Install deps + start worker + inject context | 60s | +| UserPromptSubmit | session-init | Register session + start SDK agent + semantic injection | 60s | +| PostToolUse | observation | Capture tool usage -> enqueue in worker | 120s | +| Stop | summarize + session-complete | Request summary + end session | 120s+30s | + +## Data Flow + +``` +User prompt -> session-init -> /api/sessions/init + /api/context/semantic + | +Tool use -> observation -> /api/sessions/observations + | | + | PendingMessageStore.enqueue() + | | + | SDKAgent.startSession() + | | + | Claude Agent SDK -> ResponseProcessor + | | + | +-- storeObservations() -> SQLite + | +-- chromaSync.sync() -> ChromaDB + | +-- broadcastObservation() -> SSE/UI + | +Stop -> summarize -> /api/sessions/summarize + -> session-complete -> /api/sessions/complete + drain +``` + +## Key Patterns + +### CLAIM-CONFIRM (PendingMessageStore) + +``` +enqueue() -> INSERT status='pending' +claimNextMessage() -> UPDATE status='processing' (atomic) +confirmProcessed() -> DELETE (success) +markFailed() -> UPDATE status='failed' (retry < 3) + +Self-healing: messages in 'processing' for >60s reset to 'pending' +``` + +### Circuit-Breaker (SessionRoutes) + +``` +Generator crash -> retry 1 (1s) -> retry 2 (2s) -> retry 3 (4s) + -> consecutiveRestarts > 3 -> CIRCUIT-BREAKER + -> markAllSessionMessagesAbandoned(sessionDbId) + -> Stop. No infinite loop. +``` + +Counter resets to 0 when generator completes work naturally. + +### Graceful Degradation (hook-command.ts) + +``` +Transport errors (ECONNREFUSED, timeout, 5xx) -> exit 0 (never block Claude Code) +Client bugs (4xx, TypeError, ReferenceError) -> exit 2 (blocking, needs fix) +``` + +The worker being unavailable NEVER blocks the user's Claude Code session. + +### Deduplication (observations) + +``` +SHA256(memory_session_id + title + narrative) -> content_hash +If hash exists within 30s window -> return existing ID (no insert) +``` + +### Two Types of Session ID + +- `contentSessionId` — from Claude Code, invariant during the session +- `memorySessionId` — from SDK Agent, changes on each worker restart + +The conversion between them is handled by SessionStore and is critical for FK constraints. + +## Storage + +### SQLite (claude-mem.db) + +| Table | Key fields | Purpose | +|-------|-----------|---------| +| sdk_sessions | content_session_id, memory_session_id, status | Session lifecycle | +| observations | memory_session_id, type, title, narrative, content_hash | Tool usage observations | +| session_summaries | memory_session_id, request, learned, completed | Session summaries | +| user_prompts | content_session_id, prompt_text | User prompt history | +| pending_messages | session_db_id, status, message_type | CLAIM-CONFIRM queue | +| observation_feedback | observation_id, signal_type | Usage tracking | + +### ChromaDB (chroma.sqlite3) + +Vector embeddings for semantic search. Each observation generates multiple documents: + +``` +obs_{id}_narrative -> main text +obs_{id}_fact_0 -> first fact +obs_{id}_fact_1 -> second fact +... +``` + +Accessed via chroma-mcp (MCP process), communication over stdio. + +## Process Management + +- **ProcessRegistry:** Tracks all Claude SDK subprocesses, manages PID lifecycle +- **Orphan Reaper (5min):** Kills processes with no active session +- **GracefulShutdown:** 7-step shutdown (PID file, children, HTTP server, sessions, MCP, DB, force-kill) diff --git a/docs/production-guide.md b/docs/production-guide.md new file mode 100644 index 00000000..b02500da --- /dev/null +++ b/docs/production-guide.md @@ -0,0 +1,111 @@ +# claude-mem Production Guide + +Practical guide based on 23 days of production usage with 3,400+ observations across two physical servers and 8 projects. + +## Recommended Settings + +| Setting | Default | Recommended | Why | +|---------|---------|-------------|-----| +| CLAUDE_MEM_MAX_CONCURRENT_AGENTS | 2 | 3 | Better throughput without overload | +| CLAUDE_MEM_SEMANTIC_INJECT | (new) | true | Relevant context >> recent context | +| CLAUDE_MEM_SEMANTIC_INJECT_LIMIT | (new) | 5 | Sweet spot for token cost vs coverage | +| CLAUDE_MEM_TIER_ROUTING_ENABLED | (new) | true | ~52% cost savings, no quality loss | + +## Health Monitoring + +### Key metrics to watch + +| Metric | Healthy | Warning | Action | +|--------|---------|---------|--------| +| pending_messages (pending) | 0-5 | >10 | Check worker logs, may need restart | +| pending_messages (failed) | 0 | >0 growing | Circuit-breaker may be tripping | +| sdk_sessions (active) | 0-3 | >5 stuck | Orphan sessions, worker restart | +| WAL size | <10 MB | >20 MB | Run `PRAGMA wal_checkpoint(TRUNCATE)` | +| Chroma size | Growing slowly | Sudden jump | Check for sync loops | +| Errors/day in logs | 0-2 | >10 | Investigate log patterns | + +### Quick health check + +```bash +# Check worker status +curl -s http://127.0.0.1:37777/api/health | python3 -m json.tool + +# Check database stats +sqlite3 ~/.claude-mem/claude-mem.db " + SELECT 'observations' as metric, COUNT(*) as value FROM observations + UNION ALL SELECT 'summaries', COUNT(*) FROM session_summaries + UNION ALL SELECT 'pending', COUNT(*) FROM pending_messages WHERE status='pending' + UNION ALL SELECT 'active_sessions', COUNT(*) FROM sdk_sessions WHERE status='active'; +" +``` + +## Multi-Machine Setup + +If running claude-mem on multiple machines, use `claude-mem-sync` to keep observations in sync: + +```bash +claude-mem-sync push # local -> remote +claude-mem-sync pull # remote -> local +claude-mem-sync sync # bidirectional +claude-mem-sync status # compare counts +``` + +Deduplication is by `(created_at, title)` — safe to run repeatedly. + +## Growth Expectations + +Based on active daily development usage: + +| Metric | Per day | Per month | Notes | +|--------|---------|-----------|-------| +| Observations | ~120 | ~3,600 | Varies with coding activity | +| Summaries | ~40 | ~1,200 | One per session | +| SQLite | ~0.8 MB | ~24 MB | ~5 KB per observation | +| Chroma | ~4 MB | ~120 MB | ~50 KB per observation (embeddings) | + +## Common Issues and Solutions + +### Summarize error loop + +**Symptom:** Repeated `[ERROR] Missing last_assistant_message` in logs. +**Cause:** Transcript with no assistant messages triggers summary attempt that fails repeatedly. +**Fix:** PR #1566 — skip summary when transcript is empty. + +### Chroma sync failures + +**Symptom:** `[ERROR] Batch add failed... IDs already exist` +**Cause:** MCP timeout during add leaves partial writes; retry fails on existing IDs. +**Fix:** PR #1566 — fallback to update (upsert pattern). + +### Port conflict on startup + +**Symptom:** `Worker failed to start... Is port 37777 in use?` +**Cause:** Two sessions starting simultaneously — HTTP check is non-atomic (TOCTOU race). +**Fix:** PR #1566 — atomic socket bind on Unix. + +### Orphaned pending messages + +**Symptom:** `pending_messages` table growing with old entries for completed sessions. +**Cause:** SIGTERM kills generator before queue is drained. +**Fix:** PR #1567 — drain after deleteSession(). + +### Context not relevant to current topic + +**Symptom:** Claude receives observations about CSS when you're asking about authentication. +**Cause:** Default recency-based injection selects most recent, not most relevant. +**Fix:** PR #1568 — semantic injection via Chroma on every prompt. + +## Log Analysis Tips + +```bash +# Count errors by day +grep '\[ERROR\]' ~/.claude-mem/logs/claude-mem-*.log | \ + sed 's/\[20[0-9][0-9]-[0-9][0-9]-/\n&/g' | \ + grep -oP '^\[20\d{2}-\d{2}-\d{2}' | sort | uniq -c + +# Find circuit-breaker trips +grep 'circuit\|Circuit\|ABANDONED\|abandoned' ~/.claude-mem/logs/claude-mem-*.log + +# Check pending message health +grep 'CLAIMED\|CONFIRMED\|FAILED\|ABANDONED' ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log | tail -20 +```