# Claude-Mem Worker Server Architecture **Document Version:** 1.0 **Last Updated:** 2025-01-24 **Author:** Analysis by Claude Code **Purpose:** Comprehensive technical analysis of the worker server architecture, logic flow, blocking behavior, and component value assessment --- ## Executive Summary The claude-mem worker server is a long-running HTTP service managed by PM2 that processes tool execution observations and generates session summaries using the Claude Agent SDK. It implements a **defensive, layered architecture** designed to maximize data persistence while maintaining flexibility. ### Key Design Principles 1. **Maximally Permissive Storage** - System defaults to saving data even if incomplete 2. **Auto-Recovery** - Worker restarts don't prevent processing (session state reconstructed from database) 3. **Queue-Based Processing** - HTTP API decoupled from AI processing for reliability 4. **Defensive Programming** - Auto-creates missing database records, accepts null fields 5. **Session Isolation** - Each session has independent state and SDK agent ### Architecture at a Glance ``` ┌─────────────────────────────────────────────────────────────┐ │ Layer 1: HTTP API (Express.js) │ │ - 6 REST endpoints │ │ - Always queues messages (maximally permissive) │ └──────────────────┬──────────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────────┐ │ Layer 2: In-Memory Queue │ │ - pendingMessages array per session │ │ - VULNERABILITY: Lost on worker restart │ └──────────────────┬──────────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────────┐ │ Layer 3: SDK Agent (Claude Agent SDK) │ │ - Processes queued messages via async generator │ │ - Can fail due to config or AI errors │ └──────────────────┬──────────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────────┐ │ Layer 4: Parser (XML Extraction) │ │ - Extracts observations and summaries from AI responses │ │ - Permissive (v4.2.5/v4.2.6 fixes ensure partial data saved)│ └──────────────────┬──────────────────────────────────────────┘ │ ┌──────────────────▼──────────────────────────────────────────┐ │ Layer 5: Database (SQLite with better-sqlite3) │ │ - Permanent storage (once here, data persists) │ │ - Auto-creates missing sessions, accepts nulls │ └─────────────────────────────────────────────────────────────┘ ``` **Critical Insight:** Data can only be lost between layers 2-4. Once it reaches the database (layer 5), it's permanent. --- ## Component Inventory ### HTTP REST API Endpoints | Endpoint | Purpose | Blocks Data? | |----------|---------|--------------| | `GET /health` | Worker health check | N/A | | `POST /sessions/:id/init` | Initialize session and start SDK agent | Only if session not in DB (expected) | | `POST /sessions/:id/observations` | Queue tool observation | ❌ Never (auto-recovery) | | `POST /sessions/:id/summarize` | Queue summary request | ❌ Never (auto-recovery) | | `GET /sessions/:id/status` | Get session status | N/A | | `DELETE /sessions/:id` | Abort session | ⚠️ Queued messages lost | ### Core Processing Components | Component | File | Lines | Purpose | |-----------|------|-------|---------| | WorkerService | worker-service.ts | 52-590 | Main service class, manages sessions | | runSDKAgent | worker-service.ts | 345-404 | Runs SDK agent for a session | | createMessageGenerator | worker-service.ts | 410-502 | Async generator feeding SDK | | handleAgentMessage | worker-service.ts | 508-563 | Parses and stores SDK responses | | parseObservations | parser.ts | 32-96 | Extracts observations from XML | | parseSummary | parser.ts | 102-157 | Extracts summary from XML | | SessionStore | SessionStore.ts | 9-1086 | Database operations | --- ## Deep Dive: HTTP Endpoints ### GET /health (lines 100-109) **Purpose:** Health check for monitoring and debugging **Logic Flow:** 1. Returns JSON with status, port, PID, active sessions, uptime, memory **Blocking Analysis:** ❌ N/A (read-only endpoint) **Value Assessment:** ✅ HIGH VALUE - Essential for monitoring worker health - Helps debug port conflicts and process state - Keep as-is --- ### POST /sessions/:sessionDbId/init (lines 115-169) **Purpose:** Initialize a new session and start the SDK agent **Logic Flow:** 1. Parse `sessionDbId` from URL 2. Extract `project` and `userPrompt` from request body 3. Fetch session from database using `SessionStore.getSessionById()` 4. **CRITICAL CHECK:** Return 404 if session not found in DB 5. Retrieve `claudeSessionId` from database record 6. Create `ActiveSession` object with initial state: ```typescript { sessionDbId, claudeSessionId, sdkSessionId: null, project, userPrompt, pendingMessages: [], abortController: new AbortController(), generatorPromise: null, lastPromptNumber: 0, observationCounter: 0, startTime: Date.now() } ``` 7. Store session in memory map (`this.sessions`) 8. Update `worker_port` in database 9. Start `runSDKAgent()` in background (fire-and-forget promise) 10. Return success response immediately **Blocking Analysis:** ⚠️ CONDITIONAL - Returns 404 if session doesn't exist in database - This is expected behavior - session must be created before init - Doesn't prevent future initialization attempts - Error logged and hook can retry **Value Assessment:** ✅ HIGH VALUE - Critical initialization step - Background SDK agent startup prevents timeout - Keep as-is **Edge Cases:** - Session exists but SDK agent fails to start → Session marked as failed, but new init can retry - Multiple init calls for same session → First one wins (subsequent calls find session in memory) --- ### POST /sessions/:sessionDbId/observations (lines 175-230) **Purpose:** Queue a tool execution observation for processing **Logic Flow:** 1. Parse `sessionDbId` from URL 2. Extract `tool_name`, `tool_input`, `tool_output`, `prompt_number` from body 3. Check if session exists in memory map (`this.sessions.get(sessionDbId)`) 4. **AUTO-RECOVERY** (lines 181-209): If session NOT in memory: - Fetch session from database - Recreate `ActiveSession` object - Start new SDK agent in background - This enables recovery from worker restarts! 5. Increment `observationCounter` for correlation ID tracking 6. Push observation message to `pendingMessages` queue: ```typescript { type: 'observation', tool_name, tool_input, tool_output, prompt_number } ``` 7. Return success with queue length **Blocking Analysis:** ❌ NEVER BLOCKS - Auto-creates session state from database if missing - Always queues the observation - HTTP response confirms receipt immediately - Processing happens asynchronously **Value Assessment:** ✅ HIGH VALUE - Auto-recovery is brilliant design - Worker restart doesn't lose ability to process observations - Keep as-is **Edge Cases:** - Worker restart while observation in queue → Lost (queue is in-memory) - But NEW observations after restart are queued successfully (auto-recovery) - Database not found → Would throw error, but SessionStore auto-creates sessions --- ### POST /sessions/:sessionDbId/summarize (lines 236-284) **Purpose:** Queue a summary generation request **Logic Flow:** 1. Parse `sessionDbId` and `prompt_number` from request 2. Check if session exists in memory 3. **AUTO-RECOVERY** (lines 241-270): Same pattern as observations endpoint - Fetches session from database - Recreates `ActiveSession` object - Starts new SDK agent 4. Push summarize message to `pendingMessages` queue: ```typescript { type: 'summarize', prompt_number } ``` 5. Return success with queue length **Blocking Analysis:** ❌ NEVER BLOCKS - Same auto-recovery mechanism as observations - Always queues the summary request - Processing happens asynchronously **Value Assessment:** ✅ HIGH VALUE - Auto-recovery pattern prevents data loss - Keep as-is **Code Quality Note:** ⚠️ MEDIUM - Duplicated auto-recovery code (lines 181-209 and 241-270 are nearly identical) - Could extract to helper function: `getOrCreateSession(sessionDbId)` - Would reduce duplication and improve maintainability --- ### GET /sessions/:sessionDbId/status (lines 289-304) **Purpose:** Get current session status and queue length **Logic Flow:** 1. Parse `sessionDbId` from URL 2. Get session from memory map 3. Return 404 if not found 4. Return session info: `sessionDbId`, `sdkSessionId`, `project`, `pendingMessages.length` **Blocking Analysis:** ❌ N/A (read-only endpoint) **Value Assessment:** ✅ MEDIUM VALUE - Useful for debugging - Not critical for core functionality - Keep as-is --- ### DELETE /sessions/:sessionDbId (lines 309-340) **Purpose:** Abort a running session and clean up **Logic Flow:** 1. Parse `sessionDbId` from URL 2. Get session from memory map 3. Return 404 if not found 4. Call `abortController.abort()` to signal SDK agent to stop 5. Wait for `generatorPromise` to finish (max 5 seconds timeout) 6. Mark session as 'failed' in database 7. Delete session from memory map 8. Return success **Blocking Analysis:** ⚠️ BLOCKS QUEUED MESSAGES - Aborts SDK agent processing - Any messages in `pendingMessages` queue are lost - Already-stored observations/summaries remain in database **Value Assessment:** ✅ MEDIUM VALUE - Provides clean shutdown mechanism - Used for manual cleanup - As of v4.1.0, SessionEnd hook doesn't call DELETE (graceful cleanup) - Keep for manual intervention, but not used automatically **Historical Note:** - v4.0.x: SessionEnd hook called DELETE → interrupted summary generation - v4.1.0+: Graceful cleanup → workers finish naturally --- ## Deep Dive: SDK Agent Processing ### runSDKAgent (lines 345-404) **Purpose:** Core processing engine that runs continuously for each session **Logic Flow:** 1. Call `query()` from Claude Agent SDK with: ```typescript { prompt: this.createMessageGenerator(session), options: { model: MODEL, // from CLAUDE_MEM_MODEL env var disallowedTools: DISALLOWED_TOOLS, abortController: session.abortController, pathToClaudeCodeExecutable: claudePath } } ``` 2. Iterate over SDK responses using `for await` 3. For each assistant message: - Extract text content from response - Log response size - Call `handleAgentMessage()` to parse and store 4. On completion: - Log session duration - Mark session as 'completed' in database - Delete session from memory map 5. On error: - Log error (or warning for AbortError) - Mark session as 'failed' in database - Throw error (caught by `generatorPromise.catch()`) **Blocking Analysis:** ⚠️ CAN BLOCK IF: - Invalid `CLAUDE_MEM_MODEL` → SDK initialization fails - Invalid `CLAUDE_CODE_PATH` → SDK initialization fails - SDK crashes → Session marked as failed - BUT: Doesn't prevent NEW sessions from being created **Value Assessment:** ✅ HIGH VALUE - Core processing engine - Proper error handling with session status tracking - Keep as-is **Configuration Dependencies:** - `CLAUDE_MEM_MODEL` (default: 'claude-sonnet-4-5') - `CLAUDE_CODE_PATH` (default: specific path to claude executable) - `DISALLOWED_TOOLS` (default: ['Glob', 'Grep', 'ListMcpResourcesTool', 'WebSearch']) **Edge Cases:** - SDK hangs → AbortController provides cancellation mechanism - Network issues → SDK handles retries internally - Multiple concurrent sessions → Each has isolated SDK agent --- ### createMessageGenerator (lines 410-502) **Purpose:** Async generator that feeds messages to the SDK agent (bridge between HTTP API and SDK) **Logic Flow:** 1. Build init prompt using `buildInitPrompt(project, claudeSessionId, userPrompt)` 2. Yield initial user message: ```typescript { type: 'user', session_id: claudeSessionId, // Real Claude Code session ID parent_tool_use_id: null, message: { role: 'user', content: initPrompt } } ``` 3. Enter infinite loop (`while (true)`): - Check `abortController.signal.aborted` → break if aborted - If no pending messages → sleep 100ms and continue - While `pendingMessages.length > 0`: - Shift message from queue (FIFO) - If type === 'summarize': - Update `lastPromptNumber` - Fetch session from database - Build summary prompt using `buildSummaryPrompt()` - Yield summary prompt as user message - If type === 'observation': - Update `lastPromptNumber` - Build observation prompt using `buildObservationPrompt()` - Yield observation prompt as user message **Blocking Analysis:** ❌ DOESN'T BLOCK - Continuously processes queue until aborted - 100ms polling means small delay but no data loss - Messages shifted from queue and sent to SDK - If SDK fails, messages lost from queue (but already confirmed via HTTP) **Value Assessment:** ✅ HIGH VALUE - Elegant async generator pattern - Keep as-is **Performance Note:** ⚠️ 100ms polling interval - Could be improved with event-driven queue (e.g., `AsyncQueue` with notifications) - Current implementation is simple and works well - Low priority optimization **Data Flow:** ``` HTTP /observations → pendingMessages.push() → [sleep 100ms] → pendingMessages.shift() → buildObservationPrompt() → yield to SDK → SDK processes → handleAgentMessage() ``` --- ### handleAgentMessage (lines 508-563) **Purpose:** Parse SDK response and store observations/summaries in database **Logic Flow:** 1. Call `parseObservations(content, correlationId)` 2. If observations found: - For each observation: - Call `db.storeObservation(claudeSessionId, project, observation, promptNumber)` - Log success with correlation ID 3. Call `parseSummary(content, sessionId)` 4. If summary found: - Call `db.storeSummary(claudeSessionId, project, summary, promptNumber)` - Log success 5. If NO summary found: - Log warning with content sample **Blocking Analysis:** ⚠️ CAN BLOCK IF: - Parser returns empty array/null → Nothing stored (but this is expected for routine operations) - Database error → Would throw and crash handler (rare with permissive schema) **Value Assessment:** ✅ HIGH VALUE - Core storage logic - Proper logging for debugging - Keep as-is **Critical Dependencies:** - `parseObservations()` must return valid observations - `parseSummary()` must return valid summary - Database must accept the data (schema constraints) **Logging:** - Extensive logging at INFO, SUCCESS, and WARN levels - Correlation IDs for tracking individual observations - Debug mode logs full SDK responses --- ## Deep Dive: Parser System ### parseObservations (parser.ts lines 32-96) **Purpose:** Extract observation XML blocks from SDK response and parse into structured data **Logic Flow:** 1. Use regex to find all `...` blocks (non-greedy): ```typescript /([\s\S]*?)<\/observation>/g ``` 2. For each block: - Extract all fields: `type`, `title`, `subtitle`, `narrative`, `facts`, `concepts`, `files_read`, `files_modified` - **VALIDATION** (lines 52-67): - If `type` is missing or invalid → default to "change" - Valid types: `['bugfix', 'feature', 'refactor', 'change', 'discovery', 'decision']` - All other fields can be null - Filter out `type` from `concepts` array (types and concepts are separate dimensions) - Push observation to results array 3. Return all observations **Blocking Analysis:** ❌ NEVER BLOCKS (as of v4.2.6) - **CRITICAL FIX** (v4.2.6): Removed validation that required title, subtitle, and narrative - Comment on line 52: "NOTE FROM THEDOTMACK: ALWAYS save observations - never skip. 10/24/2025" - Always returns observations with whatever fields exist - Only transformation: type defaults to "change" if invalid **Value Assessment:** ✅ HIGH VALUE - Permissive parsing ensures data is never lost - v4.2.6 fix was critical for reliability - Keep as-is **Historical Context:** - **Before v4.2.6:** Would skip observations missing required fields → data loss - **After v4.2.6:** Always saves with defaults → maximally permissive **Edge Cases:** 1. No `` tags → Returns empty array (normal for routine operations) 2. All fields empty → Returns observation with null fields and type="change" 3. Malformed XML → Regex won't match → Returns empty array (data loss) 4. Type in concepts → Filtered out (types and concepts are orthogonal) **Example:** ```xml feature Authentication added Implemented OAuth2 flow Added OAuth2 provider configuration Created callback endpoint Full OAuth2 authentication... how-it-works what-changed src/auth/oauth.ts src/auth/oauth.ts ``` --- ### parseSummary (parser.ts lines 102-157) **Purpose:** Extract summary XML block from SDK response **Logic Flow:** 1. Check for `` tag (lines 104-113) - If found → log reason and return null (intentional skip) 2. Match `...` block (non-greedy): ```typescript /([\s\S]*?)<\/summary>/ ``` - If not found → return null (SDK didn't provide summary) 3. Extract all fields: `request`, `investigated`, `learned`, `completed`, `next_steps`, `notes` (optional) 4. **VALIDATION REMOVED** (lines 133-147): - Comment: "NOTE FROM THEDOTMACK: 100% of the time we must SAVE the summary, even if fields are missing. 10/24/2025" - Comment: "NEVER DO THIS NONSENSE AGAIN." - Old code checked if all required fields present → would return null - New code returns summary with whatever fields exist 5. Return `ParsedSummary` object **Blocking Analysis:** ⚠️ MINIMAL BLOCKING (as of v4.2.5) - `` tag → Returns null (intentional, not a bug) - Missing `` tags → Returns null (SDK didn't provide) - Missing fields within `` → Does NOT block anymore (v4.2.5 fix) **Value Assessment:** ✅ HIGH VALUE - v4.2.5 fix ensures partial summaries are saved - Keep as-is **Historical Context:** - **Before v4.2.5:** Would return null if any required field missing → data loss - **After v4.2.5:** Returns summary with whatever fields exist → maximally permissive **Edge Cases:** 1. `` → Returns null, logs reason 2. No `` tags → Returns null (SDK didn't generate summary) 3. `` with all empty fields → Returns summary with empty/null strings 4. Malformed XML → Regex won't match → Returns null (data loss) **Example:** ```xml Add OAuth2 authentication Reviewed existing auth system System uses JWT tokens for sessions Implemented OAuth2 provider integration Test with production credentials Need to configure callback URLs in provider dashboard ``` --- ## Deep Dive: Database Layer ### SessionStore.storeObservation (SessionStore.ts lines 901-964) **Purpose:** Store a parsed observation in the database **Logic Flow:** 1. **AUTO-CREATE SESSION** (lines 920-940): - Check if `sdk_session_id` exists in `sdk_sessions` table - If NOT found: - Auto-create session record - Log: "Auto-created session record for session_id: {id}" - This prevents foreign key constraint errors 2. Prepare INSERT statement: ```sql INSERT INTO observations (sdk_session_id, project, type, title, subtitle, facts, narrative, concepts, files_read, files_modified, prompt_number, created_at, created_at_epoch) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ``` 3. Insert observation with: - `facts`, `concepts`, `files_read`, `files_modified` → JSON.stringify() - Timestamps auto-generated - All fields as-is (nulls allowed) **Blocking Analysis:** ❌ NEVER BLOCKS - Auto-creates missing sessions (defensive programming) - All fields nullable (except required ones) - No validation checks that could fail - Schema is permissive **Value Assessment:** ✅ HIGH VALUE - Auto-creation pattern is brilliant - Prevents foreign key errors - Keep as-is **Schema Constraints:** - `type` must be one of 6 valid types (CHECK constraint) - BUT: Parser ensures type is always valid (defaults to "change") - `sdk_session_id` has foreign key to `sdk_sessions` - BUT: Auto-creation ensures session exists - Arrays stored as JSON strings **Edge Cases:** - Session doesn't exist → Auto-created - Invalid type → Parser prevents this (defaults to "change") - Null fields → Allowed by schema --- ### SessionStore.storeSummary (SessionStore.ts lines 970-1029) **Purpose:** Store a parsed summary in the database **Logic Flow:** 1. **AUTO-CREATE SESSION** (lines 987-1007): - Same defensive pattern as `storeObservation()` - Ensures session exists before INSERT 2. Prepare INSERT statement: ```sql INSERT INTO session_summaries (sdk_session_id, project, request, investigated, learned, completed, next_steps, notes, prompt_number, created_at, created_at_epoch) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ``` 3. Insert summary with: - All content fields as-is (nulls allowed) - Timestamps auto-generated **Blocking Analysis:** ❌ NEVER BLOCKS - Auto-creates missing sessions - All content fields nullable - No validation checks - Multiple summaries per session allowed (migration 7 removed UNIQUE constraint) **Value Assessment:** ✅ HIGH VALUE - Auto-creation ensures reliability - Nullable fields allow partial data - Keep as-is **Schema Evolution:** - **Before migration 7:** `sdk_session_id` had UNIQUE constraint → Only one summary per session - **After migration 7:** UNIQUE removed → Multiple summaries per session (one per prompt) **Edge Cases:** - Session doesn't exist → Auto-created - All fields null/empty → Allowed - Multiple summaries for same session → Allowed (migration 7) --- ### Database Schema Constraints #### observations table ```sql CREATE TABLE observations ( id INTEGER PRIMARY KEY AUTOINCREMENT, sdk_session_id TEXT NOT NULL, -- Foreign key project TEXT NOT NULL, text TEXT, -- Nullable (deprecated, migration 9) type TEXT NOT NULL CHECK(type IN ('decision', 'bugfix', 'feature', 'refactor', 'discovery', 'change')), title TEXT, -- Nullable subtitle TEXT, -- Nullable facts TEXT, -- Nullable (JSON array) narrative TEXT, -- Nullable concepts TEXT, -- Nullable (JSON array) files_read TEXT, -- Nullable (JSON array) files_modified TEXT, -- Nullable (JSON array) prompt_number INTEGER, -- Nullable created_at TEXT NOT NULL, created_at_epoch INTEGER NOT NULL, FOREIGN KEY(sdk_session_id) REFERENCES sdk_sessions(sdk_session_id) ON DELETE CASCADE ); ``` **Blocking Potential:** - Invalid `type` → CHECK constraint violation - Mitigated by: Parser defaults to "change" - Missing `sdk_session_id` → Foreign key violation - Mitigated by: Auto-creation in storeObservation() #### session_summaries table ```sql CREATE TABLE session_summaries ( id INTEGER PRIMARY KEY AUTOINCREMENT, sdk_session_id TEXT NOT NULL, -- No longer UNIQUE (migration 7) project TEXT NOT NULL, request TEXT, -- Nullable investigated TEXT, -- Nullable learned TEXT, -- Nullable completed TEXT, -- Nullable next_steps TEXT, -- Nullable notes TEXT, -- Nullable prompt_number INTEGER, -- Nullable created_at TEXT NOT NULL, created_at_epoch INTEGER NOT NULL, FOREIGN KEY(sdk_session_id) REFERENCES sdk_sessions(sdk_session_id) ON DELETE CASCADE ); ``` **Blocking Potential:** - Missing `sdk_session_id` → Foreign key violation - Mitigated by: Auto-creation in storeSummary() **Key Design Decisions:** 1. **Nullable fields** - Allows partial data to be saved 2. **Auto-creation** - Prevents foreign key errors 3. **No UNIQUE constraints** (migration 7) - Multiple summaries per session 4. **WAL mode** - Better concurrency for multiple sessions 5. **JSON arrays** - Flexible storage for lists (facts, concepts, files) --- ## Deep Dive: Prompt System ### buildInitPrompt (prompts.ts lines 24-125) **Purpose:** Generate initial prompt that instructs the SDK agent what to observe and how to record **Content:** 1. **Role Definition:** "You are observing a development session to create searchable memory FOR FUTURE SESSIONS" 2. **Critical Instruction:** "Record what was BUILT/FIXED/DEPLOYED/CONFIGURED, not what you (the observer) are doing" 3. **What to Record:** Focus on deliverables, capabilities, technical changes 4. **When to Skip:** Routine operations (empty status checks, package installations, file listings) 5. **Output Format:** XML structure with `` tags and required fields **Blocking Analysis:** ⚠️ CAN CAUSE SKIPPING - "WHEN TO SKIP" section instructs SDK to not output for routine operations - "No output necessary if skipping" means no observations stored - **This is intentional filtering**, not a bug **Value Assessment:** ✅ HIGH VALUE - Prevents noise from routine operations - Focuses on meaningful changes - Keep as-is, but consider making "WHEN TO SKIP" configurable **Key Instructions:** ``` WHEN TO SKIP ------------ Skip routine operations: - Empty status checks - Package installations with no errors - Simple file listings - Repetitive operations you've already documented - **No output necessary if skipping.** ``` **Impact:** - Reduces database size by filtering noise - But could cause "missing" observations for operations user cares about - Trade-off between signal and completeness --- ### buildObservationPrompt (prompts.ts lines 130-153) **Purpose:** Wrap tool execution data in XML for SDK processing **Content:** ```xml {name} {ISO timestamp} {JSON} {JSON} ``` **Blocking Analysis:** ❌ NO BLOCKING - Just data presentation, no instructions to skip - SDK processes based on init prompt rules **Value Assessment:** ✅ HIGH VALUE - Simple, clean data wrapper - Keep as-is --- ### buildSummaryPrompt (prompts.ts lines 158-178) **Purpose:** Request summary of the session so far **Content:** 1. **Instruction:** "Think about the last request, and write a summary of what was done, what was learned, and what's next" 2. **Important Note:** "DO NOT summarize the observation process itself - you are summarizing a DIFFERENT claude code session, not this one" 3. **Output Format:** XML `` with required fields 4. **Encouragement:** "Always write at least a minimal summary explaining where we are at currently, even if you didn't learn anything new or complete any work" **Blocking Analysis:** ❌ NO BLOCKING - Encourages always writing summary - SDK may still skip if truly nothing to summarize **Value Assessment:** ✅ HIGH VALUE - Ensures summaries are generated - "Always write at least a minimal summary" reduces skip rate - Keep as-is --- ## Data Flow Analysis ### End-to-End Flow: Tool Execution → Database ``` 1. User executes tool in Claude Code ↓ 2. PostToolUse hook captures execution ↓ 3. Hook sends HTTP POST to worker /observations endpoint ↓ 4. Worker queues message in pendingMessages array └─→ HTTP 200 response (confirmed receipt) ↓ 5. createMessageGenerator polls queue (100ms interval) ↓ 6. Message shifted from queue ↓ 7. buildObservationPrompt wraps tool data in XML ↓ 8. Generator yields message to SDK agent ↓ 9. SDK sends message to Claude API ↓ 10. Claude processes tool data based on init prompt ↓ 11. Claude responds with XML (or skips if routine operation) ↓ 12. SDK returns response to runSDKAgent ↓ 13. handleAgentMessage receives response ↓ 14. parseObservations extracts blocks ↓ 15. For each observation: - db.storeObservation called - Auto-creates session if missing - Inserts into observations table ↓ 16. Data persisted in SQLite database ``` **Failure Points:** - **Point 3:** Worker not running → HTTP request fails → Hook logs error - **Point 4:** Worker crashes before processing → Queue lost - **Point 9:** Invalid model config → SDK fails → Session marked failed - **Point 11:** Malformed XML response → Parser returns empty array - **Point 15:** Database error (rare) → Throws exception **Recovery Mechanisms:** - **Auto-recovery:** New requests after worker restart auto-create session - **Graceful degradation:** Partial data saved (v4.2.5/v4.2.6 fixes) - **Database persistence:** Once stored, data survives all restarts --- ## Blocking Assessment Matrix ### Components That CAN Block Data Storage | Component | Blocking Scenario | Impact | Mitigation | |-----------|------------------|---------|------------| | Worker not running | HTTP requests fail | Observations not queued | PM2 auto-restart, health monitoring | | Invalid CLAUDE_MEM_MODEL | SDK agent fails to start | Queued messages never processed | Validation in settings script | | Invalid CLAUDE_CODE_PATH | SDK agent fails to start | Queued messages never processed | Default path, env var fallback | | Malformed XML in SDK response | Parser can't extract | Data lost for that response | Better error handling, partial parsing | | Worker restart | In-memory queue lost | Queued messages lost | Could persist queue to DB | | Session abort (DELETE) | Queue processing stopped | Remaining queue lost | Graceful cleanup (v4.1.0) | | Init prompt "WHEN TO SKIP" | SDK intentionally skips | No observation stored | Intentional filtering, configurable? | ### Components That CANNOT Block Data Storage | Component | Reason | Design Pattern | |-----------|--------|----------------| | /observations endpoint | Auto-recovery, always queues | Maximally permissive | | /summarize endpoint | Auto-recovery, always queues | Maximally permissive | | parseObservations() | Defaults to "change" type, accepts nulls | Permissive (v4.2.6 fix) | | parseSummary() | Returns partial summaries | Permissive (v4.2.5 fix) | | storeObservation() | Auto-creates sessions, accepts nulls | Defensive programming | | storeSummary() | Auto-creates sessions, accepts nulls | Defensive programming | | Database schema | Nullable fields, no UNIQUE constraints | Flexible storage | --- ## Critical Findings ### 1. Auto-Recovery Pattern Prevents Worker Restart Data Loss **Location:** `/observations` and `/summarize` endpoints (lines 181-209, 241-270) **How it works:** ```typescript if (!session) { // Fetch session from database const dbSession = db.getSessionById(sessionDbId); // Recreate in-memory state session = { sessionDbId, claudeSessionId: dbSession!.claude_session_id, sdkSessionId: null, project: dbSession!.project, userPrompt: dbSession!.user_prompt, pendingMessages: [], abortController: new AbortController(), generatorPromise: null, lastPromptNumber: 0, observationCounter: 0, startTime: Date.now() }; // Start new SDK agent session.generatorPromise = this.runSDKAgent(session); } ``` **Value:** ✅ HIGH - Worker restart doesn't prevent new observations from being processed - Database is source of truth - Stateless design enables resilience **Recommendation:** Extract to helper function to reduce duplication --- ### 2. Parser Fixes (v4.2.5/v4.2.6) Ensure Partial Data Saved **parseObservations (v4.2.6):** ```typescript // NOTE FROM THEDOTMACK: ALWAYS save observations - never skip. 10/24/2025 // All fields except type are nullable in schema // If type is missing or invalid, use "change" as catch-all fallback let finalType = 'change'; // Default catch-all if (type && validTypes.includes(type.trim())) { finalType = type.trim(); } // All other fields are optional - save whatever we have observations.push({ type: finalType, title, // Can be null subtitle, // Can be null facts, narrative, // Can be null concepts, files_read, files_modified }); ``` **parseSummary (v4.2.5):** ```typescript // NOTE FROM THEDOTMACK: 100% of the time we must SAVE the summary, // even if fields are missing. 10/24/2025 // NEVER DO THIS NONSENSE AGAIN. return { request, // Can be null investigated, // Can be null learned, // Can be null completed, // Can be null next_steps, // Can be null notes // Can be null }; ``` **Value:** ✅ CRITICAL - Prevents data loss from incomplete AI responses - LLMs make mistakes - system must be resilient - Partial data is better than no data **Recommendation:** Keep as-is, this is the right design --- ### 3. In-Memory Queue is Main Vulnerability **Issue:** `pendingMessages` array is in-memory only - Worker restart → All queued messages lost - But HTTP response already confirmed receipt **Current behavior:** 1. Hook sends observation → Worker responds "queued" → Hook thinks it's saved 2. Worker crashes before processing → Observation lost 3. BUT: New observations after restart are still processed (auto-recovery) **Impact:** ⚠️ MEDIUM - Data loss window between queue and processing - But observations are idempotent (can be resent) - Hooks don't retry on success response **Recommendation:** ⚠️ CONSIDER - Persist queue to database (e.g., `pending_observations` table) - Mark as processed when SDK handles - Increases reliability but adds complexity --- ### 4. Init Prompt "WHEN TO SKIP" Intentionally Filters **Instruction:** ``` WHEN TO SKIP ------------ Skip routine operations: - Empty status checks - Package installations with no errors - Simple file listings - Repetitive operations you've already documented - **No output necessary if skipping.** ``` **Impact:** - Reduces noise in database - Focuses on meaningful changes - BUT: User might wonder why some tool executions aren't recorded **Value:** ✅ MEDIUM - Intentional filtering - Prevents database bloat - Trade-off between signal and completeness **Recommendation:** ⚠️ CONSIDER - Make "WHEN TO SKIP" configurable (env var or settings) - Or add verbosity levels (minimal/normal/verbose) --- ## Value Assessment by Component ### HIGH VALUE - Keep As-Is | Component | Reason | |-----------|--------| | Auto-recovery pattern | Prevents worker restart data loss | | Permissive parser (v4.2.5/v4.2.6) | Ensures partial data saved, critical for reliability | | Nullable database schema | Flexible storage, allows incomplete data | | WAL mode SQLite | Good concurrency, reliable writes | | Isolated session state | No cross-contamination between sessions | | Queue-based architecture | Decouples HTTP from SDK processing | | storeObservation/storeSummary auto-creation | Defensive programming, prevents foreign key errors | ### MEDIUM VALUE - Consider Improvements | Component | Current State | Potential Improvement | |-----------|--------------|----------------------| | In-memory queue | Lost on restart | Persist to DB for durability | | 100ms polling | Works but inefficient | Event-driven async queue | | Duplicated auto-recovery code | Lines 181-209 and 241-270 identical | Extract to `getOrCreateSession()` helper | | No try-catch around DB ops | Errors crash handler | Add error handling with logging | | Model/port defaults | Hard-coded | Already configurable via env vars ✓ | | Init prompt filtering | Fixed "WHEN TO SKIP" rules | Make configurable (verbosity levels) | ### LOW VALUE - Questionable Design | Component | Issue | Recommendation | |-----------|-------|----------------| | cleanupOrphanedSessions() | Marks ALL active sessions failed on startup | Aggressive, but necessary with fixed port | | 5-second DELETE timeout | Arbitrary | Make configurable via env var | | "NO SUMMARY TAGS FOUND" warning | Log level too high | Change to INFO level | --- ## Recommendations ### Priority 1: Critical Reliability Improvements 1. **Persist Message Queue to Database** - Create `pending_messages` table - Store queued observations/summaries - Mark as processed when handled by SDK - Prevents data loss on worker restart - **Effort:** Medium, **Impact:** High 2. **Add Error Handling Around Database Operations** - Wrap `db.storeObservation()` and `db.storeSummary()` in try-catch - Log errors with full context - Continue processing other messages on error - **Effort:** Low, **Impact:** Medium ### Priority 2: Code Quality Improvements 3. **Extract Auto-Recovery to Helper Function** ```typescript private async getOrCreateSession(sessionDbId: number): Promise { // Consolidate lines 181-209 and 241-270 } ``` - **Effort:** Low, **Impact:** Low (code quality) 4. **Make Configuration More Flexible** - Add `CLAUDE_MEM_VERBOSITY` env var (minimal/normal/verbose) - Adjust init prompt "WHEN TO SKIP" based on verbosity - Add `CLAUDE_MEM_DELETE_TIMEOUT` env var - **Effort:** Low, **Impact:** Medium ### Priority 3: Performance Optimizations 5. **Replace Polling with Event-Driven Queue** - Use `AsyncQueue` with notifications instead of 100ms polling - Reduces latency from queue to processing - **Effort:** Medium, **Impact:** Low (performance) 6. **Add Queue Metrics** - Track queue length over time - Alert if queue grows unbounded - Add to `/health` endpoint - **Effort:** Low, **Impact:** Low (observability) --- ## Appendix: Configuration Reference ### Environment Variables | Variable | Default | Purpose | Blocking Impact | |----------|---------|---------|----------------| | `CLAUDE_MEM_MODEL` | `claude-sonnet-4-5` | AI model for processing | Invalid = SDK fails | | `CLAUDE_MEM_WORKER_PORT` | `37777` | HTTP server port | Invalid = Worker won't start | | `CLAUDE_CODE_PATH` | `/Users/alexnewman/.nvm/versions/node/v24.5.0/bin/claude` | Path to Claude Code | Invalid = SDK fails | ### Constants | Constant | Value | Purpose | |----------|-------|---------| | `DISALLOWED_TOOLS` | `['Glob', 'Grep', 'ListMcpResourcesTool', 'WebSearch']` | Tools SDK agent can't use | | Polling interval | `100ms` | Queue polling frequency | | DELETE timeout | `5000ms` | Max wait for agent shutdown | --- ## Conclusion The claude-mem worker server is a well-designed system with a clear **defensive, layered architecture** that prioritizes **data persistence**. The key strengths are: 1. **Auto-recovery** from worker restarts 2. **Permissive parsing** that saves partial data 3. **Nullable schema** that accepts incomplete information 4. **Session isolation** preventing cross-contamination The main vulnerability is the **in-memory queue**, which could be mitigated by persisting to the database. Overall, the system achieves its goal of creating a persistent memory system that survives failures and continues operating even with incomplete data. **Design Philosophy:** "Better to save partial data than lose everything." This philosophy is evident throughout the codebase, from the v4.2.5/v4.2.6 parser fixes to the auto-creation patterns in the database layer. The system is built to be resilient to AI errors, configuration issues, and process failures. --- **End of Document**