claude-mem/docs/worker-server-architecture.md

# Claude-Mem Worker Server Architecture

**Document Version:** 1.0
**Last Updated:** 2025-01-24
**Author:** Analysis by Claude Code
**Purpose:** Comprehensive technical analysis of the worker server architecture, logic flow, blocking behavior, and component value assessment

---

## Executive Summary

The claude-mem worker server is a long-running HTTP service managed by PM2 that processes tool execution observations and generates session summaries using the Claude Agent SDK. It implements a **defensive, layered architecture** designed to maximize data persistence while maintaining flexibility.

### Key Design Principles

1. **Maximally Permissive Storage** - System defaults to saving data even if incomplete
2. **Auto-Recovery** - Worker restarts don't prevent processing (session state reconstructed from database)
3. **Queue-Based Processing** - HTTP API decoupled from AI processing for reliability
4. **Defensive Programming** - Auto-creates missing database records, accepts null fields
5. **Session Isolation** - Each session has independent state and SDK agent

### Architecture at a Glance

```
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: HTTP API (Express.js)                              │
│ - 6 REST endpoints                                          │
│ - Always queues messages (maximally permissive)             │
└──────────────────┬──────────────────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 2: In-Memory Queue                                    │
│ - pendingMessages array per session                         │
│ - VULNERABILITY: Lost on worker restart                     │
└──────────────────┬──────────────────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 3: SDK Agent (Claude Agent SDK)                       │
│ - Processes queued messages via async generator             │
│ - Can fail due to config or AI errors                       │
└──────────────────┬──────────────────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 4: Parser (XML Extraction)                            │
│ - Extracts observations and summaries from AI responses     │
│ - Permissive (v4.2.5/v4.2.6 fixes ensure partial data saved)│
└──────────────────┬──────────────────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 5: Database (SQLite with better-sqlite3)              │
│ - Permanent storage (once here, data persists)              │
│ - Auto-creates missing sessions, accepts nulls              │
└─────────────────────────────────────────────────────────────┘
```

**Critical Insight:** Data can only be lost between layers 2-4. Once it reaches the database (layer 5), it's permanent.

---

## Component Inventory

### HTTP REST API Endpoints

| Endpoint | Purpose | Blocks Data? |
|----------|---------|--------------|
| `GET /health` | Worker health check | N/A |
| `POST /sessions/:id/init` | Initialize session and start SDK agent | Only if session not in DB (expected) |
| `POST /sessions/:id/observations` | Queue tool observation | ❌ Never (auto-recovery) |
| `POST /sessions/:id/summarize` | Queue summary request | ❌ Never (auto-recovery) |
| `GET /sessions/:id/status` | Get session status | N/A |
| `DELETE /sessions/:id` | Abort session | ⚠️ Queued messages lost |

### Core Processing Components

| Component | File | Lines | Purpose |
|-----------|------|-------|---------|
| WorkerService | worker-service.ts | 52-590 | Main service class, manages sessions |
| runSDKAgent | worker-service.ts | 345-404 | Runs SDK agent for a session |
| createMessageGenerator | worker-service.ts | 410-502 | Async generator feeding SDK |
| handleAgentMessage | worker-service.ts | 508-563 | Parses and stores SDK responses |
| parseObservations | parser.ts | 32-96 | Extracts observations from XML |
| parseSummary | parser.ts | 102-157 | Extracts summary from XML |
| SessionStore | SessionStore.ts | 9-1086 | Database operations |

---

## Deep Dive: HTTP Endpoints

### GET /health (lines 100-109)

**Purpose:** Health check for monitoring and debugging

**Logic Flow:**
1. Returns JSON with status, port, PID, active sessions, uptime, memory

**Blocking Analysis:** ❌ N/A (read-only endpoint)

**Value Assessment:** ✅ HIGH VALUE
- Essential for monitoring worker health
- Helps debug port conflicts and process state
- Keep as-is

---

### POST /sessions/:sessionDbId/init (lines 115-169)

**Purpose:** Initialize a new session and start the SDK agent

**Logic Flow:**
1. Parse `sessionDbId` from URL
2. Extract `project` and `userPrompt` from request body
3. Fetch session from database using `SessionStore.getSessionById()`
4. **CRITICAL CHECK:** Return 404 if session not found in DB
5. Retrieve `claudeSessionId` from database record
6. Create `ActiveSession` object with initial state:
   ```typescript
   {
     sessionDbId, claudeSessionId, sdkSessionId: null,
     project, userPrompt, pendingMessages: [],
     abortController: new AbortController(),
     generatorPromise: null, lastPromptNumber: 0,
     observationCounter: 0, startTime: Date.now()
   }
   ```
7. Store session in memory map (`this.sessions`)
8. Update `worker_port` in database
9. Start `runSDKAgent()` in background (fire-and-forget promise)
10. Return success response immediately

**Blocking Analysis:** ⚠️ CONDITIONAL
- Returns 404 if session doesn't exist in database
- This is expected behavior - session must be created before init
- Doesn't prevent future initialization attempts
- Error logged and hook can retry

**Value Assessment:** ✅ HIGH VALUE
- Critical initialization step
- Background SDK agent startup prevents timeout
- Keep as-is

**Edge Cases:**
- Session exists but SDK agent fails to start → Session marked as failed, but new init can retry
- Multiple init calls for same session → First one wins (subsequent calls find session in memory)

---

### POST /sessions/:sessionDbId/observations (lines 175-230)

**Purpose:** Queue a tool execution observation for processing

**Logic Flow:**
1. Parse `sessionDbId` from URL
2. Extract `tool_name`, `tool_input`, `tool_output`, `prompt_number` from body
3. Check if session exists in memory map (`this.sessions.get(sessionDbId)`)
4. **AUTO-RECOVERY** (lines 181-209): If session NOT in memory:
   - Fetch session from database
   - Recreate `ActiveSession` object
   - Start new SDK agent in background
   - This enables recovery from worker restarts!
5. Increment `observationCounter` for correlation ID tracking
6. Push observation message to `pendingMessages` queue:
   ```typescript
   {
     type: 'observation',
     tool_name, tool_input, tool_output, prompt_number
   }
   ```
7. Return success with queue length

**Blocking Analysis:** ❌ NEVER BLOCKS
- Auto-creates session state from database if missing
- Always queues the observation
- HTTP response confirms receipt immediately
- Processing happens asynchronously

**Value Assessment:** ✅ HIGH VALUE
- Auto-recovery is brilliant design
- Worker restart doesn't lose ability to process observations
- Keep as-is

**Edge Cases:**
- Worker restart while observation in queue → Lost (queue is in-memory)
- But NEW observations after restart are queued successfully (auto-recovery)
- Database not found → Would throw error, but SessionStore auto-creates sessions

---

### POST /sessions/:sessionDbId/summarize (lines 236-284)

**Purpose:** Queue a summary generation request

**Logic Flow:**
1. Parse `sessionDbId` and `prompt_number` from request
2. Check if session exists in memory
3. **AUTO-RECOVERY** (lines 241-270): Same pattern as observations endpoint
   - Fetches session from database
   - Recreates `ActiveSession` object
   - Starts new SDK agent
4. Push summarize message to `pendingMessages` queue:
   ```typescript
   {
     type: 'summarize',
     prompt_number
   }
   ```
5. Return success with queue length

**Blocking Analysis:** ❌ NEVER BLOCKS
- Same auto-recovery mechanism as observations
- Always queues the summary request
- Processing happens asynchronously

**Value Assessment:** ✅ HIGH VALUE
- Auto-recovery pattern prevents data loss
- Keep as-is

**Code Quality Note:** ⚠️ MEDIUM - Duplicated auto-recovery code (lines 181-209 and 241-270 are nearly identical)
- Could extract to helper function: `getOrCreateSession(sessionDbId)`
- Would reduce duplication and improve maintainability

---

### GET /sessions/:sessionDbId/status (lines 289-304)

**Purpose:** Get current session status and queue length

**Logic Flow:**
1. Parse `sessionDbId` from URL
2. Get session from memory map
3. Return 404 if not found
4. Return session info: `sessionDbId`, `sdkSessionId`, `project`, `pendingMessages.length`

**Blocking Analysis:** ❌ N/A (read-only endpoint)

**Value Assessment:** ✅ MEDIUM VALUE
- Useful for debugging
- Not critical for core functionality
- Keep as-is

---

### DELETE /sessions/:sessionDbId (lines 309-340)

**Purpose:** Abort a running session and clean up

**Logic Flow:**
1. Parse `sessionDbId` from URL
2. Get session from memory map
3. Return 404 if not found
4. Call `abortController.abort()` to signal SDK agent to stop
5. Wait for `generatorPromise` to finish (max 5 seconds timeout)
6. Mark session as 'failed' in database
7. Delete session from memory map
8. Return success

**Blocking Analysis:** ⚠️ BLOCKS QUEUED MESSAGES
- Aborts SDK agent processing
- Any messages in `pendingMessages` queue are lost
- Already-stored observations/summaries remain in database

**Value Assessment:** ✅ MEDIUM VALUE
- Provides clean shutdown mechanism
- Used for manual cleanup
- As of v4.1.0, SessionEnd hook doesn't call DELETE (graceful cleanup)
- Keep for manual intervention, but not used automatically

**Historical Note:**
- v4.0.x: SessionEnd hook called DELETE → interrupted summary generation
- v4.1.0+: Graceful cleanup → workers finish naturally

---

## Deep Dive: SDK Agent Processing

### runSDKAgent (lines 345-404)

**Purpose:** Core processing engine that runs continuously for each session

**Logic Flow:**
1. Call `query()` from Claude Agent SDK with:
   ```typescript
   {
     prompt: this.createMessageGenerator(session),
     options: {
       model: MODEL,  // from CLAUDE_MEM_MODEL env var
       disallowedTools: DISALLOWED_TOOLS,
       abortController: session.abortController,
       pathToClaudeCodeExecutable: claudePath
     }
   }
   ```
2. Iterate over SDK responses using `for await`
3. For each assistant message:
   - Extract text content from response
   - Log response size
   - Call `handleAgentMessage()` to parse and store
4. On completion:
   - Log session duration
   - Mark session as 'completed' in database
   - Delete session from memory map
5. On error:
   - Log error (or warning for AbortError)
   - Mark session as 'failed' in database
   - Throw error (caught by `generatorPromise.catch()`)

**Blocking Analysis:** ⚠️ CAN BLOCK IF:
- Invalid `CLAUDE_MEM_MODEL` → SDK initialization fails
- Invalid `CLAUDE_CODE_PATH` → SDK initialization fails
- SDK crashes → Session marked as failed
- BUT: Doesn't prevent NEW sessions from being created

**Value Assessment:** ✅ HIGH VALUE
- Core processing engine
- Proper error handling with session status tracking
- Keep as-is

**Configuration Dependencies:**
- `CLAUDE_MEM_MODEL` (default: 'claude-sonnet-4-5')
- `CLAUDE_CODE_PATH` (default: specific path to claude executable)
- `DISALLOWED_TOOLS` (default: ['Glob', 'Grep', 'ListMcpResourcesTool', 'WebSearch'])

**Edge Cases:**
- SDK hangs → AbortController provides cancellation mechanism
- Network issues → SDK handles retries internally
- Multiple concurrent sessions → Each has isolated SDK agent

---

### createMessageGenerator (lines 410-502)

**Purpose:** Async generator that feeds messages to the SDK agent (bridge between HTTP API and SDK)

**Logic Flow:**
1. Build init prompt using `buildInitPrompt(project, claudeSessionId, userPrompt)`
2. Yield initial user message:
   ```typescript
   {
     type: 'user',
     session_id: claudeSessionId,  // Real Claude Code session ID
     parent_tool_use_id: null,
     message: { role: 'user', content: initPrompt }
   }
   ```
3. Enter infinite loop (`while (true)`):
   - Check `abortController.signal.aborted` → break if aborted
   - If no pending messages → sleep 100ms and continue
   - While `pendingMessages.length > 0`:
     - Shift message from queue (FIFO)
     - If type === 'summarize':
       - Update `lastPromptNumber`
       - Fetch session from database
       - Build summary prompt using `buildSummaryPrompt()`
       - Yield summary prompt as user message
     - If type === 'observation':
       - Update `lastPromptNumber`
       - Build observation prompt using `buildObservationPrompt()`
       - Yield observation prompt as user message

**Blocking Analysis:** ❌ DOESN'T BLOCK
- Continuously processes queue until aborted
- 100ms polling means small delay but no data loss
- Messages shifted from queue and sent to SDK
- If SDK fails, messages lost from queue (but already confirmed via HTTP)

**Value Assessment:** ✅ HIGH VALUE
- Elegant async generator pattern
- Keep as-is

**Performance Note:** ⚠️ 100ms polling interval
- Could be improved with event-driven queue (e.g., `AsyncQueue` with notifications)
- Current implementation is simple and works well
- Low priority optimization

**Data Flow:**
```
HTTP /observations → pendingMessages.push() → [sleep 100ms] →
pendingMessages.shift() → buildObservationPrompt() → yield to SDK →
SDK processes → handleAgentMessage()
```

---

### handleAgentMessage (lines 508-563)

**Purpose:** Parse SDK response and store observations/summaries in database

**Logic Flow:**
1. Call `parseObservations(content, correlationId)`
2. If observations found:
   - For each observation:
     - Call `db.storeObservation(claudeSessionId, project, observation, promptNumber)`
     - Log success with correlation ID
3. Call `parseSummary(content, sessionId)`
4. If summary found:
   - Call `db.storeSummary(claudeSessionId, project, summary, promptNumber)`
   - Log success
5. If NO summary found:
   - Log warning with content sample

**Blocking Analysis:** ⚠️ CAN BLOCK IF:
- Parser returns empty array/null → Nothing stored (but this is expected for routine operations)
- Database error → Would throw and crash handler (rare with permissive schema)

**Value Assessment:** ✅ HIGH VALUE
- Core storage logic
- Proper logging for debugging
- Keep as-is

**Critical Dependencies:**
- `parseObservations()` must return valid observations
- `parseSummary()` must return valid summary
- Database must accept the data (schema constraints)

**Logging:**
- Extensive logging at INFO, SUCCESS, and WARN levels
- Correlation IDs for tracking individual observations
- Debug mode logs full SDK responses

---

## Deep Dive: Parser System

### parseObservations (parser.ts lines 32-96)

**Purpose:** Extract observation XML blocks from SDK response and parse into structured data

**Logic Flow:**
1. Use regex to find all `<observation>...</observation>` blocks (non-greedy):
   ```typescript
   /<observation>([\s\S]*?)<\/observation>/g
   ```
2. For each block:
   - Extract all fields: `type`, `title`, `subtitle`, `narrative`, `facts`, `concepts`, `files_read`, `files_modified`
   - **VALIDATION** (lines 52-67):
     - If `type` is missing or invalid → default to "change"
     - Valid types: `['bugfix', 'feature', 'refactor', 'change', 'discovery', 'decision']`
     - All other fields can be null
   - Filter out `type` from `concepts` array (types and concepts are separate dimensions)
   - Push observation to results array
3. Return all observations

**Blocking Analysis:** ❌ NEVER BLOCKS (as of v4.2.6)
- **CRITICAL FIX** (v4.2.6): Removed validation that required title, subtitle, and narrative
- Comment on line 52: "NOTE FROM THEDOTMACK: ALWAYS save observations - never skip. 10/24/2025"
- Always returns observations with whatever fields exist
- Only transformation: type defaults to "change" if invalid

**Value Assessment:** ✅ HIGH VALUE
- Permissive parsing ensures data is never lost
- v4.2.6 fix was critical for reliability
- Keep as-is

**Historical Context:**
- **Before v4.2.6:** Would skip observations missing required fields → data loss
- **After v4.2.6:** Always saves with defaults → maximally permissive

**Edge Cases:**
1. No `<observation>` tags → Returns empty array (normal for routine operations)
2. All fields empty → Returns observation with null fields and type="change"
3. Malformed XML → Regex won't match → Returns empty array (data loss)
4. Type in concepts → Filtered out (types and concepts are orthogonal)

**Example:**
```xml
<observation>
  <type>feature</type>
  <title>Authentication added</title>
  <subtitle>Implemented OAuth2 flow</subtitle>
  <facts>
    <fact>Added OAuth2 provider configuration</fact>
    <fact>Created callback endpoint</fact>
  </facts>
  <narrative>Full OAuth2 authentication...</narrative>
  <concepts>
    <concept>how-it-works</concept>
    <concept>what-changed</concept>
  </concepts>
  <files_read>
    <file>src/auth/oauth.ts</file>
  </files_read>
  <files_modified>
    <file>src/auth/oauth.ts</file>
  </files_modified>
</observation>
```

---

### parseSummary (parser.ts lines 102-157)

**Purpose:** Extract summary XML block from SDK response

**Logic Flow:**
1. Check for `<skip_summary reason="..."/>` tag (lines 104-113)
   - If found → log reason and return null (intentional skip)
2. Match `<summary>...</summary>` block (non-greedy):
   ```typescript
   /<summary>([\s\S]*?)<\/summary>/
   ```
   - If not found → return null (SDK didn't provide summary)
3. Extract all fields: `request`, `investigated`, `learned`, `completed`, `next_steps`, `notes` (optional)
4. **VALIDATION REMOVED** (lines 133-147):
   - Comment: "NOTE FROM THEDOTMACK: 100% of the time we must SAVE the summary, even if fields are missing. 10/24/2025"
   - Comment: "NEVER DO THIS NONSENSE AGAIN."
   - Old code checked if all required fields present → would return null
   - New code returns summary with whatever fields exist
5. Return `ParsedSummary` object

**Blocking Analysis:** ⚠️ MINIMAL BLOCKING (as of v4.2.5)
- `<skip_summary>` tag → Returns null (intentional, not a bug)
- Missing `<summary>` tags → Returns null (SDK didn't provide)
- Missing fields within `<summary>` → Does NOT block anymore (v4.2.5 fix)

**Value Assessment:** ✅ HIGH VALUE
- v4.2.5 fix ensures partial summaries are saved
- Keep as-is

**Historical Context:**
- **Before v4.2.5:** Would return null if any required field missing → data loss
- **After v4.2.5:** Returns summary with whatever fields exist → maximally permissive

**Edge Cases:**
1. `<skip_summary reason="not enough data"/>` → Returns null, logs reason
2. No `<summary>` tags → Returns null (SDK didn't generate summary)
3. `<summary>` with all empty fields → Returns summary with empty/null strings
4. Malformed XML → Regex won't match → Returns null (data loss)

**Example:**
```xml
<summary>
  <request>Add OAuth2 authentication</request>
  <investigated>Reviewed existing auth system</investigated>
  <learned>System uses JWT tokens for sessions</learned>
  <completed>Implemented OAuth2 provider integration</completed>
  <next_steps>Test with production credentials</next_steps>
  <notes>Need to configure callback URLs in provider dashboard</notes>
</summary>
```

---

## Deep Dive: Database Layer

### SessionStore.storeObservation (SessionStore.ts lines 901-964)

**Purpose:** Store a parsed observation in the database

**Logic Flow:**
1. **AUTO-CREATE SESSION** (lines 920-940):
   - Check if `sdk_session_id` exists in `sdk_sessions` table
   - If NOT found:
     - Auto-create session record
     - Log: "Auto-created session record for session_id: {id}"
   - This prevents foreign key constraint errors
2. Prepare INSERT statement:
   ```sql
   INSERT INTO observations
   (sdk_session_id, project, type, title, subtitle, facts, narrative,
    concepts, files_read, files_modified, prompt_number, created_at, created_at_epoch)
   VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
   ```
3. Insert observation with:
   - `facts`, `concepts`, `files_read`, `files_modified` → JSON.stringify()
   - Timestamps auto-generated
   - All fields as-is (nulls allowed)

**Blocking Analysis:** ❌ NEVER BLOCKS
- Auto-creates missing sessions (defensive programming)
- All fields nullable (except required ones)
- No validation checks that could fail
- Schema is permissive

**Value Assessment:** ✅ HIGH VALUE
- Auto-creation pattern is brilliant
- Prevents foreign key errors
- Keep as-is

**Schema Constraints:**
- `type` must be one of 6 valid types (CHECK constraint)
  - BUT: Parser ensures type is always valid (defaults to "change")
- `sdk_session_id` has foreign key to `sdk_sessions`
  - BUT: Auto-creation ensures session exists
- Arrays stored as JSON strings

**Edge Cases:**
- Session doesn't exist → Auto-created
- Invalid type → Parser prevents this (defaults to "change")
- Null fields → Allowed by schema

---

### SessionStore.storeSummary (SessionStore.ts lines 970-1029)

**Purpose:** Store a parsed summary in the database

**Logic Flow:**
1. **AUTO-CREATE SESSION** (lines 987-1007):
   - Same defensive pattern as `storeObservation()`
   - Ensures session exists before INSERT
2. Prepare INSERT statement:
   ```sql
   INSERT INTO session_summaries
   (sdk_session_id, project, request, investigated, learned, completed,
    next_steps, notes, prompt_number, created_at, created_at_epoch)
   VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
   ```
3. Insert summary with:
   - All content fields as-is (nulls allowed)
   - Timestamps auto-generated

**Blocking Analysis:** ❌ NEVER BLOCKS
- Auto-creates missing sessions
- All content fields nullable
- No validation checks
- Multiple summaries per session allowed (migration 7 removed UNIQUE constraint)

**Value Assessment:** ✅ HIGH VALUE
- Auto-creation ensures reliability
- Nullable fields allow partial data
- Keep as-is

**Schema Evolution:**
- **Before migration 7:** `sdk_session_id` had UNIQUE constraint → Only one summary per session
- **After migration 7:** UNIQUE removed → Multiple summaries per session (one per prompt)

**Edge Cases:**
- Session doesn't exist → Auto-created
- All fields null/empty → Allowed
- Multiple summaries for same session → Allowed (migration 7)

---

### Database Schema Constraints

#### observations table
```sql
CREATE TABLE observations (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  sdk_session_id TEXT NOT NULL,  -- Foreign key
  project TEXT NOT NULL,
  text TEXT,  -- Nullable (deprecated, migration 9)
  type TEXT NOT NULL CHECK(type IN ('decision', 'bugfix', 'feature', 'refactor', 'discovery', 'change')),
  title TEXT,  -- Nullable
  subtitle TEXT,  -- Nullable
  facts TEXT,  -- Nullable (JSON array)
  narrative TEXT,  -- Nullable
  concepts TEXT,  -- Nullable (JSON array)
  files_read TEXT,  -- Nullable (JSON array)
  files_modified TEXT,  -- Nullable (JSON array)
  prompt_number INTEGER,  -- Nullable
  created_at TEXT NOT NULL,
  created_at_epoch INTEGER NOT NULL,
  FOREIGN KEY(sdk_session_id) REFERENCES sdk_sessions(sdk_session_id) ON DELETE CASCADE
);
```

**Blocking Potential:**
- Invalid `type` → CHECK constraint violation
  - Mitigated by: Parser defaults to "change"
- Missing `sdk_session_id` → Foreign key violation
  - Mitigated by: Auto-creation in storeObservation()

#### session_summaries table
```sql
CREATE TABLE session_summaries (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  sdk_session_id TEXT NOT NULL,  -- No longer UNIQUE (migration 7)
  project TEXT NOT NULL,
  request TEXT,  -- Nullable
  investigated TEXT,  -- Nullable
  learned TEXT,  -- Nullable
  completed TEXT,  -- Nullable
  next_steps TEXT,  -- Nullable
  notes TEXT,  -- Nullable
  prompt_number INTEGER,  -- Nullable
  created_at TEXT NOT NULL,
  created_at_epoch INTEGER NOT NULL,
  FOREIGN KEY(sdk_session_id) REFERENCES sdk_sessions(sdk_session_id) ON DELETE CASCADE
);
```

**Blocking Potential:**
- Missing `sdk_session_id` → Foreign key violation
  - Mitigated by: Auto-creation in storeSummary()

**Key Design Decisions:**
1. **Nullable fields** - Allows partial data to be saved
2. **Auto-creation** - Prevents foreign key errors
3. **No UNIQUE constraints** (migration 7) - Multiple summaries per session
4. **WAL mode** - Better concurrency for multiple sessions
5. **JSON arrays** - Flexible storage for lists (facts, concepts, files)

---

## Deep Dive: Prompt System

### buildInitPrompt (prompts.ts lines 24-125)

**Purpose:** Generate initial prompt that instructs the SDK agent what to observe and how to record

**Content:**
1. **Role Definition:** "You are observing a development session to create searchable memory FOR FUTURE SESSIONS"
2. **Critical Instruction:** "Record what was BUILT/FIXED/DEPLOYED/CONFIGURED, not what you (the observer) are doing"
3. **What to Record:** Focus on deliverables, capabilities, technical changes
4. **When to Skip:** Routine operations (empty status checks, package installations, file listings)
5. **Output Format:** XML structure with `<observation>` tags and required fields

**Blocking Analysis:** ⚠️ CAN CAUSE SKIPPING
- "WHEN TO SKIP" section instructs SDK to not output for routine operations
- "No output necessary if skipping" means no observations stored
- **This is intentional filtering**, not a bug

**Value Assessment:** ✅ HIGH VALUE
- Prevents noise from routine operations
- Focuses on meaningful changes
- Keep as-is, but consider making "WHEN TO SKIP" configurable

**Key Instructions:**
```
WHEN TO SKIP
------------
Skip routine operations:
- Empty status checks
- Package installations with no errors
- Simple file listings
- Repetitive operations you've already documented
- **No output necessary if skipping.**
```

**Impact:**
- Reduces database size by filtering noise
- But could cause "missing" observations for operations user cares about
- Trade-off between signal and completeness

---

### buildObservationPrompt (prompts.ts lines 130-153)

**Purpose:** Wrap tool execution data in XML for SDK processing

**Content:**
```xml
<tool_used>
  <tool_name>{name}</tool_name>
  <tool_time>{ISO timestamp}</tool_time>
  <tool_input>{JSON}</tool_input>
  <tool_output>{JSON}</tool_output>
</tool_used>
```

**Blocking Analysis:** ❌ NO BLOCKING
- Just data presentation, no instructions to skip
- SDK processes based on init prompt rules

**Value Assessment:** ✅ HIGH VALUE
- Simple, clean data wrapper
- Keep as-is

---

### buildSummaryPrompt (prompts.ts lines 158-178)

**Purpose:** Request summary of the session so far

**Content:**
1. **Instruction:** "Think about the last request, and write a summary of what was done, what was learned, and what's next"
2. **Important Note:** "DO NOT summarize the observation process itself - you are summarizing a DIFFERENT claude code session, not this one"
3. **Output Format:** XML `<summary>` with required fields
4. **Encouragement:** "Always write at least a minimal summary explaining where we are at currently, even if you didn't learn anything new or complete any work"

**Blocking Analysis:** ❌ NO BLOCKING
- Encourages always writing summary
- SDK may still skip if truly nothing to summarize

**Value Assessment:** ✅ HIGH VALUE
- Ensures summaries are generated
- "Always write at least a minimal summary" reduces skip rate
- Keep as-is

---

## Data Flow Analysis

### End-to-End Flow: Tool Execution → Database

```
1. User executes tool in Claude Code
   ↓
2. PostToolUse hook captures execution
   ↓
3. Hook sends HTTP POST to worker /observations endpoint
   ↓
4. Worker queues message in pendingMessages array
   └─→ HTTP 200 response (confirmed receipt)
   ↓
5. createMessageGenerator polls queue (100ms interval)
   ↓
6. Message shifted from queue
   ↓
7. buildObservationPrompt wraps tool data in XML
   ↓
8. Generator yields message to SDK agent
   ↓
9. SDK sends message to Claude API
   ↓
10. Claude processes tool data based on init prompt
    ↓
11. Claude responds with XML (or skips if routine operation)
    ↓
12. SDK returns response to runSDKAgent
    ↓
13. handleAgentMessage receives response
    ↓
14. parseObservations extracts <observation> blocks
    ↓
15. For each observation:
    - db.storeObservation called
    - Auto-creates session if missing
    - Inserts into observations table
    ↓
16. Data persisted in SQLite database
```

**Failure Points:**
- **Point 3:** Worker not running → HTTP request fails → Hook logs error
- **Point 4:** Worker crashes before processing → Queue lost
- **Point 9:** Invalid model config → SDK fails → Session marked failed
- **Point 11:** Malformed XML response → Parser returns empty array
- **Point 15:** Database error (rare) → Throws exception

**Recovery Mechanisms:**
- **Auto-recovery:** New requests after worker restart auto-create session
- **Graceful degradation:** Partial data saved (v4.2.5/v4.2.6 fixes)
- **Database persistence:** Once stored, data survives all restarts

---

## Blocking Assessment Matrix

### Components That CAN Block Data Storage

| Component | Blocking Scenario | Impact | Mitigation |
|-----------|------------------|---------|------------|
| Worker not running | HTTP requests fail | Observations not queued | PM2 auto-restart, health monitoring |
| Invalid CLAUDE_MEM_MODEL | SDK agent fails to start | Queued messages never processed | Validation in settings script |
| Invalid CLAUDE_CODE_PATH | SDK agent fails to start | Queued messages never processed | Default path, env var fallback |
| Malformed XML in SDK response | Parser can't extract | Data lost for that response | Better error handling, partial parsing |
| Worker restart | In-memory queue lost | Queued messages lost | Could persist queue to DB |
| Session abort (DELETE) | Queue processing stopped | Remaining queue lost | Graceful cleanup (v4.1.0) |
| Init prompt "WHEN TO SKIP" | SDK intentionally skips | No observation stored | Intentional filtering, configurable? |

### Components That CANNOT Block Data Storage

| Component | Reason | Design Pattern |
|-----------|--------|----------------|
| /observations endpoint | Auto-recovery, always queues | Maximally permissive |
| /summarize endpoint | Auto-recovery, always queues | Maximally permissive |
| parseObservations() | Defaults to "change" type, accepts nulls | Permissive (v4.2.6 fix) |
| parseSummary() | Returns partial summaries | Permissive (v4.2.5 fix) |
| storeObservation() | Auto-creates sessions, accepts nulls | Defensive programming |
| storeSummary() | Auto-creates sessions, accepts nulls | Defensive programming |
| Database schema | Nullable fields, no UNIQUE constraints | Flexible storage |

---

## Critical Findings

### 1. Auto-Recovery Pattern Prevents Worker Restart Data Loss

**Location:** `/observations` and `/summarize` endpoints (lines 181-209, 241-270)

**How it works:**
```typescript
if (!session) {
  // Fetch session from database
  const dbSession = db.getSessionById(sessionDbId);

  // Recreate in-memory state
  session = {
    sessionDbId,
    claudeSessionId: dbSession!.claude_session_id,
    sdkSessionId: null,
    project: dbSession!.project,
    userPrompt: dbSession!.user_prompt,
    pendingMessages: [],
    abortController: new AbortController(),
    generatorPromise: null,
    lastPromptNumber: 0,
    observationCounter: 0,
    startTime: Date.now()
  };

  // Start new SDK agent
  session.generatorPromise = this.runSDKAgent(session);
}
```

**Value:** ✅ HIGH
- Worker restart doesn't prevent new observations from being processed
- Database is source of truth
- Stateless design enables resilience

**Recommendation:** Extract to helper function to reduce duplication

---

### 2. Parser Fixes (v4.2.5/v4.2.6) Ensure Partial Data Saved

**parseObservations (v4.2.6):**
```typescript
// NOTE FROM THEDOTMACK: ALWAYS save observations - never skip. 10/24/2025
// All fields except type are nullable in schema
// If type is missing or invalid, use "change" as catch-all fallback

let finalType = 'change'; // Default catch-all
if (type && validTypes.includes(type.trim())) {
  finalType = type.trim();
}

// All other fields are optional - save whatever we have
observations.push({
  type: finalType,
  title,        // Can be null
  subtitle,     // Can be null
  facts,
  narrative,    // Can be null
  concepts,
  files_read,
  files_modified
});
```

**parseSummary (v4.2.5):**
```typescript
// NOTE FROM THEDOTMACK: 100% of the time we must SAVE the summary,
// even if fields are missing. 10/24/2025
// NEVER DO THIS NONSENSE AGAIN.

return {
  request,       // Can be null
  investigated,  // Can be null
  learned,       // Can be null
  completed,     // Can be null
  next_steps,    // Can be null
  notes          // Can be null
};
```

**Value:** ✅ CRITICAL
- Prevents data loss from incomplete AI responses
- LLMs make mistakes - system must be resilient
- Partial data is better than no data

**Recommendation:** Keep as-is, this is the right design

---

### 3. In-Memory Queue is Main Vulnerability

**Issue:** `pendingMessages` array is in-memory only
- Worker restart → All queued messages lost
- But HTTP response already confirmed receipt

**Current behavior:**
1. Hook sends observation → Worker responds "queued" → Hook thinks it's saved
2. Worker crashes before processing → Observation lost
3. BUT: New observations after restart are still processed (auto-recovery)

**Impact:** ⚠️ MEDIUM
- Data loss window between queue and processing
- But observations are idempotent (can be resent)
- Hooks don't retry on success response

**Recommendation:** ⚠️ CONSIDER
- Persist queue to database (e.g., `pending_observations` table)
- Mark as processed when SDK handles
- Increases reliability but adds complexity

---

### 4. Init Prompt "WHEN TO SKIP" Intentionally Filters

**Instruction:**
```
WHEN TO SKIP
------------
Skip routine operations:
- Empty status checks
- Package installations with no errors
- Simple file listings
- Repetitive operations you've already documented
- **No output necessary if skipping.**
```

**Impact:**
- Reduces noise in database
- Focuses on meaningful changes
- BUT: User might wonder why some tool executions aren't recorded

**Value:** ✅ MEDIUM - Intentional filtering
- Prevents database bloat
- Trade-off between signal and completeness

**Recommendation:** ⚠️ CONSIDER
- Make "WHEN TO SKIP" configurable (env var or settings)
- Or add verbosity levels (minimal/normal/verbose)

---

## Value Assessment by Component

### HIGH VALUE - Keep As-Is

| Component | Reason |
|-----------|--------|
| Auto-recovery pattern | Prevents worker restart data loss |
| Permissive parser (v4.2.5/v4.2.6) | Ensures partial data saved, critical for reliability |
| Nullable database schema | Flexible storage, allows incomplete data |
| WAL mode SQLite | Good concurrency, reliable writes |
| Isolated session state | No cross-contamination between sessions |
| Queue-based architecture | Decouples HTTP from SDK processing |
| storeObservation/storeSummary auto-creation | Defensive programming, prevents foreign key errors |

### MEDIUM VALUE - Consider Improvements

| Component | Current State | Potential Improvement |
|-----------|--------------|----------------------|
| In-memory queue | Lost on restart | Persist to DB for durability |
| 100ms polling | Works but inefficient | Event-driven async queue |
| Duplicated auto-recovery code | Lines 181-209 and 241-270 identical | Extract to `getOrCreateSession()` helper |
| No try-catch around DB ops | Errors crash handler | Add error handling with logging |
| Model/port defaults | Hard-coded | Already configurable via env vars ✓ |
| Init prompt filtering | Fixed "WHEN TO SKIP" rules | Make configurable (verbosity levels) |

### LOW VALUE - Questionable Design

| Component | Issue | Recommendation |
|-----------|-------|----------------|
| cleanupOrphanedSessions() | Marks ALL active sessions failed on startup | Aggressive, but necessary with fixed port |
| 5-second DELETE timeout | Arbitrary | Make configurable via env var |
| "NO SUMMARY TAGS FOUND" warning | Log level too high | Change to INFO level |

---

## Recommendations

### Priority 1: Critical Reliability Improvements

1. **Persist Message Queue to Database**
   - Create `pending_messages` table
   - Store queued observations/summaries
   - Mark as processed when handled by SDK
   - Prevents data loss on worker restart
   - **Effort:** Medium, **Impact:** High

2. **Add Error Handling Around Database Operations**
   - Wrap `db.storeObservation()` and `db.storeSummary()` in try-catch
   - Log errors with full context
   - Continue processing other messages on error
   - **Effort:** Low, **Impact:** Medium

### Priority 2: Code Quality Improvements

3. **Extract Auto-Recovery to Helper Function**
   ```typescript
   private async getOrCreateSession(sessionDbId: number): Promise<ActiveSession> {
     // Consolidate lines 181-209 and 241-270
   }
   ```
   - **Effort:** Low, **Impact:** Low (code quality)

4. **Make Configuration More Flexible**
   - Add `CLAUDE_MEM_VERBOSITY` env var (minimal/normal/verbose)
   - Adjust init prompt "WHEN TO SKIP" based on verbosity
   - Add `CLAUDE_MEM_DELETE_TIMEOUT` env var
   - **Effort:** Low, **Impact:** Medium

### Priority 3: Performance Optimizations

5. **Replace Polling with Event-Driven Queue**
   - Use `AsyncQueue` with notifications instead of 100ms polling
   - Reduces latency from queue to processing
   - **Effort:** Medium, **Impact:** Low (performance)

6. **Add Queue Metrics**
   - Track queue length over time
   - Alert if queue grows unbounded
   - Add to `/health` endpoint
   - **Effort:** Low, **Impact:** Low (observability)

---

## Appendix: Configuration Reference

### Environment Variables

| Variable | Default | Purpose | Blocking Impact |
|----------|---------|---------|----------------|
| `CLAUDE_MEM_MODEL` | `claude-sonnet-4-5` | AI model for processing | Invalid = SDK fails |
| `CLAUDE_MEM_WORKER_PORT` | `37777` | HTTP server port | Invalid = Worker won't start |

### Constants

| Constant | Value | Purpose |
|----------|-------|---------|
| `DISALLOWED_TOOLS` | `['Glob', 'Grep', 'ListMcpResourcesTool', 'WebSearch']` | Tools SDK agent can't use |
| Polling interval | `100ms` | Queue polling frequency |
| DELETE timeout | `5000ms` | Max wait for agent shutdown |

---

## Conclusion

The claude-mem worker server is a well-designed system with a clear **defensive, layered architecture** that prioritizes **data persistence**. The key strengths are:

1. **Auto-recovery** from worker restarts
2. **Permissive parsing** that saves partial data
3. **Nullable schema** that accepts incomplete information
4. **Session isolation** preventing cross-contamination

The main vulnerability is the **in-memory queue**, which could be mitigated by persisting to the database. Overall, the system achieves its goal of creating a persistent memory system that survives failures and continues operating even with incomplete data.

**Design Philosophy:** "Better to save partial data than lose everything."

This philosophy is evident throughout the codebase, from the v4.2.5/v4.2.6 parser fixes to the auto-creation patterns in the database layer. The system is built to be resilient to AI errors, configuration issues, and process failures.

---

**End of Document**