Files
claude-mem/docs/worker-server-architecture.md
T

1130 lines
40 KiB
Markdown

# Claude-Mem Worker Server Architecture
**Document Version:** 1.0
**Last Updated:** 2025-01-24
**Author:** Analysis by Claude Code
**Purpose:** Comprehensive technical analysis of the worker server architecture, logic flow, blocking behavior, and component value assessment
---
## Executive Summary
The claude-mem worker server is a long-running HTTP service managed by PM2 that processes tool execution observations and generates session summaries using the Claude Agent SDK. It implements a **defensive, layered architecture** designed to maximize data persistence while maintaining flexibility.
### Key Design Principles
1. **Maximally Permissive Storage** - System defaults to saving data even if incomplete
2. **Auto-Recovery** - Worker restarts don't prevent processing (session state reconstructed from database)
3. **Queue-Based Processing** - HTTP API decoupled from AI processing for reliability
4. **Defensive Programming** - Auto-creates missing database records, accepts null fields
5. **Session Isolation** - Each session has independent state and SDK agent
### Architecture at a Glance
```
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: HTTP API (Express.js) │
│ - 6 REST endpoints │
│ - Always queues messages (maximally permissive) │
└──────────────────┬──────────────────────────────────────────┘
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 2: In-Memory Queue │
│ - pendingMessages array per session │
│ - VULNERABILITY: Lost on worker restart │
└──────────────────┬──────────────────────────────────────────┘
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 3: SDK Agent (Claude Agent SDK) │
│ - Processes queued messages via async generator │
│ - Can fail due to config or AI errors │
└──────────────────┬──────────────────────────────────────────┘
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 4: Parser (XML Extraction) │
│ - Extracts observations and summaries from AI responses │
│ - Permissive (v4.2.5/v4.2.6 fixes ensure partial data saved)│
└──────────────────┬──────────────────────────────────────────┘
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 5: Database (SQLite with better-sqlite3) │
│ - Permanent storage (once here, data persists) │
│ - Auto-creates missing sessions, accepts nulls │
└─────────────────────────────────────────────────────────────┘
```
**Critical Insight:** Data can only be lost between layers 2-4. Once it reaches the database (layer 5), it's permanent.
---
## Component Inventory
### HTTP REST API Endpoints
| Endpoint | Purpose | Blocks Data? |
|----------|---------|--------------|
| `GET /health` | Worker health check | N/A |
| `POST /sessions/:id/init` | Initialize session and start SDK agent | Only if session not in DB (expected) |
| `POST /sessions/:id/observations` | Queue tool observation | ❌ Never (auto-recovery) |
| `POST /sessions/:id/summarize` | Queue summary request | ❌ Never (auto-recovery) |
| `GET /sessions/:id/status` | Get session status | N/A |
| `DELETE /sessions/:id` | Abort session | ⚠️ Queued messages lost |
### Core Processing Components
| Component | File | Lines | Purpose |
|-----------|------|-------|---------|
| WorkerService | worker-service.ts | 52-590 | Main service class, manages sessions |
| runSDKAgent | worker-service.ts | 345-404 | Runs SDK agent for a session |
| createMessageGenerator | worker-service.ts | 410-502 | Async generator feeding SDK |
| handleAgentMessage | worker-service.ts | 508-563 | Parses and stores SDK responses |
| parseObservations | parser.ts | 32-96 | Extracts observations from XML |
| parseSummary | parser.ts | 102-157 | Extracts summary from XML |
| SessionStore | SessionStore.ts | 9-1086 | Database operations |
---
## Deep Dive: HTTP Endpoints
### GET /health (lines 100-109)
**Purpose:** Health check for monitoring and debugging
**Logic Flow:**
1. Returns JSON with status, port, PID, active sessions, uptime, memory
**Blocking Analysis:** ❌ N/A (read-only endpoint)
**Value Assessment:** ✅ HIGH VALUE
- Essential for monitoring worker health
- Helps debug port conflicts and process state
- Keep as-is
---
### POST /sessions/:sessionDbId/init (lines 115-169)
**Purpose:** Initialize a new session and start the SDK agent
**Logic Flow:**
1. Parse `sessionDbId` from URL
2. Extract `project` and `userPrompt` from request body
3. Fetch session from database using `SessionStore.getSessionById()`
4. **CRITICAL CHECK:** Return 404 if session not found in DB
5. Retrieve `claudeSessionId` from database record
6. Create `ActiveSession` object with initial state:
```typescript
{
sessionDbId, claudeSessionId, sdkSessionId: null,
project, userPrompt, pendingMessages: [],
abortController: new AbortController(),
generatorPromise: null, lastPromptNumber: 0,
observationCounter: 0, startTime: Date.now()
}
```
7. Store session in memory map (`this.sessions`)
8. Update `worker_port` in database
9. Start `runSDKAgent()` in background (fire-and-forget promise)
10. Return success response immediately
**Blocking Analysis:** ⚠️ CONDITIONAL
- Returns 404 if session doesn't exist in database
- This is expected behavior - session must be created before init
- Doesn't prevent future initialization attempts
- Error logged and hook can retry
**Value Assessment:** ✅ HIGH VALUE
- Critical initialization step
- Background SDK agent startup prevents timeout
- Keep as-is
**Edge Cases:**
- Session exists but SDK agent fails to start → Session marked as failed, but new init can retry
- Multiple init calls for same session → First one wins (subsequent calls find session in memory)
---
### POST /sessions/:sessionDbId/observations (lines 175-230)
**Purpose:** Queue a tool execution observation for processing
**Logic Flow:**
1. Parse `sessionDbId` from URL
2. Extract `tool_name`, `tool_input`, `tool_output`, `prompt_number` from body
3. Check if session exists in memory map (`this.sessions.get(sessionDbId)`)
4. **AUTO-RECOVERY** (lines 181-209): If session NOT in memory:
- Fetch session from database
- Recreate `ActiveSession` object
- Start new SDK agent in background
- This enables recovery from worker restarts!
5. Increment `observationCounter` for correlation ID tracking
6. Push observation message to `pendingMessages` queue:
```typescript
{
type: 'observation',
tool_name, tool_input, tool_output, prompt_number
}
```
7. Return success with queue length
**Blocking Analysis:** ❌ NEVER BLOCKS
- Auto-creates session state from database if missing
- Always queues the observation
- HTTP response confirms receipt immediately
- Processing happens asynchronously
**Value Assessment:** ✅ HIGH VALUE
- Auto-recovery is brilliant design
- Worker restart doesn't lose ability to process observations
- Keep as-is
**Edge Cases:**
- Worker restart while observation in queue → Lost (queue is in-memory)
- But NEW observations after restart are queued successfully (auto-recovery)
- Database not found → Would throw error, but SessionStore auto-creates sessions
---
### POST /sessions/:sessionDbId/summarize (lines 236-284)
**Purpose:** Queue a summary generation request
**Logic Flow:**
1. Parse `sessionDbId` and `prompt_number` from request
2. Check if session exists in memory
3. **AUTO-RECOVERY** (lines 241-270): Same pattern as observations endpoint
- Fetches session from database
- Recreates `ActiveSession` object
- Starts new SDK agent
4. Push summarize message to `pendingMessages` queue:
```typescript
{
type: 'summarize',
prompt_number
}
```
5. Return success with queue length
**Blocking Analysis:** ❌ NEVER BLOCKS
- Same auto-recovery mechanism as observations
- Always queues the summary request
- Processing happens asynchronously
**Value Assessment:** ✅ HIGH VALUE
- Auto-recovery pattern prevents data loss
- Keep as-is
**Code Quality Note:** ⚠️ MEDIUM - Duplicated auto-recovery code (lines 181-209 and 241-270 are nearly identical)
- Could extract to helper function: `getOrCreateSession(sessionDbId)`
- Would reduce duplication and improve maintainability
---
### GET /sessions/:sessionDbId/status (lines 289-304)
**Purpose:** Get current session status and queue length
**Logic Flow:**
1. Parse `sessionDbId` from URL
2. Get session from memory map
3. Return 404 if not found
4. Return session info: `sessionDbId`, `sdkSessionId`, `project`, `pendingMessages.length`
**Blocking Analysis:** ❌ N/A (read-only endpoint)
**Value Assessment:** ✅ MEDIUM VALUE
- Useful for debugging
- Not critical for core functionality
- Keep as-is
---
### DELETE /sessions/:sessionDbId (lines 309-340)
**Purpose:** Abort a running session and clean up
**Logic Flow:**
1. Parse `sessionDbId` from URL
2. Get session from memory map
3. Return 404 if not found
4. Call `abortController.abort()` to signal SDK agent to stop
5. Wait for `generatorPromise` to finish (max 5 seconds timeout)
6. Mark session as 'failed' in database
7. Delete session from memory map
8. Return success
**Blocking Analysis:** ⚠️ BLOCKS QUEUED MESSAGES
- Aborts SDK agent processing
- Any messages in `pendingMessages` queue are lost
- Already-stored observations/summaries remain in database
**Value Assessment:** ✅ MEDIUM VALUE
- Provides clean shutdown mechanism
- Used for manual cleanup
- As of v4.1.0, SessionEnd hook doesn't call DELETE (graceful cleanup)
- Keep for manual intervention, but not used automatically
**Historical Note:**
- v4.0.x: SessionEnd hook called DELETE → interrupted summary generation
- v4.1.0+: Graceful cleanup → workers finish naturally
---
## Deep Dive: SDK Agent Processing
### runSDKAgent (lines 345-404)
**Purpose:** Core processing engine that runs continuously for each session
**Logic Flow:**
1. Call `query()` from Claude Agent SDK with:
```typescript
{
prompt: this.createMessageGenerator(session),
options: {
model: MODEL, // from CLAUDE_MEM_MODEL env var
disallowedTools: DISALLOWED_TOOLS,
abortController: session.abortController,
pathToClaudeCodeExecutable: claudePath
}
}
```
2. Iterate over SDK responses using `for await`
3. For each assistant message:
- Extract text content from response
- Log response size
- Call `handleAgentMessage()` to parse and store
4. On completion:
- Log session duration
- Mark session as 'completed' in database
- Delete session from memory map
5. On error:
- Log error (or warning for AbortError)
- Mark session as 'failed' in database
- Throw error (caught by `generatorPromise.catch()`)
**Blocking Analysis:** ⚠️ CAN BLOCK IF:
- Invalid `CLAUDE_MEM_MODEL` → SDK initialization fails
- Invalid `CLAUDE_CODE_PATH` → SDK initialization fails
- SDK crashes → Session marked as failed
- BUT: Doesn't prevent NEW sessions from being created
**Value Assessment:** ✅ HIGH VALUE
- Core processing engine
- Proper error handling with session status tracking
- Keep as-is
**Configuration Dependencies:**
- `CLAUDE_MEM_MODEL` (default: 'claude-sonnet-4-5')
- `CLAUDE_CODE_PATH` (default: specific path to claude executable)
- `DISALLOWED_TOOLS` (default: ['Glob', 'Grep', 'ListMcpResourcesTool', 'WebSearch'])
**Edge Cases:**
- SDK hangs → AbortController provides cancellation mechanism
- Network issues → SDK handles retries internally
- Multiple concurrent sessions → Each has isolated SDK agent
---
### createMessageGenerator (lines 410-502)
**Purpose:** Async generator that feeds messages to the SDK agent (bridge between HTTP API and SDK)
**Logic Flow:**
1. Build init prompt using `buildInitPrompt(project, claudeSessionId, userPrompt)`
2. Yield initial user message:
```typescript
{
type: 'user',
session_id: claudeSessionId, // Real Claude Code session ID
parent_tool_use_id: null,
message: { role: 'user', content: initPrompt }
}
```
3. Enter infinite loop (`while (true)`):
- Check `abortController.signal.aborted` → break if aborted
- If no pending messages → sleep 100ms and continue
- While `pendingMessages.length > 0`:
- Shift message from queue (FIFO)
- If type === 'summarize':
- Update `lastPromptNumber`
- Fetch session from database
- Build summary prompt using `buildSummaryPrompt()`
- Yield summary prompt as user message
- If type === 'observation':
- Update `lastPromptNumber`
- Build observation prompt using `buildObservationPrompt()`
- Yield observation prompt as user message
**Blocking Analysis:** ❌ DOESN'T BLOCK
- Continuously processes queue until aborted
- 100ms polling means small delay but no data loss
- Messages shifted from queue and sent to SDK
- If SDK fails, messages lost from queue (but already confirmed via HTTP)
**Value Assessment:** ✅ HIGH VALUE
- Elegant async generator pattern
- Keep as-is
**Performance Note:** ⚠️ 100ms polling interval
- Could be improved with event-driven queue (e.g., `AsyncQueue` with notifications)
- Current implementation is simple and works well
- Low priority optimization
**Data Flow:**
```
HTTP /observations → pendingMessages.push() → [sleep 100ms] →
pendingMessages.shift() → buildObservationPrompt() → yield to SDK →
SDK processes → handleAgentMessage()
```
---
### handleAgentMessage (lines 508-563)
**Purpose:** Parse SDK response and store observations/summaries in database
**Logic Flow:**
1. Call `parseObservations(content, correlationId)`
2. If observations found:
- For each observation:
- Call `db.storeObservation(claudeSessionId, project, observation, promptNumber)`
- Log success with correlation ID
3. Call `parseSummary(content, sessionId)`
4. If summary found:
- Call `db.storeSummary(claudeSessionId, project, summary, promptNumber)`
- Log success
5. If NO summary found:
- Log warning with content sample
**Blocking Analysis:** ⚠️ CAN BLOCK IF:
- Parser returns empty array/null → Nothing stored (but this is expected for routine operations)
- Database error → Would throw and crash handler (rare with permissive schema)
**Value Assessment:** ✅ HIGH VALUE
- Core storage logic
- Proper logging for debugging
- Keep as-is
**Critical Dependencies:**
- `parseObservations()` must return valid observations
- `parseSummary()` must return valid summary
- Database must accept the data (schema constraints)
**Logging:**
- Extensive logging at INFO, SUCCESS, and WARN levels
- Correlation IDs for tracking individual observations
- Debug mode logs full SDK responses
---
## Deep Dive: Parser System
### parseObservations (parser.ts lines 32-96)
**Purpose:** Extract observation XML blocks from SDK response and parse into structured data
**Logic Flow:**
1. Use regex to find all `<observation>...</observation>` blocks (non-greedy):
```typescript
/<observation>([\s\S]*?)<\/observation>/g
```
2. For each block:
- Extract all fields: `type`, `title`, `subtitle`, `narrative`, `facts`, `concepts`, `files_read`, `files_modified`
- **VALIDATION** (lines 52-67):
- If `type` is missing or invalid → default to "change"
- Valid types: `['bugfix', 'feature', 'refactor', 'change', 'discovery', 'decision']`
- All other fields can be null
- Filter out `type` from `concepts` array (types and concepts are separate dimensions)
- Push observation to results array
3. Return all observations
**Blocking Analysis:** ❌ NEVER BLOCKS (as of v4.2.6)
- **CRITICAL FIX** (v4.2.6): Removed validation that required title, subtitle, and narrative
- Comment on line 52: "NOTE FROM THEDOTMACK: ALWAYS save observations - never skip. 10/24/2025"
- Always returns observations with whatever fields exist
- Only transformation: type defaults to "change" if invalid
**Value Assessment:** ✅ HIGH VALUE
- Permissive parsing ensures data is never lost
- v4.2.6 fix was critical for reliability
- Keep as-is
**Historical Context:**
- **Before v4.2.6:** Would skip observations missing required fields → data loss
- **After v4.2.6:** Always saves with defaults → maximally permissive
**Edge Cases:**
1. No `<observation>` tags → Returns empty array (normal for routine operations)
2. All fields empty → Returns observation with null fields and type="change"
3. Malformed XML → Regex won't match → Returns empty array (data loss)
4. Type in concepts → Filtered out (types and concepts are orthogonal)
**Example:**
```xml
<observation>
<type>feature</type>
<title>Authentication added</title>
<subtitle>Implemented OAuth2 flow</subtitle>
<facts>
<fact>Added OAuth2 provider configuration</fact>
<fact>Created callback endpoint</fact>
</facts>
<narrative>Full OAuth2 authentication...</narrative>
<concepts>
<concept>how-it-works</concept>
<concept>what-changed</concept>
</concepts>
<files_read>
<file>src/auth/oauth.ts</file>
</files_read>
<files_modified>
<file>src/auth/oauth.ts</file>
</files_modified>
</observation>
```
---
### parseSummary (parser.ts lines 102-157)
**Purpose:** Extract summary XML block from SDK response
**Logic Flow:**
1. Check for `<skip_summary reason="..."/>` tag (lines 104-113)
- If found → log reason and return null (intentional skip)
2. Match `<summary>...</summary>` block (non-greedy):
```typescript
/<summary>([\s\S]*?)<\/summary>/
```
- If not found → return null (SDK didn't provide summary)
3. Extract all fields: `request`, `investigated`, `learned`, `completed`, `next_steps`, `notes` (optional)
4. **VALIDATION REMOVED** (lines 133-147):
- Comment: "NOTE FROM THEDOTMACK: 100% of the time we must SAVE the summary, even if fields are missing. 10/24/2025"
- Comment: "NEVER DO THIS NONSENSE AGAIN."
- Old code checked if all required fields present → would return null
- New code returns summary with whatever fields exist
5. Return `ParsedSummary` object
**Blocking Analysis:** ⚠️ MINIMAL BLOCKING (as of v4.2.5)
- `<skip_summary>` tag → Returns null (intentional, not a bug)
- Missing `<summary>` tags → Returns null (SDK didn't provide)
- Missing fields within `<summary>` → Does NOT block anymore (v4.2.5 fix)
**Value Assessment:** ✅ HIGH VALUE
- v4.2.5 fix ensures partial summaries are saved
- Keep as-is
**Historical Context:**
- **Before v4.2.5:** Would return null if any required field missing → data loss
- **After v4.2.5:** Returns summary with whatever fields exist → maximally permissive
**Edge Cases:**
1. `<skip_summary reason="not enough data"/>` → Returns null, logs reason
2. No `<summary>` tags → Returns null (SDK didn't generate summary)
3. `<summary>` with all empty fields → Returns summary with empty/null strings
4. Malformed XML → Regex won't match → Returns null (data loss)
**Example:**
```xml
<summary>
<request>Add OAuth2 authentication</request>
<investigated>Reviewed existing auth system</investigated>
<learned>System uses JWT tokens for sessions</learned>
<completed>Implemented OAuth2 provider integration</completed>
<next_steps>Test with production credentials</next_steps>
<notes>Need to configure callback URLs in provider dashboard</notes>
</summary>
```
---
## Deep Dive: Database Layer
### SessionStore.storeObservation (SessionStore.ts lines 901-964)
**Purpose:** Store a parsed observation in the database
**Logic Flow:**
1. **AUTO-CREATE SESSION** (lines 920-940):
- Check if `sdk_session_id` exists in `sdk_sessions` table
- If NOT found:
- Auto-create session record
- Log: "Auto-created session record for session_id: {id}"
- This prevents foreign key constraint errors
2. Prepare INSERT statement:
```sql
INSERT INTO observations
(sdk_session_id, project, type, title, subtitle, facts, narrative,
concepts, files_read, files_modified, prompt_number, created_at, created_at_epoch)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
```
3. Insert observation with:
- `facts`, `concepts`, `files_read`, `files_modified` → JSON.stringify()
- Timestamps auto-generated
- All fields as-is (nulls allowed)
**Blocking Analysis:** ❌ NEVER BLOCKS
- Auto-creates missing sessions (defensive programming)
- All fields nullable (except required ones)
- No validation checks that could fail
- Schema is permissive
**Value Assessment:** ✅ HIGH VALUE
- Auto-creation pattern is brilliant
- Prevents foreign key errors
- Keep as-is
**Schema Constraints:**
- `type` must be one of 6 valid types (CHECK constraint)
- BUT: Parser ensures type is always valid (defaults to "change")
- `sdk_session_id` has foreign key to `sdk_sessions`
- BUT: Auto-creation ensures session exists
- Arrays stored as JSON strings
**Edge Cases:**
- Session doesn't exist → Auto-created
- Invalid type → Parser prevents this (defaults to "change")
- Null fields → Allowed by schema
---
### SessionStore.storeSummary (SessionStore.ts lines 970-1029)
**Purpose:** Store a parsed summary in the database
**Logic Flow:**
1. **AUTO-CREATE SESSION** (lines 987-1007):
- Same defensive pattern as `storeObservation()`
- Ensures session exists before INSERT
2. Prepare INSERT statement:
```sql
INSERT INTO session_summaries
(sdk_session_id, project, request, investigated, learned, completed,
next_steps, notes, prompt_number, created_at, created_at_epoch)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
```
3. Insert summary with:
- All content fields as-is (nulls allowed)
- Timestamps auto-generated
**Blocking Analysis:** ❌ NEVER BLOCKS
- Auto-creates missing sessions
- All content fields nullable
- No validation checks
- Multiple summaries per session allowed (migration 7 removed UNIQUE constraint)
**Value Assessment:** ✅ HIGH VALUE
- Auto-creation ensures reliability
- Nullable fields allow partial data
- Keep as-is
**Schema Evolution:**
- **Before migration 7:** `sdk_session_id` had UNIQUE constraint → Only one summary per session
- **After migration 7:** UNIQUE removed → Multiple summaries per session (one per prompt)
**Edge Cases:**
- Session doesn't exist → Auto-created
- All fields null/empty → Allowed
- Multiple summaries for same session → Allowed (migration 7)
---
### Database Schema Constraints
#### observations table
```sql
CREATE TABLE observations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
sdk_session_id TEXT NOT NULL, -- Foreign key
project TEXT NOT NULL,
text TEXT, -- Nullable (deprecated, migration 9)
type TEXT NOT NULL CHECK(type IN ('decision', 'bugfix', 'feature', 'refactor', 'discovery', 'change')),
title TEXT, -- Nullable
subtitle TEXT, -- Nullable
facts TEXT, -- Nullable (JSON array)
narrative TEXT, -- Nullable
concepts TEXT, -- Nullable (JSON array)
files_read TEXT, -- Nullable (JSON array)
files_modified TEXT, -- Nullable (JSON array)
prompt_number INTEGER, -- Nullable
created_at TEXT NOT NULL,
created_at_epoch INTEGER NOT NULL,
FOREIGN KEY(sdk_session_id) REFERENCES sdk_sessions(sdk_session_id) ON DELETE CASCADE
);
```
**Blocking Potential:**
- Invalid `type` → CHECK constraint violation
- Mitigated by: Parser defaults to "change"
- Missing `sdk_session_id` → Foreign key violation
- Mitigated by: Auto-creation in storeObservation()
#### session_summaries table
```sql
CREATE TABLE session_summaries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
sdk_session_id TEXT NOT NULL, -- No longer UNIQUE (migration 7)
project TEXT NOT NULL,
request TEXT, -- Nullable
investigated TEXT, -- Nullable
learned TEXT, -- Nullable
completed TEXT, -- Nullable
next_steps TEXT, -- Nullable
notes TEXT, -- Nullable
prompt_number INTEGER, -- Nullable
created_at TEXT NOT NULL,
created_at_epoch INTEGER NOT NULL,
FOREIGN KEY(sdk_session_id) REFERENCES sdk_sessions(sdk_session_id) ON DELETE CASCADE
);
```
**Blocking Potential:**
- Missing `sdk_session_id` → Foreign key violation
- Mitigated by: Auto-creation in storeSummary()
**Key Design Decisions:**
1. **Nullable fields** - Allows partial data to be saved
2. **Auto-creation** - Prevents foreign key errors
3. **No UNIQUE constraints** (migration 7) - Multiple summaries per session
4. **WAL mode** - Better concurrency for multiple sessions
5. **JSON arrays** - Flexible storage for lists (facts, concepts, files)
---
## Deep Dive: Prompt System
### buildInitPrompt (prompts.ts lines 24-125)
**Purpose:** Generate initial prompt that instructs the SDK agent what to observe and how to record
**Content:**
1. **Role Definition:** "You are observing a development session to create searchable memory FOR FUTURE SESSIONS"
2. **Critical Instruction:** "Record what was BUILT/FIXED/DEPLOYED/CONFIGURED, not what you (the observer) are doing"
3. **What to Record:** Focus on deliverables, capabilities, technical changes
4. **When to Skip:** Routine operations (empty status checks, package installations, file listings)
5. **Output Format:** XML structure with `<observation>` tags and required fields
**Blocking Analysis:** ⚠️ CAN CAUSE SKIPPING
- "WHEN TO SKIP" section instructs SDK to not output for routine operations
- "No output necessary if skipping" means no observations stored
- **This is intentional filtering**, not a bug
**Value Assessment:** ✅ HIGH VALUE
- Prevents noise from routine operations
- Focuses on meaningful changes
- Keep as-is, but consider making "WHEN TO SKIP" configurable
**Key Instructions:**
```
WHEN TO SKIP
------------
Skip routine operations:
- Empty status checks
- Package installations with no errors
- Simple file listings
- Repetitive operations you've already documented
- **No output necessary if skipping.**
```
**Impact:**
- Reduces database size by filtering noise
- But could cause "missing" observations for operations user cares about
- Trade-off between signal and completeness
---
### buildObservationPrompt (prompts.ts lines 130-153)
**Purpose:** Wrap tool execution data in XML for SDK processing
**Content:**
```xml
<tool_used>
<tool_name>{name}</tool_name>
<tool_time>{ISO timestamp}</tool_time>
<tool_input>{JSON}</tool_input>
<tool_output>{JSON}</tool_output>
</tool_used>
```
**Blocking Analysis:** ❌ NO BLOCKING
- Just data presentation, no instructions to skip
- SDK processes based on init prompt rules
**Value Assessment:** ✅ HIGH VALUE
- Simple, clean data wrapper
- Keep as-is
---
### buildSummaryPrompt (prompts.ts lines 158-178)
**Purpose:** Request summary of the session so far
**Content:**
1. **Instruction:** "Think about the last request, and write a summary of what was done, what was learned, and what's next"
2. **Important Note:** "DO NOT summarize the observation process itself - you are summarizing a DIFFERENT claude code session, not this one"
3. **Output Format:** XML `<summary>` with required fields
4. **Encouragement:** "Always write at least a minimal summary explaining where we are at currently, even if you didn't learn anything new or complete any work"
**Blocking Analysis:** ❌ NO BLOCKING
- Encourages always writing summary
- SDK may still skip if truly nothing to summarize
**Value Assessment:** ✅ HIGH VALUE
- Ensures summaries are generated
- "Always write at least a minimal summary" reduces skip rate
- Keep as-is
---
## Data Flow Analysis
### End-to-End Flow: Tool Execution → Database
```
1. User executes tool in Claude Code
2. PostToolUse hook captures execution
3. Hook sends HTTP POST to worker /observations endpoint
4. Worker queues message in pendingMessages array
└─→ HTTP 200 response (confirmed receipt)
5. createMessageGenerator polls queue (100ms interval)
6. Message shifted from queue
7. buildObservationPrompt wraps tool data in XML
8. Generator yields message to SDK agent
9. SDK sends message to Claude API
10. Claude processes tool data based on init prompt
11. Claude responds with XML (or skips if routine operation)
12. SDK returns response to runSDKAgent
13. handleAgentMessage receives response
14. parseObservations extracts <observation> blocks
15. For each observation:
- db.storeObservation called
- Auto-creates session if missing
- Inserts into observations table
16. Data persisted in SQLite database
```
**Failure Points:**
- **Point 3:** Worker not running → HTTP request fails → Hook logs error
- **Point 4:** Worker crashes before processing → Queue lost
- **Point 9:** Invalid model config → SDK fails → Session marked failed
- **Point 11:** Malformed XML response → Parser returns empty array
- **Point 15:** Database error (rare) → Throws exception
**Recovery Mechanisms:**
- **Auto-recovery:** New requests after worker restart auto-create session
- **Graceful degradation:** Partial data saved (v4.2.5/v4.2.6 fixes)
- **Database persistence:** Once stored, data survives all restarts
---
## Blocking Assessment Matrix
### Components That CAN Block Data Storage
| Component | Blocking Scenario | Impact | Mitigation |
|-----------|------------------|---------|------------|
| Worker not running | HTTP requests fail | Observations not queued | PM2 auto-restart, health monitoring |
| Invalid CLAUDE_MEM_MODEL | SDK agent fails to start | Queued messages never processed | Validation in settings script |
| Invalid CLAUDE_CODE_PATH | SDK agent fails to start | Queued messages never processed | Default path, env var fallback |
| Malformed XML in SDK response | Parser can't extract | Data lost for that response | Better error handling, partial parsing |
| Worker restart | In-memory queue lost | Queued messages lost | Could persist queue to DB |
| Session abort (DELETE) | Queue processing stopped | Remaining queue lost | Graceful cleanup (v4.1.0) |
| Init prompt "WHEN TO SKIP" | SDK intentionally skips | No observation stored | Intentional filtering, configurable? |
### Components That CANNOT Block Data Storage
| Component | Reason | Design Pattern |
|-----------|--------|----------------|
| /observations endpoint | Auto-recovery, always queues | Maximally permissive |
| /summarize endpoint | Auto-recovery, always queues | Maximally permissive |
| parseObservations() | Defaults to "change" type, accepts nulls | Permissive (v4.2.6 fix) |
| parseSummary() | Returns partial summaries | Permissive (v4.2.5 fix) |
| storeObservation() | Auto-creates sessions, accepts nulls | Defensive programming |
| storeSummary() | Auto-creates sessions, accepts nulls | Defensive programming |
| Database schema | Nullable fields, no UNIQUE constraints | Flexible storage |
---
## Critical Findings
### 1. Auto-Recovery Pattern Prevents Worker Restart Data Loss
**Location:** `/observations` and `/summarize` endpoints (lines 181-209, 241-270)
**How it works:**
```typescript
if (!session) {
// Fetch session from database
const dbSession = db.getSessionById(sessionDbId);
// Recreate in-memory state
session = {
sessionDbId,
claudeSessionId: dbSession!.claude_session_id,
sdkSessionId: null,
project: dbSession!.project,
userPrompt: dbSession!.user_prompt,
pendingMessages: [],
abortController: new AbortController(),
generatorPromise: null,
lastPromptNumber: 0,
observationCounter: 0,
startTime: Date.now()
};
// Start new SDK agent
session.generatorPromise = this.runSDKAgent(session);
}
```
**Value:** ✅ HIGH
- Worker restart doesn't prevent new observations from being processed
- Database is source of truth
- Stateless design enables resilience
**Recommendation:** Extract to helper function to reduce duplication
---
### 2. Parser Fixes (v4.2.5/v4.2.6) Ensure Partial Data Saved
**parseObservations (v4.2.6):**
```typescript
// NOTE FROM THEDOTMACK: ALWAYS save observations - never skip. 10/24/2025
// All fields except type are nullable in schema
// If type is missing or invalid, use "change" as catch-all fallback
let finalType = 'change'; // Default catch-all
if (type && validTypes.includes(type.trim())) {
finalType = type.trim();
}
// All other fields are optional - save whatever we have
observations.push({
type: finalType,
title, // Can be null
subtitle, // Can be null
facts,
narrative, // Can be null
concepts,
files_read,
files_modified
});
```
**parseSummary (v4.2.5):**
```typescript
// NOTE FROM THEDOTMACK: 100% of the time we must SAVE the summary,
// even if fields are missing. 10/24/2025
// NEVER DO THIS NONSENSE AGAIN.
return {
request, // Can be null
investigated, // Can be null
learned, // Can be null
completed, // Can be null
next_steps, // Can be null
notes // Can be null
};
```
**Value:** ✅ CRITICAL
- Prevents data loss from incomplete AI responses
- LLMs make mistakes - system must be resilient
- Partial data is better than no data
**Recommendation:** Keep as-is, this is the right design
---
### 3. In-Memory Queue is Main Vulnerability
**Issue:** `pendingMessages` array is in-memory only
- Worker restart → All queued messages lost
- But HTTP response already confirmed receipt
**Current behavior:**
1. Hook sends observation → Worker responds "queued" → Hook thinks it's saved
2. Worker crashes before processing → Observation lost
3. BUT: New observations after restart are still processed (auto-recovery)
**Impact:** ⚠️ MEDIUM
- Data loss window between queue and processing
- But observations are idempotent (can be resent)
- Hooks don't retry on success response
**Recommendation:** ⚠️ CONSIDER
- Persist queue to database (e.g., `pending_observations` table)
- Mark as processed when SDK handles
- Increases reliability but adds complexity
---
### 4. Init Prompt "WHEN TO SKIP" Intentionally Filters
**Instruction:**
```
WHEN TO SKIP
------------
Skip routine operations:
- Empty status checks
- Package installations with no errors
- Simple file listings
- Repetitive operations you've already documented
- **No output necessary if skipping.**
```
**Impact:**
- Reduces noise in database
- Focuses on meaningful changes
- BUT: User might wonder why some tool executions aren't recorded
**Value:** ✅ MEDIUM - Intentional filtering
- Prevents database bloat
- Trade-off between signal and completeness
**Recommendation:** ⚠️ CONSIDER
- Make "WHEN TO SKIP" configurable (env var or settings)
- Or add verbosity levels (minimal/normal/verbose)
---
## Value Assessment by Component
### HIGH VALUE - Keep As-Is
| Component | Reason |
|-----------|--------|
| Auto-recovery pattern | Prevents worker restart data loss |
| Permissive parser (v4.2.5/v4.2.6) | Ensures partial data saved, critical for reliability |
| Nullable database schema | Flexible storage, allows incomplete data |
| WAL mode SQLite | Good concurrency, reliable writes |
| Isolated session state | No cross-contamination between sessions |
| Queue-based architecture | Decouples HTTP from SDK processing |
| storeObservation/storeSummary auto-creation | Defensive programming, prevents foreign key errors |
### MEDIUM VALUE - Consider Improvements
| Component | Current State | Potential Improvement |
|-----------|--------------|----------------------|
| In-memory queue | Lost on restart | Persist to DB for durability |
| 100ms polling | Works but inefficient | Event-driven async queue |
| Duplicated auto-recovery code | Lines 181-209 and 241-270 identical | Extract to `getOrCreateSession()` helper |
| No try-catch around DB ops | Errors crash handler | Add error handling with logging |
| Model/port defaults | Hard-coded | Already configurable via env vars ✓ |
| Init prompt filtering | Fixed "WHEN TO SKIP" rules | Make configurable (verbosity levels) |
### LOW VALUE - Questionable Design
| Component | Issue | Recommendation |
|-----------|-------|----------------|
| cleanupOrphanedSessions() | Marks ALL active sessions failed on startup | Aggressive, but necessary with fixed port |
| 5-second DELETE timeout | Arbitrary | Make configurable via env var |
| "NO SUMMARY TAGS FOUND" warning | Log level too high | Change to INFO level |
---
## Recommendations
### Priority 1: Critical Reliability Improvements
1. **Persist Message Queue to Database**
- Create `pending_messages` table
- Store queued observations/summaries
- Mark as processed when handled by SDK
- Prevents data loss on worker restart
- **Effort:** Medium, **Impact:** High
2. **Add Error Handling Around Database Operations**
- Wrap `db.storeObservation()` and `db.storeSummary()` in try-catch
- Log errors with full context
- Continue processing other messages on error
- **Effort:** Low, **Impact:** Medium
### Priority 2: Code Quality Improvements
3. **Extract Auto-Recovery to Helper Function**
```typescript
private async getOrCreateSession(sessionDbId: number): Promise<ActiveSession> {
// Consolidate lines 181-209 and 241-270
}
```
- **Effort:** Low, **Impact:** Low (code quality)
4. **Make Configuration More Flexible**
- Add `CLAUDE_MEM_VERBOSITY` env var (minimal/normal/verbose)
- Adjust init prompt "WHEN TO SKIP" based on verbosity
- Add `CLAUDE_MEM_DELETE_TIMEOUT` env var
- **Effort:** Low, **Impact:** Medium
### Priority 3: Performance Optimizations
5. **Replace Polling with Event-Driven Queue**
- Use `AsyncQueue` with notifications instead of 100ms polling
- Reduces latency from queue to processing
- **Effort:** Medium, **Impact:** Low (performance)
6. **Add Queue Metrics**
- Track queue length over time
- Alert if queue grows unbounded
- Add to `/health` endpoint
- **Effort:** Low, **Impact:** Low (observability)
---
## Appendix: Configuration Reference
### Environment Variables
| Variable | Default | Purpose | Blocking Impact |
|----------|---------|---------|----------------|
| `CLAUDE_MEM_MODEL` | `claude-sonnet-4-5` | AI model for processing | Invalid = SDK fails |
| `CLAUDE_MEM_WORKER_PORT` | `37777` | HTTP server port | Invalid = Worker won't start |
### Constants
| Constant | Value | Purpose |
|----------|-------|---------|
| `DISALLOWED_TOOLS` | `['Glob', 'Grep', 'ListMcpResourcesTool', 'WebSearch']` | Tools SDK agent can't use |
| Polling interval | `100ms` | Queue polling frequency |
| DELETE timeout | `5000ms` | Max wait for agent shutdown |
---
## Conclusion
The claude-mem worker server is a well-designed system with a clear **defensive, layered architecture** that prioritizes **data persistence**. The key strengths are:
1. **Auto-recovery** from worker restarts
2. **Permissive parsing** that saves partial data
3. **Nullable schema** that accepts incomplete information
4. **Session isolation** preventing cross-contamination
The main vulnerability is the **in-memory queue**, which could be mitigated by persisting to the database. Overall, the system achieves its goal of creating a persistent memory system that survives failures and continues operating even with incomplete data.
**Design Philosophy:** "Better to save partial data than lose everything."
This philosophy is evident throughout the codebase, from the v4.2.5/v4.2.6 parser fixes to the auto-creation patterns in the database layer. The system is built to be resilient to AI errors, configuration issues, and process failures.
---
**End of Document**