Critical bugfix for NOT NULL constraint violation. Problem: - Worker service calls getSessionById(sessionDbId) to fetch session data - Worker then uses dbSession.claude_session_id to create ActiveSession - But getSessionById was NOT selecting claude_session_id from database - Result: claudeSessionId = undefined in worker - Caused: "NOT NULL constraint failed: sdk_sessions.claude_session_id" errors - Impact: Observations and summaries couldn't be stored Root cause: - SessionStore.getSessionById() SQL query missing claude_session_id column - Line 710-713: "SELECT id, sdk_session_id, project, user_prompt" - Should be: "SELECT id, claude_session_id, sdk_session_id, project, user_prompt" Fix: - Added claude_session_id to SELECT query in getSessionById - Updated return type to include claude_session_id: string - Now worker correctly receives claude_session_id from database - Session ID from hook flows properly through entire system Files changed: - src/services/sqlite/SessionStore.ts (getSessionById method) Testing: - Build succeeded - Ready for PM2 restart and live testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
40 KiB
Claude-Mem Worker Server Architecture
Document Version: 1.0 Last Updated: 2025-01-24 Author: Analysis by Claude Code Purpose: Comprehensive technical analysis of the worker server architecture, logic flow, blocking behavior, and component value assessment
Executive Summary
The claude-mem worker server is a long-running HTTP service managed by PM2 that processes tool execution observations and generates session summaries using the Claude Agent SDK. It implements a defensive, layered architecture designed to maximize data persistence while maintaining flexibility.
Key Design Principles
- Maximally Permissive Storage - System defaults to saving data even if incomplete
- Auto-Recovery - Worker restarts don't prevent processing (session state reconstructed from database)
- Queue-Based Processing - HTTP API decoupled from AI processing for reliability
- Defensive Programming - Auto-creates missing database records, accepts null fields
- Session Isolation - Each session has independent state and SDK agent
Architecture at a Glance
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: HTTP API (Express.js) │
│ - 6 REST endpoints │
│ - Always queues messages (maximally permissive) │
└──────────────────┬──────────────────────────────────────────┘
│
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 2: In-Memory Queue │
│ - pendingMessages array per session │
│ - VULNERABILITY: Lost on worker restart │
└──────────────────┬──────────────────────────────────────────┘
│
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 3: SDK Agent (Claude Agent SDK) │
│ - Processes queued messages via async generator │
│ - Can fail due to config or AI errors │
└──────────────────┬──────────────────────────────────────────┘
│
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 4: Parser (XML Extraction) │
│ - Extracts observations and summaries from AI responses │
│ - Permissive (v4.2.5/v4.2.6 fixes ensure partial data saved)│
└──────────────────┬──────────────────────────────────────────┘
│
┌──────────────────▼──────────────────────────────────────────┐
│ Layer 5: Database (SQLite with better-sqlite3) │
│ - Permanent storage (once here, data persists) │
│ - Auto-creates missing sessions, accepts nulls │
└─────────────────────────────────────────────────────────────┘
Critical Insight: Data can only be lost between layers 2-4. Once it reaches the database (layer 5), it's permanent.
Component Inventory
HTTP REST API Endpoints
| Endpoint | Purpose | Blocks Data? |
|---|---|---|
GET /health |
Worker health check | N/A |
POST /sessions/:id/init |
Initialize session and start SDK agent | Only if session not in DB (expected) |
POST /sessions/:id/observations |
Queue tool observation | ❌ Never (auto-recovery) |
POST /sessions/:id/summarize |
Queue summary request | ❌ Never (auto-recovery) |
GET /sessions/:id/status |
Get session status | N/A |
DELETE /sessions/:id |
Abort session | ⚠️ Queued messages lost |
Core Processing Components
| Component | File | Lines | Purpose |
|---|---|---|---|
| WorkerService | worker-service.ts | 52-590 | Main service class, manages sessions |
| runSDKAgent | worker-service.ts | 345-404 | Runs SDK agent for a session |
| createMessageGenerator | worker-service.ts | 410-502 | Async generator feeding SDK |
| handleAgentMessage | worker-service.ts | 508-563 | Parses and stores SDK responses |
| parseObservations | parser.ts | 32-96 | Extracts observations from XML |
| parseSummary | parser.ts | 102-157 | Extracts summary from XML |
| SessionStore | SessionStore.ts | 9-1086 | Database operations |
Deep Dive: HTTP Endpoints
GET /health (lines 100-109)
Purpose: Health check for monitoring and debugging
Logic Flow:
- Returns JSON with status, port, PID, active sessions, uptime, memory
Blocking Analysis: ❌ N/A (read-only endpoint)
Value Assessment: ✅ HIGH VALUE
- Essential for monitoring worker health
- Helps debug port conflicts and process state
- Keep as-is
POST /sessions/:sessionDbId/init (lines 115-169)
Purpose: Initialize a new session and start the SDK agent
Logic Flow:
- Parse
sessionDbIdfrom URL - Extract
projectanduserPromptfrom request body - Fetch session from database using
SessionStore.getSessionById() - CRITICAL CHECK: Return 404 if session not found in DB
- Retrieve
claudeSessionIdfrom database record - Create
ActiveSessionobject with initial state:{ sessionDbId, claudeSessionId, sdkSessionId: null, project, userPrompt, pendingMessages: [], abortController: new AbortController(), generatorPromise: null, lastPromptNumber: 0, observationCounter: 0, startTime: Date.now() } - Store session in memory map (
this.sessions) - Update
worker_portin database - Start
runSDKAgent()in background (fire-and-forget promise) - Return success response immediately
Blocking Analysis: ⚠️ CONDITIONAL
- Returns 404 if session doesn't exist in database
- This is expected behavior - session must be created before init
- Doesn't prevent future initialization attempts
- Error logged and hook can retry
Value Assessment: ✅ HIGH VALUE
- Critical initialization step
- Background SDK agent startup prevents timeout
- Keep as-is
Edge Cases:
- Session exists but SDK agent fails to start → Session marked as failed, but new init can retry
- Multiple init calls for same session → First one wins (subsequent calls find session in memory)
POST /sessions/:sessionDbId/observations (lines 175-230)
Purpose: Queue a tool execution observation for processing
Logic Flow:
- Parse
sessionDbIdfrom URL - Extract
tool_name,tool_input,tool_output,prompt_numberfrom body - Check if session exists in memory map (
this.sessions.get(sessionDbId)) - AUTO-RECOVERY (lines 181-209): If session NOT in memory:
- Fetch session from database
- Recreate
ActiveSessionobject - Start new SDK agent in background
- This enables recovery from worker restarts!
- Increment
observationCounterfor correlation ID tracking - Push observation message to
pendingMessagesqueue:{ type: 'observation', tool_name, tool_input, tool_output, prompt_number } - Return success with queue length
Blocking Analysis: ❌ NEVER BLOCKS
- Auto-creates session state from database if missing
- Always queues the observation
- HTTP response confirms receipt immediately
- Processing happens asynchronously
Value Assessment: ✅ HIGH VALUE
- Auto-recovery is brilliant design
- Worker restart doesn't lose ability to process observations
- Keep as-is
Edge Cases:
- Worker restart while observation in queue → Lost (queue is in-memory)
- But NEW observations after restart are queued successfully (auto-recovery)
- Database not found → Would throw error, but SessionStore auto-creates sessions
POST /sessions/:sessionDbId/summarize (lines 236-284)
Purpose: Queue a summary generation request
Logic Flow:
- Parse
sessionDbIdandprompt_numberfrom request - Check if session exists in memory
- AUTO-RECOVERY (lines 241-270): Same pattern as observations endpoint
- Fetches session from database
- Recreates
ActiveSessionobject - Starts new SDK agent
- Push summarize message to
pendingMessagesqueue:{ type: 'summarize', prompt_number } - Return success with queue length
Blocking Analysis: ❌ NEVER BLOCKS
- Same auto-recovery mechanism as observations
- Always queues the summary request
- Processing happens asynchronously
Value Assessment: ✅ HIGH VALUE
- Auto-recovery pattern prevents data loss
- Keep as-is
Code Quality Note: ⚠️ MEDIUM - Duplicated auto-recovery code (lines 181-209 and 241-270 are nearly identical)
- Could extract to helper function:
getOrCreateSession(sessionDbId) - Would reduce duplication and improve maintainability
GET /sessions/:sessionDbId/status (lines 289-304)
Purpose: Get current session status and queue length
Logic Flow:
- Parse
sessionDbIdfrom URL - Get session from memory map
- Return 404 if not found
- Return session info:
sessionDbId,sdkSessionId,project,pendingMessages.length
Blocking Analysis: ❌ N/A (read-only endpoint)
Value Assessment: ✅ MEDIUM VALUE
- Useful for debugging
- Not critical for core functionality
- Keep as-is
DELETE /sessions/:sessionDbId (lines 309-340)
Purpose: Abort a running session and clean up
Logic Flow:
- Parse
sessionDbIdfrom URL - Get session from memory map
- Return 404 if not found
- Call
abortController.abort()to signal SDK agent to stop - Wait for
generatorPromiseto finish (max 5 seconds timeout) - Mark session as 'failed' in database
- Delete session from memory map
- Return success
Blocking Analysis: ⚠️ BLOCKS QUEUED MESSAGES
- Aborts SDK agent processing
- Any messages in
pendingMessagesqueue are lost - Already-stored observations/summaries remain in database
Value Assessment: ✅ MEDIUM VALUE
- Provides clean shutdown mechanism
- Used for manual cleanup
- As of v4.1.0, SessionEnd hook doesn't call DELETE (graceful cleanup)
- Keep for manual intervention, but not used automatically
Historical Note:
- v4.0.x: SessionEnd hook called DELETE → interrupted summary generation
- v4.1.0+: Graceful cleanup → workers finish naturally
Deep Dive: SDK Agent Processing
runSDKAgent (lines 345-404)
Purpose: Core processing engine that runs continuously for each session
Logic Flow:
- Call
query()from Claude Agent SDK with:{ prompt: this.createMessageGenerator(session), options: { model: MODEL, // from CLAUDE_MEM_MODEL env var disallowedTools: DISALLOWED_TOOLS, abortController: session.abortController, pathToClaudeCodeExecutable: claudePath } } - Iterate over SDK responses using
for await - For each assistant message:
- Extract text content from response
- Log response size
- Call
handleAgentMessage()to parse and store
- On completion:
- Log session duration
- Mark session as 'completed' in database
- Delete session from memory map
- On error:
- Log error (or warning for AbortError)
- Mark session as 'failed' in database
- Throw error (caught by
generatorPromise.catch())
Blocking Analysis: ⚠️ CAN BLOCK IF:
- Invalid
CLAUDE_MEM_MODEL→ SDK initialization fails - Invalid
CLAUDE_CODE_PATH→ SDK initialization fails - SDK crashes → Session marked as failed
- BUT: Doesn't prevent NEW sessions from being created
Value Assessment: ✅ HIGH VALUE
- Core processing engine
- Proper error handling with session status tracking
- Keep as-is
Configuration Dependencies:
CLAUDE_MEM_MODEL(default: 'claude-sonnet-4-5')CLAUDE_CODE_PATH(default: specific path to claude executable)DISALLOWED_TOOLS(default: ['Glob', 'Grep', 'ListMcpResourcesTool', 'WebSearch'])
Edge Cases:
- SDK hangs → AbortController provides cancellation mechanism
- Network issues → SDK handles retries internally
- Multiple concurrent sessions → Each has isolated SDK agent
createMessageGenerator (lines 410-502)
Purpose: Async generator that feeds messages to the SDK agent (bridge between HTTP API and SDK)
Logic Flow:
- Build init prompt using
buildInitPrompt(project, claudeSessionId, userPrompt) - Yield initial user message:
{ type: 'user', session_id: claudeSessionId, // Real Claude Code session ID parent_tool_use_id: null, message: { role: 'user', content: initPrompt } } - Enter infinite loop (
while (true)):- Check
abortController.signal.aborted→ break if aborted - If no pending messages → sleep 100ms and continue
- While
pendingMessages.length > 0:- Shift message from queue (FIFO)
- If type === 'summarize':
- Update
lastPromptNumber - Fetch session from database
- Build summary prompt using
buildSummaryPrompt() - Yield summary prompt as user message
- Update
- If type === 'observation':
- Update
lastPromptNumber - Build observation prompt using
buildObservationPrompt() - Yield observation prompt as user message
- Update
- Check
Blocking Analysis: ❌ DOESN'T BLOCK
- Continuously processes queue until aborted
- 100ms polling means small delay but no data loss
- Messages shifted from queue and sent to SDK
- If SDK fails, messages lost from queue (but already confirmed via HTTP)
Value Assessment: ✅ HIGH VALUE
- Elegant async generator pattern
- Keep as-is
Performance Note: ⚠️ 100ms polling interval
- Could be improved with event-driven queue (e.g.,
AsyncQueuewith notifications) - Current implementation is simple and works well
- Low priority optimization
Data Flow:
HTTP /observations → pendingMessages.push() → [sleep 100ms] →
pendingMessages.shift() → buildObservationPrompt() → yield to SDK →
SDK processes → handleAgentMessage()
handleAgentMessage (lines 508-563)
Purpose: Parse SDK response and store observations/summaries in database
Logic Flow:
- Call
parseObservations(content, correlationId) - If observations found:
- For each observation:
- Call
db.storeObservation(claudeSessionId, project, observation, promptNumber) - Log success with correlation ID
- Call
- For each observation:
- Call
parseSummary(content, sessionId) - If summary found:
- Call
db.storeSummary(claudeSessionId, project, summary, promptNumber) - Log success
- Call
- If NO summary found:
- Log warning with content sample
Blocking Analysis: ⚠️ CAN BLOCK IF:
- Parser returns empty array/null → Nothing stored (but this is expected for routine operations)
- Database error → Would throw and crash handler (rare with permissive schema)
Value Assessment: ✅ HIGH VALUE
- Core storage logic
- Proper logging for debugging
- Keep as-is
Critical Dependencies:
parseObservations()must return valid observationsparseSummary()must return valid summary- Database must accept the data (schema constraints)
Logging:
- Extensive logging at INFO, SUCCESS, and WARN levels
- Correlation IDs for tracking individual observations
- Debug mode logs full SDK responses
Deep Dive: Parser System
parseObservations (parser.ts lines 32-96)
Purpose: Extract observation XML blocks from SDK response and parse into structured data
Logic Flow:
- Use regex to find all
<observation>...</observation>blocks (non-greedy):/<observation>([\s\S]*?)<\/observation>/g - For each block:
- Extract all fields:
type,title,subtitle,narrative,facts,concepts,files_read,files_modified - VALIDATION (lines 52-67):
- If
typeis missing or invalid → default to "change" - Valid types:
['bugfix', 'feature', 'refactor', 'change', 'discovery', 'decision'] - All other fields can be null
- If
- Filter out
typefromconceptsarray (types and concepts are separate dimensions) - Push observation to results array
- Extract all fields:
- Return all observations
Blocking Analysis: ❌ NEVER BLOCKS (as of v4.2.6)
- CRITICAL FIX (v4.2.6): Removed validation that required title, subtitle, and narrative
- Comment on line 52: "NOTE FROM THEDOTMACK: ALWAYS save observations - never skip. 10/24/2025"
- Always returns observations with whatever fields exist
- Only transformation: type defaults to "change" if invalid
Value Assessment: ✅ HIGH VALUE
- Permissive parsing ensures data is never lost
- v4.2.6 fix was critical for reliability
- Keep as-is
Historical Context:
- Before v4.2.6: Would skip observations missing required fields → data loss
- After v4.2.6: Always saves with defaults → maximally permissive
Edge Cases:
- No
<observation>tags → Returns empty array (normal for routine operations) - All fields empty → Returns observation with null fields and type="change"
- Malformed XML → Regex won't match → Returns empty array (data loss)
- Type in concepts → Filtered out (types and concepts are orthogonal)
Example:
<observation>
<type>feature</type>
<title>Authentication added</title>
<subtitle>Implemented OAuth2 flow</subtitle>
<facts>
<fact>Added OAuth2 provider configuration</fact>
<fact>Created callback endpoint</fact>
</facts>
<narrative>Full OAuth2 authentication...</narrative>
<concepts>
<concept>how-it-works</concept>
<concept>what-changed</concept>
</concepts>
<files_read>
<file>src/auth/oauth.ts</file>
</files_read>
<files_modified>
<file>src/auth/oauth.ts</file>
</files_modified>
</observation>
parseSummary (parser.ts lines 102-157)
Purpose: Extract summary XML block from SDK response
Logic Flow:
- Check for
<skip_summary reason="..."/>tag (lines 104-113)- If found → log reason and return null (intentional skip)
- Match
<summary>...</summary>block (non-greedy):/<summary>([\s\S]*?)<\/summary>/- If not found → return null (SDK didn't provide summary)
- Extract all fields:
request,investigated,learned,completed,next_steps,notes(optional) - VALIDATION REMOVED (lines 133-147):
- Comment: "NOTE FROM THEDOTMACK: 100% of the time we must SAVE the summary, even if fields are missing. 10/24/2025"
- Comment: "NEVER DO THIS NONSENSE AGAIN."
- Old code checked if all required fields present → would return null
- New code returns summary with whatever fields exist
- Return
ParsedSummaryobject
Blocking Analysis: ⚠️ MINIMAL BLOCKING (as of v4.2.5)
<skip_summary>tag → Returns null (intentional, not a bug)- Missing
<summary>tags → Returns null (SDK didn't provide) - Missing fields within
<summary>→ Does NOT block anymore (v4.2.5 fix)
Value Assessment: ✅ HIGH VALUE
- v4.2.5 fix ensures partial summaries are saved
- Keep as-is
Historical Context:
- Before v4.2.5: Would return null if any required field missing → data loss
- After v4.2.5: Returns summary with whatever fields exist → maximally permissive
Edge Cases:
<skip_summary reason="not enough data"/>→ Returns null, logs reason- No
<summary>tags → Returns null (SDK didn't generate summary) <summary>with all empty fields → Returns summary with empty/null strings- Malformed XML → Regex won't match → Returns null (data loss)
Example:
<summary>
<request>Add OAuth2 authentication</request>
<investigated>Reviewed existing auth system</investigated>
<learned>System uses JWT tokens for sessions</learned>
<completed>Implemented OAuth2 provider integration</completed>
<next_steps>Test with production credentials</next_steps>
<notes>Need to configure callback URLs in provider dashboard</notes>
</summary>
Deep Dive: Database Layer
SessionStore.storeObservation (SessionStore.ts lines 901-964)
Purpose: Store a parsed observation in the database
Logic Flow:
- AUTO-CREATE SESSION (lines 920-940):
- Check if
sdk_session_idexists insdk_sessionstable - If NOT found:
- Auto-create session record
- Log: "Auto-created session record for session_id: {id}"
- This prevents foreign key constraint errors
- Check if
- Prepare INSERT statement:
INSERT INTO observations (sdk_session_id, project, type, title, subtitle, facts, narrative, concepts, files_read, files_modified, prompt_number, created_at, created_at_epoch) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) - Insert observation with:
facts,concepts,files_read,files_modified→ JSON.stringify()- Timestamps auto-generated
- All fields as-is (nulls allowed)
Blocking Analysis: ❌ NEVER BLOCKS
- Auto-creates missing sessions (defensive programming)
- All fields nullable (except required ones)
- No validation checks that could fail
- Schema is permissive
Value Assessment: ✅ HIGH VALUE
- Auto-creation pattern is brilliant
- Prevents foreign key errors
- Keep as-is
Schema Constraints:
typemust be one of 6 valid types (CHECK constraint)- BUT: Parser ensures type is always valid (defaults to "change")
sdk_session_idhas foreign key tosdk_sessions- BUT: Auto-creation ensures session exists
- Arrays stored as JSON strings
Edge Cases:
- Session doesn't exist → Auto-created
- Invalid type → Parser prevents this (defaults to "change")
- Null fields → Allowed by schema
SessionStore.storeSummary (SessionStore.ts lines 970-1029)
Purpose: Store a parsed summary in the database
Logic Flow:
- AUTO-CREATE SESSION (lines 987-1007):
- Same defensive pattern as
storeObservation() - Ensures session exists before INSERT
- Same defensive pattern as
- Prepare INSERT statement:
INSERT INTO session_summaries (sdk_session_id, project, request, investigated, learned, completed, next_steps, notes, prompt_number, created_at, created_at_epoch) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) - Insert summary with:
- All content fields as-is (nulls allowed)
- Timestamps auto-generated
Blocking Analysis: ❌ NEVER BLOCKS
- Auto-creates missing sessions
- All content fields nullable
- No validation checks
- Multiple summaries per session allowed (migration 7 removed UNIQUE constraint)
Value Assessment: ✅ HIGH VALUE
- Auto-creation ensures reliability
- Nullable fields allow partial data
- Keep as-is
Schema Evolution:
- Before migration 7:
sdk_session_idhad UNIQUE constraint → Only one summary per session - After migration 7: UNIQUE removed → Multiple summaries per session (one per prompt)
Edge Cases:
- Session doesn't exist → Auto-created
- All fields null/empty → Allowed
- Multiple summaries for same session → Allowed (migration 7)
Database Schema Constraints
observations table
CREATE TABLE observations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
sdk_session_id TEXT NOT NULL, -- Foreign key
project TEXT NOT NULL,
text TEXT, -- Nullable (deprecated, migration 9)
type TEXT NOT NULL CHECK(type IN ('decision', 'bugfix', 'feature', 'refactor', 'discovery', 'change')),
title TEXT, -- Nullable
subtitle TEXT, -- Nullable
facts TEXT, -- Nullable (JSON array)
narrative TEXT, -- Nullable
concepts TEXT, -- Nullable (JSON array)
files_read TEXT, -- Nullable (JSON array)
files_modified TEXT, -- Nullable (JSON array)
prompt_number INTEGER, -- Nullable
created_at TEXT NOT NULL,
created_at_epoch INTEGER NOT NULL,
FOREIGN KEY(sdk_session_id) REFERENCES sdk_sessions(sdk_session_id) ON DELETE CASCADE
);
Blocking Potential:
- Invalid
type→ CHECK constraint violation- Mitigated by: Parser defaults to "change"
- Missing
sdk_session_id→ Foreign key violation- Mitigated by: Auto-creation in storeObservation()
session_summaries table
CREATE TABLE session_summaries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
sdk_session_id TEXT NOT NULL, -- No longer UNIQUE (migration 7)
project TEXT NOT NULL,
request TEXT, -- Nullable
investigated TEXT, -- Nullable
learned TEXT, -- Nullable
completed TEXT, -- Nullable
next_steps TEXT, -- Nullable
notes TEXT, -- Nullable
prompt_number INTEGER, -- Nullable
created_at TEXT NOT NULL,
created_at_epoch INTEGER NOT NULL,
FOREIGN KEY(sdk_session_id) REFERENCES sdk_sessions(sdk_session_id) ON DELETE CASCADE
);
Blocking Potential:
- Missing
sdk_session_id→ Foreign key violation- Mitigated by: Auto-creation in storeSummary()
Key Design Decisions:
- Nullable fields - Allows partial data to be saved
- Auto-creation - Prevents foreign key errors
- No UNIQUE constraints (migration 7) - Multiple summaries per session
- WAL mode - Better concurrency for multiple sessions
- JSON arrays - Flexible storage for lists (facts, concepts, files)
Deep Dive: Prompt System
buildInitPrompt (prompts.ts lines 24-125)
Purpose: Generate initial prompt that instructs the SDK agent what to observe and how to record
Content:
- Role Definition: "You are observing a development session to create searchable memory FOR FUTURE SESSIONS"
- Critical Instruction: "Record what was BUILT/FIXED/DEPLOYED/CONFIGURED, not what you (the observer) are doing"
- What to Record: Focus on deliverables, capabilities, technical changes
- When to Skip: Routine operations (empty status checks, package installations, file listings)
- Output Format: XML structure with
<observation>tags and required fields
Blocking Analysis: ⚠️ CAN CAUSE SKIPPING
- "WHEN TO SKIP" section instructs SDK to not output for routine operations
- "No output necessary if skipping" means no observations stored
- This is intentional filtering, not a bug
Value Assessment: ✅ HIGH VALUE
- Prevents noise from routine operations
- Focuses on meaningful changes
- Keep as-is, but consider making "WHEN TO SKIP" configurable
Key Instructions:
WHEN TO SKIP
------------
Skip routine operations:
- Empty status checks
- Package installations with no errors
- Simple file listings
- Repetitive operations you've already documented
- **No output necessary if skipping.**
Impact:
- Reduces database size by filtering noise
- But could cause "missing" observations for operations user cares about
- Trade-off between signal and completeness
buildObservationPrompt (prompts.ts lines 130-153)
Purpose: Wrap tool execution data in XML for SDK processing
Content:
<tool_used>
<tool_name>{name}</tool_name>
<tool_time>{ISO timestamp}</tool_time>
<tool_input>{JSON}</tool_input>
<tool_output>{JSON}</tool_output>
</tool_used>
Blocking Analysis: ❌ NO BLOCKING
- Just data presentation, no instructions to skip
- SDK processes based on init prompt rules
Value Assessment: ✅ HIGH VALUE
- Simple, clean data wrapper
- Keep as-is
buildSummaryPrompt (prompts.ts lines 158-178)
Purpose: Request summary of the session so far
Content:
- Instruction: "Think about the last request, and write a summary of what was done, what was learned, and what's next"
- Important Note: "DO NOT summarize the observation process itself - you are summarizing a DIFFERENT claude code session, not this one"
- Output Format: XML
<summary>with required fields - Encouragement: "Always write at least a minimal summary explaining where we are at currently, even if you didn't learn anything new or complete any work"
Blocking Analysis: ❌ NO BLOCKING
- Encourages always writing summary
- SDK may still skip if truly nothing to summarize
Value Assessment: ✅ HIGH VALUE
- Ensures summaries are generated
- "Always write at least a minimal summary" reduces skip rate
- Keep as-is
Data Flow Analysis
End-to-End Flow: Tool Execution → Database
1. User executes tool in Claude Code
↓
2. PostToolUse hook captures execution
↓
3. Hook sends HTTP POST to worker /observations endpoint
↓
4. Worker queues message in pendingMessages array
└─→ HTTP 200 response (confirmed receipt)
↓
5. createMessageGenerator polls queue (100ms interval)
↓
6. Message shifted from queue
↓
7. buildObservationPrompt wraps tool data in XML
↓
8. Generator yields message to SDK agent
↓
9. SDK sends message to Claude API
↓
10. Claude processes tool data based on init prompt
↓
11. Claude responds with XML (or skips if routine operation)
↓
12. SDK returns response to runSDKAgent
↓
13. handleAgentMessage receives response
↓
14. parseObservations extracts <observation> blocks
↓
15. For each observation:
- db.storeObservation called
- Auto-creates session if missing
- Inserts into observations table
↓
16. Data persisted in SQLite database
Failure Points:
- Point 3: Worker not running → HTTP request fails → Hook logs error
- Point 4: Worker crashes before processing → Queue lost
- Point 9: Invalid model config → SDK fails → Session marked failed
- Point 11: Malformed XML response → Parser returns empty array
- Point 15: Database error (rare) → Throws exception
Recovery Mechanisms:
- Auto-recovery: New requests after worker restart auto-create session
- Graceful degradation: Partial data saved (v4.2.5/v4.2.6 fixes)
- Database persistence: Once stored, data survives all restarts
Blocking Assessment Matrix
Components That CAN Block Data Storage
| Component | Blocking Scenario | Impact | Mitigation |
|---|---|---|---|
| Worker not running | HTTP requests fail | Observations not queued | PM2 auto-restart, health monitoring |
| Invalid CLAUDE_MEM_MODEL | SDK agent fails to start | Queued messages never processed | Validation in settings script |
| Invalid CLAUDE_CODE_PATH | SDK agent fails to start | Queued messages never processed | Default path, env var fallback |
| Malformed XML in SDK response | Parser can't extract | Data lost for that response | Better error handling, partial parsing |
| Worker restart | In-memory queue lost | Queued messages lost | Could persist queue to DB |
| Session abort (DELETE) | Queue processing stopped | Remaining queue lost | Graceful cleanup (v4.1.0) |
| Init prompt "WHEN TO SKIP" | SDK intentionally skips | No observation stored | Intentional filtering, configurable? |
Components That CANNOT Block Data Storage
| Component | Reason | Design Pattern |
|---|---|---|
| /observations endpoint | Auto-recovery, always queues | Maximally permissive |
| /summarize endpoint | Auto-recovery, always queues | Maximally permissive |
| parseObservations() | Defaults to "change" type, accepts nulls | Permissive (v4.2.6 fix) |
| parseSummary() | Returns partial summaries | Permissive (v4.2.5 fix) |
| storeObservation() | Auto-creates sessions, accepts nulls | Defensive programming |
| storeSummary() | Auto-creates sessions, accepts nulls | Defensive programming |
| Database schema | Nullable fields, no UNIQUE constraints | Flexible storage |
Critical Findings
1. Auto-Recovery Pattern Prevents Worker Restart Data Loss
Location: /observations and /summarize endpoints (lines 181-209, 241-270)
How it works:
if (!session) {
// Fetch session from database
const dbSession = db.getSessionById(sessionDbId);
// Recreate in-memory state
session = {
sessionDbId,
claudeSessionId: dbSession!.claude_session_id,
sdkSessionId: null,
project: dbSession!.project,
userPrompt: dbSession!.user_prompt,
pendingMessages: [],
abortController: new AbortController(),
generatorPromise: null,
lastPromptNumber: 0,
observationCounter: 0,
startTime: Date.now()
};
// Start new SDK agent
session.generatorPromise = this.runSDKAgent(session);
}
Value: ✅ HIGH
- Worker restart doesn't prevent new observations from being processed
- Database is source of truth
- Stateless design enables resilience
Recommendation: Extract to helper function to reduce duplication
2. Parser Fixes (v4.2.5/v4.2.6) Ensure Partial Data Saved
parseObservations (v4.2.6):
// NOTE FROM THEDOTMACK: ALWAYS save observations - never skip. 10/24/2025
// All fields except type are nullable in schema
// If type is missing or invalid, use "change" as catch-all fallback
let finalType = 'change'; // Default catch-all
if (type && validTypes.includes(type.trim())) {
finalType = type.trim();
}
// All other fields are optional - save whatever we have
observations.push({
type: finalType,
title, // Can be null
subtitle, // Can be null
facts,
narrative, // Can be null
concepts,
files_read,
files_modified
});
parseSummary (v4.2.5):
// NOTE FROM THEDOTMACK: 100% of the time we must SAVE the summary,
// even if fields are missing. 10/24/2025
// NEVER DO THIS NONSENSE AGAIN.
return {
request, // Can be null
investigated, // Can be null
learned, // Can be null
completed, // Can be null
next_steps, // Can be null
notes // Can be null
};
Value: ✅ CRITICAL
- Prevents data loss from incomplete AI responses
- LLMs make mistakes - system must be resilient
- Partial data is better than no data
Recommendation: Keep as-is, this is the right design
3. In-Memory Queue is Main Vulnerability
Issue: pendingMessages array is in-memory only
- Worker restart → All queued messages lost
- But HTTP response already confirmed receipt
Current behavior:
- Hook sends observation → Worker responds "queued" → Hook thinks it's saved
- Worker crashes before processing → Observation lost
- BUT: New observations after restart are still processed (auto-recovery)
Impact: ⚠️ MEDIUM
- Data loss window between queue and processing
- But observations are idempotent (can be resent)
- Hooks don't retry on success response
Recommendation: ⚠️ CONSIDER
- Persist queue to database (e.g.,
pending_observationstable) - Mark as processed when SDK handles
- Increases reliability but adds complexity
4. Init Prompt "WHEN TO SKIP" Intentionally Filters
Instruction:
WHEN TO SKIP
------------
Skip routine operations:
- Empty status checks
- Package installations with no errors
- Simple file listings
- Repetitive operations you've already documented
- **No output necessary if skipping.**
Impact:
- Reduces noise in database
- Focuses on meaningful changes
- BUT: User might wonder why some tool executions aren't recorded
Value: ✅ MEDIUM - Intentional filtering
- Prevents database bloat
- Trade-off between signal and completeness
Recommendation: ⚠️ CONSIDER
- Make "WHEN TO SKIP" configurable (env var or settings)
- Or add verbosity levels (minimal/normal/verbose)
Value Assessment by Component
HIGH VALUE - Keep As-Is
| Component | Reason |
|---|---|
| Auto-recovery pattern | Prevents worker restart data loss |
| Permissive parser (v4.2.5/v4.2.6) | Ensures partial data saved, critical for reliability |
| Nullable database schema | Flexible storage, allows incomplete data |
| WAL mode SQLite | Good concurrency, reliable writes |
| Isolated session state | No cross-contamination between sessions |
| Queue-based architecture | Decouples HTTP from SDK processing |
| storeObservation/storeSummary auto-creation | Defensive programming, prevents foreign key errors |
MEDIUM VALUE - Consider Improvements
| Component | Current State | Potential Improvement |
|---|---|---|
| In-memory queue | Lost on restart | Persist to DB for durability |
| 100ms polling | Works but inefficient | Event-driven async queue |
| Duplicated auto-recovery code | Lines 181-209 and 241-270 identical | Extract to getOrCreateSession() helper |
| No try-catch around DB ops | Errors crash handler | Add error handling with logging |
| Model/port defaults | Hard-coded | Already configurable via env vars ✓ |
| Init prompt filtering | Fixed "WHEN TO SKIP" rules | Make configurable (verbosity levels) |
LOW VALUE - Questionable Design
| Component | Issue | Recommendation |
|---|---|---|
| cleanupOrphanedSessions() | Marks ALL active sessions failed on startup | Aggressive, but necessary with fixed port |
| 5-second DELETE timeout | Arbitrary | Make configurable via env var |
| "NO SUMMARY TAGS FOUND" warning | Log level too high | Change to INFO level |
Recommendations
Priority 1: Critical Reliability Improvements
-
Persist Message Queue to Database
- Create
pending_messagestable - Store queued observations/summaries
- Mark as processed when handled by SDK
- Prevents data loss on worker restart
- Effort: Medium, Impact: High
- Create
-
Add Error Handling Around Database Operations
- Wrap
db.storeObservation()anddb.storeSummary()in try-catch - Log errors with full context
- Continue processing other messages on error
- Effort: Low, Impact: Medium
- Wrap
Priority 2: Code Quality Improvements
-
Extract Auto-Recovery to Helper Function
private async getOrCreateSession(sessionDbId: number): Promise<ActiveSession> { // Consolidate lines 181-209 and 241-270 }- Effort: Low, Impact: Low (code quality)
-
Make Configuration More Flexible
- Add
CLAUDE_MEM_VERBOSITYenv var (minimal/normal/verbose) - Adjust init prompt "WHEN TO SKIP" based on verbosity
- Add
CLAUDE_MEM_DELETE_TIMEOUTenv var - Effort: Low, Impact: Medium
- Add
Priority 3: Performance Optimizations
-
Replace Polling with Event-Driven Queue
- Use
AsyncQueuewith notifications instead of 100ms polling - Reduces latency from queue to processing
- Effort: Medium, Impact: Low (performance)
- Use
-
Add Queue Metrics
- Track queue length over time
- Alert if queue grows unbounded
- Add to
/healthendpoint - Effort: Low, Impact: Low (observability)
Appendix: Configuration Reference
Environment Variables
| Variable | Default | Purpose | Blocking Impact |
|---|---|---|---|
CLAUDE_MEM_MODEL |
claude-sonnet-4-5 |
AI model for processing | Invalid = SDK fails |
CLAUDE_MEM_WORKER_PORT |
37777 |
HTTP server port | Invalid = Worker won't start |
CLAUDE_CODE_PATH |
/Users/alexnewman/.nvm/versions/node/v24.5.0/bin/claude |
Path to Claude Code | Invalid = SDK fails |
Constants
| Constant | Value | Purpose |
|---|---|---|
DISALLOWED_TOOLS |
['Glob', 'Grep', 'ListMcpResourcesTool', 'WebSearch'] |
Tools SDK agent can't use |
| Polling interval | 100ms |
Queue polling frequency |
| DELETE timeout | 5000ms |
Max wait for agent shutdown |
Conclusion
The claude-mem worker server is a well-designed system with a clear defensive, layered architecture that prioritizes data persistence. The key strengths are:
- Auto-recovery from worker restarts
- Permissive parsing that saves partial data
- Nullable schema that accepts incomplete information
- Session isolation preventing cross-contamination
The main vulnerability is the in-memory queue, which could be mitigated by persisting to the database. Overall, the system achieves its goal of creating a persistent memory system that survives failures and continues operating even with incomplete data.
Design Philosophy: "Better to save partial data than lose everything."
This philosophy is evident throughout the codebase, from the v4.2.5/v4.2.6 parser fixes to the auto-creation patterns in the database layer. The system is built to be resilient to AI errors, configuration issues, and process failures.
End of Document