claude-mem/docs/reports/2026-01-04--logging-analysis-and-recommendations.md

# Logging Analysis and Recommendations

**Date**: 2026-01-04
**Status**: CRITICAL - Current logging does not prove system correctness
**Goal**: Enable operators to visually verify the system is working and quickly discover when it isn't

---

## Executive Summary

The current logging is **noisy bullshit that doesn't cover the important parts of the system**. The logging should:

1. **PROVE** the system is working correctly (not just record activity)
2. **MAKE OBVIOUS** when things break (clear error paths)
3. **TRACE** data end-to-end through the pipeline

### Critical Finding: Session ID Alignment is BROKEN and UNVERIFIABLE

The system has **three session ID types** that must stay aligned:
- `contentSessionId` - from Claude Code (user's session)
- `sessionDbId` - our internal database ID (integer)
- `memorySessionId` - from Claude SDK (enables resume)

**The [ALIGNMENT] logs exist because this mapping is STILL a regression bug.** The current logs show intermediate values but **don't prove correctness**.

---

## Critical System Operations

### 1. Session ID Mapping Chain (MOST CRITICAL)

```
contentSessionId (from hook)
    → sessionDbId (our DB lookup)
    → memorySessionId (captured from SDK)
```

**If this breaks, observations go to wrong sessions = DATA CORRUPTION**

**Current State:**
| Operation | Has Logging? | Proves Correctness? |
|-----------|-------------|---------------------|
| Hook receives contentSessionId | YES | NO - just logs receipt |
| DB creates/looks up sessionDbId | PARTIAL | NO - no verification |
| SDK response gives memorySessionId | YES | NO - no DB update verification |
| Observations stored with memorySessionId | PARTIAL | NO - doesn't show which IDs used |

**What's MISSING:**

```
[INFO] [SESSION] SESSION_CREATED | contentSessionId=abc123 → sessionDbId=42 | isNew=true
[INFO] [SESSION] MEMORY_ID_CAPTURED | sessionDbId=42 | memorySessionId=xyz789 | dbUpdateSuccess=true
[INFO] [SESSION] E2E_VERIFIED | contentSessionId=abc123 → sessionDbId=42 → memorySessionId=xyz789
```

### 2. Observation Storage Pipeline (CRITICAL)

**The pipeline:**
```
Hook captures tool use
    → Worker receives observation
    → Queued to pending_messages
    → SDK agent claims message
    → SDK processes → generates XML
    → Observations parsed
    → Stored to DB with memorySessionId
    → Synced to Chroma
```

**Current State:**
| Operation | Has Logging? | Proves Correctness? |
|-----------|-------------|---------------------|
| Hook captures tool | YES | Noise - "Received hook input" |
| Observation queued | YES | Noise - just says "queued" |
| Message claimed from queue | NO | MISSING |
| Observation parsed | NO | MISSING |
| Observation stored to DB | PARTIAL | NO - doesn't show IDs used |
| DB transaction committed | NO | MISSING |
| Chroma sync complete | DEBUG only | Should be INFO for failures |

**What's MISSING:**

```
[INFO] [QUEUE] CLAIMED | sessionDbId=42 | messageId=5 | type=observation | tool=Bash(npm test)
[INFO] [DB   ] STORED | sessionDbId=42 | memorySessionId=xyz789 | observations=2 | ids=[101,102]
[INFO] [QUEUE] COMPLETED | sessionDbId=42 | messageId=5 | processingTime=1.2s
```

### 3. Queue Processing (CRITICAL)

Messages can fail, get stuck, or be lost. Current logging doesn't show:
- When a message is claimed
- When a message is completed
- When a message fails and WHY
- Queue depth and processing latency

**Current State:**
- Queue enqueue: `logger.debug` (not visible at INFO)
- Queue claim: NO LOGGING
- Queue completion: NO LOGGING
- Queue failure: `logger.error` (exists but rare)
- Recovery of stuck messages: `logger.info` (good)

**What's MISSING:**

```
[INFO] [QUEUE] ENQUEUE | sessionDbId=42 | type=observation | queueDepth=3
[INFO] [QUEUE] CLAIM | sessionDbId=42 | messageId=5 | waitTime=0.1s
[INFO] [QUEUE] COMPLETE | sessionDbId=42 | messageId=5 | success=true
[ERROR][QUEUE] FAILED | sessionDbId=42 | messageId=5 | error="SDK timeout" | willRetry=true
```

### 4. Context Injection (IMPORTANT)

When a session starts, relevant past observations should be injected. Current logging doesn't show:
- What context was searched for
- What was found
- What was injected

**Current State:** Effectively no logging for context injection success path.

---

## What's Currently NOISE (Should Be DEBUG or Removed)

### Chatty Session Init Logs (new-hook.ts)
```typescript
// 7 INFO logs for a single session init
logger.info('HOOK', 'new-hook: Received hook input');      // WHO CARES
logger.info('HOOK', 'new-hook: Calling /api/sessions/init'); // WHO CARES
logger.info('HOOK', 'new-hook: Received from /api/sessions/init'); // WHO CARES
logger.info('HOOK', 'new-hook: Session N, prompt #M');     // CONSOLIDATE INTO ONE
logger.info('HOOK', 'new-hook: Calling /sessions/{id}/init'); // WHO CARES
```

**Should be ONE log:** `SESSION_INIT | sessionDbId=42 | promptNumber=1 | project=foo`

### Chatty SessionManager Logs
```typescript
logger.info('SESSION', 'initializeSession called');  // WHO CARES
logger.info('SESSION', 'Returning cached session');  // DEBUG
logger.info('SESSION', 'Fetched session from database'); // DEBUG
logger.info('SESSION', 'Creating new session object'); // DEBUG
logger.info('SESSION', 'Session initialized');       // GOOD - KEEP
logger.info('SESSION', 'Observation queued');        // DEBUG - happens constantly
logger.info('SESSION', 'Summarize queued');          // DEBUG - happens constantly
```

### Chatty Chroma Backfill Logs
```typescript
// Logs EVERY batch at INFO - should be DEBUG for progress
logger.info('CHROMA_SYNC', 'Backfill progress', { processed, remaining }); // DEBUG
```

**Should be START and END only at INFO level.**

### Duplicate Migration Logs
Both `SessionStore.ts` and `migrations/runner.ts` have ~25 identical log statements. **DEDUPLICATE.**

---

## [ALIGNMENT] Logs: The Problem

The [ALIGNMENT] logs were added to debug session ID issues. They're in the RIGHT places but they **don't prove anything**:

```typescript
// Current - shows values but doesn't verify
logger.info('SDK', `[ALIGNMENT] Resume Decision | contentSessionId=${...} | memorySessionId=${...}`);

// What's needed - proves correctness
logger.info('SDK', `[ALIGNMENT] VERIFIED | contentSessionId=${...} → sessionDbId=${...} → memorySessionId=${...} | dbMatch=true | resumeValid=true`);
```

**Current problems:**
1. Log values without validation
2. Don't show if DB operations succeeded
3. Don't trace end-to-end
4. Mixed in with noise - hard to see

**What they should do:**
1. Log the mapping chain ONCE with verification
2. Show DB operation success/failure
3. Provide clear end-to-end trace
4. Stand out from noise with consistent prefix

---

## Proposed Logging Architecture

### Log Levels by Purpose

| Level | Purpose | Examples |
|-------|---------|----------|
| ERROR | Something FAILED | DB write failed, SDK crashed, queue overflow |
| WARN | Something UNEXPECTED but handled | Fallback used, retry needed, timeout |
| INFO | KEY OPERATIONS completed | Session created, observation stored, queue processed |
| DEBUG | Detailed tracing | Cache hits, intermediate states, parsing details |

### Critical Path Logging (Must be INFO)

#### Session Lifecycle
```
[INFO] [SESSION] CREATED | contentSessionId=abc → sessionDbId=42 | project=foo
[INFO] [SESSION] MEMORY_ID_CAPTURED | sessionDbId=42 → memorySessionId=xyz | dbUpdated=true
[INFO] [SESSION] VERIFIED | chain: abc→42→xyz | valid=true
[INFO] [SESSION] COMPLETED | sessionDbId=42 | duration=45s | observations=12 | summaries=1
```

#### Observation Pipeline
```
[INFO] [QUEUE] ENQUEUED | sessionDbId=42 | type=observation | tool=Bash(npm test) | depth=1
[INFO] [QUEUE] CLAIMED | sessionDbId=42 | messageId=5 | waitTime=0.1s
[INFO] [DB   ] STORED | sessionDbId=42 | memorySessionId=xyz | obsIds=[101,102] | txnCommit=true
[INFO] [QUEUE] COMPLETED | sessionDbId=42 | messageId=5 | duration=1.2s
```

#### Error Conditions
```
[ERROR] [SESSION] MEMORY_ID_MISMATCH | expected=xyz | got=abc | sessionDbId=42
[ERROR] [DB     ] STORE_FAILED | sessionDbId=42 | error="FK constraint" | observations=2
[ERROR] [QUEUE  ] STUCK | sessionDbId=42 | stuckFor=5min | action=marking_failed
[ERROR] [SDK    ] CRASHED | sessionDbId=42 | error="Claude process died" | pendingWork=3
```

### Health Dashboard Output

After fixes, a healthy session should produce:
```
[INFO] [SESSION] CREATED | contentSessionId=abc → sessionDbId=42
[INFO] [SESSION] GENERATOR_STARTED | sessionDbId=42 | provider=claude-sdk
[INFO] [QUEUE  ] CLAIMED | sessionDbId=42 | messageId=1 | type=observation
[INFO] [SESSION] MEMORY_ID_CAPTURED | sessionDbId=42 → memorySessionId=xyz
[INFO] [DB     ] STORED | sessionDbId=42 | memorySessionId=xyz | obsIds=[1]
[INFO] [QUEUE  ] COMPLETED | sessionDbId=42 | messageId=1
... (more observations)
[INFO] [QUEUE  ] CLAIMED | sessionDbId=42 | messageId=5 | type=summarize
[INFO] [DB     ] STORED | sessionDbId=42 | summaryId=1
[INFO] [QUEUE  ] COMPLETED | sessionDbId=42 | messageId=5
[INFO] [SESSION] COMPLETED | sessionDbId=42 | duration=45s | observations=12
```

An UNHEALTHY session should make problems OBVIOUS:
```
[INFO] [SESSION] CREATED | contentSessionId=abc → sessionDbId=42
[INFO] [SESSION] GENERATOR_STARTED | sessionDbId=42 | provider=claude-sdk
[ERROR] [SESSION] MEMORY_ID_NOT_CAPTURED | sessionDbId=42 | waited=30s
[ERROR] [DB     ] STORE_FAILED | sessionDbId=42 | error="memorySessionId is null"
[WARN ] [QUEUE  ] STUCK | sessionDbId=42 | messageId=1 | age=60s | action=retry
[ERROR] [SESSION] GENERATOR_CRASHED | sessionDbId=42 | error="SDK timeout"
```

---

## Implementation Priorities

### P0: Fix Critical Missing Logs (Session Alignment)

1. **ResponseProcessor.ts** - Add logging BEFORE storeObservations:
   ```typescript
   logger.info('DB', 'STORING | sessionDbId=... | memorySessionId=... | count=...');
   ```

2. **SDKAgent.ts** - Verify DB update after memorySessionId capture:
   ```typescript
   const updated = store.updateMemorySessionId(sessionDbId, memorySessionId);
   logger.info('SESSION', `MEMORY_ID_CAPTURED | sessionDbId=${...} | memorySessionId=${...} | dbUpdated=${updated}`);
   ```

3. **SessionRoutes.ts** - Log session creation with verification:
   ```typescript
   logger.info('SESSION', `CREATED | contentSessionId=${...} → sessionDbId=${...} | verified=true`);
   ```

### P1: Fix Queue Processing Logs

1. **SessionQueueProcessor.ts** - Add CLAIM/COMPLETE logs
2. **PendingMessageStore.ts** - Add enqueue/dequeue logs

### P2: Reduce Noise

1. Move chatty logs to DEBUG level
2. Deduplicate migration logs
3. Consolidate hook init logs

### P3: Add Health Validation

1. Periodic verification log: `[INFO] [HEALTH] OK | sessions=3 | pending=0 | chroma=connected`
2. On-demand chain verification: `[INFO] [VERIFY] contentSessionId=abc chain is VALID`

---

## Files Requiring Changes

| File | Priority | Changes |
|------|----------|---------|
| `src/services/worker/agents/ResponseProcessor.ts` | P0 | Add pre-store logging with IDs |
| `src/services/worker/SDKAgent.ts` | P0 | Verify DB update, consolidate ALIGNMENT logs |
| `src/services/worker/http/routes/SessionRoutes.ts` | P0 | Add session creation verification log |
| `src/services/queue/SessionQueueProcessor.ts` | P1 | Add CLAIM/COMPLETE logs |
| `src/services/sqlite/PendingMessageStore.ts` | P1 | Add enqueue/dequeue logs |
| `src/services/worker/SessionManager.ts` | P2 | Move chatty logs to DEBUG |
| `src/hooks/new-hook.ts` | P2 | Consolidate to single INFO log |
| `src/services/sync/ChromaSync.ts` | P2 | Move progress to DEBUG, keep start/end INFO |
| `src/services/sqlite/SessionStore.ts` | P2 | Remove duplicate migration logs |

---

## Verification Checklist

After implementing changes, verify:

- [ ] Can trace contentSessionId → sessionDbId → memorySessionId in logs
- [ ] Can see when observation storage succeeds/fails
- [ ] Can see queue claim/complete for each message
- [ ] Errors are OBVIOUS and include context for debugging
- [ ] Noise is reduced to the point where INFO level is useful
- [ ] A "normal" session produces ~10-15 INFO logs, not 50+