perf: streamline worker startup and consolidate database connections (#2122)

* docs: pathfinder refactor corpus + Node 20 preflight Adds the PATHFINDER-2026-04-22 principle-driven refactor plan (11 docs, cross-checked PASS) plus the exploratory PATHFINDER-2026-04-21 corpus that motivated it. Bumps engines.node to >=20.0.0 per the ingestion-path plan preflight (recursive fs.watch). Adds the pathfinder skill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 01 — data integrity Schema, UNIQUE constraints, self-healing claim, Chroma upsert fallback. - Phase 1: fresh schema.sql regenerated at post-refactor shape. - Phase 2: migrations 23+24 — rebuild pending_messages without started_processing_at_epoch; UNIQUE(session_id, tool_use_id); UNIQUE(memory_session_id, content_hash) on observations; dedup duplicate rows before adding indexes. - Phase 3: claimNextMessage rewritten to self-healing query using worker_pid NOT IN live_worker_pids; STALE_PROCESSING_THRESHOLD_MS and the 60-s stale-reset block deleted. - Phase 4: DEDUP_WINDOW_MS and findDuplicateObservation deleted; observations.insert now uses ON CONFLICT DO NOTHING. - Phase 5: failed-message purge block deleted from worker-service 2-min interval; clearFailedOlderThan method deleted. - Phase 6: repairMalformedSchema and its Python subprocess repair path deleted from Database.ts; SQLite errors now propagate. - Phase 7: Chroma delete-then-add fallback gated behind CHROMA_SYNC_FALLBACK_ON_CONFLICT env flag as bridge until Chroma MCP ships native upsert. - Phase 8: migration 19 no-op block absorbed into fresh schema.sql. Verification greps all return 0 matches. bun test tests/sqlite/ passes 63/63. bun run build succeeds. Plan: PATHFINDER-2026-04-22/01-data-integrity.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 02 — process lifecycle OS process groups replace hand-rolled reapers. Worker runs until killed; orphans are prevented by detached spawn + kill(-pgid). - Phase 1: src/services/worker/ProcessRegistry.ts DELETED. The canonical registry at src/supervisor/process-registry.ts is the sole survivor; SDK spawn site consolidated into it via new createSdkSpawnFactory/spawnSdkProcess/getSdkProcessForSession/ ensureSdkProcessExit/waitForSlot helpers. - Phase 2: SDK children spawn with detached:true + stdio: ['ignore','pipe','pipe']; pgid recorded on ManagedProcessInfo. - Phase 3: shutdown.ts signalProcess teardown uses process.kill(-pgid, signal) on Unix when pgid is recorded; Windows path unchanged (tree-kill/taskkill). - Phase 4: all reaper intervals deleted — startOrphanReaper call, staleSessionReaperInterval setInterval (including the co-located WAL checkpoint — SQLite's built-in wal_autocheckpoint handles WAL growth without an app-level timer), killIdleDaemonChildren, killSystemOrphans, reapOrphanedProcesses, reapStaleSessions, and detectStaleGenerator. MAX_GENERATOR_IDLE_MS and MAX_SESSION_IDLE_MS constants deleted. - Phase 5: abandonedTimer — already 0 matches; primary-path cleanup via generatorPromise.finally() already lives in worker-service startSessionProcessor and SessionRoutes ensureGeneratorRunning. - Phase 6: evictIdlestSession and its evict callback deleted from SessionManager. Pool admission gates backpressure upstream. - Phase 7: SDK-failure fallback — SessionManager has zero matches for fallbackAgent/Gemini/OpenRouter. Failures surface to hooks via exit code 2 through SessionRoutes error mapping. - Phase 8: ensureWorkerRunning in worker-utils.ts rewritten to lazy-spawn — consults isWorkerPortAlive (which gates captureProcessStartToken for PID-reuse safety via commit 99060bac), then spawns detached with unref(), then waitForWorkerPort({ attempts: 3, backoffMs: 250 }) hand-rolled exponential backoff 250→500→1000ms. No respawn npm dep. - Phase 9: idle self-shutdown — zero matches for idleCheck/idleTimeout/IDLE_MAX_MS/idleShutdown. Worker exits only on external SIGTERM via supervisor signal handlers. Three test files that exercised deleted code removed: tests/worker/process-registry.test.ts, tests/worker/session-lifecycle-guard.test.ts, tests/services/worker/reap-stale-sessions.test.ts. Pass count: 1451 → 1407 (-44), all attributable to deleted test files. Zero new failures. 31 pre-existing failures remain (schema-repair suite, logger-usage-standards, environmental openclaw / plugin-distribution) — none introduced by Plan 02. All 10 verification greps return 0. bun run build succeeds. Plan: PATHFINDER-2026-04-22/02-process-lifecycle.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 04 (narrowed) — search fail-fast Phases 3, 5, 6 only. Plan-doc inaccuracies for phases 1/2/4/7/8/9 deferred for plan reconciliation: - Phase 1/2: ObservationRow type doesn't exist; the four "formatters" operate on three incompatible types. - Phase 4: RECENCY_WINDOW_MS already imported from SEARCH_CONSTANTS at every call site. - Phase 7: getExistingChromaIds is NOT @deprecated and has an active caller in ChromaSync.backfillMissingSyncs. - Phase 8: estimateTokens already consolidated. - Phase 9: knowledge-corpus rewrite blocked on PG-3 prompt-caching cost smoke test. Phase 3 — Delete SearchManager.findByConcept/findByFile/findByType. SearchRoutes handlers (handleSearchByConcept/File/Type) now call searchManager.getOrchestrator().findByXxx() directly via new getter accessors on SearchManager. ~250 LoC deleted. Phase 5 — Fail-fast Chroma. Created src/services/worker/search/errors.ts with ChromaUnavailableError extends AppError(503, 'CHROMA_UNAVAILABLE'). Deleted SearchOrchestrator.executeWithFallback's Chroma-failed SQLite-fallback branch; runtime Chroma errors now throw 503. "Path 3" (chromaSync was null at construction — explicit- uninitialized config) preserved as legitimate empty-result state per plan text. ChromaSearchStrategy.search no longer wraps in try/catch — errors propagate. Phase 6 — Delete HybridSearchStrategy three try/catch silent fallback blocks (findByConcept, findByType, findByFile) at lines ~82-95, ~120-132, ~161-172. Removed `fellBack` field from StrategySearchResult type and every return site (SQLiteSearchStrategy, BaseSearchStrategy.emptyResult, SearchOrchestrator). Tests updated (Principle 7 — delete in same PR): - search-orchestrator.test.ts: "fall back to SQLite" rewritten as "throw ChromaUnavailableError (HTTP 503)". - chroma/hybrid/sqlite-search-strategy tests: rewritten to rejects.toThrow; removed fellBack assertions. Verification: SearchManager.findBy → 0; fellBack → 0 in src/. bun test tests/worker/search/ → 122 pass, 0 fail. bun test (suite-wide) → 1407 pass, baseline maintained, 0 new failures. bun run build succeeds. Plan: PATHFINDER-2026-04-22/04-read-path.md (Phases 3, 5, 6) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 03 — ingestion path Fail-fast parser, direct in-process ingest, recursive fs.watch, DB-backed tool pairing. Worker-internal HTTP loopback eliminated. - Phase 0: Created src/services/worker/http/shared.ts exporting ingestObservation/ingestPrompt/ingestSummary as direct in-process functions plus ingestEventBus (Node EventEmitter, reusing existing pattern — no third event bus introduced). setIngestContext wires the SessionManager dependency from worker-service constructor. - Phase 1: src/sdk/parser.ts collapsed to one parseAgentXml returning { valid:true; kind: 'observation'|'summary'; data } | { valid:false; reason: string }. Inspects root element; <skip_summary reason="…"/> is a first-class summary case with skipped:true. NEVER returns undefined. NEVER coerces. - Phase 2: ResponseProcessor calls parseAgentXml exactly once, branches on the discriminated union. On invalid → markFailed + logger.warn(reason). On observation → ingestObservation. On summary → ingestSummary then emit summaryStoredEvent { sessionId, messageId } (consumed by Plan 05's blocking /api/session/end). - Phase 3: Deleted consecutiveSummaryFailures field (ResponseProcessor + SessionManager + worker-types) and MAX_CONSECUTIVE_SUMMARY_FAILURES constant. Circuit-breaker guards and "tripped" log lines removed. - Phase 4: coerceObservationToSummary deleted from sdk/parser.ts. - Phase 5: src/services/transcripts/watcher.ts rescan setInterval replaced with fs.watch(transcriptsRoot, { recursive: true, persistent: true }) — Node 20+ recursive mode. - Phase 6: src/services/transcripts/processor.ts pendingTools Map deleted. tool_use rows insert with INSERT OR IGNORE on UNIQUE(session_id, tool_use_id) (added by Plan 01). New pairToolUsesByJoin query in PendingMessageStore for read-time pairing (UNIQUE INDEX provides idempotency; explicit consumer not yet wired). - Phase 7: HTTP loopback at processor.ts:252 replaced with direct ingestObservation call. maybeParseJson silent-passthrough rewritten to fail-fast (throws on malformed JSON). - Phase 8: src/utils/tag-stripping.ts countTags + stripTagsInternal collapsed into one alternation regex, single-pass over input. - Phase 9: src/utils/transcript-parser.ts (dead TranscriptParser class) deleted. The active extractLastMessage at src/shared/transcript-parser.ts:41-144 is the sole survivor. Tests updated (Principle 7 — same-PR delete): - tests/sdk/parser.test.ts + parse-summary.test.ts: rewritten to assert discriminated-union shape; coercion-specific scenarios collapse into { valid:false } assertions. - tests/worker/agents/response-processor.test.ts: circuit-breaker describe block skipped; non-XML/empty-response tests assert fail-fast markFailed behavior. Verification: every grep returns 0. transcript-parser.ts deleted. bun run build succeeds. bun test → 1399 pass / 28 fail / 7 skip (net -8 pass = the 4 retired circuit-breaker tests + 4 collapsed parser cases). Zero new failures vs baseline. Deferred (out of Plan 03 scope, will land in Plan 06): SessionRoutes HTTP route handlers still call sessionManager.queueObservation inline rather than the new shared helpers — the helpers are ready, the route swap is mechanical and belongs with the Zod refactor. Plan: PATHFINDER-2026-04-22/03-ingestion-path.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 05 — hook surface Worker-call plumbing collapsed to one helper. Polling replaced by server-side blocking endpoint. Fail-loud counter surfaces persistent worker outages via exit code 2. - Phase 1: plugin/hooks/hooks.json — three 20-iteration `for i in 1..20; do curl -sf .../health && break; sleep 0.1; done` shell retry wrappers deleted. Hook commands invoke their bun entry point directly. - Phase 2: src/shared/worker-utils.ts — added executeWithWorkerFallback<T>(url, method, body) returning T | { continue: true; reason?: string }. All 8 hook handlers (observation, session-init, context, file-context, file-edit, summarize, session-complete, user-message) rewritten to use it instead of duplicating the ensureWorkerRunning → workerHttpRequest → fallback sequence. - Phase 3: blocking POST /api/session/end in SessionRoutes.ts using validateBody + sessionEndSchema (z.object({sessionId})). One-shot ingestEventBus.on('summaryStoredEvent') listener, 30 s timer, req.aborted handler — all share one cleanup so the listener cannot leak. summarize.ts polling loop, plus MAX_WAIT_FOR_SUMMARY_MS / POLL_INTERVAL_MS constants, deleted. - Phase 4: src/shared/hook-settings.ts — loadFromFileOnce() memoizes SettingsDefaultsManager.loadFromFile per process. Per-handler settings reads collapsed. - Phase 5: src/shared/should-track-project.ts — single exclusion check entry; isProjectExcluded no longer referenced from src/cli/handlers/. - Phase 6: cwd validation pushed into adapter normalizeInput (all 6 adapters: claude-code, cursor, raw, gemini-cli, windsurf). New AdapterRejectedInput error in src/cli/adapters/errors.ts. Handler-level isValidCwd checks deleted from file-edit.ts and observation.ts. hook-command.ts catches AdapterRejectedInput → graceful fallback. - Phase 7: session-init.ts conditional initAgent guard deleted; initAgent is idempotent. tests/hooks/context-reinjection-guard test (validated the deleted conditional) deleted in same PR per Principle 7. - Phase 8: fail-loud counter at ~/.claude-mem/state/hook-failures .json. Atomic write via .tmp + rename. CLAUDE_MEM_HOOK_FAIL_LOUD _THRESHOLD setting (default 3). On consecutive worker-unreachable ≥ N: process.exit(2). On success: reset to 0. NOT a retry. - Phase 9: ensureWorkerAliveOnce() module-scope memoization wrapping ensureWorkerRunning. executeWithWorkerFallback calls the memoized version. Minimal validateBody middleware stub at src/services/worker/http/middleware/validateBody.ts. Plan 06 will expand with typed inference + error envelope conventions. Verification: 4/4 grep targets pass. bun run build succeeds. bun test → 1393 pass / 28 fail / 7 skip; -6 pass attributable solely to deleted context-reinjection-guard test file. Zero new failures vs baseline. Plan: PATHFINDER-2026-04-22/05-hook-surface.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 06 — API surface One Zod-based validator wrapping every POST/PUT. Rate limiter, diagnostic endpoints, and shutdown wrappers deleted. Failure- marking consolidated to one helper. - Phase 1 (preflight): zod@^3 already installed. - Phase 2: validateBody middleware confirmed at canonical shape in src/services/worker/http/middleware/validateBody.ts — safeParse → 400 { error: 'ValidationError', issues: [...] } on failure, replaces req.body with parsed value on success. - Phase 3: Per-route Zod schemas declared at the top of each route file. 24 POST endpoints across SessionRoutes, CorpusRoutes, DataRoutes, MemoryRoutes, SearchRoutes, LogsRoutes, SettingsRoutes now wrap with validateBody(). /api/session/end (Plan 05) confirmed using same middleware. - Phase 4: validateRequired() deleted from BaseRouteHandler along with every call site. Inline coercion helpers (coerceStringArray, coercePositiveInteger) and inline if (!req.body...) guards deleted across all route files. - Phase 5: Rate limiter middleware and its registration deleted from src/services/worker/http/middleware.ts. Worker binds 127.0.0.1:37777 — no untrusted caller. - Phase 6: viewer.html cached at module init in ViewerRoutes.ts via fs.readFileSync; served as Buffer with text/html content type. SKILL.md + per-operation .md files cached in Server.ts as Map<string, string>; loadInstructionContent helper deleted. NO fs.watch, NO TTL — process restart is the cache-invalidation event. - Phase 7: Four diagnostic endpoints deleted from DataRoutes.ts — /api/pending-queue (GET), /api/pending-queue/process (POST), /api/pending-queue/failed (DELETE), /api/pending-queue/all (DELETE). Helper methods that ONLY served them (getQueueMessages, getStuckCount, getRecentlyProcessed, clearFailed, clearAll) deleted from PendingMessageStore. KEPT: /api/processing-status (observability), /health (used by ensureWorkerRunning). - Phase 8: stopSupervisor wrapper deleted from supervisor/index.ts. GracefulShutdown now calls getSupervisor().stop() directly. Two functions retained with clear roles: - performGracefulShutdown — worker-side 6-step shutdown - runShutdownCascade — supervisor-side child teardown (process.kill(-pgid), Windows tree-kill, PID-file cleanup) Each has unique non-trivial logic and a single canonical caller. - Phase 9: transitionMessagesTo(status, filter) is the sole failure-marking path on PendingMessageStore. Old methods markSessionMessagesFailed and markAllSessionMessagesAbandoned deleted along with all callers (worker-service, SessionCompletionHandler, tests/zombie-prevention). Tests updated (Principle 7 same-PR delete): coercion test files refactored to chain validateBody → handler. Zombie-prevention tests rewritten to call transitionMessagesTo. Verification: all 4 grep targets → 0. bun run build succeeds. bun test → 1393 pass / 28 fail / 7 skip — exact match to baseline. Zero new failures. Plan: PATHFINDER-2026-04-22/06-api-surface.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: land PATHFINDER Plan 07 — dead code sweep ts-prune-driven sweep across the tree after Plans 01-06 landed. Deleted unused exports, orphan helpers, and one fully orphaned file. Earlier-plan deletions verified. Deleted: - src/utils/bun-path.ts (entire file — getBunPath, getBunPathOrThrow, isBunAvailable: zero importers) - bun-resolver.getBunVersionString: zero callers - PendingMessageStore.retryMessage / resetProcessingToPending / abortMessage: superseded by transitionMessagesTo (Plan 06 Phase 9) - EnvManager.MANAGED_CREDENTIAL_KEYS, EnvManager.setCredential: zero callers - CodexCliInstaller.checkCodexCliStatus: zero callers; no status command exists in npx-cli - Two "REMOVED: cleanupOrphanedSessions" stale-fence comments Kept (with documented justification): - Public API surface in dist/sdk/* (parseAgentXml, prompt builders, ParsedObservation, ParsedSummary, ParseResult, SUMMARY_MODE_MARKER) — exported via package.json sdk path. - generateContext / loadContextConfig / token utilities — used via dynamic await import('../../../context-generator.js') in worker SearchRoutes. - MCP_IDE_INSTALLERS, install/uninstall functions for codex/goose — used via dynamic await import in npx-cli/install.ts + uninstall.ts (ts-prune cannot trace dynamic imports). - getExistingChromaIds — active caller in ChromaSync.backfillMissingSyncs (Plan 04 narrowed scope). - processPendingQueues / getSessionsWithPendingMessages — active orphan-recovery caller in worker-service.ts plus zombie-prevention test coverage. - StoreAndMarkCompleteResult legacy alias — return-type annotation in same file. - All Database.ts barrel re-exports — used downstream. Earlier-plan verification: - Plan 03 Phase 9: VERIFIED — src/utils/transcript-parser.ts is gone; TranscriptParser has 0 references in src/. - Plan 01 Phase 8: VERIFIED — migration 19 no-op absorbed. - SessionStore.ts:52-70 consolidation NOT executed (deferred): the methods are not thin wrappers but ~900 LoC of bodies, and two methods are documented as intentional mirrors so the context-generator.cjs bundle stays schema-consistent without pulling MigrationRunner. Deserves its own plan, not a sweep. Verification: TranscriptParser → 0; transcript-parser.ts → gone; no commented-out code markers remain. bun run build succeeds. bun test → 1393 pass / 28 fail / 7 skip — EXACT match to baseline. Zero regressions. Plan: PATHFINDER-2026-04-22/07-dead-code.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: remove residual ProcessRegistry comment reference Plan 07 dead-code sweep missed one comment-level reference to the deleted in-memory ProcessRegistry class in SessionManager.ts:347. Rewritten to describe the supervisor.json scope without naming the deleted class, completing the verification grep target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile review (P1 + 2× P2) P1 — Plan 05 Phase 3 blocking endpoint was non-functional: executeWithWorkerFallback used HEALTH_CHECK_TIMEOUT_MS (3 s) for the POST /api/session/end call, but the server holds the connection for SERVER_SIDE_SUMMARY_TIMEOUT_MS (30 s). Client always raced to a "timed out" rejection that isWorkerUnavailable classified as worker-unreachable, so the hook silently degraded instead of waiting for summaryStoredEvent. - Added optional timeoutMs to executeWithWorkerFallback, forwarded to workerHttpRequest. - summarize.ts call site now passes 35_000 (5 s above server hold window). P2 — ingestSummary({ kind: 'parsed' }) branch was dead code: ResponseProcessor emitted summaryStoredEvent directly via the event bus, bypassing the centralized helper that the comment claimed was the single source. - ResponseProcessor now calls ingestSummary({ kind: 'parsed', sessionDbId, messageId, contentSessionId, parsed }) so the event-emission path is single-sourced. - ingestSummary's requireContext() resolution moved inside the 'queue' branch (the only branch that needs sessionManager / dbManager). 'parsed' is a pure event-bus emission and doesn't need worker-internal context — fixes mocked ResponseProcessor unit tests that don't call setIngestContext. P2 — isWorkerFallback could false-positive on legitimate API responses whose schema includes { continue: true, ... }: - Added a Symbol.for('claude-mem/worker-fallback') brand to WorkerFallback. isWorkerFallback now checks the brand, not a duck-typed property name. Verification: bun run build succeeds. bun test → 1393 pass / 28 fail / 7 skip — exact baseline match. Zero new failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile iteration 2 (P1 + P2) P1 — summaryStoredEvent fired regardless of whether the row was persisted. ResponseProcessor's call to ingestSummary({ kind: 'parsed' }) ran for every parsed.kind === 'summary' even when result.summaryId came back null (e.g. FK violation, null memory_session_id at commit). The blocking /api/session/end endpoint then returned { ok: true } and the Stop hook logged 'Summary stored' for a non-existent row. - Gate ingestSummary call on (parsed.data.skipped || session.lastSummaryStored). Skipped summaries are an explicit no-op bypass and still confirm; real summaries only confirm when storage actually wrote a row. - Non-skipped + summaryId === null path logs a warn and lets the server-side timeout (504) surface to the hook instead of a false ok:true. P2 — PendingMessageStore.enqueue() returns 0 when INSERT OR IGNORE suppresses a duplicate (the UNIQUE(session_id, tool_use_id) constraint added by Plan 01 Phase 1). The two callers (SessionManager.queueObservation and queueSummarize) previously logged 'ENQUEUED messageId=0' which read like a row was inserted. - Branch on messageId === 0 and emit a 'DUP_SUPPRESSED' debug log instead of the misleading ENQUEUED line. No behavior change — the duplicate is still correctly suppressed by the DB (Principle 3); only the log surface is corrected. - confirmProcessed is never called with the enqueue() return value (it operates on session.processingMessageIds[] from claimNextMessage), so no caller is broken; the visibility fix prevents future misuse. Verification: bun run build succeeds. bun test → 1393 pass / 28 fail / 7 skip — exact baseline match. Zero new failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile iteration 3 (P1 + 2× P2) - P1 worker-service.ts: wire ensureGeneratorRunning into the ingest context after SessionRoutes is constructed. setIngestContext runs before routes exist, so transcript-watcher observations queued via ingestObservation() had no way to auto-start the SDK generator. Added attachIngestGeneratorStarter() to patch the callback in. - P2 shared.ts: IngestEventBus now sets maxListeners to 0. Concurrent /api/session/end calls register one listener each and clean up on completion, so the default-10 warning fires spuriously under normal load. - P2 SessionRoutes.ts: handleObservationsByClaudeId now delegates to ingestObservation() instead of duplicating skip-tool / meta / privacy / queue logic. Single helper, matching the Plan 03 goal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile iteration 4 (P1 tool-pair + P2 parse/path/doc) - processor.handleToolResult: restore in-memory tool-use→tool-result pairing via session.pendingTools for schemas (e.g. Codex) whose tool_result events carry only tool_use_id + output. Without this, neither handler fired — all tool observations silently dropped. - processor.maybeParseJson: return raw string on parse failure instead of throwing. Previously a single malformed JSON-shaped field caused handleLine's outer catch to discard the entire transcript line. - watcher.deepestNonGlobAncestor: split on / and \\, emit empty string for purely-glob inputs so the caller skips the watch instead of anchoring fs.watch at the filesystem root. Windows-compatible. - PendingMessageStore.enqueue: tighten docstring — callers today only log on the returned id; the SessionManager branches on id === 0. * fix: forward tool_use_id through ingestObservation (Greptile iter 5) P1 — Plan 01's UNIQUE(content_session_id, tool_use_id) dedup never fired because the new shared ingest path dropped the toolUseId before queueObservation. SQLite treats NULL values as distinct for UNIQUE, so every replayed transcript line landed a duplicate row. - shared.ingestObservation: forward payload.toolUseId to queueObservation so INSERT OR IGNORE can actually collapse. - SessionRoutes.handleObservationsByClaudeId: destructure both tool_use_id (HTTP convention) and toolUseId (JS convention) from req.body and pass into ingestObservation. - observationsByClaudeIdSchema: declare both keys explicitly so the validator doesn't rely on .passthrough() alone. * fix: drop dead pairToolUsesByJoin, close session-end listener race - PendingMessageStore: delete pairToolUsesByJoin. The method was never called and its self-join semantics are structurally incompatible with UNIQUE(content_session_id, tool_use_id): INSERT OR IGNORE collapses any second row with the same pair, so a self-join can only ever match a row to itself. In-memory pendingTools in processor.ts remains the pairing path for split-event schemas. - IngestEventBus: retain a short-lived (60s) recentStored map keyed by sessionId. Populated on summaryStoredEvent emit, evicted on consume or TTL. - handleSessionEnd: drain the recent-events buffer before attaching the listener. Closes the register-after-emit race where the summary can persist between the hook's summarize POST and its session/end POST — previously that window returned 504 after the 30s timeout. * chore: merge origin/main into vivacious-teeth Resolves conflicts with 15 commits on main (v12.3.9, security observation types, Telegram notifier, PID-reuse worker start-guard). Conflict resolution strategy: - plugin/hooks/hooks.json, plugin/scripts/*.cjs, plugin/ui/viewer-bundle.js: kept ours — PATHFINDER Plan 05 deletes the for-i-in-1-to-20 curl retry loops and the built artifacts regenerate on build. - src/cli/handlers/summarize.ts: kept ours — Plan 05 blocking POST /api/session/end supersedes main's fire-and-forget path. - src/services/worker-service.ts: kept ours — Plan 05 ingest bus + summaryStoredEvent supersedes main's SessionCompletionHandler DI refactor + orphan-reaper fallback. - src/services/worker/http/routes/SessionRoutes.ts: kept ours — same reason; generator .finally() Stop-hook self-clean is a guard for a path our blocking endpoint removes. - src/services/worker/http/routes/CorpusRoutes.ts: merged — added security_alert / security_note to ALLOWED_CORPUS_TYPES (feature from #2084) while preserving our Zod validateBody schema. Typecheck: 294 errors (vs 298 pre-merge). No new errors introduced; all remaining are pre-existing (Component-enum gaps, DOM lib for viewer, bun:sqlite types). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile P2 findings 1) SessionRoutes.handleSessionEnd was the only route handler not wrapped in wrapHandler — synchronous exceptions would hang the client rather than surfacing as 500s. Wrap it like every other handler. 2) processor.handleToolResult only consumed the session.pendingTools entry when the tool_result arrived without a toolName. In the split-schema path where tool_result carries both toolName and toolId, the entry was never deleted and the map grew for the life of the session. Consume the entry whenever toolId is present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: typing cleanup and viewer tsconfig split for PR feedback - Add explicit return types for SessionStore query methods - Exclude src/ui/viewer from root tsconfig, give it its own DOM-typed config - Add bun to root tsconfig types, plus misc typing tweaks flagged by Greptile - Rebuilt plugin/scripts/* artifacts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile P2 findings (iter 2) - PendingMessageStore.transitionMessagesTo: require sessionDbId (drop the unscoped-drain branch that would nuke every pending/processing row across all sessions if a future caller omitted the filter). - IngestEventBus.takeRecentSummaryStored: make idempotent — keep the cached event until TTL eviction so a retried Stop hook's second /api/session/end returns immediately instead of hanging 30 s. - TranscriptWatcher fs.watch callback: skip full glob scan for paths already tailed (JSONL appends fire on every line; only unknown paths warrant a rescan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: call finalizeSession in terminal session paths (Greptile iter 3) terminateSession and runFallbackForTerminatedSession previously called SessionCompletionHandler.finalizeSession before removeSessionImmediate; the refactor dropped those calls, leaving sdk_sessions.status='active' for every session killed by wall-clock limit, unrecoverable error, or exhausted fallback chain. The deleted reapStaleSessions interval was the only prior backstop. Re-wires finalizeSession (idempotent: marks completed, drains pending, broadcasts) into both paths; no reaper reintroduced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: GC failed pending_messages rows at startup (Greptile iter 4) Plan 07 deleted clearFailed/clearFailedOlderThan as "dead code", but with the periodic sweep also removed, nothing reaps status='failed' rows now — they accumulate indefinitely. Since claimNextMessage's self-healing subquery scans this table, unbounded growth degrades claim latency over time. Re-introduces clearFailedOlderThan and calls it once at worker startup (not a reaper — one-shot, idempotent). 7-day retention keeps enough history for operator inspection while bounding the table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: finalize sessions on normal exit; cleanup hoist; share handler (iter 5) 1. startSessionProcessor success branch now calls completionHandler. finalizeSession before removeSessionImmediate. Hooks-disabled installs (and any Stop hook that fails before POST /api/sessions/complete) no longer leave sdk_sessions rows as status='active' forever. Idempotent — a subsequent /api/sessions/complete is a no-op. 2. Hoist SessionRoutes.handleSessionEnd cleanup declaration above the closures that reference it (TDZ safety; safe at runtime today but fragile if timeout ever shrinks). 3. SessionRoutes now receives WorkerService's shared SessionCompletionHandler instead of constructing its own — prevents silent divergence if the handler ever becomes stateful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: stop runaway crash-recovery loop on dead sessions Two distinct bugs were combining to keep a dead session restarting forever: Bug 1 (uncaught "The operation was aborted."): child_process.spawn emits 'error' asynchronously for ENOENT/EACCES/abort signal aborts. spawnSdkProcess() never attached an 'error' listener, so any async spawn failure became uncaughtException and escaped to the daemon-level handler. Attach an 'error' listener immediately after spawn, before the !child.pid early-return, so async spawn errors are logged (with errno code) and swallowed locally. Bug 2 (sliding-window limiter never trips on slow restart cadence): RestartGuard tripped only when restartTimestamps.length exceeded MAX_WINDOWED_RESTARTS (10) within RESTART_WINDOW_MS (60s). With the 8s exponential-backoff cap, only ~7-8 restarts fit in the window, so a dead session that fail-restart-fail-restart on 8s cycles would loop forever (consecutiveRestarts climbing past 30+ in observed logs). Add a consecutiveFailures counter that increments on every restart and resets only on recordSuccess(). Trip when consecutive failures exceed MAX_CONSECUTIVE_FAILURES (5) — meaning 5 restarts with zero successful processing in between proves the session is dead. Both guards now run in parallel: tight loops still trip the windowed cap; slow loops trip the consecutive-failure cap. Also: when the SessionRoutes path trips the guard, drain pending messages to 'abandoned' so the session does not reappear in getSessionsWithPendingMessages and trigger another auto-start cycle. The worker-service.ts path already does this via terminateSession. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf: streamline worker startup and consolidate database connections 1. Database Pooling: Modified DatabaseManager, SessionStore, and SessionSearch to share a single bun:sqlite connection, eliminating redundant file descriptors. 2. Non-blocking Startup: Refactored WorktreeAdoption and Chroma backfill to run in the background (fire-and-forget), preventing them from stalling core initialization. 3. Diagnostic Routes: Added /api/chroma/status and bypassed the initialization guard for health/readiness endpoints to allow diagnostics during startup. 4. Robust Search: Implemented reliable SQLite FTS5 fallback in SearchManager for when Chroma (uvx) fails or is unavailable. 5. Code Cleanup: Removed redundant loopback MCP checks and mangled initialization logic from WorkerService. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: hard-exclude observer-sessions from hooks; bundle migration 29 (#2124) * fix: hard-exclude observer-sessions from hooks; backfill bundle migrations Stop hook + SessionEnd hook were storing the SDK observer's own init/continuation/summary prompts in user_prompts, leaking into the viewer (meta-observation regression). 25 such rows accumulated. - shouldTrackProject: hard-reject OBSERVER_SESSIONS_DIR (and its subtree) before consulting user-configured exclusion globs. - summarize.ts (Stop) and session-complete.ts (SessionEnd): early-return when shouldTrackProject(cwd) is false, so the observer's own hooks cannot bootstrap the worker or queue a summary against the meta-session. - SessionRoutes: cap user-prompt body at 256 KiB at the session-init boundary so a runaway observer prompt cannot blow up storage. - SessionStore: add migration 29 (UNIQUE(memory_session_id, content_hash) on observations) inline so bundled artifacts (worker-service.cjs, context-generator.cjs) stay schema-consistent — without it, the ON CONFLICT clause in observation inserts throws. - spawnSdkProcess: stdio[stdin] from 'ignore' to 'pipe' so the supervisor can actually feed the observer's stdin. Also rebuilds plugin/scripts/{worker-service,context-generator}.cjs. * fix: walk back to UTF-8 boundary on prompt truncation (Greptile P2) Plain Buffer.subarray at MAX_USER_PROMPT_BYTES can land mid-codepoint, which the utf8 decoder silently rewrites to U+FFFD. Walk back over any continuation bytes (0b10xxxxxx) before decoding so the truncated prompt ends on a valid sequence boundary instead of a replacement character. * fix: cross-platform observer-dir containment; clarify SDK stdin pipe claude-review feedback on PR #2124. - shouldTrackProject: literal `cwd.startsWith(OBSERVER_SESSIONS_DIR + '/')` hard-coded a POSIX separator and missed Windows backslash paths plus any trailing-slash variance. Switched to a path.relative-based isWithin() helper so Windows hook input under observer-sessions\\... is also excluded. - spawnSdkProcess: added a comment explaining why stdin must be 'pipe' — SpawnedSdkProcess.stdin is typed NonNullable and the Claude Agent SDK consumes that pipe; 'ignore' would null it and the null-check below would tear the child down on every spawn. * fix: make Stop hook fire-and-forget; remove dead /api/session/end The Stop hook was awaiting a 35-second long-poll on /api/session/end, which the worker held open until the summary-stored event fired (or its 30s server-side timeout elapsed). Followed by another await on /api/sessions/complete. Three sequential awaits, the middle one a 30s hold — not fire-and-forget despite repeated requests. The Stop hook now does ONE thing: POST /api/sessions/summarize to queue the summary work and return. The worker drives the rest async. Session-map cleanup is performed by the SessionEnd handler (session-complete.ts), not duplicated here. - summarize.ts: drop the /api/session/end long-poll and the trailing /api/sessions/complete await; ~40 lines removed; unused SessionEndResponse interface gone; header comment rewritten. - SessionRoutes: delete handleSessionEnd, sessionEndSchema, the SERVER_SIDE_SUMMARY_TIMEOUT_MS constant, and the /api/session/end route registration. Drop the now-unused ingestEventBus and SummaryStoredEvent imports. - ResponseProcessor + shared.ts + worker-utils.ts: update stale comments that referenced the dead endpoint. The IngestEventBus is left in place dormant (no listeners) for follow-up cleanup so this PR stays focused on the blocker. Bundle artifact (worker-service.cjs) rebuilt via build-and-sync. Verification: - grep '/api/session/end' plugin/scripts/worker-service.cjs → 0 - grep 'timeoutMs:35' plugin/scripts/worker-service.cjs → 0 - Worker restarted clean, /api/health ok at pid 92368 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * deps: bump all dependencies to latest including majors Upgrades: React 18→19, Express 4→5, Zod 3→4, TypeScript 5→6, @types/node 20→25, @anthropic-ai/claude-agent-sdk 0.1→0.2, @clack/prompts 0.9→1.2, plus minors. Adds Daily Maintenance section to CLAUDE.md mandating latest-version policy across manifests. Express 5 surfaced a race in Server.listen() where the 'error' handler was attached after listen() was invoked; refactored to use http.createServer with both 'error' and 'listening' handlers attached before listen(), restoring port-conflict rejection semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: surface real chroma errors and add deep status probe Replace the misleading "Vector search failed - semantic search unavailable. Install uv... restart the worker." string in SearchManager with the actual exception text from chroma_query_documents. The lying message blamed `uv` for any failure — even when the real cause was a chroma-mcp transport timeout, an empty collection, or a dead subprocess. Also add /api/chroma/status?deep=1 backed by a new ChromaMcpManager.probeSemanticSearch() that round-trips a real query (chroma_list_collections + chroma_query_documents) instead of just checking the stdio handshake. The cheap default path is unchanged. Includes the diagnostic plan (PLAN-fix-mcp-search.md) and updated test fixtures for the new structured failure message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: rebuild worker-service bundle to match merged src Bundle was stale after the squash merge of #2124 — it still contained the old "Install uv... semantic search unavailable" string and lacked probeSemanticSearch. Rebuilt via bun run build-and-sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: address coderabbit feedback on PLAN-fix-mcp-search.md - replace machine-specific /Users/alexnewman absolute paths with portable <repo-root> placeholder (MD-style portability) - add blank lines around the TypeScript fenced block (MD031) - tag the bare fenced block with `text` (MD040) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 13:37:40 -07:00
parent 8ace1d9c84
commit 94d592f212
159 changed files with 18091 additions and 5843 deletions
@@ -1,4 +1,5 @@
 import type { PlatformAdapter, NormalizedHookInput, HookResult } from '../types.js';
+import { AdapterRejectedInput, isValidCwd } from './errors.js';

 // Maps Claude Code stdin format (session_id, cwd, tool_name, etc.)
 // SessionStart hooks receive no stdin, so we must handle undefined input gracefully
@@ -12,9 +13,15 @@ const pickAgentField = (v: unknown): string | undefined =>
 export const claudeCodeAdapter: PlatformAdapter = {
  normalizeInput(raw) {
    const r = (raw ?? {}) as any;
+    // Plan 05 Phase 6 — cwd validation at the adapter boundary (single check,
+    // not duplicated in handlers). Falls back to process.cwd() when unset.
+    const cwd = r.cwd ?? process.cwd();
+    if (!isValidCwd(cwd)) {
+      throw new AdapterRejectedInput('invalid_cwd');
+    }
    return {
      sessionId: r.session_id ?? r.id ?? r.sessionId,
-      cwd: r.cwd ?? process.cwd(),
+      cwd,
      prompt: r.prompt,
      toolName: r.tool_name,
      toolInput: r.tool_input,
@@ -1,4 +1,5 @@
 import type { PlatformAdapter, NormalizedHookInput, HookResult } from '../types.js';
+import { AdapterRejectedInput, isValidCwd } from './errors.js';

 // Maps Cursor stdin format - field names differ from Claude Code
 // Cursor uses: conversation_id, workspace_roots[], result_json, command/output
@@ -13,9 +14,14 @@ export const cursorAdapter: PlatformAdapter = {
    const r = (raw ?? {}) as any;
    // Cursor-specific: shell commands come as command/output instead of tool_name/input/response
    const isShellCommand = !!r.command && !r.tool_name;
+    // Plan 05 Phase 6 — cwd validation at the adapter boundary.
+    const cwd = r.workspace_roots?.[0] ?? r.cwd ?? process.cwd();
+    if (!isValidCwd(cwd)) {
+      throw new AdapterRejectedInput('invalid_cwd');
+    }
    return {
      sessionId: r.conversation_id || r.generation_id || r.id,
-      cwd: r.workspace_roots?.[0] ?? r.cwd ?? process.cwd(),
+      cwd,
      prompt: r.prompt ?? r.query ?? r.input ?? r.message,
      toolName: isShellCommand ? 'Bash' : r.tool_name,
      toolInput: isShellCommand ? { command: r.command } : r.tool_input,
@@ -0,0 +1,24 @@
+/**
+ * Adapter-layer rejection. Plan 05 Phase 6 (PATHFINDER-2026-04-22): cwd
+ * validation moves from per-handler `if (!cwd) throw …` to the adapter
+ * boundary. When normalization detects an invalid input, the adapter throws
+ * `AdapterRejectedInput`; the hook runner translates it into a graceful
+ * `{ continue: true }` so the user's session is never blocked by a malformed
+ * hook payload.
+ */
+
+export class AdapterRejectedInput extends Error {
+  constructor(public readonly reason: string) {
+    super(`adapter rejected input: ${reason}`);
+    this.name = 'AdapterRejectedInput';
+  }
+}
+
+/**
+ * A cwd is valid when it is a non-empty string. The adapter normalizers fall
+ * back to `process.cwd()` when the inbound payload omits cwd, so the only way
+ * this returns false is when the payload supplies `null`/`''`/non-string.
+ */
+export function isValidCwd(cwd: unknown): cwd is string {
+  return typeof cwd === 'string' && cwd.length > 0;
+}
@@ -1,4 +1,5 @@
 import type { PlatformAdapter } from '../types.js';
+import { AdapterRejectedInput, isValidCwd } from './errors.js';

 /**
 * Gemini CLI Platform Adapter
@@ -39,6 +40,10 @@ export const geminiCliAdapter: PlatformAdapter = {
      ?? process.env.GEMINI_PROJECT_DIR
      ?? process.env.CLAUDE_PROJECT_DIR
      ?? process.cwd();
+    // Plan 05 Phase 6 — cwd validation at the adapter boundary.
+    if (!isValidCwd(cwd)) {
+      throw new AdapterRejectedInput('invalid_cwd');
+    }

    const sessionId = r.session_id
      ?? process.env.GEMINI_SESSION_ID
@@ -1,12 +1,18 @@
 import type { PlatformAdapter, NormalizedHookInput, HookResult } from '../types.js';
+import { AdapterRejectedInput, isValidCwd } from './errors.js';

 // Raw adapter passes through with minimal transformation - useful for testing
 export const rawAdapter: PlatformAdapter = {
  normalizeInput(raw) {
-    const r = raw as any;
+    const r = (raw ?? {}) as any;
+    // Plan 05 Phase 6 — cwd validation at the adapter boundary.
+    const cwd = r.cwd ?? process.cwd();
+    if (!isValidCwd(cwd)) {
+      throw new AdapterRejectedInput('invalid_cwd');
+    }
    return {
      sessionId: r.sessionId ?? r.session_id ?? 'unknown',
-      cwd: r.cwd ?? process.cwd(),
+      cwd,
      prompt: r.prompt,
      toolName: r.toolName ?? r.tool_name,
      toolInput: r.toolInput ?? r.tool_input,
@@ -1,4 +1,5 @@
 import type { PlatformAdapter, NormalizedHookInput, HookResult } from '../types.js';
+import { AdapterRejectedInput, isValidCwd } from './errors.js';

 // Maps Windsurf stdin format — JSON envelope with agent_action_name + tool_info payload
 //
@@ -17,9 +18,15 @@ export const windsurfAdapter: PlatformAdapter = {
    const toolInfo = r.tool_info ?? {};
    const actionName: string = r.agent_action_name ?? '';

+    // Plan 05 Phase 6 — cwd validation at the adapter boundary.
+    const cwd = toolInfo.cwd ?? process.cwd();
+    if (!isValidCwd(cwd)) {
+      throw new AdapterRejectedInput('invalid_cwd');
+    }
+
    const base: NormalizedHookInput = {
      sessionId: r.trajectory_id ?? r.execution_id,
-      cwd: toolInfo.cwd ?? process.cwd(),
+      cwd,
      platform: 'windsurf',
    };

@@ -6,34 +6,24 @@
 */

 import type { EventHandler, NormalizedHookInput, HookResult } from '../types.js';
-import { ensureWorkerRunning, getWorkerPort, workerHttpRequest } from '../../shared/worker-utils.js';
+import {
+  executeWithWorkerFallback,
+  isWorkerFallback,
+  getWorkerPort,
+} from '../../shared/worker-utils.js';
 import { getProjectContext } from '../../utils/project-name.js';
 import { HOOK_EXIT_CODES } from '../../shared/hook-constants.js';
 import { logger } from '../../utils/logger.js';
-import { SettingsDefaultsManager } from '../../shared/SettingsDefaultsManager.js';
-import { USER_SETTINGS_PATH } from '../../shared/paths.js';
+import { loadFromFileOnce } from '../../shared/hook-settings.js';

 export const contextHandler: EventHandler = {
  async execute(input: NormalizedHookInput): Promise<HookResult> {
-    // Ensure worker is running before any other logic
-    const workerReady = await ensureWorkerRunning();
-    if (!workerReady) {
-      // Worker not available - return empty context gracefully
-      return {
-        hookSpecificOutput: {
-          hookEventName: 'SessionStart',
-          additionalContext: ''
-        },
-        exitCode: HOOK_EXIT_CODES.SUCCESS
-      };
-    }
-
    const cwd = input.cwd ?? process.cwd();
    const context = getProjectContext(cwd);
    const port = getWorkerPort();

-    // Check if terminal output should be shown (load settings early)
-    const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
+    // Plan 05 Phase 4: settings via process-scope cache.
+    const settings = loadFromFileOnce();
    const showTerminalOutput = settings.CLAUDE_MEM_CONTEXT_SHOW_TERMINAL_OUTPUT === 'true';

    // Pass all projects (parent + worktree if applicable) for unified timeline
@@ -41,38 +31,36 @@ export const contextHandler: EventHandler = {
    const apiPath = `/api/context/inject?projects=${encodeURIComponent(projectsParam)}`;
    const colorApiPath = input.platform === 'claude-code' ? `${apiPath}&colors=true` : apiPath;

-    const emptyResult = {
+    const emptyResult: HookResult = {
      hookSpecificOutput: { hookEventName: 'SessionStart', additionalContext: '' },
-      exitCode: HOOK_EXIT_CODES.SUCCESS
+      exitCode: HOOK_EXIT_CODES.SUCCESS,
    };

-    // Note: Removed AbortSignal.timeout due to Windows Bun cleanup issue (libuv assertion)
-    // Worker service has its own timeouts, so client-side timeout is redundant
-    let response: Response;
-    let colorResponse: Response | null;
-    try {
-      [response, colorResponse] = await Promise.all([
-        workerHttpRequest(apiPath),
-        showTerminalOutput ? workerHttpRequest(colorApiPath).catch(() => null) : Promise.resolve(null)
-      ]);
-    } catch (error) {
-      // Worker unreachable — return empty context gracefully
-      logger.warn('HOOK', 'Context fetch error, returning empty', { error: error instanceof Error ? error.message : String(error) });
+    // Plan 05 Phase 2: single helper for ensure-worker-alive → request → fallback.
+    const contextResult = await executeWithWorkerFallback<string>(apiPath, 'GET');
+    if (isWorkerFallback(contextResult)) {
      return emptyResult;
    }

-    if (!response.ok) {
-      logger.warn('HOOK', 'Context generation failed, returning empty', { status: response.status });
+    let additionalContext: string;
+    if (typeof contextResult === 'string') {
+      additionalContext = contextResult.trim();
+    } else if (contextResult === undefined) {
+      additionalContext = '';
+    } else {
+      // Unexpected non-string body — log and fall back to empty.
+      logger.warn('HOOK', 'Context response was not a string', { type: typeof contextResult });
      return emptyResult;
    }

-    const [contextResult, colorResult] = await Promise.all([
-      response.text(),
-      colorResponse?.ok ? colorResponse.text() : Promise.resolve('')
-    ]);
+    let coloredTimeline = '';
+    if (showTerminalOutput) {
+      const colorResult = await executeWithWorkerFallback<string>(colorApiPath, 'GET');
+      if (!isWorkerFallback(colorResult) && typeof colorResult === 'string') {
+        coloredTimeline = colorResult.trim();
+      }
+    }

-    const additionalContext = contextResult.trim();
-    const coloredTimeline = colorResult.trim();
    const platform = input.platform;

    // Use colored timeline for display if available, otherwise fall back to
@@ -6,14 +6,12 @@
 */

 import type { EventHandler, NormalizedHookInput, HookResult } from '../types.js';
-import { ensureWorkerRunning, workerHttpRequest } from '../../shared/worker-utils.js';
+import { executeWithWorkerFallback, isWorkerFallback } from '../../shared/worker-utils.js';
 import { logger } from '../../utils/logger.js';
 import { parseJsonArray } from '../../shared/timeline-formatting.js';
 import { statSync } from 'fs';
 import path from 'path';
-import { isProjectExcluded } from '../../utils/project-filter.js';
-import { SettingsDefaultsManager } from '../../shared/SettingsDefaultsManager.js';
-import { USER_SETTINGS_PATH } from '../../shared/paths.js';
+import { shouldTrackProject } from '../../shared/should-track-project.js';
 import { getProjectContext } from '../../utils/project-name.js';

 /** Skip the gate for files smaller than this — timeline overhead exceeds file read cost. */
@@ -207,19 +205,12 @@ export const fileContextHandler: EventHandler = {
      logger.debug('HOOK', 'File stat failed, proceeding with gate', { error: err instanceof Error ? err.message : String(err) });
    }

-    // Check if project is excluded from tracking
-    const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
-    if (input.cwd && isProjectExcluded(input.cwd, settings.CLAUDE_MEM_EXCLUDED_PROJECTS)) {
+    // Plan 05 Phase 5: project exclusion via single helper.
+    if (input.cwd && !shouldTrackProject(input.cwd)) {
      logger.debug('HOOK', 'Project excluded from tracking, skipping file context', { cwd: input.cwd });
      return { continue: true, suppressOutput: true };
    }

-    // Ensure worker is running
-    const workerReady = await ensureWorkerRunning();
-    if (!workerReady) {
-      return { continue: true, suppressOutput: true };
-    }
-
    // Query worker for observations related to this file
    const context = getProjectContext(input.cwd);
    const cwd = input.cwd || process.cwd();
@@ -232,22 +223,19 @@ export const fileContextHandler: EventHandler = {
    }
    queryParams.set('limit', String(FETCH_LOOKAHEAD_LIMIT));

-    let data: { observations: ObservationRow[]; count: number };
-    try {
-      const response = await workerHttpRequest(`/api/observations/by-file?${queryParams.toString()}`, { method: 'GET' });
-
-      if (!response.ok) {
-        logger.warn('HOOK', 'File context query failed, skipping', { status: response.status, filePath });
-        return { continue: true, suppressOutput: true };
-      }
-
-      data = await response.json() as { observations: ObservationRow[]; count: number };
-    } catch (error) {
-      logger.warn('HOOK', 'File context fetch error, skipping', {
-        error: error instanceof Error ? error.message : String(error),
-      });
+    // Plan 05 Phase 2: single helper for ensure-worker-alive → request → fallback.
+    const result = await executeWithWorkerFallback<{ observations: ObservationRow[]; count: number }>(
+      `/api/observations/by-file?${queryParams.toString()}`,
+      'GET',
+    );
+    if (isWorkerFallback(result)) {
      return { continue: true, suppressOutput: true };
    }
+    if (!result || !Array.isArray((result as any).observations)) {
+      logger.warn('HOOK', 'File context query returned malformed body, skipping', { filePath });
+      return { continue: true, suppressOutput: true };
+    }
+    const data = result;

    if (!data.observations || data.observations.length === 0) {
      return { continue: true, suppressOutput: true };
@@ -6,35 +6,13 @@
 */

 import type { EventHandler, NormalizedHookInput, HookResult } from '../types.js';
-import { ensureWorkerRunning, workerHttpRequest } from '../../shared/worker-utils.js';
+import { executeWithWorkerFallback, isWorkerFallback } from '../../shared/worker-utils.js';
 import { logger } from '../../utils/logger.js';
 import { HOOK_EXIT_CODES } from '../../shared/hook-constants.js';
 import { normalizePlatformSource } from '../../shared/platform-source.js';

-async function sendFileEditObservation(requestBody: string, filePath: string): Promise<void> {
-  const response = await workerHttpRequest('/api/sessions/observations', {
-    method: 'POST',
-    headers: { 'Content-Type': 'application/json' },
-    body: requestBody
-  });
-
-  if (!response.ok) {
-    logger.warn('HOOK', 'File edit observation storage failed, skipping', { status: response.status, filePath });
-    return;
-  }
-
-  logger.debug('HOOK', 'File edit observation sent successfully', { filePath });
-}
-
 export const fileEditHandler: EventHandler = {
  async execute(input: NormalizedHookInput): Promise<HookResult> {
-    // Ensure worker is running before any other logic
-    const workerReady = await ensureWorkerRunning();
-    if (!workerReady) {
-      // Worker not available - skip file edit observation gracefully
-      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
-    }
-
    const { sessionId, cwd, filePath, edits } = input;
    const platformSource = normalizePlatformSource(input.platform);

@@ -46,30 +24,31 @@ export const fileEditHandler: EventHandler = {
      editCount: edits?.length ?? 0
    });

-    // Validate required fields before sending to worker
+    // Plan 05 Phase 6: cwd is validated at the adapter boundary; this is a
+    // belt-and-suspenders type guard so TypeScript narrows.
    if (!cwd) {
      throw new Error(`Missing cwd in FileEdit hook input for session ${sessionId}, file ${filePath}`);
    }

-    // Send to worker as an observation with file edit metadata
-    // The observation handler on the worker will process this appropriately
-    const requestBody = JSON.stringify({
-      contentSessionId: sessionId,
-      platformSource,
-      tool_name: 'write_file',
-      tool_input: { filePath, edits },
-      tool_response: { success: true },
-      cwd
-    });
+    // Plan 05 Phase 2: single helper for ensure-worker-alive → request → fallback.
+    const result = await executeWithWorkerFallback<{ status?: string }>(
+      '/api/sessions/observations',
+      'POST',
+      {
+        contentSessionId: sessionId,
+        platformSource,
+        tool_name: 'write_file',
+        tool_input: { filePath, edits },
+        tool_response: { success: true },
+        cwd,
+      },
+    );

-    try {
-      await sendFileEditObservation(requestBody, filePath);
-    } catch (error) {
-      // Worker unreachable — skip file edit observation gracefully
-      logger.warn('HOOK', 'File edit observation fetch error, skipping', { error: error instanceof Error ? error.message : String(error) });
+    if (isWorkerFallback(result)) {
      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
    }

+    logger.debug('HOOK', 'File edit observation sent successfully', { filePath });
    return { continue: true, suppressOutput: true };
-  }
+  },
 };
@@ -5,38 +5,14 @@
 */

 import type { EventHandler, NormalizedHookInput, HookResult } from '../types.js';
-import { ensureWorkerRunning, workerHttpRequest } from '../../shared/worker-utils.js';
+import { executeWithWorkerFallback, isWorkerFallback } from '../../shared/worker-utils.js';
 import { logger } from '../../utils/logger.js';
 import { HOOK_EXIT_CODES } from '../../shared/hook-constants.js';
-import { isProjectExcluded } from '../../utils/project-filter.js';
-import { SettingsDefaultsManager } from '../../shared/SettingsDefaultsManager.js';
-import { USER_SETTINGS_PATH } from '../../shared/paths.js';
+import { shouldTrackProject } from '../../shared/should-track-project.js';
 import { normalizePlatformSource } from '../../shared/platform-source.js';

-async function sendObservationToWorker(requestBody: string, toolName: string): Promise<void> {
-  const response = await workerHttpRequest('/api/sessions/observations', {
-    method: 'POST',
-    headers: { 'Content-Type': 'application/json' },
-    body: requestBody
-  });
-
-  if (!response.ok) {
-    logger.warn('HOOK', 'Observation storage failed, skipping', { status: response.status, toolName });
-    return;
-  }
-
-  logger.debug('HOOK', 'Observation sent successfully', { toolName });
-}
-
 export const observationHandler: EventHandler = {
  async execute(input: NormalizedHookInput): Promise<HookResult> {
-    // Ensure worker is running before any other logic
-    const workerReady = await ensureWorkerRunning();
-    if (!workerReady) {
-      // Worker not available - skip observation gracefully
-      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
-    }
-
    const { sessionId, cwd, toolName, toolInput, toolResponse } = input;
    const platformSource = normalizePlatformSource(input.platform);

@@ -49,38 +25,43 @@ export const observationHandler: EventHandler = {

    logger.dataIn('HOOK', `PostToolUse: ${toolStr}`, {});

-    // Validate required fields before sending to worker
+    // Plan 05 Phase 6: cwd is validated at the adapter boundary; the adapter
+    // rejects empty cwd before reaching the handler. We still type-narrow for
+    // TypeScript and as a belt-and-suspenders guard.
    if (!cwd) {
      throw new Error(`Missing cwd in PostToolUse hook input for session ${sessionId}, tool ${toolName}`);
    }

-    // Check if project is excluded from tracking
-    const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
-    if (isProjectExcluded(cwd, settings.CLAUDE_MEM_EXCLUDED_PROJECTS)) {
+    // Plan 05 Phase 5: project exclusion via single helper.
+    if (!shouldTrackProject(cwd)) {
      logger.debug('HOOK', 'Project excluded from tracking, skipping observation', { cwd, toolName });
      return { continue: true, suppressOutput: true };
    }

-    // Send to worker - worker handles privacy check and database operations
-    const requestBody = JSON.stringify({
-      contentSessionId: sessionId,
-      platformSource,
-      tool_name: toolName,
-      tool_input: toolInput,
-      tool_response: toolResponse,
-      cwd,
-      agentId: input.agentId,
-      agentType: input.agentType
-    });
+    // Plan 05 Phase 2: single helper for ensure-worker-alive → request → fallback.
+    const result = await executeWithWorkerFallback<{ status?: string }>(
+      '/api/sessions/observations',
+      'POST',
+      {
+        contentSessionId: sessionId,
+        platformSource,
+        tool_name: toolName,
+        tool_input: toolInput,
+        tool_response: toolResponse,
+        cwd,
+        agentId: input.agentId,
+        agentType: input.agentType,
+      },
+    );

-    try {
-      await sendObservationToWorker(requestBody, toolName);
-    } catch (error) {
-      // Worker unreachable — skip observation gracefully
-      logger.warn('HOOK', 'Observation fetch error, skipping', { error: error instanceof Error ? error.message : String(error) });
+    if (isWorkerFallback(result)) {
+      // Worker unreachable — fail-loud counter has already been incremented
+      // and may have escalated to exit 2. If we got here, threshold not yet
+      // reached, so degrade gracefully.
      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
    }

+    logger.debug('HOOK', 'Observation sent successfully', { toolName });
    return { continue: true, suppressOutput: true };
-  }
+  },
 };
@@ -10,56 +10,43 @@
 */

 import type { EventHandler, NormalizedHookInput, HookResult } from '../types.js';
-import { ensureWorkerRunning, workerHttpRequest } from '../../shared/worker-utils.js';
+import { executeWithWorkerFallback, isWorkerFallback } from '../../shared/worker-utils.js';
 import { logger } from '../../utils/logger.js';
 import { normalizePlatformSource } from '../../shared/platform-source.js';
-
-async function sendSessionCompleteRequest(sessionId: string, platformSource: string): Promise<void> {
-  const response = await workerHttpRequest('/api/sessions/complete', {
-    method: 'POST',
-    headers: { 'Content-Type': 'application/json' },
-    body: JSON.stringify({ contentSessionId: sessionId, platformSource })
-  });
-
-  if (!response.ok) {
-    const text = await response.text();
-    logger.warn('HOOK', 'session-complete: Failed to complete session', { status: response.status, body: text });
-  } else {
-    logger.info('HOOK', 'Session completed successfully', { contentSessionId: sessionId });
-  }
-}
+import { shouldTrackProject } from '../../shared/should-track-project.js';

 export const sessionCompleteHandler: EventHandler = {
  async execute(input: NormalizedHookInput): Promise<HookResult> {
-    // Ensure worker is running
-    const workerReady = await ensureWorkerRunning();
-    if (!workerReady) {
-      // Worker not available — skip session completion gracefully
-      return { continue: true, suppressOutput: true };
-    }
-
    const { sessionId } = input;
    const platformSource = normalizePlatformSource(input.platform);

+    // Same OBSERVER_SESSIONS_DIR exclusion as the rest of the hook surface —
+    // the observer's child Claude Code must never call /api/sessions/complete.
+    if (input.cwd && !shouldTrackProject(input.cwd)) {
+      return { continue: true, suppressOutput: true };
+    }
+
    if (!sessionId) {
      logger.warn('HOOK', 'session-complete: Missing sessionId, skipping');
      return { continue: true, suppressOutput: true };
    }

    logger.info('HOOK', '→ session-complete: Removing session from active map', {
-      contentSessionId: sessionId
+      contentSessionId: sessionId,
    });

-    try {
-      await sendSessionCompleteRequest(sessionId, platformSource);
-    } catch (error) {
-      // Log but don't fail - session may already be gone
-      const errorMessage = error instanceof Error ? error.message : String(error);
-      logger.warn('HOOK', 'session-complete: Error completing session', {
-        error: errorMessage
-      });
+    // Plan 05 Phase 2: single helper for ensure-worker-alive → request → fallback.
+    const result = await executeWithWorkerFallback<{ status?: string }>(
+      '/api/sessions/complete',
+      'POST',
+      { contentSessionId: sessionId, platformSource },
+    );
+
+    if (isWorkerFallback(result)) {
+      return { continue: true, suppressOutput: true };
    }

+    logger.info('HOOK', 'Session completed successfully', { contentSessionId: sessionId });
    return { continue: true, suppressOutput: true };
-  }
+  },
 };
@@ -5,45 +5,29 @@
 */

 import type { EventHandler, NormalizedHookInput, HookResult } from '../types.js';
-import { ensureWorkerRunning, workerHttpRequest } from '../../shared/worker-utils.js';
+import { executeWithWorkerFallback, isWorkerFallback } from '../../shared/worker-utils.js';
 import { getProjectContext } from '../../utils/project-name.js';
 import { logger } from '../../utils/logger.js';
 import { HOOK_EXIT_CODES } from '../../shared/hook-constants.js';
-import { isProjectExcluded } from '../../utils/project-filter.js';
-import { SettingsDefaultsManager } from '../../shared/SettingsDefaultsManager.js';
-import { USER_SETTINGS_PATH } from '../../shared/paths.js';
+import { shouldTrackProject } from '../../shared/should-track-project.js';
+import { loadFromFileOnce } from '../../shared/hook-settings.js';
 import { normalizePlatformSource } from '../../shared/platform-source.js';

-async function fetchSemanticContext(
-  prompt: string,
-  project: string,
-  limit: string,
-  sessionDbId: number
-): Promise<string> {
-  const semanticRes = await workerHttpRequest('/api/context/semantic', {
-    method: 'POST',
-    headers: { 'Content-Type': 'application/json' },
-    body: JSON.stringify({ q: prompt, project, limit })
-  });
-  if (semanticRes.ok) {
-    const data = await semanticRes.json() as { context: string; count: number };
-    if (data.context) {
-      logger.debug('HOOK', `Semantic injection: ${data.count} observations for prompt`, { sessionId: sessionDbId, count: data.count });
-      return data.context;
-    }
-  }
-  return '';
+interface SessionInitResponse {
+  sessionDbId: number;
+  promptNumber: number;
+  skipped?: boolean;
+  reason?: string;
+  contextInjected?: boolean;
+}
+
+interface SemanticContextResponse {
+  context: string;
+  count: number;
 }

 export const sessionInitHandler: EventHandler = {
  async execute(input: NormalizedHookInput): Promise<HookResult> {
-    // Ensure worker is running before any other logic
-    const workerReady = await ensureWorkerRunning();
-    if (!workerReady) {
-      // Worker not available - skip session init gracefully
-      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
-    }
-
    const { sessionId, prompt: rawPrompt } = input;
    const cwd = input.cwd ?? process.cwd();  // Match context.ts fallback (#1918)

@@ -53,9 +37,8 @@ export const sessionInitHandler: EventHandler = {
      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
    }

-    // Check if project is excluded from tracking
-    const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
-    if (cwd && isProjectExcluded(cwd, settings.CLAUDE_MEM_EXCLUDED_PROJECTS)) {
+    // Plan 05 Phase 5: project exclusion via single helper.
+    if (!shouldTrackProject(cwd)) {
      logger.info('HOOK', 'Project excluded from tracking', { cwd });
      return { continue: true, suppressOutput: true };
    }
@@ -69,38 +52,28 @@ export const sessionInitHandler: EventHandler = {

    logger.debug('HOOK', 'session-init: Calling /api/sessions/init', { contentSessionId: sessionId, project });

-    // Initialize session via HTTP - handles DB operations and privacy checks
-    let initResponse: Response;
-    try {
-      initResponse = await workerHttpRequest('/api/sessions/init', {
-        method: 'POST',
-        headers: { 'Content-Type': 'application/json' },
-        body: JSON.stringify({
-          contentSessionId: sessionId,
-          project,
-          prompt,
-          platformSource
-        })
-      });
-    } catch (err) {
-      // Worker unreachable — on Linux/WSL, hook may fire before worker is healthy (#1907)
-      logger.warn('HOOK', `session-init: worker request failed: ${err instanceof Error ? err.message : err}`);
+    // Plan 05 Phase 2: single helper for ensure-worker-alive → request → fallback.
+    const initResult = await executeWithWorkerFallback<SessionInitResponse>(
+      '/api/sessions/init',
+      'POST',
+      {
+        contentSessionId: sessionId,
+        project,
+        prompt,
+        platformSource,
+      },
+    );
+
+    if (isWorkerFallback(initResult)) {
      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
    }

-    if (!initResponse.ok) {
-      // Log but don't throw - a worker 500 should not block the user's prompt
-      logger.failure('HOOK', `Session initialization failed: ${initResponse.status}`, { contentSessionId: sessionId, project });
+    // Worker may have returned a non-2xx body (parsed but missing fields). Fail-soft.
+    if (typeof initResult?.sessionDbId !== 'number') {
+      logger.failure('HOOK', 'Session initialization returned malformed response', { contentSessionId: sessionId, project });
      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
    }

-    const initResult = await initResponse.json() as {
-      sessionDbId: number;
-      promptNumber: number;
-      skipped?: boolean;
-      reason?: string;
-      contextInjected?: boolean;
-    };
    const sessionDbId = initResult.sessionDbId;
    const promptNumber = initResult.promptNumber;

@@ -117,57 +90,47 @@ export const sessionInitHandler: EventHandler = {
      return { continue: true, suppressOutput: true };
    }

-    // Skip SDK agent re-initialization if context was already injected for this session (#1079)
-    // The prompt was already saved to the database by /api/sessions/init above —
-    // no need to re-start the SDK agent on every turn.
-    // Note: we do NOT return here — semantic injection below must run on every prompt.
-    const skipAgentInit = Boolean(initResult.contextInjected);
-    if (skipAgentInit) {
-      logger.info('HOOK', `INIT_COMPLETE | sessionDbId=${sessionDbId} | promptNumber=${promptNumber} | skipped_agent_init=true | reason=context_already_injected`, {
-        sessionId: sessionDbId
-      });
-    }
-
-    // Only initialize SDK agent for Claude Code (not Cursor)
-    // Cursor doesn't use the SDK agent - it only needs session/observation storage
-    if (!skipAgentInit && input.platform !== 'cursor' && sessionDbId) {
+    // Plan 05 Phase 7: agent init is idempotent — call unconditionally for
+    // every Claude Code session. Cursor still skipped (no SDK agent).
+    if (input.platform !== 'cursor' && sessionDbId) {
      // Strip leading slash from commands for memory agent
      // /review 101 -> review 101 (more semantic for observations)
      const cleanedPrompt = prompt.startsWith('/') ? prompt.substring(1) : prompt;

      logger.debug('HOOK', 'session-init: Calling /sessions/{sessionDbId}/init', { sessionDbId, promptNumber });

-      // Initialize SDK agent session via HTTP (starts the agent!)
-      const response = await workerHttpRequest(`/sessions/${sessionDbId}/init`, {
-        method: 'POST',
-        headers: { 'Content-Type': 'application/json' },
-        body: JSON.stringify({ userPrompt: cleanedPrompt, promptNumber })
-      });
-
-      if (!response.ok) {
-        // Log but don't throw - SDK agent failure should not block the user's prompt
-        logger.failure('HOOK', `SDK agent start failed: ${response.status}`, { sessionDbId, promptNumber });
+      const agentInitResult = await executeWithWorkerFallback<{ status?: string }>(
+        `/sessions/${sessionDbId}/init`,
+        'POST',
+        { userPrompt: cleanedPrompt, promptNumber },
+      );
+      if (isWorkerFallback(agentInitResult)) {
+        // Worker became unreachable mid-invocation; fail-loud counter handled it.
+        return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
      }
-    } else if (!skipAgentInit && input.platform === 'cursor') {
+    } else if (input.platform === 'cursor') {
      logger.debug('HOOK', 'session-init: Skipping SDK agent init for Cursor platform', { sessionDbId, promptNumber });
    }

    // Semantic context injection: query Chroma for relevant past observations
    // and inject as additionalContext so Claude receives relevant memory each prompt.
    // Controlled by CLAUDE_MEM_SEMANTIC_INJECT setting (default: true).
+    // Plan 05 Phase 4: settings via process-scope cache.
+    const settings = loadFromFileOnce();
    const semanticInject =
      String(settings.CLAUDE_MEM_SEMANTIC_INJECT).toLowerCase() === 'true';
    let additionalContext = '';

    if (semanticInject && prompt && prompt.length >= 20 && prompt !== '[media prompt]') {
      const limit = settings.CLAUDE_MEM_SEMANTIC_INJECT_LIMIT || '5';
-      try {
-        additionalContext = await fetchSemanticContext(prompt, project, limit, sessionDbId);
-      } catch (e) {
-        // Graceful degradation — semantic injection is optional
-        logger.debug('HOOK', 'Semantic injection unavailable', {
-          error: e instanceof Error ? e.message : String(e)
-        });
+      const semanticResult = await executeWithWorkerFallback<SemanticContextResponse>(
+        '/api/context/semantic',
+        'POST',
+        { q: prompt, project, limit },
+      );
+      if (!isWorkerFallback(semanticResult) && semanticResult?.context) {
+        logger.debug('HOOK', `Semantic injection: ${semanticResult.count} observations for prompt`, { sessionId: sessionDbId, count: semanticResult.count });
+        additionalContext = semanticResult.context;
      }
    }

@@ -1,26 +1,33 @@
 /**
 * Summarize Handler - Stop
 *
- * Fire-and-forget: enqueue the summarize request with the worker and return
- * immediately so the Stop hook does not block the user's terminal. The worker
- * owns completion and session cleanup.
+ * Fire-and-forget: queue the summarize request and exit. The worker handles
+ * summary generation, storage, and session cleanup asynchronously. The Stop
+ * hook does not wait for any of it — Claude Code must exit immediately.
+ * Session-complete cleanup is performed by the SessionEnd handler.
 */

 import type { EventHandler, NormalizedHookInput, HookResult } from '../types.js';
-import { ensureWorkerRunning, workerHttpRequest } from '../../shared/worker-utils.js';
+import { executeWithWorkerFallback, isWorkerFallback } from '../../shared/worker-utils.js';
 import { logger } from '../../utils/logger.js';
 import { extractLastMessage } from '../../shared/transcript-parser.js';
 import { HOOK_EXIT_CODES } from '../../shared/hook-constants.js';
 import { normalizePlatformSource } from '../../shared/platform-source.js';
-
-const SUMMARIZE_TIMEOUT_MS = 5000;
+import { shouldTrackProject } from '../../shared/should-track-project.js';

 export const summarizeHandler: EventHandler = {
  async execute(input: NormalizedHookInput): Promise<HookResult> {
+    // Skip Stop hook entirely when firing from an excluded project (notably
+    // OBSERVER_SESSIONS_DIR). Without this, the SDK observer's own Stop hook
+    // queues summaries against its meta-session and triggers a recovery loop.
+    if (input.cwd && !shouldTrackProject(input.cwd)) {
+      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
+    }
+
    // Skip summaries in subagent context — subagents do not own the session summary.
    // Gate on agentId only: that field is present exclusively for Task-spawned subagents.
    // agentType alone (no agentId) indicates `--agent`-started main sessions, which still
-    // own their summary. Do this BEFORE ensureWorkerRunning() so a subagent Stop hook
+    // own their summary. Do this BEFORE the worker call so a subagent Stop hook
    // does not bootstrap the worker.
    if (input.agentId) {
      logger.debug('HOOK', 'Skipping summary: subagent context detected', {
@@ -31,16 +38,13 @@ export const summarizeHandler: EventHandler = {
      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
    }

-    // Ensure worker is running before any other logic
-    const workerReady = await ensureWorkerRunning();
-    if (!workerReady) {
-      // Worker not available - skip summary gracefully
-      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
-    }
-
    const { sessionId, transcriptPath } = input;

    // Validate required fields before processing
+    if (!sessionId) {
+      logger.warn('HOOK', 'summarize: No sessionId provided, skipping');
+      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
+    }
    if (!transcriptPath) {
      // No transcript available - skip summary gracefully (not an error)
      logger.debug('HOOK', `No transcriptPath in Stop hook input for session ${sessionId} - skipping summary`);
@@ -75,31 +79,20 @@ export const summarizeHandler: EventHandler = {
    const platformSource = normalizePlatformSource(input.platform);

    // 1. Queue summarize request — worker returns immediately with { status: 'queued' }
-    let response: Response;
-    try {
-      response = await workerHttpRequest('/api/sessions/summarize', {
-        method: 'POST',
-        headers: { 'Content-Type': 'application/json' },
-        body: JSON.stringify({
-          contentSessionId: sessionId,
-          last_assistant_message: lastAssistantMessage,
-          platformSource
-        }),
-        timeoutMs: SUMMARIZE_TIMEOUT_MS
-      });
-    } catch (err) {
-      // Network error, worker crash, or timeout — exit gracefully instead of
-      // bubbling to hook runner which exits code 2 and blocks session exit (#1901)
-      logger.warn('HOOK', `Stop hook: summarize request failed: ${err instanceof Error ? err.message : err}`);
+    const queueResult = await executeWithWorkerFallback<{ status?: string }>(
+      '/api/sessions/summarize',
+      'POST',
+      {
+        contentSessionId: sessionId,
+        last_assistant_message: lastAssistantMessage,
+        platformSource,
+      },
+    );
+    if (isWorkerFallback(queueResult)) {
      return { continue: true, suppressOutput: true, exitCode: HOOK_EXIT_CODES.SUCCESS };
    }

-    if (!response.ok) {
-      return { continue: true, suppressOutput: true };
-    }
-
-    logger.debug('HOOK', 'Summary request queued');
-
+    logger.debug('HOOK', 'Summary request queued, exiting hook');
    return { continue: true, suppressOutput: true };
-  }
+  },
 };
@@ -7,47 +7,38 @@

 import { basename } from 'path';
 import type { EventHandler, NormalizedHookInput, HookResult } from '../types.js';
-import { ensureWorkerRunning, getWorkerPort, workerHttpRequest } from '../../shared/worker-utils.js';
+import {
+  executeWithWorkerFallback,
+  isWorkerFallback,
+  getWorkerPort,
+} from '../../shared/worker-utils.js';
 import { HOOK_EXIT_CODES } from '../../shared/hook-constants.js';

-async function fetchAndDisplayContext(project: string, colorsParam: string, port: number): Promise<void> {
-  const response = await workerHttpRequest(
-    `/api/context/inject?project=${encodeURIComponent(project)}${colorsParam}`
-  );
-
-  if (!response.ok) {
-    return;
-  }
-
-  const output = await response.text();
-  process.stderr.write(
-    "\n\n" + String.fromCodePoint(0x1F4DD) + " Claude-Mem Context Loaded\n\n" +
-    output +
-    "\n\n" + String.fromCodePoint(0x1F4A1) + " Wrap any message with <private> ... </private> to prevent storing sensitive information.\n" +
-    "\n" + String.fromCodePoint(0x1F4AC) + " Community https://discord.gg/J4wttp9vDu" +
-    `\n` + String.fromCodePoint(0x1F4FA) + ` Watch live in browser http://localhost:${port}/\n`
-  );
-}
-
 export const userMessageHandler: EventHandler = {
  async execute(input: NormalizedHookInput): Promise<HookResult> {
-    // Ensure worker is running
-    const workerReady = await ensureWorkerRunning();
-    if (!workerReady) {
-      // Worker not available — skip user message gracefully
-      return { exitCode: HOOK_EXIT_CODES.SUCCESS };
-    }
-
    const port = getWorkerPort();
    const project = basename(input.cwd ?? process.cwd());
    const colorsParam = input.platform === 'claude-code' ? '&colors=true' : '';

-    try {
-      await fetchAndDisplayContext(project, colorsParam, port);
-    } catch {
-      // Worker unreachable — skip user message gracefully
+    // Plan 05 Phase 2: single helper for ensure-worker-alive → request → fallback.
+    const result = await executeWithWorkerFallback<string>(
+      `/api/context/inject?project=${encodeURIComponent(project)}${colorsParam}`,
+      'GET',
+    );
+
+    if (isWorkerFallback(result)) {
+      return { exitCode: HOOK_EXIT_CODES.SUCCESS };
    }

+    const output = typeof result === 'string' ? result : '';
+    process.stderr.write(
+      "\n\n" + String.fromCodePoint(0x1F4DD) + " Claude-Mem Context Loaded\n\n" +
+      output +
+      "\n\n" + String.fromCodePoint(0x1F4A1) + " Wrap any message with <private> ... </private> to prevent storing sensitive information.\n" +
+      "\n" + String.fromCodePoint(0x1F4AC) + " Community https://discord.gg/J4wttp9vDu" +
+      `\n` + String.fromCodePoint(0x1F4FA) + ` Watch live in browser http://localhost:${port}/\n`
+    );
+
    return { exitCode: HOOK_EXIT_CODES.SUCCESS };
-  }
+  },
 };
@@ -1,5 +1,6 @@
 import { readJsonFromStdin } from './stdin-reader.js';
 import { getPlatformAdapter } from './adapters/index.js';
+import { AdapterRejectedInput } from './adapters/errors.js';
 import { getEventHandler } from './handlers/index.js';
 import { HOOK_EXIT_CODES } from '../shared/hook-constants.js';
 import { logger } from '../utils/logger.js';
@@ -98,6 +99,18 @@ export async function hookCommand(platform: string, event: string, options: Hook
  try {
    return await executeHookPipeline(adapter, handler, platform, options);
  } catch (error) {
+    // Plan 05 Phase 6 — adapter rejected the input (invalid cwd or other
+    // boundary-detected payload defect). Treat as graceful: emit a continue
+    // envelope and exit 0 so the user's session is not blocked by a malformed
+    // hook payload from the platform.
+    if (error instanceof AdapterRejectedInput) {
+      logger.warn('HOOK', `Adapter rejected input (${error.reason}), skipping hook`);
+      console.log(JSON.stringify({ continue: true, suppressOutput: true }));
+      if (!options.skipExit) {
+        process.exit(HOOK_EXIT_CODES.SUCCESS);
+      }
+      return HOOK_EXIT_CODES.SUCCESS;
+    }
    if (isWorkerUnavailableError(error)) {
      // Worker unavailable — degrade gracefully, don't block the user
      // Log to file instead of stderr (#1181)
@@ -351,7 +351,8 @@ function runNpmInstallInMarketplace(): void {
  execSync('npm install --production', {
    cwd: marketplaceDir,
    stdio: 'pipe',
-    ...(IS_WINDOWS ? { shell: true as const } : {}),
+    encoding: 'utf8',
+    ...(IS_WINDOWS ? { shell: process.env.ComSpec ?? 'cmd.exe' } : {}),
  });
 }

@@ -370,7 +371,8 @@ function runSmartInstall(): boolean {
  try {
    execSync(`node "${smartInstallPath}"`, {
      stdio: 'inherit',
-      ...(IS_WINDOWS ? { shell: true as const } : {}),
+      encoding: 'utf8',
+      ...(IS_WINDOWS ? { shell: process.env.ComSpec ?? 'cmd.exe' } : {}),
    });
    return true;
  } catch (error: unknown) {
@@ -64,23 +64,3 @@ export function resolveBunBinaryPath(): string | null {
  return null;
 }

-/**
- * Get the installed Bun version string (e.g. `"1.2.3"`), or `null`
- * if Bun is not available.
- */
-export function getBunVersionString(): string | null {
-  const bunPath = resolveBunBinaryPath();
-  if (!bunPath) return null;
-
-  try {
-    const result = spawnSync(bunPath, ['--version'], {
-      encoding: 'utf-8',
-      stdio: ['pipe', 'pipe', 'pipe'],
-      shell: IS_WINDOWS,
-    });
-    return result.status === 0 ? result.stdout.trim() : null;
-  } catch (error: unknown) {
-    console.error('[bun-resolver] Failed to get Bun version:', error instanceof Error ? error.message : String(error));
-    return null;
-  }
-}
@@ -1,6 +1,13 @@
 /**
 * XML Parser Module
- * Parses observation and summary XML blocks from SDK responses
+ *
+ * Single fail-fast entry point for SDK agent XML responses.
+ *
+ * Per PATHFINDER-2026-04-22 plan 03 phase 1:
+ * - One function (`parseAgentXml`) for all agent responses.
+ * - Discriminated-union return: `{ valid: true, kind, data }` or `{ valid: false, reason }`.
+ * - No coercion. No silent passthrough. No "lenient mode".
+ * - `<skip_summary reason="…"/>` is a first-class summary case (skipped: true).
 */

 import { logger } from '../utils/logger.js';
@@ -24,23 +31,103 @@ export interface ParsedSummary {
  completed: string | null;
  next_steps: string | null;
  notes: string | null;
+  /** True when the response was an explicit `<skip_summary reason="…"/>` bypass. */
+  skipped?: boolean;
+  /** Non-null when `skipped: true`. */
+  skip_reason?: string | null;
+}
+
+export type ParseResult =
+  | { valid: true; kind: 'observation'; data: ParsedObservation[] }
+  | { valid: true; kind: 'summary'; data: ParsedSummary }
+  | { valid: false; reason: string };
+
+/**
+ * Parse an SDK agent response. Inspects the first significant XML root element
+ * and returns a discriminated union. Never coerces. Never returns null/undefined.
+ *
+ * Recognised roots:
+ *   <observation> … </observation>      → { kind: 'observation', data: ParsedObservation[] }
+ *   <summary> … </summary>              → { kind: 'summary', data: ParsedSummary }
+ *   <skip_summary reason="…" />         → { kind: 'summary', data: { skipped: true, … } }
+ *
+ * Anything else → { valid: false, reason }. The caller is responsible for
+ * surfacing the reason (markFailed, log, etc.). No retry coercion.
+ */
+export function parseAgentXml(raw: string, correlationId?: string | number): ParseResult {
+  if (typeof raw !== 'string' || !raw.trim()) {
+    return { valid: false, reason: 'empty: response had no content' };
+  }
+
+  // Skip-summary is recognised even when wrapped in other text, but only as the
+  // sole structural signal. It outranks <observation> / <summary> matches because
+  // it is an explicit protocol bypass. `reason` is optional.
+  const skipMatch = /<skip_summary(?:\s+reason="([^"]*)")?\s*\/>/.exec(raw);
+  if (skipMatch) {
+    return {
+      valid: true,
+      kind: 'summary',
+      data: {
+        request: null,
+        investigated: null,
+        learned: null,
+        completed: null,
+        next_steps: null,
+        notes: null,
+        skipped: true,
+        skip_reason: skipMatch[1] ?? null,
+      },
+    };
+  }
+
+  // Find the first significant element by scanning for the first `<…>` opener
+  // that is one of the recognised roots. This tolerates leading prose / debug
+  // output from the model while still failing fast on entirely-non-XML payloads.
+  const firstRoot = /<(observation|summary)\b/i.exec(raw);
+  if (!firstRoot) {
+    const preview = raw.length > 120 ? `${raw.slice(0, 120)}…` : raw;
+    return {
+      valid: false,
+      reason: `unknown root: response contained no <observation>, <summary>, or <skip_summary/> element (preview: ${preview.replace(/\s+/g, ' ')})`,
+    };
+  }
+
+  const rootName = firstRoot[1].toLowerCase();
+  if (rootName === 'observation') {
+    const observations = parseObservationBlocks(raw, correlationId);
+    if (observations.length === 0) {
+      return {
+        valid: false,
+        reason: '<observation>: no parseable observation block (every block was empty or ghost)',
+      };
+    }
+    return { valid: true, kind: 'observation', data: observations };
+  }
+
+  // rootName === 'summary'
+  const summary = parseSummaryBlock(raw, correlationId);
+  if (!summary) {
+    return {
+      valid: false,
+      reason: '<summary>: empty or missing every required sub-tag (request/investigated/learned/completed/next_steps)',
+    };
+  }
+  return { valid: true, kind: 'summary', data: summary };
 }

 /**
- * Parse observation XML blocks from SDK response
- * Returns all observations found in the response
+ * Parse all <observation>…</observation> blocks. Filters out ghost
+ * observations (every content field empty). Returns the surviving list.
 */
-export function parseObservations(text: string, correlationId?: string): ParsedObservation[] {
+function parseObservationBlocks(text: string, correlationId?: string | number): ParsedObservation[] {
  const observations: ParsedObservation[] = [];

-  // Match <observation>...</observation> blocks (non-greedy)
  const observationRegex = /<observation>([\s\S]*?)<\/observation>/g;

  let match;
  while ((match = observationRegex.exec(text)) !== null) {
    const obsContent = match[1];

-    // Extract all fields
    const type = extractField(obsContent, 'type');
    const title = extractField(obsContent, 'title');
    const subtitle = extractField(obsContent, 'subtitle');
@@ -50,13 +137,13 @@ export function parseObservations(text: string, correlationId?: string): ParsedO
    const files_read = extractArrayElements(obsContent, 'files_read', 'file');
    const files_modified = extractArrayElements(obsContent, 'files_modified', 'file');

-    // All fields except type are nullable in schema.
-    // If type is missing or invalid, use first type from mode as fallback.
-
-    // Determine final type using active mode's valid types
+    // Type fallback: per existing semantics, missing/invalid type degrades to the
+    // first type in the active mode. This is parser-internal validation, not
+    // recovery from a contract violation: every mode's first type is intentionally
+    // the catch-all bucket.
    const mode = ModeManager.getInstance().getActiveMode();
    const validTypes = mode.observation_types.map(t => t.id);
-    const fallbackType = validTypes[0]; // First type in mode's list is the fallback
+    const fallbackType = validTypes[0];
    let finalType = fallbackType;
    if (type) {
      if (validTypes.includes(type.trim())) {
@@ -68,8 +155,6 @@ export function parseObservations(text: string, correlationId?: string): ParsedO
      logger.error('PARSER', `Observation missing type field, using "${fallbackType}"`, { correlationId });
    }

-    // All other fields are optional - save whatever we have
-
    // Filter out type from concepts array (types and concepts are separate dimensions)
    const cleanedConcepts = concepts.filter(c => c !== finalType);

@@ -83,10 +168,8 @@ export function parseObservations(text: string, correlationId?: string): ParsedO
    }

    // Skip ghost observations — records where every content field is null/empty.
-    // These accumulate when the LLM emits a bare <observation/> (or one with only <type>)
-    // due to context overflow. They carry no information and pollute the context window.
-    // (subtitle and file lists are intentionally excluded from this guard: an observation
-    // with only a subtitle is still too thin to be useful on its own.)
+    // (subtitle and file lists are intentionally excluded from this guard:
+    // an observation with only a subtitle is still too thin to be useful.)
    if (!title && !narrative && facts.length === 0 && cleanedConcepts.length === 0) {
      logger.warn('PARSER', 'Skipping empty observation (all content fields null)', {
        correlationId,
@@ -111,96 +194,29 @@ export function parseObservations(text: string, correlationId?: string): ParsedO
 }

 /**
- * Parse summary XML block from SDK response
- * Returns null if no valid summary found or if summary was skipped
- *
- * @param coerceFromObservation - When true, attempts to convert <observation> tags
- *   into summary fields if no <summary> tags are found. Only set this when the
- *   response was expected to be a summary (i.e., a summarize message was sent).
- *   Prevents the infinite retry loop described in #1633.
+ * Parse a single <summary>…</summary> block. Returns null when the block has
+ * no usable sub-tags (every required field empty) — the caller maps this to
+ * a fail-fast `{ valid: false, reason }` result.
 */
-export function parseSummary(text: string, sessionId?: number, coerceFromObservation: boolean = false): ParsedSummary | null {
-  // Check for skip_summary first
-  const skipRegex = /<skip_summary\s+reason="([^"]+)"\s*\/>/;
-  const skipMatch = skipRegex.exec(text);
-
-  if (skipMatch) {
-    logger.info('PARSER', 'Summary skipped', {
-      sessionId,
-      reason: skipMatch[1]
-    });
-    return null;
-  }
-
-  // Match <summary>...</summary> block (non-greedy)
+function parseSummaryBlock(text: string, correlationId?: string | number): ParsedSummary | null {
  const summaryRegex = /<summary>([\s\S]*?)<\/summary>/;
  const summaryMatch = summaryRegex.exec(text);
-
-  if (!summaryMatch) {
-    // When the LLM returns <observation> tags instead of <summary> tags on a
-    // summary turn, coerce the observation content into summary fields rather
-    // than discarding it. This breaks the infinite retry loop described in
-    // #1633: without coercion, the summary is silently dropped, the session
-    // completes without a summary, a new session is spawned with an ever-growing
-    // prompt, and the cycle repeats.
-    //
-    // parseSummary is called on every response (see ResponseProcessor), not just
-    // summary turns — so the absence of <summary> in an observation response is
-    // expected, not a prompt-conditioning failure. Only act when the caller
-    // actually expected a summary (coerceFromObservation=true).
-    if (coerceFromObservation && /<observation>/.test(text)) {
-      const coerced = coerceObservationToSummary(text, sessionId);
-      if (coerced) {
-        return coerced;
-      }
-      logger.warn('PARSER', 'Summary response contained <observation> tags instead of <summary> — coercion failed, no usable content', { sessionId });
-    }
-    return null;
-  }
+  if (!summaryMatch) return null;

  const summaryContent = summaryMatch[1];

-  // Extract fields
  const request = extractField(summaryContent, 'request');
  const investigated = extractField(summaryContent, 'investigated');
  const learned = extractField(summaryContent, 'learned');
  const completed = extractField(summaryContent, 'completed');
  const next_steps = extractField(summaryContent, 'next_steps');
-  const notes = extractField(summaryContent, 'notes'); // Optional
+  const notes = extractField(summaryContent, 'notes'); // optional

-  // NOTE FROM THEDOTMACK: 100% of the time we must SAVE the summary, even if fields are missing. 10/24/2025
-  // NEVER DO THIS NONSENSE AGAIN.
-
-  // Validate required fields are present (notes is optional)
-  // if (!request || !investigated || !learned || !completed || !next_steps) {
-  //   logger.warn('PARSER', 'Summary missing required fields', {
-  //     sessionId,
-  //     hasRequest: !!request,
-  //     hasInvestigated: !!investigated,
-  //     hasLearned: !!learned,
-  //     hasCompleted: !!completed,
-  //     hasNextSteps: !!next_steps
-  //   });
-  //   return null;
-  // }
-
-  // Guard: if NO sub-tags matched at all, this is a false positive —
-  // <summary> accidentally appeared inside an <observation> response with no structured content.
-  // This is NOT the same as missing some fields (which we intentionally allow above).
-  // Fix for #1360.
+  // Per maintainer note: a summary with at least one populated sub-tag must be
+  // saved. Missing sub-tags are tolerated; an entirely empty <summary> block is
+  // a false-positive (covered the #1360 regression) and is rejected.
  if (!request && !investigated && !learned && !completed && !next_steps) {
-    // If the response also contains <observation> tags with real content, fall
-    // back to coercion rather than discarding the response entirely — this covers
-    // the case where the LLM wraps empty <summary></summary> around observation
-    // content, which would otherwise resurrect the #1633 retry loop.
-    if (coerceFromObservation && /<observation>/.test(text)) {
-      const coerced = coerceObservationToSummary(text, sessionId);
-      if (coerced) {
-        logger.warn('PARSER', 'Empty <summary> match rejected — coerced from <observation> fallback (#1633)', { sessionId });
-        return coerced;
-      }
-    }
-    logger.warn('PARSER', 'Summary match has no sub-tags — skipping false positive', { sessionId });
+    logger.warn('PARSER', 'Summary block has no sub-tags — rejecting false positive', { correlationId });
    return null;
  }

@@ -210,54 +226,10 @@ export function parseSummary(text: string, sessionId?: number, coerceFromObserva
    learned,
    completed,
    next_steps,
-    notes
+    notes,
  };
 }

-/**
- * Coerce <observation> response into a ParsedSummary when <summary> tags are missing.
- * Maps observation fields to the closest summary equivalents so that a usable
- * summary is stored instead of nothing — breaking the retry loop (#1633).
- */
-function coerceObservationToSummary(text: string, sessionId?: number): ParsedSummary | null {
-  // Iterate all <observation> blocks — if the LLM emits multiple and the first is
-  // empty, we still want to salvage the first one that has usable content.
-  const obsRegex = /<observation>([\s\S]*?)<\/observation>/g;
-  let obsMatch: RegExpExecArray | null;
-  let blockIndex = 0;
-
-  while ((obsMatch = obsRegex.exec(text)) !== null) {
-    const obsContent = obsMatch[1];
-    const title = extractField(obsContent, 'title');
-    const subtitle = extractField(obsContent, 'subtitle');
-    const narrative = extractField(obsContent, 'narrative');
-    const facts = extractArrayElements(obsContent, 'facts', 'fact');
-
-    if (title || narrative || facts.length > 0) {
-      // Map observation fields → summary fields (best-effort)
-      const request = title || subtitle || null;
-      const investigated = narrative || null;
-      const learned = facts.length > 0 ? facts.join('; ') : null;
-      const completed = title ? `${title}${subtitle ? ' — ' + subtitle : ''}` : null;
-      const next_steps = null; // No direct observation equivalent
-
-      logger.warn('PARSER', 'Coerced <observation> response into <summary> to prevent retry loop (#1633)', {
-        sessionId,
-        blockIndex,
-        hasTitle: !!title,
-        hasNarrative: !!narrative,
-        factCount: facts.length,
-      });
-
-      return { request, investigated, learned, completed, next_steps, notes: null };
-    }
-
-    blockIndex++;
-  }
-
-  return null;
-}
-
 /**
 * Extract a simple field value from XML content
 * Returns null for missing or empty/whitespace-only fields
@@ -265,8 +237,6 @@ function coerceObservationToSummary(text: string, sessionId?: number): ParsedSum
 * Uses non-greedy match to handle nested tags and code snippets (Issue #798)
 */
 function extractField(content: string, fieldName: string): string | null {
-  // Use [\s\S]*? to match any character including newlines, non-greedily
-  // This handles nested XML tags like <item>...</item> inside the field
  const regex = new RegExp(`<${fieldName}>([\\s\\S]*?)</${fieldName}>`);
  const match = regex.exec(content);
  if (!match) return null;
@@ -282,7 +252,6 @@ function extractField(content: string, fieldName: string): string | null {
 function extractArrayElements(content: string, arrayName: string, elementName: string): string[] {
  const elements: string[] = [];

-  // Match the array block using [\s\S]*? for nested content
  const arrayRegex = new RegExp(`<${arrayName}>([\\s\\S]*?)</${arrayName}>`);
  const arrayMatch = arrayRegex.exec(content);

@@ -292,7 +261,6 @@ function extractArrayElements(content: string, arrayName: string, elementName: s

  const arrayContent = arrayMatch[1];

-  // Extract individual elements using [\s\S]*? for nested content
  const elementRegex = new RegExp(`<${elementName}>([\\s\\S]*?)</${elementName}>`, 'g');
  let elementMatch;
  while ((elementMatch = elementRegex.exec(arrayContent)) !== null) {
@@ -7,19 +7,14 @@ import { logger } from '../utils/logger.js';
 import type { ModeConfig } from '../services/domain/types.js';

 /**
- * Marker string embedded in summary prompts — used by ResponseProcessor to detect
- * whether the most recent user message was a summary request (enables observation→summary
- * coercion for #1633). Keep in sync with buildSummaryPrompt below.
+ * Marker string embedded in summary prompts — historically used by
+ * ResponseProcessor to detect summary turns for the (now-deleted) coercion
+ * fallback. Kept here because `buildSummaryPrompt` still embeds it as the
+ * mode-switch banner; deleting the constant would require rewriting the
+ * prompt builder, which is out of scope for plan 03.
 */
 export const SUMMARY_MODE_MARKER = 'MODE SWITCH: PROGRESS SUMMARY';

-/**
- * Maximum consecutive summary failures before the circuit breaker opens.
- * After this many failures, SessionManager.queueSummarize will skip further
- * summarize requests to prevent the infinite retry loop (#1633).
- */
-export const MAX_CONSECUTIVE_SUMMARY_FAILURES = 3;
-
 export interface Observation {
  id: number;
  tool_name: string;
@@ -10,7 +10,7 @@

 import http from 'http';
 import { logger } from '../../utils/logger.js';
-import { stopSupervisor } from '../../supervisor/index.js';
+import { getSupervisor } from '../../supervisor/index.js';

 export interface ShutdownableService {
  shutdownAll(): Promise<void>;
@@ -80,7 +80,10 @@ export async function performGracefulShutdown(config: GracefulShutdownConfig): P
  }

  // STEP 6: Supervisor handles tracked child termination, PID cleanup, and stale sockets.
-  await stopSupervisor();
+  // Plan 06 Phase 8 — call the supervisor singleton directly; the wrapper
+  // re-export from supervisor/index.ts was deleted (one wrapper, one caller,
+  // no value).
+  await getSupervisor().stop();

  logger.info('SYSTEM', 'Worker shutdown complete');
 }
@@ -48,7 +48,7 @@ interface WorktreeEntry {
  branch: string | null;
 }

-const GIT_TIMEOUT_MS = 5000;
+const GIT_TIMEOUT_MS = 15000;

 class DryRunRollback extends Error {
  constructor() {
@@ -58,11 +58,31 @@ class DryRunRollback extends Error {
 }

 function gitCapture(cwd: string, args: string[]): string | null {
+  const startTime = Date.now();
  const r = spawnSync('git', ['-C', cwd, ...args], {
    encoding: 'utf8',
    timeout: GIT_TIMEOUT_MS
  });
-  if (r.status !== 0) return null;
+  const duration = Date.now() - startTime;
+  
+  if (duration > 1000) {
+    logger.debug('GIT', `Slow git operation: git -C ${cwd} ${args.join(' ')} took ${duration}ms`);
+  }
+
+  if (r.error) {
+    logger.warn('GIT', `Git operation failed: git -C ${cwd} ${args.join(' ')}`, {
+      error: r.error.message,
+      timedOut: r.error.name === 'ETIMEDOUT' || (r.status === null && r.signal === 'SIGTERM')
+    });
+    return null;
+  }
+
+  if (r.status !== 0) {
+    logger.debug('GIT', `Git returned non-zero exit code ${r.status}: git -C ${cwd} ${args.join(' ')}`, {
+      stderr: r.stderr?.toString().trim()
+    });
+    return null;
+  }
  return (r.stdout ?? '').trim();
 }

@@ -281,83 +281,3 @@ export function uninstallCodexCli(): number {
  return 0;
 }

-// ---------------------------------------------------------------------------
-// Public API: Status Check
-// ---------------------------------------------------------------------------
-
-/**
- * Check Codex CLI integration status.
- *
- * @returns 0 always (informational)
- */
-export function checkCodexCliStatus(): number {
-  console.log('\nClaude-Mem Codex CLI Integration Status\n');
-
-  // Check transcript-watch.json
-  if (!existsSync(DEFAULT_CONFIG_PATH)) {
-    console.log('Status: Not installed');
-    console.log(`  No transcript watch config at ${DEFAULT_CONFIG_PATH}`);
-    console.log('\nRun: npx claude-mem install --ide codex-cli\n');
-    return 0;
-  }
-
-  let config: TranscriptWatchConfig;
-  try {
-    config = loadExistingTranscriptWatchConfig();
-  } catch (error) {
-    if (error instanceof Error) {
-      logger.error('WORKER', 'Could not parse transcript-watch.json', { path: DEFAULT_CONFIG_PATH }, error);
-    } else {
-      logger.error('WORKER', 'Could not parse transcript-watch.json', { path: DEFAULT_CONFIG_PATH }, new Error(String(error)));
-    }
-    console.log('Status: Unknown');
-    console.log('  Could not parse transcript-watch.json.');
-    console.log('');
-    return 0;
-  }
-
-  const codexWatch = config.watches.find(
-    (w: WatchTarget) => w.name === CODEX_WATCH_NAME,
-  );
-  const codexSchema = config.schemas?.[CODEX_WATCH_NAME];
-
-  if (!codexWatch) {
-    console.log('Status: Not installed');
-    console.log('  transcript-watch.json exists but no codex watch configured.');
-    console.log('\nRun: npx claude-mem install --ide codex-cli\n');
-    return 0;
-  }
-
-  console.log('Status: Installed');
-  console.log(`  Config: ${DEFAULT_CONFIG_PATH}`);
-  console.log(`  Watch path: ${codexWatch.path}`);
-  console.log(`  Schema: ${codexSchema ? `codex (v${codexSchema.version ?? '?'})` : 'missing'}`);
-  console.log(`  Start at end: ${codexWatch.startAtEnd ?? false}`);
-
-  if (codexWatch.context) {
-    console.log(`  Context mode: ${codexWatch.context.mode}`);
-    console.log(`  Context path: ${codexWatch.context.path ?? '<workspace>/AGENTS.md (default)'}`);
-    console.log(`  Context updates on: ${codexWatch.context.updateOn?.join(', ') ?? 'none'}`);
-  }
-
-  if (existsSync(CODEX_AGENTS_MD_PATH)) {
-    const mdContent = readFileSync(CODEX_AGENTS_MD_PATH, 'utf-8');
-    if (mdContent.includes('<claude-mem-context>')) {
-      console.log(`  Legacy global context: Present (${CODEX_AGENTS_MD_PATH})`);
-    } else {
-      console.log(`  Legacy global context: Not active`);
-    }
-  } else {
-    console.log(`  Legacy global context: None`);
-  }
-
-  const sessionsDir = path.join(CODEX_DIR, 'sessions');
-  if (existsSync(sessionsDir)) {
-    console.log(`  Sessions directory: exists`);
-  } else {
-    console.log(`  Sessions directory: not yet created (use Codex CLI to generate sessions)`);
-  }
-
-  console.log('');
-  return 0;
-}
@@ -21,6 +21,61 @@ import { getSupervisor } from '../../supervisor/index.js';
 import { isPidAlive } from '../../supervisor/process-registry.js';
 import { ENV_PREFIXES, ENV_EXACT_MATCHES } from '../../supervisor/env-sanitizer.js';

+/**
+ * Plan 06 Phase 6 — instruction content (SKILL.md + ALLOWED_OPERATIONS .md
+ * files) is read once at module init and held in memory for the lifetime of
+ * the worker process. Process restart is the cache-invalidation event.
+ *
+ * `SKILL.md` is held as the full UTF-8 string so `extractInstructionSection`
+ * can slice topic windows on every request without re-reading the file.
+ * Per-operation files are cached as a `Map<operation, content>`. Files that
+ * are missing on disk simply omit from the map; the request handler returns
+ * 404 in that case (preserving legacy behaviour).
+ */
+const INSTRUCTIONS_BASE_DIR: string = path.resolve(__dirname, '../skills/mem-search');
+const INSTRUCTIONS_OPERATIONS_DIR: string = path.join(INSTRUCTIONS_BASE_DIR, 'operations');
+const INSTRUCTIONS_SKILL_PATH: string = path.join(INSTRUCTIONS_BASE_DIR, 'SKILL.md');
+
+const cachedSkillMd: string | null = (() => {
+  try {
+    const text = fs.readFileSync(INSTRUCTIONS_SKILL_PATH, 'utf-8');
+    logger.info('SYSTEM', 'Cached SKILL.md at boot', {
+      path: INSTRUCTIONS_SKILL_PATH,
+      bytes: Buffer.byteLength(text, 'utf-8'),
+    });
+    return text;
+  } catch (error: unknown) {
+    logger.debug('SYSTEM', 'SKILL.md not present at boot, /api/instructions will 404 for topic queries', {
+      path: INSTRUCTIONS_SKILL_PATH,
+      message: error instanceof Error ? error.message : String(error),
+    });
+    return null;
+  }
+})();
+
+const cachedOperationContent: ReadonlyMap<string, string> = (() => {
+  const map = new Map<string, string>();
+  for (const operation of ALLOWED_OPERATIONS) {
+    const operationPath = path.join(INSTRUCTIONS_OPERATIONS_DIR, `${operation}.md`);
+    try {
+      map.set(operation, fs.readFileSync(operationPath, 'utf-8'));
+    } catch (error: unknown) {
+      // Missing operation files are non-fatal — 404 is returned per request.
+      logger.debug('SYSTEM', 'Operation instruction file not present at boot', {
+        path: operationPath,
+        message: error instanceof Error ? error.message : String(error),
+      });
+    }
+  }
+  if (map.size > 0) {
+    logger.info('SYSTEM', 'Cached operation instruction files at boot', {
+      count: map.size,
+      operations: Array.from(map.keys()),
+    });
+  }
+  return map;
+})();
+
 // Build-time injected version constant (set by esbuild define)
 declare const __DEFAULT_PACKAGE_VERSION__: string;
 const BUILT_IN_VERSION = typeof __DEFAULT_PACKAGE_VERSION__ !== 'undefined'
@@ -94,11 +149,20 @@ export class Server {
   */
  async listen(port: number, host: string): Promise<void> {
    return new Promise<void>((resolve, reject) => {
-      this.server = this.app.listen(port, host, () => {
+      const server = http.createServer(this.app);
+      this.server = server;
+      const onError = (err: Error) => {
+        server.off('listening', onListening);
+        reject(err);
+      };
+      const onListening = () => {
+        server.off('error', onError);
        logger.info('SYSTEM', 'HTTP server started', { host, port, pid: process.pid });
        resolve();
-      });
-      this.server.on('error', reject);
+      };
+      server.once('error', onError);
+      server.once('listening', onListening);
+      server.listen(port, host);
    });
  }

@@ -198,8 +262,9 @@ export class Server {
      res.status(200).json({ version: BUILT_IN_VERSION });
    });

-    // Instructions endpoint - loads SKILL.md sections on-demand
-    this.app.get('/api/instructions', async (req: Request, res: Response) => {
+    // Instructions endpoint — Plan 06 Phase 6 — serves the cached SKILL.md /
+    // operations content loaded once at module init.
+    this.app.get('/api/instructions', (req: Request, res: Response) => {
      const topic = (req.query.topic as string) || 'all';
      const operation = req.query.operation as string | undefined;

@@ -213,24 +278,20 @@ export class Server {
      }

      if (operation) {
-        const OPERATIONS_BASE_DIR = path.resolve(__dirname, '../skills/mem-search/operations');
-        const operationPath = path.resolve(OPERATIONS_BASE_DIR, `${operation}.md`);
-        if (!operationPath.startsWith(OPERATIONS_BASE_DIR + path.sep)) {
-          return res.status(400).json({ error: 'Invalid request' });
+        const cached = cachedOperationContent.get(operation);
+        if (cached === undefined) {
+          logger.debug('HTTP', 'Instruction file not cached at boot', { operation });
+          return res.status(404).json({ error: 'Instruction not found' });
        }
+        return res.json({ content: [{ type: 'text', text: cached }] });
      }

-      try {
-        const content = await this.loadInstructionContent(operation, topic);
-        res.json({ content: [{ type: 'text', text: content }] });
-      } catch (error) {
-        if (error instanceof Error) {
-          logger.debug('HTTP', 'Instruction file not found', { topic, operation, message: error.message });
-        } else {
-          logger.debug('HTTP', 'Instruction file not found', { topic, operation, error: String(error) });
-        }
-        res.status(404).json({ error: 'Instruction not found' });
+      if (cachedSkillMd === null) {
+        logger.debug('HTTP', 'SKILL.md not cached at boot', { topic });
+        return res.status(404).json({ error: 'Instruction not found' });
      }
+      const sectionText = this.extractInstructionSection(cachedSkillMd, topic);
+      res.json({ content: [{ type: 'text', text: sectionText }] });
    });

    // Admin endpoints for process management (localhost-only)
@@ -330,20 +391,6 @@ export class Server {
    });
  }

-  /**
-   * Load instruction content from disk for the /api/instructions endpoint.
-   * Caller must validate operation/topic before calling.
-   */
-  private async loadInstructionContent(operation: string | undefined, topic: string): Promise<string> {
-    if (operation) {
-      const operationPath = path.resolve(__dirname, '../skills/mem-search/operations', `${operation}.md`);
-      return fs.promises.readFile(operationPath, 'utf-8');
-    }
-    const skillPath = path.join(__dirname, '../skills/mem-search/SKILL.md');
-    const fullContent = await fs.promises.readFile(skillPath, 'utf-8');
-    return this.extractInstructionSection(fullContent, topic);
-  }
-
  /**
   * Extract a specific section from instruction content
   */
@@ -480,15 +480,6 @@ const QUERIES: Record<string, string> = {
 (class_definition name: (identifier) @name) @cls
 (import_statement) @imp
 (import_declaration) @imp
-`,
-
-  php: `
-(function_definition name: (name) @name) @func
-(method_declaration name: (name) @name) @method
-(class_declaration name: (name) @name) @cls
-(interface_declaration name: (name) @name) @iface
-(trait_declaration name: (name) @name) @trait_def
-(namespace_use_declaration) @imp
 `,
 };

@@ -1,8 +1,4 @@
 import { Database } from 'bun:sqlite';
-import { execFileSync } from 'child_process';
-import { existsSync, unlinkSync, writeFileSync } from 'fs';
-import { tmpdir } from 'os';
-import { join } from 'path';
 import { DATA_DIR, DB_PATH, ensureDir } from '../../shared/paths.js';
 import { logger } from '../../utils/logger.js';
 import { MigrationRunner } from './migrations/runner.js';
@@ -19,118 +15,6 @@ export interface Migration {

 let dbInstance: Database | null = null;

-/**
- * Repair malformed database schema before migrations run.
- *
- * This handles the case where a database is synced between machines running
- * different claude-mem versions. A newer version may have added columns and
- * indexes that an older version (or even the same version on a fresh install)
- * cannot process. SQLite throws "malformed database schema" when it encounters
- * an index referencing a non-existent column, which prevents ALL queries —
- * including the migrations that would fix the schema.
- *
- * The fix: use Python's sqlite3 module (which supports writable_schema) to
- * drop the orphaned schema objects, then let the migration system recreate
- * them properly. bun:sqlite doesn't allow DELETE FROM sqlite_master even
- * with writable_schema = ON.
- */
-function repairMalformedSchema(db: Database): void {
-  try {
-    // Quick test: if we can query sqlite_master, the schema is fine
-    db.query('SELECT name FROM sqlite_master WHERE type = "table" LIMIT 1').all();
-    return;
-  } catch (error: unknown) {
-    const message = error instanceof Error ? error.message : String(error);
-    if (!message.includes('malformed database schema')) {
-      throw error;
-    }
-
-    logger.warn('DB', 'Detected malformed database schema, attempting repair', { error: message });
-
-    // Extract the problematic object name from the error message
-    // Format: "malformed database schema (object_name) - details"
-    const match = message.match(/malformed database schema \(([^)]+)\)/);
-    if (!match) {
-      logger.error('DB', 'Could not parse malformed schema error, cannot auto-repair', { error: message });
-      throw error;
-    }
-
-    const objectName = match[1];
-    logger.info('DB', `Dropping malformed schema object: ${objectName}`);
-
-    // Get the DB file path. For file-based DBs, we can use Python to repair.
-    // For in-memory DBs, we can't shell out — just re-throw.
-    const dbPath = db.filename;
-    if (!dbPath || dbPath === ':memory:' || dbPath === '') {
-      logger.error('DB', 'Cannot auto-repair in-memory database');
-      throw error;
-    }
-
-    // Close the connection so Python can safely modify the file
-    db.close();
-
-    // Use Python's sqlite3 module to drop the orphaned object and reset
-    // related migration versions so they re-run and recreate things properly.
-    // bun:sqlite doesn't support DELETE FROM sqlite_master even with writable_schema.
-    //
-    // We write a temp script rather than using -c to avoid shell escaping issues
-    // with paths containing spaces or special characters. execFileSync passes
-    // args directly without a shell, so dbPath and objectName are safe.
-    const scriptPath = join(tmpdir(), `claude-mem-repair-${Date.now()}.py`);
-    try {
-      writeFileSync(scriptPath, `
-import sqlite3, sys
-db_path = sys.argv[1]
-obj_name = sys.argv[2]
-c = sqlite3.connect(db_path)
-c.execute('PRAGMA writable_schema = ON')
-c.execute('DELETE FROM sqlite_master WHERE name = ?', (obj_name,))
-c.execute('PRAGMA writable_schema = OFF')
-# Reset migration versions so affected migrations re-run.
-# Guard with existence check: schema_versions may not exist on a very fresh DB.
-has_sv = c.execute(
-  "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='schema_versions'"
-).fetchone()[0]
-if has_sv:
-  c.execute('DELETE FROM schema_versions')
-c.commit()
-c.close()
-`);
-      execFileSync('python3', [scriptPath, dbPath, objectName], { timeout: 10000 });
-      logger.info('DB', `Dropped orphaned schema object "${objectName}" and reset migration versions via Python sqlite3. All migrations will re-run (they are idempotent).`);
-    } catch (pyError: unknown) {
-      const pyMessage = pyError instanceof Error ? pyError.message : String(pyError);
-      logger.error('DB', 'Python sqlite3 repair failed', { error: pyMessage });
-      throw new Error(`Schema repair failed: ${message}. Python repair error: ${pyMessage}`);
-    } finally {
-      if (existsSync(scriptPath)) unlinkSync(scriptPath);
-    }
-  }
-}
-
-/**
- * Wrapper that handles the close/reopen cycle needed for schema repair.
- * Returns a (possibly new) Database connection.
- */
-function repairMalformedSchemaWithReopen(dbPath: string, db: Database): Database {
-  try {
-    db.query('SELECT name FROM sqlite_master WHERE type = "table" LIMIT 1').all();
-    return db;
-  } catch (error: unknown) {
-    const message = error instanceof Error ? error.message : String(error);
-    if (!message.includes('malformed database schema')) {
-      throw error;
-    }
-
-    // repairMalformedSchema closes the DB internally for Python access
-    repairMalformedSchema(db);
-
-    // Reopen and check for additional malformed objects
-    const newDb = new Database(dbPath, { create: true, readwrite: true });
-    return repairMalformedSchemaWithReopen(dbPath, newDb);
-  }
-}
-
 /**
 * ClaudeMemDatabase - New entry point for the sqlite module
 *
@@ -154,11 +38,6 @@ export class ClaudeMemDatabase {
    // Create database connection
    this.db = new Database(dbPath, { create: true, readwrite: true });

-    // Repair any malformed schema before applying settings or running migrations.
-    // Must happen first — even PRAGMA calls can fail on a corrupted schema.
-    // This may close and reopen the connection if repair is needed.
-    this.db = repairMalformedSchemaWithReopen(dbPath, this.db);
-
    // Apply optimized SQLite settings
    this.db.run('PRAGMA journal_mode = WAL');
    this.db.run('PRAGMA synchronous = NORMAL');
@@ -218,10 +97,6 @@ export class DatabaseManager {

    this.db = new Database(DB_PATH, { create: true, readwrite: true });

-    // Repair any malformed schema before applying settings or running migrations.
-    // Must happen first — even PRAGMA calls can fail on a corrupted schema.
-    this.db = repairMalformedSchemaWithReopen(DB_PATH, this.db);
-
    // Apply optimized SQLite settings
    this.db.run('PRAGMA journal_mode = WAL');
    this.db.run('PRAGMA synchronous = NORMAL');
@@ -1,9 +1,18 @@
-import { Database } from './sqlite-compat.js';
+import { Database } from 'bun:sqlite';
 import type { PendingMessage } from '../worker-types.js';
 import { logger } from '../../utils/logger.js';

-/** Messages processing longer than this are considered stale and reset to pending by self-healing */
-const STALE_PROCESSING_THRESHOLD_MS = 60_000;
+/**
+ * Provider for the set of currently-live worker PIDs.
+ *
+ * The self-healing claim query reclaims any 'processing' row whose
+ * worker_pid is NOT a live worker (crash recovery without a timer).
+ *
+ * Default: a single-worker process supplies just its own PID. Multi-worker
+ * deployments inject a callback backed by `supervisor/process-registry.ts`
+ * (`getSupervisor().getRegistry().getAll().filter(r => r.type === 'worker').map(r => r.pid)`).
+ */
+export type LiveWorkerPidsProvider = () => readonly number[];

 /**
 * Persistent pending message record from database
@@ -22,8 +31,8 @@ export interface PersistentPendingMessage {
  status: 'pending' | 'processing' | 'processed' | 'failed';
  retry_count: number;
  created_at_epoch: number;
-  started_processing_at_epoch: number | null;
  completed_at_epoch: number | null;
+  worker_pid: number | null;
  // Claude Code subagent identity — NULL for main-session messages.
  agent_type: string | null;
  agent_id: string | null;
@@ -37,44 +46,76 @@ export interface PersistentPendingMessage {
 *
 * Lifecycle:
 * 1. enqueue() - Message persisted with status 'pending'
- * 2. claimNextMessage() - Atomically claims next pending message (marks as 'processing')
+ * 2. claimNextMessage() - Atomically claims next pending message (marks as 'processing'
+ *    and stamps the live worker's PID). Self-healing: reclaims any 'processing' row
+ *    whose worker_pid is no longer alive (worker crash) in the same UPDATE.
 * 3. confirmProcessed() - Deletes message after successful processing
 *
- * Self-healing:
- * - claimNextMessage() resets stale 'processing' messages (>60s) back to 'pending' before claiming
- * - This eliminates stuck messages from generator crashes without external timers
- *
- * Recovery:
- * - getSessionsWithPendingMessages() - Find sessions that need recovery on startup
+ * Self-healing semantics:
+ *   A 'processing' row is reclaimable iff worker_pid IS NULL or worker_pid is
+ *   not present in the live-pids list at claim time. No timer, no
+ *   stale-cutoff timestamp — liveness is the truth.
 */
 export class PendingMessageStore {
  private db: Database;
  private maxRetries: number;
+  private workerPid: number;
+  private getLiveWorkerPids: LiveWorkerPidsProvider;

-  constructor(db: Database, maxRetries: number = 3) {
+  /**
+   * @param db                  SQLite database
+   * @param maxRetries          Per-message retry ceiling for transient SDK failures (default 3)
+   * @param workerPid           PID of the worker that owns this store; stamped into worker_pid on claim.
+   *                            Defaults to process.pid so single-process deployments need no extra wiring.
+   * @param getLiveWorkerPids   Provider for the set of all currently-live worker PIDs.
+   *                            Defaults to `[workerPid]` — only this worker is alive.
+   *                            Multi-worker deployments inject a supervisor-backed provider.
+   */
+  constructor(
+    db: Database,
+    maxRetries: number = 3,
+    workerPid: number = process.pid,
+    getLiveWorkerPids?: LiveWorkerPidsProvider
+  ) {
    this.db = db;
    this.maxRetries = maxRetries;
+    this.workerPid = workerPid;
+    this.getLiveWorkerPids = getLiveWorkerPids ?? (() => [this.workerPid]);
  }

  /**
-   * Enqueue a new message (persist before processing)
-   * @returns The database ID of the persisted message
+   * Enqueue a new message (persist before processing).
+   *
+   * Uses `INSERT OR IGNORE` so duplicate (content_session_id, tool_use_id)
+   * pairs collapse to a single row — the UNIQUE INDEX added in plan 01 phase 1
+   * is the authority on tool-use idempotency. Per principle 3 (UNIQUE
+   * constraint over dedup window), we don't time-gate duplicates.
+   *
+   * @returns The database ID of the persisted message, or 0 when the insert
+   *          was suppressed by ON CONFLICT. Callers MUST guard with `id > 0`
+   *          before threading the value into any subsequent SQL (e.g.
+   *          `confirmProcessed`, `markFailed`, `processingMessageIds`) —
+   *          a zero id would silently target zero rows. The only two call
+   *          sites today (`SessionManager.queueObservation` and
+   *          `queueSummarize`) use the id purely for logging and both
+   *          branch on `messageId === 0`.
   */
  enqueue(sessionDbId: number, contentSessionId: string, message: PendingMessage): number {
    const now = Date.now();
    const stmt = this.db.prepare(`
-      INSERT INTO pending_messages (
-        session_db_id, content_session_id, message_type,
+      INSERT OR IGNORE INTO pending_messages (
+        session_db_id, content_session_id, tool_use_id, message_type,
        tool_name, tool_input, tool_response, cwd,
        last_assistant_message,
        prompt_number, status, retry_count, created_at_epoch,
        agent_type, agent_id
-      ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, 'pending', 0, ?, ?, ?)
+      ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 'pending', 0, ?, ?, ?)
    `);

    const result = stmt.run(
      sessionDbId,
      contentSessionId,
+      message.toolUseId ?? null,
      message.type,
      message.tool_name || null,
      message.tool_input ? JSON.stringify(message.tool_input) : null,
@@ -90,58 +131,58 @@ export class PendingMessageStore {
    return result.lastInsertRowid as number;
  }

-  /**
-   * Atomically claim the next pending message by marking it as 'processing'.
-   * Self-healing: resets any stale 'processing' messages (>60s) back to 'pending' first.
-   * Message stays in DB until confirmProcessed() is called.
-   * Uses a transaction to prevent race conditions.
+/**
+   * Atomically claim the next message for `sessionDbId`.
+   *
+   * A row is claimable iff:
+   *   - status = 'pending', OR
+   *   - status = 'processing' AND worker_pid is not in the live-pids set
+   *     (i.e. the previous owner crashed). This is the self-healing branch:
+   *     liveness is checked at claim time, not by a background reaper.
+   *
+   * The claim stamps the live worker's PID and flips status to 'processing'
+   * in a single UPDATE … WHERE id = (subquery).
   */
  claimNextMessage(sessionDbId: number): PersistentPendingMessage | null {
-    const claimTx = this.db.transaction((sessionId: number) => {
-      // Capture time inside transaction so it's fresh if WAL contention causes retry
-      const now = Date.now();
-      // Self-healing: reset stale 'processing' messages back to 'pending'
-      // This recovers from generator crashes without external timers
-      // Note: strict < means messages must be OLDER than threshold to be reset
-      const staleCutoff = now - STALE_PROCESSING_THRESHOLD_MS;
-      const resetStmt = this.db.prepare(`
-        UPDATE pending_messages
-        SET status = 'pending', started_processing_at_epoch = NULL
-        WHERE session_db_id = ? AND status = 'processing'
-          AND started_processing_at_epoch < ?
-      `);
-      const resetResult = resetStmt.run(sessionId, staleCutoff);
-      if (resetResult.changes > 0) {
-        logger.info('QUEUE', `SELF_HEAL | sessionDbId=${sessionId} | recovered ${resetResult.changes} stale processing message(s)`);
-      }
+    // Build a parameterized IN-list of live worker PIDs. We always include
+    // this worker's PID so that an in-flight claim doesn't accidentally
+    // self-reclaim a row we just stamped (the predicate is "NOT IN live").
+    const livePids = this.getLivePidsIncludingSelf();
+    const placeholders = livePids.map(() => '?').join(',');

-      const peekStmt = this.db.prepare(`
-        SELECT * FROM pending_messages
-        WHERE session_db_id = ? AND status = 'pending'
-        ORDER BY id ASC
-        LIMIT 1
-      `);
-      const msg = peekStmt.get(sessionId) as PersistentPendingMessage | null;
+    const sql = `
+      UPDATE pending_messages
+         SET status     = 'processing',
+             worker_pid = ?
+       WHERE id = (
+         SELECT id FROM pending_messages
+          WHERE session_db_id = ?
+            AND (
+              status = 'pending'
+              OR (status = 'processing' AND (worker_pid IS NULL OR worker_pid NOT IN (${placeholders})))
+            )
+          ORDER BY id ASC
+          LIMIT 1
+       )
+       RETURNING *
+    `;

-      if (msg) {
-        // CRITICAL FIX: Mark as 'processing' instead of deleting
-        // Message will be deleted by confirmProcessed() after successful store
-        const updateStmt = this.db.prepare(`
-          UPDATE pending_messages
-          SET status = 'processing', started_processing_at_epoch = ?
-          WHERE id = ?
-        `);
-        updateStmt.run(now, msg.id);
+    const stmt = this.db.prepare(sql);
+    const params: (number | string)[] = [this.workerPid, sessionDbId, ...livePids];
+    const claimed = stmt.get(...params) as PersistentPendingMessage | null;

-        // Log claim with minimal info (avoid logging full payload)
-        logger.info('QUEUE', `CLAIMED | sessionDbId=${sessionId} | messageId=${msg.id} | type=${msg.message_type}`, {
-          sessionId: sessionId
-        });
-      }
-      return msg;
-    });
+    if (claimed) {
+      logger.info('QUEUE', `CLAIMED | sessionDbId=${sessionDbId} | messageId=${claimed.id} | type=${claimed.message_type} | workerPid=${this.workerPid}`, {
+        sessionId: sessionDbId
+      });
+    }
+    return claimed;
+  }

-    return claimTx(sessionDbId) as PersistentPendingMessage | null;
+  private getLivePidsIncludingSelf(): number[] {
+    const pids = this.getLiveWorkerPids();
+    if (pids.includes(this.workerPid)) return [...pids];
+    return [...pids, this.workerPid];
  }

  /**
@@ -158,34 +199,19 @@ export class PendingMessageStore {
  }

  /**
-   * Reset stale 'processing' messages back to 'pending' for retry.
-   * Called on worker startup and periodically to recover from crashes.
-   * @param thresholdMs Messages processing longer than this are considered stale (default: 5 minutes)
-   * @returns Number of messages reset
+   * Delete `status='failed'` rows older than `thresholdMs`. Called once at
+   * worker startup so `pending_messages` does not grow unbounded on long-
+   * running or high-failure-rate installations; `claimNextMessage`'s
+   * self-healing subquery scans this table, so bounded rows keep claim
+   * latency predictable. Not a reaper — one-shot, idempotent.
   */
-  resetStaleProcessingMessages(thresholdMs: number = 5 * 60 * 1000, sessionDbId?: number): number {
+  clearFailedOlderThan(thresholdMs: number): number {
    const cutoff = Date.now() - thresholdMs;
-    let stmt;
-    let result;
-    if (sessionDbId !== undefined) {
-      stmt = this.db.prepare(`
-        UPDATE pending_messages
-        SET status = 'pending', started_processing_at_epoch = NULL
-        WHERE status = 'processing' AND started_processing_at_epoch < ? AND session_db_id = ?
-      `);
-      result = stmt.run(cutoff, sessionDbId);
-    } else {
-      stmt = this.db.prepare(`
-        UPDATE pending_messages
-        SET status = 'pending', started_processing_at_epoch = NULL
-        WHERE status = 'processing' AND started_processing_at_epoch < ?
-      `);
-      result = stmt.run(cutoff);
-    }
-    if (result.changes > 0) {
-      logger.info('QUEUE', `RESET_STALE | count=${result.changes} | thresholdMs=${thresholdMs}${sessionDbId !== undefined ? ` | sessionDbId=${sessionDbId}` : ''}`);
-    }
-    return result.changes;
+    const stmt = this.db.prepare(`
+      DELETE FROM pending_messages
+      WHERE status = 'failed' AND COALESCE(failed_at_epoch, completed_at_epoch, 0) < ?
+    `);
+    return stmt.run(cutoff).changes;
  }

  /**
@@ -201,144 +227,44 @@ export class PendingMessageStore {
  }

  /**
-   * Get all queue messages (for UI display)
-   * Returns pending, processing, and failed messages (not processed - they're deleted)
-   * Joins with sdk_sessions to get project name
+   * Transition pending_messages rows to a terminal status — PATHFINDER-2026-04-22
+   * Plan 06 Phase 9. One SQL UPDATE path, one place to add a new terminal status
+   * later, zero divergence between call sites.
+   *
+   * - `failed` — narrow form: only rows currently `status='processing'`.
+   *   Used during error recovery when a session generator crashes and we want
+   *   to mark its in-flight messages failed without touching rows that never
+   *   left `pending`.
+   *
+   * - `abandoned` — wide form: rows in `('pending', 'processing')`.
+   *   Used during session termination or completion drain so the session
+   *   doesn't appear in `getSessionsWithPendingMessages` forever. Both forms
+   *   write the row's `status` column to `'failed'`; `abandoned` is just the
+   *   broader WHERE clause.
+   *
+   * Cites Principle 6 (one helper, N callers) and Principle 7 (the
+   * old per-status wrapper methods were deleted in the same PR).
+   *
+   * @param status  `'failed'` (processing-only) or `'abandoned'` (pending+processing)
+   * @param filter  `{ sessionDbId: number }` — scope to one session's rows.
+   *   Required: no unscoped path exists, to prevent accidental global drain.
+   * @returns Number of rows updated
   */
-  getQueueMessages(): (PersistentPendingMessage & { project: string | null })[] {
-    const stmt = this.db.prepare(`
-      SELECT pm.*, ss.project
-      FROM pending_messages pm
-      LEFT JOIN sdk_sessions ss ON pm.content_session_id = ss.content_session_id
-      WHERE pm.status IN ('pending', 'processing', 'failed')
-      ORDER BY
-        CASE pm.status
-          WHEN 'failed' THEN 0
-          WHEN 'processing' THEN 1
-          WHEN 'pending' THEN 2
-        END,
-        pm.created_at_epoch ASC
-    `);
-    return stmt.all() as (PersistentPendingMessage & { project: string | null })[];
-  }
-
-  /**
-   * Get count of stuck messages (processing longer than threshold)
-   */
-  getStuckCount(thresholdMs: number): number {
-    const cutoff = Date.now() - thresholdMs;
-    const stmt = this.db.prepare(`
-      SELECT COUNT(*) as count FROM pending_messages
-      WHERE status = 'processing' AND started_processing_at_epoch < ?
-    `);
-    const result = stmt.get(cutoff) as { count: number };
-    return result.count;
-  }
-
-  /**
-   * Retry a specific message (reset to pending)
-   * Works for pending (re-queue), processing (reset stuck), and failed messages
-   */
-  retryMessage(messageId: number): boolean {
-    const stmt = this.db.prepare(`
-      UPDATE pending_messages
-      SET status = 'pending', started_processing_at_epoch = NULL
-      WHERE id = ? AND status IN ('pending', 'processing', 'failed')
-    `);
-    const result = stmt.run(messageId);
-    return result.changes > 0;
-  }
-
-  /**
-   * Reset all processing messages for a session to pending
-   * Used when force-restarting a stuck session
-   */
-  resetProcessingToPending(sessionDbId: number): number {
-    const stmt = this.db.prepare(`
-      UPDATE pending_messages
-      SET status = 'pending', started_processing_at_epoch = NULL
-      WHERE session_db_id = ? AND status = 'processing'
-    `);
-    const result = stmt.run(sessionDbId);
-    return result.changes;
-  }
-
-  /**
-   * Mark all processing messages for a session as failed
-   * Used in error recovery when session generator crashes
-   * @returns Number of messages marked failed
-   */
-  markSessionMessagesFailed(sessionDbId: number): number {
+  transitionMessagesTo(
+    status: 'failed' | 'abandoned',
+    filter: { sessionDbId: number }
+  ): number {
    const now = Date.now();
+    const statusClause = status === 'failed'
+      ? `status = 'processing'`
+      : `status IN ('pending', 'processing')`;

-    // Atomic update - all processing messages for session → failed
-    // Note: This bypasses retry logic since generator failures are session-level,
-    // not message-level. Individual message failures use markFailed() instead.
    const stmt = this.db.prepare(`
      UPDATE pending_messages
      SET status = 'failed', failed_at_epoch = ?
-      WHERE session_db_id = ? AND status = 'processing'
+      WHERE session_db_id = ? AND ${statusClause}
    `);
-
-    const result = stmt.run(now, sessionDbId);
-    return result.changes;
-  }
-
-  /**
-   * Mark all pending and processing messages for a session as failed (abandoned).
-   * Used when SDK session is terminated and no fallback agent is available:
-   * prevents the session from appearing in getSessionsWithPendingMessages forever.
-   * @returns Number of messages marked failed
-   */
-  markAllSessionMessagesAbandoned(sessionDbId: number): number {
-    const now = Date.now();
-    const stmt = this.db.prepare(`
-      UPDATE pending_messages
-      SET status = 'failed', failed_at_epoch = ?
-      WHERE session_db_id = ? AND status IN ('pending', 'processing')
-    `);
-    const result = stmt.run(now, sessionDbId);
-    return result.changes;
-  }
-
-  /**
-   * Abort a specific message (delete from queue)
-   */
-  abortMessage(messageId: number): boolean {
-    const stmt = this.db.prepare('DELETE FROM pending_messages WHERE id = ?');
-    const result = stmt.run(messageId);
-    return result.changes > 0;
-  }
-
-  /**
-   * Retry all stuck messages at once
-   */
-  retryAllStuck(thresholdMs: number): number {
-    const cutoff = Date.now() - thresholdMs;
-    const stmt = this.db.prepare(`
-      UPDATE pending_messages
-      SET status = 'pending', started_processing_at_epoch = NULL
-      WHERE status = 'processing' AND started_processing_at_epoch < ?
-    `);
-    const result = stmt.run(cutoff);
-    return result.changes;
-  }
-
-  /**
-   * Get recently processed messages (for UI feedback)
-   * Shows messages completed in the last N minutes so users can see their stuck items were processed
-   */
-  getRecentlyProcessed(limit: number = 10, withinMinutes: number = 30): (PersistentPendingMessage & { project: string | null })[] {
-    const cutoff = Date.now() - (withinMinutes * 60 * 1000);
-    const stmt = this.db.prepare(`
-      SELECT pm.*, ss.project
-      FROM pending_messages pm
-      LEFT JOIN sdk_sessions ss ON pm.content_session_id = ss.content_session_id
-      WHERE pm.status = 'processed' AND pm.completed_at_epoch > ?
-      ORDER BY pm.completed_at_epoch DESC
-      LIMIT ?
-    `);
-    return stmt.all(cutoff, limit) as (PersistentPendingMessage & { project: string | null })[];
+    return stmt.run(now, filter.sessionDbId).changes;
  }

  /**
@@ -358,7 +284,7 @@ export class PendingMessageStore {
      // Move back to pending for retry
      const stmt = this.db.prepare(`
        UPDATE pending_messages
-        SET status = 'pending', retry_count = retry_count + 1, started_processing_at_epoch = NULL
+        SET status = 'pending', retry_count = retry_count + 1, worker_pid = NULL
        WHERE id = ?
      `);
      stmt.run(messageId);
@@ -373,24 +299,6 @@ export class PendingMessageStore {
    }
  }

-  /**
-   * Reset stuck messages (processing -> pending if stuck longer than threshold)
-   * @param thresholdMs Messages processing longer than this are considered stuck (0 = reset all)
-   * @returns Number of messages reset
-   */
-  resetStuckMessages(thresholdMs: number): number {
-    const cutoff = thresholdMs === 0 ? Date.now() : Date.now() - thresholdMs;
-
-    const stmt = this.db.prepare(`
-      UPDATE pending_messages
-      SET status = 'pending', started_processing_at_epoch = NULL
-      WHERE status = 'processing' AND started_processing_at_epoch < ?
-    `);
-
-    const result = stmt.run(cutoff);
-    return result.changes;
-  }
-
  /**
   * Get count of pending messages for a session
   */
@@ -417,27 +325,21 @@ export class PendingMessageStore {
  }

  /**
-   * Check if any session has pending work.
-   * Excludes 'processing' messages stuck for >5 minutes (resets them to 'pending' as a side effect).
+   * Check if any session has work that could be claimed right now.
+   *
+   * Counts a row as work iff it is 'pending' or it is 'processing' under a
+   * worker_pid that is not currently alive (the same predicate the
+   * self-healing claim uses). No side effects — no UPDATE, no timer.
   */
  hasAnyPendingWork(): boolean {
-    // Reset stuck 'processing' messages older than 5 minutes before checking
-    const stuckCutoff = Date.now() - (5 * 60 * 1000);
-    const resetStmt = this.db.prepare(`
-      UPDATE pending_messages
-      SET status = 'pending', started_processing_at_epoch = NULL
-      WHERE status = 'processing' AND started_processing_at_epoch < ?
-    `);
-    const resetResult = resetStmt.run(stuckCutoff);
-    if (resetResult.changes > 0) {
-      logger.info('QUEUE', `STUCK_RESET | hasAnyPendingWork reset ${resetResult.changes} stuck processing message(s) older than 5 minutes`);
-    }
-
+    const livePids = this.getLivePidsIncludingSelf();
+    const placeholders = livePids.map(() => '?').join(',');
    const stmt = this.db.prepare(`
      SELECT COUNT(*) as count FROM pending_messages
-      WHERE status IN ('pending', 'processing')
+       WHERE status = 'pending'
+          OR (status = 'processing' AND (worker_pid IS NULL OR worker_pid NOT IN (${placeholders})))
    `);
-    const result = stmt.get() as { count: number };
+    const result = stmt.get(...livePids) as { count: number };
    return result.count > 0;
  }

@@ -464,52 +366,6 @@ export class PendingMessageStore {
    return result ? { sessionDbId: result.session_db_id, contentSessionId: result.content_session_id } : null;
  }

-  /**
-   * Clear all failed messages from the queue
-   * @returns Number of messages deleted
-   */
-  clearFailed(): number {
-    const stmt = this.db.prepare(`
-      DELETE FROM pending_messages
-      WHERE status = 'failed'
-    `);
-    const result = stmt.run();
-    return result.changes;
-  }
-
-  /**
-   * Clear failed messages older than the given threshold.
-   * Preserves recent failures for inspection and manual retry.
-   * @param thresholdMs - Only delete failures older than this many milliseconds
-   * @returns Number of messages deleted
-   */
-  clearFailedOlderThan(thresholdMs: number): number {
-    const cutoff = Date.now() - thresholdMs;
-    // Use COALESCE to prefer the most recent failure timestamp over creation time.
-    // failed_at_epoch is set by session-level failures, completed_at_epoch by markFailed().
-    const stmt = this.db.prepare(`
-      DELETE FROM pending_messages
-      WHERE status = 'failed'
-        AND COALESCE(failed_at_epoch, completed_at_epoch, started_processing_at_epoch, created_at_epoch) < ?
-    `);
-    const result = stmt.run(cutoff);
-    return result.changes;
-  }
-
-  /**
-   * Clear all pending, processing, and failed messages from the queue
-   * Keeps only processed messages (for history)
-   * @returns Number of messages deleted
-   */
-  clearAll(): number {
-    const stmt = this.db.prepare(`
-      DELETE FROM pending_messages
-      WHERE status IN ('pending', 'processing', 'failed')
-    `);
-    const result = stmt.run();
-    return result.changes;
-  }
-
  /**
   * Convert a PersistentPendingMessage back to PendingMessage format
   */
@@ -25,13 +25,14 @@ export class SessionSearch {

  private static readonly MISSING_SEARCH_INPUT_MESSAGE = 'Either query or filters required for search';

-  constructor(dbPath?: string) {
-    if (!dbPath) {
+  constructor(dbPathOrDb: string | Database = DB_PATH) {
+    if (dbPathOrDb instanceof Database) {
+      this.db = dbPathOrDb;
+    } else {
      ensureDir(DATA_DIR);
-      dbPath = DB_PATH;
+      this.db = new Database(dbPathOrDb);
+      this.db.run('PRAGMA journal_mode = WAL');
    }
-    this.db = new Database(dbPath);
-    this.db.run('PRAGMA journal_mode = WAL');

    // Cache FTS5 availability once at construction (avoids DDL probe on every query)
    this._fts5Available = this.isFts5Available();
@@ -1,4 +1,4 @@
-import { Database } from 'bun:sqlite';
+import { Database, type SQLQueryBindings } from 'bun:sqlite';
 import { DATA_DIR, DB_PATH, ensureDir, OBSERVER_SESSIONS_PROJECT } from '../../shared/paths.js';
 import { logger } from '../../utils/logger.js';
 import {
@@ -13,7 +13,8 @@ import {
  LatestPromptResult
 } from '../../types/database.js';
 import type { PendingMessageStore } from './PendingMessageStore.js';
-import { computeObservationContentHash, findDuplicateObservation } from './observations/store.js';
+import type { ObservationSearchResult, SessionSummarySearchResult } from './types.js';
+import { computeObservationContentHash } from './observations/store.js';
 import { parseFileList } from './observations/files.js';
 import { DEFAULT_PLATFORM_SOURCE, normalizePlatformSource, sortPlatformSources } from '../../shared/platform-source.js';

@@ -34,17 +35,21 @@ function resolveCreateSessionArgs(
 export class SessionStore {
  public db: Database;

-  constructor(dbPath: string = DB_PATH) {
-    if (dbPath !== ':memory:') {
-      ensureDir(DATA_DIR);
-    }
-    this.db = new Database(dbPath);
+  constructor(dbPathOrDb: string | Database = DB_PATH) {
+    if (dbPathOrDb instanceof Database) {
+      this.db = dbPathOrDb;
+    } else {
+      if (dbPathOrDb !== ':memory:') {
+        ensureDir(DATA_DIR);
+      }
+      this.db = new Database(dbPathOrDb);

-    // Ensure optimized settings
-    this.db.run('PRAGMA journal_mode = WAL');
-    this.db.run('PRAGMA synchronous = NORMAL');
-    this.db.run('PRAGMA foreign_keys = ON');
-    this.db.run('PRAGMA journal_size_limit = 4194304'); // 4MB WAL cap (#1956)
+      // Ensure optimized settings only for new connections
+      this.db.run('PRAGMA journal_mode = WAL');
+      this.db.run('PRAGMA synchronous = NORMAL');
+      this.db.run('PRAGMA foreign_keys = ON');
+      this.db.run('PRAGMA journal_size_limit = 4194304'); // 4MB WAL cap (#1956)
+    }

    // Initialize schema if needed (fresh database)
    this.initializeSchema();
@@ -68,6 +73,7 @@ export class SessionStore {
    this.addObservationModelColumns();
    this.ensureMergedIntoProjectColumns();
    this.addObservationSubagentColumns();
+    this.addObservationsUniqueContentHashIndex();
  }

  /**
@@ -565,7 +571,6 @@ export class SessionStore {
        status TEXT NOT NULL DEFAULT 'pending' CHECK(status IN ('pending', 'processing', 'processed', 'failed')),
        retry_count INTEGER NOT NULL DEFAULT 0,
        created_at_epoch INTEGER NOT NULL,
-        started_processing_at_epoch INTEGER,
        completed_at_epoch INTEGER,
        FOREIGN KEY (session_db_id) REFERENCES sdk_sessions(id) ON DELETE CASCADE
      )
@@ -661,7 +666,7 @@ export class SessionStore {

  /**
   * Add failed_at_epoch column to pending_messages (migration 20)
-   * Used by markSessionMessagesFailed() for error recovery tracking
+   * Used by transitionMessagesTo() for error recovery tracking
   */
  private addFailedAtEpochColumn(): void {
    const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(20) as SchemaVersion | undefined;
@@ -1033,6 +1038,47 @@ export class SessionStore {
    }
  }

+  /**
+   * Add UNIQUE(memory_session_id, content_hash) on observations (migration 29).
+   * Mirrors MigrationRunner.addObservationsUniqueContentHashIndex so bundled
+   * artifacts that embed SessionStore (e.g. worker-service.cjs, context-generator.cjs)
+   * stay schema-consistent. Without this, INSERT … ON CONFLICT(memory_session_id,
+   * content_hash) DO NOTHING throws "ON CONFLICT clause does not match any
+   * PRIMARY KEY or UNIQUE constraint" and every observation insert fails.
+   */
+  private addObservationsUniqueContentHashIndex(): void {
+    const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(29) as SchemaVersion | undefined;
+    if (applied) return;
+
+    const obsCols = this.db.query('PRAGMA table_info(observations)').all() as TableColumnInfo[];
+    const hasMem = obsCols.some(c => c.name === 'memory_session_id');
+    const hasHash = obsCols.some(c => c.name === 'content_hash');
+    if (!hasMem || !hasHash) {
+      this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(29, new Date().toISOString());
+      return;
+    }
+
+    this.db.run('BEGIN TRANSACTION');
+    try {
+      this.db.run(`
+        DELETE FROM observations
+         WHERE id NOT IN (
+           SELECT MIN(id) FROM observations
+            GROUP BY memory_session_id, content_hash
+         )
+      `);
+      this.db.run(`
+        CREATE UNIQUE INDEX IF NOT EXISTS ux_observations_session_hash
+        ON observations(memory_session_id, content_hash)
+      `);
+      this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(29, new Date().toISOString());
+      this.db.run('COMMIT');
+    } catch (error) {
+      this.db.run('ROLLBACK');
+      throw error;
+    }
+  }
+
  /**
   * Update the memory session ID for a session
   * Called by SDKAgent when it captures the session ID from the first SDK message
@@ -1112,7 +1158,18 @@ export class SessionStore {
      LIMIT ?
    `);

-    return stmt.all(project, limit);
+    return stmt.all(project, limit) as Array<{
+      request: string | null;
+      investigated: string | null;
+      learned: string | null;
+      completed: string | null;
+      next_steps: string | null;
+      files_read: string | null;
+      files_edited: string | null;
+      notes: string | null;
+      prompt_number: number | null;
+      created_at: string;
+    }>;
  }

  /**
@@ -1137,7 +1194,15 @@ export class SessionStore {
      LIMIT ?
    `);

-    return stmt.all(project, limit);
+    return stmt.all(project, limit) as Array<{
+      memory_session_id: string;
+      request: string | null;
+      learned: string | null;
+      completed: string | null;
+      next_steps: string | null;
+      prompt_number: number | null;
+      created_at: string;
+    }>;
  }

  /**
@@ -1157,7 +1222,12 @@ export class SessionStore {
      LIMIT ?
    `);

-    return stmt.all(project, limit);
+    return stmt.all(project, limit) as Array<{
+      type: string;
+      text: string;
+      prompt_number: number | null;
+      created_at: string;
+    }>;
  }

  /**
@@ -1193,7 +1263,18 @@ export class SessionStore {
      LIMIT ?
    `);

-    return stmt.all(limit);
+    return stmt.all(limit) as Array<{
+      id: number;
+      type: string;
+      title: string | null;
+      subtitle: string | null;
+      text: string;
+      project: string;
+      platform_source: string;
+      prompt_number: number | null;
+      created_at: string;
+      created_at_epoch: number;
+    }>;
  }

  /**
@@ -1237,7 +1318,22 @@ export class SessionStore {
      LIMIT ?
    `);

-    return stmt.all(limit);
+    return stmt.all(limit) as Array<{
+      id: number;
+      request: string | null;
+      investigated: string | null;
+      learned: string | null;
+      completed: string | null;
+      next_steps: string | null;
+      files_read: string | null;
+      files_edited: string | null;
+      notes: string | null;
+      project: string;
+      platform_source: string;
+      prompt_number: number | null;
+      created_at: string;
+      created_at_epoch: number;
+    }>;
  }

  /**
@@ -1269,7 +1365,16 @@ export class SessionStore {
      LIMIT ?
    `);

-    return stmt.all(limit);
+    return stmt.all(limit) as Array<{
+      id: number;
+      content_session_id: string;
+      project: string;
+      platform_source: string;
+      prompt_number: number;
+      prompt_text: string;
+      created_at: string;
+      created_at_epoch: number;
+    }>;
  }

  /**
@@ -1283,7 +1388,7 @@ export class SessionStore {
      WHERE project IS NOT NULL AND project != ''
        AND project != ?
    `;
-    const params: unknown[] = [OBSERVER_SESSIONS_PROJECT];
+    const params: SQLQueryBindings[] = [OBSERVER_SESSIONS_PROJECT];

    if (normalizedPlatformSource) {
      query += ' AND COALESCE(platform_source, ?) = ?';
@@ -1404,7 +1509,13 @@ export class SessionStore {
      ORDER BY started_at_epoch ASC
    `);

-    return stmt.all(project, limit);
+    return stmt.all(project, limit) as Array<{
+      memory_session_id: string | null;
+      status: string;
+      started_at: string;
+      user_prompt: string | null;
+      has_summary: boolean;
+    }>;
  }

  /**
@@ -1423,7 +1534,12 @@ export class SessionStore {
      ORDER BY created_at_epoch ASC
    `);

-    return stmt.all(memorySessionId);
+    return stmt.all(memorySessionId) as Array<{
+      title: string;
+      subtitle: string;
+      type: string;
+      prompt_number: number | null;
+    }>;
  }

  /**
@@ -1445,7 +1561,7 @@ export class SessionStore {
  getObservationsByIds(
    ids: number[],
    options: { orderBy?: 'date_desc' | 'date_asc'; limit?: number; project?: string; type?: string | string[]; concepts?: string | string[]; files?: string | string[] } = {}
-  ): ObservationRecord[] {
+  ): ObservationSearchResult[] {
    if (ids.length === 0) return [];

    const { orderBy = 'date_desc', limit, project, type, concepts, files } = options;
@@ -1509,7 +1625,7 @@ export class SessionStore {
      ${limitClause}
    `);

-    return stmt.all(...params) as ObservationRecord[];
+    return stmt.all(...params) as ObservationSearchResult[];
  }

  /**
@@ -1539,7 +1655,19 @@ export class SessionStore {
      LIMIT 1
    `);

-    return stmt.get(memorySessionId) || null;
+    return (stmt.get(memorySessionId) as {
+      request: string | null;
+      investigated: string | null;
+      learned: string | null;
+      completed: string | null;
+      next_steps: string | null;
+      files_read: string | null;
+      files_edited: string | null;
+      notes: string | null;
+      prompt_number: number | null;
+      created_at: string;
+      created_at_epoch: number;
+    } | null) || null;
  }

  /**
@@ -1599,7 +1727,16 @@ export class SessionStore {
      LIMIT 1
    `);

-    return stmt.get(id) || null;
+    return (stmt.get(id) as {
+      id: number;
+      content_session_id: string;
+      memory_session_id: string | null;
+      project: string;
+      platform_source: string;
+      user_prompt: string;
+      custom_title: string | null;
+      status: string;
+    } | null) || null;
  }

  /**
@@ -1805,12 +1942,9 @@ export class SessionStore {
    const timestampEpoch = overrideTimestampEpoch ?? Date.now();
    const timestampIso = new Date(timestampEpoch).toISOString();

-    // Content-hash deduplication
+    // DB-enforced dedup: UNIQUE(memory_session_id, content_hash) +
+    // ON CONFLICT DO NOTHING (Plan 01 Phase 4).
    const contentHash = computeObservationContentHash(memorySessionId, observation.title, observation.narrative);
-    const existing = findDuplicateObservation(this.db, contentHash, timestampEpoch);
-    if (existing) {
-      return { id: existing.id, createdAtEpoch: existing.created_at_epoch };
-    }

    const stmt = this.db.prepare(`
      INSERT INTO observations
@@ -1818,9 +1952,11 @@ export class SessionStore {
       files_read, files_modified, prompt_number, discovery_tokens, agent_type, agent_id, content_hash, created_at, created_at_epoch,
       generated_by_model)
      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+      ON CONFLICT(memory_session_id, content_hash) DO NOTHING
+      RETURNING id, created_at_epoch
    `);

-    const result = stmt.run(
+    const inserted = stmt.get(
      memorySessionId,
      project,
      observation.type,
@@ -1839,12 +1975,22 @@ export class SessionStore {
      timestampIso,
      timestampEpoch,
      generatedByModel || null
-    );
+    ) as { id: number; created_at_epoch: number } | null;

-    return {
-      id: Number(result.lastInsertRowid),
-      createdAtEpoch: timestampEpoch
-    };
+    if (inserted) {
+      return { id: inserted.id, createdAtEpoch: inserted.created_at_epoch };
+    }
+
+    const existing = this.db.prepare(
+      'SELECT id, created_at_epoch FROM observations WHERE memory_session_id = ? AND content_hash = ?'
+    ).get(memorySessionId, contentHash) as { id: number; created_at_epoch: number } | null;
+
+    if (!existing) {
+      throw new Error(
+        `storeObservation: ON CONFLICT without existing row for content_hash=${contentHash}`
+      );
+    }
+    return { id: existing.id, createdAtEpoch: existing.created_at_epoch };
  }

  /**
@@ -1950,25 +2096,25 @@ export class SessionStore {
    const storeTx = this.db.transaction(() => {
      const observationIds: number[] = [];

-      // 1. Store all observations (with content-hash deduplication)
+      // 1. Store all observations.
+      // DB-enforced dedup via UNIQUE(memory_session_id, content_hash) +
+      // ON CONFLICT DO NOTHING (Plan 01 Phase 4).
      const obsStmt = this.db.prepare(`
        INSERT INTO observations
        (memory_session_id, project, type, title, subtitle, facts, narrative, concepts,
         files_read, files_modified, prompt_number, discovery_tokens, agent_type, agent_id, content_hash, created_at, created_at_epoch,
         generated_by_model)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+        ON CONFLICT(memory_session_id, content_hash) DO NOTHING
+        RETURNING id
      `);
+      const lookupExistingStmt = this.db.prepare(
+        'SELECT id FROM observations WHERE memory_session_id = ? AND content_hash = ?'
+      );

      for (const observation of observations) {
-        // Content-hash deduplication (same logic as storeObservation singular)
        const contentHash = computeObservationContentHash(memorySessionId, observation.title, observation.narrative);
-        const existing = findDuplicateObservation(this.db, contentHash, timestampEpoch);
-        if (existing) {
-          observationIds.push(existing.id);
-          continue;
-        }
-
-        const result = obsStmt.run(
+        const inserted = obsStmt.get(
          memorySessionId,
          project,
          observation.type,
@@ -1987,8 +2133,20 @@ export class SessionStore {
          timestampIso,
          timestampEpoch,
          generatedByModel || null
-        );
-        observationIds.push(Number(result.lastInsertRowid));
+        ) as { id: number } | null;
+
+        if (inserted) {
+          observationIds.push(inserted.id);
+          continue;
+        }
+
+        const existing = lookupExistingStmt.get(memorySessionId, contentHash) as { id: number } | null;
+        if (!existing) {
+          throw new Error(
+            `storeObservations: ON CONFLICT without existing row for content_hash=${contentHash}`
+          );
+        }
+        observationIds.push(existing.id);
      }

      // 2. Store summary if provided
@@ -2086,25 +2244,25 @@ export class SessionStore {
    const storeAndMarkTx = this.db.transaction(() => {
      const observationIds: number[] = [];

-      // 1. Store all observations (with content-hash deduplication)
+      // 1. Store all observations.
+      // DB-enforced dedup via UNIQUE(memory_session_id, content_hash) +
+      // ON CONFLICT DO NOTHING (Plan 01 Phase 4).
      const obsStmt = this.db.prepare(`
        INSERT INTO observations
        (memory_session_id, project, type, title, subtitle, facts, narrative, concepts,
         files_read, files_modified, prompt_number, discovery_tokens, agent_type, agent_id, content_hash, created_at, created_at_epoch,
         generated_by_model)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+        ON CONFLICT(memory_session_id, content_hash) DO NOTHING
+        RETURNING id
      `);
+      const lookupExistingStmt = this.db.prepare(
+        'SELECT id FROM observations WHERE memory_session_id = ? AND content_hash = ?'
+      );

      for (const observation of observations) {
-        // Content-hash deduplication (same logic as storeObservation singular)
        const contentHash = computeObservationContentHash(memorySessionId, observation.title, observation.narrative);
-        const existing = findDuplicateObservation(this.db, contentHash, timestampEpoch);
-        if (existing) {
-          observationIds.push(existing.id);
-          continue;
-        }
-
-        const result = obsStmt.run(
+        const inserted = obsStmt.get(
          memorySessionId,
          project,
          observation.type,
@@ -2123,8 +2281,20 @@ export class SessionStore {
          timestampIso,
          timestampEpoch,
          generatedByModel || null
-        );
-        observationIds.push(Number(result.lastInsertRowid));
+        ) as { id: number } | null;
+
+        if (inserted) {
+          observationIds.push(inserted.id);
+          continue;
+        }
+
+        const existing = lookupExistingStmt.get(memorySessionId, contentHash) as { id: number } | null;
+        if (!existing) {
+          throw new Error(
+            `storeObservationsAndMarkComplete: ON CONFLICT without existing row for content_hash=${contentHash}`
+          );
+        }
+        observationIds.push(existing.id);
      }

      // 2. Store summary if provided
@@ -2177,11 +2347,6 @@ export class SessionStore {



-  // REMOVED: cleanupOrphanedSessions - violates "EVERYTHING SHOULD SAVE ALWAYS"
-  // There's no such thing as an "orphaned" session. Sessions are created by hooks
-  // and managed by Claude Code's lifecycle. Worker restarts don't invalidate them.
-  // Marking all active sessions as 'failed' on startup destroys the user's current work.
-
  /**
   * Get session summaries by IDs (for hybrid Chroma search)
   * Returns summaries in specified temporal order
@@ -2189,7 +2354,7 @@ export class SessionStore {
  getSessionSummariesByIds(
    ids: number[],
    options: { orderBy?: 'date_desc' | 'date_asc'; limit?: number; project?: string } = {}
-  ): SessionSummaryRecord[] {
+  ): SessionSummarySearchResult[] {
    if (ids.length === 0) return [];

    const { orderBy = 'date_desc', limit, project } = options;
@@ -2211,7 +2376,7 @@ export class SessionStore {
      ${limitClause}
    `);

-    return stmt.all(...params) as SessionSummaryRecord[];
+    return stmt.all(...params) as SessionSummarySearchResult[];
  }

  /**
@@ -2443,7 +2608,15 @@ export class SessionStore {
      LIMIT 1
    `);

-    return stmt.get(id) || null;
+    return (stmt.get(id) as {
+      id: number;
+      content_session_id: string;
+      prompt_number: number;
+      prompt_text: string;
+      project: string;
+      created_at: string;
+      created_at_epoch: number;
+    } | null) || null;
  }

  /**
@@ -2519,7 +2692,18 @@ export class SessionStore {
      LIMIT 1
    `);

-    return stmt.get(id) || null;
+    return (stmt.get(id) as {
+      id: number;
+      memory_session_id: string | null;
+      content_session_id: string;
+      project: string;
+      user_prompt: string;
+      request_summary: string | null;
+      learned_summary: string | null;
+      status: string;
+      created_at: string;
+      created_at_epoch: number;
+    } | null) || null;
  }

  /**
@@ -30,7 +30,6 @@ export class MigrationRunner {
    this.ensureDiscoveryTokensColumn();
    this.createPendingMessagesTable();
    this.renameSessionIdColumns();
-    this.repairSessionIdColumnRename();
    this.addFailedAtEpochColumn();
    this.addOnUpdateCascadeToForeignKeys();
    this.addObservationContentHashColumn();
@@ -39,6 +38,8 @@ export class MigrationRunner {
    this.addSessionPlatformSourceColumn();
    this.ensureMergedIntoProjectColumns();
    this.addObservationSubagentColumns();
+    this.rebuildPendingMessagesForSelfHealingClaim();
+    this.addObservationsUniqueContentHashIndex();
  }

  /**
@@ -533,7 +534,6 @@ export class MigrationRunner {
        status TEXT NOT NULL DEFAULT 'pending' CHECK(status IN ('pending', 'processing', 'processed', 'failed')),
        retry_count INTEGER NOT NULL DEFAULT 0,
        created_at_epoch INTEGER NOT NULL,
-        started_processing_at_epoch INTEGER,
        completed_at_epoch INTEGER,
        FOREIGN KEY (session_db_id) REFERENCES sdk_sessions(id) ON DELETE CASCADE
      )
@@ -613,23 +613,9 @@ export class MigrationRunner {
    }
  }

-  /**
-   * Repair session ID column renames (migration 19)
-   * DEPRECATED: Migration 17 is now fully idempotent and handles all cases.
-   * This migration is kept for backwards compatibility but does nothing.
-   */
-  private repairSessionIdColumnRename(): void {
-    const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(19) as SchemaVersion | undefined;
-    if (applied) return;
-
-    // Migration 17 now handles all column rename cases idempotently.
-    // Just record this migration as applied.
-    this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(19, new Date().toISOString());
-  }
-
  /**
   * Add failed_at_epoch column to pending_messages (migration 20)
-   * Used by markSessionMessagesFailed() for error recovery tracking
+   * Used by transitionMessagesTo() for error recovery tracking
   */
  private addFailedAtEpochColumn(): void {
    const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(20) as SchemaVersion | undefined;
@@ -1015,4 +1001,207 @@ export class MigrationRunner {
      this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(27, new Date().toISOString());
    }
  }
+
+  /**
+   * Rebuild pending_messages for self-healing claim (migration 28).
+   *
+   * PATHFINDER-2026-04-22 Plan 01 Phase 2.
+   *
+   *  - Drops the legacy stale-reset epoch column (was the input to the
+   *    60-s stale-reset; replaced by worker-PID liveness at claim time).
+   *  - Adds `worker_pid INTEGER` (set by claimNextMessage to the live
+   *    worker's PID; rows whose worker_pid is no longer alive are
+   *    immediately reclaimable).
+   *  - Adds `tool_use_id TEXT` so ingestion-time pairing of tool_use →
+   *    tool_result can be DB-backed instead of an in-memory Map
+   *    (Plan 03 dependency).
+   *  - Dedupes any existing rows that share (content_session_id,
+   *    tool_use_id), then creates a partial UNIQUE index.
+   *
+   * Follows the table-rebuild precedent at runner.ts:691 (migration 21):
+   * disable FKs, BEGIN, recreate, INSERT-SELECT, RENAME, COMMIT, re-enable.
+   */
+  private rebuildPendingMessagesForSelfHealingClaim(): void {
+    const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(28) as SchemaVersion | undefined;
+    if (applied) return;
+
+    const pendingExists = (this.db.query("SELECT name FROM sqlite_master WHERE type='table' AND name='pending_messages'").all() as TableNameRow[]).length > 0;
+    if (!pendingExists) {
+      // pending_messages table never created on this DB — nothing to rebuild.
+      this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(28, new Date().toISOString());
+      return;
+    }
+
+    logger.debug('DB', 'Rebuilding pending_messages for self-healing claim (migration 28)');
+
+    // PRAGMA foreign_keys must be set outside a transaction.
+    this.db.run('PRAGMA foreign_keys = OFF');
+    this.db.run('BEGIN TRANSACTION');
+
+    try {
+      // Source columns may include legacy fields. We build the SELECT explicitly
+      // using only columns we know are present in the source after migration 27.
+      const sourceCols = this.db.query('PRAGMA table_info(pending_messages)').all() as TableColumnInfo[];
+      const colNames = new Set(sourceCols.map(c => c.name));
+      const has = (name: string) => colNames.has(name);
+
+      // Clean up leftover temp from a previously-crashed run.
+      this.db.run('DROP TABLE IF EXISTS pending_messages_new');
+
+      this.db.run(`
+        CREATE TABLE pending_messages_new (
+          id                       INTEGER PRIMARY KEY AUTOINCREMENT,
+          session_db_id            INTEGER NOT NULL,
+          content_session_id       TEXT    NOT NULL,
+          tool_use_id              TEXT,
+          message_type             TEXT    NOT NULL CHECK(message_type IN ('observation', 'summarize')),
+          tool_name                TEXT,
+          tool_input               TEXT,
+          tool_response            TEXT,
+          cwd                      TEXT,
+          last_user_message        TEXT,
+          last_assistant_message   TEXT,
+          prompt_number            INTEGER,
+          status                   TEXT    NOT NULL DEFAULT 'pending'
+                                           CHECK(status IN ('pending', 'processing', 'processed', 'failed')),
+          retry_count              INTEGER NOT NULL DEFAULT 0,
+          created_at_epoch         INTEGER NOT NULL,
+          failed_at_epoch          INTEGER,
+          completed_at_epoch       INTEGER,
+          worker_pid               INTEGER,
+          agent_type               TEXT,
+          agent_id                 TEXT,
+          FOREIGN KEY (session_db_id) REFERENCES sdk_sessions(id) ON DELETE CASCADE
+        )
+      `);
+
+      // INSERT-SELECT — note that the legacy stale-reset epoch column is
+      // intentionally omitted. Any 'processing' row is left with worker_pid =
+      // NULL so that a self-healing claim picks it up immediately on next
+      // worker boot.
+      this.db.run(`
+        INSERT INTO pending_messages_new (
+          id, session_db_id, content_session_id, tool_use_id, message_type,
+          tool_name, tool_input, tool_response, cwd, last_user_message,
+          last_assistant_message, prompt_number, status, retry_count,
+          created_at_epoch, failed_at_epoch, completed_at_epoch, worker_pid,
+          agent_type, agent_id
+        )
+        SELECT
+          id,
+          session_db_id,
+          content_session_id,
+          ${has('tool_use_id') ? 'tool_use_id' : 'NULL'},
+          message_type,
+          tool_name,
+          tool_input,
+          tool_response,
+          cwd,
+          ${has('last_user_message') ? 'last_user_message' : 'NULL'},
+          ${has('last_assistant_message') ? 'last_assistant_message' : 'NULL'},
+          ${has('prompt_number') ? 'prompt_number' : 'NULL'},
+          status,
+          retry_count,
+          created_at_epoch,
+          ${has('failed_at_epoch') ? 'failed_at_epoch' : 'NULL'},
+          ${has('completed_at_epoch') ? 'completed_at_epoch' : 'NULL'},
+          NULL,
+          ${has('agent_type') ? 'agent_type' : 'NULL'},
+          ${has('agent_id') ? 'agent_id' : 'NULL'}
+        FROM pending_messages
+      `);
+
+      this.db.run('DROP TABLE pending_messages');
+      this.db.run('ALTER TABLE pending_messages_new RENAME TO pending_messages');
+
+      this.db.run('CREATE INDEX IF NOT EXISTS idx_pending_messages_session        ON pending_messages(session_db_id)');
+      this.db.run('CREATE INDEX IF NOT EXISTS idx_pending_messages_status         ON pending_messages(status)');
+      this.db.run('CREATE INDEX IF NOT EXISTS idx_pending_messages_claude_session ON pending_messages(content_session_id)');
+      this.db.run('CREATE INDEX IF NOT EXISTS idx_pending_messages_worker_pid     ON pending_messages(worker_pid)');
+
+      // Dedup any pre-existing duplicate (content_session_id, tool_use_id) pairs
+      // before adding the UNIQUE index. Keep the lowest id (oldest) per pair.
+      this.db.run(`
+        DELETE FROM pending_messages
+         WHERE tool_use_id IS NOT NULL
+           AND id NOT IN (
+             SELECT MIN(id) FROM pending_messages
+              WHERE tool_use_id IS NOT NULL
+              GROUP BY content_session_id, tool_use_id
+           )
+      `);
+
+      this.db.run(`
+        CREATE UNIQUE INDEX IF NOT EXISTS ux_pending_session_tool
+        ON pending_messages(content_session_id, tool_use_id)
+        WHERE tool_use_id IS NOT NULL
+      `);
+
+      this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(28, new Date().toISOString());
+      this.db.run('COMMIT');
+      this.db.run('PRAGMA foreign_keys = ON');
+
+      logger.debug('DB', 'Rebuilt pending_messages for self-healing claim');
+    } catch (error) {
+      this.db.run('ROLLBACK');
+      this.db.run('PRAGMA foreign_keys = ON');
+      if (error instanceof Error) {
+        throw error;
+      }
+      throw new Error(`Migration 28 failed: ${String(error)}`);
+    }
+  }
+
+  /**
+   * Add UNIQUE(memory_session_id, content_hash) on observations (migration 29).
+   *
+   * PATHFINDER-2026-04-22 Plan 01 Phase 2 + Phase 4.
+   *
+   *  - Dedupes existing rows that share (memory_session_id, content_hash),
+   *    keeping the lowest id (oldest) per pair.
+   *  - Creates a UNIQUE index that lets writers use
+   *    INSERT … ON CONFLICT(memory_session_id, content_hash) DO NOTHING
+   *    in place of the legacy dedup window scan.
+   */
+  private addObservationsUniqueContentHashIndex(): void {
+    const applied = this.db.prepare('SELECT version FROM schema_versions WHERE version = ?').get(29) as SchemaVersion | undefined;
+    if (applied) return;
+
+    // Need both columns to exist.
+    const obsCols = this.db.query('PRAGMA table_info(observations)').all() as TableColumnInfo[];
+    const hasMem = obsCols.some(c => c.name === 'memory_session_id');
+    const hasHash = obsCols.some(c => c.name === 'content_hash');
+    if (!hasMem || !hasHash) {
+      // Nothing to do; record so we don't keep retrying.
+      this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(29, new Date().toISOString());
+      return;
+    }
+
+    this.db.run('BEGIN TRANSACTION');
+    try {
+      // Dedup before adding the UNIQUE index — keep the lowest id per pair.
+      this.db.run(`
+        DELETE FROM observations
+         WHERE id NOT IN (
+           SELECT MIN(id) FROM observations
+            GROUP BY memory_session_id, content_hash
+         )
+      `);
+
+      this.db.run(`
+        CREATE UNIQUE INDEX IF NOT EXISTS ux_observations_session_hash
+        ON observations(memory_session_id, content_hash)
+      `);
+
+      this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(29, new Date().toISOString());
+      this.db.run('COMMIT');
+      logger.debug('DB', 'Added UNIQUE(memory_session_id, content_hash) on observations');
+    } catch (error) {
+      this.db.run('ROLLBACK');
+      if (error instanceof Error) {
+        throw error;
+      }
+      throw new Error(`Migration 29 failed: ${String(error)}`);
+    }
+  }
 }
@@ -9,9 +9,6 @@ import { logger } from '../../../utils/logger.js';
 import { getProjectContext } from '../../../utils/project-name.js';
 import type { ObservationInput, StoreObservationResult } from './types.js';

-/** Deduplication window: observations with the same content hash within this window are skipped */
-const DEDUP_WINDOW_MS = 30_000;
-
 /**
 * Compute a short content hash for deduplication.
 * Uses (memory_session_id, title, narrative) as the semantic identity of an observation.
@@ -30,25 +27,13 @@ export function computeObservationContentHash(
 }

 /**
- * Check if a duplicate observation exists within the dedup window.
- * Returns the existing observation's id and timestamp if found, null otherwise.
- */
-export function findDuplicateObservation(
-  db: Database,
-  contentHash: string,
-  timestampEpoch: number
-): { id: number; created_at_epoch: number } | null {
-  const windowStart = timestampEpoch - DEDUP_WINDOW_MS;
-  const stmt = db.prepare(
-    'SELECT id, created_at_epoch FROM observations WHERE content_hash = ? AND created_at_epoch > ?'
-  );
-  return (stmt.get(contentHash, windowStart) as { id: number; created_at_epoch: number } | null);
-}
-
-/**
- * Store an observation (from SDK parsing)
- * Assumes session already exists (created by hook)
- * Performs content-hash deduplication: skips INSERT if an identical observation exists within 30s
+ * Store an observation (from SDK parsing).
+ *
+ * Assumes session already exists (created by hook). Deduplication is enforced
+ * by the database via UNIQUE(memory_session_id, content_hash) (Plan 01 Phase 4):
+ * INSERT … ON CONFLICT DO NOTHING absorbs duplicates silently. The returned id
+ * is the existing row's id when a conflict occurred, otherwise the freshly
+ * inserted row.
 */
 export function storeObservation(
  db: Database,
@@ -66,22 +51,18 @@ export function storeObservation(
  // Guard against empty project string (race condition where project isn't set yet)
  const resolvedProject = project || getProjectContext(process.cwd()).primary;

-  // Content-hash deduplication
  const contentHash = computeObservationContentHash(memorySessionId, observation.title, observation.narrative);
-  const existing = findDuplicateObservation(db, contentHash, timestampEpoch);
-  if (existing) {
-    logger.debug('DEDUP', `Skipped duplicate observation | contentHash=${contentHash} | existingId=${existing.id}`);
-    return { id: existing.id, createdAtEpoch: existing.created_at_epoch };
-  }

  const stmt = db.prepare(`
    INSERT INTO observations
    (memory_session_id, project, type, title, subtitle, facts, narrative, concepts,
     files_read, files_modified, prompt_number, discovery_tokens, agent_type, agent_id, content_hash, created_at, created_at_epoch)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+    ON CONFLICT(memory_session_id, content_hash) DO NOTHING
+    RETURNING id, created_at_epoch
  `);

-  const result = stmt.run(
+  const inserted = stmt.get(
    memorySessionId,
    resolvedProject,
    observation.type,
@@ -99,10 +80,24 @@ export function storeObservation(
    contentHash,
    timestampIso,
    timestampEpoch
-  );
+  ) as { id: number; created_at_epoch: number } | null;

-  return {
-    id: Number(result.lastInsertRowid),
-    createdAtEpoch: timestampEpoch
-  };
+  if (inserted) {
+    return { id: inserted.id, createdAtEpoch: inserted.created_at_epoch };
+  }
+
+  // Conflict — fetch the existing row's id for the (memory_session_id, content_hash) pair.
+  const existing = db.prepare(
+    'SELECT id, created_at_epoch FROM observations WHERE memory_session_id = ? AND content_hash = ?'
+  ).get(memorySessionId, contentHash) as { id: number; created_at_epoch: number } | null;
+
+  if (!existing) {
+    // Unreachable in practice (UNIQUE conflict implies existing row), but be explicit.
+    throw new Error(
+      `storeObservation: ON CONFLICT fired but no row exists for (memory_session_id=${memorySessionId}, content_hash=${contentHash})`
+    );
+  }
+
+  logger.debug('DEDUP', `Skipped duplicate observation | contentHash=${contentHash} | existingId=${existing.id}`);
+  return { id: existing.id, createdAtEpoch: existing.created_at_epoch };
 }
@@ -0,0 +1,188 @@
+-- claude-mem SQLite schema
+--
+-- Authoritative shape of the database after all migrations through
+-- runner.ts have been applied (current tip = migration 29). Fresh
+-- databases boot directly into this shape; existing databases reach
+-- it via the migration runner.
+--
+-- Source of truth: src/services/sqlite/migrations/runner.ts
+-- Regenerated by: PATHFINDER-2026-04-22 Plan 01 (Data Integrity).
+--
+-- Invariants enforced here (Plan 01):
+--   * pending_messages.UNIQUE(content_session_id, tool_use_id) — replaces
+--     in-memory pendingTools Map for ingestion pairing (Plan 03 also depends).
+--   * pending_messages.worker_pid INTEGER — populated by self-healing
+--     claim query; replaces the legacy stale-reset epoch column.
+--   * observations.UNIQUE(memory_session_id, content_hash) — replaces the
+--     legacy dedup window; ON CONFLICT DO NOTHING absorbs duplicates.
+
+CREATE TABLE IF NOT EXISTS schema_versions (
+  id INTEGER PRIMARY KEY,
+  version INTEGER UNIQUE NOT NULL,
+  applied_at TEXT NOT NULL
+);
+
+-- ─────────────────────────────────────────────────────────────────────
+-- sdk_sessions: one row per Claude/Codex session observed by claude-mem.
+-- ─────────────────────────────────────────────────────────────────────
+CREATE TABLE IF NOT EXISTS sdk_sessions (
+  id                  INTEGER PRIMARY KEY AUTOINCREMENT,
+  content_session_id  TEXT    UNIQUE NOT NULL,
+  memory_session_id   TEXT    UNIQUE,
+  project             TEXT    NOT NULL,
+  platform_source     TEXT    NOT NULL DEFAULT 'claude',
+  user_prompt         TEXT,
+  started_at          TEXT    NOT NULL,
+  started_at_epoch    INTEGER NOT NULL,
+  completed_at        TEXT,
+  completed_at_epoch  INTEGER,
+  status              TEXT    NOT NULL DEFAULT 'active'
+                              CHECK(status IN ('active', 'completed', 'failed')),
+  worker_port         INTEGER,
+  prompt_counter      INTEGER DEFAULT 0,
+  custom_title        TEXT
+);
+CREATE INDEX IF NOT EXISTS idx_sdk_sessions_claude_id        ON sdk_sessions(content_session_id);
+CREATE INDEX IF NOT EXISTS idx_sdk_sessions_sdk_id           ON sdk_sessions(memory_session_id);
+CREATE INDEX IF NOT EXISTS idx_sdk_sessions_project          ON sdk_sessions(project);
+CREATE INDEX IF NOT EXISTS idx_sdk_sessions_status           ON sdk_sessions(status);
+CREATE INDEX IF NOT EXISTS idx_sdk_sessions_started          ON sdk_sessions(started_at_epoch DESC);
+CREATE INDEX IF NOT EXISTS idx_sdk_sessions_platform_source  ON sdk_sessions(platform_source);
+
+-- ─────────────────────────────────────────────────────────────────────
+-- observations: structured memory rows extracted from SDK output.
+-- UNIQUE(memory_session_id, content_hash) replaces the legacy dedup window;
+-- writes use INSERT … ON CONFLICT DO NOTHING.
+-- ─────────────────────────────────────────────────────────────────────
+CREATE TABLE IF NOT EXISTS observations (
+  id                   INTEGER PRIMARY KEY AUTOINCREMENT,
+  memory_session_id    TEXT    NOT NULL,
+  project              TEXT    NOT NULL,
+  text                 TEXT,
+  type                 TEXT    NOT NULL,
+  title                TEXT,
+  subtitle             TEXT,
+  facts                TEXT,
+  narrative            TEXT,
+  concepts             TEXT,
+  files_read           TEXT,
+  files_modified       TEXT,
+  prompt_number        INTEGER,
+  discovery_tokens     INTEGER DEFAULT 0,
+  content_hash         TEXT,
+  agent_type           TEXT,
+  agent_id             TEXT,
+  merged_into_project  TEXT,
+  generated_by_model   TEXT,
+  created_at           TEXT    NOT NULL,
+  created_at_epoch     INTEGER NOT NULL,
+  FOREIGN KEY(memory_session_id) REFERENCES sdk_sessions(memory_session_id)
+    ON DELETE CASCADE ON UPDATE CASCADE,
+  UNIQUE(memory_session_id, content_hash)
+);
+CREATE INDEX IF NOT EXISTS idx_observations_sdk_session  ON observations(memory_session_id);
+CREATE INDEX IF NOT EXISTS idx_observations_project      ON observations(project);
+CREATE INDEX IF NOT EXISTS idx_observations_type         ON observations(type);
+CREATE INDEX IF NOT EXISTS idx_observations_created      ON observations(created_at_epoch DESC);
+CREATE INDEX IF NOT EXISTS idx_observations_content_hash ON observations(content_hash, created_at_epoch);
+CREATE INDEX IF NOT EXISTS idx_observations_agent_type   ON observations(agent_type);
+CREATE INDEX IF NOT EXISTS idx_observations_agent_id     ON observations(agent_id);
+CREATE INDEX IF NOT EXISTS idx_observations_merged_into  ON observations(merged_into_project);
+
+-- ─────────────────────────────────────────────────────────────────────
+-- session_summaries: one summary row per memory session.
+-- ─────────────────────────────────────────────────────────────────────
+CREATE TABLE IF NOT EXISTS session_summaries (
+  id                   INTEGER PRIMARY KEY AUTOINCREMENT,
+  memory_session_id    TEXT    NOT NULL,
+  project              TEXT    NOT NULL,
+  request              TEXT,
+  investigated         TEXT,
+  learned              TEXT,
+  completed            TEXT,
+  next_steps           TEXT,
+  files_read           TEXT,
+  files_edited         TEXT,
+  notes                TEXT,
+  prompt_number        INTEGER,
+  discovery_tokens     INTEGER DEFAULT 0,
+  merged_into_project  TEXT,
+  created_at           TEXT    NOT NULL,
+  created_at_epoch     INTEGER NOT NULL,
+  FOREIGN KEY(memory_session_id) REFERENCES sdk_sessions(memory_session_id)
+    ON DELETE CASCADE ON UPDATE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_session_summaries_sdk_session  ON session_summaries(memory_session_id);
+CREATE INDEX IF NOT EXISTS idx_session_summaries_project      ON session_summaries(project);
+CREATE INDEX IF NOT EXISTS idx_session_summaries_created      ON session_summaries(created_at_epoch DESC);
+CREATE INDEX IF NOT EXISTS idx_summaries_merged_into          ON session_summaries(merged_into_project);
+
+-- ─────────────────────────────────────────────────────────────────────
+-- pending_messages: persistent work queue for SDK messages.
+-- worker_pid + UNIQUE(content_session_id, tool_use_id) make the claim
+-- query self-healing without any legacy stale-reset epoch column.
+-- ─────────────────────────────────────────────────────────────────────
+CREATE TABLE IF NOT EXISTS pending_messages (
+  id                       INTEGER PRIMARY KEY AUTOINCREMENT,
+  session_db_id            INTEGER NOT NULL,
+  content_session_id       TEXT    NOT NULL,
+  tool_use_id              TEXT,
+  message_type             TEXT    NOT NULL
+                                   CHECK(message_type IN ('observation', 'summarize')),
+  tool_name                TEXT,
+  tool_input               TEXT,
+  tool_response            TEXT,
+  cwd                      TEXT,
+  last_user_message        TEXT,
+  last_assistant_message   TEXT,
+  prompt_number            INTEGER,
+  status                   TEXT    NOT NULL DEFAULT 'pending'
+                                   CHECK(status IN ('pending', 'processing', 'processed', 'failed')),
+  retry_count              INTEGER NOT NULL DEFAULT 0,
+  created_at_epoch         INTEGER NOT NULL,
+  failed_at_epoch          INTEGER,
+  completed_at_epoch       INTEGER,
+  worker_pid               INTEGER,
+  agent_type               TEXT,
+  agent_id                 TEXT,
+  FOREIGN KEY (session_db_id) REFERENCES sdk_sessions(id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_pending_messages_session        ON pending_messages(session_db_id);
+CREATE INDEX IF NOT EXISTS idx_pending_messages_status         ON pending_messages(status);
+CREATE INDEX IF NOT EXISTS idx_pending_messages_claude_session ON pending_messages(content_session_id);
+CREATE INDEX IF NOT EXISTS idx_pending_messages_worker_pid     ON pending_messages(worker_pid);
+CREATE UNIQUE INDEX IF NOT EXISTS ux_pending_session_tool
+  ON pending_messages(content_session_id, tool_use_id)
+  WHERE tool_use_id IS NOT NULL;
+
+-- ─────────────────────────────────────────────────────────────────────
+-- user_prompts: per-prompt history (UI + FTS search).
+-- ─────────────────────────────────────────────────────────────────────
+CREATE TABLE IF NOT EXISTS user_prompts (
+  id                 INTEGER PRIMARY KEY AUTOINCREMENT,
+  content_session_id TEXT    NOT NULL,
+  prompt_number      INTEGER NOT NULL,
+  prompt_text        TEXT    NOT NULL,
+  created_at         TEXT    NOT NULL,
+  created_at_epoch   INTEGER NOT NULL,
+  FOREIGN KEY(content_session_id) REFERENCES sdk_sessions(content_session_id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_user_prompts_claude_session ON user_prompts(content_session_id);
+CREATE INDEX IF NOT EXISTS idx_user_prompts_created        ON user_prompts(created_at_epoch DESC);
+CREATE INDEX IF NOT EXISTS idx_user_prompts_prompt_number  ON user_prompts(prompt_number);
+CREATE INDEX IF NOT EXISTS idx_user_prompts_lookup         ON user_prompts(content_session_id, prompt_number);
+
+-- ─────────────────────────────────────────────────────────────────────
+-- observation_feedback: usage-signal tracking for tier routing.
+-- ─────────────────────────────────────────────────────────────────────
+CREATE TABLE IF NOT EXISTS observation_feedback (
+  id               INTEGER PRIMARY KEY AUTOINCREMENT,
+  observation_id   INTEGER NOT NULL,
+  signal_type      TEXT    NOT NULL,
+  session_db_id    INTEGER,
+  created_at_epoch INTEGER NOT NULL,
+  metadata         TEXT,
+  FOREIGN KEY (observation_id) REFERENCES observations(id) ON DELETE CASCADE
+);
+CREATE INDEX IF NOT EXISTS idx_feedback_observation ON observation_feedback(observation_id);
+CREATE INDEX IF NOT EXISTS idx_feedback_signal      ON observation_feedback(signal_type);
@@ -10,7 +10,7 @@ import { Database } from 'bun:sqlite';
 import { logger } from '../../utils/logger.js';
 import type { ObservationInput } from './observations/types.js';
 import type { SummaryInput } from './summaries/types.js';
-import { computeObservationContentHash, findDuplicateObservation } from './observations/store.js';
+import { computeObservationContentHash } from './observations/store.js';

 /**
 * Result from storeObservations / storeObservationsAndMarkComplete transaction
@@ -64,23 +64,25 @@ export function storeObservationsAndMarkComplete(
  const storeAndMarkTx = db.transaction(() => {
    const observationIds: number[] = [];

-    // 1. Store all observations (with content-hash deduplication)
+    // 1. Store all observations.
+    // UNIQUE(memory_session_id, content_hash) + ON CONFLICT DO NOTHING enforces
+    // dedup at the DB layer (Plan 01 Phase 4). RETURNING gives us the row id
+    // when the insert went through; on conflict we look up the existing id.
    const obsStmt = db.prepare(`
      INSERT INTO observations
      (memory_session_id, project, type, title, subtitle, facts, narrative, concepts,
       files_read, files_modified, prompt_number, discovery_tokens, agent_type, agent_id, content_hash, created_at, created_at_epoch)
      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+      ON CONFLICT(memory_session_id, content_hash) DO NOTHING
+      RETURNING id
    `);
+    const lookupExistingStmt = db.prepare(
+      'SELECT id FROM observations WHERE memory_session_id = ? AND content_hash = ?'
+    );

    for (const observation of observations) {
      const contentHash = computeObservationContentHash(memorySessionId, observation.title, observation.narrative);
-      const existing = findDuplicateObservation(db, contentHash, timestampEpoch);
-      if (existing) {
-        observationIds.push(existing.id);
-        continue;
-      }
-
-      const result = obsStmt.run(
+      const inserted = obsStmt.get(
        memorySessionId,
        project,
        observation.type,
@@ -98,8 +100,20 @@ export function storeObservationsAndMarkComplete(
        contentHash,
        timestampIso,
        timestampEpoch
-      );
-      observationIds.push(Number(result.lastInsertRowid));
+      ) as { id: number } | null;
+
+      if (inserted) {
+        observationIds.push(inserted.id);
+        continue;
+      }
+
+      const existing = lookupExistingStmt.get(memorySessionId, contentHash) as { id: number } | null;
+      if (!existing) {
+        throw new Error(
+          `storeObservationsAndMarkComplete: ON CONFLICT without existing row for content_hash=${contentHash}`
+        );
+      }
+      observationIds.push(existing.id);
    }

    // 2. Store summary if provided
@@ -185,23 +199,24 @@ export function storeObservations(
  const storeTx = db.transaction(() => {
    const observationIds: number[] = [];

-    // 1. Store all observations (with content-hash deduplication)
+    // 1. Store all observations.
+    // UNIQUE(memory_session_id, content_hash) + ON CONFLICT DO NOTHING enforces
+    // dedup at the DB layer (Plan 01 Phase 4).
    const obsStmt = db.prepare(`
      INSERT INTO observations
      (memory_session_id, project, type, title, subtitle, facts, narrative, concepts,
       files_read, files_modified, prompt_number, discovery_tokens, agent_type, agent_id, content_hash, created_at, created_at_epoch)
      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+      ON CONFLICT(memory_session_id, content_hash) DO NOTHING
+      RETURNING id
    `);
+    const lookupExistingStmt = db.prepare(
+      'SELECT id FROM observations WHERE memory_session_id = ? AND content_hash = ?'
+    );

    for (const observation of observations) {
      const contentHash = computeObservationContentHash(memorySessionId, observation.title, observation.narrative);
-      const existing = findDuplicateObservation(db, contentHash, timestampEpoch);
-      if (existing) {
-        observationIds.push(existing.id);
-        continue;
-      }
-
-      const result = obsStmt.run(
+      const inserted = obsStmt.get(
        memorySessionId,
        project,
        observation.type,
@@ -219,8 +234,20 @@ export function storeObservations(
        contentHash,
        timestampIso,
        timestampEpoch
-      );
-      observationIds.push(Number(result.lastInsertRowid));
+      ) as { id: number } | null;
+
+      if (inserted) {
+        observationIds.push(inserted.id);
+        continue;
+      }
+
+      const existing = lookupExistingStmt.get(memorySessionId, contentHash) as { id: number } | null;
+      if (!existing) {
+        throw new Error(
+          `storeObservations: ON CONFLICT without existing row for content_hash=${contentHash}`
+        );
+      }
+      observationIds.push(existing.id);
    }

    // 2. Store summary if provided
@@ -341,6 +341,73 @@ export class ChromaMcpManager {
    }
  }

+  /**
+   * Deep semantic-search probe — verifies the actual query path works,
+   * not just that the subprocess responds to one tool. Each stage is wrapped
+   * in its own try/catch so the returned `stage` reflects where it failed.
+   *
+   * Stages:
+   *  - 'list'  → chroma_list_collections (also counts collections)
+   *  - 'query' → chroma_query_documents against cm__claude-mem with a trivial
+   *              query and n_results: 1 (measures latency)
+   *  - 'done'  → both stages succeeded
+   */
+  async probeSemanticSearch(): Promise<{
+    ok: boolean;
+    stage: 'connect' | 'list' | 'query' | 'done';
+    error?: string;
+    collections?: number;
+    queryLatencyMs?: number;
+  }> {
+    let collections: number | undefined;
+
+    // Stage: list — also lazy-connects via callTool
+    try {
+      const listResult: any = await this.callTool('chroma_list_collections', { limit: 100 });
+      if (Array.isArray(listResult)) {
+        collections = listResult.length;
+      } else if (listResult && Array.isArray(listResult.collections)) {
+        collections = listResult.collections.length;
+      } else if (listResult && typeof listResult === 'object' && 'length' in listResult) {
+        collections = (listResult as { length: number }).length;
+      }
+    } catch (error) {
+      const message = error instanceof Error ? error.message : String(error);
+      logger.warn('CHROMA_MCP', 'Deep probe failed at list stage', { error: message });
+      return { ok: false, stage: 'list', error: message };
+    }
+
+    // Stage: query — round-trip through the embedding/vector path
+    const queryStartedAt = Date.now();
+    try {
+      await this.callTool('chroma_query_documents', {
+        collection_name: 'cm__claude-mem',
+        query_texts: ['ping'],
+        n_results: 1
+      });
+      const queryLatencyMs = Date.now() - queryStartedAt;
+      return { ok: true, stage: 'done', collections, queryLatencyMs };
+    } catch (error) {
+      const queryLatencyMs = Date.now() - queryStartedAt;
+      const rawMessage = error instanceof Error ? error.message : String(error);
+      const isMissingOrEmpty = /not exist|missing|empty|no such/i.test(rawMessage);
+      const errorMessage = isMissingOrEmpty
+        ? `collection cm__claude-mem missing or empty (${rawMessage})`
+        : rawMessage;
+      logger.warn('CHROMA_MCP', 'Deep probe failed at query stage', {
+        error: rawMessage,
+        queryLatencyMs
+      });
+      return {
+        ok: false,
+        stage: 'query',
+        error: errorMessage,
+        collections,
+        queryLatencyMs
+      };
+    }
+  }
+
  /**
   * Gracefully stop the MCP connection and kill the chroma-mcp subprocess.
   * client.close() sends stdin close -> SIGTERM -> SIGKILL to the subprocess.
@@ -549,9 +549,10 @@ export class ChromaSync {
   * Reads from SQLite and syncs in batches
   * @param projectOverride - If provided, backfill this project instead of this.project.
   *   Used by backfillAllProjects() to iterate projects without mutating instance state.
+   * @param storeOverride - If provided, use this SessionStore instead of creating a new one.
   * Throws error if backfill fails
   */
-  async ensureBackfilled(projectOverride?: string): Promise<void> {
+  async ensureBackfilled(projectOverride?: string, storeOverride?: SessionStore): Promise<void> {
    const backfillProject = projectOverride ?? this.project;
    logger.info('CHROMA_SYNC', 'Starting smart backfill', { project: backfillProject });

@@ -560,7 +561,7 @@ export class ChromaSync {
    // Fetch existing IDs from Chroma (fast, metadata only)
    const existing = await this.getExistingChromaIds(backfillProject);

-    const db = new SessionStore();
+    const db = storeOverride ?? new SessionStore();

    try {
      await this.runBackfillPipeline(db, backfillProject, existing);
@@ -568,7 +569,10 @@ export class ChromaSync {
      logger.error('CHROMA_SYNC', 'Backfill failed', { project: backfillProject }, error instanceof Error ? error : new Error(String(error)));
      throw new Error(`Backfill failed: ${error instanceof Error ? error.message : String(error)}`);
    } finally {
-      db.close();
+      // Only close if we created it
+      if (!storeOverride) {
+        db.close();
+      }
    }
  }

@@ -861,8 +865,8 @@ export class ChromaSync {
   * with project scoped via metadata, matching how DatabaseManager and SearchManager operate.
   * Designed to be called fire-and-forget on worker startup.
   */
-  static async backfillAllProjects(): Promise<void> {
-    const db = new SessionStore();
+  static async backfillAllProjects(storeOverride?: SessionStore): Promise<void> {
+    const db = storeOverride ?? new SessionStore();
    const sync = new ChromaSync('claude-mem');
    try {
      const projects = db.db.prepare(
@@ -873,7 +877,7 @@ export class ChromaSync {

      for (const { project } of projects) {
        try {
-          await sync.ensureBackfilled(project);
+          await sync.ensureBackfilled(project, db);
        } catch (error) {
          if (error instanceof Error) {
            logger.error('CHROMA_SYNC', `Backfill failed for project: ${project}`, {}, error);
@@ -885,7 +889,10 @@ export class ChromaSync {
      }
    } finally {
      await sync.close();
-      db.close();
+      // Only close if we created it
+      if (!storeOverride) {
+        db.close();
+      }
    }
  }

@@ -1,6 +1,5 @@
 import path from 'path';
 import { sessionInitHandler } from '../../cli/handlers/session-init.js';
-import { observationHandler } from '../../cli/handlers/observation.js';
 import { fileEditHandler } from '../../cli/handlers/file-edit.js';
 import { sessionCompleteHandler } from '../../cli/handlers/session-complete.js';
 import { ensureWorkerRunning, workerHttpRequest } from '../../shared/worker-utils.js';
@@ -12,6 +11,7 @@ import { resolveFieldSpec, resolveFields, matchesRule } from './field-utils.js';
 import { expandHomePath } from './config.js';
 import type { TranscriptSchema, WatchTarget, SchemaEvent } from './types.js';
 import { normalizePlatformSource } from '../../shared/platform-source.js';
+import { ingestObservation } from '../worker/http/shared.js';

 interface SessionState {
  sessionId: string;
@@ -20,14 +20,10 @@ interface SessionState {
  project?: string;
  lastUserMessage?: string;
  lastAssistantMessage?: string;
-  pendingTools: Map<string, { name?: string; input?: unknown }>;
-}
-
-interface PendingTool {
-  id?: string;
-  name?: string;
-  input?: unknown;
-  response?: unknown;
+  // In-memory pairing for transcript schemas (e.g. Codex) where tool_use
+  // carries toolName + toolInput and tool_result only carries tool_use_id +
+  // output. Keyed by toolId; consumed and deleted on the matching tool_result.
+  pendingTools?: Map<string, { toolName: string; toolInput: unknown }>;
 }

 export class TranscriptEventProcessor {
@@ -56,7 +52,6 @@ export class TranscriptEventProcessor {
      session = {
        sessionId,
        platformSource: normalizePlatformSource(watch.name),
-        pendingTools: new Map()
      };
      this.sessions.set(key, session);
    }
@@ -129,7 +124,7 @@ export class TranscriptEventProcessor {
    const project = this.resolveProject(entry, watch, schema, event, session);
    if (project) session.project = project;

-    const fields = resolveFields(event.fields, entry, { watch, schema, session });
+    const fields = resolveFields(event.fields, entry, { watch, schema, session: session as unknown as Record<string, unknown> });

    switch (event.action) {
      case 'session_context':
@@ -196,12 +191,6 @@ export class TranscriptEventProcessor {
    const toolInput = this.maybeParseJson(fields.toolInput);
    const toolResponse = this.maybeParseJson(fields.toolResponse);

-    const pending: PendingTool = { id: toolId, name: toolName, input: toolInput, response: toolResponse };
-
-    if (toolId) {
-      session.pendingTools.set(toolId, { name: pending.name, input: pending.input });
-    }
-
    if (toolName === 'apply_patch' && typeof toolInput === 'string') {
      const files = this.parseApplyPatchFiles(toolInput);
      for (const filePath of files) {
@@ -212,35 +201,61 @@ export class TranscriptEventProcessor {
      }
    }

-    if (toolResponse !== undefined && toolName) {
+    // Two schema shapes to support:
+    //   1. Self-contained events (e.g. Claude JSONL): tool_use and tool_result
+    //      both carry toolName; tool_use may already include toolResponse.
+    //   2. Split events (e.g. Codex): tool_use carries toolName + toolInput,
+    //      tool_result carries only toolUseId + output. Neither side alone
+    //      has both toolName and toolResponse.
+    //
+    // For (1) we emit eagerly when toolResponse is present. For (2) we stash
+    // toolName/toolInput on the session keyed by toolId so handleToolResult
+    // can join them at tool_result time. The DB's
+    // UNIQUE(content_session_id, tool_use_id) index collapses any duplicate
+    // emissions that arise when both events carry a complete record.
+    if (toolName && toolResponse !== undefined) {
      await this.sendObservation(session, {
        toolName,
        toolInput,
-        toolResponse
+        toolResponse,
+        toolUseId: toolId,
      });
+    } else if (toolName && toolId) {
+      if (!session.pendingTools) session.pendingTools = new Map();
+      session.pendingTools.set(toolId, { toolName, toolInput });
    }
  }

  private async handleToolResult(session: SessionState, fields: Record<string, unknown>): Promise<void> {
    const toolId = typeof fields.toolId === 'string' ? fields.toolId : undefined;
-    const toolName = typeof fields.toolName === 'string' ? fields.toolName : undefined;
+    let toolName = typeof fields.toolName === 'string' ? fields.toolName : undefined;
    const toolResponse = this.maybeParseJson(fields.toolResponse);
+    let toolInput = this.maybeParseJson(fields.toolInput);

-    let toolInput: unknown = this.maybeParseJson(fields.toolInput);
-    let name = toolName;
-
-    if (toolId && session.pendingTools.has(toolId)) {
-      const pending = session.pendingTools.get(toolId)!;
-      toolInput = pending.input ?? toolInput;
-      name = name ?? pending.name;
-      session.pendingTools.delete(toolId);
+    // Consume any pending-tool entry for this toolId regardless of whether the
+    // tool_result already carries toolName: in the split-schema path the
+    // result always resolves the pending entry, so leaving it behind would
+    // grow the map until session end.
+    if (toolId && session.pendingTools) {
+      const pending = session.pendingTools.get(toolId);
+      if (pending) {
+        if (!toolName) toolName = pending.toolName;
+        if (toolInput === undefined) toolInput = pending.toolInput;
+        session.pendingTools.delete(toolId);
+      }
    }

-    if (name) {
+    if (toolName) {
      await this.sendObservation(session, {
-        toolName: name,
+        toolName,
        toolInput,
-        toolResponse
+        toolResponse,
+        toolUseId: toolId,
+      });
+    } else {
+      logger.debug('TRANSCRIPT', 'Dropping tool_result with no resolvable toolName', {
+        sessionId: session.sessionId,
+        toolId,
      });
    }
  }
@@ -249,14 +264,23 @@ export class TranscriptEventProcessor {
    const toolName = typeof fields.toolName === 'string' ? fields.toolName : undefined;
    if (!toolName) return;

-    await observationHandler.execute({
-      sessionId: session.sessionId,
+    // PATHFINDER plan 03 phase 7: replace HTTP loopback (worker → its own
+    // /api/sessions/observations endpoint) with a direct in-process call to
+    // ingestObservation. Same implementation backs the cross-process HTTP
+    // route handler (one helper, N callers).
+    const result = ingestObservation({
+      contentSessionId: session.sessionId,
      cwd: session.cwd ?? process.cwd(),
      toolName,
      toolInput: this.maybeParseJson(fields.toolInput),
      toolResponse: this.maybeParseJson(fields.toolResponse),
-      platform: session.platformSource
+      platformSource: session.platformSource,
+      toolUseId: typeof fields.toolUseId === 'string' ? fields.toolUseId : undefined,
    });
+
+    if (!result.ok) {
+      throw new Error(`ingestObservation failed: ${result.reason}`);
+    }
  }

  private async sendFileEdit(session: SessionState, fields: Record<string, unknown>): Promise<void> {
@@ -277,10 +301,17 @@ export class TranscriptEventProcessor {
    const trimmed = value.trim();
    if (!trimmed) return value;
    if (!(trimmed.startsWith('{') || trimmed.startsWith('['))) return value;
+    // Pass through the raw string on parse failure rather than throwing.
+    // Throwing from this helper propagates to `handleLine`'s outer catch,
+    // which then silently drops the entire transcript line — including any
+    // valid sibling fields. A single malformed JSON-shaped field should
+    // degrade to opaque-string handling, not lose the whole observation.
    try {
      return JSON.parse(trimmed);
-    } catch (error: unknown) {
-      logger.debug('WORKER', 'Failed to parse JSON string', { length: trimmed.length }, error instanceof Error ? error : undefined);
+    } catch (error) {
+      logger.debug('TRANSCRIPT', 'Field looked like JSON but did not parse; using raw string', {
+        preview: trimmed.slice(0, 120),
+      }, error instanceof Error ? error : undefined);
      return value;
    }
  }
@@ -314,7 +345,7 @@ export class TranscriptEventProcessor {
      platform: session.platformSource
    });
    await this.updateContext(session, watch);
-    session.pendingTools.clear();
+    session.pendingTools?.clear();
    const key = this.getSessionKey(watch, session.sessionId);
    this.sessions.delete(key);
  }
@@ -1,5 +1,5 @@
 import { existsSync, statSync, watch as fsWatch, createReadStream } from 'fs';
-import { basename, join } from 'path';
+import { basename, join, resolve as resolvePath, sep as pathSep } from 'path';
 import { globSync } from 'glob';
 import { logger } from '../../utils/logger.js';
 import { expandHomePath } from './config.js';
@@ -84,7 +84,7 @@ export class TranscriptWatcher {
  private processor = new TranscriptEventProcessor();
  private tailers = new Map<string, FileTailer>();
  private state: TranscriptWatchState;
-  private rescanTimers: Array<NodeJS.Timeout> = [];
+  private rootWatchers: Array<ReturnType<typeof fsWatch>> = [];

  constructor(private config: TranscriptWatchConfig, private statePath: string) {
    this.state = loadWatchState(statePath);
@@ -101,10 +101,10 @@ export class TranscriptWatcher {
      tailer.close();
    }
    this.tailers.clear();
-    for (const timer of this.rescanTimers) {
-      clearInterval(timer);
+    for (const watcher of this.rootWatchers) {
+      watcher.close();
    }
-    this.rescanTimers = [];
+    this.rootWatchers = [];
  }

  private async setupWatch(watch: WatchTarget): Promise<void> {
@@ -121,16 +121,80 @@ export class TranscriptWatcher {
      await this.addTailer(filePath, watch, schema, true);
    }

-    const rescanIntervalMs = watch.rescanIntervalMs ?? 5000;
-      const timer = setInterval(async () => {
-      const newFiles = this.resolveWatchFiles(resolvedPath);
-      for (const filePath of newFiles) {
-        if (!this.tailers.has(filePath)) {
-          await this.addTailer(filePath, watch, schema, false);
+    // PATHFINDER plan 03 phase 5: 5-second rescan timer replaced by a
+    // recursive fs.watch on the configured root. Requires Node 20+ on Linux
+    // for recursive mode (engines.node >= 20.0.0 — already enforced in
+    // package.json).
+    const watchRoot = this.deepestNonGlobAncestor(resolvedPath);
+    if (!watchRoot || !existsSync(watchRoot)) {
+      logger.debug('TRANSCRIPT', 'Watch root does not exist, skipping fs.watch', { watch: watch.name, watchRoot });
+      return;
+    }
+
+    try {
+      const watcher = fsWatch(watchRoot, { recursive: true, persistent: true }, (event, name) => {
+        if (!name) return;                                  // some events omit filename
+        // Skip the glob scan for paths we already tail — JSONL appends fire
+        // here on every line and a full resolveWatchFiles() per append is
+        // more expensive than the prior 5-s interval. Only unknown paths
+        // warrant a rescan (new transcript files surface here first).
+        const changed = resolvePath(watchRoot, name);
+        if (this.tailers.has(changed)) return;
+        const matches = this.resolveWatchFiles(resolvedPath);
+        for (const filePath of matches) {
+          if (!this.tailers.has(filePath)) {
+            void this.addTailer(filePath, watch, schema, false);
+          }
+        }
+      });
+      this.rootWatchers.push(watcher);
+      logger.info('TRANSCRIPT', 'Watching transcript root recursively', { watch: watch.name, watchRoot });
+    } catch (error) {
+      logger.warn('TRANSCRIPT', 'Failed to start recursive fs.watch on transcript root', {
+        watch: watch.name,
+        watchRoot,
+      }, error instanceof Error ? error : undefined);
+    }
+  }
+
+  /**
+   * Return the deepest path component that contains no glob meta-characters.
+   * Used to anchor `fs.watch(recursive: true)` for both literal directories
+   * and patterns like `~/.codex/sessions/**\/*.jsonl`.
+   *
+   * Handles both `/` and `\` as separators so Windows-native paths
+   * (e.g. `C:\Users\x\codex\sessions\**\*.jsonl`) resolve correctly. When
+   * the input is purely glob meta (no literal prefix) we return an empty
+   * string so the caller skips the watch instead of anchoring at the
+   * filesystem root.
+   */
+  private deepestNonGlobAncestor(inputPath: string): string {
+    if (!this.hasGlob(inputPath)) {
+      // Literal path: if it's a file, return its directory; otherwise return as-is.
+      if (existsSync(inputPath)) {
+        try {
+          const stat = statSync(inputPath);
+          return stat.isDirectory() ? inputPath : resolvePath(inputPath, '..');
+        } catch {
+          return resolvePath(inputPath, '..');
        }
      }
-    }, rescanIntervalMs);
-    this.rescanTimers.push(timer);
+      return inputPath;
+    }
+
+    const segments = inputPath.split(/[/\\]/);
+    const literalSegments: string[] = [];
+    for (const segment of segments) {
+      if (/[*?[\]{}()]/.test(segment)) break;
+      literalSegments.push(segment);
+    }
+    if (literalSegments.length === 0) return '';
+    if (literalSegments.length === 1 && literalSegments[0] === '') {
+      // Input started with a separator but the first real segment was a
+      // glob (e.g. `/**/foo`). Don't silently broaden the watch to `/`.
+      return '';
+    }
+    return literalSegments.join(pathSep);
  }

  private resolveSchema(watch: WatchTarget): TranscriptSchema | null {
@@ -79,6 +79,7 @@ import { DatabaseManager } from './worker/DatabaseManager.js';
 import { SessionManager } from './worker/SessionManager.js';
 import { SSEBroadcaster } from './worker/SSEBroadcaster.js';
 import { SDKAgent } from './worker/SDKAgent.js';
+import type { WorkerRef } from './worker/agents/types.js';
 import { GeminiAgent, isGeminiSelected, isGeminiAvailable } from './worker/GeminiAgent.js';
 import { OpenRouterAgent, isOpenRouterSelected, isOpenRouterAvailable } from './worker/OpenRouterAgent.js';
 import { PaginationHelper } from './worker/PaginationHelper.js';
@@ -88,6 +89,7 @@ import { FormattingService } from './worker/FormattingService.js';
 import { TimelineService } from './worker/TimelineService.js';
 import { SessionEventBroadcaster } from './worker/events/SessionEventBroadcaster.js';
 import { SessionCompletionHandler } from './worker/session/SessionCompletionHandler.js';
+import { setIngestContext, attachIngestGeneratorStarter } from './worker/http/shared.js';
 import { DEFAULT_CONFIG_PATH, DEFAULT_STATE_PATH, expandHomePath, loadTranscriptWatchConfig, writeSampleConfig } from './transcripts/config.js';
 import { TranscriptWatcher } from './transcripts/watcher.js';

@@ -100,14 +102,18 @@ import { SettingsRoutes } from './worker/http/routes/SettingsRoutes.js';
 import { LogsRoutes } from './worker/http/routes/LogsRoutes.js';
 import { MemoryRoutes } from './worker/http/routes/MemoryRoutes.js';
 import { CorpusRoutes } from './worker/http/routes/CorpusRoutes.js';
+import { ChromaRoutes } from './worker/http/routes/ChromaRoutes.js';

 // Knowledge agent services
 import { CorpusStore } from './worker/knowledge/CorpusStore.js';
 import { CorpusBuilder } from './worker/knowledge/CorpusBuilder.js';
 import { KnowledgeAgent } from './worker/knowledge/KnowledgeAgent.js';

-// Process management for zombie cleanup (Issue #737)
-import { startOrphanReaper, reapOrphanedProcesses, getProcessBySession, ensureProcessExit } from './worker/ProcessRegistry.js';
+// Primary-path session lifecycle helpers — no reapers, no orphan sweeps.
+// The SDK subprocess is spawned in its own POSIX process group via
+// createSdkSpawnFactory; teardown via ensureSdkProcessExit kills the whole
+// group so no descendants leak (Principle 5).
+import { getSdkProcessForSession, ensureSdkProcessExit } from '../supervisor/process-registry.js';

 /**
 * Build JSON status output for hook framework communication.
@@ -133,7 +139,7 @@ export function buildStatusOutput(status: 'ready' | 'error', message?: string):
  };
 }

-export class WorkerService {
+export class WorkerService implements WorkerRef {
  private server: Server;
  private startTime: number = Date.now();
  private mcpClient: Client;
@@ -146,14 +152,14 @@ export class WorkerService {
  // Service layer
  private dbManager: DatabaseManager;
  private sessionManager: SessionManager;
-  private sseBroadcaster: SSEBroadcaster;
+  public sseBroadcaster: SSEBroadcaster;
  private sdkAgent: SDKAgent;
  private geminiAgent: GeminiAgent;
  private openRouterAgent: OpenRouterAgent;
  private paginationHelper: PaginationHelper;
  private settingsManager: SettingsManager;
  private sessionEventBroadcaster: SessionEventBroadcaster;
-  private sessionCompletionHandler: SessionCompletionHandler;
+  private completionHandler: SessionCompletionHandler;
  private corpusStore: CorpusStore;

  // Route handlers
@@ -169,12 +175,6 @@ export class WorkerService {
  private initializationComplete: Promise<void>;
  private resolveInitialization!: () => void;

-  // Orphan reaper cleanup function (Issue #737)
-  private stopOrphanReaper: (() => void) | null = null;
-
-  // Stale session reaper interval (Issue #1168)
-  private staleSessionReaperInterval: ReturnType<typeof setInterval> | null = null;
-
  // AI interaction tracking for health endpoint
  private lastAiInteraction: {
    timestamp: number;
@@ -200,13 +200,21 @@ export class WorkerService {
    this.paginationHelper = new PaginationHelper(this.dbManager);
    this.settingsManager = new SettingsManager(this.dbManager);
    this.sessionEventBroadcaster = new SessionEventBroadcaster(this.sseBroadcaster, this);
-    this.sessionCompletionHandler = new SessionCompletionHandler(
+    this.completionHandler = new SessionCompletionHandler(
      this.sessionManager,
      this.sessionEventBroadcaster,
-      this.dbManager
+      this.dbManager,
    );
    this.corpusStore = new CorpusStore();

+    // Wire ingest helpers (plan 03 phase 0). Worker-internal callers use these
+    // directly instead of HTTP-loopback into our own routes.
+    setIngestContext({
+      sessionManager: this.sessionManager,
+      dbManager: this.dbManager,
+      eventBroadcaster: this.sessionEventBroadcaster,
+    });
+
    // Set callback for when sessions are deleted
    this.sessionManager.setOnSessionDeleted(() => {
      this.broadcastProcessingStatus();
@@ -268,6 +276,9 @@ export class WorkerService {
  private registerRoutes(): void {
    // IMPORTANT: Middleware must be registered BEFORE routes (Express processes in order)

+    // Register Chroma routes immediately so they bypass the initialization guard
+    this.server.registerRoutes(new ChromaRoutes());
+
    // Early handler for /api/context/inject — fail open if not yet initialized
    this.server.app.get('/api/context/inject', async (req, res, next) => {
      if (!this.initializationCompleteFlag || !this.searchRoutes) {
@@ -281,14 +292,20 @@ export class WorkerService {

    // Guard ALL /api/* routes during initialization — wait for DB with timeout
    // Exceptions: /api/health, /api/readiness, /api/version (handled by Server.ts core routes)
-    // and /api/context/inject (handled above with fail-open)
+    // and /api/chroma/status (diagnostic endpoint)
    this.server.app.use('/api', async (req, res, next) => {
+      // Bypass guard for diagnostic endpoints
+      if (req.path === '/chroma/status' || req.path === '/health' || req.path === '/readiness' || req.path === '/version') {
+        next();
+        return;
+      }
+
      if (this.initializationCompleteFlag) {
        next();
        return;
      }

-      const timeoutMs = 30000;
+      const timeoutMs = 120000; // 2 minutes
      const timeoutPromise = new Promise<void>((_, reject) =>
        setTimeout(() => reject(new Error('Database initialization timeout')), timeoutMs)
      );
@@ -312,7 +329,15 @@ export class WorkerService {

    // Standard routes (registered AFTER guard middleware)
    this.server.registerRoutes(new ViewerRoutes(this.sseBroadcaster, this.dbManager, this.sessionManager));
-    this.server.registerRoutes(new SessionRoutes(this.sessionManager, this.dbManager, this.sdkAgent, this.geminiAgent, this.openRouterAgent, this.sessionEventBroadcaster, this, this.sessionCompletionHandler));
+    const sessionRoutes = new SessionRoutes(this.sessionManager, this.dbManager, this.sdkAgent, this.geminiAgent, this.openRouterAgent, this.sessionEventBroadcaster, this, this.completionHandler);
+    this.server.registerRoutes(sessionRoutes);
+    // Wire the generator-starter callback now that SessionRoutes exists.
+    // `setIngestContext` ran in the constructor before routes were
+    // constructed; transcript-watcher observations depend on this side-effect
+    // to auto-start the SDK generator after enqueue.
+    attachIngestGeneratorStarter((sessionDbId, source) =>
+      sessionRoutes.ensureGeneratorRunning(sessionDbId, source),
+    );
    this.server.registerRoutes(new DataRoutes(this.paginationHelper, this.dbManager, this.sessionManager, this.sseBroadcaster, this, this.startTime));
    this.server.registerRoutes(new SettingsRoutes(this.settingsManager));
    this.server.registerRoutes(new LogsRoutes());
@@ -359,6 +384,7 @@ export class WorkerService {
   */
  private async initializeBackground(): Promise<void> {
    try {
+      logger.info('WORKER', 'Background initialization starting...');
      await aggressiveStartupCleanup();

      // Load mode configuration
@@ -368,47 +394,39 @@ export class WorkerService {

      const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);

+      const modeId = settings.CLAUDE_MEM_MODE;
+      ModeManager.getInstance().loadMode(modeId);
+      logger.info('SYSTEM', `Mode loaded: ${modeId}`);
+
      // One-time chroma wipe for users upgrading from versions with duplicate worker bugs.
-      // Only runs in local mode (chroma is local-only). Backfill at line ~414 rebuilds from SQLite.
      if (settings.CLAUDE_MEM_MODE === 'local' || !settings.CLAUDE_MEM_MODE) {
+        logger.info('WORKER', 'Checking for one-time Chroma migration...');
        runOneTimeChromaMigration();
      }

      // One-time remap of pre-worktree project names using pending_messages.cwd.
-      // Must run before dbManager.initialize() so we don't hold the DB open.
+      logger.info('WORKER', 'Checking for one-time CWD remap...');
      runOneTimeCwdRemap();

-      // Stamp merged worktrees so their observations surface under the parent
-      // project. Runs every startup (not marker-gated) because git state evolves
-      // and the engine is fully idempotent. Must also precede dbManager.initialize().
-      //
-      // The worker daemon is spawned with cwd=marketplace-plugin-dir (not a git
-      // repo), so we can't seed adoption with process.cwd(). Instead, discover
-      // parent repos from recorded pending_messages.cwd values.
-      let adoptions: Awaited<ReturnType<typeof adoptMergedWorktreesForAllKnownRepos>> | null = null;
-      try {
-        adoptions = await adoptMergedWorktreesForAllKnownRepos({});
-      } catch (err) {
-        // [ANTI-PATTERN IGNORED]: Worktree adoption is best-effort on startup; failure must not block worker initialization
-        if (err instanceof Error) {
-          logger.error('WORKER', 'Worktree adoption failed (non-fatal)', {}, err);
-        } else {
-          logger.error('WORKER', 'Worktree adoption failed (non-fatal) with non-Error', {}, new Error(String(err)));
-        }
-      }
-      if (adoptions) {
-        for (const adoption of adoptions) {
-          if (adoption.adoptedObservations > 0 || adoption.adoptedSummaries > 0 || adoption.chromaUpdates > 0) {
-            logger.info('SYSTEM', 'Merged worktrees adopted on startup', adoption);
-          }
-          if (adoption.errors.length > 0) {
-            logger.warn('SYSTEM', 'Worktree adoption had per-branch errors', {
-              repoPath: adoption.repoPath,
-              errors: adoption.errors
-            });
+      // Stamp merged worktrees (Non-blocking, fire-and-forget)
+      logger.info('WORKER', 'Adopting merged worktrees (background)...');
+      adoptMergedWorktreesForAllKnownRepos({}).then(adoptions => {
+        if (adoptions) {
+          for (const adoption of adoptions) {
+            if (adoption.adoptedObservations > 0 || adoption.adoptedSummaries > 0 || adoption.chromaUpdates > 0) {
+              logger.info('SYSTEM', 'Merged worktrees adopted in background', adoption);
+            }
+            if (adoption.errors.length > 0) {
+              logger.warn('SYSTEM', 'Worktree adoption had per-branch errors', {
+                repoPath: adoption.repoPath,
+                errors: adoption.errors
+              });
+            }
          }
        }
-      }
+      }).catch(err => {
+        logger.error('WORKER', 'Worktree adoption failed (background)', {}, err instanceof Error ? err : new Error(String(err)));
+      });

      // Initialize ChromaMcpManager only if Chroma is enabled
      const chromaEnabled = settings.CLAUDE_MEM_CHROMA_ENABLED !== 'false';
@@ -419,21 +437,24 @@ export class WorkerService {
        logger.info('SYSTEM', 'Chroma disabled via CLAUDE_MEM_CHROMA_ENABLED=false, skipping ChromaMcpManager');
      }

-      const modeId = settings.CLAUDE_MEM_MODE;
-      ModeManager.getInstance().loadMode(modeId);
-      logger.info('SYSTEM', `Mode loaded: ${modeId}`);
-
+      logger.info('WORKER', 'Initializing database manager...');
      await this.dbManager.initialize();

-      // Reset any messages that were processing when worker died
-      const { PendingMessageStore } = await import('./sqlite/PendingMessageStore.js');
-      const pendingStore = new PendingMessageStore(this.dbManager.getSessionStore().db, 3);
-      const resetCount = pendingStore.resetStaleProcessingMessages(0); // 0 = reset ALL processing
-      if (resetCount > 0) {
-        logger.info('SYSTEM', `Reset ${resetCount} stale processing messages to pending`);
+      // One-shot GC for terminally-failed rows
+      try {
+        logger.info('WORKER', 'Running startup GC for pending messages...');
+        const { PendingMessageStore } = await import('./sqlite/PendingMessageStore.js');
+        const pendingStore = new PendingMessageStore(this.dbManager.getSessionStore().db, 3);
+        const cleared = pendingStore.clearFailedOlderThan(7 * 24 * 60 * 60 * 1000);
+        if (cleared > 0) {
+          logger.info('QUEUE', 'Startup GC cleared old failed pending_messages rows', { cleared });
+        }
+      } catch (err) {
+        logger.warn('QUEUE', 'Startup GC for failed pending_messages rows failed', {}, err instanceof Error ? err : undefined);
      }

      // Initialize search services
+      logger.info('WORKER', 'Initializing search services...');
      const formattingService = new FormattingService();
      const timelineService = new TimelineService();
      const searchManager = new SearchManager(
@@ -464,8 +485,6 @@ export class WorkerService {
      logger.info('WORKER', 'CorpusRoutes registered');

      // DB and search are ready — mark initialization complete so hooks can proceed.
-      // MCP connection is tracked separately via mcpReady and is NOT required for
-      // the worker to serve context/search requests.
      this.initializationCompleteFlag = true;
      this.resolveInitialization();
      logger.info('SYSTEM', 'Core initialization complete (DB + search ready)');
@@ -474,7 +493,7 @@ export class WorkerService {

      // Auto-backfill Chroma for all projects if out of sync with SQLite (fire-and-forget)
      if (this.chromaMcpManager) {
-        ChromaSync.backfillAllProjects().then(() => {
+        ChromaSync.backfillAllProjects(this.dbManager.getSessionStore()).then(() => {
          logger.info('CHROMA_SYNC', 'Backfill check complete for all projects');
        }).catch(error => {
          logger.error('CHROMA_SYNC', 'Backfill failed (non-blocking)', {}, error as Error);
@@ -482,134 +501,55 @@ export class WorkerService {
      }

      // Mark MCP as externally ready once the bundled stdio server binary exists.
-      // Codex/Claude Desktop connect to this binary directly; the loopback client
-      // below is only a best-effort self-check and should not mark health false.
      const mcpServerPath = path.join(__dirname, 'mcp-server.cjs');
      this.mcpReady = existsSync(mcpServerPath);

-      // Best-effort loopback MCP self-check
-      getSupervisor().assertCanSpawn('mcp server');
-      const transport = new StdioClientTransport({
-        command: process.execPath,  // Use resolved path, not bare 'node' which fails on non-interactive PATH (#1876)
-        args: [mcpServerPath],
-        env: sanitizeEnv(process.env)
+      // Best-effort loopback MCP self-check (Non-blocking, F&F)
+      this.runMcpSelfCheck(mcpServerPath).catch(err => {
+        logger.debug('WORKER', 'MCP self-check failed (non-fatal)', { error: err.message });
      });

-      const MCP_INIT_TIMEOUT_MS = 300000;
+      return;
+    } catch (error) {
+      // Background initialization failed - log and let worker fail health checks
+      logger.error('SYSTEM', 'Background initialization failed', {}, error instanceof Error ? error : undefined);
+    }
+  }
+
+  /**
+   * Run a best-effort loopback MCP self-check to verify the bundled server can start.
+   * This is entirely diagnostic and does not block worker availability.
+   */
+  private async runMcpSelfCheck(mcpServerPath: string): Promise<void> {
+    try {
+      getSupervisor().assertCanSpawn('mcp server');
+      const transport = new StdioClientTransport({
+        command: process.execPath,
+        args: [mcpServerPath],
+        env: Object.fromEntries(
+          Object.entries(sanitizeEnv(process.env)).filter(([, value]) => value !== undefined)
+        ) as Record<string, string>
+      });
+
+      const MCP_INIT_TIMEOUT_MS = 60000; // 1 minute is plenty for local check
      const mcpConnectionPromise = this.mcpClient.connect(transport);
-      let timeoutId: ReturnType<typeof setTimeout>;
+      
      const timeoutPromise = new Promise<never>((_, reject) => {
-        timeoutId = setTimeout(
-          () => reject(new Error('MCP connection timeout after 5 minutes')),
-          MCP_INIT_TIMEOUT_MS
+        setTimeout(
+          () => reject(new Error('MCP connection timeout')),
+          60000
        );
      });

-      try {
-        await Promise.race([mcpConnectionPromise, timeoutPromise]);
-      } catch (connectionError) {
-        clearTimeout(timeoutId!);
-        logger.warn('WORKER', 'MCP loopback self-check failed, cleaning up subprocess', {
-          error: connectionError instanceof Error ? connectionError.message : String(connectionError)
-        });
-        try {
-          await transport.close();
-        } catch (transportCloseError) {
-          // [ANTI-PATTERN IGNORED]: transport.close() is best-effort cleanup after MCP connection already failed; supervisor handles orphan processes
-          logger.debug('WORKER', 'transport.close() failed during MCP cleanup', {
-            error: transportCloseError instanceof Error ? transportCloseError.message : String(transportCloseError)
-          });
-        }
-        logger.info('WORKER', 'Bundled MCP server remains available for external stdio clients', {
-          path: mcpServerPath
-        });
-        return;
-      }
-      clearTimeout(timeoutId!);
+      await Promise.race([mcpConnectionPromise, timeoutPromise]);
+      logger.info('WORKER', 'MCP loopback self-check connected successfully');

-      const mcpProcess = (transport as unknown as { _process?: import('child_process').ChildProcess })._process;
-      if (mcpProcess?.pid) {
-        getSupervisor().registerProcess('mcp-server', {
-          pid: mcpProcess.pid,
-          type: 'mcp',
-          startedAt: new Date().toISOString()
-        }, mcpProcess);
-        mcpProcess.once('exit', () => {
-          getSupervisor().unregisterProcess('mcp-server');
-        });
-      }
-      logger.success('WORKER', 'MCP loopback self-check connected');
-
-      // Start orphan reaper to clean up zombie processes (Issue #737)
-      this.stopOrphanReaper = startOrphanReaper(() => {
-        const activeIds = new Set<number>();
-        for (const [id] of this.sessionManager['sessions']) {
-          activeIds.add(id);
-        }
-        return activeIds;
-      });
-      logger.info('SYSTEM', 'Started orphan reaper (runs every 30 seconds)');
-
-      // Reap stale sessions to unblock orphan process cleanup (Issue #1168)
-      this.staleSessionReaperInterval = setInterval(async () => {
-        try {
-          const reaped = await this.sessionManager.reapStaleSessions();
-          if (reaped > 0) {
-            logger.info('SYSTEM', `Reaped ${reaped} stale sessions`);
-          }
-        } catch (e) {
-          // [ANTI-PATTERN IGNORED]: setInterval callback cannot throw; reaper retries on next tick (every 2 min)
-          if (e instanceof Error) {
-            logger.error('WORKER', 'Stale session reaper error', {}, e);
-          } else {
-            logger.error('WORKER', 'Stale session reaper error with non-Error', {}, new Error(String(e)));
-          }
-        }
-
-        // Purge stale failed pending messages to prevent unbounded queue growth (#1957)
-        // Only remove failures older than 1 hour to preserve recent failures for inspection/retry
-        try {
-          const pendingStore = this.sessionManager.getPendingMessageStore();
-          const FAILED_MESSAGE_RETENTION_MS = 60 * 60 * 1000; // 1 hour
-          const purged = pendingStore.clearFailedOlderThan(FAILED_MESSAGE_RETENTION_MS);
-          if (purged > 0) {
-            logger.info('SYSTEM', `Purged ${purged} stale failed pending messages (older than 1h)`);
-          }
-        } catch (e) {
-          if (e instanceof Error) {
-            logger.error('WORKER', 'Failed message purge error', {}, e);
-          } else {
-            logger.error('WORKER', 'Failed message purge error with non-Error', {}, new Error(String(e)));
-          }
-        }
-
-        // Periodic WAL checkpoint to prevent unbounded WAL growth (#1956)
-        try {
-          this.dbManager.getSessionStore().db.run('PRAGMA wal_checkpoint(PASSIVE)');
-        } catch (e) {
-          if (e instanceof Error) {
-            logger.error('WORKER', 'WAL checkpoint error', {}, e);
-          } else {
-            logger.error('WORKER', 'WAL checkpoint error with non-Error', {}, new Error(String(e)));
-          }
-        }
-      }, 2 * 60 * 1000);
-
-      // Auto-recover orphaned queues (fire-and-forget with error logging)
-      this.processPendingQueues(50).then(result => {
-        if (result.sessionsStarted > 0) {
-          logger.info('SYSTEM', `Auto-recovered ${result.sessionsStarted} sessions with pending work`, {
-            totalPending: result.totalPendingSessions,
-            started: result.sessionsStarted,
-            sessionIds: result.startedSessionIds
-          });
-        }
-      }).catch(error => {
-        logger.error('SYSTEM', 'Auto-recovery of pending queues failed', {}, error as Error);
-      });
+      // Cleanup
+      await transport.close();
    } catch (error) {
-      logger.error('SYSTEM', 'Background initialization failed', {}, error as Error);
-      throw error;
+      logger.warn('WORKER', 'MCP loopback self-check failed', { 
+        error: error instanceof Error ? error.message : String(error) 
+      });
    }
  }

@@ -787,10 +727,11 @@ export class WorkerService {
        throw error;
      })
      .finally(async () => {
-        // CRITICAL: Verify subprocess exit to prevent zombie accumulation (Issue #1168)
-        const trackedProcess = getProcessBySession(session.sessionDbId);
+        // Primary-path subprocess teardown — process-group kill ensures any
+        // SDK descendants are reaped too (Principle 5).
+        const trackedProcess = getSdkProcessForSession(session.sessionDbId);
        if (trackedProcess && trackedProcess.process.exitCode === null) {
-          await ensureProcessExit(trackedProcess, 5000);
+          await ensureSdkProcessExit(trackedProcess, 5000);
        }

        session.generatorPromise = null;
@@ -833,12 +774,14 @@ export class WorkerService {
          session.consecutiveRestarts = (session.consecutiveRestarts || 0) + 1; // Keep for logging

          if (!restartAllowed) {
-            logger.error('SYSTEM', 'Restart guard tripped: too many restarts in window, stopping to prevent runaway costs', {
+            logger.error('SYSTEM', 'Restart guard tripped: session is dead, terminating', {
              sessionId: session.sessionDbId,
              pendingCount,
              restartsInWindow: session.restartGuard.restartsInWindow,
              windowMs: session.restartGuard.windowMs,
-              maxRestarts: session.restartGuard.maxRestarts
+              maxRestarts: session.restartGuard.maxRestarts,
+              consecutiveFailures: session.restartGuard.consecutiveFailuresSinceSuccess,
+              maxConsecutiveFailures: session.restartGuard.maxConsecutiveFailures
            });
            session.consecutiveRestarts = 0;
            this.terminateSession(session.sessionDbId, 'max_restarts_exceeded');
@@ -856,26 +799,17 @@ export class WorkerService {
          this.startSessionProcessor(session, 'pending-work-restart');
          this.broadcastProcessingStatus();
        } else {
-          // Successful completion with no pending work — clean up session.
-          // Only remove from the in-memory map if finalize succeeds; otherwise
-          // leave the session in place so the 60s orphan reaper (or a future
-          // retry) can repair the inconsistency. Removing a still-"active" DB
-          // row from memory would orphan it indefinitely under the new
-          // fire-and-forget Stop hook (no /api/sessions/complete to retry).
+          // Successful completion with no pending work — finalize then drop
+          // in-memory state. finalizeSession flips sdk_sessions.status to
+          // 'completed', drains orphaned pendings, broadcasts; idempotent so
+          // the later POST /api/sessions/complete from the Stop hook is a
+          // no-op. Without this, hooks-disabled installs (and any session
+          // whose Stop hook fails before /api/sessions/complete) leave the
+          // DB row permanently 'active'.
          session.restartGuard?.recordSuccess();
          session.consecutiveRestarts = 0;
-          let finalized = false;
-          try {
-            this.sessionCompletionHandler.finalizeSession(session.sessionDbId);
-            finalized = true;
-          } catch (err) {
-            logger.warn('SESSION', 'finalizeSession failed in WorkerService generator .finally()', {
-              sessionId: session.sessionDbId
-            }, err as Error);
-          }
-          if (finalized) {
-            this.sessionManager.removeSessionImmediate(session.sessionDbId);
-          }
+          this.completionHandler.finalizeSession(session.sessionDbId);
+          this.sessionManager.removeSessionImmediate(session.sessionDbId);
        }
      });
  }
@@ -960,34 +894,12 @@ export class WorkerService {
      }
    }

-    // No fallback or both failed: mark messages abandoned and remove session so queue doesn't grow
-    const pendingStore = this.sessionManager.getPendingMessageStore();
-    const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId);
-    if (abandoned > 0) {
-      logger.warn('SDK', 'No fallback available; marked pending messages abandoned', {
-        sessionId: sessionDbId,
-        abandoned
-      });
-    }
-    // Finalize so DB status + broadcast + pending-drain are consistent on fallback failure.
-    // finalizeSession already broadcasts session_completed, so we don't also call
-    // broadcastSessionCompleted below. On finalize failure, fall back to the
-    // explicit broadcast so the UI still gets the event and leave the session
-    // in memory for the orphan reaper to retry.
-    let finalized = false;
-    try {
-      this.sessionCompletionHandler.finalizeSession(sessionDbId);
-      finalized = true;
-    } catch (err) {
-      logger.warn('SESSION', 'finalizeSession failed in runFallbackForTerminatedSession', {
-        sessionId: sessionDbId
-      }, err as Error);
-    }
-    if (finalized) {
-      this.sessionManager.removeSessionImmediate(sessionDbId);
-    } else {
-      this.sessionEventBroadcaster.broadcastSessionCompleted(sessionDbId);
-    }
+    // No fallback or both failed: mark session completed in DB (drain pending
+    // + broadcast via finalizeSession, idempotent) then drop in-memory state.
+    // Without this, sdk_sessions.status stays 'active' forever — the deleted
+    // reapStaleSessions interval was the only prior backstop.
+    this.completionHandler.finalizeSession(sessionDbId);
+    this.sessionManager.removeSessionImmediate(sessionDbId);
  }

  /**
@@ -1001,34 +913,15 @@ export class WorkerService {
   *                    no?  → terminateSession()
   */
  private terminateSession(sessionDbId: number, reason: string): void {
-    const pendingStore = this.sessionManager.getPendingMessageStore();
-    const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId);
+    logger.info('SYSTEM', 'Session terminated', { sessionId: sessionDbId, reason });

-    logger.info('SYSTEM', 'Session terminated', {
-      sessionId: sessionDbId,
-      reason,
-      abandonedMessages: abandoned
-    });
+    // finalizeSession marks sdk_sessions.status='completed', drains pending
+    // messages, and broadcasts. Idempotent. Without this, wall-clock-limited
+    // and unrecoverable-error paths leave DB rows as 'active' forever.
+    this.completionHandler.finalizeSession(sessionDbId);

-    // Finalize session (mark completed in DB + drain pending + broadcast). Idempotent.
-    // This runs AFTER startSession() has returned, which means any summary/observation
-    // writes inside processAgentResponse() are already committed to SQLite synchronously.
-    // Only remove from the in-memory map if finalize succeeds; otherwise leave the
-    // session in place so the 60s orphan reaper can repair the DB inconsistency.
-    let finalized = false;
-    try {
-      this.sessionCompletionHandler.finalizeSession(sessionDbId);
-      finalized = true;
-    } catch (err) {
-      logger.warn('SESSION', 'finalizeSession failed during terminateSession', {
-        sessionId: sessionDbId, reason
-      }, err as Error);
-    }
-
-    if (finalized) {
-      // removeSessionImmediate fires onSessionDeletedCallback → broadcastProcessingStatus()
-      this.sessionManager.removeSessionImmediate(sessionDbId);
-    }
+    // removeSessionImmediate fires onSessionDeletedCallback → broadcastProcessingStatus()
+    this.sessionManager.removeSessionImmediate(sessionDbId);
  }

  /**
@@ -1154,18 +1047,6 @@ export class WorkerService {
      logger.info('TRANSCRIPT', 'Transcript watcher stopped');
    }

-    // Stop orphan reaper before shutdown (Issue #737)
-    if (this.stopOrphanReaper) {
-      this.stopOrphanReaper();
-      this.stopOrphanReaper = null;
-    }
-
-    // Stop stale session reaper (Issue #1168)
-    if (this.staleSessionReaperInterval) {
-      clearInterval(this.staleSessionReaperInterval);
-      this.staleSessionReaperInterval = null;
-    }
-
    await performGracefulShutdown({
      server: this.server.getHttpServer(),
      sessionManager: this.sessionManager,
@@ -48,9 +48,6 @@ export interface ActiveSession {
  // Track whether the most recent storage operation persisted a summary record.
  // Used by the status endpoint so the Stop hook can detect silent summary loss (#1633).
  lastSummaryStored?: boolean;
-  // Circuit breaker: track consecutive summary failures to prevent infinite retry loops (#1633).
-  // When this reaches MAX_CONSECUTIVE_SUMMARY_FAILURES, further summarize requests are skipped.
-  consecutiveSummaryFailures: number;
  // Subagent identity carried forward from the most recent claimed pending message.
  // When observations are parsed and stored, these fields label the resulting rows
  // so subagent work is attributable. NULL / undefined means the batch came from the main session.
@@ -69,6 +66,9 @@ export interface PendingMessage {
  // Claude Code subagent identity — present only when the hook fired inside a subagent.
  agentId?: string;
  agentType?: string;
+  /** Provider-assigned tool-use id; underpins the
+   * UNIQUE(content_session_id, tool_use_id) idempotency index added in plan 01. */
+  toolUseId?: string;
 }

 /**
@@ -90,6 +90,8 @@ export interface ObservationData {
  // Claude Code subagent identity — present only when the hook fired inside a subagent.
  agentId?: string;
  agentType?: string;
+  /** Provider-assigned tool-use id (plan 03 phase 6 idempotency key). */
+  toolUseId?: string;
 }

 // ============================================================================
@@ -8,15 +8,17 @@
 * - ChromaSync integration
 */

+import { Database } from 'bun:sqlite';
 import { SessionStore } from '../sqlite/SessionStore.js';
 import { SessionSearch } from '../sqlite/SessionSearch.js';
 import { ChromaSync } from '../sync/ChromaSync.js';
 import { SettingsDefaultsManager } from '../../shared/SettingsDefaultsManager.js';
-import { USER_SETTINGS_PATH } from '../../shared/paths.js';
+import { USER_SETTINGS_PATH, DB_PATH } from '../../shared/paths.js';
 import { logger } from '../../utils/logger.js';
 import type { DBSession } from '../worker-types.js';

 export class DatabaseManager {
+  private db: Database | null = null;
  private sessionStore: SessionStore | null = null;
  private sessionSearch: SessionSearch | null = null;
  private chromaSync: ChromaSync | null = null;
@@ -26,8 +28,11 @@ export class DatabaseManager {
   */
  async initialize(): Promise<void> {
    // Open database connection (ONCE)
-    this.sessionStore = new SessionStore();
-    this.sessionSearch = new SessionSearch();
+    this.db = new Database(DB_PATH);
+    
+    // Shared connection between store and search
+    this.sessionStore = new SessionStore(this.db);
+    this.sessionSearch = new SessionSearch(this.db);

    // Initialize ChromaSync only if Chroma is enabled (SQLite-only fallback when disabled)
    const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
@@ -38,7 +43,7 @@ export class DatabaseManager {
      logger.info('DB', 'Chroma disabled via CLAUDE_MEM_CHROMA_ENABLED=false, using SQLite-only search');
    }

-    logger.info('DB', 'Database initialized');
+    logger.info('DB', 'Database initialized (shared connection)');
  }

  /**
@@ -51,13 +56,14 @@ export class DatabaseManager {
      this.chromaSync = null;
    }

-    if (this.sessionStore) {
-      this.sessionStore.close();
-      this.sessionStore = null;
-    }
-    if (this.sessionSearch) {
-      this.sessionSearch.close();
-      this.sessionSearch = null;
+    // We don't call sessionStore.close() or sessionSearch.close() 
+    // because they share this.db which we close below.
+    this.sessionStore = null;
+    this.sessionSearch = null;
+
+    if (this.db) {
+      this.db.close();
+      this.db = null;
    }
    logger.info('DB', 'Database closed');
  }
@@ -89,10 +95,6 @@ export class DatabaseManager {
    return this.chromaSync;
  }

-  // REMOVED: cleanupOrphanedSessions - violates "EVERYTHING SHOULD SAVE ALWAYS"
-  // Worker restarts don't make sessions orphaned. Sessions are managed by hooks
-  // and exist independently of worker state.
-
  /**
   * Get session by ID (throws if not found)
   */
@@ -7,6 +7,7 @@
 * - Efficient LIMIT+1 trick to avoid COUNT(*) query
 */

+import type { SQLQueryBindings } from 'bun:sqlite';
 import { DatabaseManager } from './DatabaseManager.js';
 import { logger } from '../../utils/logger.js';
 import { OBSERVER_SESSIONS_PROJECT } from '../../shared/paths.js';
@@ -102,7 +103,7 @@ export class PaginationHelper {
      FROM observations o
      LEFT JOIN sdk_sessions s ON o.memory_session_id = s.memory_session_id
    `;
-    const params: unknown[] = [];
+    const params: SQLQueryBindings[] = [];
    const conditions: string[] = [];

    if (project) {
@@ -1,527 +0,0 @@
-/**
- * ProcessRegistry: Track spawned Claude subprocesses
- *
- * Fixes Issue #737: Claude haiku subprocesses don't terminate properly,
- * causing zombie process accumulation (user reported 155 processes / 51GB RAM).
- *
- * Root causes:
- * 1. SDK's SpawnedProcess interface hides subprocess PIDs
- * 2. deleteSession() doesn't verify subprocess exit before cleanup
- * 3. abort() is fire-and-forget with no confirmation
- *
- * Solution:
- * - Use SDK's spawnClaudeCodeProcess option to capture PIDs
- * - Track all spawned processes with session association
- * - Verify exit on session deletion with timeout + SIGKILL escalation
- * - Safety net orphan reaper runs every 5 minutes
- */
-
-import { spawn, exec, ChildProcess } from 'child_process';
-import { promisify } from 'util';
-import { logger } from '../../utils/logger.js';
-import { sanitizeEnv } from '../../supervisor/env-sanitizer.js';
-import { getSupervisor } from '../../supervisor/index.js';
-
-const execAsync = promisify(exec);
-
-interface TrackedProcess {
-  pid: number;
-  sessionDbId: number;
-  spawnedAt: number;
-  process: ChildProcess;
-}
-
-function getTrackedProcesses(): TrackedProcess[] {
-  return getSupervisor().getRegistry()
-    .getAll()
-    .filter(record => record.type === 'sdk')
-    .map((record) => {
-      const processRef = getSupervisor().getRegistry().getRuntimeProcess(record.id);
-      if (!processRef) {
-        return null;
-      }
-
-      return {
-        pid: record.pid,
-        sessionDbId: Number(record.sessionId),
-        spawnedAt: Date.parse(record.startedAt),
-        process: processRef
-      };
-    })
-    .filter((value): value is TrackedProcess => value !== null);
-}
-
-/**
- * Register a spawned process in the registry
- */
-export function registerProcess(pid: number, sessionDbId: number, process: ChildProcess): void {
-  getSupervisor().registerProcess(`sdk:${sessionDbId}:${pid}`, {
-    pid,
-    type: 'sdk',
-    sessionId: sessionDbId,
-    startedAt: new Date().toISOString()
-  }, process);
-  logger.info('PROCESS', `Registered PID ${pid} for session ${sessionDbId}`, { pid, sessionDbId });
-}
-
-/**
- * Unregister a process from the registry and notify pool waiters
- */
-export function unregisterProcess(pid: number): void {
-  for (const record of getSupervisor().getRegistry().getByPid(pid)) {
-    if (record.type === 'sdk') {
-      getSupervisor().unregisterProcess(record.id);
-    }
-  }
-  logger.debug('PROCESS', `Unregistered PID ${pid}`, { pid });
-  // Notify waiters that a pool slot may be available
-  notifySlotAvailable();
-}
-
-/**
- * Get process info by session ID
- * Warns if multiple processes found (indicates race condition)
- */
-export function getProcessBySession(sessionDbId: number): TrackedProcess | undefined {
-  const matches = getTrackedProcesses().filter(info => info.sessionDbId === sessionDbId);
-  if (matches.length > 1) {
-    logger.warn('PROCESS', `Multiple processes found for session ${sessionDbId}`, {
-      count: matches.length,
-      pids: matches.map(m => m.pid)
-    });
-  }
-  return matches[0];
-}
-
-/**
- * Get count of active processes in the registry
- */
-export function getActiveCount(): number {
-  return getSupervisor().getRegistry().getAll().filter(record => record.type === 'sdk').length;
-}
-
-// Waiters for pool slots - resolved when a process exits and frees a slot
-const slotWaiters: Array<() => void> = [];
-
-/**
- * Notify waiters that a slot has freed up
- */
-function notifySlotAvailable(): void {
-  const waiter = slotWaiters.shift();
-  if (waiter) waiter();
-}
-
-/**
- * Wait for a pool slot to become available (promise-based, not polling)
- * @param maxConcurrent Max number of concurrent agents
- * @param timeoutMs Max time to wait before giving up
- * @param evictIdleSession Optional callback to evict an idle session when all slots are full (#1868)
- */
-const TOTAL_PROCESS_HARD_CAP = 10;
-
-export async function waitForSlot(
-  maxConcurrent: number,
-  timeoutMs: number = 60_000,
-  evictIdleSession?: () => boolean
-): Promise<void> {
-  // Hard cap: refuse to spawn if too many processes exist regardless of pool accounting
-  const activeCount = getActiveCount();
-  if (activeCount >= TOTAL_PROCESS_HARD_CAP) {
-    throw new Error(`Hard cap exceeded: ${activeCount} processes in registry (cap=${TOTAL_PROCESS_HARD_CAP}). Refusing to spawn more.`);
-  }
-
-  if (activeCount < maxConcurrent) return;
-
-  // Try to evict an idle session before waiting (#1868)
-  // Idle sessions hold pool slots during their 3-min idle timeout, blocking new sessions
-  // that would timeout after 60s. Eviction aborts the idle session asynchronously —
-  // the freed slot is picked up by the waiter mechanism below.
-  if (evictIdleSession) {
-    const evicted = evictIdleSession();
-    if (evicted) {
-      logger.info('PROCESS', 'Evicted idle session to free pool slot for waiting request');
-    }
-  }
-
-  logger.info('PROCESS', `Pool limit reached (${activeCount}/${maxConcurrent}), waiting for slot...`);
-
-  return new Promise<void>((resolve, reject) => {
-    const timeout = setTimeout(() => {
-      const idx = slotWaiters.indexOf(onSlot);
-      if (idx >= 0) slotWaiters.splice(idx, 1);
-      reject(new Error(`Timed out waiting for agent pool slot after ${timeoutMs}ms`));
-    }, timeoutMs);
-
-    const onSlot = () => {
-      clearTimeout(timeout);
-      if (getActiveCount() < maxConcurrent) {
-        resolve();
-      } else {
-        // Still full, re-queue
-        slotWaiters.push(onSlot);
-      }
-    };
-
-    slotWaiters.push(onSlot);
-  });
-}
-
-/**
- * Get all active PIDs (for debugging)
- */
-export function getActiveProcesses(): Array<{ pid: number; sessionDbId: number; ageMs: number }> {
-  const now = Date.now();
-  return getTrackedProcesses().map(info => ({
-    pid: info.pid,
-    sessionDbId: info.sessionDbId,
-    ageMs: now - info.spawnedAt
-  }));
-}
-
-/**
- * Wait for a process to exit with timeout, escalating to SIGKILL if needed
- * Uses event-based waiting instead of polling to avoid CPU overhead
- */
-export async function ensureProcessExit(tracked: TrackedProcess, timeoutMs: number = 5000): Promise<void> {
-  const { pid, process: proc } = tracked;
-
-  // Already exited? Only trust exitCode, NOT proc.killed
-  // proc.killed only means Node sent a signal — the process can still be alive
-  if (proc.exitCode !== null) {
-    unregisterProcess(pid);
-    return;
-  }
-
-  // Wait for graceful exit with timeout using event-based approach
-  const exitPromise = new Promise<void>((resolve) => {
-    proc.once('exit', () => resolve());
-  });
-
-  const timeoutPromise = new Promise<void>((resolve) => {
-    setTimeout(resolve, timeoutMs);
-  });
-
-  await Promise.race([exitPromise, timeoutPromise]);
-
-  // Check if exited gracefully — only trust exitCode
-  if (proc.exitCode !== null) {
-    unregisterProcess(pid);
-    return;
-  }
-
-  // Timeout: escalate to SIGKILL
-  logger.warn('PROCESS', `PID ${pid} did not exit after ${timeoutMs}ms, sending SIGKILL`, { pid, timeoutMs });
-  try {
-    proc.kill('SIGKILL');
-  } catch {
-    // Already dead
-  }
-
-  // Wait for SIGKILL to take effect — use exit event with 1s timeout instead of blind sleep
-  const sigkillExitPromise = new Promise<void>((resolve) => {
-    proc.once('exit', () => resolve());
-  });
-  const sigkillTimeout = new Promise<void>((resolve) => {
-    setTimeout(resolve, 1000);
-  });
-  await Promise.race([sigkillExitPromise, sigkillTimeout]);
-  unregisterProcess(pid);
-}
-
-/**
- * Kill idle daemon children (claude processes spawned by worker-service)
- *
- * These are SDK-spawned claude processes that completed their work but
- * didn't terminate properly. They remain as children of the worker-service
- * daemon, consuming memory without doing useful work.
- *
- * Criteria for cleanup:
- * - Process name is "claude"
- * - Parent PID is the worker-service daemon (this process)
- * - Process has 0% CPU (idle)
- * - Process has been running for more than 2 minutes
- */
-async function killIdleDaemonChildren(): Promise<number> {
-  if (process.platform === 'win32') {
-    // Windows: Different process model, skip for now
-    return 0;
-  }
-
-  const daemonPid = process.pid;
-  let killed = 0;
-
-  try {
-    const { stdout } = await execAsync(
-      'ps -eo pid,ppid,%cpu,etime,comm 2>/dev/null | grep "claude$" || true'
-    );
-
-    for (const line of stdout.trim().split('\n')) {
-      if (!line) continue;
-
-      const parts = line.trim().split(/\s+/);
-      if (parts.length < 5) continue;
-
-      const [pidStr, ppidStr, cpuStr, etime] = parts;
-      const pid = parseInt(pidStr, 10);
-      const ppid = parseInt(ppidStr, 10);
-      const cpu = parseFloat(cpuStr);
-
-      // Skip if not a child of this daemon
-      if (ppid !== daemonPid) continue;
-
-      // Skip if actively using CPU
-      if (cpu > 0) continue;
-
-      // Parse elapsed time to minutes
-      // Formats: MM:SS, HH:MM:SS, D-HH:MM:SS
-      let minutes = 0;
-      const dayMatch = etime.match(/^(\d+)-(\d+):(\d+):(\d+)$/);
-      const hourMatch = etime.match(/^(\d+):(\d+):(\d+)$/);
-      const minMatch = etime.match(/^(\d+):(\d+)$/);
-
-      if (dayMatch) {
-        minutes = parseInt(dayMatch[1], 10) * 24 * 60 +
-                  parseInt(dayMatch[2], 10) * 60 +
-                  parseInt(dayMatch[3], 10);
-      } else if (hourMatch) {
-        minutes = parseInt(hourMatch[1], 10) * 60 +
-                  parseInt(hourMatch[2], 10);
-      } else if (minMatch) {
-        minutes = parseInt(minMatch[1], 10);
-      }
-
-      // Kill if idle for more than 1 minute
-      if (minutes >= 1) {
-        logger.info('PROCESS', `Killing idle daemon child PID ${pid} (idle ${minutes}m)`, { pid, minutes });
-        try {
-          process.kill(pid, 'SIGKILL');
-          killed++;
-        } catch {
-          // Already dead or permission denied
-        }
-      }
-    }
-  } catch {
-    // No matches or command error
-  }
-
-  return killed;
-}
-
-/**
- * Kill system-level orphans (ppid=1 on Unix)
- * These are Claude processes whose parent died unexpectedly
- */
-async function killSystemOrphans(): Promise<number> {
-  if (process.platform === 'win32') {
-    return 0; // Windows doesn't have ppid=1 orphan concept
-  }
-
-  try {
-    const { stdout } = await execAsync(
-      'ps -eo pid,ppid,args 2>/dev/null | grep -E "claude.*haiku|claude.*output-format" | grep -v grep'
-    );
-
-    let killed = 0;
-    for (const line of stdout.trim().split('\n')) {
-      if (!line) continue;
-      const match = line.trim().match(/^(\d+)\s+(\d+)/);
-      if (match && parseInt(match[2]) === 1) { // ppid=1 = orphan
-        const orphanPid = parseInt(match[1]);
-        logger.warn('PROCESS', `Killing system orphan PID ${orphanPid}`, { pid: orphanPid });
-        try {
-          process.kill(orphanPid, 'SIGKILL');
-          killed++;
-        } catch {
-          // Already dead or permission denied
-        }
-      }
-    }
-    return killed;
-  } catch {
-    return 0; // No matches or error
-  }
-}
-
-/**
- * Reap orphaned processes - both registry-tracked and system-level
- */
-export async function reapOrphanedProcesses(activeSessionIds: Set<number>): Promise<number> {
-  let killed = 0;
-
-  // Registry-based: kill processes for dead sessions
-  for (const record of getSupervisor().getRegistry().getAll().filter(entry => entry.type === 'sdk')) {
-    const pid = record.pid;
-    const sessionDbId = Number(record.sessionId);
-    const processRef = getSupervisor().getRegistry().getRuntimeProcess(record.id);
-
-    if (activeSessionIds.has(sessionDbId)) continue; // Active = safe
-
-    logger.warn('PROCESS', `Killing orphan PID ${pid} (session ${sessionDbId} gone)`, { pid, sessionDbId });
-    try {
-      if (processRef) {
-        processRef.kill('SIGKILL');
-      } else {
-        process.kill(pid, 'SIGKILL');
-      }
-      killed++;
-    } catch {
-      // Already dead
-    }
-    getSupervisor().unregisterProcess(record.id);
-    notifySlotAvailable();
-  }
-
-  // System-level: find ppid=1 orphans
-  killed += await killSystemOrphans();
-
-  // Daemon children: find idle SDK processes that didn't terminate
-  killed += await killIdleDaemonChildren();
-
-  return killed;
-}
-
-/**
- * Create a custom spawn function for SDK that captures PIDs
- *
- * The SDK's spawnClaudeCodeProcess option allows us to intercept subprocess
- * creation and capture the PID before the SDK hides it.
- *
- * NOTE: Session isolation is handled via the `cwd` option in SDKAgent.ts,
- * NOT via CLAUDE_CONFIG_DIR (which breaks authentication).
- */
-export function createPidCapturingSpawn(sessionDbId: number) {
-  return (spawnOptions: {
-    command: string;
-    args: string[];
-    cwd?: string;
-    env?: NodeJS.ProcessEnv;
-    signal?: AbortSignal;
-  }) => {
-    // Kill any existing process for this session before spawning a new one.
-    // Multiple processes sharing the same --resume UUID waste API credits and
-    // can conflict with each other (Issue #1590).
-    const existing = getProcessBySession(sessionDbId);
-    if (existing && existing.process.exitCode === null) {
-      logger.warn('PROCESS', `Killing duplicate process PID ${existing.pid} before spawning new one for session ${sessionDbId}`, {
-        existingPid: existing.pid,
-        sessionDbId
-      });
-      let exited = false;
-      try {
-        existing.process.kill('SIGTERM');
-        exited = existing.process.exitCode !== null;
-      } catch (error: unknown) {
-        // Already dead — safe to unregister immediately
-        if (error instanceof Error) {
-          logger.warn('WORKER', `Failed to kill duplicate process PID ${existing.pid}, likely already dead`, { existingPid: existing.pid, sessionDbId }, error);
-        }
-        exited = true;
-      }
-
-      if (exited) {
-        unregisterProcess(existing.pid);
-      }
-      // If still alive, the 'exit' handler (line ~440) will unregister it.
-    }
-
-    getSupervisor().assertCanSpawn('claude sdk');
-
-    // On Windows, use cmd.exe wrapper for .cmd files to properly handle paths with spaces
-    const useCmdWrapper = process.platform === 'win32' && spawnOptions.command.endsWith('.cmd');
-    const env = sanitizeEnv(spawnOptions.env ?? process.env);
-
-    // Filter empty string args AND their preceding flag (Issue #2049).
-    // The Agent SDK emits ["--setting-sources", ""] when settingSources defaults to [].
-    // Simply dropping "" leaves an orphan --setting-sources that consumes the next
-    // flag (e.g. --permission-mode) as its value, crashing Claude Code 2.1.109+ with
-    // "Invalid setting source: --permission-mode". Drop the flag too so the SDK
-    // default (no setting sources) is preserved by omission.
-    const args: string[] = [];
-    for (const arg of spawnOptions.args) {
-      if (arg === '') {
-        if (args.length > 0 && args[args.length - 1].startsWith('--')) {
-          args.pop();
-        }
-        continue;
-      }
-      args.push(arg);
-    }
-
-    const child = useCmdWrapper
-      ? spawn('cmd.exe', ['/d', '/c', spawnOptions.command, ...args], {
-          cwd: spawnOptions.cwd,
-          env,
-          stdio: ['pipe', 'pipe', 'pipe'],
-          signal: spawnOptions.signal,
-          windowsHide: true
-        })
-      : spawn(spawnOptions.command, args, {
-          cwd: spawnOptions.cwd,
-          env,
-          stdio: ['pipe', 'pipe', 'pipe'],
-          signal: spawnOptions.signal, // CRITICAL: Pass signal for AbortController integration
-          windowsHide: true
-        });
-
-    // Capture stderr for debugging spawn failures
-    if (child.stderr) {
-      child.stderr.on('data', (data: Buffer) => {
-        logger.debug('SDK_SPAWN', `[session-${sessionDbId}] stderr: ${data.toString().trim()}`);
-      });
-    }
-
-    // Register PID
-    if (child.pid) {
-      registerProcess(child.pid, sessionDbId, child);
-
-      // Auto-unregister on exit
-      child.on('exit', (code: number | null, signal: string | null) => {
-        if (code !== 0) {
-          logger.warn('SDK_SPAWN', `[session-${sessionDbId}] Claude process exited`, { code, signal, pid: child.pid });
-        }
-        if (child.pid) {
-          unregisterProcess(child.pid);
-        }
-      });
-    }
-
-    // Return SDK-compatible interface
-    return {
-      stdin: child.stdin,
-      stdout: child.stdout,
-      stderr: child.stderr,
-      get killed() { return child.killed; },
-      get exitCode() { return child.exitCode; },
-      kill: child.kill.bind(child),
-      on: child.on.bind(child),
-      once: child.once.bind(child),
-      off: child.off.bind(child)
-    };
-  };
-}
-
-/**
- * Start the orphan reaper interval
- * Returns cleanup function to stop the interval
- */
-export function startOrphanReaper(getActiveSessionIds: () => Set<number>, intervalMs: number = 30 * 1000): () => void {
-  const interval = setInterval(async () => {
-    try {
-      const activeIds = getActiveSessionIds();
-      const killed = await reapOrphanedProcesses(activeIds);
-      if (killed > 0) {
-        logger.info('PROCESS', `Reaper cleaned up ${killed} orphaned processes`, { killed });
-      }
-    } catch (error) {
-      if (error instanceof Error) {
-        logger.error('WORKER', 'Reaper error', {}, error);
-      } else {
-        logger.error('WORKER', 'Reaper error', { rawError: String(error) });
-      }
-    }
-  }, intervalMs);
-
-  // Return cleanup function
-  return () => clearInterval(interval);
-}
@@ -3,15 +3,26 @@
 * Prevents tight-loop restarts (bug) while allowing legitimate occasional restarts
 * over long sessions. Replaces the flat consecutiveRestarts counter that stranded
 * pending messages after just 3 restarts over any timeframe (#2053).
+ *
+ * TWO INDEPENDENT TRIPS:
+ * 1. Sliding window: more than MAX_WINDOWED_RESTARTS within RESTART_WINDOW_MS.
+ *    Catches genuinely tight loops (e.g. crash every <6s).
+ * 2. Consecutive failures: more than MAX_CONSECUTIVE_FAILURES restarts with
+ *    NO successful processing in between. Catches dead sessions that
+ *    fail-restart-fail-restart on a slow exponential backoff cadence
+ *    (e.g. 8s backoff cap + spawn failures = restartsInWindow stays under
+ *    the windowed cap forever, but the session is clearly dead).
 */

 const RESTART_WINDOW_MS = 60_000;      // Only count restarts within last 60 seconds
 const MAX_WINDOWED_RESTARTS = 10;      // 10 restarts in 60s = runaway loop
+const MAX_CONSECUTIVE_FAILURES = 5;    // 5 restarts with no success in between = session is dead
 const DECAY_AFTER_SUCCESS_MS = 5 * 60_000; // Clear history after 5min of uninterrupted success

 export class RestartGuard {
  private restartTimestamps: number[] = [];
  private lastSuccessfulProcessing: number | null = null;
+  private consecutiveFailures: number = 0;

  /**
   * Record a restart and check if the guard should trip.
@@ -34,16 +45,23 @@ export class RestartGuard {

    // Record this restart
    this.restartTimestamps.push(now);
+    this.consecutiveFailures += 1;

-    // Check if we've exceeded the cap within the window
-    return this.restartTimestamps.length <= MAX_WINDOWED_RESTARTS;
+    // Trip if EITHER guard exceeds its limit:
+    //   - Sliding window cap (tight loops)
+    //   - Consecutive failures with no successful work (dead session, e.g. spawn always fails)
+    const withinWindowedCap = this.restartTimestamps.length <= MAX_WINDOWED_RESTARTS;
+    const withinConsecutiveCap = this.consecutiveFailures <= MAX_CONSECUTIVE_FAILURES;
+    return withinWindowedCap && withinConsecutiveCap;
  }

  /**
   * Call when a message is successfully processed to update the success timestamp.
+   * Resets the consecutive-failure counter (real progress was made).
   */
  recordSuccess(): void {
    this.lastSuccessfulProcessing = Date.now();
+    this.consecutiveFailures = 0;
  }

  /**
@@ -67,4 +85,18 @@ export class RestartGuard {
  get maxRestarts(): number {
    return MAX_WINDOWED_RESTARTS;
  }
+
+  /**
+   * Get consecutive failures since last successful processing (for logging).
+   */
+  get consecutiveFailuresSinceSuccess(): number {
+    return this.consecutiveFailures;
+  }
+
+  /**
+   * Get the max allowed consecutive failures (for logging).
+   */
+  get maxConsecutiveFailures(): number {
+    return MAX_CONSECUTIVE_FAILURES;
+  }
 }
@@ -21,7 +21,12 @@ import { buildIsolatedEnv, getAuthMethodDescription } from '../../shared/EnvMana
 import type { ActiveSession, SDKUserMessage } from '../worker-types.js';
 import { ModeManager } from '../domain/ModeManager.js';
 import { processAgentResponse, type WorkerRef } from './agents/index.js';
-import { createPidCapturingSpawn, getProcessBySession, ensureProcessExit, waitForSlot } from './ProcessRegistry.js';
+import {
+  createSdkSpawnFactory,
+  getSdkProcessForSession,
+  ensureSdkProcessExit,
+  waitForSlot,
+} from '../../supervisor/process-registry.js';
 import { sanitizeEnv } from '../../supervisor/env-sanitizer.js';

 // Import Agent SDK (assumes it's installed)
@@ -90,11 +95,11 @@ export class SDKAgent {
    }

    // Wait for agent pool slot (configurable via CLAUDE_MEM_MAX_CONCURRENT_AGENTS)
-    // Pass idle session eviction callback to prevent pool deadlock (#1868):
-    // idle sessions hold slots during 3-min idle wait, blocking new sessions
+    // Backpressure only — a full pool waits, never evicts a live session
+    // (Principle 1: do not kick live work to make room).
    const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
    const maxConcurrent = parseInt(settings.CLAUDE_MEM_MAX_CONCURRENT_AGENTS, 10) || 2;
-    await waitForSlot(maxConcurrent, 60_000, () => this.sessionManager.evictIdlestSession());
+    await waitForSlot(maxConcurrent, 60_000);

    // Build isolated environment from ~/.claude-mem/.env
    // This prevents Issue #733: random ANTHROPIC_API_KEY from project .env files
@@ -105,7 +110,7 @@ export class SDKAgent {
    logger.info('SDK', 'Starting SDK query', {
      sessionDbId: session.sessionDbId,
      contentSessionId: session.contentSessionId,
-      memorySessionId: session.memorySessionId,
+      memorySessionId: session.memorySessionId ?? undefined,
      hasRealMemorySessionId,
      shouldResume,
      resume_parameter: shouldResume ? session.memorySessionId : '(none - fresh start)',
@@ -139,12 +144,13 @@ export class SDKAgent {
        // instead of polluting user's actual project resume lists
        cwd: OBSERVER_SESSIONS_DIR,
        // Only resume if shouldResume is true (memorySessionId exists, not first prompt, not forceInit)
-        ...(shouldResume && { resume: session.memorySessionId }),
+        ...(shouldResume && session.memorySessionId ? { resume: session.memorySessionId } : {}),
        disallowedTools,
        abortController: session.abortController,
        pathToClaudeCodeExecutable: claudePath,
-        // Custom spawn function captures PIDs to fix zombie process accumulation
-        spawnClaudeCodeProcess: createPidCapturingSpawn(session.sessionDbId),
+        // Custom spawn factory: spawns the SDK child in its own POSIX process
+        // group so the worker can tear down the whole subtree on shutdown.
+        spawnClaudeCodeProcess: createSdkSpawnFactory(session.sessionDbId),
        env: isolatedEnv  // Use isolated credentials from ~/.claude-mem/.env, not process.env
      }
    });
@@ -283,10 +289,12 @@ export class SDKAgent {
        }
      }
    } finally {
-      // Ensure subprocess is terminated after query completes (or on error)
-      const tracked = getProcessBySession(session.sessionDbId);
+      // Ensure subprocess is terminated after query completes (or on error).
+      // Process-group teardown via ensureSdkProcessExit kills any descendants
+      // the SDK spawned, so no orphan reaper is needed (Principle 5).
+      const tracked = getSdkProcessForSession(session.sessionDbId);
      if (tracked && tracked.process.exitCode === null) {
-        await ensureProcessExit(tracked, 5000);
+        await ensureSdkProcessExit(tracked, 5000);
      }
    }

@@ -31,6 +31,8 @@ import {
  SEARCH_CONSTANTS
 } from './search/index.js';
 import type { TimelineData } from './search/index.js';
+import { ResultFormatter } from './search/ResultFormatter.js';
+import { ChromaUnavailableError } from './search/errors.js';

 export class SearchManager {
  private orchestrator: SearchOrchestrator;
@@ -52,6 +54,22 @@ export class SearchManager {
    this.timelineBuilder = new TimelineBuilder();
  }

+  /**
+   * Accessor for the underlying orchestrator. Used by HTTP routes that need
+   * raw StrategySearchResult instead of formatted MCP text output.
+   */
+  getOrchestrator(): SearchOrchestrator {
+    return this.orchestrator;
+  }
+
+  /**
+   * Accessor for the formatter. Used by HTTP routes that construct
+   * text output from raw orchestrator results.
+   */
+  getFormatter(): FormattingService {
+    return this.formatter;
+  }
+
  /**
   * Query Chroma vector database via ChromaSync
   * @deprecated Use orchestrator.search() instead
@@ -166,6 +184,7 @@ export class SearchManager {
    let sessions: SessionSummarySearchResult[] = [];
    let prompts: UserPromptSearchResult[] = [];
    let chromaFailed = false;
+    let chromaFailureReason: { message: string; isConnectionError: boolean } | null = null;

    // Determine which types to query based on type filter
    const searchObservations = !type || type === 'observations';
@@ -202,12 +221,6 @@ export class SearchManager {
        whereFilter = { doc_type: 'user_prompt' };
      }

-      // Include project in the Chroma where clause to scope vector search.
-      // Without this, larger projects dominate the top-N results and smaller
-      // projects get crowded out before the post-hoc SQLite filter.
-      // Match both native-provenance rows (project) and adopted merged-worktree
-      // rows (merged_into_project) so a parent-project query surfaces its
-      // merged children's observations too.
      if (options.project) {
        const projectFilter = {
          $or: [
@@ -220,82 +233,96 @@ export class SearchManager {
          : projectFilter;
      }

-      // Step 1: Chroma semantic search with optional type + project filter
-      const chromaResults = await this.queryChroma(query, 100, whereFilter);
-      chromaSucceeded = true; // Chroma didn't throw error
-      logger.debug('SEARCH', 'ChromaDB returned semantic matches', { matchCount: chromaResults.ids.length });
+      try {
+        // Step 1: Chroma semantic search with optional type + project filter
+        const chromaResults = await this.queryChroma(query, 100, whereFilter);
+        chromaSucceeded = true; // Chroma didn't throw error
+        logger.debug('SEARCH', 'ChromaDB returned semantic matches', { matchCount: chromaResults.ids.length });

-      if (chromaResults.ids.length > 0) {
-        // Step 2: Filter by date range
-        // Use user-provided dateRange if available, otherwise fall back to 90-day recency window
-        const { dateRange } = options;
-        let startEpoch: number | undefined;
-        let endEpoch: number | undefined;
+        if (chromaResults.ids.length > 0) {
+          // Step 2: Filter by date range
+          const { dateRange } = options;
+          let startEpoch: number | undefined;
+          let endEpoch: number | undefined;

-        if (dateRange) {
-          if (dateRange.start) {
-            startEpoch = typeof dateRange.start === 'number'
-              ? dateRange.start
-              : new Date(dateRange.start).getTime();
+          if (dateRange) {
+            if (dateRange.start) {
+              startEpoch = typeof dateRange.start === 'number'
+                ? dateRange.start
+                : new Date(dateRange.start).getTime();
+            }
+            if (dateRange.end) {
+              endEpoch = typeof dateRange.end === 'number'
+                ? dateRange.end
+                : new Date(dateRange.end).getTime();
+            }
+          } else {
+            // Default: 90-day recency window
+            startEpoch = Date.now() - SEARCH_CONSTANTS.RECENCY_WINDOW_MS;
          }
-          if (dateRange.end) {
-            endEpoch = typeof dateRange.end === 'number'
-              ? dateRange.end
-              : new Date(dateRange.end).getTime();
+
+          const recentMetadata = chromaResults.metadatas.map((meta, idx) => ({
+            id: chromaResults.ids[idx],
+            meta,
+            isRecent: meta && meta.created_at_epoch != null
+              && (!startEpoch || meta.created_at_epoch >= startEpoch)
+              && (!endEpoch || meta.created_at_epoch <= endEpoch)
+          })).filter(item => item.isRecent);
+
+          logger.debug('SEARCH', dateRange ? 'Results within user date range' : 'Results within 90-day window', { count: recentMetadata.length });
+
+          // Step 3: Categorize IDs by document type
+          const obsIds: number[] = [];
+          const sessionIds: number[] = [];
+          const promptIds: number[] = [];
+
+          for (const item of recentMetadata) {
+            const docType = item.meta?.doc_type;
+            if (docType === 'observation' && searchObservations) {
+              obsIds.push(item.id);
+            } else if (docType === 'session_summary' && searchSessions) {
+              sessionIds.push(item.id);
+            } else if (docType === 'user_prompt' && searchPrompts) {
+              promptIds.push(item.id);
+            }
+          }
+
+          // Step 4: Hydrate from SQLite with additional filters
+          if (obsIds.length > 0) {
+            const obsOptions = { ...options, type: obs_type, concepts, files };
+            observations = this.sessionStore.getObservationsByIds(obsIds, obsOptions);
+          }
+          if (sessionIds.length > 0) {
+            sessions = this.sessionStore.getSessionSummariesByIds(sessionIds, { orderBy: 'date_desc', limit: options.limit, project: options.project });
+          }
+          if (promptIds.length > 0) {
+            prompts = this.sessionStore.getUserPromptsByIds(promptIds, { orderBy: 'date_desc', limit: options.limit, project: options.project });
          }
        } else {
-          // Default: 90-day recency window
-          startEpoch = Date.now() - SEARCH_CONSTANTS.RECENCY_WINDOW_MS;
+          logger.debug('SEARCH', 'ChromaDB found no matches (final result, no FTS5 fallback)', {});
        }
+      } catch (chromaError) {
+        const errorObject = chromaError instanceof Error ? chromaError : new Error(String(chromaError));
+        chromaFailureReason = {
+          message: errorObject.message,
+          isConnectionError: chromaError instanceof ChromaUnavailableError,
+        };
+        logger.warn('SEARCH', 'ChromaDB semantic search failed, falling back to FTS5 keyword search', {}, errorObject);
+        chromaFailed = true;

-        const recentMetadata = chromaResults.metadatas.map((meta, idx) => ({
-          id: chromaResults.ids[idx],
-          meta,
-          isRecent: meta && meta.created_at_epoch != null
-            && (!startEpoch || meta.created_at_epoch >= startEpoch)
-            && (!endEpoch || meta.created_at_epoch <= endEpoch)
-        })).filter(item => item.isRecent);
-
-        logger.debug('SEARCH', dateRange ? 'Results within user date range' : 'Results within 90-day window', { count: recentMetadata.length });
-
-        // Step 3: Categorize IDs by document type
-        const obsIds: number[] = [];
-        const sessionIds: number[] = [];
-        const promptIds: number[] = [];
-
-        for (const item of recentMetadata) {
-          const docType = item.meta?.doc_type;
-          if (docType === 'observation' && searchObservations) {
-            obsIds.push(item.id);
-          } else if (docType === 'session_summary' && searchSessions) {
-            sessionIds.push(item.id);
-          } else if (docType === 'user_prompt' && searchPrompts) {
-            promptIds.push(item.id);
-          }
+        // Fallback to FTS5 path since Chroma failed
+        if (searchObservations) {
+          observations = this.sessionSearch.searchObservations(query, { ...options, type: obs_type, concepts, files });
        }
-
-        logger.debug('SEARCH', 'Categorized results by type', { observations: obsIds.length, sessions: sessionIds.length, prompts: prompts.length });
-
-        // Step 4: Hydrate from SQLite with additional filters
-        if (obsIds.length > 0) {
-          // Apply obs_type, concepts, files filters if provided
-          const obsOptions = { ...options, type: obs_type, concepts, files };
-          observations = this.sessionStore.getObservationsByIds(obsIds, obsOptions);
+        if (searchSessions) {
+          sessions = this.sessionSearch.searchSessions(query, options);
        }
-        if (sessionIds.length > 0) {
-          sessions = this.sessionStore.getSessionSummariesByIds(sessionIds, { orderBy: 'date_desc', limit: options.limit, project: options.project });
+        if (searchPrompts) {
+          prompts = this.sessionSearch.searchUserPrompts(query, options);
        }
-        if (promptIds.length > 0) {
-          prompts = this.sessionStore.getUserPromptsByIds(promptIds, { orderBy: 'date_desc', limit: options.limit, project: options.project });
-        }
-
-        logger.debug('SEARCH', 'Hydrated results from SQLite', { observations: observations.length, sessions: sessions.length, prompts: prompts.length });
-      } else {
-        // Chroma returned 0 results - this is the correct answer, don't fall back to FTS5
-        logger.debug('SEARCH', 'ChromaDB found no matches (final result, no FTS5 fallback)', {});
      }
    }
-    // ChromaDB not initialized - fall back to FTS5 keyword search (#1913, #2048)
+    // PATH 3: FTS5 KEYWORD SEARCH (Chroma not initialized)
    else if (query) {
      logger.debug('SEARCH', 'ChromaDB not initialized — falling back to FTS5 keyword search', {});
      try {
@@ -329,11 +356,11 @@ export class SearchManager {
    }

    if (totalResults === 0) {
-      if (chromaFailed) {
+      if (chromaFailureReason !== null) {
        return {
          content: [{
            type: 'text' as const,
-            text: `Vector search failed - semantic search unavailable.\n\nTo enable semantic search:\n1. Install uv: https://docs.astral.sh/uv/getting-started/installation/\n2. Restart the worker: npm run worker:restart\n\nNote: You can still use filter-only searches (date ranges, types, files) without a query term.`
+            text: ResultFormatter.formatChromaFailureMessage(chromaFailureReason)
          }]
        };
      }
@@ -1203,265 +1230,6 @@ export class SearchManager {
  }


-  /**
-   * Tool handler: find_by_concept
-   */
-  async findByConcept(args: any): Promise<any> {
-    const normalized = this.normalizeParams(args);
-    const { concepts: concept, ...filters } = normalized;
-    let results: ObservationSearchResult[] = [];
-
-    // Metadata-first, semantic-enhanced search
-    if (this.chromaSync) {
-      logger.debug('SEARCH', 'Using metadata-first + semantic ranking for concept search', {});
-
-      // Step 1: SQLite metadata filter (get all IDs with this concept)
-      const metadataResults = this.sessionSearch.findByConcept(concept, filters);
-      logger.debug('SEARCH', 'Found observations with concept', { concept, count: metadataResults.length });
-
-      if (metadataResults.length > 0) {
-        // Step 2: Chroma semantic ranking (rank by relevance to concept)
-        const ids = metadataResults.map(obs => obs.id);
-        const chromaResults = await this.queryChroma(concept, Math.min(ids.length, 100));
-
-        // Intersect: Keep only IDs that passed metadata filter, in semantic rank order
-        const rankedIds: number[] = [];
-        for (const chromaId of chromaResults.ids) {
-          if (ids.includes(chromaId) && !rankedIds.includes(chromaId)) {
-            rankedIds.push(chromaId);
-          }
-        }
-
-        logger.debug('SEARCH', 'Chroma ranked results by semantic relevance', { count: rankedIds.length });
-
-        // Step 3: Hydrate in semantic rank order
-        if (rankedIds.length > 0) {
-          results = this.sessionStore.getObservationsByIds(rankedIds, { limit: filters.limit || 20 });
-          // Restore semantic ranking order
-          results.sort((a, b) => rankedIds.indexOf(a.id) - rankedIds.indexOf(b.id));
-        }
-      }
-    }
-
-    // Fall back to SQLite-only if Chroma unavailable or failed
-    if (results.length === 0) {
-      logger.debug('SEARCH', 'Using SQLite-only concept search', {});
-      results = this.sessionSearch.findByConcept(concept, filters);
-    }
-
-    if (results.length === 0) {
-      return {
-        content: [{
-          type: 'text' as const,
-          text: `No observations found with concept "${concept}"`
-        }]
-      };
-    }
-
-    // Format as table
-    const header = `Found ${results.length} observation(s) with concept "${concept}"\n\n${this.formatter.formatTableHeader()}`;
-    const formattedResults = results.map((obs, i) => this.formatter.formatObservationIndex(obs, i));
-
-    return {
-      content: [{
-        type: 'text' as const,
-        text: header + '\n' + formattedResults.join('\n')
-      }]
-    };
-  }
-
-
-  /**
-   * Tool handler: find_by_file
-   */
-  async findByFile(args: any): Promise<any> {
-    const normalized = this.normalizeParams(args);
-    const { files: rawFilePath, ...filters } = normalized;
-    // Handle both string and array (normalizeParams may split on comma)
-    const filePath = Array.isArray(rawFilePath) ? rawFilePath[0] : rawFilePath;
-    let observations: ObservationSearchResult[] = [];
-    let sessions: SessionSummarySearchResult[] = [];
-
-    // Metadata-first, semantic-enhanced search for observations
-    if (this.chromaSync) {
-      logger.debug('SEARCH', 'Using metadata-first + semantic ranking for file search', {});
-
-      // Step 1: SQLite metadata filter (get all results with this file)
-      const metadataResults = this.sessionSearch.findByFile(filePath, filters);
-      logger.debug('SEARCH', 'Found results for file', { file: filePath, observations: metadataResults.observations.length, sessions: metadataResults.sessions.length });
-
-      // Sessions: Keep as-is (already summarized, no semantic ranking needed)
-      sessions = metadataResults.sessions;
-
-      // Observations: Apply semantic ranking
-      if (metadataResults.observations.length > 0) {
-        // Step 2: Chroma semantic ranking (rank by relevance to file path)
-        const ids = metadataResults.observations.map(obs => obs.id);
-        const chromaResults = await this.queryChroma(filePath, Math.min(ids.length, 100));
-
-        // Intersect: Keep only IDs that passed metadata filter, in semantic rank order
-        const rankedIds: number[] = [];
-        for (const chromaId of chromaResults.ids) {
-          if (ids.includes(chromaId) && !rankedIds.includes(chromaId)) {
-            rankedIds.push(chromaId);
-          }
-        }
-
-        logger.debug('SEARCH', 'Chroma ranked observations by semantic relevance', { count: rankedIds.length });
-
-        // Step 3: Hydrate in semantic rank order
-        if (rankedIds.length > 0) {
-          observations = this.sessionStore.getObservationsByIds(rankedIds, { limit: filters.limit || 20 });
-          // Restore semantic ranking order
-          observations.sort((a, b) => rankedIds.indexOf(a.id) - rankedIds.indexOf(b.id));
-        }
-      }
-    }
-
-    // Fall back to SQLite-only if Chroma unavailable or failed
-    if (observations.length === 0 && sessions.length === 0) {
-      logger.debug('SEARCH', 'Using SQLite-only file search', {});
-      const results = this.sessionSearch.findByFile(filePath, filters);
-      observations = results.observations;
-      sessions = results.sessions;
-    }
-
-    const totalResults = observations.length + sessions.length;
-
-    if (totalResults === 0) {
-      return {
-        content: [{
-          type: 'text' as const,
-          text: `No results found for file "${filePath}"`
-        }]
-      };
-    }
-
-    // Combine observations and sessions with timestamps for date grouping
-    const combined: Array<{
-      type: 'observation' | 'session';
-      data: ObservationSearchResult | SessionSummarySearchResult;
-      epoch: number;
-      created_at: string;
-    }> = [
-      ...observations.map(obs => ({
-        type: 'observation' as const,
-        data: obs,
-        epoch: obs.created_at_epoch,
-        created_at: obs.created_at
-      })),
-      ...sessions.map(sess => ({
-        type: 'session' as const,
-        data: sess,
-        epoch: sess.created_at_epoch,
-        created_at: sess.created_at
-      }))
-    ];
-
-    // Sort by date (most recent first)
-    combined.sort((a, b) => b.epoch - a.epoch);
-
-    // Group by date for proper timeline rendering
-    const resultsByDate = groupByDate(combined, item => item.created_at);
-
-    // Format with date headers for proper date parsing by folder CLAUDE.md generator
-    const lines: string[] = [];
-    lines.push(`Found ${totalResults} result(s) for file "${filePath}"`);
-    lines.push('');
-
-    for (const [day, dayResults] of resultsByDate) {
-      lines.push(`### ${day}`);
-      lines.push('');
-      lines.push(this.formatter.formatTableHeader());
-
-      for (const result of dayResults) {
-        if (result.type === 'observation') {
-          lines.push(this.formatter.formatObservationIndex(result.data as ObservationSearchResult, 0));
-        } else {
-          lines.push(this.formatter.formatSessionIndex(result.data as SessionSummarySearchResult, 0));
-        }
-      }
-      lines.push('');
-    }
-
-    return {
-      content: [{
-        type: 'text' as const,
-        text: lines.join('\n')
-      }]
-    };
-  }
-
-
-  /**
-   * Tool handler: find_by_type
-   */
-  async findByType(args: any): Promise<any> {
-    const normalized = this.normalizeParams(args);
-    const { type, ...filters } = normalized;
-    const typeStr = Array.isArray(type) ? type.join(', ') : type;
-    let results: ObservationSearchResult[] = [];
-
-    // Metadata-first, semantic-enhanced search
-    if (this.chromaSync) {
-      logger.debug('SEARCH', 'Using metadata-first + semantic ranking for type search', {});
-
-      // Step 1: SQLite metadata filter (get all IDs with this type)
-      const metadataResults = this.sessionSearch.findByType(type, filters);
-      logger.debug('SEARCH', 'Found observations with type', { type: typeStr, count: metadataResults.length });
-
-      if (metadataResults.length > 0) {
-        // Step 2: Chroma semantic ranking (rank by relevance to type)
-        const ids = metadataResults.map(obs => obs.id);
-        const chromaResults = await this.queryChroma(typeStr, Math.min(ids.length, 100));
-
-        // Intersect: Keep only IDs that passed metadata filter, in semantic rank order
-        const rankedIds: number[] = [];
-        for (const chromaId of chromaResults.ids) {
-          if (ids.includes(chromaId) && !rankedIds.includes(chromaId)) {
-            rankedIds.push(chromaId);
-          }
-        }
-
-        logger.debug('SEARCH', 'Chroma ranked results by semantic relevance', { count: rankedIds.length });
-
-        // Step 3: Hydrate in semantic rank order
-        if (rankedIds.length > 0) {
-          results = this.sessionStore.getObservationsByIds(rankedIds, { limit: filters.limit || 20 });
-          // Restore semantic ranking order
-          results.sort((a, b) => rankedIds.indexOf(a.id) - rankedIds.indexOf(b.id));
-        }
-      }
-    }
-
-    // Fall back to SQLite-only if Chroma unavailable or failed
-    if (results.length === 0) {
-      logger.debug('SEARCH', 'Using SQLite-only type search', {});
-      results = this.sessionSearch.findByType(type, filters);
-    }
-
-    if (results.length === 0) {
-      return {
-        content: [{
-          type: 'text' as const,
-          text: `No observations found with type "${typeStr}"`
-        }]
-      };
-    }
-
-    // Format as table
-    const header = `Found ${results.length} observation(s) with type "${typeStr}"\n\n${this.formatter.formatTableHeader()}`;
-    const formattedResults = results.map((obs, i) => this.formatter.formatObservationIndex(obs, i));
-
-    return {
-      content: [{
-        type: 'text' as const,
-        text: header + '\n' + formattedResults.join('\n')
-      }]
-    };
-  }
-
-
  /**
   * Tool handler: get_recent_context
   */
@@ -14,75 +14,10 @@ import { logger } from '../../utils/logger.js';
 import type { ActiveSession, PendingMessage, PendingMessageWithId, ObservationData } from '../worker-types.js';
 import { PendingMessageStore } from '../sqlite/PendingMessageStore.js';
 import { SessionQueueProcessor } from '../queue/SessionQueueProcessor.js';
-import { getProcessBySession, ensureProcessExit } from './ProcessRegistry.js';
+import { getSdkProcessForSession, ensureSdkProcessExit } from '../../supervisor/process-registry.js';
 import { getSupervisor } from '../../supervisor/index.js';
-import { MAX_CONSECUTIVE_SUMMARY_FAILURES } from '../../sdk/prompts.js';
 import { RestartGuard } from './RestartGuard.js';

-/** Idle threshold before a stuck generator (zombie subprocess) is force-killed. */
-export const MAX_GENERATOR_IDLE_MS = 5 * 60 * 1000; // 5 minutes
-
-/** Idle threshold before a no-generator session with no pending work is reaped. */
-export const MAX_SESSION_IDLE_MS = 15 * 60 * 1000; // 15 minutes
-
-/**
- * Minimal process interface used by detectStaleGenerator — compatible with
- * both the real Bun.Subprocess / ChildProcess shapes and test mocks.
- */
-export interface StaleGeneratorProcess {
-  exitCode: number | null;
-  kill(signal?: string): boolean | void;
-}
-
-/**
- * Minimal session fields required to evaluate stale-generator status.
- * This is a subset of ActiveSession, allowing unit tests to pass plain objects.
- */
-export interface StaleGeneratorCandidate {
-  generatorPromise: Promise<void> | null;
-  lastGeneratorActivity: number;
-  abortController: AbortController;
-}
-
-/**
- * Detect whether a session's generator is stuck (zombie subprocess) and, if so,
- * SIGKILL the subprocess and abort the controller.
- *
- * Extracted from reapStaleSessions() so tests can import and exercise the exact
- * same logic rather than duplicating it locally. (Issue #1652)
- *
- * @param session  - session to inspect
- * @param proc     - tracked subprocess (may be undefined if not in ProcessRegistry)
- * @param now      - current timestamp (defaults to Date.now(); pass explicit value in tests)
- * @returns true if the session was marked stale, false otherwise
- */
-export function detectStaleGenerator(
-  session: StaleGeneratorCandidate,
-  proc: StaleGeneratorProcess | undefined,
-  now = Date.now()
-): boolean {
-  if (!session.generatorPromise) return false;
-
-  const generatorIdleMs = now - session.lastGeneratorActivity;
-  if (generatorIdleMs <= MAX_GENERATOR_IDLE_MS) return false;
-
-  // Kill subprocess to unblock stuck for-await
-  if (proc && proc.exitCode === null) {
-    try {
-      proc.kill('SIGKILL');
-    } catch (error) {
-      if (error instanceof Error) {
-        logger.warn('SESSION', 'Failed to SIGKILL stale generator subprocess', {}, error);
-      } else {
-        logger.warn('SESSION', 'Failed to SIGKILL stale generator subprocess with non-Error', {}, new Error(String(error)));
-      }
-    }
-  }
-  // Signal the SDK agent loop to exit
-  session.abortController.abort();
-  return true;
-}
-
 export class SessionManager {
  private dbManager: DatabaseManager;
  private sessions: Map<number, ActiveSession> = new Map();
@@ -229,7 +164,6 @@ export class SessionManager {
      restartGuard: new RestartGuard(),
      processingMessageIds: [],  // CLAIM-CONFIRM: Track message IDs for confirmProcessed()
      lastGeneratorActivity: Date.now(),  // Initialize for stale detection (Issue #1099)
-      consecutiveSummaryFailures: 0,  // Circuit breaker for summary retry loop (#1633)
      pendingAgentId: null,   // Subagent identity carried from the most recent claimed message
      pendingAgentType: null  // (null for main-session messages)
    };
@@ -289,16 +223,28 @@ export class SessionManager {
      prompt_number: data.prompt_number,
      cwd: data.cwd,
      agentId: data.agentId,
-      agentType: data.agentType
+      agentType: data.agentType,
+      toolUseId: data.toolUseId,
    };

    try {
      const messageId = this.getPendingStore().enqueue(sessionDbId, session.contentSessionId, message);
      const queueDepth = this.getPendingStore().getPendingCount(sessionDbId);
      const toolSummary = logger.formatTool(data.tool_name, data.tool_input);
-      logger.info('QUEUE', `ENQUEUED | sessionDbId=${sessionDbId} | messageId=${messageId} | type=observation | tool=${toolSummary} | depth=${queueDepth}`, {
-        sessionId: sessionDbId
-      });
+      // enqueue returns 0 on INSERT OR IGNORE conflict (UNIQUE(session_id, tool_use_id)
+      // — Plan 01 Phase 1). The duplicate is correctly suppressed by the DB; surface
+      // it visibly so it isn't misread as "messageId=0 was inserted." Per
+      // Principle 3 (UNIQUE constraint over dedup window) this is the success path
+      // for replayed transcript lines, not an error.
+      if (messageId === 0) {
+        logger.debug('QUEUE', `DUP_SUPPRESSED | sessionDbId=${sessionDbId} | type=observation | tool=${toolSummary} | toolUseId=${data.toolUseId ?? 'null'} | depth=${queueDepth}`, {
+          sessionId: sessionDbId
+        });
+      } else {
+        logger.info('QUEUE', `ENQUEUED | sessionDbId=${sessionDbId} | messageId=${messageId} | type=observation | tool=${toolSummary} | depth=${queueDepth}`, {
+          sessionId: sessionDbId
+        });
+      }
    } catch (error) {
      if (error instanceof Error) {
        logger.error('SESSION', 'Failed to persist observation to DB', {
@@ -333,17 +279,10 @@ export class SessionManager {
      session = this.initializeSession(sessionDbId);
    }

-    // Circuit breaker: skip summarize if too many consecutive failures (#1633).
-    // This prevents the infinite loop where each failed summary spawns a new session
-    // with an ever-growing prompt. Counter is in-memory per ActiveSession — it resets
-    // on worker restart, which is acceptable because session state is already ephemeral.
-    if (session.consecutiveSummaryFailures >= MAX_CONSECUTIVE_SUMMARY_FAILURES) {
-      logger.warn('SESSION', `Circuit breaker OPEN: skipping summarize after ${session.consecutiveSummaryFailures} consecutive failures (#1633)`, {
-        sessionId: sessionDbId,
-        contentSessionId: session.contentSessionId
-      });
-      return;
-    }
+    // PATHFINDER plan 03 phase 3: summary-failure circuit breaker deleted.
+    // Each failed parse is independently marked failed via the retry ladder
+    // in PendingMessageStore.markFailed; a storm of bad parses surfaces as
+    // retry exhaustion, not as silent suppression of further requests.

    // CRITICAL: Persist to database FIRST
    const message: PendingMessage = {
@@ -354,9 +293,16 @@ export class SessionManager {
    try {
      const messageId = this.getPendingStore().enqueue(sessionDbId, session.contentSessionId, message);
      const queueDepth = this.getPendingStore().getPendingCount(sessionDbId);
-      logger.info('QUEUE', `ENQUEUED | sessionDbId=${sessionDbId} | messageId=${messageId} | type=summarize | depth=${queueDepth}`, {
-        sessionId: sessionDbId
-      });
+      // See queueObservation note: messageId=0 means UNIQUE-suppressed duplicate.
+      if (messageId === 0) {
+        logger.debug('QUEUE', `DUP_SUPPRESSED | sessionDbId=${sessionDbId} | type=summarize | depth=${queueDepth}`, {
+          sessionId: sessionDbId
+        });
+      } else {
+        logger.info('QUEUE', `ENQUEUED | sessionDbId=${sessionDbId} | messageId=${messageId} | type=summarize | depth=${queueDepth}`, {
+          sessionId: sessionDbId
+        });
+      }
    } catch (error) {
      if (error instanceof Error) {
        logger.error('SESSION', 'Failed to persist summarize to DB', {
@@ -402,19 +348,21 @@ export class SessionManager {
      });
    }

-    // 3. Verify subprocess exit with 5s timeout (Issue #737 fix)
-    const tracked = getProcessBySession(sessionDbId);
+    // 3. Verify subprocess exit with 5s timeout. Process-group teardown is
+    //    used internally so any SDK descendants are killed too (Principle 5).
+    const tracked = getSdkProcessForSession(sessionDbId);
    if (tracked && tracked.process.exitCode === null) {
-      logger.debug('SESSION', `Waiting for subprocess PID ${tracked.pid} to exit`, {
+      logger.debug('SESSION', `Waiting for subprocess PID ${tracked.pid} (pgid ${tracked.pgid}) to exit`, {
        sessionId: sessionDbId,
-        pid: tracked.pid
+        pid: tracked.pid,
+        pgid: tracked.pgid
      });
-      await ensureProcessExit(tracked, 5000);
+      await ensureSdkProcessExit(tracked, 5000);
    }

    // 3b. Reap all supervisor-tracked processes for this session (#1351)
-    // This catches MCP servers and other child processes not tracked by the
-    // in-memory ProcessRegistry (e.g. processes registered only in supervisor.json).
+    // Catches MCP servers and other child processes registered only in
+    // supervisor.json that the in-process tracking would not see.
    try {
      await getSupervisor().getRegistry().reapSession(sessionDbId);
    } catch (error) {
@@ -467,106 +415,6 @@ export class SessionManager {
    }
  }

-  /**
-   * Evict the idlest session to free a pool slot (#1868).
-   * An "idle" session has an active generator but no pending work — it's sitting
-   * in the 3-min idle wait before subprocess cleanup. Evicting it triggers abort
-   * which kills the subprocess and frees the pool slot for a waiting new session.
-   * @returns true if a session was evicted, false if no idle sessions found
-   */
-  evictIdlestSession(): boolean {
-    let idlestSessionId: number | null = null;
-    let oldestActivity = Infinity;
-
-    for (const [sessionDbId, session] of this.sessions) {
-      if (!session.generatorPromise) continue; // No generator = no slot held
-      const pendingCount = this.getPendingStore().getPendingCount(sessionDbId);
-      if (pendingCount > 0) continue; // Has work to do, don't evict
-
-      // Pick the session with the oldest lastGeneratorActivity (idlest)
-      if (session.lastGeneratorActivity < oldestActivity) {
-        oldestActivity = session.lastGeneratorActivity;
-        idlestSessionId = sessionDbId;
-      }
-    }
-
-    if (idlestSessionId === null) return false;
-
-    const session = this.sessions.get(idlestSessionId);
-    if (!session) return false;
-
-    logger.info('SESSION', 'Evicting idle session to free pool slot for new request (#1868)', {
-      sessionDbId: idlestSessionId,
-      idleDurationMs: Date.now() - oldestActivity
-    });
-
-    session.idleTimedOut = true;
-    session.abortController.abort();
-    return true;
-  }
-
-  /**
-   * Reap sessions with no active generator and no pending work that have been idle too long.
-   * Also reaps sessions whose generator has been stuck (no lastGeneratorActivity update) for
-   * longer than MAX_GENERATOR_IDLE_MS — these are zombie subprocesses that will never exit
-   * on their own because the orphan reaper skips sessions in the active sessions map. (Issue #1652)
-   *
-   * This unblocks the orphan reaper which skips processes for "active" sessions. (Issue #1168)
-   */
-  async reapStaleSessions(): Promise<number> {
-    const now = Date.now();
-    const staleSessionIds: number[] = [];
-
-    for (const [sessionDbId, session] of this.sessions) {
-      // Sessions with active generators — check for stuck/zombie generators (Issue #1652)
-      if (session.generatorPromise) {
-        const generatorIdleMs = now - session.lastGeneratorActivity;
-        if (generatorIdleMs > MAX_GENERATOR_IDLE_MS) {
-          logger.warn('SESSION', `Stale generator detected for session ${sessionDbId} (no activity for ${Math.round(generatorIdleMs / 60000)}m) — force-killing subprocess`, {
-            sessionDbId,
-            generatorIdleMs
-          });
-          // Force-kill the subprocess to unblock the stuck for-await in SDKAgent.
-          // Without this the generator is blocked on `for await (const msg of queryResult)`
-          // and will never exit even after abort() is called.
-          const trackedProcess = getProcessBySession(sessionDbId);
-          if (trackedProcess && trackedProcess.process.exitCode === null) {
-            try {
-              trackedProcess.process.kill('SIGKILL');
-            } catch (err) {
-              if (err instanceof Error) {
-                logger.warn('SESSION', 'Failed to SIGKILL subprocess for stale generator', { sessionDbId }, err);
-              } else {
-                logger.warn('SESSION', 'Failed to SIGKILL subprocess for stale generator with non-Error', { sessionDbId }, new Error(String(err)));
-              }
-            }
-          }
-          // Signal the SDK agent loop to exit after the subprocess dies
-          session.abortController.abort();
-          staleSessionIds.push(sessionDbId);
-        }
-        continue;
-      }
-
-      // Skip sessions with pending work
-      const pendingCount = this.getPendingStore().getPendingCount(sessionDbId);
-      if (pendingCount > 0) continue;
-
-      // No generator + no pending work + old enough = stale
-      const sessionAge = now - session.startTime;
-      if (sessionAge > MAX_SESSION_IDLE_MS) {
-        logger.warn('SESSION', `Reaping idle session ${sessionDbId} (no activity for >${Math.round(MAX_SESSION_IDLE_MS / 60000)}m)`, { sessionDbId });
-        staleSessionIds.push(sessionDbId);
-      }
-    }
-
-    for (const sessionDbId of staleSessionIds) {
-      await this.deleteSession(sessionDbId);
-    }
-
-    return staleSessionIds.length;
-  }
-
  /**
   * Shutdown all active sessions
   */
@@ -37,7 +37,9 @@ export class SettingsManager {
      for (const row of rows) {
        const key = row.key as keyof ViewerSettings;
        if (key in settings) {
-          settings[key] = JSON.parse(row.value) as ViewerSettings[typeof key];
+          // Object.assign narrows correctly across the discriminated union
+          // where `settings[key] = value` would collapse to `never`.
+          Object.assign(settings, { [key]: JSON.parse(row.value) });
        }
      }

@@ -12,8 +12,8 @@
 */

 import { logger } from '../../../utils/logger.js';
-import { parseObservations, parseSummary, type ParsedObservation, type ParsedSummary } from '../../../sdk/parser.js';
-import { SUMMARY_MODE_MARKER, MAX_CONSECUTIVE_SUMMARY_FAILURES } from '../../../sdk/prompts.js';
+import { parseAgentXml, type ParsedObservation, type ParsedSummary } from '../../../sdk/parser.js';
+import { ingestSummary } from '../http/shared.js';
 import { updateCursorContextForProject } from '../../integrations/CursorHooksInstaller.js';
 import { notifyTelegram } from '../../integrations/TelegramNotifier.js';
 import { updateFolderClaudeMdFiles } from '../../../utils/claude-md-utils.js';
@@ -67,39 +67,16 @@ export async function processAgentResponse(
    session.conversationHistory.push({ role: 'assistant', content: text });
  }

-  // Parse observations and summary
-  const observations = parseObservations(text, session.contentSessionId);
+  // Single fail-fast parse (PATHFINDER plan 03 phase 1+2). On invalid XML,
+  // mark each in-flight pending message failed and stop. The PendingMessageStore
+  // retry ladder is the legitimate primary-path surface for transient failures;
+  // there is no circuit breaker, no coercion.
+  const parsed = parseAgentXml(text, session.contentSessionId);

-  // Detect whether the most recent prompt was a summary request.
-  // If so, enable observation-to-summary coercion to prevent the infinite
-  // retry loop described in #1633.
-  const lastMessage = session.conversationHistory.at(-1);
-  const lastUserMessage = lastMessage?.role === 'user'
-    ? lastMessage
-    : session.conversationHistory.findLast(m => m.role === 'user') ?? null;
-  const summaryExpected = lastUserMessage?.content?.includes(SUMMARY_MODE_MARKER) ?? false;
-
-  const summary = parseSummary(text, session.sessionDbId, summaryExpected);
-
-  // Detect non-XML responses (auth errors, rate limits, garbled output).
-  // When the response contains no parseable XML and produced no observations,
-  // mark the pending messages as failed instead of confirming them — this prevents
-  // silent data loss when the LLM returns garbage (#1874).
-  const isNonXmlResponse = (
-    text.trim() &&
-    observations.length === 0 &&
-    !summary &&
-    !/<observation>|<summary>|<skip_summary\b/.test(text)
-  );
-
-  if (isNonXmlResponse) {
-    const preview = text.length > 200 ? `${text.slice(0, 200)}...` : text;
-    logger.warn('PARSER', `${agentName} returned non-XML response; marking messages as failed for retry (#1874)`, {
+  if (!parsed.valid) {
+    logger.warn('PARSER', `${agentName} returned unparseable response: ${parsed.reason}`, {
      sessionId: session.sessionDbId,
-      preview
    });
-
-    // Mark messages as failed (retry logic in PendingMessageStore handles retries)
    const pendingStore = sessionManager.getPendingMessageStore();
    for (const messageId of session.processingMessageIds) {
      pendingStore.markFailed(messageId);
@@ -108,6 +85,17 @@ export async function processAgentResponse(
    return;
  }

+  let observations: ParsedObservation[] = [];
+  let summary: ParsedSummary | null = null;
+  if (parsed.kind === 'observation') {
+    observations = parsed.data;
+  } else if (!parsed.data.skipped) {
+    // `<skip_summary/>` is a first-class parser result but carries nothing to
+    // persist; the summary storage path is skipped entirely so storeObservations
+    // does not see an empty record.
+    summary = parsed.data;
+  }
+
  // Convert nullable fields to empty strings for storeSummary (if summary exists)
  const summaryForStore = normalizeSummaryForStorage(summary);

@@ -174,30 +162,23 @@ export async function processAgentResponse(
  // to the Stop hook for silent-summary-loss detection (#1633)
  session.lastSummaryStored = result.summaryId !== null;

-  // Circuit breaker: track consecutive summary failures (#1633).
-  // Only evaluate when a summary was actually expected (summarize message was sent).
-  // Without this guard, the counter would increment on every normal observation
-  // response, tripping the breaker after 3 observations and permanently blocking
-  // summarization — reproducing the data-loss scenario this fix is meant to prevent.
-  if (summaryExpected) {
-    const skippedIntentionally = /<skip_summary\b/.test(text);
-    if (summaryForStore !== null) {
-      // Summary was present in the response — reset the failure counter
-      session.consecutiveSummaryFailures = 0;
-    } else if (skippedIntentionally) {
-      // Explicit <skip_summary/> is a valid protocol response — neither success
-      // nor failure. Leave the counter unchanged so we don't mask a bad run that
-      // happens to end on a skip, but also don't punish intentional skips.
-    } else {
-      // Summary was expected but none was stored — count as failure
-      session.consecutiveSummaryFailures += 1;
-      if (session.consecutiveSummaryFailures >= MAX_CONSECUTIVE_SUMMARY_FAILURES) {
-        logger.error('SESSION', `Circuit breaker: ${session.consecutiveSummaryFailures} consecutive summary failures — further summarize requests will be skipped (#1633)`, {
-          sessionId: session.sessionDbId,
-          contentSessionId: session.contentSessionId
-        });
-      }
-    }
+  // Gate ingestSummary({kind:'parsed'}) on real persistence so the event bus
+  // only fires for summaries that actually landed in the DB. Skipped summaries
+  // (<skip_summary/>) are an explicit bypass and still notify.
+  if (parsed.kind === 'summary' && (parsed.data.skipped || session.lastSummaryStored)) {
+    const messageId = session.processingMessageIds[0] ?? -1;
+    ingestSummary({
+      kind: 'parsed',
+      sessionDbId: session.sessionDbId,
+      messageId,
+      contentSessionId: session.contentSessionId,
+      parsed: parsed.data,
+    });
+  } else if (parsed.kind === 'summary') {
+    logger.warn('DB', 'summary parsed but no row persisted; suppressing summaryStoredEvent', {
+      sessionId: session.sessionDbId,
+      memorySessionId: session.memorySessionId,
+    });
  }

  // CLAIM-CONFIRM: Now that storage succeeded, confirm all processing messages (delete from queue)
@@ -342,7 +323,7 @@ async function syncAndBroadcastObservations(
  // Only runs if CLAUDE_MEM_FOLDER_CLAUDEMD_ENABLED is true (default: false)
  const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
  // Handle both string 'true' and boolean true from JSON settings
-  const settingValue = settings.CLAUDE_MEM_FOLDER_CLAUDEMD_ENABLED;
+  const settingValue: unknown = settings.CLAUDE_MEM_FOLDER_CLAUDEMD_ENABLED;
  const folderClaudeMdEnabled = settingValue === 'true' || settingValue === true;

  if (folderClaudeMdEnabled) {
@@ -47,20 +47,6 @@ export abstract class BaseRouteHandler {
    return value;
  }

-  /**
-   * Validate required body parameters
-   * Returns true if all required params present, sends 400 error otherwise
-   */
-  protected validateRequired(req: Request, res: Response, params: string[]): boolean {
-    for (const param of params) {
-      if (req.body[param] === undefined || req.body[param] === null) {
-        this.badRequest(res, `Missing ${param}`);
-        return false;
-      }
-    }
-    return true;
-  }
-
  /**
   * Send 400 Bad Request response
   */
@@ -42,42 +42,6 @@ export function createMiddleware(
    credentials: false
  }));

-  // Simple in-memory rate limiter (#1935).
-  // Worker binds localhost-only, so in practice this is a global 300 req/min
-  // cap — every caller shares the 127.0.0.1/::1 bucket.
-  const requestCounts = new Map<string, { count: number; resetAt: number }>();
-  const RATE_LIMIT_WINDOW_MS = 60_000;
-  const RATE_LIMIT_MAX_REQUESTS = 300;
-
-  const rateLimiter: RequestHandler = (req, res, next) => {
-    // Normalise IPv4-mapped IPv6 so 127.0.0.1 and ::ffff:127.0.0.1 share a bucket.
-    const clientIp = (req.socket.remoteAddress ?? req.ip ?? 'unknown').replace(/^::ffff:/, '');
-    const now = Date.now();
-    let entry = requestCounts.get(clientIp);
-
-    if (!entry || now >= entry.resetAt) {
-      // Safety valve in case the worker is ever bound non-localhost.
-      if (requestCounts.size > 1000) {
-        for (const [ip, e] of requestCounts) {
-          if (now >= e.resetAt) requestCounts.delete(ip);
-        }
-      }
-      entry = { count: 0, resetAt: now + RATE_LIMIT_WINDOW_MS };
-      requestCounts.set(clientIp, entry);
-    }
-
-    if (entry.count >= RATE_LIMIT_MAX_REQUESTS) {
-      res.set('Retry-After', String(Math.ceil((entry.resetAt - now) / 1000)));
-      res.status(429).json({ error: 'Rate limit exceeded' });
-      return;
-    }
-    entry.count++;
-
-    next();
-  };
-
-  middlewares.push(rateLimiter);
-
  // HTTP request/response logging
  middlewares.push((req: Request, res: Response, next: NextFunction) => {
    // Skip logging for static assets, health checks, and polling endpoints
@@ -0,0 +1,37 @@
+/**
+ * Zod body-validation middleware — PATHFINDER-2026-04-22 Plan 06 Phase 2.
+ *
+ * Canonical signature: given a Zod schema, parse `req.body` with `safeParse`.
+ * On failure, respond 400 with `{ error: 'ValidationError', issues: [...] }`
+ * and stop. On success, replace `req.body` with the parsed (typed) value and
+ * call `next()`.
+ *
+ * Principles:
+ *   - Principle 2 — Fail-fast over grace-degrade. No try/catch swallow,
+ *     no coercion, no "best-effort" defaults.
+ *   - Principle 6 — One helper, N callers. Every validated POST/PUT
+ *     across `src/services/worker/http/routes/` uses this one middleware
+ *     wrapped around a per-route Zod schema declared at the top of its
+ *     owning route file.
+ */
+
+import type { RequestHandler } from 'express';
+import type { ZodTypeAny } from 'zod';
+
+export const validateBody = <S extends ZodTypeAny>(schema: S): RequestHandler =>
+  (req, res, next) => {
+    const result = schema.safeParse(req.body);
+    if (!result.success) {
+      res.status(400).json({
+        error: 'ValidationError',
+        issues: result.error.issues.map(i => ({
+          path: i.path,
+          message: i.message,
+          code: i.code,
+        })),
+      });
+      return;
+    }
+    req.body = result.data;
+    next();
+  };
@@ -0,0 +1,78 @@
+/**
+ * Chroma Routes
+ *
+ * Provides diagnostic endpoints for ChromaDB integration.
+ */
+
+import express, { Request, Response } from 'express';
+import { BaseRouteHandler } from '../BaseRouteHandler.js';
+import { ChromaMcpManager } from '../../../sync/ChromaMcpManager.js';
+import { logger } from '../../../../utils/logger.js';
+import { SettingsDefaultsManager } from '../../../../shared/SettingsDefaultsManager.js';
+import { USER_SETTINGS_PATH } from '../../../../shared/paths.js';
+
+export class ChromaRoutes extends BaseRouteHandler {
+  setupRoutes(app: express.Application): void {
+    app.get('/api/chroma/status', this.handleGetStatus.bind(this));
+  }
+
+  /**
+   * GET /api/chroma/status
+   * Returns current health and connection status of chroma-mcp.
+   *
+   * Pass `?deep=1` (or `?deep=true`) to additionally run a real
+   * semantic-search round-trip via ChromaMcpManager.probeSemanticSearch().
+   * The cheap path (no `deep`) stays cheap — it only calls isHealthy().
+   */
+  private handleGetStatus = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
+    const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
+    const chromaEnabled = settings.CLAUDE_MEM_CHROMA_ENABLED !== 'false';
+
+    // Truthy check: any non-empty, non-"false"/"0" value enables deep probe.
+    // Bare `?deep` (no value) shows up as '' in Express, which we treat as enabled.
+    const deepRaw = req.query.deep;
+    const deepEnabled =
+      deepRaw !== undefined &&
+      deepRaw !== 'false' &&
+      deepRaw !== '0';
+
+    if (!chromaEnabled) {
+      res.json({
+        status: 'disabled',
+        connected: false,
+        timestamp: new Date().toISOString(),
+        details: 'Chroma is disabled via CLAUDE_MEM_CHROMA_ENABLED=false',
+        deep: deepEnabled
+      });
+      return;
+    }
+
+    const chromaMcp = ChromaMcpManager.getInstance();
+    const isHealthy = await chromaMcp.isHealthy();
+
+    if (!deepEnabled) {
+      res.json({
+        status: isHealthy ? 'healthy' : 'unhealthy',
+        connected: isHealthy,
+        timestamp: new Date().toISOString(),
+        details: isHealthy ? 'chroma-mcp is responding to tool calls' : 'chroma-mcp health check failed',
+        deep: false
+      });
+      return;
+    }
+
+    const probe = await chromaMcp.probeSemanticSearch();
+    const status = probe.ok ? 'healthy' : 'unhealthy';
+
+    res.json({
+      status,
+      connected: isHealthy,
+      timestamp: new Date().toISOString(),
+      details: probe.ok
+        ? 'chroma-mcp semantic search round-trip succeeded'
+        : `chroma-mcp deep probe failed at stage '${probe.stage}'`,
+      deep: true,
+      probe
+    });
+  });
+}
@@ -6,14 +6,65 @@
 */

 import express, { Request, Response } from 'express';
+import { z } from 'zod';
 import { BaseRouteHandler } from '../BaseRouteHandler.js';
-import { logger } from '../../../../utils/logger.js';
+import { validateBody } from '../middleware/validateBody.js';
 import { CorpusStore } from '../../knowledge/CorpusStore.js';
 import { CorpusBuilder } from '../../knowledge/CorpusBuilder.js';
 import { KnowledgeAgent } from '../../knowledge/KnowledgeAgent.js';
 import type { CorpusFilter } from '../../knowledge/types.js';

-const ALLOWED_CORPUS_TYPES = new Set(['decision', 'bugfix', 'feature', 'refactor', 'discovery', 'change', 'security_alert', 'security_note']);
+const ALLOWED_CORPUS_TYPES = ['decision', 'bugfix', 'feature', 'refactor', 'discovery', 'change', 'security_alert', 'security_note'] as const;
+const ALLOWED_CORPUS_TYPE_SET = new Set<string>(ALLOWED_CORPUS_TYPES);
+
+// Plan 06 Phase 3 — per-route Zod schemas. Coercions match the legacy
+// `coerceStringArray` / `coercePositiveInteger` semantics: accept JSON
+// strings, comma-separated strings, or native arrays; reject empty fields.
+const stringArrayLike = z.preprocess((value) => {
+  if (value === undefined || value === null || value === '') return undefined;
+  if (Array.isArray(value)) return value;
+  if (typeof value === 'string') {
+    try {
+      const parsed = JSON.parse(value);
+      if (Array.isArray(parsed)) return parsed;
+    } catch {
+      // not JSON, fall through to comma split
+    }
+    return value.split(',').map((part) => part.trim()).filter(Boolean);
+  }
+  return value;
+}, z.array(z.string().min(1)).optional());
+
+const positiveIntegerLike = z.preprocess((value) => {
+  if (value === undefined || value === null || value === '') return undefined;
+  if (typeof value === 'string') {
+    const parsed = Number(value);
+    return Number.isNaN(parsed) ? value : parsed;
+  }
+  return value;
+}, z.number().int().positive().optional());
+
+const buildCorpusSchema = z.object({
+  name: z.string().min(1),
+  description: z.string().optional(),
+  project: z.string().optional(),
+  types: stringArrayLike.refine(
+    (arr) => arr === undefined || arr.every((t) => ALLOWED_CORPUS_TYPE_SET.has(t)),
+    { message: `types must contain only ${ALLOWED_CORPUS_TYPES.join(', ')}` }
+  ),
+  concepts: stringArrayLike,
+  files: stringArrayLike,
+  query: z.string().optional(),
+  date_start: z.string().optional(),
+  date_end: z.string().optional(),
+  limit: positiveIntegerLike,
+}).passthrough();
+
+const queryCorpusSchema = z.object({
+  question: z.string().trim().min(1),
+}).passthrough();
+
+const emptyBodySchema = z.object({}).passthrough();

 export class CorpusRoutes extends BaseRouteHandler {
  constructor(
@@ -25,14 +76,14 @@ export class CorpusRoutes extends BaseRouteHandler {
  }

  setupRoutes(app: express.Application): void {
-    app.post('/api/corpus', this.handleBuildCorpus.bind(this));
+    app.post('/api/corpus', validateBody(buildCorpusSchema), this.handleBuildCorpus.bind(this));
    app.get('/api/corpus', this.handleListCorpora.bind(this));
    app.get('/api/corpus/:name', this.handleGetCorpus.bind(this));
    app.delete('/api/corpus/:name', this.handleDeleteCorpus.bind(this));
-    app.post('/api/corpus/:name/rebuild', this.handleRebuildCorpus.bind(this));
-    app.post('/api/corpus/:name/prime', this.handlePrimeCorpus.bind(this));
-    app.post('/api/corpus/:name/query', this.handleQueryCorpus.bind(this));
-    app.post('/api/corpus/:name/reprime', this.handleReprimeCorpus.bind(this));
+    app.post('/api/corpus/:name/rebuild', validateBody(emptyBodySchema), this.handleRebuildCorpus.bind(this));
+    app.post('/api/corpus/:name/prime', validateBody(emptyBodySchema), this.handlePrimeCorpus.bind(this));
+    app.post('/api/corpus/:name/query', validateBody(queryCorpusSchema), this.handleQueryCorpus.bind(this));
+    app.post('/api/corpus/:name/reprime', validateBody(emptyBodySchema), this.handleReprimeCorpus.bind(this));
  }

  /**
@@ -41,42 +92,18 @@ export class CorpusRoutes extends BaseRouteHandler {
   * Body: { name, description?, project?, types?, concepts?, files?, query?, date_start?, date_end?, limit? }
   */
  private handleBuildCorpus = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
-    if (!req.body.name) {
-      res.status(400).json({
-        error: 'Missing required field: name',
-        fix: 'Add a "name" field to your request body',
-        example: { name: 'my-corpus', query: 'hooks', limit: 100 }
-      });
-      return;
-    }
-
-    const { name, description, project, types, concepts, files, query, date_start, date_end, limit } = req.body;
-
-    const coercedTypes = this.coerceStringArray(types, 'types', res);
-    if (coercedTypes === null) return;
-    if (coercedTypes && !coercedTypes.every(type => ALLOWED_CORPUS_TYPES.has(type))) {
-      this.badRequest(res, 'types must contain valid observation types');
-      return;
-    }
-
-    const coercedConcepts = this.coerceStringArray(concepts, 'concepts', res);
-    if (coercedConcepts === null) return;
-
-    const coercedFiles = this.coerceStringArray(files, 'files', res);
-    if (coercedFiles === null) return;
-
-    const coercedLimit = this.coercePositiveInteger(limit, 'limit', res);
-    if (coercedLimit === null) return;
+    const { name, description, project, types, concepts, files, query, date_start, date_end, limit } =
+      req.body as z.infer<typeof buildCorpusSchema>;

    const filter: CorpusFilter = {};
    if (project) filter.project = project;
-    if (coercedTypes && coercedTypes.length > 0) filter.types = coercedTypes as CorpusFilter['types'];
-    if (coercedConcepts && coercedConcepts.length > 0) filter.concepts = coercedConcepts;
-    if (coercedFiles && coercedFiles.length > 0) filter.files = coercedFiles;
+    if (types && types.length > 0) filter.types = types as CorpusFilter['types'];
+    if (concepts && concepts.length > 0) filter.concepts = concepts;
+    if (files && files.length > 0) filter.files = files;
    if (query) filter.query = query;
    if (date_start) filter.date_start = date_start;
    if (date_end) filter.date_end = date_end;
-    if (coercedLimit !== undefined) filter.limit = coercedLimit;
+    if (limit !== undefined) filter.limit = limit;

    const corpus = await this.corpusBuilder.build(name, description || '', filter);

@@ -85,45 +112,6 @@ export class CorpusRoutes extends BaseRouteHandler {
    res.json(metadata);
  });

-  private coerceStringArray(value: unknown, fieldName: string, res: Response): string[] | null | undefined {
-    if (value === undefined || value === null || value === '') {
-      return undefined;
-    }
-
-    let parsed = value;
-    if (typeof value === 'string') {
-      try {
-        parsed = JSON.parse(value);
-      } catch (parseError: unknown) {
-        if (parseError instanceof Error) {
-          logger.debug('HTTP', `${fieldName} is not valid JSON, treating as comma-separated string`, { value });
-        }
-        parsed = value.split(',').map(part => part.trim()).filter(Boolean);
-      }
-    }
-
-    if (!Array.isArray(parsed) || !parsed.every(item => typeof item === 'string')) {
-      this.badRequest(res, `${fieldName} must be an array of strings`);
-      return null;
-    }
-
-    return parsed.map(item => item.trim()).filter(Boolean);
-  }
-
-  private coercePositiveInteger(value: unknown, fieldName: string, res: Response): number | null | undefined {
-    if (value === undefined || value === null || value === '') {
-      return undefined;
-    }
-
-    const parsed = typeof value === 'string' ? Number(value) : value;
-    if (typeof parsed !== 'number' || !Number.isInteger(parsed) || parsed <= 0) {
-      this.badRequest(res, `${fieldName} must be a positive integer`);
-      return null;
-    }
-
-    return parsed;
-  }
-
  /**
   * List all corpora with stats
   * GET /api/corpus
@@ -234,16 +222,6 @@ export class CorpusRoutes extends BaseRouteHandler {
   */
  private handleQueryCorpus = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
    const { name } = req.params;
-
-    if (!req.body.question || typeof req.body.question !== 'string' || req.body.question.trim().length === 0) {
-      res.status(400).json({
-        error: 'Missing required field: question',
-        fix: 'Add a non-empty "question" string to your request body',
-        example: { question: 'What architectural decisions were made about hooks?' }
-      });
-      return;
-    }
-
    const corpus = this.corpusStore.read(name);

    if (!corpus) {
@@ -6,6 +6,7 @@
 */

 import express, { Request, Response } from 'express';
+import { z } from 'zod';
 import path from 'path';
 import { readFileSync, statSync, existsSync } from 'fs';
 import { logger } from '../../../../utils/logger.js';
@@ -18,9 +19,63 @@ import { SessionManager } from '../../SessionManager.js';
 import { SSEBroadcaster } from '../../SSEBroadcaster.js';
 import type { WorkerService } from '../../../worker-service.js';
 import { BaseRouteHandler } from '../BaseRouteHandler.js';
+import { validateBody } from '../middleware/validateBody.js';
 import { normalizePlatformSource } from '../../../../shared/platform-source.js';
 import { getObservationsByFilePath } from '../../../sqlite/observations/get.js';

+// Plan 06 Phase 3 — per-route Zod schemas. Coercions match the legacy
+// behaviour where MCP clients sometimes send arrays as JSON-encoded strings
+// or comma-separated strings.
+const integerArrayLike = z.preprocess((value) => {
+  if (Array.isArray(value)) return value;
+  if (typeof value === 'string') {
+    try {
+      const parsed = JSON.parse(value);
+      if (Array.isArray(parsed)) return parsed;
+    } catch {
+      // not JSON, fall through to comma split
+    }
+    // Keep NaN values so the inner z.number().int() schema rejects them
+    // — coercion does not silently drop garbage input.
+    return value.split(',').map((part) => Number(part.trim()));
+  }
+  return value;
+}, z.array(z.number().int()));
+
+const stringArrayLike = z.preprocess((value) => {
+  if (Array.isArray(value)) return value;
+  if (typeof value === 'string') {
+    try {
+      const parsed = JSON.parse(value);
+      if (Array.isArray(parsed)) return parsed;
+    } catch {
+      // not JSON, fall through to comma split
+    }
+    return value.split(',').map((part) => part.trim()).filter(Boolean);
+  }
+  return value;
+}, z.array(z.string()));
+
+const observationsBatchSchema = z.object({
+  ids: integerArrayLike,
+  orderBy: z.enum(['date_desc', 'date_asc']).optional(),
+  limit: z.number().int().positive().optional(),
+  project: z.string().optional(),
+}).passthrough();
+
+const sdkSessionsBatchSchema = z.object({
+  memorySessionIds: stringArrayLike,
+}).passthrough();
+
+const setProcessingSchema = z.object({}).passthrough();
+
+const importSchema = z.object({
+  sessions: z.array(z.unknown()).optional(),
+  summaries: z.array(z.unknown()).optional(),
+  observations: z.array(z.unknown()).optional(),
+  prompts: z.array(z.unknown()).optional(),
+}).passthrough();
+
 export class DataRoutes extends BaseRouteHandler {
  constructor(
    private paginationHelper: PaginationHelper,
@@ -42,9 +97,9 @@ export class DataRoutes extends BaseRouteHandler {
    // Fetch by ID endpoints
    app.get('/api/observation/:id', this.handleGetObservationById.bind(this));
    app.get('/api/observations/by-file', this.handleGetObservationsByFile.bind(this));
-    app.post('/api/observations/batch', this.handleGetObservationsByIds.bind(this));
+    app.post('/api/observations/batch', validateBody(observationsBatchSchema), this.handleGetObservationsByIds.bind(this));
    app.get('/api/session/:id', this.handleGetSessionById.bind(this));
-    app.post('/api/sdk-sessions/batch', this.handleGetSdkSessionsByIds.bind(this));
+    app.post('/api/sdk-sessions/batch', validateBody(sdkSessionsBatchSchema), this.handleGetSdkSessionsByIds.bind(this));
    app.get('/api/prompt/:id', this.handleGetPromptById.bind(this));

    // Metadata endpoints
@@ -53,16 +108,10 @@ export class DataRoutes extends BaseRouteHandler {

    // Processing status endpoints
    app.get('/api/processing-status', this.handleGetProcessingStatus.bind(this));
-    app.post('/api/processing', this.handleSetProcessing.bind(this));
-
-    // Pending queue management endpoints
-    app.get('/api/pending-queue', this.handleGetPendingQueue.bind(this));
-    app.post('/api/pending-queue/process', this.handleProcessPendingQueue.bind(this));
-    app.delete('/api/pending-queue/failed', this.handleClearFailedQueue.bind(this));
-    app.delete('/api/pending-queue/all', this.handleClearAllQueue.bind(this));
+    app.post('/api/processing', validateBody(setProcessingSchema), this.handleSetProcessing.bind(this));

    // Import endpoint
-    app.post('/api/import', this.handleImport.bind(this));
+    app.post('/api/import', validateBody(importSchema), this.handleImport.bind(this));
  }

  /**
@@ -139,29 +188,13 @@ export class DataRoutes extends BaseRouteHandler {
   * Body: { ids: number[], orderBy?: 'date_desc' | 'date_asc', limit?: number, project?: string }
   */
  private handleGetObservationsByIds = this.wrapHandler((req: Request, res: Response): void => {
-    let { ids, orderBy, limit, project } = req.body;
-
-    // Coerce string-encoded arrays from MCP clients (e.g. "[1,2,3]" or "1,2,3")
-    if (typeof ids === 'string') {
-      try { ids = JSON.parse(ids); } catch { ids = ids.split(',').map(Number); }
-    }
-
-    if (!ids || !Array.isArray(ids)) {
-      this.badRequest(res, 'ids must be an array of numbers');
-      return;
-    }
+    const { ids, orderBy, limit, project } = req.body as z.infer<typeof observationsBatchSchema>;

    if (ids.length === 0) {
      res.json([]);
      return;
    }

-    // Validate all IDs are numbers
-    if (!ids.every(id => typeof id === 'number' && Number.isInteger(id))) {
-      this.badRequest(res, 'All ids must be integers');
-      return;
-    }
-
    const store = this.dbManager.getSessionStore();
    const observations = store.getObservationsByIds(ids, { orderBy, limit, project });

@@ -193,17 +226,7 @@ export class DataRoutes extends BaseRouteHandler {
   * Body: { memorySessionIds: string[] }
   */
  private handleGetSdkSessionsByIds = this.wrapHandler((req: Request, res: Response): void => {
-    let { memorySessionIds } = req.body;
-
-    // Coerce string-encoded arrays from MCP clients (e.g. '["a","b"]' or "a,b")
-    if (typeof memorySessionIds === 'string') {
-      try { memorySessionIds = JSON.parse(memorySessionIds); } catch { memorySessionIds = memorySessionIds.split(',').map((s: string) => s.trim()); }
-    }
-
-    if (!Array.isArray(memorySessionIds)) {
-      this.badRequest(res, 'memorySessionIds must be an array');
-      return;
-    }
+    const { memorySessionIds } = req.body as z.infer<typeof sdkSessionsBatchSchema>;

    const store = this.dbManager.getSessionStore();
    const sessions = store.getSdkSessionsBySessionIds(memorySessionIds);
@@ -467,96 +490,4 @@ export class DataRoutes extends BaseRouteHandler {
    });
  });

-  /**
-   * Get pending queue contents
-   * GET /api/pending-queue
-   * Returns all pending, processing, and failed messages with optional recently processed
-   */
-  private handleGetPendingQueue = this.wrapHandler((req: Request, res: Response): void => {
-    const { PendingMessageStore } = require('../../../sqlite/PendingMessageStore.js');
-    const pendingStore = new PendingMessageStore(this.dbManager.getSessionStore().db, 3);
-
-    // Get queue contents (pending, processing, failed)
-    const queueMessages = pendingStore.getQueueMessages();
-
-    // Get recently processed (last 30 min, up to 20)
-    const recentlyProcessed = pendingStore.getRecentlyProcessed(20, 30);
-
-    // Get stuck message count (processing > 5 min)
-    const stuckCount = pendingStore.getStuckCount(5 * 60 * 1000);
-
-    // Get sessions with pending work
-    const sessionsWithPending = pendingStore.getSessionsWithPendingMessages();
-
-    res.json({
-      queue: {
-        messages: queueMessages,
-        totalPending: queueMessages.filter((m: { status: string }) => m.status === 'pending').length,
-        totalProcessing: queueMessages.filter((m: { status: string }) => m.status === 'processing').length,
-        totalFailed: queueMessages.filter((m: { status: string }) => m.status === 'failed').length,
-        stuckCount
-      },
-      recentlyProcessed,
-      sessionsWithPendingWork: sessionsWithPending
-    });
-  });
-
-  /**
-   * Process pending queue
-   * POST /api/pending-queue/process
-   * Body: { sessionLimit?: number } - defaults to 10
-   * Starts SDK agents for sessions with pending messages
-   */
-  private handleProcessPendingQueue = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
-    const sessionLimit = Math.min(
-      Math.max(parseInt(req.body.sessionLimit, 10) || 10, 1),
-      100 // Max 100 sessions at once
-    );
-
-    const result = await this.workerService.processPendingQueues(sessionLimit);
-
-    res.json({
-      success: true,
-      ...result
-    });
-  });
-
-  /**
-   * Clear all failed messages from the queue
-   * DELETE /api/pending-queue/failed
-   * Returns the number of messages cleared
-   */
-  private handleClearFailedQueue = this.wrapHandler((req: Request, res: Response): void => {
-    const { PendingMessageStore } = require('../../../sqlite/PendingMessageStore.js');
-    const pendingStore = new PendingMessageStore(this.dbManager.getSessionStore().db, 3);
-
-    const clearedCount = pendingStore.clearFailed();
-
-    logger.info('QUEUE', 'Cleared failed queue messages', { clearedCount });
-
-    res.json({
-      success: true,
-      clearedCount
-    });
-  });
-
-  /**
-   * Clear all messages from the queue (pending, processing, and failed)
-   * DELETE /api/pending-queue/all
-   * Returns the number of messages cleared
-   */
-  private handleClearAllQueue = this.wrapHandler((req: Request, res: Response): void => {
-    const { PendingMessageStore } = require('../../../sqlite/PendingMessageStore.js');
-    const pendingStore = new PendingMessageStore(this.dbManager.getSessionStore().db, 3);
-
-    const clearedCount = pendingStore.clearAll();
-
-    logger.warn('QUEUE', 'Cleared ALL queue messages (pending, processing, failed)', { clearedCount });
-
-    res.json({
-      success: true,
-      clearedCount
-    });
-  });
-
 }
@@ -5,11 +5,16 @@
 */

 import express, { Request, Response } from 'express';
+import { z } from 'zod';
 import { openSync, fstatSync, readSync, closeSync, existsSync, writeFileSync } from 'fs';
 import { join } from 'path';
 import { logger } from '../../../../utils/logger.js';
 import { SettingsDefaultsManager } from '../../../../shared/SettingsDefaultsManager.js';
 import { BaseRouteHandler } from '../BaseRouteHandler.js';
+import { validateBody } from '../middleware/validateBody.js';
+
+// Plan 06 Phase 3 — per-route Zod schema. The clear-logs endpoint takes no body.
+const clearLogsSchema = z.object({}).passthrough();

 /**
 * Read the last N lines from a file without loading the entire file into memory.
@@ -99,7 +104,7 @@ export class LogsRoutes extends BaseRouteHandler {

  setupRoutes(app: express.Application): void {
    app.get('/api/logs', this.handleGetLogs.bind(this));
-    app.post('/api/logs/clear', this.handleClearLogs.bind(this));
+    app.post('/api/logs/clear', validateBody(clearLogsSchema), this.handleClearLogs.bind(this));
  }

  /**
@@ -6,10 +6,19 @@
 */

 import express, { Request, Response } from 'express';
+import { z } from 'zod';
 import { BaseRouteHandler } from '../BaseRouteHandler.js';
+import { validateBody } from '../middleware/validateBody.js';
 import { logger } from '../../../../utils/logger.js';
 import type { DatabaseManager } from '../../DatabaseManager.js';

+// Plan 06 Phase 3 — per-route Zod schema.
+const saveMemorySchema = z.object({
+  text: z.string().trim().min(1),
+  title: z.string().optional(),
+  project: z.string().optional(),
+}).passthrough();
+
 export class MemoryRoutes extends BaseRouteHandler {
  constructor(
    private dbManager: DatabaseManager,
@@ -19,7 +28,7 @@ export class MemoryRoutes extends BaseRouteHandler {
  }

  setupRoutes(app: express.Application): void {
-    app.post('/api/memory/save', this.handleSaveMemory.bind(this));
+    app.post('/api/memory/save', validateBody(saveMemorySchema), this.handleSaveMemory.bind(this));
  }

  /**
@@ -27,14 +36,9 @@ export class MemoryRoutes extends BaseRouteHandler {
   * Body: { text: string, title?: string, project?: string }
   */
  private handleSaveMemory = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
-    const { text, title, project } = req.body;
+    const { text, title, project } = req.body as z.infer<typeof saveMemorySchema>;
    const targetProject = project || this.defaultProject;

-    if (!text || typeof text !== 'string' || text.trim().length === 0) {
-      this.badRequest(res, 'text is required and must be non-empty');
-      return;
-    }
-
    const sessionStore = this.dbManager.getSessionStore();
    const chromaSync = this.dbManager.getChromaSync();

@@ -69,6 +73,17 @@ export class MemoryRoutes extends BaseRouteHandler {
    });

    // 4. Sync to ChromaDB (async, fire-and-forget)
+    if (!chromaSync) {
+      logger.debug('CHROMA', 'ChromaDB sync skipped (chromaSync not available)', { id: result.id });
+      res.json({
+        success: true,
+        id: result.id,
+        title: observation.title,
+        project: targetProject,
+        message: `Memory saved as observation #${result.id}`
+      });
+      return;
+    }
    chromaSync.syncObservation(
      result.id,
      memorySessionId,
@@ -6,9 +6,21 @@
 */

 import express, { Request, Response } from 'express';
+import { z } from 'zod';
 import { SearchManager } from '../../SearchManager.js';
 import { BaseRouteHandler } from '../BaseRouteHandler.js';
+import { validateBody } from '../middleware/validateBody.js';
 import { logger } from '../../../../utils/logger.js';
+import { groupByDate } from '../../../../shared/timeline-formatting.js';
+import type { ObservationSearchResult, SessionSummarySearchResult } from '../../../sqlite/types.js';
+
+// Plan 06 Phase 3 — per-route Zod schema. The semantic-context endpoint
+// also accepts query-string fallbacks, so the body itself is fully optional.
+const semanticContextSchema = z.object({
+  q: z.string().optional(),
+  project: z.string().optional(),
+  limit: z.union([z.string(), z.number()]).optional(),
+}).passthrough();

 export class SearchRoutes extends BaseRouteHandler {
  constructor(
@@ -38,7 +50,7 @@ export class SearchRoutes extends BaseRouteHandler {
    app.get('/api/context/timeline', this.handleGetContextTimeline.bind(this));
    app.get('/api/context/preview', this.handleContextPreview.bind(this));
    app.get('/api/context/inject', this.handleContextInject.bind(this));
-    app.post('/api/context/semantic', this.handleSemanticContext.bind(this));
+    app.post('/api/context/semantic', validateBody(semanticContextSchema), this.handleSemanticContext.bind(this));

    // Timeline and help endpoints
    app.get('/api/timeline/by-query', this.handleGetTimelineByQuery.bind(this));
@@ -120,28 +132,156 @@ export class SearchRoutes extends BaseRouteHandler {
  /**
   * Search observations by concept
   * GET /api/search/by-concept?concept=discovery&limit=5
+   *
+   * Chroma errors surface as 503 via ChromaUnavailableError (thrown by orchestrator).
   */
  private handleSearchByConcept = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
-    const result = await this.searchManager.findByConcept(req.query);
-    res.json(result);
+    const orchestrator = this.searchManager.getOrchestrator();
+    const formatter = this.searchManager.getFormatter();
+    const query = req.query as Record<string, any>;
+    const rawConcept = query.concepts ?? query.concept;
+    const concept = Array.isArray(rawConcept) ? rawConcept[0] : rawConcept;
+    const strategyResult = await orchestrator.findByConcept(concept, query);
+    const observations = strategyResult.results.observations;
+
+    if (observations.length === 0) {
+      res.json({
+        content: [{
+          type: 'text' as const,
+          text: `No observations found with concept "${concept}"`
+        }]
+      });
+      return;
+    }
+
+    const header = `Found ${observations.length} observation(s) with concept "${concept}"\n\n${formatter.formatTableHeader()}`;
+    const rows = observations.map((obs: ObservationSearchResult, i: number) => formatter.formatObservationIndex(obs, i));
+    res.json({
+      content: [{
+        type: 'text' as const,
+        text: header + '\n' + rows.join('\n')
+      }]
+    });
  });

  /**
   * Search by file path
   * GET /api/search/by-file?filePath=...&limit=10
+   *
+   * Chroma errors surface as 503 via ChromaUnavailableError (thrown by orchestrator).
   */
  private handleSearchByFile = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
-    const result = await this.searchManager.findByFile(req.query);
-    res.json(result);
+    const orchestrator = this.searchManager.getOrchestrator();
+    const formatter = this.searchManager.getFormatter();
+    const query = req.query as Record<string, any>;
+    // Accept both filePath and files for API compatibility
+    const rawFilePath = query.filePath ?? query.files;
+    const filePath = Array.isArray(rawFilePath)
+      ? rawFilePath[0]
+      : (typeof rawFilePath === 'string' && rawFilePath.includes(','))
+        ? rawFilePath.split(',')[0].trim()
+        : rawFilePath;
+
+    const { observations, sessions } = await orchestrator.findByFile(filePath, query);
+    const totalResults = observations.length + sessions.length;
+
+    if (totalResults === 0) {
+      res.json({
+        content: [{
+          type: 'text' as const,
+          text: `No results found for file "${filePath}"`
+        }]
+      });
+      return;
+    }
+
+    // Combine observations and sessions with timestamps for date grouping
+    const combined: Array<{
+      type: 'observation' | 'session';
+      data: ObservationSearchResult | SessionSummarySearchResult;
+      epoch: number;
+      created_at: string;
+    }> = [
+      ...observations.map((obs: ObservationSearchResult) => ({
+        type: 'observation' as const,
+        data: obs,
+        epoch: obs.created_at_epoch,
+        created_at: obs.created_at
+      })),
+      ...sessions.map((sess: SessionSummarySearchResult) => ({
+        type: 'session' as const,
+        data: sess,
+        epoch: sess.created_at_epoch,
+        created_at: sess.created_at
+      }))
+    ];
+
+    combined.sort((a, b) => b.epoch - a.epoch);
+    const resultsByDate = groupByDate(combined, item => item.created_at);
+
+    const lines: string[] = [];
+    lines.push(`Found ${totalResults} result(s) for file "${filePath}"`);
+    lines.push('');
+
+    for (const [day, dayResults] of resultsByDate) {
+      lines.push(`### ${day}`);
+      lines.push('');
+      lines.push(formatter.formatTableHeader());
+      for (const result of dayResults) {
+        if (result.type === 'observation') {
+          lines.push(formatter.formatObservationIndex(result.data as ObservationSearchResult, 0));
+        } else {
+          lines.push(formatter.formatSessionIndex(result.data as SessionSummarySearchResult, 0));
+        }
+      }
+      lines.push('');
+    }
+
+    res.json({
+      content: [{
+        type: 'text' as const,
+        text: lines.join('\n')
+      }]
+    });
  });

  /**
   * Search observations by type
   * GET /api/search/by-type?type=bugfix&limit=10
+   *
+   * Chroma errors surface as 503 via ChromaUnavailableError (thrown by orchestrator).
   */
  private handleSearchByType = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
-    const result = await this.searchManager.findByType(req.query);
-    res.json(result);
+    const orchestrator = this.searchManager.getOrchestrator();
+    const formatter = this.searchManager.getFormatter();
+    const query = req.query as Record<string, any>;
+    const rawType = query.type;
+    const type = (typeof rawType === 'string' && rawType.includes(','))
+      ? rawType.split(',').map((s: string) => s.trim()).filter(Boolean)
+      : rawType;
+    const typeStr = Array.isArray(type) ? type.join(', ') : type;
+
+    const strategyResult = await orchestrator.findByType(type, query);
+    const observations = strategyResult.results.observations;
+
+    if (observations.length === 0) {
+      res.json({
+        content: [{
+          type: 'text' as const,
+          text: `No observations found with type "${typeStr}"`
+        }]
+      });
+      return;
+    }
+
+    const header = `Found ${observations.length} observation(s) with type "${typeStr}"\n\n${formatter.formatTableHeader()}`;
+    const rows = observations.map((obs: ObservationSearchResult, i: number) => formatter.formatObservationIndex(obs, i));
+    res.json({
+      content: [{
+        type: 'text' as const,
+        text: header + '\n' + rows.join('\n')
+      }]
+    });
  });

  /**
@@ -6,6 +6,9 @@
 */

 import express, { Request, Response } from 'express';
+import { z } from 'zod';
+import { ingestObservation } from '../shared.js';
+import { validateBody } from '../middleware/validateBody.js';
 import { getWorkerPort } from '../../../../shared/worker-utils.js';
 import { logger } from '../../../../utils/logger.js';
 import { stripMemoryTagsFromJson, stripMemoryTagsFromPrompt } from '../../../../utils/tag-stripping.js';
@@ -21,13 +24,14 @@ import { SessionCompletionHandler } from '../../session/SessionCompletionHandler
 import { PrivacyCheckValidator } from '../../validation/PrivacyCheckValidator.js';
 import { SettingsDefaultsManager } from '../../../../shared/SettingsDefaultsManager.js';
 import { USER_SETTINGS_PATH } from '../../../../shared/paths.js';
-import { getProcessBySession, ensureProcessExit } from '../../ProcessRegistry.js';
+import { getSdkProcessForSession, ensureSdkProcessExit } from '../../../../supervisor/process-registry.js';
 import { getProjectContext } from '../../../../utils/project-name.js';
 import { normalizePlatformSource } from '../../../../shared/platform-source.js';
 import { RestartGuard } from '../../RestartGuard.js';

+const MAX_USER_PROMPT_BYTES = 256 * 1024;
+
 export class SessionRoutes extends BaseRouteHandler {
-  private completionHandler: SessionCompletionHandler;
  private spawnInProgress = new Map<number, boolean>();
  private crashRecoveryScheduled = new Set<number>();

@@ -39,13 +43,9 @@ export class SessionRoutes extends BaseRouteHandler {
    private openRouterAgent: OpenRouterAgent,
    private eventBroadcaster: SessionEventBroadcaster,
    private workerService: WorkerService,
-    completionHandler: SessionCompletionHandler
+    private completionHandler: SessionCompletionHandler,
  ) {
    super();
-    // Use the shared completion handler from WorkerService so the SDK-agent
-    // completion path and the HTTP fallback route operate on the same instance
-    // (avoids duplicate construction; keeps finalize semantics consistent).
-    this.completionHandler = completionHandler;
  }

  /**
@@ -97,7 +97,7 @@ export class SessionRoutes extends BaseRouteHandler {
  private static readonly STALE_GENERATOR_THRESHOLD_MS = 30_000; // 30 seconds (#1099)
  private static readonly MAX_SESSION_WALL_CLOCK_MS = 4 * 60 * 60 * 1000; // 4 hours (#1590)

-  private ensureGeneratorRunning(sessionDbId: number, source: string): void {
+  public ensureGeneratorRunning(sessionDbId: number, source: string): void {
    const session = this.sessionManager.getSession(sessionDbId);
    if (!session) return;

@@ -121,7 +121,7 @@ export class SessionRoutes extends BaseRouteHandler {
        session.abortController.abort();
      }
      const pendingStore = this.sessionManager.getPendingMessageStore();
-      pendingStore.markAllSessionMessagesAbandoned(sessionDbId);
+      pendingStore.transitionMessagesTo('abandoned', { sessionDbId });
      this.sessionManager.removeSessionImmediate(sessionDbId);
      return;
    }
@@ -253,7 +253,7 @@ export class SessionRoutes extends BaseRouteHandler {
        // Mark all processing messages as failed so they can be retried or abandoned
        const pendingStore = this.sessionManager.getPendingMessageStore();
        try {
-          const failedCount = pendingStore.markSessionMessagesFailed(session.sessionDbId);
+          const failedCount = pendingStore.transitionMessagesTo('failed', { sessionDbId: session.sessionDbId });
          if (failedCount > 0) {
            logger.error('SESSION', `Marked messages as failed after generator error`, {
              sessionId: session.sessionDbId,
@@ -268,10 +268,11 @@ export class SessionRoutes extends BaseRouteHandler {
        }
      })
      .finally(async () => {
-        // CRITICAL: Verify subprocess exit to prevent zombie accumulation (Issue #1168)
-        const tracked = getProcessBySession(session.sessionDbId);
+        // Primary-path subprocess teardown — process-group kill ensures any
+        // SDK descendants are reaped too (Principle 5).
+        const tracked = getSdkProcessForSession(session.sessionDbId);
        if (tracked && !tracked.process.killed && tracked.process.exitCode === null) {
-          await ensureProcessExit(tracked, 5000);
+          await ensureSdkProcessExit(tracked, 5000);
        }

        const sessionDbId = session.sessionDbId;
@@ -289,43 +290,6 @@ export class SessionRoutes extends BaseRouteHandler {
        session.currentProvider = null;
        this.workerService.broadcastProcessingStatus();

-        // Stop-hook fire-and-forget (Phase 2): if the generator just processed
-        // a summary and no work remains, the Stop hook is done and we should
-        // self-clean the session. The summary write is already committed to
-        // SQLite synchronously inside processAgentResponse() BEFORE startSession()
-        // returns (see ResponseProcessor.ts: storeObservations() is sync, and
-        // confirmProcessed() runs right after), so by the time this .finally()
-        // runs the summary is durably persisted.
-        //
-        // We gate on lastSummaryStored so we don't finalize after every idle
-        // timeout between tool calls — only when a real Stop event produced
-        // a summary record.
-        try {
-          const pendingStore = this.sessionManager.getPendingMessageStore();
-          const pendingNow = pendingStore.getPendingCount(sessionDbId);
-          if (session.lastSummaryStored === true && pendingNow === 0) {
-            logger.info('SESSION', 'Stop-hook self-clean: summary persisted + queue drained → finalizing', {
-              sessionId: sessionDbId
-            });
-            // finalizeSession is idempotent and does NOT touch the in-memory map —
-            // it only marks DB completed, drains any orphaned pending messages,
-            // and broadcasts the completion event. sessionManager cleanup is
-            // handled below by the existing abort/removeSessionImmediate flow.
-            this.completionHandler.finalizeSession(sessionDbId);
-            // Clear the flag so a subsequent re-activation of the same session
-            // does not fire finalize again without a fresh summary.
-            session.lastSummaryStored = false;
-            // Ensure the session is removed from the active-sessions map so the
-            // Stop-hook path doesn't depend on a later idle-timeout tick.
-            this.sessionManager.removeSessionImmediate(sessionDbId);
-            return;
-          }
-        } catch (err) {
-          logger.warn('SESSION', 'finalizeSession failed in SessionRoutes generator .finally()', {
-            sessionId: sessionDbId
-          }, err as Error);
-        }
-
        // Crash recovery: If not aborted and still has work, restart (with limit)
        if (!wasAborted) {
          const pendingStore = this.sessionManager.getPendingMessageStore();
@@ -353,16 +317,34 @@ export class SessionRoutes extends BaseRouteHandler {
            session.consecutiveRestarts = (session.consecutiveRestarts || 0) + 1; // Keep for logging

            if (!restartAllowed) {
-              logger.error('SESSION', `CRITICAL: Restart guard tripped — too many restarts in window, stopping to prevent runaway costs`, {
+              logger.error('SESSION', `CRITICAL: Restart guard tripped — session is dead, draining pending messages and terminating`, {
                sessionId: sessionDbId,
                pendingCount,
                restartsInWindow: session.restartGuard.restartsInWindow,
                windowMs: session.restartGuard.windowMs,
                maxRestarts: session.restartGuard.maxRestarts,
-                action: 'Generator will NOT restart. Check logs for root cause. Messages remain in pending state.'
+                consecutiveFailures: session.restartGuard.consecutiveFailuresSinceSuccess,
+                maxConsecutiveFailures: session.restartGuard.maxConsecutiveFailures,
+                action: 'Generator will NOT restart. Pending messages drained to abandoned. Check logs for root cause.'
              });
-              // Don't restart - abort to prevent further API calls
+              // Don't restart - abort to prevent further API calls AND drain pending
+              // messages so the session doesn't reappear in getSessionsWithPendingMessages
+              // and trigger another auto-start cycle.
              session.abortController.abort();
+              try {
+                const drained = pendingStore.transitionMessagesTo('abandoned', { sessionDbId });
+                if (drained > 0) {
+                  logger.error('SESSION', 'Drained pending messages to abandoned after restart guard trip', {
+                    sessionId: sessionDbId,
+                    drained,
+                  });
+                }
+              } catch (drainErr) {
+                const normalized = drainErr instanceof Error ? drainErr : new Error(String(drainErr));
+                logger.error('SESSION', 'Failed to drain pending messages after restart guard trip', {
+                  sessionId: sessionDbId,
+                }, normalized);
+              }
              return;
            }

@@ -371,7 +353,9 @@ export class SessionRoutes extends BaseRouteHandler {
              pendingCount,
              consecutiveRestarts: session.consecutiveRestarts,
              restartsInWindow: session.restartGuard!.restartsInWindow,
-              maxRestarts: session.restartGuard!.maxRestarts
+              maxRestarts: session.restartGuard!.maxRestarts,
+              consecutiveFailures: session.restartGuard!.consecutiveFailuresSinceSuccess,
+              maxConsecutiveFailures: session.restartGuard!.maxConsecutiveFailures
            });

            // Abort OLD controller before replacing to prevent child process leaks
@@ -411,21 +395,106 @@ export class SessionRoutes extends BaseRouteHandler {

  setupRoutes(app: express.Application): void {
    // Legacy session endpoints (use sessionDbId)
-    app.post('/sessions/:sessionDbId/init', this.handleSessionInit.bind(this));
-    app.post('/sessions/:sessionDbId/observations', this.handleObservations.bind(this));
-    app.post('/sessions/:sessionDbId/summarize', this.handleSummarize.bind(this));
+    app.post(
+      '/sessions/:sessionDbId/init',
+      validateBody(SessionRoutes.legacySessionInitSchema),
+      this.handleSessionInit.bind(this)
+    );
+    app.post(
+      '/sessions/:sessionDbId/observations',
+      validateBody(SessionRoutes.legacyObservationsSchema),
+      this.handleObservations.bind(this)
+    );
+    app.post(
+      '/sessions/:sessionDbId/summarize',
+      validateBody(SessionRoutes.legacySummarizeSchema),
+      this.handleSummarize.bind(this)
+    );
    app.get('/sessions/:sessionDbId/status', this.handleSessionStatus.bind(this));
    app.delete('/sessions/:sessionDbId', this.handleSessionDelete.bind(this));
    app.post('/sessions/:sessionDbId/complete', this.handleSessionComplete.bind(this));

    // New session endpoints (use contentSessionId)
-    app.post('/api/sessions/init', this.handleSessionInitByClaudeId.bind(this));
-    app.post('/api/sessions/observations', this.handleObservationsByClaudeId.bind(this));
-    app.post('/api/sessions/summarize', this.handleSummarizeByClaudeId.bind(this));
-    app.post('/api/sessions/complete', this.handleCompleteByClaudeId.bind(this));
+    app.post(
+      '/api/sessions/init',
+      validateBody(SessionRoutes.sessionInitByClaudeIdSchema),
+      this.handleSessionInitByClaudeId.bind(this)
+    );
+    app.post(
+      '/api/sessions/observations',
+      validateBody(SessionRoutes.observationsByClaudeIdSchema),
+      this.handleObservationsByClaudeId.bind(this)
+    );
+    app.post(
+      '/api/sessions/summarize',
+      validateBody(SessionRoutes.summarizeByClaudeIdSchema),
+      this.handleSummarizeByClaudeId.bind(this)
+    );
+    app.post(
+      '/api/sessions/complete',
+      validateBody(SessionRoutes.completeByClaudeIdSchema),
+      this.handleCompleteByClaudeId.bind(this)
+    );
    app.get('/api/sessions/status', this.handleStatusByClaudeId.bind(this));
  }

+  // Plan 06 Phase 3 — per-route Zod schemas. Schemas live at the top of the
+  // owning route file and gate body validation via `validateBody`.
+  // `passthrough()` preserves optional/forwarded fields the handlers
+  // already accept (e.g. cwd, agentId, agentType, platformSource).
+  private static readonly legacySessionInitSchema = z.object({
+    userPrompt: z.string().optional(),
+    promptNumber: z.number().int().optional(),
+  }).passthrough();
+
+  private static readonly legacyObservationsSchema = z.object({
+    tool_name: z.string().min(1),
+    tool_input: z.unknown().optional(),
+    tool_response: z.unknown().optional(),
+    prompt_number: z.number().int().optional(),
+    cwd: z.string().optional(),
+  }).passthrough();
+
+  private static readonly legacySummarizeSchema = z.object({
+    last_assistant_message: z.string().optional(),
+  }).passthrough();
+
+  private static readonly sessionInitByClaudeIdSchema = z.object({
+    contentSessionId: z.string().min(1),
+    project: z.string().optional(),
+    prompt: z.string().optional(),
+    platformSource: z.string().optional(),
+    customTitle: z.string().optional(),
+  }).passthrough();
+
+  private static readonly observationsByClaudeIdSchema = z.object({
+    contentSessionId: z.string().min(1),
+    tool_name: z.string().min(1),
+    tool_input: z.unknown().optional(),
+    tool_response: z.unknown().optional(),
+    cwd: z.string().optional(),
+    agentId: z.string().optional(),
+    agentType: z.string().optional(),
+    platformSource: z.string().optional(),
+    // Idempotency key for the UNIQUE(content_session_id, tool_use_id) index
+    // added in Plan 01 Phase 1. Accept both snake and camel shapes so
+    // cross-process callers using either convention still deduplicate.
+    tool_use_id: z.string().optional(),
+    toolUseId: z.string().optional(),
+  }).passthrough();
+
+  private static readonly summarizeByClaudeIdSchema = z.object({
+    contentSessionId: z.string().min(1),
+    last_assistant_message: z.string().optional(),
+    agentId: z.string().optional(),
+    platformSource: z.string().optional(),
+  }).passthrough();
+
+  private static readonly completeByClaudeIdSchema = z.object({
+    contentSessionId: z.string().min(1),
+    platformSource: z.string().optional(),
+  }).passthrough();
+
  /**
   * Initialize a new session
   */
@@ -600,98 +669,40 @@ export class SessionRoutes extends BaseRouteHandler {
   * Body: { contentSessionId, tool_name, tool_input, tool_response, cwd }
   */
  private handleObservationsByClaudeId = this.wrapHandler((req: Request, res: Response): void => {
-    const { contentSessionId, tool_name, tool_input, tool_response, cwd, agentId, agentType } = req.body;
-    const platformSource = normalizePlatformSource(req.body.platformSource);
-    const project = typeof cwd === 'string' && cwd.trim() ? getProjectContext(cwd).primary : '';
-
-    if (!contentSessionId) {
-      return this.badRequest(res, 'Missing contentSessionId');
-    }
-
-    // Load skip tools from settings
-    const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
-    const skipTools = new Set(settings.CLAUDE_MEM_SKIP_TOOLS.split(',').map(t => t.trim()).filter(Boolean));
-
-    // Skip low-value or meta tools
-    if (skipTools.has(tool_name)) {
-      logger.debug('SESSION', 'Skipping observation for tool', { tool_name });
-      res.json({ status: 'skipped', reason: 'tool_excluded' });
-      return;
-    }
-
-    // Skip meta-observations: file operations on session-memory files
-    const fileOperationTools = new Set(['Edit', 'Write', 'Read', 'NotebookEdit']);
-    if (fileOperationTools.has(tool_name) && tool_input) {
-      const filePath = tool_input.file_path || tool_input.notebook_path;
-      if (filePath && filePath.includes('session-memory')) {
-        logger.debug('SESSION', 'Skipping meta-observation for session-memory file', {
-          tool_name,
-          file_path: filePath
-        });
-        res.json({ status: 'skipped', reason: 'session_memory_meta' });
-        return;
-      }
-    }
-
-    const store = this.dbManager.getSessionStore();
-
-    let sessionDbId: number;
-    let promptNumber: number;
-    try {
-      sessionDbId = store.createSDKSession(contentSessionId, project, '', undefined, platformSource);
-      promptNumber = store.getPromptNumberFromUserPrompts(contentSessionId);
-    } catch (error) {
-      const normalizedError = error instanceof Error ? error : new Error(String(error));
-      logger.error('HTTP', 'Observation storage failed', { contentSessionId, tool_name }, normalizedError);
-      res.json({ stored: false, reason: normalizedError.message });
-      return;
-    }
-
-    // Privacy check: skip if user prompt was entirely private
-    const userPrompt = PrivacyCheckValidator.checkUserPromptPrivacy(
-      store,
+    const {
      contentSessionId,
-      promptNumber,
-      'observation',
-      sessionDbId,
-      { tool_name }
-    );
-    if (!userPrompt) {
-      res.json({ status: 'skipped', reason: 'private' });
-      return;
-    }
-
-    // Strip memory tags from tool_input and tool_response
-    const cleanedToolInput = tool_input !== undefined
-      ? stripMemoryTagsFromJson(JSON.stringify(tool_input))
-      : '{}';
-
-    const cleanedToolResponse = tool_response !== undefined
-      ? stripMemoryTagsFromJson(JSON.stringify(tool_response))
-      : '{}';
-
-    // Queue observation
-    this.sessionManager.queueObservation(sessionDbId, {
      tool_name,
-      tool_input: cleanedToolInput,
-      tool_response: cleanedToolResponse,
-      prompt_number: promptNumber,
-      cwd: cwd || (() => {
-        logger.error('SESSION', 'Missing cwd when queueing observation in SessionRoutes', {
-          sessionId: sessionDbId,
-          tool_name
-        });
-        return '';
-      })(),
-      agentId: typeof agentId === 'string' ? agentId : undefined,
-      agentType: typeof agentType === 'string' ? agentType : undefined,
+      tool_input,
+      tool_response,
+      cwd,
+      platformSource,
+      agentId,
+      agentType,
+      tool_use_id,
+      toolUseId,
+    } = req.body;
+
+    const result = ingestObservation({
+      contentSessionId,
+      toolName: tool_name,
+      toolInput: tool_input,
+      toolResponse: tool_response,
+      cwd,
+      platformSource,
+      agentId,
+      agentType,
+      toolUseId: typeof tool_use_id === 'string' ? tool_use_id : (typeof toolUseId === 'string' ? toolUseId : undefined),
    });

-    // Ensure SDK agent is running
-    this.ensureGeneratorRunning(sessionDbId, 'observation');
+    if (!result.ok) {
+      res.status(result.status ?? 500).json({ stored: false, reason: result.reason });
+      return;
+    }

-    // Broadcast observation queued event
-    this.eventBroadcaster.broadcastObservationQueued(sessionDbId);
+    if ('status' in result && result.status === 'skipped') {
+      res.json({ status: 'skipped', reason: result.reason });
+      return;
+    }

    res.json({ status: 'queued' });
  });
@@ -707,10 +718,6 @@ export class SessionRoutes extends BaseRouteHandler {
    const { contentSessionId, last_assistant_message, agentId } = req.body;
    const platformSource = normalizePlatformSource(req.body.platformSource);

-    if (!contentSessionId) {
-      return this.badRequest(res, 'Missing contentSessionId');
-    }
-
    // Belt-and-suspenders: reject summarize requests from subagent context.
    // Gate on agentId only — agentType alone indicates a main session started with
    // --agent, which still owns its summary. Mirrors the hook-side guard in summarize.ts.
@@ -802,10 +809,6 @@ export class SessionRoutes extends BaseRouteHandler {

    logger.info('HTTP', '→ POST /api/sessions/complete', { contentSessionId });

-    if (!contentSessionId) {
-      return this.badRequest(res, 'Missing contentSessionId');
-    }
-
    const store = this.dbManager.getSessionStore();

    // Look up sessionDbId from contentSessionId (createSDKSession is idempotent)
@@ -854,10 +857,25 @@ export class SessionRoutes extends BaseRouteHandler {
    // Only contentSessionId is truly required — Cursor and other platforms
    // may omit prompt/project in their payload (#838, #1049)
    const project = req.body.project || 'unknown';
-    const prompt = req.body.prompt || '[media prompt]';
+    let prompt = req.body.prompt || '[media prompt]';
    const platformSource = normalizePlatformSource(req.body.platformSource);
    const customTitle = req.body.customTitle || undefined;

+    const promptByteLength = Buffer.byteLength(prompt, 'utf8');
+    if (promptByteLength > MAX_USER_PROMPT_BYTES) {
+      logger.warn('HTTP', 'SessionRoutes: oversized prompt truncated at session-init boundary', {
+        project,
+        contentSessionId,
+        promptByteLength,
+        maxBytes: MAX_USER_PROMPT_BYTES,
+        preview: prompt.slice(0, 200)
+      });
+      const buf = Buffer.from(prompt, 'utf8');
+      let end = MAX_USER_PROMPT_BYTES;
+      while (end > 0 && (buf[end] & 0xc0) === 0x80) end--;
+      prompt = buf.subarray(0, end).toString('utf8');
+    }
+
    logger.info('HTTP', 'SessionRoutes: handleSessionInitByClaudeId called', {
      contentSessionId,
      project,
@@ -866,11 +884,6 @@ export class SessionRoutes extends BaseRouteHandler {
      customTitle
    });

-    // Validate required parameters
-    if (!this.validateRequired(req, res, ['contentSessionId'])) {
-      return;
-    }
-
    const store = this.dbManager.getSessionStore();

    // Step 1: Create/get SDK session (idempotent INSERT OR IGNORE)
@@ -6,6 +6,7 @@
 */

 import express, { Request, Response } from 'express';
+import { z } from 'zod';
 import path from 'path';
 import { readFileSync, writeFileSync, existsSync, renameSync, mkdirSync } from 'fs';
 import { homedir } from 'os';
@@ -13,11 +14,27 @@ import { getPackageRoot } from '../../../../shared/paths.js';
 import { logger } from '../../../../utils/logger.js';
 import { SettingsManager } from '../../SettingsManager.js';
 import { getBranchInfo, switchBranch, pullUpdates } from '../../BranchManager.js';
-import { ModeManager } from '../../domain/ModeManager.js';
+import { ModeManager } from '../../../domain/ModeManager.js';
 import { BaseRouteHandler } from '../BaseRouteHandler.js';
+import { validateBody } from '../middleware/validateBody.js';
 import { SettingsDefaultsManager } from '../../../../shared/SettingsDefaultsManager.js';
 import { clearPortCache } from '../../../../shared/worker-utils.js';

+// Plan 06 Phase 3 — per-route Zod schemas. Semantic validation of individual
+// CLAUDE_MEM_* keys still happens inside `validateSettings()` because the
+// allowed-value rules are richer than what Zod expresses here.
+const updateSettingsSchema = z.object({}).passthrough();
+
+const toggleMcpSchema = z.object({
+  enabled: z.boolean(),
+}).passthrough();
+
+const switchBranchSchema = z.object({
+  branch: z.string().min(1),
+}).passthrough();
+
+const updateBranchSchema = z.object({}).passthrough();
+
 export class SettingsRoutes extends BaseRouteHandler {
  constructor(
    private settingsManager: SettingsManager
@@ -28,16 +45,16 @@ export class SettingsRoutes extends BaseRouteHandler {
  setupRoutes(app: express.Application): void {
    // Settings endpoints
    app.get('/api/settings', this.handleGetSettings.bind(this));
-    app.post('/api/settings', this.handleUpdateSettings.bind(this));
+    app.post('/api/settings', validateBody(updateSettingsSchema), this.handleUpdateSettings.bind(this));

    // MCP toggle endpoints
    app.get('/api/mcp/status', this.handleGetMcpStatus.bind(this));
-    app.post('/api/mcp/toggle', this.handleToggleMcp.bind(this));
+    app.post('/api/mcp/toggle', validateBody(toggleMcpSchema), this.handleToggleMcp.bind(this));

    // Branch switching endpoints
    app.get('/api/branch/status', this.handleGetBranchStatus.bind(this));
-    app.post('/api/branch/switch', this.handleSwitchBranch.bind(this));
-    app.post('/api/branch/update', this.handleUpdateBranch.bind(this));
+    app.post('/api/branch/switch', validateBody(switchBranchSchema), this.handleSwitchBranch.bind(this));
+    app.post('/api/branch/update', validateBody(updateBranchSchema), this.handleUpdateBranch.bind(this));
  }

  /**
@@ -156,12 +173,7 @@ export class SettingsRoutes extends BaseRouteHandler {
   * Body: { enabled: boolean }
   */
  private handleToggleMcp = this.wrapHandler((req: Request, res: Response): void => {
-    const { enabled } = req.body;
-
-    if (typeof enabled !== 'boolean') {
-      this.badRequest(res, 'enabled must be a boolean');
-      return;
-    }
+    const { enabled } = req.body as z.infer<typeof toggleMcpSchema>;

    this.toggleMcp(enabled);
    res.json({ success: true, enabled: this.isMcpEnabled() });
@@ -180,12 +192,7 @@ export class SettingsRoutes extends BaseRouteHandler {
   * Body: { branch: "main" | "beta/7.0" }
   */
  private handleSwitchBranch = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
-    const { branch } = req.body;
-
-    if (!branch) {
-      res.status(400).json({ success: false, error: 'Missing branch parameter' });
-      return;
-    }
+    const { branch } = req.body as z.infer<typeof switchBranchSchema>;

    // Validate branch name
    const allowedBranches = ['main', 'beta/7.0', 'feature/bun-executable'];
@@ -15,6 +15,40 @@ import { DatabaseManager } from '../../DatabaseManager.js';
 import { SessionManager } from '../../SessionManager.js';
 import { BaseRouteHandler } from '../BaseRouteHandler.js';

+/**
+ * Plan 06 Phase 6 — viewer.html is loaded once at module init and held in
+ * memory for the lifetime of the worker process. Process restart is the
+ * cache-invalidation event; no fs.watch, no TTL, no refresh.
+ *
+ * We probe the same two on-disk locations the legacy handler did so the
+ * dev (cache) and installed (marketplace) layouts both keep working.
+ */
+const VIEWER_HTML_CANDIDATE_PATHS: readonly string[] = (() => {
+  const packageRoot = getPackageRoot();
+  return [
+    path.join(packageRoot, 'ui', 'viewer.html'),
+    path.join(packageRoot, 'plugin', 'ui', 'viewer.html'),
+  ];
+})();
+
+const resolvedViewerHtmlPath: string | null =
+  VIEWER_HTML_CANDIDATE_PATHS.find((candidate) => existsSync(candidate)) ?? null;
+
+const viewerHtmlBytes: Buffer | null = resolvedViewerHtmlPath
+  ? readFileSync(resolvedViewerHtmlPath)
+  : null;
+
+if (resolvedViewerHtmlPath) {
+  logger.info('SYSTEM', 'Cached viewer.html at boot', {
+    path: resolvedViewerHtmlPath,
+    bytes: viewerHtmlBytes!.byteLength,
+  });
+} else {
+  logger.warn('SYSTEM', 'viewer.html not found at any expected location at boot', {
+    candidates: VIEWER_HTML_CANDIDATE_PATHS,
+  });
+}
+
 export class ViewerRoutes extends BaseRouteHandler {
  constructor(
    private sseBroadcaster: SSEBroadcaster,
@@ -49,26 +83,15 @@ export class ViewerRoutes extends BaseRouteHandler {
  });

  /**
-   * Serve viewer UI
+   * Serve viewer UI from the in-memory cache populated at module init.
+   * Plan 06 Phase 6 — single read at boot, no per-request fs hit.
   */
  private handleViewerUI = this.wrapHandler((req: Request, res: Response): void => {
-    const packageRoot = getPackageRoot();
-
-    // Try cache structure first (ui/viewer.html), then marketplace structure (plugin/ui/viewer.html)
-    const viewerPaths = [
-      path.join(packageRoot, 'ui', 'viewer.html'),
-      path.join(packageRoot, 'plugin', 'ui', 'viewer.html')
-    ];
-
-    const viewerPath = viewerPaths.find(p => existsSync(p));
-
-    if (!viewerPath) {
+    if (!viewerHtmlBytes) {
      throw new Error('Viewer UI not found at any expected location');
    }
-
-    const html = readFileSync(viewerPath, 'utf-8');
-    res.setHeader('Content-Type', 'text/html');
-    res.send(html);
+    res.setHeader('Content-Type', 'text/html; charset=utf-8');
+    res.send(viewerHtmlBytes);
  });

  /**
@@ -0,0 +1,406 @@
+/**
+ * Worker HTTP shared ingest helpers.
+ *
+ * Per PATHFINDER-2026-04-22 plan 03 phase 0:
+ *   `ingestObservation`, `ingestPrompt`, `ingestSummary` are the single
+ *   in-process implementation of the worker's three ingest paths. The HTTP
+ *   route handlers (cross-process callers) and worker-internal producers
+ *   (transcript processor, ResponseProcessor) BOTH delegate here.
+ *
+ *   No HTTP loopback. No duplicated insert logic. One helper, N callers.
+ *
+ * Wiring: `WorkerService` registers its `sessionManager`, `dbManager`, and
+ * `sessionEventBroadcaster` once at startup via `setIngestContext`. The
+ * helpers fail fast if called before registration.
+ */
+
+import { logger } from '../../../utils/logger.js';
+import type { SessionManager } from '../SessionManager.js';
+import type { DatabaseManager } from '../DatabaseManager.js';
+import type { SessionEventBroadcaster } from '../events/SessionEventBroadcaster.js';
+import type { ParsedSummary } from '../../../sdk/parser.js';
+import { stripMemoryTagsFromJson } from '../../../utils/tag-stripping.js';
+import { isProjectExcluded } from '../../../utils/project-filter.js';
+import { SettingsDefaultsManager } from '../../../shared/SettingsDefaultsManager.js';
+import { USER_SETTINGS_PATH } from '../../../shared/paths.js';
+import { getProjectContext } from '../../../utils/project-name.js';
+import { normalizePlatformSource } from '../../../shared/platform-source.js';
+import { PrivacyCheckValidator } from '../validation/PrivacyCheckValidator.js';
+import { EventEmitter } from 'events';
+
+// ============================================================================
+// Event bus — Phase 2 (`summaryStoredEvent`) consumers attach here.
+// ============================================================================
+
+/**
+ * Event payload emitted exactly once per successful `ingestSummary` call that
+ * actually stored a summary row. `messageId` is the pending_messages row id
+ * that produced the summary; `sessionId` is the contentSessionId.
+ *
+ * Currently dormant — the only consumer (the blocking `/api/session/end`
+ * endpoint) was removed when the Stop hook went fire-and-forget. Kept for
+ * future internal subscribers; emissions are cheap no-ops with no listeners.
+ */
+export interface SummaryStoredEvent {
+  sessionId: string;
+  messageId: number;
+}
+
+class IngestEventBus extends EventEmitter {
+  /**
+   * Recent summaryStoredEvent buffer keyed by sessionId. Originally protected
+   * the register-after-emit race for the blocking `/api/session/end` handler.
+   * Currently unused (handler removed when Stop hook went fire-and-forget);
+   * preserved so any future subscriber gets the same race-free contract.
+   */
+  private readonly recentStored = new Map<string, { event: SummaryStoredEvent; at: number }>();
+  private static readonly RECENT_EVENT_TTL_MS = 60_000;
+
+  constructor() {
+    super();
+    // Disable the default 10-listener warning. With no current consumers
+    // this is moot, but kept for parity if future subscribers attach.
+    this.setMaxListeners(0);
+    this.on('summaryStoredEvent', (evt: SummaryStoredEvent) => {
+      this.recentStored.set(evt.sessionId, { event: evt, at: Date.now() });
+      this.evictExpiredStored();
+    });
+  }
+
+  /** Read a recently-emitted summaryStoredEvent (idempotent; TTL-evicted). */
+  takeRecentSummaryStored(sessionId: string): SummaryStoredEvent | undefined {
+    const entry = this.recentStored.get(sessionId);
+    if (!entry) return undefined;
+    if (Date.now() - entry.at > IngestEventBus.RECENT_EVENT_TTL_MS) {
+      this.recentStored.delete(sessionId);
+      return undefined;
+    }
+    return entry.event;
+  }
+
+  private evictExpiredStored(): void {
+    const cutoff = Date.now() - IngestEventBus.RECENT_EVENT_TTL_MS;
+    for (const [key, entry] of this.recentStored) {
+      if (entry.at < cutoff) this.recentStored.delete(key);
+    }
+  }
+}
+
+/**
+ * Process-local event bus for ingestion lifecycle events.
+ *
+ * Single Node EventEmitter — there is no third event-bus in the worker.
+ * `SessionManager` already uses Node EventEmitter for queue notifications
+ * (`src/services/worker/SessionManager.ts:25`), and
+ * `SessionQueueProcessor` consumes EventEmitter events
+ * (`src/services/queue/SessionQueueProcessor.ts:18`); this module follows
+ * the same pattern at the ingestion layer.
+ */
+export const ingestEventBus = new IngestEventBus();
+
+// ============================================================================
+// Context registration
+// ============================================================================
+
+interface IngestContext {
+  sessionManager: SessionManager;
+  dbManager: DatabaseManager;
+  eventBroadcaster: SessionEventBroadcaster;
+  /** Optional callback to (re)start the SDK generator after enqueue. */
+  ensureGeneratorRunning?: (sessionDbId: number, source: string) => void;
+}
+
+let ctx: IngestContext | null = null;
+
+/**
+ * Register the worker-scoped services the ingest helpers depend on.
+ * Called once from `WorkerService` constructor.
+ */
+export function setIngestContext(next: IngestContext): void {
+  ctx = next;
+}
+
+/**
+ * Attach the generator-running callback after `SessionRoutes` has been
+ * constructed. `setIngestContext` is called early in `WorkerService` startup
+ * (before routes exist), so the callback is wired in as a second step once
+ * `SessionRoutes.ensureGeneratorRunning` is available.
+ *
+ * Without this, transcript-watcher observations queue via
+ * `ingestObservation()` but the SDK generator never auto-starts to drain
+ * them.
+ */
+export function attachIngestGeneratorStarter(
+  ensureGeneratorRunning: (sessionDbId: number, source: string) => void,
+): void {
+  requireContext().ensureGeneratorRunning = ensureGeneratorRunning;
+}
+
+function requireContext(): IngestContext {
+  if (!ctx) {
+    throw new Error('ingest helpers used before setIngestContext() — wiring bug');
+  }
+  return ctx;
+}
+
+// ============================================================================
+// Result type
+// ============================================================================
+
+export type IngestResult =
+  | { ok: true; sessionDbId: number; messageId?: number }
+  | { ok: true; status: 'skipped'; reason: string }
+  | { ok: false; reason: string; status?: number };
+
+// ============================================================================
+// Observation
+// ============================================================================
+
+export interface ObservationPayload {
+  contentSessionId: string;
+  toolName: string;
+  toolInput: unknown;
+  toolResponse: unknown;
+  cwd?: string;
+  platformSource?: string;
+  agentId?: string;
+  agentType?: string;
+  toolUseId?: string;
+}
+
+/**
+ * Ingest an observation: resolve session, apply project / skip-tool filters,
+ * strip privacy tags, persist to pending_messages, ensure the SDK generator
+ * is running.
+ *
+ * Same implementation for cross-process HTTP callers and worker-internal
+ * callers (transcript processor, ResponseProcessor side-effects).
+ */
+export function ingestObservation(payload: ObservationPayload): IngestResult {
+  const { sessionManager, dbManager, eventBroadcaster, ensureGeneratorRunning } = requireContext();
+
+  if (!payload.contentSessionId) {
+    return { ok: false, reason: 'missing contentSessionId', status: 400 };
+  }
+  if (!payload.toolName) {
+    return { ok: false, reason: 'missing toolName', status: 400 };
+  }
+
+  const platformSource = normalizePlatformSource(payload.platformSource);
+  const cwd = typeof payload.cwd === 'string' ? payload.cwd : '';
+  const project = cwd.trim() ? getProjectContext(cwd).primary : '';
+
+  const settings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
+
+  // Project exclusion (the same gate the hook handler applies).
+  if (cwd && isProjectExcluded(cwd, settings.CLAUDE_MEM_EXCLUDED_PROJECTS)) {
+    return { ok: true, status: 'skipped', reason: 'project_excluded' };
+  }
+
+  // Skip low-value or meta tools per user settings.
+  const skipTools = new Set(
+    settings.CLAUDE_MEM_SKIP_TOOLS.split(',').map(t => t.trim()).filter(Boolean)
+  );
+  if (skipTools.has(payload.toolName)) {
+    return { ok: true, status: 'skipped', reason: 'tool_excluded' };
+  }
+
+  // Skip meta-observations: file operations on session-memory files.
+  const fileOperationTools = new Set(['Edit', 'Write', 'Read', 'NotebookEdit']);
+  if (fileOperationTools.has(payload.toolName) && payload.toolInput && typeof payload.toolInput === 'object') {
+    const input = payload.toolInput as { file_path?: string; notebook_path?: string };
+    const filePath = input.file_path || input.notebook_path;
+    if (filePath && filePath.includes('session-memory')) {
+      return { ok: true, status: 'skipped', reason: 'session_memory_meta' };
+    }
+  }
+
+  const store = dbManager.getSessionStore();
+
+  let sessionDbId: number;
+  let promptNumber: number;
+  try {
+    sessionDbId = store.createSDKSession(payload.contentSessionId, project, '', undefined, platformSource);
+    promptNumber = store.getPromptNumberFromUserPrompts(payload.contentSessionId);
+  } catch (error) {
+    const message = error instanceof Error ? error.message : String(error);
+    logger.error('INGEST', 'Observation session resolution failed', {
+      contentSessionId: payload.contentSessionId,
+      toolName: payload.toolName,
+    }, error instanceof Error ? error : new Error(message));
+    return { ok: false, reason: message, status: 500 };
+  }
+
+  // Privacy: skip if user prompt was entirely private.
+  const userPrompt = PrivacyCheckValidator.checkUserPromptPrivacy(
+    store,
+    payload.contentSessionId,
+    promptNumber,
+    'observation',
+    sessionDbId,
+    { tool_name: payload.toolName }
+  );
+  if (!userPrompt) {
+    return { ok: true, status: 'skipped', reason: 'private' };
+  }
+
+  const cleanedToolInput = payload.toolInput !== undefined
+    ? stripMemoryTagsFromJson(JSON.stringify(payload.toolInput))
+    : '{}';
+  const cleanedToolResponse = payload.toolResponse !== undefined
+    ? stripMemoryTagsFromJson(JSON.stringify(payload.toolResponse))
+    : '{}';
+
+  sessionManager.queueObservation(sessionDbId, {
+    tool_name: payload.toolName,
+    tool_input: cleanedToolInput,
+    tool_response: cleanedToolResponse,
+    prompt_number: promptNumber,
+    cwd: cwd || (() => {
+      logger.error('INGEST', 'Missing cwd when ingesting observation', {
+        sessionId: sessionDbId,
+        toolName: payload.toolName,
+      });
+      return '';
+    })(),
+    agentId: typeof payload.agentId === 'string' ? payload.agentId : undefined,
+    agentType: typeof payload.agentType === 'string' ? payload.agentType : undefined,
+    // Forward the provider-assigned tool-use id so the
+    // UNIQUE(content_session_id, tool_use_id) idempotency index from Plan 01
+    // can actually collapse replays. SQLite treats NULL tool_use_id values as
+    // distinct, so dropping it here silently defeats the INSERT OR IGNORE.
+    toolUseId: typeof payload.toolUseId === 'string' ? payload.toolUseId : undefined,
+  });
+
+  ensureGeneratorRunning?.(sessionDbId, 'observation');
+  eventBroadcaster.broadcastObservationQueued(sessionDbId);
+
+  return { ok: true, sessionDbId };
+}
+
+// ============================================================================
+// Summary (queue side — agent processes the request asynchronously)
+// ============================================================================
+
+export interface PromptPayload {
+  contentSessionId: string;
+  /** The user prompt text (must not contain stripped tags). */
+  prompt: string;
+  cwd?: string;
+  platformSource?: string;
+  promptNumber?: number;
+}
+
+/**
+ * Ingest a user prompt. Used by the SessionStart / UserPromptSubmit hooks and
+ * by transcript-driven session inits. Wraps `SessionStore.appendUserPrompt`
+ * so cross-process and in-process callers share the same path.
+ */
+export function ingestPrompt(payload: PromptPayload): IngestResult {
+  const { dbManager } = requireContext();
+
+  if (!payload.contentSessionId) {
+    return { ok: false, reason: 'missing contentSessionId', status: 400 };
+  }
+  if (typeof payload.prompt !== 'string') {
+    return { ok: false, reason: 'missing prompt text', status: 400 };
+  }
+
+  const platformSource = normalizePlatformSource(payload.platformSource);
+  const cwd = typeof payload.cwd === 'string' ? payload.cwd : '';
+  const project = cwd.trim() ? getProjectContext(cwd).primary : '';
+
+  try {
+    const store = dbManager.getSessionStore();
+    const sessionDbId = store.createSDKSession(payload.contentSessionId, project, payload.prompt, undefined, platformSource);
+    return { ok: true, sessionDbId };
+  } catch (error) {
+    const message = error instanceof Error ? error.message : String(error);
+    return { ok: false, reason: message, status: 500 };
+  }
+}
+
+// ============================================================================
+// Summary
+// ============================================================================
+
+/**
+ * Two shapes of ingest:
+ *   - "queue a summarize request" (cross-process hook trigger): goes via
+ *     `SessionManager.queueSummarize` so the SDK agent will produce the XML
+ *     payload on its next iteration.
+ *   - "the SDK agent already produced the parsed summary": goes via
+ *     `ingestSummary({ parsed, sessionDbId, messageId })`. Stored synchronously,
+ *     emits `summaryStoredEvent` for the blocking endpoint in plan 05.
+ */
+export type SummaryPayload =
+  | {
+      kind: 'queue';
+      contentSessionId: string;
+      lastAssistantMessage?: string;
+      platformSource?: string;
+      cwd?: string;
+    }
+  | {
+      kind: 'parsed';
+      sessionDbId: number;
+      messageId: number;
+      contentSessionId: string;
+      parsed: ParsedSummary;
+    };
+
+export function ingestSummary(payload: SummaryPayload): IngestResult {
+  // The 'parsed' branch is a pure post-store notification — it only touches
+  // the module-scope event bus, not the database/session manager. Resolving
+  // requireContext() before the branch split breaks unit tests that drive
+  // ResponseProcessor with a mocked sessionManager but no setIngestContext.
+  // Only the 'queue' branch needs the worker-internal context.
+  if (payload.kind === 'queue') {
+    const { sessionManager, dbManager, ensureGeneratorRunning } = requireContext();
+
+    if (!payload.contentSessionId) {
+      return { ok: false, reason: 'missing contentSessionId', status: 400 };
+    }
+
+    const platformSource = normalizePlatformSource(payload.platformSource);
+    const cwd = typeof payload.cwd === 'string' ? payload.cwd : '';
+    const project = cwd.trim() ? getProjectContext(cwd).primary : '';
+
+    let sessionDbId: number;
+    try {
+      sessionDbId = dbManager.getSessionStore().createSDKSession(payload.contentSessionId, project, '', undefined, platformSource);
+    } catch (error) {
+      const message = error instanceof Error ? error.message : String(error);
+      return { ok: false, reason: message, status: 500 };
+    }
+
+    sessionManager.queueSummarize(sessionDbId, payload.lastAssistantMessage);
+    ensureGeneratorRunning?.(sessionDbId, 'summarize');
+
+    return { ok: true, sessionDbId };
+  }
+
+  // kind === 'parsed' — the SDK agent has produced a summary; store via
+  // session store and emit the summaryStoredEvent for blocking consumers.
+  // Skipped summaries (`<skip_summary/>`) are recorded as a successful no-op:
+  // they have no content to persist, but consumers should still be unblocked.
+  if (payload.parsed.skipped) {
+    ingestEventBus.emit('summaryStoredEvent', {
+      sessionId: payload.contentSessionId,
+      messageId: payload.messageId,
+    } satisfies SummaryStoredEvent);
+    return { ok: true, sessionDbId: payload.sessionDbId, messageId: payload.messageId };
+  }
+
+  // The actual storage of the parsed summary remains co-transactional with
+  // the observation batch in `processAgentResponse`. By the time this branch
+  // is reached the row is already persisted; this call is the canonical
+  // post-store notification path so every producer fires the event the same
+  // way (Plan 03 Phase 2 + greploop fix — sole emitter of summaryStoredEvent).
+  ingestEventBus.emit('summaryStoredEvent', {
+    sessionId: payload.contentSessionId,
+    messageId: payload.messageId,
+  } satisfies SummaryStoredEvent);
+
+  return { ok: true, sessionDbId: payload.sessionDbId, messageId: payload.messageId };
+}
@@ -6,7 +6,7 @@
 */

 import { logger } from '../../../utils/logger.js';
-import type { ObservationRecord } from '../../../types/database.js';
+import type { ObservationSearchResult } from '../../sqlite/types.js';
 import type { SessionStore } from '../../sqlite/SessionStore.js';
 import type { SearchOrchestrator } from '../search/SearchOrchestrator.js';
 import { CorpusRenderer } from './CorpusRenderer.js';
@@ -121,19 +121,19 @@ export class CorpusBuilder {
  }

  /**
-   * Map a raw ObservationRecord (with JSON string fields) to a CorpusObservation
+   * Map a raw ObservationSearchResult (with JSON string fields) to a CorpusObservation
   */
-  private mapObservationToCorpus(row: ObservationRecord): CorpusObservation {
+  private mapObservationToCorpus(row: ObservationSearchResult): CorpusObservation {
    return {
      id: row.id,
      type: row.type,
-      title: (row as any).title || '',
-      subtitle: (row as any).subtitle || null,
-      narrative: (row as any).narrative || null,
-      facts: safeParseJsonArray((row as any).facts),
-      concepts: safeParseJsonArray((row as any).concepts),
-      files_read: safeParseJsonArray((row as any).files_read),
-      files_modified: safeParseJsonArray((row as any).files_modified),
+      title: row.title || '',
+      subtitle: row.subtitle || null,
+      narrative: row.narrative || null,
+      facts: safeParseJsonArray(row.facts),
+      concepts: safeParseJsonArray(row.concepts),
+      files_read: safeParseJsonArray(row.files_read),
+      files_modified: safeParseJsonArray(row.files_modified),
      project: row.project,
      created_at: row.created_at,
      created_at_epoch: row.created_at_epoch,
@@ -33,7 +33,13 @@ export class ResultFormatter {

    if (totalResults === 0) {
      if (chromaFailed) {
-        return this.formatChromaFailureMessage();
+        // Legacy callers route through here without a specific reason; surface a
+        // generic non-connection failure so users still get the diagnostic pointer
+        // instead of the old "install uv" lie.
+        return ResultFormatter.formatChromaFailureMessage({
+          message: 'unknown error (no reason captured by caller)',
+          isConnectionError: false,
+        });
      }
      return `No results found matching "${query}"`;
    }
@@ -270,16 +276,18 @@ export class ResultFormatter {
  }

  /**
-   * Format Chroma failure message
+   * Format Chroma failure message with the real underlying error.
+   *
+   * Static so callers (e.g. SearchManager) can format without needing
+   * an instance. The message intentionally surfaces the raw error text
+   * and points users at /api/chroma/status?deep=1 for diagnostics —
+   * never a static "install uv" instruction (which lies about the cause).
   */
-  private formatChromaFailureMessage(): string {
-    return `Vector search failed - semantic search unavailable.
-
-To enable semantic search:
-1. Install uv: https://docs.astral.sh/uv/getting-started/installation/
-2. Restart the worker: npm run worker:restart
-
-Note: You can still use filter-only searches (date ranges, types, files) without a query term.`;
+  static formatChromaFailureMessage(reason: { message: string; isConnectionError: boolean }): string {
+    if (reason.isConnectionError) {
+      return `Semantic search is offline (Chroma MCP unreachable: ${reason.message}). Falling back to keyword search; results may be incomplete. Run \`/api/chroma/status?deep=1\` to diagnose.`;
+    }
+    return `Semantic search failed: ${reason.message}. Falling back to keyword search; results may be incomplete. Check \`~/.claude-mem/logs/\` for the CHROMA_SYNC entry. Run \`/api/chroma/status?deep=1\` for a deeper probe.`;
  }

  /**
@@ -30,6 +30,7 @@ import type {
  SearchResults,
  ObservationSearchResult
 } from './types.js';
+import { ChromaUnavailableError } from './errors.js';
 import { logger } from '../../../utils/logger.js';

 /**
@@ -88,34 +89,27 @@ export class SearchOrchestrator {
    }

    // PATH 2: CHROMA SEMANTIC SEARCH (query text + Chroma available)
+    // Fail-fast: if Chroma errors, ChromaSearchStrategy now lets the error
+    // propagate. We catch it here only to translate into a typed 503.
    if (this.chromaStrategy) {
      logger.debug('SEARCH', 'Orchestrator: Using Chroma semantic search', {});
-      const result = await this.chromaStrategy.search(options);
-
-      // If Chroma succeeded (even with 0 results), return
-      if (result.usedChroma) {
-        return result;
+      try {
+        return await this.chromaStrategy.search(options);
+      } catch (error) {
+        const errorObj = error instanceof Error ? error : new Error(String(error));
+        throw new ChromaUnavailableError(
+          `Chroma query failed: ${errorObj.message}`,
+          errorObj
+        );
      }
-
-      // Chroma failed - fall back to SQLite for filter-only
-      logger.debug('SEARCH', 'Orchestrator: Chroma failed, falling back to SQLite', {});
-      const fallbackResult = await this.sqliteStrategy.search({
-        ...options,
-        query: undefined // Remove query for SQLite fallback
-      });
-
-      return {
-        ...fallbackResult,
-        fellBack: true
-      };
    }

-    // PATH 3: No Chroma available
-    logger.debug('SEARCH', 'Orchestrator: Chroma not available', {});
+    // PATH 3: Chroma not configured (explicitly uninitialized at construction).
+    // This is a legitimate config state — return empty results, not an error.
+    logger.debug('SEARCH', 'Orchestrator: Chroma not configured', {});
    return {
      results: { observations: [], sessions: [], prompts: [] },
      usedChroma: false,
-      fellBack: false,
      strategy: 'sqlite'
    };
  }
@@ -130,12 +124,11 @@ export class SearchOrchestrator {
      return await this.hybridStrategy.findByConcept(concept, options);
    }

-    // Fallback to SQLite
+    // Chroma not configured: SQLite metadata-only result.
    const results = this.sqliteStrategy.findByConcept(concept, options);
    return {
      results: { observations: results, sessions: [], prompts: [] },
      usedChroma: false,
-      fellBack: false,
      strategy: 'sqlite'
    };
  }
@@ -150,12 +143,11 @@ export class SearchOrchestrator {
      return await this.hybridStrategy.findByType(type, options);
    }

-    // Fallback to SQLite
+    // Chroma not configured: SQLite metadata-only result.
    const results = this.sqliteStrategy.findByType(type, options);
    return {
      results: { observations: results, sessions: [], prompts: [] },
      usedChroma: false,
-      fellBack: false,
      strategy: 'sqlite'
    };
  }
@@ -0,0 +1,16 @@
+/**
+ * Search-related error classes
+ */
+
+import { AppError } from '../../server/ErrorHandler.js';
+
+/**
+ * Thrown when Chroma is expected to be available but failed at query time.
+ * Maps to HTTP 503 Service Unavailable.
+ */
+export class ChromaUnavailableError extends AppError {
+  constructor(message: string, cause?: Error) {
+    super(message, 503, 'CHROMA_UNAVAILABLE', cause ? { cause: cause.message } : undefined);
+    this.name = 'ChromaUnavailableError';
+  }
+}
@@ -59,31 +59,16 @@ export class ChromaSearchStrategy extends BaseSearchStrategy implements SearchSt
    const searchSessions = searchType === 'all' || searchType === 'sessions';
    const searchPrompts = searchType === 'all' || searchType === 'prompts';

-    let observations: ObservationSearchResult[] = [];
-    let sessions: SessionSummarySearchResult[] = [];
-    let prompts: UserPromptSearchResult[] = [];
-
    // Build Chroma where filter for doc_type and project
    const whereFilter = this.buildWhereFilter(searchType, project);

    logger.debug('SEARCH', 'ChromaSearchStrategy: Querying Chroma', { query, searchType });

-    try {
-      return await this.executeChromaSearch(query, whereFilter, {
-        searchObservations, searchSessions, searchPrompts,
-        obsType, concepts, files, orderBy, limit, project
-      });
-    } catch (error) {
-      const errorObj = error instanceof Error ? error : new Error(String(error));
-      logger.error('WORKER', 'ChromaSearchStrategy: Search failed', {}, errorObj);
-      // Return empty result - caller may try fallback strategy
-      return {
-        results: { observations: [], sessions: [], prompts: [] },
-        usedChroma: false,
-        fellBack: false,
-        strategy: 'chroma'
-      };
-    }
+    // Fail-fast: errors propagate to orchestrator, which translates to HTTP 503.
+    return await this.executeChromaSearch(query, whereFilter, {
+      searchObservations, searchSessions, searchPrompts,
+      obsType, concepts, files, orderBy, limit, project
+    });
  }

  private async executeChromaSearch(
@@ -111,7 +96,6 @@ export class ChromaSearchStrategy extends BaseSearchStrategy implements SearchSt
      return {
        results: { observations: [], sessions: [], prompts: [] },
        usedChroma: true,
-        fellBack: false,
        strategy: 'chroma'
      };
    }
@@ -123,27 +107,31 @@ export class ChromaSearchStrategy extends BaseSearchStrategy implements SearchSt
    let sessions: SessionSummarySearchResult[] = [];
    let prompts: UserPromptSearchResult[] = [];

+    // Chroma already ranks by vector similarity; 'relevance' has no SQL
+    // equivalent, so drop it before hydrating rows from SessionStore.
+    const sqlOrderBy: 'date_desc' | 'date_asc' | undefined =
+      options.orderBy === 'relevance' ? undefined : options.orderBy;
+
    if (categorized.obsIds.length > 0) {
-      const obsOptions = { type: options.obsType, concepts: options.concepts, files: options.files, orderBy: options.orderBy, limit: options.limit, project: options.project };
+      const obsOptions = { type: options.obsType, concepts: options.concepts, files: options.files, orderBy: sqlOrderBy, limit: options.limit, project: options.project };
      observations = this.sessionStore.getObservationsByIds(categorized.obsIds, obsOptions);
    }

    if (categorized.sessionIds.length > 0) {
      sessions = this.sessionStore.getSessionSummariesByIds(categorized.sessionIds, {
-        orderBy: options.orderBy, limit: options.limit, project: options.project
+        orderBy: sqlOrderBy, limit: options.limit, project: options.project
      });
    }

    if (categorized.promptIds.length > 0) {
      prompts = this.sessionStore.getUserPromptsByIds(categorized.promptIds, {
-        orderBy: options.orderBy, limit: options.limit, project: options.project
+        orderBy: sqlOrderBy, limit: options.limit, project: options.project
      });
    }

    return {
      results: { observations, sessions, prompts },
      usedChroma: true,
-      fellBack: false,
      strategy: 'chroma'
    };
  }
@@ -79,20 +79,8 @@ export class HybridSearchStrategy extends BaseSearchStrategy implements SearchSt

    const ids = metadataResults.map(obs => obs.id);

-    try {
-      return await this.rankAndHydrate(concept, ids, limit);
-    } catch (error) {
-      const errorObj = error instanceof Error ? error : new Error(String(error));
-      logger.error('WORKER', 'HybridSearchStrategy: findByConcept failed', {}, errorObj);
-      // Fall back to metadata-only results
-      const results = this.sessionSearch.findByConcept(concept, filterOptions);
-      return {
-        results: { observations: results, sessions: [], prompts: [] },
-        usedChroma: false,
-        fellBack: true,
-        strategy: 'hybrid'
-      };
-    }
+    // Fail-fast: Chroma errors propagate to orchestrator (HTTP 503).
+    return await this.rankAndHydrate(concept, ids, limit);
  }

  /**
@@ -117,19 +105,8 @@ export class HybridSearchStrategy extends BaseSearchStrategy implements SearchSt

    const ids = metadataResults.map(obs => obs.id);

-    try {
-      return await this.rankAndHydrate(typeStr, ids, limit);
-    } catch (error) {
-      const errorObj = error instanceof Error ? error : new Error(String(error));
-      logger.error('WORKER', 'HybridSearchStrategy: findByType failed', {}, errorObj);
-      const results = this.sessionSearch.findByType(type as any, filterOptions);
-      return {
-        results: { observations: results, sessions: [], prompts: [] },
-        usedChroma: false,
-        fellBack: true,
-        strategy: 'hybrid'
-      };
-    }
+    // Fail-fast: Chroma errors propagate to orchestrator (HTTP 503).
+    return await this.rankAndHydrate(typeStr, ids, limit);
  }

  /**
@@ -158,18 +135,8 @@ export class HybridSearchStrategy extends BaseSearchStrategy implements SearchSt

    const ids = metadataResults.observations.map(obs => obs.id);

-    try {
-      return await this.rankAndHydrateForFile(filePath, ids, limit, sessions);
-    } catch (error) {
-      const errorObj = error instanceof Error ? error : new Error(String(error));
-      logger.error('WORKER', 'HybridSearchStrategy: findByFile failed', {}, errorObj);
-      const results = this.sessionSearch.findByFile(filePath, filterOptions);
-      return {
-        observations: results.observations,
-        sessions: results.sessions,
-        usedChroma: false
-      };
-    }
+    // Fail-fast: Chroma errors propagate to orchestrator (HTTP 503).
+    return await this.rankAndHydrateForFile(filePath, ids, limit, sessions);
  }

  private async rankAndHydrate(
@@ -191,7 +158,6 @@ export class HybridSearchStrategy extends BaseSearchStrategy implements SearchSt
      return {
        results: { observations, sessions: [], prompts: [] },
        usedChroma: true,
-        fellBack: false,
        strategy: 'hybrid'
      };
    }
@@ -98,7 +98,6 @@ export class SQLiteSearchStrategy extends BaseSearchStrategy implements SearchSt
    return {
      results: { observations, sessions, prompts },
      usedChroma: false,
-      fellBack: false,
      strategy: 'sqlite'
    };
  }
@@ -54,7 +54,6 @@ export abstract class BaseSearchStrategy implements SearchStrategy {
        prompts: []
      },
      usedChroma: strategy === 'chroma' || strategy === 'hybrid',
-      fellBack: false,
      strategy
    };
  }
@@ -103,8 +103,6 @@ export interface StrategySearchResult {
  results: SearchResults;
  /** Whether Chroma was used successfully */
  usedChroma: boolean;
-  /** Whether fallback was triggered */
-  fellBack: boolean;
  /** Strategy that produced the results */
  strategy: SearchStrategyHint;
 }
@@ -57,7 +57,7 @@ export class SessionCompletionHandler {
    // completed session would never be picked up again.
    try {
      const pendingStore = this.sessionManager.getPendingMessageStore();
-      const drainedCount = pendingStore.markAllSessionMessagesAbandoned(sessionDbId);
+      const drainedCount = pendingStore.transitionMessagesTo('abandoned', { sessionDbId });
      if (drainedCount > 0) {
        logger.warn('SESSION', `Drained ${drainedCount} orphaned pending messages on session finalize`, {
          sessionId: sessionDbId, drainedCount
@@ -30,13 +30,6 @@ const BLOCKED_ENV_VARS = [
  'CLAUDECODE',         // Prevent "cannot be launched inside another Claude Code session" error
 ];

-// Credential keys that claude-mem manages
-export const MANAGED_CREDENTIAL_KEYS = [
-  'ANTHROPIC_API_KEY',
-  'GEMINI_API_KEY',
-  'OPENROUTER_API_KEY',
-];
-
 export interface ClaudeMemEnv {
  // Credentials (optional - empty means use CLI billing for Claude)
  ANTHROPIC_API_KEY?: string;
@@ -269,16 +262,6 @@ export function getCredential(key: keyof ClaudeMemEnv): string | undefined {
  return env[key];
 }

-/**
- * Set a specific credential in claude-mem's .env
- * Pass empty string to remove the credential
- */
-export function setCredential(key: keyof ClaudeMemEnv, value: string): void {
-  const env = loadClaudeMemEnv();
-  env[key] = value || undefined;
-  saveClaudeMemEnv(env);
-}
-
 /**
 * Check if claude-mem has an Anthropic API key configured
 * If false, it means CLI billing should be used
@@ -56,6 +56,7 @@ export interface SettingsDefaults {
  CLAUDE_MEM_TRANSCRIPTS_CONFIG_PATH: string;  // Path to transcript watcher config JSON
  // Process Management
  CLAUDE_MEM_MAX_CONCURRENT_AGENTS: string;  // Max concurrent Claude SDK agent subprocesses (default: 2)
+  CLAUDE_MEM_HOOK_FAIL_LOUD_THRESHOLD: string;  // Plan 05 Phase 8 — consecutive hook→worker unreachable failures before exit code 2 (default: 3)
  // Exclusion Settings
  CLAUDE_MEM_EXCLUDED_PROJECTS: string;  // Comma-separated glob patterns for excluded project paths
  CLAUDE_MEM_FOLDER_MD_EXCLUDE: string;  // JSON array of folder paths to exclude from CLAUDE.md generation
@@ -133,6 +134,7 @@ export class SettingsDefaultsManager {
    CLAUDE_MEM_TRANSCRIPTS_CONFIG_PATH: join(homedir(), '.claude-mem', 'transcript-watch.json'),
    // Process Management
    CLAUDE_MEM_MAX_CONCURRENT_AGENTS: '2',  // Max concurrent Claude SDK agent subprocesses
+    CLAUDE_MEM_HOOK_FAIL_LOUD_THRESHOLD: '3',  // Plan 05 Phase 8 — escalate to exit code 2 after N consecutive worker-unreachable hook invocations
    // Exclusion Settings
    CLAUDE_MEM_EXCLUDED_PROJECTS: '',  // Comma-separated glob patterns for excluded project paths
    CLAUDE_MEM_FOLDER_MD_EXCLUDE: '[]',  // JSON array of folder paths to exclude from CLAUDE.md generation
@@ -193,7 +195,7 @@ export class SettingsDefaultsManager {
   * Handles both string 'true' and boolean true from JSON
   */
  static getBool(key: keyof SettingsDefaults): boolean {
-    const value = this.get(key);
+    const value: unknown = this.get(key);
    return value === 'true' || value === true;
  }

@@ -0,0 +1,35 @@
+/**
+ * Per-process settings cache for hook handlers.
+ *
+ * Plan 05 Phase 4 (PATHFINDER-2026-04-22): each hook process is short-lived,
+ * but multiple handlers within a single hook invocation independently call
+ * `SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH)` and re-read the
+ * settings file from disk. Settings cannot mutate during a single hook
+ * invocation, so we memoize the first read for the lifetime of the process.
+ *
+ * One helper, N callers (Principle 6). Every hook handler that needs settings
+ * imports `loadFromFileOnce()` from here instead of calling
+ * `SettingsDefaultsManager.loadFromFile` directly.
+ */
+
+import {
+  SettingsDefaultsManager,
+  type SettingsDefaults,
+} from './SettingsDefaultsManager.js';
+import { USER_SETTINGS_PATH } from './paths.js';
+
+let cachedSettings: SettingsDefaults | null = null;
+
+/**
+ * Load settings from disk on first call, return the memoized value thereafter.
+ *
+ * Cache lifetime is the process — hooks are short-lived (typically <1s), so a
+ * settings change made by the user is picked up the next time Claude Code
+ * spawns a hook process. There is no in-process invalidation API because there
+ * is no in-process mutation path.
+ */
+export function loadFromFileOnce(): SettingsDefaults {
+  if (cachedSettings !== null) return cachedSettings;
+  cachedSettings = SettingsDefaultsManager.loadFromFile(USER_SETTINGS_PATH);
+  return cachedSettings;
+}
@@ -0,0 +1,44 @@
+/**
+ * Single answer to "should this hook run for this cwd?"
+ *
+ * Plan 05 Phase 5 (PATHFINDER-2026-04-22): three handlers (observation,
+ * session-init, file-context) each duplicated the
+ * `loadFromFileOnce() → isProjectExcluded(cwd, settings.CLAUDE_MEM_EXCLUDED_PROJECTS)`
+ * pair. This module is the only entry point for that question; handlers call
+ * `shouldTrackProject(cwd)` and route through here.
+ *
+ * One helper, N callers (Principle 6). After this module lands, no handler
+ * references `isProjectExcluded` directly — the import lives only here.
+ */
+
+import { relative, isAbsolute } from 'path';
+import { isProjectExcluded } from '../utils/project-filter.js';
+import { loadFromFileOnce } from './hook-settings.js';
+import { OBSERVER_SESSIONS_DIR } from './paths.js';
+
+function isWithin(child: string, parent: string): boolean {
+  if (child === parent) return true;
+  const rel = relative(parent, child);
+  return rel.length > 0 && !rel.startsWith('..') && !isAbsolute(rel);
+}
+
+/**
+ * @returns true when the project at `cwd` is NOT excluded from claude-mem
+ *          tracking, i.e., the hook should proceed; false when the project
+ *          matches one of the exclusion globs.
+ *
+ * Hard-excludes OBSERVER_SESSIONS_DIR: the SDK agent spawns Claude Code with
+ * that cwd, and its hooks must never feed the worker — otherwise the observer's
+ * own init/continuation/summary prompts end up stored as `user_prompts` and
+ * leak into the viewer (meta-observation).
+ */
+export function shouldTrackProject(cwd: string): boolean {
+  if (!cwd) return true;
+  // path.relative handles separator differences (Windows '\\' vs POSIX '/')
+  // and trailing-slash variance, which a literal startsWith would miss.
+  if (isWithin(cwd, OBSERVER_SESSIONS_DIR)) {
+    return false;
+  }
+  const settings = loadFromFileOnce();
+  return !isProjectExcluded(cwd, settings.CLAUDE_MEM_EXCLUDED_PROJECTS);
+}
@@ -1,9 +1,17 @@
 import path from "path";
-import { readFileSync } from "fs";
+import { readFileSync, existsSync, writeFileSync, renameSync, mkdirSync } from "fs";
+import { spawn, execSync } from "child_process";
 import { logger } from "../utils/logger.js";
-import { HOOK_TIMEOUTS, getTimeout } from "./hook-constants.js";
+import { HOOK_TIMEOUTS, HOOK_EXIT_CODES, getTimeout } from "./hook-constants.js";
 import { SettingsDefaultsManager } from "./SettingsDefaultsManager.js";
-import { MARKETPLACE_ROOT } from "./paths.js";
+import { MARKETPLACE_ROOT, DATA_DIR } from "./paths.js";
+import { loadFromFileOnce } from "./hook-settings.js";
+// `validateWorkerPidFile` consults `captureProcessStartToken` at
+// `src/supervisor/process-registry.ts` for PID-reuse detection (commit
+// 99060bac). The lazy-spawn fast path below uses it to confirm a live port
+// is owned by OUR worker incarnation rather than a stale PID squatting on
+// the port after container restart.
+import { validateWorkerPidFile } from "../supervisor/index.js";

 // Named constants for health checks
 // Allow env var override for users on slow systems (e.g., CLAUDE_MEM_HEALTH_TIMEOUT_MS=10000)
@@ -214,26 +222,392 @@ async function checkWorkerVersion(): Promise<void> {


 /**
- * Ensure worker service is running
- * Quick health check - returns false if worker not healthy (doesn't block)
- * Port might be in use by another process, or worker might not be started yet
+ * Resolve the absolute path to the worker-service script the hook should
+ * relaunch as a detached daemon. Hooks live in the plugin's `scripts/`
+ * directory next to `worker-service.cjs`; production and dev checkouts both
+ * ship the bundled CJS there. Returns null when no candidate exists on disk
+ * (partial install, build artifact missing).
 */
-export async function ensureWorkerRunning(): Promise<boolean> {
-  // Quick health check (single attempt, no polling)
-  try {
-    if (await isWorkerHealthy()) {
-      await checkWorkerVersion();  // logs warning on mismatch, doesn't restart
-      return true;  // Worker healthy
-    }
-  } catch (e) {
-    // Not healthy - log for debugging
-    logger.debug('SYSTEM', 'Worker health check failed', {
-      error: e instanceof Error ? e.message : String(e)
-    });
+function resolveWorkerScriptPath(): string | null {
+  const candidates = [
+    path.join(MARKETPLACE_ROOT, 'plugin', 'scripts', 'worker-service.cjs'),
+    path.join(process.cwd(), 'plugin', 'scripts', 'worker-service.cjs'),
+  ];
+  for (const candidate of candidates) {
+    if (existsSync(candidate)) return candidate;
  }
+  return null;
+}

-  // Port might be in use by something else, or worker not started
-  // Return false but don't throw - let caller decide how to handle
-  logger.warn('SYSTEM', 'Worker not healthy, hook will proceed gracefully');
+/**
+ * Resolve the absolute path to the Bun runtime.
+ *
+ * Local to worker-utils.ts so the lazy-spawn path does not transitively
+ * import `services/infrastructure/ProcessManager.ts` — that module pulls
+ * in `bun:sqlite` via `cwd-remap`, and pulling it in would break the NPX
+ * CLI bundle which must run under plain Node (no Bun). The worker daemon
+ * itself requires Bun (it uses bun:sqlite directly); this lookup finds
+ * the Bun binary that the daemon will execute under.
+ */
+function resolveBunRuntime(): string | null {
+  if (process.env.BUN && existsSync(process.env.BUN)) return process.env.BUN;
+
+  try {
+    const cmd = process.platform === 'win32' ? 'where bun' : 'which bun';
+    const output = execSync(cmd, {
+      stdio: ['ignore', 'pipe', 'ignore'],
+      encoding: 'utf-8',
+      windowsHide: true,
+    });
+    const firstMatch = output
+      .split(/\r?\n/)
+      .map(line => line.trim())
+      .find(line => line.length > 0);
+    return firstMatch || null;
+  } catch {
+    return null;
+  }
+}
+
+/**
+ * Wait for the worker port to open, using exponential backoff.
+ *
+ * Deliberately hand-rolled — `respawn` or similar npm helpers add a
+ * supervisor semantic layer we do not want here (Principle 6). The retry
+ * policy is three attempts with 250ms → 500ms → 1000ms backoff, which is
+ * enough to cover the worker's start-up (~1-2s on a warm cache, slower on
+ * Windows) without blocking a hook for long when the spawn outright failed.
+ */
+async function waitForWorkerPort(options: { attempts: number; backoffMs: number }): Promise<boolean> {
+  let delayMs = options.backoffMs;
+  for (let attempt = 1; attempt <= options.attempts; attempt++) {
+    if (await isWorkerPortAlive()) return true;
+    if (attempt < options.attempts) {
+      await new Promise<void>(resolve => setTimeout(resolve, delayMs));
+      delayMs *= 2;
+    }
+  }
  return false;
 }
+
+/**
+ * Is the worker port owned by a live worker we recognize?
+ *
+ * Two gates:
+ *   1. HTTP /api/health returns 200, AND
+ *   2. PID-file start-token check (via `validateWorkerPidFile` →
+ *      `captureProcessStartToken`) confirms the recorded PID has not been
+ *      reused by a different process since the file was written.
+ *
+ * When the PID file is missing we accept a healthy HTTP response on its own
+ * — the file is written by the worker itself after `listen()` succeeds, so
+ * a brief window exists during which a freshly-spawned worker is reachable
+ * via HTTP but has not yet persisted its PID record. Treating this as
+ * "not ours" would cause the hook to double-spawn in a race with the
+ * worker's own PID-file write.
+ *
+ * An 'alive' status that fails identity verification is treated as dead so
+ * the caller falls through to the spawn path (Phase 8 contract).
+ */
+async function isWorkerPortAlive(): Promise<boolean> {
+  let healthy: boolean;
+  try {
+    healthy = await isWorkerHealthy();
+  } catch (error: unknown) {
+    logger.debug('SYSTEM', 'Worker health check threw', {
+      error: error instanceof Error ? error.message : String(error),
+    });
+    return false;
+  }
+  if (!healthy) return false;
+
+  const pidStatus = validateWorkerPidFile({ logAlive: false });
+  if (pidStatus === 'missing') return true;     // race: listening before PID file written
+  if (pidStatus === 'alive') return true;       // identity verified via start-token
+  return false;                                 // 'stale' | 'invalid' — PID reused
+}
+
+/**
+ * Lazy-spawn the worker if it is not already running, then wait for its port.
+ *
+ * Flow:
+ *   1. If the port is alive AND verified as ours, return true (fast path).
+ *   2. Otherwise, resolve the bun runtime + worker script path.
+ *   3. Spawn detached, `unref()` so the hook's exit does not take the worker
+ *      down with it (the worker lives as its own independent daemon).
+ *   4. Wait for the port to come up, up to 3 attempts with exponential
+ *      backoff (250ms → 500ms → 1000ms — ~1.75s total).
+ *
+ * PID-reuse safety is inherited from `validateWorkerPidFile` (commit
+ * 99060bac) — see the `isWorkerPortAlive` comment above. There is no
+ * auto-restart loop; failure is reported via the return value so the hook
+ * can surface it through exit code 2 (Principle 2 — fail-fast).
+ */
+export async function ensureWorkerRunning(): Promise<boolean> {
+  if (await isWorkerPortAlive()) {
+    await checkWorkerVersion();
+    return true;
+  }
+
+  const runtimePath = resolveBunRuntime();
+  const scriptPath = resolveWorkerScriptPath();
+
+  if (!runtimePath) {
+    logger.warn('SYSTEM', 'Cannot lazy-spawn worker: Bun runtime not found on PATH');
+    return false;
+  }
+  if (!scriptPath) {
+    logger.warn('SYSTEM', 'Cannot lazy-spawn worker: worker-service.cjs not found in plugin/scripts');
+    return false;
+  }
+
+  logger.info('SYSTEM', 'Worker not running — lazy-spawning', { runtimePath, scriptPath });
+
+  try {
+    const proc = spawn(runtimePath, [scriptPath, '--daemon'], {
+      detached: true,
+      stdio: ['ignore', 'ignore', 'ignore'],
+    });
+    proc.unref();
+  } catch (error: unknown) {
+    if (error instanceof Error) {
+      logger.error('SYSTEM', 'Lazy-spawn of worker failed', { runtimePath, scriptPath }, error);
+    } else {
+      logger.error('SYSTEM', 'Lazy-spawn of worker failed (non-Error)', {
+        runtimePath, scriptPath, error: String(error),
+      });
+    }
+    return false;
+  }
+
+  const alive = await waitForWorkerPort({ attempts: 3, backoffMs: 250 });
+  if (!alive) {
+    logger.warn('SYSTEM', 'Worker port did not open after lazy-spawn within 3 attempts');
+    return false;
+  }
+  return true;
+}
+
+// ============================================================================
+// Plan 05 Phase 9 — single per-process alive cache.
+//
+// One hook invocation may issue multiple worker requests (session-init issues
+// several). The alive-state cannot change mid-invocation without the hook
+// process exiting, so memoize the first result. By Principle 6 (one helper,
+// N callers), this is the ONLY alive-state cache; all hook→worker call sites
+// route through `executeWithWorkerFallback` (Phase 2) which calls this.
+// ============================================================================
+
+let aliveCache: boolean | null = null;
+
+export async function ensureWorkerAliveOnce(): Promise<boolean> {
+  if (aliveCache !== null) return aliveCache;
+  aliveCache = await ensureWorkerRunning();
+  return aliveCache;
+}
+
+// ============================================================================
+// Plan 05 Phase 8 — fail-loud counter.
+//
+// The counter records how many consecutive hook invocations have seen the
+// worker unreachable. After N (default 3) consecutive failures, the next
+// hook exits code 2 so Claude Code's hook contract surfaces the outage to
+// Claude. Below N, hooks exit 0 to avoid breaking the user's session.
+//
+// This is NOT a retry. We do not reinvoke `ensureWorkerAliveOnce` or
+// reattempt the HTTP request. We record the result of the one primary-path
+// attempt and either return (graceful) or escalate (fail-loud).
+//
+// File: ~/.claude-mem/state/hook-failures.json
+// Atomic write: tmp + rename (POSIX atomic within a filesystem).
+// ============================================================================
+
+interface HookFailureState {
+  consecutiveFailures: number;
+  lastFailureAt: number;
+}
+
+const FAIL_LOUD_DEFAULT_THRESHOLD = 3;
+
+function getStateDir(): string {
+  return path.join(DATA_DIR, 'state');
+}
+
+function getHookFailuresPath(): string {
+  return path.join(getStateDir(), 'hook-failures.json');
+}
+
+function readHookFailureState(): HookFailureState {
+  try {
+    const raw = readFileSync(getHookFailuresPath(), 'utf-8');
+    const parsed = JSON.parse(raw) as Partial<HookFailureState>;
+    return {
+      consecutiveFailures: typeof parsed.consecutiveFailures === 'number' && Number.isFinite(parsed.consecutiveFailures)
+        ? Math.max(0, Math.floor(parsed.consecutiveFailures))
+        : 0,
+      lastFailureAt: typeof parsed.lastFailureAt === 'number' && Number.isFinite(parsed.lastFailureAt)
+        ? parsed.lastFailureAt
+        : 0,
+    };
+  } catch {
+    // Missing file or corrupt JSON → fresh state.
+    return { consecutiveFailures: 0, lastFailureAt: 0 };
+  }
+}
+
+function writeHookFailureStateAtomic(state: HookFailureState): void {
+  const stateDir = getStateDir();
+  const dest = getHookFailuresPath();
+  const tmp = `${dest}.tmp`;
+  try {
+    if (!existsSync(stateDir)) {
+      mkdirSync(stateDir, { recursive: true });
+    }
+    writeFileSync(tmp, JSON.stringify(state), 'utf-8');
+    renameSync(tmp, dest);
+  } catch (error: unknown) {
+    logger.debug('SYSTEM', 'Failed to persist hook-failure counter', {
+      error: error instanceof Error ? error.message : String(error),
+    });
+  }
+}
+
+function getFailLoudThreshold(): number {
+  try {
+    const settings = loadFromFileOnce();
+    const raw = settings.CLAUDE_MEM_HOOK_FAIL_LOUD_THRESHOLD;
+    const parsed = parseInt(raw, 10);
+    if (Number.isFinite(parsed) && parsed >= 1) return parsed;
+  } catch {
+    // settings unreadable — fall through to default
+  }
+  return FAIL_LOUD_DEFAULT_THRESHOLD;
+}
+
+/**
+ * Record a worker-unreachable hook invocation. Returns the new counter value.
+ * If the counter reaches the threshold, this function writes to stderr and
+ * exits the process with code 2 (blocking error per Claude Code hook contract).
+ *
+ * Not a retry — does not reattempt the operation. The caller already ran the
+ * single primary-path attempt and got `false` from `ensureWorkerAliveOnce`.
+ */
+function recordWorkerUnreachable(): number {
+  const state = readHookFailureState();
+  const next: HookFailureState = {
+    consecutiveFailures: state.consecutiveFailures + 1,
+    lastFailureAt: Date.now(),
+  };
+  writeHookFailureStateAtomic(next);
+
+  const threshold = getFailLoudThreshold();
+  if (next.consecutiveFailures >= threshold) {
+    process.stderr.write(
+      `claude-mem worker unreachable for ${next.consecutiveFailures} consecutive hooks.\n`
+    );
+    process.exit(HOOK_EXIT_CODES.BLOCKING_ERROR);
+  }
+  return next.consecutiveFailures;
+}
+
+/**
+ * Reset the consecutive-failure counter. Called when the worker is alive,
+ * acknowledging that any prior outage has ended. Not a retry — it is a
+ * success-path acknowledgement.
+ */
+function resetWorkerFailureCounter(): void {
+  const state = readHookFailureState();
+  if (state.consecutiveFailures === 0) return;       // skip a no-op write
+  writeHookFailureStateAtomic({ consecutiveFailures: 0, lastFailureAt: 0 });
+}
+
+// ============================================================================
+// Plan 05 Phase 2 — `executeWithWorkerFallback(url, method, body)`.
+//
+// Eight handlers used to duplicate the
+// `ensureWorkerRunning() → workerHttpRequest() → if (!ok) return { continue: true }`
+// sequence. This helper is the ONE implementation; eight handlers import it.
+//
+// Behavior:
+//   1. ensureWorkerAliveOnce() (Phase 9). If false → fail-loud counter
+//      (Phase 8). May process.exit(2). Otherwise return graceful fallback.
+//   2. workerHttpRequest(url, method, body). Parse JSON.
+//   3. On success, reset the fail-loud counter.
+//
+// No retry inside this helper. No timeout-and-exit-0 swallow. The fail-loud
+// counter records consecutive invocation outcomes; it does not reinvoke work.
+// ============================================================================
+
+// Branded sentinel so isWorkerFallback cannot false-positive on legitimate
+// API responses that happen to carry `continue: true` in their own schema.
+const WORKER_FALLBACK_BRAND: unique symbol = Symbol.for('claude-mem/worker-fallback');
+
+export type WorkerFallback =
+  | { continue: true; [WORKER_FALLBACK_BRAND]: true }
+  | { continue: true; reason: string; [WORKER_FALLBACK_BRAND]: true };
+
+export type WorkerCallResult<T> = T | WorkerFallback;
+
+export function isWorkerFallback<T>(result: WorkerCallResult<T>): result is WorkerFallback {
+  return typeof result === 'object'
+    && result !== null
+    && (result as { [WORKER_FALLBACK_BRAND]?: unknown })[WORKER_FALLBACK_BRAND] === true;
+}
+
+export interface WorkerFallbackOptions {
+  /**
+   * Per-call HTTP timeout in ms. Forwarded to workerHttpRequest. Omit to use
+   * HEALTH_CHECK_TIMEOUT_MS (the default ~3 s suitable for short pings).
+   * All hook endpoints are fire-and-forget queueing endpoints that return
+   * `{status: 'queued'}` immediately, so the default suffices.
+   */
+  timeoutMs?: number;
+}
+
+export async function executeWithWorkerFallback<T = unknown>(
+  url: string,
+  method: 'GET' | 'POST' | 'PUT' | 'DELETE',
+  body?: unknown,
+  options: WorkerFallbackOptions = {},
+): Promise<WorkerCallResult<T>> {
+  const alive = await ensureWorkerAliveOnce();
+  if (!alive) {
+    // Records and possibly process.exit(2). If we return below, the counter
+    // is below threshold, the user's session continues uninterrupted.
+    recordWorkerUnreachable();
+    return { continue: true, reason: 'worker_unreachable', [WORKER_FALLBACK_BRAND]: true };
+  }
+
+  const init: { method: string; headers?: Record<string, string>; body?: string; timeoutMs?: number } = { method };
+  if (body !== undefined) {
+    init.headers = { 'Content-Type': 'application/json' };
+    init.body = JSON.stringify(body);
+  }
+  if (options.timeoutMs !== undefined) {
+    init.timeoutMs = options.timeoutMs;
+  }
+
+  const response = await workerHttpRequest(url, init);
+  if (!response.ok) {
+    // Non-2xx is a real worker response (so the worker IS reachable). Reset
+    // the consecutive-failures counter; surface the response body to the
+    // caller as a typed value via T's caller-controlled shape. Callers that
+    // care about non-2xx must inspect the value (or wrap with their own
+    // status check); the helper does not silently coerce non-2xx into a
+    // graceful fallback.
+    resetWorkerFailureCounter();
+    const text = await response.text().catch(() => '');
+    let parsed: unknown = text;
+    try { parsed = JSON.parse(text); } catch { /* keep raw text */ }
+    return parsed as T;
+  }
+
+  resetWorkerFailureCounter();
+  const text = await response.text();
+  if (text.length === 0) return undefined as unknown as T;
+  try {
+    return JSON.parse(text) as T;
+  } catch {
+    return text as unknown as T;
+  }
+}
@@ -146,10 +146,6 @@ export async function startSupervisor(): Promise<void> {
  await supervisorSingleton.start();
 }

-export async function stopSupervisor(): Promise<void> {
-  await supervisorSingleton.stop();
-}
-
 export function getSupervisor(): Supervisor {
  return supervisorSingleton;
 }
@@ -168,7 +164,7 @@ export function validateWorkerPidFile(options: ValidateWorkerPidOptions = {}): V
  let pidInfo: PidInfo | null = null;

  try {
-    pidInfo = JSON.parse(readFileSync(pidFilePath, 'utf-8')) as PidInfo;
+    pidInfo = JSON.parse(readFileSync(pidFilePath, 'utf-8')) as PidInfo | null;
  } catch (error: unknown) {
    if (error instanceof Error) {
      logger.warn('SYSTEM', 'Failed to parse worker PID file, removing it', { path: pidFilePath }, error);
@@ -182,7 +178,8 @@ export function validateWorkerPidFile(options: ValidateWorkerPidOptions = {}): V
    return 'invalid';
  }

-  if (verifyPidFileOwnership(pidInfo)) {
+  const isAlive = verifyPidFileOwnership(pidInfo);
+  if (isAlive && pidInfo) {
    if (options.logAlive ?? true) {
      logger.info('SYSTEM', 'Worker already running (PID alive)', {
        existingPid: pidInfo.pid,
@@ -194,9 +191,9 @@ export function validateWorkerPidFile(options: ValidateWorkerPidOptions = {}): V
  }

  logger.info('SYSTEM', 'Removing stale PID file (worker process is dead or PID has been reused)', {
-    pid: pidInfo.pid,
-    port: pidInfo.port,
-    startedAt: pidInfo.startedAt
+    pid: pidInfo?.pid,
+    port: pidInfo?.port,
+    startedAt: pidInfo?.startedAt
  });
  rmSync(pidFilePath, { force: true });
  return 'stale';
@@ -1,8 +1,9 @@
-import { ChildProcess, spawnSync } from 'child_process';
+import { ChildProcess, spawn, spawnSync } from 'child_process';
 import { existsSync, mkdirSync, readFileSync, writeFileSync } from 'fs';
 import { homedir } from 'os';
 import path from 'path';
 import { logger } from '../utils/logger.js';
+import { sanitizeEnv } from './env-sanitizer.js';

 const REAP_SESSION_SIGTERM_TIMEOUT_MS = 5_000;
 const REAP_SESSION_SIGKILL_TIMEOUT_MS = 1_000;
@@ -15,6 +16,14 @@ export interface ManagedProcessInfo {
  type: string;
  sessionId?: string | number;
  startedAt: string;
+  // POSIX process group leader PID for group-scoped teardown.
+  // On Unix, when a child is spawned with `detached: true`, the kernel calls
+  // setpgid() and the child becomes the leader of its own group — its pgid
+  // equals its pid. Stored so `process.kill(-pgid, signal)` can tear down
+  // the child AND every descendant it spawned in one syscall (Principle 5).
+  // Undefined on Windows (no POSIX groups) and for processes that were not
+  // spawned with detached: true (e.g. the worker itself, MCP stdio clients).
+  pgid?: number;
 }

 export interface ManagedProcessRecord extends ManagedProcessInfo {
@@ -303,22 +312,30 @@ export class ProcessRegistry {
      pids: sessionRecords.map(r => r.pid)
    });

-    // Phase 1: SIGTERM all alive processes
+    // Phase 1: SIGTERM all alive processes — use process-group teardown for
+    // records that carry pgid so any descendants the SDK spawned are killed
+    // too (Principle 5).
    const aliveRecords = sessionRecords.filter(r => isPidAlive(r.pid));
    for (const record of aliveRecords) {
      try {
-        process.kill(record.pid, 'SIGTERM');
+        if (typeof record.pgid === 'number' && process.platform !== 'win32') {
+          process.kill(-record.pgid, 'SIGTERM');
+        } else {
+          process.kill(record.pid, 'SIGTERM');
+        }
      } catch (error: unknown) {
        if (error instanceof Error) {
          const code = (error as NodeJS.ErrnoException).code;
          if (code !== 'ESRCH') {
            logger.debug('SYSTEM', `Failed to SIGTERM session process PID ${record.pid}`, {
-              pid: record.pid
+              pid: record.pid,
+              pgid: record.pgid
            }, error);
          }
        } else {
          logger.warn('SYSTEM', `Failed to SIGTERM session process PID ${record.pid} (non-Error)`, {
            pid: record.pid,
+            pgid: record.pgid,
            error: String(error)
          });
        }
@@ -333,26 +350,34 @@ export class ProcessRegistry {
      await new Promise(resolve => setTimeout(resolve, 100));
    }

-    // Phase 3: SIGKILL any survivors
+    // Phase 3: SIGKILL any survivors — process-group teardown when pgid is
+    // recorded so descendants are killed too.
    const survivors = aliveRecords.filter(r => isPidAlive(r.pid));
    for (const record of survivors) {
      logger.warn('SYSTEM', `Session process PID ${record.pid} did not exit after SIGTERM, sending SIGKILL`, {
        pid: record.pid,
+        pgid: record.pgid,
        sessionId: sessionIdNum
      });
      try {
-        process.kill(record.pid, 'SIGKILL');
+        if (typeof record.pgid === 'number' && process.platform !== 'win32') {
+          process.kill(-record.pgid, 'SIGKILL');
+        } else {
+          process.kill(record.pid, 'SIGKILL');
+        }
      } catch (error: unknown) {
        if (error instanceof Error) {
          const code = (error as NodeJS.ErrnoException).code;
          if (code !== 'ESRCH') {
            logger.debug('SYSTEM', `Failed to SIGKILL session process PID ${record.pid}`, {
-              pid: record.pid
+              pid: record.pid,
+              pgid: record.pgid
            }, error);
          }
        } else {
          logger.warn('SYSTEM', `Failed to SIGKILL session process PID ${record.pid} (non-Error)`, {
            pid: record.pid,
+            pgid: record.pgid,
            error: String(error)
          });
        }
@@ -406,3 +431,401 @@ export function getProcessRegistry(): ProcessRegistry {
 export function createProcessRegistry(registryPath: string): ProcessRegistry {
  return new ProcessRegistry(registryPath);
 }
+
+// ---------------------------------------------------------------------------
+// SDK session lookup + exit verification
+// ---------------------------------------------------------------------------
+
+export interface TrackedSdkProcess {
+  pid: number;
+  pgid: number | undefined;
+  sessionDbId: number;
+  process: ChildProcess;
+}
+
+/**
+ * Look up the live SDK subprocess for a given session, if any.
+ *
+ * Returns undefined when no SDK record is registered for the session, or
+ * when the ChildProcess reference has been dropped (process exited and was
+ * unregistered). Warns on duplicates — multiple SDK records per session
+ * indicate a race in createSdkSpawnFactory's pre-spawn cleanup.
+ */
+export function getSdkProcessForSession(sessionDbId: number): TrackedSdkProcess | undefined {
+  const registry = getProcessRegistry();
+  const matches = registry.getBySession(sessionDbId).filter(r => r.type === 'sdk');
+
+  if (matches.length > 1) {
+    logger.warn('PROCESS', `Multiple SDK processes found for session ${sessionDbId}`, {
+      count: matches.length,
+      pids: matches.map(m => m.pid),
+    });
+  }
+
+  const record = matches[0];
+  if (!record) return undefined;
+
+  const processRef = registry.getRuntimeProcess(record.id);
+  if (!processRef) return undefined;
+
+  return {
+    pid: record.pid,
+    pgid: record.pgid,
+    sessionDbId,
+    process: processRef,
+  };
+}
+
+/**
+ * Wait for an SDK subprocess to exit, escalating to SIGKILL on the process
+ * group if it overstays `timeoutMs`. Fully event-driven — no polling.
+ *
+ * This is primary-path cleanup invoked from session-level finally() blocks
+ * when a session ends; it is NOT a reaper. It runs at most once per session
+ * deletion. Process-group teardown (`kill(-pgid, SIGKILL)`) ensures any
+ * descendants the SDK spawned are also killed.
+ */
+export async function ensureSdkProcessExit(
+  tracked: TrackedSdkProcess,
+  timeoutMs: number = 5000
+): Promise<void> {
+  const { pid, pgid, process: proc } = tracked;
+
+  // Already exited? Trust exitCode, not proc.killed — proc.killed only means
+  // Node sent a signal; the process may still be running.
+  if (proc.exitCode !== null) return;
+
+  const exitPromise = new Promise<void>((resolve) => {
+    proc.once('exit', () => resolve());
+  });
+
+  const timeoutPromise = new Promise<void>((resolve) => {
+    setTimeout(resolve, timeoutMs);
+  });
+
+  await Promise.race([exitPromise, timeoutPromise]);
+
+  if (proc.exitCode !== null) return;
+
+  // Timeout: escalate to SIGKILL on the whole process group so any
+  // descendants the SDK spawned are killed too (Principle 5).
+  logger.warn('PROCESS', `PID ${pid} did not exit after ${timeoutMs}ms, sending SIGKILL to process group`, {
+    pid, pgid, timeoutMs,
+  });
+  try {
+    if (typeof pgid === 'number' && process.platform !== 'win32') {
+      process.kill(-pgid, 'SIGKILL');
+    } else {
+      proc.kill('SIGKILL');
+    }
+  } catch {
+    // Already dead — fine.
+  }
+
+  // Wait up to 1s for SIGKILL to take effect (event-driven, not blind sleep).
+  const sigkillExit = new Promise<void>((resolve) => {
+    proc.once('exit', () => resolve());
+  });
+  const sigkillTimeout = new Promise<void>((resolve) => {
+    setTimeout(resolve, 1000);
+  });
+  await Promise.race([sigkillExit, sigkillTimeout]);
+}
+
+// ---------------------------------------------------------------------------
+// Pool slot waiters — backpressure without eviction
+// ---------------------------------------------------------------------------
+//
+// waitForSlot is used by SDKAgent to avoid starting more concurrent SDK
+// subprocesses than configured. It is event-driven: when a process exits and
+// is unregistered, notifySlotAvailable() wakes exactly one waiter. There is
+// no polling. There is no idle-session eviction (Principle 1 — do not kick
+// live sessions to make room; a full pool must apply backpressure upstream).
+
+const TOTAL_PROCESS_HARD_CAP = 10;
+const slotWaiters: Array<() => void> = [];
+
+function getActiveSdkCount(): number {
+  return getProcessRegistry().getAll().filter(record => record.type === 'sdk').length;
+}
+
+function notifySlotAvailable(): void {
+  const waiter = slotWaiters.shift();
+  if (waiter) waiter();
+}
+
+/**
+ * Wait until a pool slot is available to spawn another SDK subprocess.
+ *
+ * Resolves immediately when active SDK process count is below `maxConcurrent`.
+ * Otherwise enqueues a waiter that is woken by a subsequent exit handler.
+ * Rejects with a timeout error if no slot opens within `timeoutMs`.
+ * Rejects immediately if the registry is already at the hard cap.
+ */
+export async function waitForSlot(maxConcurrent: number, timeoutMs: number = 60_000): Promise<void> {
+  const activeCount = getActiveSdkCount();
+  if (activeCount >= TOTAL_PROCESS_HARD_CAP) {
+    throw new Error(`Hard cap exceeded: ${activeCount} processes in registry (cap=${TOTAL_PROCESS_HARD_CAP}). Refusing to spawn more.`);
+  }
+
+  if (activeCount < maxConcurrent) return;
+
+  logger.info('PROCESS', `Pool limit reached (${activeCount}/${maxConcurrent}), waiting for slot...`);
+
+  return new Promise<void>((resolve, reject) => {
+    const timeout = setTimeout(() => {
+      const idx = slotWaiters.indexOf(onSlot);
+      if (idx >= 0) slotWaiters.splice(idx, 1);
+      reject(new Error(`Timed out waiting for agent pool slot after ${timeoutMs}ms`));
+    }, timeoutMs);
+
+    const onSlot = () => {
+      clearTimeout(timeout);
+      if (getActiveSdkCount() < maxConcurrent) {
+        resolve();
+      } else {
+        slotWaiters.push(onSlot);
+      }
+    };
+
+    slotWaiters.push(onSlot);
+  });
+}
+
+// ---------------------------------------------------------------------------
+// SDK subprocess spawn
+// ---------------------------------------------------------------------------
+
+export interface SpawnedSdkProcess {
+  stdin: NonNullable<ChildProcess['stdin']>;
+  stdout: NonNullable<ChildProcess['stdout']>;
+  stderr: NonNullable<ChildProcess['stderr']>;
+  readonly killed: boolean;
+  readonly exitCode: number | null;
+  kill: ChildProcess['kill'];
+  on: ChildProcess['on'];
+  once: ChildProcess['once'];
+  off: ChildProcess['off'];
+}
+
+export interface SpawnSdkOptions {
+  command: string;
+  args: string[];
+  cwd?: string;
+  env?: NodeJS.ProcessEnv;
+  signal?: AbortSignal;
+}
+
+/**
+ * Spawn a Claude SDK subprocess in its own POSIX process group.
+ *
+ * The spawn uses `detached: true` so the child becomes the leader of a new
+ * process group (setpgid). The leader's PID equals its pgid on Unix, so we
+ * store `child.pid` as both pid and pgid on the managed process record.
+ * Shutdown then signals the group via `process.kill(-pgid, signal)`, tearing
+ * down the SDK child AND every descendant in one syscall (Principle 5).
+ *
+ * Windows caveat: `detached: true` does not create a POSIX group. The
+ * recorded pgid is still the child PID so Windows teardown at least kills
+ * the direct child; full subtree teardown on Windows requires Job Objects
+ * or `taskkill /T /F` (see shutdown.ts).
+ *
+ * Node's child_process.spawn is used intentionally — Bun.spawn does NOT
+ * support `detached: true` (see PATHFINDER-2026-04-22/_reference.md Part 2
+ * row 3), and this module must work under Bun as well as Node.
+ */
+export function spawnSdkProcess(
+  sessionDbId: number,
+  options: SpawnSdkOptions
+): { process: SpawnedSdkProcess; pid: number; pgid: number } | null {
+  const registry = getProcessRegistry();
+
+  // On Windows, use cmd.exe wrapper for .cmd files to properly handle paths with spaces.
+  const useCmdWrapper = process.platform === 'win32' && options.command.endsWith('.cmd');
+  const env = sanitizeEnv(options.env ?? process.env);
+
+  // Filter empty string args AND their preceding flag (Issue #2049).
+  // The Agent SDK emits ["--setting-sources", ""] when settingSources defaults to [].
+  // Simply dropping "" leaves an orphan --setting-sources that consumes the next
+  // flag as its value, crashing Claude Code 2.1.109+ with
+  // "Invalid setting source: --permission-mode". Drop the flag too so the SDK
+  // default (no setting sources) is preserved by omission.
+  const filteredArgs: string[] = [];
+  for (const arg of options.args) {
+    if (arg === '') {
+      if (filteredArgs.length > 0 && filteredArgs[filteredArgs.length - 1].startsWith('--')) {
+        filteredArgs.pop();
+      }
+      continue;
+    }
+    filteredArgs.push(arg);
+  }
+
+  // Unix: detached:true causes the kernel to setpgid() on the child so the
+  // child becomes leader of a new process group whose pgid equals its pid.
+  // Windows: detached:true decouples the child from the parent console; there
+  // is no POSIX group, but the flag is still safe to pass.
+  //
+  // stdin must be 'pipe' (not 'ignore') because SpawnedSdkProcess.stdin is
+  // typed NonNullable<...> and the Claude Agent SDK consumes that pipe to
+  // stream prompts in. With 'ignore', child.stdin would be null and the
+  // null-check below (line ~737) would tear the child down immediately.
+  const child = useCmdWrapper
+    ? spawn('cmd.exe', ['/d', '/c', options.command, ...filteredArgs], {
+        cwd: options.cwd,
+        env,
+        detached: true,
+        stdio: ['pipe', 'pipe', 'pipe'],
+        signal: options.signal,
+        windowsHide: true,
+      })
+    : spawn(options.command, filteredArgs, {
+        cwd: options.cwd,
+        env,
+        detached: true,
+        stdio: ['pipe', 'pipe', 'pipe'],
+        signal: options.signal,
+        windowsHide: true,
+      });
+
+  // ALWAYS attach an 'error' listener BEFORE any other code runs, regardless of
+  // whether the child has a PID. child_process.spawn emits 'error' asynchronously
+  // for ENOENT, EACCES, AbortSignal-driven aborts, etc. Without a listener these
+  // become uncaughtException — the cause of "The operation was aborted." escaping
+  // to the daemon during crash-recovery loops.
+  child.on('error', (err: Error) => {
+    logger.warn('SDK_SPAWN', `[session-${sessionDbId}] child emitted error event`, {
+      sessionDbId,
+      pid: child.pid,
+      errorName: err.name,
+      errorCode: (err as NodeJS.ErrnoException).code,
+    }, err);
+  });
+
+  if (!child.pid) {
+    logger.error('PROCESS', 'Spawn succeeded but produced no PID', { sessionDbId });
+    return null;
+  }
+
+  const pid = child.pid;
+  const pgid = pid; // On Unix with detached:true, pgid === pid. On Windows, this is an alias.
+
+  // Capture stderr for debugging spawn failures.
+  if (child.stderr) {
+    child.stderr.on('data', (data: Buffer) => {
+      logger.debug('SDK_SPAWN', `[session-${sessionDbId}] stderr: ${data.toString().trim()}`);
+    });
+  }
+
+  // Register the process in the supervisor registry with pgid recorded so
+  // the shutdown cascade can signal the whole group.
+  const recordId = `sdk:${sessionDbId}:${pid}`;
+  registry.register(recordId, {
+    pid,
+    type: 'sdk',
+    sessionId: sessionDbId,
+    startedAt: new Date().toISOString(),
+    pgid,
+  }, child);
+
+  // Auto-unregister on exit. child.on('exit') is the authoritative event-driven
+  // signal that a process has left — no polling, no sweeper needed (Principle 4).
+  child.on('exit', (code: number | null, signal: string | null) => {
+    if (code !== 0) {
+      logger.warn('SDK_SPAWN', `[session-${sessionDbId}] Claude process exited`, { code, signal, pid });
+    }
+    registry.unregister(recordId);
+    // Wake one pool-slot waiter since a slot just freed up.
+    notifySlotAvailable();
+  });
+
+  if (!child.stdin || !child.stdout || !child.stderr) {
+    logger.error('PROCESS', 'Spawned SDK child missing required stdio streams', {
+      sessionDbId,
+      pid,
+      hasStdin: Boolean(child.stdin),
+      hasStdout: Boolean(child.stdout),
+      hasStderr: Boolean(child.stderr),
+    });
+    try { child.kill('SIGKILL'); } catch { /* already dead */ }
+    return null;
+  }
+
+  const spawned: SpawnedSdkProcess = {
+    stdin: child.stdin,
+    stdout: child.stdout,
+    stderr: child.stderr,
+    get killed() { return child.killed; },
+    get exitCode() { return child.exitCode; },
+    kill: child.kill.bind(child),
+    on: child.on.bind(child),
+    once: child.once.bind(child),
+    off: child.off.bind(child),
+  };
+
+  return { process: spawned, pid, pgid };
+}
+
+/**
+ * SDK-compatible spawn factory.
+ *
+ * The Claude Agent SDK's `spawnClaudeCodeProcess` option calls our factory
+ * with its own spawn arguments; we forward them into `spawnSdkProcess` which
+ * creates the child in its own process group and records it in the supervisor
+ * registry. The returned shape is the minimal subset of ChildProcess that the
+ * SDK consumes — stdin/stdout/stderr pipes, killed/exitCode getters, and
+ * kill/on/once/off.
+ *
+ * Pre-spawn cleanup: if a previous process for this session is still alive
+ * (e.g. a crash-recovery attempt that collided with a still-running SDK),
+ * SIGTERM it. Multiple processes sharing the same --resume UUID waste API
+ * credits and can conflict with each other (Issue #1590).
+ */
+export function createSdkSpawnFactory(sessionDbId: number) {
+  return (spawnOptions: SpawnSdkOptions): SpawnedSdkProcess => {
+    const registry = getProcessRegistry();
+
+    // Kill any existing process for this session before spawning a new one.
+    const existing = registry.getBySession(sessionDbId).filter(r => r.type === 'sdk');
+    for (const record of existing) {
+      if (!isPidAlive(record.pid)) continue;
+      try {
+        if (typeof record.pgid === 'number') {
+          // Signal the whole group — kill the SDK child and any descendants.
+          if (process.platform !== 'win32') {
+            process.kill(-record.pgid, 'SIGTERM');
+          } else {
+            process.kill(record.pid, 'SIGTERM');
+          }
+        } else {
+          process.kill(record.pid, 'SIGTERM');
+        }
+        logger.warn('PROCESS', `Killing duplicate SDK process PID ${record.pid} before spawning new one for session ${sessionDbId}`, {
+          existingPid: record.pid,
+          sessionDbId,
+        });
+      } catch (error: unknown) {
+        const code = error instanceof Error ? (error as NodeJS.ErrnoException).code : undefined;
+        if (code !== 'ESRCH') {
+          if (error instanceof Error) {
+            logger.warn('PROCESS', `Failed to SIGTERM duplicate SDK process PID ${record.pid}`, { sessionDbId }, error);
+          } else {
+            logger.warn('PROCESS', `Failed to SIGTERM duplicate SDK process PID ${record.pid} (non-Error)`, {
+              sessionDbId, error: String(error),
+            });
+          }
+        }
+      }
+    }
+
+    const result = spawnSdkProcess(sessionDbId, spawnOptions);
+    if (!result) {
+      // Match the legacy failure mode: the SDK needs a process-like object
+      // even on spawn failure; throwing here surfaces via exit code 2 to the
+      // hook layer (Principle 2 — fail-fast).
+      throw new Error(`Failed to spawn SDK subprocess for session ${sessionDbId}`);
+    }
+
+    return result.process;
+  };
+}
@@ -34,16 +34,18 @@ export async function runShutdownCascade(options: ShutdownCascadeOptions): Promi
    }

    try {
-      await signalProcess(record.pid, 'SIGTERM');
+      await signalProcess(record, 'SIGTERM');
    } catch (error: unknown) {
      if (error instanceof Error) {
        logger.debug('SYSTEM', 'Failed to send SIGTERM to child process', {
          pid: record.pid,
+          pgid: record.pgid,
          type: record.type
        }, error);
      } else {
        logger.warn('SYSTEM', 'Failed to send SIGTERM to child process (non-Error)', {
          pid: record.pid,
+          pgid: record.pgid,
          type: record.type,
          error: String(error)
        });
@@ -56,16 +58,18 @@ export async function runShutdownCascade(options: ShutdownCascadeOptions): Promi
  const survivors = childRecords.filter(record => isPidAlive(record.pid));
  for (const record of survivors) {
    try {
-      await signalProcess(record.pid, 'SIGKILL');
+      await signalProcess(record, 'SIGKILL');
    } catch (error: unknown) {
      if (error instanceof Error) {
        logger.debug('SYSTEM', 'Failed to force kill child process', {
          pid: record.pid,
+          pgid: record.pgid,
          type: record.type
        }, error);
      } else {
        logger.warn('SYSTEM', 'Failed to force kill child process (non-Error)', {
          pid: record.pid,
+          pgid: record.pgid,
          type: record.type,
          error: String(error)
        });
@@ -110,7 +114,38 @@ async function waitForExit(records: ManagedProcessRecord[], timeoutMs: number):
  }
 }

-async function signalProcess(pid: number, signal: 'SIGTERM' | 'SIGKILL'): Promise<void> {
+async function signalProcess(record: ManagedProcessRecord, signal: 'SIGTERM' | 'SIGKILL'): Promise<void> {
+  const { pid, pgid } = record;
+
+  // Unix path: when the record carries a pgid (set when the child was spawned
+  // with detached:true so it became its own group leader), signal the negative
+  // PID to tear down the whole process group in one syscall — the SDK child
+  // AND every descendant it spawned. This replaces hand-rolled orphan sweeps
+  // (Principle 5: OS-supervised process groups over hand-rolled reapers).
+  //
+  // Falls back to single-PID kill when pgid is absent (the worker itself,
+  // MCP stdio clients, anything not spawned with detached:true).
+  if (process.platform !== 'win32') {
+    try {
+      if (typeof pgid === 'number') {
+        process.kill(-pgid, signal);
+      } else {
+        process.kill(pid, signal);
+      }
+    } catch (error: unknown) {
+      if (error instanceof Error) {
+        const errno = (error as NodeJS.ErrnoException).code;
+        if (errno === 'ESRCH') {
+          return;
+        }
+      }
+      throw error;
+    }
+    return;
+  }
+
+  // Windows: no POSIX process groups. SIGTERM uses single-PID kill; SIGKILL
+  // uses tree-kill or taskkill /T to walk the descendant tree.
  if (signal === 'SIGTERM') {
    try {
      process.kill(pid, signal);
@@ -126,50 +161,35 @@ async function signalProcess(pid: number, signal: 'SIGTERM' | 'SIGKILL'): Promis
    return;
  }

-  if (process.platform === 'win32') {
-    const treeKill = await loadTreeKill();
-    if (treeKill) {
-      await new Promise<void>((resolve, reject) => {
-        treeKill(pid, signal, (error) => {
-          if (!error) {
-            resolve();
-            return;
-          }
+  const treeKill = await loadTreeKill();
+  if (treeKill) {
+    await new Promise<void>((resolve, reject) => {
+      treeKill(pid, signal, (error) => {
+        if (!error) {
+          resolve();
+          return;
+        }

-          const errno = (error as NodeJS.ErrnoException).code;
-          if (errno === 'ESRCH') {
-            resolve();
-            return;
-          }
-          reject(error);
-        });
+        const errno = (error as NodeJS.ErrnoException).code;
+        if (errno === 'ESRCH') {
+          resolve();
+          return;
+        }
+        reject(error);
      });
-      return;
-    }
-
-    const args = ['/PID', String(pid), '/T'];
-    if (signal === 'SIGKILL') {
-      args.push('/F');
-    }
-
-    await execFileAsync('taskkill', args, {
-      timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND,
-      windowsHide: true
    });
    return;
  }

-  try {
-    process.kill(pid, signal);
-  } catch (error: unknown) {
-    if (error instanceof Error) {
-      const errno = (error as NodeJS.ErrnoException).code;
-      if (errno === 'ESRCH') {
-        return;
-      }
-    }
-    throw error;
+  const args = ['/PID', String(pid), '/T'];
+  if (signal === 'SIGKILL') {
+    args.push('/F');
  }
+
+  await execFileAsync('taskkill', args, {
+    timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND,
+    windowsHide: true
+  });
 }

 async function loadTreeKill(): Promise<TreeKillFn | null> {
@@ -15,7 +15,7 @@ type DataItem = Observation | Summary | UserPrompt;
 /**
 * Generic pagination hook for observations, summaries, and prompts
 */
-function usePaginationFor(endpoint: string, dataType: DataType, currentFilter: string, currentSource: string) {
+function usePaginationFor<TItem extends DataItem>(endpoint: string, dataType: DataType, currentFilter: string, currentSource: string) {
  const [state, setState] = useState<PaginationState>({
    isLoading: false,
    hasMore: true
@@ -30,7 +30,7 @@ function usePaginationFor(endpoint: string, dataType: DataType, currentFilter: s
   * Load more items from the API
   * Automatically resets offset to 0 if filter has changed
   */
-  const loadMore = useCallback(async (): Promise<DataItem[]> => {
+  const loadMore = useCallback(async (): Promise<TItem[]> => {
    // Check if filter changed - if so, reset pagination synchronously
    const selectionKey = `${currentSource}::${currentFilter}`;
    const filterChanged = lastSelectionRef.current !== selectionKey;
@@ -75,7 +75,7 @@ function usePaginationFor(endpoint: string, dataType: DataType, currentFilter: s
      throw new Error(`Failed to load ${dataType}: ${response.statusText}`);
    }

-    const data = await response.json() as { items: DataItem[], hasMore: boolean };
+    const data = await response.json() as { items: TItem[], hasMore: boolean };

    const nextState = {
      ...stateRef.current,
@@ -106,9 +106,9 @@ function usePaginationFor(endpoint: string, dataType: DataType, currentFilter: s
 * Hook for paginating observations
 */
 export function usePagination(currentFilter: string, currentSource: string) {
-  const observations = usePaginationFor(API_ENDPOINTS.OBSERVATIONS, 'observations', currentFilter, currentSource);
-  const summaries = usePaginationFor(API_ENDPOINTS.SUMMARIES, 'summaries', currentFilter, currentSource);
-  const prompts = usePaginationFor(API_ENDPOINTS.PROMPTS, 'prompts', currentFilter, currentSource);
+  const observations = usePaginationFor<Observation>(API_ENDPOINTS.OBSERVATIONS, 'observations', currentFilter, currentSource);
+  const summaries = usePaginationFor<Summary>(API_ENDPOINTS.SUMMARIES, 'summaries', currentFilter, currentSource);
+  const prompts = usePaginationFor<UserPrompt>(API_ENDPOINTS.PROMPTS, 'prompts', currentFilter, currentSource);

  return {
    observations,
@@ -0,0 +1,9 @@
+{
+  "extends": "../../../tsconfig.json",
+  "compilerOptions": {
+    "lib": ["ES2022", "DOM", "DOM.Iterable"],
+    "rootDir": "."
+  },
+  "include": ["./**/*"],
+  "exclude": []
+}
@@ -1,80 +0,0 @@
-/**
- * Bun Path Utility
- * 
- * Resolves the Bun executable path for environments where Bun is not in PATH
- * (e.g., fish shell users where ~/.config/fish/config.fish isn't read by /bin/sh)
- */
-
-import { spawnSync } from 'child_process';
-import { existsSync } from 'fs';
-import { join } from 'path';
-import { homedir } from 'os';
-import { logger } from './logger.js';
-
-/**
- * Get the Bun executable path
- * Tries PATH first, then checks common installation locations
- * Returns absolute path if found, null otherwise
- */
-export function getBunPath(): string | null {
-  const isWindows = process.platform === 'win32';
-
-  // Try PATH first
-  try {
-    const result = spawnSync('bun', ['--version'], {
-      encoding: 'utf-8',
-      stdio: ['pipe', 'pipe', 'pipe'],
-      shell: false  // SECURITY: No need for shell, bun is the executable
-    });
-    if (result.status === 0) {
-      return 'bun'; // Available in PATH
-    }
-  } catch (e) {
-    logger.debug('SYSTEM', 'Bun not found in PATH, checking common installation locations', {
-      error: e instanceof Error ? e.message : String(e)
-    });
-  }
-
-  // Check common installation paths
-  const bunPaths = isWindows
-    ? [join(homedir(), '.bun', 'bin', 'bun.exe')]
-    : [
-        join(homedir(), '.bun', 'bin', 'bun'),
-        '/usr/local/bin/bun',
-        '/opt/homebrew/bin/bun', // Apple Silicon Homebrew
-        '/home/linuxbrew/.linuxbrew/bin/bun' // Linux Homebrew
-      ];
-
-  for (const bunPath of bunPaths) {
-    if (existsSync(bunPath)) {
-      return bunPath;
-    }
-  }
-
-  return null;
-}
-
-/**
- * Get the Bun executable path or throw an error
- * Use this when Bun is required for operation
- */
-export function getBunPathOrThrow(): string {
-  const bunPath = getBunPath();
-  if (!bunPath) {
-    const isWindows = process.platform === 'win32';
-    const installCmd = isWindows
-      ? 'powershell -c "irm bun.sh/install.ps1 | iex"'
-      : 'curl -fsSL https://bun.sh/install | bash';
-    throw new Error(
-      `Bun is required but not found. Install it with:\n  ${installCmd}\nThen restart your terminal.`
-    );
-  }
-  return bunPath;
-}
-
-/**
- * Check if Bun is available (in PATH or common locations)
- */
-export function isBunAvailable(): boolean {
-  return getBunPath() !== null;
-}
@@ -15,12 +15,47 @@ export enum LogLevel {
  SILENT = 4
 }

-export type Component = 'HOOK' | 'WORKER' | 'SDK' | 'PARSER' | 'DB' | 'SYSTEM' | 'HTTP' | 'SESSION' | 'CHROMA' | 'CHROMA_MCP' | 'CHROMA_SYNC' | 'FOLDER_INDEX' | 'CLAUDE_MD' | 'QUEUE' | 'TELEGRAM';
+export type Component =
+  | 'AGENTS_MD'
+  | 'BRANCH'
+  | 'CHROMA'
+  | 'CHROMA_MCP'
+  | 'CHROMA_SYNC'
+  | 'CLAUDE_MD'
+  | 'CONFIG'
+  | 'CONSOLE'
+  | 'CURSOR'
+  | 'DB'
+  | 'DEDUP'
+  | 'ENV'
+  | 'FOLDER_INDEX'
+  | 'HOOK'
+  | 'HTTP'
+  | 'IMPORT'
+  | 'INGEST'
+  | 'OPENCLAW'
+  | 'OPENCODE'
+  | 'PARSER'
+  | 'PROCESS'
+  | 'PROJECT_NAME'
+  | 'QUEUE'
+  | 'SDK'
+  | 'SDK_SPAWN'
+  | 'SEARCH'
+  | 'SECURITY'
+  | 'SESSION'
+  | 'SETTINGS'
+  | 'SHUTDOWN'
+  | 'SYSTEM'
+  | 'TELEGRAM'
+  | 'TRANSCRIPT'
+  | 'WINDSURF'
+  | 'WORKER';

 interface LogContext {
-  sessionId?: number;
+  sessionId?: string | number;
  memorySessionId?: string;
-  correlationId?: string;
+  correlationId?: string | number;
  [key: string]: any;
 }

@@ -10,82 +10,97 @@
 *    (should not be persisted to memory)
 * 4. <system-reminder> - Claude Code-injected system reminders
 *    (CLAUDE.md contents, deferred tool lists, etc. — should not be persisted)
+ * 5. <persisted-output> - Persisted-output payload tag
 *
 * EDGE PROCESSING PATTERN: Filter at hook layer before sending to worker/storage.
 * This keeps the worker service simple and follows one-way data stream.
+ *
+ * PATHFINDER plan 03 phase 8: collapsed countTags + stripTagsInternal into a
+ * single alternation regex. One pass over the input. One helper, N callers
+ * (`stripMemoryTagsFromJson` / `stripMemoryTagsFromPrompt` are thin adapters).
 */

 import { logger } from './logger.js';

+/** All tag names this module strips. Single source of truth for the regex. */
+const TAG_NAMES = [
+  'private',
+  'claude-mem-context',
+  'system_instruction',
+  'system-instruction',
+  'persisted-output',
+  'system-reminder',
+] as const;
+type TagName = (typeof TAG_NAMES)[number];
+
+/**
+ * Single-pass alternation regex covering every privacy / context tag.
+ * Backreference `\1` ensures a closing tag matches the opening name; tag
+ * attributes (e.g. `<system-reminder data-foo="…">`) are tolerated via
+ * `[^>]*`.
+ */
+const STRIP_REGEX = new RegExp(
+  `<(${TAG_NAMES.join('|')})\\b[^>]*>[\\s\\S]*?</\\1>`,
+  'g'
+);
+
 /**
 * Regex to match <system-reminder> tags and their content.
 * Exported for use by transcript parsers that strip system-reminder at read-time.
+ *
+ * Kept as a separate single-tag regex because the active transcript parser
+ * (`src/shared/transcript-parser.ts`) consumes only this one tag and would
+ * otherwise need to re-import the multi-tag list.
 */
 export const SYSTEM_REMINDER_REGEX = /<system-reminder>[\s\S]*?<\/system-reminder>/g;

-/**
- * Maximum number of tags allowed in a single content block
- * This protects against ReDoS (Regular Expression Denial of Service) attacks
- * where malicious input with many nested/unclosed tags could cause catastrophic backtracking
- */
+/** Maximum total stripped-tag count before we log a ReDoS-class anomaly. */
 const MAX_TAG_COUNT = 100;

 /**
- * Count total number of opening tags in content
- * Used for ReDoS protection before regex processing
+ * Strip every recognised tag from `input` in a single pass.
+ *
+ * @returns the stripped string (trimmed) and per-tag counts. Counts are
+ *          surfaced to logs for observability but are not used as a control
+ *          signal.
 */
-function countTags(content: string): number {
-  const privateCount = (content.match(/<private>/g) || []).length;
-  const contextCount = (content.match(/<claude-mem-context>/g) || []).length;
-  const systemInstructionCount = (content.match(/<system_instruction>/g) || []).length;
-  const systemInstructionHyphenCount = (content.match(/<system-instruction>/g) || []).length;
-const persistedOutputCount = (content.match(/<persisted-output>/g) || []).length;
-  const systemReminderCount = (content.match(/<system-reminder>/g) || []).length;
-  return privateCount + contextCount + systemInstructionCount + systemInstructionHyphenCount + persistedOutputCount + systemReminderCount;
-}
+export function stripTags(input: string): { stripped: string; counts: Record<TagName, number> } {
+  const counts: Record<TagName, number> = Object.fromEntries(
+    TAG_NAMES.map(name => [name, 0])
+  ) as Record<TagName, number>;

-/**
- * Internal function to strip memory tags from content
- * Shared logic extracted from both JSON and prompt stripping functions
- */
-function stripTagsInternal(content: string): string {
-  // ReDoS protection: limit tag count before regex processing
-  const tagCount = countTags(content);
-  if (tagCount > MAX_TAG_COUNT) {
+  STRIP_REGEX.lastIndex = 0; // /g state is per-instance — reset before each call.
+
+  let total = 0;
+  const stripped = input.replace(STRIP_REGEX, (_, name: TagName) => {
+    counts[name] = (counts[name] ?? 0) + 1;
+    total += 1;
+    return '';
+  });
+
+  if (total > MAX_TAG_COUNT) {
    logger.warn('SYSTEM', 'tag count exceeds limit', undefined, {
-      tagCount,
+      tagCount: total,
      maxAllowed: MAX_TAG_COUNT,
-      contentLength: content.length
+      contentLength: input.length,
    });
-    // Still process but log the anomaly
  }

-  return content
-    .replace(/<claude-mem-context>[\s\S]*?<\/claude-mem-context>/g, '')
-    .replace(/<private>[\s\S]*?<\/private>/g, '')
-    .replace(/<system_instruction>[\s\S]*?<\/system_instruction>/g, '')
-    .replace(/<system-instruction>[\s\S]*?<\/system-instruction>/g, '')
-.replace(/<persisted-output>[\s\S]*?<\/persisted-output>/g, '')
-    .replace(SYSTEM_REMINDER_REGEX, '')
-    .trim();
+  return { stripped: stripped.trim(), counts };
 }

 /**
- * Strip memory tags from JSON-serialized content (tool inputs/responses)
- *
- * @param content - Stringified JSON content from tool_input or tool_response
- * @returns Cleaned content with tags removed, or '{}' if invalid
+ * Strip memory tags from JSON-serialized content (tool inputs/responses).
+ * Thin adapter around `stripTags` — same regex, same single pass.
 */
 export function stripMemoryTagsFromJson(content: string): string {
-  return stripTagsInternal(content);
+  return stripTags(content).stripped;
 }

 /**
- * Strip memory tags from user prompt content
- *
- * @param content - Raw user prompt text
- * @returns Cleaned content with tags removed
+ * Strip memory tags from user prompt content.
+ * Thin adapter around `stripTags` — same regex, same single pass.
 */
 export function stripMemoryTagsFromPrompt(content: string): string {
-  return stripTagsInternal(content);
+  return stripTags(content).stripped;
 }
@@ -1,266 +0,0 @@
-/**
- * TranscriptParser - Properly parse Claude Code transcript JSONL files
- * Handles all transcript entry types based on validated model
- */
-
-import { readFileSync } from 'fs';
-import { logger } from './logger.js';
-import { SYSTEM_REMINDER_REGEX } from './tag-stripping.js';
-import type {
-  TranscriptEntry,
-  UserTranscriptEntry,
-  AssistantTranscriptEntry,
-  SummaryTranscriptEntry,
-  SystemTranscriptEntry,
-  QueueOperationTranscriptEntry,
-  ContentItem,
-  TextContent,
-} from '../types/transcript.js';
-
-export interface ParseStats {
-  totalLines: number;
-  parsedEntries: number;
-  failedLines: number;
-  entriesByType: Record<string, number>;
-  failureRate: number;
-}
-
-export class TranscriptParser {
-  private entries: TranscriptEntry[] = [];
-  private parseErrors: Array<{ lineNumber: number; error: string }> = [];
-
-  constructor(transcriptPath: string) {
-    this.parseTranscript(transcriptPath);
-  }
-
-  private parseTranscript(transcriptPath: string): void {
-    const content = readFileSync(transcriptPath, 'utf-8').trim();
-    if (!content) return;
-
-    const lines = content.split('\n');
-
-    lines.forEach((line, index) => {
-      try {
-        const entry = JSON.parse(line) as TranscriptEntry;
-        this.entries.push(entry);
-      } catch (error) {
-        logger.debug('PARSER', 'Failed to parse transcript line', { lineNumber: index + 1 }, error as Error);
-        this.parseErrors.push({
-          lineNumber: index + 1,
-          error: error instanceof Error ? error.message : String(error),
-        });
-      }
-    });
-
-    // Log summary if there were parse errors
-    if (this.parseErrors.length > 0) {
-      logger.error('PARSER', `Failed to parse ${this.parseErrors.length} lines`, {
-        path: transcriptPath,
-        totalLines: lines.length,
-        errorCount: this.parseErrors.length
-      });
-    }
-  }
-
-  /**
-   * Get all entries of a specific type
-   */
-  getEntriesByType<T extends TranscriptEntry>(type: T['type']): T[] {
-    return this.entries.filter((e) => e.type === type) as T[];
-  }
-
-  /**
-   * Get all user entries
-   */
-  getUserEntries(): UserTranscriptEntry[] {
-    return this.getEntriesByType<UserTranscriptEntry>('user');
-  }
-
-  /**
-   * Get all assistant entries
-   */
-  getAssistantEntries(): AssistantTranscriptEntry[] {
-    return this.getEntriesByType<AssistantTranscriptEntry>('assistant');
-  }
-
-  /**
-   * Get all summary entries
-   */
-  getSummaryEntries(): SummaryTranscriptEntry[] {
-    return this.getEntriesByType<SummaryTranscriptEntry>('summary');
-  }
-
-  /**
-   * Get all system entries
-   */
-  getSystemEntries(): SystemTranscriptEntry[] {
-    return this.getEntriesByType<SystemTranscriptEntry>('system');
-  }
-
-  /**
-   * Get all queue operation entries
-   */
-  getQueueOperationEntries(): QueueOperationTranscriptEntry[] {
-    return this.getEntriesByType<QueueOperationTranscriptEntry>('queue-operation');
-  }
-
-  /**
-   * Get last entry of a specific type
-   */
-  getLastEntryByType<T extends TranscriptEntry>(type: T['type']): T | null {
-    const entries = this.getEntriesByType<T>(type);
-    return entries.length > 0 ? entries[entries.length - 1] : null;
-  }
-
-  /**
-   * Extract text content from content items
-   */
-  private extractTextFromContent(content: string | ContentItem[]): string {
-    if (typeof content === 'string') {
-      return content;
-    }
-
-    if (Array.isArray(content)) {
-      return content
-        .filter((item): item is TextContent => item.type === 'text')
-        .map((item) => item.text)
-        .join('\n');
-    }
-
-    return '';
-  }
-
-  /**
-   * Get last user message text (finds last entry with actual text content)
-   */
-  getLastUserMessage(): string {
-    const userEntries = this.getUserEntries();
-
-    // Iterate backward to find the last user message with text content
-    for (let i = userEntries.length - 1; i >= 0; i--) {
-      const entry = userEntries[i];
-      if (!entry?.message?.content) continue;
-
-      const text = this.extractTextFromContent(entry.message.content);
-      if (text) return text;
-    }
-
-    return '';
-  }
-
-  /**
-   * Get last assistant message text (finds last entry with text content, with optional system-reminder filtering)
-   */
-  getLastAssistantMessage(filterSystemReminders = true): string {
-    const assistantEntries = this.getAssistantEntries();
-
-    // Iterate backward to find the last assistant message with text content
-    for (let i = assistantEntries.length - 1; i >= 0; i--) {
-      const entry = assistantEntries[i];
-      if (!entry?.message?.content) continue;
-
-      let text = this.extractTextFromContent(entry.message.content);
-      if (!text) continue;
-
-      if (filterSystemReminders) {
-        // Filter out system-reminder tags and their content
-        text = text.replace(SYSTEM_REMINDER_REGEX, '');
-        // Clean up excessive whitespace
-        text = text.replace(/\n{3,}/g, '\n\n').trim();
-      }
-
-      if (text) return text;
-    }
-
-    return '';
-  }
-
-  /**
-   * Get all tool use operations from assistant entries
-   */
-  getToolUseHistory(): Array<{ name: string; timestamp: string; input: any }> {
-    const toolUses: Array<{ name: string; timestamp: string; input: any }> = [];
-
-    for (const entry of this.getAssistantEntries()) {
-      if (Array.isArray(entry.message.content)) {
-        for (const item of entry.message.content) {
-          if (item.type === 'tool_use') {
-            toolUses.push({
-              name: item.name,
-              timestamp: entry.timestamp,
-              input: item.input,
-            });
-          }
-        }
-      }
-    }
-
-    return toolUses;
-  }
-
-  /**
-   * Get total token usage across all assistant messages
-   */
-  getTotalTokenUsage(): {
-    inputTokens: number;
-    outputTokens: number;
-    cacheCreationTokens: number;
-    cacheReadTokens: number;
-  } {
-    const assistantEntries = this.getAssistantEntries();
-
-    return assistantEntries.reduce(
-      (acc, entry) => {
-        const usage = entry.message.usage;
-        if (usage) {
-          acc.inputTokens += usage.input_tokens || 0;
-          acc.outputTokens += usage.output_tokens || 0;
-          acc.cacheCreationTokens += usage.cache_creation_input_tokens || 0;
-          acc.cacheReadTokens += usage.cache_read_input_tokens || 0;
-        }
-        return acc;
-      },
-      {
-        inputTokens: 0,
-        outputTokens: 0,
-        cacheCreationTokens: 0,
-        cacheReadTokens: 0,
-      }
-    );
-  }
-
-  /**
-   * Get parse statistics
-   */
-  getParseStats(): ParseStats {
-    const entriesByType: Record<string, number> = {};
-
-    for (const entry of this.entries) {
-      entriesByType[entry.type] = (entriesByType[entry.type] || 0) + 1;
-    }
-
-    const totalLines = this.entries.length + this.parseErrors.length;
-
-    return {
-      totalLines,
-      parsedEntries: this.entries.length,
-      failedLines: this.parseErrors.length,
-      entriesByType,
-      failureRate: totalLines > 0 ? this.parseErrors.length / totalLines : 0,
-    };
-  }
-
-  /**
-   * Get parse errors
-   */
-  getParseErrors(): Array<{ lineNumber: number; error: string }> {
-    return this.parseErrors;
-  }
-
-  /**
-   * Get all entries (raw)
-   */
-  getAllEntries(): TranscriptEntry[] {
-    return this.entries;
-  }
-}