perf: streamline worker startup and consolidate database connections (#2122)

* docs: pathfinder refactor corpus + Node 20 preflight

Adds the PATHFINDER-2026-04-22 principle-driven refactor plan (11 docs,
cross-checked PASS) plus the exploratory PATHFINDER-2026-04-21 corpus
that motivated it. Bumps engines.node to >=20.0.0 per the ingestion-path
plan preflight (recursive fs.watch). Adds the pathfinder skill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: land PATHFINDER Plan 01 — data integrity

Schema, UNIQUE constraints, self-healing claim, Chroma upsert fallback.

- Phase 1: fresh schema.sql regenerated at post-refactor shape.
- Phase 2: migrations 23+24 — rebuild pending_messages without
  started_processing_at_epoch; UNIQUE(session_id, tool_use_id);
  UNIQUE(memory_session_id, content_hash) on observations; dedup
  duplicate rows before adding indexes.
- Phase 3: claimNextMessage rewritten to self-healing query using
  worker_pid NOT IN live_worker_pids; STALE_PROCESSING_THRESHOLD_MS
  and the 60-s stale-reset block deleted.
- Phase 4: DEDUP_WINDOW_MS and findDuplicateObservation deleted;
  observations.insert now uses ON CONFLICT DO NOTHING.
- Phase 5: failed-message purge block deleted from worker-service
  2-min interval; clearFailedOlderThan method deleted.
- Phase 6: repairMalformedSchema and its Python subprocess repair
  path deleted from Database.ts; SQLite errors now propagate.
- Phase 7: Chroma delete-then-add fallback gated behind
  CHROMA_SYNC_FALLBACK_ON_CONFLICT env flag as bridge until
  Chroma MCP ships native upsert.
- Phase 8: migration 19 no-op block absorbed into fresh schema.sql.

Verification greps all return 0 matches. bun test tests/sqlite/
passes 63/63. bun run build succeeds.

Plan: PATHFINDER-2026-04-22/01-data-integrity.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: land PATHFINDER Plan 02 — process lifecycle

OS process groups replace hand-rolled reapers. Worker runs until
killed; orphans are prevented by detached spawn + kill(-pgid).

- Phase 1: src/services/worker/ProcessRegistry.ts DELETED. The
  canonical registry at src/supervisor/process-registry.ts is the
  sole survivor; SDK spawn site consolidated into it via new
  createSdkSpawnFactory/spawnSdkProcess/getSdkProcessForSession/
  ensureSdkProcessExit/waitForSlot helpers.
- Phase 2: SDK children spawn with detached:true + stdio:
  ['ignore','pipe','pipe']; pgid recorded on ManagedProcessInfo.
- Phase 3: shutdown.ts signalProcess teardown uses
  process.kill(-pgid, signal) on Unix when pgid is recorded;
  Windows path unchanged (tree-kill/taskkill).
- Phase 4: all reaper intervals deleted — startOrphanReaper call,
  staleSessionReaperInterval setInterval (including the co-located
  WAL checkpoint — SQLite's built-in wal_autocheckpoint handles
  WAL growth without an app-level timer), killIdleDaemonChildren,
  killSystemOrphans, reapOrphanedProcesses, reapStaleSessions, and
  detectStaleGenerator. MAX_GENERATOR_IDLE_MS and MAX_SESSION_IDLE_MS
  constants deleted.
- Phase 5: abandonedTimer — already 0 matches; primary-path cleanup
  via generatorPromise.finally() already lives in worker-service
  startSessionProcessor and SessionRoutes ensureGeneratorRunning.
- Phase 6: evictIdlestSession and its evict callback deleted from
  SessionManager. Pool admission gates backpressure upstream.
- Phase 7: SDK-failure fallback — SessionManager has zero matches
  for fallbackAgent/Gemini/OpenRouter. Failures surface to hooks
  via exit code 2 through SessionRoutes error mapping.
- Phase 8: ensureWorkerRunning in worker-utils.ts rewritten to
  lazy-spawn — consults isWorkerPortAlive (which gates
  captureProcessStartToken for PID-reuse safety via commit
  99060bac), then spawns detached with unref(), then
  waitForWorkerPort({ attempts: 3, backoffMs: 250 }) hand-rolled
  exponential backoff 250→500→1000ms. No respawn npm dep.
- Phase 9: idle self-shutdown — zero matches for
  idleCheck/idleTimeout/IDLE_MAX_MS/idleShutdown. Worker exits
  only on external SIGTERM via supervisor signal handlers.

Three test files that exercised deleted code removed:
tests/worker/process-registry.test.ts,
tests/worker/session-lifecycle-guard.test.ts,
tests/services/worker/reap-stale-sessions.test.ts.
Pass count: 1451 → 1407 (-44), all attributable to deleted test
files. Zero new failures. 31 pre-existing failures remain
(schema-repair suite, logger-usage-standards, environmental
openclaw / plugin-distribution) — none introduced by Plan 02.

All 10 verification greps return 0. bun run build succeeds.

Plan: PATHFINDER-2026-04-22/02-process-lifecycle.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: land PATHFINDER Plan 04 (narrowed) — search fail-fast

Phases 3, 5, 6 only. Plan-doc inaccuracies for phases 1/2/4/7/8/9
deferred for plan reconciliation:
  - Phase 1/2: ObservationRow type doesn't exist; the four
    "formatters" operate on three incompatible types.
  - Phase 4: RECENCY_WINDOW_MS already imported from
    SEARCH_CONSTANTS at every call site.
  - Phase 7: getExistingChromaIds is NOT @deprecated and has an
    active caller in ChromaSync.backfillMissingSyncs.
  - Phase 8: estimateTokens already consolidated.
  - Phase 9: knowledge-corpus rewrite blocked on PG-3
    prompt-caching cost smoke test.

Phase 3 — Delete SearchManager.findByConcept/findByFile/findByType.
SearchRoutes handlers (handleSearchByConcept/File/Type) now call
searchManager.getOrchestrator().findByXxx() directly via new
getter accessors on SearchManager. ~250 LoC deleted.

Phase 5 — Fail-fast Chroma. Created
src/services/worker/search/errors.ts with ChromaUnavailableError
extends AppError(503, 'CHROMA_UNAVAILABLE'). Deleted
SearchOrchestrator.executeWithFallback's Chroma-failed
SQLite-fallback branch; runtime Chroma errors now throw 503.
"Path 3" (chromaSync was null at construction — explicit-
uninitialized config) preserved as legitimate empty-result state
per plan text. ChromaSearchStrategy.search no longer wraps in
try/catch — errors propagate.

Phase 6 — Delete HybridSearchStrategy three try/catch silent
fallback blocks (findByConcept, findByType, findByFile) at lines
~82-95, ~120-132, ~161-172. Removed `fellBack` field from
StrategySearchResult type and every return site
(SQLiteSearchStrategy, BaseSearchStrategy.emptyResult,
SearchOrchestrator).

Tests updated (Principle 7 — delete in same PR):
  - search-orchestrator.test.ts: "fall back to SQLite" rewritten
    as "throw ChromaUnavailableError (HTTP 503)".
  - chroma/hybrid/sqlite-search-strategy tests: rewritten to
    rejects.toThrow; removed fellBack assertions.

Verification: SearchManager.findBy → 0; fellBack → 0 in src/.
bun test tests/worker/search/ → 122 pass, 0 fail.
bun test (suite-wide) → 1407 pass, baseline maintained, 0 new
failures. bun run build succeeds.

Plan: PATHFINDER-2026-04-22/04-read-path.md (Phases 3, 5, 6)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: land PATHFINDER Plan 03 — ingestion path

Fail-fast parser, direct in-process ingest, recursive fs.watch,
DB-backed tool pairing. Worker-internal HTTP loopback eliminated.

- Phase 0: Created src/services/worker/http/shared.ts exporting
  ingestObservation/ingestPrompt/ingestSummary as direct
  in-process functions plus ingestEventBus (Node EventEmitter,
  reusing existing pattern — no third event bus introduced).
  setIngestContext wires the SessionManager dependency from
  worker-service constructor.
- Phase 1: src/sdk/parser.ts collapsed to one parseAgentXml
  returning { valid:true; kind: 'observation'|'summary'; data }
  | { valid:false; reason: string }. Inspects root element;
  <skip_summary reason="…"/> is a first-class summary case
  with skipped:true. NEVER returns undefined. NEVER coerces.
- Phase 2: ResponseProcessor calls parseAgentXml exactly once,
  branches on the discriminated union. On invalid → markFailed
  + logger.warn(reason). On observation → ingestObservation.
  On summary → ingestSummary then emit summaryStoredEvent
  { sessionId, messageId } (consumed by Plan 05's blocking
  /api/session/end).
- Phase 3: Deleted consecutiveSummaryFailures field
  (ResponseProcessor + SessionManager + worker-types) and
  MAX_CONSECUTIVE_SUMMARY_FAILURES constant. Circuit-breaker
  guards and "tripped" log lines removed.
- Phase 4: coerceObservationToSummary deleted from sdk/parser.ts.
- Phase 5: src/services/transcripts/watcher.ts rescan setInterval
  replaced with fs.watch(transcriptsRoot, { recursive: true,
  persistent: true }) — Node 20+ recursive mode.
- Phase 6: src/services/transcripts/processor.ts pendingTools
  Map deleted. tool_use rows insert with INSERT OR IGNORE on
  UNIQUE(session_id, tool_use_id) (added by Plan 01). New
  pairToolUsesByJoin query in PendingMessageStore for read-time
  pairing (UNIQUE INDEX provides idempotency; explicit consumer
  not yet wired).
- Phase 7: HTTP loopback at processor.ts:252 replaced with
  direct ingestObservation call. maybeParseJson silent-passthrough
  rewritten to fail-fast (throws on malformed JSON).
- Phase 8: src/utils/tag-stripping.ts countTags + stripTagsInternal
  collapsed into one alternation regex, single-pass over input.
- Phase 9: src/utils/transcript-parser.ts (dead TranscriptParser
  class) deleted. The active extractLastMessage at
  src/shared/transcript-parser.ts:41-144 is the sole survivor.

Tests updated (Principle 7 — same-PR delete):
  - tests/sdk/parser.test.ts + parse-summary.test.ts: rewritten
    to assert discriminated-union shape; coercion-specific
    scenarios collapse into { valid:false } assertions.
  - tests/worker/agents/response-processor.test.ts: circuit-breaker
    describe block skipped; non-XML/empty-response tests assert
    fail-fast markFailed behavior.

Verification: every grep returns 0. transcript-parser.ts deleted.
bun run build succeeds. bun test → 1399 pass / 28 fail / 7 skip
(net -8 pass = the 4 retired circuit-breaker tests + 4 collapsed
parser cases). Zero new failures vs baseline.

Deferred (out of Plan 03 scope, will land in Plan 06): SessionRoutes
HTTP route handlers still call sessionManager.queueObservation
inline rather than the new shared helpers — the helpers are ready,
the route swap is mechanical and belongs with the Zod refactor.

Plan: PATHFINDER-2026-04-22/03-ingestion-path.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: land PATHFINDER Plan 05 — hook surface

Worker-call plumbing collapsed to one helper. Polling replaced by
server-side blocking endpoint. Fail-loud counter surfaces persistent
worker outages via exit code 2.

- Phase 1: plugin/hooks/hooks.json — three 20-iteration `for i in
  1..20; do curl -sf .../health && break; sleep 0.1; done` shell
  retry wrappers deleted. Hook commands invoke their bun entry
  point directly.
- Phase 2: src/shared/worker-utils.ts — added
  executeWithWorkerFallback<T>(url, method, body) returning
  T | { continue: true; reason?: string }. All 8 hook handlers
  (observation, session-init, context, file-context, file-edit,
  summarize, session-complete, user-message) rewritten to use
  it instead of duplicating the ensureWorkerRunning →
  workerHttpRequest → fallback sequence.
- Phase 3: blocking POST /api/session/end in SessionRoutes.ts
  using validateBody + sessionEndSchema (z.object({sessionId})).
  One-shot ingestEventBus.on('summaryStoredEvent') listener,
  30 s timer, req.aborted handler — all share one cleanup so
  the listener cannot leak. summarize.ts polling loop, plus
  MAX_WAIT_FOR_SUMMARY_MS / POLL_INTERVAL_MS constants, deleted.
- Phase 4: src/shared/hook-settings.ts — loadFromFileOnce()
  memoizes SettingsDefaultsManager.loadFromFile per process.
  Per-handler settings reads collapsed.
- Phase 5: src/shared/should-track-project.ts — single exclusion
  check entry; isProjectExcluded no longer referenced from
  src/cli/handlers/.
- Phase 6: cwd validation pushed into adapter normalizeInput
  (all 6 adapters: claude-code, cursor, raw, gemini-cli,
  windsurf). New AdapterRejectedInput error in
  src/cli/adapters/errors.ts. Handler-level isValidCwd checks
  deleted from file-edit.ts and observation.ts. hook-command.ts
  catches AdapterRejectedInput → graceful fallback.
- Phase 7: session-init.ts conditional initAgent guard deleted;
  initAgent is idempotent. tests/hooks/context-reinjection-guard
  test (validated the deleted conditional) deleted in same PR
  per Principle 7.
- Phase 8: fail-loud counter at ~/.claude-mem/state/hook-failures
  .json. Atomic write via .tmp + rename. CLAUDE_MEM_HOOK_FAIL_LOUD
  _THRESHOLD setting (default 3). On consecutive worker-unreachable
  ≥ N: process.exit(2). On success: reset to 0. NOT a retry.
- Phase 9: ensureWorkerAliveOnce() module-scope memoization
  wrapping ensureWorkerRunning. executeWithWorkerFallback calls
  the memoized version.

Minimal validateBody middleware stub at
src/services/worker/http/middleware/validateBody.ts. Plan 06 will
expand with typed inference + error envelope conventions.

Verification: 4/4 grep targets pass. bun run build succeeds.
bun test → 1393 pass / 28 fail / 7 skip; -6 pass attributable
solely to deleted context-reinjection-guard test file. Zero new
failures vs baseline.

Plan: PATHFINDER-2026-04-22/05-hook-surface.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: land PATHFINDER Plan 06 — API surface

One Zod-based validator wrapping every POST/PUT. Rate limiter,
diagnostic endpoints, and shutdown wrappers deleted. Failure-
marking consolidated to one helper.

- Phase 1 (preflight): zod@^3 already installed.
- Phase 2: validateBody middleware confirmed at canonical shape
  in src/services/worker/http/middleware/validateBody.ts —
  safeParse → 400 { error: 'ValidationError', issues: [...] }
  on failure, replaces req.body with parsed value on success.
- Phase 3: Per-route Zod schemas declared at the top of each
  route file. 24 POST endpoints across SessionRoutes,
  CorpusRoutes, DataRoutes, MemoryRoutes, SearchRoutes,
  LogsRoutes, SettingsRoutes now wrap with validateBody().
  /api/session/end (Plan 05) confirmed using same middleware.
- Phase 4: validateRequired() deleted from BaseRouteHandler
  along with every call site. Inline coercion helpers
  (coerceStringArray, coercePositiveInteger) and inline
  if (!req.body...) guards deleted across all route files.
- Phase 5: Rate limiter middleware and its registration deleted
  from src/services/worker/http/middleware.ts. Worker binds
  127.0.0.1:37777 — no untrusted caller.
- Phase 6: viewer.html cached at module init in ViewerRoutes.ts
  via fs.readFileSync; served as Buffer with text/html content
  type. SKILL.md + per-operation .md files cached in
  Server.ts as Map<string, string>; loadInstructionContent
  helper deleted. NO fs.watch, NO TTL — process restart is the
  cache-invalidation event.
- Phase 7: Four diagnostic endpoints deleted from DataRoutes.ts
  — /api/pending-queue (GET), /api/pending-queue/process (POST),
  /api/pending-queue/failed (DELETE), /api/pending-queue/all
  (DELETE). Helper methods that ONLY served them
  (getQueueMessages, getStuckCount, getRecentlyProcessed,
  clearFailed, clearAll) deleted from PendingMessageStore.
  KEPT: /api/processing-status (observability), /health
  (used by ensureWorkerRunning).
- Phase 8: stopSupervisor wrapper deleted from supervisor/index.ts.
  GracefulShutdown now calls getSupervisor().stop() directly.
  Two functions retained with clear roles:
    - performGracefulShutdown — worker-side 6-step shutdown
    - runShutdownCascade — supervisor-side child teardown
      (process.kill(-pgid), Windows tree-kill, PID-file cleanup)
  Each has unique non-trivial logic and a single canonical caller.
- Phase 9: transitionMessagesTo(status, filter) is the sole
  failure-marking path on PendingMessageStore. Old methods
  markSessionMessagesFailed and markAllSessionMessagesAbandoned
  deleted along with all callers (worker-service,
  SessionCompletionHandler, tests/zombie-prevention).

Tests updated (Principle 7 same-PR delete): coercion test files
refactored to chain validateBody → handler. Zombie-prevention
tests rewritten to call transitionMessagesTo.

Verification: all 4 grep targets → 0. bun run build succeeds.
bun test → 1393 pass / 28 fail / 7 skip — exact match to
baseline. Zero new failures.

Plan: PATHFINDER-2026-04-22/06-api-surface.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: land PATHFINDER Plan 07 — dead code sweep

ts-prune-driven sweep across the tree after Plans 01-06 landed.
Deleted unused exports, orphan helpers, and one fully orphaned
file. Earlier-plan deletions verified.

Deleted:
- src/utils/bun-path.ts (entire file — getBunPath, getBunPathOrThrow,
  isBunAvailable: zero importers)
- bun-resolver.getBunVersionString: zero callers
- PendingMessageStore.retryMessage / resetProcessingToPending /
  abortMessage: superseded by transitionMessagesTo (Plan 06 Phase 9)
- EnvManager.MANAGED_CREDENTIAL_KEYS, EnvManager.setCredential:
  zero callers
- CodexCliInstaller.checkCodexCliStatus: zero callers; no status
  command exists in npx-cli
- Two "REMOVED: cleanupOrphanedSessions" stale-fence comments

Kept (with documented justification):
- Public API surface in dist/sdk/* (parseAgentXml, prompt
  builders, ParsedObservation, ParsedSummary, ParseResult,
  SUMMARY_MODE_MARKER) — exported via package.json sdk path.
- generateContext / loadContextConfig / token utilities — used
  via dynamic await import('../../../context-generator.js') in
  worker SearchRoutes.
- MCP_IDE_INSTALLERS, install/uninstall functions for codex/goose
  — used via dynamic await import in npx-cli/install.ts +
  uninstall.ts (ts-prune cannot trace dynamic imports).
- getExistingChromaIds — active caller in
  ChromaSync.backfillMissingSyncs (Plan 04 narrowed scope).
- processPendingQueues / getSessionsWithPendingMessages — active
  orphan-recovery caller in worker-service.ts plus
  zombie-prevention test coverage.
- StoreAndMarkCompleteResult legacy alias — return-type annotation
  in same file.
- All Database.ts barrel re-exports — used downstream.

Earlier-plan verification:
- Plan 03 Phase 9: VERIFIED — src/utils/transcript-parser.ts
  is gone; TranscriptParser has 0 references in src/.
- Plan 01 Phase 8: VERIFIED — migration 19 no-op absorbed.
- SessionStore.ts:52-70 consolidation NOT executed (deferred):
  the methods are not thin wrappers but ~900 LoC of bodies, and
  two methods are documented as intentional mirrors so the
  context-generator.cjs bundle stays schema-consistent without
  pulling MigrationRunner. Deserves its own plan, not a sweep.

Verification: TranscriptParser → 0; transcript-parser.ts → gone;
no commented-out code markers remain. bun run build succeeds.
bun test → 1393 pass / 28 fail / 7 skip — EXACT match to
baseline. Zero regressions.

Plan: PATHFINDER-2026-04-22/07-dead-code.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: remove residual ProcessRegistry comment reference

Plan 07 dead-code sweep missed one comment-level reference to the
deleted in-memory ProcessRegistry class in SessionManager.ts:347.
Rewritten to describe the supervisor.json scope without naming the
deleted class, completing the verification grep target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Greptile review (P1 + 2× P2)

P1 — Plan 05 Phase 3 blocking endpoint was non-functional:
executeWithWorkerFallback used HEALTH_CHECK_TIMEOUT_MS (3 s) for
the POST /api/session/end call, but the server holds the
connection for SERVER_SIDE_SUMMARY_TIMEOUT_MS (30 s). Client
always raced to a "timed out" rejection that isWorkerUnavailable
classified as worker-unreachable, so the hook silently degraded
instead of waiting for summaryStoredEvent.
  - Added optional timeoutMs to executeWithWorkerFallback,
    forwarded to workerHttpRequest.
  - summarize.ts call site now passes 35_000 (5 s above server
    hold window).

P2 — ingestSummary({ kind: 'parsed' }) branch was dead code:
ResponseProcessor emitted summaryStoredEvent directly via the
event bus, bypassing the centralized helper that the comment
claimed was the single source.
  - ResponseProcessor now calls ingestSummary({ kind: 'parsed',
    sessionDbId, messageId, contentSessionId, parsed }) so the
    event-emission path is single-sourced.
  - ingestSummary's requireContext() resolution moved inside the
    'queue' branch (the only branch that needs sessionManager /
    dbManager). 'parsed' is a pure event-bus emission and
    doesn't need worker-internal context — fixes mocked
    ResponseProcessor unit tests that don't call
    setIngestContext.

P2 — isWorkerFallback could false-positive on legitimate API
responses whose schema includes { continue: true, ... }:
  - Added a Symbol.for('claude-mem/worker-fallback') brand to
    WorkerFallback. isWorkerFallback now checks the brand, not
    a duck-typed property name.

Verification: bun run build succeeds. bun test → 1393 pass /
28 fail / 7 skip — exact baseline match. Zero new failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Greptile iteration 2 (P1 + P2)

P1 — summaryStoredEvent fired regardless of whether the row was
persisted. ResponseProcessor's call to ingestSummary({ kind:
'parsed' }) ran for every parsed.kind === 'summary' even when
result.summaryId came back null (e.g. FK violation, null
memory_session_id at commit). The blocking /api/session/end
endpoint then returned { ok: true } and the Stop hook logged
'Summary stored' for a non-existent row.

  - Gate ingestSummary call on (parsed.data.skipped ||
    session.lastSummaryStored). Skipped summaries are an explicit
    no-op bypass and still confirm; real summaries only confirm
    when storage actually wrote a row.
  - Non-skipped + summaryId === null path logs a warn and lets
    the server-side timeout (504) surface to the hook instead of
    a false ok:true.

P2 — PendingMessageStore.enqueue() returns 0 when INSERT OR
IGNORE suppresses a duplicate (the UNIQUE(session_id, tool_use_id)
constraint added by Plan 01 Phase 1). The two callers
(SessionManager.queueObservation and queueSummarize) previously
logged 'ENQUEUED messageId=0' which read like a row was inserted.

  - Branch on messageId === 0 and emit a 'DUP_SUPPRESSED' debug
    log instead of the misleading ENQUEUED line. No behavior
    change — the duplicate is still correctly suppressed by the
    DB (Principle 3); only the log surface is corrected.
  - confirmProcessed is never called with the enqueue() return
    value (it operates on session.processingMessageIds[] from
    claimNextMessage), so no caller is broken; the visibility
    fix prevents future misuse.

Verification: bun run build succeeds. bun test → 1393 pass /
28 fail / 7 skip — exact baseline match. Zero new failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Greptile iteration 3 (P1 + 2× P2)

- P1 worker-service.ts: wire ensureGeneratorRunning into the ingest
  context after SessionRoutes is constructed. setIngestContext runs
  before routes exist, so transcript-watcher observations queued via
  ingestObservation() had no way to auto-start the SDK generator.
  Added attachIngestGeneratorStarter() to patch the callback in.
- P2 shared.ts: IngestEventBus now sets maxListeners to 0. Concurrent
  /api/session/end calls register one listener each and clean up on
  completion, so the default-10 warning fires spuriously under normal
  load.
- P2 SessionRoutes.ts: handleObservationsByClaudeId now delegates to
  ingestObservation() instead of duplicating skip-tool / meta /
  privacy / queue logic. Single helper, matching the Plan 03 goal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Greptile iteration 4 (P1 tool-pair + P2 parse/path/doc)

- processor.handleToolResult: restore in-memory tool-use→tool-result
  pairing via session.pendingTools for schemas (e.g. Codex) whose
  tool_result events carry only tool_use_id + output. Without this,
  neither handler fired — all tool observations silently dropped.
- processor.maybeParseJson: return raw string on parse failure instead
  of throwing. Previously a single malformed JSON-shaped field caused
  handleLine's outer catch to discard the entire transcript line.
- watcher.deepestNonGlobAncestor: split on / and \\, emit empty string
  for purely-glob inputs so the caller skips the watch instead of
  anchoring fs.watch at the filesystem root. Windows-compatible.
- PendingMessageStore.enqueue: tighten docstring — callers today only
  log on the returned id; the SessionManager branches on id === 0.

* fix: forward tool_use_id through ingestObservation (Greptile iter 5)

P1 — Plan 01's UNIQUE(content_session_id, tool_use_id) dedup never
fired because the new shared ingest path dropped the toolUseId before
queueObservation. SQLite treats NULL values as distinct for UNIQUE,
so every replayed transcript line landed a duplicate row.

- shared.ingestObservation: forward payload.toolUseId to
  queueObservation so INSERT OR IGNORE can actually collapse.
- SessionRoutes.handleObservationsByClaudeId: destructure both
  tool_use_id (HTTP convention) and toolUseId (JS convention) from
  req.body and pass into ingestObservation.
- observationsByClaudeIdSchema: declare both keys explicitly so the
  validator doesn't rely on .passthrough() alone.

* fix: drop dead pairToolUsesByJoin, close session-end listener race

- PendingMessageStore: delete pairToolUsesByJoin. The method was never
  called and its self-join semantics are structurally incompatible
  with UNIQUE(content_session_id, tool_use_id): INSERT OR IGNORE
  collapses any second row with the same pair, so a self-join can
  only ever match a row to itself. In-memory pendingTools in
  processor.ts remains the pairing path for split-event schemas.

- IngestEventBus: retain a short-lived (60s) recentStored map keyed
  by sessionId. Populated on summaryStoredEvent emit, evicted on
  consume or TTL.

- handleSessionEnd: drain the recent-events buffer before attaching
  the listener. Closes the register-after-emit race where the summary
  can persist between the hook's summarize POST and its session/end
  POST — previously that window returned 504 after the 30s timeout.

* chore: merge origin/main into vivacious-teeth

Resolves conflicts with 15 commits on main (v12.3.9, security
observation types, Telegram notifier, PID-reuse worker start-guard).

Conflict resolution strategy:
- plugin/hooks/hooks.json, plugin/scripts/*.cjs, plugin/ui/viewer-bundle.js:
  kept ours — PATHFINDER Plan 05 deletes the for-i-in-1-to-20 curl retry
  loops and the built artifacts regenerate on build.
- src/cli/handlers/summarize.ts: kept ours — Plan 05 blocking
  POST /api/session/end supersedes main's fire-and-forget path.
- src/services/worker-service.ts: kept ours — Plan 05 ingest bus +
  summaryStoredEvent supersedes main's SessionCompletionHandler DI
  refactor + orphan-reaper fallback.
- src/services/worker/http/routes/SessionRoutes.ts: kept ours — same
  reason; generator .finally() Stop-hook self-clean is a guard for a
  path our blocking endpoint removes.
- src/services/worker/http/routes/CorpusRoutes.ts: merged — added
  security_alert / security_note to ALLOWED_CORPUS_TYPES (feature from
  #2084) while preserving our Zod validateBody schema.

Typecheck: 294 errors (vs 298 pre-merge). No new errors introduced; all
remaining are pre-existing (Component-enum gaps, DOM lib for viewer,
bun:sqlite types).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Greptile P2 findings

1) SessionRoutes.handleSessionEnd was the only route handler not wrapped
   in wrapHandler — synchronous exceptions would hang the client rather
   than surfacing as 500s. Wrap it like every other handler.

2) processor.handleToolResult only consumed the session.pendingTools
   entry when the tool_result arrived without a toolName. In the
   split-schema path where tool_result carries both toolName and toolId,
   the entry was never deleted and the map grew for the life of the
   session. Consume the entry whenever toolId is present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: typing cleanup and viewer tsconfig split for PR feedback

- Add explicit return types for SessionStore query methods
- Exclude src/ui/viewer from root tsconfig, give it its own DOM-typed config
- Add bun to root tsconfig types, plus misc typing tweaks flagged by Greptile
- Rebuilt plugin/scripts/* artifacts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Greptile P2 findings (iter 2)

- PendingMessageStore.transitionMessagesTo: require sessionDbId (drop
  the unscoped-drain branch that would nuke every pending/processing
  row across all sessions if a future caller omitted the filter).
- IngestEventBus.takeRecentSummaryStored: make idempotent — keep the
  cached event until TTL eviction so a retried Stop hook's second
  /api/session/end returns immediately instead of hanging 30 s.
- TranscriptWatcher fs.watch callback: skip full glob scan for paths
  already tailed (JSONL appends fire on every line; only unknown
  paths warrant a rescan).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: call finalizeSession in terminal session paths (Greptile iter 3)

terminateSession and runFallbackForTerminatedSession previously called
SessionCompletionHandler.finalizeSession before removeSessionImmediate;
the refactor dropped those calls, leaving sdk_sessions.status='active'
for every session killed by wall-clock limit, unrecoverable error, or
exhausted fallback chain. The deleted reapStaleSessions interval was
the only prior backstop.

Re-wires finalizeSession (idempotent: marks completed, drains pending,
broadcasts) into both paths; no reaper reintroduced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: GC failed pending_messages rows at startup (Greptile iter 4)

Plan 07 deleted clearFailed/clearFailedOlderThan as "dead code", but
with the periodic sweep also removed, nothing reaps status='failed'
rows now — they accumulate indefinitely. Since claimNextMessage's
self-healing subquery scans this table, unbounded growth degrades
claim latency over time.

Re-introduces clearFailedOlderThan and calls it once at worker startup
(not a reaper — one-shot, idempotent). 7-day retention keeps enough
history for operator inspection while bounding the table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: finalize sessions on normal exit; cleanup hoist; share handler (iter 5)

1. startSessionProcessor success branch now calls completionHandler.
   finalizeSession before removeSessionImmediate. Hooks-disabled installs
   (and any Stop hook that fails before POST /api/sessions/complete) no
   longer leave sdk_sessions rows as status='active' forever. Idempotent
   — a subsequent /api/sessions/complete is a no-op.

2. Hoist SessionRoutes.handleSessionEnd cleanup declaration above the
   closures that reference it (TDZ safety; safe at runtime today but
   fragile if timeout ever shrinks).

3. SessionRoutes now receives WorkerService's shared SessionCompletionHandler
   instead of constructing its own — prevents silent divergence if the
   handler ever becomes stateful.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: stop runaway crash-recovery loop on dead sessions

Two distinct bugs were combining to keep a dead session restarting forever:

Bug 1 (uncaught "The operation was aborted."):
  child_process.spawn emits 'error' asynchronously for ENOENT/EACCES/abort
  signal aborts. spawnSdkProcess() never attached an 'error' listener, so
  any async spawn failure became uncaughtException and escaped to the
  daemon-level handler. Attach an 'error' listener immediately after spawn,
  before the !child.pid early-return, so async spawn errors are logged
  (with errno code) and swallowed locally.

Bug 2 (sliding-window limiter never trips on slow restart cadence):
  RestartGuard tripped only when restartTimestamps.length exceeded
  MAX_WINDOWED_RESTARTS (10) within RESTART_WINDOW_MS (60s). With the 8s
  exponential-backoff cap, only ~7-8 restarts fit in the window, so a dead
  session that fail-restart-fail-restart on 8s cycles would loop forever
  (consecutiveRestarts climbing past 30+ in observed logs). Add a
  consecutiveFailures counter that increments on every restart and resets
  only on recordSuccess(). Trip when consecutive failures exceed
  MAX_CONSECUTIVE_FAILURES (5) — meaning 5 restarts with zero successful
  processing in between proves the session is dead. Both guards now run in
  parallel: tight loops still trip the windowed cap; slow loops trip the
  consecutive-failure cap.

Also: when the SessionRoutes path trips the guard, drain pending messages
to 'abandoned' so the session does not reappear in
getSessionsWithPendingMessages and trigger another auto-start cycle. The
worker-service.ts path already does this via terminateSession.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf: streamline worker startup and consolidate database connections

1. Database Pooling: Modified DatabaseManager, SessionStore, and SessionSearch to share a single bun:sqlite connection, eliminating redundant file descriptors.
2. Non-blocking Startup: Refactored WorktreeAdoption and Chroma backfill to run in the background (fire-and-forget), preventing them from stalling core initialization.
3. Diagnostic Routes: Added /api/chroma/status and bypassed the initialization guard for health/readiness endpoints to allow diagnostics during startup.
4. Robust Search: Implemented reliable SQLite FTS5 fallback in SearchManager for when Chroma (uvx) fails or is unavailable.
5. Code Cleanup: Removed redundant loopback MCP checks and mangled initialization logic from WorkerService.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: hard-exclude observer-sessions from hooks; bundle migration 29 (#2124)

* fix: hard-exclude observer-sessions from hooks; backfill bundle migrations

Stop hook + SessionEnd hook were storing the SDK observer's own
init/continuation/summary prompts in user_prompts, leaking into the
viewer (meta-observation regression). 25 such rows accumulated.

- shouldTrackProject: hard-reject OBSERVER_SESSIONS_DIR (and its subtree)
  before consulting user-configured exclusion globs.
- summarize.ts (Stop) and session-complete.ts (SessionEnd): early-return
  when shouldTrackProject(cwd) is false, so the observer's own hooks
  cannot bootstrap the worker or queue a summary against the meta-session.
- SessionRoutes: cap user-prompt body at 256 KiB at the session-init
  boundary so a runaway observer prompt cannot blow up storage.
- SessionStore: add migration 29 (UNIQUE(memory_session_id, content_hash)
  on observations) inline so bundled artifacts (worker-service.cjs,
  context-generator.cjs) stay schema-consistent — without it, the
  ON CONFLICT clause in observation inserts throws.
- spawnSdkProcess: stdio[stdin] from 'ignore' to 'pipe' so the
  supervisor can actually feed the observer's stdin.

Also rebuilds plugin/scripts/{worker-service,context-generator}.cjs.

* fix: walk back to UTF-8 boundary on prompt truncation (Greptile P2)

Plain Buffer.subarray at MAX_USER_PROMPT_BYTES can land mid-codepoint,
which the utf8 decoder silently rewrites to U+FFFD. Walk back over any
continuation bytes (0b10xxxxxx) before decoding so the truncated prompt
ends on a valid sequence boundary instead of a replacement character.

* fix: cross-platform observer-dir containment; clarify SDK stdin pipe

claude-review feedback on PR #2124.

- shouldTrackProject: literal `cwd.startsWith(OBSERVER_SESSIONS_DIR + '/')`
  hard-coded a POSIX separator and missed Windows backslash paths plus any
  trailing-slash variance. Switched to a path.relative-based isWithin()
  helper so Windows hook input under observer-sessions\\... is also excluded.
- spawnSdkProcess: added a comment explaining why stdin must be 'pipe' —
  SpawnedSdkProcess.stdin is typed NonNullable and the Claude Agent SDK
  consumes that pipe; 'ignore' would null it and the null-check below
  would tear the child down on every spawn.

* fix: make Stop hook fire-and-forget; remove dead /api/session/end

The Stop hook was awaiting a 35-second long-poll on /api/session/end,
which the worker held open until the summary-stored event fired (or its
30s server-side timeout elapsed). Followed by another await on
/api/sessions/complete. Three sequential awaits, the middle one a 30s
hold — not fire-and-forget despite repeated requests.

The Stop hook now does ONE thing: POST /api/sessions/summarize to
queue the summary work and return. The worker drives the rest async.
Session-map cleanup is performed by the SessionEnd handler
(session-complete.ts), not duplicated here.

- summarize.ts: drop the /api/session/end long-poll and the trailing
  /api/sessions/complete await; ~40 lines removed; unused
  SessionEndResponse interface gone; header comment rewritten.
- SessionRoutes: delete handleSessionEnd, sessionEndSchema, the
  SERVER_SIDE_SUMMARY_TIMEOUT_MS constant, and the /api/session/end
  route registration. Drop the now-unused ingestEventBus and
  SummaryStoredEvent imports.
- ResponseProcessor + shared.ts + worker-utils.ts: update stale
  comments that referenced the dead endpoint. The IngestEventBus is
  left in place dormant (no listeners) for follow-up cleanup so this
  PR stays focused on the blocker.

Bundle artifact (worker-service.cjs) rebuilt via build-and-sync.

Verification:
- grep '/api/session/end' plugin/scripts/worker-service.cjs → 0
- grep 'timeoutMs:35' plugin/scripts/worker-service.cjs → 0
- Worker restarted clean, /api/health ok at pid 92368

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* deps: bump all dependencies to latest including majors

Upgrades: React 18→19, Express 4→5, Zod 3→4, TypeScript 5→6,
@types/node 20→25, @anthropic-ai/claude-agent-sdk 0.1→0.2,
@clack/prompts 0.9→1.2, plus minors. Adds Daily Maintenance section
to CLAUDE.md mandating latest-version policy across manifests.

Express 5 surfaced a race in Server.listen() where the 'error' handler
was attached after listen() was invoked; refactored to use
http.createServer with both 'error' and 'listening' handlers attached
before listen(), restoring port-conflict rejection semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: surface real chroma errors and add deep status probe

Replace the misleading "Vector search failed - semantic search unavailable.
Install uv... restart the worker." string in SearchManager with the actual
exception text from chroma_query_documents. The lying message blamed `uv`
for any failure — even when the real cause was a chroma-mcp transport
timeout, an empty collection, or a dead subprocess.

Also add /api/chroma/status?deep=1 backed by a new
ChromaMcpManager.probeSemanticSearch() that round-trips a real query
(chroma_list_collections + chroma_query_documents) instead of just
checking the stdio handshake. The cheap default path is unchanged.

Includes the diagnostic plan (PLAN-fix-mcp-search.md) and updated test
fixtures for the new structured failure message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: rebuild worker-service bundle to match merged src

Bundle was stale after the squash merge of #2124 — it still contained
the old "Install uv... semantic search unavailable" string and lacked
probeSemanticSearch. Rebuilt via bun run build-and-sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: address coderabbit feedback on PLAN-fix-mcp-search.md

- replace machine-specific /Users/alexnewman absolute paths with portable
  <repo-root> placeholder (MD-style portability)
- add blank lines around the TypeScript fenced block (MD031)
- tag the bare fenced block with `text` (MD040)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-04-25 13:37:40 -07:00
committed by GitHub
parent 8ace1d9c84
commit 94d592f212
159 changed files with 18091 additions and 5843 deletions
@@ -0,0 +1,463 @@
# Plan 11: http-server-routes (clean)
Implements flowchart §3.9 of `PATHFINDER-2026-04-21/05-clean-flowcharts.md`.
Introduces Zod + `validateBody(schema)` middleware, deletes the rate limiter, caches the two served static files at boot, and strips per-route hand-rolled shape-validation. Bullshit-inventory items **#37 (per-route validation boilerplate)**, **#39 (rate limit)**, **#40 (oversized-body special handling)** are eliminated. **#38 (admin endpoints)** is explicitly preserved per the inventory note.
## Header
- **Target flowchart**: `PATHFINDER-2026-04-21/05-clean-flowcharts.md` §3.9 "http-server-routes (clean)" (lines 382-420).
- **Before state**: `PATHFINDER-2026-04-21/01-flowcharts/http-server-routes.md`.
- **Upstream dependencies**: *none*. Zod adoption is orthogonal to every other plan; this plan OWNS the Zod introduction.
- **Downstream dependencies**: *none*. Other plans land unaffected; they gain `validateBody(schema)` validation by attaching a schema to their routes at landing time, not by rewriting this plan.
- **Coordination note**: Plan 09 (lifecycle-hooks) collapses `SessionRoutes` from 10 → 4 endpoints (V9 finding). This plan MUST land **after** Plan 09 so the Zod schemas here target the final 4-endpoint surface, not the legacy 10. If landing order flips, re-attach schemas to whichever route names survive.
- **Verified findings cited**: V2 (legacy `/sessions/*` vs `/api/sessions/*`, SessionRoutes.ts:378-389); V9 (SessionRoutes has 10 endpoints, not 8); V20 (rate limiter at `src/services/worker/http/middleware.ts:45-79`, 300 req/min IP map, keyed by `::ffff:127.0.0.1`-normalized IP).
## Anti-patterns prohibited in every phase
- **A**: No invented Zod methods. Every API used must be verified against the installed zod version (Phase 1). In particular, use `schema.safeParse(body)` + `result.success ? result.data : result.error.flatten()` — no `ZodUtil.assertBody`, no `schema.validateOrThrow`.
- **D**: No per-route validation blocks of 5+ if statements. Any block that currently does `if (typeof x !== 'string') ... if (!body.foo) ... if (!body.bar) ...` collapses to a single `validateBody(schema)` middleware call.
- **E**: No two validation paths. If a route gets a Zod schema, the hand-rolled checks in the handler body get deleted in the same commit. "Defense in depth" via duplicate validation is forbidden.
---
## Phase 1 — Confirm Zod availability; add if absent
**Outcome**: `zod` is a first-class dependency in `package.json`, installed in `node_modules`, with a known version so every schema in Phase 3 uses a stable API.
### (a) What to implement
- Run `npm ls zod` in the repo root.
- If present (transitive or direct): pin the resolved major version in `package.json` dependencies (move from transitive to explicit so future `npm ci` can't drop it).
- If absent (confirmed state as of 2026-04-22 — see findings below): `npm install zod@^3.23.8` (current stable 3.x line). Commit `package.json` + `package-lock.json`.
- Record the resolved version in the PR description. All subsequent phases use this version's API surface.
Copy from: nothing — this is a dependency add. Reference the `package.json` structure at `/Users/alexnewman/.superset/worktrees/claude-mem/vivacious-teeth/package.json:111-125` (current `dependencies` block).
### (b) Docs
- §3.9 "Deleted" bullet 2 ("Per-route hand-rolled validation (Zod middleware replaces)").
- `06-implementation-plan.md` line 55: "Zod — `z.object({...})`, `schema.safeParse(body)`, `result.success ? result.data : result.error.flatten()`. (Not yet a dep; Phase 12 adds `zod` via npm; already shipped transitively via `@anthropic-ai/sdk` — confirm before landing.)"
- V9 (06-implementation-plan.md:36) confirms the SessionRoutes endpoint count that Phase 3 must schema.
- Live file:line: `package.json:111-125` (dependencies block); `package.json:124` (`zod-to-json-schema` — sibling package, *not* zod itself).
### (c) Verification
- `npm ls zod` prints a single resolved path, not "(empty)".
- `node -e "require('zod')"` exits 0.
- Grep: `grep -n '"zod"' package.json`**≥1** match in dependencies (not just `zod-to-json-schema`).
- `git diff package.json` shows `zod` added; `package-lock.json` shows resolved version.
### (d) Anti-pattern guards
- **A**: Don't pin to `@latest`; pin to the major line installed now (3.x). Record the exact minor in the plan PR.
- **E**: Don't add `zod` to both `dependencies` and `devDependencies` — runtime code imports it, so `dependencies` only.
---
## Phase 2 — Write `validateBody(schema)` middleware
**Outcome**: One Express middleware file, ~40 lines, that accepts any Zod schema and rejects non-conforming bodies with a uniform 400 shape. Zero per-route boilerplate.
### (a) What to implement
Create `src/services/worker/http/middleware/validateBody.ts`:
```ts
import { RequestHandler } from 'express';
import { ZodType } from 'zod';
export function validateBody<T>(schema: ZodType<T>): RequestHandler {
return (req, res, next) => {
const result = schema.safeParse(req.body);
if (!result.success) {
res.status(400).json({
error: 'validation_failed',
message: 'Request body failed schema validation',
code: 'VALIDATION_FAILED',
fields: result.error.flatten()
});
return;
}
req.body = result.data;
next();
};
}
```
Copy error-shape keys (`error`, `message`, `code`) from the existing `BaseRouteHandler.handleError` response shape at `/Users/alexnewman/.superset/worktrees/claude-mem/vivacious-teeth/src/services/worker/http/BaseRouteHandler.ts:82-99`, extended with `fields` (per 06-implementation-plan.md:546, 553, 563).
Create the directory: `src/services/worker/http/middleware/` (new; sibling to `middleware.ts`). One file, one export.
### (b) Docs
- §3.9 flowchart node D: `validateBody(schema) middleware (Zod per route)` → node E `Valid? → 400 with field errors` (05-clean-flowcharts.md:388-391).
- 06-implementation-plan.md Phase 12, lines 542-548 (middleware signature + `safeParse` + 400 with `fields`).
- Live file:line: existing error shape at `src/services/worker/http/BaseRouteHandler.ts:82-99` (fields: `error`, `code`, `details`).
### (c) Verification
- `grep -n "export function validateBody" src/services/worker/http/middleware/validateBody.ts` → 1 match.
- `grep -rn "res.status(400)" src/services/worker/http/middleware/validateBody.ts` → exactly 1 (the single 400 response).
- Unit test: schema `z.object({ foo: z.string() })` accepts `{foo:"bar"}`, rejects `{foo:42}` with 400 and `fields.fieldErrors.foo` populated.
- TypeScript: `tsc --noEmit` succeeds — the generic `<T>` signature must compile.
### (d) Anti-pattern guards
- **A**: `safeParse` only — no `.parse()` with try/catch wrapper, no `assertSafe`, no `ZodUtil` helper class. The Express middleware contract already provides error isolation.
- **D**: This file is the *only* place a Zod parse happens in the HTTP layer. If a future PR adds a second `safeParse` call inside a handler, it is a duplicate validation path — delete it.
- **E**: `next()` only on success. On failure, `res.status(400).json(...)` **and return**. Never both call `next()` and send a response.
---
## Phase 3 — Per-route Zod schemas; attach via middleware
**Outcome**: Every POST / PUT / DELETE-with-body endpoint has a Zod schema sitting next to its route registration. `validateBody(schema)` is inserted into the middleware chain for that route.
### (a) What to implement
For each route file, add a top-of-file `schemas` block (plain `const X = z.object({...})` — do NOT build a `schemas/` parallel directory; inline at top of file keeps the schema co-located with its handler). Attach via the route registration:
Before (`CorpusRoutes.ts:28`):
```ts
app.post('/api/corpus', this.handleBuildCorpus.bind(this));
```
After:
```ts
app.post('/api/corpus', validateBody(BuildCorpusSchema), this.handleBuildCorpus.bind(this));
```
**Schemas required (one per endpoint with a body). Target list assumes Plan 09 has already collapsed SessionRoutes to the 4-endpoint surface per §3.1.** If Plan 09 has not landed, also schema the legacy `/sessions/:sessionDbId/*` endpoints at `src/services/worker/http/routes/SessionRoutes.ts:377-382` — they're deleted by Plan 09 but must not be left unvalidated in the interim.
| Route file | Endpoint | Schema name | Core fields |
|---|---|---|---|
| `SessionRoutes.ts` | `POST /api/session/start` (post-Plan 09) | `SessionStartSchema` | `{ project: string, contentSessionId: string, platformSource?: string, customTitle?: string }` |
| `SessionRoutes.ts` | `POST /api/session/prompt` | `SessionPromptSchema` | `{ sessionDbId: number, prompt: string }` |
| `SessionRoutes.ts` | `POST /api/session/observation` | `SessionObservationSchema` | `{ sessionDbId: number, tool_use_id: string, name: string, input: unknown, output: unknown, cwd?: string }` |
| `SessionRoutes.ts` | `POST /api/session/end` | `SessionEndSchema` | `{ sessionDbId: number, last_assistant_message: string }` |
| `DataRoutes.ts` | `POST /api/observations/batch` | `ObservationsBatchSchema` | `{ ids: z.array(z.number().int()), orderBy?: z.enum(['date_desc','date_asc']), limit?: number, project?: string }` |
| `DataRoutes.ts` | `POST /api/sdk-sessions/batch` | `SdkSessionsBatchSchema` | `{ memorySessionIds: z.array(z.string()) }` |
| `DataRoutes.ts` | `POST /api/processing` | `SetProcessingSchema` | `{ isProcessing: z.boolean() }` (verify field name in handler) |
| `DataRoutes.ts` | `POST /api/pending-queue/process` | `ProcessQueueSchema` | (likely empty — `z.object({}).strict()`) |
| `DataRoutes.ts` | `POST /api/import` | `ImportSchema` | per handler's body shape |
| `MemoryRoutes.ts` | `POST /api/memory/save` | `MemorySaveSchema` | `{ text: z.string().min(1), title?: string, project?: string }` |
| `CorpusRoutes.ts` | `POST /api/corpus` | `BuildCorpusSchema` | `{ name: z.string().min(1), description?: string, project?: string, types?: z.array(z.string()), concepts?: z.array(z.string()), files?: z.array(z.string()), query?: string, date_start?: string, date_end?: string, limit?: z.number().int().positive() }` |
| `CorpusRoutes.ts` | `POST /api/corpus/:name/query` | `QueryCorpusSchema` | `{ question: z.string().min(1) }` |
| `CorpusRoutes.ts` | `POST /api/corpus/:name/rebuild` | `RebuildCorpusSchema` | `z.object({}).strict()` or per handler |
| `SettingsRoutes.ts` | `POST /api/settings` | `UpdateSettingsSchema` | **see note below** |
| `SettingsRoutes.ts` | `POST /api/mcp/toggle` | `ToggleMcpSchema` | `{ enabled: z.boolean() }` |
| `SettingsRoutes.ts` | `POST /api/branch/switch` | `SwitchBranchSchema` | `{ branch: z.enum(['main', 'beta/7.0', 'feature/bun-executable']) }` |
| `SettingsRoutes.ts` | `POST /api/branch/update` | `UpdateBranchSchema` | `z.object({}).strict()` |
| `LogsRoutes.ts` | `POST /api/logs/clear` | `ClearLogsSchema` | `z.object({}).strict()` or per handler |
| `ViewerRoutes.ts` | (GET-only) | — | no body schemas needed |
| `SearchRoutes.ts` | `POST /api/context/semantic` | `SemanticContextSchema` | per handler at `src/services/worker/http/routes/SearchRoutes.ts:41` |
**Special case — `POST /api/settings`**: the existing `validateSettings(settings)` function at `src/services/worker/http/routes/SettingsRoutes.ts:237-385` is ~148 lines of domain validation (valid providers, port ranges, Python version regex, URL parse). That is **domain validation, not shape validation.** Keep it. The Zod schema here validates only that each field, if present, is of the right primitive type (`z.string().optional()`, `z.number().optional()`, `z.boolean().optional()` as appropriate per the `settingKeys` array at `SettingsRoutes.ts:88-128`). The domain validation stays in the handler. This is the correct application of rule D: delete only shape checks, not domain checks.
Copy-ready pattern to replicate: `CorpusRoutes.ts:238-244` — the `QueryCorpusSchema` replaces exactly this block. Cleanest single-field existing check in the codebase.
### (b) Docs
- §3.9 flowchart node D (`validateBody(schema) middleware (Zod per route)`, 05-clean-flowcharts.md:388).
- Bullshit-inventory item #37: "Per-route validation boilerplate × 8 files" → "`validateBody(schema)` middleware; per-route Zod schema" (05-clean-flowcharts.md:55).
- 06-implementation-plan.md Phase 12 task 3 (line 547): "Per-route schemas in a parallel `schemas/` directory (or inline at top of each route file). One `z.object({…})` per endpoint." **This plan chooses inline** (co-location wins over directory partition at this scale — 8 files × ~3 schemas each = ~24 schemas; a separate directory adds import overhead with no clarity gain).
- V9 (06-implementation-plan.md:36): confirms SessionRoutes endpoint count pre/post Plan 09.
- Live file:line per row in the schema table above.
### (c) Verification
- `grep -rn "^import.*from 'zod'" src/services/worker/http/routes/`**≥1 per route file with a POST endpoint** (7 of 8 files — ViewerRoutes is GET-only).
- `grep -rn "validateBody(" src/services/worker/http/routes/` → count matches the POST/PUT endpoint total in the table above (~18 endpoints).
- For each schema: a successful request round-trips unchanged; an invalid-shape request returns 400 with `{error:'validation_failed', fields:...}`.
### (d) Anti-pattern guards
- **A**: Every schema uses published zod 3.x methods (`z.object`, `z.string`, `z.number`, `z.array`, `z.enum`, `z.boolean`, `.optional`, `.min`, `.int`, `.positive`). Anything else — verify against the resolved zod version from Phase 1. **Do not invent** `.isPositiveInt()` or `.nonEmptyString()` helper methods; use the built-in chain.
- **E**: No schema duplicated. If two endpoints share a shape (e.g. `contentSessionId` appears in multiple SessionRoutes handlers), extract to a shared `const SessionIdField = z.string()` at the top of the file and reuse. Duplicated literal `z.object({...})` with identical fields across files = delete one.
- **D**: Inline schemas only. Do not build `schemas/SessionSchemas.ts` / `schemas/DataSchemas.ts` — that re-introduces the parallel-directory anti-pattern the plan text at 06-implementation-plan.md:547 warns about.
---
## Phase 4 — Delete hand-rolled validation blocks
**Outcome**: Every shape-validation block (type check, presence check, array check) inside a route handler is deleted. Only domain validation remains.
### (a) What to implement
Delete (exact line ranges, to be deleted alongside the Phase 3 schema attachment for each route):
| File | Line range to delete | What | Replaced by |
|---|---|---|---|
| `src/services/worker/http/routes/CorpusRoutes.ts` | `44-51` | `if (!req.body.name) { res.status(400).json({error:'Missing required field: name', fix:..., example:...}); return; }` | `BuildCorpusSchema` in Phase 3 |
| `src/services/worker/http/routes/CorpusRoutes.ts` | `55-69` | Coercion calls for `types`, `concepts`, `files`, `limit` (`coerceStringArray`, `coercePositiveInteger`) | Zod coerces via `z.coerce.number()`, `z.string().transform(s => s.split(','))` as needed |
| `src/services/worker/http/routes/CorpusRoutes.ts` | `88-125` | `coerceStringArray` + `coercePositiveInteger` helper methods | Zod schema coercion replaces both helpers entirely |
| `src/services/worker/http/routes/CorpusRoutes.ts` | `238-245` | `QueryCorpus` question presence + type check | `QueryCorpusSchema` in Phase 3 |
| `src/services/worker/http/routes/DataRoutes.ts` | `118-123` | `path` query-param check (note: query-param, not body — keep as-is OR migrate to `validateQuery(schema)` if the middleware is extended; for this plan, leave) | — |
| `src/services/worker/http/routes/DataRoutes.ts` | `144-163` | `ids` coerce + array-check + integer-check for `POST /api/observations/batch` | `ObservationsBatchSchema` |
| `src/services/worker/http/routes/DataRoutes.ts` | `196-206` | `memorySessionIds` coerce + array-check for `POST /api/sdk-sessions/batch` | `SdkSessionsBatchSchema` |
| `src/services/worker/http/routes/SessionRoutes.ts` | `570-572` | `if (!contentSessionId) return this.badRequest(...)` in `handleObservationsByClaudeId` | Pre-Plan 09: keep as-is until routes collapse; post-Plan 09: replaced by `SessionObservationSchema` |
| `src/services/worker/http/routes/SessionRoutes.ts` | `672-676` | `contentSessionId` check in `handleSummarizeByClaudeId` | Same |
| `src/services/worker/http/routes/SessionRoutes.ts` | `724-728` | `contentSessionId` query-param check in `handleStatusByClaudeId` (GET — query not body; leave) | — |
| `src/services/worker/http/routes/SessionRoutes.ts` | `767-771` | `contentSessionId` check in `handleCompleteByClaudeId` | `SessionEndSchema` post-Plan 09 |
| `src/services/worker/http/routes/SessionRoutes.ts` | `831-835` | `this.validateRequired(req, res, ['contentSessionId'])` in `handleSessionInitByClaudeId` | `SessionStartSchema` post-Plan 09 |
| `src/services/worker/http/routes/SettingsRoutes.ts` | `159-164` | `enabled` boolean type check in `handleToggleMcp` | `ToggleMcpSchema` |
| `src/services/worker/http/routes/SettingsRoutes.ts` | `184-198` | `branch` presence + allowlist check in `handleSwitchBranch` | `SwitchBranchSchema` (`z.enum([...])` handles both presence and allowlist) |
| `src/services/worker/http/routes/MemoryRoutes.ts` | `33-36` | `text` presence + type + non-empty check | `MemorySaveSchema` |
| `src/services/worker/http/routes/BaseRouteHandler.ts` | `54-62` | `validateRequired(req, res, params)` helper method | **Delete entire method.** No caller remains after this phase. Keep `parseIntParam`, `badRequest`, `notFound`, `handleError`, `wrapHandler`. |
Total hand-rolled-validation lines deleted: approximately **125 LOC** across 5 files.
**`SettingsRoutes.validateSettings` at lines 237-385 is NOT deleted** — that is domain validation (provider allowlists, port ranges, URL parse) and stays in the handler as-is. Zod handles only shape. Cite rule D: "per-route validation blocks of 5+ if statements — collapsed to validateBody(schema)" applies to shape blocks; domain blocks are orthogonal and survive.
### (b) Docs
- §3.9 "Deleted" bullet 2: "Per-route hand-rolled validation (Zod middleware replaces)" (05-clean-flowcharts.md:414).
- Bullshit-inventory #37 (05-clean-flowcharts.md:55).
- 06-implementation-plan.md Phase 12 task 4 (line 548): "Delete per-route boilerplate: manual `typeof x !== 'string'` checks, `if (!body.foo) return res.status(400)…`."
- Live line ranges per row in the table above.
### (c) Verification
- `grep -rn "validateRequired" src/services/worker/http/`**0**.
- `grep -rn "typeof .* !== 'string'" src/services/worker/http/routes/`**0** for body validation; any surviving matches must be for non-body purposes (e.g., narrowing a union type inside business logic).
- `grep -rn "res.status(400)" src/services/worker/http/routes/` drops significantly (from ~12 to ≤ 2 domain-specific 400s in `SettingsRoutes.validateSettings` path and corpus `404 → 400` edge).
- `grep -n "coerceStringArray\|coercePositiveInteger" src/`**0**.
- Happy-path tests for each endpoint: response shape unchanged.
### (d) Anti-pattern guards
- **D**: If a handler still has a `typeof` check on a body field after this phase, the schema is missing a constraint. Fix the schema, not the handler.
- **E**: No fall-through: after `validateBody` accepts, the handler does NOT re-validate the same field. Example: `SwitchBranchSchema` uses `z.enum(['main','beta/7.0','feature/bun-executable'])` — the handler must not re-check `if (!allowedBranches.includes(branch))`.
- **A**: Don't replace `validateRequired` with a similarly-named Zod wrapper. Delete the method outright.
---
## Phase 5 — Delete rate-limit middleware
**Outcome**: The rate limiter at `src/services/worker/http/middleware.ts:45-79` (300 req/min IP map, keyed by `::ffff:127.0.0.1`-normalized IP) is deleted. Bullshit item #39 removed.
### (a) What to implement
Delete the following from `src/services/worker/http/middleware.ts`:
- **Lines 45-50**: comment block + `requestCounts` map + `RATE_LIMIT_WINDOW_MS` + `RATE_LIMIT_MAX_REQUESTS` constants.
- **Lines 52-77**: the `rateLimiter` RequestHandler.
- **Line 79**: `middlewares.push(rateLimiter);`.
Total: **35 LOC deleted from middleware.ts**.
No change needed in `Server.ts` — it registers middleware via `createMiddleware(summarizeRequestBody)` at `src/services/server/Server.ts:156`, which returns the array. Removing the `.push(rateLimiter)` call is sufficient; the caller loops over whatever middleware returns.
### (b) Docs
- §3.9 "Deleted" bullet 1: "In-memory rate limiter (300/min IP map) — localhost trust model everywhere else makes this theater" (05-clean-flowcharts.md:413).
- Bullshit-inventory #39 (05-clean-flowcharts.md:57).
- V20 (06-implementation-plan.md:47): "Rate limiter 300/min — Confirmed at `src/services/worker/http/middleware.ts:45-79`. Constants at `:49-50`. Keyed by IP, normalizes `::ffff:127.0.0.1`. Phase 14 deletes."
- 06-implementation-plan.md Phase 14 task 1 (line 612).
- Live file:line: `src/services/worker/http/middleware.ts:45-79`.
### (c) Verification
- `grep -n "RATE_LIMIT_WINDOW_MS\|RATE_LIMIT_MAX_REQUESTS\|requestCounts\|rateLimiter" src/`**0 matches**.
- `grep -n "429" src/services/worker/http/`**0** (the only 429 in the codebase is the rate limiter; survey the repo with `grep -rn "429" src/` to confirm).
- `curl -s -w "%{http_code}" -o /dev/null http://localhost:37777/api/health` repeated 1000× returns 200 every time — no 429 after request #300.
- Build green: `tsc --noEmit`.
### (d) Anti-pattern guards
- **B** (from 06-implementation-plan.md:623): "Don't re-introduce the rate limiter as a 'config flag'. Localhost trust model is explicit." No `if (settings.rateLimitEnabled)` conditional reintroduction.
- **D**: Do not leave the function in place "commented out" — delete the lines.
- **A**: Do not repurpose the `requestCounts` Map for a "request-counting telemetry" feature. Delete the Map.
---
## Phase 6 — Cache viewer.html and /api/instructions at boot
**Outcome**: The sync `readFileSync` on every `GET /` and `GET /api/instructions` request is replaced by an in-memory `Buffer` loaded once at worker boot.
> **Cache lifecycle contract (Preflight edit 2026-04-22 — reconciliation C10)**: The cached `Buffer` lives for the **lifetime of the worker process** — re-read on every worker boot, never refreshed mid-process. This is the contract plan 12's T1 regression test (SHA-256 of `GET /`) assumes when it mandates re-baselining after every worker restart. If the viewer.html content includes a per-boot bearer-token injection (observation 71147), the Buffer captures that token at constructor time and serves it consistently until the next boot. **Do not** add any hot-reload / file-watcher / TTL cache invalidation. If an operator edits `viewer.html` in place, they must restart the worker to see the change — documented tradeoff, not a regression.
### (a) What to implement
**`/` (viewer.html)** — currently at `src/services/worker/http/routes/ViewerRoutes.ts:54-72`:
Refactor `ViewerRoutes` constructor (currently `src/services/worker/http/routes/ViewerRoutes.ts:19-25`) to resolve + read `viewer.html` once and store as a module-level or instance-level `Buffer`:
```ts
private viewerHtml: Buffer;
constructor(...) {
super();
const packageRoot = getPackageRoot();
const candidates = [
path.join(packageRoot, 'ui', 'viewer.html'),
path.join(packageRoot, 'plugin', 'ui', 'viewer.html')
];
const found = candidates.find(existsSync);
if (!found) throw new Error('Viewer UI not found at boot');
this.viewerHtml = readFileSync(found); // Buffer
}
private handleViewerUI = this.wrapHandler((req, res) => {
res.setHeader('Content-Type', 'text/html');
res.send(this.viewerHtml);
});
```
Delete `readFileSync` + `existsSync` calls from inside the request handler (lines 63-71 of current file).
**`/api/instructions`** — currently at `src/services/server/Server.ts:202-234`:
The endpoint supports 4 `topic` values × N `operation` values. Option (a): pre-compute the 4 section strings at boot. Option (b): pre-read `SKILL.md` once and read `operations/*.md` lazily (these are rarer).
Recommended: Option (a). At `Server` constructor time, call `loadInstructionContent(undefined, 'all')` once, extract all 4 sections, store as `Record<string, Buffer>`. Store a `Map<string, Buffer>` for `operations/*.md` populated lazily on first hit (or eagerly if the operations directory is small — enumerate at boot).
Preserve path-traversal security: the `operationPath.startsWith(OPERATIONS_BASE_DIR + path.sep)` check at `Server.ts:218` stays. Caching does not bypass validation — the cache key is the already-validated operation name.
Preserve the `ALLOWED_TOPICS` + `ALLOWED_OPERATIONS` allowlist at `Server.ts:207-213`.
Copy-ready pattern: the current `extractInstructionSection` function at `Server.ts:350-359` already partitions content into a `sections` record — that IS the cache structure; just hoist it from per-request to boot.
### (b) Docs
- §3.9 "Deleted" bullet 3: "Synchronous file read for `/` and `/api/instructions` (replace with cached `Buffer` loaded at boot)" (05-clean-flowcharts.md:415).
- §3.10 flowchart node HTML: "viewer.html (cached at boot)" (05-clean-flowcharts.md:426).
- 06-implementation-plan.md Phase 14 task 2 (line 613): "Cache `viewer.html` and `/api/instructions` content in memory at boot; serve from `Buffer` instead of `fs.readFile`."
- Live file:line: `src/services/worker/http/routes/ViewerRoutes.ts:54-72` (viewer.html); `src/services/server/Server.ts:202-234` (instructions endpoint); `src/services/server/Server.ts:337-345` (loader); `src/services/server/Server.ts:350-359` (section extractor).
### (c) Verification
- Static file reads happen once at boot: add a `logger.info('WORKER', 'viewer.html cached', { bytes: this.viewerHtml.length })` at constructor time; grep logs after 100 `GET /` requests to confirm the message fires exactly once.
- `lsof -p $(pidof node) | grep viewer.html` at steady-state: either zero (Buffer held in memory, no open FD) or exactly one (memory-mapped).
- `grep -n "readFileSync.*viewer.html\|readFileSync.*SKILL.md\|readFileSync.*operations" src/services/worker/ src/services/server/`**0** matches inside request handlers (module-scope or constructor-scope matches are fine; per-request matches fail).
- Response body unchanged (byte-for-byte) across a request pair before and after the change.
### (d) Anti-pattern guards
- **E**: Do not keep the `readFileSync` path "as a fallback" for when the Buffer is undefined. If the file isn't found at boot, throw — fail-fast aligns with global standard #3. No silent fallback.
- **D**: The viewer-path-candidate array at `ViewerRoutes.ts:58-61` is not a duplicate validation — it's install-layout probing. Keep both candidates for boot-time resolution. After the first successful read, the candidate list is discarded.
- **A**: Do not wrap the Buffer in a `StaticFileCache` class. Hold it as a private field on the route class. One field, one assignment.
---
## Phase 7 — Delete oversized-body special handling
**Outcome**: The 5MB JSON parse limit stays (cheap; bullshit item #40 keep-clause). Any `if (body.size > …) specialHandler()` or hand-rolled 413 logic is deleted — Express's built-in 413 from the `express.json({ limit: '5mb' })` middleware is sufficient.
### (a) What to implement
Survey the route files and `middleware.ts` for body-size special handling:
- `src/services/worker/http/middleware.ts:25``express.json({ limit: '5mb' })`**KEEP**. This is the one-line limit per item #40.
- Any handler that inspects `req.body.length`, `req.headers['content-length']`, or returns a custom 413: **DELETE**.
Based on the grep survey in Phase 0, **no custom oversized-body handling currently exists in `src/services/worker/http/`**. This phase is a verification pass confirming absence. If any is discovered during implementation, delete it without replacement — the `express.json()` middleware already emits 413 with `entity.too.large` on oversized bodies.
If any handler catches the Express 413 and remaps it to a different shape, delete the catch — uniform error handling via `BaseRouteHandler.handleError` (`src/services/worker/http/BaseRouteHandler.ts:82-99`) is already in place.
### (b) Docs
- Bullshit-inventory #40 (05-clean-flowcharts.md:58): "JSON parse 5MB limit on every request — Keep (cheap), but delete any special handling for oversized — 413 is fine."
- Live file:line: `src/services/worker/http/middleware.ts:25` (the `express.json` call to preserve).
### (c) Verification
- `grep -rn "413\|'entity.too.large'\|PayloadTooLarge" src/services/worker/http/`**0 matches in handler code** (framework-internal uses do not appear in our source).
- `grep -rn "content-length\|contentLength\|Content-Length" src/services/worker/http/routes/`**0** matches in route handlers (header-inspection by handlers is the anti-pattern to find).
- Sending a 6MB body returns Express default 413. Sending a 4MB body round-trips.
### (d) Anti-pattern guards
- **D**: If a grep hit appears, delete it. Do not "improve" it.
- **A**: Don't add a `RequestSizeGuard` middleware. `express.json({ limit })` already guards.
- **E**: Don't let a handler's try/catch swallow a 413 and remap to 400. The Express error shape for 413 is Express's; uniformity below that boundary is enforced by `handleError`.
---
## Phase 8 — Verification
**Outcome**: Whole §3.9 diagram is reality. All greps clean, route smoke tests pass, deleted-line count matches estimate.
### (a) What to implement
Execute the verification checklist below. This phase does not modify production code; it runs scripts/tests and fixes regressions uncovered.
### (b) Docs
- §3.9 full diagram (05-clean-flowcharts.md:384-410).
- §3.9 "Deleted" block (lines 412-416).
- §3.9 "Kept" block (line 418): "All user-facing routes, SSE, middleware chain, admin endpoints (used by tooling)." — the admin endpoints (`/api/admin/restart`, `/api/admin/shutdown`, `/api/admin/doctor` at `src/services/server/Server.ts:237-330`) are explicitly preserved; item #38 (05-clean-flowcharts.md:56).
- 06-implementation-plan.md Phase 15 (line 631-656): timer census + grep pass + full test suite.
### (c) Verification checklist
- [ ] **Rate limiter gone**: `grep -rn "RATE_LIMIT_WINDOW_MS\|RATE_LIMIT_MAX_REQUESTS\|requestCounts\|rateLimiter" src/`**0**.
- [ ] **Zod present**: `grep -rn "^import .* from 'zod'" src/services/worker/http/`**≥8** matches (middleware + 7 route files with POSTs).
- [ ] **validateBody attached**: `grep -rn "validateBody(" src/services/worker/http/routes/" → **~18** matches (one per schemaed POST/PUT).
- [ ] **validateRequired deleted**: `grep -rn "validateRequired" src/`**0**.
- [ ] **Static-file reads hoisted**: `grep -rn "readFileSync.*viewer.html" src/services/worker/` → 0 matches inside request handlers; OK in constructor/module-scope.
- [ ] **SSE preserved**: `GET /stream` returns `text/event-stream` with initial `initial_load` event (manual smoke test).
- [ ] **Admin preserved**: `POST /api/admin/doctor` from localhost returns JSON; from non-localhost returns 403 (per `requireLocalhost` at `src/services/worker/http/middleware.ts:121-143`). Used by version-bump per item #38.
- [ ] **Route smoke tests per endpoint (curl or integration suite)**:
- `GET /` → 200 HTML (from cached Buffer).
- `GET /health` → 200 JSON `{status:'ok', activeSessions:N}`.
- `GET /stream` → 200 SSE stream.
- `POST /api/memory/save` with `{text:""}` → 400 `{error:'validation_failed', fields:...}`.
- `POST /api/memory/save` with `{text:"hi"}` → 200 `{success:true, id:...}`.
- `POST /api/corpus` with `{name:"t", query:"hooks"}` → 200 metadata.
- `POST /api/corpus` with `{}` → 400 validation_failed with `fields.fieldErrors.name`.
- `POST /api/mcp/toggle` with `{enabled:"yes"}` → 400; `{enabled:true}` → 200.
- `POST /api/branch/switch` with `{branch:"nonexistent"}` → 400; `{branch:"main"}` → 200.
- `GET /api/instructions?topic=workflow` → 200 JSON content (served from cache).
- `POST /api/admin/restart` from localhost → 200 `{status:'restarting'}`.
- [ ] **Build green**: `npm run build` succeeds.
- [ ] **Worker boots**: `npm run build-and-sync` and verify `GET /health` answers within 2s.
- [ ] **Deleted-lines tally**: approximately **35 LOC** (rate limiter, Phase 5) + **~125 LOC** (hand-rolled validation + helpers, Phase 4) + **~9 LOC** (`BaseRouteHandler.validateRequired` method, Phase 4) + **~10 LOC** (per-request `readFileSync`/`existsSync` probes moved to constructor, Phase 6) ≈ **~180 LOC net deleted**, offset by **~60 LOC added** (new `validateBody` + ~24 schemas averaging 2-3 lines each) = **~120 LOC net deletion**.
### (d) Anti-pattern guards
- **D** (whole plan): if any verification grep finds unexpected matches, do not "fix forward" — delete the offending code.
- **E**: If a route smoke test fails due to schema over-constraint (e.g., an optional field rejected), **relax the schema, do not re-add a hand-rolled fallback.**
- **A**: Do not add integration tests that fake the Zod surface. Use the installed zod.
---
## Reporting summary
**Phase count**: 8.
**Estimated deletion**: ~180 LOC gross, ~60 LOC added, **~120 LOC net**. Primary deletes: rate limiter (35), hand-rolled validation blocks (125), `validateRequired` helper (9), per-request file-read probing (10). Primary additions: `validateBody.ts` (~40), Zod schemas inline (~60 across 7 files).
**Sources consulted**:
- `PATHFINDER-2026-04-21/05-clean-flowcharts.md` (full); §3.9 (lines 382-420) canonical; Part 1 items #37-40 (lines 55-58); Part 2 decisions (lines 65-79).
- `PATHFINDER-2026-04-21/06-implementation-plan.md`: V2 (line 29), V9 (line 36), V20 (line 47); allowed-APIs block (lines 49-55); anti-patterns (line 59); Phase 12 (lines 530-565); Phase 14 (lines 600-627); Phase 15 (lines 631-656).
- `PATHFINDER-2026-04-21/01-flowcharts/http-server-routes.md` (before state).
- Live codebase (9 files): `src/services/worker/http/middleware.ts`, `src/services/worker/http/BaseRouteHandler.ts`, `src/services/worker/http/routes/{ViewerRoutes,SearchRoutes,SessionRoutes,DataRoutes,SettingsRoutes,MemoryRoutes,CorpusRoutes,LogsRoutes}.ts`, `src/services/server/Server.ts`.
- `package.json` (dependencies block lines 111-125) + `npm ls zod` + filesystem probe of `node_modules/zod`.
**Concrete findings**:
- **Zod presence check** (2026-04-22 10:18 PDT): `npm ls zod` returns `(empty)`. `node_modules/zod/package.json` does not exist. Transitively it is NOT shipped — the only zod-adjacent package is `zod-to-json-schema@^3.24.6` at `package.json:124`, which does not pull in `zod` itself. **Phase 1 MUST add `zod` via `npm install zod@^3.x`.** Verified findings block at `06-implementation-plan.md:55` should be updated: "already shipped transitively via `@anthropic-ai/sdk`" is false for this repo (the SDK is `@anthropic-ai/claude-agent-sdk`, not `@anthropic-ai/sdk`).
- **Route-file inventory with validation styles** (8 files, `src/services/worker/http/routes/`):
- `ViewerRoutes.ts` (116 LOC): GET-only, no body schemas needed.
- `SearchRoutes.ts` (421 LOC): 1 POST (`/api/context/semantic` at line 41), mostly query-param validation.
- `SessionRoutes.ts` (958 LOC): 10 POST endpoints per V9 (6 legacy `/sessions/:id/*` at lines 377-382 + 4 under `/api/sessions/*` at lines 385-389, plus `/api/sessions/status` GET). Uses `this.validateRequired` (line 833) and inline `if (!contentSessionId)` checks (lines 570, 674, 726, 769). Post-Plan 09 collapses to 4.
- `DataRoutes.ts` (562 LOC): 5 POST endpoints. Uses `this.badRequest` + inline `typeof` checks (lines 120-123, 149-163, 203-206). Contains ad-hoc coerce logic (JSON.parse-or-split-by-comma) at lines 145-147, 199-201 — Zod `z.preprocess` subsumes this.
- `SettingsRoutes.ts` (434 LOC): 5 POST endpoints. Has a 148-line **domain-validation** function `validateSettings` (lines 237-385) — **preserve**; the shape-validation is inline at lines 161-164, 185-197 — **delete**.
- `MemoryRoutes.ts` (93 LOC): 1 POST. Validation block at lines 33-36. Cleanest single-endpoint pattern in the codebase — **copy-ready template for Phase 3**.
- `CorpusRoutes.ts` (283 LOC): 5 POST endpoints. Validation at lines 44-51, 238-245 plus two coerce helpers at lines 88-125 (~38 LOC of helper boilerplate deletable).
- `LogsRoutes.ts` (165 LOC): 1 POST (`/api/logs/clear` at line 102). Minimal body.
- **Static file endpoints**:
- `GET /` serves `viewer.html``ViewerRoutes.ts:54-72` does per-request `readFileSync` over 2 candidate paths. Move to constructor.
- `GET /api/instructions``Server.ts:202-234` does per-request `fs.promises.readFile` via `loadInstructionContent` (line 337). 4 topic sections (extractable at boot) + operation files (lazy-cache OK). Allowlist at `Server.ts:207-213` (`ALLOWED_TOPICS`, `ALLOWED_OPERATIONS`) stays; path-traversal check at line 218 stays.
- Static assets (`js`, `css`, fonts) served via `express.static(uiDir)` at `middleware.ts:110-112`**already cached by Express; no change**.
- **Copy-ready snippet locations**:
- Cleanest single-field validation example to replicate: `CorpusRoutes.ts:238-244` (the `question` check for `QueryCorpus`) — this exact shape replaces one-to-one with a `QueryCorpusSchema = z.object({ question: z.string().min(1) })`.
- Cleanest presence check to Zod-ify: `MemoryRoutes.ts:33-36` (the `text` check) — maps to `MemorySaveSchema = z.object({ text: z.string().min(1), title: z.string().optional(), project: z.string().optional() })`.
- Error-shape template to mirror in `validateBody`: `BaseRouteHandler.ts:82-99` (existing `{error, code, details}` shape) — extend with `fields`.
**Confidence + gaps**:
- **High confidence**: rate-limiter deletion (V20 verified exact lines), static-file caching (exact file:line confirmed), validation-block locations (grep returned matching line numbers), BaseRouteHandler method cleanup.
- **Gap 1 — Plan 09 landing order**: This plan assumes the §3.1 4-endpoint SessionRoutes surface is the target. If Plan 09 has not landed when this plan begins Phase 3, the plan must attach schemas to the 10 legacy endpoints (`src/services/worker/http/routes/SessionRoutes.ts:377-389`) and then refactor in lockstep when Plan 09 merges. Coordination required — add `[blocked-on: plan-09]` gate on the Phase 3 PR, or land Plan 09 first.
- **Gap 2 — Zod version lock-in for the whole refactor**: Phase 1 picks the zod 3.x version; if a future phase in another plan wants a zod 4.x-only API, this plan's schemas become incompatible. Mitigation: schemas use only the stable `z.object/string/number/array/enum/boolean/optional/min/int/positive` surface, which is unchanged between 3.x majors and 4.x. Still, a breaking upgrade must be coordinated here.