server-beta: Phases 4–13 — event pipeline, generation, MCP, compat, Docker, team audit, observability (#2383)
* feat(server-beta): Phase 4 — Postgres event-to-generation-job pipeline Adds POST /v1/events, /v1/events/batch, GET /v1/jobs/:id, GET /v1/events/:id, and POST /v1/memories on the server-beta runtime, backed by Postgres. - Event row + outbox generation-job row insert in one withPostgresTransaction. - BullMQ enqueue happens after commit; enqueue failure leaves the row queued for Phase 3 startup reconciliation. - ?generate=false skips the outbox; ?wait=true returns queue status only, never observation IDs (provider generation is Phase 5). - Batch pre-validates all event projectIds against api-key scope before any write; mixed-project batches reject 403 with zero side effects. - /v1/memories is a direct insert alias — no generator, no outbox. - Cross-tenant /v1/jobs/:id returns 404 to avoid leaking row existence. - New PostgresAuthMiddleware reads api_keys by SHA-256 hash; populates req.authContext.teamId/projectId; legacy ServerV1Routes (SQLite, used by worker runtime) is left untouched. - Tests: unit suite hardened with stubbed pool.query so route registration is safe; integration tests skip cleanly without CLAUDE_MEM_TEST_POSTGRES_URL. Verification: 87 pass / 1 skip / 0 fail. No new typecheck errors. Required greps for WorkerService and MemoryItemsRepository in src/server/routes/v1 and src/server/runtime return no hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(server-beta): Phase 5 — provider observation generator Adds independent provider generation under src/server/generation/ with no worker coupling. Server beta can now generate observations end-to-end: event -> outbox -> BullMQ -> provider -> parser -> persisted observation. - ProviderObservationGenerator orchestrates: lock outbox (queued -> processing), reload agent_event from Postgres (BullMQ payload is advisory only), call provider, hand raw text to processGeneratedResponse, route errors via markGenerationFailed with retryable flag from ServerClassifiedProviderError. - processGeneratedResponse parses with parseAgentXml, persists via PostgresObservationRepository with deterministic generation_key = generation:v1:{job_id}:{index}:{fingerprint}, links via PostgresObservationSourcesRepository, advances outbox status, appends observation_generation_job_events, audits — all in one withPostgresTransaction. Idempotent on retry via UNIQUE constraints. - Three provider adapters under src/server/generation/providers/: Claude, Gemini, OpenRouter. Self-contained — no imports from src/services/worker/*. Worker providers unchanged. - Shared error classification + prompt builder under providers/shared/. Prompt builder strips <private> at the edge; fully-private batches emit <skip_summary /> without billing the provider. - ActiveServerBetaGenerationWorkerManager wires BullMQ Worker via ServerJobQueue.start(...) with concurrency 1 + autorun:false + worker.on('error') per BullMQ docs. - New GET /v1/events/:id/observations on ServerV1PostgresRoutes returns observations linked via observation_sources, team/project scoped. Verification: 104 pass / 4 skip / 0 fail. No typecheck regressions. Anti-pattern greps clean for services/worker imports under src/server, WorkerRef/ActiveSession/SessionStore in src/server/generation. Deferred: ModeManager loading uses a stable fallback observation type list; summary and reindex queue lanes are not yet wired. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(server-beta): Phase 6 — independent server session semantics server_sessions is now the canonical Server beta session model. Sessions are independent of legacy worker ActiveSession state. - PostgresServerSessionRepository extended: findByExternalIdForScope, endSession (idempotent via COALESCE(ended_at, now())), markGenerationStarted/Completed/Failed, listUnprocessedEvents (filters agent_events with completed agent_event jobs). - ServerSessionRuntimeRepository wraps the repo; every method requires explicit team_id + project_id and validates scope via assertProjectOwnership. - SessionGenerationPolicy supports per-event (default), debounce (BullMQ delayed-job replace via getJob+remove+add), and end-of-session. Configured via CLAUDE_MEM_SERVER_SESSION_POLICY and CLAUDE_MEM_SERVER_SESSION_DEBOUNCE_MS env vars; per-team override hooks are exposed on ServerV1PostgresRoutesOptions for future settings layer. - POST /v1/sessions/start (find-or-create on (project_id, external_session_id), GET /v1/sessions/:id (scoped 404), POST /v1/sessions/:id/end (transactional: end + create summary outbox via UNIQUE collapse + enqueue post-commit). Re-ending is fully idempotent. - processSessionSummaryResponse persists summary as kind='summary' observation with the same idempotency model (generation_key + observation_sources UNIQUE). - ProviderObservationGenerator dispatches on source_type: agent_event -> processGeneratedResponse, session_summary -> processSessionSummaryResponse; loadEvents handles session-summary by loading unprocessed events. - ActiveServerBetaGenerationWorkerManager wires summary BullMQ lane alongside event lane (concurrency=1, autorun=false, error listener attached per BullMQ docs). Verification: 110 pass / 6 skip / 0 fail. Net typecheck error count unchanged at 24 (pre-existing, none in Phase 6 files). Anti-pattern greps clean for ActiveSession/SessionStore in src/server/runtime, no worker imports anywhere in src/server. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(server-beta): Phase 7 — hook routing without worker dependency Hooks can now talk directly to server-beta when CLAUDE_MEM_RUNTIME=server-beta is selected, with a clean worker fallback when server-beta is unhealthy. - src/services/hooks/server-beta-client.ts — typed HTTP client for /v1/sessions/start, /v1/events, /v1/sessions/:id/end. Throws ServerBetaClientError with kind classification (missing_api_key, transport, timeout, http_error, invalid_response) and isFallbackEligible helper. Zero imports from services/worker/. - src/services/hooks/runtime-selector.ts — reads CLAUDE_MEM_RUNTIME from settings, returns worker or server-beta context, logs [server-beta-fallback] reason=<code> on every config-time fallback. - src/services/hooks/server-beta-bootstrap.ts — Postgres-backed API key bootstrap. Find-or-creates local-hook-team + local-hook-project, generates cmem_<random> key (SHA-256 hashed), inserts into api_keys with scopes events:write/sessions:write/observations:read/jobs:read. Settings file written with chmod 0600. rotateServerBetaApiKey() wired to a new `claude-mem server keys rotate` command. - src/cli/handlers/{observation,session-init,summarize}.ts — every hook handler tries server-beta first when configured, falls through to the existing worker path on transport/5xx/429/missing-key. One WARN line per fallback. Hook JSON output shape unchanged. - src/shared/SettingsDefaultsManager.ts — three new keys with defaults: CLAUDE_MEM_SERVER_BETA_URL, CLAUDE_MEM_SERVER_BETA_API_KEY, CLAUDE_MEM_SERVER_BETA_PROJECT_ID. - src/npx-cli/commands/install.ts — when installer selects server-beta runtime and CLAUDE_MEM_SERVER_DATABASE_URL is set, bootstraps a local API key automatically. Warns and continues if the DB URL is missing. plugin/scripts/*.cjs bundles rebuilt via npm run build to pick up the new hook handler code path. No plaintext keys in the bundle (verified). Verification: 16 hook unit tests pass; 275 server/storage/services tests pass with 7 pre-existing failures (verified independent of this change via git stash --include-untracked). Build clean. No new typecheck errors in Phase 7 files. Anti-pattern guards verified: - /api/sessions/observations only reached via explicit fallback path - server-beta runtime never starts the worker process - API keys live only in ~/.claude-mem/settings.json (chmod 0600), never in the bundle (grep confirmed) - Worker fallback preserved, observable via single WARN line per call Deferred: semantic context injection (UserPromptSubmit hook) stays worker-only; server-beta does not yet expose /v1/context/semantic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(server-beta): Phase 8 — MCP backed by server-beta core MCP tools now route through server-beta in server-beta mode while keeping worker-mode search/timeline/get_observations tools fully working. - src/servers/mcp-server.ts — five new observation_* tools registered: observation_add, observation_record_event, observation_search, observation_context, observation_generation_status. Three memory_* compatibility aliases delegate to the canonical handlers. Worker auto-start is gated when selectRuntime() === 'server-beta' so MCP in server-beta mode never spawns the worker. - src/services/hooks/server-beta-client.ts — addObservation, searchObservations, contextObservations, getJobStatus added so MCP shares one transport with hooks (Phase 7). - src/server/routes/v1/ServerV1PostgresRoutes.ts — POST /v1/search and POST /v1/context REST cores backed by PostgresObservationRepository full-text search (GIN tsvector from Phase 1). - Existing memory_search/timeline/get_observations tools call callWorkerAPI unchanged in worker mode; worker tests unaffected. Verification: 39 pass / 4 skip / 0 fail on targeted suite. Pre-existing 7 baseline failures verified independent (git stash). No new typecheck errors. WorkerService grep clean across src/servers/mcp-server.ts and src/server/. Anti-pattern guards verified: - No duplicate generation logic in MCP — observation_record_event hits /v1/events which owns event+outbox+enqueue inside one tx - WorkerService not imported anywhere under MCP server-beta path - No hardcoded worker URLs — all transport via Phase 7 ServerBetaClient - memory_* aliases retained, single handler per pair Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(server-beta): Phase 9 — compatibility adapters without coupling Legacy /api/sessions/observations and /api/sessions/summarize endpoints keep working on server-beta runtime by translating to AgentEvent and session-end calls — no worker code, no route duplication. - src/server/services/IngestEventsService.ts — shared event-ingest path used by both /v1/events and the compat adapter. Owns transactional event row + outbox row + lifecycle log + post-commit BullMQ enqueue, honors Phase 6 SessionGenerationPolicy. - src/server/services/EndSessionService.ts — shared session-end path used by both /v1/sessions/:id/end and the compat adapter. Idempotent ended_at + summary outbox + deterministic summary job id. - src/server/compat/SessionsObservationsAdapter.ts — translates legacy POST /api/sessions/observations payload (Claude Code transcript shape) -> AgentEvent (source_adapter='claude-code-compat', event_type='tool_use') -> IngestEventsService.ingestOne. Resolves contentSessionId to server_sessions via find-or-create. - src/server/compat/SessionsSummarizeAdapter.ts — translates legacy POST /api/sessions/summarize -> EndSessionService.end. Preserves the legacy agentId -> {status:'skipped', reason:'subagent_context'} behavior so existing clients see the same response shape. - src/server/routes/v1/ServerV1PostgresRoutes.ts — refactored to delegate to the new shared services (-203 LoC net) so /v1 and /api compat both call the SAME canonical code path. - src/server/runtime/ServerBetaService.ts — registers both compat adapters alongside ServerV1PostgresRoutes, sharing service instances. - docs/server-beta-parity-map.md — full enumeration of legacy /api/* routes labeled native, adapter, or unsupported (with reasons). Viewer read-path adapters explicitly listed as unsupported pending a future viewer-rewrite phase. Verification: 7 compat tests pass, 6 v1-routes tests still pass (refactor preserved behavior), 4 session-routes tests pass. Pre- existing 16 baseline failures verified independent via git stash. Zero new typecheck errors. Anti-pattern guards verified: - No services/worker/http/routes or WorkerService imports under src/server/compat or src/server/runtime - Compat adapters are thin translators with names ending in *Adapter and a top-of-file comment noting they are legacy compatibility - /v1/* remains the canonical Server beta API; compat adapters call shared services rather than acting as a parallel API Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(server-beta): Phase 10 — Docker stack and deployable runtime Server beta now ships as a Docker stack with no worker process anywhere and a separate horizontal generation worker for scaling. - src/server/runtime/create-server-beta-service.ts — validateServerBetaEnv() fails fast on missing CLAUDE_MEM_SERVER_DATABASE_URL, requires CLAUDE_MEM_QUEUE_ENGINE=bullmq in Docker, rejects CLAUDE_MEM_AUTH_MODE=local-dev and CLAUDE_MEM_ALLOW_LOCAL_DEV_BYPASS inside containers (detected via /.dockerenv or CLAUDE_MEM_DOCKER=1). Adds CLAUDE_MEM_GENERATION_DISABLED so the HTTP service can run generator-free. - src/server/runtime/ServerBetaService.ts — runServerBetaGenerationWorker for the dedicated consumer process; runServerBetaApiKeyCli is a new Postgres-backed `server api-key` command (the legacy worker CLI wrote to SQLite and was invisible to the Postgres runtime); getQueueHealth shim feeds /api/health a consistent ObservationQueueHealth shape. - src/npx-cli/commands/{runtime,server}.ts — `claude-mem server worker start` subcommand that boots only the BullMQ consumer. - docker/claude-mem/{Dockerfile,entrypoint.sh} — entrypoint forces CLAUDE_MEM_DOCKER=1 + CLAUDE_MEM_RUNTIME=server-beta and exposes three modes: server (HTTP only, generation disabled), worker (BullMQ consumer), shell. Worker bundle is no longer the default CMD. - docker-compose.yml — full stack: postgres + valkey + claude-mem-server (HTTP-only) + claude-mem-worker (generation consumer). Wires service-to-service env vars. - scripts/e2e-server-beta-docker.sh + docker/e2e/server-beta-e2e.mjs — E2E now hits /v1/sessions/start, /v1/events?wait=true, /v1/jobs/:id; asserts no worker-service.cjs process anywhere in the stack; one-shot docker compose run --rm verifies local-dev auth is rejected with the expected stderr; restart-and-verify confirms Postgres durability and BullMQ retry idempotency. - docs/server.md — full Phase 10 doc: stack diagram, env table, worker mode, auth-in-Docker policy. - docs/api.md — event generation semantics (wait=true, generationJob). Verification: full Docker E2E PASSED on live daemon (phase1 + phase2 + restart-and-verify + revoked-key + no-worker- process + local-dev-rejected). Unit tests 292 pass / 9 skip / 7 fail (7 fails pre-existing baseline). Zero new typecheck errors. Anti-pattern guards verified: - entrypoint never execs worker-service.cjs; E2E greps prove no worker process anywhere in the stack - validateServerBetaEnv refuses local-dev auth in Docker with explicit remediation message; ALLOW_LOCAL_DEV_BYPASS rejected the same way - Docker requires CLAUDE_MEM_QUEUE_ENGINE=bullmq; in-process queue rejected at startup - claude-mem worker / worker-service / WorkerService greps clean in docker/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(server-beta): Phase 11 — team-aware generation with audit chain Generation jobs now carry team_id/project_id/api_key_id/actor_id/ source_adapter from enqueue through execution; the outbox is reloaded from Postgres before any side effect so BullMQ payload can never act as auth authority. - src/server/jobs/types.ts — ServerGenerationJobPayloadSchema (Zod discriminated union) requires team_id, project_id, generation_job_id, source_adapter, api_key_id, actor_id (nullable), source_type, source_id, plus event_id / server_session_id per kind. assertServerGenerationJobPayload is called at enqueue (outbox.ts) and again at execution boundary. - src/server/services/{IngestEventsService,EndSessionService}.ts + SessionGenerationPolicy.ts — thread identity context (apiKeyId, actorId, sourceAdapter) into both event and summary BullMQ payloads. - src/server/generation/ProviderObservationGenerator.ts — loadCanonicalOutbox loads the outbox row WITHOUT scope filter, then compares candidate.team_id/project_id to payload.team_id/project_id; mismatch -> ServerGenerationScopeViolationError (non-retryable), failed status, generation_job.scope_violation audit. isApiKeyRevoked checks api_keys (revoked_at, expires_at, row missing) before any provider call; revoked -> generation_job.revoked_key audit + non- retryable failure. generation_job.processing audit emitted on lock. - src/server/generation/processGeneratedResponse.ts — generated observations carry team_id/project_id/server_session_id from the reloaded source row (not job payload). observation_sources.metadata records source_adapter, actor_id, api_key_id for traceability. observation.created audit per observation; generation_job.completed audit per terminal transition. All audit rows reference the same generation_job_id in details. - src/server/routes/v1/ServerV1PostgresRoutes.ts — GET /v1/teams/:id/jobs and GET /v1/projects/:id/jobs with SQL-layer scoping (WHERE team_id=$1 [AND project_id=$2] [AND status=$3]); cross-tenant returns 404 to avoid leaking row existence. Pagination via status/limit/offset. audit_log rows for event.received, event.batch_received, observation.read. - src/server/compat/{SessionsObservationsAdapter,SessionsSummarizeAdapter}.ts — propagate apiKeyId and sourceAdapter='claude-code-compat'. Verification: 162 pass / 10 skip / 0 fail. Pre-existing failures in tests/services/queue and tests/services/worker confirmed independent via git stash. Zero new typecheck errors in server-beta files. Required greps: rg "team_id.*req\.body|project_id.*req\.body" src/server -> 0 matches Audit chain integration test passes — generation_job.processing, observation.created, and generation_job.completed audit rows all share the same generation_job_id reference. Anti-pattern guards verified: - BullMQ payload never acts as auth authority — Postgres outbox reload with mismatch check happens before every side effect - team_id / project_id never derived from request body for scope decisions; always req.authContext.teamId / projectId - Application-layer team/project filtering forbidden — listJobsForScope pushes scope into the SQL WHERE clause - Project-scoped key on cross-project /v1/teams/:id/jobs returns 404 - Revoked api keys cause non-retryable failure with audit before any provider call Deferred: a redundant generation_job.queued audit_log row (already covered by observation_generation_job_events lifecycle log per Phase 1 schema split). Compat adapters set actor_id=null but propagate api_key_id which is the canonical reference downstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(server-beta): Phase 12 — observability and operations Operators can now inspect, retry, and cancel generation jobs from the CLI; queue lane metrics flow into /api/health and /v1/info; every request gets a stable request_id that flows through HTTP -> audit -> outbox -> generator -> completion log. - src/server/middleware/request-id.ts — honors safe inbound X-Request-Id, mints uuid v4 otherwise. Set on req.requestId and echoed via response header so external traces can correlate. - src/server/jobs/ServerJobQueue.ts — QueueEvents wired with completed, failed, progress, stalled, error listeners; lifecycle counters exposed via observe() API. Logs emitted as [generation] job=<id> source_type=<...> duration=<ms> attempts=<N> reason=<message>. Stalled and error counters survive worker restart. - src/server/jobs/types.ts — ServerGenerationJob payload schema extended with optional request_id; flows through from HTTP into every BullMQ job. - src/server/queue/ObservationQueueEngine.ts — health snapshot now carries per-lane (event, summary) counts via ObservationQueueHealthLaneSnapshot. - src/server/runtime/{ActiveServerBetaQueueManager, ActiveServerBetaGenerationWorkerManager,ServerBetaService}.ts — per-lane getJobCounts feed /api/health and /v1/info; stalled events audit through audit_log with action generation_job.stalled. - src/server/routes/v1/ServerV1PostgresRoutes.ts — GET /v1/jobs (status/source_type/since/limit/offset, scope from api-key, payload stripped unless ?include=payload AND admin scope), POST /v1/jobs/:id/retry (idempotent; queued -> no-op; audit generation_job.retried_by_operator), POST /v1/jobs/:id/cancel (terminal -> no-op; audit generation_job.cancelled_by_operator; generator reload-before-side-effects already prevents double work). - src/server/services/IngestEventsService.ts + SessionGenerationPolicy.ts + ProviderObservationGenerator.ts — request_id propagated end to end. Generator extracts request_id from BullMQ payload and includes it in lock/processing/completion logs and audit details. - src/npx-cli/commands/server-jobs.ts + src/npx-cli/commands/server.ts — `claude-mem server jobs status|failed|retry|cancel`. status compares Postgres outbox counts to BullMQ queue counts and surfaces divergence. failed prints attempts + last_error message. --team and --project filters. Verification: 350 pass / 12 skip / 7 fail (pre-existing baseline, verified independent via git stash). 18 new tests added (request-id middleware, server-jobs CLI seams, jobs list/retry/cancel routes Postgres-gated). Zero new typecheck errors. Anti-pattern guards verified: - agent_events.payload only emitted in /v1/jobs response inside the admin-gated branch (?include=payload + admin scope) — returns 403 otherwise - jobs retry on a queued row is a no-op (no double BullMQ enqueue, no double UPDATE) - Every operator action writes to audit_log with the *_by_operator action and request_id correlation in details - Stalled events audit through generation_job.stalled Sample correlated trace (one request_id end to end): HTTP middleware: req.requestId = 'req-abc' audit event.received: details.requestId = 'req-abc' BullMQ payload: { request_id: 'req-abc', generation_job_id: 'gj_x' } generator lock log: [generation] job locked { jobId, requestId } audit generation_job.processing: details.requestId = 'req-abc' completion log: [generation] job=evt_... duration=1230ms Deferred: live /api/health round-trip integration test (needs Redis); stalled event live integration test (needs Redis); storing request_id on the observations row itself (spec did not require). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(server-beta): add Phase 13 release readiness report Captures the final verification gate: tests (1749 pass, 45 fail all pre-existing baseline, zero regressions), required greps clean, Docker E2E green end-to-end, all 7 exit criteria met, build clean, typecheck unchanged from main. Documents deferred items. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build(server-beta): rebuild server-beta-service bundle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(server-beta): address Greptile review on PR #2383 - ProviderObservationGenerator.lockOutbox: skip duplicate worker run when another lock is active instead of returning the row, which previously let two BullMQ workers issue the (paid, rate-limited) external provider call before the persistence-layer terminal-status guard collapsed the duplicate. Reconciliation still recovers from a stale lock on startup or next retry. - docker-compose.yml: require POSTGRES_USER/PASSWORD/DB env vars (no defaults). Stack refuses to start without explicit secrets. Added a header warning that the file must not be deployed unmodified. - e2e-server-beta-docker.sh: export ephemeral test creds for the new required env vars so the Docker E2E driver still runs unattended. - ServerBetaService api-key list: bound query with LIMIT/OFFSET (default 100, max 500) and add optional --team filter to prevent unintentional cross-tenant key metadata disclosure on shared admin hosts. - SessionGenerationPolicy: fix dead `??` fallback for NaN parseInt result; use `||` so DEFAULT_DEBOUNCE_MS actually applies. - ServerV1PostgresRoutes: `?wait=true` now actually waits — polls the outbox row until terminal status (timeout 30s, 100ms interval) on both /v1/events and /v1/events/batch. Returns `waitTimedOut: true` if the cap is hit so callers can re-poll the status endpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(server-beta): address CodeRabbit + Greptile second review on PR #2383 P1 fixes - Operator retry endpoint was re-publishing the Postgres outbox metadata column as the BullMQ payload; the worker's assertServerGenerationJobPayload always rejected it, leaving the row stuck in queued until startup reconciliation. Persist the BullMQ payload on the outbox row at create-time inside IngestEventsService and EndSessionService, then re-enqueue that canonical payload on retry. Major fixes - prompt-builder: escape server_session_id when interpolating into the XML prompt; previously a session id containing `<`, `&`, or quotes could inject XML into the provider input. - ServerJobQueue: route both worker.on('stalled') and the QueueEvents 'stalled' subscriber through a single notifyStalled helper that dedupes by jobId for 30s, so counters.stalled increments once per stall. QueueEvents 'error' now routes through notifyQueueError so it increments counters.errored and runs onError listeners — keeping observability symmetric across both sources. - ServerV1PostgresRoutes: convert PostgresObservationRepository from three dynamic imports to a single static import for consistency. - mcp-server / ServerBetaClient: actually forward the observation_record_event tool's `generate` flag through to the /v1/events endpoint as `?generate=false` instead of voiding it. - server-sessions.markGenerationFailed: guard jsonb_set against a null error payload so the failure path can't null out metadata before the generation_status='failed' write commits. Minor fixes - server-sessions.endSession: keep updated_at stable on repeated calls so the documented idempotency contract holds. - SettingsDefaultsManager + ServerBetaService.getServerBetaPort: derive the server-beta default port from UID (37877 + uid%100), matching the worker port pattern, so two users on the same host don't collide. Docker stacks always pass CLAUDE_MEM_SERVER_PORT explicitly so the containerized deployment is unaffected. - server-session-runtime test: close the pg.Pool in afterAll. - server-beta-release-readiness.md: escape pipes inside table inline code, add `text` language tag to the fenced log block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(server-beta): address Greptile + CodeRabbit third review on PR #2383 P1 fixes - SessionsObservationsAdapter.resolveServerSession: catch unique-violation (23505) on concurrent compat inserts and re-fetch instead of returning 500. Two compat callers carrying the same contentSessionId can both observe `existing===null` and race on the (project_id, external_session_id) unique constraint; the second now resolves to the raced row instead of dropping the event. - /v1/events/batch: pass `sourceAdapter: null` to ingestBatch so each event's BullMQ payload (and persisted outbox payload column) reflects its own event.sourceAdapter via buildEventBullmqPayload's fallback, rather than stamping the whole batch with the first event's adapter. Minor - server-session-runtime test afterEach: wrap DROP SCHEMA in try/finally so client.release() always runs even if the drop throws. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): drop `pool as never` cast — pg.Pool already matches PostgresPool Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(server-beta): retry of completed job now 409s instead of duplicating retryGenerationJob previously fell through to the reset+re-enqueue path when called on a job in `completed` status. The observations index dedupes on (generation_job_id, parsed_observation_index, content) but LLM output is non-deterministic, so a second provider run almost always produced a different content string and bypassed the index, persisting a parallel set of observation rows attributed to the same generation job. Match cancelGenerationJob's 409 guard for completed jobs. failed and cancelled remain valid retry targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build(server-beta): rebuild bundles after rebase onto main Regenerates the three plugin bundles so they reflect the rebased source state. Mechanical rebuild output only — no source changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(server-beta): wrap resolveServerSession in try/catch for structured error response Greptile P1 on PR #2383: resolveServerSession was called before the try/catch in both compat adapters, so Postgres errors during session lookup (timeout, pool exhaustion, etc.) escaped to Express's default error handler and returned HTML/text 500s. Legacy clients calling response.json() would get a parse failure instead of the documented { stored: false, reason: 'internal_error' } (or { status: 'error', reason: 'internal_error' } for the summarize adapter) shape. Move the resolveServerSession call inside the existing try block in both adapters so any failure flows through the structured catch handler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(server-beta): catch 23505 unique violation in POST /v1/sessions/start Greptile P1 on PR #2383: concurrent requests with the same externalSessionId can both pass the findByExternalIdForScope check, both call repo.create, and the loser hits the (project_id, external_session_id) unique constraint. The handler treated that as an unknown error and returned a 500. Apply the same pattern resolveServerSession already uses: catch error.code '23505' when externalSessionId is set, refetch the row inserted by the winning request, and return 200 with that session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,212 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
// Legacy compatibility — new clients should use POST /v1/events directly.
|
||||
//
|
||||
// Legacy worker payloads to `/api/sessions/observations` are translated into
|
||||
// the Server beta event/job model and delegated to IngestEventsService. The
|
||||
// adapter never touches worker code, never queues observations directly, and
|
||||
// never uses `src/services/worker/*` types.
|
||||
//
|
||||
// Translation rules:
|
||||
// - `contentSessionId` (Claude Code session UUID) becomes the
|
||||
// `external_session_id` of a Server beta `server_sessions` row, scoped to
|
||||
// the API key's team and project. The session is create-or-found.
|
||||
// - The tool-use shape (tool_name, tool_input, tool_response, tool_use_id)
|
||||
// is mapped to an `agent_event` with sourceAdapter='claude-code-compat',
|
||||
// eventType='tool_use', payload preserves the legacy fields verbatim.
|
||||
// - The API key MUST be project-scoped. Cross-project compat calls return
|
||||
// 400; we never let compat traffic bypass project scope.
|
||||
|
||||
import type { Application, Request, Response } from 'express';
|
||||
import { z } from 'zod';
|
||||
import type { RouteHandler } from '../../services/server/Server.js';
|
||||
import type { PostgresPool } from '../../storage/postgres/pool.js';
|
||||
import { PostgresServerSessionsRepository } from '../../storage/postgres/server-sessions.js';
|
||||
import { logger } from '../../utils/logger.js';
|
||||
import { requirePostgresServerAuth } from '../middleware/postgres-auth.js';
|
||||
import { IngestEventsService } from '../services/IngestEventsService.js';
|
||||
import type { CreatePostgresAgentEventInput } from '../../storage/postgres/agent-events.js';
|
||||
|
||||
const COMPAT_SOURCE_ADAPTER = 'claude-code-compat';
|
||||
const COMPAT_EVENT_TYPE = 'tool_use';
|
||||
|
||||
const observationsSchema = z.object({
|
||||
contentSessionId: z.string().min(1),
|
||||
tool_name: z.string().min(1),
|
||||
tool_input: z.unknown().optional(),
|
||||
tool_response: z.unknown().optional(),
|
||||
cwd: z.string().optional(),
|
||||
agentId: z.string().optional(),
|
||||
agentType: z.string().optional(),
|
||||
platformSource: z.string().optional(),
|
||||
tool_use_id: z.string().optional(),
|
||||
toolUseId: z.string().optional(),
|
||||
}).passthrough();
|
||||
|
||||
export interface SessionsObservationsAdapterOptions {
|
||||
pool: PostgresPool;
|
||||
ingestEvents: IngestEventsService;
|
||||
authMode?: string;
|
||||
allowLocalDevBypass?: boolean;
|
||||
}
|
||||
|
||||
export class SessionsObservationsAdapter implements RouteHandler {
|
||||
constructor(private readonly options: SessionsObservationsAdapterOptions) {}
|
||||
|
||||
setupRoutes(app: Application): void {
|
||||
const writeAuth = requirePostgresServerAuth(this.options.pool, {
|
||||
authMode: this.options.authMode,
|
||||
allowLocalDevBypass: this.options.allowLocalDevBypass,
|
||||
requiredScopes: ['memories:write'],
|
||||
});
|
||||
|
||||
app.post('/api/sessions/observations', writeAuth, this.asyncHandler(async (req, res) => {
|
||||
const parsed = observationsSchema.safeParse(req.body);
|
||||
if (!parsed.success) {
|
||||
res.status(400).json({ error: 'ValidationError', issues: parsed.error.issues });
|
||||
return;
|
||||
}
|
||||
const teamId = req.authContext?.teamId ?? null;
|
||||
const projectId = req.authContext?.projectId ?? null;
|
||||
if (!teamId) {
|
||||
res.status(403).json({ error: 'Forbidden', message: 'API key is not bound to a team' });
|
||||
return;
|
||||
}
|
||||
if (!projectId) {
|
||||
// Compat mode requires a project-scoped key — the legacy payload does
|
||||
// not carry a Server beta projectId, so without scope we cannot place
|
||||
// the row in a tenant-scoped table.
|
||||
res.status(400).json({
|
||||
error: 'BadRequest',
|
||||
message: 'Legacy /api/sessions/observations requires a project-scoped API key',
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
const session = await resolveServerSession({
|
||||
pool: this.options.pool,
|
||||
teamId,
|
||||
projectId,
|
||||
contentSessionId: parsed.data.contentSessionId,
|
||||
platformSource: typeof parsed.data.platformSource === 'string' ? parsed.data.platformSource : null,
|
||||
agentId: typeof parsed.data.agentId === 'string' ? parsed.data.agentId : null,
|
||||
agentType: typeof parsed.data.agentType === 'string' ? parsed.data.agentType : null,
|
||||
});
|
||||
|
||||
const toolUseId = typeof parsed.data.tool_use_id === 'string'
|
||||
? parsed.data.tool_use_id
|
||||
: (typeof parsed.data.toolUseId === 'string' ? parsed.data.toolUseId : null);
|
||||
|
||||
const input: CreatePostgresAgentEventInput = {
|
||||
projectId,
|
||||
teamId,
|
||||
serverSessionId: session.id,
|
||||
sourceAdapter: COMPAT_SOURCE_ADAPTER,
|
||||
sourceEventId: toolUseId,
|
||||
eventType: COMPAT_EVENT_TYPE,
|
||||
payload: {
|
||||
contentSessionId: parsed.data.contentSessionId,
|
||||
tool_name: parsed.data.tool_name,
|
||||
tool_input: parsed.data.tool_input ?? null,
|
||||
tool_response: parsed.data.tool_response ?? null,
|
||||
cwd: parsed.data.cwd ?? null,
|
||||
platformSource: parsed.data.platformSource ?? null,
|
||||
agentId: parsed.data.agentId ?? null,
|
||||
agentType: parsed.data.agentType ?? null,
|
||||
toolUseId,
|
||||
},
|
||||
metadata: { compat: 'sessions/observations' },
|
||||
occurredAt: new Date(),
|
||||
};
|
||||
|
||||
const result = await this.options.ingestEvents.ingestOne(input, {
|
||||
source: 'http_post_api_sessions_observations',
|
||||
apiKeyId: req.authContext?.apiKeyId ?? null,
|
||||
actorId: null,
|
||||
sourceAdapter: COMPAT_SOURCE_ADAPTER,
|
||||
});
|
||||
// Legacy response shape — older clients only check `status`.
|
||||
res.json({
|
||||
status: 'queued',
|
||||
observationCount: 1,
|
||||
sessionId: session.id,
|
||||
serverSessionId: session.id,
|
||||
eventId: result.event.id,
|
||||
generationJobId: result.outbox?.id ?? null,
|
||||
transport: result.enqueueState,
|
||||
});
|
||||
} catch (error) {
|
||||
logger.error('SYSTEM', 'compat observations adapter failed', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
contentSessionId: parsed.data.contentSessionId,
|
||||
});
|
||||
res.status(500).json({ stored: false, reason: 'internal_error' });
|
||||
}
|
||||
}));
|
||||
}
|
||||
|
||||
private asyncHandler(fn: (req: Request, res: Response) => Promise<void> | void) {
|
||||
return (req: Request, res: Response, next: (err?: unknown) => void): void => {
|
||||
Promise.resolve(fn(req, res)).catch(next);
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Look up an existing server_session by (project, team, externalSessionId)
|
||||
* or create one if missing. Idempotent: re-issuing for the same content
|
||||
* session returns the existing row.
|
||||
*
|
||||
* Concurrent compat callers can race here — both observe `existing===null`
|
||||
* and both call `repo.create`, where the second will hit one of two unique
|
||||
* constraints (`(project_id, idempotency_key)` covered by ON CONFLICT, or
|
||||
* `(project_id, external_session_id)` which is NOT covered). Catch the
|
||||
* unique-violation and re-fetch so the caller never sees a 500.
|
||||
*/
|
||||
export async function resolveServerSession(input: {
|
||||
pool: PostgresPool;
|
||||
teamId: string;
|
||||
projectId: string;
|
||||
contentSessionId: string;
|
||||
platformSource: string | null;
|
||||
agentId: string | null;
|
||||
agentType: string | null;
|
||||
}): Promise<{ id: string; projectId: string; teamId: string }> {
|
||||
const repo = new PostgresServerSessionsRepository(input.pool);
|
||||
const existing = await repo.findByExternalIdForScope({
|
||||
externalSessionId: input.contentSessionId,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
});
|
||||
if (existing) {
|
||||
return { id: existing.id, projectId: existing.projectId, teamId: existing.teamId };
|
||||
}
|
||||
try {
|
||||
const created = await repo.create({
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
externalSessionId: input.contentSessionId,
|
||||
contentSessionId: input.contentSessionId,
|
||||
agentId: input.agentId,
|
||||
agentType: input.agentType,
|
||||
platformSource: input.platformSource,
|
||||
});
|
||||
return { id: created.id, projectId: created.projectId, teamId: created.teamId };
|
||||
} catch (error) {
|
||||
// Postgres unique_violation. A concurrent compat call inserted the row
|
||||
// for this (project, external_session_id) before we could; re-fetch
|
||||
// and return that row instead of bubbling a 500 to the legacy client.
|
||||
if ((error as { code?: string } | null)?.code === '23505') {
|
||||
const racedRow = await repo.findByExternalIdForScope({
|
||||
externalSessionId: input.contentSessionId,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
});
|
||||
if (racedRow) {
|
||||
return { id: racedRow.id, projectId: racedRow.projectId, teamId: racedRow.teamId };
|
||||
}
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,127 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
// Legacy compatibility — new clients should use POST /v1/sessions/:id/end directly.
|
||||
//
|
||||
// Translates the legacy `/api/sessions/summarize` request into a call to
|
||||
// EndSessionService. The legacy shape carries `contentSessionId` and an
|
||||
// optional `last_assistant_message`; we resolve the server_session by
|
||||
// (team, project, external_session_id=contentSessionId), then end it.
|
||||
//
|
||||
// Re-summarizing the same session collapses to the same outbox row because
|
||||
// the (team_id, project_id, source_type='session_summary', source_id)
|
||||
// UNIQUE constraint stays in force — exactly the same idempotency guarantee
|
||||
// as `/v1/sessions/:id/end`.
|
||||
|
||||
import type { Application, Request, Response } from 'express';
|
||||
import { z } from 'zod';
|
||||
import type { RouteHandler } from '../../services/server/Server.js';
|
||||
import type { PostgresPool } from '../../storage/postgres/pool.js';
|
||||
import { PostgresServerSessionsRepository } from '../../storage/postgres/server-sessions.js';
|
||||
import { logger } from '../../utils/logger.js';
|
||||
import { requirePostgresServerAuth } from '../middleware/postgres-auth.js';
|
||||
import { EndSessionService } from '../services/EndSessionService.js';
|
||||
import { resolveServerSession } from './SessionsObservationsAdapter.js';
|
||||
|
||||
const summarizeSchema = z.object({
|
||||
contentSessionId: z.string().min(1),
|
||||
last_assistant_message: z.string().optional(),
|
||||
agentId: z.string().optional(),
|
||||
platformSource: z.string().optional(),
|
||||
}).passthrough();
|
||||
|
||||
export interface SessionsSummarizeAdapterOptions {
|
||||
pool: PostgresPool;
|
||||
endSession: EndSessionService;
|
||||
authMode?: string;
|
||||
allowLocalDevBypass?: boolean;
|
||||
}
|
||||
|
||||
export class SessionsSummarizeAdapter implements RouteHandler {
|
||||
constructor(private readonly options: SessionsSummarizeAdapterOptions) {}
|
||||
|
||||
setupRoutes(app: Application): void {
|
||||
const writeAuth = requirePostgresServerAuth(this.options.pool, {
|
||||
authMode: this.options.authMode,
|
||||
allowLocalDevBypass: this.options.allowLocalDevBypass,
|
||||
requiredScopes: ['memories:write'],
|
||||
});
|
||||
|
||||
app.post('/api/sessions/summarize', writeAuth, this.asyncHandler(async (req, res) => {
|
||||
const parsed = summarizeSchema.safeParse(req.body);
|
||||
if (!parsed.success) {
|
||||
res.status(400).json({ error: 'ValidationError', issues: parsed.error.issues });
|
||||
return;
|
||||
}
|
||||
const teamId = req.authContext?.teamId ?? null;
|
||||
const projectId = req.authContext?.projectId ?? null;
|
||||
if (!teamId) {
|
||||
res.status(403).json({ error: 'Forbidden', message: 'API key is not bound to a team' });
|
||||
return;
|
||||
}
|
||||
if (!projectId) {
|
||||
res.status(400).json({
|
||||
error: 'BadRequest',
|
||||
message: 'Legacy /api/sessions/summarize requires a project-scoped API key',
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
// Subagent contexts in legacy code emit summarize calls but the worker
|
||||
// skipped them. We preserve the legacy semantics so existing clients
|
||||
// see the same response shape.
|
||||
if (parsed.data.agentId) {
|
||||
res.json({ status: 'skipped', reason: 'subagent_context' });
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
const session = await resolveServerSession({
|
||||
pool: this.options.pool,
|
||||
teamId,
|
||||
projectId,
|
||||
contentSessionId: parsed.data.contentSessionId,
|
||||
platformSource: typeof parsed.data.platformSource === 'string' ? parsed.data.platformSource : null,
|
||||
agentId: null,
|
||||
agentType: null,
|
||||
});
|
||||
|
||||
const result = await this.options.endSession.end({
|
||||
sessionId: session.id,
|
||||
projectId,
|
||||
teamId,
|
||||
source: 'http_post_api_sessions_summarize',
|
||||
apiKeyId: req.authContext?.apiKeyId ?? null,
|
||||
actorId: null,
|
||||
sourceAdapter: 'claude-code-compat',
|
||||
});
|
||||
if (!result.session) {
|
||||
res.status(404).json({ status: 'not_found', reason: 'session_not_found' });
|
||||
return;
|
||||
}
|
||||
res.json({
|
||||
status: 'queued',
|
||||
sessionId: session.id,
|
||||
serverSessionId: session.id,
|
||||
generationJobId: result.outbox?.id ?? null,
|
||||
transport: result.enqueueState,
|
||||
});
|
||||
} catch (error) {
|
||||
logger.error('SYSTEM', 'compat summarize adapter failed', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
contentSessionId: parsed.data.contentSessionId,
|
||||
});
|
||||
res.status(500).json({ status: 'error', reason: 'internal_error' });
|
||||
}
|
||||
}));
|
||||
}
|
||||
|
||||
private asyncHandler(fn: (req: Request, res: Response) => Promise<void> | void) {
|
||||
return (req: Request, res: Response, next: (err?: unknown) => void): void => {
|
||||
Promise.resolve(fn(req, res)).catch(next);
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
// Side-effect import so PostgresServerSessionsRepository symbol is reachable
|
||||
// even when tree-shaking is aggressive in the main bundle.
|
||||
void PostgresServerSessionsRepository;
|
||||
@@ -0,0 +1,538 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import type { Job } from 'bullmq';
|
||||
import { logger } from '../../utils/logger.js';
|
||||
import { PostgresAgentEventsRepository } from '../../storage/postgres/agent-events.js';
|
||||
import { PostgresObservationGenerationJobRepository } from '../../storage/postgres/generation-jobs.js';
|
||||
import { PostgresProjectsRepository } from '../../storage/postgres/projects.js';
|
||||
import { PostgresAuthRepository } from '../../storage/postgres/auth.js';
|
||||
import type { PostgresPool } from '../../storage/postgres/pool.js';
|
||||
import type { PostgresObservationGenerationJob } from '../../storage/postgres/generation-jobs.js';
|
||||
import {
|
||||
assertServerGenerationJobPayload,
|
||||
ServerGenerationJobPayloadValidationError,
|
||||
type ServerGenerationJobPayload,
|
||||
} from '../jobs/types.js';
|
||||
import { ServerClassifiedProviderError } from './providers/shared/error-classification.js';
|
||||
import type { ServerGenerationProvider } from './providers/shared/types.js';
|
||||
import {
|
||||
markGenerationFailed,
|
||||
processGeneratedResponse,
|
||||
processSessionSummaryResponse,
|
||||
type ProcessGeneratedResponseOutcome,
|
||||
} from './processGeneratedResponse.js';
|
||||
import { PostgresServerSessionsRepository } from '../../storage/postgres/server-sessions.js';
|
||||
|
||||
// Phase 11 — sentinel exception class so the worker can distinguish
|
||||
// scope-violation/revoked-key failures from generic processor errors and
|
||||
// audit them under the right action. Marked non-retryable: an attacker who
|
||||
// tampered with a payload should never be retried into the queue.
|
||||
export class ServerGenerationScopeViolationError extends Error {
|
||||
readonly reason: 'scope_mismatch' | 'revoked_key';
|
||||
constructor(reason: 'scope_mismatch' | 'revoked_key', message: string) {
|
||||
super(message);
|
||||
this.reason = reason;
|
||||
}
|
||||
}
|
||||
|
||||
// ProviderObservationGenerator is the BullMQ Worker processor for server-beta
|
||||
// observation generation. It does the following on every job invocation:
|
||||
//
|
||||
// 1. Reload the Postgres outbox row and the source agent_events row.
|
||||
// 2. Lock the outbox by transitioning queued -> processing.
|
||||
// 3. Call the provider with a fully-reloaded ServerGenerationContext.
|
||||
// BullMQ payload data is advisory only.
|
||||
// 4. Hand the raw response to processGeneratedResponse, which persists +
|
||||
// links + advances outbox in one Postgres transaction.
|
||||
// 5. On provider/parse error, route through markGenerationFailed which
|
||||
// decides retry vs final failure based on attempt count + error class.
|
||||
//
|
||||
// Anti-pattern guards verified at the boundary:
|
||||
// - no imports from src/services/worker/*
|
||||
// - no use of WorkerRef / ActiveSession / SessionStore
|
||||
// - no assumption of Claude Code transcript shape
|
||||
|
||||
export interface ProviderObservationGeneratorOptions {
|
||||
pool: PostgresPool;
|
||||
provider: ServerGenerationProvider;
|
||||
workerId?: string;
|
||||
}
|
||||
|
||||
export class ProviderObservationGenerator {
|
||||
constructor(private readonly options: ProviderObservationGeneratorOptions) {}
|
||||
|
||||
/**
|
||||
* Worker entrypoint. Returns a small JSON summary on success so BullMQ's
|
||||
* completed-state telemetry has something to inspect, but Postgres remains
|
||||
* canonical authority.
|
||||
*/
|
||||
async process(
|
||||
job: Job<ServerGenerationJobPayload>,
|
||||
): Promise<{ jobId: string; status: 'completed'; observationCount: number }> {
|
||||
const correlationId = `bullmq:${job.id ?? '?'}`;
|
||||
// Phase 12 — pivot id captured up front so every log line in this
|
||||
// dispatch carries the same identifier whether or not we manage to
|
||||
// load the canonical row. requestId comes from payload (HTTP middleware).
|
||||
const payloadRequestId = (job.data as { request_id?: string | null } | undefined)?.request_id ?? null;
|
||||
|
||||
// Phase 11 — validate the BullMQ payload against the discriminated-union
|
||||
// schema BEFORE doing anything else. A malformed payload (missing
|
||||
// team_id, project_id, generation_job_id, etc.) means the enqueue path
|
||||
// bypassed the boundary contract; we refuse to run it. Throwing surfaces
|
||||
// it on BullMQ's failed list with a clear message.
|
||||
let payload: ServerGenerationJobPayload;
|
||||
try {
|
||||
payload = assertServerGenerationJobPayload(job.data);
|
||||
} catch (error) {
|
||||
if (error instanceof ServerGenerationJobPayloadValidationError) {
|
||||
logger.error('SYSTEM', 'rejecting malformed job payload at execution', {
|
||||
correlationId,
|
||||
issues: error.issues,
|
||||
});
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
|
||||
if (payload.kind !== 'event' && payload.kind !== 'event-batch' && payload.kind !== 'summary') {
|
||||
logger.warn('SYSTEM', 'unsupported job kind for ProviderObservationGenerator', {
|
||||
correlationId,
|
||||
kind: payload.kind,
|
||||
});
|
||||
throw new Error(`unsupported job kind: ${payload.kind}`);
|
||||
}
|
||||
|
||||
// Phase 11 — anti-bypass guard. We MUST NOT trust BullMQ payload data
|
||||
// for tenant scope. Reload the canonical outbox row keyed by id only
|
||||
// (no scope filter), then compare its team_id/project_id to the
|
||||
// payload's. A mismatch indicates payload tampering or a programmer
|
||||
// bug; either way we audit and refuse.
|
||||
const candidate = await this.loadCanonicalOutbox(payload.generation_job_id);
|
||||
if (!candidate) {
|
||||
logger.info('SYSTEM', 'job row not found by id; nothing to do', {
|
||||
correlationId,
|
||||
generationJobId: payload.generation_job_id,
|
||||
});
|
||||
return { jobId: payload.generation_job_id, status: 'completed', observationCount: 0 };
|
||||
}
|
||||
if (candidate.teamId !== payload.team_id || candidate.projectId !== payload.project_id) {
|
||||
const violation = new ServerGenerationScopeViolationError(
|
||||
'scope_mismatch',
|
||||
`BullMQ payload team/project does not match outbox row (jobId=${payload.generation_job_id})`,
|
||||
);
|
||||
await this.auditScopeViolation(payload, candidate, violation, correlationId);
|
||||
// Tag the row as failed so subsequent retries do not pick it up.
|
||||
await markGenerationFailed({
|
||||
pool: this.options.pool,
|
||||
job: candidate,
|
||||
reason: violation.message,
|
||||
classification: 'scope_mismatch',
|
||||
retryable: false,
|
||||
...(this.options.workerId !== undefined ? { workerId: this.options.workerId } : {}),
|
||||
});
|
||||
throw violation;
|
||||
}
|
||||
|
||||
// Phase 11 — revocation check. If the api_key that initiated this job
|
||||
// was revoked between enqueue and execution, do not generate. Audit
|
||||
// and fail without retry.
|
||||
if (payload.api_key_id) {
|
||||
const revoked = await this.isApiKeyRevoked(payload.api_key_id);
|
||||
if (revoked) {
|
||||
const violation = new ServerGenerationScopeViolationError(
|
||||
'revoked_key',
|
||||
`api key ${payload.api_key_id} is revoked; refusing to generate for outbox ${candidate.id}`,
|
||||
);
|
||||
await this.auditRevokedKey(payload, candidate, violation, correlationId);
|
||||
await markGenerationFailed({
|
||||
pool: this.options.pool,
|
||||
job: candidate,
|
||||
reason: violation.message,
|
||||
classification: 'revoked_key',
|
||||
retryable: false,
|
||||
...(this.options.workerId !== undefined ? { workerId: this.options.workerId } : {}),
|
||||
});
|
||||
throw violation;
|
||||
}
|
||||
}
|
||||
|
||||
const fresh = await this.lockOutbox(payload.generation_job_id, payload.team_id, payload.project_id);
|
||||
if (!fresh) {
|
||||
logger.info('SYSTEM', 'job no longer exists or is in terminal status; nothing to do', {
|
||||
correlationId,
|
||||
generationJobId: payload.generation_job_id,
|
||||
});
|
||||
return { jobId: payload.generation_job_id, status: 'completed', observationCount: 0 };
|
||||
}
|
||||
|
||||
// Phase 11 — emit "processing started" audit so we have a row even if
|
||||
// the provider crashes before completion.
|
||||
// Phase 12 — log+audit carry the same job_id / request_id so support
|
||||
// can pivot from BullMQ id -> outbox id -> originating HTTP request.
|
||||
logger.info('SYSTEM', `[generation] job locked for processing`, {
|
||||
correlationId,
|
||||
jobId: fresh.id,
|
||||
bullmqJobId: job.id ?? null,
|
||||
requestId: payloadRequestId,
|
||||
sourceType: fresh.sourceType,
|
||||
attempt: fresh.attempts,
|
||||
});
|
||||
await this.auditEvent({
|
||||
teamId: fresh.teamId,
|
||||
projectId: fresh.projectId,
|
||||
apiKeyId: payload.api_key_id,
|
||||
actorId: payload.actor_id,
|
||||
action: 'generation_job.processing',
|
||||
resourceId: fresh.id,
|
||||
details: {
|
||||
sourceType: fresh.sourceType,
|
||||
sourceId: fresh.sourceId,
|
||||
sourceAdapter: payload.source_adapter,
|
||||
attempt: fresh.attempts,
|
||||
correlationId,
|
||||
requestId: payloadRequestId,
|
||||
},
|
||||
});
|
||||
|
||||
try {
|
||||
const events = await this.loadEvents(fresh, payload);
|
||||
const project = await this.loadProject(fresh);
|
||||
|
||||
const result = await this.options.provider.generate({
|
||||
job: fresh,
|
||||
events,
|
||||
project: {
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
serverSessionId: fresh.serverSessionId,
|
||||
projectName: project?.name ?? null,
|
||||
},
|
||||
});
|
||||
|
||||
const persistInput = {
|
||||
pool: this.options.pool,
|
||||
job: fresh,
|
||||
rawText: result.rawText,
|
||||
modelId: result.modelId,
|
||||
providerLabel: result.providerLabel,
|
||||
// Phase 11 — flow identity context from BullMQ payload into the
|
||||
// persistence layer so observations and audit rows carry the same
|
||||
// generation_job_id reference back through to the original API key.
|
||||
apiKeyId: payload.api_key_id,
|
||||
actorId: payload.actor_id,
|
||||
sourceAdapter: payload.source_adapter,
|
||||
...(this.options.workerId !== undefined ? { workerId: this.options.workerId } : {}),
|
||||
};
|
||||
const outcome: ProcessGeneratedResponseOutcome = fresh.sourceType === 'session_summary'
|
||||
? await processSessionSummaryResponse(persistInput)
|
||||
: await processGeneratedResponse(persistInput);
|
||||
|
||||
if (outcome.kind === 'parse_error') {
|
||||
await markGenerationFailed({
|
||||
pool: this.options.pool,
|
||||
job: fresh,
|
||||
reason: outcome.reason,
|
||||
classification: 'parse_error',
|
||||
retryable: false,
|
||||
...(this.options.workerId !== undefined ? { workerId: this.options.workerId } : {}),
|
||||
});
|
||||
throw new Error(`generation parse error: ${outcome.reason}`);
|
||||
}
|
||||
|
||||
logger.info('SYSTEM', 'generation completed', {
|
||||
correlationId,
|
||||
jobId: outcome.jobId,
|
||||
bullmqJobId: job.id ?? null,
|
||||
requestId: payloadRequestId,
|
||||
observationCount: outcome.observations.length,
|
||||
privateContentDetected: outcome.privateContentDetected,
|
||||
});
|
||||
|
||||
return {
|
||||
jobId: outcome.jobId,
|
||||
status: 'completed',
|
||||
observationCount: outcome.observations.length,
|
||||
};
|
||||
} catch (error) {
|
||||
const classified = error instanceof ServerClassifiedProviderError ? error : null;
|
||||
const retryable = classified
|
||||
? classified.kind === 'transient' || classified.kind === 'rate_limit'
|
||||
: false;
|
||||
await markGenerationFailed({
|
||||
pool: this.options.pool,
|
||||
job: fresh,
|
||||
reason: error instanceof Error ? error.message : String(error),
|
||||
classification: classified?.kind ?? 'unknown',
|
||||
retryable,
|
||||
...(this.options.workerId !== undefined ? { workerId: this.options.workerId } : {}),
|
||||
});
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
// Phase 11 — load the outbox row by id WITHOUT a scope filter so we can
|
||||
// compare its team_id/project_id to the BullMQ payload as a tampering
|
||||
// detector. Authoritative scope decisions still come from this row, NEVER
|
||||
// from the BullMQ payload.
|
||||
private async loadCanonicalOutbox(jobId: string): Promise<PostgresObservationGenerationJob | null> {
|
||||
const result = await this.options.pool.query<{
|
||||
id: string;
|
||||
project_id: string;
|
||||
team_id: string;
|
||||
agent_event_id: string | null;
|
||||
source_type: 'agent_event' | 'session_summary' | 'observation_reindex';
|
||||
source_id: string;
|
||||
server_session_id: string | null;
|
||||
job_type: string;
|
||||
status: 'queued' | 'processing' | 'completed' | 'failed' | 'cancelled';
|
||||
idempotency_key: string;
|
||||
bullmq_job_id: string | null;
|
||||
attempts: number;
|
||||
max_attempts: number;
|
||||
next_attempt_at: Date | null;
|
||||
locked_at: Date | null;
|
||||
locked_by: string | null;
|
||||
completed_at: Date | null;
|
||||
failed_at: Date | null;
|
||||
cancelled_at: Date | null;
|
||||
last_error: unknown;
|
||||
payload: unknown;
|
||||
created_at: Date;
|
||||
updated_at: Date;
|
||||
}>(
|
||||
'SELECT * FROM observation_generation_jobs WHERE id = $1',
|
||||
[jobId],
|
||||
);
|
||||
const row = result.rows[0];
|
||||
if (!row) return null;
|
||||
return {
|
||||
id: row.id,
|
||||
projectId: row.project_id,
|
||||
teamId: row.team_id,
|
||||
agentEventId: row.agent_event_id,
|
||||
sourceType: row.source_type,
|
||||
sourceId: row.source_id,
|
||||
serverSessionId: row.server_session_id,
|
||||
jobType: row.job_type,
|
||||
status: row.status,
|
||||
idempotencyKey: row.idempotency_key,
|
||||
bullmqJobId: row.bullmq_job_id,
|
||||
attempts: row.attempts,
|
||||
maxAttempts: row.max_attempts,
|
||||
nextAttemptAtEpoch: row.next_attempt_at?.getTime() ?? null,
|
||||
lockedAtEpoch: row.locked_at?.getTime() ?? null,
|
||||
lockedBy: row.locked_by,
|
||||
completedAtEpoch: row.completed_at?.getTime() ?? null,
|
||||
failedAtEpoch: row.failed_at?.getTime() ?? null,
|
||||
cancelledAtEpoch: row.cancelled_at?.getTime() ?? null,
|
||||
lastError: row.last_error && typeof row.last_error === 'object'
|
||||
? (row.last_error as Record<string, unknown>)
|
||||
: null,
|
||||
payload: row.payload && typeof row.payload === 'object' && !Array.isArray(row.payload)
|
||||
? (row.payload as Record<string, unknown>)
|
||||
: {},
|
||||
createdAtEpoch: row.created_at.getTime(),
|
||||
updatedAtEpoch: row.updated_at.getTime(),
|
||||
};
|
||||
}
|
||||
|
||||
private async isApiKeyRevoked(apiKeyId: string): Promise<boolean> {
|
||||
const result = await this.options.pool.query<{ revoked_at: Date | null; expires_at: Date | null }>(
|
||||
'SELECT revoked_at, expires_at FROM api_keys WHERE id = $1',
|
||||
[apiKeyId],
|
||||
);
|
||||
const row = result.rows[0];
|
||||
if (!row) {
|
||||
// The key was deleted entirely. Treat as revoked.
|
||||
return true;
|
||||
}
|
||||
if (row.revoked_at) return true;
|
||||
if (row.expires_at && row.expires_at.getTime() <= Date.now()) return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
private async auditScopeViolation(
|
||||
payload: ServerGenerationJobPayload,
|
||||
canonical: PostgresObservationGenerationJob,
|
||||
error: ServerGenerationScopeViolationError,
|
||||
correlationId: string,
|
||||
): Promise<void> {
|
||||
logger.error('SYSTEM', 'BullMQ payload scope mismatch — refusing to generate', {
|
||||
correlationId,
|
||||
generationJobId: payload.generation_job_id,
|
||||
payloadTeamId: payload.team_id,
|
||||
payloadProjectId: payload.project_id,
|
||||
canonicalTeamId: canonical.teamId,
|
||||
canonicalProjectId: canonical.projectId,
|
||||
});
|
||||
await this.auditEvent({
|
||||
teamId: canonical.teamId,
|
||||
projectId: canonical.projectId,
|
||||
apiKeyId: payload.api_key_id,
|
||||
actorId: payload.actor_id,
|
||||
action: 'generation_job.scope_violation',
|
||||
resourceId: canonical.id,
|
||||
details: {
|
||||
reason: 'scope_mismatch',
|
||||
message: error.message,
|
||||
payloadTeamId: payload.team_id,
|
||||
payloadProjectId: payload.project_id,
|
||||
canonicalTeamId: canonical.teamId,
|
||||
canonicalProjectId: canonical.projectId,
|
||||
sourceAdapter: payload.source_adapter,
|
||||
correlationId,
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
private async auditRevokedKey(
|
||||
payload: ServerGenerationJobPayload,
|
||||
canonical: PostgresObservationGenerationJob,
|
||||
error: ServerGenerationScopeViolationError,
|
||||
correlationId: string,
|
||||
): Promise<void> {
|
||||
logger.warn('SYSTEM', 'api key revoked between enqueue and execute — refusing to generate', {
|
||||
correlationId,
|
||||
generationJobId: payload.generation_job_id,
|
||||
apiKeyId: payload.api_key_id,
|
||||
});
|
||||
await this.auditEvent({
|
||||
teamId: canonical.teamId,
|
||||
projectId: canonical.projectId,
|
||||
apiKeyId: payload.api_key_id,
|
||||
actorId: payload.actor_id,
|
||||
action: 'generation_job.revoked_key',
|
||||
resourceId: canonical.id,
|
||||
details: {
|
||||
reason: 'revoked_key',
|
||||
message: error.message,
|
||||
sourceAdapter: payload.source_adapter,
|
||||
correlationId,
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
private async auditEvent(input: {
|
||||
teamId: string | null;
|
||||
projectId: string | null;
|
||||
apiKeyId: string | null;
|
||||
actorId: string | null;
|
||||
action: string;
|
||||
resourceId: string | null;
|
||||
details?: Record<string, unknown>;
|
||||
}): Promise<void> {
|
||||
try {
|
||||
const repo = new PostgresAuthRepository(this.options.pool);
|
||||
await repo.createAuditLog({
|
||||
teamId: input.teamId,
|
||||
projectId: input.projectId,
|
||||
actorId: input.actorId,
|
||||
apiKeyId: input.apiKeyId,
|
||||
action: input.action,
|
||||
resourceType: 'observation_generation_job',
|
||||
resourceId: input.resourceId,
|
||||
details: input.details ?? {},
|
||||
});
|
||||
} catch (auditError) {
|
||||
logger.warn('SYSTEM', 'audit_log insert failed in ProviderObservationGenerator', {
|
||||
action: input.action,
|
||||
error: auditError instanceof Error ? auditError.message : String(auditError),
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
private async lockOutbox(
|
||||
jobId: string,
|
||||
teamId: string,
|
||||
projectId: string,
|
||||
): Promise<PostgresObservationGenerationJob | null> {
|
||||
const repo = new PostgresObservationGenerationJobRepository(this.options.pool);
|
||||
const current = await repo.getByIdForScope({ id: jobId, projectId, teamId });
|
||||
if (!current) {
|
||||
return null;
|
||||
}
|
||||
if (current.status === 'completed' || current.status === 'cancelled' || current.status === 'failed') {
|
||||
return null;
|
||||
}
|
||||
if (current.status === 'processing') {
|
||||
// Another worker holds the lock — most commonly this fires when BullMQ
|
||||
// redelivers a stalled job to a second worker while the first is still
|
||||
// mid-`provider.generate()`. Returning the row here would cause both
|
||||
// workers to issue the (paid, rate-limited) external provider call,
|
||||
// and the persistence-level terminal-status guard only collapses the
|
||||
// duplicate after the call has already happened. Skip instead. If the
|
||||
// first worker truly died, `reconcileOnStartup` (and the next BullMQ
|
||||
// retry) will resurrect the row.
|
||||
logger.info('SYSTEM', 'generation job already in processing; skipping duplicate worker run', {
|
||||
jobId: current.id,
|
||||
lockedBy: current.lockedBy,
|
||||
lockedAtEpoch: current.lockedAtEpoch,
|
||||
attempts: current.attempts,
|
||||
});
|
||||
return null;
|
||||
}
|
||||
const transitioned = await repo.transitionStatus({
|
||||
id: current.id,
|
||||
projectId: current.projectId,
|
||||
teamId: current.teamId,
|
||||
status: 'processing',
|
||||
lockedBy: this.options.workerId ?? 'server-beta-worker',
|
||||
});
|
||||
return transitioned;
|
||||
}
|
||||
|
||||
private async loadEvents(
|
||||
job: PostgresObservationGenerationJob,
|
||||
payload: ServerGenerationJobPayload,
|
||||
): Promise<NonNullable<Awaited<ReturnType<PostgresAgentEventsRepository['getByIdForScope']>>>[]> {
|
||||
const repo = new PostgresAgentEventsRepository(this.options.pool);
|
||||
|
||||
type Event = NonNullable<Awaited<ReturnType<PostgresAgentEventsRepository['getByIdForScope']>>>;
|
||||
|
||||
if (job.sourceType === 'session_summary') {
|
||||
// Summary jobs feed the provider every event tied to the server_session
|
||||
// that hasn't already been collapsed into a completed event-generation
|
||||
// job. The session repo enforces tenant scope inside its WHERE clause.
|
||||
if (!job.serverSessionId) return [];
|
||||
const sessions = new PostgresServerSessionsRepository(this.options.pool);
|
||||
const events = await sessions.listUnprocessedEvents({
|
||||
serverSessionId: job.serverSessionId,
|
||||
projectId: job.projectId,
|
||||
teamId: job.teamId,
|
||||
});
|
||||
return events;
|
||||
}
|
||||
|
||||
if (job.sourceType !== 'agent_event') {
|
||||
return [];
|
||||
}
|
||||
|
||||
if (payload.kind === 'event') {
|
||||
const event = await repo.getByIdForScope({
|
||||
id: payload.agent_event_id,
|
||||
projectId: job.projectId,
|
||||
teamId: job.teamId,
|
||||
});
|
||||
return event ? [event] : [];
|
||||
}
|
||||
|
||||
if (payload.kind === 'event-batch') {
|
||||
const out: Event[] = [];
|
||||
for (const id of payload.agent_event_ids) {
|
||||
const event = await repo.getByIdForScope({
|
||||
id,
|
||||
projectId: job.projectId,
|
||||
teamId: job.teamId,
|
||||
});
|
||||
if (event) out.push(event);
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
return [];
|
||||
}
|
||||
|
||||
private async loadProject(job: PostgresObservationGenerationJob) {
|
||||
const repo = new PostgresProjectsRepository(this.options.pool);
|
||||
return await repo.getByIdForTeam(job.projectId, job.teamId);
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,539 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import { parseAgentXml, type ParsedObservation, type ParsedSummary } from '../../sdk/parser.js';
|
||||
import { logger } from '../../utils/logger.js';
|
||||
import {
|
||||
PostgresObservationRepository,
|
||||
PostgresObservationSourcesRepository,
|
||||
buildObservationGenerationKey,
|
||||
type PostgresObservation,
|
||||
} from '../../storage/postgres/observations.js';
|
||||
import {
|
||||
PostgresObservationGenerationJobEventsRepository,
|
||||
PostgresObservationGenerationJobRepository,
|
||||
type PostgresObservationGenerationJob,
|
||||
} from '../../storage/postgres/generation-jobs.js';
|
||||
import { PostgresAuthRepository } from '../../storage/postgres/auth.js';
|
||||
import {
|
||||
withPostgresTransaction,
|
||||
type PostgresPool,
|
||||
} from '../../storage/postgres/pool.js';
|
||||
import { stripTags } from '../../utils/tag-stripping.js';
|
||||
|
||||
// processGeneratedResponse owns the full "we got XML from a provider →
|
||||
// persist + link + advance outbox" pipeline. Every side-effect runs inside
|
||||
// a single Postgres transaction so retries are idempotent:
|
||||
//
|
||||
// - observations.generation_key (UNIQUE per team/project) collapses retry
|
||||
// duplicates to a single row.
|
||||
// - observation_sources (UNIQUE on observation_id, source_type, source_id)
|
||||
// collapses duplicate source links.
|
||||
// - observation_generation_jobs.transitionStatus is the lifecycle gate.
|
||||
//
|
||||
// The function NEVER touches worker SessionStore tables, NEVER assumes a
|
||||
// Claude Code transcript shape, and ALWAYS reloads the job before mutating.
|
||||
// BullMQ payload data is advisory; the outbox row is canonical.
|
||||
|
||||
export type ProcessGeneratedResponseOutcome =
|
||||
| {
|
||||
kind: 'completed';
|
||||
jobId: string;
|
||||
observations: PostgresObservation[];
|
||||
privateContentDetected: boolean;
|
||||
}
|
||||
| { kind: 'parse_error'; jobId: string; reason: string };
|
||||
|
||||
export interface ProcessGeneratedResponseInput {
|
||||
pool: PostgresPool;
|
||||
job: PostgresObservationGenerationJob;
|
||||
rawText: string;
|
||||
modelId?: string;
|
||||
providerLabel: string;
|
||||
workerId?: string;
|
||||
// Phase 11 — identity context propagated from the BullMQ payload (and
|
||||
// ultimately the API-key that ingested the source row). Persisted on
|
||||
// observation_sources.metadata for traceability and re-emitted in the
|
||||
// observation.created audit row.
|
||||
apiKeyId?: string | null;
|
||||
actorId?: string | null;
|
||||
sourceAdapter?: string | null;
|
||||
}
|
||||
|
||||
export async function processGeneratedResponse(
|
||||
input: ProcessGeneratedResponseInput,
|
||||
): Promise<ProcessGeneratedResponseOutcome> {
|
||||
const { job, rawText } = input;
|
||||
|
||||
const parsed = parseAgentXml(rawText, job.id);
|
||||
if (!parsed.valid) {
|
||||
return { kind: 'parse_error', jobId: job.id, reason: 'parser rejected response' };
|
||||
}
|
||||
|
||||
// Skip-summary or zero-observation responses are still a success — the
|
||||
// provider explicitly decided there's nothing worth recording (e.g.
|
||||
// privacy-stripped batch). Mark the job completed with no observations.
|
||||
const observationsToWrite = parsed.observations ?? [];
|
||||
const skipped = parsed.summary?.skipped === true;
|
||||
const privateContentDetected = skipped || observationsToWrite.length === 0;
|
||||
|
||||
return await withPostgresTransaction(input.pool, async (client) => {
|
||||
const obsRepo = new PostgresObservationRepository(client);
|
||||
const sourcesRepo = new PostgresObservationSourcesRepository(client);
|
||||
const jobsRepo = new PostgresObservationGenerationJobRepository(client);
|
||||
const eventsLogRepo = new PostgresObservationGenerationJobEventsRepository(client);
|
||||
const auditRepo = new PostgresAuthRepository(client);
|
||||
|
||||
// Reload the job inside the transaction. If it was already completed
|
||||
// by another worker, return its existing observations idempotently.
|
||||
const fresh = await jobsRepo.getByIdForScope({
|
||||
id: job.id,
|
||||
projectId: job.projectId,
|
||||
teamId: job.teamId,
|
||||
});
|
||||
if (!fresh) {
|
||||
throw new Error(`generation job ${job.id} not found in scope`);
|
||||
}
|
||||
if (fresh.status === 'completed' || fresh.status === 'cancelled' || fresh.status === 'failed') {
|
||||
logger.info('SYSTEM', 'generation job already in terminal status; skipping persistence', {
|
||||
jobId: fresh.id,
|
||||
status: fresh.status,
|
||||
});
|
||||
return {
|
||||
kind: 'completed' as const,
|
||||
jobId: fresh.id,
|
||||
observations: [],
|
||||
privateContentDetected,
|
||||
};
|
||||
}
|
||||
|
||||
const persisted: PostgresObservation[] = [];
|
||||
for (let index = 0; index < observationsToWrite.length; index++) {
|
||||
const parsedObservation = observationsToWrite[index]!;
|
||||
const content = renderObservationContent(parsedObservation);
|
||||
if (!content || content.trim().length === 0) {
|
||||
continue;
|
||||
}
|
||||
|
||||
// Defense-in-depth: even if the parser slipped a private-tagged
|
||||
// string through, scrub before persisting.
|
||||
const scrubbed = stripTags(content);
|
||||
if (!scrubbed.stripped || scrubbed.stripped.trim().length === 0) {
|
||||
continue;
|
||||
}
|
||||
|
||||
const generationKey = buildObservationGenerationKey({
|
||||
generationJobId: fresh.id,
|
||||
parsedObservationIndex: index,
|
||||
content: scrubbed.stripped,
|
||||
});
|
||||
|
||||
const observation = await obsRepo.create({
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
serverSessionId: fresh.serverSessionId,
|
||||
kind: parsedObservation.type ?? 'observation',
|
||||
content: scrubbed.stripped,
|
||||
generationKey,
|
||||
metadata: {
|
||||
title: parsedObservation.title,
|
||||
subtitle: parsedObservation.subtitle,
|
||||
facts: parsedObservation.facts,
|
||||
narrative: parsedObservation.narrative,
|
||||
concepts: parsedObservation.concepts,
|
||||
files_read: parsedObservation.files_read,
|
||||
files_modified: parsedObservation.files_modified,
|
||||
provider: input.providerLabel,
|
||||
model: input.modelId ?? null,
|
||||
},
|
||||
createdByJobId: fresh.id,
|
||||
});
|
||||
persisted.push(observation);
|
||||
|
||||
await sourcesRepo.addSource({
|
||||
observationId: observation.id,
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
sourceType: fresh.sourceType,
|
||||
sourceId: fresh.sourceId,
|
||||
agentEventId: fresh.agentEventId ?? null,
|
||||
generationJobId: fresh.id,
|
||||
metadata: {
|
||||
provider: input.providerLabel,
|
||||
parsedObservationIndex: index,
|
||||
// Phase 11 — denormalize identity context for traceability so an
|
||||
// operator can answer "which api key produced this observation?"
|
||||
// without joining back through generation_job → outbox → key.
|
||||
source_adapter: input.sourceAdapter ?? null,
|
||||
actor_id: input.actorId ?? null,
|
||||
api_key_id: input.apiKeyId ?? null,
|
||||
},
|
||||
});
|
||||
|
||||
// Phase 11 — audit each generated observation. Using the SAME
|
||||
// generation_job_id reference so the audit chain (event_received →
|
||||
// generation_job.queued → generation_job.processing → observation.
|
||||
// created → observation.read) can be reconstructed.
|
||||
try {
|
||||
await auditRepo.createAuditLog({
|
||||
teamId: fresh.teamId,
|
||||
projectId: fresh.projectId,
|
||||
actorId: input.actorId ?? null,
|
||||
apiKeyId: input.apiKeyId ?? null,
|
||||
action: 'observation.created',
|
||||
resourceType: 'observation',
|
||||
resourceId: observation.id,
|
||||
details: {
|
||||
generationJobId: fresh.id,
|
||||
sourceType: fresh.sourceType,
|
||||
sourceId: fresh.sourceId,
|
||||
provider: input.providerLabel,
|
||||
model: input.modelId ?? null,
|
||||
sourceAdapter: input.sourceAdapter ?? null,
|
||||
parsedObservationIndex: index,
|
||||
},
|
||||
});
|
||||
} catch (auditError) {
|
||||
logger.warn('SYSTEM', 'audit_log observation.created insert failed', {
|
||||
observationId: observation.id,
|
||||
error: auditError instanceof Error ? auditError.message : String(auditError),
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Advance outbox status. Phase 1 transitionStatus enforces legal
|
||||
// transitions and tenant scope inside its WHERE clause.
|
||||
await jobsRepo.transitionStatus({
|
||||
id: fresh.id,
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
status: 'completed',
|
||||
});
|
||||
await eventsLogRepo.append({
|
||||
generationJobId: fresh.id,
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
eventType: 'completed',
|
||||
statusAfter: 'completed',
|
||||
attempt: fresh.attempts,
|
||||
details: {
|
||||
provider: input.providerLabel,
|
||||
model: input.modelId ?? null,
|
||||
observationCount: persisted.length,
|
||||
privateContentDetected,
|
||||
workerId: input.workerId ?? null,
|
||||
},
|
||||
});
|
||||
|
||||
// Audit log — best-effort; failure here would already be inside the
|
||||
// transaction so any insert error rolls everything back. We accept
|
||||
// that to keep the pipeline observable end-to-end.
|
||||
try {
|
||||
await auditRepo.createAuditLog({
|
||||
teamId: fresh.teamId,
|
||||
projectId: fresh.projectId,
|
||||
actorId: input.actorId ?? null,
|
||||
apiKeyId: input.apiKeyId ?? null,
|
||||
action: 'generation_job.completed',
|
||||
resourceType: 'observation_generation_job',
|
||||
resourceId: fresh.id,
|
||||
details: {
|
||||
generationJobId: fresh.id,
|
||||
provider: input.providerLabel,
|
||||
model: input.modelId ?? null,
|
||||
observationCount: persisted.length,
|
||||
observationIds: persisted.map(o => o.id),
|
||||
sourceAdapter: input.sourceAdapter ?? null,
|
||||
},
|
||||
});
|
||||
} catch (auditError) {
|
||||
// The audit log table may not have a metadata column on older
|
||||
// schemas; swallow rather than failing generation.
|
||||
logger.warn('SYSTEM', 'audit log insert failed during generation', {
|
||||
jobId: fresh.id,
|
||||
error: auditError instanceof Error ? auditError.message : String(auditError),
|
||||
});
|
||||
}
|
||||
|
||||
return {
|
||||
kind: 'completed' as const,
|
||||
jobId: fresh.id,
|
||||
observations: persisted,
|
||||
privateContentDetected,
|
||||
};
|
||||
});
|
||||
}
|
||||
|
||||
export interface MarkGenerationFailedInput {
|
||||
pool: PostgresPool;
|
||||
job: PostgresObservationGenerationJob;
|
||||
reason: string;
|
||||
classification?: string;
|
||||
retryable: boolean;
|
||||
workerId?: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Move a generation job to a non-success terminal state. Used when the
|
||||
* provider returned an error or invalid XML. Retryable failures move the
|
||||
* job back to `queued` so reconciliation can re-enqueue; non-retryable
|
||||
* failures move to `failed`.
|
||||
*/
|
||||
export async function markGenerationFailed(input: MarkGenerationFailedInput): Promise<void> {
|
||||
await withPostgresTransaction(input.pool, async (client) => {
|
||||
const jobsRepo = new PostgresObservationGenerationJobRepository(client);
|
||||
const eventsLogRepo = new PostgresObservationGenerationJobEventsRepository(client);
|
||||
|
||||
const fresh = await jobsRepo.getByIdForScope({
|
||||
id: input.job.id,
|
||||
projectId: input.job.projectId,
|
||||
teamId: input.job.teamId,
|
||||
});
|
||||
if (!fresh || fresh.status === 'completed' || fresh.status === 'cancelled') {
|
||||
return;
|
||||
}
|
||||
|
||||
const canRetry = input.retryable && fresh.attempts < fresh.maxAttempts;
|
||||
const target = canRetry ? 'queued' : 'failed';
|
||||
|
||||
await jobsRepo.transitionStatus({
|
||||
id: fresh.id,
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
status: target,
|
||||
lastError: { reason: input.reason, classification: input.classification ?? null },
|
||||
...(canRetry ? { nextAttemptAt: new Date(Date.now() + retryDelayMs(fresh.attempts)) } : {}),
|
||||
});
|
||||
|
||||
await eventsLogRepo.append({
|
||||
generationJobId: fresh.id,
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
eventType: canRetry ? 'retry_scheduled' : 'failed',
|
||||
statusAfter: target,
|
||||
attempt: fresh.attempts,
|
||||
details: {
|
||||
reason: input.reason,
|
||||
classification: input.classification ?? null,
|
||||
workerId: input.workerId ?? null,
|
||||
},
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* Persist a parsed session summary as an observations row with kind='summary'.
|
||||
*
|
||||
* Wraps the same outbox transition / source-link / audit pipeline as
|
||||
* processGeneratedResponse but emits a single 'summary'-kind observation
|
||||
* derived from the summary fields. Idempotency is enforced through the same
|
||||
* `observations.generation_key` UNIQUE index — re-running the summary job
|
||||
* after a restart will collapse to one row.
|
||||
*/
|
||||
export async function processSessionSummaryResponse(
|
||||
input: ProcessGeneratedResponseInput,
|
||||
): Promise<ProcessGeneratedResponseOutcome> {
|
||||
const { job, rawText } = input;
|
||||
|
||||
if (job.sourceType !== 'session_summary') {
|
||||
return { kind: 'parse_error', jobId: job.id, reason: 'session summary processor invoked on non-summary job' };
|
||||
}
|
||||
|
||||
const parsed = parseAgentXml(rawText, job.id);
|
||||
if (!parsed.valid) {
|
||||
return { kind: 'parse_error', jobId: job.id, reason: 'parser rejected summary response' };
|
||||
}
|
||||
|
||||
const summary = parsed.summary ?? null;
|
||||
const skipped = summary?.skipped === true;
|
||||
const summaryContent = summary ? renderSummaryContent(summary) : '';
|
||||
const privateContentDetected = skipped || summaryContent.trim().length === 0;
|
||||
|
||||
return await withPostgresTransaction(input.pool, async (client) => {
|
||||
const obsRepo = new PostgresObservationRepository(client);
|
||||
const sourcesRepo = new PostgresObservationSourcesRepository(client);
|
||||
const jobsRepo = new PostgresObservationGenerationJobRepository(client);
|
||||
const eventsLogRepo = new PostgresObservationGenerationJobEventsRepository(client);
|
||||
const auditRepo = new PostgresAuthRepository(client);
|
||||
|
||||
const fresh = await jobsRepo.getByIdForScope({
|
||||
id: job.id,
|
||||
projectId: job.projectId,
|
||||
teamId: job.teamId,
|
||||
});
|
||||
if (!fresh) {
|
||||
throw new Error(`session summary generation job ${job.id} not found in scope`);
|
||||
}
|
||||
if (fresh.status === 'completed' || fresh.status === 'cancelled' || fresh.status === 'failed') {
|
||||
logger.info('SYSTEM', 'session summary job already in terminal status; skipping persistence', {
|
||||
jobId: fresh.id,
|
||||
status: fresh.status,
|
||||
});
|
||||
return {
|
||||
kind: 'completed' as const,
|
||||
jobId: fresh.id,
|
||||
observations: [],
|
||||
privateContentDetected,
|
||||
};
|
||||
}
|
||||
|
||||
const persisted: PostgresObservation[] = [];
|
||||
if (!privateContentDetected) {
|
||||
const scrubbed = stripTags(summaryContent);
|
||||
const scrubbedContent = scrubbed.stripped ?? '';
|
||||
if (scrubbedContent.trim().length > 0) {
|
||||
const generationKey = buildObservationGenerationKey({
|
||||
generationJobId: fresh.id,
|
||||
parsedObservationIndex: 0,
|
||||
content: scrubbedContent,
|
||||
});
|
||||
const observation = await obsRepo.create({
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
serverSessionId: fresh.serverSessionId,
|
||||
kind: 'summary',
|
||||
content: scrubbedContent,
|
||||
generationKey,
|
||||
metadata: {
|
||||
request: summary?.request ?? null,
|
||||
investigated: summary?.investigated ?? null,
|
||||
learned: summary?.learned ?? null,
|
||||
completed: summary?.completed ?? null,
|
||||
next_steps: summary?.next_steps ?? null,
|
||||
notes: summary?.notes ?? null,
|
||||
provider: input.providerLabel,
|
||||
model: input.modelId ?? null,
|
||||
},
|
||||
createdByJobId: fresh.id,
|
||||
});
|
||||
persisted.push(observation);
|
||||
|
||||
await sourcesRepo.addSource({
|
||||
observationId: observation.id,
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
sourceType: 'session_summary',
|
||||
sourceId: fresh.sourceId,
|
||||
generationJobId: fresh.id,
|
||||
metadata: {
|
||||
provider: input.providerLabel,
|
||||
parsedObservationIndex: 0,
|
||||
source_adapter: input.sourceAdapter ?? null,
|
||||
actor_id: input.actorId ?? null,
|
||||
api_key_id: input.apiKeyId ?? null,
|
||||
},
|
||||
});
|
||||
|
||||
// Phase 11 — observation.created audit for the summary observation.
|
||||
try {
|
||||
await auditRepo.createAuditLog({
|
||||
teamId: fresh.teamId,
|
||||
projectId: fresh.projectId,
|
||||
actorId: input.actorId ?? null,
|
||||
apiKeyId: input.apiKeyId ?? null,
|
||||
action: 'observation.created',
|
||||
resourceType: 'observation',
|
||||
resourceId: observation.id,
|
||||
details: {
|
||||
generationJobId: fresh.id,
|
||||
sourceType: 'session_summary',
|
||||
sourceId: fresh.sourceId,
|
||||
provider: input.providerLabel,
|
||||
model: input.modelId ?? null,
|
||||
sourceAdapter: input.sourceAdapter ?? null,
|
||||
kind: 'summary',
|
||||
},
|
||||
});
|
||||
} catch (auditError) {
|
||||
logger.warn('SYSTEM', 'audit_log observation.created (summary) insert failed', {
|
||||
observationId: observation.id,
|
||||
error: auditError instanceof Error ? auditError.message : String(auditError),
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
await jobsRepo.transitionStatus({
|
||||
id: fresh.id,
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
status: 'completed',
|
||||
});
|
||||
await eventsLogRepo.append({
|
||||
generationJobId: fresh.id,
|
||||
projectId: fresh.projectId,
|
||||
teamId: fresh.teamId,
|
||||
eventType: 'completed',
|
||||
statusAfter: 'completed',
|
||||
attempt: fresh.attempts,
|
||||
details: {
|
||||
provider: input.providerLabel,
|
||||
model: input.modelId ?? null,
|
||||
observationCount: persisted.length,
|
||||
privateContentDetected,
|
||||
workerId: input.workerId ?? null,
|
||||
sourceType: 'session_summary',
|
||||
},
|
||||
});
|
||||
|
||||
try {
|
||||
await auditRepo.createAuditLog({
|
||||
teamId: fresh.teamId,
|
||||
projectId: fresh.projectId,
|
||||
actorId: input.actorId ?? null,
|
||||
apiKeyId: input.apiKeyId ?? null,
|
||||
action: 'generation_job.completed',
|
||||
resourceType: 'observation_generation_job',
|
||||
resourceId: fresh.id,
|
||||
details: {
|
||||
generationJobId: fresh.id,
|
||||
provider: input.providerLabel,
|
||||
model: input.modelId ?? null,
|
||||
observationCount: persisted.length,
|
||||
observationIds: persisted.map(o => o.id),
|
||||
sourceAdapter: input.sourceAdapter ?? null,
|
||||
sourceType: 'session_summary',
|
||||
},
|
||||
});
|
||||
} catch (auditError) {
|
||||
logger.warn('SYSTEM', 'audit log insert failed during summary generation', {
|
||||
jobId: fresh.id,
|
||||
error: auditError instanceof Error ? auditError.message : String(auditError),
|
||||
});
|
||||
}
|
||||
|
||||
return {
|
||||
kind: 'completed' as const,
|
||||
jobId: fresh.id,
|
||||
observations: persisted,
|
||||
privateContentDetected,
|
||||
};
|
||||
});
|
||||
}
|
||||
|
||||
function renderSummaryContent(summary: ParsedSummary): string {
|
||||
const parts: string[] = [];
|
||||
if (summary.request) parts.push(`Request: ${summary.request}`);
|
||||
if (summary.investigated) parts.push(`Investigated: ${summary.investigated}`);
|
||||
if (summary.learned) parts.push(`Learned: ${summary.learned}`);
|
||||
if (summary.completed) parts.push(`Completed: ${summary.completed}`);
|
||||
if (summary.next_steps) parts.push(`Next steps: ${summary.next_steps}`);
|
||||
if (summary.notes) parts.push(`Notes: ${summary.notes}`);
|
||||
return parts.join('\n\n').trim();
|
||||
}
|
||||
|
||||
function renderObservationContent(observation: ParsedObservation): string {
|
||||
const parts: string[] = [];
|
||||
if (observation.title) parts.push(observation.title);
|
||||
if (observation.subtitle) parts.push(observation.subtitle);
|
||||
if (observation.narrative) parts.push(observation.narrative);
|
||||
if (observation.facts && observation.facts.length > 0) {
|
||||
parts.push(observation.facts.map(f => `- ${f}`).join('\n'));
|
||||
}
|
||||
return parts.join('\n\n').trim();
|
||||
}
|
||||
|
||||
function retryDelayMs(attempts: number): number {
|
||||
// Exponential backoff: 5s, 25s, 125s, capped at 10 minutes.
|
||||
const base = 5000 * Math.pow(5, Math.max(0, attempts));
|
||||
return Math.min(base, 10 * 60 * 1000);
|
||||
}
|
||||
@@ -0,0 +1,247 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import { logger } from '../../../utils/logger.js';
|
||||
import {
|
||||
ServerClassifiedProviderError,
|
||||
parseRetryAfterMs,
|
||||
} from './shared/error-classification.js';
|
||||
import { buildServerGenerationPrompt } from './shared/prompt-builder.js';
|
||||
import type {
|
||||
ServerGenerationContext,
|
||||
ServerGenerationProvider,
|
||||
ServerGenerationResult,
|
||||
} from './shared/types.js';
|
||||
|
||||
const ANTHROPIC_API_URL = 'https://api.anthropic.com/v1/messages';
|
||||
const ANTHROPIC_VERSION = '2023-06-01';
|
||||
const DEFAULT_MODEL = 'claude-3-5-sonnet-latest';
|
||||
|
||||
export interface ClaudeObservationProviderOptions {
|
||||
apiKey: string;
|
||||
model?: string;
|
||||
maxOutputTokens?: number;
|
||||
fetchImpl?: typeof fetch;
|
||||
}
|
||||
|
||||
interface AnthropicMessagesResponse {
|
||||
content?: Array<{ type?: string; text?: string }>;
|
||||
usage?: { input_tokens?: number; output_tokens?: number };
|
||||
error?: { type?: string; message?: string };
|
||||
}
|
||||
|
||||
export class ClaudeObservationProvider implements ServerGenerationProvider {
|
||||
readonly providerLabel = 'claude' as const;
|
||||
private readonly apiKey: string;
|
||||
private readonly model: string;
|
||||
private readonly maxOutputTokens: number;
|
||||
private readonly fetchImpl: typeof fetch;
|
||||
|
||||
constructor(options: ClaudeObservationProviderOptions) {
|
||||
if (!options.apiKey) {
|
||||
throw new ServerClassifiedProviderError('Anthropic API key not configured', {
|
||||
kind: 'auth_invalid',
|
||||
cause: new Error('apiKey is required'),
|
||||
});
|
||||
}
|
||||
this.apiKey = options.apiKey;
|
||||
this.model = options.model ?? DEFAULT_MODEL;
|
||||
this.maxOutputTokens = options.maxOutputTokens ?? 4096;
|
||||
this.fetchImpl = options.fetchImpl ?? fetch;
|
||||
}
|
||||
|
||||
async generate(
|
||||
context: ServerGenerationContext,
|
||||
signal?: AbortSignal,
|
||||
): Promise<ServerGenerationResult> {
|
||||
const { prompt, skippedAll } = buildServerGenerationPrompt(context);
|
||||
if (skippedAll) {
|
||||
// All events were scrubbed by privacy stripping. Don't bill the
|
||||
// provider — return a synthetic skip response that parser accepts.
|
||||
return {
|
||||
rawText: '<skip_summary reason="all_events_private" />',
|
||||
providerLabel: this.providerLabel,
|
||||
modelId: this.model,
|
||||
};
|
||||
}
|
||||
|
||||
let response: Response;
|
||||
try {
|
||||
response = await this.fetchImpl(ANTHROPIC_API_URL, {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
'x-api-key': this.apiKey,
|
||||
'anthropic-version': ANTHROPIC_VERSION,
|
||||
},
|
||||
body: JSON.stringify({
|
||||
model: this.model,
|
||||
max_tokens: this.maxOutputTokens,
|
||||
temperature: 0.3,
|
||||
messages: [{ role: 'user', content: prompt }],
|
||||
}),
|
||||
signal,
|
||||
});
|
||||
} catch (networkError) {
|
||||
throw classifyClaudeServerError({
|
||||
cause: networkError,
|
||||
});
|
||||
}
|
||||
|
||||
if (!response.ok) {
|
||||
const bodyText = await safeReadBody(response);
|
||||
throw classifyClaudeServerError({
|
||||
status: response.status,
|
||||
bodyText,
|
||||
headers: response.headers,
|
||||
cause: new Error(`Anthropic API error: ${response.status} - ${bodyText}`),
|
||||
});
|
||||
}
|
||||
|
||||
let data: AnthropicMessagesResponse;
|
||||
try {
|
||||
data = (await response.json()) as AnthropicMessagesResponse;
|
||||
} catch (parseError) {
|
||||
throw new ServerClassifiedProviderError('Anthropic returned invalid JSON', {
|
||||
kind: 'parse_error',
|
||||
cause: parseError,
|
||||
});
|
||||
}
|
||||
|
||||
if (data.error) {
|
||||
throw classifyClaudeServerError({
|
||||
status: response.status,
|
||||
bodyText: `${data.error.type ?? ''} ${data.error.message ?? ''}`,
|
||||
headers: response.headers,
|
||||
cause: new Error(`Anthropic API error: ${data.error.type} - ${data.error.message}`),
|
||||
});
|
||||
}
|
||||
|
||||
const blocks = Array.isArray(data.content) ? data.content : [];
|
||||
const rawText = blocks
|
||||
.filter(block => block?.type === 'text' && typeof block.text === 'string')
|
||||
.map(block => block.text!)
|
||||
.join('\n')
|
||||
.trim();
|
||||
|
||||
if (!rawText) {
|
||||
logger.warn('SDK', 'Anthropic returned empty content array', {
|
||||
provider: 'claude',
|
||||
model: this.model,
|
||||
});
|
||||
}
|
||||
|
||||
const usage = data.usage ?? {};
|
||||
const tokensUsed =
|
||||
typeof usage.input_tokens === 'number' || typeof usage.output_tokens === 'number'
|
||||
? (usage.input_tokens ?? 0) + (usage.output_tokens ?? 0)
|
||||
: undefined;
|
||||
|
||||
return {
|
||||
rawText,
|
||||
...(tokensUsed !== undefined ? { tokensUsed } : {}),
|
||||
providerLabel: this.providerLabel,
|
||||
modelId: this.model,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
interface ClassifyInput {
|
||||
status?: number;
|
||||
bodyText?: string;
|
||||
headers?: Headers | { get(name: string): string | null };
|
||||
cause: unknown;
|
||||
}
|
||||
|
||||
/**
|
||||
* Anthropic-specific HTTP error classification. Mirrors worker
|
||||
* `classifyClaudeError`, but extracted for server-beta and rebound to
|
||||
* Anthropic Messages REST semantics rather than SDK error classes.
|
||||
*/
|
||||
export function classifyClaudeServerError(input: ClassifyInput): ServerClassifiedProviderError {
|
||||
const status = input.status;
|
||||
const body = input.bodyText ?? '';
|
||||
const lower = body.toLowerCase();
|
||||
const retryAfterMs = input.headers ? parseRetryAfterMs(input.headers.get('retry-after')) : undefined;
|
||||
|
||||
if (lower.includes('overloaded')) {
|
||||
return new ServerClassifiedProviderError(
|
||||
`Anthropic overloaded${status !== undefined ? ` (status ${status})` : ''}`,
|
||||
{ kind: 'transient', cause: input.cause },
|
||||
);
|
||||
}
|
||||
|
||||
if (status === 401 || status === 403 || lower.includes('invalid api key')) {
|
||||
return new ServerClassifiedProviderError(
|
||||
`Anthropic auth invalid${status !== undefined ? ` (status ${status})` : ''}`,
|
||||
{ kind: 'auth_invalid', cause: input.cause },
|
||||
);
|
||||
}
|
||||
|
||||
if (status === 429) {
|
||||
return new ServerClassifiedProviderError('Anthropic rate limit (429)', {
|
||||
kind: 'rate_limit',
|
||||
cause: input.cause,
|
||||
...(retryAfterMs !== undefined ? { retryAfterMs } : {}),
|
||||
});
|
||||
}
|
||||
|
||||
if (lower.includes('quota exceeded')) {
|
||||
return new ServerClassifiedProviderError('Anthropic quota exhausted', {
|
||||
kind: 'quota_exhausted',
|
||||
cause: input.cause,
|
||||
});
|
||||
}
|
||||
|
||||
if (
|
||||
lower.includes('prompt is too long') ||
|
||||
lower.includes('context window') ||
|
||||
lower.includes('max_tokens')
|
||||
) {
|
||||
return new ServerClassifiedProviderError('Anthropic context overflow', {
|
||||
kind: 'unrecoverable',
|
||||
cause: input.cause,
|
||||
});
|
||||
}
|
||||
|
||||
if (status === 529) {
|
||||
return new ServerClassifiedProviderError('Anthropic overloaded (529)', {
|
||||
kind: 'transient',
|
||||
cause: input.cause,
|
||||
});
|
||||
}
|
||||
|
||||
if (status !== undefined && status >= 500 && status < 600) {
|
||||
return new ServerClassifiedProviderError(`Anthropic upstream error (status ${status})`, {
|
||||
kind: 'transient',
|
||||
cause: input.cause,
|
||||
});
|
||||
}
|
||||
|
||||
if (status === 400) {
|
||||
return new ServerClassifiedProviderError('Anthropic bad request (400)', {
|
||||
kind: 'unrecoverable',
|
||||
cause: input.cause,
|
||||
});
|
||||
}
|
||||
|
||||
if (status === undefined) {
|
||||
const message = input.cause instanceof Error ? input.cause.message : String(input.cause);
|
||||
return new ServerClassifiedProviderError(`Anthropic network error: ${message}`, {
|
||||
kind: 'transient',
|
||||
cause: input.cause,
|
||||
});
|
||||
}
|
||||
|
||||
return new ServerClassifiedProviderError(
|
||||
`Anthropic API error: ${status}${body ? ` - ${body.substring(0, 200)}` : ''}`,
|
||||
{ kind: 'unrecoverable', cause: input.cause },
|
||||
);
|
||||
}
|
||||
|
||||
async function safeReadBody(response: Response): Promise<string> {
|
||||
try {
|
||||
return await response.text();
|
||||
} catch {
|
||||
return '';
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,148 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import { logger } from '../../../utils/logger.js';
|
||||
import {
|
||||
ServerClassifiedProviderError,
|
||||
classifyHttpProviderError,
|
||||
parseRetryAfterMs,
|
||||
} from './shared/error-classification.js';
|
||||
import { buildServerGenerationPrompt } from './shared/prompt-builder.js';
|
||||
import type {
|
||||
ServerGenerationContext,
|
||||
ServerGenerationProvider,
|
||||
ServerGenerationResult,
|
||||
} from './shared/types.js';
|
||||
|
||||
const GEMINI_API_URL = 'https://generativelanguage.googleapis.com/v1/models';
|
||||
const DEFAULT_MODEL = 'gemini-2.5-flash';
|
||||
|
||||
export interface GeminiObservationProviderOptions {
|
||||
apiKey: string;
|
||||
model?: string;
|
||||
maxOutputTokens?: number;
|
||||
fetchImpl?: typeof fetch;
|
||||
}
|
||||
|
||||
interface GeminiResponse {
|
||||
candidates?: Array<{
|
||||
content?: { parts?: Array<{ text?: string }> };
|
||||
}>;
|
||||
usageMetadata?: { totalTokenCount?: number };
|
||||
error?: { code?: number; status?: string; message?: string };
|
||||
}
|
||||
|
||||
export class GeminiObservationProvider implements ServerGenerationProvider {
|
||||
readonly providerLabel = 'gemini' as const;
|
||||
private readonly apiKey: string;
|
||||
private readonly model: string;
|
||||
private readonly maxOutputTokens: number;
|
||||
private readonly fetchImpl: typeof fetch;
|
||||
|
||||
constructor(options: GeminiObservationProviderOptions) {
|
||||
if (!options.apiKey) {
|
||||
throw new ServerClassifiedProviderError('Gemini API key not configured', {
|
||||
kind: 'auth_invalid',
|
||||
cause: new Error('apiKey is required'),
|
||||
});
|
||||
}
|
||||
this.apiKey = options.apiKey;
|
||||
this.model = options.model ?? DEFAULT_MODEL;
|
||||
this.maxOutputTokens = options.maxOutputTokens ?? 4096;
|
||||
this.fetchImpl = options.fetchImpl ?? fetch;
|
||||
}
|
||||
|
||||
async generate(
|
||||
context: ServerGenerationContext,
|
||||
signal?: AbortSignal,
|
||||
): Promise<ServerGenerationResult> {
|
||||
const { prompt, skippedAll } = buildServerGenerationPrompt(context);
|
||||
if (skippedAll) {
|
||||
return {
|
||||
rawText: '<skip_summary reason="all_events_private" />',
|
||||
providerLabel: this.providerLabel,
|
||||
modelId: this.model,
|
||||
};
|
||||
}
|
||||
|
||||
const url = `${GEMINI_API_URL}/${encodeURIComponent(this.model)}:generateContent?key=${encodeURIComponent(this.apiKey)}`;
|
||||
|
||||
let response: Response;
|
||||
try {
|
||||
response = await this.fetchImpl(url, {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({
|
||||
contents: [{ role: 'user', parts: [{ text: prompt }] }],
|
||||
generationConfig: {
|
||||
temperature: 0.3,
|
||||
maxOutputTokens: this.maxOutputTokens,
|
||||
},
|
||||
}),
|
||||
signal,
|
||||
});
|
||||
} catch (networkError) {
|
||||
throw classifyHttpProviderError({
|
||||
cause: networkError,
|
||||
providerLabel: 'Gemini',
|
||||
});
|
||||
}
|
||||
|
||||
if (!response.ok) {
|
||||
const bodyText = await safeReadBody(response);
|
||||
throw classifyHttpProviderError({
|
||||
status: response.status,
|
||||
bodyText,
|
||||
headers: response.headers,
|
||||
cause: new Error(`Gemini API error: ${response.status} - ${bodyText}`),
|
||||
providerLabel: 'Gemini',
|
||||
});
|
||||
}
|
||||
|
||||
let data: GeminiResponse;
|
||||
try {
|
||||
data = (await response.json()) as GeminiResponse;
|
||||
} catch (parseError) {
|
||||
throw new ServerClassifiedProviderError('Gemini returned invalid JSON', {
|
||||
kind: 'parse_error',
|
||||
cause: parseError,
|
||||
});
|
||||
}
|
||||
|
||||
if (data.error) {
|
||||
throw classifyHttpProviderError({
|
||||
status: response.status,
|
||||
bodyText: `${data.error.status ?? ''} ${data.error.message ?? ''}`,
|
||||
headers: response.headers,
|
||||
cause: new Error(`Gemini API error: ${data.error.status} - ${data.error.message}`),
|
||||
providerLabel: 'Gemini',
|
||||
});
|
||||
}
|
||||
|
||||
const rawText = data.candidates?.[0]?.content?.parts?.[0]?.text?.trim() ?? '';
|
||||
if (!rawText) {
|
||||
logger.warn('SDK', 'Gemini returned empty content', { provider: 'gemini', model: this.model });
|
||||
}
|
||||
|
||||
const tokensUsed = typeof data.usageMetadata?.totalTokenCount === 'number'
|
||||
? data.usageMetadata.totalTokenCount
|
||||
: undefined;
|
||||
|
||||
return {
|
||||
rawText,
|
||||
...(tokensUsed !== undefined ? { tokensUsed } : {}),
|
||||
providerLabel: this.providerLabel,
|
||||
modelId: this.model,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
// Re-export for tests/auditing parity with worker classifier surface.
|
||||
export { parseRetryAfterMs };
|
||||
|
||||
async function safeReadBody(response: Response): Promise<string> {
|
||||
try {
|
||||
return await response.text();
|
||||
} catch {
|
||||
return '';
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,151 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import { logger } from '../../../utils/logger.js';
|
||||
import {
|
||||
ServerClassifiedProviderError,
|
||||
classifyHttpProviderError,
|
||||
} from './shared/error-classification.js';
|
||||
import { buildServerGenerationPrompt } from './shared/prompt-builder.js';
|
||||
import type {
|
||||
ServerGenerationContext,
|
||||
ServerGenerationProvider,
|
||||
ServerGenerationResult,
|
||||
} from './shared/types.js';
|
||||
|
||||
const OPENROUTER_API_URL = 'https://openrouter.ai/api/v1/chat/completions';
|
||||
const DEFAULT_MODEL = 'anthropic/claude-3.5-sonnet';
|
||||
|
||||
export interface OpenRouterObservationProviderOptions {
|
||||
apiKey: string;
|
||||
model?: string;
|
||||
maxOutputTokens?: number;
|
||||
siteUrl?: string;
|
||||
appName?: string;
|
||||
fetchImpl?: typeof fetch;
|
||||
}
|
||||
|
||||
interface OpenRouterResponse {
|
||||
choices?: Array<{ message?: { content?: string } }>;
|
||||
usage?: { total_tokens?: number };
|
||||
error?: { code?: string | number; message?: string };
|
||||
}
|
||||
|
||||
export class OpenRouterObservationProvider implements ServerGenerationProvider {
|
||||
readonly providerLabel = 'openrouter' as const;
|
||||
private readonly apiKey: string;
|
||||
private readonly model: string;
|
||||
private readonly maxOutputTokens: number;
|
||||
private readonly siteUrl: string;
|
||||
private readonly appName: string;
|
||||
private readonly fetchImpl: typeof fetch;
|
||||
|
||||
constructor(options: OpenRouterObservationProviderOptions) {
|
||||
if (!options.apiKey) {
|
||||
throw new ServerClassifiedProviderError('OpenRouter API key not configured', {
|
||||
kind: 'auth_invalid',
|
||||
cause: new Error('apiKey is required'),
|
||||
});
|
||||
}
|
||||
this.apiKey = options.apiKey;
|
||||
this.model = options.model ?? DEFAULT_MODEL;
|
||||
this.maxOutputTokens = options.maxOutputTokens ?? 4096;
|
||||
this.siteUrl = options.siteUrl ?? 'https://github.com/thedotmack/claude-mem';
|
||||
this.appName = options.appName ?? 'claude-mem';
|
||||
this.fetchImpl = options.fetchImpl ?? fetch;
|
||||
}
|
||||
|
||||
async generate(
|
||||
context: ServerGenerationContext,
|
||||
signal?: AbortSignal,
|
||||
): Promise<ServerGenerationResult> {
|
||||
const { prompt, skippedAll } = buildServerGenerationPrompt(context);
|
||||
if (skippedAll) {
|
||||
return {
|
||||
rawText: '<skip_summary reason="all_events_private" />',
|
||||
providerLabel: this.providerLabel,
|
||||
modelId: this.model,
|
||||
};
|
||||
}
|
||||
|
||||
let response: Response;
|
||||
try {
|
||||
response = await this.fetchImpl(OPENROUTER_API_URL, {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
Authorization: `Bearer ${this.apiKey}`,
|
||||
'HTTP-Referer': this.siteUrl,
|
||||
'X-Title': this.appName,
|
||||
'Content-Type': 'application/json',
|
||||
},
|
||||
body: JSON.stringify({
|
||||
model: this.model,
|
||||
messages: [{ role: 'user', content: prompt }],
|
||||
temperature: 0.3,
|
||||
max_tokens: this.maxOutputTokens,
|
||||
}),
|
||||
signal,
|
||||
});
|
||||
} catch (networkError) {
|
||||
throw classifyHttpProviderError({
|
||||
cause: networkError,
|
||||
providerLabel: 'OpenRouter',
|
||||
});
|
||||
}
|
||||
|
||||
if (!response.ok) {
|
||||
const bodyText = await safeReadBody(response);
|
||||
throw classifyHttpProviderError({
|
||||
status: response.status,
|
||||
bodyText,
|
||||
headers: response.headers,
|
||||
cause: new Error(`OpenRouter API error: ${response.status} - ${bodyText}`),
|
||||
providerLabel: 'OpenRouter',
|
||||
});
|
||||
}
|
||||
|
||||
let data: OpenRouterResponse;
|
||||
try {
|
||||
data = (await response.json()) as OpenRouterResponse;
|
||||
} catch (parseError) {
|
||||
throw new ServerClassifiedProviderError('OpenRouter returned invalid JSON', {
|
||||
kind: 'parse_error',
|
||||
cause: parseError,
|
||||
});
|
||||
}
|
||||
|
||||
if (data.error) {
|
||||
throw classifyHttpProviderError({
|
||||
status: response.status,
|
||||
bodyText: `${data.error.code ?? ''} ${data.error.message ?? ''}`,
|
||||
headers: response.headers,
|
||||
cause: new Error(`OpenRouter API error: ${data.error.code} - ${data.error.message}`),
|
||||
providerLabel: 'OpenRouter',
|
||||
});
|
||||
}
|
||||
|
||||
const rawText = data.choices?.[0]?.message?.content?.trim() ?? '';
|
||||
if (!rawText) {
|
||||
logger.warn('SDK', 'OpenRouter returned empty content', {
|
||||
provider: 'openrouter',
|
||||
model: this.model,
|
||||
});
|
||||
}
|
||||
|
||||
const tokensUsed = typeof data.usage?.total_tokens === 'number' ? data.usage.total_tokens : undefined;
|
||||
|
||||
return {
|
||||
rawText,
|
||||
...(tokensUsed !== undefined ? { tokensUsed } : {}),
|
||||
providerLabel: this.providerLabel,
|
||||
modelId: this.model,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
async function safeReadBody(response: Response): Promise<string> {
|
||||
try {
|
||||
return await response.text();
|
||||
} catch {
|
||||
return '';
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,136 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
// Server-beta-local copy of the worker provider error classification model.
|
||||
// Phase 5 anti-pattern guard: src/server/* must not import from
|
||||
// src/services/worker/*, so we duplicate the small, stable error model here.
|
||||
// Worker code keeps src/services/worker/provider-errors.ts unchanged.
|
||||
|
||||
export type ServerProviderErrorClass =
|
||||
| 'transient'
|
||||
| 'unrecoverable'
|
||||
| 'rate_limit'
|
||||
| 'quota_exhausted'
|
||||
| 'auth_invalid'
|
||||
| 'parse_error'
|
||||
| (string & {});
|
||||
|
||||
export class ServerClassifiedProviderError extends Error {
|
||||
readonly kind: ServerProviderErrorClass;
|
||||
readonly retryAfterMs?: number;
|
||||
readonly cause: unknown;
|
||||
|
||||
constructor(
|
||||
message: string,
|
||||
opts: {
|
||||
kind: ServerProviderErrorClass;
|
||||
cause: unknown;
|
||||
retryAfterMs?: number;
|
||||
},
|
||||
) {
|
||||
super(message);
|
||||
this.name = 'ServerClassifiedProviderError';
|
||||
this.kind = opts.kind;
|
||||
this.cause = opts.cause;
|
||||
if (opts.retryAfterMs !== undefined) {
|
||||
this.retryAfterMs = opts.retryAfterMs;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
export function isServerClassified(err: unknown): err is ServerClassifiedProviderError {
|
||||
return err instanceof ServerClassifiedProviderError;
|
||||
}
|
||||
|
||||
/**
|
||||
* Parse Retry-After header (seconds or HTTP-date). Returns ms or undefined.
|
||||
* Behavior intentionally mirrors the worker providers' helper so server
|
||||
* retries match worker retry policy.
|
||||
*/
|
||||
export function parseRetryAfterMs(value: string | null): number | undefined {
|
||||
if (!value) return undefined;
|
||||
const seconds = Number(value);
|
||||
if (!Number.isNaN(seconds) && seconds >= 0) {
|
||||
return Math.floor(seconds * 1000);
|
||||
}
|
||||
const dateMs = Date.parse(value);
|
||||
if (!Number.isNaN(dateMs)) {
|
||||
const delta = dateMs - Date.now();
|
||||
return delta > 0 ? delta : 0;
|
||||
}
|
||||
return undefined;
|
||||
}
|
||||
|
||||
interface ClassifyHttpInput {
|
||||
status?: number;
|
||||
bodyText?: string;
|
||||
headers?: Headers | { get(name: string): string | null };
|
||||
cause: unknown;
|
||||
providerLabel: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Generic HTTP-error → ServerClassifiedProviderError mapping shared by
|
||||
* Gemini and OpenRouter server adapters. Provider-specific overrides (e.g.
|
||||
* Anthropic OverloadedError, Gemini quota body markers) are layered on top
|
||||
* by the per-provider classifier wrappers in this module.
|
||||
*/
|
||||
export function classifyHttpProviderError(input: ClassifyHttpInput): ServerClassifiedProviderError {
|
||||
const { status, providerLabel } = input;
|
||||
const body = input.bodyText ?? '';
|
||||
const lower = body.toLowerCase();
|
||||
const retryAfterMs = input.headers ? parseRetryAfterMs(input.headers.get('retry-after')) : undefined;
|
||||
|
||||
if (
|
||||
lower.includes('quota exceeded') ||
|
||||
lower.includes('insufficient credits') ||
|
||||
lower.includes('insufficient_quota') ||
|
||||
lower.includes('resource_exhausted')
|
||||
) {
|
||||
return new ServerClassifiedProviderError(
|
||||
`${providerLabel} quota exhausted${status !== undefined ? ` (status ${status})` : ''}`,
|
||||
{ kind: 'quota_exhausted', cause: input.cause },
|
||||
);
|
||||
}
|
||||
|
||||
if (status === 429) {
|
||||
return new ServerClassifiedProviderError(`${providerLabel} rate limit (429)`, {
|
||||
kind: 'rate_limit',
|
||||
cause: input.cause,
|
||||
...(retryAfterMs !== undefined ? { retryAfterMs } : {}),
|
||||
});
|
||||
}
|
||||
|
||||
if (status === 401 || status === 403) {
|
||||
return new ServerClassifiedProviderError(`${providerLabel} auth error (status ${status})`, {
|
||||
kind: 'auth_invalid',
|
||||
cause: input.cause,
|
||||
});
|
||||
}
|
||||
|
||||
if (status === 400 || status === 404) {
|
||||
return new ServerClassifiedProviderError(`${providerLabel} bad request (status ${status})`, {
|
||||
kind: 'unrecoverable',
|
||||
cause: input.cause,
|
||||
});
|
||||
}
|
||||
|
||||
if (status !== undefined && status >= 500 && status < 600) {
|
||||
return new ServerClassifiedProviderError(`${providerLabel} upstream error (status ${status})`, {
|
||||
kind: 'transient',
|
||||
cause: input.cause,
|
||||
});
|
||||
}
|
||||
|
||||
if (status === undefined) {
|
||||
const message = input.cause instanceof Error ? input.cause.message : String(input.cause);
|
||||
return new ServerClassifiedProviderError(`${providerLabel} network error: ${message}`, {
|
||||
kind: 'transient',
|
||||
cause: input.cause,
|
||||
});
|
||||
}
|
||||
|
||||
return new ServerClassifiedProviderError(
|
||||
`${providerLabel} API error: ${status}${body ? ` - ${body.substring(0, 200)}` : ''}`,
|
||||
{ kind: 'unrecoverable', cause: input.cause },
|
||||
);
|
||||
}
|
||||
@@ -0,0 +1,164 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import { ModeManager } from '../../../../services/domain/ModeManager.js';
|
||||
import type { ModeConfig, ObservationType } from '../../../../services/domain/types.js';
|
||||
import { stripTags } from '../../../../utils/tag-stripping.js';
|
||||
import type { PostgresAgentEvent } from '../../../../storage/postgres/agent-events.js';
|
||||
import type { ServerGenerationContext } from './types.js';
|
||||
|
||||
// Fallback list mirrors the default observation types used by claude-mem
|
||||
// modes. The server-beta prompt does not strictly need a loaded mode file —
|
||||
// the parser accepts any of these as the <type> value — so when no mode is
|
||||
// loaded (tests, fresh installs) we synthesize a minimal type list rather
|
||||
// than throwing.
|
||||
const FALLBACK_OBSERVATION_TYPES: ReadonlyArray<Pick<ObservationType, 'id'>> = [
|
||||
{ id: 'discovery' },
|
||||
{ id: 'progress' },
|
||||
{ id: 'blocker' },
|
||||
{ id: 'decision' },
|
||||
];
|
||||
|
||||
// Build a single-shot generation prompt from a list of AgentEvent records
|
||||
// plus project/session metadata. Output: a user prompt asking the provider
|
||||
// to return one or more <observation> XML blocks (or an empty response if
|
||||
// the batch should be skipped). This is intentionally a single-turn request
|
||||
// — server-beta does NOT use the worker's multi-turn SDK conversation
|
||||
// model. parseAgentXml(...) accepts the response unchanged.
|
||||
//
|
||||
// Privacy: every event payload field passes through `stripTags` (which
|
||||
// removes <private>, <claude-mem-context>, <system-reminder>, etc.) before
|
||||
// being included in the prompt. Privacy enforcement here is belt-and-suspenders
|
||||
// — `processGeneratedResponse` also discards observations that are entirely
|
||||
// derived from privately-tagged inputs.
|
||||
|
||||
export interface BuildServerPromptResult {
|
||||
readonly prompt: string;
|
||||
readonly hadPrivateContent: boolean;
|
||||
readonly skippedAll: boolean;
|
||||
}
|
||||
|
||||
const MAX_PAYLOAD_CHARS = 16 * 1024;
|
||||
|
||||
export function buildServerGenerationPrompt(
|
||||
context: ServerGenerationContext,
|
||||
options: { mode?: ModeConfig } = {},
|
||||
): BuildServerPromptResult {
|
||||
const mode = options.mode ?? loadActiveModeOrFallback();
|
||||
|
||||
let hadPrivateContent = false;
|
||||
let allEventsScrubbedToEmpty = true;
|
||||
const eventBlocks: string[] = [];
|
||||
|
||||
for (const event of context.events) {
|
||||
const block = buildEventBlock(event);
|
||||
if (block.hadPrivate) {
|
||||
hadPrivateContent = true;
|
||||
}
|
||||
if (block.body.length > 0) {
|
||||
allEventsScrubbedToEmpty = false;
|
||||
eventBlocks.push(block.body);
|
||||
}
|
||||
}
|
||||
|
||||
const skippedAll = context.events.length > 0 && allEventsScrubbedToEmpty;
|
||||
|
||||
const sessionTag = context.project.serverSessionId
|
||||
? `\n <server_session_id>${escapeXml(context.project.serverSessionId)}</server_session_id>`
|
||||
: '';
|
||||
const projectTag = context.project.projectName
|
||||
? `\n <project_name>${escapeXml(context.project.projectName)}</project_name>`
|
||||
: '';
|
||||
|
||||
const observationOutputSchema = buildObservationOutputSchema(mode);
|
||||
|
||||
const prompt = [
|
||||
'<server_beta_observation_request>',
|
||||
` <project_id>${escapeXml(context.project.projectId)}</project_id>`,
|
||||
` <team_id>${escapeXml(context.project.teamId)}</team_id>` + sessionTag + projectTag,
|
||||
` <generation_job_id>${escapeXml(context.job.id)}</generation_job_id>`,
|
||||
' <agent_events>',
|
||||
eventBlocks.length > 0 ? eventBlocks.join('\n') : ' <!-- empty after privacy stripping -->',
|
||||
' </agent_events>',
|
||||
'</server_beta_observation_request>',
|
||||
'',
|
||||
'You are observing an agent at work. Return one or more',
|
||||
'<observation>...</observation> XML blocks summarizing durable, useful',
|
||||
'discoveries from the events above. If the events contain nothing worth',
|
||||
'recording (e.g., everything was scrubbed by privacy filters or the',
|
||||
'activity was trivial), return a single self-closing <skip_summary />',
|
||||
'tag and nothing else. Do not include any prose outside the XML.',
|
||||
'',
|
||||
'Schema for each <observation> block:',
|
||||
observationOutputSchema,
|
||||
].join('\n');
|
||||
|
||||
return { prompt, hadPrivateContent, skippedAll };
|
||||
}
|
||||
|
||||
interface EventBlockResult {
|
||||
body: string;
|
||||
hadPrivate: boolean;
|
||||
}
|
||||
|
||||
function buildEventBlock(event: PostgresAgentEvent): EventBlockResult {
|
||||
const rawPayload =
|
||||
typeof event.payload === 'string' ? event.payload : JSON.stringify(event.payload ?? {}, null, 2);
|
||||
|
||||
const stripResult = stripTags(rawPayload);
|
||||
const hadPrivate = (stripResult.counts.private ?? 0) > 0;
|
||||
const truncatedPayload = stripResult.stripped.length > MAX_PAYLOAD_CHARS
|
||||
? stripResult.stripped.slice(0, MAX_PAYLOAD_CHARS) + '\n[...truncated]'
|
||||
: stripResult.stripped;
|
||||
|
||||
if (truncatedPayload.trim().length === 0) {
|
||||
return { body: '', hadPrivate };
|
||||
}
|
||||
|
||||
return {
|
||||
body: [
|
||||
' <agent_event>',
|
||||
` <id>${escapeXml(event.id)}</id>`,
|
||||
` <event_type>${escapeXml(event.eventType)}</event_type>`,
|
||||
` <source_adapter>${escapeXml(event.sourceAdapter)}</source_adapter>`,
|
||||
` <occurred_at>${new Date(event.occurredAtEpoch).toISOString()}</occurred_at>`,
|
||||
' <payload>',
|
||||
escapeXml(truncatedPayload),
|
||||
' </payload>',
|
||||
' </agent_event>',
|
||||
].join('\n'),
|
||||
hadPrivate,
|
||||
};
|
||||
}
|
||||
|
||||
function loadActiveModeOrFallback(): ModeConfig | { observation_types: ReadonlyArray<Pick<ObservationType, 'id'>> } {
|
||||
try {
|
||||
return ModeManager.getInstance().getActiveMode();
|
||||
} catch {
|
||||
return { observation_types: FALLBACK_OBSERVATION_TYPES } as unknown as ModeConfig;
|
||||
}
|
||||
}
|
||||
|
||||
function buildObservationOutputSchema(mode: ModeConfig | { observation_types: ReadonlyArray<Pick<ObservationType, 'id'>> }): string {
|
||||
const types = mode.observation_types.map(t => t.id).join(' | ');
|
||||
return [
|
||||
'<observation>',
|
||||
` <type>[ ${types} ]</type>`,
|
||||
' <title>...</title>',
|
||||
' <subtitle>...</subtitle>',
|
||||
' <facts><fact>...</fact></facts>',
|
||||
' <narrative>...</narrative>',
|
||||
' <concepts><concept>...</concept></concepts>',
|
||||
' <files_read><file>...</file></files_read>',
|
||||
' <files_modified><file>...</file></files_modified>',
|
||||
'</observation>',
|
||||
].join('\n');
|
||||
}
|
||||
|
||||
function escapeXml(text: string): string {
|
||||
return text
|
||||
.replace(/&/g, '&')
|
||||
.replace(/</g, '<')
|
||||
.replace(/>/g, '>')
|
||||
.replace(/"/g, '"')
|
||||
.replace(/'/g, ''');
|
||||
}
|
||||
@@ -0,0 +1,33 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import type { PostgresAgentEvent } from '../../../../storage/postgres/agent-events.js';
|
||||
import type { PostgresObservationGenerationJob } from '../../../../storage/postgres/generation-jobs.js';
|
||||
|
||||
// ServerGenerationContext is the input handed to a server provider adapter.
|
||||
// It is reloaded from Postgres on every retry; BullMQ payload is advisory.
|
||||
// Anti-pattern guard: this MUST NOT carry worker session state.
|
||||
export interface ServerGenerationContext {
|
||||
readonly job: PostgresObservationGenerationJob;
|
||||
readonly events: readonly PostgresAgentEvent[];
|
||||
readonly project: {
|
||||
readonly projectId: string;
|
||||
readonly teamId: string;
|
||||
readonly serverSessionId: string | null;
|
||||
readonly projectName?: string | null;
|
||||
};
|
||||
}
|
||||
|
||||
// ServerGenerationResult is the raw provider response (XML accepted by
|
||||
// parseAgentXml). Empty string means provider returned nothing — handled
|
||||
// upstream as a "skip with no observation" outcome by processGeneratedResponse.
|
||||
export interface ServerGenerationResult {
|
||||
readonly rawText: string;
|
||||
readonly tokensUsed?: number;
|
||||
readonly providerLabel: string;
|
||||
readonly modelId?: string;
|
||||
}
|
||||
|
||||
export interface ServerGenerationProvider {
|
||||
readonly providerLabel: 'claude' | 'gemini' | 'openrouter';
|
||||
generate(context: ServerGenerationContext, signal?: AbortSignal): Promise<ServerGenerationResult>;
|
||||
}
|
||||
@@ -2,10 +2,12 @@
|
||||
|
||||
import {
|
||||
Queue,
|
||||
QueueEvents,
|
||||
Worker,
|
||||
type Job,
|
||||
type JobsOptions,
|
||||
type Processor,
|
||||
type QueueEventsOptions,
|
||||
type QueueOptions,
|
||||
type WorkerOptions
|
||||
} from 'bullmq';
|
||||
@@ -33,6 +35,22 @@ export interface ServerJobCounts {
|
||||
completed: number;
|
||||
}
|
||||
|
||||
// Phase 12 — runtime stalled counter. BullMQ doesn't expose a stalled counter
|
||||
// from getJobCounts (the underlying list is rotated on consumption). We keep
|
||||
// a per-process counter that tracks how many distinct stalled events we've
|
||||
// observed since startup. /api/health and /v1/info surface this.
|
||||
export interface ServerJobLifecycleCounters {
|
||||
stalled: number;
|
||||
errored: number;
|
||||
}
|
||||
|
||||
export interface ServerJobObservedListener {
|
||||
onCompleted?: (jobId: string, durationMs: number, returnvalue: unknown) => void;
|
||||
onFailed?: (jobId: string | undefined, attemptsMade: number, reason: string) => void;
|
||||
onStalled?: (jobId: string) => void;
|
||||
onError?: (error: unknown) => void;
|
||||
}
|
||||
|
||||
export interface ServerJobQueueOptions<TPayload> {
|
||||
name: string;
|
||||
config: RedisQueueConfig;
|
||||
@@ -63,7 +81,18 @@ export class ServerJobQueue<TPayload extends object = object> {
|
||||
private readonly workerFactory?: ServerJobQueueOptions<TPayload>['workerFactory'];
|
||||
private queue: ReturnType<NonNullable<ServerJobQueueOptions<TPayload>['queueFactory']>> | Queue<TPayload> | null = null;
|
||||
private worker: ReturnType<NonNullable<ServerJobQueueOptions<TPayload>['workerFactory']>> | Worker<TPayload> | null = null;
|
||||
private queueEvents: QueueEvents | null = null;
|
||||
private started = false;
|
||||
private readonly counters: ServerJobLifecycleCounters = { stalled: 0, errored: 0 };
|
||||
private readonly listeners: ServerJobObservedListener[] = [];
|
||||
private readonly jobStartTimes = new Map<string, number>();
|
||||
// worker.on('stalled') and the QueueEvents 'stalled' subscriber both fire
|
||||
// for the same job — BullMQ's docs explicitly recommend listening on both
|
||||
// for production reliability. To avoid double-counting and double-callback
|
||||
// we record each stalled jobId here for a short TTL and treat the second
|
||||
// signal as an idempotent no-op.
|
||||
private readonly recentlyStalled = new Map<string, NodeJS.Timeout>();
|
||||
private static readonly STALLED_DEDUPE_WINDOW_MS = 30_000;
|
||||
|
||||
constructor(options: ServerJobQueueOptions<TPayload>) {
|
||||
this.name = options.name;
|
||||
@@ -154,6 +183,53 @@ export class ServerJobQueue<TPayload extends object = object> {
|
||||
// BullMQ docs require `worker.on('error', ...)` to avoid unhandled rejections
|
||||
// when a job throws. We construct the Worker with autorun: false so the
|
||||
// caller controls startup explicitly via run().
|
||||
//
|
||||
// Phase 12 — wire `completed`, `failed`, `progress`, `error`, and the
|
||||
// QueueEvents `stalled` listener. Stalled events go through QueueEvents
|
||||
// because BullMQ's docs note rare stalls don't always reach the local
|
||||
// worker.on('stalled') listener; QueueEvents publishes from Redis.
|
||||
// Deduped stalled handler. Counts the stall once even though BullMQ may
|
||||
// surface it via both worker.on('stalled') and QueueEvents 'stalled'.
|
||||
private notifyStalled(jobId: string, source: 'worker' | 'queue-events'): void {
|
||||
if (this.recentlyStalled.has(jobId)) {
|
||||
logger.debug?.('QUEUE', `[generation] job=${jobId} stalled (suppressed duplicate from ${source})`, {
|
||||
queue: this.name,
|
||||
jobId,
|
||||
source,
|
||||
});
|
||||
return;
|
||||
}
|
||||
const timer = setTimeout(() => {
|
||||
this.recentlyStalled.delete(jobId);
|
||||
}, ServerJobQueue.STALLED_DEDUPE_WINDOW_MS);
|
||||
if (typeof (timer as { unref?: () => void }).unref === 'function') {
|
||||
(timer as { unref: () => void }).unref();
|
||||
}
|
||||
this.recentlyStalled.set(jobId, timer);
|
||||
this.counters.stalled += 1;
|
||||
logger.warn('QUEUE', `[generation] job=${jobId} stalled${source === 'queue-events' ? ' (queue-events)' : ''}`, {
|
||||
queue: this.name,
|
||||
jobId,
|
||||
source,
|
||||
});
|
||||
for (const l of this.listeners) {
|
||||
try { l.onStalled?.(jobId); } catch { /* listener errors must not propagate */ }
|
||||
}
|
||||
}
|
||||
|
||||
// Single source of truth for queue-side error accounting. worker errors and
|
||||
// QueueEvents errors both increment counters.errored and notify listeners,
|
||||
// so per-process metrics aren't asymmetric across the two sources.
|
||||
private notifyQueueError(error: unknown, source: 'worker' | 'queue-events'): void {
|
||||
this.counters.errored += 1;
|
||||
logger.warn('QUEUE', `${this.name} ${source} error`, {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
});
|
||||
for (const l of this.listeners) {
|
||||
try { l.onError?.(error); } catch { /* listener errors must not propagate */ }
|
||||
}
|
||||
}
|
||||
|
||||
start(processor: Processor<TPayload>): void {
|
||||
if (this.started) {
|
||||
throw new Error(`ServerJobQueue ${this.name} is already started`);
|
||||
@@ -168,22 +244,115 @@ export class ServerJobQueue<TPayload extends object = object> {
|
||||
const worker = this.workerFactory
|
||||
? this.workerFactory(this.name, processor, workerOptions)
|
||||
: new Worker<TPayload>(this.name, processor, workerOptions);
|
||||
worker.on('error', (error: unknown) => {
|
||||
logger.warn('QUEUE', `${this.name} worker error`, {
|
||||
error: error instanceof Error ? error.message : String(error)
|
||||
worker.on('error', (error: unknown) => this.notifyQueueError(error, 'worker'));
|
||||
// BullMQ Worker exposes `active`, `completed`, `failed`, `progress`, and
|
||||
// `stalled` events. We attach to all five because the runtime relies on
|
||||
// them for observability (Phase 12).
|
||||
if (typeof (worker as { on?: unknown }).on === 'function') {
|
||||
const w = worker as Worker<TPayload>;
|
||||
w.on('active', (job: Job<TPayload>) => {
|
||||
if (job.id) this.jobStartTimes.set(job.id, Date.now());
|
||||
});
|
||||
});
|
||||
w.on('completed', (job: Job<TPayload>, returnvalue: unknown) => {
|
||||
const startedAt = job.id ? this.jobStartTimes.get(job.id) : undefined;
|
||||
const durationMs = startedAt ? Date.now() - startedAt : 0;
|
||||
if (job.id) this.jobStartTimes.delete(job.id);
|
||||
const sourceType = (job.data as { source_type?: string } | undefined)?.source_type ?? '?';
|
||||
logger.info('QUEUE', `[generation] job=${job.id ?? '?'} source_type=${sourceType} duration=${durationMs}ms`, {
|
||||
queue: this.name,
|
||||
jobId: job.id ?? null,
|
||||
sourceType,
|
||||
durationMs,
|
||||
});
|
||||
for (const l of this.listeners) {
|
||||
try { l.onCompleted?.(job.id ?? '?', durationMs, returnvalue); } catch { /* swallow listener errors only */ }
|
||||
}
|
||||
});
|
||||
w.on('failed', (job: Job<TPayload> | undefined, error: Error) => {
|
||||
if (job?.id) this.jobStartTimes.delete(job.id);
|
||||
const sourceType = (job?.data as { source_type?: string } | undefined)?.source_type ?? '?';
|
||||
const attemptsMade = job?.attemptsMade ?? 0;
|
||||
logger.warn('QUEUE', `[generation] job=${job?.id ?? '?'} source_type=${sourceType} attempts=${attemptsMade} reason=${error.message}`, {
|
||||
queue: this.name,
|
||||
jobId: job?.id ?? null,
|
||||
sourceType,
|
||||
attemptsMade,
|
||||
reason: error.message,
|
||||
});
|
||||
for (const l of this.listeners) {
|
||||
try { l.onFailed?.(job?.id, attemptsMade, error.message); } catch { /* swallow */ }
|
||||
}
|
||||
});
|
||||
w.on('progress', (job: Job<TPayload>, progress: unknown) => {
|
||||
logger.debug?.('QUEUE', `[generation] job=${job.id ?? '?'} progress`, {
|
||||
queue: this.name,
|
||||
jobId: job.id ?? null,
|
||||
progress,
|
||||
});
|
||||
});
|
||||
w.on('stalled', (jobId: string) => this.notifyStalled(jobId, 'worker'));
|
||||
}
|
||||
worker.run();
|
||||
this.worker = worker;
|
||||
|
||||
// QueueEvents subscribes to Redis pub/sub for cross-process events
|
||||
// (BullMQ "Stalled Jobs" docs recommend this for production reliability).
|
||||
// Skip in test/factory mode since the test factory does not provide a
|
||||
// real Redis connection.
|
||||
if (!this.workerFactory) {
|
||||
try {
|
||||
const events = new QueueEvents(this.name, {
|
||||
connection: this.config.connection,
|
||||
prefix: this.config.prefix,
|
||||
} as QueueEventsOptions);
|
||||
events.on('stalled', ({ jobId }: { jobId: string }) => this.notifyStalled(jobId, 'queue-events'));
|
||||
// QueueEvents emits its own 'error' too — surface through the same
|
||||
// counter+listener path as worker errors so observability stays symmetric.
|
||||
events.on('error', (error: Error) => this.notifyQueueError(error, 'queue-events'));
|
||||
this.queueEvents = events;
|
||||
} catch (error) {
|
||||
logger.warn('QUEUE', `${this.name} failed to start QueueEvents listener`, {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
this.started = true;
|
||||
}
|
||||
|
||||
/**
|
||||
* Phase 12 — register an observer for completed/failed/stalled/error
|
||||
* events. Used by the runtime to surface lifecycle hooks (audit, metrics)
|
||||
* without subclassing. Listeners that throw are isolated.
|
||||
*/
|
||||
observe(listener: ServerJobObservedListener): void {
|
||||
this.listeners.push(listener);
|
||||
}
|
||||
|
||||
/**
|
||||
* Phase 12 — runtime counters for stalled/errored events. waiting/active/
|
||||
* completed/failed/delayed live in `getCounts()` (BullMQ getJobCounts).
|
||||
* Stalled is a per-process counter because BullMQ rotates the underlying
|
||||
* list and there's no reliable count from getJobCounts.
|
||||
*/
|
||||
getLifecycleCounters(): ServerJobLifecycleCounters {
|
||||
return { ...this.counters };
|
||||
}
|
||||
|
||||
isStarted(): boolean {
|
||||
return this.started;
|
||||
}
|
||||
|
||||
async close(): Promise<void> {
|
||||
const errors: Error[] = [];
|
||||
if (this.queueEvents) {
|
||||
try {
|
||||
await this.queueEvents.close();
|
||||
} catch (error) {
|
||||
errors.push(error instanceof Error ? error : new Error(String(error)));
|
||||
}
|
||||
this.queueEvents = null;
|
||||
}
|
||||
if (this.worker) {
|
||||
try {
|
||||
await this.worker.close();
|
||||
@@ -201,6 +370,10 @@ export class ServerJobQueue<TPayload extends object = object> {
|
||||
}
|
||||
this.queue = null;
|
||||
}
|
||||
for (const timer of this.recentlyStalled.values()) {
|
||||
clearTimeout(timer);
|
||||
}
|
||||
this.recentlyStalled.clear();
|
||||
if (errors.length > 0) {
|
||||
throw errors[0];
|
||||
}
|
||||
|
||||
@@ -9,11 +9,12 @@ import type { JsonObject } from '../../storage/postgres/utils.js';
|
||||
import { logger } from '../../utils/logger.js';
|
||||
import { buildServerJobId } from './job-id.js';
|
||||
import type { ServerJobQueue } from './ServerJobQueue.js';
|
||||
import type {
|
||||
GenerateObservationsForEventJob,
|
||||
GenerateSessionSummaryJob,
|
||||
ReindexObservationJob,
|
||||
ServerGenerationJobKind
|
||||
import {
|
||||
assertServerGenerationJobPayload,
|
||||
type GenerateObservationsForEventJob,
|
||||
type GenerateSessionSummaryJob,
|
||||
type ReindexObservationJob,
|
||||
type ServerGenerationJobKind,
|
||||
} from './types.js';
|
||||
|
||||
// Postgres outbox is canonical history; BullMQ is the execution transport.
|
||||
@@ -86,6 +87,10 @@ export async function enqueueOutbox(
|
||||
});
|
||||
|
||||
try {
|
||||
// Phase 11 — defense in depth. Validate the payload shape at the queue
|
||||
// boundary so a malformed enqueue is rejected synchronously and never
|
||||
// produces a job whose audit trail is missing fields.
|
||||
assertServerGenerationJobPayload(payload);
|
||||
await queue.add(bullmqJobId, payload);
|
||||
await eventsRepo.append({
|
||||
generationJobId: row.id,
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import { z } from 'zod';
|
||||
import type {
|
||||
ObservationGenerationJobSourceType,
|
||||
ObservationGenerationJobStatus
|
||||
@@ -9,6 +10,12 @@ export type ServerGenerationJobKind = 'event' | 'event-batch' | 'summary' | 'rei
|
||||
|
||||
export type ServerGenerationJobStatus = ObservationGenerationJobStatus;
|
||||
|
||||
// Phase 11 — every BullMQ job carries the full team-aware tracing surface so
|
||||
// the worker can audit and scope-check on every retry. team_id and project_id
|
||||
// are advisory: the worker MUST reload the canonical outbox row from Postgres
|
||||
// and compare these fields before any side effect. Treating these as auth
|
||||
// authority would be a bypass — the comparison is a tampering detector, not
|
||||
// the auth gate.
|
||||
export interface ServerGenerationJob {
|
||||
kind: ServerGenerationJobKind;
|
||||
team_id: string;
|
||||
@@ -16,6 +23,18 @@ export interface ServerGenerationJob {
|
||||
source_type: ObservationGenerationJobSourceType;
|
||||
source_id: string;
|
||||
generation_job_id: string;
|
||||
// Identity of the API key that initiated this job at the HTTP boundary.
|
||||
// Reused at execution time to detect revocation between enqueue and run.
|
||||
api_key_id: string | null;
|
||||
// The actor associated with the api key at enqueue time. Audit-only;
|
||||
// never trust this for authz decisions.
|
||||
actor_id: string | null;
|
||||
// Legacy adapter or surface that produced the source row, for routing
|
||||
// and audit (e.g. 'api', 'hooks', 'mcp', 'compat:sessions-observations').
|
||||
source_adapter: string;
|
||||
// Phase 12 — request correlation id, optional but always serialized as a
|
||||
// nullable field so downstream consumers can rely on shape stability.
|
||||
request_id?: string | null;
|
||||
}
|
||||
|
||||
export interface GenerateObservationsForEventJob extends ServerGenerationJob {
|
||||
@@ -57,3 +76,80 @@ export const SERVER_JOB_KIND_PREFIX: Record<ServerGenerationJobKind, string> = {
|
||||
summary: 'sum',
|
||||
reindex: 'rdx'
|
||||
};
|
||||
|
||||
// Phase 11 — Zod schema validates payloads at the queue boundary so a
|
||||
// malformed enqueue is rejected synchronously rather than silently producing
|
||||
// a job the worker can't audit. Required fields here mirror the
|
||||
// ServerGenerationJob interface; a missing team_id, project_id, or
|
||||
// generation_job_id should always be a programmer error caught at enqueue.
|
||||
|
||||
const baseFieldsSchema = z.object({
|
||||
team_id: z.string().min(1, 'team_id is required'),
|
||||
project_id: z.string().min(1, 'project_id is required'),
|
||||
source_type: z.enum(['agent_event', 'session_summary', 'observation_reindex']),
|
||||
source_id: z.string().min(1, 'source_id is required'),
|
||||
generation_job_id: z.string().min(1, 'generation_job_id is required'),
|
||||
// api_key_id and actor_id are nullable to accommodate local-dev/system
|
||||
// enqueues, but the *field* must be present in the payload so audit
|
||||
// records always render the same shape.
|
||||
api_key_id: z.string().min(1).nullable(),
|
||||
actor_id: z.string().min(1).nullable(),
|
||||
source_adapter: z.string().min(1, 'source_adapter is required'),
|
||||
// Phase 12 — request_id is optional in the schema (older jobs predating
|
||||
// this phase have nullable/missing values) but always passes through to
|
||||
// logs and audit when present.
|
||||
request_id: z.string().min(1).nullable().optional(),
|
||||
});
|
||||
|
||||
export const GenerateObservationsForEventJobSchema = baseFieldsSchema.extend({
|
||||
kind: z.literal('event'),
|
||||
agent_event_id: z.string().min(1),
|
||||
});
|
||||
|
||||
export const GenerateObservationsForEventBatchJobSchema = baseFieldsSchema.extend({
|
||||
kind: z.literal('event-batch'),
|
||||
agent_event_ids: z.array(z.string().min(1)).min(1),
|
||||
});
|
||||
|
||||
export const GenerateSessionSummaryJobSchema = baseFieldsSchema.extend({
|
||||
kind: z.literal('summary'),
|
||||
server_session_id: z.string().min(1),
|
||||
});
|
||||
|
||||
export const ReindexObservationJobSchema = baseFieldsSchema.extend({
|
||||
kind: z.literal('reindex'),
|
||||
observation_id: z.string().min(1),
|
||||
});
|
||||
|
||||
export const ServerGenerationJobPayloadSchema = z.discriminatedUnion('kind', [
|
||||
GenerateObservationsForEventJobSchema,
|
||||
GenerateObservationsForEventBatchJobSchema,
|
||||
GenerateSessionSummaryJobSchema,
|
||||
ReindexObservationJobSchema,
|
||||
]);
|
||||
|
||||
export class ServerGenerationJobPayloadValidationError extends Error {
|
||||
readonly issues: z.ZodIssue[];
|
||||
|
||||
constructor(issues: z.ZodIssue[]) {
|
||||
super(`invalid server generation job payload: ${issues.map(i => i.message).join('; ')}`);
|
||||
this.issues = issues;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Validate a candidate BullMQ payload against the discriminated union and
|
||||
* return a typed payload, or throw `ServerGenerationJobPayloadValidationError`.
|
||||
* Use this at every enqueue site so a malformed payload never enters the
|
||||
* transport — the worker MUST also re-validate from Postgres but defense in
|
||||
* depth is cheap.
|
||||
*/
|
||||
export function assertServerGenerationJobPayload(
|
||||
candidate: unknown,
|
||||
): ServerGenerationJobPayload {
|
||||
const result = ServerGenerationJobPayloadSchema.safeParse(candidate);
|
||||
if (!result.success) {
|
||||
throw new ServerGenerationJobPayloadValidationError(result.error.issues);
|
||||
}
|
||||
return result.data as ServerGenerationJobPayload;
|
||||
}
|
||||
|
||||
@@ -0,0 +1,199 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import { createHash } from 'crypto';
|
||||
import type { NextFunction, Request, RequestHandler, Response } from 'express';
|
||||
import type { PostgresPool } from '../../storage/postgres/pool.js';
|
||||
import type { PostgresApiKey } from '../../storage/postgres/auth.js';
|
||||
import type { AuthContext } from './auth.js';
|
||||
|
||||
// Postgres-backed auth middleware for the server-beta runtime.
|
||||
//
|
||||
// Mirrors src/server/middleware/auth.ts but reads API keys from the Postgres
|
||||
// `api_keys` table instead of bun:sqlite. Phase 4 routes use this so the
|
||||
// runtime depends only on the Postgres pool and Postgres-backed repositories.
|
||||
//
|
||||
// teamId / projectId on req.authContext come straight from the Postgres
|
||||
// api_keys row. Routes use those to scope every read and write.
|
||||
|
||||
export interface PostgresRequireAuthOptions {
|
||||
requiredScopes?: string[];
|
||||
authMode?: string;
|
||||
allowLocalDevBypass?: boolean;
|
||||
// Local-dev fallback team for unauthenticated loopback requests. This is
|
||||
// only used when authMode === 'local-dev' AND allowLocalDevBypass is true
|
||||
// AND the request is on loopback. It must NEVER be used to scope a real
|
||||
// production request.
|
||||
localDevTeamId?: string | null;
|
||||
}
|
||||
|
||||
export function requirePostgresServerAuth(
|
||||
pool: PostgresPool,
|
||||
options: PostgresRequireAuthOptions = {},
|
||||
): RequestHandler {
|
||||
return async (req: Request, res: Response, next: NextFunction) => {
|
||||
try {
|
||||
const authMode = options.authMode ?? process.env.CLAUDE_MEM_AUTH_MODE ?? 'api-key';
|
||||
const authorization = req.header('authorization') ?? '';
|
||||
const rawKey = parseBearerToken(authorization);
|
||||
|
||||
const allowLocalDevBypass = options.allowLocalDevBypass
|
||||
?? process.env.CLAUDE_MEM_ALLOW_LOCAL_DEV_BYPASS === '1';
|
||||
if (
|
||||
!rawKey
|
||||
&& authMode === 'local-dev'
|
||||
&& allowLocalDevBypass
|
||||
&& isLocalhost(req)
|
||||
&& hasLoopbackHostHeader(req)
|
||||
&& !hasForwardedClientHeaders(req)
|
||||
) {
|
||||
const ctx: AuthContext = {
|
||||
userId: null,
|
||||
organizationId: null,
|
||||
teamId: options.localDevTeamId ?? null,
|
||||
projectId: null,
|
||||
scopes: ['local-dev'],
|
||||
apiKeyId: null,
|
||||
mode: 'local-dev',
|
||||
};
|
||||
req.authContext = ctx;
|
||||
next();
|
||||
return;
|
||||
}
|
||||
|
||||
if (!rawKey) {
|
||||
res.status(401).json({ error: 'Unauthorized', message: 'Missing bearer API key' });
|
||||
return;
|
||||
}
|
||||
|
||||
const verified = await verifyPostgresApiKey(pool, rawKey, options.requiredScopes ?? []);
|
||||
if (!verified) {
|
||||
res.status(403).json({ error: 'Forbidden', message: 'Invalid API key or insufficient scope' });
|
||||
return;
|
||||
}
|
||||
|
||||
const ctx: AuthContext = {
|
||||
userId: null,
|
||||
organizationId: null,
|
||||
teamId: verified.teamId,
|
||||
projectId: verified.projectId,
|
||||
scopes: verified.scopes,
|
||||
apiKeyId: verified.apiKeyId,
|
||||
mode: 'api-key',
|
||||
};
|
||||
req.authContext = ctx;
|
||||
next();
|
||||
} catch (error) {
|
||||
next(error);
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
interface VerifiedPostgresApiKey {
|
||||
apiKeyId: string;
|
||||
teamId: string | null;
|
||||
projectId: string | null;
|
||||
scopes: string[];
|
||||
}
|
||||
|
||||
export async function verifyPostgresApiKey(
|
||||
pool: PostgresPool,
|
||||
rawKey: string,
|
||||
requiredScopes: string[],
|
||||
): Promise<VerifiedPostgresApiKey | null> {
|
||||
const keyHash = createHash('sha256').update(rawKey).digest('hex');
|
||||
const result = await pool.query(
|
||||
`
|
||||
SELECT id, team_id, project_id, scopes, revoked_at, expires_at
|
||||
FROM api_keys
|
||||
WHERE key_hash = $1
|
||||
`,
|
||||
[keyHash],
|
||||
);
|
||||
const row = result.rows[0] as Pick<
|
||||
PostgresApiKey,
|
||||
'id' | 'teamId' | 'projectId'
|
||||
> & {
|
||||
id: string;
|
||||
team_id: string | null;
|
||||
project_id: string | null;
|
||||
scopes: unknown;
|
||||
revoked_at: Date | null;
|
||||
expires_at: Date | null;
|
||||
} | undefined;
|
||||
if (!row) {
|
||||
return null;
|
||||
}
|
||||
if (row.revoked_at) {
|
||||
return null;
|
||||
}
|
||||
if (row.expires_at && row.expires_at.getTime() <= Date.now()) {
|
||||
return null;
|
||||
}
|
||||
const scopes = normalizeScopes(row.scopes);
|
||||
if (!hasRequiredScopes(scopes, requiredScopes)) {
|
||||
return null;
|
||||
}
|
||||
return {
|
||||
apiKeyId: row.id,
|
||||
teamId: row.team_id,
|
||||
projectId: row.project_id,
|
||||
scopes,
|
||||
};
|
||||
}
|
||||
|
||||
function normalizeScopes(value: unknown): string[] {
|
||||
if (!Array.isArray(value)) {
|
||||
return [];
|
||||
}
|
||||
return value.filter((item): item is string => typeof item === 'string');
|
||||
}
|
||||
|
||||
function hasRequiredScopes(grantedScopes: string[], requiredScopes: string[]): boolean {
|
||||
if (requiredScopes.length === 0 || grantedScopes.includes('*')) {
|
||||
return true;
|
||||
}
|
||||
return requiredScopes.every(scope => grantedScopes.includes(scope));
|
||||
}
|
||||
|
||||
function parseBearerToken(header: string): string | null {
|
||||
const match = /^Bearer\s+(.+)$/i.exec(header.trim());
|
||||
return match?.[1]?.trim() || null;
|
||||
}
|
||||
|
||||
function isLocalhost(req: Request): boolean {
|
||||
const clientIp = req.ip || req.socket.remoteAddress || '';
|
||||
return clientIp === '127.0.0.1'
|
||||
|| clientIp === '::1'
|
||||
|| clientIp === '::ffff:127.0.0.1'
|
||||
|| clientIp === 'localhost';
|
||||
}
|
||||
|
||||
function hasLoopbackHostHeader(req: Request): boolean {
|
||||
const host = parseHostWithoutPort(req.header('host') ?? '');
|
||||
return host === '127.0.0.1'
|
||||
|| host === 'localhost'
|
||||
|| host === '::1';
|
||||
}
|
||||
|
||||
function parseHostWithoutPort(rawHost: string): string {
|
||||
const host = rawHost.trim().toLowerCase();
|
||||
if (host.startsWith('[')) {
|
||||
const closeBracketIndex = host.indexOf(']');
|
||||
return closeBracketIndex === -1 ? host : host.slice(1, closeBracketIndex);
|
||||
}
|
||||
|
||||
const lastColonIndex = host.lastIndexOf(':');
|
||||
if (lastColonIndex > -1 && /^\d+$/.test(host.slice(lastColonIndex + 1))) {
|
||||
return host.slice(0, lastColonIndex);
|
||||
}
|
||||
return host;
|
||||
}
|
||||
|
||||
function hasForwardedClientHeaders(req: Request): boolean {
|
||||
return Boolean(
|
||||
req.header('forwarded')
|
||||
|| req.header('x-forwarded-for')
|
||||
|| req.header('x-forwarded-host')
|
||||
|| req.header('x-real-ip'),
|
||||
);
|
||||
}
|
||||
@@ -0,0 +1,40 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import { randomUUID } from 'crypto';
|
||||
import type { NextFunction, Request, RequestHandler, Response } from 'express';
|
||||
|
||||
// Phase 12 — request_id middleware. Mints a UUID per inbound request and
|
||||
// attaches it to req.requestId so route handlers, ingest services, and
|
||||
// generation jobs can correlate logs back to the original HTTP call. Honors
|
||||
// an inbound `X-Request-Id` header so an upstream load balancer / gateway
|
||||
// can supply the id, but rejects non-conformant values to keep audit rows
|
||||
// clean (UUID v4 OR a small whitelist of [a-zA-Z0-9-_] up to 64 chars).
|
||||
//
|
||||
// Anti-pattern guard: never trust the inbound id for auth — this is purely
|
||||
// an audit/log correlator. Auth still flows through requirePostgresServerAuth.
|
||||
|
||||
const REQUEST_ID_HEADER = 'x-request-id';
|
||||
const REQUEST_ID_MAX_LENGTH = 64;
|
||||
const REQUEST_ID_SAFE_PATTERN = /^[A-Za-z0-9][A-Za-z0-9\-_]{0,63}$/;
|
||||
|
||||
declare module 'express-serve-static-core' {
|
||||
interface Request {
|
||||
requestId?: string;
|
||||
}
|
||||
}
|
||||
|
||||
export function requestIdMiddleware(): RequestHandler {
|
||||
return (req: Request, res: Response, next: NextFunction) => {
|
||||
const inbound = req.header(REQUEST_ID_HEADER);
|
||||
const accepted = inbound && isAcceptableRequestId(inbound) ? inbound : randomUUID();
|
||||
req.requestId = accepted;
|
||||
res.setHeader('X-Request-Id', accepted);
|
||||
next();
|
||||
};
|
||||
}
|
||||
|
||||
export function isAcceptableRequestId(value: string): boolean {
|
||||
if (typeof value !== 'string') return false;
|
||||
if (value.length === 0 || value.length > REQUEST_ID_MAX_LENGTH) return false;
|
||||
return REQUEST_ID_SAFE_PATTERN.test(value);
|
||||
}
|
||||
@@ -17,6 +17,23 @@ export interface ObservationQueueEngine {
|
||||
close(): Promise<void>;
|
||||
}
|
||||
|
||||
// Phase 12 — `lanes` exposes per-queue counts (waiting/active/completed/
|
||||
// failed/delayed/stalled) so deploy probes can monitor saturation per lane.
|
||||
// `unavailable: true` means the sample failed; the health endpoint MUST NOT
|
||||
// 503 just because counts are stale.
|
||||
export interface ObservationQueueHealthLaneSnapshot {
|
||||
kind: string;
|
||||
name: string;
|
||||
waiting: number;
|
||||
active: number;
|
||||
completed: number;
|
||||
failed: number;
|
||||
delayed: number;
|
||||
stalled: number;
|
||||
unavailable: boolean;
|
||||
unavailableReason?: string;
|
||||
}
|
||||
|
||||
export interface ObservationQueueHealth {
|
||||
engine: 'bullmq';
|
||||
redis: {
|
||||
@@ -27,6 +44,7 @@ export interface ObservationQueueHealth {
|
||||
prefix: string;
|
||||
error?: string;
|
||||
};
|
||||
lanes?: ObservationQueueHealthLaneSnapshot[];
|
||||
}
|
||||
|
||||
export interface ObservationQueueInspection {
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,164 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import type { Job } from 'bullmq';
|
||||
import { logger } from '../../utils/logger.js';
|
||||
import { PostgresAuthRepository } from '../../storage/postgres/auth.js';
|
||||
import type { PostgresPool } from '../../storage/postgres/pool.js';
|
||||
import { ProviderObservationGenerator } from '../generation/ProviderObservationGenerator.js';
|
||||
import type { ServerGenerationProvider } from '../generation/providers/shared/types.js';
|
||||
import type { ServerGenerationJobPayload } from '../jobs/types.js';
|
||||
import type { ActiveServerBetaQueueManager } from './ActiveServerBetaQueueManager.js';
|
||||
import type {
|
||||
ServerBetaBoundaryHealth,
|
||||
ServerBetaGenerationWorkerManager,
|
||||
} from './types.js';
|
||||
|
||||
// ActiveServerBetaGenerationWorkerManager wires a BullMQ Worker (per the
|
||||
// 'event' queue) to a ProviderObservationGenerator. Concurrency defaults to
|
||||
// 1 per the plan (line 80–86) so retries observe a single inflight provider
|
||||
// call per server. autorun:false / explicit run() is enforced by
|
||||
// ServerJobQueue.start.
|
||||
//
|
||||
// This class is wired in only when both a queue manager AND a configured
|
||||
// provider are present. create-server-beta-service keeps the disabled
|
||||
// adapter otherwise so server beta can boot without provider credentials.
|
||||
|
||||
export interface ActiveServerBetaGenerationWorkerManagerOptions {
|
||||
pool: PostgresPool;
|
||||
queueManager: ActiveServerBetaQueueManager;
|
||||
provider: ServerGenerationProvider;
|
||||
workerId?: string;
|
||||
// Test seam: replace the generator with a stub.
|
||||
generatorFactory?: (
|
||||
pool: PostgresPool,
|
||||
provider: ServerGenerationProvider,
|
||||
workerId: string,
|
||||
) => ProviderObservationGenerator;
|
||||
}
|
||||
|
||||
export class ActiveServerBetaGenerationWorkerManager implements ServerBetaGenerationWorkerManager {
|
||||
readonly kind = 'generation-worker-manager' as const;
|
||||
private started = false;
|
||||
private closed = false;
|
||||
private readonly generator: ProviderObservationGenerator;
|
||||
private readonly workerId: string;
|
||||
|
||||
constructor(private readonly options: ActiveServerBetaGenerationWorkerManagerOptions) {
|
||||
this.workerId = options.workerId ?? `server-beta-${process.pid}`;
|
||||
this.generator = options.generatorFactory
|
||||
? options.generatorFactory(options.pool, options.provider, this.workerId)
|
||||
: new ProviderObservationGenerator({
|
||||
pool: options.pool,
|
||||
provider: options.provider,
|
||||
workerId: this.workerId,
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* Attach BullMQ Worker to the 'event' queue. Per BullMQ docs we use
|
||||
* new Worker(queueName, processor, { concurrency, autorun })
|
||||
* via ServerJobQueue.start(...). Errors are surfaced through the queue
|
||||
* wrapper's worker.on('error', ...) listener.
|
||||
*/
|
||||
start(): void {
|
||||
if (this.started) return;
|
||||
const dispatcher = async (job: Job<ServerGenerationJobPayload>) => {
|
||||
try {
|
||||
return await this.generator.process(job);
|
||||
} catch (error) {
|
||||
logger.warn('SYSTEM', 'observation generator failed', {
|
||||
jobId: job.id,
|
||||
kind: job.data.kind,
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
});
|
||||
throw error;
|
||||
}
|
||||
};
|
||||
this.options.queueManager.start('event', dispatcher);
|
||||
// Phase 6: wire the summary lane alongside the event lane. Concurrency
|
||||
// defaults to 1 per ServerJobQueue config (per the plan), and the same
|
||||
// ProviderObservationGenerator dispatches on job.data.source_type via the
|
||||
// outbox row reload inside lockOutbox+process.
|
||||
this.options.queueManager.start('summary', dispatcher);
|
||||
|
||||
// Phase 12 — audit stalled events directly. Phase 11's audit chain now
|
||||
// covers the operator and provider lifecycle; stalled jobs come from
|
||||
// BullMQ runtime not the HTTP boundary, so we wire them in here. Best-
|
||||
// effort: a missing/unscoped audit MUST NOT crash the worker.
|
||||
for (const lane of ['event', 'summary'] as const) {
|
||||
try {
|
||||
const queue = this.options.queueManager.getQueue(lane);
|
||||
queue.observe({
|
||||
onStalled: (jobId) => {
|
||||
void this.auditStalledJob(jobId, lane);
|
||||
},
|
||||
});
|
||||
} catch (error) {
|
||||
logger.warn('SYSTEM', `failed to wire stalled observer for ${lane} lane`, {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
this.started = true;
|
||||
}
|
||||
|
||||
// Phase 12 — write a `generation_job.stalled` audit row. We look up the
|
||||
// outbox row by BullMQ jobId (== bullmq_job_id column) so team/project
|
||||
// scope is correct on the audit row even when the original API key
|
||||
// metadata is unavailable (BullMQ retries can outlive a session).
|
||||
private async auditStalledJob(bullmqJobId: string, lane: 'event' | 'summary'): Promise<void> {
|
||||
try {
|
||||
const result = await this.options.pool.query<{
|
||||
id: string;
|
||||
team_id: string | null;
|
||||
project_id: string | null;
|
||||
}>(
|
||||
'SELECT id, team_id, project_id FROM observation_generation_jobs WHERE bullmq_job_id = $1 LIMIT 1',
|
||||
[bullmqJobId],
|
||||
);
|
||||
const row = result.rows[0];
|
||||
if (!row) return;
|
||||
const repo = new PostgresAuthRepository(this.options.pool);
|
||||
await repo.createAuditLog({
|
||||
teamId: row.team_id,
|
||||
projectId: row.project_id,
|
||||
actorId: null,
|
||||
apiKeyId: null,
|
||||
action: 'generation_job.stalled',
|
||||
resourceType: 'observation_generation_job',
|
||||
resourceId: row.id,
|
||||
details: { lane, bullmqJobId },
|
||||
});
|
||||
} catch (error) {
|
||||
logger.warn('SYSTEM', 'failed to audit stalled generation_job', {
|
||||
bullmqJobId,
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
getHealth(): ServerBetaBoundaryHealth {
|
||||
if (this.closed) {
|
||||
return { status: 'errored', reason: 'generation-worker-manager closed' };
|
||||
}
|
||||
return {
|
||||
status: this.started ? 'active' : 'disabled',
|
||||
reason: this.started
|
||||
? 'BullMQ Worker attached to event queue with ProviderObservationGenerator'
|
||||
: 'wired but not started',
|
||||
details: {
|
||||
provider: this.options.provider.providerLabel,
|
||||
workerId: this.workerId,
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
async close(): Promise<void> {
|
||||
if (this.closed) return;
|
||||
this.closed = true;
|
||||
// The underlying Worker is owned by ServerJobQueue.close() (driven by
|
||||
// the queue manager). We do not double-close here; the queue manager's
|
||||
// close cascade handles it.
|
||||
}
|
||||
}
|
||||
@@ -11,6 +11,7 @@ import type { RedisQueueConfig } from '../queue/redis-config.js';
|
||||
import { logger } from '../../utils/logger.js';
|
||||
import type {
|
||||
ServerBetaBoundaryHealth,
|
||||
ServerBetaQueueLaneMetric,
|
||||
ServerBetaQueueManager,
|
||||
} from './types.js';
|
||||
|
||||
@@ -75,6 +76,49 @@ export class ActiveServerBetaQueueManager implements ServerBetaQueueManager {
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Phase 12 — per-lane counts. Returns BullMQ getJobCounts plus the
|
||||
* per-process stalled counter. If Redis is unreachable, the lane is
|
||||
* reported with an `unavailable` flag rather than throwing so /api/health
|
||||
* remains responsive even in partial-failure modes.
|
||||
*/
|
||||
async getLaneMetrics(): Promise<ServerBetaQueueLaneMetric[]> {
|
||||
const out: ServerBetaQueueLaneMetric[] = [];
|
||||
for (const kind of QUEUE_KINDS) {
|
||||
const queue = this.queues.get(kind);
|
||||
if (!queue) continue;
|
||||
const lifecycle = queue.getLifecycleCounters();
|
||||
try {
|
||||
const counts = await queue.getCounts();
|
||||
out.push({
|
||||
kind,
|
||||
name: SERVER_JOB_QUEUE_NAMES[kind],
|
||||
waiting: counts.waiting,
|
||||
active: counts.active,
|
||||
completed: counts.completed,
|
||||
failed: counts.failed,
|
||||
delayed: counts.delayed,
|
||||
stalled: lifecycle.stalled,
|
||||
unavailable: false,
|
||||
});
|
||||
} catch (error) {
|
||||
out.push({
|
||||
kind,
|
||||
name: SERVER_JOB_QUEUE_NAMES[kind],
|
||||
waiting: 0,
|
||||
active: 0,
|
||||
completed: 0,
|
||||
failed: 0,
|
||||
delayed: 0,
|
||||
stalled: lifecycle.stalled,
|
||||
unavailable: true,
|
||||
unavailableReason: error instanceof Error ? error.message : String(error),
|
||||
});
|
||||
}
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
async close(): Promise<void> {
|
||||
if (this.closed) {
|
||||
return;
|
||||
|
||||
@@ -14,7 +14,11 @@ import {
|
||||
verifyPidFileOwnership,
|
||||
type PidInfo,
|
||||
} from '../../supervisor/process-registry.js';
|
||||
import type { ServerBetaServiceGraph } from './types.js';
|
||||
import { ServerV1PostgresRoutes } from '../routes/v1/ServerV1PostgresRoutes.js';
|
||||
import { SessionsObservationsAdapter } from '../compat/SessionsObservationsAdapter.js';
|
||||
import { SessionsSummarizeAdapter } from '../compat/SessionsSummarizeAdapter.js';
|
||||
import { ActiveServerBetaQueueManager } from './ActiveServerBetaQueueManager.js';
|
||||
import type { ServerBetaServiceGraph, ServerBetaQueueLaneMetric } from './types.js';
|
||||
|
||||
const SERVER_BETA_RUNTIME = 'server-beta';
|
||||
const DEFAULT_SERVER_BETA_HOST = '127.0.0.1';
|
||||
@@ -50,7 +54,12 @@ class ServerBetaRuntimeInfoRoutes implements RouteHandler {
|
||||
res.json({ status: 'ok', runtime: SERVER_BETA_RUNTIME });
|
||||
});
|
||||
|
||||
app.get('/v1/info', (_req, res) => {
|
||||
// Phase 12 — `/v1/info` includes per-lane queue metrics so deploy probes
|
||||
// can read waiting/active/completed/failed/delayed/stalled without
|
||||
// hitting `/api/health`. Sampling is best-effort: a Redis blip surfaces
|
||||
// the lane with `unavailable: true` rather than crashing the route.
|
||||
app.get('/v1/info', async (_req, res) => {
|
||||
const queueLanes = await collectQueueLaneMetrics(this.graph);
|
||||
res.json({
|
||||
name: 'claude-mem-server',
|
||||
runtime: SERVER_BETA_RUNTIME,
|
||||
@@ -65,11 +74,28 @@ class ServerBetaRuntimeInfoRoutes implements RouteHandler {
|
||||
providerRegistry: this.graph.providerRegistry.getHealth(),
|
||||
eventBroadcaster: this.graph.eventBroadcaster.getHealth(),
|
||||
},
|
||||
queueLanes,
|
||||
});
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
async function collectQueueLaneMetrics(
|
||||
graph: ServerBetaServiceGraph,
|
||||
): Promise<ServerBetaQueueLaneMetric[]> {
|
||||
const manager = graph.queueManager;
|
||||
if (!(manager instanceof ActiveServerBetaQueueManager)) {
|
||||
return [];
|
||||
}
|
||||
try {
|
||||
return await manager.getLaneMetrics();
|
||||
} catch {
|
||||
// /api/health and /v1/info MUST never throw on a queue blip — surface
|
||||
// empty lanes so the rest of the payload still renders.
|
||||
return [];
|
||||
}
|
||||
}
|
||||
|
||||
export class ServerBetaService {
|
||||
private readonly graph: ServerBetaServiceGraph;
|
||||
private readonly host: string;
|
||||
@@ -106,8 +132,73 @@ export class ServerBetaService {
|
||||
authMethod: this.graph.authMode,
|
||||
lastInteraction: null,
|
||||
}),
|
||||
// Phase 10 — surface BullMQ/Valkey health on /api/health so deploy
|
||||
// probes (and the Docker E2E) can confirm the queue engine without
|
||||
// peeking at /v1/info. The queue manager's getHealth() returns its
|
||||
// boundary descriptor; we shape it into the worker-compatible
|
||||
// ObservationQueueHealth schema the Server class expects.
|
||||
// Phase 12 — also include per-lane counts (waiting/active/completed/
|
||||
// failed/delayed/stalled) so deploy probes can monitor saturation.
|
||||
getQueueHealth: async () => {
|
||||
const health = this.graph.queueManager.getHealth();
|
||||
const details = (health.details ?? {}) as Record<string, unknown>;
|
||||
if (health.status !== 'active' || details.engine !== 'bullmq') {
|
||||
return null;
|
||||
}
|
||||
const lanes = await collectQueueLaneMetrics(this.graph);
|
||||
return {
|
||||
engine: 'bullmq' as const,
|
||||
redis: {
|
||||
status: 'ok' as const,
|
||||
mode: String(details.mode ?? 'unknown'),
|
||||
host: String(details.host ?? '127.0.0.1'),
|
||||
port: typeof details.port === 'number' ? details.port : 6379,
|
||||
prefix: String(details.prefix ?? 'claude_mem'),
|
||||
},
|
||||
lanes: lanes.map(lane => ({
|
||||
kind: lane.kind,
|
||||
name: lane.name,
|
||||
waiting: lane.waiting,
|
||||
active: lane.active,
|
||||
completed: lane.completed,
|
||||
failed: lane.failed,
|
||||
delayed: lane.delayed,
|
||||
stalled: lane.stalled,
|
||||
unavailable: lane.unavailable,
|
||||
...(lane.unavailableReason ? { unavailableReason: lane.unavailableReason } : {}),
|
||||
})),
|
||||
};
|
||||
},
|
||||
});
|
||||
server.registerRoutes(new ServerBetaRuntimeInfoRoutes(this.graph));
|
||||
const v1Routes = new ServerV1PostgresRoutes({
|
||||
pool: this.graph.postgres.pool,
|
||||
queueManager: this.graph.queueManager,
|
||||
authMode: this.graph.authMode === 'disabled' ? 'api-key' : this.graph.authMode,
|
||||
runtime: SERVER_BETA_RUNTIME,
|
||||
// Session policy is read inside the routes (default 'per-event' from
|
||||
// resolveSessionGenerationPolicy(), env-overridable via
|
||||
// CLAUDE_MEM_SERVER_SESSION_POLICY). We do not duplicate it here.
|
||||
});
|
||||
server.registerRoutes(v1Routes);
|
||||
|
||||
// Phase 9 — legacy compatibility adapters. These translate the old
|
||||
// `/api/sessions/observations` and `/api/sessions/summarize` worker
|
||||
// routes to the canonical Server beta event/job model. They share the
|
||||
// SAME shared services with /v1/* routes — never duplicate ingest or
|
||||
// session-end logic. New clients should hit /v1/* directly.
|
||||
const compatAuthMode = this.graph.authMode === 'disabled' ? 'api-key' : this.graph.authMode;
|
||||
server.registerRoutes(new SessionsObservationsAdapter({
|
||||
pool: this.graph.postgres.pool,
|
||||
ingestEvents: v1Routes.getIngestEventsService(),
|
||||
authMode: compatAuthMode,
|
||||
}));
|
||||
server.registerRoutes(new SessionsSummarizeAdapter({
|
||||
pool: this.graph.postgres.pool,
|
||||
endSession: v1Routes.getEndSessionService(),
|
||||
authMode: compatAuthMode,
|
||||
}));
|
||||
|
||||
server.finalizeRoutes();
|
||||
|
||||
await server.listen(this.requestedPort, this.host);
|
||||
@@ -184,6 +275,28 @@ export async function runServerBetaCli(argv: string[] = process.argv.slice(2)):
|
||||
const port = getServerBetaPort();
|
||||
const host = process.env.CLAUDE_MEM_SERVER_HOST ?? DEFAULT_SERVER_BETA_HOST;
|
||||
|
||||
// Phase 10: `claude-mem server worker [start|--daemon]` runs the BullMQ
|
||||
// generation worker as a foregrounded process — no HTTP server, no route
|
||||
// registration. In Compose this becomes a separately scaled service.
|
||||
if (command === 'worker') {
|
||||
const sub = (argv[1] ?? '--daemon').toLowerCase();
|
||||
if (sub === 'start' || sub === '--daemon' || sub === 'run') {
|
||||
await runServerBetaGenerationWorker();
|
||||
return;
|
||||
}
|
||||
console.error('Usage: server-beta-service worker start');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
// `server api-key create|list|revoke` mirrors the worker-service tooling
|
||||
// but writes to the Postgres `api_keys` table the server-beta runtime
|
||||
// actually reads from. The legacy worker-service CLI talks to SQLite and
|
||||
// would be invisible to this stack.
|
||||
if (command === 'server' && argv[1]?.toLowerCase() === 'api-key') {
|
||||
await runServerBetaApiKeyCli(argv.slice(2));
|
||||
return;
|
||||
}
|
||||
|
||||
switch (command) {
|
||||
case 'start': {
|
||||
const existing = readServerBetaPidFile();
|
||||
@@ -258,9 +371,212 @@ export async function runServerBetaCli(argv: string[] = process.argv.slice(2)):
|
||||
}
|
||||
}
|
||||
|
||||
// Phase 10 — Postgres-backed `server api-key create|list|revoke` CLI. The
|
||||
// legacy `worker-service.cjs server api-key` command talks to SQLite and
|
||||
// is invisible to the server-beta runtime, which reads keys from
|
||||
// Postgres. Use this entrypoint inside Docker / Compose.
|
||||
export async function runServerBetaApiKeyCli(argv: string[]): Promise<void> {
|
||||
const sub = argv[0]?.toLowerCase();
|
||||
const options = parseFlagArgs(argv.slice(1));
|
||||
|
||||
if (!process.env.CLAUDE_MEM_SERVER_DATABASE_URL) {
|
||||
console.error('CLAUDE_MEM_SERVER_DATABASE_URL is required for `server api-key` commands.');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
const { getSharedPostgresPool } = await import('../../storage/postgres/index.js');
|
||||
const { PostgresAuthRepository } = await import('../../storage/postgres/auth.js');
|
||||
const { createHash, randomBytes } = await import('crypto');
|
||||
const pool = getSharedPostgresPool({ requireDatabaseUrl: true });
|
||||
const repo = new PostgresAuthRepository(pool);
|
||||
|
||||
try {
|
||||
if (sub === 'create') {
|
||||
const scopes = (options.scope ?? options.scopes ?? 'memories:read')
|
||||
.split(',')
|
||||
.map((scope: string) => scope.trim())
|
||||
.filter(Boolean);
|
||||
// Resolve team/project. If the caller passed --team/--project, honor
|
||||
// them. Otherwise, run the server-beta bootstrap to get-or-create the
|
||||
// local team+project, then create a NEW key against those IDs with
|
||||
// the caller's requested scopes (the bootstrap key uses hook scopes,
|
||||
// which is the wrong default for an arbitrary CLI-issued key).
|
||||
let teamId = options.team ?? null;
|
||||
let projectId = options.project ?? null;
|
||||
if (!teamId || !projectId) {
|
||||
const { bootstrapServerBetaApiKey } = await import('../../services/hooks/server-beta-bootstrap.js');
|
||||
const result = await bootstrapServerBetaApiKey({ pool, closePool: false });
|
||||
teamId = result.teamId;
|
||||
projectId = result.projectId;
|
||||
}
|
||||
const rawKey = `cmem_${randomBytes(24).toString('hex')}`;
|
||||
const keyHash = createHash('sha256').update(rawKey).digest('hex');
|
||||
const created = await repo.createApiKey({
|
||||
keyHash,
|
||||
teamId,
|
||||
projectId,
|
||||
scopes,
|
||||
actorId: 'system:server-beta-cli',
|
||||
});
|
||||
console.log(JSON.stringify({
|
||||
id: created.id,
|
||||
key: rawKey,
|
||||
name: options.name ?? 'server-api-key',
|
||||
teamId,
|
||||
projectId,
|
||||
scopes,
|
||||
}, null, 2));
|
||||
return;
|
||||
}
|
||||
|
||||
if (sub === 'list') {
|
||||
// Bound the result set to prevent unintentional cross-tenant key
|
||||
// metadata disclosure when an admin runs `api-key list` on a shared
|
||||
// host. Default page is 100; --team filters to a single tenant.
|
||||
const teamFilter = options.team ?? null;
|
||||
const limitArg = Number.parseInt(options.limit ?? '100', 10);
|
||||
const offsetArg = Number.parseInt(options.offset ?? '0', 10);
|
||||
const limit = Number.isFinite(limitArg) && limitArg > 0 && limitArg <= 500
|
||||
? limitArg
|
||||
: 100;
|
||||
const offset = Number.isFinite(offsetArg) && offsetArg >= 0 ? offsetArg : 0;
|
||||
const where = teamFilter ? 'WHERE team_id = $1' : '';
|
||||
const params: unknown[] = teamFilter ? [teamFilter, limit, offset] : [limit, offset];
|
||||
const limitIdx = teamFilter ? 2 : 1;
|
||||
const offsetIdx = teamFilter ? 3 : 2;
|
||||
const result = await pool.query<{
|
||||
id: string;
|
||||
team_id: string | null;
|
||||
project_id: string | null;
|
||||
scopes: unknown;
|
||||
revoked_at: Date | null;
|
||||
expires_at: Date | null;
|
||||
last_used_at: Date | null;
|
||||
created_at: Date;
|
||||
}>(
|
||||
`SELECT id, team_id, project_id, scopes, revoked_at, expires_at, last_used_at, created_at
|
||||
FROM api_keys
|
||||
${where}
|
||||
ORDER BY created_at DESC
|
||||
LIMIT $${limitIdx} OFFSET $${offsetIdx}`,
|
||||
params,
|
||||
);
|
||||
console.log(JSON.stringify({
|
||||
teamId: teamFilter,
|
||||
limit,
|
||||
offset,
|
||||
count: result.rows.length,
|
||||
keys: result.rows.map(row => ({
|
||||
id: row.id,
|
||||
teamId: row.team_id,
|
||||
projectId: row.project_id,
|
||||
scopes: row.scopes,
|
||||
status: row.revoked_at ? 'revoked' : 'active',
|
||||
lastUsedAt: row.last_used_at?.toISOString() ?? null,
|
||||
expiresAt: row.expires_at?.toISOString() ?? null,
|
||||
createdAt: row.created_at.toISOString(),
|
||||
})),
|
||||
}, null, 2));
|
||||
return;
|
||||
}
|
||||
|
||||
if (sub === 'revoke') {
|
||||
const id = argv[1];
|
||||
if (!id) {
|
||||
console.error('Usage: server-beta-service server api-key revoke <id>');
|
||||
process.exit(1);
|
||||
}
|
||||
const result = await pool.query(
|
||||
`UPDATE api_keys SET revoked_at = now()
|
||||
WHERE id = $1 AND revoked_at IS NULL
|
||||
RETURNING id`,
|
||||
[id],
|
||||
);
|
||||
if (result.rowCount === 0) {
|
||||
console.error(`API key not found or already revoked: ${id}`);
|
||||
process.exit(1);
|
||||
}
|
||||
console.log(JSON.stringify({ id, status: 'revoked' }, null, 2));
|
||||
return;
|
||||
}
|
||||
|
||||
console.error(`Unknown server api-key subcommand: ${sub ?? '(none)'}`);
|
||||
console.error('Usage: server-beta-service server api-key create|list|revoke');
|
||||
process.exit(1);
|
||||
} finally {
|
||||
// Pool is shared; do not close here. The process will exit and the
|
||||
// pool tears down via the shared module's process exit hook.
|
||||
}
|
||||
}
|
||||
|
||||
function parseFlagArgs(argv: string[]): Record<string, string> {
|
||||
const out: Record<string, string> = {};
|
||||
for (let i = 0; i < argv.length; i++) {
|
||||
const arg = argv[i];
|
||||
if (!arg) continue;
|
||||
if (arg.startsWith('--')) {
|
||||
const equalsIdx = arg.indexOf('=');
|
||||
if (equalsIdx > -1) {
|
||||
out[arg.slice(2, equalsIdx)] = arg.slice(equalsIdx + 1);
|
||||
} else {
|
||||
out[arg.slice(2)] = argv[i + 1] ?? '';
|
||||
i += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
// Phase 10 — generation-worker-only entrypoint. Starts BullMQ workers against
|
||||
// the same Postgres + Valkey/Redis the HTTP server-beta service uses, but
|
||||
// never opens an HTTP listener. In Compose this is a separate, horizontally
|
||||
// scalable service. The HTTP server-beta service should run with
|
||||
// CLAUDE_MEM_GENERATION_DISABLED=true so generation only happens in this
|
||||
// process.
|
||||
export async function runServerBetaGenerationWorker(): Promise<void> {
|
||||
const { validateServerBetaEnv, createServerBetaService } = await import('./create-server-beta-service.js');
|
||||
validateServerBetaEnv();
|
||||
// Build the service WITHOUT starting HTTP. We reuse createServerBetaService
|
||||
// for pool + bootstrap + queue + generation worker wiring, but never call
|
||||
// service.start(). Generation is enabled here even if env says
|
||||
// CLAUDE_MEM_GENERATION_DISABLED, because this IS the generation worker.
|
||||
delete process.env.CLAUDE_MEM_GENERATION_DISABLED;
|
||||
const service = await createServerBetaService();
|
||||
const state = service.getRuntimeState();
|
||||
logger.info('SYSTEM', 'Server beta generation worker started (no HTTP)', {
|
||||
pid: process.pid,
|
||||
queue: state.boundaries.queueManager,
|
||||
generation: state.boundaries.generationWorkerManager,
|
||||
});
|
||||
console.log(JSON.stringify({ status: 'worker-running', runtime: SERVER_BETA_RUNTIME, pid: process.pid }));
|
||||
|
||||
let stopping = false;
|
||||
const shutdown = async () => {
|
||||
if (stopping) return;
|
||||
stopping = true;
|
||||
try {
|
||||
await service.stop();
|
||||
} finally {
|
||||
process.exit(0);
|
||||
}
|
||||
};
|
||||
process.once('SIGTERM', shutdown);
|
||||
process.once('SIGINT', shutdown);
|
||||
|
||||
// Block forever — Workers run in background via BullMQ. Without this the
|
||||
// process would exit and BullMQ jobs would never be consumed.
|
||||
await new Promise<void>(() => {});
|
||||
}
|
||||
|
||||
function getServerBetaPort(): number {
|
||||
const parsed = Number.parseInt(process.env.CLAUDE_MEM_SERVER_PORT ?? '', 10);
|
||||
return Number.isInteger(parsed) && parsed > 0 ? parsed : DEFAULT_SERVER_BETA_PORT;
|
||||
if (Number.isInteger(parsed) && parsed > 0) {
|
||||
return parsed;
|
||||
}
|
||||
// UID-derived default for multi-account isolation: two users on the same
|
||||
// host get distinct ports without explicit configuration. Containerized
|
||||
// deployments always pass CLAUDE_MEM_SERVER_PORT so this branch is local-only.
|
||||
return DEFAULT_SERVER_BETA_PORT + ((process.getuid?.() ?? 77) % 100);
|
||||
}
|
||||
|
||||
function spawnServerBetaDaemon(port: number): number | undefined {
|
||||
|
||||
@@ -0,0 +1,163 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import {
|
||||
PostgresServerSessionsRepository,
|
||||
type PostgresServerSession,
|
||||
} from '../../storage/postgres/server-sessions.js';
|
||||
import type { PostgresAgentEvent } from '../../storage/postgres/agent-events.js';
|
||||
import type { JsonObject } from '../../storage/postgres/utils.js';
|
||||
import type { PostgresPool } from '../../storage/postgres/pool.js';
|
||||
import type { PostgresQueryable } from '../../storage/postgres/utils.js';
|
||||
|
||||
// ServerSessionRuntimeRepository is the runtime helper layer used by Server
|
||||
// beta routes and generation policies. It is intentionally thin: every method
|
||||
// requires explicit `team_id` + `project_id` and validates scope through the
|
||||
// underlying PostgresServerSessionsRepository (which calls
|
||||
// assertProjectOwnership before any write). It does NOT cache state — every
|
||||
// call hits Postgres so the runtime never trusts in-memory ActiveSession-style
|
||||
// objects, per the Phase 6 anti-pattern guard.
|
||||
|
||||
export interface ServerSessionScope {
|
||||
teamId: string;
|
||||
projectId: string;
|
||||
}
|
||||
|
||||
export interface GetActiveSessionInput extends ServerSessionScope {
|
||||
externalSessionId: string;
|
||||
contentSessionId?: string | null;
|
||||
agentId?: string | null;
|
||||
agentType?: string | null;
|
||||
platformSource?: string | null;
|
||||
metadata?: JsonObject;
|
||||
}
|
||||
|
||||
export interface ServerSessionRuntimeRepositoryOptions {
|
||||
client: PostgresQueryable;
|
||||
}
|
||||
|
||||
export class ServerSessionRuntimeRepository {
|
||||
private readonly repo: PostgresServerSessionsRepository;
|
||||
|
||||
constructor(private readonly options: ServerSessionRuntimeRepositoryOptions) {
|
||||
this.repo = new PostgresServerSessionsRepository(options.client);
|
||||
}
|
||||
|
||||
/**
|
||||
* Find or create the canonical Server beta session row for an external
|
||||
* session id. Idempotent on (project_id, external_session_id).
|
||||
*
|
||||
* Anti-pattern guard: this MUST NOT consult worker `ActiveSession` or any
|
||||
* legacy SessionStore. server_sessions is the canonical model.
|
||||
*/
|
||||
async getActiveSession(input: GetActiveSessionInput): Promise<PostgresServerSession> {
|
||||
const existing = await this.repo.findByExternalIdForScope({
|
||||
externalSessionId: input.externalSessionId,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
});
|
||||
if (existing) {
|
||||
return existing;
|
||||
}
|
||||
return this.repo.create({
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
externalSessionId: input.externalSessionId,
|
||||
contentSessionId: input.contentSessionId ?? null,
|
||||
agentId: input.agentId ?? null,
|
||||
agentType: input.agentType ?? null,
|
||||
platformSource: input.platformSource ?? null,
|
||||
metadata: input.metadata ?? {},
|
||||
});
|
||||
}
|
||||
|
||||
async getById(input: { id: string } & ServerSessionScope): Promise<PostgresServerSession | null> {
|
||||
return this.repo.getByIdForScope({
|
||||
id: input.id,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
});
|
||||
}
|
||||
|
||||
async findByExternalId(input: {
|
||||
externalSessionId: string;
|
||||
} & ServerSessionScope): Promise<PostgresServerSession | null> {
|
||||
return this.repo.findByExternalIdForScope({
|
||||
externalSessionId: input.externalSessionId,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
});
|
||||
}
|
||||
|
||||
async listUnprocessedEvents(
|
||||
input: { serverSessionId: string; limit?: number } & ServerSessionScope,
|
||||
): Promise<PostgresAgentEvent[]> {
|
||||
const params: {
|
||||
serverSessionId: string;
|
||||
projectId: string;
|
||||
teamId: string;
|
||||
limit?: number;
|
||||
} = {
|
||||
serverSessionId: input.serverSessionId,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
};
|
||||
if (input.limit !== undefined) {
|
||||
params.limit = input.limit;
|
||||
}
|
||||
return this.repo.listUnprocessedEvents(params);
|
||||
}
|
||||
|
||||
/**
|
||||
* End the session if not already ended. Idempotent — re-ending a session
|
||||
* returns the unchanged row and never creates a duplicate summary job
|
||||
* because the (team_id, project_id, source_type='session_summary',
|
||||
* source_id) UNIQUE constraint on observation_generation_jobs collapses
|
||||
* duplicate enqueue attempts.
|
||||
*/
|
||||
async endSession(
|
||||
input: { id: string } & ServerSessionScope,
|
||||
): Promise<PostgresServerSession | null> {
|
||||
return this.repo.endSession({
|
||||
id: input.id,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
});
|
||||
}
|
||||
|
||||
async markGenerationStarted(
|
||||
input: { id: string } & ServerSessionScope,
|
||||
): Promise<PostgresServerSession | null> {
|
||||
return this.repo.markGenerationStarted({
|
||||
id: input.id,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
});
|
||||
}
|
||||
|
||||
async markGenerationCompleted(
|
||||
input: { id: string } & ServerSessionScope,
|
||||
): Promise<PostgresServerSession | null> {
|
||||
return this.repo.markGenerationCompleted({
|
||||
id: input.id,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
});
|
||||
}
|
||||
|
||||
async markGenerationFailed(
|
||||
input: { id: string; error?: string | null } & ServerSessionScope,
|
||||
): Promise<PostgresServerSession | null> {
|
||||
return this.repo.markGenerationFailed({
|
||||
id: input.id,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
error: input.error ?? null,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
export function createServerSessionRuntimeRepository(
|
||||
pool: PostgresPool,
|
||||
): ServerSessionRuntimeRepository {
|
||||
return new ServerSessionRuntimeRepository({ client: pool });
|
||||
}
|
||||
@@ -0,0 +1,206 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import type { JobsOptions } from 'bullmq';
|
||||
import type {
|
||||
GenerateObservationsForEventJob,
|
||||
GenerateSessionSummaryJob,
|
||||
} from '../jobs/types.js';
|
||||
import { buildServerJobId } from '../jobs/job-id.js';
|
||||
import type { PostgresAgentEvent } from '../../storage/postgres/agent-events.js';
|
||||
import type { PostgresObservationGenerationJob } from '../../storage/postgres/generation-jobs.js';
|
||||
|
||||
// SessionGenerationPolicy decides WHEN to enqueue work for the BullMQ event
|
||||
// and summary lanes. It is configurable via:
|
||||
// - CLAUDE_MEM_SERVER_SESSION_POLICY env var (per-process default)
|
||||
// - per-call override (per-team settings can plug in here later)
|
||||
//
|
||||
// Three policies are supported:
|
||||
// - 'per-event' (default): enqueue immediately on every event POST.
|
||||
// Matches Phase 4/5 behavior.
|
||||
// - 'debounce': enqueue with `delay`; when a new event arrives within
|
||||
// the window, replace the delayed job (deterministic
|
||||
// BullMQ jobId means re-add(jobId, ...) overwrites the
|
||||
// waiting entry, and removeOnComplete/Fail keep things
|
||||
// tidy). Outbox row is canonical so durability is safe.
|
||||
// - 'end-of-session': only enqueue summary jobs at /v1/sessions/:id/end.
|
||||
// Per-event posts skip BullMQ entirely; the outbox row
|
||||
// remains in `queued` state and startup reconciliation
|
||||
// will publish it later (or it can be cancelled).
|
||||
//
|
||||
// Anti-pattern guard: the policy MUST NOT use ActiveSession-style cached
|
||||
// state. Inputs are always reloaded by the caller from Postgres before this
|
||||
// fires.
|
||||
|
||||
export type ServerSessionGenerationPolicy = 'per-event' | 'debounce' | 'end-of-session';
|
||||
|
||||
const DEFAULT_DEBOUNCE_MS = 5000;
|
||||
|
||||
export interface SessionGenerationPolicyOptions {
|
||||
policy?: ServerSessionGenerationPolicy;
|
||||
debounceWindowMs?: number;
|
||||
}
|
||||
|
||||
export function resolveSessionGenerationPolicy(
|
||||
options: SessionGenerationPolicyOptions = {},
|
||||
): { policy: ServerSessionGenerationPolicy; debounceWindowMs: number } {
|
||||
const envPolicy = (process.env.CLAUDE_MEM_SERVER_SESSION_POLICY ?? '').trim().toLowerCase();
|
||||
const policy: ServerSessionGenerationPolicy = options.policy
|
||||
?? (envPolicy === 'debounce' || envPolicy === 'end-of-session' || envPolicy === 'per-event'
|
||||
? envPolicy
|
||||
: 'per-event');
|
||||
const debounceWindowMs = options.debounceWindowMs
|
||||
?? (Number.parseInt(process.env.CLAUDE_MEM_SERVER_SESSION_DEBOUNCE_MS ?? '', 10)
|
||||
|| DEFAULT_DEBOUNCE_MS);
|
||||
return {
|
||||
policy,
|
||||
debounceWindowMs: Number.isFinite(debounceWindowMs) && debounceWindowMs > 0
|
||||
? debounceWindowMs
|
||||
: DEFAULT_DEBOUNCE_MS,
|
||||
};
|
||||
}
|
||||
|
||||
export interface EnqueueEventDecisionInput {
|
||||
event: PostgresAgentEvent;
|
||||
outbox: PostgresObservationGenerationJob;
|
||||
// Phase 11 — identity context captured at HTTP ingest time so the BullMQ
|
||||
// payload carries every audit field. apiKeyId may be null for local-dev
|
||||
// enqueues and `actorId` follows the api key's `actor_id` column.
|
||||
apiKeyId?: string | null;
|
||||
actorId?: string | null;
|
||||
sourceAdapter?: string | null;
|
||||
// Phase 12 — request correlation id minted at the HTTP boundary.
|
||||
requestId?: string | null;
|
||||
}
|
||||
|
||||
export interface EnqueueEventDecision {
|
||||
shouldEnqueue: boolean;
|
||||
jobId: string;
|
||||
payload: GenerateObservationsForEventJob;
|
||||
jobsOptions?: JobsOptions;
|
||||
reason: 'per-event' | 'debounce' | 'end-of-session-skip';
|
||||
}
|
||||
|
||||
export function buildEnqueueEventDecision(
|
||||
input: EnqueueEventDecisionInput,
|
||||
options: SessionGenerationPolicyOptions = {},
|
||||
): EnqueueEventDecision {
|
||||
const resolved = resolveSessionGenerationPolicy(options);
|
||||
const jobId = input.outbox.bullmqJobId ?? buildServerJobId({
|
||||
kind: 'event',
|
||||
team_id: input.event.teamId,
|
||||
project_id: input.event.projectId,
|
||||
source_type: 'agent_event',
|
||||
source_id: input.event.id,
|
||||
});
|
||||
const payload: GenerateObservationsForEventJob = {
|
||||
kind: 'event',
|
||||
team_id: input.outbox.teamId,
|
||||
project_id: input.outbox.projectId,
|
||||
source_type: 'agent_event',
|
||||
source_id: input.event.id,
|
||||
generation_job_id: input.outbox.id,
|
||||
agent_event_id: input.event.id,
|
||||
api_key_id: input.apiKeyId ?? null,
|
||||
actor_id: input.actorId ?? null,
|
||||
source_adapter: input.sourceAdapter ?? input.event.sourceAdapter ?? 'api',
|
||||
request_id: input.requestId ?? null,
|
||||
};
|
||||
|
||||
if (resolved.policy === 'end-of-session') {
|
||||
return { shouldEnqueue: false, jobId, payload, reason: 'end-of-session-skip' };
|
||||
}
|
||||
|
||||
if (resolved.policy === 'debounce') {
|
||||
return {
|
||||
shouldEnqueue: true,
|
||||
jobId,
|
||||
payload,
|
||||
jobsOptions: { delay: resolved.debounceWindowMs },
|
||||
reason: 'debounce',
|
||||
};
|
||||
}
|
||||
|
||||
return { shouldEnqueue: true, jobId, payload, reason: 'per-event' };
|
||||
}
|
||||
|
||||
// Minimal queue surface used by scheduleDebouncedEventJob. Declared as an
|
||||
// interface (instead of `Pick<ServerJobQueue<...>, ...>`) so the parameter
|
||||
// accepts ServerJobQueue<ServerGenerationJobPayload> at the call site without
|
||||
// triggering invariant TPayload type errors. The ServerJobQueue.add signature
|
||||
// is structurally compatible — it requires `payload: TPayload`, and we only
|
||||
// hand in narrowed payloads.
|
||||
export interface DebounceableEventQueue {
|
||||
add(jobId: string, payload: GenerateObservationsForEventJob, options?: JobsOptions): Promise<void>;
|
||||
remove(jobId: string): Promise<void>;
|
||||
getJob(jobId: string): Promise<unknown>;
|
||||
}
|
||||
|
||||
/**
|
||||
* Apply a debounce decision to a BullMQ queue. If a delayed job already exists
|
||||
* for this deterministic id, BullMQ's `add(jobId, ...)` will be a no-op, so we
|
||||
* proactively remove it first so the new event's delay window starts fresh.
|
||||
*
|
||||
* This implements the "if a new event arrives within window, replace the
|
||||
* delayed job" requirement.
|
||||
*/
|
||||
export async function scheduleDebouncedEventJob(
|
||||
queue: DebounceableEventQueue,
|
||||
decision: EnqueueEventDecision,
|
||||
): Promise<void> {
|
||||
if (!decision.shouldEnqueue) return;
|
||||
if (decision.reason === 'debounce') {
|
||||
try {
|
||||
const existing = await queue.getJob(decision.jobId);
|
||||
if (existing) {
|
||||
await queue.remove(decision.jobId);
|
||||
}
|
||||
} catch {
|
||||
// best-effort; if remove fails because the job already moved to active
|
||||
// we just let `add` no-op or fail through to the caller's error handler
|
||||
}
|
||||
}
|
||||
await queue.add(decision.jobId, decision.payload, decision.jobsOptions);
|
||||
}
|
||||
|
||||
export interface BuildSummaryJobInput {
|
||||
serverSessionId: string;
|
||||
teamId: string;
|
||||
projectId: string;
|
||||
generationJobId: string;
|
||||
// Phase 11 — same identity context the event-payload builder receives.
|
||||
apiKeyId?: string | null;
|
||||
actorId?: string | null;
|
||||
sourceAdapter?: string | null;
|
||||
// Phase 12 — request correlation id flows into the summary lane too.
|
||||
requestId?: string | null;
|
||||
}
|
||||
|
||||
export function buildSummaryJobId(input: {
|
||||
serverSessionId: string;
|
||||
teamId: string;
|
||||
projectId: string;
|
||||
}): string {
|
||||
return buildServerJobId({
|
||||
kind: 'summary',
|
||||
team_id: input.teamId,
|
||||
project_id: input.projectId,
|
||||
source_type: 'session_summary',
|
||||
source_id: input.serverSessionId,
|
||||
});
|
||||
}
|
||||
|
||||
export function buildSummaryJobPayload(input: BuildSummaryJobInput): GenerateSessionSummaryJob {
|
||||
return {
|
||||
kind: 'summary',
|
||||
team_id: input.teamId,
|
||||
project_id: input.projectId,
|
||||
source_type: 'session_summary',
|
||||
source_id: input.serverSessionId,
|
||||
generation_job_id: input.generationJobId,
|
||||
server_session_id: input.serverSessionId,
|
||||
api_key_id: input.apiKeyId ?? null,
|
||||
actor_id: input.actorId ?? null,
|
||||
source_adapter: input.sourceAdapter ?? 'api',
|
||||
request_id: input.requestId ?? null,
|
||||
};
|
||||
}
|
||||
@@ -1,10 +1,17 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import { existsSync } from 'fs';
|
||||
import { logger } from '../../utils/logger.js';
|
||||
import { createPostgresStorageRepositories, getSharedPostgresPool, SERVER_BETA_POSTGRES_SCHEMA_VERSION } from '../../storage/postgres/index.js';
|
||||
import { bootstrapServerBetaPostgresSchema } from '../../storage/postgres/schema.js';
|
||||
import type { PostgresPool } from '../../storage/postgres/pool.js';
|
||||
import { getRedisQueueConfig } from '../queue/redis-config.js';
|
||||
import { ActiveServerBetaQueueManager } from './ActiveServerBetaQueueManager.js';
|
||||
import { ActiveServerBetaGenerationWorkerManager } from './ActiveServerBetaGenerationWorkerManager.js';
|
||||
import { ClaudeObservationProvider } from '../generation/providers/ClaudeObservationProvider.js';
|
||||
import { GeminiObservationProvider } from '../generation/providers/GeminiObservationProvider.js';
|
||||
import { OpenRouterObservationProvider } from '../generation/providers/OpenRouterObservationProvider.js';
|
||||
import type { ServerGenerationProvider } from '../generation/providers/shared/types.js';
|
||||
import { ServerBetaService } from './ServerBetaService.js';
|
||||
import {
|
||||
DisabledServerBetaEventBroadcaster,
|
||||
@@ -13,6 +20,7 @@ import {
|
||||
DisabledServerBetaQueueManager,
|
||||
type ServerBetaAuthMode,
|
||||
type ServerBetaBootstrapStatus,
|
||||
type ServerBetaGenerationWorkerManager,
|
||||
type ServerBetaQueueManager,
|
||||
type ServerBetaServiceGraph,
|
||||
} from './types.js';
|
||||
@@ -22,13 +30,147 @@ export interface CreateServerBetaServiceOptions {
|
||||
authMode?: ServerBetaAuthMode;
|
||||
bootstrapSchema?: boolean;
|
||||
queueManager?: ServerBetaQueueManager;
|
||||
// Phase 5 seam: tests can inject a fake provider without env config.
|
||||
generationProvider?: ServerGenerationProvider;
|
||||
generationWorkerManager?: ServerBetaGenerationWorkerManager;
|
||||
// Phase 10: when true, skip building the generation worker. Used when the
|
||||
// service is just an HTTP front-end and a separate `server worker` process
|
||||
// consumes the BullMQ queues.
|
||||
generationDisabled?: boolean;
|
||||
// Phase 10: skip env validation (tests). Production code paths always run
|
||||
// validation so misconfiguration fails fast at startup.
|
||||
skipEnvValidation?: boolean;
|
||||
}
|
||||
|
||||
// Phase 10 — env validation. Server beta in Docker requires explicit, complete
|
||||
// configuration. Missing pieces fail fast at startup rather than silently
|
||||
// degrading. Required env when running in Docker:
|
||||
// - CLAUDE_MEM_SERVER_DATABASE_URL (Postgres)
|
||||
// - CLAUDE_MEM_QUEUE_ENGINE=bullmq (no in-memory queue in Docker)
|
||||
// - CLAUDE_MEM_REDIS_URL (BullMQ requires Redis/Valkey)
|
||||
// - CLAUDE_MEM_AUTH_MODE != local-dev (auth must be real in Docker)
|
||||
// `local-dev` bypass is only valid on a developer's loopback; in Docker the
|
||||
// container is reachable via service-to-service networking and exposed ports,
|
||||
// so the loopback assumption is invalid.
|
||||
export interface ServerBetaEnvValidationOptions {
|
||||
env?: NodeJS.ProcessEnv;
|
||||
isDocker?: boolean;
|
||||
}
|
||||
|
||||
export interface ServerBetaEnvValidationResult {
|
||||
isDocker: boolean;
|
||||
runtime: string;
|
||||
authMode: string;
|
||||
queueEngine: string;
|
||||
hasDatabaseUrl: boolean;
|
||||
hasRedisUrl: boolean;
|
||||
}
|
||||
|
||||
export function detectDockerEnvironment(env: NodeJS.ProcessEnv = process.env): boolean {
|
||||
if (env.CLAUDE_MEM_DOCKER === '1' || env.CLAUDE_MEM_DOCKER === 'true') return true;
|
||||
// /.dockerenv is the canonical Docker marker; existsSync is cheap.
|
||||
try {
|
||||
if (existsSync('/.dockerenv')) return true;
|
||||
} catch {
|
||||
// ignore
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
export function validateServerBetaEnv(
|
||||
options: ServerBetaEnvValidationOptions = {},
|
||||
): ServerBetaEnvValidationResult {
|
||||
const env = options.env ?? process.env;
|
||||
const isDocker = options.isDocker ?? detectDockerEnvironment(env);
|
||||
const errors: string[] = [];
|
||||
|
||||
const runtime = (env.CLAUDE_MEM_RUNTIME ?? '').trim();
|
||||
if (!runtime) {
|
||||
// Warn but allow — defaulted to 'worker' upstream; we log a warning so
|
||||
// operators know server-beta is the active runtime here.
|
||||
if (isDocker) {
|
||||
logger.warn('SYSTEM', 'CLAUDE_MEM_RUNTIME unset; server-beta container assumes runtime=server-beta');
|
||||
}
|
||||
} else if (runtime !== 'server-beta' && isDocker) {
|
||||
errors.push(
|
||||
`CLAUDE_MEM_RUNTIME=${runtime} is invalid in Docker; the server-beta image only runs CLAUDE_MEM_RUNTIME=server-beta.`,
|
||||
);
|
||||
}
|
||||
|
||||
const authMode = (env.CLAUDE_MEM_AUTH_MODE ?? 'api-key').trim();
|
||||
if (isDocker) {
|
||||
if (authMode === 'local-dev') {
|
||||
errors.push(
|
||||
'CLAUDE_MEM_AUTH_MODE=local-dev is not allowed in Docker. Set CLAUDE_MEM_AUTH_MODE=api-key and create a key with `claude-mem server api-key create`.',
|
||||
);
|
||||
}
|
||||
if (
|
||||
env.CLAUDE_MEM_ALLOW_LOCAL_DEV_BYPASS === '1'
|
||||
|| env.CLAUDE_MEM_ALLOW_LOCAL_DEV_BYPASS === 'true'
|
||||
) {
|
||||
errors.push(
|
||||
'CLAUDE_MEM_ALLOW_LOCAL_DEV_BYPASS is not allowed in Docker. Loopback bypass cannot be enforced inside a container; remove the variable.',
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
const queueEngine = (env.CLAUDE_MEM_QUEUE_ENGINE ?? '').trim().toLowerCase();
|
||||
if (isDocker) {
|
||||
if (!queueEngine) {
|
||||
errors.push('CLAUDE_MEM_QUEUE_ENGINE is required in Docker; set it to "bullmq".');
|
||||
} else if (queueEngine !== 'bullmq') {
|
||||
errors.push(
|
||||
`CLAUDE_MEM_QUEUE_ENGINE=${queueEngine} is not allowed in Docker. Only "bullmq" is supported (no in-process queues across container boundaries).`,
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
const hasDatabaseUrl = Boolean((env.CLAUDE_MEM_SERVER_DATABASE_URL ?? '').trim());
|
||||
if (!hasDatabaseUrl) {
|
||||
errors.push('CLAUDE_MEM_SERVER_DATABASE_URL is required to start server-beta (Postgres connection string).');
|
||||
}
|
||||
|
||||
const hasRedisUrl = Boolean((env.CLAUDE_MEM_REDIS_URL ?? '').trim());
|
||||
if (queueEngine === 'bullmq' && !hasRedisUrl) {
|
||||
errors.push('CLAUDE_MEM_REDIS_URL is required when CLAUDE_MEM_QUEUE_ENGINE=bullmq.');
|
||||
}
|
||||
|
||||
if (errors.length > 0) {
|
||||
const message = [
|
||||
'server-beta startup configuration is invalid:',
|
||||
...errors.map(line => ` - ${line}`),
|
||||
].join('\n');
|
||||
throw new Error(message);
|
||||
}
|
||||
|
||||
return {
|
||||
isDocker,
|
||||
runtime: runtime || 'server-beta',
|
||||
authMode,
|
||||
queueEngine: queueEngine || 'disabled',
|
||||
hasDatabaseUrl,
|
||||
hasRedisUrl,
|
||||
};
|
||||
}
|
||||
|
||||
export async function createServerBetaService(
|
||||
options: CreateServerBetaServiceOptions = {},
|
||||
): Promise<ServerBetaService> {
|
||||
if (!options.skipEnvValidation) {
|
||||
validateServerBetaEnv();
|
||||
}
|
||||
const pool = options.pool ?? getSharedPostgresPool({ requireDatabaseUrl: true });
|
||||
const bootstrap = await initializePostgres(pool, options.bootstrapSchema ?? true);
|
||||
const queueManager = options.queueManager ?? buildQueueManager();
|
||||
const generationDisabled = options.generationDisabled
|
||||
?? (process.env.CLAUDE_MEM_GENERATION_DISABLED === '1'
|
||||
|| process.env.CLAUDE_MEM_GENERATION_DISABLED === 'true');
|
||||
const generationWorkerManager = options.generationWorkerManager
|
||||
?? (generationDisabled
|
||||
? new DisabledServerBetaGenerationWorkerManager(
|
||||
'CLAUDE_MEM_GENERATION_DISABLED is set; this server runs HTTP only. A separate `claude-mem server worker start` process consumes the BullMQ queues.',
|
||||
)
|
||||
: buildGenerationWorkerManager(pool, queueManager, options.generationProvider));
|
||||
const graph: ServerBetaServiceGraph = {
|
||||
runtime: 'server-beta',
|
||||
postgres: {
|
||||
@@ -36,16 +178,74 @@ export async function createServerBetaService(
|
||||
bootstrap,
|
||||
},
|
||||
authMode: options.authMode ?? parseAuthMode(process.env.CLAUDE_MEM_AUTH_MODE),
|
||||
queueManager: options.queueManager ?? buildQueueManager(),
|
||||
generationWorkerManager: new DisabledServerBetaGenerationWorkerManager('Phase 2 boundary only; generation workers are not wired.'),
|
||||
providerRegistry: new DisabledServerBetaProviderRegistry('Phase 2 boundary only; provider-backed generation is not wired.'),
|
||||
queueManager,
|
||||
generationWorkerManager,
|
||||
providerRegistry: new DisabledServerBetaProviderRegistry('Phase 5 keeps the provider registry boundary as inert; per-call providers are owned by the generation worker manager.'),
|
||||
eventBroadcaster: new DisabledServerBetaEventBroadcaster('Phase 2 boundary only; SSE/event broadcasting is not wired.'),
|
||||
storage: createPostgresStorageRepositories(pool),
|
||||
};
|
||||
|
||||
if (generationWorkerManager instanceof ActiveServerBetaGenerationWorkerManager) {
|
||||
generationWorkerManager.start();
|
||||
}
|
||||
|
||||
return new ServerBetaService({ graph });
|
||||
}
|
||||
|
||||
function buildGenerationWorkerManager(
|
||||
pool: PostgresPool,
|
||||
queueManager: ServerBetaQueueManager,
|
||||
injectedProvider?: ServerGenerationProvider,
|
||||
): ServerBetaGenerationWorkerManager {
|
||||
if (!(queueManager instanceof ActiveServerBetaQueueManager)) {
|
||||
return new DisabledServerBetaGenerationWorkerManager(
|
||||
'queue manager is disabled; set CLAUDE_MEM_QUEUE_ENGINE=bullmq to enable provider generation.',
|
||||
);
|
||||
}
|
||||
const provider = injectedProvider ?? buildServerGenerationProviderFromEnv();
|
||||
if (!provider) {
|
||||
return new DisabledServerBetaGenerationWorkerManager(
|
||||
'no server generation provider configured; set CLAUDE_MEM_SERVER_PROVIDER and the matching API key to enable.',
|
||||
);
|
||||
}
|
||||
return new ActiveServerBetaGenerationWorkerManager({
|
||||
pool,
|
||||
queueManager,
|
||||
provider,
|
||||
});
|
||||
}
|
||||
|
||||
function buildServerGenerationProviderFromEnv(): ServerGenerationProvider | null {
|
||||
const provider = (process.env.CLAUDE_MEM_SERVER_PROVIDER ?? '').trim().toLowerCase();
|
||||
if (!provider) return null;
|
||||
try {
|
||||
if (provider === 'claude' || provider === 'anthropic') {
|
||||
const apiKey = process.env.ANTHROPIC_API_KEY ?? process.env.CLAUDE_MEM_ANTHROPIC_API_KEY ?? '';
|
||||
if (!apiKey) return null;
|
||||
const opts: { apiKey: string; model?: string } = { apiKey };
|
||||
if (process.env.CLAUDE_MEM_SERVER_MODEL) opts.model = process.env.CLAUDE_MEM_SERVER_MODEL;
|
||||
return new ClaudeObservationProvider(opts);
|
||||
}
|
||||
if (provider === 'gemini') {
|
||||
const apiKey = process.env.GEMINI_API_KEY ?? process.env.CLAUDE_MEM_GEMINI_API_KEY ?? '';
|
||||
if (!apiKey) return null;
|
||||
const opts: { apiKey: string; model?: string } = { apiKey };
|
||||
if (process.env.CLAUDE_MEM_SERVER_MODEL) opts.model = process.env.CLAUDE_MEM_SERVER_MODEL;
|
||||
return new GeminiObservationProvider(opts);
|
||||
}
|
||||
if (provider === 'openrouter') {
|
||||
const apiKey = process.env.OPENROUTER_API_KEY ?? process.env.CLAUDE_MEM_OPENROUTER_API_KEY ?? '';
|
||||
if (!apiKey) return null;
|
||||
const opts: { apiKey: string; model?: string } = { apiKey };
|
||||
if (process.env.CLAUDE_MEM_SERVER_MODEL) opts.model = process.env.CLAUDE_MEM_SERVER_MODEL;
|
||||
return new OpenRouterObservationProvider(opts);
|
||||
}
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
// Queue manager selection is fail-fast on misconfiguration. If the user
|
||||
// explicitly opts into BullMQ via CLAUDE_MEM_QUEUE_ENGINE=bullmq we build
|
||||
// the active manager; any error there throws so the runtime does not
|
||||
|
||||
@@ -20,6 +20,24 @@ export interface ServerBetaBoundaryHealth {
|
||||
details?: Record<string, unknown>;
|
||||
}
|
||||
|
||||
// Phase 12 — per-lane queue metric snapshot. Returned by
|
||||
// ActiveServerBetaQueueManager.getLaneMetrics so /api/health and /v1/info
|
||||
// can publish current waiting/active/completed/failed/delayed/stalled counts
|
||||
// for each generation lane. `unavailable` is set when Redis was unreachable
|
||||
// at sample time so /api/health still responds rather than 500'ing.
|
||||
export interface ServerBetaQueueLaneMetric {
|
||||
kind: string;
|
||||
name: string;
|
||||
waiting: number;
|
||||
active: number;
|
||||
completed: number;
|
||||
failed: number;
|
||||
delayed: number;
|
||||
stalled: number;
|
||||
unavailable: boolean;
|
||||
unavailableReason?: string;
|
||||
}
|
||||
|
||||
export interface ServerBetaQueueManager {
|
||||
readonly kind: 'queue-manager';
|
||||
getHealth(): ServerBetaBoundaryHealth;
|
||||
|
||||
@@ -0,0 +1,155 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
// Shared session-end + summary-job path used by both `/v1/sessions/:id/end`
|
||||
// (canonical) and `src/server/compat/SessionsSummarizeAdapter.ts` (legacy
|
||||
// translator). Both call sites must produce identical Postgres state and
|
||||
// queue effects: ended_at idempotency, exactly one outbox row per session
|
||||
// summary, deterministic BullMQ job id.
|
||||
//
|
||||
// This module MUST NOT import from src/services/worker/* — Phase 9 keeps
|
||||
// the compat shim coupled to Server beta core only.
|
||||
|
||||
import {
|
||||
PostgresObservationGenerationJobEventsRepository,
|
||||
PostgresObservationGenerationJobRepository,
|
||||
type PostgresObservationGenerationJob,
|
||||
} from '../../storage/postgres/generation-jobs.js';
|
||||
import type { PostgresPool } from '../../storage/postgres/pool.js';
|
||||
import { withPostgresTransaction } from '../../storage/postgres/pool.js';
|
||||
import {
|
||||
PostgresServerSessionsRepository,
|
||||
type PostgresServerSession,
|
||||
} from '../../storage/postgres/server-sessions.js';
|
||||
import { logger } from '../../utils/logger.js';
|
||||
import { buildSummaryJobId, buildSummaryJobPayload } from '../runtime/SessionGenerationPolicy.js';
|
||||
import type { GenerateSessionSummaryJob } from '../jobs/types.js';
|
||||
import type { EnqueueOutcome, EventQueueLike } from './IngestEventsService.js';
|
||||
import { newId } from '../../storage/postgres/utils.js';
|
||||
|
||||
const SUMMARY_JOB_TYPE = 'observation_generate_session_summary';
|
||||
|
||||
export interface EndSessionServiceOptions {
|
||||
pool: PostgresPool;
|
||||
resolveSummaryQueue: () => EventQueueLike | null;
|
||||
}
|
||||
|
||||
export interface EndSessionResult {
|
||||
session: PostgresServerSession | null;
|
||||
outbox: PostgresObservationGenerationJob | null;
|
||||
enqueueState: EnqueueOutcome;
|
||||
}
|
||||
|
||||
export interface EndSessionInput {
|
||||
sessionId: string;
|
||||
projectId: string;
|
||||
teamId: string;
|
||||
source?: string;
|
||||
// Phase 11 — identity context propagated into the BullMQ summary payload.
|
||||
apiKeyId?: string | null;
|
||||
actorId?: string | null;
|
||||
sourceAdapter?: string | null;
|
||||
}
|
||||
|
||||
export class EndSessionService {
|
||||
constructor(private readonly options: EndSessionServiceOptions) {}
|
||||
|
||||
async end(input: EndSessionInput): Promise<EndSessionResult> {
|
||||
const source = input.source ?? 'http_post_v1_sessions_end';
|
||||
|
||||
const txResult = await withPostgresTransaction(this.options.pool, async (client) => {
|
||||
const sessionsRepo = new PostgresServerSessionsRepository(client);
|
||||
const ended = await sessionsRepo.endSession({
|
||||
id: input.sessionId,
|
||||
projectId: input.projectId,
|
||||
teamId: input.teamId,
|
||||
});
|
||||
if (!ended) {
|
||||
return {
|
||||
session: null as PostgresServerSession | null,
|
||||
outbox: null as PostgresObservationGenerationJob | null,
|
||||
};
|
||||
}
|
||||
const jobsRepo = new PostgresObservationGenerationJobRepository(client);
|
||||
const eventsLogRepo = new PostgresObservationGenerationJobEventsRepository(client);
|
||||
// Persist the BullMQ payload at create-time so reconciliation and
|
||||
// operator retry can re-enqueue a payload that passes the worker's
|
||||
// assertServerGenerationJobPayload validation.
|
||||
const outboxId = newId();
|
||||
const summaryPayload = buildSummaryJobPayload({
|
||||
serverSessionId: ended.id,
|
||||
teamId: ended.teamId,
|
||||
projectId: ended.projectId,
|
||||
generationJobId: outboxId,
|
||||
apiKeyId: input.apiKeyId ?? null,
|
||||
actorId: input.actorId ?? null,
|
||||
sourceAdapter: input.sourceAdapter ?? null,
|
||||
});
|
||||
const outbox = await jobsRepo.create({
|
||||
id: outboxId,
|
||||
projectId: ended.projectId,
|
||||
teamId: ended.teamId,
|
||||
sourceType: 'session_summary',
|
||||
sourceId: ended.id,
|
||||
serverSessionId: ended.id,
|
||||
jobType: SUMMARY_JOB_TYPE,
|
||||
bullmqJobId: buildSummaryJobId({
|
||||
serverSessionId: ended.id,
|
||||
teamId: ended.teamId,
|
||||
projectId: ended.projectId,
|
||||
}),
|
||||
payload: summaryPayload as unknown as Record<string, unknown>,
|
||||
});
|
||||
await eventsLogRepo.append({
|
||||
generationJobId: outbox.id,
|
||||
projectId: outbox.projectId,
|
||||
teamId: outbox.teamId,
|
||||
eventType: 'queued',
|
||||
statusAfter: outbox.status,
|
||||
attempt: outbox.attempts,
|
||||
details: { source },
|
||||
});
|
||||
return { session: ended, outbox };
|
||||
});
|
||||
|
||||
if (!txResult.session || !txResult.outbox) {
|
||||
return { session: txResult.session, outbox: null, enqueueState: 'skipped' };
|
||||
}
|
||||
const enqueueState = await this.publishSummaryJob(txResult.session.id, txResult.outbox, input);
|
||||
return { session: txResult.session, outbox: txResult.outbox, enqueueState };
|
||||
}
|
||||
|
||||
private async publishSummaryJob(
|
||||
serverSessionId: string,
|
||||
outbox: PostgresObservationGenerationJob,
|
||||
input: EndSessionInput,
|
||||
): Promise<'enqueued' | 'queued_only'> {
|
||||
const queue = this.options.resolveSummaryQueue();
|
||||
if (!queue) {
|
||||
return 'queued_only';
|
||||
}
|
||||
const jobId = outbox.bullmqJobId ?? buildSummaryJobId({
|
||||
serverSessionId,
|
||||
teamId: outbox.teamId,
|
||||
projectId: outbox.projectId,
|
||||
});
|
||||
const payload: GenerateSessionSummaryJob = buildSummaryJobPayload({
|
||||
serverSessionId,
|
||||
teamId: outbox.teamId,
|
||||
projectId: outbox.projectId,
|
||||
generationJobId: outbox.id,
|
||||
apiKeyId: input.apiKeyId ?? null,
|
||||
actorId: input.actorId ?? null,
|
||||
sourceAdapter: input.sourceAdapter ?? null,
|
||||
});
|
||||
try {
|
||||
await queue.add(jobId, payload);
|
||||
return 'enqueued';
|
||||
} catch (error) {
|
||||
logger.warn('SYSTEM', 'failed to publish summary generation job to BullMQ', {
|
||||
outboxId: outbox.id,
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
});
|
||||
return 'queued_only';
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,273 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
// Shared event-ingest path used by both `/v1/events` (canonical) and
|
||||
// `src/server/compat/SessionsObservationsAdapter.ts` (legacy translator).
|
||||
// Centralizes the transactional write (event row + outbox row + lifecycle
|
||||
// log) and the post-commit BullMQ enqueue so both call sites apply the
|
||||
// exact same SessionGenerationPolicy and outbox-then-publish guarantees.
|
||||
//
|
||||
// This module MUST NOT import from src/services/worker/* — the whole point
|
||||
// of Phase 9 is to give the compat adapters a translation surface that
|
||||
// reaches Server beta core directly, with no worker-layer detours.
|
||||
|
||||
import type { CreatePostgresAgentEventInput, PostgresAgentEvent } from '../../storage/postgres/agent-events.js';
|
||||
import { PostgresAgentEventsRepository } from '../../storage/postgres/agent-events.js';
|
||||
import {
|
||||
PostgresObservationGenerationJobEventsRepository,
|
||||
PostgresObservationGenerationJobRepository,
|
||||
type PostgresObservationGenerationJob,
|
||||
} from '../../storage/postgres/generation-jobs.js';
|
||||
import type { PostgresPool } from '../../storage/postgres/pool.js';
|
||||
import { withPostgresTransaction } from '../../storage/postgres/pool.js';
|
||||
import { logger } from '../../utils/logger.js';
|
||||
import { buildServerJobId } from '../jobs/job-id.js';
|
||||
import type { GenerateObservationsForEventJob } from '../jobs/types.js';
|
||||
import {
|
||||
buildEnqueueEventDecision,
|
||||
scheduleDebouncedEventJob,
|
||||
type ServerSessionGenerationPolicy,
|
||||
} from '../runtime/SessionGenerationPolicy.js';
|
||||
import { newId } from '../../storage/postgres/utils.js';
|
||||
|
||||
function buildEventBullmqPayload(input: {
|
||||
outboxId: string;
|
||||
event: PostgresAgentEvent;
|
||||
apiKeyId: string | null;
|
||||
actorId: string | null;
|
||||
sourceAdapter: string | null;
|
||||
requestId: string | null;
|
||||
}): GenerateObservationsForEventJob {
|
||||
return {
|
||||
kind: 'event',
|
||||
team_id: input.event.teamId,
|
||||
project_id: input.event.projectId,
|
||||
source_type: 'agent_event',
|
||||
source_id: input.event.id,
|
||||
generation_job_id: input.outboxId,
|
||||
agent_event_id: input.event.id,
|
||||
api_key_id: input.apiKeyId,
|
||||
actor_id: input.actorId,
|
||||
source_adapter: input.sourceAdapter ?? input.event.sourceAdapter ?? 'api',
|
||||
request_id: input.requestId,
|
||||
};
|
||||
}
|
||||
|
||||
const EVENT_JOB_TYPE = 'observation_generate_for_event';
|
||||
|
||||
export type EnqueueOutcome = 'enqueued' | 'queued_only' | 'skipped';
|
||||
|
||||
export interface IngestEventsServiceOptions {
|
||||
pool: PostgresPool;
|
||||
// Lazy queue resolver so the service does not depend on the queue manager
|
||||
// type and tests can swap in a fake. When this returns null, the outbox
|
||||
// row stays `queued` and Phase 3 startup reconciliation will publish it.
|
||||
resolveEventQueue: () => EventQueueLike | null;
|
||||
sessionPolicy?: ServerSessionGenerationPolicy;
|
||||
sessionDebounceWindowMs?: number;
|
||||
}
|
||||
|
||||
export interface EventQueueLike {
|
||||
add(jobId: string, payload: unknown, options?: unknown): Promise<unknown>;
|
||||
}
|
||||
|
||||
export interface IngestEventResult {
|
||||
event: PostgresAgentEvent;
|
||||
outbox: PostgresObservationGenerationJob | null;
|
||||
enqueueState: EnqueueOutcome;
|
||||
}
|
||||
|
||||
export interface IngestEventOptions {
|
||||
generate?: boolean;
|
||||
source?: string;
|
||||
// Phase 11 — identity context that flows from the HTTP auth boundary into
|
||||
// the BullMQ payload and audit log. None of these are auth gates: the
|
||||
// worker reloads and re-validates from Postgres before any side effect.
|
||||
apiKeyId?: string | null;
|
||||
actorId?: string | null;
|
||||
sourceAdapter?: string | null;
|
||||
// Phase 12 — opaque correlation id minted at the HTTP middleware so
|
||||
// generator logs and audit rows can pivot back to the originating request.
|
||||
requestId?: string | null;
|
||||
}
|
||||
|
||||
export class IngestEventsService {
|
||||
constructor(private readonly options: IngestEventsServiceOptions) {}
|
||||
|
||||
async ingestOne(
|
||||
input: CreatePostgresAgentEventInput,
|
||||
opts: IngestEventOptions = {},
|
||||
): Promise<IngestEventResult> {
|
||||
const generate = opts.generate ?? true;
|
||||
const source = opts.source ?? 'http_post_v1_events';
|
||||
|
||||
const txResult = await withPostgresTransaction(this.options.pool, async (client) => {
|
||||
const eventsRepo = new PostgresAgentEventsRepository(client);
|
||||
const inserted = await eventsRepo.create(input);
|
||||
|
||||
if (!generate) {
|
||||
return { event: inserted, outbox: null as PostgresObservationGenerationJob | null };
|
||||
}
|
||||
|
||||
const jobsRepo = new PostgresObservationGenerationJobRepository(client);
|
||||
const eventsLogRepo = new PostgresObservationGenerationJobEventsRepository(client);
|
||||
// Pre-generate the outbox id so we can build the BullMQ payload (which
|
||||
// references generation_job_id) and persist it on the row. Reconciliation
|
||||
// and operator retry rely on this persisted payload to re-enqueue a
|
||||
// payload that passes assertServerGenerationJobPayload at the worker.
|
||||
const outboxId = newId();
|
||||
const bullmqPayload = buildEventBullmqPayload({
|
||||
outboxId,
|
||||
event: inserted,
|
||||
apiKeyId: opts.apiKeyId ?? null,
|
||||
actorId: opts.actorId ?? null,
|
||||
sourceAdapter: opts.sourceAdapter ?? null,
|
||||
requestId: opts.requestId ?? null,
|
||||
});
|
||||
const outbox = await jobsRepo.create({
|
||||
id: outboxId,
|
||||
projectId: inserted.projectId,
|
||||
teamId: inserted.teamId,
|
||||
sourceType: 'agent_event',
|
||||
sourceId: inserted.id,
|
||||
agentEventId: inserted.id,
|
||||
serverSessionId: inserted.serverSessionId,
|
||||
jobType: EVENT_JOB_TYPE,
|
||||
bullmqJobId: buildServerJobId({
|
||||
kind: 'event',
|
||||
team_id: inserted.teamId,
|
||||
project_id: inserted.projectId,
|
||||
source_type: 'agent_event',
|
||||
source_id: inserted.id,
|
||||
}),
|
||||
payload: bullmqPayload as unknown as Record<string, unknown>,
|
||||
});
|
||||
await eventsLogRepo.append({
|
||||
generationJobId: outbox.id,
|
||||
projectId: outbox.projectId,
|
||||
teamId: outbox.teamId,
|
||||
eventType: 'queued',
|
||||
statusAfter: outbox.status,
|
||||
attempt: outbox.attempts,
|
||||
details: { source },
|
||||
});
|
||||
return { event: inserted, outbox };
|
||||
});
|
||||
|
||||
let enqueueState: EnqueueOutcome = 'skipped';
|
||||
if (txResult.outbox) {
|
||||
enqueueState = await this.publishEventJob(txResult.event, txResult.outbox, opts);
|
||||
}
|
||||
return { event: txResult.event, outbox: txResult.outbox, enqueueState };
|
||||
}
|
||||
|
||||
async ingestBatch(
|
||||
inputs: CreatePostgresAgentEventInput[],
|
||||
opts: IngestEventOptions = {},
|
||||
): Promise<IngestEventResult[]> {
|
||||
const generate = opts.generate ?? true;
|
||||
const source = opts.source ?? 'http_post_v1_events_batch';
|
||||
|
||||
const txResults = await withPostgresTransaction(this.options.pool, async (client) => {
|
||||
const eventsRepo = new PostgresAgentEventsRepository(client);
|
||||
const jobsRepo = new PostgresObservationGenerationJobRepository(client);
|
||||
const eventsLogRepo = new PostgresObservationGenerationJobEventsRepository(client);
|
||||
const acc: { event: PostgresAgentEvent; outbox: PostgresObservationGenerationJob | null }[] = [];
|
||||
for (const input of inputs) {
|
||||
const event = await eventsRepo.create(input);
|
||||
if (!generate) {
|
||||
acc.push({ event, outbox: null });
|
||||
continue;
|
||||
}
|
||||
const outboxId = newId();
|
||||
const bullmqPayload = buildEventBullmqPayload({
|
||||
outboxId,
|
||||
event,
|
||||
apiKeyId: opts.apiKeyId ?? null,
|
||||
actorId: opts.actorId ?? null,
|
||||
sourceAdapter: opts.sourceAdapter ?? null,
|
||||
requestId: opts.requestId ?? null,
|
||||
});
|
||||
const outbox = await jobsRepo.create({
|
||||
id: outboxId,
|
||||
projectId: event.projectId,
|
||||
teamId: event.teamId,
|
||||
sourceType: 'agent_event',
|
||||
sourceId: event.id,
|
||||
agentEventId: event.id,
|
||||
serverSessionId: event.serverSessionId,
|
||||
jobType: EVENT_JOB_TYPE,
|
||||
bullmqJobId: buildServerJobId({
|
||||
kind: 'event',
|
||||
team_id: event.teamId,
|
||||
project_id: event.projectId,
|
||||
source_type: 'agent_event',
|
||||
source_id: event.id,
|
||||
}),
|
||||
payload: bullmqPayload as unknown as Record<string, unknown>,
|
||||
});
|
||||
await eventsLogRepo.append({
|
||||
generationJobId: outbox.id,
|
||||
projectId: outbox.projectId,
|
||||
teamId: outbox.teamId,
|
||||
eventType: 'queued',
|
||||
statusAfter: outbox.status,
|
||||
attempt: outbox.attempts,
|
||||
details: { source },
|
||||
});
|
||||
acc.push({ event, outbox });
|
||||
}
|
||||
return acc;
|
||||
});
|
||||
|
||||
return Promise.all(txResults.map(async ({ event, outbox }) => {
|
||||
const enqueueState: EnqueueOutcome = outbox
|
||||
? await this.publishEventJob(event, outbox, opts)
|
||||
: 'skipped';
|
||||
return { event, outbox, enqueueState };
|
||||
}));
|
||||
}
|
||||
|
||||
private async publishEventJob(
|
||||
event: PostgresAgentEvent,
|
||||
outbox: PostgresObservationGenerationJob,
|
||||
opts: IngestEventOptions = {},
|
||||
): Promise<'enqueued' | 'queued_only'> {
|
||||
const queue = this.options.resolveEventQueue();
|
||||
if (!queue) {
|
||||
return 'queued_only';
|
||||
}
|
||||
const policyOptions: { policy?: ServerSessionGenerationPolicy; debounceWindowMs?: number } = {};
|
||||
if (this.options.sessionPolicy !== undefined) {
|
||||
policyOptions.policy = this.options.sessionPolicy;
|
||||
}
|
||||
if (this.options.sessionDebounceWindowMs !== undefined) {
|
||||
policyOptions.debounceWindowMs = this.options.sessionDebounceWindowMs;
|
||||
}
|
||||
const decision = buildEnqueueEventDecision(
|
||||
{
|
||||
event,
|
||||
outbox,
|
||||
apiKeyId: opts.apiKeyId ?? null,
|
||||
actorId: opts.actorId ?? null,
|
||||
sourceAdapter: opts.sourceAdapter ?? event.sourceAdapter ?? null,
|
||||
// Phase 12 — flow request_id into the BullMQ payload so the worker
|
||||
// can emit it in [generation] logs and the audit row.
|
||||
requestId: opts.requestId ?? null,
|
||||
},
|
||||
policyOptions,
|
||||
);
|
||||
if (!decision.shouldEnqueue) {
|
||||
return 'queued_only';
|
||||
}
|
||||
try {
|
||||
await scheduleDebouncedEventJob(queue as never, decision);
|
||||
return 'enqueued';
|
||||
} catch (error) {
|
||||
logger.warn('SYSTEM', 'failed to publish event generation job to BullMQ', {
|
||||
outboxId: outbox.id,
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
});
|
||||
return 'queued_only';
|
||||
}
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user