Server-beta: Postgres storage + independent runtime + BullMQ queue (Phases 1–3) (#2351)

* Add server beta runtime foundation

* Address server beta review findings

* Resolve server beta review comments

* Tighten server beta review follow-ups

* Harden server beta auth and search

* Avoid unnecessary FTS rebuilds

* Block scoped keys from creating projects

* Release BullMQ claims best effort on close

* Address server beta review blockers

* Reset BullMQ claims best effort

* Add Postgres observation storage foundation

* feat(server-beta): add independent runtime service

Introduce src/server/runtime/ as a self-contained server-beta runtime
that owns its lifecycle, Postgres bootstrap, and HTTP boundary without
depending on WorkerService.

ServerBetaService wraps the existing Server class, exposes
/healthz and /v1/info with runtime="server-beta", and persists state
to dedicated paths (.server-beta.pid|.port|.runtime.json). The four
boundary managers (queue, generation worker, provider registry, event
broadcaster) are intentionally disabled in this phase and report their
status through /v1/info; later phases activate them.

Adds plans/2026-05-07-finish-bullmq-branch-ship-plan.md to track the
remaining work for this branch.

Phase 2 of plans/2026-05-07-server-beta-independent-bullmq-observation-runtime.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server-beta): route CLI lifecycle and bundle separate runtime

scripts/build-hooks.js now produces plugin/scripts/server-beta-service.cjs
as a separate Node CJS bundle, alongside the existing worker-service
bundle. The server-beta runtime is now installable independently.

src/npx-cli/commands/server.ts routes start|stop|restart|status to the
server-beta lifecycle instead of the legacy worker. The worker keeps its
own start|stop|restart|status under the worker namespace; the two
runtimes can be operated independently.

src/services/worker-service.ts adds a server-* command parser branch
that delegates to the sibling server-beta-service.cjs bundle so
direct worker-service invocations still route to the right runtime.

tests/npx-cli-server-namespace.test.ts updated to expect server-beta
lifecycle routing.

Includes rebuilt plugin/scripts/*.cjs bundles produced by
build-and-sync.

Phase 2 of plans/2026-05-07-server-beta-independent-bullmq-observation-runtime.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server-beta): add BullMQ job queue primitives

Introduce src/server/jobs/ as the queue-side primitives that Phase 3 of
the server-beta runtime needs to operate.

types.ts defines a discriminated union over the four job kinds (event,
event-batch, summary, reindex) and maps each to a per-kind BullMQ queue
name and deterministic-ID prefix.

job-id.ts builds deterministic, colon-free BullMQ jobIds from
(kind, team, project, source). The colon ban exists because BullMQ uses
':' as a Redis key separator internally; embedding ':' in jobIds
breaks scan and state lookups.

ServerJobQueue.ts is a thin wrapper over BullMQ Queue + Worker that
enforces autorun:false, default concurrency 1, and an attached error
listener — all per BullMQ docs requirements. Test seams accept queue
and worker factories so unit tests do not need Redis.

outbox.ts publishes through the Postgres ObservationGenerationJob
repository as canonical history. enqueueOutbox writes the row first,
then publishes to BullMQ; if BullMQ throws, the row is transitioned to
failed and a failed event is appended. reconcileOnStartup re-enqueues
queued + processing rows after a restart, replacing terminal BullMQ
jobs that may still be holding the deterministic ID slot. markCompleted
and markFailed wrap transitionStatus and append the matching event row.

Includes 20 unit tests covering deterministic ID stability, colon-free
output, queue lifecycle, error-listener attachment, double-start
refusal, idempotent enqueue, BullMQ failure rollback, startup
reconciliation, max-attempts skipping, and completion / failure /
retry transitions.

Phase 3 commit 1 of plans/2026-05-07-server-beta-independent-bullmq-observation-runtime.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server-beta): activate queue boundary in runtime service

Wire ActiveServerBetaQueueManager into the server-beta runtime graph.
The active manager owns one ServerJobQueue per generation kind (event,
event-batch, summary, reindex) and surfaces lane metadata through
boundary health.

Selection is opt-in and fail-fast: if CLAUDE_MEM_QUEUE_ENGINE is set to
bullmq the active manager is constructed (and any Redis/config error
throws — no silent fallback to SQLite, per Phase 3 anti-pattern guard).
For any other engine the disabled boundary remains so worker-era and
test setups stay compatible.

Widens ServerBetaBoundaryHealth.status to a discriminated union
('disabled' | 'active' | 'errored') with optional details. The disabled
adapter still emits status='disabled', which keeps the existing
server-beta-service test green.

ServerBetaService receives the manager through a new optional
queueManager field on CreateServerBetaServiceOptions so test graphs
and Phase 4 wiring can inject custom managers.

Adds tests/server/runtime/active-queue-manager.test.ts covering bullmq
guard, active health shape, per-kind queue access, close behavior, and
post-close errored health.

Phase 3 commit 2 of plans/2026-05-07-server-beta-independent-bullmq-observation-runtime.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server-beta): cap /v1/events/batch at 500 events

Prevents unbounded array DoS surface flagged in PR review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-05-08 01:20:07 -07:00
committed by GitHub
parent 0a43ab7632
commit 36b0929fae
183 changed files with 35709 additions and 2033 deletions
@@ -0,0 +1,331 @@
# Finish BullMQ Observation Queue Branch — Ship Plan
Date: 2026-05-07
Branch: `bullmq-vs-bee-queue-for-claude-mem-observation-que`
Base: `origin/main` @ `0a43ab76`
Parent plan: `plans/2026-05-07-server-beta-independent-bullmq-observation-runtime.md`
## Reframe
The prior session believed Phase 1 was ungated because two reviewer agents failed (one returned not_found, "Carver" was user-aborted at 111.9s). That belief was based on a stale snapshot that predated commit `4e0fc77a Add Postgres observation storage foundation`. **Phase 1 is committed.** `git status` shows zero uncommitted changes under `src/storage/postgres/`.
What is actually dirty in the worktree is **Phase 2: Define Server Runtime Boundary**. The dirty files map 1:1 to that phase's "What To Implement" section. The remaining work to "finish this branch" is: confirm Phase 1 with concrete checks (not another reviewer agent), land Phase 2, push.
Phases 313 (BullMQ queue, event-to-job pipeline, provider extraction, hook routing, MCP, compat, Docker, team auth, observability, final verification) are explicitly **out of scope** for this branch. The PR is already 167 files / 23.5K insertions. Continuing past Phase 2 here would make review impossible.
## Phase 0: Documentation Discovery
### Sources Read
- `plans/2026-05-07-server-beta-independent-bullmq-observation-runtime.md` (parent plan, 987 lines, all 14 sections from Phase 0 through Phase 13)
- `PR_REORIENTATION_REPORT.md` (660 lines) — independent inventory of committed + dirty surfaces
- `git status`, `git log --oneline -15`, `git diff --stat HEAD`
- Worktree: `src/server/runtime/{ServerBetaService.ts,create-server-beta-service.ts,types.ts}`
- Worktree: `src/storage/postgres/` — already in commit `4e0fc77a`
### Concrete Findings
- Phase 1 (Postgres storage foundation) is committed in `4e0fc77a`. Includes scoped `addSource`, `transitionStatus`, generation-job event `append`, FTS via generated `content_search` tsvector + GIN index, tenant-scoped uniqueness constraints, and 20 integration tests including the negative-scope mutation test.
- Phase 2 (server runtime boundary) is implemented but uncommitted. Files match the parent plan's Phase 2 deliverables exactly: independent `ServerBetaService`, `create-server-beta-service`, disabled boundary types, `.server-beta.{pid,port,runtime.json}` paths, runtime labels in `/api/health` and `/v1/info`, server-beta CLI lifecycle, build-hooks split into a separate `server-beta-service.cjs` bundle, ephemeral-port test for `/api/health` and `/v1/info`.
- Two doc artifacts (`AGENTS.md`, `PR_REORIENTATION_REPORT.md`) are also untracked. Decide before push.
### Anti-Pattern Guards (carried from parent plan)
- Do not spawn a third reviewer agent to "gate" Phase 1. The integration test suite plus the plan's grep checklist is the gate. Reviewer agents are a second opinion, not the primary gate.
- Do not pull Phase 3+ work into this branch.
- Do not amend `4e0fc77a` to "tidy" Phase 1; create new commits.
- Do not couple Phase 2 to `WorkerService` (the entire point of Phase 2 is independence).
## Phase A: Re-Confirm Phase 1 Gate (Deterministic, No Reviewer Agent)
### What To Run
1. `tsc --noEmit` scoped to Postgres storage:
```bash
bunx tsc --noEmit src/storage/postgres/*.ts
```
2. Postgres integration suite (requires `DATABASE_URL` or local Postgres on default port):
```bash
bun test tests/storage/postgres
```
3. Anti-pattern greps (must all return zero matches in `src/storage/postgres/`):
```bash
rg -n "UNIQUE\s*\(\s*source_type\s*,\s*source_id\s*,\s*job_type\s*\)" src/storage/postgres
rg -n "UNIQUE\s*\(\s*observation_id\s*,\s*source_type\s*,\s*source_id\s*\)" src/storage/postgres
```
4. Scoped-mutation grep (must show `projectId`/`teamId` parameters):
```bash
rg -n "addSource|transitionStatus|append" src/storage/postgres
```
### Verification Checklist
- TypeScript clean.
- All 20 Postgres integration tests pass, including the negative-scope mutation test.
- Both anti-pattern greps return empty.
- Scoped-mutation grep shows `projectId`/`teamId` in every signature.
### Anti-Pattern Guards
- Do not edit `src/storage/postgres/*.ts` in this phase. If Phase A fails, open a separate fix-up commit; do not amend `4e0fc77a`.
## Phase B: Land Phase 2 (Server Runtime Boundary)
### What To Run
1. Phase 2 independence grep — Server beta runtime must not import worker:
```bash
rg -n "WorkerService|services/worker-service|worker/http" \
src/server/runtime src/npx-cli/commands/server.ts
```
Allowed: matches inside `src/services/worker-service.ts` itself (delegation back to server-beta is fine). Forbidden: any import inside `src/server/runtime/`.
2. Server-beta service test:
```bash
bun test tests/server/server-beta-service.test.ts
```
3. CLI namespace test:
```bash
bun test tests/npx-cli-server-namespace.test.ts
```
4. Build verifies `server-beta-service.cjs` bundle is produced:
```bash
npm run build-and-sync
ls -la plugin/scripts/server-beta-service.cjs
```
5. Smoke test independence:
```bash
npx claude-mem server status # before start
npx claude-mem server start
npx claude-mem server status # running, runtime=server-beta
curl -s http://127.0.0.1:$(cat ~/.claude-mem/.server-beta.port)/healthz
curl -s http://127.0.0.1:$(cat ~/.claude-mem/.server-beta.port)/v1/info
npx claude-mem server stop
```
Worker `start|stop|status` must remain functional throughout.
### Commit Layout
Two commits, in order:
1. **`feat(server-beta): add independent runtime service`**
- `src/server/runtime/ServerBetaService.ts`
- `src/server/runtime/create-server-beta-service.ts`
- `src/server/runtime/types.ts`
- `src/server/routes/v1/ServerV1Routes.ts` (runtime label)
- `src/services/server/Server.ts` (runtime option)
- `src/shared/paths.ts` (`.server-beta.{pid,port,runtime.json}`)
- `tests/server/server-beta-service.test.ts`
2. **`feat(server-beta): route CLI lifecycle and build a separate bundle`**
- `scripts/build-hooks.js` (server-beta bundle output)
- `src/npx-cli/commands/runtime.ts` (server-beta lifecycle commands)
- `src/npx-cli/commands/server.ts` (CLI routing)
- `src/services/worker-service.ts` (delegate `server-start|stop|restart|status` to sibling bundle)
- `tests/npx-cli-server-namespace.test.ts`
### Documentation References
- Parent plan, lines 469514: Phase 2 deliverables and verification checklist.
- `src/services/server/Server.ts`: existing route-composition style to copy.
- `src/services/infrastructure/ProcessManager.ts`: PID-file safety patterns.
### Verification Checklist
- All five Phase B steps pass.
- Worker lifecycle still works while server-beta is running, and vice versa.
- Two commits land cleanly with no `--amend` or force operations.
### Anti-Pattern Guards
- Do not import `WorkerService` from `src/server/runtime/`.
- Do not overload worker PID/port files.
- Do not boot worker as a background dependency of server-beta.
- Do not silently fall back from server-beta to worker.
## Phase C: Decide Doc Artifacts
### What To Decide
| File | Recommendation | Rationale |
|------|---------------|-----------|
| `PR_REORIENTATION_REPORT.md` | Use as PR body, then delete (or move to `docs/internal/`). | It's a snapshot, not durable docs. Useful for the PR reviewer; rots in-tree. |
| `AGENTS.md` | Read first, then either commit (if generally useful guidance) or move under `.scratch/`. | Decision depends on content. |
### Verification
- Final `git status` shows only intended doc artifacts (or none).
- `.scratch/` is gitignored if used.
### Anti-Pattern Guard
- Do not push `PR_REORIENTATION_REPORT.md` to main as a doc; it has a date and a HEAD SHA, it ages immediately.
## Phase D: Push and Open/Update PR
### What To Run
1. `git push -u origin bullmq-vs-bee-queue-for-claude-mem-observation-que`
2. `gh pr view --web` (if PR exists) or `gh pr create` with body sourced from `PR_REORIENTATION_REPORT.md`.
3. PR body must explicitly carve scope: "Includes Phase 1 + Phase 2 from `plans/2026-05-07-server-beta-independent-bullmq-observation-runtime.md`. Phases 313 are follow-ups on separate branches."
### Verification Checklist
- PR title is short (under 70 chars) and reflects scope: e.g., "Add Postgres storage + independent server-beta runtime (Phases 12)".
- PR body lists out-of-scope phases.
- CI is green.
### Anti-Pattern Guards
- Do not force-push to main.
- Do not merge without CI green.
## Phase E: Branch Closeout
Once the PR merges, this branch is done. Phase 3 (BullMQ-First Server Queue) starts on a fresh branch off main. Do not reuse this branch for Phase 3 work — keep the queue/runtime split visible in history.
## Final Verification (cross-phase)
Run after Phases AD:
```bash
git status # clean or only intended doc artifacts
git log --oneline origin/main..HEAD # 4e0fc77a + Phase 2 commits, no force-push markers
bun test tests/storage/postgres tests/server tests/npx-cli-server-namespace.test.ts
rg -n "WorkerService|services/worker-service|worker/http" src/server/runtime
rg -n "PendingMessageStore|SessionQueueProcessor" src/server/runtime
```
Expected:
- All three test paths green.
- Both greps return zero matches.
- Branch ready to merge.
## Decisions Locked
1. Phase 1 gate: orchestrator-managed deterministic checks (no reviewer agent).
2. `AGENTS.md` + `PR_REORIENTATION_REPORT.md`: **discard** before commit.
3. Scope: this branch ships Phases 1 + 2 + **3** (BullMQ-First Server Queue). Phase E becomes Phase 3 work, push moves to Phase F.
## Phase D (revised): Discard Untracked Doc Artifacts
```bash
rm AGENTS.md PR_REORIENTATION_REPORT.md
```
Verification: `git status` shows neither file.
## Phase E: Implement Phase 3 — BullMQ-First Server Queue
Source: parent plan lines 515570.
### What To Implement
- `src/server/jobs/types.ts` — job-shape types:
- `ServerGenerationJob` (base)
- `GenerateObservationsForEventJob`
- `GenerateObservationsForEventBatchJob`
- `GenerateSessionSummaryJob`
- `ReindexObservationJob`
- Every job carries `team_id`, `project_id`, `source_type`, `source_id`, `generation_job_id`. Event jobs add `agent_event_id`. Summary jobs add `server_session_id`. Reindex jobs add target observation ID or deterministic reindex scope ID.
- `src/server/jobs/job-id.ts` — deterministic, colon-free job IDs (port the SHA-256-safe pattern from `src/server/queue/BullMqObservationQueueEngine.ts`).
- `src/server/jobs/ServerJobQueue.ts` — thin wrapper around BullMQ `Queue`, `Worker`, `QueueEvents`. Use `autorun: false`, explicit `concurrency: 1` default per lane, and an `error` listener on every `Worker`.
- `src/server/jobs/outbox.ts` — durable outbox over `ObservationGenerationJobRepository`. Statuses: `queued`, `processing`, `completed`, `failed`, `cancelled`. Tracks attempts, last error, timestamps, and tenant/project/session IDs.
- Startup reconciliation:
- Re-enqueue rows in `queued` or stale `processing`.
- Skip rows already `completed`.
- Replace terminal BullMQ jobs before reusing deterministic IDs.
- Wire queue health into `/v1/info`, `/api/health`, and `claude-mem server status` via the existing runtime label hook.
- Activate the queue boundary in `ServerBetaService` (Phase 2 left it disabled). Provide a real adapter when `CLAUDE_MEM_QUEUE_ENGINE=bullmq` and `REDIS_URL` are present; keep the disabled adapter as the fallback.
### Documentation References
- BullMQ Workers: https://docs.bullmq.io/guide/workers
- BullMQ Concurrency: https://docs.bullmq.io/guide/workers/concurrency
- BullMQ Stalled Jobs: https://docs.bullmq.io/guide/jobs/stalled
- `src/server/queue/BullMqObservationQueueEngine.ts` — copy deterministic job-ID + Redis health patterns; do **not** copy the worker-iterator compatibility shape.
- `src/server/queue/redis-config.ts` — Valkey/Redis health checks.
- `src/storage/postgres/generation-jobs.ts` — outbox repository (already committed in 4e0fc77a).
### Verification Checklist
Unit tests under `tests/server/jobs/`:
- `job-id.test.ts` — deterministic IDs, no colons, stable across runs, content-derived.
- `server-job-queue.test.ts` — Queue/Worker lifecycle, `error` listener attached, concurrency honored, autorun false.
- `outbox.test.ts` — duplicate enqueue suppression, terminal job replacement, status transitions, attempt counting.
Integration tests under `tests/server/queue-bootstrap/`:
- Start `ServerBetaService` with Postgres + Valkey + queue boundary enabled.
- Insert outbox rows directly through `ObservationGenerationJobRepository`.
- Enqueue fake jobs; restart before fake processing completes.
- Assert reconciliation re-enqueues exactly once and outbox status reaches `completed` exactly once.
- Assert Redis-down fails Server beta startup when `CLAUDE_MEM_QUEUE_ENGINE=bullmq`; no silent fallback to SQLite.
Greps:
```bash
rg -n "Bull(MQ|Mq).*\.add\(" src/server/jobs # uses BullMQ Queue.add
rg -n "autorun" src/server/jobs # workers explicitly set autorun
rg -n "on\(['\"]error" src/server/jobs # error listener attached
rg -n ":job:|:obs:" src/server/jobs # NO colons in deterministic IDs
```
The colon-grep must return zero matches.
### Anti-Pattern Guards
- Do not treat BullMQ completed/failed state as canonical history — Postgres outbox is canonical.
- Do not require event-route wiring or provider generation here (Phase 4 territory).
- Do not allow duplicate processor side effects on retry — keep observation writes idempotent by deterministic key.
- Do not use BullMQ Pro-only features (groups).
- Do not leave pending work only in Redis.
- Do not silently fall back from BullMQ to SQLite when `CLAUDE_MEM_QUEUE_ENGINE=bullmq` is set.
### Commit Layout
Two commits:
1. **`feat(server-beta): add BullMQ job queue primitives`**
- `src/server/jobs/types.ts`
- `src/server/jobs/job-id.ts`
- `src/server/jobs/ServerJobQueue.ts`
- `src/server/jobs/outbox.ts`
- `tests/server/jobs/*.test.ts`
2. **`feat(server-beta): activate queue boundary in runtime service`**
- `src/server/runtime/ServerBetaService.ts` (queue boundary wiring)
- `src/server/runtime/create-server-beta-service.ts` (boundary selection from env)
- `src/server/runtime/types.ts` (active queue manager interface)
- Health surface updates in `/v1/info` and `/api/health` if not already covered by Phase 2 runtime label.
- `tests/server/queue-bootstrap/*.test.ts`
## Phase F: Push and Open/Update PR
```bash
git push -u origin bullmq-vs-bee-queue-for-claude-mem-observation-que
gh pr view --web # if PR exists
# else:
gh pr create --title "Server-beta: Postgres storage + independent runtime + BullMQ queue (Phases 13)"
```
PR body must list:
- Scope: Phases 1, 2, 3 of `plans/2026-05-07-server-beta-independent-bullmq-observation-runtime.md`.
- Out of scope: Phases 413 (event-to-job pipeline, provider extraction, hook routing, MCP, compat, Docker, team auth, observability, final verification).
### Verification Checklist
- `git status` clean.
- `git log --oneline origin/main..HEAD` shows all expected commits, no force-push markers.
- CI green.
## Final Cross-Phase Verification
```bash
git status # clean
bun test tests/storage/postgres tests/server tests/npx-cli-server-namespace.test.ts
rg -n "WorkerService|services/worker-service|worker/http" src/server/runtime # zero
rg -n "PendingMessageStore|SessionQueueProcessor" src/server/runtime src/server/jobs # zero
```