Cynical deletion: close 27 issues by removing defenders + tolerators (#2141)

* fix: mirror migration 28 in SessionStore so pending_messages.tool_use_id and worker_pid columns are created (#2139) SessionStore's inline migration list jumped from v27 to v29, skipping rebuildPendingMessagesForSelfHealingClaim. The worker uses SessionStore directly via worker/DatabaseManager.ts and bypasses the canonical MigrationRunner, so fresh installs ended up at "max v29" with neither column present — every queue claim and observation insert failed. Adds addPendingMessagesToolUseIdAndWorkerPidColumns following the existing mirror precedent (addObservationSubagentColumns / addObservationsUniqueContentHashIndex). Uses ALTER TABLE + column-existence guards so already-broken DBs at v29 self-heal on next worker boot. Verified on fresh DB and on a synthetic v29-without-v28 broken DB: both columns and indexes (idx_pending_messages_worker_pid, ux_pending_session_tool) appear after one boot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: wrap v28 mirror dedup+index creation in transaction Addresses Greptile P2 review on PR #2140: matches the existing pattern in addObservationsUniqueContentHashIndex (v29 mirror at SessionStore.ts:1127) and runner.ts rebuildPendingMessagesForSelfHealingClaim. A crash between the dedup DELETE and the schema_versions INSERT no longer leaves the DB in a half-applied state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plan): cynical-deletion plan for 29 open issues 9-phase plan applying delete-first lens to triaged issue corpus. Headlines: kill defenders (orphan cleanup, EncodedCommand spawn, restart-port-steal) and tolerators (silent JSON drops, drifted SSE filters). Each phase closes a named subset of issues. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: delete process-management theater (Phase 1: DEL-1 + DEL-2) Delete aggressiveStartupCleanup, the PowerShell -EncodedCommand spawn branch, and the restart-with-port-steal sequence. Replace daemon spawning with a single uniform child_process.spawn path using arg-array form, keeping setsid on Unix when available. The defenders (orphan cleanup, duplicate-worker probes, port stealing) bred more bugs than they fixed. PID file with start-time token already provides correct OS-trust ownership; restart now requests httpShutdown, waits 5s for the port to free, then exits 1 if it didn't (user resolves). Net -247 lines. Closes #2090, #2095 (already fixed at session-init.ts:78), #2107, #2111, #2114, #2117, #2123, #2097, #2135. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: observer-sessions trust boundary via CLAUDE_MEM_INTERNAL env (Phase 2: DEL-9) Replace the cwd === OBSERVER_SESSIONS_DIR discriminator (which every consumer must repeat and inevitably drifts) with a single env-var trust boundary set once at spawn time in buildIsolatedEnv. - buildIsolatedEnv now sets CLAUDE_MEM_INTERNAL=1, covering all three spawn sites (SDKAgent, KnowledgeAgent.prime, KnowledgeAgent.executeQuery) - shouldTrackProject checks the env var first (cwd check stays as belt-and-braces fallback) - New shared shouldEmitProjectRow predicate — SSE broadcaster and pagination filter share the same predicate so they can never drift apart (#2118) - ObservationBroadcaster filters observer rows from SSE stream - PaginationHelper hardcoded 'observer-sessions' replaced with OBSERVER_SESSIONS_PROJECT const - project-filter basename match pass — *observer-sessions* now matches basename, not just full path (globToRegex's [^/]* can't cross /) (#2126 item 1) - New `claude-mem cleanup [--dry-run]` subcommand wires CleanupV12_4_3 through to the worker for #2126 item 5 Closes #2118, #2126. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: strip proxy env vars before spawning worker (Phase 4: CON-1) User's HTTP_PROXY/HTTPS_PROXY config was bleeding into internal AI calls when claude-mem spawns the claude subprocess, causing connection failures. Strip unconditionally — no passthrough knob, which rejects #2099's whitelist proposal. Closes #2115, #2099. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: fail-fast on silent drops in stdin/file-context/memory-save (Phase 5: FF-1) Three independent fail-fast fixes: #2089 — stdin-reader silent drop. Non-empty stdin that fails JSON.parse now rejects with a clear error instead of resolving undefined. Empty stdin still resolves undefined. #2094 — PreToolUse:Read truncation Edit deadlock. file-context handler no longer returns a fake truncated Read result via updatedInput. Removes userOffset/userLimit/truncated machinery; injects the timeline via additionalContext only and lets the real Read pass through. Read state and Claude's expectation now stay consistent, eliminating the infinite Edit retry loop. #2116 — /api/memory/save metadata drop + project bug. Schema accepts metadata as a documented JSON column (migration 30 adds observations. metadata TEXT, mirrored in SessionStore). Schema also tightened to .strict() so unknown top-level fields fail fast instead of being silently dropped. Project resolution now consults metadata.project as a fallback before defaultProject. Closes #2089, #2094, #2116. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: small deletions — Zod externalize / Gemini fallback / session timeout / installCLI alias (Phase 6) DEL-4 (#2113): Externalize zod from mcp-server.cjs and context-generator.cjs hook bundles so OpenCode's runtime resolves a single Zod copy. Worker keeps Zod bundled (it's a daemon subprocess, not in OpenCode's hook bundle). Added zod to plugin/package.json so externalized requires resolve at runtime. DEL-5 (#2087): Delete the never-wired GeminiAgent → Claude fallback. fallbackAgent was always null in production. On 429 the agent now throws cleanly (message stays pending for retry). Removed setFallbackAgent, FallbackAgent interface, and the 429 fallback branch from both GeminiAgent and OpenRouterAgent. Updated docs that claimed automatic Claude fallback. DEL-6 (#2127, #2098): Raise MAX_SESSION_WALL_CLOCK_MS from 4h to 24h. The timeout is a real guard against runaway-cost loops (per issue #1590), but 4h kills legitimate long Claude Code days. 24h preserves the guard while never hitting in normal use. No knob — a session approaching this age is a bug worth investigating, not a value worth tuning. DEL-8 (#2054): Delete installCLI() alias function. Saves 4 keystrokes at the cost of cross-platform shell-config mutation surface — not worth it. Canonical entry is npx claude-mem (and bunx). Uninstall now strips legacy alias/function lines from ~/.bashrc, ~/.zshrc, and the PowerShell profile. Closes #2087, #2098, #2113, #2127, #2054. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: de-hardcode worker port + multi-account commit (Phase 3: CON-2 + DEL-7) Replace hardcoded 37777 fallbacks with SettingsDefaultsManager.get( 'CLAUDE_MEM_WORKER_PORT') in npx-cli (runtime/install/uninstall), opencode-plugin, OpenClaw installer, SearchRoutes example URLs. Timeline-report SKILL.md now resolves WORKER_PORT from settings.json at the top and uses ${WORKER_PORT} in all curl invocations. Remaining 37777 literals are doc comments + viewer build-time form- field placeholder (which is replaced by /api/settings on mount). hooks.json: add cygpath POSIX→Windows path translation between _R resolution and node invocation. No-op on macOS/Linux. Closes the Windows + Git Bash MODULE_NOT_FOUND in #2109. CLAUDE.md gains a Multi-account section documenting CLAUDE_MEM_DATA_DIR + optional CLAUDE_MEM_WORKER_PORT — every existing path/port code path now honors them. Closes #2103, #2109, #2101. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: install/uninstall improvements (Phase 7: #2106) 5 fixes for the install/uninstall flow: Item 1 — multiselect default. install.ts no longer pre-selects every detected IDE; user explicitly opts in. Item 3 — shutdown-before-overwrite. New src/services/install/shutdown-helper.ts shared by install and uninstall: POSTs /api/admin/shutdown then polls /api/health until the worker stops responding. install calls it before copyPluginToMarketplace so reinstall over a running worker doesn't conflict; uninstall calls it before deletion. Item 4 — uninstall path coverage. Removes ~/.npm/_npx/*/node_modules/ claude-mem, ~/.cache/claude-cli-nodejs/*/mcp-logs-plugin-claude-mem-*, ~/.claude/plugins/data/claude-mem-thedotmack/. Best-effort: per-path try/catch so a single permission failure doesn't abort uninstall. chroma-mcp shutdown is implicit via the worker's GracefulShutdown cascade in item 3's helper. Item 5 — install summary documents "Close all Claude Code sessions before uninstalling, or ~/.claude-mem will be recreated by active hooks." Item 6 — real-port query. After install, fetches /api/health on the configured port with 3s timeout. Reports actually-bound port if the response carries it; falls back to requested port. No retry loop. Closes #2106 (items 1, 3, 4, 5, 6). Items 2, 7 closed separately as already-fixed and insufficient-detail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: pin chroma-mcp to 0.2.6 (Phase 8: DEL-3 lite) Replace unpinned 'chroma-mcp' arg with chroma-mcp==0.2.6 in both local and remote modes. Pinning makes installs deterministic across machines and across time, eliminating the dependency-drift class of bugs. Verified 0.2.6 in a clean uv cache: starts cleanly, no httpcore/ httpx ImportError, no --with flags needed. The --with flags removed in a0dd516c are not required at this pin (transitive deps resolve correctly when the top-level version is fixed). #2102's three protections (transport cleanup on failure, stale onclose handler guard, 10s reconnect backoff) confirmed intact. Closes #2046, #2085, #2102. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: update stale assertions for per-UID port + migration 30 (Phase 9) SettingsDefaultsManager.CLAUDE_MEM_WORKER_PORT default is per-UID (37700 + uid%100), not literal '37777'. Three assertions in settings-defaults-manager.test.ts now compute the expected value the same way the source does. migration-runner.test.ts: drop expect(versions).toContain(19) (version 19 was a noop never recorded — pre-existing bug at parent), add expect(versions).toContain(30) for the new observations.metadata column added in Phase 5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Greptile P1/P2 review comments on PR #2141 P1: spawnDaemon return value was unchecked in worker-service.ts restart case, so a failed spawn silently exited 0 with a misleading "Worker restart spawned" log. Now error and exit 1 when restartPid is undefined. P2: shutdown-helper.ts health-poll catch treated AbortError (timeout) the same as connection-refused, so a slow worker could be reported confirmedStopped while still holding file locks. Now distinguish: AbortError continues polling; other errors return confirmedStopped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build: rebuild plugin artifacts after merging main Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address CodeRabbit review comments on PR #2141 - hooks.json: quote $HOME in cache lookup so paths with spaces work - timeline-report SKILL.md: fall back when process.getuid is unavailable (Windows) - opencode-plugin: validate CLAUDE_MEM_WORKER_PORT before using - uninstall.ts: only strip alias lines, not function declarations (multi-line bodies left intact) - MemoryRoutes: trim whitespace-only project before precedence resolution - SessionStore migration 21: preserve metadata column if observations already has it - stdin-reader test: restore full property descriptor to avoid cross-test pollution Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 21:23:24 -07:00
parent 7f255cbc51
commit d13662d5d8
52 changed files with 2312 additions and 1222 deletions
@@ -50,29 +50,52 @@ interface MarkerPayload {
 * the marker file ensures the work runs at most once per data directory.
 *
 * @param dataDirectory - Override for DATA_DIR (used in tests)
+ * @param options.dryRun - When true, scans + reports counts but performs NO
+ *        DB writes, NO backup, NO chroma wipe, and does NOT write the marker.
+ *        Used by `claude-mem cleanup --dry-run` to preview what would happen
+ *        without mutating user state. (#2126 item 5)
 */
-export function runOneTimeV12_4_3Cleanup(dataDirectory?: string): void {
+export function runOneTimeV12_4_3Cleanup(
+  dataDirectory?: string,
+  options: { dryRun?: boolean } = {},
+): CleanupCounts | undefined {
+  const dryRun = options.dryRun === true;
  const effectiveDataDir = dataDirectory ?? DATA_DIR;
  const markerPath = path.join(effectiveDataDir, MARKER_FILENAME);

-  if (existsSync(markerPath)) {
+  if (existsSync(markerPath) && !dryRun) {
    logger.debug('SYSTEM', 'v12.4.3 cleanup marker exists, skipping');
    return;
  }

-  if (process.env.CLAUDE_MEM_SKIP_CLEANUP_V12_4_3 === '1') {
+  if (process.env.CLAUDE_MEM_SKIP_CLEANUP_V12_4_3 === '1' && !dryRun) {
    logger.warn('SYSTEM', 'v12.4.3 cleanup skipped via CLAUDE_MEM_SKIP_CLEANUP_V12_4_3=1; marker not written');
    return;
  }

  const dbPath = path.join(effectiveDataDir, 'claude-mem.db');
  if (!existsSync(dbPath)) {
+    if (dryRun) {
+      logger.info('SYSTEM', 'v12.4.3 cleanup --dry-run: no DB present, nothing to scan', { dbPath });
+      return emptyCounts();
+    }
    mkdirSync(effectiveDataDir, { recursive: true });
    writeMarker(markerPath, { appliedAt: new Date().toISOString(), backupPath: null, chromaWiped: false, counts: emptyCounts(), skipped: 'no-db' });
    logger.debug('SYSTEM', 'No DB present, v12.4.3 cleanup marker written without work', { dbPath });
    return;
  }

+  if (dryRun) {
+    logger.info('SYSTEM', 'Running v12.4.3 cleanup --dry-run (read-only scan, no writes)', { dbPath });
+    try {
+      return scanCleanupCounts(dbPath);
+    } catch (err: unknown) {
+      const error = err instanceof Error ? err : new Error(String(err));
+      logger.error('SYSTEM', 'v12.4.3 cleanup --dry-run scan failed', {}, error);
+      return undefined;
+    }
+  }
+
  logger.warn('SYSTEM', 'Running one-time v12.4.3 pollution cleanup', { dbPath });

  try {
@@ -83,6 +106,43 @@ export function runOneTimeV12_4_3Cleanup(dataDirectory?: string): void {
  }
 }

+/**
+ * Read-only scan: count what runOneTimeV12_4_3Cleanup *would* delete.
+ * Mirrors the COUNT(*) queries from runObserverSessionsPurge and
+ * runStuckPendingPurge. Opens the DB read-only — never mutates.
+ */
+function scanCleanupCounts(dbPath: string): CleanupCounts {
+  const counts = emptyCounts();
+  const db = new Database(dbPath, { readonly: true });
+  try {
+    counts.observerSessions = (
+      db.prepare(`SELECT COUNT(*) AS n FROM sdk_sessions WHERE project = ?`).get(OBSERVER_SESSIONS_PROJECT) as { n: number }
+    ).n;
+    counts.observerCascadeRows =
+      (db.prepare(`SELECT COUNT(*) AS n FROM user_prompts WHERE content_session_id IN (SELECT content_session_id FROM sdk_sessions WHERE project = ?)`).get(OBSERVER_SESSIONS_PROJECT) as { n: number }).n
+      + (db.prepare(`SELECT COUNT(*) AS n FROM observations WHERE memory_session_id IN (SELECT memory_session_id FROM sdk_sessions WHERE project = ? AND memory_session_id IS NOT NULL)`).get(OBSERVER_SESSIONS_PROJECT) as { n: number }).n
+      + (db.prepare(`SELECT COUNT(*) AS n FROM session_summaries WHERE memory_session_id IN (SELECT memory_session_id FROM sdk_sessions WHERE project = ? AND memory_session_id IS NOT NULL)`).get(OBSERVER_SESSIONS_PROJECT) as { n: number }).n;
+    counts.stuckPendingMessages = (db.prepare(
+      `SELECT COUNT(*) AS n FROM pending_messages
+         WHERE status IN ('failed', 'processing')
+           AND session_db_id IN (
+             SELECT session_db_id FROM pending_messages
+              WHERE status IN ('failed', 'processing')
+              GROUP BY session_db_id
+              HAVING COUNT(*) >= ?
+           )`
+    ).get(STUCK_PENDING_THRESHOLD) as { n: number }).n;
+  } finally {
+    db.close();
+  }
+  logger.info('SYSTEM', 'v12.4.3 cleanup --dry-run scan complete', {
+    observerSessions: counts.observerSessions,
+    observerCascadeRows: counts.observerCascadeRows,
+    stuckPendingMessages: counts.stuckPendingMessages,
+  });
+  return counts;
+}
+
 function executeCleanup(dbPath: string, effectiveDataDir: string, markerPath: string): void {
  const dbSize = statSync(dbPath).size;
  const required = Math.ceil(dbSize * 1.2) + 100 * 1024 * 1024;
@@ -541,191 +541,6 @@ export async function cleanupOrphanedProcesses(): Promise<void> {
  logger.info('SYSTEM', 'Orphaned processes cleaned up', { count: pidsToKill.length });
 }

-// Patterns that should be killed immediately at startup (no age gate)
-// These are child processes that should not outlive their parent worker
-const AGGRESSIVE_CLEANUP_PATTERNS = ['worker-service.cjs', 'chroma-mcp'];
-
-// Patterns that keep the age-gated threshold (may be legitimately running)
-const AGE_GATED_CLEANUP_PATTERNS = ['mcp-server.cjs'];
-
-/**
- * Enumerate processes for aggressive startup cleanup. Aggressive patterns are
- * killed immediately; age-gated patterns only if older than ORPHAN_MAX_AGE_MINUTES.
- */
-async function enumerateAggressiveCleanupProcesses(
-  isWindows: boolean,
-  currentPid: number,
-  protectedPids: Set<number>,
-  allPatterns: string[]
-): Promise<number[]> {
-  const pidsToKill: number[] = [];
-
-  if (isWindows) {
-    // Use WQL -Filter for server-side filtering (no $_ pipeline syntax).
-    // Avoids Git Bash $_ interpretation (#1062) and PowerShell syntax errors (#1024).
-    const wqlPatternConditions = allPatterns
-      .map(p => `CommandLine LIKE '%${p}%'`)
-      .join(' OR ');
-
-    const cmd = `powershell -NoProfile -NonInteractive -Command "Get-CimInstance Win32_Process -Filter '(${wqlPatternConditions}) AND ProcessId != ${currentPid}' | Select-Object ProcessId, CommandLine, CreationDate | ConvertTo-Json"`;
-    const { stdout } = await execAsync(cmd, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, windowsHide: true });
-
-    if (!stdout.trim() || stdout.trim() === 'null') {
-      logger.debug('SYSTEM', 'No orphaned claude-mem processes found (Windows)');
-      return [];
-    }
-
-    const processes = JSON.parse(stdout);
-    const processList = Array.isArray(processes) ? processes : [processes];
-    const now = Date.now();
-
-    for (const proc of processList) {
-      const pid = proc.ProcessId;
-      if (!Number.isInteger(pid) || pid <= 0 || protectedPids.has(pid)) continue;
-
-      const commandLine = proc.CommandLine || '';
-      const isAggressive = AGGRESSIVE_CLEANUP_PATTERNS.some(p => commandLine.includes(p));
-
-      if (isAggressive) {
-        // Kill immediately — no age check
-        pidsToKill.push(pid);
-        logger.debug('SYSTEM', 'Found orphaned process (aggressive)', { pid, commandLine: commandLine.substring(0, 80) });
-      } else {
-        // Age-gated: only kill if older than threshold
-        const creationMatch = proc.CreationDate?.match(/\/Date\((\d+)\)\//);
-        if (creationMatch) {
-          const creationTime = parseInt(creationMatch[1], 10);
-          const ageMinutes = (now - creationTime) / (1000 * 60);
-          if (ageMinutes >= ORPHAN_MAX_AGE_MINUTES) {
-            pidsToKill.push(pid);
-            logger.debug('SYSTEM', 'Found orphaned process (age-gated)', { pid, ageMinutes: Math.round(ageMinutes) });
-          }
-        }
-      }
-    }
-  } else {
-    // Unix: Use ps with elapsed time
-    const patternRegex = allPatterns.join('|');
-    const { stdout } = await execAsync(
-      `ps -eo pid,etime,command | grep -E "${patternRegex}" | grep -v grep || true`
-    );
-
-    if (!stdout.trim()) {
-      logger.debug('SYSTEM', 'No orphaned claude-mem processes found (Unix)');
-      return [];
-    }
-
-    const lines = stdout.trim().split('\n');
-    for (const line of lines) {
-      const match = line.trim().match(/^(\d+)\s+(\S+)\s+(.*)$/);
-      if (!match) continue;
-
-      const pid = parseInt(match[1], 10);
-      const etime = match[2];
-      const command = match[3];
-
-      if (!Number.isInteger(pid) || pid <= 0 || protectedPids.has(pid)) continue;
-
-      const isAggressive = AGGRESSIVE_CLEANUP_PATTERNS.some(p => command.includes(p));
-
-      if (isAggressive) {
-        // Kill immediately — no age check
-        pidsToKill.push(pid);
-        logger.debug('SYSTEM', 'Found orphaned process (aggressive)', { pid, command: command.substring(0, 80) });
-      } else {
-        // Age-gated: only kill if older than threshold
-        const ageMinutes = parseElapsedTime(etime);
-        if (ageMinutes >= ORPHAN_MAX_AGE_MINUTES) {
-          pidsToKill.push(pid);
-          logger.debug('SYSTEM', 'Found orphaned process (age-gated)', { pid, ageMinutes, command: command.substring(0, 80) });
-        }
-      }
-    }
-  }
-
-  return pidsToKill;
-}
-
-/**
- * Aggressive startup cleanup for orphaned claude-mem processes.
- *
- * Unlike cleanupOrphanedProcesses() which age-gates everything at 30 minutes,
- * this function kills worker-service.cjs and chroma-mcp processes immediately
- * (they should not outlive their parent worker). Only mcp-server.cjs keeps
- * the age threshold since it may be legitimately running.
- *
- * Called once at daemon startup.
- */
-export async function aggressiveStartupCleanup(): Promise<void> {
-  const isWindows = process.platform === 'win32';
-  const currentPid = process.pid;
-  const allPatterns = [...AGGRESSIVE_CLEANUP_PATTERNS, ...AGE_GATED_CLEANUP_PATTERNS];
-
-  // Protect parent process (the hook that spawned us) from being killed.
-  // Without this, a new daemon kills its own parent hook process (#1426).
-  //
-  // Note: readPidFile() is not used here because start() writes the new PID
-  // before initializeBackground() calls this function, so readPidFile() would
-  // just return process.pid (already protected). If a pre-existing worker needs
-  // protection, ensureWorkerStarted() handles that by returning early when a
-  // healthy worker is detected — we never reach this code in that case.
-  const protectedPids = new Set<number>([currentPid]);
-  if (process.ppid && process.ppid > 0) {
-    protectedPids.add(process.ppid);
-  }
-
-  let pidsToKill: number[];
-  try {
-    pidsToKill = await enumerateAggressiveCleanupProcesses(isWindows, currentPid, protectedPids, allPatterns);
-  } catch (error: unknown) {
-    if (error instanceof Error) {
-      logger.error('SYSTEM', 'Failed to enumerate orphaned processes during aggressive cleanup', {}, error);
-    } else {
-      logger.error('SYSTEM', 'Failed to enumerate orphaned processes during aggressive cleanup', {}, new Error(String(error)));
-    }
-    return;
-  }
-
-  if (pidsToKill.length === 0) {
-    return;
-  }
-
-  logger.info('SYSTEM', 'Aggressive startup cleanup: killing orphaned processes', {
-    platform: isWindows ? 'Windows' : 'Unix',
-    count: pidsToKill.length,
-    pids: pidsToKill
-  });
-
-  if (isWindows) {
-    for (const pid of pidsToKill) {
-      if (!Number.isInteger(pid) || pid <= 0) continue;
-      try {
-        execSync(`taskkill /PID ${pid} /T /F`, { timeout: HOOK_TIMEOUTS.POWERSHELL_COMMAND, stdio: 'ignore', windowsHide: true });
-      } catch (error: unknown) {
-        if (error instanceof Error) {
-          logger.debug('SYSTEM', 'Failed to kill process, may have already exited', { pid }, error);
-        } else {
-          logger.debug('SYSTEM', 'Failed to kill process, may have already exited', { pid }, new Error(String(error)));
-        }
-      }
-    }
-  } else {
-    for (const pid of pidsToKill) {
-      try {
-        process.kill(pid, 'SIGKILL');
-      } catch (error: unknown) {
-        if (error instanceof Error) {
-          logger.debug('SYSTEM', 'Process already exited', { pid }, error);
-        } else {
-          logger.debug('SYSTEM', 'Process already exited', { pid }, new Error(String(error)));
-        }
-      }
-    }
-  }
-
-  logger.info('SYSTEM', 'Aggressive startup cleanup complete', { count: pidsToKill.length });
-}
-
 const CHROMA_MIGRATION_MARKER_FILENAME = '.chroma-cleaned-v10.3';

 /**
@@ -929,14 +744,20 @@ function executeCwdRemap(dbPath: string, effectiveDataDir: string, markerPath: s
 }

 /**
- * Spawn a detached daemon process
- * Returns the child PID or undefined if spawn failed
+ * Spawn a detached daemon process.
 *
- * On Windows, uses PowerShell Start-Process with -WindowStyle Hidden to spawn
- * a truly independent process without console popups. Unlike WMIC, PowerShell
- * inherits environment variables from the parent process.
+ * Uses Node's child_process.spawn with the arg-array form on every platform.
+ * The arg-array form bypasses the shell entirely on Windows, so no quoting
+ * heuristics or PowerShell wrappers are needed (handles paths with spaces
+ * like `C:\Users\Alex Newman\...` natively).
 *
- * On Unix, uses standard detached spawn.
+ * On Unix, prefer setsid to detach from the controlling terminal so SIGHUP
+ * can't reach the daemon even if the in-process handler fails. The
+ * `detached: true` option already creates a new process group on POSIX;
+ * setsid is the belt-and-suspenders extra.
+ *
+ * Bun.spawn is intentionally NOT used here: it does not support detached
+ * spawning (see comment in process-registry.ts:633-639).
 *
 * PID file is written by the worker itself after listen() succeeds,
 * not by the spawner (race-free, works on all platforms).
@@ -946,7 +767,6 @@ export function spawnDaemon(
  port: number,
  extraEnv: Record<string, string> = {}
 ): number | undefined {
-  const isWindows = process.platform === 'win32';
  getSupervisor().assertCanSpawn('worker daemon');

  const env = sanitizeEnv({
@@ -957,9 +777,7 @@ export function spawnDaemon(

  // worker-service.cjs imports `bun:sqlite`, so the spawned runtime MUST be
  // Bun on every platform — never the current process.execPath, which may be
-  // Node when the caller is the MCP server. Resolve once before the OS branch
-  // split so we don't pay for a duplicate PATH lookup if Bun isn't found at a
-  // well-known path. See resolveWorkerRuntimePath() for the candidate list.
+  // Node when the caller is the MCP server.
  const runtimePath = resolveWorkerRuntimePath();
  if (!runtimePath) {
    logger.error(
@@ -969,65 +787,20 @@ export function spawnDaemon(
    return undefined;
  }

-  if (isWindows) {
-    // Use PowerShell Start-Process to spawn a hidden, independent process
-    // Unlike WMIC, PowerShell inherits environment variables from parent
-    // -WindowStyle Hidden prevents console popup
-
-    // Use -EncodedCommand to avoid all shell quoting issues with spaces in paths
-    const psScript = `Start-Process -FilePath '${runtimePath.replace(/'/g, "''")}' -ArgumentList @('${scriptPath.replace(/'/g, "''")}','--daemon') -WindowStyle Hidden`;
-    const encodedCommand = Buffer.from(psScript, 'utf16le').toString('base64');
-
-    try {
-      execSync(`powershell -NoProfile -EncodedCommand ${encodedCommand}`, {
-        stdio: 'ignore',
-        windowsHide: true,
-        env
-      });
-      // Windows success sentinel: PowerShell `Start-Process` does not return
-      // the spawned PID, and we don't want to pay for an extra `Get-Process`
-      // round-trip just to discover it. Return 0 (a conventionally invalid
-      // Unix PID) so callers can distinguish "spawn dispatched" from "spawn
-      // failed". Callers MUST use `pid === undefined` to detect failure —
-      // never falsy checks like `if (!pid)`, which would silently treat
-      // success as failure here.
-      return 0;
-    } catch (error: unknown) {
-      // APPROVED OVERRIDE: Windows daemon spawn is best-effort; log and let callers fall back to health checks/retry flow.
-      if (error instanceof Error) {
-        logger.error('SYSTEM', 'Failed to spawn worker daemon on Windows', { runtimePath }, error);
-      } else {
-        logger.error('SYSTEM', 'Failed to spawn worker daemon on Windows', { runtimePath }, new Error(String(error)));
-      }
-      return undefined;
-    }
-  }
-
-  // Unix: Use setsid to create a new session, fully detaching from the
-  // controlling terminal. This prevents SIGHUP from reaching the daemon
-  // even if the in-process SIGHUP handler somehow fails (belt-and-suspenders).
-  // Fall back to standard detached spawn if setsid is not available.
-  // `runtimePath` was resolved at the top of this function (see comment there).
+  // On Unix, prefer setsid to fully detach from the controlling terminal.
+  // On Windows or systems without setsid, spawn the runtime directly.
  const setsidPath = '/usr/bin/setsid';
-  if (existsSync(setsidPath)) {
-    const child = spawn(setsidPath, [runtimePath, scriptPath, '--daemon'], {
-      detached: true,
-      stdio: 'ignore',
-      env
-    });
+  const useSetsid = process.platform !== 'win32' && existsSync(setsidPath);

-    if (child.pid === undefined) {
-      return undefined;
-    }
+  const execPath = useSetsid ? setsidPath : runtimePath;
+  const args = useSetsid
+    ? [runtimePath, scriptPath, '--daemon']
+    : [scriptPath, '--daemon'];

-    child.unref();
-    return child.pid;
-  }
-
-  // Fallback: standard detached spawn (macOS, systems without setsid)
-  const child = spawn(runtimePath, [scriptPath, '--daemon'], {
+  const child = spawn(execPath, args, {
    detached: true,
    stdio: 'ignore',
+    windowsHide: true,
    env
  });

@@ -1036,7 +809,6 @@ export function spawnDaemon(
  }

  child.unref();
-
  return child.pid;
 }

@@ -0,0 +1,58 @@
+/**
+ * Shared worker-shutdown helper used by both `install` (to clear out a
+ * running worker before overwriting plugin files) and `uninstall` (to
+ * release file locks before deletion).
+ *
+ * Posts to `/api/admin/shutdown`, then polls `/api/health` until the
+ * connection is refused (= worker is gone) or the timeout elapses.
+ *
+ * Best-effort: if the worker is not running, the POST throws and we
+ * return immediately. Callers should never depend on this throwing.
+ */
+
+export interface ShutdownResult {
+  /** True if we actively shut down a worker; false if none was running. */
+  workerWasRunning: boolean;
+  /** True if we observed the worker stop responding before the timeout. */
+  confirmedStopped: boolean;
+}
+
+export async function shutdownWorkerAndWait(
+  port: number | string,
+  timeoutMs: number = 10000,
+): Promise<ShutdownResult> {
+  const baseUrl = `http://127.0.0.1:${port}`;
+  let workerWasRunning = false;
+
+  try {
+    await fetch(`${baseUrl}/api/admin/shutdown`, {
+      method: 'POST',
+      signal: AbortSignal.timeout(5000),
+    });
+    workerWasRunning = true;
+  } catch {
+    // Worker not running (connection refused) or shutdown POST timed out.
+    // Either way, nothing more to do.
+    return { workerWasRunning: false, confirmedStopped: true };
+  }
+
+  const pollIntervalMs = 500;
+  const maxAttempts = Math.ceil(timeoutMs / pollIntervalMs);
+  for (let attempt = 0; attempt < maxAttempts; attempt++) {
+    await new Promise((resolve) => setTimeout(resolve, pollIntervalMs));
+    try {
+      await fetch(`${baseUrl}/api/health`, {
+        signal: AbortSignal.timeout(1000),
+      });
+      // Health endpoint still responding — worker is still alive, keep waiting.
+    } catch (err) {
+      // AbortError = health endpoint timed out (worker still accepting
+      // connections but slow). Keep polling. Any other error
+      // (ECONNREFUSED, ECONNRESET) means the worker is gone.
+      if (err instanceof Error && err.name === 'AbortError') continue;
+      return { workerWasRunning, confirmedStopped: true };
+    }
+  }
+
+  return { workerWasRunning, confirmedStopped: false };
+}
@@ -26,6 +26,7 @@ import {
  unlinkSync,
 } from 'fs';
 import { logger } from '../../utils/logger.js';
+import { SettingsDefaultsManager } from '../../shared/SettingsDefaultsManager.js';

 // ============================================================================
 // Path Resolution
@@ -168,7 +169,7 @@ function writeOpenClawConfig(config: Record<string, any>): void {
 * and the memory slot.
 */
 function registerPluginInOpenClawConfig(
-  workerPort: number = 37777,
+  workerPort: number,
  project: string = 'openclaw',
  syncMemoryFile: boolean = true,
 ): void {
@@ -305,7 +306,11 @@ function copyPluginFilesAndRegister(
    'utf-8',
  );

-  registerPluginInOpenClawConfig();
+  // Resolve port via SettingsDefaultsManager so CLAUDE_MEM_WORKER_PORT env
+  // takes priority and the per-UID default (37700 + uid % 100) is used
+  // otherwise. Required for multi-account isolation (#2101).
+  const workerPort = SettingsDefaultsManager.getInt('CLAUDE_MEM_WORKER_PORT');
+  registerPluginInOpenClawConfig(workerPort);
  console.log(`  Registered in openclaw.json`);

  logger.info('OPENCLAW', 'Plugin installed', { destination: extensionDirectory });
@@ -75,6 +75,7 @@ export class SessionStore {
    this.addObservationSubagentColumns();
    this.addPendingMessagesToolUseIdAndWorkerPidColumns();
    this.addObservationsUniqueContentHashIndex();
+    this.addObservationsMetadataColumn();
  }

  /**
@@ -715,6 +716,14 @@ export class SessionStore {
    // Clean up leftover temp table from a previously-crashed run
    this.db.run('DROP TABLE IF EXISTS observations_new');

+    // If the live observations table already has metadata (added in v30 or
+    // by an older bundled artifact that ran v30 before v21 was recorded),
+    // preserve it so this rebuild doesn't silently drop the column's data.
+    const observationsCols = this.db.query('PRAGMA table_info(observations)').all() as TableColumnInfo[];
+    const observationsHasMetadata = observationsCols.some(c => c.name === 'metadata');
+    const metadataColumnSQL = observationsHasMetadata ? ',\n        metadata TEXT' : '';
+    const metadataSelectSQL = observationsHasMetadata ? ', metadata' : '';
+
    const observationsNewSQL = `
      CREATE TABLE observations_new (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
@@ -732,7 +741,7 @@ export class SessionStore {
        prompt_number INTEGER,
        discovery_tokens INTEGER DEFAULT 0,
        created_at TEXT NOT NULL,
-        created_at_epoch INTEGER NOT NULL,
+        created_at_epoch INTEGER NOT NULL${metadataColumnSQL},
        FOREIGN KEY(memory_session_id) REFERENCES sdk_sessions(memory_session_id) ON DELETE CASCADE ON UPDATE CASCADE
      )
    `;
@@ -740,7 +749,7 @@ export class SessionStore {
      INSERT INTO observations_new
      SELECT id, memory_session_id, project, text, type, title, subtitle, facts,
             narrative, concepts, files_read, files_modified, prompt_number,
-             discovery_tokens, created_at, created_at_epoch
+             discovery_tokens, created_at, created_at_epoch${metadataSelectSQL}
      FROM observations
    `;
    const observationsIndexesSQL = `
@@ -1156,6 +1165,29 @@ export class SessionStore {
    }
  }

+  /**
+   * Add metadata TEXT column to observations (migration 30).
+   *
+   * Mirrors MigrationRunner.addObservationsMetadataColumn so bundled artifacts
+   * that embed SessionStore (e.g. worker-service.cjs, context-generator.cjs)
+   * stay schema-consistent. Without this, INSERT … (..., metadata, ...) raises
+   * "table observations has no column named metadata" and POST /api/memory/save
+   * starts failing on every call once it begins persisting metadata (#2116).
+   *
+   * Idempotent via PRAGMA table_info guard.
+   */
+  private addObservationsMetadataColumn(): void {
+    const cols = this.db.query('PRAGMA table_info(observations)').all() as TableColumnInfo[];
+    const hasColumn = cols.some(c => c.name === 'metadata');
+
+    if (!hasColumn) {
+      this.db.run('ALTER TABLE observations ADD COLUMN metadata TEXT');
+      logger.debug('DB', 'Added metadata column to observations table (#2116)');
+    }
+
+    this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(30, new Date().toISOString());
+  }
+
  /**
   * Update the memory session ID for a session
   * Called by SDKAgent when it captures the session ID from the first SDK message
@@ -2009,6 +2041,9 @@ export class SessionStore {
      files_modified: string[];
      agent_type?: string | null;
      agent_id?: string | null;
+      // Caller-supplied JSON metadata, stored verbatim in the metadata column (#2116).
+      // Pre-stringified by the caller so we don't double-encode an already-JSON value.
+      metadata?: string | null;
    },
    promptNumber?: number,
    discoveryTokens: number = 0,
@@ -2027,8 +2062,8 @@ export class SessionStore {
      INSERT INTO observations
      (memory_session_id, project, type, title, subtitle, facts, narrative, concepts,
       files_read, files_modified, prompt_number, discovery_tokens, agent_type, agent_id, content_hash, created_at, created_at_epoch,
-       generated_by_model)
-      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+       generated_by_model, metadata)
+      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
      ON CONFLICT(memory_session_id, content_hash) DO NOTHING
      RETURNING id, created_at_epoch
    `);
@@ -2051,7 +2086,8 @@ export class SessionStore {
      contentHash,
      timestampIso,
      timestampEpoch,
-      generatedByModel || null
+      generatedByModel || null,
+      observation.metadata ?? null
    ) as { id: number; created_at_epoch: number } | null;

    if (inserted) {
@@ -40,6 +40,7 @@ export class MigrationRunner {
    this.addObservationSubagentColumns();
    this.rebuildPendingMessagesForSelfHealingClaim();
    this.addObservationsUniqueContentHashIndex();
+    this.addObservationsMetadataColumn();
  }

  /**
@@ -1204,4 +1205,27 @@ export class MigrationRunner {
      throw new Error(`Migration 29 failed: ${String(error)}`);
    }
  }
+
+  /**
+   * Add metadata TEXT column to observations (migration 30).
+   *
+   * Backward-compatible: nullable, no default. Holds JSON-encoded arbitrary
+   * metadata supplied by callers of POST /api/memory/save (#2116). Without
+   * this column, the route's Zod `.passthrough()` accepted unknown fields
+   * but the INSERT silently dropped them — a quiet contract violation.
+   *
+   * Idempotent via PRAGMA table_info guard so cross-machine DB sync that
+   * leaves schema_versions ahead of actual schema still self-heals.
+   */
+  private addObservationsMetadataColumn(): void {
+    const cols = this.db.query('PRAGMA table_info(observations)').all() as TableColumnInfo[];
+    const hasColumn = cols.some(c => c.name === 'metadata');
+
+    if (!hasColumn) {
+      this.db.run('ALTER TABLE observations ADD COLUMN metadata TEXT');
+      logger.debug('DB', 'Added metadata column to observations table (#2116)');
+    }
+
+    this.db.prepare('INSERT OR IGNORE INTO schema_versions (version, applied_at) VALUES (?, ?)').run(30, new Date().toISOString());
+  }
 }
@@ -74,6 +74,7 @@ CREATE TABLE IF NOT EXISTS observations (
  agent_id             TEXT,
  merged_into_project  TEXT,
  generated_by_model   TEXT,
+  metadata             TEXT,
  created_at           TEXT    NOT NULL,
  created_at_epoch     INTEGER NOT NULL,
  FOREIGN KEY(memory_session_id) REFERENCES sdk_sessions(memory_session_id)
@@ -31,6 +31,24 @@ const RECONNECT_BACKOFF_MS = 10_000; // Don't retry connections faster than this
 const DEFAULT_CHROMA_DATA_DIR = path.join(os.homedir(), '.claude-mem', 'chroma');
 const CHROMA_SUPERVISOR_ID = 'chroma-mcp';

+/**
+ * Pinned chroma-mcp version for deterministic installs.
+ *
+ * Why pin: `uvx chroma-mcp` (unpinned) resolves whatever version PyPI happens
+ * to serve at install time. That has bitten us multiple ways:
+ *   - #2046: transient missing httpcore/httpx after dependency resolver shifts
+ *   - #2085: surprise breaking changes between point releases
+ *   - #2102: subprocess spawn storms triggered by version drift in chromadb deps
+ *
+ * Pinning to a specific known-good version makes installs reproducible across
+ * machines and across time. Bump deliberately, not accidentally.
+ *
+ * Verified 2026-04-25 with `uvx --python 3.13 chroma-mcp==0.2.6 --help` in a
+ * clean uv cache: starts cleanly, no httpcore/httpx ImportError, no `--with`
+ * flags required. If that changes on a future bump, re-add the flags here.
+ */
+const CHROMA_MCP_PINNED_VERSION = '0.2.6';
+
 export class ChromaMcpManager {
  private static instance: ChromaMcpManager | null = null;
  private client: Client | null = null;
@@ -212,7 +230,7 @@ export class ChromaMcpManager {

      const args = [
        '--python', pythonVersion,
-        'chroma-mcp',
+        `chroma-mcp==${CHROMA_MCP_PINNED_VERSION}`,
        '--client-type', 'http',
        '--host', chromaHost,
        '--port', chromaPort
@@ -238,7 +256,7 @@ export class ChromaMcpManager {
    // Local mode: persistent client with data directory
    return [
      '--python', pythonVersion,
-      'chroma-mcp',
+      `chroma-mcp==${CHROMA_MCP_PINNED_VERSION}`,
      '--client-type', 'persistent',
      '--data-dir', DEFAULT_CHROMA_DATA_DIR.replace(/\\/g, '/')
    ];
@@ -44,7 +44,6 @@ import {
  readPidFile,
  removePidFile,
  getPlatformTimeout,
-  aggressiveStartupCleanup,
  runOneTimeChromaMigration,
  runOneTimeCwdRemap,
  cleanStalePidFile,
@@ -386,7 +385,6 @@ export class WorkerService implements WorkerRef {
  private async initializeBackground(): Promise<void> {
    try {
      logger.info('WORKER', 'Background initialization starting...');
-      await aggressiveStartupCleanup();

      // Load mode configuration
      const { ModeManager } = await import('./domain/ModeManager.js');
@@ -1154,34 +1152,21 @@ async function main() {
    case 'restart': {
      logger.info('SYSTEM', 'Restarting worker');
      await httpShutdown(port);
-      const restartFreed = await waitForPortFree(port, getPlatformTimeout(15000));
+      const restartFreed = await waitForPortFree(port, 5000);
      if (!restartFreed) {
-        logger.error('SYSTEM', 'Port did not free up after shutdown, aborting restart', { port });
-        process.exit(0);
+        // Don't loop, don't force-kill, don't steal the port. The PID file
+        // owns the lock; if the previous worker won't release the port the
+        // user resolves it manually.
+        console.error('Port still bound after shutdown. Resolve manually.');
+        process.exit(1);
      }
      removePidFile();
-
-      const pid = spawnDaemon(__filename, port);
-      if (pid === undefined) {
-        logger.error('SYSTEM', 'Failed to spawn worker daemon during restart');
-        // Exit gracefully: Windows Terminal won't keep tab open on exit 0
-        // The wrapper/plugin will handle restart logic if needed
-        process.exit(0);
+      const restartPid = spawnDaemon(__filename, port);
+      if (restartPid === undefined) {
+        console.error('Failed to spawn worker daemon during restart.');
+        process.exit(1);
      }
-
-      // PID file is written by the worker itself after listen() succeeds
-      // This is race-free and works correctly on Windows where cmd.exe PID is useless
-
-      const healthy = await waitForHealth(port, getPlatformTimeout(HOOK_TIMEOUTS.POST_SPAWN_WAIT));
-      if (!healthy) {
-        removePidFile();
-        logger.error('SYSTEM', 'Worker failed to restart');
-        // Exit gracefully: Windows Terminal won't keep tab open on exit 0
-        // The wrapper/plugin will handle restart logic if needed
-        process.exit(0);
-      }
-
-      logger.info('SYSTEM', 'Worker restarted successfully');
+      logger.info('SYSTEM', 'Worker restart spawned', { pid: restartPid });
      process.exit(0);
      break;
    }
@@ -1298,6 +1283,26 @@ async function main() {
      process.exit(0);
    }

+    case 'cleanup': {
+      // CLI surface for the v12.4.3 pollution cleanup. Shares its scan logic
+      // with the auto-run-on-startup path so --dry-run reports counts that
+      // exactly match what the next startup would delete. (#2126 item 5)
+      const dryRun = process.argv.includes('--dry-run');
+      const counts = runOneTimeV12_4_3Cleanup(undefined, { dryRun });
+      const tag = dryRun ? '(dry-run, no changes made)' : '(applied)';
+      console.log(`\nv12.4.3 cleanup ${tag}`);
+      if (counts) {
+        console.log(`  Observer sessions:        ${counts.observerSessions}`);
+        console.log(`  Observer cascade rows:    ${counts.observerCascadeRows}`);
+        console.log(`  Stuck pending_messages:   ${counts.stuckPendingMessages}`);
+      } else if (dryRun) {
+        console.log('  Scan failed — see worker log for details.');
+      } else {
+        console.log('  Already applied (marker present) or skipped.');
+      }
+      process.exit(0);
+    }
+
    case '--daemon':
    default: {
      // GUARD 1: Refuse to start if another worker is already alive.
@@ -25,10 +25,8 @@ import { ModeManager } from '../domain/ModeManager.js';
 import type { ModeConfig } from '../domain/types.js';
 import {
  processAgentResponse,
-  shouldFallbackToClaude,
  isAbortError,
-  type WorkerRef,
-  type FallbackAgent
+  type WorkerRef
 } from './agents/index.js';

 // Gemini API endpoint — use v1 (stable), not v1beta.
@@ -116,21 +114,12 @@ interface GeminiContent {
 export class GeminiAgent {
  private dbManager: DatabaseManager;
  private sessionManager: SessionManager;
-  private fallbackAgent: FallbackAgent | null = null;

  constructor(dbManager: DatabaseManager, sessionManager: SessionManager) {
    this.dbManager = dbManager;
    this.sessionManager = sessionManager;
  }

-  /**
-   * Set the fallback agent (Claude SDK) for when Gemini API fails
-   * Must be set after construction to avoid circular dependency
-   */
-  setFallbackAgent(agent: FallbackAgent): void {
-    this.fallbackAgent = agent;
-  }
-
  /**
   * Start Gemini agent for a session
   * Uses multi-turn conversation to maintain context across messages
@@ -352,28 +341,19 @@ export class GeminiAgent {
  }

  /**
-   * Handle errors from Gemini API calls with abort detection and Claude fallback.
+   * Handle errors from Gemini API calls with abort detection.
   * Shared by init query and message processing try blocks.
+   *
+   * Note: The previous Claude-SDK fallback path was removed in #2087 — it was
+   * never wired in production (`fallbackAgent` was always null), so 429s
+   * already threw in practice. The throw is now explicit.
   */
-  private handleGeminiError(error: unknown, session: ActiveSession, worker?: WorkerRef): Promise<void> | never {
+  private handleGeminiError(error: unknown, session: ActiveSession, _worker?: WorkerRef): never {
    if (isAbortError(error)) {
      logger.warn('SDK', 'Gemini agent aborted', { sessionId: session.sessionDbId });
      throw error;
    }

-    // Check if we should fall back to Claude
-    if (shouldFallbackToClaude(error) && this.fallbackAgent) {
-      logger.warn('SDK', 'Gemini API failed, falling back to Claude SDK', {
-        sessionDbId: session.sessionDbId,
-        error: error instanceof Error ? error.message : String(error),
-        historyLength: session.conversationHistory.length
-      });
-
-      // Fall back to Claude - it will use the same session with shared conversationHistory
-      // Note: With claim-and-delete queue pattern, messages are already deleted on claim
-      return this.fallbackAgent.startSession(session, worker);
-    }
-
    logger.failure('SDK', 'Gemini agent error', { sessionDbId: session.sessionDbId }, error instanceof Error ? error : new Error(String(error)));
    throw error;
  }
@@ -24,8 +24,6 @@ import { SessionManager } from './SessionManager.js';
 import {
  isAbortError,
  processAgentResponse,
-  shouldFallbackToClaude,
-  type FallbackAgent,
  type WorkerRef
 } from './agents/index.js';

@@ -65,21 +63,12 @@ interface OpenRouterResponse {
 export class OpenRouterAgent {
  private dbManager: DatabaseManager;
  private sessionManager: SessionManager;
-  private fallbackAgent: FallbackAgent | null = null;

  constructor(dbManager: DatabaseManager, sessionManager: SessionManager) {
    this.dbManager = dbManager;
    this.sessionManager = sessionManager;
  }

-  /**
-   * Set the fallback agent (Claude SDK) for when OpenRouter API fails
-   * Must be set after construction to avoid circular dependency
-   */
-  setFallbackAgent(agent: FallbackAgent): void {
-    this.fallbackAgent = agent;
-  }
-
  /**
   * Start OpenRouter agent for a session
   * Uses multi-turn conversation to maintain context across messages
@@ -327,27 +316,18 @@ export class OpenRouterAgent {
  }

  /**
-   * Handle errors from session processing: abort re-throw, fallback to Claude, or log and re-throw.
+   * Handle errors from session processing: abort re-throw or log and re-throw.
+   *
+   * Note: The previous Claude-SDK fallback path was removed in #2087 — it was
+   * never wired in production (`fallbackAgent` was always null), so 429s
+   * already threw in practice. The throw is now explicit.
   */
-  private async handleSessionError(error: unknown, session: ActiveSession, worker?: WorkerRef): Promise<never | void> {
+  private async handleSessionError(error: unknown, session: ActiveSession, _worker?: WorkerRef): Promise<never> {
    if (isAbortError(error)) {
      logger.warn('SDK', 'OpenRouter agent aborted', { sessionId: session.sessionDbId });
      throw error;
    }

-    if (shouldFallbackToClaude(error) && this.fallbackAgent) {
-      logger.warn('SDK', 'OpenRouter API failed, falling back to Claude SDK', {
-        sessionDbId: session.sessionDbId,
-        error: error instanceof Error ? error.message : String(error),
-        historyLength: session.conversationHistory.length
-      });
-
-      // Fall back to Claude - it will use the same session with shared conversationHistory
-      // Note: With claim-and-delete queue pattern, messages are already deleted on claim
-      await this.fallbackAgent.startSession(session, worker);
-      return;
-    }
-
    logger.failure('SDK', 'OpenRouter agent error', { sessionDbId: session.sessionDbId }, error instanceof Error ? error : new Error(String(error)));
    throw error;
  }
@@ -175,7 +175,8 @@ export class PaginationHelper {
      params.push(project, project);
    } else {
      // Hide internal observer-session rows from the unfiltered UI list.
-      conditions.push("ss.project != 'observer-sessions'");
+      conditions.push('ss.project != ?');
+      params.push(OBSERVER_SESSIONS_PROJECT);
    }

    if (platformSource) {
@@ -229,7 +230,8 @@ export class PaginationHelper {
      params.push(project);
    } else {
      // Hide internal observer-session rows from the unfiltered UI list.
-      conditions.push("s.project != 'observer-sessions'");
+      conditions.push('s.project != ?');
+      params.push(OBSERVER_SESSIONS_PROJECT);
    }

    if (platformSource) {
@@ -13,6 +13,7 @@

 import type { WorkerRef, ObservationSSEPayload, SummarySSEPayload } from './types.js';
 import { logger } from '../../../utils/logger.js';
+import { shouldEmitProjectRow } from '../../../shared/should-track-project.js';

 /**
 * Broadcast a new observation to SSE clients
@@ -28,6 +29,18 @@ export function broadcastObservation(
    return;
  }

+  // Parity with PaginationHelper's unfiltered-list SQL filter (#2118):
+  // observer-session rows are internal and must not stream to viewer clients.
+  // Same predicate used by both filters via shouldEmitProjectRow so they
+  // can never drift apart.
+  if (!shouldEmitProjectRow(payload.project)) {
+    logger.debug('WORKER', 'SSE observation broadcast skipped (internal project)', {
+      project: payload.project,
+      id: payload.id,
+    });
+    return;
+  }
+
  worker.sseBroadcaster.broadcast({
    type: 'new_observation',
    observation: payload
@@ -48,6 +61,15 @@ export function broadcastSummary(
    return;
  }

+  // Parity with PaginationHelper's unfiltered-list SQL filter (#2118).
+  if (!shouldEmitProjectRow(payload.project)) {
+    logger.debug('WORKER', 'SSE summary broadcast skipped (internal project)', {
+      project: payload.project,
+      id: payload.id,
+    });
+    return;
+  }
+
  worker.sseBroadcaster.broadcast({
    type: 'new_summary',
    summary: payload
@@ -6,7 +6,7 @@
 *
 * Usage:
 * ```typescript
- * import { processAgentResponse, shouldFallbackToClaude } from './agents/index.js';
+ * import { processAgentResponse, isAbortError } from './agents/index.js';
 * ```
 */

@@ -19,7 +19,6 @@ export type {
  StorageResult,
  ResponseProcessingContext,
  ParsedResponse,
-  FallbackAgent,
  BaseAgentConfig,
 } from './types.js';

@@ -98,17 +98,6 @@ export interface ParsedResponse {
  summary: ParsedSummary | null;
 }

-// ============================================================================
-// Fallback Agent Interface
-// ============================================================================
-
-/**
- * Interface for fallback agent (used by Gemini/OpenRouter to fall back to Claude)
- */
-export interface FallbackAgent {
-  startSession(session: ActiveSession, worker?: WorkerRef): Promise<void>;
-}
-
 // ============================================================================
 // Agent Configuration Types
 // ============================================================================
@@ -13,11 +13,22 @@ import { logger } from '../../../../utils/logger.js';
 import type { DatabaseManager } from '../../DatabaseManager.js';

 // Plan 06 Phase 3 — per-route Zod schema.
+//
+// `metadata` is an arbitrary JSON object the caller can use to attach
+// integration-specific provenance (e.g. obsidian_note, claude_mem_version,
+// custom_key). It is stored verbatim in the observations.metadata column
+// (migration 30) — no schema enforcement on its keys (#2116).
+//
+// `metadata.project`, when present and the top-level `project` is omitted,
+// is honored as the project assignment. This lets integrating plugins file
+// observations under a project other than their own without having to know
+// the top-level field name.
 const saveMemorySchema = z.object({
  text: z.string().trim().min(1),
  title: z.string().optional(),
  project: z.string().optional(),
-}).passthrough();
+  metadata: z.record(z.string(), z.unknown()).optional(),
+}).strict();

 export class MemoryRoutes extends BaseRouteHandler {
  constructor(
@@ -33,11 +44,26 @@ export class MemoryRoutes extends BaseRouteHandler {

  /**
   * POST /api/memory/save - Save a manual memory/observation
-   * Body: { text: string, title?: string, project?: string }
+   * Body: {
+   *   text: string,
+   *   title?: string,
+   *   project?: string,
+   *   metadata?: Record<string, unknown>  // arbitrary JSON, persisted verbatim (#2116)
+   * }
+   *
+   * Project resolution order: top-level `project` → `metadata.project` (string)
+   * → this.defaultProject. Unknown top-level fields are now rejected (400) —
+   * `.strict()` replaced `.passthrough()` so silent drops can't recur.
   */
  private handleSaveMemory = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
-    const { text, title, project } = req.body as z.infer<typeof saveMemorySchema>;
-    const targetProject = project || this.defaultProject;
+    const { text, title, project, metadata } = req.body as z.infer<typeof saveMemorySchema>;
+    const explicitProject = typeof project === 'string' && project.trim()
+      ? project.trim()
+      : undefined;
+    const metadataProject = typeof metadata?.project === 'string' && metadata.project.trim()
+      ? metadata.project.trim()
+      : undefined;
+    const targetProject = explicitProject || metadataProject || this.defaultProject;

    const sessionStore = this.dbManager.getSessionStore();
    const chromaSync = this.dbManager.getChromaSync();
@@ -54,7 +80,10 @@ export class MemoryRoutes extends BaseRouteHandler {
      narrative: text,
      concepts: [] as string[],
      files_read: [] as string[],
-      files_modified: [] as string[]
+      files_modified: [] as string[],
+      // Stringify here so the storage layer doesn't need to know about JSON shape.
+      // Preserved verbatim, including nested objects.
+      metadata: metadata ? JSON.stringify(metadata) : null,
    };

    // 3. Store to SQLite
@@ -449,6 +449,10 @@ export class SearchRoutes extends BaseRouteHandler {
   * GET /api/search/help
   */
  private handleSearchHelp = this.wrapHandler((req: Request, res: Response): void => {
+    // Use the actual host:port the request came in on so example URLs always
+    // round-trip back to this same worker — matters for multi-account / non-
+    // default-port setups (#2101, #2103).
+    const baseUrl = `http://${req.headers.host ?? 'localhost'}`;
    res.json({
      title: 'Claude-Mem Search API',
      description: 'HTTP API for searching persistent memory',
@@ -551,10 +555,10 @@ export class SearchRoutes extends BaseRouteHandler {
        }
      ],
      examples: [
-        'curl "http://localhost:37777/api/search/observations?query=authentication&limit=5"',
-        'curl "http://localhost:37777/api/search/by-type?type=bugfix&limit=10"',
-        'curl "http://localhost:37777/api/context/recent?project=claude-mem&limit=3"',
-        'curl "http://localhost:37777/api/context/timeline?anchor=123&depth_before=5&depth_after=5"'
+        `curl "${baseUrl}/api/search/observations?query=authentication&limit=5"`,
+        `curl "${baseUrl}/api/search/by-type?type=bugfix&limit=10"`,
+        `curl "${baseUrl}/api/context/recent?project=claude-mem&limit=3"`,
+        `curl "${baseUrl}/api/context/timeline?anchor=123&depth_before=5&depth_after=5"`
      ]
    });
  });
@@ -95,7 +95,18 @@ export class SessionRoutes extends BaseRouteHandler {
   * The next generator will use the new provider with shared conversationHistory.
   */
  private static readonly STALE_GENERATOR_THRESHOLD_MS = 30_000; // 30 seconds (#1099)
-  private static readonly MAX_SESSION_WALL_CLOCK_MS = 4 * 60 * 60 * 1000; // 4 hours (#1590)
+
+  // Wall-clock cap on a single in-memory session — exists to prevent runaway
+  // API costs from a session that is somehow stuck in a re-activation loop
+  // (#1590, #2127, #2098). 4h was the original value, picked when bugs in the
+  // re-activation path made cost runaways more plausible; users in practice
+  // have legitimate long-running sessions (24h+ Claude Code days) that this
+  // killed without warning. 24h is the new ceiling — long enough that
+  // a real human workday never hits it, short enough that a runaway loop is
+  // still bounded. We deliberately do NOT expose this as a config knob: a
+  // session approaching this age is almost certainly a bug worth investigating,
+  // not a knob worth tuning.
+  private static readonly MAX_SESSION_WALL_CLOCK_MS = 24 * 60 * 60 * 1000; // 24 hours (#1590, #2127)

  public ensureGeneratorRunning(sessionDbId: number, source: string): void {
    const session = this.sessionManager.getSession(sessionDbId);