# Plan 06 — Worker Env Isolation > **Goal:** Stop host-side environment variables from contaminating the worker's Anthropic SDK subprocess. Two confirmed bugs anchor this plan: `ANTHROPIC_BASE_URL` leaks from the parent shell while `ANTHROPIC_AUTH_TOKEN` is blocked, breaking proxy/gateway auth (#2375); and `CLAUDE_CODE_EFFORT_LEVEL` propagates from host CLI settings into the SDK subprocess where it triggers a permanent HTTP 400 that the retry classifier mistakes for transient (#2357). Adjacent feature #2289 (`$TIER` alias syntax) is in scope where it shares the same env/model-resolution surface. > > **Net effect:** > - The OAuth-skip predicate requires a real credential (`ANTHROPIC_API_KEY` or `ANTHROPIC_AUTH_TOKEN`), not a bare `ANTHROPIC_BASE_URL`. Proxy/gateway users put credentials in `~/.claude-mem/.env`; nothing relies on parent-shell leaks. > - `BLOCKED_ENV_VARS` adds `ANTHROPIC_BASE_URL` and the `CLAUDE_CODE_EFFORT_LEVEL` / `CLAUDE_CODE_ALWAYS_ENABLE_EFFORT` pair (defense in depth alongside the existing `env-sanitizer.ts` `CLAUDE_CODE_*` prefix filter). > - The Claude provider's error classifier explicitly handles HTTP 400 as `unrecoverable`, matching `GeminiProvider`/`OpenRouterProvider`. No more unbounded retry loop on permanent-error responses. > - Every spawn boundary that hands env to a child process applies BOTH `buildIsolatedEnv` and `sanitizeEnv`. A grep-based CI check forbids spawning subprocesses with raw `process.env`. > - `~/.claude-mem/.env` becomes the single source of truth for non-OAuth Anthropic credentials. The loader's whitelist documents this contract. > > **Out of scope:** > - Hook-side env handling (Plan 01 / 02 territory). > - Worker daemon lifecycle, DB bloat, and chroma-mcp leaks (Plan 03). > - Observer/Knowledge SDK tool enforcement (Plan 05). > - Re-auth UX flow (different concern; out of scope for this plan). > - General provider-router refactor — `$TIER` alias is scoped to model resolution only (Phase 4). --- ## Problem Statement (line citations) ### Bug A — `ANTHROPIC_BASE_URL` leaks, OAuth gets skipped, `ANTHROPIC_AUTH_TOKEN` is missing (#2375) `src/shared/EnvManager.ts` lines 14–24 (`BLOCKED_ENV_VARS`): ```ts const BLOCKED_ENV_VARS = [ 'ANTHROPIC_API_KEY', // #733 'ANTHROPIC_AUTH_TOKEN', // added 5edf1557 (2026-05-04) — leak prevention 'CLAUDECODE', 'CLAUDE_CODE_OAUTH_TOKEN', // #2215 ]; ``` `ANTHROPIC_BASE_URL` is **not** in the list, so it survives `buildIsolatedEnv()` (lines 166–205) and reaches `isolatedEnv` from `process.env`. `buildIsolatedEnvWithFreshOAuth()` lines 222–288 then runs the OAuth-skip predicate at lines 237–244: ```ts if ( isolatedEnv.ANTHROPIC_API_KEY || isolatedEnv.ANTHROPIC_BASE_URL || isolatedEnv.ANTHROPIC_AUTH_TOKEN ) { clearStaleMarker(); return isolatedEnv; } ``` The bare `BASE_URL` branch was added in commit `a122d34e` (2026-05-04) under the rationale "tokenless gateways may exist." Combined with the `AUTH_TOKEN` block from `5edf1557` the same day, the subprocess ends up with: - `ANTHROPIC_BASE_URL` ✅ (leaked from parent) - `ANTHROPIC_AUTH_TOKEN` ❌ (blocked, never re-injected because `~/.claude-mem/.env` is empty for first-time proxy users) - `CLAUDE_CODE_OAUTH_TOKEN` ❌ (skip path bypassed the keychain read) Result: `Not logged in · Please run /login` from every SDK subprocess. ### Bug B — `CLAUDE_CODE_EFFORT_LEVEL` triggers permanent 400 + unbounded retry (#2357) The Anthropic SDK subprocess reads `CLAUDE_CODE_EFFORT_LEVEL` from its env and forwards it as the `effort` parameter on Messages API calls. claude-mem's source contains **zero** references to `effort` — the leak path is environmental, not code. Models without effort support (Haiku 4.5, Sonnet 4.5, older) reject with HTTP 400. `src/supervisor/env-sanitizer.ts` lines 1–51 already filters `CLAUDE_CODE_*` via `ENV_PREFIXES` (with explicit allowances in `ENV_PRESERVE`). But: 1. `buildIsolatedEnv` does NOT call `sanitizeEnv` internally; callers are expected to chain them. 2. `BLOCKED_ENV_VARS` is the canonical leak deny-list and does not name `CLAUDE_CODE_EFFORT_LEVEL`. Defense-in-depth is currently single-layer. 3. The retry classifier in `src/services/worker/ClaudeProvider.ts` has no HTTP 400 case; the default branch at line 98 returns `kind: 'transient'`, so a permanent 400 loops forever. `src/services/worker/GeminiProvider.ts` lines 89–94 and `src/services/worker/OpenRouterProvider.ts` lines 82–87 already classify 400 as `unrecoverable`; that pattern is the copy-target for ClaudeProvider. ### Adjacent — `$TIER` alias syntax (#2289) `src/shared/SettingsDefaultsManager.ts` line 116 already implements a *portable* `'haiku'` alias for `CLAUDE_MEM_TIER_SIMPLE_MODEL` (per #1463). What's missing is the user-facing `$TIER` *syntax* in the `CLAUDE_MEM_MODEL` field that resolves to a provider-appropriate model at request time. Same code surface (model resolution in `ClaudeProvider.getModelId` at lines 442–446); minimal extension. --- ## Phase 0 — Documentation Discovery (already completed) Findings below are direct file reads dated 2026-05-08. Each implementation phase cites by line number; do not re-derive. **Confidence: HIGH on file/API inventory.** Local-only files were read end-to-end. ### Allowed APIs / patterns to copy | Item | Location | What to copy | |---|---|---| | `BLOCKED_ENV_VARS` array | `src/shared/EnvManager.ts:14–24` | Add new entries; keep the comment-per-entry convention | | `buildIsolatedEnv` filter pattern | `src/shared/EnvManager.ts:166–205` | Filter on `BLOCKED_ENV_VARS.includes(key)`; defensive `delete isolatedEnv.X` post-filter | | `buildIsolatedEnvWithFreshOAuth` skip-check | `src/shared/EnvManager.ts:237–244` | Restrict predicate to real credentials only | | `loadClaudeMemEnv` whitelist + `ClaudeMemEnv` interface | `src/shared/EnvManager.ts:26–32, 79–100` | Single source of truth for what `~/.claude-mem/.env` accepts | | `ENV_PRESERVE` / `ENV_EXACT_MATCHES` / `ENV_PREFIXES` | `src/supervisor/env-sanitizer.ts:1–51` | Whitelist-based env stripping; do NOT add `CLAUDE_CODE_EFFORT_LEVEL` to `ENV_PRESERVE` | | Provider error classifier (HTTP 400 → unrecoverable) | `src/services/worker/GeminiProvider.ts:89–94`, `src/services/worker/OpenRouterProvider.ts:82–87` | Identical pattern to apply in `ClaudeProvider` | | `ClassifiedProviderError` constructor + `kind: 'unrecoverable' \| 'auth_invalid' \| 'transient' \| 'rate_limit' \| 'quota_exhausted'` | `src/services/worker/retry.ts` | Use existing `kind` enum; do not invent `permanent` | | `isRetryableKind` predicate | `src/services/worker/retry.ts:37–44` | Used by all retry sites; no edit needed once classifier is correct | | Tier model resolution + `'haiku'` alias | `src/services/worker/http/routes/SessionRoutes.ts:503–521`, `src/shared/SettingsDefaultsManager.ts:51–53, 115–117` | Pattern for extending `$TIER` syntax | | Settings flat-key + `loadFromFile` | `src/shared/SettingsDefaultsManager.ts:6–67, 70–131, 137–139, 161–206` | New keys MUST be added to interface AND `DEFAULTS` block | | Plan format (phase numbering, line-cited edits, anti-patterns block) | `plans/01-hook-io-discipline.md`, `plans/05-observer-tool-enforcement.md` | Reuse layout | ### Anti-patterns / methods that DO NOT exist (avoid inventing) - claude-mem source has **zero references** to `effort`, `CLAUDE_CODE_EFFORT_LEVEL`, `CLAUDE_CODE_ALWAYS_ENABLE_EFFORT`, or `reasoning_effort`. Do not "remove the effort parameter we forward" — there is none. The leak is the SDK subprocess reading the env var directly. - `BLOCKED_ENV_VARS` is an `Array` with `.includes` lookup. Do NOT convert to `Set` in the same change — that touches every caller and is an unrelated refactor. - `ClassifiedProviderError.kind` does NOT support the value `'permanent'`. The existing enum is `'transient' | 'rate_limit' | 'unrecoverable' | 'auth_invalid' | 'quota_exhausted'`. Use `unrecoverable` for permanent 400s. - `pending_messages` has **no `retry_count` column** (dropped — see `src/services/sqlite/SessionStore.ts:104`'s `deadColumns` array). Issue #2357's "retry counter climbed past #1874" refers to log-line numbering, not a DB counter. Do not add a counter as part of this plan; that's Plan 03 territory. - `sanitizeEnv` is whitelist-based (preserves a fixed set; strips everything matching `CLAUDE_CODE_*` etc). It is NOT idempotent if you re-add a name to `ENV_PRESERVE`. Do not add `CLAUDE_CODE_EFFORT_LEVEL` to `ENV_PRESERVE` — that's the opposite of what we want. - `buildIsolatedEnv` and `sanitizeEnv` are **independent layers**. Some callers chain (`sanitizeEnv(buildIsolatedEnv(...))`); some only use one. Do not assume chaining is universal — Phase 5 audits every spawn boundary. - The `~/.claude-mem/.env` loader at `src/shared/EnvManager.ts:79–100` uses property-by-property assignment as an implicit whitelist. Do NOT replace with `Object.assign(result, parsed)` — that breaks the whitelist guarantee. ### File inventory used by this plan | File | Lines | Disposition | |---|---|---| | `src/shared/EnvManager.ts` | 319 | Edited heavily (Phase 2, Phase 5) | | `src/supervisor/env-sanitizer.ts` | 51 | Light edit (Phase 3 — comment change only; `CLAUDE_CODE_*` prefix already filters EFFORT_LEVEL) | | `src/services/worker/ClaudeProvider.ts` | 448 | Edited (Phase 3 — error classifier on `query()` rejection path) | | `src/services/worker/retry.ts` | small | Confirm-only (Phase 3 — `isRetryableKind` already correct) | | `src/services/worker/GeminiProvider.ts` | reference only | Read for pattern (Phase 3) | | `src/services/worker/OpenRouterProvider.ts` | reference only | Read for pattern (Phase 3) | | `src/shared/SettingsDefaultsManager.ts` | 209 | Edited (Phase 4 — `$TIER` alias resolution) | | `src/services/worker/http/routes/SessionRoutes.ts` | reference | Read tier-routing pattern (Phase 4) | | `src/services/infrastructure/ProcessManager.ts` | line 415 | Audit (Phase 5) — confirm `sanitizeEnv` chain is sufficient | | `src/services/sync/ChromaMcpManager.ts` | line 585 | Audit (Phase 5) | | `src/supervisor/process-registry.ts` | line 539 | Audit (Phase 5) | | `src/services/worker-service.ts` | line 412 | Audit (Phase 5) | | `src/services/worker/knowledge/KnowledgeAgent.ts` | lines 54, 149 | Confirm-only (Phase 5) | | `tests/env-isolation.test.ts` | NEW | CREATED (Phase 6) | | `scripts/check-spawn-env-discipline.cjs` | NEW | CREATED (Phase 7) | | `CLAUDE.md` | small | Edited (Phase 7 — document `~/.claude-mem/.env` contract) | --- ## Phase 1 — Audit & write the failing tests first **Goal:** Pin down current behavior with red tests so the fix can prove itself green. No production-code changes in this phase. ### 1.1 Tests to add (`tests/env-isolation.test.ts`) Use `bun:test` per `package.json` `"test": "bun test"`. Pattern from `tests/claude-provider-resume.test.ts:1`. 1. **`buildIsolatedEnvWithFreshOAuth strips ANTHROPIC_BASE_URL when no .env credentials are configured`** - Stub `process.env.ANTHROPIC_BASE_URL = 'https://proxy.example'`, no `~/.claude-mem/.env`, no API_KEY/AUTH_TOKEN in env. - Call `buildIsolatedEnvWithFreshOAuth()`. - Assert: result does NOT have `ANTHROPIC_BASE_URL` (post-fix). Currently RED. 2. **`OAuth-skip does not fire on bare ANTHROPIC_BASE_URL`** - Same setup. Spy on `readClaudeOAuthToken`. - Assert: `readClaudeOAuthToken` was called (because BASE_URL alone is not enough to skip). Currently RED — `readClaudeOAuthToken` is NOT called today. 3. **`ANTHROPIC_AUTH_TOKEN from ~/.claude-mem/.env reaches the isolated env`** - Write a temp `.env` with `ANTHROPIC_AUTH_TOKEN=test-token` and `ANTHROPIC_BASE_URL=https://proxy.example`. - Assert: `isolatedEnv.ANTHROPIC_AUTH_TOKEN === 'test-token'` AND `isolatedEnv.ANTHROPIC_BASE_URL === 'https://proxy.example'`. Currently GREEN (already works); test guards against regression. 4. **`CLAUDE_CODE_EFFORT_LEVEL is stripped from the isolated env`** - Stub `process.env.CLAUDE_CODE_EFFORT_LEVEL = 'MAX'`. - Assert: `sanitizeEnv(buildIsolatedEnv())` does NOT contain `CLAUDE_CODE_EFFORT_LEVEL`. Currently GREEN via `env-sanitizer.ENV_PREFIXES`; test guards. 5. **`CLAUDE_CODE_EFFORT_LEVEL is in BLOCKED_ENV_VARS for defense-in-depth`** - Assert: `BLOCKED_ENV_VARS.includes('CLAUDE_CODE_EFFORT_LEVEL')`. Currently RED. 6. **`HTTP 400 from Claude SDK is classified unrecoverable`** - Construct an error matching the SDK's 400 shape (`error.status === 400`, body contains `does not support the effort parameter`). - Assert: `classifyClaudeProviderError(err).kind === 'unrecoverable'`. Currently RED — falls through to `transient`. 7. **`HTTP 400 with effort-parameter body emits a once-only warn log`** - Same setup as 6, plus capture `logger.warn` calls. - Assert: warn fires once with category `SDK` and a hint pointing at #2357 / `~/.claude-mem/.env`. Currently RED. ### 1.2 Verification checklist (Phase 1) - [ ] All 7 tests added; tests 1, 2, 5, 6, 7 are RED; tests 3, 4 are GREEN. - [ ] `bun test tests/env-isolation.test.ts` runs cleanly (RED tests fail with the expected assertion, no other errors). - [ ] No production-code changes in this phase (`git diff src/` empty). ### 1.3 Anti-pattern guards - Do NOT mock `EnvManager.buildIsolatedEnv` — it's the unit under test. - Do NOT use `vi.*` (project uses `bun:test`, not vitest). - Do NOT skip cleanup of temp `.env` files. Use a per-test `beforeEach`/`afterEach` with `mkdtempSync`. --- ## Phase 2 — Fix #2375 (BASE_URL leak + OAuth-skip predicate) **Goal:** Make the OAuth-skip require a real credential, and add `ANTHROPIC_BASE_URL` to the deny-list so it can only be configured via `~/.claude-mem/.env`. ### 2.1 Edit `src/shared/EnvManager.ts:14–24` — extend `BLOCKED_ENV_VARS` **Before:** ```ts const BLOCKED_ENV_VARS = [ 'ANTHROPIC_API_KEY', 'ANTHROPIC_AUTH_TOKEN', 'CLAUDECODE', 'CLAUDE_CODE_OAUTH_TOKEN', ]; ``` **After (add `ANTHROPIC_BASE_URL`):** ```ts const BLOCKED_ENV_VARS = [ 'ANTHROPIC_API_KEY', // #733 'ANTHROPIC_AUTH_TOKEN', // 5edf1557 — leak prevention; re-injected from ~/.claude-mem/.env when configured 'ANTHROPIC_BASE_URL', // #2375 — same leak class as AUTH_TOKEN; re-injected from ~/.claude-mem/.env. Without this entry, a leaked BASE_URL alone triggered the OAuth-skip while no auth credential reached the subprocess. 'CLAUDECODE', 'CLAUDE_CODE_OAUTH_TOKEN', // #2215 ]; ``` ### 2.2 Edit `src/shared/EnvManager.ts:237–244` — restrict OAuth-skip to real credentials **Before:** ```ts if ( isolatedEnv.ANTHROPIC_API_KEY || isolatedEnv.ANTHROPIC_BASE_URL || isolatedEnv.ANTHROPIC_AUTH_TOKEN ) { clearStaleMarker(); return isolatedEnv; } ``` **After:** ```ts // Skip OAuth lookup ONLY when a real credential is configured. A bare // ANTHROPIC_BASE_URL is not a credential — every documented gateway needs // either an AUTH_TOKEN or an API_KEY. This guards #2375 against a class of // leaks where a parent shell exports BASE_URL (e.g. for the Claude Code CLI // itself) while no token is present. if (isolatedEnv.ANTHROPIC_API_KEY || isolatedEnv.ANTHROPIC_AUTH_TOKEN) { clearStaleMarker(); return isolatedEnv; } ``` ### 2.3 Verify the `~/.claude-mem/.env` re-injection at `src/shared/EnvManager.ts:178–195` Currently the loader path covers BASE_URL re-injection from `.env`. Confirm by reading the function. No code change required here, but add a TS comment block above lines 178–195 documenting the new contract: ```ts // Contract (post-#2375): ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, and // ANTHROPIC_API_KEY are *only* populated from ~/.claude-mem/.env. They are // in BLOCKED_ENV_VARS so parent-shell values never leak through. ``` ### 2.4 Verification checklist (Phase 2) - [ ] Tests 1, 2 from Phase 1 now GREEN. - [ ] Existing test suite still passes (`bun test`). - [ ] `grep -n "ANTHROPIC_BASE_URL" src/shared/EnvManager.ts` shows entries at: `BLOCKED_ENV_VARS`, `ClaudeMemEnv` interface, loader, re-injection, OAuth-skip predicate (NOT in skip predicate). - [ ] Smoke: with a `~/.claude-mem/.env` containing `ANTHROPIC_BASE_URL=...` and `ANTHROPIC_AUTH_TOKEN=...`, the worker actually authenticates against the proxy. Test with BigModel or any sandboxed proxy. ### 2.5 Anti-pattern guards - Do NOT add `ANTHROPIC_BASE_URL` to `ENV_PRESERVE` in `env-sanitizer.ts` — `BLOCKED_ENV_VARS` is the right layer; `env-sanitizer` is a downstream filter. - Do NOT keep the BASE_URL branch in the OAuth-skip predicate "for tokenless gateways may exist" — every documented gateway requires a token. The skip path was a misdesign. - Do NOT delete the existing `delete isolatedEnv.CLAUDE_CODE_OAUTH_TOKEN` defensive line at line 229. That guard is intact; it's belt-and-suspenders for #2215 and orthogonal to this plan. --- ## Phase 3 — Fix #2357 (CLAUDE_CODE_EFFORT_LEVEL leak + 400 retry classification) **Goal:** Two-layer defense for the env leak (existing `CLAUDE_CODE_*` prefix filter + new `BLOCKED_ENV_VARS` entries), plus a permanent classification for the resulting HTTP 400 so the retry loop terminates if the leak ever sneaks past either layer. ### 3.1 Edit `src/shared/EnvManager.ts:14–24` — add EFFORT entries to `BLOCKED_ENV_VARS` After the Phase 2 edit, the list is: ```ts const BLOCKED_ENV_VARS = [ 'ANTHROPIC_API_KEY', 'ANTHROPIC_AUTH_TOKEN', 'ANTHROPIC_BASE_URL', 'CLAUDECODE', 'CLAUDE_CODE_OAUTH_TOKEN', // #2357 — host CLI config, not part of the plugin's contract. The // env-sanitizer's CLAUDE_CODE_* prefix filter strips these for spawn paths // that go through it, but BLOCKED_ENV_VARS is the canonical deny-list and // belongs in defense-in-depth. 'CLAUDE_CODE_EFFORT_LEVEL', 'CLAUDE_CODE_ALWAYS_ENABLE_EFFORT', ]; ``` ### 3.2 Edit `src/services/worker/ClaudeProvider.ts` — classify HTTP 400 as unrecoverable Locate the existing error-classification path. The Anthropic SDK raises errors with `error.status` and a body containing the failure description. Pattern from `src/services/worker/GeminiProvider.ts:89–94` (the canonical copy-target): ```ts if (status === 400) { return new ClassifiedProviderError( `Gemini bad request (status 400)`, { kind: 'unrecoverable', cause: input.cause }, ); } ``` Add the equivalent in `ClaudeProvider`'s error classifier (new function or existing — read the file; create if absent, mirroring `GeminiProvider` shape): ```ts function classifyClaudeProviderError(input: { cause: unknown }): ClassifiedProviderError { const err = input.cause; const status = (err as { status?: number })?.status; const bodyText = String((err as { message?: string })?.message ?? ''); // Permanent: SDK rejected the request itself. Most common cause in the wild // is a leaked CLAUDE_CODE_EFFORT_LEVEL the SDK subprocess forwarded as // `effort` against a model that doesn't support it (#2357). The leak is // also blocked at BLOCKED_ENV_VARS + env-sanitizer; this classifier ends // the retry loop if either layer is bypassed. if (status === 400) { if (/effort parameter/i.test(bodyText)) { logger.warn( 'SDK', 'Claude API rejected effort parameter — likely CLAUDE_CODE_EFFORT_LEVEL leaked into SDK env (issue #2357). Configure CLAUDE_MEM_MODEL or set credentials in ~/.claude-mem/.env.', { status, bodyText }, ); } return new ClassifiedProviderError( `Claude bad request (status 400): ${bodyText}`, { kind: 'unrecoverable', cause: input.cause }, ); } // 401 / 403 → auth_invalid (existing pattern from GeminiProvider:96-103) if (status === 401 || status === 403) { return new ClassifiedProviderError( `Claude auth rejected (status ${status})`, { kind: 'auth_invalid', cause: input.cause }, ); } // 429 → rate_limit if (status === 429) { return new ClassifiedProviderError( `Claude rate limited (status 429)`, { kind: 'rate_limit', cause: input.cause }, ); } // Default: transient (preserves the existing fall-through behavior). return new ClassifiedProviderError( `Claude SDK error: ${bodyText}`, { kind: 'transient', cause: input.cause }, ); } ``` Wire this classifier into the existing `try { ... } catch` around `query(...)` in `ClaudeProvider.ts`. **Read the actual catch shape before editing** — the function lives near line 180–195 and the existing `for await` over `queryResult` is where rejections surface. ### 3.3 Confirm `src/supervisor/env-sanitizer.ts` already strips `CLAUDE_CODE_EFFORT_LEVEL` Read lines 1–51. Verify: - `ENV_PREFIXES` includes `'CLAUDE_CODE_'`. - `ENV_PRESERVE` does NOT include `CLAUDE_CODE_EFFORT_LEVEL`, `CLAUDE_CODE_ALWAYS_ENABLE_EFFORT`. Add an inline comment at the `ENV_PREFIXES` declaration: ```ts // Filters CLAUDE_CODE_* unless explicitly preserved in ENV_PRESERVE. // This is layer 2 of defense for #2357 — layer 1 is BLOCKED_ENV_VARS in EnvManager. ``` No code change to behavior here. ### 3.4 Verification checklist (Phase 3) - [ ] Tests 5, 6, 7 from Phase 1 now GREEN. - [ ] `grep -n "CLAUDE_CODE_EFFORT_LEVEL" src/` returns hits in `EnvManager.ts` (BLOCKED_ENV_VARS) and the test file. Nothing else. - [ ] Reproduce #2357 scenario locally: ```bash CLAUDE_CODE_EFFORT_LEVEL=MAX bun run src/services/worker-service.ts --daemon # Observe: no `effort` parameter on outgoing requests. ``` - [ ] If a 400 is forced (e.g., via a mocked SDK reject), the retry loop terminates after the first attempt; `logger.warn` fires once. ### 3.5 Anti-pattern guards - Do NOT add a separate "permanent error" enum value — `kind: 'unrecoverable'` already exists and is the right slot. - Do NOT regex on the entire error stack — `error.status === 400` is the deterministic signal; the body text check is purely for the user-facing log hint. - Do NOT log inside `classifyClaudeProviderError` for every 400 — only the effort-parameter sub-case warrants a hint. Generic 400s are noisy enough at the call site. - Do NOT mark all 400s with body matching `/effort/i` as `auth_invalid` — that would trigger the "re-login" flow incorrectly. Use `unrecoverable`. - Do NOT rely on the SDK supporting an `effort` SDK-option that we strip. The SDK type does not expose `effort`; the leak is the SDK's own subprocess (`pathToClaudeCodeExecutable`) reading the env var. Stripping at our env layer is the only fix we control. --- ## Phase 4 — `$TIER` alias syntax (#2289) **Goal:** Allow `CLAUDE_MEM_MODEL=$TIER:summary` (and similar) to resolve at request time to a provider-appropriate model, reusing the existing `'haiku'` portable alias machinery (line 116, #1463). Optional phase; can be deferred without blocking Phase 2/3. ### 4.1 Edit `src/shared/SettingsDefaultsManager.ts` — extend tier interface Add to the `SettingsDefaults` interface near lines 51–53: ```ts CLAUDE_MEM_TIER_FAST_MODEL: string; // for $TIER:fast — defaults to 'haiku' CLAUDE_MEM_TIER_SMART_MODEL: string; // for $TIER:smart — defaults to 'sonnet' (or provider-equivalent) ``` Add to the `DEFAULTS` block near lines 115–117: ```ts CLAUDE_MEM_TIER_FAST_MODEL: 'haiku', CLAUDE_MEM_TIER_SMART_MODEL: 'sonnet', ``` ### 4.2 Edit `src/services/worker/ClaudeProvider.ts:442–446` — add `$TIER` resolution Replace `getModelId()`: ```ts private getModelId(): string { const settingsPath = paths.settings(); const settings = SettingsDefaultsManager.loadFromFile(settingsPath); return resolveTierAlias(settings.CLAUDE_MEM_MODEL, settings); } ``` Add `resolveTierAlias` to a shared util (`src/services/worker/model-aliases.ts`, NEW): ```ts import type { SettingsDefaults } from '../../shared/SettingsDefaultsManager'; const TIER_PATTERN = /^\$TIER:(fast|smart|simple|summary)$/; export function resolveTierAlias(model: string, settings: SettingsDefaults): string { const match = TIER_PATTERN.exec(model); if (!match) return model; switch (match[1]) { case 'fast': return settings.CLAUDE_MEM_TIER_FAST_MODEL || 'haiku'; case 'smart': return settings.CLAUDE_MEM_TIER_SMART_MODEL || 'sonnet'; case 'simple': return settings.CLAUDE_MEM_TIER_SIMPLE_MODEL || 'haiku'; case 'summary': return settings.CLAUDE_MEM_TIER_SUMMARY_MODEL || settings.CLAUDE_MEM_MODEL; default: return model; } } ``` ### 4.3 Same call site in `KnowledgeAgent.ts:149` (`getModelId`) Apply the same `resolveTierAlias` wrap. Knowledge agent uses the same settings path. ### 4.4 Verification checklist (Phase 4) - [ ] New test: `resolveTierAlias('$TIER:fast', settings)` returns `settings.CLAUDE_MEM_TIER_FAST_MODEL`. - [ ] New test: `resolveTierAlias('claude-haiku-4-5-20251001', settings)` returns input unchanged (non-tier passthrough). - [ ] Setting `CLAUDE_MEM_MODEL=$TIER:fast` and starting the worker actually queries against the fast-tier model. - [ ] Documentation updated in `docs/public/configuration.mdx` with the four tier aliases. ### 4.5 Anti-pattern guards - Do NOT match `$TIER:*` greedily — the regex is anchored. - Do NOT add `$PROVIDER:` or `$MODEL:` aliases in this phase — out of scope; one syntax at a time. - Do NOT mutate `settings` inside `resolveTierAlias`; pure function only. - Do NOT resolve the alias at settings-load time — resolve at *request* time so users can edit settings without restarting the worker. --- ## Phase 5 — Cross-spawn-boundary audit **Goal:** Every place claude-mem spawns a subprocess must apply both `buildIsolatedEnv` (or the async variant) AND `sanitizeEnv`. A grep-based check codifies the rule. ### 5.1 Audit table — current state per call site | File | Line | Spawn target | Env construction | Sufficient? | |---|---|---|---|---| | `src/services/worker/ClaudeProvider.ts` | 155 | Anthropic SDK subprocess | `sanitizeEnv(await buildIsolatedEnvWithFreshOAuth())` | ✅ | | `src/services/worker/knowledge/KnowledgeAgent.ts` | 54, 149 | Knowledge SDK subprocess | `sanitizeEnv(await buildIsolatedEnvWithFreshOAuth())` | ✅ | | `src/services/infrastructure/ProcessManager.ts` | 415 | Worker daemon | `sanitizeEnv({...process.env, CLAUDE_MEM_WORKER_PORT, ...extraEnv})` | ⚠️ daemon inherits parent env then sanitizes — does not pass through `buildIsolatedEnv`. **Document why this is OK**: daemon is the trust boundary; parent env IS the truth. But it should still strip `CLAUDE_CODE_EFFORT_LEVEL` via the prefix filter. Confirm. | | `src/services/sync/ChromaMcpManager.ts` | 585 | chroma-mcp subprocess | `sanitizeEnv(process.env)` | ⚠️ same as above. | | `src/supervisor/process-registry.ts` | 539 | Generic spawn factory | `sanitizeEnv(options.env ?? process.env)` | ⚠️ same. | | `src/services/worker-service.ts` | 412 | MCP server subprocess | `sanitizeEnv(process.env)` | ⚠️ same. | For the worker-daemon and downstream MCP/chroma spawns, parent-process env IS the source of truth — they are pre-credential paths. As long as `CLAUDE_CODE_EFFORT_LEVEL` and the Anthropic credentials are stripped (which `sanitizeEnv` does via `CLAUDE_CODE_*` prefix and the existing `ANTHROPIC_AUTH_TOKEN` block), behavior is correct. The plan does not change these paths — it adds tests that prove they stay correct. ### 5.2 Add audit test — `tests/env-isolation.test.ts` 8. **`every documented spawn site applies sanitizeEnv`** - Read each file from the audit table. - Assert: each line cited contains `sanitizeEnv(`. Currently GREEN; test prevents regression. 9. **`worker-daemon spawn env does not contain CLAUDE_CODE_EFFORT_LEVEL`** - Stub `process.env.CLAUDE_CODE_EFFORT_LEVEL = 'MAX'`. - Construct the env block as ProcessManager.ts:415 does. - Assert: result does not contain `CLAUDE_CODE_EFFORT_LEVEL`. Currently GREEN. ### 5.3 Verification checklist (Phase 5) - [ ] Tests 8, 9 GREEN. - [ ] No new spawn sites introduced; if any are added by accident, the CI check (Phase 7) flags them. ### 5.4 Anti-pattern guards - Do NOT add `buildIsolatedEnv` calls to ProcessManager / ChromaMcpManager / MCP server spawn paths. They legitimately need parent-shell `PATH`, `HOME`, etc. — those would be wiped by the credential-isolated builder. - Do NOT consolidate the two layers into one helper "for clarity" — they have distinct contracts and are layered intentionally. --- ## Phase 6 — Test the full integration end-to-end **Goal:** Smoke test the proxy/gateway path so we know the fix works in the real world. ### 6.1 Manual smoke (BigModel proxy or any equivalent) ```bash # Setup: cat > ~/.claude-mem/.env <<'EOF' ANTHROPIC_BASE_URL=https://open.bigmodel.cn/api/anthropic ANTHROPIC_AUTH_TOKEN= EOF chmod 600 ~/.claude-mem/.env # Reset worker: npm run build-and-sync pkill -f worker-service.cjs # Trigger: # In any Claude Code session, use any tool — PostToolUse hook should land an observation. # Verify: tail -f ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log # Expect: no "Not logged in" errors; observations land via the proxy. ``` ### 6.2 Manual smoke (CLAUDE_CODE_EFFORT_LEVEL leak) ```bash # Setup: export CLAUDE_CODE_EFFORT_LEVEL=MAX export CLAUDE_CODE_ALWAYS_ENABLE_EFFORT=true # Restart Claude Code so the env propagates to the hook subprocess. # Verify: tail -f ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log # Expect: NO repeated "API Error: 400 This model does not support the effort parameter." # Expect: NO "PARSER returned non-XML response; marking messages as failed for retry". ``` ### 6.3 Verification checklist (Phase 6) - [ ] Both smoke scenarios pass. - [ ] `bun test` is green. - [ ] One iteration on a fresh machine confirms `~/.claude-mem/.env` is the only knob users need for proxy auth. --- ## Phase 7 — CI guard + documentation **Goal:** A grep-based CI check rejects PRs that introduce a subprocess spawn without `sanitizeEnv`. Documentation aligns with the new contract. ### 7.1 Add `scripts/check-spawn-env-discipline.cjs` Pattern from `plans/01-hook-io-discipline.md` Phase 6 (`scripts/check-hook-io-discipline.cjs`): ```js #!/usr/bin/env node // Forbid raw process.env in subprocess spawn calls. Every spawn must use // sanitizeEnv(...) and (where credentials are involved) buildIsolatedEnv*. const { execSync } = require('node:child_process'); const VIOLATIONS = []; // Find every `spawn(` / `spawnSync(` / `child_process.spawn(` call in src/ const grep = execSync( `grep -rEn "spawn(Sync)?\\(" src/ | grep -v "node_modules" | grep -v "\\.test\\."`, { encoding: 'utf8' }, ); for (const line of grep.split('\n').filter(Boolean)) { // Allow if the same logical block contains sanitizeEnv // (heuristic: read 5 lines after the match in the source file) const [filePath, lineNumStr] = line.split(':', 2); const lineNum = Number.parseInt(lineNumStr, 10); const src = require('node:fs').readFileSync(filePath, 'utf8').split('\n'); const window = src.slice(lineNum - 1, lineNum + 8).join('\n'); if (!/sanitizeEnv\s*\(/.test(window)) { VIOLATIONS.push(`${filePath}:${lineNum} — spawn without sanitizeEnv`); } } if (VIOLATIONS.length > 0) { console.error('Spawn-env discipline check FAILED:'); VIOLATIONS.forEach(v => console.error(' ' + v)); process.exit(1); } console.log('Spawn-env discipline check passed.'); ``` Wire to `package.json` `scripts.test:env-discipline`. Add to CI alongside existing hook checks. ### 7.2 Edit `CLAUDE.md` — document the `~/.claude-mem/.env` contract Add a section under "Configuration": ```markdown ### Anthropic Credentials (proxies, gateways, BigModel, etc.) For non-OAuth Anthropic credentials (proxies / gateways / `ANTHROPIC_AUTH_TOKEN` / `ANTHROPIC_API_KEY`), put them in `~/.claude-mem/.env`: \``` ANTHROPIC_BASE_URL=https://your-proxy.example ANTHROPIC_AUTH_TOKEN=your-token \``` The file is read at worker spawn time and re-injected into the SDK subprocess. **Parent-shell exports of these variables are intentionally ignored** — they are in `BLOCKED_ENV_VARS` to prevent host-config bleed-through (#2375). If you only have an OAuth subscription, no `.env` is needed; the worker reads the token from your keychain at spawn time. ``` ### 7.3 Verification checklist (Phase 7) - [ ] `npm run test:env-discipline` passes on the post-fix tree. - [ ] CI pipeline runs the new check. - [ ] CLAUDE.md section exists and accurately reflects the new contract. ### 7.4 Anti-pattern guards - Do NOT extend the CI check to flag every `process.env` read — only `spawn*()` call sites need `sanitizeEnv`. Reads are fine. - Do NOT add the `.env` file path to `.gitignore` — it lives in `~/.claude-mem/`, not in the repo, so it's already outside. --- ## Cross-plan dependencies - **Plan 01 (Hook IO Discipline):** Independent. Both can be implemented in parallel. - **Plan 02 (Spawn-Contract Templating):** Independent. Both touch templating but at different layers. - **Plan 03 (Worker Lifecycle):** Phase 3.2's HTTP 400 classification removes a class of unbounded retries. Plan 03's "circuit breaker" + "stale-session sweep" handles other retry classes. Merge order: this plan first (small, surgical), then Plan 03. - **Plan 04 (Installer Transparency):** Independent. - **Plan 05 (Observer Tool Enforcement):** Adjacent — `KnowledgeAgent` is touched in both plans (this one for `getModelId`, Plan 05 for tool enforcement). Sequence Plan 05 first (security urgency), then Plan 06. ## Pre-/do checklist - [ ] Verify `BLOCKED_ENV_VARS` is still an `Array` and not converted to a `Set` (Phase 2 refactor risk). - [ ] Verify the existing test suite passes against current `main` before starting (`bun test`). - [ ] Re-confirm `effort` is still absent from `src/` (`grep -rn "effort" src/`) — if a future change adds the parameter, Phase 3.2's regex needs revisiting. - [ ] Read `node_modules/@anthropic-ai/claude-agent-sdk/sdk.d.ts` to confirm `query()` options does NOT support `effort` natively. If the SDK adds it, Phase 3.2's body-text regex still works as a fallback, but a code-level strip becomes the right fix. - [ ] Verify `~/.claude-mem/.env` permissions are `0o600` post-fix (the saver enforces this; readers should not weaken it).