Files
claude-mem/plans/06-worker-env-isolation.md
T
Alex Newman a10d1b342f docs(plans): add architectural plan files for issues #2376-#2381
Six numbered plan documents covering:
- 01 Hook IO Discipline (#2376)
- 02 Spawn-Contract Templating (#2377)
- 03 Worker / Daemon Lifecycle Hardening (#2378)
- 04 Installer Failure Transparency (#2379)
- 05 Observer SDK Tool Enforcement (#2380)
- 06 Worker Env Isolation (#2381)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:31:02 -07:00

632 lines
33 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Plan 06 — Worker Env Isolation
> **Goal:** Stop host-side environment variables from contaminating the worker's Anthropic SDK subprocess. Two confirmed bugs anchor this plan: `ANTHROPIC_BASE_URL` leaks from the parent shell while `ANTHROPIC_AUTH_TOKEN` is blocked, breaking proxy/gateway auth (#2375); and `CLAUDE_CODE_EFFORT_LEVEL` propagates from host CLI settings into the SDK subprocess where it triggers a permanent HTTP 400 that the retry classifier mistakes for transient (#2357). Adjacent feature #2289 (`$TIER` alias syntax) is in scope where it shares the same env/model-resolution surface.
>
> **Net effect:**
> - The OAuth-skip predicate requires a real credential (`ANTHROPIC_API_KEY` or `ANTHROPIC_AUTH_TOKEN`), not a bare `ANTHROPIC_BASE_URL`. Proxy/gateway users put credentials in `~/.claude-mem/.env`; nothing relies on parent-shell leaks.
> - `BLOCKED_ENV_VARS` adds `ANTHROPIC_BASE_URL` and the `CLAUDE_CODE_EFFORT_LEVEL` / `CLAUDE_CODE_ALWAYS_ENABLE_EFFORT` pair (defense in depth alongside the existing `env-sanitizer.ts` `CLAUDE_CODE_*` prefix filter).
> - The Claude provider's error classifier explicitly handles HTTP 400 as `unrecoverable`, matching `GeminiProvider`/`OpenRouterProvider`. No more unbounded retry loop on permanent-error responses.
> - Every spawn boundary that hands env to a child process applies BOTH `buildIsolatedEnv` and `sanitizeEnv`. A grep-based CI check forbids spawning subprocesses with raw `process.env`.
> - `~/.claude-mem/.env` becomes the single source of truth for non-OAuth Anthropic credentials. The loader's whitelist documents this contract.
>
> **Out of scope:**
> - Hook-side env handling (Plan 01 / 02 territory).
> - Worker daemon lifecycle, DB bloat, and chroma-mcp leaks (Plan 03).
> - Observer/Knowledge SDK tool enforcement (Plan 05).
> - Re-auth UX flow (different concern; out of scope for this plan).
> - General provider-router refactor — `$TIER` alias is scoped to model resolution only (Phase 4).
---
## Problem Statement (line citations)
### Bug A — `ANTHROPIC_BASE_URL` leaks, OAuth gets skipped, `ANTHROPIC_AUTH_TOKEN` is missing (#2375)
`src/shared/EnvManager.ts` lines 1424 (`BLOCKED_ENV_VARS`):
```ts
const BLOCKED_ENV_VARS = [
'ANTHROPIC_API_KEY', // #733
'ANTHROPIC_AUTH_TOKEN', // added 5edf1557 (2026-05-04) — leak prevention
'CLAUDECODE',
'CLAUDE_CODE_OAUTH_TOKEN', // #2215
];
```
`ANTHROPIC_BASE_URL` is **not** in the list, so it survives `buildIsolatedEnv()` (lines 166205) and reaches `isolatedEnv` from `process.env`.
`buildIsolatedEnvWithFreshOAuth()` lines 222288 then runs the OAuth-skip predicate at lines 237244:
```ts
if (
isolatedEnv.ANTHROPIC_API_KEY ||
isolatedEnv.ANTHROPIC_BASE_URL ||
isolatedEnv.ANTHROPIC_AUTH_TOKEN
) {
clearStaleMarker();
return isolatedEnv;
}
```
The bare `BASE_URL` branch was added in commit `a122d34e` (2026-05-04) under the rationale "tokenless gateways may exist." Combined with the `AUTH_TOKEN` block from `5edf1557` the same day, the subprocess ends up with:
- `ANTHROPIC_BASE_URL` ✅ (leaked from parent)
- `ANTHROPIC_AUTH_TOKEN` ❌ (blocked, never re-injected because `~/.claude-mem/.env` is empty for first-time proxy users)
- `CLAUDE_CODE_OAUTH_TOKEN` ❌ (skip path bypassed the keychain read)
Result: `Not logged in · Please run /login` from every SDK subprocess.
### Bug B — `CLAUDE_CODE_EFFORT_LEVEL` triggers permanent 400 + unbounded retry (#2357)
The Anthropic SDK subprocess reads `CLAUDE_CODE_EFFORT_LEVEL` from its env and forwards it as the `effort` parameter on Messages API calls. claude-mem's source contains **zero** references to `effort` — the leak path is environmental, not code. Models without effort support (Haiku 4.5, Sonnet 4.5, older) reject with HTTP 400.
`src/supervisor/env-sanitizer.ts` lines 151 already filters `CLAUDE_CODE_*` via `ENV_PREFIXES` (with explicit allowances in `ENV_PRESERVE`). But:
1. `buildIsolatedEnv` does NOT call `sanitizeEnv` internally; callers are expected to chain them.
2. `BLOCKED_ENV_VARS` is the canonical leak deny-list and does not name `CLAUDE_CODE_EFFORT_LEVEL`. Defense-in-depth is currently single-layer.
3. The retry classifier in `src/services/worker/ClaudeProvider.ts` has no HTTP 400 case; the default branch at line 98 returns `kind: 'transient'`, so a permanent 400 loops forever.
`src/services/worker/GeminiProvider.ts` lines 8994 and `src/services/worker/OpenRouterProvider.ts` lines 8287 already classify 400 as `unrecoverable`; that pattern is the copy-target for ClaudeProvider.
### Adjacent — `$TIER` alias syntax (#2289)
`src/shared/SettingsDefaultsManager.ts` line 116 already implements a *portable* `'haiku'` alias for `CLAUDE_MEM_TIER_SIMPLE_MODEL` (per #1463). What's missing is the user-facing `$TIER` *syntax* in the `CLAUDE_MEM_MODEL` field that resolves to a provider-appropriate model at request time. Same code surface (model resolution in `ClaudeProvider.getModelId` at lines 442446); minimal extension.
---
## Phase 0 — Documentation Discovery (already completed)
Findings below are direct file reads dated 2026-05-08. Each implementation phase cites by line number; do not re-derive. **Confidence: HIGH on file/API inventory.** Local-only files were read end-to-end.
### Allowed APIs / patterns to copy
| Item | Location | What to copy |
|---|---|---|
| `BLOCKED_ENV_VARS` array | `src/shared/EnvManager.ts:1424` | Add new entries; keep the comment-per-entry convention |
| `buildIsolatedEnv` filter pattern | `src/shared/EnvManager.ts:166205` | Filter on `BLOCKED_ENV_VARS.includes(key)`; defensive `delete isolatedEnv.X` post-filter |
| `buildIsolatedEnvWithFreshOAuth` skip-check | `src/shared/EnvManager.ts:237244` | Restrict predicate to real credentials only |
| `loadClaudeMemEnv` whitelist + `ClaudeMemEnv` interface | `src/shared/EnvManager.ts:2632, 79100` | Single source of truth for what `~/.claude-mem/.env` accepts |
| `ENV_PRESERVE` / `ENV_EXACT_MATCHES` / `ENV_PREFIXES` | `src/supervisor/env-sanitizer.ts:151` | Whitelist-based env stripping; do NOT add `CLAUDE_CODE_EFFORT_LEVEL` to `ENV_PRESERVE` |
| Provider error classifier (HTTP 400 → unrecoverable) | `src/services/worker/GeminiProvider.ts:8994`, `src/services/worker/OpenRouterProvider.ts:8287` | Identical pattern to apply in `ClaudeProvider` |
| `ClassifiedProviderError` constructor + `kind: 'unrecoverable' \| 'auth_invalid' \| 'transient' \| 'rate_limit' \| 'quota_exhausted'` | `src/services/worker/retry.ts` | Use existing `kind` enum; do not invent `permanent` |
| `isRetryableKind` predicate | `src/services/worker/retry.ts:3744` | Used by all retry sites; no edit needed once classifier is correct |
| Tier model resolution + `'haiku'` alias | `src/services/worker/http/routes/SessionRoutes.ts:503521`, `src/shared/SettingsDefaultsManager.ts:5153, 115117` | Pattern for extending `$TIER` syntax |
| Settings flat-key + `loadFromFile` | `src/shared/SettingsDefaultsManager.ts:667, 70131, 137139, 161206` | New keys MUST be added to interface AND `DEFAULTS` block |
| Plan format (phase numbering, line-cited edits, anti-patterns block) | `plans/01-hook-io-discipline.md`, `plans/05-observer-tool-enforcement.md` | Reuse layout |
### Anti-patterns / methods that DO NOT exist (avoid inventing)
- claude-mem source has **zero references** to `effort`, `CLAUDE_CODE_EFFORT_LEVEL`, `CLAUDE_CODE_ALWAYS_ENABLE_EFFORT`, or `reasoning_effort`. Do not "remove the effort parameter we forward" — there is none. The leak is the SDK subprocess reading the env var directly.
- `BLOCKED_ENV_VARS` is an `Array<string>` with `.includes` lookup. Do NOT convert to `Set` in the same change — that touches every caller and is an unrelated refactor.
- `ClassifiedProviderError.kind` does NOT support the value `'permanent'`. The existing enum is `'transient' | 'rate_limit' | 'unrecoverable' | 'auth_invalid' | 'quota_exhausted'`. Use `unrecoverable` for permanent 400s.
- `pending_messages` has **no `retry_count` column** (dropped — see `src/services/sqlite/SessionStore.ts:104`'s `deadColumns` array). Issue #2357's "retry counter climbed past #1874" refers to log-line numbering, not a DB counter. Do not add a counter as part of this plan; that's Plan 03 territory.
- `sanitizeEnv` is whitelist-based (preserves a fixed set; strips everything matching `CLAUDE_CODE_*` etc). It is NOT idempotent if you re-add a name to `ENV_PRESERVE`. Do not add `CLAUDE_CODE_EFFORT_LEVEL` to `ENV_PRESERVE` — that's the opposite of what we want.
- `buildIsolatedEnv` and `sanitizeEnv` are **independent layers**. Some callers chain (`sanitizeEnv(buildIsolatedEnv(...))`); some only use one. Do not assume chaining is universal — Phase 5 audits every spawn boundary.
- The `~/.claude-mem/.env` loader at `src/shared/EnvManager.ts:79100` uses property-by-property assignment as an implicit whitelist. Do NOT replace with `Object.assign(result, parsed)` — that breaks the whitelist guarantee.
### File inventory used by this plan
| File | Lines | Disposition |
|---|---|---|
| `src/shared/EnvManager.ts` | 319 | Edited heavily (Phase 2, Phase 5) |
| `src/supervisor/env-sanitizer.ts` | 51 | Light edit (Phase 3 — comment change only; `CLAUDE_CODE_*` prefix already filters EFFORT_LEVEL) |
| `src/services/worker/ClaudeProvider.ts` | 448 | Edited (Phase 3 — error classifier on `query()` rejection path) |
| `src/services/worker/retry.ts` | small | Confirm-only (Phase 3 — `isRetryableKind` already correct) |
| `src/services/worker/GeminiProvider.ts` | reference only | Read for pattern (Phase 3) |
| `src/services/worker/OpenRouterProvider.ts` | reference only | Read for pattern (Phase 3) |
| `src/shared/SettingsDefaultsManager.ts` | 209 | Edited (Phase 4 — `$TIER` alias resolution) |
| `src/services/worker/http/routes/SessionRoutes.ts` | reference | Read tier-routing pattern (Phase 4) |
| `src/services/infrastructure/ProcessManager.ts` | line 415 | Audit (Phase 5) — confirm `sanitizeEnv` chain is sufficient |
| `src/services/sync/ChromaMcpManager.ts` | line 585 | Audit (Phase 5) |
| `src/supervisor/process-registry.ts` | line 539 | Audit (Phase 5) |
| `src/services/worker-service.ts` | line 412 | Audit (Phase 5) |
| `src/services/worker/knowledge/KnowledgeAgent.ts` | lines 54, 149 | Confirm-only (Phase 5) |
| `tests/env-isolation.test.ts` | NEW | CREATED (Phase 6) |
| `scripts/check-spawn-env-discipline.cjs` | NEW | CREATED (Phase 7) |
| `CLAUDE.md` | small | Edited (Phase 7 — document `~/.claude-mem/.env` contract) |
---
## Phase 1 — Audit & write the failing tests first
**Goal:** Pin down current behavior with red tests so the fix can prove itself green. No production-code changes in this phase.
### 1.1 Tests to add (`tests/env-isolation.test.ts`)
Use `bun:test` per `package.json` `"test": "bun test"`. Pattern from `tests/claude-provider-resume.test.ts:1`.
1. **`buildIsolatedEnvWithFreshOAuth strips ANTHROPIC_BASE_URL when no .env credentials are configured`**
- Stub `process.env.ANTHROPIC_BASE_URL = 'https://proxy.example'`, no `~/.claude-mem/.env`, no API_KEY/AUTH_TOKEN in env.
- Call `buildIsolatedEnvWithFreshOAuth()`.
- Assert: result does NOT have `ANTHROPIC_BASE_URL` (post-fix). Currently RED.
2. **`OAuth-skip does not fire on bare ANTHROPIC_BASE_URL`**
- Same setup. Spy on `readClaudeOAuthToken`.
- Assert: `readClaudeOAuthToken` was called (because BASE_URL alone is not enough to skip). Currently RED — `readClaudeOAuthToken` is NOT called today.
3. **`ANTHROPIC_AUTH_TOKEN from ~/.claude-mem/.env reaches the isolated env`**
- Write a temp `.env` with `ANTHROPIC_AUTH_TOKEN=test-token` and `ANTHROPIC_BASE_URL=https://proxy.example`.
- Assert: `isolatedEnv.ANTHROPIC_AUTH_TOKEN === 'test-token'` AND `isolatedEnv.ANTHROPIC_BASE_URL === 'https://proxy.example'`. Currently GREEN (already works); test guards against regression.
4. **`CLAUDE_CODE_EFFORT_LEVEL is stripped from the isolated env`**
- Stub `process.env.CLAUDE_CODE_EFFORT_LEVEL = 'MAX'`.
- Assert: `sanitizeEnv(buildIsolatedEnv())` does NOT contain `CLAUDE_CODE_EFFORT_LEVEL`. Currently GREEN via `env-sanitizer.ENV_PREFIXES`; test guards.
5. **`CLAUDE_CODE_EFFORT_LEVEL is in BLOCKED_ENV_VARS for defense-in-depth`**
- Assert: `BLOCKED_ENV_VARS.includes('CLAUDE_CODE_EFFORT_LEVEL')`. Currently RED.
6. **`HTTP 400 from Claude SDK is classified unrecoverable`**
- Construct an error matching the SDK's 400 shape (`error.status === 400`, body contains `does not support the effort parameter`).
- Assert: `classifyClaudeProviderError(err).kind === 'unrecoverable'`. Currently RED — falls through to `transient`.
7. **`HTTP 400 with effort-parameter body emits a once-only warn log`**
- Same setup as 6, plus capture `logger.warn` calls.
- Assert: warn fires once with category `SDK` and a hint pointing at #2357 / `~/.claude-mem/.env`. Currently RED.
### 1.2 Verification checklist (Phase 1)
- [ ] All 7 tests added; tests 1, 2, 5, 6, 7 are RED; tests 3, 4 are GREEN.
- [ ] `bun test tests/env-isolation.test.ts` runs cleanly (RED tests fail with the expected assertion, no other errors).
- [ ] No production-code changes in this phase (`git diff src/` empty).
### 1.3 Anti-pattern guards
- Do NOT mock `EnvManager.buildIsolatedEnv` — it's the unit under test.
- Do NOT use `vi.*` (project uses `bun:test`, not vitest).
- Do NOT skip cleanup of temp `.env` files. Use a per-test `beforeEach`/`afterEach` with `mkdtempSync`.
---
## Phase 2 — Fix #2375 (BASE_URL leak + OAuth-skip predicate)
**Goal:** Make the OAuth-skip require a real credential, and add `ANTHROPIC_BASE_URL` to the deny-list so it can only be configured via `~/.claude-mem/.env`.
### 2.1 Edit `src/shared/EnvManager.ts:1424` — extend `BLOCKED_ENV_VARS`
**Before:**
```ts
const BLOCKED_ENV_VARS = [
'ANTHROPIC_API_KEY',
'ANTHROPIC_AUTH_TOKEN',
'CLAUDECODE',
'CLAUDE_CODE_OAUTH_TOKEN',
];
```
**After (add `ANTHROPIC_BASE_URL`):**
```ts
const BLOCKED_ENV_VARS = [
'ANTHROPIC_API_KEY', // #733
'ANTHROPIC_AUTH_TOKEN', // 5edf1557 — leak prevention; re-injected from ~/.claude-mem/.env when configured
'ANTHROPIC_BASE_URL', // #2375 — same leak class as AUTH_TOKEN; re-injected from ~/.claude-mem/.env. Without this entry, a leaked BASE_URL alone triggered the OAuth-skip while no auth credential reached the subprocess.
'CLAUDECODE',
'CLAUDE_CODE_OAUTH_TOKEN', // #2215
];
```
### 2.2 Edit `src/shared/EnvManager.ts:237244` — restrict OAuth-skip to real credentials
**Before:**
```ts
if (
isolatedEnv.ANTHROPIC_API_KEY ||
isolatedEnv.ANTHROPIC_BASE_URL ||
isolatedEnv.ANTHROPIC_AUTH_TOKEN
) {
clearStaleMarker();
return isolatedEnv;
}
```
**After:**
```ts
// Skip OAuth lookup ONLY when a real credential is configured. A bare
// ANTHROPIC_BASE_URL is not a credential — every documented gateway needs
// either an AUTH_TOKEN or an API_KEY. This guards #2375 against a class of
// leaks where a parent shell exports BASE_URL (e.g. for the Claude Code CLI
// itself) while no token is present.
if (isolatedEnv.ANTHROPIC_API_KEY || isolatedEnv.ANTHROPIC_AUTH_TOKEN) {
clearStaleMarker();
return isolatedEnv;
}
```
### 2.3 Verify the `~/.claude-mem/.env` re-injection at `src/shared/EnvManager.ts:178195`
Currently the loader path covers BASE_URL re-injection from `.env`. Confirm by reading the function. No code change required here, but add a TS comment block above lines 178195 documenting the new contract:
```ts
// Contract (post-#2375): ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, and
// ANTHROPIC_API_KEY are *only* populated from ~/.claude-mem/.env. They are
// in BLOCKED_ENV_VARS so parent-shell values never leak through.
```
### 2.4 Verification checklist (Phase 2)
- [ ] Tests 1, 2 from Phase 1 now GREEN.
- [ ] Existing test suite still passes (`bun test`).
- [ ] `grep -n "ANTHROPIC_BASE_URL" src/shared/EnvManager.ts` shows entries at: `BLOCKED_ENV_VARS`, `ClaudeMemEnv` interface, loader, re-injection, OAuth-skip predicate (NOT in skip predicate).
- [ ] Smoke: with a `~/.claude-mem/.env` containing `ANTHROPIC_BASE_URL=...` and `ANTHROPIC_AUTH_TOKEN=...`, the worker actually authenticates against the proxy. Test with BigModel or any sandboxed proxy.
### 2.5 Anti-pattern guards
- Do NOT add `ANTHROPIC_BASE_URL` to `ENV_PRESERVE` in `env-sanitizer.ts``BLOCKED_ENV_VARS` is the right layer; `env-sanitizer` is a downstream filter.
- Do NOT keep the BASE_URL branch in the OAuth-skip predicate "for tokenless gateways may exist" — every documented gateway requires a token. The skip path was a misdesign.
- Do NOT delete the existing `delete isolatedEnv.CLAUDE_CODE_OAUTH_TOKEN` defensive line at line 229. That guard is intact; it's belt-and-suspenders for #2215 and orthogonal to this plan.
---
## Phase 3 — Fix #2357 (CLAUDE_CODE_EFFORT_LEVEL leak + 400 retry classification)
**Goal:** Two-layer defense for the env leak (existing `CLAUDE_CODE_*` prefix filter + new `BLOCKED_ENV_VARS` entries), plus a permanent classification for the resulting HTTP 400 so the retry loop terminates if the leak ever sneaks past either layer.
### 3.1 Edit `src/shared/EnvManager.ts:1424` — add EFFORT entries to `BLOCKED_ENV_VARS`
After the Phase 2 edit, the list is:
```ts
const BLOCKED_ENV_VARS = [
'ANTHROPIC_API_KEY',
'ANTHROPIC_AUTH_TOKEN',
'ANTHROPIC_BASE_URL',
'CLAUDECODE',
'CLAUDE_CODE_OAUTH_TOKEN',
// #2357 — host CLI config, not part of the plugin's contract. The
// env-sanitizer's CLAUDE_CODE_* prefix filter strips these for spawn paths
// that go through it, but BLOCKED_ENV_VARS is the canonical deny-list and
// belongs in defense-in-depth.
'CLAUDE_CODE_EFFORT_LEVEL',
'CLAUDE_CODE_ALWAYS_ENABLE_EFFORT',
];
```
### 3.2 Edit `src/services/worker/ClaudeProvider.ts` — classify HTTP 400 as unrecoverable
Locate the existing error-classification path. The Anthropic SDK raises errors with `error.status` and a body containing the failure description. Pattern from `src/services/worker/GeminiProvider.ts:8994` (the canonical copy-target):
```ts
if (status === 400) {
return new ClassifiedProviderError(
`Gemini bad request (status 400)`,
{ kind: 'unrecoverable', cause: input.cause },
);
}
```
Add the equivalent in `ClaudeProvider`'s error classifier (new function or existing — read the file; create if absent, mirroring `GeminiProvider` shape):
```ts
function classifyClaudeProviderError(input: { cause: unknown }): ClassifiedProviderError {
const err = input.cause;
const status = (err as { status?: number })?.status;
const bodyText = String((err as { message?: string })?.message ?? '');
// Permanent: SDK rejected the request itself. Most common cause in the wild
// is a leaked CLAUDE_CODE_EFFORT_LEVEL the SDK subprocess forwarded as
// `effort` against a model that doesn't support it (#2357). The leak is
// also blocked at BLOCKED_ENV_VARS + env-sanitizer; this classifier ends
// the retry loop if either layer is bypassed.
if (status === 400) {
if (/effort parameter/i.test(bodyText)) {
logger.warn(
'SDK',
'Claude API rejected effort parameter — likely CLAUDE_CODE_EFFORT_LEVEL leaked into SDK env (issue #2357). Configure CLAUDE_MEM_MODEL or set credentials in ~/.claude-mem/.env.',
{ status, bodyText },
);
}
return new ClassifiedProviderError(
`Claude bad request (status 400): ${bodyText}`,
{ kind: 'unrecoverable', cause: input.cause },
);
}
// 401 / 403 → auth_invalid (existing pattern from GeminiProvider:96-103)
if (status === 401 || status === 403) {
return new ClassifiedProviderError(
`Claude auth rejected (status ${status})`,
{ kind: 'auth_invalid', cause: input.cause },
);
}
// 429 → rate_limit
if (status === 429) {
return new ClassifiedProviderError(
`Claude rate limited (status 429)`,
{ kind: 'rate_limit', cause: input.cause },
);
}
// Default: transient (preserves the existing fall-through behavior).
return new ClassifiedProviderError(
`Claude SDK error: ${bodyText}`,
{ kind: 'transient', cause: input.cause },
);
}
```
Wire this classifier into the existing `try { ... } catch` around `query(...)` in `ClaudeProvider.ts`. **Read the actual catch shape before editing** — the function lives near line 180195 and the existing `for await` over `queryResult` is where rejections surface.
### 3.3 Confirm `src/supervisor/env-sanitizer.ts` already strips `CLAUDE_CODE_EFFORT_LEVEL`
Read lines 151. Verify:
- `ENV_PREFIXES` includes `'CLAUDE_CODE_'`.
- `ENV_PRESERVE` does NOT include `CLAUDE_CODE_EFFORT_LEVEL`, `CLAUDE_CODE_ALWAYS_ENABLE_EFFORT`.
Add an inline comment at the `ENV_PREFIXES` declaration:
```ts
// Filters CLAUDE_CODE_* unless explicitly preserved in ENV_PRESERVE.
// This is layer 2 of defense for #2357 — layer 1 is BLOCKED_ENV_VARS in EnvManager.
```
No code change to behavior here.
### 3.4 Verification checklist (Phase 3)
- [ ] Tests 5, 6, 7 from Phase 1 now GREEN.
- [ ] `grep -n "CLAUDE_CODE_EFFORT_LEVEL" src/` returns hits in `EnvManager.ts` (BLOCKED_ENV_VARS) and the test file. Nothing else.
- [ ] Reproduce #2357 scenario locally:
```bash
CLAUDE_CODE_EFFORT_LEVEL=MAX bun run src/services/worker-service.ts --daemon
# Observe: no `effort` parameter on outgoing requests.
```
- [ ] If a 400 is forced (e.g., via a mocked SDK reject), the retry loop terminates after the first attempt; `logger.warn` fires once.
### 3.5 Anti-pattern guards
- Do NOT add a separate "permanent error" enum value — `kind: 'unrecoverable'` already exists and is the right slot.
- Do NOT regex on the entire error stack — `error.status === 400` is the deterministic signal; the body text check is purely for the user-facing log hint.
- Do NOT log inside `classifyClaudeProviderError` for every 400 — only the effort-parameter sub-case warrants a hint. Generic 400s are noisy enough at the call site.
- Do NOT mark all 400s with body matching `/effort/i` as `auth_invalid` — that would trigger the "re-login" flow incorrectly. Use `unrecoverable`.
- Do NOT rely on the SDK supporting an `effort` SDK-option that we strip. The SDK type does not expose `effort`; the leak is the SDK's own subprocess (`pathToClaudeCodeExecutable`) reading the env var. Stripping at our env layer is the only fix we control.
---
## Phase 4 — `$TIER` alias syntax (#2289)
**Goal:** Allow `CLAUDE_MEM_MODEL=$TIER:summary` (and similar) to resolve at request time to a provider-appropriate model, reusing the existing `'haiku'` portable alias machinery (line 116, #1463). Optional phase; can be deferred without blocking Phase 2/3.
### 4.1 Edit `src/shared/SettingsDefaultsManager.ts` — extend tier interface
Add to the `SettingsDefaults` interface near lines 5153:
```ts
CLAUDE_MEM_TIER_FAST_MODEL: string; // for $TIER:fast — defaults to 'haiku'
CLAUDE_MEM_TIER_SMART_MODEL: string; // for $TIER:smart — defaults to 'sonnet' (or provider-equivalent)
```
Add to the `DEFAULTS` block near lines 115117:
```ts
CLAUDE_MEM_TIER_FAST_MODEL: 'haiku',
CLAUDE_MEM_TIER_SMART_MODEL: 'sonnet',
```
### 4.2 Edit `src/services/worker/ClaudeProvider.ts:442446` — add `$TIER` resolution
Replace `getModelId()`:
```ts
private getModelId(): string {
const settingsPath = paths.settings();
const settings = SettingsDefaultsManager.loadFromFile(settingsPath);
return resolveTierAlias(settings.CLAUDE_MEM_MODEL, settings);
}
```
Add `resolveTierAlias` to a shared util (`src/services/worker/model-aliases.ts`, NEW):
```ts
import type { SettingsDefaults } from '../../shared/SettingsDefaultsManager';
const TIER_PATTERN = /^\$TIER:(fast|smart|simple|summary)$/;
export function resolveTierAlias(model: string, settings: SettingsDefaults): string {
const match = TIER_PATTERN.exec(model);
if (!match) return model;
switch (match[1]) {
case 'fast': return settings.CLAUDE_MEM_TIER_FAST_MODEL || 'haiku';
case 'smart': return settings.CLAUDE_MEM_TIER_SMART_MODEL || 'sonnet';
case 'simple': return settings.CLAUDE_MEM_TIER_SIMPLE_MODEL || 'haiku';
case 'summary': return settings.CLAUDE_MEM_TIER_SUMMARY_MODEL || settings.CLAUDE_MEM_MODEL;
default: return model;
}
}
```
### 4.3 Same call site in `KnowledgeAgent.ts:149` (`getModelId`)
Apply the same `resolveTierAlias` wrap. Knowledge agent uses the same settings path.
### 4.4 Verification checklist (Phase 4)
- [ ] New test: `resolveTierAlias('$TIER:fast', settings)` returns `settings.CLAUDE_MEM_TIER_FAST_MODEL`.
- [ ] New test: `resolveTierAlias('claude-haiku-4-5-20251001', settings)` returns input unchanged (non-tier passthrough).
- [ ] Setting `CLAUDE_MEM_MODEL=$TIER:fast` and starting the worker actually queries against the fast-tier model.
- [ ] Documentation updated in `docs/public/configuration.mdx` with the four tier aliases.
### 4.5 Anti-pattern guards
- Do NOT match `$TIER:*` greedily — the regex is anchored.
- Do NOT add `$PROVIDER:` or `$MODEL:` aliases in this phase — out of scope; one syntax at a time.
- Do NOT mutate `settings` inside `resolveTierAlias`; pure function only.
- Do NOT resolve the alias at settings-load time — resolve at *request* time so users can edit settings without restarting the worker.
---
## Phase 5 — Cross-spawn-boundary audit
**Goal:** Every place claude-mem spawns a subprocess must apply both `buildIsolatedEnv` (or the async variant) AND `sanitizeEnv`. A grep-based check codifies the rule.
### 5.1 Audit table — current state per call site
| File | Line | Spawn target | Env construction | Sufficient? |
|---|---|---|---|---|
| `src/services/worker/ClaudeProvider.ts` | 155 | Anthropic SDK subprocess | `sanitizeEnv(await buildIsolatedEnvWithFreshOAuth())` | ✅ |
| `src/services/worker/knowledge/KnowledgeAgent.ts` | 54, 149 | Knowledge SDK subprocess | `sanitizeEnv(await buildIsolatedEnvWithFreshOAuth())` | ✅ |
| `src/services/infrastructure/ProcessManager.ts` | 415 | Worker daemon | `sanitizeEnv({...process.env, CLAUDE_MEM_WORKER_PORT, ...extraEnv})` | ⚠️ daemon inherits parent env then sanitizes — does not pass through `buildIsolatedEnv`. **Document why this is OK**: daemon is the trust boundary; parent env IS the truth. But it should still strip `CLAUDE_CODE_EFFORT_LEVEL` via the prefix filter. Confirm. |
| `src/services/sync/ChromaMcpManager.ts` | 585 | chroma-mcp subprocess | `sanitizeEnv(process.env)` | ⚠️ same as above. |
| `src/supervisor/process-registry.ts` | 539 | Generic spawn factory | `sanitizeEnv(options.env ?? process.env)` | ⚠️ same. |
| `src/services/worker-service.ts` | 412 | MCP server subprocess | `sanitizeEnv(process.env)` | ⚠️ same. |
For the worker-daemon and downstream MCP/chroma spawns, parent-process env IS the source of truth — they are pre-credential paths. As long as `CLAUDE_CODE_EFFORT_LEVEL` and the Anthropic credentials are stripped (which `sanitizeEnv` does via `CLAUDE_CODE_*` prefix and the existing `ANTHROPIC_AUTH_TOKEN` block), behavior is correct. The plan does not change these paths — it adds tests that prove they stay correct.
### 5.2 Add audit test — `tests/env-isolation.test.ts`
8. **`every documented spawn site applies sanitizeEnv`**
- Read each file from the audit table.
- Assert: each line cited contains `sanitizeEnv(`. Currently GREEN; test prevents regression.
9. **`worker-daemon spawn env does not contain CLAUDE_CODE_EFFORT_LEVEL`**
- Stub `process.env.CLAUDE_CODE_EFFORT_LEVEL = 'MAX'`.
- Construct the env block as ProcessManager.ts:415 does.
- Assert: result does not contain `CLAUDE_CODE_EFFORT_LEVEL`. Currently GREEN.
### 5.3 Verification checklist (Phase 5)
- [ ] Tests 8, 9 GREEN.
- [ ] No new spawn sites introduced; if any are added by accident, the CI check (Phase 7) flags them.
### 5.4 Anti-pattern guards
- Do NOT add `buildIsolatedEnv` calls to ProcessManager / ChromaMcpManager / MCP server spawn paths. They legitimately need parent-shell `PATH`, `HOME`, etc. — those would be wiped by the credential-isolated builder.
- Do NOT consolidate the two layers into one helper "for clarity" — they have distinct contracts and are layered intentionally.
---
## Phase 6 — Test the full integration end-to-end
**Goal:** Smoke test the proxy/gateway path so we know the fix works in the real world.
### 6.1 Manual smoke (BigModel proxy or any equivalent)
```bash
# Setup:
cat > ~/.claude-mem/.env <<'EOF'
ANTHROPIC_BASE_URL=https://open.bigmodel.cn/api/anthropic
ANTHROPIC_AUTH_TOKEN=<your-bigmodel-token>
EOF
chmod 600 ~/.claude-mem/.env
# Reset worker:
npm run build-and-sync
pkill -f worker-service.cjs
# Trigger:
# In any Claude Code session, use any tool — PostToolUse hook should land an observation.
# Verify:
tail -f ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log
# Expect: no "Not logged in" errors; observations land via the proxy.
```
### 6.2 Manual smoke (CLAUDE_CODE_EFFORT_LEVEL leak)
```bash
# Setup:
export CLAUDE_CODE_EFFORT_LEVEL=MAX
export CLAUDE_CODE_ALWAYS_ENABLE_EFFORT=true
# Restart Claude Code so the env propagates to the hook subprocess.
# Verify:
tail -f ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log
# Expect: NO repeated "API Error: 400 This model does not support the effort parameter."
# Expect: NO "PARSER returned non-XML response; marking messages as failed for retry".
```
### 6.3 Verification checklist (Phase 6)
- [ ] Both smoke scenarios pass.
- [ ] `bun test` is green.
- [ ] One iteration on a fresh machine confirms `~/.claude-mem/.env` is the only knob users need for proxy auth.
---
## Phase 7 — CI guard + documentation
**Goal:** A grep-based CI check rejects PRs that introduce a subprocess spawn without `sanitizeEnv`. Documentation aligns with the new contract.
### 7.1 Add `scripts/check-spawn-env-discipline.cjs`
Pattern from `plans/01-hook-io-discipline.md` Phase 6 (`scripts/check-hook-io-discipline.cjs`):
```js
#!/usr/bin/env node
// Forbid raw process.env in subprocess spawn calls. Every spawn must use
// sanitizeEnv(...) and (where credentials are involved) buildIsolatedEnv*.
const { execSync } = require('node:child_process');
const VIOLATIONS = [];
// Find every `spawn(` / `spawnSync(` / `child_process.spawn(` call in src/
const grep = execSync(
`grep -rEn "spawn(Sync)?\\(" src/ | grep -v "node_modules" | grep -v "\\.test\\."`,
{ encoding: 'utf8' },
);
for (const line of grep.split('\n').filter(Boolean)) {
// Allow if the same logical block contains sanitizeEnv
// (heuristic: read 5 lines after the match in the source file)
const [filePath, lineNumStr] = line.split(':', 2);
const lineNum = Number.parseInt(lineNumStr, 10);
const src = require('node:fs').readFileSync(filePath, 'utf8').split('\n');
const window = src.slice(lineNum - 1, lineNum + 8).join('\n');
if (!/sanitizeEnv\s*\(/.test(window)) {
VIOLATIONS.push(`${filePath}:${lineNum} — spawn without sanitizeEnv`);
}
}
if (VIOLATIONS.length > 0) {
console.error('Spawn-env discipline check FAILED:');
VIOLATIONS.forEach(v => console.error(' ' + v));
process.exit(1);
}
console.log('Spawn-env discipline check passed.');
```
Wire to `package.json` `scripts.test:env-discipline`. Add to CI alongside existing hook checks.
### 7.2 Edit `CLAUDE.md` — document the `~/.claude-mem/.env` contract
Add a section under "Configuration":
```markdown
### Anthropic Credentials (proxies, gateways, BigModel, etc.)
For non-OAuth Anthropic credentials (proxies / gateways / `ANTHROPIC_AUTH_TOKEN` / `ANTHROPIC_API_KEY`), put them in `~/.claude-mem/.env`:
\```
ANTHROPIC_BASE_URL=https://your-proxy.example
ANTHROPIC_AUTH_TOKEN=your-token
\```
The file is read at worker spawn time and re-injected into the SDK subprocess. **Parent-shell exports of these variables are intentionally ignored** — they are in `BLOCKED_ENV_VARS` to prevent host-config bleed-through (#2375).
If you only have an OAuth subscription, no `.env` is needed; the worker reads the token from your keychain at spawn time.
```
### 7.3 Verification checklist (Phase 7)
- [ ] `npm run test:env-discipline` passes on the post-fix tree.
- [ ] CI pipeline runs the new check.
- [ ] CLAUDE.md section exists and accurately reflects the new contract.
### 7.4 Anti-pattern guards
- Do NOT extend the CI check to flag every `process.env` read — only `spawn*()` call sites need `sanitizeEnv`. Reads are fine.
- Do NOT add the `.env` file path to `.gitignore` — it lives in `~/.claude-mem/`, not in the repo, so it's already outside.
---
## Cross-plan dependencies
- **Plan 01 (Hook IO Discipline):** Independent. Both can be implemented in parallel.
- **Plan 02 (Spawn-Contract Templating):** Independent. Both touch templating but at different layers.
- **Plan 03 (Worker Lifecycle):** Phase 3.2's HTTP 400 classification removes a class of unbounded retries. Plan 03's "circuit breaker" + "stale-session sweep" handles other retry classes. Merge order: this plan first (small, surgical), then Plan 03.
- **Plan 04 (Installer Transparency):** Independent.
- **Plan 05 (Observer Tool Enforcement):** Adjacent — `KnowledgeAgent` is touched in both plans (this one for `getModelId`, Plan 05 for tool enforcement). Sequence Plan 05 first (security urgency), then Plan 06.
## Pre-/do checklist
- [ ] Verify `BLOCKED_ENV_VARS` is still an `Array<string>` and not converted to a `Set` (Phase 2 refactor risk).
- [ ] Verify the existing test suite passes against current `main` before starting (`bun test`).
- [ ] Re-confirm `effort` is still absent from `src/` (`grep -rn "effort" src/`) — if a future change adds the parameter, Phase 3.2's regex needs revisiting.
- [ ] Read `node_modules/@anthropic-ai/claude-agent-sdk/sdk.d.ts` to confirm `query()` options does NOT support `effort` natively. If the SDK adds it, Phase 3.2's body-text regex still works as a fallback, but a code-level strip becomes the right fix.
- [ ] Verify `~/.claude-mem/.env` permissions are `0o600` post-fix (the saver enforces this; readers should not weaken it).