Files

T

Alex Newman a10d1b342f docs(plans): add architectural plan files for issues #2376-#2381

Six numbered plan documents covering:
- 01 Hook IO Discipline (#2376)
- 02 Spawn-Contract Templating (#2377)
- 03 Worker / Daemon Lifecycle Hardening (#2378)
- 04 Installer Failure Transparency (#2379)
- 05 Observer SDK Tool Enforcement (#2380)
- 06 Worker Env Isolation (#2381)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-10 00:31:02 -07:00

33 KiB

Raw Blame History

Plan 06 — Worker Env Isolation

Goal: Stop host-side environment variables from contaminating the worker's Anthropic SDK subprocess. Two confirmed bugs anchor this plan: ANTHROPIC_BASE_URL leaks from the parent shell while ANTHROPIC_AUTH_TOKEN is blocked, breaking proxy/gateway auth (#2375); and CLAUDE_CODE_EFFORT_LEVEL propagates from host CLI settings into the SDK subprocess where it triggers a permanent HTTP 400 that the retry classifier mistakes for transient (#2357). Adjacent feature #2289 ($TIER alias syntax) is in scope where it shares the same env/model-resolution surface.

Net effect:

The OAuth-skip predicate requires a real credential (ANTHROPIC_API_KEY or ANTHROPIC_AUTH_TOKEN), not a bare ANTHROPIC_BASE_URL. Proxy/gateway users put credentials in ~/.claude-mem/.env; nothing relies on parent-shell leaks.

BLOCKED_ENV_VARS adds ANTHROPIC_BASE_URL and the CLAUDE_CODE_EFFORT_LEVEL / CLAUDE_CODE_ALWAYS_ENABLE_EFFORT pair (defense in depth alongside the existing env-sanitizer.ts CLAUDE_CODE_* prefix filter).

The Claude provider's error classifier explicitly handles HTTP 400 as unrecoverable, matching GeminiProvider/OpenRouterProvider. No more unbounded retry loop on permanent-error responses.

Every spawn boundary that hands env to a child process applies BOTH buildIsolatedEnv and sanitizeEnv. A grep-based CI check forbids spawning subprocesses with raw process.env.

~/.claude-mem/.env becomes the single source of truth for non-OAuth Anthropic credentials. The loader's whitelist documents this contract.

Out of scope:

Hook-side env handling (Plan 01 / 02 territory).

Worker daemon lifecycle, DB bloat, and chroma-mcp leaks (Plan 03).

Observer/Knowledge SDK tool enforcement (Plan 05).

Re-auth UX flow (different concern; out of scope for this plan).

General provider-router refactor — $TIER alias is scoped to model resolution only (Phase 4).

Problem Statement (line citations)

Bug A — `ANTHROPIC_BASE_URL` leaks, OAuth gets skipped, `ANTHROPIC_AUTH_TOKEN` is missing (#2375)

src/shared/EnvManager.ts lines 14–24 (BLOCKED_ENV_VARS):

const BLOCKED_ENV_VARS = [
  'ANTHROPIC_API_KEY',       // #733
  'ANTHROPIC_AUTH_TOKEN',    // added 5edf1557 (2026-05-04) — leak prevention
  'CLAUDECODE',
  'CLAUDE_CODE_OAUTH_TOKEN', // #2215
];

ANTHROPIC_BASE_URL is not in the list, so it survives buildIsolatedEnv() (lines 166–205) and reaches isolatedEnv from process.env.

buildIsolatedEnvWithFreshOAuth() lines 222–288 then runs the OAuth-skip predicate at lines 237–244:

if (
  isolatedEnv.ANTHROPIC_API_KEY ||
  isolatedEnv.ANTHROPIC_BASE_URL ||
  isolatedEnv.ANTHROPIC_AUTH_TOKEN
) {
  clearStaleMarker();
  return isolatedEnv;
}

The bare BASE_URL branch was added in commit a122d34e (2026-05-04) under the rationale "tokenless gateways may exist." Combined with the AUTH_TOKEN block from 5edf1557 the same day, the subprocess ends up with:

ANTHROPIC_BASE_URL ✅ (leaked from parent)
ANTHROPIC_AUTH_TOKEN ❌ (blocked, never re-injected because ~/.claude-mem/.env is empty for first-time proxy users)
CLAUDE_CODE_OAUTH_TOKEN ❌ (skip path bypassed the keychain read)

Result: Not logged in · Please run /login from every SDK subprocess.

Bug B — `CLAUDE_CODE_EFFORT_LEVEL` triggers permanent 400 + unbounded retry (#2357)

The Anthropic SDK subprocess reads CLAUDE_CODE_EFFORT_LEVEL from its env and forwards it as the effort parameter on Messages API calls. claude-mem's source contains zero references to effort — the leak path is environmental, not code. Models without effort support (Haiku 4.5, Sonnet 4.5, older) reject with HTTP 400.

src/supervisor/env-sanitizer.ts lines 1–51 already filters CLAUDE_CODE_* via ENV_PREFIXES (with explicit allowances in ENV_PRESERVE). But:

buildIsolatedEnv does NOT call sanitizeEnv internally; callers are expected to chain them.
BLOCKED_ENV_VARS is the canonical leak deny-list and does not name CLAUDE_CODE_EFFORT_LEVEL. Defense-in-depth is currently single-layer.
The retry classifier in src/services/worker/ClaudeProvider.ts has no HTTP 400 case; the default branch at line 98 returns kind: 'transient', so a permanent 400 loops forever.

src/services/worker/GeminiProvider.ts lines 89–94 and src/services/worker/OpenRouterProvider.ts lines 82–87 already classify 400 as unrecoverable; that pattern is the copy-target for ClaudeProvider.

Adjacent — `$TIER` alias syntax (#2289)

src/shared/SettingsDefaultsManager.ts line 116 already implements a portable 'haiku' alias for CLAUDE_MEM_TIER_SIMPLE_MODEL (per #1463). What's missing is the user-facing $TIER syntax in the CLAUDE_MEM_MODEL field that resolves to a provider-appropriate model at request time. Same code surface (model resolution in ClaudeProvider.getModelId at lines 442–446); minimal extension.

Phase 0 — Documentation Discovery (already completed)

Findings below are direct file reads dated 2026-05-08. Each implementation phase cites by line number; do not re-derive. Confidence: HIGH on file/API inventory. Local-only files were read end-to-end.

Allowed APIs / patterns to copy

Item	Location	What to copy
`BLOCKED_ENV_VARS` array	`src/shared/EnvManager.ts:14–24`	Add new entries; keep the comment-per-entry convention
`buildIsolatedEnv` filter pattern	`src/shared/EnvManager.ts:166–205`	Filter on `BLOCKED_ENV_VARS.includes(key)`; defensive `delete isolatedEnv.X` post-filter
`buildIsolatedEnvWithFreshOAuth` skip-check	`src/shared/EnvManager.ts:237–244`	Restrict predicate to real credentials only
`loadClaudeMemEnv` whitelist + `ClaudeMemEnv` interface	`src/shared/EnvManager.ts:26–32, 79–100`	Single source of truth for what `~/.claude-mem/.env` accepts
`ENV_PRESERVE` / `ENV_EXACT_MATCHES` / `ENV_PREFIXES`	`src/supervisor/env-sanitizer.ts:1–51`	Whitelist-based env stripping; do NOT add `CLAUDE_CODE_EFFORT_LEVEL` to `ENV_PRESERVE`
Provider error classifier (HTTP 400 → unrecoverable)	`src/services/worker/GeminiProvider.ts:89–94`, `src/services/worker/OpenRouterProvider.ts:82–87`	Identical pattern to apply in `ClaudeProvider`
`ClassifiedProviderError` constructor + `kind: 'unrecoverable' \| 'auth_invalid' \| 'transient' \| 'rate_limit' \| 'quota_exhausted'`	`src/services/worker/retry.ts`	Use existing `kind` enum; do not invent `permanent`
`isRetryableKind` predicate	`src/services/worker/retry.ts:37–44`	Used by all retry sites; no edit needed once classifier is correct
Tier model resolution + `'haiku'` alias	`src/services/worker/http/routes/SessionRoutes.ts:503–521`, `src/shared/SettingsDefaultsManager.ts:51–53, 115–117`	Pattern for extending `$TIER` syntax
Settings flat-key + `loadFromFile`	`src/shared/SettingsDefaultsManager.ts:6–67, 70–131, 137–139, 161–206`	New keys MUST be added to interface AND `DEFAULTS` block
Plan format (phase numbering, line-cited edits, anti-patterns block)	`plans/01-hook-io-discipline.md`, `plans/05-observer-tool-enforcement.md`	Reuse layout

Anti-patterns / methods that DO NOT exist (avoid inventing)

claude-mem source has zero references to effort, CLAUDE_CODE_EFFORT_LEVEL, CLAUDE_CODE_ALWAYS_ENABLE_EFFORT, or reasoning_effort. Do not "remove the effort parameter we forward" — there is none. The leak is the SDK subprocess reading the env var directly.
BLOCKED_ENV_VARS is an Array<string> with .includes lookup. Do NOT convert to Set in the same change — that touches every caller and is an unrelated refactor.
ClassifiedProviderError.kind does NOT support the value 'permanent'. The existing enum is 'transient' | 'rate_limit' | 'unrecoverable' | 'auth_invalid' | 'quota_exhausted'. Use unrecoverable for permanent 400s.
pending_messages has no retry_count column (dropped — see src/services/sqlite/SessionStore.ts:104's deadColumns array). Issue #2357's "retry counter climbed past #1874" refers to log-line numbering, not a DB counter. Do not add a counter as part of this plan; that's Plan 03 territory.
sanitizeEnv is whitelist-based (preserves a fixed set; strips everything matching CLAUDE_CODE_* etc). It is NOT idempotent if you re-add a name to ENV_PRESERVE. Do not add CLAUDE_CODE_EFFORT_LEVEL to ENV_PRESERVE — that's the opposite of what we want.
buildIsolatedEnv and sanitizeEnv are independent layers. Some callers chain (sanitizeEnv(buildIsolatedEnv(...))); some only use one. Do not assume chaining is universal — Phase 5 audits every spawn boundary.
The ~/.claude-mem/.env loader at src/shared/EnvManager.ts:79–100 uses property-by-property assignment as an implicit whitelist. Do NOT replace with Object.assign(result, parsed) — that breaks the whitelist guarantee.

File inventory used by this plan

File	Lines	Disposition
`src/shared/EnvManager.ts`	319	Edited heavily (Phase 2, Phase 5)
`src/supervisor/env-sanitizer.ts`	51	Light edit (Phase 3 — comment change only; `CLAUDE_CODE_*` prefix already filters EFFORT_LEVEL)
`src/services/worker/ClaudeProvider.ts`	448	Edited (Phase 3 — error classifier on `query()` rejection path)
`src/services/worker/retry.ts`	small	Confirm-only (Phase 3 — `isRetryableKind` already correct)
`src/services/worker/GeminiProvider.ts`	reference only	Read for pattern (Phase 3)
`src/services/worker/OpenRouterProvider.ts`	reference only	Read for pattern (Phase 3)
`src/shared/SettingsDefaultsManager.ts`	209	Edited (Phase 4 — `$TIER` alias resolution)
`src/services/worker/http/routes/SessionRoutes.ts`	reference	Read tier-routing pattern (Phase 4)
`src/services/infrastructure/ProcessManager.ts`	line 415	Audit (Phase 5) — confirm `sanitizeEnv` chain is sufficient
`src/services/sync/ChromaMcpManager.ts`	line 585	Audit (Phase 5)
`src/supervisor/process-registry.ts`	line 539	Audit (Phase 5)
`src/services/worker-service.ts`	line 412	Audit (Phase 5)
`src/services/worker/knowledge/KnowledgeAgent.ts`	lines 54, 149	Confirm-only (Phase 5)
`tests/env-isolation.test.ts`	NEW	CREATED (Phase 6)
`scripts/check-spawn-env-discipline.cjs`	NEW	CREATED (Phase 7)
`CLAUDE.md`	small	Edited (Phase 7 — document `~/.claude-mem/.env` contract)

Phase 1 — Audit & write the failing tests first

Goal: Pin down current behavior with red tests so the fix can prove itself green. No production-code changes in this phase.

1.1 Tests to add (`tests/env-isolation.test.ts`)

Use bun:test per package.json "test": "bun test". Pattern from tests/claude-provider-resume.test.ts:1.

buildIsolatedEnvWithFreshOAuth strips ANTHROPIC_BASE_URL when no .env credentials are configured
- Stub process.env.ANTHROPIC_BASE_URL = 'https://proxy.example', no ~/.claude-mem/.env, no API_KEY/AUTH_TOKEN in env.
- Call buildIsolatedEnvWithFreshOAuth().
- Assert: result does NOT have ANTHROPIC_BASE_URL (post-fix). Currently RED.
OAuth-skip does not fire on bare ANTHROPIC_BASE_URL
- Same setup. Spy on readClaudeOAuthToken.
- Assert: readClaudeOAuthToken was called (because BASE_URL alone is not enough to skip). Currently RED — readClaudeOAuthToken is NOT called today.
ANTHROPIC_AUTH_TOKEN from ~/.claude-mem/.env reaches the isolated env
- Write a temp .env with ANTHROPIC_AUTH_TOKEN=test-token and ANTHROPIC_BASE_URL=https://proxy.example.
- Assert: isolatedEnv.ANTHROPIC_AUTH_TOKEN === 'test-token' AND isolatedEnv.ANTHROPIC_BASE_URL === 'https://proxy.example'. Currently GREEN (already works); test guards against regression.
CLAUDE_CODE_EFFORT_LEVEL is stripped from the isolated env
- Stub process.env.CLAUDE_CODE_EFFORT_LEVEL = 'MAX'.
- Assert: sanitizeEnv(buildIsolatedEnv()) does NOT contain CLAUDE_CODE_EFFORT_LEVEL. Currently GREEN via env-sanitizer.ENV_PREFIXES; test guards.
CLAUDE_CODE_EFFORT_LEVEL is in BLOCKED_ENV_VARS for defense-in-depth
- Assert: BLOCKED_ENV_VARS.includes('CLAUDE_CODE_EFFORT_LEVEL'). Currently RED.
HTTP 400 from Claude SDK is classified unrecoverable
- Construct an error matching the SDK's 400 shape (error.status === 400, body contains does not support the effort parameter).
- Assert: classifyClaudeProviderError(err).kind === 'unrecoverable'. Currently RED — falls through to transient.
HTTP 400 with effort-parameter body emits a once-only warn log
- Same setup as 6, plus capture logger.warn calls.
- Assert: warn fires once with category SDK and a hint pointing at #2357 / ~/.claude-mem/.env. Currently RED.

1.2 Verification checklist (Phase 1)

All 7 tests added; tests 1, 2, 5, 6, 7 are RED; tests 3, 4 are GREEN.
bun test tests/env-isolation.test.ts runs cleanly (RED tests fail with the expected assertion, no other errors).
No production-code changes in this phase (git diff src/ empty).

1.3 Anti-pattern guards

Do NOT mock EnvManager.buildIsolatedEnv — it's the unit under test.
Do NOT use vi.* (project uses bun:test, not vitest).
Do NOT skip cleanup of temp .env files. Use a per-test beforeEach/afterEach with mkdtempSync.

Phase 2 — Fix #2375 (BASE_URL leak + OAuth-skip predicate)

Goal: Make the OAuth-skip require a real credential, and add ANTHROPIC_BASE_URL to the deny-list so it can only be configured via ~/.claude-mem/.env.

2.1 Edit `src/shared/EnvManager.ts:14–24` — extend `BLOCKED_ENV_VARS`

Before:

const BLOCKED_ENV_VARS = [
  'ANTHROPIC_API_KEY',
  'ANTHROPIC_AUTH_TOKEN',
  'CLAUDECODE',
  'CLAUDE_CODE_OAUTH_TOKEN',
];

After (add ANTHROPIC_BASE_URL):

const BLOCKED_ENV_VARS = [
  'ANTHROPIC_API_KEY',       // #733
  'ANTHROPIC_AUTH_TOKEN',    // 5edf1557 — leak prevention; re-injected from ~/.claude-mem/.env when configured
  'ANTHROPIC_BASE_URL',      // #2375 — same leak class as AUTH_TOKEN; re-injected from ~/.claude-mem/.env. Without this entry, a leaked BASE_URL alone triggered the OAuth-skip while no auth credential reached the subprocess.
  'CLAUDECODE',
  'CLAUDE_CODE_OAUTH_TOKEN', // #2215
];

2.2 Edit `src/shared/EnvManager.ts:237–244` — restrict OAuth-skip to real credentials

Before:

if (
  isolatedEnv.ANTHROPIC_API_KEY ||
  isolatedEnv.ANTHROPIC_BASE_URL ||
  isolatedEnv.ANTHROPIC_AUTH_TOKEN
) {
  clearStaleMarker();
  return isolatedEnv;
}

After:

// Skip OAuth lookup ONLY when a real credential is configured. A bare
// ANTHROPIC_BASE_URL is not a credential — every documented gateway needs
// either an AUTH_TOKEN or an API_KEY. This guards #2375 against a class of
// leaks where a parent shell exports BASE_URL (e.g. for the Claude Code CLI
// itself) while no token is present.
if (isolatedEnv.ANTHROPIC_API_KEY || isolatedEnv.ANTHROPIC_AUTH_TOKEN) {
  clearStaleMarker();
  return isolatedEnv;
}

2.3 Verify the `~/.claude-mem/.env` re-injection at `src/shared/EnvManager.ts:178–195`

Currently the loader path covers BASE_URL re-injection from .env. Confirm by reading the function. No code change required here, but add a TS comment block above lines 178–195 documenting the new contract:

// Contract (post-#2375): ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, and
// ANTHROPIC_API_KEY are *only* populated from ~/.claude-mem/.env. They are
// in BLOCKED_ENV_VARS so parent-shell values never leak through.

2.4 Verification checklist (Phase 2)

Tests 1, 2 from Phase 1 now GREEN.
Existing test suite still passes (bun test).
grep -n "ANTHROPIC_BASE_URL" src/shared/EnvManager.ts shows entries at: BLOCKED_ENV_VARS, ClaudeMemEnv interface, loader, re-injection, OAuth-skip predicate (NOT in skip predicate).
Smoke: with a ~/.claude-mem/.env containing ANTHROPIC_BASE_URL=... and ANTHROPIC_AUTH_TOKEN=..., the worker actually authenticates against the proxy. Test with BigModel or any sandboxed proxy.

2.5 Anti-pattern guards

Do NOT add ANTHROPIC_BASE_URL to ENV_PRESERVE in env-sanitizer.ts — BLOCKED_ENV_VARS is the right layer; env-sanitizer is a downstream filter.
Do NOT keep the BASE_URL branch in the OAuth-skip predicate "for tokenless gateways may exist" — every documented gateway requires a token. The skip path was a misdesign.
Do NOT delete the existing delete isolatedEnv.CLAUDE_CODE_OAUTH_TOKEN defensive line at line 229. That guard is intact; it's belt-and-suspenders for #2215 and orthogonal to this plan.

Phase 3 — Fix #2357 (CLAUDE_CODE_EFFORT_LEVEL leak + 400 retry classification)

Goal: Two-layer defense for the env leak (existing CLAUDE_CODE_* prefix filter + new BLOCKED_ENV_VARS entries), plus a permanent classification for the resulting HTTP 400 so the retry loop terminates if the leak ever sneaks past either layer.

3.1 Edit `src/shared/EnvManager.ts:14–24` — add EFFORT entries to `BLOCKED_ENV_VARS`

After the Phase 2 edit, the list is:

const BLOCKED_ENV_VARS = [
  'ANTHROPIC_API_KEY',
  'ANTHROPIC_AUTH_TOKEN',
  'ANTHROPIC_BASE_URL',
  'CLAUDECODE',
  'CLAUDE_CODE_OAUTH_TOKEN',
  // #2357 — host CLI config, not part of the plugin's contract. The
  // env-sanitizer's CLAUDE_CODE_* prefix filter strips these for spawn paths
  // that go through it, but BLOCKED_ENV_VARS is the canonical deny-list and
  // belongs in defense-in-depth.
  'CLAUDE_CODE_EFFORT_LEVEL',
  'CLAUDE_CODE_ALWAYS_ENABLE_EFFORT',
];

3.2 Edit `src/services/worker/ClaudeProvider.ts` — classify HTTP 400 as unrecoverable

Locate the existing error-classification path. The Anthropic SDK raises errors with error.status and a body containing the failure description. Pattern from src/services/worker/GeminiProvider.ts:89–94 (the canonical copy-target):

if (status === 400) {
  return new ClassifiedProviderError(
    `Gemini bad request (status 400)`,
    { kind: 'unrecoverable', cause: input.cause },
  );
}

Add the equivalent in ClaudeProvider's error classifier (new function or existing — read the file; create if absent, mirroring GeminiProvider shape):

function classifyClaudeProviderError(input: { cause: unknown }): ClassifiedProviderError {
  const err = input.cause;
  const status = (err as { status?: number })?.status;
  const bodyText = String((err as { message?: string })?.message ?? '');

  // Permanent: SDK rejected the request itself. Most common cause in the wild
  // is a leaked CLAUDE_CODE_EFFORT_LEVEL the SDK subprocess forwarded as
  // `effort` against a model that doesn't support it (#2357). The leak is
  // also blocked at BLOCKED_ENV_VARS + env-sanitizer; this classifier ends
  // the retry loop if either layer is bypassed.
  if (status === 400) {
    if (/effort parameter/i.test(bodyText)) {
      logger.warn(
        'SDK',
        'Claude API rejected effort parameter — likely CLAUDE_CODE_EFFORT_LEVEL leaked into SDK env (issue #2357). Configure CLAUDE_MEM_MODEL or set credentials in ~/.claude-mem/.env.',
        { status, bodyText },
      );
    }
    return new ClassifiedProviderError(
      `Claude bad request (status 400): ${bodyText}`,
      { kind: 'unrecoverable', cause: input.cause },
    );
  }

  // 401 / 403 → auth_invalid (existing pattern from GeminiProvider:96-103)
  if (status === 401 || status === 403) {
    return new ClassifiedProviderError(
      `Claude auth rejected (status ${status})`,
      { kind: 'auth_invalid', cause: input.cause },
    );
  }

  // 429 → rate_limit
  if (status === 429) {
    return new ClassifiedProviderError(
      `Claude rate limited (status 429)`,
      { kind: 'rate_limit', cause: input.cause },
    );
  }

  // Default: transient (preserves the existing fall-through behavior).
  return new ClassifiedProviderError(
    `Claude SDK error: ${bodyText}`,
    { kind: 'transient', cause: input.cause },
  );
}

Wire this classifier into the existing try { ... } catch around query(...) in ClaudeProvider.ts. Read the actual catch shape before editing — the function lives near line 180–195 and the existing for await over queryResult is where rejections surface.

3.3 Confirm `src/supervisor/env-sanitizer.ts` already strips `CLAUDE_CODE_EFFORT_LEVEL`

Read lines 1–51. Verify:

ENV_PREFIXES includes 'CLAUDE_CODE_'.
ENV_PRESERVE does NOT include CLAUDE_CODE_EFFORT_LEVEL, CLAUDE_CODE_ALWAYS_ENABLE_EFFORT.

Add an inline comment at the ENV_PREFIXES declaration:

// Filters CLAUDE_CODE_* unless explicitly preserved in ENV_PRESERVE.
// This is layer 2 of defense for #2357 — layer 1 is BLOCKED_ENV_VARS in EnvManager.

No code change to behavior here.

3.4 Verification checklist (Phase 3)

Tests 5, 6, 7 from Phase 1 now GREEN.
grep -n "CLAUDE_CODE_EFFORT_LEVEL" src/ returns hits in EnvManager.ts (BLOCKED_ENV_VARS) and the test file. Nothing else.

Reproduce #2357 scenario locally:

CLAUDE_CODE_EFFORT_LEVEL=MAX bun run src/services/worker-service.ts --daemon
# Observe: no `effort` parameter on outgoing requests.

If a 400 is forced (e.g., via a mocked SDK reject), the retry loop terminates after the first attempt; logger.warn fires once.

3.5 Anti-pattern guards

Do NOT add a separate "permanent error" enum value — kind: 'unrecoverable' already exists and is the right slot.
Do NOT regex on the entire error stack — error.status === 400 is the deterministic signal; the body text check is purely for the user-facing log hint.
Do NOT log inside classifyClaudeProviderError for every 400 — only the effort-parameter sub-case warrants a hint. Generic 400s are noisy enough at the call site.
Do NOT mark all 400s with body matching /effort/i as auth_invalid — that would trigger the "re-login" flow incorrectly. Use unrecoverable.
Do NOT rely on the SDK supporting an effort SDK-option that we strip. The SDK type does not expose effort; the leak is the SDK's own subprocess (pathToClaudeCodeExecutable) reading the env var. Stripping at our env layer is the only fix we control.

Phase 4 — `$TIER` alias syntax (#2289)

Goal: Allow CLAUDE_MEM_MODEL=$TIER:summary (and similar) to resolve at request time to a provider-appropriate model, reusing the existing 'haiku' portable alias machinery (line 116, #1463). Optional phase; can be deferred without blocking Phase 2/3.

4.1 Edit `src/shared/SettingsDefaultsManager.ts` — extend tier interface

Add to the SettingsDefaults interface near lines 51–53:

CLAUDE_MEM_TIER_FAST_MODEL: string;     // for $TIER:fast — defaults to 'haiku'
CLAUDE_MEM_TIER_SMART_MODEL: string;    // for $TIER:smart — defaults to 'sonnet' (or provider-equivalent)

Add to the DEFAULTS block near lines 115–117:

CLAUDE_MEM_TIER_FAST_MODEL: 'haiku',
CLAUDE_MEM_TIER_SMART_MODEL: 'sonnet',

4.2 Edit `src/services/worker/ClaudeProvider.ts:442–446` — add `$TIER` resolution

Replace getModelId():

private getModelId(): string {
  const settingsPath = paths.settings();
  const settings = SettingsDefaultsManager.loadFromFile(settingsPath);
  return resolveTierAlias(settings.CLAUDE_MEM_MODEL, settings);
}

Add resolveTierAlias to a shared util (src/services/worker/model-aliases.ts, NEW):

import type { SettingsDefaults } from '../../shared/SettingsDefaultsManager';

const TIER_PATTERN = /^\$TIER:(fast|smart|simple|summary)$/;

export function resolveTierAlias(model: string, settings: SettingsDefaults): string {
  const match = TIER_PATTERN.exec(model);
  if (!match) return model;

  switch (match[1]) {
    case 'fast':    return settings.CLAUDE_MEM_TIER_FAST_MODEL || 'haiku';
    case 'smart':   return settings.CLAUDE_MEM_TIER_SMART_MODEL || 'sonnet';
    case 'simple':  return settings.CLAUDE_MEM_TIER_SIMPLE_MODEL || 'haiku';
    case 'summary': return settings.CLAUDE_MEM_TIER_SUMMARY_MODEL || settings.CLAUDE_MEM_MODEL;
    default:        return model;
  }
}

4.3 Same call site in `KnowledgeAgent.ts:149` (`getModelId`)

Apply the same resolveTierAlias wrap. Knowledge agent uses the same settings path.

4.4 Verification checklist (Phase 4)

New test: resolveTierAlias('$TIER:fast', settings) returns settings.CLAUDE_MEM_TIER_FAST_MODEL.
New test: resolveTierAlias('claude-haiku-4-5-20251001', settings) returns input unchanged (non-tier passthrough).
Setting CLAUDE_MEM_MODEL=$TIER:fast and starting the worker actually queries against the fast-tier model.
Documentation updated in docs/public/configuration.mdx with the four tier aliases.

4.5 Anti-pattern guards

Do NOT match $TIER:* greedily — the regex is anchored.
Do NOT add $PROVIDER: or $MODEL: aliases in this phase — out of scope; one syntax at a time.
Do NOT mutate settings inside resolveTierAlias; pure function only.
Do NOT resolve the alias at settings-load time — resolve at request time so users can edit settings without restarting the worker.

Phase 5 — Cross-spawn-boundary audit

Goal: Every place claude-mem spawns a subprocess must apply both buildIsolatedEnv (or the async variant) AND sanitizeEnv. A grep-based check codifies the rule.

5.1 Audit table — current state per call site

File	Line	Spawn target	Env construction	Sufficient?
`src/services/worker/ClaudeProvider.ts`	155	Anthropic SDK subprocess	`sanitizeEnv(await buildIsolatedEnvWithFreshOAuth())`	✅
`src/services/worker/knowledge/KnowledgeAgent.ts`	54, 149	Knowledge SDK subprocess	`sanitizeEnv(await buildIsolatedEnvWithFreshOAuth())`	✅
`src/services/infrastructure/ProcessManager.ts`	415	Worker daemon	`sanitizeEnv({...process.env, CLAUDE_MEM_WORKER_PORT, ...extraEnv})`	⚠️ daemon inherits parent env then sanitizes — does not pass through `buildIsolatedEnv`. Document why this is OK: daemon is the trust boundary; parent env IS the truth. But it should still strip `CLAUDE_CODE_EFFORT_LEVEL` via the prefix filter. Confirm.
`src/services/sync/ChromaMcpManager.ts`	585	chroma-mcp subprocess	`sanitizeEnv(process.env)`	⚠️ same as above.
`src/supervisor/process-registry.ts`	539	Generic spawn factory	`sanitizeEnv(options.env ?? process.env)`	⚠️ same.
`src/services/worker-service.ts`	412	MCP server subprocess	`sanitizeEnv(process.env)`	⚠️ same.

For the worker-daemon and downstream MCP/chroma spawns, parent-process env IS the source of truth — they are pre-credential paths. As long as CLAUDE_CODE_EFFORT_LEVEL and the Anthropic credentials are stripped (which sanitizeEnv does via CLAUDE_CODE_* prefix and the existing ANTHROPIC_AUTH_TOKEN block), behavior is correct. The plan does not change these paths — it adds tests that prove they stay correct.

5.2 Add audit test — `tests/env-isolation.test.ts`

every documented spawn site applies sanitizeEnv
- Read each file from the audit table.
- Assert: each line cited contains sanitizeEnv(. Currently GREEN; test prevents regression.
worker-daemon spawn env does not contain CLAUDE_CODE_EFFORT_LEVEL
- Stub process.env.CLAUDE_CODE_EFFORT_LEVEL = 'MAX'.
- Construct the env block as ProcessManager.ts:415 does.
- Assert: result does not contain CLAUDE_CODE_EFFORT_LEVEL. Currently GREEN.

5.3 Verification checklist (Phase 5)

Tests 8, 9 GREEN.
No new spawn sites introduced; if any are added by accident, the CI check (Phase 7) flags them.

5.4 Anti-pattern guards

Do NOT add buildIsolatedEnv calls to ProcessManager / ChromaMcpManager / MCP server spawn paths. They legitimately need parent-shell PATH, HOME, etc. — those would be wiped by the credential-isolated builder.
Do NOT consolidate the two layers into one helper "for clarity" — they have distinct contracts and are layered intentionally.

Phase 6 — Test the full integration end-to-end

Goal: Smoke test the proxy/gateway path so we know the fix works in the real world.

6.1 Manual smoke (BigModel proxy or any equivalent)

# Setup:
cat > ~/.claude-mem/.env <<'EOF'
ANTHROPIC_BASE_URL=https://open.bigmodel.cn/api/anthropic
ANTHROPIC_AUTH_TOKEN=<your-bigmodel-token>
EOF
chmod 600 ~/.claude-mem/.env

# Reset worker:
npm run build-and-sync
pkill -f worker-service.cjs

# Trigger:
# In any Claude Code session, use any tool — PostToolUse hook should land an observation.

# Verify:
tail -f ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log
# Expect: no "Not logged in" errors; observations land via the proxy.

6.2 Manual smoke (CLAUDE_CODE_EFFORT_LEVEL leak)

# Setup:
export CLAUDE_CODE_EFFORT_LEVEL=MAX
export CLAUDE_CODE_ALWAYS_ENABLE_EFFORT=true

# Restart Claude Code so the env propagates to the hook subprocess.

# Verify:
tail -f ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log
# Expect: NO repeated "API Error: 400 This model does not support the effort parameter."
# Expect: NO "PARSER returned non-XML response; marking messages as failed for retry".

6.3 Verification checklist (Phase 6)

Both smoke scenarios pass.
bun test is green.
One iteration on a fresh machine confirms ~/.claude-mem/.env is the only knob users need for proxy auth.

Phase 7 — CI guard + documentation

Goal: A grep-based CI check rejects PRs that introduce a subprocess spawn without sanitizeEnv. Documentation aligns with the new contract.

7.1 Add `scripts/check-spawn-env-discipline.cjs`

Pattern from plans/01-hook-io-discipline.md Phase 6 (scripts/check-hook-io-discipline.cjs):

#!/usr/bin/env node
// Forbid raw process.env in subprocess spawn calls. Every spawn must use
// sanitizeEnv(...) and (where credentials are involved) buildIsolatedEnv*.

const { execSync } = require('node:child_process');

const VIOLATIONS = [];

// Find every `spawn(` / `spawnSync(` / `child_process.spawn(` call in src/
const grep = execSync(
  `grep -rEn "spawn(Sync)?\\(" src/ | grep -v "node_modules" | grep -v "\\.test\\."`,
  { encoding: 'utf8' },
);

for (const line of grep.split('\n').filter(Boolean)) {
  // Allow if the same logical block contains sanitizeEnv
  // (heuristic: read 5 lines after the match in the source file)
  const [filePath, lineNumStr] = line.split(':', 2);
  const lineNum = Number.parseInt(lineNumStr, 10);
  const src = require('node:fs').readFileSync(filePath, 'utf8').split('\n');
  const window = src.slice(lineNum - 1, lineNum + 8).join('\n');
  if (!/sanitizeEnv\s*\(/.test(window)) {
    VIOLATIONS.push(`${filePath}:${lineNum} — spawn without sanitizeEnv`);
  }
}

if (VIOLATIONS.length > 0) {
  console.error('Spawn-env discipline check FAILED:');
  VIOLATIONS.forEach(v => console.error('  ' + v));
  process.exit(1);
}
console.log('Spawn-env discipline check passed.');

Wire to package.json scripts.test:env-discipline. Add to CI alongside existing hook checks.

7.2 Edit `CLAUDE.md` — document the `~/.claude-mem/.env` contract

Add a section under "Configuration":

### Anthropic Credentials (proxies, gateways, BigModel, etc.)

For non-OAuth Anthropic credentials (proxies / gateways / `ANTHROPIC_AUTH_TOKEN` / `ANTHROPIC_API_KEY`), put them in `~/.claude-mem/.env`:

\```
ANTHROPIC_BASE_URL=https://your-proxy.example
ANTHROPIC_AUTH_TOKEN=your-token
\```

The file is read at worker spawn time and re-injected into the SDK subprocess. **Parent-shell exports of these variables are intentionally ignored** — they are in `BLOCKED_ENV_VARS` to prevent host-config bleed-through (#2375).

If you only have an OAuth subscription, no `.env` is needed; the worker reads the token from your keychain at spawn time.

7.3 Verification checklist (Phase 7)

npm run test:env-discipline passes on the post-fix tree.
CI pipeline runs the new check.
CLAUDE.md section exists and accurately reflects the new contract.

7.4 Anti-pattern guards

Do NOT extend the CI check to flag every process.env read — only spawn*() call sites need sanitizeEnv. Reads are fine.
Do NOT add the .env file path to .gitignore — it lives in ~/.claude-mem/, not in the repo, so it's already outside.

Cross-plan dependencies

Plan 01 (Hook IO Discipline): Independent. Both can be implemented in parallel.
Plan 02 (Spawn-Contract Templating): Independent. Both touch templating but at different layers.
Plan 03 (Worker Lifecycle): Phase 3.2's HTTP 400 classification removes a class of unbounded retries. Plan 03's "circuit breaker" + "stale-session sweep" handles other retry classes. Merge order: this plan first (small, surgical), then Plan 03.
Plan 04 (Installer Transparency): Independent.
Plan 05 (Observer Tool Enforcement): Adjacent — KnowledgeAgent is touched in both plans (this one for getModelId, Plan 05 for tool enforcement). Sequence Plan 05 first (security urgency), then Plan 06.

Pre-/do checklist

Verify BLOCKED_ENV_VARS is still an Array<string> and not converted to a Set (Phase 2 refactor risk).
Verify the existing test suite passes against current main before starting (bun test).
Re-confirm effort is still absent from src/ (grep -rn "effort" src/) — if a future change adds the parameter, Phase 3.2's regex needs revisiting.
Read node_modules/@anthropic-ai/claude-agent-sdk/sdk.d.ts to confirm query() options does NOT support effort natively. If the SDK adds it, Phase 3.2's body-text regex still works as a fallback, but a code-level strip becomes the right fix.
Verify ~/.claude-mem/.env permissions are 0o600 post-fix (the saver enforces this; readers should not weaken it).

33 KiB Raw Blame History Unescape Escape