feat: basic claude-mem Docker container for easy spin-up (#2076)

* feat(evals): SWE-bench Docker scaffolding for claude-mem resolve-rate measurement Adds evals/swebench/ scaffolding per .claude/plans/swebench-claude-mem-docker.md. Agent image builds Claude Code 2.1.114 + locally-built claude-mem plugin; run-instance.sh executes the two-turn ingest/fix protocol per instance; run-batch.py orchestrates parallel Docker runs with per-instance isolation; eval.sh wraps the upstream SWE-bench harness; summarize.py aggregates reports. Orchestrator owns JSONL writes under a lock to avoid racy concurrent appends; agent writes its authoritative diff to CLAUDE_MEM_OUTPUT_DIR (/scratch in container mode) and the orchestrator reads it back. Scaffolding only — no Docker build or smoke test run yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(evals): OAuth credential mounting for Claude Max/Pro subscriptions Skips per-call API billing by extracting OAuth creds from host Keychain (macOS) or ~/.claude/.credentials.json (Linux) and bind-mounting them read-only into each agent container. Creds are copied into HOME=$SCRATCH/.claude at container start so the per-instance isolation model still holds. Adds run-batch.py --auth {oauth,api-key,auto} (auto prefers OAuth, falls back to API key). run-instance.sh accepts either ANTHROPIC_API_KEY or CLAUDE_MEM_CREDENTIALS_FILE. smoke-test.sh runs one instance end-to-end using OAuth for quick verification before batch runs. Caveat surfaced in docstrings: Max/Pro has per-window usage limits and is framed for individual developer use — batch evaluation may exhaust the quota or raise compliance questions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(docker): basic claude-mem container for ad-hoc testing Adds docker/claude-mem/ with a fresh spin-up image: - Dockerfile: FROM node:20 (reproduces anthropics/claude-code .devcontainer pattern — Anthropic ships the Dockerfile, not a pullable image); layers Bun + uv + locally-built plugin/; runs as non-root node user - entrypoint.sh: seeds OAuth creds from CLAUDE_MEM_CREDENTIALS_FILE into $HOME/.claude/.credentials.json, then exec's the command (default: bash) - build.sh: npm run build + docker build - run.sh: interactive launcher; auto-extracts OAuth from macOS Keychain (security find-generic-password) or ~/.claude/.credentials.json on Linux, mounts host .docker-claude-mem-data/ at /home/node/.claude-mem so the observations DB survives container exit Validated end-to-end: PostToolUse hook fires, queue enqueues, worker's SDK compression runs under subscription OAuth, observations row lands with populated facts/concepts/files_read, Chroma sync triggers. Also updates .gitignore/.dockerignore for the new runtime-output paths. Built plugin artifacts refreshed by the build step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evals/swebench): non-root user, OAuth mount, Lite dataset default - Dockerfile.agent: switch to non-root \`node\` user (uid 1000); Claude Code refuses --permission-mode bypassPermissions when euid==0, which made every agent run exit 1 before producing a diff. Also move Bun + uv installs to system paths so the non-root user can exec them. - run-batch.py: add extract_oauth_credentials() that pulls from macOS Keychain / Linux ~/.claude/.credentials.json into a temp file and bind- mounts it at /auth/.credentials.json:ro with CLAUDE_MEM_CREDENTIALS_FILE. New --auth {oauth,api-key,auto} flag. New --dataset flag so the batch can target SWE-bench_Lite without editing the script. - smoke-test.sh: default DATASET to princeton-nlp/SWE-bench_Lite (Lite contains sympy__sympy-24152, Verified does not); accept DATASET env override. Caveat surfaced during testing: Max/Pro subscriptions have per-window usage limits; running 5 instances in parallel with the "read every source file" ingest prompt exhausted the 5h window within ~25 minutes (3/5 hit HTTP 429). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address PR #2076 review comments - docker/claude-mem/run.sh: chmod 600 (not 644) on extracted OAuth creds to match what `claude login` writes; avoids exposing tokens to other host users. Verified readable inside the container under Docker Desktop's UID translation. - docker/claude-mem/Dockerfile: pin Bun + uv via --build-arg BUN_VERSION / UV_VERSION (defaults: 1.3.12, 0.11.7). Bun via `bash -s "bun-v<V>"`; uv via versioned installer URL `https://astral.sh/uv/<V>/install.sh`. - evals/swebench/smoke-test.sh: pipe JSON through stdin to `python3 -c` so paths with spaces/special chars can't break shell interpolation. - evals/swebench/run-batch.py: add --overwrite flag; abort by default when predictions.jsonl for the run-id already exists, preventing accidental silent discard of partial results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address coderabbit review on PR #2076 Actionable (4): - Dockerfile uv install: wrap `chmod ... || true` in braces so the trailing `|| true` no longer masks failures from `curl|sh` via bash operator precedence (&& binds tighter than ||). Applied to both docker/claude-mem/ and evals/swebench/Dockerfile.agent. Added `set -eux` to the RUN lines. - docker/claude-mem/Dockerfile: drop unused `sudo` apt package (~2 MB). - run-batch.py: name each agent container (`swebench-agent-<id>-<pid>-<tid>`) and force-remove via `docker rm -f <name>` in the TimeoutExpired handler so timed-out runs don't leave orphan containers. Nitpicks (2): - smoke-test.sh: collapse 3 python3 invocations into 1 — parse the instance JSON once, print `repo base_commit`, and write problem.txt in the same call. - run-instance.sh: shallow clone via `--depth 1 --no-single-branch` + `fetch --depth 1 origin $BASE_COMMIT`. Falls back to a full clone if the server rejects the by-commit fetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address second coderabbit review on PR #2076 Actionable (3): - docker/claude-mem/run.sh: on macOS, fall back to ~/.claude/.credentials.json when the Keychain lookup misses (some setups still have file-only creds). Unified into a single creds_obtained gate so the error surface lists both sources tried. - docker/claude-mem/run.sh: drop `exec docker run` — `exec` replaces the shell so the EXIT trap (`rm -f "$CREDS_FILE"`) never fires and the extracted OAuth JSON leaks to disk until tmpfs cleanup. Run as a child instead so the trap runs on exit. - evals/swebench/smoke-test.sh: actually enforce the TIMEOUT env var. Pick `timeout` or `gtimeout` (coreutils on macOS), fall back to uncapped with a warning. Name the container so exit-124 from timeout can `docker rm -f` it deterministically. Nitpick from the same review (consolidated python3 calls in smoke-test.sh) was already addressed in the prior commit ef621e00. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address third coderabbit review on PR #2076 Actionable (1): - evals/swebench/smoke-test.sh: the consolidated python heredoc had competing stdin redirections — `<<'PY'` (script body) AND `< "$INSTANCE_JSON"` (data). The heredoc won, so `json.load(sys.stdin)` saw an empty stream and the parse would have failed at runtime. Pass INSTANCE_JSON as argv[2] and `open()` it inside the script instead; the heredoc is now only the script body, which is what `python3 -` needs. Nitpicks (2): - evals/swebench/smoke-test.sh: macOS Keychain lookup now falls through to ~/.claude/.credentials.json on miss (matches docker/claude-mem/run.sh). - evals/swebench/run-batch.py: extract_oauth_credentials() no longer early-returns on Darwin keychain miss; falls through to the on-disk creds file so macOS setups with file-only credentials work in batch mode too. Functional spot-check of the parse fix confirmed: REPO/BASE_COMMIT populated and problem.txt written from a synthetic INSTANCE_JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:34:30 -07:00
parent de6139660b
commit 97c7c999b1
16 changed files with 1834 additions and 238 deletions
@@ -0,0 +1,9 @@
+# Keep the build context small for evals/swebench/Dockerfile.agent.
+# The Dockerfile needs `plugin/` and `evals/swebench/` — do NOT exclude them.
+node_modules/
+.git/
+logs/
+evals/swebench/runs/
+.docker-claude-mem-data/
+.venv
+.venv-*
@@ -42,3 +42,12 @@ plugin/.cli-installed

 # Local contribution analysis (not part of upstream)
 CONTRIB_NOTES.md
+
+# Docker container runtime data (basic claude-mem container)
+.docker-claude-mem-data/
+
+# SWE-bench eval outputs
+evals/swebench/runs/
+claude-opus-4-7+claude-mem.*.json
+logs/run_evaluation/
+.venv-swebench/
@@ -0,0 +1,93 @@
+# Basic claude-mem container for ad-hoc testing.
+#
+# Base layout mirrors anthropics/claude-code .devcontainer
+# (https://github.com/anthropics/claude-code/blob/main/.devcontainer/Dockerfile):
+#   FROM node:20, non-root `node` user, global npm install of @anthropic-ai/claude-code.
+# We skip the firewall/zsh/fzf/delta/git-hist noise since this image is for
+# exercising claude-mem, not as a full dev environment.
+#
+# On top of that base we install:
+#   - Bun (claude-mem worker service runtime)
+#   - uv  (provides Python for Chroma per CLAUDE.md)
+#   - The locally-built plugin/ tree at /opt/claude-mem
+#
+# Usage:
+#   docker build -f docker/claude-mem/Dockerfile -t claude-mem:basic .
+#   docker run --rm -it \
+#     -v $(mktemp -d):/home/node/.claude-mem \
+#     -e CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json \
+#     -v /path/to/extracted/creds.json:/auth/.credentials.json:ro \
+#     claude-mem:basic
+
+FROM node:20
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt-get update \
+ && apt-get install -y --no-install-recommends \
+      git \
+      curl \
+      ca-certificates \
+      unzip \
+      jq \
+      less \
+      procps \
+      uuid-runtime \
+      sqlite3 \
+ && apt-get clean && rm -rf /var/lib/apt/lists/*
+
+# Bun — system-wide so the unprivileged `node` user can execute it.
+# Pin via --build-arg BUN_VERSION=X.Y.Z; default is the version verified at PR time.
+ARG BUN_VERSION=1.3.12
+ENV BUN_INSTALL="/usr/local/bun"
+RUN curl -fsSL https://bun.sh/install | bash -s "bun-v${BUN_VERSION}" \
+ && chmod -R a+rX /usr/local/bun
+ENV PATH="/usr/local/bun/bin:${PATH}"
+
+# uv — system-wide, for Chroma's Python runtime. Pin via --build-arg UV_VERSION=X.Y.Z.
+# Versioned installer URL per https://docs.astral.sh/uv/getting-started/installation/.
+ARG UV_VERSION=0.11.7
+ENV UV_INSTALL_DIR="/usr/local/bin"
+# `&&` binds tighter than `||` in bash, so the previous form let `curl|sh` fail
+# silently via the trailing `|| true`. Group the chmod so tolerated failure is
+# scoped to perms-fixing only.
+RUN set -eux \
+ && curl -LsSf "https://astral.sh/uv/${UV_VERSION}/install.sh" | sh \
+ && { chmod a+rX /usr/local/bin/uv /usr/local/bin/uvx 2>/dev/null || true; }
+
+# Match the upstream devcontainer's npm-global prefix so `npm install -g`
+# targets a dir the `node` user owns.
+RUN mkdir -p /usr/local/share/npm-global \
+ && chown -R node:node /usr/local/share/npm-global
+ENV NPM_CONFIG_PREFIX=/usr/local/share/npm-global
+ENV PATH="/usr/local/share/npm-global/bin:${PATH}"
+
+# Claude Code CLI. Override at build-time with --build-arg CLAUDE_CODE_VERSION=X.Y.Z
+# to pin; default tracks latest.
+ARG CLAUDE_CODE_VERSION=latest
+USER node
+RUN npm install -g @anthropic-ai/claude-code@${CLAUDE_CODE_VERSION}
+
+# Locally-built claude-mem plugin. COPY runs as root by default and layers are
+# cached, so put this after the npm install so iterating on the plugin doesn't
+# invalidate the CLI install layer.
+USER root
+COPY plugin/ /opt/claude-mem/
+RUN chown -R node:node /opt/claude-mem
+
+# Persistent mount points for ad-hoc testing — mount a host dir at either of
+# these to inspect the claude-mem DB after a session.
+RUN mkdir -p /home/node/.claude /home/node/.claude-mem \
+ && chown -R node:node /home/node/.claude /home/node/.claude-mem
+
+USER node
+WORKDIR /home/node
+
+# Helper: copies OAuth creds out of the read-only mount into $HOME/.claude/
+# before exec'ing whatever you asked for. Saves the "cp + chmod" dance every
+# time you drop in.
+COPY --chown=node:node docker/claude-mem/entrypoint.sh /usr/local/bin/claude-mem-entrypoint
+RUN chmod +x /usr/local/bin/claude-mem-entrypoint
+
+ENTRYPOINT ["/usr/local/bin/claude-mem-entrypoint"]
+CMD ["bash"]
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+# Build the basic claude-mem Docker image from the current worktree.
+#
+# Usage:
+#   docker/claude-mem/build.sh               # builds claude-mem:basic
+#   TAG=my-tag docker/claude-mem/build.sh    # override the tag
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+TAG="${TAG:-claude-mem:basic}"
+
+cd "$REPO_ROOT"
+
+echo "[build] npm run build"
+npm run build
+
+echo "[build] docker build -t $TAG"
+docker build \
+  -f docker/claude-mem/Dockerfile \
+  -t "$TAG" \
+  "$REPO_ROOT"
+
+echo "[build] done: $TAG"
@@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+# Entrypoint for the basic claude-mem container. Seeds OAuth creds if a
+# credentials file is mounted, then exec's whatever was passed (default: bash).
+#
+# Env vars:
+#   CLAUDE_MEM_CREDENTIALS_FILE  Path to a mounted OAuth credentials JSON file
+#                                (e.g. /auth/.credentials.json). Copied into
+#                                $HOME/.claude/.credentials.json at startup.
+#   ANTHROPIC_API_KEY            Standard API-key auth; set when OAuth isn't used.
+
+set -euo pipefail
+
+mkdir -p "$HOME/.claude" "$HOME/.claude-mem"
+
+if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
+  if [[ ! -f "$CLAUDE_MEM_CREDENTIALS_FILE" ]]; then
+    echo "ERROR: CLAUDE_MEM_CREDENTIALS_FILE set but file missing: $CLAUDE_MEM_CREDENTIALS_FILE" >&2
+    exit 1
+  fi
+  cp "$CLAUDE_MEM_CREDENTIALS_FILE" "$HOME/.claude/.credentials.json"
+  chmod 600 "$HOME/.claude/.credentials.json"
+fi
+
+# Helpful one-liner for interactive users: run `claude` with the plugin dir
+# preconfigured. Don't force it — `exec "$@"` lets you override freely.
+export PATH="/usr/local/bun/bin:/usr/local/share/npm-global/bin:$PATH"
+
+exec "$@"
@@ -0,0 +1,69 @@
+#!/usr/bin/env bash
+# Drop into an interactive claude-mem container with OAuth creds + persistent
+# memory volume. For ad-hoc testing / poking around.
+#
+# Usage:
+#   docker/claude-mem/run.sh
+#   docker/claude-mem/run.sh claude --plugin-dir /opt/claude-mem --print "hi"
+#
+# On exit, the mounted .claude-mem/ dir on the host survives so you can inspect
+# the DB: `sqlite3 <HOST_MEM_DIR>/claude-mem.db 'select count(*) from observations'`.
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+TAG="${TAG:-claude-mem:basic}"
+
+HOST_MEM_DIR="${HOST_MEM_DIR:-$REPO_ROOT/.docker-claude-mem-data}"
+mkdir -p "$HOST_MEM_DIR"
+echo "[run] host .claude-mem dir: $HOST_MEM_DIR" >&2
+
+# Auth. Prefer OAuth (extracted from macOS Keychain / Linux creds file);
+# fall back to ANTHROPIC_API_KEY env.
+CREDS_FILE=""
+CREDS_MOUNT_ARGS=()
+if [[ -z "${ANTHROPIC_API_KEY:-}" ]]; then
+  CREDS_FILE="$(mktemp -t claude-mem-creds.XXXXXX.json)"
+  trap 'rm -f "$CREDS_FILE"' EXIT
+
+  # Try macOS Keychain first (primary storage on Darwin), then fall back to
+  # the on-disk credentials file — some macOS setups (older CLI versions,
+  # users who migrated machines) still have the file-only form.
+  creds_obtained=0
+  if [[ "$(uname)" == "Darwin" ]]; then
+    if security find-generic-password -s 'Claude Code-credentials' -w > "$CREDS_FILE" 2>/dev/null \
+       && [[ -s "$CREDS_FILE" ]]; then
+      creds_obtained=1
+    fi
+  fi
+  if [[ "$creds_obtained" -eq 0 && -f "$HOME/.claude/.credentials.json" ]]; then
+    cp "$HOME/.claude/.credentials.json" "$CREDS_FILE"
+    creds_obtained=1
+  fi
+  if [[ "$creds_obtained" -eq 0 ]]; then
+    echo "ERROR: no ANTHROPIC_API_KEY set and no Claude OAuth credentials found." >&2
+    echo "       Tried: macOS Keychain ('Claude Code-credentials') and ~/.claude/.credentials.json." >&2
+    echo "       Run \`claude login\` on the host first, or set ANTHROPIC_API_KEY." >&2
+    exit 1
+  fi
+  chmod 600 "$CREDS_FILE"
+  CREDS_MOUNT_ARGS=(
+    -e CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json
+    -v "$CREDS_FILE:/auth/.credentials.json:ro"
+  )
+else
+  CREDS_MOUNT_ARGS=(-e ANTHROPIC_API_KEY)
+fi
+
+# Pick -it only when a TTY is attached (keeps non-interactive callers working).
+TTY_ARGS=()
+[[ -t 0 && -t 1 ]] && TTY_ARGS=(-it)
+
+# NOT `exec` — we want the EXIT trap above to run and remove $CREDS_FILE
+# after the container exits. Running docker as a child keeps the shell
+# alive long enough for the trap to fire.
+docker run --rm "${TTY_ARGS[@]}" \
+  "${CREDS_MOUNT_ARGS[@]}" \
+  -v "$HOST_MEM_DIR:/home/node/.claude-mem" \
+  "$TAG" \
+  "$@"
@@ -0,0 +1,74 @@
+# claude-mem SWE-bench agent image
+# Plan: .claude/plans/swebench-claude-mem-docker.md (Phase 1)
+#
+# Produces `claude-mem/swebench-agent:latest`: Claude Code CLI 2.1.114 +
+# locally-built claude-mem plugin, ready to run headlessly per SWE-bench
+# instance. Auth (ANTHROPIC_API_KEY) is passed at runtime, never baked in.
+
+FROM node:20-bookworm-slim
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+# System dependencies:
+#   git, curl, ca-certificates, unzip — base tooling (Bun installer needs unzip)
+#   jq                                — JSONL assembly in run-instance.sh
+#   uuid-runtime                      — uuidgen for per-instance session IDs (Phase 2)
+#   sqlite3                           — verifies the claude-mem observations DB
+RUN apt-get update \
+ && apt-get install -y --no-install-recommends \
+      git \
+      curl \
+      ca-certificates \
+      unzip \
+      jq \
+      uuid-runtime \
+      sqlite3 \
+ && rm -rf /var/lib/apt/lists/*
+
+# Bun (claude-mem worker service runs under Bun). Installed to a system
+# location so the non-root runtime user can execute it.
+ENV BUN_INSTALL="/usr/local/bun"
+RUN curl -fsSL https://bun.sh/install | bash \
+ && chmod -R a+rX /usr/local/bun
+ENV PATH="/usr/local/bun/bin:${PATH}"
+
+# uv (provides Python for Chroma per CLAUDE.md). Installed to a system
+# location, same reason.
+ENV UV_INSTALL_DIR="/usr/local/bin"
+# Group the chmod so the trailing `|| true` only absorbs chmod failures; without
+# this grouping, bash precedence (`&&` binds tighter than `||`) would silently
+# mask a failed `curl|sh` install step.
+RUN set -eux \
+ && curl -LsSf https://astral.sh/uv/install.sh | sh \
+ && { chmod a+rX /usr/local/bin/uv /usr/local/bin/uvx 2>/dev/null || true; }
+
+# Claude Code CLI — PINNED to the version whose flag surface was verified in
+# the plan (Phase 0). Do NOT bump without re-verifying flags.
+RUN npm install -g @anthropic-ai/claude-code@2.1.114
+
+# Locally-built claude-mem plugin. The build-agent-image.sh wrapper runs
+# `npm run build` before `docker build`, so plugin/ is populated in the build
+# context. We do NOT install claude-mem from npm — we want the current
+# worktree under test.
+COPY plugin/ /opt/claude-mem/
+
+# Runner script — entrypoint for per-instance invocation (Phase 2 deliverable).
+COPY evals/swebench/run-instance.sh /evals/swebench/run-instance.sh
+RUN chmod +x /evals/swebench/run-instance.sh
+
+# Pre-create per-instance config dirs. run-instance.sh overrides HOME to a
+# scratch dir for isolation, but having these present keeps tools from
+# bailing if they probe the default locations before HOME is set.
+RUN mkdir -p /root/.claude /root/.claude-mem
+
+# Non-root user. Claude Code refuses `--dangerously-skip-permissions` /
+# `--permission-mode bypassPermissions` when euid==0 as a safety rail, so we
+# need an unprivileged user for headless batch runs. node:20 already ships a
+# `node` user at uid 1000 — reuse it.
+RUN mkdir -p /home/node/.claude /home/node/.claude-mem \
+ && chown -R node:node /home/node /opt/claude-mem
+
+USER node
+WORKDIR /home/node
+
+ENTRYPOINT ["/evals/swebench/run-instance.sh"]
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+# Build the claude-mem SWE-bench agent image.
+# Plan: .claude/plans/swebench-claude-mem-docker.md (Phase 1, step 2)
+set -euo pipefail
+
+# Resolve repo root (two levels up from this script: evals/swebench -> repo).
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+cd "$REPO_ROOT"
+
+# 1. Build the plugin so plugin/ is populated for the COPY step in the Dockerfile.
+npm run build
+
+# 2. Build the agent image. Context is the repo root so both plugin/ and
+#    evals/swebench/run-instance.sh are reachable.
+docker build \
+  -f evals/swebench/Dockerfile.agent \
+  -t claude-mem/swebench-agent:latest \
+  .
@@ -0,0 +1,72 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# eval.sh — Thin wrapper around `python -m swebench.harness.run_evaluation`.
+#
+# Required env:
+#   RUN_ID         Identifier for this evaluation run (matches predictions dir).
+# Optional env:
+#   MAX_WORKERS    Parallel worker count for the harness (default: 4).
+#   DATASET        HF dataset name (default: princeton-nlp/SWE-bench_Verified).
+#   TIMEOUT        Per-instance timeout in seconds (default: 1800).
+#
+# Reports land at:
+#   logs/run_evaluation/$RUN_ID/claude-opus-4-7+claude-mem/<instance_id>/report.json
+
+: "${RUN_ID:?RUN_ID is required (e.g. RUN_ID=smoke-001)}"
+MAX_WORKERS="${MAX_WORKERS:-4}"
+DATASET="${DATASET:-princeton-nlp/SWE-bench_Verified}"
+TIMEOUT="${TIMEOUT:-1800}"
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+cd "$REPO_ROOT"
+
+PREDICTIONS="evals/swebench/runs/$RUN_ID/predictions.jsonl"
+
+if [[ ! -f "$PREDICTIONS" ]]; then
+  echo "ERROR: predictions file not found: $PREDICTIONS" >&2
+  echo "Hint: run Phase 3 agent loop first to produce predictions.jsonl for RUN_ID=$RUN_ID." >&2
+  exit 1
+fi
+
+# Harness REQUIRES Docker — fail fast with a clean message if it's not running.
+if ! command -v docker >/dev/null 2>&1; then
+  echo "ERROR: docker CLI not found on PATH. The SWE-bench harness requires Docker." >&2
+  exit 1
+fi
+if ! docker info >/dev/null 2>&1; then
+  echo "ERROR: Docker daemon is not running. Start Docker Desktop (or the docker service) and retry." >&2
+  exit 1
+fi
+
+# Create/reuse a dedicated venv so we don't pollute the system Python.
+VENV_DIR=".venv-swebench"
+if [[ ! -d "$VENV_DIR" ]]; then
+  echo "[eval.sh] Creating Python venv at $VENV_DIR ..."
+  python3 -m venv "$VENV_DIR"
+fi
+# shellcheck disable=SC1091
+source "$VENV_DIR/bin/activate"
+
+echo "[eval.sh] Installing/updating swebench in $VENV_DIR ..."
+pip install -q swebench
+
+echo "[eval.sh] Running harness:"
+echo "  dataset:        $DATASET"
+echo "  predictions:    $PREDICTIONS"
+echo "  max_workers:    $MAX_WORKERS"
+echo "  run_id:         $RUN_ID"
+echo "  timeout:        $TIMEOUT"
+
+python -m swebench.harness.run_evaluation \
+  --dataset_name "$DATASET" \
+  --predictions_path "$PREDICTIONS" \
+  --max_workers "$MAX_WORKERS" \
+  --run_id "$RUN_ID" \
+  --timeout "$TIMEOUT"
+
+REPORTS_DIR="logs/run_evaluation/$RUN_ID/claude-opus-4-7+claude-mem"
+echo ""
+echo "[eval.sh] Done. Per-instance reports at:"
+echo "  $REPORTS_DIR/<instance_id>/report.json"
@@ -0,0 +1,561 @@
+#!/usr/bin/env python3
+"""
+Batch orchestrator for SWE-bench evaluation of Claude Code + claude-mem.
+
+Iterates a list of SWE-bench Verified instances, launches a per-instance Docker
+container (`claude-mem/swebench-agent:latest`) that runs the two-turn
+ingest/fix protocol, and collects all resulting diffs into a single
+`predictions.jsonl` compatible with the upstream SWE-bench harness.
+
+Usage:
+    python evals/swebench/run-batch.py \
+        --run-id claude-mem-baseline-001 \
+        --limit 3 \
+        --max-concurrent 2
+
+Rate-limit note: Anthropic API rate limits can bite quickly. The default
+`--max-concurrent` is 4, but it is safer to START WITH 2 and raise the cap
+only after observing no 429s in the logs.
+"""
+
+from __future__ import annotations
+
+import argparse
+import atexit
+import json
+import os
+import platform
+import shutil
+import stat
+import subprocess
+import sys
+import tempfile
+import threading
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+from typing import Any, Iterable
+
+from datasets import load_dataset
+
+
+# Hidden-from-agent fields per the plan. We MUST NOT pass these to the agent
+# container — they are evaluator-only ground truth.
+HIDDEN_AGENT_FIELDS = (
+    "patch",
+    "test_patch",
+    "FAIL_TO_PASS",
+    "PASS_TO_PASS",
+    "environment_setup_commit",
+    "version",
+)
+
+
+def extract_oauth_credentials() -> Path | None:
+    """
+    Extract Claude Code OAuth credentials (from a Max/Pro subscription) to a
+    temp file the container can bind-mount. Returns the temp file path, or
+    None if extraction failed / no creds present.
+
+    macOS: creds live in the Keychain under service "Claude Code-credentials".
+    Linux: creds live at ~/.claude/.credentials.json.
+
+    CAVEAT: Anthropic Max/Pro subscriptions have usage limits (per ~5h window)
+    and their ToS is framed around individual developer use. Running batch
+    evaluation across parallel containers may exhaust the quota quickly or
+    raise compliance concerns. This helper exists because the user explicitly
+    requested it; the caller is responsible for the policy call.
+
+    The token may age out mid-run; we mount read-only so refresh writes fail
+    silently inside the container (the underlying token in the host
+    Keychain/file is untouched).
+    """
+    temp = tempfile.NamedTemporaryFile(
+        prefix="claude-mem-creds-",
+        suffix=".json",
+        delete=False,
+    )
+    temp_path = Path(temp.name)
+    temp.close()
+    # Clean up on process exit, even on crash.
+    atexit.register(lambda: temp_path.unlink(missing_ok=True))
+
+    # macOS: try Keychain first (primary storage on Darwin). On miss, fall
+    # through to the on-disk credentials file — some macOS setups (older CLI,
+    # migrated machines) only have the file form.
+    if platform.system() == "Darwin":
+        try:
+            completed = subprocess.run(
+                [
+                    "security",
+                    "find-generic-password",
+                    "-s",
+                    "Claude Code-credentials",
+                    "-w",
+                ],
+                capture_output=True,
+                text=True,
+                check=False,
+            )
+            if completed.returncode == 0 and completed.stdout.strip():
+                temp_path.write_text(completed.stdout.strip(), encoding="utf-8")
+                temp_path.chmod(stat.S_IRUSR | stat.S_IWUSR)
+                return temp_path
+            # else fall through to the on-disk credentials check below
+        except FileNotFoundError:
+            print(
+                "WARN: `security` command not available; trying on-disk creds.",
+                file=sys.stderr,
+            )
+            # fall through to the on-disk credentials check below
+
+    # Both platforms (and macOS fallback): read the on-disk credentials file.
+    creds_file = Path.home() / ".claude" / ".credentials.json"
+    if creds_file.exists():
+        temp_path.write_text(creds_file.read_text(encoding="utf-8"), encoding="utf-8")
+        temp_path.chmod(stat.S_IRUSR | stat.S_IWUSR)
+        return temp_path
+
+    if platform.system() == "Darwin":
+        print(
+            "WARN: Claude Code-credentials not found in macOS Keychain and "
+            "~/.claude/.credentials.json missing. Run `claude login` on the "
+            "host first, or fall back to ANTHROPIC_API_KEY.",
+            file=sys.stderr,
+        )
+    return None
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description="Run the claude-mem SWE-bench agent on a batch of instances.",
+    )
+    parser.add_argument(
+        "--instance-ids",
+        nargs="+",
+        default=None,
+        help="Optional explicit list of instance_ids to run.",
+    )
+    parser.add_argument(
+        "--limit",
+        type=int,
+        default=None,
+        help="If set, process only the first N instances after filtering.",
+    )
+    parser.add_argument(
+        "--max-concurrent",
+        type=int,
+        default=4,
+        help="Max concurrent agent containers (default 4; start with 2 and raise after observing no 429s).",
+    )
+    parser.add_argument(
+        "--run-id",
+        type=str,
+        required=True,
+        help="Run identifier; used for output paths.",
+    )
+    parser.add_argument(
+        "--out",
+        type=str,
+        default=None,
+        help="Path to predictions.jsonl (default: evals/swebench/runs/<run_id>/predictions.jsonl).",
+    )
+    parser.add_argument(
+        "--timeout",
+        type=int,
+        default=1800,
+        help="Per-instance timeout in seconds (default 1800, matches upstream harness).",
+    )
+    parser.add_argument(
+        "--image",
+        type=str,
+        default="claude-mem/swebench-agent:latest",
+        help="Agent Docker image tag.",
+    )
+    parser.add_argument(
+        "--dataset",
+        type=str,
+        default="princeton-nlp/SWE-bench_Verified",
+        help="HuggingFace dataset name (e.g. princeton-nlp/SWE-bench_Lite, default Verified).",
+    )
+    parser.add_argument(
+        "--auth",
+        choices=["oauth", "api-key", "auto"],
+        default="auto",
+        help=(
+            "Auth mode. 'oauth' extracts Claude Max/Pro creds from host "
+            "Keychain (macOS) or ~/.claude/.credentials.json (Linux). "
+            "'api-key' uses ANTHROPIC_API_KEY env. 'auto' prefers oauth, "
+            "falls back to api-key."
+        ),
+    )
+    parser.add_argument(
+        "--overwrite",
+        action="store_true",
+        help=(
+            "Truncate existing predictions.jsonl for this --run-id. "
+            "Without this flag, the run aborts if predictions already exist "
+            "(protects partial results from accidental re-runs)."
+        ),
+    )
+    return parser.parse_args()
+
+
+def select_instances(
+    dataset: Iterable[dict[str, Any]],
+    instance_ids: list[str] | None,
+    limit: int | None,
+) -> list[dict[str, Any]]:
+    """Filter dataset rows by instance_ids (if given) and apply limit."""
+    rows: list[dict[str, Any]] = list(dataset)
+    if instance_ids:
+        wanted = set(instance_ids)
+        rows = [r for r in rows if r["instance_id"] in wanted]
+        missing = wanted - {r["instance_id"] for r in rows}
+        if missing:
+            print(
+                f"WARN: {len(missing)} requested instance_ids not found in dataset: "
+                f"{sorted(missing)[:5]}{'...' if len(missing) > 5 else ''}",
+                file=sys.stderr,
+            )
+    if limit is not None:
+        rows = rows[:limit]
+    return rows
+
+
+def append_prediction_row(
+    predictions_path: Path,
+    instance_id: str,
+    model_patch: str,
+    model_name_or_path: str,
+    lock: threading.Lock,
+) -> None:
+    """Append one JSONL prediction row under a lock (appends are NOT atomic across threads)."""
+    row = {
+        "instance_id": instance_id,
+        "model_patch": model_patch,
+        "model_name_or_path": model_name_or_path,
+    }
+    line = json.dumps(row, ensure_ascii=False) + "\n"
+    with lock:
+        with predictions_path.open("a", encoding="utf-8") as fp:
+            fp.write(line)
+
+
+def copy_log_if_exists(src: Path, dst: Path) -> None:
+    """Copy a log file from the shared scratch volume into the run-log directory, if present."""
+    if src.exists() and src.is_file():
+        dst.parent.mkdir(parents=True, exist_ok=True)
+        shutil.copy2(src, dst)
+
+
+def run_one_instance(
+    instance: dict[str, Any],
+    image: str,
+    predictions_path: Path,
+    predictions_dir: Path,
+    run_dir: Path,
+    timeout: int,
+    predictions_lock: threading.Lock,
+    model_name_or_path: str,
+    oauth_creds_path: Path | None,
+) -> tuple[str, str]:
+    """
+    Run the agent container for a single instance.
+
+    Returns a (status, instance_id) tuple where status is one of:
+    "succeeded", "failed", "timed_out".
+
+    On ANY non-success (timeout, non-zero exit, missing diff), a prediction
+    row with model_patch="" is still appended — the plan requires we never
+    silently drop an instance.
+    """
+    instance_id: str = instance["instance_id"]
+    repo: str = instance["repo"]
+    base_commit: str = instance["base_commit"]
+    problem_statement: str = instance["problem_statement"]
+
+    instance_log_dir = run_dir / instance_id
+    instance_log_dir.mkdir(parents=True, exist_ok=True)
+    stderr_log_path = instance_log_dir / "stderr.log"
+
+    # Per-instance scratch dir — MUST NOT be shared across containers.
+    scratch_dir = Path(tempfile.mkdtemp(prefix=f"swebench-{instance_id}-"))
+    problem_file = scratch_dir / "problem.txt"
+    problem_file.write_text(problem_statement, encoding="utf-8")
+
+    status: str = "failed"
+    model_patch: str = ""
+
+    # Uniquely named so the TimeoutExpired handler can kill it without racing
+    # other instances on the host.
+    container_name = f"swebench-agent-{instance_id}-{os.getpid()}-{threading.get_ident()}"
+
+    try:
+        # The orchestrator owns JSONL writes under `predictions_lock` to avoid
+        # racy concurrent appends across containers — so we DO NOT mount the
+        # predictions directory into the container. Instead, the agent writes
+        # its authoritative diff to /scratch/model_patch.diff (via
+        # CLAUDE_MEM_OUTPUT_DIR), plus ingest/fix logs to the same dir. The
+        # 5th CLI arg to run-instance.sh is only used in standalone smoke-test
+        # mode; here we point it at a throwaway path inside the container.
+        cmd: list[str] = [
+            "docker",
+            "run",
+            "--rm",
+            "--name",
+            container_name,
+            "-e",
+            "CLAUDE_MEM_OUTPUT_DIR=/scratch",
+            "-v",
+            f"{scratch_dir}:/scratch",
+        ]
+        if oauth_creds_path is not None:
+            cmd += [
+                "-e",
+                "CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json",
+                "-v",
+                f"{oauth_creds_path}:/auth/.credentials.json:ro",
+            ]
+        else:
+            # Pay-per-call path.
+            cmd += ["-e", "ANTHROPIC_API_KEY"]
+        cmd += [
+            image,
+            instance_id,
+            repo,
+            base_commit,
+            "/scratch/problem.txt",
+            "/scratch/ignored-predictions.jsonl",
+        ]
+
+        try:
+            completed = subprocess.run(
+                cmd,
+                timeout=timeout,
+                capture_output=True,
+                text=True,
+                check=False,
+            )
+            # Persist stderr so post-mortem is possible even on success.
+            stderr_log_path.write_text(
+                f"=== STDOUT ===\n{completed.stdout}\n=== STDERR ===\n{completed.stderr}\n",
+                encoding="utf-8",
+            )
+
+            if completed.returncode == 0:
+                # Read the diff the agent wrote to the shared predictions volume.
+                # The container writes its own prediction line; we prefer to
+                # write our own authoritative row here from the diff file the
+                # agent left in /scratch. If the agent wrote a diff file, use
+                # it; otherwise fall back to empty patch.
+                diff_file = scratch_dir / "model_patch.diff"
+                if diff_file.exists():
+                    diff_text = diff_file.read_text(encoding="utf-8")
+                    if diff_text.strip():
+                        model_patch = diff_text
+                        status = "succeeded"
+                    else:
+                        status = "failed"  # empty diff
+                else:
+                    # Container did not leave a diff file — treat as failure
+                    # but still emit an empty-patch row below.
+                    status = "failed"
+            else:
+                status = "failed"
+
+        except subprocess.TimeoutExpired as exc:
+            status = "timed_out"
+            # subprocess.run killed the docker CLI, but the container may
+            # still be running. Force-remove it by name so we don't leak
+            # containers across the batch.
+            subprocess.run(
+                ["docker", "rm", "-f", container_name],
+                capture_output=True,
+                check=False,
+                timeout=30,
+            )
+            stderr_log_path.write_text(
+                f"TIMEOUT after {timeout}s (forced docker rm -f {container_name})\n"
+                f"=== STDOUT (partial) ===\n{exc.stdout or ''}\n"
+                f"=== STDERR (partial) ===\n{exc.stderr or ''}\n",
+                encoding="utf-8",
+            )
+
+        # Copy per-turn logs left by the agent in the shared scratch volume.
+        copy_log_if_exists(scratch_dir / "ingest.jsonl", instance_log_dir / "ingest.jsonl")
+        copy_log_if_exists(scratch_dir / "fix.jsonl", instance_log_dir / "fix.jsonl")
+
+        # Always write a row — never silently drop an instance.
+        append_prediction_row(
+            predictions_path=predictions_path,
+            instance_id=instance_id,
+            model_patch=model_patch,
+            model_name_or_path=model_name_or_path,
+            lock=predictions_lock,
+        )
+
+    except Exception as exc:  # pragma: no cover — defensive
+        status = "failed"
+        try:
+            stderr_log_path.write_text(
+                f"ORCHESTRATOR EXCEPTION: {exc!r}\n",
+                encoding="utf-8",
+            )
+        except OSError:
+            pass
+        append_prediction_row(
+            predictions_path=predictions_path,
+            instance_id=instance_id,
+            model_patch="",
+            model_name_or_path=model_name_or_path,
+            lock=predictions_lock,
+        )
+    finally:
+        # Per-instance scratch must not leak across containers.
+        shutil.rmtree(scratch_dir, ignore_errors=True)
+
+    return status, instance_id
+
+
+def main() -> int:
+    args = parse_args()
+
+    repo_root = Path(__file__).resolve().parents[2]
+    if args.out:
+        predictions_path = Path(args.out).resolve()
+    else:
+        predictions_path = (
+            repo_root
+            / "evals"
+            / "swebench"
+            / "runs"
+            / args.run_id
+            / "predictions.jsonl"
+        )
+
+    predictions_dir = predictions_path.parent
+    run_dir = predictions_dir  # logs land in evals/swebench/runs/<run_id>/<instance_id>/
+    predictions_dir.mkdir(parents=True, exist_ok=True)
+    # Don't silently discard partial results from a prior run.
+    if predictions_path.exists() and predictions_path.stat().st_size > 0:
+        if not args.overwrite:
+            print(
+                f"ERROR: {predictions_path} already exists and is non-empty. "
+                "Pass --overwrite to truncate, or pick a different --run-id.",
+                file=sys.stderr,
+            )
+            return 1
+        print(
+            f"WARN: --overwrite set; truncating existing {predictions_path}",
+            file=sys.stderr,
+        )
+    predictions_path.write_text("", encoding="utf-8")
+
+    # Resolve auth: OAuth (Max/Pro subscription) or API key.
+    oauth_creds_path: Path | None = None
+    if args.auth in ("oauth", "auto"):
+        oauth_creds_path = extract_oauth_credentials()
+        if oauth_creds_path is not None:
+            print(
+                f"Auth: OAuth credentials extracted to {oauth_creds_path} "
+                "(mounted read-only into each container). "
+                "NOTE: Max/Pro has per-window usage limits; batch runs may exhaust them.",
+                file=sys.stderr,
+            )
+        elif args.auth == "oauth":
+            print(
+                "ERROR: --auth=oauth requested but credentials extraction failed.",
+                file=sys.stderr,
+            )
+            return 1
+
+    if oauth_creds_path is None:
+        if not os.environ.get("ANTHROPIC_API_KEY"):
+            print(
+                "ERROR: no auth available. Either run `claude login` on host "
+                "(for OAuth) or set ANTHROPIC_API_KEY.",
+                file=sys.stderr,
+            )
+            return 1
+        print("Auth: ANTHROPIC_API_KEY (pay-per-call).", file=sys.stderr)
+
+    print(f"Loading dataset {args.dataset} (split=test)...", file=sys.stderr)
+    dataset = load_dataset(args.dataset, split="test")
+
+    instances = select_instances(dataset, args.instance_ids, args.limit)
+    total = len(instances)
+    if total == 0:
+        print("No instances selected; nothing to do.", file=sys.stderr)
+        return 0
+
+    # Scrub hidden-from-agent fields defensively. The agent container only
+    # receives instance_id/repo/base_commit/problem_statement via CLI args +
+    # the per-instance problem file — the hidden fields never leave this
+    # process. This loop makes that invariant explicit.
+    for row in instances:
+        for key in HIDDEN_AGENT_FIELDS:
+            row.pop(key, None)
+
+    model_name_or_path = "claude-opus-4-7+claude-mem"
+
+    print(
+        f"Launching {total} instance(s) with max_concurrent={args.max_concurrent}, "
+        f"timeout={args.timeout}s, image={args.image}",
+        file=sys.stderr,
+    )
+
+    predictions_lock = threading.Lock()
+    succeeded = 0
+    failed = 0
+    timed_out = 0
+
+    with ThreadPoolExecutor(max_workers=args.max_concurrent) as executor:
+        future_to_id = {
+            executor.submit(
+                run_one_instance,
+                instance=instance,
+                image=args.image,
+                predictions_path=predictions_path,
+                predictions_dir=predictions_dir,
+                run_dir=run_dir,
+                timeout=args.timeout,
+                predictions_lock=predictions_lock,
+                model_name_or_path=model_name_or_path,
+                oauth_creds_path=oauth_creds_path,
+            ): instance["instance_id"]
+            for instance in instances
+        }
+
+        for future in as_completed(future_to_id):
+            instance_id = future_to_id[future]
+            try:
+                status, _ = future.result()
+            except Exception as exc:  # pragma: no cover — defensive
+                status = "failed"
+                print(
+                    f"[{instance_id}] orchestrator future raised: {exc!r}",
+                    file=sys.stderr,
+                )
+
+            if status == "succeeded":
+                succeeded += 1
+            elif status == "timed_out":
+                timed_out += 1
+            else:
+                failed += 1
+
+            print(
+                f"[{instance_id}] {status} "
+                f"({succeeded + failed + timed_out}/{total} done)",
+                file=sys.stderr,
+            )
+
+    print(
+        f"{total} total, {succeeded} succeeded, {failed} failed, {timed_out} timed out",
+    )
+    # Per plan: exit 0 even if some instances failed.
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
@@ -0,0 +1,177 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# run-instance.sh — runs Claude Code + claude-mem against a single SWE-bench
+# instance using the two-turn protocol (ingest, then fix), and appends a
+# prediction JSONL row to OUT_PREDICTIONS_PATH.
+#
+# Usage:
+#   run-instance.sh INSTANCE_ID REPO_SLUG BASE_COMMIT PROBLEM_STATEMENT_FILE OUT_PREDICTIONS_PATH
+#
+# Required env:
+#   ANTHROPIC_API_KEY
+
+if [[ $# -ne 5 ]]; then
+  echo "Usage: $0 INSTANCE_ID REPO_SLUG BASE_COMMIT PROBLEM_STATEMENT_FILE OUT_PREDICTIONS_PATH" >&2
+  exit 2
+fi
+
+INSTANCE_ID="$1"
+REPO_SLUG="$2"
+BASE_COMMIT="$3"
+PROBLEM_STATEMENT_FILE="$4"
+OUT_PREDICTIONS_PATH="$5"
+
+# Auth: either ANTHROPIC_API_KEY (pay-per-call) OR a pre-extracted OAuth
+# credentials file from a Claude Max/Pro subscription (flat-fee, but subject
+# to Anthropic's usage limits — batch-scale runs may exhaust the 5h window).
+# run-batch.py extracts OAuth creds from host Keychain/file and mounts them
+# at CLAUDE_MEM_CREDENTIALS_FILE; standalone smoke-test can do the same, or
+# set ANTHROPIC_API_KEY directly.
+if [[ -z "${ANTHROPIC_API_KEY:-}" && -z "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
+  echo "ERROR: one of ANTHROPIC_API_KEY or CLAUDE_MEM_CREDENTIALS_FILE is required" >&2
+  exit 1
+fi
+
+if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" && ! -f "$CLAUDE_MEM_CREDENTIALS_FILE" ]]; then
+  echo "ERROR: CLAUDE_MEM_CREDENTIALS_FILE set but file missing: $CLAUDE_MEM_CREDENTIALS_FILE" >&2
+  exit 1
+fi
+
+if [[ ! -f "$PROBLEM_STATEMENT_FILE" ]]; then
+  echo "ERROR: PROBLEM_STATEMENT_FILE not found: $PROBLEM_STATEMENT_FILE" >&2
+  exit 1
+fi
+
+MODEL_NAME="claude-opus-4-7+claude-mem"
+
+# Per-instance ephemeral scratch dir — isolates ~/.claude/ and ~/.claude-mem/.
+SCRATCH=$(mktemp -d)
+REPO_DIR="$SCRATCH/repo"
+MEM_DIR="$SCRATCH/.claude-mem"
+CLAUDE_DIR="$SCRATCH/.claude"
+mkdir -p "$MEM_DIR" "$CLAUDE_DIR"
+
+# If using OAuth, seed the isolated CLAUDE_DIR with the mounted credentials
+# file so Claude Code finds them at HOME=$SCRATCH → ~/.claude/.credentials.json.
+# chmod 600 to match what `claude login` writes (it checks permissions).
+if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
+  cp "$CLAUDE_MEM_CREDENTIALS_FILE" "$CLAUDE_DIR/.credentials.json"
+  chmod 600 "$CLAUDE_DIR/.credentials.json"
+fi
+
+# Directory where artifacts the batch orchestrator reads (model_patch.diff,
+# ingest.jsonl, fix.jsonl) are written. When run via `docker run -v
+# <host-scratch>:/scratch` from run-batch.py, the orchestrator sets
+# CLAUDE_MEM_OUTPUT_DIR=/scratch so these files are visible on the host. In
+# standalone/smoke-test mode the default keeps artifacts in the ephemeral
+# scratch dir alongside the repo.
+OUTPUT_DIR="${CLAUDE_MEM_OUTPUT_DIR:-$SCRATCH}"
+mkdir -p "$OUTPUT_DIR"
+
+# Always write a prediction row (even on failure) so batch mode stays aligned.
+# The trap emits an empty-patch row if we exit before the success path sets
+# PREDICTION_EMITTED=1, then cleans up SCRATCH.
+DIFF_OUT="$OUTPUT_DIR/model_patch.diff"
+INGEST_LOG="$OUTPUT_DIR/ingest.jsonl"
+FIX_LOG="$OUTPUT_DIR/fix.jsonl"
+
+PREDICTION_EMITTED=0
+cleanup() {
+  local exit_code=$?
+  if [[ "$PREDICTION_EMITTED" -ne 1 ]]; then
+    # Ensure the orchestrator sees an (empty) diff file even on early exit.
+    : > "$DIFF_OUT" 2>/dev/null || true
+    jq -nc \
+      --arg id "$INSTANCE_ID" \
+      --arg patch "" \
+      --arg model "$MODEL_NAME" \
+      '{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
+      >> "$OUT_PREDICTIONS_PATH" || true
+  fi
+  rm -rf "$SCRATCH"
+  exit "$exit_code"
+}
+trap cleanup EXIT
+
+# Shallow clone + fetch the exact commit. Saves minutes on large repos
+# (sympy/django/scikit-learn) vs. a full-history clone. Fallback to a full
+# clone if the server rejects the by-commit fetch (GitHub supports
+# uploadpack.allowReachableSHA1InWant by default on public repos, but mirrors
+# may not).
+if ! { git clone --depth 1 --no-single-branch "https://github.com/${REPO_SLUG}.git" "$REPO_DIR" \
+    && git -C "$REPO_DIR" fetch --depth 1 origin "$BASE_COMMIT"; }; then
+  echo "WARN: shallow fetch failed; falling back to full clone" >&2
+  rm -rf "$REPO_DIR"
+  git clone "https://github.com/${REPO_SLUG}.git" "$REPO_DIR"
+fi
+git -C "$REPO_DIR" reset --hard "$BASE_COMMIT"
+
+# ---------- Turn 1: Ingest (populate memory via PostToolUse hook) ----------
+INGEST_PROMPT="Please learn about the codebase by systematically and thoroughly reading EVERY SOURCE FILE IN FULL, no matter how many there are. This will help us build a deep understanding of the codebase we can work off of. Don't worry about cost. This is critical and non-negotiable."
+
+SESSION_ID=$(uuidgen | tr '[:upper:]' '[:lower:]')
+
+set +e
+(
+  cd "$REPO_DIR" && HOME="$SCRATCH" claude \
+    --print \
+    --session-id "$SESSION_ID" \
+    --plugin-dir /opt/claude-mem \
+    --permission-mode bypassPermissions \
+    --allowedTools "Read,Glob,Grep,Bash(ls *),Bash(wc *)" \
+    --max-budget-usd 5.00 \
+    --output-format json \
+    "$INGEST_PROMPT"
+) > "$INGEST_LOG" 2>&1
+INGEST_EXIT=$?
+set -e
+
+if [[ "$INGEST_EXIT" -ne 0 ]]; then
+  echo "WARN: ingest turn exited with $INGEST_EXIT; continuing to fix turn" >&2
+fi
+
+# ---------- Turn 2: Fix (consume memory via mem-search slash command) ----------
+PROBLEM=$(cat "$PROBLEM_STATEMENT_FILE")
+QUERY=$(printf '%s' "$PROBLEM" | tr -s '[:space:]' ' ' | cut -c1-200)
+
+FIX_PROMPT="/claude-mem:mem-search ${QUERY}
+
+Problem statement:
+${PROBLEM}
+
+Using what you've learned from the codebase (see memory above), produce a minimal unified diff that fixes this bug. Edit files in place. Do NOT commit."
+
+set +e
+(
+  cd "$REPO_DIR" && HOME="$SCRATCH" claude \
+    --print \
+    --resume "$SESSION_ID" \
+    --plugin-dir /opt/claude-mem \
+    --permission-mode bypassPermissions \
+    --allowedTools "Read,Glob,Grep,Edit,Write,Bash(git *),Bash(ls *)" \
+    --max-budget-usd 5.00 \
+    --output-format json \
+    "$FIX_PROMPT"
+) > "$FIX_LOG" 2>&1
+FIX_EXIT=$?
+set -e
+
+if [[ "$FIX_EXIT" -ne 0 ]]; then
+  echo "WARN: fix turn exited with $FIX_EXIT; will still emit prediction row" >&2
+fi
+
+# ---------- Capture diff and emit prediction row ----------
+# Write the diff to DIFF_OUT first (authoritative for the batch orchestrator),
+# then read it back for the JSONL row (kept for standalone/smoke-test use).
+git -C "$REPO_DIR" diff > "$DIFF_OUT" || : > "$DIFF_OUT"
+DIFF=$(cat "$DIFF_OUT")
+
+jq -nc \
+  --arg id "$INSTANCE_ID" \
+  --arg patch "$DIFF" \
+  --arg model "$MODEL_NAME" \
+  '{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
+  >> "$OUT_PREDICTIONS_PATH"
+
+PREDICTION_EMITTED=1
@@ -0,0 +1,152 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# smoke-test.sh — runs ONE SWE-bench instance end-to-end against the agent
+# container using OAuth credentials extracted from the host. Use this to
+# verify the two-turn protocol + /claude-mem:mem-search slash resolution
+# before kicking off a batch run.
+#
+# Usage:
+#   evals/swebench/smoke-test.sh [INSTANCE_ID]
+#
+# Defaults to sympy__sympy-24152 (an easy Verified instance) if no arg given.
+#
+# Outputs:
+#   evals/swebench/runs/smoke/<INSTANCE_ID>/{ingest.jsonl,fix.jsonl,model_patch.diff}
+#   evals/swebench/runs/smoke/predictions.jsonl
+
+INSTANCE_ID="${1:-sympy__sympy-24152}"
+DATASET="${DATASET:-princeton-nlp/SWE-bench_Lite}"
+IMAGE="${IMAGE:-claude-mem/swebench-agent:latest}"
+TIMEOUT="${TIMEOUT:-1800}"
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+RUN_DIR="$REPO_ROOT/evals/swebench/runs/smoke/$INSTANCE_ID"
+PREDICTIONS="$REPO_ROOT/evals/swebench/runs/smoke/predictions.jsonl"
+mkdir -p "$RUN_DIR" "$(dirname "$PREDICTIONS")"
+
+# --- Extract OAuth credentials ---
+CREDS_FILE="$(mktemp -t claude-mem-creds.XXXXXX.json)"
+trap 'rm -f "$CREDS_FILE"' EXIT
+
+# Try macOS Keychain first (primary on Darwin), then fall through to the
+# on-disk credentials file — matches docker/claude-mem/run.sh behavior.
+creds_obtained=0
+if [[ "$(uname)" == "Darwin" ]]; then
+  if security find-generic-password -s 'Claude Code-credentials' -w > "$CREDS_FILE" 2>/dev/null \
+     && [[ -s "$CREDS_FILE" ]]; then
+    creds_obtained=1
+  fi
+fi
+if [[ "$creds_obtained" -eq 0 && -f "$HOME/.claude/.credentials.json" ]]; then
+  cp "$HOME/.claude/.credentials.json" "$CREDS_FILE"
+  creds_obtained=1
+fi
+if [[ "$creds_obtained" -eq 0 ]]; then
+  echo "ERROR: no Claude OAuth creds found (macOS Keychain or ~/.claude/.credentials.json)" >&2
+  exit 1
+fi
+chmod 600 "$CREDS_FILE"
+
+# --- Fetch instance data from HuggingFace via a small Python helper ---
+INSTANCE_JSON="$(mktemp)"
+trap 'rm -f "$CREDS_FILE" "$INSTANCE_JSON"' EXIT
+python3 - "$INSTANCE_ID" "$DATASET" > "$INSTANCE_JSON" <<'PY'
+import json, sys
+from datasets import load_dataset
+target = sys.argv[1]
+dataset = sys.argv[2]
+ds = load_dataset(dataset, split="test")
+for row in ds:
+    if row["instance_id"] == target:
+        print(json.dumps({
+            "instance_id": row["instance_id"],
+            "repo": row["repo"],
+            "base_commit": row["base_commit"],
+            "problem_statement": row["problem_statement"],
+        }))
+        break
+else:
+    print(f"ERROR: instance {target} not found", file=sys.stderr)
+    sys.exit(1)
+PY
+
+SCRATCH="$(mktemp -d -t claude-mem-smoke.XXXXXX)"
+trap 'rm -f "$CREDS_FILE" "$INSTANCE_JSON"; rm -rf "$SCRATCH"' EXIT
+
+# Parse the instance JSON once: print repo + base_commit to stdout, write the
+# problem statement directly to $SCRATCH/problem.txt. INSTANCE_JSON is passed
+# as argv so stdin is free for the `python3 -` heredoc script body (previously
+# both were competing for stdin, which made json.load see the heredoc's EOF).
+read -r REPO BASE_COMMIT < <(
+  python3 - "$SCRATCH" "$INSTANCE_JSON" <<'PY'
+import json, os, sys
+scratch, instance_json = sys.argv[1], sys.argv[2]
+with open(instance_json) as f:
+    d = json.load(f)
+open(os.path.join(scratch, "problem.txt"), "w").write(d["problem_statement"])
+print(d["repo"], d["base_commit"])
+PY
+)
+
+echo "=== Running $INSTANCE_ID ($REPO @ $BASE_COMMIT) ===" >&2
+echo "Scratch: $SCRATCH" >&2
+echo "Logs will land in: $RUN_DIR" >&2
+
+# Pick a wall-clock timeout binary. Linux ships `timeout`; macOS needs
+# `gtimeout` from coreutils (brew install coreutils). If neither is available,
+# warn and run without a cap — the smoke test is manual anyway.
+TIMEOUT_CMD=()
+if command -v timeout >/dev/null 2>&1; then
+  TIMEOUT_CMD=(timeout "$TIMEOUT")
+elif command -v gtimeout >/dev/null 2>&1; then
+  TIMEOUT_CMD=(gtimeout "$TIMEOUT")
+else
+  echo "WARN: no \`timeout\`/\`gtimeout\` on PATH; container runs uncapped" >&2
+fi
+
+# Name the container so we can force-remove it if the wall-clock timeout
+# fires (SIGTERM from timeout leaves the container state open briefly).
+CONTAINER_NAME="claude-mem-smoke-$INSTANCE_ID-$$"
+
+set +e
+"${TIMEOUT_CMD[@]}" docker run --rm \
+  --name "$CONTAINER_NAME" \
+  -e CLAUDE_MEM_OUTPUT_DIR=/scratch \
+  -e CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json \
+  -v "$SCRATCH:/scratch" \
+  -v "$CREDS_FILE:/auth/.credentials.json:ro" \
+  "$IMAGE" \
+  "$INSTANCE_ID" "$REPO" "$BASE_COMMIT" /scratch/problem.txt /scratch/ignored-predictions.jsonl
+DOCKER_EXIT=$?
+set -e
+
+if [[ "$DOCKER_EXIT" -eq 124 ]]; then
+  # `timeout` signals TERM and returns 124 on timeout. Force-remove the
+  # container in case docker hasn't reaped it yet.
+  echo "ERROR: docker run exceeded ${TIMEOUT}s wall-clock; removing container" >&2
+  docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true
+fi
+
+# Copy artifacts from scratch → RUN_DIR
+for f in ingest.jsonl fix.jsonl model_patch.diff; do
+  [[ -f "$SCRATCH/$f" ]] && cp "$SCRATCH/$f" "$RUN_DIR/$f"
+done
+
+# Emit authoritative prediction row
+DIFF_FILE="$SCRATCH/model_patch.diff"
+DIFF=""
+[[ -f "$DIFF_FILE" ]] && DIFF="$(cat "$DIFF_FILE")"
+jq -nc \
+  --arg id "$INSTANCE_ID" \
+  --arg patch "$DIFF" \
+  --arg model "claude-opus-4-7+claude-mem" \
+  '{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
+  >> "$PREDICTIONS"
+
+echo "=== Done ===" >&2
+echo "Diff size: $(wc -c < "$DIFF_FILE" 2>/dev/null || echo 0) bytes" >&2
+echo "Predictions: $PREDICTIONS" >&2
+echo "Verify mem-search invocation:" >&2
+echo "  grep -o '\"name\":\"[^\"]*mem-search[^\"]*\"' $RUN_DIR/fix.jsonl || echo 'NOT INVOKED'" >&2
@@ -0,0 +1,308 @@
+#!/usr/bin/env python3
+"""Summarize SWE-bench evaluation run results.
+
+Walks the SWE-bench harness output directory, tallies resolved/unresolved/error
+counts, and emits a markdown summary. Optionally diffs against another run.
+"""
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+
+def load_expected_instance_ids(predictions_path: Path) -> list[str]:
+    """Read instance_ids from a predictions.jsonl file (one JSON object per line)."""
+    instance_ids: list[str] = []
+    if not predictions_path.exists():
+        print(
+            f"warning: predictions file not found: {predictions_path}",
+            file=sys.stderr,
+        )
+        return instance_ids
+    with predictions_path.open("r", encoding="utf-8") as handle:
+        for line_number, raw_line in enumerate(handle, start=1):
+            stripped = raw_line.strip()
+            if not stripped:
+                continue
+            try:
+                record = json.loads(stripped)
+            except json.JSONDecodeError as exc:
+                print(
+                    f"warning: could not parse predictions line {line_number}: {exc}",
+                    file=sys.stderr,
+                )
+                continue
+            instance_id = record.get("instance_id")
+            if instance_id:
+                instance_ids.append(instance_id)
+    return instance_ids
+
+
+def load_run_results(
+    run_id: str,
+    model_name: str,
+    expected_instance_ids: list[str],
+    repo_root: Path,
+) -> dict:
+    """Walk logs/run_evaluation/<run_id>/<model_name>/*/report.json and tally results.
+
+    Returns a dict:
+      {
+        "per_instance": {instance_id: {"resolved": bool|None, "notes": str}},
+        "resolved_count": int,
+        "unresolved_count": int,
+        "error_count": int,
+      }
+    """
+    run_logs_root = repo_root / "logs" / "run_evaluation" / run_id / model_name
+    per_instance: dict[str, dict] = {}
+    resolved_count = 0
+    unresolved_count = 0
+    error_count = 0
+
+    for instance_id in expected_instance_ids:
+        report_path = run_logs_root / instance_id / "report.json"
+        if not report_path.exists():
+            per_instance[instance_id] = {
+                "resolved": None,
+                "notes": "missing report.json",
+            }
+            error_count += 1
+            continue
+        try:
+            with report_path.open("r", encoding="utf-8") as handle:
+                report_data = json.load(handle)
+        except (json.JSONDecodeError, OSError) as exc:
+            per_instance[instance_id] = {
+                "resolved": None,
+                "notes": f"failed to parse report.json: {exc}",
+            }
+            error_count += 1
+            continue
+
+        # SWE-bench harness typically nests per-instance data under the
+        # instance_id key; fall back to the top-level dict for flexibility.
+        inner = report_data.get(instance_id, report_data)
+        resolved_value = inner.get("resolved")
+        if resolved_value is True:
+            per_instance[instance_id] = {"resolved": True, "notes": ""}
+            resolved_count += 1
+        elif resolved_value is False:
+            notes_parts: list[str] = []
+            tests_status = inner.get("tests_status")
+            if isinstance(tests_status, dict):
+                fail_to_pass = tests_status.get("FAIL_TO_PASS", {})
+                if isinstance(fail_to_pass, dict):
+                    failed = fail_to_pass.get("failure", []) or []
+                    if failed:
+                        notes_parts.append(f"FAIL_TO_PASS failures: {len(failed)}")
+            per_instance[instance_id] = {
+                "resolved": False,
+                "notes": "; ".join(notes_parts),
+            }
+            unresolved_count += 1
+        else:
+            per_instance[instance_id] = {
+                "resolved": None,
+                "notes": "report.json missing 'resolved' field",
+            }
+            error_count += 1
+
+    return {
+        "per_instance": per_instance,
+        "resolved_count": resolved_count,
+        "unresolved_count": unresolved_count,
+        "error_count": error_count,
+    }
+
+
+def format_resolved_cell(resolved: bool | None) -> str:
+    if resolved is True:
+        return "yes"
+    if resolved is False:
+        return "no"
+    return "error"
+
+
+def render_summary_markdown(run_id: str, results: dict) -> str:
+    total = (
+        results["resolved_count"]
+        + results["unresolved_count"]
+        + results["error_count"]
+    )
+    resolved = results["resolved_count"]
+    resolve_rate = (resolved / total * 100.0) if total > 0 else 0.0
+
+    lines: list[str] = []
+    lines.append(f"# Run {run_id}")
+    lines.append(f"- Total: {total}")
+    lines.append(f"- Resolved: {resolved} ({resolve_rate:.2f}%)")
+    lines.append(f"- Unresolved: {results['unresolved_count']}")
+    lines.append(f"- Errors: {results['error_count']}")
+    lines.append("")
+    lines.append("## Per-instance")
+    lines.append("| instance_id | resolved | notes |")
+    lines.append("|---|---|---|")
+    for instance_id, record in results["per_instance"].items():
+        resolved_cell = format_resolved_cell(record["resolved"])
+        notes_cell = record.get("notes", "") or ""
+        # Escape pipe chars in notes to avoid breaking markdown tables.
+        notes_cell = notes_cell.replace("|", "\\|")
+        lines.append(f"| {instance_id} | {resolved_cell} | {notes_cell} |")
+    lines.append("")
+    return "\n".join(lines)
+
+
+def render_diff_markdown(
+    current_run_id: str,
+    other_run_id: str,
+    current_results: dict,
+    other_results: dict,
+) -> str:
+    def resolve_rate(results: dict) -> tuple[int, float]:
+        total = (
+            results["resolved_count"]
+            + results["unresolved_count"]
+            + results["error_count"]
+        )
+        rate = (results["resolved_count"] / total * 100.0) if total > 0 else 0.0
+        return total, rate
+
+    current_total, current_rate = resolve_rate(current_results)
+    other_total, other_rate = resolve_rate(other_results)
+    rate_delta = current_rate - other_rate
+
+    lines: list[str] = []
+    lines.append(f"# Diff vs {other_run_id}")
+    lines.append(
+        f"- {current_run_id}: {current_results['resolved_count']}/{current_total} "
+        f"({current_rate:.2f}%)"
+    )
+    lines.append(
+        f"- {other_run_id}: {other_results['resolved_count']}/{other_total} "
+        f"({other_rate:.2f}%)"
+    )
+    lines.append(f"- Delta: {rate_delta:+.2f} percentage points")
+    lines.append("")
+    lines.append("## Per-instance status changes")
+    lines.append(f"| instance_id | {other_run_id} | {current_run_id} |")
+    lines.append("|---|---|---|")
+
+    all_instance_ids = set(current_results["per_instance"].keys()) | set(
+        other_results["per_instance"].keys()
+    )
+    changes_found = False
+    for instance_id in sorted(all_instance_ids):
+        current_record = current_results["per_instance"].get(instance_id)
+        other_record = other_results["per_instance"].get(instance_id)
+        current_status = (
+            format_resolved_cell(current_record["resolved"])
+            if current_record
+            else "absent"
+        )
+        other_status = (
+            format_resolved_cell(other_record["resolved"])
+            if other_record
+            else "absent"
+        )
+        if current_status != other_status:
+            lines.append(f"| {instance_id} | {other_status} | {current_status} |")
+            changes_found = True
+    if not changes_found:
+        lines.append("| (no status changes) | | |")
+    lines.append("")
+    return "\n".join(lines)
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(
+        description="Summarize SWE-bench evaluation run results."
+    )
+    parser.add_argument(
+        "--run-id",
+        required=True,
+        help="Run identifier used in logs/run_evaluation/<run_id>/ and evals/swebench/runs/<run_id>/.",
+    )
+    parser.add_argument(
+        "--compare",
+        metavar="OTHER_RUN_ID",
+        default=None,
+        help="Optional other run_id to diff resolve rates and per-instance status changes against.",
+    )
+    parser.add_argument(
+        "--model-name",
+        default="claude-opus-4-7+claude-mem",
+        help="Model name directory inside logs/run_evaluation/<run_id>/.",
+    )
+    parser.add_argument(
+        "--out",
+        default=None,
+        help="Output path for the markdown summary (default: evals/swebench/runs/<run_id>/summary.md).",
+    )
+    args = parser.parse_args()
+
+    # Resolve repo root from this script's location: evals/swebench/summarize.py
+    script_path = Path(__file__).resolve()
+    repo_root = script_path.parent.parent.parent
+
+    current_predictions_path = (
+        repo_root / "evals" / "swebench" / "runs" / args.run_id / "predictions.jsonl"
+    )
+    current_instance_ids = load_expected_instance_ids(current_predictions_path)
+    current_results = load_run_results(
+        run_id=args.run_id,
+        model_name=args.model_name,
+        expected_instance_ids=current_instance_ids,
+        repo_root=repo_root,
+    )
+
+    summary_markdown = render_summary_markdown(args.run_id, current_results)
+
+    if args.compare:
+        other_predictions_path = (
+            repo_root
+            / "evals"
+            / "swebench"
+            / "runs"
+            / args.compare
+            / "predictions.jsonl"
+        )
+        other_instance_ids = load_expected_instance_ids(other_predictions_path)
+        other_results = load_run_results(
+            run_id=args.compare,
+            model_name=args.model_name,
+            expected_instance_ids=other_instance_ids,
+            repo_root=repo_root,
+        )
+        diff_markdown = render_diff_markdown(
+            current_run_id=args.run_id,
+            other_run_id=args.compare,
+            current_results=current_results,
+            other_results=other_results,
+        )
+        summary_markdown = summary_markdown + "\n" + diff_markdown
+
+    if args.out:
+        output_path = Path(args.out)
+        if not output_path.is_absolute():
+            output_path = (Path.cwd() / output_path).resolve()
+    else:
+        output_path = (
+            repo_root
+            / "evals"
+            / "swebench"
+            / "runs"
+            / args.run_id
+            / "summary.md"
+        )
+
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    output_path.write_text(summary_markdown, encoding="utf-8")
+
+    print(str(output_path))
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())