feat: basic claude-mem Docker container for easy spin-up (#2076)
* feat(evals): SWE-bench Docker scaffolding for claude-mem resolve-rate measurement Adds evals/swebench/ scaffolding per .claude/plans/swebench-claude-mem-docker.md. Agent image builds Claude Code 2.1.114 + locally-built claude-mem plugin; run-instance.sh executes the two-turn ingest/fix protocol per instance; run-batch.py orchestrates parallel Docker runs with per-instance isolation; eval.sh wraps the upstream SWE-bench harness; summarize.py aggregates reports. Orchestrator owns JSONL writes under a lock to avoid racy concurrent appends; agent writes its authoritative diff to CLAUDE_MEM_OUTPUT_DIR (/scratch in container mode) and the orchestrator reads it back. Scaffolding only — no Docker build or smoke test run yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(evals): OAuth credential mounting for Claude Max/Pro subscriptions Skips per-call API billing by extracting OAuth creds from host Keychain (macOS) or ~/.claude/.credentials.json (Linux) and bind-mounting them read-only into each agent container. Creds are copied into HOME=$SCRATCH/.claude at container start so the per-instance isolation model still holds. Adds run-batch.py --auth {oauth,api-key,auto} (auto prefers OAuth, falls back to API key). run-instance.sh accepts either ANTHROPIC_API_KEY or CLAUDE_MEM_CREDENTIALS_FILE. smoke-test.sh runs one instance end-to-end using OAuth for quick verification before batch runs. Caveat surfaced in docstrings: Max/Pro has per-window usage limits and is framed for individual developer use — batch evaluation may exhaust the quota or raise compliance questions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(docker): basic claude-mem container for ad-hoc testing Adds docker/claude-mem/ with a fresh spin-up image: - Dockerfile: FROM node:20 (reproduces anthropics/claude-code .devcontainer pattern — Anthropic ships the Dockerfile, not a pullable image); layers Bun + uv + locally-built plugin/; runs as non-root node user - entrypoint.sh: seeds OAuth creds from CLAUDE_MEM_CREDENTIALS_FILE into $HOME/.claude/.credentials.json, then exec's the command (default: bash) - build.sh: npm run build + docker build - run.sh: interactive launcher; auto-extracts OAuth from macOS Keychain (security find-generic-password) or ~/.claude/.credentials.json on Linux, mounts host .docker-claude-mem-data/ at /home/node/.claude-mem so the observations DB survives container exit Validated end-to-end: PostToolUse hook fires, queue enqueues, worker's SDK compression runs under subscription OAuth, observations row lands with populated facts/concepts/files_read, Chroma sync triggers. Also updates .gitignore/.dockerignore for the new runtime-output paths. Built plugin artifacts refreshed by the build step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evals/swebench): non-root user, OAuth mount, Lite dataset default - Dockerfile.agent: switch to non-root \`node\` user (uid 1000); Claude Code refuses --permission-mode bypassPermissions when euid==0, which made every agent run exit 1 before producing a diff. Also move Bun + uv installs to system paths so the non-root user can exec them. - run-batch.py: add extract_oauth_credentials() that pulls from macOS Keychain / Linux ~/.claude/.credentials.json into a temp file and bind- mounts it at /auth/.credentials.json:ro with CLAUDE_MEM_CREDENTIALS_FILE. New --auth {oauth,api-key,auto} flag. New --dataset flag so the batch can target SWE-bench_Lite without editing the script. - smoke-test.sh: default DATASET to princeton-nlp/SWE-bench_Lite (Lite contains sympy__sympy-24152, Verified does not); accept DATASET env override. Caveat surfaced during testing: Max/Pro subscriptions have per-window usage limits; running 5 instances in parallel with the "read every source file" ingest prompt exhausted the 5h window within ~25 minutes (3/5 hit HTTP 429). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address PR #2076 review comments - docker/claude-mem/run.sh: chmod 600 (not 644) on extracted OAuth creds to match what `claude login` writes; avoids exposing tokens to other host users. Verified readable inside the container under Docker Desktop's UID translation. - docker/claude-mem/Dockerfile: pin Bun + uv via --build-arg BUN_VERSION / UV_VERSION (defaults: 1.3.12, 0.11.7). Bun via `bash -s "bun-v<V>"`; uv via versioned installer URL `https://astral.sh/uv/<V>/install.sh`. - evals/swebench/smoke-test.sh: pipe JSON through stdin to `python3 -c` so paths with spaces/special chars can't break shell interpolation. - evals/swebench/run-batch.py: add --overwrite flag; abort by default when predictions.jsonl for the run-id already exists, preventing accidental silent discard of partial results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address coderabbit review on PR #2076 Actionable (4): - Dockerfile uv install: wrap `chmod ... || true` in braces so the trailing `|| true` no longer masks failures from `curl|sh` via bash operator precedence (&& binds tighter than ||). Applied to both docker/claude-mem/ and evals/swebench/Dockerfile.agent. Added `set -eux` to the RUN lines. - docker/claude-mem/Dockerfile: drop unused `sudo` apt package (~2 MB). - run-batch.py: name each agent container (`swebench-agent-<id>-<pid>-<tid>`) and force-remove via `docker rm -f <name>` in the TimeoutExpired handler so timed-out runs don't leave orphan containers. Nitpicks (2): - smoke-test.sh: collapse 3 python3 invocations into 1 — parse the instance JSON once, print `repo base_commit`, and write problem.txt in the same call. - run-instance.sh: shallow clone via `--depth 1 --no-single-branch` + `fetch --depth 1 origin $BASE_COMMIT`. Falls back to a full clone if the server rejects the by-commit fetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address second coderabbit review on PR #2076 Actionable (3): - docker/claude-mem/run.sh: on macOS, fall back to ~/.claude/.credentials.json when the Keychain lookup misses (some setups still have file-only creds). Unified into a single creds_obtained gate so the error surface lists both sources tried. - docker/claude-mem/run.sh: drop `exec docker run` — `exec` replaces the shell so the EXIT trap (`rm -f "$CREDS_FILE"`) never fires and the extracted OAuth JSON leaks to disk until tmpfs cleanup. Run as a child instead so the trap runs on exit. - evals/swebench/smoke-test.sh: actually enforce the TIMEOUT env var. Pick `timeout` or `gtimeout` (coreutils on macOS), fall back to uncapped with a warning. Name the container so exit-124 from timeout can `docker rm -f` it deterministically. Nitpick from the same review (consolidated python3 calls in smoke-test.sh) was already addressed in the prior commit ef621e00. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address third coderabbit review on PR #2076 Actionable (1): - evals/swebench/smoke-test.sh: the consolidated python heredoc had competing stdin redirections — `<<'PY'` (script body) AND `< "$INSTANCE_JSON"` (data). The heredoc won, so `json.load(sys.stdin)` saw an empty stream and the parse would have failed at runtime. Pass INSTANCE_JSON as argv[2] and `open()` it inside the script instead; the heredoc is now only the script body, which is what `python3 -` needs. Nitpicks (2): - evals/swebench/smoke-test.sh: macOS Keychain lookup now falls through to ~/.claude/.credentials.json on miss (matches docker/claude-mem/run.sh). - evals/swebench/run-batch.py: extract_oauth_credentials() no longer early-returns on Darwin keychain miss; falls through to the on-disk creds file so macOS setups with file-only credentials work in batch mode too. Functional spot-check of the parse fix confirmed: REPO/BASE_COMMIT populated and problem.txt written from a synthetic INSTANCE_JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,9 @@
|
||||
# Keep the build context small for evals/swebench/Dockerfile.agent.
|
||||
# The Dockerfile needs `plugin/` and `evals/swebench/` — do NOT exclude them.
|
||||
node_modules/
|
||||
.git/
|
||||
logs/
|
||||
evals/swebench/runs/
|
||||
.docker-claude-mem-data/
|
||||
.venv
|
||||
.venv-*
|
||||
@@ -42,3 +42,12 @@ plugin/.cli-installed
|
||||
|
||||
# Local contribution analysis (not part of upstream)
|
||||
CONTRIB_NOTES.md
|
||||
|
||||
# Docker container runtime data (basic claude-mem container)
|
||||
.docker-claude-mem-data/
|
||||
|
||||
# SWE-bench eval outputs
|
||||
evals/swebench/runs/
|
||||
claude-opus-4-7+claude-mem.*.json
|
||||
logs/run_evaluation/
|
||||
.venv-swebench/
|
||||
|
||||
@@ -0,0 +1,93 @@
|
||||
# Basic claude-mem container for ad-hoc testing.
|
||||
#
|
||||
# Base layout mirrors anthropics/claude-code .devcontainer
|
||||
# (https://github.com/anthropics/claude-code/blob/main/.devcontainer/Dockerfile):
|
||||
# FROM node:20, non-root `node` user, global npm install of @anthropic-ai/claude-code.
|
||||
# We skip the firewall/zsh/fzf/delta/git-hist noise since this image is for
|
||||
# exercising claude-mem, not as a full dev environment.
|
||||
#
|
||||
# On top of that base we install:
|
||||
# - Bun (claude-mem worker service runtime)
|
||||
# - uv (provides Python for Chroma per CLAUDE.md)
|
||||
# - The locally-built plugin/ tree at /opt/claude-mem
|
||||
#
|
||||
# Usage:
|
||||
# docker build -f docker/claude-mem/Dockerfile -t claude-mem:basic .
|
||||
# docker run --rm -it \
|
||||
# -v $(mktemp -d):/home/node/.claude-mem \
|
||||
# -e CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json \
|
||||
# -v /path/to/extracted/creds.json:/auth/.credentials.json:ro \
|
||||
# claude-mem:basic
|
||||
|
||||
FROM node:20
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y --no-install-recommends \
|
||||
git \
|
||||
curl \
|
||||
ca-certificates \
|
||||
unzip \
|
||||
jq \
|
||||
less \
|
||||
procps \
|
||||
uuid-runtime \
|
||||
sqlite3 \
|
||||
&& apt-get clean && rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Bun — system-wide so the unprivileged `node` user can execute it.
|
||||
# Pin via --build-arg BUN_VERSION=X.Y.Z; default is the version verified at PR time.
|
||||
ARG BUN_VERSION=1.3.12
|
||||
ENV BUN_INSTALL="/usr/local/bun"
|
||||
RUN curl -fsSL https://bun.sh/install | bash -s "bun-v${BUN_VERSION}" \
|
||||
&& chmod -R a+rX /usr/local/bun
|
||||
ENV PATH="/usr/local/bun/bin:${PATH}"
|
||||
|
||||
# uv — system-wide, for Chroma's Python runtime. Pin via --build-arg UV_VERSION=X.Y.Z.
|
||||
# Versioned installer URL per https://docs.astral.sh/uv/getting-started/installation/.
|
||||
ARG UV_VERSION=0.11.7
|
||||
ENV UV_INSTALL_DIR="/usr/local/bin"
|
||||
# `&&` binds tighter than `||` in bash, so the previous form let `curl|sh` fail
|
||||
# silently via the trailing `|| true`. Group the chmod so tolerated failure is
|
||||
# scoped to perms-fixing only.
|
||||
RUN set -eux \
|
||||
&& curl -LsSf "https://astral.sh/uv/${UV_VERSION}/install.sh" | sh \
|
||||
&& { chmod a+rX /usr/local/bin/uv /usr/local/bin/uvx 2>/dev/null || true; }
|
||||
|
||||
# Match the upstream devcontainer's npm-global prefix so `npm install -g`
|
||||
# targets a dir the `node` user owns.
|
||||
RUN mkdir -p /usr/local/share/npm-global \
|
||||
&& chown -R node:node /usr/local/share/npm-global
|
||||
ENV NPM_CONFIG_PREFIX=/usr/local/share/npm-global
|
||||
ENV PATH="/usr/local/share/npm-global/bin:${PATH}"
|
||||
|
||||
# Claude Code CLI. Override at build-time with --build-arg CLAUDE_CODE_VERSION=X.Y.Z
|
||||
# to pin; default tracks latest.
|
||||
ARG CLAUDE_CODE_VERSION=latest
|
||||
USER node
|
||||
RUN npm install -g @anthropic-ai/claude-code@${CLAUDE_CODE_VERSION}
|
||||
|
||||
# Locally-built claude-mem plugin. COPY runs as root by default and layers are
|
||||
# cached, so put this after the npm install so iterating on the plugin doesn't
|
||||
# invalidate the CLI install layer.
|
||||
USER root
|
||||
COPY plugin/ /opt/claude-mem/
|
||||
RUN chown -R node:node /opt/claude-mem
|
||||
|
||||
# Persistent mount points for ad-hoc testing — mount a host dir at either of
|
||||
# these to inspect the claude-mem DB after a session.
|
||||
RUN mkdir -p /home/node/.claude /home/node/.claude-mem \
|
||||
&& chown -R node:node /home/node/.claude /home/node/.claude-mem
|
||||
|
||||
USER node
|
||||
WORKDIR /home/node
|
||||
|
||||
# Helper: copies OAuth creds out of the read-only mount into $HOME/.claude/
|
||||
# before exec'ing whatever you asked for. Saves the "cp + chmod" dance every
|
||||
# time you drop in.
|
||||
COPY --chown=node:node docker/claude-mem/entrypoint.sh /usr/local/bin/claude-mem-entrypoint
|
||||
RUN chmod +x /usr/local/bin/claude-mem-entrypoint
|
||||
|
||||
ENTRYPOINT ["/usr/local/bin/claude-mem-entrypoint"]
|
||||
CMD ["bash"]
|
||||
Executable
+24
@@ -0,0 +1,24 @@
|
||||
#!/usr/bin/env bash
|
||||
# Build the basic claude-mem Docker image from the current worktree.
|
||||
#
|
||||
# Usage:
|
||||
# docker/claude-mem/build.sh # builds claude-mem:basic
|
||||
# TAG=my-tag docker/claude-mem/build.sh # override the tag
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
TAG="${TAG:-claude-mem:basic}"
|
||||
|
||||
cd "$REPO_ROOT"
|
||||
|
||||
echo "[build] npm run build"
|
||||
npm run build
|
||||
|
||||
echo "[build] docker build -t $TAG"
|
||||
docker build \
|
||||
-f docker/claude-mem/Dockerfile \
|
||||
-t "$TAG" \
|
||||
"$REPO_ROOT"
|
||||
|
||||
echo "[build] done: $TAG"
|
||||
Executable
+28
@@ -0,0 +1,28 @@
|
||||
#!/usr/bin/env bash
|
||||
# Entrypoint for the basic claude-mem container. Seeds OAuth creds if a
|
||||
# credentials file is mounted, then exec's whatever was passed (default: bash).
|
||||
#
|
||||
# Env vars:
|
||||
# CLAUDE_MEM_CREDENTIALS_FILE Path to a mounted OAuth credentials JSON file
|
||||
# (e.g. /auth/.credentials.json). Copied into
|
||||
# $HOME/.claude/.credentials.json at startup.
|
||||
# ANTHROPIC_API_KEY Standard API-key auth; set when OAuth isn't used.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
mkdir -p "$HOME/.claude" "$HOME/.claude-mem"
|
||||
|
||||
if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
|
||||
if [[ ! -f "$CLAUDE_MEM_CREDENTIALS_FILE" ]]; then
|
||||
echo "ERROR: CLAUDE_MEM_CREDENTIALS_FILE set but file missing: $CLAUDE_MEM_CREDENTIALS_FILE" >&2
|
||||
exit 1
|
||||
fi
|
||||
cp "$CLAUDE_MEM_CREDENTIALS_FILE" "$HOME/.claude/.credentials.json"
|
||||
chmod 600 "$HOME/.claude/.credentials.json"
|
||||
fi
|
||||
|
||||
# Helpful one-liner for interactive users: run `claude` with the plugin dir
|
||||
# preconfigured. Don't force it — `exec "$@"` lets you override freely.
|
||||
export PATH="/usr/local/bun/bin:/usr/local/share/npm-global/bin:$PATH"
|
||||
|
||||
exec "$@"
|
||||
Executable
+69
@@ -0,0 +1,69 @@
|
||||
#!/usr/bin/env bash
|
||||
# Drop into an interactive claude-mem container with OAuth creds + persistent
|
||||
# memory volume. For ad-hoc testing / poking around.
|
||||
#
|
||||
# Usage:
|
||||
# docker/claude-mem/run.sh
|
||||
# docker/claude-mem/run.sh claude --plugin-dir /opt/claude-mem --print "hi"
|
||||
#
|
||||
# On exit, the mounted .claude-mem/ dir on the host survives so you can inspect
|
||||
# the DB: `sqlite3 <HOST_MEM_DIR>/claude-mem.db 'select count(*) from observations'`.
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
TAG="${TAG:-claude-mem:basic}"
|
||||
|
||||
HOST_MEM_DIR="${HOST_MEM_DIR:-$REPO_ROOT/.docker-claude-mem-data}"
|
||||
mkdir -p "$HOST_MEM_DIR"
|
||||
echo "[run] host .claude-mem dir: $HOST_MEM_DIR" >&2
|
||||
|
||||
# Auth. Prefer OAuth (extracted from macOS Keychain / Linux creds file);
|
||||
# fall back to ANTHROPIC_API_KEY env.
|
||||
CREDS_FILE=""
|
||||
CREDS_MOUNT_ARGS=()
|
||||
if [[ -z "${ANTHROPIC_API_KEY:-}" ]]; then
|
||||
CREDS_FILE="$(mktemp -t claude-mem-creds.XXXXXX.json)"
|
||||
trap 'rm -f "$CREDS_FILE"' EXIT
|
||||
|
||||
# Try macOS Keychain first (primary storage on Darwin), then fall back to
|
||||
# the on-disk credentials file — some macOS setups (older CLI versions,
|
||||
# users who migrated machines) still have the file-only form.
|
||||
creds_obtained=0
|
||||
if [[ "$(uname)" == "Darwin" ]]; then
|
||||
if security find-generic-password -s 'Claude Code-credentials' -w > "$CREDS_FILE" 2>/dev/null \
|
||||
&& [[ -s "$CREDS_FILE" ]]; then
|
||||
creds_obtained=1
|
||||
fi
|
||||
fi
|
||||
if [[ "$creds_obtained" -eq 0 && -f "$HOME/.claude/.credentials.json" ]]; then
|
||||
cp "$HOME/.claude/.credentials.json" "$CREDS_FILE"
|
||||
creds_obtained=1
|
||||
fi
|
||||
if [[ "$creds_obtained" -eq 0 ]]; then
|
||||
echo "ERROR: no ANTHROPIC_API_KEY set and no Claude OAuth credentials found." >&2
|
||||
echo " Tried: macOS Keychain ('Claude Code-credentials') and ~/.claude/.credentials.json." >&2
|
||||
echo " Run \`claude login\` on the host first, or set ANTHROPIC_API_KEY." >&2
|
||||
exit 1
|
||||
fi
|
||||
chmod 600 "$CREDS_FILE"
|
||||
CREDS_MOUNT_ARGS=(
|
||||
-e CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json
|
||||
-v "$CREDS_FILE:/auth/.credentials.json:ro"
|
||||
)
|
||||
else
|
||||
CREDS_MOUNT_ARGS=(-e ANTHROPIC_API_KEY)
|
||||
fi
|
||||
|
||||
# Pick -it only when a TTY is attached (keeps non-interactive callers working).
|
||||
TTY_ARGS=()
|
||||
[[ -t 0 && -t 1 ]] && TTY_ARGS=(-it)
|
||||
|
||||
# NOT `exec` — we want the EXIT trap above to run and remove $CREDS_FILE
|
||||
# after the container exits. Running docker as a child keeps the shell
|
||||
# alive long enough for the trap to fire.
|
||||
docker run --rm "${TTY_ARGS[@]}" \
|
||||
"${CREDS_MOUNT_ARGS[@]}" \
|
||||
-v "$HOST_MEM_DIR:/home/node/.claude-mem" \
|
||||
"$TAG" \
|
||||
"$@"
|
||||
@@ -0,0 +1,74 @@
|
||||
# claude-mem SWE-bench agent image
|
||||
# Plan: .claude/plans/swebench-claude-mem-docker.md (Phase 1)
|
||||
#
|
||||
# Produces `claude-mem/swebench-agent:latest`: Claude Code CLI 2.1.114 +
|
||||
# locally-built claude-mem plugin, ready to run headlessly per SWE-bench
|
||||
# instance. Auth (ANTHROPIC_API_KEY) is passed at runtime, never baked in.
|
||||
|
||||
FROM node:20-bookworm-slim
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
# System dependencies:
|
||||
# git, curl, ca-certificates, unzip — base tooling (Bun installer needs unzip)
|
||||
# jq — JSONL assembly in run-instance.sh
|
||||
# uuid-runtime — uuidgen for per-instance session IDs (Phase 2)
|
||||
# sqlite3 — verifies the claude-mem observations DB
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y --no-install-recommends \
|
||||
git \
|
||||
curl \
|
||||
ca-certificates \
|
||||
unzip \
|
||||
jq \
|
||||
uuid-runtime \
|
||||
sqlite3 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Bun (claude-mem worker service runs under Bun). Installed to a system
|
||||
# location so the non-root runtime user can execute it.
|
||||
ENV BUN_INSTALL="/usr/local/bun"
|
||||
RUN curl -fsSL https://bun.sh/install | bash \
|
||||
&& chmod -R a+rX /usr/local/bun
|
||||
ENV PATH="/usr/local/bun/bin:${PATH}"
|
||||
|
||||
# uv (provides Python for Chroma per CLAUDE.md). Installed to a system
|
||||
# location, same reason.
|
||||
ENV UV_INSTALL_DIR="/usr/local/bin"
|
||||
# Group the chmod so the trailing `|| true` only absorbs chmod failures; without
|
||||
# this grouping, bash precedence (`&&` binds tighter than `||`) would silently
|
||||
# mask a failed `curl|sh` install step.
|
||||
RUN set -eux \
|
||||
&& curl -LsSf https://astral.sh/uv/install.sh | sh \
|
||||
&& { chmod a+rX /usr/local/bin/uv /usr/local/bin/uvx 2>/dev/null || true; }
|
||||
|
||||
# Claude Code CLI — PINNED to the version whose flag surface was verified in
|
||||
# the plan (Phase 0). Do NOT bump without re-verifying flags.
|
||||
RUN npm install -g @anthropic-ai/claude-code@2.1.114
|
||||
|
||||
# Locally-built claude-mem plugin. The build-agent-image.sh wrapper runs
|
||||
# `npm run build` before `docker build`, so plugin/ is populated in the build
|
||||
# context. We do NOT install claude-mem from npm — we want the current
|
||||
# worktree under test.
|
||||
COPY plugin/ /opt/claude-mem/
|
||||
|
||||
# Runner script — entrypoint for per-instance invocation (Phase 2 deliverable).
|
||||
COPY evals/swebench/run-instance.sh /evals/swebench/run-instance.sh
|
||||
RUN chmod +x /evals/swebench/run-instance.sh
|
||||
|
||||
# Pre-create per-instance config dirs. run-instance.sh overrides HOME to a
|
||||
# scratch dir for isolation, but having these present keeps tools from
|
||||
# bailing if they probe the default locations before HOME is set.
|
||||
RUN mkdir -p /root/.claude /root/.claude-mem
|
||||
|
||||
# Non-root user. Claude Code refuses `--dangerously-skip-permissions` /
|
||||
# `--permission-mode bypassPermissions` when euid==0 as a safety rail, so we
|
||||
# need an unprivileged user for headless batch runs. node:20 already ships a
|
||||
# `node` user at uid 1000 — reuse it.
|
||||
RUN mkdir -p /home/node/.claude /home/node/.claude-mem \
|
||||
&& chown -R node:node /home/node /opt/claude-mem
|
||||
|
||||
USER node
|
||||
WORKDIR /home/node
|
||||
|
||||
ENTRYPOINT ["/evals/swebench/run-instance.sh"]
|
||||
Executable
+20
@@ -0,0 +1,20 @@
|
||||
#!/usr/bin/env bash
|
||||
# Build the claude-mem SWE-bench agent image.
|
||||
# Plan: .claude/plans/swebench-claude-mem-docker.md (Phase 1, step 2)
|
||||
set -euo pipefail
|
||||
|
||||
# Resolve repo root (two levels up from this script: evals/swebench -> repo).
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
|
||||
cd "$REPO_ROOT"
|
||||
|
||||
# 1. Build the plugin so plugin/ is populated for the COPY step in the Dockerfile.
|
||||
npm run build
|
||||
|
||||
# 2. Build the agent image. Context is the repo root so both plugin/ and
|
||||
# evals/swebench/run-instance.sh are reachable.
|
||||
docker build \
|
||||
-f evals/swebench/Dockerfile.agent \
|
||||
-t claude-mem/swebench-agent:latest \
|
||||
.
|
||||
Executable
+72
@@ -0,0 +1,72 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# eval.sh — Thin wrapper around `python -m swebench.harness.run_evaluation`.
|
||||
#
|
||||
# Required env:
|
||||
# RUN_ID Identifier for this evaluation run (matches predictions dir).
|
||||
# Optional env:
|
||||
# MAX_WORKERS Parallel worker count for the harness (default: 4).
|
||||
# DATASET HF dataset name (default: princeton-nlp/SWE-bench_Verified).
|
||||
# TIMEOUT Per-instance timeout in seconds (default: 1800).
|
||||
#
|
||||
# Reports land at:
|
||||
# logs/run_evaluation/$RUN_ID/claude-opus-4-7+claude-mem/<instance_id>/report.json
|
||||
|
||||
: "${RUN_ID:?RUN_ID is required (e.g. RUN_ID=smoke-001)}"
|
||||
MAX_WORKERS="${MAX_WORKERS:-4}"
|
||||
DATASET="${DATASET:-princeton-nlp/SWE-bench_Verified}"
|
||||
TIMEOUT="${TIMEOUT:-1800}"
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
cd "$REPO_ROOT"
|
||||
|
||||
PREDICTIONS="evals/swebench/runs/$RUN_ID/predictions.jsonl"
|
||||
|
||||
if [[ ! -f "$PREDICTIONS" ]]; then
|
||||
echo "ERROR: predictions file not found: $PREDICTIONS" >&2
|
||||
echo "Hint: run Phase 3 agent loop first to produce predictions.jsonl for RUN_ID=$RUN_ID." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Harness REQUIRES Docker — fail fast with a clean message if it's not running.
|
||||
if ! command -v docker >/dev/null 2>&1; then
|
||||
echo "ERROR: docker CLI not found on PATH. The SWE-bench harness requires Docker." >&2
|
||||
exit 1
|
||||
fi
|
||||
if ! docker info >/dev/null 2>&1; then
|
||||
echo "ERROR: Docker daemon is not running. Start Docker Desktop (or the docker service) and retry." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Create/reuse a dedicated venv so we don't pollute the system Python.
|
||||
VENV_DIR=".venv-swebench"
|
||||
if [[ ! -d "$VENV_DIR" ]]; then
|
||||
echo "[eval.sh] Creating Python venv at $VENV_DIR ..."
|
||||
python3 -m venv "$VENV_DIR"
|
||||
fi
|
||||
# shellcheck disable=SC1091
|
||||
source "$VENV_DIR/bin/activate"
|
||||
|
||||
echo "[eval.sh] Installing/updating swebench in $VENV_DIR ..."
|
||||
pip install -q swebench
|
||||
|
||||
echo "[eval.sh] Running harness:"
|
||||
echo " dataset: $DATASET"
|
||||
echo " predictions: $PREDICTIONS"
|
||||
echo " max_workers: $MAX_WORKERS"
|
||||
echo " run_id: $RUN_ID"
|
||||
echo " timeout: $TIMEOUT"
|
||||
|
||||
python -m swebench.harness.run_evaluation \
|
||||
--dataset_name "$DATASET" \
|
||||
--predictions_path "$PREDICTIONS" \
|
||||
--max_workers "$MAX_WORKERS" \
|
||||
--run_id "$RUN_ID" \
|
||||
--timeout "$TIMEOUT"
|
||||
|
||||
REPORTS_DIR="logs/run_evaluation/$RUN_ID/claude-opus-4-7+claude-mem"
|
||||
echo ""
|
||||
echo "[eval.sh] Done. Per-instance reports at:"
|
||||
echo " $REPORTS_DIR/<instance_id>/report.json"
|
||||
Executable
+561
@@ -0,0 +1,561 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Batch orchestrator for SWE-bench evaluation of Claude Code + claude-mem.
|
||||
|
||||
Iterates a list of SWE-bench Verified instances, launches a per-instance Docker
|
||||
container (`claude-mem/swebench-agent:latest`) that runs the two-turn
|
||||
ingest/fix protocol, and collects all resulting diffs into a single
|
||||
`predictions.jsonl` compatible with the upstream SWE-bench harness.
|
||||
|
||||
Usage:
|
||||
python evals/swebench/run-batch.py \
|
||||
--run-id claude-mem-baseline-001 \
|
||||
--limit 3 \
|
||||
--max-concurrent 2
|
||||
|
||||
Rate-limit note: Anthropic API rate limits can bite quickly. The default
|
||||
`--max-concurrent` is 4, but it is safer to START WITH 2 and raise the cap
|
||||
only after observing no 429s in the logs.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import atexit
|
||||
import json
|
||||
import os
|
||||
import platform
|
||||
import shutil
|
||||
import stat
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import threading
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from pathlib import Path
|
||||
from typing import Any, Iterable
|
||||
|
||||
from datasets import load_dataset
|
||||
|
||||
|
||||
# Hidden-from-agent fields per the plan. We MUST NOT pass these to the agent
|
||||
# container — they are evaluator-only ground truth.
|
||||
HIDDEN_AGENT_FIELDS = (
|
||||
"patch",
|
||||
"test_patch",
|
||||
"FAIL_TO_PASS",
|
||||
"PASS_TO_PASS",
|
||||
"environment_setup_commit",
|
||||
"version",
|
||||
)
|
||||
|
||||
|
||||
def extract_oauth_credentials() -> Path | None:
|
||||
"""
|
||||
Extract Claude Code OAuth credentials (from a Max/Pro subscription) to a
|
||||
temp file the container can bind-mount. Returns the temp file path, or
|
||||
None if extraction failed / no creds present.
|
||||
|
||||
macOS: creds live in the Keychain under service "Claude Code-credentials".
|
||||
Linux: creds live at ~/.claude/.credentials.json.
|
||||
|
||||
CAVEAT: Anthropic Max/Pro subscriptions have usage limits (per ~5h window)
|
||||
and their ToS is framed around individual developer use. Running batch
|
||||
evaluation across parallel containers may exhaust the quota quickly or
|
||||
raise compliance concerns. This helper exists because the user explicitly
|
||||
requested it; the caller is responsible for the policy call.
|
||||
|
||||
The token may age out mid-run; we mount read-only so refresh writes fail
|
||||
silently inside the container (the underlying token in the host
|
||||
Keychain/file is untouched).
|
||||
"""
|
||||
temp = tempfile.NamedTemporaryFile(
|
||||
prefix="claude-mem-creds-",
|
||||
suffix=".json",
|
||||
delete=False,
|
||||
)
|
||||
temp_path = Path(temp.name)
|
||||
temp.close()
|
||||
# Clean up on process exit, even on crash.
|
||||
atexit.register(lambda: temp_path.unlink(missing_ok=True))
|
||||
|
||||
# macOS: try Keychain first (primary storage on Darwin). On miss, fall
|
||||
# through to the on-disk credentials file — some macOS setups (older CLI,
|
||||
# migrated machines) only have the file form.
|
||||
if platform.system() == "Darwin":
|
||||
try:
|
||||
completed = subprocess.run(
|
||||
[
|
||||
"security",
|
||||
"find-generic-password",
|
||||
"-s",
|
||||
"Claude Code-credentials",
|
||||
"-w",
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=False,
|
||||
)
|
||||
if completed.returncode == 0 and completed.stdout.strip():
|
||||
temp_path.write_text(completed.stdout.strip(), encoding="utf-8")
|
||||
temp_path.chmod(stat.S_IRUSR | stat.S_IWUSR)
|
||||
return temp_path
|
||||
# else fall through to the on-disk credentials check below
|
||||
except FileNotFoundError:
|
||||
print(
|
||||
"WARN: `security` command not available; trying on-disk creds.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
# fall through to the on-disk credentials check below
|
||||
|
||||
# Both platforms (and macOS fallback): read the on-disk credentials file.
|
||||
creds_file = Path.home() / ".claude" / ".credentials.json"
|
||||
if creds_file.exists():
|
||||
temp_path.write_text(creds_file.read_text(encoding="utf-8"), encoding="utf-8")
|
||||
temp_path.chmod(stat.S_IRUSR | stat.S_IWUSR)
|
||||
return temp_path
|
||||
|
||||
if platform.system() == "Darwin":
|
||||
print(
|
||||
"WARN: Claude Code-credentials not found in macOS Keychain and "
|
||||
"~/.claude/.credentials.json missing. Run `claude login` on the "
|
||||
"host first, or fall back to ANTHROPIC_API_KEY.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return None
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Run the claude-mem SWE-bench agent on a batch of instances.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--instance-ids",
|
||||
nargs="+",
|
||||
default=None,
|
||||
help="Optional explicit list of instance_ids to run.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--limit",
|
||||
type=int,
|
||||
default=None,
|
||||
help="If set, process only the first N instances after filtering.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max-concurrent",
|
||||
type=int,
|
||||
default=4,
|
||||
help="Max concurrent agent containers (default 4; start with 2 and raise after observing no 429s).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--run-id",
|
||||
type=str,
|
||||
required=True,
|
||||
help="Run identifier; used for output paths.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--out",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Path to predictions.jsonl (default: evals/swebench/runs/<run_id>/predictions.jsonl).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--timeout",
|
||||
type=int,
|
||||
default=1800,
|
||||
help="Per-instance timeout in seconds (default 1800, matches upstream harness).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--image",
|
||||
type=str,
|
||||
default="claude-mem/swebench-agent:latest",
|
||||
help="Agent Docker image tag.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dataset",
|
||||
type=str,
|
||||
default="princeton-nlp/SWE-bench_Verified",
|
||||
help="HuggingFace dataset name (e.g. princeton-nlp/SWE-bench_Lite, default Verified).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--auth",
|
||||
choices=["oauth", "api-key", "auto"],
|
||||
default="auto",
|
||||
help=(
|
||||
"Auth mode. 'oauth' extracts Claude Max/Pro creds from host "
|
||||
"Keychain (macOS) or ~/.claude/.credentials.json (Linux). "
|
||||
"'api-key' uses ANTHROPIC_API_KEY env. 'auto' prefers oauth, "
|
||||
"falls back to api-key."
|
||||
),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--overwrite",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Truncate existing predictions.jsonl for this --run-id. "
|
||||
"Without this flag, the run aborts if predictions already exist "
|
||||
"(protects partial results from accidental re-runs)."
|
||||
),
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def select_instances(
|
||||
dataset: Iterable[dict[str, Any]],
|
||||
instance_ids: list[str] | None,
|
||||
limit: int | None,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Filter dataset rows by instance_ids (if given) and apply limit."""
|
||||
rows: list[dict[str, Any]] = list(dataset)
|
||||
if instance_ids:
|
||||
wanted = set(instance_ids)
|
||||
rows = [r for r in rows if r["instance_id"] in wanted]
|
||||
missing = wanted - {r["instance_id"] for r in rows}
|
||||
if missing:
|
||||
print(
|
||||
f"WARN: {len(missing)} requested instance_ids not found in dataset: "
|
||||
f"{sorted(missing)[:5]}{'...' if len(missing) > 5 else ''}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
if limit is not None:
|
||||
rows = rows[:limit]
|
||||
return rows
|
||||
|
||||
|
||||
def append_prediction_row(
|
||||
predictions_path: Path,
|
||||
instance_id: str,
|
||||
model_patch: str,
|
||||
model_name_or_path: str,
|
||||
lock: threading.Lock,
|
||||
) -> None:
|
||||
"""Append one JSONL prediction row under a lock (appends are NOT atomic across threads)."""
|
||||
row = {
|
||||
"instance_id": instance_id,
|
||||
"model_patch": model_patch,
|
||||
"model_name_or_path": model_name_or_path,
|
||||
}
|
||||
line = json.dumps(row, ensure_ascii=False) + "\n"
|
||||
with lock:
|
||||
with predictions_path.open("a", encoding="utf-8") as fp:
|
||||
fp.write(line)
|
||||
|
||||
|
||||
def copy_log_if_exists(src: Path, dst: Path) -> None:
|
||||
"""Copy a log file from the shared scratch volume into the run-log directory, if present."""
|
||||
if src.exists() and src.is_file():
|
||||
dst.parent.mkdir(parents=True, exist_ok=True)
|
||||
shutil.copy2(src, dst)
|
||||
|
||||
|
||||
def run_one_instance(
|
||||
instance: dict[str, Any],
|
||||
image: str,
|
||||
predictions_path: Path,
|
||||
predictions_dir: Path,
|
||||
run_dir: Path,
|
||||
timeout: int,
|
||||
predictions_lock: threading.Lock,
|
||||
model_name_or_path: str,
|
||||
oauth_creds_path: Path | None,
|
||||
) -> tuple[str, str]:
|
||||
"""
|
||||
Run the agent container for a single instance.
|
||||
|
||||
Returns a (status, instance_id) tuple where status is one of:
|
||||
"succeeded", "failed", "timed_out".
|
||||
|
||||
On ANY non-success (timeout, non-zero exit, missing diff), a prediction
|
||||
row with model_patch="" is still appended — the plan requires we never
|
||||
silently drop an instance.
|
||||
"""
|
||||
instance_id: str = instance["instance_id"]
|
||||
repo: str = instance["repo"]
|
||||
base_commit: str = instance["base_commit"]
|
||||
problem_statement: str = instance["problem_statement"]
|
||||
|
||||
instance_log_dir = run_dir / instance_id
|
||||
instance_log_dir.mkdir(parents=True, exist_ok=True)
|
||||
stderr_log_path = instance_log_dir / "stderr.log"
|
||||
|
||||
# Per-instance scratch dir — MUST NOT be shared across containers.
|
||||
scratch_dir = Path(tempfile.mkdtemp(prefix=f"swebench-{instance_id}-"))
|
||||
problem_file = scratch_dir / "problem.txt"
|
||||
problem_file.write_text(problem_statement, encoding="utf-8")
|
||||
|
||||
status: str = "failed"
|
||||
model_patch: str = ""
|
||||
|
||||
# Uniquely named so the TimeoutExpired handler can kill it without racing
|
||||
# other instances on the host.
|
||||
container_name = f"swebench-agent-{instance_id}-{os.getpid()}-{threading.get_ident()}"
|
||||
|
||||
try:
|
||||
# The orchestrator owns JSONL writes under `predictions_lock` to avoid
|
||||
# racy concurrent appends across containers — so we DO NOT mount the
|
||||
# predictions directory into the container. Instead, the agent writes
|
||||
# its authoritative diff to /scratch/model_patch.diff (via
|
||||
# CLAUDE_MEM_OUTPUT_DIR), plus ingest/fix logs to the same dir. The
|
||||
# 5th CLI arg to run-instance.sh is only used in standalone smoke-test
|
||||
# mode; here we point it at a throwaway path inside the container.
|
||||
cmd: list[str] = [
|
||||
"docker",
|
||||
"run",
|
||||
"--rm",
|
||||
"--name",
|
||||
container_name,
|
||||
"-e",
|
||||
"CLAUDE_MEM_OUTPUT_DIR=/scratch",
|
||||
"-v",
|
||||
f"{scratch_dir}:/scratch",
|
||||
]
|
||||
if oauth_creds_path is not None:
|
||||
cmd += [
|
||||
"-e",
|
||||
"CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json",
|
||||
"-v",
|
||||
f"{oauth_creds_path}:/auth/.credentials.json:ro",
|
||||
]
|
||||
else:
|
||||
# Pay-per-call path.
|
||||
cmd += ["-e", "ANTHROPIC_API_KEY"]
|
||||
cmd += [
|
||||
image,
|
||||
instance_id,
|
||||
repo,
|
||||
base_commit,
|
||||
"/scratch/problem.txt",
|
||||
"/scratch/ignored-predictions.jsonl",
|
||||
]
|
||||
|
||||
try:
|
||||
completed = subprocess.run(
|
||||
cmd,
|
||||
timeout=timeout,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=False,
|
||||
)
|
||||
# Persist stderr so post-mortem is possible even on success.
|
||||
stderr_log_path.write_text(
|
||||
f"=== STDOUT ===\n{completed.stdout}\n=== STDERR ===\n{completed.stderr}\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
if completed.returncode == 0:
|
||||
# Read the diff the agent wrote to the shared predictions volume.
|
||||
# The container writes its own prediction line; we prefer to
|
||||
# write our own authoritative row here from the diff file the
|
||||
# agent left in /scratch. If the agent wrote a diff file, use
|
||||
# it; otherwise fall back to empty patch.
|
||||
diff_file = scratch_dir / "model_patch.diff"
|
||||
if diff_file.exists():
|
||||
diff_text = diff_file.read_text(encoding="utf-8")
|
||||
if diff_text.strip():
|
||||
model_patch = diff_text
|
||||
status = "succeeded"
|
||||
else:
|
||||
status = "failed" # empty diff
|
||||
else:
|
||||
# Container did not leave a diff file — treat as failure
|
||||
# but still emit an empty-patch row below.
|
||||
status = "failed"
|
||||
else:
|
||||
status = "failed"
|
||||
|
||||
except subprocess.TimeoutExpired as exc:
|
||||
status = "timed_out"
|
||||
# subprocess.run killed the docker CLI, but the container may
|
||||
# still be running. Force-remove it by name so we don't leak
|
||||
# containers across the batch.
|
||||
subprocess.run(
|
||||
["docker", "rm", "-f", container_name],
|
||||
capture_output=True,
|
||||
check=False,
|
||||
timeout=30,
|
||||
)
|
||||
stderr_log_path.write_text(
|
||||
f"TIMEOUT after {timeout}s (forced docker rm -f {container_name})\n"
|
||||
f"=== STDOUT (partial) ===\n{exc.stdout or ''}\n"
|
||||
f"=== STDERR (partial) ===\n{exc.stderr or ''}\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
# Copy per-turn logs left by the agent in the shared scratch volume.
|
||||
copy_log_if_exists(scratch_dir / "ingest.jsonl", instance_log_dir / "ingest.jsonl")
|
||||
copy_log_if_exists(scratch_dir / "fix.jsonl", instance_log_dir / "fix.jsonl")
|
||||
|
||||
# Always write a row — never silently drop an instance.
|
||||
append_prediction_row(
|
||||
predictions_path=predictions_path,
|
||||
instance_id=instance_id,
|
||||
model_patch=model_patch,
|
||||
model_name_or_path=model_name_or_path,
|
||||
lock=predictions_lock,
|
||||
)
|
||||
|
||||
except Exception as exc: # pragma: no cover — defensive
|
||||
status = "failed"
|
||||
try:
|
||||
stderr_log_path.write_text(
|
||||
f"ORCHESTRATOR EXCEPTION: {exc!r}\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
except OSError:
|
||||
pass
|
||||
append_prediction_row(
|
||||
predictions_path=predictions_path,
|
||||
instance_id=instance_id,
|
||||
model_patch="",
|
||||
model_name_or_path=model_name_or_path,
|
||||
lock=predictions_lock,
|
||||
)
|
||||
finally:
|
||||
# Per-instance scratch must not leak across containers.
|
||||
shutil.rmtree(scratch_dir, ignore_errors=True)
|
||||
|
||||
return status, instance_id
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
|
||||
repo_root = Path(__file__).resolve().parents[2]
|
||||
if args.out:
|
||||
predictions_path = Path(args.out).resolve()
|
||||
else:
|
||||
predictions_path = (
|
||||
repo_root
|
||||
/ "evals"
|
||||
/ "swebench"
|
||||
/ "runs"
|
||||
/ args.run_id
|
||||
/ "predictions.jsonl"
|
||||
)
|
||||
|
||||
predictions_dir = predictions_path.parent
|
||||
run_dir = predictions_dir # logs land in evals/swebench/runs/<run_id>/<instance_id>/
|
||||
predictions_dir.mkdir(parents=True, exist_ok=True)
|
||||
# Don't silently discard partial results from a prior run.
|
||||
if predictions_path.exists() and predictions_path.stat().st_size > 0:
|
||||
if not args.overwrite:
|
||||
print(
|
||||
f"ERROR: {predictions_path} already exists and is non-empty. "
|
||||
"Pass --overwrite to truncate, or pick a different --run-id.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return 1
|
||||
print(
|
||||
f"WARN: --overwrite set; truncating existing {predictions_path}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
predictions_path.write_text("", encoding="utf-8")
|
||||
|
||||
# Resolve auth: OAuth (Max/Pro subscription) or API key.
|
||||
oauth_creds_path: Path | None = None
|
||||
if args.auth in ("oauth", "auto"):
|
||||
oauth_creds_path = extract_oauth_credentials()
|
||||
if oauth_creds_path is not None:
|
||||
print(
|
||||
f"Auth: OAuth credentials extracted to {oauth_creds_path} "
|
||||
"(mounted read-only into each container). "
|
||||
"NOTE: Max/Pro has per-window usage limits; batch runs may exhaust them.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
elif args.auth == "oauth":
|
||||
print(
|
||||
"ERROR: --auth=oauth requested but credentials extraction failed.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return 1
|
||||
|
||||
if oauth_creds_path is None:
|
||||
if not os.environ.get("ANTHROPIC_API_KEY"):
|
||||
print(
|
||||
"ERROR: no auth available. Either run `claude login` on host "
|
||||
"(for OAuth) or set ANTHROPIC_API_KEY.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return 1
|
||||
print("Auth: ANTHROPIC_API_KEY (pay-per-call).", file=sys.stderr)
|
||||
|
||||
print(f"Loading dataset {args.dataset} (split=test)...", file=sys.stderr)
|
||||
dataset = load_dataset(args.dataset, split="test")
|
||||
|
||||
instances = select_instances(dataset, args.instance_ids, args.limit)
|
||||
total = len(instances)
|
||||
if total == 0:
|
||||
print("No instances selected; nothing to do.", file=sys.stderr)
|
||||
return 0
|
||||
|
||||
# Scrub hidden-from-agent fields defensively. The agent container only
|
||||
# receives instance_id/repo/base_commit/problem_statement via CLI args +
|
||||
# the per-instance problem file — the hidden fields never leave this
|
||||
# process. This loop makes that invariant explicit.
|
||||
for row in instances:
|
||||
for key in HIDDEN_AGENT_FIELDS:
|
||||
row.pop(key, None)
|
||||
|
||||
model_name_or_path = "claude-opus-4-7+claude-mem"
|
||||
|
||||
print(
|
||||
f"Launching {total} instance(s) with max_concurrent={args.max_concurrent}, "
|
||||
f"timeout={args.timeout}s, image={args.image}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
predictions_lock = threading.Lock()
|
||||
succeeded = 0
|
||||
failed = 0
|
||||
timed_out = 0
|
||||
|
||||
with ThreadPoolExecutor(max_workers=args.max_concurrent) as executor:
|
||||
future_to_id = {
|
||||
executor.submit(
|
||||
run_one_instance,
|
||||
instance=instance,
|
||||
image=args.image,
|
||||
predictions_path=predictions_path,
|
||||
predictions_dir=predictions_dir,
|
||||
run_dir=run_dir,
|
||||
timeout=args.timeout,
|
||||
predictions_lock=predictions_lock,
|
||||
model_name_or_path=model_name_or_path,
|
||||
oauth_creds_path=oauth_creds_path,
|
||||
): instance["instance_id"]
|
||||
for instance in instances
|
||||
}
|
||||
|
||||
for future in as_completed(future_to_id):
|
||||
instance_id = future_to_id[future]
|
||||
try:
|
||||
status, _ = future.result()
|
||||
except Exception as exc: # pragma: no cover — defensive
|
||||
status = "failed"
|
||||
print(
|
||||
f"[{instance_id}] orchestrator future raised: {exc!r}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
if status == "succeeded":
|
||||
succeeded += 1
|
||||
elif status == "timed_out":
|
||||
timed_out += 1
|
||||
else:
|
||||
failed += 1
|
||||
|
||||
print(
|
||||
f"[{instance_id}] {status} "
|
||||
f"({succeeded + failed + timed_out}/{total} done)",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
print(
|
||||
f"{total} total, {succeeded} succeeded, {failed} failed, {timed_out} timed out",
|
||||
)
|
||||
# Per plan: exit 0 even if some instances failed.
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Executable
+177
@@ -0,0 +1,177 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# run-instance.sh — runs Claude Code + claude-mem against a single SWE-bench
|
||||
# instance using the two-turn protocol (ingest, then fix), and appends a
|
||||
# prediction JSONL row to OUT_PREDICTIONS_PATH.
|
||||
#
|
||||
# Usage:
|
||||
# run-instance.sh INSTANCE_ID REPO_SLUG BASE_COMMIT PROBLEM_STATEMENT_FILE OUT_PREDICTIONS_PATH
|
||||
#
|
||||
# Required env:
|
||||
# ANTHROPIC_API_KEY
|
||||
|
||||
if [[ $# -ne 5 ]]; then
|
||||
echo "Usage: $0 INSTANCE_ID REPO_SLUG BASE_COMMIT PROBLEM_STATEMENT_FILE OUT_PREDICTIONS_PATH" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
INSTANCE_ID="$1"
|
||||
REPO_SLUG="$2"
|
||||
BASE_COMMIT="$3"
|
||||
PROBLEM_STATEMENT_FILE="$4"
|
||||
OUT_PREDICTIONS_PATH="$5"
|
||||
|
||||
# Auth: either ANTHROPIC_API_KEY (pay-per-call) OR a pre-extracted OAuth
|
||||
# credentials file from a Claude Max/Pro subscription (flat-fee, but subject
|
||||
# to Anthropic's usage limits — batch-scale runs may exhaust the 5h window).
|
||||
# run-batch.py extracts OAuth creds from host Keychain/file and mounts them
|
||||
# at CLAUDE_MEM_CREDENTIALS_FILE; standalone smoke-test can do the same, or
|
||||
# set ANTHROPIC_API_KEY directly.
|
||||
if [[ -z "${ANTHROPIC_API_KEY:-}" && -z "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
|
||||
echo "ERROR: one of ANTHROPIC_API_KEY or CLAUDE_MEM_CREDENTIALS_FILE is required" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" && ! -f "$CLAUDE_MEM_CREDENTIALS_FILE" ]]; then
|
||||
echo "ERROR: CLAUDE_MEM_CREDENTIALS_FILE set but file missing: $CLAUDE_MEM_CREDENTIALS_FILE" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ ! -f "$PROBLEM_STATEMENT_FILE" ]]; then
|
||||
echo "ERROR: PROBLEM_STATEMENT_FILE not found: $PROBLEM_STATEMENT_FILE" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODEL_NAME="claude-opus-4-7+claude-mem"
|
||||
|
||||
# Per-instance ephemeral scratch dir — isolates ~/.claude/ and ~/.claude-mem/.
|
||||
SCRATCH=$(mktemp -d)
|
||||
REPO_DIR="$SCRATCH/repo"
|
||||
MEM_DIR="$SCRATCH/.claude-mem"
|
||||
CLAUDE_DIR="$SCRATCH/.claude"
|
||||
mkdir -p "$MEM_DIR" "$CLAUDE_DIR"
|
||||
|
||||
# If using OAuth, seed the isolated CLAUDE_DIR with the mounted credentials
|
||||
# file so Claude Code finds them at HOME=$SCRATCH → ~/.claude/.credentials.json.
|
||||
# chmod 600 to match what `claude login` writes (it checks permissions).
|
||||
if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
|
||||
cp "$CLAUDE_MEM_CREDENTIALS_FILE" "$CLAUDE_DIR/.credentials.json"
|
||||
chmod 600 "$CLAUDE_DIR/.credentials.json"
|
||||
fi
|
||||
|
||||
# Directory where artifacts the batch orchestrator reads (model_patch.diff,
|
||||
# ingest.jsonl, fix.jsonl) are written. When run via `docker run -v
|
||||
# <host-scratch>:/scratch` from run-batch.py, the orchestrator sets
|
||||
# CLAUDE_MEM_OUTPUT_DIR=/scratch so these files are visible on the host. In
|
||||
# standalone/smoke-test mode the default keeps artifacts in the ephemeral
|
||||
# scratch dir alongside the repo.
|
||||
OUTPUT_DIR="${CLAUDE_MEM_OUTPUT_DIR:-$SCRATCH}"
|
||||
mkdir -p "$OUTPUT_DIR"
|
||||
|
||||
# Always write a prediction row (even on failure) so batch mode stays aligned.
|
||||
# The trap emits an empty-patch row if we exit before the success path sets
|
||||
# PREDICTION_EMITTED=1, then cleans up SCRATCH.
|
||||
DIFF_OUT="$OUTPUT_DIR/model_patch.diff"
|
||||
INGEST_LOG="$OUTPUT_DIR/ingest.jsonl"
|
||||
FIX_LOG="$OUTPUT_DIR/fix.jsonl"
|
||||
|
||||
PREDICTION_EMITTED=0
|
||||
cleanup() {
|
||||
local exit_code=$?
|
||||
if [[ "$PREDICTION_EMITTED" -ne 1 ]]; then
|
||||
# Ensure the orchestrator sees an (empty) diff file even on early exit.
|
||||
: > "$DIFF_OUT" 2>/dev/null || true
|
||||
jq -nc \
|
||||
--arg id "$INSTANCE_ID" \
|
||||
--arg patch "" \
|
||||
--arg model "$MODEL_NAME" \
|
||||
'{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
|
||||
>> "$OUT_PREDICTIONS_PATH" || true
|
||||
fi
|
||||
rm -rf "$SCRATCH"
|
||||
exit "$exit_code"
|
||||
}
|
||||
trap cleanup EXIT
|
||||
|
||||
# Shallow clone + fetch the exact commit. Saves minutes on large repos
|
||||
# (sympy/django/scikit-learn) vs. a full-history clone. Fallback to a full
|
||||
# clone if the server rejects the by-commit fetch (GitHub supports
|
||||
# uploadpack.allowReachableSHA1InWant by default on public repos, but mirrors
|
||||
# may not).
|
||||
if ! { git clone --depth 1 --no-single-branch "https://github.com/${REPO_SLUG}.git" "$REPO_DIR" \
|
||||
&& git -C "$REPO_DIR" fetch --depth 1 origin "$BASE_COMMIT"; }; then
|
||||
echo "WARN: shallow fetch failed; falling back to full clone" >&2
|
||||
rm -rf "$REPO_DIR"
|
||||
git clone "https://github.com/${REPO_SLUG}.git" "$REPO_DIR"
|
||||
fi
|
||||
git -C "$REPO_DIR" reset --hard "$BASE_COMMIT"
|
||||
|
||||
# ---------- Turn 1: Ingest (populate memory via PostToolUse hook) ----------
|
||||
INGEST_PROMPT="Please learn about the codebase by systematically and thoroughly reading EVERY SOURCE FILE IN FULL, no matter how many there are. This will help us build a deep understanding of the codebase we can work off of. Don't worry about cost. This is critical and non-negotiable."
|
||||
|
||||
SESSION_ID=$(uuidgen | tr '[:upper:]' '[:lower:]')
|
||||
|
||||
set +e
|
||||
(
|
||||
cd "$REPO_DIR" && HOME="$SCRATCH" claude \
|
||||
--print \
|
||||
--session-id "$SESSION_ID" \
|
||||
--plugin-dir /opt/claude-mem \
|
||||
--permission-mode bypassPermissions \
|
||||
--allowedTools "Read,Glob,Grep,Bash(ls *),Bash(wc *)" \
|
||||
--max-budget-usd 5.00 \
|
||||
--output-format json \
|
||||
"$INGEST_PROMPT"
|
||||
) > "$INGEST_LOG" 2>&1
|
||||
INGEST_EXIT=$?
|
||||
set -e
|
||||
|
||||
if [[ "$INGEST_EXIT" -ne 0 ]]; then
|
||||
echo "WARN: ingest turn exited with $INGEST_EXIT; continuing to fix turn" >&2
|
||||
fi
|
||||
|
||||
# ---------- Turn 2: Fix (consume memory via mem-search slash command) ----------
|
||||
PROBLEM=$(cat "$PROBLEM_STATEMENT_FILE")
|
||||
QUERY=$(printf '%s' "$PROBLEM" | tr -s '[:space:]' ' ' | cut -c1-200)
|
||||
|
||||
FIX_PROMPT="/claude-mem:mem-search ${QUERY}
|
||||
|
||||
Problem statement:
|
||||
${PROBLEM}
|
||||
|
||||
Using what you've learned from the codebase (see memory above), produce a minimal unified diff that fixes this bug. Edit files in place. Do NOT commit."
|
||||
|
||||
set +e
|
||||
(
|
||||
cd "$REPO_DIR" && HOME="$SCRATCH" claude \
|
||||
--print \
|
||||
--resume "$SESSION_ID" \
|
||||
--plugin-dir /opt/claude-mem \
|
||||
--permission-mode bypassPermissions \
|
||||
--allowedTools "Read,Glob,Grep,Edit,Write,Bash(git *),Bash(ls *)" \
|
||||
--max-budget-usd 5.00 \
|
||||
--output-format json \
|
||||
"$FIX_PROMPT"
|
||||
) > "$FIX_LOG" 2>&1
|
||||
FIX_EXIT=$?
|
||||
set -e
|
||||
|
||||
if [[ "$FIX_EXIT" -ne 0 ]]; then
|
||||
echo "WARN: fix turn exited with $FIX_EXIT; will still emit prediction row" >&2
|
||||
fi
|
||||
|
||||
# ---------- Capture diff and emit prediction row ----------
|
||||
# Write the diff to DIFF_OUT first (authoritative for the batch orchestrator),
|
||||
# then read it back for the JSONL row (kept for standalone/smoke-test use).
|
||||
git -C "$REPO_DIR" diff > "$DIFF_OUT" || : > "$DIFF_OUT"
|
||||
DIFF=$(cat "$DIFF_OUT")
|
||||
|
||||
jq -nc \
|
||||
--arg id "$INSTANCE_ID" \
|
||||
--arg patch "$DIFF" \
|
||||
--arg model "$MODEL_NAME" \
|
||||
'{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
|
||||
>> "$OUT_PREDICTIONS_PATH"
|
||||
|
||||
PREDICTION_EMITTED=1
|
||||
Executable
+152
@@ -0,0 +1,152 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# smoke-test.sh — runs ONE SWE-bench instance end-to-end against the agent
|
||||
# container using OAuth credentials extracted from the host. Use this to
|
||||
# verify the two-turn protocol + /claude-mem:mem-search slash resolution
|
||||
# before kicking off a batch run.
|
||||
#
|
||||
# Usage:
|
||||
# evals/swebench/smoke-test.sh [INSTANCE_ID]
|
||||
#
|
||||
# Defaults to sympy__sympy-24152 (an easy Verified instance) if no arg given.
|
||||
#
|
||||
# Outputs:
|
||||
# evals/swebench/runs/smoke/<INSTANCE_ID>/{ingest.jsonl,fix.jsonl,model_patch.diff}
|
||||
# evals/swebench/runs/smoke/predictions.jsonl
|
||||
|
||||
INSTANCE_ID="${1:-sympy__sympy-24152}"
|
||||
DATASET="${DATASET:-princeton-nlp/SWE-bench_Lite}"
|
||||
IMAGE="${IMAGE:-claude-mem/swebench-agent:latest}"
|
||||
TIMEOUT="${TIMEOUT:-1800}"
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
RUN_DIR="$REPO_ROOT/evals/swebench/runs/smoke/$INSTANCE_ID"
|
||||
PREDICTIONS="$REPO_ROOT/evals/swebench/runs/smoke/predictions.jsonl"
|
||||
mkdir -p "$RUN_DIR" "$(dirname "$PREDICTIONS")"
|
||||
|
||||
# --- Extract OAuth credentials ---
|
||||
CREDS_FILE="$(mktemp -t claude-mem-creds.XXXXXX.json)"
|
||||
trap 'rm -f "$CREDS_FILE"' EXIT
|
||||
|
||||
# Try macOS Keychain first (primary on Darwin), then fall through to the
|
||||
# on-disk credentials file — matches docker/claude-mem/run.sh behavior.
|
||||
creds_obtained=0
|
||||
if [[ "$(uname)" == "Darwin" ]]; then
|
||||
if security find-generic-password -s 'Claude Code-credentials' -w > "$CREDS_FILE" 2>/dev/null \
|
||||
&& [[ -s "$CREDS_FILE" ]]; then
|
||||
creds_obtained=1
|
||||
fi
|
||||
fi
|
||||
if [[ "$creds_obtained" -eq 0 && -f "$HOME/.claude/.credentials.json" ]]; then
|
||||
cp "$HOME/.claude/.credentials.json" "$CREDS_FILE"
|
||||
creds_obtained=1
|
||||
fi
|
||||
if [[ "$creds_obtained" -eq 0 ]]; then
|
||||
echo "ERROR: no Claude OAuth creds found (macOS Keychain or ~/.claude/.credentials.json)" >&2
|
||||
exit 1
|
||||
fi
|
||||
chmod 600 "$CREDS_FILE"
|
||||
|
||||
# --- Fetch instance data from HuggingFace via a small Python helper ---
|
||||
INSTANCE_JSON="$(mktemp)"
|
||||
trap 'rm -f "$CREDS_FILE" "$INSTANCE_JSON"' EXIT
|
||||
python3 - "$INSTANCE_ID" "$DATASET" > "$INSTANCE_JSON" <<'PY'
|
||||
import json, sys
|
||||
from datasets import load_dataset
|
||||
target = sys.argv[1]
|
||||
dataset = sys.argv[2]
|
||||
ds = load_dataset(dataset, split="test")
|
||||
for row in ds:
|
||||
if row["instance_id"] == target:
|
||||
print(json.dumps({
|
||||
"instance_id": row["instance_id"],
|
||||
"repo": row["repo"],
|
||||
"base_commit": row["base_commit"],
|
||||
"problem_statement": row["problem_statement"],
|
||||
}))
|
||||
break
|
||||
else:
|
||||
print(f"ERROR: instance {target} not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
PY
|
||||
|
||||
SCRATCH="$(mktemp -d -t claude-mem-smoke.XXXXXX)"
|
||||
trap 'rm -f "$CREDS_FILE" "$INSTANCE_JSON"; rm -rf "$SCRATCH"' EXIT
|
||||
|
||||
# Parse the instance JSON once: print repo + base_commit to stdout, write the
|
||||
# problem statement directly to $SCRATCH/problem.txt. INSTANCE_JSON is passed
|
||||
# as argv so stdin is free for the `python3 -` heredoc script body (previously
|
||||
# both were competing for stdin, which made json.load see the heredoc's EOF).
|
||||
read -r REPO BASE_COMMIT < <(
|
||||
python3 - "$SCRATCH" "$INSTANCE_JSON" <<'PY'
|
||||
import json, os, sys
|
||||
scratch, instance_json = sys.argv[1], sys.argv[2]
|
||||
with open(instance_json) as f:
|
||||
d = json.load(f)
|
||||
open(os.path.join(scratch, "problem.txt"), "w").write(d["problem_statement"])
|
||||
print(d["repo"], d["base_commit"])
|
||||
PY
|
||||
)
|
||||
|
||||
echo "=== Running $INSTANCE_ID ($REPO @ $BASE_COMMIT) ===" >&2
|
||||
echo "Scratch: $SCRATCH" >&2
|
||||
echo "Logs will land in: $RUN_DIR" >&2
|
||||
|
||||
# Pick a wall-clock timeout binary. Linux ships `timeout`; macOS needs
|
||||
# `gtimeout` from coreutils (brew install coreutils). If neither is available,
|
||||
# warn and run without a cap — the smoke test is manual anyway.
|
||||
TIMEOUT_CMD=()
|
||||
if command -v timeout >/dev/null 2>&1; then
|
||||
TIMEOUT_CMD=(timeout "$TIMEOUT")
|
||||
elif command -v gtimeout >/dev/null 2>&1; then
|
||||
TIMEOUT_CMD=(gtimeout "$TIMEOUT")
|
||||
else
|
||||
echo "WARN: no \`timeout\`/\`gtimeout\` on PATH; container runs uncapped" >&2
|
||||
fi
|
||||
|
||||
# Name the container so we can force-remove it if the wall-clock timeout
|
||||
# fires (SIGTERM from timeout leaves the container state open briefly).
|
||||
CONTAINER_NAME="claude-mem-smoke-$INSTANCE_ID-$$"
|
||||
|
||||
set +e
|
||||
"${TIMEOUT_CMD[@]}" docker run --rm \
|
||||
--name "$CONTAINER_NAME" \
|
||||
-e CLAUDE_MEM_OUTPUT_DIR=/scratch \
|
||||
-e CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json \
|
||||
-v "$SCRATCH:/scratch" \
|
||||
-v "$CREDS_FILE:/auth/.credentials.json:ro" \
|
||||
"$IMAGE" \
|
||||
"$INSTANCE_ID" "$REPO" "$BASE_COMMIT" /scratch/problem.txt /scratch/ignored-predictions.jsonl
|
||||
DOCKER_EXIT=$?
|
||||
set -e
|
||||
|
||||
if [[ "$DOCKER_EXIT" -eq 124 ]]; then
|
||||
# `timeout` signals TERM and returns 124 on timeout. Force-remove the
|
||||
# container in case docker hasn't reaped it yet.
|
||||
echo "ERROR: docker run exceeded ${TIMEOUT}s wall-clock; removing container" >&2
|
||||
docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true
|
||||
fi
|
||||
|
||||
# Copy artifacts from scratch → RUN_DIR
|
||||
for f in ingest.jsonl fix.jsonl model_patch.diff; do
|
||||
[[ -f "$SCRATCH/$f" ]] && cp "$SCRATCH/$f" "$RUN_DIR/$f"
|
||||
done
|
||||
|
||||
# Emit authoritative prediction row
|
||||
DIFF_FILE="$SCRATCH/model_patch.diff"
|
||||
DIFF=""
|
||||
[[ -f "$DIFF_FILE" ]] && DIFF="$(cat "$DIFF_FILE")"
|
||||
jq -nc \
|
||||
--arg id "$INSTANCE_ID" \
|
||||
--arg patch "$DIFF" \
|
||||
--arg model "claude-opus-4-7+claude-mem" \
|
||||
'{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
|
||||
>> "$PREDICTIONS"
|
||||
|
||||
echo "=== Done ===" >&2
|
||||
echo "Diff size: $(wc -c < "$DIFF_FILE" 2>/dev/null || echo 0) bytes" >&2
|
||||
echo "Predictions: $PREDICTIONS" >&2
|
||||
echo "Verify mem-search invocation:" >&2
|
||||
echo " grep -o '\"name\":\"[^\"]*mem-search[^\"]*\"' $RUN_DIR/fix.jsonl || echo 'NOT INVOKED'" >&2
|
||||
Executable
+308
@@ -0,0 +1,308 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Summarize SWE-bench evaluation run results.
|
||||
|
||||
Walks the SWE-bench harness output directory, tallies resolved/unresolved/error
|
||||
counts, and emits a markdown summary. Optionally diffs against another run.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def load_expected_instance_ids(predictions_path: Path) -> list[str]:
|
||||
"""Read instance_ids from a predictions.jsonl file (one JSON object per line)."""
|
||||
instance_ids: list[str] = []
|
||||
if not predictions_path.exists():
|
||||
print(
|
||||
f"warning: predictions file not found: {predictions_path}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return instance_ids
|
||||
with predictions_path.open("r", encoding="utf-8") as handle:
|
||||
for line_number, raw_line in enumerate(handle, start=1):
|
||||
stripped = raw_line.strip()
|
||||
if not stripped:
|
||||
continue
|
||||
try:
|
||||
record = json.loads(stripped)
|
||||
except json.JSONDecodeError as exc:
|
||||
print(
|
||||
f"warning: could not parse predictions line {line_number}: {exc}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
continue
|
||||
instance_id = record.get("instance_id")
|
||||
if instance_id:
|
||||
instance_ids.append(instance_id)
|
||||
return instance_ids
|
||||
|
||||
|
||||
def load_run_results(
|
||||
run_id: str,
|
||||
model_name: str,
|
||||
expected_instance_ids: list[str],
|
||||
repo_root: Path,
|
||||
) -> dict:
|
||||
"""Walk logs/run_evaluation/<run_id>/<model_name>/*/report.json and tally results.
|
||||
|
||||
Returns a dict:
|
||||
{
|
||||
"per_instance": {instance_id: {"resolved": bool|None, "notes": str}},
|
||||
"resolved_count": int,
|
||||
"unresolved_count": int,
|
||||
"error_count": int,
|
||||
}
|
||||
"""
|
||||
run_logs_root = repo_root / "logs" / "run_evaluation" / run_id / model_name
|
||||
per_instance: dict[str, dict] = {}
|
||||
resolved_count = 0
|
||||
unresolved_count = 0
|
||||
error_count = 0
|
||||
|
||||
for instance_id in expected_instance_ids:
|
||||
report_path = run_logs_root / instance_id / "report.json"
|
||||
if not report_path.exists():
|
||||
per_instance[instance_id] = {
|
||||
"resolved": None,
|
||||
"notes": "missing report.json",
|
||||
}
|
||||
error_count += 1
|
||||
continue
|
||||
try:
|
||||
with report_path.open("r", encoding="utf-8") as handle:
|
||||
report_data = json.load(handle)
|
||||
except (json.JSONDecodeError, OSError) as exc:
|
||||
per_instance[instance_id] = {
|
||||
"resolved": None,
|
||||
"notes": f"failed to parse report.json: {exc}",
|
||||
}
|
||||
error_count += 1
|
||||
continue
|
||||
|
||||
# SWE-bench harness typically nests per-instance data under the
|
||||
# instance_id key; fall back to the top-level dict for flexibility.
|
||||
inner = report_data.get(instance_id, report_data)
|
||||
resolved_value = inner.get("resolved")
|
||||
if resolved_value is True:
|
||||
per_instance[instance_id] = {"resolved": True, "notes": ""}
|
||||
resolved_count += 1
|
||||
elif resolved_value is False:
|
||||
notes_parts: list[str] = []
|
||||
tests_status = inner.get("tests_status")
|
||||
if isinstance(tests_status, dict):
|
||||
fail_to_pass = tests_status.get("FAIL_TO_PASS", {})
|
||||
if isinstance(fail_to_pass, dict):
|
||||
failed = fail_to_pass.get("failure", []) or []
|
||||
if failed:
|
||||
notes_parts.append(f"FAIL_TO_PASS failures: {len(failed)}")
|
||||
per_instance[instance_id] = {
|
||||
"resolved": False,
|
||||
"notes": "; ".join(notes_parts),
|
||||
}
|
||||
unresolved_count += 1
|
||||
else:
|
||||
per_instance[instance_id] = {
|
||||
"resolved": None,
|
||||
"notes": "report.json missing 'resolved' field",
|
||||
}
|
||||
error_count += 1
|
||||
|
||||
return {
|
||||
"per_instance": per_instance,
|
||||
"resolved_count": resolved_count,
|
||||
"unresolved_count": unresolved_count,
|
||||
"error_count": error_count,
|
||||
}
|
||||
|
||||
|
||||
def format_resolved_cell(resolved: bool | None) -> str:
|
||||
if resolved is True:
|
||||
return "yes"
|
||||
if resolved is False:
|
||||
return "no"
|
||||
return "error"
|
||||
|
||||
|
||||
def render_summary_markdown(run_id: str, results: dict) -> str:
|
||||
total = (
|
||||
results["resolved_count"]
|
||||
+ results["unresolved_count"]
|
||||
+ results["error_count"]
|
||||
)
|
||||
resolved = results["resolved_count"]
|
||||
resolve_rate = (resolved / total * 100.0) if total > 0 else 0.0
|
||||
|
||||
lines: list[str] = []
|
||||
lines.append(f"# Run {run_id}")
|
||||
lines.append(f"- Total: {total}")
|
||||
lines.append(f"- Resolved: {resolved} ({resolve_rate:.2f}%)")
|
||||
lines.append(f"- Unresolved: {results['unresolved_count']}")
|
||||
lines.append(f"- Errors: {results['error_count']}")
|
||||
lines.append("")
|
||||
lines.append("## Per-instance")
|
||||
lines.append("| instance_id | resolved | notes |")
|
||||
lines.append("|---|---|---|")
|
||||
for instance_id, record in results["per_instance"].items():
|
||||
resolved_cell = format_resolved_cell(record["resolved"])
|
||||
notes_cell = record.get("notes", "") or ""
|
||||
# Escape pipe chars in notes to avoid breaking markdown tables.
|
||||
notes_cell = notes_cell.replace("|", "\\|")
|
||||
lines.append(f"| {instance_id} | {resolved_cell} | {notes_cell} |")
|
||||
lines.append("")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def render_diff_markdown(
|
||||
current_run_id: str,
|
||||
other_run_id: str,
|
||||
current_results: dict,
|
||||
other_results: dict,
|
||||
) -> str:
|
||||
def resolve_rate(results: dict) -> tuple[int, float]:
|
||||
total = (
|
||||
results["resolved_count"]
|
||||
+ results["unresolved_count"]
|
||||
+ results["error_count"]
|
||||
)
|
||||
rate = (results["resolved_count"] / total * 100.0) if total > 0 else 0.0
|
||||
return total, rate
|
||||
|
||||
current_total, current_rate = resolve_rate(current_results)
|
||||
other_total, other_rate = resolve_rate(other_results)
|
||||
rate_delta = current_rate - other_rate
|
||||
|
||||
lines: list[str] = []
|
||||
lines.append(f"# Diff vs {other_run_id}")
|
||||
lines.append(
|
||||
f"- {current_run_id}: {current_results['resolved_count']}/{current_total} "
|
||||
f"({current_rate:.2f}%)"
|
||||
)
|
||||
lines.append(
|
||||
f"- {other_run_id}: {other_results['resolved_count']}/{other_total} "
|
||||
f"({other_rate:.2f}%)"
|
||||
)
|
||||
lines.append(f"- Delta: {rate_delta:+.2f} percentage points")
|
||||
lines.append("")
|
||||
lines.append("## Per-instance status changes")
|
||||
lines.append(f"| instance_id | {other_run_id} | {current_run_id} |")
|
||||
lines.append("|---|---|---|")
|
||||
|
||||
all_instance_ids = set(current_results["per_instance"].keys()) | set(
|
||||
other_results["per_instance"].keys()
|
||||
)
|
||||
changes_found = False
|
||||
for instance_id in sorted(all_instance_ids):
|
||||
current_record = current_results["per_instance"].get(instance_id)
|
||||
other_record = other_results["per_instance"].get(instance_id)
|
||||
current_status = (
|
||||
format_resolved_cell(current_record["resolved"])
|
||||
if current_record
|
||||
else "absent"
|
||||
)
|
||||
other_status = (
|
||||
format_resolved_cell(other_record["resolved"])
|
||||
if other_record
|
||||
else "absent"
|
||||
)
|
||||
if current_status != other_status:
|
||||
lines.append(f"| {instance_id} | {other_status} | {current_status} |")
|
||||
changes_found = True
|
||||
if not changes_found:
|
||||
lines.append("| (no status changes) | | |")
|
||||
lines.append("")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Summarize SWE-bench evaluation run results."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--run-id",
|
||||
required=True,
|
||||
help="Run identifier used in logs/run_evaluation/<run_id>/ and evals/swebench/runs/<run_id>/.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--compare",
|
||||
metavar="OTHER_RUN_ID",
|
||||
default=None,
|
||||
help="Optional other run_id to diff resolve rates and per-instance status changes against.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--model-name",
|
||||
default="claude-opus-4-7+claude-mem",
|
||||
help="Model name directory inside logs/run_evaluation/<run_id>/.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--out",
|
||||
default=None,
|
||||
help="Output path for the markdown summary (default: evals/swebench/runs/<run_id>/summary.md).",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Resolve repo root from this script's location: evals/swebench/summarize.py
|
||||
script_path = Path(__file__).resolve()
|
||||
repo_root = script_path.parent.parent.parent
|
||||
|
||||
current_predictions_path = (
|
||||
repo_root / "evals" / "swebench" / "runs" / args.run_id / "predictions.jsonl"
|
||||
)
|
||||
current_instance_ids = load_expected_instance_ids(current_predictions_path)
|
||||
current_results = load_run_results(
|
||||
run_id=args.run_id,
|
||||
model_name=args.model_name,
|
||||
expected_instance_ids=current_instance_ids,
|
||||
repo_root=repo_root,
|
||||
)
|
||||
|
||||
summary_markdown = render_summary_markdown(args.run_id, current_results)
|
||||
|
||||
if args.compare:
|
||||
other_predictions_path = (
|
||||
repo_root
|
||||
/ "evals"
|
||||
/ "swebench"
|
||||
/ "runs"
|
||||
/ args.compare
|
||||
/ "predictions.jsonl"
|
||||
)
|
||||
other_instance_ids = load_expected_instance_ids(other_predictions_path)
|
||||
other_results = load_run_results(
|
||||
run_id=args.compare,
|
||||
model_name=args.model_name,
|
||||
expected_instance_ids=other_instance_ids,
|
||||
repo_root=repo_root,
|
||||
)
|
||||
diff_markdown = render_diff_markdown(
|
||||
current_run_id=args.run_id,
|
||||
other_run_id=args.compare,
|
||||
current_results=current_results,
|
||||
other_results=other_results,
|
||||
)
|
||||
summary_markdown = summary_markdown + "\n" + diff_markdown
|
||||
|
||||
if args.out:
|
||||
output_path = Path(args.out)
|
||||
if not output_path.is_absolute():
|
||||
output_path = (Path.cwd() / output_path).resolve()
|
||||
else:
|
||||
output_path = (
|
||||
repo_root
|
||||
/ "evals"
|
||||
/ "swebench"
|
||||
/ "runs"
|
||||
/ args.run_id
|
||||
/ "summary.md"
|
||||
)
|
||||
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_path.write_text(summary_markdown, encoding="utf-8")
|
||||
|
||||
print(str(output_path))
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
File diff suppressed because one or more lines are too long
+191
-191
File diff suppressed because one or more lines are too long
+11
-11
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user