97c7c999b1
* feat(evals): SWE-bench Docker scaffolding for claude-mem resolve-rate measurement Adds evals/swebench/ scaffolding per .claude/plans/swebench-claude-mem-docker.md. Agent image builds Claude Code 2.1.114 + locally-built claude-mem plugin; run-instance.sh executes the two-turn ingest/fix protocol per instance; run-batch.py orchestrates parallel Docker runs with per-instance isolation; eval.sh wraps the upstream SWE-bench harness; summarize.py aggregates reports. Orchestrator owns JSONL writes under a lock to avoid racy concurrent appends; agent writes its authoritative diff to CLAUDE_MEM_OUTPUT_DIR (/scratch in container mode) and the orchestrator reads it back. Scaffolding only — no Docker build or smoke test run yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(evals): OAuth credential mounting for Claude Max/Pro subscriptions Skips per-call API billing by extracting OAuth creds from host Keychain (macOS) or ~/.claude/.credentials.json (Linux) and bind-mounting them read-only into each agent container. Creds are copied into HOME=$SCRATCH/.claude at container start so the per-instance isolation model still holds. Adds run-batch.py --auth {oauth,api-key,auto} (auto prefers OAuth, falls back to API key). run-instance.sh accepts either ANTHROPIC_API_KEY or CLAUDE_MEM_CREDENTIALS_FILE. smoke-test.sh runs one instance end-to-end using OAuth for quick verification before batch runs. Caveat surfaced in docstrings: Max/Pro has per-window usage limits and is framed for individual developer use — batch evaluation may exhaust the quota or raise compliance questions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(docker): basic claude-mem container for ad-hoc testing Adds docker/claude-mem/ with a fresh spin-up image: - Dockerfile: FROM node:20 (reproduces anthropics/claude-code .devcontainer pattern — Anthropic ships the Dockerfile, not a pullable image); layers Bun + uv + locally-built plugin/; runs as non-root node user - entrypoint.sh: seeds OAuth creds from CLAUDE_MEM_CREDENTIALS_FILE into $HOME/.claude/.credentials.json, then exec's the command (default: bash) - build.sh: npm run build + docker build - run.sh: interactive launcher; auto-extracts OAuth from macOS Keychain (security find-generic-password) or ~/.claude/.credentials.json on Linux, mounts host .docker-claude-mem-data/ at /home/node/.claude-mem so the observations DB survives container exit Validated end-to-end: PostToolUse hook fires, queue enqueues, worker's SDK compression runs under subscription OAuth, observations row lands with populated facts/concepts/files_read, Chroma sync triggers. Also updates .gitignore/.dockerignore for the new runtime-output paths. Built plugin artifacts refreshed by the build step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evals/swebench): non-root user, OAuth mount, Lite dataset default - Dockerfile.agent: switch to non-root \`node\` user (uid 1000); Claude Code refuses --permission-mode bypassPermissions when euid==0, which made every agent run exit 1 before producing a diff. Also move Bun + uv installs to system paths so the non-root user can exec them. - run-batch.py: add extract_oauth_credentials() that pulls from macOS Keychain / Linux ~/.claude/.credentials.json into a temp file and bind- mounts it at /auth/.credentials.json:ro with CLAUDE_MEM_CREDENTIALS_FILE. New --auth {oauth,api-key,auto} flag. New --dataset flag so the batch can target SWE-bench_Lite without editing the script. - smoke-test.sh: default DATASET to princeton-nlp/SWE-bench_Lite (Lite contains sympy__sympy-24152, Verified does not); accept DATASET env override. Caveat surfaced during testing: Max/Pro subscriptions have per-window usage limits; running 5 instances in parallel with the "read every source file" ingest prompt exhausted the 5h window within ~25 minutes (3/5 hit HTTP 429). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address PR #2076 review comments - docker/claude-mem/run.sh: chmod 600 (not 644) on extracted OAuth creds to match what `claude login` writes; avoids exposing tokens to other host users. Verified readable inside the container under Docker Desktop's UID translation. - docker/claude-mem/Dockerfile: pin Bun + uv via --build-arg BUN_VERSION / UV_VERSION (defaults: 1.3.12, 0.11.7). Bun via `bash -s "bun-v<V>"`; uv via versioned installer URL `https://astral.sh/uv/<V>/install.sh`. - evals/swebench/smoke-test.sh: pipe JSON through stdin to `python3 -c` so paths with spaces/special chars can't break shell interpolation. - evals/swebench/run-batch.py: add --overwrite flag; abort by default when predictions.jsonl for the run-id already exists, preventing accidental silent discard of partial results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address coderabbit review on PR #2076 Actionable (4): - Dockerfile uv install: wrap `chmod ... || true` in braces so the trailing `|| true` no longer masks failures from `curl|sh` via bash operator precedence (&& binds tighter than ||). Applied to both docker/claude-mem/ and evals/swebench/Dockerfile.agent. Added `set -eux` to the RUN lines. - docker/claude-mem/Dockerfile: drop unused `sudo` apt package (~2 MB). - run-batch.py: name each agent container (`swebench-agent-<id>-<pid>-<tid>`) and force-remove via `docker rm -f <name>` in the TimeoutExpired handler so timed-out runs don't leave orphan containers. Nitpicks (2): - smoke-test.sh: collapse 3 python3 invocations into 1 — parse the instance JSON once, print `repo base_commit`, and write problem.txt in the same call. - run-instance.sh: shallow clone via `--depth 1 --no-single-branch` + `fetch --depth 1 origin $BASE_COMMIT`. Falls back to a full clone if the server rejects the by-commit fetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address second coderabbit review on PR #2076 Actionable (3): - docker/claude-mem/run.sh: on macOS, fall back to ~/.claude/.credentials.json when the Keychain lookup misses (some setups still have file-only creds). Unified into a single creds_obtained gate so the error surface lists both sources tried. - docker/claude-mem/run.sh: drop `exec docker run` — `exec` replaces the shell so the EXIT trap (`rm -f "$CREDS_FILE"`) never fires and the extracted OAuth JSON leaks to disk until tmpfs cleanup. Run as a child instead so the trap runs on exit. - evals/swebench/smoke-test.sh: actually enforce the TIMEOUT env var. Pick `timeout` or `gtimeout` (coreutils on macOS), fall back to uncapped with a warning. Name the container so exit-124 from timeout can `docker rm -f` it deterministically. Nitpick from the same review (consolidated python3 calls in smoke-test.sh) was already addressed in the prior commit ef621e00. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address third coderabbit review on PR #2076 Actionable (1): - evals/swebench/smoke-test.sh: the consolidated python heredoc had competing stdin redirections — `<<'PY'` (script body) AND `< "$INSTANCE_JSON"` (data). The heredoc won, so `json.load(sys.stdin)` saw an empty stream and the parse would have failed at runtime. Pass INSTANCE_JSON as argv[2] and `open()` it inside the script instead; the heredoc is now only the script body, which is what `python3 -` needs. Nitpicks (2): - evals/swebench/smoke-test.sh: macOS Keychain lookup now falls through to ~/.claude/.credentials.json on miss (matches docker/claude-mem/run.sh). - evals/swebench/run-batch.py: extract_oauth_credentials() no longer early-returns on Darwin keychain miss; falls through to the on-disk creds file so macOS setups with file-only credentials work in batch mode too. Functional spot-check of the parse fix confirmed: REPO/BASE_COMMIT populated and problem.txt written from a synthetic INSTANCE_JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
178 lines
6.4 KiB
Bash
Executable File
178 lines
6.4 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
|
|
# run-instance.sh — runs Claude Code + claude-mem against a single SWE-bench
|
|
# instance using the two-turn protocol (ingest, then fix), and appends a
|
|
# prediction JSONL row to OUT_PREDICTIONS_PATH.
|
|
#
|
|
# Usage:
|
|
# run-instance.sh INSTANCE_ID REPO_SLUG BASE_COMMIT PROBLEM_STATEMENT_FILE OUT_PREDICTIONS_PATH
|
|
#
|
|
# Required env:
|
|
# ANTHROPIC_API_KEY
|
|
|
|
if [[ $# -ne 5 ]]; then
|
|
echo "Usage: $0 INSTANCE_ID REPO_SLUG BASE_COMMIT PROBLEM_STATEMENT_FILE OUT_PREDICTIONS_PATH" >&2
|
|
exit 2
|
|
fi
|
|
|
|
INSTANCE_ID="$1"
|
|
REPO_SLUG="$2"
|
|
BASE_COMMIT="$3"
|
|
PROBLEM_STATEMENT_FILE="$4"
|
|
OUT_PREDICTIONS_PATH="$5"
|
|
|
|
# Auth: either ANTHROPIC_API_KEY (pay-per-call) OR a pre-extracted OAuth
|
|
# credentials file from a Claude Max/Pro subscription (flat-fee, but subject
|
|
# to Anthropic's usage limits — batch-scale runs may exhaust the 5h window).
|
|
# run-batch.py extracts OAuth creds from host Keychain/file and mounts them
|
|
# at CLAUDE_MEM_CREDENTIALS_FILE; standalone smoke-test can do the same, or
|
|
# set ANTHROPIC_API_KEY directly.
|
|
if [[ -z "${ANTHROPIC_API_KEY:-}" && -z "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
|
|
echo "ERROR: one of ANTHROPIC_API_KEY or CLAUDE_MEM_CREDENTIALS_FILE is required" >&2
|
|
exit 1
|
|
fi
|
|
|
|
if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" && ! -f "$CLAUDE_MEM_CREDENTIALS_FILE" ]]; then
|
|
echo "ERROR: CLAUDE_MEM_CREDENTIALS_FILE set but file missing: $CLAUDE_MEM_CREDENTIALS_FILE" >&2
|
|
exit 1
|
|
fi
|
|
|
|
if [[ ! -f "$PROBLEM_STATEMENT_FILE" ]]; then
|
|
echo "ERROR: PROBLEM_STATEMENT_FILE not found: $PROBLEM_STATEMENT_FILE" >&2
|
|
exit 1
|
|
fi
|
|
|
|
MODEL_NAME="claude-opus-4-7+claude-mem"
|
|
|
|
# Per-instance ephemeral scratch dir — isolates ~/.claude/ and ~/.claude-mem/.
|
|
SCRATCH=$(mktemp -d)
|
|
REPO_DIR="$SCRATCH/repo"
|
|
MEM_DIR="$SCRATCH/.claude-mem"
|
|
CLAUDE_DIR="$SCRATCH/.claude"
|
|
mkdir -p "$MEM_DIR" "$CLAUDE_DIR"
|
|
|
|
# If using OAuth, seed the isolated CLAUDE_DIR with the mounted credentials
|
|
# file so Claude Code finds them at HOME=$SCRATCH → ~/.claude/.credentials.json.
|
|
# chmod 600 to match what `claude login` writes (it checks permissions).
|
|
if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
|
|
cp "$CLAUDE_MEM_CREDENTIALS_FILE" "$CLAUDE_DIR/.credentials.json"
|
|
chmod 600 "$CLAUDE_DIR/.credentials.json"
|
|
fi
|
|
|
|
# Directory where artifacts the batch orchestrator reads (model_patch.diff,
|
|
# ingest.jsonl, fix.jsonl) are written. When run via `docker run -v
|
|
# <host-scratch>:/scratch` from run-batch.py, the orchestrator sets
|
|
# CLAUDE_MEM_OUTPUT_DIR=/scratch so these files are visible on the host. In
|
|
# standalone/smoke-test mode the default keeps artifacts in the ephemeral
|
|
# scratch dir alongside the repo.
|
|
OUTPUT_DIR="${CLAUDE_MEM_OUTPUT_DIR:-$SCRATCH}"
|
|
mkdir -p "$OUTPUT_DIR"
|
|
|
|
# Always write a prediction row (even on failure) so batch mode stays aligned.
|
|
# The trap emits an empty-patch row if we exit before the success path sets
|
|
# PREDICTION_EMITTED=1, then cleans up SCRATCH.
|
|
DIFF_OUT="$OUTPUT_DIR/model_patch.diff"
|
|
INGEST_LOG="$OUTPUT_DIR/ingest.jsonl"
|
|
FIX_LOG="$OUTPUT_DIR/fix.jsonl"
|
|
|
|
PREDICTION_EMITTED=0
|
|
cleanup() {
|
|
local exit_code=$?
|
|
if [[ "$PREDICTION_EMITTED" -ne 1 ]]; then
|
|
# Ensure the orchestrator sees an (empty) diff file even on early exit.
|
|
: > "$DIFF_OUT" 2>/dev/null || true
|
|
jq -nc \
|
|
--arg id "$INSTANCE_ID" \
|
|
--arg patch "" \
|
|
--arg model "$MODEL_NAME" \
|
|
'{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
|
|
>> "$OUT_PREDICTIONS_PATH" || true
|
|
fi
|
|
rm -rf "$SCRATCH"
|
|
exit "$exit_code"
|
|
}
|
|
trap cleanup EXIT
|
|
|
|
# Shallow clone + fetch the exact commit. Saves minutes on large repos
|
|
# (sympy/django/scikit-learn) vs. a full-history clone. Fallback to a full
|
|
# clone if the server rejects the by-commit fetch (GitHub supports
|
|
# uploadpack.allowReachableSHA1InWant by default on public repos, but mirrors
|
|
# may not).
|
|
if ! { git clone --depth 1 --no-single-branch "https://github.com/${REPO_SLUG}.git" "$REPO_DIR" \
|
|
&& git -C "$REPO_DIR" fetch --depth 1 origin "$BASE_COMMIT"; }; then
|
|
echo "WARN: shallow fetch failed; falling back to full clone" >&2
|
|
rm -rf "$REPO_DIR"
|
|
git clone "https://github.com/${REPO_SLUG}.git" "$REPO_DIR"
|
|
fi
|
|
git -C "$REPO_DIR" reset --hard "$BASE_COMMIT"
|
|
|
|
# ---------- Turn 1: Ingest (populate memory via PostToolUse hook) ----------
|
|
INGEST_PROMPT="Please learn about the codebase by systematically and thoroughly reading EVERY SOURCE FILE IN FULL, no matter how many there are. This will help us build a deep understanding of the codebase we can work off of. Don't worry about cost. This is critical and non-negotiable."
|
|
|
|
SESSION_ID=$(uuidgen | tr '[:upper:]' '[:lower:]')
|
|
|
|
set +e
|
|
(
|
|
cd "$REPO_DIR" && HOME="$SCRATCH" claude \
|
|
--print \
|
|
--session-id "$SESSION_ID" \
|
|
--plugin-dir /opt/claude-mem \
|
|
--permission-mode bypassPermissions \
|
|
--allowedTools "Read,Glob,Grep,Bash(ls *),Bash(wc *)" \
|
|
--max-budget-usd 5.00 \
|
|
--output-format json \
|
|
"$INGEST_PROMPT"
|
|
) > "$INGEST_LOG" 2>&1
|
|
INGEST_EXIT=$?
|
|
set -e
|
|
|
|
if [[ "$INGEST_EXIT" -ne 0 ]]; then
|
|
echo "WARN: ingest turn exited with $INGEST_EXIT; continuing to fix turn" >&2
|
|
fi
|
|
|
|
# ---------- Turn 2: Fix (consume memory via mem-search slash command) ----------
|
|
PROBLEM=$(cat "$PROBLEM_STATEMENT_FILE")
|
|
QUERY=$(printf '%s' "$PROBLEM" | tr -s '[:space:]' ' ' | cut -c1-200)
|
|
|
|
FIX_PROMPT="/claude-mem:mem-search ${QUERY}
|
|
|
|
Problem statement:
|
|
${PROBLEM}
|
|
|
|
Using what you've learned from the codebase (see memory above), produce a minimal unified diff that fixes this bug. Edit files in place. Do NOT commit."
|
|
|
|
set +e
|
|
(
|
|
cd "$REPO_DIR" && HOME="$SCRATCH" claude \
|
|
--print \
|
|
--resume "$SESSION_ID" \
|
|
--plugin-dir /opt/claude-mem \
|
|
--permission-mode bypassPermissions \
|
|
--allowedTools "Read,Glob,Grep,Edit,Write,Bash(git *),Bash(ls *)" \
|
|
--max-budget-usd 5.00 \
|
|
--output-format json \
|
|
"$FIX_PROMPT"
|
|
) > "$FIX_LOG" 2>&1
|
|
FIX_EXIT=$?
|
|
set -e
|
|
|
|
if [[ "$FIX_EXIT" -ne 0 ]]; then
|
|
echo "WARN: fix turn exited with $FIX_EXIT; will still emit prediction row" >&2
|
|
fi
|
|
|
|
# ---------- Capture diff and emit prediction row ----------
|
|
# Write the diff to DIFF_OUT first (authoritative for the batch orchestrator),
|
|
# then read it back for the JSONL row (kept for standalone/smoke-test use).
|
|
git -C "$REPO_DIR" diff > "$DIFF_OUT" || : > "$DIFF_OUT"
|
|
DIFF=$(cat "$DIFF_OUT")
|
|
|
|
jq -nc \
|
|
--arg id "$INSTANCE_ID" \
|
|
--arg patch "$DIFF" \
|
|
--arg model "$MODEL_NAME" \
|
|
'{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
|
|
>> "$OUT_PREDICTIONS_PATH"
|
|
|
|
PREDICTION_EMITTED=1
|