feat: basic claude-mem Docker container for easy spin-up (#2076)

* feat(evals): SWE-bench Docker scaffolding for claude-mem resolve-rate measurement

Adds evals/swebench/ scaffolding per .claude/plans/swebench-claude-mem-docker.md.
Agent image builds Claude Code 2.1.114 + locally-built claude-mem plugin;
run-instance.sh executes the two-turn ingest/fix protocol per instance;
run-batch.py orchestrates parallel Docker runs with per-instance isolation;
eval.sh wraps the upstream SWE-bench harness; summarize.py aggregates reports.

Orchestrator owns JSONL writes under a lock to avoid racy concurrent appends;
agent writes its authoritative diff to CLAUDE_MEM_OUTPUT_DIR (/scratch in
container mode) and the orchestrator reads it back. Scaffolding only — no
Docker build or smoke test run yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(evals): OAuth credential mounting for Claude Max/Pro subscriptions

Skips per-call API billing by extracting OAuth creds from host Keychain
(macOS) or ~/.claude/.credentials.json (Linux) and bind-mounting them
read-only into each agent container. Creds are copied into HOME=$SCRATCH/.claude
at container start so the per-instance isolation model still holds.

Adds run-batch.py --auth {oauth,api-key,auto} (auto prefers OAuth, falls
back to API key). run-instance.sh accepts either ANTHROPIC_API_KEY or
CLAUDE_MEM_CREDENTIALS_FILE. smoke-test.sh runs one instance end-to-end
using OAuth for quick verification before batch runs.

Caveat surfaced in docstrings: Max/Pro has per-window usage limits and is
framed for individual developer use — batch evaluation may exhaust the
quota or raise compliance questions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(docker): basic claude-mem container for ad-hoc testing

Adds docker/claude-mem/ with a fresh spin-up image:
- Dockerfile: FROM node:20 (reproduces anthropics/claude-code .devcontainer
  pattern — Anthropic ships the Dockerfile, not a pullable image); layers
  Bun + uv + locally-built plugin/; runs as non-root node user
- entrypoint.sh: seeds OAuth creds from CLAUDE_MEM_CREDENTIALS_FILE into
  $HOME/.claude/.credentials.json, then exec's the command (default: bash)
- build.sh: npm run build + docker build
- run.sh: interactive launcher; auto-extracts OAuth from macOS Keychain
  (security find-generic-password) or ~/.claude/.credentials.json on Linux,
  mounts host .docker-claude-mem-data/ at /home/node/.claude-mem so the
  observations DB survives container exit

Validated end-to-end: PostToolUse hook fires, queue enqueues, worker's SDK
compression runs under subscription OAuth, observations row lands with
populated facts/concepts/files_read, Chroma sync triggers.

Also updates .gitignore/.dockerignore for the new runtime-output paths.
Built plugin artifacts refreshed by the build step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(evals/swebench): non-root user, OAuth mount, Lite dataset default

- Dockerfile.agent: switch to non-root \`node\` user (uid 1000); Claude Code
  refuses --permission-mode bypassPermissions when euid==0, which made every
  agent run exit 1 before producing a diff. Also move Bun + uv installs to
  system paths so the non-root user can exec them.
- run-batch.py: add extract_oauth_credentials() that pulls from macOS
  Keychain / Linux ~/.claude/.credentials.json into a temp file and bind-
  mounts it at /auth/.credentials.json:ro with CLAUDE_MEM_CREDENTIALS_FILE.
  New --auth {oauth,api-key,auto} flag. New --dataset flag so the batch can
  target SWE-bench_Lite without editing the script.
- smoke-test.sh: default DATASET to princeton-nlp/SWE-bench_Lite (Lite
  contains sympy__sympy-24152, Verified does not); accept DATASET env
  override.

Caveat surfaced during testing: Max/Pro subscriptions have per-window usage
limits; running 5 instances in parallel with the "read every source file"
ingest prompt exhausted the 5h window within ~25 minutes (3/5 hit HTTP 429).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address PR #2076 review comments

- docker/claude-mem/run.sh: chmod 600 (not 644) on extracted OAuth creds
  to match what `claude login` writes; avoids exposing tokens to other
  host users. Verified readable inside the container under Docker
  Desktop's UID translation.
- docker/claude-mem/Dockerfile: pin Bun + uv via --build-arg BUN_VERSION
  / UV_VERSION (defaults: 1.3.12, 0.11.7). Bun via `bash -s "bun-v<V>"`;
  uv via versioned installer URL `https://astral.sh/uv/<V>/install.sh`.
- evals/swebench/smoke-test.sh: pipe JSON through stdin to `python3 -c`
  so paths with spaces/special chars can't break shell interpolation.
- evals/swebench/run-batch.py: add --overwrite flag; abort by default
  when predictions.jsonl for the run-id already exists, preventing
  accidental silent discard of partial results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address coderabbit review on PR #2076

Actionable (4):
- Dockerfile uv install: wrap `chmod ... || true` in braces so the trailing
  `|| true` no longer masks failures from `curl|sh` via bash operator
  precedence (&& binds tighter than ||). Applied to both docker/claude-mem/
  and evals/swebench/Dockerfile.agent. Added `set -eux` to the RUN lines.
- docker/claude-mem/Dockerfile: drop unused `sudo` apt package (~2 MB).
- run-batch.py: name each agent container (`swebench-agent-<id>-<pid>-<tid>`)
  and force-remove via `docker rm -f <name>` in the TimeoutExpired handler
  so timed-out runs don't leave orphan containers.

Nitpicks (2):
- smoke-test.sh: collapse 3 python3 invocations into 1 — parse the instance
  JSON once, print `repo base_commit`, and write problem.txt in the same
  call.
- run-instance.sh: shallow clone via `--depth 1 --no-single-branch` +
  `fetch --depth 1 origin $BASE_COMMIT`. Falls back to a full clone if the
  server rejects the by-commit fetch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address second coderabbit review on PR #2076

Actionable (3):
- docker/claude-mem/run.sh: on macOS, fall back to ~/.claude/.credentials.json
  when the Keychain lookup misses (some setups still have file-only creds).
  Unified into a single creds_obtained gate so the error surface lists both
  sources tried.
- docker/claude-mem/run.sh: drop `exec docker run` — `exec` replaces the shell
  so the EXIT trap (`rm -f "$CREDS_FILE"`) never fires and the extracted
  OAuth JSON leaks to disk until tmpfs cleanup. Run as a child instead so
  the trap runs on exit.
- evals/swebench/smoke-test.sh: actually enforce the TIMEOUT env var. Pick
  `timeout` or `gtimeout` (coreutils on macOS), fall back to uncapped with
  a warning. Name the container so exit-124 from timeout can `docker rm -f`
  it deterministically.

Nitpick from the same review (consolidated python3 calls in smoke-test.sh)
was already addressed in the prior commit ef621e00.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address third coderabbit review on PR #2076

Actionable (1):
- evals/swebench/smoke-test.sh: the consolidated python heredoc had competing
  stdin redirections — `<<'PY'` (script body) AND `< "$INSTANCE_JSON"` (data).
  The heredoc won, so `json.load(sys.stdin)` saw an empty stream and the parse
  would have failed at runtime. Pass INSTANCE_JSON as argv[2] and `open()` it
  inside the script instead; the heredoc is now only the script body, which
  is what `python3 -` needs.

Nitpicks (2):
- evals/swebench/smoke-test.sh: macOS Keychain lookup now falls through to
  ~/.claude/.credentials.json on miss (matches docker/claude-mem/run.sh).
- evals/swebench/run-batch.py: extract_oauth_credentials() no longer
  early-returns on Darwin keychain miss; falls through to the on-disk creds
  file so macOS setups with file-only credentials work in batch mode too.

Functional spot-check of the parse fix confirmed: REPO/BASE_COMMIT populated
and problem.txt written from a synthetic INSTANCE_JSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-04-19 17:34:30 -07:00
committed by GitHub
parent de6139660b
commit 97c7c999b1
16 changed files with 1834 additions and 238 deletions
+9
View File
@@ -0,0 +1,9 @@
# Keep the build context small for evals/swebench/Dockerfile.agent.
# The Dockerfile needs `plugin/` and `evals/swebench/` — do NOT exclude them.
node_modules/
.git/
logs/
evals/swebench/runs/
.docker-claude-mem-data/
.venv
.venv-*
+9
View File
@@ -42,3 +42,12 @@ plugin/.cli-installed
# Local contribution analysis (not part of upstream)
CONTRIB_NOTES.md
# Docker container runtime data (basic claude-mem container)
.docker-claude-mem-data/
# SWE-bench eval outputs
evals/swebench/runs/
claude-opus-4-7+claude-mem.*.json
logs/run_evaluation/
.venv-swebench/
+93
View File
@@ -0,0 +1,93 @@
# Basic claude-mem container for ad-hoc testing.
#
# Base layout mirrors anthropics/claude-code .devcontainer
# (https://github.com/anthropics/claude-code/blob/main/.devcontainer/Dockerfile):
# FROM node:20, non-root `node` user, global npm install of @anthropic-ai/claude-code.
# We skip the firewall/zsh/fzf/delta/git-hist noise since this image is for
# exercising claude-mem, not as a full dev environment.
#
# On top of that base we install:
# - Bun (claude-mem worker service runtime)
# - uv (provides Python for Chroma per CLAUDE.md)
# - The locally-built plugin/ tree at /opt/claude-mem
#
# Usage:
# docker build -f docker/claude-mem/Dockerfile -t claude-mem:basic .
# docker run --rm -it \
# -v $(mktemp -d):/home/node/.claude-mem \
# -e CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json \
# -v /path/to/extracted/creds.json:/auth/.credentials.json:ro \
# claude-mem:basic
FROM node:20
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
git \
curl \
ca-certificates \
unzip \
jq \
less \
procps \
uuid-runtime \
sqlite3 \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# Bun — system-wide so the unprivileged `node` user can execute it.
# Pin via --build-arg BUN_VERSION=X.Y.Z; default is the version verified at PR time.
ARG BUN_VERSION=1.3.12
ENV BUN_INSTALL="/usr/local/bun"
RUN curl -fsSL https://bun.sh/install | bash -s "bun-v${BUN_VERSION}" \
&& chmod -R a+rX /usr/local/bun
ENV PATH="/usr/local/bun/bin:${PATH}"
# uv — system-wide, for Chroma's Python runtime. Pin via --build-arg UV_VERSION=X.Y.Z.
# Versioned installer URL per https://docs.astral.sh/uv/getting-started/installation/.
ARG UV_VERSION=0.11.7
ENV UV_INSTALL_DIR="/usr/local/bin"
# `&&` binds tighter than `||` in bash, so the previous form let `curl|sh` fail
# silently via the trailing `|| true`. Group the chmod so tolerated failure is
# scoped to perms-fixing only.
RUN set -eux \
&& curl -LsSf "https://astral.sh/uv/${UV_VERSION}/install.sh" | sh \
&& { chmod a+rX /usr/local/bin/uv /usr/local/bin/uvx 2>/dev/null || true; }
# Match the upstream devcontainer's npm-global prefix so `npm install -g`
# targets a dir the `node` user owns.
RUN mkdir -p /usr/local/share/npm-global \
&& chown -R node:node /usr/local/share/npm-global
ENV NPM_CONFIG_PREFIX=/usr/local/share/npm-global
ENV PATH="/usr/local/share/npm-global/bin:${PATH}"
# Claude Code CLI. Override at build-time with --build-arg CLAUDE_CODE_VERSION=X.Y.Z
# to pin; default tracks latest.
ARG CLAUDE_CODE_VERSION=latest
USER node
RUN npm install -g @anthropic-ai/claude-code@${CLAUDE_CODE_VERSION}
# Locally-built claude-mem plugin. COPY runs as root by default and layers are
# cached, so put this after the npm install so iterating on the plugin doesn't
# invalidate the CLI install layer.
USER root
COPY plugin/ /opt/claude-mem/
RUN chown -R node:node /opt/claude-mem
# Persistent mount points for ad-hoc testing — mount a host dir at either of
# these to inspect the claude-mem DB after a session.
RUN mkdir -p /home/node/.claude /home/node/.claude-mem \
&& chown -R node:node /home/node/.claude /home/node/.claude-mem
USER node
WORKDIR /home/node
# Helper: copies OAuth creds out of the read-only mount into $HOME/.claude/
# before exec'ing whatever you asked for. Saves the "cp + chmod" dance every
# time you drop in.
COPY --chown=node:node docker/claude-mem/entrypoint.sh /usr/local/bin/claude-mem-entrypoint
RUN chmod +x /usr/local/bin/claude-mem-entrypoint
ENTRYPOINT ["/usr/local/bin/claude-mem-entrypoint"]
CMD ["bash"]
+24
View File
@@ -0,0 +1,24 @@
#!/usr/bin/env bash
# Build the basic claude-mem Docker image from the current worktree.
#
# Usage:
# docker/claude-mem/build.sh # builds claude-mem:basic
# TAG=my-tag docker/claude-mem/build.sh # override the tag
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
TAG="${TAG:-claude-mem:basic}"
cd "$REPO_ROOT"
echo "[build] npm run build"
npm run build
echo "[build] docker build -t $TAG"
docker build \
-f docker/claude-mem/Dockerfile \
-t "$TAG" \
"$REPO_ROOT"
echo "[build] done: $TAG"
+28
View File
@@ -0,0 +1,28 @@
#!/usr/bin/env bash
# Entrypoint for the basic claude-mem container. Seeds OAuth creds if a
# credentials file is mounted, then exec's whatever was passed (default: bash).
#
# Env vars:
# CLAUDE_MEM_CREDENTIALS_FILE Path to a mounted OAuth credentials JSON file
# (e.g. /auth/.credentials.json). Copied into
# $HOME/.claude/.credentials.json at startup.
# ANTHROPIC_API_KEY Standard API-key auth; set when OAuth isn't used.
set -euo pipefail
mkdir -p "$HOME/.claude" "$HOME/.claude-mem"
if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
if [[ ! -f "$CLAUDE_MEM_CREDENTIALS_FILE" ]]; then
echo "ERROR: CLAUDE_MEM_CREDENTIALS_FILE set but file missing: $CLAUDE_MEM_CREDENTIALS_FILE" >&2
exit 1
fi
cp "$CLAUDE_MEM_CREDENTIALS_FILE" "$HOME/.claude/.credentials.json"
chmod 600 "$HOME/.claude/.credentials.json"
fi
# Helpful one-liner for interactive users: run `claude` with the plugin dir
# preconfigured. Don't force it — `exec "$@"` lets you override freely.
export PATH="/usr/local/bun/bin:/usr/local/share/npm-global/bin:$PATH"
exec "$@"
+69
View File
@@ -0,0 +1,69 @@
#!/usr/bin/env bash
# Drop into an interactive claude-mem container with OAuth creds + persistent
# memory volume. For ad-hoc testing / poking around.
#
# Usage:
# docker/claude-mem/run.sh
# docker/claude-mem/run.sh claude --plugin-dir /opt/claude-mem --print "hi"
#
# On exit, the mounted .claude-mem/ dir on the host survives so you can inspect
# the DB: `sqlite3 <HOST_MEM_DIR>/claude-mem.db 'select count(*) from observations'`.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
TAG="${TAG:-claude-mem:basic}"
HOST_MEM_DIR="${HOST_MEM_DIR:-$REPO_ROOT/.docker-claude-mem-data}"
mkdir -p "$HOST_MEM_DIR"
echo "[run] host .claude-mem dir: $HOST_MEM_DIR" >&2
# Auth. Prefer OAuth (extracted from macOS Keychain / Linux creds file);
# fall back to ANTHROPIC_API_KEY env.
CREDS_FILE=""
CREDS_MOUNT_ARGS=()
if [[ -z "${ANTHROPIC_API_KEY:-}" ]]; then
CREDS_FILE="$(mktemp -t claude-mem-creds.XXXXXX.json)"
trap 'rm -f "$CREDS_FILE"' EXIT
# Try macOS Keychain first (primary storage on Darwin), then fall back to
# the on-disk credentials file — some macOS setups (older CLI versions,
# users who migrated machines) still have the file-only form.
creds_obtained=0
if [[ "$(uname)" == "Darwin" ]]; then
if security find-generic-password -s 'Claude Code-credentials' -w > "$CREDS_FILE" 2>/dev/null \
&& [[ -s "$CREDS_FILE" ]]; then
creds_obtained=1
fi
fi
if [[ "$creds_obtained" -eq 0 && -f "$HOME/.claude/.credentials.json" ]]; then
cp "$HOME/.claude/.credentials.json" "$CREDS_FILE"
creds_obtained=1
fi
if [[ "$creds_obtained" -eq 0 ]]; then
echo "ERROR: no ANTHROPIC_API_KEY set and no Claude OAuth credentials found." >&2
echo " Tried: macOS Keychain ('Claude Code-credentials') and ~/.claude/.credentials.json." >&2
echo " Run \`claude login\` on the host first, or set ANTHROPIC_API_KEY." >&2
exit 1
fi
chmod 600 "$CREDS_FILE"
CREDS_MOUNT_ARGS=(
-e CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json
-v "$CREDS_FILE:/auth/.credentials.json:ro"
)
else
CREDS_MOUNT_ARGS=(-e ANTHROPIC_API_KEY)
fi
# Pick -it only when a TTY is attached (keeps non-interactive callers working).
TTY_ARGS=()
[[ -t 0 && -t 1 ]] && TTY_ARGS=(-it)
# NOT `exec` — we want the EXIT trap above to run and remove $CREDS_FILE
# after the container exits. Running docker as a child keeps the shell
# alive long enough for the trap to fire.
docker run --rm "${TTY_ARGS[@]}" \
"${CREDS_MOUNT_ARGS[@]}" \
-v "$HOST_MEM_DIR:/home/node/.claude-mem" \
"$TAG" \
"$@"
+74
View File
@@ -0,0 +1,74 @@
# claude-mem SWE-bench agent image
# Plan: .claude/plans/swebench-claude-mem-docker.md (Phase 1)
#
# Produces `claude-mem/swebench-agent:latest`: Claude Code CLI 2.1.114 +
# locally-built claude-mem plugin, ready to run headlessly per SWE-bench
# instance. Auth (ANTHROPIC_API_KEY) is passed at runtime, never baked in.
FROM node:20-bookworm-slim
ENV DEBIAN_FRONTEND=noninteractive
# System dependencies:
# git, curl, ca-certificates, unzip — base tooling (Bun installer needs unzip)
# jq — JSONL assembly in run-instance.sh
# uuid-runtime — uuidgen for per-instance session IDs (Phase 2)
# sqlite3 — verifies the claude-mem observations DB
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
git \
curl \
ca-certificates \
unzip \
jq \
uuid-runtime \
sqlite3 \
&& rm -rf /var/lib/apt/lists/*
# Bun (claude-mem worker service runs under Bun). Installed to a system
# location so the non-root runtime user can execute it.
ENV BUN_INSTALL="/usr/local/bun"
RUN curl -fsSL https://bun.sh/install | bash \
&& chmod -R a+rX /usr/local/bun
ENV PATH="/usr/local/bun/bin:${PATH}"
# uv (provides Python for Chroma per CLAUDE.md). Installed to a system
# location, same reason.
ENV UV_INSTALL_DIR="/usr/local/bin"
# Group the chmod so the trailing `|| true` only absorbs chmod failures; without
# this grouping, bash precedence (`&&` binds tighter than `||`) would silently
# mask a failed `curl|sh` install step.
RUN set -eux \
&& curl -LsSf https://astral.sh/uv/install.sh | sh \
&& { chmod a+rX /usr/local/bin/uv /usr/local/bin/uvx 2>/dev/null || true; }
# Claude Code CLI — PINNED to the version whose flag surface was verified in
# the plan (Phase 0). Do NOT bump without re-verifying flags.
RUN npm install -g @anthropic-ai/claude-code@2.1.114
# Locally-built claude-mem plugin. The build-agent-image.sh wrapper runs
# `npm run build` before `docker build`, so plugin/ is populated in the build
# context. We do NOT install claude-mem from npm — we want the current
# worktree under test.
COPY plugin/ /opt/claude-mem/
# Runner script — entrypoint for per-instance invocation (Phase 2 deliverable).
COPY evals/swebench/run-instance.sh /evals/swebench/run-instance.sh
RUN chmod +x /evals/swebench/run-instance.sh
# Pre-create per-instance config dirs. run-instance.sh overrides HOME to a
# scratch dir for isolation, but having these present keeps tools from
# bailing if they probe the default locations before HOME is set.
RUN mkdir -p /root/.claude /root/.claude-mem
# Non-root user. Claude Code refuses `--dangerously-skip-permissions` /
# `--permission-mode bypassPermissions` when euid==0 as a safety rail, so we
# need an unprivileged user for headless batch runs. node:20 already ships a
# `node` user at uid 1000 — reuse it.
RUN mkdir -p /home/node/.claude /home/node/.claude-mem \
&& chown -R node:node /home/node /opt/claude-mem
USER node
WORKDIR /home/node
ENTRYPOINT ["/evals/swebench/run-instance.sh"]
+20
View File
@@ -0,0 +1,20 @@
#!/usr/bin/env bash
# Build the claude-mem SWE-bench agent image.
# Plan: .claude/plans/swebench-claude-mem-docker.md (Phase 1, step 2)
set -euo pipefail
# Resolve repo root (two levels up from this script: evals/swebench -> repo).
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
cd "$REPO_ROOT"
# 1. Build the plugin so plugin/ is populated for the COPY step in the Dockerfile.
npm run build
# 2. Build the agent image. Context is the repo root so both plugin/ and
# evals/swebench/run-instance.sh are reachable.
docker build \
-f evals/swebench/Dockerfile.agent \
-t claude-mem/swebench-agent:latest \
.
+72
View File
@@ -0,0 +1,72 @@
#!/usr/bin/env bash
set -euo pipefail
# eval.sh — Thin wrapper around `python -m swebench.harness.run_evaluation`.
#
# Required env:
# RUN_ID Identifier for this evaluation run (matches predictions dir).
# Optional env:
# MAX_WORKERS Parallel worker count for the harness (default: 4).
# DATASET HF dataset name (default: princeton-nlp/SWE-bench_Verified).
# TIMEOUT Per-instance timeout in seconds (default: 1800).
#
# Reports land at:
# logs/run_evaluation/$RUN_ID/claude-opus-4-7+claude-mem/<instance_id>/report.json
: "${RUN_ID:?RUN_ID is required (e.g. RUN_ID=smoke-001)}"
MAX_WORKERS="${MAX_WORKERS:-4}"
DATASET="${DATASET:-princeton-nlp/SWE-bench_Verified}"
TIMEOUT="${TIMEOUT:-1800}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
cd "$REPO_ROOT"
PREDICTIONS="evals/swebench/runs/$RUN_ID/predictions.jsonl"
if [[ ! -f "$PREDICTIONS" ]]; then
echo "ERROR: predictions file not found: $PREDICTIONS" >&2
echo "Hint: run Phase 3 agent loop first to produce predictions.jsonl for RUN_ID=$RUN_ID." >&2
exit 1
fi
# Harness REQUIRES Docker — fail fast with a clean message if it's not running.
if ! command -v docker >/dev/null 2>&1; then
echo "ERROR: docker CLI not found on PATH. The SWE-bench harness requires Docker." >&2
exit 1
fi
if ! docker info >/dev/null 2>&1; then
echo "ERROR: Docker daemon is not running. Start Docker Desktop (or the docker service) and retry." >&2
exit 1
fi
# Create/reuse a dedicated venv so we don't pollute the system Python.
VENV_DIR=".venv-swebench"
if [[ ! -d "$VENV_DIR" ]]; then
echo "[eval.sh] Creating Python venv at $VENV_DIR ..."
python3 -m venv "$VENV_DIR"
fi
# shellcheck disable=SC1091
source "$VENV_DIR/bin/activate"
echo "[eval.sh] Installing/updating swebench in $VENV_DIR ..."
pip install -q swebench
echo "[eval.sh] Running harness:"
echo " dataset: $DATASET"
echo " predictions: $PREDICTIONS"
echo " max_workers: $MAX_WORKERS"
echo " run_id: $RUN_ID"
echo " timeout: $TIMEOUT"
python -m swebench.harness.run_evaluation \
--dataset_name "$DATASET" \
--predictions_path "$PREDICTIONS" \
--max_workers "$MAX_WORKERS" \
--run_id "$RUN_ID" \
--timeout "$TIMEOUT"
REPORTS_DIR="logs/run_evaluation/$RUN_ID/claude-opus-4-7+claude-mem"
echo ""
echo "[eval.sh] Done. Per-instance reports at:"
echo " $REPORTS_DIR/<instance_id>/report.json"
+561
View File
@@ -0,0 +1,561 @@
#!/usr/bin/env python3
"""
Batch orchestrator for SWE-bench evaluation of Claude Code + claude-mem.
Iterates a list of SWE-bench Verified instances, launches a per-instance Docker
container (`claude-mem/swebench-agent:latest`) that runs the two-turn
ingest/fix protocol, and collects all resulting diffs into a single
`predictions.jsonl` compatible with the upstream SWE-bench harness.
Usage:
python evals/swebench/run-batch.py \
--run-id claude-mem-baseline-001 \
--limit 3 \
--max-concurrent 2
Rate-limit note: Anthropic API rate limits can bite quickly. The default
`--max-concurrent` is 4, but it is safer to START WITH 2 and raise the cap
only after observing no 429s in the logs.
"""
from __future__ import annotations
import argparse
import atexit
import json
import os
import platform
import shutil
import stat
import subprocess
import sys
import tempfile
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Any, Iterable
from datasets import load_dataset
# Hidden-from-agent fields per the plan. We MUST NOT pass these to the agent
# container — they are evaluator-only ground truth.
HIDDEN_AGENT_FIELDS = (
"patch",
"test_patch",
"FAIL_TO_PASS",
"PASS_TO_PASS",
"environment_setup_commit",
"version",
)
def extract_oauth_credentials() -> Path | None:
"""
Extract Claude Code OAuth credentials (from a Max/Pro subscription) to a
temp file the container can bind-mount. Returns the temp file path, or
None if extraction failed / no creds present.
macOS: creds live in the Keychain under service "Claude Code-credentials".
Linux: creds live at ~/.claude/.credentials.json.
CAVEAT: Anthropic Max/Pro subscriptions have usage limits (per ~5h window)
and their ToS is framed around individual developer use. Running batch
evaluation across parallel containers may exhaust the quota quickly or
raise compliance concerns. This helper exists because the user explicitly
requested it; the caller is responsible for the policy call.
The token may age out mid-run; we mount read-only so refresh writes fail
silently inside the container (the underlying token in the host
Keychain/file is untouched).
"""
temp = tempfile.NamedTemporaryFile(
prefix="claude-mem-creds-",
suffix=".json",
delete=False,
)
temp_path = Path(temp.name)
temp.close()
# Clean up on process exit, even on crash.
atexit.register(lambda: temp_path.unlink(missing_ok=True))
# macOS: try Keychain first (primary storage on Darwin). On miss, fall
# through to the on-disk credentials file — some macOS setups (older CLI,
# migrated machines) only have the file form.
if platform.system() == "Darwin":
try:
completed = subprocess.run(
[
"security",
"find-generic-password",
"-s",
"Claude Code-credentials",
"-w",
],
capture_output=True,
text=True,
check=False,
)
if completed.returncode == 0 and completed.stdout.strip():
temp_path.write_text(completed.stdout.strip(), encoding="utf-8")
temp_path.chmod(stat.S_IRUSR | stat.S_IWUSR)
return temp_path
# else fall through to the on-disk credentials check below
except FileNotFoundError:
print(
"WARN: `security` command not available; trying on-disk creds.",
file=sys.stderr,
)
# fall through to the on-disk credentials check below
# Both platforms (and macOS fallback): read the on-disk credentials file.
creds_file = Path.home() / ".claude" / ".credentials.json"
if creds_file.exists():
temp_path.write_text(creds_file.read_text(encoding="utf-8"), encoding="utf-8")
temp_path.chmod(stat.S_IRUSR | stat.S_IWUSR)
return temp_path
if platform.system() == "Darwin":
print(
"WARN: Claude Code-credentials not found in macOS Keychain and "
"~/.claude/.credentials.json missing. Run `claude login` on the "
"host first, or fall back to ANTHROPIC_API_KEY.",
file=sys.stderr,
)
return None
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Run the claude-mem SWE-bench agent on a batch of instances.",
)
parser.add_argument(
"--instance-ids",
nargs="+",
default=None,
help="Optional explicit list of instance_ids to run.",
)
parser.add_argument(
"--limit",
type=int,
default=None,
help="If set, process only the first N instances after filtering.",
)
parser.add_argument(
"--max-concurrent",
type=int,
default=4,
help="Max concurrent agent containers (default 4; start with 2 and raise after observing no 429s).",
)
parser.add_argument(
"--run-id",
type=str,
required=True,
help="Run identifier; used for output paths.",
)
parser.add_argument(
"--out",
type=str,
default=None,
help="Path to predictions.jsonl (default: evals/swebench/runs/<run_id>/predictions.jsonl).",
)
parser.add_argument(
"--timeout",
type=int,
default=1800,
help="Per-instance timeout in seconds (default 1800, matches upstream harness).",
)
parser.add_argument(
"--image",
type=str,
default="claude-mem/swebench-agent:latest",
help="Agent Docker image tag.",
)
parser.add_argument(
"--dataset",
type=str,
default="princeton-nlp/SWE-bench_Verified",
help="HuggingFace dataset name (e.g. princeton-nlp/SWE-bench_Lite, default Verified).",
)
parser.add_argument(
"--auth",
choices=["oauth", "api-key", "auto"],
default="auto",
help=(
"Auth mode. 'oauth' extracts Claude Max/Pro creds from host "
"Keychain (macOS) or ~/.claude/.credentials.json (Linux). "
"'api-key' uses ANTHROPIC_API_KEY env. 'auto' prefers oauth, "
"falls back to api-key."
),
)
parser.add_argument(
"--overwrite",
action="store_true",
help=(
"Truncate existing predictions.jsonl for this --run-id. "
"Without this flag, the run aborts if predictions already exist "
"(protects partial results from accidental re-runs)."
),
)
return parser.parse_args()
def select_instances(
dataset: Iterable[dict[str, Any]],
instance_ids: list[str] | None,
limit: int | None,
) -> list[dict[str, Any]]:
"""Filter dataset rows by instance_ids (if given) and apply limit."""
rows: list[dict[str, Any]] = list(dataset)
if instance_ids:
wanted = set(instance_ids)
rows = [r for r in rows if r["instance_id"] in wanted]
missing = wanted - {r["instance_id"] for r in rows}
if missing:
print(
f"WARN: {len(missing)} requested instance_ids not found in dataset: "
f"{sorted(missing)[:5]}{'...' if len(missing) > 5 else ''}",
file=sys.stderr,
)
if limit is not None:
rows = rows[:limit]
return rows
def append_prediction_row(
predictions_path: Path,
instance_id: str,
model_patch: str,
model_name_or_path: str,
lock: threading.Lock,
) -> None:
"""Append one JSONL prediction row under a lock (appends are NOT atomic across threads)."""
row = {
"instance_id": instance_id,
"model_patch": model_patch,
"model_name_or_path": model_name_or_path,
}
line = json.dumps(row, ensure_ascii=False) + "\n"
with lock:
with predictions_path.open("a", encoding="utf-8") as fp:
fp.write(line)
def copy_log_if_exists(src: Path, dst: Path) -> None:
"""Copy a log file from the shared scratch volume into the run-log directory, if present."""
if src.exists() and src.is_file():
dst.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(src, dst)
def run_one_instance(
instance: dict[str, Any],
image: str,
predictions_path: Path,
predictions_dir: Path,
run_dir: Path,
timeout: int,
predictions_lock: threading.Lock,
model_name_or_path: str,
oauth_creds_path: Path | None,
) -> tuple[str, str]:
"""
Run the agent container for a single instance.
Returns a (status, instance_id) tuple where status is one of:
"succeeded", "failed", "timed_out".
On ANY non-success (timeout, non-zero exit, missing diff), a prediction
row with model_patch="" is still appended — the plan requires we never
silently drop an instance.
"""
instance_id: str = instance["instance_id"]
repo: str = instance["repo"]
base_commit: str = instance["base_commit"]
problem_statement: str = instance["problem_statement"]
instance_log_dir = run_dir / instance_id
instance_log_dir.mkdir(parents=True, exist_ok=True)
stderr_log_path = instance_log_dir / "stderr.log"
# Per-instance scratch dir — MUST NOT be shared across containers.
scratch_dir = Path(tempfile.mkdtemp(prefix=f"swebench-{instance_id}-"))
problem_file = scratch_dir / "problem.txt"
problem_file.write_text(problem_statement, encoding="utf-8")
status: str = "failed"
model_patch: str = ""
# Uniquely named so the TimeoutExpired handler can kill it without racing
# other instances on the host.
container_name = f"swebench-agent-{instance_id}-{os.getpid()}-{threading.get_ident()}"
try:
# The orchestrator owns JSONL writes under `predictions_lock` to avoid
# racy concurrent appends across containers — so we DO NOT mount the
# predictions directory into the container. Instead, the agent writes
# its authoritative diff to /scratch/model_patch.diff (via
# CLAUDE_MEM_OUTPUT_DIR), plus ingest/fix logs to the same dir. The
# 5th CLI arg to run-instance.sh is only used in standalone smoke-test
# mode; here we point it at a throwaway path inside the container.
cmd: list[str] = [
"docker",
"run",
"--rm",
"--name",
container_name,
"-e",
"CLAUDE_MEM_OUTPUT_DIR=/scratch",
"-v",
f"{scratch_dir}:/scratch",
]
if oauth_creds_path is not None:
cmd += [
"-e",
"CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json",
"-v",
f"{oauth_creds_path}:/auth/.credentials.json:ro",
]
else:
# Pay-per-call path.
cmd += ["-e", "ANTHROPIC_API_KEY"]
cmd += [
image,
instance_id,
repo,
base_commit,
"/scratch/problem.txt",
"/scratch/ignored-predictions.jsonl",
]
try:
completed = subprocess.run(
cmd,
timeout=timeout,
capture_output=True,
text=True,
check=False,
)
# Persist stderr so post-mortem is possible even on success.
stderr_log_path.write_text(
f"=== STDOUT ===\n{completed.stdout}\n=== STDERR ===\n{completed.stderr}\n",
encoding="utf-8",
)
if completed.returncode == 0:
# Read the diff the agent wrote to the shared predictions volume.
# The container writes its own prediction line; we prefer to
# write our own authoritative row here from the diff file the
# agent left in /scratch. If the agent wrote a diff file, use
# it; otherwise fall back to empty patch.
diff_file = scratch_dir / "model_patch.diff"
if diff_file.exists():
diff_text = diff_file.read_text(encoding="utf-8")
if diff_text.strip():
model_patch = diff_text
status = "succeeded"
else:
status = "failed" # empty diff
else:
# Container did not leave a diff file — treat as failure
# but still emit an empty-patch row below.
status = "failed"
else:
status = "failed"
except subprocess.TimeoutExpired as exc:
status = "timed_out"
# subprocess.run killed the docker CLI, but the container may
# still be running. Force-remove it by name so we don't leak
# containers across the batch.
subprocess.run(
["docker", "rm", "-f", container_name],
capture_output=True,
check=False,
timeout=30,
)
stderr_log_path.write_text(
f"TIMEOUT after {timeout}s (forced docker rm -f {container_name})\n"
f"=== STDOUT (partial) ===\n{exc.stdout or ''}\n"
f"=== STDERR (partial) ===\n{exc.stderr or ''}\n",
encoding="utf-8",
)
# Copy per-turn logs left by the agent in the shared scratch volume.
copy_log_if_exists(scratch_dir / "ingest.jsonl", instance_log_dir / "ingest.jsonl")
copy_log_if_exists(scratch_dir / "fix.jsonl", instance_log_dir / "fix.jsonl")
# Always write a row — never silently drop an instance.
append_prediction_row(
predictions_path=predictions_path,
instance_id=instance_id,
model_patch=model_patch,
model_name_or_path=model_name_or_path,
lock=predictions_lock,
)
except Exception as exc: # pragma: no cover — defensive
status = "failed"
try:
stderr_log_path.write_text(
f"ORCHESTRATOR EXCEPTION: {exc!r}\n",
encoding="utf-8",
)
except OSError:
pass
append_prediction_row(
predictions_path=predictions_path,
instance_id=instance_id,
model_patch="",
model_name_or_path=model_name_or_path,
lock=predictions_lock,
)
finally:
# Per-instance scratch must not leak across containers.
shutil.rmtree(scratch_dir, ignore_errors=True)
return status, instance_id
def main() -> int:
args = parse_args()
repo_root = Path(__file__).resolve().parents[2]
if args.out:
predictions_path = Path(args.out).resolve()
else:
predictions_path = (
repo_root
/ "evals"
/ "swebench"
/ "runs"
/ args.run_id
/ "predictions.jsonl"
)
predictions_dir = predictions_path.parent
run_dir = predictions_dir # logs land in evals/swebench/runs/<run_id>/<instance_id>/
predictions_dir.mkdir(parents=True, exist_ok=True)
# Don't silently discard partial results from a prior run.
if predictions_path.exists() and predictions_path.stat().st_size > 0:
if not args.overwrite:
print(
f"ERROR: {predictions_path} already exists and is non-empty. "
"Pass --overwrite to truncate, or pick a different --run-id.",
file=sys.stderr,
)
return 1
print(
f"WARN: --overwrite set; truncating existing {predictions_path}",
file=sys.stderr,
)
predictions_path.write_text("", encoding="utf-8")
# Resolve auth: OAuth (Max/Pro subscription) or API key.
oauth_creds_path: Path | None = None
if args.auth in ("oauth", "auto"):
oauth_creds_path = extract_oauth_credentials()
if oauth_creds_path is not None:
print(
f"Auth: OAuth credentials extracted to {oauth_creds_path} "
"(mounted read-only into each container). "
"NOTE: Max/Pro has per-window usage limits; batch runs may exhaust them.",
file=sys.stderr,
)
elif args.auth == "oauth":
print(
"ERROR: --auth=oauth requested but credentials extraction failed.",
file=sys.stderr,
)
return 1
if oauth_creds_path is None:
if not os.environ.get("ANTHROPIC_API_KEY"):
print(
"ERROR: no auth available. Either run `claude login` on host "
"(for OAuth) or set ANTHROPIC_API_KEY.",
file=sys.stderr,
)
return 1
print("Auth: ANTHROPIC_API_KEY (pay-per-call).", file=sys.stderr)
print(f"Loading dataset {args.dataset} (split=test)...", file=sys.stderr)
dataset = load_dataset(args.dataset, split="test")
instances = select_instances(dataset, args.instance_ids, args.limit)
total = len(instances)
if total == 0:
print("No instances selected; nothing to do.", file=sys.stderr)
return 0
# Scrub hidden-from-agent fields defensively. The agent container only
# receives instance_id/repo/base_commit/problem_statement via CLI args +
# the per-instance problem file — the hidden fields never leave this
# process. This loop makes that invariant explicit.
for row in instances:
for key in HIDDEN_AGENT_FIELDS:
row.pop(key, None)
model_name_or_path = "claude-opus-4-7+claude-mem"
print(
f"Launching {total} instance(s) with max_concurrent={args.max_concurrent}, "
f"timeout={args.timeout}s, image={args.image}",
file=sys.stderr,
)
predictions_lock = threading.Lock()
succeeded = 0
failed = 0
timed_out = 0
with ThreadPoolExecutor(max_workers=args.max_concurrent) as executor:
future_to_id = {
executor.submit(
run_one_instance,
instance=instance,
image=args.image,
predictions_path=predictions_path,
predictions_dir=predictions_dir,
run_dir=run_dir,
timeout=args.timeout,
predictions_lock=predictions_lock,
model_name_or_path=model_name_or_path,
oauth_creds_path=oauth_creds_path,
): instance["instance_id"]
for instance in instances
}
for future in as_completed(future_to_id):
instance_id = future_to_id[future]
try:
status, _ = future.result()
except Exception as exc: # pragma: no cover — defensive
status = "failed"
print(
f"[{instance_id}] orchestrator future raised: {exc!r}",
file=sys.stderr,
)
if status == "succeeded":
succeeded += 1
elif status == "timed_out":
timed_out += 1
else:
failed += 1
print(
f"[{instance_id}] {status} "
f"({succeeded + failed + timed_out}/{total} done)",
file=sys.stderr,
)
print(
f"{total} total, {succeeded} succeeded, {failed} failed, {timed_out} timed out",
)
# Per plan: exit 0 even if some instances failed.
return 0
if __name__ == "__main__":
sys.exit(main())
+177
View File
@@ -0,0 +1,177 @@
#!/usr/bin/env bash
set -euo pipefail
# run-instance.sh — runs Claude Code + claude-mem against a single SWE-bench
# instance using the two-turn protocol (ingest, then fix), and appends a
# prediction JSONL row to OUT_PREDICTIONS_PATH.
#
# Usage:
# run-instance.sh INSTANCE_ID REPO_SLUG BASE_COMMIT PROBLEM_STATEMENT_FILE OUT_PREDICTIONS_PATH
#
# Required env:
# ANTHROPIC_API_KEY
if [[ $# -ne 5 ]]; then
echo "Usage: $0 INSTANCE_ID REPO_SLUG BASE_COMMIT PROBLEM_STATEMENT_FILE OUT_PREDICTIONS_PATH" >&2
exit 2
fi
INSTANCE_ID="$1"
REPO_SLUG="$2"
BASE_COMMIT="$3"
PROBLEM_STATEMENT_FILE="$4"
OUT_PREDICTIONS_PATH="$5"
# Auth: either ANTHROPIC_API_KEY (pay-per-call) OR a pre-extracted OAuth
# credentials file from a Claude Max/Pro subscription (flat-fee, but subject
# to Anthropic's usage limits — batch-scale runs may exhaust the 5h window).
# run-batch.py extracts OAuth creds from host Keychain/file and mounts them
# at CLAUDE_MEM_CREDENTIALS_FILE; standalone smoke-test can do the same, or
# set ANTHROPIC_API_KEY directly.
if [[ -z "${ANTHROPIC_API_KEY:-}" && -z "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
echo "ERROR: one of ANTHROPIC_API_KEY or CLAUDE_MEM_CREDENTIALS_FILE is required" >&2
exit 1
fi
if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" && ! -f "$CLAUDE_MEM_CREDENTIALS_FILE" ]]; then
echo "ERROR: CLAUDE_MEM_CREDENTIALS_FILE set but file missing: $CLAUDE_MEM_CREDENTIALS_FILE" >&2
exit 1
fi
if [[ ! -f "$PROBLEM_STATEMENT_FILE" ]]; then
echo "ERROR: PROBLEM_STATEMENT_FILE not found: $PROBLEM_STATEMENT_FILE" >&2
exit 1
fi
MODEL_NAME="claude-opus-4-7+claude-mem"
# Per-instance ephemeral scratch dir — isolates ~/.claude/ and ~/.claude-mem/.
SCRATCH=$(mktemp -d)
REPO_DIR="$SCRATCH/repo"
MEM_DIR="$SCRATCH/.claude-mem"
CLAUDE_DIR="$SCRATCH/.claude"
mkdir -p "$MEM_DIR" "$CLAUDE_DIR"
# If using OAuth, seed the isolated CLAUDE_DIR with the mounted credentials
# file so Claude Code finds them at HOME=$SCRATCH → ~/.claude/.credentials.json.
# chmod 600 to match what `claude login` writes (it checks permissions).
if [[ -n "${CLAUDE_MEM_CREDENTIALS_FILE:-}" ]]; then
cp "$CLAUDE_MEM_CREDENTIALS_FILE" "$CLAUDE_DIR/.credentials.json"
chmod 600 "$CLAUDE_DIR/.credentials.json"
fi
# Directory where artifacts the batch orchestrator reads (model_patch.diff,
# ingest.jsonl, fix.jsonl) are written. When run via `docker run -v
# <host-scratch>:/scratch` from run-batch.py, the orchestrator sets
# CLAUDE_MEM_OUTPUT_DIR=/scratch so these files are visible on the host. In
# standalone/smoke-test mode the default keeps artifacts in the ephemeral
# scratch dir alongside the repo.
OUTPUT_DIR="${CLAUDE_MEM_OUTPUT_DIR:-$SCRATCH}"
mkdir -p "$OUTPUT_DIR"
# Always write a prediction row (even on failure) so batch mode stays aligned.
# The trap emits an empty-patch row if we exit before the success path sets
# PREDICTION_EMITTED=1, then cleans up SCRATCH.
DIFF_OUT="$OUTPUT_DIR/model_patch.diff"
INGEST_LOG="$OUTPUT_DIR/ingest.jsonl"
FIX_LOG="$OUTPUT_DIR/fix.jsonl"
PREDICTION_EMITTED=0
cleanup() {
local exit_code=$?
if [[ "$PREDICTION_EMITTED" -ne 1 ]]; then
# Ensure the orchestrator sees an (empty) diff file even on early exit.
: > "$DIFF_OUT" 2>/dev/null || true
jq -nc \
--arg id "$INSTANCE_ID" \
--arg patch "" \
--arg model "$MODEL_NAME" \
'{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
>> "$OUT_PREDICTIONS_PATH" || true
fi
rm -rf "$SCRATCH"
exit "$exit_code"
}
trap cleanup EXIT
# Shallow clone + fetch the exact commit. Saves minutes on large repos
# (sympy/django/scikit-learn) vs. a full-history clone. Fallback to a full
# clone if the server rejects the by-commit fetch (GitHub supports
# uploadpack.allowReachableSHA1InWant by default on public repos, but mirrors
# may not).
if ! { git clone --depth 1 --no-single-branch "https://github.com/${REPO_SLUG}.git" "$REPO_DIR" \
&& git -C "$REPO_DIR" fetch --depth 1 origin "$BASE_COMMIT"; }; then
echo "WARN: shallow fetch failed; falling back to full clone" >&2
rm -rf "$REPO_DIR"
git clone "https://github.com/${REPO_SLUG}.git" "$REPO_DIR"
fi
git -C "$REPO_DIR" reset --hard "$BASE_COMMIT"
# ---------- Turn 1: Ingest (populate memory via PostToolUse hook) ----------
INGEST_PROMPT="Please learn about the codebase by systematically and thoroughly reading EVERY SOURCE FILE IN FULL, no matter how many there are. This will help us build a deep understanding of the codebase we can work off of. Don't worry about cost. This is critical and non-negotiable."
SESSION_ID=$(uuidgen | tr '[:upper:]' '[:lower:]')
set +e
(
cd "$REPO_DIR" && HOME="$SCRATCH" claude \
--print \
--session-id "$SESSION_ID" \
--plugin-dir /opt/claude-mem \
--permission-mode bypassPermissions \
--allowedTools "Read,Glob,Grep,Bash(ls *),Bash(wc *)" \
--max-budget-usd 5.00 \
--output-format json \
"$INGEST_PROMPT"
) > "$INGEST_LOG" 2>&1
INGEST_EXIT=$?
set -e
if [[ "$INGEST_EXIT" -ne 0 ]]; then
echo "WARN: ingest turn exited with $INGEST_EXIT; continuing to fix turn" >&2
fi
# ---------- Turn 2: Fix (consume memory via mem-search slash command) ----------
PROBLEM=$(cat "$PROBLEM_STATEMENT_FILE")
QUERY=$(printf '%s' "$PROBLEM" | tr -s '[:space:]' ' ' | cut -c1-200)
FIX_PROMPT="/claude-mem:mem-search ${QUERY}
Problem statement:
${PROBLEM}
Using what you've learned from the codebase (see memory above), produce a minimal unified diff that fixes this bug. Edit files in place. Do NOT commit."
set +e
(
cd "$REPO_DIR" && HOME="$SCRATCH" claude \
--print \
--resume "$SESSION_ID" \
--plugin-dir /opt/claude-mem \
--permission-mode bypassPermissions \
--allowedTools "Read,Glob,Grep,Edit,Write,Bash(git *),Bash(ls *)" \
--max-budget-usd 5.00 \
--output-format json \
"$FIX_PROMPT"
) > "$FIX_LOG" 2>&1
FIX_EXIT=$?
set -e
if [[ "$FIX_EXIT" -ne 0 ]]; then
echo "WARN: fix turn exited with $FIX_EXIT; will still emit prediction row" >&2
fi
# ---------- Capture diff and emit prediction row ----------
# Write the diff to DIFF_OUT first (authoritative for the batch orchestrator),
# then read it back for the JSONL row (kept for standalone/smoke-test use).
git -C "$REPO_DIR" diff > "$DIFF_OUT" || : > "$DIFF_OUT"
DIFF=$(cat "$DIFF_OUT")
jq -nc \
--arg id "$INSTANCE_ID" \
--arg patch "$DIFF" \
--arg model "$MODEL_NAME" \
'{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
>> "$OUT_PREDICTIONS_PATH"
PREDICTION_EMITTED=1
+152
View File
@@ -0,0 +1,152 @@
#!/usr/bin/env bash
set -euo pipefail
# smoke-test.sh — runs ONE SWE-bench instance end-to-end against the agent
# container using OAuth credentials extracted from the host. Use this to
# verify the two-turn protocol + /claude-mem:mem-search slash resolution
# before kicking off a batch run.
#
# Usage:
# evals/swebench/smoke-test.sh [INSTANCE_ID]
#
# Defaults to sympy__sympy-24152 (an easy Verified instance) if no arg given.
#
# Outputs:
# evals/swebench/runs/smoke/<INSTANCE_ID>/{ingest.jsonl,fix.jsonl,model_patch.diff}
# evals/swebench/runs/smoke/predictions.jsonl
INSTANCE_ID="${1:-sympy__sympy-24152}"
DATASET="${DATASET:-princeton-nlp/SWE-bench_Lite}"
IMAGE="${IMAGE:-claude-mem/swebench-agent:latest}"
TIMEOUT="${TIMEOUT:-1800}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
RUN_DIR="$REPO_ROOT/evals/swebench/runs/smoke/$INSTANCE_ID"
PREDICTIONS="$REPO_ROOT/evals/swebench/runs/smoke/predictions.jsonl"
mkdir -p "$RUN_DIR" "$(dirname "$PREDICTIONS")"
# --- Extract OAuth credentials ---
CREDS_FILE="$(mktemp -t claude-mem-creds.XXXXXX.json)"
trap 'rm -f "$CREDS_FILE"' EXIT
# Try macOS Keychain first (primary on Darwin), then fall through to the
# on-disk credentials file — matches docker/claude-mem/run.sh behavior.
creds_obtained=0
if [[ "$(uname)" == "Darwin" ]]; then
if security find-generic-password -s 'Claude Code-credentials' -w > "$CREDS_FILE" 2>/dev/null \
&& [[ -s "$CREDS_FILE" ]]; then
creds_obtained=1
fi
fi
if [[ "$creds_obtained" -eq 0 && -f "$HOME/.claude/.credentials.json" ]]; then
cp "$HOME/.claude/.credentials.json" "$CREDS_FILE"
creds_obtained=1
fi
if [[ "$creds_obtained" -eq 0 ]]; then
echo "ERROR: no Claude OAuth creds found (macOS Keychain or ~/.claude/.credentials.json)" >&2
exit 1
fi
chmod 600 "$CREDS_FILE"
# --- Fetch instance data from HuggingFace via a small Python helper ---
INSTANCE_JSON="$(mktemp)"
trap 'rm -f "$CREDS_FILE" "$INSTANCE_JSON"' EXIT
python3 - "$INSTANCE_ID" "$DATASET" > "$INSTANCE_JSON" <<'PY'
import json, sys
from datasets import load_dataset
target = sys.argv[1]
dataset = sys.argv[2]
ds = load_dataset(dataset, split="test")
for row in ds:
if row["instance_id"] == target:
print(json.dumps({
"instance_id": row["instance_id"],
"repo": row["repo"],
"base_commit": row["base_commit"],
"problem_statement": row["problem_statement"],
}))
break
else:
print(f"ERROR: instance {target} not found", file=sys.stderr)
sys.exit(1)
PY
SCRATCH="$(mktemp -d -t claude-mem-smoke.XXXXXX)"
trap 'rm -f "$CREDS_FILE" "$INSTANCE_JSON"; rm -rf "$SCRATCH"' EXIT
# Parse the instance JSON once: print repo + base_commit to stdout, write the
# problem statement directly to $SCRATCH/problem.txt. INSTANCE_JSON is passed
# as argv so stdin is free for the `python3 -` heredoc script body (previously
# both were competing for stdin, which made json.load see the heredoc's EOF).
read -r REPO BASE_COMMIT < <(
python3 - "$SCRATCH" "$INSTANCE_JSON" <<'PY'
import json, os, sys
scratch, instance_json = sys.argv[1], sys.argv[2]
with open(instance_json) as f:
d = json.load(f)
open(os.path.join(scratch, "problem.txt"), "w").write(d["problem_statement"])
print(d["repo"], d["base_commit"])
PY
)
echo "=== Running $INSTANCE_ID ($REPO @ $BASE_COMMIT) ===" >&2
echo "Scratch: $SCRATCH" >&2
echo "Logs will land in: $RUN_DIR" >&2
# Pick a wall-clock timeout binary. Linux ships `timeout`; macOS needs
# `gtimeout` from coreutils (brew install coreutils). If neither is available,
# warn and run without a cap — the smoke test is manual anyway.
TIMEOUT_CMD=()
if command -v timeout >/dev/null 2>&1; then
TIMEOUT_CMD=(timeout "$TIMEOUT")
elif command -v gtimeout >/dev/null 2>&1; then
TIMEOUT_CMD=(gtimeout "$TIMEOUT")
else
echo "WARN: no \`timeout\`/\`gtimeout\` on PATH; container runs uncapped" >&2
fi
# Name the container so we can force-remove it if the wall-clock timeout
# fires (SIGTERM from timeout leaves the container state open briefly).
CONTAINER_NAME="claude-mem-smoke-$INSTANCE_ID-$$"
set +e
"${TIMEOUT_CMD[@]}" docker run --rm \
--name "$CONTAINER_NAME" \
-e CLAUDE_MEM_OUTPUT_DIR=/scratch \
-e CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json \
-v "$SCRATCH:/scratch" \
-v "$CREDS_FILE:/auth/.credentials.json:ro" \
"$IMAGE" \
"$INSTANCE_ID" "$REPO" "$BASE_COMMIT" /scratch/problem.txt /scratch/ignored-predictions.jsonl
DOCKER_EXIT=$?
set -e
if [[ "$DOCKER_EXIT" -eq 124 ]]; then
# `timeout` signals TERM and returns 124 on timeout. Force-remove the
# container in case docker hasn't reaped it yet.
echo "ERROR: docker run exceeded ${TIMEOUT}s wall-clock; removing container" >&2
docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true
fi
# Copy artifacts from scratch → RUN_DIR
for f in ingest.jsonl fix.jsonl model_patch.diff; do
[[ -f "$SCRATCH/$f" ]] && cp "$SCRATCH/$f" "$RUN_DIR/$f"
done
# Emit authoritative prediction row
DIFF_FILE="$SCRATCH/model_patch.diff"
DIFF=""
[[ -f "$DIFF_FILE" ]] && DIFF="$(cat "$DIFF_FILE")"
jq -nc \
--arg id "$INSTANCE_ID" \
--arg patch "$DIFF" \
--arg model "claude-opus-4-7+claude-mem" \
'{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
>> "$PREDICTIONS"
echo "=== Done ===" >&2
echo "Diff size: $(wc -c < "$DIFF_FILE" 2>/dev/null || echo 0) bytes" >&2
echo "Predictions: $PREDICTIONS" >&2
echo "Verify mem-search invocation:" >&2
echo " grep -o '\"name\":\"[^\"]*mem-search[^\"]*\"' $RUN_DIR/fix.jsonl || echo 'NOT INVOKED'" >&2
+308
View File
@@ -0,0 +1,308 @@
#!/usr/bin/env python3
"""Summarize SWE-bench evaluation run results.
Walks the SWE-bench harness output directory, tallies resolved/unresolved/error
counts, and emits a markdown summary. Optionally diffs against another run.
"""
import argparse
import json
import sys
from pathlib import Path
def load_expected_instance_ids(predictions_path: Path) -> list[str]:
"""Read instance_ids from a predictions.jsonl file (one JSON object per line)."""
instance_ids: list[str] = []
if not predictions_path.exists():
print(
f"warning: predictions file not found: {predictions_path}",
file=sys.stderr,
)
return instance_ids
with predictions_path.open("r", encoding="utf-8") as handle:
for line_number, raw_line in enumerate(handle, start=1):
stripped = raw_line.strip()
if not stripped:
continue
try:
record = json.loads(stripped)
except json.JSONDecodeError as exc:
print(
f"warning: could not parse predictions line {line_number}: {exc}",
file=sys.stderr,
)
continue
instance_id = record.get("instance_id")
if instance_id:
instance_ids.append(instance_id)
return instance_ids
def load_run_results(
run_id: str,
model_name: str,
expected_instance_ids: list[str],
repo_root: Path,
) -> dict:
"""Walk logs/run_evaluation/<run_id>/<model_name>/*/report.json and tally results.
Returns a dict:
{
"per_instance": {instance_id: {"resolved": bool|None, "notes": str}},
"resolved_count": int,
"unresolved_count": int,
"error_count": int,
}
"""
run_logs_root = repo_root / "logs" / "run_evaluation" / run_id / model_name
per_instance: dict[str, dict] = {}
resolved_count = 0
unresolved_count = 0
error_count = 0
for instance_id in expected_instance_ids:
report_path = run_logs_root / instance_id / "report.json"
if not report_path.exists():
per_instance[instance_id] = {
"resolved": None,
"notes": "missing report.json",
}
error_count += 1
continue
try:
with report_path.open("r", encoding="utf-8") as handle:
report_data = json.load(handle)
except (json.JSONDecodeError, OSError) as exc:
per_instance[instance_id] = {
"resolved": None,
"notes": f"failed to parse report.json: {exc}",
}
error_count += 1
continue
# SWE-bench harness typically nests per-instance data under the
# instance_id key; fall back to the top-level dict for flexibility.
inner = report_data.get(instance_id, report_data)
resolved_value = inner.get("resolved")
if resolved_value is True:
per_instance[instance_id] = {"resolved": True, "notes": ""}
resolved_count += 1
elif resolved_value is False:
notes_parts: list[str] = []
tests_status = inner.get("tests_status")
if isinstance(tests_status, dict):
fail_to_pass = tests_status.get("FAIL_TO_PASS", {})
if isinstance(fail_to_pass, dict):
failed = fail_to_pass.get("failure", []) or []
if failed:
notes_parts.append(f"FAIL_TO_PASS failures: {len(failed)}")
per_instance[instance_id] = {
"resolved": False,
"notes": "; ".join(notes_parts),
}
unresolved_count += 1
else:
per_instance[instance_id] = {
"resolved": None,
"notes": "report.json missing 'resolved' field",
}
error_count += 1
return {
"per_instance": per_instance,
"resolved_count": resolved_count,
"unresolved_count": unresolved_count,
"error_count": error_count,
}
def format_resolved_cell(resolved: bool | None) -> str:
if resolved is True:
return "yes"
if resolved is False:
return "no"
return "error"
def render_summary_markdown(run_id: str, results: dict) -> str:
total = (
results["resolved_count"]
+ results["unresolved_count"]
+ results["error_count"]
)
resolved = results["resolved_count"]
resolve_rate = (resolved / total * 100.0) if total > 0 else 0.0
lines: list[str] = []
lines.append(f"# Run {run_id}")
lines.append(f"- Total: {total}")
lines.append(f"- Resolved: {resolved} ({resolve_rate:.2f}%)")
lines.append(f"- Unresolved: {results['unresolved_count']}")
lines.append(f"- Errors: {results['error_count']}")
lines.append("")
lines.append("## Per-instance")
lines.append("| instance_id | resolved | notes |")
lines.append("|---|---|---|")
for instance_id, record in results["per_instance"].items():
resolved_cell = format_resolved_cell(record["resolved"])
notes_cell = record.get("notes", "") or ""
# Escape pipe chars in notes to avoid breaking markdown tables.
notes_cell = notes_cell.replace("|", "\\|")
lines.append(f"| {instance_id} | {resolved_cell} | {notes_cell} |")
lines.append("")
return "\n".join(lines)
def render_diff_markdown(
current_run_id: str,
other_run_id: str,
current_results: dict,
other_results: dict,
) -> str:
def resolve_rate(results: dict) -> tuple[int, float]:
total = (
results["resolved_count"]
+ results["unresolved_count"]
+ results["error_count"]
)
rate = (results["resolved_count"] / total * 100.0) if total > 0 else 0.0
return total, rate
current_total, current_rate = resolve_rate(current_results)
other_total, other_rate = resolve_rate(other_results)
rate_delta = current_rate - other_rate
lines: list[str] = []
lines.append(f"# Diff vs {other_run_id}")
lines.append(
f"- {current_run_id}: {current_results['resolved_count']}/{current_total} "
f"({current_rate:.2f}%)"
)
lines.append(
f"- {other_run_id}: {other_results['resolved_count']}/{other_total} "
f"({other_rate:.2f}%)"
)
lines.append(f"- Delta: {rate_delta:+.2f} percentage points")
lines.append("")
lines.append("## Per-instance status changes")
lines.append(f"| instance_id | {other_run_id} | {current_run_id} |")
lines.append("|---|---|---|")
all_instance_ids = set(current_results["per_instance"].keys()) | set(
other_results["per_instance"].keys()
)
changes_found = False
for instance_id in sorted(all_instance_ids):
current_record = current_results["per_instance"].get(instance_id)
other_record = other_results["per_instance"].get(instance_id)
current_status = (
format_resolved_cell(current_record["resolved"])
if current_record
else "absent"
)
other_status = (
format_resolved_cell(other_record["resolved"])
if other_record
else "absent"
)
if current_status != other_status:
lines.append(f"| {instance_id} | {other_status} | {current_status} |")
changes_found = True
if not changes_found:
lines.append("| (no status changes) | | |")
lines.append("")
return "\n".join(lines)
def main() -> int:
parser = argparse.ArgumentParser(
description="Summarize SWE-bench evaluation run results."
)
parser.add_argument(
"--run-id",
required=True,
help="Run identifier used in logs/run_evaluation/<run_id>/ and evals/swebench/runs/<run_id>/.",
)
parser.add_argument(
"--compare",
metavar="OTHER_RUN_ID",
default=None,
help="Optional other run_id to diff resolve rates and per-instance status changes against.",
)
parser.add_argument(
"--model-name",
default="claude-opus-4-7+claude-mem",
help="Model name directory inside logs/run_evaluation/<run_id>/.",
)
parser.add_argument(
"--out",
default=None,
help="Output path for the markdown summary (default: evals/swebench/runs/<run_id>/summary.md).",
)
args = parser.parse_args()
# Resolve repo root from this script's location: evals/swebench/summarize.py
script_path = Path(__file__).resolve()
repo_root = script_path.parent.parent.parent
current_predictions_path = (
repo_root / "evals" / "swebench" / "runs" / args.run_id / "predictions.jsonl"
)
current_instance_ids = load_expected_instance_ids(current_predictions_path)
current_results = load_run_results(
run_id=args.run_id,
model_name=args.model_name,
expected_instance_ids=current_instance_ids,
repo_root=repo_root,
)
summary_markdown = render_summary_markdown(args.run_id, current_results)
if args.compare:
other_predictions_path = (
repo_root
/ "evals"
/ "swebench"
/ "runs"
/ args.compare
/ "predictions.jsonl"
)
other_instance_ids = load_expected_instance_ids(other_predictions_path)
other_results = load_run_results(
run_id=args.compare,
model_name=args.model_name,
expected_instance_ids=other_instance_ids,
repo_root=repo_root,
)
diff_markdown = render_diff_markdown(
current_run_id=args.run_id,
other_run_id=args.compare,
current_results=current_results,
other_results=other_results,
)
summary_markdown = summary_markdown + "\n" + diff_markdown
if args.out:
output_path = Path(args.out)
if not output_path.is_absolute():
output_path = (Path.cwd() / output_path).resolve()
else:
output_path = (
repo_root
/ "evals"
/ "swebench"
/ "runs"
/ args.run_id
/ "summary.md"
)
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.write_text(summary_markdown, encoding="utf-8")
print(str(output_path))
return 0
if __name__ == "__main__":
raise SystemExit(main())
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long