gad:eval-run

A GAD eval is a controlled experiment. Every run MUST preserve its outputs so that experiments are reproducible and comparable. This skill describes the full procedure.

The Preservation Contract (mandatory)

Every eval run on an implementation project MUST end with these artifacts preserved:

TRACE.json at evals/<project>/v<N>/TRACE.json — measurement results
Code + planning docs at evals/<project>/v<N>/run/ — what the agent built
Build output at apps/portfolio/public/evals/<project>/v<N>/ — playable demo
CLI logs at evals/<project>/v<N>/.gad-log/ — what commands the agent ran
Runtime attribution in TRACE.json.runtime_identity — which coding agent/runtime actually performed the run

If any of these are missing, the run is invalid and must be re-executed. Verify with gad eval verify. The test suite enforces this via tests/eval-preservation.test.cjs.

The Project Layout Contract (mandatory)

Every eval — regardless of framework (GAD, bare, emergent) — MUST place ALL workflow artifacts under game/.planning/:

game/
├── .planning/            ← ALL workflow artifacts (WORKFLOW.md, DECISIONS, skills/, etc.)
├── src/                  ← source code only
├── public/               ← assets only
└── package.json, etc.    ← build config only

This separates process artifacts from source code so experiments can be compared cleanly. The format inside .planning/ is entirely up to the agent — XML, Markdown, whatever — but workflow files MUST NOT be mixed into source directories or placed at the project root.

gad eval preserve detects violations (missing .planning/, workflow files outside .planning/) and logs warnings. Record violations in human review notes.

When to use this skill

User says "run the eval", "start the eval", "run round N"
You need to run a specific impl eval project against its requirements
You need to A/B compare conditions (GAD vs bare vs emergent)

Procedure

Step 1 — Generate the bootstrap prompt

gad eval run --project <project-name> --prompt-only --runtime <claude-code|codex|cursor|...>

This creates evals/<project>/v<N>/PROMPT.md by inlining:

The current template/AGENTS.md
The current template/REQUIREMENTS.xml (and any versioned variant)
Source docs from template/source-*
Any planning files in template/.planning/
Any inherited skills in template/skills/ (for emergent evals)

The generated v<N> directory becomes the canonical home for this run's artifacts. The runtime argument is mandatory for trustworthy telemetry. It stamps TRACE.json, sets the expected hook target, and tells the operator which runtime must have GAD hooks installed.

Step 2 — Spawn an isolated agent

Use the Agent tool with isolation: "worktree" to run the prompt in an isolated git worktree. The agent should build the project under game/ in the worktree root.

Pass the prompt file path in the agent's instructions so they can re-read it if needed. gad eval run now attempts to ensure the target runtime install globally before the run starts. You can still verify or repair it manually:

gad install all --claude --global   # Claude Code
gad install all --codex --global    # Codex
gad install all --cursor --global   # Cursor

And ensure the run executes with:

GAD_RUNTIME=<runtime-id>
GAD_LOG_DIR=<eval-run-dir>/.gad-log
GAD_EVAL_TRACE_DIR=<eval-run-dir>

New gad eval run prompts include both POSIX and PowerShell snippets for this.

Step 3 — Preserve outputs (MANDATORY)

After the agent completes:

gad eval preserve <project-name> v<N> --from <worktree-path>

This copies:

<worktree>/game/src/, public/, .planning/, skills/, config files → evals/<project>/v<N>/run/
<worktree>/game/dist/ → apps/portfolio/public/evals/<project>/v<N>/
<worktree>/.planning/.gad-log/ → evals/<project>/v<N>/.gad-log/ (if present)
runtime-attributed eval logs routed via GAD_LOG_DIR → evals/<project>/v<N>/.gad-log/

Do this for every run, without exception. If you skip it, the outputs are lost when the worktree is cleaned up. The test suite will fail if preservation is missing for any new run.

Step 4 — Verify preservation

gad eval verify

Shows a table of all runs and flags any missing artifacts. Every recent run should show OK. Runtime identity is now part of that contract. Legacy runs may still fail verification because they predate runtime attribution; treat that as historical debt, not as a reason to weaken the contract for new runs.

Step 5 — Write or reconstruct TRACE.json

If you have CLI logs:

gad eval trace from-log --project <project-name> --version v<N>

Or reconstruct from git history:

gad eval trace reconstruct --project <project-name> --version v<N>

Or write TRACE.json manually with the measured dimensions (see DEFINITIONS.md).

Step 6 — Human review

Open the build:

gad eval open <project-name> v<N>

Score it 0.0-1.0 and record:

gad eval review <project-name> v<N> --score 0.65 --notes "notes here"

The composite is automatically recomputed with caps (<0.20 → 0.40, <0.10 → 0.25).

Step 7 — Check the report

gad eval report

Shows cross-project comparison with human review scores and composite rankings.

A/B experiments

When running an A/B comparison (e.g., GAD vs bare vs emergent):

Generate prompts for all conditions with the SAME base requirements version
Launch all agents in parallel (each gets its own worktree)
Preserve each run immediately on completion
Score all conditions with the same human reviewer and rubric
Document findings in evals/FINDINGS-<date>-<label>.md
Reference the requirements version used in the findings doc

What gets tested by the preservation test suite

tests/eval-preservation.test.cjs enforces:

gad eval preserve actually copies files correctly
Every impl eval run at or after the contract cutoff has TRACE.json
Every impl eval run at or after the cutoff has run/ with code
Every impl eval run at or after the cutoff has a preserved build

If you add a new impl eval project, update IMPL_EVAL_PROJECTS and PRESERVATION_CONTRACT_CUTOFF in the test file.

Common failure modes

Agent forgets to copy build — gad eval preserve handles this for you, don't rely on the agent
Worktree cleaned up before preserve — preserve BEFORE the worktree is removed
TRACE.json missing — write it manually or via gad eval review which updates the composite
Build renders blank — still preserve it, but score human_review = 0.0 (caps will apply)