gad:eval-suite

Runs multiple eval projects in parallel using the Agent tool, then reconstructs traces and produces a cross-project comparison report.

When to use

After making changes to GAD skills, workflows, or CLI
When iterating on the framework and need to validate across multiple projects
Regular eval cadence — run the suite, review, improve, repeat

Step 1 — Generate bootstrap prompts

gad eval suite

This finds all eval projects with a template/ directory and generates a PROMPT.md for each. Output goes to a timestamped directory under evals/.suite-runs/.

To run specific projects only:

gad eval suite --projects escape-the-dungeon,portfolio-bare

Step 2 — Launch agents in parallel

For EACH prompt file generated, launch a background agent:

Agent(
  prompt=<contents of the PROMPT.md file>,
  isolation="worktree",
  run_in_background=true
)

All agents launch in a single message — parallel, not sequential.

Step 3 — Wait for completion

All agents run simultaneously. You're notified as each completes.

Step 4 — Reconstruct traces

For each completed eval:

gad eval trace reconstruct --project <name>

This parses git history from the eval's worktree to build TRACE.json — no agent cooperation needed (gad-22).

Step 5 — Cross-project report

gad eval report

Produces a comparison table across all projects:

Project	Version	Phases	Tasks	Discipline	Planning	Skill Acc	Composite

To compare specific projects:

gad eval report --projects escape-the-dungeon,portfolio-bare

Step 6 — Report findings

For each eval:

What improved vs last run?
What regressed?
What skills were triggered / missing?
What conventions were generated?

Cross-eval patterns:

Which skills work across all project types?
Which need per-project adaptation?
Where does the loop break consistently?

Iteration cycle

make changes → gad eval suite → launch agents → gad eval report → review → repeat

Each iteration should:

Fix at least one finding from the prior run
Add at least one new eval criterion
Tighten at least one skill based on findings

CLI quick reference

Command	Purpose
`gad eval suite`	Generate prompts for all runnable evals
`gad eval run --project <name> --prompt-only`	Generate prompt for one eval
`gad eval run --project <name>`	Generate prompt + create worktree
`gad eval trace reconstruct --project <name>`	Build TRACE.json from git history
`gad eval report`	Cross-project comparison table
`gad eval scores --project <name>`	Compare runs within one project
`gad eval diff v1 v2 --project <name>`	Diff two specific runs