requirements v5

milestone gad-v1.1

Evaluate and evolve agents under measurable pressure.

Get Anything Done is a planning + evaluation framework for AI coding agents. We give agents real implementation tasks, measure the pressure the requirements apply, and score the outcome across rounds. The goal isn't to ship a framework for faster software — the goal is to find out what works, why, and under what conditions. Every decision lives in the repo.

Project market Methodology The hypothesis Fork on GitHub

Playable runs: 19
Runs scored: 18
Decisions logged: 171
Requirements: v5

New this round: CSH-testing via the Emergent workflow. Round 4's Emergent v4 scored 0.885 after authoring two new skills and deprecating one — the first observed full skill-ratcheting cycle. See the evidence

Honest disclosure: N=2-5 runs per condition. One human reviewer. One task domain. The "hypotheses" on this site are exploratory observations, not tested claims. We hold every claim to its strongest critique on /skeptic — read it before trusting any number on this site.

Hypothesis tracks

Every hypothesis, one line per round.

Each line is a research track we are testing. Freedom = bare workflow. CSH = emergent workflow. GAD framework = full framework. Planned tracks (content-driven, codex runtime) show as dashed ghost lines so you can see the research plan even where no data exists yet. Click a round to filter the Playable Archive below. Read /skeptic before trusting any individual point — sample sizes are small.

Domain:

Legend: solid lines show rounds with real data. Dashed lines are planned tracks where no runs have been scored yet — they exist to make the research plan visible. Data provenance: values come from EVAL_RUNS[n].humanReviewNormalized.aggregate_score, grouped by round + workflow at prebuild.

All hypotheses CSH evidence Freedom evidence Full roadmap

Playable preview

Try the latest builds. Right here.

A quick taste of the most recent scored builds. For the full catalog of 34+ runs across all domains, visit the project market.

playable: escape-the-dungeon-bare/v6

Open full screen

Bare · v6

Needs review

Gate passed

Round 5

escape-the-dungeon-bare

requirements v5 · 2026-04-10

Composite: 0.000
Human: 0.00
Tokens: 111,001
Build time: 24 min
Runtime: —
Started: —
Commits: —

Full breakdown Source on GitHub

Browse all projects

Experiment log

Round by round. What we asked. What the agents actually shipped.

The experiment log is append-only. Each entry captures the requirements version, the workflow conditions that ran, the scores, and the key finding that drove the next round's changes.

Showing 4 of 4 rounds

4 of 4

Round 4

Greenfield, three-condition, requirements v4 (pressure-oriented)

etd

etd-bare

etd-emergent

gad

bare

emergent

6 runs

Date: 2026-04-09 Requirements version: v4 (pressure over features, authored-only, 4 gates including forge-with-ingenuity-payoff and pressure-mechanics) Conditions: GAD v10, Bare v5, Emergent v4 — run serially after round 3's parallel attempt hit the shared account rate limit (gad-62) Framework versioning: first round under trace schema v4 with hook-captured events (phase 25). Framework version stamped on every TRACE.json.

Results:

Condition	Version	Human (rubric)	Composite	Notes
Bare v5	v5	TBD	TBD	Complete playable game against v4 pressure requirements. DOM + iconify-icon + @iconify-json/game-icons. 2 floors × 8 rooms.
Emergent v4	v4	0.805 (rubric aggregate)	TBD	Complete playable, "incredible" book-like UI, DoT/resistance/stacking mechanics, first observed full skill ratcheting cycle — authored dom-over-kaplay + pressure-forge-coupling + CHANGELOG. 6-dimension rubric including skill_inheritance_effectiveness 0.95.
GAD v10	v4	0.02 (rubric aggregate)	—	API-interrupted (HTTP 529 overloaded_error, gad-64). Title screen rendered with a novel visual treatment (ui_polish 0.10) but planning phase crashed before scene implementation. Excluded from cross-round quality comparisons per gad-63 + gad-64.
GAD v9	v4	0.05 (legacy score)	—	Rate-limited during round 4 attempt #1 (parallel). Start screen only. Excluded from cross-round quality.

Key findings — freedom hypothesis holds under v4:

Under pressure-oriented v4 requirements, Bare + Emergent both shipped complete playable games; GAD was api-interrupted before implementation. Freedom hypothesis (gad-36) still holds, now with v4 as the stricter test.
First observed full skill ratcheting cycle. Emergent v4 inherited from emergent v3, authored 2 new project-tailored skills (dom-over-kaplay, pressure-forge-coupling), documented the disposition of each inherited skill in CHANGELOG.md, and deprecated kaplay-scene-pattern as unusable under DOM architecture. This is the first round where the compound-skills hypothesis (gad-65) has evidence to evaluate.
Convergent design evolution. All three conditions independently chose DOM + iconify-icon + @iconify-json/game-icons + per-floor forced-craft encounters, suggesting v4's pressure requirements are narrow enough to collapse the solution space regardless of framework.
Rubric replaces single-score human review (phase 27 track 1, gad-61). Emergent gets a 6th dimension `skill_inheritance_effectiveness` as the CSH test signal.

User playtest captured 12 v5 requirements (`evals/_v5-requirements-addendum.md`): training-via-encounter, rune discovery loop, merchants, NPC dialogue, inventory/equipment + skill tree, spell/skill loadout slots, end-boss reachability, save checkpoints, notification lifecycle, rest rooms actually rest, 2D map navigation.

Documented in:

`evals/FINDINGS-2026-04-09-round-4-complete.md`
`evals/FINDINGS-2026-04-09-round-4-partial.md`
`evals/_v5-requirements-addendum.md`

Decisions landed this round: gad-61 (programmatic eval priority), gad-62 (serial default), gad-63 (rate-limited preserve-but-exclude), gad-64 (api-interrupted as distinct failure category), gad-65 (compound-skills hypothesis), gad-66 (authored-content injection experiment queued), gad-67 (serial as permanent default).

Led to:

v5 requirements addendum (12 new/changed requirements from playtest)
Phase 27 rubric shipping (per-dimension scoring, RubricRadar SVG, /rubric page)
gad-66 content-pack extraction experiment
HTTP 529 investigation queued before GAD v11 retry (task 21-23b)
Serial-only execution as permanent default (gad-67)

---

Graphs

All hypotheses, plotted.

Interactive charts covering every scored run across all three workflow conditions (GAD, bare, emergent). Hover for details. The hypothesis tracks chart above shows the cross-round trajectory; these show the per-run data points.

Composite vs human review

Points above the diagonal: human rates higher than automated composite. The freedom hypothesis shows bare consistently above the line while GAD clusters below.

All scored runs — human review

Every run with a human review score, ranked highest to lowest. Color = workflow condition.

GADBareEmergent

Data provenance: scatter reads scores.composite and humanReview.score from TRACE.json per run. Bar chart reads humanReviewNormalized.aggregate_score. Rate-limited and API-interrupted runs excluded per gad-63 + gad-64. See /data for the full provenance index.

Lineage

Built on the GSD principles, built to be measured.

GAD is downstream of Get Shit Done— small loops, visible state, executable specs. The talk is the creator's perspective on why tight planning loops beat ad-hoc prompting alone. We took those principles, wrote the CLI to make them cheap, and then bolted on an eval harness so drift and regressions show up in benchmarks instead of vibes.

The video starts at the segment where the structured-planning argument lands. The full talk is worth watching if you've ever wondered why your agent is confidently producing the wrong code.

Originated by gsd-build

Adapted for measurement

Eval-first since v1.0

Open on YouTube ↗

The loop

Five steps. Every session. No variation.

snapshot → pick one task → implement → update planning docs → commit. The CLI gives the agent a single command to re-hydrate context; skills tell the agent what methodology to apply; subagents do the expensive work off the main thread. That's the whole framework.

The canonical GAD loop: snapshot for context, pick a task, implement, update planning docs, commit. Subagents handle research, planning, and verification.

What it looks like in the terminal

Abbreviated example of a real session — planning phase 19 with the CLI and subagents driving the work, the user stays out of the micromanagement.

example session

user
let's plan the next phase — we need to ship the eval preservation contract
agent
Running `gad snapshot --projectid get-anything-done` to hydrate context before planning.
tool
Snapshot (sprint 4): phases 18→22 · current: 18 · next-action: plan phase 19 preservation contract · 38 decisions loaded
agent
I'll use the `/gad:plan-phase 19` skill. Spawning gad-phase-researcher first to survey prior eval runs that lost their artifacts.
tool
gad-phase-researcher → 3 runs lost artifacts in worktree cleanup · root cause: no preserve step · proposed contract in RESEARCH.md
agent
Good. Spawning gad-planner with the research attached. Goal-backward verification on: evals preserved at canonical paths, tests enforce, verify command audits.
tool
gad-planner → PLAN.md written · 6 tasks · goal-backward verification passes · committed as 19-01 through 19-06
user
yes go
agent
Switching to `/gad:execute-phase 19`. Task 19-01: implement `gad eval preserve` command. Wave 1.
tool
git commit 19-01: feat(eval): add preserve command · TASK-REGISTRY.xml updated · STATE.xml next-action → 19-02

The eval framework

Three workflows. One scoring formula. No hiding behind process metrics.

Each eval project ships a REQUIREMENTS.xml with versioned gate criteria. Every run produces a TRACE.jsonwith a composite score. Process metrics matter, but they cannot rescue a run that ships a broken game — human review weighs 30% precisely so "the process was followed" isn't a free pass.

Workflow

GAD

Full GAD framework: .planning/ XML, AGENTS.md loop, skill triggers, plan/execute/verify cycle.

Workflow

Bare

No framework. Agent builds the game however it wants. Workflow artifacts only mandated to live under game/.planning/.

Workflow

Emergent

No framework, but inherits skills from previous bare/emergent runs. Evolves them in place and writes a CHANGELOG.

Composite score weights

Defined in evals/<project>/gad.json. Same formula across every implementation eval, so you can compare a GAD run to a Bare run apples-to-apples.

Dimension	Weight	What it measures
Human review	0.30	Subjective quality vote — gates everything.
Requirement coverage	0.15	How many gate criteria the artifact passes.
Planning quality	0.15	Phases, tasks, and decisions actually captured.
Per-task discipline	0.15	Atomic commits with task IDs vs batch dumps.
Skill accuracy	0.10	Did the agent invoke skills when their triggers fired?
Time efficiency	0.05	Wall-clock vs the project's expected envelope.

Run it locally

One repo. One CLI. Five commands to your first eval run.

The CLI lives at bin/gad.cjs. The eval projects live under evals/. Everything else is committed planning state. No services, no auth, no telemetry.

terminal

# 1. Clone the repo
git clone https://github.com/MagicbornStudios/get-anything-done
cd get-anything-done

# 2. See available eval projects
node bin/gad.cjs eval list

# 3. Bootstrap an agent prompt for one project
node bin/gad.cjs eval bootstrap escape-the-dungeon-bare

# 4. Run an eval (creates an isolated git worktree)
node bin/gad.cjs eval run escape-the-dungeon-bare

# 5. After the agent finishes, preserve and verify
node bin/gad.cjs eval preserve escape-the-dungeon-bare v4 --from <worktree>
node bin/gad.cjs eval verify

Read the source Browse the eval projects Experiment log