requirements v5
milestone gad-v1.1

Evaluate and evolve agents under measurable pressure.

Get Anything Done is a planning + evaluation framework for AI coding agents. We give agents real implementation tasks, measure the pressure the requirements apply, and score the outcome across rounds. The goal isn't to ship a framework for faster software — the goal is to find out what works, why, and under what conditions. Every decision lives in the repo.

Playable runs
19
Runs scored
18
Decisions logged
171
Requirements
v5

New this round: CSH-testing via the Emergent workflow. Round 4's Emergent v4 scored 0.885 after authoring two new skills and deprecating one — the first observed full skill-ratcheting cycle. See the evidence

Honest disclosure: N=2-5 runs per condition. One human reviewer. One task domain. The "hypotheses" on this site are exploratory observations, not tested claims. We hold every claim to its strongest critique on /skeptic — read it before trusting any number on this site.

Hypothesis tracks

Every hypothesis, one line per round.

Each line is a research track we are testing. Freedom = bare workflow. CSH = emergent workflow. GAD framework = full framework. Planned tracks (content-driven, codex runtime) show as dashed ghost lines so you can see the research plan even where no data exists yet. Click a round to filter the Playable Archive below. Read /skeptic before trusting any individual point — sample sizes are small.

Domain:

Legend: solid lines show rounds with real data. Dashed lines are planned tracks where no runs have been scored yet — they exist to make the research plan visible. Data provenance: values come from EVAL_RUNS[n].humanReviewNormalized.aggregate_score, grouped by round + workflow at prebuild.

Playable preview

Try the latest builds. Right here.

A quick taste of the most recent scored builds. For the full catalog of 34+ runs across all domains, visit the project market.

playable: escape-the-dungeon-bare/v6
Open full screen
Bare · v6
Needs review
Gate passed
Round 5

escape-the-dungeon-bare

requirements v5 · 2026-04-10

Composite
0.000
Human
0.00
Tokens
111,001
Build time
24 min
Runtime
Started
Commits
Full breakdownSource on GitHub

Experiment log

Round by round. What we asked. What the agents actually shipped.

The experiment log is append-only. Each entry captures the requirements version, the workflow conditions that ran, the scores, and the key finding that drove the next round's changes.

Showing 4 of 4 rounds

4 of 4
Round 4

Greenfield, three-condition, requirements v4 (pressure-oriented)

etd
etd-bare
etd-emergent
gad
bare
emergent
6 runs

Date: 2026-04-09 Requirements version: v4 (pressure over features, authored-only, 4 gates including forge-with-ingenuity-payoff and pressure-mechanics) Conditions: GAD v10, Bare v5, Emergent v4 — run serially after round 3's parallel attempt hit the shared account rate limit (gad-62) Framework versioning: first round under trace schema v4 with hook-captured events (phase 25). Framework version stamped on every TRACE.json.

Results:

ConditionVersionHuman (rubric)CompositeNotes
**Bare v5**v5TBDTBDComplete playable game against v4 pressure requirements. DOM + iconify-icon + @iconify-json/game-icons. 2 floors × 8 rooms.
**Emergent v4**v4**0.805** (rubric aggregate)TBDComplete playable, "incredible" book-like UI, DoT/resistance/stacking mechanics, first observed full skill ratcheting cycle — authored dom-over-kaplay + pressure-forge-coupling + CHANGELOG. 6-dimension rubric including skill_inheritance_effectiveness 0.95.
**GAD v10**v4**0.02** (rubric aggregate)**API-interrupted** (HTTP 529 overloaded_error, gad-64). Title screen rendered with a novel visual treatment (ui_polish 0.10) but planning phase crashed before scene implementation. Excluded from cross-round quality comparisons per gad-63 + gad-64.
GAD v9v40.05 (legacy score)Rate-limited during round 4 attempt #1 (parallel). Start screen only. Excluded from cross-round quality.

Key findings — freedom hypothesis holds under v4:

  • Under pressure-oriented v4 requirements, Bare + Emergent both shipped complete playable games; GAD was api-interrupted before implementation. Freedom hypothesis (gad-36) still holds, now with v4 as the stricter test.
  • First observed full skill ratcheting cycle. Emergent v4 inherited from emergent v3, authored 2 new project-tailored skills (dom-over-kaplay, pressure-forge-coupling), documented the disposition of each inherited skill in CHANGELOG.md, and deprecated kaplay-scene-pattern as unusable under DOM architecture. This is the first round where the compound-skills hypothesis (gad-65) has evidence to evaluate.
  • Convergent design evolution. All three conditions independently chose DOM + iconify-icon + @iconify-json/game-icons + per-floor forced-craft encounters, suggesting v4's pressure requirements are narrow enough to collapse the solution space regardless of framework.
  • Rubric replaces single-score human review (phase 27 track 1, gad-61). Emergent gets a 6th dimension `skill_inheritance_effectiveness` as the CSH test signal.

User playtest captured 12 v5 requirements (`evals/_v5-requirements-addendum.md`): training-via-encounter, rune discovery loop, merchants, NPC dialogue, inventory/equipment + skill tree, spell/skill loadout slots, end-boss reachability, save checkpoints, notification lifecycle, rest rooms actually rest, 2D map navigation.

Documented in:

  • `evals/FINDINGS-2026-04-09-round-4-complete.md`
  • `evals/FINDINGS-2026-04-09-round-4-partial.md`
  • `evals/_v5-requirements-addendum.md`

Decisions landed this round: gad-61 (programmatic eval priority), gad-62 (serial default), gad-63 (rate-limited preserve-but-exclude), gad-64 (api-interrupted as distinct failure category), gad-65 (compound-skills hypothesis), gad-66 (authored-content injection experiment queued), gad-67 (serial as permanent default).

Led to:

  • v5 requirements addendum (12 new/changed requirements from playtest)
  • Phase 27 rubric shipping (per-dimension scoring, RubricRadar SVG, /rubric page)
  • gad-66 content-pack extraction experiment
  • HTTP 529 investigation queued before GAD v11 retry (task 21-23b)
  • Serial-only execution as permanent default (gad-67)

---

Graphs

All hypotheses, plotted.

Interactive charts covering every scored run across all three workflow conditions (GAD, bare, emergent). Hover for details. The hypothesis tracks chart above shows the cross-round trajectory; these show the per-run data points.

Composite vs human review

Points above the diagonal: human rates higher than automated composite. The freedom hypothesis shows bare consistently above the line while GAD clusters below.

All scored runs — human review

Every run with a human review score, ranked highest to lowest. Color = workflow condition.

GADBareEmergent

Data provenance: scatter reads scores.composite and humanReview.score from TRACE.json per run. Bar chart reads humanReviewNormalized.aggregate_score. Rate-limited and API-interrupted runs excluded per gad-63 + gad-64. See /data for the full provenance index.

Lineage

Built on the GSD principles, built to be measured.

GAD is downstream of Get Shit Done— small loops, visible state, executable specs. The talk is the creator's perspective on why tight planning loops beat ad-hoc prompting alone. We took those principles, wrote the CLI to make them cheap, and then bolted on an eval harness so drift and regressions show up in benchmarks instead of vibes.

The video starts at the segment where the structured-planning argument lands. The full talk is worth watching if you've ever wondered why your agent is confidently producing the wrong code.

Originated by gsd-build
Adapted for measurement
Eval-first since v1.0

The loop

Five steps. Every session. No variation.

snapshot → pick one task → implement → update planning docs → commit. The CLI gives the agent a single command to re-hydrate context; skills tell the agent what methodology to apply; subagents do the expensive work off the main thread. That's the whole framework.

The canonical GAD loop: snapshot for context, pick a task, implement, update planning docs, commit. Subagents handle research, planning, and verification.

What it looks like in the terminal

Abbreviated example of a real session — planning phase 19 with the CLI and subagents driving the work, the user stays out of the micromanagement.

example session
  • user

    let's plan the next phase — we need to ship the eval preservation contract

  • agent

    Running `gad snapshot --projectid get-anything-done` to hydrate context before planning.

  • tool

    Snapshot (sprint 4): phases 18→22 · current: 18 · next-action: plan phase 19 preservation contract · 38 decisions loaded

  • agent

    I'll use the `/gad:plan-phase 19` skill. Spawning gad-phase-researcher first to survey prior eval runs that lost their artifacts.

  • tool

    gad-phase-researcher → 3 runs lost artifacts in worktree cleanup · root cause: no preserve step · proposed contract in RESEARCH.md

  • agent

    Good. Spawning gad-planner with the research attached. Goal-backward verification on: evals preserved at canonical paths, tests enforce, verify command audits.

  • tool

    gad-planner → PLAN.md written · 6 tasks · goal-backward verification passes · committed as 19-01 through 19-06

  • user

    yes go

  • agent

    Switching to `/gad:execute-phase 19`. Task 19-01: implement `gad eval preserve` command. Wave 1.

  • tool

    git commit 19-01: feat(eval): add preserve command · TASK-REGISTRY.xml updated · STATE.xml next-action → 19-02

The eval framework

Three workflows. One scoring formula. No hiding behind process metrics.

Each eval project ships a REQUIREMENTS.xml with versioned gate criteria. Every run produces a TRACE.jsonwith a composite score. Process metrics matter, but they cannot rescue a run that ships a broken game — human review weighs 30% precisely so "the process was followed" isn't a free pass.

Workflow
GAD
Full GAD framework: .planning/ XML, AGENTS.md loop, skill triggers, plan/execute/verify cycle.
Workflow
Bare
No framework. Agent builds the game however it wants. Workflow artifacts only mandated to live under game/.planning/.
Workflow
Emergent
No framework, but inherits skills from previous bare/emergent runs. Evolves them in place and writes a CHANGELOG.

Composite score weights

Defined in evals/<project>/gad.json. Same formula across every implementation eval, so you can compare a GAD run to a Bare run apples-to-apples.

DimensionWeightWhat it measures
Human review0.30Subjective quality vote — gates everything.
Requirement coverage0.15How many gate criteria the artifact passes.
Planning quality0.15Phases, tasks, and decisions actually captured.
Per-task discipline0.15Atomic commits with task IDs vs batch dumps.
Skill accuracy0.10Did the agent invoke skills when their triggers fired?
Time efficiency0.05Wall-clock vs the project's expected envelope.

Run it locally

One repo. One CLI. Five commands to your first eval run.

The CLI lives at bin/gad.cjs. The eval projects live under evals/. Everything else is committed planning state. No services, no auth, no telemetry.

terminal
# 1. Clone the repo
git clone https://github.com/MagicbornStudios/get-anything-done
cd get-anything-done

# 2. See available eval projects
node bin/gad.cjs eval list

# 3. Bootstrap an agent prompt for one project
node bin/gad.cjs eval bootstrap escape-the-dungeon-bare

# 4. Run an eval (creates an isolated git worktree)
node bin/gad.cjs eval run escape-the-dungeon-bare

# 5. After the agent finishes, preserve and verify
node bin/gad.cjs eval preserve escape-the-dungeon-bare v4 --from <worktree>
node bin/gad.cjs eval verify
Client debug · NEXT_PUBLIC_CLIENT_DEBUG=1
0 lines

No events yet. Window errors, unhandled rejections, and React render errors appear here. Set NEXT_PUBLIC_CLIENT_DEBUG_CONSOLE=1 to mirror console.error / console.warn.