Evaluate and evolve agents under measurable pressure.
Get Anything Done is a planning + evaluation framework for AI coding agents. We give agents real implementation tasks, measure the pressure the requirements apply, and score the outcome across rounds. The goal isn't to ship a framework for faster software — the goal is to find out what works, why, and under what conditions. Every decision lives in the repo.
- Playable runs
- 19
- Runs scored
- 18
- Decisions logged
- 171
- Requirements
- v5
New this round: CSH-testing via the Emergent workflow. Round 4's Emergent v4 scored 0.885 after authoring two new skills and deprecating one — the first observed full skill-ratcheting cycle. See the evidence
Honest disclosure: N=2-5 runs per condition. One human reviewer. One task domain. The "hypotheses" on this site are exploratory observations, not tested claims. We hold every claim to its strongest critique on /skeptic — read it before trusting any number on this site.
Hypothesis tracks
Every hypothesis, one line per round.
Each line is a research track we are testing. Freedom = bare workflow. CSH = emergent workflow. GAD framework = full framework. Planned tracks (content-driven, codex runtime) show as dashed ghost lines so you can see the research plan even where no data exists yet. Click a round to filter the Playable Archive below. Read /skeptic before trusting any individual point — sample sizes are small.
Legend: solid lines show rounds with real data. Dashed lines are planned tracks where no runs have been scored yet — they exist to make the research plan visible. Data provenance: values come from EVAL_RUNS[n].humanReviewNormalized.aggregate_score, grouped by round + workflow at prebuild.
Playable preview
Try the latest builds. Right here.
A quick taste of the most recent scored builds. For the full catalog of 34+ runs across all domains, visit the project market.
escape-the-dungeon-bare
requirements v5 · 2026-04-10
- Composite
- 0.000
- Human
- 0.00
- Tokens
- 111,001
- Build time
- 24 min
- Runtime
- —
- Started
- —
- Commits
- —
Experiment log
Round by round. What we asked. What the agents actually shipped.
The experiment log is append-only. Each entry captures the requirements version, the workflow conditions that ran, the scores, and the key finding that drove the next round's changes.
Showing 4 of 4 rounds
Greenfield, three-condition, requirements v4 (pressure-oriented)
Date: 2026-04-09 Requirements version: v4 (pressure over features, authored-only, 4 gates including forge-with-ingenuity-payoff and pressure-mechanics) Conditions: GAD v10, Bare v5, Emergent v4 — run serially after round 3's parallel attempt hit the shared account rate limit (gad-62) Framework versioning: first round under trace schema v4 with hook-captured events (phase 25). Framework version stamped on every TRACE.json.
Results:
| Condition | Version | Human (rubric) | Composite | Notes |
|---|---|---|---|---|
| **Bare v5** | v5 | TBD | TBD | Complete playable game against v4 pressure requirements. DOM + iconify-icon + @iconify-json/game-icons. 2 floors × 8 rooms. |
| **Emergent v4** | v4 | **0.805** (rubric aggregate) | TBD | Complete playable, "incredible" book-like UI, DoT/resistance/stacking mechanics, first observed full skill ratcheting cycle — authored dom-over-kaplay + pressure-forge-coupling + CHANGELOG. 6-dimension rubric including skill_inheritance_effectiveness 0.95. |
| **GAD v10** | v4 | **0.02** (rubric aggregate) | — | **API-interrupted** (HTTP 529 overloaded_error, gad-64). Title screen rendered with a novel visual treatment (ui_polish 0.10) but planning phase crashed before scene implementation. Excluded from cross-round quality comparisons per gad-63 + gad-64. |
| GAD v9 | v4 | 0.05 (legacy score) | — | Rate-limited during round 4 attempt #1 (parallel). Start screen only. Excluded from cross-round quality. |
Key findings — freedom hypothesis holds under v4:
- Under pressure-oriented v4 requirements, Bare + Emergent both shipped complete playable games; GAD was api-interrupted before implementation. Freedom hypothesis (gad-36) still holds, now with v4 as the stricter test.
- First observed full skill ratcheting cycle. Emergent v4 inherited from emergent v3, authored 2 new project-tailored skills (dom-over-kaplay, pressure-forge-coupling), documented the disposition of each inherited skill in CHANGELOG.md, and deprecated kaplay-scene-pattern as unusable under DOM architecture. This is the first round where the compound-skills hypothesis (gad-65) has evidence to evaluate.
- Convergent design evolution. All three conditions independently chose DOM + iconify-icon + @iconify-json/game-icons + per-floor forced-craft encounters, suggesting v4's pressure requirements are narrow enough to collapse the solution space regardless of framework.
- Rubric replaces single-score human review (phase 27 track 1, gad-61). Emergent gets a 6th dimension `skill_inheritance_effectiveness` as the CSH test signal.
User playtest captured 12 v5 requirements (`evals/_v5-requirements-addendum.md`): training-via-encounter, rune discovery loop, merchants, NPC dialogue, inventory/equipment + skill tree, spell/skill loadout slots, end-boss reachability, save checkpoints, notification lifecycle, rest rooms actually rest, 2D map navigation.
Documented in:
- `evals/FINDINGS-2026-04-09-round-4-complete.md`
- `evals/FINDINGS-2026-04-09-round-4-partial.md`
- `evals/_v5-requirements-addendum.md`
Decisions landed this round: gad-61 (programmatic eval priority), gad-62 (serial default), gad-63 (rate-limited preserve-but-exclude), gad-64 (api-interrupted as distinct failure category), gad-65 (compound-skills hypothesis), gad-66 (authored-content injection experiment queued), gad-67 (serial as permanent default).
Led to:
- v5 requirements addendum (12 new/changed requirements from playtest)
- Phase 27 rubric shipping (per-dimension scoring, RubricRadar SVG, /rubric page)
- gad-66 content-pack extraction experiment
- HTTP 529 investigation queued before GAD v11 retry (task 21-23b)
- Serial-only execution as permanent default (gad-67)
---
Graphs
All hypotheses, plotted.
Interactive charts covering every scored run across all three workflow conditions (GAD, bare, emergent). Hover for details. The hypothesis tracks chart above shows the cross-round trajectory; these show the per-run data points.
Composite vs human review
Points above the diagonal: human rates higher than automated composite. The freedom hypothesis shows bare consistently above the line while GAD clusters below.
All scored runs — human review
Every run with a human review score, ranked highest to lowest. Color = workflow condition.
Data provenance: scatter reads scores.composite and humanReview.score from TRACE.json per run. Bar chart reads humanReviewNormalized.aggregate_score. Rate-limited and API-interrupted runs excluded per gad-63 + gad-64. See /data for the full provenance index.
Lineage
Built on the GSD principles, built to be measured.
GAD is downstream of Get Shit Done— small loops, visible state, executable specs. The talk is the creator's perspective on why tight planning loops beat ad-hoc prompting alone. We took those principles, wrote the CLI to make them cheap, and then bolted on an eval harness so drift and regressions show up in benchmarks instead of vibes.
The video starts at the segment where the structured-planning argument lands. The full talk is worth watching if you've ever wondered why your agent is confidently producing the wrong code.
The loop
Five steps. Every session. No variation.
snapshot → pick one task → implement → update planning docs → commit. The CLI gives the agent a single command to re-hydrate context; skills tell the agent what methodology to apply; subagents do the expensive work off the main thread. That's the whole framework.
The canonical GAD loop: snapshot for context, pick a task, implement, update planning docs, commit. Subagents handle research, planning, and verification.
What it looks like in the terminal
Abbreviated example of a real session — planning phase 19 with the CLI and subagents driving the work, the user stays out of the micromanagement.
- user
let's plan the next phase — we need to ship the eval preservation contract
- agent
Running `gad snapshot --projectid get-anything-done` to hydrate context before planning.
- tool
Snapshot (sprint 4): phases 18→22 · current: 18 · next-action: plan phase 19 preservation contract · 38 decisions loaded
- agent
I'll use the `/gad:plan-phase 19` skill. Spawning gad-phase-researcher first to survey prior eval runs that lost their artifacts.
- tool
gad-phase-researcher → 3 runs lost artifacts in worktree cleanup · root cause: no preserve step · proposed contract in RESEARCH.md
- agent
Good. Spawning gad-planner with the research attached. Goal-backward verification on: evals preserved at canonical paths, tests enforce, verify command audits.
- tool
gad-planner → PLAN.md written · 6 tasks · goal-backward verification passes · committed as 19-01 through 19-06
- user
yes go
- agent
Switching to `/gad:execute-phase 19`. Task 19-01: implement `gad eval preserve` command. Wave 1.
- tool
git commit 19-01: feat(eval): add preserve command · TASK-REGISTRY.xml updated · STATE.xml next-action → 19-02
The eval framework
Three workflows. One scoring formula. No hiding behind process metrics.
Each eval project ships a REQUIREMENTS.xml with versioned gate criteria. Every run produces a TRACE.jsonwith a composite score. Process metrics matter, but they cannot rescue a run that ships a broken game — human review weighs 30% precisely so "the process was followed" isn't a free pass.
Composite score weights
Defined in evals/<project>/gad.json. Same formula across every implementation eval, so you can compare a GAD run to a Bare run apples-to-apples.
| Dimension | Weight | What it measures |
|---|---|---|
| Human review | 0.30 | Subjective quality vote — gates everything. |
| Requirement coverage | 0.15 | How many gate criteria the artifact passes. |
| Planning quality | 0.15 | Phases, tasks, and decisions actually captured. |
| Per-task discipline | 0.15 | Atomic commits with task IDs vs batch dumps. |
| Skill accuracy | 0.10 | Did the agent invoke skills when their triggers fired? |
| Time efficiency | 0.05 | Wall-clock vs the project's expected envelope. |
Run it locally
One repo. One CLI. Five commands to your first eval run.
The CLI lives at bin/gad.cjs. The eval projects live under evals/. Everything else is committed planning state. No services, no auth, no telemetry.
# 1. Clone the repo git clone https://github.com/MagicbornStudios/get-anything-done cd get-anything-done # 2. See available eval projects node bin/gad.cjs eval list # 3. Bootstrap an agent prompt for one project node bin/gad.cjs eval bootstrap escape-the-dungeon-bare # 4. Run an eval (creates an isolated git worktree) node bin/gad.cjs eval run escape-the-dungeon-bare # 5. After the agent finishes, preserve and verify node bin/gad.cjs eval preserve escape-the-dungeon-bare v4 --from <worktree> node bin/gad.cjs eval verify