GAD

requirements v3

2026-04-08

Gate failed

Escape the Dungeon · GAD

escape-the-dungeon/v8

Composite score

0.177

Human review

0.20

Human review note

Better particle effects on main menu and better colors than previous GAD runs. However, crafting system broke the game when used (unusable). Old ASCII text design for map/spells/bags menus. Hard to read text. Added icons but didn't search for sourced sprites. 0 commits — rate limit hit before agent could finalize. Score 0.20: has some visual improvements but broken crafting gates it.

Reviewed by human · 2026-04-08

Play this build Source on GitHub Raw TRACE.json

Dimension scores

Where the composite came from

Each dimension is scored 0.0 – 1.0 and combined using the weights in evals/escape-the-dungeon/gad.json. Human review dominates on purpose — process metrics alone can't rescue a broken run.

Dimension	Score	Bar
Human review	0.200
Requirement coverage	0.330
Planning quality	0.000
Per-task discipline	0.000
Skill accuracy	0.170
Time efficiency	0.967

Composite formula

How 0.177 was calculated

The composite score is a weighted sum of the dimensions above. Weights come from evals/escape-the-dungeon/gad.json. Contribution = score × weight; dimensions sorted by contribution so you can see what actually moved the needle.

Dimension	Weight	Score	Contribution
human_review	0.30	0.200	0.0600(34%)
requirement_coverage	0.15	0.330	0.0495(28%)
time_efficiency	0.05	0.967	0.0484(28%)
skill_accuracy	0.10	0.170	0.0170(10%)
planning_quality	0.15	0.000	0.0000(0%)
per_task_discipline	0.15	0.000	0.0000(0%)
Weighted sum	0.90		0.1749

Skill accuracy breakdown

Did the agent invoke the right skills at the right moments?

Tracing gap

This run stored only the aggregate skill_accuracy: 0.17 — there is no per-skill trigger breakdown in its TRACE.json. We can't tell you which of the expected skills fired vs missed. This is exactly the failure mode gad-50 calls out: the trace schema is too lossy to explain scores like this after the fact.

Phase 25 of the GAD framework work ships trace schema v4 — every tool use, skill invocation with its trigger context, and subagent spawn with inputs + outputs. Older runs like this one will keep their aggregate score but new runs will land with the full breakdown.

How tracing works →

Watch this dissection

A 20-second walkthrough

Remotion React composition, rendered live in your browser. No MP4, no download, no external player. Reuses the same components as the rest of the site so the video stays accurate to the live data by construction. Press play, pause any time, scrub the timeline.

video · v8-dissection20s · 30fps

Case study

Escape the Dungeon v8

How process metrics rated a broken game at 0.18

0:00 / 0:20

Escape the Dungeon v8 — process metrics lied

A 20-second walkthrough of how v8 scored composite 0.18 while the game itself was broken. Shows the weighted-sum math in motion and the tracing gap that made the score unexplainable after the fact.

failure-case

tracing-gap

gad-29

escape-the-dungeon

Browse all video dissections →

Gate report

Requirement coverage

Total criteria

Fully met

Partially met

Not met

Reviewer notes on gates

G1 (game-loop) partial — renders. G2 (spell-crafting) FAILED — crafting system broken, breaks the game when used. G3 (ui-quality) partial — has particle effects and colors, but text is hard to read, ASCII for map/spells/bags, no sourced sprites.

Process metrics

How the agent actually worked

Primary runtime

—

older runs may not carry runtime attribution yet

Agent lanes

0 root · 0 subagent · source missing

Observed depth

—

0 traced event(s) with agent lineage

Wall clock

16m

0 phases · 0 tasks

Started

—

Run start captured in TRACE timing metadata

Ended

—

Missing end time usually means the run was scaffolded but never finalized

Tool uses

1,291 tokens · rate-limited

Commits

0 with task id · 0 batch

Planning docs

decisions captured · 0 phases planned