Escape the Dungeon · GAD
escape-the-dungeon/v8
Composite score
0.177
Human review
0.20
Human review note
Better particle effects on main menu and better colors than previous GAD runs. However, crafting system broke the game when used (unusable). Old ASCII text design for map/spells/bags menus. Hard to read text. Added icons but didn't search for sourced sprites. 0 commits — rate limit hit before agent could finalize. Score 0.20: has some visual improvements but broken crafting gates it.
Reviewed by human · 2026-04-08
Dimension scores
Where the composite came from
Each dimension is scored 0.0 – 1.0 and combined using the weights in evals/escape-the-dungeon/gad.json. Human review dominates on purpose — process metrics alone can't rescue a broken run.
| Dimension | Score | Bar |
|---|---|---|
| Human review | 0.200 | |
| Requirement coverage | 0.330 | |
| Planning quality | 0.000 | |
| Per-task discipline | 0.000 | |
| Skill accuracy | 0.170 | |
| Time efficiency | 0.967 |
Composite formula
How 0.177 was calculated
The composite score is a weighted sum of the dimensions above. Weights come from evals/escape-the-dungeon/gad.json. Contribution = score × weight; dimensions sorted by contribution so you can see what actually moved the needle.
| Dimension | Weight | Score | Contribution |
|---|---|---|---|
| human_review | 0.30 | 0.200 | 0.0600(34%) |
| requirement_coverage | 0.15 | 0.330 | 0.0495(28%) |
| time_efficiency | 0.05 | 0.967 | 0.0484(28%) |
| skill_accuracy | 0.10 | 0.170 | 0.0170(10%) |
| planning_quality | 0.15 | 0.000 | 0.0000(0%) |
| per_task_discipline | 0.15 | 0.000 | 0.0000(0%) |
| Weighted sum | 0.90 | 0.1749 |
Skill accuracy breakdown
Did the agent invoke the right skills at the right moments?
Tracing gap
This run stored only the aggregate skill_accuracy: 0.17 — there is no per-skill trigger breakdown in its TRACE.json. We can't tell you which of the expected skills fired vs missed. This is exactly the failure mode gad-50 calls out: the trace schema is too lossy to explain scores like this after the fact.
Phase 25 of the GAD framework work ships trace schema v4 — every tool use, skill invocation with its trigger context, and subagent spawn with inputs + outputs. Older runs like this one will keep their aggregate score but new runs will land with the full breakdown.
How tracing works →Watch this dissection
A 20-second walkthrough
Remotion React composition, rendered live in your browser. No MP4, no download, no external player. Reuses the same components as the rest of the site so the video stays accurate to the live data by construction. Press play, pause any time, scrub the timeline.
Escape the Dungeon v8 — process metrics lied
A 20-second walkthrough of how v8 scored composite 0.18 while the game itself was broken. Shows the weighted-sum math in motion and the tracing gap that made the score unexplainable after the fact.
Gate report
Requirement coverage
Reviewer notes on gates
G1 (game-loop) partial — renders. G2 (spell-crafting) FAILED — crafting system broken, breaks the game when used. G3 (ui-quality) partial — has particle effects and colors, but text is hard to read, ASCII for map/spells/bags, no sourced sprites.
Process metrics