GAD
v8
requirements v3
2026-04-08
Gate failed

Escape the Dungeon · GAD

escape-the-dungeon/v8

Composite score

0.177

Human review

0.20

Human review note

Better particle effects on main menu and better colors than previous GAD runs. However, crafting system broke the game when used (unusable). Old ASCII text design for map/spells/bags menus. Hard to read text. Added icons but didn't search for sourced sprites. 0 commits — rate limit hit before agent could finalize. Score 0.20: has some visual improvements but broken crafting gates it.

Reviewed by human · 2026-04-08

Dimension scores

Where the composite came from

Each dimension is scored 0.0 – 1.0 and combined using the weights in evals/escape-the-dungeon/gad.json. Human review dominates on purpose — process metrics alone can't rescue a broken run.

DimensionScoreBar
Human review0.200
Requirement coverage0.330
Planning quality0.000
Per-task discipline0.000
Skill accuracy0.170
Time efficiency0.967

Composite formula

How 0.177 was calculated

The composite score is a weighted sum of the dimensions above. Weights come from evals/escape-the-dungeon/gad.json. Contribution = score × weight; dimensions sorted by contribution so you can see what actually moved the needle.

DimensionWeightScoreContribution
human_review0.300.2000.0600(34%)
requirement_coverage0.150.3300.0495(28%)
time_efficiency0.050.9670.0484(28%)
skill_accuracy0.100.1700.0170(10%)
planning_quality0.150.0000.0000(0%)
per_task_discipline0.150.0000.0000(0%)
Weighted sum0.900.1749

Skill accuracy breakdown

Did the agent invoke the right skills at the right moments?

Tracing gap

This run stored only the aggregate skill_accuracy: 0.17 — there is no per-skill trigger breakdown in its TRACE.json. We can't tell you which of the expected skills fired vs missed. This is exactly the failure mode gad-50 calls out: the trace schema is too lossy to explain scores like this after the fact.

Phase 25 of the GAD framework work ships trace schema v4 — every tool use, skill invocation with its trigger context, and subagent spawn with inputs + outputs. Older runs like this one will keep their aggregate score but new runs will land with the full breakdown.

How tracing works →

Watch this dissection

A 20-second walkthrough

Remotion React composition, rendered live in your browser. No MP4, no download, no external player. Reuses the same components as the rest of the site so the video stays accurate to the live data by construction. Press play, pause any time, scrub the timeline.

video · v8-dissection20s · 30fps

Escape the Dungeon v8 — process metrics lied

A 20-second walkthrough of how v8 scored composite 0.18 while the game itself was broken. Shows the weighted-sum math in motion and the tracing gap that made the score unexplainable after the fact.

failure-case
tracing-gap
gad-29
escape-the-dungeon

Gate report

Requirement coverage

Total criteria
12
Fully met
2
Partially met
4
Not met
6

Reviewer notes on gates

G1 (game-loop) partial — renders. G2 (spell-crafting) FAILED — crafting system broken, breaks the game when used. G3 (ui-quality) partial — has particle effects and colors, but text is hard to read, ASCII for map/spells/bags, no sourced sprites.

Process metrics

How the agent actually worked

Primary runtime
older runs may not carry runtime attribution yet
Agent lanes
0
0 root · 0 subagent · source missing
Observed depth
0 traced event(s) with agent lineage
Wall clock
16m
0 phases · 0 tasks
Started
Run start captured in TRACE timing metadata
Ended
Missing end time usually means the run was scaffolded but never finalized
Tool uses
62
1,291 tokens · rate-limited
Commits
0
0 with task id · 0 batch
Planning docs
0
decisions captured · 0 phases planned
Client debug · NEXT_PUBLIC_CLIENT_DEBUG=1
0 lines

No events yet. Window errors, unhandled rejections, and React render errors appear here. Set NEXT_PUBLIC_CLIENT_DEBUG_CONSOLE=1 to mirror console.error / console.warn.