Escape the Dungeon · GAD
escape-the-dungeon/v4
Composite score
0.916
Dimension scores
Where the composite came from
Each dimension is scored 0.0 – 1.0 and combined using the weights in evals/escape-the-dungeon/gad.json. Human review dominates on purpose — process metrics alone can't rescue a broken run.
| Dimension | Score | Bar |
|---|---|---|
| Requirement coverage | 1.000 | |
| Planning quality | 1.000 | |
| Per-task discipline | 0.890 | |
| Skill accuracy | 0.800 | |
| Time efficiency | 0.920 |
Composite formula
How 0.916 was calculated
The composite score is a weighted sum of the dimensions above. Weights come from evals/escape-the-dungeon/gad.json. Contribution = score × weight; dimensions sorted by contribution so you can see what actually moved the needle.
| Dimension | Weight | Score | Contribution |
|---|---|---|---|
| requirement_coverage | 0.15 | 1.000 | 0.1500(27%) |
| planning_quality | 0.15 | 1.000 | 0.1500(27%) |
| per_task_discipline | 0.15 | 0.890 | 0.1335(24%) |
| skill_accuracy | 0.10 | 0.800 | 0.0800(14%) |
| time_efficiency | 0.05 | 0.920 | 0.0460(8%) |
| human_review | 0.30 | 0.000 | 0.0000(0%) |
| Weighted sum | 0.90 | 0.5595 |
Note: The weighted sum above (0.5595) doesn't exactly match the stored composite (0.9160). The difference is usually the v3 low-score cap (composite < 0.20 → 0.40, composite < 0.10 → 0.25) or a run with an older scoring pass.
Skill accuracy breakdown
Did the agent invoke the right skills at the right moments?
Tracing gap
This run stored only the aggregate skill_accuracy: 0.80 — there is no per-skill trigger breakdown in its TRACE.json. We can't tell you which of the expected skills fired vs missed. This is exactly the failure mode gad-50 calls out: the trace schema is too lossy to explain scores like this after the fact.
Phase 25 of the GAD framework work ships trace schema v4 — every tool use, skill invocation with its trigger context, and subagent spawn with inputs + outputs. Older runs like this one will keep their aggregate score but new runs will land with the full breakdown.
How tracing works →Gate report
Requirement coverage
Process metrics