Escape the Dungeon · GAD

escape-the-dungeon/v4

Composite score

0.916

Source on GitHub Raw TRACE.json

Dimension scores

Where the composite came from

Each dimension is scored 0.0 – 1.0 and combined using the weights in evals/escape-the-dungeon/gad.json. Human review dominates on purpose — process metrics alone can't rescue a broken run.

Dimension	Score	Bar
Requirement coverage	1.000
Planning quality	1.000
Per-task discipline	0.890
Skill accuracy	0.800
Time efficiency	0.920

Composite formula

How 0.916 was calculated

The composite score is a weighted sum of the dimensions above. Weights come from evals/escape-the-dungeon/gad.json. Contribution = score × weight; dimensions sorted by contribution so you can see what actually moved the needle.

Dimension	Weight	Score	Contribution
requirement_coverage	0.15	1.000	0.1500(27%)
planning_quality	0.15	1.000	0.1500(27%)
per_task_discipline	0.15	0.890	0.1335(24%)
skill_accuracy	0.10	0.800	0.0800(14%)
time_efficiency	0.05	0.920	0.0460(8%)
human_review	0.30	0.000	0.0000(0%)
Weighted sum	0.90		0.5595

Note: The weighted sum above (0.5595) doesn't exactly match the stored composite (0.9160). The difference is usually the v3 low-score cap (composite < 0.20 → 0.40, composite < 0.10 → 0.25) or a run with an older scoring pass.

Skill accuracy breakdown

Did the agent invoke the right skills at the right moments?

Tracing gap

This run stored only the aggregate skill_accuracy: 0.80 — there is no per-skill trigger breakdown in its TRACE.json. We can't tell you which of the expected skills fired vs missed. This is exactly the failure mode gad-50 calls out: the trace schema is too lossy to explain scores like this after the fact.

Phase 25 of the GAD framework work ships trace schema v4 — every tool use, skill invocation with its trigger context, and subagent spawn with inputs + outputs. Older runs like this one will keep their aggregate score but new runs will land with the full breakdown.

How tracing works →

Gate report

Requirement coverage

Total criteria

Fully met

Partially met

Not met

Process metrics

How the agent actually worked

Primary runtime

—

older runs may not carry runtime attribution yet

Agent lanes

0 root · 0 subagent · source missing

Observed depth

—

0 traced event(s) with agent lineage

Wall clock

38m

10 phases · 19 tasks

Started

Apr 7, 3:42 AM

Run start captured in TRACE timing metadata

Ended

Apr 7, 4:20 AM

Missing end time usually means the run was scaffolded but never finalized

Tool uses

189

136,930 tokens

Commits

19 with task id · 2 batch

Planning docs

decisions captured · 10 phases planned