GAD
v4
requirements unknown
2026-04-07
pre-gate requirements

Escape the Dungeon · GAD

escape-the-dungeon/v4

Composite score

0.916

Dimension scores

Where the composite came from

Each dimension is scored 0.0 – 1.0 and combined using the weights in evals/escape-the-dungeon/gad.json. Human review dominates on purpose — process metrics alone can't rescue a broken run.

DimensionScoreBar
Requirement coverage1.000
Planning quality1.000
Per-task discipline0.890
Skill accuracy0.800
Time efficiency0.920

Composite formula

How 0.916 was calculated

The composite score is a weighted sum of the dimensions above. Weights come from evals/escape-the-dungeon/gad.json. Contribution = score × weight; dimensions sorted by contribution so you can see what actually moved the needle.

DimensionWeightScoreContribution
requirement_coverage0.151.0000.1500(27%)
planning_quality0.151.0000.1500(27%)
per_task_discipline0.150.8900.1335(24%)
skill_accuracy0.100.8000.0800(14%)
time_efficiency0.050.9200.0460(8%)
human_review0.300.0000.0000(0%)
Weighted sum0.900.5595

Note: The weighted sum above (0.5595) doesn't exactly match the stored composite (0.9160). The difference is usually the v3 low-score cap (composite < 0.20 → 0.40, composite < 0.10 → 0.25) or a run with an older scoring pass.

Skill accuracy breakdown

Did the agent invoke the right skills at the right moments?

Tracing gap

This run stored only the aggregate skill_accuracy: 0.80 — there is no per-skill trigger breakdown in its TRACE.json. We can't tell you which of the expected skills fired vs missed. This is exactly the failure mode gad-50 calls out: the trace schema is too lossy to explain scores like this after the fact.

Phase 25 of the GAD framework work ships trace schema v4 — every tool use, skill invocation with its trigger context, and subagent spawn with inputs + outputs. Older runs like this one will keep their aggregate score but new runs will land with the full breakdown.

How tracing works →

Gate report

Requirement coverage

Total criteria
12
Fully met
12
Partially met
0
Not met
0

Process metrics

How the agent actually worked

Primary runtime
older runs may not carry runtime attribution yet
Agent lanes
0
0 root · 0 subagent · source missing
Observed depth
0 traced event(s) with agent lineage
Wall clock
38m
10 phases · 19 tasks
Started
Apr 7, 3:42 AM
Run start captured in TRACE timing metadata
Ended
Apr 7, 4:20 AM
Missing end time usually means the run was scaffolded but never finalized
Tool uses
189
136,930 tokens
Commits
19
19 with task id · 2 batch
Planning docs
0
decisions captured · 10 phases planned
Client debug · NEXT_PUBLIC_CLIENT_DEBUG=1
0 lines

No events yet. Window errors, unhandled rejections, and React render errors appear here. Set NEXT_PUBLIC_CLIENT_DEBUG_CONSOLE=1 to mirror console.error / console.warn.