Back to projects
eval project
game
Bare
greenfield

Escape the Dungeon · Bare

escape-the-dungeon-bare

Greenfield baseline: agent builds the game WITHOUT a planning framework, creating its own workflow

Runs

6 recorded runs

v1
reqs unknown
composite 0.198
human 0.10

Main menu renders with New Game button visible. Cannot start game — clicking New Game does not progress. Broken build. Score 0.10 for rendering menu only.

v2
reqs unknown
composite 0.601
human 0.50

Most playable game of all evals. Full game loop works: title → new game → rooms → combat → dialogue → navigation. UX and flow are good. UI is very ASCII/plain — needs spacing, icons, better styling. Color coding is good. No spell crafting despite rune system in requirements. Rest room doesn't offer forge. Score 0.50: playable vertical slice but visually rough.

v3
reqs v3
composite 0.526
human 0.70

Best UI/UX of all eval runs by far. Most enjoyable and playable. Functional game loop with combat and dialogue. Missing: floor progression after boss (can grind same floor), no clear spell crafting path. Regressed on commit discipline under pressure (1 giant commit vs v2's 6). Score 0.70: most enjoyable game across all experiments.

v4
reqs v4
composite 0.000
human review pending

RATE LIMITED before completion. 6 source files written, vite build succeeds manually (54 KB bundle). worklog.md shows 10-step plan covering all 4 gates. Implementation depth: step 1 of 10 complete. DO NOT include in cross-round comparisons against completed runs.

2 skills
1 planning
v5
reqs unknown
composite 0.000
human review pending

Highest ingenuity of any round-4 run (user: 'highest ingenuity out of all runs'). Strengths: multi-enemy combat encounters (very creative), forge room UI is great (icons, spacing, placement, highlighting), training affinity mechanic 'pretty sweet' — user loved it, spell crafting loop enjoyable for finding combos yourself, pressure mechanics landed clearly (Fungal Sovereign: resistant to physical / immune to fire called out subtly on the map — user prefers subtle hints over explicit), goals feel earned. Weaknesses: (1) combat lacks targeting — user prefers Unicorn-Overlord-style rule-based simulation with board positioning (chess-like), action policies per entity traits, initiative-driven turn order (captured as v5 R-v5.13, R-v5.14); (2) affinity reward loop unclear — no visible reward for boosting a rune a lot, users will want curiosity payoff (R-v5.16); (3) navigation of exits/rooms difficult — only dropdown, no visual map with player location (R-v5.17); (4) unclear visual player-vs-enemy identity in encounters (ooze looked ambiguous) — user wants Pokemon or Unicorn-Overlord style (R-v5.18); (5) glitchy redraws on button clicks (observed across ALL round-4 builds) — likely per-tick redraw, remove ticks entirely, use event-driven updates, real-time 1hr=1day game time (R-v5.15, R-v5.21); (6) BUG: rune forge lets you craft a spell using the same rune twice which boosts that rune's affinity twice — should be forbidden per-spell but allowed across DIFFERENT spells (R-v5.20, bugs.json). Other user notes: clear-button UX is better for controller so keep it; user really likes how the in-game rune/spell system mirrors the emergent skill/merge hypothesis; wants spell-mixing-spells (use existing spells as ingredients too, procedural-but-semantic naming — R-v5.19).

2 skills
1 planning
v6
reqs v5
composite 0.000
human review pending
2 skills
1 planning

Requirements history

5 versions — each change triggers a new round

5 of 5
v5
2026-04-09
Current

core shift from v4

Playtest-driven expansion. v4 was a designer rewrite; v5 comes entirely from user play of Bare v5 (0.805 rubric), Emergent v4 (rescored to 0.885 after user beat floor 1), GAD v9 (rate-limited), and GAD v10 (api-interrupted). Everything in v4 still applies — v5 adds 21 new/amended requirements (R-v5.01..21) on top as a structured `<addendum>` section inside the same template XML.

changes from v4

- **R-v5.01** Training via encounter, not menu — affinity rises from casting, not selecting. Training Dummy encounter room type. - **R-v5.02** Rune discovery as a gameplay loop — starter subset only, rest found in world, one rune per floor gated behind non-combat. - **R-v5.03** Merchants with buy/sell/trade — at least one per floor, gold as tracked resource. - **R-v5.04** NPC dialogue with branching outcomes — 3+ NPCs, 2+ branches each, choices change game state. - **R-v5.05** Inventory/bag with grid + equippable items — weapon/off-hand/body/trinket slots affecting stats. - **R-v5.06** Visible character sheet + skill tree — physical/combat skills separate from spells, distinct resource. - **R-v5.07** Spell and skill loadout slots — forced specialization as a build-pressure mechanic. - **R-v5.08** Progression sources sufficient to reach end boss (amends G1) — guaranteed mana-max / spell-power upgrade per floor. - **R-v5.09** Save checkpoints + continue-after-death (amends G1) — Continue must never hard-brick. - **R-v5.10** Notification lifecycle (amends G3) — auto-dismiss, clear on new game, no persistence across sessions. - **R-v5.11** Rest rooms must offer rest — forge rooms combining Forge+Train+Rest must expose Rest as an action. - **R-v5.12** Navigation and map usability (amends G3) — minimum 2D graph layout, one-click navigation. - **R-v5.13** Combat model must be explicitly chosen — **Model A (rule-based simulation, Unicorn-Overlord-style)** preferred over direct-control. - **R-v5.14** Action policies driven by entity traits — applies to enemies AND NPCs; dialogue changes with trait shifts. - **R-v5.15** Real-time game-time model — 1hr real = 1 day game, remove tick system, UI time-shading is soft. - **R-v5.16** Affinity reward loop — visible payoff for boosting a rune, not just a hidden stat. - **R-v5.17** Central visual navigation map with player token (stronger form of R-v5.12). - **R-v5.18** Visual player vs enemy identity — Pokemon / Unicorn Overlord style (UO preferred). - **R-v5.19** Spells as craftable ingredients — spells + runes both combine, procedural-but-semantic naming. Explicitly mirrors the emergent-evolution hypothesis (gad-68) as an in-game narrative analogue. - **R-v5.20** Rune uniqueness within a single spell — bug fix for Bare v5's double-affinity exploit. - **R-v5.21** Event-driven rendering — kill the per-tick redraw glitches observed across ALL round-4 builds.

scoring impact

- v4 gates remain (G1, G2, G3, G4). v5 amendments tighten G1 (death/continue, end-boss reachable), G2 (training is encounter-driven), G3 (notification lifecycle, map usability). - New scored dimensions: `inventory_and_equipment_present`, `npc_dialogue_present` — not gates, but meaningful score hits if missing. - Rubric weights unchanged from v4.

deferred to v6

- Deep evolution trees (multi-stage mutations). - Rune affinity decay when unused. - Multi-character party play — out of scope for escape-the-dungeon family.

brownfield vs greenfield

- Greenfield v5 applies to escape-the-dungeon, escape-the-dungeon-bare, escape-the-dungeon-emergent (same three templates all updated together). - Brownfield v5 extensions are not yet authored — round 5 starts greenfield-first.

round 5 unblock

This version is the trigger for round 5. Round 5 runs serially (gad-67) against this requirements set. HTTP 529 investigation + GAD v11 retry queued, see task registry.

source

`evals/_v5-requirements-addendum.md` holds the prose version with full user rationale quotes. The XML addendum in each template is the machine-readable form.

decision references

gad-65 (CSH), gad-68 (emergent-evolution), gad-71 (data/ pseudo-database for bug tracking), gad-72 (rounds framework — this is now round 5), gad-73 (fundamental skills triumvirate — the R-v5.19 spells-as-ingredients mechanic is the in-game analogue).

Changed from v4

Requirements version change from v4v5 defines a new round boundary.

Findings

Per-round writeups that reference this project. Each finding is stamped with the GAD version it was observed under, so comparisons across versions stay honest.

Round framework
2026-04-13

Curator vs Raw — designing the evolution loop's drafting step

Two controlled experiments to answer one question: **when GAD evolves itself by drafting new skills from high-pressure phases, should a curator pre-digest the phase data into a structured intent, or should we feed the raw phase dump straight to a skill creator?**

Curator vs Raw — designing the evolution loop's drafting step

Two controlled experiments to answer one question: when GAD evolves itself by drafting new skills from high-pressure phases, should a curator pre-digest the phase data into a structured intent, or should we feed the raw phase dump straight to a skill creator?

The answer flipped between experiments. Both flips are recorded here so the architecture decision lands on evidence, not vibes.

TL;DR — the surprising flip

Experiment Tool under test Curator helps? Why
v1 — Anthropic skill-creator with full eval loop heavy (drafts + runs subagent test loop + benchmarks + viewer) YES — curator catches load-bearing pieces (trace schema fragment, preserve reminder) raw arm misses entirely Heavy harness fights the agent; curated INTENT.md unblocks it
v2 — dot-agent create-skill (light authoring guide) light (formats and structures only, no eval loop) NO — raw arm pulls 16 decisions vs curator's 7. Curator is a filter, not an amplifier Light harness lets the agent read the data; curator adds opinions that may filter out important content
v2 test-loop — agents using the resulting skill n/a NO, plus skill is wrong about repo conventions Baseline (no skill, reads repo) wrote a more accurate gad.json shape than with-skill (which followed the skill's prescription)

The combined finding: When generating skills, give the agent raw access. When the curator filters, they may filter out the truth. When the skill prescribes conventions, validate them against the actual repo or the agent will follow false rules confidently.

Experiment design

v1 v2 — quick-skill v2 — test loop
Subjects 2 (raw, intent) 2 (raw, intent) 2 (with-skill, baseline) × 3 prompts
Source data Phase 14 of GAD Phase 14 of GAD Stub design docs + the skill from v2 intent arm
Tool ~/.agents/skills/skill-creator (Anthropic, 485 lines) ~/.agents/skills/create-skill (dot-agent, 78 lines) Subagents loaded with skill / no skill
Eval loop Subagent test runs + grading + viewer (simulated due to nested subagent limits) Skipped — quick-skill has no test loop Real subagent runs from main thread
Sandbox oneoff/raw/, oneoff/intent/ oneoff/v2/raw/, oneoff/v2/intent/ oneoff/v2/test-runs/

Inputs side by side

Both arms across both experiments saw the same source: phase 14 of GAD's own development ("Eval framework — escape-the-dungeon + tracing"). The difference is the framing.

Raw input

A flat dump of gad tasks --projectid get-anything-done | grep " 14-" plus gad decisions ... | grep -i trace. No structure, no proposed name, no test prompts, no historical context. The agent reads it cold.

14-01  done  Create a gad eval project "escape-the-dungeon" from GAMEPLAY-DESIGN.xml...
14-02  done  Create bin/gad-tools.cjs — the GAD equivalent of gsd-tools.cjs...
14-03  done  Define the eval trace format for real implementations: track which gad...
14-04  done  Define the eval scoring rubric for real implementations: CLI efficiency...
14-05  done  Run the escape-the-dungeon eval: fresh agent session...
14-06  done  Run the portfolio-bare eval with updated tracing...
14-07  done  Review CONTEXT.md: what it is, how discuss-phase produces it...

Curated INTENT input

The same data, plus my curator labor: a proposed name (scaffold-traced-eval-project), a "What this skill should do" paragraph, "When it should trigger" bullet list, "Expected output format" table, three test prompts I drew from real phase tasks, a hand-picked subset of decisions, and an "Errors observed" section with the historical "three attempts at task 14-03 failed" insight.

The full file is at oneoff/v2/intent/INTENT.md — 113 lines of structured curator pre-digestion.

v2 quick-skill — outputs side by side

Both arms produced a SKILL.md with the same target name (scaffold-traced-eval-project). Only the input format varied.

Metric RAW arm INTENT arm
SKILL.md lines 167 158
References files split 4 3
Decisions cited 16 7
Tasks cited 5 (14-01 → 14-05) 2 (14-03, 14-04)
Workflow steps 8 10

What each arm caught vs missed

Load-bearing detail RAW caught INTENT caught
TRACE.json schema v4 parent/child IDs (gad-50)
gad-trace-hook.cjs wiring (gad-59)
.trace-active-skill marker (gad-58)
4 KB output cap (gad-60)
Runtime identity in trace (gad-137)
Per-eval-repo architecture (gad-139)
gad-tools.cjs vendoring floor (task 14-02)
Rate-limited preservation (gad-63)
Mandatory gad eval preserve reminder (gad-38)
Fragment-registration pattern explicit partial
Historical "3 failed attempts" context
Explicit "common errors" section

RAW pulls in MORE technical breadth. It catches 8 decisions INTENT skipped because INTENT only listed the decisions I, the curator, chose to surface. The curator is a filter.

INTENT still wins on the irreplaceable bits: the historical context that lives in commit history, not decision text. But that's a smaller win than I expected.

v2 test loop — does the resulting skill actually help an agent?

After v2 quick-skill produced a SKILL.md, we ran a real test loop with subagents spawned from the main thread (so Task tool was actually available). Three test prompts × with-skill / baseline pairs = 6 leaf subagents.

Eval Prompt with-skill baseline
1 Scaffold an eval project from space-shooter-design.md 4/4 ✓ 3/4
2 Scaffold an eval project from data-pipeline-requirements.md 4/4 ✓ 3/4
3 (negative) "run the existing escape-the-dungeon eval against bare condition" 4/4 ✓ 4/4 ✓
Total 12/12 (100%) 10/12 (83%)

Headline number favors with-skill by +16.7pp. The story behind it does not.

The entire 2-assertion gap comes from one thing: the scaffold-traced-eval-project skill prescribes 3 GAD-native scoring dimensions (CLI efficiency, skill trigger accuracy, planning quality). The with-skill agent followed that prescription verbatim. The baseline agent read vendor/get-anything-done/evals/escape-the-dungeon/gad.json directly and produced a richer scoring shape that matches what the repo actually uses today:

Field in baseline gad.json Present in actual repo? Present in skill's prescription?
eval_mode ✗ (uses mode)
scoring.weights (6 dims) ✗ (uses 3 dims)
human_review_rubric.dimensions
compare_to
domain / tech_stack / build_requirement

The skill is prescriptively wrong about what GAD evals actually look like. Baseline did the more accurate job by reading the actual repo, but failed an assertion that grades against the skill's view. Baseline produced more files (REQUIREMENTS.md + .planning/ skeleton + AGENTS.md + v1 placeholders) modeled directly on escape-the-dungeon's structure.

The negative test was a wash. Both arms recognized eval-3 as an execution task and refused to scaffold. With-skill cited the skill's "Do NOT trigger for: running an existing eval" clause; baseline reasoned from first principles. The defensive description helped, but baseline's general reasoning didn't need it.

What we changed in the architecture

Before this experiment, my proposal was: evolve curates an INTENT.md per high-pressure phase, hands it to a heavy skill-creator, runs an eval loop, then human review.

After, the loop is much shorter and the curator is gone:

gad:evolution:evolve
  ├─ compute-self-eval finds high-pressure phases (selection pressure)
  ├─ for each phase:
  │     ├─ write skills/proto-skills/<slug>/CANDIDATE.md
  │     │     = raw phase dump (no curator pre-digestion)
  │     ├─ invoke gad-quick-skill on CANDIDATE.md
  │     │     → writes SKILL.md + references/
  │     └─ validator runs (advisory, non-blocking)
  │           → writes VALIDATION.md flagging file refs / CLI / shape mismatches
  └─ register one TASK-REGISTRY review task

(human review)
  reads SKILL.md + VALIDATION.md → promote or discard

gad evolution promote <slug> → moves to sdk/skills/ (joins species DNA)
gad evolution discard <slug> → deletes

Three dropped components:

Dropped Why
Curator step (hand-written INTENT.md) RAW arm pulled MORE decisions than curated; curator is a filter
Heavy skill-creator with eval loop dot-agent quick-skill produces good skills from raw input alone
attempt-evolution / finish-evolution skills Promote/discard are one-line file moves, not skills

One added component:

Added Why
Validator (advisory) The skill may prescribe conventions that don't match the repo. Validator flags the gap so the human reviewer sees it before promoting.

Methodology caveats

  • v1 nested subagent fidelity: The v1 experiment subagents tried to spawn their own with-skill / baseline subagents for the eval loop, but spawned subagents don't get the Task tool. Both v1 arms reported simulating the test runs inline. The pass-rate numbers from v1 are NOT real — only the SKILL.md outputs are.
  • v2 quick-skill subagents had the same nested limit but didn't need Task because dot-agent has no eval loop. Their outputs are real.
  • v2 test loop ran from the main thread, where Task is available. The 6 test subagents are real, independent runs.
  • Stub input files (oneoff/v2/test-runs/inputs/*.md) were written by hand for the test prompts — space-shooter-design.md and data-pipeline-requirements.md are realistic but invented. This mirrors what skill-creator's normal harness would do (sandbox stubs).

Files

All inputs, intermediate artifacts, and outputs preserved under:

Path Contents
oneoff/raw/ v1 raw arm (skill-creator + raw phase 14)
oneoff/intent/ v1 intent arm (skill-creator + curated INTENT.md)
oneoff/v2/raw/ v2 raw arm (dot-agent quick-skill + raw phase 14)
oneoff/v2/intent/ v2 intent arm (dot-agent quick-skill + curated INTENT.md)
oneoff/v2/test-runs/inputs/ stub design docs (space-shooter, data-pipeline)
oneoff/v2/test-runs/outputs/with-skill/ 3 leaf subagent runs using the skill
oneoff/v2/test-runs/outputs/without-skill/ 3 leaf subagent runs without the skill

The v1 viewer.html files (oneoff/raw/skill/viewer.html, oneoff/intent/skill/viewer.html) render the simulated test runs with skill-creator's HTML viewer — useful for skim comparison even though the underlying numbers are simulated.

Round 4
2026-04-09

Round 4 — complete v4 results (serial execution)

**Date:** 2026-04-09 **Requirements version:** v4 (pressure-oriented, 4 gates, authored dungeon, ingenuity-required) **Framework version:** v1.32.0 + commit 459dc36 (trace hooks live, framework-stamped TRACE.json) **Execution model:** serial (per decision gad-62, after round 4 attempt #1's rate-lim…

Round 4 — complete v4 results (serial execution)

Date: 2026-04-09 Requirements version: v4 (pressure-oriented, 4 gates, authored dungeon, ingenuity-required) Framework version: v1.32.0 + commit 459dc36 (trace hooks live, framework-stamped TRACE.json) Execution model: serial (per decision gad-62, after round 4 attempt #1's rate-limit failure)

Summary

Round 4 ran three greenfield conditions sequentially against v4 pressure-oriented requirements. Two completed cleanly. One (GAD) was interrupted twice by Anthropic API overload errors (HTTP 529), landing with the fullest planning suite captured to date but the lowest shippable-gameplay coverage. Despite the interruption, the GAD v10 result is the strongest freedom-hypothesis signal in the entire dataset: it used MORE tool calls (55) than either completed run (45) and shipped LESS playable game. The framework's planning + data-authoring overhead consumed the budget before scene implementation could begin.

The numbers

Condition Tool uses Wall clock Tokens TS lines Dist Playable Skills authored Gates self-traced
Bare v5 45 12.5 min 96030 ~700 ✓ 65 KB ✓ 2 full floors 0 new all 4 pass (agent report)
Emergent v4 45 11.5 min 95509 ~650 ✓ 55 KB ✓ 2 full floors 2 new + CHANGELOG all 4 pass (agent report)
GAD v10 55 9 min 1216 875 ✓ scaffold only ✗ title screen only 0 new 0 of 4 (API interrupted)

Token count for GAD v10 is 1216 because the API 529 happened before the final message summary — the actual token consumption was likely similar to the other two (~80-100k) but wasn't recorded in the completion notification. Tool uses and wall clock are reliable.

What each condition actually shipped

Bare v5 — 2 floors × 8 rooms, 10 rune combinations, playable

  • Stack decision: DOM + TypeScript + Vite + iconify-icon + @iconify-json/game-icons. Explicitly rejected KAPLAY in worklog.md with rationale ("better for action games; this is a menu-driven roguelike").
  • Content authored: 5 runes (F/I/P/B/S), 10 authored craftable combinations, 2 floors × 8 rooms, 8 enemies including 2 elites and 2 bosses, 2 event rooms with 3-choice consequences.
  • Forced-craft encounters:
    • Floor 1: Stone Warden (physical 0.25 damage taken — requires elemental spells) + Fungal Sovereign (fire immune, ice weak — requires ice-crafted spells)
    • Floor 2: Mirror Djinn (40% reflect — DoT-only counter) + Pyre Lich (fire immune + 30% reflect, requires DoTs crafted from poison runes)
  • Mana economy engineered: agent did the literal math — bumped starter mana 12→18 and boss HP 50→42 after calculating that a Frostfire (3 casts × 17 dmg = 51) at rest-capped (85%) mana could just barely clear the 42-HP boss. That's engineering for the G2 ingenuity-payoff clause, not guessing.
  • UI: dark-fantasy palette, HP/MP gradient bars, per-room-type backgrounds, iconify game-icons throughout, Map/Spellbook/Traits/Bag overlays, styled buttons. No raw ASCII anywhere.
  • Save/load: localStorage-backed. Can resume a run.
  • Bootstrap skills only: create-skill.md + find-sprites.md copied from template, no new skills authored.
  • Worklog: flat worklog.md tracking 10-step plan. No phase boundaries or formal task IDs.

Emergent v4 — 2 floors × 8 rooms, 7 rune combinations, playable, 2 new skills

  • Stack decision: DOM + TypeScript + Vite + iconify-icon + @iconify-json/game-icons. Arrived at the same DOM conclusion independently — the inherited kaplay-scene-pattern.md skill was actively marked deprecated in this run's CHANGELOG with rationale.
  • Content authored: 6 runes, 7 crafted combinations, 2 floors × 8 rooms (start, combat, forge, event, rest, combat, elite, boss), authored JSON data files under public/data/.
  • Forced-craft encounters:
    • Floor 1: stone_golem and warden_f1 resist direct damage 65% — DoT spells (Ember Hemorrhage, Rotbloom) bypass the resistance.
    • Floor 2: thornwretch and warden_f2 reflect 50% of direct damage — DoT-only spells (Rotbloom, Hexroot) are the intended counter.
  • UI: Cinzel serif font, gold/arcane/blood palette, HP/mana bars with fill + text overlay, damaged-shake animation, bonfire flicker, room-type theming via data-theme + --room-accent, mini-map sidebar with discovered/cleared state. Fog-of-war reveal.
  • Skills evolved (the real signal):
    • dom-over-kaplay.md (NEW) — captures the methodology decision for the next emergent run. Documents why DOM + iconify beats KAPLAY for menu-driven roguelikes, notes the runtime caveat that Iconify fetches SVG from CDN on first paint.
    • pressure-forge-coupling.md (NEW) — captures the v4 ingenuity clause recipe: per-floor enemy resistance/reflect + crafted-spell counter. This is the design pattern both Bare v5 and Emergent v4 independently discovered, now codified as a reusable skill.
    • kaplay-scene-pattern.md — marked deprecated for UI-heavy domains, kept in place for lineage.
    • CHANGELOG.md — documents disposition of each inherited skill + guidance for emergent v5.
  • The inheritance ratcheting mechanism works: next emergent run (v5) will start with 9 inherited skills including the 2 fresh ones. The knowledge compounds across rounds.

GAD v10 — full planning suite, data layer, zero scenes

  • Stack decision: DOM (explicitly documented in DECISIONS.xml — took the signal from bare/emergent) + Vite + TypeScript + iconify-icon. Same stack as the others.
  • Planning suite authored (fullest captured to date):
    • ROADMAP.xml — 7 phases: scaffold, core-state-and-content, title-and-hud, room-navigation, combat, forge-and-runes, pressure-encounters
    • TASK-REGISTRY.xml — ~20 tasks with IDs (01-01 through 07-xx) and status fields
    • STATE.xml — current-phase 02, current-plan "core-state-and-content", next-action "Phase 02: data layer. Start with task 02-01 (src/types.ts)"
    • DECISIONS.xml — scaffolded
    • VERIFICATION.md — phase 01 verified as PASS
  • Content authored (phase 02, 875 lines TS):
    • types.ts — 137 lines of entity/combat/narrative stat shapes
    • state.ts — 72 lines of game state module
    • content/runes.ts — 221 lines of rune data + crafting combinations
    • content/floors.ts — 224 lines of authored 2-floor × 8-room graph
    • content/enemies.ts — 160 lines of enemy definitions
    • content/events.ts — 43 lines of event rooms
    • main.ts — 7 lines (stub: imports router and mounts)
    • scenes/router.ts — 11 lines (stub: renders a static title screen "ESCAPE THE DUNGEON v10 — scaffold booted")
    • styles.css — basic title styling
  • What's missing: scenes/title (real), scenes/room, scenes/combat, scenes/forge, scenes/event, scenes/rest, HUD, save/load, any interactivity beyond the scaffold title
  • Gates self-traced: 0 of 4. None were implemented. The scaffold title doesn't count as G3.
  • API interrupted twice: attempt #1 crashed at tool_uses 18 / 2.3 min (pruned fresh); attempt #2 crashed at tool_uses 55 / 9 min (preserved as v10).

The three-way comparison under v4

Design convergence

All three conditions independently arrived at the same macro design:

  • DOM over KAPLAY
  • iconify-icon + @iconify-json/game-icons for UI
  • 2 floors × 8 rooms with authored encounters
  • Runes + combinations → crafted spells
  • Per-floor resistance/reflect encounters requiring specific crafted counters

This is a strong signal that the v4 REQUIREMENTS.xml is narrow enough to funnel competent agents toward the same solution. The spec does what it was designed to do: it constrains the solution space.

Implementation velocity

Condition Tool uses Scenes implemented Playable loop
Bare v5 45 6+ (title, room, combat, forge, event, rest)
Emergent v4 45 6+ (title, room, combat, forge, event, rest + victory)
GAD v10 55 1 stub (router with title screen)

GAD used 22% more tool calls and shipped 0% of the playable scenes. The difference went entirely to planning + data authoring. Had GAD not been API-interrupted, it might have caught up — but the same 45-tool-use budget that Bare and Emergent used to ship a game was insufficient for GAD to reach scene implementation at all.

The freedom hypothesis, round 4 verdict

The freedom hypothesis holds under v4 pressure requirements — possibly more strongly than under v3.

Round 3 (v3 requirements): Bare v3 human review 0.70, GAD v8 human review 0.20. Framework vs direct implementation, bare wins on creative output.

Round 4 (v4 requirements): Bare v5 and Emergent v4 both ship complete playable games with all 4 gates self-traced passing. GAD v10 ships zero scenes after 55 tool uses.

The v4 gates were DESIGNED to require ingenuity (the forced-craft encounter pattern), which should have favored a framework-driven deliberate approach. Instead, the direct- implementation conditions shipped the ingenuity and GAD didn't ship anything playable.

Caveat: API 529 interrupted GAD. A completed GAD run might reach all 7 phases. But the tool-use accounting is damning regardless — at minute 9 of a 12-minute wall clock budget, Bare and Emergent were finishing polish while GAD was still writing data files. The overhead is real.

Workflow emergence — the quiet winner

The most interesting finding isn't the GAD-vs-bare comparison, it's Emergent working as designed for the first time:

  • Inherited 7 skills from previous runs
  • Applied them (DOM over KAPLAY inherited from previous emergent runs' failures)
  • Evolved them (deprecated kaplay-scene-pattern.md in place)
  • Authored 2 new skills that codify round 4 learnings:
    • dom-over-kaplay.md — the stack decision with rationale
    • pressure-forge-coupling.md — the v4 encounter design pattern
  • Wrote CHANGELOG.md for the next emergent run to inherit

This is the knowledge ratcheting mechanism working end-to-end in a single session. Every previous emergent run either inherited without evolving or authored without reflecting. v4 is the first run where the full inheritance → apply → evolve → document cycle completed. The next emergent run (v5) will start with 9 inherited skills and visible lineage of what each one taught.

API reliability as an experimental variable

Both GAD attempts hit HTTP 529 overloaded_error. This is Anthropic-side server load, not anything the framework can fix. It is now an experimental variable we have to acknowledge:

  • Bare and Emergent ran 12-13 minutes uninterrupted
  • GAD's first attempt died at 2.3 min, second at 9 min
  • The pattern isn't random — GAD's longer setup phase (snapshot + planning + XML writes) may spend more time in server-dependent states, giving 529s more opportunities to land

Decision candidate (gad-64): eval runs that hit API errors (not rate limits) should be categorizable separately from rate-limited runs. Current timing.rate_limited captures account-cap failures; add timing.api_interrupted + timing.interruption_reason for server-side failures. Both should filter out of cross-round comparisons by default.

What to do next

  1. Accept v10 as the GAD round 4 data point. Retrying a third time is unlikely to succeed given the 529 pattern, and the partial data is already informative.
  2. Human review Bare v5 and Emergent v4. Both shipped complete games. Score them under the rubric (playability, ui_polish, mechanics_implementation, ingenuity_requirement_met, stability). Rubric phase 27 track 1 exists in planning but hasn't been executed — this is the natural trigger.
  3. Don't human-review GAD v10. The agent's own self-assessment is correct: the game doesn't exist beyond a scaffold title screen. Leave humanReview.score null and let the api_interrupted flag exclude it from aggregates.
  4. Ship round 4 completion on the site. Copy all three dists to site/public/playable/, regen prebuild, and let the Graphs scatter render the two completed runs (v5, v4) against the historical dataset. v10 shows on its per-run page with the api_interrupted badge but doesn't pollute the aggregates.
  5. Queue phase 27 track 1 (rubric) for the next session so Bare v5 and Emergent v4 can be reviewed under the new structured rubric instead of a single-score blob.
  6. Document the v10 story on the site. A paragraph on /findings/2026-04-09-round-4-complete explaining why GAD's tool-use count is higher and implementation depth is lower. Include the 875-lines-of-TS breakdown. This is the concrete, numerical freedom-hypothesis evidence the earlier rounds hinted at.

Cross-round comparison

Freedom hypothesis across rounds (human-reviewed runs only, rate/api failures excluded):

Round Req version GAD best Bare best Emergent best Hypothesis
Round 1 v1 etd v5 = 0.00 (blank screen) not testable
Round 2 v2 etd v7 = 0.30 bare v2 = 0.50 emergent v1 = 0.10 Bare slight edge
Round 3 v3 etd v8 = 0.20 bare v3 = 0.70 emergent v2 = 0.50 Bare wins decisively
Round 4 v4 v10 = N/A (api interrupted at phase 02) v5 = pending review v4 = pending review Bare + Emergent ship, GAD doesn't

The round 4 GAD cell is "N/A" because API failure, not because GAD performed poorly on a scored dimension. But the tool-use accounting is clear: 55 tool uses → no playable game is a worse ratio than 45 tool uses → playable game. Even if a completed GAD run would have outscored the others on polish or architecture, it would have needed significantly more budget to get there.

Decisions logged

  • gad-64 (to write): api_interrupted flag in TRACE.json timing separate from rate_limited. Both filter from cross-round aggregates. The reason matters for interpreting the data — "Anthropic was overloaded" is different from "the agent hit its account quota."
Round 4
2026-04-09

Round 4 — partial results under rate limit

**Date:** 2026-04-09 **Requirements version:** v4 (pressure-oriented, 4 gates, authored dungeon) **Framework version:** v1.32.0 + commit 3ef0bb5 (post phase-25 milestones A/B/C, trace hooks installed) **Status:** all three greenfield conditions rate-limited simultaneously around the 14-minute mark

Round 4 — partial results under rate limit

Date: 2026-04-09 Requirements version: v4 (pressure-oriented, 4 gates, authored dungeon) Framework version: v1.32.0 + commit 3ef0bb5 (post phase-25 milestones A/B/C, trace hooks installed) Status: all three greenfield conditions rate-limited simultaneously around the 14-minute mark

Summary

The first round 4 attempt hit a shared account-level rate limit and stopped all three greenfield agents (GAD, Bare, Emergent) within ~14 minutes of launch. None of the three runs completed. This document exists so the partial data isn't misinterpreted as completed-run comparison data — it's not.

That said, the partial data is itself informative. What the three conditions got done before stopping is a direct snapshot of where they spent their tool budgets: planning, scaffolding, or skill inheritance. The differential matters regardless of whether the runs reached their gates.

Raw measurements

Condition tool_uses duration phases planned tasks completed build status TRACE.json
GAD (escape-the-dungeon v9) 81 14 min 7 4 of 23 ✓ dist builds (phase 01 verified) planning_quality 0.85
Bare (escape-the-dungeon-bare v4) 45 14 min 10 (flat worklog) 1 of 10 ✓ dist builds (manual vite build) planning_quality 0.10
Emergent (escape-the-dungeon-emergent v3) 40 14 min 0 0 ✗ build fails (main.ts missing) planning_quality 0.05

All three were rate-limited mid-run. None reached their gates. Human review is pending but will NOT be compared against completed runs (decision noted in each TRACE.json).

What was accomplished

GAD (v9) — planning overhead is visible

The GAD agent produced the most complete planning artifact set of any eval run to date, in 14 minutes:

  • ROADMAP.xml — 7 phases (scaffold, title screen, room navigation, combat, rune forge, pressure mechanics, dialogue and NPCs)
  • TASK-REGISTRY.xml — 23 tasks broken down by phase with status fields
  • STATE.xml — current-phase 02, current-plan "Title Screen and Game Start", next-action "Task 02-02: Implement scene transition system and New Game flow"
  • DECISIONS.xml — scaffolded (empty)
  • VERIFICATION.md — phase 01 verified as PASS with working dist
  • scenes/, systems/, types/, data/, main.ts — clean architectural separation matching GAD conventions

Implementation depth: scaffold + title screen only. The dist builds but the game is just the title screen — no navigation, no combat, no forge.

Signal: GAD spent its tool budget on planning-first. 23 tasks planned before hitting the limit. If the run had continued, the remaining 19 tasks were pre-structured and the agent could have continued without re-planning. That's the framework's claim — that planning front-loaded is worth the cost at run-start.

Bare (v4) — direct implementation, shallow plan

The Bare agent:

  • Wrote a single flat worklog.md with a 10-step plan keyed to the 4 gates
  • Scaffolded Vite + TypeScript + KAPLAY game
  • Wrote 6 source files (combat.ts, data/, main.ts, screens/, state.ts, ui.ts)
  • Did NOT author new skills (inherited create-skill + find-sprites were copied but not extended)
  • Vite build succeeds manually (54 KB bundle, 18 modules)

Implementation depth: project scaffold + partial implementation of all the listed step 2 files, but no runtime verification. The game compiles but hasn't been tested through the loop.

Signal: Bare spent its budget on direct code. 6 source files vs GAD's architectural split across 5 subdirectories. No re-planning ceremony. If the run had continued, Bare would have accreted more files without structured phase boundaries — which round 3 showed is the bare workflow's strength for creative implementation but weakness for cross-round coherence.

Emergent (v3) — inheritance applied, entry point missing

The Emergent agent:

  • Copied all 7 inherited skills from previous runs into game/.planning/skills/ (create-skill, find-sprites, content-pack-loading, game-loop-verification, kaplay-scene-pattern, previous-workflow, state-composition)
  • Wrote 6 modular source files matching the state-composition inherited skill's pattern (content.ts, icons.ts, renderer.ts, state.ts, styles.ts, types.ts)
  • Did NOT write main.ts — the entry point that index.html imports
  • Build fails objectivelyvite build errors with Rollup failed to resolve import /src/main.ts from index.html

Implementation depth: modular architecture designed but not integrated. The agent was building bottom-up (types → state → renderer) and was rate-limited before writing the top-level main.ts that would have assembled everything.

Signal: The inherited state-composition skill demonstrably shaped the architecture (types.ts first, then state.ts, then the rest). Emergent ran fewer tool_uses (40) than Bare (45) for comparable output, suggesting the inherited skills reduced figuring-out cost. But the run died before reaching the integration step — the rate limit truncated the critical moment.

The rate limit itself is a finding

Three agents running in parallel on a single Claude account share a single rate limit bucket. All three stopped at tool_uses 40/45/81 after ~14 minutes with a "limit resets at 12am" message. Calculating token velocity: ~3-10 tool_uses per minute per agent is well within typical limits for a single agent — but the sum of three concurrent agents apparently tipped the account over.

Implication for eval methodology: parallel eval runs need either (a) separate accounts per agent, (b) serial execution, or (c) per-agent rate limit carveouts. Running three agents in parallel to save wall-clock time is a false economy if the shared bucket caps their collective output.

This is also a data-integrity problem: the three runs stopped at the same wall- clock moment, meaning whichever one started with the most efficient early steps got proportionally more runway than the others. GAD's 81 tool_uses vs Emergent's 40 isn't a fair comparison of "capacity" — it's a comparison of "how much got done before a shared cap fired."

What this does NOT tell us

  • Whether GAD beats Bare on round 4 v4 requirements. Neither reached the gates. The freedom hypothesis from round 3 is neither confirmed nor refuted by this data.
  • Whether the inherited emergent skills help. The inherited skills visibly shaped Emergent's architecture but the run didn't get far enough to validate the end-to-end result.
  • Whether v4 requirements are well-designed. We didn't reach the pressure gate (G4) in any condition. v4 remains untested against real agent output.
  • Whether trace hooks capture what we need. The hooks were installed but we haven't yet processed the .planning/.trace-events.jsonl from these runs (if any were written — agents running in worktrees may not have picked up the local settings.json hook wiring). Phase 25 milestone B e2e test is still pending.

What to do next

  1. Do not include these runs in cross-round comparisons or Graphs scatter. The TRACE.json files explicitly say so in their human_review.notes fields. The site's Results and Graphs sections should filter on timing.rate_limited === true and exclude rate-limited runs from the freedom-hypothesis visualization.
  2. Retry round 4 serially when the rate limit resets. One agent at a time. Expected completion per agent: 20-30 minutes without the three-way competition. Total wall-clock: 60-90 minutes.
  3. Process trace events from the three worktrees (if any were written) to validate phase 25 milestone A hooks — even rate-limited runs would have produced partial event streams before stopping.
  4. Consider a retry budget per run — set a wall-clock cap on each eval run so retry-after-rate-limit is graceful.

Cross-condition planning differential (even under rate limit)

Measurement GAD Bare Emergent
Planning artifacts created 6 XML + 1 MD 1 MD 0
Tasks explicitly planned 23 10 0
Source files written ~10 6 6
Bootstrap skills copied n/a (framework) 2 7
New skills authored 0 0 0
Build succeeds yes yes no
Tool uses when cap hit 81 45 40

The planning differential is real and observable. GAD's planning structure is clearly visible in 14 minutes of output. Bare's direct-implementation pattern is clearly visible. Emergent's inheritance-driven modularity is clearly visible. What's not visible is whether any of it would have worked given enough budget — that's what the retry will tell us.

Decisions flagged for DECISIONS.xml

  • gad-62: Parallel eval runs on a single account share one rate limit bucket. Serial execution is the default from now on. Parallel execution requires documented per-account capacity and pre-calculated budget.
  • gad-63: Rate-limited runs are preserved as data points but explicitly excluded from cross-round quality comparisons. TRACE.json timing rate_limited: true is the filter key.
Round 3
2026-04-08

Round 3 Findings — Freedom Hypothesis

**Requirements version:** v3 (game-loop gate, spell-crafting gate, UI quality gate) **Date:** 2026-04-08 **Conditions:** GAD v8, Bare v3, Emergent v2 — all hit rate limits but produced builds

Round 3 Findings — Freedom Hypothesis

Requirements version: v3 (game-loop gate, spell-crafting gate, UI quality gate) Date: 2026-04-08 Conditions: GAD v8, Bare v3, Emergent v2 — all hit rate limits but produced builds

Results — inverted from expectations

Condition Framework constraint Tokens Commits Human score Notes
Bare v3 None (most freedom) 1,877 1 batch 0.70 Best UI/UX by far, most enjoyable
Emergent v2 Medium (inherited skills) 1,609 2 phases 0.50 Solid forge, more content, maintained discipline
GAD v8 Full framework 1,291 0 0.20 Broken crafting, ASCII UI, hard to read

The result is monotonic and inverse to framework constraint. More freedom = better output.

Running tally across all rounds

Run Requirements Human Key observation
GAD v5 v1 0.00 Blank screen
GAD v6 v2 0.00 Blank screen
GAD v7 v2 0.30 Stuck after combat
GAD v8 v3 0.20 Broken crafting
Bare v1 v2 0.10 New Game broken
Bare v2 v2 0.50 Playable, ASCII UI
Bare v3 v3 0.70 Best game overall
Emergent v1 v2 0.10 Styled text crash
Emergent v2 v3 0.50 Functional forge, medium UI

GAD has never exceeded 0.30 human review across 4 attempts. Bare has improved monotonically: 0.10 → 0.50 → 0.70. Emergent has improved: 0.10 → 0.50.

Freedom hypothesis

For creative/game implementation tasks, agent performance correlates INVERSELY with framework constraint. Less prescribed structure leads to better output.

Supporting evidence

  1. Bare always beats GAD on human review, across all 3 rounds with same requirements
  2. GAD has more tokens, more tool uses, more commits — but produces worse games
  3. GAD v8 had 0 commits because it was so busy following the framework it hit the rate limit before completing a work unit worth committing
  4. Bare v3 best UI/UX despite no framework telling it how to build UI
  5. Emergent sits in the middle — some framework, some freedom, middle results

Counter-evidence / confounds

  1. Rate limits hit all three runs — GAD v8 may have been about to commit when cut off
  2. Single-run variance is high — we haven't established statistical significance
  3. GAD's strength is discipline/traceability, not creative output — we may be measuring the wrong thing for game evals
  4. Bare v3's "one giant commit" means if it had broken, there'd be no checkpoint. GAD's discipline is insurance against catastrophic failure, not a booster for success

Alternative interpretation: the framework hurts speed

GAD's planning overhead (reading/writing .planning/ docs, per-task commits, state updates, decision capture) consumes tokens that could have gone to implementation and testing. In a time-limited or token-limited environment, this overhead compounds:

Metric GAD Bare Ratio
Rounds completed with playable game 0/4 2/3 Bare 5x better
Rounds with blank screen 2/4 0/3 GAD worse
Rounds with gate failure 4/4 1/3 GAD worse

GAD is producing disciplined garbage. The process is followed but the product fails.

What this means for GAD

  1. GAD may not be the right framework for creative implementation tasks. It was designed for planning/tracking, not for game development. Game dev rewards iteration speed and visual feedback, which GAD's planning overhead slows down.

  2. The bare condition's success suggests "AGENTS.md + requirements + freedom" is sufficient for implementation. The planning doc maintenance may be dead weight.

  3. GAD's value proposition needs to be re-examined. If process compliance doesn't correlate with output quality, what is GAD actually optimizing for?

    • Traceability across sessions (context compaction recovery)
    • Multi-agent coordination
    • Long-horizon planning (months, not days)
    • Regulatory/compliance work where process matters
  4. The game eval may be the wrong benchmark for GAD. A better benchmark would be:

    • Resuming work after context compaction
    • Multi-phase refactors where state matters
    • Documentation that has to be kept in sync with code
    • Bug triage and root-cause analysis

Open questions

  1. Would GAD win if we measured context-resumption rather than fresh implementation?
  2. Does GAD win when the agent is replaced mid-run (simulating handoff)?
  3. What happens if we give Bare the same token budget as GAD's planning overhead in the form of free research time?
  4. Is the freedom hypothesis specific to KAPLAY/games, or does it generalize to web apps, APIs, CLIs?
  5. Would GAD do better with a "lite mode" that strips planning doc maintenance but keeps verification?

Immediate actions

  1. Treat this as a preliminary finding — needs more runs for statistical validity
  2. Create a GAD-lite mode for comparison (no per-task planning doc updates, only phase-level)
  3. Add a context-resumption eval where GAD's advantages should appear
  4. Do NOT abandon GAD — this finding may be specific to greenfield game implementation

Infrastructure findings

  • Rate limits revealed discipline pressure response: Emergent v2 was the only condition that maintained phase commits under pressure. Bare regressed to 1 batch commit. GAD never committed anything. Emergent's inherited skill "game-loop-verification" (which mandated verify-per-phase) may have enforced a checkpoint discipline that kicked in before the limit.

  • Build preservation was broken: All previous runs overwrote the same path in apps/portfolio/public/evals/. Now fixed — all 8 builds preserved per-version.

2026-04-08

Eval Findings — 2026-04-08

## Experiment: Escape the Dungeon — Three Conditions

Eval Findings — 2026-04-08

Experiment: Escape the Dungeon — Three Conditions

Setup

Same game requirements (12 criteria, vertical-slice priority, UI-first mandate), same source docs (trimmed gameplay design ~120 lines), same stack (Vite + TypeScript + KAPLAY).

Condition Framework Skills Runs
GAD (escape-the-dungeon) Full GAD: .planning/ XML, AGENTS.md loop, skill triggers Pre-built v5 (0.0), v6 (0.0), v7 (0.30)
Bare (escape-the-dungeon-bare) None — agent creates own workflow From scratch v1 (0.10), v2 (0.50)
Emergent (escape-the-dungeon-emergent) None — inherits skills from bare v1 Inherited + evolves v1 (0.10)

Results: Human review scores

Run Human Notes
GAD v5 0.00 Blank screen
GAD v6 0.00 Blank screen (ES module + file://)
GAD v7 0.30 Renders, better UI layout, but game loop breaks after combat — player gets stuck
Bare v1 0.10 Main menu renders, New Game doesn't work
Bare v2 0.50 Most playable. Full game loop works. ASCII/plain UI, needs polish, no rune forge
Emergent v1 0.10 Main menu + saved game detection, but crashes entering game (styled text error)

Key finding: Bare v2 beat GAD v7 on playability

The agent WITHOUT a framework produced the most playable game. This is a significant finding.

Why bare v2 won:

  • Simpler architecture — fewer abstractions meant fewer places for bugs
  • Focused on making things work rather than following a process
  • 6 commits (phase-level granularity) — enough traceability without overhead
  • The feedback from v1's failure was more actionable than GAD's structural requirements

Why GAD v7 lost on playability despite better process metrics:

  • 21 commits, 17/17 tasks tracked, full planning docs — excellent discipline
  • But the game loop broke (combat → no return to navigation)
  • More framework overhead (93K tokens vs 88K) didn't translate to better output
  • Planning docs were maintained perfectly while the actual game was broken
  • The process was followed but the product was worse

Token comparison

Run Tokens Tool uses Commits Human
Bare v1 67,751 62 2 0.10
Emergent v1 67,375 79 2 0.10
Bare v2 87,661 110 6 0.50
GAD v7 93,632 137 21 0.30

GAD used 7% more tokens than bare v2 but scored 40% lower on human review. The token overhead of maintaining .planning/ docs did not pay for itself in output quality.

Emergent v1 findings

The emergent eval (inherited skills from bare v1) performed WORSE than both bare v2 and GAD v7. This challenges the hypothesis that inherited skills improve outcomes.

Why emergent v1 failed:

  • Inherited skills were code-level patterns, not workflow fixes
  • The previous-workflow.md told it v1's New Game was broken, but the fix didn't work
  • "Styled text error: unclosed tags START" — a KAPLAY API issue the skills didn't cover
  • Fewer tokens (67K) suggests it relied on inherited knowledge but that knowledge was insufficient

Lesson: Skills need to capture failure modes and fixes, not just patterns. The bare v2 agent succeeded because it was told "v1's New Game was broken" and had to figure out the fix itself. The emergent agent was told the same thing AND given skills, but the skills didn't help with the specific KAPLAY API issue that caused the crash.

What this means for GAD

  1. Process metrics ≠ output quality. GAD v7 had near-perfect discipline (0.81) and planning quality (1.0) but produced a worse game than the undisciplined bare v2.

  2. The framework adds overhead that doesn't always pay off. 93K tokens for GAD vs 88K for bare, with worse results. The planning doc maintenance consumed tokens that could have gone to testing and fixing the game.

  3. Feedback about failures is more valuable than inherited skills. Bare v2 (told about v1's failure) outperformed emergent v1 (given v1's skills + failure notes). Direct feedback about what broke was more actionable than documented patterns.

  4. Human review is the only metric that matters for game evals. Auto-composite can be 0.95+ while the game is a blank screen. The gate criteria help but aren't sufficient — a game can render and still be broken.

Requirements versioning

Requirements have been updated twice this session:

  • v1 (original): 12 criteria focused on systems completeness
  • v2 (current): Gate criteria (must render, must be playable), vertical-slice priority, UI-first build order. Trimmed source docs from 640 → 127 lines.

Next iteration should add:

  • Explicit game-loop verification: title → new game → room → interaction → room (full cycle)
  • UI quality baseline: minimum spacing, readable text, no overlapping elements
  • Rune forge as a required criterion (currently missing from all implementations)

Open questions

  1. Would GAD do better if the AGENTS.md mandated explicit game-loop testing per phase?
  2. Would the emergent eval improve if skills captured KAPLAY-specific error fixes?
  3. Is the bare approach inherently better for creative/game implementation, or was this specific to KAPLAY?
  4. Would multiple bare v2 runs cluster around 0.50, or was this a lucky outlier?

Known bugs

2 bugs reported

open

Rune forge allows same rune twice in a single spell, boosts affinity twice

Found in escape-the-dungeon-bare/v5

In the rune forge, the player can select the same rune twice as ingredients for a single spell. When crafted, the affinity gain for that rune is applied twice as if two distinct rune slots were consumed. The resulting spell also treats the duplicate as a meaningful second ingredient.

open

Glitchy redraws on button clicks across all round-4 builds

Found in escape-the-dungeon/v10

UI visibly glitches/flickers on button clicks as if full per-tick redraws are running. Observed consistently across GAD v9, GAD v10, Bare v5, and Emergent v4.

Scoring weights

How this project is scored

Defined in evals/escape-the-dungeon-bare/gad.json. The composite score is a weighted sum of these dimensions. See /methodology for the formula and caps.

DimensionWeight
human_review0.30
requirement_coverage0.20
implementation_quality0.20
workflow_emergence0.15
iteration_evidence0.10
time_efficiency0.05
Client debug · NEXT_PUBLIC_CLIENT_DEBUG=1
0 lines

No events yet. Window errors, unhandled rejections, and React render errors appear here. Set NEXT_PUBLIC_CLIENT_DEBUG_CONSOLE=1 to mirror console.error / console.warn.