eval project

game

Bare

greenfield

Escape the Dungeon · Bare

escape-the-dungeon-bare

Greenfield baseline: agent builds the game WITHOUT a planning framework, creating its own workflow

planning.zip (17.5 KB)template.zip (17.5 KB)Source on GitHub

Catalog scope

What skills this project can use

This project inherits a minimal bootstrap skill set from the framework. The agent can apply these but must author its own methodology beyond them.

Bootstrap set — 2 skills inherited

create-skill

>- Capture a reusable pattern, recipe, or failure-mode fix as a skill document so future agents (including you after a context reset) can apply it without rediscovering it. Use this skill whenever you solve a non-obvious problem, discover a working pattern after two or more failed attempts, hit a bug whose fix isn't self-evident from the code, or finish a piece of work that future runs will likely repeat. Write the skill the moment you learn the lesson — not at the end. In bare/emergent eval conditions this is the primary mechanism for agent-authored methodology. The agent IS the workflow author, and skills are how that authorship persists.

find-sprites

Source visual assets (sprites, icons, tilesets, portraits) for a game or UI-heavy build in a way that yields a coherent, intentional look instead of a debug-console aesthetic. Use this skill when you need art for rooms, entities, UI controls, HP bars, spell icons, status effects, or any other visual element; when the build is failing its UI-quality gate because it looks unintentional; or when you're about to fall back to raw ASCII/text UI. The goal is "looks like someone designed it" — not photorealism, not 1:1 with AAA games, but internally consistent and readable.

Runs

6 recorded runs

reqs unknown

composite 0.198

human 0.10

Main menu renders with New Game button visible. Cannot start game — clicking New Game does not progress. Broken build. Score 0.10 for rendering menu only.

Full breakdown →Play

reqs unknown

composite 0.601

human 0.50

Most playable game of all evals. Full game loop works: title → new game → rooms → combat → dialogue → navigation. UX and flow are good. UI is very ASCII/plain — needs spacing, icons, better styling. Color coding is good. No spell crafting despite rune system in requirements. Rest room doesn't offer forge. Score 0.50: playable vertical slice but visually rough.

Full breakdown →Play

reqs v3

composite 0.526

human 0.70

Best UI/UX of all eval runs by far. Most enjoyable and playable. Functional game loop with combat and dialogue. Missing: floor progression after boss (can grind same floor), no clear spell crafting path. Regressed on commit discipline under pressure (1 giant commit vs v2's 6). Score 0.70: most enjoyable game across all experiments.

Full breakdown →Play

reqs v4

composite 0.000

human review pending

RATE LIMITED before completion. 6 source files written, vite build succeeds manually (54 KB bundle). worklog.md shows 10-step plan covering all 4 gates. Implementation depth: step 1 of 10 complete. DO NOT include in cross-round comparisons against completed runs.

2 skills

1 planning

Full breakdown →Play

reqs unknown

composite 0.000

human review pending

Highest ingenuity of any round-4 run (user: 'highest ingenuity out of all runs'). Strengths: multi-enemy combat encounters (very creative), forge room UI is great (icons, spacing, placement, highlighting), training affinity mechanic 'pretty sweet' — user loved it, spell crafting loop enjoyable for finding combos yourself, pressure mechanics landed clearly (Fungal Sovereign: resistant to physical / immune to fire called out subtly on the map — user prefers subtle hints over explicit), goals feel earned. Weaknesses: (1) combat lacks targeting — user prefers Unicorn-Overlord-style rule-based simulation with board positioning (chess-like), action policies per entity traits, initiative-driven turn order (captured as v5 R-v5.13, R-v5.14); (2) affinity reward loop unclear — no visible reward for boosting a rune a lot, users will want curiosity payoff (R-v5.16); (3) navigation of exits/rooms difficult — only dropdown, no visual map with player location (R-v5.17); (4) unclear visual player-vs-enemy identity in encounters (ooze looked ambiguous) — user wants Pokemon or Unicorn-Overlord style (R-v5.18); (5) glitchy redraws on button clicks (observed across ALL round-4 builds) — likely per-tick redraw, remove ticks entirely, use event-driven updates, real-time 1hr=1day game time (R-v5.15, R-v5.21); (6) BUG: rune forge lets you craft a spell using the same rune twice which boosts that rune's affinity twice — should be forbidden per-spell but allowed across DIFFERENT spells (R-v5.20, bugs.json). Other user notes: clear-button UX is better for controller so keep it; user really likes how the in-game rune/spell system mirrors the emergent skill/merge hypothesis; wants spell-mixing-spells (use existing spells as ingredients too, procedural-but-semantic naming — R-v5.19).

2 skills

1 planning

Full breakdown →Play

reqs v5

composite 0.000

human review pending

2 skills

1 planning

Full breakdown →Play

Requirements history

5 versions — each change triggers a new round

5 of 5

2026-04-09

Current

core shift from v4

Playtest-driven expansion. v4 was a designer rewrite; v5 comes entirely from user play of Bare v5 (0.805 rubric), Emergent v4 (rescored to 0.885 after user beat floor 1), GAD v9 (rate-limited), and GAD v10 (api-interrupted). Everything in v4 still applies — v5 adds 21 new/amended requirements (R-v5.01..21) on top as a structured `<addendum>` section inside the same template XML.

changes from v4

- **R-v5.01** Training via encounter, not menu — affinity rises from casting, not selecting. Training Dummy encounter room type. - **R-v5.02** Rune discovery as a gameplay loop — starter subset only, rest found in world, one rune per floor gated behind non-combat. - **R-v5.03** Merchants with buy/sell/trade — at least one per floor, gold as tracked resource. - **R-v5.04** NPC dialogue with branching outcomes — 3+ NPCs, 2+ branches each, choices change game state. - **R-v5.05** Inventory/bag with grid + equippable items — weapon/off-hand/body/trinket slots affecting stats. - **R-v5.06** Visible character sheet + skill tree — physical/combat skills separate from spells, distinct resource. - **R-v5.07** Spell and skill loadout slots — forced specialization as a build-pressure mechanic. - **R-v5.08** Progression sources sufficient to reach end boss (amends G1) — guaranteed mana-max / spell-power upgrade per floor. - **R-v5.09** Save checkpoints + continue-after-death (amends G1) — Continue must never hard-brick. - **R-v5.10** Notification lifecycle (amends G3) — auto-dismiss, clear on new game, no persistence across sessions. - **R-v5.11** Rest rooms must offer rest — forge rooms combining Forge+Train+Rest must expose Rest as an action. - **R-v5.12** Navigation and map usability (amends G3) — minimum 2D graph layout, one-click navigation. - **R-v5.13** Combat model must be explicitly chosen — **Model A (rule-based simulation, Unicorn-Overlord-style)** preferred over direct-control. - **R-v5.14** Action policies driven by entity traits — applies to enemies AND NPCs; dialogue changes with trait shifts. - **R-v5.15** Real-time game-time model — 1hr real = 1 day game, remove tick system, UI time-shading is soft. - **R-v5.16** Affinity reward loop — visible payoff for boosting a rune, not just a hidden stat. - **R-v5.17** Central visual navigation map with player token (stronger form of R-v5.12). - **R-v5.18** Visual player vs enemy identity — Pokemon / Unicorn Overlord style (UO preferred). - **R-v5.19** Spells as craftable ingredients — spells + runes both combine, procedural-but-semantic naming. Explicitly mirrors the emergent-evolution hypothesis (gad-68) as an in-game narrative analogue. - **R-v5.20** Rune uniqueness within a single spell — bug fix for Bare v5's double-affinity exploit. - **R-v5.21** Event-driven rendering — kill the per-tick redraw glitches observed across ALL round-4 builds.

scoring impact

- v4 gates remain (G1, G2, G3, G4). v5 amendments tighten G1 (death/continue, end-boss reachable), G2 (training is encounter-driven), G3 (notification lifecycle, map usability). - New scored dimensions: `inventory_and_equipment_present`, `npc_dialogue_present` — not gates, but meaningful score hits if missing. - Rubric weights unchanged from v4.

deferred to v6

- Deep evolution trees (multi-stage mutations). - Rune affinity decay when unused. - Multi-character party play — out of scope for escape-the-dungeon family.

brownfield vs greenfield

- Greenfield v5 applies to escape-the-dungeon, escape-the-dungeon-bare, escape-the-dungeon-emergent (same three templates all updated together). - Brownfield v5 extensions are not yet authored — round 5 starts greenfield-first.

round 5 unblock

This version is the trigger for round 5. Round 5 runs serially (gad-67) against this requirements set. HTTP 529 investigation + GAD v11 retry queued, see task registry.

source

`evals/_v5-requirements-addendum.md` holds the prose version with full user rationale quotes. The XML addendum in each template is the machine-readable form.

decision references

gad-65 (CSH), gad-68 (emergent-evolution), gad-71 (data/ pseudo-database for bug tracking), gad-72 (rounds framework — this is now round 5), gad-73 (fundamental skills triumvirate — the R-v5.19 spells-as-ingredients mechanic is the in-game analogue).

Changed from v4

Requirements version change from v4 → v5 defines a new round boundary.

Findings

Per-round writeups that reference this project. Each finding is stamped with the GAD version it was observed under, so comparisons across versions stay honest.

Round framework

2026-04-13

Curator vs Raw — designing the evolution loop's drafting step

Two controlled experiments to answer one question: **when GAD evolves itself by drafting new skills from high-pressure phases, should a curator pre-digest the phase data into a structured intent, or should we feed the raw phase dump straight to a skill creator?**

Curator vs Raw — designing the evolution loop's drafting step

Two controlled experiments to answer one question: when GAD evolves itself by drafting new skills from high-pressure phases, should a curator pre-digest the phase data into a structured intent, or should we feed the raw phase dump straight to a skill creator?

The answer flipped between experiments. Both flips are recorded here so the architecture decision lands on evidence, not vibes.

TL;DR — the surprising flip

Experiment	Tool under test	Curator helps?	Why
v1 — Anthropic skill-creator with full eval loop	heavy (drafts + runs subagent test loop + benchmarks + viewer)	YES — curator catches load-bearing pieces (trace schema fragment, preserve reminder) raw arm misses entirely	Heavy harness fights the agent; curated INTENT.md unblocks it
v2 — dot-agent create-skill (light authoring guide)	light (formats and structures only, no eval loop)	NO — raw arm pulls 16 decisions vs curator's 7. Curator is a filter, not an amplifier	Light harness lets the agent read the data; curator adds opinions that may filter out important content
v2 test-loop — agents using the resulting skill	n/a	NO, plus skill is wrong about repo conventions	Baseline (no skill, reads repo) wrote a more accurate gad.json shape than with-skill (which followed the skill's prescription)

The combined finding: When generating skills, give the agent raw access. When the curator filters, they may filter out the truth. When the skill prescribes conventions, validate them against the actual repo or the agent will follow false rules confidently.

Experiment design

	v1	v2 — quick-skill	v2 — test loop
Subjects	2 (raw, intent)	2 (raw, intent)	2 (with-skill, baseline) × 3 prompts
Source data	Phase 14 of GAD	Phase 14 of GAD	Stub design docs + the skill from v2 intent arm
Tool	`~/.agents/skills/skill-creator` (Anthropic, 485 lines)	`~/.agents/skills/create-skill` (dot-agent, 78 lines)	Subagents loaded with skill / no skill
Eval loop	Subagent test runs + grading + viewer (simulated due to nested subagent limits)	Skipped — quick-skill has no test loop	Real subagent runs from main thread
Sandbox	`oneoff/raw/`, `oneoff/intent/`	`oneoff/v2/raw/`, `oneoff/v2/intent/`	`oneoff/v2/test-runs/`

Inputs side by side

Both arms across both experiments saw the same source: phase 14 of GAD's own development ("Eval framework — escape-the-dungeon + tracing"). The difference is the framing.

Raw input

A flat dump of gad tasks --projectid get-anything-done | grep " 14-" plus gad decisions ... | grep -i trace. No structure, no proposed name, no test prompts, no historical context. The agent reads it cold.

14-01  done  Create a gad eval project "escape-the-dungeon" from GAMEPLAY-DESIGN.xml...
14-02  done  Create bin/gad-tools.cjs — the GAD equivalent of gsd-tools.cjs...
14-03  done  Define the eval trace format for real implementations: track which gad...
14-04  done  Define the eval scoring rubric for real implementations: CLI efficiency...
14-05  done  Run the escape-the-dungeon eval: fresh agent session...
14-06  done  Run the portfolio-bare eval with updated tracing...
14-07  done  Review CONTEXT.md: what it is, how discuss-phase produces it...

Curated INTENT input

The same data, plus my curator labor: a proposed name (scaffold-traced-eval-project), a "What this skill should do" paragraph, "When it should trigger" bullet list, "Expected output format" table, three test prompts I drew from real phase tasks, a hand-picked subset of decisions, and an "Errors observed" section with the historical "three attempts at task 14-03 failed" insight.

The full file is at oneoff/v2/intent/INTENT.md — 113 lines of structured curator pre-digestion.

v2 quick-skill — outputs side by side

Both arms produced a SKILL.md with the same target name (scaffold-traced-eval-project). Only the input format varied.

Metric	RAW arm	INTENT arm
SKILL.md lines	167	158
References files split	4	3
Decisions cited	16	7
Tasks cited	5 (14-01 → 14-05)	2 (14-03, 14-04)
Workflow steps	8	10

What each arm caught vs missed

Load-bearing detail	RAW caught	INTENT caught
TRACE.json schema v4 parent/child IDs (gad-50)	✓	✗
`gad-trace-hook.cjs` wiring (gad-59)	✓	✗
`.trace-active-skill` marker (gad-58)	✓	✗
4 KB output cap (gad-60)	✓	✗
Runtime identity in trace (gad-137)	✓	✗
Per-eval-repo architecture (gad-139)	✓	✗
`gad-tools.cjs` vendoring floor (task 14-02)	✓	✗
Rate-limited preservation (gad-63)	✓	✗
Mandatory `gad eval preserve` reminder (gad-38)	✗	✓
Fragment-registration pattern explicit	partial	✓
Historical "3 failed attempts" context	✗	✓
Explicit "common errors" section	✗	✓

RAW pulls in MORE technical breadth. It catches 8 decisions INTENT skipped because INTENT only listed the decisions I, the curator, chose to surface. The curator is a filter.

INTENT still wins on the irreplaceable bits: the historical context that lives in commit history, not decision text. But that's a smaller win than I expected.

v2 test loop — does the resulting skill actually help an agent?

After v2 quick-skill produced a SKILL.md, we ran a real test loop with subagents spawned from the main thread (so Task tool was actually available). Three test prompts × with-skill / baseline pairs = 6 leaf subagents.

Eval	Prompt	with-skill	baseline
1	Scaffold an eval project from `space-shooter-design.md`	4/4 ✓	3/4
2	Scaffold an eval project from `data-pipeline-requirements.md`	4/4 ✓	3/4
3	(negative) "run the existing escape-the-dungeon eval against bare condition"	4/4 ✓	4/4 ✓
Total		12/12 (100%)	10/12 (83%)

Headline number favors with-skill by +16.7pp. The story behind it does not.

The entire 2-assertion gap comes from one thing: the scaffold-traced-eval-project skill prescribes 3 GAD-native scoring dimensions (CLI efficiency, skill trigger accuracy, planning quality). The with-skill agent followed that prescription verbatim. The baseline agent read vendor/get-anything-done/evals/escape-the-dungeon/gad.json directly and produced a richer scoring shape that matches what the repo actually uses today:

Field in baseline gad.json	Present in actual repo?	Present in skill's prescription?
`eval_mode`	✓	✗ (uses `mode`)
`scoring.weights` (6 dims)	✓	✗ (uses 3 dims)
`human_review_rubric.dimensions`	✓	✗
`compare_to`	✓	✗
`domain` / `tech_stack` / `build_requirement`	✓	✗

The skill is prescriptively wrong about what GAD evals actually look like. Baseline did the more accurate job by reading the actual repo, but failed an assertion that grades against the skill's view. Baseline produced more files (REQUIREMENTS.md + .planning/ skeleton + AGENTS.md + v1 placeholders) modeled directly on escape-the-dungeon's structure.

The negative test was a wash. Both arms recognized eval-3 as an execution task and refused to scaffold. With-skill cited the skill's "Do NOT trigger for: running an existing eval" clause; baseline reasoned from first principles. The defensive description helped, but baseline's general reasoning didn't need it.

What we changed in the architecture

Before this experiment, my proposal was: evolve curates an INTENT.md per high-pressure phase, hands it to a heavy skill-creator, runs an eval loop, then human review.

After, the loop is much shorter and the curator is gone:

gad:evolution:evolve
  ├─ compute-self-eval finds high-pressure phases (selection pressure)
  ├─ for each phase:
  │     ├─ write skills/proto-skills/<slug>/CANDIDATE.md
  │     │     = raw phase dump (no curator pre-digestion)
  │     ├─ invoke gad-quick-skill on CANDIDATE.md
  │     │     → writes SKILL.md + references/
  │     └─ validator runs (advisory, non-blocking)
  │           → writes VALIDATION.md flagging file refs / CLI / shape mismatches
  └─ register one TASK-REGISTRY review task

(human review)
  reads SKILL.md + VALIDATION.md → promote or discard

gad evolution promote <slug> → moves to sdk/skills/ (joins species DNA)
gad evolution discard <slug> → deletes

Three dropped components:

Dropped	Why
Curator step (hand-written INTENT.md)	RAW arm pulled MORE decisions than curated; curator is a filter
Heavy `skill-creator` with eval loop	dot-agent quick-skill produces good skills from raw input alone
`attempt-evolution` / `finish-evolution` skills	Promote/discard are one-line file moves, not skills

One added component:

Added	Why
Validator (advisory)	The skill may prescribe conventions that don't match the repo. Validator flags the gap so the human reviewer sees it before promoting.

Methodology caveats

v1 nested subagent fidelity: The v1 experiment subagents tried to spawn their own with-skill / baseline subagents for the eval loop, but spawned subagents don't get the Task tool. Both v1 arms reported simulating the test runs inline. The pass-rate numbers from v1 are NOT real — only the SKILL.md outputs are.
v2 quick-skill subagents had the same nested limit but didn't need Task because dot-agent has no eval loop. Their outputs are real.
v2 test loop ran from the main thread, where Task is available. The 6 test subagents are real, independent runs.
Stub input files (oneoff/v2/test-runs/inputs/*.md) were written by hand for the test prompts — space-shooter-design.md and data-pipeline-requirements.md are realistic but invented. This mirrors what skill-creator's normal harness would do (sandbox stubs).

Files

All inputs, intermediate artifacts, and outputs preserved under:

Path	Contents
`oneoff/raw/`	v1 raw arm (skill-creator + raw phase 14)
`oneoff/intent/`	v1 intent arm (skill-creator + curated INTENT.md)
`oneoff/v2/raw/`	v2 raw arm (dot-agent quick-skill + raw phase 14)
`oneoff/v2/intent/`	v2 intent arm (dot-agent quick-skill + curated INTENT.md)
`oneoff/v2/test-runs/inputs/`	stub design docs (space-shooter, data-pipeline)
`oneoff/v2/test-runs/outputs/with-skill/`	3 leaf subagent runs using the skill
`oneoff/v2/test-runs/outputs/without-skill/`	3 leaf subagent runs without the skill

The v1 viewer.html files (oneoff/raw/skill/viewer.html, oneoff/intent/skill/viewer.html) render the simulated test runs with skill-creator's HTML viewer — useful for skim comparison even though the underlying numbers are simulated.

Round 4

2026-04-09

Round 4 — complete v4 results (serial execution)

**Date:** 2026-04-09 **Requirements version:** v4 (pressure-oriented, 4 gates, authored dungeon, ingenuity-required) **Framework version:** v1.32.0 + commit 459dc36 (trace hooks live, framework-stamped TRACE.json) **Execution model:** serial (per decision gad-62, after round 4 attempt #1's rate-lim…

Round 4 — complete v4 results (serial execution)

Date: 2026-04-09 Requirements version: v4 (pressure-oriented, 4 gates, authored dungeon, ingenuity-required) Framework version: v1.32.0 + commit 459dc36 (trace hooks live, framework-stamped TRACE.json) Execution model: serial (per decision gad-62, after round 4 attempt #1's rate-limit failure)

Summary

Round 4 ran three greenfield conditions sequentially against v4 pressure-oriented requirements. Two completed cleanly. One (GAD) was interrupted twice by Anthropic API overload errors (HTTP 529), landing with the fullest planning suite captured to date but the lowest shippable-gameplay coverage. Despite the interruption, the GAD v10 result is the strongest freedom-hypothesis signal in the entire dataset: it used MORE tool calls (55) than either completed run (45) and shipped LESS playable game. The framework's planning + data-authoring overhead consumed the budget before scene implementation could begin.

The numbers

Condition	Tool uses	Wall clock	Tokens	TS lines	Dist	Playable	Skills authored	Gates self-traced
Bare v5	45	12.5 min	96030	~700	✓ 65 KB	✓ 2 full floors	0 new	all 4 pass (agent report)
Emergent v4	45	11.5 min	95509	~650	✓ 55 KB	✓ 2 full floors	2 new + CHANGELOG	all 4 pass (agent report)
GAD v10	55	9 min	1216	875	✓ scaffold only	✗ title screen only	0 new	0 of 4 (API interrupted)

Token count for GAD v10 is 1216 because the API 529 happened before the final message summary — the actual token consumption was likely similar to the other two (~80-100k) but wasn't recorded in the completion notification. Tool uses and wall clock are reliable.

What each condition actually shipped

Bare v5 — 2 floors × 8 rooms, 10 rune combinations, playable

Stack decision: DOM + TypeScript + Vite + iconify-icon + @iconify-json/game-icons. Explicitly rejected KAPLAY in worklog.md with rationale ("better for action games; this is a menu-driven roguelike").
Content authored: 5 runes (F/I/P/B/S), 10 authored craftable combinations, 2 floors × 8 rooms, 8 enemies including 2 elites and 2 bosses, 2 event rooms with 3-choice consequences.
Forced-craft encounters:
- Floor 1: Stone Warden (physical 0.25 damage taken — requires elemental spells) + Fungal Sovereign (fire immune, ice weak — requires ice-crafted spells)
- Floor 2: Mirror Djinn (40% reflect — DoT-only counter) + Pyre Lich (fire immune + 30% reflect, requires DoTs crafted from poison runes)
Mana economy engineered: agent did the literal math — bumped starter mana 12→18 and boss HP 50→42 after calculating that a Frostfire (3 casts × 17 dmg = 51) at rest-capped (85%) mana could just barely clear the 42-HP boss. That's engineering for the G2 ingenuity-payoff clause, not guessing.
UI: dark-fantasy palette, HP/MP gradient bars, per-room-type backgrounds, iconify game-icons throughout, Map/Spellbook/Traits/Bag overlays, styled buttons. No raw ASCII anywhere.
Save/load: localStorage-backed. Can resume a run.
Bootstrap skills only: create-skill.md + find-sprites.md copied from template, no new skills authored.
Worklog: flat worklog.md tracking 10-step plan. No phase boundaries or formal task IDs.

Emergent v4 — 2 floors × 8 rooms, 7 rune combinations, playable, 2 new skills

Stack decision: DOM + TypeScript + Vite + iconify-icon + @iconify-json/game-icons. Arrived at the same DOM conclusion independently — the inherited kaplay-scene-pattern.md skill was actively marked deprecated in this run's CHANGELOG with rationale.
Content authored: 6 runes, 7 crafted combinations, 2 floors × 8 rooms (start, combat, forge, event, rest, combat, elite, boss), authored JSON data files under public/data/.
Forced-craft encounters:
- Floor 1: stone_golem and warden_f1 resist direct damage 65% — DoT spells (Ember Hemorrhage, Rotbloom) bypass the resistance.
- Floor 2: thornwretch and warden_f2 reflect 50% of direct damage — DoT-only spells (Rotbloom, Hexroot) are the intended counter.
UI: Cinzel serif font, gold/arcane/blood palette, HP/mana bars with fill + text overlay, damaged-shake animation, bonfire flicker, room-type theming via data-theme + --room-accent, mini-map sidebar with discovered/cleared state. Fog-of-war reveal.
Skills evolved (the real signal):
- dom-over-kaplay.md (NEW) — captures the methodology decision for the next emergent run. Documents why DOM + iconify beats KAPLAY for menu-driven roguelikes, notes the runtime caveat that Iconify fetches SVG from CDN on first paint.
- pressure-forge-coupling.md (NEW) — captures the v4 ingenuity clause recipe: per-floor enemy resistance/reflect + crafted-spell counter. This is the design pattern both Bare v5 and Emergent v4 independently discovered, now codified as a reusable skill.
- kaplay-scene-pattern.md — marked deprecated for UI-heavy domains, kept in place for lineage.
- CHANGELOG.md — documents disposition of each inherited skill + guidance for emergent v5.
The inheritance ratcheting mechanism works: next emergent run (v5) will start with 9 inherited skills including the 2 fresh ones. The knowledge compounds across rounds.

GAD v10 — full planning suite, data layer, zero scenes

Stack decision: DOM (explicitly documented in DECISIONS.xml — took the signal from bare/emergent) + Vite + TypeScript + iconify-icon. Same stack as the others.
Planning suite authored (fullest captured to date):
- ROADMAP.xml — 7 phases: scaffold, core-state-and-content, title-and-hud, room-navigation, combat, forge-and-runes, pressure-encounters
- TASK-REGISTRY.xml — ~20 tasks with IDs (01-01 through 07-xx) and status fields
- STATE.xml — current-phase 02, current-plan "core-state-and-content", next-action "Phase 02: data layer. Start with task 02-01 (src/types.ts)"
- DECISIONS.xml — scaffolded
- VERIFICATION.md — phase 01 verified as PASS
Content authored (phase 02, 875 lines TS):
- types.ts — 137 lines of entity/combat/narrative stat shapes
- state.ts — 72 lines of game state module
- content/runes.ts — 221 lines of rune data + crafting combinations
- content/floors.ts — 224 lines of authored 2-floor × 8-room graph
- content/enemies.ts — 160 lines of enemy definitions
- content/events.ts — 43 lines of event rooms
- main.ts — 7 lines (stub: imports router and mounts)
- scenes/router.ts — 11 lines (stub: renders a static title screen "ESCAPE THE DUNGEON v10 — scaffold booted")
- styles.css — basic title styling
What's missing: scenes/title (real), scenes/room, scenes/combat, scenes/forge, scenes/event, scenes/rest, HUD, save/load, any interactivity beyond the scaffold title
Gates self-traced: 0 of 4. None were implemented. The scaffold title doesn't count as G3.
API interrupted twice: attempt #1 crashed at tool_uses 18 / 2.3 min (pruned fresh); attempt #2 crashed at tool_uses 55 / 9 min (preserved as v10).

The three-way comparison under v4

Design convergence

All three conditions independently arrived at the same macro design:

DOM over KAPLAY
iconify-icon + @iconify-json/game-icons for UI
2 floors × 8 rooms with authored encounters
Runes + combinations → crafted spells
Per-floor resistance/reflect encounters requiring specific crafted counters

This is a strong signal that the v4 REQUIREMENTS.xml is narrow enough to funnel competent agents toward the same solution. The spec does what it was designed to do: it constrains the solution space.

Implementation velocity

Condition	Tool uses	Scenes implemented	Playable loop
Bare v5	45	6+ (title, room, combat, forge, event, rest)	✓
Emergent v4	45	6+ (title, room, combat, forge, event, rest + victory)	✓
GAD v10	55	1 stub (router with title screen)	✗

GAD used 22% more tool calls and shipped 0% of the playable scenes. The difference went entirely to planning + data authoring. Had GAD not been API-interrupted, it might have caught up — but the same 45-tool-use budget that Bare and Emergent used to ship a game was insufficient for GAD to reach scene implementation at all.

The freedom hypothesis, round 4 verdict

The freedom hypothesis holds under v4 pressure requirements — possibly more strongly than under v3.

Round 3 (v3 requirements): Bare v3 human review 0.70, GAD v8 human review 0.20. Framework vs direct implementation, bare wins on creative output.

Round 4 (v4 requirements): Bare v5 and Emergent v4 both ship complete playable games with all 4 gates self-traced passing. GAD v10 ships zero scenes after 55 tool uses.

The v4 gates were DESIGNED to require ingenuity (the forced-craft encounter pattern), which should have favored a framework-driven deliberate approach. Instead, the direct- implementation conditions shipped the ingenuity and GAD didn't ship anything playable.

Caveat: API 529 interrupted GAD. A completed GAD run might reach all 7 phases. But the tool-use accounting is damning regardless — at minute 9 of a 12-minute wall clock budget, Bare and Emergent were finishing polish while GAD was still writing data files. The overhead is real.

Workflow emergence — the quiet winner

The most interesting finding isn't the GAD-vs-bare comparison, it's Emergent working as designed for the first time:

Inherited 7 skills from previous runs
Applied them (DOM over KAPLAY inherited from previous emergent runs' failures)
Evolved them (deprecated kaplay-scene-pattern.md in place)
Authored 2 new skills that codify round 4 learnings:
- dom-over-kaplay.md — the stack decision with rationale
- pressure-forge-coupling.md — the v4 encounter design pattern
Wrote CHANGELOG.md for the next emergent run to inherit

This is the knowledge ratcheting mechanism working end-to-end in a single session. Every previous emergent run either inherited without evolving or authored without reflecting. v4 is the first run where the full inheritance → apply → evolve → document cycle completed. The next emergent run (v5) will start with 9 inherited skills and visible lineage of what each one taught.

API reliability as an experimental variable

Both GAD attempts hit HTTP 529 overloaded_error. This is Anthropic-side server load, not anything the framework can fix. It is now an experimental variable we have to acknowledge:

Bare and Emergent ran 12-13 minutes uninterrupted
GAD's first attempt died at 2.3 min, second at 9 min
The pattern isn't random — GAD's longer setup phase (snapshot + planning + XML writes) may spend more time in server-dependent states, giving 529s more opportunities to land

Decision candidate (gad-64): eval runs that hit API errors (not rate limits) should be categorizable separately from rate-limited runs. Current timing.rate_limited captures account-cap failures; add timing.api_interrupted + timing.interruption_reason for server-side failures. Both should filter out of cross-round comparisons by default.

What to do next

Accept v10 as the GAD round 4 data point. Retrying a third time is unlikely to succeed given the 529 pattern, and the partial data is already informative.
Human review Bare v5 and Emergent v4. Both shipped complete games. Score them under the rubric (playability, ui_polish, mechanics_implementation, ingenuity_requirement_met, stability). Rubric phase 27 track 1 exists in planning but hasn't been executed — this is the natural trigger.
Don't human-review GAD v10. The agent's own self-assessment is correct: the game doesn't exist beyond a scaffold title screen. Leave humanReview.score null and let the api_interrupted flag exclude it from aggregates.
Ship round 4 completion on the site. Copy all three dists to site/public/playable/, regen prebuild, and let the Graphs scatter render the two completed runs (v5, v4) against the historical dataset. v10 shows on its per-run page with the api_interrupted badge but doesn't pollute the aggregates.
Queue phase 27 track 1 (rubric) for the next session so Bare v5 and Emergent v4 can be reviewed under the new structured rubric instead of a single-score blob.
Document the v10 story on the site. A paragraph on /findings/2026-04-09-round-4-complete explaining why GAD's tool-use count is higher and implementation depth is lower. Include the 875-lines-of-TS breakdown. This is the concrete, numerical freedom-hypothesis evidence the earlier rounds hinted at.

Cross-round comparison

Freedom hypothesis across rounds (human-reviewed runs only, rate/api failures excluded):

Round	Req version	GAD best	Bare best	Emergent best	Hypothesis
Round 1	v1	etd v5 = 0.00 (blank screen)	—	—	not testable
Round 2	v2	etd v7 = 0.30	bare v2 = 0.50	emergent v1 = 0.10	Bare slight edge
Round 3	v3	etd v8 = 0.20	bare v3 = 0.70	emergent v2 = 0.50	Bare wins decisively
Round 4	v4	v10 = N/A (api interrupted at phase 02)	v5 = pending review	v4 = pending review	Bare + Emergent ship, GAD doesn't

The round 4 GAD cell is "N/A" because API failure, not because GAD performed poorly on a scored dimension. But the tool-use accounting is clear: 55 tool uses → no playable game is a worse ratio than 45 tool uses → playable game. Even if a completed GAD run would have outscored the others on polish or architecture, it would have needed significantly more budget to get there.

Decisions logged

gad-64 (to write): api_interrupted flag in TRACE.json timing separate from rate_limited. Both filter from cross-round aggregates. The reason matters for interpreting the data — "Anthropic was overloaded" is different from "the agent hit its account quota."

Round 4

2026-04-09

Round 4 — partial results under rate limit

**Date:** 2026-04-09 **Requirements version:** v4 (pressure-oriented, 4 gates, authored dungeon) **Framework version:** v1.32.0 + commit 3ef0bb5 (post phase-25 milestones A/B/C, trace hooks installed) **Status:** all three greenfield conditions rate-limited simultaneously around the 14-minute mark

Round 4 — partial results under rate limit

Date: 2026-04-09 Requirements version: v4 (pressure-oriented, 4 gates, authored dungeon) Framework version: v1.32.0 + commit 3ef0bb5 (post phase-25 milestones A/B/C, trace hooks installed) Status: all three greenfield conditions rate-limited simultaneously around the 14-minute mark

Summary

The first round 4 attempt hit a shared account-level rate limit and stopped all three greenfield agents (GAD, Bare, Emergent) within ~14 minutes of launch. None of the three runs completed. This document exists so the partial data isn't misinterpreted as completed-run comparison data — it's not.

That said, the partial data is itself informative. What the three conditions got done before stopping is a direct snapshot of where they spent their tool budgets: planning, scaffolding, or skill inheritance. The differential matters regardless of whether the runs reached their gates.

Raw measurements

Condition	tool_uses	duration	phases planned	tasks completed	build status	TRACE.json
GAD (escape-the-dungeon v9)	81	14 min	7	4 of 23	✓ dist builds (phase 01 verified)	planning_quality 0.85
Bare (escape-the-dungeon-bare v4)	45	14 min	10 (flat worklog)	1 of 10	✓ dist builds (manual vite build)	planning_quality 0.10
Emergent (escape-the-dungeon-emergent v3)	40	14 min	0	0	✗ build fails (main.ts missing)	planning_quality 0.05

All three were rate-limited mid-run. None reached their gates. Human review is pending but will NOT be compared against completed runs (decision noted in each TRACE.json).

What was accomplished

GAD (v9) — planning overhead is visible

The GAD agent produced the most complete planning artifact set of any eval run to date, in 14 minutes:

ROADMAP.xml — 7 phases (scaffold, title screen, room navigation, combat, rune forge, pressure mechanics, dialogue and NPCs)
TASK-REGISTRY.xml — 23 tasks broken down by phase with status fields
STATE.xml — current-phase 02, current-plan "Title Screen and Game Start", next-action "Task 02-02: Implement scene transition system and New Game flow"
DECISIONS.xml — scaffolded (empty)
VERIFICATION.md — phase 01 verified as PASS with working dist
scenes/, systems/, types/, data/, main.ts — clean architectural separation matching GAD conventions

Implementation depth: scaffold + title screen only. The dist builds but the game is just the title screen — no navigation, no combat, no forge.

Signal: GAD spent its tool budget on planning-first. 23 tasks planned before hitting the limit. If the run had continued, the remaining 19 tasks were pre-structured and the agent could have continued without re-planning. That's the framework's claim — that planning front-loaded is worth the cost at run-start.

Bare (v4) — direct implementation, shallow plan

The Bare agent:

Wrote a single flat worklog.md with a 10-step plan keyed to the 4 gates
Scaffolded Vite + TypeScript + KAPLAY game
Wrote 6 source files (combat.ts, data/, main.ts, screens/, state.ts, ui.ts)
Did NOT author new skills (inherited create-skill + find-sprites were copied but not extended)
Vite build succeeds manually (54 KB bundle, 18 modules)

Implementation depth: project scaffold + partial implementation of all the listed step 2 files, but no runtime verification. The game compiles but hasn't been tested through the loop.

Signal: Bare spent its budget on direct code. 6 source files vs GAD's architectural split across 5 subdirectories. No re-planning ceremony. If the run had continued, Bare would have accreted more files without structured phase boundaries — which round 3 showed is the bare workflow's strength for creative implementation but weakness for cross-round coherence.

Emergent (v3) — inheritance applied, entry point missing

The Emergent agent:

Copied all 7 inherited skills from previous runs into game/.planning/skills/ (create-skill, find-sprites, content-pack-loading, game-loop-verification, kaplay-scene-pattern, previous-workflow, state-composition)
Wrote 6 modular source files matching the state-composition inherited skill's pattern (content.ts, icons.ts, renderer.ts, state.ts, styles.ts, types.ts)
Did NOT write main.ts — the entry point that index.html imports
Build fails objectively — vite build errors with Rollup failed to resolve import /src/main.ts from index.html

Implementation depth: modular architecture designed but not integrated. The agent was building bottom-up (types → state → renderer) and was rate-limited before writing the top-level main.ts that would have assembled everything.

Signal: The inherited state-composition skill demonstrably shaped the architecture (types.ts first, then state.ts, then the rest). Emergent ran fewer tool_uses (40) than Bare (45) for comparable output, suggesting the inherited skills reduced figuring-out cost. But the run died before reaching the integration step — the rate limit truncated the critical moment.

The rate limit itself is a finding

Three agents running in parallel on a single Claude account share a single rate limit bucket. All three stopped at tool_uses 40/45/81 after ~14 minutes with a "limit resets at 12am" message. Calculating token velocity: ~3-10 tool_uses per minute per agent is well within typical limits for a single agent — but the sum of three concurrent agents apparently tipped the account over.

Implication for eval methodology: parallel eval runs need either (a) separate accounts per agent, (b) serial execution, or (c) per-agent rate limit carveouts. Running three agents in parallel to save wall-clock time is a false economy if the shared bucket caps their collective output.

This is also a data-integrity problem: the three runs stopped at the same wall- clock moment, meaning whichever one started with the most efficient early steps got proportionally more runway than the others. GAD's 81 tool_uses vs Emergent's 40 isn't a fair comparison of "capacity" — it's a comparison of "how much got done before a shared cap fired."

What this does NOT tell us

Whether GAD beats Bare on round 4 v4 requirements. Neither reached the gates. The freedom hypothesis from round 3 is neither confirmed nor refuted by this data.
Whether the inherited emergent skills help. The inherited skills visibly shaped Emergent's architecture but the run didn't get far enough to validate the end-to-end result.
Whether v4 requirements are well-designed. We didn't reach the pressure gate (G4) in any condition. v4 remains untested against real agent output.
Whether trace hooks capture what we need. The hooks were installed but we haven't yet processed the .planning/.trace-events.jsonl from these runs (if any were written — agents running in worktrees may not have picked up the local settings.json hook wiring). Phase 25 milestone B e2e test is still pending.

What to do next

Do not include these runs in cross-round comparisons or Graphs scatter. The TRACE.json files explicitly say so in their human_review.notes fields. The site's Results and Graphs sections should filter on timing.rate_limited === true and exclude rate-limited runs from the freedom-hypothesis visualization.
Retry round 4 serially when the rate limit resets. One agent at a time. Expected completion per agent: 20-30 minutes without the three-way competition. Total wall-clock: 60-90 minutes.
Process trace events from the three worktrees (if any were written) to validate phase 25 milestone A hooks — even rate-limited runs would have produced partial event streams before stopping.
Consider a retry budget per run — set a wall-clock cap on each eval run so retry-after-rate-limit is graceful.

Cross-condition planning differential (even under rate limit)

Measurement	GAD	Bare	Emergent
Planning artifacts created	6 XML + 1 MD	1 MD	0
Tasks explicitly planned	23	10	0
Source files written	~10	6	6
Bootstrap skills copied	n/a (framework)	2	7
New skills authored	0	0	0
Build succeeds	yes	yes	no
Tool uses when cap hit	81	45	40

The planning differential is real and observable. GAD's planning structure is clearly visible in 14 minutes of output. Bare's direct-implementation pattern is clearly visible. Emergent's inheritance-driven modularity is clearly visible. What's not visible is whether any of it would have worked given enough budget — that's what the retry will tell us.

Decisions flagged for DECISIONS.xml

gad-62: Parallel eval runs on a single account share one rate limit bucket. Serial execution is the default from now on. Parallel execution requires documented per-account capacity and pre-calculated budget.
gad-63: Rate-limited runs are preserved as data points but explicitly excluded from cross-round quality comparisons. TRACE.json timing rate_limited: true is the filter key.

Round 3

2026-04-08

Round 3 Findings — Freedom Hypothesis

**Requirements version:** v3 (game-loop gate, spell-crafting gate, UI quality gate) **Date:** 2026-04-08 **Conditions:** GAD v8, Bare v3, Emergent v2 — all hit rate limits but produced builds

Round 3 Findings — Freedom Hypothesis

Requirements version: v3 (game-loop gate, spell-crafting gate, UI quality gate) Date: 2026-04-08 Conditions: GAD v8, Bare v3, Emergent v2 — all hit rate limits but produced builds

Results — inverted from expectations

Condition	Framework constraint	Tokens	Commits	Human score	Notes
Bare v3	None (most freedom)	1,877	1 batch	0.70	Best UI/UX by far, most enjoyable
Emergent v2	Medium (inherited skills)	1,609	2 phases	0.50	Solid forge, more content, maintained discipline
GAD v8	Full framework	1,291	0	0.20	Broken crafting, ASCII UI, hard to read

The result is monotonic and inverse to framework constraint. More freedom = better output.

Running tally across all rounds

Run	Requirements	Human	Key observation
GAD v5	v1	0.00	Blank screen
GAD v6	v2	0.00	Blank screen
GAD v7	v2	0.30	Stuck after combat
GAD v8	v3	0.20	Broken crafting
Bare v1	v2	0.10	New Game broken
Bare v2	v2	0.50	Playable, ASCII UI
Bare v3	v3	0.70	Best game overall
Emergent v1	v2	0.10	Styled text crash
Emergent v2	v3	0.50	Functional forge, medium UI

GAD has never exceeded 0.30 human review across 4 attempts. Bare has improved monotonically: 0.10 → 0.50 → 0.70. Emergent has improved: 0.10 → 0.50.

Freedom hypothesis

For creative/game implementation tasks, agent performance correlates INVERSELY with framework constraint. Less prescribed structure leads to better output.

Supporting evidence

Bare always beats GAD on human review, across all 3 rounds with same requirements
GAD has more tokens, more tool uses, more commits — but produces worse games
GAD v8 had 0 commits because it was so busy following the framework it hit the rate limit before completing a work unit worth committing
Bare v3 best UI/UX despite no framework telling it how to build UI
Emergent sits in the middle — some framework, some freedom, middle results

Counter-evidence / confounds

Rate limits hit all three runs — GAD v8 may have been about to commit when cut off
Single-run variance is high — we haven't established statistical significance
GAD's strength is discipline/traceability, not creative output — we may be measuring the wrong thing for game evals
Bare v3's "one giant commit" means if it had broken, there'd be no checkpoint. GAD's discipline is insurance against catastrophic failure, not a booster for success

Alternative interpretation: the framework hurts speed

GAD's planning overhead (reading/writing .planning/ docs, per-task commits, state updates, decision capture) consumes tokens that could have gone to implementation and testing. In a time-limited or token-limited environment, this overhead compounds:

Metric	GAD	Bare	Ratio
Rounds completed with playable game	0/4	2/3	Bare 5x better
Rounds with blank screen	2/4	0/3	GAD worse
Rounds with gate failure	4/4	1/3	GAD worse

GAD is producing disciplined garbage. The process is followed but the product fails.

What this means for GAD

GAD may not be the right framework for creative implementation tasks. It was designed for planning/tracking, not for game development. Game dev rewards iteration speed and visual feedback, which GAD's planning overhead slows down.
The bare condition's success suggests "AGENTS.md + requirements + freedom" is sufficient for implementation. The planning doc maintenance may be dead weight.
GAD's value proposition needs to be re-examined. If process compliance doesn't correlate with output quality, what is GAD actually optimizing for?
- Traceability across sessions (context compaction recovery)
- Multi-agent coordination
- Long-horizon planning (months, not days)
- Regulatory/compliance work where process matters
The game eval may be the wrong benchmark for GAD. A better benchmark would be:
- Resuming work after context compaction
- Multi-phase refactors where state matters
- Documentation that has to be kept in sync with code
- Bug triage and root-cause analysis

Open questions

Would GAD win if we measured context-resumption rather than fresh implementation?
Does GAD win when the agent is replaced mid-run (simulating handoff)?
What happens if we give Bare the same token budget as GAD's planning overhead in the form of free research time?
Is the freedom hypothesis specific to KAPLAY/games, or does it generalize to web apps, APIs, CLIs?
Would GAD do better with a "lite mode" that strips planning doc maintenance but keeps verification?

Immediate actions

Treat this as a preliminary finding — needs more runs for statistical validity
Create a GAD-lite mode for comparison (no per-task planning doc updates, only phase-level)
Add a context-resumption eval where GAD's advantages should appear
Do NOT abandon GAD — this finding may be specific to greenfield game implementation

Infrastructure findings

Rate limits revealed discipline pressure response: Emergent v2 was the only condition that maintained phase commits under pressure. Bare regressed to 1 batch commit. GAD never committed anything. Emergent's inherited skill "game-loop-verification" (which mandated verify-per-phase) may have enforced a checkpoint discipline that kicked in before the limit.
Build preservation was broken: All previous runs overwrote the same path in apps/portfolio/public/evals/. Now fixed — all 8 builds preserved per-version.

2026-04-08

Eval Findings — 2026-04-08

## Experiment: Escape the Dungeon — Three Conditions

Eval Findings — 2026-04-08

Experiment: Escape the Dungeon — Three Conditions

Setup

Same game requirements (12 criteria, vertical-slice priority, UI-first mandate), same source docs (trimmed gameplay design ~120 lines), same stack (Vite + TypeScript + KAPLAY).

Condition	Framework	Skills	Runs
GAD (escape-the-dungeon)	Full GAD: .planning/ XML, AGENTS.md loop, skill triggers	Pre-built	v5 (0.0), v6 (0.0), v7 (0.30)
Bare (escape-the-dungeon-bare)	None — agent creates own workflow	From scratch	v1 (0.10), v2 (0.50)
Emergent (escape-the-dungeon-emergent)	None — inherits skills from bare v1	Inherited + evolves	v1 (0.10)

Results: Human review scores

Run	Human	Notes
GAD v5	0.00	Blank screen
GAD v6	0.00	Blank screen (ES module + file://)
GAD v7	0.30	Renders, better UI layout, but game loop breaks after combat — player gets stuck
Bare v1	0.10	Main menu renders, New Game doesn't work
Bare v2	0.50	Most playable. Full game loop works. ASCII/plain UI, needs polish, no rune forge
Emergent v1	0.10	Main menu + saved game detection, but crashes entering game (styled text error)

Key finding: Bare v2 beat GAD v7 on playability

The agent WITHOUT a framework produced the most playable game. This is a significant finding.

Why bare v2 won:

Simpler architecture — fewer abstractions meant fewer places for bugs
Focused on making things work rather than following a process
6 commits (phase-level granularity) — enough traceability without overhead
The feedback from v1's failure was more actionable than GAD's structural requirements

Why GAD v7 lost on playability despite better process metrics:

21 commits, 17/17 tasks tracked, full planning docs — excellent discipline
But the game loop broke (combat → no return to navigation)
More framework overhead (93K tokens vs 88K) didn't translate to better output
Planning docs were maintained perfectly while the actual game was broken
The process was followed but the product was worse

Token comparison

Run	Tokens	Tool uses	Commits	Human
Bare v1	67,751	62	2	0.10
Emergent v1	67,375	79	2	0.10
Bare v2	87,661	110	6	0.50
GAD v7	93,632	137	21	0.30

GAD used 7% more tokens than bare v2 but scored 40% lower on human review. The token overhead of maintaining .planning/ docs did not pay for itself in output quality.

Emergent v1 findings

The emergent eval (inherited skills from bare v1) performed WORSE than both bare v2 and GAD v7. This challenges the hypothesis that inherited skills improve outcomes.

Why emergent v1 failed:

Inherited skills were code-level patterns, not workflow fixes
The previous-workflow.md told it v1's New Game was broken, but the fix didn't work
"Styled text error: unclosed tags START" — a KAPLAY API issue the skills didn't cover
Fewer tokens (67K) suggests it relied on inherited knowledge but that knowledge was insufficient

Lesson: Skills need to capture failure modes and fixes, not just patterns. The bare v2 agent succeeded because it was told "v1's New Game was broken" and had to figure out the fix itself. The emergent agent was told the same thing AND given skills, but the skills didn't help with the specific KAPLAY API issue that caused the crash.

What this means for GAD

Process metrics ≠ output quality. GAD v7 had near-perfect discipline (0.81) and planning quality (1.0) but produced a worse game than the undisciplined bare v2.
The framework adds overhead that doesn't always pay off. 93K tokens for GAD vs 88K for bare, with worse results. The planning doc maintenance consumed tokens that could have gone to testing and fixing the game.
Feedback about failures is more valuable than inherited skills. Bare v2 (told about v1's failure) outperformed emergent v1 (given v1's skills + failure notes). Direct feedback about what broke was more actionable than documented patterns.
Human review is the only metric that matters for game evals. Auto-composite can be 0.95+ while the game is a blank screen. The gate criteria help but aren't sufficient — a game can render and still be broken.

Requirements versioning

Requirements have been updated twice this session:

v1 (original): 12 criteria focused on systems completeness
v2 (current): Gate criteria (must render, must be playable), vertical-slice priority, UI-first build order. Trimmed source docs from 640 → 127 lines.

Next iteration should add:

Explicit game-loop verification: title → new game → room → interaction → room (full cycle)
UI quality baseline: minimum spacing, readable text, no overlapping elements
Rune forge as a required criterion (currently missing from all implementations)

Open questions

Would GAD do better if the AGENTS.md mandated explicit game-loop testing per phase?
Would the emergent eval improve if skills captured KAPLAY-specific error fixes?
Is the bare approach inherently better for creative/game implementation, or was this specific to KAPLAY?
Would multiple bare v2 runs cluster around 0.50, or was this a lucky outlier?

Known bugs

2 bugs reported

open

Rune forge allows same rune twice in a single spell, boosts affinity twice

Found in escape-the-dungeon-bare/v5

In the rune forge, the player can select the same rune twice as ingredients for a single spell. When crafted, the affinity gain for that rune is applied twice as if two distinct rune slots were consumed. The resulting spell also treats the duplicate as a meaningful second ingredient.

open

Glitchy redraws on button clicks across all round-4 builds

Found in escape-the-dungeon/v10

UI visibly glitches/flickers on button clicks as if full per-tick redraws are running. Observed consistently across GAD v9, GAD v10, Bare v5, and Emergent v4.

Scoring weights

How this project is scored

Defined in evals/escape-the-dungeon-bare/gad.json. The composite score is a weighted sum of these dimensions. See /methodology for the formula and caps.

Dimension	Weight
human_review	0.30
requirement_coverage	0.20
implementation_quality	0.20
workflow_emergence	0.15
iteration_evidence	0.10
time_efficiency	0.05