Project market

Every eval project. Playable in your browser.

Browse the full catalog of agent evaluation projects across games, video, software, and tooling. Each project tests a hypothesis about how coding agents build under different conditions.

Showing 19 of 19 playable builds

last 5 rounds per project

Featured projects

Featured
Game
gad
greenfield

escape-the-dungeon

Greenfield: agent builds the game from scratch using the full GAD framework

11 runs7 playableRound 5
Featured
Game
bare
greenfield

escape-the-dungeon-bare

Greenfield baseline: agent builds the game WITHOUT a planning framework, creating its own workflow

6 runs6 playableRound 5
Featured
Game
emergent
greenfield

escape-the-dungeon-emergent

Greenfield emergent: agent builds the game with skills inherited from previous bare/emergent runs. Tests whether self-created systems improve over iterations.

6 runs5 playableRound 5
Featured
Video
gad
greenfield

gad-explainer-video

NEW EVAL DOMAIN (task 22-31, scaffolded 2026-04-09). The task is to produce a Remotion video that explains the GAD framework, its hypotheses, and its current state. The video requirements evolve across rounds the same way the escape-the-dungeon game requirements did — each round adds clarity, constraints, and complexity to the explainer script. Tests the same hypotheses (freedom / CSH / emergent-evolution) on a completely different task domain: video composition instead of game implementation. If bare still outperforms GAD here, freedom hypothesis generalizes beyond game dev. If not, freedom may be specific to creative implementation.

1 run1 playableRound 5

All projects23 projects

Tooling

cli-efficiency

Measures token efficiency of gad CLI vs raw file reads for coding agent context. Compares two workflows: (1) CLI-first using gad context/session/state/phases/tasks, (2) baseline grep+read pattern used by GSD/RP agents.

Game
gad
greenfield

escape-the-dungeon-gad-emergent

GAD+Emergent combined condition: agent gets the full .planning/ XML scaffold AND GAD framework skills AND inherited skills from prior emergent runs. The maximally-scaffolded condition. Tests whether combining framework planning + methodology skills + inherited project-specific skills produces the best outcomes — or whether the overhead of ALL that context drowns the agent. If this beats both GAD-alone and Emergent-alone, it validates the emergent-evolution hypothesis (gad-68) in its strongest form: fundamentals + inheritance + planning = mastery. If it underperforms, the overhead hypothesis gets another data point: more scaffolding ≠ better output.

Game
bare
greenfield

escape-the-dungeon-planning-only

Planning-only condition: agent gets the full .planning/ XML scaffold (ROADMAP, TASK-REGISTRY, DECISIONS, STATE) but only bootstrap skills (create-skill, find-sprites) — NO GAD framework skills. Tests whether the planning STRUCTURE alone (without skill methodology) improves outcomes vs bare. Isolates the variable: does having a roadmap + task registry + decision log help, even without the skills to execute against them?

Game
bare
greenfield

etd-babylonjs

Tech-stack comparison: Escape the Dungeon with Babylon.js — same v5 requirements, different framework

Game
bare
brownfield

etd-brownfield-bare

Brownfield: extend bare v3 codebase with v4 features WITHOUT a planning framework

Game
emergent
brownfield

etd-brownfield-emergent

Brownfield emergent: extend bare v3 codebase with v4 features, inheriting skills from previous emergent runs

Game
gad
brownfield

etd-brownfield-gad

Brownfield: extend bare v3 codebase with v4 features using the full GAD framework

Game
bare
greenfield

etd-phaser

Tech-stack comparison: Escape the Dungeon with Phaser.js — same v5 requirements, different framework

Game
bare
greenfield

etd-pixijs

Tech-stack comparison: Escape the Dungeon with PixiJS — same v5 requirements, different framework

Game
bare
greenfield

etd-threejs

Tech-stack comparison: Escape the Dungeon with Three.js — same v5 requirements, different framework

Tooling
gad
greenfield

eval-skill-install-eval

Evaluates the eval-skill-install skill — measures whether installing skills into eval projects correctly and running with/without comparisons works

Video
bare
greenfield

gad-explainer-video-bare

gad-explainer-video-bare — tests the same hypotheses as gad-explainer-video but under the bare workflow condition. See gad-explainer-video/gad.json for the full rubric and gad-explainer-video/REQUIREMENTS.md for the task spec.

Video
emergent
greenfield

gad-explainer-video-emergent

gad-explainer-video-emergent — tests the same hypotheses as gad-explainer-video but under the emergent workflow condition. See gad-explainer-video/gad.json for the full rubric and gad-explainer-video/REQUIREMENTS.md for the task spec.

Tooling

gad-planning-loop

GAD self-evaluation: measures planning loop fidelity across a phase

Tooling
gad
greenfield

gad-skill-creator-eval

Evaluates the gad-skill-creator skill — measures whether skills created using this methodology produce better agent outcomes than ad-hoc skill creation

1 run
Planning

planning-migration

Lossless migration of all vendor project .planning/ dirs and portfolio planning sink to unified GAD format. Evaluates format compliance, sink sync, trace coverage, and data preservation across the compile/decompile round-trip.

1 run
Planning

portfolio-bare

Portfolio monorepo planning eval. Agent opens a fresh template directory, plans and executes phases from requirements, declares context mode (fresh/loaded) at session start. Runs accumulate over time — context_delta emerges from the comparison.

3 runs
Planning

project-migration

Measures quality and completeness of migrating a project from a legacy planning framework (RP) to GAD. Scores planning continuity, skill coverage, and context efficiency before and after.

1 run
Tooling
gad
greenfield

reverse-engineer-eval

Evaluates the reverse-engineer skill by measuring requirements quality — reverse-engineer a target codebase, then build from the generated requirements, score the result against the original.

1 run
Software
gad
greenfield

skill-evaluation-app

NEW EVAL DOMAIN (task 22-51, scaffolded 2026-04-09). The task is to build a browser-based GUI for authoring eval requirements and viewing per-skill evaluation harness results. Tests the same hypotheses (freedom / CSH / emergent-evolution) on a THIRD task domain — front-end application development with a defined functional spec — different from both escape-the-dungeon (game dev) and gad-explainer-video (video composition). If the freedom hypothesis still holds here, the pattern generalizes beyond creative implementation. If compound-skills still holds, the skill evolution effect is task-domain-independent. Per SKEPTIC cross-cutting critique #5 (single-task domain), this is the single most important generalization test the project can run.

Tooling
bare
greenfield

skill-evaluation-app-bare

skill-evaluation-app-bare — tests the same hypotheses as skill-evaluation-app but under the bare workflow condition. See skill-evaluation-app/gad.json for the full rubric and skill-evaluation-app/REQUIREMENTS.md for the task spec.

Tooling
emergent
greenfield

skill-evaluation-app-emergent

skill-evaluation-app-emergent — tests the same hypotheses as skill-evaluation-app but under the emergent workflow condition. See skill-evaluation-app/gad.json for the full rubric and skill-evaluation-app/REQUIREMENTS.md for the task spec.

Tooling

subagent-utility

Formal evaluation of subagent utility vs single-session — informs gad-16 revision

Playable builds

Click any build badge to play it in-browser. Hover for details.

Legend:reviewedneeds reviewexcluded (rate-limited / api-interrupted)

Escape the Dungeon

Roguelike dungeon crawler — primary eval vehicle across all rounds18

GAD Explainer Video

Remotion composition — planned eval for video generation workflows1
playable: escape-the-dungeon-bare/v3
Open full screen
Bare · v3
Reviewed
Gate passed
Round 3

escape-the-dungeon-bare

requirements v3 · 2026-04-08

Composite
0.526
Human
0.70
Tokens
1,877
Build time
15 min
Runtime
Started
Commits
1

Best UI/UX of all eval runs by far. Most enjoyable and playable. Functional game loop with combat and dialogue. Missing: floor progression after boss (can grind same floor), no clear spell crafting path. Regressed on commit discipline under pressure (1 giant commit vs v2's 6). Score 0.70: most enjoyable game across all experiments.

Full breakdownSource on GitHub
Client debug · NEXT_PUBLIC_CLIENT_DEBUG=1
0 lines

No events yet. Window errors, unhandled rejections, and React render errors appear here. Set NEXT_PUBLIC_CLIENT_DEBUG_CONSOLE=1 to mirror console.error / console.warn.