Project market
Every eval project. Playable in your browser.
Browse the full catalog of agent evaluation projects across games, video, software, and tooling. Each project tests a hypothesis about how coding agents build under different conditions.
Showing 19 of 19 playable builds
Featured projects
escape-the-dungeon
Greenfield: agent builds the game from scratch using the full GAD framework
escape-the-dungeon-bare
Greenfield baseline: agent builds the game WITHOUT a planning framework, creating its own workflow
escape-the-dungeon-emergent
Greenfield emergent: agent builds the game with skills inherited from previous bare/emergent runs. Tests whether self-created systems improve over iterations.
gad-explainer-video
NEW EVAL DOMAIN (task 22-31, scaffolded 2026-04-09). The task is to produce a Remotion video that explains the GAD framework, its hypotheses, and its current state. The video requirements evolve across rounds the same way the escape-the-dungeon game requirements did — each round adds clarity, constraints, and complexity to the explainer script. Tests the same hypotheses (freedom / CSH / emergent-evolution) on a completely different task domain: video composition instead of game implementation. If bare still outperforms GAD here, freedom hypothesis generalizes beyond game dev. If not, freedom may be specific to creative implementation.
All projects23 projects
cli-efficiency
Measures token efficiency of gad CLI vs raw file reads for coding agent context. Compares two workflows: (1) CLI-first using gad context/session/state/phases/tasks, (2) baseline grep+read pattern used by GSD/RP agents.
escape-the-dungeon-gad-emergent
GAD+Emergent combined condition: agent gets the full .planning/ XML scaffold AND GAD framework skills AND inherited skills from prior emergent runs. The maximally-scaffolded condition. Tests whether combining framework planning + methodology skills + inherited project-specific skills produces the best outcomes — or whether the overhead of ALL that context drowns the agent. If this beats both GAD-alone and Emergent-alone, it validates the emergent-evolution hypothesis (gad-68) in its strongest form: fundamentals + inheritance + planning = mastery. If it underperforms, the overhead hypothesis gets another data point: more scaffolding ≠ better output.
escape-the-dungeon-planning-only
Planning-only condition: agent gets the full .planning/ XML scaffold (ROADMAP, TASK-REGISTRY, DECISIONS, STATE) but only bootstrap skills (create-skill, find-sprites) — NO GAD framework skills. Tests whether the planning STRUCTURE alone (without skill methodology) improves outcomes vs bare. Isolates the variable: does having a roadmap + task registry + decision log help, even without the skills to execute against them?
etd-babylonjs
Tech-stack comparison: Escape the Dungeon with Babylon.js — same v5 requirements, different framework
etd-brownfield-bare
Brownfield: extend bare v3 codebase with v4 features WITHOUT a planning framework
etd-brownfield-emergent
Brownfield emergent: extend bare v3 codebase with v4 features, inheriting skills from previous emergent runs
etd-brownfield-gad
Brownfield: extend bare v3 codebase with v4 features using the full GAD framework
etd-phaser
Tech-stack comparison: Escape the Dungeon with Phaser.js — same v5 requirements, different framework
etd-pixijs
Tech-stack comparison: Escape the Dungeon with PixiJS — same v5 requirements, different framework
etd-threejs
Tech-stack comparison: Escape the Dungeon with Three.js — same v5 requirements, different framework
eval-skill-install-eval
Evaluates the eval-skill-install skill — measures whether installing skills into eval projects correctly and running with/without comparisons works
gad-explainer-video-bare
gad-explainer-video-bare — tests the same hypotheses as gad-explainer-video but under the bare workflow condition. See gad-explainer-video/gad.json for the full rubric and gad-explainer-video/REQUIREMENTS.md for the task spec.
gad-explainer-video-emergent
gad-explainer-video-emergent — tests the same hypotheses as gad-explainer-video but under the emergent workflow condition. See gad-explainer-video/gad.json for the full rubric and gad-explainer-video/REQUIREMENTS.md for the task spec.
gad-planning-loop
GAD self-evaluation: measures planning loop fidelity across a phase
gad-skill-creator-eval
Evaluates the gad-skill-creator skill — measures whether skills created using this methodology produce better agent outcomes than ad-hoc skill creation
planning-migration
Lossless migration of all vendor project .planning/ dirs and portfolio planning sink to unified GAD format. Evaluates format compliance, sink sync, trace coverage, and data preservation across the compile/decompile round-trip.
portfolio-bare
Portfolio monorepo planning eval. Agent opens a fresh template directory, plans and executes phases from requirements, declares context mode (fresh/loaded) at session start. Runs accumulate over time — context_delta emerges from the comparison.
project-migration
Measures quality and completeness of migrating a project from a legacy planning framework (RP) to GAD. Scores planning continuity, skill coverage, and context efficiency before and after.
reverse-engineer-eval
Evaluates the reverse-engineer skill by measuring requirements quality — reverse-engineer a target codebase, then build from the generated requirements, score the result against the original.
skill-evaluation-app
NEW EVAL DOMAIN (task 22-51, scaffolded 2026-04-09). The task is to build a browser-based GUI for authoring eval requirements and viewing per-skill evaluation harness results. Tests the same hypotheses (freedom / CSH / emergent-evolution) on a THIRD task domain — front-end application development with a defined functional spec — different from both escape-the-dungeon (game dev) and gad-explainer-video (video composition). If the freedom hypothesis still holds here, the pattern generalizes beyond creative implementation. If compound-skills still holds, the skill evolution effect is task-domain-independent. Per SKEPTIC cross-cutting critique #5 (single-task domain), this is the single most important generalization test the project can run.
skill-evaluation-app-bare
skill-evaluation-app-bare — tests the same hypotheses as skill-evaluation-app but under the bare workflow condition. See skill-evaluation-app/gad.json for the full rubric and skill-evaluation-app/REQUIREMENTS.md for the task spec.
skill-evaluation-app-emergent
skill-evaluation-app-emergent — tests the same hypotheses as skill-evaluation-app but under the emergent workflow condition. See skill-evaluation-app/gad.json for the full rubric and skill-evaluation-app/REQUIREMENTS.md for the task spec.
subagent-utility
Formal evaluation of subagent utility vs single-session — informs gad-16 revision
Playable builds
Click any build badge to play it in-browser. Hover for details.
Escape the Dungeon
Roguelike dungeon crawler — primary eval vehicle across all rounds18GAD Explainer Video
Remotion composition — planned eval for video generation workflows1escape-the-dungeon-bare
requirements v3 · 2026-04-08
- Composite
- 0.526
- Human
- 0.70
- Tokens
- 1,877
- Build time
- 15 min
- Runtime
- —
- Started
- —
- Commits
- 1
Best UI/UX of all eval runs by far. Most enjoyable and playable. Functional game loop with combat and dialogue. Missing: floor progression after boss (can grind same floor), no clear spell crafting path. Regressed on commit discipline under pressure (1 giant commit vs v2's 6). Score 0.70: most enjoyable game across all experiments.