Hypotheses
Every research hypothesis, wired to its eval track.
Each card below is one hypothesis the project is tracking, the eval track that tests it, the latest evidence, and a link to the dedicated page for deeper reading. Every hypothesis also has an entry on /skeptic holding it to its strongest critique — read that before trusting any claim here.
Labels: preliminary observation means we have seen a pattern and named it, but sample size is too small to call it a finding. discussing means we are still working out what the hypothesis even claims. operationalized means it has a concrete computable definition. not yet tested means no runs have produced data against it.
All tracks, one chart
Highest human-review score per round, per hypothesis track. Solid lines have real data. Dashed lines are planned tracks — content-driven (gad-66) and codex runtime (task 89) — with no runs yet. Read /skeptic before trusting any individual point: N=2-5 runs per condition, one human reviewer, one task domain so far.
Legend: solid lines show rounds with real data. Dashed lines are planned tracks where no runs have been scored yet — they exist to make the research plan visible. Data provenance: values come from EVAL_RUNS[n].humanReviewNormalized.aggregate_score, grouped by round + workflow at prebuild.
Current hypotheses
For creative implementation tasks, agent performance correlates inversely with framework constraint. Less prescribed structure leads to better output.
Latest evidence
Bare improved monotonically 0.10 → 0.50 → 0.70 → 0.805 across rounds 2-4 while GAD never exceeded 0.30 human review. N=4 vs N=5 — not statistically significant. Bare v5 scored highest ingenuity of round 4.
A coding agent's skill library compounds in value over many rounds as skills are merged and tailored to the project. The emergent workflow should produce monotonically improving results.
Latest evidence
Emergent v4 scored 0.885 (highest round-4 result) after authoring 2 new skills, deprecating 1, and documenting disposition of every inherited skill in CHANGELOG.md. First observed full ratcheting cycle. N=2-3 runs — not enough to claim compounding.
Synthesis of freedom + CSH. A coding agent given blank artifacts, requirements, and the ability to create/merge/find skills against a GAD-provided foundational pool will produce better work over time. Projects are themselves emergent.
Latest evidence
Working synthesis only. Depends on gad-73 fundamental skills triumvirate (find-skills / merge-skill / create-skill) existing — audit is unfinished. The metaphor (craftsman) is doing the heavy lifting more than evidence.
Given requirements AND a pre-authored content pack (spells, runes, items, NPCs, dialogue) extracted from prior runs, an agent will produce a more fleshed-out game on top — analogous to making a movie based on a book. Derivative work as a distinct research direction.
Latest evidence
No runs yet. Resolved open question: this becomes its own eval track (escape-the-dungeon-inherited-content) distinct from greenfield emergent so CSH measurements stay clean. Comparison rules are intentionally different — content-pack and greenfield runs do NOT share a rubric.
Constraint intensity on the agent during an eval is a first-class variable. Now operationalized programmatically (gad-79) as task_pressure = log2(R + 2G + C + 1) / log2(65) computed from REQUIREMENTS.xml structure.
Latest evidence
Q1 + Q2 resolved 2026-04-09. Formula implemented in prebuild. v5 computed score: 0.884 (R=21, G=4, C=10, raw=39). Stored as TRACE metadata per gad-79, not a rubric dimension. Distinct from game_pressure (in-game player experience).