Hypotheses

Every research hypothesis, wired to its eval track.

Each card below is one hypothesis the project is tracking, the eval track that tests it, the latest evidence, and a link to the dedicated page for deeper reading. Every hypothesis also has an entry on /skeptic holding it to its strongest critique — read that before trusting any claim here.

Labels: preliminary observation means we have seen a pattern and named it, but sample size is too small to call it a finding. discussing means we are still working out what the hypothesis even claims. operationalized means it has a concrete computable definition. not yet tested means no runs have produced data against it.

All tracks, one chart

Highest human-review score per round, per hypothesis track. Solid lines have real data. Dashed lines are planned tracks — content-driven (gad-66) and codex runtime (task 89) — with no runs yet. Read /skeptic before trusting any individual point: N=2-5 runs per condition, one human reviewer, one task domain so far.

Legend: solid lines show rounds with real data. Dashed lines are planned tracks where no runs have been scored yet — they exist to make the research plan visible. Data provenance: values come from EVAL_RUNS[n].humanReviewNormalized.aggregate_score, grouped by round + workflow at prebuild.

Current hypotheses

GAD-D-36

preliminary observation

Bare workflow track

Freedom Hypothesis

For creative implementation tasks, agent performance correlates inversely with framework constraint. Less prescribed structure leads to better output.

Latest evidence

Bare improved monotonically 0.10 → 0.50 → 0.70 → 0.805 across rounds 2-4 while GAD never exceeded 0.30 human review. N=4 vs N=5 — not statistically significant. Bare v5 scored highest ingenuity of round 4.

Evidence page Critique escape-the-dungeon-bare

GAD-D-65

preliminary observation

Emergent workflow track

Compound-Skills Hypothesis

A coding agent's skill library compounds in value over many rounds as skills are merged and tailored to the project. The emergent workflow should produce monotonically improving results.

Latest evidence

Emergent v4 scored 0.885 (highest round-4 result) after authoring 2 new skills, deprecating 1, and documenting disposition of every inherited skill in CHANGELOG.md. First observed full ratcheting cycle. N=2-3 runs — not enough to claim compounding.

Evidence page Critique escape-the-dungeon-emergent

GAD-D-68

discussing

Emergent workflow track (same as CSH)

Emergent-Evolution Hypothesis

Synthesis of freedom + CSH. A coding agent given blank artifacts, requirements, and the ability to create/merge/find skills against a GAD-provided foundational pool will produce better work over time. Projects are themselves emergent.

Latest evidence

Working synthesis only. Depends on gad-73 fundamental skills triumvirate (find-skills / merge-skill / create-skill) existing — audit is unfinished. The metaphor (craftsman) is doing the heavy lifting more than evidence.

Evidence page Critique escape-the-dungeon-emergent

GAD-D-66

not yet tested

Content-pack injection track (planned)

Content-Driven Hypothesis

Given requirements AND a pre-authored content pack (spells, runes, items, NPCs, dialogue) extracted from prior runs, an agent will produce a more fleshed-out game on top — analogous to making a movie based on a book. Derivative work as a distinct research direction.

Latest evidence

No runs yet. Resolved open question: this becomes its own eval track (escape-the-dungeon-inherited-content) distinct from greenfield emergent so CSH measurements stay clean. Comparison rules are intentionally different — content-pack and greenfield runs do NOT share a rubric.

Evidence page Critique

GAD-D-75

operationalized

All tracks (metadata, not a rubric dimension)

Pressure as measurable

Constraint intensity on the agent during an eval is a first-class variable. Now operationalized programmatically (gad-79) as task_pressure = log2(R + 2G + C + 1) / log2(65) computed from REQUIREMENTS.xml structure.

Latest evidence

Q1 + Q2 resolved 2026-04-09. Formula implemented in prebuild. v5 computed score: 0.884 (R=21, G=4, C=10, raw=39). Stored as TRACE metadata per gad-79, not a rubric dimension. Distinct from game_pressure (in-game player experience).

Evidence page Critique