Less framework, better output?

The freedom hypothesis (GAD-D-36) emerged from round 3: the bare condition — no framework, no inherited skills, just AGENTS.md + requirements — kept outscoring the full GAD condition on human review. Bare has improved monotonically across rounds. GAD never exceeded 0.30. This page rolls up the evidence, the caveats, and a hard link to the skeptic critique — read that too before trusting the pattern.

Bare runs

Playable

Scored

Latest score

0.000

Human review across rounds

Each row is one bare run. If the freedom hypothesis is real, the line goes up-and-to-the-right. So far it does — but n=5 is not a curve, and each run targeted a harder requirements version, so the improvement may be the requirements getting clearer rather than freedom itself paying off.

2026-04-08

0.100

2026-04-08

0.500

2026-04-08

0.700

2026-04-09

0.000

2026-04-09

0.805

2026-04-10

0.000

Data provenance: scores from EVAL_RUNS[].humanReviewNormalized.aggregate_score or legacy humanReview.score for runs predating the rubric.

What "bare" means

No framework

Bare runs get AGENTS.md + REQUIREMENTS.xml. No .planning/ XML, no plan-execute-verify loop, no skill library, no subagents. The agent creates its own structure.

Own workflow

The agent authors whatever planning artifacts it finds useful under game/.planning/. Per decision gad-39, all workflow artifacts live there regardless of framework choice.

Contrast with emergent

Bare starts cold every time. Emergent starts warm with inherited skills from prior runs. Both have no framework, but emergent tests compounding (CSH) while bare tests freedom. See /standards for the Anthropic skills guide + agentskills.io convention.

Why this is a preliminary observation, not a finding

Skeptic note (read /skeptic for the full critique):

N=5 is not a curve. The "monotonic improvement" is exactly the kind of pattern random noise produces about 1 in 16 times by chance.
Each bare version targets a harder requirements set. The score improvement may be "requirements got clearer" rather than "bare is better than GAD."
Bare and GAD use different AGENTS.md prompts. The "framework" variable is conflated with the "system prompt" variable.
GAD's design assumes multi-session work; greenfield single-shot game implementation is not its strength case. We may be testing GAD against the wrong benchmark.

What would falsify the freedom hypothesis: a round where bare produces a worse game than GAD on the same requirements with N ≥ 3 replicates per condition, OR a different task domain where GAD beats bare. Neither has been run.

All hypotheses Compound-Skills Hypothesis (/emergent)Content-driven (planned)escape-the-dungeon-bare project