Freedom
Less framework, better output?
The freedom hypothesis (GAD-D-36) emerged from round 3: the bare condition — no framework, no inherited skills, just AGENTS.md + requirements — kept outscoring the full GAD condition on human review. Bare has improved monotonically across rounds. GAD never exceeded 0.30. This page rolls up the evidence, the caveats, and a hard link to the skeptic critique — read that too before trusting the pattern.
Human review across rounds
Each row is one bare run. If the freedom hypothesis is real, the line goes up-and-to-the-right. So far it does — but n=5 is not a curve, and each run targeted a harder requirements version, so the improvement may be the requirements getting clearer rather than freedom itself paying off.
Data provenance: scores from EVAL_RUNS[].humanReviewNormalized.aggregate_score or legacy humanReview.score for runs predating the rubric.
What "bare" means
game/.planning/. Per decision gad-39, all workflow artifacts live there regardless of framework choice.Why this is a preliminary observation, not a finding
Skeptic note (read /skeptic for the full critique):
- N=5 is not a curve. The "monotonic improvement" is exactly the kind of pattern random noise produces about 1 in 16 times by chance.
- Each bare version targets a harder requirements set. The score improvement may be "requirements got clearer" rather than "bare is better than GAD."
- Bare and GAD use different AGENTS.md prompts. The "framework" variable is conflated with the "system prompt" variable.
- GAD's design assumes multi-session work; greenfield single-shot game implementation is not its strength case. We may be testing GAD against the wrong benchmark.
What would falsify the freedom hypothesis: a round where bare produces a worse game than GAD on the same requirements with N ≥ 3 replicates per condition, OR a different task domain where GAD beats bare. Neither has been run.