Freedom hypothesis

Freedom

Less framework, better output?

The freedom hypothesis (GAD-D-36) emerged from round 3: the bare condition — no framework, no inherited skills, just AGENTS.md + requirements — kept outscoring the full GAD condition on human review. Bare has improved monotonically across rounds. GAD never exceeded 0.30. This page rolls up the evidence, the caveats, and a hard link to the skeptic critique — read that too before trusting the pattern.

Bare runs
6
Playable
6
Scored
4
Latest score
0.000

Human review across rounds

Each row is one bare run. If the freedom hypothesis is real, the line goes up-and-to-the-right. So far it does — but n=5 is not a curve, and each run targeted a harder requirements version, so the improvement may be the requirements getting clearer rather than freedom itself paying off.

v1
2026-04-08
0.100
v2
2026-04-08
0.500
v3
2026-04-08
0.700
v4
2026-04-09
0.000
v5
2026-04-09
0.805
v6
2026-04-10
0.000

Data provenance: scores from EVAL_RUNS[].humanReviewNormalized.aggregate_score or legacy humanReview.score for runs predating the rubric.

What "bare" means

No framework
Bare runs get AGENTS.md + REQUIREMENTS.xml. No .planning/ XML, no plan-execute-verify loop, no skill library, no subagents. The agent creates its own structure.
Own workflow
The agent authors whatever planning artifacts it finds useful under game/.planning/. Per decision gad-39, all workflow artifacts live there regardless of framework choice.
Contrast with emergent
Bare starts cold every time. Emergent starts warm with inherited skills from prior runs. Both have no framework, but emergent tests compounding (CSH) while bare tests freedom. See /standards for the Anthropic skills guide + agentskills.io convention.

Why this is a preliminary observation, not a finding

Skeptic note (read /skeptic for the full critique):

  • N=5 is not a curve. The "monotonic improvement" is exactly the kind of pattern random noise produces about 1 in 16 times by chance.
  • Each bare version targets a harder requirements set. The score improvement may be "requirements got clearer" rather than "bare is better than GAD."
  • Bare and GAD use different AGENTS.md prompts. The "framework" variable is conflated with the "system prompt" variable.
  • GAD's design assumes multi-session work; greenfield single-shot game implementation is not its strength case. We may be testing GAD against the wrong benchmark.

What would falsify the freedom hypothesis: a round where bare produces a worse game than GAD on the same requirements with N ≥ 3 replicates per condition, OR a different task domain where GAD beats bare. Neither has been run.

Client debug · NEXT_PUBLIC_CLIENT_DEBUG=1
0 lines

No events yet. Window errors, unhandled rejections, and React render errors appear here. Set NEXT_PUBLIC_CLIENT_DEBUG_CONSOLE=1 to mirror console.error / console.warn.