Skeptic
Every claim we've made, held to its strongest critique.
Research that doesn't critique itself isn't research. This page is the public commitment to taking our own claims apart. For every hypothesis the project has named — freedom, compound-skills, emergent-evolution, pressure, GAD's value prop — we state the steelman, then the strongest available critique, then the alternatives, then what would falsify it. Then we list concrete moves that would make us more credible.
Source: .planning/docs/SKEPTIC.md. This document gets updated as the critique deepens, not as the hypotheses get more confident.
Critiques that hit every claim
These problems apply to everyhypothesis below. They're the structural weaknesses of a one-person research project at low N.
Freedom Hypothesis
Steelman
Across rounds 2-4, bare improved monotonically (0.10 → 0.50 → 0.70 → 0.805) while GAD never exceeded 0.30 across four attempts. GAD spends 7-15% more tokens on the same task. The pattern is consistent.
Problems with the claim
- N=4 vs N=5 is not a curve. The 'monotonic improvement' is exactly the kind of pattern noise produces 1-in-16 by chance — coin-flip territory.
- The bare prompt has changed across rounds. Bare v1's AGENTS.md is not bare v5's. Some of bare's improvement is the requirements doc getting clearer, not the framework being absent.
- GAD never finished a round — gate failures or partial completions every time. We compare 'broken games' against 'working games' and call it a framework comparison. It might be a budget or runtime issue.
- GAD and bare may not be the same agent at all. Same model family, but different system prompts produce different behavior. The 'framework' variable is conflated with the 'system prompt' variable.
- GAD's design assumes multi-session work, planning loop survives compaction, decision tracking. Greenfield single-shot game implementation is exactly the workload GAD's design says is NOT the primary use case (gad-74). We chose a benchmark that disadvantages the framework we're testing.
Alternative explanations for the same data
- Bare's improvement is the requirements getting clearer, not the framework being absent.
- GAD's stagnation is single-condition variance — GAD might score 0.70 on its 5th attempt with no framework changes.
- The bare AGENTS.md happens to be a better prompt than the GAD AGENTS.md, independent of the planning loop.
What would falsify this
- A round where bare produces a worse game than GAD on the same requirements with N≥3 replicates per condition
- A different task domain (web app, CLI) where GAD beats bare with the same setup
- A pre-registered prediction that doesn't pan out
Honest current status
Preliminary observation, single-domain, low-N, post-hoc named. Calling it a 'hypothesis' is generous; calling it a 'finding' would be irresponsible.
Compound-Skills Hypothesis
Steelman
Round 4's emergent v4 authored two new skills (dom-over-kaplay, pressure-forge-coupling), deprecated one (kaplay-scene-pattern), and documented the disposition of every inherited skill in CHANGELOG.md. Its rubric aggregate (0.885 after rescoring) was the highest of any round-4 run. First observed inherit → evaluate → evolve → document cycle.
Problems with the claim
- N=2-3 emergent runs is not a curve. v1, v2, v4 — three points cannot establish a trend.
- Each emergent run targeted a harder requirements version. We do not know whether emergent improved or whether the maintainer got better at writing prompts.
- The 6th rubric dimension (skill_inheritance_effectiveness) is human-rated by the same person who authored the rubric. We use human judgment to measure whether human-authored skills are useful. Circular.
- Inherited 'skills' are project-specific. dom-over-kaplay would be useless for any other game. Is that 'compounding' or just 'specialization that won't generalize'?
- No ablation. We have never run an emergent project that DOESN'T inherit. Until we do, we cannot tell whether emergent v4's score comes from inherited skills or simply from the agent being more capable now.
- CHANGELOG self-report is exactly the trust model gad-69 says we should not rely on. The agent decides what its CHANGELOG entries say.
Alternative explanations for the same data
- Emergent v4's score is an outlier produced by the maintainer playing more carefully on a more polished build.
- Skills don't compound — agents just get better at games as the agent improves.
- The 6th dimension is rewarding the maintainer's preference for emergent, not measuring inheritance effectiveness.
What would falsify this
- An emergent-no-inherit run that scores comparably to emergent-with-inherit at the same round
- Round 5 emergent scoring lower than round 4 emergent against the harder v5 requirements
- A second human reviewer giving meaningfully different scores
Honest current status
First observation of a ratcheting cycle. Not enough data to claim compounding. The 6th rubric dimension is a measurement of intent to test the hypothesis, not a measurement of the hypothesis being true.
Emergent-Evolution Hypothesis
Steelman
Synthesis hypothesis explaining both the freedom hypothesis and CSH with a single mechanism. The craftsman/lifter metaphor is intuitive. RepoMirror and Ralph Wiggum loop creators independently observed similar dynamics. The in-game rune/spell merge mechanic is a real-world analogue.
Problems with the claim
- It is a synthesis of two not-yet-proven hypotheses, framed as a new claim. The conjunction of two unproven things is LESS credible than either alone.
- The metaphor is doing the heavy lifting. Craftsman, lifter, blacksmith — good stories, not evidence. We have not shown that human craftsmanship dynamics transfer to agent skill libraries.
- The merge-skill primitive does not yet exist. gad-73 names create-skill / merge-skill / find-skills as the foundational triumvirate, but the audit task is unfinished. We claim the framework provides a 'substrate' while at least one of the substrate's three primitives is unbuilt.
- 'Projects are emergent' is unfalsifiable as currently stated. What evidence could convince me a project is NOT emergent? None. That's a vibe, not a hypothesis.
- Repomirror and Ralph Wiggum loop observations are anecdotes. Suggestive, not evidence. We don't have their data and they don't have ours.
Alternative explanations for the same data
- The emergent workflow is just bare with a few extra files, and its improvement is the same bare-improvement attributed to a different cause.
- The 'evolution substrate' is a metaphor we like, not an observed mechanism.
- The synthesis is post-hoc justification for keeping the GAD framework around after the freedom hypothesis cast doubt on it.
What would falsify this
- A round 5 emergent run that performs WORSE than round 4 emergent against harder requirements
- An emergent project against a different task domain failing to compound
- The triumvirate audit revealing that merge-skill / find-skills don't exist and the framework is not actually providing the substrate we claim
Honest current status
A working synthesis. Useful as a research direction. Not yet a hypothesis with stakes — we have not stated what would make us drop it.
Pressure as a measurable dimension
Steelman
Naming pressure explains a lot of confusing observations: why bare improved monotonically (the maintainer was implicitly raising pressure), why GAD's gate failures clustered in early rounds, why emergent v4 felt qualitatively different from v2. It also gives us a normalization variable for cross-round comparisons.
Problems with the claim
- The formula does not exist. We call pressure 'measurable' while the measurement is currently a hand-typed constant in app/roadmap/roadmap-shared.ts. That is exactly self-report — the same problem gad-69 says we should fight.
- The five sub-dimensions overlap. 'Requirement complexity' and 'constraint density' are nearly synonymous. We have five labels because five sounds satisfying, not because there are five orthogonal axes.
- All five sub-dimensions are author-rated. The agent doesn't know what pressure level it's under. The reviewer doesn't either. Only the requirements author does — the same person rating the dimensions.
- Pressure-tier ratings are post-hoc and predictive simultaneously. Round 5's pressure rating (0.92) is in the file before round 5 has run. We will then evaluate round 5 against that prediction, creating circular validation.
- There is no validation step. Even if we compute pressure programmatically, we have no way to check whether it matches agent-experienced pressure.
Alternative explanations for the same data
- Pressure is just requirement word count.
- Pressure is just the maintainer's intuition about how hard a round felt.
- Pressure is a retrofit narrative justifying why early rounds scored lower.
What would falsify this
- Round 5 produces results inconsistent with the predicted pressure tier
- A second researcher rating the same requirements gives meaningfully different pressure scores
- A programmatic pressure score correlates poorly with the hand-rated one
Honest current status
A useful conceptual lens. Not yet a measurement. The current presentation in /roadmap overstates how operational it is.
GAD's value proposition
Steelman
The freedom hypothesis suggests GAD doesn't beat bare on creative implementation. So we positioned GAD's value elsewhere: durable in-repo state, decisions auditable, fork-and-go, the eval framework as the load-bearing feature. This is more honest than the original framing.
Problems with the claim
- We have no evidence GAD does task management at scale BETTER than alternatives. Linear, Notion, GitHub Issues, plain markdown files — any can hold tasks in-repo or near-repo. We claim 'in-repo' is a differentiator without showing it improves outcomes.
- 'Forkable + no SaaS' is true of any text-files-in-git system. RepoPlanner, GSD, Aider all qualify. What does GAD add over `cat .planning/state.md`? A CLI, an XML schema, a snapshot command. That's not nothing, but it's not a moat.
- The eval framework IS load-bearing — but we measure ourselves with our own framework. Circular validation. There is no external benchmark.
- 'Skill security' is a future commitment, not a current feature. /security correctly says we don't host third-party skills and the certification model is a research direction. The third leg of the value prop is aspirational.
- The pivot from 'ship software' to 'evaluate agents' is post-hoc. It happened after round 3 made the original framing untenable. We renamed instead of shipping the original. Valid response, but should be acknowledged.
Alternative explanations for the same data
- 'GAD is a research notebook for one person' — true and honest, smaller audience claim.
- 'GAD is an opinionated convention for keeping decisions in repo' — true, but doesn't justify the framework overhead.
- 'GAD is an experiment in whether task management improves agent reliability' — accurate, lower stakes.
What would falsify this
- A team using a non-in-repo task system (Linear) demonstrably runs better evals
- The eval framework reaches 10+ external contributors and we still have to defend its construction
Honest current status
The new value prop is more defensible than the old one. Still partially aspirational (skill security) and partially circular (the eval framework validates itself).
What would make us more credible
Concrete moves, ranked by how much they'd actually move the needle. The top three are doable in the next session if we choose to prioritize credibility over feature velocity.
How this page is used
- Publishing a finding:read the relevant hypothesis section first. If the finding doesn't survive the critique, soften the claim.
- Designing a new round:check the falsification conditions. If the round can't produce data that would falsify a claim, the round is testing something else.
- When confident: point yourself here. Confidence in early-stage research is the most dangerous failure mode.
- When a reader asks "how do I know this is real?":point them here. The credibility move is admitting what we don't know.