Methodology
Every formula, every weight, every cap.
This page is the appendix. Every number on the site — every bar, every composite, every "gate passed" badge — traces back to one of the formulas below. If you want to verify a run yourself, pull its TRACE.json from GitHub and run the math from here.
Composite formula
The weighted sum
The composite score is a plain weighted sum of dimension scores. Every dimension is normalised to 0.0 – 1.0 before the multiply. The weights are project-specific and committed to evals/<project>/gad.json.
composite =
Σdimensions(scorei×weighti)
Weights sum to 1.0 across a project's dimensions. A run can max out at 1.0; the minimum is 0.0 (modulo the low-score cap below).
Weights per eval project
Different eval projects weight different dimensions. A tooling eval might care most about time efficiency; an implementation eval weighs human review at 30% to prevent process metrics from rescuing a broken artifact.
Per-project weight tables
17 eval projects · same filters as project market / playable (collapsed by default)
Per-project weight tables
17 eval projects · same filters as project market / playable (collapsed by default)
Showing 17 of 17 eval projects with composite weights
escape-the-dungeon
escape-the-dungeon
| human_review | 0.30 | |
| requirement_coverage | 0.15 | |
| planning_quality | 0.15 | |
| per_task_discipline | 0.15 | |
| skill_accuracy | 0.10 | |
| time_efficiency | 0.05 |
escape-the-dungeon-bare
escape-the-dungeon-bare
| human_review | 0.30 | |
| requirement_coverage | 0.20 | |
| implementation_quality | 0.20 | |
| workflow_emergence | 0.15 | |
| iteration_evidence | 0.10 | |
| time_efficiency | 0.05 |
escape-the-dungeon-emergent
escape-the-dungeon-emergent
| human_review | 0.30 | |
| requirement_coverage | 0.20 | |
| implementation_quality | 0.15 | |
| skill_reuse | 0.15 | |
| workflow_quality | 0.10 | |
| iteration_evidence | 0.05 | |
| time_efficiency | 0.05 |
escape-the-dungeon-gad-emergent
escape-the-dungeon-gad-emergent
| human_review | 0.30 | |
| requirement_coverage | 0.20 | |
| implementation_quality | 0.15 | |
| skill_reuse | 0.15 | |
| workflow_quality | 0.10 | |
| iteration_evidence | 0.05 | |
| time_efficiency | 0.05 |
escape-the-dungeon-planning-only
escape-the-dungeon-planning-only
| human_review | 0.30 | |
| requirement_coverage | 0.20 | |
| implementation_quality | 0.15 | |
| skill_reuse | 0.15 | |
| workflow_quality | 0.10 | |
| iteration_evidence | 0.05 | |
| time_efficiency | 0.05 |
etd-brownfield-bare
etd-brownfield-bare
| human_review | 0.30 | |
| requirement_coverage | 0.20 | |
| implementation_quality | 0.20 | |
| workflow_emergence | 0.15 | |
| iteration_evidence | 0.10 | |
| time_efficiency | 0.05 |
etd-brownfield-emergent
etd-brownfield-emergent
| human_review | 0.30 | |
| requirement_coverage | 0.20 | |
| implementation_quality | 0.15 | |
| skill_reuse | 0.15 | |
| workflow_quality | 0.10 | |
| iteration_evidence | 0.05 | |
| time_efficiency | 0.05 |
etd-brownfield-gad
etd-brownfield-gad
| human_review | 0.30 | |
| requirement_coverage | 0.15 | |
| planning_quality | 0.15 | |
| per_task_discipline | 0.15 | |
| skill_accuracy | 0.10 | |
| time_efficiency | 0.05 |
eval-skill-install-eval
eval-skill-install-eval
| human_review | 0.35 | |
| install_success | 0.20 | |
| comparison_valid | 0.20 | |
| cli_usage | 0.15 | |
| preservation_complete | 0.10 |
gad-explainer-video
gad-explainer-video
| requirement_coverage | 0.20 | |
| video_polish | 0.20 | |
| implementation_quality | 0.15 | |
| pedagogical_clarity | 0.15 | |
| human_review | 0.15 | |
| workflow_quality | 0.10 | |
| time_efficiency | 0.05 |
gad-explainer-video-bare
gad-explainer-video-bare
| requirement_coverage | 0.20 | |
| video_polish | 0.20 | |
| implementation_quality | 0.15 | |
| pedagogical_clarity | 0.15 | |
| human_review | 0.15 | |
| workflow_quality | 0.10 | |
| time_efficiency | 0.05 |
gad-explainer-video-emergent
gad-explainer-video-emergent
| requirement_coverage | 0.20 | |
| video_polish | 0.20 | |
| implementation_quality | 0.15 | |
| pedagogical_clarity | 0.15 | |
| human_review | 0.15 | |
| workflow_quality | 0.10 | |
| time_efficiency | 0.05 |
gad-skill-creator-eval
gad-skill-creator-eval
| human_review | 0.35 | |
| skill_quality | 0.25 | |
| eval_scaffolded | 0.15 | |
| cli_usage | 0.15 | |
| attribution_tagged | 0.10 |
reverse-engineer-eval
reverse-engineer-eval
| human_review | 0.30 | |
| requirements_completeness | 0.25 | |
| requirements_accuracy | 0.20 | |
| build_success | 0.15 | |
| functional_fidelity | 0.10 |
skill-evaluation-app
skill-evaluation-app
| human_review | 0.55 | |
| requirement_coverage | 0.15 | |
| implementation_quality | 0.15 | |
| workflow_quality | 0.10 | |
| time_efficiency | 0.05 |
skill-evaluation-app-bare
skill-evaluation-app-bare
| human_review | 0.55 | |
| requirement_coverage | 0.15 | |
| implementation_quality | 0.15 | |
| workflow_quality | 0.10 | |
| time_efficiency | 0.05 |
skill-evaluation-app-emergent
skill-evaluation-app-emergent
| human_review | 0.55 | |
| requirement_coverage | 0.15 | |
| implementation_quality | 0.15 | |
| workflow_quality | 0.10 | |
| time_efficiency | 0.05 |
Gate logic
Gates override everything
Starting with requirements v2, some criteria are marked gate="true". If any gate fails, requirement_coverage collapses to 0. This is how a run that "ticks most boxes" can still score near zero on the mechanical dimension — because one gate (e.g. G1 game loop softlocks) makes the rest meaningless.
v1 runs (pre-gates) show a pre-gate requirementsbadge on their per-run pages instead of a pass/fail because the concept didn't exist yet. v3 introduced four explicit gates (game loop, spell crafting, UI quality); v4 added a fifth (pressure mechanics).
Low-score caps (v3+)
Layered on top of the weighted sum to prevent a broken run from reaching respectable territory on time-efficiency alone.
| If weighted sum < | Capped to | Reason |
|---|---|---|
| 0.20 | 0.40 | Prevent near-zero runs from being falsely rescued by time efficiency bonuses. |
| 0.10 | 0.25 | Reserved for runs that barely produced anything. Still appears in the results set but clearly distinct from a mid-tier run. |
Data production pipeline
Raw → structured → derived → insight
The eval framework's primary output is structured data, not scores. Scores are one kind of derived number; the framework also produces rubrics, automated gate checks, derived metrics from trace events, and cross-run aggregates. The four stages below are how raw run artifacts become insights on this site.
Every eval run produces a TRACE.json sidecar, a session.jsonl (Claude Code), a git log with per-task commits, and a dist/ build. Phase 25 adds .trace-events.jsonl for hook-captured tool/skill/subagent events. These are the primary sources — nothing is ever recomputed from anything upstream.
Examples: TRACE.json · session.jsonl · git log · dist/
The prebuild script reads raw artifacts and emits typed records: EvalRunRecord, CatalogSkill, RequirementsVersion, PlanningState, ProducedArtifacts. Schema versioned so old runs parse cleanly alongside new ones. This is what the site consumes — no client-side parsing.
Examples: lib/eval-data.generated.ts · lib/catalog.generated.ts
Computed from structured records: composite scores, divergence (composite vs human review), commit rhythm, plan-adherence delta, tool-use mix (phase 25+), skill-to-tool ratio, produced artifact density. Each derived number has a formula that's traceable back to its inputs — no magic aggregates.
Examples: scores.composite · divergence_score · plan_adherence_delta
Cross-run queries answer specific research questions. Charts shape data around the question, not the data shape. Phase 27 adds /insights with curated query cards and gad eval query for custom drilling. Every chart's caption is the question it answers — the number is just evidence.
Examples: freedom hypothesis scatter · rubric radar · insight cards
Objective vs subjective today
Most of what we measure today is objective (counts, durations, coverage ratios, commit rhythm). A few load-bearing measurements are still subjective — human review is a single number set by a reviewer who "felt like it was mid," and gate pass/fail depends on a human opening the built game and playing it. Phase 27 is the research methodology work that makes those measurements structured: human review gets a per-dimension rubric, gate checks get playwright automation, and derived metrics get exposed via gad eval queryso we can ask cross-run questions like "which runs used the forge room more than 3 times?"
The methodology discipline is captured in the objective-eval-design skill: every measurement must answer a specific research question, expose its inputs, be comparable across runs, and be decomposable. A number that fails any of those tests isn't ready to publish. See /standards for the Anthropic skills guide + agentskills.io convention that governs how individual skills are authored and evaluated.
Agent runtimes
Which coding agents can produce trace v4 data
Trace schema v4 (phase 25) needs to capture every tool call, skill invocation, and subagent spawn with inputs, outputs, and timestamps. The only reliable way to get that data is from inside the coding agent's runtime via hooks or callbacks. Agents without a hook runtime are explicitly unsupportedfor GAD evaluation — we're not going to screen-scrape stdout. Decision gad-53 pins this.
| Agent | Hook runtime | Trace v4 support | Notes |
|---|---|---|---|
| Claude Code | PreToolUse / PostToolUse hooks via settings.json | supported | First-class support. Hooks run before and after every tool call; session.jsonl captures the full invocation stream. Phase 25 writes a hook handler that emits trace v4 events directly. |
| Aider | Python callbacks + chat history export | supported | Supported via converter. Python API exposes on_message / on_tool_call style callbacks; the existing chat history file is parseable for after-the-fact conversion. Future sub-phase. |
| Continue.dev | VS Code extension API (onToolCall, onChatUpdate) | supported | Supported via converter. Extension hosts expose tool-call events; we'd ship a small extension-side emitter that writes trace v4 to disk. Future sub-phase. |
| OpenAI Codex CLI | Structured stream output (Running/Ran prefixes) | supported | Supported via stream parser. Codex's terminal output format is line-delimited with recognisable prefixes (Running ..., • Ran ..., └ <output>). Lossier than hooks because reasoning text interleaves with tool calls and rate limits can truncate. Future sub-phase. |
| Cursor | Closed-source, no public hook API | unsupported | No way to trace from inside the editor. The only access is through the chat panel which has no tool-call visibility. Not supported until Cursor exposes a hook runtime. |
| Vanilla ChatGPT / Claude.ai web | None | unsupported | Web interfaces have no tool access and no extension points. Fundamentally the wrong shape of tool for the kind of work we're evaluating. |
Multi-agent support (decision gad-55)
Agents beyond Claude Code are supported through converters, not through per-agent trace code. A converter reads the target agent's native session format and emits GAD trace schema v4. The same /runs/[project]/[version] page renders it. Codex's Running/Ranstream format is parseable but requires streaming detection; Aider's Python callbacks are straightforward to hook into. Phase 25 ships the Claude Code converter first; Codex and Aider converters are future sub-phases if and when we want to run cross-agent comparisons.
Worked examples
Two runs, end to end
Two runs picked as walkthroughs — one process-vs-reality divergence, one highest-scoring bare run. Click through for the full per-run view with the formula breakdown.
Better particle effects on main menu and better colors than previous GAD runs. However, crafting system broke the game when used (unusable). Old ASCII text design for map/spells/bags menus. Hard to read text. Added icons but didn't search for sourced sprites. 0 commits — rate limit hit before agent could finalize. Score 0.20: has some visual improvements but broken crafting gates it.
Full breakdown →Best UI/UX of all eval runs by far. Most enjoyable and playable. Functional game loop with combat and dialogue. Missing: floor progression after boss (can grind same floor), no clear spell crafting path. Regressed on commit discipline under pressure (1 giant commit vs v2's 6). Score 0.70: most enjoyable game across all experiments.
Full breakdown →Greenfield → brownfield lineage
Brownfield evals branch from a specific greenfield run's preserved output. The agent starts with the greenfield's source code and extends it against new or expanded requirements. This tests the same 5 hypotheses (bare / planning-only / GAD / emergent / GAD+emergent) but for code extension instead of creation. Decision gad-90 formalizes the lineage model.
Field lineage: nodes are the latest run per eval project from EVAL_RUNS. Brownfield baselines read from each project's gad.json baseline field. Edges show the source-code inheritance path.
What each condition template contains
Transparency about what the eval agent receives. Each column is one condition. ✓ means the file is present in the template; — means absent. This is the full input set — the agent sees nothing else.
| File | Bare | Planning | GAD | Emergent | GAD+Emrg |
|---|---|---|---|---|---|
| AGENTS.md | ✓ | ✓ | ✓ | ✓ | ✓ |
| REQUIREMENTS.xml | ✓ | ✓ | ✓ | ✓ | ✓ |
| .planning/ROADMAP.xml | — | ✓ | ✓ | — | ✓ |
| .planning/TASK-REGISTRY.xml | — | ✓ | ✓ | — | ✓ |
| .planning/DECISIONS.xml | — | ✓ | ✓ | — | ✓ |
| .planning/STATE.xml | — | ✓ | ✓ | — | ✓ |
| skills/ (bootstrap: 2) | ✓ | ✓ | — | — | — |
| skills/ (GAD: 10) | — | — | ✓ | — | ✓ |
| skills/ (inherited: 6) | — | — | — | ✓ | ✓ |
| GAD CLI available | — | — | ✓ | — | ✓ |
| Total skills | 2 | 2 | 10 | 6 | 16 |
Source: the template/ directory of each eval project under evals/escape-the-dungeon*/. This table shows the greenfield setup. Brownfield conditions additionally receive the preserved source code from their baseline greenfield run.
Submit a review
Each project ships with a rubric. Score a run against it with the gad eval reviewCLI — the weighted aggregate lands in that run's TRACE.json automatically.
CLI
gad eval review escape-the-dungeon v<N> \
--rubric '{"playability": 0.80, "ui_polish": 0.80, "mechanics_implementation": 0.80, "ingenuity_requirement_met": 0.80, "stability": 0.80}' \
--notes "what landed, what broke, what surprised you"See every run for this projectCLI
gad eval review escape-the-dungeon-bare v<N> \
--rubric '{"playability": 0.80, "ui_polish": 0.80, "mechanics_implementation": 0.80, "ingenuity_requirement_met": 0.80, "stability": 0.80}' \
--notes "what landed, what broke, what surprised you"See every run for this projectCLI
gad eval review escape-the-dungeon-emergent v<N> \
--rubric '{"playability": 0.80, "ui_polish": 0.80, "mechanics_implementation": 0.80, "ingenuity_requirement_met": 0.80, "stability": 0.80, "skill_inheritance_effectiveness": 0.80}' \
--notes "what landed, what broke, what surprised you"See every run for this projectCLI
gad eval review escape-the-dungeon-gad-emergent v<N> \
--rubric '{"playability": 0.80, "ui_polish": 0.80, "mechanics_implementation": 0.80, "ingenuity_requirement_met": 0.80, "stability": 0.80, "skill_inheritance_effectiveness": 0.80}' \
--notes "what landed, what broke, what surprised you"See every run for this projectCLI
gad eval review escape-the-dungeon-planning-only v<N> \
--rubric '{"playability": 0.80, "ui_polish": 0.80, "mechanics_implementation": 0.80, "ingenuity_requirement_met": 0.80, "stability": 0.80}' \
--notes "what landed, what broke, what surprised you"See every run for this projectCLI
gad eval review gad-explainer-video v<N> \
--rubric '{"pedagogical_clarity": 0.80, "video_polish": 0.80, "accuracy": 0.80, "scope_fit": 0.80, "stability": 0.80}' \
--notes "what landed, what broke, what surprised you"See every run for this projectCLI
gad eval review reverse-engineer-eval v<N> \
--rubric '{"requirements_completeness": 0.80, "requirements_accuracy": 0.80, "build_success": 0.80, "functional_fidelity": 0.80, "presentation": 0.80, "skill_quality": 0.80}' \
--notes "what landed, what broke, what surprised you"See every run for this projectCLI
gad eval review skill-evaluation-app v<N> \
--rubric '{"ui_usability": 0.80, "requirements_ergonomics": 0.80, "harness_integration": 0.80, "visualization_quality": 0.80, "stability": 0.80}' \
--notes "what landed, what broke, what surprised you"See every run for this projectOpen questions
The unresolved questions about the hypothesis, evaluation approach, and framework — public backlog of what is still being worked out.
evaluation
Decision gad-75 names pressure as a first-class eval dimension with five sub-dimensions (requirement complexity, ambiguity, constraint density, iteration budget, failure cost). The dimensions are named; the formula is not. Open questions: (a) is pressure a single aggregate score or a 5-tuple? (b) is it self-rated by the requirements author or programmatically extracted? (c) if programmatic, which fields in REQUIREMENTS.xml feed which sub-dimension? (d) does pressure live on requirements (per version) or on runs (per execution), or both? (e) how do we validate that our pressure rating matches agent-experienced pressure — probably by correlating rating against tool_uses-to-failure ratio. Candidate first-pass: self-rated 0.0-1.0 per sub-dimension on each requirements version (stored in REQUIREMENTS-VERSIONS.md and in a new <pressure-profile> block in REQUIREMENTS.xml), aggregated as a weighted sum, displayed on the /roadmap timeline as a pressure-tier progression across rounds.
Per gad-69, every eval metric must answer 'can this be collected programmatically?' before 'how do we score it?'. We currently rely heavily on human review + agent self-report for: did the skill actually load, did the gate pass, is the forge integrated with pressure, is the game beatable. A programmatic-eval GAPS audit is queued as task 83 / task 96. The output should be .planning/GAPS.md ranking gaps by (a) how much human/agent judgment they currently require, (b) how mechanically checkable they actually are, (c) which phase picks them up. Highest candidates: playwright smoke tests for G1-G4 gates, hook-captured skill-trigger events, build/test exit codes as stability signals.
Pressure is CURRENTLY framed in gad-75 as metadata about the test conditions, not a measurement of the result. But there's a parallel interpretation: pressure could be scored — 'did the agent handle the pressure well'. That scoring collapses pressure into the existing rubric as a new dimension. The two framings answer different questions: metadata-pressure lets us normalize cross-round comparisons ('X scored 0.8 at pressure 0.6 vs Y scored 0.7 at pressure 0.9'), scored-pressure lets us rank agents on pressure-handling ability directly. Probably both are useful and we need both — one stored in REQUIREMENTS/TRACE metadata and one computed as a derived dimension. But committing to one framing first matters for the /roadmap page design and for any comparison visualization.
Per decision gad-84, Claude Code's hook system exposes tool calls but not thinking blocks or inter-tool message text. Those artifacts exist in session.jsonl. Open research question: can we extract them post-hoc and derive quality signal from them? Candidate metrics — thought-to-tool-use interval, thought length variance, ratio of 'exploration' to 'execution' phrases, thought-before-action vs thought-after-action ordering. All hypothetical until we have the extraction pipeline and look at real data. Feasibility assessment: high for extraction, unknown for signal.
Emergent v4 aggregate: 0.885. Bare v5 aggregate: 0.805. The 0.08 delta maps almost exactly to emergent's 6th-dimension contribution (skill_inheritance_effectiveness 0.95 * 0.20 weight = +0.19 against bare's absent-dimension 0). Shared-dimension comparison: emergent +0.18 playability, +0.10 ui_polish, tie on mechanics/ingenuity/stability. Is that 6th dimension double-counting emergent's inheritance advantage? Arguably yes — CSH is being tested by the existence of that dimension. Counter: without it, there's no way to score whether inheritance is actually working, which is the whole point of the emergent condition.
framework
Per decision gad-77, before the /contribute page ships we need to verify: (a) cloning the repo into a new directory results in a working agent environment (skills in .claude/, agents installed, commands available), (b) opening the repo in Claude Code surfaces the GAD skills to the agent without additional setup, (c) a conversational request like 'run the escape-the-dungeon-bare eval' actually works with no manual snapshot or XML editing. This is an untested assumption right now — we've been running agents in the canonical development repo where everything is already wired up. A fresh clone might hit missing .claude/settings.json entries, missing hook handler paths, missing env vars, or installer bugs. Until this is tested, /contribute is vaporware.
User's vision (decision gad-73): GAD provides three fundamental skills as the foundation of emergent evolution. find-skills locates a trusted GAD fundamental (e.g. 'scientific method', 'debug'). merge-skill fuses that fundamental into a project-tailored skill (e.g. 'scientific-method-for-kaplay-rune-spellcrafting'). create-skill authors genuinely new ones when no merge candidate exists. This triumvirate IS the in-game rune/spell merging mechanic made meta. We need to audit what exists today in skills/ — create-skill likely already exists, merge-skill and find-skills may not. Reference: https://skills.sh/vercel-labs/skills/find-skills. GAD's version scopes to trusted ecosystem initially. Bigger ambition: prove skill effectiveness + provide skill security.
User wants GAD to eventually provide a trust model for skills — how can you tell if a skill you're about to inherit is safe, effective, and actually improves anything? Initial thoughts: (a) frontmatter signing / checksum, (b) provenance lineage (which run authored it, which rubric scores validated it, which other runs inherited it successfully), (c) automated review against Anthropic's skills guide (gad-70), (d) sandboxed trial run in a throwaway worktree before trusting. Distinct from typosquatting defense (which lives in the planned /security page). This is about effectiveness + integrity, not name collisions.
game design
Bare v5 playtest surfaced a preference for rule-based simulated combat (loadout + spells + stats + action policies + initiative, chess-like positioning) over direct-control. R-v5.13 captures this as 'Model A preferred unless implementation exception granted.' Open question: is this the right call for round 5? Rule-based is harder to implement correctly in a single eval run and may advantage GAD's planning overhead (which would invalidate the freedom comparison if it's the reason bare underperforms next round).
hypothesis
Round 4 produced a provocative finding: Emergent v4 scored 0.885 (with the 6-dim rubric including skill_inheritance_effectiveness), Bare v5 scored 0.805 (5-dim rubric), and GAD v10 was api-interrupted. On shared rubric dimensions (playability, ui_polish, mechanics, ingenuity, stability) Emergent beat Bare on playability and UI polish, tied on mechanics/ingenuity/stability. The freedom hypothesis (gad-36) says bare beats framework on creative output. The compound-skills hypothesis (gad-65) says emergent-with-evolution compounds over rounds. Round 4 may be the first evidence that CSH is overtaking freedom — BUT it's one round, rubric reweighting matters, and Emergent inherited from Bare so it has freedom-hypothesis lineage baked in. Need more rounds against v5 requirements to disambiguate.
site
Task 90 queues ASSUMPTIONS.md. Without a documented target user, IA refactor decisions (task 84), skills directory UX (task 85), and landing-page framing will be ad-hoc. Candidates: coding-agent researchers, framework authors, indie devs exploring skill evolution, enterprise teams evaluating agent frameworks. Each implies a different navigation priority.
Current nav already has 11 items (GAD, Lineage, Methodology, Rubric, Results, Graphs, Videos, Catalog, Findings, Planning, and now + Rubric). Tasks 86/87/88/95 will add more. Task 84 plans dropdown grouping + keyword search before more pages ship. Open: what is the grouping that makes sense to a first-time visitor vs a returning researcher?
tooling
GAD v10 was api-interrupted twice by HTTP 529 (once at tool_uses=18, then 55). This is distinct from account rate limits (gad-64). Per STATE.xml, investigation is queued before GAD v11 retry. Without understanding whether 529 correlates with high planning overhead (GAD-specific), payload size, tool-use frequency, or time of day, we can't plan round 5 reliably.
Resolved
What used to be open
Do eval worktrees actually inherit state from the parent monorepo, and does that contaminate the results?
PASSES with explicit allowlist. Audit written at .planning/docs/ISOLATION-AUDIT-2026-04-10.md. Key findings: (1) git worktrees have separate working trees, so .agents/skills/ and .planning/ do NOT inherit — the bare condition stays clean of framework skills, (2) .claude/settings.json IS inherited by Claude Code worktrees at .claude/worktrees/agent-*/ (the settings search walks up), but this is acceptable because the only settings are the trace hook handler which is instrumentation not framework assistance, (3) gad eval run creates worktrees at os.tmpdir() which is entirely outside the repo — even cleaner. The secondary vector (globally-installed user skills at ~/.claude/skills/) is mitigated by adding a bare-eval-prompt line: 'Do NOT load or reference any globally-installed skills.' Round 5 is UNBLOCKED.
Should the landing page stop leading with 'ship software faster' and lead with task-management + skill evaluation instead?
Resolved by decision gad-76 and the landing rewrite shipped in the 2026-04-09 IA session. New value-prop line: 'A system for evaluating and evolving agents through real tasks, measurable pressure, and iteration.' Landing primary CTA is now Play (B), above-the-fold stack is Play → Methodology → Findings → Hypothesis → Fork. Target audience is primarily coding-agent researchers (A) with indie devs (C) as experiential secondary entry. Pressure (gad-75) is introduced as the hook that differentiates GAD from other coding-agent frameworks.
How do we measure the value of authored-content-pack inheritance (gad-66) without confounding freedom/CSH tests?
Resolved: content-pack injection becomes its own eval track (separate from greenfield emergent) so CSH measurements stay clean. User reframed it as a *content-driven hypothesis* — analogous to making a game or movie derivative from a book. It is explicitly derivative work: not all processes are, 'much like a forger might not use the exact same brush.' This is a distinct hypothesis from freedom and CSH, with its own track, its own rounds, and its own comparison rules. We do NOT compare content-pack runs to greenfield runs on the same rubric — they answer different questions. The content-pack track's scoring focuses on: (a) does the extra content produce a more fleshed-out game given the same token budget, (b) does the agent integrate the content coherently rather than just bolting it on, (c) does the final game feel unified despite derivative source material.