Methodology

Every formula, every weight, every cap.

This page is the appendix. Every number on the site — every bar, every composite, every "gate passed" badge — traces back to one of the formulas below. If you want to verify a run yourself, pull its TRACE.json from GitHub and run the math from here.

Composite formula

The weighted sum

The composite score is a plain weighted sum of dimension scores. Every dimension is normalised to 0.0 – 1.0 before the multiply. The weights are project-specific and committed to evals/<project>/gad.json.

composite =

Σ_dimensions(score_i×weight_i)

Weights sum to 1.0 across a project's dimensions. A run can max out at 1.0; the minimum is 0.0 (modulo the low-score cap below).

Weights per eval project

Different eval projects weight different dimensions. A tooling eval might care most about time efficiency; an implementation eval weighs human review at 30% to prevent process metrics from rescuing a broken artifact.

Per-project weight tables

17 eval projects · same filters as project market / playable (collapsed by default)

Showing 17 of 17 eval projects with composite weights

last 5 rounds per project

Game

gad

greenfield

escape-the-dungeon

Σ weights = 0.90

human_review	0.30
requirement_coverage	0.15
planning_quality	0.15
per_task_discipline	0.15
skill_accuracy	0.10
time_efficiency	0.05

Game

bare

greenfield

escape-the-dungeon-bare

Σ weights = 1.00

human_review	0.30
requirement_coverage	0.20
implementation_quality	0.20
workflow_emergence	0.15
iteration_evidence	0.10
time_efficiency	0.05

Game

emergent

greenfield

escape-the-dungeon-emergent

Σ weights = 1.00

human_review	0.30
requirement_coverage	0.20
implementation_quality	0.15
skill_reuse	0.15
workflow_quality	0.10
iteration_evidence	0.05
time_efficiency	0.05

Game

gad

greenfield

escape-the-dungeon-gad-emergent

Σ weights = 1.00

human_review	0.30
requirement_coverage	0.20
implementation_quality	0.15
skill_reuse	0.15
workflow_quality	0.10
iteration_evidence	0.05
time_efficiency	0.05

Game

bare

greenfield

escape-the-dungeon-planning-only

Σ weights = 1.00

human_review	0.30
requirement_coverage	0.20
implementation_quality	0.15
skill_reuse	0.15
workflow_quality	0.10
iteration_evidence	0.05
time_efficiency	0.05

Game

bare

brownfield

etd-brownfield-bare

Σ weights = 1.00

human_review	0.30
requirement_coverage	0.20
implementation_quality	0.20
workflow_emergence	0.15
iteration_evidence	0.10
time_efficiency	0.05

Game

emergent

brownfield

etd-brownfield-emergent

Σ weights = 1.00

human_review	0.30
requirement_coverage	0.20
implementation_quality	0.15
skill_reuse	0.15
workflow_quality	0.10
iteration_evidence	0.05
time_efficiency	0.05

Game

gad

brownfield

etd-brownfield-gad

Σ weights = 0.90

human_review	0.30
requirement_coverage	0.15
planning_quality	0.15
per_task_discipline	0.15
skill_accuracy	0.10
time_efficiency	0.05

Tooling

gad

greenfield

eval-skill-install-eval

Σ weights = 1.00

human_review	0.35
install_success	0.20
comparison_valid	0.20
cli_usage	0.15
preservation_complete	0.10

Video

gad

greenfield

gad-explainer-video

Σ weights = 1.00

requirement_coverage	0.20
video_polish	0.20
implementation_quality	0.15
pedagogical_clarity	0.15
human_review	0.15
workflow_quality	0.10
time_efficiency	0.05

Video

bare

greenfield

gad-explainer-video-bare

Σ weights = 1.00

requirement_coverage	0.20
video_polish	0.20
implementation_quality	0.15
pedagogical_clarity	0.15
human_review	0.15
workflow_quality	0.10
time_efficiency	0.05

Video

emergent

greenfield

gad-explainer-video-emergent

Σ weights = 1.00

requirement_coverage	0.20
video_polish	0.20
implementation_quality	0.15
pedagogical_clarity	0.15
human_review	0.15
workflow_quality	0.10
time_efficiency	0.05

Tooling

gad

greenfield

gad-skill-creator-eval

Σ weights = 1.00

human_review	0.35
skill_quality	0.25
eval_scaffolded	0.15
cli_usage	0.15
attribution_tagged	0.10

Tooling

gad

greenfield

reverse-engineer-eval

Σ weights = 1.00

human_review	0.30
requirements_completeness	0.25
requirements_accuracy	0.20
build_success	0.15
functional_fidelity	0.10

Software

gad

greenfield

skill-evaluation-app

Σ weights = 1.00

human_review	0.55
requirement_coverage	0.15
implementation_quality	0.15
workflow_quality	0.10
time_efficiency	0.05

Tooling

bare

greenfield

skill-evaluation-app-bare

Σ weights = 1.00

human_review	0.55
requirement_coverage	0.15
implementation_quality	0.15
workflow_quality	0.10
time_efficiency	0.05

Tooling

emergent

greenfield

skill-evaluation-app-emergent

Σ weights = 1.00

human_review	0.55
requirement_coverage	0.15
implementation_quality	0.15
workflow_quality	0.10
time_efficiency	0.05

Gate logic

Gates override everything

Starting with requirements v2, some criteria are marked gate="true". If any gate fails, requirement_coverage collapses to 0. This is how a run that "ticks most boxes" can still score near zero on the mechanical dimension — because one gate (e.g. G1 game loop softlocks) makes the rest meaningless.

v1 runs (pre-gates) show a pre-gate requirementsbadge on their per-run pages instead of a pass/fail because the concept didn't exist yet. v3 introduced four explicit gates (game loop, spell crafting, UI quality); v4 added a fifth (pressure mechanics).

Low-score caps (v3+)

Layered on top of the weighted sum to prevent a broken run from reaching respectable territory on time-efficiency alone.

If weighted sum <	Capped to	Reason
0.20	0.40	Prevent near-zero runs from being falsely rescued by time efficiency bonuses.
0.10	0.25	Reserved for runs that barely produced anything. Still appears in the results set but clearly distinct from a mid-tier run.

Data production pipeline

Raw → structured → derived → insight

The eval framework's primary output is structured data, not scores. Scores are one kind of derived number; the framework also produces rubrics, automated gate checks, derived metrics from trace events, and cross-run aggregates. The four stages below are how raw run artifacts become insights on this site.

Raw artifacts

Every eval run produces a TRACE.json sidecar, a session.jsonl (Claude Code), a git log with per-task commits, and a dist/ build. Phase 25 adds .trace-events.jsonl for hook-captured tool/skill/subagent events. These are the primary sources — nothing is ever recomputed from anything upstream.

Examples: TRACE.json · session.jsonl · git log · dist/

Structured records

The prebuild script reads raw artifacts and emits typed records: EvalRunRecord, CatalogSkill, RequirementsVersion, PlanningState, ProducedArtifacts. Schema versioned so old runs parse cleanly alongside new ones. This is what the site consumes — no client-side parsing.

Examples: lib/eval-data.generated.ts · lib/catalog.generated.ts

Derived metrics

Computed from structured records: composite scores, divergence (composite vs human review), commit rhythm, plan-adherence delta, tool-use mix (phase 25+), skill-to-tool ratio, produced artifact density. Each derived number has a formula that's traceable back to its inputs — no magic aggregates.

Examples: scores.composite · divergence_score · plan_adherence_delta

Insights + visualizations

Cross-run queries answer specific research questions. Charts shape data around the question, not the data shape. Phase 27 adds /insights with curated query cards and gad eval query for custom drilling. Every chart's caption is the question it answers — the number is just evidence.

Examples: freedom hypothesis scatter · rubric radar · insight cards

Objective vs subjective today

Most of what we measure today is objective (counts, durations, coverage ratios, commit rhythm). A few load-bearing measurements are still subjective — human review is a single number set by a reviewer who "felt like it was mid," and gate pass/fail depends on a human opening the built game and playing it. Phase 27 is the research methodology work that makes those measurements structured: human review gets a per-dimension rubric, gate checks get playwright automation, and derived metrics get exposed via gad eval queryso we can ask cross-run questions like "which runs used the forge room more than 3 times?"

The methodology discipline is captured in the objective-eval-design skill: every measurement must answer a specific research question, expose its inputs, be comparable across runs, and be decomposable. A number that fails any of those tests isn't ready to publish. See /standards for the Anthropic skills guide + agentskills.io convention that governs how individual skills are authored and evaluated.

Agent runtimes

Which coding agents can produce trace v4 data

Trace schema v4 (phase 25) needs to capture every tool call, skill invocation, and subagent spawn with inputs, outputs, and timestamps. The only reliable way to get that data is from inside the coding agent's runtime via hooks or callbacks. Agents without a hook runtime are explicitly unsupportedfor GAD evaluation — we're not going to screen-scrape stdout. Decision gad-53 pins this.

Agent	Hook runtime	Trace v4 support	Notes
Claude Code	PreToolUse / PostToolUse hooks via settings.json	supported	First-class support. Hooks run before and after every tool call; session.jsonl captures the full invocation stream. Phase 25 writes a hook handler that emits trace v4 events directly.
Aider	Python callbacks + chat history export	supported	Supported via converter. Python API exposes on_message / on_tool_call style callbacks; the existing chat history file is parseable for after-the-fact conversion. Future sub-phase.
Continue.dev	VS Code extension API (onToolCall, onChatUpdate)	supported	Supported via converter. Extension hosts expose tool-call events; we'd ship a small extension-side emitter that writes trace v4 to disk. Future sub-phase.
OpenAI Codex CLI	Structured stream output (Running/Ran prefixes)	supported	Supported via stream parser. Codex's terminal output format is line-delimited with recognisable prefixes (Running ..., • Ran ..., └ <output>). Lossier than hooks because reasoning text interleaves with tool calls and rate limits can truncate. Future sub-phase.
Cursor	Closed-source, no public hook API	unsupported	No way to trace from inside the editor. The only access is through the chat panel which has no tool-call visibility. Not supported until Cursor exposes a hook runtime.
Vanilla ChatGPT / Claude.ai web	None	unsupported	Web interfaces have no tool access and no extension points. Fundamentally the wrong shape of tool for the kind of work we're evaluating.

Multi-agent support (decision gad-55)

Agents beyond Claude Code are supported through converters, not through per-agent trace code. A converter reads the target agent's native session format and emits GAD trace schema v4. The same /runs/[project]/[version] page renders it. Codex's Running/Ranstream format is parseable but requires streaming detection; Aider's Python callbacks are straightforward to hook into. Phase 25 ships the Claude Code converter first; Codex and Aider converters are future sub-phases if and when we want to run cross-agent comparisons.

Worked examples

Two runs, end to end

Two runs picked as walkthroughs — one process-vs-reality divergence, one highest-scoring bare run. Click through for the full per-run view with the formula breakdown.

escape-the-dungeon · v8

composite 0.177 · human 0.20

Better particle effects on main menu and better colors than previous GAD runs. However, crafting system broke the game when used (unusable). Old ASCII text design for map/spells/bags menus. Hard to read text. Added icons but didn't search for sourced sprites. 0 commits — rate limit hit before agent could finalize. Score 0.20: has some visual improvements but broken crafting gates it.

Full breakdown →

escape-the-dungeon-bare · v3

composite 0.526 · human 0.70

Best UI/UX of all eval runs by far. Most enjoyable and playable. Functional game loop with combat and dialogue. Missing: floor progression after boss (can grind same floor), no clear spell crafting path. Regressed on commit discipline under pressure (1 giant commit vs v2's 6). Score 0.70: most enjoyable game across all experiments.

Full breakdown →

Greenfield → brownfield lineage

Brownfield evals branch from a specific greenfield run's preserved output. The agent starts with the greenfield's source code and extends it against new or expanded requirements. This tests the same 5 hypotheses (bare / planning-only / GAD / emergent / GAD+emergent) but for code extension instead of creation. Decision gad-90 formalizes the lineage model.

GF = greenfieldBF = brownfield (branches from a greenfield run)GADBareEmergent

Field lineage: nodes are the latest run per eval project from EVAL_RUNS. Brownfield baselines read from each project's gad.json baseline field. Edges show the source-code inheritance path.

What each condition template contains

Transparency about what the eval agent receives. Each column is one condition. ✓ means the file is present in the template; — means absent. This is the full input set — the agent sees nothing else.

File	Bare	Planning	GAD	Emergent	GAD+Emrg
AGENTS.md	✓	✓	✓	✓	✓
REQUIREMENTS.xml	✓	✓	✓	✓	✓
.planning/ROADMAP.xml	—	✓	✓	—	✓
.planning/TASK-REGISTRY.xml	—	✓	✓	—	✓
.planning/DECISIONS.xml	—	✓	✓	—	✓
.planning/STATE.xml	—	✓	✓	—	✓
skills/ (bootstrap: 2)	✓	✓	—	—	—
skills/ (GAD: 10)	—	—	✓	—	✓
skills/ (inherited: 6)	—	—	—	✓	✓
GAD CLI available	—	—	✓	—	✓
Total skills	2	2	10	6	16

Source: the template/ directory of each eval project under evals/escape-the-dungeon*/. This table shows the greenfield setup. Brownfield conditions additionally receive the preserved source code from their baseline greenfield run.

Submit a review

Each project ships with a rubric. Score a run against it with the gad eval reviewCLI — the weighted aggregate lands in that run's TRACE.json automatically.

GADescape-the-dungeon

rubric v1

5 dimensions

CLI

gad eval review escape-the-dungeon v<N> \
  --rubric '{"playability": 0.80, "ui_polish": 0.80, "mechanics_implementation": 0.80, "ingenuity_requirement_met": 0.80, "stability": 0.80}' \
  --notes "what landed, what broke, what surprised you"

See every run for this project

Bareescape-the-dungeon-bare

rubric v1

5 dimensions

CLI

gad eval review escape-the-dungeon-bare v<N> \
  --rubric '{"playability": 0.80, "ui_polish": 0.80, "mechanics_implementation": 0.80, "ingenuity_requirement_met": 0.80, "stability": 0.80}' \
  --notes "what landed, what broke, what surprised you"

See every run for this project

Emergentescape-the-dungeon-emergent

rubric v1

6 dimensions

CLI

gad eval review escape-the-dungeon-emergent v<N> \
  --rubric '{"playability": 0.80, "ui_polish": 0.80, "mechanics_implementation": 0.80, "ingenuity_requirement_met": 0.80, "stability": 0.80, "skill_inheritance_effectiveness": 0.80}' \
  --notes "what landed, what broke, what surprised you"

See every run for this project

GADescape-the-dungeon-gad-emergent

rubric v1

6 dimensions

CLI

gad eval review escape-the-dungeon-gad-emergent v<N> \
  --rubric '{"playability": 0.80, "ui_polish": 0.80, "mechanics_implementation": 0.80, "ingenuity_requirement_met": 0.80, "stability": 0.80, "skill_inheritance_effectiveness": 0.80}' \
  --notes "what landed, what broke, what surprised you"

See every run for this project

Bareescape-the-dungeon-planning-only

rubric v1

5 dimensions

CLI

gad eval review escape-the-dungeon-planning-only v<N> \
  --rubric '{"playability": 0.80, "ui_polish": 0.80, "mechanics_implementation": 0.80, "ingenuity_requirement_met": 0.80, "stability": 0.80}' \
  --notes "what landed, what broke, what surprised you"

See every run for this project

GADgad-explainer-video

rubric v1

5 dimensions

CLI

gad eval review gad-explainer-video v<N> \
  --rubric '{"pedagogical_clarity": 0.80, "video_polish": 0.80, "accuracy": 0.80, "scope_fit": 0.80, "stability": 0.80}' \
  --notes "what landed, what broke, what surprised you"

See every run for this project

GADreverse-engineer-eval

rubric v1

6 dimensions

CLI

gad eval review reverse-engineer-eval v<N> \
  --rubric '{"requirements_completeness": 0.80, "requirements_accuracy": 0.80, "build_success": 0.80, "functional_fidelity": 0.80, "presentation": 0.80, "skill_quality": 0.80}' \
  --notes "what landed, what broke, what surprised you"

See every run for this project

GADskill-evaluation-app

rubric v1

5 dimensions

CLI

gad eval review skill-evaluation-app v<N> \
  --rubric '{"ui_usability": 0.80, "requirements_ergonomics": 0.80, "harness_integration": 0.80, "visualization_quality": 0.80, "stability": 0.80}' \
  --notes "what landed, what broke, what surprised you"

See every run for this project

Open questions

The unresolved questions about the hypothesis, evaluation approach, and framework — public backlog of what is still being worked out.

13 open · 3 resolved · 6 categories

evaluation

critical

open

2026-04-09

How do we compute a pressure score per eval operationally?

Decision gad-75 names pressure as a first-class eval dimension with five sub-dimensions (requirement complexity, ambiguity, constraint density, iteration budget, failure cost). The dimensions are named; the formula is not. Open questions: (a) is pressure a single aggregate score or a 5-tuple? (b) is it self-rated by the requirements author or programmatically extracted? (c) if programmatic, which fields in REQUIREMENTS.xml feed which sub-dimension? (d) does pressure live on requirements (per version) or on runs (per execution), or both? (e) how do we validate that our pressure rating matches agent-experienced pressure — probably by correlating rating against tool_uses-to-failure ratio. Candidate first-pass: self-rated 0.0-1.0 per sub-dimension on each requirements version (stored in REQUIREMENTS-VERSIONS.md and in a new <pressure-profile> block in REQUIREMENTS.xml), aggregated as a weighted sum, displayed on the /roadmap timeline as a pressure-tier progression across rounds.

GAD-D-75 GAD-D-72

critical

open

2026-04-09

Which of our current eval metrics should become programmatic first?

Per gad-69, every eval metric must answer 'can this be collected programmatically?' before 'how do we score it?'. We currently rely heavily on human review + agent self-report for: did the skill actually load, did the gate pass, is the forge integrated with pressure, is the game beatable. A programmatic-eval GAPS audit is queued as task 83 / task 96. The output should be .planning/GAPS.md ranking gaps by (a) how much human/agent judgment they currently require, (b) how mechanically checkable they actually are, (c) which phase picks them up. Highest candidates: playwright smoke tests for G1-G4 gates, hook-captured skill-trigger events, build/test exit codes as stability signals.

GAD-D-61 GAD-D-69

high

discussing

2026-04-09

Should pressure become a scored rubric dimension or stay as test-condition metadata?

Pressure is CURRENTLY framed in gad-75 as metadata about the test conditions, not a measurement of the result. But there's a parallel interpretation: pressure could be scored — 'did the agent handle the pressure well'. That scoring collapses pressure into the existing rubric as a new dimension. The two framings answer different questions: metadata-pressure lets us normalize cross-round comparisons ('X scored 0.8 at pressure 0.6 vs Y scored 0.7 at pressure 0.9'), scored-pressure lets us rank agents on pressure-handling ability directly. Probably both are useful and we need both — one stored in REQUIREMENTS/TRACE metadata and one computed as a derived dimension. But committing to one framing first matters for the /roadmap page design and for any comparison visualization.

GAD-D-75

medium

open

2026-04-09

Can we extract meaningful signal from Claude Code thinking blocks post-hoc?

Per decision gad-84, Claude Code's hook system exposes tool calls but not thinking blocks or inter-tool message text. Those artifacts exist in session.jsonl. Open research question: can we extract them post-hoc and derive quality signal from them? Candidate metrics — thought-to-tool-use interval, thought length variance, ratio of 'exploration' to 'execution' phrases, thought-before-action vs thought-after-action ordering. All hypothetical until we have the extraction pipeline and look at real data. Feasibility assessment: high for extraction, unknown for signal.

GAD-D-84 GAD-D-69 GAD-D-50

medium

discussing

2026-04-09

Is the 6th skill_inheritance_effectiveness dimension on emergent an unfair boost vs the 5-dim bare/gad rubric?

Emergent v4 aggregate: 0.885. Bare v5 aggregate: 0.805. The 0.08 delta maps almost exactly to emergent's 6th-dimension contribution (skill_inheritance_effectiveness 0.95 * 0.20 weight = +0.19 against bare's absent-dimension 0). Shared-dimension comparison: emergent +0.18 playability, +0.10 ui_polish, tie on mechanics/ingenuity/stability. Is that 6th dimension double-counting emergent's inheritance advantage? Arguably yes — CSH is being tested by the existence of that dimension. Counter: without it, there's no way to score whether inheritance is actually working, which is the whole point of the emergent condition.

GAD-D-65

framework

high

open

2026-04-09

Does a fresh clone of the repo actually work end-to-end for a human contributor?

Per decision gad-77, before the /contribute page ships we need to verify: (a) cloning the repo into a new directory results in a working agent environment (skills in .claude/, agents installed, commands available), (b) opening the repo in Claude Code surfaces the GAD skills to the agent without additional setup, (c) a conversational request like 'run the escape-the-dungeon-bare eval' actually works with no manual snapshot or XML editing. This is an untested assumption right now — we've been running agents in the canonical development repo where everything is already wired up. A fresh clone might hit missing .claude/settings.json entries, missing hook handler paths, missing env vars, or installer bugs. Until this is tested, /contribute is vaporware.

GAD-D-77

high

open

2026-04-09

Do we already have create-skill / merge-skill / find-skills as fundamentals, and if not, should we build them?

User's vision (decision gad-73): GAD provides three fundamental skills as the foundation of emergent evolution. find-skills locates a trusted GAD fundamental (e.g. 'scientific method', 'debug'). merge-skill fuses that fundamental into a project-tailored skill (e.g. 'scientific-method-for-kaplay-rune-spellcrafting'). create-skill authors genuinely new ones when no merge candidate exists. This triumvirate IS the in-game rune/spell merging mechanic made meta. We need to audit what exists today in skills/ — create-skill likely already exists, merge-skill and find-skills may not. Reference: https://skills.sh/vercel-labs/skills/find-skills. GAD's version scopes to trusted ecosystem initially. Bigger ambition: prove skill effectiveness + provide skill security.

GAD-D-70 GAD-D-73

medium

open

2026-04-09

What does 'skill security' actually look like in practice?

User wants GAD to eventually provide a trust model for skills — how can you tell if a skill you're about to inherit is safe, effective, and actually improves anything? Initial thoughts: (a) frontmatter signing / checksum, (b) provenance lineage (which run authored it, which rubric scores validated it, which other runs inherited it successfully), (c) automated review against Anthropic's skills guide (gad-70), (d) sandboxed trial run in a throwaway worktree before trusting. Distinct from typosquatting defense (which lives in the planned /security page). This is about effectiveness + integrity, not name collisions.

GAD-D-70 GAD-D-73 GAD-D-74

game design

high

discussing

2026-04-09

Should v5 mandate Unicorn-Overlord-style rule-based combat (Model A) or allow direct-control (Model B)?

Bare v5 playtest surfaced a preference for rule-based simulated combat (loadout + spells + stats + action policies + initiative, chess-like positioning) over direct-control. R-v5.13 captures this as 'Model A preferred unless implementation exception granted.' Open question: is this the right call for round 5? Rule-based is harder to implement correctly in a single eval run and may advantage GAD's planning overhead (which would invalidate the freedom comparison if it's the reason bare underperforms next round).

GAD-D-36 GAD-D-68 R-v5.13 R-v5.14

hypothesis

high

open

2026-04-09

Does compounding skill inheritance (CSH) eventually beat raw freedom (freedom hypothesis) across many rounds?

Round 4 produced a provocative finding: Emergent v4 scored 0.885 (with the 6-dim rubric including skill_inheritance_effectiveness), Bare v5 scored 0.805 (5-dim rubric), and GAD v10 was api-interrupted. On shared rubric dimensions (playability, ui_polish, mechanics, ingenuity, stability) Emergent beat Bare on playability and UI polish, tied on mechanics/ingenuity/stability. The freedom hypothesis (gad-36) says bare beats framework on creative output. The compound-skills hypothesis (gad-65) says emergent-with-evolution compounds over rounds. Round 4 may be the first evidence that CSH is overtaking freedom — BUT it's one round, rubric reweighting matters, and Emergent inherited from Bare so it has freedom-hypothesis lineage baked in. Need more rounds against v5 requirements to disambiguate.

GAD-D-36 GAD-D-65 GAD-D-68

site

high

open

2026-04-09

Who exactly is the site + framework for?

Task 90 queues ASSUMPTIONS.md. Without a documented target user, IA refactor decisions (task 84), skills directory UX (task 85), and landing-page framing will be ad-hoc. Candidates: coding-agent researchers, framework authors, indie devs exploring skill evolution, enterprise teams evaluating agent frameworks. Each implies a different navigation priority.

GAD-D-68

high

open

2026-04-09

How do we keep the site navigable as we add /security, /glossary, /roadmap, /skills-guide, /questions, /compare?

Current nav already has 11 items (GAD, Lineage, Methodology, Rubric, Results, Graphs, Videos, Catalog, Findings, Planning, and now + Rubric). Tasks 86/87/88/95 will add more. Task 84 plans dropdown grouping + keyword search before more pages ship. Open: what is the grouping that makes sense to a first-time visitor vs a returning researcher?

tooling

high

open

2026-04-09

What is the actual root cause of the HTTP 529 overloaded_error crashing GAD runs?

GAD v10 was api-interrupted twice by HTTP 529 (once at tool_uses=18, then 55). This is distinct from account rate limits (gad-64). Per STATE.xml, investigation is queued before GAD v11 retry. Without understanding whether 529 correlates with high planning overhead (GAD-specific), payload size, tool-use frequency, or time of day, we can't plan round 5 reliably.

GAD-D-62 GAD-D-64

Resolved

What used to be open

resolved

2026-04-10

Do eval worktrees actually inherit state from the parent monorepo, and does that contaminate the results?

PASSES with explicit allowlist. Audit written at .planning/docs/ISOLATION-AUDIT-2026-04-10.md. Key findings: (1) git worktrees have separate working trees, so .agents/skills/ and .planning/ do NOT inherit — the bare condition stays clean of framework skills, (2) .claude/settings.json IS inherited by Claude Code worktrees at .claude/worktrees/agent-*/ (the settings search walks up), but this is acceptable because the only settings are the trace hook handler which is instrumentation not framework assistance, (3) gad eval run creates worktrees at os.tmpdir() which is entirely outside the repo — even cleaner. The secondary vector (globally-installed user skills at ~/.claude/skills/) is mitigated by adding a bare-eval-prompt line: 'Do NOT load or reference any globally-installed skills.' Round 5 is UNBLOCKED.

resolved

2026-04-09

Should the landing page stop leading with 'ship software faster' and lead with task-management + skill evaluation instead?

Resolved by decision gad-76 and the landing rewrite shipped in the 2026-04-09 IA session. New value-prop line: 'A system for evaluating and evolving agents through real tasks, measurable pressure, and iteration.' Landing primary CTA is now Play (B), above-the-fold stack is Play → Methodology → Findings → Hypothesis → Fork. Target audience is primarily coding-agent researchers (A) with indie devs (C) as experiential secondary entry. Pressure (gad-75) is introduced as the hook that differentiates GAD from other coding-agent frameworks.

resolved

2026-04-09

How do we measure the value of authored-content-pack inheritance (gad-66) without confounding freedom/CSH tests?

Resolved: content-pack injection becomes its own eval track (separate from greenfield emergent) so CSH measurements stay clean. User reframed it as a *content-driven hypothesis* — analogous to making a game or movie derivative from a book. It is explicitly derivative work: not all processes are, 'much like a forger might not use the exact same brush.' This is a distinct hypothesis from freedom and CSH, with its own track, its own rounds, and its own comparison rules. We do NOT compare content-pack runs to greenfield runs on the same rubric — they answer different questions. The content-pack track's scoring focuses on: (a) does the extra content produce a more fleshed-out game given the same token budget, (b) does the agent integrate the content coherently rather than just bolting it on, (c) does the final game feel unified despite derivative source material.