Standards

Two canonical references. Cited on every skill page.

GAD is a research framework, not a new skill format. The way we author skills, evaluate them, and load them into agents follows two open references that already exist. This page is the single canonical citation point — every other skill-related page on the site links here rather than repeating the standards inline.

Anchor decisions: GAD-D-70 (Anthropic guide as canonical reference), GAD-D-80 (agentskills.io adoption + skills/ convention), GAD-D-81 (skill-count policy derived from both sources).

The two references

Anthropic

PDF

The Complete Guide to Building Skills for Claude

Anthropic's canonical document on authoring skills for Claude. Covers SKILL.md format, frontmatter rules, three testing layers (triggering / functional / performance comparison), three skill categories (doc-creation / workflow-automation / mcp-enhancement), and iteration signals (under-triggering, over-triggering, execution issues). Source of truth for how Claude Code loads and activates skills.

Download the PDF

Open standard

agentskills.io

Agent Skills — open format + interoperability standard

The cross-client open format for skills. Specifies the SKILL.md file structure, the skills/ discovery convention for cross-client interoperability, progressive-disclosure three-tier loading, name collision handling, trust gating, and a full per-skill evaluation methodology. Agents built against this standard should find each other's skills regardless of runtime.

Read the standard

Progressive disclosure — three tiers

Per agentskills.io client implementation, every compliant agent follows the same three-tier load strategy. This is what keeps skill libraries scalable — you don't pay the token cost of every installed skill upfront, only the ones actually used in a conversation.

tier 1

Catalog

Name + description only. Loaded at session start.

Token cost: ~50-100 tokens per skill. 27 skills in GAD ≈ ~2000 tokens for the full catalog.

tier 2

Instructions

Full SKILL.md body. Loaded when the agent decides a skill is relevant to the current task.

Token cost: <5000 tokens recommended per skill.

tier 3

Resources

Scripts, references, assets bundled in the skill directory. Loaded on-demand when instructions reference them.

Token cost: varies — pay only for the files actually read.

Discovery convention - skills/

The agentskills.io cross-client interoperability convention is the canonical location GAD is migrating toward (GAD-D-80). Skills installed at these paths become visible to any compliant client - Claude Code, Codex, Cursor, Windsurf, Augment, and others - without per-runtime copies.

Scope	Path	Purpose
Project	<project>/.<client>/skills/	Client-native location
Project	<project>/skills/	Cross-client interoperability
User	~/.<client>/skills/	Client-native location
User	~/skills/	Cross-client interoperability

Current GAD gap

GAD now treats vendor/get-anything-done/skills/ as the authored source of truth. Runtime-native layouts such as .claude/, .codex/, and generated commands/ are transpiled outputs, not canonical repo content. gad install now reads canonical skills first and emits the runtime-specific shape each client needs. The full findability writeup is at .planning/docs/SKILL-FINDABILITY-2026-04-09.md. Standardizing authored skills under the repo-root skills/ convention started in task 22-46 and is now the active framework contract.

Name collision handling

Per the standard, when two skills share the same name field: project-level skills override user-level skills. Within the same scope, first-found or last-found is acceptable but the choice must be consistent and collisions must be logged so the user knows a skill was shadowed. GAD adds a stronger requirement per GAD-D-81: skills must answer "what can this do that no other skill can?" — if the answer is unclear, they are merge candidates rather than collisions. A skill-collision detection scan is queued as task 22-49 to catch overlapping trigger descriptions before they manifest as ambiguous routing at runtime.

Per-skill evaluation methodology

Directly from agentskills.io/skill-creation/evaluating-skills. Distinct from GAD's eval-project rubric (which scores a whole build like escape-the-dungeon). This methodology scores individual skills.

1. Test cases in evals/evals.json

Every skill stores its test cases alongside it. Each test has a prompt, an expected-output description, optional input files, and assertions. Start with 2-3 cases, expand after the first iteration.

2. Run each case twice

with_skill vs without_skill. Same prompt, same inputs, clean context. The baseline (no skill) is what the agent does on its own. Improving over baseline is what you're measuring.

3. Capture timing

Tokens and duration per run, stored in timing.json. Lets you see the cost of the skill, not just the benefit.

4. Grade assertions + iterate

Each assertion gets PASS or FAIL with concrete evidence. All iterations aggregate into benchmark.json with delta (with_skill − without_skill) per metric. The loop is: grade → review → propose improvements → rerun → grade.

GAD adoption status: not yet. The per-skill methodology is queued under task 22-50 + the triumvirate audit. Once built, per-skill evaluation is the direct answer to the programmatic-eval gap G2 in .planning/docs/GAPS.md (skill-trigger coverage) and the per-skill effectiveness half of G11 (skill-inheritance hygiene).

Three testing layers (Anthropic guide)

From the Anthropic skills guide. Complementary to the agentskills.io with_skill vs without_skill methodology — this taxonomy distinguishes what you're testing.

layer 1

Triggering tests

Does the skill load when it should? Test with obvious prompts, paraphrased prompts, and negative (unrelated) prompts. The skill should trigger on the first two and NOT trigger on the third.

layer 2

Functional tests

Does the skill produce correct outputs? Valid outputs, API calls succeed, error handling works, edge cases covered. This is where the agentskills.io assertion-based grading fits.

layer 3

Performance comparison

Does the skill actually improve results vs baseline? The with_skill vs without_skill pattern. Improvements in task completion rate, reduction in back-and-forth messages, fewer failed API calls, lower token usage.