Evals
Evals
Leadtype treats agent-facing docs as behavior, not just files on disk. The repo includes eval harnesses that run real models against generated artifacts and check both the final answer and which context files the agent actually read.
What we benchmark
| Benchmark | Files under test | What it checks |
|---|---|---|
| Package docs | AGENTS.md + docs/*.md | Installs a packed leadtype tarball into a sandbox project and checks whether coding agents discover node_modules/leadtype/AGENTS.md, read the right markdown topic, and complete the task. |
| Hosted docs | llms.txt + markdown mirrors + llms-full.txt variants | Simulates a hosted docs web root as local files. Agents start at /llms.txt, then choose page-level markdown, root llms-full.txt, or experimental grouped/router formats depending on the variant under test. |
The hosted-docs benchmark uses the same nine-page corpus for every variant: quickstart, how-it-works, frontmatter, components, connect-docs-site, package-docs bundle, CLI, LLM bundles, and Search. Those pages are split across five groups: Get Started, Authoring, Build, Ship Package Docs, and Reference.
| Variant | Root /llms.txt pattern | Full-context content pattern |
|---|---|---|
Page-level .md links | Lists every page-level /docs/*.md link grouped by section. | No full-context file is part of the intended path. |
| Explicit group bundle links | Links directly to /docs/llms-full/<group>.txt files. | Each group bundle contains only the pages in that group. |
Root llms-full.txt monolith | Links to one root /llms-full.txt. | Root llms-full.txt contains every generated markdown page flattened into one file. |
Root llms-full.txt router | Links to root /llms-full.txt. | Root llms-full.txt is only a router: it links to /docs/llms-full/<group>.txt; each group bundle contains that group's pages. |
Section llms.txt indexes | Links to /docs/<group>/llms.txt section indexes. | Each section index links page-level markdown first, plus an optional /docs/llms-full/<group>.txt group bundle. |
What we learned
In the hosted-docs benchmark, monolithic /llms-full.txt was the only tested format that passed all six fixtures on both Claude Opus 4.7 and GPT-5.5.
| Variant | Claude Opus 4.7 | GPT‑5.5 | Readout |
|---|---|---|---|
Root llms-full.txt monolith | 6/6 | 6/6 | Most reliable tested fallback. |
Page-level .md links | 4/6 | 5/6 | Cheap and natural, but not always enough for synthesis tasks. |
Root llms-full.txt router | 5/6 | 4/6 | Promising, but model-dependent. |
Section llms.txt indexes | 4/6 | 5/6 | Promising, but adds more public artifacts. |
| Explicit group bundle links | 2/6 | 4/6 | Agents often answered correctly without following the intended bundle links. |
The stricter context-selection check matters: a model can answer correctly from /llms.txt summaries or prior knowledge, but that does not prove a proposed artifact shape made it choose the right context. A full pass means the model answered correctly and followed the intended context path for that variant.
Current default
Leadtype keeps the public website artifact set small:
/llms.txt routes agents to page-level markdown first. /llms-full.txt is the broad all-docs fallback when page links are not enough. Groups still organize navigation, llms.txt sections, search metadata, and AGENTS.md; they are not published as per-group full-context files by default.
Open question
The current benchmark uses a small docs corpus. Larger projects may suffer from a monolithic llms-full.txt because of token cost or truncation. Keep grouped and section-index variants in the eval harness so larger-corpus benchmarks can revisit that tradeoff before adding more default public artifacts.
Run the evals
The detailed harness docs live in the repository's evals/README.md.