Evals

Leadtype treats agent-facing docs as behavior, not just files on disk. The repo includes eval harnesses that run real models against generated artifacts and check both the final answer and which context files the agent actually read.

What we benchmark

Benchmark	Files under test	What it checks
Package docs	`AGENTS.md` + `docs/*.md`	Installs a packed leadtype tarball into a sandbox project and checks whether coding agents discover `node_modules/leadtype/AGENTS.md`, read the right markdown topic, and complete the task.
Hosted docs	`llms.txt` + markdown mirrors + `llms-full.txt` variants	Simulates a hosted docs web root as local files. Agents start at `/llms.txt`, then choose page-level markdown, root `llms-full.txt`, or experimental grouped/router formats depending on the variant under test.

The hosted-docs benchmark uses the same nine-page corpus for every variant: quickstart, how-it-works, frontmatter, components, connect-docs-site, package-docs bundle, CLI, LLM bundles, and Search. Those pages are split across five groups: Get Started, Authoring, Build, Ship Package Docs, and Reference.

Variant	Root `/llms.txt` pattern	Full-context content pattern
Page-level `.md` links	Lists every page-level `/docs/*.md` link grouped by section.	No full-context file is part of the intended path.
Explicit group bundle links	Links directly to `/docs/llms-full/<group>.txt` files.	Each group bundle contains only the pages in that group.
Root `llms-full.txt` monolith	Links to one root `/llms-full.txt`.	Root `llms-full.txt` contains every generated markdown page flattened into one file.
Root `llms-full.txt` router	Links to root `/llms-full.txt`.	Root `llms-full.txt` is only a router: it links to `/docs/llms-full/<group>.txt`; each group bundle contains that group's pages.
Section `llms.txt` indexes	Links to `/docs/<group>/llms.txt` section indexes.	Each section index links page-level markdown first, plus an optional `/docs/llms-full/<group>.txt` group bundle.

What we learned

In the hosted-docs benchmark, monolithic /llms-full.txt was the only tested format that passed all six fixtures on both Claude Opus 4.7 and GPT-5.5.

Variant	Claude Opus 4.7	GPT‑5.5	Readout
Root `llms-full.txt` monolith	6/6	6/6	Most reliable tested fallback.
Page-level `.md` links	4/6	5/6	Cheap and natural, but not always enough for synthesis tasks.
Root `llms-full.txt` router	5/6	4/6	Promising, but model-dependent.
Section `llms.txt` indexes	4/6	5/6	Promising, but adds more public artifacts.
Explicit group bundle links	2/6	4/6	Agents often answered correctly without following the intended bundle links.

The stricter context-selection check matters: a model can answer correctly from /llms.txt summaries or prior knowledge, but that does not prove a proposed artifact shape made it choose the right context. A full pass means the model answered correctly and followed the intended context path for that variant.

Current default

Leadtype keeps the public website artifact set small:

public/
├── llms.txt
├── llms-full.txt
└── docs/*.md

/llms.txt routes agents to page-level markdown first. /llms-full.txt is the broad all-docs fallback when page links are not enough. Groups still organize navigation, llms.txt sections, search metadata, and AGENTS.md; they are not published as per-group full-context files by default.

Open question

The current benchmark uses a small docs corpus. Larger projects may suffer from a monolithic llms-full.txt because of token cost or truncation. Keep grouped and section-index variants in the eval harness so larger-corpus benchmarks can revisit that tradeoff before adding more default public artifacts.

Run the evals

The detailed harness docs live in the repository's evals/README.md.

cd evals
bun run evals
bun run evals:llms -- --model gpt-5.5