Evals

Leadtype treats agent-facing docs as behavior, not just files on disk. The repo ships eval harnesses (under evals/) that run real coding agents against generated artifacts and grade them with an independent LLM judge.

Read this if you want the evidence behind three claims: that bundling docs into a package measurably helps coding agents, that the default llms.txt + llms-full.txt shape routes agents most reliably, and that agents rarely discover llms.txt on their own unless pointed at it.

How the harness works

Each eval is one cell — a fixture (task) × arm × model — run 10 times. A run:

Spins up an isolated temp project and npm installs a packed leadtype tarball, so node_modules/leadtype/ looks like a real install.
Lets the agent work with a small set of path-scoped tools (read, write, list, glob, grep, and a narrow npm). No shell, no network beyond the model call.
Sends the agent's answer (and any files it wrote) to an LLM judge, which grades it against a per-fixture rubric and returns pass/fail + a 0–100 score + a failure-mode classification. We judge with deepseek-v4-pro — a model from a family with no candidate under test, so it has no same-family preference. The judge is the headline metric; "did the agent read our files" is tracked separately as supporting evidence.

Pass rates are reported with a Wilson 95% confidence interval, and every run's transcript, judge verdict, and produced files are archived under evals/results/ for inspection. Except where noted (the cost table was re-measured 2026-06-01 after a token-counting fix), the numbers below are from the 2026-05-31 run: the package benchmark uses 5 models across 4 families (Anthropic Haiku 4.5 / Opus 4.8, OpenAI GPT‑5.5, Moonshot Kimi K2.6, Google Gemini 3.5 Flash) × 4 arms × 10 runs, judged by deepseek-v4-pro and cross-validated with grok-4.3. The hosted-docs benchmarks use a 3-model subset (Haiku, Opus, GPT‑5.5).

Does bundling docs help? (package benchmark)

The package benchmark answers the core question with an A/B test:

Treatment — the installed leadtype ships its bundled AGENTS.md + docs/*.md.
Control — those files are stripped (along with dist/*.map source maps, which would otherwise leak the original commented source). The compiled JS — including the CLI's --help text — and the .d.ts types stay. An honest "a package that simply didn't bundle agent docs," not "an agent with no information."

Three things change when the docs are present, and they differ in how universal they are. Two hold for every model and are the wins to lead with; the third is real but uneven:

Every run gets cheaper — universal, frontier models included.
Agents stop confidently guessing wrong about the API — the clearest, most judge-robust correctness win.
Raw accuracy rises — but mostly for the small, cheap models most agents run on; the frontier lift is small and judge-sensitive.

Every run gets cheaper

Pooled across fixtures, bundling docs cut the tokens every model spent — the agent reads one short doc instead of probing the package with repeated grep/read/list calls. Tokens (input + output, summed across every step of the tool loop) drop 32–54%, and tool calls drop for every model too:

Model	Tokens (docs → none)	Tool calls (docs → none)	Wall-clock (docs → none)
`claude-haiku-4.5`	115.8k → 227.0k (−49%)	15.1 → 18.2 (−17%)	54.2s → 41.4s (+31%)
`claude-opus-4.8`	98.8k → 215.4k (−54%)	9.1 → 15.5 (−41%)	38.5s → 71.4s (−46%)
`gemini-3.5-flash`	228.1k → 422.3k (−46%)	14.0 → 27.5 (−49%)	60.5s → 136.6s (−56%)
`kimi-k2.6`	153.5k → 309.3k (−50%)	15.6 → 20.1 (−22%)	52.9s → 79.9s (−34%)
`gpt-5.5`	160.3k → 234.5k (−32%)	14.4 → 15.5 (−7%)	51.4s → 61.2s (−16%)

Even GPT‑5.5, which barely needs docs for correctness, spent 32% fewer tokens with them — the token win holds regardless of model tier or whether docs ever move the pass rate. Wall-clock is not a claim we lean on: it tracks gateway latency more than work done, and is unstable run-to-run (Haiku came out slower with docs here, −37% in an earlier run). Tokens and tool calls are the defensible cost signal.

Cost re-measured after a token-counting fix

The cost figures above are from a 2026-06-01 treatment-vs-control re-run. The original harness recorded only the final step of each tool loop (result.usage) rather than the sum across all steps (result.totalUsage), which undercounted per-run tokens several-fold. The fix doesn't touch the pass-rate or confident-wrong results below — those come from the 2026-05-31 run and only depend on the judge's verdict.

Agents stop confidently guessing wrong

The judge classifies each failure, and the dangerous mode is confidently wrong — asserting the opposite of the truth with no hedging. Docs cut it across the board:

Model	Confident-wrong: control → treatment
`claude-haiku-4.5`	28% → 10%
`claude-opus-4.8`	7% → 0%
`gemini-3.5-flash`	8% → 0%
`gpt-5.5`	8% → 0%
`kimi-k2.6`	3% → 2%

The clearest case is nav-unknown-group — "what happens if a page declares a group the config doesn't know?" The build actually fails, but without docs agents don't say "I don't know"; they confidently assert the wrong answer. Bundling docs is cheap insurance against exactly that, and it's the most defensible reason to ship them — it doesn't depend on the model being small or the judge being generous.

Accuracy lift — concentrated in small models

Bundling docs also lifts raw task success, but the gain is largest for the smaller, cheaper models that most agents actually run on, and marginal for frontier models that recover more from the package alone:

Model	With bundled docs	Without (control)	Lift
`claude-haiku-4.5` (small)	87%	70%	+17 pts
`gemini-3.5-flash` (small)	100%	85%	+15 pts
`gpt-5.5` (frontier)	98%	87%	+12 pts
`claude-opus-4.8` (frontier)	98%	92%	+7 pts
`kimi-k2.6` (frontier)	98%	95%	+3 pts

(Pooled over all fixtures, n = 60 runs per cell, judged by neutral deepseek-v4-pro.)

Decomposing the value: code vs. docs vs. memory

A bare arm installs nothing (pure training-memory recall); control adds the compiled package; treatment adds the docs; pointer adds leadtype's recommended root AGENTS.md pointer. The ladder shows exactly what each layer buys:

Model	bare (memory)	→ control (+code)	→ treatment (+docs)	→ pointer
`claude-haiku-4.5`	22%	70%	87%	93%
`gemini-3.5-flash`	78%	85%	100%	98%
`gpt-5.5`	90%	87%	98%	98%
`claude-opus-4.8`	87%	92%	98%	100%
`kimi-k2.6`	88%	95%	98%	98%

Small models are nearly helpless from memory (Haiku 22%) and gain from both the installed code and the docs; frontier models already know ~90% and docs add the last few points.

Where the lift comes from

The per-fixture breakdown is more telling than the average: almost all of the accuracy lift comes from two behavioral gotchas the compiled package can't self-document.

nav-unknown-group (+28 pooled) — the confident-wrong case above; docs turn a confident wrong answer into a correct one.
search-when-embeddings (+28 pooled) — "when is the default static BM25 index enough vs. embeddings?" Another non-obvious behavioral rule recoverable only from prose.
explain-cli-flag, mounted-changelog-urls, bundle-rationale, custom-generate-script — roughly flat (−2 to +6). Recoverable from the CLI's --help, the .d.ts types, or the README. We keep them to show the benchmark isn't cherry-picked and to pin down where docs don't pay off.

So docs add little for facts your CLI, types, or README already expose — the accuracy payoff is concentrated where the library's behavior is non-obvious, and largest for smaller models. Write for agents turns this into an authoring rule.

Judge choice matters

LLM judges can favor their own family. Once Gemini was a candidate, the judge had to come from a family with no candidate, so we grade with deepseek-v4-pro and re-graded the headline arms with a second neutral judge, grok-4.3:

Lift (treatment − control)	Haiku	Gemini	GPT‑5.5	Opus	Kimi
`deepseek-v4-pro` (canonical)	+17	+15	+12	+7	+3
`grok-4.3` (cross-check)	+27	+8	+7	+0	+2

Both neutral judges agree on the direction and rank order — small models gain most, frontier least. Frontier magnitudes are judge-sensitive (Opus +7 vs +0) but small either way, so the qualitative claim holds. Reproduce with bun run rejudge <runDir> --judge xai/grok-4.3 --arms treatment,control.

Which llms.txt shape routes agents best? (hosted-docs benchmark)

The hosted-docs benchmark simulates a hosted docs web root as local files (3-model subset). Agents start at /llms.txt and follow links; a "pass" means the judge marked the answer correct, and context match means the agent actually read the context path the variant intends. A shape only earns trust when both are high.

What we found

Pass rate is similar across shapes, so the discriminating metric is context match — did the agent follow the path the shape intends?

Variant	Context match	Verdict
Page-level `.md` links	100%	Agents follow it reliably
Section `llms.txt` indexes	88%	Solid
Root `llms-full.txt` monolith	83%	Reliable broad fallback
Root `llms-full.txt` router	28%	Agents bypass the intended links
Explicit group bundles	26%	Agents bypass the intended links

page-links, monolith, and section-indexes are the shapes agents reliably use as intended. explicit-bundles and router look fine on pass rate but agents don't follow their intended path — they answer from the /llms.txt summary or grab whatever bundle looks closest. The extra structure doesn't earn its keep.

Agents rarely discover `llms.txt` on their own

The routing benchmark tells agents to start at /llms.txt. A separate discovery arm drops that hint and serves a realistic web root (docs pages + llms.txt + llms-full.txt + sitemap + robots). Agents then consult /llms.txt only ~29% of the time — they mostly grep the doc pages directly and still answer correctly. So llms.txt earns its keep most as a pointed-to entry (a root AGENTS.md, or a tool that fetches it), not as an organically-discovered one.

Current defaults

Leadtype keeps the public website artifact set small:

public/
├── llms.txt
├── llms-full.txt
└── docs/*.md

/llms.txt routes agents to page-level markdown first; /llms-full.txt is the broad all-docs fallback. Groups organize navigation, llms.txt sections, search metadata, and AGENTS.md; they aren't published as per-group full-context files by default.

For packages, leadtype generate --bundle ships AGENTS.md + docs/*.md inside the tarball — and the recommended consumer setup is a root AGENTS.md pointing at node_modules/leadtype/AGENTS.md, which the benchmark's pointer arm shows beats leaving the agent to find the bundle on its own.

Run the evals

cd evals
bun install && bun run pack-leadtype
bun run evals:full:arms       # package: 5 models × bare/control/treatment/pointer × 10
bun run evals:llms:full        # hosted-docs routing (3-model subset)
bun run evals:llms:discovery   # unhinted llms.txt discovery

The detailed harness docs live in evals/README.md.