---
title: Evals
description: >-
  How Leadtype benchmarks AGENTS.md, llms.txt, and llms-full.txt output before
  changing defaults.
group: reference
lastModified: '2026-05-11T20:02:32-07:00'
lastAuthor: 'github-actions[bot]'
---
# Evals

Leadtype treats agent-facing docs as behavior, not just files on disk. The repo includes eval harnesses that run real models against generated artifacts and check both the final answer and which context files the agent actually read.

## What we benchmark

|Benchmark|Files under test|What it checks|
|--|--|--|
|Package docs|`AGENTS.md` + `docs/*.md`|Installs a packed leadtype tarball into a sandbox project and checks whether coding agents discover `node_modules/leadtype/AGENTS.md`, read the right markdown topic, and complete the task.|
|Hosted docs|`llms.txt` + markdown mirrors + `llms-full.txt` variants|Simulates a hosted docs web root as local files. Agents start at `/llms.txt`, then choose page-level markdown, root `llms-full.txt`, or experimental grouped/router formats depending on the variant under test.|

The hosted-docs benchmark uses the same nine-page corpus for every variant:
quickstart, how-it-works, frontmatter, components, connect-docs-site,
package-docs bundle, CLI, LLM bundles, and Search. Those pages are split across
five groups: Get Started, Authoring, Build, Ship Package Docs, and Reference.

|Variant|Root `/llms.txt` pattern|Full-context content pattern|
|--|--|--|
|Page-level `.md` links|Lists every page-level `/docs/*.md` link grouped by section.|No full-context file is part of the intended path.|
|Explicit group bundle links|Links directly to `/docs/llms-full/<group>.txt` files.|Each group bundle contains only the pages in that group.|
|Root `llms-full.txt` monolith|Links to one root `/llms-full.txt`.|Root `llms-full.txt` contains every generated markdown page flattened into one file.|
|Root `llms-full.txt` router|Links to root `/llms-full.txt`.|Root `llms-full.txt` is only a router: it links to `/docs/llms-full/<group>.txt`; each group bundle contains that group's pages.|
|Section `llms.txt` indexes|Links to `/docs/<group>/llms.txt` section indexes.|Each section index links page-level markdown first, plus an optional `/docs/llms-full/<group>.txt` group bundle.|

## What we learned

In the hosted-docs benchmark, monolithic `/llms-full.txt` was the only tested format that passed all six fixtures on both Claude Opus 4.7 and GPT-5.5.

|Variant|Claude Opus 4.7|GPT‑5.5|Readout|
|--|--|--|--|
|Root `llms-full.txt` monolith|6/6|6/6|Most reliable tested fallback.|
|Page-level `.md` links|4/6|5/6|Cheap and natural, but not always enough for synthesis tasks.|
|Root `llms-full.txt` router|5/6|4/6|Promising, but model-dependent.|
|Section `llms.txt` indexes|4/6|5/6|Promising, but adds more public artifacts.|
|Explicit group bundle links|2/6|4/6|Agents often answered correctly without following the intended bundle links.|

The stricter context-selection check matters: a model can answer correctly from `/llms.txt` summaries or prior knowledge, but that does not prove a proposed artifact shape made it choose the right context. A full pass means the model answered correctly *and* followed the intended context path for that variant.

## Current default

Leadtype keeps the public website artifact set small:

```txt
public/
├── llms.txt
├── llms-full.txt
└── docs/*.md
```

`/llms.txt` routes agents to page-level markdown first. `/llms-full.txt` is the broad all-docs fallback when page links are not enough. Groups still organize navigation, `llms.txt` sections, search metadata, and `AGENTS.md`; they are not published as per-group full-context files by default.

## Open question

The current benchmark uses a small docs corpus. Larger projects may suffer from a monolithic `llms-full.txt` because of token cost or truncation. Keep grouped and section-index variants in the eval harness so larger-corpus benchmarks can revisit that tradeoff before adding more default public artifacts.

## Run the evals

The detailed harness docs live in the repository's `evals/README.md`.

```bash
cd evals
bun run evals
bun run evals:llms -- --model gpt-5.5
```