---
title: Stream AI answers
description: Source-grounded answer streaming over the static index — Vercel AI
  SDK, TanStack AI, or Cloudflare Workers AI — behind a hardened endpoint.
related:
  - title: Add search
    href: /docs/search/add-search
    description: Generate the index and wire the search UI first.
  - title: Search reference
    href: /docs/reference/search
    description: createAnswerContext, streamDocsAnswer options, and the guard helpers.
---
Add this only after [basic search](/docs/search/add-search) works. Leadtype
retrieves the most relevant chunks from the static index, builds a **constrained
prompt** that tells the model to answer *only* from those chunks and cite them,
then streams the response. The model never answers from memory.

The shape is the same across runtimes:

```text
query → searchDocs (retrieve chunks) → createAnswerContext (system + prompt + sources)
      → streamDocsAnswer → { response: text stream, sources }
```

`response` is a plain-text streamed `Response`; `sources` is citation metadata
you render **next to** the answer, never inside the streamed text.

## Pick your runtime

`streamDocsAnswer` ships per runtime. All take `{ index, content, query }` and
return `{ response, sources }`; they differ only in how you name the model.

**Vercel AI SDK**

```ts
import { streamDocsAnswer } from "leadtype/search/vercel";

const { response, sources } = streamDocsAnswer({
  index,
  content,
  query,
  model: "openai/gpt-5.5", // any AI SDK / AI Gateway model id
  productName: "My Library",
  maxOutputTokens: 2000,
});
```

Uses the `ai` package's `streamText`. Auth via your AI Gateway / provider env.

**TanStack AI**

```ts
import { streamDocsAnswer } from "leadtype/search/tanstack";
import { openai } from "@tanstack/ai-openai";

const { response, sources } = streamDocsAnswer({
  index,
  content,
  query,
  adapter: openai({ apiKey: process.env.OPENAI_API_KEY }),
  productName: "My Library",
  maxTokens: 2000,
});
```

Pass an explicit `adapter` from any `@tanstack/ai-*` provider.

**Cloudflare Workers AI**

```ts
import {
  createCloudflareDocsAdapter,
  streamDocsAnswer,
} from "leadtype/search/cloudflare";

const adapter = createCloudflareDocsAdapter({
  provider: "workers-ai", // or anthropic | openai | gemini | grok | openrouter
  model: "@cf/meta/llama-3.1-8b-instruct",
  options: { binding: env.AI.gateway("docs") },
});

const { response, sources } = streamDocsAnswer({
  index,
  content,
  query,
  adapter,
  maxTokens: 2000,
});
```

Build an adapter from a Workers AI binding (optionally through AI Gateway).

To bring your own model loop instead, call `createAnswerContext(index, query, {
content, productName })` for the `{ system, prompt, sources }` and pass them to
any SDK yourself.

## Build a hardened endpoint

Answer generation accepts user input and calls a paid model, so the route needs
guards. `leadtype/search` ships the building blocks; wire them into your
framework's request handler:

```ts
import {
  createMemoryRateLimiter,
  docsSearchDefaults,
  getClientIdentifier,
  readJsonWithLimit,
  validateDocsQuery,
} from "leadtype/search";
import { streamDocsAnswer } from "leadtype/search/vercel";
import { docsSearchContent, docsSearchIndex } from "./search-data";

// Swap for Redis / KV / Durable Objects in production — this is per-instance.
const limiter = createMemoryRateLimiter({ limit: 10, windowMs: 60_000 });

export async function POST(request: Request): Promise<Response> {
  const rate = await limiter.check(`ask:${getClientIdentifier(request)}`);
  if (!rate.allowed) {
    return Response.json(
      { error: "Too many requests." },
      {
        status: 429,
        headers: {
          "Retry-After": String(Math.ceil((rate.resetAt - Date.now()) / 1000)),
        },
      }
    );
  }

  const body = await readJsonWithLimit<{ query?: unknown }>(request, {
    maxBytes: docsSearchDefaults.maxBodyBytes, // 16 KB
  });
  const query = validateDocsQuery(body.query, {
    fieldName: "query",
    maxChars: docsSearchDefaults.askMaxQueryChars, // 600
  });

  const { response } = streamDocsAnswer({
    index: docsSearchIndex,
    content: docsSearchContent,
    query,
    model: "openai/gpt-5.5",
    productName: "My Library",
  });
  return response;
}
```

`validateDocsQuery` trims and caps the text, `readJsonWithLimit` rejects
oversized bodies before parsing, and `getClientIdentifier` reads common proxy IP
headers (`cf-connecting-ip`, `x-forwarded-for`, `x-real-ip`) for the limiter key.
Gate on credentials and return `503` when the provider isn't configured so the UI
can disable the feature.

The TanStack Start example wires exactly this for all three runtimes —
`apps/tanstack/src/routes/api/docs/ask/{vercel,tanstack,cloudflare}.ts` plus the
shared handler in `lib/provider-answer.ts`.

## Consume the stream on the client

The endpoint returns a plain-text stream — read it incrementally and append:

```ts
const response = await fetch("/api/docs/ask", {
  method: "POST",
  body: JSON.stringify({ query }),
  signal,
});

const reader = response.body!.getReader();
const decoder = new TextDecoder();
let answer = "";
while (true) {
  const { value, done } = await reader.read();
  if (done) {
    break;
  }
  answer += decoder.decode(value, { stream: true });
  // render `answer` as it grows
}
```

Run a normal search first to show results (and sources) immediately, then stream
the answer. Use an `AbortController` so a new query cancels the in-flight one.

## Verify

* The endpoint returns `429` past the rate limit and `503` when credentials are missing.
* A query streams text incrementally, not in one blob.
* The answer cites sources that match the retrieved chunks — and refuses when the docs don't cover the question.
