Stream AI answers

Add this only after basic search works. Leadtype retrieves the most relevant chunks from the static index, builds a constrained prompt that tells the model to answer only from those chunks and cite them, then streams the response. The model never answers from memory.

The shape is the same across runtimes:

query → searchDocs (retrieve chunks) → createAnswerContext (system + prompt + sources)
      → streamDocsAnswer → { response: text stream, sources }

response is a plain-text streamed Response; sources is citation metadata you render next to the answer, never inside the streamed text.

Pick your runtime

streamDocsAnswer ships per runtime. All take { index, content, query } and return { response, sources }; they differ only in how you name the model.

const { response, sources } = streamDocsAnswer({
  index,
  content,
  query,
  model: "openai/gpt-5.5", // any AI SDK / AI Gateway model id
  productName: "My Library",
  maxOutputTokens: 2000,
});

Uses the ai package's streamText. Auth via your AI Gateway / provider env.

To bring your own model loop instead, call createAnswerContext(index, query, { content, productName }) for the { system, prompt, sources } and pass them to any SDK yourself.

Build a hardened endpoint

Answer generation accepts user input and calls a paid model, so the route needs guards. leadtype/search ships the building blocks; wire them into your framework's request handler:

import {
  createMemoryRateLimiter,
  docsSearchDefaults,
  getClientIdentifier,
  readJsonWithLimit,
  validateDocsQuery,
} from "leadtype/search";
import { streamDocsAnswer } from "leadtype/search/vercel";
import { docsSearchContent, docsSearchIndex } from "./search-data";

// Swap for Redis / KV / Durable Objects in production — this is per-instance.
const limiter = createMemoryRateLimiter({ limit: 10, windowMs: 60_000 });

export async function POST(request: Request): Promise<Response> {
  const rate = await limiter.check(`ask:${getClientIdentifier(request)}`);
  if (!rate.allowed) {
    return Response.json(
      { error: "Too many requests." },
      {
        status: 429,
        headers: {
          "Retry-After": String(Math.ceil((rate.resetAt - Date.now()) / 1000)),
        },
      }
    );
  }

  const body = await readJsonWithLimit<{ query?: unknown }>(request, {
    maxBytes: docsSearchDefaults.maxBodyBytes, // 16 KB
  });
  const query = validateDocsQuery(body.query, {
    fieldName: "query",
    maxChars: docsSearchDefaults.askMaxQueryChars, // 600
  });

  const { response } = streamDocsAnswer({
    index: docsSearchIndex,
    content: docsSearchContent,
    query,
    model: "openai/gpt-5.5",
    productName: "My Library",
  });
  return response;
}

validateDocsQuery trims and caps the text, readJsonWithLimit rejects oversized bodies before parsing, and getClientIdentifier reads common proxy IP headers (cf-connecting-ip, x-forwarded-for, x-real-ip) for the limiter key. Gate on credentials and return 503 when the provider isn't configured so the UI can disable the feature.

The TanStack Start example wires exactly this for all three runtimes — apps/tanstack/src/routes/api/docs/ask/{vercel,tanstack,cloudflare}.ts plus the shared handler in lib/provider-answer.ts.

Consume the stream on the client

The endpoint returns a plain-text stream — read it incrementally and append:

const response = await fetch("/api/docs/ask", {
  method: "POST",
  body: JSON.stringify({ query }),
  signal,
});

const reader = response.body!.getReader();
const decoder = new TextDecoder();
let answer = "";
while (true) {
  const { value, done } = await reader.read();
  if (done) {
    break;
  }
  answer += decoder.decode(value, { stream: true });
  // render `answer` as it grows
}

Run a normal search first to show results (and sources) immediately, then stream the answer. Use an AbortController so a new query cancels the in-flight one.

Verify

The endpoint returns 429 past the rate limit and 503 when credentials are missing.
A query streams text incrementally, not in one blob.
The answer cites sources that match the retrieved chunks — and refuses when the docs don't cover the question.