Streaming response performance planner

LLM Latency Estimator

Estimate time to first token, generation time, buffered end-to-end latency, and approximate throughput from measured LLM timing assumptions.

First-token and generation breakdownTail-latency safety bufferConcurrency-based throughput estimate

LLM latency estimate

Streaming response and capacity model

Slow modeled latency

Buffered end-to-end latency exceeds ten seconds.

Time to first token

570 ms

Generation time

7.98 s

Buffered end-to-end latency

10.36 s

Includes the selected tail-latency buffer

Estimated requests per minute

57.94

Estimated requests per hour

3,476.25

Calculation basis

Unbuffered end-to-end: 8.63 s
First-token share: 6.6%
Requests/minute per slot: 5.79
Concurrent slots: 10

Formula

How LLM response latency is estimated

Network and prompt processing determine time to first token. Remaining output tokens are divided by generation throughput, then client overhead and a tail buffer are added.

Time to first token = network round trip + queue and prompt processing

Generation time = (output tokens − 1) ÷ tokens per second

Buffered end-to-end latency = total latency × (1 + tail buffer %)

llm-latency.ts

export function llmLatency(input: {
  networkMs: number;
  queueAndPromptMs: number;
  outputTokens: number;
  tokensPerSecond: number;
  clientOverheadMs: number;
  tailBufferPercent: number;
}) {
  const timeToFirstTokenMs =
    input.networkMs + input.queueAndPromptMs;
  const generationMs =
    (Math.max(0, input.outputTokens - 1) /
      input.tokensPerSecond) *
    1000;
  const endToEndMs =
    timeToFirstTokenMs + generationMs + input.clientOverheadMs;

  return {
    timeToFirstTokenMs,
    generationMs,
    bufferedEndToEndMs:
      endToEndMs * (1 + input.tailBufferPercent / 100),
  };
}

Example streaming latency estimate

A request with 570 milliseconds to first token and four hundred output tokens at fifty tokens per second takes roughly 8.6 seconds before a tail buffer.

Measure each input from production traces when possible. Provider load, prompt length, tool calls, reasoning behavior, regions, and connection reuse can change latency substantially.

What this estimate includes

Network, queue, prompt, generation, and client timing
Time to first token and full response latency
Configurable tail-latency reserve
Approximate serial capacity across concurrent slots

Frequently asked questions

What is time to first token?

It is the delay between starting a request and receiving the first streamed output token. This calculator models it as network time plus queue and prompt processing.

Why subtract one token from generation time?

The first token is already represented by time to first token. Generation throughput is applied to the remaining output tokens.

Is concurrency the same as requests per second?

No. Concurrency is the number of requests that can be processed simultaneously. Approximate capacity also depends on how long each request occupies a slot.

Does this model prompt length or tool calls directly?

No. Include prompt ingestion, retrieval, tool execution, and reasoning delays in the queue and prompt processing input or model them separately.

Related calculators

Text-to-Token & Cost Estimator

Estimate input tokens and project OpenAI, Gemini, or Claude API spend.

Open

Daily API Budget Planner

Turn a fixed monthly AI budget into request and user limits.

Open

AI Token Visualizer

Inspect approximate token-sized chunks and projected context-window usage.

Open

Related glossary terms

Input tokens

Input tokens are the tokenized units sent to a model, including instructions, user content, conversation history, retrieved context, and tool definitions.

Open

Requests per day

Requests per day is the number of billable API calls made during a day. TokenMath commonly derives it from requests per active user multiplied by active users.

Open

Cost per request

Cost per request is the sum of all billable usage generated by one API call, commonly input token cost plus output token cost for a text model.

Open

LLM Latency Estimator

Latency assumptions

LLM latency estimate

How LLM response latency is estimated

Example streaming latency estimate

What this estimate includes

Frequently asked questions

Related calculators

Related glossary terms