Skip to main content

Streaming response performance planner

LLM Latency Estimator

Estimate time to first token, generation time, buffered end-to-end latency, and approximate throughput from measured LLM timing assumptions.

First-token and generation breakdownTail-latency safety bufferConcurrency-based throughput estimate

Latency assumptions

Use production percentiles or representative benchmark timings.

ms
ms
tokens
tok/s
ms
%
The throughput result assumes each request occupies one concurrency slot for the full buffered response time. It is a capacity estimate, not a provider quota.

LLM latency estimate

Streaming response and capacity model

Slow modeled latency

Buffered end-to-end latency exceeds ten seconds.

Time to first token

570 ms

Generation time

7.98 s

Buffered end-to-end latency

10.36 s

Includes the selected tail-latency buffer

Estimated requests per minute

57.94

Estimated requests per hour

3,476.25

Calculation basis

Unbuffered end-to-end
8.63 s
First-token share
6.6%
Requests/minute per slot
5.79
Concurrent slots
10

Formula

How LLM response latency is estimated

Network and prompt processing determine time to first token. Remaining output tokens are divided by generation throughput, then client overhead and a tail buffer are added.

Time to first token = network round trip + queue and prompt processing

Generation time = (output tokens − 1) ÷ tokens per second

Buffered end-to-end latency = total latency × (1 + tail buffer %)

llm-latency.ts
export function llmLatency(input: {
  networkMs: number;
  queueAndPromptMs: number;
  outputTokens: number;
  tokensPerSecond: number;
  clientOverheadMs: number;
  tailBufferPercent: number;
}) {
  const timeToFirstTokenMs =
    input.networkMs + input.queueAndPromptMs;
  const generationMs =
    (Math.max(0, input.outputTokens - 1) /
      input.tokensPerSecond) *
    1000;
  const endToEndMs =
    timeToFirstTokenMs + generationMs + input.clientOverheadMs;

  return {
    timeToFirstTokenMs,
    generationMs,
    bufferedEndToEndMs:
      endToEndMs * (1 + input.tailBufferPercent / 100),
  };
}

Example streaming latency estimate

A request with 570 milliseconds to first token and four hundred output tokens at fifty tokens per second takes roughly 8.6 seconds before a tail buffer.

Measure each input from production traces when possible. Provider load, prompt length, tool calls, reasoning behavior, regions, and connection reuse can change latency substantially.

What this estimate includes

  • Network, queue, prompt, generation, and client timing
  • Time to first token and full response latency
  • Configurable tail-latency reserve
  • Approximate serial capacity across concurrent slots

Frequently asked questions

What is time to first token?

It is the delay between starting a request and receiving the first streamed output token. This calculator models it as network time plus queue and prompt processing.

Why subtract one token from generation time?

The first token is already represented by time to first token. Generation throughput is applied to the remaining output tokens.

Is concurrency the same as requests per second?

No. Concurrency is the number of requests that can be processed simultaneously. Approximate capacity also depends on how long each request occupies a slot.

Does this model prompt length or tool calls directly?

No. Include prompt ingestion, retrieval, tool execution, and reasoning delays in the queue and prompt processing input or model them separately.