Streaming response performance planner
LLM Latency Estimator
Estimate time to first token, generation time, buffered end-to-end latency, and approximate throughput from measured LLM timing assumptions.
LLM latency estimate
Streaming response and capacity model
Slow modeled latency
Buffered end-to-end latency exceeds ten seconds.
570 ms
7.98 s
10.36 s
Includes the selected tail-latency buffer
57.94
3,476.25
Calculation basis
- Unbuffered end-to-end
- 8.63 s
- First-token share
- 6.6%
- Requests/minute per slot
- 5.79
- Concurrent slots
- 10
Formula
How LLM response latency is estimated
Network and prompt processing determine time to first token. Remaining output tokens are divided by generation throughput, then client overhead and a tail buffer are added.
Time to first token = network round trip + queue and prompt processing
Generation time = (output tokens − 1) ÷ tokens per second
Buffered end-to-end latency = total latency × (1 + tail buffer %)
export function llmLatency(input: {
networkMs: number;
queueAndPromptMs: number;
outputTokens: number;
tokensPerSecond: number;
clientOverheadMs: number;
tailBufferPercent: number;
}) {
const timeToFirstTokenMs =
input.networkMs + input.queueAndPromptMs;
const generationMs =
(Math.max(0, input.outputTokens - 1) /
input.tokensPerSecond) *
1000;
const endToEndMs =
timeToFirstTokenMs + generationMs + input.clientOverheadMs;
return {
timeToFirstTokenMs,
generationMs,
bufferedEndToEndMs:
endToEndMs * (1 + input.tailBufferPercent / 100),
};
}Example streaming latency estimate
A request with 570 milliseconds to first token and four hundred output tokens at fifty tokens per second takes roughly 8.6 seconds before a tail buffer.
Measure each input from production traces when possible. Provider load, prompt length, tool calls, reasoning behavior, regions, and connection reuse can change latency substantially.
What this estimate includes
- Network, queue, prompt, generation, and client timing
- Time to first token and full response latency
- Configurable tail-latency reserve
- Approximate serial capacity across concurrent slots
Frequently asked questions
What is time to first token?
It is the delay between starting a request and receiving the first streamed output token. This calculator models it as network time plus queue and prompt processing.
Why subtract one token from generation time?
The first token is already represented by time to first token. Generation throughput is applied to the remaining output tokens.
Is concurrency the same as requests per second?
No. Concurrency is the number of requests that can be processed simultaneously. Approximate capacity also depends on how long each request occupies a slot.
Does this model prompt length or tool calls directly?
No. Include prompt ingestion, retrieval, tool execution, and reasoning delays in the queue and prompt processing input or model them separately.
Related calculators
Related glossary terms
Input tokens
Input tokens are the tokenized units sent to a model, including instructions, user content, conversation history, retrieved context, and tool definitions.
OpenRequests per day
Requests per day is the number of billable API calls made during a day. TokenMath commonly derives it from requests per active user multiplied by active users.
OpenCost per request
Cost per request is the sum of all billable usage generated by one API call, commonly input token cost plus output token cost for a text model.
Open