The bill for an AI feature does not come from the model thinking. It comes from the model being moved. Every token your product streams out is a full read of the model's weights from GPU memory, and you are paying, at the most basic level, for that haul of bytes. Once you see inference that way, the entire cost structure stops being mysterious and starts being arithmetic.
Here is the arithmetic that catches everyone off guard. To generate one token from a 70-billion-parameter model in FP16, the GPU reads all 70 billion weights out of high-bandwidth memory, 140 gigabytes of traffic, and does one small multiply-add against each. The math is trivial; the haul is enormous. The GPU's thousands of compute cores spend almost the whole step waiting on memory, the way a fully staffed kitchen still cannot plate faster than the one waiter carrying ingredients up from the basement.
That single fact, decode is bottlenecked on memory rather than compute, is the load-bearing wall under everything else. It explains why the headline TFLOPS on a spec sheet barely touch your bill, why batching is the only thing that makes a GPU pay for itself, why output tokens cost more than input, and why the gap between a profitable inference service and a money pit is one variable most people never look at. Every lever below turns out to be the same move in disguise: move fewer bytes, or share each byte-move across more tokens.
The memory wall sets the floor before user number one
Start with what has to be true before you serve a single request: the weights have to live in GPU memory. A 70B model at FP16 is 140 GB. The biggest single GPU you can rent, an H100, holds 80 GB. So a 70B model does not fit on one GPU. You need at least two, sharded, just to load it, before anyone has typed a word. That is a fixed cost, paid in hardware, independent of traffic.
This is the part that breaks naive cost models. People reason about inference as a per-request variable cost, like bandwidth. But a large fixed cost sits underneath, set by model size and precision, and it dictates a minimum GPU count below which you cannot serve the model at all. The question is never just "how much per token," it is "how many GPUs do I light up before the first token, and how many tokens justify that."
The weights are the fixed part of the memory budget; the variable part is the KV cache, the term that surprises everyone who only thinks about model size. As the model generates, it stores the keys and values for every token it has already seen so it does not recompute them each step, and that cache grows linearly with batch size times sequence length. A single LLaMA-13B request at 8K context burns about 6.7 GB. Stack four concurrent requests and the cache alone exceeds the model's own weights. Long context costs more not because the prompt is longer to read once, but because you carry a large per-request memory tax for the whole generation, and it competes with the weights for the same fixed VRAM.
So the VRAM budget is weights plus KV cache plus overhead, and it is a hard ceiling. When a deployment "runs out of memory," it is almost always the cache, not the weights, because the weights are static and the cache grows with load. The first place a senior engineer earns the title is knowing that the way to fit more concurrent users is usually to shrink the KV cache, not buy more compute. The LLM inference serving deep dive lays out that memory layout request by request.
Decode is bandwidth-bound, which is why batching is the whole game
There is a clean way to ask whether an operation is starved for memory or for math: arithmetic intensity, the ratio of floating-point operations performed to bytes moved out of memory. Plot throughput against it and you get the roofline model, a diagonal memory-bound region rising to a flat compute-bound ceiling. Where your workload sits on that line decides what you are paying for.
At batch one, decode sits deep in the memory-bound region. You move every weight, ~140 GB for our 70B model, to produce one token, so arithmetic intensity is one or two FLOPs per byte. The H100's compute ceiling needs hundreds of FLOPs per byte to be worth reaching. You are two or three orders of magnitude short of using the silicon you rented, and the GPU is idle most of each step waiting on HBM. Serving one request at a time on an expensive GPU is chartering a freight train to deliver one envelope: the envelope is cheap, the train is not.
Batching fixes the economics, and the mechanism is precise. Batch 64 requests and you still load each weight once, but now that load does useful work for 64 tokens instead of one. Bytes moved stay flat while math multiplies by 64, so arithmetic intensity climbs, the workload walks up the diagonal toward the compute ceiling, and you finally use the cores you are renting. There is a crossover, the critical batch size, above which decode becomes compute-bound: around 240 tokens on TPU v5e in bf16, the low hundreds on an H100. Below it you waste silicon, and the serving stack's whole job is keeping you above it.
This is why the gap between a naive deployment and a tuned one is an order of magnitude, not 20 percent. The most-cited public benchmark put naive static batching at 81 tokens per second under realistic length variance and continuously-batched vLLM at roughly 1,900 on the identical GPU. Same silicon, same model, 23x the throughput, the only difference being whether the GPU was kept busy.
Why does static batching collapse? You gather eight requests, run them together, return all eight, then gather the next eight. But generations finish at wildly different lengths: one answers in 20 tokens, another in 600. The seven finished requests sit in their slots holding memory and doing nothing until the 600-token straggler completes, so the GPU is "busy" running a batch but mostly grinding on white space. Continuous batching, also called iteration-level scheduling and pioneered by the Orca serving system, fixes this at the granularity of a single decode step: the instant any sequence emits its end token, its slot is evicted and a waiting request takes it on the next step. The batch stays full, real-world utilization climbs from the 30-to-40 percent range into 70-to-80 percent, and that is why production runs vLLM, TensorRT-LLM, or SGLang instead of a hand-written loop.
Paired with it is PagedAttention, the idea behind vLLM, a memory trick with outsized economic consequence. Older systems allocated KV cache in big worst-case contiguous chunks and wasted 60 to 80 percent of it to fragmentation. PagedAttention manages the cache in small non-contiguous blocks, the way an OS pages memory, cutting waste below 4 percent. Less wasted cache means more sequences fit the same VRAM, which means a higher achievable batch, which means lower cost per token. A memory-efficiency change shows up on the invoice as a throughput change.
Utilization is the variable the whole bill turns on
Here is the formula that everything has been building toward, and it is almost embarrassingly simple:
cost per 1M tokens = (GPU $/hour) / (tokens served per hour)
tokens served per hour = tokens/sec * 3600 * utilization
The first term, dollars per GPU-hour, is what you fixate on when shopping. The last, utilization, is what actually decides the answer and the one people leave out. Walk a case. An H100 at $2.50 an hour serving a 7B-class model at a sustained 2,500 tokens per second and 70 percent utilization moves about 6.3 million tokens an hour, near $0.40 per million tokens. Drop the same GPU to 10 percent utilization, the kind of number spiky traffic on a dedicated box produces, and it moves under a million tokens an hour, jumping to roughly $2.78 per million. Same hardware, same model, same code, seven times the cost, because an idle GPU bills exactly the same as a busy one.
This reframes the entire build-versus-buy decision. Independent analysis is blunter than my illustration: a GPU at 10 percent load turns a $0.013 per-thousand-token cost into $0.13, a 10x penalty that makes self-hosting more expensive than the premium APIs you were trying to undercut. The APIs are cheap for a structural reason, not a charitable one. A provider serving thousands of customers packs a GPU near full utilization continuously, amortizing every weight load across an enormous, always-full batch. You, serving bursty traffic on a reserved box, cannot. You are paying for the idle time the provider engineered away by aggregating demand.
So self-host-versus-API is a utilization question, not a price-per-token one. Self-hosting wins above roughly 50 percent sustained utilization, several million tokens per GPU per day, where you genuinely keep the silicon saturated, and loses badly below that, because the per-token API never charges you for gaps in your traffic and a dedicated GPU charges you for nothing else. Forecast your utilization curve honestly before comparing list prices, because most teams overestimate it. The same discipline applies as in any capacity estimation exercise: the average is a trap, and the gap between peak and trough is where the money leaks.
Prefill and decode are two different machines wearing one model
The asymmetry the whole pricing world quietly encodes is the split between prefill and decode, an architectural divide rather than a detail. When a request arrives, the model first processes the entire prompt at once: every prompt token goes through the network in parallel, a matrix-matrix multiply, dense and compute-bound, with high arithmetic intensity because each weight load is reused across all the prompt's tokens. This is prefill, and it sets time to first token, the latency before the answer starts streaming. Then the model switches modes, generating the response one token at a time, each depending on the last, a matrix-vector multiply, thin and memory-bound, re-fetching every weight per token. This is decode, and it sets inter-token latency, the pace at which words appear. Prefill is compute-bound freight you can parallelize; decode is a bandwidth-bound trickle you cannot.
This is exactly why providers charge three to five times more for output than input. Output tokens are decode, the expensive bandwidth-bound phase; input tokens are prefill, the cheap compute-bound phase that parallelizes across the whole prompt. The price column on a pricing page is a readout of the physics. Claude Sonnet at roughly $3 per million in and $15 out, or GPT-4.1-nano at $0.10 in and $0.40 out, is decode being structurally more expensive to produce than prefill. Nobody chose that ratio arbitrarily; it falls out of the roofline.
The split also creates a production problem worth naming. When prefill and decode share a batch, a long prompt's prefill is heavy enough to stall everyone else's decode for that step, and inter-token latency spikes for every user on the box. Two real fixes. Chunked prefill, the Sarathi-Serve approach, splits a long prefill into pieces and interleaves them with ongoing decode, using decode's spare arithmetic-intensity headroom to slip the prefill in without stalling anyone, reporting 2.6x on a single-A100 Mistral-7B up to 6.9x on an 8-A100 Falcon-180B within latency targets. The other is prefill-decode disaggregation: run prefill on one GPU pool and decode on a separate one so they never contend. That is not a toy idea; it is how the largest disclosed deployments run, which is where the real numbers come in.
The real numbers: what a hyperscale deployment actually looks like
Almost nobody publishes their inference economics, which is what makes DeepSeek's 2025 disclosure the most useful artifact in this discussion. Over a 24-hour window on H800 GPUs they ran a peak of 278 nodes at an assumed $2 per GPU-hour, a daily cost around $87,000. Against R1 list prices, the theoretical revenue for the traffic they served came to roughly $562,000 a day, the source of the much-quoted 545 percent cost-profit margin. Hold the headline; the operational numbers underneath teach more.
They processed 608 billion input tokens and 168 billion output tokens that day. The ~3.6-to-1 input-to-output ratio is itself a lesson: prompts and context dwarf generated answers in real traffic, which is why prefill efficiency and caching matter so much. And caching is the quiet star here. Of those 608 billion input tokens, 342 billion, 56.3 percent, were served from KV-cache hits rather than recomputed. More than half their input compute simply did not happen, because repeated context, shared system prompts, common document chunks, was cached and reused. Caching is not a minor optimization in production; it bends more than half the input bill to near zero.
The per-node throughput makes the prefill-decode asymmetry concrete: 73,700 tokens per second on prefill nodes versus 14,800 on decode nodes, a 5x gap on the same hardware, because prefill is the parallelizable compute-bound phase and decode the serial bandwidth-bound one. And the architecture is disaggregation made real: four nodes on prefill, eighteen on decode, the decode pool far larger precisely because decode is the slow expensive phase that needs more silicon to keep up. The shape of the cluster is the shape of the economics.
Now the caveat, because quoting the 545 percent without it is exactly the mistake a senior engineer does not make. DeepSeek flagged that actual revenue was far lower: web and app tiers are free, off-peak requests get steep discounts, and the margin excludes all training and R&D, which for a frontier model is the overwhelming majority of total spend. The 545 percent is a theoretical upper bound on the marginal serving margin under perfect monetization, not a claim that running an LLM company prints money. The takeaway is the opposite of the headline: the marginal inference margin is real, but the business math only closes when you account for everything that number conveniently excludes.
The lever stack: every optimization is the same move
Step back and the optimizations a senior engineer reaches for are not a grab-bag of tricks. Each moves a single term in the cost equation: bytes moved per token, tokens served per weight-load, how many tokens need the expensive model at all, or dollars per byte of bandwidth. Naming the term a lever pulls is the difference between cargo-culting configs and engineering a cost down.
Quantization moves bytes per token, and it is more a feasibility lever than a quality knob. FP16 to INT4 takes a 70B model from ~140 GB to ~35 GB, the difference between needing four A100s and needing one. It does not just speed up decode by moving less memory per token; it collapses the minimum GPU count, the fixed cost the whole deployment hangs on. FP8 on Hopper roughly doubles decode throughput for the same reason, fewer bytes per weight. The tradeoff is accuracy, which varies by method and model, and the engineering question is how far you can go before quality slips past your bar. The LLM quantization breakdown covers where INT4, INT8, and FP8 each land on that curve.
Routing moves how many tokens need the expensive model. Most requests do not need your largest one, and a learned router sends the easy ones cheap and reserves the flagship for the hard ones. RouteLLM reports cutting cost more than 85 percent on one benchmark and 35 to 45 percent on others while holding around 95 percent of GPT-4-only quality, with routers that transfer across model pairs. You are not picking one model, you are building a cost-quality frontier and operating along it per request. Model routing is the dedicated treatment, and it pairs with cascade escalation: try the cheap model first, escalate only when confidence is low.
Caching moves tokens served per weight-load to nearly free for repeated context, and DeepSeek's 56.3 percent hit rate shows the scale. This is a system-design decision more than a serving-stack flag: shared system prompts, reused RAG chunks, and stable conversation prefixes all become cache hits, and many APIs bill cached input at up to a 90 percent discount. Ordering prompts so the stable part comes first and the variable part last is, quite literally, a cost optimization. Speculative decoding moves the same term differently: a small draft model proposes several tokens, the big model verifies them in one parallel pass, and you amortize one expensive weight-traversal across multiple accepted tokens for a 2-to-3x decode speedup at decent acceptance. It works because decode leaves so much compute idle; speculation spends that idle compute on verification. The catch is the draft model must fit in VRAM alongside the target, so it is a bandwidth win bought with a memory cost, and if the target already fills the GPU, speculation throws an out-of-memory error instead of a speedup.
Hardware itself is a lever, because decode time per token is bytes-moved over bandwidth. The bandwidth ladder runs A100 at ~2.0 TB/s, H100 at 3.35, H200 at 4.8, B200 around 8. Doubling memory bandwidth roughly doubles decode tokens per second for free, which is why bandwidth, not peak FLOPS, is the number to read on an inference spec sheet. An H200 is not faster than an H100 because it computes more; it is faster on decode because it moves bytes faster.
One model class breaks the simple intuition, and senior engineers get it wrong all the time: Mixture of Experts. The pitch is that only a few experts activate per token, so it must be cheaper. Compute, yes: FLOPs scale with active parameters, so Mixtral-8x7B does only 13B of work per token. Memory, no: VRAM scales with total parameters, because every expert has to be resident in case the router picks it, and Mixtral is 46B total, too big for one 80 GB GPU at FP16. MoE cuts the compute bill, not the memory floor. At batch one, where you cannot amortize the weight loads, a big MoE can be worse economics than a small dense model. It only pays off at the batch sizes where its compute savings actually show up.
Where this gets hard: the second wall and the three-way frontier
Two nuances separate someone who has read about this from someone who has run it. The first: batching does not scale forever, because a second memory wall hides behind the first. Once batching has amortized the weight loads, the bottleneck shifts from loading weights to accessing the KV cache during attention. Attention's arithmetic intensity stays near one regardless of batch size, because each request attends over its own private cache, so at large batch the GPU goes back to being memory-starved, on the cache instead of the weights this time. Barcelona Supercomputing Center and IBM measured it: a 256-fold batch increase bought only a 33.8-fold throughput increase, with the attention kernel idle more than 80 percent of the time waiting on DRAM. Batching has sharply diminishing returns, which is why attention variants that shrink the cache, GQA, MQA, and the MLA approach DeepSeek uses, matter as much as any serving trick. They push back the second wall.
The second: there is no single dial called "speed." There is a three-way frontier between cost per token, time to first token, and inter-token latency, and you cannot win all three at once. Bigger batches lower cost per token but raise both latencies, because each request waits behind more company. Smaller batches give snappy latency but waste the GPU and spike cost. The production objective is never "make it fast" or "make it cheap"; it is minimize cost subject to a P99 time-to-first-token budget and a P99 inter-token-latency budget. And the metric that matters is goodput, the tokens per second that actually meet your latency targets, which raw throughput quietly lies about: tokens generated too slowly to satisfy the SLO are worse than useless, since you paid for them and the user left. Picking the operating point is the same tradeoff reasoning as the system design interview framework, where the move is always to make the constraint explicit before optimizing against it.
A strategic overlay should shape the build-versus-buy call. Cost for fixed capability is falling on the order of 10x per year at the frontier, and by some task-level measures a median of 50x per year. A self-hosting decision that pencils out against today's API prices may be underwater in six months. Treat the API as the reversible default and an owned cluster as the irreversible commitment you make only once volume and utilization clearly justify locking in. To understand why the underlying capability keeps getting cheaper to produce, how LLMs work traces the mechanics all of this efficiency engineering wraps around.
The honest landing
LLM inference is expensive for one reason that cascades into all the others: generating a token means dragging the entire model out of memory, and memory bandwidth is the scarce resource, not compute. From that single fact the whole cost structure falls out. The memory wall sets a fixed minimum hardware bill before your first user. Decode being bandwidth-bound makes batching the only thing that makes a GPU pay, because it shares each expensive memory haul across more tokens. Utilization is the master variable, because an idle GPU bills the same as a saturated one, which makes self-hosting a utilization bet rather than a price comparison. Every lever, quantize, batch, route, cache, speculate, pick faster memory, reduces to the same two moves: move fewer bytes, or amortize each byte-move across more tokens.
The practical discipline is this. Never quote a cost per token without naming the five things that set it, model, precision, context length, batch and utilization, and hardware, because the same model spans an order of magnitude on utilization alone. Read bandwidth, not TFLOPS. Forecast your utilization before you buy a GPU, and assume you are overestimating it. Order your prompts so the stable context caches. And treat the whole thing as one equation with a handful of terms rather than a bag of tricks, because that is what lets you find where the next dollar of cost is actually hiding. The teams that serve AI cheaply are not the ones with the most GPUs. They are the ones who keep the GPUs they have full.
FAQ
Why is LLM inference memory-bound instead of compute-bound?
Generating one token requires reading every weight in the model out of GPU memory once, then doing a tiny amount of math with it. For a 70B model at FP16 that is 140 GB of memory traffic to produce a single token. The arithmetic per byte is so low that the GPU's compute units sit nearly idle waiting on memory the whole time. The spec that bounds your token rate is therefore HBM bandwidth, not TFLOPS. This is why an H200 at 4.8 TB/s decodes faster than an H100 at 3.35 TB/s on the same model even though their math throughput is similar.
Why do output tokens cost more than input tokens?
Input tokens are processed in the prefill phase, where the whole prompt goes through the model at once as a matrix-matrix multiply. That is compute-bound and cheaply parallelized, so each weight load is reused across many tokens. Output tokens come from the decode phase, one token at a time, where every weight is re-fetched from memory per token. Decode is the bandwidth-bound, expensive phase, so providers price output at roughly three to five times input. The pricing is encoding the physics directly.
How is cost per million tokens actually calculated?
Cost per million tokens equals the GPU dollars per hour divided by the tokens that GPU actually serves per hour, where tokens per hour is tokens per second times 3600 times utilization. The utilization term is what makes the number swing by an order of magnitude. The same GPU and the same model can cost ten times more per token at 10 percent utilization than at 70 percent, because an idle GPU bills the same as a busy one. Quoting cost per token without naming model, precision, context length, batch, and utilization is meaningless.
Is self-hosting an open model cheaper than using an API?
Only above a sustained utilization threshold, roughly 50 percent, which works out to several million tokens per GPU per day depending on model size. Below that, the idle-GPU bill dominates. A box at 10 percent load turns a 0.013 dollar per thousand tokens cost into 0.13, which loses to per-token APIs that never charge you for idle. Self-hosting wins when you have steady, high-volume, predictable traffic you can keep a GPU saturated with. It loses for spiky or low-volume workloads, where the API absorbs the idle time for you.
Does Mixture of Experts make inference cheaper?
It cuts compute, not the memory floor, and those are different bills. FLOPs per token scale with the active parameters, so an 8x7B model doing 13B of active work is cheap to compute. But VRAM scales with total parameters, because every expert has to be resident in memory in case the router picks it. Mixtral-8x7B is 13B active and 46B total, so it will not fit on one 80 GB GPU at FP16. At batch one, where you cannot amortize the weight loads, a big MoE can be worse economics than a small dense model.