Data Science Collective
Feroz Khan
May 2026
13 min read
The Memory Wall Is Strangling Your LLM: Why GPUs Are Faster Than We Think and Slower Than We Need
A structural problem hiding behind impressive benchmark numbers.
Feroz Khan
ML Engineer, Seattle
62,000
Theoretical tok/s (H100)
~200
Actual tok/s in production
There is a number that should bother anyone who has spent time thinking seriously about LLM inference: 62,000 tokens per second.
That is the theoretical throughput ceiling for an 8B-parameter model running on a single NVIDIA H100 GPU. You can derive it purely from the chip's peak compute capacity of one quadrillion floating-point operations per second (1 petaFLOP/s). It is the number you would include in a slide deck if you wanted to sound optimistic about AI infrastructure.
The actual number, across virtually every production inference engine in use today, sits somewhere between 100 and 300 tokens per second.
That is a 200x gap between theory and reality. And it is not a software bug, a framework inefficiency, or a failure of engineering ambition. It is a structural property of how modern hardware is built.
The Compute Illusion
Modern GPUs are genuinely extraordinary compute engines. The H100's tensor cores can perform matrix multiplications at a rate that would have seemed impossible a decade ago. If inference throughput were purely a function of arithmetic throughput, we would be living in a very different world where latency was effectively free and model serving was a solved problem.
But inference throughput is not a function of arithmetic throughput. It is a function of memory bandwidth. Specifically, the rate at which a GPU can transfer data between its high-bandwidth memory (HBM) and its on-chip compute units. And that number, while impressive in absolute terms (3.35 TB/s on the H100), is nowhere near sufficient to keep those tensor cores fed.
What Decoding Actually Costs
An LLM with 8 billion parameters, stored in 16-bit precision, occupies roughly 16 GB of memory. During inference, generating each new token requires a full forward pass through the model. That means every single set of weights (all 16 GB of them) must travel from HBM to the on-chip SRAM and into the processor registers, get used for a matrix multiplication, and then be discarded to make room for the next layer's weights.
This is not a one-time cost. It happens for every token generated.
tokens/sec = HBM_bandwidth / model_size = 3,350 GB/s / 16 GB = ~200 tok/s
If you want to generate a 1,000-token response, you need 1,000 complete weight transfers. With 3.35 TB/s of bandwidth and 16 GB of weights per transfer, the math is almost embarrassingly direct: you can afford roughly 200 transfers per second. The compute units, capable of 1,000 TFLOP/s, are sitting idle for most of this time, waiting for data to arrive.
HBM (model weights)
80 GB
3.35 TB/s bandwidth, the bottleneck
On-chip SRAM
50 MB
Fast but tiny, cannot hold any real model
Peak compute
1 PFLOP/s
Mostly idle during decode
This is the memory wall, not a new concept (Williams et al. described the roofline model formally in 2009), but newly relevant in a way that is defining the economics of AI infrastructure. Compute hardware has improved at roughly Moore's Law pace. Memory bandwidth has improved much more slowly. This divergence is what makes LLM inference structurally difficult.
Section 2: Autoregressive Decode
One Token. One Full Weight Transfer. Repeat.
To appreciate why the memory wall is so hard to escape during decode, you need to think about arithmetic intensity : the ratio of floating-point operations to bytes of memory transferred, measured in FLOPs per byte.
During autoregressive decoding, each new token requires its own forward pass. The ratio of output tokens to model streams is always 1:1. Arithmetic intensity stays constant regardless of response length, and it stays low. Decode is almost always memory-bound. Watch what happens token by token:
The prefill stage, where the model processes the input prompt, behaves very differently. Because all prompt tokens are available simultaneously, the transformer can process them in parallel through a single forward pass. For a prompt of N tokens, you get N token-equivalents of computation from a single model stream. Arithmetic intensity scales linearly with sequence length. Prefill is compute-bound. This asymmetry between prefill and decode is a structural property of autoregressive generation, not an engineering oversight.
Section 3: KV Cache and Batching
Batching Helps, Until the KV Cache Kills It
The most obvious lever for improving arithmetic intensity during decode is batching. If you process 64 queries simultaneously, a single model stream extends 64 responses. The numerator of your arithmetic intensity ratio grows while the denominator stays fixed.
But there is a catch. The KV cache is essential to making autoregressive inference tractable. In transformer attention, computing the attention scores for a new token requires access to the keys and values produced by every previous token. Rather than recomputing these from scratch each step (which would make cost scale as O(N2) with sequence length), inference engines cache them in HBM.
The problem is that KV cache memory scales with batch size times sequence length times model depth. As batches grow larger and sequences get longer, the KV cache consumes an increasing share of HBM. This limits how much you can amortize the weight-streaming cost across multiple queries.
Arithmetic intensity
1x
vs single-query baseline
KV cache pressure
1%
of 80 GB HBM consumed
Throughput gain
1x
relative to batch size 1
vLLM addressed part of this with PagedAttention, borrowing ideas from operating system virtual memory to manage KV cache blocks non-contiguously. This reduces fragmentation and allows HBM to be used more efficiently, enabling larger effective batch sizes. TensorRT-LLM adds fused kernels and multi-head attention optimizations that reduce per-token overhead. These are meaningful wins, but they all operate within the same fundamental constraint: a memory-bound regime where the limiting factor is how fast you can push bytes from HBM to compute.
Section 4: Speculative Decoding
Guess Cheaply. Verify in Parallel.
Speculative decoding is a cleverer approach. The core insight: if you can predict the next several tokens cheaply, you can verify them all at once with the large model, amortizing one expensive forward pass across multiple tokens.
In practice, this means running a small draft model (sometimes as small as a few hundred million parameters) for several steps to generate candidate token sequences. The large verifier model then processes the entire drafted sequence in a single forward pass (similar to prefill where multiple tokens are handled simultaneously), and either accepts or rejects each draft token using a carefully designed acceptance criterion that preserves the target distribution.
When the draft model achieves high acceptance rates, speculative decoding effectively reduces the number of large model forward passes per output token, pushing the system toward higher arithmetic intensity. The verifier starts behaving more like a compute-bound workload.
Tokens accepted per pass
3.0
expected, given acceptance rate
Effective speedup
3.0x
vs naive single-token decode
Draft calibration
good
acceptance rate healthy
The challenge is calibrating the draft model. Too large, and it loses its speed advantage over the verifier. Too small, and acceptance rates collapse and you end up doing more total work than naive decoding. The acceptance rate is also sensitive to temperature, prompt distribution, and draft length. In practice, speculative decoding requires tuning and does not deliver consistent speedups across all task types.
Variants worth noting: self-speculative decoding uses early exit layers of the same model as the draft mechanism, avoiding the need for a separate model entirely. Lookahead decoding uses n-gram speculation derived from the input. These methods trade generality for deployment simplicity.
Section 5: Diffusion LLMs
A Structural Escape from Memory Bounds
Speculative decoding is fundamentally a patch on an autoregressive paradigm. The real question is whether the paradigm itself can be changed.
Diffusion language models represent the most structurally distinct alternative currently in serious development. Rather than generating tokens left-to-right, one at a time, diffusion models operate on the entire output sequence simultaneously. They start with a fully masked or noisy sequence and iteratively refine it across multiple denoising steps until a coherent response emerges.
From an arithmetic intensity perspective, this is a meaningful shift. During each denoising iteration, the model performs a forward pass that updates every token in the context window, not just one. With a context window of length L, a single model stream contributes L token-update operations instead of one. Arithmetic intensity scales with context length, which is the opposite of autoregressive decoding.
For typical response lengths of several hundred to a few thousand tokens, this pushes diffusion models well into compute-bound territory. The compute units are no longer waiting for weights to arrive; they are processing operations as fast as the HBM can feed them, and often HBM is not the bottleneck at all.
But theory and practice diverge here in important ways. Early diffusion LLMs like LLaDA were often slower than their autoregressive counterparts in wall-clock time. The central waste in vanilla diffusion: most refinement steps do not actually update most tokens meaningfully. Empirically, at any given step, the model is highly confident about perhaps 10% of positions. The rest are ambiguous, yet the model dutifully computes updates for all positions, burning FLOPs on outputs it will likely revise in subsequent steps.
AR forward passes
0
1 per output token, always
Diffusion steps
0
each updates all positions at once
Compute regime
-
memory vs compute bound
Block Diffusion: The Hybrid Architecture
The most promising current direction is block diffusion (ICLR 2025 Oral). It is a hybrid that combines the throughput advantages of diffusion with autoregressive models. The idea: partition the output into fixed-size blocks. Within each block, tokens are decoded using diffusion (all positions refined simultaneously). Across blocks, generation proceeds autoregressively, where each block conditions on all previous blocks.
This structure recovers two important properties that vanilla diffusion sacrifices. First, early stopping becomes possible: once any block generates an EOS token, generation terminates without diffusing subsequent blocks. Second, KV caching is restored across blocks, reused exactly as in standard transformer inference.
Most optimizations developed for autoregressive inference (speculative decoding, paged attention, continuous batching) can be grafted onto block diffusion without significant rearchitecting. From a systems perspective, block diffusion is the first architecture that seriously addresses both the memory bandwidth problem (via high arithmetic intensity within blocks) and the wasted computation problem (via adaptive stopping and KV caching across blocks).
Section 6: The Roofline Model
A Framework for Thinking About Bottlenecks
The roofline model (Williams et al., 2009) gives us a clean way to reason about where inference algorithms fall in the compute-vs-memory spectrum. On a roofline plot, the x-axis represents arithmetic intensity (FLOPs per byte) and the y-axis represents achieved performance. The relationship is linear up to a ridge point: below the ridge, you are memory-bound, where performance scales with bandwidth. Above it, you are compute-bound, limited by peak FLOP/s.
Autoregressive decoding sits far to the left of the ridge point. Prefill is compute-bound because it processes all prompt tokens in parallel. Diffusion and block diffusion cross into compute-bound territory by updating many tokens per model stream. Hover over each workload below to see exactly where it falls.
Closing
What This Means for Inference Economics
Today, inference costs are dominated by compute time, and compute time is dominated by the memory-bandwidth ceiling during decode. GPU utilization for inference is structurally low compared to training. You are paying for peak FLOP/s hardware but using a small fraction of it productively.
Architectures that shift workloads toward the compute-bound regime (whether through batching, speculative decoding, diffusion, or block diffusion) directly reduce cost per token by making better use of silicon that is already paid for. There is also a latency angle: for interactive applications, time-to-first-token matters enormously. Any architectural change that collapses the distinction between prefill and decode has the potential to flatten the latency curve.
The 200x gap between theoretical and realized token throughput is not going away through incremental improvements to existing infrastructure. What is interesting about the current moment is that solutions are emerging simultaneously at every level of the stack: at the systems level (paged attention, continuous batching, prefix caching, quantization), at the algorithm level (speculative decoding, lookahead methods), and at the architecture level (diffusion LLMs, block diffusion, mixture-of-experts with sparse activation).
The memory wall has been a known problem in high-performance computing for a long time before LLMs existed. What is new is that language model inference has made it a first-order economic problem. That is a different kind of pressure than academic interest. And it tends to produce results.
References
- Williams, Waterman, Patterson (2009). Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4).
- Kwon et al. (2023). Efficient memory management for large language model serving with PagedAttention. ACM SIGOPS.
- Zheng et al. (2024). SGLang: Efficient execution of structured language model programs. NeurIPS 2024.
- Leviathan, Kalman, Matias (2023). Fast inference from transformers via speculative decoding. ICML.
- Zhang et al. (2024). Draft and Verify: Lossless large language model acceleration via self-speculative decoding. ACL 2024.
- Fu et al. (2024). Break the sequential dependency of LLM inference using Lookahead decoding. arXiv:2402.02057.
- Austin et al. (2021). Structured denoising diffusion models in discrete state-spaces. NeurIPS 2021.
- Nie et al. (2025). Large language diffusion models (LLaDA). arXiv:2502.09992.
- Arriola et al. (2025). Block diffusion: Interpolating between autoregressive and diffusion language models. ICLR 2025 (Oral). arXiv:2503.09573.
- Dettmers et al. (2022). LLM.int8(): 8-bit matrix multiplication for transformers at scale. NeurIPS 2022.