Two Equations for Forcing KV-Cache Offload
There's a specific failure mode I keep seeing in KV-cache offloading benchmarks and experiments: Someone
stands up vLLM with the native offloading connector, sets
--gpu-memory-utilization=0.9 because that's the default everyone copies, fires a load
generator at it, and reports throughput numbers. The problem: their KV cache never left
the GPU. The working set fit comfortably in HBM the whole time, the offload tier sat
idle, and they benchmarked a system that was never under memory pressure. The inverse
failure is just as common: set utilization too low, and instead of offloading you
trigger queueing and preemption, with requests waiting for KV blocks or getting recomputed, and you
measure congestion instead.
Both mistakes come from the same gap: people size these experiments against parameter count and a gut feeling, when the thing that actually governs KV residency is a short chain of architecture-dependent arithmetic. This post is that arithmetic, end to end, worked on the Qwen/Qwen3.6-35B-A3B model, that breaks every naive shortcut at once.
We'll build up to two short checks: The utilization window (does everything fit
where you think it fits?) and the retention clock (does anything survive long enough
to be reused?). The first one tells you the exact range of --gpu-memory-utilization in which
your experiment does what you think it does. The second is the one everybody skips, and
it's the difference between a cache and a very busy /dev/null.
There's a fair bit of notation ahead. It's all defined where it first appears, but if you ever lose track, everything lives here:
Nomenclature
| Symbol | Meaning |
|---|---|
| Standard-attention layers, the only ones with a growing KV cache | |
| KV heads per attention layer | |
| Head dimension | |
| Bytes per KV-cache element (BF16 = 2, FP8 = 1), a deployment choice | |
| KV-cache bytes per token: | |
| , | Total parameter count; bytes per weight parameter |
| Total weight bytes, (all experts, for MoE) | |
| Per-GPU VRAM in bytes | |
--gpu-memory-utilization, fraction of the engine may touch | |
| Per-GPU non-weight, non-KV overhead (activations, CUDA graphs, vision encoder) | |
| Tensor-parallel degree | |
| Replica count (deliberately absent from the per-replica math) | |
| KV memory per replica: | |
| GPU token capacity per replica: | |
| Offered concurrency, requests you intend to have running at once | |
| , | Input / output sequence length - prompt and generated tokens per request |
| Live-set tokens: | |
| Sessions whose prefixes you want kept warm | |
| Reuse-corpus (footprint) tokens: | |
| Utilization needed to hold tokens on GPU | |
| , | The window bounds: and |
| , | CPU offload buffer in bytes; its token capacity |
| Disk-tier token capacity | |
| , | Any tier's capacity; its incoming write rate (churn) |
| Measured GPU→CPU offload write rate | |
| Tier retention time (Che's characteristic time): | |
| Client think time between conversation turns | |
| Time to first token - includes any queue wait before prefill | |
| Reuse gap (inter-reference time) a cached block must survive: |
#Step 0: Parameter count is a red herring (and here it's doubly so)
The first instinct is to reason from model size: "it's a 35B model, KV cache scales with the model, so...". Stop right there. Parameter count tells you almost nothing about KV cache size, and Qwen3.6-35B-A3B makes the point twice over.
First trap: which parameter count? It's a Mixture-of-Experts model (35B total parameters, ~3B active per token). The active count sets FLOPs, not memory: every expert still has to be resident in VRAM even though only a handful fire per token. So the number that drives weight memory is the 35B total, not the 3B active.
Second trap: which layers even have a KV cache? Qwen3.6-35B-A3B is a hybrid. Its 40 layers are arranged as 10 blocks of:
3 × (Gated DeltaNet -> MoE) # linear attention: constant-size recurrent state
1 × (Gated Attention -> MoE) # standard attention: growing KV cache
Only the 10 Gated Attention layers grow a per-token KV cache. The 30 Gated DeltaNet
layers are linear-attention layers, they carry a fixed-size recurrent state that does
not grow with sequence length (vLLM manages it through its mamba/constant-state cache
path, not the paged KV pool). Plug the full L=40 into the standard formula and you
overestimate the growing cache by 4× before you've done anything else.
So the numbers that actually matter for KV sizing are not "35B." They are:
- - standard-attention layers (not 40)
- - KV heads in those layers (grouped-query, and very grouped)
- - head dimension, the width of each key/value vector
- - KV-cache dtype in bytes, a deployment choice independent of the weights
#Step 1: Bytes per token
Each token passing through a standard-attention layer deposits one key vector and one value vector. Summing over only those layers:
(We write , the per-token cost of KV, rather than a bare , because in this post and are already spoken for.)
For Qwen3.6-35B-A3B with a BF16 KV cache:
That is tiny. For comparison, Llama-3.1-8B (32 dense-attention layers, 8 KV heads, head_dim 128) sits at 128 KiB/token, over x6 more per token despite being a fifth the total size. Two architectural facts do the work for the Qwen: only a quarter of the layers cache anything, and there are just two KV heads. Switch the KV cache to FP8 () and you halve it again to 10 KiB/token.
The naive-formula sanity check: if you'd carelessly used all 40 layers, you'd have computed 80 KiB/token (four times reality!) and you'd have sized your entire experiment against a phantom.
#The recurrent-state asterisk*
The 30 DeltaNet layers aren't free, they just don't scale the way you expect. Each holds a fixed-size recurrent state per sequence, on the order of tens of MB per active request (verify against your build since the exact geometry depends on the DeltaNet head layout). A 4K-token request and a 256K-token request carry the same DeltaNet state, so on the live side this cost scales with concurrency, not context length. It's a different memory law living inside the same model, and it's why long-context serving on this architecture is so cheap on the growing-cache side.
Two small details before you put it away:
- vLLM's hybrid allocator carves these state pages out of the KV pool itself, not out of the overhead term. They trim a little off , which is the KV memory budget we'll compute in Step 2, before your attention tokens ever see it.
- The moment you enable prefix caching, the state stops being purely
concurrency-scaled. Resuming a cached prefix requires a state checkpoint at the
resume point, and vLLM retains those checkpoints per cached prefix (block-aligned, via
--mamba-cache-mode=align, see the gotchas). Retain a big corpus, and you're retaining its checkpoints too. Track the state as a separate term either way; just don't call it "fixed" once caching is on, and don't fold it into .
If you're instead sizing an MLA model like DeepSeek, different asterisk. MLA caches a single compressed latent per token, so the factor of 2 for K/V disappears and the width is the latent dim, not . Everything else remains unchanged.
#Step 2: How much GPU actually holds KV
vLLM's gpu_memory_utilization () is a fraction of each GPU's total memory that the
engine may touch. Out of that budget, weights and runtime overhead come first; KV gets the
remainder. Per GPU, with tensor-parallel degree (weights are sharded, so each GPU
holds ):
where is per-GPU VRAM, is total weight bytes (parameter
count times bytes per parameter, and for MoE that's all 35B, every expert) and
is per-GPU non-weight, non-KV overhead (activations, CUDA graphs,
and for this model the vision encoder, which vLLM's multimodal profiling reserves
memory for unless you pass --language-model-only). The DeltaNet state pool is not in
, the hybrid allocator takes it out of the KV budget, per the asterisk above. Aggregate
over the replica's GPUs and the terms recombine:
Divide by for GPU token capacity per replica:
Note what's not here: the replica count . Replicas are independent processes with their own weights and KV pools. To size a fleet, divide total load by and size one replica.
A worked budget: Serve BF16 on 2xH100 (80 GiB, ). BF16 weights for 35B
total run about GB (35 GB per GPU after sharding, which is exactly why this
model wants in BF16, or an FP8 checkpoint to fit on one card). Assume GiB per
GPU (bumped up from a dense model's ~2 GiB to cover the vision encoder; nvidia-smi also
reports slightly under a true 80 GiB, gotcha #1 absorbs both):
3.8 million tokens of growing KV on-GPU at the default utilization. That enormous number is the whole story of this architecture: because is so small, HBM holds a staggering amount of context, and forcing offload is going to take real effort.
#Step 3: The two working sets (this is the part you were missing)
Here's the conceptual move almost every naive sizing skips: There isn't one KV demand; there are two, and they play completely different roles.
The live set is the KV for the requests you intend to have running at once. It is a hard floor on GPU residency, and vLLM enforces it in two ways (neither of which is offload). Requests that don't fit simply wait: the scheduler admits fewer than you offered, your effective concurrency quietly sits below the you think you're testing, and TTFT balloons. And requests that were admitted but grow past capacity during decode get preempted (recomputed or swapped). If this regime sounds familiar, it should: "the memory the active work needs exceeds the fast memory available" is Denning's classic thrashing condition, vintage 1968 [1]. GPUs didn't invent it, they just made it cost more per hour. Peak occupancy is when every concurrent request has grown to full length:
where is the concurrency you're offering (what you want max_num_seqs and the
KV pool to actually sustain, and not the total number of requests you send), and
, are the input and output sequence lengths (prompt and
generated tokens per request).
Tip: Watch the served concurrency in the engine logs; if it's a fraction of your load generator's setting, you're below the floor whether you meant to be or not.
The reuse corpus is the total set of unique prefix tokens you want kept warm across tiers: shared system prompts, multi-turn agent histories, RAG contexts. The cache literature calls this the footprint, which is the total unique data a workload touches. It's what prefix caching retains, and it's what you want to overflow onto CPU and disk:
where is the number of sessions whose prefixes you're keeping warm and the bar is the average retained length per session.
The active prefixes are a subset of the corpus, the corpus is the union. You never add these two. Size the GPU floor against and the spill against . Conflating them is how people benchmark preemption while believing they're benchmarking offload.
#Step 4: Invert for U - the utilization window
Set equal to a target token count and solve for the utilization that produces it:
Evaluate at both working sets:
The experiment lives in the window between them, clamped to something practical like 0.95:
Read it out loud, because it's the whole post:
- Below the GPU can't even hold the live set. You get queueing and preemption: effective concurrency sags, TTFT balloons, and the run measures congestion at least as much as caching. Not automatically fatal (Step 6 explains when offload still shows through) but know you're there.
- Inside the window the live set fits, the corpus doesn't, the surplus spills to CPU then disk. This is the regime you want.
- At or above everything fits on GPU. Nothing offloads. Your tier is a no-op.
Back to the example. This model does long context (262K native), so give it a long-context agent workload: concurrent, , :
That's well under the 3.8M-token GPU capacity, so the live set is safe. Now a corpus of 200 retained agent sessions at ~34.8K tokens each, , about 143 GB of KV:
is the tell: the corpus can't fully fit at any utilization, so it spills no matter what. Combined with , any holds the live set (no preemption) while forcing the corpus surplus off-GPU. Pick () and roughly 3.5M tokens are forced to CPU/disk by construction, not by luck.
Notice how the tiny reshapes the problem: with 20 KiB/token you cannot pressure this model with a handful of concurrent requests, no matter how long. Offload here is driven by the reuse corpus, essentially never by the live set. On a dense model with a fat , it's often the reverse. The math tells you which lever to pull before you allocate the node.
The no-window case. If alone blows past the practical GPU budget i.e. , there is no utilization at which the live set fully fits. Reduce , shorten sequences, raise or proceed with eyes open. A queueing-dominated run isn't automatically worthless: offload can still deliver dramatic TTFT and throughput gains there, because cached prefixes drain the queue faster. But the queue itself changes the offload tier's job in a way the shelf math can't see (Step 6), and you must not present the results as a clean steady-state benchmark.
#Step 5: Forcing the disk tier specifically
Getting something to offload is easy once you're in the window. Getting the disk (or remote) tier to activate takes one more inequality. With GPU capacity and a configured CPU buffer of token-capacity , the cascade fills top-down:
So to guarantee disk sees traffic:
At ( tokens) against our 6.96M-token corpus, a CPU buffer covering up to ~2M tokens still leaves ~1.5M tokens with nowhere to go but disk. Because is small, these token counts are huge and CPU buffers you'd size in GB for a dense model, you may need to size in hundreds of GB here to make a dent. Plan the buffer in tokens, not gigabytes, and convert.
One trap: lower tiers may not use the same . LMCache and similar engines can store CPU/disk copies compressed or requantized, so compute and with that tier's bytes-per-token. Assuming a uniform across tiers will skew your fill math.
A second trap, hybrid-specific: counts only the attention KV. Whether a spilled prefix is actually restorable is a separate question. The DeltaNet layers can only resume from a surviving state checkpoint, so the usable hit length is
Tokens past the last checkpoint get their attention KV shipped back up and then get
recomputed anyway, through all 40 layers, because the recurrent state at that position
no longer exists. If your dashboards ever show external_kv_transfer and
local_compute both high at the same time, this is what you're looking at: the tier is
restoring data without proportionally skipping compute. On a pure-attention model the two
counters trade off; on a hybrid they can both lose.
#Step 6: Check the clock, not just the shelf
Everything so far is shelf math: does the corpus fit, and where does the overflow land. There is a second requirement, and it's the one that quietly zeroes out more offloading benchmarks than any sizing mistake: a cached block only pays off if it survives until it's reused. Every tier is an LRU cache under continuous write pressure, and an LRU's retention time is nothing more than capacity over churn:
The CDN world has a name for this quantity: the characteristic time of an LRU cache, the load-bearing idea in Che's approximation [2]. We're just computing it for KV blocks instead of web objects.
A block written during one turn of a conversation is needed again when the next turn's prefill is scheduled, not when the client sends it. So the reuse gap it must survive (the inter-reference time, in cache-speak) is:
client think time plus however long the next turn sits in the queue. Which gives us the second boxed equation of the post: The retention clock.
with your offload buffer in bytes and the offload write rate you can read straight off the connector's transfer metrics.
Three things this inequality knows that the shelf math doesn't:
Cache value is a step function, not a ramp. Intuition says a tier holding 15% of the
footprint captures ~15% of the benefit. It captures roughly zero. Under uniform churn,
every block ages out before any block is reused, for every conversation at once. This
is a known, named phenomenon: LRU miss-ratio curves have a cliff at the footprint
size, and looping access patterns (which multi-turn agents are, structurally) make the
cliff as sharp as it gets (there's a research lineage on smoothing these cliffs, Talus
and Cliffhanger among them; our fix is the blunt one: get past the cliff). We learned it
the empirical way, on a different model with a fatter : a 24 GiB CPU tier against
a ~150 GB agentic footprint produced a 2% hit rate and a write/read ratio of 92:1 (7.4 TB pushed down over a 30-minute run, 80 GB ever read back). That is not a small cache;
it's a very busy /dev/null. Growing the tier past the footprint (to 160 GiB) took the
hit rate to 82% and request throughput up 2.6x. Nothing between those two points behaves
linearly, so "let's start with a smaller buffer and see" is not the safe incremental
experiment it sounds like.
TTFT is inside the reuse gap, so the window and the clock are coupled. Run below and the queue stretches TTFT, which stretches , which demands more retention from a tier that's already missing, which means more recompute, which lengthens the queue further. The loop also runs in the good direction: size the tier past the footprint and hits collapse prefill demand, the queue drains, shrinks, and hits get easier. This feedback is why fixing an undersized tier looks discontinuous on every chart and why a queueing-dominated run (the "no-window case" above) makes the retention clock harder to satisfy, not irrelevant.
The counters that catch it in five minutes. First, GPU->CPU vs CPU->GPU bytes: a healthy tier reads back the same order of magnitude it writes. Note that the native connector offloads eagerly (every computed block goes down immediately) so a busy write side proves nothing; the read-back side is the whole diagnosis. Second, the external prefix-cache hit rate against the workload's achievable rate: a multi-turn agent trace re-sends ~99% of every prompt, so a single-digit hit rate is retention failure, not a cold cache. Third, block idle-before-evict vs your measured - that's the boxed inequality, read directly off the dashboards.
None of this is novel, by the way, and that's a feature: working sets and thrashing are Denning [1], the retention clock is the characteristic time from Che's approximation [2], the footprint and the miss-ratio cliff are textbook cache analysis. KV offload sizing isn't a new discipline, it's fifty years of caching theory wearing a GPU costume, and every claim in this post has a literature behind it that you can check.
#The gotchas that will actually bite you
The formulas are the easy part. These are the things that quietly make your numbers wrong:
1. is empirical, so measure it, don't assume it. vLLM profiles peak activation memory at startup, then reserves the rest for KV. On a multimodal hybrid like this one, also swallows the vision encoder (the DeltaNet state pool, remember, comes out of the KV budget instead). Read the reported KV cache size from the startup log; that's ground truth and folds in everything.
2. KV blocks are quantized. vLLM allocates KV in fixed blocks (block_size,
historically 16 tokens, varies by version and attention backend). is an upper
bound that rounds down to whole blocks.
3. Tensor parallelism can replicate KV heads, and this model is a landmine. If , TP can't cleanly shard the KV heads and replicates them, so per-GPU KV doesn't fall by the full and aggregate KV can balloon by up to . Qwen3.6-35B-A3B has just 2 KV heads, so TP-2 is the ceiling before replication kicks in, and long-context serving recipes routinely reach for TP-4 or TP-8 for the FLOPs and the headroom. Copy a TP-8 recipe and each KV head is replicated across 4 ranks: a real, easy-to-miss 4x inflation of your KV footprint relative to the clean formula. Check what the recipe you're pasting actually uses, and account for it, or your will be badly optimistic.
4. Quantized weights aren't just P × b_param. The FP8 checkpoint lands near 35 GB
(single-H100 territory), NVFP4 lower still, but scales/zero-points and higher-precision
layers add overhead. Pad your .
5. "Offloading" without a connector is actually preemption. With no offloading backend configured, running out of KV makes vLLM recompute or swap, it does not tier your cache.
#Why forcing offload costs you (and why you're measuring it)
The reason this whole exercise matters is that the tiers are not free, and the gap is enormous. On an H100, HBM3 moves data at roughly 3.35 TB/s. The PCIe Gen5 link to host memory gives you about 64 GB/s, on the order of 2% of HBM bandwidth. Shuffling a 50 GB KV cache is milliseconds off HBM and the better part of a second across PCIe. Force enough of your working set onto CPU or disk and prefill flips from compute-bound to memory-bound, and tail latency follows.
That degradation is the measurement. A well-sized offloading experiment doesn't make offload look good, it quantifies what it costs and where the crossover sits, so you can decide whether the capacity win is worth the bandwidth tax for your traffic. (If you have NVLink peers, offloading KV into a neighbor's HBM at ~900 GB/s is a very different tradeoff than PCIe-to-host, but that's another post.)
#TL;DR checklist
- Find which layers actually cache: for a hybrid, use (standard-attention layers only), not total layers. Get , , from the config. Compute .
- Compute (weights = total params for MoE) and .
- Separate from . Don't add them.
- Compute the window: , , with . Pick inside it.
- To hit disk, size the corpus so (in tokens, convert your byte budgets).
- Then check the clock: . An undersized tier doesn't underperform, it flatlines. Watch the CPU→GPU read-back counter, not the write side.
- Watch vs : if TP exceeds your KV-head count, your real footprint is up to larger than the formula says.
- Start the server, read the reported # GPU blocks, and reconcile against .
If they disagree by more than a few percent, your ,
block_size, or KV-head replication assumption is off. - Confirm an offload connector is actually enabled (and on a hybrid, that
--mamba-cache-mode=alignis set) otherwise you're benchmarking preemption (or a crash loop).
Set utilization from this, not from the default someone pasted into a tutorial. The whole chain is a handful of multiplications, and it's the difference between an experiment that measures your offloading system and one that measures nothing at all.
#References
[1] P. J. Denning, "The working set model for program behavior," Communications of the ACM, vol. 11, no. 5, pp. 323–333, May 1968. [Online]. Available: https://denninginstitute.com/pjd/PUBS/WSModel_1968.pdf
[2] H. Che, Y. Tung, and Z. Wang, "Hierarchical Web caching systems: Modeling, design and experimental results," IEEE Journal on Selected Areas in Communications, vol. 20, no. 7, pp. 1305–1314, Sep. 2002, doi: 10.1109/JSAC.2002.801752.