July 4, 202619 min 22 sec read

Two Equations for Forcing KV-Cache Offload

There's a specific failure mode I keep seeing in KV-cache offloading benchmarks and experiments: Someone stands up vLLM with the native offloading connector, sets --gpu-memory-utilization=0.9 because that's the default everyone copies, fires a load generator at it, and reports throughput numbers. The problem: their KV cache never left the GPU. The working set fit comfortably in HBM the whole time, the offload tier sat idle, and they benchmarked a system that was never under memory pressure. The inverse failure is just as common: set utilization too low, and instead of offloading you trigger queueing and preemption, with requests waiting for KV blocks or getting recomputed, and you measure congestion instead.

Both mistakes come from the same gap: people size these experiments against parameter count and a gut feeling, when the thing that actually governs KV residency is a short chain of architecture-dependent arithmetic. This post is that arithmetic, end to end, worked on the Qwen/Qwen3.6-35B-A3B model, that breaks every naive shortcut at once.

We'll build up to two short checks: The utilization window (does everything fit where you think it fits?) and the retention clock (does anything survive long enough to be reused?). The first one tells you the exact range of --gpu-memory-utilization in which your experiment does what you think it does. The second is the one everybody skips, and it's the difference between a cache and a very busy /dev/null.

There's a fair bit of notation ahead. It's all defined where it first appears, but if you ever lose track, everything lives here:

Nomenclature
SymbolMeaning
LattnL_{attn}Standard-attention layers, the only ones with a growing KV cache
HkvH_{kv}KV heads per attention layer
dheadd_{head}Head dimension
bkvb_{kv}Bytes per KV-cache element (BF16 = 2, FP8 = 1), a deployment choice
cKVc_{KV}KV-cache bytes per token: 2LattnHkvdheadbkv2\,L_{attn}\,H_{kv}\,d_{head}\,b_{kv}
PP, bparamb_{param}Total parameter count; bytes per weight parameter
WWTotal weight bytes, PbparamP \cdot b_{param} (all experts, for MoE)
MGPUM_{GPU}Per-GPU VRAM in bytes
UU--gpu-memory-utilization, fraction of MGPUM_{GPU} the engine may touch
MovhM_{ovh}Per-GPU non-weight, non-KV overhead (activations, CUDA graphs, vision encoder)
NTPN_{TP}Tensor-parallel degree
RRReplica count (deliberately absent from the per-replica math)
MKVM_{KV}KV memory per replica: NTP(UMGPUMovh)WN_{TP}(U M_{GPU} - M_{ovh}) - W
TGPUT_{GPU}GPU token capacity per replica: MKV/cKVM_{KV}/c_{KV}
CCOffered concurrency, requests you intend to have running at once
ISL\text{ISL}, OSL\text{OSL}Input / output sequence length - prompt and generated tokens per request
TliveT_{live}Live-set tokens: C(ISL+OSL)C\,(\text{ISL}+\text{OSL})
NsessionsN_{sessions}Sessions whose prefixes you want kept warm
TcorpusT_{corpus}Reuse-corpus (footprint) tokens: Nsessions(ISL+OSL)retainedN_{sessions} \cdot \overline{(\text{ISL}+\text{OSL})}_{\text{retained}}
U(T)U(T)Utilization needed to hold TT tokens on GPU
UminU_{min}, UmaxU_{max}The window bounds: U(Tlive)U(T_{live}) and U(Tcorpus)U(T_{corpus})
BCPUB_{CPU}, TCPUT_{CPU}CPU offload buffer in bytes; its token capacity BCPU/cKVB_{CPU}/c_{KV}
TdiskT_{disk}Disk-tier token capacity
BtierB_{tier}, B˙in\dot{B}_{in}Any tier's capacity; its incoming write rate (churn)
B˙GPUCPU\dot{B}_{GPU \to CPU}Measured GPU→CPU offload write rate
tretaint_{retain}Tier retention time (Che's characteristic time): Btier/B˙inB_{tier} / \dot{B}_{in}
tthinkt_{think}Client think time between conversation turns
TTFT\text{TTFT}Time to first token - includes any queue wait before prefill
tgapt_{gap}Reuse gap (inter-reference time) a cached block must survive: tthink+TTFTt_{think} + \text{TTFT}

#Step 0: Parameter count is a red herring (and here it's doubly so)

The first instinct is to reason from model size: "it's a 35B model, KV cache scales with the model, so...". Stop right there. Parameter count tells you almost nothing about KV cache size, and Qwen3.6-35B-A3B makes the point twice over.

First trap: which parameter count? It's a Mixture-of-Experts model (35B total parameters, ~3B active per token). The active count sets FLOPs, not memory: every expert still has to be resident in VRAM even though only a handful fire per token. So the number that drives weight memory is the 35B total, not the 3B active.

Second trap: which layers even have a KV cache? Qwen3.6-35B-A3B is a hybrid. Its 40 layers are arranged as 10 blocks of:

3 × (Gated DeltaNet  -> MoE)      # linear attention: constant-size recurrent state
1 × (Gated Attention -> MoE)      # standard attention: growing KV cache

Only the 10 Gated Attention layers grow a per-token KV cache. The 30 Gated DeltaNet layers are linear-attention layers, they carry a fixed-size recurrent state that does not grow with sequence length (vLLM manages it through its mamba/constant-state cache path, not the paged KV pool). Plug the full L=40 into the standard formula and you overestimate the growing cache by before you've done anything else.

So the numbers that actually matter for KV sizing are not "35B." They are:

  • Lattn=10L_{attn} = 10 - standard-attention layers (not 40)
  • Hkv=2H_{kv} = 2 - KV heads in those layers (grouped-query, and very grouped)
  • dhead=256d_{head} = 256 - head dimension, the width of each key/value vector
  • bkvb_{kv} - KV-cache dtype in bytes, a deployment choice independent of the weights

#Step 1: Bytes per token

Each token passing through a standard-attention layer deposits one key vector and one value vector. Summing over only those layers:

cKV  =  2K and VLattnHkvdheadbkv[bytes/token]c_{KV} \;=\; \underbrace{2}_{K \text{ and } V} \cdot\, L_{attn} \cdot H_{kv} \cdot d_{head} \cdot b_{kv} \quad \text{[bytes/token]}

(We write cKVc_{KV}, the per-token cost of KV, rather than a bare kk, because in this post KK and VV are already spoken for.)

For Qwen3.6-35B-A3B with a BF16 KV cache:

cKV=21022562=20,480 bytes=20 KiB/tokenc_{KV} = 2 \cdot 10 \cdot 2 \cdot 256 \cdot 2 = 20{,}480 \text{ bytes} = 20\ \text{KiB/token}

That is tiny. For comparison, Llama-3.1-8B (32 dense-attention layers, 8 KV heads, head_dim 128) sits at 128 KiB/token, over x6 more per token despite being a fifth the total size. Two architectural facts do the work for the Qwen: only a quarter of the layers cache anything, and there are just two KV heads. Switch the KV cache to FP8 (bkv=1b_{kv}=1) and you halve it again to 10 KiB/token.

The naive-formula sanity check: if you'd carelessly used all 40 layers, you'd have computed 80 KiB/token (four times reality!) and you'd have sized your entire experiment against a phantom.

#The recurrent-state asterisk*

The 30 DeltaNet layers aren't free, they just don't scale the way you expect. Each holds a fixed-size recurrent state per sequence, on the order of tens of MB per active request (verify against your build since the exact geometry depends on the DeltaNet head layout). A 4K-token request and a 256K-token request carry the same DeltaNet state, so on the live side this cost scales with concurrency, not context length. It's a different memory law living inside the same model, and it's why long-context serving on this architecture is so cheap on the growing-cache side.

Two small details before you put it away:

  • vLLM's hybrid allocator carves these state pages out of the KV pool itself, not out of the overhead term. They trim a little off MKVM_{KV}, which is the KV memory budget we'll compute in Step 2, before your attention tokens ever see it.
  • The moment you enable prefix caching, the state stops being purely concurrency-scaled. Resuming a cached prefix requires a state checkpoint at the resume point, and vLLM retains those checkpoints per cached prefix (block-aligned, via --mamba-cache-mode=align, see the gotchas). Retain a big corpus, and you're retaining its checkpoints too. Track the state as a separate term either way; just don't call it "fixed" once caching is on, and don't fold it into cKVc_{KV}.

If you're instead sizing an MLA model like DeepSeek, different asterisk. MLA caches a single compressed latent per token, so the factor of 2 for K/V disappears and the width is the latent dim, not HkvdheadH_{kv}\,d_{head}. Everything else remains unchanged.


#Step 2: How much GPU actually holds KV

vLLM's gpu_memory_utilization (UU) is a fraction of each GPU's total memory that the engine may touch. Out of that budget, weights and runtime overhead come first; KV gets the remainder. Per GPU, with tensor-parallel degree NTPN_{TP} (weights are sharded, so each GPU holds W/NTPW/N_{TP}):

MKVper-GPU  =  UMGPUWNTPMovhM_{KV}^{\text{per-GPU}} \;=\; U M_{GPU} - \frac{W}{N_{TP}} - M_{ovh}

where MGPUM_{GPU} is per-GPU VRAM, W=PbparamW = P \cdot b_{param} is total weight bytes (parameter count times bytes per parameter, and for MoE that's all 35B, every expert) and MovhM_{ovh} is per-GPU non-weight, non-KV overhead (activations, CUDA graphs, and for this model the vision encoder, which vLLM's multimodal profiling reserves memory for unless you pass --language-model-only). The DeltaNet state pool is not in MovhM_{ovh}, the hybrid allocator takes it out of the KV budget, per the asterisk above. Aggregate over the replica's NTPN_{TP} GPUs and the W/NTPW/N_{TP} terms recombine:

MKV=NTP(UMGPUMovh)WM_{KV} = N_{TP}\bigl(U M_{GPU} - M_{ovh}\bigr) - W

Divide by cKVc_{KV} for GPU token capacity per replica:

TGPU=NTP(UMGPUMovh)WcKVT_{GPU} = \frac{N_{TP}(U M_{GPU} - M_{ovh}) - W}{c_{KV}}

Note what's not here: the replica count RR. Replicas are independent processes with their own weights and KV pools. To size a fleet, divide total load by RR and size one replica.

A worked budget: Serve BF16 on 2xH100 (80 GiB, NTP=2N_{TP}=2). BF16 weights for 35B total run about W70W \approx 70 GB (35 GB per GPU after sharding, which is exactly why this model wants NTP=2N_{TP}=2 in BF16, or an FP8 checkpoint to fit on one card). Assume Movh=3M_{ovh} = 3 GiB per GPU (bumped up from a dense model's ~2 GiB to cover the vision encoder; nvidia-smi also reports slightly under a true 80 GiB, gotcha #1 absorbs both):

MKV=2(0.985.9e93.22e9)70e978.2 GBM_{KV} = 2\,(0.9 \cdot 85.9\text{e9} - 3.22\text{e9}) - 70\text{e9} \approx 78.2\text{ GB} TGPU=78.2e9/20,4803.8M tokensT_{GPU} = 78.2\text{e9} / 20{,}480 \approx 3.8\text{M tokens}

3.8 million tokens of growing KV on-GPU at the default utilization. That enormous number is the whole story of this architecture: because cKVc_{KV} is so small, HBM holds a staggering amount of context, and forcing offload is going to take real effort.


#Step 3: The two working sets (this is the part you were missing)

Here's the conceptual move almost every naive sizing skips: There isn't one KV demand; there are two, and they play completely different roles.

The live set is the KV for the requests you intend to have running at once. It is a hard floor on GPU residency, and vLLM enforces it in two ways (neither of which is offload). Requests that don't fit simply wait: the scheduler admits fewer than you offered, your effective concurrency quietly sits below the CC you think you're testing, and TTFT balloons. And requests that were admitted but grow past capacity during decode get preempted (recomputed or swapped). If this regime sounds familiar, it should: "the memory the active work needs exceeds the fast memory available" is Denning's classic thrashing condition, vintage 1968 [1]. GPUs didn't invent it, they just made it cost more per hour. Peak occupancy is when every concurrent request has grown to full length:

Tlive=C(ISL+OSL)T_{live} = C \,(\text{ISL} + \text{OSL})

where CC is the concurrency you're offering (what you want max_num_seqs and the KV pool to actually sustain, and not the total number of requests you send), and ISL\text{ISL}, OSL\text{OSL} are the input and output sequence lengths (prompt and generated tokens per request).

Tip: Watch the served concurrency in the engine logs; if it's a fraction of your load generator's setting, you're below the floor whether you meant to be or not.

The reuse corpus is the total set of unique prefix tokens you want kept warm across tiers: shared system prompts, multi-turn agent histories, RAG contexts. The cache literature calls this the footprint, which is the total unique data a workload touches. It's what prefix caching retains, and it's what you want to overflow onto CPU and disk:

TcorpusNsessions(ISL+OSL)retainedT_{corpus} \approx N_{sessions} \cdot \overline{(\text{ISL}+\text{OSL})}_{\text{retained}}

where NsessionsN_{sessions} is the number of sessions whose prefixes you're keeping warm and the bar is the average retained length per session.

The active prefixes are a subset of the corpus, the corpus is the union. You never add these two. Size the GPU floor against TliveT_{live} and the spill against TcorpusT_{corpus}. Conflating them is how people benchmark preemption while believing they're benchmarking offload.


#Step 4: Invert for U - the utilization window

Set TGPUT_{GPU} equal to a target token count TT and solve for the utilization that produces it:

U(T)=W+NTPMovh+TcKVNTPMGPUU(T) = \frac{W + N_{TP} M_{ovh} + T\,c_{KV}}{N_{TP} M_{GPU}}

Evaluate at both working sets:

Umin=U(Tlive)Umax=U(Tcorpus)U_{min} = U(T_{live}) \qquad U_{max} = U(T_{corpus})

The experiment lives in the window between them, clamped to something practical like 0.95:

UminU<min(Umax,0.95)\boxed{\,U_{min} \le U < \min(U_{max},\, 0.95)\,}

Read it out loud, because it's the whole post:

  • Below UminU_{min} the GPU can't even hold the live set. You get queueing and preemption: effective concurrency sags, TTFT balloons, and the run measures congestion at least as much as caching. Not automatically fatal (Step 6 explains when offload still shows through) but know you're there.
  • Inside the window the live set fits, the corpus doesn't, the surplus spills to CPU then disk. This is the regime you want.
  • At or above UmaxU_{max} everything fits on GPU. Nothing offloads. Your tier is a no-op.

Back to the example. This model does long context (262K native), so give it a long-context agent workload: C=32C = 32 concurrent, ISL=32768\text{ISL} = 32768, OSL=2048\text{OSL} = 2048:

Tlive=32348161.11M tokensUmin0.58T_{live} = 32 \cdot 34816 \approx 1.11\text{M tokens} \Rightarrow U_{min} \approx 0.58

That's well under the 3.8M-token GPU capacity, so the live set is safe. Now a corpus of 200 retained agent sessions at ~34.8K tokens each, 6.96M tokens\approx 6.96\text{M tokens}, about 143 GB of KV:

Umax=70e9+23.22e9+6.96e620480285.9e91.28U_{max} = \frac{70\text{e9} + 2\cdot3.22\text{e9} + 6.96\text{e6}\cdot20480}{2\cdot85.9\text{e9}} \approx 1.28

Umax>1U_{max} > 1 is the tell: the corpus can't fully fit at any utilization, so it spills no matter what. Combined with Umin0.58U_{min}\approx0.58, any U[0.58,0.95]U \in [0.58, 0.95] holds the live set (no preemption) while forcing the corpus surplus off-GPU. Pick U=0.85U = 0.85 (TGPU3.4MT_{GPU}\approx3.4\text{M}) and roughly 3.5M tokens are forced to CPU/disk by construction, not by luck.

Notice how the tiny cKVc_{KV} reshapes the problem: with 20 KiB/token you cannot pressure this model with a handful of concurrent requests, no matter how long. Offload here is driven by the reuse corpus, essentially never by the live set. On a dense model with a fat cKVc_{KV}, it's often the reverse. The math tells you which lever to pull before you allocate the node.

The no-window case. If TlivecKVT_{live}\,c_{KV} alone blows past the practical GPU budget i.e. Umin>0.95U_{min} > 0.95, there is no utilization at which the live set fully fits. Reduce CC, shorten sequences, raise NTPN_{TP} or proceed with eyes open. A queueing-dominated run isn't automatically worthless: offload can still deliver dramatic TTFT and throughput gains there, because cached prefixes drain the queue faster. But the queue itself changes the offload tier's job in a way the shelf math can't see (Step 6), and you must not present the results as a clean steady-state benchmark.


#Step 5: Forcing the disk tier specifically

Getting something to offload is easy once you're in the window. Getting the disk (or remote) tier to activate takes one more inequality. With GPU capacity TGPUT_{GPU} and a configured CPU buffer of token-capacity TCPU=BCPU/cKVT_{CPU} = B_{CPU}/c_{KV}, the cascade fills top-down:

disk used=max ⁣(0,  TcorpusTGPUTCPU)\text{disk used} = \max\!\bigl(0,\; T_{corpus} - T_{GPU} - T_{CPU}\bigr)

So to guarantee disk sees traffic:

Tcorpus>TGPU+TCPUT_{corpus} > T_{GPU} + T_{CPU}

At U=0.85U=0.85 (TGPU3.4MT_{GPU}\approx3.4\text{M} tokens) against our 6.96M-token corpus, a CPU buffer covering up to ~2M tokens still leaves ~1.5M tokens with nowhere to go but disk. Because cKVc_{KV} is small, these token counts are huge and CPU buffers you'd size in GB for a dense model, you may need to size in hundreds of GB here to make a dent. Plan the buffer in tokens, not gigabytes, and convert.

One trap: lower tiers may not use the same cKVc_{KV}. LMCache and similar engines can store CPU/disk copies compressed or requantized, so compute TCPUT_{CPU} and TdiskT_{disk} with that tier's bytes-per-token. Assuming a uniform cKVc_{KV} across tiers will skew your fill math.

A second trap, hybrid-specific: TcorpuscKVT_{corpus} \cdot c_{KV} counts only the attention KV. Whether a spilled prefix is actually restorable is a separate question. The DeltaNet layers can only resume from a surviving state checkpoint, so the usable hit length is

min(attention KV available, newest surviving state checkpoint)\min\bigl(\text{attention KV available},\ \text{newest surviving state checkpoint}\bigr)

Tokens past the last checkpoint get their attention KV shipped back up and then get recomputed anyway, through all 40 layers, because the recurrent state at that position no longer exists. If your dashboards ever show external_kv_transfer and local_compute both high at the same time, this is what you're looking at: the tier is restoring data without proportionally skipping compute. On a pure-attention model the two counters trade off; on a hybrid they can both lose.


#Step 6: Check the clock, not just the shelf

Everything so far is shelf math: does the corpus fit, and where does the overflow land. There is a second requirement, and it's the one that quietly zeroes out more offloading benchmarks than any sizing mistake: a cached block only pays off if it survives until it's reused. Every tier is an LRU cache under continuous write pressure, and an LRU's retention time is nothing more than capacity over churn:

tretain  =  BtierB˙int_{retain} \;=\; \frac{B_{tier}}{\dot{B}_{in}}

The CDN world has a name for this quantity: the characteristic time of an LRU cache, the load-bearing idea in Che's approximation [2]. We're just computing it for KV blocks instead of web objects.

A block written during one turn of a conversation is needed again when the next turn's prefill is scheduled, not when the client sends it. So the reuse gap it must survive (the inter-reference time, in cache-speak) is:

tgap    tthink+TTFTt_{gap} \;\approx\; t_{think} + \text{TTFT}

client think time plus however long the next turn sits in the queue. Which gives us the second boxed equation of the post: The retention clock.

  BCPUB˙GPUCPU  >  tthink+TTFT  \boxed{\;\frac{B_{CPU}}{\dot{B}_{GPU \to CPU}} \;>\; t_{think} + \text{TTFT}\;}

with BCPUB_{CPU} your offload buffer in bytes and B˙GPUCPU\dot{B}_{GPU \to CPU} the offload write rate you can read straight off the connector's transfer metrics.

Three things this inequality knows that the shelf math doesn't:

Cache value is a step function, not a ramp. Intuition says a tier holding 15% of the footprint captures ~15% of the benefit. It captures roughly zero. Under uniform churn, every block ages out before any block is reused, for every conversation at once. This is a known, named phenomenon: LRU miss-ratio curves have a cliff at the footprint size, and looping access patterns (which multi-turn agents are, structurally) make the cliff as sharp as it gets (there's a research lineage on smoothing these cliffs, Talus and Cliffhanger among them; our fix is the blunt one: get past the cliff). We learned it the empirical way, on a different model with a fatter cKVc_{KV}: a 24 GiB CPU tier against a ~150 GB agentic footprint produced a 2% hit rate and a write/read ratio of 92:1 (7.4 TB pushed down over a 30-minute run, 80 GB ever read back). That is not a small cache; it's a very busy /dev/null. Growing the tier past the footprint (to 160 GiB) took the hit rate to 82% and request throughput up 2.6x. Nothing between those two points behaves linearly, so "let's start with a smaller buffer and see" is not the safe incremental experiment it sounds like.

TTFT is inside the reuse gap, so the window and the clock are coupled. Run below UminU_{min} and the queue stretches TTFT, which stretches tgapt_{gap}, which demands more retention from a tier that's already missing, which means more recompute, which lengthens the queue further. The loop also runs in the good direction: size the tier past the footprint and hits collapse prefill demand, the queue drains, tgapt_{gap} shrinks, and hits get easier. This feedback is why fixing an undersized tier looks discontinuous on every chart and why a queueing-dominated run (the "no-window case" above) makes the retention clock harder to satisfy, not irrelevant.

The counters that catch it in five minutes. First, GPU->CPU vs CPU->GPU bytes: a healthy tier reads back the same order of magnitude it writes. Note that the native connector offloads eagerly (every computed block goes down immediately) so a busy write side proves nothing; the read-back side is the whole diagnosis. Second, the external prefix-cache hit rate against the workload's achievable rate: a multi-turn agent trace re-sends ~99% of every prompt, so a single-digit hit rate is retention failure, not a cold cache. Third, block idle-before-evict vs your measured tgapt_{gap} - that's the boxed inequality, read directly off the dashboards.

None of this is novel, by the way, and that's a feature: working sets and thrashing are Denning [1], the retention clock is the characteristic time from Che's approximation [2], the footprint and the miss-ratio cliff are textbook cache analysis. KV offload sizing isn't a new discipline, it's fifty years of caching theory wearing a GPU costume, and every claim in this post has a literature behind it that you can check.


#The gotchas that will actually bite you

The formulas are the easy part. These are the things that quietly make your numbers wrong:

1. MovhM_{ovh} is empirical, so measure it, don't assume it. vLLM profiles peak activation memory at startup, then reserves the rest for KV. On a multimodal hybrid like this one, MovhM_{ovh} also swallows the vision encoder (the DeltaNet state pool, remember, comes out of the KV budget instead). Read the reported KV cache size from the startup log; that's ground truth and folds in everything.

2. KV blocks are quantized. vLLM allocates KV in fixed blocks (block_size, historically 16 tokens, varies by version and attention backend). TGPUT_{GPU} is an upper bound that rounds down to whole blocks.

3. Tensor parallelism can replicate KV heads, and this model is a landmine. If NTP>HkvN_{TP} > H_{kv}, TP can't cleanly shard the KV heads and replicates them, so per-GPU KV doesn't fall by the full 1/NTP1/N_{TP} and aggregate KV can balloon by up to NTP/HkvN_{TP}/H_{kv}. Qwen3.6-35B-A3B has just 2 KV heads, so TP-2 is the ceiling before replication kicks in, and long-context serving recipes routinely reach for TP-4 or TP-8 for the FLOPs and the headroom. Copy a TP-8 recipe and each KV head is replicated across 4 ranks: a real, easy-to-miss 4x inflation of your KV footprint relative to the clean formula. Check what the recipe you're pasting actually uses, and account for it, or your TGPUT_{GPU} will be badly optimistic.

4. Quantized weights aren't just P × b_param. The FP8 checkpoint lands near 35 GB (single-H100 territory), NVFP4 lower still, but scales/zero-points and higher-precision layers add overhead. Pad your WW.

5. "Offloading" without a connector is actually preemption. With no offloading backend configured, running out of KV makes vLLM recompute or swap, it does not tier your cache.


#Why forcing offload costs you (and why you're measuring it)

The reason this whole exercise matters is that the tiers are not free, and the gap is enormous. On an H100, HBM3 moves data at roughly 3.35 TB/s. The PCIe Gen5 link to host memory gives you about 64 GB/s, on the order of 2% of HBM bandwidth. Shuffling a 50 GB KV cache is milliseconds off HBM and the better part of a second across PCIe. Force enough of your working set onto CPU or disk and prefill flips from compute-bound to memory-bound, and tail latency follows.

That degradation is the measurement. A well-sized offloading experiment doesn't make offload look good, it quantifies what it costs and where the crossover sits, so you can decide whether the capacity win is worth the bandwidth tax for your traffic. (If you have NVLink peers, offloading KV into a neighbor's HBM at ~900 GB/s is a very different tradeoff than PCIe-to-host, but that's another post.)


#TL;DR checklist

  1. Find which layers actually cache: for a hybrid, use LattnL_{attn} (standard-attention layers only), not total layers. Get HkvH_{kv}, dheadd_{head}, bkvb_{kv} from the config. Compute cKV=2LattnHkvdheadbkvc_{KV} = 2\,L_{attn}\,H_{kv}\,d_{head}\,b_{kv}.
  2. Compute MKV=NTP(UMGPUMovh)WM_{KV} = N_{TP}(U M_{GPU} - M_{ovh}) - W (weights = total params for MoE) and TGPU=MKV/cKVT_{GPU} = M_{KV}/c_{KV}.
  3. Separate Tlive=C(ISL+OSL)T_{live} = C(\text{ISL}+\text{OSL}) from TcorpusT_{corpus}. Don't add them.
  4. Compute the window: Umin=U(Tlive)U_{min}=U(T_{live}), Umax=U(Tcorpus)U_{max}=U(T_{corpus}), with U(T)=W+NTPMovh+TcKVNTPMGPUU(T) = \frac{W + N_{TP}\,M_{ovh} + T\,c_{KV}}{N_{TP}\,M_{GPU}}. Pick UU inside it.
  5. To hit disk, size the corpus so Tcorpus>TGPU+TCPUT_{corpus} > T_{GPU} + T_{CPU} (in tokens, convert your byte budgets).
  6. Then check the clock: BCPU/B˙GPUCPU>tthink+TTFTB_{CPU} / \dot{B}_{GPU \to CPU} > t_{think} + \text{TTFT}. An undersized tier doesn't underperform, it flatlines. Watch the CPU→GPU read-back counter, not the write side.
  7. Watch NTPN_{TP} vs HkvH_{kv}: if TP exceeds your KV-head count, your real footprint is up to NTP/HkvN_{TP}/H_{kv} larger than the formula says.
  8. Start the server, read the reported # GPU blocks, and reconcile against TGPUT_{GPU}. If they disagree by more than a few percent, your MovhM_{ovh}, block_size, or KV-head replication assumption is off.
  9. Confirm an offload connector is actually enabled (and on a hybrid, that --mamba-cache-mode=align is set) otherwise you're benchmarking preemption (or a crash loop).

Set utilization from this, not from the default someone pasted into a tutorial. The whole chain is a handful of multiplications, and it's the difference between an experiment that measures your offloading system and one that measures nothing at all.


#References

[1] P. J. Denning, "The working set model for program behavior," Communications of the ACM, vol. 11, no. 5, pp. 323–333, May 1968. [Online]. Available: https://denninginstitute.com/pjd/PUBS/WSModel_1968.pdf

[2] H. Che, Y. Tung, and Z. Wang, "Hierarchical Web caching systems: Modeling, design and experimental results," IEEE Journal on Selected Areas in Communications, vol. 20, no. 7, pp. 1305–1314, Sep. 2002, doi: 10.1109/JSAC.2002.801752.