Honest numbers.
Honest boundaries.
All figures are analysis-derived, not silicon-measured. Where a claim is not yet validated on silicon, we say so. The decode bandwidth wall is the physics baseline — everything else builds from it.
The decode bandwidth wall
In autoregressive decode, every token must re-read all active weights. tok/s = memory bandwidth / bytes per token. This is the hard physics floor — no architecture can escape it. Where MPU finds advantage is in workloads that reduce the bytes per token.
Llama 405B
Same wall as GPU
DeepSeek-V3
11x fewer bytes/token
DeepSeek-V3
MPU advantage zone
Three throughput levers
Single-stream bandwidth is a hard floor. Three things move the needle.
Batch Size
Amortize weight reads across concurrent sequences. More sequences sharing the same bandwidth = higher effective throughput.
Ingress Efficiency
Gen1 CSL deterministic ingress at ~45%. Gen2 enhanced CSL with prefetch, reorder, and compression targets ~90%.
Domain Count
Pure replication — N independent domains, N× throughput. No new system complexity per domain added.
The agent-era KV cache pressure
Agent workloads push per-sequence KV cache from ~10 MB to tens of GB. This is a different memory from the bounded ~16 KB/token SIF state — it accumulates with context, resides in local DRAM, and is re-read every decode step. Llama 405B reaches 504 KB/token.
Llama 405B
128K context: ~63 GB
1M context: ~2 seq/domain
DeepSeek-V3
128K context: ~8.6 GB
1M context: ~16 seq/domain
DeepSeek-V3 + MLA
128K context: ~4.7 GB
1M context: ~30 seq/domain
Where MPU competes — and where it doesn't
Small Dense 7B-13B — SRAM Mode
When weights + KV fit in on-chip SRAM (~14–26 GB, 1–2 chips), each PE reads its slice at SRAM bandwidth (10–100 TB/s). GPU is HBM-bound at single-request — MPU is compute-bound. 100–500 tok/s vs GPU's 10–50.
Dense 70B+ FP16 — SRAM Tiling
Weights + KV tiled across N chips' SRAM (70B→9 chips, 405B→51 chips). Each chip computes at SRAM bandwidth (10–100 TB/s) vs GPU HBM (3 TB/s) — 10–30x faster. 8ch LPDDR5X loads weights once into SRAM (~0.1s/chip), then compute at SRAM speed. Trade-off: chip count, not throughput.
MoE DeepSeek-V3 — Equal Cost
Core inference target. Structural routing + 11x fewer bytes/token. All-to-All overhead excluded.
Training — Gen2 Target
AdamW wall (~10.7 TB) is the blocker. Compiler-driven gradient accumulation + FP8/BF16 optimizer state under active study.
SYMATICS