Performance

Honest numbers.
Honest boundaries.

All figures are analysis-derived, not silicon-measured. Where a claim is not yet validated on silicon, we say so. The decode bandwidth wall is the physics baseline — everything else builds from it.

Decode bandwidth wall diagram

The decode bandwidth wall

In autoregressive decode, every token must re-read all active weights. tok/s = memory bandwidth / bytes per token. This is the hard physics floor — no architecture can escape it. Where MPU finds advantage is in workloads that reduce the bytes per token.

Llama 405B

Dense FP16
810 GB
/ token

Same wall as GPU

DeepSeek-V3

MoE FP8
74 GB
/ token

11x fewer bytes/token

DeepSeek-V3

MoE + MLA
~37 GB
/ token

MPU advantage zone

Three throughput levers

Single-stream bandwidth is a hard floor. Three things move the needle.

Batch Size

B

Amortize weight reads across concurrent sequences. More sequences sharing the same bandwidth = higher effective throughput.

Ingress Efficiency

45% → 90%

Gen1 CSL deterministic ingress at ~45%. Gen2 enhanced CSL with prefetch, reorder, and compression targets ~90%.

Domain Count

×N

Pure replication — N independent domains, N× throughput. No new system complexity per domain added.

The agent-era KV cache pressure

Agent workloads push per-sequence KV cache from ~10 MB to tens of GB. This is a different memory from the bounded ~16 KB/token SIF state — it accumulates with context, resides in local DRAM, and is re-read every decode step. Llama 405B reaches 504 KB/token.

Llama 405B

504 KB
/ token

128K context: ~63 GB

1M context: ~2 seq/domain

DeepSeek-V3

69 KB
/ token

128K context: ~8.6 GB

1M context: ~16 seq/domain

DeepSeek-V3 + MLA

~37 KB
/ token (est.)

128K context: ~4.7 GB

1M context: ~30 seq/domain

Honest framing: MPU does not make the agent memory wall disappear. The defensible claim: "while KV fits, MPU re-reads from high-bandwidth local DRAM, not spilling to slow storage" — capacity-gated. The KV-friendly regime (MoE+MLA) is exactly MPU's per-watt sweet spot.

Where MPU competes — and where it doesn't

Small Dense 7B-13B — SRAM Mode

10x+

When weights + KV fit in on-chip SRAM (~14–26 GB, 1–2 chips), each PE reads its slice at SRAM bandwidth (10–100 TB/s). GPU is HBM-bound at single-request — MPU is compute-bound. 100–500 tok/s vs GPU's 10–50.

Dense 70B+ FP16 — SRAM Tiling

10x+

Weights + KV tiled across N chips' SRAM (70B→9 chips, 405B→51 chips). Each chip computes at SRAM bandwidth (10–100 TB/s) vs GPU HBM (3 TB/s) — 10–30x faster. 8ch LPDDR5X loads weights once into SRAM (~0.1s/chip), then compute at SRAM speed. Trade-off: chip count, not throughput.

MoE DeepSeek-V3 — Equal Cost

0.54–0.90x

Core inference target. Structural routing + 11x fewer bytes/token. All-to-All overhead excluded.

Training — Gen2 Target

Gen2

AdamW wall (~10.7 TB) is the blocker. Compiler-driven gradient accumulation + FP8/BF16 optimizer state under active study.

Core claim: For ALL dense models, the strategy is SRAM tiling — put weights + KV in on-chip SRAM across N chips (7B→1 chip, 70B→9 chips, 405B→51 chips). Each PE computes at SRAM bandwidth (10–100 TB/s) vs GPU HBM (3 TB/s) — 10–30x faster. Single-request throughput is 10x+ vs GPU because SRAM has no memory contention. 8ch LPDDR5X (136 GB/s) loads weights into SRAM once per inference (~0.1s/chip), then compute runs at SRAM speed. For MoE models (DeepSeek-V3, ~74 GB active), LPDDR5X alone is sufficient. Trade-off: chip count for large models, not feasibility.

Target workload fit

Dense 7B-70B (SRAM tiling) Core opportunity Weights + KV tiled across N chip SRAM (7B→1, 70B→9 chips). 10x+ GPU tok/s single-request. SRAM bandwidth 10–30x HBM, no contention.
Dense 405B (SRAM tiling) Candidate 51–102 chips for SRAM tiling. Technically 10x+, but chip count makes GPU HBM more practical per-dollar.
MoE / sparse (DeepSeek-V3) Core opportunity Active weights ~74 GB, fits in LPDDR5X (136 GB/s 8ch). Structural routing. 11x fewer bytes/token.
Long context (128K+) Candidate KV in local high-bandwidth DRAM when it fits. SRAM holds hot KV, LPDDR5X streams cold.
Edge / cost-sensitive inference Secondary Potential TOPS/W if silicon delivers.
Training (Gen2) Gen2 target AdamW wall (~10.7 TB) is the blocker — compiler-driven gradient accumulation + FP8/BF16 optimizer state under active study.

Dive into the full technical specs