Accelerator Landscape

Where MPU sits in the accelerator spectrum.

Two axes determine MPU's positioning: the memory layer (HBM / SRAM / DDR) and where scheduling happens (runtime vs. compile-time). These define who MPU competes with — and who it doesn't.

Accelerator spectrum diagram — GPU, LPU, HC1, MPU positioning
Production · dominant

GPU
(NVIDIA)

MEMORY

HBM (3–8+ TB/s)

SCHEDULING

Runtime dynamic

+ General purpose, mature CUDA, high batch throughput

— Cost/power, collective overhead at scale

Production (inference cloud)

LPU
(Groq)

MEMORY

Large on-chip SRAM

SCHEDULING

Compile-time fixed

+ Ultra-low latency, deterministic single-stream

— SRAM capacity limited, expensive for large models

Silicon (8B hard-wired)

HC1
(Taalas)

MEMORY

Mask-ROM hard-wired

SCHEDULING

Fixed-function

+ Very high throughput for single model

— Single model only, no generality

Gen1 / pre-silicon

MPU
(Symatics)

MEMORY

DDR7* / structural ingress

SCHEDULING

Compile-time structural

+ Sparse/MoE, long-context, multi-agent — best when weight reuse is high, not first load

— Pre-silicon, near-zero ecosystem

MPU's nearest "living relative" is Groq LPU, not Google TPU

Groq has already shipped in production what MPU pursues: deterministic execution, compile-time scheduling, SRAM-resident state, ultra-low-latency single-stream. The real difference is the memory layer — MPU uses DDR instead of SRAM (Groq needs hundreds of chips to fit large models in SRAM).

*TPU = compiler schedules data into a fixed, HBM-fed array. MPU = the array's wiring is the schedule, DDR-fed.

Architecture-Class Comparison

Full specification matrix

DIMENSION GPU LPU TPU HC1 MPU
Core idea Massive parallel general-purpose Deterministic single-stream Systolic matrix engine Hard-wired single model Structure-as-Computation
Compute unit 1000s CUDA/Tensor cores Large systolic array Systolic MXU array Fixed-function MAC Homogeneous PE array
Scheduling Runtime dynamic Compile-time fixed Compile-plan / run-execute Fixed at manufacture Compile-time structural
Memory HBM (3–8 TB/s) Large on-chip SRAM HBM Mask-ROM DDR7 + structural ingress
Interconnect NVLink + collectives Chip-to-chip mesh ICI / optical Single chip Extend + domain
Strength General, CUDA, batch Ultra-low latency Google throughput/$ Very high (one model) Sparse/MoE, long-context
Programmability Highest (full CUDA) Medium Medium (XLA/JAX) None Lowest (compile to structure)
Maturity Production Production Production (Google) Silicon (8B) Pre-silicon / Gen1
Not a general-purpose switch

Not a general-purpose switch

The MPU Switch is the physical carrier of SIF inside one domain — a single hub performing per-row broadcast. Its low aggregate bandwidth (512 GB/s vs. NVSwitch's multi-Tbps) is a design conclusion, not under-provisioning: weights don't cross SIF, and the state that does is bounded.

CXL Switch
Multi-Tbps

Address/ID routing, credits, QoS, in-network compute

NVSwitch 3.0
12.8 Tbps

NVLink routing, SHARP reduction, credit flow

MPU Switch
512 GB/s

No routing, no flow control, no in-network compute

The impossible triangle

Structure-as-computation × Generality × No-HBM = pick two. Product value lies in how we choose the middle ground.

KEEP
Structure-as-Computation
CONSTRAIN
Generality
KEEP
No HBM

MPU picks structural computing + no-HBM, accepts generality constraints (compile-time fixed structure), and wins on MoE/sparse workloads where the per-token byte advantage is largest.

See how performance numbers are derived

View Performance Contact us