Partner Portal
NDA-protected technical specifications, detailed architecture parameters, and validation documentation for ecosystem partners.
Don't have access? Contact info@symaticslab.com
Technical Specifications
Contents
Physical Design Parameters
| Process Node | TSMC N7 | ~76 MTr/mm², logic + SRAM mix |
| Die Dimensions | ~20.5 × 21.0 mm | 84 mm perimeter, within reticle limit |
| Clock Frequency | 1 GHz | 0.5 mm wire length per cycle |
| CU Array | 256 cols × 128 rows | Central ~85% of die area |
| CU Microarchitecture | Cache(12KB I-SRAM)+R+PE+W | Full 16-bit FP16 data path; ~2–3K instructions/CU |
| On-Chip SRAM | 384 MB total | 12 KB I-SRAM × 32,768 CUs; ~0.89 MB/mm² |
| Operator Latency | MAC/×/+/= 1 cycle | Divide = 4 cycles; exp = 50 cycles |
| Pins | ~4,000 | ~1,000 signal + ~3,000 P/G (3:1 ratio) |
| External Memory | 4× GDDR7 | ~192 GB/s/chip raw bandwidth |
| TDP | ~250–350 W | ~60–80 W/cm², standard cold plate |
| SerDes Lanes | 256 per chip | 16 Gbps/lane NRZ (= 16-bit FP16 @ 1 GHz) |
| V-EXT Port | 16 lanes | 32 GB/s for inter-domain rail links |
Four-Phase Memory Model
Weights and state ingress from DDR7 via CSL into column structure.
Computation on the PE array. Results travel by wire, not through memory.
Structured readout of computed results.
Phase-swap via dual-bank atomic switch for multi-layer models.
Domain Composition
| Component | Qty | Specification |
|---|---|---|
| MPU Chip (+ on-die SCC) | 64 | N7 ~430 mm²; ~250–350 W; 32,768 CUs each |
| CSL (Column State Loader) | 64 | One per chip; Gen1 Memory Ingress Block; eta ~45% |
| GDDR7 Devices | 256 (4/chip) | ~192 GB/s/chip; ~12.3 TB/s aggregate raw |
| SIF Switch Hub | 1 | 64-port; 256 lanes; ~390 mm²/N7; 1-hop < 200 ns |
| Host Node | 1 | Dispatch / serving / image load; no data-plane scheduling |
| BMC / Management | 1 set | I2C bus + telemetry aggregation + secure boot chain |
Single-Hub Dividend
The entire domain's switch silicon converges to one 64-port hub die (256-lane class ~390 mm²/N7), one tape-out, no Spine/Root split. Within the domain, a 1-hop <200 ns replaces the prior 6-level tree.
SIF Hub: Latency Budget (Mode 2 Intra-Domain)
| Source MPU SerDes TX | ~30–50 ns | PCS/alignment; PHY-dependent [TBD] |
| Hub: 64:1 select + 1:64 broadcast | ~40–60 ns | 3-stage on-die pipeline, retiming REG |
| Dest MPU SerDes RX + enqueue | ~30–50 ns | To row receive bus |
| Total | < 200 ns | Fixed pipe depth, zero slot jitter |
Hub Architecture
Communication Patterns
SIF Control & Schedule Model
The hub has no run-time routing. "Routing" is a compiled schedule — the compiler fixes (source, receiver, row, phase, target window) ahead of time. The device only selects the scheduled source and broadcasts.
Slot-Table Entry
All "routing" lives here as static configuration. The row/group is implicit by lane.
Determinism Rules
- +Throughput: ≤ 1 FP16 value per stream per cycle; latency = pure pipe depth
- +No backpressure: rate mismatch is a config error, excluded at compile time
- +Single-source: per group, per slot, exactly one source; device checks, does not arbitrate
- +Source order preserved: A/B dual-register relay, depth 2
- +Destination-free: initiating port carries no routing type, no destination — data + valid only
Scale-Out & SuperPod Composition
Structure Extension (Mode 3)
Domain Replication
SuperPod Composition (Proposal)
An 8-domain row-of-racks chain: 8 × domain racks + 1 NET/MGMT rack + 1 CDU/power rack, ~10 racks in a row.
Per-Domain KV Budget (Agent Era)
| Model @ 2 TB Domain | KV / Token | Weights Resident | 128K Context | 1M Context |
|---|---|---|---|---|
| Llama 405B (dense GQA-8 FP16) | 504 KB | 810 GB | ~63 GB | ~2 seq/domain |
| DeepSeek-V3 (MoE FP8) | 74 KB | 671 GB | ~9.2 GB | ~22 seq/domain |
| DeepSeek-V3 (MoE+MLA FP8) | 69 KB | 671 GB | ~8.6 GB | ~24 seq/domain |
Resolution: Make KV Budget Explicit
MPU does not bypass the capacity problem — it relocates it. Write "per-domain KV budget" as an explicit spec parallel to "per-token weight bytes." Each +8 GB/chip = +512 GB domain KV budget.
Defensible Claim
"While KV fits, MPU re-reads from high-bandwidth local DRAM, not spilling to slow storage" — capacity-gated. The KV-friendly regime (MoE+MLA) is exactly MPU's per-watt sweet spot.
Gen1 vs Gen2: Full Specification
| Capability | Gen1 | Gen2 |
|---|---|---|
| Ingress efficiency (eta) | ~45% | ~90% |
| CSL function | Deterministic Memory Ingress Block | Prefetch, reorder, compression, bandwidth shaping, multi-domain scheduling |
| SIF form | FPGA prototype (reduced-scale hub) | Dedicated fan-out ASIC (full 64-port hub, 256-lane) |
| SIF hub | Reduced-scale FPGA hub (≥10 Gbps) | 64-port ASIC hub, 16 Gbps P0 / 32 Gbps P2 |
| Target workload | 64-chip domain validation | Production MoE, long-context, agentic |
| Scale-out | Physical channels reserved | Enabled: structure extension + multi-domain scheduling |
| RAS | CRC + phase restart path working | Full RAS after link-retry vs. restart decision |
| Deliverable | FPGA platform + reduced RTL co-bring-up (SOW-1) | Tape-out with three-IC implementation (SOW-3) |
Open Decisions (All Sources)
Items from Switch PRD, System PRD, and Gen1/Gen2 PRD that require architectural sign-off.
Sources: Switch PRD D-1..D-10 | System PRD D-S1..D-S9 | Gen1/Gen2 PRD Section 12
Switch PRD (D-1 to D-10)
| # | Decision | Recommendation | When |
|---|---|---|---|
| D-1 | Per-MPU hub port width | 2 lanes/dir, single die; widen only on validated batch evidence | Before SOW-2 |
| D-2 | Board-level long-reach SerDes (16/32 Gbps) | Complete IP evaluation during SOW-1 | SOW-1 |
| D-3 | Per-lane rate path (16 -> 32 Gbps) | Lock 16 Gbps Gen1; 32 Gbps as Gen2 headroom | Gen1 freeze |
| D-4 | Source self-listen (loopback at hub) | Lean yes — simplifies compiler consistency checks | FPGA phase |
| D-5 | Link retry vs. phase restart | Baseline phase restart; decide retry from measured FPGA BER | After FPGA measure |
| D-6 | Flit length and field widths | Set with D-2 PHY flit granularity | After D-2 |
| D-7 | On-device management MCU | Lean no — pure register plane + BMC to shrink attack surface | Before SOW-2 |
| D-8 | Inter-domain rail topology | Line/ring baseline over copper; optical now moot (80 Mbps copper is trivial) | Gen2 planning |
| D-9 | Broadcast-only vs. subset-multicast (dst_mask) | Broadcast-only if every crossing is one row -> all chips | Gen1 freeze |
| D-10 | Data-dependent routing (in-band route-id) for MoE dispatch | Out of Gen1; if ever needed, precompiled route-id indexing a route table | Gen2 planning |
System PRD (D-S1 to D-S9)
| # | Decision | Recommendation | When |
|---|---|---|---|
| D-S1 | Root/Spine tiering | Converged to single hub — CLOSED | — |
| D-S2 | Inter-domain rail medium | 80 Mbps/line copper; confirm pinout/cable budget | SOW-1 |
| D-S3 | Host node spec | Commodity server + standard NIC; not custom | Before SOW-2 |
| D-S4 | Spare & repair strategy | Recompile re-route as baseline; hot spares per customer SLA | Gen1 system def |
| D-S5 | Rack / power / cold-plate vendor | Taiwan-local chassis + cold-plate partner | SOW-2 |
| D-S6 | SIF / CIF naming | Standardize on SIF across doc set | Immediate |
| D-S7 | Memory capacity solution | Capacity-bandwidth tension; decide with memory decision | Same window |
| D-S8 | Pod chain-length ceiling | 8-domain proposal baseline; frozen with D-S2 + SOW-1 | After SOW-1 |
| D-S9 | Per-domain KV budget spec | Write as explicit spec; size memory by weights + KV together | Same window |
SIF-Link: Flit Format & Protocol
Data-Plane Flit
Fixed-length flit; payload = FP16 value sequence of a row stream.
Physical Layering
Validation Phases: Detailed Deliverables
SIF FPGA
Map reduced-scale single hub on FPGA. Measure source-to-hub-to-receiver broadcast latency against <200 ns target. Verify source order preservation across late-start/early-arrive cases.
- ✓FPGA hub platform (8–16 MPU scale)
- ✓Latency measurement vs. <200 ns target
- ✓Source order preservation proven
- ✓Inter-domain direct link proven
- ✓SerDes IP evaluation + link sim
CSL Ingress
DDR/analog-DDR -> CSL prefetch/reorder/buffer -> MPU Column State Ingress, column-by-column. Verify ingress cadence matches compute cadence.
- ✓Single cluster (4-chip) ingress closed
- ✓Compute-ingress cadence match proven
- ✓Cold plate / rack physical design
- ✓Power delivery characterization
System Integration
Integrate on-chip compute, SIF traverse, and CSL ingress into a closed loop. 64-chip domain bring-up: ~25 kW rack, liquid cooling, complete management and telemetry stack.
- ✓64-chip domain bring-up
- ✓~25 kW rack operational
- ✓ASIC tape-out basis
- ✓RAS validation (CRC + restart)
Confidential & NDA-Protected
This information is provided under NDA to ecosystem partners only. Do not distribute. For questions, contact info@symaticslab.com.
SYMATICS