// blog · analysis · compute2026-06-10source: analysis / ai-blogs.org

Memory bandwidth is the new context window — why DiffusionGemma's parallel decoding and Gemini 3.5 Pro's 2M context are the same hardware story

Two June releases reframe the compute-binding constraint: DiffusionGemma's parallel block generation and Gemini 3.5 Pro's 2M-token context. Both push against the same wall — memory bandwidth, not raw FLOPS, is the frontier.

The two model releases that surfaced this week look unrelated. DiffusionGemma generates 256-token blocks in parallel at 1,000 tok/s on H100; Gemini 3.5 Pro promises a 2M-token context window. One is throughput; the other is context. Both are bound by the same physical constraint: how fast can the model's KV cache move through HBM.

The constraint is bandwidth, not compute

Autoregressive token generation at scale is memory-bandwidth limited, not FLOPS-limited. Each generated token requires reading the entire KV cache; the larger the context, the more bytes you have to move per token. DiffusionGemma's parallel-decoding architecture sidesteps the per-token KV cache read by generating whole blocks; the throughput win comes from doing more useful work per memory access. Gemini 3.5 Pro at 2M context will need new caching strategies just to keep per-token latency in production-acceptable ranges.

Why this favors NVIDIA's stack consolidation

NVIDIA's RTX Spark PC chip launch and the broader Vera Rubin / Blackwell roadmap are explicitly memory-bandwidth-focused. HBM3e and the upcoming HBM4 generations are NVIDIA's gating supply; AMD's Helios MI455X rack hits competitive density but is shipping into a market where the buyer's binding constraint is HBM allocation. The architecture choices Google is making in DiffusionGemma and Gemini 3.5 Pro are mechanically compatible with NVIDIA's bandwidth-led product roadmap.

What this changes for buyers

The sovereign-AI procurement decision has been framed as "NVIDIA vs AMD" on hardware specs. The bandwidth-bound architecture reality reframes it: "which vendor's HBM allocation, NVLink/Infinity Fabric bandwidth, and inference-optimization stack matches my workload's actual binding constraint." Helios is the first credible second source, but the buyer's optimization target — long-context, high-throughput foundation-model inference — is exactly the workload NVIDIA's bandwidth-led stack was designed for.

NVIDIA Blog — RTX AI Garage — DiffusionGemma local inference → · ServeTheHome — AMD EPYC Venice, MI455X & Helios hardware →