// news · frontier-models · multimodal2026-05-28source: google / blog.google / deepmind

Google Gemini 3.1 Ultra ships 2M-token context with native multimodal — text, image, audio, video processed in a single context window

Google's Gemini 3.1 Ultra launched with a 2-million-token context window and native multimodal capability — text, image, audio, and video all processed in the same context window without translation through intermediate text representations. The release extends Google's lead on the context-window and multimodal-orchestration axes, and lands inside the public-framing convergence where Google, OpenAI, and Anthropic execs all describe the frontier race as effectively neck-and-neck across specialized axes.

The 2M-token context window is the substantive capability piece. Through 2024-2025 the context-window competition ran through several iterations — Anthropic's 100K and 200K Claude tiers, OpenAI's 128K GPT-4 Turbo, Google's earlier 1M Gemini Pro 1.5. Gemini 3.1 Ultra at 2M extends the ceiling further while maintaining native-multimodal handling — meaning the 2M tokens are not just text but can be any mixture of text, image patches, audio chunks, and video frames in a single context. The technical achievement is the attention-mechanism work that lets the model maintain relevant-token recall across the 2M-token range without the recall-degradation pattern that plagued the long-context-but-not-truly-long-recall earlier generations.

The competitive context is the specialized-axis frontier. Google, OpenAI, and Anthropic execs continue to describe the race as effectively neck-and-neck with each lab leading on different axes. Google's lead is concentrated in the multimodal-orchestration and context-window axes — Gemini 3.1 Ultra's 2M-token native-multimodal context is the most explicit example. Combined with Gemini Spark's 24/7 background agent running across the Google ecosystem, the consumer-facing Google AI surface is now structurally differentiated from the per-session-API frontier-lab competitors on both the context-and-multimodal axis and the persistent-execution axis. For developers and procurement teams, the workload-to-model matching pattern is the operative selection logic.

See our analysis →

Google Blog — Gemini 3.1 Ultra 2M token context release → · DeepMind — Gemini 3.1 native multimodal architecture → · TechCrunch — Google Gemini 3.1 Ultra capability announcement →