Molmo 2 from Allen Institute — state-of-the-art video understanding with pointing and tracking, open multimodal alternative to closed-source vendor offerings
Allen Institute's Molmo 2 ships state-of-the-art video understanding with pointing and tracking capabilities — open multimodal model that competes credibly with closed-source vendor offerings on video-understanding-specific tasks. The pointing-and-tracking capability addresses interactive video-annotation use cases that pure-classification or pure-description models can't handle.
The substantive piece is the pointing-and-tracking capability for video understanding. Pre-Molmo-2 video-understanding open models supported classification (what's in the video) and description (what's happening). The pointing-and-tracking capability adds spatial-temporal precision — identifying specific objects' locations across frames, tracking object motion. The interactive-annotation use cases this enables (sports analytics, security review, content moderation) weren't supported by previous open-source video models.
The competitive read against the closed-source video-understanding landscape (Gemini, GPT-4o vision, Claude vision) is that Molmo 2 provides an open-source alternative at competitive capability levels for the specific pointing-and-tracking use cases. H2 2026 procurement for video-analytics workloads should weight Molmo 2 against the closed-source alternatives, particularly for deployments where self-hosting matters (privacy, data sovereignty, cost-at-scale).
Allen Institute — Molmo 2: State-of-the-art video understanding, pointing, and tracking → · Pinggy — Best Video Generation AI Models in 2026 →