// news · multimodal · open-source2026-06-24source: allenai / pinggy

Molmo 2 from Allen Institute — state-of-the-art video understanding with pointing and tracking, open multimodal alternative to closed-source vendor offerings

Allen Institute's Molmo 2 ships state-of-the-art video understanding with pointing and tracking capabilities — open multimodal model that competes credibly with closed-source vendor offerings on video-understanding-specific tasks. The pointing-and-tracking capability addresses interactive video-annotation use cases that pure-classification or pure-description models can't handle.

The substantive piece is the pointing-and-tracking capability for video understanding. Pre-Molmo-2 video-understanding open models supported classification (what's in the video) and description (what's happening). The pointing-and-tracking capability adds spatial-temporal precision — identifying specific objects' locations across frames, tracking object motion. The interactive-annotation use cases this enables (sports analytics, security review, content moderation) weren't supported by previous open-source video models.

The competitive read against the closed-source video-understanding landscape (Gemini, GPT-4o vision, Claude vision) is that Molmo 2 provides an open-source alternative at competitive capability levels for the specific pointing-and-tracking use cases. H2 2026 procurement for video-analytics workloads should weight Molmo 2 against the closed-source alternatives, particularly for deployments where self-hosting matters (privacy, data sovereignty, cost-at-scale).

See our analysis →

Allen Institute — Molmo 2: State-of-the-art video understanding, pointing, and tracking → · Pinggy — Best Video Generation AI Models in 2026 →