// news · multimodal · agentic-ai2026-06-03source: marktechpost.com

Alibaba's Qwen3.7-Plus brings vision and GUI control to a 60%-cheaper tier

Alibaba's Qwen team released Qwen3.7-Plus on June 2, adding image and video input, deep reasoning, and tool use to a model priced roughly 60% below the text-only Qwen3.7-Max it shipped weeks earlier. The headline numbers are ScreenSpot Pro 79.0 and Terminal-Bench 70.3 — front-of-pack scores for an "open-API" GUI agent, though the weights stay closed. The release is the clearest signal yet that Chinese labs are skipping the "release a vision model" step and going straight to screen-reading agents.

The model accepts text, images, and video on the Bailian platform (Model Studio for international users) and adds five agentic features on top of multimodal input: deep reasoning, self-programming, tool invocation, verification, and autonomous iteration. In plain terms, it can take a screenshot, read what is on the screen, click the right pixel, run a command, check the result, and loop until done. Qwen3.7-Plus inherits the 1M-token context window from the Qwen3.7-Max backbone and uses an agentic reinforcement-learning loop trained on real-world execution feedback rather than synthetic traces.

The ScreenSpot Pro score of 79.0 is the line worth staring at. That benchmark measures pixel-accurate UI grounding — can the model point to the right button — and a 79 puts Qwen3.7-Plus level with Anthropic's Claude Computer Use and OpenAI's Operator on a task that has been the moat of closed Western labs for the last year. Terminal-Bench 70.3 says the same model is competent in a sandboxed shell. Historically those have been two different models; Alibaba is shipping them as one. For the cost angle and the broader multimodal cheapening trend, see our companion piece on NVIDIA's Nemotron 3 Nano Omni open-weight push.

The catch is that Qwen3.7-Plus is API-only. No open weights at launch, no published context-window ceiling, no committed price sheet outside temporary free access via Vercel's AI Gateway. That puts it in an awkward category: cheaper and more capable than Qwen3.7-Max from a few weeks ago, but proprietary in the same way GPT-5 and Gemini 3 are proprietary — which is exactly what Alibaba's open-weight reputation was supposed to differentiate against. Reviewers note an open-weight variant "remains plausible in Q3 2026," which is another way of saying it is not confirmed.

The strategic read: vision is now table stakes, and the frontier has moved to agents that can drive a computer. Two of the three highest-profile multimodal releases this quarter — Nemotron 3 Nano Omni and Qwen3.7-Plus — both led with GUI navigation as the headline use case, not image captioning or chart QA. Customer-service workflows and document intelligence are the marketing copy; the actual product is "an LLM that can use your software." Anyone still benchmarking multimodal models on VQAv2 is fighting last year's war.

MarkTechPost — Alibaba's Qwen3.7-Plus launch → · BuildFastWithAI — Qwen3.7-Plus review and benchmarks → · VentureBeat — Qwen3.7-Plus pricing and modality coverage →