// news · research-papers · multimodal2026-05-22source: arxiv 2504 / vlm research

Chain-of-Modality prompting — Vision-Language Models progressively integrate modalities to refine manipulation plans from human demonstration video

An arXiv paper titled 'Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models' (arXiv 2504.13351) introduces a prompting strategy where Vision Language Models progressively integrate information from each modality to refine task plans for robotic manipulation. The structural innovation is that the methodology works without retraining — it's a prompting protocol that elicits multimodal reasoning from existing VLMs.

The pattern is significant for the agentic-skill-induction direction. The neuro-symbolic skill induction paper from the AM cycle lifted reasoning traces into a logical predicate library; Chain-of-Modality is the prompting protocol that produces the traces in the first place for multimodal robotic tasks. The two methodologies stack: Chain-of-Modality generates the trace, neuro-symbolic lifting compiles it into reusable skills.

For the robotics-deployment timeline, the implication is that VLM-based robotic manipulation can now ship without per-deployment fine-tuning. Chain-of-Modality elicits competent multi-step plans from existing VLMs; the skill library accumulates from successful deployments; the system improves through use rather than through retraining. Closer to the 'robot learns on the job' paradigm than the 'robot ships pre-trained for one task' paradigm.

arXiv — Chain-of-Modality manipulation → · arXiv — Robot Confirmation Q-Former Multimodal LLM → · arXiv — Bridging vision and touch →