Vision-language-action model

A vision-language-action model (VLA) is a neural architecture that takes visual observations and natural-language instructions and emits robot actions (joint torques or end-effector poses) in one forward pass. Examples: Google DeepMind's Gemini Robotics, RT-2 lineage, Physical Intelligence's π0 and π0.5, Skild AI's brain models, Figure's Helix.

The strategic question is what the model emits. A VLA that emits text describing actions (an LLM reasoning over action words and a separate decoder generating commands) is structurally different from one that emits continuous joint commands at policy frequency. The first is easier to scale on existing LLM infrastructure; the second is what's actually controlling the robot in closed loop. Claims of "VLA" without specifying which architecture should be read as the LLM-on-action-words version by default. It's a meaningful capability gap.

Canonical reference: registry.deploy.report/glossary#vision-language-action-model ↗

Used in 2 Deploy signals

Relevant DEPLOY coverage

By topic: embodied aiai infrastructure

By company: figure aiteslaapptronik

Related terms

More in ai, models & control

← All glossary terms