ExplainersHumanoid capability: what they can really do
What is a foundation model for robotics?
A foundation model for robotics extends the large-language-model paradigm to physical-action prediction. Trained on robot demonstrations rather than internet text alone, these models output action sequences (motor commands, manipulator trajectories) rather than text tokens. The category is dominated by vision-language-action (VLA) architectures that take camera images plus optional language instructions as input and produce action tokens as output. Companies building these models constitute the brain-provider tier of the robotics value chain, distinct from the humanoid OEM tier that builds the physical platforms the models run on.
The category in one sentence
A foundation model for robotics is a machine learning model trained at scale on robot demonstration data, capable of producing action sequences (motor commands, manipulator trajectories, locomotion patterns) in response to camera + language + sensor input. The category extends the large-language-model paradigm from text generation to physical action prediction.
The defining architectural pattern is vision-language-action (VLA): input is multimodal (images, optional language instructions, proprioceptive sensor data); output is action tokens (motor commands the robot hardware can execute). VLA models are trained on combinations of teleoperated demonstrations (human operators driving the robot through tasks; the resulting trajectories become training data), behavior cloning (learning from prior robot demonstrations), and simulation rollouts. Some architectures incorporate reinforcement learning at later training stages; the dominant 2025-2026 paradigm trains primarily on demonstrations.
How VLA models differ from LLMs
For analysts evaluating where the brain-provider tier of robotics sits relative to the AI-research mainstream:
- Output space: LLMs produce text tokens; VLAs produce action tokens (motor commands, trajectory waypoints, manipulator targets). The action space is structured by the robot's hardware envelope.
- Training data: LLMs train on internet text; VLAs train on robot demonstrations. The data acquisition cost is substantially higher for VLAs because each demonstration requires a real robot operating in a physical environment (or a high-fidelity simulator that produces transferable data).
- Deployment: LLMs run as cloud inference; VLAs typically run embedded on the robot hardware because the latency budget for physical control is tight (single-digit milliseconds vs LLM response latency in hundreds of milliseconds). Some hybrid architectures separate slow-loop planning (cloud) from fast-loop control (embedded).
- Verification posture: LLM capability claims operate against text-generation benchmarks (MMLU, GSM8K, HumanEval); VLA capability claims operate against task-completion benchmarks (success rate on manipulation tasks; long-horizon coordination; cross-environment generalization). Per DEPLOY's verified-vs-claimed framework on capability claims, VLA verification requires physical demonstration that benchmark scores alone do not provide.
The structural distinction matters because brain-provider tier verification operates at a different evaluation rhythm than LLM evaluation. A VLA model that scores well on academic benchmarks may not transfer to commercial deployment without additional engineering work.
Who is building foundation models for robotics
The brain-provider tier of robotics has multiple distinct strategic theses operating in parallel:
- Skild AI: Pittsburgh-based; CMU heritage; cross-platform general-purpose brain thesis (single VLA model targeted at multiple robot platforms rather than platform-specific).
- Physical Intelligence: UC Berkeley lineage through Sergey Levine's research; transformer-based VLA models (Pi-0; Pi-0.5); research-publication emphasis.
- Covariant: UC Berkeley + Pieter Abbeel research lineage; warehouse-automation specialization; AWS partnership context.
- Google DeepMind: large research organization with RT-2 (an earlier VLA from the Robotics Transformers research line) and Gemini Robotics (extending Gemini multimodal models to embodied action). Operates across AV and humanoid contexts.
- OpenAI Robotics: relaunched May 2026 after the Dactyl team dissolved around 2021; current scope is developing under Aditya Ramesh's leadership (Worldsim research line continuation toward embodiment).
- NVIDIA Project GR00T: VLA foundation model targeting humanoid platforms; cross-platform integration thesis aligned with NVIDIA's broader robotics-stack play (Isaac Sim, Jetson hardware, Omniverse).
- Meta AI (FAIR + Reality Labs): research arm with publications on embodied AI; less consumer-deployment focus than peer brain providers.
The cohort operates at different commercial-maturity tiers. DEPLOY's frontier-AI-labs-entering-robotics piece covers the OpenAI + DeepMind + Meta cluster specifically; this category piece covers the broader brain-provider tier including pure-play VLA companies (Skild, Physical Intelligence, Covariant).
Why teleop disclosure feeds training data
A bridge between foundation models for robotics and the broader humanoid-cohort teleop disclosure work: teleoperated demonstrations are the dominant training data source for VLA models. When 1X NEO operates with explicit Expert Mode teleop, the operator sessions produce trajectory data that feeds the next-generation autonomy training pipeline. Teleop is not a workaround for missing autonomy; it is the engineered data acquisition layer that produces the demonstrations VLA models need.
The framework operates this distinction recursively: humanoid teleoperation disclosure across manufacturers is the operational visible layer; foundation-model training is the upstream consumer of the demonstration data the operational layer produces. Manufacturers that disclose teleop transparently (1X) generate verification-grade training data their own brain-development pipeline can use.
What the framework verifies and what it does not
Applying DEPLOY's verified-vs-claimed framework to foundation models for robotics:
- Research output verified: published papers + benchmark scores + demonstration videos document VLA capability at research-and-demonstration scale.
- Cross-platform deployment claimed at varying depth: some brain providers (Skild) frame cross-platform deployment as the central thesis; others (Physical Intelligence) frame research-publication progress as the primary verification. Cross-platform commercial-deployment depth varies.
- Commercial-scale verification developing: brain-provider tier commercial deployment lags behind humanoid OEM commercial deployment. The verification surfaces are different: OEM deployments anchor at customer facilities (BMW Spartanburg; GXO Flowery Branch); brain-provider deployments anchor at integration partnerships with OEMs, which the cohort has surfaced at varying disclosure depth.
- Cap-flag application: per-model commercial-deployment counts, per-partner integration depth, and verified cross-platform transfer are all framework cap-flag candidates for most brain providers. The cap-flag is the editorial truth, not a gap.
Where to go for context
For the structural framework distinguishing brain-provider tier from OEM-platform tier, see brain-provider tier vs OEM-platform tier distinction. For the brain-provider landscape comparison across companies, see brain-provider landscape comparison. For Skild AI specifically as a brain-provider exemplar, see what is Skild AI.
For broader frontier AI labs entering robotics, see why are frontier AI labs entering robotics and the OpenAI Robotics relaunch signal. For the framework DEPLOY applies to capability claims across humanoid makers, see how DEPLOY verifies capability claims.
Defined terms in this explainer
More in humanoid capability: what they can really do
- Can humanoid robots cook?
- Can humanoid robots do laundry?
- How does teleoperation differ across humanoid robot manufacturers?
- Is 1X NEO autonomous, or is it controlled by humans?
- Is Boston Dynamics Atlas commercially available?
- What are the risks of humanoid robots?
- What can humanoid robots actually do today?
- What does Figure's BMW Spartanburg humanoid deployment actually look like?
- What is Agility Robotics and the Digit humanoid robot?
- What is Apptronik Apollo and how does it compare to other humanoids?
- What is Figure 03?
- What is Figure's Catalyst Brands Reno humanoid deployment?
- What is Fourier Intelligence and the GR-3 humanoid?
- What is Mentee Robotics and the MenteeBot humanoid?
- What is PAL Robotics and the TALOS humanoid?
- What is Sanctuary AI and the Phoenix humanoid robot?
- What is Skild AI and the Skild Brain foundation model?
- What is the Apptronik Apollo deployment at Mercedes-Benz?
- What is the Atlas deployment at Hyundai Metaplant America?
- What is the typical lifespan of a humanoid robot?
- What is UBTech Walker S2 and is UBTech a real humanoid company?
- What's the difference between robotics brain providers and robot makers?
- Which companies build foundation models for robotics, and how do they compare?
- Who are the leading humanoid robot makers?
- Why are frontier AI labs (OpenAI, Google DeepMind, Meta) entering robotics in 2025-2026?