Explainers Humanoid capability: what they can really do

What is a foundation model for robotics?

A foundation model for robotics extends the large-language-model paradigm to physical-action prediction. Trained on robot demonstrations rather than internet text alone, these models output action sequences (motor commands, manipulator trajectories) rather than text tokens. The category is dominated by vision-language-action (VLA) architectures that take camera images plus optional language instructions as input and produce action tokens as output. Companies building these models constitute the brain-provider tier of the robotics value chain, distinct from the humanoid OEM tier that builds the physical platforms the models run on.

The category in one sentence

A foundation model for robotics is a machine learning model trained at scale on robot demonstration data, capable of producing action sequences (motor commands, manipulator trajectories, locomotion patterns) in response to camera + language + sensor input. The category extends the large-language-model paradigm from text generation to physical action prediction.

The defining architectural pattern is vision-language-action (VLA): input is multimodal (images, optional language instructions, proprioceptive sensor data); output is action tokens (motor commands the robot hardware can execute). VLA models are trained on combinations of teleoperated demonstrations (human operators driving the robot through tasks; the resulting trajectories become training data), behavior cloning (learning from prior robot demonstrations), and simulation rollouts. Some architectures incorporate reinforcement learning at later training stages; the dominant 2025-2026 paradigm trains primarily on demonstrations.

How VLA models differ from LLMs

For analysts evaluating where the brain-provider tier of robotics sits relative to the AI-research mainstream:

Output space: LLMs produce text tokens; VLAs produce action tokens (motor commands, trajectory waypoints, manipulator targets). The action space is structured by the robot's hardware envelope.
Training data: LLMs train on internet text; VLAs train on robot demonstrations. The data acquisition cost is substantially higher for VLAs because each demonstration requires a real robot operating in a physical environment (or a high-fidelity simulator that produces transferable data).
Deployment: LLMs run as cloud inference; VLAs typically run embedded on the robot hardware because the latency budget for physical control is tight (single-digit milliseconds vs LLM response latency in hundreds of milliseconds). Some hybrid architectures separate slow-loop planning (cloud) from fast-loop control (embedded).
Verification posture: LLM capability claims operate against text-generation benchmarks (MMLU, GSM8K, HumanEval); VLA capability claims operate against task-completion benchmarks (success rate on manipulation tasks; long-horizon coordination; cross-environment generalization). Per DEPLOY's verified-vs-claimed framework on capability claims, VLA verification requires physical demonstration that benchmark scores alone do not provide.

The structural distinction matters because brain-provider tier verification operates at a different evaluation rhythm than LLM evaluation. A VLA model that scores well on academic benchmarks may not transfer to commercial deployment without additional engineering work.

Who is building foundation models for robotics

The brain-provider tier of robotics has multiple distinct strategic theses operating in parallel:

Skild AI: Pittsburgh-based; CMU heritage; cross-platform general-purpose brain thesis (single VLA model targeted at multiple robot platforms rather than platform-specific).
Physical Intelligence: UC Berkeley lineage through Sergey Levine's research; transformer-based VLA models (Pi-0; Pi-0.5); research-publication emphasis.
Covariant: UC Berkeley + Pieter Abbeel research lineage; warehouse-automation specialization; AWS partnership context.
Google DeepMind: large research organization with RT-2 (an earlier VLA from the Robotics Transformers research line) and Gemini Robotics (extending Gemini multimodal models to embodied action). Operates across AV and humanoid contexts.
OpenAI Robotics: relaunched May 2026 after the Dactyl team dissolved around 2021; current scope is developing under Aditya Ramesh's leadership (Worldsim research line continuation toward embodiment).
NVIDIA Project GR00T: VLA foundation model targeting humanoid platforms; cross-platform integration thesis aligned with NVIDIA's broader robotics-stack play (Isaac Sim, Jetson hardware, Omniverse).
Meta AI (FAIR + Reality Labs): research arm with publications on embodied AI; less consumer-deployment focus than peer brain providers.

The cohort operates at different commercial-maturity tiers. DEPLOY's frontier-AI-labs-entering-robotics piece covers the OpenAI + DeepMind + Meta cluster specifically; this category piece covers the broader brain-provider tier including pure-play VLA companies (Skild, Physical Intelligence, Covariant).

Why teleop disclosure feeds training data

A bridge between foundation models for robotics and the broader humanoid-cohort teleop disclosure work: teleoperated demonstrations are the dominant training data source for VLA models. When 1X NEO operates with explicit Expert Mode teleop, the operator sessions produce trajectory data that feeds the next-generation autonomy training pipeline. Teleop is not a workaround for missing autonomy; it is the engineered data acquisition layer that produces the demonstrations VLA models need.

The framework operates this distinction recursively: humanoid teleoperation disclosure across manufacturers is the operational visible layer; foundation-model training is the upstream consumer of the demonstration data the operational layer produces. Manufacturers that disclose teleop transparently (1X) generate verification-grade training data their own brain-development pipeline can use.

What the framework verifies and what it does not

Applying DEPLOY's verified-vs-claimed framework to foundation models for robotics:

Research output verified: published papers + benchmark scores + demonstration videos document VLA capability at research-and-demonstration scale.
Cross-platform deployment claimed at varying depth: some brain providers (Skild) frame cross-platform deployment as the central thesis; others (Physical Intelligence) frame research-publication progress as the primary verification. Cross-platform commercial-deployment depth varies.
Commercial-scale verification developing: brain-provider tier commercial deployment lags behind humanoid OEM commercial deployment. The verification surfaces are different: OEM deployments anchor at customer facilities (BMW Spartanburg; GXO Flowery Branch); brain-provider deployments anchor at integration partnerships with OEMs, which the cohort has surfaced at varying disclosure depth.
Cap-flag application: per-model commercial-deployment counts, per-partner integration depth, and verified cross-platform transfer are all framework cap-flag candidates for most brain providers. The cap-flag is the editorial truth, not a gap.

Where to go for context

For the structural framework distinguishing brain-provider tier from OEM-platform tier, see brain-provider tier vs OEM-platform tier distinction. For the brain-provider landscape comparison across companies, see brain-provider landscape comparison. For Skild AI specifically as a brain-provider exemplar, see what is Skild AI.

For broader frontier AI labs entering robotics, see why are frontier AI labs entering robotics and the OpenAI Robotics relaunch signal. For the framework DEPLOY applies to capability claims across humanoid makers, see how DEPLOY verifies capability claims.

Defined terms in this explainer

More in humanoid capability: what they can really do

← All explainers