Methodology

How DEPLOY verifies capability claims: Demo evidence versus deployed evidence

A capability claim is a different artifact than a deployment record. The framework partitions the two so operators evaluating a maker's "our robot can do X" statement can ask the specific verification question that matches the claim.

Every humanoid maker can show a robot doing something impressive on video. A demonstration video is, by definition, a controlled artifact: the maker selected the task, the environment, the take, and the framing. A deployed-capability record is a different artifact: the customer chose what work to put in front of the robot, the operating envelope decided what variation to throw at it, and the production-floor or warehouse-floor outcome decided what the robot actually delivered. Both are editorially meaningful; they are not the same verification surface. This page explains how DEPLOY tests which surface a given capability claim anchors against.

The methodology rests on the canonical constructs defined at /methodology/what-verified-means: the verified-vs-claimed boundary (counterparty anchoring as the verification posture), operating-envelope precision (verification is scoped to the envelope it was verified within), and source discipline (primary-source verification at three distinct editorial functions). Capability claims apply the same constructs at a different layer than deployment-status or pricing claims; the constructs are general, the application is capability-specific.

The demo-versus-deployment distinction

A demonstration shows what the robot did once, in controlled conditions, with the maker selecting both the task and the framing. A deployment shows what the robot does repeatedly, in customer conditions, with the customer accepting the output as production-grade work. The two are editorially distinct in three structural ways. First, the counterparty: a demo is maker-anchored, a deployment is counterparty-anchored (a customer of record taking the output as part of their own operations). Second, the operating envelope: a demo runs in the envelope the maker chose, a deployment runs in the envelope the customer's workflow decides. Third, the repeatability: a demo is a single artifact, a deployment is a sustained operating record.

Manufacturers frequently use demo evidence to support deployment-grade language ("our robot performs X in production environments"). The framework partitions the claim. If the evidence is a controlled demonstration, the claim is a capability demo; the maker can show what the robot is capable of doing under selected conditions. If the evidence is a customer-of-record deployment with counterparty confirmation, the claim is deployed capability; the maker has produced verifiable record of the robot doing the work in customer conditions. Both can be true of the same robot at the same time; they are not editorially equivalent.

Single-task demonstration versus general-purpose claim

A capability demonstration almost always shows a single task: the robot manipulating an egg, sorting a tote, walking uphill, climbing stairs. The task is specific, the success state is observable, and the demo's editorial subject is the task. A general-purpose capability claim is broader: the robot can handle warehouse work, the robot can perform household chores, the robot can operate in any unstructured environment. The framework treats the two as separate verification surfaces even when they appear in the same maker communication.

A single-task demonstration verifies a single task. The generalization from the demonstrated task to the general-purpose claim is itself a separate assertion, and the assertion requires its own verification evidence. Did the robot perform the task across variations in the input conditions? Did the robot perform other tasks in the same envelope at the same success rate? Did a counterparty accept the robot's work across a multi-task envelope? Without generalization evidence, a single-task demonstration anchors a single-task capability claim, not a general-purpose capability claim. DEPLOY's registry surfaces flag capability scope per task and per envelope, not as the generalized claim.

Lab environment versus real-world deployment

Operating-envelope precision applies to capability evidence the same way it applies to deployment claims. A robot performing a task in a maker's lab environment has verified the task in the lab envelope; the same task in a customer's warehouse, factory, or operational site is a different verification artifact. The lab envelope controls lighting, floor surface, object positioning, ambient noise, schedule, and human-in-loop posture. The real-world envelope does not. A capability that clears in the lab envelope has not by that fact cleared in the deployment envelope; the framework's worked example at /methodology/operating-envelope-precision shows the same logic applied to autonomous-freight envelopes.

DEPLOY's source discipline asks where the capability evidence came from. A maker's published demonstration video anchors lab-envelope capability. A customer-of-record confirmation (a press release from the customer naming the work the robot performed, an earnings-call mention, an investor-disclosure line item, an independent journalist's on-site reporting at the customer facility) anchors deployment-envelope capability. The two evidences are not interchangeable; the deployment- envelope verification requires the counterparty surface, not the maker surface alone.

Worked examples across DEPLOY's coverage

The Tesla We Robot event in October 2024 is the canonical reference for capability-demo framing without disclosure. On-site Optimus units mingled with attendees and served drinks; the framing was autonomous capability. Within four days, Bloomberg confirmed the demonstrations were teleoperated by Tesla employees stationed off-camera. The teleoperation itself is standard practice in humanoid development; the framework's editorial subject is the framing gap, not the teleoperation. The signal anchors the demo-versus-deployment distinction in DEPLOY's corpus.

Tesla Optimus Gen 2's December 2023 announcement anchors the single-task-versus-general-purpose distinction. The egg-handling demonstration was a single-task artifact; the framing positioned the demonstration as trajectory evidence toward general-purpose commercial deployment. The hardware advances were real (30 percent faster walking, 11 degrees of freedom in redesigned tactile hands). The trajectory claim attached editorial accountability that subsequent events were measured against; the Gen 2 framing's verification anchor was a single-task demonstration, and the general-purpose trajectory claim has not yet anchored to deployment-envelope evidence.

The deployed-capability counter-example is Agility's Digit at GXO Flowery Branch: 100,000 totes moved in live fulfillment under a multi-year Robots-as-a-Service contract, verified by the customer of record (GXO Logistics) and confirmed across independent trade press. The capability claim anchors not to a demonstration but to a deployment record at a single customer envelope. The framework's verification posture is deployed-capability-verified-at-a-single-customer-envelope; the claim does not generalize automatically to multi-customer deployment until additional counterparty evidence accumulates.

The end-product verification standard appears at Figure 02 at BMW Spartanburg: 30,000 BMW X3 vehicles built across an 11-month live production deployment, with chassis assembly work accepted into BMW's normal quality-control process. The verification bar is end-product automotive-OEM acceptance, structurally higher than logistics cycle-counting because the customer's downstream warranty chain depends on the work the robot performed. The signal anchors deployed-capability-verified- at-OEM-acceptance for chassis-assembly tasks in the automotive-production envelope.

The enterprise-breadth pattern appears at Apptronik Apollo across Mercedes-Benz, GXO, and Jabil: three Fortune-500 enterprise pilots locked in before any single deployment has produced scaled-throughput data. DEPLOY's registry surfaces the contracts and the capital as verified-by-counterparty; the per-deployment throughput data is cap-flagged pending evidence. The cap-flag is the published verification posture: the contracts are real, the throughput data is not yet present, and DEPLOY does not estimate the gap. Applying the framework recursively, the cap-flag on Apollo throughput is the same discipline DEPLOY asks operators to apply to any maker's enterprise-breadth claim: confirm the contracts, then ask separately whether per-deployment throughput evidence has surfaced.

Where to go next

Operators evaluating a specific maker's capability claim can start with the registry's structured-data view at verified humanoid companies, which catalogs the verification posture per maker across the four framework anchors. The framework's full mechanics live at the verification-framework page. The capability- specific worked examples cross-link to the foundational signals that anchor each verification posture in DEPLOY's canon.

What 'verified' means at DEPLOY → the canonical reference for verified-vs-claimed boundary, operating-envelope precision, cap-flag-as-trust-signal, and source discipline.
DEPLOY's verification framework: the four anchors → counterparty, operating envelope, absence of human-in-loop, repeatability; the structural test operators apply when evaluating any maker's commercial-deployment claim.
Why operating envelope matters in autonomous freight → three commercial autonomous-freight deployments at three structurally distinct envelopes; the envelope-precision construct applied to a different layer.
The verified-vs-claimed framework canon → nine anchored angles plus two pending; each with diagnostic question, decision criteria, and worked-example signal.