Trio: A World Model for the Physical World
A status update from MachineFi Lab on Trio — what we’ve built, what’s running now, and how to get your cameras on it.
For all of history, the physical world has been run by people. A person watches what’s happening, judges what it means, and acts on it — drives the truck, works the line, walks the floor. Perceive, predict, act: that loop has always needed a human in it.
AI changed the digital world first — language, code, images. Now it’s starting on the physical one. An AI that drives a car through live traffic. An AI that learns a video game by imagining how it plays. A robot that folds a pile of laundry. The part underneath all of them — the thing that lets a machine watch a situation, imagine what happens next, and act on it — is a world model. Trio is another kind of world model.
Those are narrow on purpose: one car, one game, one robot, one task. But the largest physical surface of all is already wired and watching — the cameras over every warehouse, store, factory, and care floor, recording thousands of hours that today produce almost nothing but footage to pull up after something goes wrong. A world model that ran on those — on whole operations, live — is the opportunity. It’s what Trio is built for.
What Trio Is
Notice what those four have in common: each runs one thing — one car, one game, one robot. None of them runs an operation. And that’s where most of the physical economy actually lives — a restaurant at the lunch rush, a car wash cycling cars through its bays, a warehouse loading trucks, a store working its floor, a factory line — places with dozens of people, vehicles, and machines moving at once, around the clock, all on cameras nobody has time to watch.
That’s what Trio is for. Trio is our world-model platform for physical operations — not a single monolithic model, but a suite of three products that together perceive, predict, and act on a live operation. Where a language model learns how text works, Trio learns how a place works — what’s in it, how it moves, what happens next — for your operation, from the cameras you already have. We don’t replace language models; we give them the physical world.
Trio runs that loop in three stages — and it ships in that order. Perception is live today; foresight and action are what’s next.
Today, two of those are real and in your hands. Trio-Retina (See) turns any camera feed into one standard, live read of what’s happening — who’s where, what they’re doing, where they’re headed. Trio-Lumen (Understand) makes that programmable in plain English — “flag anyone in the loading dock after hours” — watching every frame around the clock and turning it into events and alerts. Perception and understanding, shipping today.
pip install trio-retina
Trio-Retina is open source — runs on your own machine, or try it live in the Playground →
Those two are the foundation the rest is built on. Foresight and action — anticipating trouble before it happens, then acting on the floor — are the next stages of the loop. The order is deliberate: you can’t foresee what you can’t yet see, so we built sight first.
A model trained on the open internet learns how the world looks. Trio learns how your operation runs.
What It Looks Like in One Warehouse
Strip away the abstraction. A loading dock, mid-shift. A forklift backs out of a bay; a worker steps out from between two racks on a path that crosses it. Neither can see the other yet.
See — Trio-Retina, running on a small box next to the camera, already has both as tracked objects: the forklift and the person, their positions, and where each is heading.
Foresee — Trio’s world model rolls the next two seconds forward. The two paths intersect. It has seen this exact geometry end badly before.
Act — a deterministic edge safety gate fires the intersection alarm in about 50 milliseconds — faster than either person could react — and the forklift is signaled to stop. A near-miss instead of an incident report.
That’s the whole thesis in a single frame: not footage you pull up after something happens, but a decision made the instant before it does.
A Real World Model — and How Ours Is Different
Trio sits inside a fast-moving field. World models are where a lot of AI’s best minds are now pointing. The idea traces to Ha & Schmidhuber’s World Models (2018) — an agent learning a compact model of its environment and “dreaming” rollouts inside it. Yann LeCun argues a predictive world model in latent space (his JEPA) is the missing piece on the path to autonomous machine intelligence; Fei-Fei Li calls the frontier spatial intelligence, and her World Labs builds models that generate explorable 3D worlds. The field roughly splits into camps:
- Latent prediction — V-JEPA 2 (Meta) and the Dreamer line learn dynamics in latent space and plan inside them.
- Generative & interactive worlds — Genie 3 (DeepMind), NVIDIA Cosmos, and World Labs’ Marble imagine and generate environments.
- Driving — Tesla FSD and Wayve’s GAIA-2 run the most deployed world models on Earth — for one car.
- Robotics — Physical Intelligence, Skild AI, and Figure build foundation models for a single robot.
Almost all of them either imagine or simulate a world, or model a single agent’s egocentric domain — one car, one robot. Trio is the one that runs on live, real, third-person operations that already exist — a whole warehouse or store, many people and machines at once — and acts on them in real time.
| World model | Optimizes for | How Trio differs |
|---|---|---|
| JEPA · V-JEPA (LeCun) | learning general world models in latent space — research | a deployed product on live operations; specialized, not an architecture |
| World Labs (Fei-Fei Li) | generating & reconstructing explorable 3D worlds | reads the world your cameras already see; doesn’t generate one |
| Genie · Cosmos | imagining & simulating environments | decides in real time on spaces that already exist |
| Tesla FSD | driving one car — egocentric, single domain | third-person, multi-entity, a whole operation, many domains |
| Physical Intelligence · Figure · Skild | one robot, one task | reasons about what a whole operation should do next |
Two axes set Trio apart. Technically — it’s small, fast, and specialized: real-time at the edge, a floor near $0.004 per query, billed per decision, a frozen foundation plus small per-site adapters (LoRA, trained in GPU-hours) rather than one giant general model re-run on every frame. On the OVBench streaming benchmark, wrapping an open-weights model in Trio’s stack lifts accuracy +2.3 points purely from architecture, and its perception streams without the fixed minute-limits the frontier models cap out at. By scenario — it runs on the operations that already exist, and acts on them now, instead of imagining a world, driving one car, or moving one robot.
How Trio Is Built
For the technical teams: here’s how Trio stays fast and cheap enough to run on every camera, all day. If you’re here for the operations story, skim ahead — the payoff is the last line.
Five principles hold the system together: every interface between layers is a strongly-typed, inspectable scene graph (never an opaque vector); a router owns cost, running the cheap layers continuously and waking the expensive reasoning only when needed; tools are bidirectional, so the reasoning layer can command the lower layers to re-examine or re-simulate; every decision ships with its evidence, so an operator can inspect, contest, and override it; and the foundation models stay frozen while small per-deployment adapters — LoRA modules and a cross-tier fusion adapter, trained in GPU-hours rather than a full retrain — specialize each site.
Those principles are realized as seven planes — six in the path of a single decision, plus governance across all:
Because perception and prediction run locally and only compact symbols and latents travel to the cloud — never raw video — Trio is billed per decision, not per token per frame.
Where Trio Runs
The warehouse was one frame. The restaurant, the car wash, the store, the factory we opened with — the same model points at any operation that runs on cameras, today alongside human operators, surfacing what their existing systems miss:








What We’ve Built — and What’s Next
Trio is no longer a thesis on a whiteboard. The v1.0 technical report formalizes the full system — the perception–prediction–action stack, five principles, seven planes — with two fully worked reference domains (a car wash and a warehouse), down to the forklift-and-pedestrian near-miss above, caught by a deterministic edge safety gate that fires in about 50 milliseconds, well inside the 100 ms ceiling. Trio-Retina is open source (pip install trio-retina), and the Playground is live — open platform.machinefi.com/playground and watch Trio read real footage in your browser.
Three forces make now the moment: edge silicon can finally run real-time operational reasoning without a cloud round-trip; multi-entity scene understanding has crossed a research threshold single-object detection never approached; and the operators of physical environments are ready for what may be the most underpriced capability in AI today — a world model on top of the cameras they already own, without new hardware. From here, Trio grows up the loop — from seeing and understanding today toward foreseeing and, in time, acting on the floor.
Start with Trio today
Two ways in — both live right now:
Trio-Retina on GitHub
The open-source perception layer — the model-agnostic state layer that turns any detector into one standard stream of events plus latent state. pip install trio-retina and run it on your own machine.
Trio-Lumen on the platform
See your operation come alive in the browser — Trio reading real footage as objects with state and crowds as flow, then point it at your own cameras and ask in plain English.
Try Trio-Lumen live →— The MachineFi Lab team
Further reading on world models
- D. Ha, J. Schmidhuber. World Models. 2018.
- Y. LeCun. A Path Towards Autonomous Machine Intelligence. 2022. (introduces JEPA)
- F.-F. Li. From Words to Worlds: Spatial Intelligence is AI’s Next Frontier. 2025. (World Labs)
- D. Hafner, W. Yan, T. Lillicrap. Training Agents Inside of Scalable World Models (DreamerV4). 2025.
- Meta AI. V-JEPA 2. 2025.
- DeepMind. Genie 3. 2025.
- NVIDIA. Cosmos World Foundation Model Platform for Physical AI. 2025.
- Wayve. GAIA-2: a controllable multi-camera world model for driving. 2025.