MachineFi Lab · Status update

Trio: A World Model for the Physical World

A status update from MachineFi Lab on Trio — what we’ve built, what’s running now, and how to get your cameras on it.

A Tesla on the highway with the Autopilot road visualization on the dash — Drive a car, play a game, fold laundry, build a world — each is a world model. Trio is another kind.

DreamerV4 — a grid of game and control environments it learns to play — Drive a car, play a game, fold laundry, build a world — each is a world model. Trio is another kind.

For all of history, the physical world has been run by people. A person watches what’s happening, judges what it means, and acts on it — drives the truck, works the line, walks the floor. Perceive, predict, act: that loop has always needed a human in it.

AI changed the digital world first — language, code, images. Now it’s starting on the physical one. An AI that drives a car through live traffic. An AI that learns a video game by imagining how it plays. A robot that folds a pile of laundry. The part underneath all of them — the thing that lets a machine watch a situation, imagine what happens next, and act on it — is a world model. Trio is another kind of world model.

Those are narrow on purpose: one car, one game, one robot, one task. But the largest physical surface of all is already wired and watching — the cameras over every warehouse, store, factory, and care floor, recording thousands of hours that today produce almost nothing but footage to pull up after something goes wrong. A world model that ran on those — on whole operations, live — is the opportunity. It’s what Trio is built for.

What Trio Is

Notice what those four have in common: each runs one thing — one car, one game, one robot. None of them runs an operation. And that’s where most of the physical economy actually lives — a restaurant at the lunch rush, a car wash cycling cars through its bays, a warehouse loading trucks, a store working its floor, a factory line — places with dozens of people, vehicles, and machines moving at once, around the clock, all on cameras nobody has time to watch.

That’s what Trio is for. Trio is our world-model platform for physical operations — not a single monolithic model, but a suite of three products that together perceive, predict, and act on a live operation. Where a language model learns how text works, Trio learns how a place works — what’s in it, how it moves, what happens next — for your operation, from the cameras you already have. We don’t replace language models; we give them the physical world.

Trio runs that loop in three stages — and it ships in that order. Perception is live today; foresight and action are what’s next.

Perception ships today; foresight and action are the roadmap

Today, two of those are real and in your hands. Trio-Retina (See) turns any camera feed into one standard, live read of what’s happening — who’s where, what they’re doing, where they’re headed. Trio-Lumen (Understand) makes that programmable in plain English — “flag anyone in the loading dock after hours” — watching every frame around the clock and turning it into events and alerts. Perception and understanding, shipping today.

pip install trio-retina Trio-Retina is open source — runs on your own machine, or try it live in the Playground →

Those two are the foundation the rest is built on. Foresight and action — anticipating trouble before it happens, then acting on the floor — are the next stages of the loop. The order is deliberate: you can’t foresee what you can’t yet see, so we built sight first.

A model trained on the open internet learns how the world looks. Trio learns how your operation runs.

What It Looks Like in One Warehouse

Strip away the abstraction. A loading dock, mid-shift. A forklift backs out of a bay; a worker steps out from between two racks on a path that crosses it. Neither can see the other yet.

A warehouse loading dock — a forklift and a worker on crossing paths

See — Trio-Retina, running on a small box next to the camera, already has both as tracked objects: the forklift and the person, their positions, and where each is heading.

Foresee — Trio’s world model rolls the next two seconds forward. The two paths intersect. It has seen this exact geometry end badly before.

Act — a deterministic edge safety gate fires the intersection alarm in about 50 milliseconds — faster than either person could react — and the forklift is signaled to stop. A near-miss instead of an incident report.

That’s the whole thesis in a single frame: not footage you pull up after something happens, but a decision made the instant before it does.

A Real World Model — and How Ours Is Different

Trio sits inside a fast-moving field. World models are where a lot of AI’s best minds are now pointing. The idea traces to Ha & Schmidhuber’s World Models (2018) — an agent learning a compact model of its environment and “dreaming” rollouts inside it. Yann LeCun argues a predictive world model in latent space (his JEPA) is the missing piece on the path to autonomous machine intelligence; Fei-Fei Li calls the frontier spatial intelligence, and her World Labs builds models that generate explorable 3D worlds. The field roughly splits into camps:

Latent prediction — V-JEPA 2 (Meta) and the Dreamer line learn dynamics in latent space and plan inside them.
Generative & interactive worlds — Genie 3 (DeepMind), NVIDIA Cosmos, and World Labs’ Marble imagine and generate environments.
Driving — Tesla FSD and Wayve’s GAIA-2 run the most deployed world models on Earth — for one car.
Robotics — Physical Intelligence, Skild AI, and Figure build foundation models for a single robot.

Almost all of them either imagine or simulate a world, or model a single agent’s egocentric domain — one car, one robot. Trio is the one that runs on live, real, third-person operations that already exist — a whole warehouse or store, many people and machines at once — and acts on them in real time.

World model	Optimizes for	How Trio differs
JEPA · V-JEPA (LeCun)	learning general world models in latent space — research	a deployed product on live operations; specialized, not an architecture
World Labs (Fei-Fei Li)	generating & reconstructing explorable 3D worlds	reads the world your cameras already see; doesn’t generate one
Genie · Cosmos	imagining & simulating environments	decides in real time on spaces that already exist
Tesla FSD	driving one car — egocentric, single domain	third-person, multi-entity, a whole operation, many domains
Physical Intelligence · Figure · Skild	one robot, one task	reasons about what a whole operation should do next

Two axes set Trio apart. Technically — it’s small, fast, and specialized: real-time at the edge, a floor near $0.004 per query, billed per decision, a frozen foundation plus small per-site adapters (LoRA, trained in GPU-hours) rather than one giant general model re-run on every frame. On the OVBench streaming benchmark, wrapping an open-weights model in Trio’s stack lifts accuracy +2.3 points purely from architecture, and its perception streams without the fixed minute-limits the frontier models cap out at. By scenario — it runs on the operations that already exist, and acts on them now, instead of imagining a world, driving one car, or moving one robot.

How Trio Is Built

For the technical teams: here’s how Trio stays fast and cheap enough to run on every camera, all day. If you’re here for the operations story, skim ahead — the payoff is the last line.

Five principles hold the system together: every interface between layers is a strongly-typed, inspectable scene graph (never an opaque vector); a router owns cost, running the cheap layers continuously and waking the expensive reasoning only when needed; tools are bidirectional, so the reasoning layer can command the lower layers to re-examine or re-simulate; every decision ships with its evidence, so an operator can inspect, contest, and override it; and the foundation models stay frozen while small per-deployment adapters — LoRA modules and a cross-tier fusion adapter, trained in GPU-hours rather than a full retrain — specialize each site.

Those principles are realized as seven planes — six in the path of a single decision, plus governance across all:

How a decision flows through Trio — seven planes across the edge–cloud continuum

Because perception and prediction run locally and only compact symbols and latents travel to the cloud — never raw video — Trio is billed per decision, not per token per frame.

Where Trio Runs

The warehouse was one frame. The restaurant, the car wash, the store, the factory we opened with — the same model points at any operation that runs on cameras, today alongside human operators, surfacing what their existing systems miss:

Franchise OperationsQueue management, shrinkage reduction, employee compliance, customer-flow analytics.

Security & AccessIntrusion detection, loitering analysis, tailgating prevention, after-hours enforcement.

Logistics & WarehousingDock status, vehicle dwell, PPE compliance, safety-SOP enforcement across yards and floors.

Manufacturing & IndustrialLine monitoring, defect detection, hazard alerts across every line and machine zone.

Smart CitiesParking, traffic flow, public safety, infrastructure monitoring across streets and transit.

Healthcare & Life SciencesFall detection, occupancy patterns, behavioral monitoring across resident rooms and campuses.

Hospitality & VenuesCrowd management, VIP-zone access control, real-time incident response at scale.

Critical Infrastructure24/7 perimeter intelligence, intrusion detection, autonomous response for sites that can’t miss an alert.

What We’ve Built — and What’s Next

Trio is no longer a thesis on a whiteboard. The v1.0 technical report formalizes the full system — the perception–prediction–action stack, five principles, seven planes — with two fully worked reference domains (a car wash and a warehouse), down to the forklift-and-pedestrian near-miss above, caught by a deterministic edge safety gate that fires in about 50 milliseconds, well inside the 100 ms ceiling. Trio-Retina is open source (pip install trio-retina), and the Playground is live — open platform.machinefi.com/playground and watch Trio read real footage in your browser.

Three forces make now the moment: edge silicon can finally run real-time operational reasoning without a cloud round-trip; multi-entity scene understanding has crossed a research threshold single-object detection never approached; and the operators of physical environments are ready for what may be the most underpriced capability in AI today — a world model on top of the cameras they already own, without new hardware. From here, Trio grows up the loop — from seeing and understanding today toward foreseeing and, in time, acting on the floor.

Start with Trio today

Two ways in — both live right now:

Build on it · developers

Trio-Retina on GitHub

The open-source perception layer — the model-agnostic state layer that turns any detector into one standard stream of events plus latent state. pip install trio-retina and run it on your own machine.

★ Star Trio-Retina on GitHub →

Play with it · operators

Trio-Lumen on the platform

See your operation come alive in the browser — Trio reading real footage as objects with state and crowds as flow, then point it at your own cameras and ask in plain English.

Try Trio-Lumen live →

— The MachineFi Lab team