Getting Started with Trio Stream API: Developer Tutorial (2026)

The hardest part of building with live video AI used to be everything that happened before you wrote a single line of business logic: standing up frame-extraction pipelines, stitching together vision models, managing GPU infrastructure, and somehow getting answers fast enough to act on them. The Trio Stream API collapses all of that into a few HTTP calls. This guide walks you through the entire journey — from creating an API key to parsing your first AI-generated answer from a real camera feed.

Trio Stream API: The Trio Stream API is a multimodal inference interface that accepts live video, audio, and sensor streams as input and returns structured, natural-language AI responses. It handles frame sampling, vision-language model routing, and output formatting as managed infrastructure, so developers can query a camera feed the same way they would query a REST endpoint — without building or maintaining any ML pipeline.

Prerequisites

Before you write any code, make sure you have the following in place:

A Trio account with an active API key. Sign up at machinefi.com — the Starter tier is free and includes 10,000 API calls per month.
Python 3.9 or later installed on your machine.
A camera feed URL in RTSP, WebRTC, HLS, or plain HTTP MJPEG format. If you don't have a physical camera handy, you can use a public test stream or a local webcam exposed via FFmpeg.
Basic familiarity with Python virtual environments and pip.

That's it. You don't need a GPU, a Kubernetes cluster, or any computer-vision background. The API handles all of that on Trio's infrastructure.

Installation

Create a fresh virtual environment and install the Trio SDK:

setup.sh

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install trio-sdk python-dotenv

Next, create a .env file in your project root and add your API key:

.env

TRIO_API_KEY=sk_live_your_key_here

Connecting Your First Stream

With the SDK installed and your key loaded, connecting a camera feed is three lines of Python:

connect_stream.py

import os
import trio_sdk
from dotenv import load_dotenv
 
load_dotenv()
 
# Initialize the client
client = trio_sdk.Client(api_key=os.environ["TRIO_API_KEY"])
 
# Connect a camera stream — RTSP, HLS, WebRTC, or HTTP MJPEG
stream = client.streams.connect(
    url="rtsp://camera.local/live",
    label="front-entrance",      # optional human-readable name
    region="us-east-1",          # optional: route to nearest inference node
)
 
print(f"Stream connected: {stream.id}")
print(f"Status: {stream.status}")

The connect() call registers your stream with Trio's ingestion layer. The SDK validates the URL, negotiates the transport protocol, and returns a Stream object with a stable stream.id you can reference in all subsequent calls. The stream stays active until you explicitly disconnect it or it times out due to inactivity.

50ms

median end-to-end latency from frame capture to API response on Trio's managed inference infrastructure

Source: Trio internal benchmarks, Q1 2026

Asking Questions

Once you have a connected stream, you can ask it anything with the ask() method:

ask_question.py

import os
import trio_sdk
from dotenv import load_dotenv
 
load_dotenv()
 
client = trio_sdk.Client(api_key=os.environ["TRIO_API_KEY"])
stream = client.streams.connect(url="rtsp://camera.local/live")
 
# Ask a natural-language question about the current frame
response = stream.ask("How many people are currently in frame?")
 
print(response.answer)       # "3 people are visible in the frame."
print(response.confidence)   # 0.94
print(response.latency_ms)   # 48

The ask() method captures a frame (or a short clip if temporal context is needed), runs it through Trio's vision-language routing layer, and returns a structured Response object synchronously. For most simple questions, you'll get an answer in well under 100 milliseconds.

You can ask follow-up questions in the same session to maintain context:

ask_followup.py

# Follow-up questions maintain context within the session
response2 = stream.ask("Are any of them wearing high-visibility vests?")
print(response2.answer)  # "Yes, 2 of the 3 people are wearing hi-vis vests."
 
# Structured JSON output for downstream processing
response3 = stream.ask(
    "List each person's approximate location in the frame.",
    output_format="json",
)
print(response3.data)
# [{"person": 1, "location": "left foreground"},
#  {"person": 2, "location": "center background"},
#  {"person": 3, "location": "right midground"}]

Handling Responses

Every ask() call returns a Response object with a consistent schema. Here is the full set of fields you can access:

response_fields.py

response = stream.ask("Describe the scene.")
 
# Core fields
print(response.answer)         # Natural-language answer string
print(response.confidence)     # Float 0.0–1.0 model confidence score
print(response.latency_ms)     # Round-trip time in milliseconds
print(response.frame_ts)       # UTC timestamp of the captured frame
print(response.stream_id)      # ID of the source stream
print(response.request_id)     # Unique ID for this inference request
 
# Optional fields (present when output_format="json")
print(response.data)           # Parsed dict or list from JSON output
print(response.raw_json)       # Raw JSON string before parsing
 
# Metadata
print(response.model)          # Which vision-language model was used
print(response.tokens_used)    # Token count for billing/monitoring

Source: Trio API Reference, 2026

For high-throughput or event-driven architectures, use the streaming or webhook modes rather than polling ask() in a loop. The streaming response mode emits tokens progressively — useful when you're rendering answers to a dashboard in real time. The webhook mode pushes results to your endpoint as they arrive, with no open connection required on your side.

Advanced Features

Multi-Stream Sessions

Trio supports querying multiple camera feeds within a single session. This is useful when you need to correlate observations across cameras — for example, tracking a person moving between zones in a warehouse:

multi_stream.py

import os
import trio_sdk
from dotenv import load_dotenv
 
load_dotenv()
 
client = trio_sdk.Client(api_key=os.environ["TRIO_API_KEY"])
 
# Connect multiple streams
entrance = client.streams.connect(url="rtsp://cam1.local/live", label="entrance")
floor_a  = client.streams.connect(url="rtsp://cam2.local/live", label="floor-a")
floor_b  = client.streams.connect(url="rtsp://cam3.local/live", label="floor-b")
 
# Create a session to query them together
session = client.sessions.create(streams=[entrance, floor_a, floor_b])
 
# Ask a cross-camera question
result = session.ask(
    "Is anyone present in all three zones at the same time?"
)
print(result.answer)
# "No. 2 people are in Floor A, 1 person is at the Entrance. Floor B is empty."

Webhook Subscriptions

For production pipelines that need to react to events without polling, use stream.subscribe() to push answers to your endpoint:

webhook_subscribe.py

# Subscribe to continuous inference on a trigger condition
subscription = stream.subscribe(
    question="Alert me if any person enters the restricted zone.",
    webhook_url="https://your-app.com/api/trio-events",
    confidence_threshold=0.85,   # Only fire if model is >85% confident
    cooldown_seconds=30,         # Don't re-fire within 30s of last alert
)
 
print(f"Subscription active: {subscription.id}")

When the condition is met, Trio posts a signed JSON payload to your webhook URL. Verify the signature using the X-Trio-Signature header and your webhook secret from the dashboard.

Next Steps

You now have everything you need to build a working Trio integration. Here is where to go next depending on what you're building:

If you're evaluating Trio against a self-hosted pipeline, read our Build vs. Buy: Video Analytics Pipeline breakdown. It compares total cost of ownership, time to first inference, and maintenance burden side by side.
If you want to understand the infrastructure behind the API, How to Analyze a Live Video Stream with AI walks through the full architecture from camera to answer.
If you're hitting the limits of frame-by-frame queries, The Video-to-LLM Gap explains how Trio handles temporal reasoning across clips — and why that matters for complex detection tasks.

The Trio SDK reference docs, rate limit tables, and error code glossary are available at docs.machinefi.com. For questions, the #developers channel in the Trio Discord community is the fastest path to an answer from the team.

Keep Reading

How to Analyze a Live Video Stream with AI — A deep dive into the full pipeline architecture behind real-time video AI, from frame extraction to model inference to structured output.
Build vs. Buy: Video Analytics Pipeline — An honest cost and complexity comparison between building your own video AI stack and using a managed API like Trio.
The Video-to-LLM Gap — Why standard LLMs can't process video directly, and how Trio bridges the gap between live camera feeds and language model intelligence.