Deploying AI Models to Edge Devices: Raspberry Pi, Jetson, and Beyond
A practical guide to running inference on constrained hardware without sacrificing accuracy
Edge AI has crossed a threshold. In 2024, running a useful computer-vision model on a Raspberry Pi meant squinting at a stuttering 4 FPS feed and praying the object detector finished before the object left the frame. In 2026, a Raspberry Pi 5 with a Hailo-8 accelerator hat runs YOLOv8s at 30 FPS. A Jetson Orin Nano runs a full INT8-quantized detection pipeline with three simultaneous camera feeds. A Google Coral USB Accelerator turns a $35 Pi Zero 2W into a viable people-counter for a retail aisle.
None of that happens automatically. Getting from a trained model to reliable edge inference requires understanding the hardware constraints, choosing the right runtime, converting your model correctly, and building the operational scaffolding — containerization, OTA updates, monitoring — that keeps it running after you deploy it and walk away.
This guide covers the complete workflow end to end.
Why Deploy AI at the Edge at All?
Before getting into hardware and runtimes, it's worth anchoring on why edge inference matters — because the answer determines which hardware tier you need.
The core arguments for edge AI over cloud AI come down to four constraints: latency, bandwidth, privacy, and connectivity. When a robotic arm needs to stop within 8ms of detecting a human hand in its workspace, the round-trip to a cloud API is physically impossible — light doesn't travel fast enough. When 50 cameras each generate 3–8 Mbps of H.264 video, routing all of it through a WAN link is economically prohibitive. When video contains patients, employees, or proprietary process data, transmitting raw frames to a third-party cloud raises regulatory flags. When cameras are deployed in remote locations without reliable internet, cloud dependence means no analytics during outages.
Edge deployment solves all four. The trade-off is that you inherit the operational complexity of managing hardware you can't reach across a network: firmware, model versioning, failure recovery, and hardware refresh cycles.
55%
of AI inference workloads will run at the edge by 2027, up from 28% in 2023, driven by latency and bandwidth constraints
Understanding GPU vs CPU inference trade-offs helps frame the hardware decision: edge devices span a spectrum from CPU-only (Pi 5) to GPU-accelerated (Jetson) to dedicated NPU/TPU (Coral, Hailo), and each tier unlocks different model sizes and frame rates.
Hardware Comparison: Choosing Your Platform
Edge AI hardware in 2026 falls into four practical tiers, each with a distinct performance-cost profile.
- Edge AI Accelerator
A purpose-built silicon die — separate from the main CPU and GPU — optimized specifically for neural network matrix operations. Examples include Google's Edge TPU (Coral), Hailo's NPU (Hailo-8/Hailo-15), and the dedicated deep-learning accelerators inside NVIDIA Jetson SoCs. Accelerators deliver 5–30x higher inference throughput per watt compared to running the same model on a general-purpose CPU.
Raspberry Pi 5: The Accessible Starting Point
The Pi 5 is where most edge AI projects start — and where many end up staying. With 4–8 GB of RAM and a quad-core Cortex-A76 CPU running at 2.4 GHz, it handles TFLite and ONNX Runtime inference for models up to about 10M parameters before frame rates become painful.
The Hailo-8 AI HAT+ changes the equation entirely. Installed on the GPIO header, it provides 26 TOPS of neural compute while the Pi's CPU handles pre- and post-processing. A YOLOv8s model that runs at 3 FPS on the CPU alone runs at 35 FPS with the Hailo HAT. For single-stream real-time detection workloads at under $200 total hardware cost, this is currently the best value proposition in edge AI.
The limitation is memory bandwidth and RAM ceiling. The Pi 5 cannot run large models — anything above ~50M parameters is impractical — and there is no upgrade path for the HAT beyond what Hailo-8 provides.
Jetson Orin: The Production Workhorse
For production deployments where multiple camera streams, larger models, or occasional LLM inference is required, the Jetson Orin line is the standard choice. The Orin Nano at $250 delivers 40 TOPS and can run three simultaneous YOLOv8m streams or a single 7B-parameter quantized VLM at ~5 tokens/sec.
The Jetson's key advantage is TensorRT: NVIDIA's inference optimizer applies layer fusion, kernel auto-tuning, and INT8 calibration to produce engines that extract every last FLOP from the GPU. A model optimized with TensorRT typically runs 3–5x faster than the same model in plain PyTorch on the same hardware.
The Orin NX at $500 doubles the TOPS and RAM, enabling more complex multimodal workloads — critical for model optimization for edge deployment at the upper end of what embedded hardware supports.
Coral and Hailo: Pure NPU Efficiency
Google Coral and Hailo target a different optimization point: maximum inference efficiency per watt rather than raw throughput. The Coral EdgeTPU delivers 4 TOPS at under 2 watts — roughly 10x the energy efficiency of a Jetson Nano for compatible workloads. The trade-off is that the EdgeTPU only runs fully quantized (INT8) models, and only a subset of TFLite operations are accelerated; everything else falls back to the host CPU.
Hailo-8 is more flexible and significantly more powerful at 26 TOPS, with a broader set of supported layer types and an SDK (HAILOrt) that handles model compilation, driver management, and inference pipeline configuration.
For battery-powered devices, embedded systems without active cooling, or deployments where power budget is the binding constraint, Coral or Hailo modules are the right choice.
Runtime Framework Comparison
Choosing the right inference runtime is as important as choosing the right hardware — the same hardware running the same model can show 3x throughput differences between a naive implementation and an optimized runtime.
TensorFlow Lite
TFLite is the runtime of choice when you need to run on the widest variety of hardware without recompiling: the same .tflite file runs on a Pi CPU, accelerates on a Coral EdgeTPU, and deploys on Android and iOS. The conversion pipeline is well-documented and the TFLiteConverter supports post-training INT8 quantization with a small representative dataset.
The limitation is that TFLite supports a fixed op set. Custom layers, certain attention mechanisms, and newer transformer architectures may have ops that TFLite doesn't support, forcing you to either run those layers on the CPU or restructure the model.
ONNX Runtime
ONNX Runtime has become the cross-platform standard for PyTorch-origin models. The open ONNX format acts as an intermediate representation: export from PyTorch with torch.onnx.export(), then run on any ONNX-compatible device. ONNX Runtime supports execution providers (EPs) including CUDA, TensorRT, CoreML, DirectML, and more — meaning the same model file can target different hardware by switching the EP.
For real-time object detection pipelines that need to run across mixed hardware fleets (some Jetson, some Pi, some x86), ONNX is often the best common denominator.
TensorRT
TensorRT is NVIDIA-only and it is the runtime you use when you need every drop of performance from a Jetson device. The TensorRT optimization process is offline: you build an optimized engine file (.engine) for a specific device, batch size, and input shape. The resulting engine cannot be transferred to a different GPU — you must rebuild it for each deployment target.
This is worth the friction. TensorRT INT8 engines typically run 4–8x faster than FP32 PyTorch on the same Jetson, and 2–3x faster than FP16 ONNX Runtime. For choosing the right AI model for video analytics, TensorRT is what makes sub-10ms inference achievable on Jetson hardware.
Model Conversion Workflow
The conversion pipeline from a trained model to an optimized edge runtime artifact has several steps, and each step is an opportunity to lose accuracy or gain speed. Here is the canonical workflow:
import torch
from ultralytics import YOLO
# Step 1: Load the trained model
model = YOLO("yolov8s.pt")
# Step 2: Export to ONNX with dynamic batching disabled
# Fixed input shape enables more aggressive graph optimization
model.export(
format="onnx",
imgsz=640,
opset=17,
simplify=True, # runs onnx-simplifier to fuse ops
dynamic=False, # fix batch=1 for edge deployment
)
# Output: yolov8s.onnxFor TensorRT on Jetson, build the engine directly from the ONNX file using the trtexec utility or the Python API:
# Run this ON the target Jetson device — engines are device-specific
trtexec \
--onnx=yolov8s.onnx \
--saveEngine=yolov8s_int8.engine \
--int8 \
--calib=calibration_data/ \
--workspace=2048 \
--streams=1 \
--verboseFor TFLite with INT8 quantization:
import tensorflow as tf
import numpy as np
def representative_dataset():
"""Feed ~100-500 representative images for INT8 calibration."""
for image in calibration_images: # your calibration set
yield [image.astype(np.float32)]
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()
with open("model_int8.tflite", "wb") as f:
f.write(tflite_model)
print(f"INT8 model size: {len(tflite_model) / 1024 / 1024:.1f} MB")Running Inference on the Device
Once you have an optimized model artifact, running inference looks different depending on the runtime. Here is a minimal but production-ready inference loop for each major option:
import onnxruntime as ort
import numpy as np
import cv2
# Initialize session with execution provider priority list
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4 # tune for your CPU core count
# Try CUDA first, fall back to CPU
providers = [
("CUDAExecutionProvider", {"device_id": 0}),
"CPUExecutionProvider",
]
session = ort.InferenceSession(
"yolov8s.onnx",
sess_options=sess_options,
providers=providers,
)
input_name = session.get_inputs()[0].name
def preprocess(frame: np.ndarray) -> np.ndarray:
img = cv2.resize(frame, (640, 640))
img = img[:, :, ::-1] # BGR to RGB
img = img.astype(np.float32) / 255.0
return np.expand_dims(img.transpose(2, 0, 1), axis=0) # NCHW
# Inference loop
cap = cv2.VideoCapture("rtsp://camera.local/stream")
while True:
ret, frame = cap.read()
if not ret:
break
input_tensor = preprocess(frame)
outputs = session.run(None, {input_name: input_tensor})
# outputs[0] shape: [1, 84, 8400] for YOLOv8 (decode separately)For Jetson with TensorRT:
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit # noqa: F401
import numpy as np
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with open("yolov8s_int8.engine", "rb") as f:
engine_bytes = f.read()
runtime = trt.Runtime(TRT_LOGGER)
engine = runtime.deserialize_cuda_engine(engine_bytes)
context = engine.create_execution_context()
# Allocate device memory buffers
bindings = []
binding_shapes = []
for binding in engine:
shape = engine.get_binding_shape(binding)
size = trt.volume(shape)
dtype = trt.nptype(engine.get_binding_dtype(binding))
mem = cuda.mem_alloc(size * np.dtype(dtype).itemsize)
bindings.append(int(mem))
binding_shapes.append((shape, dtype, mem))
stream = cuda.Stream()
def infer(input_data: np.ndarray) -> np.ndarray:
# Copy input to device
cuda.memcpy_htod_async(binding_shapes[0][2], input_data, stream)
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Copy output back to host
output = np.empty(binding_shapes[1][0], dtype=binding_shapes[1][1])
cuda.memcpy_dtoh_async(output, binding_shapes[1][2], stream)
stream.synchronize()
return output4–8x
typical throughput improvement from TensorRT INT8 versus FP32 PyTorch on the same Jetson Orin hardware
Containerizing with Docker
Docker is the correct answer to "how do I manage dependencies and runtime environments across a fleet of edge devices." The NVIDIA JetPack SDK ships container base images for each Jetson generation that include the correct CUDA version, TensorRT libraries, and cuDNN — pinned to the exact versions tested on that hardware.
# Use NVIDIA's official JetPack container base
# Pin the L4T version to match your device's JetPack release
FROM nvcr.io/nvidia/l4t-tensorrt:r8.5.2-runtime
WORKDIR /app
# Install Python inference dependencies
RUN apt-get update && apt-get install -y python3-pip libopencv-dev \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
# Copy model engine and inference code
COPY yolov8s_int8.engine ./models/
COPY infer_tensorrt.py .
# Expose metrics endpoint for monitoring
EXPOSE 9090
CMD ["python3", "infer_tensorrt.py", "--stream", "rtsp://camera.local/stream"]For Raspberry Pi + Hailo:
FROM arm64v8/python:3.11-slim
WORKDIR /app
# Install Hailo runtime (hailort) and dependencies
# hailort must match the firmware version on the Hailo-8 device
RUN apt-get update && apt-get install -y \
libopencv-dev python3-opencv \
&& rm -rf /var/lib/apt/lists/*
# HAILOrt Python package — install from Hailo's developer zone
COPY hailort-4.17.0-cp311-cp311-linux_aarch64.whl .
RUN pip install hailort-4.17.0-cp311-cp311-linux_aarch64.whl
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY *.hef ./models/
COPY infer_hailo.py .
CMD ["python3", "infer_hailo.py"]The key operational advantage of containerization for edge devices is reproducibility: when a device in a remote location starts behaving unexpectedly, you can rebuild the exact same container image locally, reproduce the issue, and push a fixed image via OTA. Without containers, "works on my Pi" debugging is a nightmare.
OTA Model Updates and Rollback
Deploying models to hundreds of edge devices and keeping them up to date without physically visiting each one is a solved problem — but only if you design for it from the start. The minimal viable OTA pipeline has three components: a model registry, an update agent on each device, and a rollback mechanism.
import hashlib
import requests
import subprocess
import logging
from pathlib import Path
MODEL_REGISTRY = "https://models.example.com/api/v1"
MODEL_DIR = Path("/app/models")
CURRENT_MODEL_FILE = MODEL_DIR / "current_model.engine"
STAGED_MODEL_FILE = MODEL_DIR / "staged_model.engine"
ROLLBACK_MODEL_FILE = MODEL_DIR / "rollback_model.engine"
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("ota-agent")
def get_deployed_version() -> str | None:
version_file = MODEL_DIR / "version.txt"
if version_file.exists():
return version_file.read_text().strip()
return None
def check_for_update(device_id: str) -> dict | None:
resp = requests.get(
f"{MODEL_REGISTRY}/latest",
params={"device_id": device_id},
timeout=30,
)
resp.raise_for_status()
latest = resp.json()
current = get_deployed_version()
if latest["version"] != current:
return latest
return None
def download_and_verify(update: dict) -> bool:
log.info(f"Downloading model version {update['version']}")
resp = requests.get(update["download_url"], stream=True, timeout=120)
resp.raise_for_status()
hasher = hashlib.sha256()
with open(STAGED_MODEL_FILE, "wb") as f:
for chunk in resp.iter_content(chunk_size=8192):
f.write(chunk)
hasher.update(chunk)
digest = hasher.hexdigest()
if digest != update["sha256"]:
log.error(f"Checksum mismatch: expected {update['sha256']}, got {digest}")
STAGED_MODEL_FILE.unlink(missing_ok=True)
return False
log.info("Download verified successfully")
return True
def apply_update(version: str) -> None:
# Preserve rollback copy
if CURRENT_MODEL_FILE.exists():
CURRENT_MODEL_FILE.rename(ROLLBACK_MODEL_FILE)
STAGED_MODEL_FILE.rename(CURRENT_MODEL_FILE)
(MODEL_DIR / "version.txt").write_text(version)
# Restart the inference service
subprocess.run(["systemctl", "restart", "edge-inference"], check=True)
log.info(f"Update to version {version} applied")
def rollback() -> None:
if not ROLLBACK_MODEL_FILE.exists():
log.error("No rollback model available")
return
CURRENT_MODEL_FILE.rename(STAGED_MODEL_FILE) # stash bad model
ROLLBACK_MODEL_FILE.rename(CURRENT_MODEL_FILE)
subprocess.run(["systemctl", "restart", "edge-inference"], check=True)
log.warning("Rolled back to previous model version")For larger fleets, tools like Balena Cloud, NVIDIA Fleet Command, or AWS IoT Greengrass provide managed OTA with staged rollouts (update 5% of devices first, monitor, then roll to 100%), automatic rollback on health-check failure, and centralized logging. The trade-off is vendor lock-in and per-device pricing — for fleets under 100 devices, the self-hosted agent above plus a simple model registry API is usually sufficient.
Production Monitoring
An edge AI deployment without monitoring is a ticking clock. Models drift, cameras get dirty, lighting conditions change seasonally, and hardware fails silently. The minimum viable monitoring stack for edge devices exposes a local metrics endpoint that a central collector scrapes:
from prometheus_client import start_http_server, Gauge, Counter, Histogram
import time
# Define metrics
INFERENCE_LATENCY = Histogram(
"inference_latency_seconds",
"Per-frame inference time",
buckets=[0.005, 0.010, 0.025, 0.050, 0.100, 0.250],
)
DETECTION_COUNT = Counter(
"detections_total",
"Total detections by class",
["class_name"],
)
FRAME_DROP_RATE = Gauge(
"frame_drop_rate",
"Fraction of frames dropped due to queue pressure",
)
MODEL_CONFIDENCE_AVG = Gauge(
"model_confidence_avg",
"Rolling average confidence score — proxy for model/scene health",
)
# Start the metrics HTTP server on port 9090
start_http_server(9090)
# In your inference loop:
def record_inference(latency_s: float, detections: list) -> None:
INFERENCE_LATENCY.observe(latency_s)
for det in detections:
DETECTION_COUNT.labels(class_name=det.class_name).inc()
if detections:
avg_conf = sum(d.confidence for d in detections) / len(detections)
MODEL_CONFIDENCE_AVG.set(avg_conf)The most important metric to watch is model_confidence_avg. A sustained drop in average confidence — below 0.60 for a model that normally runs at 0.80+ — typically indicates one of: camera obstruction, lighting change, scene composition drift (new object types appearing), or model version mismatch. Catching this via an alert beats discovering it from a user complaint.
Latency vs throughput trade-offs become critical here: on constrained hardware, you often face a choice between lower per-frame latency (smaller batch sizes, less aggressive preprocessing) and higher total throughput (larger batches, optimized pipelines). Monitoring both separately helps you understand when you are hitting hardware limits versus software configuration limits.
Connecting Edge Devices to MachineFi Trio
Edge devices excel at real-time, low-latency detection. Cloud APIs excel at deep semantic reasoning with large models. The hybrid pattern — use the edge for fast first-pass detection, forward exceptions to Trio for VLM-level analysis — combines both strengths without the bandwidth cost of streaming raw video to the cloud.
import trio_sdk as trio
import cv2
import numpy as np
client = trio.Client(api_key="YOUR_API_KEY")
# Your local edge model (ONNX, TFLite, or TensorRT)
local_detector = LocalDetector(model_path="yolov8s.onnx")
cap = cv2.VideoCapture("rtsp://camera.local/stream")
while True:
ret, frame = cap.read()
if not ret:
continue
# Fast local inference — runs at 30+ FPS on edge device
detections = local_detector.detect(frame)
# Forward only high-confidence or anomalous frames to Trio
needs_deep_analysis = any(
d.confidence > 0.85 and d.class_name in ["person", "vehicle"]
for d in detections
)
if needs_deep_analysis:
# Trio applies a large VLM for semantic understanding
_, encoded = cv2.imencode(".jpg", frame, [cv2.IMWRITE_JPEG_QUALITY, 85])
response = client.analyze_frame(
image_bytes=encoded.tobytes(),
prompt="Describe what is happening. Are there any safety concerns?",
)
print(f"[Trio] {response.answer}")This pattern is explored in depth in the build vs. buy analysis for video analytics pipelines: edge handles the real-time workload, cloud handles the analytical depth. The two tiers complement rather than compete with each other.
Keep Reading
- Model Optimization for Edge Deployment: Quantization, Pruning, and Distillation — A deep dive into INT8 calibration, structured pruning, and knowledge distillation to hit your latency targets without sacrificing accuracy.
- Edge AI vs Cloud AI: Where Should You Process Your Video Streams? — The full trade-off framework for latency, bandwidth, privacy, and cost across edge and cloud architectures.
- GPU vs CPU AI Inference: Choosing the Right Compute for Your Pipeline — When a GPU accelerator is worth the cost, and when a well-optimized CPU pipeline is enough.