MachineFi

GPU vs CPU for AI Inference: Performance, Cost, and When to Use Each

Benchmark data, cost models, and a practical decision framework for video AI workloads

MachineFi Labs10 min read

Choosing between a GPU and a CPU for AI inference isn't a question of which is "better" — it's a question of which is right for your specific workload, latency budget, and cost model. The answer changes dramatically depending on whether you're running one frame per second or one hundred, whether you're on a cloud VM or an edge device, and whether you care more about throughput or tail latency.

This guide cuts through the marketing and gives you benchmark data, real cost models, and a decision framework you can actually use.

Why Hardware Choice Matters More Than Model Choice

Most discussions about AI inference focus on model architecture — YOLO vs RT-DETR, ResNet vs EfficientNet, quantized vs full precision. But for production video AI, the hardware decision has a larger impact on total cost and end-to-end latency than any model choice.

Consider a real scenario: you're running object detection on 16 concurrent RTSP camera streams, each producing frames at 5fps for analysis. That's 80 inference requests per second, each requiring a 640x640 image to pass through a model with ~6 billion multiply-accumulate operations. Do that math on CPU versus GPU and you get very different answers — in cost, latency, and operational complexity.

Understanding edge AI vs cloud AI trade-offs is the prerequisite to this decision, because hardware availability differs radically between edge and cloud deployments. On edge devices, you might not have a GPU at all. In the cloud, you're paying per GPU-hour whether or not you're using all the cores.

GPU Architecture: Why Parallel Matters for AI

GPU Inference

GPU inference is the process of running an AI model's forward pass on a Graphics Processing Unit. GPUs accelerate inference by executing thousands of floating-point operations simultaneously across their parallel processing cores (CUDA cores, Tensor Cores, or equivalent), making them particularly efficient for the matrix multiplications that dominate neural network computation.

A modern data center GPU like the NVIDIA H100 contains 16,896 CUDA cores and 528 Tensor Cores, all operating in parallel. An NVIDIA A10G — the workhorse of many cloud inference deployments — has 6,144 CUDA cores. Compare that to a 32-core CPU where each core executes instructions sequentially.

Neural network inference is fundamentally a problem of matrix multiplication at scale. A single forward pass through a ResNet-50 involves roughly 4 billion multiply-accumulate operations. GPUs are purpose-built for exactly this kind of work.

The key architectural features that make GPUs fast for AI:

Tensor Cores — Introduced with NVIDIA's Volta architecture, Tensor Cores perform mixed-precision matrix operations (FP16/BF16 multiply, FP32 accumulate) in a single clock cycle. A100 Tensor Cores deliver 312 TFLOPS of FP16 throughput versus a typical CPU's 1-4 TFLOPS.

High memory bandwidth — The H100 delivers 3.35 TB/s of memory bandwidth versus ~200 GB/s for a high-end server CPU. During inference, the bottleneck is often moving weights from memory to compute units — bandwidth wins.

SIMT execution model — GPUs use Single Instruction, Multiple Thread (SIMT) execution, where thousands of threads execute the same instruction simultaneously across different data. This maps perfectly onto the batched tensor operations in neural networks.

312 TFLOPS

FP16 Tensor Core throughput on NVIDIA A100 — versus 1-4 TFLOPS for a high-end CPU — making GPU matrix math 100-300x faster for AI workloads

Source: NVIDIA A100 GPU Architecture Whitepaper, 2020

CPU Architecture: Where Sequential Wins

CPUs are not bad at inference — they're optimized for a fundamentally different execution model. A high-end server CPU like the Intel Xeon Platinum 8480+ has 60 cores running at up to 4.0GHz, with advanced branch prediction, large L3 caches, and AVX-512 SIMD instructions that can process 16 float32 values per instruction per core.

For AI inference specifically, modern CPUs benefit from:

Low latency per operation — CPU cores run at 3-5GHz versus a GPU's 1-2GHz. For a single inference request where you can't batch, a CPU's per-core speed often beats waiting for GPU parallelism to kick in.

ONNX Runtime and OpenVINO optimization — Intel's OpenVINO toolkit and ONNX Runtime's CPU execution provider both include graph-level optimizations, operator fusion, and INT8 quantization that can bring CPU inference within 3-5x of GPU performance for small batch sizes.

No data transfer overhead — GPU inference requires copying input data from CPU RAM to GPU VRAM before each batch. At batch size 1, this PCIe transfer (typically 16 GB/s) can add 1-3ms to every inference call. On CPU, there's no transfer step.

Availability and cost — Every cloud VM, every edge server, every developer's laptop has a CPU. An m6i.2xlarge EC2 instance (8 vCPUs) costs $0.384/hour. A g4dn.xlarge (1 T4 GPU) costs $0.526/hour. For workloads that don't need GPU parallelism, you're paying a 37% premium for hardware you're not utilizing.

Benchmark Data: GPU vs CPU Head-to-Head

Let's look at real numbers. These benchmarks reflect production-representative workloads: YOLOv8n (nano) and YOLOv8m (medium) object detection models, run via ONNX Runtime, at various batch sizes relevant to real-time object detection in Python.

GPU vs CPU Inference Benchmarks: YOLOv8 Object Detection (640x640, FP32)
Source: MachineFi engineering benchmarks; ONNX Runtime 1.17, Ubuntu 22.04, FP32 precision

The pattern is clear: at batch size 1, the GPU advantage is real but modest (4-8x). At batch size 32, the GPU pulls ahead dramatically (45-75x). This is the fundamental insight that drives the decision framework: batch size determines whether you need a GPU.

Cost Per Inference: The Number That Actually Matters

Raw throughput doesn't tell you what it costs to run inference at scale. For that you need cost-per-inference — the fully-loaded hourly cost divided by the number of inferences you run.

$0.0000003

approximate cost per inference on a GPU at full utilization (batch 32, YOLOv8n on g4dn.xlarge) — 8x cheaper than the same model on CPU-only at low batch sizes

Source: AWS on-demand pricing, March 2025; MachineFi cost model
Cost Per Inference: GPU vs CPU at AWS On-Demand Pricing (March 2025)
Source: AWS EC2 on-demand pricing, us-east-1, March 2025; throughput extrapolated from benchmarks

At sustained, high-throughput workloads, the GPU wins on cost-per-inference by a wide margin — roughly 30-40x cheaper than equivalent CPU-only instances for YOLOv8m at batch size 32. But this assumes you're keeping the GPU busy. If your inference load is intermittent or your average batch size is 1-2, that $0.526/hour GPU instance sits idle between requests, and the economics flip.

This is why model optimization for edge deployment matters — quantized INT8 models on CPU can sometimes reach cost parity with GPU for low-batch workloads, while dramatically reducing hardware requirements.

GPU Utilization: The Hidden Cost Driver

The number that most GPU cost analyses ignore is utilization. A GPU that's 30% utilized costs the same as one that's 100% utilized.

In real-world video AI deployments, GPU utilization varies significantly by workload pattern:

  • Continuous stream analysis (processing every N seconds from N cameras): utilization is predictable and usually high. Good GPU fit.
  • Event-driven analysis (process frames only when motion is detected): utilization can be 5-20% if your cameras see intermittent activity. GPU economics worsen dramatically.
  • Batch overnight processing (analyzing recorded video): high utilization, clear GPU win.
  • API with variable load (serving ad-hoc inference requests): utilization tracks traffic patterns, often low during off-peak hours.

The Latency vs Throughput Trade-off

Understanding latency vs throughput in real-time AI is critical here because GPU and CPU have genuinely different latency profiles.

For a single inference request at batch size 1:

  • CPU latency: 26ms (YOLOv8n) to 83ms (YOLOv8m) — no transfer overhead, deterministic
  • GPU latency (A10G): 8ms (YOLOv8n) to 12ms (YOLOv8m) after the model is on-device — but add 1-5ms for PCIe transfer at batch 1

For sustained throughput at batch size 32:

  • CPU: 62 fps (YOLOv8n), 14 fps (YOLOv8m) — hits a wall fast
  • GPU: 2,800 fps (YOLOv8n), 1,050 fps (YOLOv8m) — essentially unconstrained for typical deployments

If your SLA requires sub-50ms end-to-end latency and you're processing one stream, a well-optimized CPU pipeline with ONNX Runtime can get you there for lightweight models (YOLOv8n, MobileNetV3). For anything larger, or for concurrent multi-stream workloads, GPU is the only path to low-latency, high-throughput inference.

NPUs and TPUs: The Third Option

Before landing on a decision framework, it's worth acknowledging that GPU and CPU are no longer the only options — especially for deploying AI models on edge devices.

Neural Processing Units (NPUs) are purpose-built inference accelerators increasingly found in edge hardware:

  • Apple Neural Engine (M-series chips): up to 38 TOPS, optimized for Core ML models
  • Qualcomm Hexagon DSP: 26-75 TOPS depending on generation, powers Snapdragon edge AI
  • Hailo-8: 26 TOPS at 2.5W — the current leader for ultra-low-power edge inference
  • NVIDIA Jetson's DLA (Deep Learning Accelerator): runs alongside the GPU, offloads INT8 inference

Google TPUs (Tensor Processing Units) are cloud-only accelerators designed exclusively for matrix operations. v4 TPUs deliver 275 TFLOPS of BF16 throughput — competitive with A100 for large transformer models, but less flexible for diverse inference workloads.

For most video AI pipelines at the edge, NPUs offer the best power-efficiency: 10-26 TOPS at 2-5W versus a Jetson Orin's GPU drawing 15-25W. The trade-off is reduced model flexibility — NPUs typically require model quantization to INT8 and may not support all operators.

Decision Framework: Which Hardware for Your Workload

Here's the practical decision logic. Work through these questions in order to land on the right hardware for your video AI workload.

Step 1: What's your sustained throughput requirement?

  • Under 30 fps total (all streams combined): CPU is viable for lightweight models
  • 30-200 fps: CPU with optimization (ONNX Runtime, INT8 quantization, multi-threading) or entry-level GPU
  • Over 200 fps: GPU required

Step 2: What's your latency SLA?

  • Over 500ms acceptable (near-real-time is fine): CPU can work at any scale
  • 50-500ms: CPU for light models, GPU for anything above YOLOv8n
  • Under 50ms: GPU for medium/large models; optimized CPU with INT8 for nano models only

Step 3: Where does this run?

  • Cloud: GPU instance economics usually win above 100 fps sustained throughput
  • Edge device: Check for NPU availability first; Jetson for GPU, Hailo-8 for ultra-low-power, CPU for everything else
  • Developer laptop/server: CPU unless you have NVIDIA hardware

Step 4: What's your utilization pattern?

  • Continuous, predictable load (more than 60% utilization): GPU
  • Intermittent or event-driven (under 40% utilization): CPU or serverless GPU (Lambda GPU, Banana.dev)
  • Batch/offline processing: GPU with spot instances

Step 5: What's your operational budget for ML infrastructure?

  • Team with dedicated ML infra: GPU clusters make sense, you can manage CUDA complexity
  • Small team or no ML infra background: start CPU-first, scale to GPU when you hit actual limits. See the build vs buy analysis for video analytics pipelines for context on total infrastructure costs.

Practical Architecture: The Hybrid Approach

The most performant production video AI pipelines don't choose between GPU and CPU — they use both where each excels.

A typical scalable video AI architecture for multi-camera deployments looks like:

CPU handles:

  • RTSP stream ingestion and decoding (FFmpeg is CPU-optimized)
  • Frame preprocessing: resize, normalize, color space conversion
  • Motion detection and frame selection (avoid sending redundant frames to inference)
  • Post-processing: NMS (non-maximum suppression), result parsing, webhook delivery
  • Lightweight classification models (MobileNet, SqueezeNet)

GPU handles:

  • Object detection inference (YOLOv8m, RT-DETR, DINO)
  • Segmentation and pose estimation
  • Feature extraction for re-identification or embedding search
  • Vision LLM calls for complex scene understanding

This split typically reduces GPU memory pressure by 30-40% (because preprocessing is off-GPU) and can double effective GPU throughput by keeping the GPU fed continuously from a CPU-side frame buffer.

Understanding edge computing fundamentals is valuable here — the same hybrid principle applies at the edge, where a Raspberry Pi-class CPU handles stream ingestion while a Hailo-8 or Jetson GPU handles inference.

The Quantization Multiplier

No GPU vs CPU comparison is complete without discussing quantization, because it changes both sides of the equation substantially.

Quantizing a YOLOv8m model from FP32 to INT8 (using ONNX Runtime's quantization tools or TensorRT) typically:

  • Reduces model size by 4x
  • Increases CPU throughput by 2-4x (AVX-512 VNNI integer math)
  • Increases GPU throughput by 1.5-2x
  • Reduces accuracy by 0.5-2% mAP (usually acceptable)

INT8 quantization on CPU is particularly powerful because modern Intel Xeon and AMD EPYC processors have hardware-accelerated INT8 units (VNNI/VPDPBUSD) that approach GPU-class throughput for small models. An INT8 YOLOv8n on a 16-core Xeon can reach 90-120 fps — competitive with small GPU deployments.

If you haven't yet optimized your model for deployment, do that before choosing hardware. A well-quantized model on a CPU can outperform an unoptimized model on a GPU, while costing a fraction as much.

Keep Reading

MachineFi Labs

Engineering Team at MachineFi

The team behind Trio — the multimodal stream API that turns live video, audio, and sensor feeds into AI-ready intelligence.