How to Fine-Tune YOLOv8 on a Custom Dataset in Under an Hour
From labeling to deployment — a hands-on tutorial for training your own object detector
If you've ever tried to use a pre-trained YOLOv8 model on a real-world task — detecting defects on a production line, counting inventory on a warehouse shelf, or spotting PPE violations on a construction site — you've likely hit the same wall: the model doesn't know what your objects are. COCO-trained weights recognize 80 generic categories. Your problem probably isn't one of them.
Fine-tuning YOLOv8 on a custom dataset is how you fix that. In this tutorial you'll go from raw images to a deployed, production-ready detector in under an hour. We'll cover every step: labeling tools, YOLO format, training scripts, hyperparameter choices, validation metrics, and exporting your model for edge devices.
If you're new to object detection, read What Is Computer Vision? and our real-time object detection with Python guide first — they'll give you the foundation to get the most out of this tutorial.
What Is Fine-Tuning (and Why Does It Beat Training from Scratch)?
- Transfer Learning for Object Detection
Transfer learning is the process of taking a neural network pretrained on a large dataset (such as COCO's 118,000 images) and continuing its training on a smaller, domain-specific dataset. The network's early layers — which have learned to detect edges, textures, and shapes — are reused. Only the later layers are updated to recognize your specific classes. This dramatically reduces the data and compute required to reach high accuracy.
Training YOLOv8 from random weights on a custom 500-image dataset would produce a poor model. The network has no prior knowledge of what visual features matter. Fine-tuning from COCO pretrained weights gives you a massive head start: the backbone already understands corners, gradients, object boundaries, and spatial relationships. You're teaching it the final mile — what your objects look like.
The difference in practice is stark. Expect mAP50 scores 15–25 points higher when fine-tuning versus training from scratch on small datasets.
10x
fewer labeled images needed when fine-tuning YOLOv8 versus training from scratch to reach the same mAP threshold on custom datasets
Choosing Your YOLOv8 Model Size
YOLOv8 ships in five sizes. The right choice depends on your inference hardware and acceptable latency — not just accuracy.
For most custom dataset fine-tuning projects, start with yolov8m. It trains fast, generalizes well, and runs comfortably on a Jetson Orin or modern GPU. If you're targeting an edge deployment, start with yolov8n or yolov8s and verify accuracy meets your requirements before moving up. See our guide on model optimization for edge deployment for a deeper look at the tradeoffs.
Step 1 — Collect and Prepare Your Dataset
The single most important factor in fine-tuned model quality is data quality. A well-labeled dataset of 500 images will outperform a poorly-labeled dataset of 5,000.
How Many Images Do You Need?
For most industrial and commercial use cases:
- Minimum viable: 200–500 images per class for simple, high-contrast objects
- Solid baseline: 500–1,500 images per class for objects with significant variation
- Production-grade: 1,500–5,000 images per class for complex real-world conditions
These numbers assume fine-tuning from COCO weights. If your objects look nothing like COCO categories (e.g., microscopic cell structures, aerial satellite imagery), err toward the higher end.
Capture Diversity, Not Just Volume
Your images need to cover the real distribution your model will encounter in deployment:
- Lighting conditions (bright daylight, overcast, artificial light, night)
- Camera angles (top-down, oblique, eye-level)
- Object sizes (close-up and distant instances in the same frame)
- Backgrounds (the cluttered environments where inference actually runs)
- Occlusion cases (partially hidden objects)
A dataset with 300 truly diverse images will generalize better than 1,000 images captured under identical conditions.
import os
import shutil
import random
from pathlib import Path
def split_dataset(source_dir: str, output_dir: str, train_ratio: float = 0.8, val_ratio: float = 0.1) -> None:
"""
Split a flat directory of image/label pairs into train/val/test splits.
Expects: source_dir/images/*.jpg and source_dir/labels/*.txt
"""
images = sorted(Path(source_dir, "images").glob("*.jpg"))
images += sorted(Path(source_dir, "images").glob("*.png"))
random.seed(42) # reproducible split
random.shuffle(images)
n = len(images)
n_train = int(n * train_ratio)
n_val = int(n * val_ratio)
splits = {
"train": images[:n_train],
"val": images[n_train:n_train + n_val],
"test": images[n_train + n_val:],
}
for split_name, split_images in splits.items():
img_out = Path(output_dir, split_name, "images")
lbl_out = Path(output_dir, split_name, "labels")
img_out.mkdir(parents=True, exist_ok=True)
lbl_out.mkdir(parents=True, exist_ok=True)
for img_path in split_images:
lbl_path = Path(source_dir, "labels", img_path.stem + ".txt")
shutil.copy(img_path, img_out / img_path.name)
if lbl_path.exists():
shutil.copy(lbl_path, lbl_out / lbl_path.with_suffix(".txt").name)
print(f"Split complete: {n_train} train / {n_val} val / {n - n_train - n_val} test")
if __name__ == "__main__":
split_dataset(
source_dir="data/raw",
output_dir="data/dataset",
train_ratio=0.8,
val_ratio=0.1,
)Step 2 — Annotate Your Images
Annotation is the most time-intensive step. Two tools dominate the landscape for YOLO-format labeling: CVAT (open-source, self-hostable) and Roboflow (SaaS with a free tier). They make very different tradeoffs.
For most teams starting out, Roboflow is the path of least resistance. Its free tier allows up to 10,000 images, includes AI-assisted labeling (which can cut annotation time by 60–80% once you have a few hundred labels), and exports directly to YOLO format. Its augmentation pipeline also lets you expand your dataset synthetically.
For sensitive data that can't leave your infrastructure, CVAT is the professional choice. It requires Docker to self-host but gives you complete data control and robust team collaboration features.
Understanding YOLO Annotation Format
Regardless of which tool you use, YOLO expects a specific text format for bounding box annotations:
# Format: <class_id> <x_center> <y_center> <width> <height>
# All values normalized to [0, 1] relative to image dimensions
0 0.512 0.348 0.234 0.187
1 0.721 0.612 0.089 0.143
0 0.156 0.801 0.178 0.201Each line is one bounding box. Coordinates are normalized — divide pixel coordinates by image width/height. Class IDs are zero-indexed integers corresponding to your class list in dataset.yaml.
import cv2
import numpy as np
from pathlib import Path
def visualize_annotations(image_path: str, label_path: str, class_names: list[str]) -> np.ndarray:
"""Draw YOLO bounding boxes on an image for annotation QA."""
img = cv2.imread(image_path)
h, w = img.shape[:2]
with open(label_path) as f:
lines = f.read().strip().split("\n")
colors = [
(0, 255, 0), (255, 0, 0), (0, 0, 255),
(255, 255, 0), (0, 255, 255), (255, 0, 255),
]
for line in lines:
if not line.strip():
continue
parts = line.split()
cls_id = int(parts[0])
cx, cy, bw, bh = map(float, parts[1:])
# Convert from normalized to pixel coordinates
x1 = int((cx - bw / 2) * w)
y1 = int((cy - bh / 2) * h)
x2 = int((cx + bw / 2) * w)
y2 = int((cy + bh / 2) * h)
color = colors[cls_id % len(colors)]
cv2.rectangle(img, (x1, y1), (x2, y2), color, 2)
label = class_names[cls_id] if cls_id < len(class_names) else str(cls_id)
cv2.putText(img, label, (x1, y1 - 8), cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)
return img
if __name__ == "__main__":
classes = ["hardhat", "vest", "no_hardhat", "no_vest"]
img = visualize_annotations(
"data/dataset/train/images/frame_0042.jpg",
"data/dataset/train/labels/frame_0042.txt",
classes,
)
cv2.imwrite("qa_preview.jpg", img)
print("Saved qa_preview.jpg — check your annotation alignment.")Step 3 — Create the Dataset YAML
The dataset.yaml file is how Ultralytics knows where to find your data and what classes to detect.
# Paths — absolute or relative to where you run training
path: /home/user/projects/ppe-detector/data/dataset
train: train/images
val: val/images
test: test/images # optional
# Number of classes
nc: 4
# Class names (order must match your annotation class IDs)
names:
0: hardhat
1: safety_vest
2: no_hardhat
3: no_vestStep 4 — Train the Model
With your dataset split and YAML configured, training is a single function call.
from ultralytics import YOLO
# Load a pretrained YOLOv8 checkpoint (downloads automatically on first run)
model = YOLO("yolov8m.pt")
# Fine-tune on your custom dataset
results = model.train(
data="dataset.yaml",
epochs=100,
imgsz=640,
batch=16, # reduce to 8 if you hit OOM on smaller GPUs
lr0=0.01, # initial learning rate
lrf=0.001, # final learning rate (lr0 * lrf)
momentum=0.937,
weight_decay=0.0005,
warmup_epochs=3,
patience=25, # early stopping: stop if no improvement for 25 epochs
device=0, # GPU index; use "cpu" for CPU-only
workers=8,
project="runs/train",
name="ppe_detector_v1",
exist_ok=False,
pretrained=True, # always True for fine-tuning
optimizer="AdamW",
verbose=True,
seed=42,
deterministic=True,
single_cls=False,
rect=False,
cache=True, # cache images in RAM for faster training (needs ~8GB RAM)
amp=True, # mixed precision — major speedup on modern GPUs
plots=True, # save training plots to runs/train/ppe_detector_v1/
)
print(f"Best checkpoint saved to: {results.save_dir}/weights/best.pt")This is where transfer learning in computer vision pays off — because you're starting from COCO pretrained weights, the model already understands visual structure and only needs to learn your specific class appearances.
What Happens During Training
Ultralytics logs training progress to the console and saves plots to your run directory. Key metrics to watch:
box_loss— bounding box regression loss. Should decrease steadily and plateau.cls_loss— classification loss. Higher values early on are normal; should converge.dfl_loss— distribution focal loss (bounding box precision). Should track alongside box_loss.metrics/mAP50— mean average precision at IoU 0.5. This is your primary quality signal.metrics/mAP50-95— averaged across IoU thresholds 0.5–0.95. Harder metric; expect it to be 15–25 points lower than mAP50.
For a 100-epoch training run on a standard GPU with 1,000 images, expect:
- T4 GPU (Google Colab): 45–75 minutes
- RTX 4090: 8–15 minutes
- CPU only: 4–8 hours (not recommended)
Running on Google Colab or RunPod
If you don't have a local GPU, Google Colab (free T4) or RunPod ($0.20–0.40/hr for an A40) are practical choices. For Colab:
# Run this cell first in Google Colab
!pip install ultralytics -q
# Mount Google Drive to access your dataset
from google.colab import drive
drive.mount("/content/drive")
# Copy dataset from Drive to local Colab disk (much faster than training from Drive)
import shutil
shutil.copytree("/content/drive/MyDrive/datasets/ppe", "/content/dataset")
# Verify GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")Step 5 — Hyperparameter Tuning
The defaults in Ultralytics work well for most datasets. But if you're getting lower mAP than expected after 100 epochs, these are the levers to pull:
from ultralytics import YOLO
model = YOLO("yolov8m.pt")
# Use Ultralytics built-in hyperparameter evolution
# This runs 100 generations of evolutionary search
# Warning: plan for 10-50x your single-run training time
model.tune(
data="dataset.yaml",
epochs=50, # shorter per-run for tuning speed
iterations=100, # number of hyperparameter candidates to try
optimizer="AdamW",
plots=True,
save=True,
val=True,
)For quicker manual iteration, focus on these three parameters first:
| Parameter | Default | Try if underfitting | Try if overfitting |
|---|---|---|---|
epochs | 100 | 150–300 | N/A (use patience) |
lr0 | 0.01 | 0.001 (slow, stable) | 0.1 (aggressive) |
augment | True | Keep True | Reduce specific augmentations |
mosaic | 1.0 | Keep at 1.0 | Lower to 0.5 |
dropout | 0.0 | N/A | 0.1–0.3 |
Data Augmentation Settings
Ultralytics applies a rich augmentation pipeline by default. You can override individual augmentations:
from ultralytics import YOLO
model = YOLO("yolov8m.pt")
model.train(
data="dataset.yaml",
epochs=100,
imgsz=640,
batch=16,
# Augmentation overrides
hsv_h=0.015, # hue shift (0.0–1.0)
hsv_s=0.7, # saturation shift
hsv_v=0.4, # value/brightness shift
degrees=5.0, # rotation range (degrees)
translate=0.1, # translation fraction
scale=0.5, # scale gain
shear=0.0, # shear degrees
perspective=0.0,
flipud=0.0, # vertical flip probability (0 for upright objects)
fliplr=0.5, # horizontal flip probability
mosaic=1.0, # mosaic augmentation probability
mixup=0.1, # mixup augmentation probability
copy_paste=0.1, # copy-paste augmentation (great for instance segmentation)
)+8.3 mAP
average improvement from proper data augmentation versus training with no augmentation on small custom datasets (fewer than 1,000 images)
Step 6 — Validate and Interpret Metrics
After training, Ultralytics saves a best.pt checkpoint based on the highest validation mAP50. Run a formal evaluation:
from ultralytics import YOLO
model = YOLO("runs/train/ppe_detector_v1/weights/best.pt")
# Evaluate on validation set
metrics = model.val(
data="dataset.yaml",
split="val",
conf=0.25, # confidence threshold for predictions
iou=0.6, # IoU threshold for NMS
plots=True,
save_json=True, # save COCO-format JSON for further analysis
)
print(f"mAP50: {metrics.box.map50:.4f}")
print(f"mAP50-95: {metrics.box.map:.4f}")
print(f"Precision: {metrics.box.mp:.4f}")
print(f"Recall: {metrics.box.mr:.4f}")
# Per-class breakdown
for i, cls_name in enumerate(model.names.values()):
print(f" {cls_name:20s} AP50: {metrics.box.ap50[i]:.4f}")Understanding mAP
mAP50 (mean Average Precision at IoU 0.50) measures detection accuracy with a lenient overlap threshold. A predicted box is counted as correct if it overlaps the ground truth box by at least 50%. This is the standard metric for comparing models and is the one most practitioners quote.
mAP50-95 averages AP across IoU thresholds from 0.50 to 0.95 in steps of 0.05. It's a much stricter metric that penalizes imprecise box placement. COCO benchmark uses this as the primary metric.
Rough quality benchmarks for a fine-tuned custom model:
- mAP50 below 0.50: model is not ready — check your annotations, class balance, and consider more data
- mAP50 0.50–0.70: acceptable for internal use with human review
- mAP50 0.70–0.85: solid production quality for most industrial applications
- mAP50 above 0.85: excellent — validate on real-world test footage before shipping
Always evaluate on a held-out test set that was not used during training or hyperparameter selection. If your val set was used to pick best.pt, your validation metrics are optimistically biased.
For further context on how model performance trades off against inference cost in production, see our guide on choosing the right AI model for video analytics.
Step 7 — Run Inference
Before exporting, verify your model works correctly on new images:
import cv2
from ultralytics import YOLO
model = YOLO("runs/train/ppe_detector_v1/weights/best.pt")
# Single image inference
results = model.predict(
source="test_images/site_photo_001.jpg",
conf=0.4,
iou=0.5,
imgsz=640,
show=False,
save=True,
save_txt=False,
verbose=False,
)
for result in results:
boxes = result.boxes
print(f"Detected {len(boxes)} objects:")
for box in boxes:
cls_id = int(box.cls)
conf = float(box.conf)
x1, y1, x2, y2 = box.xyxy[0].tolist()
print(f" {model.names[cls_id]}: {conf:.2f} @ [{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")
# Live webcam / RTSP stream inference
# results = model.predict(source="rtsp://admin:[email protected]/stream", stream=True)
# for result in results:
# frame = result.plot() # annotated frame
# cv2.imshow("Detection", frame)Step 8 — Export for Edge Deployment
Once you're satisfied with accuracy, export your model to an optimized format. This is where you move from research to production. Our detailed guide on deploying AI models to edge devices covers this step in depth — here's the essentials.
from ultralytics import YOLO
model = YOLO("runs/train/ppe_detector_v1/weights/best.pt")
# Export to ONNX (universal — runs on CPU, GPU, Jetson, most edge devices)
model.export(
format="onnx",
imgsz=640,
half=False, # FP16 — set True if your runtime supports it
opset=17, # ONNX opset version
simplify=True, # simplify ONNX graph
dynamic=False, # fixed batch size for edge deployment
)
# Export to TensorRT (NVIDIA Jetson, data center GPUs)
model.export(
format="engine",
imgsz=640,
half=True, # TensorRT FP16 for ~2x speedup vs FP32
device=0,
workspace=4, # GB of GPU workspace for TensorRT optimization
simplify=True,
)
# Export to TFLite (Android, Raspberry Pi, Google Coral)
model.export(
format="tflite",
imgsz=320, # smaller image size for mobile/edge
half=False,
int8=False, # set True for INT8 quantization (fastest on mobile, slight accuracy loss)
)
# Export to CoreML (Apple Silicon, iPhone, iPad)
model.export(
format="coreml",
imgsz=640,
nms=True, # include NMS in the CoreML model graph
)
print("Exports complete. Check the current directory for output files.")Running ONNX Inference
import numpy as np
import cv2
import onnxruntime as ort
def load_onnx_model(model_path: str) -> ort.InferenceSession:
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
session = ort.InferenceSession(model_path, providers=providers)
print(f"Using provider: {session.get_providers()[0]}")
return session
def preprocess(image: np.ndarray, imgsz: int = 640) -> tuple[np.ndarray, float, float]:
h, w = image.shape[:2]
scale = min(imgsz / h, imgsz / w)
new_h, new_w = int(h * scale), int(w * scale)
resized = cv2.resize(image, (new_w, new_h))
padded = np.zeros((imgsz, imgsz, 3), dtype=np.uint8)
padded[:new_h, :new_w] = resized
blob = padded.transpose(2, 0, 1)[None].astype(np.float32) / 255.0
return blob, scale, scale
def run_inference(session: ort.InferenceSession, blob: np.ndarray) -> np.ndarray:
input_name = session.get_inputs()[0].name
outputs = session.run(None, {input_name: blob})
return outputs[0] # shape: (1, num_detections, 6) — [x1,y1,x2,y2,conf,cls]
if __name__ == "__main__":
session = load_onnx_model("best.onnx")
image = cv2.imread("test.jpg")
blob, sx, sy = preprocess(image)
preds = run_inference(session, blob)
print(f"Inference output shape: {preds.shape}")
# Post-process NMS, scale boxes back to original image coordinates
# See full post-processing in /blog/real-time-object-detection-pythonCommon Mistakes That Kill Fine-Tuning Performance
After helping dozens of teams through this process, these are the issues that show up most often:
1. Not verifying annotation alignment. The most common cause of mysteriously low mAP. Your class IDs in annotation files must match your YAML names order exactly. Run verify_annotations.py after every dataset change.
2. Severe class imbalance. If you have 1,000 images of class A and 50 of class B, your model will learn to ignore class B. Either oversample the minority class, undersample the majority, or use weighted loss. Aim for at most a 5:1 ratio between your most and least common classes.
3. Evaluating on the training set. If your val images were used to select hyperparameters, your metrics are optimistic. Always keep a held-out test set that you touch exactly once.
4. Too-small images for small objects. If your objects of interest are small relative to the full frame, consider training at imgsz=1280 rather than 640. The compute cost doubles, but detection of small objects improves significantly. Alternatively, tile your images during preprocessing.
5. Skipping the cache=True flag. On datasets under 10,000 images, caching to RAM cuts training time by 30–50% on most setups. If you don't have enough RAM, use cache="disk" instead.
6. Stopping too early. With patience=25, training stops if mAP doesn't improve for 25 consecutive epochs. For small datasets, mAP can plateau and then improve again late. If your loss curves still look noisy, increase patience to 50.
For a broader view of how your fine-tuned model fits into an end-to-end pipeline, see GPU vs CPU for AI inference and our overview of neural networks explained.
Putting It All Together: A Complete Timeline
Here's a realistic hour-by-hour breakdown for a first fine-tuning project:
- 0:00–0:20 — Image collection and quality review (delete blurry, duplicate, or out-of-distribution images)
- 0:20–0:45 — Annotation in Roboflow or CVAT (AI-assist speeds this up dramatically after the first 50 manual labels)
- 0:45–1:00 — Dataset split, YAML creation, and annotation verification
- 1:00–2:00 — Training run (GPU dependent; run in background)
- 2:00–2:10 — Validation, per-class mAP review
- 2:10–2:15 — ONNX export
- 2:15+ — Deploy to your target device or connect to the Trio stream API for live video inference
If your mAP50 is below 0.65 after the first run, the issue is almost always in the data — not the training code. Review your annotations, check for class imbalance, and add more diverse examples before tuning hyperparameters.
Keep Reading
- How to Build a Real-Time Object Detection Pipeline with Python — Once you have your fine-tuned model, this guide shows you how to integrate it into a live video pipeline.
- Transfer Learning in Computer Vision — A deeper look at why pretrained weights work so well and when to fine-tune versus train from scratch.
- Deploying AI Models to Edge Devices — ONNX, TensorRT, and TFLite deployment walkthrough for Jetson Orin, Raspberry Pi, and embedded hardware.