Master Object Detection: How to Perform Inference with YOLOv8 and Retrain with Custom Data

9 min readApr 18, 2023

The full process of using pretrained yolov8 model for inference and re-training model with custom dataset

Introduction

In the realm of computer vision, object detection stands as one of the most critical and challenging tasks. One of the most prominent and revolutionary object detection algorithms is the You Only Look Once (YOLO) detection model. Its ability to quickly identify and locate objects has made it an invaluable tool for a wide range of applications, from self-driving cars to security systems.

YOLO is a real-time object detection algorithm that works by dividing an image into a grid of cells and predicting the object’s bounding box and class probability for each cell. The network then combines these predictions to output the final detection results. YOLO was first introduced in the paper “You Only Look Once: Unified, Real-Time Object Detection” by Joseph Redmon et al. in 2016.

YOLO Architecture

The YOLO architecture consists of two main components:

The feature extraction network is a deep convolutional neural network (CNN) that takes an input image and outputs a feature map. The feature map preserves the spatial information of the image and contains high-level features such as edges, corners, and textures.
The detection network takes the feature map from the feature extraction network as input and predicts the bounding boxes and class probabilities for each cell in the grid. The detection network consists of a series of convolutional layers followed by a fully connected layer. The output of the detection network is a tensor of size (S, S, (B * 5 + C)), where S is the size of the grid, B is the number of bounding boxes per grid cell, and C is the number of classes.

YOLO Architecture (Research Paper by Hao Zhang & Xianggong Hong)

The YOLO detection process consists of the following steps:

Input Image: The algorithm takes an input image and resizes it to a fixed size.
Grid Generation: The algorithm divides the resized image into a grid of cells. Each cell is responsible for predicting the object(s) that falls within its boundaries.
Bounding Box Prediction: For each cell, the algorithm predicts B bounding boxes, each containing 5 values: (x, y, w, h, confidence). The (x, y) values represent the center of the bounding box relative to the cell's boundaries, and (w, h) represent the width and height of the bounding box. The confidence value represents the algorithm's confidence in the prediction.
Class Prediction: For each cell, the algorithm predicts the probability of each class C.
Non-Maximum Suppression: The algorithm applies non-maximum suppression to remove redundant bounding box predictions. Non-maximum suppression compares the predicted bounding boxes’ overlap and keeps only the one with the highest confidence.
Intersection Over Union (IoU): Evaluate the accuracy of the bounding box predictions. IoU is a measure of the overlap between the predicted and ground truth bounding boxes, with values ranging from 0 to 1. A high IoU value indicates a better overlap between the predicted and ground truth bounding boxes.
Output: The algorithm outputs the final detection results, which include the class label, bounding box coordinates, and confidence score.

YOLOV8

YOLOv8, developed by Alexey Bochkovskiy and his team at Ultralytics, represents a cutting-edge object detection algorithm that outperforms its predecessors in the YOLO (You Only Look Once) series. The YOLOv8 series consists of five models in each category for detection, segmentation, and classification tasks. YOLOv8 Nano is the smallest and fastest model, while YOLOv8 Extra Large (YOLOv8x) is the slowest but most accurate model among them.

The Detect, Segment, and Pose models in the YOLOv8 series have been pre-trained on the COCO dataset, while the Classify models have been pre-trained on the ImageNet dataset. The COCO (Common Objects in Context) dataset is a widely used large-scale dataset for object detection, segmentation, and captioning tasks in computer vision research. It contains over 330,000 images with more than 2.5 million labeled object instances across 80 different categories, making it a valuable resource for training and evaluating object detection and segmentation models. See here for the full list of the COCO object classes.

Inference with Pretrained Yolov8 Model

Without further delay, let’s proceed to the implementation by installing the required libraries.

pip install ultralytics==8.0.75 opencv-python==4.7.0.72 gdown==4.7.1 filterpy==1.4.5 torch

If you are running the model for the first time and don’t have the pretrained model saved, you can download it through an API. All the available pretrained models can be downloaded conveniently using Ultralytics’ Python API.

from ultralytics import YOLO
model = YOLO("yolov8s.pt")

To begin, we import the necessary libraries and configure the model by setting its various parameters to ensure it runs as desired.

import cv2
import torch
from pathlib import Path
from ultralytics.nn.autobackend import AutoBackend
from ultralytics.yolo.utils.ops import non_max_suppression, scale_boxes
from ultralytics.yolo.data.dataloaders.stream_loaders import LoadImages, LoadStreams
from ultralytics.yolo.utils.plotting import Annotator, colors

# model config
half = False
img_sz = [640, 640]
device = 'cpu'
classes = list(range(80))
conf_thres = 0.5
iou_thres = 0.5
max_det = 1000
line_thickness = 2
agnostic_nms = False

half: A boolean variable that determines whether to use half-precision floating-point arithmetic (float16) for inference. It is often used with CUDA to speed up deep learning computations on GPUs. This can be advantageous when working with large models or datasets, as it allows more data to be stored in memory at once, reducing the need to transfer data between the CPU and GPU. It’s set to False if runnng on machine without gpu.
img_sz: The size of the image the model will analyze. The default size is 640x640 pixels, which is a square image.
device: The type of hardware the model runs on. By default, it runs on a computer's CPU.
classes: The types of objects the model will try to detect. By default, the model is trained to recognize 80 different objects defined in the COCO dataset.
conf_thres: The level of certainty the model needs to detect an object. By default, it's set to 0.5, meaning the model will only detect objects it's at least 50% sure about.
iou_thres: The amount of overlap between multiple objects before the model considers them as the same object. By default, it's set to 0.5, meaning objects with an overlap of more than 50% will be considered the same object.
max_det: The maximum number of objects the model will detect per image. By default, it's set to 1000.
line_thickness: The thickness of the lines drawn around the detected objects in the output image.
agnostic_nms: A setting that determines whether the model performs class-specific or class-agnostic non-maximum suppression (NMS) during object detection. Agnostic runs NMS on all boxes togethor rather than per class. By default, it's set to False, meaning the model performs class-specific NMS.

Next, we can define the input source on which we intend to run the YOLO model. In our case, I have a video called test.mp4 for demo purpose. Use 0 if you intend to use your webcam for live detecting. Following that, you can decide whether to show the video during inference using show_video, and save_video if you want to save the output locally.

source = './test.mp4'
webcam = source.isnumeric() or source.endswith('.txt')

# output config
show_video = True
save_video = False
output_file_name = 'test_output_1.avi'
out_writter = cv2.VideoWriter(
    output_file_name, 
    cv2.VideoWriter_fourcc('M','J','P','G'), 30, img_sz
)

Subsequently, we load the YOLOv8 model and its associated configuration into AutoBackend class in yolo library, initializing the PyTorch model object with those weights, and setting up the model to run on a specified device (e.g., CPU or GPU). Additionally, the AutoBackendclass can be configured to use the PyTorch DNN module for faster inference on CPUs that support it, and can also use half precision (FP16) calculations to reduce memory usage and increase throughput.

If the “webcam” is used, the LoadStreams function is used to create a data loader for a webcam feed, whereas LoadImagessfunction is used for static image or video.

# load model
# use FP16 half-precision inference, use OpenCV DNN for ONNX inference
model_name = "yolov8n.pt"
model = AutoBackend(model_name, device=device, dnn=False, fp16=half)
stride, names, pt = model.stride, model.names, model.pt

# data loader
if webcam:
    dataset = LoadStreams(
        source, 
        imgsz=img_sz, 
        stride=stride, 
        auto=pt, 
        transforms=getattr(model.model, 'transforms', None), 
        vid_stride=1
    )
    bs = len(dataset)

else:
    dataset = LoadImages(
        source,
        imgsz=img_sz,
        stride=stride,
        auto=pt,
        transforms=getattr(model.model, 'transforms', None),
        vid_stride=1
    )

We have reached the most thrilling part of the process where we feed each frame into the YOLO model. In the following code block, we perform several operations such as image processing, inference, and detection using yolo model together with non-maximum suppression. Then, we iterate through the results and plot bounding boxes (using Annotator class method) on the objects detected in the frame.

for frame_idx, batch in enumerate(dataset):
    # processing
    path, tranform_im, ori_im, vid_cap, s = batch
    tranform_im = torch.from_numpy(tranform_im).to(device)
    tranform_im = tranform_im.half() if half else tranform_im.float()
    tranform_im /= 255.0 
    tranform_im = torch.unsqueeze(tranform_im, 0)

    # inference
    preds = model(tranform_im, augment=False, visualize=False)
    results = non_max_suppression(preds, conf_thres, iou_thres, classes, agnostic_nms, max_det=max_det)

    # Process detections
    for i, det in enumerate(results):
        
        # annotator for plotting
        annotator = Annotator(ori_im, line_width=2, example=str(names))

        if det is not None and len(det): 
            det[:, :4] = scale_boxes(tranform_im.shape[2:], det[:, :4], ori_im.shape).round()

            for j, (output) in enumerate(det):
                bbox = output[0:4]
                conf = output[4]
                cls = output[5]
                annotator.box_label(bbox, f'{names[int(cls)]} {conf:.2f}', color=colors(int(cls), True))

    final_img = annotator.result()
    if show_video:
        cv2.namedWindow("out", cv2.WINDOW_FREERATIO)
        cv2.resizeWindow("out", final_img.shape[1], final_img.shape[0])
        cv2.imshow("out", final_img)
        if cv2.waitKey(1) == ord('q'):
            exit()

    if save_video:
        frame = cv2.resize(final_img, img_sz, interpolation=cv2.INTER_AREA)
        out_writter.write(frame)

Here is an example of a single frame processed using the script above. A full running script with example can be found from the repo.

Retraining with Custom Dataset

Retraining YOLO with a custom dataset becomes necessary when the classes of objects to be detected are not included in the default COCO dataset. In such cases, it is crucial to prepare a new dataset and retrain the model with the dataset, ensuring that the model can accurately detect the objects we intend to detect. For the purpose of demonstrating YOLO training, we can import and extract the custom dataset for plant disease detection.

!wget https://moderncomputervision.s3.eu-west-2.amazonaws.com/PlantDoc.v1-resize-416x416.yolov5pytorch.zip
!unzip -q PlantDoc.v1-resize-416x416.yolov5pytorch.zip
!mv ./train ./datasets/train 
!mv ./test ./datasets/test

After the extraction, we should be able to find a very important file called data.yamlthat store the crucial information for training. Under the custom yaml file, it provides information on the location of the images used for training and validation, as well as the number of classes (30 in this case).

# data.yaml
train: ./train/images
val: ./valid/images

nc: 30
names: ['Apple Scab Leaf', 'Apple leaf', 'Apple rust leaf', 'Bell_pepper leaf spot', 'Bell_pepper leaf', 'Blueberry leaf', 'Cherry leaf', 'Corn Gray leaf spot', 'Corn leaf blight', 'Corn rust leaf', 'Peach leaf', 'Potato leaf early blight', 'Potato leaf late blight', 'Potato leaf', 'Raspberry leaf', 'Soyabean leaf', 'Soybean leaf', 'Squash Powdery mildew leaf', 'Strawberry leaf', 'Tomato Early blight leaf', 'Tomato Septoria leaf spot', 'Tomato leaf bacterial spot', 'Tomato leaf late blight', 'Tomato leaf mosaic virus', 'Tomato leaf yellow virus', 'Tomato leaf', 'Tomato mold leaf', 'Tomato two spotted spider mites leaf', 'grape leaf black rot', 'grape leaf'

After ensuring that all necessary components are in place, training a new model is made easy with the UltralyticsUltralytics library. With just a few lines of code, you can begin the training process.

from ultralytics import YOLO

# build a new model from YAML
model = YOLO('yolov8n.yaml')  

# Train the model
model.train(data='./data.yaml', epochs=100, imgsz=640)

# save the model
model.export("./custom_model.pt")

Great! You should now be able to see the model starting the training process.

Conclusion

In conclusion, we have explored the power of YOLO for object detection tasks and how we can use pre-trained models to perform inference on images and videos. We have also learned how to retrain a YOLO model with a custom dataset using the Ultralytics library. With the ability to train and fine-tune YOLO models with our own data, the possibilities for object detection tasks are endless. Whether it is for detecting plant diseases, traffic monitoring, or even facial recognition, YOLO provides us with an accurate and efficient solution. With the implementation and the resources provided in this article, we hope that you will be able to apply these techniques to your own projects and take advantage of the capabilities of YOLO.