metadata

title: Custom Yolo V3
emoji: 📈
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 3.40.1
app_file: app.py
pinned: false
license: mit

Custom YOLOv3 trained on Pascal VOC using PyTorch Lightning

This is a live demonstration of the code available here. In this exercise, we are trying to adapt pre-existing code from this library and implement it using PyTorch Lightning. Some other changes are also made such as the introduction of One Cycle Policy and Mosaic trasnfomations.

Dataset

The model was trained on the Pascal VOC dataset. Pascal VOC (Visual Object Classes) is a collection of standardized image datasets for object class recognition. Initiated from 2005 to 2012, this dataset has become one of the most widely used benchmarks for evaluating the performance of different algorithms for object detection and image segmentation.

Pascal VOC has been instrumental in the development and evaluation of many state-of-the-art computer vision algorithms, especially before the dominance of the COCO dataset. Many deep learning models for object detection, like Faster R-CNN, SSD, and YOLO, have been trained and evaluated on Pascal VOC, providing a common ground for comparison.

As the field of computer vision evolved, datasets with larger numbers of images and more diverse annotations became necessary. This led to the development of more comprehensive datasets like MS COCO. Due to its limited size and diversity, Pascal VOC has become less dominant in recent years, but it still remains an important benchmark in the history of object detection.

Currently the model supports the following classes - such as aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, and TV/monitor.

YOLOv3

YOLOv3 is the third version of the YOLO architecture, a state-of-the-art, real-time object detection system. YOLO, as the name suggests, processes images in one pass, making it incredibly fast while maintaining a balance with accuracy.

Key Features:

Single Shot Detector: Unlike two-step detectors which first identify regions of interest and then classify those regions, YOLO performs both tasks in a single pass, making it faster.
Darknet-53: YOLOv3 introduces a new 53-layer architecture called Darknet-53, which is a hybrid of the Darknet architecture and some characteristics of the more complex architectures like ResNet.
Three Scales: YOLOv3 makes detections at three different scales by using three different sizes of anchor boxes. This helps in capturing objects of different sizes more accurately.
Bounding Box Predictions: Instead of predicting the coordinates for the bounding boxes, YOLOv3 predicts the offsets from a set of anchor boxes. This helps in stabilizing the model's predictions.
Multi-label Classification: Unlike YOLOv2 which used Softmax, YOLOv3 uses independent logistic classifiers to determine the probability of the object's presence, allowing the detection of multiple object classes in one bounding box.
Loss Function: The loss function in YOLOv3 is designed in a way that it treats object detection as a regression problem rather than a classification problem. This approach is more suitable for single-shot detection.
Use of Three Anchor Boxes: For each grid cell, it uses three anchor boxes (pre-determined shapes). This helps the network adjust its predictions to the shape of objects.

While YOLOv3 is not the most accurate object detection algorithm, its strength lies in its speed, making it suitable for real-time applications. When compared to its predecessors, YOLOv3 offers a good balance between speed and accuracy. It performs particularly well in detecting smaller objects due to its multi-scale predictions.

Model Metrics

We shall look at some of the metrics related to the model while training it. Initially we can see that the model has poor performance but starts to rapidly increase from the $3^{rd}$ epoch for the next few epochs and then it starts to gradually climb.

Towards the end of the maximum epochs we have set we can see that the model is performing well and has achieved a decent accuracy for class, no object and object detection. We have achieved 85% class accuracy, 98% no object accuracy and 78% object accuracy. The loss has also reduced from 20 ($1^{st}$ epoch) to 3.7 in the last epoch

We have achieved a Mean Average Precision of 0.4, caculated at the last epoch.