z-logo
open-access-imgOpen Access
Enhanced YOLOv12 through Sliced Contrastive Supervision and Full Scene Fine-Tuning
Author(s) -
Javier E. Garza,
Muhammad F. Islam
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3596039
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Real-time detection of objects in drone-based imagery has proven to be a challenging task for even the most state-of-the-art deep learning models. Due to computational limitations, images are often scaled down during training, reducing the feature space and leading to decreased overall accuracy during validation and inference. This work proposes two-stage training strategy and several key improvements to one of the most recent You Only Look Once (YOLO) models, YOLOv12. An extra P2 branch and its corresponding scale were added to the head of the network to improve detection of small-scale objects. An additional CIoU-like penalty term was combined with the standard CIoU-based loss used in YOLO to improve detection accuracy. Finally, a contrastive loss function and an associated embedding branch were introduced to help discriminate between features in the embedding space, grouping instances of the same classes closer together and instances of different objects further apart. The first stage of training leverages these improvements on a sliced, full-resolution version of the VisDrone2019-DET dataset, which maintains 15% overlap between slices, and the second stage continues to leverage these improvements for fine-tuning using the full images in a scaled down configuration to provide full scene context. Results demonstrate a 35.5% Mean Average Precision (mAP 50:95 ) and a 56.6% mAP 50 on the validation split. On the test split, results demonstrate a 36.2% mAP 50:95 and a 57.7% mAP 50 .

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom