Enhanced YOLOv12 through Sliced Contrastive Supervision and Full Scene Fine-Tuning | Zendy

Javier E. Garza | Zendy; Muhammad F. Islam | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Enhanced YOLOv12 through Sliced Contrastive Supervision and Full Scene Fine-Tuning

Author(s) -

Javier E. Garza,

Muhammad F. Islam

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3596039

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Real-time detection of objects in drone-based imagery has proven to be a challenging task for even the most state-of-the-art deep learning models. Due to computational limitations, images are often scaled down during training, reducing the feature space and leading to decreased overall accuracy during validation and inference. This work proposes two-stage training strategy and several key improvements to one of the most recent You Only Look Once (YOLO) models, YOLOv12. An extra P2 branch and its corresponding scale were added to the head of the network to improve detection of small-scale objects. An additional CIoU-like penalty term was combined with the standard CIoU-based loss used in YOLO to improve detection accuracy. Finally, a contrastive loss function and an associated embedding branch were introduced to help discriminate between features in the embedding space, grouping instances of the same classes closer together and instances of different objects further apart. The first stage of training leverages these improvements on a sliced, full-resolution version of the VisDrone2019-DET dataset, which maintains 15% overlap between slices, and the second stage continues to leverage these improvements for fine-tuning using the full images in a scaled down configuration to provide full scene context. Results demonstrate a 35.5% Mean Average Precision (mAP 50:95 ) and a 56.6% mAP 50 on the validation split. On the test split, results demonstrate a 36.2% mAP 50:95 and a 57.7% mAP 50 .

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research