CLIP-to-Seg Distillation for Zero-shot Semantic Segmentation
Author(s) -
Jialei Chen,
Zhenzhen Quan,
Chenkai Zhang,
Xu Zheng,
Daisuke Deguchi,
Hiroshi Murase
Publication year - 2025
Publication title -
ieee transactions on circuits and systems for video technology
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.873
H-Index - 168
eISSN - 1558-2205
pISSN - 1051-8215
DOI - 10.1109/tcsvt.2025.3616588
Subject(s) - components, circuits, devices and systems , communication, networking and broadcast technologies , computing and processing , signal processing and analysis
CLIP has greatly advanced zero-shot segmentation by leveraging its strong visual-language association and generalization capability. However, directly adapting CLIP for segmentation often yields suboptimal results due to inconsistencies between image and pixel-level prediction objectives. Additionally, merely combining segmentation and CLIP models often leads to disjoint optimization, introducing significant computational overhead and additional parameters. To address these issues, we propose a novel CLIP-to-Seg Distillation approach, incorporating global and local distillation to flexibly transfer CLIP’s powerful zero-shot generalization capability to existing closed-set segmentation models. Global distillation leverages CLS tokens to condense segmentation features and distills high-level concepts to the segmentation model via image-level features. Local distillation adapts CLIP’s local semantic transferability to dense prediction tasks using object-level features, aided by pseudo-mask generation for latent class mining. To further generalize the CLIP-distilled segmentation model, we generate latent text embeddings for the mined latent classes by coordinating their text embeddings and dense features. Our method equips existing closed-set segmentation models with strong generalization capabilities for open concepts through effective and flexible CLIP-to-Seg distillation. Without relying on the CLIP model or introducing extra inference overhead, our method seamlessly integrates into existing closed-set segmentation models and enables zero-shot capability, achieving state-of-the-art performance on multiple benchmarks.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom