CLIP-to-Seg Distillation for Zero-shot Semantic Segmentation | Zendy

Jialei Chen | Zendy; Zhenzhen Quan | Zendy; Chenkai Zhang | Zendy; Xu Zheng | Zendy; Daisuke Deguchi | Zendy; Hiroshi Murase | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

CLIP-to-Seg Distillation for Zero-shot Semantic Segmentation

Author(s) -

Jialei Chen,

Zhenzhen Quan,

Chenkai Zhang,

Xu Zheng,

Daisuke Deguchi,

Hiroshi Murase

Publication year - 2025

Publication title -

ieee transactions on circuits and systems for video technology

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.873

H-Index - 168

eISSN - 1558-2205

pISSN - 1051-8215

DOI - 10.1109/tcsvt.2025.3616588

Subject(s) - components, circuits, devices and systems , communication, networking and broadcast technologies , computing and processing , signal processing and analysis

CLIP has greatly advanced zero-shot segmentation by leveraging its strong visual-language association and generalization capability. However, directly adapting CLIP for segmentation often yields suboptimal results due to inconsistencies between image and pixel-level prediction objectives. Additionally, merely combining segmentation and CLIP models often leads to disjoint optimization, introducing significant computational overhead and additional parameters. To address these issues, we propose a novel CLIP-to-Seg Distillation approach, incorporating global and local distillation to flexibly transfer CLIP’s powerful zero-shot generalization capability to existing closed-set segmentation models. Global distillation leverages CLS tokens to condense segmentation features and distills high-level concepts to the segmentation model via image-level features. Local distillation adapts CLIP’s local semantic transferability to dense prediction tasks using object-level features, aided by pseudo-mask generation for latent class mining. To further generalize the CLIP-distilled segmentation model, we generate latent text embeddings for the mined latent classes by coordinating their text embeddings and dense features. Our method equips existing closed-set segmentation models with strong generalization capabilities for open concepts through effective and flexible CLIP-to-Seg distillation. Without relying on the CLIP model or introducing extra inference overhead, our method seamlessly integrates into existing closed-set segmentation models and enables zero-shot capability, achieving state-of-the-art performance on multiple benchmarks.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research