SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression | Zendy

Tairen Piao | Zendy; Ikhyun Cho | Zendy; U Kang | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression

Author(s) -

Tairen Piao,

Ikhyun Cho,

U Kang

Publication year - 2022

Publication title -

plos one

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.99

H-Index - 332

ISSN - 1932-6203

DOI - 10.1371/journal.pone.0265621

Subject(s) - computer science , quantization (signal processing) , algorithm , inference , inverse , language model , compression ratio , artificial intelligence , mathematics , physics , geometry , thermodynamics , internal combustion engine

Given a pre-trained BERT, how can we compress it to a fast and lightweight one while maintaining its accuracy? Pre-training language model, such as BERT, is effective for improving the performance of natural language processing (NLP) tasks. However, heavy models like BERT have problems of large memory cost and long inference time. In this paper, we propose S ensi M ix (Sensitivity-Aware Mixed Precision Quantization), a novel quantization-based BERT compression method that considers the sensitivity of different modules of BERT. S ensi M ix effectively applies 8-bit index quantization and 1-bit value quantization to the sensitive and insensitive parts of BERT, maximizing the compression rate while minimizing the accuracy drop. We also propose three novel 1-bit training methods to minimize the accuracy drop: Absolute Binary Weight Regularization, Prioritized Training, and Inverse Layer-wise Fine-tuning. Moreover, for fast inference, we apply FP16 general matrix multiplication (GEMM) and XNOR-Count GEMM for 8-bit and 1-bit quantization parts of the model, respectively. Experiments on four GLUE downstream tasks show that S ensi M ix compresses the original BERT model to an equally effective but lightweight one, reducing the model size by a factor of 8× and shrinking the inference time by around 80% without noticeable accuracy drop.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore