English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

Large language models (LLMs) have demonstrated impressive abilities invarious domains while the inference cost is expensive. Many previous studiesexploit quantization methods to reduce LLM inference cost by reducing latencyand memory consumption. Applying 2-bit single-precision weight quantizationbrings >3% accuracy loss, so the state-of-the-art methods use mixed-precisionmethods for LLMs (e.g. Llama2-7b, etc.) to improve the accuracy. However,challenges still exist: (1) Uneven distribution in weight matrix. (2) Largespeed degradation by adding sparse outliers. (3) Time-consuming dequantizationoperations on GPUs. To tackle these challenges and enable fast and efficientLLM inference on GPUs, we propose the following techniques in this paper. (1)Intra-weight mixed-precision quantization. (2) Exclusive 2-bit sparse outlierwith minimum speed degradation. (3) Asynchronous dequantization. We conductextensive experiments on different model families (e.g. Llama3, etc.) and modelsizes. We achieve 2.91-bit for each weight considering all scales/zeros fordifferent models with negligible loss. As a result, with our 2/4/16mixed-precision quantization for each weight matrix and asynchronousdequantization during inference, our design achieves an end-to-end speedup forLlama2-7b is 1.74x over the original model, and we reduce both runtime cost andtotal cost by up to 2.53x and 2.29x with less GPU requirements.

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization