English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

The inference of Large language models (LLMs) requires immense computationand memory resources. To curtail these costs, quantisation has merged as apromising solution, but existing LLM quantisation mainly focuses on 8-bit. Inthis work, we explore the statistical and learning properties of the LLM layerand attribute the bottleneck of LLM quantisation to numerical scaling offsets.To address this, we adapt block quantisations for LLMs, a family of methodsthat share scaling factors across packed numbers. Block quantisationsefficiently reduce the numerical scaling offsets solely from an arithmeticperspective, without additional treatments in the computational path. Ournearly-lossless quantised 6-bit LLMs achieve a $19\times$ higher arithmeticdensity and $5\times$ memory density than the float32 baseline, surpassing theprior art 8-bit quantisation by $2.5\times$ in arithmetic density and$1.2\times$ in memory density, without requiring any data calibration orre-training. We also share our insights into sub-8-bit LLM quantisation,including the mismatch between activation and weight distributions, optimalfine-tuning strategies, and a lower quantisation granularity inherent in thestatistical properties of LLMs. The latter two tricks enable nearly-lossless4-bit LLMs on downstream tasks. Our code is open-sourced.

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?