English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

Recent advances in self-supervised learning and the Transformer architecturehave significantly improved natural language processing (NLP), achievingremarkably low perplexity. However, the growing size of NLP models introduces amemory wall problem during the generation phase. To mitigate this issue, recentefforts have focused on quantizing model weights to sub-4-bit precision whilepreserving full precision for activations, resulting in practical speed-upsduring inference on a single GPU. However, these improvements primarily stemfrom reduced memory movement, which necessitates a resource-intensivedequantization process rather than actual computational reduction. In thispaper, we introduce LUT-GEMM, an efficient kernel for quantized matrixmultiplication, which not only eliminates the resource-intensive dequantizationprocess but also reduces computational costs compared to previous kernels forweight-only quantization. Furthermore, we proposed group-wise quantization tooffer a flexible trade-off between compression ratio and accuracy. The impactof LUT-GEMM is facilitated by implementing high compression ratios throughlow-bit quantization and efficient LUT-based operations. We show experimentallythat when applied to the OPT-175B model with 3-bit quantization, LUT-GEMMsubstantially accelerates token generation latency, achieving a remarkable2.1$\times$ improvement on a single GPU when compared to OPTQ, which relies onthe costly dequantization process.

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models