English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Open

Presents the pdf analysis as premium feature

PDF Analysis

Insights

Presents the summarisation as premium feature

Summarisation

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Attention-based models have demonstrated remarkable success in various natural language understanding tasks. However, efficient execution remains a challenge for these models which are memory-bound due to their massive number of parameters. We present GOBO, a model quantization technique that compresses the vast majority (typically 99.9%) of the 32-bit floating-point parameters of state-of-the-art BERT models and their variants to 3 bits while maintaining their accuracy. Unlike other quantization methods, GOBO does not require fine-tuning nor retraining to compensate for the quantization error. We present two practical hardware applications of GOBO. In the first GOBO reduces memory storage and traffic and as a result inference latency and energy consumption. This GOBO memory compression mechanism is plug-in compatible with many architectures; we demonstrate it with the TPU, Eyeriss, and an architecture using Tensor Cores-like units. Second, we present a co-designed hardware architecture that also reduces computation. Uniquely, the GOBO architecture maintains most of the weights in 3b even during computation, a property that: (i) makes the processing elements area efficient, allowing us to pack more compute power per unit area, (ii) replaces most multiply-accumulations with additions, and (iii) reduces the off-chip traffic by amplifying on-chip memory capacity.

GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference