English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

Transformer-based Large Language Models (LLMs) have made a significant impacton various domains. However, LLMs' efficiency suffers from both heavycomputation and memory overheads. Compression techniques like sparsificationand quantization are commonly used to mitigate the gap between LLM'scomputation/memory overheads and hardware capacity. However, existing GPU andtransformer-based accelerators cannot efficiently process compressed LLMs, dueto the following unresolved challenges: low computational efficiency,underutilized memory bandwidth, and large compilation overheads.  This paper proposes FlightLLM, enabling efficient LLMs inference with acomplete mapping flow on FPGAs. In FlightLLM, we highlight an innovativesolution that the computation and memory overhead of LLMs can be solved byutilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memoryhierarchy). We propose a configurable sparse DSP chain to support differentsparsity patterns with high computation efficiency. Second, we propose analways-on-chip decode scheme to boost memory bandwidth with mixed-precisionsupport. Finally, to make FlightLLM available for real-world LLMs, we propose alength adaptive compilation method to reduce the compilation overhead.Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0$\times$higher energy efficiency and 1.8$\times$ better cost efficiency againstcommercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) usingvLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100GPU with 1.2$\times$ higher throughput using the latest Versal VHK158 FPGA.

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs