English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

Large Language Model (LLM) inference consists of two distinct phases -prefill phase which processes the input prompt and decode phase which generatesoutput tokens autoregressively. While the prefill phase effectively saturatesGPU compute at small batch sizes, the decode phase results in low computeutilization as it generates one token at a time per request. The varyingprefill and decode times also lead to imbalance across micro-batches when usingpipeline parallelism, resulting in further inefficiency due to bubbles.  We present SARATHI to address these challenges. SARATHI employschunked-prefills, which splits a prefill request into equal sized chunks, anddecode-maximal batching, which constructs a batch using a single prefill chunkand populates the remaining slots with decodes. During inference, the prefillchunk saturates GPU compute, while the decode requests 'piggyback' and cost upto an order of magnitude less compared to a decode-only batch. Chunked-prefillsallows constructing multiple decode-maximal batches from a single prefillrequest, maximizing coverage of decodes that can piggyback. Furthermore, theuniform compute design of these batches ameliorates the imbalance betweenmicro-batches, significantly reducing pipeline bubbles.  Our techniques yield significant improvements in inference performance acrossmodels and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improvesdecode throughput by up to 10x, and accelerates end-to-end throughput by up to1.33x. For LLaMa-33B on A100 GPU, we achieve 1.25x higher end-to-end-throughputand up to 4.25x higher decode throughput. When used with pipeline parallelismon GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-endthroughput improvement of 1.91x.

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills