English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

We provide an optimized implementation of the forward pass ofFlashAttention-2, a popular memory-aware scaled dot-product attentionalgorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architectureand written using the open-source CUTLASS library. In doing so, we explain thechallenges and techniques involved in fusing online-softmax with back-to-backGEMM kernels, utilizing the Hopper-specific Tensor Memory Accelerator (TMA) andWarpgroup Matrix-Multiply-Accumulate (WGMMA) instructions, defining andtransforming CUTLASS Layouts and Tensors, overlapping copy and GEMM operations,and choosing optimal tile sizes for the Q, K and V attention matrices whilebalancing the register pressure and shared memory utilization. In head-to-headbenchmarks on a single H100 PCIe GPU for some common choices ofhyperparameters, we observe 20-50% higher FLOPs/s over a version ofFlashAttention-2 optimized for last-generation NVIDIA Ampere architecture.

A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library