Best AI News — Updated Every 3 Hours
Story Page
← All Stories
Home Community Story
Community

FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference.

Via r/LocalLlama
Tuesday, Mar 24, 2026 · 12:31AM
Summary

Wrote a deep dive on FlashAttention-4 (03/05/2026) that's relevant for anyone thinking about inference performance. TL;DR for inference: BF16 forward: 1,613 TFLOPs/s on B200 (71% utilization). Attention is basically at matmul speed now. 2.1-2.7x faster than Triton, up to 1.3x faster than cuDNN 9.13

Continue reading the full article
Read at r/LocalLlama
www.reddit.com
Back to all stories