Community

FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference.

Via r/LocalLlama

Tuesday, Mar 24, 2026 · 12:31AM

Summary

Wrote a deep dive on FlashAttention-4 (03/05/2026) that's relevant for anyone thinking about inference performance. TL;DR for inference: BF16 forward: 1,613 TFLOPs/s on B200 (71% utilization). Attention is basically at matmul speed now. 2.1-2.7x faster than Triton, up to 1.3x faster than cuDNN 9.13

Continue reading the full article

Read at r/LocalLlama

www.reddit.com