Wrote a deep dive on FlashAttention-4 (03/05/2026) that's relevant for anyone thinking about inference performance. TL;DR for inference: BF16 forward: 1,613 TFLOPs/s on B200 (71% utilization). Attention is basically at matmul speed now. 2.1-2.7x faster than Triton, up to 1.3x faster than cuDNN 9.13