Overcoming Compute and Memory Bottlenecks: FlashAttention-4 performance on NVIDIA Blackwell

FlashAttention-4 performance on NVIDIA Blackwell: visualization of warp-specialized attention pipeline reducing HBM traffic for LLM inference

Overcoming Compute and Memory Bottlenecks: FlashAttention-4 performance on NVIDIA Blackwell

By Agustin Giovagnoli / January 22, 2026

Modern GPUs are compute-rich but memory-starved, and attention makes that imbalance worse. FlashAttention-4 performance on NVIDIA Blackwell matters because it re-architects attention to reduce O(N²) memory traffic, translating into higher tokens-per-second and better power efficiency for LLM inference on B200-class systems [1][2].

Executive summary: what FA4 delivers on Blackwell

FlashAttention-4 (FA4) is a new generation of attention kernels co-designed for NVIDIA Blackwell that maximizes on-chip data reuse and overlaps compute with data movement. On Blackwell, FA4 reaches about 1,605 TFLOPs/s FP16—roughly 71% of peak—and delivers up to around 1.3× speedup over cuDNN attention and approximately 2.4× over Triton baselines at long sequence lengths [1][2][4]. In practice, this improves prefill throughput and power efficiency for LLM inference, boosting tokens-per-second and reducing per-token cost on B200-class systems [1][2]. Earlier FlashAttention work showed multi-x speedups and large memory savings versus naïve attention by fusing matmul, softmax, and dropout into a single IO-aware kernel—an approach FA4 extends for Blackwell [1][2].

The problem: compute-rich but memory-starved GPUs

Blackwell’s compute scales faster than its HBM bandwidth, making attention increasingly memory-bound rather than compute-bound, especially at long contexts. Standard attention incurs O(N²) memory accesses as keys, queries, and values repeatedly travel to and from HBM, dominating runtime over flops. Memory-efficient attention on Blackwell reduces this traffic by keeping data on chip and minimizing redundant reads/writes [1][2][3].

What FA4 changes: technical innovations

FA4 introduces a Blackwell-specific, warp-specialized five-stage pipeline that tightly fuses attention sub-operations to maximize on-chip reuse and reduce HBM traffic. By restructuring the kernel to overlap compute with data movement and keep intermediates on-chip, FA4 drives higher throughput under memory pressure [1][2].

Two additional innovations remove softmax bottlenecks:

  • Software exponentials on CUDA cores offload pressure from limited SFUs, improving softmax throughput.
  • Adaptive online rescaling maintains numerical stability without redundant passes over activations, avoiding extra memory movement [1][2][4].

These choices cut memory stalls and unlock more sustained utilization—building on earlier FlashAttention versions that already achieved high peak FLOPs utilization on prior GPU generations [2][4].

Performance deep dive: FlashAttention-4 performance on NVIDIA Blackwell

Benchmarks report FA4 at roughly 1,605 TFLOPs/s FP16 on Blackwell, or about 71% of the architecture’s peak. At long sequence lengths, FA4 delivers around 20–22% faster throughput than cuDNN attention (about 1.3×) and up to about 2.4× over Triton baselines. These gains translate directly into higher tokens-per-second, better prefill rates, and improved power efficiency during LLM inference. Earlier FlashAttention work also demonstrated that IO-aware fusion yields multi-x speedups and order-of-magnitude memory savings versus naïve PyTorch attention, enabling longer contexts on fixed-memory GPUs [1][2][4].

For teams optimizing LLM inference optimization on Blackwell, these results highlight where FA4 shines: long-context sequences where memory movement dominates and kernel fusion can keep more of the attention computation on chip [1][2].

Practical benefits for operators

  • Higher tokens/sec on B200-class systems by cutting attention’s memory stalls [1][2].
  • Lower per-token cost through better throughput and power efficiency [1][2].
  • Longer context windows on the same hardware due to reduced O(N²) memory traffic [1][2].

The biggest ROI often appears in prefill-heavy and long-context workloads where attention dominates runtime. For many stacks, a migration decision will center on FA4 vs cuDNN attention gains under target sequence lengths and batch shapes [1][2][4].

Ecosystem and integration status

Ecosystem integration is underway: FA4 techniques are being upstreamed into cuDNN, and frameworks like vLLM and SGLang already support FA4-style prefill. This paves a path to broader adoption as vendor libraries and serving frameworks roll in these kernels and patterns [1][2]. For background on the platform, see the NVIDIA Blackwell architecture (external).

Limitations and migration considerations

The current FA4 reference is Blackwell-only and forward-only, with missing backward, variable-length, and GQA features. Full training support and broader feature parity will depend on future releases. Teams should validate workload fit, framework readiness, and fallbacks before production rollout [1][2].

Deployment checklist and next steps

  • Confirm Blackwell hardware availability (e.g., B200 in HGX systems) and driver/runtime compatibility [1].
  • Benchmark FA4 vs cuDNN attention at target sequence lengths; track tokens/sec, prefill throughput, and energy per token [1][2][4].
  • Verify framework paths: vLLM/SGLang for FA4-style prefill; monitor cuDNN upstreaming status [1][2].
  • Plan fallbacks for unsupported features (backward, variable-length, GQA) [1][2].
  • Document gains and costs to inform a staged rollout.

For more implementation resources across tooling and infrastructure, Explore AI tools and playbooks.

Conclusion: who benefits most and when to adopt

Organizations serving long-context LLMs or chasing lower per-token costs on B200-class systems stand to benefit first. FA4’s warp-specialized attention pipeline and softmax optimizations translate into measurable throughput gains under memory pressure, making it a strong candidate where attention dominates runtime. As cuDNN and major serving frameworks broaden support, evaluating FlashAttention-4 performance on NVIDIA Blackwell should be part of any near-term optimization roadmap [1][2][4].

Sources

[1] Overcoming Compute and Memory Bottlenecks with …
https://developer.nvidia.com/blog/overcoming-compute-and-memory-bottlenecks-with-flashattention-4-on-nvidia-blackwell/

[2] FlashAttention 4: Faster, Memory-Efficient Attention for LLMs
https://www.digitalocean.com/community/tutorials/flashattention-4-llm-inference-optimization

[3] Why FlashAttention solves the memory bottleneck in AI …
https://www.linkedin.com/posts/hoang-van-hao_ai-mlengineering-llm-activity-7392561476932313088-iWN7

[4] FlashAttention 4: Breaking the Petaflop Barrier in GPU Attention …
https://medium.com/@changtimwu/flashattention-4-breaking-the-petaflop-barrier-in-gpu-attention-kernels-be9444311af0

Scroll to Top