Hierarchical tiling diagram for CUDA tiled matrix multiply showing block, warp, and register-level data reuse on NVIDIA GPUs

CUDA Tiled Matrix Multiply: High-Performance Guide & cuBLAS Tips

By Agustin Giovagnoli / January 14, 2026

High-performance matrix multiplication on NVIDIA GPUs depends on how well you move data, not just how fast you multiply. A disciplined approach to CUDA tiled matrix multiply leverages the GPU memory hierarchy—global, shared, and registers—to increase arithmetic intensity and hide latency. Practitioners report that block-level shared-memory tiling on H100-class hardware can deliver more than a 4× throughput improvement over a naive, per-element kernel, underscoring why tiling matters for both training and inference workloads [5].

Quick overview: CUDA tiled matrix multiply

The core idea is hierarchical tiling: at the block level, each thread block computes a tile of the output matrix C while iterating over K-dimension tiles from A and B staged in shared memory, which are then reused across many multiply-accumulate operations. Within the block, warps take ownership of sub-tiles, and at the finest granularity, each thread accumulates a small per-thread sub-tile of C in registers before writing back once to global memory. This structure improves locality, boosts effective bandwidth via coalesced accesses, and raises arithmetic intensity compared to naive kernels [1][4][5].

Block-level tiling: shared-memory loading and K-tiling

In a typical design, a block computes a C tile (e.g., M_tile × N_tile). The block advances along K in chunks: for each K-tile, threads cooperatively load corresponding tiles of A and B from global memory into shared memory, synchronize, perform a burst of MACs reusing those tiles, then proceed to the next K chunk. This reuse slashes global memory traffic and is a primary source of speedup relative to elementwise implementations [1][4][5].

Trade-offs revolve around tile size and shared-memory usage. Larger tiles can increase reuse but consume more shared memory, which may limit occupancy. The right balance depends on matrix shapes and the GPU’s shared-memory capacity [4][5].

Warp tiling and occupancy considerations

Warp tiling assigns sub-tiles of the block’s output to individual warps. This gives finer control over scheduling and can improve reuse within a warp, helping maintain high occupancy. Effective warp tiling patterns reduce the need for inter-warp communication while keeping math pipelines busy as K-tiles stream through shared memory [4][5].

Register tiling: per-thread sub-tiles and reduced write-backs

Register tiling pushes the idea down to threads. Each thread accumulates a small ROWS_PER_THREAD × COLS_PER_THREAD sub-tile of C in registers across many K iterations, then performs a single, coalesced write-back at the end. This approach increases arithmetic intensity and minimizes global memory stores [4][5].

A simplified sketch for the accumulation phase:

Initialize a small per-thread accumulator array in registers (e.g., a few rows × cols).
For each K-tile: load fragments from shared memory into registers, perform outer-product-style MACs into the accumulator.
After processing all K-tiles for the block, write the accumulator back to C in global memory in a coalesced pattern [4][5].

Memory efficiency: coalescing and shared-memory layout

Memory access patterns make or break performance:

Coalesced global memory accesses: organize loads and stores so consecutive threads access consecutive addresses. This maximizes effective bandwidth and feeds the compute pipelines efficiently [1][4].
Shared-memory bank conflicts: lay out shared tiles to minimize conflicts when threads read A and B fragments. More sophisticated layouts can reduce conflicts, though added index math and complexity may claw back some gains. Profiling helps determine if complexity is justified [4][5].

Alternatives & shortcuts: warp shuffle vs independent threads

Warp-level shuffle instructions can move data between threads without shared memory, enabling intra-warp sharing patterns during GEMM. While powerful, many high-performance kernels instead emphasize designs where threads operate mostly independently to reduce communication overhead and simplify control flow. The decision depends on your tiling pattern, reuse needs, and profiling results [3][4].

Using vendor libraries: cuBLAS and cuSPARSE for production

For most production scenarios, vendor libraries remain the fastest path to robust performance. cuBLAS implements highly tuned dense GEMM, while cuSPARSE provides block-sparse GEMM that exploits structured sparsity and tensor cores where applicable. These libraries encapsulate advanced tiling, scheduling, and tensor-core paths that are difficult to reproduce by hand at scale [2][5]. Hand-written tiled kernels are best for learning, adapting to unusual data layouts, or niche workloads where library calls fall short [2][5].

For background on CUDA programming concepts, see the NVIDIA CUDA Programming Guide (external) for broader architectural context that complements GEMM-specific tuning https://docs.nvidia.com/cuda/cuda-c-programming-guide/.

Practical benchmark example and expected gains

A reasonable benchmarking workflow compares three kernels on the same shapes and precision: (1) a naive per-element kernel, (2) a block-level shared-memory tiled kernel with coalesced accesses, and (3) a kernel that adds warp and register tiling. On H100-class GPUs, moving from naive to block-level tiling alone has been shown to yield over a 4× throughput gain, with additional improvements from deeper tiling and memory-layout refinements, subject to occupancy and bank-conflict constraints [5]. Always validate with profiling to ensure that shared-memory usage, register pressure, and occupancy are aligned with your target GPU [4][5].

Best practices checklist for production deployments

Start with libraries: prefer cuBLAS for dense GEMM and cuSPARSE for block-sparse GEMM that can leverage tensor cores [2][5].
Validate data layout: ensure coalesced global accesses for A, B, and C [1][4].
Tune tile sizes: balance shared-memory footprint against occupancy; iterate with a profiler [4][5].
Minimize bank conflicts: choose shared-memory layouts that reduce conflicts without over-complicating index math [4][5].
Keep communications lean: consider warp shuffles only when data sharing clearly outperforms independent-thread designs [3][4].

To go deeper into GPU optimization strategies beyond GEMM, you can explore AI tools and playbooks to structure experiments and benchmarking.

CUDA tiled matrix multiply: What to remember

The most reliable path to speed is disciplined tiling across block, warp, and register levels—paired with coalesced memory access and careful shared-memory layout. Use libraries first for production, then custom-tune when your data layout or sparsity pattern demands it. On modern NVIDIA GPUs, this approach can unlock the multi-x gains that separate merely functional kernels from truly high-performance ones [2][4][5].

Sources

[1] How to improve performance when multiply two matrices …
https://forums.developer.nvidia.com/t/how-to-improve-performance-when-multiply-two-matrices-with-large-data-in-cuda/32543

[2] Accelerating Matrix Multiplication: Block Sparse & Tensor Cores
https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/

[3] Matix Multiplication using __shfl? – CUDA
https://forums.developer.nvidia.com/t/matix-multiplication-using-shfl/40425

[4] [PDF] Performance Analysis of CUDA-based General Matrix Multiplication …
https://kth.diva-portal.org/smash/get/diva2:1985710/FULLTEXT01.pdf

[5] Worklog: Optimising GEMM on NVIDIA H100 for cuBLAS-like …
https://hamzaelshafie.bearblog.dev/worklog-optimising-gemm-on-nvidia-h100-for-cublas-like-performance-wip/