GPU Efficiency & SM Utilization — From Matrix Multiply to TensorBoard

Step 1 — The Starting Point

You Have a Matrix Multiplication

Almost everything in deep learning — forward passes, backward passes, attention — boils down to matrix multiplications. Let's start with a concrete one.

The Operation: C = A × B

From our training code, the forward pass of a single linear layer does:

          # Linear layer: output = input @ weight.T

          # input shape:  [64, 4096]  (batch_size x hidden_dim)

          # weight shape: [4096, 4096] (hidden_dim x hidden_dim)

          # output shape: [64, 4096]

          output = input @ weight.T   # This is a matrix multiplication!

Each element of the output matrix C[i][j] requires computing a dot product — multiply corresponding elements of row i from A and column j from B, then sum them all up.

Total multiply-add operations C[i][j] = ∑_k=0..4095 A[i][k] × B[k][j] → 64 × 4096 × 4096 = 1.07 billion operations

Key question: We have 1.07 billion independent multiply-add operations. A single CPU core does them one by one. A GPU has thousands of cores that can do many of them simultaneously. But how does the GPU organize this work?

Let's look at the GPU

Step 2 — The Hardware

Inside the GPU: Streaming Multiprocessors

A GPU is not just "lots of cores". It's organized into Streaming Multiprocessors (SMs) — semi-independent processors, each with their own cores, registers, and shared memory.

NVIDIA A100 GPU

108 Streaming Multiprocessors 80 GB HBM2e • 2 TB/s bandwidth

Fully active SM

Partially active SM

Idle SM

What's Inside Each SM?

64 FP32 CUDA Cores
32 FP64 Cores
4 Tensor Cores (TF32/FP16/BF16)
256 KB Register File
192 KB Shared Memory / L1 Cache
4 Warp Schedulers

Think of an SM Like a Factory Floor

Each SM is like a small factory with its own workers (cores), local storage (registers + shared memory), and a manager (warp scheduler).

The GPU is a collection of 108 factories. Work gets divided up and assigned to these factories. The question is: how many factories are actually busy? That's SM efficiency.

The goal Keep as many SMs busy as possible → High SM Efficiency

Step 3 — The Thread Hierarchy

Threads, Warps, Blocks, and Grids

CUDA organizes parallel work into a hierarchy. Understanding this hierarchy is key to understanding GPU efficiency.

G

Grid

The entire kernel launch. Contains all blocks. One grid = one kernel call.

[16, 32, 1]

B

Block (Thread Block)

A group of threads that execute on one SM. Threads in a block can share memory and synchronize.

[128, 1, 1]

W

Warp

32 threads that execute the same instruction in lockstep (SIMT). The fundamental scheduling unit.

32 threads

T

Thread

The smallest unit of execution. Each thread computes one (or a few) output elements.

1 element

Concrete Example from TensorBoard

The profiler shows the matmul kernel was launched with:

Grid Dimensions

[16, 32, 1]

Total blocks = 16 × 32 × 1 = 512 blocks

Block Dimensions

[128, 1, 1]

Threads per block = 128 × 1 × 1 = 128 threads/block

Total threads launched 512 blocks × 128 threads/block = 65,536 total threads

Warps per block: 128 threads ÷ 32 threads/warp = 4 warps per block
Total warps: 512 blocks × 4 warps = 2,048 warps

Step 3.5 — The Key Question

Who Decides Grid & Block Dimensions?

You see grid [16, 32, 1] and block [128, 1, 1] in the profiler. But who chose these numbers? It depends on who wrote the kernel.

Path A: Library Kernels (Your Case)

When you write output = input @ weight.T in PyTorch, you're not writing a CUDA kernel yourself. Under the hood:

1

PyTorch calls torch.mm() or torch.matmul()

2

Which calls cuBLAS (NVIDIA's optimized BLAS library)

3

cuBLAS has hundreds of pre-written kernels for different matrix shapes

4

cuBLAS auto-selects the best kernel + grid/block config via heuristics & autotuning

You don't choose anything. cuBLAS picked grid [16,32,1] and block [128,1,1] because its internal heuristics determined this is optimal for a [64,4096] × [4096,4096] multiply on an A100.

Path B: Custom CUDA Kernels

If you write a CUDA kernel (e.g., in Triton or raw CUDA C++), then you decide the grid and block dimensions.

            // You choose TILE_M, TILE_N, THREADS

            #define TILE_M 4   // rows per block

            #define TILE_N 128 // cols per block

            #define THREADS 128

            // Grid = how many tiles to cover output C

            dim3 grid(

              ceil(M / TILE_M), // = 64/4 = 16

              ceil(N / TILE_N), // = 4096/128 = 32

              1

            );

            dim3 block(THREADS, 1, 1);

            // Launch!

            matmul_kernel<<<grid, block>>>(A, B, C);

You choose everything. The tile sizes and thread count determine both grid and block dimensions. Bad choices = low SM efficiency or low occupancy.

How Grid [16, 32, 1] Was Derived (The Logic)

Whether cuBLAS or you decide, the logic is the same. The output matrix C has shape [M, N]. Each block computes one tile of C:

Output Matrix C [64 × 4096]

← N = 4096 cols → M = 64 rows ↓

Grid X (tiles along M) ceil(M / TILE_M) = ceil(64 / 4) = 16

Grid Y (tiles along N) ceil(N / TILE_N) = ceil(4096 / 128) = 32

Grid Z 1 (no batch dimension in grid)

Result grid = [16, 32, 1] → 512 blocks

Each tile is a 4 × 128 chunk of the output matrix. One block of 128 threads computes all 512 elements (4 × 128) in that tile. So each thread computes ~4 output elements.

Why 128 Threads Per Block?

The block size (threads per block) is a critical choice. Here's the tradeoff:

Block Size	Warps/Block	Pros	Cons
32	1	More blocks fit per SM, many blocks launched	Fewer threads to cooperate in shared memory
128	4	Good balance — enough warps for latency hiding, enough blocks for SM coverage	—
256	8	More warps per block for latency hiding	Fewer blocks per SM, uses more registers per block
1024	32	Maximum threads cooperating	Only 1–2 blocks per SM, less flexibility

In practice: For nn.Linear and matmul, cuBLAS decides for you. It tested many configurations during NVIDIA's development and the profiler just shows you what it picked. For custom kernels (Triton, raw CUDA), you have to experiment or use occupancy calculators.

Summary: The Decision Chain

You Write

output = input @ weight.T

matrix shapes: [64,4096] × [4096,4096]

→

PyTorch Calls

cuBLAS gemm()

NVIDIA's matmul library

→

cuBLAS Decides

TILE_M=4, TILE_N=128

128 threads/block

→

GPU Gets

grid [16,32,1]

block [128,1,1]

Step 4 — Mapping to Hardware

How Blocks Get Assigned to SMs

You've launched 512 blocks. You have 108 SMs. The GPU scheduler now decides which blocks go where. This is the critical step for efficiency.

1

You launch the kernel

matmul_kernel<<<grid(16,32,1), block(128,1,1)>>>(...)
This tells the GPU: "I have 512 blocks of work, each with 128 threads."

2

GPU checks resource limits per SM

Each SM has finite resources. The scheduler checks: how many blocks can fit on one SM simultaneously? This depends on:

Registers

122 regs/thread × 128 threads = 15,616 regs/block

Shared Memory

12,544 bytes per block

Max Blocks/SM

Hardware limit: 32 (A100)

3

Scheduler assigns blocks in waves

The profiler tells us blocks per SM = 4.74. Since blocks can't be fractional, each SM runs about 4–5 blocks concurrently. Blocks that don't fit wait for an SM to become free.

Blocks per SM (from profiler) 4.740741 ≈ 4–5 concurrent blocks per SM

4

512 blocks across 108 SMs

With ~4.74 blocks/SM, the GPU needs: 512 ÷ 4.74 ≈ 108 SMs. All 108 SMs are busy!

SMs needed 512 blocks ÷ 4.74 blocks/SM = ~108 SMs (all of them!)

Good news: In this case, we have enough blocks (512) to keep all 108 SMs busy. But what if we had fewer blocks? That's where SM efficiency drops below 100%.

Step 5 — The Key Metric

SM Efficiency: Are All Factories Busy?

SM Efficiency tells you what fraction of the GPU's streaming multiprocessors are actually doing work. It's the first thing to check when a kernel seems slow.

The Formula

SM Efficiency has a beautifully simple definition:

Estimated SM Efficiency SM Efficiency = min( blocks_launched ÷ num_SMs , 1 )

In plain English: if you launch at least as many blocks as there are SMs, every SM gets work, and efficiency is 100%. If you launch fewer blocks than SMs, some SMs sit idle.

Our Matrix Multiply Example

From TensorBoard:

Grid	[16, 32, 1]
Total blocks	512
Number of SMs	108
Blocks per SM	4.740741
Est. SM Efficiency	1 (100%)

Calculation:

Step 1: Total blocks 16 × 32 × 1 = 512

Step 2: SM Efficiency min(512 ÷ 108, 1) = min(4.74, 1) = 1

All 108 SMs are active!

Interactive: What Happens When You Have Fewer Blocks?

Drag the slider to change the total number of blocks and see how SM efficiency changes.

Total Blocks: 512

SMs: 108

SM Efficiency 100%

100%

0% 25% 50% 75% 100%

SM Status Map (each square = 1 SM)

Active SMs 108 / 108

Blocks per SM 4.74

Idle SMs (wasted) 0

Bad: SM Efficiency < 100%

If you launch only 64 blocks on 108 SMs:

min(64/108, 1) = 0.59 (59%)

44 SMs sit completely idle. You're paying for a full GPU but only using 59% of it.

Common causes: small batch size, small matrix dimensions, too-large block size.

Good: SM Efficiency = 100%

If you launch 512 blocks on 108 SMs:

min(512/108, 1) = 1.0 (100%)

Every SM has work. In fact, each SM has ~4.74 blocks queued, which also helps hide latency (while one block waits for memory, another can execute).

Step 6 — Going Deeper

Occupancy: How Busy is Each SM?

SM Efficiency tells you how many SMs are active. Occupancy tells you how efficiently each active SM is being used. They're different metrics!

SM Efficiency vs Occupancy

SM Efficiency: What % of SMs have at least one block?

Occupancy: What % of the maximum possible warps are active on each SM?

You can have 100% SM Efficiency (all SMs busy) but low occupancy (each SM is underutilized).

The Occupancy Formula

Achieved Occupancy active warps per SM ÷ max warps per SM

On the A100, each SM supports up to 64 warps (2,048 threads). Our profiler shows:

From TensorBoard warps per SM = 18.96 → occupancy = 18.96/64 = ~25%

Visualizing Occupancy on One SM

Each SM can run up to 64 warps (2,048 threads). Our kernel only fills ~19 warps. The rest of the SM's warp slots are empty.

Warp Slots on One SM

Active warp

Empty slot

25%

Achieved Occupancy

~19 of 64 warp slots used

Why only 25% occupancy? Each block uses 128 threads (4 warps) and requires 122 registers per thread + 12,544 bytes of shared memory. With ~4–5 blocks per SM, that's only ~19 warps out of a possible 64. The bottleneck is register usage — each thread needs 122 registers, which limits how many blocks can coexist on one SM.

Does Low Occupancy = Bad Performance?

Not necessarily! Occupancy is a proxy metric, not a direct measure of performance.

When low occupancy is fine:

• Kernel is compute-bound (not waiting for memory)
• Using Tensor Cores efficiently
• High arithmetic intensity (lots of FLOPs per byte)

When low occupancy hurts:

• Kernel is memory-bound (waiting for data)
• Not enough warps to hide memory latency
• Many stalls visible in profiler

Step 7 — Try It Yourself

GPU Efficiency Calculator

Change the matrix dimensions, block size, or GPU model and see how SM efficiency and occupancy change in real time.

GPU Model

Matrix Multiplication: C[M×N] = A[M×K] × B[K×N]

M (rows of output):

N (cols of output):

K (inner dimension):

(affects FLOPs, not grid size)

Tiling / Block Configuration

Tile size M (BLOCK_M):

rows of C per block

Tile size N (BLOCK_N):

cols of C per block

Threads per block:

Computed Values

Grid dimensions grid = [16, 32, 1]

Total blocks launched 512

Warps per block 4

Total FLOPs 2.15 B

SM EFFICIENCY

100%

All 108 SMs are active

BLOCKS PER SM

4.74

Enough to hide some latency

100%

Step 8 — Reading the Profiler

TensorBoard Profiler: What All the Numbers Mean

Here's a field-by-field guide to what you see in TensorBoard's CUDA kernel details.

Kernel Details Panel

Field	Value	What It Means
stream	7	CUDA stream this kernel ran on. Different streams can overlap.
correlation	40	Links this GPU kernel to the CPU-side launch event.
registers per thread	122	Each thread uses 122 of the 256 available registers. High register usage limits occupancy.
shared memory	12544	12.5 KB of shared memory per block. Used for fast data sharing between threads in a block.
blocks per SM	4.740741	Average number of concurrent blocks per SM. Higher = better latency hiding.
warps per SM	18.962963	~19 out of 64 max warps active. This drives occupancy = 19/64 ≈ 25%.
grid	[16, 32, 1]	Grid dimensions — how blocks are organized. Total = 16×32×1 = 512 blocks.
block	[128, 1, 1]	Block dimensions — 128 threads per block (4 warps).
est. achieved occupancy %	25	25% occupancy — 19 of 64 warp slots filled per SM. Limited by register usage.

Counter Sample Panel

Separately, TensorBoard shows the GPU performance counters:

Counter	Series	Value
GPU 0 Est. SM Efficiency	Est. SM Efficiency	1 (= 100%)

This confirms our calculation: min(512/108, 1) = 1. All SMs are active during this kernel.

The Complete Picture

100%

SM Efficiency

All 108 SMs active

25%

Achieved Occupancy

~19 of 64 warps per SM

512

Total Blocks

~4.74 per SM

Key Takeaways

What You Should Remember

01

SM Efficiency = Are All SMs Busy?

min(blocks_launched / num_SMs, 1). Launch at least as many blocks as SMs to hit 100%. With 512 blocks on 108 SMs, all SMs are active.

02

Occupancy = How Full is Each SM?

active_warps / max_warps per SM. Limited by registers, shared memory, and block count. 25% occupancy can still be performant for compute-bound kernels.

03

Grid & Block Dimensions Matter

Grid [16,32,1] = 512 blocks. Block [128,1,1] = 128 threads. These are chosen by CUDA libraries (cuBLAS) or set by you in custom kernels. They determine both SM efficiency and occupancy.

04

Small Matrices = Low Efficiency

If your matmul only generates 50 blocks on a 108-SM GPU, 58 SMs sit idle (54% efficiency). Batch size directly affects this — bigger batches = more blocks = more SMs used.

05

Warps are the Real Unit

The GPU doesn't schedule individual threads — it schedules warps of 32 threads in SIMT lockstep. The warp scheduler on each SM juggles active warps to hide memory latency.

06

Check TensorBoard First

When a kernel is slow, check: (1) SM Efficiency — are all SMs active? (2) Occupancy — is each SM being utilized? (3) Memory bandwidth — is the kernel memory-bound?