GPU Deep Dive — From First Principles

GPU Efficiency & SM Utilization

You have a matrix multiplication. You have a GPU. What happens next? A from-scratch walkthrough of grids, blocks, SMs, warps, and how GPU efficiency is actually calculated in TensorBoard.

108 SMs
A100 GPU (80GB)
Grid
How blocks are organized
Block
Group of threads on 1 SM
Warp
32 threads in lockstep

You Have a Matrix Multiplication

Almost everything in deep learning — forward passes, backward passes, attention — boils down to matrix multiplications. Let's start with a concrete one.

The Operation: C = A × B

From our training code, the forward pass of a single linear layer does:

# Linear layer: output = input @ weight.T
# input shape: [64, 4096] (batch_size x hidden_dim)
# weight shape: [4096, 4096] (hidden_dim x hidden_dim)
# output shape: [64, 4096]

output = input @ weight.T   # This is a matrix multiplication!

Each element of the output matrix C[i][j] requires computing a dot product — multiply corresponding elements of row i from A and column j from B, then sum them all up.

Total multiply-add operations C[i][j] = ∑k=0..4095 A[i][k] × B[k][j]   →   64 × 4096 × 4096 = 1.07 billion operations
Key question: We have 1.07 billion independent multiply-add operations. A single CPU core does them one by one. A GPU has thousands of cores that can do many of them simultaneously. But how does the GPU organize this work?
Let's look at the GPU

Inside the GPU: Streaming Multiprocessors

A GPU is not just "lots of cores". It's organized into Streaming Multiprocessors (SMs) — semi-independent processors, each with their own cores, registers, and shared memory.

NVIDIA A100 GPU
108 Streaming Multiprocessors 80 GB HBM2e • 2 TB/s bandwidth
Fully active SM
Partially active SM
Idle SM

What's Inside Each SM?

  • 64 FP32 CUDA Cores
  • 32 FP64 Cores
  • 4 Tensor Cores (TF32/FP16/BF16)
  • 256 KB Register File
  • 192 KB Shared Memory / L1 Cache
  • 4 Warp Schedulers

Think of an SM Like a Factory Floor

Each SM is like a small factory with its own workers (cores), local storage (registers + shared memory), and a manager (warp scheduler).

The GPU is a collection of 108 factories. Work gets divided up and assigned to these factories. The question is: how many factories are actually busy? That's SM efficiency.

The goal Keep as many SMs busy as possible → High SM Efficiency

Threads, Warps, Blocks, and Grids

CUDA organizes parallel work into a hierarchy. Understanding this hierarchy is key to understanding GPU efficiency.

G
Grid
The entire kernel launch. Contains all blocks. One grid = one kernel call.
[16, 32, 1]
B
Block (Thread Block)
A group of threads that execute on one SM. Threads in a block can share memory and synchronize.
[128, 1, 1]
W
Warp
32 threads that execute the same instruction in lockstep (SIMT). The fundamental scheduling unit.
32 threads
T
Thread
The smallest unit of execution. Each thread computes one (or a few) output elements.
1 element

Concrete Example from TensorBoard

The profiler shows the matmul kernel was launched with:

Grid Dimensions
[16, 32, 1]
Total blocks = 16 × 32 × 1 = 512 blocks
Block Dimensions
[128, 1, 1]
Threads per block = 128 × 1 × 1 = 128 threads/block
Total threads launched 512 blocks × 128 threads/block = 65,536 total threads
Warps per block: 128 threads ÷ 32 threads/warp = 4 warps per block
Total warps: 512 blocks × 4 warps = 2,048 warps

Who Decides Grid & Block Dimensions?

You see grid [16, 32, 1] and block [128, 1, 1] in the profiler. But who chose these numbers? It depends on who wrote the kernel.

Path A: Library Kernels (Your Case)

When you write output = input @ weight.T in PyTorch, you're not writing a CUDA kernel yourself. Under the hood:

1
PyTorch calls torch.mm() or torch.matmul()
2
Which calls cuBLAS (NVIDIA's optimized BLAS library)
3
cuBLAS has hundreds of pre-written kernels for different matrix shapes
4
cuBLAS auto-selects the best kernel + grid/block config via heuristics & autotuning
You don't choose anything. cuBLAS picked grid [16,32,1] and block [128,1,1] because its internal heuristics determined this is optimal for a [64,4096] × [4096,4096] multiply on an A100.

Path B: Custom CUDA Kernels

If you write a CUDA kernel (e.g., in Triton or raw CUDA C++), then you decide the grid and block dimensions.

// You choose TILE_M, TILE_N, THREADS
#define TILE_M 4 // rows per block
#define TILE_N 128 // cols per block
#define THREADS 128

// Grid = how many tiles to cover output C
dim3 grid(
  ceil(M / TILE_M), // = 64/4 = 16
  ceil(N / TILE_N), // = 4096/128 = 32
  1
);
dim3 block(THREADS, 1, 1);

// Launch!
matmul_kernel<<<grid, block>>>(A, B, C);
You choose everything. The tile sizes and thread count determine both grid and block dimensions. Bad choices = low SM efficiency or low occupancy.

How Grid [16, 32, 1] Was Derived (The Logic)

Whether cuBLAS or you decide, the logic is the same. The output matrix C has shape [M, N]. Each block computes one tile of C:

Output Matrix C [64 × 4096]
B0 B1 ... 512 tiles total (16 rows × 32 cols) ...
← N = 4096 cols → M = 64 rows ↓
Grid X (tiles along M) ceil(M / TILE_M) = ceil(64 / 4) = 16
Grid Y (tiles along N) ceil(N / TILE_N) = ceil(4096 / 128) = 32
Grid Z 1 (no batch dimension in grid)
Result grid = [16, 32, 1] → 512 blocks

Each tile is a 4 × 128 chunk of the output matrix. One block of 128 threads computes all 512 elements (4 × 128) in that tile. So each thread computes ~4 output elements.

Why 128 Threads Per Block?

The block size (threads per block) is a critical choice. Here's the tradeoff:

Block Size Warps/Block Pros Cons
32 1 More blocks fit per SM, many blocks launched Fewer threads to cooperate in shared memory
128 4 Good balance — enough warps for latency hiding, enough blocks for SM coverage
256 8 More warps per block for latency hiding Fewer blocks per SM, uses more registers per block
1024 32 Maximum threads cooperating Only 1–2 blocks per SM, less flexibility
In practice: For nn.Linear and matmul, cuBLAS decides for you. It tested many configurations during NVIDIA's development and the profiler just shows you what it picked. For custom kernels (Triton, raw CUDA), you have to experiment or use occupancy calculators.

Summary: The Decision Chain

You Write
output = input @ weight.T
matrix shapes: [64,4096] × [4096,4096]
PyTorch Calls
cuBLAS gemm()
NVIDIA's matmul library
cuBLAS Decides
TILE_M=4, TILE_N=128
128 threads/block
GPU Gets
grid [16,32,1]
block [128,1,1]

How Blocks Get Assigned to SMs

You've launched 512 blocks. You have 108 SMs. The GPU scheduler now decides which blocks go where. This is the critical step for efficiency.

1
You launch the kernel
matmul_kernel<<<grid(16,32,1), block(128,1,1)>>>(...)
This tells the GPU: "I have 512 blocks of work, each with 128 threads."
2
GPU checks resource limits per SM
Each SM has finite resources. The scheduler checks: how many blocks can fit on one SM simultaneously? This depends on:
Registers
122 regs/thread × 128 threads = 15,616 regs/block
Shared Memory
12,544 bytes per block
Max Blocks/SM
Hardware limit: 32 (A100)
3
Scheduler assigns blocks in waves
The profiler tells us blocks per SM = 4.74. Since blocks can't be fractional, each SM runs about 4–5 blocks concurrently. Blocks that don't fit wait for an SM to become free.
Blocks per SM (from profiler) 4.740741 ≈ 4–5 concurrent blocks per SM
4
512 blocks across 108 SMs
With ~4.74 blocks/SM, the GPU needs: 512 ÷ 4.74 ≈ 108 SMs. All 108 SMs are busy!
SMs needed 512 blocks ÷ 4.74 blocks/SM = ~108 SMs (all of them!)
Good news: In this case, we have enough blocks (512) to keep all 108 SMs busy. But what if we had fewer blocks? That's where SM efficiency drops below 100%.

SM Efficiency: Are All Factories Busy?

SM Efficiency tells you what fraction of the GPU's streaming multiprocessors are actually doing work. It's the first thing to check when a kernel seems slow.

The Formula

SM Efficiency has a beautifully simple definition:

Estimated SM Efficiency SM Efficiency = min( blocks_launched ÷ num_SMs , 1 )

In plain English: if you launch at least as many blocks as there are SMs, every SM gets work, and efficiency is 100%. If you launch fewer blocks than SMs, some SMs sit idle.

Our Matrix Multiply Example

From TensorBoard:
Grid[16, 32, 1]
Total blocks512
Number of SMs108
Blocks per SM4.740741
Est. SM Efficiency1 (100%)
Calculation:
Step 1: Total blocks 16 × 32 × 1 = 512
Step 2: SM Efficiency min(512 ÷ 108, 1) = min(4.74, 1) = 1
All 108 SMs are active!

Interactive: What Happens When You Have Fewer Blocks?

Drag the slider to change the total number of blocks and see how SM efficiency changes.

512
108
SM Efficiency 100%
100%
0% 25% 50% 75% 100%
SM Status Map (each square = 1 SM)
Active SMs 108 / 108
Blocks per SM 4.74
Idle SMs (wasted) 0

Bad: SM Efficiency < 100%

If you launch only 64 blocks on 108 SMs:

min(64/108, 1) = 0.59 (59%)

44 SMs sit completely idle. You're paying for a full GPU but only using 59% of it.

Common causes: small batch size, small matrix dimensions, too-large block size.

Good: SM Efficiency = 100%

If you launch 512 blocks on 108 SMs:

min(512/108, 1) = 1.0 (100%)

Every SM has work. In fact, each SM has ~4.74 blocks queued, which also helps hide latency (while one block waits for memory, another can execute).

Occupancy: How Busy is Each SM?

SM Efficiency tells you how many SMs are active. Occupancy tells you how efficiently each active SM is being used. They're different metrics!

SM Efficiency vs Occupancy

SM Efficiency: What % of SMs have at least one block?

Occupancy: What % of the maximum possible warps are active on each SM?

You can have 100% SM Efficiency (all SMs busy) but low occupancy (each SM is underutilized).

The Occupancy Formula

Achieved Occupancy active warps per SM ÷ max warps per SM

On the A100, each SM supports up to 64 warps (2,048 threads). Our profiler shows:

From TensorBoard warps per SM = 18.96 → occupancy = 18.96/64 = ~25%

Visualizing Occupancy on One SM

Each SM can run up to 64 warps (2,048 threads). Our kernel only fills ~19 warps. The rest of the SM's warp slots are empty.

Warp Slots on One SM
Active warp
Empty slot
25%
Achieved Occupancy
~19 of 64 warp slots used
Why only 25% occupancy? Each block uses 128 threads (4 warps) and requires 122 registers per thread + 12,544 bytes of shared memory. With ~4–5 blocks per SM, that's only ~19 warps out of a possible 64. The bottleneck is register usage — each thread needs 122 registers, which limits how many blocks can coexist on one SM.

Does Low Occupancy = Bad Performance?

Not necessarily! Occupancy is a proxy metric, not a direct measure of performance.

When low occupancy is fine:

  • • Kernel is compute-bound (not waiting for memory)
  • • Using Tensor Cores efficiently
  • • High arithmetic intensity (lots of FLOPs per byte)

When low occupancy hurts:

  • • Kernel is memory-bound (waiting for data)
  • • Not enough warps to hide memory latency
  • • Many stalls visible in profiler

GPU Efficiency Calculator

Change the matrix dimensions, block size, or GPU model and see how SM efficiency and occupancy change in real time.

GPU Model
Matrix Multiplication: C[M×N] = A[M×K] × B[K×N]
M (rows of output):
N (cols of output):
K (inner dimension):
(affects FLOPs, not grid size)
Tiling / Block Configuration
Tile size M (BLOCK_M):
rows of C per block
Tile size N (BLOCK_N):
cols of C per block
Threads per block:
Computed Values
Grid dimensions grid = [16, 32, 1]
Total blocks launched 512
Warps per block 4
Total FLOPs 2.15 B
SM EFFICIENCY
100%
All 108 SMs are active
BLOCKS PER SM
4.74
Enough to hide some latency
100%

TensorBoard Profiler: What All the Numbers Mean

Here's a field-by-field guide to what you see in TensorBoard's CUDA kernel details.

Kernel Details Panel

Field Value What It Means
stream 7 CUDA stream this kernel ran on. Different streams can overlap.
correlation 40 Links this GPU kernel to the CPU-side launch event.
registers per thread 122 Each thread uses 122 of the 256 available registers. High register usage limits occupancy.
shared memory 12544 12.5 KB of shared memory per block. Used for fast data sharing between threads in a block.
blocks per SM 4.740741 Average number of concurrent blocks per SM. Higher = better latency hiding.
warps per SM 18.962963 ~19 out of 64 max warps active. This drives occupancy = 19/64 ≈ 25%.
grid [16, 32, 1] Grid dimensions — how blocks are organized. Total = 16×32×1 = 512 blocks.
block [128, 1, 1] Block dimensions — 128 threads per block (4 warps).
est. achieved occupancy % 25 25% occupancy — 19 of 64 warp slots filled per SM. Limited by register usage.

Counter Sample Panel

Separately, TensorBoard shows the GPU performance counters:

Counter Series Value
GPU 0 Est. SM Efficiency Est. SM Efficiency 1 (= 100%)

This confirms our calculation: min(512/108, 1) = 1. All SMs are active during this kernel.

The Complete Picture

100%
SM Efficiency
All 108 SMs active
25%
Achieved Occupancy
~19 of 64 warps per SM
512
Total Blocks
~4.74 per SM

What You Should Remember

01

SM Efficiency = Are All SMs Busy?

min(blocks_launched / num_SMs, 1). Launch at least as many blocks as SMs to hit 100%. With 512 blocks on 108 SMs, all SMs are active.

02

Occupancy = How Full is Each SM?

active_warps / max_warps per SM. Limited by registers, shared memory, and block count. 25% occupancy can still be performant for compute-bound kernels.

03

Grid & Block Dimensions Matter

Grid [16,32,1] = 512 blocks. Block [128,1,1] = 128 threads. These are chosen by CUDA libraries (cuBLAS) or set by you in custom kernels. They determine both SM efficiency and occupancy.

04

Small Matrices = Low Efficiency

If your matmul only generates 50 blocks on a 108-SM GPU, 58 SMs sit idle (54% efficiency). Batch size directly affects this — bigger batches = more blocks = more SMs used.

05

Warps are the Real Unit

The GPU doesn't schedule individual threads — it schedules warps of 32 threads in SIMT lockstep. The warp scheduler on each SM juggles active warps to hide memory latency.

06

Check TensorBoard First

When a kernel is slow, check: (1) SM Efficiency — are all SMs active? (2) Occupancy — is each SM being utilized? (3) Memory bandwidth — is the kernel memory-bound?