You have a matrix multiplication. You have a GPU. What happens next? A from-scratch walkthrough of grids, blocks, SMs, warps, and how GPU efficiency is actually calculated in TensorBoard.
Almost everything in deep learning — forward passes, backward passes, attention — boils down to matrix multiplications. Let's start with a concrete one.
From our training code, the forward pass of a single linear layer does:
Each element of the output matrix C[i][j] requires computing a dot product — multiply corresponding elements of row i from A and column j from B, then sum them all up.
A GPU is not just "lots of cores". It's organized into Streaming Multiprocessors (SMs) — semi-independent processors, each with their own cores, registers, and shared memory.
64 FP32 CUDA Cores32 FP64 Cores4 Tensor Cores (TF32/FP16/BF16)256 KB Register File192 KB Shared Memory / L1 Cache4 Warp SchedulersEach SM is like a small factory with its own workers (cores), local storage (registers + shared memory), and a manager (warp scheduler).
The GPU is a collection of 108 factories. Work gets divided up and assigned to these factories. The question is: how many factories are actually busy? That's SM efficiency.
CUDA organizes parallel work into a hierarchy. Understanding this hierarchy is key to understanding GPU efficiency.
The profiler shows the matmul kernel was launched with:
You see grid [16, 32, 1] and block [128, 1, 1] in the profiler. But who chose these numbers? It depends on who wrote the kernel.
When you write output = input @ weight.T in PyTorch, you're not writing a CUDA kernel yourself. Under the hood:
torch.mm() or torch.matmul()[16,32,1] and block [128,1,1] because its internal heuristics determined this is optimal for a [64,4096] × [4096,4096] multiply on an A100.
If you write a CUDA kernel (e.g., in Triton or raw CUDA C++), then you decide the grid and block dimensions.
Whether cuBLAS or you decide, the logic is the same. The output matrix C has shape [M, N]. Each block computes one tile of C:
Each tile is a 4 × 128 chunk of the output matrix. One block of 128 threads computes all 512 elements (4 × 128) in that tile. So each thread computes ~4 output elements.
The block size (threads per block) is a critical choice. Here's the tradeoff:
| Block Size | Warps/Block | Pros | Cons |
|---|---|---|---|
| 32 | 1 | More blocks fit per SM, many blocks launched | Fewer threads to cooperate in shared memory |
| 128 | 4 | Good balance — enough warps for latency hiding, enough blocks for SM coverage | — |
| 256 | 8 | More warps per block for latency hiding | Fewer blocks per SM, uses more registers per block |
| 1024 | 32 | Maximum threads cooperating | Only 1–2 blocks per SM, less flexibility |
nn.Linear and matmul, cuBLAS decides for you. It tested many configurations during NVIDIA's development and the profiler just shows you what it picked. For custom kernels (Triton, raw CUDA), you have to experiment or use occupancy calculators.
You've launched 512 blocks. You have 108 SMs. The GPU scheduler now decides which blocks go where. This is the critical step for efficiency.
matmul_kernel<<<grid(16,32,1), block(128,1,1)>>>(...)blocks per SM = 4.74. Since blocks can't be fractional, each SM runs about 4–5 blocks concurrently. Blocks that don't fit wait for an SM to become free.
SM Efficiency tells you what fraction of the GPU's streaming multiprocessors are actually doing work. It's the first thing to check when a kernel seems slow.
SM Efficiency has a beautifully simple definition:
In plain English: if you launch at least as many blocks as there are SMs, every SM gets work, and efficiency is 100%. If you launch fewer blocks than SMs, some SMs sit idle.
| Grid | [16, 32, 1] |
| Total blocks | 512 |
| Number of SMs | 108 |
| Blocks per SM | 4.740741 |
| Est. SM Efficiency | 1 (100%) |
Drag the slider to change the total number of blocks and see how SM efficiency changes.
If you launch only 64 blocks on 108 SMs:
44 SMs sit completely idle. You're paying for a full GPU but only using 59% of it.
Common causes: small batch size, small matrix dimensions, too-large block size.
If you launch 512 blocks on 108 SMs:
Every SM has work. In fact, each SM has ~4.74 blocks queued, which also helps hide latency (while one block waits for memory, another can execute).
SM Efficiency tells you how many SMs are active. Occupancy tells you how efficiently each active SM is being used. They're different metrics!
SM Efficiency: What % of SMs have at least one block?
Occupancy: What % of the maximum possible warps are active on each SM?
On the A100, each SM supports up to 64 warps (2,048 threads). Our profiler shows:
Each SM can run up to 64 warps (2,048 threads). Our kernel only fills ~19 warps. The rest of the SM's warp slots are empty.
Not necessarily! Occupancy is a proxy metric, not a direct measure of performance.
Change the matrix dimensions, block size, or GPU model and see how SM efficiency and occupancy change in real time.
Here's a field-by-field guide to what you see in TensorBoard's CUDA kernel details.
| Field | Value | What It Means |
|---|---|---|
| stream | 7 | CUDA stream this kernel ran on. Different streams can overlap. |
| correlation | 40 | Links this GPU kernel to the CPU-side launch event. |
| registers per thread | 122 | Each thread uses 122 of the 256 available registers. High register usage limits occupancy. |
| shared memory | 12544 | 12.5 KB of shared memory per block. Used for fast data sharing between threads in a block. |
| blocks per SM | 4.740741 | Average number of concurrent blocks per SM. Higher = better latency hiding. |
| warps per SM | 18.962963 | ~19 out of 64 max warps active. This drives occupancy = 19/64 ≈ 25%. |
| grid | [16, 32, 1] | Grid dimensions — how blocks are organized. Total = 16×32×1 = 512 blocks. |
| block | [128, 1, 1] | Block dimensions — 128 threads per block (4 warps). |
| est. achieved occupancy % | 25 | 25% occupancy — 19 of 64 warp slots filled per SM. Limited by register usage. |
Separately, TensorBoard shows the GPU performance counters:
| Counter | Series | Value |
|---|---|---|
| GPU 0 Est. SM Efficiency | Est. SM Efficiency | 1 (= 100%) |
This confirms our calculation: min(512/108, 1) = 1. All SMs are active during this kernel.
min(blocks_launched / num_SMs, 1). Launch at least as many blocks as SMs to hit 100%. With 512 blocks on 108 SMs, all SMs are active.
active_warps / max_warps per SM. Limited by registers, shared memory, and block count. 25% occupancy can still be performant for compute-bound kernels.
Grid [16,32,1] = 512 blocks. Block [128,1,1] = 128 threads. These are chosen by CUDA libraries (cuBLAS) or set by you in custom kernels. They determine both SM efficiency and occupancy.
If your matmul only generates 50 blocks on a 108-SM GPU, 58 SMs sit idle (54% efficiency). Batch size directly affects this — bigger batches = more blocks = more SMs used.
The GPU doesn't schedule individual threads — it schedules warps of 32 threads in SIMT lockstep. The warp scheduler on each SM juggles active warps to hide memory latency.
When a kernel is slow, check: (1) SM Efficiency — are all SMs active? (2) Occupancy — is each SM being utilized? (3) Memory bandwidth — is the kernel memory-bound?