How sparse models achieve massive capacity with fixed compute cost
A dense model activates every parameter for every token. An MoE model activates only a fraction, achieving more capacity with the same compute.
Walk through each step of the Mixture of Experts forward pass, from token arrival to weighted output.
A token embedding vector enters the MoE layer, ready to be routed to the best experts.
Watch how a sequence of tokens is distributed across experts. Each token picks its top-2 experts via the router.
Without intervention, the router collapses to a few "favorite" experts. An auxiliary loss term encourages uniform routing.
Router collapses to 2-3 experts. Most experts are underutilized.
Aux loss encourages balanced routing across all experts.
MoE replaces the standard FFN in transformer blocks. Different architectures choose different placement strategies.