Member-only story
LLM Training — Fundamentals of Tensor Parallelism
Tensor model parallelism allows individual layers of a model to be partitioned over multiple devices, enabling efficient distribution of computational workload. This technique is particularly effective for the Multi-Layer Perceptron (MLP) and Self-Attention layers within a Transformer architecture.
Tensor Parallelism (TP)
TP can be applied to two main components of a Transformer layer:
- MLP Layer
- Self-Attention Layer
During the forward pass:
- The function f is ignored.
- The function g requires one AllReduce operation to aggregate partial results.
During the backward pass:
- The function g is ignored.
- The function f requires one AllReduce operation to aggregate partial gradients.

TP on MLP
In the MLP layer, tensor parallelism can be achieved by splitting matrices A and B into submatrices [A1,A2] and [B1, B2]^T. This approach necessitates only one AllReduce communication across the GPUs for the final partial-sum aggregation, highlighted in red in the figure below.

TP on Self-Attention
Similar to the MLP layer, tensor parallelism can be applied to the Self-Attention (SA) layer by naturally splitting multi-head attention calculation (SA) into subsets of heads (SA1 and SA2) and assigning each subset of heads to a GPU. B is split row-wise as in TP on MLP, [B1, B2]^T.
Summary
TP allows GEMMs in a single Transformer layer to be executed with only two AllReduce operations in both the forward and backward paths, significantly enhancing computational efficiency.

References
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, Mar 2020