Member-only story

LLM Training — Fundamentals of Tensor Parallelism

Published in

Byte-Sized AI

2 min readJul 13, 2024

Tensor model parallelism allows individual layers of a model to be partitioned over multiple devices, enabling efficient distribution of computational workload. This technique is particularly effective for the Multi-Layer Perceptron (MLP) and Self-Attention layers within a Transformer architecture.

Tensor Parallelism (TP)

TP can be applied to two main components of a Transformer layer:

MLP Layer
Self-Attention Layer

During the forward pass:

The function f is ignored.
The function g requires one AllReduce operation to aggregate partial results.

During the backward pass:

The function g is ignored.
The function f requires one AllReduce operation to aggregate partial gradients.

TP on MLP

In the MLP layer, tensor parallelism can be achieved by splitting matrices A and B into submatrices [A1,A2] and [B1, B2]^T. This approach necessitates only one AllReduce communication across the GPUs for the final partial-sum aggregation, highlighted in red in the figure below.

Forward Pass of MLP; The red “+” aggregates two partial results, requiring AllReduce across two GPUs

TP on Self-Attention

Similar to the MLP layer, tensor parallelism can be applied to the Self-Attention (SA) layer by naturally splitting multi-head attention calculation (SA) into subsets of heads (SA1 and SA2) and assigning each subset of heads to a GPU. B is split row-wise as in TP on MLP, [B1, B2]^T.

Summary

TP allows GEMMs in a single Transformer layer to be executed with only two AllReduce operations in both the forward and backward paths, significantly enhancing computational efficiency.

References

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, Mar 2020