## Efficient LLMs

Yatin Nandwani Research Scientist, IBM Research



Semester 1,

2025-2026

Large Language Models: Introduction and Recent Advances

- How to train big models on big data?
- What's in the GPU memory during training?
   i. Model weights; ii. Param gradients iii. Optim states iv. Activations
- What is the size of params / grads / optim states?
- What is the size of activations?
- How to reduce activation memory?
  - Activation e-computation aka Gradient checkpointing
- How to increase Datoh size? 2
  - Gradient accumulation (run fwd / bwd k times before optim.step())
- Can we parallelize grad. Accumulation?
- Can we shard the optim. states, grads, and model arams across GPUs?



#### Command: torchrun --nproc\_per\_node 2 train.py

```
from torch.distributed.fsdp import fully_shard, FSDPModule
model = Transformer()
for layer in model.layers:
    fully shard(layer)
fully_shard(model)
assert isinstance(model, Transformer)
assert isinstance(model, FSDPModule)
print(model)
  FSDPTransformer(
     (tok_embeddings): Embedding(...)
    (layers): 3 x FSDPTransformerBlock(...)
     (output): Linear(...)
#
```

```
for _ in range(epochs):
    x = torch.randint(0, vocab_size, (batch_size, seq_len), device=device
    loss = model(x).sum()
    loss.backward()
    optim.step()
    optim.zero_grad()
```

```
from torch.utils.data import DataLoader

train_loader = DataLoader(
    dataset,
    batch_size=32,
)
```

for ball is trais-loader:

```
from torch.utils.data import DataLoader, DistributedSampler
    import torch.distributed as dist
4
    dist.init_process_group("nccl")
    rank = dist.get_rank() 
    world_size = dist.get_world_size()
    train_sampler = DistributedSampler(
        dataset, num_replicas=world_size, rank=rank,
        shuffle=True, # shuffle at each epoch
    train loader = DataLoader(
        dataset,
        batch_size=local_batch_size, # per-GPU batch size
        sampler=train_sampler,
```





# SFTTrainer

**torchrun** is low-level PyTorch-native

**accelerate** is high-level and automates much of the distributed setup.

```
accelerate launch --config_file <path/to/acc/config> trl/scripts/sft.py \
 --model_name_or path Qwen/Qwen2-0.5B\
 --dataset name trl-lib/Capybara \
 --learning rate 2.0e-5\
 --num train epochs 1\
 --per_device_train_batch_size 2\
--gradient_accumulation_steps 8\
 --gradient accumulation steps 8\
 --eos_token '<|im_end|>'\
 --eval strategy steps\
 --eval steps 100\
 --output dir Qwen2-0.5B-SFT\
```

### **SFTTrainer**

**torchrun** is low-level PyTorch-native

accelerate is high-level and automates much of the distributed setup.

https://huggingface.co/docs/trl/en/sft\_trainer
https://huggingface.co/docs/trl/main/en/distributing\_training
https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py

- How to train big models on big data?
- What's in the GPU memory during training?
  - i. Model weights; ii. Param gradients iii. Optim states iv. Activations
- What is the size of params / grads / optim states?
- What is the size of activations?
- How to reduce activation memory?
  - Activation re-computation aka Gradient checkpointing
- How to increase batch size?
  - Gradient accumulation (run fwd / bwd k times before optim.step())
- Can we parallelize grad. Accumulation?
- Can we shard the optim. states, grads, and model params across GPUs?

Recap

- How to train big models on big data?
- What's in the GPU memory during training?
  - i. Model weights; ii. Param gradients iii. Optim states iv. Activations
- What is the size of params / grads / optim states?
- What is the size of activations?
- How to reduce activation memory?
  - Activation re-computation aka Gradient checkpointing
- How to increase batch size?
  - Gradient accumulation (run fwd / bwd k times before optim.step())
- Can we parallelize grad. Accumulation?
- Can we shard the optim. states, grads, and model params across GPUs?
- Still not sufficient for big models (e.g 70B Llama). Can we split one input sequence across GPUs?





## Let us revisit activation memory







| 19 * seq * bs * h  1 * seq * bs * h  8 * seq * bs * h  8 * seq * bs * h |
|-------------------------------------------------------------------------|
| 8 * seq * bs * h                                                        |
|                                                                         |
| 8 * seq * bs * h                                                        |
|                                                                         |
| 2 * seq * bs * h                                                        |
| 2 * seq * bs * h                                                        |
| $11 * seq * bs * h + 5 * n_{heads} * seq^2 * bs$                        |
|                                                                         |

## Memory for Activations

$$m_{act} = L * \begin{pmatrix} 34 * seq * bs * h \\ + \\ 5 * n_{heads} * seq^2 * bs \end{pmatrix}$$

- Scales Linearly with batch size
- Quadratically with the sequence length

Korthikanti etal. 2022, Reducing Activation Recomputation in Large Transformer Models











### Column Linear





1. 
$$A \cdot B = A \cdot \begin{bmatrix} B_1 & B_2 & \cdots \end{bmatrix} = \begin{bmatrix} AB_1 & AB_2 & \cdots \end{bmatrix}$$





### Column Linear





1. 
$$A \cdot B = A \cdot \begin{bmatrix} B_1 & B_2 & \cdots \end{bmatrix} = \begin{bmatrix} AB_1 & AB_2 & \cdots \end{bmatrix}$$





## MLP Block

| Linear <i>(4h -&gt; h)</i> | 8 * seq * bs * h |
|----------------------------|------------------|
| GELU                       | 8 * seq * bs * h |
| Linear (h -> 4h)           | 2 * seq * bs * h |

- Focus on the 2<sup>nd</sup> Linear Layer and output neurons
- neurons contributes all the output neurons
- Can we split the computation on two diffiers



- 1.  $A \cdot B = A \cdot B_1$









 $u_{4h}$ 

 $o_h$ 

 $u_1$ 

























## Tensor Parallelism in MLP Block

- Use column-linear to split the 1st layer:
  - compute different neurons on different GPUs
  - Each GPU compute  $\frac{4h}{TP}$  activations
- Use row-linear to split the 2<sup>nd</sup> layer:
  - Each GPU acts on different neurons and computes partial output
  - No need to communicate the intermediate activations across GPUs → reduction in activation memory!
- Use <u>all\_reduce</u> to communicate the partial outputs









### Tensor Parallelism in MHA Block

• Parallelize different heads on different GPUs – i.e. along

num\_attention\_heads dimension

• Same as splitting *K*, *Q*, *V* matrices in column-parallel





### Tensor Parallelism in Attention Block

• Parallelize different heads on different GPUs – i.e. along

• Same as splitting *K*, *Q*, *V* matrices in column-parallel









 $f^*$ : all-reduce to synchronize activations

- 1. Synchronization not overlapping with computation
- 2. "Exposed communication" overhead is necessary to combine partial results across tensor-parallel ranks before the final LayerNorm can be applied.





 $f^*$ : all-reduce to synchronize activations

Dropout & LayerNorm: exactly same operations replicated on exactly same data









 $f^*$ : all-reduce to synchronize activations

Dropout & LayerNorm: exactly same operations replicated on exactly same data

 $f^*$ : all-reduce to synchronize activations

Dropout: same operation on same data!



Dropout and LayerNorm – doing same operation on same data on all TP GPUs!

 $f^*$ : all-reduce to synchronize activations

Dropout & LayerNorm: exactly same operations replicated on exactly same data

 $f^*$ : all-reduce to synchronize activations

Dropout: same operation on same data!





#### Can we parallelize dropout and LayerNorm?

 $f^*$ : all-reduce to synchronize activations

Dropout & LayerNorm: exactly same operations replicated on exactly same data

 $f^*$ : all-reduce to synchronize activations

Dropout: same operation on same data!

## Sequence Parallel – parallelizing dropout & LayerNorm

| Total                      | $34 * seq * bs * h + 5 * n_{heads} * seq^2 * bs$ |
|----------------------------|--------------------------------------------------|
| MLP Block                  | 19 * seq * bs * h                                |
| D/o mask                   | 1 * seq * bs * h                                 |
| Linear <i>(4h -&gt; h)</i> | 8 * seq * bs * h                                 |
| GELU                       | 8 * seq * bs * h                                 |
| Linear (h -> 4h)           | 2 * seq * bs * h                                 |
| Layer Norm                 | 2 * seq * bs * h                                 |

| Attention | 11 * seq * bs * h +    |
|-----------|------------------------|
| Block     | $5*n_{heads}*seq^2*bs$ |

| Layer Norm | 2 * seq * bs * h |
|------------|------------------|
|------------|------------------|

- In DP, we parallelize along the "batch dim" (bs)
- In TP, we parallelize along the "hidden dim" (h)
- In SP, we parallelize along the input sequence dimension (seq)







#### **Initial LayerNorm layer (SP region)**

- Input tensors  $X1^*$  and  $X2^*(b, s, /2h)$  enter, already split across the sequence dimension.
- Each GPU computes LayerNorm independently on its sequence chunk, giving *Y1\** and *Y2\**.

#### First transition (SP → TP)

- g operation (all-gather) combines Y1 and Y2 back to full sequence length.
- Restores Y(b, s, h) since column-linear layers need the full hidden dimension h.

#### First linear layer (TP region)

- A1 and A2 are column-linear layers, so they split Y along the hidden dimension.
- GELU is applied independently on each GPU.
- Z1\* and Z2\* are (b, s, h/2).



#### **Initial LayerNorm layer (SP region)**

- Input tensors  $X1^*$  and  $X2^*(b, s, /2h)$  enter, already split across the sequence dimension.
- Each GPU computes LayerNorm independently on its sequence chunk, giving Y1\* and Y2\*.

#### First transition (SP → TP)

- g operation (all-gather) combines Y1 and Y2 back to full sequence length.
- Restores Y(b, s, h) since column-linear layers need the full hidden dimension h.

#### **Second linear layer (TP region)**

- *B1* and *B2* are row-linear layers, so they restore the hidden dimension.
- W1 and W2 are (b, s, h) that need to be summed together.

#### Final transition (TP → SP)

- g\* operation (reduce-scatter) reduces for previous row-linear correctness while scattering along the sequence dimension.
- W1\* and W2\* are (b, s, /2h)

## Computation-communication timeline for MLP Layer



**AllGather Activs** 

ReduceScatter Activ

Forward pass

- 1. Synchronization not overlapping with computation
- 2. "Exposed communication" overhead is necessary to combine partial results across tensor-parallel ranks before the LayerNorm can be applied.



# Tensor+Sequence Parallelism - Tradeoffs

- Intermediate activations sharded across GPUs.
- TP: Reduces activation memory for matrix multiplication
- SP: Reduces activation memory for LayerNorm & dropout
- Need to gather full activations for LayerNorm
- Introduces significant communication overhead
- Introduces "exposed communication"



For Sequence Length 4096





# Tensor+Sequence Parallelism - Tradeoffs

- TP leverages fast NVLink interconnects within a node.
- Slower network connections across nodes results in huge throughput drop.



For Sequence Length 4096



# Tensor+Sequence Parallelism - Tradeoffs





For Sequence Length 4096



 If we scale the sequence length the activation memory will still blow up in the TP region

- In SP, we split one sequence into chunks and process each chunk in parallel.
- Can we do the same for MLP Layer?
- How about Attention Layer?







# Tensor+Sequence Parall

 If we scale the sequence length the activation memory will still blow up in the TP region



#### )ns

- If we scale the sequence length the activation memory will still blow up in the TP region
- 2. If the model is too big to fit with TP=8 we will see a massive slowdown due to the inter-node connectivity.
  - In TP, we split a "Tensor" across GPUs.
  - How about splitting layers across GPUs?

Pipeline Parallelism





 If we scale the sequence length the activation memory will still blow up in the TP region

Context Parallelism

2. If the model is too big to fit with TP=8 we will see a massive slowdown due to the inter-node connectivity.

Pipeline Parallelism



# Context Parallelism – partition along sequence length

- For MLP layers, its exactly same as SP for LayerNorm & Dropout
- In Attention Layer, each token has to attend on every other token.
- But tokens in a different chunk are on a different GPU!
- Full communication b/w GPUs?
- Can we somehow overlap computation with communication?

Ring Attention = online softmax computation + overlapped computation







Ring Attention with Blockwise Transformers for Near-Infinite Context H. Liu, M. Zaharia, P. Abbeel. 2023 [PDF]

















# Computation-communication timeline



#### All-to-all (ring) implementation:

- GPUs exchange K/V pairs in a ring-like pattern, one chunk at a time.
- More memory-efficient, as each GPU only needs to store one additional chunk temporarily.
- Communication is spread out and overlapped with computation, though with some additional base latency overhead from multiple communication steps.





# Comparing with Naïve all-gather Implementation





**GPU Communication:** 

Fetch Ki+1, V\_i+1

Forward pass

1. If we scale the sequence length the activation memory will still blow up in the TP region

- TP: Split a model across one node to tame large models
- CP: Tame the activation explosion with long sequences.
- TP: doesn't scale well across nodes



 If we scale the sequence length the activation memory will still blow up in the TP region



- TP: Split a model across one node to tame large models
- CP: Tame the activation explosion with long sequences.
- TP: doesn't scale well across nodes



1. If we scale the sequence length the activation memory will still blow up in the TP region

- TP: Split a model across one node to tame large models
- CP: Tame the activation explosion with long sequences.
- TP: doesn't scale well across nodes
- How about splitting layers across
   GPUs?



- If we scale the sequence length the activation memory will still blow up in the TP region
- 2. If the model is too big to fit with TP=8 we will see a massive slowdown due to the inter-node connectivity.

Pipeline Parallelism

- TP: Split a model across one node to tame large models
- CP: Tame the activation explosion with long sequences.
- TP: doesn't scale well across nodes
- How about splitting layers across
   GPUs?

