The Implicit Bias of Gradient Accumulation

by Jiatong Yu, April 2025


Motivation

I was experimenting with disassembling the GRPO objective into a weighted sum of SFT losses while removing the KL regularization term:

\[ J_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim P_{\mathrm{sft}}(Q),\, \{o_i\}_{i=1}^G \sim \pi_{\theta_{\mathrm{old}}}(O \mid q)} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=0}^{|o_i|-1} \left( \frac{\pi_\theta(o_{i, t+1}\mid q, o_{i,\le t})}{\pi_{\theta_{\mathrm{old}}}(o_{i, t+1}\mid q, o_{i,\le t})} \,\hat A_{i, \le t} \right) \right]. \] You can see that the GRPO objective, without KL regularization, can be decomposed into gradient descent on trajectories of better-than-average rewards and gradient ascent on trajectories of worse-than-average rewards. I was exploring whether directly simplifying the GRPO objective this way can achieve good performance.

As how things always goes, I ran into another question: would it be better to (1) do gradient accumulation after we gather gradients for both GA losses and GD losses, or (2) update parameters on GD gradients and GA gradients separately? While in RL-inspired algorithms it's always implemented as the former case, it was not intuitive to me why one choice is better than the other, except for larger batch size.


Sequential Updates between Loss Functions

Write the full-batch update as \[ \theta' \,=\, \theta - \eta \nabla L(\theta),\qquad \text{with } L(\theta)=\frac{1}{B}\sum_{b=1}^{B} L_b(\theta). \] With mini-batch training that performs sequential updates \( \theta_b = \theta_{b-1} - \tfrac{\eta}{B}\,\nabla L_b(\theta_{b-1}) \), a second-order expansion gives

\[ \theta_m = \theta - \frac{\eta}{B}\sum_{b=1}^{m}\nabla L_b(\theta) \;+\; \frac{\eta^2}{B^2}\sum_{b_1=1}^{m}\sum_{b_2=1}^{b_1-1}\nabla^2 L_{b_1}(\theta)\,\nabla L_{b_2}(\theta) \;+\; O\!\big(m^3(\eta/B)^3\big). \]

Warm-up: Alternating two losses \(L_-\) and \(L_+\)

Suppose updates alternate \(L_-, L_+, L_-, L_+, \dots\) and take \(m=B\). Then

\[ \theta_B = \theta \;-\; \frac{\eta}{2}\big(\nabla L_{+}(\theta)+\nabla L_{-}(\theta)\big) \;+\; \eta^2\Bigg[ \frac{B-2}{4B}\,\nabla\|\nabla L(\theta)\|^2 \;+\; \frac{1}{2B}\,\nabla^2 L_{+}(\theta)\,\nabla L_{-}(\theta) \Bigg] \;+\; O(\eta^3). \]

For large batches the correction term approaches the usual \(\tfrac{\eta^2}{2}\,\nabla^2 L(\theta)\nabla L(\theta)\) (multiple small steps on the same loss). But with small batches the extra cross-term \(\nabla^2 L_{+}\,\nabla L_{-}\) matters.

Sampling two sets of losses

Now imagine two families of losses \(\{L_k^{-}\}\) and \(\{L_k^{+}\}\). At odd steps update on \(L_k^{-}\), at even steps on \(L_k^{+}\), and randomly reshuffle the pairs. Let \(E_-=\mathbb{E}[L_1^{-}],\; E_+=\mathbb{E}[L_1^{+}],\; E=\tfrac12(E_-+E_+)\). Taking the expectation of the second-order term over the reshuffling yields

\[ \mathbb{E}\!\left[\frac{1}{B^2} \sum_{b_1=1}^{B}\sum_{b_2=1}^{b_1-1} \nabla^2 L_{b_1}\,\nabla L_{b_2}\right] \;=\; \frac{B-2}{4B}\,\nabla\|\nabla E\|^2 \;-\; \frac{1}{2B}\,\nabla_E\big\|\nabla L_1-\nabla E\big\|^2 \;+\; \frac{1}{2B}\,\mathbb{E}\big[\nabla^2 L_{1}^{+}\,\nabla L_{1}^{-}\big]. \]

If we also randomize which loss starts first, the asymmetric final term symmetrizes to \(\tfrac{1}{4B}\,\nabla \mathbb{E}\big[\nabla L_1^{-}\!\cdot\!\nabla L_1^{+}\big]\).

Interpretation
Performing finer-grained sequential updates introduces an implicit bias that encourages alignment between the gradients of the two losses—effectively anti-penalizing their inner product.
Experiments on Adam Optimizer

Our calculations above only works for SGD. For Adam the analysis becomes monsterous so we resort experiments. We finetune Llama3-8B-Instruct model on (1) GD on correct GSM8K answers, and (2) GA on incorrect GSM8K answers generated synthetically with batch size 8. We track the scalar product between gradeints from GA loss on incorrect answers vs. GD loss on correct answers during training. The red line represents iterative gradient updates, and the green line represents accumulated update that gathers GD and GA gradients together.

Adam Accumulation

Implications

  • Aggregated accumulated step: behaves like a full-batch step on the average loss, missing the extra second-order term that promotes gradient alignment across components.
  • Separated gradient updates: add a curvature-weighted cross term that nudges gradients of different pieces (e.g., “reinforce good” vs “suppress bad”) to align. This can stabilize training when those pieces point in similar directions, and cause interference when they conflict.
  • Batch size matters: the alignment effect scales like \(1/B\). It is negligible for large \(B\), and most pronounced for tiny effective batches or long accumulation windows.

The extra term is a cousin of the well-known second-order correction that appears when expanding SGD as a stochastic differential equation; here, because updates alternate across different loss components, the correction contains cross-Hessian × gradient interactions that selectively favor gradient alignment (compare with analyses of SGD’s implicit regularization in over-parameterized models).


References

Smith, Samuel L., et al. (2021). “On the Origin of Implicit Regularization in Stochastic Gradient Descent.”


If you came across this message, thanks for reading! I'm not finished with writing this blog yet. I'll add more experiments on impacts to RL-styled algorithms :)