Implicit Bias of Gradient Accumulation in RLHF
In RL-inspired algorithms such as GRPO, we are practically switching between a descent loss (when reward is positive) and an ascent loss (when reward is negative). Is there implicit bias injected by combining different loss functions in the same gradient update?