Is Anybody Doing Async RL Correctly?
Async RL can unlock much higher training throughput, but stale trajectories turn the update into a fragile off-policy estimator. This post explains why clipping and masking only partially help, why effective sample size is an early collapse signal, and how variance-controlled updates stabilize high-lag training.