Beyond Async RL: Efficient and Stable Post-Training for Reasoning Models and Agents

Goaly Team
May 28, 2026 · 10 min read
About Goaly

At Goaly, our mission is to make frontier AI affordable and frictionless for every business. Today, only a small number of tech giants can afford to build, adapt, and own the custom models their businesses actually need — especially when those models must stay inside their own environment. We believe that has to change.

The early results below are not the full picture of what we’re building, but they highlight an important prerequisite: making post-training dramatically more efficient without sacrificing model quality or training stability. If teams can iterate on their models faster, get more learning out of the same GPU budget, and better support domain-specific agentic workloads in their own business environments, proprietary model development becomes less constrained by the dollar cost and more driven by the actual product needs.

Motivation

Decoupling the trainer and sampler into separate GPU pools and letting them run asynchronously has become standard practice for large-scale RL post-training[1][2][3]. It overlaps work that used to be serialized, and on paper it should make RL much faster. But async execution alone does not make an RL system efficient, stable, or easy to scale. Getting load balancing right is non-trivial, and the off-policy effects introduced by async execution can destabilize training if left uncontrolled. Some open-source frameworks end up capping or defaulting off-policy steps at one to preserve convergence, which leaves significant throughput on the table. At Goaly, we’ve been attacking these problems through a combination of system optimizations and algorithmic controls. The goal is to make RL training faster and cheaper without sacrificing model quality or training stability. By striking this balance, our stack delivers 2.8–4.3× speedups on smaller models (8B) and 1.8–2.5× speedups on 30–32B models compared to mainstream OSS RL solutions — with no regression in model quality.

Beyond efficiency, we’ve invested heavily in supporting the latest agentic RL trends — workloads where training stability matters even more than in classic RLVR. This includes long-horizon tasks where small off-policy drifts compound across many turns, massively concurrent sessions with per-task crash recovery, custom stateful sandboxes for enterprise workflows, and schedulers that absorb heterogeneous environment latency instead of assuming every rollout looks like a text completion.

Headline Results

Step-time speedupWe benchmarked GRPO training across three model scales using Qwen3[4] models, comparing against an OSS baseline. At 8B, step time drops from 77s to 27s — 2.8× faster than the OSS async baseline (4.3× vs OSS sync). At 30B-A3B, we see 2.5× faster steps. At 32B, the speedup is 1.8×. Across all three settings, model quality matches the baseline: reward curves show no regression, and AIME[5] evals show parity.

Cost reductionsAt a fixed GPU count, step-time speedups translate directly into lower GPU time per training step. The 2.8× 8B speedup implies roughly 64% lower per-step compute cost, the 2.5× 30B speedup implies roughly 60% lower cost, and the 1.8× 32B speedup implies roughly 44% lower cost. For the 32B run on 32 H100s, that is over $4,000 saved per 1,000 training steps at typical cloud rates.

RL step-time speedup

Warm-state mean step time · matched data and hardware
Figure 1. Step-time speedup across large-scale RL workloads. All rows use matched hardware and the same recipe, data, and model weights. Dashed line marks baseline parity (1.0×).

RL Cost Reductions

Per-step GPU-time reduction · assumes GPU count stays the same
Figure 2. Per-step compute cost reductions implied by the measured step-time speedups. This assumes the number of GPUs stays the same, so GPU time scales with wall-clock step time.

Why async RL alone isn't enough

Async RL improves overlap between rollout generation and policy training, but it introduces new system and algorithmic problems that aren't free to solve.

1. Off-policy drift. The trainer keeps optimizing the policy while the sampler might be generating some long rollouts with old weights. By the time that long rollout finishes, the policy has already moved on several steps ahead. The trajectory used to update the current policy was generated by a policy that's different from the current. Push the gap too far and training becomes unstable or even crashes, which is why multiple RL frameworks cap or default max off-policy step at 1.

Off-policy drift in async RLThe trainer completes four optimizer steps while the sampler is still generating a single long rollout under the original policy. By the time the rollout finishes, the trainer is four policy versions ahead.TrainerSamplerStep 1v0 → v1Step 2v1 → v2Step 3v2 → v3Step 4v3 → v4One long rolloutGenerated with v0Trainer is on v4 when the v0 rollout arrives — already 4 steps stale.

2. Training distribution skew. Shorter rollouts finish faster than longer ones, so they're more likely to be ready when the trainer pulls a batch. Left uncorrected, the effective training distribution drifts toward short samples, which hurts model performance on longer sequences. As a result, the model effectively trains on a different data mix than the one you wanted.

Distribution skew toward short rolloutsFive rollouts of varying lengths start at the same time. The trainer pulls a batch before the longer rollouts finish, so the batch over-represents short samples.shortshortshortmediumlongbatch pulled hereShort rollouts finish first — the batch ends up mostly short samples.

3. Load balancing. Async execution doesn't automatically keep both sides busy. Generation latency varies with prompt length, response length, environment latency, verifier latency, and tool-call behavior. Without careful scheduling and tuning, one side of the system still waits on the other.

Load imbalance between trainer and samplerThe trainer alternates between optimizer steps and idle periods because sampler rollouts take varying amounts of time. When a rollout runs long, the trainer has nothing to optimize on and sits idle.TrainerSamplerstepidlestepidlesteprolloutlong rolloutrolloutrolloutWhen a rollout runs long, the trainer sits idle waiting for the next batch.

How we address these

Mitigating off-policy effects

To mitigate off-policy drift, we implemented several techniques so the trainer's and sampler's logits are bit-wise matched: given the same weights, the two systems produce identical logits down to the last bit, removing any numerical drift between the inference and training stacks.

Bit-wise logit matching between sampler and trainerAlthough the sampler uses inference-optimized kernels and the trainer uses training-optimized kernels, both produce bit-wise identical logits given the same weights. This removes numerical drift between the two stacks as a source of inconsistency.Same weightspolicy version v_tSampler kernelinference-optimizedTrainer kerneltraining-optimizedIdentical logitsbit-wise matched

Preserving the training distribution

Since short rollouts finish first, they get over-represented in the trainer's batches, skewing the overall training distribution toward short samples. To avoid this, we apply corrections that preserve the effective distribution of short and long samples, even as the off-policy window grows.

Effective batch composition with and without rollout accountingWithout rollout accounting, the effective batch is dominated by short samples. With rollout accounting, the batch composition matches the intended source distribution.Without correctionWith correctionshortmediumlongshortmediumlongRollout accounting keeps the batch distribution intact, even at larger off-policy windows.

In our experiments, combined with importance sampling, these two changes let us run RL stably at up to 32 steps of off-policy while keeping on-par reward trajectories with synchronous RL. The benchmarks below use 8 steps, already enough to deliver up to 2.8× speedup over OSS async, with headroom to spare. (32 is just the highest we've tested; we haven't pushed to find the upper bound.)

Maximizing throughput

With off-policy effects under control, we're able to push hardware utilization as much as possible. Beyond standard async RL, we fully decouple trainer and sampler so each component can run at maximum throughput. There's no strict synchronization between trainer and sampler; they communicate through a message queue.

  • We enable continuous batching on the sampler to keep generation at maximum concurrency all the time.
  • We implemented eager weights transfer from trainer to sampler — the trainer pushes updated weights immediately after each optimizer step, and the sampler applies them as soon as they arrive, even mid-sequence, so generation always uses the most recent policy available.
Eager weights transfer in async RLThe trainer pushes updated weights immediately after each optimizer step. The sampler picks them up mid-rollout, so a single trajectory uses tokens from multiple policy versions instead of getting stuck at the version that started the rollout.TrainerSamplerStep 1v0 → v1Step 2v1 → v2Step 3v2 → v3Step 4v3 → v4v0 tokensv1 tokensv2 tokensv3 tokensEach trainer step pushes new weights — the sampler picks them up mid-rollout.
  • We achieve a balanced load through two levers. First, continuous batching on the sampler absorbs rollout-length variance automatically, removing the largest source of idle time without any tuning. Second, we profile each side's throughput and set the GPU allocation accordingly — the right split is task-distribution-dependent, but once profiled for a given task family it stays stable across runs.
Balanced load between trainer and samplerWith a decoupled architecture and an off-policy budget acting as a buffer, both trainer and sampler stay continuously busy. Variable rollout lengths don't cause idle gaps on the trainer side.TrainerSamplerstepstepstepstepsteprolloutlong rolloutrolloutrolloutrolloutThe off-policy budget buffers trajectories — both sides stay continuously busy.

With all these improvements, our infra achieves consistent step-time speedups across a wide range of models compared to OSS async/sync baselines, without regression in model performance.

Detailed Benchmark Setup and Results

To measure the improvements precisely, we compared Goaly against a SOTA open-source RL framework under tightly matched conditions: same base model, same training data (all jobs are using DAPO math dataset[6]), same GRPO recipe, same hyperparameters, same batch size, same seed, and the same trainer/sampler GPU(Nvidia H200) allocation.

Step time reduction

Per-step time traces across scales

perf/step_time · raw trace plus smoothed trend
8B · 8 GPU
bsz 128 · 32k context
30B-A3B · 32 GPU
bsz 256 · 32k context
32B · 32 GPU
bsz 512 · 32k context
OSS (sync)OSS (async)Goaly
Figure 3. Per-step time traces at three scales. Faint lines are raw TensorBoard values; bold lines are smoothed trends.

Rewards quality

Speed is only meaningful if the model learns the same thing. Across all three RL scales, the reward trajectories closely track the baseline, with the same convergence shape and same final reward band in these matched traces. We are not claiming statistical equivalence from one measured run per configuration; the claim is that these efficiency wins did not produce a visible learning regression.

Reward trajectories across scales

rollout/math_reward/mean · same GRPO recipe
8B · 8 GPU
30B-A3B · 32 GPU
32B · 32 GPU
OSS (sync)OSS (async)Goaly
Figure 4. Reward trajectories at three scales on math reasoning tasks. Same GRPO recipe, same data, same hardware budget. OSS sync curves are absent: those runs were terminated early due to prohibitively slow step times.

For the 32B run, we also evaluated the trained model on AIME-2024. Pass@1 at iteration 180 is 0.4792 (Goaly) vs. 0.4625 (baseline). We treat this as quality parity rather than an eval lift; the important result is that the efficiency win did not show a quality regression in this run.

32B Qwen3 RL · Eval-AIME @ iter 180

AIME-2024 pass-rate · same iteration and eval harness
OSS (async)Goaly
Figure 5. AIME-2024 pass-rate at iteration 180 for the 32B run. Same iteration and eval harness.

From reasoning RL to agentic environments

Reasoning RL (math, STEM, and other RLVR tasks) is the cleanest domain in which to measure trainer-sampler efficiency, and the system designs for it are starting to converge. Agentic RL is the opposite. It's still a fast-evolving domain, and it demands efficient, reliable and scalable infrastructure, one that orchestrates far more moving parts than non-agentic RL: live environments, sandboxes, verifiers, and long-running agent sessions.

Concretely, this means handling:

  • Long-horizon tasks for terminal and coding agents
  • Massively concurrent sessions with task-crash recovery
  • Custom environments and sandboxes for enterprise workflows, with stateful execution
  • Training that interacts with live environments, not just static prompts and responses
  • Schedulers that absorb heterogeneous environment latency instead of assuming every rollout looks like a text completion

This is where Goaly is focused: building the next layer of agentic RL infrastructure on top of the optimized RL core. The agent system is designed to plug seamlessly into the RL loop or into eval pipelines through a simple API, and to support customized sandboxes for different environment types and workflows.

Initial coding-agent validation

As an initial validation, we ran a small-scale coding-agent RL experiment on 8 GPUs using SWE-Gym[7] data and a Qwen3.5-9B[8] base model. Each rollout is a live coding session inside a Docker sandbox. The agent can generate bash commands, inspect the repository, edit files, run tests, and submit a final patch. The environment returns verifier-based rewards (pass = 1.0, fail = 0) while Goaly records the full trajectory across model outputs, tool actions, test logs, rewards, artifacts, and policy versions. The final verifier reward is a binary result from the verification testcases/scripts. The trace below shows one real session from that run.

Agentic RL · coding-agent session

pydantic#9066 · IPv4Address not parsed

session 7/32|policy v42
explorereproducediagnosefixverifysubmit
Figure 6. Real session from agentic RL training: a Qwen3.5-9B policy model solves pydantic#9066 by diagnosing an IPv4Address serialization bug and patching encode_default. The terminal event records final verifier reward=1.0 after 847 tests pass.

The environment server is designed to be generic to support more than just swe agent. Any team can bring the environment, tools, and reward logic that matter to them. We're currently supporting

  • Coding/terminal agents: repo checkout + terminal + test harness. Reward from hidden test suites. Initial validation on SWE-Gym tasks using live Docker sandboxes and real open-source repositories.
  • Tool-use agents: API endpoints, simulators, MCP tool servers. Reward from task-completion verifiers. Multi-step function-calling workflows.
  • Custom workflows: private tools, business rules, customer-defined sandboxes. Reward from outcome metrics. Custom environments can run inside the customer's own infrastructure boundary.

What's next

These experiments are just a preview of what we're building. At Goaly, our goal is to make frontier post-training — fast, stable, and agent-ready — accessible to every team serious about building AI, not just those who can afford to build the infrastructure from scratch.

We're still early. Agentic RL at scale is a challenging problem across the industry, and our results here are a first step. The work ahead is building infrastructure that handles the full complexity of real agent environments reliably and at scale.

We'll keep sharing high-level system results and lessons as the platform matures.