Beyond Async RL: Faster Post-Training for Reasoning Models and Agents

About Goaly

At Goaly, our mission is to make frontier AI affordable and frictionless for every business. Today, only a small number of tech giants can afford to build, adapt, and own the custom models their businesses actually need — especially when those models must stay inside their own environment. We believe that has to change.

The early results below are not the full picture of what we’re building, but they highlight an important prerequisite: making post-training dramatically more efficient without sacrificing model quality or training stability. If teams can iterate on their models faster, get more learning out of the same GPU budget, and better support domain-specific agentic workloads in their own business environments, proprietary model development becomes less constrained by the dollar cost and more driven by the actual product needs.

Motivation

Decoupling the trainer and sampler into separate GPU pools and letting them run asynchronously has become standard practice for large-scale RL post-training^[1]1.LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-Scale LLM Training (Wu et al, 2025).^[2]2.AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning (Fu et al, 2025).^[3]3.Forge: Scalable Agent RL Framework and Algorithm (MiniMax, 2026).. It overlaps work that used to be serialized, and on paper it should make RL much faster. But async execution alone does not make an RL system efficient, stable, or easy to scale. Getting load balancing right is non-trivial, and the off-policy effects introduced by async execution can destabilize training if left uncontrolled. Some open-source frameworks end up capping or defaulting off-policy steps at one to preserve convergence, which leaves significant throughput on the table. At Goaly, we’ve been attacking these problems through a combination of system optimizations and algorithmic controls. The goal is to make RL training faster and cheaper without sacrificing model quality or training stability. By striking this balance, our stack delivers 2.8–4.3× speedups on smaller models (8B) and 1.8–2.5× speedups on 30–32B models compared to mainstream OSS RL solutions — with no regression in model quality.

Beyond efficiency, we’ve invested heavily in supporting the latest agentic RL trends — workloads where training stability matters even more than in classic RLVR. This includes long-horizon tasks where small off-policy drifts compound across many turns, massively concurrent sessions with per-task crash recovery, custom stateful sandboxes for enterprise workflows, and schedulers that absorb heterogeneous environment latency instead of assuming every rollout looks like a text completion.

Headline Results

Step-time speedupWe benchmarked GRPO training across three model scales using Qwen3^[4]4.Qwen3 Technical Report (Yang et al, 2025). models, comparing against an OSS baseline. At 8B, step time drops from 77s to 27s — 2.8× faster than the OSS async baseline (4.3× vs OSS sync). At 30B-A3B, we see 2.5× faster steps. At 32B, the speedup is 1.8×. Across all three settings, model quality matches the baseline: reward curves show no regression, and AIME^[5]5.AIME 2024 (MAA, 2024). evals show parity.

Cost reductionsAt a fixed GPU count, step-time speedups translate directly into lower GPU time per training step. The 2.8× 8B speedup implies roughly 64% lower per-step compute cost, the 2.5× 30B speedup implies roughly 60% lower cost, and the 1.8× 32B speedup implies roughly 44% lower cost. For the 32B run on 32 H100s, that is over $4,000 saved per 1,000 training steps at typical cloud rates.

RL step-time speedup

Warm-state mean step time · matched data and hardware

Figure 1. Step-time speedup across large-scale RL workloads. All rows use matched hardware and the same recipe, data, and model weights. Dashed line marks baseline parity (1.0×).

RL Cost Reductions

Per-step GPU-time reduction · assumes GPU count stays the same

Figure 2. Per-step compute cost reductions implied by the measured step-time speedups. This assumes the number of GPUs stays the same, so GPU time scales with wall-clock step time.

Why async RL alone isn't enough

Async RL improves overlap between rollout generation and policy training, but it introduces new system and algorithmic problems that aren't free to solve.

1. Off-policy drift. The trainer keeps optimizing the policy while the sampler might be generating some long rollouts with old weights. By the time that long rollout finishes, the policy has already moved on several steps ahead. The trajectory used to update the current policy was generated by a policy that's different from the current. Push the gap too far and training becomes unstable or even crashes, which is why multiple RL frameworks cap or default max off-policy step at 1.

2. Training distribution skew. Shorter rollouts finish faster than longer ones, so they're more likely to be ready when the trainer pulls a batch. Left uncorrected, the effective training distribution drifts toward short samples, which hurts model performance on longer sequences. As a result, the model effectively trains on a different data mix than the one you wanted.

3. Load balancing. Async execution doesn't automatically keep both sides busy. Generation latency varies with prompt length, response length, environment latency, verifier latency, and tool-call behavior. Without careful scheduling and tuning, one side of the system still waits on the other.

How we address these

Mitigating off-policy effects

To mitigate off-policy drift, we implemented several techniques so the trainer's and sampler's logits are bit-wise matched: given the same weights, the two systems produce identical logits down to the last bit, removing any numerical drift between the inference and training stacks.

Preserving the training distribution

Since short rollouts finish first, they get over-represented in the trainer's batches, skewing the overall training distribution toward short samples. To avoid this, we apply corrections that preserve the effective distribution of short and long samples, even as the off-policy window grows.

In our experiments, combined with importance sampling, these two changes let us run RL stably at up to 32 steps of off-policy while keeping on-par reward trajectories with synchronous RL. The benchmarks below use 8 steps, already enough to deliver up to 2.8× speedup over OSS async, with headroom to spare. (32 is just the highest we've tested; we haven't pushed to find the upper bound.)

Maximizing throughput

With off-policy effects under control, we're able to push hardware utilization as much as possible. Beyond standard async RL, we fully decouple trainer and sampler so each component can run at maximum throughput. There's no strict synchronization between trainer and sampler; they communicate through a message queue.

We enable continuous batching on the sampler to keep generation at maximum concurrency all the time.
We implemented eager weights transfer from trainer to sampler — the trainer pushes updated weights immediately after each optimizer step, and the sampler applies them as soon as they arrive, even mid-sequence, so generation always uses the most recent policy available.

We achieve a balanced load through two levers. First, continuous batching on the sampler absorbs rollout-length variance automatically, removing the largest source of idle time without any tuning. Second, we profile each side's throughput and set the GPU allocation accordingly — the right split is task-distribution-dependent, but once profiled for a given task family it stays stable across runs.

With all these improvements, our infra achieves consistent step-time speedups across a wide range of models compared to OSS async/sync baselines, without regression in model performance.

Detailed Benchmark Setup and Results

To measure the improvements precisely, we compared Goaly against a SOTA open-source RL framework under tightly matched conditions: same base model, same training data (all jobs are using DAPO math dataset^[6]6.DAPO: An Open-Source LLM Reinforcement Learning System at Scale (Yu et al, 2025).), same GRPO recipe, same hyperparameters, same batch size, same seed, and the same trainer/sampler GPU(Nvidia H200) allocation.

Step time reduction

Per-step time traces across scales

perf/step_time · raw trace plus smoothed trend

8B · 8 GPU

bsz 128 · 32k context

30B-A3B · 32 GPU

bsz 256 · 32k context

32B · 32 GPU

bsz 512 · 32k context

OSS (sync)OSS (async)Goaly

Figure 3. Per-step time traces at three scales. Faint lines are raw TensorBoard values; bold lines are smoothed trends.

Rewards quality

Speed is only meaningful if the model learns the same thing. Across all three RL scales, the reward trajectories closely track the baseline, with the same convergence shape and same final reward band in these matched traces. We are not claiming statistical equivalence from one measured run per configuration; the claim is that these efficiency wins did not produce a visible learning regression.

Reward trajectories across scales

rollout/math_reward/mean · same GRPO recipe

8B · 8 GPU

30B-A3B · 32 GPU

32B · 32 GPU

OSS (sync)OSS (async)Goaly

Figure 4. Reward trajectories at three scales on math reasoning tasks. Same GRPO recipe, same data, same hardware budget. OSS sync curves are absent: those runs were terminated early due to prohibitively slow step times.

For the 32B run, we also evaluated the trained model on AIME-2024. Pass@1 at iteration 180 is 0.4792 (Goaly) vs. 0.4625 (baseline). We treat this as quality parity rather than an eval lift; the important result is that the efficiency win did not show a quality regression in this run.

32B Qwen3 RL · Eval-AIME @ iter 180

AIME-2024 pass-rate · same iteration and eval harness

OSS (async)Goaly

Figure 5. AIME-2024 pass-rate at iteration 180 for the 32B run. Same iteration and eval harness.

From reasoning RL to agentic environments

Reasoning RL (math, STEM, and other RLVR tasks) is the cleanest domain in which to measure trainer-sampler efficiency, and the system designs for it are starting to converge. Agentic RL is the opposite. It's still a fast-evolving domain, and it demands efficient, reliable and scalable infrastructure, one that orchestrates far more moving parts than non-agentic RL: live environments, sandboxes, verifiers, and long-running agent sessions.

Concretely, this means handling:

Long-horizon tasks for terminal and coding agents
Massively concurrent sessions with task-crash recovery
Custom environments and sandboxes for enterprise workflows, with stateful execution
Training that interacts with live environments, not just static prompts and responses
Schedulers that absorb heterogeneous environment latency instead of assuming every rollout looks like a text completion

This is where Goaly is focused: building the next layer of agentic RL infrastructure on top of the optimized RL core. The agent system is designed to plug seamlessly into the RL loop or into eval pipelines through a simple API, and to support customized sandboxes for different environment types and workflows.

Initial coding-agent validation

As an initial validation, we ran a small-scale coding-agent RL experiment on 8 GPUs using SWE-Gym^[7]7.SWE-Gym: An Open Environment for Training Software Engineering Agents and Verifiers (Pan et al, 2024). data and a Qwen3.5-9B^[8]8.Qwen3.5: Accelerating Productivity with Native Multimodal Agents (Qwen Team, 2026). base model. Each rollout is a live coding session inside a Docker sandbox. The agent can generate bash commands, inspect the repository, edit files, run tests, and submit a final patch. The environment returns verifier-based rewards (pass = 1.0, fail = 0) while Goaly records the full trajectory across model outputs, tool actions, test logs, rewards, artifacts, and policy versions. The final verifier reward is a binary result from the verification testcases/scripts. The trace below shows one real session from that run.

Agentic RL · coding-agent session

pydantic#9066 · IPv4Address not parsed

session 7/32|policy v42

explorereproducediagnosefixverifysubmit

Figure 6. Real session from agentic RL training: a Qwen3.5-9B policy model solves pydantic#9066 by diagnosing an IPv4Address serialization bug and patching encode_default. The terminal event records final verifier reward=1.0 after 847 tests pass.

The environment server is designed to be generic to support more than just swe agent. Any team can bring the environment, tools, and reward logic that matter to them. We're currently supporting

Coding/terminal agents: repo checkout + terminal + test harness. Reward from hidden test suites. Initial validation on SWE-Gym tasks using live Docker sandboxes and real open-source repositories.
Tool-use agents: API endpoints, simulators, MCP tool servers. Reward from task-completion verifiers. Multi-step function-calling workflows.
Custom workflows: private tools, business rules, customer-defined sandboxes. Reward from outcome metrics. Custom environments can run inside the customer's own infrastructure boundary.

What's next

These experiments are just a preview of what we're building. At Goaly, our goal is to make frontier post-training — fast, stable, and agent-ready — accessible to every team serious about building AI, not just those who can afford to build the infrastructure from scratch.

We're still early. Agentic RL at scale is a challenging problem across the industry, and our results here are a first step. The work ahead is building infrastructure that handles the full complexity of real agent environments reliably and at scale.

We'll keep sharing high-level system results and lessons as the platform matures.

If you want to build with us, we're hiring engineers & AI researchers.
If you're interested in trying our solution or want to prototype with us, join the waitlist.