At Goaly, our mission is to make frontier AI affordable and frictionless for every business. Today, only a small number of tech giants can afford to build, adapt, and own the custom models their businesses actually need — especially when those models must stay inside their own environment. We believe that has to change.
The early results below are not the full picture of what we’re building, but they highlight an important prerequisite: making post-training dramatically more efficient without sacrificing model quality or training stability. If teams can iterate on their models faster, get more learning out of the same GPU budget, and better support domain-specific agentic workloads in their own business environments, proprietary model development becomes less constrained by the dollar cost and more driven by the actual product needs.
Motivation
Decoupling the trainer and sampler into separate GPU pools and letting them run asynchronously has become standard practice for large-scale RL post-training[1]1.LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-Scale LLM Training (Wu et al, 2025).[2]2.AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning (Fu et al, 2025).[3]3.Forge: Scalable Agent RL Framework and Algorithm (MiniMax, 2026).. It overlaps work that used to be serialized, and on paper it should make RL much faster. But async execution alone does not make an RL system efficient, stable, or easy to scale. Getting load balancing right is non-trivial, and the off-policy effects introduced by async execution can destabilize training if left uncontrolled. Some open-source frameworks end up capping or defaulting off-policy steps at one to preserve convergence, which leaves significant throughput on the table. At Goaly, we’ve been attacking these problems through a combination of system optimizations and algorithmic controls. The goal is to make RL training faster and cheaper without sacrificing model quality or training stability. By striking this balance, our stack delivers 2.8–4.3× speedups on smaller models (8B) and 1.8–2.5× speedups on 30–32B models compared to mainstream OSS RL solutions — with no regression in model quality.
Beyond efficiency, we’ve invested heavily in supporting the latest agentic RL trends — workloads where training stability matters even more than in classic RLVR. This includes long-horizon tasks where small off-policy drifts compound across many turns, massively concurrent sessions with per-task crash recovery, custom stateful sandboxes for enterprise workflows, and schedulers that absorb heterogeneous environment latency instead of assuming every rollout looks like a text completion.
Headline Results
Step-time speedupWe benchmarked GRPO training across three model scales using Qwen3[4]4.Qwen3 Technical Report (Yang et al, 2025). models, comparing against an OSS baseline. At 8B, step time drops from 77s to 27s — 2.8× faster than the OSS async baseline (4.3× vs OSS sync). At 30B-A3B, we see 2.5× faster steps. At 32B, the speedup is 1.8×. Across all three settings, model quality matches the baseline: reward curves show no regression, and AIME[5]5.AIME 2024 (MAA, 2024). evals show parity.
Cost reductionsAt a fixed GPU count, step-time speedups translate directly into lower GPU time per training step. The 2.8× 8B speedup implies roughly 64% lower per-step compute cost, the 2.5× 30B speedup implies roughly 60% lower cost, and the 1.8× 32B speedup implies roughly 44% lower cost. For the 32B run on 32 H100s, that is over $4,000 saved per 1,000 training steps at typical cloud rates.
RL step-time speedup
RL Cost Reductions
Why async RL alone isn't enough
Async RL improves overlap between rollout generation and policy training, but it introduces new system and algorithmic problems that aren't free to solve.
1. Off-policy drift. The trainer keeps optimizing the policy while the sampler might be generating some long rollouts with old weights. By the time that long rollout finishes, the policy has already moved on several steps ahead. The trajectory used to update the current policy was generated by a policy that's different from the current. Push the gap too far and training becomes unstable or even crashes, which is why multiple RL frameworks cap or default max off-policy step at 1.
2. Training distribution skew. Shorter rollouts finish faster than longer ones, so they're more likely to be ready when the trainer pulls a batch. Left uncorrected, the effective training distribution drifts toward short samples, which hurts model performance on longer sequences. As a result, the model effectively trains on a different data mix than the one you wanted.
3. Load balancing. Async execution doesn't automatically keep both sides busy. Generation latency varies with prompt length, response length, environment latency, verifier latency, and tool-call behavior. Without careful scheduling and tuning, one side of the system still waits on the other.
How we address these
Mitigating off-policy effects
To mitigate off-policy drift, we implemented several techniques so the trainer's and sampler's logits are bit-wise matched: given the same weights, the two systems produce identical logits down to the last bit, removing any numerical drift between the inference and training stacks.
Preserving the training distribution
Since short rollouts finish first, they get over-represented in the trainer's batches, skewing the overall training distribution toward short samples. To avoid this, we apply corrections that preserve the effective distribution of short and long samples, even as the off-policy window grows.
In our experiments, combined with importance sampling, these two changes let us run RL stably at up to 32 steps of off-policy while keeping on-par reward trajectories with synchronous RL. The benchmarks below use 8 steps, already enough to deliver up to 2.8× speedup over OSS async, with headroom to spare. (32 is just the highest we've tested; we haven't pushed to find the upper bound.)
Maximizing throughput
With off-policy effects under control, we're able to push hardware utilization as much as possible. Beyond standard async RL, we fully decouple trainer and sampler so each component can run at maximum throughput. There's no strict synchronization between trainer and sampler; they communicate through a message queue.
- We enable continuous batching on the sampler to keep generation at maximum concurrency all the time.
- We implemented eager weights transfer from trainer to sampler — the trainer pushes updated weights immediately after each optimizer step, and the sampler applies them as soon as they arrive, even mid-sequence, so generation always uses the most recent policy available.
- We achieve a balanced load through two levers. First, continuous batching on the sampler absorbs rollout-length variance automatically, removing the largest source of idle time without any tuning. Second, we profile each side's throughput and set the GPU allocation accordingly — the right split is task-distribution-dependent, but once profiled for a given task family it stays stable across runs.
With all these improvements, our infra achieves consistent step-time speedups across a wide range of models compared to OSS async/sync baselines, without regression in model performance.
Detailed Benchmark Setup and Results
To measure the improvements precisely, we compared Goaly against a SOTA open-source RL framework under tightly matched conditions: same base model, same training data (all jobs are using DAPO math dataset[6]6.DAPO: An Open-Source LLM Reinforcement Learning System at Scale (Yu et al, 2025).), same GRPO recipe, same hyperparameters, same batch size, same seed, and the same trainer/sampler GPU(Nvidia H200) allocation.
Step time reduction
Per-step time traces across scales
Rewards quality
Speed is only meaningful if the model learns the same thing. Across all three RL scales, the reward trajectories closely track the baseline, with the same convergence shape and same final reward band in these matched traces. We are not claiming statistical equivalence from one measured run per configuration; the claim is that these efficiency wins did not produce a visible learning regression.
Reward trajectories across scales
For the 32B run, we also evaluated the trained model on AIME-2024. Pass@1 at iteration 180 is 0.4792 (Goaly) vs. 0.4625 (baseline). We treat this as quality parity rather than an eval lift; the important result is that the efficiency win did not show a quality regression in this run.
32B Qwen3 RL · Eval-AIME @ iter 180
From reasoning RL to agentic environments
Reasoning RL (math, STEM, and other RLVR tasks) is the cleanest domain in which to measure trainer-sampler efficiency, and the system designs for it are starting to converge. Agentic RL is the opposite. It's still a fast-evolving domain, and it demands efficient, reliable and scalable infrastructure, one that orchestrates far more moving parts than non-agentic RL: live environments, sandboxes, verifiers, and long-running agent sessions.
Concretely, this means handling:
- Long-horizon tasks for terminal and coding agents
- Massively concurrent sessions with task-crash recovery
- Custom environments and sandboxes for enterprise workflows, with stateful execution
- Training that interacts with live environments, not just static prompts and responses
- Schedulers that absorb heterogeneous environment latency instead of assuming every rollout looks like a text completion
This is where Goaly is focused: building the next layer of agentic RL infrastructure on top of the optimized RL core. The agent system is designed to plug seamlessly into the RL loop or into eval pipelines through a simple API, and to support customized sandboxes for different environment types and workflows.
Initial coding-agent validation
As an initial validation, we ran a small-scale coding-agent RL experiment on 8 GPUs using SWE-Gym[7]7.SWE-Gym: An Open Environment for Training Software Engineering Agents and Verifiers (Pan et al, 2024). data and a Qwen3.5-9B[8]8.Qwen3.5: Accelerating Productivity with Native Multimodal Agents (Qwen Team, 2026). base model. Each rollout is a live coding session inside a Docker sandbox. The agent can generate bash commands, inspect the repository, edit files, run tests, and submit a final patch. The environment returns verifier-based rewards (pass = 1.0, fail = 0) while Goaly records the full trajectory across model outputs, tool actions, test logs, rewards, artifacts, and policy versions. The final verifier reward is a binary result from the verification testcases/scripts. The trace below shows one real session from that run.
Agentic RL · coding-agent session
pydantic#9066 · IPv4Address not parsed
The environment server is designed to be generic to support more than just swe agent. Any team can bring the environment, tools, and reward logic that matter to them. We're currently supporting
- Coding/terminal agents: repo checkout + terminal + test harness. Reward from hidden test suites. Initial validation on SWE-Gym tasks using live Docker sandboxes and real open-source repositories.
- Tool-use agents: API endpoints, simulators, MCP tool servers. Reward from task-completion verifiers. Multi-step function-calling workflows.
- Custom workflows: private tools, business rules, customer-defined sandboxes. Reward from outcome metrics. Custom environments can run inside the customer's own infrastructure boundary.
What's next
These experiments are just a preview of what we're building. At Goaly, our goal is to make frontier post-training — fast, stable, and agent-ready — accessible to every team serious about building AI, not just those who can afford to build the infrastructure from scratch.
We're still early. Agentic RL at scale is a challenging problem across the industry, and our results here are a first step. The work ahead is building infrastructure that handles the full complexity of real agent environments reliably and at scale.
We'll keep sharing high-level system results and lessons as the platform matures.
- If you want to build with us, we're hiring engineers & AI researchers.
- If you're interested in trying our solution or want to prototype with us, join the waitlist.