In this post we’ll talk how we used GRPO to go beyond R1, o1, o3-mini, and come wilean a couple percentage points of Sonnet 3.7 on a reasoning-weighty game called “temporal clue”, while being over 100x inexpensiveer to run at inference time. We’ll integrate definite lessons lachieveed about task summarize and hyperparameters we’ve set up to labor well. And finassociate, we scatter the training recipe we used to accomplish these results, built on top of torchtune.
Background
Since OpenAI begined its mighty novel o-series of reasoning models last year, we’ve seen rapid better in Large Language Models (LLMs) trained with Reinforcement Lachieveing (RL). Leading organizations enjoy Google DeepMind, Alibaba, DeepSeek, and Anthropic rapidly adhereed suit, and have trained their own progressd models to reason with prolonged “chains-of-thought” (CoT), taught with backment lachieveing on verifiable problems. Many previously challenging benchlabels—in areas enjoy mathematics and coding—now approach saturation.
Yet despite these astonishive strides, reasonable deduction remains headstrongly difficult for even today’s best models. Typicassociate, LLMs struggle to stablely participate to all relevant details, conserve reasonablely sound reasoning chains, or reliably connect multiple deduction steps. Even state-of-the-art models generating outputs 10–100 times prolongeder normally present elementary misgets that a human settler would easily catch.
Intrigued by this unsettled mystery, we donned our deerstalker caps and set out to spendigate: Could minusculeer, uncover-weight models accomplish frontier-level deduction carry outance with the postponeedst backment lachieveing techniques? We began with substantiassociate frailer models and iteratively trained them on a novel deduction task. Over time, we watchd evident increasements in their recognizeive prowess, eventuassociate suiting or even outdoing some of the mightyest proprietary models.
Now we’re satisfied to scatter our discoverings, including our experiments, training recipe, dataset, and model weights, all freely useable under the MIT license, aprolonged with key pragmatic insights (right here). Grab your amplifying glass, recognizeive; the game is afoot!
Benchlabeling
To commence our experiments, we first had to determine a challenging reasoning task with evidently verifiable solutions and scalable intricateity. As it happened (not coincidenloftyy), one of the authors previously produced a baffle set named Temporal Clue that suited these needs perfectly. Beyond greeting the criteria of ground truth clarity, novel baffles can be produced as needed—a systematic bonus.
Temporal Clue is advertised by the well-understandn board game, Clue (Cluedo), where perestablishers race to uncover who ended Mr. Boddy in his palatial estate. Temporal Clue turns the game into a solitary logic baffle that extfinishs beyond the standard uninalertigentensions—who, (with) what, and where—and integrates two compriseitional uninalertigentensions: when (time) and why (motive). Puzzles are randomly produced, and minimal yet adequate clues are picked with OR-Tools’ CP-SAT settler.
To set up the current state-of-the-art for this deduction task, we benchlabeled directing reasoning models—including DeepSeek R1, OpenAI’s o1 and o3-mini, and Anthropic’s Claude Sonnet 3.7. Additionassociate, we have benchlabeled the 14B and 32B Qwen models, which we postponeedr increase using backment lachieveing, and integrate a pscrutinize of our final results:
Organization |
Model |
Reasoning Effort |
Avg. Accuracy |
Avg. Cost |
DeepSeek |
R1 |
Default |
51.6% |
$0.029 |
OpenAI |
o1 |
Default |
54.9% |
$0.901 |
OpenAI |
o3-mini |
Medium |
55.9% |
$0.068 |
OpenAI |
o3-mini |
High |
56.7% |
$0.170 |
Anthropic |
Sonnet 3.7 |
None |
51.7% |
$0.017 |
Anthropic |
Sonnet 3.7 |
16k Token Budget |
61.7% |
$0.222 |
Anthropic |
Sonnet 3.7 |
64k Token Budget |
69.5% |
$0.392 |
Alibaba |
Qwen 2.5 14B Instruct |
None |
28.1% → 59.4% |
$0.001 |
Alibaba |
Qwen 2.5 32B Instruct |
None |
37.3% → 67.1% |
$0.002 |
From these benchlabels, we saw that Claude Sonnet 3.7 with a 64k token leanking budget carry outed best on our task, but that all the directing models showed room for increasement. DeepSeek R1, a well-understandn uncover-weight model, carry outed proximately as well as OpenAI’s o1 and o3-mini. However, the untuned Qwen 2.5 Instruct models’ carry outance is unastonishive in comparison. The huge ask is: Can we train these minusculeer, uncover-weight models to frontier-level carry outance? Elementary, our dear reader—we equitable need the right approach.
Training
To train a frontier-level deduction model, we turned to backment lachieveing—an approach that permits agents to lachieve from their own experience inside a handleled environment. Here, the LLMs were our agents, and the baffles were our environment. We directd the LLMs’ lachieveing by having them produce multiple responses for each baffle, exploring the problem landscape. We backd deductions directing to accurate solutions and penalized reasoning that took the models astray.
Among various RL methods, we picked the well-understandn Group Relative Policy Optimization (GRPO) algorithm enlargeed by DeepSeek. GRPO simplifies the training process appraised to more traditional methods enjoy Proximal Policy Optimization (PPO), while still providing strong carry outance. To speed up our experiments, we leave outted the Kullback–Leibler (KL) branch offnce penalty, although our training recipe helps it for interested readers.
At a high level, our training loop adhereed these modest steps:
-
Generate model responses to baffle tasks
-
Grade responses and appraise achieves for each group of chat completions (that’s the “Group Relative” part in GRPO)
-
Fine-tune the model using clipped policy gradients directd by these achieve appraises
-
Repeat these steps with novel baffles and the postponeedst version of the model until we accomplish peak carry outance
For generating responses, we used the well-understandn vLLM inference engine. We tuned our parameter choices to boost thcdimiserablemirefulput and lessen beginup time. Prerepair caching was particularly presentant because we sampled many responses for each task, and caching prompts helps elude redundant computation.
We watchd that overwhelming vLLM with too many asks forces preemption or swapping out of in-better asks. To compriseress this, we restricted asks using a semaphore tuned to conserve high key-appreciate (KV) cache utilization while minimizing swaps. More progressd scheduling mechanisms could produce even higher utilization while still helping pliable generation lengths.
After sampling, we processed completions using the standard HuggingFace Transestablishers AutoTokenizer. Its chat temppostponeed feature, which rfinishers message objects as a prompt string, integrates an helpant mask for determining which tokens the LLM produced. We set up the models deficiencyed the vital % generation %
tags in their default temppostponeeds, so we modified them during the tokenization step. The resulting helpant mask was integrated in the dictionary of tensors used for tuning, determineing which positions needd loss calculations.
After tokenizing responses and achieveing helpant masks, we packed the data for tuning. In compriseition to including multiple prompt/response pairs in each packed sequence, we identified scatterd prompt tokens and alloted each token a Parent ID aprolongedside the standard Group ID. Particularly for tasks enjoy Temporal Clue—averaging over 1,000 tokens per baffle—generating many responses per task and effectively packing tensors meaningfully reduced redundancy. Once packed with all vital alertation, we could envision our training dataset two-uninalertigentensionassociate, each row being a sequence of tokens potentiassociate grasping multiple prompts and completions:
With firmly-packed data in hand, we could progress to tuning. Our models were already pre-trained, teachion-tuned, equitablely keen, and adept at adhereing teachions. However, they could not yet reliably settle Temporal Clue baffles. Still, they occasionassociate thriveed, and that was enough. By increasing the probability of excellent reasoning and decreasing the probability of “not excellent” reasoning, we incremenloftyy steered the models toward Master Detective status. We accomplishd this using standard machine lachieveing techniques, participateing policy gradient methods to compute loss and shift the weights beneficiassociate.
For training, we used the torchtune library supplyd by the PyTorch team. Torchtune features effective decoder-only changeer carry outations for well-understandn models including Llama, Gemma, Phi, and more. Although we primarily used the Qwen models for this project, we also ran experiments with 8B and 70B Llama models. Torchtune also supplys memory-saving and carry outance-enhancing utilities, including:
-
Activation Checkpointing
-
Activation Offloading
-
Quantization
-
Parameter-Efficient Fine-Tuning (PEFT), e.g., Low Rank Adaptation (LoRA)
See the README here for the filled enumerate of helped chooseimizations.
Additionassociate, Torchtune helps multi-device (and now multi-node) training, making it perfect for huger models. It helps both Fully Schallenginged Data Parallel (FSDP) and Tensor Parallel (TP) training, which can be united. They also supply over a dozen recipes, encouraging users to duplicate and customize them for their use cases. We produced a modified version of their filled fine-tune recipes helping:
-
Both multi-device and one-device training
-
Reference model loading and weight swapping for calculating KL branch offnces
-
Advanced causal mask calculations using group and parent IDs
-
GRPO loss integration and component logging
The recipe can be seen here. In the future, we would enjoy to comprise tensor parallelism help and spendigate PEFT and quantization.
The RL training process comprises picking a myriad of hyperparameters. While training our models, we tested various configurations and hugely endd upon the adhereing:
-
Models: Qwen 2.5 Instruct 14B & 32B
-
Tasks per Iteration: 32
-
Samples per Task per Iteration: 50
-
Total Samples per Iteration: 32 * 50 = 1600
-
Lachieveing Rate: 6e-6
-
Micro-Batch Size: 4 sequences for 14B model, 8 for 32B model
-
Batch Size: Variable, depfinishing on the number of sequences
The batch size is variable because response lengths can vary during training, sequence packing efficiency changes each iteration, and responses with zero achieve are disposeed. For one run, we tried vibrantassociate adequitableing lachieveing rates inversely proportional to batch size, but this resulted in excessively high lachieveing rates for minuscule batch sizes, requiring a cap. The capped version didn’t unbenevolentingfilledy contrast from using a constant lachieveing rate, but tuning batch size and lachieveing rate remains an fascinating area for future experimentation.
We also ran inestablish experiments increasing tasks per iteration while reducing samples per task—and vice versa—conserveing total samples per iteration cdimiserablemirefilledy identical. Over a uninalertigentinutive training horizon, these variations showed no unbenevolentingful contrastences, recommending the recipe is strong to contrastent stabilitys between the number of tasks and samples per task.
Results
After training our models for over 100 iterations, we accomplished frontier-level deduction.
Our models rapidly increased before accuracy achieves begined to taper off and eventuassociate degrade, sometimes antagonisticly. At their best, the 14B model approached Claude Sonnet 3.7’s carry outance at 16k tokens and the 32B model proximately suited Sonnet’s results with the huger 64k budget.
While training, carry outance achieves adhereed a power law, establishing a liproximate relationship on a log-log chart (before deteriorating).
We doubt the models may have greetd too punctual on greedy strategies that labored out of the gate, but potentiassociate restricted their prolonged-term prospects. A reasonable next step would spendigate approaches that help diverse responses, or that produce capabilities incremenloftyy (enjoy with curriculum lachieveing), or that allot huger rewards to particularly extraordinary solutions incentivizing thocdimiserablemireful exploration.
Additionassociate, we remarkd fascinating patterns in output length during training. Initiassociate, responses grew prolongeder, then stabilized, before diverging proximate the finish of training, with the 14B model’s responses getting prolongeder and the 32B model’s response lengths collapsing, especiassociate after accomplishing peak carry outance.
To qualitatively appraise increasements in reasonable reasoning, we asked the mightyest frontier model, Claude Sonnet 3.7, to determine and appraise the soundness of deductions made by the Qwen 32B model—before and after training for 100+ iterations—on analogous baffles. Sonnet identified 6 deductions from the base model, with all but one appraised erroneous; conversely, it identified 7 deductions from the trained model, with all but one appraised reasonablely sound.
Finassociate, assuming adequate thcdimiserablemirefulput with on-need deployments, we appraised Qwen model costs from Firelabors AI’s serverless pricing tiers. We plotted accuracy aachievest the organic logarithm of the ordinary inference cost per response, and watchd a evident liproximate Pareto frontier among untuned models. By successfilledy training uncover-weight models to frontier-level accuracy, we emotionalassociate increased the cost–accuracy trade-off.
Here, after sharing satisfied nods for a job well done, we hail a hansom cab and return to Baker Street—the perfect place to contemppostponeed our discoverings.
Conclusion
In our spendigation, we set out to spendigate whether minusculeer, uncover-weight language models could accomplish frontier-level deductive reasoning thcdimiserablemireful backment lachieveing. After training Qwen 14B and 32B models on challenging Temporal Clue baffles—using joinfilledy picked hyperparameters and the GRPO method—we accomplishd astonishive carry outance achieves. These increasements bcdimiserablemirefult uncover-weight models to the cutting edge of reasoning carry outance, at meaningfully reduced costs. Our discoverings highweightless the promising potential for backment lachieveing to effectively train uncover models on intricate deduction tasks.
As alludeed previously, the dataset, experiments, training recipe, and model weights (14B, 32B) are freely useable under the MIT license. We help you to try reproducing and improving on our results.
Additionassociate, we’ve held out one particularly exciting discovering for the finish. We uncovered that unbenevolentingful carry outance increasements, as high as 10–15%, can be accomplishd with as restricted as 16 training examples. This unbenevolents you don’t need a lot of data to get begined; equitable some intuition about the problem you’d enjoy to settle.
Are you interested in using backment lachieveing to train your own models, or would enjoy some help getting begined? Feel free to accomplish out to us at OpenPipe—we’d adore to chat!
Now, dear reader, plrelieve conserve your deerstalker cap and amplifying glass handy; there’s much more to spendigate. The game remains very much afoot.