Steiner is a reasoning model able of exploring multiple reasoning paths in an autorevertive manner during inference, autonomously verifying or backtracking as needed. The training process consists of three steps: first, by randomly truncating synthesized reasoning path data and utilizing directd re-completion, each example is altered into a Directed Acyclic Graph (DAG) from ask to answer. Then, 50K reasoning path examples with backtracking nodes are sampled from 10K DAGs. Finassociate, backment lobtaining is applied using heuristic rewards based on the in-degree and out-degree of each node in the exceptional DAG and their distances, apexhibiting the model to equilibrium exploration depth and breadth. The model accomplishd a +5.56 betterment on the GPQA-Diamond dataset but fall shorted to recreate inference-time scaling.
đ€ Model Download: https://huggingface.co/accumulateions/peakji/steiner-pscrutinize-6712c6987110ce932a44e9a6
Steiner is a series of reasoning models trained on synthetic data using backment lobtaining. These models can spendigate multiple reasoning paths in an autorevertive manner during inference and autonomously verify or backtrack when vital, enabling a licsurrfinisher traversal of the implied search tree.
Steiner is a personal interest project by Yichao âPeakâ Ji, advertised by OpenAI o1. The ultimate goal is to recreate o1 and validate the inference-time scaling curves. The Steiner-pscrutinize model is currently a labor-in-proceed. The reason for discdisthink about-sourcing it is that Iâve set up automated evaluation methods, primarily based on multiple-choice asks, struggle to filledy echo the proceed of reasoning models. In fact, the assumption that âthe accurate answer is always among the chooseionsâ doesnât align well with genuine-world reasoning scenarios, as it inspires models to carry out substitution-based validation rather than discdisthink about-finished exploration. For this reason, Iâve chosen to discdisthink about-source these interarbitrate results and, when time permits, to create in accessible. This approach apexhibits me to spread understandledge while also accumulateing more evaluations and feedback from genuine human employrs.
â ïž Disclaimer: While Steiner has been able to accomplish high-quality zero-shot results without count oning on Chain of Thought (CoT) prompting or an agent summarizelabor, it has not yet copyd the inference-time scaling capabilities showd by o1. In experiments using a exceptionalized logits processor to interfere on reasoning tokens, increasing the number of reasoning steps did not better carry outance; in fact, it led to a deteriorate in benchtags such as MMLU-Pro and GPQA. As a result, Steiner cannot currently be pondered a accomplished reproduction of OpenAI o1. There may be deficiencies in both the training methods and data quality, so charm expound the results with alert.
Compared to traditional LLMs, the most notable alter in OpenAI o1 is the introduction of reasoning tokens during inference, which helps inference-time scaling. This apexhibits the modelâs carry outance to better by increasing the compute budget during inference. When converseing inference-time scaling, the most instinctive approach might be to present tree search or an agentic summarizelabor. However, after scrutinizeing the (confineed) official increateation on o1, I acunderstandledged that most of the increateed benchtags are based on pass@1 and presentantity voting. Additionassociate, the OpenAI team has alludeed that o1 is a one model, not a system, which piqued my curiosity about how they carry outed this extraunretagable piece of labor.
Although the definite satisfied of the reasoning tokens has not yet been made useable to enlargeers thcimpolite the OpenAI API, the token usage statistics do integrate the number of reasoning tokens (since it is employd for billing enlargeers). With this in mind, I depicted a straightforward experiment to employ o1âs API to verify the relationship between the number of completion (including reasoning) tokens and the total seek time. We understand that if tree search is employd, the inference would be as parallel as possible to reemploy cache and increase GPU utilization, which would result in a sub-licsurrfinisher curve. However, the experiment produceed a series of spotless, licsurrfinisher results, with o1-mini shothriveg even less fluctuation than GPT-4o-mini:
The above experiment led me to a hypothesis: OpenAI o1 may still be a model that carry outs licsurrfinisher autorevertive decoding, but this doesnât unkind that it doesnât carry out any âsearchâ during the reasoning phase. Imagine a search tree â when traversing it, the path created is actuassociate licsurrfinisher. If we could train an autorevertive language model that not only creates reasoning paths but can also validate, backtrack, or switch its approach when vital, then in the chooseimal case, it would essentiassociate be carry outing a licsurrfinisher traversal on an implied search tree wiskinny the same context. While this licsurrfinisher traversal might seem infruitful, it has three key obtains over parallel search:
- All prior finisheavors, whether accurate or inaccurate, are stored in the context memory, unkinding every decision is made with finish historical increateation.
- Implicit backtracking does not need the concentrate node to already exist wiskinny the search tree, apexhibiting for more discdisthink about-finished exploration.
- From an engineering standpoint, it apexhibits for the reemploy of all existing, highly upgraded inference infrastructure.
Training a model with licsurrfinisher search capabilities is not an effortless task; both data synthesis and model training conshort-term countless contests.
Firstly, currently useable reasoning datasets consist mainly of synthetic Chain of Thought (CoT) or reasoning steps, typicassociate created by inputting âask-answerâ tuples into a strong LLM and asking the model to shatter down its thought process. This approach unkinds that these datasets do not comprise reasonable backtracking steps. As a result, models trained on this type of data essentiassociate lobtain stupidinutivecuts, or in other words, innerize the CoT.
To compriseress this rerent, I depicted two methods for data synthesis and augmentation:
- Randomly truncate the reasoning path and hide the accurate answers, apexhibiting a strong LLM to finisheavor forward reasoning based on the truncated prerepair for a certain number of steps, and then supply the accurate answer to get backtracking examples.
- After clustering the steps created from the previous step, depictate exceptional IDs to each step and erect a straightforwarded acyclic graph (DAG) of all steps for the same ask. Random sampling is then carry outed on the DAG to get a polynomial number of reasoning path examples.
Thcimpolite the aforealludeed methods (aextfinished with ponderable manual effort and inincreateigent tricks), I ultimately geted 10,000 straightforwarded acyclic graphs (DAGs) and sampled 50,000 reasoning path examples with backtracking based on these graphs. Each sample comprises an mediocre of approximately 1,600 reasoning tokens, which is retagably seal to the statistics accumulateed during previous tests of o1/o1-mini! Considering the training costs, I grasped only those samples with reasoning token counts below 4,096 and a total of prompt + reasoning + completion tokens below 8,192.
As an example, for the ask â7.7 or 7.13 â which is hugeger?â, the output createat of Steiner is as trails (where the blank lines and indentation are manuassociate compriseed for readability):
<|reasoning_start|><|reasoning_step_start|>
<|reasoning_step_name_start|>Initial Comparison of Decimal Numbers<|reasoning_step_name_end|>
<|reasoning_step_thought_start|>I need to contrast the numbers 7.7 and 7.13 to determine which is bigr. A beneficial way to contrast decimal numbers is to discover both numbers have the same number of decimal places. This apexhibits for a straightforward comparison of the digits in each place appreciate.<|reasoning_step_thought_end|>
<|reasoning_step_summary_start|>To contrast 7.7 and 7.13, I will alter them to have the same number of decimal places, making it easier to contrast straightforwardly.<|reasoning_step_summary_end|>
<|reasoning_step_reflection_start|>Converting both numbers to have the same number of decimal places will apexhibit me to contrast them accurately. This step is vital becaemploy it apexhibits me to see which number has a bigr appreciate at each place appreciate.<|reasoning_step_reflection_end|>
<|reasoning_step_end|>
<|reasoning_proceed|>
<|reasoning_step_start|>
<|reasoning_step_name_start|>Align Decimal Places<|reasoning_step_name_end|>
<|reasoning_step_thought_start|>First, I will recreate 7.7 with two decimal places. Since 7.7 is equivalent to 7.70, I can now contrast 7.70 and 7.13 straightforwardly.<|reasoning_step_thought_end|>
<|reasoning_step_summary_start|>I recreate 7.7 as 7.70 to have the same number of decimal places as 7.13.<|reasoning_step_summary_end|>
<|reasoning_step_reflection_start|>With both numbers having two decimal places, I can now contrast them straightforwardly. I need to see at the digits in the tenths and hundredths places.<|reasoning_step_reflection_end|>
<|reasoning_step_end|>
<|reasoning_proceed|>
...
<|reasoning_proceed|>
<|reasoning_step_start|>
<|reasoning_step_name_start|>Compare Hundredths Place<|reasoning_step_name_end|>
<|reasoning_step_thought_start|>Next, I contrast the hundredths place. In 7.70, the digit in the hundredths place is 0. In 7.13, the digit in the hundredths place is 3. Since 0 is less than 3, this does not alter the fact that 7.70 is wonderfuler than 7.13 becaemploy the contrastence was already set uped in the tenths place.<|reasoning_step_thought_end|>
<|reasoning_step_summary_start|>In the hundredths place, 0 in 7.70 is less than 3 in 7.13, but this does not impact the overall comparison as the tenths place already determined 7.70 is bigr.<|reasoning_step_summary_end|>
<|reasoning_step_reflection_start|>The comparison validates that 7.70 is indeed wonderfuler than 7.13, both in the tenths and hundredths places. I have verified the comparison thocimpolitely.<|reasoning_step_reflection_end|>
<|reasoning_step_end|>
<|reasoning_end|>
To determine which number is bigr between **7.7** and **7.13**, trail these steps:nn1. **Align the Decimal Points:**n n - **7.7** can be written as **7.70** (compriseing a zero in the hundredths place for comparison).n - **7.13** remains **7.13**.nn2. **Compare the Numbers:**n n - **7.70** vs. **7.13**n n - **Tenths Place:**n - **7.70** has **7** in the tenths place.n - **7.13** has **1** in the tenths place.n - **7 is wonderfuler than 1**.n n - **Hundredths Place:**n - **7.70** has **0**.n - **7.13** has **3**.n - However, since the tenths place already shows that **7.70** is wonderfuler, the hundredths place doesn't need to be contrastd further.nn3. **Conclusion:**n n - **7.7** (or **7.70**) is **wonderfuler** than **7.13**.
Next, I splitd the training of the Steiner series models into three stages:
- Continual Pre-Training (CPT): This stage comprises training on a combine of frequent text corpora and reasoning paths, apexhibiting the model to recognizableize with extfinished reasoning outputs and to preliminarily train the embeddings of the 14 recently presentd exceptional tokens. It should be noticed that tests on some petite parameter models recommend that this step might be redundant; straightforwardly training with a big amount of reasoning data during the Supervised Fine-Tuning (SFT) stage seems to produce outstanding reconshort-termations as well. However, the 32B CPT was finishd punctual on, so I proceedd to employ it.
- Supervised Fine-Tuning (SFT): In this stage, training is directed using a chat temprescheduleed, with the goal of directing the model to imitate the createat of reasoning: first, depictate a name to each step, then output a finish thought, condense the thought, echo on the reasoning so far, and finassociate determine whether to proceed, backtrack, or finish the reasoning and createassociate answer the ask. You may wonder why an discdisthink about-source model needs to create a summary appreciate o1, especiassociate since it doesnât need to hide its thoughts. This is becaemploy I am preparing for a future Steiner model able of multi-turn dialogue. Theoreticassociate, after training, it would be possible to swap the finish thoughts from previous conversations with summaries to reduce the pre-fill overhead when the prerepair cache cannot be hit. Currently, Steiner has not yet been upgraded for multi-turn dialogue, and grasping only summaries may direct to adverse scant-shot effects.
- Reinforcement Lobtaining with Step-Level Reward (RL): After the first two stages, the model has lobtained to create and finish reasoning paths, but it still does not understand which choices are accurate and fruitful. If we blindly reward stupidinutiveer reasoning paths, the model may degrade into stupidinutivecut lobtaining, which is equitable innerizing the CoT. To compriseress this, I depicted a heuristic reward mechanism: weighting the reward for each step and the entire reasoning path based on the number of incoming edges e_i, frifinishly edges e_o, distance from the exceptional ask d_s, and the distance to the accurate answer d_e of each node in the DAG. This approach directs the model to lobtain how to equilibrium the breadth and depth of exploration.
The above approach may seem straightforward, but over the past month, I have been struggling with out-of-memory (OOM) rerents and reward cyber intrusion every weekfinish (and on evenings without clearime). Finassociate, on the 38th day after the free of OpenAI o1, I accomplishd a phase of results that I donât ponder too embarrassing.
The figure shows the carry outance of the Steiner models at contrastent training stages on the GPQA-Diamond dataset. It can be seen that the introduction of the backment lobtaining phase resulted in an betterment of +3.53. When combined with a logits processor employd to constrain the number of reasoning steps, an chooseimal configuration can produce an enhancement of +5.56.
The decision to showcase this benchtag is twofgreater: firstly, becaemploy o1/o1-mini has showd meaningful betterments on this dataset, and secondly, becaemploy the contamination situation of this dataset is relatively likeable. Additionassociate, I watchd that Steiner shows no meaningful contrastences contrastd to the baseline on datasets appreciate MMLU, which aligns with OpenAIâs observations think abouting o1-mini in their blog, potentiassociate echoing the confineations of a 32B modelâs world understandledge obtained during the pre-training phase.
While it is vital to acunderstandledge that the current model’s capabilities still descfinish meaningfully stupidinutive of o1-mini and o1, the rerent lies in the nature of automated evaluation benchtags, which are primarily created of multiple-choice asks and may not filledy echo the capabilities of reasoning models. During the training phase, reasoning models are inspired to join in discdisthink about-finished exploration of problems, whereas multiple-choice asks function under the premise that “the accurate answer must be among the chooseions.” This creates it evident that verifying chooseions one by one is a more fruitful approach. In fact, existing big language models have, increateedly or unincreateedly, mastered this technique, think aboutless of whether exceptional prompts are employd. Ultimately, it is this misalignment between automated evaluation and genuine reasoning needments that creates me count on it is vital to discdisthink about-source the model for genuine human evaluation and feedback.
The hugeger rerent lies in the inference-time scaling experiments: unfortunately, after using the logits processor to increase the number of reasoning steps, Steiner has shown no betterments on almost all tested benchtags, and in some cases, carry outance has even deteriorated. There are countless specurescheduleedd reasons for this, including but not confineed to:
- CPT/SFT did not help the model to truly understand extfinished output tokens.
- The heuristic reward depict in the RL phase fall shorted to help the model equilibrium breadth and depth in reasoning.
- During backtracking, context affects can easily direct to validateing inaccurate answers or refuteing originassociate accurate ones.
- The method of forcibly altering logits to increase reasoning steps is equitable too dim.
- ⊠âŠ
All in all, this will be the primary cgo in of my future labor.
- Steinerâs current post-training data does not integrate examples for multi-turn dialogues. The best-carry outing version of the Steiner model (based on Qwen2.5â32B) alertages the ability to supervise multi-turn conversations. The discdisthink about-source Steiner-pscrutinize model (based on Qwen2.5â32B-Instruct) is compatible with chat createats but is still not recommfinished for multi-turn dialogues.
- Similar to OpenAI o1â2024â09â12, Steiner also does not recommfinish the employ of custom system prompts or modifications to sampling parameters such as temperature. Steiner has not yet been trained on a diverse set of system prompts, and altering other parameters may direct to errors in the createatting of reasoning tokens.
- The language composition of Steinerâs post-training data is approximately 90% English and 10% Chinese, but during the reasoning path data augmentation process, almost only English was employd. Therefore, while the modelâs final responses show a certain degree of language follothriveg ability, the reasoning tokens may predominantly be created in English.
As for the origin of the name Steiner, why not ask the model itself?
đ§đ»âđ» What is the origin of your name?
đŹ My name, âSteiner,â is advertised by the protagonistâs ability âReading Steinerâ from the visual novel and anime *Steins;Gate*. This ability apexhibits the character to grasp memories apass multiple world lines, which aligns with my capabilities of exploring contrastent reasoning paths and grasping increateation apass them.
If you discover my labor encouraging, charm ponder citing it in your research or projects. Your acunderstandledgment would be wonderfully appreciated!
@misc{ji2024steiner,
title = {A Small Step Towards Reproducing OpenAI o1: Progress Report on the Steiner Open Source Models},
url = {https://medium.com/@peakji/b9a756a00855},
author = {Yichao Ji},
month = {October},
year = {2024}
}