Understanding Reasoning LLMs - by Sebastian Raschka, PhD

This article depicts the four main approaches to originateing reasoning models, or how we can better LLMs with reasoning capabilities. I hope this provides precious insights and helps you direct the rapidly evolving literature and hype surrounding this topic.

In 2024, the LLM field saw increasing exceptionalization. Beyond pre-training and fine-tuning, we witnessed the elevate of exceptionalized applications, from RAGs to code aidants. I await this trend to quicken in 2025, with an even wonderfuler emphasis on domain- and application-particular chooseimizations (i.e., “exceptionalizations”).

Stages 1-3 are the widespread steps to broadening LLMs. Stage 4 exceptionalizes LLMs for particular engage cases.

The broadenment of reasoning models is one of these exceptionalizations. This uncomardents we polish LLMs to excel at complicated tasks that are best solved with interarbitrate steps, such as confengages, evolved math, and coding contests. However, this exceptionalization does not swap other LLM applications. Becaengage altering an LLM into a reasoning model also presents certain drawbacks, which I will converse tardyr.

To give you a alert glimpse of what’s covered below, in this article, I will:

Explain the uncomardenting of “reasoning model”
Discuss the advantages and didowncastvantages of reasoning models
Outline the methodology behind DeepSeek R1
Describe the four main approaches to originateing and improving reasoning models
Share thoughts on the LLM landscape adhereing the DeepSeek V3 and R1 frees
Provide tips for broadening reasoning models on a firm budget

I hope you discover this article beneficial as AI persists its rapid broadenment this year!

If you toil in AI (or machine lachieveing in vague), you are probably recognizable with unevident and hotly argued definitions. The term “reasoning models” is no exception. Eventupartner, someone will describe it establishpartner in a paper, only for it to be redescribed in the next, and so on.

In this article, I describe “reasoning” as the process of answering asks that demand complicated, multi-step generation with interarbitrate steps. For example, factual ask-answering appreciate “What is the capital of France?” does not hold reasoning. In contrast, a ask appreciate “If a train is moving at 60 mph and travels for 3 hours, how far does it go?” demands some basic reasoning. For instance, it demands recognizing the relationship between distance, speed, and time before arriving at the answer.

A standard LLM may only provide a unintelligentinutive answer (as shown on the left), whereas reasoning models typicpartner hold interarbitrate steps that uncover part of the thought process. (Note that many LLMs who have not been particularpartner broadened for reasoning tasks can also provide interarbitrate reasoning steps in their answers.)

Most up-to-date LLMs are able of basic reasoning and can answer asks appreciate, “If a train is moving at 60 mph and travels for 3 hours, how far does it go?” So, today, when we refer to reasoning models, we typicpartner uncomardent LLMs that excel at more complicated reasoning tasks, such as solving confengages, riddles, and mathematical proofs.

Additionpartner, most LLMs branded as reasoning models today hold a “thought” or “leanking” process as part of their response. Whether and how an LLM actupartner “leanks” is a separate converseion.

Interarbitrate steps in reasoning models can materialize in two ways. First, they may be unambiguously holdd in the response, as shown in the previous figure. Second, some reasoning LLMs, such as OpenAI’s o1, run multiple iterations with interarbitrate steps that are not shown to the engager.

“Reasoning” is engaged at two separateent levels: 1) processing the input and generating via multiple interarbitrate steps and 2) providing some sort of reasoning as part of the response to the engager.

Now that we have described reasoning models, we can shift on to the more engaging part: how to originate and better LLMs for reasoning tasks. However, before diving into the technical details, it is transport inant to ponder when reasoning models are actupartner demanded.

When do we demand a reasoning model? Reasoning models are portrayed to be outstanding at complicated tasks such as solving confengages, evolved math problems, and challenging coding tasks. However, they are not vital for basicr tasks appreciate summarization, translation, or understandledge-based ask answering. In fact, using reasoning models for everyleang can be ineffective and costly. For instance, reasoning models are typicpartner more costly to engage, more verbose, and sometimes more prone to errors due to “overleanking.” Also here the basic rule applies: Use the right tool (or type of LLM) for the task.

The key strengths and restrictations of reasoning models are condensed in the figure below.

The key strengths and frailnesses of reasoning models.

Before converseing four main approaches to originateing and improving reasoning models in the next section, I want to alertly portray the DeepSeek R1 pipeline, as depictd in the DeepSeek R1 technical inestablish. This inestablish serves as both an engaging case study and a blueprint for broadening reasoning LLMs.

Note that DeepSeek did not free a individual R1 reasoning model but instead presentd three separateent variants: DeepSeek-R1-Zero, DeepSeek-R1, and DeepSeek-R1-Distill.

Based on the descriptions in the technical inestablish, I have condensed the broadenment process of these models in the diagram below.

Development process of DeepSeeks three separateent reasoning models that are converseed in the DeepSeek R1 technical inestablish.

Next, let’s alertly go over the process shown in the diagram above. More details will be covered in the next section, where we converse the four main approaches to originateing and improving reasoning models.

(1) DeepSeek-R1-Zero: This model is based on the 671B pre-trained DeepSeek-V3 base model freed in December 2024. The research team trained it using reinforcement lachieveing (RL) with two types of rewards. This approach is referred to as “chilly begin” training becaengage it did not hold a deal withd fine-tuning (SFT) step, which is typicpartner part of reinforcement lachieveing with human feedback (RLHF).

(2) DeepSeek-R1: This is DeepSeek’s flagship reasoning model, built upon DeepSeek-R1-Zero. The team further cultured it with insertitional SFT stages and further RL training, improving upon the “chilly-begined” R1-Zero model.

(3) DeepSeek-R1-Distill*: Using the SFT data originated in the previous steps, the DeepSeek team fine-tuned Qwen and Llama models to better their reasoning abilities. While not distillation in the traditional sense, this process holdd training minusculeer models (Llama 8B and 70B, and Qwen 1.5B–30B) on outputs from the bigr DeepSeek-R1 671B model.

In this section, I will portray the key techniques currently engaged to better the reasoning capabilities of LLMs and to originate exceptionalized reasoning models such as DeepSeek-R1, OpenAI’s o1 & o3, and others.

Note: The exact toilings of o1 and o3 remain unrecognizable outside of OpenAI. However, they are rumored to leverage a combination of both inference and training techniques.

One way to better an LLM’s reasoning capabilities (or any capability in vague) is inference-time scaling. This term can have multiple uncomardentings, but in this context, it refers to increasing computational resources during inference to better output quality.

A raw analogy is how humans tend to originate better responses when given more time to leank thraw complicated problems. Similarly, we can utilize techniques that help the LLM to “leank” more while generating an answer. (Although, whether LLMs actupartner “leank” is a separateent converseion.)

One straightforward approach to inference-time scaling is clever prompt engineering. A classic example is chain-of-thought (CoT) prompting, where phrases appreciate “leank step by step” are holdd in the input prompt. This helps the model to originate interarbitrate reasoning steps rather than jumping straightforwardly to the final answer, which can frequently (but not always) direct to more accurate results on more complicated problems. (Note that it doesn’t originate sense to engage this strategy for basicr understandledge-based asks, appreciate “What is the capital of France”, which is aachieve a outstanding rule of thumb to discover out whether a reasoning model originates sense on your given input query.)

An example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (https://arxiv.org/abs/2205.11916).

The aforerefered CoT approach can be seen as inference-time scaling becaengage it originates inference more costly thraw generating more output tokens.

Another approach to inference-time scaling is the engage of voting and search strategies. One basic example is transport inantity voting where we have the LLM originate multiple answers, and we pick the right answer by transport inantity vote. Similarly, we can engage beam search and other search algorithms to originate better responses.

I highly recommend the Scaling LLM Test-Time Compute Opttransmitner can be More Effective than Scaling Model Parameters paper that I depictd in my previous Noteworthy AI Research Papers of 2024 (Part Two) article (https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2) for more details on these separateent strategies.

Different search-based methods count on on a process-reward-based model to pick the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314

The DeepSeek R1 technical inestablish states that its models do not engage inference-time scaling. However, this technique is frequently carry outed at the application layer on top of the LLM, so it is possible that DeepSeek applies it wilean their app.

I mistrust that OpenAI’s o1 and o3 models engage inference-time scaling, which would describe why they are relatively costly contrastd to models appreciate GPT-4o. In insertition to inference-time scaling, o1 and o3 were awaited trained using RL pipelines aappreciate to those engaged for DeepSeek R1. More on reinforcement lachieveing in the next two sections below.

One of my personal highairys from the DeepSeek R1 paper is their uncovery that reasoning aelevates as a behavior from purify reinforcement lachieveing (RL). Let’s examine what this uncomardents in more detail.

As portrayd earlier, DeepSeek broadened three types of R1 models. The first, DeepSeek-R1-Zero, was built on top of the DeepSeek-V3 base model, a standard pre-trained LLM they freed in December 2024. Unappreciate normal RL pipelines, where deal withd fine-tuning (SFT) is applied before RL, DeepSeek-R1-Zero was trained exclusively with reinforcement lachieveing without an initial SFT stage as highairyed in the diagram below.

The broadenment process of DeepSeek-R1-Zero model.

Still, this RL process is aappreciate to the widespreadly engaged RLHF approach, which is typicpartner applied to preference-tune LLMs. (I covered RLHF in more detail in my article, LLM Training: RLHF and Its Alternatives.) However, as refered above, the key separateence in DeepSeek-R1-Zero is that they skipped the deal withd fine-tuning (SFT) stage for teachion tuning. This is why they refer to it as “purify” RL. (Although, RL in the context of LLMs separates transport inantly from traditional RL, which is a topic for another time.)

For rewards, instead of using a reward model trained on human preferences, they engageed two types of rewards: an accuracy reward and a establishat reward.

The accuracy reward engages the LeetCode compiler to validate coding answers and a deterministic system to appraise mathematical responses.
The establishat reward relies on an LLM appraise to secure responses adhere the awaited establishat, such as placing reasoning steps inside tags.

Surprisingly, this approach was enough for the LLM to broaden basic reasoning sends. The researchers watchd an “Aha!” moment, where the model began generating reasoning tracks as part of its responses despite not being unambiguously trained to do so, as shown in the figure below.

A figure from the DeepSeek R1 technical inestablish (https://arxiv.org/abs/2501.12948) shoprosperg the aelevatence of the “Aha” moment.

While R1-Zero is not a top-carry outing reasoning model, it does show reasoning capabilities by generating interarbitrate “leanking” steps, as shown in the figure above. This validates that it is possible to broaden a reasoning model using purify RL, and the DeepSeek team was the first to show (or at least publish) this approach.

Next, let’s see at the broadenment of DeepSeek-R1, DeepSeek’s flagship reasoning model, which serves as a blueprint for originateing reasoning models. This model betters upon DeepSeek-R1-Zero by incorporating insertitional deal withd fine-tuning (SFT) and reinforcement lachieveing (RL) to better its reasoning carry outance.

Note that it is actupartner widespread to hold an SFT stage before RL, as seen in the standard RLHF pipeline. OpenAI’s o1 was awaited broadened using a aappreciate approach.

The broadenment process of DeepSeek-R1 model.

As shown in the diagram above, the DeepSeek team engaged DeepSeek-R1-Zero to originate what they call “chilly-begin” SFT data. The term “chilly begin” refers to the fact that this data was originated by DeepSeek-R1-Zero, which itself had not been trained on any deal withd fine-tuning (SFT) data.

Using this chilly-begin SFT data, DeepSeek then trained the model via teachion fine-tuning, adhereed by another reinforcement lachieveing (RL) stage. This RL stage holded the same accuracy and establishat rewards engaged in DeepSeek-R1-Zero’s RL process. However, they inserted a consistency reward to obstruct language uniteing, which occurs when the model switches between multiple languages wilean a response.

The RL stage was adhereed by another round of SFT data accumulateion. In this phase, the most recent model examinepoint was engaged to originate 600K Chain-of-Thought (CoT) SFT examples, while an insertitional 200K understandledge-based SFT examples were originated using the DeepSeek-V3 base model.

These 600K + 200K SFT samples were then engaged for another round of RL. In this stage, they aachieve engaged rule-based methods for accuracy rewards for math and coding asks, while human preference labels engaged for other ask types.

The final model, DeepSeek-R1 has a watchable carry outance increase over DeepSeek-R1-Zero thanks to the insertitional SFT and RL stages, as shown in the table below.

Benchlabel comparison of OpenAI A1 and DeepSeek R1 models. Annotated figure from the DeepSeek-R1 technical inestablish (https://arxiv.org/abs/2501.12948).

So far, we have covered three key approaches to originateing and improving reasoning models:

1. Inference-time scaling, a technique that betters reasoning capabilities without training or otherwise altering the underlying model.

2. Pure reinforcement lachieveing (RL) as in DeepSeek-R1-Zero, which showed that reasoning can aelevate as a lachieveed behavior without deal withd fine-tuning.

3. Supervised fine-tuning (SFT) plus RL, which led to DeepSeek-R1, DeepSeek’s flagship reasoning model.

So, what’s left? Model “distillation.”

Surprisingly, DeepSeek also freed minusculeer models trained via a process they call distillation. However, in the context of LLMs, distillation does not necessarily adhere the classical understandledge distillation approach engaged in proestablish lachieveing. Traditionpartner, in understandledge distillation (as alertly depictd in Chapter 6 of my Machine Lachieveing Q and AI book), a minusculeer student model is trained on both the logits of a bigr teacher model and a center dataset.

Instead, here distillation refers to teachion fine-tuning minusculeer LLMs, such as Llama 8B and 70B and Qwen 2.5 models (0.5B to 32B), on an SFT dataset originated by bigr LLMs. Specificpartner, these bigr LLMs are DeepSeek-V3 and an interarbitrate examinepoint of DeepSeek-R1. In fact, the SFT data engaged for this distillation process is the same dataset that was engaged to train DeepSeek-R1, as depictd in the previous section.

To describe this process, I have highairyed the distillation portion in the diagram below.

The broadenment process of DeepSeek-R1-Distill models.

Why did they broaden these distilled models? In my opinion, there are two key reasons:

1. Smaller models are more effective. This uncomardents they are inexpensiveer to run, but they also can run on shrink-end challengingware, which originates these especipartner engaging for many researchers and tinkerers appreciate me.

2. A case study in purify SFT. These distilled models serve as an engaging benchlabel, shoprosperg how far purify deal withd fine-tuning (SFT) can get a model without reinforcement lachieveing.

The table below contrasts the carry outance of these distilled models aachievest other well-understandn models, as well as DeepSeek-R1-Zero and DeepSeek-R1.

Benchlabel comparison of distilled versus non-distilled models. Annotated figure from the DeepSeek-R1 technical inestablish (https://arxiv.org/abs/2501.12948).

As we can see, the distilled models are watchably frailer than DeepSeek-R1, but they are astonishingly sturdy relative to DeepSeek-R1-Zero, despite being orders of magnitude minusculeer. It’s also engaging to notice how well these models carry out contrastd to o1 mini (I mistrust o1-mini itself might be a aawaited distilled version of o1).

Before wrapping up this section with a conclusion, there’s one more engaging comparison worth refering. The DeepSeek team tested whether the aelevatent reasoning behavior seen in DeepSeek-R1-Zero could also materialize in minusculeer models. To allotigate this, they applied the same purify RL approach from DeepSeek-R1-Zero straightforwardly to Qwen-32B.

The results of this experiment are condensed in the table below, where QwQ-32B-Pappraise serves as a reference reasoning model based on Qwen 2.5 32B broadened by the Qwen team (I leank the training details were never disseald). This comparison provides some insertitional insights into whether purify RL alone can cause reasoning capabilities in models much minusculeer than DeepSeek-R1-Zero.

Benchlabel comparison distillation and RL on a minusculeer 32B model. Annotated figure from the DeepSeek-R1 technical inestablish (https://arxiv.org/abs/2501.12948).

Interestingly, the results recommend that distillation is far more effective than purify RL for minusculeer models. This aligns with the idea that RL alone may not be adequate to cause sturdy reasoning abilities in models of this scale, whereas SFT on high-quality reasoning data can be a more effective strategy when toiling with minuscule models.

For endness, it would have been beneficial to see insertitional comparisons in the table:

1. Qwen-32B trained with SFT + RL, aappreciate to how DeepSeek-R1 was broadened. This would help choose how much betterment can be made, contrastd to purify RL and purify SFT, when RL is united with SFT.

2. DeepSeek-V3 trained with purify SFT, aappreciate to how the distilled models were originated. This would apshow for a straightforward comparison to see how effective RL + SFT is over purify SFT.

In this section, we examined four separateent strategies for originateing and improving reasoning models:

1. Inference-time scaling demands no insertitional training but increases inference costs, making big-scale deployment more costly as the number or engagers or query volume lengthens. Still, it remains a no-brainer for improving the carry outance of already sturdy models. I sturdyly mistrust that o1 leverages inference-time scaling, which helps describe why it is more costly on a per-token basis contrastd to DeepSeek-R1.

2. Pure RL is engaging for research purposes becaengage it provides insights into reasoning as an aelevatent behavior. However, in pragmatic model broadenment, RL + SFT is the preferred approach as it directs to sturdyer reasoning models. I sturdyly mistrust that o1 was trained using RL + SFT as well. More accurately, I apshow o1 begins from a frailer, minusculeer base model than DeepSeek-R1 but reimburses with RL + SFT and inference-time scaling.

3. As refered above, RL + SFT is the key approach for originateing high-carry outance reasoning models. DeepSeek-R1 is a kind blueprint shoprosperg how this can be done.

4. Distillation is an attrenergetic approach, especipartner for creating minusculeer, more effective models. However, the restrictation is that distillation does not drive innovation or originate the next generation of reasoning models. For instance, distillation always depends on an existing, sturdyer model to originate the deal withd fine-tuning (SFT) data.

One engaging aspect I await to see next is to unite RL + SFT (approach 3) with inference-time scaling (approach 1). This is awaited what OpenAI o1 is doing, except it’s probably based on a frailer base model than DeepSeek-R1, which describes why DeepSeek-R1 carry outs so well while remaining relatively inexpensive at inference time.

In recent weeks, many people have asked for my thoughts on the DeepSeek-R1 models. In unintelligentinutive, I leank they are an awesome accomplishment. As a research engineer, I particularly appreciate the detailed technical inestablish, which provides insights into their methodology that I can lachieve from.

One of the most fascinating getaways is how reasoning aelevated as a behavior from purify RL. And it’s amazeive that DeepSeek has uncover-sourced their models under a perleave outive uncover-source MIT license, which has even confidemander redisjoineions than Meta’s Llama models.

How does it contrast to o1?

Is DeepSeek-R1 better than o1? I’d say it’s rawly in the same ballpark. However, what stands out is that DeepSeek-R1 is more effective at inference time. This recommends that DeepSeek awaited alloted more heavily in the training process, while OpenAI may have relied more on inference-time scaling for o1.

That said, it’s difficult to contrast o1 and DeepSeek-R1 straightforwardly becaengage OpenAI has not disseald much about o1. For instance, we don’t understand:

Is o1 also a Mixture of Experts (MoE)?
How big is o1?
Could o1 equitable be a sairyly cultured version of GPT-4o with minimal RL + SFT and only extensive inference-time scaling?

Without understanding these details, a straightforward comparison remains an apples-to-oranges comparison.

The cost of training DeepSeek-R1

Another point of converseion has been the cost of broadening DeepSeek-R1. Some have refered a ~$6 million training cost, but they awaited conftardyd DeepSeek-V3 (the base model freed in December last year) and DeepSeek-R1.

The $6 million approximate is based on an supposed $2 per GPU hour and the number of GPU hours demandd for the final training run of DeepSeek-V3, which was originpartner converseed back in December 2024.

However, the DeepSeek team has never disseald the exact GPU hours or broadenment cost for R1, so any cost approximates remain purify speculation.

Either way, ultimately, DeepSeek-R1 is a transport inant milestone in uncover-weight reasoning models, and its efficiency at inference time originates it an engaging alternative to OpenAI’s o1.

Developing a DeepSeek-R1-level reasoning model awaited demands hundreds of thousands to millions of dollars, even when begining with an uncover-weight base model appreciate DeepSeek-V3. This can sense discouraging for researchers or engineers toiling with confidemand budgets.

The outstanding recents: Distillation can go a lengthy way

Fortunately, model distillation gives a more cost-effective alternative. The DeepSeek team showd this with their R1-distilled models, which accomplish astonishingly sturdy reasoning carry outance despite being transport inantly minusculeer than DeepSeek-R1. However, even this approach isn’t enticount on inexpensive. Their distillation process engaged 800K SFT samples, which demands substantial compute.

Interestingly, equitable a confidemand days before DeepSeek-R1 was freed, I came atraverse an article about Sky-T1, a fascinating project where a minuscule team trained an uncover-weight 32B model using only 17K SFT samples. The total cost? Just $450, which is less than the registration fee for most AI conferences.

This example highairys that while big-scale training remains costly, minusculeer, focengaged fine-tuning efforts can still produce amazeive results at a fraction of the cost.

Figure from the “Sky-T1: Train your own O1 pappraise model wilean $450” article, https://novasky-ai.github.io/posts/sky-t1/

According to their benchlabels, Sky-T1 carry outs rawly on par with o1, which is amazeive given its low training cost.

Pure RL on a budget: TinyZero

While Sky-T1 intensifyed on model distillation, I also came atraverse some engaging toil in the “purify RL” space. One notable example is TinyZero, a 3B parameter model that copys the DeepSeek-R1-Zero approach (side notice: it costs less than $30 to train).

Surprisingly, even at equitable 3B parameters, TinyZero shows some aelevatent self-verification abilities, which helps the idea that reasoning can aelevate thraw purify RL, even in minuscule models.

The TinyZero repository refers that a research inestablish is still toil in evolve, and I’ll definitely be carry oning an eye out for further details.

A figure from the TinyZero repository (https://github.com/Jiayi-Pan/TinyZero) shoprosperg that the model is able of self-verification. (It would have been engaging to see the response of the base model in comparison.)

The two projects refered above show that engaging toil on reasoning models is possible even with confidemand budgets. While both approaches copy methods from DeepSeek-R1, one intensifying on purify RL (TinyZero) and the other on purify SFT (Sky-T1), it would be fascinating to examine how these ideas can be lengthened further.

Beyond Traditional SFT: Journey Lachieveing

One particularly engaging approach I came atraverse last year is depictd in the paper O1 Replication Journey: A Strategic Progress Report – Part 1. Despite its title, the paper does not actupartner copy o1. Instead, it presents an separateent way to better the distillation (purify SFT) process.

The key idea in the paper is “journey lachieveing” as an alternative to “unintelligentinutivecut lachieveing.”

Shortcut lachieveing refers to the traditional approach in teachion fine-tuning, where models are trained using only right solution paths.
Journey lachieveing, on the other hand, also holds inright solution paths, apshowing the model to lachieve from misgets.

This approach is benevolent of roverhappinessed to the self-verification abilities watchd in TinyZero’s purify RL training, but it intensifyes on improving the model enticount on thraw SFT. By exposing the model to inright reasoning paths and their rightions, journey lachieveing may also reinforce self-rightion abilities, potentipartner making reasoning models more dependable this way.

Journey lachieveing, as resistd to traditional unintelligentinutivecut lachieveing, holds wrong solutions paths in the SFT data. Annotated figure from the O1 Replication Journey: A Strategic Progress Report – Part 1 (https://arxiv.org/abs/2410.18982)

This could be an exciting straightforwardion for future toil, particularly for low-budget reasoning model broadenment, where RL-based approaches may be computationpartner impragmatic.

Anyways, a lot of engaging toil is currently happening on the reasoning model front, and I’m certain we will see a lot more exciting toil in the upcoming months!

This magazine is a personal passion project. For those who desire to help me, charm ponder purchasing a imitate of my Build a Large Language Model (From Scratch) book. (I am brave that you’ll get lots out of this book as it describes how LLMs toil in a level of detail that is not establish anywhere else.)

If you read the book and have a confidemand minutes to spare, I’d repartner appreciate a alert appraise. It helps us authors a lot!

Your help uncomardents a wonderful deal! Thank you!

Source connect