Open-R1: Update #1

Community Article Published February 2, 2025

image/png

It’s been two weeks since the release of DeepSeek R1 and just a week since we started the open-r1 project to replicate the missing pieces, namely the training pipeline and the synthetic data. This post summarizes:

  • the progress of Open-R1 to replicate the DeepSeek-R1 pipeline and dataset
  • what we learned about DeepSeek-R1 and discussions around it
  • cool projects the community has built since the release of DeepSeek-R1

It should serve both as an update on the project and as a collection of interesting resources around DeepSeek-R1.

Progress after 1 Week

Let’s start by looking at the progress we made on Open-R1. We started Open-R1 just one week ago and people across the teams as well as the community came together to work on it and we have some progress to report.

Evaluation

The first step in reproduction is to verify that we can match the evaluation scores. We are able to reproduce Deepseek's reported results on the MATH-500 Benchmark:

Model MATH-500 (HF lighteval) MATH-500 (DeepSeek Reported)
DeepSeek-R1-Distill-Qwen-1.5B 81.6 83.9
DeepSeek-R1-Distill-Qwen-7B 91.8 92.8
DeepSeek-R1-Distill-Qwen-14B 94.2 93.9
DeepSeek-R1-Distill-Qwen-32B 95.0 94.3
DeepSeek-R1-Distill-Llama-8B 85.8 89.1
DeepSeek-R1-Distill-Llama-70B 93.4 94.5

You can find the instructions to run these evaluations in the open-r1 repository.

One observation we have made is the enormous size of the generations from the DeepSeek models, which makes even evaluating the model challenging. Here we show DeepSeek-R1 response lengths in the OpenThoughts dataset:

Distribution of R1’s responses shows that they are on average very long with the average response being 6,000 tokens long and some responses containing more than 20,000 tokens. Worth noting that the average page contains ~500 words and one token is on average slightly less than a word, which means the many reponses are over 10 pages long. (src: [https://x.com/gui_penedo/status/1884953463051649052](https://x.com/gui_penedo/status/1884953463051649052))

Distribution of R1’s responses shows that they are on average very long with the average response being 6,000 tokens long and some responses containing more than 20,000 tokens. Worth noting that the average page contains ~500 words and one token is on average slightly less than a word, which means the many reponses are over 10 pages long. (src: https://x.com/gui_penedo/status/1884953463051649052)

The length of responses will make GPRO training challenging, as we will have to generate long completions which will require a significant proportion of GPU memory to store the activations / gradients for the optimization step.

In order to share our progress publicly, we have created an open-r1 evaluation leaderboard, so the community can follow our reproduction efforts (space is here):

Training Pipeline

Following the release of Open R1, GRPO (Grouped Relative Policy Optimization) was integrated into the latest TRL release (version 0.14). This integration enables training any model with one or multiple reward functions or models. The GRPO implementation integrates with DeepSpeed ZeRO 1/2/3 for parallelized training that scales to many GPUs, and uses vLLM for fast generation, which is the primary bottleneck in online training methods.

from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer

dataset = load_dataset("trl-lib/tldr", split="train")

# Dummy reward: rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]

training_args = GRPOConfig(output_dir="Qwen2-0.5B-GRPO", logging_steps=10)
trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_len,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

There are still some limitations relating to the use of high memory usage, and progress is being made to profile and reduce them.

Synthetic Data Generation

One of the most exciting findings of the R1 report was that the main model can be used to generate synthetic reasoning traces and smaller models fine-tuned on this dataset see similar performance gains as the main model. So naturally we want to re-create the synthetic reasoning dataset as well such that the community can fine-tune other models on it.

With a model as big as R1 the main challenge is scaling up generation efficiently and fast. We spent a week tinkering with various setups and configurations.

The model fits on two 8xH100 nodes so naturally we started experimenting with that setup and used vLLM as inference server. However, we quickly noticed that this configuration is not ideal: the throughput is sub-optimal and only allows for 8 parallel request, because the GPU KV cache fills up too quickly. What happens when the cache fills up ist that the requests which use a lot of cache are preempted and if the config uses PreemptionMode.RECOMPUTE the requests are scheduled again later when more VRAM is available.

We then switched to a setup with 4x 8xH100 nodes, so 32 GPUs in total. This leaves enough spare VRAM for 32 requests running in parallel with barely any of them getting rescheduled due to 100% cache utilization.

Originally we started querying the vLLM servers with batches of requests but noticed quickly, that straggles in the batches would cause the GPU utiliziation to vary since a new batch would only start processing once the last sample of the previous batch is done. Switching the batched inference to streaming helped stabilize the GPU utilization significantly:

image/png

It only required changing the code sending requests to the vLLM servers. The code for batched inference:

# send requests in batches of 500
for batch in batch_generator(dataset, bs=500):
    active_tasks = []
    for row in batch:
        task = asyncio.create_task(send_requests(row))
        active_tasks.add(task)
    if active_tasks:
        await asyncio.gather(*active_tasks)

The new code for streaming requests:

active_tasks = []
for row in dataset:
    # keep the total active requests under 500
    while len(active_tasks) >= 500:
        done, active_tasks = await asyncio.wait(
            active_tasks,
            return_when=asyncio.FIRST_COMPLETED
        )

    task = asyncio.create_task(send_requests(row))
    active_tasks.add(task)

# wait for all remaining tasks to complete
if active_tasks:
    await asyncio.gather(*active_tasks)

We are generating at a fairly constant rate but might still explore a bit further if for example switching to the CPU cache is a better strategy when long queries get preempted.

The current inference code can be found here.

Outreach

There has been wide interest in open-r1, including from the media so various team members have been in the news in the past week:

Other mentions: Washington Post, Financial Times, Financial Times, Fortune, Fortune, The Verge, Financial Review, Tech Crunch, Die Zeit, Financial Times, New York Times, The Wall Street Journal, EuroNews, Barrons, New York Times, Vox, Nature, SwissInfo, Handelsblatt, Business Insider, IEEE Spectrum, MIT Tech Review, LeMonde.

What have we learned about DeepSeek-R1?

While the community is still digesting DeepSeek-R1’s results and report, DeepSeek has captured broader public attention just two weeks after its release.

Responses to R1

After a relatively calm first week post-release, the second week saw significant market reactions, prompting responses from multiple AI research labs:

In parallel, several companies worked on providing the DeepSeek models through various platforms (non-exhaustive list):

DeepSeek V3 Training Compute

There has been a lot of interest around the proclaimed cost of training V3/R1. While the exact number probably doesn’t matter so much, people worked on some back-of-the-envelope calculations to verify the order of magnitude here. TL;DR the numbers seem generally in the right order of magnitude, as seen in these discussions:

As many groups are working on reproducing the training pipeline, we’ll get more evidence on the possible training efficiency for the model.

Training Dataset

Last week some speculations surfaced that DeepSeek might have been using OpenAI outputs to train its models. See for example the Financial Times. However, it is unclear at this point what the consequences of these allegations will be.

Community

The open source community has been extremely active around DeepSeek-R1 and many people started building interesting projects around the model.

Projects

There have been a number of projects that try to reproduce the basic learning mechanics at smaller scale, so you can test the basic learning principles at home.

Results of TinyZero showing the model expanding its reasoning.
Similarly, this plot from researchers at HKUST shows how the model generates longer and longer reasoning traces as training continues.

Datasets

The community has been busy on a number of dataset efforts related to R1, some highlights include:

  • bespokelabs/Bespoke-Stratos-17k: is a replication of the Berkeley Sky-T1 data pipeline which uses DeepSeek-R1 to create a dataset of questions, reasoning traces and answers. This data was subsequently used to fine-tune 7B and 32B Qwen models using a distillation approach similar to the R1 paper.
  • open-thoughts/OpenThoughts-114k: an "Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles". Part of the Open Thoughts effort.
  • cognitivecomputations/dolphin-r1: 800k sample dataset with completions from DeepSeek-R1, Gemini flash and 200k samples from Dolphin chat with the goal of helping train R1 style models.
  • ServiceNow-AI/R1-Distill-SFT: Currently at 17,000 samples, an effort by the ServiceNow Language Models lab to create data to support Open-R1 efforts.
  • NovaSky-AI/Sky-T1_data_17k: A dataset used to train Sky-T1-32B-Preview. This dataset was part of a fairly early effort to replicate o1 style reasoning. The model trained on this dataset was trained for less than $450. This blog post goes into more detail.
  • Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B: This dataset extends Magpie and approach to generating instruction data without starting prompts to include reasoning in the responses. The instructions are generated by Llama 3.1 70B Instruct and Llama 3.3 70B Instruct, and the responses are generated by DeepSeek-R1-Distill-Llama-70B

This list only covers a small number of reasoning and problem solving related datasets on the Hub. We’re excited to see what other datasets the community build in the coming weeks.

What's next?

We are just getting started and want to finish the training pipeline and try it on smaller models and use the scaled up inference pipeline to generate high quality datasets. If you want to contribute check out the open-r1 repository on GitHub or follow the Hugging Face open-r1 org.

Community

thanks for the update. FYI many of the Datasets links look broken.

·
Article author

Fixed, thanks!

Could you add the following models to your R1 distilled leaderboard please? :D Would be great to compare all of the distilled models in one place!
https://huggingface.co/open-thoughts/OpenThinker-7B
https://huggingface.co/bespokelabs/Bespoke-Stratos-7B
https://huggingface.co/bespokelabs/Bespoke-Stratos-32B

can we also work together in regards to human alignment?
https://huggingface.co/blog/etemiz/aha-indicator

This is an interesting post from HF - a blend of straight training goodness, some speculation, and some opinion.

I don't know whether I love this, or hate it - good work, team!

Sign up or log in to comment