Papers
arxiv:2404.07965

Rho-1: Not All Tokens Are What You Need

Published on Apr 11
ยท Submitted by akhaliq on Apr 12
#1 Paper of the day
Authors:
,
,
,

Abstract

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "Not all tokens in a corpus are equally important for language model training". Our initial analysis delves into token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher excess loss. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.

Community

Wow. Very promising.

Assuming that you discard 29/30 tokens prior to pretraining: how much compute is saved, if any?

Is it possible to create "concentrated" datasets, I.E the 1/30 tokens that were selected, or is the process tightly coupled?

Not everything is new that comes out from the LLM hype ;)

Paper misses to quote e.g. the "Token Dropping BERT" paper (https://huggingface.co/papers/2203.13240) and this revisited paper of token dropping method (https://huggingface.co/papers/2305.15273). Both are btw. ACL papers.

ยท

Thank you for highlighting these papers! We are happy to learn about the previous successful application of ideas similar to selective language modeling in MLM tasks (i.e., BERT). We will review them thoroughly and discussing their relevance in our revised version.

@librarian-bot recommend

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

@zhibin-msft @zubingou Couple of questions:

  1. Is there a similar graph to Figure 9 for the Mistral-7B model?
  2. In Fig 12, we see that some words are broken up. Is there a way to force the SLM to use full words instead of tokens?
    image.png
ยท

Thank you for your questions.

  1. Sure, we plan to include the selected tokens for the 7B model in a revision.
  2. The words in Figure 12 are not "broken up"; they are different tokens. The uncertainty of the first word of each sentence is usually higher, which may explain why it was not selected. This could be due to the token's limited relative learnability at the current stage.

Great work! @zhibin-msft @zubingou I have one question. The method obviously favors tokens with lower loss for the reference model and higher loss for the training model, thus may benefiting from the following aspects:

  1. From the perspective of data distribution, the method favors the samples similar to corpus used to train reference model.
  2. From the perspective of whether the samples are learnable, this method prefer learnable samples (e.g., 1+2=3, the token 3 is more learnable than 2), that is, samples with lower loss for the reference model.
  3. From the perspective of whether the samples have learnable space for the training model, this method favors samples with greater learnable space, that is, samples with higher Loss for the training model.
  4. From the perspective of overfitting, this method disfavors the samples with lower loss for the training model, thus alleviating the overfitting problem.

I wonder which one plays the leading role in your method? The 1st one actually performs data selection (though token-level), which is similar to Importance Resampling [1]. The 2nd one is widely researched and practiced in math/reasoning filed recently. The 3rd one improves training efficiency, which is similar to online data mixing works [2] (dataset-level rather than token-level). The 4th one reminds me of research of deduplication [3] and repeated data [4].

Once again, great work and inspires me a lot!

[1] https://openreview.net/pdf?id=uPSQv0leAu
[2] https://openreview.net/pdf?id=9Tze4oy4lw
[3] https://openreview.net/pdf?id=CG0L2PFrb1
[4] https://arxiv.org/pdf/2205.10487

ยท

Good question!

We believe the primary benefit comes from dynamically selecting tokens that are closer to the reference distribution, which inherently includes choosing tokens that are more learnable relative to the target task.

Directly using the second approach (filtering data based on reference loss) also results in some improvements, but it may lead to a bias of over-selecting already learned low-loss tokens.

We think that the token curriculum in points 3 and 4 may provide some benefits, but selecting data based solely on the training model's own loss might have a lower potential for future improvement.

Overall, our work remains simple and intuitive yet, and we believe there is significant potential for further development in this area!

Very interesting paper! I had similar idea before and it is very nice to read a related comprehensive study.

I am thinking instead of excess loss, whether kl divergence makes better sense here. This is similar to distillation in some sense. Excess loss suggests that after enough training, the loss value should converge to the loss value from the reference model. How about distribution itself? If the distribution always converges to the distribution from reference model, then the kl-divergence might be a better metric?

Would you please care to cite the paper your paper's loss is based on? https://arxiv.org/abs/2206.07137 introduced RhoLoss which is equivalent to your "excess loss"?

You even call your model Rho-1, whereas the loss we introduced is called RhoLoss... which is simply too improbably a coincidence.

So please explain this and correctly attribute your methods.

Thanks!

ยท

Thank you very much for bringing this paper to our attention. We were amused and quite surprised by the coincidence of the name, given the very related research motivation. And we had accually select the name "rho" from the Greek alphabet. Someday we might run out of names to use!

Although there are many differences in the specific methods and experiments, we are pleased to see that the idea of reference loss in SLM has been effectively applied at sample-level in this paper. This may also inspire us to continue our research at the token level in more applications, such as images, videos, etc.

We will thoroughly review this paper and discuss it in the related work section of our future revisions.

Rho-1: Transforming Language Models with Selective Token Training

Links ๐Ÿ”—:

๐Ÿ‘‰ Subscribe: https://www.youtube.com/@Arxflix
๐Ÿ‘‰ Twitter: https://x.com/arxflix
๐Ÿ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 17

Browse 17 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.07965 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 41