Chandresh Mallick's picture

Chandresh Mallick

Chandresh7777777

AI & ML interests

Reinforcement Learning, Generative AI, Natural Language Processing, Image Processing, Prompting, Prompt Engineering

Recent Activity

View all activity

Organizations

None yet

Chandresh7777777's activity

replied to mkurman's post 9 days ago
replied to mkurman's post 10 days ago
view reply

Great idea brother! Now, even I want to implement this, however, I am not sure how to calculate the loss for the blurred (or MASKED tokens). Did you use some reward model or KL divergence between the predicted token (for which ground truth has been MASKED) and the neighboring tokens?

reacted to mkurman's post with ❤️ 10 days ago
view post
Post
2021
Blurred-Thoughts Supervised Fine-Tuning (BT-SFT) 🤖

Can we teach a model to think completely on its own without reinforcement learning? Actually, yes.

We can do straightforward supervised fine-tuning using a relatively simple trick: blurring a part of CoT thoughts. But why is this effective?

We observed that various models differ in their thinking processes, and fine-tuning one model on another model’s thoughts (CoT) can sometimes be inefficient—often resulting in the model simply memorizing reasoning rather than learning how to actually think.

I discovered that this process can still be efficient if we clearly indicate when the model should start and stop thinking and uncover only a part of CoT and the expected answer, blurring the other part of CoT. This approach allows the model to learn only a portion of the thought process while still arriving at an expected answer.

To demonstrate this, you can watch my experimental BT-SFT on meditsolutions/Llama-3.2-SUN-2.5B-chat model, which was fine-tuned on 151 million tokens from the Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B dataset.

Enjoy! 🚀

PS. If you were curious enough to read this, leave me a comment. It's always nice to chat with open-minded and intelligent ppl.
  • 3 replies
·