dreambooth-hackathon (DreamBooth Hackathon)

lewtun

posted an update 6 days ago

Post

6338

We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute 🔥

How? By combining step-wise reward models with tree search algorithms :)

We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"

We're open sourcing the full recipe and sharing a detailed blog post.

In our blog post we cover:

📈 Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.

🎄 Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.

🧭 Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM

Here's the links:

- Blog post: HuggingFaceH4/blogpost-scaling-test-time-compute

- Code: https://github.com/huggingface/search-and-learn

Enjoy!

2 replies

·

xianbao

posted an update 4 months ago

Post

1697

With the open-weight release of CogVideoX-5B from THUDM, i.e. GLM team, the Video Generation Model (how about calling it VGM) field has officially became the next booming "LLM"

What does the landscape look like? What are other video generation models? This collection below is all your need.

xianbao/video-generation-models-66c350163c74f60f5c412af6

The above video is generated by @a-r-r-o-w with CogVideoX-5B, taken from a nice lookout for the field!

xianbao

authored a paper 6 months ago

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Paper • 2406.13923 • Published Jun 20 • 21

xianbao

posted an update 7 months ago

Post

1785

Why Apache 2.0 Matters for LLMs 🤔

@01AI_Yi recently switched from a permissive & commercially friendly license, to Apache 2.0. And the community loved it! 🚀

@JustinLin610 also had a poll on model license and the majority votes for Apache 2.0.

Why it is a Big Deal? ⬇️

📚 Legal Simplicity: Custom licenses need costly & time-consuming legal review. Apache 2.0 is well-known & easier for legal teams to handle.

👩‍💻 Developer-Friendly: Legal docs are a pain for devs! Apache 2.0 is well-known and tech-friendly, making it easier for non-native developers to understand the implications too.

🔗 Easier Integration: Apache 2.0 is compatible with many other licenses, simplifying tasks like model merging with models of different licensing requirements.

🚫 No Permission Needed: Custom licenses often require explicit permission and additional documentation work of filling forms, creating barriers. Apache 2.0 removes this hurdle, letting devs focus on innovation.

There are a lot interesting discussions from
@JustinLin610 's poll: https://x.com/JustinLin610/status/1793559737482764375 which inspired this thread.

Any other thoughts? Let me know ^^

1 reply

·

xianbao

posted an update 7 months ago

Post

1213

DeepSeekV2 is a big deal. Not only because its significant improvements to both key components of Transformer: the Attention layer and FFN layer.

It has also completed disrupted the Chines LLM market and forcing the competitors to drop the price to 1% of the original price.

---

There are two key components in Transformer architecture: the self-attention layer, which captures relationships between tokens in context, and the Feed-Forward Network (FFN) layer, which stores knowledge.

DeepSeek V2 introduces optimizations to both:

Attention layer normally uses KV Cache to reduce repetitive compute, but it consumes significant GPU RAM, limiting concurrent requests. DeepSeek V2 introduces Multi-head Latent Attention (MLA), which stores only a small latent representation, resulting in substantial RAM savings.

DeepSeek V2 utilizes 162 experts instead of the usual 8 as in Mixtral. This approach segments experts into finer granularity for higher specialization and more accurate knowledge acquisition. Activating only a small subset of experts for each token, leads to efficient processing.

It disrupted the market by dropping API prices to $0.14 per 1M tokens. This dramatic reduction forced competitors like GLM, Ernie, and QWen to follow suit, lowering their prices to 1% of their original offerings. Now, users can access these APIs at 1/35th the cost of ChatGPT-4o.

xianbao

posted an update 8 months ago

Post

1859

So hard to keep up with pace!!! Lots of new Chinese fine-tunes are being released on HF

So I asked my agent to create a collection
xianbao/llama3-zh-662ba8503bdfe51948a28403

code: https://colab.research.google.com/drive/1ap6fP-VytZE367Nqk26DeQqgQkYaf-cD#scrollTo=eljRbYb4c92M

Would be nice to run then regularly. Any thoughts / suggestions on where to host this cron job?

1 reply

·

lewtun

posted an update 9 months ago

Post

5010

Introducing Zephyr 141B-A35B 🪁:

HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1

Yesterday, Mistral released their latest base model (via magnet link of course 😅) and the community quickly converted it to transformers format and pushed it to the Hub: mistral-community/Mixtral-8x22B-v0.1

Early evals of this model looked extremely strong, so we teamed up with Argilla and KAIST AI to cook up a Zephyr recipe with a few new alignment techniques that came out recently:

🧑‍🍳 Align the base model with Odds Ratio Preference Optimisation (ORPO). This novel algorithm developed by @JW17 and @nlee-208 and @j6mes and does not require an SFT step to achieve high performance and is thus much more computationally efficient than methods like DPO and PPO.

🦫 Use a brand new dataset of 7k high-quality, multi-turn preferences that has been developed by our friends at Argilla. To create this dataset, they took the excellent Capybara SFT dataset from @LDJnr LDJnr/Capybara and converted it into a preference dataset by augmenting the final turn with responses from new LLMs that were then ranked by GPT-4.

What we find especially neat about this approach is that training on 7k samples only takes ~1.3h on 4 H100 nodes, yet produces a model that is very strong on chat benchmarks like IFEval and BBH.

Kudos to @alvarobartt @JW17 and @nlee-208 for this very nice and fast-paced collab!

For more details on the paper and dataset, checkout our collection: HuggingFaceH4/zephyr-orpo-6617eba2c5c0e2cc3c151524

lewtun

posted an update 10 months ago

Post

Can we align code generation models to be good at chat without compromising their base capabilities 🤔?

This was the question the H4 team asked itself when BigCode released StarCoder2 a bit over a week ago. We knew that code models like deepseek-ai/deepseek-coder-6.7b-instruct and m-a-p/OpenCodeInterpreter-DS-33B get impressive scores on code benchmarks like HumanEval, but they tend to score poorly on chat benchmarks like MT Bench and IFEval. We also knew that the Zephyr recipe we applied to Mistral 7B produced a strong chat model, so we wondered -- could be tweaked to produce a strong coding assistant?

It turns out the answer is yes and I'm happy to share StarChat2, a DPO fine-tune of StarCoder2 15B that scores highly on both HumanEval and MT Bench / IFEval 🌟!

The most interesting lesson for me was that you get better models by blending in more code/math data than chat during the SFT step - in terms of tokens, we found a ratio of 3:1 worked best.

Anyway, here's a demo of the model, along with all the code and datasets we used to train it:

* Demo: HuggingFaceH4/starchat2-playground
* Collection: HuggingFaceH4/starchat2-15b-65f068417b330fafad751fce
* Recipe: https://github.com/huggingface/alignment-handbook

Hope it's useful to others!

3 replies

·

xianbao

posted an update 11 months ago

Post

Welcome Bunny! A family of lightweight but powerful multimodal models from BAAI

With detailed work on dataset curation, the Bunny-3B model built upon SigLIP and Phi-2 achieves performance on par with 13B models.

Model: BAAI/bunny-phi-2-siglip-lora

2 replies

·

xianbao

posted an update 11 months ago

Post

There appears to be a huge misunderstanding regarding the licensing requirements for open sourced Chinese speaking speaking LLMs on
@huggingface

I initially shared this misconception too, but after conducting some research, I came up with the list below.

Veryimpressive!

xianbao

posted an update 11 months ago

Post

Vision LLM for #edgecomputing ?

@openbmb , who OS'ed the UltraFeedback dataset before, released a series of strong eco-friendly yet powerful LLMs

- MiniCPM: 2B model that competes with Mistral-7B

- MiniCPM-V: 3B vision LLM on edge!
- MiniCPM-V: 3B vision LLM on edge!