I'm now working on finetuning of coding models. If you are GPU-hungry like me, you will find quantized models very helpful. But quantization for finetuning and inference are different and incompatible. So I made two collections here.
For quantization, the inference models are far more popular on HF than finetuning models. I use https://huggingface.co/QuantFactory to generate inference models (GGUF), and there are a few other choices.
But there hasn't been such a service for finetuning models. DIY isn't too hard though. I made a few myself and you can find the script in the model cards. If the original model is small enough, you can even do it on a free T4 (available via Google Colab).
If you know a (small) coding model worthy of quantization, please let me know and I'd love to add it to the collections.
Proof that ablative educational dataset significantly enhances model capabilities (independent of model parameters or architecture) 🤩
Yesterday, FineWeb’s technical report was published. FYI FineWeb (by 🤗) is currently the best opensource text dataset that can scale up model performance up to that of GPT-3 level.
While proprietary datasets used in training models like GPT-4/Claude/LlaMA are crawled internally and never released, FineWeb builds on CommonCrawl (an open repo for crawled web data). They preprocessed the data using their custom built data preprocessing library datatrove (which they also opensourced), and then evaluate the data quality on lighteval by training small sized models “ablation models” using nanotron (a library for pretraining transformer models).
Of all versions of FineWeb, FineWeb-Edu outperforms all other subsets. This is thanks to a new filtering technique wherein they used synthetic data to develop classifiers for identifying educational contents.
🔥 What's New: - Polars integration 🐻❄️ - fsspec support for conversion to JSON, CSV, and Parquet - Mode parameter for Image feature - CLI function to convert script-datasets to Parquet - Dataset.take and Dataset.skip
Plus, a bunch of general improvements & bug fixes!