I created a Capybara-inspired Italian dataset by translating the initial instruction and running it through a pipeline to generate conversations. I used Claude Sonnet for translation and instruction generation, and Opus for generating the answers.
I hope this dataset proves useful for people working on ๐ฎ๐น language models.
@mik3ml just released ReDiX/wikipediaQA-ita an interesting synthetic dataset originated from wikipedia using a fine tuned version of mistral-7B specific for the Italian language ๐ฎ๐น .
On evaluating fine tuned 7B Italian open source LLMs I have collected many data points and I created a super simple explorative analyses. My hypothesis based on data are:
- mmlu is hard to improve when fine tuning a base model on a different language - fine tuning also on single GPUs can improve by 5% to 10% the base model on common tasks but a lot more on specific cases with the right training time and data - fine tuning can specialize well but at cost of loosing some foundational knowledge.
Based on the work of @mrinaldi and @ruggsea we just released the biggest - ready for training - conversational dataset based on Usenet data in the Italian language ๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น. It contains about 9 millions of conversations made by real humans.
It is based on lm-evaluation-harness and at the moment , mainly, on 7 billion models. In the next weeks we will add more models. If you have suggestion or need explanations join our community discord https://discord.gg/a26cRkBCNH
The dataset contributes to the https://huggingface.co/mii-community project, aimed at advancing the creation of Italian open-source Language Models (LLMs).๐ฎ๐น ๐ค About 10-20 billion token, probably the best conversational open source dataset in the Italian language. ๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น
Introducing ๐๐ง๐ข๐ฏ๐๐ซ๐ฌ๐๐ฅ ๐๐๐ซ ๐๐จ๐ซ ๐๐ญ๐๐ฅ๐ข๐๐ง ๐๐๐ง๐ ๐ฎ๐๐ ๐, a revolutionary Named Entity Recognition (NER) model evolved from the GliNer architecture and meticulously tailored for the Italian language. This advanced model is a beacon of efficiency and versatility, engineered to ๐ซ๐๐๐จ๐ ๐ง๐ข๐ณ๐ ๐๐ง๐ฒ ๐๐ง๐ญ๐ข๐ญ๐ฒ ๐ญ๐ฒ๐ฉ๐ within the rich nuances of Italian, using a bidirectional transformer encoder. It stands out as an ideal solution for those navigating the challenges of resource-limited environments or seeking an efficient alternative to the cumbersome Large Language Models (LLMs). ๐๐ฎ๐ง๐ฌ ๐๐๐ฌ๐ญ ๐๐ฅ๐ฌ๐จ ๐จ๐ง ๐๐๐!