11 10 6

Natalia Elvira

nataliaElv

AI & ML interests

Data curation, high-quality data, multilinguality, NLP & computational linguistics

Recent Activity

posted an update 15 days ago

New chapter in the Hugging Face NLP course! 🤗 🚀 We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub. Any feedback for improvements welcome! https://huggingface.co/learn/nlp-course/chapter10

reacted to davanstrien's post with 🚀 22 days ago

The https://huggingface.co/datasets/data-is-better-together/fineweb-c dataset is growing! This week a few more languages have got 1,000 annotations for the educational quality of data from https://huggingface.co/datasets/HuggingFaceFW/fineweb-2. Why should you care? The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data (https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1). Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining. Why not use an LLM? LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in. The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things: - Evaluate whether an LLM can label the educational quality for texts in that language well - Directly be used for training quality classifiers - Help discover other rules and huerisitcs for refining fineweb2 further for different languages. This week the following languages where done: Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community Contribute yourself here: https://huggingface.co/spaces/data-is-better-together/fineweb-c

posted an update 23 days ago

Do you want to easily save annotations to a Dataset in the Hub? In the last version of Argilla (v2.6.0), you can export your data directly from the UI to the Hub. Check all the changes and update to the latest version: https://github.com/argilla-io/argilla/releases/tag/v2.6.0

View all activity

Articles

FineWeb2-C: Help Build Better Language Models in Your Language

Dec 23, 2024

• 18

Argilla 2.4: Easily Build Fine-Tuning and Evaluation datasets on the Hub — No Code Required

Nov 4, 2024

• 41

How to build a custom text classifier without days of human labeling

Oct 17, 2024

• 55

How to optimize your data labelling project with custom interfaces

Oct 16, 2024

• 18

Organizations

nataliaElv's activity

posted an update 15 days ago

Post

1429

New chapter in the Hugging Face NLP course! 🤗 🚀

We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub.

Any feedback for improvements welcome!

https://huggingface.co/learn/nlp-course/chapter10

reacted to davanstrien's post with 🚀 22 days ago

Post

2213

The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c

1 reply

posted an update 23 days ago

Post

546

Do you want to easily save annotations to a Dataset in the Hub?

In the last version of Argilla (v2.6.0), you can export your data directly from the UI to the Hub.

Check all the changes and update to the latest version: https://github.com/argilla-io/argilla/releases/tag/v2.6.0

posted an update about 2 months ago

Post

1664

If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU

updated 2 Spaces about 2 months ago

Running

🌐📢

FineWeb 2 Communications Pack

Running

🏃

Language Leads Dashboard

posted an update about 2 months ago

Post

1294

How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates 😂

Do you want to see how your annotations compare to others?
👉 Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
✍️ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion

updated a Space about 2 months ago

Sleeping

🔎

Fineweb2: Compare My Annotations

updated a collection about 2 months ago

FineWeb2 Collaborative Annotation Sprint

Collection

5 items • Updated Dec 24, 2024 • 6

posted an update about 2 months ago

Post

1188

We're so close to reaching 100 languages! Can you help us cover the remaining 200? Check if we're still looking for language leads for your language: nataliaElv/language-leads-dashboard

updated a Space about 2 months ago

Running

🏃

Language Leads Dashboard

liked a dataset 2 months ago

data-is-better-together/open-image-preferences-v1-results

Viewer • Updated Dec 9, 2024 • 10k • 113 • 27

posted an update 2 months ago

Post

1643

Would you like to get a high-quality dataset to pre-train LLMs in your language? 🌏

At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.

Follow the link below, check if your language is listed and sign up to be a Language Lead!

https://forms.gle/s9nGajBh6Pb9G72J6

liked a dataset 2 months ago

nyuuzyou/publicdomainpictures

Viewer • Updated Nov 18, 2024 • 644k • 53 • 5

New activity in nataliaElv/argilla-progress 2 months ago

Update app.py

#1 opened 2 months ago by

davidberenstein1957

posted an update 2 months ago

Post

366

You can now add your Bluesky handle to your Hugging Face profile! 🦋
Have you noticed?

updated a dataset 2 months ago

huggingface-course/documentation-images

Viewer • Updated Nov 22, 2024 • 2 • 231k