3481 29 12

Loïck BOURDOIS

lbourdois

https://lbourdois.github.io/blog/

AI & ML interests

👀

Recent Activity

updated a dataset 8 days ago

lbourdois/caption-maya-multimodal-pretrain-clean

updated a dataset 8 days ago

lbourdois/caption-maya-multimodal-pretrain-clean

updated a dataset 8 days ago

lbourdois/caption-maya-multimodal-pretrain-clean

View all activity

Articles

Organizations

lbourdois's activity

reacted to Wauplin's post with 🔥🤗 3 months ago

Post

2802

What a great milestone to celebrate! The huggingface_hub library is slowly becoming a cornerstone of the Python ML ecosystem when it comes to interacting with the @huggingface Hub. It wouldn't be there without the hundreds of community contributions and feedback! No matter if you are loading a model, sharing a dataset, running remote inference or starting jobs on our infra, you are for sure using it! And this is only the beginning so give a star if you wanna follow the project 👉 https://github.com/huggingface/huggingface_hub

1 reply

reacted to davanstrien's post with 🚀👀 3 months ago

Post

3168

ColPali is revolutionizing multimodal retrieval, but could it be even more effective with domain-specific fine-tuning?

Check out my latest blog post, where I guide you through creating a ColPali fine-tuning dataset using Qwen/Qwen2-VL-7B-Instruct to generate queries for a collection of UFO documents sourced from the Internet Archive.

The post covers:
- Introduction to data for ColPali models
- Using Qwen2-VL for retrieval query generation
- Tips for better query generation

Check out the post here:
https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html

The resulting Hugging Face dataset: davanstrien/ufo-ColPali

1 reply

reacted to tomaarsen's post with ❤️👀🚀🔥 3 months ago

Post

2039

🎉SetFit v1.1.0 is out! Training efficient classifiers on CPU or GPU now uses the Sentence Transformers Trainer, and we resolved a lot of issues caused by updates of third-party libraries (like Transformers). Details:

Training a SetFit classifier model consists of 2 phases:
1. Finetuning a Sentence Transformer embedding model
2. Training a Classifier to map embeddings -> classes

🔌The first phase now uses the SentenceTransformerTrainer that was introduced in the Sentence Transformers v3 update. This brings some immediate upsides like MultiGPU support, without any (intended) breaking changes.

➡️ Beyond that, we softly deprecated the "evaluation_strategy" argument in favor of "eval_strategy" (following a Transformers deprecation), and deprecated Python 3.7. In return, we add official support for Python 3.11 and 3.12.

✨ There's some more minor changes too, like max_steps and eval_max_steps now being a hard limit instead of an approximate one, training/validation losses now logging nicely in Notebooks, and the "device" parameter no longer being ignored in some situations.

Check out the full release notes here: https://github.com/huggingface/setfit/releases/tag/v1.1.0
Or read the documentation: https://huggingface.co/docs/setfit
Or check out the public SetFit models for inspiration: https://huggingface.co/models?library=setfit&sort=created

P.s. the model in the code snippet trained in 1 minute and it can classify ~6000 sentences per second on my GPU.

reacted to merve's post with ❤️🤝🤗 5 months ago

Post

2551

🥹 @lbourdois has made an app to browse all of my vision paper summaries for everyone's convenience merve/vision_papers

reacted to severo's post with ❤️🚀 5 months ago

Post

3484

[New tool] Follow interesting ML persons 👩‍🎨 👨‍🎤 👩‍🏫 with Followgraph

severo/followgraph

Please try it and tell me if it helped you discover high-quality content 👍 👎

I repurposed "Followgraph for Mastodon" (https://followgraph.vercel.app/).

My new follows: @TheBloke @mlabonne @teknium @KnutJaegersberg @SkalskiP @AmelieSchreiber @lbourdois @ceyda @andrewyng @Pclanglais @karpathy

And you?

5 replies

replied to their post 9 months ago

Merci !

I'll try to avoid a 4-month gap with the next article 🙃
The year 2023 and in particular the second half of the year has been quite busy. So maybe I'll split the 2023 article in two.

posted an update 9 months ago

Post

3498

I stopped procrastinating and finally took the time to write the second article of my series of blog posts on SSM: https://huggingface.co/blog/lbourdois/ssm-2022.
In this blog post, I review the history of SSM models released in 2022, with over 14 models discussed in a synthetic format.
They are separated into two parts: "theoretical" (DSS, S4D, GSS, Mega, S5, etc.) and "applications" (Sashimi, ViS4mer, CCNN, etc.).

To understand everything, it's best to have read the introduction to S4 to SSM blog post first: https://huggingface.co/blog/lbourdois/get-on-the-ssm-train.
All the articles in the series are listed in this space: lbourdois/SSM_blog_posts

Wishing you a good reading :)

2 replies

reacted to kargaranamir's post with 👍 11 months ago

Post

A Text Language Identification Model with Support for +2000 Labels:

space: cis-lmu/glotlid-space
model: cis-lmu/glotlid
github: https://github.com/cisnlp/GlotLID
paper: GlotLID: Language Identification for Low-Resource Languages (2310.16248)

posted an update 11 months ago

Post

The most widely used French NER models on HF ( Jean-Baptiste/camembert-ner and cmarkea/distilcamembert-base-ner) are trained on a single dataset (WikiNER) which on the one hand contains leaks and therefore distorts the true results of these models, and on the other hand overspecializes them in a particular domain (= texts from Wikipedia). They are also only available in a base version (110M parameters).

That's why I've trained new NER models in French both on more data (x3), as well as in base and large versions (336M). They are available in 3 entities (PER, ORG, LOC) or 4 entities (PER, ORG, LOC, MISC):
- CATIE-AQ/NERmembert-base-4entities
- CATIE-AQ/NERmembert-large-4entities
- CATIE-AQ/NERmembert-base-3entities
- CATIE-AQ/NERmembert-large-3entities

Datasets without leaks are also available:
- CATIE-AQ/frenchNER_4entities
- CATIE-AQ/frenchNER_3entities

reacted to MoritzLaurer's post with 🤝 11 months ago

Post

Prompts are hyperparameters. Every time you test a different prompt on your data, you become less sure if the LLM actually generalizes to unseen data.

Issues of overfitting to a test set seem like concepts from boring times when people still fine-tuned models, but it's just as important for "zeroshot prompting". Using a separate validation split to tune the main hyperparameter of LLMs (the prompt) is just as important as train-val-test splitting for fine-tuning. The only difference is that you don't have a training dataset anymore and it somehow feels different because there is no training / no parameter updates.

Its easy to trick yourself into believing that an LLM performs well on your task, while you've actually overfit the prompt on your data. Every good "zeroshot" paper should clarify that they used a validation split for finding their prompt before final testing.

1 reply

replied to their post 11 months ago

For NER (Name Entity Recognition) you can consult https://huggingface.co/tasks/token-classification.
A leak is when data of the train split is found in the test split, biasing the results and benchmarks.

reacted to euclaise's post with ❤️ 11 months ago

Post

Memphis: Advancing language model reasoning without relying on proprietary model outputs

Memphis is a series of models which advance human-data models, offering good performance without relying on proprietary model outputs (e.g. GPT-generated datasets). I've developed a new iterative finetuning procedure to improve the reasoning ability of these models beyond what is possible using only SFT on the same data.

Currently, I've released two models: Memphis-CoT-3B, and Memphis-scribe-3B.

To create these models, I've created new datasets:
- euclaise/reddit-instruct : A dataset of instruction/QA-like data scraped from Reddit. A curated version, filtered using Lilac and neural embedding models, is available at euclaise/reddit-instruct-curated
- euclaise/TinyCoT : TinyCoT is a mtea-dataset that aggregates a variety of different human-sourced reasoning data. It is a curated version of my previous MegaCoT dataset euclaise/MegaCoT, which contains 629k responses which get cut down to 28k for TinyCoT. There's also an intermediate version euclaise/MiniCoT, which has 129k responses.

Memphis-CoT is trained on reddit-instruct, a filtered version of oasst2 sablo/oasst2_curated, and TinyCoT. Multiple iterations were performed on TinyCoT, while reddit-instruct and oasst2 were only used for the initial model.

Memphis-scribe further finetunes Memphis-CoT on more creative tasks. It was finetuned from Memphis-CoT on 18 different datasets, including datasets like euclaise/WritingPrompts_curated, lemonilia/LimaRP, and more.

To prevent catastrophic forgetting, I used weight averaging between iterations.

- euclaise/Memphis-CoT-3B
- euclaise/Memphis-scribe-3B

2 replies

Loïck BOURDOIS

AI & ML interests

Recent Activity

Articles

Introduction to State Space Models (SSM)

History of State Space Models (SSM) in 2022

Introduction to LLE

Organizations

lbourdois's activity