Bram Vanroy's picture

Bram Vanroy PRO

BramVanroy

AI & ML interests

Artificial intelligence, natural language processing, computational linguistics

Recent Activity

upvoted a paper 2 days ago
Qwen2.5 Technical Report
new activity 4 days ago
GroNLP/dutch-cola:Citation
updated a dataset 6 days ago
HuggingFaceFW/fineweb
View all activity

Organizations

Language and Translation Technology Team's profile picture BigScience Workshop's profile picture How to teach Hugging Face?'s profile picture Hugging Face Fellows's profile picture Blog-explorers's profile picture HPLT's profile picture ZeroGPU Explorers's profile picture TENACITY's profile picture signon-project's profile picture Social Post Explorers's profile picture Occiglot's profile picture Hugging Face Discord Community's profile picture Networks of Ideas and Knowledge in the Ancient World's profile picture Instituut voor de Nederlandse Taal / Dutch Language Institute's profile picture ml-fw-prerelease's profile picture

Posts 12

view post
Post
416
In the spirit of "Better late than never", I've finally written a brief overview paper for GEITje 7B Ultra. Initially released 10 months ago (oops), but still reaching around 1300 monthly downloads across the HF ecosystem (not including ollama).

GEITje 7B Ultra: A Conversational Model for Dutch (2412.04092)

While the paper discusses the model a little bit, I especially wanted to write about the datasets, which to this day seem an important asset for Dutch LLM training (SFT and preference tuning). We have a long way to go for Dutch, but publishing transparent and reproducible artefacts seems an important step to me, alongside having open discussions about data, bias, architectures.

In that spirit, thanks are in order for the creation of GEITje 7B Ultra and all related datasets:

- Michiel Buisman and UWV for providing the means to create the datasets
- Flemish Supercomputer Center (VSC) for the compute
- The Hugging Face Fellows and rest of the team for their discussions and insights
- The Dutch NLP community, notably @Rijgersberg for building the base GEITje model and the fruitful discussions we've had

More to come, step by step!

BramVanroy/geitje-7b-ultra-65c1ee010ad80fd1f6a8f208
view post
Post
1675
The InstructGPT paper mentions that they insert 10% pretraining data during SFT, which they find improves the effect of PPO (IIUC). Has anyone else done later ablations on this? I've only seen the inverse suggested, mixing in SFT data during pretraining.