joy larkin

joylarkin

AI & ML interests

Global AI, Multilingual AI, European AI, AGI, ASI, AI Data, Datasets, Data workflows, LLMs, Fine-tuning, Evals, etc. ••• AI Marketer, Evangelist, Technologist ••• Head of Marketing/GTM @ Airtrain AI.

Recent Activity

replied to m-ric's post about 1 month ago

Need a measurement for traction of a GitHub repo, a more reliable one than Github star history? (which is a bit too hype-driven) 📈 ➡️ I've made a Space to visualize PyPI downloads. Try it here 👉 https://huggingface.co/spaces/m-ric/package-download-history

liked a Space about 1 month ago

m-ric/package-download-history

updated a collection about 2 months ago

Global Multilingual AI

View all activity

Organizations

Posts 2

Post

2630

💬 Chat as a way to query SQL! The Airtrain AI team is happy to share a new Hugging Face Space that lets you interact with Hugging Face Hub datasets using a natural language chatbot. 🤗

Start Exploring 👉 airtrain-ai/hf-dataset-chat-to-sql

This Space is forked from davidberenstein1957/text-to-sql-hub-datasets by @davidberenstein1957 and features chat capability with improved table naming. The tool works with Hugging Face’s recently released in-browser DuckDB-based SQL query engine for datasets.

Post

3019

Introducing Fineweb-Edu-Fortified: An enhanced Fineweb-Edu dataset. 📚

This dataset is tailored for NLP tasks and helps streamline model training by offering a more refined, unique dataset. Perfect for startups and researchers looking for high-quality educational content to train, evaluate, or fine-tune AI models. The dataset is based on the Fineweb-Edu subset of the large Fineweb dataset and includes:

- Exact-match deduplication across all crawls
- Embeddings for each row using the TaylorAI/bge-micro model
- Count column indicating duplication frequency
- Includes data from 95 Common Crawl crawls (2013-2024)
- Rows have been reduced from 1.279B to 0.324B after deduplication
- It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)

Access the entire Fineweb-Edu-Fortified dataset on Hugging Face → airtrain-ai/fineweb-edu-fortified

Try a semantic search demo via this Hugging Face Space → airtrain-ai/fineweb-edu-fortified-search-demo

Many thanks to the amazing @josh-sematic for his work on this project, the Fineweb/Fineweb-Edu team at Hugging Face for producing the original datasets and for their support during our work on Fineweb-Edu-Fortified, and also thanks to @underspirit for pointing out the reduction in dataset size that could be achieved via deduplication. 🤗

joy larkin

AI & ML interests

Recent Activity

Organizations

Posts 2

Collections 3

airtrain-ai/fineweb-edu-fortified

Josephgflowers/Par-Four-Fineweb-Edu-Fortified

Occiglot Euro LLM Leaderboard

Open Multilingual Llm Leaderboard

Multilingual LMSys Chatbot Arena Leaderboard

utter-project/EuroLLM-1.7B

models

datasets