vikas (Vikas Kumar)

liked a model about 1 month ago

alibaba-damo/mgp-str-base

Image-to-Text • Updated Dec 11, 2023 • 9.38k • 63

liked a Space 3 months ago

Running on Zero

163

📲🫴🏻👁

Tonic's GOT OCR

GOT - OCR (from : UCAS, Beijing)

liked a Space 4 months ago

Running on Zero

232

🔥

The 5 Most Under-Rated Tools on Hugging Face

Aug 22

• 86

upvoted a paper 4 months ago

Transformer Explainer: Interactive Learning of Text-Generative Models

Paper • 2408.04619 • Published Aug 8 • 155

upvoted a paper 5 months ago

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Paper • 2406.08464 • Published Jun 12 • 65

upvoted an article 5 months ago

Article

Finetuning PaliGemma with AutoTrain

By

•

Jul 25

• 8

upvoted a collection 5 months ago

Gemma 2 2B Release

Collection

The 2.6B parameter version of Gemma 2. • 6 items • Updated 7 days ago • 77

upvoted 2 articles 5 months ago

Article

Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth

By

•

Jul 29

• 255

Article

Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

Jul 23

• 224

liked a dataset 5 months ago

HuggingFaceM4/Docmatix

Viewer • Updated Aug 26 • 2.55M • 26.1k • 234

upvoted a collection 5 months ago

🪐 SmolLM

Collection

A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos • 12 items • Updated Aug 18 • 204

upvoted 3 articles 5 months ago

Article

SmolLM - blazingly fast and remarkably powerful

Jul 16

• 290

Article

The Rise of Agentic Data Generation

By

•

Jul 15

• 78

Article

ColPali: Efficient Document Retrieval with Vision Language Models 👀

By

•

Jul 5

• 180

reacted to fdaudens's post with 👍 6 months ago

Post

3351

🧠 How to create more diverse, realistic synthetic AI training data?

@TencentAIGC-Lab AI Lab created @proj-persona , a vast collection of 1 billion diverse personas, to help create synthetic data with LLMs that encapsulate a wide array of perspectives, knowledge, experiences, interests, and professions.

These personas were created with automatically curated data, representing approximately 13% of the world’s total population.

💡 The authors argue that integrating a persona into data synthesis prompts effectively steers LLMs to adopt specific perspectives, creating unique and relevant synthetic data with minimal effort.

They showcased various practical applications of Persona Hub to demonstrate its effectiveness and versatility in various synthetic data creation scenarios: mathematical and logical reasoning problems, simulating diverse user requests and prompts for LLMs, generating informative and detailed text content across various topics, and more.

🚀 It's one of the trending datasets on Hugging Face. Digging into it is quite fun! I found one that reminds me of several people I know: "A journalist who covers technology and innovation in the print and digital media industries." It helped generate the prompt attached to this post (about which I'd be curious to know your answers 😉).

Synthetic data is a hot topic in AI. It will be interesting to see if this research could help make LLMs more robust, versatile, and capable of handling a wide array of real-world scenarios.

👉Explore the dataset: proj-persona/PersonaHub
👉 Read the paper: https://arxiv.org/pdf/2406.20094

reacted to mrm8488's post with ❤️ 6 months ago

Post

4645

🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)

🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)

Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

7 replies

·

upvoted a collection 6 months ago

Florence

Collection

9 items • Updated Jul 11 • 160

upvoted 2 articles 6 months ago

Article

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

Jun 24

• 179

Article

BM25 for Python: Achieving high performance while simplifying dependencies with BM25S⚡

By

•

Jul 9

• 41

Vikas Kumar

AI & ML interests

Recent Activity

Organizations

vikas's activity

alibaba-damo/mgp-str-base

Tonic's GOT OCR

Qwen2-VL-7B

The 5 Most Under-Rated Tools on Hugging Face

Transformer Explainer: Interactive Learning of Text-Generative Models

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Finetuning PaliGemma with AutoTrain

Gemma 2 2B Release

Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth

Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

HuggingFaceM4/Docmatix

🪐 SmolLM

SmolLM - blazingly fast and remarkably powerful

The Rise of Agentic Data Generation

ColPali: Efficient Document Retrieval with Vision Language Models 👀

Florence

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

BM25 for Python: Achieving high performance while simplifying dependencies with BM25S⚡

Vikas Kumar

AI & ML interests

Recent Activity

Organizations

vikas's activity

Tonic's GOT OCR

Qwen2-VL-7B

The 5 Most Under-Rated Tools on Hugging Face

Finetuning PaliGemma with AutoTrain

Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth

Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

SmolLM - blazingly fast and remarkably powerful

The Rise of Agentic Data Generation

ColPali: Efficient Document Retrieval with Vision Language Models 👀

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

BM25 for Python: Achieving high performance while simplifying dependencies with *BM25S*⚡

BM25 for Python: Achieving high performance while simplifying dependencies with BM25S⚡