Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

Join Posts waitlist
SeverianΒ 
posted an update about 8 hours ago
view post
Post
590
Create and Train Your Own Expert LLM: Generating Synthetic, Fact-Based Datasets with LMStudio/Ollama and then fine-tuning with MLX and Unsloth

Hey everyone!

I know there are tons of videos and tutorials out there already but I've noticed a lot of questions popping up in community posts about using synthetic datasets for creative projects and how to transform personal content into more factual material. In my own work doing enterprise-level SFT and crafting my open-source models, I've enhanced a Python framework originally shared by the creator of the Tess models. This improved stack utilizes local language models and also integrates the Wikipedia dataset to ensure that the content generated is as accurate and reliable as possible.

I've been thinking of putting together a comprehensive, step-by-step course/guide on creating your own Expert Language Model. From dataset preparation and training to deployment on Hugging Face and even using something like AnythingLLM for user interaction. I'll walk you through each phase, clarifying complex concepts and troubleshooting common pitfalls.

Let me know if this interests you!

Most of the datasets and models I've made have been using these scripts and my approach
  • 2 replies
Β·
phenixrhyderΒ 
posted an update about 11 hours ago
view post
Post
738
Midjourney Ai
ayush-thakur02Β 
posted an update about 17 hours ago
view post
Post
960
Enhancing Distributed Systems with Self-Healing Nodes and Adaptive Data Sharding

Paper: Self-healing Nodes with Adaptive Data-Sharding (2405.00004)

The paper introduces an innovative approach to improve distributed systems by integrating self-healing nodes with adaptive data sharding. This method leverages advanced concepts like self-replication, fractal regeneration, and predictive sharding to enhance scalability, performance, fault tolerance, and adaptability.

Key Concepts:
- Self-Replication: Nodes can create copies of themselves or their data to aid in recovery and load balancing.
- Fractal Regeneration: Nodes can reconfigure and restore their functionality after partial damage, inspired by natural fractals.
- Predictive Sharding: Nodes can anticipate future data trends and proactively adjust data distribution to optimize performance.

Methodology:
The approach consists of four main steps:
- Temporal data sharding based on data's temporal characteristics.
- Self-replicating nodes to enhance data availability and reliability.
- Fractal regeneration for robust recovery mechanisms.
- Predictive sharding using consistent hashing to anticipate and adapt to future data trends.

Results and Analysis:
Experimental evaluations show that this approach outperforms existing data sharding techniques in scalability, performance, fault tolerance, and adaptability. The use of synthetic data and workload generators created realistic scenarios for testing.

Applications:
The methodology can be applied to various domains such as distributed database systems, blockchain networks, IoT, and cloud computing, offering improvements in data distribution efficiency and system resilience.
JawardΒ 
posted an update about 19 hours ago
view post
Post
846
# Thoughts on Neural Scaling Laws
When you take a zoomed-out perceptive view on the success goals of neural networks, you see they all revolve around the Scaling Laws - empirical observations that performance improves with increased model size, dataset, and compute resources.

The specifics of how these laws apply, vary for different modalities and architectures. This is notable in the empirical equations used to measure these laws.

Yet they all heavily rely on three main factors - Data, Size and Computation. These factors themselves also have sub-dependencies - data size & quality, model size & architecture, num of GPUs & code for compute kernels respectively.

As research in these laws progresses, we begin to see new scaling laws emerge that may apply in much different ways than usual. This is typical in recent local LLMs (Phi-3, Gemma 2B, LLMs in a flash) which shows small sized models with small rich quality data beating large models

I look forward to the singularity moment - when these laws take a full round spin and meet at where it all began:)

References:
- Scaling Laws for Neural Language Models: https://arxiv.org/pdf/2001.08361
- Scaling Laws for Autoregressive Generative Modeling: https://arxiv.org/abs/2010.14701
- LLMs in a flash: https://arxiv.org/abs/2312.11514
- Phi-3 Technical Report: https://arxiv.org/abs/2404.14219
- Gemma 2B: https://arxiv.org/pdf/2403.08295
MonsterMMORPGΒ 
posted an update 1 day ago
view post
Post
953
The IDM-VTON (Improving Diffusion Models for Authentic Virtual Try-on in the Wild) is so powerful that it can even transfer beard or hair as well.

I have prepared installer scripts and full tutorials for Windows (requires min 8 GB VRAM GPU), Massed Compute (I suggest this if you don’t have a strong GPU), RunPod and a free Kaggle account (works perfect as well but slow).

Windows Tutorial : https://youtu.be/m4pcIeAVQD0

Cloud (Massed Compute, RunPod & Kaggle) Tutorial : https://youtu.be/LeHfgq_lAXU

qq8933Β 
posted an update 1 day ago
view post
Post
1000
ChemLLM-20B SFT and DPO is coming!πŸ€—
fdaudensΒ 
posted an update 1 day ago
view post
Post
1200
A new dataset for anyone interested in Satellite imagery: 3 million @Satellogic images of unique locations β€” 6 million images, including location revisits β€” from around the world under a Creative Commons CC-BY 4.0 license.

Interesting potential in journalism.

satellogic/EarthView
georgewritescodeΒ 
posted an update 1 day ago
view post
Post
1012
Excited to bring our benchmarking leaderboard of >100 LLM API endpoints to HF!

Speed and price are often just as important as quality when building applications with LLMs. We bring together all the data you need to consider all three when you need to pick a model and API provider.

Coverage:
β€£ Quality (Index of evals, MMLU, Chatbot Arena, HumanEval, MT-Bench)
β€£ Throughput (tokens/s: median, P5, P25, P75, P95)
β€£ Latency (TTFT: median, P5, P25, P75, P95)
β€£ Context window
β€£ OpenAI library compatibility

Link to Space: ArtificialAnalysis/LLM-Performance-Leaderboard

Blog post: https://huggingface.co/blog/leaderboard-artificial-analysis
davanstrienΒ 
posted an update 1 day ago
view post
Post
1069
Only 14 languages have DPO preference style datasets on the Hugging Face Hub ( DIBT/preference_data_by_language) Let's improve that! How?

The Cohere For AI Aya dataset CohereForAI/aya_dataset has human-annotated prompt-completion pairs in 71 languages. We can use this to create DPO datasets for more languages!

Using Aya's prompt/response pairs as a starting point we can use an LLM to generate an additional response to each prompt. We then use an LLM Judge to rank each response.

βœ… In some/many languages, human responses may be better than LLM ones but we may want to check that assumption for some languages.
πŸš€ We use Argilla's distilabel library to push data to Argilla for validation. This also allows us to determine if an LLM judge is effective for different languages.

As an example of what this pipeline produces:
- DIBT/aya_dutch_dpo a DPO style dataset for Dutch using Llama 3 as a generator/judge LM.
- An annotation Space that anyone with a HF account can contribute to: https://dibt-demo-argilla-space.hf.space/dataset/924ef8a8-a447-4563-8806-0e2a668a5314/annotation-mode?page=1&status=pending

As part of Data is Better Together we want to build more DPO datasets. Join us here: https://github.com/huggingface/data-is-better-together#4-dpoorpo-datasets-for-more-languages πŸ€—
abhishekΒ 
posted an update 1 day ago
view post
Post
1196
πŸš€πŸš€πŸš€πŸš€ Introducing AutoTrain Configs! πŸš€πŸš€πŸš€πŸš€
Now you can train models using yaml config files! πŸ’₯ These configs are easy to understand and are not at all overwhelming. So, even a person with almost zero knowledge of machine learning can train state of the art models without writing any code. Check out example configs in the config directory of autotrain-advanced github repo and feel free to share configs by creating a pull request πŸ€—
Github repo: https://github.com/huggingface/autotrain-advanced
  • 1 reply
Β·