Hackathon Somos NLP 2023: Los LLMs hablan Español

community

AI & ML interests

Democratizar el PLN en español creando recursos abiertos en nuestro idioma🚀

Recent Activity

somosnlp-hackathon-2023's activity

alvarobartt 
posted an update 4 months ago
view post
Post
2836
🤗 Serving Meta Llama 3.1 405B on Google Cloud is now possible via the Hugging Face Deep Learning Containers (DLCs) for Text Generation Inference (TGI)

In this post, we showcase how to deploy https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on an A3 instance with 8 x H100 GPUs on Vertex AI

Thanks to the Hugging Face DLCs for TGI and Google Cloud Vertex AI, deploying a high-performance text generation container for serving Large Language Models (LLMs) has never been easier. And we’re not going to stop here – stay tuned as we enable more experiences to build AI with open models on Google Cloud!

Read the full post at https://huggingface.co/blog/llama31-on-vertex-ai
mrm8488 
posted an update 6 months ago
view post
Post
4649
🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)


🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)


Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/
·
alvarobartt 
posted an update 8 months ago
view post
Post
3086
🔥 Prometheus 2 was recently released by Kaist AI as an alternative and closely mirroring both human and GPT-4 evaluation, and surpassing the former Prometheus!

prometheus-eval/prometheus-7b-v2.0
prometheus-eval/prometheus-8x7b-v2.0

🌬️Fine-tuned on top of mistralai/Mistral-7B-Instruct-v0.2 and mistralai/Mixtral-8x7B-Instruct-v0.1
🗂️The datasets used for fine-tuning have been publicly released i.e. prometheus-eval/Feedback-Collection and prometheus-eval/Preference-Collection
🤝🏻Unified LM evaluator for absolute (a single prompt-completion pair) and relative (two completions for a given prompt) due to model merging
❌No longer needs a mandatory reference / golden answer, but can still be provided optionally
🔝Surpasses the former version of Prometheus, and has a high correlation with human, GPT-4, and Claude 3 Opus scores when evaluating LMs
📝Apache 2.0 license

Long-story short, an amazing job from Kaist AI bridging the gap with LLM evaluators other than proprietary and bigger models!

This week at Argilla, we decided to add a new task to use Prometheus 2 as an LLM evaluator using distilabel, so we implemented PrometheusEval.

😱 Using PrometheusEval running their 7B variant with vLLM in a single L40 on top of HuggingFaceH4/instruction-dataset, we got the 327 existing prompt-completion pairs evaluated and pushed to the Hub in less than 2 minutes!

Find the generated dataset and the code at distilabel-internal-testing/instruction-dataset-prometheus
  • 1 reply
·
mrm8488 
posted an update 8 months ago
view post
Post
5523
Working on a concept GPT-2 (small) that uses KANs instead of MLPs.
The ckpt and training code will be soon on the hub.
·
alvarobartt 
posted an update 8 months ago
view post
Post
2761
🦫 We have just released argilla/Capybara-Preferences in collaboration with Kaist AI ( @JW17 , @nlee-208 ) and Hugging Face ( @lewtun )

A new synthetic preference dataset built using distilabel on top of the awesome LDJnr/Capybara from @LDJnr

The current dataset combines the already generated alternative completions from argilla/distilabel-capybara-dpo-7k-binarized, while also adding the remaining ones using the same approach!

Here are some key features on how we built it:

- 🧹 Duplicate removal, keeping the conversation besides the last assistant response, and some slight pre-processing

- 🤖 Generation of alternative completions for the existing conversations (last turn only) with: mlabonne/NeuralBeagle14-7B, argilla/notus-7b-v1, and teknium/OpenHermes-2.5-Mistral-7B

- 👨🏻‍🏫 Running UltraFeedback via GPT-4 to generate the critique i.e. ratings and rationales, for the last assistant responses

- 🎉 Finally, we selected the chosen and rejected responses based on their UltraFeedback score, and applied some slight post-processing!

Sounds simple right? Start building your own synthetic datasets with https://github.com/argilla-io/distilabel already!
osanseviero 
posted an update 8 months ago
view post
Post
9780
Diaries of Open Source. Part 15 🤗

🕵️‍♀️Idefics 2 is out, a multimodal open-source model with very nice capabilities
Models, demo, and datasets: HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe
Blog: https://hf.co/blog/idefics2

💾Snowflake released snowflake-arctic-embed, a family of powerful small embedding models
Model: Snowflake/snowflake-arctic-embed-m
Blog: https://www.snowflake.com/blog/introducing-snowflake-arctic-embed-snowflakes-state-of-the-art-text-embedding-family-of-models/

✨Pile-T5, EleutherAI's T5 model trained on 2T tokens
Blog: https://blog.eleuther.ai/pile-t5/
Models: EleutherAI/pile-t5-65a76a0d0022dd270b385a66
GitHub: https://github.com/EleutherAI/improved-t5

🤖CodeQwen1.5-7B base and chat models. Models trained on 3T tokens strong benchmark results for code generation, editing and SQL
Blog post: https://qwenlm.github.io/blog/codeqwen1.5/
Demo: Qwen/CodeQwen1.5-7b-Chat-demo
Models: Qwen/CodeQwen1.5-7B and Qwen/CodeQwen1.5-7B-Chat

Misc
🦉 DocOwl1.5: Unified Stucture Learning for OCR-free Document Understanding mPLUG/DocOwl
👀Cerule - a tiny Vision LM model Tensoic/Cerule-v0.1
ChemLLM - a LLM for chemistry and molecule science ⚗️https://hf.co/AI4Chem/ChemLLM-7B-Chat-1.5-DPO
Distil Whisper Large
📝New pdf/OCR datasets with 19 samples pixparse/pdf-document-ocr-datasets-660701430b0346f97c4bc628
🔥Gretel AI high quality text-to-sql synthetic dataset gretelai/synthetic_text_to_sql
·
osanseviero 
posted an update 9 months ago
view post
Post
9238
Diaries of Open Source. Part 14 🤗

🔥CohereForAI releases Command R+, an open 104B model with:
- Tool usage capabilities
- Specialized in RAGs
- Multilingual
It's one of the first models to surpass GPT-4 in the lmsys arena, check it out!
Model: CohereForAI/c4ai-command-r-plus
Official demo: https://hf.co/spaces/CohereForAI/c4ai-command-r-plus
Quantized: CohereForAI/c4ai-command-r-plus-4bit

🎉Google releases a new version of their Gemma instruct models, with improved quality, nicer to converse, and a fancier RL algorithm. The model is similar to Llama 2 70B in the Chat Arena!
Models: google/gemma-release-65d5efbccdbb8c4202ec078b
Try it out in HuggingChat https://hf.co/chat/models/google/gemma-1.1-7b-it

🪄VoiceCraft, a speech editing and TTS SOTA open model
Paper: VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (2403.16973)
Model: pyp1/VoiceCraft

💻Google released CodeGemma, a family of code generation, completion, and chat models
Blog post: https://hf.co/blog/codegemma
Models: google/codegemma-release-66152ac7b683e2667abdee11
Report: https://storage.googleapis.com/deepmind-media/gemma/codegemma_report.pdf

Misc models:
🦖T-Rex2, a very powerful object detection model for many applications https://github.com/IDEA-Research/T-Rex
👀 CT-RATE : A 3D dataset paired with text reports ibrahimhamamci/CT-RATE
🐙Octopus v2: a Gemma-based model trained for Android API - extremely fast, better than Llama+RAG, great results NexaAIDev/Octopus-v2
  • 2 replies
·
osanseviero 
posted an update 9 months ago
view post
Post
2279
Diaries of Open Source. Part 13 🤗

🤏Two different bitnet 1.5 open-source replications
Original paper: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (2402.17764)
1bitllm experiment: https://hf.co/blog/joey00072/experiments-with-bitnet-1-5
NousResearch experiment NousResearch/OLMo-Bitnet-1B

🥳Tiny and large multimodal models great for embeddings
GitHub: https://github.com/unum-cloud/uform
Encoders: https://hf.co/collections/unum-cloud/multimodal-encoders-660553903617c5297eb16838
ONNX weights: https://hf.co/collections/unum-cloud/uform-vl-english-large-onnx-66055a57c182d846f3bc1949

📜 SMPLer-X: Expressive Human Pose and Shape Estimation
Project website: https://caizhongang.com/projects/SMPLer-X/
Demo: caizhongang/SMPLer-X
Paper: SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation (2309.17448)

🧙GeoWizard: 3D Geometry Estimation
Project website: https://fuxiao0719.github.io/projects/geowizard/
Demo: lemonaddie/geowizard

Misc models and datasets
- Dolphin-2.8-mistral-7b-v0.2 cognitivecomputations/dolphin-2.8-mistral-7b-v02
- Hermes-2-Pro-11B, a self-frankenmerge 11B variant mattshumer/Hermes-2-Pro-11B
- Large conversational dataset based on Usenet data in the Italian language mii-community/UsenetArchiveIT-conversations
  • 3 replies
·
osanseviero 
posted an update 9 months ago
view post
Post
3527
Diaries of Open Source. Part 12 🤗

🚀Alibaba releases Qwen1.5-MoE-A2.7B, an interesting MoE with 2.7B activated parameters and 64 experts
Blog https://qwenlm.github.io/blog/qwen-moe/
Demo: Qwen/qwen1.5-MoE-A2.7B-Chat-demo
Models: https://hf.co/Qwen
GitHub: https://github.com/QwenLM/Qwen1.5

🎵VoiceCraft, SOTA speech editing and text to speech
GitHub: https://github.com/jasonppy/VoiceCraft
Model: pyp1/VoiceCraft

🐍 AI21Labs release Jamba, an SSM-Transformer, pretrained MoE which allows a large context window (256K) and high throughput
Blog https://www.ai21.com/blog/announcing-jamba
Model ai21labs/Jamba-v0.1

✨ Berkeley releases Starling-LM-7B, an RLHF-ed model, and -RM-34B, a Yi-based reward model very good for its size
Starling Beta: Nexusflow/Starling-LM-7B-beta
Starling RM: Nexusflow/Starling-RM-34B

🖥️Stability releases Stable Code Instruct 3B, an instruct model for code generation
Blog: https://stability.ai/news/introducing-stable-code-instruct-3b
Demo: stabilityai/stable-code-instruct-3b
Report: https://stability.ai/s/Stable_Code_TechReport_release.pdf

📚Common Corpus: the largest public domain dataset for training LLMs
Blog: https://hf.co/blog/Pclanglais/common-corpus
Dataset: https://hf.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613

Misc:
⚡GaLore: a very memory-efficient technique that allows pretraining models in consumer GPUs https://hf.co/blog/galore
Moirai
📈Moirai, foundation models for time series forecasting https://hf.co/collections/Salesforce/moirai-10-r-models-65c8d3a94c51428c300e0742
🔥 Mistral-ORPO-Capybara-7K, a high-quality Mistral fine-tune using ORPO, a new alignment technique kaist-ai/mistral-orpo-capybara-7k
🤯APISR, an anime super-resolution upscaling model HikariDawn/APISR
·
osanseviero 
posted an update 9 months ago
view post
Post
2068
Diaries of Open Source. Part 11 🚀

🚀Databricks release DBRX, potentially the best open access model! A 132B Mixture of Experts with 36B active params and trained on 12 trillion tokens
Blog: https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Base and instruct models: databricks/dbrx-6601c0852a0cdd3c59f71962
Demo: databricks/dbrx-instruct

🤏1-bit and 2-bit quantization exploration using HQQ+
Blog post: https://mobiusml.github.io/1bit_blog/
Models: https://hf.co/collections/mobiuslabsgmbh/llama2-7b-hqq-6604257a96fc8b9c4e13e0fe
GitHub: https://github.com/mobiusml/hqq

📚Cosmopedia: a large-scale synthetic dataset for pre-training - it includes 25 billion tokens and 30 million files
Dataset: HuggingFaceTB/cosmopedia
Blog: https://hf.co/blog/cosmopedia

⭐Mini-Gemini: multi-modal VLMs, from 2B to 34B
Models: https://hf.co/collections/YanweiLi/mini-gemini-6603c50b9b43d044171d0854
Paper: Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (2403.18814)
GitHub: https://github.com/dvlab-research/MiniGemini

🔥VILA - On Pre-training for VLMs
Models: Efficient-Large-Model/vila-on-pre-training-for-visual-language-models-65d8022a3a52cd9bcd62698e
Paper: VILA: On Pre-training for Visual Language Models (2312.07533)

Misc
👀 FeatUp: a framework for image features at any resolution: mhamilton723/FeatUp FeatUp: A Model-Agnostic Framework for Features at Any Resolution (2403.10516)
🍞ColBERTus Maxiums, a colbertialized embedding model mixedbread-ai/mxbai-colbert-large-v1
🖌️Semantic Palette, a new drawing paradigm ironjr/SemanticPalette
🧑‍⚕️HistoGPT, a vision model that generates accurate pathology reports marr-peng-lab/histogpt https://www.medrxiv.org/content/10.1101/2024.03.15.24304211v1
·
osanseviero 
posted an update 9 months ago
view post
Post
1616
Diaries of Open Source. Part 10 🚀

🌼Marigold-LCM: A super fast SOTA Depth Estimator
Demo: prs-eth/marigold-lcm
Original paper: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (2312.02145)
Model: https://hf.co/prs-eth/marigold-lcm-v1-0

🌟Quiet-STaR: A self-teaching technique via internal monologue
Paper: Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (2403.09629)
GitHub: https://github.com/ezelikman/quiet-star
Tweetutorial: https://twitter.com/ericzelikman/status/1768663835106513041

🖼️ WebSight v0.2: A image-to-code dataset containing tailwind CSS, images in screenshots, and more!
Dataset: HuggingFaceM4/WebSight
Paper: Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset (2403.09029)
Blog: https://hf.co/blog/websight

🕵️Agent-FLAN - effective agent tuning for LLMs
Paper: Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models (2403.12881)
Model: internlm/Agent-FLAN-7b
Dataset: internlm/Agent-FLAN
Website: https://internlm.github.io/Agent-FLAN/

🔥HPT, a family of multimodal LLMs from HyperGAI
Blog post: https://hypergai.com/blog/introducing-hpt-a-family-of-leading-multimodal-llms
Model: HyperGAI/HPT
GitHub: https://github.com/hyperGAI/HPT

🌏Models and datasets around the world
- Tess-70B, a MiQu-70B fine-tune with high-quality data migtissera/Tess-70B-v1.6
- UNI, a model trained on 100 million pathology images from 100k+ slides MahmoodLab/UNI
- CONCH, a VLM trained on 1.17 million pathology image-text pairs MahmoodLab/CONCH
·
osanseviero 
posted an update 9 months ago
view post
Post
3268
Diaries of Open Source. Part 9!

⏰Amazon releases Chronos, a family of models for time series
Base model: amazon/chronos-t5-large
Paper: Chronos: Learning the Language of Time Series (2403.07815)
Models: https://huggingface.co/collections/amazon/chronos-models-65f1791d630a8d57cb718444

💡ORPO Alignment: align without a reference model nor SFT!
Paper: ORPO: Monolithic Preference Optimization without Reference Model (2403.07691)
Models: kaist-ai/orpo-65efef87544ba100aef30013
GitHub: https://github.com/xfactlab/orpo

🇺🇳Cohere releases 250M Wikipedia Embeddings in 300+ languages
Data: Cohere/wikipedia-2023-11-embed-multilingual-v3
Announcement: https://twitter.com/Nils_Reimers/status/1767891859207057618

🧬SegmentNT: a LLM for annotating DNA at single nucleotide resolution
Models: InstaDeepAI/segmentnt-65eb4941c57808b4a3fe1319
GitHub repo: https://github.com/instadeepai/nucleotide-transformer
Paper: https://www.biorxiv.org/content/10.1101/2024.03.14.584712v1

🚀DynamiCrafter: video generation models for interpolation and looping are out!
Project page: https://doubiiu.github.io/projects/DynamiCrafter/
GitHub: https://github.com/Doubiiu/DynamiCrafter
Demo: Doubiiu/DynamiCrafter_interp_loop

🚀Stanford releases Anticipatory Music Transformer:
GitHub: https://github.com/jthickstun/anticipation/
Models: https://hf.co/stanford-crfm
Original blog announcement: https://crfm.stanford.edu/2023/06/16/anticipatory-music-transformer.html
  • 2 replies
·
osanseviero 
posted an update 9 months ago
view post
Post
2527
Diaries of Open Source. Part 8!

🤯CRM: Image-to-3D Textured Mesh
Demo: Zhengyi/CRM
Model: Zhengyi/CRM
Project page: https://ml.cs.tsinghua.edu.cn/~zhengyi/CRM/
Paper: CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model (2403.05034)

🤏Half Quadratic Quantization: super-fast quantization of very large models
Blog post: https://mobiusml.github.io/hqq_blog/
Colab: https://colab.research.google.com/drive/1cG_5R_u9q53Uond7F0JEdliwvoeeaXVN?usp=sharing
Repo: https://github.com/mobiusml/hqq

🤗GemMoE -Gemma + MoE
Model: Crystalcareai/GemMoE-Base-Random
Collection: Crystalcareai/gemmoe-65f11f4922af97ebe9943591

👀VeCLIP and MOFI, new 0-shot and image retrieval models by Apple, are now open-source!
GitHub: https://github.com/apple/ml-veclip/ and https://github.com/apple/ml-mofi
VeCLIP paper: From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions (2310.07699)
MOFI paper: MOFI: Learning Image Representations from Noisy Entity Annotated Images (2306.07952)

⚡SPIN: Recipe for alignment with very little data
Collection: argilla/dibt-prompt-collective-spin-65ef59062518776024395fc3
Tweetutorial: https://twitter.com/argilla_io/status/1767608154697699455

👀ViT Prisma - an interoperability library for vision models
GitHub: https://github.com/soniajoseph/ViT-Prisma

☕OpenLRM: full model and training code are open-sourced
Codebase: https://github.com/3DTopia/OpenLRM
Demo: zxhezexin/OpenLRM
Models: https://huggingface.co/zxhezexin

⚗️Oxford releases an extensive PEFT evaluation for bio models
Model: NTaylor/bio-mobilebert-mimic-mp-lora
GitHub: https://github.com/nlpie-research/efficient-ml
Paper: Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks (2402.10597)

🌍Data and models around the world
Hermes 2 Pro 7B: an upgraded Nous Hermes 2 model with strong function calling and JSON capabilities NousResearch/Hermes-2-Pro-Mistral-7B
Navarasa-2.0 : Gemma fine-tuned in 15 indian language Telugu-LLM-Labs/navarasa-65f5e6ffdf29f02c6d7767ce
·
osanseviero 
posted an update 9 months ago
view post
Post
1883
Diaries of Open Source. Part 7!

🔥Sakana releases Evolutionary Model Merge
Blog post: https://sakana.ai/evolutionary-model-merge/
Paper: Evolutionary Optimization of Model Merging Recipes (2403.13187)
Models and demo: https://hf.co/SakanaAI

🍞MixedBread releases new SoTA sentence embedding model
Announcement: https://www.mixedbread.ai/blog/mxbai-embed-large-v1
Model: mixedbread-ai/mxbai-embed-large-v1

🎥VideoMamba, a Mamba-based model for video understanding
Blog: https://hf.co/blog/vladbogo/video-mamba
Demo: OpenGVLab/VideoMamba
Model: OpenGVLab/VideoMamba

🔍 MathVerse, a visual math benchmark for multimodal LLMs
Paper page: https://mathverse-cuhk.github.io/
Dataset: AI4Math/MathVerse
Paper: MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? (2403.14624)

🧠GraphWiz, a family of instruct-tuned LLMs to solve graph problems
Repos: https://hf.co/GraphWiz
Paper: GraphWiz: An Instruction-Following Language Model for Graph Problems (2402.16029)

🪆NLLB-SigLIP-MRL: a combination of NLLB and SigLIP trained with Matryoshka representation learning
Model: visheratin/nllb-siglip-mrl-large
Tweet: https://twitter.com/visheratin/status/1766643219909984734?s=46

🧍HDM and ProciGen: Template-free reconstruction of human-object interactions
Paper page: https://virtualhumans.mpi-inf.mpg.de/procigen-hdm/
Demo: xiexh20/HDM-interaction-recon
Models: xiexh20/HDM-models

🌎Models and data around the world
EagleX 7B, multi-lingual RNN-based model https://hf.co/spaces/recursal/EagleX-7B-1.7T-Gradio-Demo
Tamil LLM mervinpraison/tamil-large-language-model-7b-v1.0
  • 2 replies
·
osanseviero 
posted an update 9 months ago
view post
Post
1914
Diaries of Open Source. Part 6!

🏎️xAI releases Grok-1, a 314B MoE
Blog: https://x.ai/blog/grok-os
GH repo: https://github.com/xai-org/grok-1
Model: xai-org/grok-1

🕺MusicLang, a model for controllable music generation
Demo: musiclang/musiclang-predict
GH repo: https://github.com/musiclang/musiclang_predict

🔬BioT5: a family of models for biology and chemical text tasks
Base model: QizhiPei/biot5-base
Model for molecule captioning and design: QizhiPei/biot5-base-mol2text and QizhiPei/biot5-base-text2mol
GH Repo: https://github.com/QizhiPei/BioT5
Paper: BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations (2310.07276)

🤏Check out the AQLM and QMoE official weights from ISTA-DAS lab
Org: https://hf.co/ISTA-DASLab
Papers: Extreme Compression of Large Language Models via Additive Quantization (2401.06118) and QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models (2310.16795)

🚀Community releases
Einstein-v4-7B, a Mistral fine-tune on high-quality data Weyaxi/Einstein-v4-7B
IL-7B, a Misttral fine-tune merge for rheumatology cmcmaster/il_7b
Caselaw Access Project, a collaboration to digitalize 40 million US court decisions from 6.7 million cases from 360 years https://hf.co/datasets/TeraflopAI/Caselaw_Access_Project

🌍Data and models around the world
HPLT Monolingual, a dataset of 75 languages with over 40TB of data HPLT/hplt_monolingual_v1_2
OpenLLM Turkish Benchmarks & Leaderboard malhajar/openllmturkishleadboard-datasets-65e5854490a87c0f2670ec18 and malhajar/OpenLLMTurkishLeaderboard
Occiglot, a collaborative effort for European LLMs with an initial release of 7B models for French, German, Spanish, and Italian occiglot/occiglot-eu5-7b-v01-65dbed502a6348b052695e01
Guftagoo, a Hindi+Hinglish multi-turn conversational dataset Tensoic/gooftagoo
AryaBhatta-Orca-Maths-Hindi dataset GenVRadmin/Aryabhatta-Orca-Maths-Hindi
  • 1 reply
·
osanseviero 
posted an update 9 months ago
view post
Post
Diaries of Open Source. Part 5!

🤯Contextual KTO Mistral PairRM: this model combines iterative KTO, SnorkelAI DPO dataset, Allenai PairRM for ranking, Mistral for the base model, and is a very strong model with Claude 3 quality on AlpacaEval 2.0
Final model: ContextualAI/Contextual_KTO_Mistral_PairRM
Dataset: snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset
Leaderboard: https://tatsu-lab.github.io/alpaca_eval/
Base model: mistralai/Mistral-7B-Instruct-v0.2

🤏 tinyBenchmarks: Quick and cheap LLM evaluation!
Code: https://github.com/felipemaiapolo/tinyBenchmarks
Paper: tinyBenchmarks: evaluating LLMs with fewer examples (2402.14992)
Data: tinyBenchmarks/tinyMMLU

🎨Transformers.js 2.16 includes StableLM, speaker verification and diarization, and better chat templating. Try some fun demos!
- Xenova/video-object-detection
- Xenova/cross-encoder-web
- Xenova/the-tokenizer-playground

🏴‍☠️ Abascus Liberated-Qwen1.5-72B, a Qwen 72B-based model that strongly follows system prompts
Model: abacusai/Liberated-Qwen1.5-72B

👀Design2Code: benchmark of webpage screenshots to code
Data: SALT-NLP/Design2Code
Project https://salt-nlp.github.io/Design2Code/
Paper Design2Code: How Far Are We From Automating Front-End Engineering? (2403.03163)

🌎Data and models around the world
- One of the biggest Italian datasets https://hf.co/datasets/manalog/UsenetArchiveIT
- IndicLLMSuite: argest Pre-training and Instruction Fine-tuning dataset collection across 22 Indic languages ai4bharat/indicllmsuite-65ee7d225c337fcfa0991707
- Hebrew-Gemma-11B, the best base Hebrew model yam-peleg/Hebrew-Gemma-11B
- Komodo-7B, a family of multiple Indonesian languages LLMs Yellow-AI-NLP/komodo-7b-base

You can find the previous part at https://huggingface.co/posts/osanseviero/127895284909100
ehcalabres 
posted an update 9 months ago
view post
Post
🚀 Hello HF Posts World!

I'm excited to share in my first HF post that we at Neuraptic AI have released MAGNUM, the first open-source AI model designed to natively support any structured and unstructured data modality.

MAGNUM can learn a holistic representation of your business logic from any source of digital information—be it images, documents, emails, databases, audio, signals, and more. This rich context empowers it to deliver significantly more accurate answers.

If you want to know more about it, feel free to ask or read the paper here 🤗
A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data (2403.04866)

Have a nice week!