Training & Architectures - a sbarman25 Collection

sbarman25 's Collections

Training & Architectures

Models

Safety / Alignment / Policies / SMI

Evals & Monitoring

Spaces

Agentic

Vulnerabilities

CV / Text-to-Image / Image-to-Image / Diffusion

Others

Hardware-aware Models

Tool Usage (w/VLMs)

Vision Language Models

Training & Architectures

updated Sep 21

Attention Is All You Need

Paper • 1706.03762 • Published Jun 12, 2017 • 44

Note 🔖 GPT-2: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Paper • 2307.08691 • Published Jul 17, 2023 • 8

Note 🔖 GH: https://github.com/Dao-AILab/flash-attention 🔖 TGI Docs: https://huggingface.co/docs/text-generation-inference https://benjaminwarner.dev/2023/08/16/flash-attention-compile 🔖 Flash Attention-3: https://www.together.ai/blog/flashattention-3
Mixtral of Experts

Paper • 2401.04088 • Published Jan 8 • 159
Mistral 7B

Paper • 2310.06825 • Published Oct 10, 2023 • 47
Zephyr: Direct Distillation of LM Alignment

Paper • 2310.16944 • Published Oct 25, 2023 • 122
Llama 2: Open Foundation and Fine-Tuned Chat Models

Paper • 2307.09288 • Published Jul 18, 2023 • 242
Code Llama: Open Foundation Models for Code

Paper • 2308.12950 • Published Aug 24, 2023 • 22
Orca 2: Teaching Small Language Models How to Reason

Paper • 2311.11045 • Published Nov 18, 2023 • 70
OneLLM: One Framework to Align All Modalities with Language

Paper • 2312.03700 • Published Dec 6, 2023 • 20
WizardLM: Empowering Large Language Models to Follow Complex Instructions

Paper • 2304.12244 • Published Apr 24, 2023 • 13
The Falcon Series of Open Language Models

Paper • 2311.16867 • Published Nov 28, 2023 • 12
DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 181
TeleChat Technical Report

Paper • 2401.03804 • Published Jan 8 • 8

Note 🔖Dataset: https://huggingface.co/datasets/Tele-AI/TeleChat-PTD
TinyLlama: An Open-Source Small Language Model

Paper • 2401.02385 • Published Jan 4 • 89
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5 • 69

Note Check their findings and reward models.
Foundation Models for Generalist Geospatial Artificial Intelligence

Paper • 2310.18660 • Published Oct 28, 2023 • 8

Note https://huggingface.co/ibm-nasa-geospatial
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 126

Note 🔖Input: (Text, Image) Output: (Text, Image)
Gemini: A Family of Highly Capable Multimodal Models

Paper • 2312.11805 • Published Dec 19, 2023 • 45

Note LATEST (Updated 2024): https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf (Gemini 1.5): https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf OpenAI Stuff: 📜GPT-4V System Card: https://cdn.openai.com/papers/GPTV_System_Card.pdf 📜GPT 4: https://cdn.openai.com/papers/gpt-4-system-card.pdf Anthropic: 🔖Claude 3: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
google/gemma-7b

Text Generation • Updated Jun 27 • 269k • • 3.05k

Note 🔖 Series: https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b 🔖 Details: https://ai.google.dev/gemma/docs/model_card 🔖 Paper: https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf
World Model on Million-Length Video And Language With RingAttention

Paper • 2402.08268 • Published Feb 13 • 37

Note 🔖 https://largeworldmodel.github.io/ Context Scaling: 1M
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Paper • 2402.13753 • Published Feb 21 • 112
Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

Paper • 2312.17661 • Published Dec 29, 2023 • 13
An In-depth Look at Gemini's Language Abilities

Paper • 2312.11444 • Published Dec 18, 2023 • 1
Question Aware Vision Transformer for Multimodal Reasoning

Paper • 2402.05472 • Published Feb 8 • 8

Note Requires somewhat grounded data or product specific knowledge.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Paper • 2312.00752 • Published Dec 1, 2023 • 138
Exponentially Faster Language Modelling

Paper • 2311.10770 • Published Nov 15, 2023 • 118
Training Transformers Together

Paper • 2207.03481 • Published Jul 7, 2022 • 5
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

Paper • 2401.00448 • Published Dec 31, 2023 • 28
FP8-LM: Training FP8 Large Language Models

Paper • 2310.18313 • Published Oct 27, 2023 • 31
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27 • 88
m-a-p/ChatMusician

Text Generation • Updated Apr 8 • 334 • 116
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

Paper • 2403.00522 • Published Mar 1 • 44
Training Compute-Optimal Large Language Models

Paper • 2203.15556 • Published Mar 29, 2022 • 10
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

Paper • 2207.10551 • Published Jul 21, 2022
Recurrent Linear Transformers

Paper • 2310.15719 • Published Oct 24, 2023
Training Language Models to Self-Correct via Reinforcement Learning

Paper • 2409.12917 • Published Sep 19 • 134