The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work โฏ๏ธ
Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B they evaluate sampling strategies, scaling laws for models and datasets, video representation and more! > The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled ๐ scaling dataset has diminishing returns for smaller models > They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal > They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2 they find google/siglip-so400m-patch14-384 to be most powerful ๐ฅ > they also compare freezing different parts of models, training all stages with some frozen parts give the best yield
They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models ๐ฅ
Multimodal ๐ผ๏ธ > Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants ๐ > OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license โจ > Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts
LLMs ๐ฌ > Meta released a new iteration of Llama 70B, Llama3.2-70B trained further > EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license ๐ฅ > Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license > Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models > Dataset: FineWeb2 just landed with multilinguality update! ๐ฅ nearly 8TB pretraining data in many languages!
Image/Video Generation ๐ผ๏ธ > Tencent released HunyuanVideo, a new photorealistic video generation model > OminiControl is a new editing/control framework for image generation models like Flux
Audio ๐ > Indic-Parler-TTS is a new text2speech model made by community
New InternVL drop with a state-of-the-art 78B vision language model with MIT license ๐ฅ OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes 78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks ๐ Try here OpenGVLab/InternVL
small but mighty ๐ฅ you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM ๐ซฐ๐ป also with gradient accumulation simulated batch size is 16 โจ I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work ๐ https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
๐ผ๏ธ Multimodal > At Hugging Face we released SmolVLM, a performant and efficient smol vision language model ๐ > Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents ๐ค > Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length > ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM > Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning model๐ > Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT ๐
๐ฌ LLMs > Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet ๐ฅ > AliBaba has released Marco-o1, a new open-source reasoning model ๐ฅ > NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)
โฏ๏ธ Image/Video Generation > Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation > Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res โฏ๏ธ > Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla ๐ท๏ธ
Audio > OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens
What a week! A recap for everything you missed โ๏ธ merve/nov-22-releases-673fbbcfc1c97c4f411def07 Multimodal โจ > Mistral AI released Pixtral 124B, a gigantic open vision language model > Llava-CoT (formerly known as Llava-o1) was released, a multimodal reproduction of o1 model by PKU > OpenGVLab released MMPR: a new multimodal reasoning dataset > Jina has released Jina-CLIP-v2 0.98B multilingual multimodal embeddings > Apple released new SotA vision encoders AIMv2
LLMs ๐ฆ > AllenAI dropped a huge release of models, datasets and scripts for Tรผlu, a family of models based on Llama 3.1 aligned with SFT, DPO and a new technique they have developed called RLVR > Jina has released embeddings-v3: new multilingual embeddings with longer context > Hugging Face released SmolTalk: synthetic dataset used to align SmolLM2 using supervised fine-tuning > Microsoft released orca-agentinstruct-1M-v1: a gigantic instruction dataset of 1M synthetic instruction pairs
Image Generation ๐ผ๏ธ > Black Forest Labs released Flux 1. tools: four new models for different image modifications and two LoRAs to do image conditioning and better steer generations
Lastly Hugging Face released a new library Observers: a lightweight SDK for monitoring interactions with AI APIs and easily store and browse them ๐ $ pip install observers
Apple released AIMv2 ๐ a family of state-of-the-art open-set vision encoders apple/aimv2-6720fe1558d94c7805f7688c > like CLIP, but add a decoder and train on autoregression ๐คฏ > 19 open models come in 300M, 600M, 1.2B, 2.7B with resolutions of 224, 336, 448 > Load and use with ๐ค transformers
OmniVision-968M: a new local VLM for edge devices, fast & small but performant ๐จ a new vision language model with 9x less image tokens, super efficient ๐ aligned with DPO for reducing hallucinations โก๏ธ Apache 2.0 license ๐ฅ
Models ๐ป Coding: Qwen team released two Qwen2.5-Coder checkpoints of 32B and 7B. Infly released OpenCoder: 1.5B and 8B coding models with instruction SFT'd versions and their datasets! ๐
๐ผ๏ธ Image/Video Gen: Alibaba vision lab released In-context LoRA -- 10 LoRA models on different themes based on Flux. Also Mochi the sota video generation model with A2.0 license now comes natively supported in diffusers ๐
๐ผ๏ธ VLMs/Multimodal: NexaAIDev released Omnivision 968M a new vision language model aligned with DPO for reducing hallucinations, also comes with GGUF ckpts ๐ Microsoft released LLM2CLIP, a new CLIP-like model with longer context window allowing complex text inputs and better search
๐ฎ AGI?: Etched released Oasis 500M, a diffusion based open world model that takes keyboard input and outputs gameplay ๐คฏ
Datasets Common Corpus: A text dataset with 2T tokens with permissive license for EN/FR on various sources: code, science, finance, culture ๐