a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more
🎯The space handles documenting content from the input image along with standardized plain text. It includes adjustment tools with over 30 font styles, file formatting support for PDF and DOCX, textual alignments, font size adjustments, and line spacing modifications.
📄PDFs are rendered using the ReportLab software library toolkit.
The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work ⏯️
Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B they evaluate sampling strategies, scaling laws for models and datasets, video representation and more! > The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled 📈 scaling dataset has diminishing returns for smaller models > They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal > They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2 they find google/siglip-so400m-patch14-384 to be most powerful 🔥 > they also compare freezing different parts of models, training all stages with some frozen parts give the best yield
They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models 🔥
Multimodal 🖼️ > Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants 👏 > OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license ✨ > Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts
LLMs 💬 > Meta released a new iteration of Llama 70B, Llama3.2-70B trained further > EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license 🔥 > Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license > Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models > Dataset: FineWeb2 just landed with multilinguality update! 🔥 nearly 8TB pretraining data in many languages!
Image/Video Generation 🖼️ > Tencent released HunyuanVideo, a new photorealistic video generation model > OminiControl is a new editing/control framework for image generation models like Flux
Audio 🔊 > Indic-Parler-TTS is a new text2speech model made by community
New InternVL drop with a state-of-the-art 78B vision language model with MIT license 🔥 OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes 78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks 👏 Try here OpenGVLab/InternVL
🧪The datasets were prepared for a 3:2 aspect ratio by processing images of any dimension (width × height) in alignment with the adapter's concept. This involved using techniques such as magic expand, magic fill, or outpainting to adjust the remaining parts of the image to achieve the 3:2 ratio & posts training. This approach enhanced the desired image quality to up to 2 MB for detailed prompts and reduced artifacts in images sized at 1280 × 832.
🎈This approach was used instead of cropping down the 2x or 3x zoomed positions in the actual image. It generative filling to adjust the image's aspect ratio proportionally within the dataset.
🔧I used Canva's Magic Expand, Firefly's Generative Fill, and Flux's Outpaint for aspect ratio adjustments.
small but mighty 🔥 you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM 🫰🏻 also with gradient accumulation simulated batch size is 16 ✨ I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work 💝 https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
🖼️ Multimodal > At Hugging Face we released SmolVLM, a performant and efficient smol vision language model 💗 > Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents 🤖 > Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length > ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM > Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning model📖 > Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT 📕
💬 LLMs > Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet 🔥 > AliBaba has released Marco-o1, a new open-source reasoning model 💥 > NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)
⏯️ Image/Video Generation > Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation > Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res ⏯️ > Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla 🏷️
Audio > OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens
Fine-Textured [Polygon] Character 3D Design Renders 🙉
Adapters capable of providing better lighting control (Bn+, Bn-) and richer textures compared to previous sets require more contextual prompts for optimal performance.
The ideal settings are achieved at inference steps around 30–35, with the best dimensions being 1280 x 832 [ 3:2 ]. However, it also performs well with the default settings of 1024 x 1024 [ 1:1 ].