Kuldeep Singh Sidhu's picture
6 3

Kuldeep Singh Sidhu

singhsidhukuldeep

AI & ML interests

๐Ÿ˜ƒ TOP 3 on HuggingFace for posts ๐Ÿค— Seeking contributors for a completely open-source ๐Ÿš€ Data Science platform! singhsidhukuldeep.github.io

Organizations

singhsidhukuldeep's activity

posted an update about 11 hours ago
view post
Post
145
Exciting breakthrough in LLM reasoning: Introducing "Thread of Thought" (ThoT) - a novel prompting strategy that revolutionizes how language models handle chaotic contexts!

Unlike traditional approaches that struggle with complex, interleaved information, ThoT enables LLMs to methodically segment and analyze extended contexts with remarkable precision. Here's how it works:

Technical Deep Dive:
- ThoT employs a two-step prompting mechanism:
1. Initial Analysis: Uses a template combining chaotic context (X) and query (Q) with a trigger sentence that initiates systematic reasoning.
2. Conclusion Refinement: Leverages the organized thought sequence to extract definitive answers.

Implementation Details:
- Seamlessly integrates as a "plug-and-play" module with existing LLMs.
- Requires no model retraining or fine-tuning.
- Works with various prompting techniques and model architectures.

Performance Highlights:
- Outperformed traditional methods on PopQA and EntityQ datasets.
- Achieved 57.4% accuracy with GPT-3.5-turbo (vs. 48.2% for Chain-of-Thought).
- Demonstrated superior performance across model scales, from 7B to 70B parameters.

Key Applications:
- Retrieval-augmented generation.
- Multi-turn conversation responses.
- Complex reasoning tasks requiring information synthesis.

What makes it special: ThoT mirrors human cognitive processes by breaking down complex information into manageable segments while maintaining logical continuity โ€“ a game-changer for handling information-dense contexts.
posted an update 2 days ago
view post
Post
2067
Good folks at @nvidia and @Tsinghua_Uni have released LLAMA-MESH - A Revolutionary Approach to 3D Content Generation!

This innovative framework enables the direct generation of 3D meshes from natural language prompts while maintaining strong language capabilities.

Here is the Architecture & Implementation!

>> Core Components

Model Foundation
- If you haven't guessed it yet, it's built on the LLaMA-3.1-8B-Instruct base model
- Maintains original language capabilities while adding 3D generation
- Context length is set to 8,000 tokens

3D Representation Strategy
- Uses the OBJ file format for mesh representation
- Quantizes vertex coordinates into 64 discrete bins per axis
- Sorts vertices by z-y-x coordinates, from lowest to highest
- Sorts faces by the lowest vertex indices for consistency

Data Processing Pipeline
- Filters meshes to a maximum of 500 faces for computational efficiency
- Applies random rotations (0ยฐ, 90ยฐ, 180ยฐ, 270ยฐ) for data augmentation
- Generates ~125k mesh variations from 31k base meshes
- Uses Cap3D-generated captions for text descriptions

>> Training Framework

Dataset Composition
- 40% Mesh Generation tasks
- 20% Mesh Understanding tasks
- 40% General Conversation (UltraChat dataset)
- 8x training turns for generation, 4x for understanding

Training Configuration
- Deployed on 32 A100 GPUs (for Nvidia, this is literally in-house)
- 21,000 training iterations
- Global batch size: 128
- AdamW optimizer with a 1e-5 learning rate
- 30-step warmup with cosine scheduling
- Total training time: approximately 3 days (based on the paper)

This research opens exciting possibilities for intuitive 3D content creation through natural language interaction. The future of digital design is conversational!
posted an update 3 days ago
view post
Post
1836
It's not every day you see the No. 1 ranked paper of the day open-sourcing a very powerful image editing app!

Fascinating to see MagicQuill - a groundbreaking interactive image editing system that makes precise photo editing effortless through advanced AI!

The system's architecture features three sophisticated components:

1. Editing Processor:
- Implements a dual-branch architecture integrated into a latent diffusion framework
- Utilizes PiDiNet for edge map extraction and content-aware per-pixel inpainting
- Features a specialized UNet architecture with zero-convolution layers for feature insertion
- Employs denoising score matching for training the control branch
- Processes both structural modifications via scribble guidance and color manipulation through downsampled color blocks
- Maintains pixel-level control through VAE-based latent space operations

2. Painting Assistor:
- Powered by a fine-tuned LLaVA multimodal LLM using Low-Rank Adaptation (LoRA)
- Trained on a custom dataset derived from Densely Captioned Images (DCI)
- Processes user brushstrokes through specialized Q&A tasks for add/subtract/color operations
- Features bounding box coordinate normalization for precise stroke localization
- Implements streamlined single-word/phrase outputs for real-time performance

3. Idea Collector:
- Built as a modular ReactJS component library
- Supports cross-platform deployment via HTTP protocols
- Compatible with Gradio and ComfyUI frameworks
- Features comprehensive layer management and parameter adjustment capabilities
- Implements real-time canvas updates and preview generation

The system outperforms existing solutions like SmartEdit and BrushNet in edge alignment and color fidelity while maintaining seamless integration with popular AI frameworks.

What are your thoughts on AI-powered creative tools?
replied to m-ric's post 3 days ago
replied to maxiw's post 4 days ago
posted an update 4 days ago
view post
Post
1467
Sometimes, we forget that all these LLMs are trained on just raw text. Ideally, they are simply text completion models. Imagine a model that keeps on writing follow-up questions when you ask, "How to make pizza?" rather than answering you!

That's where Instruction Tuning comes inโ€”itโ€™s a game-changer.

Instruction tuning has revolutionized how we interact with Large Language Models (LLMs), bridging the crucial gap between raw model capabilities and practical applications.

Itโ€™s what transforms a GPT into ChatGPT!

Think of instruction tuning as teaching AI to "speak human"โ€”it's the difference between a model that merely predicts the next words and one that truly understands and executes our intentions.

The real magic? It enables zero-shot learning, meaning models can tackle new tasks they've never encountered before, as long as the instructions are clear. This versatility is what makes modern AI assistants so powerful and user-friendly.
  • 2 replies
ยท
replied to maxiw's post 5 days ago
view reply

Enough to make a grown man cry! ๐Ÿ˜ƒ๐Ÿค—

Anyway, next whole week I will be posting about the best papers(according to me), 1 every day that discuss ways to reduce hallucinations (total 7)...cheers๐Ÿ˜ฌ

posted an update 10 days ago
view post
Post
2253
Thinking about upgrading from Python 3.10 to 3.11? Here's why you should make the move - a deep technical breakdown that might convince you:

>> Performance Revolution
The performance improvements are staggering, with benchmarks showing 10-60% faster execution across different workloads. Let me break down the game-changing features:

>> Core Architecture Changes
Python 3.11's interpreter now uses statically allocated core modules, eliminating the multi-step loading process we've dealt with in 3.10. This means your applications will start 10-15% faster out of the gate.

>> Function Optimization
The redesigned frame objects are a thing of beauty - they've been stripped of unnecessary baggage, resulting in a 3-7% speedup for all function calls. But it gets better: function calls are now inlined, giving us a 1-3% boost, with recursive functions like Fibonacci seeing up to 1.7x improvement!

>> Adaptive Intelligence
The new Specializing Interpreter is perhaps the most exciting addition. Think of it as a lightweight JIT - it identifies hot code paths and optimizes them automatically.

The interpreter now automatically specializes math operations, array indexing, and even sequence unpacking based on actual usage patterns.

>> Exception Handling Revolution
My favorite feature? Zero-cost exceptions! Your try-except blocks no longer carry overhead when no exceptions occur. The code runs at full speed until an exception actually happens.

Ready to make the switch? These improvements aren't just numbers - they're real-world performance gains waiting to be unlocked in your codebase.
posted an update 14 days ago
view post
Post
2071
Exciting Research Alert: Revolutionizing Dense Passage Retrieval with Entailment Tuning!

The good folks at HKUST have developed a novel approach that significantly improves information retrieval by leveraging natural language inference.

The entailment tuning approach consists of several key steps to enhance dense passage retrieval performance.

Data Preparation
- Convert questions into existence claims using rule-based transformations.
- Combine retrieval data with NLI data from SNLI and MNLI datasets.
- Unify the format of both data types using a consistent prompting framework.

Entailment Tuning Process
- Initialize the model using pre-trained language models like BERT or RoBERTa.
- Apply aggressive masking (ฮฒ=0.8) specifically to the hypothesis components while preserving premise information.
- Train the model to predict the masked hypothesis tokens from the premise content.
- Run the training for 10 epochs using 8 GPUs, taking approximately 1.5-3.5 hours.

Training Arguments for Entailment Tuning (Yes! They Shared Them)
- Use a learning rate of 2e-5 with 100 warmup steps.
- Set batch size to 128.
- Apply weight decay of 0.01.
- Utilize the Adam optimizer with beta values (0.9, 0.999).
- Maintain maximum gradient norm at 1.0.

Deployment
- Index passages using FAISS for efficient retrieval.
- Shard vector store across multiple GPUs.
- Enable sub-millisecond retrieval of the top-100 passages per query.

Integration with Existing Systems
- Insert entailment tuning between pre-training and fine-tuning stages.
- Maintain compatibility with current dense retrieval methods.
- Preserve existing contrastive learning approaches during fine-tuning.

Simple, intuitive, and effective!

This advancement significantly improves the quality of retrieved passages for question-answering systems and retrieval-augmented generation tasks.
posted an update 22 days ago
view post
Post
2565
Good folks from @Microsoft have released an exciting breakthrough in GUI automation!

OmniParser โ€“ a game-changing approach for pure vision-based GUI agents that works across multiple platforms and applications.

Key technical innovations:
- Custom-trained interactable icon detection model using 67k screenshots from popular websites
- Specialized BLIP-v2 model fine-tuned on 7k icon-description pairs for extracting functional semantics
- Novel combination of icon detection, OCR, and semantic understanding to create structured UI representations

The results are impressive:
- Outperforms GPT-4V baseline by significant margins on the ScreenSpot benchmark
- Achieves 73% accuracy on Mind2Web without requiring HTML data
- Demonstrates a 57.7% success rate on AITW mobile tasks

What makes OmniParser special is its ability to work across platforms (mobile, desktop, web) using only screenshot data โ€“ no HTML or view hierarchy needed. This opens up exciting possibilities for building truly universal GUI automation tools.

The team has open-sourced both the interactable region detection dataset and icon description dataset to accelerate research in this space.

Kudos to the Microsoft Research team for pushing the boundaries of what's possible with pure vision-based GUI understanding!

What are your thoughts on vision-based GUI automation?
posted an update 24 days ago
view post
Post
1159
Good folks from @Microsoft Research have just released bitnet.cpp, a game-changing inference framework that achieves remarkable performance gains.

Key Technical Highlights:
- Achieves speedups of up to 6.17x on x86 CPUs and 5.07x on ARM CPUs
- Reduces energy consumption by 55.4โ€“82.2%
- Enables running 100B parameter models at human reading speed (5โ€“7 tokens/second) on a single CPU

Features Three Optimized Kernels:
1. I2_S: Uses 2-bit weight representation
2. TL1: Implements 4-bit index lookup tables for every two weights
3. TL2: Employs 5-bit compression for every three weights

Performance Metrics:
- Lossless inference with 100% accuracy compared to full-precision models
- Tested across model sizes from 125M to 100B parameters
- Evaluated on both Apple M2 Ultra and Intel i7-13700H processors

This breakthrough makes running large language models locally more accessible than ever, opening new possibilities for edge computing and resource-constrained environments.
  • 4 replies
ยท
posted an update 25 days ago
view post
Post
2701
If you have ~300+ GB of V-RAM, you can run Mochi from @genmo

A SOTA model that dramatically closes the gap between closed and open video generation models.

Mochi 1 introduces revolutionary architecture featuring joint reasoning over 44,520 video tokens with full 3D attention. The model implements extended learnable rotary positional embeddings (RoPE) in three dimensions, with network-learned mixing frequencies for space and time axes.

The model incorporates cutting-edge improvements, including:
- SwiGLU feedforward layers
- Query-key normalization for enhanced stability
- Sandwich normalization for controlled internal activations

What is currently available?
The base model delivers impressive 480p video generation with exceptional motion quality and prompt adherence. Released under the Apache 2.0 license, it's freely available for both personal and commercial applications.

What's Coming?
Genmo has announced Mochi 1 HD, scheduled for release later this year, which will feature:
- Enhanced 720p resolution
- Improved motion fidelity
- Better handling of complex scene warping
  • 2 replies
ยท
posted an update 30 days ago
view post
Post
1290
Looks like @Meta thinks we forgot they created PyTorch, so now they've open-sourced Lingua, a powerful and flexible library for training and inferencing large language models.

Things that stand out:

- Architecture: Pure PyTorch nn.Module implementation for easy customization.

- Checkpointing: Uses the new PyTorch distributed saving method (.distcp format) for flexible model reloading across different GPU configurations.

- Configuration: Utilizes data classes and YAML files for intuitive setup and modification.

- Profiling: Integrates with xFormers' profiler for automatic MFU and HFU calculation, plus memory profiling.

- Slurm Integration: Includes stool.py for seamless job launching on Slurm clusters.

Some results from @Meta to show off:

- 1B parameter models trained on 60B tokens achieve strong performance across various NLP tasks.

- 7B parameter Mamba model (trained on 200B tokens) shows competitive results with Llama 7B on benchmarks like ARC, MMLU, and BBH.

If you're working on LLM research or looking to experiment with cutting-edge language model architectures, Lingua is definitely worth exploring.
posted an update about 1 month ago
view post
Post
1727
Good folks at @Apple have developed a novel method called KV Prediction that significantly reduces the "time to first token" (TTFT) for on-device LLM inference.

Some highlights of the paper:

โ€ข Uses a small auxiliary transformer model to efficiently predict the KV cache of a larger base model
โ€ข Reduces TTFT by up to 4x while retaining 60-80% accuracy on benchmarks
โ€ข Achieves Pareto-optimal efficiency-accuracy trade-off compared to baselines
โ€ข Demonstrates 15-50% relative accuracy improvements on TriviaQA at equal TTFT FLOP budgets
โ€ข Shows up to 30% accuracy gains on HumanEval code completion at fixed TTFT FLOP counts
โ€ข Validated on Apple M2 Pro CPU, proving FLOP gains translate to real-world speedups


So, how's it done?

Based on the KV Prediction method described in the paper, here are the key steps for how it's done:

1. Choose a base model and an auxiliary model:
- The base model is a larger, pretrained transformer model that will be used for final generation.
- The auxiliary model is a smaller transformer model used to efficiently process the input prompt.

2. Design the KV predictor:
- Create a set of learned linear projections to map from the auxiliary model's KV cache to the base model's KV cache.
- Define a mapping from auxiliary cache layers to base cache layers.

3. Training process:
- Pass input tokens through the auxiliary model to get its KV cache.
- Use the KV predictor to generate a predicted KV cache for the base model.
- Run the base model using the predicted KV cache and compute losses.
- Backpropagate errors through the frozen base model to update the auxiliary model and KV predictor.

4. Inference process:
- Process the input prompt with the auxiliary model to get its KV cache.
- Use the KV predictor to generate the predicted base model KV cache.
- Run a single token generation step with the base model using the predicted KV cache.
- Continue autoregressive generation with the base model as normal.

Excited to hear your thoughts!
posted an update about 1 month ago
view post
Post
390
All the way from Korea, a novel approach called Mentor-KD significantly improves the reasoning abilities of small language models.

Mentor-KD introduces an intermediate-sized "mentor" model to augment training data and provide soft labels during knowledge distillation from large language models (LLMs) to smaller models.

Broadly, itโ€™s a two-stage process:
1) Fine-tune the mentor on filtered Chain-of-Thought (CoT) annotations from an LLM teacher.
2) Use the mentor to generate additional CoT rationales and soft probability distributions.

The student model is then trained using:
- CoT rationales from both the teacher and mentor (rationale distillation).
- Soft labels from the mentor (soft label distillation).

Results show that Mentor-KD consistently outperforms baselines, with up to 5% accuracy gains on some tasks.

Mentor-KD is especially effective in low-resource scenarios, achieving comparable performance to baselines while using only 40% of the original training data.

This work opens up exciting possibilities for making smaller, more efficient language models better at complex reasoning tasks.

What are your thoughts on this approach?
posted an update about 1 month ago
view post
Post
2152
While Google's Transformer might have introduced "Attention is all you need," Microsoft and Tsinghua University are here with the DIFF Transformer, stating, "Sparse-Attention is all you need."

The DIFF Transformer outperforms traditional Transformers in scaling properties, requiring only about 65% of the model size or training tokens to achieve comparable performance.

The secret sauce? A differential attention mechanism that amplifies focus on relevant context while canceling out noise, leading to sparser and more effective attention patterns.

How?
- It uses two separate softmax attention maps and subtracts them.
- It employs a learnable scalar ฮป for balancing the attention maps.
- It implements GroupNorm for each attention head independently.
- It is compatible with FlashAttention for efficient computation.

What do you get?
- Superior long-context modeling (up to 64K tokens).
- Enhanced key information retrieval.
- Reduced hallucination in question-answering and summarization tasks.
- More robust in-context learning, less affected by prompt order.
- Mitigation of activation outliers, opening doors for efficient quantization.

Extensive experiments show DIFF Transformer's advantages across various tasks and model sizes, from 830M to 13.1B parameters.

This innovative architecture could be a game-changer for the next generation of LLMs. What are your thoughts on DIFF Transformer's potential impact?
  • 1 reply
ยท
posted an update about 1 month ago
view post
Post
541
Good folks from Universitat Politรจcnica de Catalunya, University of Groningen, and Meta have released "A Primer on the Inner Workings of Transformer-based Language Models."

They don't make survey papers like they used to, but this is an exciting new survey on Transformer LM interpretability!

This comprehensive survey provides a technical deep dive into:

โ€ข Transformer architecture components (attention, FFN, residual stream)
โ€ข Methods for localizing model behavior:
- Input attribution (gradient & perturbation-based)
- Component importance (logit attribution, causal interventions)
โ€ข Information decoding techniques:
- Probing, linear feature analysis
- Sparse autoencoders for disentangling features
โ€ข Key insights on model internals:
- Attention mechanisms (induction heads, copy suppression)
- FFN neuron behaviors
- Residual stream properties
- Multi-component emergent behaviors

The paper offers a unified notation and connects insights across different areas of interpretability research. It's a must-read for anyone working on understanding large language models!

Some fascinating technical highlights:
- Detailed breakdowns of attention head circuits (e.g., IOI task)
- Analysis of factual recall mechanisms
- Overview of polysemanticity and superposition
- Discussion of grokking as circuit emergence

What interpretability insights do you find most intriguing?
posted an update about 1 month ago
view post
Post
2007
Just started going through the latest "State of AI Report 2024", and I cannot get over the predictions!

The report predicts major developments in AI over the next 12 months, including a $10B+ investment from a sovereign state into a large US AI lab, triggering national security scrutiny, and a viral app created by someone without coding skills.

It forecasts changes in data collection practices due to frontier labs facing trials, softer-than-expected EU AI Act implementations, and the rise of an open-source alternative to OpenAI GPT-4 outperforming in benchmarks.

NVIDIAโ€™s dominance will remain largely unchallenged, investment in humanoid robots will decline, Appleโ€™s on-device AI research will gain momentum, and a research paper by an AI scientist will be accepted at a major conference.

Lastly, a GenAI-based video game is expected to achieve breakout success.

Yet to go through all 200+ pages... will post summarized thoughts later.
  • 2 replies
ยท
replied to their post about 2 months ago
view reply

Here's why you should be pumped:

๐Ÿ”ฅ Supercharge your models:
โ€ข Up to 97% speedup for LLaMA 3 8B inference
โ€ข 50% speedup for LLaMA 3 70B pretraining on H100
โ€ข 53% speedup for diffusion models on H100

๐Ÿ’พ Slash memory usage:
โ€ข 73% peak VRAM reduction for LLaMA 3.1 8B at 128K context length
โ€ข 50% model VRAM reduction for CogVideoX

Whether you're working on LLMs, diffusion models, or other AI applications, torchao is a must-have tool in your arsenal. It's time to make your models faster, smaller, and more efficient!

So, what use cases do you expect out of this?

posted an update about 2 months ago
view post
Post
1259
Good folks at @PyTorch have just released torchao, a game-changing library for native architecture optimization.

-- How torchao Works (They threw the kitchen-sink at it...)

torchao leverages several advanced techniques to optimize PyTorch models, making them faster and more memory-efficient. Here's an overview of its key mechanisms:

Quantization

torchao employs various quantization methods to reduce model size and accelerate inference:

โ€ข Weight-only quantization: Converts model weights to lower precision formats like int4 or int8, significantly reducing memory usage.
โ€ข Dynamic activation quantization: Quantizes activations on-the-fly during inference, balancing performance and accuracy.
โ€ข Automatic quantization: The autoquant function intelligently selects the best quantization strategy for each layer in a model.

Low-bit Datatypes

The library utilizes low-precision datatypes to speed up computations:

โ€ข float8: Enables float8 training for linear layers, offering substantial speedups for large models like LLaMA 3 70B.
โ€ข int4 and int8: Provide options for extreme compression of weights and activations.

Sparsity Techniques

torchao implements sparsity methods to reduce model density:

โ€ข Semi-sparse weights: Combine quantization with sparsity for compute-bound models.

KV Cache Optimization

For transformer-based models, torchao offers KV cache quantization, leading to significant VRAM reductions for long context lengths.

Integration with PyTorch Ecosystem

torchao seamlessly integrates with existing PyTorch tools:

โ€ข Compatible with torch.compile() for additional performance gains.
โ€ข Works with FSDP2 for distributed training scenarios.
โ€ข Supports most PyTorch models available on Hugging Face out-of-the-box.

By combining these techniques, torchao enables developers to significantly improve the performance and efficiency of their PyTorch models with minimal code changes and accuracy impact.
ยท