RAG techniques continuously evolve to enhance LLM response accuracy by retrieving relevant external data during generation. To keep up with current AI trends, new RAG types incorporate deep step-by-step reasoning, tree search, citations, multimodality and other effective techniques.
3. Chain-of-Retrieval Augmented Generation (CoRAG) -> Chain-of-Retrieval Augmented Generation (2501.14342) Retrieves information step-by-step and adjusts it, also deciding how much compute power to use at test time. If needed it reformulates queries.
(Probably) the first "longCoT" dataset for the Russian language created via Deeseek-R1.
- Prompts taken from the Sky-T1 dataset and translated via Llama3.3-70B. - Answers and reasoning generated by Deepseek-R1 (685B). - 16.4K samples in total, ≈12.4K Russian-only (in the rest, either the answer or reasoning is in English). - Languages in the answers and reasoning are labeled using fasttext.
🗂️ I don't think the collections feature of Hugging Face is widely used, even though it's an excellent way to organize and discover interesting resources. To do my bit to change that, I've created two carefully curated collections that combine both my original work and other valuable datasets:
Educational Datasets - Mostly English-Russian, but other languages are also included - Extended by my new Begemot.ai dataset (2.7M+ Russian education records) nyuuzyou/begemot
- Extensive art-focused collection, including my new datasets: - Buzzly.art (2K artworks) nyuuzyou/buzzlyart - Paintberri (60K+ pieces) nyuuzyou/paintberri - Itaku.ee (924K+ items) nyuuzyou/itaku - Extended with other amazing datasets from the community
Collections should become a more common feature - hopefully this will encourage others to create and share their own curated collections. By organizing related datasets into these themed collections, I hope to make it easier for researchers and developers to discover and use these valuable resources.