We’re excited to release Abstract2Appendix v1 10K , a high-quality dataset crafted to enhance the long-context capabilities of Large Language Models (LLMs). This dataset combines thousands of peer reviews from NeurIPS 2023, EMNLP 2023, TMLR, and ICLR 2023, making it a treasure trove of detailed feedback, critical reasoning, and structured academic insights. Our experiments showed that this dataset increased long context ability of phi-3 models!
🌟 Key Highlights:
• Expert Reviews: Aggregated from 3–6 reviews per paper, capturing the most insightful and constructive content. • Rich Metadata: we have aggregated the reviews, and also included full parsed paper • LLM Ready: Perfect for fine-tuning (We did dpo and sft)
🎯 Use Cases:
• Fine-tuning models with Direct Preference Optimization (DPO) and Supervised Fine-Tuning (SFT). • Benchmarking zero-shot and long-context comprehension capabilities.
This dataset is based on the methodology described in our recent paper, “Abstract2Appendix: Academic Reviews Enhance LLM Long-Context Capabilities”. Check it out for more details! https://arxiv.org/abs/2411.05232
After the Supervised Fine-Tuning (SFT) phase, we observed a notable degradation in the instruction-following capabilities of the LLaVA Multi-Modal Large Language Model (MM-LLM). To address this issue, we introduced a 6K-entry VQA preference dataset and employed Direct Preference Optimization (DPO), alongside testing other algorithms such as Rejection Sampling and SteerLM, to enhance instruction-following proficiency. Our methodology not only fully restored the language following capabilities of LLaVa on the MT-Bench but also outperformed LLaVA-RLHF and Vicuna. Additionally, our approach extended to visual VQA tasks, as demonstrated by significant performance improvements on MM-Vet and LLaVa-Bench. An interesting observation was that, compared to models using distilled SFT, our method showed substantial out-of-distribution improvements.