Request for Insights on Fine-Tuning Methods
Hello @mukaj ,
I hope you're doing well. I wanted to express my appreciation for the embedding model you fine-tuned. It performs exceptionally well on the FiQA dataset with 768-dimensional vectors. I'm genuinely impressed by the results and am very interested in learning more about your fine-tuning process. If it's not too much trouble, could you please share some details about the method you used and the format of the dataset?
Thank you so much for your time and consideration!
Hi,
Thanks for the interest. The dataset was generated in similar fashion to what is described in this paper: https://arxiv.org/abs/2401.00368.
The seed data I used was a lot of financial documents such as Annual Reports, Earnings Call Transcripts, SEC filings, Sustainability Reports.. etc.. So the documents were fed page by page (with some cleaning and filtering) and then the LLM generated Positive/Negative retrieval queries based off the given passage. The model used at the time was Mixtral 8x7b. So the dataset format was just [Query, Document Passage, Pos_or_Neg]
The fine tuning details are pretty standard (something like 1e-5 lr, Lion, 10 epochs) but for the final model I ditched the Negatives in the data and just used MultipleNegativesRankingLoss, I think the Hard Negatives generated from the LLM were perhaps not the best as ContrastiveLoss did not do well on validation. Kept around 5000 queries/document pairs as validation and used InformationRetrievalEvaluator to evaluate them on NDCG@10/MRR@10 etc.. There was a very good correlation of validation set performance increasing to FiQA Task performance, of course no test set from FiQA was ever downloaded/used, and this was evaluated with MTEB library so self reported scores are from this output.
Hope this helps!