Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance
Abstract
Recent advancements in large language models (LLMs) have shown strong general reasoning abilities, yet their effectiveness in financial reasoning remains underexplored. In this study, we comprehensively evaluate 16 powerful reasoning and general LLMs on three complex financial tasks involving financial text, tabular data, and equations, assessing numerical reasoning, tabular interpretation, financial terminology comprehension, long-context processing, and equation-based problem solving. Our results show that while better datasets and pretraining improve financial reasoning, general enhancements like CoT fine-tuning do not always yield consistent gains. Moreover, all reasoning strategies face challenges in improving performance on long-context and multi-table tasks. To address these limitations, we develop a financial reasoning-enhanced model based on Llama-3.1-8B-Instruct, by CoT fine-tuning and reinforcement learning with domain-specific reasoning paths. Even with simple fine-tuning with one financial dataset, our model achieves a consistent 10% performance improvement across tasks, surpassing all 8B models and even Llama3-70B-Instruct and Llama3.1-70B-Instruct on average. Our results highlight the need for domain-specific adaptations in financial tasks, emphasizing future directions such as multi-table reasoning, long-context processing, and financial terminology comprehension. All our datasets, models, and codes are publicly available. Furthermore, we introduce a leaderboard for benchmarking future datasets and models.
Community
https://huggingface.co/mukaj/Llama-3.1-Hawkish-8B
Similar sized model also trained for financial reasoning, just tested it on FinQA and scored 60.94%, would be good to include in your leaderboard.
Thank you for the comment. I have tested the performance of this model using FinQA dataset. But using our evaluation method from https://github.com/yale-nlp/DocMath-Eval that using GPT to extract results and compare the results, we only get 46.85%. May I ask what evaluation method are you using for this?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Advancing Reasoning in Large Language Models: Promising Methods and Approaches (2025)
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)
- Enhancing Generalization in Chain of Thought Reasoning for Smaller Models (2025)
- Baichuan4-Finance Technical Report (2024)
- ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers (2024)
- FineMedLM-o1: Enhancing the Medical Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training (2025)
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper