Papers
arxiv:2502.08127

Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance

Published on Feb 12
· Submitted by jiminHuang on Feb 13
#1 Paper of the day
Authors:
,
,
,
,

Abstract

Recent advancements in large language models (LLMs) have shown strong general reasoning abilities, yet their effectiveness in financial reasoning remains underexplored. In this study, we comprehensively evaluate 16 powerful reasoning and general LLMs on three complex financial tasks involving financial text, tabular data, and equations, assessing numerical reasoning, tabular interpretation, financial terminology comprehension, long-context processing, and equation-based problem solving. Our results show that while better datasets and pretraining improve financial reasoning, general enhancements like CoT fine-tuning do not always yield consistent gains. Moreover, all reasoning strategies face challenges in improving performance on long-context and multi-table tasks. To address these limitations, we develop a financial reasoning-enhanced model based on Llama-3.1-8B-Instruct, by CoT fine-tuning and reinforcement learning with domain-specific reasoning paths. Even with simple fine-tuning with one financial dataset, our model achieves a consistent 10% performance improvement across tasks, surpassing all 8B models and even Llama3-70B-Instruct and Llama3.1-70B-Instruct on average. Our results highlight the need for domain-specific adaptations in financial tasks, emphasizing future directions such as multi-table reasoning, long-context processing, and financial terminology comprehension. All our datasets, models, and codes are publicly available. Furthermore, we introduce a leaderboard for benchmarking future datasets and models.

Community

Paper author Paper submitter
Paper author Paper submitter

https://huggingface.co/mukaj/Llama-3.1-Hawkish-8B

Similar sized model also trained for financial reasoning, just tested it on FinQA and scored 60.94%, would be good to include in your leaderboard.

·

Thank you for the comment. I have tested the performance of this model using FinQA dataset. But using our evaluation method from https://github.com/yale-nlp/DocMath-Eval that using GPT to extract results and compare the results, we only get 46.85%. May I ask what evaluation method are you using for this?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.08127 in a Space README.md to link it from this page.

Collections including this paper 3