DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
Abstract
The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010 generated concrete questions. Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and DynaMath provides valuable insights to guide the development of more reliable models for mathematical reasoning.
Community
Check out the dynamic visual math benchmark for evaluating reasoning robustness of VLMs. There are lots of interesting findings on SOTA VLMs' robustness performance. The GitHub and huggingface links are https://github.com/DynaMath/DynaMath and https://huggingface.co/datasets/DynaMath/DynaMath_Sample.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning (2024)
- TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions (2024)
- ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning (2024)
- Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning (2024)
- VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Perfect ideas!
One small question:
How do you generate ground-truth solutions to the Program-based Generated questions?
Thanks @julyai ! This is a great question. When we create programs, we ensure that both the questions and their corresponding ground-truth answers are generated together. For example, for the shifted absolute value question in Figure 1, the shift amount is a random variable in our program, and our program can easily calculate the ground-truth solution based on its value (e.g., when the shift is non-zero, the function is differentiable at x=0; it is non-differentiable otherwise). Other problems are created in a similar manner, where both questions and ground-truth solutions are generated by the program.
Thanks again for the question and feel free to let us know if there is anything else to clarify :)
Thanks! Totally get!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper