Papers
arxiv:2411.00836

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

Published on Oct 29
· Submitted by Ray2333 on Nov 5

Abstract

The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010 generated concrete questions. Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and DynaMath provides valuable insights to guide the development of more reliable models for mathematical reasoning.

Community

Paper author Paper submitter
edited 15 days ago

Check out the dynamic visual math benchmark for evaluating reasoning robustness of VLMs. There are lots of interesting findings on SOTA VLMs' robustness performance. The GitHub and huggingface links are https://github.com/DynaMath/DynaMath and https://huggingface.co/datasets/DynaMath/DynaMath_Sample.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Perfect ideas!
One small question:
How do you generate ground-truth solutions to the Program-based Generated questions?

·

Thanks @julyai ! This is a great question. When we create programs, we ensure that both the questions and their corresponding ground-truth answers are generated together. For example, for the shifted absolute value question in Figure 1, the shift amount is a random variable in our program, and our program can easily calculate the ground-truth solution based on its value (e.g., when the shift is non-zero, the function is differentiable at x=0; it is non-differentiable otherwise). Other problems are created in a similar manner, where both questions and ground-truth solutions are generated by the program.

Thanks again for the question and feel free to let us know if there is anything else to clarify :)

Thanks! Totally get!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.00836 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.00836 in a Space README.md to link it from this page.

Collections including this paper 5