license: apache-2.0
---

# Model Card for AtomThink-LLaVA-Llama3-8B

The model is fine-tuned based on LLaVA-Llama3-8B and AtomThink framework, and can be used to solve complex multimodal mathematical problems.

# Comparison of accuracy with state-of-the-art methods on MathVista and MathVerse:
| **Model**             | **Inference** | **General** | **Math** | **Total** | **TL**   | **TD**   | **VI**   | **VD**   | **VO**   | **Total** |
|-----------------------|---------------|-------------|----------|-----------|----------|----------|----------|----------|----------|-----------|
| Random Choice         | -             | -           | -        | 17.9      | 12.4     | 12.4     | 12.4     | 12.4     | 12.4     | 12.4      |
| Human                 | -             | -           | -        | -         | 70.9     | 71.2     | 61.4     | 68.3     | 66.7     | 66.7      |
| OpenAI o1             | Slow Think*   | -           | -        | 73.9      | -        | -        | -        | -        | -        | -         |
| GPT-4o                | CoT           | -           | -        | 63.8      | -        | -        | -        | -        | -        | -         |
| GPT-4V                | CoT           | -           | -        | 49.9      | 56.6     | 63.1     | 51.4     | 50.8     | 50.3     | 54.4      |
| LLaVA-NeXT-34B        | Direct        | -           | -        | 46.5      | 25.5     | 33.8     | 23.5     | 20.3     | 15.7     | 23.8      |
| InternLM-XComposer2   | Direct        | -           | -        | 57.6      | 17.0     | 22.3     | 15.7     | 16.4     | 11.0     | 16.5      |
| Qwen-VL-Plus          | Direct        | -           | -        | 43.3      | 11.1     | 15.7     | 9.0      | 13.0     | 10.0     | 11.8      |
| LLaVA-1.5-13B         | Direct        | -           | -        | 27.6      | 15.2     | 19.4     | 16.8     | 15.2     | 11.3     | 15.6      |
| G-LLaVA-7B            | Direct        | -           | -        | 53.4      | 20.7     | 20.9     | 17.2     | 14.6     | 9.4      | 16.6      |
| MAVIS-7B              | Direct        | -           | -        | -         | 29.1     | 41.4     | 27.4     | 24.9     | 14.6     | 27.5      |
| LLaVA-Llama3-8B       | Direct        | 34.1        | 25.6     | 29.5      | 16.0     | 19.3     | 16.4     | 13.1     | 15.0     | 15.9      |
| LLaVA w/. Formatted   | CoT           | 30.2        | 22.9     | 26.3      | 14.3     | 18.4     | 15.7     | 10.0     | 7.7      | 13.2      |
| AtomThink-LLaVA       | Direct        | 34.4        | 27.2     | 30.5      | 16.0     | 19.3     | 16.2     | 13.1     | 15.0     | 15.9      |
| AtomThink-LLaVA       | Quick Think   | **36.9**    | **37.0** | **36.6**  | **22.2** | **26.6** | **24.1** | **20.9** | **17.9** | **22.4**  |
| AtomThink-LLaVA       | Slow Think    | **36.5**    | **41.3** | **39.1**  | **36.1** | **42.4** | **30.0** | **36.8** | **28.6** | **34.7**  |

# Citation
If you use this dataset in your research, please cite:
```text
@article{xiang2024atomthink,
  title={AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning},
  author={Xiang, Kun and Liu, Zhili and Jiang, Zihao and Nie, Yunshuang and Huang, Runhui and Fan, Haoxiang and Li, Hanhui and Huang, Weiran and Zeng, Yihan and Han, Jianhua and others},
  journal={arXiv preprint arXiv:2411.11930},
  year={2024}
}
@article{liu2024visual,
  title={Visual instruction tuning},
  author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
  journal={Advances in neural information processing systems},
  volume={36},
  year={2024}
}
```

# License
The checkpoint is released under the Apache 2.0 license. Please ensure proper attribution when using this checkpoint.