arxiv:2407.13168

SciCode: A Research Coding Benchmark Curated by Scientists

Published on Jul 18

· Submitted by

amber1120 on Jul 22

Upvote

Authors:

Minyang Tian ,

Luyu Gao ,

Shizhuo Dylan Zhang ,

Shengyan Liu ,

Yutao Ma ,

Chenyu Tian ,

Bohao Wu ,

Yanyu Xiong ,

Shengzhu Yin ,

Abstract

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

View arXiv page View PDF Add to collection

Community

amber1120

Paper author Paper submitter Jul 22

https://scicode-bench.github.io

nielsr

Jul 25

Hi @amber1120 congrats on this work!

Are you planning to share the dataset on the hub? Here's a guide: https://huggingface.co/docs/datasets/loading.

The dataset could then be loaded in 2 lines of code, like so:

from datasets import load_dataset

dataset  = load_dataset("your-hf-organization/scicode")

It could then also be linked to this paper, as explained here: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.

We could also set up a Space using Gradio for the leaderboard.

Let me know if you need any help!

Cheers,
Niels
Open-source @ HF

librarian-bot

Jul 23

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.13168 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.13168 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.13168 in a Space README.md to link it from this page.