Papers
arxiv:2307.07686

Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C++

Published on Jul 15, 2023
Authors:
,
,
,
,

Abstract

In this study, we present a novel dataset for training machine learning models translating between OpenMP Fortran and C++ code. To ensure reliability and applicability, the dataset is created from a range of representative open-source OpenMP benchmarks. It is also refined using a meticulous code similarity test. The effectiveness of our dataset is assessed using both quantitative (<PRE_TAG>CodeBLEU</POST_TAG>) and qualitative (human evaluation) methods. We showcase how this dataset significantly elevates the translation competencies of large language models (LLMs). Specifically, models without prior coding knowledge experienced a boost of times~5.1 in their <PRE_TAG>CodeBLEU</POST_TAG> scores, while models with some coding familiarity saw an impressive times~9.9-fold increase. The best fine-tuned model using our dataset outperforms GPT-4. It is also reaching human-level accuracy. This work underscores the immense potential of our dataset in propelling advancements in the domain of code translation for high-performance computing. The dataset is accessible at https://github.com/bin123apple/Fortran-CPP-HPC-code-translation-dataset{OpenMP-Fortran-CPP-Translation}.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2307.07686 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2307.07686 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.