Harnessing the Power of Prompt-based Techniques for Generating School-Level Questions using Large Language Models
Abstract
Designing high-quality educational questions is a challenging and time-consuming task. In this work, we propose a novel approach that utilizes prompt-based techniques to generate descriptive and reasoning-based questions. However, current question-answering (QA) datasets are inadequate for conducting our experiments on <PRE_TAG>prompt-based question generation</POST_TAG> (QG) in an educational setting. Therefore, we curate a new QG dataset called <PRE_TAG>EduProbe</POST_TAG> for school-level subjects, by leveraging the rich content of NCERT textbooks. We carefully annotate this dataset as <PRE_TAG>quadruples</POST_TAG> of 1) <PRE_TAG>Context</POST_TAG>: a segment upon which the question is formed; 2) <PRE_TAG>Long Prompt</POST_TAG>: a long textual cue for the question (i.e., a longer sequence of words or phrases, covering the main theme of the context); 3) <PRE_TAG>Short Prompt</POST_TAG>: a short textual cue for the question (i.e., a condensed representation of the key information or focus of the context); 4) <PRE_TAG>Question</POST_TAG>: a deep question that aligns with the context and is coherent with the prompts. We investigate several prompt-based QG methods by fine-tuning pre-trained transformer-based large language models (LLMs), namely <PRE_TAG>PEGASUS</POST_TAG>, <PRE_TAG>T5</POST_TAG>, <PRE_TAG>M<PRE_TAG><PRE_TAG>BART</POST_TAG></POST_TAG></POST_TAG>, and <PRE_TAG>BART</POST_TAG>. Moreover, we explore the performance of two general-purpose pre-trained LLMs such as <PRE_TAG>Text-Davinci-003</POST_TAG> and <PRE_TAG>GPT-3.5-Turbo</POST_TAG> without any further training. By performing <PRE_TAG>automatic evaluation</POST_TAG>, we show that <PRE_TAG>T5</POST_TAG> (with long prompt) outperforms all other models, but still falls short of the <PRE_TAG>human baseline</POST_TAG>. Under <PRE_TAG>human evaluation criteria</POST_TAG>, TextDavinci-003 usually shows better results than other models under various prompt settings. Even in the case of human evaluation criteria, QG models mostly fall short of the <PRE_TAG>human baseline</POST_TAG>. Our code and dataset are available at: https://github.com/my625/PromptQG
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper