arxiv:2312.01032

Harnessing the Power of Prompt-based Techniques for Generating School-Level Questions using Large Language Models

Published on Dec 2, 2023

Authors:

Abstract

Designing high-quality educational questions is a challenging and time-consuming task. In this work, we propose a novel approach that utilizes prompt-based techniques to generate descriptive and reasoning-based questions. However, current question-answering (QA) datasets are inadequate for conducting our experiments on <PRE_TAG>prompt-based question generation</POST_TAG> (QG) in an educational setting. Therefore, we curate a new QG dataset called <PRE_TAG>EduProbe</POST_TAG> for school-level subjects, by leveraging the rich content of NCERT textbooks. We carefully annotate this dataset as <PRE_TAG>quadruples</POST_TAG> of 1) <PRE_TAG>Context</POST_TAG>: a segment upon which the question is formed; 2) <PRE_TAG>Long Prompt</POST_TAG>: a long textual cue for the question (i.e., a longer sequence of words or phrases, covering the main theme of the context); 3) <PRE_TAG>Short Prompt</POST_TAG>: a short textual cue for the question (i.e., a condensed representation of the key information or focus of the context); 4) <PRE_TAG>Question</POST_TAG>: a deep question that aligns with the context and is coherent with the prompts. We investigate several prompt-based QG methods by fine-tuning pre-trained transformer-based large language models (LLMs), namely <PRE_TAG>PEGASUS</POST_TAG>, <PRE_TAG>T5</POST_TAG>, <PRE_TAG>M<PRE_TAG><PRE_TAG>BART</POST_TAG></POST_TAG></POST_TAG>, and <PRE_TAG>BART</POST_TAG>. Moreover, we explore the performance of two general-purpose pre-trained LLMs such as <PRE_TAG>Text-Davinci-003</POST_TAG> and <PRE_TAG>GPT-3.5-Turbo</POST_TAG> without any further training. By performing <PRE_TAG>automatic evaluation</POST_TAG>, we show that <PRE_TAG>T5</POST_TAG> (with long prompt) outperforms all other models, but still falls short of the <PRE_TAG>human baseline</POST_TAG>. Under <PRE_TAG>human evaluation criteria</POST_TAG>, TextDavinci-003 usually shows better results than other models under various prompt settings. Even in the case of human evaluation criteria, QG models mostly fall short of the <PRE_TAG>human baseline</POST_TAG>. Our code and dataset are available at: https://github.com/my625/PromptQG

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2312.01032 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2312.01032 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2312.01032 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.