LaMini-Flan-T5-77M-qa-generation

Model Description

This model is a fine-tuned version of MBZUAI/LaMini-Flan-T5-77M trained to generate question and answer pairs from raw text. It is based on the FLAN-T5 architecture and has been optimized for question-answer generation tasks.

Key Features

Base Model: MBZUAI/LaMini-Flan-T5-77M
Task: Question and answer pair generation
Training Data: agentlans/finewebedu-sft
Added Tokens: [QUESTION_END], [ANSWER_END]
Evaluation Loss: 1.3572

Usage

To use this model for generating question-answer pairs:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "agentlans/LaMini-Flan-T5-77M-qa-generation"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_text = "Your input text here..."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

Output Processing

The model generates output in the following format:

Question[QUESTION_END]Answer[ANSWER_END]Question[QUESTION_END]Answer[ANSWER_END]...

To parse this output into a structured format:

import re

def clean_text(text):
    return re.sub(r'\s+', ' ', text).strip()

def parse_qa_pairs(input_text):
    qa_blocks = re.split(r'(\[ANSWER_END\])', input_text)
    pairs = []
    for i in range(0, len(qa_blocks) - 1, 2):
        qa_block = qa_blocks[i]
        parts = qa_block.split('[QUESTION_END]')
        if len(parts) == 2:
            question, answer = map(clean_text, parts)
            if question and answer:
                pairs.append({"question": question, "answer": answer})
    return pairs

qa_pairs = parse_qa_pairs(decoded_output)

Example

Input:

The ocean, covering over 70% of our planet's surface, is a vast and mysterious realm teeming with life and beauty. From the vibrant coral reefs that serve as bustling underwater cities to the deep, dark trenches that house some of the most bizarre creatures on Earth, the ocean is a treasure trove of biodiversity. It plays a crucial role in regulating the global climate, absorbing carbon dioxide and producing oxygen through its phytoplankton. Moreover, the ocean's depths remain largely unexplored, holding countless secrets and potential discoveries that could revolutionize our understanding of biology, medicine, and environmental science. As we continue to learn more about this incredible ecosystem, it becomes increasingly clear that protecting our oceans is essential for the health of our planet and future generations.

Output:

[
    {
        "question": "What is the ocean's role in regulating the global climate?",
        "answer": "The ocean plays a crucial role in regulating the global climate by absorbing carbon dioxide and producing oxygen through its phytoplankton."
    },
    {
        "question": "What are some of the key discoveries that could revolutionize our understanding of the ocean?",
        "answer": "The ocean's depths remain largely unexplored, holding secrets and potential discoveries that could revolutionize our understanding of biology, medicine, and environmental science."
    },
    {
        "question": "What is the significance of protecting our oceans for future generations?",
        "answer": "Protecting our oceans is essential for the health of our planet and future generations because it is a vital part of our ecosystem and a vital resource for our survival and well-being."
    }
]

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

Learning rate: 0.0003
Train batch size: 16
Eval batch size: 16
Seed: 42
Gradient accumulation steps: 2
Total train batch size: 32
Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
LR scheduler type: linear
LR scheduler warmup steps: 500
Number of epochs: 10.0

Training Results

Training Loss	Epoch	Step	Validation Loss
1.6321	1.2361	500	1.4333
1.5305	2.4722	1000	1.4013
1.4754	3.7083	1500	1.3719
1.4425	4.9444	2000	1.3693
1.3781	6.1805	2500	1.3647
1.3687	7.4166	3000	1.3572
1.3413	8.6527	3500	1.3596
1.3539	9.8888	4000	1.3594

Limitations

The model's performance may vary depending on the complexity and domain of the input text.
The quality of generated questions and answers can be inconsistent across different topics.
The model may occasionally generate irrelevant or repetitive question-answer pairs.

Framework Versions

Transformers 4.44.0
Pytorch 2.2.2+cu121
Datasets 2.18.0
Tokenizers 0.19.1

agentlans
/

LaMini-Flan-T5-77M-qa-generation