Tags generation dataset 🧠

Community Article Published December 20, 2024

August 20, 2024
Zineb MEFTAH
Dataset Link

Abstract

This paper introduces the methodology for creating a tags generation dataset using two strategies cyclic refinement and reverse strategy, This approach resulted in a dataset of 2,000 samples, approving a better performance compared to baseline models on tags relevance and diversity.

Motivation

In my personalized news project, I needed to generate relevant tags for articles that users clicked on. This would improve recommendations and make the experience more tailored. I tried using GPT base models, but their performance wasn’t good enough for this specific task.

too many tags, in which many not highly related to the topic.

Solution

Models like GPT are powerful, but they often need fine-tuning for specific tasks. In this case, generating accurate tags required a dataset of articles paired with tags. Unfortunately, no suitable dataset was available, so I decided to create one.

The Approach

The methodology was inspired from:

cyclic refinement: from the article GENERAL CYCLICAL TRAINING OF NEURAL NETWORKS, which demonstrated that the model performance can be improved through iterative data refinement.
reverse strategy: Inspired by Modeling reverse thinking for machine learning. showing reverse generation (generating articles from tags) to simplify the learning process.

Methodology

Here’s how I built the dataset:

Seed Dataset I manually created a small, high-quality seed tags for articles inspired by BBC News articles. These were diverse, and well formatted.
Synthetic Expansion Using GPT-4O-2024-08-06, I generated 100 sample articles using the seed dataset.
Cyclic Refinement I fine-tuned GPT-4O-Mini-2024-07-18 on the synthetic dataset. Then, I refined the model in cycle:
- Generate articles for the seed's tags using the fine-tuned model (better quality seed).
- Creating dataset applying synthetic data generation using the fine-tuned model.
- Re-fine-tune the base model on the synthetic dataset.
  
  This cycle improved the model’s performance over time.

Final Dataset Once the model reached a good level of accuracy, I generated a full dataset of 2,000 samples of diversed tags using GPT base model, since it was good in that, for each tags set we generated the corresponding article using the fine-tuned model. This dataset was large enough for my project, but the process could scale to create even larger datasets.
Synthetic data generation code:

class NewsHeadlines(BaseModel):
    Keywords: str
    Articles: str

import csv

# Initialize the list to store the examples
examples = []

# Manually define the column names based on their order in the file
column_names = ['Keywords', 'Articles']


# Open the CSV file and read its contents
with open('seed.csv', mode='r') as file:
    # Create a CSV reader without headers
    csv_reader = csv.reader(file)

    # Iterate through each row in the CSV file
    for row in csv_reader:
        # Map the row data to the column names
        row_data = dict(zip(column_names, row))

        # Format the data into the required string
        example_string = f"""The news article's keywords: {row_data['Keywords']}, The news article: {row_data['Articles']}"""

        # Append the formatted string as a dictionary to the examples list
        examples.append({"example": example_string})

OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

custom_prompt = f"""
Each time create a new set of keywords then generate its news article.
"""

# Create a FewShotPromptTemplate with the custom prompt
prompt_template = FewShotPromptTemplate(
    prefix=custom_prompt,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

synthetic_data_generator = create_openai_data_generator(
    output_schema=NewsHeadlines,
    llm=ChatOpenAI(model="ft:gpt-4o-mini-2024-07-18:open-ai-website:test5:9w9ELVAc", temperature=1),
    prompt=prompt_template,
)

synthetic_results = synthetic_data_generator.generate(
    subject="News Articles",
    extra="The keywords and news articles should be varied, diversed, and not repeated.",
    runs=100,
)

Fine-tuning process was made on the OpenAI platform
Comparison We tried a simple comparison on a news article to extract tags from it using a model that we fine-tuned it on generating tags from articles using our dataset, here are the screenshots of the process:

Model	Comparison	Efficiency
gpt-4o-mini	Excessive, repetitive keywords; some weakly related terms	Low
gpt-4o	Improved but still included some unrelated keywords	Medium
Fine-Tuned Model	No repetition; improved keyword relevance	High

Use Cases

This fine-tuned model has applications beyond my project:

Search engines can use it for better indexing.
Automated tagging systems for blogs, news, or academic content.
Any system that needs efficient text classification.

Integration in My Project

In my news project, the model generates tags every time a user clicks on an article. These tags are stored to refine the user’s profile, ensuring future recommendations are more relevant and personalized.

Language (NLP)

English

References & Acknowledgments

License

This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0).

Upvote