Tags generation dataset 🧠

Community Article Published December 20, 2024

August 20, 2024
Zineb MEFTAH
Dataset Link

Abstract

This paper introduces the methodology for creating a tags generation dataset using two strategies cyclic refinement and reverse strategy, This approach resulted in a dataset of 2,000 samples, approving a better performance compared to baseline models on tags relevance and diversity.

Logo

Motivation

In my personalized news project, I needed to generate relevant tags for articles that users clicked on. This would improve recommendations and make the experience more tailored. I tried using GPT base models, but their performance wasn’t good enough for this specific task.

Problematic

  • too many tags, in which many not highly related to the topic.

Solution

Models like GPT are powerful, but they often need fine-tuning for specific tasks. In this case, generating accurate tags required a dataset of articles paired with tags. Unfortunately, no suitable dataset was available, so I decided to create one.

The Approach

The methodology was inspired from:

Methodology

CyclePresentation

  • Here’s how I built the dataset:
  1. Seed Dataset I manually created a small, high-quality seed tags for articles inspired by BBC News articles. These were diverse, and well formatted.

  2. Synthetic Expansion Using GPT-4O-2024-08-06, I generated 100 sample articles using the seed dataset.

  3. Cyclic Refinement I fine-tuned GPT-4O-Mini-2024-07-18 on the synthetic dataset. Then, I refined the model in cycle:

    • Generate articles for the seed's tags using the fine-tuned model (better quality seed).
    • Creating dataset applying synthetic data generation using the fine-tuned model.
    • Re-fine-tune the base model on the synthetic dataset.

      This cycle improved the model’s performance over time.

  • Final Dataset Once the model reached a good level of accuracy, I generated a full dataset of 2,000 samples of diversed tags using GPT base model, since it was good in that, for each tags set we generated the corresponding article using the fine-tuned model. This dataset was large enough for my project, but the process could scale to create even larger datasets.

  • Synthetic data generation code:

class NewsHeadlines(BaseModel):
    Keywords: str
    Articles: str

import csv

# Initialize the list to store the examples
examples = []

# Manually define the column names based on their order in the file
column_names = ['Keywords', 'Articles']


# Open the CSV file and read its contents
with open('seed.csv', mode='r') as file:
    # Create a CSV reader without headers
    csv_reader = csv.reader(file)

    # Iterate through each row in the CSV file
    for row in csv_reader:
        # Map the row data to the column names
        row_data = dict(zip(column_names, row))

        # Format the data into the required string
        example_string = f"""The news article's keywords: {row_data['Keywords']}, The news article: {row_data['Articles']}"""

        # Append the formatted string as a dictionary to the examples list
        examples.append({"example": example_string})

OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

custom_prompt = f"""
Each time create a new set of keywords then generate its news article.
"""

# Create a FewShotPromptTemplate with the custom prompt
prompt_template = FewShotPromptTemplate(
    prefix=custom_prompt,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

synthetic_data_generator = create_openai_data_generator(
    output_schema=NewsHeadlines,
    llm=ChatOpenAI(model="ft:gpt-4o-mini-2024-07-18:open-ai-website:test5:9w9ELVAc", temperature=1),
    prompt=prompt_template,
)

synthetic_results = synthetic_data_generator.generate(
    subject="News Articles",
    extra="The keywords and news articles should be varied, diversed, and not repeated.",
    runs=100,
)
  • Fine-tuning process was made on the OpenAI platform

  • Comparison We tried a simple comparison on a news article to extract tags from it using a model that we fine-tuned it on generating tags from articles using our dataset, here are the screenshots of the process:

GPT40mini

GPT40

Fine-tuned model

Model Comparison Efficiency
gpt-4o-mini Excessive, repetitive keywords; some weakly related terms Low
gpt-4o Improved but still included some unrelated keywords Medium
Fine-Tuned Model No repetition; improved keyword relevance High

Use Cases

This fine-tuned model has applications beyond my project:

  • Search engines can use it for better indexing.
  • Automated tagging systems for blogs, news, or academic content.
  • Any system that needs efficient text classification.

Integration in My Project

In my news project, the model generates tags every time a user clicks on an article. These tags are stored to refine the user’s profile, ensuring future recommendations are more relevant and personalized.


Language (NLP)

English


References & Acknowledgments


License

This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0).