Tags generation dataset ðŸ§
August 20, 2024
Zineb MEFTAH
Dataset Link
Abstract
This paper introduces the methodology for creating a tags generation dataset using two strategies cyclic refinement and reverse strategy, This approach resulted in a dataset of 2,000 samples, approving a better performance compared to baseline models on tags relevance and diversity.
Motivation
In my personalized news project, I needed to generate relevant tags for articles that users clicked on. This would improve recommendations and make the experience more tailored. I tried using GPT base models, but their performance wasn’t good enough for this specific task.
- too many tags, in which many not highly related to the topic.
Solution
Models like GPT are powerful, but they often need fine-tuning for specific tasks. In this case, generating accurate tags required a dataset of articles paired with tags. Unfortunately, no suitable dataset was available, so I decided to create one.
The Approach
The methodology was inspired from:
- cyclic refinement: from the article GENERAL CYCLICAL TRAINING OF NEURAL NETWORKS, which demonstrated that the model performance can be improved through iterative data refinement.
- reverse strategy: Inspired by Modeling reverse thinking for machine learning. showing reverse generation (generating articles from tags) to simplify the learning process.
Methodology
- Here’s how I built the dataset:
Seed Dataset I manually created a small, high-quality seed tags for articles inspired by BBC News articles. These were diverse, and well formatted.
Synthetic Expansion Using GPT-4O-2024-08-06, I generated 100 sample articles using the seed dataset.
Cyclic Refinement I fine-tuned GPT-4O-Mini-2024-07-18 on the synthetic dataset. Then, I refined the model in cycle:
- Generate articles for the seed's tags using the fine-tuned model (better quality seed).
- Creating dataset applying synthetic data generation using the fine-tuned model.
- Re-fine-tune the base model on the synthetic dataset.
This cycle improved the model’s performance over time.
Final Dataset Once the model reached a good level of accuracy, I generated a full dataset of 2,000 samples of diversed tags using GPT base model, since it was good in that, for each tags set we generated the corresponding article using the fine-tuned model. This dataset was large enough for my project, but the process could scale to create even larger datasets.
Synthetic data generation code:
class NewsHeadlines(BaseModel):
Keywords: str
Articles: str
import csv
# Initialize the list to store the examples
examples = []
# Manually define the column names based on their order in the file
column_names = ['Keywords', 'Articles']
# Open the CSV file and read its contents
with open('seed.csv', mode='r') as file:
# Create a CSV reader without headers
csv_reader = csv.reader(file)
# Iterate through each row in the CSV file
for row in csv_reader:
# Map the row data to the column names
row_data = dict(zip(column_names, row))
# Format the data into the required string
example_string = f"""The news article's keywords: {row_data['Keywords']}, The news article: {row_data['Articles']}"""
# Append the formatted string as a dictionary to the examples list
examples.append({"example": example_string})
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")
custom_prompt = f"""
Each time create a new set of keywords then generate its news article.
"""
# Create a FewShotPromptTemplate with the custom prompt
prompt_template = FewShotPromptTemplate(
prefix=custom_prompt,
examples=examples,
suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
input_variables=["subject", "extra"],
example_prompt=OPENAI_TEMPLATE,
)
synthetic_data_generator = create_openai_data_generator(
output_schema=NewsHeadlines,
llm=ChatOpenAI(model="ft:gpt-4o-mini-2024-07-18:open-ai-website:test5:9w9ELVAc", temperature=1),
prompt=prompt_template,
)
synthetic_results = synthetic_data_generator.generate(
subject="News Articles",
extra="The keywords and news articles should be varied, diversed, and not repeated.",
runs=100,
)
Fine-tuning process was made on the OpenAI platform
Comparison We tried a simple comparison on a news article to extract tags from it using a model that we fine-tuned it on generating tags from articles using our dataset, here are the screenshots of the process:
Model | Comparison | Efficiency |
---|---|---|
gpt-4o-mini | Excessive, repetitive keywords; some weakly related terms | Low |
gpt-4o | Improved but still included some unrelated keywords | Medium |
Fine-Tuned Model | No repetition; improved keyword relevance | High |
Use Cases
This fine-tuned model has applications beyond my project:
- Search engines can use it for better indexing.
- Automated tagging systems for blogs, news, or academic content.
- Any system that needs efficient text classification.
Integration in My Project
In my news project, the model generates tags every time a user clicks on an article. These tags are stored to refine the user’s profile, ensuring future recommendations are more relevant and personalized.
Language (NLP)
English
References & Acknowledgments
- Mandeep Goyal and Qusay H. Mahmoud, 2024
- Dylan Royan Almeida, OpenAI Cookbook, 2024
- Xu Guo, and Yiqiang Chen, 2024
- Nikolaos Giarelis and Nikos Karacapilidis, 2024
- Llama Team, AI @ Meta, 2023
License
This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0).