Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.7.1
title: Homepage2vec
emoji: π
colorFrom: purple
colorTo: blue
python_version: 3.1
sdk: gradio
sdk_version: 4.11.0
app_file: app.py
pinned: false
tags:
- llm
- gpt
- homepage2vec
- multi-lingual
- multi-label
- classification
π Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification
This project was developed in collaboration with the Data Science Lab (DLab) at EPFL as part of the Machine Learning (CS-433) course. We thank Prof. Robert West for enabling the project and Tiziano Piccardi for his guidance and support throughout the project.
π Abstract
Homepage2Vec, a state-of-the-art open-source model for multilingual, multilabel website classification, has proven powerful in accurately classifying website topics. However, it is limited by its initial training data, which on average only contains a single topic for a website. This study explores the use of Large Language Models (LLMs) for creating a high-quality finetuning dataset that more accurately reflects the topic diversity of a website. We assess various LLM-based labelers and select the best one through comparison to crowdsourced annotations. We generate two variants of a new 10,000-website dataset, curlie-gpt3.5-10k
and curlie-gpt4-10k
, for finetuning Homepage2Vec. We show that finetuning Homepage2Vec with these datasets improves its macro F1 from 38% to 42%. Finally, we release both LLM-annotated datasets publicly.