Training details

by stevehh251 - opened Dec 5, 2024

Discussion

stevehh251

Dec 5, 2024

Good afternoon, on which dataset was the fine-tuning of the model carried out?

olga-rondareva

Owner Dec 6, 2024

The model was fine-tuned on about 10 datasets from different news portals. All of them were created ourselves.
I've pushed an example of dataset I've created based on https://www.kaggle.com/datasets/hadasu92/cnn-articles-after-basic-cleaning.
I will share later this month my dataset's creation code and training code.

stevehh251

Dec 20, 2024

•

edited Dec 20, 2024

All of them were created ourselves.

This is great news. We are also exploring the applicability of MarkupLM for web scraping tasks.
Are you planning to publish your datasets in the public domain?

olga-rondareva

Owner Dec 23, 2024

No, I'm not planning to publish our datasets in public domain due to project restrictions. But I've added one more example of dataset in separate branch
https://huggingface.co/olga-rondareva/OxMarkupLM/tree/medium-dataset/datasets

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment