Training details
Good afternoon, on which dataset was the fine-tuning of the model carried out?
The model was fine-tuned on about 10 datasets from different news portals. All of them were created ourselves.
I've pushed an example of dataset I've created based on https://www.kaggle.com/datasets/hadasu92/cnn-articles-after-basic-cleaning.
I will share later this month my dataset's creation code and training code.
All of them were created ourselves.
This is great news. We are also exploring the applicability of MarkupLM for web scraping tasks.
Are you planning to publish your datasets in the public domain?
No, I'm not planning to publish our datasets in public domain due to project restrictions. But I've added one more example of dataset in separate branch
https://huggingface.co/olga-rondareva/OxMarkupLM/tree/medium-dataset/datasets