What data is this model trained on?

by samr - opened Jun 11

samr

Jun 11

Hi - I've been using SiEBERT and am pretty impressed at the results compared with other models. I can see in the Hugging Face card it says it was trained on 15 diverse datasets. I read the paper and the supplementary info but this seemed to mostly be about the metrics used to evaluate other sentiment analysis models. Is it recorded anywhere which datasets SiEBERT was trained on?

siebert

Owner Jun 11

Hi, thanks for your interest! The model was trained on the 15 datasets listed on the model card for evaluation. For evaluation we always trained on 14 datasets and tested on the remaining one. The exact references for the datasets can be found in the bibliography of the paper. Please let me know in case of any doubts!

samr

Jun 12

Thanks for this response. I can see the author surname and date on the model card but I can't always match them to a reference in the paper. Can I check that these are correct?

Maas et al. (2011): Highly polarized IMDB reviews https://aclanthology.org/P11-1015/
Pang and Lee (2005): Movie review data: https://www.cs.cornell.edu/people/pabo/movie-review-data/
Nakov et al. (2013): Tweets https://aclanthology.org/S13-2052/
Pang et al. (2002): Movie reviews https://aclanthology.org/W02-1011/
Speriosu et al. (2011) Tweets https://aclanthology.org/W11-2207/

These I could not identify:

Shamma (2009)
Kaggle
Hartmann et al. (2019)

Would you be able to direct me towards the source?

I do not actually need to follow up every reference. The reason I am asking is because I am trying to apply different pre-trained sentiment analysis models to text in the health and care domain. I doubted that any models will work particularly well out of the box and all will need fine-tuning. It turns out that SiEBERT is actually pretty good out of the box - better than other models I've tried so far. There's a lot more work to do but I will want to write all of this up into a paper in due course. Of course a relevant factor is the type of data SiEBERT is trained on. It looks to me like it is mostly tweets and reviews from the sources I was able to identify but I wondered if there was something about the rest of the training data which might explain why it is performing better, or whether it is not related to the training data but is something about the model itself.

siebert

Owner Jun 12

Hi Sam,

you identified the right ones.

The missing ones are:

Shamma (2009): https://dl.acm.org/doi/10.1145/1631144.1631148
Kaggle: https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
Hartmann et al. (2019): https://www.sciencedirect.com/science/article/pii/S0167811618300545

In our experience, much of the performance actually comes from the large amount of diverse datasets used to train the model. Using a larger model -- RoBERTa instead of e.g. DistilBERT -- has some impact as well.

Please reach out at christian.siebert@uni-hamburg.de if you want access to the actual datasets to perform your own analysis.

Hope this was helpful!

Best,

Christian

samr

Jun 13

Thanks very much for this response, Christian.

samr changed discussion status to closed Jun 13

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment