What data is this model trained on?

#8
by samr - opened

Hi - I've been using SiEBERT and am pretty impressed at the results compared with other models. I can see in the Hugging Face card it says it was trained on 15 diverse datasets. I read the paper and the supplementary info but this seemed to mostly be about the metrics used to evaluate other sentiment analysis models. Is it recorded anywhere which datasets SiEBERT was trained on?

Owner

Hi, thanks for your interest! The model was trained on the 15 datasets listed on the model card for evaluation. For evaluation we always trained on 14 datasets and tested on the remaining one. The exact references for the datasets can be found in the bibliography of the paper. Please let me know in case of any doubts!

Thanks for this response. I can see the author surname and date on the model card but I can't always match them to a reference in the paper. Can I check that these are correct?

These I could not identify:

  • Shamma (2009)
  • Kaggle
  • Hartmann et al. (2019)

Would you be able to direct me towards the source?

I do not actually need to follow up every reference. The reason I am asking is because I am trying to apply different pre-trained sentiment analysis models to text in the health and care domain. I doubted that any models will work particularly well out of the box and all will need fine-tuning. It turns out that SiEBERT is actually pretty good out of the box - better than other models I've tried so far. There's a lot more work to do but I will want to write all of this up into a paper in due course. Of course a relevant factor is the type of data SiEBERT is trained on. It looks to me like it is mostly tweets and reviews from the sources I was able to identify but I wondered if there was something about the rest of the training data which might explain why it is performing better, or whether it is not related to the training data but is something about the model itself.

Owner

Hi Sam,

you identified the right ones.

The missing ones are:

In our experience, much of the performance actually comes from the large amount of diverse datasets used to train the model. Using a larger model -- RoBERTa instead of e.g. DistilBERT -- has some impact as well.

Please reach out at christian.siebert@uni-hamburg.de if you want access to the actual datasets to perform your own analysis.

Hope this was helpful!

Best,

Christian

Thanks very much for this response, Christian.

samr changed discussion status to closed

Sign up or log in to comment