HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition

HeBERT is a Hebrew pretrained language model. It is based on Google's BERT architecture and it is BERT-Base config.

HeBert was trained on three dataset:

A Hebrew version of OSCAR: ~9.8 GB of data, including 1 billion words and over 20.8 millions sentences.
A Hebrew dump of Wikipedia: ~650 MB of data, including over 63 millions words and 3.8 millions sentences
Emotion User Generated Content (UGC) data that was collected for the purpose of this study (described below).

Named-entity recognition (NER)

The ability of the model to classify named entities in text, such as persons' names, organizations, and locations; tested on a labeled dataset from Ben Mordecai and M Elhadad (2005), and evaluated with F1-score.

How to use

    from transformers import pipeline
    
    # how to use?
    NER = pipeline(
        "token-classification",
        model="avichr/heBERT_NER",
        tokenizer="avichr/heBERT_NER",
    )
    NER('דויד לומד באוניברסיטה העברית שבירושלים')

Other tasks

Emotion Recognition Model. An online model can be found at huggingface spaces or as colab notebook
Sentiment Analysis.
masked-LM model (can be fine-tunned to any down-stream task).

Contact us

Avichay Chriqui
Inbal yahav
The Coller Semitic Languages AI Lab
Thank you, תודה, شكرا

If you used this model please cite us as :

Chriqui, A., & Yahav, I. (2021). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. arXiv preprint arXiv:2102.01909.

@article{chriqui2021hebert,
  title={HeBERT \& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
  author={Chriqui, Avihay and Yahav, Inbal},
  journal={arXiv preprint arXiv:2102.01909},
  year={2021}
}

git