ChouBERT

non-profit

AI & ML interests

Crowdsensing crop health on social media

Recent Activity

Welcome to the space of ChouBERT, a French language model for plant health text mining.

On l'appelle ChouBERT parce qu'il est fait pour surveiller les végétaux comme le chou, il gère bien les polysémies comme "chou" ou "chouchou" et il est chou.

We further pre-trained CamemBERT base model on French plant health bulletins and Tweets to build ChouBERT.

ChouBERT-n are pre-trained for n epochs with MLM. You may use these models if you want to reproduce our experiments.

ChouBERT-n-plant-health-ner are fine-tuned ChouBERT-n for Named Entity Recongnition (NER) in plant health domain. The NER paper: https://hal.science/hal-04245168/.

ChouBERT-n-plant-health-tweet-classifier are fine-tuned ChouBERT-n for distinguishing tweets about Plant Health observation from other tweets. We describe how we build ChouBRET in this paper: https://hal.archives-ouvertes.fr/hal-03621123

Our work shows that ChouBERT-16 and ChouBERT-32-based classifiers are the most generalizable for recognizing unseen hazards, especially polysemous terms. We also upload the CamemBERT-based classifiers as the baseline.

Listen to a song of ChouBERT (generated with sumo): https://suno.com/song/acb11b86-5433-4d97-8e68-e44aefc66a99

BibTeX entries

@inproceedings{jiang2022choubert,
  title={ChouBERT: Pre-training French Language Model for Crowdsensing with Tweets in Phytosanitary Context},
  author={Jiang, Shufan and Angarita, Rafael and Cormier, St{\'e}phane and Orensanz, Julien and Rousseaux, Francis},
  booktitle={International Conference on Research Challenges in Information Science},
  pages={653--661},
  year={2022},
  organization={Springer}
}

@inproceedings{jiang2022ner,
  title = {{Named Entity Recognition for Monitoring Plant Health Threats in Tweets: a ChouBERT Approach}},
  author = {Jiang, Shufan and Angarita, Rafael and Cormier, St{\'e}phane and Rousseaux, Francis},
  booktitle = {{2022 6th International Conference on Universal Village (UV)}},
  address = {Boston, United States},
  publisher = {{IEEE}},
  year = {2022},
  doi = {10.1109/UV56588.2022.10185492},
}

datasets

None public yet