File size: 1,596 Bytes
5ea48d8 335d486 5ea48d8 3993f5b 5ea48d8 3993f5b f0b79a7 5ce708b f0b79a7 db08582 e00746a 5ce708b e00746a 83d3286 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
---
title: README
sdk: static
pinned: true
---
Welcome to the space of ChouBERT, a French language model for plant health text mining.
On l'appelle ChouBERT parce qu'il est fait pour surveiller les végétaux comme le chou, il gère bien les plolysémies comme "chou" ou "chouchou" et il est chou.
We further pre-trained CamemBERT base model on French plant health bulletins and Tweets to build ChouBERT.
ChouBERT-n are pre-trained for n epochs with MLM. You may use these models if you want to reproduce our experiments.
ChouBERT-n-plant-health-ner are fine-tuned ChouBERT-n for Named Entity Recongnition (NER) in plant health domain. The NER paper: <https://hal.science/hal-04245168/>.
ChouBERT-n-plant-health-tweet-classifier are fine-tuned ChouBERT-n for distinguishing tweets about Plant Health observation from other tweets. We describe how we build ChouBRET in this paper: <https://hal.archives-ouvertes.fr/hal-03621123>
Our work shows that ChouBERT-16 and ChouBERT-32-based classifiers are the most generalizable for recognizing unseen hazards, especially polysemous terms.
We also upload the CamemBERT-based classifiers as the baseline.
### BibTeX entry
```bibtex
@inproceedings{jiang2022choubert,
title={ChouBERT: Pre-training French Language Model for Crowdsensing with Tweets in Phytosanitary Context},
author={Jiang, Shufan and Angarita, Rafael and Cormier, St{\'e}phane and Orensanz, Julien and Rousseaux, Francis},
booktitle={International Conference on Research Challenges in Information Science},
pages={653--661},
year={2022},
organization={Springer}
}
``` |