File size: 1,596 Bytes
5ea48d8
 
 
335d486
5ea48d8
3993f5b
5ea48d8
3993f5b
 
 
f0b79a7
5ce708b
f0b79a7
db08582
e00746a
 
 
5ce708b
 
 
e00746a
 
 
 
 
 
 
 
 
 
 
83d3286
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
title: README
sdk: static
pinned: true
---
Welcome to the space of ChouBERT, a French language model for plant health text mining. 

On l'appelle ChouBERT parce qu'il est fait pour surveiller les végétaux comme le chou, il gère bien les plolysémies comme "chou" ou "chouchou" et il est chou.

We further pre-trained CamemBERT base model on French plant health bulletins and Tweets to build ChouBERT.

ChouBERT-n are pre-trained for n epochs with MLM. You may use these models if you want to reproduce our experiments.

ChouBERT-n-plant-health-ner are fine-tuned ChouBERT-n for Named Entity Recongnition (NER) in plant health domain. The NER paper: <https://hal.science/hal-04245168/>.

ChouBERT-n-plant-health-tweet-classifier are fine-tuned ChouBERT-n for distinguishing tweets about Plant Health observation from other tweets. We describe how we build ChouBRET in this paper: <https://hal.archives-ouvertes.fr/hal-03621123>

Our work shows that ChouBERT-16 and ChouBERT-32-based classifiers are the most generalizable for recognizing unseen hazards, especially polysemous terms. 
We also upload the CamemBERT-based classifiers as the baseline.

### BibTeX entry

```bibtex
@inproceedings{jiang2022choubert,
  title={ChouBERT: Pre-training French Language Model for Crowdsensing with Tweets in Phytosanitary Context},
  author={Jiang, Shufan and Angarita, Rafael and Cormier, St{\'e}phane and Orensanz, Julien and Rousseaux, Francis},
  booktitle={International Conference on Research Challenges in Information Science},
  pages={653--661},
  year={2022},
  organization={Springer}
}
```