|
--- |
|
language: fr |
|
license: mit |
|
tags: |
|
- roberta |
|
- token-classification |
|
base_model: almanach/camembertv2-base |
|
datasets: |
|
- Rhapsodie |
|
metrics: |
|
- las |
|
- upos |
|
model-index: |
|
- name: almanach/camembertv2-base-rhapsodie |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Part-of-Speech Tagging |
|
dataset: |
|
type: Rhapsodie |
|
name: Rhapsodie |
|
metrics: |
|
- name: upos |
|
type: upos |
|
value: 0.97556 |
|
verified: false |
|
- task: |
|
type: token-classification |
|
name: Dependency Parsing |
|
dataset: |
|
type: Rhapsodie |
|
name: Rhapsodie |
|
metrics: |
|
- name: las |
|
type: las |
|
value: 0.84497 |
|
verified: false |
|
--- |
|
|
|
# Model Card for almanach/camembertv2-base-rhapsodie |
|
|
|
almanach/camembertv2-base-rhapsodie is a roberta model for token classification. It is trained on the Rhapsodie dataset for the task of Part-of-Speech Tagging and Dependency Parsing. |
|
The model achieves an f1 score of on the Rhapsodie dataset. |
|
|
|
The model is part of the almanach/camembertv2-base family of model finetunes. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** Wissam Antoun (Phd Student at Almanach, Inria-Paris) |
|
- **Model type:** roberta |
|
- **Language(s) (NLP):** French |
|
- **License:** MIT |
|
- **Finetuned from model :** almanach/camembertv2-base |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/WissamAntoun/camemberta |
|
- **Paper:** https://arxiv.org/abs/2411.08868 |
|
|
|
## Uses |
|
|
|
The model can be used for token classification tasks in French for Part-of-Speech Tagging and Dependency Parsing. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
The model may exhibit biases based on the training data. The model may not generalize well to other datasets or tasks. The model may also have limitations in terms of the data it was trained on. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
You can use the models directly with the hopsparser library in server mode https://github.com/hopsparser/hopsparser/blob/main/docs/server.md |
|
|
|
|
|
## Training Details |
|
|
|
### Training Procedure |
|
|
|
Model trained with the [hopsparser](https://github.com/hopsparser/hopsparser) library on the Rhapsodie dataset. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
```yml |
|
# Layer dimensions |
|
mlp_input: 1024 |
|
mlp_tag_hidden: 16 |
|
mlp_arc_hidden: 512 |
|
mlp_lab_hidden: 128 |
|
# Lexers |
|
lexers: |
|
- name: word_embeddings |
|
type: words |
|
embedding_size: 256 |
|
word_dropout: 0.5 |
|
- name: char_level_embeddings |
|
type: chars_rnn |
|
embedding_size: 64 |
|
lstm_output_size: 128 |
|
- name: fasttext |
|
type: fasttext |
|
- name: camembertv2_base_p2_17k_last_layer |
|
type: bert |
|
model: /scratch/camembertv2/runs/models/camembertv2-base-bf16/post/ckpt-p2-17000/pt/ |
|
layers: [11] |
|
subwords_reduction: "mean" |
|
# Training hyperparameters |
|
encoder_dropout: 0.5 |
|
mlp_dropout: 0.5 |
|
batch_size: 8 |
|
epochs: 64 |
|
lr: |
|
base: 0.00003 |
|
schedule: |
|
shape: linear |
|
warmup_steps: 100 |
|
|
|
``` |
|
|
|
#### Results |
|
|
|
**UPOS:** 0.97556 |
|
**LAS:** 0.84497 |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
roberta custom model for token classification. |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@misc{antoun2024camembert20smarterfrench, |
|
title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection}, |
|
author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah}, |
|
year={2024}, |
|
eprint={2411.08868}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2411.08868}, |
|
} |
|
|
|
@inproceedings{grobol:hal-03223424, |
|
title = {Analyse en dépendances du français avec des plongements contextualisés}, |
|
author = {Grobol, Loïc and Crabbé, Benoît}, |
|
url = {https://hal.archives-ouvertes.fr/hal-03223424}, |
|
booktitle = {Actes de la 28ème Conférence sur le Traitement Automatique des Langues Naturelles}, |
|
eventtitle = {TALN-RÉCITAL 2021}, |
|
venue = {Lille, France}, |
|
pdf = {https://hal.archives-ouvertes.fr/hal-03223424/file/HOPS_final.pdf}, |
|
hal_id = {hal-03223424}, |
|
hal_version = {v1}, |
|
} |
|
|
|
``` |