File size: 7,733 Bytes
e43f405 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 |
---
language: fr
license: mit
tags:
- roberta
- token-classification
base_model: almanach/camembertv2-base
datasets:
- FTB-NER
metrics:
- f1
pipeline_tag: token-classification
library_name: transformers
model-index:
- name: almanach/camembertv2-base-ftb-ner
results:
- task:
type: token-classification
name: Named Entity Recognition (NER)
dataset:
type: ftb-ner
name: French Treebank Named Entity Recognition
metrics:
- name: f1
type: f1
value: 0.93548
verified: false
---
# Model Card for almanach/camembertv2-base-ftb-ner
almanach/camembertv2-base-ftb-ner is a roberta model for token classification. It is trained on the FTB-NER dataset for the task of Named Entity Recognition (NER). The model achieves an f1 score of 0.93548 on the FTB-NER dataset.
The model is part of the almanach/camembertv2-base family of model finetunes.
## Model Details
### Model Description
- **Developed by:** Wissam Antoun (Phd Student at Almanach, Inria-Paris)
- **Model type:** roberta
- **Language(s) (NLP):** French
- **License:** MIT
- **Finetuned from model [optional]:** almanach/camembertv2-base
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/WissamAntoun/camemberta
- **Paper:** https://arxiv.org/abs/2411.08868
## Uses
The model can be used for token classification tasks in French for Named Entity Recognition (NER).
## Bias, Risks, and Limitations
The model may exhibit biases based on the training data. The model may not generalize well to other datasets or tasks. The model may also have limitations in terms of the data it was trained on.
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model = AutoModelForTokenClassification.from_pretrained("almanach/camembertv2-base-ftb-ner")
tokenizer = AutoTokenizer.from_pretrained("almanach/camembertv2-base-ftb-ner")
classifier = pipeline("token-classification", model=model, tokenizer=tokenizer)
classifier("Votre texte ici")
```
## Training Details
### Training Data
The model is trained on the FTB-NER dataset.
- Dataset Name: FTB-NER
- Dataset Size:
- Train: 9881
- Dev: 1235
- Test: 1235
### Training Procedure
Model trained with the run_ner.py script from the huggingface repository.
#### Training Hyperparameters
```yml
accelerator_config: '{''split_batches'': False, ''dispatch_batches'': None, ''even_batches'':
True, ''use_seedable_sampler'': True, ''non_blocking'': False, ''gradient_accumulation_kwargs'':
None}'
adafactor: false
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1.0e-08
auto_find_batch_size: false
base_model: camembertv2
base_model_name: camembertv2-base-bf16-p2-17000
batch_eval_metrics: false
bf16: false
bf16_full_eval: false
data_seed: 1337.0
dataloader_drop_last: false
dataloader_num_workers: 0
dataloader_persistent_workers: false
dataloader_pin_memory: true
dataloader_prefetch_factor: .nan
ddp_backend: .nan
ddp_broadcast_buffers: .nan
ddp_bucket_cap_mb: .nan
ddp_find_unused_parameters: .nan
ddp_timeout: 1800
debug: '[]'
deepspeed: .nan
disable_tqdm: false
dispatch_batches: .nan
do_eval: true
do_predict: false
do_train: true
epoch: 8.0
eval_accumulation_steps: 4
eval_accuracy: 0.9937000109565028
eval_delay: 0
eval_do_concat_batches: true
eval_f1: 0.935483870967742
eval_loss: 0.0347304567694664
eval_on_start: false
eval_precision: 0.9362204724409448
eval_recall: 0.934748427672956
eval_runtime: 2.7702
eval_samples: 1235.0
eval_samples_per_second: 445.821
eval_steps: .nan
eval_steps_per_second: 55.953
eval_strategy: epoch
eval_use_gather_object: false
evaluation_strategy: epoch
fp16: false
fp16_backend: auto
fp16_full_eval: false
fp16_opt_level: O1
fsdp: '[]'
fsdp_config: '{''min_num_params'': 0, ''xla'': False, ''xla_fsdp_v2'': False, ''xla_fsdp_grad_ckpt'':
False}'
fsdp_min_num_params: 0
fsdp_transformer_layer_cls_to_wrap: .nan
full_determinism: false
gradient_accumulation_steps: 2
gradient_checkpointing: false
gradient_checkpointing_kwargs: .nan
greater_is_better: true
group_by_length: false
half_precision_backend: auto
hub_always_push: false
hub_model_id: .nan
hub_private_repo: false
hub_strategy: every_save
hub_token: <HUB_TOKEN>
ignore_data_skip: false
include_inputs_for_metrics: false
include_num_input_tokens_seen: false
include_tokens_per_second: false
jit_mode_eval: false
label_names: .nan
label_smoothing_factor: 0.0
learning_rate: 5.000000000000001e-05
length_column_name: length
load_best_model_at_end: true
local_rank: 0
log_level: debug
log_level_replica: warning
log_on_each_node: true
logging_dir: /scratch/camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1/SEED-1337/logs
logging_first_step: false
logging_nan_inf_filter: true
logging_steps: 100
logging_strategy: steps
lr_scheduler_kwargs: '{}'
lr_scheduler_type: linear
max_grad_norm: 1.0
max_steps: -1
metric_for_best_model: f1
mp_parameters: .nan
name: camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1
neftune_noise_alpha: .nan
no_cuda: false
num_train_epochs: 8.0
optim: adamw_torch
optim_args: .nan
optim_target_modules: .nan
output_dir: /scratch/camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1/SEED-1337
overwrite_output_dir: false
past_index: -1
per_device_eval_batch_size: 8
per_device_train_batch_size: 8
per_gpu_eval_batch_size: .nan
per_gpu_train_batch_size: .nan
prediction_loss_only: false
push_to_hub: false
push_to_hub_model_id: .nan
push_to_hub_organization: .nan
push_to_hub_token: <PUSH_TO_HUB_TOKEN>
ray_scope: last
remove_unused_columns: true
report_to: '[''tensorboard'']'
restore_callback_states_from_checkpoint: false
resume_from_checkpoint: .nan
run_name: /scratch/camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1/SEED-1337
save_on_each_node: false
save_only_model: false
save_safetensors: true
save_steps: 500
save_strategy: epoch
save_total_limit: .nan
seed: 1337
skip_memory_metrics: true
split_batches: .nan
tf32: .nan
torch_compile: true
torch_compile_backend: inductor
torch_compile_mode: .nan
torch_empty_cache_steps: .nan
torchdynamo: .nan
total_flos: 2833132740217920.0
tpu_metrics_debug: false
tpu_num_cores: .nan
train_loss: 0.0880794880495777
train_runtime: 679.3683
train_samples: 9881
train_samples_per_second: 116.355
train_steps_per_second: 7.277
use_cpu: false
use_ipex: false
use_legacy_prediction_loop: false
use_mps_device: false
warmup_ratio: 0.1
warmup_steps: 0
weight_decay: 0.0
```
#### Results
**F1-Score:** 0.93548
## Technical Specifications
### Model Architecture and Objective
roberta for token classification.
## Citation
**BibTeX:**
```bibtex
@misc{antoun2024camembert20smarterfrench,
title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
year={2024},
eprint={2411.08868},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.08868},
}
``` |