--- language: fr license: mit tags: - roberta - token-classification base_model: almanach/camembertv2-base datasets: - FTB-NER metrics: - f1 pipeline_tag: token-classification library_name: transformers model-index: - name: almanach/camembertv2-base-ftb-ner results: - task: type: token-classification name: Named Entity Recognition (NER) dataset: type: ftb-ner name: French Treebank Named Entity Recognition metrics: - name: f1 type: f1 value: 0.93548 verified: false --- # Model Card for almanach/camembertv2-base-ftb-ner almanach/camembertv2-base-ftb-ner is a roberta model for token classification. It is trained on the FTB-NER dataset for the task of Named Entity Recognition (NER). The model achieves an f1 score of 0.93548 on the FTB-NER dataset. The model is part of the almanach/camembertv2-base family of model finetunes. ## Model Details ### Model Description - **Developed by:** Wissam Antoun (Phd Student at Almanach, Inria-Paris) - **Model type:** roberta - **Language(s) (NLP):** French - **License:** MIT - **Finetuned from model [optional]:** almanach/camembertv2-base ### Model Sources [optional] - **Repository:** https://github.com/WissamAntoun/camemberta - **Paper:** https://arxiv.org/abs/2411.08868 ## Uses The model can be used for token classification tasks in French for Named Entity Recognition (NER). ## Bias, Risks, and Limitations The model may exhibit biases based on the training data. The model may not generalize well to other datasets or tasks. The model may also have limitations in terms of the data it was trained on. ## How to Get Started with the Model Use the code below to get started with the model. ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline model = AutoModelForTokenClassification.from_pretrained("almanach/camembertv2-base-ftb-ner") tokenizer = AutoTokenizer.from_pretrained("almanach/camembertv2-base-ftb-ner") classifier = pipeline("token-classification", model=model, tokenizer=tokenizer) classifier("Votre texte ici") ``` ## Training Details ### Training Data The model is trained on the FTB-NER dataset. - Dataset Name: FTB-NER - Dataset Size: - Train: 9881 - Dev: 1235 - Test: 1235 ### Training Procedure Model trained with the run_ner.py script from the huggingface repository. #### Training Hyperparameters ```yml accelerator_config: '{''split_batches'': False, ''dispatch_batches'': None, ''even_batches'': True, ''use_seedable_sampler'': True, ''non_blocking'': False, ''gradient_accumulation_kwargs'': None}' adafactor: false adam_beta1: 0.9 adam_beta2: 0.999 adam_epsilon: 1.0e-08 auto_find_batch_size: false base_model: camembertv2 base_model_name: camembertv2-base-bf16-p2-17000 batch_eval_metrics: false bf16: false bf16_full_eval: false data_seed: 1337.0 dataloader_drop_last: false dataloader_num_workers: 0 dataloader_persistent_workers: false dataloader_pin_memory: true dataloader_prefetch_factor: .nan ddp_backend: .nan ddp_broadcast_buffers: .nan ddp_bucket_cap_mb: .nan ddp_find_unused_parameters: .nan ddp_timeout: 1800 debug: '[]' deepspeed: .nan disable_tqdm: false dispatch_batches: .nan do_eval: true do_predict: false do_train: true epoch: 8.0 eval_accumulation_steps: 4 eval_accuracy: 0.9937000109565028 eval_delay: 0 eval_do_concat_batches: true eval_f1: 0.935483870967742 eval_loss: 0.0347304567694664 eval_on_start: false eval_precision: 0.9362204724409448 eval_recall: 0.934748427672956 eval_runtime: 2.7702 eval_samples: 1235.0 eval_samples_per_second: 445.821 eval_steps: .nan eval_steps_per_second: 55.953 eval_strategy: epoch eval_use_gather_object: false evaluation_strategy: epoch fp16: false fp16_backend: auto fp16_full_eval: false fp16_opt_level: O1 fsdp: '[]' fsdp_config: '{''min_num_params'': 0, ''xla'': False, ''xla_fsdp_v2'': False, ''xla_fsdp_grad_ckpt'': False}' fsdp_min_num_params: 0 fsdp_transformer_layer_cls_to_wrap: .nan full_determinism: false gradient_accumulation_steps: 2 gradient_checkpointing: false gradient_checkpointing_kwargs: .nan greater_is_better: true group_by_length: false half_precision_backend: auto hub_always_push: false hub_model_id: .nan hub_private_repo: false hub_strategy: every_save hub_token: ignore_data_skip: false include_inputs_for_metrics: false include_num_input_tokens_seen: false include_tokens_per_second: false jit_mode_eval: false label_names: .nan label_smoothing_factor: 0.0 learning_rate: 5.000000000000001e-05 length_column_name: length load_best_model_at_end: true local_rank: 0 log_level: debug log_level_replica: warning log_on_each_node: true logging_dir: /scratch/camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1/SEED-1337/logs logging_first_step: false logging_nan_inf_filter: true logging_steps: 100 logging_strategy: steps lr_scheduler_kwargs: '{}' lr_scheduler_type: linear max_grad_norm: 1.0 max_steps: -1 metric_for_best_model: f1 mp_parameters: .nan name: camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1 neftune_noise_alpha: .nan no_cuda: false num_train_epochs: 8.0 optim: adamw_torch optim_args: .nan optim_target_modules: .nan output_dir: /scratch/camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1/SEED-1337 overwrite_output_dir: false past_index: -1 per_device_eval_batch_size: 8 per_device_train_batch_size: 8 per_gpu_eval_batch_size: .nan per_gpu_train_batch_size: .nan prediction_loss_only: false push_to_hub: false push_to_hub_model_id: .nan push_to_hub_organization: .nan push_to_hub_token: ray_scope: last remove_unused_columns: true report_to: '[''tensorboard'']' restore_callback_states_from_checkpoint: false resume_from_checkpoint: .nan run_name: /scratch/camembertv2/runs/results/ftb_ner/camembertv2-base-bf16-p2-17000/max_seq_length-192-gradient_accumulation_steps-2-precision-fp32-learning_rate-5.000000000000001e-05-epochs-8-lr_scheduler-linear-warmup_steps-0.1/SEED-1337 save_on_each_node: false save_only_model: false save_safetensors: true save_steps: 500 save_strategy: epoch save_total_limit: .nan seed: 1337 skip_memory_metrics: true split_batches: .nan tf32: .nan torch_compile: true torch_compile_backend: inductor torch_compile_mode: .nan torch_empty_cache_steps: .nan torchdynamo: .nan total_flos: 2833132740217920.0 tpu_metrics_debug: false tpu_num_cores: .nan train_loss: 0.0880794880495777 train_runtime: 679.3683 train_samples: 9881 train_samples_per_second: 116.355 train_steps_per_second: 7.277 use_cpu: false use_ipex: false use_legacy_prediction_loop: false use_mps_device: false warmup_ratio: 0.1 warmup_steps: 0 weight_decay: 0.0 ``` #### Results **F1-Score:** 0.93548 ## Technical Specifications ### Model Architecture and Objective roberta for token classification. ## Citation **BibTeX:** ```bibtex @misc{antoun2024camembert20smarterfrench, title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection}, author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah}, year={2024}, eprint={2411.08868}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.08868}, } ```