Upload folder using huggingface_hub

68d8e7d verified 8 days ago

4.06 kB

	---
	language: fr
	license: mit
	tags:
	- roberta
	- token-classification
	base_model: almanach/camembertv2-base
	datasets:
	- Rhapsodie
	metrics:
	- las
	- upos
	model-index:
	- name: almanach/camembertv2-base-rhapsodie
	results:
	- task:
	type: token-classification
	name: Part-of-Speech Tagging
	dataset:
	type: Rhapsodie
	name: Rhapsodie
	metrics:
	- name: upos
	type: upos
	value: 0.97556
	verified: false
	- task:
	type: token-classification
	name: Dependency Parsing
	dataset:
	type: Rhapsodie
	name: Rhapsodie
	metrics:
	- name: las
	type: las
	value: 0.84497
	verified: false
	---

	# Model Card for almanach/camembertv2-base-rhapsodie

	almanach/camembertv2-base-rhapsodie is a roberta model for token classification. It is trained on the Rhapsodie dataset for the task of Part-of-Speech Tagging and Dependency Parsing.
	The model achieves an f1 score of on the Rhapsodie dataset.

	The model is part of the almanach/camembertv2-base family of model finetunes.

	## Model Details

	### Model Description

	- Developed by: Wissam Antoun (Phd Student at Almanach, Inria-Paris)
	- Model type: roberta
	- Language(s) (NLP): French
	- License: MIT
	- Finetuned from model : almanach/camembertv2-base

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/WissamAntoun/camemberta
	- Paper: https://arxiv.org/abs/2411.08868

	## Uses

	The model can be used for token classification tasks in French for Part-of-Speech Tagging and Dependency Parsing.

	## Bias, Risks, and Limitations

	The model may exhibit biases based on the training data. The model may not generalize well to other datasets or tasks. The model may also have limitations in terms of the data it was trained on.


	## How to Get Started with the Model

	You can use the models directly with the hopsparser library in server mode https://github.com/hopsparser/hopsparser/blob/main/docs/server.md


	## Training Details

	### Training Procedure

	Model trained with the [hopsparser](https://github.com/hopsparser/hopsparser) library on the Rhapsodie dataset.


	#### Training Hyperparameters

	```yml
	# Layer dimensions
	mlp_input: 1024
	mlp_tag_hidden: 16
	mlp_arc_hidden: 512
	mlp_lab_hidden: 128
	# Lexers
	lexers:
	- name: word_embeddings
	type: words
	embedding_size: 256
	word_dropout: 0.5
	- name: char_level_embeddings
	type: chars_rnn
	embedding_size: 64
	lstm_output_size: 128
	- name: fasttext
	type: fasttext
	- name: camembertv2_base_p2_17k_last_layer
	type: bert
	model: /scratch/camembertv2/runs/models/camembertv2-base-bf16/post/ckpt-p2-17000/pt/
	layers: [11]
	subwords_reduction: "mean"
	# Training hyperparameters
	encoder_dropout: 0.5
	mlp_dropout: 0.5
	batch_size: 8
	epochs: 64
	lr:
	base: 0.00003
	schedule:
	shape: linear
	warmup_steps: 100

	```

	#### Results

	UPOS: 0.97556
	LAS: 0.84497

	## Technical Specifications

	### Model Architecture and Objective

	roberta custom model for token classification.

	## Citation

	BibTeX:

	```bibtex
	@misc{antoun2024camembert20smarterfrench,
	title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
	author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
	year={2024},
	eprint={2411.08868},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2411.08868},
	}

	@inproceedings{grobol:hal-03223424,
	title = {Analyse en dépendances du français avec des plongements contextualisés},
	author = {Grobol, Loïc and Crabbé, Benoît},
	url = {https://hal.archives-ouvertes.fr/hal-03223424},
	booktitle = {Actes de la 28ème Conférence sur le Traitement Automatique des Langues Naturelles},
	eventtitle = {TALN-RÉCITAL 2021},
	venue = {Lille, France},
	pdf = {https://hal.archives-ouvertes.fr/hal-03223424/file/HOPS_final.pdf},
	hal_id = {hal-03223424},
	hal_version = {v1},
	}

	```