--- language: - es license: cc-by-4.0 library_name: span-marker tags: - span-marker - token-classification - ner - named-entity-recognition - generated_from_span_marker_trainer datasets: - xtreme metrics: - precision - recall - f1 widget: - text: Me llamo Álvaro y vivo en Barcelona (España). - text: Marie Curie fue profesora en la Universidad de Paris. - text: La Universidad de Salamanca es la universidad en activo más antigua de España. pipeline_tag: token-classification base_model: bert-base-multilingual-cased model-index: - name: SpanMarker with bert-base-multilingual-cased on xtreme/PAN-X.es results: - task: type: token-classification name: Named Entity Recognition dataset: name: xtreme/PAN-X.es type: xtreme split: eval metrics: - type: f1 value: 0.9186626746506986 name: F1 - type: precision value: 0.9231154938993816 name: Precision - type: recall value: 0.9142526071842411 name: Recall --- # SpanMarker with bert-base-multilingual-cased on xtreme/PAN-X.es This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [xtreme/PAN-X.es](https://huggingface.co/datasets/xtreme) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) as the underlying encoder. ## Model Details ### Model Description - **Model Type:** SpanMarker - **Encoder:** [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) - **Maximum Sequence Length:** 512 tokens - **Maximum Entity Length:** 8 words - **Training Dataset:** [xtreme/PAN-X.es](https://huggingface.co/datasets/xtreme) - **Languages:** es - **License:** cc-by-4.0 ### Model Sources - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) ### Model Labels | Label | Examples | |:------|:------------------------------------------------------------------------------------| | LOC | "Salamanca", "Paris", "Barcelona (España)" | | ORG | "ONU", "Fútbol Club Barcelona", "Museo Nacional del Prado" | | PER | "Fray Luis de León", "Leo Messi", "Álvaro Bartolomé" | ## Uses ### Direct Use for Inference ```python from span_marker import SpanMarkerModel # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("alvarobartt/bert-base-multilingual-cased-ner-spanish") # Run inference entities = model.predict("Marie Curie fue profesora en la Universidad de Paris.") ``` ## Training Details ### Training Set Metrics | Training set | Min | Median | Max | |:----------------------|:----|:-------|:----| | Sentence length | 3 | 6.4642 | 64 | | Entities per sentence | 1 | 1.2375 | 24 | ### Training Hyperparameters - learning_rate: 5e-05 - train_batch_size: 8 - eval_batch_size: 4 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 2 ### Training Results | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| | 0.3998 | 1000 | 0.0388 | 0.8761 | 0.8641 | 0.8701 | 0.9223 | | 0.7997 | 2000 | 0.0326 | 0.8995 | 0.8740 | 0.8866 | 0.9341 | | 1.1995 | 3000 | 0.0277 | 0.9076 | 0.9019 | 0.9047 | 0.9424 | | 1.5994 | 4000 | 0.0261 | 0.9143 | 0.9113 | 0.9128 | 0.9473 | | 1.9992 | 5000 | 0.0234 | 0.9231 | 0.9143 | 0.9187 | 0.9502 | ### Framework Versions - Python: 3.10.12 - SpanMarker: 1.3.1.dev - Transformers: 4.33.3 - PyTorch: 2.0.1+cu118 - Datasets: 2.14.5 - Tokenizers: 0.13.3 ## Citation ### BibTeX ``` @software{Aarsen_SpanMarker, author = {Aarsen, Tom}, license = {Apache-2.0}, title = {{SpanMarker for Named Entity Recognition}}, url = {https://github.com/tomaarsen/SpanMarkerNER} } ```