|
--- |
|
language: |
|
- en |
|
- multilingual |
|
license: cc-by-sa-4.0 |
|
library_name: span-marker |
|
tags: |
|
- span-marker |
|
- token-classification |
|
- ner |
|
- named-entity-recognition |
|
- generated_from_span_marker_trainer |
|
datasets: |
|
- DFKI-SLT/few-nerd |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
widget: |
|
- text: The WPC led the international peace movement in the decade after the Second |
|
World War, but its failure to speak out against the Soviet suppression of the |
|
1956 Hungarian uprising and the resumption of Soviet nuclear tests in 1961 marginalised |
|
it, and in the 1960s it was eclipsed by the newer, non-aligned peace organizations |
|
like the Campaign for Nuclear Disarmament. |
|
- text: Most of the Steven Seagal movie "Under Siege "(co-starring Tommy Lee Jones) |
|
was filmed on the, which is docked on Mobile Bay at Battleship Memorial Park and |
|
open to the public. |
|
- text: 'The Central African CFA franc (French: "franc CFA "or simply "franc ", ISO |
|
4217 code: XAF) is the currency of six independent states in Central Africa: Cameroon, |
|
Central African Republic, Chad, Republic of the Congo, Equatorial Guinea and Gabon.' |
|
- text: Brenner conducted post-doctoral research at Brandeis University with Gregory |
|
Petsko and then took his first academic position at Thomas Jefferson University |
|
in 1996, moving to Dartmouth Medical School in 2003, where he served as Associate |
|
Director for Basic Sciences at Norris Cotton Cancer Center. |
|
- text: On Friday, October 27, 2017, the Senate of Spain (Senado) voted 214 to 47 |
|
to invoke Article 155 of the Spanish Constitution over Catalonia after the Catalan |
|
Parliament declared the independence. |
|
pipeline_tag: token-classification |
|
co2_eq_emissions: |
|
emissions: 452.84872035276965 |
|
source: codecarbon |
|
training_type: fine-tuning |
|
on_cloud: false |
|
cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K |
|
ram_total_size: 31.777088165283203 |
|
hours_used: 3.118 |
|
hardware_used: 1 x NVIDIA GeForce RTX 3090 |
|
base_model: xlm-roberta-base |
|
model-index: |
|
- name: SpanMarker with xlm-roberta-base on FewNERD |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Named Entity Recognition |
|
dataset: |
|
name: FewNERD |
|
type: DFKI-SLT/few-nerd |
|
split: test |
|
metrics: |
|
- type: f1 |
|
value: 0.6884821229658107 |
|
name: F1 |
|
- type: precision |
|
value: 0.6890426017339362 |
|
name: Precision |
|
- type: recall |
|
value: 0.6879225552622042 |
|
name: Recall |
|
--- |
|
|
|
# SpanMarker with xlm-roberta-base on FewNERD |
|
|
|
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) as the underlying encoder. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** SpanMarker |
|
- **Encoder:** [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) |
|
- **Maximum Sequence Length:** 256 tokens |
|
- **Maximum Entity Length:** 8 words |
|
- **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) |
|
- **Languages:** en, multilingual |
|
- **License:** cc-by-sa-4.0 |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) |
|
- **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) |
|
|
|
### Model Labels |
|
| Label | Examples | |
|
|:-----------------------------------------|:---------------------------------------------------------------------------------------------------------| |
|
| art-broadcastprogram | "The Gale Storm Show : Oh , Susanna", "Corazones", "Street Cents" | |
|
| art-film | "L'Atlantide", "Shawshank Redemption", "Bosch" | |
|
| art-music | "Hollywood Studio Symphony", "Atkinson , Danko and Ford ( with Brockie and Hilton )", "Champion Lover" | |
|
| art-other | "Venus de Milo", "Aphrodite of Milos", "The Today Show" | |
|
| art-painting | "Cofiwch Dryweryn", "Production/Reproduction", "Touit" | |
|
| art-writtenart | "The Seven Year Itch", "Time", "Imelda de ' Lambertazzi" | |
|
| building-airport | "Newark Liberty International Airport", "Luton Airport", "Sheremetyevo International Airport" | |
|
| building-hospital | "Hokkaido University Hospital", "Yeungnam University Hospital", "Memorial Sloan-Kettering Cancer Center" | |
|
| building-hotel | "Radisson Blu Sea Plaza Hotel", "The Standard Hotel", "Flamingo Hotel" | |
|
| building-library | "British Library", "Berlin State Library", "Bayerische Staatsbibliothek" | |
|
| building-other | "Communiplex", "Henry Ford Museum", "Alpha Recording Studios" | |
|
| building-restaurant | "Fatburger", "Carnegie Deli", "Trumbull" | |
|
| building-sportsfacility | "Boston Garden", "Glenn Warner Soccer Facility", "Sports Center" | |
|
| building-theater | "Pittsburgh Civic Light Opera", "National Paris Opera", "Sanders Theatre" | |
|
| event-attack/battle/war/militaryconflict | "Jurist", "Easter Offensive", "Vietnam War" | |
|
| event-disaster | "1693 Sicily earthquake", "1990s North Korean famine", "the 1912 North Mount Lyell Disaster" | |
|
| event-election | "March 1898 elections", "Elections to the European Parliament", "1982 Mitcham and Morden by-election" | |
|
| event-other | "Eastwood Scoring Stage", "Union for a Popular Movement", "Masaryk Democratic Movement" | |
|
| event-protest | "Russian Revolution", "French Revolution", "Iranian Constitutional Revolution" | |
|
| event-sportsevent | "World Cup", "Stanley Cup", "National Champions" | |
|
| location-GPE | "Mediterranean Basin", "Croatian", "the Republic of Croatia" | |
|
| location-bodiesofwater | "Norfolk coast", "Atatürk Dam Lake", "Arthur Kill" | |
|
| location-island | "Laccadives", "Staten Island", "new Samsat district" | |
|
| location-mountain | "Ruweisat Ridge", "Miteirya Ridge", "Salamander Glacier" | |
|
| location-other | "Victoria line", "Northern City Line", "Cartuther" | |
|
| location-park | "Painted Desert Community Complex Historic District", "Shenandoah National Park", "Gramercy Park" | |
|
| location-road/railway/highway/transit | "Newark-Elizabeth Rail Link", "NJT", "Friern Barnet Road" | |
|
| organization-company | "Church 's Chicken", "Texas Chicken", "Dixy Chicken" | |
|
| organization-education | "MIT", "Belfast Royal Academy and the Ulster College of Physical Education", "Barnard College" | |
|
| organization-government/governmentagency | "Congregazione dei Nobili", "Diet", "Supreme Court" | |
|
| organization-media/newspaper | "TimeOut Melbourne", "Al Jazeera", "Clash" | |
|
| organization-other | "IAEA", "4th Army", "Defence Sector C" | |
|
| organization-politicalparty | "Al Wafa ' Islamic", "Shimpotō", "Kenseitō" | |
|
| organization-religion | "UPCUSA", "Jewish", "Christian" | |
|
| organization-showorganization | "Bochumer Symphoniker", "Mr. Mister", "Lizzy" | |
|
| organization-sportsleague | "First Division", "NHL", "China League One" | |
|
| organization-sportsteam | "Tottenham", "Arsenal", "Luc Alphand Aventures" | |
|
| other-astronomything | "Algol", "Zodiac", "`` Caput Larvae ''" | |
|
| other-award | "Grand Commander of the Order of the Niger", "Order of the Republic of Guinea and Nigeria", "GCON" | |
|
| other-biologything | "Amphiphysin", "BAR", "N-terminal lipid" | |
|
| other-chemicalthing | "carbon dioxide", "sulfur", "uranium" | |
|
| other-currency | "$", "lac crore", "Travancore Rupee" | |
|
| other-disease | "hypothyroidism", "bladder cancer", "French Dysentery Epidemic of 1779" | |
|
| other-educationaldegree | "Master", "Bachelor", "BSc ( Hons ) in physics" | |
|
| other-god | "El", "Fujin", "Raijin" | |
|
| other-language | "Breton-speaking", "Latin", "English" | |
|
| other-law | "United States Freedom Support Act", "Thirty Years ' Peace", "Leahy–Smith America Invents Act ( AIA" | |
|
| other-livingthing | "insects", "patchouli", "monkeys" | |
|
| other-medical | "amitriptyline", "pediatrician", "Pediatrics" | |
|
| person-actor | "Tchéky Karyo", "Edmund Payne", "Ellaline Terriss" | |
|
| person-artist/author | "George Axelrod", "Hicks", "Gaetano Donizett" | |
|
| person-athlete | "Jaguar", "Neville", "Tozawa" | |
|
| person-director | "Richard Quine", "Frank Darabont", "Bob Swaim" | |
|
| person-other | "Campbell", "Richard Benson", "Holden" | |
|
| person-politician | "Rivière", "Emeric", "William" | |
|
| person-scholar | "Stedman", "Wurdack", "Stalmine" | |
|
| person-soldier | "Joachim Ziegler", "Krukenberg", "Helmuth Weidling" | |
|
| product-airplane | "EC135T2 CPDS", "Spey-equipped FGR.2s", "Luton" | |
|
| product-car | "Phantom", "Corvettes - GT1 C6R", "100EX" | |
|
| product-food | "V. labrusca", "red grape", "yakiniku" | |
|
| product-game | "Hardcore RPG", "Airforce Delta", "Splinter Cell" | |
|
| product-other | "PDP-1", "Fairbottom Bobs", "X11" | |
|
| product-ship | "Essex", "Congress", "HMS `` Chinkara ''" | |
|
| product-software | "Wikipedia", "Apdf", "AmiPDF" | |
|
| product-train | "55022", "Royal Scots Grey", "High Speed Trains" | |
|
| product-weapon | "AR-15 's", "ZU-23-2MR Wróbel II", "ZU-23-2M Wróbel" | |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
| Label | Precision | Recall | F1 | |
|
|:-----------------------------------------|:----------|:-------|:-------| |
|
| **all** | 0.6890 | 0.6879 | 0.6885 | |
|
| art-broadcastprogram | 0.6 | 0.5771 | 0.5883 | |
|
| art-film | 0.7384 | 0.7453 | 0.7419 | |
|
| art-music | 0.7930 | 0.7221 | 0.7558 | |
|
| art-other | 0.4245 | 0.2900 | 0.3446 | |
|
| art-painting | 0.5476 | 0.4035 | 0.4646 | |
|
| art-writtenart | 0.6400 | 0.6539 | 0.6469 | |
|
| building-airport | 0.8219 | 0.8242 | 0.8230 | |
|
| building-hospital | 0.7024 | 0.8104 | 0.7526 | |
|
| building-hotel | 0.7175 | 0.7283 | 0.7228 | |
|
| building-library | 0.74 | 0.7296 | 0.7348 | |
|
| building-other | 0.5828 | 0.5910 | 0.5869 | |
|
| building-restaurant | 0.5525 | 0.5216 | 0.5366 | |
|
| building-sportsfacility | 0.6187 | 0.7881 | 0.6932 | |
|
| building-theater | 0.7067 | 0.7626 | 0.7336 | |
|
| event-attack/battle/war/militaryconflict | 0.7544 | 0.7468 | 0.7506 | |
|
| event-disaster | 0.5882 | 0.5314 | 0.5584 | |
|
| event-election | 0.4167 | 0.2198 | 0.2878 | |
|
| event-other | 0.4902 | 0.4042 | 0.4430 | |
|
| event-protest | 0.3643 | 0.2831 | 0.3186 | |
|
| event-sportsevent | 0.6125 | 0.6239 | 0.6182 | |
|
| location-GPE | 0.8102 | 0.8553 | 0.8321 | |
|
| location-bodiesofwater | 0.6888 | 0.7725 | 0.7282 | |
|
| location-island | 0.7285 | 0.6440 | 0.6836 | |
|
| location-mountain | 0.7129 | 0.7327 | 0.7227 | |
|
| location-other | 0.4376 | 0.2560 | 0.3231 | |
|
| location-park | 0.6991 | 0.6900 | 0.6945 | |
|
| location-road/railway/highway/transit | 0.6936 | 0.7259 | 0.7094 | |
|
| organization-company | 0.6921 | 0.6912 | 0.6917 | |
|
| organization-education | 0.7838 | 0.7963 | 0.7900 | |
|
| organization-government/governmentagency | 0.5363 | 0.4394 | 0.4831 | |
|
| organization-media/newspaper | 0.6215 | 0.6705 | 0.6451 | |
|
| organization-other | 0.5766 | 0.5157 | 0.5444 | |
|
| organization-politicalparty | 0.6449 | 0.7324 | 0.6859 | |
|
| organization-religion | 0.5139 | 0.6057 | 0.5560 | |
|
| organization-showorganization | 0.5620 | 0.5657 | 0.5638 | |
|
| organization-sportsleague | 0.6348 | 0.6542 | 0.6443 | |
|
| organization-sportsteam | 0.7138 | 0.7566 | 0.7346 | |
|
| other-astronomything | 0.7418 | 0.7625 | 0.752 | |
|
| other-award | 0.7291 | 0.6736 | 0.7002 | |
|
| other-biologything | 0.6735 | 0.6275 | 0.6497 | |
|
| other-chemicalthing | 0.6025 | 0.5651 | 0.5832 | |
|
| other-currency | 0.6843 | 0.8411 | 0.7546 | |
|
| other-disease | 0.6284 | 0.7089 | 0.6662 | |
|
| other-educationaldegree | 0.5856 | 0.6033 | 0.5943 | |
|
| other-god | 0.6089 | 0.6913 | 0.6475 | |
|
| other-language | 0.6608 | 0.7968 | 0.7225 | |
|
| other-law | 0.6693 | 0.7246 | 0.6958 | |
|
| other-livingthing | 0.6070 | 0.6014 | 0.6042 | |
|
| other-medical | 0.5062 | 0.5113 | 0.5088 | |
|
| person-actor | 0.8274 | 0.7673 | 0.7962 | |
|
| person-artist/author | 0.6761 | 0.7294 | 0.7018 | |
|
| person-athlete | 0.8132 | 0.8347 | 0.8238 | |
|
| person-director | 0.675 | 0.6823 | 0.6786 | |
|
| person-other | 0.6472 | 0.6388 | 0.6429 | |
|
| person-politician | 0.6621 | 0.6593 | 0.6607 | |
|
| person-scholar | 0.5181 | 0.5007 | 0.5092 | |
|
| person-soldier | 0.4750 | 0.5131 | 0.4933 | |
|
| product-airplane | 0.6230 | 0.6717 | 0.6464 | |
|
| product-car | 0.7293 | 0.7176 | 0.7234 | |
|
| product-food | 0.5758 | 0.5185 | 0.5457 | |
|
| product-game | 0.7049 | 0.6734 | 0.6888 | |
|
| product-other | 0.5477 | 0.4067 | 0.4668 | |
|
| product-ship | 0.6247 | 0.6395 | 0.6320 | |
|
| product-software | 0.6497 | 0.6760 | 0.6626 | |
|
| product-train | 0.5505 | 0.5732 | 0.5616 | |
|
| product-weapon | 0.6004 | 0.4744 | 0.5300 | |
|
|
|
## Uses |
|
|
|
### Direct Use for Inference |
|
|
|
```python |
|
from span_marker import SpanMarkerModel |
|
|
|
# Download from the 🤗 Hub |
|
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super") |
|
# Run inference |
|
entities = model.predict("Most of the Steven Seagal movie \"Under Siege \"(co-starring Tommy Lee Jones) was filmed on the, which is docked on Mobile Bay at Battleship Memorial Park and open to the public.") |
|
``` |
|
|
|
### Downstream Use |
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
```python |
|
from span_marker import SpanMarkerModel, Trainer |
|
|
|
# Download from the 🤗 Hub |
|
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super") |
|
|
|
# Specify a Dataset with "tokens" and "ner_tag" columns |
|
dataset = load_dataset("conll2003") # For example CoNLL2003 |
|
|
|
# Initialize a Trainer using the pretrained model & dataset |
|
trainer = Trainer( |
|
model=model, |
|
train_dataset=dataset["train"], |
|
eval_dataset=dataset["validation"], |
|
) |
|
trainer.train() |
|
trainer.save_model("tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super-finetuned") |
|
``` |
|
</details> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
<!-- |
|
## Bias, Risks and Limitations |
|
|
|
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
## Training Details |
|
|
|
### Training Set Metrics |
|
| Training set | Min | Median | Max | |
|
|:----------------------|:----|:--------|:----| |
|
| Sentence length | 1 | 24.4945 | 267 | |
|
| Entities per sentence | 0 | 2.5832 | 88 | |
|
|
|
### Training Hyperparameters |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 16 |
|
- eval_batch_size: 16 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_ratio: 0.1 |
|
- num_epochs: 3 |
|
|
|
### Training Results |
|
| Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |
|
|:------:|:-----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| |
|
| 0.2947 | 3000 | 0.0318 | 0.6058 | 0.5990 | 0.6024 | 0.9020 | |
|
| 0.5893 | 6000 | 0.0266 | 0.6556 | 0.6679 | 0.6617 | 0.9173 | |
|
| 0.8840 | 9000 | 0.0250 | 0.6691 | 0.6804 | 0.6747 | 0.9206 | |
|
| 1.1787 | 12000 | 0.0239 | 0.6865 | 0.6761 | 0.6813 | 0.9212 | |
|
| 1.4733 | 15000 | 0.0234 | 0.6872 | 0.6812 | 0.6842 | 0.9226 | |
|
| 1.7680 | 18000 | 0.0231 | 0.6919 | 0.6821 | 0.6870 | 0.9227 | |
|
| 2.0627 | 21000 | 0.0231 | 0.6909 | 0.6871 | 0.6890 | 0.9233 | |
|
| 2.3573 | 24000 | 0.0231 | 0.6903 | 0.6875 | 0.6889 | 0.9238 | |
|
| 2.6520 | 27000 | 0.0229 | 0.6918 | 0.6926 | 0.6922 | 0.9242 | |
|
| 2.9467 | 30000 | 0.0228 | 0.6927 | 0.6930 | 0.6928 | 0.9243 | |
|
|
|
### Environmental Impact |
|
Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon). |
|
- **Carbon Emitted**: 0.453 kg of CO2 |
|
- **Hours Used**: 3.118 hours |
|
|
|
### Training Hardware |
|
- **On Cloud**: No |
|
- **GPU Model**: 1 x NVIDIA GeForce RTX 3090 |
|
- **CPU Model**: 13th Gen Intel(R) Core(TM) i7-13700K |
|
- **RAM Size**: 31.78 GB |
|
|
|
### Framework Versions |
|
- Python: 3.9.16 |
|
- SpanMarker: 1.4.1.dev |
|
- Transformers: 4.30.0 |
|
- PyTorch: 2.0.1+cu118 |
|
- Datasets: 2.14.0 |
|
- Tokenizers: 0.13.2 |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
``` |
|
@software{Aarsen_SpanMarker, |
|
author = {Aarsen, Tom}, |
|
license = {Apache-2.0}, |
|
title = {{SpanMarker for Named Entity Recognition}}, |
|
url = {https://github.com/tomaarsen/SpanMarkerNER} |
|
} |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |