metadata
language:
- tl
license: gpl-3.0
library_name: span-marker
tags:
- span-marker
- token-classification
- ner
- named-entity-recognition
- generated_from_span_marker_trainer
datasets:
- ljvmiranda921/tlunified-ner
metrics:
- precision
- recall
- f1
widget:
- text: >-
MANILA - Binalewala ng Philippine National Police (PNP) nitong Sabado ang
posibleng paglulunsad ng tinatawag na " sympathy attacks " ng Moro
National Liberation Front (MNLF) at Abu Sayyaf matapos arestuhin si
Indanan, Sulu Mayor Alvarez Isnaji.
- text: >-
Pinatawan din ng apat na buwang suspensyon si Herma Gonzales - Escudero,
chief revenue officer III ng BIR - Cotabato City, dahil sa kasong
dishonesty at limang kaso ng perjury sa Municipal Trial Court ng Cotabato
City . Bunga ito ng kanyang kabiguan na ideklara sa kanyang SALN noong
2002 - 2004 ang 200 metro kwadradong lote sa South Cotabato at Toyota Revo
noong 2001 SALN at undervaluation ng kanyang mga ari - arian sa lalawigan
noong 2000 - 2004 SALN.
- text: >-
Sa tila pagpapabaya sa mga magsasaka, sinabi ni Escudero na hindi
mangyayari ang pangarap ng Department of Agriculture (DA) na maging self -
sufficient ang Pilipinas sa bigas.
- text: >-
MANILA - Tiniyak ng pinuno ng Government Service Insurance System (GSIS)
na tatapatan nito ang pro - Meralco advertisement ni Judy Ann Santos upang
isulong ang kanyang posisyon na dapat ibaba ang singil sa kuryente.
- text: >-
Idinagdag ni South Cotabato Rep Darlene Antonino - Custodio, na illegal na
ipagpaliban ang halalan sa ARMM kung ang gagamitin lamang basehan ay ang
ipapasang panukala ng Kongreso.
pipeline_tag: token-classification
co2_eq_emissions:
emissions: 22.090476722294312
source: codecarbon
training_type: fine-tuning
on_cloud: false
cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K
ram_total_size: 31.777088165283203
hours_used: 0.238
hardware_used: 1 x NVIDIA GeForce RTX 3090
base_model: bert-base-multilingual-cased
model-index:
- name: SpanMarker with bert-base-multilingual-cased on TLUnified
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: TLUnified
type: ljvmiranda921/tlunified-ner
split: test
metrics:
- type: f1
value: 0.8886810102899907
name: F1
- type: precision
value: 0.8736971183323115
name: Precision
- type: recall
value: 0.9041878172588832
name: Recall
SpanMarker with bert-base-multilingual-cased on TLUnified
This is a SpanMarker model trained on the TLUnified dataset that can be used for Named Entity Recognition. This SpanMarker model uses bert-base-multilingual-cased as the underlying encoder.
Model Details
Model Description
- Model Type: SpanMarker
- Encoder: bert-base-multilingual-cased
- Maximum Sequence Length: 256 tokens
- Maximum Entity Length: 8 words
- Training Dataset: TLUnified
- Language: tl
- License: gpl-3.0
Model Sources
- Repository: SpanMarker on GitHub
- Thesis: SpanMarker For Named Entity Recognition
Model Labels
Label | Examples |
---|---|
LOC | "Israel", "Batasan", "United States" |
ORG | "MMDA", "International Monitoring Team", "Coordinating Committees for the Cessation of Hostilities" |
PER | "Puno", "Fernando", "Villavicencio" |
Evaluation
Metrics
Label | Precision | Recall | F1 |
---|---|---|---|
all | 0.8737 | 0.9042 | 0.8887 |
LOC | 0.8830 | 0.9084 | 0.8955 |
ORG | 0.7579 | 0.8587 | 0.8052 |
PER | 0.9264 | 0.9220 | 0.9242 |
Uses
Direct Use for Inference
from span_marker import SpanMarkerModel
# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-mbert-base-tlunified")
# Run inference
entities = model.predict("Idinagdag ni South Cotabato Rep Darlene Antonino - Custodio, na illegal na ipagpaliban ang halalan sa ARMM kung ang gagamitin lamang basehan ay ang ipapasang panukala ng Kongreso.")
Downstream Use
You can finetune this model on your own dataset.
Click to expand
from span_marker import SpanMarkerModel, Trainer
# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-mbert-base-tlunified")
# Specify a Dataset with "tokens" and "ner_tag" columns
dataset = load_dataset("conll2003") # For example CoNLL2003
# Initialize a Trainer using the pretrained model & dataset
trainer = Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("tomaarsen/span-marker-mbert-base-tlunified-finetuned")
Training Details
Training Set Metrics
Training set | Min | Median | Max |
---|---|---|---|
Sentence length | 1 | 31.7625 | 150 |
Entities per sentence | 0 | 2.0661 | 38 |
Training Hyperparameters
- learning_rate: 5e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 3
Training Results
Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
---|---|---|---|---|---|---|
0.6803 | 400 | 0.0074 | 0.8552 | 0.8835 | 0.8691 | 0.9774 |
1.3605 | 800 | 0.0072 | 0.8709 | 0.9034 | 0.8869 | 0.9798 |
2.0408 | 1200 | 0.0070 | 0.8753 | 0.9053 | 0.8900 | 0.9812 |
2.7211 | 1600 | 0.0065 | 0.8876 | 0.9003 | 0.8939 | 0.9807 |
Environmental Impact
Carbon emissions were measured using CodeCarbon.
- Carbon Emitted: 0.022 kg of CO2
- Hours Used: 0.238 hours
Training Hardware
- On Cloud: No
- GPU Model: 1 x NVIDIA GeForce RTX 3090
- CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
- RAM Size: 31.78 GB
Framework Versions
- Python: 3.9.16
- SpanMarker: 1.5.1.dev
- Transformers: 4.30.0
- PyTorch: 2.0.1+cu118
- Datasets: 2.14.0
- Tokenizers: 0.13.3
Citation
BibTeX
@software{Aarsen_SpanMarker,
author = {Aarsen, Tom},
license = {Apache-2.0},
title = {{SpanMarker for Named Entity Recognition}},
url = {https://github.com/tomaarsen/SpanMarkerNER}
}