File size: 3,283 Bytes
8098521 1fb6722 4288b20 1fb6722 03d1d56 1fb6722 8098521 1fb6722 c0dba72 1fb6722 7422d6b 1fb6722 db5ec51 1fb6722 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
license: cc-by-4.0
library_name: span-marker
base_model: gwlms/bert-base-dewiki-v1
tags:
- span-marker
- token-classification
- ner
- named-entity-recognition
pipeline_tag: token-classification
widget:
- text: "Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München ."
example_title: "Wikipedia"
datasets:
- gwlms/germeval2014
language:
- de
model-index:
- name: SpanMarker with GWLMS BERT on GermEval 2014 NER Dataset by Stefan Schweter (@stefan-it)
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
type: gwlms/germeval2014
name: GermEval 2014
split: test
revision: f3647c56803ce67c08ee8d15f4611054c377b226
metrics:
- type: f1
value: 0.8745
name: F1
metrics:
- f1
---
# SpanMarker for GermEval 2014 NER
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that
was fine-tuned on the [GermEval 2014 NER Dataset](https://sites.google.com/site/germeval2014ner/home).
The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following
properties: The data was sampled from German Wikipedia and News Corpora as a collection of citations. The dataset
covers over 31,000 sentences corresponding to over 590,000 tokens. The NER annotation uses the NoSta-D guidelines,
which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating
embeddings among NEs such as `[ORG FC Kickers [LOC Darmstadt]]`.
12 classes of Named Entites are annotated and must be recognized: four main classes `PER`son, `LOC`ation, `ORG`anisation,
and `OTH`er and their subclasses by introducing two fine-grained labels: `-deriv` marks derivations from NEs such as
"englisch" (“English”), and `-part` marks compounds including a NE as a subsequence deutschlandweit (“Germany-wide”).
# Fine-Tuning
We use the same hyper-parameters as used in the
["German's Next Language Model"](https://aclanthology.org/2020.coling-main.598/) paper using the
[GWLMS BERT](https://huggingface.co/gwlms/bert-base-dewiki-v1) model as backbone.
Evaluation is performed with SpanMarkers internal evaluation code that uses `seqeval`.
We fine-tune 5 models and upload the model with best F1-Score on development set. Results on development set are
in brackets:
| Model | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Avg.
| ---------- | --------------- | --------------- | --------------- | --------------- | ------------------- | ---------------
| GWLMS BERT | (87.27) / 87.28 | (87.20) / 87.42 | (88.05) / 87.68 | (88.25) / 87.59 | (**88.47**) / 87.45 | (87.85) / 87.48
The best model achieves a final test score of 87.45%.
Scripts for [training](trainer.py) and [evaluation](evaluator.py) are also available.
# Usage
The fine-tuned model can be used like:
```python
from span_marker import SpanMarkerModel
# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("stefan-it/span-marker-bert-germeval14")
# Run inference
entities = model.predict("Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München .")
``` |