Model save

4e93afb 12 months ago

9.27 kB

	---
	library_name: span-marker
	tags:
	- span-marker
	- token-classification
	- ner
	- named-entity-recognition
	- generated_from_span_marker_trainer
	datasets:
	- SpeedOfMagic/ontonotes_english
	metrics:
	- precision
	- recall
	- f1
	widget:
	- text: Late Friday night, the Senate voted 87 - 7 to approve an estimated $13.5 billion
	measure that had been stripped of hundreds of provisions that would have widened,
	rather than narrowed, the federal budget deficit.
	- text: Among classes for which details were available, yields ranged from 8.78%,
	or 75 basis points over two - year Treasury securities, to 10.05%, or 200 basis
	points over 10 - year Treasurys.
	- text: According to statistics, in the past five years, Tianjin Bonded Area has attracted
	a total of over 3000 enterprises from 73 countries and regions all over the world
	and 25 domestic provinces, cities and municipalities to invest, reaching a total
	agreed investment value of more than 3 billion US dollars and a total agreed foreign
	investment reaching more than 2 billion US dollars.
	- text: But Dirk Van Dongen, president of the National Association of Wholesaler -
	Distributors, said that last month's rise "isn't as bad an omen" as the 0.9% figure
	suggests.
	- text: Robert White, Canadian Auto Workers union president, used the impending Scarborough
	shutdown to criticize the U.S. - Canada free trade agreement and its champion,
	Prime Minister Brian Mulroney.
	pipeline_tag: token-classification
	model-index:
	- name: SpanMarker
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: Unknown
	type: SpeedOfMagic/ontonotes_english
	split: test
	metrics:
	- type: f1
	value: 0.9077127659574469
	name: F1
	- type: precision
	value: 0.9045852107076597
	name: Precision
	- type: recall
	value: 0.9108620229516947
	name: Recall
	---

	# SpanMarker

	This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [SpeedOfMagic/ontonotes_english](https://huggingface.co/datasets/SpeedOfMagic/ontonotes_english) dataset that can be used for Named Entity Recognition.

	## Model Details

	### Model Description
	- Model Type: SpanMarker
	<!-- - Encoder: [Unknown](https://huggingface.co/unknown) -->
	- Maximum Sequence Length: 256 tokens
	- Maximum Entity Length: 8 words
	- Training Dataset: [SpeedOfMagic/ontonotes_english](https://huggingface.co/datasets/SpeedOfMagic/ontonotes_english)
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->

	### Model Sources

	- Repository: [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
	- Thesis: [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)

	### Model Labels
	\| Label \| Examples \|
	\|:------------\|:-------------------------------------------------------------------------------------------------------\|
	\| CARDINAL \| "tens of thousands", "One point three million", "two" \|
	\| DATE \| "Sunday", "a year", "two thousand one" \|
	\| EVENT \| "World War Two", "Katrina", "Hurricane Katrina" \|
	\| FAC \| "Route 80", "the White House", "Dylan 's Candy Bars" \|
	\| GPE \| "America", "Atlanta", "Miami" \|
	\| LANGUAGE \| "English", "Russian", "Arabic" \|
	\| LAW \| "Roe", "the Patriot Act", "FISA" \|
	\| LOC \| "Asia", "the Gulf Coast", "the West Bank" \|
	\| MONEY \| "twenty - seven million dollars", "one hundred billion dollars", "less than fourteen thousand dollars" \|
	\| NORP \| "American", "Muslim", "Americans" \|
	\| ORDINAL \| "third", "First", "first" \|
	\| ORG \| "Wal - Mart", "Wal - Mart 's", "a Wal - Mart" \|
	\| PERCENT \| "seventeen percent", "sixty - seven percent", "a hundred percent" \|
	\| PERSON \| "Kira Phillips", "Rick Sanchez", "Bob Shapiro" \|
	\| PRODUCT \| "Columbia", "Discovery Shuttle", "Discovery" \|
	\| QUANTITY \| "forty - five miles", "six thousand feet", "a hundred and seventy pounds" \|
	\| TIME \| "tonight", "evening", "Tonight" \|
	\| WORK_OF_ART \| "A Tale of Two Cities", "Newsnight", "Headline News" \|

	## Evaluation

	### Metrics
	\| Label \| Precision \| Recall \| F1 \|
	\|:------------\|:----------\|:-------\|:-------\|
	\| all \| 0.9046 \| 0.9109 \| 0.9077 \|
	\| CARDINAL \| 0.8579 \| 0.8524 \| 0.8552 \|
	\| DATE \| 0.8634 \| 0.8893 \| 0.8762 \|
	\| EVENT \| 0.6719 \| 0.6935 \| 0.6825 \|
	\| FAC \| 0.7211 \| 0.7852 \| 0.7518 \|
	\| GPE \| 0.9725 \| 0.9647 \| 0.9686 \|
	\| LANGUAGE \| 0.9286 \| 0.5909 \| 0.7222 \|
	\| LAW \| 0.7941 \| 0.7297 \| 0.7606 \|
	\| LOC \| 0.7632 \| 0.8101 \| 0.7859 \|
	\| MONEY \| 0.8914 \| 0.8885 \| 0.8900 \|
	\| NORP \| 0.9311 \| 0.9643 \| 0.9474 \|
	\| ORDINAL \| 0.8227 \| 0.9282 \| 0.8723 \|
	\| ORG \| 0.9217 \| 0.9073 \| 0.9145 \|
	\| PERCENT \| 0.9145 \| 0.9198 \| 0.9171 \|
	\| PERSON \| 0.9638 \| 0.9643 \| 0.9640 \|
	\| PRODUCT \| 0.6778 \| 0.8026 \| 0.7349 \|
	\| QUANTITY \| 0.7850 \| 0.8 \| 0.7925 \|
	\| TIME \| 0.6794 \| 0.6730 \| 0.6762 \|
	\| WORK_OF_ART \| 0.6562 \| 0.6442 \| 0.6502 \|

	## Uses

	### Direct Use for Inference

	```python
	from span_marker import SpanMarkerModel

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("supreethrao/instructNER_ontonotes5_xl")
	# Run inference
	entities = model.predict("Robert White, Canadian Auto Workers union president, used the impending Scarborough shutdown to criticize the U.S. - Canada free trade agreement and its champion, Prime Minister Brian Mulroney.")
	```

	### Downstream Use
	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	```python
	from span_marker import SpanMarkerModel, Trainer

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("supreethrao/instructNER_ontonotes5_xl")

	# Specify a Dataset with "tokens" and "ner_tag" columns
	dataset = load_dataset("conll2003") # For example CoNLL2003

	# Initialize a Trainer using the pretrained model & dataset
	trainer = Trainer(
	model=model,
	train_dataset=dataset["train"],
	eval_dataset=dataset["validation"],
	)
	trainer.train()
	trainer.save_model("supreethrao/instructNER_ontonotes5_xl-finetuned")
	```
	</details>

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Set Metrics
	\| Training set \| Min \| Median \| Max \|
	\|:----------------------\|:----\|:--------\|:----\|
	\| Sentence length \| 1 \| 18.1647 \| 210 \|
	\| Entities per sentence \| 0 \| 1.3655 \| 32 \|

	### Training Hyperparameters
	- learning_rate: 5e-05
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 2
	- total_train_batch_size: 32
	- total_eval_batch_size: 32
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 3
	- mixed_precision_training: Native AMP

	### Framework Versions
	- Python: 3.10.13
	- SpanMarker: 1.5.0
	- Transformers: 4.35.2
	- PyTorch: 2.1.1
	- Datasets: 2.15.0
	- Tokenizers: 0.15.0

	## Citation

	### BibTeX
	```
	@software{Aarsen_SpanMarker,
	author = {Aarsen, Tom},
	license = {Apache-2.0},
	title = {{SpanMarker for Named Entity Recognition}},
	url = {https://github.com/tomaarsen/SpanMarkerNER}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->