bert-base-spanmarker-STEM-NER / README.md

zhang19991111

Upload 9 files

57ed72b verified 10 months ago

7.45 kB

	---
	language: en
	license: cc-by-sa-4.0
	library_name: span-marker
	tags:
	- span-marker
	- token-classification
	- ner
	- named-entity-recognition
	- generated_from_span_marker_trainer
	metrics:
	- precision
	- recall
	- f1
	widget:
	- text: Inductively Coupled Plasma - Mass Spectrometry ( ICP - MS ) analysis of Longcliffe
	SP52 limestone was undertaken to identify other impurities present , and the effect
	of sorbent mass and SO2 concentration on elemental partitioning in the carbonator
	between solid sorbent and gaseous phase was investigated , using a bubbler sampling
	system .
	- text: We extensively evaluate our work against benchmark and competitive protocols
	across a range of metrics over three real connectivity and GPS traces such as
	Sassy [ 44 ] , San Francisco Cabs [ 45 ] and Infocom 2006 [ 33 ] .
	- text: In this research , we developed a robust two - layer classifier that can accurately
	classify normal hearing ( NH ) from hearing impaired ( HI ) infants with congenital
	sensori - neural hearing loss ( SNHL ) based on their Magnetic Resonance ( MR
	) images .
	- text: In situ Peak Force Tapping AFM was employed for determining morphology and
	nano - mechanical properties of the surface layer .
	- text: By means of a criterion of Gilmer for polynomially dense subsets of the ring
	of integers of a number field , we show that , if h∈K[X ] maps every element of
	OK of degree n to an algebraic integer , then h(X ) is integral - valued over
	OK , that is , h(OK)⊂OK .
	pipeline_tag: token-classification
	base_model: bert-base-uncased
	model-index:
	- name: SpanMarker with bert-base-uncased on my-data
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: my-data
	type: unknown
	split: test
	metrics:
	- type: f1
	value: 0.6547008547008547
	name: F1
	- type: precision
	value: 0.69009009009009
	name: Precision
	- type: recall
	value: 0.6227642276422765
	name: Recall
	---

	# SpanMarker with bert-base-uncased on my-data

	This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-uncased](https://huggingface.co/bert-base-uncased) as the underlying encoder.

	## Model Details

	### Model Description
	- Model Type: SpanMarker
	- Encoder: [bert-base-uncased](https://huggingface.co/bert-base-uncased)
	- Maximum Sequence Length: 256 tokens
	- Maximum Entity Length: 8 words
	<!-- - Training Dataset: [Unknown](https://huggingface.co/datasets/unknown) -->
	- Language: en
	- License: cc-by-sa-4.0

	### Model Sources

	- Repository: [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
	- Thesis: [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)

	### Model Labels
	\| Label \| Examples \|
	\|:---------\|:--------------------------------------------------------------------------------------------------------\|
	\| Data \| "an overall mitochondrial", "defect", "Depth time - series" \|
	\| Material \| "cross - shore measurement locations", "the subject 's fibroblasts", "COXI , COXII and COXIII subunits" \|
	\| Method \| "EFSA", "an approximation", "in vitro" \|
	\| Process \| "translation", "intake", "a significant reduction of synthesis" \|

	## Evaluation

	### Metrics
	\| Label \| Precision \| Recall \| F1 \|
	\|:---------\|:----------\|:-------\|:-------\|
	\| all \| 0.6901 \| 0.6228 \| 0.6547 \|
	\| Data \| 0.6136 \| 0.5714 \| 0.5918 \|
	\| Material \| 0.7926 \| 0.7413 \| 0.7661 \|
	\| Method \| 0.4286 \| 0.3 \| 0.3529 \|
	\| Process \| 0.6780 \| 0.5854 \| 0.6283 \|

	## Uses

	### Direct Use for Inference

	```python
	from span_marker import SpanMarkerModel

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("span_marker_model_id")
	# Run inference
	entities = model.predict("In situ Peak Force Tapping AFM was employed for determining morphology and nano - mechanical properties of the surface layer .")
	```

	### Downstream Use
	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	```python
	from span_marker import SpanMarkerModel, Trainer

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("span_marker_model_id")

	# Specify a Dataset with "tokens" and "ner_tag" columns
	dataset = load_dataset("conll2003") # For example CoNLL2003

	# Initialize a Trainer using the pretrained model & dataset
	trainer = Trainer(
	model=model,
	train_dataset=dataset["train"],
	eval_dataset=dataset["validation"],
	)
	trainer.train()
	trainer.save_model("span_marker_model_id-finetuned")
	```
	</details>

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Set Metrics
	\| Training set \| Min \| Median \| Max \|
	\|:----------------------\|:----\|:--------\|:----\|
	\| Sentence length \| 3 \| 25.6049 \| 106 \|
	\| Entities per sentence \| 0 \| 5.2439 \| 22 \|

	### Training Hyperparameters
	- learning_rate: 5e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 10

	### Training Results
	\| Epoch \| Step \| Validation Loss \| Validation Precision \| Validation Recall \| Validation F1 \| Validation Accuracy \|
	\|:------:\|:----:\|:---------------:\|:--------------------:\|:-----------------:\|:-------------:\|:-------------------:\|
	\| 2.0134 \| 300 \| 0.0557 \| 0.6921 \| 0.5706 \| 0.6255 \| 0.7645 \|
	\| 4.0268 \| 600 \| 0.0583 \| 0.6994 \| 0.6527 \| 0.6752 \| 0.7974 \|
	\| 6.0403 \| 900 \| 0.0701 \| 0.7085 \| 0.6679 \| 0.6876 \| 0.8039 \|
	\| 8.0537 \| 1200 \| 0.0797 \| 0.6963 \| 0.6870 \| 0.6916 \| 0.8129 \|

	### Framework Versions
	- Python: 3.10.12
	- SpanMarker: 1.5.0
	- Transformers: 4.36.2
	- PyTorch: 2.0.1+cu118
	- Datasets: 2.16.1
	- Tokenizers: 0.15.0

	## Citation

	### BibTeX
	```
	@software{Aarsen_SpanMarker,
	author = {Aarsen, Tom},
	license = {Apache-2.0},
	title = {{SpanMarker for Named Entity Recognition}},
	url = {https://github.com/tomaarsen/SpanMarkerNER}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->