Model Card for congruence-engine/gliner_2.5_textile_industry_historic

This model is a fine-tuned version of gliner-community/gliner_medium-v2.5. It was fine-tuned on a dataset of synthetic data prepared using historic textile industry glossaries, combined with a subset of the Pile-NER dataset.

The model was developed as part of the Congruence Engine project at the Science Museum Group, which explored opportunities for linking industrial history collections across museum and archive collections.

Model Details

Model Description

This model is a fine-tuned version of model from the GLiNER (Generalist and Lightweight Model for Named Entity Recognition) family. GLiNER is part of a new wave of Named Entity Recogniton (NER) models commonly referred to as ‘Universal NER’ – the key distinction from traditional NER being that the model is not restricted to previously established entity types, but can extract entities based on user-defined labels.

Developed by: Max Long, Kaspar Beelen, Arran Rees, Ben Russell (Science Museum Group)
Funded by: UK Arts and Humanities Research Council (AHRC)
Shared by: Congruence Engine Project, Science Museum Group
Model type: BERT
Language(s) (NLP): English
License: Creative Commons Attribution 4.0
Finetuned from model: gliner-community/gliner_medium-v2.5

Model Sources

Repository: Repository for Congruence Engine experiments with fine-tuning GLiNER for cultural heritage uses.

Uses

This model was developed experimentally by the Congruence Engine project, to explore the possibility of applying a fine-tuned NER model for use in linking museum collection items and objects in the cultural heritage sector. This model is intended to be used to identify the following entity types: following entity types: "textile manufacturing chemical", "textile dye", "textile machinery", "textile fibre", "textile fabric", "textile fabric component", "textile fabric imperfection", "textile waste material", "textile weave", "textile manufacturing process", "textile industry unit of measurement", "textile industry occupation".

Bias, Risks, and Limitations

This model was fine-tuned using data derived from historic glossaries used in the textile industry between the late nineteenth century and the early twentieth century. The training dataset will therefore contain expressions and terms that are prejudicial or offensive, particularly with respect to gender, race and disability. You can view the dataset here.

While the NER terms themselves are based on historic glossaries and definitions, the synthetic examples in the dataset were generated using OpenAI's GPT 4o. Users should be aware that this process may have inserted further biases into the dataset.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. The model is intended to be used within the specific context of historical research, museum database management, and museum curation. Users should approach any other intended applications with caution.

How to Get Started with the Model

To use this model, you must first install the GLiNER Python library:

!pip install gliner -U

Then, load the model:

from gliner import GLiNER
model = GLiNER.from_pretrained("congruence-engine/gliner_2.5_textile_industry_historic", load_tokenizer=True)

You can now test the model on an example text:

text = """
The 19th century textile industry was a vibrant period of innovation and expansion, fueled by advancements in materials and techniques. Barwood, a natural dye source imported from Africa, played a crucial role in achieving rich red hues. Skilled colourists experimented with this and other natural dyes to create striking fabrics that met the era’s demand for color diversity.

Processes like degumming were essential in preparing silk for dyeing and weaving, removing sericin to achieve a smooth finish. Similarly, scouring, the thorough cleaning of wool and other fibers, ensured that impurities did not interfere with dyeing or spinning processes. Innovations like the scotch feed mechanism improved efficiency in spinning mills, streamlining the delivery of fibers to machinery.

Domett, a plain but durable cloth, was widely used for practical garments and household items, exemplifying the industry’s focus on both utility and style. These combined efforts shaped the thriving textile trade of the era.
"""

# Labels for entity prediction
labels = ["textile machinery", "textile fabric", "textile industry occupation", "textile dye", "textile manufacturing process"]

# Perform entity prediction
entities = model.predict_entities(text, labels, threshold=0.5)

# Display predicted entities and their labels
for entity in entities:
    print(entity["text"], "=>", entity["label"])

Training Details

Training Data

The model was fine-tuned using the following sources:

Synthetic sentences generated using OpenAI's GPT4o model, based on historic textile glossaries compiled from digitised books (2,504 examples)
A subset of the Pile-NER-type dataset (4,000 examples, to avoid overfitting)

Dataset card: max-long/textile_glossaries_and_pile_ner

For a full description of how the synthetic data was generated, you can consult this notebook. A Colab version is available here.

Training Procedure

This model was fine-tuned on an A100 machine hosted virtually using a Google Colab notebook. You can find a notebook describing the full fine-tuning procedure here. You can also consult a Colab version here.

Training Hyperparameters

learning_rate=5e-6
weight_decay=0.01
others_lr=1e-5
others_weight_decay=0.01
lr_scheduler_type="linear", #cosine
warmup_ratio=0.1
eval_strategy="steps"
logging_steps=50
save_steps = 100
save_total_limit=10
dataloader_num_workers = 0
use_cpu = False

Speeds, Sizes, Times

global_step=502
training_loss=173.8674098163012
train_runtime': 154.4285
train_samples_per_second: 25.941
train_steps_per_second: 3.251
total_flos: 0.0
train_loss: 173.8674098163012
epoch: 2.0

Evaluation

Testing Data, Factors & Metrics

Testing Data

Test split generated during training.

Number of examples: 501.

Metrics

P: Precision (out of all entities identified, x% were correct)

tR: Recall (number of actual positives identified by the model)

tF1: A combination of Precision and Recall

n': F1 score as a decimal

Results

Metric	Score
P	73.24%
tR	66.02%
tF1	69.44%
n'	0.6944079078480608

Environmental Impact

Hardware Type: A100
Hours used: 1
Cloud Provider: Google Cloud
Compute Region: asia-southeast1
Carbon Emitted: 0.1 kg CO2 eq.

Technical Specifications

Compute Infrastructure

Google Colab notebook environment.

Hardware

A100 GPU.

Model Card Authors

Dr. Max Long

Model Card Contact

mel58@cam.ac.uk

congruence-engine
/

gliner_2.5_textile_industry_historic