graded-wsd / README.md
pierluigic's picture
Update README.md
72cee12 verified
---
language:
- en
base_model:
- FacebookAI/roberta-large
pipeline_tag: text-classification
---
# Graded Word Sense Disambiguation (WSD) Model
## Model Summary
This model is a **fine-tuned version of RoBERTa-Large** for **Graded Word Sense Disambiguation (WSD)**. It is designed to predict the **degree of applicability** (1-4) of a word sense in context by leveraging **large-scale sense-annotated corpora**. The model is based on the work outlined in:
**Reference Paper:**
Pierluigi Cassotti, Nina Tahmasebi (2025). Sense-specific Historical Word Usage Generation.
This model has been trained to handle **graded WSD tasks**, providing **continuous-valued predictions** instead of hard classification, making it useful for nuanced applications in lexicography, computational linguistics, and historical text analysis.
---
## Model Details
- **Base Model:** `roberta-large`
- **Task:** Graded Word Sense Disambiguation (WSD)
- **Fine-tuning Dataset:** Oxford English Dictionary (OED) sense-annotated corpus
- **Training Steps:**
- Tokenizer augmented with special tokens (`<t>`, `</t>`) for marking target words in context.
- Dataset preprocessed with **sense annotations** and **word offsets**.
- Sentences containing sense-annotated words were split into **train (90%)** and **validation (10%)** sets.
- **Objective:** Predicting a continuous label representing the applicability of a sense.
- **Evaluation Metric:** Root Mean Squared Error (RMSE).
- **Batch Size:** 32
- **Learning Rate:** 2e-5
- **Epochs:** 1
- **Optimizer:** AdamW with weight decay of 0.01
- **Evaluation Strategy:** Steps-based (every 10% of the dataset).
---
## Training & Fine-Tuning
Fine-tuning was performed using the **Hugging Face `Trainer` API** with a **custom dataset loader**. The dataset was processed as follows:
1. **Preprocessing**
- Example sentences were extracted from the OED and augmented with **definitions**.
- The target word was **highlighted** with special tokens (`<t>`, `</t>`).
- Each instance was labeled with a **graded similarity score**.
2. **Tokenization & Encoding**
- Tokenized with `AutoTokenizer.from_pretrained("roberta-large")`.
- Definitions were concatenated using the `</s></s>` separator for **cross-sentence representation**.
3. **Training Pipeline**
- Model fine-tuned on the **regression task** with a single **linear output head**.
- Used **Mean Squared Error (MSE) loss**.
- Evaluation on validation set using **Root Mean Squared Error (RMSE)**.
---
## Usage
### Example Code
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("ChangeIsKey/graded-wsd")
model = AutoModelForSequenceClassification.from_pretrained("ChangeIsKey/graded-wsd")
sentence = "The <t>bank</t> of the river was eroding due to the storm."
target_word = "bank"
definition = "The land alongside a river or a stream."
tokenized_input = tokenizer(f"{sentence} </s></s> {definition}", truncation=True, padding=True, return_tensors="pt")
with torch.no_grad():
output = model(**tokenized_input)
score = output.logits.item()
print(f"Graded Sense Score: {score}")
```
### Input Format
- Sentence: Contextual usage of the word.
- Target Word: The word to be disambiguated.
- Definition: The dictionary definition of the intended sense.
### Output
- **A continuous score** (between 1 and 4) indicating the **similarity** of the given definition with respect to the word in its current context.
---
## Citation
If you use this model, please cite the following paper:
```
@article{cassotti2025,
title={Sense-specific Historical Word Usage Generation},
author={Cassotti, Pierluigi and Tahmasebi, Nina},
journal={TACL},
year={2025}
}
```