|
--- |
|
language: |
|
- en |
|
base_model: |
|
- FacebookAI/roberta-large |
|
pipeline_tag: text-classification |
|
--- |
|
# Graded Word Sense Disambiguation (WSD) Model |
|
|
|
## Model Summary |
|
This model is a **fine-tuned version of RoBERTa-Large** for **Graded Word Sense Disambiguation (WSD)**. It is designed to predict the **degree of applicability** (1-4) of a word sense in context by leveraging **large-scale sense-annotated corpora**. The model is based on the work outlined in: |
|
|
|
**Reference Paper:** |
|
Pierluigi Cassotti, Nina Tahmasebi (2025). Sense-specific Historical Word Usage Generation. |
|
|
|
|
|
This model has been trained to handle **graded WSD tasks**, providing **continuous-valued predictions** instead of hard classification, making it useful for nuanced applications in lexicography, computational linguistics, and historical text analysis. |
|
|
|
--- |
|
|
|
## Model Details |
|
- **Base Model:** `roberta-large` |
|
- **Task:** Graded Word Sense Disambiguation (WSD) |
|
- **Fine-tuning Dataset:** Oxford English Dictionary (OED) sense-annotated corpus |
|
- **Training Steps:** |
|
- Tokenizer augmented with special tokens (`<t>`, `</t>`) for marking target words in context. |
|
- Dataset preprocessed with **sense annotations** and **word offsets**. |
|
- Sentences containing sense-annotated words were split into **train (90%)** and **validation (10%)** sets. |
|
- **Objective:** Predicting a continuous label representing the applicability of a sense. |
|
- **Evaluation Metric:** Root Mean Squared Error (RMSE). |
|
- **Batch Size:** 32 |
|
- **Learning Rate:** 2e-5 |
|
- **Epochs:** 1 |
|
- **Optimizer:** AdamW with weight decay of 0.01 |
|
- **Evaluation Strategy:** Steps-based (every 10% of the dataset). |
|
|
|
--- |
|
|
|
## Training & Fine-Tuning |
|
Fine-tuning was performed using the **Hugging Face `Trainer` API** with a **custom dataset loader**. The dataset was processed as follows: |
|
|
|
1. **Preprocessing** |
|
- Example sentences were extracted from the OED and augmented with **definitions**. |
|
- The target word was **highlighted** with special tokens (`<t>`, `</t>`). |
|
- Each instance was labeled with a **graded similarity score**. |
|
|
|
2. **Tokenization & Encoding** |
|
- Tokenized with `AutoTokenizer.from_pretrained("roberta-large")`. |
|
- Definitions were concatenated using the `</s></s>` separator for **cross-sentence representation**. |
|
|
|
3. **Training Pipeline** |
|
- Model fine-tuned on the **regression task** with a single **linear output head**. |
|
- Used **Mean Squared Error (MSE) loss**. |
|
- Evaluation on validation set using **Root Mean Squared Error (RMSE)**. |
|
|
|
--- |
|
|
|
## Usage |
|
### Example Code |
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("ChangeIsKey/graded-wsd") |
|
model = AutoModelForSequenceClassification.from_pretrained("ChangeIsKey/graded-wsd") |
|
|
|
sentence = "The <t>bank</t> of the river was eroding due to the storm." |
|
target_word = "bank" |
|
definition = "The land alongside a river or a stream." |
|
|
|
tokenized_input = tokenizer(f"{sentence} </s></s> {definition}", truncation=True, padding=True, return_tensors="pt") |
|
with torch.no_grad(): |
|
output = model(**tokenized_input) |
|
score = output.logits.item() |
|
|
|
print(f"Graded Sense Score: {score}") |
|
``` |
|
|
|
### Input Format |
|
- Sentence: Contextual usage of the word. |
|
- Target Word: The word to be disambiguated. |
|
- Definition: The dictionary definition of the intended sense. |
|
|
|
### Output |
|
- **A continuous score** (between 1 and 4) indicating the **similarity** of the given definition with respect to the word in its current context. |
|
|
|
--- |
|
|
|
## Citation |
|
If you use this model, please cite the following paper: |
|
|
|
``` |
|
@article{cassotti2025, |
|
title={Sense-specific Historical Word Usage Generation}, |
|
author={Cassotti, Pierluigi and Tahmasebi, Nina}, |
|
journal={TACL}, |
|
year={2025} |
|
} |
|
``` |