graded-wsd / README.md

Update README.md

72cee12 verified 9 days ago

3.78 kB

	---
	language:
	- en
	base_model:
	- FacebookAI/roberta-large
	pipeline_tag: text-classification
	---
	# Graded Word Sense Disambiguation (WSD) Model

	## Model Summary
	This model is a fine-tuned version of RoBERTa-Large for Graded Word Sense Disambiguation (WSD). It is designed to predict the degree of applicability (1-4) of a word sense in context by leveraging large-scale sense-annotated corpora. The model is based on the work outlined in:

	Reference Paper:
	Pierluigi Cassotti, Nina Tahmasebi (2025). Sense-specific Historical Word Usage Generation.


	This model has been trained to handle graded WSD tasks, providing continuous-valued predictions instead of hard classification, making it useful for nuanced applications in lexicography, computational linguistics, and historical text analysis.

	---

	## Model Details
	- Base Model: `roberta-large`
	- Task: Graded Word Sense Disambiguation (WSD)
	- Fine-tuning Dataset: Oxford English Dictionary (OED) sense-annotated corpus
	- Training Steps:
	- Tokenizer augmented with special tokens (`<t>`, `</t>`) for marking target words in context.
	- Dataset preprocessed with sense annotations and word offsets.
	- Sentences containing sense-annotated words were split into train (90%) and validation (10%) sets.
	- Objective: Predicting a continuous label representing the applicability of a sense.
	- Evaluation Metric: Root Mean Squared Error (RMSE).
	- Batch Size: 32
	- Learning Rate: 2e-5
	- Epochs: 1
	- Optimizer: AdamW with weight decay of 0.01
	- Evaluation Strategy: Steps-based (every 10% of the dataset).

	---

	## Training & Fine-Tuning
	Fine-tuning was performed using the Hugging Face `Trainer` API with a custom dataset loader. The dataset was processed as follows:

	1. Preprocessing
	- Example sentences were extracted from the OED and augmented with definitions.
	- The target word was highlighted with special tokens (`<t>`, `</t>`).
	- Each instance was labeled with a graded similarity score.

	2. Tokenization & Encoding
	- Tokenized with `AutoTokenizer.from_pretrained("roberta-large")`.
	- Definitions were concatenated using the `</s></s>` separator for cross-sentence representation.

	3. Training Pipeline
	- Model fine-tuned on the regression task with a single linear output head.
	- Used Mean Squared Error (MSE) loss.
	- Evaluation on validation set using Root Mean Squared Error (RMSE).

	---

	## Usage
	### Example Code
	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	tokenizer = AutoTokenizer.from_pretrained("ChangeIsKey/graded-wsd")
	model = AutoModelForSequenceClassification.from_pretrained("ChangeIsKey/graded-wsd")

	sentence = "The <t>bank</t> of the river was eroding due to the storm."
	target_word = "bank"
	definition = "The land alongside a river or a stream."

	tokenized_input = tokenizer(f"{sentence} </s></s> {definition}", truncation=True, padding=True, return_tensors="pt")
	with torch.no_grad():
	output = model(**tokenized_input)
	score = output.logits.item()

	print(f"Graded Sense Score: {score}")
	```

	### Input Format
	- Sentence: Contextual usage of the word.
	- Target Word: The word to be disambiguated.
	- Definition: The dictionary definition of the intended sense.

	### Output
	- A continuous score (between 1 and 4) indicating the similarity of the given definition with respect to the word in its current context.

	---

	## Citation
	If you use this model, please cite the following paper:

	```
	@article{cassotti2025,
	title={Sense-specific Historical Word Usage Generation},
	author={Cassotti, Pierluigi and Tahmasebi, Nina},
	journal={TACL},
	year={2025}
	}
	```