shlm-grc-en
Sentence embeddings for English and Ancient Greek
This model creates sentence embeddings in a shared vector space for Ancient Greek and English text.
The base model uses a modified version of the HLM architecture described in Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers (arXiv)
This model is trained to produce sentence embeddings using the multilingual knowledge distillation method and datasets described in Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation (arXiv).
This model was distilled from BAAI/bge-base-en-v1.5
for embedding English and Ancient Greek text.
Usage (Sentence-Transformers)
This model is currently incompatible with the latest version of the sentence-transformers library.
For now, either use HuggingFace Transformers directly (see below) or the following fork of sentence-transformers: https://github.com/kevinkrahn/sentence-transformers
You can use the model with sentence-transformers like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('kevinkrahn/shlm-grc-en')
embeddings = model.encode(sentences)
print(embeddings)
Usage (HuggingFace Transformers)
Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output):
return model_output[0][:,0]
# Sentences we want sentence embeddings for
sentences = ['This is an English sentence', 'Ὁ Παρθενών ἐστιν ἱερὸν καλὸν τῆς Ἀθήνης.']
# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output)
print("Sentence embeddings:")
print(sentence_embeddings)
Citing & Authors
If you use this model please cite the following papers:
@inproceedings{riemenschneider-krahn-2024-heidelberg,
title = "Heidelberg-Boston @ {SIGTYP} 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers",
author = "Riemenschneider, Frederick and
Krahn, Kevin",
editor = "Hahn, Michael and
Sorokin, Alexey and
Kumar, Ritesh and
Shcherbakov, Andreas and
Otmakhova, Yulia and
Yang, Jinrui and
Serikov, Oleg and
Rani, Priya and
Ponti, Edoardo M. and
Murado{\u{g}}lu, Saliha and
Gao, Rena and
Cotterell, Ryan and
Vylomova, Ekaterina",
booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP",
month = mar,
year = "2024",
address = "St. Julian's, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.sigtyp-1.16",
pages = "131--141",
}
@inproceedings{krahn-etal-2023-sentence,
title = "Sentence Embedding Models for {A}ncient {G}reek Using Multilingual Knowledge Distillation",
author = "Krahn, Kevin and
Tate, Derrick and
Lamicela, Andrew C.",
editor = "Anderson, Adam and
Gordin, Shai and
Li, Bin and
Liu, Yudong and
Passarotti, Marco C.",
booktitle = "Proceedings of the Ancient Language Processing Workshop",
month = sep,
year = "2023",
address = "Varna, Bulgaria",
publisher = "INCOMA Ltd., Shoumen, Bulgaria",
url = "https://aclanthology.org/2023.alp-1.2",
pages = "13--22",
}
- Downloads last month
- 155