Hungarian Experimental Sentence-BERT
The pre-trained huBERT was fine-tuned on the Hunglish 2.0 parallel corpus to mimic the bert-base-nli-stsb-mean-tokens model provided by UKPLab. Sentence embeddings were obtained by applying mean pooling to the huBERT output. The data was split into training (98%) and validation (2%) sets. By the end of the training, a mean squared error of 0.106 was computed on the validation set. Our code was based on the Sentence-Transformers library. Our model was trained for 2 epochs on a single GTX 1080Ti GPU card with the batch size set to 32. The training took approximately 15 hours.
Limitations
- max_seq_length = 128
Usage
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('NYTK/sentence-transformers-experimental-hubert-hungarian')
embeddings = model.encode(sentences)
print(embeddings)
Citation
If you use this model, please cite the following paper:
@article {bertopic,
title = {Analyzing Narratives of Patient Experiences: A BERT Topic Modeling Approach},
journal = {Acta Polytechnica Hungarica},
year = {2023},
author = {Osváth, Mátyás and Yang, Zijian Győző and Kósa, Karolina},
pages = {153--171},
volume = {20},
number = {7}
}
- Downloads last month
- 163
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.