KennethEnevoldsen's picture
Update README.md
0e737ef verified
---
license: mit
datasets:
- DDSC/dagw_no_twitter
language:
- da
tags:
- SimCSE
---
A version of the chcaa/dfm-encoder-large-v1 trained using SimCSE. It was trained as a part of the [Scandinavian Embeddings Benchmark](https://kennethenevoldsen.github.io/scandinavian-embedding-benchmark/) to establish a naive baseline for SimCSE.
**Note**: We do not recommend this model, but instead encourage the user to check out the current best model on [SEB](https://kennethenevoldsen.github.io/scandinavian-embedding-benchmark/) or check out the [recommendation](https://huggingface.co/collections/danish-foundation-models/state-of-the-art-danish-models-65f01d84a10842712e186172) by the Danish Foundation Models team.
## Hyperparameters
Trained using the [SimCSE](https://github.com/princeton-nlp/SimCSE) implementation with:
```
CUDA_VISIBLE_DEVICES=0 python train.py \
--train_file data/dfm_paragraphs.txt \ # paragraphs extract from Danish Gigaword
--model_name_or_path chcaa/dfm-encoder-large-v1 \
--num_train_epochs 1 \
--per_device_train_batch_size 128 \
--learning_rate 1e-5 \
--max_seq_length 32 \
--evaluation_strategy steps \
--metric_for_best_model stsb_spearman \
--load_best_model_at_end \
--pooler_type cls \
--mlp_only_train \
--do_mlm \
--overwrite_output_dir \
--temp 0.05 \
--do_train \
--fp16
```
## Citation
To cite this work please refer to the following article:
```
Enevoldsen, K., Kardos, M., Muennighoff, N., & Nielbo, K. (2024). The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding. https://openreview.net/forum?id=pJl_i7HIA72
```
or use the following BibTeX:
```
@article{enevoldsenScandinavianEmbeddingBenchmarks2024,
title = {The {Scandinavian} {Embedding} {Benchmarks}: {Comprehensive} {Assessment} of {Multilingual} and {Monolingual} {Text} {Embedding}},
shorttitle = {The {Scandinavian} {Embedding} {Benchmarks}},
url = {https://openreview.net/forum?id=pJl_i7HIA72},
language = {en},
urldate = {2024-04-12},
author = {Enevoldsen, Kenneth and Kardos, Márton and Muennighoff, Niklas and Nielbo, Kristoffer},
month = feb,
year = {2024},
}