File size: 2,201 Bytes
8de3fd0
 
 
 
 
 
 
 
 
0e737ef
 
 
 
 
 
 
8de3fd0
c7392f2
 
 
8de3fd0
c7392f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8de3fd0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
license: mit
datasets:
- DDSC/dagw_no_twitter
language:
- da
tags:
- SimCSE
---

A version of the chcaa/dfm-encoder-large-v1 trained using SimCSE. It was trained as a part of the [Scandinavian Embeddings Benchmark](https://kennethenevoldsen.github.io/scandinavian-embedding-benchmark/) to establish a naive baseline for SimCSE.

**Note**: We do not recommend this model, but instead encourage the user to check out the current best model on [SEB](https://kennethenevoldsen.github.io/scandinavian-embedding-benchmark/) or check out the [recommendation](https://huggingface.co/collections/danish-foundation-models/state-of-the-art-danish-models-65f01d84a10842712e186172) by the Danish Foundation Models team.


## Hyperparameters
Trained using the [SimCSE](https://github.com/princeton-nlp/SimCSE) implementation with:

```
CUDA_VISIBLE_DEVICES=0 python train.py \
    --train_file data/dfm_paragraphs.txt \ # paragraphs extract from Danish Gigaword
    --model_name_or_path chcaa/dfm-encoder-large-v1 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 128 \
    --learning_rate 1e-5 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --pooler_type cls \
    --mlp_only_train \
    --do_mlm \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --fp16 
```


## Citation

To cite this work please refer to the following article:

```
Enevoldsen, K., Kardos, M., Muennighoff, N., & Nielbo, K. (2024). The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding. https://openreview.net/forum?id=pJl_i7HIA72
```

or use the following BibTeX:
```
@article{enevoldsenScandinavianEmbeddingBenchmarks2024,
	title = {The {Scandinavian} {Embedding} {Benchmarks}: {Comprehensive} {Assessment} of {Multilingual} and {Monolingual} {Text} {Embedding}},
	shorttitle = {The {Scandinavian} {Embedding} {Benchmarks}},
	url = {https://openreview.net/forum?id=pJl_i7HIA72},
	language = {en},
	urldate = {2024-04-12},
	author = {Enevoldsen, Kenneth and Kardos, Márton and Muennighoff, Niklas and Nielbo, Kristoffer},
	month = feb,
	year = {2024},
}