Domain adaptation
Hey Aaron,
this model is great!
If I wanted to adapt this to a special domain, like German medical texts, what would be my best path forward?
Can I domain-adapt with unlabelled data, i.e. Gigabytes worth of text? Or would I have to generate labels (query, positive, negative)?
Can you give me some pointers?
Hi,
thanks!
Depending on the task. For semantic similarity / semantic search you could further finetune the model with labels. Note that the similarity should be between 0 (no similarity) and 1 (the same).
Also, consider adding tokens for example for special words before finetuning. If you want to stick with sentence-transformers this could help:
https://github.com/UKPLab/sentence-transformers/issues/744
Also, if you have enough resources you could train from scratch. Or just use a pretrained, smaller model:
https://huggingface.co/GerMedBERT/medbert-512
for semantic similarity you could either use your labeled dataset or also fine-tune on STS to prime the model on the task. First one would be more promising.
All the best
Aaron