--- pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity license: mit datasets: - squad - eli5 - sentence-transformers/embedding-training-data - KennethTM/gooaq_pairs_danish - sentence-transformers/gooaq - KennethTM/squad_pairs_danish - KennethTM/eli5_question_answer_danish language: - da library_name: sentence-transformers widget: - source_sentence: 'Kører der cykler på vejen?' sentences: - 'I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.' - 'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.' --- # Note *This an updated version of [KennethTM/MiniLM-L6-danish-encoder](https://huggingface.co/KennethTM/MiniLM-L6-danish-encoder). This version is just trained on more data ([GooAQ dataset](https://huggingface.co/datasets/sentence-transformers/gooaq) translated to [Danish](https://huggingface.co/datasets/KennethTM/gooaq_pairs_danish)) and is otherwise the same* # MiniLM-L6-danish-encoder This is a lightweight (~22 M parameters) [sentence-transformers](https://www.SBERT.net) model for Danish NLP: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search. The maximum sequence length is 512 tokens. The model was not pre-trained from scratch but adapted from the English version of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a [Danish tokenizer](https://huggingface.co/KennethTM/bert-base-uncased-danish). Trained on ELI5 and SQUAD data machine translated from English to Danish. # Usage (Sentence-Transformers) Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: ``` pip install -U sentence-transformers ``` Then you can use the model like this: ```python from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim # Given a query query = ['Kører der cykler på vejen?'] # And two passages passage = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.', 'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.'] # Compute embeddings model = SentenceTransformer("KennethTM/MiniLM-L6-danish-encoder-v2") query_embeddings = model.encode(query) passage_embeddings = model.encode(passage) # To find most relevant passage for the query (closer to 1 means more similar) cosine_scores = cos_sim(query_embeddings, passage_embeddings) print(cosine_scores) ``` # Usage (HuggingFace Transformers) Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings. ```python from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F #Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("KennethTM/MiniLM-L6-danish-encoder-v2") model = AutoModel.from_pretrained("KennethTM/MiniLM-L6-danish-encoder-v2") # Given a query query = ['Kører der cykler på vejen?'] # And two passages passage = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.', 'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.'] # Tokenize sentences query_encoded = tokenizer(query, padding=True, truncation=True, return_tensors='pt') passage_encoded = tokenizer(passage, padding=True, truncation=True, return_tensors='pt') # Compute embeddings with torch.no_grad(): query_features = model(**query_encoded) passage_features = model(**passage_encoded) # Perform pooling query_embeddings = mean_pooling(query_features, query_encoded['attention_mask']) passage_embeddings = mean_pooling(passage_features, passage_encoded['attention_mask']) # To find most relevant passage for the query (closer to 1 means more similar) cosine_scores = F.cosine_similarity(query_embeddings, passage_embeddings) print(cosine_scores) ```