File size: 3,743 Bytes
2f341ce 3b9b6cc faba9ef 2f2fec5 2f341ce 3b9b6cc 7fc8423 1120190 2f341ce 3b9b6cc 2f341ce 4127d14 2f341ce 3b9b6cc 2f341ce 2f2fec5 2f341ce 3b9b6cc 2f341ce 3b9b6cc 2f341ce 3b9b6cc 2f341ce 3b9b6cc 2f341ce 8423799 2f341ce 3b9b6cc 2f341ce 3b9b6cc 2f341ce 4e8b396 dae03a8 d45db09 4e8b396 d324c6c 4e8b396 3b9b6cc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
pipeline_tag: sentence-similarity
license: apache-2.0
language:
- cs
- da
- de
- en
- es
- fi
- fr
- he
- hr
- hu
- id
- it
- nl
- 'no'
- pl
- pt
- ro
- ru
- sv
- tr
- vi
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
datasets:
- clips/mfaq
widget:
source_sentence: "<Q>How many models can I host on HuggingFace?"
sentences:
- "<A>All plans come with unlimited private models and datasets."
- "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
- "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."
---
# MFAQ
We present a multilingual FAQ retrieval model trained on the [MFAQ dataset](https://huggingface.co/datasets/clips/mfaq), it ranks candidate answers according to a given question.
## Installation
```
pip install sentence-transformers transformers
```
## Usage
You can use MFAQ with sentence-transformers or directly with a HuggingFace model.
In both cases, questions need to be prepended with `<Q>`, and answers with `<A>`.
#### Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
question = "<Q>How many models can I host on HuggingFace?"
answer_1 = "<A>All plans come with unlimited private models and datasets."
answer_2 = "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
answer_3 = "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."
model = SentenceTransformer('clips/mfaq')
embeddings = model.encode([question, answer_1, answer_3, answer_3])
print(embeddings)
```
#### HuggingFace Transformers
```python
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
question = "<Q>How many models can I host on HuggingFace?"
answer_1 = "<A>All plans come with unlimited private models and datasets."
answer_2 = "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
answer_3 = "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."
tokenizer = AutoTokenizer.from_pretrained('clips/mfaq')
model = AutoModel.from_pretrained('clips/mfaq')
# Tokenize sentences
encoded_input = tokenizer([question, answer_1, answer_3, answer_3], padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
```
## Training
You can find the training script for the model [here](https://github.com/clips/mfaq).
## People
This model was developed by [Maxime De Bruyn](https://www.linkedin.com/in/maximedebruyn/), Ehsan Lotfi, Jeska Buhmann and Walter Daelemans.
## Citation information
```
@misc{debruyn2021mfaq,
title={MFAQ: a Multilingual FAQ Dataset},
author={Maxime De Bruyn and Ehsan Lotfi and Jeska Buhmann and Walter Daelemans},
year={2021},
eprint={2109.12870},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |