Encoding a large Knowledge Base

#25
by SachinVashistha - opened

Hi, I looked into the code given in the description for using UAE-large-V1.

from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
qv = angle.encode(Prompts.C.format(text='what is the weather?'))
doc_vecs = angle.encode([
'The weather is great!',
'it is rainy today.',
'i am going to bed'
])

for dv in doc_vecs:
print(cosine_similarity(qv[0], dv))

In this code, variable "doc_vecs" contains encoding of three sentences in the list. If I have a list of millions of sentences (i.e. a Knowledge Base), then is there any fast way to encode these sentences?

WhereIsAI org

Hi @SachinVashistha , it is suggested to use Mixedbread's batched to encode large-scale data.
Here is an example: https://angle.readthedocs.io/en/latest/notes/quickstart.html#batch-inference

BTW, if you have multiple GPUs, you could use them to accelerate inference, example: https://github.com/SeanLee97/AnglE/blob/main/examples/multigpu_infer.py

Sign up or log in to comment