--- license: apache-2.0 pipeline_tag: text-classification tags: - transformers - sentence-transformers - text-embeddings-inference language: - ko - multilingual --- # upskyy/ko-reranker-8k **ko-reranker-8k**는 [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) 모델에 [한국어 데이터](https://huggingface.co/datasets/upskyy/ko-wiki-reranking)를 finetuning 한 model 입니다. ## Usage ## Using FlagEmbedding ``` pip install -U FlagEmbedding ``` Get relevance scores (higher scores indicate more relevance): ```python from FlagEmbedding import FlagReranker reranker = FlagReranker('upskyy/ko-reranker-8k', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation score = reranker.compute_score(['query', 'passage']) print(score) # -8.3828125 # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score score = reranker.compute_score(['query', 'passage'], normalize=True) print(score) # 0.000228713314721116 scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) print(scores) # [-11.2265625, 8.6875] # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True) print(scores) # [1.3315579521758342e-05, 0.9998313472460109] ``` ## Using Huggingface transformers Get relevance scores (higher scores indicate more relevance): ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('upskyy/ko-reranker-8k') model = AutoModelForSequenceClassification.from_pretrained('upskyy/ko-reranker-8k') model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) ``` ## Citation ```bibtex @misc{li2023making, title={Making Large Language Models A Better Foundation For Dense Retrieval}, author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao}, year={2023}, eprint={2312.15503}, archivePrefix={arXiv}, primaryClass={cs.CL} } @misc{chen2024bge, title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, year={2024}, eprint={2402.03216}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Reference - [Dongjin-kr/ko-reranker](https://huggingface.co/Dongjin-kr/ko-reranker) - [reranker-kr](https://github.com/aws-samples/aws-ai-ml-workshop-kr/tree/master/genai/aws-gen-ai-kr/30_fine_tune/reranker-kr) - [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)