Model Card for KartonBERT-USE-base-v1

This universal sentence encoder model is designed to convert text content into a 768-float vector space, ensuring an effective representation. It aims to be proficient in tasks involving sentence / document similarity.

Despite its small size (104 million parameters only), the model maintains a high level of performance. It uses a lowercase-optimized tokenizer with a vocabulary size of 23,000 tokens. This balance between compactness and effectiveness allows the model to deliver strong results in text encoding tasks, ensuring both speed and accuracy in real-time applications.

Model Description

Developed by: Bartłomiej Orlik, https://www.linkedin.com/in/bartłomiej-orlik/
Model type: BERT Universal Sentence Encoder
Language: Polish
License: GPL-3.0
Trained from model: OrlikB/KartonBERT_base_uncased_v1: https://huggingface.co/OrlikB/KartonBERT_base_uncased_v1

How to Get Started with the Model

Use the code below to get started with the model.

Using Sentence-Transformers

You can use the model with sentence-transformers:

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('OrlikB/KartonBERT-USE-base-v1')

text_1 = 'Jestem wielkim fanem opakowań tekturowych'
text_2 = 'Bardzo podobają mi się kartony'

embeddings_1 = model.encode(text_1, normalize_embeddings=True)
embeddings_2 = model.encode(text_2, normalize_embeddings=True)

similarity = embeddings_1 @ embeddings_2.T
print(similarity)

Using HuggingFace Transformers

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

def encode_text(text):
    encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt', max_length=512)
    with torch.no_grad():
        model_output = model(**encoded_input)
        sentence_embeddings = model_output[0][:, 0]
        sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
    return  sentence_embeddings.squeeze().numpy()

cosine_similarity = lambda a, b: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


tokenizer = AutoTokenizer.from_pretrained('OrlikB/KartonBERT-USE-base-v1')
model = AutoModel.from_pretrained('OrlikB/KartonBERT-USE-base-v1')
model.eval()

text_1 = 'Jestem wielkim fanem opakowań tekturowych'
text_2 = 'Bardzo podobają mi się kartony'

embeddings_1 = encode_text(text_1)
embeddings_2 = encode_text(text_2)

print(cosine_similarity(embeddings_1, embeddings_2))

*Note: You can use the encode_text function for demonstration purposes. For the best experience, it's recommended to process text in batches.

Evaluation

MTEB for Polish Language

Rank	Model	Model Size (Million Parameters)	Memory Usage (GB, fp32)	Embedding Dimensions	Max Tokens	Average (26 datasets)	Classification Average (7 datasets)	Clustering Average (1 dataset)	PairClassification Average (4 datasets)	Retrieval Average (11 datasets)	STS Average (3 datasets)
1	bge-multilingual-gemma2	9242	34.43	3584	8192	70	77.99	50.29	89.62	59.41	70.64
2	gte-Qwen2-7B-instruct	7613	28.36	3584	131072	67.86	77.84	51.36	88.48	54.69	70.86
3	gte-Qwen2-1.5B-instruct	1776	6.62	1536	131072	64.04	72.29	44.59	84.87	51.88	68.12
4	jina-embeddings-v3	572	2.13	1024	8194	63.97	70.81	43.66	83.70	51.89	72.77
5	jina-embeddings-v3	572	2.13	1024	8194	63.97	70.81	43.66	83.70	51.89	72.77
6	mmlw-roberta-large	435	1.62	1024	514	63.23	66.39	31.16	89.13	52.71	70.59
7	KartonBERT-USE-base-v1	104	0.39	768	512	61.67	67.57	29.88	87.04	49.14	70.65
8	mmlw-e5-large	560	2.09	1024	514	61.17	61.07	30.62	85.90	52.63	69.98
9	mmlw-roberta-base	124	0.46	768	514	61.05	62.92	33.08	88.14	49.92	70.70
10	multilingual-e5-large	560	2.09	1024	514	60.08	63.82	33.88	85.50	48.98	66.91
11	mmlw-e5-base	278	1.04	768	514	59.71	59.52	30.25	86.16	50.06	70.13
12	gte-multilingual-base	305	1.14	768	8192	58.22	60.15	33.67	85.45	46.40	68.92
13	st-polish-kartonberta-base-alpha-v1	124	0.46	768	514	56.92	60.44	32.85	87.92	42.19	69.47

More Information

If I have spare computing resources (GPU), I may improve the quality of the model by further training.

OrlikB
/

KartonBERT-USE-base-v1