TEmA-small / README.md

Upload folder using huggingface_hub

5ced29b verified 23 days ago

4.64 kB

	---
	license: cc-by-4.0
	language:
	- az
	metrics:
	- pearsonr
	base_model:
	- sentence-transformers/LaBSE
	pipeline_tag: sentence-similarity
	widget:
	- source_sentence: Bu xoşbəxt bir insandır
	sentences:
	- Bu xoşbəxt bir itdir
	- Bu çox xoşbəxt bir insandır
	- Bu gün günəşli bir gündür
	example_title: Sentence Similarity
	tags:
	- labse
	---

	# TEmA-small

	This model is a fine-tuned version of the [LaBSE](https://huggingface.co/sentence-transformers/LaBSE), which is specialized for sentence similarity tasks in Azerbaijan texts.
	It maps sentences and paragraphs to a 768-dimensional dense vector space, useful for tasks like clustering, semantic search, and more.




	## Benchmark Results

	\| STSBenchmark \| biosses-sts \| sickr-sts \| sts12-sts \| sts13-sts \| sts15-sts \| sts16-sts \| Average Pearson \| Model \|
	\|--------------\|-------------\|-----------\|-----------\|-----------\|-----------\|-----------\|-----------------\|------------------------------------\|
	\| 0.8253 \| 0.7859 \| 0.7924 \| 0.8444 \| 0.7490 \| 0.8141 \| 0.7600 \| 0.7959 \| TEmA-small \|
	\| 0.7872 \| 0.8303 \| 0.7801 \| 0.7978 \| 0.6963 \| 0.8052 \| 0.7794 \| 0.7823 \| Cohere/embed-multilingual-v3.0 \|
	\| 0.7927 \| 0.6672 \| 0.7758 \| 0.8122 \| 0.7312 \| 0.7831 \| 0.7416 \| 0.7577 \| BAAI/bge-m3 \|
	\| 0.7572 \| 0.8139 \| 0.7328 \| 0.7646 \| 0.6318 \| 0.7542 \| 0.7092 \| 0.7377 \| intfloat/multilingual-e5-large-instruct \|
	\| 0.7400 \| 0.8216 \| 0.6946 \| 0.7098 \| 0.6781 \| 0.7637 \| 0.7222 \| 0.7329 \| labse_stripped \|
	\| 0.7485 \| 0.7714 \| 0.7271 \| 0.7170 \| 0.6496 \| 0.7570 \| 0.7255 \| 0.7280 \| intfloat/multilingual-e5-large \|
	\| 0.7245 \| 0.8237 \| 0.6839 \| 0.6570 \| 0.7125 \| 0.7612 \| 0.7386 \| 0.7288 \| OpenAI/text-embedding-3-large \|
	\| 0.7363 \| 0.8148 \| 0.7067 \| 0.7050 \| 0.6535 \| 0.7514 \| 0.7070 \| 0.7250 \| sentence-transformers/LaBSE \|
	\| 0.7376 \| 0.7917 \| 0.7190 \| 0.7441 \| 0.6286 \| 0.7461 \| 0.7026 \| 0.7242 \| intfloat/multilingual-e5-small \|
	\| 0.7192 \| 0.8198 \| 0.7160 \| 0.7338 \| 0.5815 \| 0.7318 \| 0.6973 \| 0.7142 \| Cohere/embed-multilingual-light-v3.0 \|
	\| 0.6960 \| 0.8185 \| 0.6950 \| 0.6752 \| 0.5899 \| 0.7186 \| 0.6790 \| 0.6960 \| intfloat/multilingual-e5-base \|
	\| 0.5830 \| 0.2486 \| 0.5921 \| 0.5593 \| 0.5559 \| 0.5404 \| 0.5289 \| 0.5155 \| antoinelouis/colbert-xm \|


	[STS-Benchmark](https://github.com/LocalDoc-Azerbaijan/STS-Benchmark)




	## Accuracy Results
	- Cosine Distance: 96.63
	- Manhattan Distance: 96.52
	- Euclidean Distance: 96.57




	## Usage

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	# Mean Pooling - Take attention mask into account for correct averaging
	def mean_pooling(model_output, attention_mask):
	token_embeddings = model_output[0] #First element of model_output contains all token embeddings
	input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

	# Function to normalize embeddings
	def normalize_embeddings(embeddings):
	return embeddings / embeddings.norm(dim=1, keepdim=True)

	# Sentences we want embeddings for
	sentences = [
	"Bu xoşbəxt bir insandır",
	"Bu çox xoşbəxt bir insandır",
	"Bu gün günəşli bir gündür"
	]

	# Load model from HuggingFace Hub
	tokenizer = AutoTokenizer.from_pretrained('LocalDoc/TEmA-small')
	model = AutoModel.from_pretrained('LocalDoc/TEmA-small')

	# Tokenize sentences
	encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

	# Compute token embeddings
	with torch.no_grad():
	model_output = model(**encoded_input)

	# Perform pooling
	sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

	# Normalize embeddings
	sentence_embeddings = normalize_embeddings(sentence_embeddings)

	# Calculate cosine similarities
	cosine_similarities = torch.nn.functional.cosine_similarity(
	sentence_embeddings[0].unsqueeze(0),
	sentence_embeddings[1:],
	dim=1
	)

	print("Cosine Similarities:")
	for i, score in enumerate(cosine_similarities):
	print(f"Sentence 1 <-> Sentence {i+2}: {score:.4f}")
	```