Omartificial-Intelligence-Space commited on
Commit
ab2ed1f
·
verified ·
1 Parent(s): 919f7cc

Update readme.md

Browse files
Files changed (1) hide show
  1. README.md +5 -324
README.md CHANGED
@@ -132,38 +132,23 @@ model-index:
132
  name: Spearman Max
133
  ---
134
 
135
- # SentenceTransformer based on Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka
136
 
137
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka](https://huggingface.co/Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka) on the all-nli and [sts](https://huggingface.co/datasets/Omartificial-Intelligence-Space/arabic-stsb) datasets. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
138
 
139
  ## Model Details
140
 
141
  ### Model Description
142
  - **Model Type:** Sentence Transformer
143
- - **Base model:** [Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka](https://huggingface.co/Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka) <!-- at revision d0361a36f6fe69febfc8550d0918abab174f6f30 -->
144
  - **Maximum Sequence Length:** 512 tokens
145
  - **Output Dimensionality:** 768 tokens
146
  - **Similarity Function:** Cosine Similarity
147
  - **Training Datasets:**
148
- - all-nli
149
  - [sts](https://huggingface.co/datasets/Omartificial-Intelligence-Space/arabic-stsb)
150
  - **Language:** ar
151
- <!-- - **License:** Unknown -->
152
 
153
- ### Model Sources
154
-
155
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
156
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
157
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
158
-
159
- ### Full Model Architecture
160
-
161
- ```
162
- SentenceTransformer(
163
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
164
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
165
- )
166
- ```
167
 
168
  ## Usage
169
 
@@ -180,7 +165,7 @@ Then you can load this model and run inference.
180
  from sentence_transformers import SentenceTransformer
181
 
182
  # Download from the 🤗 Hub
183
- model = SentenceTransformer("Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka-multi-task")
184
  # Run inference
185
  sentences = [
186
  'الكلب البني مستلقي على جانبه على سجادة بيج، مع جسم أخضر في المقدمة.',
@@ -197,29 +182,7 @@ print(similarities.shape)
197
  # [3, 3]
198
  ```
199
 
200
- <!--
201
- ### Direct Usage (Transformers)
202
-
203
- <details><summary>Click to see the direct usage in Transformers</summary>
204
-
205
- </details>
206
- -->
207
-
208
- <!--
209
- ### Downstream Usage (Sentence Transformers)
210
-
211
- You can finetune this model on your own dataset.
212
-
213
- <details><summary>Click to expand</summary>
214
-
215
- </details>
216
- -->
217
 
218
- <!--
219
- ### Out-of-Scope Use
220
-
221
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
222
- -->
223
 
224
  ## Evaluation
225
 
@@ -259,285 +222,3 @@ You can finetune this model on your own dataset.
259
  | pearson_max | 0.7923 |
260
  | spearman_max | 0.7947 |
261
 
262
- <!--
263
- ## Bias, Risks and Limitations
264
-
265
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
266
- -->
267
-
268
- <!--
269
- ### Recommendations
270
-
271
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
272
- -->
273
-
274
- ## Training Details
275
-
276
- ### Training Datasets
277
-
278
- #### all-nli
279
-
280
- * Dataset: all-nli
281
- * Size: 942,069 training samples
282
- * Columns: <code>premise</code>, <code>hypothesis</code>, and <code>label</code>
283
- * Approximate statistics based on the first 1000 samples:
284
- | | premise | hypothesis | label |
285
- |:--------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:-------------------------------------------------------------------|
286
- | type | string | string | int |
287
- | details | <ul><li>min: 5 tokens</li><li>mean: 14.09 tokens</li><li>max: 43 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 8.28 tokens</li><li>max: 28 tokens</li></ul> | <ul><li>0: ~33.40%</li><li>1: ~33.30%</li><li>2: ~33.30%</li></ul> |
288
- * Samples:
289
- | premise | hypothesis | label |
290
- |:-----------------------------------------------|:--------------------------------------------|:---------------|
291
- | <code>شخص على حصان يقفز فوق طائرة معطلة</code> | <code>شخص يقوم بتدريب حصانه للمنافسة</code> | <code>1</code> |
292
- | <code>شخص على حصان يقفز فوق طائرة معطلة</code> | <code>شخص في مطعم، يطلب عجة.</code> | <code>2</code> |
293
- | <code>شخص على حصان يقفز فوق طائرة معطلة</code> | <code>شخص في الهواء الطلق، على حصان.</code> | <code>0</code> |
294
- * Loss: [<code>SoftmaxLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#softmaxloss)
295
-
296
- #### sts
297
-
298
- * Dataset: [sts](https://huggingface.co/datasets/Omartificial-Intelligence-Space/arabic-stsb) at [f5a6f89](https://huggingface.co/datasets/Omartificial-Intelligence-Space/arabic-stsb/tree/f5a6f89da460d307eff3acbbfcb62d0705cdbbb5)
299
- * Size: 5,749 training samples
300
- * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
301
- * Approximate statistics based on the first 1000 samples:
302
- | | sentence1 | sentence2 | score |
303
- |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------|
304
- | type | string | string | float |
305
- | details | <ul><li>min: 4 tokens</li><li>mean: 7.46 tokens</li><li>max: 22 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 7.36 tokens</li><li>max: 18 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.54</li><li>max: 1.0</li></ul> |
306
- * Samples:
307
- | sentence1 | sentence2 | score |
308
- |:-----------------------------------------------|:--------------------------------------------------------|:------------------|
309
- | <code>طائرة ستقلع</code> | <code>طائرة جوية ستقلع</code> | <code>1.0</code> |
310
- | <code>رجل يعزف على ناي كبير</code> | <code>رجل يعزف على الناي.</code> | <code>0.76</code> |
311
- | <code>رجل ينشر الجبن الممزق على البيتزا</code> | <code>رجل ينشر الجبن الممزق على بيتزا غير مطبوخة</code> | <code>0.76</code> |
312
- * Loss: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with these parameters:
313
- ```json
314
- {
315
- "loss_fct": "torch.nn.modules.loss.MSELoss"
316
- }
317
- ```
318
-
319
- ### Evaluation Datasets
320
-
321
- #### all-nli
322
-
323
- * Dataset: all-nli
324
- * Size: 1,000 evaluation samples
325
- * Columns: <code>premise</code>, <code>hypothesis</code>, and <code>label</code>
326
- * Approximate statistics based on the first 1000 samples:
327
- | | premise | hypothesis | label |
328
- |:--------|:---------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:-------------------------------------------------------------------|
329
- | type | string | string | int |
330
- | details | <ul><li>min: 5 tokens</li><li>mean: 15.1 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 8.11 tokens</li><li>max: 21 tokens</li></ul> | <ul><li>0: ~33.10%</li><li>1: ~33.30%</li><li>2: ~33.60%</li></ul> |
331
- * Samples:
332
- | premise | hypothesis | label |
333
- |:------------------------------------------------|:------------------------------------------------------------------------------|:---------------|
334
- | <code>امرأتان يتعانقان بينما يحملان طرود</code> | <code>الأخوات يعانقون بعضهم لوداعاً بينما يحملون حزمة بعد تناول الغداء</code> | <code>1</code> |
335
- | <code>امرأتان يتعانقان بينما يحملان حزمة</code> | <code>إمرأتان يحملان حزمة</code> | <code>0</code> |
336
- | <code>امرأتان يتعانقان بينما يحملان حزمة</code> | <code>الرجال يتشاجرون خارج مطعم</code> | <code>2</code> |
337
- * Loss: [<code>SoftmaxLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#softmaxloss)
338
-
339
- #### sts
340
-
341
- * Dataset: [sts](https://huggingface.co/datasets/Omartificial-Intelligence-Space/arabic-stsb) at [f5a6f89](https://huggingface.co/datasets/Omartificial-Intelligence-Space/arabic-stsb/tree/f5a6f89da460d307eff3acbbfcb62d0705cdbbb5)
342
- * Size: 1,500 evaluation samples
343
- * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
344
- * Approximate statistics based on the first 1000 samples:
345
- | | sentence1 | sentence2 | score |
346
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------|
347
- | type | string | string | float |
348
- | details | <ul><li>min: 4 tokens</li><li>mean: 12.55 tokens</li><li>max: 42 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 12.49 tokens</li><li>max: 54 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.47</li><li>max: 1.0</li></ul> |
349
- * Samples:
350
- | sentence1 | sentence2 | score |
351
- |:--------------------------------------|:---------------------------------------|:------------------|
352
- | <code>رجل يرتدي قبعة صلبة يرقص</code> | <code>رجل يرتدي قبعة صلبة يرقص.</code> | <code>1.0</code> |
353
- | <code>طفل صغير يركب حصاناً.</code> | <code>طفل يركب حصاناً.</code> | <code>0.95</code> |
354
- | <code>رجل يطعم فأراً لأفعى</code> | <code>الرجل يطعم الفأر للثعبان.</code> | <code>1.0</code> |
355
- * Loss: [<code>CosineSimilarityLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with these parameters:
356
- ```json
357
- {
358
- "loss_fct": "torch.nn.modules.loss.MSELoss"
359
- }
360
- ```
361
-
362
- ### Training Hyperparameters
363
- #### Non-Default Hyperparameters
364
-
365
- - `eval_strategy`: steps
366
- - `per_device_train_batch_size`: 16
367
- - `per_device_eval_batch_size`: 16
368
- - `num_train_epochs`: 1
369
- - `warmup_ratio`: 0.1
370
- - `fp16`: True
371
- - `multi_dataset_batch_sampler`: round_robin
372
-
373
- #### All Hyperparameters
374
- <details><summary>Click to expand</summary>
375
-
376
- - `overwrite_output_dir`: False
377
- - `do_predict`: False
378
- - `eval_strategy`: steps
379
- - `prediction_loss_only`: True
380
- - `per_device_train_batch_size`: 16
381
- - `per_device_eval_batch_size`: 16
382
- - `per_gpu_train_batch_size`: None
383
- - `per_gpu_eval_batch_size`: None
384
- - `gradient_accumulation_steps`: 1
385
- - `eval_accumulation_steps`: None
386
- - `learning_rate`: 5e-05
387
- - `weight_decay`: 0.0
388
- - `adam_beta1`: 0.9
389
- - `adam_beta2`: 0.999
390
- - `adam_epsilon`: 1e-08
391
- - `max_grad_norm`: 1.0
392
- - `num_train_epochs`: 1
393
- - `max_steps`: -1
394
- - `lr_scheduler_type`: linear
395
- - `lr_scheduler_kwargs`: {}
396
- - `warmup_ratio`: 0.1
397
- - `warmup_steps`: 0
398
- - `log_level`: passive
399
- - `log_level_replica`: warning
400
- - `log_on_each_node`: True
401
- - `logging_nan_inf_filter`: True
402
- - `save_safetensors`: True
403
- - `save_on_each_node`: False
404
- - `save_only_model`: False
405
- - `restore_callback_states_from_checkpoint`: False
406
- - `no_cuda`: False
407
- - `use_cpu`: False
408
- - `use_mps_device`: False
409
- - `seed`: 42
410
- - `data_seed`: None
411
- - `jit_mode_eval`: False
412
- - `use_ipex`: False
413
- - `bf16`: False
414
- - `fp16`: True
415
- - `fp16_opt_level`: O1
416
- - `half_precision_backend`: auto
417
- - `bf16_full_eval`: False
418
- - `fp16_full_eval`: False
419
- - `tf32`: None
420
- - `local_rank`: 0
421
- - `ddp_backend`: None
422
- - `tpu_num_cores`: None
423
- - `tpu_metrics_debug`: False
424
- - `debug`: []
425
- - `dataloader_drop_last`: False
426
- - `dataloader_num_workers`: 0
427
- - `dataloader_prefetch_factor`: None
428
- - `past_index`: -1
429
- - `disable_tqdm`: False
430
- - `remove_unused_columns`: True
431
- - `label_names`: None
432
- - `load_best_model_at_end`: False
433
- - `ignore_data_skip`: False
434
- - `fsdp`: []
435
- - `fsdp_min_num_params`: 0
436
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
437
- - `fsdp_transformer_layer_cls_to_wrap`: None
438
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
439
- - `deepspeed`: None
440
- - `label_smoothing_factor`: 0.0
441
- - `optim`: adamw_torch
442
- - `optim_args`: None
443
- - `adafactor`: False
444
- - `group_by_length`: False
445
- - `length_column_name`: length
446
- - `ddp_find_unused_parameters`: None
447
- - `ddp_bucket_cap_mb`: None
448
- - `ddp_broadcast_buffers`: False
449
- - `dataloader_pin_memory`: True
450
- - `dataloader_persistent_workers`: False
451
- - `skip_memory_metrics`: True
452
- - `use_legacy_prediction_loop`: False
453
- - `push_to_hub`: False
454
- - `resume_from_checkpoint`: None
455
- - `hub_model_id`: None
456
- - `hub_strategy`: every_save
457
- - `hub_private_repo`: False
458
- - `hub_always_push`: False
459
- - `gradient_checkpointing`: False
460
- - `gradient_checkpointing_kwargs`: None
461
- - `include_inputs_for_metrics`: False
462
- - `eval_do_concat_batches`: True
463
- - `fp16_backend`: auto
464
- - `push_to_hub_model_id`: None
465
- - `push_to_hub_organization`: None
466
- - `mp_parameters`:
467
- - `auto_find_batch_size`: False
468
- - `full_determinism`: False
469
- - `torchdynamo`: None
470
- - `ray_scope`: last
471
- - `ddp_timeout`: 1800
472
- - `torch_compile`: False
473
- - `torch_compile_backend`: None
474
- - `torch_compile_mode`: None
475
- - `dispatch_batches`: None
476
- - `split_batches`: None
477
- - `include_tokens_per_second`: False
478
- - `include_num_input_tokens_seen`: False
479
- - `neftune_noise_alpha`: None
480
- - `optim_target_modules`: None
481
- - `batch_eval_metrics`: False
482
- - `eval_on_start`: False
483
- - `batch_sampler`: batch_sampler
484
- - `multi_dataset_batch_sampler`: round_robin
485
-
486
- </details>
487
-
488
- ### Training Logs
489
- | Epoch | Step | Training Loss | all-nli loss | sts loss | sts-dev_spearman_cosine | sts-test_spearman_cosine |
490
- |:------:|:----:|:-------------:|:------------:|:--------:|:-----------------------:|:------------------------:|
491
- | 0.1389 | 100 | 0.5848 | 1.0957 | 0.0324 | 0.8309 | - |
492
- | 0.2778 | 200 | 0.5243 | 0.9695 | 0.0294 | 0.8386 | - |
493
- | 0.4167 | 300 | 0.5135 | 0.9486 | 0.0295 | 0.8398 | - |
494
- | 0.5556 | 400 | 0.4896 | 0.9366 | 0.0305 | 0.8317 | - |
495
- | 0.6944 | 500 | 0.5048 | 0.9201 | 0.0298 | 0.8395 | - |
496
- | 0.8333 | 600 | 0.4862 | 0.8885 | 0.0291 | 0.8370 | - |
497
- | 0.9722 | 700 | 0.4628 | 0.8893 | 0.0289 | 0.8389 | - |
498
- | 1.0 | 720 | - | - | - | - | 0.7893 |
499
-
500
-
501
- ### Framework Versions
502
- - Python: 3.9.18
503
- - Sentence Transformers: 3.0.1
504
- - Transformers: 4.42.4
505
- - PyTorch: 2.2.2+cu121
506
- - Accelerate: 0.26.1
507
- - Datasets: 2.19.0
508
- - Tokenizers: 0.19.1
509
-
510
- ## Citation
511
-
512
- ### BibTeX
513
-
514
- #### Sentence Transformers and SoftmaxLoss
515
- ```bibtex
516
- @inproceedings{reimers-2019-sentence-bert,
517
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
518
- author = "Reimers, Nils and Gurevych, Iryna",
519
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
520
- month = "11",
521
- year = "2019",
522
- publisher = "Association for Computational Linguistics",
523
- url = "https://arxiv.org/abs/1908.10084",
524
- }
525
- ```
526
-
527
- <!--
528
- ## Glossary
529
-
530
- *Clearly define terms in order to be accessible across audiences.*
531
- -->
532
-
533
- <!--
534
- ## Model Card Authors
535
-
536
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
537
- -->
538
-
539
- <!--
540
- ## Model Card Contact
541
-
542
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
543
- -->
 
132
  name: Spearman Max
133
  ---
134
 
135
+ # GATE-AraBert-v0
136
 
137
+ This is a General Arabic Text Embedding trained using SentenceTransformers in a multi-task setup. The system trains on the AllNLI and on the STS dataset.
138
 
139
  ## Model Details
140
 
141
  ### Model Description
142
  - **Model Type:** Sentence Transformer
143
+ - **Base model:** [Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2](https://huggingface.co/Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2) <!-- at revision 5ce4f80f3ede26de623d6ac10681399dba5c684a -->
144
  - **Maximum Sequence Length:** 512 tokens
145
  - **Output Dimensionality:** 768 tokens
146
  - **Similarity Function:** Cosine Similarity
147
  - **Training Datasets:**
148
+ - [all-nli](https://huggingface.co/datasets/Omartificial-Intelligence-Space/Arabic-NLi-Pair-Class)
149
  - [sts](https://huggingface.co/datasets/Omartificial-Intelligence-Space/arabic-stsb)
150
  - **Language:** ar
 
151
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
  ## Usage
154
 
 
165
  from sentence_transformers import SentenceTransformer
166
 
167
  # Download from the 🤗 Hub
168
+ model = SentenceTransformer("Omartificial-Intelligence-Space/GATE-AraBert-v0")
169
  # Run inference
170
  sentences = [
171
  'الكلب البني مستلقي على جانبه على سجادة بيج، مع جسم أخضر في المقدمة.',
 
182
  # [3, 3]
183
  ```
184
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
 
 
 
 
 
 
186
 
187
  ## Evaluation
188
 
 
222
  | pearson_max | 0.7923 |
223
  | spearman_max | 0.7947 |
224