bhavnicksm commited on
Commit
1188b24
·
verified ·
1 Parent(s): f52f17c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -130
README.md CHANGED
@@ -1,130 +1,130 @@
1
- ---
2
- base_model: baai/bge-base-en-v1.5
3
- language:
4
- - en
5
- library_name: model2vec
6
- license: mit
7
- model_name: brown-fairy-base-v0
8
- tags:
9
- - embeddings
10
- - static-embeddings
11
- - sentence-transformers
12
- ---
13
- # 🪲 brown-fairy-base-v0 Model Card
14
-
15
- <div align="center">
16
- <img width="50%" alt="Fairy logo" src="./assets/fairy_logo.png">
17
- </div>
18
-
19
- > [!TIP]
20
- > Fairies are among the most enchanting and magical beings in folklore and mythology. They appear across countless cultures and stories, from ancient forests to modern gardens. They are celebrated for their ability to bridge the mundane and magical realms, known for their ethereal grace and transformative powers. Fairies are tiny, higher-dimensional beings that can interact with the world in ways that are beyond our understanding.
21
-
22
- The fairy series of models are an attempt to tune the beetle series of models to be more suitable for downstream tasks. These models are meant to fully open experiments at making state-of-the-art static embeddings.
23
-
24
- The brown-fairy-base-v0 model is a distillation of the `baai/bge-base-en-v1.5` model into the `brown-beetle-base-v0` model. There was no PCA or Zipf applied to this model.
25
-
26
- ## Installation
27
-
28
- Install model2vec using pip:
29
-
30
- ```bash
31
- pip install model2vec
32
- ```
33
-
34
- ## Usage
35
-
36
- Load this model using the `from_pretrained` method:
37
-
38
- ```python
39
- from model2vec import StaticModel
40
-
41
- # Load a pretrained Model2Vec model
42
- model = StaticModel.from_pretrained("bhavnicksm/brown-fairy-base-v0")
43
-
44
- # Compute text embeddings
45
- embeddings = model.encode(["Example sentence"])
46
- ```
47
-
48
- Read more about the Model2Vec library [here](https://github.com/MinishLab/model2vec).
49
-
50
- ## Reproduce this model
51
-
52
- This model was trained on a subset of the 2 Million texts from the [FineWeb-Edu](https://huggingface.co/datasets/mixedbread-ai/fineweb-edu) dataset, which was labeled by the `baai/bge-base-en-v1.5` model.
53
-
54
- <details>
55
- <summary>Training Code</summary>
56
-
57
- Note: The datasets need to me made seperately and loaded with the `datasets` library.
58
-
59
- ```python
60
- static_embedding = StaticEmbedding.from_model2vec("bhavnicksm/brown-beetle-base-v0")
61
- model = SentenceTransformer(
62
- modules=[static_embedding]
63
- )
64
-
65
- loss = MSELoss(model)
66
-
67
- run_name = "brown-fairy-base-v0"
68
- args = SentenceTransformerTrainingArguments(
69
- # Required parameter:
70
- output_dir=f"output/{run_name}",
71
- # Optional training parameters:
72
- num_train_epochs=1,
73
- per_device_train_batch_size=2048,
74
- per_device_eval_batch_size=2048,
75
- learning_rate=1e-1,
76
- warmup_ratio=0.1,
77
- fp16=False, # Set to False if you get an error that your GPU can't run on FP16
78
- bf16=True, # Set to True if you have a GPU that supports BF16
79
- batch_sampler=BatchSamplers.NO_DUPLICATES,
80
- # Optional tracking/debugging parameters:
81
- eval_strategy="steps",
82
- eval_steps=50,
83
- save_strategy="steps",
84
- save_steps=50,
85
- save_total_limit=5,
86
- logging_steps=50,
87
- logging_first_step=True,
88
- run_name=run_name,
89
- )
90
-
91
- evaluator = NanoBEIREvaluator()
92
- evaluator(model)
93
-
94
- trainer = SentenceTransformerTrainer(
95
- model=model,
96
- args=args,
97
- train_dataset=train_dataset,
98
- eval_dataset=eval_dataset,
99
- loss=loss,
100
- evaluator=evaluator,
101
- )
102
- trainer.train()
103
-
104
- evaluator(model)
105
-
106
- model.save_pretrained(f"output/{run_name}")
107
- ```
108
-
109
- </details>
110
-
111
- ## Comparison with other models
112
-
113
- Coming soon...
114
-
115
- ## Acknowledgements
116
-
117
- This model is based on the [Model2Vec](https://github.com/MinishLab/model2vec) library. Credit goes to the [Minish Lab](https://github.com/MinishLab) team for developing this library.
118
-
119
- ## Citation
120
-
121
- This model builds on work done by Minish Lab. Please cite the [Model2Vec repository](https://github.com/MinishLab/model2vec) if you use this model in your work.
122
-
123
- ```bibtex
124
- @software{minishlab2024model2vec,
125
- authors = {Stephan Tulkens, Thomas van Dongen},
126
- title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
127
- year = {2024},
128
- url = {https://github.com/MinishLab/model2vec},
129
- }
130
- ```
 
1
+ ---
2
+ base_model: baai/bge-base-en-v1.5
3
+ language:
4
+ - en
5
+ library_name: model2vec
6
+ license: mit
7
+ model_name: brown-fairy-base-v0
8
+ tags:
9
+ - embeddings
10
+ - static-embeddings
11
+ - sentence-transformers
12
+ ---
13
+ # 🧚🏻‍♀️ brown-fairy-base-v0 Model Card
14
+
15
+ <div align="center">
16
+ <img width="50%" alt="Fairy logo" src="./assets/fairy_logo.png">
17
+ </div>
18
+
19
+ > [!TIP]
20
+ > Fairies are among the most enchanting and magical beings in folklore and mythology. They appear across countless cultures and stories, from ancient forests to modern gardens. They are celebrated for their ability to bridge the mundane and magical realms, known for their ethereal grace and transformative powers. Fairies are tiny, higher-dimensional beings that can interact with the world in ways that are beyond our understanding.
21
+
22
+ The fairy series of models are an attempt to tune the beetle series of models to be more suitable for downstream tasks. These models are meant to fully open experiments at making state-of-the-art static embeddings.
23
+
24
+ The brown-fairy-base-v0 model is a distillation of the `baai/bge-base-en-v1.5` model into the `brown-beetle-base-v0` model. There was no PCA or Zipf applied to this model.
25
+
26
+ ## Installation
27
+
28
+ Install model2vec using pip:
29
+
30
+ ```bash
31
+ pip install model2vec
32
+ ```
33
+
34
+ ## Usage
35
+
36
+ Load this model using the `from_pretrained` method:
37
+
38
+ ```python
39
+ from model2vec import StaticModel
40
+
41
+ # Load a pretrained Model2Vec model
42
+ model = StaticModel.from_pretrained("bhavnicksm/brown-fairy-base-v0")
43
+
44
+ # Compute text embeddings
45
+ embeddings = model.encode(["Example sentence"])
46
+ ```
47
+
48
+ Read more about the Model2Vec library [here](https://github.com/MinishLab/model2vec).
49
+
50
+ ## Reproduce this model
51
+
52
+ This model was trained on a subset of the 2 Million texts from the [FineWeb-Edu](https://huggingface.co/datasets/mixedbread-ai/fineweb-edu) dataset, which was labeled by the `baai/bge-base-en-v1.5` model.
53
+
54
+ <details>
55
+ <summary>Training Code</summary>
56
+
57
+ Note: The datasets need to me made seperately and loaded with the `datasets` library.
58
+
59
+ ```python
60
+ static_embedding = StaticEmbedding.from_model2vec("bhavnicksm/brown-beetle-base-v0")
61
+ model = SentenceTransformer(
62
+ modules=[static_embedding]
63
+ )
64
+
65
+ loss = MSELoss(model)
66
+
67
+ run_name = "brown-fairy-base-v0"
68
+ args = SentenceTransformerTrainingArguments(
69
+ # Required parameter:
70
+ output_dir=f"output/{run_name}",
71
+ # Optional training parameters:
72
+ num_train_epochs=1,
73
+ per_device_train_batch_size=2048,
74
+ per_device_eval_batch_size=2048,
75
+ learning_rate=1e-1,
76
+ warmup_ratio=0.1,
77
+ fp16=False, # Set to False if you get an error that your GPU can't run on FP16
78
+ bf16=True, # Set to True if you have a GPU that supports BF16
79
+ batch_sampler=BatchSamplers.NO_DUPLICATES,
80
+ # Optional tracking/debugging parameters:
81
+ eval_strategy="steps",
82
+ eval_steps=50,
83
+ save_strategy="steps",
84
+ save_steps=50,
85
+ save_total_limit=5,
86
+ logging_steps=50,
87
+ logging_first_step=True,
88
+ run_name=run_name,
89
+ )
90
+
91
+ evaluator = NanoBEIREvaluator()
92
+ evaluator(model)
93
+
94
+ trainer = SentenceTransformerTrainer(
95
+ model=model,
96
+ args=args,
97
+ train_dataset=train_dataset,
98
+ eval_dataset=eval_dataset,
99
+ loss=loss,
100
+ evaluator=evaluator,
101
+ )
102
+ trainer.train()
103
+
104
+ evaluator(model)
105
+
106
+ model.save_pretrained(f"output/{run_name}")
107
+ ```
108
+
109
+ </details>
110
+
111
+ ## Comparison with other models
112
+
113
+ Coming soon...
114
+
115
+ ## Acknowledgements
116
+
117
+ This model is based on the [Model2Vec](https://github.com/MinishLab/model2vec) library. Credit goes to the [Minish Lab](https://github.com/MinishLab) team for developing this library.
118
+
119
+ ## Citation
120
+
121
+ This model builds on work done by Minish Lab. Please cite the [Model2Vec repository](https://github.com/MinishLab/model2vec) if you use this model in your work.
122
+
123
+ ```bibtex
124
+ @software{minishlab2024model2vec,
125
+ authors = {Stephan Tulkens, Thomas van Dongen},
126
+ title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
127
+ year = {2024},
128
+ url = {https://github.com/MinishLab/model2vec},
129
+ }
130
+ ```