twright8 commited on
Commit
7271f60
1 Parent(s): 4fbfffc

Push model using huggingface_hub.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -1,3 +1,239 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Alibaba-NLP/gte-base-en-v1.5
3
+ library_name: setfit
4
+ metrics:
5
+ - accuracy
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - setfit
9
+ - sentence-transformers
10
+ - text-classification
11
+ - generated_from_setfit_trainer
12
+ widget:
13
+ - text: Tech Start-up Revolutionizes Water Purification SAN FRANCISCO - AquaTech,
14
+ a Silicon Valley start-up, unveiled its groundbreaking water purification system
15
+ today. Using advanced nanotechnology, the device can purify contaminated water
16
+ in seconds, potentially bringing safe drinking water to millions. "This could
17
+ be a game-changer for global health," said WHO representative Dr. Amina Osei.
18
+ Field trials are set to begin next month.
19
+ - text: Whistleblower Exposes Massive Fraud in Medicare Billing WASHINGTON - A former
20
+ employee of MedTech Solutions, a major medical equipment supplier, has come forward
21
+ with explosive allegations of systematic fraud in Medicare billing practices.
22
+ The whistleblower, whose identity remains protected, claims the company routinely
23
+ inflated prices and billed for unnecessary equipment, defrauding the government
24
+ of an estimated $1.2 billion over five years. Documents obtained by this newspaper
25
+ appear to corroborate these claims, showing discrepancies between actual costs
26
+ and billed amounts for common medical devices such as wheelchairs and oxygen tanks.
27
+ "This isn't just about money," said Senator Lisa Kline, chair of the Senate Health
28
+ Committee. "This kind of fraud directly impacts patient care and drives up healthcare
29
+ costs for everyone." The Department of Justice has announced a full investigation
30
+ into MedTech Solutions and its parent company, HealthCorp International. Industry
31
+ experts suggest this could be just the tip of the iceberg, with similar practices
32
+ potentially widespread across the medical supply sector. MedTech Solutions has
33
+ denied all allegations and vowed to cooperate fully with investigators.
34
+ - text: Nursing Home Chain Under Fire for Neglect and Fraud CHICAGO - A damning report
35
+ released today by state health inspectors reveals a pattern of severe neglect
36
+ and fraudulent practices across Sunset Years, one of the nation's largest nursing
37
+ home chains. Investigators found widespread understaffing, with some facilities
38
+ staffed at dangerously low levels while still billing Medicare and Medicaid for
39
+ full care. In several instances, residents were found to be malnourished or suffering
40
+ from untreated bedsores, despite records indicating proper care. "It's heartbreaking,"
41
+ said Maria Rodriguez, whose mother was a resident at one of the chain's Chicago
42
+ facilities. "We trusted them with our loved ones, and they betrayed that trust
43
+ for profit." Sunset Years CEO Robert Thompson issued a statement claiming the
44
+ issues were isolated incidents and not reflective of the company's overall standards.
45
+ However, multiple state attorneys general have announced plans to pursue legal
46
+ action against the chain
47
+ - text: Global Coffee Prices Surge Amid Brazilian Drought Coffee futures hit a five-year
48
+ high today as severe drought continues to ravage Brazil's coffee-growing regions.
49
+ Experts warn consumers may see significant price increases in coming months.
50
+ - text: 'BREAKING: Hospital CEO Arrested in Kickback Scheme Federal agents arrested
51
+ Mercy General Hospital CEO John Smith today on charges of accepting kickbacks
52
+ for preferential treatment of patients. Prosecutors allege Smith pocketed over
53
+ $2 million, compromising patient care. Smith''s lawyer denies all accusations.'
54
+ inference: true
55
+ model-index:
56
+ - name: SetFit with Alibaba-NLP/gte-base-en-v1.5
57
+ results:
58
+ - task:
59
+ type: text-classification
60
+ name: Text Classification
61
+ dataset:
62
+ name: Unknown
63
+ type: unknown
64
+ split: test
65
+ metrics:
66
+ - type: accuracy
67
+ value: 0.8181818181818182
68
+ name: Accuracy
69
+ ---
70
+
71
+ # SetFit with Alibaba-NLP/gte-base-en-v1.5
72
+
73
+ This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [Alibaba-NLP/gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) as the Sentence Transformer embedding model. A [SetFitHead](huggingface.co/docs/setfit/reference/main#setfit.SetFitHead) instance is used for classification.
74
+
75
+ The model has been trained using an efficient few-shot learning technique that involves:
76
+
77
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
78
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
79
+
80
+ ## Model Details
81
+
82
+ ### Model Description
83
+ - **Model Type:** SetFit
84
+ - **Sentence Transformer body:** [Alibaba-NLP/gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5)
85
+ - **Classification head:** a [SetFitHead](huggingface.co/docs/setfit/reference/main#setfit.SetFitHead) instance
86
+ - **Maximum Sequence Length:** 8192 tokens
87
+ - **Number of Classes:** 2 classes
88
+ <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
89
+ <!-- - **Language:** Unknown -->
90
+ <!-- - **License:** Unknown -->
91
+
92
+ ### Model Sources
93
+
94
+ - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
95
+ - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
96
+ - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
97
+
98
+ ### Model Labels
99
+ | Label | Examples |
100
+ |:------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
101
+ | 1 | <ul><li>'Lucknow: Deputy CM Brajesh Pathak recommends dismissal of 17 govt doctors for absenteeism LUCKNOW: State govt has recommended the dismissal of 17 medical officers after they were found absent from duty for several months. In addition to this, disciplinary action has been ordered against three medical officers.The order was issued by deputy CM Brajesh Pathak who also holds the charge of health and medical education departments, said a govt spokesman on Thursday. In his order, Pathak stated: "No doctor or health worker who is negligent in medical services will be forgiven." tnn \'Committed to high-level health services\'Strict action will be taken against them. The state is committed to providing high-level health services to the people and no laxity on the count will be tolerated," Pathak stated. Three doctors who will face disciplinary action are Dr Mukul Mishra, orthopedic specialist of District Hospital, Jhansi; Dr Madhavi Singh, ophthalmologist posted at Community Health Centre, Fatehpur, Barabanki and Dr Pramod Kumar Sharma under Chief Medical Officer, Bareilly.'</li><li>"Kerala model therapy: Govt gives 56 absentee doctors 'show-cause pill' Thiruvananthapuram: The state health and family welfare department has issued show-cause notice to 56 doctors who have been on unauthorised absence in various medical colleges and pharmacy colleges in Kerala. In the notice issued by Rajan Khobragade, additional chief secretary, health and family welfare department, the doctors have been directed to report for duty before the ACS at the secretariat within 15 days."</li><li>'42% of Nigerian Doctors, Nurse Demand Bribes Before Attending to Patients - NBS Reports The National Bureau of Statistics (NBS) recently published a report titled "NBS Corruption in Nigeria: Patterns and Trend" for 2023, revealing concerning statistics about corruption in the healthcare sector. According to the report, two-thirds of Nigerian doctors, nurses, and midwives demand bribes from patients before providing treatment. Additionally, 42 percent of these health workers accept bribes to expedite procedures, while 15 percent take bribes to ensure the completion of medical procedures. It, however, added that 11 per cent were paid bribes as a "sign of appreciation," which still reflects the purpose of gratification for the healthcare service they received. "As for doctors, nurses and midwives, 11 per cent of bribes were paid as a sign of appreciation, possibly reflecting gratitude for the care received," it stated. The report comes as Nigerians have continued to raise concerns over poor quality health services in the country. With these concerns, a shortage of health workers continues to plague the health system even as practitioners travel abroad to seek better welfare with the "japa syndrome." The NBS report, in collaboration with the United Nations Office on Drugs and Crimes (UNODC), also revealed how Nigerian public officials received nothing less than N721 billion as bribes in 2023'</li></ul> |
102
+ | 0 | <ul><li>'Malta\'s former prime minister charged with corruption over hospital scandal Malta\'s former prime minister Joseph Muscat has been charged with corruption in a hospital privatisation scandal that was once investigated by the murdered investigative journalist Daphne Caruana Galizia. Muscat has been charged with accepting bribes, corruption in public office and money laundering, according to documents seen by AFP. He has described the allegations as "fantasies and lies" and said he was the victim of a political vendetta. Chris Fearne, Malta\'s deputy prime minister, who is tipped to become Malta\'s next European commissioner, and the country\'s former finance minister Edward Scicluna, who is now the governor of Malta\'s central bank, were charged with fraud, misappropriation and fraudulent gain.'</li><li>"US Supreme Court gives pharma companies a chance to thwart terrorism-funding lawsuit 21 pharmaceutical and medical equipment companies, including AstraZeneca, Pfizer, GE Healthcare USA, Johnson & Johnson, and F. Hoffmann-La Roche, are accused of illegally helping to fund terrorism in Iraq by providing corrupt payments to the Hezbollah-sponsored militia group Jaysh al-Mahdi to obtain medical supply contracts from Iraq's health ministry. The lawsuit seeks unspecified damages under the Anti-Terrorism Act."</li><li>'Health Ministry Official Arrested in Procurement Scandal JAKARTA - Indonesian authorities have arrested a high-ranking Health Ministry official on suspicion of corruption in medical equipment procurement. Agus Sutiyo, 52, Director of Medical Supplies, is accused of accepting bribes totaling $1.2 million from suppliers in exchange for awarding inflated contracts. The Corruption Eradication Commission (KPK) alleges that Sutiyo manipulated tender processes, favoring companies that offered kickbacks. The scheme reportedly cost the government an estimated $10 million in overpayments. KPK spokesperson Febri Diansyah stated, "This case undermines public trust and diverts crucial resources from healthcare services." Sutiyo faces up to 20 years in prison if convicted.'</li></ul> |
103
+
104
+ ## Evaluation
105
+
106
+ ### Metrics
107
+ | Label | Accuracy |
108
+ |:--------|:---------|
109
+ | **all** | 0.8182 |
110
+
111
+ ## Uses
112
+
113
+ ### Direct Use for Inference
114
+
115
+ First install the SetFit library:
116
+
117
+ ```bash
118
+ pip install setfit
119
+ ```
120
+
121
+ Then you can load this model and run inference.
122
+
123
+ ```python
124
+ from setfit import SetFitModel
125
+
126
+ # Download from the 🤗 Hub
127
+ model = SetFitModel.from_pretrained("twright8/news_cats")
128
+ # Run inference
129
+ preds = model("Global Coffee Prices Surge Amid Brazilian Drought Coffee futures hit a five-year high today as severe drought continues to ravage Brazil's coffee-growing regions. Experts warn consumers may see significant price increases in coming months.")
130
+ ```
131
+
132
+ <!--
133
+ ### Downstream Use
134
+
135
+ *List how someone could finetune this model on their own dataset.*
136
+ -->
137
+
138
+ <!--
139
+ ### Out-of-Scope Use
140
+
141
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
142
+ -->
143
+
144
+ <!--
145
+ ## Bias, Risks and Limitations
146
+
147
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
148
+ -->
149
+
150
+ <!--
151
+ ### Recommendations
152
+
153
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
154
+ -->
155
+
156
+ ## Training Details
157
+
158
+ ### Training Set Metrics
159
+ | Training set | Min | Median | Max |
160
+ |:-------------|:----|:---------|:----|
161
+ | Word count | 55 | 153.8462 | 290 |
162
+
163
+ | Label | Training Sample Count |
164
+ |:------|:----------------------|
165
+ | 0 | 13 |
166
+ | 1 | 13 |
167
+
168
+ ### Training Hyperparameters
169
+ - batch_size: (8, 1)
170
+ - num_epochs: (3, 17)
171
+ - max_steps: -1
172
+ - sampling_strategy: oversampling
173
+ - body_learning_rate: (9.629116538858926e-05, 2.651259436793277e-05)
174
+ - head_learning_rate: 0.02145586669240117
175
+ - loss: CoSENTLoss
176
+ - distance_metric: cosine_distance
177
+ - margin: 0.25
178
+ - end_to_end: True
179
+ - use_amp: True
180
+ - warmup_proportion: 0.1
181
+ - max_length: 512
182
+ - seed: 42
183
+ - eval_max_steps: -1
184
+ - load_best_model_at_end: True
185
+
186
+ ### Training Results
187
+ | Epoch | Step | Training Loss | Validation Loss |
188
+ |:----------:|:------:|:-------------:|:---------------:|
189
+ | 0.0217 | 1 | 1.8133 | - |
190
+ | **0.4348** | **20** | **0.0054** | **1.6363** |
191
+ | 0.8696 | 40 | 0.0 | 4.9011 |
192
+ | 1.3043 | 60 | 0.0 | 7.0885 |
193
+ | 1.7391 | 80 | 0.0 | 6.2756 |
194
+ | 2.1739 | 100 | 0.0 | 6.2417 |
195
+ | 2.6087 | 120 | 0.0 | 6.4769 |
196
+
197
+ * The bold row denotes the saved checkpoint.
198
+ ### Framework Versions
199
+ - Python: 3.10.13
200
+ - SetFit: 1.0.3
201
+ - Sentence Transformers: 3.0.1
202
+ - Transformers: 4.39.0
203
+ - PyTorch: 2.3.0+cu121
204
+ - Datasets: 2.20.0
205
+ - Tokenizers: 0.15.2
206
+
207
+ ## Citation
208
+
209
+ ### BibTeX
210
+ ```bibtex
211
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
212
+ doi = {10.48550/ARXIV.2209.11055},
213
+ url = {https://arxiv.org/abs/2209.11055},
214
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
215
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
216
+ title = {Efficient Few-Shot Learning Without Prompts},
217
+ publisher = {arXiv},
218
+ year = {2022},
219
+ copyright = {Creative Commons Attribution 4.0 International}
220
+ }
221
+ ```
222
+
223
+ <!--
224
+ ## Glossary
225
+
226
+ *Clearly define terms in order to be accessible across audiences.*
227
+ -->
228
+
229
+ <!--
230
+ ## Model Card Authors
231
+
232
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
233
+ -->
234
+
235
+ <!--
236
+ ## Model Card Contact
237
+
238
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
239
+ -->
config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "checkpoints/step_20",
3
+ "architectures": [
4
+ "NewModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration.NewConfig",
9
+ "AutoModel": "modeling.NewModel",
10
+ "AutoModelForMaskedLM": "Alibaba-NLP/new-impl--modeling.NewForMaskedLM",
11
+ "AutoModelForMultipleChoice": "Alibaba-NLP/new-impl--modeling.NewForMultipleChoice",
12
+ "AutoModelForQuestionAnswering": "Alibaba-NLP/new-impl--modeling.NewForQuestionAnswering",
13
+ "AutoModelForSequenceClassification": "Alibaba-NLP/new-impl--modeling.NewForSequenceClassification",
14
+ "AutoModelForTokenClassification": "Alibaba-NLP/new-impl--modeling.NewForTokenClassification"
15
+ },
16
+ "classifier_dropout": null,
17
+ "hidden_act": "gelu",
18
+ "hidden_dropout_prob": 0.1,
19
+ "hidden_size": 768,
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 3072,
22
+ "layer_norm_eps": 1e-12,
23
+ "layer_norm_type": "layer_norm",
24
+ "logn_attention_clip1": false,
25
+ "logn_attention_scale": false,
26
+ "max_position_embeddings": 8192,
27
+ "model_type": "new",
28
+ "num_attention_heads": 12,
29
+ "num_hidden_layers": 12,
30
+ "pack_qkv": true,
31
+ "pad_token_id": 0,
32
+ "position_embedding_type": "rope",
33
+ "rope_scaling": {
34
+ "factor": 2.0,
35
+ "type": "ntk"
36
+ },
37
+ "rope_theta": 500000,
38
+ "torch_dtype": "float32",
39
+ "transformers_version": "4.39.0",
40
+ "type_vocab_size": 0,
41
+ "unpad_inputs": false,
42
+ "use_memory_efficient_attention": false,
43
+ "vocab_size": 30528
44
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.39.0",
5
+ "pytorch": "2.3.0+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
config_setfit.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "labels": null,
3
+ "normalize_embeddings": false
4
+ }
configuration.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ NEW model configuration"""
17
+ from transformers.configuration_utils import PretrainedConfig
18
+ from transformers.utils import logging
19
+
20
+ logger = logging.get_logger(__name__)
21
+
22
+
23
+ class NewConfig(PretrainedConfig):
24
+ r"""
25
+ This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
26
+ instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
27
+ configuration with the defaults will yield a similar configuration to that of the NEW
28
+ [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
29
+
30
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
31
+ documentation from [`PretrainedConfig`] for more information.
32
+
33
+
34
+ Args:
35
+ vocab_size (`int`, *optional*, defaults to 30522):
36
+ Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
37
+ `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
38
+ hidden_size (`int`, *optional*, defaults to 768):
39
+ Dimensionality of the encoder layers and the pooler layer.
40
+ num_hidden_layers (`int`, *optional*, defaults to 12):
41
+ Number of hidden layers in the Transformer encoder.
42
+ num_attention_heads (`int`, *optional*, defaults to 12):
43
+ Number of attention heads for each attention layer in the Transformer encoder.
44
+ intermediate_size (`int`, *optional*, defaults to 3072):
45
+ Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
46
+ hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
47
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
48
+ `"relu"`, `"silu"` and `"gelu_new"` are supported.
49
+ hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
50
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
51
+ attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
52
+ The dropout ratio for the attention probabilities.
53
+ max_position_embeddings (`int`, *optional*, defaults to 512):
54
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
55
+ just in case (e.g., 512 or 1024 or 2048).
56
+ type_vocab_size (`int`, *optional*, defaults to 2):
57
+ The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
58
+ initializer_range (`float`, *optional*, defaults to 0.02):
59
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
60
+ layer_norm_eps (`float`, *optional*, defaults to 1e-12):
61
+ The epsilon used by the layer normalization layers.
62
+ position_embedding_type (`str`, *optional*, defaults to `"rope"`):
63
+ Type of position embedding. Choose one of `"absolute"`, `"rope"`.
64
+ rope_theta (`float`, *optional*, defaults to 10000.0):
65
+ The base period of the RoPE embeddings.
66
+ rope_scaling (`Dict`, *optional*):
67
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
68
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
69
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
70
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
71
+ these scaling strategies behave:
72
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
73
+ experimental feature, subject to breaking API changes in future versions.
74
+ classifier_dropout (`float`, *optional*):
75
+ The dropout ratio for the classification head.
76
+
77
+ Examples:
78
+
79
+ ```python
80
+ >>> from transformers import NewConfig, NewModel
81
+
82
+ >>> # Initializing a NEW izhx/new-base-en style configuration
83
+ >>> configuration = NewConfig()
84
+
85
+ >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
86
+ >>> model = NewModel(configuration)
87
+
88
+ >>> # Accessing the model configuration
89
+ >>> configuration = model.config
90
+ ```"""
91
+
92
+ model_type = "new"
93
+
94
+ def __init__(
95
+ self,
96
+ vocab_size=30528,
97
+ hidden_size=768,
98
+ num_hidden_layers=12,
99
+ num_attention_heads=12,
100
+ intermediate_size=3072,
101
+ hidden_act="gelu",
102
+ hidden_dropout_prob=0.1,
103
+ attention_probs_dropout_prob=0.0,
104
+ max_position_embeddings=2048,
105
+ type_vocab_size=1,
106
+ initializer_range=0.02,
107
+ layer_norm_type='layer_norm',
108
+ layer_norm_eps=1e-12,
109
+ # pad_token_id=0,
110
+ position_embedding_type="rope",
111
+ rope_theta=10000.0,
112
+ rope_scaling=None,
113
+ classifier_dropout=None,
114
+ pack_qkv=True,
115
+ unpad_inputs=False,
116
+ use_memory_efficient_attention=False,
117
+ logn_attention_scale=False,
118
+ logn_attention_clip1=False,
119
+ **kwargs,
120
+ ):
121
+ super().__init__(**kwargs)
122
+
123
+ self.vocab_size = vocab_size
124
+ self.hidden_size = hidden_size
125
+ self.num_hidden_layers = num_hidden_layers
126
+ self.num_attention_heads = num_attention_heads
127
+ self.hidden_act = hidden_act
128
+ self.intermediate_size = intermediate_size
129
+ self.hidden_dropout_prob = hidden_dropout_prob
130
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
131
+ self.max_position_embeddings = max_position_embeddings
132
+ self.type_vocab_size = type_vocab_size
133
+ self.initializer_range = initializer_range
134
+ self.layer_norm_type = layer_norm_type
135
+ self.layer_norm_eps = layer_norm_eps
136
+ self.position_embedding_type = position_embedding_type
137
+ self.rope_theta = rope_theta
138
+ self.rope_scaling = rope_scaling
139
+ self.classifier_dropout = classifier_dropout
140
+
141
+ self.pack_qkv = pack_qkv
142
+ self.unpad_inputs = unpad_inputs
143
+ self.use_memory_efficient_attention = use_memory_efficient_attention
144
+ self.logn_attention_scale = logn_attention_scale
145
+ self.logn_attention_clip1 = logn_attention_clip1
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f118bf0a45a5b1f81b12afb79d250c0b1d712bf19fadb2b32c76e790ebc57d3
3
+ size 547119128
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3f5339885643e3b45e7350dbf556d1f87db4e25888051a738378301c60c8a24
3
+ size 7702
modeling.py ADDED
@@ -0,0 +1,1407 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """PyTorch NEW model."""
17
+
18
+ import math
19
+ from typing import List, Optional, Tuple, Union
20
+
21
+ import torch
22
+ import torch.utils.checkpoint
23
+ from torch import nn
24
+
25
+ from transformers.activations import ACT2FN
26
+ from transformers.modeling_outputs import (
27
+ BaseModelOutput,
28
+ BaseModelOutputWithPooling,
29
+ MaskedLMOutput,
30
+ MultipleChoiceModelOutput,
31
+ QuestionAnsweringModelOutput,
32
+ SequenceClassifierOutput,
33
+ TokenClassifierOutput,
34
+ )
35
+ from transformers.modeling_utils import PreTrainedModel
36
+ from transformers.utils import logging
37
+
38
+ try:
39
+ import xformers.ops as xops
40
+ except ImportError as e:
41
+ xops = None
42
+
43
+ from .configuration import NewConfig
44
+
45
+
46
+ logger = logging.get_logger(__name__)
47
+
48
+
49
+ # Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
50
+ # Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
51
+ class IndexFirstAxis(torch.autograd.Function):
52
+ @staticmethod
53
+ def forward(ctx, input, indices):
54
+ ctx.save_for_backward(indices)
55
+ assert input.ndim >= 2
56
+ ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
57
+ second_dim = other_shape.numel()
58
+ # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
59
+ # return input[indices]
60
+ # return torch.gather(
61
+ # rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
62
+ # ).reshape(-1, *other_shape)
63
+ return torch.gather(
64
+ input.view(ctx.first_axis_dim, second_dim),
65
+ 0,
66
+ indices.unsqueeze(-1).expand(indices.size(0), second_dim)
67
+ ).reshape(-1, *other_shape)
68
+
69
+ @staticmethod
70
+ def backward(ctx, grad_output):
71
+ (indices,) = ctx.saved_tensors
72
+ assert grad_output.ndim >= 2
73
+ other_shape = grad_output.shape[1:]
74
+ # grad_output = rearrange(grad_output, "b ... -> b (...)")
75
+ grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
76
+ grad_input = torch.zeros(
77
+ [ctx.first_axis_dim, grad_output.shape[1]],
78
+ device=grad_output.device,
79
+ dtype=grad_output.dtype,
80
+ )
81
+ # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
82
+ # grad_input[indices] = grad_output
83
+ # grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
84
+ grad_input.scatter_(
85
+ 0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
86
+ )
87
+ return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
88
+
89
+
90
+ index_first_axis = IndexFirstAxis.apply
91
+
92
+
93
+ def unpad_input(hidden_states, attention_mask=None, indices=None):
94
+ """
95
+ Arguments:
96
+ hidden_states: (batch, seqlen, ...)
97
+ attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
98
+ indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
99
+ Return:
100
+ hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
101
+ """
102
+ if indices is None:
103
+ assert attention_mask is not None
104
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
105
+
106
+ # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
107
+ # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
108
+ # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
109
+ # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
110
+ # so we write custom forward and backward to make it a bit faster.
111
+ hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
112
+ return index_first_axis(hidden_states, indices)
113
+
114
+
115
+ class IndexPutFirstAxis(torch.autograd.Function):
116
+ @staticmethod
117
+ def forward(
118
+ ctx,
119
+ values: torch.Tensor,
120
+ indices: torch.Tensor,
121
+ first_axis_dim
122
+ ) -> torch.Tensor:
123
+ ctx.save_for_backward(indices)
124
+ assert indices.ndim == 1
125
+ assert values.ndim >= 2
126
+ output = torch.zeros(
127
+ first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
128
+ )
129
+ output[indices] = values
130
+ return output
131
+
132
+ @staticmethod
133
+ def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
134
+ indices, = ctx.saved_tensors
135
+ grad_values = grad_output[indices]
136
+ return grad_values, None, None
137
+
138
+
139
+ index_put_first_axis = IndexPutFirstAxis.apply
140
+
141
+
142
+ def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
143
+ """Add padding to sequences.
144
+
145
+ Arguments:
146
+ inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
147
+ indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
148
+ batch: int batch_size
149
+ seqlen: int max sequence length
150
+
151
+ Returns:
152
+ inputs: (batch, seqlen, ...)
153
+ """
154
+ output = index_put_first_axis(inputs, indices, batch * seqlen)
155
+ return output.view(batch, seqlen, *inputs.shape[1:])
156
+
157
+
158
+ def rotate_half(x):
159
+ """Rotates half the hidden dims of the input."""
160
+ x1 = x[..., : x.shape[-1] // 2]
161
+ x2 = x[..., x.shape[-1] // 2 :]
162
+ return torch.cat((-x2, x1), dim=-1)
163
+
164
+
165
+ def apply_rotary_pos_emb(q, k, cos, sin):
166
+ """Applies Rotary Position Embedding to the query and key tensors.
167
+
168
+ Args:
169
+ q (`torch.Tensor`): The query tensor.
170
+ k (`torch.Tensor`): The key tensor.
171
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
172
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
173
+ Returns:
174
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
175
+ """
176
+ cos, sin = cos.to(q.dtype), sin.to(q.dtype)
177
+ q_embed = (q * cos) + (rotate_half(q) * sin)
178
+ k_embed = (k * cos) + (rotate_half(k) * sin)
179
+ return q_embed, k_embed
180
+
181
+
182
+ class RotaryEmbedding(torch.nn.Module):
183
+ def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
184
+ super().__init__()
185
+
186
+ self.dim = dim
187
+ self.max_position_embeddings = max_position_embeddings
188
+ self.base = base
189
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
190
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
191
+
192
+ # Build here to make `torch.jit.trace` work.
193
+ self._set_cos_sin_cache(
194
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
195
+ )
196
+
197
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
198
+ self.max_seq_len_cached = seq_len
199
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
200
+
201
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
202
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
203
+ emb = torch.cat((freqs, freqs), dim=-1)
204
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
205
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
206
+
207
+ def forward(self, x, seq_len=None):
208
+ # x: [bs, num_attention_heads, seq_len, head_size]
209
+ if seq_len > self.max_seq_len_cached:
210
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
211
+
212
+ return (
213
+ self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
214
+ self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
215
+ )
216
+
217
+
218
+ class NTKScalingRotaryEmbedding(RotaryEmbedding):
219
+ """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
220
+
221
+ def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
222
+ self.scaling_factor = scaling_factor
223
+ self.mixed_b = mixed_b
224
+ super().__init__(dim, max_position_embeddings, base, device)
225
+ max_position_embeddings = max_position_embeddings * self.scaling_factor
226
+ self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
227
+
228
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
229
+ self.max_seq_len_cached = seq_len
230
+
231
+ if seq_len > self.max_position_embeddings:
232
+ base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
233
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
234
+
235
+ if self.mixed_b is None:
236
+ inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim) # (6)
237
+ else:
238
+ a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b # (13)
239
+ lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp() # (12)
240
+ inv_freq = inv_freq / lambda_1_m # (10)
241
+
242
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
243
+
244
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
245
+
246
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
247
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
248
+ emb = torch.cat((freqs, freqs), dim=-1)
249
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
250
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
251
+
252
+
253
+ class RMSNorm(nn.Module):
254
+ def __init__(self, hidden_size, eps=1e-6):
255
+ """
256
+ RMSNorm is equivalent to T5LayerNorm
257
+ """
258
+ super().__init__()
259
+ self.weight = nn.Parameter(torch.ones(hidden_size))
260
+ self.variance_epsilon = eps
261
+
262
+ def forward(self, hidden_states):
263
+ input_dtype = hidden_states.dtype
264
+ hidden_states = hidden_states.to(torch.float32)
265
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
266
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
267
+ return self.weight * hidden_states.to(input_dtype)
268
+
269
+
270
+ LAYER_NORM = {
271
+ 'layer_norm': nn.LayerNorm,
272
+ 'rms_norm': RMSNorm
273
+ }
274
+
275
+
276
+ class NewEmbeddings(nn.Module):
277
+ """
278
+ Embedding and Unpadding.
279
+ """
280
+
281
+ def __init__(self, config: NewConfig):
282
+ super().__init__()
283
+ self.padding_idx = config.pad_token_id
284
+ self.word_embeddings = nn.Embedding(
285
+ config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
286
+ )
287
+
288
+ self.position_embedding_type = config.position_embedding_type
289
+ if self.position_embedding_type == 'absolute':
290
+ self.position_embeddings = nn.Embedding(
291
+ config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
292
+ )
293
+ elif self.position_embedding_type == 'rope':
294
+ self._init_rope(config)
295
+ else:
296
+ raise ValueError
297
+
298
+ self.type_vocab_size = config.type_vocab_size
299
+ if self.type_vocab_size > 0:
300
+ self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
301
+
302
+ # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
303
+ # any TensorFlow checkpoint file
304
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
305
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
306
+ # position_ids is contiguous in memory and excluded when serialized
307
+ self.register_buffer(
308
+ "position_ids", torch.arange(config.max_position_embeddings), persistent=False
309
+ )
310
+
311
+ def _init_rope(self, config):
312
+ kwargs = dict(
313
+ dim=int(config.hidden_size / config.num_attention_heads),
314
+ max_position_embeddings=config.max_position_embeddings,
315
+ base=config.rope_theta
316
+ )
317
+ if config.rope_scaling is None:
318
+ self.rotary_emb = RotaryEmbedding(**kwargs)
319
+ else:
320
+ kwargs.update(scaling_factor=config.rope_scaling["factor"])
321
+ scaling_type = config.rope_scaling["type"]
322
+ if scaling_type == 'ntk':
323
+ kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
324
+ self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
325
+ # elif scaling_type == "linear":
326
+ # self.rotary_emb = LinearScalingRotaryEmbedding(**kwargs)
327
+ # elif scaling_type == "dynamic":
328
+ # self.rotary_emb = DynamicNTKScalingRotaryEmbedding(**kwargs)
329
+ else:
330
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
331
+
332
+ def forward(
333
+ self,
334
+ unpad_inputs: bool,
335
+ input_ids: Optional[torch.Tensor] = None,
336
+ attention_mask: Optional[torch.Tensor] = None,
337
+ length: Optional[List[int]] = None,
338
+ token_type_ids: Optional[torch.Tensor] = None,
339
+ position_ids: Optional[torch.Tensor] = None,
340
+ inputs_embeds: Optional[torch.Tensor] = None,
341
+ ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
342
+ """
343
+ """
344
+ if inputs_embeds is None:
345
+ device, input_shape = input_ids.device, input_ids.shape
346
+ else:
347
+ device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
348
+ batch_size, seq_length = input_shape
349
+
350
+ # Set attention_mask if it's None
351
+ if attention_mask is None:
352
+ attention_mask = torch.ones(input_shape, device=device)
353
+ if length is not None:
354
+ for i, l in enumerate(length):
355
+ attention_mask[i, l:] = 0
356
+
357
+ # Set attention_mask_bool for unpadding
358
+ if unpad_inputs:
359
+ attention_mask_bool = attention_mask.bool()
360
+ if length is None:
361
+ length = attention_mask.sum(-1).tolist()
362
+
363
+ # Get word embeddings
364
+ if inputs_embeds is None:
365
+ if unpad_inputs:
366
+ input_ids = input_ids[attention_mask_bool].unsqueeze(0)
367
+ inputs_embeds = self.word_embeddings(input_ids)
368
+ else:
369
+ if unpad_inputs:
370
+ inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
371
+ embeddings = inputs_embeds
372
+
373
+ # Set and unpad position_ids
374
+ if position_ids is None:
375
+ if seq_length > self.position_ids.size(0):
376
+ self.register_buffer(
377
+ "position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
378
+ )
379
+ if unpad_inputs:
380
+ # [1, cumsum_seq_len]
381
+ position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
382
+ else:
383
+ # [bs, seq_len]
384
+ position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
385
+ elif unpad_inputs:
386
+ position_ids = position_ids[attention_mask_bool].unsqueeze(0) # [1, cumsum_seq_len]
387
+
388
+ # Compute rotary embedding
389
+ if self.position_embedding_type == 'rope':
390
+ rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
391
+ rope_cos = rope_cos[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
392
+ rope_sin = rope_sin[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
393
+ rope_embeds = rope_cos, rope_sin
394
+ else:
395
+ rope_embeds = None
396
+
397
+ if self.type_vocab_size > 0:
398
+ if token_type_ids is None:
399
+ token_type_ids = position_ids.mul(0)
400
+ else:
401
+ if self.type_vocab_size < 2:
402
+ token_type_ids.mul_(0)
403
+ if unpad_inputs:
404
+ token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
405
+
406
+ token_type_embeddings = self.token_type_embeddings(token_type_ids)
407
+ embeddings = embeddings + token_type_embeddings
408
+
409
+ # BERT position
410
+ if self.position_embedding_type == "absolute":
411
+ position_embeddings = self.position_embeddings(position_ids)
412
+ embeddings = embeddings + position_embeddings
413
+
414
+ embeddings = self.LayerNorm(embeddings)
415
+ embeddings = self.dropout(embeddings)
416
+
417
+ return embeddings, attention_mask, rope_embeds, length
418
+
419
+
420
+ class NewAttention(nn.Module):
421
+ def __init__(self, config: NewConfig, pack_qkv=None, use_memory_efficient_attention=None):
422
+ super().__init__()
423
+ self.config = config
424
+ if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
425
+ raise ValueError(
426
+ f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
427
+ f"heads ({config.num_attention_heads})"
428
+ )
429
+
430
+ self.hidden_size = config.hidden_size
431
+ self.num_attention_heads = config.num_attention_heads
432
+ self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
433
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
434
+
435
+ if pack_qkv is None:
436
+ pack_qkv = config.pack_qkv
437
+ self.pack_qkv = pack_qkv
438
+
439
+ if self.pack_qkv:
440
+ self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
441
+ else:
442
+ self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
443
+ self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
444
+ self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
445
+
446
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
447
+ self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
448
+
449
+ if use_memory_efficient_attention is None:
450
+ use_memory_efficient_attention = self.config.use_memory_efficient_attention
451
+ self.use_memory_efficient_attention = use_memory_efficient_attention
452
+ self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
453
+ if self.use_memory_efficient_attention:
454
+ assert self.memory_efficient_attention is not None, 'please install xformers'
455
+
456
+ def forward(
457
+ self,
458
+ hidden_states: torch.Tensor,
459
+ attention_bias: torch.FloatTensor,
460
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
461
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
462
+ attention_scale: Optional[torch.FloatTensor] = None,
463
+ head_mask: Optional[torch.FloatTensor] = None,
464
+ output_attentions: Optional[bool] = False,
465
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
466
+ ) -> Tuple[torch.Tensor, ...]:
467
+ shape_hd = (self.num_attention_heads, self.attention_head_size)
468
+ # qkv
469
+ if self.pack_qkv and qkv_inputs is None:
470
+ qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
471
+ else:
472
+ if qkv_inputs is None:
473
+ qkv_inputs = (hidden_states, hidden_states, hidden_states)
474
+ qkv_pack = [
475
+ getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
476
+ ]
477
+ query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
478
+
479
+ if self.config.position_embedding_type == 'rope':
480
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
481
+
482
+ dtype = query_states.dtype
483
+
484
+ if self.config.logn_attention_scale and attention_scale is not None:
485
+ # https://kexue.fm/archives/8823
486
+ query_states = query_states * attention_scale.to(dtype)
487
+
488
+ if padding_inputs is not None:
489
+ query_states = pad_input(query_states.squeeze(), *padding_inputs)
490
+ key_states = pad_input(key_states.squeeze(), *padding_inputs)
491
+ value_states = pad_input(value_states.squeeze(), *padding_inputs)
492
+
493
+ if self.use_memory_efficient_attention:
494
+ assert self.memory_efficient_attention is not None, "xformers is not loaded"
495
+ assert output_attentions is False, "memory_efficient_attention do not output attentions"
496
+ assert head_mask is None, "Not support yet"
497
+ attention_probs = None
498
+ if torch.is_tensor(attention_bias):
499
+ attention_bias = attention_bias.to(dtype)
500
+ context_layer = self.memory_efficient_attention(
501
+ query_states,
502
+ key_states,
503
+ value_states,
504
+ attn_bias=attention_bias,
505
+ p=self.dropout.p
506
+ )
507
+ else:
508
+ if output_attentions and isinstance(self, NewSdpaAttention):
509
+ raise RuntimeError("SDPA do not output attentions")
510
+ context_layer, attention_probs = self._attention(
511
+ query_states, key_states, value_states, attention_bias, head_mask
512
+ )
513
+
514
+ if padding_inputs is not None:
515
+ context_layer = unpad_input(context_layer, indices=padding_inputs[0])
516
+
517
+ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
518
+ context_layer = context_layer.view(new_context_layer_shape)
519
+
520
+ # output proj
521
+ attn_output = self.o_proj(context_layer)
522
+
523
+ # add attentions if we output them
524
+ outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
525
+ return outputs
526
+
527
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
528
+ """
529
+ Args:
530
+ q/k/v: (B, L, n_head, head_dim),
531
+ Returns:
532
+ attn_output: (B L, n_head, head_dim)
533
+ """
534
+ query_states = query_states.transpose(1, 2)
535
+ key_states = key_states.transpose(1, 2)
536
+ value_states = value_states.transpose(1, 2)
537
+ # Take the dot product between "query" and "key" to get the raw attention scores.
538
+ attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
539
+
540
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
541
+ if attention_bias is not None:
542
+ # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
543
+ attention_scores = attention_scores + attention_bias
544
+
545
+ # Normalize the attention scores to probabilities.
546
+ attention_probs = nn.functional.softmax(attention_scores, dim=-1)
547
+
548
+ # This is actually dropping out entire tokens to attend to, which might
549
+ # seem a bit unusual, but is taken from the original Transformer paper.
550
+ if self.dropout.p > 0:
551
+ attention_probs = self.dropout(attention_probs)
552
+
553
+ # Mask heads if we want to
554
+ if head_mask is not None:
555
+ attention_probs = attention_probs * head_mask
556
+
557
+ context_layer = torch.matmul(attention_probs, value_states)
558
+
559
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
560
+ return context_layer, attention_probs
561
+
562
+
563
+ class NewSdpaAttention(NewAttention):
564
+ """
565
+ New attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
566
+ `NewAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
567
+ SDPA API.
568
+ """
569
+ def __init__(self, config: NewConfig, **kwargs):
570
+ super().__init__(config, **kwargs)
571
+ # torch.backends.cuda.enable_mem_efficient_sdp(False)
572
+ # logger.warning(
573
+ # "Disable memory efficient attention kernel for `NewSdpaAttention`, you can set "
574
+ # "`use_memory_efficient_attention=True` if it expected to use."
575
+ # )
576
+
577
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
578
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
579
+ query_states.transpose(1, 2),
580
+ key_states.transpose(1, 2),
581
+ value_states.transpose(1, 2),
582
+ attn_mask=attention_bias,
583
+ dropout_p=self.dropout.p if self.training else 0.0,
584
+ )
585
+ attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
586
+ return attn_output, None
587
+
588
+
589
+ NEW_ATTENTION_CLASSES = {
590
+ "eager": NewAttention,
591
+ # "flash_attention_2": , # TODO
592
+ "sdpa": NewSdpaAttention,
593
+ }
594
+
595
+
596
+ class NewGatedMLP(nn.Module):
597
+ """
598
+ GLU Variants Improve Transformer.
599
+ """
600
+
601
+ def __init__(self, config: NewConfig):
602
+ super().__init__()
603
+ self.intermediate_size = config.intermediate_size
604
+ self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
605
+ self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
606
+ self.act_fn = ACT2FN[config.hidden_act]
607
+ if config.hidden_dropout_prob > 0:
608
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
609
+ else:
610
+ self.hidden_dropout = None
611
+
612
+ def forward(self, hidden_states):
613
+ up_gate = self.up_gate_proj(hidden_states)
614
+ up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
615
+ gate = self.act_fn(gate)
616
+ gated_states = gate * up_states
617
+ if self.hidden_dropout is not None:
618
+ gated_states = self.hidden_dropout(gated_states)
619
+ down_states = self.down_proj(gated_states)
620
+ return down_states
621
+
622
+
623
+ class NewLayer(nn.Module):
624
+ def __init__(
625
+ self,
626
+ config: NewConfig,
627
+ pack_qkv=None,
628
+ use_memory_efficient_attention=None,
629
+ attn_implementation=None
630
+ ):
631
+ super().__init__()
632
+ if attn_implementation is None:
633
+ attn_implementation = config._attn_implementation
634
+ if use_memory_efficient_attention is None:
635
+ use_memory_efficient_attention = config.use_memory_efficient_attention
636
+ if use_memory_efficient_attention:
637
+ if attn_implementation != 'eager':
638
+ logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
639
+ attn_implementation = 'eager' # Since it will be SDPA by default for torch>=2.1.1
640
+ self.attention = NEW_ATTENTION_CLASSES[attn_implementation](
641
+ config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
642
+ )
643
+ self.mlp = NewGatedMLP(config)
644
+
645
+ ln_class = LAYER_NORM[config.layer_norm_type]
646
+ self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
647
+ self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
648
+
649
+ if config.hidden_dropout_prob > 0:
650
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
651
+ else:
652
+ self.hidden_dropout = None
653
+
654
+ def forward(
655
+ self,
656
+ hidden_states: torch.Tensor,
657
+ attention_bias: torch.FloatTensor,
658
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
659
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
660
+ attention_scale: Optional[torch.FloatTensor] = None,
661
+ subset_indices: Optional[torch.LongTensor] = None,
662
+ head_mask: Optional[torch.FloatTensor] = None,
663
+ output_attentions: Optional[bool] = False,
664
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
665
+ ) -> Tuple[torch.Tensor, ...]:
666
+ # Multi head self attention
667
+ residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
668
+ attention_outputs = self.attention(
669
+ hidden_states,
670
+ attention_bias,
671
+ rope_embeds,
672
+ padding_inputs,
673
+ attention_scale,
674
+ head_mask,
675
+ output_attentions=output_attentions,
676
+ qkv_inputs=qkv_inputs,
677
+ )
678
+ hidden_states = attention_outputs[0]
679
+ if self.hidden_dropout is not None:
680
+ hidden_states = self.hidden_dropout(hidden_states)
681
+ hidden_states = residual + hidden_states
682
+
683
+ # In pretraining, after the attention of last layer, we only need the masked tokens.
684
+ if subset_indices is not None:
685
+ hidden_states = hidden_states[subset_indices]
686
+
687
+ hidden_states = self.attn_ln(hidden_states)
688
+
689
+ # Fully Connected
690
+ residual = hidden_states
691
+ hidden_states = self.mlp(hidden_states)
692
+ if self.hidden_dropout is not None:
693
+ hidden_states = self.hidden_dropout(hidden_states)
694
+ hidden_states = residual + hidden_states
695
+ hidden_states = self.mlp_ln(hidden_states)
696
+
697
+ # add self attentions if we output attention weights
698
+ outputs = (hidden_states,) + attention_outputs[1:]
699
+ return outputs
700
+
701
+
702
+ class NewEncoder(nn.Module):
703
+ def __init__(self, config):
704
+ super().__init__()
705
+ self.config = config
706
+ self.layer = nn.ModuleList([NewLayer(config) for _ in range(config.num_hidden_layers)])
707
+ self.gradient_checkpointing = False
708
+
709
+ def forward(
710
+ self,
711
+ hidden_states: torch.Tensor,
712
+ attention_bias: Optional[torch.FloatTensor] = None,
713
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
714
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
715
+ attention_scale: Optional[torch.FloatTensor] = None,
716
+ subset_indices: Optional[torch.LongTensor] = None,
717
+ head_mask: Optional[torch.FloatTensor] = None,
718
+ output_attentions: Optional[bool] = False,
719
+ output_hidden_states: Optional[bool] = False,
720
+ return_dict: Optional[bool] = True,
721
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
722
+ all_hidden_states = () if output_hidden_states else None
723
+ all_self_attentions = () if output_attentions else None
724
+
725
+ for i, layer_module in enumerate(self.layer):
726
+ if output_hidden_states:
727
+ all_hidden_states = all_hidden_states + (hidden_states,)
728
+
729
+ if i >= len(self.layer) - 1:
730
+ layer_subset_indices = subset_indices
731
+ else:
732
+ layer_subset_indices = None
733
+
734
+ layer_head_mask = head_mask[i] if head_mask is not None else None
735
+
736
+ if self.gradient_checkpointing and self.training:
737
+ layer_outputs = self._gradient_checkpointing_func(
738
+ layer_module.__call__,
739
+ hidden_states,
740
+ attention_bias,
741
+ rope_embeds,
742
+ padding_inputs,
743
+ attention_scale,
744
+ layer_subset_indices,
745
+ layer_head_mask,
746
+ )
747
+ else:
748
+ layer_outputs = layer_module(
749
+ hidden_states,
750
+ attention_bias,
751
+ rope_embeds,
752
+ padding_inputs,
753
+ attention_scale,
754
+ layer_subset_indices,
755
+ layer_head_mask,
756
+ output_attentions,
757
+ )
758
+
759
+ hidden_states = layer_outputs[0]
760
+ if output_attentions:
761
+ all_self_attentions = all_self_attentions + (layer_outputs[1],)
762
+
763
+ if output_hidden_states:
764
+ all_hidden_states = all_hidden_states + (hidden_states,)
765
+
766
+ if not return_dict:
767
+ return tuple(
768
+ v
769
+ for v in [
770
+ hidden_states,
771
+ all_hidden_states,
772
+ all_self_attentions,
773
+ ]
774
+ if v is not None
775
+ )
776
+ return BaseModelOutput(
777
+ last_hidden_state=hidden_states,
778
+ hidden_states=all_hidden_states,
779
+ attentions=all_self_attentions,
780
+ )
781
+
782
+
783
+ # Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->New
784
+ class NewPooler(nn.Module):
785
+ def __init__(self, config):
786
+ super().__init__()
787
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
788
+ self.activation = nn.Tanh()
789
+
790
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
791
+ # We "pool" the model by simply taking the hidden state corresponding
792
+ # to the first token.
793
+ first_token_tensor = hidden_states[:, 0]
794
+ pooled_output = self.dense(first_token_tensor)
795
+ pooled_output = self.activation(pooled_output)
796
+ return pooled_output
797
+
798
+
799
+ class NewPreTrainedModel(PreTrainedModel):
800
+ """
801
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
802
+ models.
803
+ """
804
+
805
+ config_class = NewConfig
806
+ base_model_prefix = "new"
807
+ supports_gradient_checkpointing = True
808
+ _supports_sdpa = True
809
+
810
+ def _init_weights(self, module):
811
+ """Initialize the weights"""
812
+ if isinstance(module, nn.Linear):
813
+ # Slightly different from the TF version which uses truncated_normal for initialization
814
+ # cf https://github.com/pytorch/pytorch/pull/5617
815
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
816
+ if module.bias is not None:
817
+ module.bias.data.zero_()
818
+ elif isinstance(module, nn.Embedding):
819
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
820
+ if module.padding_idx is not None:
821
+ module.weight.data[module.padding_idx].zero_()
822
+ elif isinstance(module, nn.LayerNorm):
823
+ module.bias.data.zero_()
824
+ module.weight.data.fill_(1.0)
825
+
826
+
827
+ class NewModel(NewPreTrainedModel):
828
+ """
829
+ The bare New Model transformer outputting raw hidden-states without any specific head on top.
830
+ """
831
+
832
+ def __init__(self, config: NewConfig, add_pooling_layer=False):
833
+ super().__init__(config)
834
+ self.config = config
835
+
836
+ self.embeddings = NewEmbeddings(config)
837
+ self.encoder = NewEncoder(config)
838
+
839
+ self.pooler = NewPooler(config) if add_pooling_layer else None
840
+
841
+ # Initialize weights and apply final processing
842
+ self.post_init()
843
+
844
+ def get_input_embeddings(self):
845
+ return self.embeddings.word_embeddings
846
+
847
+ def set_input_embeddings(self, value):
848
+ self.embeddings.word_embeddings = value
849
+
850
+ def forward(
851
+ self,
852
+ input_ids: Optional[torch.Tensor] = None,
853
+ attention_mask: Optional[torch.Tensor] = None,
854
+ length: Optional[List[int]] = None,
855
+ subset_indices: Optional[torch.LongTensor] = None,
856
+ token_type_ids: Optional[torch.Tensor] = None,
857
+ position_ids: Optional[torch.Tensor] = None,
858
+ head_mask: Optional[torch.Tensor] = None,
859
+ inputs_embeds: Optional[torch.Tensor] = None,
860
+ output_attentions: Optional[bool] = None,
861
+ output_hidden_states: Optional[bool] = None,
862
+ return_dict: Optional[bool] = None,
863
+ unpad_inputs: Optional[bool] = None,
864
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
865
+ r"""
866
+ length (`list` of length `batch_size`, *optional*):
867
+ If is `None`, return padded `last_hidden_state`.
868
+ subset_indices ():
869
+ pass
870
+ unpad_inputs (`bool`, *optional*):
871
+ pass
872
+ """
873
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
874
+ output_hidden_states = (
875
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
876
+ )
877
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
878
+ unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
879
+ output_padded = length is None
880
+
881
+ if input_ids is not None and inputs_embeds is not None:
882
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
883
+ elif input_ids is not None:
884
+ self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
885
+ input_shape = input_ids.size()
886
+ elif inputs_embeds is not None:
887
+ input_shape = inputs_embeds.size()[:-1]
888
+ else:
889
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
890
+
891
+ # TODO: not used
892
+ # # Prepare head mask if needed
893
+ # # 1.0 in head_mask indicate we keep the head
894
+ # # attention_probs has shape bsz x n_heads x N x N
895
+ # # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
896
+ # # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
897
+ # head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
898
+
899
+ # Get embeddings, may unpad them
900
+ (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
901
+ unpad_inputs,
902
+ input_ids=input_ids,
903
+ attention_mask=attention_mask,
904
+ length=length,
905
+ token_type_ids=token_type_ids,
906
+ position_ids=position_ids,
907
+ inputs_embeds=inputs_embeds
908
+ )
909
+
910
+ batch_size, seq_length = input_shape
911
+ if unpad_inputs and self.config.use_memory_efficient_attention:
912
+ attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
913
+ else:
914
+ # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
915
+ # ourselves in which case we just need to make it broadcastable to all heads.
916
+ attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
917
+ if self.config.use_memory_efficient_attention:
918
+ # Invalid shape for attention bias: torch.Size([48, 1, 1, 512]) (expected (48, 12, 512, 512))
919
+ attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
920
+
921
+ padding_inputs = None
922
+ if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
923
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
924
+ if not self.config.use_memory_efficient_attention:
925
+ padding_inputs = (indices, *input_shape)
926
+
927
+ attention_scale = None
928
+ if self.config.logn_attention_scale:
929
+ logger.warning_once("TODO: logn_attention_scale")
930
+ # # attention scale log_512(input_len)
931
+ # attention_scale = attention_mask.sum(1).log() / torch.tensor(self.config.max_position_embeddings).log()
932
+ # # inference-time logn scale need clip 1
933
+ # if self.config.logn_attention_clip1:
934
+ # attention_scale.clip_(1)
935
+ # attention_scale = attention_scale[:, None, None, None]
936
+ # else:
937
+ # attention_scale = None
938
+
939
+ encoder_outputs = self.encoder(
940
+ embedding_output,
941
+ attention_bias=attention_bias,
942
+ rope_embeds=rope_embeds,
943
+ padding_inputs=padding_inputs,
944
+ attention_scale=attention_scale,
945
+ subset_indices=subset_indices,
946
+ head_mask=head_mask,
947
+ output_attentions=output_attentions,
948
+ output_hidden_states=output_hidden_states,
949
+ return_dict=return_dict,
950
+ )
951
+ sequence_output = encoder_outputs[0]
952
+ if unpad_inputs and output_padded:
953
+ sequence_output = pad_input(
954
+ sequence_output.squeeze(), indices, batch_size, seq_length
955
+ )
956
+
957
+ pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
958
+
959
+ if not return_dict:
960
+ return (sequence_output, pooled_output) + encoder_outputs[1:]
961
+
962
+ return BaseModelOutputWithPooling(
963
+ last_hidden_state=sequence_output,
964
+ pooler_output=pooled_output,
965
+ hidden_states=encoder_outputs.hidden_states,
966
+ attentions=encoder_outputs.attentions,
967
+ )
968
+
969
+
970
+ class NewLMPredictionHead(nn.Module):
971
+ def __init__(self, config):
972
+ super().__init__()
973
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
974
+ self.transform_act_fn = ACT2FN[config.hidden_act]
975
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
976
+
977
+ # The output weights are the same as the input embeddings, but there is
978
+ # an output-only bias for each token.
979
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
980
+
981
+ def forward(self, hidden_states):
982
+ hidden_states = self.dense(hidden_states)
983
+ hidden_states = self.transform_act_fn(hidden_states)
984
+ hidden_states = self.norm(hidden_states)
985
+ hidden_states = self.decoder(hidden_states)
986
+ return hidden_states
987
+
988
+
989
+ class NewForMaskedLM(NewPreTrainedModel):
990
+ _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
991
+
992
+ def __init__(self, config: NewConfig):
993
+ super().__init__(config)
994
+ self.new = NewModel(config, add_pooling_layer=False)
995
+ self.lm_head = NewLMPredictionHead(config)
996
+ self.loss_fct = nn.CrossEntropyLoss()
997
+
998
+ # Initialize weights and apply final processing
999
+ self.post_init()
1000
+
1001
+ def get_output_embeddings(self):
1002
+ return self.lm_head.decoder
1003
+
1004
+ def set_output_embeddings(self, new_embeddings):
1005
+ self.lm_head.decoder = new_embeddings
1006
+
1007
+ def forward(
1008
+ self,
1009
+ input_ids: Optional[torch.Tensor] = None,
1010
+ attention_mask: Optional[torch.Tensor] = None,
1011
+ token_type_ids: Optional[torch.Tensor] = None,
1012
+ position_ids: Optional[torch.Tensor] = None,
1013
+ head_mask: Optional[torch.Tensor] = None,
1014
+ inputs_embeds: Optional[torch.Tensor] = None,
1015
+ labels: Optional[torch.Tensor] = None,
1016
+ output_attentions: Optional[bool] = None,
1017
+ output_hidden_states: Optional[bool] = None,
1018
+ return_dict: Optional[bool] = None,
1019
+ unpad_inputs: Optional[bool] = None,
1020
+ ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
1021
+ r"""
1022
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1023
+ Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
1024
+ config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
1025
+ loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
1026
+ """
1027
+
1028
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1029
+
1030
+ if labels is None or not self.new.config.unpad_inputs:
1031
+ length = None
1032
+ subset_indices = None
1033
+ else:
1034
+ length = attention_mask.sum(-1).tolist()
1035
+ labels = labels[attention_mask.bool()].unsqueeze(0)
1036
+ subset_indices = labels > -100
1037
+
1038
+ outputs = self.new(
1039
+ input_ids,
1040
+ attention_mask=attention_mask,
1041
+ length=length,
1042
+ subset_indices=subset_indices,
1043
+ token_type_ids=token_type_ids,
1044
+ position_ids=position_ids,
1045
+ head_mask=head_mask,
1046
+ inputs_embeds=inputs_embeds,
1047
+ output_attentions=output_attentions,
1048
+ output_hidden_states=output_hidden_states,
1049
+ return_dict=return_dict,
1050
+ unpad_inputs=unpad_inputs,
1051
+ )
1052
+
1053
+ sequence_output = outputs[0]
1054
+ prediction_scores = self.lm_head(sequence_output)
1055
+
1056
+ masked_lm_loss = None
1057
+ if labels is not None:
1058
+ if subset_indices is None:
1059
+ mask = attention_mask.bool()
1060
+ prediction_scores = prediction_scores[mask]
1061
+ labels = labels[mask]
1062
+ else:
1063
+ labels = labels[subset_indices]
1064
+ masked_lm_loss = self.loss_fct(prediction_scores, labels)
1065
+
1066
+ if not return_dict:
1067
+ output = (prediction_scores,) + outputs[2:]
1068
+ return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
1069
+
1070
+ return MaskedLMOutput(
1071
+ loss=masked_lm_loss,
1072
+ logits=prediction_scores,
1073
+ hidden_states=outputs.hidden_states,
1074
+ attentions=outputs.attentions,
1075
+ )
1076
+
1077
+
1078
+ class NewForSequenceClassification(NewPreTrainedModel):
1079
+ def __init__(self, config):
1080
+ super().__init__(config)
1081
+ self.num_labels = config.num_labels
1082
+ self.config = config
1083
+
1084
+ self.new = NewModel(config, add_pooling_layer=True)
1085
+ classifier_dropout = (
1086
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1087
+ )
1088
+ self.dropout = nn.Dropout(classifier_dropout)
1089
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1090
+
1091
+ # Initialize weights and apply final processing
1092
+ self.post_init()
1093
+
1094
+ def forward(
1095
+ self,
1096
+ input_ids: Optional[torch.Tensor] = None,
1097
+ attention_mask: Optional[torch.Tensor] = None,
1098
+ token_type_ids: Optional[torch.Tensor] = None,
1099
+ position_ids: Optional[torch.Tensor] = None,
1100
+ head_mask: Optional[torch.Tensor] = None,
1101
+ inputs_embeds: Optional[torch.Tensor] = None,
1102
+ labels: Optional[torch.Tensor] = None,
1103
+ output_attentions: Optional[bool] = None,
1104
+ output_hidden_states: Optional[bool] = None,
1105
+ return_dict: Optional[bool] = None,
1106
+ unpad_inputs: Optional[bool] = None,
1107
+ ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
1108
+ r"""
1109
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1110
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1111
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1112
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1113
+ """
1114
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1115
+
1116
+ outputs = self.new(
1117
+ input_ids,
1118
+ attention_mask=attention_mask,
1119
+ token_type_ids=token_type_ids,
1120
+ position_ids=position_ids,
1121
+ head_mask=head_mask,
1122
+ inputs_embeds=inputs_embeds,
1123
+ output_attentions=output_attentions,
1124
+ output_hidden_states=output_hidden_states,
1125
+ return_dict=return_dict,
1126
+ unpad_inputs=unpad_inputs,
1127
+ )
1128
+
1129
+ pooled_output = outputs[1]
1130
+
1131
+ pooled_output = self.dropout(pooled_output)
1132
+ logits = self.classifier(pooled_output)
1133
+
1134
+ loss = None
1135
+ if labels is not None:
1136
+ if self.config.problem_type is None:
1137
+ if self.num_labels == 1:
1138
+ self.config.problem_type = "regression"
1139
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1140
+ self.config.problem_type = "single_label_classification"
1141
+ else:
1142
+ self.config.problem_type = "multi_label_classification"
1143
+
1144
+ if self.config.problem_type == "regression":
1145
+ loss_fct = nn.MSELoss()
1146
+ if self.num_labels == 1:
1147
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
1148
+ else:
1149
+ loss = loss_fct(logits, labels)
1150
+ elif self.config.problem_type == "single_label_classification":
1151
+ loss_fct = nn.CrossEntropyLoss()
1152
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1153
+ elif self.config.problem_type == "multi_label_classification":
1154
+ loss_fct = nn.BCEWithLogitsLoss()
1155
+ loss = loss_fct(logits, labels)
1156
+
1157
+ if not return_dict:
1158
+ output = (logits,) + outputs[2:]
1159
+ return ((loss,) + output) if loss is not None else output
1160
+
1161
+ return SequenceClassifierOutput(
1162
+ loss=loss,
1163
+ logits=logits,
1164
+ hidden_states=outputs.hidden_states,
1165
+ attentions=outputs.attentions,
1166
+ )
1167
+
1168
+
1169
+ class NewForMultipleChoice(NewPreTrainedModel):
1170
+ def __init__(self, config):
1171
+ super().__init__(config)
1172
+
1173
+ self.new = NewModel(config, add_pooling_layer=True)
1174
+ classifier_dropout = (
1175
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1176
+ )
1177
+ self.dropout = nn.Dropout(classifier_dropout)
1178
+ self.classifier = nn.Linear(config.hidden_size, 1)
1179
+
1180
+ # Initialize weights and apply final processing
1181
+ self.post_init()
1182
+
1183
+ def forward(
1184
+ self,
1185
+ input_ids: Optional[torch.Tensor] = None,
1186
+ attention_mask: Optional[torch.Tensor] = None,
1187
+ token_type_ids: Optional[torch.Tensor] = None,
1188
+ position_ids: Optional[torch.Tensor] = None,
1189
+ head_mask: Optional[torch.Tensor] = None,
1190
+ inputs_embeds: Optional[torch.Tensor] = None,
1191
+ labels: Optional[torch.Tensor] = None,
1192
+ output_attentions: Optional[bool] = None,
1193
+ output_hidden_states: Optional[bool] = None,
1194
+ return_dict: Optional[bool] = None,
1195
+ unpad_inputs: Optional[bool] = None,
1196
+ ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
1197
+ r"""
1198
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1199
+ Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
1200
+ num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
1201
+ `input_ids` above)
1202
+ """
1203
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1204
+ num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
1205
+
1206
+ input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
1207
+ attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
1208
+ token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
1209
+ position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
1210
+ inputs_embeds = (
1211
+ inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
1212
+ if inputs_embeds is not None
1213
+ else None
1214
+ )
1215
+
1216
+ outputs = self.new(
1217
+ input_ids,
1218
+ attention_mask=attention_mask,
1219
+ token_type_ids=token_type_ids,
1220
+ position_ids=position_ids,
1221
+ head_mask=head_mask,
1222
+ inputs_embeds=inputs_embeds,
1223
+ output_attentions=output_attentions,
1224
+ output_hidden_states=output_hidden_states,
1225
+ return_dict=return_dict,
1226
+ unpad_inputs=unpad_inputs,
1227
+ )
1228
+
1229
+ pooled_output = outputs[1]
1230
+
1231
+ pooled_output = self.dropout(pooled_output)
1232
+ logits = self.classifier(pooled_output)
1233
+ reshaped_logits = logits.view(-1, num_choices)
1234
+
1235
+ loss = None
1236
+ if labels is not None:
1237
+ loss_fct = nn.CrossEntropyLoss()
1238
+ loss = loss_fct(reshaped_logits, labels)
1239
+
1240
+ if not return_dict:
1241
+ output = (reshaped_logits,) + outputs[2:]
1242
+ return ((loss,) + output) if loss is not None else output
1243
+
1244
+ return MultipleChoiceModelOutput(
1245
+ loss=loss,
1246
+ logits=reshaped_logits,
1247
+ hidden_states=outputs.hidden_states,
1248
+ attentions=outputs.attentions,
1249
+ )
1250
+
1251
+
1252
+ class NewForTokenClassification(NewPreTrainedModel):
1253
+ def __init__(self, config):
1254
+ super().__init__(config)
1255
+ self.num_labels = config.num_labels
1256
+
1257
+ self.new = NewModel(config, add_pooling_layer=False)
1258
+ classifier_dropout = (
1259
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1260
+ )
1261
+ self.dropout = nn.Dropout(classifier_dropout)
1262
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1263
+
1264
+ # Initialize weights and apply final processing
1265
+ self.post_init()
1266
+
1267
+ def forward(
1268
+ self,
1269
+ input_ids: Optional[torch.Tensor] = None,
1270
+ attention_mask: Optional[torch.Tensor] = None,
1271
+ token_type_ids: Optional[torch.Tensor] = None,
1272
+ position_ids: Optional[torch.Tensor] = None,
1273
+ head_mask: Optional[torch.Tensor] = None,
1274
+ inputs_embeds: Optional[torch.Tensor] = None,
1275
+ labels: Optional[torch.Tensor] = None,
1276
+ output_attentions: Optional[bool] = None,
1277
+ output_hidden_states: Optional[bool] = None,
1278
+ return_dict: Optional[bool] = None,
1279
+ unpad_inputs: Optional[bool] = None,
1280
+ ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
1281
+ r"""
1282
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1283
+ Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
1284
+ """
1285
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1286
+
1287
+ outputs = self.new(
1288
+ input_ids,
1289
+ attention_mask=attention_mask,
1290
+ token_type_ids=token_type_ids,
1291
+ position_ids=position_ids,
1292
+ head_mask=head_mask,
1293
+ inputs_embeds=inputs_embeds,
1294
+ output_attentions=output_attentions,
1295
+ output_hidden_states=output_hidden_states,
1296
+ return_dict=return_dict,
1297
+ unpad_inputs=unpad_inputs,
1298
+ )
1299
+
1300
+ sequence_output = outputs[0]
1301
+
1302
+ sequence_output = self.dropout(sequence_output)
1303
+ logits = self.classifier(sequence_output)
1304
+
1305
+ loss = None
1306
+ if labels is not None:
1307
+ loss_fct = nn.CrossEntropyLoss()
1308
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1309
+
1310
+ if not return_dict:
1311
+ output = (logits,) + outputs[2:]
1312
+ return ((loss,) + output) if loss is not None else output
1313
+
1314
+ return TokenClassifierOutput(
1315
+ loss=loss,
1316
+ logits=logits,
1317
+ hidden_states=outputs.hidden_states,
1318
+ attentions=outputs.attentions,
1319
+ )
1320
+
1321
+
1322
+ class NewForQuestionAnswering(NewPreTrainedModel):
1323
+ def __init__(self, config):
1324
+ super().__init__(config)
1325
+ self.num_labels = config.num_labels
1326
+
1327
+ self.new = NewModel(config, add_pooling_layer=False)
1328
+ self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
1329
+
1330
+ # Initialize weights and apply final processing
1331
+ self.post_init()
1332
+
1333
+ def forward(
1334
+ self,
1335
+ input_ids: Optional[torch.Tensor] = None,
1336
+ attention_mask: Optional[torch.Tensor] = None,
1337
+ token_type_ids: Optional[torch.Tensor] = None,
1338
+ position_ids: Optional[torch.Tensor] = None,
1339
+ head_mask: Optional[torch.Tensor] = None,
1340
+ inputs_embeds: Optional[torch.Tensor] = None,
1341
+ start_positions: Optional[torch.Tensor] = None,
1342
+ end_positions: Optional[torch.Tensor] = None,
1343
+ output_attentions: Optional[bool] = None,
1344
+ output_hidden_states: Optional[bool] = None,
1345
+ return_dict: Optional[bool] = None,
1346
+ unpad_inputs: Optional[bool] = None,
1347
+ ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
1348
+ r"""
1349
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1350
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1351
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1352
+ are not taken into account for computing the loss.
1353
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1354
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1355
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1356
+ are not taken into account for computing the loss.
1357
+ """
1358
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1359
+
1360
+ outputs = self.new(
1361
+ input_ids,
1362
+ attention_mask=attention_mask,
1363
+ token_type_ids=token_type_ids,
1364
+ position_ids=position_ids,
1365
+ head_mask=head_mask,
1366
+ inputs_embeds=inputs_embeds,
1367
+ output_attentions=output_attentions,
1368
+ output_hidden_states=output_hidden_states,
1369
+ return_dict=return_dict,
1370
+ unpad_inputs=unpad_inputs,
1371
+ )
1372
+
1373
+ sequence_output = outputs[0]
1374
+
1375
+ logits = self.qa_outputs(sequence_output)
1376
+ start_logits, end_logits = logits.split(1, dim=-1)
1377
+ start_logits = start_logits.squeeze(-1).contiguous()
1378
+ end_logits = end_logits.squeeze(-1).contiguous()
1379
+
1380
+ total_loss = None
1381
+ if start_positions is not None and end_positions is not None:
1382
+ # If we are on multi-GPU, split add a dimension
1383
+ if len(start_positions.size()) > 1:
1384
+ start_positions = start_positions.squeeze(-1)
1385
+ if len(end_positions.size()) > 1:
1386
+ end_positions = end_positions.squeeze(-1)
1387
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
1388
+ ignored_index = start_logits.size(1)
1389
+ start_positions = start_positions.clamp(0, ignored_index)
1390
+ end_positions = end_positions.clamp(0, ignored_index)
1391
+
1392
+ loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
1393
+ start_loss = loss_fct(start_logits, start_positions)
1394
+ end_loss = loss_fct(end_logits, end_positions)
1395
+ total_loss = (start_loss + end_loss) / 2
1396
+
1397
+ if not return_dict:
1398
+ output = (start_logits, end_logits) + outputs[2:]
1399
+ return ((total_loss,) + output) if total_loss is not None else output
1400
+
1401
+ return QuestionAnsweringModelOutput(
1402
+ loss=total_loss,
1403
+ start_logits=start_logits,
1404
+ end_logits=end_logits,
1405
+ hidden_states=outputs.hidden_states,
1406
+ attentions=outputs.attentions,
1407
+ )
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "max_length": 512,
49
+ "model_max_length": 8192,
50
+ "pad_to_multiple_of": null,
51
+ "pad_token": "[PAD]",
52
+ "pad_token_type_id": 0,
53
+ "padding_side": "right",
54
+ "sep_token": "[SEP]",
55
+ "stride": 0,
56
+ "strip_accents": null,
57
+ "tokenize_chinese_chars": true,
58
+ "tokenizer_class": "BertTokenizer",
59
+ "truncation_side": "right",
60
+ "truncation_strategy": "longest_first",
61
+ "unk_token": "[UNK]"
62
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff