aws-prototyping commited on
Commit
7eba212
1 Parent(s): 93963dd

Initial commit of the model files.

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,159 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ inference: false
4
+ ---
5
+
6
+ # MegaBeam-Mistral-7B-300k-AWQ Model
7
+
8
+ MegaBeam-Mistral-7B-300k-AWQ is a version of the [MegaBeam-Mistral-7B-300k](https://huggingface.co/amazon/MegaBeam-Mistral-7B-300k) model that was
9
+ quantized using the AWQ method developed by [Lin et al. (2023)](https://arxiv.org/abs/2306.00978).
10
+ The MegaBeam-Mistral-7B-300k-AWQ models are approximately **70% smaller** than those of MegaBeam-Mistral-7B-300k whilst maintaining comparable performance.
11
+
12
+ Please refer to the [original MegaBeam-Mistral-7B-300k model card](https://huggingface.co/amazon/MegaBeam-Mistral-7B-300k) for details about the model
13
+ preparation and training processes.
14
+
15
+ ## MegaBeam-Mistral-7B-300k Variants
16
+
17
+ | Branch | Approx. Model Size | `q_group_size` | `w_bit` | `version` |
18
+ |--------|---:|---------------:|--------:|-----------|
19
+ | [main](https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ/tree/main) | 3.9 GB | 128 | 4 | GEMM |
20
+ | [MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM](https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ/tree/MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM) | 4.0 GB | 64 | 4 | GEMM |
21
+ | [MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM](https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ/tree/MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM) | 4.3 GB | 32 | 4 | GEMM |
22
+
23
+ ## Dependencies
24
+ - [`autoawq==0.2.5`](https://pypi.org/project/autoawq/0.2.5/) – [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) was used to quantize the MegaBeam-Mistral-7B-300k model.
25
+ - [`vllm==0.4.2`](https://pypi.org/project/vllm/0.4.2/) – [vLLM](https://github.com/vllm-project/vllm) was used to host models for benchmarking.
26
+
27
+ ## Evaluations
28
+
29
+ ### InfiniteBench
30
+
31
+ This benchmark was developed by [Zhang et al. (2024)](https://arxiv.org/abs/2402.13718), available from https://github.com/OpenBMB/InfiniteBench.
32
+
33
+ See the [original MegaBeam-Mistral-7B-300k model card](https://huggingface.co/amazon/MegaBeam-Mistral-7B-300k)
34
+ for more details.
35
+
36
+ | Task Name | MegaBeam-Mistral-7B-300k-AWQ | MegaBeam-Mistral-7B-300k | Mistral-7B-Instruct-v0.2 | Llama-3-8B-Instruct-262k | Llama3-70B-1M | GPT-4-1106-preview | YaRN-Mistral-7B | Kimi-Chat | Claude 2 | Yi-6B-200K | Yi-34B-200K | Chatglm3-6B-128K |
37
+ |------------------|------------------------------|--------------------------|--------------------------|--------------------------|---------------|--------------------|-----------------|-----------|----------|------------|-------------|------------------|
38
+ | Retrieve.PassKey | 100% | 100% | 75.76% | 98.30% | 81.35% | 100% | 92.71% | 98.14% | 97.80% | 100.00% | 100.00% | 92.20% |
39
+ | Retrieve.Number | 92.7% | 96.10% | 25.25% | 97.79% | 97.62% | 100% | 56.61% | 95.42% | 98.14% | 94.92% | 100.00% | 80.68% |
40
+ | Retrieve.KV | 0% | 0% | 0% | 3.40% | 3% | 89.00% | < 5% | 53.60% | 65.40% | < 5% | < 5% | < 5% |
41
+ | En.Sum | 29.05% | 29.39% | 22.13% | 16.40% | 20.72% | 14.73% | 9.09% | 17.93% | 14.45% | < 5% | < 5% | < 5% |
42
+ | En.QA | 15.69% | 14.93% | 4.93% | 13.20% | 16.52% | 22.22% | 9.55% | 16.52% | 11.97% | 9.20% | 12.17% | < 5% |
43
+ | En.MC | 48.91% | 51.52% | 7.80% | 50.65% | 62% | 67.25% | 27.95% | 72.49% | 62.88% | 36.68% | 38.43% | 10.48% |
44
+ | En.Dia | 11.50% | 9.50% | 3.50% | 1% | 12.50% | 8.50% | 7.50% | 11.50% | 46.50% | < 5% | < 5% | < 5% |
45
+ | Zh.QA | 10.53% | 10.71% | 3.43% | 19.02% | 26% | 25.96% | 14.43% | 17.93% | 9.64% | 15.07% | 13.61% | < 5% |
46
+ | Code.Debug | 21.83% | 27.41% | 11.60% | 22.08% | 23.85% | 39.59% | < 5% | 18.02% | < 5% | < 5% | < 5% | < 5% |
47
+ | Code.Run | 1.25% | 1.75% | 0.25% | 0% | 0% | 23.25% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% |
48
+ | Math.Calc | 0% | 0% | 0% | 0% | 0% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% |
49
+ | Math.Find | 20.57% | 24.28% | 26.28% | 15.40% | 30% | 60.00% | 17.14% | 12.57% | 32.29% | < 5% | 25.71% | 7.71% |
50
+ | **Average** | 29.34% | 30.70% | 15.08% | 28.10% | 31.13% | 46.08% | 20.41% | 34.93% | 37.21% | 22.78% | 25.41% | 17.59% |
51
+
52
+
53
+ ### Long Context
54
+
55
+ The following benchmark results are shown as _accuracy_ (%) values, unless stated otherwise.
56
+
57
+ #### Topic Retrieval
58
+
59
+ See https://lmsys.org/blog/2023-06-29-longchat/
60
+
61
+ | Model Name | n_topics=05 | n_topics=10 | n_topics=15 | n_topics=20 | n_topics=25 |
62
+ |:---------------------------------------------------|--------------:|--------------:|--------------:|--------------:|--------------:|
63
+ | _n_tokens_ (approx.) = | _3048_ | _5966_ | _8903_ | _11832_ | _14757_ |
64
+ | MegaBeam-Mistral-7B-300k | 100 | 100 | 100 | 100 | 100 |
65
+ | **MegaBeam-Mistral-7B-300k-AWQ** | **100** | **100** | **100**| **100** | **100** |
66
+ | **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** |
67
+ | **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** |
68
+
69
+ #### [Line Retrieval](https://lmsys.org/blog/2023-06-29-longchat/#longeval-results)
70
+
71
+ See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results
72
+
73
+ | Model Name | n_lines=200 | n_lines=300 | n_lines=400 | n_lines=500 | n_lines=600 | n_lines=680 |
74
+ |:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:|
75
+ | _n_tokens_ (approx.) = | _4317_ | _6415_ | _8510_ | _10610_ | _12698_ | _14373_ |
76
+ | MegaBeam-Mistral-7B-300k | 98 | 98 | 92 | 98 | 90 | 90 |
77
+ | **MegaBeam-Mistral-7B-300k-AWQ** | **96**| **94**| **88** | **80** | **70**| **62** |
78
+ | **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **100**| **98**| **96** | **96** | **90**| **94** |
79
+ | **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **98**| **98**| **82** | **96** | **92**| **90** |
80
+
81
+ #### Pass Key Retrieval
82
+
83
+ See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101
84
+
85
+ | Model Name | n_garbage=12000 | n_garbage=20000 | n_garbage=31000 | n_garbage=38000 | n_garbage=45000 | n_garbage=60000 |
86
+ |:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:|
87
+ | _n_tokens_ (approx.) = | _3272_ | _5405_ | _8338_ | _10205_ | _12071_ | _16072_ |
88
+ | MegaBeam-Mistral-7B-300k | 100 | 100 | 100 | 100 | 100 | 100|
89
+ | **MegaBeam-Mistral-7B-300k-AWQ** | **100** | **100**| **100**| **100** | **100**| **100**|
90
+ | **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**|
91
+ | **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**|
92
+
93
+
94
+ #### QuALITY (Question Answering with Long Input Texts, Yes!)
95
+
96
+ See https://nyu-mll.github.io/quality/
97
+
98
+ |Model Name| Test set Accuracy | Hard subset Accuracy|
99
+ |:----------|-------------:|-------------:|
100
+ | MegaBeam-Mistral-7B-300k | 53.2 | 72 |
101
+ | **MegaBeam-Mistral-7B-300k-AWQ** | **51.3** | **71.3** |
102
+ | **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **52.4** | **72.1** |
103
+ | **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **53.1** | **71.3** |
104
+
105
+ ## Usage
106
+
107
+ ## Inference via vLLM HTTP Host
108
+
109
+ ### Launch Host
110
+ ```bash
111
+ python -m vllm.entrypoints.openai.api_server \
112
+ --model aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ \
113
+ --quantization awq
114
+ ```
115
+
116
+ ### Query Host
117
+ ```bash
118
+ curl -X POST http://localhost:8000/v1/completions \
119
+ -H "Content-Type: application/json" \
120
+ -d '{ "model": "aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ",
121
+ "prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
122
+ "temperature": 0,
123
+ "echo": false
124
+ }'
125
+ ```
126
+
127
+ ## Inference via [vLLM Offline Inference](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html)
128
+ ```python
129
+ from vllm import LLM, SamplingParams
130
+
131
+ prompts = [
132
+ "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
133
+ ]
134
+ sampling_params = SamplingParams(temperature=0, max_tokens=100)
135
+
136
+ llm = LLM(model="aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ")
137
+
138
+ outputs = llm.generate(prompts, sampling_params)
139
+
140
+ # Print the outputs.
141
+ for output in outputs:
142
+ prompt = output.prompt
143
+ generated_text = output.outputs[0].text
144
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
145
+
146
+ ```
147
+
148
+ ## License
149
+
150
+ Apache 2.0
151
+
152
+ ## Limitations
153
+
154
+ Before using the MegaBeam-Mistral-7B-300k-AWQ model, it is important to perform your own
155
+ independent assessment, and take measures to ensure that your use would comply
156
+ with your own specific quality control practices and standards, and that your
157
+ use would comply with the local rules, laws, regulations, licenses and terms
158
+ that apply to you, and your content.
159
+
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ",
3
+ "architectures": [
4
+ "MistralForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 4096,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 14336,
13
+ "max_position_embeddings": 288800,
14
+ "model_type": "mistral",
15
+ "num_attention_heads": 32,
16
+ "num_hidden_layers": 32,
17
+ "num_key_value_heads": 8,
18
+ "quantization_config": {
19
+ "bits": 4,
20
+ "group_size": 128,
21
+ "modules_to_not_convert": null,
22
+ "quant_method": "awq",
23
+ "version": "gemm",
24
+ "zero_point": true
25
+ },
26
+ "rms_norm_eps": 1e-05,
27
+ "rope_theta": 25000000.0,
28
+ "sliding_window": null,
29
+ "tie_word_embeddings": false,
30
+ "torch_dtype": "float16",
31
+ "transformers_version": "4.41.2",
32
+ "use_cache": true,
33
+ "vocab_size": 32000
34
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "do_sample": true,
5
+ "eos_token_id": 2,
6
+ "transformers_version": "4.41.2"
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2b626ba3dcb50bea165b4925f9bb7e3e4b7c2ef8a0deb17ed2e04790fd36f9f
3
+ size 4150880232
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ }
29
+ },
30
+ "additional_special_tokens": [],
31
+ "bos_token": "<s>",
32
+ "chat_template": "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}",
33
+ "clean_up_tokenization_spaces": false,
34
+ "eos_token": "</s>",
35
+ "legacy": true,
36
+ "model_max_length": 1000000000000000019884624838656,
37
+ "pad_token": null,
38
+ "sp_model_kwargs": {},
39
+ "spaces_between_special_tokens": false,
40
+ "tokenizer_class": "LlamaTokenizer",
41
+ "unk_token": "<unk>",
42
+ "use_default_system_prompt": false
43
+ }