Text Generation
Transformers
Safetensors
chatglm
feature-extraction
custom_code
README.md CHANGED
@@ -4,33 +4,42 @@ language:
4
  - zh
5
  - ja
6
  - de
7
- license: wtfpl
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  pipeline_tag: text-generation
9
  co2_eq_emissions:
10
  emissions: 700
11
- training_type: fine-tuning
12
- library_name: transformers
13
- datasets:
14
- - CausalLM/Retrieval-SFT-Chat
15
- - CausalLM/Refined-Anime-Text
16
- ---
17
- # miniG
18
-
19
- [Text-Only Weight](https://huggingface.co/CausalLM/miniG/tree/text-only)
20
-
21
- [GGML with ChatGLM.cpp (recommended)](https://huggingface.co/CausalLM/miniG/tree/ggml): https://github.com/li-plus/chatglm.cpp
22
 
23
- [GGUF (Text-Only, not recommended)](https://huggingface.co/CausalLM/miniG/tree/gguf): There is a significant degradation, even with the F16.
24
-
25
- **Update:** A new ["alt" version](https://huggingface.co/CausalLM/miniG/tree/alt) of the model has been uploaded, which is trained with masked context provided. This is intended to reduce overfitting and provide a more objective performance. The model weights in the main branch of the repository are trained directly on SFT data, while the alt branch, on the other hand, is trained with the masked context of raw-text used to synthesize the data provided. The alt version exhibits better stability in some cases, with less overfitting. However, it may have limitations in knowledge retention and hallucination due to the lack of external context.
26
 
27
- > **Hint:** How can I check if my inference parameters and quantized inference are performing well? You can try having the model recite "The Gift of the Magi" by O. Henry (which is a public domain text). You should expect it to recite the entire text accurately, including the formatting.
28
 
29
  A model trained on a synthesis dataset of over **120 million** entries, this dataset having been generated through the application of state-of-the-art language models utilizing large context windows, alongside methodologies akin to retrieval-augmented generation and knowledge graph integration, where the data synthesis is conducted within clusters derived from a curated pretraining corpus of 20 billion tokens, with subsequent validation performed by the model itself.
30
 
31
  Despite the absence of thorough alignment with human preferences, the model is under no obligation to cater to poorly constructed prompts or the clichés often found in conventional benchmarks. Bonus: Included is an implementation of a **Vision Language Model** that has undergone Locked-Image Tuning.
32
 
33
- **Supported Input Modalities**: text, image. For text-only weight, please use the branch `revision=text-only` at https://huggingface.co/CausalLM/miniG/tree/text-only . And [GGUF](https://huggingface.co/CausalLM/miniG/tree/gguf) for text-only should be working after PR [#9194](https://github.com/ggerganov/llama.cpp/pull/9194) was merged.
34
 
35
  **Context Window:** 1M tokens
36
 
@@ -38,37 +47,25 @@ Despite the absence of thorough alignment with human preferences, the model is u
38
 
39
  **Cautionary Notes:** **It is strongly recommended to utilize a standardized implementation for inference**, such as Hugging Face Transformers, to avoid the significant performance degradation that might occur when using accelerated kernels like vllm or lmdeploy - not to mention the potentially catastrophic effects of model quantization. **As of now, these accelerated inference implementations are known to severely compromise effective** vision inference, though they have a less pronounced impact on pure text performance.
40
 
41
- **Inference Parameters:** Our observations suggest that, if one desires to achieve results with fewer hallucinations, it is advisable to employ sampling with top_p=0.8 followed by a temperature setting of 0.3, or alternatively, to use pure temperature sampling with a setting of 0.2. **In general, a lower temperature is required compared to similar models**, which we tentatively attribute to overfitting on the vast dataset. The model inference should refer to THUDM/glm-4-9b-chat-1m and THUDM/glm-4v-9b. We only guarantee best performance when using transformers for inference. In our testing, we also used lmdeploy, which resulted in a significant performance degradation for multimodal input.
42
 
43
- **Regarding Formatting:** We strongly recommend you double-check your input to ensure: 1. The system prompt is not empty. Even something as simple as "You are a helpful assistant." is expected. 2. There is always a newline character after the <|role|> tag. This will help ensure proper parsing and processing of your input.
44
 
45
  **Regarding [Benchmark Scores](https://huggingface.co/spaces/JosephusCheung/Goodharts-Law-on-Benchmarks-a-Page-for-miniG):** Generally, you shouldn't worry too much about them, as people can always train specifically to achieve good results. We mainly use them as a smoke test, a quick check to ensure no major regressions have occurred. In fact, if you actually read through the benchmark questions themselves, you'll often find yourself chuckling at how inane, low-quality, or even downright silly they are.
46
 
47
- **Regarding Training:** The final released version was trained using a merge of multiple candidate models in an attempt to improve performance. However, we were unable to conclusively determine whether this was effective. Excluding candidate versions, an efficient naïve fine-tuning should be achievable within one day on 16 nodes of 8*A100-80G. Based on this, we estimate the carbon emissions to be 700 kg CO2 eq.
48
 
49
  **Disclaimer:** Please note that the model was trained on unfiltered internet data. Since we do not have the capacity to vet all of it, there may be a substantial amount of objectionable content, pornography, violence, and offensive language present that we are unable to remove. Therefore, you will still need to complete your own checks on the model's safety and filter keywords in the output. Due to computational resource constraints, we are presently unable to implement RLHF for the model's ethics and safety, nor training on SFT samples that refuse to answer certain questions for restrictive fine-tuning.
50
 
51
- **For English Users:** This model was not trained on meaningless logical riddles like those "strawberry questions" (which is a data optimization case-by-case, unseen during the pre-training phase). This approach has no value beyond creating a spectacle. The model focuses more on utilizing the content within the pre-training corpus, rather than solely on artificial optimizations introduced during the SFT stage for specific tasks.
52
-
53
- **Seeking Unconditional Sponsorship:** Training and synthesizing datasets can be expensive. While we cannot disclose more details about the cost budget, we can theoretically analyze the example of synthesizing and self-verifying the dataset used to train this model, which involved 120M entries synthesized from 20B tokens. The nominal cost of data synthesis and self-verification using a commercial model API could be as high as $3M, while the nominal cost using local model inference, measured in GPU time, could still reach up to $0.1M. We are actively training larger parameter models and scaling up data synthesis, and are seeking substantial compute resources and generous **unconditional** grants. While this is for the purpose of commercial exploration and technology selection, we are currently under no immediate pressure to generate profit and remain committed to sharing more with the open-source community.
54
 
55
  # 迷你G
56
 
57
- [纯文本权重](https://huggingface.co/CausalLM/miniG/tree/text-only)
58
-
59
- [GGML 用于 ChatGLM.cpp (推荐)](https://huggingface.co/CausalLM/miniG/tree/ggml): https://github.com/li-plus/chatglm.cpp
60
-
61
- [GGUF (纯文本,不推荐)](https://huggingface.co/CausalLM/miniG/tree/gguf): 即使使用F16,性能也有显著下降。
62
-
63
- **更新:** 我们上传了一个新的 ["alt" 版本](https://huggingface.co/CausalLM/miniG/tree/alt) 模型,该模型使用掩码上下文进行训练。此版本旨在减少过拟合并提供更客观的性能。仓库主分支中的模型权重直接在 SFT 数据上训练,而 alt 分支则使用用于合成提供数据的原始文本的掩码上下文进行训练。alt ��本在某些情况下表现出更好的稳定性,过拟合更少。然而,由于缺乏外部上下文,它可能在知识保留和幻觉方面存在局限性。
64
-
65
- > **提示:** 如何检查我的推理参数和量化推理是否表现良好?你可以尝试让模型背诵朱自清的《背影》(这是一个公共领域的文本)。你应该期待它能够准确地背诵整个文本,包括格式和换行。
66
-
67
  一个在超过**1.2亿**条数据合成数据集上训练的模型,这些数据集是通过应用具有大上下文窗口的最先进语言模型生成的,并结合了类似于检索增强生成和知识图谱集成的方法,数据合成是在一个由200亿个标记组成的预训练语料库中提取的聚类内进行的,随后由模型本身进行验证。
68
 
69
  尽管该模型没有完全对齐人类偏好,但它没有义务迎合不良构建的提示或常见基准测试中的陈词滥调。额外内容:包含了经过锁定图像微调的**视觉语言模型**实现。
70
 
71
- **支持的输入模态**:文本、图像。对于纯文本权重,请使用 https://huggingface.co/CausalLM/miniG/tree/text-only 上的分支 `revision=text-only`。在 PR [#9194](https://github.com/ggerganov/llama.cpp/pull/9194) 合并后,适用于纯文本的 [GGUF](https://huggingface.co/CausalLM/miniG/tree/gguf) 应该可以正常工作。
72
 
73
  **上下文窗口**:1M 个标记
74
 
@@ -76,16 +73,14 @@ Despite the absence of thorough alignment with human preferences, the model is u
76
 
77
  **注意事项:** **强烈建议使用标准化的推理实现**,例如Hugging Face Transformers,以避免在使用加速内核(如vllm或lmdeploy)时可能发生的显著性能下降——更不用说模型量化可能带来的灾难性影响。**目前,这些加速推理实现已知会严重损害**视觉推理的有效性,尽管对纯文本性能的影响较小。
78
 
79
- **推理参数:** 我们的观察表明,如果想要减少幻觉结果,建议使用top_p=0.8的采样方式,然后设置temperature为0.3,或者使用纯粹的temperature采样,设置为0.2。**总体来说,相比类似的模型,该模型需要较低的temperature**,我们暂时将其归因于在庞大数据集上的过拟合。模型推理应参考 THUDM/glm-4-9b-chat-1m 和 THUDM/glm-4v-9b。我们只保证使用 transformer 进行推理时的性能最佳。在我们的测试中,我们还使用了 lmdeploy,这导致多模态输入的性能显著下降。
80
-
81
- **关于格式:** 我们强烈建议您仔细检查输入内容,以确保:1. 系统提示不为空。即使是像“You are a helpful assistant.”这样简单的提示也是预期的。2. <|role|> 标签后始终有一个换行符。这将有助于确保正确解析和处理您的输入。
82
 
83
- **关于[基准测试分数](https://huggingface.co/spaces/JosephusCheung/Goodharts-Law-on-Benchmarks-a-Page-for-miniG):** 一般来说,你不应该太过在意这些分数,因为人们总是可以专门训练以取得好成绩。我们主要将它们作为一个冒烟测试,一种快速检查,确保没有发生重大回退。事实上,如果你真的去阅读这些基准测试问题本身,你常常会发现自己会忍不住笑出声来,因为它们是多么无聊、低质量,甚至荒谬可笑。
84
 
85
- **关于训练:** 最终发布的版本使用了多个候选模型的合并来尝试提高性能。然而,我们无法确定这种方法是否确实有效。排除候选版本和合并实验,使用16个节点、每个节点配备8个A100-80G显卡的情况下,应该可以在一天之内实现高效的朴素微调。据此我们估算碳排放量为700公斤二氧化碳当量。
86
 
87
- **免责声明:** 请注意,该模型是在未经过滤的互联网数据上训练的。由于我���无法对所有数据进行筛选,仍有可能存在大量不适当的内容——包括从露骨的材料到暴力和攻击性语言的内容——我们无法移除。因此,您必须自行对模型进行安全检查,并在输出中实施关键词过滤。由于计算资源的限制,我们目前无法为伦理和安全考虑进行人类反馈的强化学习(RLHF),也不能对SFT样本进行限制性微调,以限制模型回答某些问题的能力。
88
 
89
- **致中文用户:** 这个模型没有接受过像“弱智吧”这样毫无意义的逻辑谜题的训练(这属于数据优化中的个案,在预训练阶段从未见过)。这种方法除了制造噱头之外没有任何价值。该模型更注重利用预训练语料库中的内容,而不是仅仅依靠 SFT 阶段为特定任务引入的人工优化。
90
 
91
- **寻求无条件赞助:** 训练和合成数据集可能非常昂贵。虽然我们无法透露更多关于成本预算的细节,但我们可以从理论上分析一下合成和自我验证用于训练该模型的数据集的例子,该数据集包含从 200 亿个标记合成的 1.2 亿个条目。使用商业模型 API 进行数据合成和自我验证的名义成本可能高达 300 万美元,而使用本地模型推理(以 GPU 时间衡量)的名义成本仍然可能高达 10 万美元。我们正在积极训练更大参数的模型并扩大数据合成规模,同时寻求大量的计算资源和慷慨的**无条件**资助。尽管这是为了商业探索和技术选择的目的,但我们目前并没有立即产生利润的压力,并且仍然致力于与开源社区分享更多成果。
 
4
  - zh
5
  - ja
6
  - de
7
+ model-index:
8
+ - name: miniG
9
+ results:
10
+ - task:
11
+ type: text-generation
12
+ metrics:
13
+ - name: MMLU
14
+ type: MMLU
15
+ value: 85.45
16
+ - name: IFEval
17
+ type: IFEval
18
+ value: 74.22
19
+ - name: GSM8K (5-shot)
20
+ type: GSM8K (5-shot)
21
+ value: 75.89
22
+ - name: HumanEval
23
+ type: HumanEval
24
+ value: 79.88
25
+ - name: GPQA
26
+ type: GPQA
27
+ value: 37.37
28
+ license: agpl-3.0
29
  pipeline_tag: text-generation
30
  co2_eq_emissions:
31
  emissions: 700
32
+ training_type: "fine-tuning"
 
 
 
 
 
 
 
 
 
 
33
 
34
+ ---
 
 
35
 
36
+ # miniG
37
 
38
  A model trained on a synthesis dataset of over **120 million** entries, this dataset having been generated through the application of state-of-the-art language models utilizing large context windows, alongside methodologies akin to retrieval-augmented generation and knowledge graph integration, where the data synthesis is conducted within clusters derived from a curated pretraining corpus of 20 billion tokens, with subsequent validation performed by the model itself.
39
 
40
  Despite the absence of thorough alignment with human preferences, the model is under no obligation to cater to poorly constructed prompts or the clichés often found in conventional benchmarks. Bonus: Included is an implementation of a **Vision Language Model** that has undergone Locked-Image Tuning.
41
 
42
+ **Supported Input Modalities**: text, image
43
 
44
  **Context Window:** 1M tokens
45
 
 
47
 
48
  **Cautionary Notes:** **It is strongly recommended to utilize a standardized implementation for inference**, such as Hugging Face Transformers, to avoid the significant performance degradation that might occur when using accelerated kernels like vllm or lmdeploy - not to mention the potentially catastrophic effects of model quantization. **As of now, these accelerated inference implementations are known to severely compromise effective** vision inference, though they have a less pronounced impact on pure text performance.
49
 
50
+ **Inference Parameters:** Our observations suggest that, if one desires to achieve results with fewer hallucinations, it is advisable to employ sampling with top_p=0.8 followed by a temperature setting of 0.3, or alternatively, to use pure temperature sampling with a setting of 0.2. **In general, a lower temperature is required compared to similar models**, which we tentatively attribute to overfitting on the vast dataset.
51
 
52
+ **Regarding Formatting:** We strongly recommend you double-check your input to ensure: 1. The system prompt is not empty. Even something as simple as "You are a helpful assistant." is expected. 2. Each role's content ends with a newline character ('\n') before being concatenated with the <|role|> tag. 3. There is always a newline character after the <|role|> tag. This will help ensure proper parsing and processing of your input.
53
 
54
  **Regarding [Benchmark Scores](https://huggingface.co/spaces/JosephusCheung/Goodharts-Law-on-Benchmarks-a-Page-for-miniG):** Generally, you shouldn't worry too much about them, as people can always train specifically to achieve good results. We mainly use them as a smoke test, a quick check to ensure no major regressions have occurred. In fact, if you actually read through the benchmark questions themselves, you'll often find yourself chuckling at how inane, low-quality, or even downright silly they are.
55
 
56
+ **Regarding training:** The final released version was trained using a merge of multiple candidate models in an attempt to improve performance. However, we were unable to conclusively determine whether this was effective. Excluding candidate versions, an efficient naive fine-tuning should be achievable within one day on 16 nodes of 8*A100-80G. Based on this, we estimate the carbon emissions to be 700 kg CO2 eq.
57
 
58
  **Disclaimer:** Please note that the model was trained on unfiltered internet data. Since we do not have the capacity to vet all of it, there may be a substantial amount of objectionable content, pornography, violence, and offensive language present that we are unable to remove. Therefore, you will still need to complete your own checks on the model's safety and filter keywords in the output. Due to computational resource constraints, we are presently unable to implement RLHF for the model's ethics and safety, nor training on SFT samples that refuse to answer certain questions for restrictive fine-tuning.
59
 
60
+ **Seeking Unconditional Sponsorship:** We are actively training larger parameter models and scaling up data synthesis, and are seeking substantial compute resources and generous **unconditional** grants. While this is for the purpose of commercial exploration and technology selection, we are currently under no immediate pressure to generate profit and remain committed to sharing more with the open-source community.
 
 
61
 
62
  # 迷你G
63
 
 
 
 
 
 
 
 
 
 
 
64
  一个在超过**1.2亿**条数据合成数据集上训练的模型,这些数据集是通过应用具有大上下文窗口的最先进语言模型生成的,并结合了类似于检索增强生成和知识图谱集成的方法,数据合成是在一个由200亿个标记组成的预训练语料库中提取的聚类内进行的,随后由模型本身进行验证。
65
 
66
  尽管该模型没有完全对齐人类偏好,但它没有义务迎合不良构建的提示或常见基准测试中的陈词滥调。额外内容:包含了经过锁定图像微调的**视觉语言模型**实现。
67
 
68
+ **支持的输入模态**:文本、图像
69
 
70
  **上下文窗口**:1M 个标记
71
 
 
73
 
74
  **注意事项:** **强烈建议使用标准化的推理实现**,例如Hugging Face Transformers,以避免在使用加速内核(如vllm或lmdeploy)时可能发生的显著性能下降——更不用说模型量化可能带来的灾难性影响。**目前,这些加速推理实现已知会严重损害**视觉推理的有效性,尽管对纯文本性能的影响较小。
75
 
76
+ **推理参数:**我们的观察表明,如果想要减少幻觉结果,建议使用top_p=0.8的采样方式,然后设置temperature为0.3,或者使用纯粹的temperature采样,设置为0.2。**总体来说,相比类似的模型,该模型需要较低的temperature**,我们暂时将其归因于在庞大数据集上的过拟合。
 
 
77
 
78
+ **关于格式:**我们强烈建议您仔细检查输入内容,以确保:1. 系统提示不为空。即使是像“You are a helpful assistant.”这样简单的提示也是预期的。2. 每个角色的内容在与 <|role|> 标签连接之前都以换行符 ('\n') 结尾。3. <|role|> 标签后始终有一个换行符。这将有助于确保正确解析和处理您的输入。
79
 
80
+ **关于[基准测试分数](https://huggingface.co/spaces/JosephusCheung/Goodharts-Law-on-Benchmarks-a-Page-for-miniG):**一般来说,你不应该太过在意这些分数,因为人们总是可以专门训练以取得好成绩。我们主要将它们作为一个冒烟测试,一种快速检查,确保没有发生重大回退。事实上,如果你真的去阅读这些基准测试问题本身,你常常会发现自己会忍不住笑出声来,因为它们是多么无聊、低质量,甚至荒谬可笑。
81
 
82
+ **关于训练:**最终发布的版本使用了多个候选模型的合并来尝试提高性能。然而,我们无法确定这种方法是否确实有效。排除候选版本和合并实验,使用16个节点、每个节点配备8个A100-80G显卡的情况下,应该可以在一天之内实现高效的朴素微调。据此我们估算碳排放量为700公斤二氧化碳当量。
83
 
84
+ **免责声明:**请注意,该模型是在未经过滤的互联网数据上训练的。由于我们无法对所有数据进行筛选,仍有可能存在大量不适当的内容——包括从露骨的材料到暴力和攻击性语言的内容——我们无法移除。因此,您必须自行对模型进行安全检查,并在输出中实施关键词过滤。由于计算资源的限制,我们目前无法为伦理和安全考虑进行人类反馈的强化学习(RLHF),也不能对SFT样本进行限制性微调,以限制模型回答某些问题的能力。
85
 
86
+ **寻求无条件赞助:**我们正在积极训练更大参数的模型并扩大数据合成规模,同时寻求大量的计算资源和慷慨的**无条件**资助。尽管这是为了商业探索和技术选择的目的,但我们目前并没有立即产生利润的压力,并且仍然致力于与开源社区分享更多成果。
config.json CHANGED
@@ -1,14 +1,9 @@
1
  {
2
  "_name_or_path": "miniG",
3
- "add_bias_linear": false,
4
- "add_qkv_bias": true,
5
- "apply_query_key_layer_scaling": true,
6
- "apply_residual_connection_post_layernorm": false,
7
  "architectures": [
8
- "ChatGLMForConditionalGeneration"
9
  ],
10
- "attention_dropout": 0.0,
11
- "attention_softmax_in_fp32": true,
12
  "auto_map": {
13
  "AutoConfig": "configuration_chatglm.ChatGLMConfig",
14
  "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
@@ -16,53 +11,35 @@
16
  "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
17
  "AutoModelForSequenceClassification": "modeling_chatglm.ChatGLMForSequenceClassification"
18
  },
 
 
 
 
 
 
 
19
  "bias_dropout_fusion": true,
20
- "boi_token_id": 151339,
21
- "classifier_dropout": null,
22
- "eoi_token_id": 151340,
23
- "eos_token_id": [
24
- 151329,
25
- 151336,
26
- 151338
27
- ],
28
  "ffn_hidden_size": 13696,
29
  "fp32_residual_connection": false,
30
  "hidden_dropout": 0.0,
31
  "hidden_size": 4096,
32
  "kv_channels": 128,
33
  "layernorm_epsilon": 1.5625e-07,
34
- "model_type": "chatglm",
35
  "multi_query_attention": true,
36
  "multi_query_group_num": 4,
37
  "num_attention_heads": 32,
38
  "num_hidden_layers": 40,
39
  "num_layers": 40,
 
40
  "original_rope": true,
41
- "pad_token_id": 151329,
42
  "padded_vocab_size": 151552,
43
  "post_layer_norm": true,
44
- "pre_seq_len": null,
45
- "prefix_projection": false,
46
  "rmsnorm": true,
47
- "rope_ratio": 10000,
48
  "seq_length": 1048576,
49
- "tie_word_embeddings": false,
50
  "torch_dtype": "bfloat16",
51
  "transformers_version": "4.44.0",
52
- "use_cache": true,
53
- "vision_config": {
54
- "dropout_prob": 0.0,
55
- "hidden_act": "gelu",
56
- "hidden_size": 1792,
57
- "image_size": 1120,
58
- "in_channels": 3,
59
- "intermediate_size": 15360,
60
- "layer_norm_eps": 1e-06,
61
- "num_heads": 16,
62
- "num_hidden_layers": 63,
63
- "num_positions": 6401,
64
- "patch_size": 14,
65
- "scaling_factor": 8
66
- },
67
- "vocab_size": 151552
68
  }
 
1
  {
2
  "_name_or_path": "miniG",
3
+ "model_type": "chatglm",
 
 
 
4
  "architectures": [
5
+ "ChatGLMModel"
6
  ],
 
 
7
  "auto_map": {
8
  "AutoConfig": "configuration_chatglm.ChatGLMConfig",
9
  "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
 
11
  "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
12
  "AutoModelForSequenceClassification": "modeling_chatglm.ChatGLMForSequenceClassification"
13
  },
14
+ "add_bias_linear": false,
15
+ "add_qkv_bias": true,
16
+ "apply_query_key_layer_scaling": true,
17
+ "apply_residual_connection_post_layernorm": false,
18
+ "attention_dropout": 0.0,
19
+ "attention_softmax_in_fp32": true,
20
+ "attn_implementation": "sdpa",
21
  "bias_dropout_fusion": true,
 
 
 
 
 
 
 
 
22
  "ffn_hidden_size": 13696,
23
  "fp32_residual_connection": false,
24
  "hidden_dropout": 0.0,
25
  "hidden_size": 4096,
26
  "kv_channels": 128,
27
  "layernorm_epsilon": 1.5625e-07,
 
28
  "multi_query_attention": true,
29
  "multi_query_group_num": 4,
30
  "num_attention_heads": 32,
31
  "num_hidden_layers": 40,
32
  "num_layers": 40,
33
+ "rope_ratio": 10000,
34
  "original_rope": true,
 
35
  "padded_vocab_size": 151552,
36
  "post_layer_norm": true,
 
 
37
  "rmsnorm": true,
 
38
  "seq_length": 1048576,
39
+ "use_cache": true,
40
  "torch_dtype": "bfloat16",
41
  "transformers_version": "4.44.0",
42
+ "tie_word_embeddings": false,
43
+ "eos_token_id": [151329, 151336, 151338],
44
+ "pad_token_id": 151329
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  }
configuration_chatglm.py CHANGED
@@ -29,10 +29,6 @@ class ChatGLMConfig(PretrainedConfig):
29
  apply_query_key_layer_scaling=True,
30
  attention_softmax_in_fp32=True,
31
  fp32_residual_connection=False,
32
- pre_seq_len=None,
33
- prefix_projection=False,
34
- boi_token_id=None,
35
- eoi_token_id=None,
36
  **kwargs
37
  ):
38
  self.num_layers = num_layers
@@ -59,8 +55,4 @@ class ChatGLMConfig(PretrainedConfig):
59
  self.apply_query_key_layer_scaling = apply_query_key_layer_scaling
60
  self.attention_softmax_in_fp32 = attention_softmax_in_fp32
61
  self.fp32_residual_connection = fp32_residual_connection
62
- self.pre_seq_len = pre_seq_len
63
- self.prefix_projection = prefix_projection
64
- self.boi_token_id = boi_token_id
65
- self.eoi_token_id = eoi_token_id
66
  super().__init__(**kwargs)
 
29
  apply_query_key_layer_scaling=True,
30
  attention_softmax_in_fp32=True,
31
  fp32_residual_connection=False,
 
 
 
 
32
  **kwargs
33
  ):
34
  self.num_layers = num_layers
 
55
  self.apply_query_key_layer_scaling = apply_query_key_layer_scaling
56
  self.attention_softmax_in_fp32 = attention_softmax_in_fp32
57
  self.fp32_residual_connection = fp32_residual_connection
 
 
 
 
58
  super().__init__(**kwargs)
generation_config.json CHANGED
@@ -7,7 +7,7 @@
7
  "pad_token_id": 151329,
8
  "do_sample": true,
9
  "temperature": 0.8,
10
- "max_length": 8192,
11
  "top_p": 0.8,
12
  "transformers_version": "4.44.0"
13
  }
 
7
  "pad_token_id": 151329,
8
  "do_sample": true,
9
  "temperature": 0.8,
10
+ "max_length": 1024000,
11
  "top_p": 0.8,
12
  "transformers_version": "4.44.0"
13
  }
model.safetensors.index.json ADDED
@@ -0,0 +1,291 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 18967715904
4
+ },
5
+ "weight_map": {
6
+ "transformer.embedding.word_embeddings.weight": "model-00001-of-00010.safetensors",
7
+ "transformer.encoder.final_layernorm.weight": "model-00010-of-00010.safetensors",
8
+ "transformer.encoder.layers.0.input_layernorm.weight": "model-00001-of-00010.safetensors",
9
+ "transformer.encoder.layers.0.mlp.dense_4h_to_h.weight": "model-00001-of-00010.safetensors",
10
+ "transformer.encoder.layers.0.mlp.dense_h_to_4h.weight": "model-00001-of-00010.safetensors",
11
+ "transformer.encoder.layers.0.post_attention_layernorm.weight": "model-00001-of-00010.safetensors",
12
+ "transformer.encoder.layers.0.self_attention.dense.weight": "model-00001-of-00010.safetensors",
13
+ "transformer.encoder.layers.0.self_attention.query_key_value.bias": "model-00001-of-00010.safetensors",
14
+ "transformer.encoder.layers.0.self_attention.query_key_value.weight": "model-00001-of-00010.safetensors",
15
+ "transformer.encoder.layers.1.input_layernorm.weight": "model-00001-of-00010.safetensors",
16
+ "transformer.encoder.layers.1.mlp.dense_4h_to_h.weight": "model-00002-of-00010.safetensors",
17
+ "transformer.encoder.layers.1.mlp.dense_h_to_4h.weight": "model-00001-of-00010.safetensors",
18
+ "transformer.encoder.layers.1.post_attention_layernorm.weight": "model-00001-of-00010.safetensors",
19
+ "transformer.encoder.layers.1.self_attention.dense.weight": "model-00001-of-00010.safetensors",
20
+ "transformer.encoder.layers.1.self_attention.query_key_value.bias": "model-00001-of-00010.safetensors",
21
+ "transformer.encoder.layers.1.self_attention.query_key_value.weight": "model-00001-of-00010.safetensors",
22
+ "transformer.encoder.layers.10.input_layernorm.weight": "model-00003-of-00010.safetensors",
23
+ "transformer.encoder.layers.10.mlp.dense_4h_to_h.weight": "model-00003-of-00010.safetensors",
24
+ "transformer.encoder.layers.10.mlp.dense_h_to_4h.weight": "model-00003-of-00010.safetensors",
25
+ "transformer.encoder.layers.10.post_attention_layernorm.weight": "model-00003-of-00010.safetensors",
26
+ "transformer.encoder.layers.10.self_attention.dense.weight": "model-00003-of-00010.safetensors",
27
+ "transformer.encoder.layers.10.self_attention.query_key_value.bias": "model-00003-of-00010.safetensors",
28
+ "transformer.encoder.layers.10.self_attention.query_key_value.weight": "model-00003-of-00010.safetensors",
29
+ "transformer.encoder.layers.11.input_layernorm.weight": "model-00003-of-00010.safetensors",
30
+ "transformer.encoder.layers.11.mlp.dense_4h_to_h.weight": "model-00004-of-00010.safetensors",
31
+ "transformer.encoder.layers.11.mlp.dense_h_to_4h.weight": "model-00004-of-00010.safetensors",
32
+ "transformer.encoder.layers.11.post_attention_layernorm.weight": "model-00004-of-00010.safetensors",
33
+ "transformer.encoder.layers.11.self_attention.dense.weight": "model-00004-of-00010.safetensors",
34
+ "transformer.encoder.layers.11.self_attention.query_key_value.bias": "model-00004-of-00010.safetensors",
35
+ "transformer.encoder.layers.11.self_attention.query_key_value.weight": "model-00004-of-00010.safetensors",
36
+ "transformer.encoder.layers.12.input_layernorm.weight": "model-00004-of-00010.safetensors",
37
+ "transformer.encoder.layers.12.mlp.dense_4h_to_h.weight": "model-00004-of-00010.safetensors",
38
+ "transformer.encoder.layers.12.mlp.dense_h_to_4h.weight": "model-00004-of-00010.safetensors",
39
+ "transformer.encoder.layers.12.post_attention_layernorm.weight": "model-00004-of-00010.safetensors",
40
+ "transformer.encoder.layers.12.self_attention.dense.weight": "model-00004-of-00010.safetensors",
41
+ "transformer.encoder.layers.12.self_attention.query_key_value.bias": "model-00004-of-00010.safetensors",
42
+ "transformer.encoder.layers.12.self_attention.query_key_value.weight": "model-00004-of-00010.safetensors",
43
+ "transformer.encoder.layers.13.input_layernorm.weight": "model-00004-of-00010.safetensors",
44
+ "transformer.encoder.layers.13.mlp.dense_4h_to_h.weight": "model-00004-of-00010.safetensors",
45
+ "transformer.encoder.layers.13.mlp.dense_h_to_4h.weight": "model-00004-of-00010.safetensors",
46
+ "transformer.encoder.layers.13.post_attention_layernorm.weight": "model-00004-of-00010.safetensors",
47
+ "transformer.encoder.layers.13.self_attention.dense.weight": "model-00004-of-00010.safetensors",
48
+ "transformer.encoder.layers.13.self_attention.query_key_value.bias": "model-00004-of-00010.safetensors",
49
+ "transformer.encoder.layers.13.self_attention.query_key_value.weight": "model-00004-of-00010.safetensors",
50
+ "transformer.encoder.layers.14.input_layernorm.weight": "model-00004-of-00010.safetensors",
51
+ "transformer.encoder.layers.14.mlp.dense_4h_to_h.weight": "model-00004-of-00010.safetensors",
52
+ "transformer.encoder.layers.14.mlp.dense_h_to_4h.weight": "model-00004-of-00010.safetensors",
53
+ "transformer.encoder.layers.14.post_attention_layernorm.weight": "model-00004-of-00010.safetensors",
54
+ "transformer.encoder.layers.14.self_attention.dense.weight": "model-00004-of-00010.safetensors",
55
+ "transformer.encoder.layers.14.self_attention.query_key_value.bias": "model-00004-of-00010.safetensors",
56
+ "transformer.encoder.layers.14.self_attention.query_key_value.weight": "model-00004-of-00010.safetensors",
57
+ "transformer.encoder.layers.15.input_layernorm.weight": "model-00004-of-00010.safetensors",
58
+ "transformer.encoder.layers.15.mlp.dense_4h_to_h.weight": "model-00005-of-00010.safetensors",
59
+ "transformer.encoder.layers.15.mlp.dense_h_to_4h.weight": "model-00004-of-00010.safetensors",
60
+ "transformer.encoder.layers.15.post_attention_layernorm.weight": "model-00004-of-00010.safetensors",
61
+ "transformer.encoder.layers.15.self_attention.dense.weight": "model-00004-of-00010.safetensors",
62
+ "transformer.encoder.layers.15.self_attention.query_key_value.bias": "model-00004-of-00010.safetensors",
63
+ "transformer.encoder.layers.15.self_attention.query_key_value.weight": "model-00004-of-00010.safetensors",
64
+ "transformer.encoder.layers.16.input_layernorm.weight": "model-00005-of-00010.safetensors",
65
+ "transformer.encoder.layers.16.mlp.dense_4h_to_h.weight": "model-00005-of-00010.safetensors",
66
+ "transformer.encoder.layers.16.mlp.dense_h_to_4h.weight": "model-00005-of-00010.safetensors",
67
+ "transformer.encoder.layers.16.post_attention_layernorm.weight": "model-00005-of-00010.safetensors",
68
+ "transformer.encoder.layers.16.self_attention.dense.weight": "model-00005-of-00010.safetensors",
69
+ "transformer.encoder.layers.16.self_attention.query_key_value.bias": "model-00005-of-00010.safetensors",
70
+ "transformer.encoder.layers.16.self_attention.query_key_value.weight": "model-00005-of-00010.safetensors",
71
+ "transformer.encoder.layers.17.input_layernorm.weight": "model-00005-of-00010.safetensors",
72
+ "transformer.encoder.layers.17.mlp.dense_4h_to_h.weight": "model-00005-of-00010.safetensors",
73
+ "transformer.encoder.layers.17.mlp.dense_h_to_4h.weight": "model-00005-of-00010.safetensors",
74
+ "transformer.encoder.layers.17.post_attention_layernorm.weight": "model-00005-of-00010.safetensors",
75
+ "transformer.encoder.layers.17.self_attention.dense.weight": "model-00005-of-00010.safetensors",
76
+ "transformer.encoder.layers.17.self_attention.query_key_value.bias": "model-00005-of-00010.safetensors",
77
+ "transformer.encoder.layers.17.self_attention.query_key_value.weight": "model-00005-of-00010.safetensors",
78
+ "transformer.encoder.layers.18.input_layernorm.weight": "model-00005-of-00010.safetensors",
79
+ "transformer.encoder.layers.18.mlp.dense_4h_to_h.weight": "model-00005-of-00010.safetensors",
80
+ "transformer.encoder.layers.18.mlp.dense_h_to_4h.weight": "model-00005-of-00010.safetensors",
81
+ "transformer.encoder.layers.18.post_attention_layernorm.weight": "model-00005-of-00010.safetensors",
82
+ "transformer.encoder.layers.18.self_attention.dense.weight": "model-00005-of-00010.safetensors",
83
+ "transformer.encoder.layers.18.self_attention.query_key_value.bias": "model-00005-of-00010.safetensors",
84
+ "transformer.encoder.layers.18.self_attention.query_key_value.weight": "model-00005-of-00010.safetensors",
85
+ "transformer.encoder.layers.19.input_layernorm.weight": "model-00005-of-00010.safetensors",
86
+ "transformer.encoder.layers.19.mlp.dense_4h_to_h.weight": "model-00005-of-00010.safetensors",
87
+ "transformer.encoder.layers.19.mlp.dense_h_to_4h.weight": "model-00005-of-00010.safetensors",
88
+ "transformer.encoder.layers.19.post_attention_layernorm.weight": "model-00005-of-00010.safetensors",
89
+ "transformer.encoder.layers.19.self_attention.dense.weight": "model-00005-of-00010.safetensors",
90
+ "transformer.encoder.layers.19.self_attention.query_key_value.bias": "model-00005-of-00010.safetensors",
91
+ "transformer.encoder.layers.19.self_attention.query_key_value.weight": "model-00005-of-00010.safetensors",
92
+ "transformer.encoder.layers.2.input_layernorm.weight": "model-00002-of-00010.safetensors",
93
+ "transformer.encoder.layers.2.mlp.dense_4h_to_h.weight": "model-00002-of-00010.safetensors",
94
+ "transformer.encoder.layers.2.mlp.dense_h_to_4h.weight": "model-00002-of-00010.safetensors",
95
+ "transformer.encoder.layers.2.post_attention_layernorm.weight": "model-00002-of-00010.safetensors",
96
+ "transformer.encoder.layers.2.self_attention.dense.weight": "model-00002-of-00010.safetensors",
97
+ "transformer.encoder.layers.2.self_attention.query_key_value.bias": "model-00002-of-00010.safetensors",
98
+ "transformer.encoder.layers.2.self_attention.query_key_value.weight": "model-00002-of-00010.safetensors",
99
+ "transformer.encoder.layers.20.input_layernorm.weight": "model-00005-of-00010.safetensors",
100
+ "transformer.encoder.layers.20.mlp.dense_4h_to_h.weight": "model-00006-of-00010.safetensors",
101
+ "transformer.encoder.layers.20.mlp.dense_h_to_4h.weight": "model-00006-of-00010.safetensors",
102
+ "transformer.encoder.layers.20.post_attention_layernorm.weight": "model-00005-of-00010.safetensors",
103
+ "transformer.encoder.layers.20.self_attention.dense.weight": "model-00005-of-00010.safetensors",
104
+ "transformer.encoder.layers.20.self_attention.query_key_value.bias": "model-00005-of-00010.safetensors",
105
+ "transformer.encoder.layers.20.self_attention.query_key_value.weight": "model-00005-of-00010.safetensors",
106
+ "transformer.encoder.layers.21.input_layernorm.weight": "model-00006-of-00010.safetensors",
107
+ "transformer.encoder.layers.21.mlp.dense_4h_to_h.weight": "model-00006-of-00010.safetensors",
108
+ "transformer.encoder.layers.21.mlp.dense_h_to_4h.weight": "model-00006-of-00010.safetensors",
109
+ "transformer.encoder.layers.21.post_attention_layernorm.weight": "model-00006-of-00010.safetensors",
110
+ "transformer.encoder.layers.21.self_attention.dense.weight": "model-00006-of-00010.safetensors",
111
+ "transformer.encoder.layers.21.self_attention.query_key_value.bias": "model-00006-of-00010.safetensors",
112
+ "transformer.encoder.layers.21.self_attention.query_key_value.weight": "model-00006-of-00010.safetensors",
113
+ "transformer.encoder.layers.22.input_layernorm.weight": "model-00006-of-00010.safetensors",
114
+ "transformer.encoder.layers.22.mlp.dense_4h_to_h.weight": "model-00006-of-00010.safetensors",
115
+ "transformer.encoder.layers.22.mlp.dense_h_to_4h.weight": "model-00006-of-00010.safetensors",
116
+ "transformer.encoder.layers.22.post_attention_layernorm.weight": "model-00006-of-00010.safetensors",
117
+ "transformer.encoder.layers.22.self_attention.dense.weight": "model-00006-of-00010.safetensors",
118
+ "transformer.encoder.layers.22.self_attention.query_key_value.bias": "model-00006-of-00010.safetensors",
119
+ "transformer.encoder.layers.22.self_attention.query_key_value.weight": "model-00006-of-00010.safetensors",
120
+ "transformer.encoder.layers.23.input_layernorm.weight": "model-00006-of-00010.safetensors",
121
+ "transformer.encoder.layers.23.mlp.dense_4h_to_h.weight": "model-00006-of-00010.safetensors",
122
+ "transformer.encoder.layers.23.mlp.dense_h_to_4h.weight": "model-00006-of-00010.safetensors",
123
+ "transformer.encoder.layers.23.post_attention_layernorm.weight": "model-00006-of-00010.safetensors",
124
+ "transformer.encoder.layers.23.self_attention.dense.weight": "model-00006-of-00010.safetensors",
125
+ "transformer.encoder.layers.23.self_attention.query_key_value.bias": "model-00006-of-00010.safetensors",
126
+ "transformer.encoder.layers.23.self_attention.query_key_value.weight": "model-00006-of-00010.safetensors",
127
+ "transformer.encoder.layers.24.input_layernorm.weight": "model-00006-of-00010.safetensors",
128
+ "transformer.encoder.layers.24.mlp.dense_4h_to_h.weight": "model-00006-of-00010.safetensors",
129
+ "transformer.encoder.layers.24.mlp.dense_h_to_4h.weight": "model-00006-of-00010.safetensors",
130
+ "transformer.encoder.layers.24.post_attention_layernorm.weight": "model-00006-of-00010.safetensors",
131
+ "transformer.encoder.layers.24.self_attention.dense.weight": "model-00006-of-00010.safetensors",
132
+ "transformer.encoder.layers.24.self_attention.query_key_value.bias": "model-00006-of-00010.safetensors",
133
+ "transformer.encoder.layers.24.self_attention.query_key_value.weight": "model-00006-of-00010.safetensors",
134
+ "transformer.encoder.layers.25.input_layernorm.weight": "model-00006-of-00010.safetensors",
135
+ "transformer.encoder.layers.25.mlp.dense_4h_to_h.weight": "model-00007-of-00010.safetensors",
136
+ "transformer.encoder.layers.25.mlp.dense_h_to_4h.weight": "model-00007-of-00010.safetensors",
137
+ "transformer.encoder.layers.25.post_attention_layernorm.weight": "model-00007-of-00010.safetensors",
138
+ "transformer.encoder.layers.25.self_attention.dense.weight": "model-00007-of-00010.safetensors",
139
+ "transformer.encoder.layers.25.self_attention.query_key_value.bias": "model-00007-of-00010.safetensors",
140
+ "transformer.encoder.layers.25.self_attention.query_key_value.weight": "model-00007-of-00010.safetensors",
141
+ "transformer.encoder.layers.26.input_layernorm.weight": "model-00007-of-00010.safetensors",
142
+ "transformer.encoder.layers.26.mlp.dense_4h_to_h.weight": "model-00007-of-00010.safetensors",
143
+ "transformer.encoder.layers.26.mlp.dense_h_to_4h.weight": "model-00007-of-00010.safetensors",
144
+ "transformer.encoder.layers.26.post_attention_layernorm.weight": "model-00007-of-00010.safetensors",
145
+ "transformer.encoder.layers.26.self_attention.dense.weight": "model-00007-of-00010.safetensors",
146
+ "transformer.encoder.layers.26.self_attention.query_key_value.bias": "model-00007-of-00010.safetensors",
147
+ "transformer.encoder.layers.26.self_attention.query_key_value.weight": "model-00007-of-00010.safetensors",
148
+ "transformer.encoder.layers.27.input_layernorm.weight": "model-00007-of-00010.safetensors",
149
+ "transformer.encoder.layers.27.mlp.dense_4h_to_h.weight": "model-00007-of-00010.safetensors",
150
+ "transformer.encoder.layers.27.mlp.dense_h_to_4h.weight": "model-00007-of-00010.safetensors",
151
+ "transformer.encoder.layers.27.post_attention_layernorm.weight": "model-00007-of-00010.safetensors",
152
+ "transformer.encoder.layers.27.self_attention.dense.weight": "model-00007-of-00010.safetensors",
153
+ "transformer.encoder.layers.27.self_attention.query_key_value.bias": "model-00007-of-00010.safetensors",
154
+ "transformer.encoder.layers.27.self_attention.query_key_value.weight": "model-00007-of-00010.safetensors",
155
+ "transformer.encoder.layers.28.input_layernorm.weight": "model-00007-of-00010.safetensors",
156
+ "transformer.encoder.layers.28.mlp.dense_4h_to_h.weight": "model-00007-of-00010.safetensors",
157
+ "transformer.encoder.layers.28.mlp.dense_h_to_4h.weight": "model-00007-of-00010.safetensors",
158
+ "transformer.encoder.layers.28.post_attention_layernorm.weight": "model-00007-of-00010.safetensors",
159
+ "transformer.encoder.layers.28.self_attention.dense.weight": "model-00007-of-00010.safetensors",
160
+ "transformer.encoder.layers.28.self_attention.query_key_value.bias": "model-00007-of-00010.safetensors",
161
+ "transformer.encoder.layers.28.self_attention.query_key_value.weight": "model-00007-of-00010.safetensors",
162
+ "transformer.encoder.layers.29.input_layernorm.weight": "model-00007-of-00010.safetensors",
163
+ "transformer.encoder.layers.29.mlp.dense_4h_to_h.weight": "model-00008-of-00010.safetensors",
164
+ "transformer.encoder.layers.29.mlp.dense_h_to_4h.weight": "model-00007-of-00010.safetensors",
165
+ "transformer.encoder.layers.29.post_attention_layernorm.weight": "model-00007-of-00010.safetensors",
166
+ "transformer.encoder.layers.29.self_attention.dense.weight": "model-00007-of-00010.safetensors",
167
+ "transformer.encoder.layers.29.self_attention.query_key_value.bias": "model-00007-of-00010.safetensors",
168
+ "transformer.encoder.layers.29.self_attention.query_key_value.weight": "model-00007-of-00010.safetensors",
169
+ "transformer.encoder.layers.3.input_layernorm.weight": "model-00002-of-00010.safetensors",
170
+ "transformer.encoder.layers.3.mlp.dense_4h_to_h.weight": "model-00002-of-00010.safetensors",
171
+ "transformer.encoder.layers.3.mlp.dense_h_to_4h.weight": "model-00002-of-00010.safetensors",
172
+ "transformer.encoder.layers.3.post_attention_layernorm.weight": "model-00002-of-00010.safetensors",
173
+ "transformer.encoder.layers.3.self_attention.dense.weight": "model-00002-of-00010.safetensors",
174
+ "transformer.encoder.layers.3.self_attention.query_key_value.bias": "model-00002-of-00010.safetensors",
175
+ "transformer.encoder.layers.3.self_attention.query_key_value.weight": "model-00002-of-00010.safetensors",
176
+ "transformer.encoder.layers.30.input_layernorm.weight": "model-00008-of-00010.safetensors",
177
+ "transformer.encoder.layers.30.mlp.dense_4h_to_h.weight": "model-00008-of-00010.safetensors",
178
+ "transformer.encoder.layers.30.mlp.dense_h_to_4h.weight": "model-00008-of-00010.safetensors",
179
+ "transformer.encoder.layers.30.post_attention_layernorm.weight": "model-00008-of-00010.safetensors",
180
+ "transformer.encoder.layers.30.self_attention.dense.weight": "model-00008-of-00010.safetensors",
181
+ "transformer.encoder.layers.30.self_attention.query_key_value.bias": "model-00008-of-00010.safetensors",
182
+ "transformer.encoder.layers.30.self_attention.query_key_value.weight": "model-00008-of-00010.safetensors",
183
+ "transformer.encoder.layers.31.input_layernorm.weight": "model-00008-of-00010.safetensors",
184
+ "transformer.encoder.layers.31.mlp.dense_4h_to_h.weight": "model-00008-of-00010.safetensors",
185
+ "transformer.encoder.layers.31.mlp.dense_h_to_4h.weight": "model-00008-of-00010.safetensors",
186
+ "transformer.encoder.layers.31.post_attention_layernorm.weight": "model-00008-of-00010.safetensors",
187
+ "transformer.encoder.layers.31.self_attention.dense.weight": "model-00008-of-00010.safetensors",
188
+ "transformer.encoder.layers.31.self_attention.query_key_value.bias": "model-00008-of-00010.safetensors",
189
+ "transformer.encoder.layers.31.self_attention.query_key_value.weight": "model-00008-of-00010.safetensors",
190
+ "transformer.encoder.layers.32.input_layernorm.weight": "model-00008-of-00010.safetensors",
191
+ "transformer.encoder.layers.32.mlp.dense_4h_to_h.weight": "model-00008-of-00010.safetensors",
192
+ "transformer.encoder.layers.32.mlp.dense_h_to_4h.weight": "model-00008-of-00010.safetensors",
193
+ "transformer.encoder.layers.32.post_attention_layernorm.weight": "model-00008-of-00010.safetensors",
194
+ "transformer.encoder.layers.32.self_attention.dense.weight": "model-00008-of-00010.safetensors",
195
+ "transformer.encoder.layers.32.self_attention.query_key_value.bias": "model-00008-of-00010.safetensors",
196
+ "transformer.encoder.layers.32.self_attention.query_key_value.weight": "model-00008-of-00010.safetensors",
197
+ "transformer.encoder.layers.33.input_layernorm.weight": "model-00008-of-00010.safetensors",
198
+ "transformer.encoder.layers.33.mlp.dense_4h_to_h.weight": "model-00008-of-00010.safetensors",
199
+ "transformer.encoder.layers.33.mlp.dense_h_to_4h.weight": "model-00008-of-00010.safetensors",
200
+ "transformer.encoder.layers.33.post_attention_layernorm.weight": "model-00008-of-00010.safetensors",
201
+ "transformer.encoder.layers.33.self_attention.dense.weight": "model-00008-of-00010.safetensors",
202
+ "transformer.encoder.layers.33.self_attention.query_key_value.bias": "model-00008-of-00010.safetensors",
203
+ "transformer.encoder.layers.33.self_attention.query_key_value.weight": "model-00008-of-00010.safetensors",
204
+ "transformer.encoder.layers.34.input_layernorm.weight": "model-00008-of-00010.safetensors",
205
+ "transformer.encoder.layers.34.mlp.dense_4h_to_h.weight": "model-00009-of-00010.safetensors",
206
+ "transformer.encoder.layers.34.mlp.dense_h_to_4h.weight": "model-00009-of-00010.safetensors",
207
+ "transformer.encoder.layers.34.post_attention_layernorm.weight": "model-00008-of-00010.safetensors",
208
+ "transformer.encoder.layers.34.self_attention.dense.weight": "model-00008-of-00010.safetensors",
209
+ "transformer.encoder.layers.34.self_attention.query_key_value.bias": "model-00008-of-00010.safetensors",
210
+ "transformer.encoder.layers.34.self_attention.query_key_value.weight": "model-00008-of-00010.safetensors",
211
+ "transformer.encoder.layers.35.input_layernorm.weight": "model-00009-of-00010.safetensors",
212
+ "transformer.encoder.layers.35.mlp.dense_4h_to_h.weight": "model-00009-of-00010.safetensors",
213
+ "transformer.encoder.layers.35.mlp.dense_h_to_4h.weight": "model-00009-of-00010.safetensors",
214
+ "transformer.encoder.layers.35.post_attention_layernorm.weight": "model-00009-of-00010.safetensors",
215
+ "transformer.encoder.layers.35.self_attention.dense.weight": "model-00009-of-00010.safetensors",
216
+ "transformer.encoder.layers.35.self_attention.query_key_value.bias": "model-00009-of-00010.safetensors",
217
+ "transformer.encoder.layers.35.self_attention.query_key_value.weight": "model-00009-of-00010.safetensors",
218
+ "transformer.encoder.layers.36.input_layernorm.weight": "model-00009-of-00010.safetensors",
219
+ "transformer.encoder.layers.36.mlp.dense_4h_to_h.weight": "model-00009-of-00010.safetensors",
220
+ "transformer.encoder.layers.36.mlp.dense_h_to_4h.weight": "model-00009-of-00010.safetensors",
221
+ "transformer.encoder.layers.36.post_attention_layernorm.weight": "model-00009-of-00010.safetensors",
222
+ "transformer.encoder.layers.36.self_attention.dense.weight": "model-00009-of-00010.safetensors",
223
+ "transformer.encoder.layers.36.self_attention.query_key_value.bias": "model-00009-of-00010.safetensors",
224
+ "transformer.encoder.layers.36.self_attention.query_key_value.weight": "model-00009-of-00010.safetensors",
225
+ "transformer.encoder.layers.37.input_layernorm.weight": "model-00009-of-00010.safetensors",
226
+ "transformer.encoder.layers.37.mlp.dense_4h_to_h.weight": "model-00009-of-00010.safetensors",
227
+ "transformer.encoder.layers.37.mlp.dense_h_to_4h.weight": "model-00009-of-00010.safetensors",
228
+ "transformer.encoder.layers.37.post_attention_layernorm.weight": "model-00009-of-00010.safetensors",
229
+ "transformer.encoder.layers.37.self_attention.dense.weight": "model-00009-of-00010.safetensors",
230
+ "transformer.encoder.layers.37.self_attention.query_key_value.bias": "model-00009-of-00010.safetensors",
231
+ "transformer.encoder.layers.37.self_attention.query_key_value.weight": "model-00009-of-00010.safetensors",
232
+ "transformer.encoder.layers.38.input_layernorm.weight": "model-00009-of-00010.safetensors",
233
+ "transformer.encoder.layers.38.mlp.dense_4h_to_h.weight": "model-00009-of-00010.safetensors",
234
+ "transformer.encoder.layers.38.mlp.dense_h_to_4h.weight": "model-00009-of-00010.safetensors",
235
+ "transformer.encoder.layers.38.post_attention_layernorm.weight": "model-00009-of-00010.safetensors",
236
+ "transformer.encoder.layers.38.self_attention.dense.weight": "model-00009-of-00010.safetensors",
237
+ "transformer.encoder.layers.38.self_attention.query_key_value.bias": "model-00009-of-00010.safetensors",
238
+ "transformer.encoder.layers.38.self_attention.query_key_value.weight": "model-00009-of-00010.safetensors",
239
+ "transformer.encoder.layers.39.input_layernorm.weight": "model-00009-of-00010.safetensors",
240
+ "transformer.encoder.layers.39.mlp.dense_4h_to_h.weight": "model-00010-of-00010.safetensors",
241
+ "transformer.encoder.layers.39.mlp.dense_h_to_4h.weight": "model-00010-of-00010.safetensors",
242
+ "transformer.encoder.layers.39.post_attention_layernorm.weight": "model-00010-of-00010.safetensors",
243
+ "transformer.encoder.layers.39.self_attention.dense.weight": "model-00010-of-00010.safetensors",
244
+ "transformer.encoder.layers.39.self_attention.query_key_value.bias": "model-00010-of-00010.safetensors",
245
+ "transformer.encoder.layers.39.self_attention.query_key_value.weight": "model-00010-of-00010.safetensors",
246
+ "transformer.encoder.layers.4.input_layernorm.weight": "model-00002-of-00010.safetensors",
247
+ "transformer.encoder.layers.4.mlp.dense_4h_to_h.weight": "model-00002-of-00010.safetensors",
248
+ "transformer.encoder.layers.4.mlp.dense_h_to_4h.weight": "model-00002-of-00010.safetensors",
249
+ "transformer.encoder.layers.4.post_attention_layernorm.weight": "model-00002-of-00010.safetensors",
250
+ "transformer.encoder.layers.4.self_attention.dense.weight": "model-00002-of-00010.safetensors",
251
+ "transformer.encoder.layers.4.self_attention.query_key_value.bias": "model-00002-of-00010.safetensors",
252
+ "transformer.encoder.layers.4.self_attention.query_key_value.weight": "model-00002-of-00010.safetensors",
253
+ "transformer.encoder.layers.5.input_layernorm.weight": "model-00002-of-00010.safetensors",
254
+ "transformer.encoder.layers.5.mlp.dense_4h_to_h.weight": "model-00002-of-00010.safetensors",
255
+ "transformer.encoder.layers.5.mlp.dense_h_to_4h.weight": "model-00002-of-00010.safetensors",
256
+ "transformer.encoder.layers.5.post_attention_layernorm.weight": "model-00002-of-00010.safetensors",
257
+ "transformer.encoder.layers.5.self_attention.dense.weight": "model-00002-of-00010.safetensors",
258
+ "transformer.encoder.layers.5.self_attention.query_key_value.bias": "model-00002-of-00010.safetensors",
259
+ "transformer.encoder.layers.5.self_attention.query_key_value.weight": "model-00002-of-00010.safetensors",
260
+ "transformer.encoder.layers.6.input_layernorm.weight": "model-00002-of-00010.safetensors",
261
+ "transformer.encoder.layers.6.mlp.dense_4h_to_h.weight": "model-00003-of-00010.safetensors",
262
+ "transformer.encoder.layers.6.mlp.dense_h_to_4h.weight": "model-00003-of-00010.safetensors",
263
+ "transformer.encoder.layers.6.post_attention_layernorm.weight": "model-00002-of-00010.safetensors",
264
+ "transformer.encoder.layers.6.self_attention.dense.weight": "model-00002-of-00010.safetensors",
265
+ "transformer.encoder.layers.6.self_attention.query_key_value.bias": "model-00002-of-00010.safetensors",
266
+ "transformer.encoder.layers.6.self_attention.query_key_value.weight": "model-00002-of-00010.safetensors",
267
+ "transformer.encoder.layers.7.input_layernorm.weight": "model-00003-of-00010.safetensors",
268
+ "transformer.encoder.layers.7.mlp.dense_4h_to_h.weight": "model-00003-of-00010.safetensors",
269
+ "transformer.encoder.layers.7.mlp.dense_h_to_4h.weight": "model-00003-of-00010.safetensors",
270
+ "transformer.encoder.layers.7.post_attention_layernorm.weight": "model-00003-of-00010.safetensors",
271
+ "transformer.encoder.layers.7.self_attention.dense.weight": "model-00003-of-00010.safetensors",
272
+ "transformer.encoder.layers.7.self_attention.query_key_value.bias": "model-00003-of-00010.safetensors",
273
+ "transformer.encoder.layers.7.self_attention.query_key_value.weight": "model-00003-of-00010.safetensors",
274
+ "transformer.encoder.layers.8.input_layernorm.weight": "model-00003-of-00010.safetensors",
275
+ "transformer.encoder.layers.8.mlp.dense_4h_to_h.weight": "model-00003-of-00010.safetensors",
276
+ "transformer.encoder.layers.8.mlp.dense_h_to_4h.weight": "model-00003-of-00010.safetensors",
277
+ "transformer.encoder.layers.8.post_attention_layernorm.weight": "model-00003-of-00010.safetensors",
278
+ "transformer.encoder.layers.8.self_attention.dense.weight": "model-00003-of-00010.safetensors",
279
+ "transformer.encoder.layers.8.self_attention.query_key_value.bias": "model-00003-of-00010.safetensors",
280
+ "transformer.encoder.layers.8.self_attention.query_key_value.weight": "model-00003-of-00010.safetensors",
281
+ "transformer.encoder.layers.9.input_layernorm.weight": "model-00003-of-00010.safetensors",
282
+ "transformer.encoder.layers.9.mlp.dense_4h_to_h.weight": "model-00003-of-00010.safetensors",
283
+ "transformer.encoder.layers.9.mlp.dense_h_to_4h.weight": "model-00003-of-00010.safetensors",
284
+ "transformer.encoder.layers.9.post_attention_layernorm.weight": "model-00003-of-00010.safetensors",
285
+ "transformer.encoder.layers.9.self_attention.dense.weight": "model-00003-of-00010.safetensors",
286
+ "transformer.encoder.layers.9.self_attention.query_key_value.bias": "model-00003-of-00010.safetensors",
287
+ "transformer.encoder.layers.9.self_attention.query_key_value.weight": "model-00003-of-00010.safetensors",
288
+ "transformer.output_layer.weight": "model-00010-of-00010.safetensors",
289
+ "transformer.rotary_pos_emb.inv_freq": "model-00001-of-00010.safetensors"
290
+ }
291
+ }
modeling_chatglm.py CHANGED
@@ -1,13 +1,19 @@
1
- """ PyTorch GLM-4V model. """
 
2
  import math
 
 
 
3
  import sys
 
4
  import torch
5
  import torch.utils.checkpoint
6
  import torch.nn.functional as F
7
  from torch import nn
8
  from torch.nn import CrossEntropyLoss, LayerNorm, MSELoss, BCEWithLogitsLoss
9
  from torch.nn.utils import skip_init
10
- from typing import Optional, Tuple, Union, List, Dict, Any
 
11
 
12
  from transformers.modeling_outputs import (
13
  BaseModelOutputWithPast,
@@ -19,7 +25,6 @@ from transformers.utils import logging, is_torch_npu_available
19
  from transformers.generation.logits_process import LogitsProcessor
20
  from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList, GenerationConfig, ModelOutput
21
 
22
- from .visual import EVA2CLIPModel
23
  from .configuration_chatglm import ChatGLMConfig
24
 
25
  try:
@@ -41,9 +46,6 @@ if sys.platform != 'darwin' and not is_torch_npu_available():
41
 
42
  logger = logging.get_logger(__name__)
43
 
44
- LANGUAGE_TOKEN_TYPE = 0
45
- VISION_TOKEN_TYPE = 1
46
-
47
  _CHECKPOINT_FOR_DOC = "THUDM/ChatGLM"
48
  _CONFIG_FOR_DOC = "ChatGLMConfig"
49
 
@@ -60,38 +62,6 @@ class InvalidScoreLogitsProcessor(LogitsProcessor):
60
  return scores
61
 
62
 
63
- class PrefixEncoder(torch.nn.Module):
64
- """
65
- The torch.nn model to encode the prefix
66
- Input shape: (batch-size, prefix-length)
67
- Output shape: (batch-size, prefix-length, 2*layers*hidden)
68
- """
69
-
70
- def __init__(self, config: ChatGLMConfig):
71
- super().__init__()
72
- self.prefix_projection = config.prefix_projection
73
- if self.prefix_projection:
74
- # Use a two-layer MLP to encode the prefix
75
- kv_size = config.num_layers * config.kv_channels * config.multi_query_group_num * 2
76
- self.embedding = torch.nn.Embedding(config.pre_seq_len, kv_size)
77
- self.trans = torch.nn.Sequential(
78
- torch.nn.Linear(kv_size, config.hidden_size),
79
- torch.nn.Tanh(),
80
- torch.nn.Linear(config.hidden_size, kv_size)
81
- )
82
- else:
83
- self.embedding = torch.nn.Embedding(config.pre_seq_len,
84
- config.num_layers * config.kv_channels * config.multi_query_group_num * 2)
85
-
86
- def forward(self, prefix: torch.Tensor):
87
- if self.prefix_projection:
88
- prefix_tokens = self.embedding(prefix)
89
- past_key_values = self.trans(prefix_tokens)
90
- else:
91
- past_key_values = self.embedding(prefix)
92
- return past_key_values
93
-
94
-
95
  def split_tensor_along_last_dim(
96
  tensor: torch.Tensor,
97
  num_partitions: int,
@@ -129,17 +99,6 @@ class RotaryEmbedding(nn.Module):
129
  self.original_impl = original_impl
130
  self.rope_ratio = rope_ratio
131
 
132
- def impl(self, seq_length: int, dim: int, device: torch.device, dtype: torch.dtype):
133
- base = 10000 * self.rope_ratio
134
- inv_freq = 1.0 / (
135
- base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
136
- seq = torch.arange(seq_length, device=inv_freq.device, dtype=torch.float32)
137
- freqs = torch.outer(seq, inv_freq)
138
- # first part even vector components, second part odd vector components,
139
- # 2 * dim in dimension size
140
- emb = torch.cat((freqs, freqs), dim=-1)
141
- return emb
142
-
143
  def forward_impl(
144
  self, seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000
145
  ):
@@ -167,12 +126,9 @@ class RotaryEmbedding(nn.Module):
167
  return cache
168
 
169
  def forward(self, max_seq_len, offset=0):
170
- if self.original_impl:
171
- return self.forward_impl(
172
- max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device
173
- )
174
- else:
175
- return self.impl(max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device)
176
 
177
 
178
  @torch.jit.script
@@ -210,16 +166,16 @@ class RMSNorm(torch.nn.Module):
210
  return (self.weight * hidden_states).to(input_dtype)
211
 
212
 
213
-
214
  class CoreAttention(torch.nn.Module):
215
  def __init__(self, config: ChatGLMConfig, layer_number):
216
  super(CoreAttention, self).__init__()
217
-
218
  self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
219
  self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
220
  if self.apply_query_key_layer_scaling:
221
  self.attention_softmax_in_fp32 = True
222
  self.layer_number = max(1, layer_number)
 
223
 
224
  projection_size = config.kv_channels * config.num_attention_heads
225
 
@@ -238,95 +194,77 @@ class CoreAttention(torch.nn.Module):
238
  self.attention_dropout = torch.nn.Dropout(config.attention_dropout)
239
 
240
  def forward(self, query_layer, key_layer, value_layer, attention_mask):
241
- pytorch_major_version = int(torch.__version__.split('.')[0])
242
- if pytorch_major_version >= 2:
243
- if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
244
- context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
245
- is_causal=True)
246
- else:
247
- if attention_mask is not None:
248
- attention_mask = ~attention_mask
249
- context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
250
- attention_mask)
251
- context_layer = context_layer.transpose(1, 2).contiguous()
252
- new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
253
- context_layer = context_layer.reshape(*new_context_layer_shape)
254
- else:
255
- # Raw attention scores
256
-
257
- # [b, np, sq, sk]
258
- output_size = (query_layer.size(0), query_layer.size(1), query_layer.size(2), key_layer.size(2))
259
-
260
- # [b, np, sq, hn] -> [b * np, sq, hn]
261
- query_layer = query_layer.view(output_size[0] * output_size[1], output_size[2], -1)
262
- # [b, np, sk, hn] -> [b * np, sk, hn]
263
- key_layer = key_layer.view(output_size[0] * output_size[1], output_size[3], -1)
264
-
265
- # preallocting input tensor: [b * np, sq, sk]
266
- matmul_input_buffer = torch.empty(
267
- output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query_layer.dtype,
268
- device=query_layer.device
269
- )
270
 
271
- # Raw attention scores. [b * np, sq, sk]
272
- matmul_result = torch.baddbmm(
273
- matmul_input_buffer,
274
- query_layer, # [b * np, sq, hn]
275
- key_layer.transpose(1, 2), # [b * np, hn, sk]
276
- beta=0.0,
277
- alpha=(1.0 / self.norm_factor),
278
- )
279
 
280
- # change view to [b, np, sq, sk]
281
- attention_scores = matmul_result.view(*output_size)
282
-
283
- # ===========================
284
- # Attention probs and dropout
285
- # ===========================
286
-
287
- # attention scores and attention mask [b, np, sq, sk]
288
- if self.attention_softmax_in_fp32:
289
- attention_scores = attention_scores.float()
290
- if self.coeff is not None:
291
- attention_scores = attention_scores * self.coeff
292
- if attention_mask is None and attention_scores.shape[2] == attention_scores.shape[3]:
293
- attention_mask = torch.ones(output_size[0], 1, output_size[2], output_size[3],
294
- device=attention_scores.device, dtype=torch.bool)
295
- attention_mask.tril_()
296
- attention_mask = ~attention_mask
297
- if attention_mask is not None:
298
- attention_scores = attention_scores.masked_fill(attention_mask, float("-inf"))
299
- attention_probs = F.softmax(attention_scores, dim=-1)
300
- attention_probs = attention_probs.type_as(value_layer)
301
-
302
- # This is actually dropping out entire tokens to attend to, which might
303
- # seem a bit unusual, but is taken from the original Transformer paper.
304
- attention_probs = self.attention_dropout(attention_probs)
305
- # =========================
306
- # Context layer. [sq, b, hp]
307
- # =========================
308
-
309
- # value_layer -> context layer.
310
- # [sk, b, np, hn] --> [b, np, sq, hn]
311
-
312
- # context layer shape: [b, np, sq, hn]
313
- output_size = (value_layer.size(1), value_layer.size(2), query_layer.size(0), value_layer.size(3))
314
- # change view [b * np, sk, hn]
315
- value_layer = value_layer.view(output_size[0] * output_size[1], value_layer.size(2), -1)
316
- # change view [b * np, sq, sk]
317
- attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1)
318
- # matmul: [b * np, sq, hn]
319
- context_layer = torch.bmm(attention_probs, value_layer)
320
- # change view [b, np, sq, hn]
321
- context_layer = context_layer.view(*output_size)
322
- # [b, np, sq, hn] --> [b, sq, np, hn]
323
- context_layer = context_layer.transpose(1, 2).contiguous()
324
- # [b, sq, np, hn] --> [b, sq, hp]
325
- new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
326
- context_layer = context_layer.reshape(*new_context_layer_shape)
327
 
328
  return context_layer
329
 
 
330
  class SdpaAttention(CoreAttention):
331
  def forward(self, query_layer, key_layer, value_layer, attention_mask):
332
  if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
@@ -450,6 +388,7 @@ CORE_ATTENTION_CLASSES = {
450
  "flash_attention_2": FlashAttention2
451
  }
452
 
 
453
  class SelfAttention(torch.nn.Module):
454
  """Parallel self-attention layer abstract class.
455
 
@@ -469,7 +408,6 @@ class SelfAttention(torch.nn.Module):
469
 
470
  self.multi_query_attention = config.multi_query_attention
471
  self.qkv_hidden_size = 3 * self.projection_size
472
- self.original_rope = config.original_rope
473
  if self.multi_query_attention:
474
  self.num_multi_query_groups_per_partition = config.multi_query_group_num
475
  self.qkv_hidden_size = (
@@ -480,7 +418,7 @@ class SelfAttention(torch.nn.Module):
480
  device=device, **_config_to_kwargs(config)
481
  )
482
 
483
- self.core_attention = CoreAttention(config, self.layer_number)
484
 
485
  # Output.
486
  self.dense = nn.Linear(self.projection_size, config.hidden_size, bias=config.add_bias_linear,
@@ -558,7 +496,11 @@ class SelfAttention(torch.nn.Module):
558
  key_layer = torch.cat((cache_k, key_layer), dim=2)
559
  value_layer = torch.cat((cache_v, value_layer), dim=2)
560
  if use_cache:
561
- kv_cache = (key_layer, value_layer)
 
 
 
 
562
  else:
563
  kv_cache = None
564
 
@@ -791,7 +733,15 @@ class GLMTransformer(torch.nn.Module):
791
  )
792
  hidden_states, kv_cache = layer_ret
793
  if use_cache:
794
- presents = presents + (kv_cache,)
 
 
 
 
 
 
 
 
795
 
796
  if output_hidden_states:
797
  all_hidden_states = all_hidden_states + (hidden_states,)
@@ -821,16 +771,20 @@ class ChatGLMPreTrainedModel(PreTrainedModel):
821
  """Initialize the weights."""
822
  return
823
 
824
- def get_masks(self, input_embeds, past_key_values, padding_mask=None):
825
- batch_size, seq_length, embed_size = input_embeds.shape
826
- full_attention_mask = torch.ones(batch_size, seq_length, seq_length, device=input_embeds.device)
 
 
 
 
827
  full_attention_mask.tril_()
828
  past_length = 0
829
  if past_key_values:
830
  past_length = past_key_values[0][0].shape[2]
831
  if past_length:
832
  full_attention_mask = torch.cat((torch.ones(batch_size, seq_length, past_length,
833
- device=input_embeds.device), full_attention_mask), dim=-1)
834
  if padding_mask is not None:
835
  full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1)
836
  if not past_length and padding_mask is not None:
@@ -844,9 +798,6 @@ class ChatGLMPreTrainedModel(PreTrainedModel):
844
  position_ids = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1)
845
  return position_ids
846
 
847
- def get_multimodal_position_ids(self, input_ids, device):
848
- batch_size, seq_length = input_ids.shape
849
- position_ids = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1)
850
 
851
  class Embedding(torch.nn.Module):
852
  """Language model embeddings."""
@@ -874,15 +825,6 @@ class Embedding(torch.nn.Module):
874
  return embeddings
875
 
876
 
877
- def is_empty(images_list: Optional[List[List[torch.Tensor]]]):
878
- if images_list is None or len(images_list) == 0:
879
- return True
880
- for image_list in images_list:
881
- if image_list is not None:
882
- return False
883
- return True
884
-
885
-
886
  class ChatGLMModel(ChatGLMPreTrainedModel):
887
  def __init__(self, config: ChatGLMConfig, device=None, empty_init=True):
888
  super().__init__(config)
@@ -910,16 +852,6 @@ class ChatGLMModel(ChatGLMPreTrainedModel):
910
  self.encoder = init_method(GLMTransformer, config, **init_kwargs)
911
  self.output_layer = init_method(nn.Linear, config.hidden_size, config.padded_vocab_size, bias=False,
912
  dtype=config.torch_dtype, **init_kwargs)
913
- self.pre_seq_len = config.pre_seq_len
914
- self.prefix_projection = config.prefix_projection
915
- if self.pre_seq_len is not None:
916
- for param in self.parameters():
917
- param.requires_grad = False
918
- self.prefix_tokens = torch.arange(self.pre_seq_len).long()
919
- self.prefix_encoder = PrefixEncoder(config)
920
- self.dropout = torch.nn.Dropout(0.1)
921
-
922
- self.vision = EVA2CLIPModel(config)
923
 
924
  def get_input_embeddings(self):
925
  return self.embedding.word_embeddings
@@ -927,70 +859,19 @@ class ChatGLMModel(ChatGLMPreTrainedModel):
927
  def set_input_embeddings(self, value):
928
  self.embedding.word_embeddings = value
929
 
930
- def get_prompt(self, batch_size, device, dtype=torch.half):
931
- prefix_tokens = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1).to(device)
932
- past_key_values = self.prefix_encoder(prefix_tokens).type(dtype)
933
- past_key_values = past_key_values.view(
934
- batch_size,
935
- self.pre_seq_len,
936
- self.pre_seq_len,
937
- self.num_layers * 2,
938
- self.multi_query_group_num,
939
- self.kv_channels
940
- )
941
- # seq_len, b, nh, hidden_size
942
- past_key_values = self.dropout(past_key_values)
943
- past_key_values = past_key_values.permute([2, 1, 0, 3, 4]).split(2)
944
- return past_key_values
945
-
946
  def forward(
947
  self,
948
- input_ids: torch.LongTensor = None,
949
- images: torch.Tensor = None,
950
  position_ids: Optional[torch.Tensor] = None,
951
  attention_mask: Optional[torch.BoolTensor] = None,
952
  full_attention_mask: Optional[torch.BoolTensor] = None,
953
  past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
954
  inputs_embeds: Optional[torch.Tensor] = None,
955
  use_cache: Optional[bool] = None,
 
956
  output_hidden_states: Optional[bool] = None,
957
  return_dict: Optional[bool] = None,
958
- ) -> Union[Tuple, BaseModelOutputWithPast]:
959
- """take care of image_encode, position_ids and (attention_mask = None is fine)"""
960
-
961
- # generate mode with past_key_values. the image features are already mapped
962
- if past_key_values is None:
963
- # not allow for inputs_embeds, because we want to process image feature
964
- assert input_ids is not None and inputs_embeds is None, f"{input_ids} {inputs_embeds}"
965
- if not is_empty(images): # multi-modality
966
- image_size: int = self.config.vision_config['image_size']
967
- patch_size: int = self.config.vision_config['patch_size']
968
- num_patches = (image_size // patch_size // 2) ** 2
969
- assert len(input_ids) == len(images), f"{len(input_ids)} {len(images)}"
970
- inputs_embeds = self.embedding(input_ids)
971
-
972
- images = images.to(dtype=inputs_embeds.dtype)
973
- images_features = self.vision(images)
974
-
975
- if position_ids is None:
976
- position_ids = self.get_position_ids(input_ids, device=inputs_embeds.device)
977
- new_input_embeds, new_position_ids = [], []
978
-
979
- for i in range(len(input_ids)):
980
- input_id = input_ids[i].tolist()
981
- boi_token_pos, eoi_token_pos = input_id.index(self.config.boi_token_id), input_id.index(
982
- self.config.eoi_token_id)
983
- assert eoi_token_pos - boi_token_pos == 2
984
- new_input_embeds.append(torch.cat(
985
- (inputs_embeds[i, :boi_token_pos], images_features[i].to(inputs_embeds.device),
986
- inputs_embeds[i, eoi_token_pos + 1:])))
987
- new_position_ids.append(torch.cat(
988
- (position_ids[i, :boi_token_pos + 1], position_ids[i, boi_token_pos + 1].repeat(num_patches),
989
- position_ids[i, eoi_token_pos:])
990
- ))
991
- inputs_embeds = torch.stack(new_input_embeds, dim=0)
992
- position_ids = torch.stack(new_position_ids, dim=0)
993
-
994
  output_hidden_states = (
995
  output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
996
  )
@@ -1002,41 +883,12 @@ class ChatGLMModel(ChatGLMPreTrainedModel):
1002
  if inputs_embeds is None:
1003
  inputs_embeds = self.embedding(input_ids)
1004
 
1005
- if self.pre_seq_len is not None:
1006
- if past_key_values is None:
1007
- past_key_values = self.get_prompt(batch_size=batch_size, device=input_ids.device,
1008
- dtype=inputs_embeds.dtype)
1009
- if attention_mask is not None:
1010
- attention_mask = torch.cat([attention_mask.new_ones((batch_size, self.pre_seq_len)),
1011
- attention_mask], dim=-1)
1012
-
1013
  if full_attention_mask is None:
1014
  if (attention_mask is not None and not attention_mask.all()) or (past_key_values and seq_length != 1):
1015
- if self.training:
1016
- # https://github.com/THUDM/GLM-4/issues/264
1017
- new_input_ids, new_attention_mask = [], []
1018
- for i in range(len(input_ids)):
1019
- input_id = input_ids[i].tolist()
1020
- boi_token_pos, eoi_token_pos = input_id.index(self.config.boi_token_id), input_id.index(self.config.eoi_token_id)
1021
- assert eoi_token_pos - boi_token_pos == 2
1022
-
1023
- new_attention_mask.append(torch.cat(
1024
- (attention_mask[i, :boi_token_pos + 1], torch.ones(num_patches).to(attention_mask.device),
1025
- attention_mask[i, eoi_token_pos:])))
1026
-
1027
- new_input_ids.append(torch.cat(
1028
- (input_ids[i, :boi_token_pos + 1], input_ids[i, -1].repeat(num_patches),
1029
- input_ids[i, eoi_token_pos:])))
1030
-
1031
- attention_mask = torch.stack(new_attention_mask, dim=0)
1032
- input_ids = torch.stack(new_input_ids, dim=0)
1033
- inputs_embeds = self.embedding(input_ids)
1034
-
1035
- full_attention_mask = self.get_masks(inputs_embeds, past_key_values, padding_mask=attention_mask)
1036
 
1037
  # Rotary positional embeddings
1038
  rotary_pos_emb = self.rotary_pos_emb(self.seq_length)
1039
-
1040
  if position_ids is not None:
1041
  rotary_pos_emb = rotary_pos_emb[position_ids]
1042
  else:
@@ -1047,6 +899,12 @@ class ChatGLMModel(ChatGLMPreTrainedModel):
1047
  inputs_embeds, full_attention_mask, rotary_pos_emb=rotary_pos_emb,
1048
  kv_caches=past_key_values, use_cache=use_cache, output_hidden_states=output_hidden_states
1049
  )
 
 
 
 
 
 
1050
 
1051
  if not return_dict:
1052
  return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
@@ -1059,16 +917,6 @@ class ChatGLMModel(ChatGLMPreTrainedModel):
1059
  )
1060
 
1061
 
1062
- def _history_to_prompt(history, query):
1063
- prompt = ''
1064
- flag = False
1065
- for i, (old_query, response) in enumerate(history):
1066
- prompt += ('<|user|>' if flag else '') + old_query + "<|assistant|>" + response + "<|endoftext|>"
1067
- flag = True
1068
- prompt += '{}{}<|assistant|>'.format('<|user|>' if flag else '', query)
1069
- return prompt
1070
-
1071
-
1072
  class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
1073
  def __init__(self, config: ChatGLMConfig, empty_init=True, device=None):
1074
  super().__init__(config)
@@ -1109,7 +957,6 @@ class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
1109
  def prepare_inputs_for_generation(
1110
  self,
1111
  input_ids: torch.LongTensor,
1112
- images: Optional[torch.Tensor] = None,
1113
  past_key_values: Optional[torch.Tensor] = None,
1114
  attention_mask: Optional[torch.Tensor] = None,
1115
  position_ids: Optional[torch.Tensor] = None,
@@ -1120,34 +967,12 @@ class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
1120
  # only last token for input_ids if past is not None
1121
  if position_ids is None:
1122
  position_ids = self.get_position_ids(input_ids, device=input_ids.device)
1123
- if attention_mask is not None:
1124
- image_size: int = self.config.vision_config['image_size']
1125
- patch_size: int = self.config.vision_config['patch_size']
1126
- num_patches = (image_size // patch_size // 2) ** 2
1127
- new_attention_masks = []
1128
-
1129
- # if not image, use this default id
1130
- eoi_token_pos = 6
1131
- boi_token_pos = 4
1132
-
1133
- for i in range(len(input_ids)):
1134
- input_id = input_ids[i].tolist()
1135
- if not is_empty(images):
1136
- boi_token_pos, eoi_token_pos = input_id.index(self.config.boi_token_id), input_id.index(
1137
- self.config.eoi_token_id)
1138
- assert eoi_token_pos - boi_token_pos == 2
1139
- new_attention_masks.append(torch.cat(
1140
- (attention_mask[i, :boi_token_pos + 1], attention_mask.new_ones(num_patches),
1141
- attention_mask[i, eoi_token_pos:])
1142
- ))
1143
- attention_mask = torch.stack(new_attention_masks, dim=0)
1144
  if not is_first_forward:
1145
  if past_key_values is not None:
1146
  position_ids = position_ids[..., -1:]
1147
  input_ids = input_ids[:, -1:]
1148
  return {
1149
  "input_ids": input_ids,
1150
- "images": images,
1151
  "past_key_values": past_key_values,
1152
  "position_ids": position_ids,
1153
  "attention_mask": attention_mask,
@@ -1158,7 +983,6 @@ class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
1158
  def forward(
1159
  self,
1160
  input_ids: Optional[torch.Tensor] = None,
1161
- images: List[List[torch.Tensor]] = None,
1162
  position_ids: Optional[torch.Tensor] = None,
1163
  attention_mask: Optional[torch.Tensor] = None,
1164
  past_key_values: Optional[Tuple[torch.FloatTensor]] = None,
@@ -1175,7 +999,6 @@ class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
1175
 
1176
  transformer_outputs = self.transformer(
1177
  input_ids=input_ids,
1178
- images=images,
1179
  position_ids=position_ids,
1180
  attention_mask=attention_mask,
1181
  past_key_values=past_key_values,
@@ -1192,23 +1015,12 @@ class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
1192
 
1193
  loss = None
1194
  if labels is not None:
1195
- new_labels = []
1196
- for i in range(len(input_ids)):
1197
- input_id = input_ids[i].tolist()
1198
- boi_token_pos, eoi_token_pos = input_id.index(self.config.boi_token_id), input_id.index(
1199
- self.config.eoi_token_id)
1200
- assert eoi_token_pos - boi_token_pos == 2
1201
-
1202
- new_labels.append(torch.cat(
1203
- (
1204
- labels[i, :boi_token_pos + 1],
1205
- torch.tensor([-100]).to(labels.device).to(labels.dtype).repeat(1600),
1206
- labels[i, eoi_token_pos:])))
1207
-
1208
- labels = torch.stack(new_labels, dim=0)
1209
  lm_logits = lm_logits.to(torch.float32)
 
 
1210
  shift_logits = lm_logits[..., :-1, :].contiguous()
1211
  shift_labels = labels[..., 1:].contiguous()
 
1212
  loss_fct = CrossEntropyLoss(ignore_index=-100)
1213
  loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
1214
 
@@ -1246,6 +1058,202 @@ class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
1246
  for layer_past in past
1247
  )
1248
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1249
  class ChatGLMForSequenceClassification(ChatGLMPreTrainedModel):
1250
  def __init__(self, config: ChatGLMConfig, empty_init=True, device=None):
1251
  super().__init__(config)
@@ -1253,7 +1261,7 @@ class ChatGLMForSequenceClassification(ChatGLMPreTrainedModel):
1253
  self.num_labels = config.num_labels
1254
  self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
1255
 
1256
- self.classifier_head = nn.Linear(config.hidden_size, config.num_labels, bias=True, dtype=torch.half)
1257
  if config.classifier_dropout is not None:
1258
  self.dropout = nn.Dropout(config.classifier_dropout)
1259
  else:
@@ -1270,6 +1278,7 @@ class ChatGLMForSequenceClassification(ChatGLMPreTrainedModel):
1270
  inputs_embeds: Optional[torch.LongTensor] = None,
1271
  labels: Optional[torch.LongTensor] = None,
1272
  use_cache: Optional[bool] = None,
 
1273
  output_hidden_states: Optional[bool] = None,
1274
  return_dict: Optional[bool] = None,
1275
  ) -> Union[Tuple[torch.Tensor, ...], SequenceClassifierOutputWithPast]:
@@ -1283,12 +1292,13 @@ class ChatGLMForSequenceClassification(ChatGLMPreTrainedModel):
1283
  past_key_values=past_key_values,
1284
  inputs_embeds=inputs_embeds,
1285
  use_cache=use_cache,
 
1286
  output_hidden_states=output_hidden_states,
1287
  return_dict=return_dict,
1288
  )
1289
 
1290
  hidden_states = transformer_outputs[0]
1291
- pooled_hidden_states = hidden_states[-1]
1292
  if self.dropout is not None:
1293
  pooled_hidden_states = self.dropout(pooled_hidden_states)
1294
  logits = self.classifier_head(pooled_hidden_states)
 
1
+ """ PyTorch ChatGLM model. """
2
+ import json
3
  import math
4
+ import copy
5
+ import warnings
6
+ import re
7
  import sys
8
+
9
  import torch
10
  import torch.utils.checkpoint
11
  import torch.nn.functional as F
12
  from torch import nn
13
  from torch.nn import CrossEntropyLoss, LayerNorm, MSELoss, BCEWithLogitsLoss
14
  from torch.nn.utils import skip_init
15
+ from typing import Optional, Tuple, Union, List, Callable, Dict, Any
16
+ from copy import deepcopy
17
 
18
  from transformers.modeling_outputs import (
19
  BaseModelOutputWithPast,
 
25
  from transformers.generation.logits_process import LogitsProcessor
26
  from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList, GenerationConfig, ModelOutput
27
 
 
28
  from .configuration_chatglm import ChatGLMConfig
29
 
30
  try:
 
46
 
47
  logger = logging.get_logger(__name__)
48
 
 
 
 
49
  _CHECKPOINT_FOR_DOC = "THUDM/ChatGLM"
50
  _CONFIG_FOR_DOC = "ChatGLMConfig"
51
 
 
62
  return scores
63
 
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  def split_tensor_along_last_dim(
66
  tensor: torch.Tensor,
67
  num_partitions: int,
 
99
  self.original_impl = original_impl
100
  self.rope_ratio = rope_ratio
101
 
 
 
 
 
 
 
 
 
 
 
 
102
  def forward_impl(
103
  self, seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000
104
  ):
 
126
  return cache
127
 
128
  def forward(self, max_seq_len, offset=0):
129
+ return self.forward_impl(
130
+ max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device
131
+ )
 
 
 
132
 
133
 
134
  @torch.jit.script
 
166
  return (self.weight * hidden_states).to(input_dtype)
167
 
168
 
 
169
  class CoreAttention(torch.nn.Module):
170
  def __init__(self, config: ChatGLMConfig, layer_number):
171
  super(CoreAttention, self).__init__()
172
+ self.config = config
173
  self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
174
  self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
175
  if self.apply_query_key_layer_scaling:
176
  self.attention_softmax_in_fp32 = True
177
  self.layer_number = max(1, layer_number)
178
+ self.is_causal = True
179
 
180
  projection_size = config.kv_channels * config.num_attention_heads
181
 
 
194
  self.attention_dropout = torch.nn.Dropout(config.attention_dropout)
195
 
196
  def forward(self, query_layer, key_layer, value_layer, attention_mask):
197
+ # [b, np, sq, sk]
198
+ output_size = (query_layer.size(0), query_layer.size(1), query_layer.size(2), key_layer.size(2))
199
+
200
+ # [b, np, sq, hn] -> [b * np, sq, hn]
201
+ query_layer = query_layer.view(output_size[0] * output_size[1], output_size[2], -1)
202
+ # [b, np, sk, hn] -> [b * np, sk, hn]
203
+ key_layer = key_layer.view(output_size[0] * output_size[1], output_size[3], -1)
204
+
205
+ # preallocting input tensor: [b * np, sq, sk]
206
+ matmul_input_buffer = torch.empty(
207
+ output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query_layer.dtype,
208
+ device=query_layer.device
209
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
210
 
211
+ # Raw attention scores. [b * np, sq, sk]
212
+ matmul_result = torch.baddbmm(
213
+ matmul_input_buffer,
214
+ query_layer, # [b * np, sq, hn]
215
+ key_layer.transpose(1, 2), # [b * np, hn, sk]
216
+ beta=0.0,
217
+ alpha=(1.0 / self.norm_factor),
218
+ )
219
 
220
+ # change view to [b, np, sq, sk]
221
+ attention_scores = matmul_result.view(*output_size)
222
+
223
+ # ===========================
224
+ # Attention probs and dropout
225
+ # ===========================
226
+
227
+ # attention scores and attention mask [b, np, sq, sk]
228
+ if self.attention_softmax_in_fp32:
229
+ attention_scores = attention_scores.float()
230
+ if self.coeff is not None:
231
+ attention_scores = attention_scores * self.coeff
232
+ if attention_mask is None and attention_scores.shape[2] == attention_scores.shape[3]:
233
+ attention_mask = torch.ones(output_size[0], 1, output_size[2], output_size[3],
234
+ device=attention_scores.device, dtype=torch.bool)
235
+ attention_mask.tril_()
236
+ attention_mask = ~attention_mask
237
+ if attention_mask is not None:
238
+ attention_scores = attention_scores.masked_fill(attention_mask, float("-inf"))
239
+ attention_probs = F.softmax(attention_scores, dim=-1)
240
+ attention_probs = attention_probs.type_as(value_layer)
241
+
242
+ # This is actually dropping out entire tokens to attend to, which might
243
+ # seem a bit unusual, but is taken from the original Transformer paper.
244
+ attention_probs = self.attention_dropout(attention_probs)
245
+
246
+ # query layer shape: [b * np, sq, hn]
247
+ # value layer shape: [b, np, sk, hn]
248
+ # attention shape: [b, np, sq, sk]
249
+ # context layer shape: [b, np, sq, hn]
250
+ output_size = (value_layer.size(0), value_layer.size(1), query_layer.size(1), value_layer.size(3))
251
+ # change view [b * np, sk, hn]
252
+ value_layer = value_layer.view(output_size[0] * output_size[1], value_layer.size(2), -1)
253
+ # change view [b * np, sq, sk]
254
+ attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1)
255
+ # matmul: [b * np, sq, hn]
256
+ context_layer = torch.bmm(attention_probs, value_layer)
257
+ # change view [b, np, sq, hn]
258
+ context_layer = context_layer.view(*output_size)
259
+ # [b, np, sq, hn] --> [b, sq, np, hn]
260
+ context_layer = context_layer.transpose(1, 2).contiguous()
261
+ # [b, sq, np, hn] --> [b, sq, hp]
262
+ new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
263
+ context_layer = context_layer.reshape(*new_context_layer_shape)
 
 
 
264
 
265
  return context_layer
266
 
267
+
268
  class SdpaAttention(CoreAttention):
269
  def forward(self, query_layer, key_layer, value_layer, attention_mask):
270
  if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
 
388
  "flash_attention_2": FlashAttention2
389
  }
390
 
391
+
392
  class SelfAttention(torch.nn.Module):
393
  """Parallel self-attention layer abstract class.
394
 
 
408
 
409
  self.multi_query_attention = config.multi_query_attention
410
  self.qkv_hidden_size = 3 * self.projection_size
 
411
  if self.multi_query_attention:
412
  self.num_multi_query_groups_per_partition = config.multi_query_group_num
413
  self.qkv_hidden_size = (
 
418
  device=device, **_config_to_kwargs(config)
419
  )
420
 
421
+ self.core_attention = CORE_ATTENTION_CLASSES[config._attn_implementation](config, self.layer_number)
422
 
423
  # Output.
424
  self.dense = nn.Linear(self.projection_size, config.hidden_size, bias=config.add_bias_linear,
 
496
  key_layer = torch.cat((cache_k, key_layer), dim=2)
497
  value_layer = torch.cat((cache_v, value_layer), dim=2)
498
  if use_cache:
499
+ if kv_cache is None:
500
+ kv_cache = torch.cat((key_layer.unsqueeze(0).unsqueeze(0), value_layer.unsqueeze(0).unsqueeze(0)),
501
+ dim=1)
502
+ else:
503
+ kv_cache = (key_layer, value_layer)
504
  else:
505
  kv_cache = None
506
 
 
733
  )
734
  hidden_states, kv_cache = layer_ret
735
  if use_cache:
736
+ # token by token decoding, use tuple format
737
+ if kv_caches[0] is not None:
738
+ presents = presents + (kv_cache,)
739
+ # prefilling in decoding, use tensor format to save cuda memory
740
+ else:
741
+ if len(presents) == 0:
742
+ presents = kv_cache
743
+ else:
744
+ presents = torch.cat((presents, kv_cache.to(presents.device)), dim=0)
745
 
746
  if output_hidden_states:
747
  all_hidden_states = all_hidden_states + (hidden_states,)
 
771
  """Initialize the weights."""
772
  return
773
 
774
+ def get_masks(self, input_ids, past_key_values, padding_mask=None):
775
+ if self.config._attn_implementation == "flash_attention_2":
776
+ if padding_mask is not None and not padding_mask.all():
777
+ return padding_mask
778
+ return None
779
+ batch_size, seq_length = input_ids.shape
780
+ full_attention_mask = torch.ones(batch_size, seq_length, seq_length, device=input_ids.device)
781
  full_attention_mask.tril_()
782
  past_length = 0
783
  if past_key_values:
784
  past_length = past_key_values[0][0].shape[2]
785
  if past_length:
786
  full_attention_mask = torch.cat((torch.ones(batch_size, seq_length, past_length,
787
+ device=input_ids.device), full_attention_mask), dim=-1)
788
  if padding_mask is not None:
789
  full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1)
790
  if not past_length and padding_mask is not None:
 
798
  position_ids = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1)
799
  return position_ids
800
 
 
 
 
801
 
802
  class Embedding(torch.nn.Module):
803
  """Language model embeddings."""
 
825
  return embeddings
826
 
827
 
 
 
 
 
 
 
 
 
 
828
  class ChatGLMModel(ChatGLMPreTrainedModel):
829
  def __init__(self, config: ChatGLMConfig, device=None, empty_init=True):
830
  super().__init__(config)
 
852
  self.encoder = init_method(GLMTransformer, config, **init_kwargs)
853
  self.output_layer = init_method(nn.Linear, config.hidden_size, config.padded_vocab_size, bias=False,
854
  dtype=config.torch_dtype, **init_kwargs)
 
 
 
 
 
 
 
 
 
 
855
 
856
  def get_input_embeddings(self):
857
  return self.embedding.word_embeddings
 
859
  def set_input_embeddings(self, value):
860
  self.embedding.word_embeddings = value
861
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
862
  def forward(
863
  self,
864
+ input_ids,
 
865
  position_ids: Optional[torch.Tensor] = None,
866
  attention_mask: Optional[torch.BoolTensor] = None,
867
  full_attention_mask: Optional[torch.BoolTensor] = None,
868
  past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
869
  inputs_embeds: Optional[torch.Tensor] = None,
870
  use_cache: Optional[bool] = None,
871
+ output_attentions: Optional[bool] = None,
872
  output_hidden_states: Optional[bool] = None,
873
  return_dict: Optional[bool] = None,
874
+ ):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
875
  output_hidden_states = (
876
  output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
877
  )
 
883
  if inputs_embeds is None:
884
  inputs_embeds = self.embedding(input_ids)
885
 
 
 
 
 
 
 
 
 
886
  if full_attention_mask is None:
887
  if (attention_mask is not None and not attention_mask.all()) or (past_key_values and seq_length != 1):
888
+ full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
889
 
890
  # Rotary positional embeddings
891
  rotary_pos_emb = self.rotary_pos_emb(self.seq_length)
 
892
  if position_ids is not None:
893
  rotary_pos_emb = rotary_pos_emb[position_ids]
894
  else:
 
899
  inputs_embeds, full_attention_mask, rotary_pos_emb=rotary_pos_emb,
900
  kv_caches=past_key_values, use_cache=use_cache, output_hidden_states=output_hidden_states
901
  )
902
+ if presents is not None and type(presents) is torch.Tensor:
903
+ presents = presents.split(1, dim=0)
904
+ presents = list(presents)
905
+ presents = [list(x.squeeze(0).split(1, dim=0)) for x in presents]
906
+ presents = [tuple([x.squeeze(0) for x in y]) for y in presents]
907
+ presents = tuple(presents)
908
 
909
  if not return_dict:
910
  return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
 
917
  )
918
 
919
 
 
 
 
 
 
 
 
 
 
 
920
  class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
921
  def __init__(self, config: ChatGLMConfig, empty_init=True, device=None):
922
  super().__init__(config)
 
957
  def prepare_inputs_for_generation(
958
  self,
959
  input_ids: torch.LongTensor,
 
960
  past_key_values: Optional[torch.Tensor] = None,
961
  attention_mask: Optional[torch.Tensor] = None,
962
  position_ids: Optional[torch.Tensor] = None,
 
967
  # only last token for input_ids if past is not None
968
  if position_ids is None:
969
  position_ids = self.get_position_ids(input_ids, device=input_ids.device)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
970
  if not is_first_forward:
971
  if past_key_values is not None:
972
  position_ids = position_ids[..., -1:]
973
  input_ids = input_ids[:, -1:]
974
  return {
975
  "input_ids": input_ids,
 
976
  "past_key_values": past_key_values,
977
  "position_ids": position_ids,
978
  "attention_mask": attention_mask,
 
983
  def forward(
984
  self,
985
  input_ids: Optional[torch.Tensor] = None,
 
986
  position_ids: Optional[torch.Tensor] = None,
987
  attention_mask: Optional[torch.Tensor] = None,
988
  past_key_values: Optional[Tuple[torch.FloatTensor]] = None,
 
999
 
1000
  transformer_outputs = self.transformer(
1001
  input_ids=input_ids,
 
1002
  position_ids=position_ids,
1003
  attention_mask=attention_mask,
1004
  past_key_values=past_key_values,
 
1015
 
1016
  loss = None
1017
  if labels is not None:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1018
  lm_logits = lm_logits.to(torch.float32)
1019
+
1020
+ # Shift so that tokens < n predict n
1021
  shift_logits = lm_logits[..., :-1, :].contiguous()
1022
  shift_labels = labels[..., 1:].contiguous()
1023
+ # Flatten the tokens
1024
  loss_fct = CrossEntropyLoss(ignore_index=-100)
1025
  loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
1026
 
 
1058
  for layer_past in past
1059
  )
1060
 
1061
+ def process_response(self, output, history):
1062
+ content = ""
1063
+ history = deepcopy(history)
1064
+ for response in output.split("<|assistant|>"):
1065
+ if "\n" in response:
1066
+ metadata, content = response.split("\n", maxsplit=1)
1067
+ else:
1068
+ metadata, content = "", response
1069
+ if not metadata.strip():
1070
+ content = content.strip()
1071
+ history.append({"role": "assistant", "metadata": metadata, "content": content})
1072
+ content = content.replace("[[训练时间]]", "2023年")
1073
+ else:
1074
+ history.append({"role": "assistant", "metadata": metadata, "content": content})
1075
+ if history[0]["role"] == "system" and "tools" in history[0]:
1076
+ parameters = json.loads(content)
1077
+ content = {"name": metadata.strip(), "parameters": parameters}
1078
+ else:
1079
+ content = {"name": metadata.strip(), "content": content}
1080
+ return content, history
1081
+
1082
+ @torch.inference_mode()
1083
+ def chat(self, tokenizer, query: str, history: List[Dict] = None, role: str = "user",
1084
+ max_length: int = 8192, num_beams=1, do_sample=True, top_p=0.8, temperature=0.8, logits_processor=None,
1085
+ **kwargs):
1086
+ if history is None:
1087
+ history = []
1088
+ if logits_processor is None:
1089
+ logits_processor = LogitsProcessorList()
1090
+ logits_processor.append(InvalidScoreLogitsProcessor())
1091
+ gen_kwargs = {"max_length": max_length, "num_beams": num_beams, "do_sample": do_sample, "top_p": top_p,
1092
+ "temperature": temperature, "logits_processor": logits_processor, **kwargs}
1093
+ history.append({"role": role, "content": query})
1094
+ inputs = tokenizer.apply_chat_template(history, add_generation_prompt=True, tokenize=True,
1095
+ return_tensors="pt", return_dict=True)
1096
+ inputs = inputs.to(self.device)
1097
+ eos_token_id = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|user|>"),
1098
+ tokenizer.convert_tokens_to_ids("<|observation|>")]
1099
+ outputs = self.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id)
1100
+ outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):-1]
1101
+ response = tokenizer.decode(outputs)
1102
+ response, history = self.process_response(response, history)
1103
+ return response, history
1104
+
1105
+ @torch.inference_mode()
1106
+ def stream_chat(self, tokenizer, query: str, history: List[Dict] = None, role: str = "user",
1107
+ past_key_values=None, max_length: int = 8192, do_sample=True, top_p=0.8, temperature=0.8,
1108
+ logits_processor=None, return_past_key_values=False, **kwargs):
1109
+ if history is None:
1110
+ history = []
1111
+ if logits_processor is None:
1112
+ logits_processor = LogitsProcessorList()
1113
+ logits_processor.append(InvalidScoreLogitsProcessor())
1114
+ eos_token_id = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|user|>"),
1115
+ tokenizer.convert_tokens_to_ids("<|observation|>")]
1116
+ gen_kwargs = {"max_length": max_length, "do_sample": do_sample, "top_p": top_p,
1117
+ "temperature": temperature, "logits_processor": logits_processor, **kwargs}
1118
+ if past_key_values is None:
1119
+ inputs = tokenizer.apply_chat_template(history + [{"role": role, "content": query}],
1120
+ add_generation_prompt=True, tokenize=True, return_tensors="pt",
1121
+ return_dict=True)
1122
+ else:
1123
+ inputs = tokenizer.apply_chat_template([{"role": role, "content": query}], add_special_tokens=False,
1124
+ add_generation_prompt=True, tokenize=True, return_tensors="pt",
1125
+ return_dict=True)
1126
+ inputs = inputs.to(self.device)
1127
+ if past_key_values is not None:
1128
+ past_length = past_key_values[0][0].shape[2]
1129
+ inputs.position_ids += past_length
1130
+ attention_mask = inputs.attention_mask
1131
+ attention_mask = torch.cat((attention_mask.new_ones(1, past_length), attention_mask), dim=1)
1132
+ inputs['attention_mask'] = attention_mask
1133
+ history.append({"role": role, "content": query})
1134
+ for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
1135
+ eos_token_id=eos_token_id, return_past_key_values=return_past_key_values,
1136
+ **gen_kwargs):
1137
+ if return_past_key_values:
1138
+ outputs, past_key_values = outputs
1139
+ outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):-1]
1140
+ response = tokenizer.decode(outputs)
1141
+ if response and response[-1] != "�":
1142
+ response, new_history = self.process_response(response, history)
1143
+ if return_past_key_values:
1144
+ yield response, new_history, past_key_values
1145
+ else:
1146
+ yield response, new_history
1147
+
1148
+ @torch.inference_mode()
1149
+ def stream_generate(
1150
+ self,
1151
+ input_ids,
1152
+ generation_config: Optional[GenerationConfig] = None,
1153
+ logits_processor: Optional[LogitsProcessorList] = None,
1154
+ stopping_criteria: Optional[StoppingCriteriaList] = None,
1155
+ prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,
1156
+ return_past_key_values=False,
1157
+ **kwargs,
1158
+ ):
1159
+ batch_size, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]
1160
+
1161
+ if generation_config is None:
1162
+ generation_config = self.generation_config
1163
+ generation_config = copy.deepcopy(generation_config)
1164
+ model_kwargs = generation_config.update(**kwargs)
1165
+ model_kwargs["use_cache"] = generation_config.use_cache
1166
+ bos_token_id, eos_token_id = generation_config.bos_token_id, generation_config.eos_token_id
1167
+
1168
+ if isinstance(eos_token_id, int):
1169
+ eos_token_id = [eos_token_id]
1170
+ eos_token_id_tensor = torch.tensor(eos_token_id).to(input_ids.device) if eos_token_id is not None else None
1171
+
1172
+ has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not None
1173
+ if has_default_max_length and generation_config.max_new_tokens is None:
1174
+ warnings.warn(
1175
+ f"Using `max_length`'s default ({generation_config.max_length}) to control the generation length. "
1176
+ "This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we"
1177
+ " recommend using `max_new_tokens` to control the maximum length of the generation.",
1178
+ UserWarning,
1179
+ )
1180
+ elif generation_config.max_new_tokens is not None:
1181
+ generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length
1182
+ if not has_default_max_length:
1183
+ logger.warn(
1184
+ f"Both `max_new_tokens` (={generation_config.max_new_tokens}) and `max_length`(="
1185
+ f"{generation_config.max_length}) seem to have been set. `max_new_tokens` will take precedence. "
1186
+ "Please refer to the documentation for more information. "
1187
+ "(https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)",
1188
+ UserWarning,
1189
+ )
1190
+
1191
+ if input_ids_seq_length >= generation_config.max_length:
1192
+ input_ids_string = "decoder_input_ids" if self.config.is_encoder_decoder else "input_ids"
1193
+ logger.warning(
1194
+ f"Input length of {input_ids_string} is {input_ids_seq_length}, but `max_length` is set to"
1195
+ f" {generation_config.max_length}. This can lead to unexpected behavior. You should consider"
1196
+ " increasing `max_new_tokens`."
1197
+ )
1198
+
1199
+ # 2. Set generation parameters if not already defined
1200
+ logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
1201
+ stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
1202
+
1203
+ logits_processor = self._get_logits_processor(
1204
+ generation_config=generation_config,
1205
+ input_ids_seq_length=input_ids_seq_length,
1206
+ encoder_input_ids=input_ids,
1207
+ prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
1208
+ logits_processor=logits_processor,
1209
+ )
1210
+
1211
+ stopping_criteria = self._get_stopping_criteria(
1212
+ generation_config=generation_config, stopping_criteria=stopping_criteria
1213
+ )
1214
+ logits_warper = self._get_logits_warper(generation_config)
1215
+
1216
+ unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
1217
+ scores = None
1218
+ while True:
1219
+ model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
1220
+ # forward pass to get next token
1221
+ outputs = self(
1222
+ **model_inputs,
1223
+ return_dict=True,
1224
+ output_attentions=False,
1225
+ output_hidden_states=False,
1226
+ )
1227
+
1228
+ next_token_logits = outputs.logits[:, -1, :]
1229
+
1230
+ # pre-process distribution
1231
+ next_token_scores = logits_processor(input_ids, next_token_logits)
1232
+ next_token_scores = logits_warper(input_ids, next_token_scores)
1233
+
1234
+ # sample
1235
+ probs = nn.functional.softmax(next_token_scores, dim=-1)
1236
+ if generation_config.do_sample:
1237
+ next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
1238
+ else:
1239
+ next_tokens = torch.argmax(probs, dim=-1)
1240
+ # update generated ids, model inputs, and length for next step
1241
+ input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
1242
+ model_kwargs = self._update_model_kwargs_for_generation(
1243
+ outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
1244
+ )
1245
+ unfinished_sequences = unfinished_sequences.mul(
1246
+ next_tokens.tile(eos_token_id_tensor.shape[0], 1).ne(eos_token_id_tensor.unsqueeze(1)).prod(dim=0)
1247
+ )
1248
+ if return_past_key_values:
1249
+ yield input_ids, outputs.past_key_values
1250
+ else:
1251
+ yield input_ids
1252
+ # stop when each sentence is finished, or if we exceed the maximum length
1253
+ if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
1254
+ break
1255
+
1256
+
1257
  class ChatGLMForSequenceClassification(ChatGLMPreTrainedModel):
1258
  def __init__(self, config: ChatGLMConfig, empty_init=True, device=None):
1259
  super().__init__(config)
 
1261
  self.num_labels = config.num_labels
1262
  self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
1263
 
1264
+ self.classifier_head = nn.Linear(config.hidden_size, config.num_labels, bias=True, dtype=config.torch_dtype)
1265
  if config.classifier_dropout is not None:
1266
  self.dropout = nn.Dropout(config.classifier_dropout)
1267
  else:
 
1278
  inputs_embeds: Optional[torch.LongTensor] = None,
1279
  labels: Optional[torch.LongTensor] = None,
1280
  use_cache: Optional[bool] = None,
1281
+ output_attentions: Optional[bool] = None,
1282
  output_hidden_states: Optional[bool] = None,
1283
  return_dict: Optional[bool] = None,
1284
  ) -> Union[Tuple[torch.Tensor, ...], SequenceClassifierOutputWithPast]:
 
1292
  past_key_values=past_key_values,
1293
  inputs_embeds=inputs_embeds,
1294
  use_cache=use_cache,
1295
+ output_attentions=output_attentions,
1296
  output_hidden_states=output_hidden_states,
1297
  return_dict=return_dict,
1298
  )
1299
 
1300
  hidden_states = transformer_outputs[0]
1301
+ pooled_hidden_states = hidden_states[:, -1]
1302
  if self.dropout is not None:
1303
  pooled_hidden_states = self.dropout(pooled_hidden_states)
1304
  logits = self.classifier_head(pooled_hidden_states)
tokenization_chatglm.py CHANGED
@@ -3,10 +3,8 @@ import base64
3
  import os
4
  import json
5
  import tiktoken
6
- import torch
7
  from torch import TensorType
8
  from typing import List, Optional, Union, Dict, Any
9
- from torchvision import transforms
10
  from transformers import PreTrainedTokenizer
11
  from transformers.utils import logging, PaddingStrategy
12
  from transformers.tokenization_utils_base import EncodedInput, BatchEncoding
@@ -22,7 +20,6 @@ class ChatGLM4Tokenizer(PreTrainedTokenizer):
22
  padding_side="left",
23
  clean_up_tokenization_spaces=False,
24
  encode_special_tokens=False,
25
- image_size=None,
26
  **kwargs
27
  ):
28
  self.name = "GLM4Tokenizer"
@@ -30,7 +27,6 @@ class ChatGLM4Tokenizer(PreTrainedTokenizer):
30
  pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
31
  self.pat_str = re.compile(pat_str)
32
  self.encode_special_tokens = encode_special_tokens
33
- self.image_size = image_size
34
 
35
  mergeable_ranks = {}
36
  with open(vocab_file) as f:
@@ -134,143 +130,109 @@ class ChatGLM4Tokenizer(PreTrainedTokenizer):
134
  prefix_tokens = [self.convert_tokens_to_ids("[gMASK]"), self.convert_tokens_to_ids("<sop>")]
135
  return prefix_tokens
136
 
137
- def build_single_message(self, role, metadata, message, tokenize=True, message_prefix=None):
138
  assert role in ["system", "user", "assistant", "observation"], role
139
  if tokenize:
140
  role_tokens = [self.convert_tokens_to_ids(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n",
141
  disallowed_special=())
142
  message_tokens = self.tokenizer.encode(message, disallowed_special=())
143
- if message_prefix is not None:
144
- message_tokens = message_prefix + message_tokens
145
  tokens = role_tokens + message_tokens
146
  return tokens
147
  else:
148
  return str(f"<|{role}|>{metadata}\n{message}")
149
 
150
- def apply_chat_template(
151
- self,
152
- conversation: Union[List[Dict[str, str]], List[List[Dict[str, str]]], "Conversation"],
153
- add_generation_prompt: bool = False,
154
- tokenize: bool = True,
155
- padding: bool = False,
156
- truncation: bool = False,
157
- max_length: Optional[int] = None,
158
- return_tensors: Optional[Union[str, TensorType]] = None,
159
- return_dict: bool = False,
160
- tokenizer_kwargs: Optional[Dict[str, Any]] = None,
161
- add_special_tokens: bool = True,
162
- **kwargs,
163
- ) -> Union[str, List[int], List[str], List[List[int]], BatchEncoding]:
164
-
165
- if return_dict and not tokenize:
166
- raise ValueError(
167
- "`return_dict=True` is incompatible with `tokenize=False`, because there is no dict "
168
- "of tokenizer outputs to return."
169
- )
170
-
171
- def handle_single_conversation(conversation):
172
- input_ids = self.get_prefix_tokens() if add_special_tokens else []
173
- input_message = "[gMASK]<sop>" if add_special_tokens else ""
174
- input_image = None
175
- transform = transforms.Compose(
176
- [
177
- transforms.Resize(
178
- (self.image_size, self.image_size), interpolation=transforms.InterpolationMode.BICUBIC
179
- ),
180
- transforms.ToTensor(),
181
- transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
182
- ]
183
- )
184
- for item in conversation:
185
- if item.get("tools"):
186
- tools = item["tools"]
187
- content = "你是一个名为 GLM-4 的人工智能助手。你是基于智谱AI训练的语言模型 GLM-4 模型开发的,你的任务是针对用户的问题和要求提供适当的答复和支持。"
188
- for tool in tools:
189
- if tool["type"] == "function":
190
- function = tool["function"]
191
- content += f"\n\n## {function['name']}\n\n{json.dumps(function, ensure_ascii=False, indent=4)}"
192
- content += "\n在调用上述函数时,请使用 Json 格式表示调用的参数。"
193
- elif tool["type"] == "python":
194
- content += "\n\n## python\n\n当你向 `python` 发送包含 Python 代码的消息时,该代码将会在一个有状态的 Jupyter notebook 环境中执行。\n`python` 返回代码执行的输出,或在执行 60 秒后返回超时。\n`/mnt/data` 将会持久化存储你的文件。在此会话中,`python` 无法访问互联网。不要使用 `python` 进行任何网络请求或者在线 API 调用,这些在线内容的访问将不会成功。"
195
- elif tool["type"] == "simple_browser":
196
- content += "\n\n## simple_browser\n\n你可以使用 `simple_browser` 工具。该工具支持以下函数:\n`search(query: str, recency_days: int)`:使用搜索引擎进行查询并显示结果,可以使用 `recency_days` 参数控制搜索内容的时效性。\n`mclick(ids: list[int])`:获取一系列指定 id 的页面内容。每次调用时,须选择3-10个页面。选择多个角度的页面,同时尽可能选择可信任的信息来源。考虑到部分页面是无法加载的,你也可以多打开一些可能有用的页面而不用担心内容过多。\n`open_url(url: str)`:打开指定的 URL。\n\n使用 `【{引用 id}†{引用文本}】` 来引用内容。\n\n操作步骤:1. 使用 `search` 来获得信息列表; 2. 使用 `mclick` 来获取指定 ID 页面的内容; 3. 根据获得的内容进行回复。在回复中应当引用信息来源。\n 如果用户提供了 URL,也可以用 `open_url` 直接打开页面。\n如果初次搜索结果没有找到合适的信息,也可以再次使用 `search` 进行搜索。"
197
- elif tool["type"] == "cogview":
198
- content += "\n\n## cogview\n\n如果用户的请求中包含了对图像的描述,你可以使用 `cogview` 来生成图像并展示给用户。你需要向 `cogview` 发送图像描述,规则:\n- 发送给 `cogview` 的消息必须使用英语。用户的中文描述必须完全翻译为英语。\n- 应当尽可能详细地描述图像生成的需求,需求描述约 100 英文单词。\n- 保持用户原始描述的意图。不要虚构内容或者没见过的人物。\n- 如无特殊说明,所在地为中国,持有中国立场并遵循中国社会主义价值观。"
199
- else:
200
- raise NotImplementedError(f"Unknown tool type {tool['type']}")
201
- input = self.build_single_message("system", "", content, tokenize=tokenize)
202
- if tokenize:
203
- input_ids.extend(input)
204
- else:
205
- input_message += input
206
- message = ""
207
- message_prefix = None
208
- if item.get("image"):
209
- assert input_image is None, "Multiple images are not supported"
210
- input_image = transform(item["image"])
211
- message_prefix = self.convert_tokens_to_ids(
212
- ["<|begin_of_image|>", "<|endoftext|>", "<|end_of_image|>"])
213
- if item.get("content"):
214
- message += item["content"]
215
- if message or message_prefix:
216
- input = self.build_single_message(
217
- item["role"],
218
- item.get("metadata", ""),
219
- message,
220
- tokenize=tokenize,
221
- message_prefix=message_prefix
222
- )
223
- if tokenize:
224
- input_ids.extend(input)
225
- else:
226
- input_message += input
227
- if add_generation_prompt:
228
- if tokenize:
229
- input_ids.extend([self.convert_tokens_to_ids("<|assistant|>")])
230
- else:
231
- input_message += "<|assistant|>"
232
- return {"input": input_ids if tokenize else input_message, "image": input_image}
233
-
234
- # Main logic to handle different conversation formats
235
- if isinstance(conversation, list) and all(isinstance(i, dict) for i in conversation):
236
- result = handle_single_conversation(conversation)
237
- input_ids = result["input"]
238
- input_images = [result["image"]]
239
- elif isinstance(conversation, list) and all(isinstance(i, list) for i in conversation):
240
- results = [handle_single_conversation(c) for c in conversation]
241
- input_ids = [item["input"] for item in results]
242
- input_images = [item["image"] for item in results]
243
- elif hasattr(conversation, "messages"):
244
- result = handle_single_conversation(conversation.messages)
245
- input_ids = result["input"]
246
- input_images = [result["image"]]
247
- else:
248
- raise ValueError("Invalid conversation format")
249
-
250
- if tokenize:
251
- output = self.batch_encode_plus(
252
- [input_ids] if isinstance(input_ids[0], int) else input_ids,
253
- padding=padding,
254
- truncation=truncation,
255
- max_length=max_length,
256
- return_tensors=return_tensors,
257
- is_split_into_words=True,
258
- add_special_tokens=False
259
- )
260
- if return_dict:
261
- found_image = False
262
- for image in input_images:
263
- if image is not None:
264
- found_image = True
265
- break
266
- if found_image:
267
- output["images"] = torch.stack(input_images)
268
- return output
269
- else:
270
- return output["input_ids"]
271
- else:
272
- return input_ids
273
-
274
 
275
  def build_inputs_with_special_tokens(
276
  self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
 
3
  import os
4
  import json
5
  import tiktoken
 
6
  from torch import TensorType
7
  from typing import List, Optional, Union, Dict, Any
 
8
  from transformers import PreTrainedTokenizer
9
  from transformers.utils import logging, PaddingStrategy
10
  from transformers.tokenization_utils_base import EncodedInput, BatchEncoding
 
20
  padding_side="left",
21
  clean_up_tokenization_spaces=False,
22
  encode_special_tokens=False,
 
23
  **kwargs
24
  ):
25
  self.name = "GLM4Tokenizer"
 
27
  pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
28
  self.pat_str = re.compile(pat_str)
29
  self.encode_special_tokens = encode_special_tokens
 
30
 
31
  mergeable_ranks = {}
32
  with open(vocab_file) as f:
 
130
  prefix_tokens = [self.convert_tokens_to_ids("[gMASK]"), self.convert_tokens_to_ids("<sop>")]
131
  return prefix_tokens
132
 
133
+ def build_single_message(self, role, metadata, message, tokenize=True):
134
  assert role in ["system", "user", "assistant", "observation"], role
135
  if tokenize:
136
  role_tokens = [self.convert_tokens_to_ids(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n",
137
  disallowed_special=())
138
  message_tokens = self.tokenizer.encode(message, disallowed_special=())
 
 
139
  tokens = role_tokens + message_tokens
140
  return tokens
141
  else:
142
  return str(f"<|{role}|>{metadata}\n{message}")
143
 
144
+ # Use Jinja Template in tokenizer_config.json
145
+ # def apply_chat_template(
146
+ # self,
147
+ # conversation: Union[List[Dict[str, str]], List[List[Dict[str, str]]], "Conversation"],
148
+ # add_generation_prompt: bool = False,
149
+ # tokenize: bool = True,
150
+ # padding: bool = False,
151
+ # truncation: bool = False,
152
+ # max_length: Optional[int] = None,
153
+ # return_tensors: Optional[Union[str, TensorType]] = None,
154
+ # return_dict: bool = False,
155
+ # tokenizer_kwargs: Optional[Dict[str, Any]] = None,
156
+ # add_special_tokens: bool = True,
157
+ # **kwargs,
158
+ # ) -> Union[str, List[int], List[str], List[List[int]], BatchEncoding]:
159
+ #
160
+ # if return_dict and not tokenize:
161
+ # raise ValueError(
162
+ # "`return_dict=True` is incompatible with `tokenize=False`, because there is no dict "
163
+ # "of tokenizer outputs to return."
164
+ # )
165
+ #
166
+ # def handle_single_conversation(conversation):
167
+ # input_ids = self.get_prefix_tokens() if add_special_tokens else []
168
+ # input_message = "[gMASK]<sop>" if add_special_tokens else ""
169
+ # for item in conversation:
170
+ # if item.get("tools"):
171
+ # tools = item["tools"]
172
+ # content = "你是一个名为 GhatGLM 的人工智能助手。你是基于智谱AI训练的语言模型 GLM-4 模型开发的,你的任务是针对用户的问题和要求提供适当的答复和支持。"
173
+ # content += "\n\n# 可用工具"
174
+ # for tool in tools:
175
+ # if tool["type"] == "function":
176
+ # function = tool["function"]
177
+ # content += f"\n\n## {function['name']}\n\n{json.dumps(function, ensure_ascii=False, indent=4)}"
178
+ # content += "\n在调用上述函数时,请使用 Json 格式表示调用的参数。"
179
+ # elif tool["type"] == "python":
180
+ # content += "\n\n## python\n\n当你向 `python` 发送包含 Python 代码的消息时,该代码将会在一个有状态的 Jupyter notebook 环境中执行。\n`python` 返回代码执行的输出,或在执行 60 秒后返回超时。\n`/mnt/data` 将会持久化存储你的文件。在此会话中,`python` 无法访问互联网。不要使用 `python` 进行任何网络请求或者在线 API 调用,这些在线内容的访问将不会成功。"
181
+ # elif tool["type"] == "simple_browser":
182
+ # content += "\n\n## simple_browser\n\n你可以使用 `simple_browser` 工具。该工具支持以下函数:\n`search(query: str, recency_days: int)`:使用搜索引擎进行查询并显示结果,可以使用 `recency_days` 参数控制搜索内容的时效性。\n`mclick(ids: list[int])`:获取一系列指定 id 的页面内容。每次调用时,须选择3-10个页面。选择多个角度的页面,同时尽可能选择可信任的信息来源。考虑到部分页面是无法加载的,你也可以多打开一些可能有用的页面而不用担心内容过多。\n`open_url(url: str)`:打开指定的 URL。\n\n使用 `【{引用 id}†{引用文本}】` 来引用内容。\n\n操作步骤:1. 使用 `search` 来获得信息列表; 2. 使用 `mclick` 来获取指定 ID 页面的内容; 3. 根据获得的内容进行回复。在回复中应当引用信息来源。\n 如果用户提供了 URL,也可以用 `open_url` 直接打开页面。\n如果初次搜索结果没有找到合适的信息,也可以再次使用 `search` 进行搜索。"
183
+ # elif tool["type"] == "cogview":
184
+ # content += "\n\n## cogview\n\n如果用户的请求中包含了对图像的描述,你可以使用 `cogview` 来生成图像并展示给用户。你需要向 `cogview` 发送图像描述,规则:\n- 发送给 `cogview` 的消息必须使用英语。用户的中文描述必须完全翻译为英语。\n- 应当尽可能详细地描述图像生成的需求,需求描述约 100 英文单词。\n- 保持用户原始描述的意图。不要虚构内容或者没见过的人物。\n- 如无特殊说明,所在地为中国,持有中国立场并遵循中国社会主义价值观。"
185
+ # else:
186
+ # raise NotImplementedError(f"Unknown tool type {tool['type']}")
187
+ # input = self.build_single_message("system", "", content, tokenize=tokenize)
188
+ # if tokenize:
189
+ # input_ids.extend(input)
190
+ # else:
191
+ # input_message += input
192
+ # if item["content"]:
193
+ # input = self.build_single_message(
194
+ # item["role"],
195
+ # item.get("metadata", ""),
196
+ # item["content"],
197
+ # tokenize=tokenize
198
+ # )
199
+ # if tokenize:
200
+ # input_ids.extend(input)
201
+ # else:
202
+ # input_message += input
203
+ # if add_generation_prompt:
204
+ # if tokenize:
205
+ # input_ids.extend([self.convert_tokens_to_ids("<|assistant|>")])
206
+ # else:
207
+ # input_message += "<|assistant|>"
208
+ # return input_ids if tokenize else input_message
209
+ #
210
+ # # Main logic to handle different conversation formats
211
+ # if isinstance(conversation, list) and all(isinstance(i, dict) for i in conversation):
212
+ # result = handle_single_conversation(conversation)
213
+ # elif isinstance(conversation, list) and all(isinstance(i, list) for i in conversation):
214
+ # result = [handle_single_conversation(c) for c in conversation]
215
+ # elif hasattr(conversation, "messages"):
216
+ # result = handle_single_conversation(conversation.messages)
217
+ # else:
218
+ # raise ValueError("Invalid conversation format")
219
+ #
220
+ # if tokenize:
221
+ # output = self.batch_encode_plus(
222
+ # [result] if isinstance(result[0], int) else result,
223
+ # padding=padding,
224
+ # truncation=truncation,
225
+ # max_length=max_length,
226
+ # return_tensors=return_tensors,
227
+ # is_split_into_words=True,
228
+ # add_special_tokens=False
229
+ # )
230
+ # if return_dict:
231
+ # return output
232
+ # else:
233
+ # return output["input_ids"]
234
+ # else:
235
+ # return result
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
236
 
237
  def build_inputs_with_special_tokens(
238
  self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
tokenizer_config.json CHANGED
@@ -123,12 +123,12 @@
123
  "<|user|>", "<|assistant|>", "<|observation|>", "<|begin_of_image|>", "<|end_of_image|>",
124
  "<|begin_of_video|>", "<|end_of_video|>"],
125
  "clean_up_tokenization_spaces": false,
 
126
  "do_lower_case": false,
127
  "eos_token": "<|endoftext|>",
128
  "pad_token": "<|endoftext|>",
129
- "model_max_length": 8192,
130
  "padding_side": "left",
131
  "remove_space": false,
132
- "tokenizer_class": "ChatGLM4Tokenizer",
133
- "image_size": 1120
134
  }
 
123
  "<|user|>", "<|assistant|>", "<|observation|>", "<|begin_of_image|>", "<|end_of_image|>",
124
  "<|begin_of_video|>", "<|end_of_video|>"],
125
  "clean_up_tokenization_spaces": false,
126
+ "chat_template": "[gMASK]<sop>{% for item in messages %}{% if item['tools'] is defined %}<|system|>\n你是一个名为 GLM-4 的人工智能助手。你是基于智谱AI训练的语言模型 GLM-4 模型开发的,你的任务是针对用户的问题和要求提供适当的答复和支持。\n\n# 可用工具{% set tools = item['tools'] %}{% for tool in tools %}{% if tool['type'] == 'function' %}\n\n## {{ tool['function']['name'] }}\n\n{{ tool['function'] | tojson(indent=4) }}\n在调用上述函数时,请使用 Json 格式表示调用的参数。{% elif tool['type'] == 'python' %}\n\n## python\n\n当你向 `python` 发送包含 Python 代码的消息时,该代码将会在一个有状态的 Jupyter notebook 环境中执行。\n`python` 返回代码执行的输出,或在执行 60 秒后返回超时。\n`/mnt/data` 将会持久化存储你的文件。在此会话中,`python` 无法访问互联网。不要使用 `python` 进行任何网络请求或者在线 API 调用,这些在线内容的访问将不会成功。{% elif tool['type'] == 'simple_browser' %}\n\n## simple_browser\n\n你可以使用 `simple_browser` 工具。该工具支持以下函数:\n`search(query: str, recency_days: int)`:使用搜索引擎进行查询并显示结果,可以使用 `recency_days` 参数控制搜索内容的时效性。\n`mclick(ids: list[int])`:获取一系列指定 id 的页面内容。每次调用时,须选择3-10个页面。选择多个角度的页面,同时尽可能选择可信任的信息来源。考虑到部分页面是无法加载的,你也可以多打开一些可能有用的页面而不用担心内容过多。\n`open_url(url: str)`:打开指定的 URL。\n\n使用 `【{引用 id}†{引用文本}】` 来引用内容。\n\n操作步骤:1. 使用 `search` 来获得信息列表; 2. 使用 `mclick` 来获取指定 ID 页面的内容; 3. 根据获得的内容进行回复。在回复中应当引用信息来源。\n 如果用户提供了 URL,也可以用 `open_url` 直接打开页面。\n如果初次搜索结果没有找到合适的信息,也可以再次使用 `search` 进行搜索。{% elif tool['type'] == 'cogview' %}\n\n## cogview\n\n如果用户的请求中包含了对图像的描述,你可以使用 `cogview` 来生成图像并展示给用户。你需要向 `cogview` 发送图像描述,规则:\n- 发送给 `cogview` 的消息必须使用英语。用户的中文描述必须完全翻译为英语。\n- 应当尽可能详细地描述图像生成的需求,需求描述约 100 英文单词。\n- 保持用户原始描述的意图。不要虚构内容或者没见过的人物。\n- 如无特殊说明,所在地为中国,持有中国立场并遵循中国社会主义价值观。{% endif %}{% endfor %}{% endif %}{% if item['content'] %}<|{{ item['role'] }}|>{{ item['metadata'] }}\n{{ item['content'] }}{% endif %}{% endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}",
127
  "do_lower_case": false,
128
  "eos_token": "<|endoftext|>",
129
  "pad_token": "<|endoftext|>",
130
+ "model_max_length": 1024000,
131
  "padding_side": "left",
132
  "remove_space": false,
133
+ "tokenizer_class": "ChatGLM4Tokenizer"
 
134
  }
visual.py DELETED
@@ -1,180 +0,0 @@
1
- import torch
2
- from torch import nn
3
- from argparse import Namespace
4
- import torch.nn.functional as F
5
- from transformers.activations import ACT2FN
6
- import math
7
- from torch.nn import LayerNorm
8
-
9
-
10
- def standard_attention(query_layer, key_layer, value_layer, scaling_attention_score=True):
11
- if scaling_attention_score:
12
- query_layer = query_layer / math.sqrt(query_layer.shape[-1])
13
- attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
14
-
15
- attention_probs = F.softmax(attention_scores, dim=-1)
16
-
17
- context_layer = torch.matmul(attention_probs, value_layer)
18
- return context_layer
19
-
20
-
21
- def attention_fn_default(query_layer, key_layer, value_layer, scaling_attention_score=True):
22
- if int(torch.__version__.split('.')[0]) >= 2 and scaling_attention_score:
23
- # Pytorch 2.0 attention uses very much memory if attention_mask is float, and has NaN bug if attention_mask is None.
24
- attn_output = torch.nn.functional.scaled_dot_product_attention(
25
- query_layer, key_layer, value_layer,
26
- attn_mask=None,
27
- dropout_p=0.,
28
- is_causal=False
29
- )
30
- return attn_output
31
- else:
32
- return standard_attention(
33
- query_layer, key_layer, value_layer, scaling_attention_score=scaling_attention_score
34
- )
35
-
36
-
37
- class PatchEmbedding(nn.Module):
38
- def __init__(self, config):
39
- super().__init__()
40
- self.proj = nn.Conv2d(config.in_channels, config.hidden_size, kernel_size=config.patch_size,
41
- stride=config.patch_size)
42
- self.cls_embedding = nn.Parameter(torch.zeros(1, config.hidden_size))
43
- self.position_embedding = nn.Embedding(config.num_positions, config.hidden_size)
44
-
45
- def forward(self, images: "tensor(B, C, H, W)") -> "tensor(B, L, D)":
46
- x = self.proj(images)
47
- x = x.flatten(2).transpose(1, 2)
48
- cls_token = self.cls_embedding.expand(x.shape[0], -1, -1)
49
- x = torch.cat((cls_token, x), dim=1)
50
- x += self.position_embedding.weight.unsqueeze(0)
51
- return x
52
-
53
-
54
- class Attention(nn.Module):
55
- def __init__(self, config):
56
- super().__init__()
57
- self.num_heads = config.num_heads
58
- head_dim = config.hidden_size // config.num_heads
59
- self.scale = head_dim ** -0.5
60
- self.query_key_value = nn.Linear(config.hidden_size, config.hidden_size * 3)
61
- self.dense = nn.Linear(config.hidden_size, config.hidden_size)
62
- self.output_dropout = torch.nn.Dropout(config.dropout_prob)
63
-
64
- def forward(self, x: "tensor(B, L, D)") -> "tensor(B, L, D)":
65
- B, L, _ = x.shape
66
- qkv = self.query_key_value(x)
67
- qkv = qkv.reshape(B, L, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4) # 3, B, H, L, D
68
- q, k, v = qkv[0], qkv[1], qkv[2]
69
-
70
- out = attention_fn_default(
71
- q, k, v
72
- )
73
- output = self.dense(out.transpose(1, 2).reshape(B, L, -1))
74
- output = self.output_dropout(output)
75
- return output
76
-
77
- def attention(self, q, k, v):
78
- attn_weights = torch.matmul(q * self.scale, k.transpose(-2, -1))
79
- attn_weights = attn_weights.softmax(dim=-1)
80
- output = torch.matmul(attn_weights, v)
81
- return output
82
-
83
-
84
- class MLP(nn.Module):
85
- def __init__(self, config):
86
- super().__init__()
87
- self.config = config
88
- self.activation_fn = ACT2FN[config.hidden_act]
89
- self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
90
- self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
91
-
92
- def forward(self, x: torch.Tensor) -> torch.Tensor:
93
- x = self.fc1(x)
94
- x = self.activation_fn(x)
95
- x = self.fc2(x)
96
- return x
97
-
98
-
99
- class TransformerLayer(nn.Module):
100
- def __init__(self, config):
101
- super().__init__()
102
- self.input_layernorm = LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
103
- self.attention = Attention(config)
104
- self.mlp = MLP(config)
105
- self.post_attention_layernorm = LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
106
-
107
- def forward(self, hidden_states):
108
- attention_input = hidden_states
109
- attention_output = self.input_layernorm(self.attention(attention_input))
110
- hidden_states = attention_input + attention_output
111
- mlp_input = hidden_states
112
-
113
- # https://github.com/THUDM/GLM-4/issues/350
114
- mlp_output = self.post_attention_layernorm(self.mlp(mlp_input)).to(mlp_input.device)
115
- output = mlp_input + mlp_output
116
- return output
117
-
118
-
119
- class Transformer(nn.Module):
120
- def __init__(self, config):
121
- super().__init__()
122
- self.layers = nn.ModuleList([TransformerLayer(config) for _ in range(config.num_hidden_layers)])
123
-
124
- def forward(self, hidden_states):
125
- for layer_module in self.layers:
126
- hidden_states = layer_module(hidden_states)
127
- return hidden_states
128
-
129
-
130
- class GLU(nn.Module):
131
- def __init__(self, config, in_features):
132
- super().__init__()
133
- self.linear_proj = nn.Linear(in_features, config.hidden_size, bias=False)
134
- self.norm1 = nn.LayerNorm(config.hidden_size)
135
- self.act1 = nn.GELU()
136
- self.act2 = nn.functional.silu
137
- self.dense_h_to_4h = nn.Linear(config.hidden_size, config.ffn_hidden_size, bias=False)
138
- self.gate_proj = nn.Linear(config.hidden_size, config.ffn_hidden_size, bias=False)
139
- self.dense_4h_to_h = nn.Linear(config.ffn_hidden_size, config.hidden_size, bias=False)
140
-
141
- def forward(self, x):
142
- x = self.linear_proj(x)
143
- x = self.act1(self.norm1(x))
144
- x = self.act2(self.gate_proj(x)) * self.dense_h_to_4h(x)
145
- x = self.dense_4h_to_h(x)
146
- return x
147
-
148
-
149
- class EVA2CLIPModel(nn.Module):
150
- def __init__(self, config):
151
- super().__init__()
152
- vision_config = Namespace(**config.vision_config)
153
- self.patch_embedding = PatchEmbedding(vision_config)
154
- self.transformer = Transformer(vision_config)
155
- self.linear_proj = GLU(config, in_features=config.hidden_size)
156
- self.conv = nn.Conv2d(in_channels=vision_config.hidden_size, out_channels=config.hidden_size, kernel_size=2,
157
- stride=2)
158
- self.boi = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
159
- self.eoi = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
160
- self.scaling_factor = vision_config.scaling_factor
161
-
162
- def forward(self, images: "tensor(B, C, H, W)") -> "tensor(B, L, D)":
163
- x = self.patch_embedding(images)
164
- x = self.transformer(x)
165
- x = x[:, 1:]
166
-
167
- b, s, h = x.shape
168
- grid_size = int(s ** 0.5)
169
- x = x.view(b, grid_size, grid_size, h).permute(0, 3, 1, 2)
170
- x = self.conv(x)
171
-
172
- x = x.flatten(2).transpose(1, 2)
173
- x = self.linear_proj(x)
174
-
175
- # https://github.com/THUDM/GLM-4/issues/350
176
- boi = self.boi.expand(x.shape[0], -1, -1).to(x.device)
177
- eoi = self.eoi.expand(x.shape[0], -1, -1).to(x.device)
178
- x = torch.cat((boi, x, eoi), dim=1)
179
- x = x / self.scaling_factor
180
- return x