guoday commited on
Commit
a5db2fc
1 Parent(s): b5b9c3b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +177 -0
README.md CHANGED
@@ -3,3 +3,180 @@ license: other
3
  license_name: deepseek
4
  license_link: LICENSE
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  license_name: deepseek
4
  license_link: LICENSE
5
  ---
6
+
7
+
8
+
9
+ ### 1. Introduction of Deepseek Coder
10
+
11
+ Deepseek Coder comprises a series of code language models trained on both 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2T tokens. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and a extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.
12
+
13
+ - **Massive Training Data**: Trained on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
14
+
15
+ - **Highly Flexible & Scalable**: Offered in model sizes of 1B, 7B, and 33B, enabling users to choose the setup most suitable for their requirements.
16
+
17
+ - **Superior Model Performance**: State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.
18
+
19
+ - **Advanced Code Completion Capabilities**: A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks.
20
+
21
+
22
+
23
+ ### 2. Model Summary
24
+ deepseek-coder-5.7bmqa-base is a 5.7B parameter model with Multi Query Attention trained on 2 trillion tokens.
25
+ - **Home Page:** [DeepSeek](https://deepseek.com/)
26
+ - **Repository:** [deepseek-ai/deepseek-coder](https://github.com/deepseek-ai/deepseek-coder)
27
+ - **Chat With DeepSeek Coder:** [DeepSeek-Coder](https://coder.deepseek.com/)
28
+
29
+
30
+ ### 3. How to Use
31
+ Here give some examples of how to use our model.
32
+ #### 1)Code Completion
33
+ ```python
34
+ from transformers import AutoTokenizer, AutoModelForCausalLM
35
+ import torch
36
+ tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
37
+ model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()
38
+ input_text = "#write a quick sort algorithm"
39
+ inputs = tokenizer(input_text, return_tensors="pt").cuda()
40
+ outputs = model.generate(**inputs, max_length=128)
41
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
42
+ ```
43
+ This code will output the following result:
44
+ ```
45
+ def quick_sort(arr):
46
+ if len(arr) <= 1:
47
+ return arr
48
+ pivot = arr[0]
49
+ left = []
50
+ right = []
51
+ for i in range(1, len(arr)):
52
+ if arr[i] < pivot:
53
+ left.append(arr[i])
54
+ else:
55
+ right.append(arr[i])
56
+ return quick_sort(left) + [pivot] + quick_sort(right)
57
+ ```
58
+
59
+ #### 2)Code Insertion
60
+ ```python
61
+ from transformers import AutoTokenizer, AutoModelForCausalLM
62
+ import torch
63
+ tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
64
+ model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()
65
+ input_text = """<fim_prefix>def quick_sort(arr):
66
+ if len(arr) <= 1:
67
+ return arr
68
+ pivot = arr[0]
69
+ left = []
70
+ right = []
71
+ <fim_middle>
72
+ if arr[i] < pivot:
73
+ left.append(arr[i])
74
+ else:
75
+ right.append(arr[i])
76
+ return quick_sort(left) + [pivot] + quick_sort(right)<fim_suffix>"""
77
+ inputs = tokenizer(input_text, return_tensors="pt").cuda()
78
+ outputs = model.generate(**inputs, max_length=128)
79
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])
80
+ ```
81
+ This code will output the following result:
82
+ ```
83
+ for i in range(1, len(arr)):
84
+ ```
85
+ #### 3)Repository Level Code Completion
86
+ ```python
87
+ from transformers import AutoTokenizer, AutoModelForCausalLM
88
+ tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
89
+ model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()
90
+
91
+ input_text = """#utils.py
92
+ import torch
93
+ from sklearn import datasets
94
+ from sklearn.model_selection import train_test_split
95
+ from sklearn.preprocessing import StandardScaler
96
+ from sklearn.metrics import accuracy_score
97
+
98
+ def load_data():
99
+ iris = datasets.load_iris()
100
+ X = iris.data
101
+ y = iris.target
102
+
103
+ # Standardize the data
104
+ scaler = StandardScaler()
105
+ X = scaler.fit_transform(X)
106
+
107
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
108
+
109
+ # Convert numpy data to PyTorch tensors
110
+ X_train = torch.tensor(X_train, dtype=torch.float32)
111
+ X_test = torch.tensor(X_test, dtype=torch.float32)
112
+ y_train = torch.tensor(y_train, dtype=torch.int64)
113
+ y_test = torch.tensor(y_test, dtype=torch.int64)
114
+
115
+ return X_train, X_test, y_train, y_test
116
+
117
+ def evaluate_predictions(y_test, y_pred):
118
+ return accuracy_score(y_test, y_pred)
119
+ #model.py
120
+ import torch
121
+ import torch.nn as nn
122
+ import torch.optim as optim
123
+ from torch.utils.data import DataLoader, TensorDataset
124
+
125
+ class IrisClassifier(nn.Module):
126
+ def __init__(self):
127
+ super(IrisClassifier, self).__init__()
128
+ self.fc = nn.Sequential(
129
+ nn.Linear(4, 16),
130
+ nn.ReLU(),
131
+ nn.Linear(16, 3)
132
+ )
133
+
134
+ def forward(self, x):
135
+ return self.fc(x)
136
+
137
+ def train_model(self, X_train, y_train, epochs, lr, batch_size):
138
+ criterion = nn.CrossEntropyLoss()
139
+ optimizer = optim.Adam(self.parameters(), lr=lr)
140
+
141
+ # Create DataLoader for batches
142
+ dataset = TensorDataset(X_train, y_train)
143
+ dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
144
+
145
+ for epoch in range(epochs):
146
+ for batch_X, batch_y in dataloader:
147
+ optimizer.zero_grad()
148
+ outputs = self(batch_X)
149
+ loss = criterion(outputs, batch_y)
150
+ loss.backward()
151
+ optimizer.step()
152
+
153
+ def predict(self, X_test):
154
+ with torch.no_grad():
155
+ outputs = self(X_test)
156
+ _, predicted = outputs.max(1)
157
+ return predicted.numpy()
158
+ #main.py
159
+ from utils import load_data, evaluate_predictions
160
+ from model import IrisClassifier as Classifier
161
+
162
+ def main():
163
+ # Model training and evaluation
164
+ """
165
+ inputs = tokenizer(input_text, return_tensors="pt").cuda()
166
+ outputs = model.generate(**inputs, max_new_tokens=140)
167
+ print(tokenizer.decode(outputs[0]))
168
+ ```
169
+
170
+ ---
171
+ In the following scenario, the Deepseek-Coder 7B model effectively calls a class **IrisClassifier** and its member function from the `model.py` file, and also utilizes functions from the `utils.py` file, to correctly complete the **main** function in`main.py` file for model training and evaluation.
172
+
173
+
174
+ ### 4. Lincense
175
+ This code repository is licensed under the MIT License. The use of DeepSeek Coder model and weights is subject to the Model License. DeepSeek Coder supports commercial use.
176
+
177
+ See the [LICENSE-MODEL](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) for more details.
178
+
179
+ ### 5. Contact
180
+
181
+ If you have any questions, please raise an issue or contact us at [agi_code@deepseek.com](mailto:agi_code@deepseek.com).
182
+