Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- Skylion007/openwebtext
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
metrics:
|
8 |
+
- perplexity
|
9 |
+
pipeline_tag: text-generation
|
10 |
+
---
|
11 |
+
# GPT-2 Mini
|
12 |
+
|
13 |
+
A smaller GPT-2 model with (only) 39M parameters. It was pretrained on a subset of OpenWebText, the open-source version of the pretraining dataset used by OpenAI for the original GPT-2 models.
|
14 |
+
|
15 |
+
## Uses
|
16 |
+
|
17 |
+
The purpose of this model is mainly for research and education. Its small size allows for fast experiments in resource-limited settings, while still being able of generating complex and coherent text.
|
18 |
+
|
19 |
+
## Getting Started
|
20 |
+
|
21 |
+
Use the code below to get started with the model:
|
22 |
+
```py
|
23 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
24 |
+
|
25 |
+
# Load model
|
26 |
+
model = AutoModelForCausalLM.from_pretrained("erwanf/gpt2-mini")
|
27 |
+
model.eval()
|
28 |
+
|
29 |
+
# Load tokenizer
|
30 |
+
tokenizer = AutoTokenizer.from_pretrained("erwanf/gpt2-mini")
|
31 |
+
|
32 |
+
# Generate text
|
33 |
+
prompt = "Hello, I'm a language model,"
|
34 |
+
input_ids = tokenizer.encode(prompt, return_tensors="pt")
|
35 |
+
|
36 |
+
output = model.generate(input_ids, do_sample=True, max_length=50, num_return_sequences=5)
|
37 |
+
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
|
38 |
+
print(output_text)
|
39 |
+
```
|
40 |
+
|
41 |
+
Output:
|
42 |
+
```
|
43 |
+
["Hello, I'm a language model, I can't be more efficient in words.\n\nYou can use this as a point to find out the next bit in your system, and learn more about me.\n\nI think a lot of the",
|
44 |
+
"Hello, I'm a language model, my teacher is a good teacher - a good school teacher – and one thing you have to remember:\n\nIt's not perfect. A school is not perfect; it isn't perfect at all!\n\n",
|
45 |
+
'Hello, I\'m a language model, but if I can do something for you then go for it (for a word). Here is my blog, the language:\n\nI\'ve not used "normal" in English words, but I\'ve always',
|
46 |
+
'Hello, I\'m a language model, I\'m talking to you the very first time I used a dictionary and it can be much better than one word in my dictionary. What would an "abnormal" English dictionary have to do with a dictionary and',
|
47 |
+
'Hello, I\'m a language model, the most powerful representation of words and phrases in the language I\'m using."\n\nThe new rules change that makes it much harder for people to understand a language that does not have a native grammar (even with']
|
48 |
+
```
|
49 |
+
|
50 |
+
## Training Details
|
51 |
+
|
52 |
+
The architecture relies on the GPT-2 model, with smaller dimensions and less layers. It uses the same tokenizer as GPT-2. We used the first 2M rows from the OpenWebText dataset, out of which we use 1k for test and validation sets.
|
53 |
+
|
54 |
+
### Hyperparameters
|
55 |
+
|
56 |
+
| **Hyperparameter** | **Value** |
|
57 |
+
|------------------------|------------------|
|
58 |
+
| **Model Parameters** | |
|
59 |
+
| Vocabulary Size | 50,257 |
|
60 |
+
| Context Length | 512 |
|
61 |
+
| Number of Layers | 4 |
|
62 |
+
| Hidden Size | 512 |
|
63 |
+
| Number of Attention Heads | 8 |
|
64 |
+
| Intermediate Size | 2048 |
|
65 |
+
| Activation Function | GELU |
|
66 |
+
| Dropout | No |
|
67 |
+
| **Training Parameters**| |
|
68 |
+
| Learning Rate | 5e-4 |
|
69 |
+
| Batch Size | 256 |
|
70 |
+
| Optimizer | AdamW |
|
71 |
+
| beta1 | 0.9 |
|
72 |
+
| beta2 | 0.98 |
|
73 |
+
| Weight Decay | 0.1 |
|
74 |
+
| Training Steps | 100,000 |
|
75 |
+
| Warmup Steps | 4,000 |
|
76 |
+
| Learning Rate Scheduler| Cosine |
|
77 |
+
| Training Dataset Size | 1M samples |
|
78 |
+
| Validation Dataset Size| 1k samples |
|
79 |
+
| Float Type | bf16 |
|
80 |
+
|