Doron Adler commited on
Commit
5f3dba5
1 Parent(s): 4144ffa

Added model card

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md CHANGED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: he
3
+
4
+ thumbnail: https://avatars1.githubusercontent.com/u/3617152?norod.jpg
5
+ widget:
6
+ - text: "עוד בימי קדם"
7
+ - text: "קוראים לי דורון ואני מעוניין ל"
8
+ - text: "קוראים לי איציק ואני חושב ש"
9
+ - text: "החתול שלך מאוד חמוד ו"
10
+
11
+ license: mit
12
+ ---
13
+
14
+ # hebrew-gpt_neo-tiny
15
+
16
+ Hebrew text generation model based on [EleutherAI's gpt-neo](https://github.com/EleutherAI/gpt-neo). Each was trained on a TPUv3-8 which was made avilable to me via the [TPU Research Cloud](https://sites.research.google/trc/) Program.
17
+
18
+ ## Datasets
19
+
20
+ 1. An assortment of various Hebrew corpuses - I have made it available [here](https://mega.nz/folder/CodSSA4R#4INvMes-56m_WUi7jQMbJQ)
21
+
22
+
23
+ 2. oscar / unshuffled_deduplicated_he - [Homepage](https://oscar-corpus.com) | [Dataset Permalink](https://huggingface.co/datasets/viewer/?dataset=oscar&config=unshuffled_deduplicated_he)
24
+
25
+ The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
26
+
27
+ ## Training Config
28
+
29
+ Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-tiny/configs) <BR>
30
+
31
+ ## Usage
32
+
33
+ ### Google Colab Notebook
34
+
35
+ Available [here ](https://colab.research.google.com/github/Norod/hebrew-gpt_neo/blob/main/hebrew-gpt_neo-tiny/Norod78_hebrew_gpt_neo_tiny_Colab.ipynb) <BR>
36
+
37
+
38
+ #### Simple usage sample code
39
+
40
+ ```python
41
+
42
+ !pip install tokenizers==0.10.2 transformers==4.5.1
43
+
44
+ from transformers import AutoTokenizer, AutoModelForCausalLM
45
+
46
+ tokenizer = AutoTokenizer.from_pretrained("Norod78/hebrew-gpt_neo-tiny")
47
+ model = AutoModelForCausalLM.from_pretrained("Norod78/hebrew-gpt_neo-tiny", pad_token_id=tokenizer.eos_token_id)
48
+
49
+ prompt_text = "אני אוהב שוקולד ועוגות"
50
+ max_len = 512
51
+ sample_output_num = 3
52
+ seed = 1000
53
+
54
+ import numpy as np
55
+ import torch
56
+
57
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
58
+ n_gpu = 0 if torch.cuda.is_available()==False else torch.cuda.device_count()
59
+
60
+ print(f"device: {device}, n_gpu: {n_gpu}")
61
+
62
+ np.random.seed(seed)
63
+ torch.manual_seed(seed)
64
+ if n_gpu > 0:
65
+ torch.cuda.manual_seed_all(seed)
66
+
67
+ model.to(device)
68
+
69
+ encoded_prompt = tokenizer.encode(
70
+ prompt_text, add_special_tokens=False, return_tensors="pt")
71
+
72
+ encoded_prompt = encoded_prompt.to(device)
73
+
74
+ if encoded_prompt.size()[-1] == 0:
75
+ input_ids = None
76
+ else:
77
+ input_ids = encoded_prompt
78
+
79
+ print("input_ids = " + str(input_ids))
80
+
81
+ if input_ids != None:
82
+ max_len += len(encoded_prompt[0])
83
+ if max_len > 2048:
84
+ max_len = 2048
85
+
86
+ print("Updated max_len = " + str(max_len))
87
+
88
+ sample_outputs = model.generate(
89
+ input_ids,
90
+ do_sample=True,
91
+ max_length=max_len,
92
+ top_k=50,
93
+ top_p=0.95,
94
+ num_return_sequences=sample_output_num
95
+ )
96
+
97
+ print(100 * '-' + "\
98
+ Output:\
99
+ " + 100 * '-')
100
+ for i, sample_output in enumerate(sample_outputs):
101
+ print("\
102
+ {}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
103
+ print("\
104
+ " + 100 * '-')
105
+
106
+ ```