keneonyeachonam commited on
Commit
3ae98d3
·
1 Parent(s): cb64b1d

Upload app.py

Browse files
Files changed (1) hide show
  1. app.py +162 -0
app.py ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+
3
+ gr.Markdown("""
4
+ # Big Science Bloom is a 176B Parameter Large Language ML Model.
5
+ https://www.youtube.com/watch?v=wA8rjKueB3Q
6
+ https://www.youtube.com/watch?v=2MBJOuVq380&t=241s
7
+ # Big Science Papers and Code - Exciting AI Developments! 🤖💻🔬
8
+ https://paperswithcode.com/paper/bloom-a-176b-parameter-open-access
9
+ """)
10
+
11
+ api = gr.Interface.load("models/bigscience/bloom")
12
+
13
+ def complete_with_gpt(text):
14
+ # Use the last 50 characters of the text as context
15
+ # return text[:-50] + api(text[-50:])
16
+ # Use the last 100 characters of the text as context
17
+ return text[:-100] + api(text[-100:])
18
+
19
+
20
+ with gr.Blocks() as demo:
21
+ with gr.Row():
22
+ textbox = gr.Textbox(placeholder="Type here and press enter...", lines=14)
23
+ with gr.Column():
24
+ btn = gr.Button("Generate")
25
+
26
+ btn.click(complete_with_gpt, textbox, textbox)
27
+
28
+ with gr.Row():
29
+ gr.Markdown("""
30
+
31
+ # Example on how to prompt.
32
+
33
+ Create a pattern sequence of text. In this example I use language names then click generate to add each line after adding another heading for a language.
34
+
35
+ English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
36
+ Japanese: 私はアランです。コンピューター科学者とプログラ
37
+ English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
38
+ Chinese: 你好,我叫Aaron。我是一个计算机科学家和高级首席工程师。
39
+ English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
40
+ Spanish: Hola, me llamo Aaron. Soy un cientifico de la computacion y un ingeniero principal
41
+ English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
42
+ Sanskrit: नमस्ते, मेरा नाम है Aaron. मैं एक कंप्यूटर वैज्ञानिक और वरिष्ठ प्रमुख इंजीनियर हूँ।
43
+ French: Bonjour, je m'appelle Aaron. Je suis un scientifique en informatique et un ingénieur senior.
44
+
45
+
46
+ ## Language Models 🗣️
47
+
48
+ 🏆 Bloom sets new record for most performant and efficient AI model in science! 🌸
49
+
50
+ ### Comparison of Large Language Models
51
+
52
+ | Model Name | Model Size (in Parameters) |
53
+ | ----------------- | -------------------------- |
54
+ | BigScience-tr11-176B | 176 billion |
55
+ | GPT-3 | 175 billion |
56
+ | OpenAI's DALL-E 2.0 | 500 million |
57
+ | NVIDIA's Megatron | 8.3 billion |
58
+ | Transformer-XL | 250 million |
59
+ | XLNet | 210 million |
60
+
61
+ ## ChatGPT Datasets 📚
62
+
63
+ - WebText
64
+ - Common Crawl
65
+ - BooksCorpus
66
+ - English Wikipedia
67
+ - Toronto Books Corpus
68
+ - OpenWebText
69
+
70
+ ## ChatGPT Datasets - Details 📚
71
+
72
+ - **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
73
+ - [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.](https://paperswithcode.com/dataset/webtext)
74
+
75
+ - **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
76
+ - [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/common-crawl) by Brown et al.
77
+
78
+ - **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
79
+ - [Scalable Methods for 8 Billion Token Language Modeling](https://paperswithcode.com/dataset/bookcorpus) by Zhu et al.
80
+
81
+ - **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
82
+ - [Improving Language Understanding by Generative Pre-Training](https://huggingface.co/spaces/awacke1/WikipediaUltimateAISearch?logs=build) Space for Wikipedia Search
83
+
84
+ - **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
85
+ - [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://paperswithcode.com/dataset/bookcorpus) by Schwenk and Douze.
86
+
87
+ - **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
88
+ - [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/openwebtext) by Brown et al.
89
+
90
+
91
+ ## Big Science Model 🚀
92
+
93
+ - 📜 Papers:
94
+ 1. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [Paper](https://arxiv.org/abs/2211.05100)
95
+ 2. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Paper](https://arxiv.org/abs/1909.08053)
96
+ 3. 8-bit Optimizers via Block-wise Quantization [Paper](https://arxiv.org/abs/2110.02861)
97
+ 4. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [Paper](https://arxiv.org/abs/2108.12409)
98
+ 5. [Other papers related to Big Science](https://huggingface.co/models?other=doi:10.57967/hf/0003)
99
+ 6. [217 other models optimized for use with Bloom](https://huggingface.co/models?other=bloom)
100
+
101
+ - 📚 Datasets:
102
+
103
+ **Datasets:**
104
+
105
+ 1. - **Universal Dependencies:** A collection of annotated corpora for natural language processing in a range of languages, with a focus on dependency parsing.
106
+ - [Universal Dependencies official website.](https://universaldependencies.org/)
107
+ 2. - **WMT 2014:** The fourth edition of the Workshop on Statistical Machine Translation, featuring shared tasks on translating between English and various other languages.
108
+ - [WMT14 website.](http://www.statmt.org/wmt14/)
109
+ 3. - **The Pile:** An English language corpus of diverse text, sourced from various places on the internet.
110
+ - [The Pile official website.](https://pile.eleuther.ai/)
111
+ 4. - **HumanEval:** A dataset of English sentences, annotated with human judgments on a range of linguistic qualities.
112
+ - [HumanEval: An Evaluation Benchmark for Language Understanding](https://github.com/google-research-datasets/humaneval) by Gabriel Ilharco, Daniel Loureiro, Pedro Rodriguez, and Afonso Mendes.
113
+ 5. - **FLORES-101:** A dataset of parallel sentences in 101 languages, designed for multilingual machine translation.
114
+ - [FLORES-101: A Massively Multilingual Parallel Corpus for Language Understanding](https://flores101.opennmt.net/) by Aman Madaan, Shruti Rijhwani, Raghav Gupta, and Mitesh M. Khapra.
115
+ 6. - **CrowS-Pairs:** A dataset of sentence pairs, designed for evaluating the plausibility of generated text.
116
+ - [CrowS-Pairs: A Challenge Dataset for Plausible Plausibility Judgments](https://github.com/stanford-cogsci/crows-pairs) by Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, Pascale Fung, and Caiming Xiong.
117
+ 7. - **WikiLingua:** A dataset of parallel sentences in 75 languages, sourced from Wikipedia.
118
+ - [WikiLingua: A New Benchmark Dataset for Cross-Lingual Wikification](https://arxiv.org/abs/2105.08031) by Jiarui Yao, Yanqiao Zhu, Ruihan Bao, Guosheng Lin, Lidong Bing, and Bei Shi.
119
+ 8. - **MTEB:** A dataset of English sentences, annotated with their entailment relationships with respect to other sentences.
120
+ - [Multi-Task Evaluation Benchmark for Natural Language Inference](https://github.com/google-research-datasets/mteb) by Michał Lukasik, Marcin Junczys-Dowmunt, and Houda Bouamor.
121
+ 9. - **xP3:** A dataset of English sentences, annotated with their paraphrase relationships with respect to other sentences.
122
+ - [xP3: A Large-Scale Evaluation Benchmark for Paraphrase Identification in Context](https://github.com/nyu-dl/xp3) by Aniket Didolkar, James Mayfield, Markus Saers, and Jason Baldridge.
123
+ 10. - **DiaBLa:** A dataset of English dialogue, annotated with dialogue acts.
124
+ - [A Large-Scale Corpus for Conversation Disentanglement](https://github.com/HLTCHKUST/DiaBLA) by Samuel Broscheit, António Branco, and André F. T. Martins.
125
+
126
+
127
+ - 📚 Dataset Papers with Code
128
+
129
+ 1. [Universal Dependencies](https://paperswithcode.com/dataset/universal-dependencies)
130
+ 2. [WMT 2014](https://paperswithcode.com/dataset/wmt-2014)
131
+ 3. [The Pile](https://paperswithcode.com/dataset/the-pile)
132
+ 4. [HumanEval](https://paperswithcode.com/dataset/humaneval)
133
+ 5. [FLORES-101](https://paperswithcode.com/dataset/flores-101)
134
+ 6. [CrowS-Pairs](https://paperswithcode.com/dataset/crows-pairs)
135
+ 7. [WikiLingua](https://paperswithcode.com/dataset/wikilingua)
136
+ 8. [MTEB](https://paperswithcode.com/dataset/mteb)
137
+ 9. [xP3](https://paperswithcode.com/dataset/xp3)
138
+ 10. [DiaBLa](https://paperswithcode.com/dataset/diabla)
139
+
140
+ # Deep RL ML Strategy 🧠
141
+
142
+ The AI strategies are:
143
+ - Language Model Preparation using Human Augmented with Supervised Fine Tuning 🤖
144
+ - Reward Model Training with Prompts Dataset Multi-Model Generate Data to Rank 🎁
145
+ - Fine Tuning with Reinforcement Reward and Distance Distribution Regret Score 🎯
146
+ - Proximal Policy Optimization Fine Tuning 🤝
147
+ - Variations - Preference Model Pretraining 🤔
148
+ - Use Ranking Datasets Sentiment - Thumbs Up/Down, Distribution 📊
149
+ - Online Version Getting Feedback 💬
150
+ - OpenAI - InstructGPT - Humans generate LM Training Text 🔍
151
+ - DeepMind - Advantage Actor Critic Sparrow, GopherCite 🦜
152
+ - Reward Model Human Prefence Feedback 🏆
153
+
154
+ For more information on specific techniques and implementations, check out the following resources:
155
+ - OpenAI's paper on [GPT-3](https://arxiv.org/abs/2005.14165) which details their Language Model Preparation approach
156
+ - DeepMind's paper on [SAC](https://arxiv.org/abs/1801.01290) which describes the Advantage Actor Critic algorithm
157
+ - OpenAI's paper on [Reward Learning](https://arxiv.org/abs/1810.06580) which explains their approach to training Reward Models
158
+ - OpenAI's blog post on [GPT-3's fine-tuning process](https://openai.com/blog/fine-tuning-gpt-3/)
159
+
160
+ """)
161
+
162
+ demo.launch()