Commit
·
3ae98d3
1
Parent(s):
cb64b1d
Upload app.py
Browse files
app.py
ADDED
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
|
3 |
+
gr.Markdown("""
|
4 |
+
# Big Science Bloom is a 176B Parameter Large Language ML Model.
|
5 |
+
https://www.youtube.com/watch?v=wA8rjKueB3Q
|
6 |
+
https://www.youtube.com/watch?v=2MBJOuVq380&t=241s
|
7 |
+
# Big Science Papers and Code - Exciting AI Developments! 🤖💻🔬
|
8 |
+
https://paperswithcode.com/paper/bloom-a-176b-parameter-open-access
|
9 |
+
""")
|
10 |
+
|
11 |
+
api = gr.Interface.load("models/bigscience/bloom")
|
12 |
+
|
13 |
+
def complete_with_gpt(text):
|
14 |
+
# Use the last 50 characters of the text as context
|
15 |
+
# return text[:-50] + api(text[-50:])
|
16 |
+
# Use the last 100 characters of the text as context
|
17 |
+
return text[:-100] + api(text[-100:])
|
18 |
+
|
19 |
+
|
20 |
+
with gr.Blocks() as demo:
|
21 |
+
with gr.Row():
|
22 |
+
textbox = gr.Textbox(placeholder="Type here and press enter...", lines=14)
|
23 |
+
with gr.Column():
|
24 |
+
btn = gr.Button("Generate")
|
25 |
+
|
26 |
+
btn.click(complete_with_gpt, textbox, textbox)
|
27 |
+
|
28 |
+
with gr.Row():
|
29 |
+
gr.Markdown("""
|
30 |
+
|
31 |
+
# Example on how to prompt.
|
32 |
+
|
33 |
+
Create a pattern sequence of text. In this example I use language names then click generate to add each line after adding another heading for a language.
|
34 |
+
|
35 |
+
English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
|
36 |
+
Japanese: 私はアランです。コンピューター科学者とプログラ
|
37 |
+
English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
|
38 |
+
Chinese: 你好,我叫Aaron。我是一个计算机科学家和高级首席工程师。
|
39 |
+
English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
|
40 |
+
Spanish: Hola, me llamo Aaron. Soy un cientifico de la computacion y un ingeniero principal
|
41 |
+
English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
|
42 |
+
Sanskrit: नमस्ते, मेरा नाम है Aaron. मैं एक कंप्यूटर वैज्ञानिक और वरिष्ठ प्रमुख इंजीनियर हूँ।
|
43 |
+
French: Bonjour, je m'appelle Aaron. Je suis un scientifique en informatique et un ingénieur senior.
|
44 |
+
|
45 |
+
|
46 |
+
## Language Models 🗣️
|
47 |
+
|
48 |
+
🏆 Bloom sets new record for most performant and efficient AI model in science! 🌸
|
49 |
+
|
50 |
+
### Comparison of Large Language Models
|
51 |
+
|
52 |
+
| Model Name | Model Size (in Parameters) |
|
53 |
+
| ----------------- | -------------------------- |
|
54 |
+
| BigScience-tr11-176B | 176 billion |
|
55 |
+
| GPT-3 | 175 billion |
|
56 |
+
| OpenAI's DALL-E 2.0 | 500 million |
|
57 |
+
| NVIDIA's Megatron | 8.3 billion |
|
58 |
+
| Transformer-XL | 250 million |
|
59 |
+
| XLNet | 210 million |
|
60 |
+
|
61 |
+
## ChatGPT Datasets 📚
|
62 |
+
|
63 |
+
- WebText
|
64 |
+
- Common Crawl
|
65 |
+
- BooksCorpus
|
66 |
+
- English Wikipedia
|
67 |
+
- Toronto Books Corpus
|
68 |
+
- OpenWebText
|
69 |
+
|
70 |
+
## ChatGPT Datasets - Details 📚
|
71 |
+
|
72 |
+
- **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
|
73 |
+
- [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.](https://paperswithcode.com/dataset/webtext)
|
74 |
+
|
75 |
+
- **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
|
76 |
+
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/common-crawl) by Brown et al.
|
77 |
+
|
78 |
+
- **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
|
79 |
+
- [Scalable Methods for 8 Billion Token Language Modeling](https://paperswithcode.com/dataset/bookcorpus) by Zhu et al.
|
80 |
+
|
81 |
+
- **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
|
82 |
+
- [Improving Language Understanding by Generative Pre-Training](https://huggingface.co/spaces/awacke1/WikipediaUltimateAISearch?logs=build) Space for Wikipedia Search
|
83 |
+
|
84 |
+
- **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
|
85 |
+
- [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://paperswithcode.com/dataset/bookcorpus) by Schwenk and Douze.
|
86 |
+
|
87 |
+
- **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
|
88 |
+
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/openwebtext) by Brown et al.
|
89 |
+
|
90 |
+
|
91 |
+
## Big Science Model 🚀
|
92 |
+
|
93 |
+
- 📜 Papers:
|
94 |
+
1. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [Paper](https://arxiv.org/abs/2211.05100)
|
95 |
+
2. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Paper](https://arxiv.org/abs/1909.08053)
|
96 |
+
3. 8-bit Optimizers via Block-wise Quantization [Paper](https://arxiv.org/abs/2110.02861)
|
97 |
+
4. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [Paper](https://arxiv.org/abs/2108.12409)
|
98 |
+
5. [Other papers related to Big Science](https://huggingface.co/models?other=doi:10.57967/hf/0003)
|
99 |
+
6. [217 other models optimized for use with Bloom](https://huggingface.co/models?other=bloom)
|
100 |
+
|
101 |
+
- 📚 Datasets:
|
102 |
+
|
103 |
+
**Datasets:**
|
104 |
+
|
105 |
+
1. - **Universal Dependencies:** A collection of annotated corpora for natural language processing in a range of languages, with a focus on dependency parsing.
|
106 |
+
- [Universal Dependencies official website.](https://universaldependencies.org/)
|
107 |
+
2. - **WMT 2014:** The fourth edition of the Workshop on Statistical Machine Translation, featuring shared tasks on translating between English and various other languages.
|
108 |
+
- [WMT14 website.](http://www.statmt.org/wmt14/)
|
109 |
+
3. - **The Pile:** An English language corpus of diverse text, sourced from various places on the internet.
|
110 |
+
- [The Pile official website.](https://pile.eleuther.ai/)
|
111 |
+
4. - **HumanEval:** A dataset of English sentences, annotated with human judgments on a range of linguistic qualities.
|
112 |
+
- [HumanEval: An Evaluation Benchmark for Language Understanding](https://github.com/google-research-datasets/humaneval) by Gabriel Ilharco, Daniel Loureiro, Pedro Rodriguez, and Afonso Mendes.
|
113 |
+
5. - **FLORES-101:** A dataset of parallel sentences in 101 languages, designed for multilingual machine translation.
|
114 |
+
- [FLORES-101: A Massively Multilingual Parallel Corpus for Language Understanding](https://flores101.opennmt.net/) by Aman Madaan, Shruti Rijhwani, Raghav Gupta, and Mitesh M. Khapra.
|
115 |
+
6. - **CrowS-Pairs:** A dataset of sentence pairs, designed for evaluating the plausibility of generated text.
|
116 |
+
- [CrowS-Pairs: A Challenge Dataset for Plausible Plausibility Judgments](https://github.com/stanford-cogsci/crows-pairs) by Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, Pascale Fung, and Caiming Xiong.
|
117 |
+
7. - **WikiLingua:** A dataset of parallel sentences in 75 languages, sourced from Wikipedia.
|
118 |
+
- [WikiLingua: A New Benchmark Dataset for Cross-Lingual Wikification](https://arxiv.org/abs/2105.08031) by Jiarui Yao, Yanqiao Zhu, Ruihan Bao, Guosheng Lin, Lidong Bing, and Bei Shi.
|
119 |
+
8. - **MTEB:** A dataset of English sentences, annotated with their entailment relationships with respect to other sentences.
|
120 |
+
- [Multi-Task Evaluation Benchmark for Natural Language Inference](https://github.com/google-research-datasets/mteb) by Michał Lukasik, Marcin Junczys-Dowmunt, and Houda Bouamor.
|
121 |
+
9. - **xP3:** A dataset of English sentences, annotated with their paraphrase relationships with respect to other sentences.
|
122 |
+
- [xP3: A Large-Scale Evaluation Benchmark for Paraphrase Identification in Context](https://github.com/nyu-dl/xp3) by Aniket Didolkar, James Mayfield, Markus Saers, and Jason Baldridge.
|
123 |
+
10. - **DiaBLa:** A dataset of English dialogue, annotated with dialogue acts.
|
124 |
+
- [A Large-Scale Corpus for Conversation Disentanglement](https://github.com/HLTCHKUST/DiaBLA) by Samuel Broscheit, António Branco, and André F. T. Martins.
|
125 |
+
|
126 |
+
|
127 |
+
- 📚 Dataset Papers with Code
|
128 |
+
|
129 |
+
1. [Universal Dependencies](https://paperswithcode.com/dataset/universal-dependencies)
|
130 |
+
2. [WMT 2014](https://paperswithcode.com/dataset/wmt-2014)
|
131 |
+
3. [The Pile](https://paperswithcode.com/dataset/the-pile)
|
132 |
+
4. [HumanEval](https://paperswithcode.com/dataset/humaneval)
|
133 |
+
5. [FLORES-101](https://paperswithcode.com/dataset/flores-101)
|
134 |
+
6. [CrowS-Pairs](https://paperswithcode.com/dataset/crows-pairs)
|
135 |
+
7. [WikiLingua](https://paperswithcode.com/dataset/wikilingua)
|
136 |
+
8. [MTEB](https://paperswithcode.com/dataset/mteb)
|
137 |
+
9. [xP3](https://paperswithcode.com/dataset/xp3)
|
138 |
+
10. [DiaBLa](https://paperswithcode.com/dataset/diabla)
|
139 |
+
|
140 |
+
# Deep RL ML Strategy 🧠
|
141 |
+
|
142 |
+
The AI strategies are:
|
143 |
+
- Language Model Preparation using Human Augmented with Supervised Fine Tuning 🤖
|
144 |
+
- Reward Model Training with Prompts Dataset Multi-Model Generate Data to Rank 🎁
|
145 |
+
- Fine Tuning with Reinforcement Reward and Distance Distribution Regret Score 🎯
|
146 |
+
- Proximal Policy Optimization Fine Tuning 🤝
|
147 |
+
- Variations - Preference Model Pretraining 🤔
|
148 |
+
- Use Ranking Datasets Sentiment - Thumbs Up/Down, Distribution 📊
|
149 |
+
- Online Version Getting Feedback 💬
|
150 |
+
- OpenAI - InstructGPT - Humans generate LM Training Text 🔍
|
151 |
+
- DeepMind - Advantage Actor Critic Sparrow, GopherCite 🦜
|
152 |
+
- Reward Model Human Prefence Feedback 🏆
|
153 |
+
|
154 |
+
For more information on specific techniques and implementations, check out the following resources:
|
155 |
+
- OpenAI's paper on [GPT-3](https://arxiv.org/abs/2005.14165) which details their Language Model Preparation approach
|
156 |
+
- DeepMind's paper on [SAC](https://arxiv.org/abs/1801.01290) which describes the Advantage Actor Critic algorithm
|
157 |
+
- OpenAI's paper on [Reward Learning](https://arxiv.org/abs/1810.06580) which explains their approach to training Reward Models
|
158 |
+
- OpenAI's blog post on [GPT-3's fine-tuning process](https://openai.com/blog/fine-tuning-gpt-3/)
|
159 |
+
|
160 |
+
""")
|
161 |
+
|
162 |
+
demo.launch()
|