Spaces:
Running
Running
title: README | |
emoji: π | |
colorFrom: pink | |
colorTo: blue | |
sdk: static | |
pinned: false | |
Welcome - This classroom organization holds examples and links for this session. | |
Begin by adding a bookmark. | |
# Chat and Clinical | |
<h1><center>π₯«Open Datasets for Health Careπ</center></h1> | |
1. Datasets for open source or creative commons zero datasets and also links with PDF's for public clinical use: | |
<div align="center"> | |
Curated Datasets: <a href = "https://www.kaggle.com/datasets">Kaggle</a>. | |
<a href="https://www.nlm.nih.gov/research/umls/index.html">NLM UMLS</a>. | |
<a href="https://loinc.org/downloads/">LOINC</a>. | |
<a href="https://www.cms.gov/medicare/icd-10/2022-icd-10-cm">ICD10 Diagnosis</a>. | |
<a href="https://icd.who.int/dev11/downloads">ICD11</a>. | |
<a href="https://paperswithcode.com/datasets?q=medical&v=lst&o=newest">Papers,Code,Datasets for SOTA in Medicine</a>. | |
<a href="https://paperswithcode.com/datasets?q=mental&v=lst&o=newest">Mental</a>. | |
<a href="https://paperswithcode.com/datasets?q=behavior&v=lst&o=newest">Behavior</a>. | |
<a href="https://www.cms.gov/medicare-coverage-database/downloads/downloads.aspx">CMS Downloads</a>. | |
<a href="https://www.cms.gov/medicare/fraud-and-abuse/physicianselfreferral/list_of_codes">CMS CPT and HCPCS Procedures and Services</a> | |
</div> | |
# Examples and Exercises - Create These Spaces in Your Account and Test / Modify | |
## Easy Examples | |
1. FastSpeech - https://huggingface.co/spaces/AIZero2HeroBootcamp/FastSpeech2LinerGradioApp | |
2. Memory - https://huggingface.co/spaces/AIZero2HeroBootcamp/Memory | |
3. StaticHTML5PlayCanvas - https://huggingface.co/spaces/AIZero2HeroBootcamp/StaticHTML5Playcanvas | |
4. 3DHuman - https://huggingface.co/spaces/AIZero2HeroBootcamp/3DHuman | |
5. TranscriptAILearnerFromYoutube - https://huggingface.co/spaces/AIZero2HeroBootcamp/TranscriptAILearnerFromYoutube | |
6. AnimatedGifGallery - https://huggingface.co/spaces/AIZero2HeroBootcamp/AnimatedGifGallery | |
7. VideoToAnimatedGif - https://huggingface.co/spaces/AIZero2HeroBootcamp/VideoToAnimatedGif | |
## Hard Examples: | |
8. ChatGPTandLangChain - https://huggingface.co/spaces/AIZero2HeroBootcamp/ChatGPTandLangchain | |
a. Keys: https://platform.openai.com/account/api-keys | |
9. MultiPDFQAChatGPTLangchain - https://huggingface.co/spaces/AIZero2HeroBootcamp/MultiPDF-QA-ChatGPT-Langchain | |
# π Two easy ways to turbo boost your AI learning journey - Lets go 100X! π» | |
# π AI Pair Programming with GPT | |
### Open 2 Browsers to: | |
1. __π ChatGPT__ [URL](https://chat.openai.com/chat) or [URL2](https://platform.openai.com/playground) and | |
2. __π Huggingface__ [URL](https://huggingface.co/awacke1) in separate browser windows. | |
1. π€ Use prompts to generate a streamlit program on Huggingface or locally to test it. | |
2. π§ For advanced work, add Python 3.10 and VSCode locally, and debug as gradio or streamlit apps. | |
3. π Use these two superpower processes to reduce the time it takes you to make a new AI program! β±οΈ | |
# π₯ YouTube University Method: | |
1. ποΈββοΈ Plan two hours each weekday to exercise your body and brain. | |
2. π¬ Make a playlist of videos you want to learn from on YouTube. Save the links to edit later. | |
3. π Try watching the videos at a faster speed while exercising, and sample the first five minutes of each video. | |
4. π Reorder the playlist so the most useful videos are at the front, and take breaks to exercise. | |
5. π Practice note-taking in markdown to instantly save what you want to remember. Share your notes with others! | |
6. π₯ AI Pair Programming Using Long Answer Language Models with Human Feedback | |
## π₯ 2023 AI/ML Learning Playlists for ChatGPT, LLMs, Recent Events in AI: | |
1. AI News: https://www.youtube.com/playlist?list=PLHgX2IExbFotMOKWOErYeyHSiikf6RTeX | |
2. ChatGPT Code Interpreter: https://www.youtube.com/playlist?list=PLHgX2IExbFou1pOQMayB7PArCalMWLfU- | |
3. Ilya Sutskever and Sam Altman: https://www.youtube.com/playlist?list=PLHgX2IExbFovr66KW6Mqa456qyY-Vmvw- | |
4. Andrew Huberman on Neuroscience and Health: https://www.youtube.com/playlist?list=PLHgX2IExbFotRU0jl_a0e0mdlYU-NWy1r | |
5. Andrej Karpathy: https://www.youtube.com/playlist?list=PLHgX2IExbFovbOFCgLNw1hRutQQKrfYNP | |
6. Medical Futurist on GPT: https://www.youtube.com/playlist?list=PLHgX2IExbFosVaCMZCZ36bYqKBYqFKHB2 | |
7. ML APIs: https://www.youtube.com/playlist?list=PLHgX2IExbFovPX9z4m61rQImM7cDDY79L | |
8. FastAPI and Streamlit: https://www.youtube.com/playlist?list=PLHgX2IExbFosyX2jzJJimPAI9C0FHflwB | |
9. AI UI UX: https://www.youtube.com/playlist?list=PLHgX2IExbFosCUPzEp4bQaygzrzXPz81w | |
10. ChatGPT Streamlit 2023: https://www.youtube.com/playlist?list=PLHgX2IExbFotDzxBRWwUBTb0_XFEr4Dlg | |
### LLM Base Model Overview and Evolutionary Tree: https://github.com/Mooler0410/LLMsPracticalGuide | |
## π₯ 2023 AI/ML Advanced Learning Playlists: | |
1. [2023 QA Models and Long Form Question Answering NLP](https://www.youtube.com/playlist?list=PLHgX2IExbFovrkkx8HMTLNgYdjCMNYmX_) | |
2. [FHIR Bioinformatics Development Using AI/ML and Python, Streamlit, and Gradio - 2022](https://www.youtube.com/playlist?list=PLHgX2IExbFovoMUC3hYXeFegpk_Y0Lz0Q) | |
3. [2023 ChatGPT for Coding Assistant Streamlit, Gradio and Python Apps](https://www.youtube.com/playlist?list=PLHgX2IExbFouOEnppexiKZVdz_k5b0pvI) | |
4. [2023 BigScience Bloom - Large Language Model for AI Systems and NLP](https://www.youtube.com/playlist?list=PLHgX2IExbFouqnsIqziThlPCX_miiDq14) | |
5. [2023 Streamlit Pro Tips for AI UI UX for Data Science, Engineering, and Mathematics](https://www.youtube.com/playlist?list=PLHgX2IExbFou3cP19hHO9Xb-cN8uwr5RM) | |
6. [2023 Fun, New and Interesting AI, Videos, and AI/ML Techniques](https://www.youtube.com/playlist?list=PLHgX2IExbFotoMt32SrT3Xynt5BXTGnEP) | |
7. [2023 Best Minds in AGI AI Gamification and Large Language Models](https://www.youtube.com/playlist?list=PLHgX2IExbFotmFeBTpyje1uI22n0GAkXT) | |
8. [2023 State of the Art for Vision Image Classification, Text Classification and Regression, Extractive Question Answering and Tabular Classification](https://www.youtube.com/playlist?list=PLHgX2IExbFotPcPu6pauNHOoZTTbnAQ2F) | |
9. [2023 AutoML DataRobot and AI Platforms for Building Models, Features, Test, and Transparency](https://www.youtube.com/playlist?list=PLHgX2IExbFovsY2oGbDwdEhPrakkC8i3g) | |
<h1><center>π₯«Open Datasets for Health Careπ</center></h1> | |
<div align="center">Curated Datasets: <a href = "https://www.kaggle.com/datasets">Kaggle</a>. <a href="https://www.nlm.nih.gov/research/umls/index.html">NLM UMLS</a>. <a href="https://loinc.org/downloads/">LOINC</a>. <a href="https://www.cms.gov/medicare/icd-10/2022-icd-10-cm">ICD10 Diagnosis</a>. <a href="https://icd.who.int/dev11/downloads">ICD11</a>. <a href="https://paperswithcode.com/datasets?q=medical&v=lst&o=newest">Papers,Code,Datasets for SOTA in Medicine</a>. <a href="https://paperswithcode.com/datasets?q=mental&v=lst&o=newest">Mental</a>. <a href="https://paperswithcode.com/datasets?q=behavior&v=lst&o=newest">Behavior</a>. <a href="https://www.cms.gov/medicare-coverage-database/downloads/downloads.aspx">CMS Downloads</a>. <a href="https://www.cms.gov/medicare/fraud-and-abuse/physicianselfreferral/list_of_codes">CMS CPT and HCPCS Procedures and Services</a> | |
</div> | |
# Azure Development Architectures in 2023: | |
1. ChatGPT: https://azure.github.io/awesome-azd/?tags=chatgpt | |
2. Azure OpenAI Services: https://azure.github.io/awesome-azd/?tags=openai | |
3. Python: https://azure.github.io/awesome-azd/?tags=python | |
4. AI LLM Architecture - Guidance by MS: https://github.com/microsoft/guidance | |
# Dockerfile and Azure ACR->ACA Easy Robust Deploys from VSCode: | |
1. Set up VSCode with Azure and Remote extensions and install Azure CLI locally | |
2. Get access to azure subscriptions. Form there in VSCode, expand to Container Apps | |
3. In Container Apps create new and pick Dockerfile to deploy to a ACR then ACA spin up using Azure to build. | |
# Dockerfile for Streamlit and Dockerfile for FastAPI: | |
Show two examples. | |
# Example Starter Prompts for AIPP: | |
Write a streamlit program that demonstrates Data synthesis. | |
Synthesize data from multiple sources to create new datasets. | |
Use two datasets and demonstrate pandas dataframe query merge and join | |
with two datasets in python list dictionaries: | |
List of Hospitals that are over 1000 bed count by city and state, and | |
State population size and square miles. | |
Perform a calculated function on the merged dataset. | |
### Comparison of Large Language Models | |
| Model Name | Model Size (in Parameters) | | |
| ----------------- | -------------------------- | | |
| BigScience-tr11-176B | 176 billion | | |
| GPT-3 | 175 billion | | |
| OpenAI's DALL-E 2.0 | 500 million | | |
| NVIDIA's Megatron | 8.3 billion | | |
| Transformer-XL | 250 million | | |
| XLNet | 210 million | | |
## ChatGPT Datasets π | |
- WebText | |
- Common Crawl | |
- BooksCorpus | |
- English Wikipedia | |
- Toronto Books Corpus | |
- OpenWebText | |
- | |
## ChatGPT Datasets - Details π | |
- **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2. | |
- [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.](https://paperswithcode.com/dataset/webtext) | |
- **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3. | |
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/common-crawl) by Brown et al. | |
- **BooksCorpus:** A dataset of over 11,000 books from a variety of genres. | |
- [Scalable Methods for 8 Billion Token Language Modeling](https://paperswithcode.com/dataset/bookcorpus) by Zhu et al. | |
- **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017. | |
- [Improving Language Understanding by Generative Pre-Training](https://huggingface.co/spaces/awacke1/WikipediaUltimateAISearch?logs=build) Space for Wikipedia Search | |
- **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto. | |
- [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://paperswithcode.com/dataset/bookcorpus) by Schwenk and Douze. | |
- **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3. | |
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/openwebtext) by Brown et al. | |
## Big Science Model π | |
- π Papers: | |
1. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [Paper](https://arxiv.org/abs/2211.05100) | |
2. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Paper](https://arxiv.org/abs/1909.08053) | |
3. 8-bit Optimizers via Block-wise Quantization [Paper](https://arxiv.org/abs/2110.02861) | |
4. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [Paper](https://arxiv.org/abs/2108.12409) | |
5. [Other papers related to Big Science](https://huggingface.co/models?other=doi:10.57967/hf/0003) | |
6. [217 other models optimized for use with Bloom](https://huggingface.co/models?other=bloom) | |
- π Datasets: | |
**Datasets:** | |
1. - **Universal Dependencies:** A collection of annotated corpora for natural language processing in a range of languages, with a focus on dependency parsing. | |
- [Universal Dependencies official website.](https://universaldependencies.org/) | |
2. - **WMT 2014:** The fourth edition of the Workshop on Statistical Machine Translation, featuring shared tasks on translating between English and various other languages. | |
- [WMT14 website.](http://www.statmt.org/wmt14/) | |
3. - **The Pile:** An English language corpus of diverse text, sourced from various places on the internet. | |
- [The Pile official website.](https://pile.eleuther.ai/) | |
4. - **HumanEval:** A dataset of English sentences, annotated with human judgments on a range of linguistic qualities. | |
- [HumanEval: An Evaluation Benchmark for Language Understanding](https://github.com/google-research-datasets/humaneval) by Gabriel Ilharco, Daniel Loureiro, Pedro Rodriguez, and Afonso Mendes. | |
5. - **FLORES-101:** A dataset of parallel sentences in 101 languages, designed for multilingual machine translation. | |
- [FLORES-101: A Massively Multilingual Parallel Corpus for Language Understanding](https://flores101.opennmt.net/) by Aman Madaan, Shruti Rijhwani, Raghav Gupta, and Mitesh M. Khapra. | |
6. - **CrowS-Pairs:** A dataset of sentence pairs, designed for evaluating the plausibility of generated text. | |
- [CrowS-Pairs: A Challenge Dataset for Plausible Plausibility Judgments](https://github.com/stanford-cogsci/crows-pairs) by Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, Pascale Fung, and Caiming Xiong. | |
7. - **WikiLingua:** A dataset of parallel sentences in 75 languages, sourced from Wikipedia. | |
- [WikiLingua: A New Benchmark Dataset for Cross-Lingual Wikification](https://arxiv.org/abs/2105.08031) by Jiarui Yao, Yanqiao Zhu, Ruihan Bao, Guosheng Lin, Lidong Bing, and Bei Shi. | |
8. - **MTEB:** A dataset of English sentences, annotated with their entailment relationships with respect to other sentences. | |
- [Multi-Task Evaluation Benchmark for Natural Language Inference](https://github.com/google-research-datasets/mteb) by MichaΕ Lukasik, Marcin Junczys-Dowmunt, and Houda Bouamor. | |
9. - **xP3:** A dataset of English sentences, annotated with their paraphrase relationships with respect to other sentences. | |
- [xP3: A Large-Scale Evaluation Benchmark for Paraphrase Identification in Context](https://github.com/nyu-dl/xp3) by Aniket Didolkar, James Mayfield, Markus Saers, and Jason Baldridge. | |
10. - **DiaBLa:** A dataset of English dialogue, annotated with dialogue acts. | |
- [A Large-Scale Corpus for Conversation Disentanglement](https://github.com/HLTCHKUST/DiaBLA) by Samuel Broscheit, AntΓ³nio Branco, and AndrΓ© F. T. Martins. | |
- π Dataset Papers with Code | |
1. [Universal Dependencies](https://paperswithcode.com/dataset/universal-dependencies) | |
2. [WMT 2014](https://paperswithcode.com/dataset/wmt-2014) | |
3. [The Pile](https://paperswithcode.com/dataset/the-pile) | |
4. [HumanEval](https://paperswithcode.com/dataset/humaneval) | |
5. [FLORES-101](https://paperswithcode.com/dataset/flores-101) | |
6. [CrowS-Pairs](https://paperswithcode.com/dataset/crows-pairs) | |
7. [WikiLingua](https://paperswithcode.com/dataset/wikilingua) | |
8. [MTEB](https://paperswithcode.com/dataset/mteb) | |
9. [xP3](https://paperswithcode.com/dataset/xp3) | |
10. [DiaBLa](https://paperswithcode.com/dataset/diabla) | |
# Deep RL ML Strategy π§ | |
The AI strategies are: | |
- Language Model Preparation using Human Augmented with Supervised Fine Tuning π€ | |
- Reward Model Training with Prompts Dataset Multi-Model Generate Data to Rank π | |
- Fine Tuning with Reinforcement Reward and Distance Distribution Regret Score π― | |
- Proximal Policy Optimization Fine Tuning π€ | |
- Variations - Preference Model Pretraining π€ | |
- Use Ranking Datasets Sentiment - Thumbs Up/Down, Distribution π | |
- Online Version Getting Feedback π¬ | |
- OpenAI - InstructGPT - Humans generate LM Training Text π | |
- DeepMind - Advantage Actor Critic Sparrow, GopherCite π¦ | |
- Reward Model Human Prefence Feedback π | |
For more information on specific techniques and implementations, check out the following resources: | |
- OpenAI's paper on [GPT-3](https://arxiv.org/abs/2005.14165) which details their Language Model Preparation approach | |
- DeepMind's paper on [SAC](https://arxiv.org/abs/1801.01290) which describes the Advantage Actor Critic algorithm | |
- OpenAI's paper on [Reward Learning](https://arxiv.org/abs/1810.06580) which explains their approach to training Reward Models | |
- OpenAI's blog post on [GPT-3's fine-tuning process](https://openai.com/blog/fine-tuning-gpt-3/) |