MartialTerran
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,169 @@
|
|
1 |
These are Toy-GPTs for demonstrating and teaching principles, internal components and features of LLMs. Huggingface Transformers.py library is not used. Only libraries such as Pytorch. Generally, these Toy Models are small and can be operated in CMD console in Windows 10 Laptop having no pytorch-available GPU. But, the hyperparameters can be changed to increase or decrease the total number of parameters (not tried and tested)
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
license: apache-2.0
|
5 |
---
|
|
|
1 |
These are Toy-GPTs for demonstrating and teaching principles, internal components and features of LLMs. Huggingface Transformers.py library is not used. Only libraries such as Pytorch. Generally, these Toy Models are small and can be operated in CMD console in Windows 10 Laptop having no pytorch-available GPU. But, the hyperparameters can be changed to increase or decrease the total number of parameters (not tried and tested)
|
2 |
|
3 |
+
I. Terminology, Jargon, Acronyms, and People
|
4 |
+
|
5 |
+
Here's a table outlining the key terms, their relation to GPT transformers, and brief explanations:
|
6 |
+
|
7 |
+
Term/Acronym/Person Relation to GPT Transformers Explanation
|
8 |
+
Machine Learning (ML) GPT transformers are a specific type of ML model. A field of computer science that gives computers the ability to learn without being explicitly programmed.
|
9 |
+
Hardware GPT transformers require significant computational hardware (like GPUs) to train and run. The physical components of a computer system, including processors, memory, etc.
|
10 |
+
Scaling GPT transformers have demonstrated impressive performance improvements with increased model size and data. This is referred to as "scaling." Increasing the size of a model (number of parameters) and the amount of data it's trained on.
|
11 |
+
Maths (Mathematics) The underlying principles of transformers are based on mathematical concepts like linear algebra and calculus, but the focus has shifted towards empirical results. The abstract science of number, quantity, and space.
|
12 |
+
Theory Some argue that the development of transformers has been driven more by engineering than by a deep theoretical understanding of why they work so well. A set of principles on which the practice of an activity is based. In this context, it refers to the fundamental understanding of how and why machine learning models work.
|
13 |
+
Algorithm Transformers use specific algorithms for training (like backpropagation) and inference. A set of rules or instructions for solving a problem or accomplishing a task.
|
14 |
+
Neural Nets (Neural Networks) Transformers are a type of neural network architecture. A computing system inspired by the biological neural networks that constitute animal brains. They consist of interconnected nodes (neurons) that process and transmit information.
|
15 |
+
Symbolic AI Not directly related to transformers, but represents a contrasting approach to AI. A branch of AI that focuses on representing knowledge and reasoning using symbols and logical rules.
|
16 |
+
ARC (Abstraction and Reasoning Corpus) Not directly related to transformers but used as a benchmark to evaluate the reasoning abilities of AI models, including LLMs based on transformers. A dataset designed to test the abstract reasoning abilities of AI systems.
|
17 |
+
Test-time fine-tuning A technique that can be applied to transformer-based models, including GPT, to improve performance on specific tasks during inference. Adapting a pre-trained model to a specific task during the testing phase, rather than during the initial training phase.
|
18 |
+
Backprop (Backpropagation) The core algorithm used to train transformers, including GPT. An algorithm for training neural networks by calculating the gradient of the loss function with respect to the network's weights and using it to update the weights.
|
19 |
+
Active Data Selection A technique that can be used with transformer models to improve efficiency and performance. A method for selecting the most informative data points to train a model, rather than using all available data.
|
20 |
+
Chollet (François Chollet) Not directly involved in the development of transformers but a prominent critic of the current state of AI research, including the focus on scaling large language models like those based on transformers. A deep learning researcher and author known for his work on Keras and his critical views on the current direction of AI research.
|
21 |
+
Location Embedding A specific technique used in the original transformer architecture, but not a core part of GPT. Later transformer models, including GPT, use positional encoding instead. A method for representing the position of words in a sequence within the transformer architecture.
|
22 |
+
RL (Reinforcement Learning) Can be combined with transformers, including GPT, for certain applications. A type of machine learning where an agent learns to interact with an environment by taking actions and receiving rewards or penalties.
|
23 |
+
MCTS (Monte Carlo Tree Search) Can be used in conjunction with transformer-based models for decision-making in complex environments. A decision-making algorithm often used in game playing and other domains where an agent needs to explore a large search space.
|
24 |
+
LLM (Large Language Model) GPT is a type of LLM based on the transformer architecture. A type of language model that is trained on a massive amount of text data and can generate human-quality text, translate languages, write different kinds of creative content, and answer your questions in an informative way.
|
25 |
+
Pytorch A popular deep learning framework that can be used to implement and train transformers, including GPT. An open-source machine learning framework that provides tools for building and training deep learning models.
|
26 |
+
LeNet Not directly related to transformers, but a historical example of a convolutional neural network. One of the earliest convolutional neural networks, developed by Yann LeCun in the late 1980s.
|
27 |
+
Convolutional Networks Not directly related to transformers but a different type of neural network architecture that was dominant before transformers. A type of neural network that is particularly well-suited for processing images. They use convolutional layers to extract features from the input data.
|
28 |
+
Ivakhnenko Not directly related to transformers, but a pioneer in deep learning who developed an early form of neural networks. A Ukrainian mathematician who is considered one of the pioneers of deep learning. He developed the Group Method of Data Handling (GMDH), an early form of neural networks.
|
29 |
+
Word2Vec Not directly related to transformers, but a precursor that demonstrated the power of learning word embeddings. A technique for learning word embeddings, which are vector representations of words that capture their semantic meaning.
|
30 |
+
Self-attention A key component of the transformer architecture, including GPT. A mechanism that allows a model to weigh the importance of different words in a sequence when processing each word.
|
31 |
+
GPT (Generative Pre-trained Transformer) A specific type of LLM based on the transformer architecture, developed by OpenAI. A type of language model developed by OpenAI that is trained on a massive amount of text data and can generate human-quality text, translate languages, write different kinds of creative content, and answer your questions in an informative way.
|
32 |
+
4-bit precision A method of reducing the memory footprint of a model like GPT, potentially applicable to transformers. Representing the weights and activations of a neural network using only 4 bits, instead of the standard 32 or 16 bits. This reduces the memory footprint and computational cost of the model.
|
33 |
+
8B (8 Billion parameters) Refers to models of the size of Llama 3 which can be quantized to 4-bit precission and run on edge devices, potentially applicable to transformers. The number of parameters in a machine learning model, which is a measure of its complexity.
|
34 |
+
MuZero Not directly related to transformers, but an example of a reinforcement learning algorithm that uses MCTS. A reinforcement learning algorithm developed by DeepMind that can learn to play games without being given the rules.
|
35 |
+
NLP (Natural Language Processing) Transformers, especially GPT, have revolutionized the field of NLP. A field of computer science that deals with the interaction between computers and human language.
|
36 |
+
OpenAI The organization that developed GPT, a leading model based on the transformer architecture. An AI research organization that has developed many state-of-the-art AI models, including GPT.
|
37 |
+
SimPo Not directly related to transformers but an example of a self-play reinforcement learning algorithm. A self-play reinforcement learning algorithm developed by OpenAI.
|
38 |
+
SPIN Not directly related to transformers but an example of a self-play reinforcement learning algorithm. A self-play reinforcement learning algorithm developed by DeepMind.
|
39 |
+
DPO (Direct Preference Optimization) A method for fine-tuning LLMs like GPT. A method for fine-tuning large language models using human preferences.
|
40 |
+
KTO (Kahneman-Tversky Optimization) A method for fine-tuning LLMs like GPT. A method for fine-tuning large language models inspired by the work of Daniel Kahneman and Amos Tversky on human judgment and decision-making.
|
41 |
+
Tokenization A crucial step in processing text for transformer models, including GPT. The process of breaking down text into individual units, called tokens, which can be processed by a machine learning model.
|
42 |
+
MegaByte A potential solution to the tokenization bottleneck for LLMs like GPT, but not directly related to transformers themselves. A method for improving the efficiency of tokenization for large language models.
|
43 |
+
In-context learning A capability of LLMs like GPT, enabled by the transformer architecture. The ability of a large language model to learn from a few examples provided in the input prompt, without requiring any explicit training.
|
44 |
+
Mechanistic Interpretability A research area focused on understanding the inner workings of models like GPT, which are based on transformers. A research area focused on understanding the inner workings of deep learning models, including large language models.
|
45 |
+
AGI (Artificial General Intelligence) Some believe that transformers, including the architecture underlying GPT, could be a path to AGI, while others disagree. A hypothetical type of artificial intelligence that would have the ability to understand, learn, and apply knowledge across a wide range of domains, similar to a human.
|
46 |
+
Ilya Sutskever Not directly involved in the invention of transformers but a leading researcher at OpenAI who believes in their potential for AGI, including the GPT architecture. A computer scientist and co-founder of OpenAI, known for his work on deep learning and artificial general intelligence.
|
47 |
+
Amodei (Dario Amodei) Not directly involved in the invention of transformers but a leading researcher at Anthropic (formerly OpenAI) who has worked on scaling laws for AI, including models based on transformers like GPT. An AI researcher and co-founder of Anthropic, known for his work on AI safety and scaling laws.
|
48 |
+
Shazeer (Noam Shazeer) One of the co-authors of the original transformer paper ("Attention is All You Need"). A computer scientist known for his work on the transformer architecture and large language models.
|
49 |
+
Karpathy (Andrej Karpathy) Not directly involved in the invention of transformers but a prominent researcher who has worked on deep learning models, including those based on transformers like GPT. An AI researcher known for his work on computer vision, natural language processing, and reinforcement learning.
|
50 |
+
Thermodynamic computation Not directly related to transformers, but a potential future technology for more efficient computation that could benefit the training and deployment of models based on transformers. A hypothetical type of computation that would use the principles of thermodynamics to perform calculations more efficiently.
|
51 |
+
II. Contentions and Arguments
|
52 |
+
|
53 |
+
@adamkadmon6339:
|
54 |
+
|
55 |
+
Main Argument: The current state of machine learning research is overly focused on scaling up existing models (like transformers) and lacks the mathematical depth and theoretical innovation that characterized earlier eras of AI research. They contend that this trend leads to a "faddish" culture that prioritizes empirical results over fundamental understanding. They criticize the use of large models like LLMs as a symptom of this issue.
|
56 |
+
|
57 |
+
Specific Points:
|
58 |
+
|
59 |
+
Scaling is not a substitute for mathematical theory or algorithmic innovation.
|
60 |
+
|
61 |
+
The transformer, while successful, is based on old ideas and doesn't represent a significant mathematical advance.
|
62 |
+
|
63 |
+
Researchers are relying too heavily on techniques like backpropagation and test-time fine-tuning without exploring new approaches.
|
64 |
+
|
65 |
+
The incentive structure in the field rewards practical applications over theoretical breakthroughs.
|
66 |
+
|
67 |
+
@ianmatejka3533:
|
68 |
+
|
69 |
+
Main Argument: Transformers are a powerful and versatile architecture with significant untapped potential, and the current focus on improving them is justified. They argue that many advancements are still being made in areas like reinforcement learning, optimization, and interpretability, contrary to the notion that the field is stagnant. They believe the transformer may be a key to AGI.
|
70 |
+
|
71 |
+
Specific Points:
|
72 |
+
|
73 |
+
Transformers have proven to be a general-purpose architecture applicable across multiple domains.
|
74 |
+
|
75 |
+
Reinforcement learning and MCTS are still evolving rapidly, with recent breakthroughs like MuZero.
|
76 |
+
|
77 |
+
Optimizers, learning algorithms, and tokenization techniques are continuously improving.
|
78 |
+
|
79 |
+
In-context learning and mechanistic interpretability are promising research areas.
|
80 |
+
|
81 |
+
Many experts believe transformers are sufficient for AGI.
|
82 |
+
|
83 |
+
@marilynlucas5128:
|
84 |
+
|
85 |
+
Main Argument: Agrees with @adamkadmon6339 that presentations should focus on mathematical frameworks and algorithms.
|
86 |
+
|
87 |
+
Specific Points:
|
88 |
+
|
89 |
+
Presentations should be grounded in mathematical and algorithmic understanding.
|
90 |
+
|
91 |
+
Empirical results without theoretical underpinnings are less valuable.
|
92 |
+
|
93 |
+
@badrraitabcas:
|
94 |
+
|
95 |
+
Main Argument: Historically, empirical results have driven progress in connectionist AI (deep learning), and mathematical guarantees have not been a prerequisite for success.
|
96 |
+
|
97 |
+
Specific Points:
|
98 |
+
|
99 |
+
Deep learning's progress has often been driven by what works in practice, rather than by formal mathematical proofs.
|
100 |
+
|
101 |
+
@rjDOTdev:
|
102 |
+
|
103 |
+
Main Argument: Questions whether the goal of AI research should be to replicate human intelligence or to achieve superintelligence, suggesting that the need for novel problem-solving approaches might differ depending on the goal.
|
104 |
+
|
105 |
+
Specific Points:
|
106 |
+
|
107 |
+
Humans often require practice to solve novel problems.
|
108 |
+
|
109 |
+
The goals of human-level AI and superintelligence might require different research directions.
|
110 |
+
|
111 |
+
@henrismith7472:
|
112 |
+
|
113 |
+
Main Argument: The debate about theoretical vs. empirical approaches is secondary to the transformative potential of AI technology, even in its current state.
|
114 |
+
|
115 |
+
Specific Points:
|
116 |
+
|
117 |
+
The impact of AI will be enormous, regardless of the current state of theoretical understanding.
|
118 |
+
|
119 |
+
Scaling laws, hardware advancements, and positive feedback loops suggest rapid progress will continue.
|
120 |
+
|
121 |
+
III. Rewritten Dialog
|
122 |
+
|
123 |
+
Here's a more understandable version of the dialog, focusing on clarity and avoiding jargon:
|
124 |
+
|
125 |
+
@adamkadmon6339: It used to be that machine learning was all about complex math. Now, it feels like it's more about having powerful computers, creating hype, expressing opinions, and just combining existing software. I worry that we're losing the ability to make the kind of big theoretical leaps we saw in the 80s and 90s.
|
126 |
+
|
127 |
+
@KevinKreger: Okay, boomer.
|
128 |
+
|
129 |
+
@RickySupriyadi: I don't entirely agree. Making these models bigger does seem to unlock new abilities.
|
130 |
+
|
131 |
+
@marilynlucas5128: Exactly! I listened to that whole presentation and didn't hear anything meaningful. Where's the math to explain what's going on? In machine learning, you need to explain your ideas with math and algorithms. Without that, it's hard to take it seriously.
|
132 |
+
|
133 |
+
@bertobertoberto3: That's just how engineering works.
|
134 |
+
|
135 |
+
@adamkadmon6339: @KevinKreger It's just a fact, not a generational thing.
|
136 |
+
|
137 |
+
@adamkadmon6339: @RickySupriyadi Making models bigger has led to amazing results, but it's not the same as coming up with a new mathematical theory or a new algorithm.
|
138 |
+
|
139 |
+
@adamkadmon6339: @bertobertoberto3 We need both the engineering and the theory. The theory part seems to have vanished.
|
140 |
+
|
141 |
+
@MoreCompute: Why do you say that? Did you watch the video?
|
142 |
+
|
143 |
+
@ianmatejka3533: I completely disagree. The "transformer" has proven itself to be a very flexible design that works well for many different problems. Now, researchers are trying to find better ways to learn. Reinforcement learning, especially methods like Monte Carlo Tree Search, has gotten a lot better recently. These methods will probably give us more progress than trying to replace the transformer.
|
144 |
+
|
145 |
+
@adamkadmon6339: @MoreCompute I did look at the paper. It's good, solid work. There's some math in there, but let me explain my initial comment better. In the beginning, the pioneers of AI saw the limitations of the old symbolic approach and created neural networks. Just a few people did that. Now, we have a huge number of machine learning people who, when faced with a challenge like the Abstraction and Reasoning Corpus, just try to fine-tune their models on the test data because they can't think of anything else. Training a network on every possible example isn't how we learn, and it's not practical in general. People are being professional but not innovative. They're not coming up with new math or new theories; they're just using existing techniques, like adjusting the model's parameters during training (backpropagation) and selecting specific data to train on. These are very old ideas. People are being competent but not clever or original.
|
146 |
+
|
147 |
+
@adamkadmon6339: @ianmatejka3533 You're mistaken. Even the transformer was just a new arrangement of old ideas. There's no fundamentally new math in it, except for how it encodes the position of words, which isn't very good. Reinforcement learning and Monte Carlo Tree Search are also quite old now. I'm not denying that making models bigger has led to impressive results, especially with large language models. But the research culture is too focused on trends and lacks deep mathematical thinking. If you can't see beyond the transformer, you're part of the problem. Training a network on every example is fundamentally a bad idea.
|
148 |
+
|
149 |
+
@badrraitabcas: I'm not sure how much you know about AI, but historically, AI and strong mathematical proofs haven't always gone together. It seems like practical results have been the main driver in the field of deep learning.
|
150 |
+
|
151 |
+
@adamkadmon6339: @badrraitabcas (Speaking anonymously) I'm very familiar with the people and the history. I had backpropagation working before it was even published, back when we had to calculate everything by hand. In the early days of neural networks, many people experimented. But this led to strong theories that are still used today. Successful practical results came from mathematically motivated ideas. We need both. I'm not saying the recent work is just tinkering, but it's obvious that we need a better idea for learning from a single example.
|
152 |
+
|
153 |
+
@adamkadmon6339: Also, there's no money in making a mathematical breakthrough unless you keep it secret. It'll be out of your hands immediately. So, the system encourages people to work on practical applications for companies rather than investing in deep, formal theories. This might explain why the field feels a bit shallow right now.
|
154 |
+
|
155 |
+
@ianmatejka3533: Saying transformers are just a new arrangement of old ideas is too simplistic. LeNet introduced convolutional networks way back in 1989, and they were the main thing in deep learning for decades. By your logic, all neural networks go back to Ivakhnenko's work in the 60s. But transformers came about through a series of innovations: Word2Vec showed we could learn meaningful representations of words, self-attention was explored in earlier networks, and transformers themselves only appeared 7 years ago. Modern large language models like GPT are even newer, with their general intelligence abilities only emerging in the last 3 years. And what used to require 175 billion parameters can now be done more efficiently with smaller 8 billion parameter models using only 4-bits to represent each number!
|
156 |
+
|
157 |
+
@ianmatejka3533: Reinforcement learning and Monte Carlo Tree Search are also far from "old." We only figured out how to generalize MCTS without human input 5 years ago with MuZero. Combining MCTS with natural language processing is a new problem that OpenAI has only recently tackled. Other reinforcement learning techniques, like self-play, are just getting started and have a lot of potential.
|
158 |
+
|
159 |
+
@ianmatejka3533: The algorithms we use for training and learning are also evolving quickly. New methods like DPO and KTO have made fine-tuning more stable and easier. Breaking down text into smaller units (tokenization), which is a major bottleneck for large language models, is seeing progress with solutions like MegaByte, although they're still not widely used. Similarly, learning from examples in the prompt and understanding how these models work internally are promising but under-researched areas.
|
160 |
+
|
161 |
+
@ianmatejka3533: The transformer isn't the final AI design, but we've barely scratched the surface of what it can do. There's still a lot of low-hanging fruit to explore. While Chollet's criticism has some merit, many experts—like Ilya Sutskever, Dario Amodei, Noam Shazeer, and Andrej Karpathy—believe transformers are good enough to achieve artificial general intelligence, and we should focus on optimizing this design rather than starting over from scratch.
|
162 |
+
|
163 |
+
@rjDOTdev: @adamkadmon6339 Coming from an education background, I know that most people need some practice before they can answer a new type of question. Maybe we're talking about different goals here? Are we trying to create something that learns like a human or something that's much smarter than a human?
|
164 |
+
|
165 |
+
@henrismith7472: Are you all really arguing about this instead of thinking about how much this technology is going to change the world? Even if progress stopped right now, and we just focused on using what we already have, I don't think we fully grasp the impact it would have. And we have at least two more scaling laws to go, extremely efficient and powerful chips being invented, and countless accelerating technologies converging with positive feedback loops... As someone who's just started learning to code, it seems like some of you need to take a step back and look at the bigger picture.
|
166 |
+
|
167 |
---
|
168 |
license: apache-2.0
|
169 |
---
|