The only used vocabulary words/tokens in this model are the letters of the alphabet?

#1
by MartialTerran - opened

It seems upon examination of your vocab and tokenizer files that the only useful vocabulary words/tokens used in this model is limited to the letters of the alphabet? Is that intentional? What is the purpose of using only single-letter tokenization? And why not just arrange them in true alphabetical order? Quote: " e -1 t -2 a -3 o -4 h -5 n -6 s -7 i -8 r -9 d -10 l -11 u -12 w -13 m -14 ↨ -15 g -16 c -17 f -18 y -19 . -20 p -21 , -22 b -23 -24 k -25 v -26 " -27 ' -28 j -29 x -30 z -31 q"

Owner

Honestly. Previously I was making my own tokenizer and it only had 96 characters.

however I spent months and months trying to get huggingface to accept it, or any other tools that allow custom models, and they all failed unless I manually hacked them to support my model/tokenizer.

By using this setup with the initial 0xbytes and stuff all those tools just... work with it.

I can use a google collab and tell it load model and tokenizer and all the tools to finetune are happy with it. ( as long as I fix the upper-case to shift+letter before tokenizing)

Owner

And why not in alphabetical order? I tried. Oh God did I try. But I don't actually control the tokenizer I used. I could only force it to be able to get 1 letter of each.

My previous versions were in alphabetical order and I liked it more... felt aesthetically pleasing.

The set of alphabet-only tokens might work in some small models such as "makemore" by Andrej Karpathy (see https://www.youtube.com/watch?v=PaCmpygFfXo on youtube), but probably will not work in these "autotokenizer" small models because: The tokenizers generally distinguish between first-letter-tokens, such as the "r" in "read" as "_r" which is not treated the same as a standalone/middle "r" as in "are". So, depending on what is actually happening, you might not really have enough tokens to actually spell most or any of the words, so your model will founder? You really will have to examine the TOKENS being selected while words are being tokenized and examine the selected tokens that enter your model during training and/or during inference, to see if the model is functioning or malfunctioning at that stage. You should construct a python def to create a histogram, frequency-per-letter of the letters in the text going into the tokenizer, and frequencies of the alphabet tokens coming out of the tokenizer, and see if they are actually matching. As far as using only lower-case, I think an experimental TinyStories small language model should be built on the premise that capital letters do not exist at all, and lowercasing all letters in the training data and ignoring capitalization entirely, just to see if the ultra-small model can generate coherent text if trained without spitting up the training data between capitalized letters R and lowercase letters r etc. See the comment about this in the recent "tokenization" tutorial video by Andrej Karpathy. https://www.youtube.com/watch?v=zduSFxRajkE

Also, you have only 26 tokens supporting around 40,000 different "words" included in the "TinyStories" train text. Think about that. You are burdening each "letter" (token) with the task of identification and distinction of much more than many thousands of distinct words (per letter-token), even before the model begins to learn any of the rules of grammar or any sentence logic. If you want to use minimal tokens (alphabet) you need to try to obtain and use a TinyStories train.txt that has a truly LIMITED VOCABULARY set, such as 2000 unique words, or at most 5,000 distinct words, not 40,000 different words. Or, you should need to greatly increase the embedding dimension to 32,000 floats per token, to create space for each alphabet-letter-token to absorb the weight of its share of inclusion within the 40,000 different words in the TinyStories train.txt

Owner

I have seen the video. And enjoyed it.

My tokenizer only has 1 way to make an r and that's not combined with anything else, yes I know how the default works, but I have jigged it to not have any combine words/word parts. Each letter is its own token same as space and all other punctuation.

Yes, I realise what a task I am putting on the individual tokens, along with how impractical it is to expect the model to learn anything about them, but the goal isn't to teach it about the letters, but all the ways they combine into new words/sentences and flow from there.

No I don't expect it to become nearly as useful as a proper tokenizer model that actually has word bits to work from and build a world model from.

But the point is to see what it can do.

(So far it knows cranes help build buildings but are also birds and fly off with them... so... yes... it's not... clever, but I think all tinystories models m8ght have that issue.)

  1. Without a huge embedding dimension (e.g., 32,000 floats) your single-letter model will never be able to well handle the 40,000 word vocabulary of the train text. Each letter-token has to learn and record its own relationship to like 10,000++ whole words among the 40,000:
    a: a, as, are, ape, apple, applesauce, peace, ...
    b: be, best, beat, beaten, better, become, beautiful, about ....

  2. At a minimum, you should also have dedicated tokens for top-100 or top-500 most-common filler words: I (capital), you, he, she, it, the, that, they, and, was, is, has, because, big, small, very, .... these would have tokenization-priority in the tokenizer (e.g., merges)

  3. If "My tokenizer only has 1 way to make an r", then how does it process "spaces"? The space-r ("_r") tokens in GPT-2 BPE tokenizers is to avoid using a dedicated "space" token between words. So, how are you processing "spaces" between words, if "My tokenizer only has 1 way to make an r"?

  4. "I am thinking of pretraining a GPT2 model from scratch where the tokens must be of whole words (e.g. using the WordLevel tokenizer) instead of subwords (e.g. using the ByteLevelBPETokenizer tokenizer). This is because subwords don’t make sense for my application. Will the GPT2LMHeadModel and GPT2Tokenizer be able to accept word level tokenization?"
    https://discuss.huggingface.co/t/wordlevel-tokenization-with-gpt2/5153

  5. "identify the boundaries of each word in the tokenized sequence. The T5 tokenizer in the transformers library provides a method called tokenize_plus that can be helpful for this task."
    https://discuss.huggingface.co/t/whole-word-masking-for-t5/63869

  6. Tokens to Words mapping in the tokenizer decode st ep?
    https://github.com/huggingface/tokenizers/issues/447

  7. My experiment(s), if I had time and resources, would be to use the lowercase whole-word tokens from Eluethor or Neo models, and limit the vocabulary-used in the tinystories_train.txt to the 5,000 most-common words, and output lower-case-only text. I would want to eliminate the distinct space-word ("_r", and "_R") tokens to maximize coherence.

  8. So, how would you handle "spaces" between words if not using separate space-word ("_r", and "_R") tokens? Quote "My tokenizer only has 1 way to make an r",

"
The tokens are integers that are then transformed inside the model to floating point vectors.

Tokens are the atomic units of the LLM. So you would think, why not just create a token out of each letter or character? You can, but while this has a small set of tokens, small vocabulary and really no OOV words (Out of Vocabulary), each character contains little to no information.

So training on characters leads to an information starved per token system.

On the other hand, you could have each word be its own token. Here you there is maximum information per word, but given all the misspellings, you end up with a massive set of tokens, and too much information per word, and lots of OOV words.

So the middle ground is sub-word or small word tokens. Here you have a relatively small set of tokens, a lower amount of OOV words, and the right balance of information per token.

GPT has gone from a a 50k token library, to now a 100k token library. More tokens is generally better, if you can handle it, so as time goes on, we may get massive million token tokenizers, but we aren’t there.

So why not just go straight to a vector? Well, you could map each word to it’s own vector, I have done this, and what happens is you get into the “large token” scenario, where there is too much information coming in for the network to handle. So you back off, and go to sub-word, similar to the tokenizers today.

You are trying to make a decision in the network with limited computing resources. So the tokens transform into vectors that are now in a continuous space, and in this space, close vectors have close meaning. So you are globbing meaning to localized chunks in the space, instead of each different thing has a dramatically different internal representation. You need many neurons to make sense of this, so you have to localize and linearize things to get it to work, and get down to a computable number of neurons.
"
https://community.openai.com/t/embedding-tokens-vs-embedding-strings/463213/5

There may be functions to remove and add tokens:
"How to add new tokens to an existing vocabulary. First of all, let’s start at the end: if we have a list of words (for example, specific words from a specialized domain such as medical), it is (very) easy thanks to the Transformers library from Hugging Face to add it to the vocabulary of the downloaded natural language model.
"In order to do that, just use the tokenizer.add_tokens() function to add this list of words to the tokenizer’s existing vocabulary. And if a word from this list already belongs to the existing vocabulary, it will not be added, thus guaranteeing the addition of words not already present. Here is the corresponding code:"
https://medium.com/@pierre_guillou/nlp-how-to-add-a-domain-specific-vocabulary-new-tokens-to-a-subword-tokenizer-already-trained-33ab15613a41

Tool: spaCY
https://medium.com/@pierre_guillou/nlp-how-to-add-a-domain-specific-vocabulary-new-tokens-to-a-subword-tokenizer-already-trained-33ab15613a41
"The solution: use a word tokenizer like spaCY to find new tokens, not a subword tokenizer
spaCY is a words tokenizer well known. Let’s use it to find the most frequent words of our corpus instead of a WordPiece tokenizer which generates subwords as well.

Observation: here, the expression “most frequent words” means “the tokens present in most of the documents”.

You will find all the code in our notebook. In summary, here are the main steps in this process:

Initialize the spaCY tokenizer with the general vocabulary of the language model (which is the same as that of the corpus: here, English).
Get the list of tokens for your documents using the spaCY tokenizer (we don’t keep stop words, punctuation, etc.): these tokens are just words!
Using the scikit-learn library, get the IDFs (Inversed Document Frequency) of these tokens.
Thanks to IDFs, organize tokens in a list ranging from the most frequent token in documents to the least frequent one.
Decide which proportion of these new tokens will represent your specialized vocabulary and add them to the existing tokenizer vocabulary.
Resize the model embeddings matrix so that it matches the tokenizer (new) size (to the token embedding vectors of the existing vocabulary will be added as many new embedding vectors as there are new tokens added).
Voilà. :-)

Owner

Honestly, by having a single token for space.
I know that is the default way the tokenizer does it, and I hate it having _words and such, hence my issues working with it to get it to just do what I want, nothing more. (excluding the stuff at the beginning that... is needed it seems)

Here is an example output from step 300k where it stopped, and I have to get new data prepared to continue. (↨ is the shift key, as this is the raw output)

↨the garden was filled with vibrant flowers, but the flower that once bloomed now stood stubbornly at the edge of the garden. ↨its petals were wilted and its stem was bent.

↨mrs. ↨johnson, the neighbor, came by to check on her. "↨oh, dear," she said as she looked at the garden. "↨i'm so sorry about that flower. ↨it must have been too tired to care for all the others."

"↨don't worry about it," ↨mrs. ↨johnson replied. "↨i'll go and get some more supplies for it. ↨i won't push it away from my garden."

↨mrs. ↨johnson smiled. "↨that's the spirit. ↨remember, gardening is a struggle. ↨it takes time and patience to get things grown and flourish."

↨mrs. ↨johnson nodded, taking in her words. "↨you're right. ↨i need to learn to be patient and not push things too hard."

↨as ↨mrs. ↨johnson continued to walk through her garden, she noticed a small wilted flower in the corner. ↨she knelt down and picked it up, feeling its softness.

"↨hello there, little one," she said to the flower. "↨you're all alive now, aren't you?"

↨the flower seemed to nod in response, its petals still wilted and its stem bent.

↨mrs. ↨johnson smiled. "↨you know, you remind me of something. ↨it's a reminder that we all have the power to grow and thrive, but sometimes it can be a struggle to keep things going."

↨she placed the flower back in the garden, where it would receive the supplies it needed to grow and flourish. ↨and as she walked away, she felt a sense of peace and contentment. ↨gardening was a new beginning, and she was excited to see what the future held.

or a slightly earlier example from the ~280,000ish step range:

↨amidst the hustle and bustle of the bustling city, in the heart of the financial district, stood the towering headquarters of ↨tech↨solutions ↨inc. ↨the building, a sleek and modern structure, was a testament to the company's innovation and ambition. ↨inside, the air was thick with the pungent scent of computers and the hum of phones ringing.

↨at the center of it all was ↨c↨e↨o, ↨john ↨thompson. ↨he sat behind his mahogany desk, his face serious as he gazed out the window, lost in thought. ↨across the room, his team was huddled around the conference table, their faces serious and determined. ↨the project they were working on had been proving to be a disaster. ↨the software required microcontrolled algorithms and a thick, filthy sludge.

↨john sighed, running a hand through his greying hair. ↨he stood up, and walked over to the window, looking out at the city lights. ↨he couldn't shake the feeling of frustration that gnawed at him. ↨the task was almost complete. ↨he couldn't help but worry about the implications of the success of the project.

↨suddenly, a notification popped up on his computer screen. ↨it was a message from an unknown sender. ↨john's heart raced as he opened it, his fingers trembling. ↨it read, "↨dear ↨tech↨solutions, ↨alex ↨i↨i↨i, ↨displays '↨additions & ↨games.' ↨your sleep-deprived focus on making these additions appear to your team handle quality control. ↨seek cooperation in the marketing department and recall all of the sales figures."

↨john's mind raced as he tried to figure out who could have sent the message. ↨he paced back and forth, deep in thought. ↨the team was running low on supplies, and they were struggling to make progress. ↨he couldn't let them down. ↨he needed to make a decision.

↨john sat back down at his desk, deep in thought. ↨he stared at the software, trying to come up with a solution. ↨he couldn't let this unexpected twist leave him in trouble. ↨he remembered a conversation he'd had with one of his most trusted advisors, ↨mr. ↨thompson. ↨the advisor would listen inte
Output generated in 201.36 seconds (10.17 tokens/s, 2047 tokens, context 1)

So, it is knowing words, and its biggest problem is the dataset being made for children, has some pretty odd stuff in it as possible... even GPT4 wasn't realistic in its stories from memory.

Still waking up, and I like your points, and will be thinking on them later in my day.

But honestly, this is just... I know people say this wont work, but no-one has shown why it wont work (said why, sure, but not shown it failing), and when I try, it keeps... not failing, and working better than expected... so I keep trying new models/more data/better data to find where the breaking point is.

So, you are using a dedicated "space" token between words ["a single token for space."]. And you are using the "shift" key token as a pre-token to indicate that "one capitalized letter follows"? [I saw ↨ in other outputs from your models and I did not know what it meant, I thought it was gibberish. But apparently it is valid token data.].
Can you give a link to the exact model (and configs) that created the coherent outputs that you quoted here above?
What do you mean by "step 300k where it stopped" " the ~280,000ish step range:"? Backpropagation/pretraining steps? How exactly do you define a step? What happens during each "step".

Owner

At work, so will have to wait till I get home to share the model. But by step I mean each back prop using the karpathy llama.c gitrepo.

WandB graph of first 200k steps is here. Next 100k is a fixed loss rate and can be found going up a level. https://wandb.ai/corianas/llamac/runs/dvqj9xdg?nw=nwusercorianas
But each step is 98304 tokens processed before doing an optimizer step.

Owner

But even the 100k wasn't too incoherent, unless a capital letter is passed in. The tokenizer is able to process it into a form that sends the model wild as it never encountered it before.

(I think it's a block of the 0xhex numbers but haven't checked yet to be honest.)

I have never seen any of this stuff (training telemetry data) before.
I found your hyperparameters:
max_seq_len:2,048
multiple_of:32
n_heads:12
n_kv_heads:12
n_layers:12
out_dir:"Char_out"
vocab_size:341
vocab_source:"custom"
wandb_log:true
wandb_project:"llamac"
wandb_run_name:"run2024_03_25_16_08_12"
warmup_iters:1,000
weight_decay:0.1

https://wandb.ai/corianas/llamac/runs/dvqj9xdg/overview?nw=nwusercorianas

Was it you, or Andrej Karpathy, who wrote this code to insert/remove a "casifier" token to avoid the first-word-of-sentence capitalization burden?:
adding the casifier token to text:
for example in data:
text = example["story"]
text = text.strip()

removing casifier token from output response:
with ctx:
for k in range(num_samples):
y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)

  •        print(enc.decode(y[0].tolist()))
    
  •        print(remove_caseifer_optimized(enc.decode(y[0].tolist())))
           print('---------------')
    

diff --git a/tinystories.py b/tinystories.py

job:
https://wandb.ai/corianas/llamac/artifacts/job/job-https___github.com_karpathy_llama2.c.git_train.py/v13

iter:200,000
loss/train:0.29893773794174194
loss/val:0.2729758322238922
lr:0.00001
mfu:9.277976907418696
step_time:63,492.133378982544
tokens:19,660,800,000

Owner

Anything with casifier in it is from me.

I tried fixing the c code to allow me to input upper-case and have it fix itself up, like I got the python to do. But my c coding sucks it seems and I never got it to work.

I wasn't really aware that was pushed into wandb. Interesting.

c-coding. I can't code in python nor in c. But, I can produce (assemble and debug) 300 lines of working python code in one day. Like today, I built a python script to generate 2D histograms of the color frequency in all the pixels of image files in a folder. Today I used free Bing Copilot (Creative) which is ChatGPT4 turbo.

I believe that You can copy your python code into these free GPTs and if you ask them to they will convert it to c++ code or whatever you want.

Get a free Developers API Key for Google Gemini Pro, and then try to get access to the Gemini Pro 1.5 model at https://aistudio.google.com/ You can upload of paste in your python code and try to convert it to c.

I have been trying to use Gemini 1.5 to de-huggingface the python of a number of downloadable-weights models like GPT-2 variants and LLAMA models (producing standalone model.py, config.json, train.py, sample.py), so that I will be able to run the models (and modify them) on local Gaming PC with GPU. [Which I have done with GPT-2 model 124M] (I was anticipating the problems that you have been having, trying to overcome the rigidity of the huggingface code templates "Transformers", "Autotokenizers". Because you are doing pretraining, you need to use the GPU compute provided by others, and are trapped by the rigid Huggingface templates. They (Huggingface) really should accommodate your efforts and experiments, since you are definitely making a contribution to our understanding of the capabilities of small-scale language models. (you should start a blog reporting on the results of your various GPT experiments).

But, I have recently started using Google's Gemini ProAPI (and Pro 1.5 in the aistudio, which has 1 million token limit). I find Gemini is very smart but very lazy. It want to tell you about the code function you will need instead of actually writing them. You have to talk to it like a stearn boss. "NOW write each of these functions in working python code immediately. Do not leave out any of the necessary code...."

I have studied GPT-2 architecture and variants and I think that I can hack existing pretrained GPT models in certain ways, or modify model architectures, to make them more useful and more power-efficient. I also want to do something like what you have been doing with the tiny language models, but with whole-word tokens, plus the alphabet as the vocabulary (not BPE tokens). At some point I may be ready to reach out to Andrej Karpathy and see if he wants to verify my theories with his math and coding abilities and maybe write a paper/patent together. It seems that Karpathy has not actually published or patented any formal papers on the GPT LLMs. But he is studying them and teaching them intensively.

You might want to investigate if you can patent your "casifier/decasifier" method. (find out if it is novel and useful, and whether the method was overlooked by the Google Team that chose the "space-BPE" tokens. If novel and useful, I could help you write the patent for it. [But, Since I had essentially the same idea in my head last week (using a Capitalize-flag or New-sentence token prior to the first-letter-of-sentence token), (but did not write code for it) your method is probably not sufficiently novel to be patented, but I do not have much evidence for that conclusion].

I am going to be very busy on a different project deadline until end of next week, so I hope you make even more progress with the tiny_LLMs and tiny-vocabs, and keep in touch, but I won't be able to keep up with GPTs and your projects until then. Take care.

Owner

Thank you, I will.
I have been trying to document what I am doing, but I have yet to publish anything.

But thank you for your interest, it is really inspiring me to get my act together and start putting things out there more.

Corianas,

I am giving you the Python Scripts I developed to examine and EXTRACT the words contained in the TinyStoriesV2-GPT4-train.txt
Also posted the contents of the Extracted-text.txt files sorted into categories such as numeralized-words, 1-letter-words, unicode-words, etc.
In case you want to see the full spectrum of all the junk words used inside of TinyStoriesV2-GPT4-train.txt

Here is the link to the two python scripts and the resulting Extracted-words files.
https://huggingface.co/datasets/MartialTerran/Scripts_for_checking_Train.py

How is it going? Have you made a publication (e.g., blog) explaining any positive/negative results you obtained for the tiny-vocabulary GPT models? I saw in a video that large GPT vendors are training GPTs to avoid "jailbreaking" (e.g., "DAN" ,"tell me how to build napalm") but people defeating "jailbreaking" safeguards by providing prompts in only "64" (mime?) code strings that is not human-readable, but is GPT-interpretable. So, the "64" code (hexadecimal sequences?) would not be adapted for a "BPE vocabuary". So, I infer that only the alphabet and the numerals portion of the BPE vocabulary would be used to tokenize and build parameters for interpreting the "64" code. So, that would be evidence that at least the very large GPTs can fully interpret "64" code strings as fluently as English, based on a small vocabulary equal to alphabet and numbers. Anything in the TinyStories Dataset can be converted on-the-fly into "64" code for training. Do you know of anyone who has tried building a tiny-LM "64" encoder-and-decoder in a tiny GPT having only the alphabet/numbers as the vocab.json ?

Hi, in reverse order, not any models off the top of my head, but I think they would be referred to as byte level models, though that might just be one with 16 options. (0-F) rather than 00-FF.

Regarding that jailbreak, I will have to look into it, as that's actually fascinating, might even work with Unicode hex I think.

Not published anything, was trying to get a website up and working right, as well as planning a video version (as they seem to attract attention and comments more often.) But both work and migraines have distracted me.

I have time set aside for this coming in the next week and am praying my health doesn't let me down again.

I will let you know even a preprint as your insights are wonderful.

Thank you again for the interest and prodding me.

Hi again. It is good that you are going forward with your experiments and publishing.

Headaches suck. I occasionally get headaches if I skip drinking coffee. caffeine addiction. (coffee shrinks blood vessels. but remove the coffee and blood flows greater. something like that) Some headaches are caused by extra blood-pressure existing in the brain/skull. In some cases, you can force rub squeegee your hands upon skull and manually force the blood out of the skull (e.g., front to back) to remove the excess blood (for temporary relief). Not sure about migraines though.

I was inspired by your massive Tiny-GPT pre/training efforts. Using Gemini Pro and/or Bing, I wrote a python script that implements a 2-input boolean XOR gate using a 3-level neural Net (2 input neurons) (5 hidden level neurons) and (1 output neuron) with Sigmoid as activation function. The code included training (backpropagation) loops and I added an adaptive alpha (learning rate) when I noticed that the loss seemed to oscillate. Finally, the trained NN essentially sort of worked as a 2-input XOR (like 1 percent loss, but not outputting a perfect 1.0 nor perfect 0.0 outputs). So, that was the first time I "trained" a neural net! This 2-input XOR-NN has been done before, supposedly. I next tried to implement a 5-input XOR Gate using a 3-level NN. It trained but after 500 iterations of the 2^5 training data set the training failed and then some of the input combinations went into permanent failure (100% loss) but most of the input combinations reverted to a %50 percent error (halfway between the two acceptable outputs). Complete failure. I thought of ways to differently perform a better logical loss-definition (backpropagation per each input combination instead of averaging the loss over multiple input combos) and backpropagation, but then got bogged down in python array dimensionality problems. Abandoned for now. I think it may be impossible to implement a 5-input or more-input XOR Gate using a 3-level NN, but I found no example of anyone else trying it before. [The academics seemed to lose interest at building 2-input XOR NN] I do want to next build a kind of "gutted-GPT-2" (GPT-2 but without positional encoding and without self-attention) model to see how small it would be able to perform K-input XOR (or any other sequence-independent reproducible Boolean). In theory, a GPT can accept context-limit-input size and perform XOR on it. Then, like playing with a Jenga tower, I want to remove hidden layers and embedding dimensions until the trained "gutted GPT-2" model can no longer perform the boolean operation(s). Then, someday build a simple GPT that can perform any Boolean operation reliably when given a Prompt in Boolean like: XOR(1, 0) =

Wow, I will be loving to see the results of your work, It sounds super unique and will generate some interesting results no matter what they are.

Unfortunately my headaches are more structural, from spinal surgery and weather changes I am sorry to say. But yes, caffeine withdrawals are no joke, and have suffered them before as well. But thank you for the idea, I always love new things to try towards relief. :-)

Hi. I am sorry that you have suffered the spinal stuff. Don't delay or avoid living your full life cycle for that hindrance though. Once, I bent wrong and I had a pinched neck peripheral nerve one time and it was dehabilitating (neurpoathy in my arm) for months but with traction it resolved without surgery.

I have been for weeks greatly distracted and hindered from my GPT studies by deadlines in other areas more directly connected with my education/degrees. I had some time this weekend to study GPT stuff. Because I am focused on small studiable/hackable GPT implementations, I have explored the future use of Mythic analog MatMul (NN Matrix Multiplications) chips (each can handle 80M parameters at 8-bits) and generally non-NVIDIA GPUs per TineyGrad (github). I wrote an email to TinyGrad to try to get TinyGrad to support the Mythic's MatMul ICs in a non-proprietary software (python script) scenario. (IBM developed a similar IC)

I was thinking that YOU and some others would be able to someday implement inference-mode versions of your tiny GPT models (below 80M parameters) in the Mythic IC. I was also impressed by the coherence of this tiny 3M parameter model: https://huggingface.co/calum/tinystories-gpt2-3M
(Oddly, it uses the 50,000 token vocab derived from GPT-2, which bloats it. But it still functions.) (It uses 4-float wide attention heads, which I have not seen before).

I wrote to TinyGrad guy:
######################
Martial Terran
Sun, May 5, 12:10 AM (1 day ago)
to george [TinyGrad company]

Do you have a Bounty for building drivers that can be called within python-based GPT models (TinyStories models Ronen Eldan | The TinyStories Dataset: How Small Can Language Models Be And Still Speak Coherent ( youtube.com]( https://www.youtube.com/watch?v=wTQH6mRDXhw ) that can run on low-power Mythic ICs? (80 Million Parameters). It seems like [TinyGrad and Mythic ICs] would be a good match. In that case, parents could place an educational talkative GPT into a teadybear toy for small children.

The [Mythic] company hints that their chips can local [inference] run GPT models:
TEN Capital Presents Trends in AI and Edge Computing (youtube.com)
https://www.youtube.com/watch?v=OkEA-JUyvwQ&t=1403s

Unshackling Edge AI Performance and LLMs - Mythic
https://mythic.ai/whats-new/unshackling-edge-ai-performance-and-llms/

Mythic has its own proprietary software.

M1076-AMP-Product-Brief-v1.0-1.pdf (mythic.ai)

Technology - Mythic

See:
Deploying DNN models to the Mythic AMP™

Mythic AI Workflow - Mythic

It would be nice if you could pair a tiny box computer with a mythic chip to run GPT matmul operations in a Python GPT model at low power.
#################
I also want Mythic or someone else to sell a USB-connected version of their analog MatMul IC for experimenters and hobbyists. [But someone also has to write the software/drivers to make it accessible from the python running within the Windows laptop PC, so I wrote to the TinyGrad guy) So far, I see no evidence that any GPT has been implemented using analog (Mythic) analog/hybrid ICs. But Mythic support for GPTs is hinted at supporting the 7G or 70G llama models. https://www.youtube.com/watch?v=OkEA-JUyvwQ&t=1403s

Because I have a BSEE degree and a professional practical background including Flash Memory (the basic memory cell type being used in the Mythic IC) I think I might have come up with some patentable ideas that I might want to share/license to Mythic to ultimately accelerate the use of low-power analog ICs for GPT inference mode (e.g., for "place an educational talkative GPT into a teadybear toy for small children." like as shown in the movie "AI"). I have not contacted Mythic company yet as I wanted to experience their proprietary software first] My longterm goal is to see the development of a solar-powered TimeCapsuleTeacher that a future jungle mother will want to place in front of her baby to entertain and that will be an educator in the event of total collapse. The device would also have contemporary applications in spreading education/knowledge to remote places already devoid of English Speakers.

https://huggingface.co/TimeCapsuleTeacherLLM

Meanwhile, on the application side, I am trying to develop and promote a GPT AI-based "inconsistency-detector" (e.g., lie-detector) for use in the legal field (to try to curtain the rampant misrepresentations of law and self-contradiction factual deceptions that now frequently occur in the New York Court System.

P.S. Ronen Eldan is now "following" me here on huggingface, even though (or because) I pointed out the Junk words in the TinyStoriesv2 dataset (and the unnecessary use of the 50,000 token GPT-2-type BPE vocab for a 40,000 unique-word dataset). That follow made me happy. And that You for the follow and the encouragement that you give me

PPS:
I got this invitation, free, to a AI DevSummit 2024 in SanFran. (I am not sure what value it has for my interests) Maybe You are close enough to San Fran that you can use this and go LIVE:
Free OPEN Passes to AI DevSummit 2024

Details
Register here: https://www.devnetwork.com/registration/?event=AI%20DevSummit%202024&utm_source=meetup&utm_medium=email&utm_campaign=MU6425&discount=MU6425

You must register at the link above (and not just indicate that you are attending here on Meetup).

AI DevSummit 2024 (May 29-30, San Francisco, CA) + (June 5-6, Live Online) is the conference for engineering leaders & technical executives where 2,000+ engineering managers & directors, team leads, and technical executives converge to discover the latest developer & engineering technology & innovations. Learn from leaders at Meta, NASA, Couchbase, Neo4j, Data Monsters, AtlasPro AI, LinearB, OWASP, and many more!

The AI DevSummit team has offered our group 25 free OPEN Passes and discounted PRO Passes, so our members can attend for free.

Register now to get your free OPEN Pass or to SAVE $100 on your PRO Pass.

To register, go to:https://www.devnetwork.com/registration/?event=AI%20DevSummit%202024&utm_source=meetup&utm_medium=email&utm_campaign=MU6425&discount=MU6425

Sign up or log in to comment