eyad-silx
/

llm

Model card Files Files and versions Community

eyad-silx commited on Dec 30, 2024

Commit

d278d9d

1 Parent(s): 5253ac8

Update repository

Browse files

Files changed (49) hide show

LICENSE +21 -0
README.md +227 -0
assets/gpt2_124M_loss.png +0 -0
assets/nanogpt.jpg +0 -0
bench.py +117 -0
config/char_config.py +43 -0
config/dtat_config.py +48 -0
config/enwik8_config.py +46 -0
config/eval_gpt2.py +8 -0
config/eval_gpt2_large.py +8 -0
config/eval_gpt2_medium.py +8 -0
config/eval_gpt2_xl.py +8 -0
config/finetune_shakespeare.py +25 -0
config/train_gpt2.py +25 -0
config/train_shakespeare_char.py +37 -0
configurator.py +47 -0
data/openwebtext/prepare.py +81 -0
data/openwebtext/readme.md +15 -0
data/shakespeare/prepare.py +33 -0
data/shakespeare/readme.md +9 -0
data/shakespeare_char/prepare.py +68 -0
data/shakespeare_char/readme.md +9 -0
model.py +330 -0
model_dtat.py +257 -0
model_modified.py +190 -0
prepare_data.py +37 -0
sample.py +89 -0
scaling_laws.ipynb +0 -0
train.py +336 -0
train_baseline.py +228 -0
train_dtat.py +256 -0
train_enwik8.py +114 -0
transformer_sizing.ipynb +402 -0
wandb/run-20241230_125819-geso4xvw/files/config.yaml +47 -0
wandb/run-20241230_125819-geso4xvw/files/output.log +21 -0
wandb/run-20241230_125819-geso4xvw/files/wandb-metadata.json +43 -0
wandb/run-20241230_125819-geso4xvw/files/wandb-summary.json +1 -0
wandb/run-20241230_125819-geso4xvw/logs/debug-core.log +14 -0
wandb/run-20241230_125819-geso4xvw/logs/debug-internal.log +16 -0
wandb/run-20241230_125819-geso4xvw/logs/debug.log +26 -0
wandb/run-20241230_125819-geso4xvw/run-geso4xvw.wandb +0 -0
wandb/run-20241230_125924-h4hgg9ir/files/config.yaml +47 -0
wandb/run-20241230_125924-h4hgg9ir/files/output.log +29 -0
wandb/run-20241230_125924-h4hgg9ir/files/wandb-metadata.json +43 -0
wandb/run-20241230_125924-h4hgg9ir/files/wandb-summary.json +1 -0
wandb/run-20241230_125924-h4hgg9ir/logs/debug-core.log +14 -0
wandb/run-20241230_125924-h4hgg9ir/logs/debug-internal.log +17 -0
wandb/run-20241230_125924-h4hgg9ir/logs/debug.log +26 -0
wandb/run-20241230_125924-h4hgg9ir/run-h4hgg9ir.wandb +0 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2022 Andrej Karpathy
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,227 @@

+# nanoGPT
+![nanoGPT](assets/nanogpt.jpg)
+The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file `train.py` reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training. The code itself is plain and readable: `train.py` is a ~300-line boilerplate training loop and `model.py` a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI. That's it.
+![repro124m](assets/gpt2_124M_loss.png)
+Because the code is so simple, it is very easy to hack to your needs, train new models from scratch, or finetune pretrained checkpoints (e.g. biggest one currently available as a starting point would be the GPT-2 1.3B model from OpenAI).
+## install
+```
+pip install torch numpy transformers datasets tiktoken wandb tqdm
+```
+Dependencies:
+- [pytorch](https://pytorch.org) <3
+- [numpy](https://numpy.org/install/) <3
+-  `transformers` for huggingface transformers <3 (to load GPT-2 checkpoints)
+-  `datasets` for huggingface datasets <3 (if you want to download + preprocess OpenWebText)
+-  `tiktoken` for OpenAI's fast BPE code <3
+-  `wandb` for optional logging <3
+-  `tqdm` for progress bars <3
+## quick start
+If you are not a deep learning professional and you just want to feel the magic and get your feet wet, the fastest way to get started is to train a character-level GPT on the works of Shakespeare. First, we download it as a single (1MB) file and turn it from raw text into one large stream of integers:
+```sh
+python data/shakespeare_char/prepare.py
+```
+This creates a `train.bin` and `val.bin` in that data directory. Now it is time to train your GPT. The size of it very much depends on the computational resources of your system:
+**I have a GPU**. Great, we can quickly train a baby GPT with the settings provided in the [config/train_shakespeare_char.py](config/train_shakespeare_char.py) config file:
+```sh
+python train.py config/train_shakespeare_char.py
+```
+If you peek inside it, you'll see that we're training a GPT with a context size of up to 256 characters, 384 feature channels, and it is a 6-layer Transformer with 6 heads in each layer. On one A100 GPU this training run takes about 3 minutes and the best validation loss is 1.4697. Based on the configuration, the model checkpoints are being written into the `--out_dir` directory `out-shakespeare-char`. So once the training finishes we can sample from the best model by pointing the sampling script at this directory:
+```sh
+python sample.py --out_dir=out-shakespeare-char
+```
+This generates a few samples, for example:
+```
+ANGELO:
+And cowards it be strawn to my bed,
+And thrust the gates of my threats,
+Because he that ale away, and hang'd
+An one with him.
+DUKE VINCENTIO:
+I thank your eyes against it.
+DUKE VINCENTIO:
+Then will answer him to save the malm:
+And what have you tyrannous shall do this?
+DUKE VINCENTIO:
+If you have done evils of all disposition
+To end his power, the day of thrust for a common men
+That I leave, to fight with over-liking
+Hasting in a roseman.
+```
+lol  `¯\_(ツ)_/¯`. Not bad for a character-level model after 3 minutes of training on a GPU. Better results are quite likely obtainable by instead finetuning a pretrained GPT-2 model on this dataset (see finetuning section later).
+**I only have a macbook** (or other cheap computer). No worries, we can still train a GPT but we want to dial things down a notch. I recommend getting the bleeding edge PyTorch nightly ([select it here](https://pytorch.org/get-started/locally/) when installing) as it is currently quite likely to make your code more efficient. But even without it, a simple train run could look as follows:
+```sh
+python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0
+```
+Here, since we are running on CPU instead of GPU we must set both `--device=cpu` and also turn off PyTorch 2.0 compile with `--compile=False`. Then when we evaluate we get a bit more noisy but faster estimate (`--eval_iters=20`, down from 200), our context size is only 64 characters instead of 256, and the batch size only 12 examples per iteration, not 64. We'll also use a much smaller Transformer (4 layers, 4 heads, 128 embedding size), and decrease the number of iterations to 2000 (and correspondingly usually decay the learning rate to around max_iters with `--lr_decay_iters`). Because our network is so small we also ease down on regularization (`--dropout=0.0`). This still runs in about ~3 minutes, but gets us a loss of only 1.88 and therefore also worse samples, but it's still good fun:
+```sh
+python sample.py --out_dir=out-shakespeare-char --device=cpu
+```
+Generates samples like this:
+```
+GLEORKEN VINGHARD III:
+Whell's the couse, the came light gacks,
+And the for mought you in Aut fries the not high shee
+bot thou the sought bechive in that to doth groan you,
+No relving thee post mose the wear
+```
+Not bad for ~3 minutes on a CPU, for a hint of the right character gestalt. If you're willing to wait longer, feel free to tune the hyperparameters, increase the size of the network, the context length (`--block_size`), the length of training, etc.
+Finally, on Apple Silicon Macbooks and with a recent PyTorch version make sure to add `--device=mps` (short for "Metal Performance Shaders"); PyTorch then uses the on-chip GPU that can *significantly* accelerate training (2-3X) and allow you to use larger networks. See [Issue 28](https://github.com/karpathy/nanoGPT/issues/28) for more.
+## reproducing GPT-2
+A more serious deep learning professional may be more interested in reproducing GPT-2 results. So here we go - we first tokenize the dataset, in this case the [OpenWebText](https://openwebtext2.readthedocs.io/en/latest/), an open reproduction of OpenAI's (private) WebText:
+```sh
+python data/openwebtext/prepare.py
+```
+This downloads and tokenizes the [OpenWebText](https://huggingface.co/datasets/openwebtext) dataset. It will create a `train.bin` and `val.bin` which holds the GPT2 BPE token ids in one sequence, stored as raw uint16 bytes. Then we're ready to kick off training. To reproduce GPT-2 (124M) you'll want at least an 8X A100 40GB node and run:
+```sh
+torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py
+```
+This will run for about 4 days using PyTorch Distributed Data Parallel (DDP) and go down to loss of ~2.85. Now, a GPT-2 model just evaluated on OWT gets a val loss of about 3.11, but if you finetune it it will come down to ~2.85 territory (due to an apparent domain gap), making the two models ~match.
+If you're in a cluster environment and you are blessed with multiple GPU nodes you can make GPU go brrrr e.g. across 2 nodes like:
+```sh
+# Run on the first (master) node with example IP 123.456.123.456:
+torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py
+# Run on the worker node:
+torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py
+```
+It is a good idea to benchmark your interconnect (e.g. iperf3). In particular, if you don't have Infiniband then also prepend `NCCL_IB_DISABLE=1` to the above launches. Your multinode training will work, but most likely _crawl_. By default checkpoints are periodically written to the `--out_dir`. We can sample from the model by simply `python sample.py`.
+Finally, to train on a single GPU simply run the `python train.py` script. Have a look at all of its args, the script tries to be very readable, hackable and transparent. You'll most likely want to tune a number of those variables depending on your needs.
+## baselines
+OpenAI GPT-2 checkpoints allow us to get some baselines in place for openwebtext. We can get the numbers as follows:
+```sh
+$ python train.py config/eval_gpt2.py
+$ python train.py config/eval_gpt2_medium.py
+$ python train.py config/eval_gpt2_large.py
+$ python train.py config/eval_gpt2_xl.py
+```
+and observe the following losses on train and val:
+| model | params | train loss | val loss |
+| ------| ------ | ---------- | -------- |
+| gpt2 | 124M         | 3.11  | 3.12     |
+| gpt2-medium | 350M  | 2.85  | 2.84     |
+| gpt2-large | 774M   | 2.66  | 2.67     |
+| gpt2-xl | 1558M     | 2.56  | 2.54     |
+However, we have to note that GPT-2 was trained on (closed, never released) WebText, while OpenWebText is just a best-effort open reproduction of this dataset. This means there is a dataset domain gap. Indeed, taking the GPT-2 (124M) checkpoint and finetuning on OWT directly for a while reaches loss down to ~2.85. This then becomes the more appropriate baseline w.r.t. reproduction.
+## finetuning
+Finetuning is no different than training, we just make sure to initialize from a pretrained model and train with a smaller learning rate. For an example of how to finetune a GPT on new text go to `data/shakespeare` and run `prepare.py` to download the tiny shakespeare dataset and render it into a `train.bin` and `val.bin`, using the OpenAI BPE tokenizer from GPT-2. Unlike OpenWebText this will run in seconds. Finetuning can take very little time, e.g. on a single GPU just a few minutes. Run an example finetuning like:
+```sh
+python train.py config/finetune_shakespeare.py
+```
+This will load the config parameter overrides in `config/finetune_shakespeare.py` (I didn't tune them much though). Basically, we initialize from a GPT2 checkpoint with `init_from` and train as normal, except shorter and with a small learning rate. If you're running out of memory try decreasing the model size (they are `{'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}`) or possibly decreasing the `block_size` (context length). The best checkpoint (lowest validation loss) will be in the `out_dir` directory, e.g. in `out-shakespeare` by default, per the config file. You can then run the code in `sample.py --out_dir=out-shakespeare`:
+```
+THEODORE:
+Thou shalt sell me to the highest bidder: if I die,
+I sell thee to the first; if I go mad,
+I sell thee to the second; if I
+lie, I sell thee to the third; if I slay,
+I sell thee to the fourth: so buy or sell,
+I tell thee again, thou shalt not sell my
+possession.
+JULIET:
+And if thou steal, thou shalt not sell thyself.
+THEODORE:
+I do not steal; I sell the stolen goods.
+THEODORE:
+Thou know'st not what thou sell'st; thou, a woman,
+Thou art ever a victim, a thing of no worth:
+Thou hast no right, no right, but to be sold.
+```
+Whoa there, GPT, entering some dark place over there. I didn't really tune the hyperparameters in the config too much, feel free to try!
+## sampling / inference
+Use the script `sample.py` to sample either from pre-trained GPT-2 models released by OpenAI, or from a model you trained yourself. For example, here is a way to sample from the largest available `gpt2-xl` model:
+```sh
+python sample.py \
+    --init_from=gpt2-xl \
+    --start="What is the answer to life, the universe, and everything?" \
+    --num_samples=5 --max_new_tokens=100
+```
+If you'd like to sample from a model you trained, use the `--out_dir` to point the code appropriately. You can also prompt the model with some text from a file, e.g. ```python sample.py --start=FILE:prompt.txt```.
+## efficiency notes
+For simple model benchmarking and profiling, `bench.py` might be useful. It's identical to what happens in the meat of the training loop of `train.py`, but omits much of the other complexities.
+Note that the code by default uses [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). At the time of writing (Dec 29, 2022) this makes `torch.compile()` available in the nightly release. The improvement from the one line of code is noticeable, e.g. cutting down iteration time from ~250ms / iter to 135ms / iter. Nice work PyTorch team!
+## todos
+- Investigate and add FSDP instead of DDP
+- Eval zero-shot perplexities on standard evals (e.g. LAMBADA? HELM? etc.)
+- Finetune the finetuning script, I think the hyperparams are not great
+- Schedule for linear batch size increase during training
+- Incorporate other embeddings (rotary, alibi)
+- Separate out the optim buffers from model params in checkpoints I think
+- Additional logging around network health (e.g. gradient clip events, magnitudes)
+- Few more investigations around better init etc.
+## troubleshooting
+Note that by default this repo uses PyTorch 2.0 (i.e. `torch.compile`). This is fairly new and experimental, and not yet available on all platforms (e.g. Windows). If you're running into related error messages try to disable this by adding `--compile=False` flag. This will slow down the code but at least it will run.
+For some context on this repository, GPT, and language modeling it might be helpful to watch my [Zero To Hero series](https://karpathy.ai/zero-to-hero.html). Specifically, the [GPT video](https://www.youtube.com/watch?v=kCc8FmEb1nY) is popular if you have some prior language modeling context.
+For more questions/discussions feel free to stop by **#nanoGPT** on Discord:
+[![](https://dcbadge.vercel.app/api/server/3zy8kqD9Cp?compact=true&style=flat)](https://discord.gg/3zy8kqD9Cp)
+## acknowledgements
+All nanoGPT experiments are powered by GPUs on [Lambda labs](https://lambdalabs.com), my favorite Cloud GPU provider. Thank you Lambda labs for sponsoring nanoGPT!

assets/gpt2_124M_loss.png ADDED Viewed

assets/nanogpt.jpg ADDED Viewed

bench.py ADDED Viewed

	@@ -0,0 +1,117 @@

+"""
+A much shorter version of train.py for benchmarking
+"""
+import os
+from contextlib import nullcontext
+import numpy as np
+import time
+import torch
+from model import GPTConfig, GPT
+# -----------------------------------------------------------------------------
+batch_size = 12
+block_size = 1024
+bias = False
+real_data = True
+seed = 1337
+device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
+dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32' or 'bfloat16' or 'float16'
+compile = True # use PyTorch 2.0 to compile the model to be faster
+profile = False # use pytorch profiler, or just simple benchmarking?
+exec(open('configurator.py').read()) # overrides from command line or config file
+# -----------------------------------------------------------------------------
+torch.manual_seed(seed)
+torch.cuda.manual_seed(seed)
+torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
+torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
+device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
+ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
+ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)
+# data loading init
+if real_data:
+    dataset = 'openwebtext'
+    data_dir = os.path.join('data', dataset)
+    train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
+    def get_batch(split):
+        data = train_data # note ignore split in benchmarking script
+        ix = torch.randint(len(data) - block_size, (batch_size,))
+        x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
+        y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
+        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
+        return x, y
+else:
+    # alternatively, if fixed data is desired to not care about data loading
+    x = torch.randint(50304, (batch_size, block_size), device=device)
+    y = torch.randint(50304, (batch_size, block_size), device=device)
+    get_batch = lambda split: (x, y)
+# model init
+gptconf = GPTConfig(
+    block_size = block_size, # how far back does the model look? i.e. context size
+    n_layer = 12, n_head = 12, n_embd = 768, # size of the model
+    dropout = 0, # for determinism
+    bias = bias,
+)
+model = GPT(gptconf)
+model.to(device)
+optimizer = model.configure_optimizers(weight_decay=1e-2, learning_rate=1e-4, betas=(0.9, 0.95), device_type=device_type)
+if compile:
+    print("Compiling model...")
+    model = torch.compile(model) # pytorch 2.0
+if profile:
+    # useful docs on pytorch profiler:
+    # - tutorial https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html
+    # - api https://pytorch.org/docs/stable/profiler.html#torch.profiler.profile
+    wait, warmup, active = 5, 5, 5
+    num_steps = wait + warmup + active
+    with torch.profiler.profile(
+        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
+        schedule=torch.profiler.schedule(wait=wait, warmup=warmup, active=active, repeat=1),
+        on_trace_ready=torch.profiler.tensorboard_trace_handler('./bench_log'),
+        record_shapes=False,
+        profile_memory=False,
+        with_stack=False, # incurs an additional overhead, disable if not needed
+        with_flops=True,
+        with_modules=False, # only for torchscript models atm
+    ) as prof:
+        X, Y = get_batch('train')
+        for k in range(num_steps):
+            with ctx:
+                logits, loss = model(X, Y)
+            X, Y = get_batch('train')
+            optimizer.zero_grad(set_to_none=True)
+            loss.backward()
+            optimizer.step()
+            lossf = loss.item()
+            print(f"{k}/{num_steps} loss: {lossf:.4f}")
+            prof.step() # notify the profiler at end of each step
+else:
+    # simple benchmarking
+    torch.cuda.synchronize()
+    for stage, num_steps in enumerate([10, 20]): # burnin, then benchmark
+        t0 = time.time()
+        X, Y = get_batch('train')
+        for k in range(num_steps):
+            with ctx:
+                logits, loss = model(X, Y)
+            X, Y = get_batch('train')
+            optimizer.zero_grad(set_to_none=True)
+            loss.backward()
+            optimizer.step()
+            lossf = loss.item()
+            print(f"{k}/{num_steps} loss: {lossf:.4f}")
+        torch.cuda.synchronize()
+        t1 = time.time()
+        dt = t1-t0
+        mfu = model.estimate_mfu(batch_size * 1 * num_steps, dt)
+        if stage == 1:
+            print(f"time per iteration: {dt/num_steps*1000:.4f}ms, MFU: {mfu*100:.2f}%")

config/char_config.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""
+Configuration for character-level language model on enwik8
+Targeting ~44M parameters for comparison with baseline models
+"""
+# Model configuration
+config = {
+    # Dataset params
+    'dataset': 'enwik8',
+    'vocab_size': 256,  # Character-level, so 256 possible byte values
+    'block_size': 1024,  # Context length
+    # Model params (tuned for ~44M parameters)
+    'n_layer': 12,
+    'n_head': 8,
+    'n_embd': 512,
+    'dropout': 0.1,
+    'bias': False,  # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
+    # Training params
+    'learning_rate': 6e-4,
+    'max_iters': 100000,
+    'weight_decay': 1e-1,
+    'beta1': 0.9,
+    'beta2': 0.95,
+    'grad_clip': 1.0,
+    # Learning rate decay settings
+    'decay_lr': True,
+    'warmup_iters': 2000,
+    'lr_decay_iters': 100000,
+    'min_lr': 6e-5,
+    # Evaluation and logging
+    'eval_interval': 500,
+    'log_interval': 100,
+    'eval_iters': 200,
+    # System
+    'device': 'cuda',  # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
+    'dtype': 'bfloat16',  # 'float32', 'bfloat16', or 'float16'
+    'compile': True,  # use PyTorch 2.0 to compile the model to be faster
+}

config/dtat_config.py ADDED Viewed

	@@ -0,0 +1,48 @@

+"""
+Configuration for Dynamic Token-Aware Transformer (DTAT) on enwik8
+"""
+class DTATConfig:
+    def __init__(self):
+        # Model architecture
+        self.block_size = 1024
+        self.vocab_size = 256  # byte-level vocabulary
+        self.n_layer = 12
+        self.n_head = 8
+        self.n_embd = 512
+        self.dropout = 0.1
+        self.bias = False
+        # DTAT specific parameters
+        self.sparse_topk = 32  # Number of tokens to attend to for less important tokens
+        # Training parameters
+        self.batch_size = 32  # Added batch_size
+        self.learning_rate = 6e-4
+        self.weight_decay = 1e-1
+        self.beta1 = 0.9
+        self.beta2 = 0.95
+        self.grad_clip = 1.0
+        self.warmup_iters = 2000
+        # Learning rate schedule
+        self.decay_lr = True
+        self.lr_decay_iters = 100000
+        self.min_lr = 6e-5
+        # Training loop
+        self.max_iters = 100000
+        self.eval_interval = 500
+        self.log_interval = 100
+        self.eval_iters = 200
+        # System
+        self.device = 'cuda'
+        self.dtype = 'bfloat16'
+        self.compile = True
+    def get_config(self):
+        return self
+def get_config():
+    return DTATConfig()

config/enwik8_config.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""
+Configuration for enwik8 dataset using NanoGPT architecture
+Targeting ~44M parameters for comparison with baseline models
+"""
+import ml_collections
+def get_config():
+    config = ml_collections.ConfigDict()
+    # model
+    config.block_size = 1024
+    config.vocab_size = 256  # 256 possible byte values
+    config.n_layer = 12
+    config.n_head = 8
+    config.n_embd = 512
+    config.dropout = 0.1
+    config.bias = False
+    # adamw optimizer
+    config.learning_rate = 6e-4
+    config.max_iters = 100000
+    config.weight_decay = 1e-1
+    config.beta1 = 0.9
+    config.beta2 = 0.95
+    config.grad_clip = 1.0
+    # learning rate decay settings
+    config.decay_lr = True
+    config.warmup_iters = 2000
+    config.lr_decay_iters = 100000
+    config.min_lr = 6e-5
+    # system
+    config.device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc.
+    config.dtype = 'bfloat16' # 'float32', 'bfloat16', or 'float16'
+    config.compile = True # use PyTorch 2.0 to compile the model to be faster
+    # data
+    config.dataset = 'enwik8'
+    config.batch_size = 32
+    config.eval_interval = 500
+    config.log_interval = 100
+    config.eval_iters = 200
+    return config

config/eval_gpt2.py ADDED Viewed

	@@ -0,0 +1,8 @@

+# evaluate the base gpt2
+# n_layer=12, n_head=12, n_embd=768
+# 124M parameters
+batch_size = 8
+eval_iters = 500 # use more iterations to get good estimate
+eval_only = True
+wandb_log = False
+init_from = 'gpt2'

config/eval_gpt2_large.py ADDED Viewed

	@@ -0,0 +1,8 @@

+# evaluate the base gpt2
+# n_layer=36, n_head=20, n_embd=1280
+# 774M parameters
+batch_size = 8
+eval_iters = 500 # use more iterations to get good estimate
+eval_only = True
+wandb_log = False
+init_from = 'gpt2-large'

config/eval_gpt2_medium.py ADDED Viewed

	@@ -0,0 +1,8 @@

+# evaluate the base gpt2
+# n_layer=24, n_head=16, n_embd=1024
+# 350M parameters
+batch_size = 8
+eval_iters = 500 # use more iterations to get good estimate
+eval_only = True
+wandb_log = False
+init_from = 'gpt2-medium'

config/eval_gpt2_xl.py ADDED Viewed

	@@ -0,0 +1,8 @@

+# evaluate the base gpt2
+# n_layer=48, n_head=25, n_embd=1600
+# 1558M parameters
+batch_size = 8
+eval_iters = 500 # use more iterations to get good estimate
+eval_only = True
+wandb_log = False
+init_from = 'gpt2-xl'

config/finetune_shakespeare.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import time
+out_dir = 'out-shakespeare'
+eval_interval = 5
+eval_iters = 40
+wandb_log = False # feel free to turn on
+wandb_project = 'shakespeare'
+wandb_run_name = 'ft-' + str(time.time())
+dataset = 'shakespeare'
+init_from = 'gpt2-xl' # this is the largest GPT-2 model
+# only save checkpoints if the validation loss improves
+always_save_checkpoint = False
+# the number of examples per iter:
+# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
+# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
+batch_size = 1
+gradient_accumulation_steps = 32
+max_iters = 20
+# finetune at constant LR
+learning_rate = 3e-5
+decay_lr = False

config/train_gpt2.py ADDED Viewed

	@@ -0,0 +1,25 @@

+# config for training GPT-2 (124M) down to very nice loss of ~2.85 on 1 node of 8X A100 40GB
+# launch as the following (e.g. in a screen session) and wait ~5 days:
+# $ torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py
+wandb_log = True
+wandb_project = 'owt'
+wandb_run_name='gpt2-124M'
+# these make the total batch size be ~0.5M
+# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
+batch_size = 12
+block_size = 1024
+gradient_accumulation_steps = 5 * 8
+# this makes total number of tokens be 300B
+max_iters = 600000
+lr_decay_iters = 600000
+# eval stuff
+eval_interval = 1000
+eval_iters = 200
+log_interval = 10
+# weight decay
+weight_decay = 1e-1

config/train_shakespeare_char.py ADDED Viewed

	@@ -0,0 +1,37 @@

+# train a miniature character-level shakespeare model
+# good for debugging and playing on macbooks and such
+out_dir = 'out-shakespeare-char'
+eval_interval = 250 # keep frequent because we'll overfit
+eval_iters = 200
+log_interval = 10 # don't print too too often
+# we expect to overfit on this small dataset, so only save when val improves
+always_save_checkpoint = False
+wandb_log = False # override via command line if you like
+wandb_project = 'shakespeare-char'
+wandb_run_name = 'mini-gpt'
+dataset = 'shakespeare_char'
+gradient_accumulation_steps = 1
+batch_size = 64
+block_size = 256 # context of up to 256 previous characters
+# baby GPT model :)
+n_layer = 6
+n_head = 6
+n_embd = 384
+dropout = 0.2
+learning_rate = 1e-3 # with baby networks can afford to go a bit higher
+max_iters = 5000
+lr_decay_iters = 5000 # make equal to max_iters usually
+min_lr = 1e-4 # learning_rate / 10 usually
+beta2 = 0.99 # make a bit bigger because number of tokens per iter is small
+warmup_iters = 100 # not super necessary potentially
+# on macbook also add
+# device = 'cpu'  # run on cpu only
+# compile = False # do not torch compile the model

configurator.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""
+Poor Man's Configurator. Probably a terrible idea. Example usage:
+$ python train.py config/override_file.py --batch_size=32
+this will first run config/override_file.py, then override batch_size to 32
+The code in this file will be run as follows from e.g. train.py:
+>>> exec(open('configurator.py').read())
+So it's not a Python module, it's just shuttling this code away from train.py
+The code in this script then overrides the globals()
+I know people are not going to love this, I just really dislike configuration
+complexity and having to prepend config. to every single variable. If someone
+comes up with a better simple Python solution I am all ears.
+"""
+import sys
+from ast import literal_eval
+for arg in sys.argv[1:]:
+    if '=' not in arg:
+        # assume it's the name of a config file
+        assert not arg.startswith('--')
+        config_file = arg
+        print(f"Overriding config with {config_file}:")
+        with open(config_file) as f:
+            print(f.read())
+        exec(open(config_file).read())
+    else:
+        # assume it's a --key=value argument
+        assert arg.startswith('--')
+        key, val = arg.split('=')
+        key = key[2:]
+        if key in globals():
+            try:
+                # attempt to eval it it (e.g. if bool, number, or etc)
+                attempt = literal_eval(val)
+            except (SyntaxError, ValueError):
+                # if that goes wrong, just use the string
+                attempt = val
+            # ensure the types match ok
+            assert type(attempt) == type(globals()[key])
+            # cross fingers
+            print(f"Overriding: {key} = {attempt}")
+            globals()[key] = attempt
+        else:
+            raise ValueError(f"Unknown config key: {key}")

data/openwebtext/prepare.py ADDED Viewed

	@@ -0,0 +1,81 @@

+# saves the openwebtext dataset to a binary file for training. following was helpful:
+# https://github.com/HazyResearch/flash-attention/blob/main/training/src/datamodules/language_modeling_hf.py
+import os
+from tqdm import tqdm
+import numpy as np
+import tiktoken
+from datasets import load_dataset # huggingface datasets
+# number of workers in .map() call
+# good number to use is ~order number of cpu cores // 2
+num_proc = 8
+# number of workers in load_dataset() call
+# best number might be different from num_proc above as it also depends on NW speed.
+# it is better than 1 usually though
+num_proc_load_dataset = num_proc
+enc = tiktoken.get_encoding("gpt2")
+if __name__ == '__main__':
+    # takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
+    dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)
+    # owt by default only contains the 'train' split, so create a test split
+    split_dataset = dataset["train"].train_test_split(test_size=0.0005, seed=2357, shuffle=True)
+    split_dataset['val'] = split_dataset.pop('test') # rename the test split to val
+    # this results in:
+    # >>> split_dataset
+    # DatasetDict({
+    #     train: Dataset({
+    #         features: ['text'],
+    #         num_rows: 8009762
+    #     })
+    #     val: Dataset({
+    #         features: ['text'],
+    #         num_rows: 4007
+    #     })
+    # })
+    # we now want to tokenize the dataset. first define the encoding function (gpt2 bpe)
+    def process(example):
+        ids = enc.encode_ordinary(example['text']) # encode_ordinary ignores any special tokens
+        ids.append(enc.eot_token) # add the end of text token, e.g. 50256 for gpt2 bpe
+        # note: I think eot should be prepended not appended... hmm. it's called "eot" though...
+        out = {'ids': ids, 'len': len(ids)}
+        return out
+    # tokenize the dataset
+    tokenized = split_dataset.map(
+        process,
+        remove_columns=['text'],
+        desc="tokenizing the splits",
+        num_proc=num_proc,
+    )
+    # concatenate all the ids in each dataset into one large file we can use for training
+    for split, dset in tokenized.items():
+        arr_len = np.sum(dset['len'], dtype=np.uint64)
+        filename = os.path.join(os.path.dirname(__file__), f'{split}.bin')
+        dtype = np.uint16 # (can do since enc.max_token_value == 50256 is < 2**16)
+        arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
+        total_batches = 1024
+        idx = 0
+        for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
+            # Batch together samples for faster write
+            batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
+            arr_batch = np.concatenate(batch['ids'])
+            # Write into mmap
+            arr[idx : idx + len(arr_batch)] = arr_batch
+            idx += len(arr_batch)
+        arr.flush()
+    # train.bin is ~17GB, val.bin ~8.5MB
+    # train has ~9B tokens (9,035,582,198)
+    # val has ~4M tokens (4,434,897)
+    # to read the bin files later, e.g. with numpy:
+    # m = np.memmap('train.bin', dtype=np.uint16, mode='r')

data/openwebtext/readme.md ADDED Viewed

	@@ -0,0 +1,15 @@

+## openwebtext dataset
+after running `prepare.py` (preprocess) we get:
+- train.bin is ~17GB, val.bin ~8.5MB
+- train has ~9B tokens (9,035,582,198)
+- val has ~4M tokens (4,434,897)
+this came from 8,013,769 documents in total.
+references:
+- OpenAI's WebText dataset is discussed in [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
+- [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset

data/shakespeare/prepare.py ADDED Viewed

	@@ -0,0 +1,33 @@

+import os
+import requests
+import tiktoken
+import numpy as np
+# download the tiny shakespeare dataset
+input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')
+if not os.path.exists(input_file_path):
+    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
+    with open(input_file_path, 'w', encoding='utf-8') as f:
+        f.write(requests.get(data_url).text)
+with open(input_file_path, 'r', encoding='utf-8') as f:
+    data = f.read()
+n = len(data)
+train_data = data[:int(n*0.9)]
+val_data = data[int(n*0.9):]
+# encode with tiktoken gpt2 bpe
+enc = tiktoken.get_encoding("gpt2")
+train_ids = enc.encode_ordinary(train_data)
+val_ids = enc.encode_ordinary(val_data)
+print(f"train has {len(train_ids):,} tokens")
+print(f"val has {len(val_ids):,} tokens")
+# export to bin files
+train_ids = np.array(train_ids, dtype=np.uint16)
+val_ids = np.array(val_ids, dtype=np.uint16)
+train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))
+val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))
+# train.bin has 301,966 tokens
+# val.bin has 36,059 tokens

data/shakespeare/readme.md ADDED Viewed

	@@ -0,0 +1,9 @@

+# tiny shakespeare
+Tiny shakespeare, of the good old char-rnn fame :)
+After running `prepare.py`:
+- train.bin has 301,966 tokens
+- val.bin has 36,059 tokens

data/shakespeare_char/prepare.py ADDED Viewed

	@@ -0,0 +1,68 @@

+"""
+Prepare the Shakespeare dataset for character-level language modeling.
+So instead of encoding with GPT-2 BPE tokens, we just map characters to ints.
+Will save train.bin, val.bin containing the ids, and meta.pkl containing the
+encoder and decoder and some other related info.
+"""
+import os
+import pickle
+import requests
+import numpy as np
+# download the tiny shakespeare dataset
+input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')
+if not os.path.exists(input_file_path):
+    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
+    with open(input_file_path, 'w') as f:
+        f.write(requests.get(data_url).text)
+with open(input_file_path, 'r') as f:
+    data = f.read()
+print(f"length of dataset in characters: {len(data):,}")
+# get all the unique characters that occur in this text
+chars = sorted(list(set(data)))
+vocab_size = len(chars)
+print("all the unique characters:", ''.join(chars))
+print(f"vocab size: {vocab_size:,}")
+# create a mapping from characters to integers
+stoi = { ch:i for i,ch in enumerate(chars) }
+itos = { i:ch for i,ch in enumerate(chars) }
+def encode(s):
+    return [stoi[c] for c in s] # encoder: take a string, output a list of integers
+def decode(l):
+    return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
+# create the train and test splits
+n = len(data)
+train_data = data[:int(n*0.9)]
+val_data = data[int(n*0.9):]
+# encode both to integers
+train_ids = encode(train_data)
+val_ids = encode(val_data)
+print(f"train has {len(train_ids):,} tokens")
+print(f"val has {len(val_ids):,} tokens")
+# export to bin files
+train_ids = np.array(train_ids, dtype=np.uint16)
+val_ids = np.array(val_ids, dtype=np.uint16)
+train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))
+val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))
+# save the meta information as well, to help us encode/decode later
+meta = {
+    'vocab_size': vocab_size,
+    'itos': itos,
+    'stoi': stoi,
+}
+with open(os.path.join(os.path.dirname(__file__), 'meta.pkl'), 'wb') as f:
+    pickle.dump(meta, f)
+# length of dataset in characters:  1115394
+# all the unique characters:
+#  !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
+# vocab size: 65
+# train has 1003854 tokens
+# val has 111540 tokens

data/shakespeare_char/readme.md ADDED Viewed

	@@ -0,0 +1,9 @@

+# tiny shakespeare, character-level
+Tiny shakespeare, of the good old char-rnn fame :) Treated on character-level.
+After running `prepare.py`:
+- train.bin has 1,003,854 tokens
+- val.bin has 111,540 tokens

model.py ADDED Viewed

	@@ -0,0 +1,330 @@

+"""
+Full definition of a GPT Language Model, all of it in this single file.
+References:
+1) the official GPT-2 TensorFlow implementation released by OpenAI:
+https://github.com/openai/gpt-2/blob/master/src/model.py
+2) huggingface/transformers PyTorch implementation:
+https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py
+"""
+import math
+import inspect
+from dataclasses import dataclass
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+class LayerNorm(nn.Module):
+    """ LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """
+    def __init__(self, ndim, bias):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(ndim))
+        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
+    def forward(self, input):
+        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)
+class CausalSelfAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        assert config.n_embd % config.n_head == 0
+        # key, query, value projections for all heads, but in a batch
+        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
+        # output projection
+        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
+        # regularization
+        self.attn_dropout = nn.Dropout(config.dropout)
+        self.resid_dropout = nn.Dropout(config.dropout)
+        self.n_head = config.n_head
+        self.n_embd = config.n_embd
+        self.dropout = config.dropout
+        # flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
+        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
+        if not self.flash:
+            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
+            # causal mask to ensure that attention is only applied to the left in the input sequence
+            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
+                                        .view(1, 1, config.block_size, config.block_size))
+    def forward(self, x):
+        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
+        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
+        q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
+        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
+        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
+        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
+        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
+        if self.flash:
+            # efficient attention using Flash Attention CUDA kernels
+            y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
+        else:
+            # manual implementation of attention
+            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
+            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
+            att = F.softmax(att, dim=-1)
+            att = self.attn_dropout(att)
+            y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
+        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
+        # output projection
+        y = self.resid_dropout(self.c_proj(y))
+        return y
+class MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
+        self.gelu    = nn.GELU()
+        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
+        self.dropout = nn.Dropout(config.dropout)
+    def forward(self, x):
+        x = self.c_fc(x)
+        x = self.gelu(x)
+        x = self.c_proj(x)
+        x = self.dropout(x)
+        return x
+class Block(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
+        self.attn = CausalSelfAttention(config)
+        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
+        self.mlp = MLP(config)
+    def forward(self, x):
+        x = x + self.attn(self.ln_1(x))
+        x = x + self.mlp(self.ln_2(x))
+        return x
+@dataclass
+class GPTConfig:
+    block_size: int = 1024
+    vocab_size: int = 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
+    n_layer: int = 12
+    n_head: int = 12
+    n_embd: int = 768
+    dropout: float = 0.0
+    bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
+class GPT(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        assert config.vocab_size is not None
+        assert config.block_size is not None
+        self.config = config
+        self.transformer = nn.ModuleDict(dict(
+            wte = nn.Embedding(config.vocab_size, config.n_embd),
+            wpe = nn.Embedding(config.block_size, config.n_embd),
+            drop = nn.Dropout(config.dropout),
+            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
+            ln_f = LayerNorm(config.n_embd, bias=config.bias),
+        ))
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        # with weight tying when using torch.compile() some warnings get generated:
+        # "UserWarning: functional_call was passed multiple values for tied weights.
+        # This behavior is deprecated and will be an error in future versions"
+        # not 100% sure what this is, so far seems to be harmless. TODO investigate
+        self.transformer.wte.weight = self.lm_head.weight # https://paperswithcode.com/method/weight-tying
+        # init all weights
+        self.apply(self._init_weights)
+        # apply special scaled init to the residual projections, per GPT-2 paper
+        for pn, p in self.named_parameters():
+            if pn.endswith('c_proj.weight'):
+                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))
+        # report number of parameters
+        print("number of parameters: %.2fM" % (self.get_num_params()/1e6,))
+    def get_num_params(self, non_embedding=True):
+        """
+        Return the number of parameters in the model.
+        For non-embedding count (default), the position embeddings get subtracted.
+        The token embeddings would too, except due to the parameter sharing these
+        params are actually used as weights in the final layer, so we include them.
+        """
+        n_params = sum(p.numel() for p in self.parameters())
+        if non_embedding:
+            n_params -= self.transformer.wpe.weight.numel()
+        return n_params
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+    def forward(self, idx, targets=None):
+        device = idx.device
+        b, t = idx.size()
+        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
+        pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)
+        # forward the GPT model itself
+        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
+        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
+        x = self.transformer.drop(tok_emb + pos_emb)
+        for block in self.transformer.h:
+            x = block(x)
+        x = self.transformer.ln_f(x)
+        if targets is not None:
+            # if we are given some desired targets also calculate the loss
+            logits = self.lm_head(x)
+            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
+        else:
+            # inference-time mini-optimization: only forward the lm_head on the very last position
+            logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
+            loss = None
+        return logits, loss
+    def crop_block_size(self, block_size):
+        # model surgery to decrease the block size if necessary
+        # e.g. we may load the GPT2 pretrained model checkpoint (block size 1024)
+        # but want to use a smaller block size for some smaller, simpler model
+        assert block_size <= self.config.block_size
+        self.config.block_size = block_size
+        self.transformer.wpe.weight = nn.Parameter(self.transformer.wpe.weight[:block_size])
+        for block in self.transformer.h:
+            if hasattr(block.attn, 'bias'):
+                block.attn.bias = block.attn.bias[:,:,:block_size,:block_size]
+    @classmethod
+    def from_pretrained(cls, model_type, override_args=None):
+        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
+        override_args = override_args or {} # default to empty dict
+        # only dropout can be overridden see more notes below
+        assert all(k == 'dropout' for k in override_args)
+        from transformers import GPT2LMHeadModel
+        print("loading weights from pretrained gpt: %s" % model_type)
+        # n_layer, n_head and n_embd are determined from model_type
+        config_args = {
+            'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
+            'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
+            'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
+            'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
+        }[model_type]
+        print("forcing vocab_size=50257, block_size=1024, bias=True")
+        config_args['vocab_size'] = 50257 # always 50257 for GPT model checkpoints
+        config_args['block_size'] = 1024 # always 1024 for GPT model checkpoints
+        config_args['bias'] = True # always True for GPT model checkpoints
+        # we can override the dropout rate, if desired
+        if 'dropout' in override_args:
+            print(f"overriding dropout rate to {override_args['dropout']}")
+            config_args['dropout'] = override_args['dropout']
+        # create a from-scratch initialized minGPT model
+        config = GPTConfig(**config_args)
+        model = GPT(config)
+        sd = model.state_dict()
+        sd_keys = sd.keys()
+        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask / buffer, not a param
+        # init a huggingface/transformers model
+        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
+        sd_hf = model_hf.state_dict()
+        # copy while ensuring all of the parameters are aligned and match in names and shapes
+        sd_keys_hf = sd_hf.keys()
+        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')] # ignore these, just a buffer
+        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')] # same, just the mask (buffer)
+        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
+        # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear
+        # this means that we have to transpose these weights when we import them
+        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
+        for k in sd_keys_hf:
+            if any(k.endswith(w) for w in transposed):
+                # special treatment for the Conv1D weights we need to transpose
+                assert sd_hf[k].shape[::-1] == sd[k].shape
+                with torch.no_grad():
+                    sd[k].copy_(sd_hf[k].t())
+            else:
+                # vanilla copy over the other parameters
+                assert sd_hf[k].shape == sd[k].shape
+                with torch.no_grad():
+                    sd[k].copy_(sd_hf[k])
+        return model
+    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
+        # start with all of the candidate parameters
+        param_dict = {pn: p for pn, p in self.named_parameters()}
+        # filter out those that do not require grad
+        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
+        # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
+        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
+        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
+        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
+        optim_groups = [
+            {'params': decay_params, 'weight_decay': weight_decay},
+            {'params': nodecay_params, 'weight_decay': 0.0}
+        ]
+        num_decay_params = sum(p.numel() for p in decay_params)
+        num_nodecay_params = sum(p.numel() for p in nodecay_params)
+        print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
+        print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
+        # Create AdamW optimizer and use the fused version if it is available
+        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
+        use_fused = fused_available and device_type == 'cuda'
+        extra_args = dict(fused=True) if use_fused else dict()
+        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
+        print(f"using fused AdamW: {use_fused}")
+        return optimizer
+    def estimate_mfu(self, fwdbwd_per_iter, dt):
+        """ estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS """
+        # first estimate the number of flops we do per iteration.
+        # see PaLM paper Appendix B as ref: https://arxiv.org/abs/2204.02311
+        N = self.get_num_params()
+        cfg = self.config
+        L, H, Q, T = cfg.n_layer, cfg.n_head, cfg.n_embd//cfg.n_head, cfg.block_size
+        flops_per_token = 6*N + 12*L*H*Q*T
+        flops_per_fwdbwd = flops_per_token * T
+        flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
+        # express our flops throughput as ratio of A100 bfloat16 peak flops
+        flops_achieved = flops_per_iter * (1.0/dt) # per second
+        flops_promised = 312e12 # A100 GPU bfloat16 peak flops is 312 TFLOPS
+        mfu = flops_achieved / flops_promised
+        return mfu
+    @torch.no_grad()
+    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
+        """
+        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
+        the sequence max_new_tokens times, feeding the predictions back into the model each time.
+        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
+        """
+        for _ in range(max_new_tokens):
+            # if the sequence context is growing too long we must crop it at block_size
+            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
+            # forward the model to get the logits for the index in the sequence
+            logits, _ = self(idx_cond)
+            # pluck the logits at the final step and scale by desired temperature
+            logits = logits[:, -1, :] / temperature
+            # optionally crop the logits to only the top k options
+            if top_k is not None:
+                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                logits[logits < v[:, [-1]]] = -float('Inf')
+            # apply softmax to convert logits to (normalized) probabilities
+            probs = F.softmax(logits, dim=-1)
+            # sample from the distribution
+            idx_next = torch.multinomial(probs, num_samples=1)
+            # append sampled index to the running sequence and continue
+            idx = torch.cat((idx, idx_next), dim=1)
+        return idx

model_dtat.py ADDED Viewed

	@@ -0,0 +1,257 @@

+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class TokenImportanceNetwork(nn.Module):
+    """
+    Computes importance scores for each token based on:
+    1. Local context patterns
+    2. Token frequency
+    3. Position information
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.n_embd = config.n_embd
+        # Local context processing
+        self.context_net = nn.Sequential(
+            nn.Conv1d(config.n_embd, config.n_embd // 2, kernel_size=3, padding=1),
+            nn.ReLU(),
+            nn.Conv1d(config.n_embd // 2, 1, kernel_size=1)
+        )
+        # Frequency awareness
+        self.freq_embedding = nn.Embedding(256, config.n_embd // 4)  # 256 possible byte values
+        # Position awareness
+        self.pos_embedding = nn.Embedding(config.block_size, config.n_embd // 4)
+        # Final importance score computation
+        self.importance_proj = nn.Sequential(
+            nn.Linear(config.n_embd + config.n_embd//2, config.n_embd//4),
+            nn.ReLU(),
+            nn.Linear(config.n_embd//4, 1),
+            nn.Sigmoid()
+        )
+    def forward(self, x, freq_table, positions):
+        B, T, C = x.shape
+        # Process local context
+        x_conv = self.context_net(x.transpose(1, 2)).transpose(1, 2)  # [B, T, 1]
+        # Get frequency embeddings
+        freq_emb = self.freq_embedding(freq_table)  # [B, T, C//4]
+        # Get position embeddings
+        pos_emb = self.pos_embedding(positions)  # [B, T, C//4]
+        # Combine all features
+        combined = torch.cat([x, freq_emb, pos_emb], dim=-1)
+        # Compute importance scores
+        importance = self.importance_proj(combined)  # [B, T, 1]
+        return importance
+class SparseDenseAttention(nn.Module):
+    """
+    Hybrid attention mechanism that uses:
+    - Full attention for important tokens
+    - Sparse attention for less important tokens
+    """
+    def __init__(self, config):
+        super().__init__()
+        assert config.n_embd % config.n_head == 0
+        self.n_head = config.n_head
+        self.n_embd = config.n_embd
+        self.dropout = config.dropout
+        # Key, Query, Value projections
+        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
+        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
+        # Dropouts
+        self.attn_dropout = nn.Dropout(config.dropout)
+        self.resid_dropout = nn.Dropout(config.dropout)
+        # Sparse attention parameters
+        self.sparse_topk = getattr(config, 'sparse_topk', 32)  # Number of tokens to attend to for less important tokens
+    def forward(self, x, importance_scores, mask=None):
+        B, T, C = x.shape
+        # Calculate query, key, values for all heads in batch
+        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
+        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
+        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
+        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
+        # Compute attention scores
+        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
+        # Apply importance scores
+        importance_scores = importance_scores.squeeze(-1).unsqueeze(1)  # [B, 1, T]
+        att = att * importance_scores.unsqueeze(-1)  # Scale attention by importance
+        # For less important tokens (importance < threshold), use sparse attention
+        sparse_mask = importance_scores < 0.5
+        if sparse_mask.any():
+            # Keep only top-k values for less important tokens
+            topk_values, _ = torch.topk(att.masked_fill(~sparse_mask, -float('inf')),
+                                      k=self.sparse_topk, dim=-1)
+            sparse_threshold = topk_values[..., -1, None]
+            att = att.masked_fill(
+                (att < sparse_threshold) & sparse_mask.unsqueeze(-1),
+                -float('inf')
+            )
+        # Apply softmax and dropout
+        att = F.softmax(att, dim=-1)
+        att = self.attn_dropout(att)
+        # Compute output
+        y = att @ v  # [B, nh, T, hs]
+        y = y.transpose(1, 2).contiguous().view(B, T, C)
+        # Output projection
+        y = self.resid_dropout(self.c_proj(y))
+        return y
+class Block(nn.Module):
+    """
+    Transformer block with importance-aware processing
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.ln_1 = nn.LayerNorm(config.n_embd)
+        self.attn = SparseDenseAttention(config)
+        self.ln_2 = nn.LayerNorm(config.n_embd)
+        self.mlp = nn.Sequential(
+            nn.Linear(config.n_embd, 4 * config.n_embd),
+            nn.GELU(),
+            nn.Linear(4 * config.n_embd, config.n_embd),
+            nn.Dropout(config.dropout),
+        )
+        # Feature amplification
+        self.feature_gate = nn.Sequential(
+            nn.Linear(config.n_embd, config.n_embd),
+            nn.Sigmoid()
+        )
+    def forward(self, x, importance_scores):
+        # Self-attention with importance awareness
+        attn_output = self.attn(self.ln_1(x), importance_scores)
+        x = x + attn_output
+        # Feature amplification based on importance
+        gate = self.feature_gate(x)
+        x = x * (1 + importance_scores * gate)
+        # MLP block
+        x = x + self.mlp(self.ln_2(x))
+        return x
+class DTATTransformer(nn.Module):
+    """
+    Dynamic Token-Aware Transformer (DTAT) for character-level language modeling
+    """
+    def __init__(self, config):
+        super().__init__()
+        assert config.vocab_size is not None
+        assert config.block_size is not None
+        self.config = config
+        self.transformer = nn.ModuleDict(dict(
+            wte = nn.Embedding(config.vocab_size, config.n_embd),
+            wpe = nn.Embedding(config.block_size, config.n_embd),
+            drop = nn.Dropout(config.dropout),
+            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
+            ln_f = nn.LayerNorm(config.n_embd),
+        ))
+        # Token importance network
+        self.importance_net = TokenImportanceNetwork(config)
+        # Output head
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        # Initialize weights
+        self.apply(self._init_weights)
+        # Report number of parameters
+        print("number of parameters: %.2fM" % (self.get_num_params()/1e6,))
+    def get_num_params(self):
+        return sum(p.numel() for p in self.parameters())
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+    def forward(self, idx, targets=None, freq_table=None):
+        device = idx.device
+        b, t = idx.size()
+        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
+        # Get token frequencies if not provided
+        if freq_table is None:
+            freq_table = torch.bincount(idx.view(-1), minlength=self.config.vocab_size)
+            freq_table = freq_table.view(1, -1).expand(b, -1)
+        # Generate position indices
+        pos = torch.arange(0, t, dtype=torch.long, device=device)
+        # Token embeddings
+        tok_emb = self.transformer.wte(idx)
+        pos_emb = self.transformer.wpe(pos)
+        x = self.transformer.drop(tok_emb + pos_emb)
+        # Compute token importance scores
+        importance_scores = self.importance_net(x, freq_table, pos)
+        # Apply transformer blocks with importance awareness
+        for block in self.transformer.h:
+            x = block(x, importance_scores)
+        x = self.transformer.ln_f(x)
+        # Language modeling head
+        logits = self.lm_head(x)
+        # Loss computation
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
+            # Convert loss to bits per character (bpc)
+            loss = loss / math.log(2)
+        return logits, loss, importance_scores
+    @torch.no_grad()
+    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
+        for _ in range(max_new_tokens):
+            # Crop context if needed
+            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
+            # Forward pass
+            logits, _, _ = self(idx_cond)
+            logits = logits[:, -1, :] / temperature
+            # Optional top-k sampling
+            if top_k is not None:
+                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                logits[logits < v[:, [-1]]] = -float('inf')
+            # Sample from distribution
+            probs = F.softmax(logits, dim=-1)
+            idx_next = torch.multinomial(probs, num_samples=1)
+            idx = torch.cat((idx, idx_next), dim=1)
+        return idx

model_modified.py ADDED Viewed

	@@ -0,0 +1,190 @@

+import math
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+class HierarchicalPositionEncoding(nn.Module):
+    """
+    Hierarchical Position Encoding that captures position information at multiple scales:
+    - Fine-grained local position (token level)
+    - Medium-scale position (segment level)
+    - Coarse-grained position (document level)
+    """
+    def __init__(self, d_model, max_len=1024, base=10000):
+        super().__init__()
+        self.d_model = d_model
+        self.max_len = max_len
+        self.base = base
+        # Split embedding dimensions for different scales
+        self.local_dim = d_model // 2
+        self.segment_dim = d_model // 4
+        self.doc_dim = d_model - self.local_dim - self.segment_dim
+        # Create position encodings for different scales
+        self.register_buffer('local_pe', self._create_pe(max_len, self.local_dim))
+        self.register_buffer('segment_pe', self._create_pe(max_len//8, self.segment_dim))
+        self.register_buffer('doc_pe', self._create_pe(max_len//32, self.doc_dim))
+    def _create_pe(self, max_len, d_model):
+        pe = torch.zeros(max_len, d_model)
+        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
+        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(self.base) / d_model))
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        return pe.unsqueeze(0)
+    def forward(self, x):
+        B, T, C = x.shape
+        # Get positional encodings at different scales
+        local_pos = self.local_pe[:, :T, :]
+        segment_pos = self.segment_pe[:, :(T//8), :].repeat_interleave(8, dim=1)[:, :T, :]
+        doc_pos = self.doc_pe[:, :(T//32), :].repeat_interleave(32, dim=1)[:, :T, :]
+        # Combine all scales
+        pos_encoding = torch.cat([local_pos, segment_pos, doc_pos], dim=-1)
+        return pos_encoding
+class MultiScaleAttention(nn.Module):
+    """
+    Multi-scale attention mechanism that processes information at different temporal scales
+    """
+    def __init__(self, config):
+        super().__init__()
+        assert config.n_embd % config.n_head == 0
+        # key, query, value projections for all heads, but in a batch
+        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
+        # output projection
+        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
+        # regularization
+        self.attn_dropout = nn.Dropout(config.dropout)
+        self.resid_dropout = nn.Dropout(config.dropout)
+        self.n_head = config.n_head
+        self.n_embd = config.n_embd
+        self.dropout = config.dropout
+    def forward(self, x):
+        B, T, C = x.shape # batch size, sequence length, embedding dimensionality
+        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
+        q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
+        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
+        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
+        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
+        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
+        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
+        att = F.softmax(att, dim=-1)
+        att = self.attn_dropout(att)
+        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
+        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
+        # output projection
+        y = self.resid_dropout(self.c_proj(y))
+        return y
+class Block(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.ln_1 = nn.LayerNorm(config.n_embd)
+        self.attn = MultiScaleAttention(config)
+        self.ln_2 = nn.LayerNorm(config.n_embd)
+        self.mlp = nn.ModuleDict(dict(
+            c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias),
+            c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias),
+            act     = nn.GELU(),
+            dropout = nn.Dropout(config.dropout),
+        ))
+        m = self.mlp
+        self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x))))
+    def forward(self, x):
+        x = x + self.attn(self.ln_1(x))
+        x = x + self.mlpf(self.ln_2(x))
+        return x
+class GPTModified(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        assert config.vocab_size is not None
+        assert config.block_size is not None
+        self.config = config
+        self.transformer = nn.ModuleDict(dict(
+            wte = nn.Embedding(config.vocab_size, config.n_embd),
+            hpe = HierarchicalPositionEncoding(config.n_embd, config.block_size),
+            drop = nn.Dropout(config.dropout),
+            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
+            ln_f = nn.LayerNorm(config.n_embd),
+        ))
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        # Initialize weights
+        self.apply(self._init_weights)
+        # Apply special scaled init to the residual projections, per GPT-2 paper
+        for pn, p in self.named_parameters():
+            if pn.endswith('c_proj.weight'):
+                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))
+        # Report number of parameters
+        print("number of parameters: %.2fM" % (self.get_num_params()/1e6,))
+    def get_num_params(self, non_embedding=True):
+        n_params = sum(p.numel() for p in self.parameters())
+        if non_embedding:
+            n_params -= self.transformer.wte.weight.numel()
+        return n_params
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+    def forward(self, idx, targets=None):
+        device = idx.device
+        b, t = idx.size()
+        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
+        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)
+        # Forward pass
+        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
+        pos_emb = self.transformer.hpe(tok_emb) # position embeddings of shape (b, t, n_embd)
+        x = self.transformer.drop(tok_emb + pos_emb)
+        for block in self.transformer.h:
+            x = block(x)
+        x = self.transformer.ln_f(x)
+        logits = self.lm_head(x)
+        # If we are given some desired targets also calculate the loss
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
+        return logits, loss
+    @torch.no_grad()
+    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
+        for _ in range(max_new_tokens):
+            # If the sequence context is growing too long we must crop it at block_size
+            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
+            # Forward the model to get the logits for the index in the sequence
+            logits, _ = self(idx_cond)
+            # Pluck the logits at the final step and scale by desired temperature
+            logits = logits[:, -1, :] / temperature
+            # Optionally crop the logits to only the top k options
+            if top_k is not None:
+                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                logits[logits < v[:, [-1]]] = -float('Inf')
+            # Apply softmax to convert logits to (normalized) probabilities
+            probs = F.softmax(logits, dim=-1)
+            # Sample from the distribution
+            idx_next = torch.multinomial(probs, num_samples=1)
+            # Append sampled index to the running sequence and continue
+            idx = torch.cat((idx, idx_next), dim=1)
+        return idx

prepare_data.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import os
+def prepare_enwik8(input_file, output_dir):
+    """
+    Prepare enwik8 dataset from enwik9:
+    - Extract first 100M bytes for enwik8
+    - Split into train (90M), val (5M), and test (5M)
+    """
+    # Create output directory if it doesn't exist
+    os.makedirs(output_dir, exist_ok=True)
+    # Read first 100M bytes from enwik9
+    with open(input_file, 'rb') as f:
+        data = f.read(100_000_000)  # Read exactly 100M bytes
+    # Split the data
+    train_data = data[:90_000_000]  # First 90M bytes
+    val_data = data[90_000_000:95_000_000]  # Next 5M bytes
+    test_data = data[95_000_000:]  # Last 5M bytes
+    # Save splits
+    splits = {
+        'train.bin': train_data,
+        'val.bin': val_data,
+        'test.bin': test_data
+    }
+    for name, split_data in splits.items():
+        with open(os.path.join(output_dir, name), 'wb') as f:
+            f.write(split_data)
+        print(f"Saved {name} ({len(split_data):,} bytes)")
+if __name__ == "__main__":
+    input_file = "enwik9/enwik9"
+    output_dir = "data"
+    prepare_enwik8(input_file, output_dir)
+    print("Dataset preparation completed!")

sample.py ADDED Viewed

	@@ -0,0 +1,89 @@

+"""
+Sample from a trained model
+"""
+import os
+import pickle
+from contextlib import nullcontext
+import torch
+import tiktoken
+from model import GPTConfig, GPT
+# -----------------------------------------------------------------------------
+init_from = 'resume' # either 'resume' (from an out_dir) or a gpt2 variant (e.g. 'gpt2-xl')
+out_dir = 'out' # ignored if init_from is not 'resume'
+start = "\n" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
+num_samples = 10 # number of samples to draw
+max_new_tokens = 500 # number of tokens generated in each sample
+temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
+top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
+seed = 1337
+device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
+dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32' or 'bfloat16' or 'float16'
+compile = False # use PyTorch 2.0 to compile the model to be faster
+exec(open('configurator.py').read()) # overrides from command line or config file
+# -----------------------------------------------------------------------------
+torch.manual_seed(seed)
+torch.cuda.manual_seed(seed)
+torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
+torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
+device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
+ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
+ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)
+# model
+if init_from == 'resume':
+    # init from a model saved in a specific directory
+    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
+    checkpoint = torch.load(ckpt_path, map_location=device)
+    gptconf = GPTConfig(**checkpoint['model_args'])
+    model = GPT(gptconf)
+    state_dict = checkpoint['model']
+    unwanted_prefix = '_orig_mod.'
+    for k,v in list(state_dict.items()):
+        if k.startswith(unwanted_prefix):
+            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
+    model.load_state_dict(state_dict)
+elif init_from.startswith('gpt2'):
+    # init from a given GPT-2 model
+    model = GPT.from_pretrained(init_from, dict(dropout=0.0))
+model.eval()
+model.to(device)
+if compile:
+    model = torch.compile(model) # requires PyTorch 2.0 (optional)
+# look for the meta pickle in case it is available in the dataset folder
+load_meta = False
+if init_from == 'resume' and 'config' in checkpoint and 'dataset' in checkpoint['config']: # older checkpoints might not have these...
+    meta_path = os.path.join('data', checkpoint['config']['dataset'], 'meta.pkl')
+    load_meta = os.path.exists(meta_path)
+if load_meta:
+    print(f"Loading meta from {meta_path}...")
+    with open(meta_path, 'rb') as f:
+        meta = pickle.load(f)
+    # TODO want to make this more general to arbitrary encoder/decoder schemes
+    stoi, itos = meta['stoi'], meta['itos']
+    encode = lambda s: [stoi[c] for c in s]
+    decode = lambda l: ''.join([itos[i] for i in l])
+else:
+    # ok let's assume gpt-2 encodings by default
+    print("No meta.pkl found, assuming GPT-2 encodings...")
+    enc = tiktoken.get_encoding("gpt2")
+    encode = lambda s: enc.encode(s, allowed_special={"<|endoftext|>"})
+    decode = lambda l: enc.decode(l)
+# encode the beginning of the prompt
+if start.startswith('FILE:'):
+    with open(start[5:], 'r', encoding='utf-8') as f:
+        start = f.read()
+start_ids = encode(start)
+x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
+# run generation
+with torch.no_grad():
+    with ctx:
+        for k in range(num_samples):
+            y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
+            print(decode(y[0].tolist()))
+            print('---------------')

scaling_laws.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

train.py ADDED Viewed

	@@ -0,0 +1,336 @@

+"""
+This training script can be run both on a single gpu in debug mode,
+and also in a larger training run with distributed data parallel (ddp).
+To run on a single GPU, example:
+$ python train.py --batch_size=32 --compile=False
+To run with DDP on 4 gpus on 1 node, example:
+$ torchrun --standalone --nproc_per_node=4 train.py
+To run with DDP on 4 gpus across 2 nodes, example:
+- Run on the first (master) node with example IP 123.456.123.456:
+$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py
+- Run on the worker node:
+$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py
+(If your cluster does not have Infiniband interconnect prepend NCCL_IB_DISABLE=1)
+"""
+import os
+import time
+import math
+import pickle
+from contextlib import nullcontext
+import numpy as np
+import torch
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.distributed import init_process_group, destroy_process_group
+from model import GPTConfig, GPT
+# -----------------------------------------------------------------------------
+# default config values designed to train a gpt2 (124M) on OpenWebText
+# I/O
+out_dir = 'out'
+eval_interval = 2000
+log_interval = 1
+eval_iters = 200
+eval_only = False # if True, script exits right after the first eval
+always_save_checkpoint = True # if True, always save a checkpoint after each eval
+init_from = 'scratch' # 'scratch' or 'resume' or 'gpt2*'
+# wandb logging
+wandb_log = False # disabled by default
+wandb_project = 'owt'
+wandb_run_name = 'gpt2' # 'run' + str(time.time())
+# data
+dataset = 'openwebtext'
+gradient_accumulation_steps = 5 * 8 # used to simulate larger batch sizes
+batch_size = 12 # if gradient_accumulation_steps > 1, this is the micro-batch size
+block_size = 1024
+# model
+n_layer = 12
+n_head = 12
+n_embd = 768
+dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+
+bias = False # do we use bias inside LayerNorm and Linear layers?
+# adamw optimizer
+learning_rate = 6e-4 # max learning rate
+max_iters = 600000 # total number of training iterations
+weight_decay = 1e-1
+beta1 = 0.9
+beta2 = 0.95
+grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
+# learning rate decay settings
+decay_lr = True # whether to decay the learning rate
+warmup_iters = 2000 # how many steps to warm up for
+lr_decay_iters = 600000 # should be ~= max_iters per Chinchilla
+min_lr = 6e-5 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla
+# DDP settings
+backend = 'nccl' # 'nccl', 'gloo', etc.
+# system
+device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
+dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
+compile = True # use PyTorch 2.0 to compile the model to be faster
+# -----------------------------------------------------------------------------
+config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
+exec(open('configurator.py').read()) # overrides from command line or config file
+config = {k: globals()[k] for k in config_keys} # will be useful for logging
+# -----------------------------------------------------------------------------
+# various inits, derived attributes, I/O setup
+ddp = int(os.environ.get('RANK', -1)) != -1 # is this a ddp run?
+if ddp:
+    init_process_group(backend=backend)
+    ddp_rank = int(os.environ['RANK'])
+    ddp_local_rank = int(os.environ['LOCAL_RANK'])
+    ddp_world_size = int(os.environ['WORLD_SIZE'])
+    device = f'cuda:{ddp_local_rank}'
+    torch.cuda.set_device(device)
+    master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.
+    seed_offset = ddp_rank # each process gets a different seed
+    # world_size number of processes will be training simultaneously, so we can scale
+    # down the desired gradient accumulation iterations per process proportionally
+    assert gradient_accumulation_steps % ddp_world_size == 0
+    gradient_accumulation_steps //= ddp_world_size
+else:
+    # if not ddp, we are running on a single gpu, and one process
+    master_process = True
+    seed_offset = 0
+    ddp_world_size = 1
+tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
+print(f"tokens per iteration will be: {tokens_per_iter:,}")
+if master_process:
+    os.makedirs(out_dir, exist_ok=True)
+torch.manual_seed(1337 + seed_offset)
+torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
+torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
+device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
+# note: float16 data type will automatically use a GradScaler
+ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
+ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)
+# poor man's data loader
+data_dir = os.path.join('data', dataset)
+def get_batch(split):
+    # We recreate np.memmap every batch to avoid a memory leak, as per
+    # https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once/61472122#61472122
+    if split == 'train':
+        data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
+    else:
+        data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
+    ix = torch.randint(len(data) - block_size, (batch_size,))
+    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
+    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
+    if device_type == 'cuda':
+        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
+        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
+    else:
+        x, y = x.to(device), y.to(device)
+    return x, y
+# init these up here, can override if init_from='resume' (i.e. from a checkpoint)
+iter_num = 0
+best_val_loss = 1e9
+# attempt to derive vocab_size from the dataset
+meta_path = os.path.join(data_dir, 'meta.pkl')
+meta_vocab_size = None
+if os.path.exists(meta_path):
+    with open(meta_path, 'rb') as f:
+        meta = pickle.load(f)
+    meta_vocab_size = meta['vocab_size']
+    print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")
+# model init
+model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
+                  bias=bias, vocab_size=None, dropout=dropout) # start with model_args from command line
+if init_from == 'scratch':
+    # init a new model from scratch
+    print("Initializing a new model from scratch")
+    # determine the vocab size we'll use for from-scratch training
+    if meta_vocab_size is None:
+        print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")
+    model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304
+    gptconf = GPTConfig(**model_args)
+    model = GPT(gptconf)
+elif init_from == 'resume':
+    print(f"Resuming training from {out_dir}")
+    # resume training from a checkpoint.
+    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
+    checkpoint = torch.load(ckpt_path, map_location=device)
+    checkpoint_model_args = checkpoint['model_args']
+    # force these config attributes to be equal otherwise we can't even resume training
+    # the rest of the attributes (e.g. dropout) can stay as desired from command line
+    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
+        model_args[k] = checkpoint_model_args[k]
+    # create the model
+    gptconf = GPTConfig(**model_args)
+    model = GPT(gptconf)
+    state_dict = checkpoint['model']
+    # fix the keys of the state dictionary :(
+    # honestly no idea how checkpoints sometimes get this prefix, have to debug more
+    unwanted_prefix = '_orig_mod.'
+    for k,v in list(state_dict.items()):
+        if k.startswith(unwanted_prefix):
+            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
+    model.load_state_dict(state_dict)
+    iter_num = checkpoint['iter_num']
+    best_val_loss = checkpoint['best_val_loss']
+elif init_from.startswith('gpt2'):
+    print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
+    # initialize from OpenAI GPT-2 weights
+    override_args = dict(dropout=dropout)
+    model = GPT.from_pretrained(init_from, override_args)
+    # read off the created config params, so we can store them into checkpoint correctly
+    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
+        model_args[k] = getattr(model.config, k)
+# crop down the model block size if desired, using model surgery
+if block_size < model.config.block_size:
+    model.crop_block_size(block_size)
+    model_args['block_size'] = block_size # so that the checkpoint will have the right value
+model.to(device)
+# initialize a GradScaler. If enabled=False scaler is a no-op
+scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))
+# optimizer
+optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
+if init_from == 'resume':
+    optimizer.load_state_dict(checkpoint['optimizer'])
+checkpoint = None # free up memory
+# compile the model
+if compile:
+    print("compiling the model... (takes a ~minute)")
+    unoptimized_model = model
+    model = torch.compile(model) # requires PyTorch 2.0
+# wrap model into DDP container
+if ddp:
+    model = DDP(model, device_ids=[ddp_local_rank])
+# helps estimate an arbitrarily accurate loss over either split using many batches
+@torch.no_grad()
+def estimate_loss():
+    out = {}
+    model.eval()
+    for split in ['train', 'val']:
+        losses = torch.zeros(eval_iters)
+        for k in range(eval_iters):
+            X, Y = get_batch(split)
+            with ctx:
+                logits, loss = model(X, Y)
+            losses[k] = loss.item()
+        out[split] = losses.mean()
+    model.train()
+    return out
+# learning rate decay scheduler (cosine with warmup)
+def get_lr(it):
+    # 1) linear warmup for warmup_iters steps
+    if it < warmup_iters:
+        return learning_rate * (it + 1) / (warmup_iters + 1)
+    # 2) if it > lr_decay_iters, return min learning rate
+    if it > lr_decay_iters:
+        return min_lr
+    # 3) in between, use cosine decay down to min learning rate
+    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
+    assert 0 <= decay_ratio <= 1
+    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
+    return min_lr + coeff * (learning_rate - min_lr)
+# logging
+if wandb_log and master_process:
+    import wandb
+    wandb.init(project=wandb_project, name=wandb_run_name, config=config)
+# training loop
+X, Y = get_batch('train') # fetch the very first batch
+t0 = time.time()
+local_iter_num = 0 # number of iterations in the lifetime of this process
+raw_model = model.module if ddp else model # unwrap DDP container if needed
+running_mfu = -1.0
+while True:
+    # determine and set the learning rate for this iteration
+    lr = get_lr(iter_num) if decay_lr else learning_rate
+    for param_group in optimizer.param_groups:
+        param_group['lr'] = lr
+    # evaluate the loss on train/val sets and write checkpoints
+    if iter_num % eval_interval == 0 and master_process:
+        losses = estimate_loss()
+        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
+        if wandb_log:
+            wandb.log({
+                "iter": iter_num,
+                "train/loss": losses['train'],
+                "val/loss": losses['val'],
+                "lr": lr,
+                "mfu": running_mfu*100, # convert to percentage
+            })
+        if losses['val'] < best_val_loss or always_save_checkpoint:
+            best_val_loss = losses['val']
+            if iter_num > 0:
+                checkpoint = {
+                    'model': raw_model.state_dict(),
+                    'optimizer': optimizer.state_dict(),
+                    'model_args': model_args,
+                    'iter_num': iter_num,
+                    'best_val_loss': best_val_loss,
+                    'config': config,
+                }
+                print(f"saving checkpoint to {out_dir}")
+                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
+    if iter_num == 0 and eval_only:
+        break
+    # forward backward update, with optional gradient accumulation to simulate larger batch size
+    # and using the GradScaler if data type is float16
+    for micro_step in range(gradient_accumulation_steps):
+        if ddp:
+            # in DDP training we only need to sync gradients at the last micro step.
+            # the official way to do this is with model.no_sync() context manager, but
+            # I really dislike that this bloats the code and forces us to repeat code
+            # looking at the source of that context manager, it just toggles this variable
+            model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
+        with ctx:
+            logits, loss = model(X, Y)
+            loss = loss / gradient_accumulation_steps # scale the loss to account for gradient accumulation
+        # immediately async prefetch next batch while model is doing the forward pass on the GPU
+        X, Y = get_batch('train')
+        # backward pass, with gradient scaling if training in fp16
+        scaler.scale(loss).backward()
+    # clip the gradient
+    if grad_clip != 0.0:
+        scaler.unscale_(optimizer)
+        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
+    # step the optimizer and scaler if training in fp16
+    scaler.step(optimizer)
+    scaler.update()
+    # flush the gradients as soon as we can, no need for this memory anymore
+    optimizer.zero_grad(set_to_none=True)
+    # timing and logging
+    t1 = time.time()
+    dt = t1 - t0
+    t0 = t1
+    if iter_num % log_interval == 0 and master_process:
+        # get loss as float. note: this is a CPU-GPU sync point
+        # scale up to undo the division above, approximating the true total loss (exact would have been a sum)
+        lossf = loss.item() * gradient_accumulation_steps
+        if local_iter_num >= 5: # let the training loop settle a bit
+            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
+            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
+        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
+    iter_num += 1
+    local_iter_num += 1
+    # termination conditions
+    if iter_num > max_iters:
+        break
+if ddp:
+    destroy_process_group()

train_baseline.py ADDED Viewed

	@@ -0,0 +1,228 @@

+"""
+Training script for baseline NanoGPT model on enwik8 dataset.
+Ensures proper bpc calculation and comparable evaluation with DTAT.
+"""
+import os
+import time
+import math
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.distributed import init_process_group, destroy_process_group
+from contextlib import nullcontext
+import wandb
+from model import GPT, GPTConfig
+def get_batch(data, block_size, batch_size, device):
+    """Generate a small batch of data of inputs x and targets y."""
+    ix = torch.randint(len(data) - block_size, (batch_size,))
+    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
+    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
+    x, y = x.to(device), y.to(device)
+    return x, y
+def estimate_loss(model, data, eval_iters, block_size, batch_size, device):
+    """Estimate loss on data split, ensuring proper bpc calculation."""
+    model.eval()
+    losses = torch.zeros(eval_iters)
+    for k in range(eval_iters):
+        X, Y = get_batch(data, block_size, batch_size, device)
+        with torch.no_grad():
+            logits, loss = model(X, Y)
+            # Convert from nats to bpc
+            loss = loss / math.log(2)
+            losses[k] = loss.item()
+    out = losses.mean()
+    model.train()
+    return out
+def get_lr(it, config):
+    """Get learning rate based on iteration."""
+    if it < config.warmup_iters:
+        return config.learning_rate * it / config.warmup_iters
+    if it > config.lr_decay_iters:
+        return config.min_lr
+    decay_ratio = (it - config.warmup_iters) / (config.lr_decay_iters - config.warmup_iters)
+    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
+    return config.min_lr + coeff * (config.learning_rate - config.min_lr)
+def main():
+    # Initialize distributed training if needed
+    ddp = int(os.environ.get('RANK', -1)) != -1
+    if ddp:
+        init_process_group(backend='nccl')
+        ddp_rank = int(os.environ['RANK'])
+        ddp_local_rank = int(os.environ['LOCAL_RANK'])
+        device = f'cuda:{ddp_local_rank}'
+        master_process = ddp_rank == 0
+        seed_offset = ddp_rank
+    else:
+        device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        master_process = True
+        seed_offset = 0
+    torch.manual_seed(1337 + seed_offset)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+    device_type = 'cuda' if 'cuda' in device else 'cpu'
+    # Model configuration (matching paper's 44M parameter target)
+    config = GPTConfig(
+        block_size=1024,
+        vocab_size=256,  # byte-level vocab
+        n_layer=12,
+        n_head=8,
+        n_embd=512,
+        dropout=0.1,
+        bias=False,
+        # Training specific
+        learning_rate=6e-4,
+        min_lr=6e-5,
+        warmup_iters=2000,
+        lr_decay_iters=100000,
+        max_iters=100000,
+        eval_interval=500,
+        eval_iters=200,
+        batch_size=32,
+    )
+    # Initialize wandb for baseline model
+    if master_process:
+        wandb.init(
+            project="enwik8-baseline",
+            config={
+                "architecture": "NanoGPT-Baseline",
+                "dataset": "enwik8",
+                "batch_size": config.batch_size,
+                "learning_rate": config.learning_rate,
+                "warmup_iters": config.warmup_iters,
+                "block_size": config.block_size,
+                "n_layer": config.n_layer,
+                "n_head": config.n_head,
+                "n_embd": config.n_embd,
+                "dropout": config.dropout,
+            }
+        )
+    # Data loading
+    print("Loading data...")
+    data_dir = os.path.join('data')
+    train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint8, mode='r')
+    val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint8, mode='r')
+    # Model initialization
+    print("Initializing model...")
+    model = GPT(config)
+    model.to(device)
+    # Optimizer
+    optimizer = torch.optim.AdamW(
+        model.parameters(),
+        lr=config.learning_rate,
+        betas=(0.9, 0.95),
+        weight_decay=0.1,
+    )
+    if ddp:
+        model = DDP(model, device_ids=[ddp_local_rank])
+    # Training loop
+    print("Starting training...")
+    best_val_loss = float('inf')
+    iter_num = 0
+    while True:
+        lr = get_lr(iter_num, config)
+        for param_group in optimizer.param_groups:
+            param_group['lr'] = lr
+        # Get batch and timing
+        t0 = time.time()
+        X, Y = get_batch(train_data, config.block_size, config.batch_size, device)
+        # Forward pass
+        logits, loss = model(X, Y)
+        # Convert loss to bpc
+        loss = loss / math.log(2)
+        # Backward pass
+        optimizer.zero_grad(set_to_none=True)
+        loss.backward()
+        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+        optimizer.step()
+        # Timing and logging
+        t1 = time.time()
+        dt = t1 - t0
+        if iter_num % 100 == 0 and master_process:
+            # Log metrics to wandb
+            metrics = {
+                "train/loss": loss.item(),
+                "train/bpc": loss.item(),
+                "train/grad_norm": grad_norm.item(),
+                "train/learning_rate": lr,
+                "train/tokens_per_sec": config.batch_size * config.block_size / dt,
+                "train/iteration": iter_num,
+            }
+            wandb.log(metrics)
+            print(f"iter {iter_num}: loss {loss.item():.4f}, bpc {loss.item():.4f}, "
+                  f"grad_norm {grad_norm:.2f}, lr {lr:.2e}")
+        # Evaluation
+        if iter_num % config.eval_interval == 0:
+            val_loss = estimate_loss(
+                model, val_data, config.eval_iters,
+                config.block_size, config.batch_size, device
+            )
+            if master_process:
+                # Log validation metrics
+                val_metrics = {
+                    "val/loss": val_loss,
+                    "val/bpc": val_loss,
+                    "val/iteration": iter_num,
+                }
+                wandb.log(val_metrics)
+                print(f"step {iter_num}: val loss {val_loss:.4f}, val bpc {val_loss:.4f}")
+            # Save best model
+            if val_loss < best_val_loss:
+                best_val_loss = val_loss
+                if master_process:
+                    print(f"Saving best model with val_loss: {best_val_loss:.4f}")
+                    checkpoint = {
+                        'model_state_dict': model.state_dict(),
+                        'optimizer_state_dict': optimizer.state_dict(),
+                        'config': config,
+                        'iter_num': iter_num,
+                        'best_val_loss': best_val_loss,
+                    }
+                    torch.save(checkpoint, 'best_model_baseline.pt')
+                    # Log best model to wandb
+                    wandb.save('best_model_baseline.pt')
+                    wandb.run.summary["best_val_loss"] = best_val_loss
+                    wandb.run.summary["best_val_bpc"] = best_val_loss
+                    wandb.run.summary["best_model_iter"] = iter_num
+        iter_num += 1
+        # End training if we reach max_iters
+        if iter_num > config.max_iters:
+            break
+    # Clean up
+    if ddp:
+        destroy_process_group()
+    if master_process:
+        wandb.finish()
+if __name__ == '__main__':
+    main()

train_dtat.py ADDED Viewed

	@@ -0,0 +1,256 @@

+"""
+Training script for Dynamic Token-Aware Transformer (DTAT) on enwik8 dataset.
+Based on NanoGPT's training structure with modifications for token importance awareness.
+"""
+import os
+import time
+import math
+import pickle
+from contextlib import nullcontext
+import numpy as np
+import torch
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.distributed import init_process_group, destroy_process_group
+import matplotlib.pyplot as plt
+import wandb
+from model_dtat import DTATTransformer
+from config.dtat_config import get_config
+# -----------------------------------------------------------------------------
+# I/O
+def get_batch(data, block_size, batch_size, device):
+    """Generate a small batch of data of inputs x and targets y."""
+    ix = torch.randint(len(data) - block_size, (batch_size,))
+    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
+    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
+    x, y = x.to(device), y.to(device)
+    return x, y
+def compute_freq_table(data, vocab_size=256):
+    """Compute frequency table for the dataset."""
+    freq = np.bincount(data, minlength=vocab_size)
+    return freq / len(data)
+def visualize_importance(importance_scores, tokens, save_path):
+    """Visualize token importance scores and log to wandb."""
+    plt.figure(figsize=(15, 5))
+    plt.bar(range(len(tokens)), importance_scores.squeeze().cpu())
+    plt.title('Token Importance Scores')
+    plt.xlabel('Token Position')
+    plt.ylabel('Importance Score')
+    plt.savefig(save_path)
+    # Log to wandb
+    if wandb.run is not None:
+        wandb.log({"token_importance": wandb.Image(save_path)})
+    plt.close()
+# -----------------------------------------------------------------------------
+# Training
+def estimate_loss(model, data, config):
+    out = {}
+    model.eval()
+    losses = torch.zeros(config.eval_iters)
+    for k in range(config.eval_iters):
+        X, Y = get_batch(data, config.block_size, config.batch_size, config.device)
+        with torch.no_grad():
+            logits, loss, _ = model(X, Y)
+            losses[k] = loss.item()
+    out = losses.mean()
+    model.train()
+    return out
+def get_lr(it, config):
+    # 1) Linear warmup for warmup_iters steps
+    if it < config.warmup_iters:
+        return config.learning_rate * it / config.warmup_iters
+    # 2) If it > lr_decay_iters, return min learning rate
+    if it > config.lr_decay_iters:
+        return config.min_lr
+    # 3) In between, use cosine decay down to min learning rate
+    decay_ratio = (it - config.warmup_iters) / (config.lr_decay_iters - config.warmup_iters)
+    assert 0 <= decay_ratio <= 1
+    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
+    return config.min_lr + coeff * (config.learning_rate - config.min_lr)
+def main():
+    # Initialize distributed training if needed
+    ddp = int(os.environ.get('RANK', -1)) != -1
+    if ddp:
+        init_process_group(backend='nccl')
+        ddp_rank = int(os.environ['RANK'])
+        ddp_local_rank = int(os.environ['LOCAL_RANK'])
+        device = f'cuda:{ddp_local_rank}'
+        master_process = ddp_rank == 0
+        seed_offset = ddp_rank
+        assert config.batch_size % torch.cuda.device_count() == 0
+        config.batch_size = config.batch_size // torch.cuda.device_count()
+    else:
+        device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        master_process = True
+        seed_offset = 0
+    # Set seed for reproducibility
+    torch.manual_seed(1337 + seed_offset)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+    device_type = 'cuda' if 'cuda' in device else 'cpu'
+    # Get config
+    config = get_config()
+    config.device = device
+    # Initialize wandb
+    if master_process:
+        wandb.init(
+            project="enwik8-dtat",
+            config={
+                "architecture": "DTAT",
+                "dataset": "enwik8",
+                "batch_size": config.batch_size,
+                "learning_rate": config.learning_rate,
+                "warmup_iters": config.warmup_iters,
+                "block_size": config.block_size,
+                "n_layer": config.n_layer,
+                "n_head": config.n_head,
+                "n_embd": config.n_embd,
+                "dropout": config.dropout,
+                "sparse_topk": config.sparse_topk,
+            }
+        )
+    # Data loading
+    print("Loading data...")
+    data_dir = os.path.join('data')
+    train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint8, mode='r')
+    val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint8, mode='r')
+    # Compute frequency table for the training data
+    freq_table = compute_freq_table(train_data)
+    # Model init
+    print("Initializing model...")
+    model = DTATTransformer(config)
+    model.to(device)
+    # Optimizer
+    optimizer = torch.optim.AdamW(
+        model.parameters(),
+        lr=config.learning_rate,
+        betas=(config.beta1, config.beta2),
+        weight_decay=config.weight_decay
+    )
+    if ddp:
+        model = DDP(model, device_ids=[ddp_local_rank])
+    # Training loop
+    print("Starting training...")
+    best_val_loss = float('inf')
+    iter_num = 0
+    while True:
+        lr = get_lr(iter_num, config) if config.decay_lr else config.learning_rate
+        for param_group in optimizer.param_groups:
+            param_group['lr'] = lr
+        # Get batch
+        t0 = time.time()
+        X, Y = get_batch(train_data, config.block_size, config.batch_size, device)
+        # Forward pass
+        logits, loss, importance_scores = model(X, Y)
+        # Calculate additional metrics
+        importance_mean = importance_scores.mean().item()
+        importance_std = importance_scores.std().item()
+        # Backward pass
+        optimizer.zero_grad(set_to_none=True)
+        loss.backward()
+        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_clip)
+        optimizer.step()
+        # Timing and logging
+        t1 = time.time()
+        dt = t1 - t0
+        if iter_num % config.log_interval == 0 and master_process:
+            # Log metrics to wandb
+            metrics = {
+                "train/loss": loss.item(),
+                "train/bpc": loss.item(),
+                "train/importance_mean": importance_mean,
+                "train/importance_std": importance_std,
+                "train/grad_norm": grad_norm.item(),
+                "train/learning_rate": lr,
+                "train/tokens_per_sec": config.batch_size * config.block_size / dt,
+                "train/iteration": iter_num,
+            }
+            wandb.log(metrics)
+            print(f"iter {iter_num}: loss {loss.item():.4f}, bpc {loss.item():.4f}, "
+                  f"importance_mean {importance_mean:.3f}, grad_norm {grad_norm:.2f}")
+            # Visualize importance scores periodically
+            if iter_num % (config.log_interval * 10) == 0:
+                visualize_importance(
+                    importance_scores[0],
+                    X[0].cpu().numpy(),
+                    f'importance_scores_iter_{iter_num}.png'
+                )
+        # Evaluation
+        if iter_num % config.eval_interval == 0:
+            val_loss = estimate_loss(model, val_data, config)
+            # Log validation metrics
+            if master_process:
+                val_metrics = {
+                    "val/loss": val_loss,
+                    "val/bpc": val_loss,
+                    "val/iteration": iter_num,
+                }
+                wandb.log(val_metrics)
+                print(f"step {iter_num}: val loss {val_loss:.4f}, val bpc {val_loss:.4f}")
+            # Save best model
+            if val_loss < best_val_loss:
+                best_val_loss = val_loss
+                if master_process:
+                    print(f"Saving best model with val_loss: {best_val_loss:.4f}")
+                    checkpoint = {
+                        'model_state_dict': model.state_dict(),
+                        'optimizer_state_dict': optimizer.state_dict(),
+                        'config': config,
+                        'iter_num': iter_num,
+                        'best_val_loss': best_val_loss,
+                    }
+                    torch.save(checkpoint, 'best_model_dtat.pt')
+                    # Log best model to wandb
+                    wandb.save('best_model_dtat.pt')
+                    wandb.run.summary["best_val_loss"] = best_val_loss
+                    wandb.run.summary["best_val_bpc"] = best_val_loss
+                    wandb.run.summary["best_model_iter"] = iter_num
+        iter_num += 1
+        # End training if we reach max_iters
+        if iter_num > config.max_iters:
+            break
+    # Clean up
+    if ddp:
+        destroy_process_group()
+    if master_process:
+        wandb.finish()
+if __name__ == '__main__':
+    main()

train_enwik8.py ADDED Viewed

	@@ -0,0 +1,114 @@

+import os
+import time
+import math
+import torch
+from torch.nn import functional as F
+from model_modified import GPTModified
+import numpy as np
+from contextlib import nullcontext
+# Import configurations
+from config.char_config import config
+def load_data(split):
+    """Load binary data from split file."""
+    filename = os.path.join('data', f'{split}.bin')
+    with open(filename, 'rb') as f:
+        data = np.fromfile(f, dtype=np.uint8)
+    return data
+def get_batch(data, block_size, batch_size, device):
+    """Generate a small batch of data of inputs x and targets y."""
+    ix = torch.randint(len(data) - block_size, (batch_size,))
+    x = torch.stack([torch.from_numpy(data[i:i+block_size].astype(np.int64)) for i in ix])
+    y = torch.stack([torch.from_numpy(data[i+1:i+1+block_size].astype(np.int64)) for i in ix])
+    x, y = x.to(device), y.to(device)
+    return x, y
+def estimate_loss(model, data, eval_iters, block_size, batch_size, device):
+    """Estimate loss on data split."""
+    out = {}
+    model.eval()
+    losses = torch.zeros(eval_iters)
+    for k in range(eval_iters):
+        X, Y = get_batch(data, block_size, batch_size, device)
+        with torch.no_grad():
+            logits, loss = model(X, Y)
+            losses[k] = loss.item()
+    out = losses.mean()
+    model.train()
+    return out
+def convert_to_bpc(loss):
+    """Convert from natural log (nats) to bits per character (bpc)."""
+    return loss / math.log(2)
+def main():
+    # System setup
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    print(f"Using device: {device}")
+    dtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[config['dtype']]
+    ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[config['dtype']]
+    ctx = nullcontext() if device == 'cpu' else torch.amp.autocast(device_type=device, dtype=ptdtype)
+    # Data loading
+    print("Loading data...")
+    train_data = load_data('train')
+    val_data = load_data('val')
+    # Model init
+    print("Initializing model...")
+    model = GPTModified(config)
+    model.to(device)
+    # Optimizer
+    optimizer = torch.optim.AdamW(
+        model.parameters(),
+        lr=config['learning_rate'],
+        betas=(config['beta1'], config['beta2']),
+        weight_decay=config['weight_decay']
+    )
+    # Training loop
+    best_val_loss = float('inf')
+    batch_size = 32
+    for iter in range(config['max_iters']):
+        # Sample a batch of data
+        xb, yb = get_batch(train_data, config['block_size'], batch_size, device)
+        # Forward pass
+        with ctx:
+            logits, loss = model(xb, yb)
+            loss_bpc = convert_to_bpc(loss.item())
+        # Backward pass
+        optimizer.zero_grad(set_to_none=True)
+        loss.backward()
+        torch.nn.utils.clip_grad_norm_(model.parameters(), config['grad_clip'])
+        optimizer.step()
+        # Logging
+        if iter % config['log_interval'] == 0:
+            print(f"iter {iter}: train loss {loss_bpc:.4f} bpc")
+        # Evaluation
+        if iter % config['eval_interval'] == 0:
+            val_loss = estimate_loss(model, val_data, config['eval_iters'],
+                                   config['block_size'], batch_size, device)
+            val_bpc = convert_to_bpc(val_loss)
+            print(f"iter {iter}: val loss {val_bpc:.4f} bpc")
+            # Save best model
+            if val_bpc < best_val_loss:
+                best_val_loss = val_bpc
+                torch.save({
+                    'model_state_dict': model.state_dict(),
+                    'optimizer_state_dict': optimizer.state_dict(),
+                    'iter': iter,
+                    'best_val_loss': best_val_loss,
+                }, 'best_model.pt')
+if __name__ == '__main__':
+    main()

transformer_sizing.ipynb ADDED Viewed

	@@ -0,0 +1,402 @@

+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Transformer Theoretical Model\n",
+    "\n",
+    "This notebook stores a bunch of analysis about a Transformer, e.g. estimates the number of FLOPs, parameters, peak memory footprint, checkpoint size, etc."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from collections import OrderedDict"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# config_args = {\n",
+    "#     'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params\n",
+    "#     'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params\n",
+    "#     'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params\n",
+    "#     'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params\n",
+    "# }[model_type]\n",
+    "\n",
+    "block_size = 1024\n",
+    "vocab_size = 50257\n",
+    "n_layer = 12\n",
+    "n_head = 12\n",
+    "n_embd = 768\n",
+    "bias = False\n",
+    "assert not bias, \"this notebook assumes bias=False just for simplicity\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "we see: 124337664, expected: 124337664, match: True\n",
+      "name                 params     ratio (%) \n",
+      "emebedding/position      786432     0.6325\n",
+      "embedding/token        38597376    31.0424\n",
+      "embedding              39383808    31.6749\n",
+      "attention/ln                768     0.0006\n",
+      "attention/kqv           1769472     1.4231\n",
+      "attention/proj           589824     0.4744\n",
+      "attention               2360064     1.8981\n",
+      "mlp/ln                      768     0.0006\n",
+      "mlp/ffw                 2359296     1.8975\n",
+      "mlp/proj                2359296     1.8975\n",
+      "mlp                     4719360     3.7956\n",
+      "block                   7079424     5.6937\n",
+      "transformer            84953088    68.3245\n",
+      "ln_f                        768     0.0006\n",
+      "dense                         0     0.0000\n",
+      "total                 124337664   100.0000\n"
+     ]
+    }
+   ],
+   "source": [
+    "def params():\n",
+    "    \"\"\" estimates the number of parameters in the model\"\"\"\n",
+    "    out = OrderedDict()\n",
+    "\n",
+    "    # token and position embeddings\n",
+    "    out['emebedding/position'] = n_embd * block_size\n",
+    "    out['embedding/token'] = n_embd * vocab_size\n",
+    "    out['embedding'] = out['emebedding/position'] + out['embedding/token']\n",
+    "\n",
+    "    # attention blocks\n",
+    "    out['attention/ln'] = n_embd # note, bias=False in our LN\n",
+    "    out['attention/kqv'] = n_embd * 3*n_embd\n",
+    "    out['attention/proj'] = n_embd**2\n",
+    "    out['attention'] = out['attention/ln'] + out['attention/kqv'] + out['attention/proj']\n",
+    "\n",
+    "    # MLP blocks\n",
+    "    ffw_size = 4*n_embd # feed forward size\n",
+    "    out['mlp/ln'] = n_embd\n",
+    "    out['mlp/ffw'] = n_embd * ffw_size\n",
+    "    out['mlp/proj'] = ffw_size * n_embd\n",
+    "    out['mlp'] = out['mlp/ln'] + out['mlp/ffw'] + out['mlp/proj']\n",
+    "    \n",
+    "    # the transformer and the rest of it\n",
+    "    out['block'] = out['attention'] + out['mlp']\n",
+    "    out['transformer'] = n_layer * out['block']\n",
+    "    out['ln_f'] = n_embd # final layernorm\n",
+    "    out['dense'] = 0 # 0 because of parameter sharing. This layer uses the weights from the embedding layer\n",
+    "\n",
+    "    # total\n",
+    "    out['total'] = out['embedding'] + out['transformer'] + out['ln_f'] + out['dense']\n",
+    "\n",
+    "    return out\n",
+    "\n",
+    "# compare our param count to that reported by PyTorch\n",
+    "p = params()\n",
+    "params_total = p['total']\n",
+    "print(f\"we see: {params_total}, expected: {124337664}, match: {params_total == 124337664}\")\n",
+    "# create a header\n",
+    "print(f\"{'name':20s} {'params':10s} {'ratio (%)':10s}\")\n",
+    "for k,v in p.items():\n",
+    "    print(f\"{k:20s} {v:10d} {v/params_total*100:10.4f}\")\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "est checkpoint size: 1.49 GB\n",
+      "measured with wc -c ckpt.pt: 1542470366\n",
+      "fluff ratio: 103.38%\n"
+     ]
+    }
+   ],
+   "source": [
+    "# we can now calculate the size of each checkpoint\n",
+    "# params are stored in fp32, and the AdamW optimizer has 2 additional buffers per param for statistics\n",
+    "params_bytes = params_total*4\n",
+    "params_and_buffers_bytes = params_bytes + 2*params_bytes\n",
+    "print(f\"est checkpoint size: {params_and_buffers_bytes/1e9:.2f} GB\")\n",
+    "measured_bytes = 1542470366 # from wc -c ckpt.pt\n",
+    "print(f\"measured with wc -c ckpt.pt: {measured_bytes}\")\n",
+    "print(f\"fluff ratio: {measured_bytes/params_and_buffers_bytes*100:.2f}%\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can also estimate the ratio of our GPU memory that will be taken up just by the weights and the buffers inside the AdamW optimizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "memory ratio taken up just for parameters: 3.73%\n"
+     ]
+    }
+   ],
+   "source": [
+    "gpu_memory = 40e9 # 40 GB A100 GPU, roughly\n",
+    "print(f\"memory ratio taken up just for parameters: {params_and_buffers_bytes / gpu_memory * 100:.2f}%\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "i.e. not that much of the memory for this tiny model, most of the memory is activations (forward and backward). This of course changes dramatically for larger and larger models."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's estimate FLOPs for a single forward pass."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "name                 flops          ratio (%) \n",
+      "attention/kqv            3623878656     1.2426\n",
+      "attention/scores         1610612736     0.5522\n",
+      "attention/reduce         1610612736     0.5522\n",
+      "attention/proj           1207959552     0.4142\n",
+      "attention                8053063680     2.7612\n",
+      "mlp/ffw1                 4831838208     1.6567\n",
+      "mlp/ffw2                 4831838208     1.6567\n",
+      "mlp                      9663676416     3.3135\n",
+      "block                   17716740096     6.0747\n",
+      "transformer            212600881152    72.8963\n",
+      "dense                   79047426048    27.1037\n",
+      "forward_total          291648307200   100.0000\n",
+      "backward_total         583296614400   200.0000\n",
+      "total                  874944921600   300.0000\n"
+     ]
+    }
+   ],
+   "source": [
+    "def flops():\n",
+    "    # we only count Weight FLOPs, all other layers (LayerNorm, Softmax, etc) are effectively irrelevant\n",
+    "    # we count actual FLOPs, not MACs. Hence 2* all over the place\n",
+    "    # basically for any matrix multiply A (BxC) @ B (CxD) -> (BxD) flops are 2*B*C*D\n",
+    "\n",
+    "    out = OrderedDict()\n",
+    "    head_size = n_embd // n_head\n",
+    "\n",
+    "    # attention blocks\n",
+    "    # 1) the projection to key, query, values\n",
+    "    out['attention/kqv'] = 2 * block_size * (n_embd * 3*n_embd)\n",
+    "    # 2) calculating the attention scores\n",
+    "    out['attention/scores'] = 2 * block_size * block_size * n_embd\n",
+    "    # 3) the reduction of the values (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)\n",
+    "    out['attention/reduce'] = 2 * n_head * (block_size * block_size * head_size)\n",
+    "    # 4) the final linear projection\n",
+    "    out['attention/proj'] = 2 * block_size * (n_embd * n_embd)\n",
+    "    out['attention'] = sum(out['attention/'+k] for k in ['kqv', 'scores', 'reduce', 'proj'])\n",
+    "\n",
+    "    # MLP blocks\n",
+    "    ffw_size = 4*n_embd # feed forward size\n",
+    "    out['mlp/ffw1'] = 2 * block_size * (n_embd * ffw_size)\n",
+    "    out['mlp/ffw2'] = 2 * block_size * (ffw_size * n_embd)\n",
+    "    out['mlp'] = out['mlp/ffw1'] + out['mlp/ffw2']\n",
+    "\n",
+    "    # the transformer and the rest of it\n",
+    "    out['block'] = out['attention'] + out['mlp']\n",
+    "    out['transformer'] = n_layer * out['block']\n",
+    "    out['dense'] = 2 * block_size * (n_embd * vocab_size)\n",
+    "\n",
+    "    # forward,backward,total\n",
+    "    out['forward_total'] = out['transformer'] + out['dense']\n",
+    "    out['backward_total'] = 2 * out['forward_total'] # use common estimate of bwd = 2*fwd\n",
+    "    out['total'] = out['forward_total'] + out['backward_total']\n",
+    "\n",
+    "    return out\n",
+    "    \n",
+    "# compare our param count to that reported by PyTorch\n",
+    "f = flops()\n",
+    "flops_total = f['forward_total']\n",
+    "print(f\"{'name':20s} {'flops':14s} {'ratio (%)':10s}\")\n",
+    "for k,v in f.items():\n",
+    "    print(f\"{k:20s} {v:14d} {v/flops_total*100:10.4f}\")\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "palm_flops: 875062886400, flops: 874944921600, ratio: 1.0001\n"
+     ]
+    }
+   ],
+   "source": [
+    "# now here is an estimate copy pasted from the PaLM paper\n",
+    "# this formula is often used to calculate MFU (model flops utilization)\n",
+    "def palm_flops():\n",
+    "    \"\"\"estimate of the model flops following PaLM paper formula\"\"\"\n",
+    "    # non-embedding model parameters. note that we do not subtract the\n",
+    "    # embedding/token params because those are tied and get used in the last layer.\n",
+    "    N = params()['total'] - params()['emebedding/position']\n",
+    "    L, H, Q, T = n_layer, n_head, n_embd//n_head, block_size\n",
+    "    mf_per_token = 6*N + 12*L*H*Q*T\n",
+    "    mf = mf_per_token * block_size\n",
+    "    return mf\n",
+    "\n",
+    "print(f\"palm_flops: {palm_flops():d}, flops: {flops()['total']:d}, ratio: {palm_flops()/flops()['total']:.4f}\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Ok they are quite similar, giving some confidence that my math in flops() function was ~ok. Now, A100 is cited at 312TFLOPS bfloat16 on tensor cores. So what is our model flops utilization (MFU)? I trained the model above with a batch_size of 20 and grad_accum of 5, which runs in about 755ms on a single A100 GPU. We get:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "fraction of A100 used: 37.14%\n"
+     ]
+    }
+   ],
+   "source": [
+    "# here is what we currently roughly measure\n",
+    "batch_size = 20 * 5 # 5 is grad_accum, so total batch size is 100\n",
+    "measured_time = 0.755 # in seconds per iteration\n",
+    "measured_throughput = batch_size / measured_time\n",
+    "flops_achieved = f['total'] * measured_throughput\n",
+    "\n",
+    "# A100 is cited to be 312 TFLOPS of bloat16 running on tensor cores\n",
+    "a100_flops_promised = 312e12\n",
+    "\n",
+    "# the fraction of the A100 that we are using:\n",
+    "print(f\"fraction of A100 used: {flops_achieved / a100_flops_promised * 100:.2f}%\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For reference, we'd prefer to be somewhere around 50%+, and not just for a single GPU but for an entire DDP run. So we still have some work to do, but at least we're within a factor of ~2X of what is achievable with this GPU."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "time needed to train the model: 3.46 days\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Finally let's check out the 6ND approximation as total cost of training in FLOPs\n",
+    "model_size = params()['total'] # this is number of parameters, N\n",
+    "tokens_num = 300e9 # 300B tokens, this is dataset size in tokens, D\n",
+    "a100_flops = 312e12 # 312 TFLOPS\n",
+    "assumed_mfu = 0.3 # assume this model flops utilization (take the current 37% from above and add some DDP overhead)\n",
+    "flops_throughput = a100_flops * 8 * assumed_mfu # assume an 8XA100 node at 30% utilization\n",
+    "flops_needed = 6 * model_size * tokens_num # 6ND\n",
+    "time_needed_s = flops_needed / flops_throughput # in seconds\n",
+    "print(f\"time needed to train the model: {time_needed_s/3600/24:.2f} days\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is not a bad estimate at all. I trained this model and it converged in roughly 4 days. Btw as a good reference for where 6ND comes from and some intuition around it I recommend [Dzmitry's post](https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4)."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, FLOPs are just one constraint, the other that we have to keep a close track of is the memory bandwidth. TODO estimate LOAD/STORE costs of our model later."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "pytorch2",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "7f5833218766b48e6e35e4452ee875aac0e2188d05bbe5298f2c62b79f08b222"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

wandb/run-20241230_125819-geso4xvw/files/config.yaml ADDED Viewed

	@@ -0,0 +1,47 @@

+_wandb:
+    value:
+        cli_version: 0.18.6
+        m: []
+        python_version: 3.11.7
+        t:
+            "1":
+                - 1
+                - 55
+                - 105
+            "2":
+                - 1
+                - 55
+                - 105
+            "3":
+                - 16
+                - 23
+                - 55
+            "4": 3.11.7
+            "5": 0.18.6
+            "8":
+                - 3
+                - 5
+            "12": 0.18.6
+            "13": windows-amd64
+architecture:
+    value: DTAT
+batch_size:
+    value: 32
+block_size:
+    value: 1024
+dataset:
+    value: enwik8
+dropout:
+    value: 0.1
+learning_rate:
+    value: 0.0006
+n_embd:
+    value: 512
+n_head:
+    value: 8
+n_layer:
+    value: 12
+sparse_topk:
+    value: 32
+warmup_iters:
+    value: 2000

wandb/run-20241230_125819-geso4xvw/files/output.log ADDED Viewed

	@@ -0,0 +1,21 @@

+Loading data...
+Initializing model...
+Traceback (most recent call last):
+  File "C:\sakana\enwik8-model\train_dtat.py", line 256, in <module>
+    main()
+  File "C:\sakana\enwik8-model\train_dtat.py", line 137, in main
+    model = DTATTransformer(config)
+            ^^^^^^^^^^^^^^^^^^^^^^^
+  File "C:\sakana\enwik8-model\model_dtat.py", line 172, in __init__
+    h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
+                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "C:\sakana\enwik8-model\model_dtat.py", line 172, in <listcomp>
+    h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
+                       ^^^^^^^^^^^^^
+  File "C:\sakana\enwik8-model\model_dtat.py", line 129, in __init__
+    self.attn = SparseDenseAttention(config)
+                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "C:\sakana\enwik8-model\model_dtat.py", line 80, in __init__
+    self.sparse_topk = config.get('sparse_topk', 32)  # Number of tokens to attend to for less important tokens
+                       ^^^^^^^^^^
+AttributeError: 'DTATConfig' object has no attribute 'get'

wandb/run-20241230_125819-geso4xvw/files/wandb-metadata.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "os":  "Windows-10-10.0.26100-SP0",
+  "python":  "3.11.7",
+  "startedAt":  "2024-12-30T10:58:19.924711Z",
+  "program":  "C:\\sakana\\enwik8-model\\train_dtat.py",
+  "codePath":  "train_dtat.py",
+  "git":  {
+    "remote":  "https://github.com/karpathy/nanoGPT.git",
+    "commit":  "93a43d9a5c22450bbf06e78da2cb6eeef084b717"
+  },
+  "email":  "mitel40181@gholar.com",
+  "root":  "C:\\sakana\\enwik8-model",
+  "host":  "SILX",
+  "username":  "silxs",
+  "executable":  "C:\\fcc-intro-to-llms\\cuda\\Scripts\\python.exe",
+  "codePathLocal":  "train_dtat.py",
+  "cpu_count":  8,
+  "cpu_count_logical":  16,
+  "gpu":  "NVIDIA GeForce RTX 3050 Laptop GPU",
+  "gpu_count":  1,
+  "disk":  {
+    "/":  {
+      "total":  "487147769856",
+      "used":  "485680205824"
+    }
+  },
+  "memory":  {
+    "total":  "16387997696"
+  },
+  "cpu":  {
+    "count":  8,
+    "countLogical":  16
+  },
+  "gpu_nvidia":  [
+    {
+      "name":  "NVIDIA GeForce RTX 3050 Laptop GPU",
+      "memoryTotal":  "4294967296",
+      "cudaCores":  2048,
+      "architecture":  "Ampere"
+    }
+  ],
+  "cudaVersion":  "12.6"
+}

wandb/run-20241230_125819-geso4xvw/files/wandb-summary.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"_wandb":{"runtime":1}}

wandb/run-20241230_125819-geso4xvw/logs/debug-core.log ADDED Viewed

	@@ -0,0 +1,14 @@

+{"time":"2024-12-30T12:58:19.192321+02:00","level":"INFO","msg":"started logging, with flags","port-filename":"C:\\Users\\silxs\\AppData\\Local\\Temp\\tmpf0be7_0z\\port-16680.txt","pid":16680,"debug":false,"disable-analytics":false}
+{"time":"2024-12-30T12:58:19.192321+02:00","level":"INFO","msg":"FeatureState","shutdownOnParentExitEnabled":false}
+{"time":"2024-12-30T12:58:19.1989835+02:00","level":"INFO","msg":"Will exit if parent process dies.","ppid":16680}
+{"time":"2024-12-30T12:58:19.1989835+02:00","level":"INFO","msg":"server is running","addr":{"IP":"127.0.0.1","Port":53467,"Zone":""}}
+{"time":"2024-12-30T12:58:19.3765291+02:00","level":"INFO","msg":"connection: ManageConnectionData: new connection created","id":"127.0.0.1:53468"}
+{"time":"2024-12-30T12:58:19.9252228+02:00","level":"INFO","msg":"handleInformInit: received","streamId":"geso4xvw","id":"127.0.0.1:53468"}
+{"time":"2024-12-30T12:58:20.0386713+02:00","level":"INFO","msg":"handleInformInit: stream started","streamId":"geso4xvw","id":"127.0.0.1:53468"}
+{"time":"2024-12-30T12:58:21.8885719+02:00","level":"INFO","msg":"handleInformTeardown: server teardown initiated","id":"127.0.0.1:53468"}
+{"time":"2024-12-30T12:58:21.8891443+02:00","level":"INFO","msg":"server is shutting down"}
+{"time":"2024-12-30T12:58:21.8891443+02:00","level":"INFO","msg":"connection: Close: initiating connection closure","id":"127.0.0.1:53468"}
+{"time":"2024-12-30T12:58:21.8891443+02:00","level":"INFO","msg":"connection: Close: connection successfully closed","id":"127.0.0.1:53468"}
+{"time":"2024-12-30T12:58:25.467087+02:00","level":"INFO","msg":"handleInformTeardown: server shutdown complete","id":"127.0.0.1:53468"}
+{"time":"2024-12-30T12:58:25.467087+02:00","level":"INFO","msg":"connection: ManageConnectionData: connection closed","id":"127.0.0.1:53468"}
+{"time":"2024-12-30T12:58:25.467087+02:00","level":"INFO","msg":"server is closed"}

wandb/run-20241230_125819-geso4xvw/logs/debug-internal.log ADDED Viewed

	@@ -0,0 +1,16 @@

+{"time":"2024-12-30T12:58:19.9262449+02:00","level":"INFO","msg":"using version","core version":"0.18.6"}
+{"time":"2024-12-30T12:58:19.9267933+02:00","level":"INFO","msg":"created symlink","path":"C:\\sakana\\enwik8-model\\wandb\\run-20241230_125819-geso4xvw\\logs\\debug-core.log"}
+{"time":"2024-12-30T12:58:20.0381603+02:00","level":"INFO","msg":"created new stream","id":"geso4xvw"}
+{"time":"2024-12-30T12:58:20.0386713+02:00","level":"INFO","msg":"stream: started","id":"geso4xvw"}
+{"time":"2024-12-30T12:58:20.0386713+02:00","level":"INFO","msg":"handler: started","stream_id":{"value":"geso4xvw"}}
+{"time":"2024-12-30T12:58:20.0386713+02:00","level":"INFO","msg":"sender: started","stream_id":"geso4xvw"}
+{"time":"2024-12-30T12:58:20.0386713+02:00","level":"INFO","msg":"writer: Do: started","stream_id":{"value":"geso4xvw"}}
+{"time":"2024-12-30T12:58:20.9024895+02:00","level":"INFO","msg":"Starting system monitor"}
+{"time":"2024-12-30T12:58:21.8891443+02:00","level":"INFO","msg":"stream: closing","id":"geso4xvw"}
+{"time":"2024-12-30T12:58:21.8891443+02:00","level":"INFO","msg":"Stopping system monitor"}
+{"time":"2024-12-30T12:58:21.8901726+02:00","level":"INFO","msg":"Stopped system monitor"}
+{"time":"2024-12-30T12:58:24.8986622+02:00","level":"INFO","msg":"fileTransfer: Close: file transfer manager closed"}
+{"time":"2024-12-30T12:58:25.4660418+02:00","level":"INFO","msg":"handler: closed","stream_id":{"value":"geso4xvw"}}
+{"time":"2024-12-30T12:58:25.4660418+02:00","level":"INFO","msg":"writer: Close: closed","stream_id":{"value":"geso4xvw"}}
+{"time":"2024-12-30T12:58:25.4660418+02:00","level":"INFO","msg":"sender: closed","stream_id":"geso4xvw"}
+{"time":"2024-12-30T12:58:25.4665528+02:00","level":"INFO","msg":"stream: closed","id":"geso4xvw"}

wandb/run-20241230_125819-geso4xvw/logs/debug.log ADDED Viewed

	@@ -0,0 +1,26 @@

+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_setup.py:_flush():79] Current SDK version is 0.18.6
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_setup.py:_flush():79] Configure stats pid to 16680
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_setup.py:_flush():79] Loading settings from C:\Users\silxs\.config\wandb\settings
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_setup.py:_flush():79] Loading settings from C:\sakana\enwik8-model\wandb\settings
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_setup.py:_flush():79] Loading settings from environment variables: {}
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_setup.py:_flush():79] Applying setup settings: {'mode': None, '_disable_service': None}
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_setup.py:_flush():79] Inferring run settings from compute environment: {'program_relpath': 'train_dtat.py', 'program_abspath': 'C:\\sakana\\enwik8-model\\train_dtat.py', 'program': 'C:\\sakana\\enwik8-model\\train_dtat.py'}
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_setup.py:_flush():79] Applying login settings: {}
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_init.py:_log_setup():533] Logging user logs to C:\sakana\enwik8-model\wandb\run-20241230_125819-geso4xvw\logs\debug.log
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_init.py:_log_setup():534] Logging internal logs to C:\sakana\enwik8-model\wandb\run-20241230_125819-geso4xvw\logs\debug-internal.log
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_init.py:init():619] calling init triggers
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_init.py:init():626] wandb.init called with sweep_config: {}
+config: {'architecture': 'DTAT', 'dataset': 'enwik8', 'batch_size': 32, 'learning_rate': 0.0006, 'warmup_iters': 2000, 'block_size': 1024, 'n_layer': 12, 'n_head': 8, 'n_embd': 512, 'dropout': 0.1, 'sparse_topk': 32}
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_init.py:init():669] starting backend
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [wandb_init.py:init():673] sending inform_init request
+2024-12-30 12:58:19,921 INFO    MainThread:16680 [backend.py:_multiprocessing_setup():104] multiprocessing start_methods=spawn, using: spawn
+2024-12-30 12:58:19,924 INFO    MainThread:16680 [wandb_init.py:init():686] backend started and connected
+2024-12-30 12:58:19,927 INFO    MainThread:16680 [wandb_init.py:init():781] updated telemetry
+2024-12-30 12:58:19,977 INFO    MainThread:16680 [wandb_init.py:init():814] communicating run to backend with 90.0 second timeout
+2024-12-30 12:58:20,892 INFO    MainThread:16680 [wandb_init.py:init():867] starting run threads in backend
+2024-12-30 12:58:21,272 INFO    MainThread:16680 [wandb_run.py:_console_start():2451] atexit reg
+2024-12-30 12:58:21,272 INFO    MainThread:16680 [wandb_run.py:_redirect():2299] redirect: wrap_raw
+2024-12-30 12:58:21,272 INFO    MainThread:16680 [wandb_run.py:_redirect():2364] Wrapping output streams.
+2024-12-30 12:58:21,272 INFO    MainThread:16680 [wandb_run.py:_redirect():2389] Redirects installed.
+2024-12-30 12:58:21,277 INFO    MainThread:16680 [wandb_init.py:init():911] run started, returning control to user process
+2024-12-30 12:58:21,889 WARNING MsgRouterThr:16680 [router.py:message_loop():75] message_loop has been closed

wandb/run-20241230_125819-geso4xvw/run-geso4xvw.wandb ADDED Viewed

Binary file (3.19 kB). View file

wandb/run-20241230_125924-h4hgg9ir/files/config.yaml ADDED Viewed

	@@ -0,0 +1,47 @@

+_wandb:
+    value:
+        cli_version: 0.18.6
+        m: []
+        python_version: 3.11.7
+        t:
+            "1":
+                - 1
+                - 55
+                - 105
+            "2":
+                - 1
+                - 55
+                - 105
+            "3":
+                - 16
+                - 23
+                - 55
+            "4": 3.11.7
+            "5": 0.18.6
+            "8":
+                - 3
+                - 5
+            "12": 0.18.6
+            "13": windows-amd64
+architecture:
+    value: DTAT
+batch_size:
+    value: 32
+block_size:
+    value: 1024
+dataset:
+    value: enwik8
+dropout:
+    value: 0.1
+learning_rate:
+    value: 0.0006
+n_embd:
+    value: 512
+n_head:
+    value: 8
+n_layer:
+    value: 12
+sparse_topk:
+    value: 32
+warmup_iters:
+    value: 2000

wandb/run-20241230_125924-h4hgg9ir/files/output.log ADDED Viewed

	@@ -0,0 +1,29 @@

+Loading data...
+Initializing model...
+number of parameters: 42.40M
+Starting training...
+Traceback (most recent call last):
+  File "C:\sakana\enwik8-model\train_dtat.py", line 256, in <module>
+    main()
+  File "C:\sakana\enwik8-model\train_dtat.py", line 166, in main
+    logits, loss, importance_scores = model(X, Y)
+                                      ^^^^^^^^^^^
+  File "C:\fcc-intro-to-llms\cuda\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "C:\fcc-intro-to-llms\cuda\Lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "C:\sakana\enwik8-model\model_dtat.py", line 218, in forward
+    importance_scores = self.importance_net(x, freq_table, pos)
+                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "C:\fcc-intro-to-llms\cuda\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "C:\fcc-intro-to-llms\cuda\Lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "C:\sakana\enwik8-model\model_dtat.py", line 51, in forward
+    combined = torch.cat([x, freq_emb, pos_emb], dim=-1)
+               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 1024 but got size 256 for tensor number 1 in the list.

wandb/run-20241230_125924-h4hgg9ir/files/wandb-metadata.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "os":  "Windows-10-10.0.26100-SP0",
+  "python":  "3.11.7",
+  "startedAt":  "2024-12-30T10:59:24.719225Z",
+  "program":  "C:\\sakana\\enwik8-model\\train_dtat.py",
+  "codePath":  "train_dtat.py",
+  "git":  {
+    "remote":  "https://github.com/karpathy/nanoGPT.git",
+    "commit":  "93a43d9a5c22450bbf06e78da2cb6eeef084b717"
+  },
+  "email":  "mitel40181@gholar.com",
+  "root":  "C:\\sakana\\enwik8-model",
+  "host":  "SILX",
+  "username":  "silxs",
+  "executable":  "C:\\fcc-intro-to-llms\\cuda\\Scripts\\python.exe",
+  "codePathLocal":  "train_dtat.py",
+  "cpu_count":  8,
+  "cpu_count_logical":  16,
+  "gpu":  "NVIDIA GeForce RTX 3050 Laptop GPU",
+  "gpu_count":  1,
+  "disk":  {
+    "/":  {
+      "total":  "487147769856",
+      "used":  "485685227520"
+    }
+  },
+  "memory":  {
+    "total":  "16387997696"
+  },
+  "cpu":  {
+    "count":  8,
+    "countLogical":  16
+  },
+  "gpu_nvidia":  [
+    {
+      "name":  "NVIDIA GeForce RTX 3050 Laptop GPU",
+      "memoryTotal":  "4294967296",
+      "cudaCores":  2048,
+      "architecture":  "Ampere"
+    }
+  ],
+  "cudaVersion":  "12.6"
+}

wandb/run-20241230_125924-h4hgg9ir/files/wandb-summary.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"_wandb":{"runtime":4}}

wandb/run-20241230_125924-h4hgg9ir/logs/debug-core.log ADDED Viewed

	@@ -0,0 +1,14 @@

+{"time":"2024-12-30T12:59:23.8853518+02:00","level":"INFO","msg":"started logging, with flags","port-filename":"C:\\Users\\silxs\\AppData\\Local\\Temp\\tmpmcp7jkur\\port-36980.txt","pid":36980,"debug":false,"disable-analytics":false}
+{"time":"2024-12-30T12:59:23.8853518+02:00","level":"INFO","msg":"FeatureState","shutdownOnParentExitEnabled":false}
+{"time":"2024-12-30T12:59:23.8919931+02:00","level":"INFO","msg":"Will exit if parent process dies.","ppid":36980}
+{"time":"2024-12-30T12:59:23.8919931+02:00","level":"INFO","msg":"server is running","addr":{"IP":"127.0.0.1","Port":53707,"Zone":""}}
+{"time":"2024-12-30T12:59:24.0739714+02:00","level":"INFO","msg":"connection: ManageConnectionData: new connection created","id":"127.0.0.1:53716"}
+{"time":"2024-12-30T12:59:24.7197359+02:00","level":"INFO","msg":"handleInformInit: received","streamId":"h4hgg9ir","id":"127.0.0.1:53716"}
+{"time":"2024-12-30T12:59:24.831581+02:00","level":"INFO","msg":"handleInformInit: stream started","streamId":"h4hgg9ir","id":"127.0.0.1:53716"}
+{"time":"2024-12-30T12:59:28.8801803+02:00","level":"INFO","msg":"handleInformTeardown: server teardown initiated","id":"127.0.0.1:53716"}
+{"time":"2024-12-30T12:59:28.8801803+02:00","level":"INFO","msg":"connection: Close: initiating connection closure","id":"127.0.0.1:53716"}
+{"time":"2024-12-30T12:59:28.8801803+02:00","level":"INFO","msg":"server is shutting down"}
+{"time":"2024-12-30T12:59:28.8801803+02:00","level":"INFO","msg":"connection: Close: connection successfully closed","id":"127.0.0.1:53716"}
+{"time":"2024-12-30T12:59:53.2992062+02:00","level":"INFO","msg":"handleInformTeardown: server shutdown complete","id":"127.0.0.1:53716"}
+{"time":"2024-12-30T12:59:53.2992062+02:00","level":"INFO","msg":"connection: ManageConnectionData: connection closed","id":"127.0.0.1:53716"}
+{"time":"2024-12-30T12:59:53.2992062+02:00","level":"INFO","msg":"server is closed"}

wandb/run-20241230_125924-h4hgg9ir/logs/debug-internal.log ADDED Viewed

	@@ -0,0 +1,17 @@

+{"time":"2024-12-30T12:59:24.7202464+02:00","level":"INFO","msg":"using version","core version":"0.18.6"}
+{"time":"2024-12-30T12:59:24.7207602+02:00","level":"INFO","msg":"created symlink","path":"C:\\sakana\\enwik8-model\\wandb\\run-20241230_125924-h4hgg9ir\\logs\\debug-core.log"}
+{"time":"2024-12-30T12:59:24.8310365+02:00","level":"INFO","msg":"created new stream","id":"h4hgg9ir"}
+{"time":"2024-12-30T12:59:24.831581+02:00","level":"INFO","msg":"stream: started","id":"h4hgg9ir"}
+{"time":"2024-12-30T12:59:24.831581+02:00","level":"INFO","msg":"sender: started","stream_id":"h4hgg9ir"}
+{"time":"2024-12-30T12:59:24.831581+02:00","level":"INFO","msg":"handler: started","stream_id":{"value":"h4hgg9ir"}}
+{"time":"2024-12-30T12:59:24.831581+02:00","level":"INFO","msg":"writer: Do: started","stream_id":{"value":"h4hgg9ir"}}
+{"time":"2024-12-30T12:59:25.363056+02:00","level":"INFO","msg":"Starting system monitor"}
+{"time":"2024-12-30T12:59:28.8801803+02:00","level":"INFO","msg":"stream: closing","id":"h4hgg9ir"}
+{"time":"2024-12-30T12:59:28.8801803+02:00","level":"INFO","msg":"Stopping system monitor"}
+{"time":"2024-12-30T12:59:28.8812132+02:00","level":"INFO","msg":"Stopped system monitor"}
+{"time":"2024-12-30T12:59:29.7987804+02:00","level":"INFO","msg":"fileTransfer: Close: file transfer manager closed"}
+{"time":"2024-12-30T12:59:50.82286+02:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/files/mitel40181-silx/enwik8-dtat/h4hgg9ir/file_stream\": dial tcp 35.186.228.49:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond."}
+{"time":"2024-12-30T12:59:53.2987018+02:00","level":"INFO","msg":"handler: closed","stream_id":{"value":"h4hgg9ir"}}
+{"time":"2024-12-30T12:59:53.2987018+02:00","level":"INFO","msg":"sender: closed","stream_id":"h4hgg9ir"}
+{"time":"2024-12-30T12:59:53.2987018+02:00","level":"INFO","msg":"writer: Close: closed","stream_id":{"value":"h4hgg9ir"}}
+{"time":"2024-12-30T12:59:53.2987018+02:00","level":"INFO","msg":"stream: closed","id":"h4hgg9ir"}

wandb/run-20241230_125924-h4hgg9ir/logs/debug.log ADDED Viewed

	@@ -0,0 +1,26 @@

+2024-12-30 12:59:24,715 INFO    MainThread:36980 [wandb_setup.py:_flush():79] Current SDK version is 0.18.6
+2024-12-30 12:59:24,715 INFO    MainThread:36980 [wandb_setup.py:_flush():79] Configure stats pid to 36980
+2024-12-30 12:59:24,715 INFO    MainThread:36980 [wandb_setup.py:_flush():79] Loading settings from C:\Users\silxs\.config\wandb\settings
+2024-12-30 12:59:24,715 INFO    MainThread:36980 [wandb_setup.py:_flush():79] Loading settings from C:\sakana\enwik8-model\wandb\settings
+2024-12-30 12:59:24,715 INFO    MainThread:36980 [wandb_setup.py:_flush():79] Loading settings from environment variables: {}
+2024-12-30 12:59:24,715 INFO    MainThread:36980 [wandb_setup.py:_flush():79] Applying setup settings: {'mode': None, '_disable_service': None}
+2024-12-30 12:59:24,716 INFO    MainThread:36980 [wandb_setup.py:_flush():79] Inferring run settings from compute environment: {'program_relpath': 'train_dtat.py', 'program_abspath': 'C:\\sakana\\enwik8-model\\train_dtat.py', 'program': 'C:\\sakana\\enwik8-model\\train_dtat.py'}
+2024-12-30 12:59:24,716 INFO    MainThread:36980 [wandb_setup.py:_flush():79] Applying login settings: {}
+2024-12-30 12:59:24,716 INFO    MainThread:36980 [wandb_init.py:_log_setup():533] Logging user logs to C:\sakana\enwik8-model\wandb\run-20241230_125924-h4hgg9ir\logs\debug.log
+2024-12-30 12:59:24,716 INFO    MainThread:36980 [wandb_init.py:_log_setup():534] Logging internal logs to C:\sakana\enwik8-model\wandb\run-20241230_125924-h4hgg9ir\logs\debug-internal.log
+2024-12-30 12:59:24,716 INFO    MainThread:36980 [wandb_init.py:init():619] calling init triggers
+2024-12-30 12:59:24,716 INFO    MainThread:36980 [wandb_init.py:init():626] wandb.init called with sweep_config: {}
+config: {'architecture': 'DTAT', 'dataset': 'enwik8', 'batch_size': 32, 'learning_rate': 0.0006, 'warmup_iters': 2000, 'block_size': 1024, 'n_layer': 12, 'n_head': 8, 'n_embd': 512, 'dropout': 0.1, 'sparse_topk': 32}
+2024-12-30 12:59:24,716 INFO    MainThread:36980 [wandb_init.py:init():669] starting backend
+2024-12-30 12:59:24,716 INFO    MainThread:36980 [wandb_init.py:init():673] sending inform_init request
+2024-12-30 12:59:24,718 INFO    MainThread:36980 [backend.py:_multiprocessing_setup():104] multiprocessing start_methods=spawn, using: spawn
+2024-12-30 12:59:24,719 INFO    MainThread:36980 [wandb_init.py:init():686] backend started and connected
+2024-12-30 12:59:24,722 INFO    MainThread:36980 [wandb_init.py:init():781] updated telemetry
+2024-12-30 12:59:24,755 INFO    MainThread:36980 [wandb_init.py:init():814] communicating run to backend with 90.0 second timeout
+2024-12-30 12:59:25,357 INFO    MainThread:36980 [wandb_init.py:init():867] starting run threads in backend
+2024-12-30 12:59:25,623 INFO    MainThread:36980 [wandb_run.py:_console_start():2451] atexit reg
+2024-12-30 12:59:25,623 INFO    MainThread:36980 [wandb_run.py:_redirect():2299] redirect: wrap_raw
+2024-12-30 12:59:25,624 INFO    MainThread:36980 [wandb_run.py:_redirect():2364] Wrapping output streams.
+2024-12-30 12:59:25,624 INFO    MainThread:36980 [wandb_run.py:_redirect():2389] Redirects installed.
+2024-12-30 12:59:25,626 INFO    MainThread:36980 [wandb_init.py:init():911] run started, returning control to user process
+2024-12-30 12:59:28,880 WARNING MsgRouterThr:36980 [router.py:message_loop():75] message_loop has been closed

wandb/run-20241230_125924-h4hgg9ir/run-h4hgg9ir.wandb ADDED Viewed

Binary file (4.22 kB). View file