This is a GPT-2 model trained in llm.c for 330K steps (of 1M batch size) on FineWeb-EDU.
A lot more detailed information is here: https://github.com/karpathy/llm.c/discussions/677 .
This model has a bit of a complicated history. I wanted to train it for 400K steps, i.e. (-x 400000
), but it became unstable later in training and exploded around step 330K. Because I was losing my computing quota shortly, I decided to just rewind back to checkpoint 300K, and then instead of going all the way to 400K I started annealing linearly down to 330K. This went without incident and produced this model.
This is the longest I've trained a GPT-2 model for, and it reaches HellaSwag of 62.7 by the end.
- Downloads last month
- 323
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.