The TokenFormer is a fully attention-based architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of neural network.(see paper). It contains four models of sizes 150M, 450M, 900M, 1.5B. For each size, it's trained based on gpt-neox code base and uses Pile with 300B tokens. All 4 model sizes are trained on the exact same data, in the exact same order.

TokenFormer-1-5B

Model Details

  • Developed by: Haiyang Wang
  • Model type: TokenFormer-based Language Model
  • Language: English
  • Learn more: TokenFormer's GitHub repository for training procedure, config files, and details on how to use. See paper for more evals and implementation details.
  • Library: GPT-NeoX
  • License: Apache 2.0
  • Contact: to ask questions about this model, please email Haiyang Wang.
TokenFormer model Layers #QKV Param Tokens #Output Param Tokens #FFN Param Tokens Model Dim Heads Batch Size Learning Rate Training Iterations
150M 12 768 768 3072 768 12 2M 6.0 x 10-4 143000
450M 24 1024 1024 4096 1024 16 2M 6.0 x 10-4 143000
900M 32 1280 1280 5120 1280 16 2M 6.0 x 10-4 143000
1.5B 40 1536 1536 6144 1536 16 2M 6.0 x 10-4 143000
Engineering details for the TokenFormer.

Training

Training data

The Pile is a 825GiB general-purpose dataset in English. It was created by EleutherAI specifically for training large language models. It contains texts from 22 diverse sources, roughly broken down into five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub, Enron Emails). See the Pile paper for a breakdown of all data sources, methodology, and a discussion of ethical implications. Consult the datasheet for more detailed documentation about the Pile and its component datasets. The Pile can be downloaded from the official website, or from a community mirror.

Training procedure

We follow the default training strategy of Pythia in gpt-neox, including the dataset processing, hyper-parameter and code base. All models were trained on the exact same data, in the exact same order. Each model saw 299,892,736,000 tokens during training.

All TokenFormer models trained for 143000 steps at a batch size of 2M (2,097,152 tokens).
See GitHub for more details on training procedure.
TokenFormer uses the same tokenizer as GPT-NeoX- 20B.

Evaluations

All TokenFormer models were evaluated using the LM Evaluation Harness. You can run the evaluation with our instruction.
Expand the sections below to see plots of evaluation results for all TokenFormer compared with Opensource Transformer-based LLMs.

Model #Param LAMBADA HellaSwag PIQA Arc-E Arc-C WinoGrande Average
Pythia 150M 35.4 30.3 62.3 43.6 23.6 51.3 40.1
TokenFormer 150M 45.0 35.5 64.9 47.3 24.9 50.4 44.7
Pythia 410M 51.4 40.6 66.9 52.1 24.6 53.8 48.2
TokenFormer 450M 57.3 47.5 69.5 56.2 26.7 54.6 52.0
Pythia 1B 56.1 47.2 70.7 57.0 27.1 53.5 51.9
TokenFormer 900M 64.0 55.3 72.4 59.9 30.6 56.4 56.4
GPT-Neo 1.3B 57.2 48.9 71.1 56.2 25.9 54.9 52.4
OPT 1.3B 58.0 53.7 72.4 56.7 29.6 59.5 55.0
Pythia 1.3B 61.7 52.1 71.0 60.5 28.5 57.2 55.2
GPT-Neo 2.7B 62.2 55.8 71.1 61.1 30.2 57.6 56.5
OPT 2.7B 63.6 60.6 74.8 60.8 31.3 61.0 58.7
Pythia 2.8B 64.7 59.3 74.0 64.1 32.9 59.7 59.1
TokenFormer 1.5B 64.7 60.0 74.8 64.8 32.0 59.7 59.3
Zero-shot evaluation of Language Modeling.
Downloads last month
25
Inference Examples
Unable to determine this model's library. Check the docs .