Vectorrent commited on
Commit
7368704
1 Parent(s): cc23476

Upload 9 files

Browse files
README.md CHANGED
@@ -1,3 +1,78 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - text generation
6
+ - pytorch
7
+ - causal-lm
8
+ license: mit
9
+ datasets:
10
+ - allenai/c4
11
+ - HuggingFaceFW/fineweb-edu
12
+ - togethercomputer/RedPajama-Data-V2
13
+ - Muennighoff/natural-instructions
14
+ - databricks/databricks-dolly-15k
15
+ - HuggingFaceTB/smollm-corpus
16
+ - open-phi/textbooks
17
+ - roneneldan/TinyStories
18
+ ---
19
+
20
+ # Mixtress 135M
21
+
22
+ ## Model Description
23
+
24
+ Mixtress 135M is a transformer model based upon the [Mixtral](https://huggingface.co/docs/transformers/en/model_doc/mixtral) architecture. It is the culmination of approximately 20 weeks of [Kaggle](https://kaggle.com) free hours, and 67 twelve-hour training runs.
25
+
26
+ The results are laughably bad. The model has massively overfit to the training data, and it saw far less tokens than other models of comparable size. But at least I can say we saw it through to completion!
27
+
28
+ ## Training data
29
+
30
+ Mixtress was trained on a curated sampling of data from the following datasets:
31
+
32
+ - allenai/c4
33
+ - HuggingFaceFW/fineweb-edu
34
+ - togethercomputer/RedPajama-Data-V2
35
+ - Muennighoff/natural-instructions
36
+ - databricks/databricks-dolly-15k
37
+ - HuggingFaceTB/smollm-corpus
38
+ - open-phi/textbooks
39
+ - roneneldan/TinyStories
40
+
41
+ ## Training procedure
42
+
43
+ This model was trained for 2.15 billion tokens over 20,000 optimizer steps. It was trained as a masked autoregressive language model, using cross-entropy loss.
44
+
45
+ The final train loss was 1.941, validation loss was 2.206, and perplexity was 9.136.
46
+
47
+ Mixtress was pre-trained and fine-tuned simultaneously. Full reproduction code may be found [at this URL](https://www.kaggle.com/code/luciferianink/pretraining-a-mixtral), or in the Jupyter notebook [in this repository](./pretraining-a-mixtral.ipynb).
48
+
49
+ ## Intended Use and Limitations
50
+
51
+ The model is best at what it was pretrained for, which is generating conversational text and answering questions from a prompt.
52
+
53
+ ### How to use
54
+
55
+ You can use this model directly with a pipeline for text generation. This example generates a different sequence each time it's run:
56
+
57
+ ```py
58
+ >>> from transformers import pipeline
59
+ >>> generator = pipeline('text-generation', model='UNSAFE/Mixtress-135M')
60
+ >>> generator("In a shocking finding, ", do_sample=True, temperature=0.7, min_length=50)
61
+
62
+ [{'generated_text': 'In a shocking finding, 20 years ago, U.S. President Donald Trump'}]
63
+ ```
64
+
65
+ ## Eval results
66
+
67
+ All evaluations were done using the [Pythia evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
68
+
69
+ ### Scores
70
+
71
+ | Model and Size | ARC-easy | ARC-challenge | HellaSwag | OpenBookQA | PiQA |
72
+ | ----------------- | ---------- | ------------- | ---------- | ---------- | ---------- |
73
+ | gpt-neo-125m | 22.95 | N/A | 30.26 | N/A | N/A |
74
+ | **Mixtress-135M** | **0.2921** | **0.2457** | **0.2699** | **0.2180** | **0.5267** |
75
+
76
+ ## Join Us
77
+
78
+ If you would like to chat with us, please join the [Discord](https://discord.gg/8ZmHP8CqUX) server!
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "mixtress",
3
+ "architectures": [
4
+ "MixtralForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.1,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "mish",
10
+ "hidden_size": 512,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 1024,
13
+ "max_position_embeddings": 131072,
14
+ "model_type": "mixtral",
15
+ "num_attention_heads": 16,
16
+ "num_experts_per_tok": 3,
17
+ "num_hidden_layers": 8,
18
+ "num_key_value_heads": 8,
19
+ "num_local_experts": 9,
20
+ "output_router_logits": false,
21
+ "rms_norm_eps": 1e-05,
22
+ "rope_theta": 1000000.0,
23
+ "router_aux_loss_coef": 0.001,
24
+ "router_jitter_noise": 0.1,
25
+ "sliding_window": 4096,
26
+ "tie_word_embeddings": true,
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.45.1",
29
+ "universal": true,
30
+ "use_cache": false,
31
+ "vocab_size": 32000,
32
+ "world_size": 23
33
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.45.1"
6
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:243086d038abcf517e46268af189614e2bf13a10e9ef73c653888ab17af00d50
3
+ size 543902624
pretraining-a-mixtral.ipynb ADDED
@@ -0,0 +1 @@
 
 
1
+ {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.14","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"gpu","dataSources":[],"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":true}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Pretraining a Mixtral\n---\n[AIGen](https://github.com/Vectorrent/aigen) is a text generation and training library, originally forked from [AITextGen](https://aitextgen.minimaxir.com/) (which is now defunct).\n\nAIGen is also the foundation of [VTX](https://github.com/0-5788719150923125/vtx).\n\nTo use this notebook with Kaggle, one must first enable the \"Internet\" feature. To do so:\n\n1. Find \"Notebook options\" in the sidebar on the right-hand side of this page.\n2. If required, verify your phone number.\n3. Choose \"Internet on\".\n4. Connect to the P100 accelerator.\n5. Setup file persistence.\n\nDo not forget to connect to an accelerator. The P100's are better than the T4's. However, with 2x T4's available, training may benefit from DistributedDataParallel (DDP) training.","metadata":{}},{"cell_type":"markdown","source":"## Update system packages","metadata":{}},{"cell_type":"code","source":"# Now we install AIGen\n!pip install 'git+https://github.com/Vectorrent/aigen.git'\n\n# Speed up everything\n# !pip install -U flash-attn --no-build-isolation","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Configuration\n\nWe would set a bunch of variables here, if we hadn't hardcoded them below for clarity.","metadata":{}},{"cell_type":"code","source":"# Set some variables\nfocus = 'mixtress'\nprecision = 32\nattn_implementation = \"eager\"\n\n# working dir\noutput_dir = \"/kaggle/working\"\n\n# Mixtral is gated, so we use someone else's repo\nbase_model = 'TitanML/tiny-mixtral'\ntokenizer_model = base_model\ntokenizer_config = dict(\n cache_dir=f\"{output_dir}/{focus}\",\n padding=\"max_length\",\n padding_side=\"left\",\n use_fast=True,\n return_overflowing_tokens=True,\n truncation=True,\n trust_remote_code=True,\n)\n\n# to continue training from a checkpoint, False starts a fresh run\nresume_training = False","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Pretraining\n\nWe want to train a model from scratch, so we import a functional config, then change it.","metadata":{}},{"cell_type":"code","source":"from transformers import (\n AutoConfig,\n AutoTokenizer,\n PretrainedConfig,\n PreTrainedTokenizerFast,\n)\n\ntokenizer = AutoTokenizer.from_pretrained(tokenizer_model, **tokenizer_config)\n\npretrain_config = AutoConfig.from_pretrained(base_model)\noverrides = {\n \"model_type\": base_model,\n \"universal\": True,\n \"world_size\": 23,\n \"hidden_act\": 'mish',\n \"hidden_size\": 512,\n \"intermediate_size\": 1024,\n \"initializer_range\": 0.02,\n \"num_hidden_layers\": 8,\n \"num_attention_heads\": 16,\n \"num_key_value_heads\": 8,\n \"rope_theta\": 1000000.0,\n \"num_experts_per_tok\": 3,\n \"num_local_experts\": 9,\n \"vocab_size\": 32000,\n \"tie_word_embeddings\": True,\n \"router_aux_loss_coef\": 0.001,\n \"router_jitter_noise\": 0.1,\n \"sliding_window\": 4096,\n \"attention_dropout\": 0.1\n}\nsetattr(pretrain_config, \"_name_or_path\", focus)\nsetattr(pretrain_config, \"bos_token_id\", tokenizer.bos_token_id)\nsetattr(pretrain_config, \"eos_token_id\", tokenizer.eos_token_id)\nfor k, v in overrides.items():\n setattr(pretrain_config, k, v)\nprint(f\"modified pretrain config:\")\nprint(pretrain_config)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Load the model\n\nHere we initialize the model with random weights.","metadata":{}},{"cell_type":"code","source":"# Instantiate your model\nimport os\nimport shutil\nfrom aigen import aigen\n\nprototype = None\n\nif resume_training:\n model = None\n model_folder = f\"{output_dir}/{focus}\"\n pretrain_config = None\nelse:\n model = base_model\n model_folder = None\n shutil.rmtree(output_dir, ignore_errors=True)\n\nprototype = aigen(\n model=model,\n model_folder=model_folder,\n tokenizer=tokenizer,\n cache_dir=f\"{output_dir}/{focus}\",\n precision=precision,\n config=pretrain_config,\n device_map=\"cuda:0\",\n attn_implementation=attn_implementation\n)\n\nprint(prototype)","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Metrics\n\nWe want to log training metrics, so we install Tensorboard and expose it via ngrok. This requires an authtoken from ngrok.com, saved in Kaggle's \"Add-ons>Secrets\".","metadata":{}},{"cell_type":"code","source":"from kaggle_secrets import UserSecretsClient\nsecret_label = \"NGROK_SECRET\"\nsecret_value = UserSecretsClient().get_secret(secret_label)\n\nimport os\nimport shutil\n\ndirectory = f\"{output_dir}/logs\"\nos.makedirs(directory, exist_ok=True)\n\nif not resume_training:\n for filename in os.listdir(directory):\n file_path = os.path.join(directory, filename)\n shutil.rmtree(file_path)\n\nif secret_value:\n\n !pip install ngrok tensorboard\n\n import threading\n import subprocess\n\n def start_tensorboard():\n subprocess.Popen(\n [\"tensorboard\", \"--logdir\", \"/kaggle/working/logs\", \"--bind_all\", \"--samples_per_plugin\", \"scalars=999999999\"],\n stdout=subprocess.DEVNULL,\n stderr=subprocess.STDOUT\n )\n\n tensorboard_thread = threading.Thread(target=start_tensorboard)\n tensorboard_thread.start()\n\n import ngrok\n\n listener = await ngrok.forward(6006, authtoken=secret_value)\n \n import time\n\n time.sleep(1)\n print(listener.url())","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Training\n\nFinally, we train the model on a dataset streamed from: https://huggingface.co/datasets","metadata":{}},{"cell_type":"code","source":"# Train the model\n\nimport os\nfrom lightning.pytorch import loggers\n\nos.makedirs(f\"{output_dir}/logs/{focus}\", exist_ok=True)\nlogger = loggers.TensorBoardLogger(f\"{output_dir}/logs\", name=focus, default_hp_metric=True)\n\nprototype.train(\n devices=[0],\n strategy=\"auto\",\n streaming_data=[\n {\n \"hf\": True,\n \"repo\": \"allenai/c4\", \n \"split\": \"train\",\n \"val_split\": \"validation\",\n \"subset\": \"en.noblocklist\",\n \"schemas\": [\n {\n \"text\": \"\"\n }\n ],\n \"buffer_size\": 1000,\n \"val_samples\": 1000,\n \"sample_rate\": 1.0\n },\n {\n \"hf\": True,\n \"repo\": \"HuggingFaceFW/fineweb-edu\", \n \"split\": \"train\", \n \"subset\": \"sample-10BT\",\n \"schemas\": [\n {\n \"text\": \"\"\n }\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"sample_rate\":1.0\n },\n# {\n# \"hf\": True,\n# \"repo\": \"cerebras/SlimPajama-627B\",\n# \"split\": \"train\",\n# \"val_split\": \"validation\",\n# \"val_samples\": 1000,\n# # \"snapshots\": [\n# # \"2023-14\"\n# # ],\n# # \"subset\": \"sample-10B\",\n# # \"languages\": [\n# # \"en\"\n# # ],\n# \"schemas\": [\n# {\n# \"text\": \"\"\n# }\n# ],\n# \"buffer_size\": 1000,\n# \"sample_rate\": 1.0\n# },\n# {\n# \"hf\": True,\n# \"repo\": \"togethercomputer/RedPajama-Data-V2\",\n# \"split\": \"train\",\n# \"snapshots\": [\n# \"2023-14\"\n# ],\n# \"subset\": \"sample-10B\",\n# \"languages\": [\n# \"en\"\n# ],\n# \"schemas\": [\n# {\n# \"raw_content\": \"\"\n# }\n# ],\n# \"buffer_size\": 1000,\n# \"sample_rate\": 1.0\n# },\n {\n \"hf\": True,\n \"repo\": \"Muennighoff/natural-instructions\",\n \"split\": \"train\",\n \"val_split\": \"test\",\n \"schemas\": [\n {\n \"definition\": \"¶{context}:> \",\n \"inputs\": '¶{human}:> ',\n \"targets\": '¶{robot}:> '\n },\n {\n \"definition\": \"SYSTEM: \",\n \"inputs\": 'USER: ',\n \"targets\": 'ASSISTANT: '\n },\n {\n \"definition\": \"CONTEXT: \",\n \"inputs\": 'INPUT: ',\n \"targets\": 'OUTPUT: '\n }\n ],\n \"patterns\": [\n '{context}',\n '{human}',\n '{robot}'\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"val_samples\": 1000,\n \"sample_rate\": 0.25,\n },\n {\n \"hf\": True,\n \"repo\": \"databricks/databricks-dolly-15k\",\n \"split\": \"train\",\n \"schemas\": [\n {\n \"context\": \"¶{context}:> \",\n \"instruction\": '¶{instruction}:> ',\n \"response\": '¶{response}:> '\n },\n {\n \"context\": \"SYSTEM: \",\n \"instruction\": 'USER: ',\n \"response\": 'ASSISTANT: '\n },\n {\n \"context\": \"CONTEXT: \",\n \"instruction\": 'INPUT: ',\n \"response\": 'OUTPUT: '\n }\n ],\n \"patterns\": [\n '{context}',\n '{instruction}',\n '{response}'\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"sample_rate\": 0.25,\n },\n {\n \"hf\": True,\n \"repo\": \"HuggingFaceTB/smollm-corpus\",\n \"split\": \"train\",\n \"subset\": \"cosmopedia-v2\",\n \"schemas\": [\n {\n \"prompt\": '¶{prompt}:> ',\n \"text\": '¶{text}:> '\n },\n {\n \"prompt\": 'USER: ',\n \"text\": 'ASSISTANT: '\n },\n {\n \"prompt\": 'INPUT: ',\n \"text\": 'OUTPUT: '\n }\n ],\n \"patterns\": [\n '{prompt}',\n '{text}'\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"sample_rate\": 0.5,\n },\n {\n \"hf\": True,\n \"repo\": \"open-phi/textbooks\",\n \"split\": \"train\",\n \"schemas\": [\n {\n \"markdown\": '',\n }\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"sample_rate\": 1.0,\n },\n {\n \"hf\": True,\n \"repo\": \"roneneldan/TinyStories\",\n \"split\": \"train\",\n \"subset\": \"default\",\n \"schemas\": [\n {\n \"text\": '',\n },\n {\n \"text\": ': ',\n },\n {\n \"text\": ':> ',\n },\n {\n \"text\": 'OUTPUT: ',\n },\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"sample_rate\": 0.25,\n \"val_split\": \"validation\",\n \"val_samples\": 1000,\n },\n ],\n batch_size=2,\n gradient_accumulation_steps=8,\n block_size=2048,\n num_steps=20000,\n val_interval=1000,\n warmup_steps=10,\n optimizer=\"Lion\",\n learning_rate=0.0001,\n weight_decay=0.001,\n gradient_clip_val=1.0,\n scheduler=\"cosine\",\n loggers=[logger],\n gradient_checkpointing=True,\n generate_every=10,\n save_every=25,\n checkpoint_every=25,\n resume=resume_training,\n progress_bar=True,\n output_dir=f\"{output_dir}/{focus}\",\n)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Testing\n\nFor testing, we just run an interactive inference session.","metadata":{}},{"cell_type":"code","source":"# Test inference\n\nwhile True:\n print(\"PROMPT:\\n\")\n prompt = input()\n completion = prototype.generate(\n prompt=prompt,\n do_sample=True,\n min_length=23,\n max_new_tokens=111,\n temperature=0.9,\n eta_cutoff=0.0003,\n penalty_alpha=0.6,\n top_k=4,\n repetition_penalty=1.1,\n no_repeat_ngram_size=13,\n renormalize_logits=True,\n remove_invalid_values=True,\n max_time=60,\n use_cache=True,\n )\n print(\"COMPLETION:\\n\")\n print(completion)","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dadfd56d766715c61d2ef780a525ab43b8e6da4de6865bda3d95fdef5e134055
3
+ size 493443
tokenizer_config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ }
29
+ },
30
+ "additional_special_tokens": [],
31
+ "bos_token": "<s>",
32
+ "clean_up_tokenization_spaces": false,
33
+ "eos_token": "</s>",
34
+ "legacy": true,
35
+ "model_max_length": 1000000000000000019884624838656,
36
+ "pad_token": null,
37
+ "sp_model_kwargs": {},
38
+ "spaces_between_special_tokens": false,
39
+ "tokenizer_class": "LlamaTokenizer",
40
+ "unk_token": "<unk>",
41
+ "use_default_system_prompt": false
42
+ }