Upload 9 files

Browse files

Files changed (9) hide show

README.md +78 -3
config.json +33 -0
generation_config.json +6 -0
model.safetensors +3 -0
pretraining-a-mixtral.ipynb +1 -0
special_tokens_map.json +23 -0
tokenizer.json +0 -0
tokenizer.model +3 -0
tokenizer_config.json +42 -0

README.md CHANGED Viewed

@@ -1,3 +1,78 @@
----
-license: mit
----

+---
+language:
+  - en
+tags:
+  - text generation
+  - pytorch
+  - causal-lm
+license: mit
+datasets:
+  - allenai/c4
+  - HuggingFaceFW/fineweb-edu
+  - togethercomputer/RedPajama-Data-V2
+  - Muennighoff/natural-instructions
+  - databricks/databricks-dolly-15k
+  - HuggingFaceTB/smollm-corpus
+  - open-phi/textbooks
+  - roneneldan/TinyStories
+---
+# Mixtress 135M
+## Model Description
+Mixtress 135M is a transformer model based upon the [Mixtral](https://huggingface.co/docs/transformers/en/model_doc/mixtral) architecture. It is the culmination of approximately 20 weeks of [Kaggle](https://kaggle.com) free hours, and 67 twelve-hour training runs.
+The results are laughably bad. The model has massively overfit to the training data, and it saw far less tokens than other models of comparable size. But at least I can say we saw it through to completion!
+## Training data
+Mixtress was trained on a curated sampling of data from the following datasets:
+- allenai/c4
+- HuggingFaceFW/fineweb-edu
+- togethercomputer/RedPajama-Data-V2
+- Muennighoff/natural-instructions
+- databricks/databricks-dolly-15k
+- HuggingFaceTB/smollm-corpus
+- open-phi/textbooks
+- roneneldan/TinyStories
+## Training procedure
+This model was trained for 2.15 billion tokens over 20,000 optimizer steps. It was trained as a masked autoregressive language model, using cross-entropy loss.
+The final train loss was 1.941, validation loss was 2.206, and perplexity was 9.136.
+Mixtress was pre-trained and fine-tuned simultaneously. Full reproduction code may be found [at this URL](https://www.kaggle.com/code/luciferianink/pretraining-a-mixtral), or in the Jupyter notebook [in this repository](./pretraining-a-mixtral.ipynb).
+## Intended Use and Limitations
+The model is best at what it was pretrained for, which is generating conversational text and answering questions from a prompt.
+### How to use
+You can use this model directly with a pipeline for text generation. This example generates a different sequence each time it's run:
+```py
+>>> from transformers import pipeline
+>>> generator = pipeline('text-generation', model='UNSAFE/Mixtress-135M')
+>>> generator("In a shocking finding, ", do_sample=True, temperature=0.7, min_length=50)
+[{'generated_text': 'In a shocking finding, 20 years ago, U.S. President Donald Trump'}]
+```
+## Eval results
+All evaluations were done using the [Pythia evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
+### Scores
+| Model and Size    | ARC-easy   | ARC-challenge | HellaSwag  | OpenBookQA | PiQA       |
+| ----------------- | ---------- | ------------- | ---------- | ---------- | ---------- |
+| gpt-neo-125m      | 22.95      | N/A           | 30.26      | N/A        | N/A        |
+| **Mixtress-135M** | **0.2921** | **0.2457**    | **0.2699** | **0.2180** | **0.5267** |
+## Join Us
+If you would like to chat with us, please join the [Discord](https://discord.gg/8ZmHP8CqUX) server!

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "_name_or_path": "mixtress",
+  "architectures": [
+    "MixtralForCausalLM"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "hidden_act": "mish",
+  "hidden_size": 512,
+  "initializer_range": 0.02,
+  "intermediate_size": 1024,
+  "max_position_embeddings": 131072,
+  "model_type": "mixtral",
+  "num_attention_heads": 16,
+  "num_experts_per_tok": 3,
+  "num_hidden_layers": 8,
+  "num_key_value_heads": 8,
+  "num_local_experts": 9,
+  "output_router_logits": false,
+  "rms_norm_eps": 1e-05,
+  "rope_theta": 1000000.0,
+  "router_aux_loss_coef": 0.001,
+  "router_jitter_noise": 0.1,
+  "sliding_window": 4096,
+  "tie_word_embeddings": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.45.1",
+  "universal": true,
+  "use_cache": false,
+  "vocab_size": 32000,
+  "world_size": 23
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.45.1"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:243086d038abcf517e46268af189614e2bf13a10e9ef73c653888ab17af00d50
+size 543902624

pretraining-a-mixtral.ipynb ADDED Viewed

	@@ -0,0 +1 @@

+ {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.14","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"gpu","dataSources":[],"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":true}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Pretraining a Mixtral\n---\n[AIGen](https://github.com/Vectorrent/aigen) is a text generation and training library, originally forked from [AITextGen](https://aitextgen.minimaxir.com/) (which is now defunct).\n\nAIGen is also the foundation of [VTX](https://github.com/0-5788719150923125/vtx).\n\nTo use this notebook with Kaggle, one must first enable the \"Internet\" feature. To do so:\n\n1. Find \"Notebook options\" in the sidebar on the right-hand side of this page.\n2. If required, verify your phone number.\n3. Choose \"Internet on\".\n4. Connect to the P100 accelerator.\n5. Setup file persistence.\n\nDo not forget to connect to an accelerator. The P100's are better than the T4's. However, with 2x T4's available, training may benefit from DistributedDataParallel (DDP) training.","metadata":{}},{"cell_type":"markdown","source":"## Update system packages","metadata":{}},{"cell_type":"code","source":"# Now we install AIGen\n!pip install 'git+https://github.com/Vectorrent/aigen.git'\n\n# Speed up everything\n# !pip install -U flash-attn --no-build-isolation","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Configuration\n\nWe would set a bunch of variables here, if we hadn't hardcoded them below for clarity.","metadata":{}},{"cell_type":"code","source":"# Set some variables\nfocus = 'mixtress'\nprecision = 32\nattn_implementation = \"eager\"\n\n# working dir\noutput_dir = \"/kaggle/working\"\n\n# Mixtral is gated, so we use someone else's repo\nbase_model = 'TitanML/tiny-mixtral'\ntokenizer_model = base_model\ntokenizer_config = dict(\n cache_dir=f\"{output_dir}/{focus}\",\n padding=\"max_length\",\n padding_side=\"left\",\n use_fast=True,\n return_overflowing_tokens=True,\n truncation=True,\n trust_remote_code=True,\n)\n\n# to continue training from a checkpoint, False starts a fresh run\nresume_training = False","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Pretraining\n\nWe want to train a model from scratch, so we import a functional config, then change it.","metadata":{}},{"cell_type":"code","source":"from transformers import (\n AutoConfig,\n AutoTokenizer,\n PretrainedConfig,\n PreTrainedTokenizerFast,\n)\n\ntokenizer = AutoTokenizer.from_pretrained(tokenizer_model, **tokenizer_config)\n\npretrain_config = AutoConfig.from_pretrained(base_model)\noverrides = {\n \"model_type\": base_model,\n \"universal\": True,\n \"world_size\": 23,\n \"hidden_act\": 'mish',\n \"hidden_size\": 512,\n \"intermediate_size\": 1024,\n \"initializer_range\": 0.02,\n \"num_hidden_layers\": 8,\n \"num_attention_heads\": 16,\n \"num_key_value_heads\": 8,\n \"rope_theta\": 1000000.0,\n \"num_experts_per_tok\": 3,\n \"num_local_experts\": 9,\n \"vocab_size\": 32000,\n \"tie_word_embeddings\": True,\n \"router_aux_loss_coef\": 0.001,\n \"router_jitter_noise\": 0.1,\n \"sliding_window\": 4096,\n \"attention_dropout\": 0.1\n}\nsetattr(pretrain_config, \"_name_or_path\", focus)\nsetattr(pretrain_config, \"bos_token_id\", tokenizer.bos_token_id)\nsetattr(pretrain_config, \"eos_token_id\", tokenizer.eos_token_id)\nfor k, v in overrides.items():\n setattr(pretrain_config, k, v)\nprint(f\"modified pretrain config:\")\nprint(pretrain_config)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Load the model\n\nHere we initialize the model with random weights.","metadata":{}},{"cell_type":"code","source":"# Instantiate your model\nimport os\nimport shutil\nfrom aigen import aigen\n\nprototype = None\n\nif resume_training:\n model = None\n model_folder = f\"{output_dir}/{focus}\"\n pretrain_config = None\nelse:\n model = base_model\n model_folder = None\n shutil.rmtree(output_dir, ignore_errors=True)\n\nprototype = aigen(\n model=model,\n model_folder=model_folder,\n tokenizer=tokenizer,\n cache_dir=f\"{output_dir}/{focus}\",\n precision=precision,\n config=pretrain_config,\n device_map=\"cuda:0\",\n attn_implementation=attn_implementation\n)\n\nprint(prototype)","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Metrics\n\nWe want to log training metrics, so we install Tensorboard and expose it via ngrok. This requires an authtoken from ngrok.com, saved in Kaggle's \"Add-ons>Secrets\".","metadata":{}},{"cell_type":"code","source":"from kaggle_secrets import UserSecretsClient\nsecret_label = \"NGROK_SECRET\"\nsecret_value = UserSecretsClient().get_secret(secret_label)\n\nimport os\nimport shutil\n\ndirectory = f\"{output_dir}/logs\"\nos.makedirs(directory, exist_ok=True)\n\nif not resume_training:\n for filename in os.listdir(directory):\n file_path = os.path.join(directory, filename)\n shutil.rmtree(file_path)\n\nif secret_value:\n\n !pip install ngrok tensorboard\n\n import threading\n import subprocess\n\n def start_tensorboard():\n subprocess.Popen(\n [\"tensorboard\", \"--logdir\", \"/kaggle/working/logs\", \"--bind_all\", \"--samples_per_plugin\", \"scalars=999999999\"],\n stdout=subprocess.DEVNULL,\n stderr=subprocess.STDOUT\n )\n\n tensorboard_thread = threading.Thread(target=start_tensorboard)\n tensorboard_thread.start()\n\n import ngrok\n\n listener = await ngrok.forward(6006, authtoken=secret_value)\n \n import time\n\n time.sleep(1)\n print(listener.url())","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Training\n\nFinally, we train the model on a dataset streamed from: https://huggingface.co/datasets","metadata":{}},{"cell_type":"code","source":"# Train the model\n\nimport os\nfrom lightning.pytorch import loggers\n\nos.makedirs(f\"{output_dir}/logs/{focus}\", exist_ok=True)\nlogger = loggers.TensorBoardLogger(f\"{output_dir}/logs\", name=focus, default_hp_metric=True)\n\nprototype.train(\n devices=[0],\n strategy=\"auto\",\n streaming_data=[\n {\n \"hf\": True,\n \"repo\": \"allenai/c4\", \n \"split\": \"train\",\n \"val_split\": \"validation\",\n \"subset\": \"en.noblocklist\",\n \"schemas\": [\n {\n \"text\": \"\"\n }\n ],\n \"buffer_size\": 1000,\n \"val_samples\": 1000,\n \"sample_rate\": 1.0\n },\n {\n \"hf\": True,\n \"repo\": \"HuggingFaceFW/fineweb-edu\", \n \"split\": \"train\", \n \"subset\": \"sample-10BT\",\n \"schemas\": [\n {\n \"text\": \"\"\n }\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"sample_rate\":1.0\n },\n# {\n# \"hf\": True,\n# \"repo\": \"cerebras/SlimPajama-627B\",\n# \"split\": \"train\",\n# \"val_split\": \"validation\",\n# \"val_samples\": 1000,\n# # \"snapshots\": [\n# # \"2023-14\"\n# # ],\n# # \"subset\": \"sample-10B\",\n# # \"languages\": [\n# # \"en\"\n# # ],\n# \"schemas\": [\n# {\n# \"text\": \"\"\n# }\n# ],\n# \"buffer_size\": 1000,\n# \"sample_rate\": 1.0\n# },\n# {\n# \"hf\": True,\n# \"repo\": \"togethercomputer/RedPajama-Data-V2\",\n# \"split\": \"train\",\n# \"snapshots\": [\n# \"2023-14\"\n# ],\n# \"subset\": \"sample-10B\",\n# \"languages\": [\n# \"en\"\n# ],\n# \"schemas\": [\n# {\n# \"raw_content\": \"\"\n# }\n# ],\n# \"buffer_size\": 1000,\n# \"sample_rate\": 1.0\n# },\n {\n \"hf\": True,\n \"repo\": \"Muennighoff/natural-instructions\",\n \"split\": \"train\",\n \"val_split\": \"test\",\n \"schemas\": [\n {\n \"definition\": \"¶{context}:> \",\n \"inputs\": '¶{human}:> ',\n \"targets\": '¶{robot}:> '\n },\n {\n \"definition\": \"SYSTEM: \",\n \"inputs\": 'USER: ',\n \"targets\": 'ASSISTANT: '\n },\n {\n \"definition\": \"CONTEXT: \",\n \"inputs\": 'INPUT: ',\n \"targets\": 'OUTPUT: '\n }\n ],\n \"patterns\": [\n '{context}',\n '{human}',\n '{robot}'\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"val_samples\": 1000,\n \"sample_rate\": 0.25,\n },\n {\n \"hf\": True,\n \"repo\": \"databricks/databricks-dolly-15k\",\n \"split\": \"train\",\n \"schemas\": [\n {\n \"context\": \"¶{context}:> \",\n \"instruction\": '¶{instruction}:> ',\n \"response\": '¶{response}:> '\n },\n {\n \"context\": \"SYSTEM: \",\n \"instruction\": 'USER: ',\n \"response\": 'ASSISTANT: '\n },\n {\n \"context\": \"CONTEXT: \",\n \"instruction\": 'INPUT: ',\n \"response\": 'OUTPUT: '\n }\n ],\n \"patterns\": [\n '{context}',\n '{instruction}',\n '{response}'\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"sample_rate\": 0.25,\n },\n {\n \"hf\": True,\n \"repo\": \"HuggingFaceTB/smollm-corpus\",\n \"split\": \"train\",\n \"subset\": \"cosmopedia-v2\",\n \"schemas\": [\n {\n \"prompt\": '¶{prompt}:> ',\n \"text\": '¶{text}:> '\n },\n {\n \"prompt\": 'USER: ',\n \"text\": 'ASSISTANT: '\n },\n {\n \"prompt\": 'INPUT: ',\n \"text\": 'OUTPUT: '\n }\n ],\n \"patterns\": [\n '{prompt}',\n '{text}'\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"sample_rate\": 0.5,\n },\n {\n \"hf\": True,\n \"repo\": \"open-phi/textbooks\",\n \"split\": \"train\",\n \"schemas\": [\n {\n \"markdown\": '',\n }\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"sample_rate\": 1.0,\n },\n {\n \"hf\": True,\n \"repo\": \"roneneldan/TinyStories\",\n \"split\": \"train\",\n \"subset\": \"default\",\n \"schemas\": [\n {\n \"text\": '',\n },\n {\n \"text\": ': ',\n },\n {\n \"text\": ':> ',\n },\n {\n \"text\": 'OUTPUT: ',\n },\n ],\n \"delimiter\": \"\\n\",\n \"buffer_size\": 1000,\n \"sample_rate\": 0.25,\n \"val_split\": \"validation\",\n \"val_samples\": 1000,\n },\n ],\n batch_size=2,\n gradient_accumulation_steps=8,\n block_size=2048,\n num_steps=20000,\n val_interval=1000,\n warmup_steps=10,\n optimizer=\"Lion\",\n learning_rate=0.0001,\n weight_decay=0.001,\n gradient_clip_val=1.0,\n scheduler=\"cosine\",\n loggers=[logger],\n gradient_checkpointing=True,\n generate_every=10,\n save_every=25,\n checkpoint_every=25,\n resume=resume_training,\n progress_bar=True,\n output_dir=f\"{output_dir}/{focus}\",\n)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Testing\n\nFor testing, we just run an interactive inference session.","metadata":{}},{"cell_type":"code","source":"# Test inference\n\nwhile True:\n print(\"PROMPT:\\n\")\n prompt = input()\n completion = prototype.generate(\n prompt=prompt,\n do_sample=True,\n min_length=23,\n max_new_tokens=111,\n temperature=0.9,\n eta_cutoff=0.0003,\n penalty_alpha=0.6,\n top_k=4,\n repetition_penalty=1.1,\n no_repeat_ngram_size=13,\n renormalize_logits=True,\n remove_invalid_values=True,\n max_time=60,\n use_cache=True,\n )\n print(\"COMPLETION:\\n\")\n print(completion)","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dadfd56d766715c61d2ef780a525ab43b8e6da4de6865bda3d95fdef5e134055
+size 493443

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [],
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "legacy": true,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": null,
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}