File size: 14,120 Bytes
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.10.14","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"gpu","dataSources":[],"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":true}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Pretraining a Mixtral\n---\n[AIGen](https://github.com/Vectorrent/aigen) is a text generation and training library, originally forked from [AITextGen](https://aitextgen.minimaxir.com/) (which is now defunct).\n\nAIGen is also the foundation of [VTX](https://github.com/0-5788719150923125/vtx).\n\nTo use this notebook with Kaggle, one must first enable the \"Internet\" feature. To do so:\n\n1. Find \"Notebook options\" in the sidebar on the right-hand side of this page.\n2. If required, verify your phone number.\n3. Choose \"Internet on\".\n4. Connect to the P100 accelerator.\n5. Setup file persistence.\n\nDo not forget to connect to an accelerator. The P100's are better than the T4's. However, with 2x T4's available, training may benefit from DistributedDataParallel (DDP) training.","metadata":{}},{"cell_type":"markdown","source":"## Update system packages","metadata":{}},{"cell_type":"code","source":"# Now we install AIGen\n!pip install 'git+https://github.com/Vectorrent/aigen.git'\n\n# Speed up everything\n# !pip install -U flash-attn --no-build-isolation","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Configuration\n\nWe would set a bunch of variables here, if we hadn't hardcoded them below for clarity.","metadata":{}},{"cell_type":"code","source":"# Set some variables\nfocus = 'mixtress'\nprecision = 32\nattn_implementation = \"eager\"\n\n# working dir\noutput_dir = \"/kaggle/working\"\n\n# Mixtral is gated, so we use someone else's repo\nbase_model = 'TitanML/tiny-mixtral'\ntokenizer_model = base_model\ntokenizer_config = dict(\n    cache_dir=f\"{output_dir}/{focus}\",\n    padding=\"max_length\",\n    padding_side=\"left\",\n    use_fast=True,\n    return_overflowing_tokens=True,\n    truncation=True,\n    trust_remote_code=True,\n)\n\n# to continue training from a checkpoint, False starts a fresh run\nresume_training = False","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Pretraining\n\nWe want to train a model from scratch, so we import a functional config, then change it.","metadata":{}},{"cell_type":"code","source":"from transformers import (\n    AutoConfig,\n    AutoTokenizer,\n    PretrainedConfig,\n    PreTrainedTokenizerFast,\n)\n\ntokenizer = AutoTokenizer.from_pretrained(tokenizer_model, **tokenizer_config)\n\npretrain_config = AutoConfig.from_pretrained(base_model)\noverrides = {\n    \"model_type\": base_model,\n    \"universal\": True,\n    \"world_size\": 23,\n    \"hidden_act\": 'mish',\n    \"hidden_size\": 512,\n    \"intermediate_size\": 1024,\n    \"initializer_range\": 0.02,\n    \"num_hidden_layers\": 8,\n    \"num_attention_heads\": 16,\n    \"num_key_value_heads\": 8,\n    \"rope_theta\": 1000000.0,\n    \"num_experts_per_tok\": 3,\n    \"num_local_experts\": 9,\n    \"vocab_size\": 32000,\n    \"tie_word_embeddings\": True,\n    \"router_aux_loss_coef\": 0.001,\n    \"router_jitter_noise\": 0.1,\n    \"sliding_window\": 4096,\n    \"attention_dropout\": 0.1\n}\nsetattr(pretrain_config, \"_name_or_path\", focus)\nsetattr(pretrain_config, \"bos_token_id\", tokenizer.bos_token_id)\nsetattr(pretrain_config, \"eos_token_id\", tokenizer.eos_token_id)\nfor k, v in overrides.items():\n    setattr(pretrain_config, k, v)\nprint(f\"modified pretrain config:\")\nprint(pretrain_config)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Load the model\n\nHere we initialize the model with random weights.","metadata":{}},{"cell_type":"code","source":"# Instantiate your model\nimport os\nimport shutil\nfrom aigen import aigen\n\nprototype = None\n\nif resume_training:\n    model = None\n    model_folder = f\"{output_dir}/{focus}\"\n    pretrain_config = None\nelse:\n    model = base_model\n    model_folder = None\n    shutil.rmtree(output_dir, ignore_errors=True)\n\nprototype = aigen(\n    model=model,\n    model_folder=model_folder,\n    tokenizer=tokenizer,\n    cache_dir=f\"{output_dir}/{focus}\",\n    precision=precision,\n    config=pretrain_config,\n    device_map=\"cuda:0\",\n    attn_implementation=attn_implementation\n)\n\nprint(prototype)","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Metrics\n\nWe want to log training metrics, so we install Tensorboard and expose it via ngrok. This requires an authtoken from ngrok.com, saved in Kaggle's \"Add-ons>Secrets\".","metadata":{}},{"cell_type":"code","source":"from kaggle_secrets import UserSecretsClient\nsecret_label = \"NGROK_SECRET\"\nsecret_value = UserSecretsClient().get_secret(secret_label)\n\nimport os\nimport shutil\n\ndirectory = f\"{output_dir}/logs\"\nos.makedirs(directory, exist_ok=True)\n\nif not resume_training:\n    for filename in os.listdir(directory):\n        file_path = os.path.join(directory, filename)\n        shutil.rmtree(file_path)\n\nif secret_value:\n\n    !pip install ngrok tensorboard\n\n    import threading\n    import subprocess\n\n    def start_tensorboard():\n        subprocess.Popen(\n            [\"tensorboard\", \"--logdir\", \"/kaggle/working/logs\", \"--bind_all\", \"--samples_per_plugin\", \"scalars=999999999\"],\n            stdout=subprocess.DEVNULL,\n            stderr=subprocess.STDOUT\n        )\n\n    tensorboard_thread = threading.Thread(target=start_tensorboard)\n    tensorboard_thread.start()\n\n    import ngrok\n\n    listener = await ngrok.forward(6006, authtoken=secret_value)\n    \n    import time\n\n    time.sleep(1)\n    print(listener.url())","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Training\n\nFinally, we train the model on a dataset streamed from: https://huggingface.co/datasets","metadata":{}},{"cell_type":"code","source":"# Train the model\n\nimport os\nfrom lightning.pytorch import loggers\n\nos.makedirs(f\"{output_dir}/logs/{focus}\", exist_ok=True)\nlogger = loggers.TensorBoardLogger(f\"{output_dir}/logs\", name=focus, default_hp_metric=True)\n\nprototype.train(\n    devices=[0],\n    strategy=\"auto\",\n    streaming_data=[\n        {\n            \"hf\": True,\n            \"repo\": \"allenai/c4\", \n            \"split\": \"train\",\n            \"val_split\": \"validation\",\n            \"subset\": \"en.noblocklist\",\n            \"schemas\": [\n                {\n                    \"text\": \"\"\n                }\n            ],\n            \"buffer_size\": 1000,\n            \"val_samples\": 1000,\n            \"sample_rate\": 1.0\n        },\n        {\n            \"hf\": True,\n            \"repo\": \"HuggingFaceFW/fineweb-edu\", \n            \"split\": \"train\", \n            \"subset\": \"sample-10BT\",\n            \"schemas\": [\n                {\n                    \"text\": \"\"\n                }\n            ],\n            \"delimiter\": \"\\n\",\n            \"buffer_size\": 1000,\n            \"sample_rate\":1.0\n        },\n#         {\n#             \"hf\": True,\n#             \"repo\": \"cerebras/SlimPajama-627B\",\n#             \"split\": \"train\",\n#             \"val_split\": \"validation\",\n#             \"val_samples\": 1000,\n# #             \"snapshots\": [\n# #                 \"2023-14\"\n# #             ],\n# #             \"subset\": \"sample-10B\",\n# #             \"languages\": [\n# #                 \"en\"\n# #             ],\n#             \"schemas\": [\n#                 {\n#                     \"text\": \"\"\n#                 }\n#             ],\n#             \"buffer_size\": 1000,\n#             \"sample_rate\": 1.0\n#         },\n#         {\n#             \"hf\": True,\n#             \"repo\": \"togethercomputer/RedPajama-Data-V2\",\n#             \"split\": \"train\",\n#             \"snapshots\": [\n#                 \"2023-14\"\n#             ],\n#             \"subset\": \"sample-10B\",\n#             \"languages\": [\n#                 \"en\"\n#             ],\n#             \"schemas\": [\n#                 {\n#                     \"raw_content\": \"\"\n#                 }\n#             ],\n#             \"buffer_size\": 1000,\n#             \"sample_rate\": 1.0\n#         },\n        {\n            \"hf\": True,\n            \"repo\": \"Muennighoff/natural-instructions\",\n            \"split\": \"train\",\n            \"val_split\": \"test\",\n            \"schemas\": [\n                {\n                   \"definition\": \"¶{context}:> \",\n                   \"inputs\": '¶{human}:> ',\n                   \"targets\": '¶{robot}:> '\n                },\n                {\n                   \"definition\": \"SYSTEM: \",\n                   \"inputs\": 'USER: ',\n                   \"targets\": 'ASSISTANT: '\n                },\n                {\n                   \"definition\": \"CONTEXT: \",\n                   \"inputs\": 'INPUT: ',\n                   \"targets\": 'OUTPUT: '\n                }\n            ],\n            \"patterns\": [\n                '{context}',\n                '{human}',\n                '{robot}'\n            ],\n            \"delimiter\": \"\\n\",\n            \"buffer_size\": 1000,\n            \"val_samples\": 1000,\n            \"sample_rate\": 0.25,\n        },\n        {\n            \"hf\": True,\n            \"repo\": \"databricks/databricks-dolly-15k\",\n            \"split\": \"train\",\n            \"schemas\": [\n                {\n                   \"context\": \"¶{context}:> \",\n                   \"instruction\": '¶{instruction}:> ',\n                   \"response\": '¶{response}:> '\n                },\n                {\n                   \"context\": \"SYSTEM: \",\n                   \"instruction\": 'USER: ',\n                   \"response\": 'ASSISTANT: '\n                },\n                {\n                   \"context\": \"CONTEXT: \",\n                   \"instruction\": 'INPUT: ',\n                   \"response\": 'OUTPUT: '\n                }\n            ],\n            \"patterns\": [\n               '{context}',\n               '{instruction}',\n               '{response}'\n            ],\n            \"delimiter\": \"\\n\",\n            \"buffer_size\": 1000,\n            \"sample_rate\": 0.25,\n        },\n        {\n            \"hf\": True,\n            \"repo\": \"HuggingFaceTB/smollm-corpus\",\n            \"split\": \"train\",\n            \"subset\": \"cosmopedia-v2\",\n            \"schemas\": [\n                {\n                   \"prompt\": '¶{prompt}:> ',\n                   \"text\": '¶{text}:> '\n                },\n                {\n                   \"prompt\": 'USER: ',\n                   \"text\": 'ASSISTANT: '\n                },\n                {\n                   \"prompt\": 'INPUT: ',\n                   \"text\": 'OUTPUT: '\n                }\n            ],\n            \"patterns\": [\n               '{prompt}',\n               '{text}'\n            ],\n            \"delimiter\": \"\\n\",\n            \"buffer_size\": 1000,\n            \"sample_rate\": 0.5,\n        },\n        {\n            \"hf\": True,\n            \"repo\": \"open-phi/textbooks\",\n            \"split\": \"train\",\n            \"schemas\": [\n                {\n                   \"markdown\": '',\n                }\n            ],\n            \"delimiter\": \"\\n\",\n            \"buffer_size\": 1000,\n            \"sample_rate\": 1.0,\n        },\n        {\n            \"hf\": True,\n            \"repo\": \"roneneldan/TinyStories\",\n            \"split\": \"train\",\n            \"subset\": \"default\",\n            \"schemas\": [\n                {\n                    \"text\": '',\n                },\n                {\n                    \"text\": ': ',\n                },\n                {\n                    \"text\": ':> ',\n                },\n                {\n                    \"text\": 'OUTPUT: ',\n                },\n            ],\n            \"delimiter\": \"\\n\",\n            \"buffer_size\": 1000,\n            \"sample_rate\": 0.25,\n            \"val_split\": \"validation\",\n            \"val_samples\": 1000,\n        },\n    ],\n    batch_size=2,\n    gradient_accumulation_steps=8,\n    block_size=2048,\n    num_steps=20000,\n    val_interval=1000,\n    warmup_steps=10,\n    optimizer=\"Lion\",\n    learning_rate=0.0001,\n    weight_decay=0.001,\n    gradient_clip_val=1.0,\n    scheduler=\"cosine\",\n    loggers=[logger],\n    gradient_checkpointing=True,\n    generate_every=10,\n    save_every=25,\n    checkpoint_every=25,\n    resume=resume_training,\n    progress_bar=True,\n    output_dir=f\"{output_dir}/{focus}\",\n)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Testing\n\nFor testing, we just run an interactive inference session.","metadata":{}},{"cell_type":"code","source":"# Test inference\n\nwhile True:\n    print(\"PROMPT:\\n\")\n    prompt = input()\n    completion = prototype.generate(\n        prompt=prompt,\n        do_sample=True,\n        min_length=23,\n        max_new_tokens=111,\n        temperature=0.9,\n        eta_cutoff=0.0003,\n        penalty_alpha=0.6,\n        top_k=4,\n        repetition_penalty=1.1,\n        no_repeat_ngram_size=13,\n        renormalize_logits=True,\n        remove_invalid_values=True,\n        max_time=60,\n        use_cache=True,\n    )\n    print(\"COMPLETION:\\n\")\n    print(completion)","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}