monsoon-nlp
/

tinyllama-mixpretrain-quinoa-sciphi

Text Generation

Generated from Trainer

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

monsoon-nlp commited on Apr 22, 2024

Commit

7603efe

·

verified ·

1 Parent(s): a460bac

Update README.md

Files changed (1) hide show

README.md +13 -16

README.md CHANGED Viewed

@@ -3,31 +3,28 @@ license: apache-2.0
 base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
 tags:
 - generated_from_trainer
-model-index:
-- name: tinyllama-mixpretrain-quinoa-sciphi
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # tinyllama-mixpretrain-quinoa-sciphi
-This model is a fine-tuned version of [TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) on the None dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters

 base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
 tags:
 - generated_from_trainer
+datasets:
+  - cerebras/SlimPajama-627B
+  - bigcode/starcoderdata
+  - monsoon-nlp/greenbeing-proteins
+  - SciPhi/textbooks-are-all-you-need-lite
 ---
 # tinyllama-mixpretrain-quinoa-sciphi
+TinyLLaMA model with continued pretraining / full-model finetuning on amino acids and simulated science textbooks.
+The goal is to a create models which understand amino acid sequences and natural language descriptions or Q&A.
+Training data was shuffled with:
+- 50% amino acid sequences / proteins from the [GreenBeing](https://huggingface.co/datasets/monsoon-nlp/greenbeing-proteins) research dataset (mostly quinoa)
+- 50% textbook content from the [SciPhi](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) training dataset
+## Training procedure
+CoLab notebook: https://colab.research.google.com/drive/1dah43byt-T0HQC9eCigNbxSZ8aHu6s-W?usp=sharing
+To fit on an L4 GPU, it was necessary to use max_length=400 and train_batch_size=1
 ### Training hyperparameters