monsoon-nlp commited on
Commit
7603efe
·
verified ·
1 Parent(s): a460bac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -16
README.md CHANGED
@@ -3,31 +3,28 @@ license: apache-2.0
3
  base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
4
  tags:
5
  - generated_from_trainer
6
- model-index:
7
- - name: tinyllama-mixpretrain-quinoa-sciphi
8
- results: []
 
 
9
  ---
10
 
11
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
- should probably proofread and complete it, then remove this comment. -->
13
-
14
  # tinyllama-mixpretrain-quinoa-sciphi
15
 
16
- This model is a fine-tuned version of [TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) on the None dataset.
17
-
18
- ## Model description
19
-
20
- More information needed
21
 
22
- ## Intended uses & limitations
23
 
24
- More information needed
 
 
25
 
26
- ## Training and evaluation data
27
 
28
- More information needed
29
 
30
- ## Training procedure
31
 
32
  ### Training hyperparameters
33
 
 
3
  base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
4
  tags:
5
  - generated_from_trainer
6
+ datasets:
7
+ - cerebras/SlimPajama-627B
8
+ - bigcode/starcoderdata
9
+ - monsoon-nlp/greenbeing-proteins
10
+ - SciPhi/textbooks-are-all-you-need-lite
11
  ---
12
 
 
 
 
13
  # tinyllama-mixpretrain-quinoa-sciphi
14
 
15
+ TinyLLaMA model with continued pretraining / full-model finetuning on amino acids and simulated science textbooks.
 
 
 
 
16
 
17
+ The goal is to a create models which understand amino acid sequences and natural language descriptions or Q&A.
18
 
19
+ Training data was shuffled with:
20
+ - 50% amino acid sequences / proteins from the [GreenBeing](https://huggingface.co/datasets/monsoon-nlp/greenbeing-proteins) research dataset (mostly quinoa)
21
+ - 50% textbook content from the [SciPhi](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) training dataset
22
 
23
+ ## Training procedure
24
 
25
+ CoLab notebook: https://colab.research.google.com/drive/1dah43byt-T0HQC9eCigNbxSZ8aHu6s-W?usp=sharing
26
 
27
+ To fit on an L4 GPU, it was necessary to use max_length=400 and train_batch_size=1
28
 
29
  ### Training hyperparameters
30