We used this version of TinyLlama as a base model: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
The goal was to improve performance on basic algebra (i.e. solving systems of linear equations).
The base model was fine tuned on 8k rows synthetic solution data generated by OpenMath-Mistral-7B-v0.1-hf on ALG-514.
We used the NeMo Skills pipeline for inference with code execution and generating the synthetic data. HuggingFace's SFTTrainer was used for fine tuning, as the NeMo Skills pipeline was too difficult to work with. It took 30 minutes to fine tune on an RTX3090.
Sample Input:
System:
You're an expert Python programmer and mathematician.
Help the user to solve this problem using code when necessary.
Make sure to put the answer (and only answer) inside \boxed{}.
User:
High school graduating class is made up of 466 students. There are 212 more girls than boys. How many boys are in the class?
Assistant:
Sample Output:
Let's solve this problem using Python code.
<llm-code>
total_students = 466
more_girls = 212
number_of_girls = total_students - more_girls
number_of_boys = number_of_girls / 2
number_of_boys
</llm-code>
<llm-code-output>
127.0
</llm-code-output>
Thus the number of boys in the class is \boxed{127}.
The code interpreter halts generation at </llm-code>
, inserts the result of the executed code in <llm-code-output></llm-code-output>
, then continues generation. If there is no code interpreter, it will just guess numbers for the output.
Notes from previous model cards:
We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.
Eval
Checkpoint | Coverage |
---|---|
1600 | 0.890244 |
2200 | 0.890244 |
2400 | 0.890244 |
2600 | 0.878049 |
1200 | 0.878049 |
2800 | 0.853659 |
2000 | 0.853659 |
800 | 0.841463 |
1000 | 0.829268 |
1800 | 0.829268 |
1400 | 0.817073 |
mistral | 0.804878 |
3000 | 0.780488 |
600 | 0.768293 |
400 | 0.731707 |
200 | 0.682927 |
0 | 0.000000 |
Note that checkpoint_0
is the base model and checkpoint_mistral
is OpenMath-Mistral-7B-v0.1-hf. Also note that after 800 steps the fine tuned model had better coverage than the much larger teacher model.
The zero shot performance is _not good_â„¢, but this model could be used to quickly generate synthetic data since the coverage is decent. The uploaded model is checkpoint-2.6k (best zero-shot performance and top 4 coverage).
People involved in creating this fine tune:
- Coulton Theuer [theuerc@umich.edu]
- Bret Ellenbogen [bretelle@umich.edu]
- Victoria Chang [vgc@umich.edu]
- Downloads last month
- 15