Introduction

The Aquila-135M model is a small bilingual(Chinese and English) language model, which is trained using a two-phrase paradigm: pre-training and annealing. This model used 1.66TB bilingual tokens in Chinese and English during pre-training phrase and 100B tokens during annealing training phrase. In annealing stage, we selected 100B tokens of high-quality bilingual data and finally got our model.

The Aquila-135M-Instuct model is finetuned using Infinity Instruct.

The entire training process was conducted using FlagGems based on Triton and parallel training framework named FlagScale.

Also, we have open-sourced all intermediate checkpoints.

News

2024/12/24: We have released Aquila-135M and Aquila-135M-Instruct.
2024/12/24: We have released all datasets and intermediate checkpoints during training. Please feel free to use these models for analysis and experimentation.

Datasets

We have open-sourced all bilingual datasets during both pre-training and annealing phrases. Datasets composition and mix proportions are shown in the figure below.

Evaluation

We followed the same evaluation setting of SmolLM models and evaluated models using the lighteval tool.

The parameter count excludes the embedding part and Aquila-135M and SmolLM2-135M share an identical model structure. Aquila-135M achieves comparable performance on English benchmarks, while Aquila-135M demonstrates significantly better results on Chinese benchmarks.

Among small models with a total parameter count below and around 400M, Aquila-135M maintains a leading position in processing capabilities while significantly enhancing Chinese language proficiency.

Metrics (0-shot)	Aquila-135M (Trition)	Aquila-135M (CUDA)	SmolLM-135M	SmolLM2-135M	gpt2-medium-360M	TinyMistral-248M	TinyMistral-248M-2.5	OpenELM-270M	Wide-Sheared-LLaMA-290M	opt-350m	MobileLLM-350M	pythia-410m	SmolLM-360M	SmolLM2-360M
HellaSwag	41.19	41.12	41.15	42.10	37.08	27.06	26.80	45.74	24.94	36.08	26.28	39.22	51.73	54.66
ARC (Average)	44.76	44.15	42.34	43.93	34.34	29.71	27.63	35.74	26.20	31.91	27.72	35.14	49.95	53.24
PIQA	66.38	67.52	68.28	68.44	66.38	57.40	53.92	69.75	50.60	64.36	50.27	67.19	71.55	71.98
MMLU (cloze)	31.07	30.67	30.26	31.58	27.75	25.82	25.59	27.89	24.75	26.58	24.86	28.88	34.32	36.09
CommonsenseQA	32.10	31.70	32.02	32.92	31.70	24.57	21.46	35.71	16.54	32.10	17.53	31.45	36.61	38.74
TriviaQA	6.65	7.02	4.24	4.03	2.36	0.50	0.08	1.34	0.00	1.38	0.00	2.06	9.19	16.92
Winograde	51.07	51.70	51.22	50.99	49.49	49.25	49.01	52.41	49.72	51.54	49.41	49.96	53.12	52.49
OpenBookQA	34.40	34.40	33.80	34.60	31.40	29.40	27.40	30.60	26.00	27.80	24.80	28.40	37.20	37.00
GSM8K (5-shot)	2.12	2.12	1.00	1.52	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	2.81
SIQA	41.81	42.32	41.15	41.45	41.30	41.86	39.71	42.73	39.76	42.37	37.10	42.02	43.45	41.61
CEval	29.22	29.82	28.28	26.41	25.40	25.38	26.89	26.69	26.37	26.67	25.68	27.97	27.66	28.51
CMMLU	29.48	29.63	26.01	26.66	27.20	26.67	25.57	26.25	26.33	26.93	25.61	26.91	27.06	27.39
Average-English	35.16	35.27	34.55	35.16	32.18	28.56	27.16	34.19	25.85	31.41	25.80	32.43	38.71	40.55
Average-Chinese	29.35	29.73	27.15	26.54	26.30	26.03	26.23	26.47	26.35	26.80	25.65	27.44	27.36	27.95
Average	32.25	32.50	30.85	30.85	29.24	27.29	26.70	30.33	26.10	29.11	25.72	29.94	33.04	34.25

For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers.

How to use

Instruct Model

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "BAAI/Aquila-135M-Instruct"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

messages = [{"role": "user", "content": "什么是引力？"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))
## 引力是宇宙中的一个基本力，由多个物体相互作用而产生的。它由能量和质量组成，与引力定律密切相关。

messages = [{"role": "user", "content": "What is gravity?"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))
## Gravity is the force that keeps us on Earth as we orbit it. It pulls objects towards each other with a strength that depends on how far apart they are from each other, and how strong the gravitational pull is. The stronger the object's mass, the greater its gravitational pull.

Future Plan

We plan to further optimize the composition and proportions of the dataset.
We plan to further explore the application of small-scale models in specific scenarios.

Citation

If you find this useful, please cite the following work

@misc{aquila-135m,
      title={Aquila-135M: A Bilingual Small Language Model in Chinese and English}, 
      author={BAAI},
      year={},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={}, 
}

BAAI
/

Aquila-135M-Intermediate