---
license: apache-2.0
---
# Tele-FLM
Tele-FLM (aka FLM-2) is a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgement capabilities.
Built upon the decoder-only transformer architecture, it has been trained on approximately 2T tokens.
Tele-FLM demonstrates superior performances at its scale, and sometimes surpass larger models.
In addition to sharing the model weights, we provide the core designs, engineering practices, and training details, anticipating their benefits for both academic and industrial communities.
## Model Details
- **Developed by:** BAAI & TeleAI
- **Language(s):** English; Chinese; Other languages
- **License:** Apache 2.0
## Tech report
[Tele-FLM Technical Report](https://arxiv.org/pdf/2404.16645)
## Bias, Risks, and Limitations
Although we've made extensive efforts to thoroughly clean and filter the training corpus for the model, due to the open nature of the dataset, the model may still have picked up on some unsafe examples. Consequently, the model may still generate unexpected content, including but not limited to discrimination, bias, or offensive language. We would like to strongly advise users not to spread any unsafe content generated by the model. The project developers cannot be held responsible for any repercussions stemming from the dissemination of harmful information.
## Quick Start
Use the code below to get started with Tele-FLM.
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('CofeAI/Tele-FLM', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('CofeAI/Tele-FLM', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto", trust_remote_code=True)
inputs = tokenizer('北京市是中国的首都', return_tensors='pt').to(model.device)
generated = model.generate(**inputs, max_new_tokens=128, repetition_penalty=1.03)
print(tokenizer.decode(generated.cpu()[0], skip_special_tokens=True))
```
## Training Details
### Training Data
Our training dataset comprises a variety of domains, as detailed in the table below.
The total amount of data is roughly 2 trillion, with English and Chinese data in a ratio of about 2:1.
In line with the methodology of GPT-4, we collected some instruct data and incorporated it into our pre-training data after removing the test sets of common datasets using the strict n-gram-based method. We deliberately avoid “training on the test set” or any other benchmark-oriented trick.
|Domain |Language|Sampling Prop. |Epochs |Disk Size |
|-------|:--------------:|:--------------:|:-------:|:-----------:|
| Webtext |en, zh | 75.21% | 1.0 | 5.9 TB |
| Code |code, zh | 9.81% | 1.0 | 528.1 GB |
| Book |en, zh | 7.17% | 0.8 | 647.6 GB |
| WorldKnowledge |multi, en, zh | 2.87% | 2.5 | 67.5 GB |
| QA |en, zh | 2.12% | 1.0 | 159.2 GB |
| AcademicPaper |en | 0.99% | 1.0 | 54.4 GB |
| Profession-Law |zh | 1.04% | 1.0 | 84.2 GB |
| Profession-Math |math | 0.62% | 2.0 | 6.1 GB |
| Profession-Patent |zh | 0.14% | 1.0 | 10.4 GB |
| Profession-Medical |zh | 0.02% | 1.0 | 1.2 GB |
| ClassicalChinese |zh | 0.02% | 2.5 | 0.5 GB |
### Model Architecture
We adopt the architecture of FLM-101B as the backbone for Tele-FLM, with several modifications:
- Rotary Positional Embedding (RoPE)
- RMSNorm for normalization
- SwiGLU for activation function
- Linear bias disabled
- Embedding and language model head untied
Consequently, Tele-FLM is largely compatible with Llama architecturally.
To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM and released it as open source.
In the pre-training stage, we employ μP for optimal hyperparameter search. The μP model (Tele-FLM_μP) is architecturally identical to Tele-FLM except for the model width.
The architecture of Tele-FLM and Tele-FLM_μP is listed below.
For more details of μP, please refer to our technical report and the original Tensor Program papers.
| Models | layer
number | attention
heads| hidden
size | ffn hidden
size| vocab
size | context
length | param size
(M) |
|--------|--------------|----------------|-------------|----------------|------------|----------------|----------------|
| Tele-FLM | 64 | 64 | 8,192 | 21,824 | 80,000 | 4,096 | 52,850 |
| Tele-FLM_μP | 64 | 4 | 512 | 1,344 | 80,000 | 4,096 | 283 |
### Training Hyperparameters
Due to the smaller size, Tele-FLM_μP allows for significantly more experimental runs within fixed time and resource constraints.
We searched seven hyperparameters for pretraining. All the hyperparameters are shown below.
| Searched Hyperparameters ||| Non-Searched Hyperparameters ||
|--------------------------------------------|-|-|-|----------------------------------|
| Learning Rate | 1.5e-4 || LR Schedule Type | cosine |
| Matrix Learning Rate | 1.5e-4 || LR Schedule (tokens) | 2.5T |
| Minimum Learning Rate | 1.5e-5 || Warmup Step | 2,000 |
| Standard Deviation | 4e-3 || Clip Grad | 1.0 |
| Matrix Standard Deviation | 4.242e-3 || Weight Decay | 0.0 |
| Input Mult | 1.0 || Batch Size (tokens) | 5,505,024 |
| Output Mult | 3.125e-2 || RoPE Theta | 10,000 |
### Training Loss