Falcon 2: An 11B parameter pretrained language model and VLM, trained on over 5000B tokens and 11 languages
The Falcon 2 Models
TII is launching a new generation of models, Falcon 2, focused on providing the open-source community with a series of smaller models with enhanced performance and multi-modal support. Our goal is to enable cheaper inference and encourage the development of more downstream applications with improved usability.
The first generation of Falcon models, featuring Falcon-40B and Falcon-180B, made a significant contribution to the open-source community, promoting the release of advanced LLMs with permissive licenses. More detailed information on the previous generation of Falcon models can be found in the RefinedWeb, Penedo et al., 2023 and The Falcon Series of Open Language Models, Almazrouei et al., 2023 papers, and the Falcon and Falcon-180B blog posts.
The second generation of models is focused on increased usability and integrability, building a multi-modal ecosystem. We start this journey by releasing not only the base 11B LLM, but also the 11B VLM model that incorporates image understanding capabilities. The vision-language model, or VLM, will allow users to engage in chats about visual content using text.
As with our previous work, the models offer support mainly in English but have good capabilities in ten other languages, including Spanish, French, and German.
Table of Contents
- The Falcon 2 Models
- Falcon 2 11B LLM
- Falcon 2 11B VLM
- Licensing information
Falcon2-11B LLM
Training Data
Falcon2-11B was trained on over 5,000 GT (billion tokens) of RefinedWeb, a high-quality filtered and deduplicated web dataset, enhanced with curated corpora. It followed a four-stage training strategy. The first three stages were focused on increasing the context length, from 2048 to 4096 and finally to 8192 tokens. The last stage aimed to further enhance performance using only high-quality data.
Overall, the data sources included RefinedWeb-English, RefinedWeb-Europe (cs, de, es, fr, it, nl, pl, pt, ro, sv), high-quality technical data, code data, and conversational data extracted from public sources.
The training stages were as follows:
Stage | Context Length | GT |
---|---|---|
Stage 1 | 2048 | 4500 |
Stage 2 | 4096 | 250 |
Stage 3 | 8192 | 250 |
Stage 4 | 8192 | 500 |
The data was tokenized with Falcon2-11B tokenizer, the same tokenizer as for the previous Falcon models.
Model Architecture
The following table summarizes some of the crucial details about the model architecture:
Design choice | Value |
---|---|
Number of Transformer Blocks | 60 |
Number of Query Heads | 32 |
Number of Key/Value Heads | 8 |
Head Dimensions | 128 |
Parallel Attention | yes |
MLP Upscale Factor | 4 |
Training Procedure
Falcon2-11B was trained on 1024 A100 40GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO and Flash-Attention 2.
Training Hyperparameters
Hyperparameter | Value |
---|---|
Precision | bfloat16 |
Optimizer | AdamW |
Max LR | 3.7e-4 |
Min LR | 1.89e-5 |
LR schedule | Cos decay (stage 1) |
Context length | 8192 (stages 3 and 4) |
Weight decay | 1e-1 |
Z-loss | 1e-4 |
Batch size | Variable |
Falcon2-11B Evaluation
English performance
Performance on Open LLM Leaderboard tasks:
Checkpoint | GT | HellaSwag-10 | Winogrande-5 | ArcChallenge-25 | TruthfulQA-0 | MMLU-5 | GSMK8k-5 | Average |
---|---|---|---|---|---|---|---|---|
Falcon2-11B | 5500 | 82.91 | 78.30 | 59.73 | 52.56 | 58.37 | 53.83 | 64.28 |
Falcon-40B | 1000 | 85.28 | 81.29 | 61.86 | 41.65 | 56.89 | 21.46 | 58.07 |
Falcon-7B | 1500 | 78.13 | 72.38 | 47.87 | 34.26 | 27.79 | 4.62 | 44.17 |
Gemma-7B | 6000 | 82.47 | 78.45 | 61.09 | 44.91 | 66.03 | 52.77 | 64.29 |
Llama3-8B | 15000 | 82.09 | 77.35 | 59.47 | 43.90 | 66.69 | 44.79 | 62.38 |
Mistral-7B | N/A | 83.31 | 78.37 | 59.98 | 42.15 | 64.16 | 37.83 | 60.97 |
The Hugging Face Leaderboard team provided an official evaluation of our model on the Open LLM Leaderboard tasks. The model performs better than models such as Llama3-8B (trained on three times more data) and Mistral-7B, and on par with Gemma-7b.
Zero shot performance:
Checkpoint | GT | HellaSwag | ArcEasy | Winogrande | ArcChallenge |
---|---|---|---|---|---|
Falcon2-11B | 5500 | 82.07 | 77.78 | 78.30 | 50.17 |
Falcon-40B | 1000 | 82.82 | 81.86 | 76.4 | 54.69 |
Falcon-7B | 1500 | 76.31 | 74.74 | 67.17 | 43.43 |
The evaluation results show that the Falcon2-11B shows similar performance to Falcon-40B, at a four times smaller model size!
Multilingual capabilities
Using the Multilingual LLM leaderboard, we compare the Falcon2-11B model to the Llama-7B and Bloom-7B. For reference, we also include Falcon-40B (that supports the same languages), Falcon-7B (that supports French) and Mistral-7B.
Model | Language ID | ArcChallenge-25 | Hellaswag | MMLU 25 | TQA | Average |
---|---|---|---|---|---|---|
Falcon2-11B | de | 43.7 | 67.96 | 38.3 | 47.53 | 49.37 |
es | 46.2 | 73.63 | 37.9 | 46.43 | 51.06 | |
fr | 45.8 | 72.41 | 39.53 | 47.30 | 51.27 | |
it | 45.6 | 70.83 | 38.05 | 47.14 | 50.42 | |
nl | 41.7 | 69.05 | 38.29 | 48.81 | 49.47 | |
ro | 42.4 | 66.24 | 38.01 | 45.53 | 48.04 | |
Falcon-40B | de | 45.1 | 68.3 | 36.2 | 39.8 | 47.4 |
es | 48.5 | 73.9 | 37.2 | 39.0 | 49.6 | |
fr | 47.6 | 72.9 | 37.3 | 38.5 | 49.1 | |
it | 46.3 | 70.2 | 36.4 | 40.7 | 48.4 | |
nl | 42.9 | 68.4 | 36.5 | 40.9 | 47.1 | |
ro | 43.2 | 66.0 | 35.7 | 39.8 | 46.2 | |
Falcon-7B | fr | 37.3 | 64.1 | 28.4 | 34.0 | 40.9 |
Mistral-7B | de | 41.2 | 58.7 | 40.5 | 44.9 | 46.3 |
es | 44.2 | 65.3 | 42.4 | 43.1 | 48.7 | |
fr | 44.9 | 64.4 | 41.9 | 43.0 | 48.6 | |
it | 43.2 | 60.9 | 39.7 | 43.1 | 46.7 | |
nl | 40.0 | 57.9 | 41.4 | 43.3 | 45.7 | |
ro | 40.7 | 53.6 | 39.3 | 43.6 | 44.3 | |
Llama-7B | de | 35.1 | 49.9 | 29.9 | 38.3 | 38.3 |
es | 36.8 | 56.4 | 30.3 | 37.0 | 40.1 | |
fr | 37.3 | 55.7 | 30.5 | 39.9 | 40.9 | |
it | 35.8 | 52.0 | 29.9 | 39.6 | 39.3 | |
nl | 33.6 | 48.7 | 29.8 | 40.0 | 38.0 | |
ro | 32.4 | 44.9 | 29.7 | 37.0 | 36.0 | |
Bloom-7B | de | 26.3 | 32.4 | 28.1 | 43.7 | 32.6 |
es | 38.1 | 56.7 | 28.9 | 40.4 | 41.0 | |
fr | 36.7 | 56.6 | 29.9 | 40.9 | 41.0 | |
it | 29.0 | 40.8 | 27.6 | 43.7 | 35.3 | |
nl | 23.1 | 31.7 | 27.5 | 42.7 | 31.3 | |
ro | 26.9 | 31.8 | 27.4 | 46.1 | 33.1 |
In the spirit of the original Falcon models, the Falcon2-11B was trained not only on English data but also on ten other languages. Our multilingual evaluation results show that the model presents good capabilities in the six languages (de, es, fr, it, nl, ro) featured on the Multilingual LLM Leaderboard and actually shows higher performance than the Falcon-40B and several other multilingual models on all the cited languages.
We will soon release more extensive evaluation results for multilingual capabilities in the Falcon2-11B model card!
Code generation capabilities
We check the model's performance on code generation against the BigCode Leaderboard on the HumanEval benchmark for the Python language, obtaining pass@1 of 29.59%.
Using Falcon2-11B
from transformers import AutoTokenizer
import transformers
import torch
model = "tiiuae/falcon-11B"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.bfloat16,
device_map="auto",
)
And then, you'd run text generation using code like the following:
sequences = pipeline(
"Can you explain the concept of Quantum Computing?",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
Falcon2-11B VLM
Falcon2-11B VLM is a vision-language model (VLM) built on top of the LLM, that additionally handles image inputs and is capable of answering queries about the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model, and train with image-text data.
To enhance the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.
Training
The training is done in two stages: pretraining and finetuning. In both stages, the visual encoder weights are kept frozen. In the pretraining stage, the LLM is kept frozen, and only the multimodal projector is trained on 558K image-caption pairs. This enables the multimodal projector to learn a mapping from visual to text embedding space. During finetuning, both the projector and LLM weights are trained on a corpus of 1.2M image-text instruction data from public datasets, which also includes multi-round conversations.
Falcon2-11B VLM Evaluation
Model | MME | GQA | SQA | POPE | VQAv2 | TextVQA | MM-Bench | SEED-IMG | Average |
---|---|---|---|---|---|---|---|---|---|
Falcon2-11B VLM | 1589/343 | 64.5 | 74.9 | 88.4 | 82.1 | 66.7 | 72.0 | 72.3 | 74.4 |
LLaVA-1.6 (Vicuna-7B) | 1519/332 | 64.2 | 70.1 | 86.5 | 81.8 | 64.9 | 67.4 | 70.2 | 72.1 |
LLaVA-1.6 (Vicuna-13B) | 1575/326 | 65.4 | 73.6 | 86.2 | 82.8 | 67.1 | 70.0 | 71.9 | 73.8 |
LLaVA-1.6 (Mistral-7B) | 1498/321 | 64.8 | 72.8 | 86.7 | 82.2 | 65.7 | 68.7 | 72.2 | 73.3 |
Using Falcon2-11B-FalconVLM
from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch
processor = LlavaNextProcessor.from_pretrained("tiiuae/falcon-11B-vlm")
model = LlavaNextForConditionalGeneration.from_pretrained("tiiuae/falcon-11B-vlm", torch_dtype=torch.bfloat16)
url = "https://merzougabirding.com/wp-content/uploads/2023/09/falcon-size.jpg"
falcon_image = Image.open(requests.get(url, stream=True).raw)
prompt = "User: <image>\nWhat's special about this bird's vision?"
inputs = processor(prompt, images=falcon_image, return_tensors="pt", padding=True).to('cuda:0')
model.to('cuda:0')
output = model.generate(**inputs, max_new_tokens=256)
prompt_length = inputs['input_ids'].shape[1]
generated_captions = processor.decode(output[0], skip_special_tokens=True).strip()
print(generated_captions)
License information
The Falcon 2 models are made available under the TII Falcon 2 License, a permissive Apache 2.0-based software license which includes an acceptable use policy that promotes the responsible use of AI. This license was crafted within the spirit of TII's commitment to the open source community.