Falcon 2: An 11B parameter pretrained language model and VLM, trained on over 5000B tokens and 11 languages

Published May 24, 2024

The Falcon 2 Models

TII is launching a new generation of models, Falcon 2, focused on providing the open-source community with a series of smaller models with enhanced performance and multi-modal support. Our goal is to enable cheaper inference and encourage the development of more downstream applications with improved usability.

The first generation of Falcon models, featuring Falcon-40B and Falcon-180B, made a significant contribution to the open-source community, promoting the release of advanced LLMs with permissive licenses. More detailed information on the previous generation of Falcon models can be found in the RefinedWeb, Penedo et al., 2023 and The Falcon Series of Open Language Models, Almazrouei et al., 2023 papers, and the Falcon and Falcon-180B blog posts.

The second generation of models is focused on increased usability and integrability, building a multi-modal ecosystem. We start this journey by releasing not only the base 11B LLM, but also the 11B VLM model that incorporates image understanding capabilities. The vision-language model, or VLM, will allow users to engage in chats about visual content using text.

As with our previous work, the models offer support mainly in English but have good capabilities in ten other languages, including Spanish, French, and German.

The Falcon 2 Models
Falcon 2 11B LLM
Falcon 2 11B VLM
Licensing information

Falcon2-11B LLM

Training Data

Falcon2-11B was trained on over 5,000 GT (billion tokens) of RefinedWeb, a high-quality filtered and deduplicated web dataset, enhanced with curated corpora. It followed a four-stage training strategy. The first three stages were focused on increasing the context length, from 2048 to 4096 and finally to 8192 tokens. The last stage aimed to further enhance performance using only high-quality data.

Overall, the data sources included RefinedWeb-English, RefinedWeb-Europe (cs, de, es, fr, it, nl, pl, pt, ro, sv), high-quality technical data, code data, and conversational data extracted from public sources.

The training stages were as follows:

Stage	Context Length	GT
Stage 1	2048	4500
Stage 2	4096	250
Stage 3	8192	250
Stage 4	8192	500

The data was tokenized with Falcon2-11B tokenizer, the same tokenizer as for the previous Falcon models.

Model Architecture

The following table summarizes some of the crucial details about the model architecture:

Design choice	Value
Number of Transformer Blocks	60
Number of Query Heads	32
Number of Key/Value Heads	8
Head Dimensions	128
Parallel Attention	yes
MLP Upscale Factor	4

Training Procedure

Falcon2-11B was trained on 1024 A100 40GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO and Flash-Attention 2.

Training Hyperparameters

Hyperparameter	Value
Precision	bfloat16
Optimizer	AdamW
Max LR	3.7e-4
Min LR	1.89e-5
LR schedule	Cos decay (stage 1)
Context length	8192 (stages 3 and 4)
Weight decay	1e-1
Z-loss	1e-4
Batch size	Variable

Falcon2-11B Evaluation

English performance

Performance on Open LLM Leaderboard tasks:

Checkpoint	GT	HellaSwag-10	Winogrande-5	ArcChallenge-25	TruthfulQA-0	MMLU-5	GSMK8k-5	Average
Falcon2-11B	5500	82.91	78.30	59.73	52.56	58.37	53.83	64.28
Falcon-40B	1000	85.28	81.29	61.86	41.65	56.89	21.46	58.07
Falcon-7B	1500	78.13	72.38	47.87	34.26	27.79	4.62	44.17
Gemma-7B	6000	82.47	78.45	61.09	44.91	66.03	52.77	64.29
Llama3-8B	15000	82.09	77.35	59.47	43.90	66.69	44.79	62.38
Mistral-7B	N/A	83.31	78.37	59.98	42.15	64.16	37.83	60.97

The Hugging Face Leaderboard team provided an official evaluation of our model on the Open LLM Leaderboard tasks. The model performs better than models such as Llama3-8B (trained on three times more data) and Mistral-7B, and on par with Gemma-7b.

Zero shot performance:

Checkpoint	GT	HellaSwag	ArcEasy	Winogrande	ArcChallenge
Falcon2-11B	5500	82.07	77.78	78.30	50.17
Falcon-40B	1000	82.82	81.86	76.4	54.69
Falcon-7B	1500	76.31	74.74	67.17	43.43

The evaluation results show that the Falcon2-11B shows similar performance to Falcon-40B, at a four times smaller model size!

Multilingual capabilities

Using the Multilingual LLM leaderboard, we compare the Falcon2-11B model to the Llama-7B and Bloom-7B. For reference, we also include Falcon-40B (that supports the same languages), Falcon-7B (that supports French) and Mistral-7B.

Model	Language ID	ArcChallenge-25	Hellaswag	MMLU 25	TQA	Average
Falcon2-11B	de	43.7	67.96	38.3	47.53	49.37
	es	46.2	73.63	37.9	46.43	51.06
	fr	45.8	72.41	39.53	47.30	51.27
	it	45.6	70.83	38.05	47.14	50.42
	nl	41.7	69.05	38.29	48.81	49.47
	ro	42.4	66.24	38.01	45.53	48.04
Falcon-40B	de	45.1	68.3	36.2	39.8	47.4
	es	48.5	73.9	37.2	39.0	49.6
	fr	47.6	72.9	37.3	38.5	49.1
	it	46.3	70.2	36.4	40.7	48.4
	nl	42.9	68.4	36.5	40.9	47.1
	ro	43.2	66.0	35.7	39.8	46.2
Falcon-7B	fr	37.3	64.1	28.4	34.0	40.9
Mistral-7B	de	41.2	58.7	40.5	44.9	46.3
	es	44.2	65.3	42.4	43.1	48.7
	fr	44.9	64.4	41.9	43.0	48.6
	it	43.2	60.9	39.7	43.1	46.7
	nl	40.0	57.9	41.4	43.3	45.7
	ro	40.7	53.6	39.3	43.6	44.3
Llama-7B	de	35.1	49.9	29.9	38.3	38.3
	es	36.8	56.4	30.3	37.0	40.1
	fr	37.3	55.7	30.5	39.9	40.9
	it	35.8	52.0	29.9	39.6	39.3
	nl	33.6	48.7	29.8	40.0	38.0
	ro	32.4	44.9	29.7	37.0	36.0
Bloom-7B	de	26.3	32.4	28.1	43.7	32.6
	es	38.1	56.7	28.9	40.4	41.0
	fr	36.7	56.6	29.9	40.9	41.0
	it	29.0	40.8	27.6	43.7	35.3
	nl	23.1	31.7	27.5	42.7	31.3
	ro	26.9	31.8	27.4	46.1	33.1

In the spirit of the original Falcon models, the Falcon2-11B was trained not only on English data but also on ten other languages. Our multilingual evaluation results show that the model presents good capabilities in the six languages (de, es, fr, it, nl, ro) featured on the Multilingual LLM Leaderboard and actually shows higher performance than the Falcon-40B and several other multilingual models on all the cited languages.

We will soon release more extensive evaluation results for multilingual capabilities in the Falcon2-11B model card!

Code generation capabilities

We check the model's performance on code generation against the BigCode Leaderboard on the HumanEval benchmark for the Python language, obtaining pass@1 of 29.59%.

Using Falcon2-11B

from transformers import AutoTokenizer
import transformers
import torch

model = "tiiuae/falcon-11B"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

And then, you'd run text generation using code like the following:

sequences = pipeline(
   "Can you explain the concept of Quantum Computing?",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Falcon2-11B VLM

Falcon2-11B VLM is a vision-language model (VLM) built on top of the LLM, that additionally handles image inputs and is capable of answering queries about the images. To achieve this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model, and train with image-text data.

To enhance the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, similar to LLaVA-Next.

Training

The training is done in two stages: pretraining and finetuning. In both stages, the visual encoder weights are kept frozen. In the pretraining stage, the LLM is kept frozen, and only the multimodal projector is trained on 558K image-caption pairs. This enables the multimodal projector to learn a mapping from visual to text embedding space. During finetuning, both the projector and LLM weights are trained on a corpus of 1.2M image-text instruction data from public datasets, which also includes multi-round conversations.

Falcon2-11B VLM Evaluation

Model	MME	GQA	SQA	POPE	VQAv2	TextVQA	MM-Bench	SEED-IMG	Average
Falcon2-11B VLM	1589/343	64.5	74.9	88.4	82.1	66.7	72.0	72.3	74.4
LLaVA-1.6 (Vicuna-7B)	1519/332	64.2	70.1	86.5	81.8	64.9	67.4	70.2	72.1
LLaVA-1.6 (Vicuna-13B)	1575/326	65.4	73.6	86.2	82.8	67.1	70.0	71.9	73.8
LLaVA-1.6 (Mistral-7B)	1498/321	64.8	72.8	86.7	82.2	65.7	68.7	72.2	73.3

Using Falcon2-11B-FalconVLM

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch

processor = LlavaNextProcessor.from_pretrained("tiiuae/falcon-11B-vlm")
model = LlavaNextForConditionalGeneration.from_pretrained("tiiuae/falcon-11B-vlm", torch_dtype=torch.bfloat16)

url = "https://merzougabirding.com/wp-content/uploads/2023/09/falcon-size.jpg"
falcon_image = Image.open(requests.get(url, stream=True).raw)
prompt = "User: <image>\nWhat's special about this bird's vision?"

inputs = processor(prompt, images=falcon_image, return_tensors="pt", padding=True).to('cuda:0')

model.to('cuda:0')
output = model.generate(**inputs, max_new_tokens=256)


prompt_length = inputs['input_ids'].shape[1]
generated_captions = processor.decode(output[0], skip_special_tokens=True).strip()

print(generated_captions)

License information

The Falcon 2 models are made available under the TII Falcon 2 License, a permissive Apache 2.0-based software license which includes an acceptable use policy that promotes the responsible use of AI. This license was crafted within the spirit of TII's commitment to the open source community.

Welcome PaliGemma 2 – New vision language models by Google

By December 5, 2024 • 147

Welcome FalconMamba: The first strong attention-free 7B model

By August 12, 2024 guest • 109

Community

sssu7

10 days ago

This comment has been hidden (marked as Off-Topic)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote