--- library_name: keras-hub license: apache-2.0 language: - en tags: - text-generation-inference - keras pipeline_tag: text-generation --- ### Model Overview # Model Summary Falcon-RW-1B is a 1B parameters causal decoder-only model built by [TII](https://www.tii.ae/) and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). The architecture of the model is adopted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)) but it uses ALiBi. ## Use ### Direct Use Research on large language models, specifically the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.). ### Out-of-scope Use Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful. ## Bias, Risks, and Limitations Falcon-RW-1B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online. ## Recommendations We recommend users of Falcon-RW-1B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use. ## Training Details ### Training Data Falcon-RW-1B was trained on 350B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset. The data was tokenized with the GPT-2 tokenizer. ### Training Procedure Falcon-RW-1B was trained on 32 A100 40GB GPUs, using only data parallelism with ZeRO. ### Training Hyperparameters Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)). | Hyperparameter | Value | Comment | |----------------|----------|-------------------------------------------| | Precision | bfloat16 | | | Optimizer | AdamW | | | Learning rate | 2e-4 | 500M tokens warm-up, cosine decay to 2e-5 | | Weight decay | 1e-1 | | | Batch size | 512 | 4B tokens ramp-up | ### Speeds, Sizes, Times Training happened in early December 2022 and took about six days. ### Evaluation See the [paper on arXiv](https://arxiv.org/abs/2306.01116) for in-depth evaluation. ## Technical Specifications ### Model Architecture and Objective Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token). The architecture is adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), but uses ALiBi ([Ofir et al., 2021](https://arxiv.org/abs/2108.12409)). | **Hyperparameter** | **Value** | |:------------------:|:---------:| | Layers | 24 | | d_model | 2048 | | head_dim | 64 | | Vocabulary | 50304 | | Sequence length | 2048 | ## Citation ``` @article{refinedweb, title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only}, author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay}, journal={arXiv preprint arXiv:2306.01116}, eprint={2306.01116}, eprinttype = {arXiv}, url={https://arxiv.org/abs/2306.01116}, year={2023} } ```