|
--- |
|
license: gemma |
|
library_name: transformers |
|
extra_gated_heading: Access RecurrentGemma on Hugging Face |
|
extra_gated_prompt: To access RecurrentGemma on Hugging Face, you’re required to review |
|
and agree to Google’s usage license. To do this, please ensure you’re logged-in |
|
to Hugging Face and click below. Requests are processed immediately. |
|
extra_gated_button_content: Acknowledge license |
|
--- |
|
|
|
# RecurrentGemma Model Card |
|
|
|
**Model Page**: [RecurrentGemma]( https://ai.google.dev/gemma/docs/recurrentgemma/model_card) |
|
|
|
This model card corresponds to the 9B instruction version of the RecurrentGemma model. You can also visit the model card of the [9B base model](https://huggingface.co/google/recurrentgemma-9b). |
|
|
|
**Resources and technical documentation:** |
|
|
|
* [Responsible Generative AI Toolkit](https://ai.google.dev/responsible) |
|
* [RecurrentGemma on Kaggle](https://www.kaggle.com/models/google/recurrentgemma) |
|
|
|
**Terms of Use:** [Terms](https://www.kaggle.com/models/google/gemma/license/consent) |
|
|
|
**Authors:** Google |
|
|
|
## Model information |
|
|
|
|
|
## Usage |
|
|
|
Below we share some code snippets on how to get quickly started with running the model. |
|
|
|
First, make sure to `pip install transformers`, then copy the snippet from the section that is relevant for your usecase. |
|
|
|
### Running the model on a single / multi GPU |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("google/recurrentgemma-9b-it") |
|
model = AutoModelForCausalLM.from_pretrained("google/recurrentgemma-9b-it", device_map="auto") |
|
|
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
|
|
outputs = model.generate(**input_ids) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
### Chat Template |
|
|
|
The instruction-tuned models use a chat template that must be adhered to for conversational use. |
|
The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet. |
|
|
|
Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction: |
|
|
|
```py |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import transformers |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("google/recurrentgemma-9b-it") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"google/recurrentgemma-9b-it", |
|
device_map="auto" |
|
torch_dtype=dtype, |
|
) |
|
chat = [ |
|
{ "role": "user", "content": "Write a hello world program" }, |
|
] |
|
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) |
|
``` |
|
|
|
At this point, the prompt contains the following text: |
|
|
|
``` |
|
<bos><start_of_turn>user |
|
Write a hello world program<end_of_turn> |
|
<start_of_turn>model |
|
``` |
|
|
|
As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity |
|
(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with |
|
the `<end_of_turn>` token. |
|
|
|
You can follow this format to build the prompt manually, if you need to do it without the tokenizer's |
|
chat template. |
|
|
|
After the prompt is ready, generation can be performed like this: |
|
|
|
```py |
|
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt") |
|
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
### Model summary |
|
|
|
#### Description |
|
|
|
RecurrentGemma is a family of open language models built on a [novel recurrent |
|
architecture](https://arxiv.org/abs/2402.19427) developed at Google. Both |
|
pre-trained and instruction-tuned versions are available in English. |
|
|
|
Like Gemma, RecurrentGemma models are well-suited for a variety of text |
|
generation tasks, including question answering, summarization, and reasoning. |
|
Because of its novel architecture, RecurrentGemma requires less memory than |
|
Gemma and achieves faster inference when generating long sequences. |
|
|
|
#### Inputs and outputs |
|
|
|
* **Input:** Text string (e.g., a question, a prompt, or a document to be |
|
summarized). |
|
* **Output:** Generated English-language text in response to the input (e.g., |
|
an answer to the question, a summary of the document). |
|
|
|
#### Citation |
|
|
|
```none |
|
@article{recurrentgemma_2024, |
|
title={RecurrentGemma}, |
|
url={}, |
|
DOI={}, |
|
publisher={Kaggle}, |
|
author={Griffin Team, Soham De, Samuel L Smith, Anushan Fernando, Alex Botev, George-Christian Muraru, Ruba Haroun, Leonard Berrada et al.}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
### Model data |
|
|
|
#### Training dataset and data processing |
|
|
|
RecurrentGemma uses the same training data and data processing as used by the |
|
Gemma model family. A full description can be found on the [Gemma model |
|
card](https://ai.google.dev/gemma/docs/model_card#model_data). |
|
|
|
## Implementation information |
|
|
|
### Hardware and frameworks used during training |
|
|
|
Like |
|
[Gemma](https://ai.google.dev/gemma/docs/model_card#implementation_information), |
|
RecurrentGemma was trained on |
|
[TPUv5e](https://cloud.google.com/tpu/docs/intro-to-tpu?_gl=1*18wi411*_ga*MzE3NDU5OTY1LjE2MzQwNDA4NDY.*_ga_WH2QY8WWF5*MTcxMTA0MjUxMy4xNy4wLjE3MTEwNDI1MTkuMC4wLjA.&_ga=2.239449409.-317459965.1634040846), |
|
using [JAX](https://github.com/google/jax) and [ML |
|
Pathways](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/). |
|
|
|
## Evaluation information |
|
|
|
### Benchmark results |
|
|
|
#### Evaluation approach |
|
|
|
These models were evaluated against a large collection of different datasets and |
|
metrics to cover different aspects of text generation: |
|
|
|
#### Evaluation results |
|
|
|
Benchmark | Metric | RecurrentGemma 9B |
|
------------------- | ------------- | ----------------- |
|
[MMLU] | 5-shot, top-1 | 60.5 |
|
[HellaSwag] | 0-shot | 80.4 |
|
[PIQA] | 0-shot | 81.3 |
|
[SocialIQA] | 0-shot | 52.3 |
|
[BoolQ] | 0-shot | 80.3 |
|
[WinoGrande] | partial score | 73.6 |
|
[CommonsenseQA] | 7-shot | 73.2 |
|
[OpenBookQA] | | 51.8 |
|
[ARC-e][ARC-c] | | 78.8 |
|
[ARC-c] | | 52.0 |
|
[TriviaQA] | 5-shot | 70.5 |
|
[Natural Questions] | 5-shot | 21.7 |
|
[HumanEval] | pass@1 | 31.1 |
|
[MBPP] | 3-shot | 42.0 |
|
[GSM8K] | maj@1 | 42.6 |
|
[MATH] | 4-shot | 23.8 |
|
[AGIEval] | | 39.3 |
|
[BIG-Bench] | | 55.2 |
|
**Average** | | 56.1 |
|
|
|
### Inference speed results |
|
|
|
RecurrentGemma provides improved sampling speeds, particularly for long sequences or large batch sizes. We compared the sampling speeds of RecurrentGemma-9B to Gemma-7B. Note that Gemma-7B uses Multi-Head Attention, and the speed improvements would be smaller when comparing against a transformer using Multi-Query Attention. |
|
|
|
#### Throughput |
|
|
|
We evaluated throughput, i.e., the maximum number of tokens produced per second by increasing the batch size, of RecurrentGemma-9B compared to Gemma-7B, using a prefill of 2K tokens. |
|
|
|
<img src="max_throughput.png" width="400" alt="Maximum Throughput comparison of RecurrentGemma-9B and Gemma-7B"> |
|
|
|
#### Latency |
|
|
|
We also compared end-to-end speedups achieved by RecurrentGemma-9B over Gemma-7B when sampling a long sequence after a prefill of 4K tokens and using a batch size of 1. |
|
|
|
\# Tokens Sampled | Gemma-7B (sec) | RecurrentGemma-9B (sec) | Improvement (%) |
|
----------------- | -------------- | ----------------------- | --------------- |
|
128 | 3.1 | 2.8 | 9.2% |
|
256 | 5.9 | 5.4 | 9.7% |
|
512 | 11.6 | 10.5 | 10.7% |
|
1024 | 23.5 | 20.6 | 14.2% |
|
2048 | 48.2 | 40.9 | 17.7% |
|
4096 | 101.9 | 81.5 | 25.0% |
|
8192 | OOM | 162.8 | - |
|
16384 | OOM | 325.2 | - |
|
|
|
## Ethics and safety |
|
|
|
### Ethics and safety evaluations |
|
|
|
#### Evaluations approach |
|
|
|
Our evaluation methods include structured evaluations and internal red-teaming |
|
testing of relevant content policies. Red-teaming was conducted by a number of |
|
different teams, each with different goals and human evaluation metrics. These |
|
models were evaluated against a number of different categories relevant to |
|
ethics and safety, including: |
|
|
|
* **Text-to-text content safety:** Human evaluation on prompts covering safety |
|
policies including child sexual abuse and exploitation, harassment, violence |
|
and gore, and hate speech. |
|
* **Text-to-text representational harms:** Benchmark against relevant academic |
|
datasets such as WinoBias and BBQ Dataset. |
|
* **Memorization:** Automated evaluation of memorization of training data, |
|
including the risk of personally identifiable information exposure. |
|
* **Large-scale harm:** Tests for “dangerous capabilities,” such as chemical, |
|
biological, radiological, and nuclear (CBRN) risks; as well as tests for |
|
persuasion and deception, cybersecurity, and autonomous replication. |
|
|
|
#### Evaluation results |
|
|
|
The results of ethics and safety evaluations are within acceptable thresholds |
|
for meeting [internal |
|
policies](https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/2023_Google_AI_Principles_Progress_Update.pdf#page=11) |
|
for categories such as child safety, content safety, representational harms, |
|
memorization, large-scale harms. On top of robust internal evaluations, the |
|
results of well known safety benchmarks like BBQ, Winogender, Winobias, |
|
RealToxicity, and TruthfulQA are shown here. |
|
|
|
Benchmark | Metric | RecurrentGemma 9B | RecurrentGemma 9B IT |
|
------------------------ | ------ | ----------------- | -------------------- |
|
[RealToxicity] | avg | 10.3 | 8.8 |
|
[BOLD] | | 39.8 | 47.9 |
|
[CrowS-Pairs] | top-1 | 38.7 | 39.5 |
|
[BBQ Ambig][BBQ] | top-1 | 95.9 | 67.1 |
|
[BBQ Disambig][BBQ] | top-1 | 78.6 | 78.9 |
|
[Winogender] | top-1 | 59.0 | 64.0 |
|
[TruthfulQA] | | 38.6 | 47.7 |
|
[Winobias 1_2][Winobias] | | 61.5 | 60.6 |
|
[Winobias 2_2][Winobias] | | 90.2 | 90.3 |
|
[Toxigen] | | 58.8 | 64.5 |
|
|
|
## Model usage and limitations |
|
|
|
### Known limitations |
|
|
|
These models have certain limitations that users should be aware of: |
|
|
|
* **Training data** |
|
* The quality and diversity of the training data significantly influence |
|
the model's capabilities. Biases or gaps in the training data can lead |
|
to limitations in the model's responses. |
|
* The scope of the training dataset determines the subject areas the model |
|
can handle effectively. |
|
* **Context and task complexity** |
|
* LLMs are better at tasks that can be framed with clear prompts and |
|
instructions. Open-ended or highly complex tasks might be challenging. |
|
* A model's performance can be influenced by the amount of context |
|
provided (longer context generally leads to better outputs, up to a |
|
certain point). |
|
* **Language ambiguity and nuance** |
|
* Natural language is inherently complex. LLMs might struggle to grasp |
|
subtle nuances, sarcasm, or figurative language. |
|
* **Factual accuracy** |
|
* LLMs generate responses based on information they learned from their |
|
training datasets, but they are not knowledge bases. They may generate |
|
incorrect or outdated factual statements. |
|
* **Common sense** |
|
* LLMs rely on statistical patterns in language. They might lack the |
|
ability to apply common sense reasoning in certain situations. |
|
|
|
### Ethical considerations and risks |
|
|
|
The development of large language models (LLMs) raises several ethical concerns. |
|
In creating an open model, we have carefully considered the following: |
|
|
|
* **Bias and fairness** |
|
* LLMs trained on large-scale, real-world text data can reflect |
|
socio-cultural biases embedded in the training material. These models |
|
underwent careful scrutiny, input data pre-processing described and |
|
posterior evaluations reported in this card. |
|
* **Misinformation and misuse** |
|
* LLMs can be misused to generate text that is false, misleading, or |
|
harmful. |
|
* Guidelines are provided for responsible use with the model, see the |
|
[Responsible Generative AI |
|
Toolkit](https://ai.google.dev/gemma/responsible). |
|
* **Transparency and accountability** |
|
* This model card summarizes details on the models' architecture, |
|
capabilities, limitations, and evaluation processes. |
|
* A responsibly developed open model offers the opportunity to share |
|
innovation by making LLM technology accessible to developers and |
|
researchers across the AI ecosystem. |
|
|
|
Risks Identified and Mitigations: |
|
|
|
* **Perpetuation of biases:** It's encouraged to perform continuous monitoring |
|
(using evaluation metrics, human review) and the exploration of de-biasing |
|
techniques during model training, fine-tuning, and other use cases. |
|
* **Generation of harmful content:** Mechanisms and guidelines for content |
|
safety are essential. Developers are encouraged to exercise caution and |
|
implement appropriate content safety safeguards based on their specific |
|
product policies and application use cases. |
|
* **Misuse for malicious purposes:** Technical limitations and developer and |
|
end-user education can help mitigate against malicious applications of LLMs. |
|
Educational resources and reporting mechanisms for users to flag misuse are |
|
provided. Prohibited uses of Gemma models are outlined in our [terms of |
|
use](https://www.kaggle.com/models/google/gemma/license/consent). |
|
* **Privacy violations:** Models were trained on data filtered for removal of |
|
PII (Personally Identifiable Information). Developers are encouraged to |
|
adhere to privacy regulations with privacy-preserving techniques. |
|
|
|
## Intended usage |
|
|
|
### Application |
|
|
|
Open Large Language Models (LLMs) have a wide range of applications across |
|
various industries and domains. The following list of potential uses is not |
|
comprehensive. The purpose of this list is to provide contextual information |
|
about the possible use-cases that the model creators considered as part of model |
|
training and development. |
|
|
|
* **Content creation and communication** |
|
* **Text generation:** These models can be used to generate creative text |
|
formats like poems, scripts, code, marketing copy, email drafts, etc. |
|
* **Chatbots and conversational AI:** Power conversational interfaces for |
|
customer service, virtual assistants, or interactive applications. |
|
* **Text summarization:** Generate concise summaries of a text corpus, |
|
research papers, or reports. |
|
* **Research and education** |
|
* **Natural Language Processing (NLP) research:** These models can serve |
|
as a foundation for researchers to experiment with NLP techniques, |
|
develop algorithms, and contribute to the advancement of the field. |
|
* **Language Learning Tools:** Support interactive language learning |
|
experiences, aiding in grammar correction or providing writing practice. |
|
* **Knowledge Exploration:** Assist researchers in exploring large bodies |
|
of text by generating summaries or answering questions about specific |
|
topics. |
|
|
|
### Benefits |
|
|
|
At the time of release, this family of models provides high-performance open |
|
large language model implementations designed from the ground up for Responsible |
|
AI development compared to similarly sized models. |
|
|
|
Using the benchmark evaluation metrics described in this document, these models |
|
have shown to provide superior performance to other, comparably-sized open model |
|
alternatives. |
|
|
|
In particular, RecurrentGemma models achieve comparable performance to Gemma |
|
models but are faster during inference and require less memory, especially on |
|
long sequences. |
|
|
|
[MMLU]: https://arxiv.org/abs/2009.03300 |
|
[HellaSwag]: https://arxiv.org/abs/1905.07830 |
|
[PIQA]: https://arxiv.org/abs/1911.11641 |
|
[SocialIQA]: https://arxiv.org/abs/1904.09728 |
|
[BoolQ]: https://arxiv.org/abs/1905.10044 |
|
[winogrande]: https://arxiv.org/abs/1907.10641 |
|
[CommonsenseQA]: https://arxiv.org/abs/1811.00937 |
|
[OpenBookQA]: https://arxiv.org/abs/1809.02789 |
|
[ARC-c]: https://arxiv.org/abs/1911.01547 |
|
[TriviaQA]: https://arxiv.org/abs/1705.03551 |
|
[Natural Questions]: https://github.com/google-research-datasets/natural-questions |
|
[HumanEval]: https://arxiv.org/abs/2107.03374 |
|
[MBPP]: https://arxiv.org/abs/2108.07732 |
|
[GSM8K]: https://arxiv.org/abs/2110.14168 |
|
[MATH]: https://arxiv.org/abs/2103.03874 |
|
[AGIEval]: https://arxiv.org/abs/2304.06364 |
|
[BIG-Bench]: https://arxiv.org/abs/2206.04615 |
|
[RealToxicity]: https://arxiv.org/abs/2009.11462 |
|
[BOLD]: https://arxiv.org/abs/2101.11718 |
|
[CrowS-Pairs]: https://aclanthology.org/2020.emnlp-main.154/ |
|
[BBQ]: https://arxiv.org/abs/2110.08193v2 |
|
[Winogender]: https://arxiv.org/abs/1804.09301 |
|
[TruthfulQA]: https://arxiv.org/abs/2109.07958 |
|
[winobias]: https://arxiv.org/abs/1804.06876 |
|
[Toxigen]: https://arxiv.org/abs/2203.09509 |
|
|