Text Generation
Transformers
PyTorch
longllama
code
text-generation-inference
custom_code
Eval Results
syzymon commited on
Commit
0552cd3
1 Parent(s): f36dfc1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +193 -0
README.md CHANGED
@@ -1,3 +1,196 @@
1
  ---
2
  license: llama2
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
  ---
4
+
5
+
6
+
7
+ # LongLLaMA: Focused Transformer Training for Context Scaling
8
+
9
+
10
+ <div align="center">
11
+
12
+
13
+
14
+
15
+
16
+ <table>
17
+
18
+ <tr>
19
+ <td align="center">
20
+ <span style="font-size:300%">{</span>
21
+ </td>
22
+ <td align="center">
23
+ <span style="font-size:115%">
24
+ <b>
25
+ <a href="https://huggingface.co/syzymon/long_llama_code_7b" tyle="margin-bottom:30px">LongLLaMA Code-7B</a>
26
+ </b>
27
+ </span>
28
+ </td>
29
+ <td align="center">
30
+ <span style="font-size:300%">}</span>
31
+ </td>
32
+
33
+ </tr>
34
+ </table>
35
+
36
+
37
+ </div>
38
+
39
+
40
+ <div align="center">
41
+
42
+ [TLDR](#TLDR) | [Overview](#Overview) | [Usage](#Usage) | [LongLLaMA performance](#LongLLaMA-performance) | [Authors](#Authors) | [Citation](#Citation) | [License](#License) | [Acknowledgments](#Acknowledgments)
43
+
44
+ [FoT continued pretraining](fot_continued_pretraining) | [Instruction tuning](instruction_fine_tuning)
45
+
46
+ </div>
47
+
48
+ ## TLDR
49
+ This repository contains the research preview of **LongLLaMA, a large language model capable of handling long contexts of 256k tokens or even more**.
50
+
51
+ LongLLaMA is built upon the foundation of [OpenLLaMA](https://github.com/openlm-research/open_llama) and fine-tuned using the [Focused Transformer (FoT)](https://arxiv.org/abs/2307.03170) method.
52
+ LongLLaMA Code is built upon the foundation of [Code Llama](https://huggingface.co/codellama/CodeLlama-7b-hf).
53
+
54
+
55
+ ## Overview
56
+
57
+ ### Base models
58
+ [Focused Transformer: Contrastive Training for Context Scaling](https://arxiv.org/abs/2307.03170) (FoT) presents a simple method for endowing language models with the ability to handle context consisting possibly of millions of tokens while training on significantly shorter input. FoT permits a subset of attention layers to access a memory cache of (key, value) pairs to extend the context length. The distinctive aspect of FoT is its training procedure, drawing from contrastive learning. Specifically, we deliberately expose the memory attention layers to both relevant and irrelevant keys (like negative samples from unrelated documents). This strategy incentivizes the model to differentiate keys connected with semantically diverse values, thereby enhancing their structure. This, in turn, makes it possible to extrapolate the effective context length much beyond what is seen in training.
59
+
60
+
61
+ **LongLLaMA** is an [OpenLLaMA](https://github.com/openlm-research/open_llama) model finetuned with the FoT method,
62
+ with three layers used for context extension. **Crucially, LongLLaMA is able to extrapolate much beyond the context length seen in training: $8k$. E.g., in the passkey retrieval task, it can handle inputs of length $256k$**.
63
+ **LongLLaMA Code** is a [Code Llama](https://huggingface.co/codellama/CodeLlama-7b-hf) model finetuned with the FoT method.
64
+
65
+
66
+ <div align="center">
67
+
68
+ | | [LongLLaMA-3B](https://huggingface.co/syzymon/long_llama_3b) | [LongLLaMA-3Bv1.1](https://huggingface.co/syzymon/long_llama_3b_v1_1) | [LongLLaMA Code-7B](https://huggingface.co/syzymon/long_llama_code_7b) |
69
+ |----------------|----------|----------|-----------|
70
+ | Source model | [OpenLLaMA-3B](https://huggingface.co/openlm-research/open_llama_3b_easylm) | [OpenLLaMA-3Bv2](https://huggingface.co/openlm-research/open_llama_3b_v2_easylm) | [CodeLLaMA-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf) |
71
+ | Source model tokens | 1T | 1 T | 2T + 0.5 T |
72
+ | Fine-tuning tokens | 10B | 5B | 35B | - |
73
+ | Memory layers | 6, 12, 18 | 6, 12, 18 | 8, 16, 24 |
74
+
75
+ </div>
76
+
77
+
78
+ ## Usage
79
+
80
+ See also:
81
+ * [Colab with LongLLaMA-Instruct-3Bv1.1](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_instruct_colab.ipynb).
82
+ * [Colab with an example usage of base LongLLaMA](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_colab.ipynb).
83
+ ### Requirements
84
+ ```
85
+ pip install --upgrade pip
86
+ pip install transformers==4.30 sentencepiece accelerate
87
+ ```
88
+
89
+ ### Loading model
90
+ ```python
91
+ import torch
92
+ from transformers import AutoTokenizer, AutoModelForCausalLM
93
+
94
+ tokenizer = AutoTokenizer.from_pretrained("syzymon/long_llama_code_7b")
95
+ model = AutoModelForCausalLM.from_pretrained("syzymon/long_llama_code_7b",
96
+ torch_dtype=torch.float32,
97
+ trust_remote_code=True)
98
+ ```
99
+
100
+ ### Input handling and generation
101
+ LongLLaMA uses the Hugging Face interface, the long input given to the model will be
102
+ split into context windows and loaded into the memory cache.
103
+ ```python
104
+ prompt = "My name is Julien and I like to"
105
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
106
+ outputs = model(input_ids=input_ids)
107
+ ```
108
+ During the model call, one can provide the parameter `last_context_length` which specifies the number of tokens left in the last context window. Tuning this parameter can improve generation as the first layers do not have access to memory. See details in [How LongLLaMA handles long inputs](#How-LongLLaMA-handles-long-inputs).
109
+
110
+ ```python
111
+ generation_output = model.generate(
112
+ input_ids=input_ids,
113
+ max_new_tokens=1024,
114
+ num_beams=1,
115
+ last_context_length=3072,
116
+ do_sample=True,
117
+ temperature=1.0,
118
+ )
119
+ print(tokenizer.decode(generation_output[0]))
120
+ ```
121
+
122
+ ### Additional configuration
123
+ LongLLaMA has several other parameters:
124
+ * `mem_layers` specifies layers endowed with memory (should be either an empty list or a list of all memory layers specified in the description of the checkpoint).
125
+ * `mem_dtype` allows changing the type of memory cache
126
+ * `mem_attention_grouping` can trade off speed for reduced memory usage.
127
+ When equal to `(4, 2048)`, the memory layers will process at most $4*2048$ queries at once ($4$ heads and $2048$ queries for each head).
128
+
129
+ ```python
130
+ import torch
131
+ from transformers import LlamaTokenizer, AutoModelForCausalLM
132
+
133
+ tokenizer = LlamaTokenizer.from_pretrained("syzymon/long_llama_3b_v1_1")
134
+ model = AutoModelForCausalLM.from_pretrained(
135
+ "syzymon/long_llama_3b_v1_1", torch_dtype=torch.float32,
136
+ mem_layers=[],
137
+ mem_dtype='bfloat16',
138
+ trust_remote_code=True,
139
+ mem_attention_grouping=(4, 2048),
140
+ )
141
+ ```
142
+
143
+
144
+ ### Drop-in use with LLaMA code
145
+ LongLLaMA checkpoints can also be used as a drop-in replacement for LLaMA checkpoints in [Hugging Face implementation of LLaMA](https://huggingface.co/docs/transformers/main/model_doc/llama), but in this case, they will be limited to the original context length.
146
+
147
+ ```python
148
+ from transformers import LlamaTokenizer, LlamaForCausalLM
149
+ import torch
150
+
151
+ tokenizer = LlamaTokenizer.from_pretrained("syzymon/long_llama_3b_v1_1")
152
+ model = LlamaForCausalLM.from_pretrained("syzymon/long_llama_3b_v1_1", torch_dtype=torch.float32)
153
+ ```
154
+
155
+
156
+ ### How LongLLaMA handles long inputs
157
+ Inputs over $ctx=2048$ ($ctx=4096$ for LongLLaMA Code) tokens are automatically split into windows $w_1, \ldots, w_m$. The first $m-2$ windows contain $ctx$ tokens each, $w_{m-1}$ has no more than $2048$ tokens, and $w_m$ contains the number of tokens specified by `last_context_length`. The model processes the windows one by one extending the memory cache after each. If `use_cache` is `True`, then the last window will not be loaded to the memory cache but to the local (generation) cache.
158
+
159
+ The memory cache stores $(key, value)$ pairs for each head of the specified memory layers `mem_layers`. In addition to this, it stores attention masks.
160
+
161
+ If `use_cache=True` (which is the case in generation), LongLLaMA will use two caches: the memory cache for the specified layers and the local (generation) cache for all layers. When the local cache exceeds $2048$ elements, its content is moved to the memory cache for the memory layers.
162
+
163
+ For simplicity, context extension is realized with a memory cache and full attention in this repo. Replacing this simple mechanism with a KNN search over an external database is possible with systems like [Faiss](https://github.com/facebookresearch/faiss). This potentially would enable further context length scaling. We leave this as a future work.
164
+
165
+
166
+ ## Authors
167
+ - [Szymon Tworkowski](https://scholar.google.com/citations?user=1V8AeXYAAAAJ&hl=en)
168
+ - [Konrad Staniszewski](https://scholar.google.com/citations?user=CM6PCBYAAAAJ)
169
+ - [Mikołaj Pacek](https://scholar.google.com/citations?user=eh6iEbQAAAAJ&hl=en&oi=ao)
170
+ - [Henryk Michalewski](https://scholar.google.com/citations?user=YdHW1ycAAAAJ&hl=en)
171
+ - [Yuhuai Wu](https://scholar.google.com/citations?user=bOQGfFIAAAAJ&hl=en)
172
+ - [Piotr Miłoś](https://scholar.google.pl/citations?user=Se68XecAAAAJ&hl=pl&oi=ao)
173
+
174
+
175
+ ## Citation
176
+ To cite this work please use
177
+ ```bibtex
178
+ @misc{tworkowski2023focused,
179
+ title={Focused Transformer: Contrastive Training for Context Scaling},
180
+ author={Szymon Tworkowski and Konrad Staniszewski and Mikołaj Pacek and Yuhuai Wu and Henryk Michalewski and Piotr Miłoś},
181
+ year={2023},
182
+ eprint={2307.03170},
183
+ archivePrefix={arXiv},
184
+ primaryClass={cs.CL}
185
+ }
186
+ ```
187
+
188
+
189
+ ## License
190
+ For the LongLLaMA Code see [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/LICENSE) license.
191
+ Some of the examples use external code (see headers of files for copyright notices and licenses).
192
+
193
+ ## Acknowledgments
194
+ We gratefully acknowledge the TPU Research Cloud program, which was instrumental to our research by providing significant computational resources. We are also grateful to Xinyang Geng and Hao Liu for releasing [OpenLLaMA](https://github.com/openlm-research/open_llama) checkpoints and the [EasyLM](https://github.com/young-geng/EasyLM) library.
195
+
196
+ We would like to thank [Xiaosong,He](https://github.com/hxs91) for suggestions on how to improve the explanations of cross-batch code.