Text Generation
Transformers
PyTorch
Safetensors
longllama
text-generation-inference
custom_code
Szymon Tworkowski commited on
Commit
fffd985
1 Parent(s): 2137eb1

add arxiv links

Browse files
Files changed (1) hide show
  1. README.md +12 -5
README.md CHANGED
@@ -10,16 +10,16 @@ tags:
10
  [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_colab.ipynb)
11
 
12
 
13
- [TLDR](#TLDR) | [Overview](#Overview) | [Usage](#Usage) | [LongLLaMA performance](#LongLLaMA-performance) | [Authors](#Authors) | [Citation](#Citation) | [License](License) | [Acknowledgments](#Acknowledgments)
14
 
15
  ## TLDR
16
  This repository contains the research preview of **LongLLaMA, a large language model capable of handling long contexts of 256k tokens or even more**.
17
 
18
- LongLLaMA is built upon the foundation of [OpenLLaMA](https://github.com/openlm-research/open_llama) and fine-tuned using the Focused Transformer (FoT) method. We release a smaller 3B variant of the LongLLaMA model on a permissive license (Apache 2.0) and inference code supporting longer contexts on [Hugging Face](https://huggingface.co/syzymon/long_llama_3b). Our model weights can serve as the drop-in replacement of LLaMA in existing implementations (for short context up to 2048 tokens). Additionally, we provide evaluation results and comparisons against the original OpenLLaMA models. Stay tuned for further updates.
19
 
20
 
21
  ## Overview
22
- [Focused Transformer: Contrastive Training for Context Scaling](TODO) (FoT) presents a simple method for endowing language models with the ability to handle context consisting possibly of millions of tokens while training on significantly shorter input. FoT permits a subset of attention layers to access a memory cache of (key, value) pairs to extend the context length. The distinctive aspect of FoT is its training procedure, drawing from contrastive learning. Specifically, we deliberately expose the memory attention layers to both relevant and irrelevant keys (like negative samples from unrelated documents). This strategy incentivizes the model to differentiate keys connected with semantically diverse values, thereby enhancing their structure. This, in turn, makes it possible to extrapolate the effective context length much beyond what is seen in training.
23
 
24
 
25
  **LongLLaMA** is an [OpenLLaMA](https://github.com/openlm-research/open_llama) model finetuned with the FoT method,
@@ -124,7 +124,7 @@ For simplicity, context extension is realized with a memory cache and full atten
124
 
125
 
126
  ## LongLLaMA performance
127
- We present some illustrative examples of LongLLaMA results and refer to our paper [Focused Transformer: Contrastive Training for Context Scaling](TODO) for more details.
128
 
129
  We manage to achieve good performance on the passkey retrieval task from [Landmark Attention: Random-Access Infinite Context Length for Transformers](https://arxiv.org/abs/2305.16300). The code for generating the prompt and running the model is located in `examples/passkey.py`.
130
 
@@ -188,7 +188,14 @@ on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
188
  ## Citation
189
  To cite this work please use
190
  ```bibtex
191
- TODO
 
 
 
 
 
 
 
192
  ```
193
 
194
 
 
10
  [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_colab.ipynb)
11
 
12
 
13
+ [TLDR](#tldr) | [Overview](#overview) | [Usage](#usage) | [LongLLaMA performance](#longllama-performance) | [Authors](#authors) | [Citation](#citation) | [License](license) | [Acknowledgments](#acknowledgments)
14
 
15
  ## TLDR
16
  This repository contains the research preview of **LongLLaMA, a large language model capable of handling long contexts of 256k tokens or even more**.
17
 
18
+ LongLLaMA is built upon the foundation of [OpenLLaMA](https://github.com/openlm-research/open_llama) and fine-tuned using the [Focused Transformer (FoT)](https://arxiv.org/abs/2307.03170) method. We release a smaller 3B variant of the LongLLaMA model on a permissive license (Apache 2.0) and inference code supporting longer contexts on [Hugging Face](https://huggingface.co/syzymon/long_llama_3b). Our model weights can serve as the drop-in replacement of LLaMA in existing implementations (for short context up to 2048 tokens). Additionally, we provide evaluation results and comparisons against the original OpenLLaMA models. Stay tuned for further updates.
19
 
20
 
21
  ## Overview
22
+ [Focused Transformer: Contrastive Training for Context Scaling](https://arxiv.org/abs/2307.03170) (FoT) presents a simple method for endowing language models with the ability to handle context consisting possibly of millions of tokens while training on significantly shorter input. FoT permits a subset of attention layers to access a memory cache of (key, value) pairs to extend the context length. The distinctive aspect of FoT is its training procedure, drawing from contrastive learning. Specifically, we deliberately expose the memory attention layers to both relevant and irrelevant keys (like negative samples from unrelated documents). This strategy incentivizes the model to differentiate keys connected with semantically diverse values, thereby enhancing their structure. This, in turn, makes it possible to extrapolate the effective context length much beyond what is seen in training.
23
 
24
 
25
  **LongLLaMA** is an [OpenLLaMA](https://github.com/openlm-research/open_llama) model finetuned with the FoT method,
 
124
 
125
 
126
  ## LongLLaMA performance
127
+ We present some illustrative examples of LongLLaMA results and refer to our paper [Focused Transformer: Contrastive Training for Context Scaling](https://arxiv.org/abs/2307.03170) for more details.
128
 
129
  We manage to achieve good performance on the passkey retrieval task from [Landmark Attention: Random-Access Infinite Context Length for Transformers](https://arxiv.org/abs/2305.16300). The code for generating the prompt and running the model is located in `examples/passkey.py`.
130
 
 
188
  ## Citation
189
  To cite this work please use
190
  ```bibtex
191
+ @misc{tworkowski2023focused,
192
+ title={Focused Transformer: Contrastive Training for Context Scaling},
193
+ author={Szymon Tworkowski and Konrad Staniszewski and Mikołaj Pacek and Yuhuai Wu and Henryk Michalewski and Piotr Miłoś},
194
+ year={2023},
195
+ eprint={2307.03170},
196
+ archivePrefix={arXiv},
197
+ primaryClass={cs.CL}
198
+ }
199
  ```
200
 
201