File size: 2,562 Bytes

b1c97df
0770313
 
 
 
 
 
 
b1c97df
 
0770313
b1c97df
 
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
8233a1a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
 
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
b1c97df
0770313
 
 
 
 
 
 
b1c97df
 
0770313
b1c97df
0770313

---
language:
- en
pipeline_tag: text-generation
tags:
- meta
- llama-3
license: llama3
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/VcZWbW_eZkJAZZ5ricL4B.png)

# Llama-3-Giraffe-70B-Instruct

Abacus.AI presents our longer-necked variant of Llama 3 70B - now with the instruct variant!

This model has an effective context length of approximately 128k.

We have currently trained on ~1.5B tokens.

There are our Needle-in-a-Haystack heatmap results. We are conducting further evals of model efficacy and will update our model card as these come in:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/Z4uUhcjgf1P7EPGQyRLkW.png)

### MT-Bench Evaluation

We also measured performance on MT-Bench to verify that the context extension did not significantly impact performance on instruct tasks:

```
####### 1st turn:
Meta-Llama-3-70B-Instruct      9.21
Llama-3-Giraffe-70B-Instruct 9.19

####### 2nd turn:
Meta-Llama-3-70B-Instruct     2   8.80
Llama-3-Giraffe-70B-Instruct 2   8.54

####### average:
Meta-Llama-3-70B-Instruct      9.00
Llama-3-Giraffe-70B-Instruct 8.87 
```

## Training Methodology

The methodology for training uses [PoSE](https://arxiv.org/abs/2309.10400) and dynamic-NTK interpolation. 

### NTK-scaling

The scale factor for NTK is 4. Note that we also tried theta-scaling but this did not work as well as NTK scaling in our experiments.

### PoSE

We utilise Positional Skip-wise Training (PoSE) with the following parameters:

- **Number of Chunks**: 5
- **Max position ID**: 32768

### Data

We use on average ~8K long samples from [RedPajama](https://github.com/togethercomputer/RedPajama-Data).

### Hardware

We train on 8xH100 GPUs with Deepspeed Zero Stage 3.

## Evaluation Methodology

We use the [EasyContext](https://github.com/abacusai/EasyContext/blob/eval_runs/eval_needle.py) implementation of Needle-in-a-Haystack to evaluate Llama-3-Giraffe-70B.

We evaluate with the following parameters:

- **Min context length**: 2000
- **Max context length**: 128000
- **Context interval**: 4000
- **Depth interval**: 0.1
- **Num samples**: 2
- **Rnd number digits**: 7
- **Haystack dir**: PaulGrahamEssays


### Adapter Transfer

We apply the above techniques first to Llama-3-70B-Base, using LoRA on the Q and K weights only. This adapter is then applied to Llama-3-70B-Instruct, and we
release the merged version here.