abacusai
/

Llama-3-Giraffe-70B-Instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Llama-3-Giraffe-70B-Instruct / README.md

siddartha-abacus's picture

siddartha-abacus

Update README.md

8233a1a verified 5 months ago

|

history blame contribute delete

No virus

2.56 kB

	---
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- meta
	- llama-3
	license: llama3
	---

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/VcZWbW_eZkJAZZ5ricL4B.png)

	# Llama-3-Giraffe-70B-Instruct

	Abacus.AI presents our longer-necked variant of Llama 3 70B - now with the instruct variant!

	This model has an effective context length of approximately 128k.

	We have currently trained on ~1.5B tokens.

	There are our Needle-in-a-Haystack heatmap results. We are conducting further evals of model efficacy and will update our model card as these come in:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/Z4uUhcjgf1P7EPGQyRLkW.png)

	### MT-Bench Evaluation

	We also measured performance on MT-Bench to verify that the context extension did not significantly impact performance on instruct tasks:

	```
	####### 1st turn:
	Meta-Llama-3-70B-Instruct 9.21
	Llama-3-Giraffe-70B-Instruct 9.19

	####### 2nd turn:
	Meta-Llama-3-70B-Instruct 2 8.80
	Llama-3-Giraffe-70B-Instruct 2 8.54

	####### average:
	Meta-Llama-3-70B-Instruct 9.00
	Llama-3-Giraffe-70B-Instruct 8.87
	```

	## Training Methodology

	The methodology for training uses [PoSE](https://arxiv.org/abs/2309.10400) and dynamic-NTK interpolation.

	### NTK-scaling

	The scale factor for NTK is 4. Note that we also tried theta-scaling but this did not work as well as NTK scaling in our experiments.

	### PoSE

	We utilise Positional Skip-wise Training (PoSE) with the following parameters:

	- Number of Chunks: 5
	- Max position ID: 32768

	### Data

	We use on average ~8K long samples from [RedPajama](https://github.com/togethercomputer/RedPajama-Data).

	### Hardware

	We train on 8xH100 GPUs with Deepspeed Zero Stage 3.

	## Evaluation Methodology

	We use the [EasyContext](https://github.com/abacusai/EasyContext/blob/eval_runs/eval_needle.py) implementation of Needle-in-a-Haystack to evaluate Llama-3-Giraffe-70B.

	We evaluate with the following parameters:

	- Min context length: 2000
	- Max context length: 128000
	- Context interval: 4000
	- Depth interval: 0.1
	- Num samples: 2
	- Rnd number digits: 7
	- Haystack dir: PaulGrahamEssays


	### Adapter Transfer

	We apply the above techniques first to Llama-3-70B-Base, using LoRA on the Q and K weights only. This adapter is then applied to Llama-3-70B-Instruct, and we
	release the merged version here.