File size: 2,351 Bytes
115e671
 
 
 
 
 
c90dfa5
 
 
 
 
 
 
 
 
 
 
115e671
 
 
 
 
 
 
 
824a209
04da09c
824a209
 
 
115e671
 
2699f75
115e671
 
2699f75
115e671
 
 
 
 
2699f75
115e671
824a209
 
 
 
 
d672854
 
824a209
07e64f4
 
 
 
 
 
 
 
 
c90dfa5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
language: en
tags:
- hummingbird
- causal-lm
license: mit
datasets:
- wikimedia/wikipedia
- qwedsacf/grade-school-math-instructions
- HuggingFaceH4/instruction-dataset
- alespalla/chatbot_instruction_prompts
- MBZUAI/LaMini-instruction
- hendrycks/competition_math
- lighteval/MATH
- camel-ai/math
- microsoft/orca-math-word-problems-200k
pipeline_tag: text-generation
---

# Hummingbird 0.0 Release

This is Hummingbird 0.0, a 1B proof-of-concept causal language model based on **Efficient Attention**, which reduces the number of parameters in the attention layer by 50% compared to standard Multi-Head Attention.

This version of Hummingbird is only meant to demonstrate Efficient Attention for use in causal language modelling. It has been trained on only 15 Billion tokens and is not safeguarded. Therefore, we do not recommend using it as a chatbot.

<div align="center">
  <img src="figs/Hummingbird.png" width="400"/>
</div>


## Model Details

The model consists of 1.1 Billion parameters with the following specifications:

| Parameter            | size |
| :------------------- | :--- |
| # Transformer Blocks | 10   |
| Model Dimension      | 3072 |
| # Heads              | 1    |


The Attention Mechanism used is based on our newly proposed Efficient Attention from our paper, *You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism* ([arXiv:2403.01643](https://arxiv.org/abs/2403.01643)). We have chosen the number of heads to be 1 as an interesting case study since all current LMs use multiple heads.

The loss plot below illustrates the model's performance during training. For comparison, when trained on 15 billion tokens, Hummingbird achieves a slightly lower loss than TinyLlama, a model of similar size.
<div align="center">
  <img src="figs/history.png" width="700"/>
</div>

## Team
The design and training of Hummingbird has been done jointly by [Mehran Hosseini](https://mehranhosseini.com) and [Peyman Hosseini](https://peymanhosseini.net).

If you use Efficient Attention or Hummingbird, please cite our paper:

```
@article{Hosseinis24BetterAttention,
  title      = {You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism},
  author     = {Hosseini, Mehran and Hosseini, Peyman},
  journal    = {arXiv preprint arXiv:2403.01643},
  year       = {2024}
}
```