File size: 2,837 Bytes
09bd1d6 470e01e e3a610f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
---
license: mit
datasets:
- togethercomputer/RedPajama-Data-1T
- wikipedia
pipeline_tag: text-generation
---
Hadamard-Walsh 1.7B is an experimental model using a new positional encoder. The encoder represents absolute positions by using a combination of rows from the Hadamard-Walsh matrix (https://en.wikipedia.org/wiki/Hadamard_code). Each row corresponds to a binary digit is the positional code, where the presence of a row codes for a 1 and the absence, a zero. While training, the base offset in the sequence is randomly chosen for each batch. The result is that the model is very proficient at sequences much longer than those seen in training.
The encoding scheme was devised when I was performing experiments to determine the degree to which various positional encodings schemes interfere with the information carrying capacity of embeddings. This particular scheme did exceptionally well. It was both highly resilient to interference as well as minimally interfering with the embeddings. As a follow-on experiment, I adapted the encoder to work with my transformer implementation and found that it performs exceptionally well when directly compared with other popular positional encoding schemes.
The model has had approximately three weeks of pretraining on six RTX4090s. It seems to be doing remarkably well when compared with my other model of similar size and training, but using ALiBi positional encodings and SwiGLU. I have also noted an unusual loss pattern, where evaluation loss shows large punctuated drops. I can only speculate, but my suspicion is that the random offset of the positional encoder may have a regularizing effect on training. The attention patterns are also quite "interesting." [TODO: add images of attention patterns].
### Model Details:
- Model Dimension: 2048
- Hidden Layers: 32
- Attention Heads: 32
- Feedforward Dimension: 8192
- Feedforward Network Type: Conventional MLP with GeLU activation
- Vocabulary Size: 32000
- Max Sequence Length: 16K (14-bit absolute positional encoding via Walsh matrix)
- Weight Initialization: DeepNet, https://arxiv.org/abs/2203.00555
- Pretraining Datasets: RedPajama-Data-1T, mostly "books" and some Wikipedia.
### Loading:
The model implementation is all my own, so you will need to use "trust_remote_code" to load the model.
```
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
)
model_id = "dinalt/walsh-1-7b"
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
# flash_attention_2 requires bfloat16 or float16
torch_dtype=torch.bfloat16,
# One of ["flash_attention_2", "sdpa", "eager"]
attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
```
The model has been tested with text-generation-webui, which needs to be started with the "--trust-remote-code" flag. |