File size: 4,135 Bytes
609e6b7
 
 
 
 
 
87b1360
609e6b7
5980dab
 
87b1360
609e6b7
 
 
d25058a
609e6b7
87b1360
 
 
 
ba44f7a
 
 
48b8ed3
87b1360
 
609e6b7
87b1360
 
83fd560
87b1360
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48b8ed3
 
87b1360
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37162e4
 
48b8ed3
 
 
 
 
 
7acb992
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
609e6b7
 
87b1360
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: cc-by-sa-3.0
language:
- de
---

# xLSTM Model trained on German Wikipedia

![xLSTM](brat-logo.png)

Research & development of an xLSTM model trained on German Wikipedia.

The Flair team is currently working on the integration of xLSTM (both LM training and fine-tuning models for downstream tasks).

For pretraining this xLSTM model, we this [fork](https://github.com/HallerPatrick/helibrunna) (from [Patrick Haller](https://huggingface.co/PatrickHaller)) of the awesome [Helibrunna](https://github.com/AI-Guru/helibrunna) library from [Tristan](https://huggingface.co/TristanBehrens).

Initially, we integrated xLSTM model training into Flair - for more information about this, please refer to the archived [flair-old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch of this repository.

# Changelog

- 06.09.2024: We discovered a (potential) bug in pretraining code: when using the complete Wikipedia corpus, unfortunately only the first 512 subtoken of each article are used.
-             We implement a grouping-based approach that tokenizes the whole corpus and groups the corpus into 512 subtoken chunks.
-             Pretraining with this new approach is currently running.
- 29.08.2024: Uploaded re-trained model for 1 epoch over complete German Wikipedia corpus. Training was done with gradient clipping (0.25).
- 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna).
- 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch.

# Training

The current model was trained with commit `a1b3772` from the [`main` branch](https://github.com/HallerPatrick/helibrunna) of the forked Helibrunna repo.

The `xlstm` [library](https://github.com/NX-AI/xlstm) needs to be installed manually - also check that `pip3 install Ninja` is installed.

The German Wikipedia dump from [this repository](https://huggingface.co/datasets/gwlms/dewiki-20230701-flair-corpus) is used.

The following training configuration is used:

```yaml
description: "Train a wikipedia xLSTM"

training:
  model_name: "german_wikipedia"
  batch_size: 10
  lr: 6e-4
  lr_warmup_steps: 4584
  lr_decay_until_steps: "auto"
  lr_decay_factor: 0.001
  weight_decay: 0.1
  amp_precision: bfloat16
  weight_precision: float32
  enable_mixed_precision: true
  num_epochs: 1
  output_dir: "./output"
  save_every_step: 2000
  log_every_step: 10
  generate_every_step: 5000
  wandb_project: "xlstm"
  max_grad_norm: 0.25
  # wandb_project: "lovecraftxlstm"

model:
  num_blocks: 24
  embedding_dim: 768
  mlstm_block:
    mlstm:
      num_heads: 4
  slstm_block: {}
  slstm_at: []
  context_length: 512

dataset:
  output_path: "./output/german-wikipedia-dataset"
  hugging_face_id: ["stefan-it/dewiki-20230701"]
  split: "train" # Also subsetting is possible: "train[:100000]"
  shuffle: False
  seed: 42

tokenizer:
  type: "pretrained"
  pretrained_class: "LlamaTokenizer"
  pretrained_id: "meta-llama/Llama-2-7b-hf"
```

The training loss curve can be seen here:

![Training Loss](training-loss.png)

The uploaded model checkpoint is from 458,431 steps (1 epoch over corpus). Training took 1d 3h 17m 58s on a single RTX 4090.

# Usage

It is possible to use the model to generate some text:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name_or_path = "stefan-it/xlstm-german-wikipedia"

model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

input_ids = tokenizer.encode("Heute ist schönes Wetter in", return_tensors="pt")
output = model.generate(input_ids, max_length=100, temperature=0.7, do_sample=True)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)
```

# Caveats

Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters.
Also downstream experiments are coming very soon.