Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ language:
4
+ - de
5
+ library_name: transformers
6
+ ---
7
+ # Llama3_DiscoLeo_Instruct_8B_v0.1
8
+
9
+ ## Thanks and Accreditation
10
+
11
+ [DiscoResearch/Llama3_DiscoLeo_Instruct_8B_v0.1](https://huggingface.co/collections/DiscoResearch/discoleo-8b-llama3-for-german-6650527496c0fafefd4c9729)
12
+ is the result of a joint effort between [DiscoResearch](https://huggingface.co/DiscoResearch) and [Occiglot](https://huggingface.co/occiglot)
13
+ with support from the [DFKI](https://www.dfki.de/web/) (German Research Center for Artificial Intelligence) and [hessian.Ai](https://hessian.ai).
14
+ Occiglot kindly handled data preprocessing, filtering, and deduplication as part of their latest [dataset release](https://huggingface.co/datasets/occiglot/occiglot-fineweb-v0.5), as well as sharing their compute allocation at hessian.Ai's 42 Supercomputer.
15
+
16
+ ## Model Overview
17
+
18
+ Llama3_DiscoLeo_Instruct_8B_v0 is an instruction tuned version of our [Llama3_German_8B](https://huggingface.co/DiscoResearch/Llama3_German_8B).
19
+ The base model was derived from [Meta's Llama3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) through continuous pretraining on 65 billion high-quality German tokens, similar to previous [LeoLM](https://huggingface.co/LeoLM) or [Occiglot](https://huggingface.co/collections/occiglot/occiglot-eu5-7b-v01-65dbed502a6348b052695e01) models.
20
+ We finetuned this checkpoint on the German Instruction dataset from DiscoResearch created by [Jan-Philipp Harries](https://huggingface.co/jphme) and [Daniel Auras](https://huggingface.co/rasdani) ([DiscoResearch](https://huggingface.co/DiscoResearch), [ellamind](https://ellamind.com)).
21
+
22
+
23
+ ## How to use
24
+ Llama3_DiscoLeo_Instruct_8B_v0.1 uses the [Llama-3 chat template](https://github.com/meta-llama/llama3?tab=readme-ov-file#instruction-tuned-models), which can be easily used with [transformer's chat templating](https://huggingface.co/docs/transformers/main/en/chat_templating).
25
+
26
+
27
+ ## Model Training and Hyperparameters
28
+ The model was full-fintuned with axolotl on the [hessian.Ai 42](hessian.ai) with 8192 context-length, learning rate 2e-5 and batch size of 16.
29
+
30
+
31
+ ## Evaluation and Results
32
+
33
+ We evaluated the model using a suite of common English Benchmarks and their German counterparts with [GermanBench](https://github.com/bjoernpl/GermanBenchmark).
34
+
35
+ In the below image and corresponding table, you can see the benchmark scores for the different instruct models compared to Metas instruct version. All checkpoints are available in this [collection](https://huggingface.co/collections/DiscoResearch/discoleo-8b-llama3-for-german-6650527496c0fafefd4c9729).
36
+
37
+ ![instruct scores](instruct_model_benchmarks.png)
38
+
39
+ | Model | truthfulqa | truthful_qa_de | arc_challenge | arc_challenge_de | hellaswag | hellaswag_de | MMLU | MMLU_DE | mean |
40
+ |---------------------------------------------------|-------------|----------------|----------------|-------------------|-------------|--------------|----------|----------|-----------|
41
+ | DiscoResearch/Llama3_DiscoLeo_Instruct_8B_v0.1 | **0.530425** | 0.528673 | 0.595563 | **0.538396** | 0.807210| 0.664409 | 0.618989 | 0.560536 | **0.605525**|
42
+ | DiscoResearch/Llama3_DiscoLeo_Instruct_8B_32k_v0.1| 0.527493 | **0.532451** | 0.587884 | 0.537543 | **0.807708**| **0.667098** | 0.621234 | **0.562389** | 0.605475 |
43
+ | meta-llama/Meta-Llama-3-8B-Instruct | 0.516810 | 0.526288 | **0.613481** | 0.498294 | 0.785401 | 0.562537 | **0.669585** | 0.558135 | 0.591316 |
44
+
45
+ ## Model Configurations
46
+
47
+ We release DiscoLeo-8B in the following configurations:
48
+ 1. [Base model with continued pretraining](https://huggingface.co/DiscoResearch/Llama3_German_8B)
49
+ 2. [Long-context version (32k context length)](https://huggingface.co/DiscoResearch/Llama3_German_8B_32k)
50
+ 3. [Instruction-tuned version of the base model](https://huggingface.co/DiscoResearch/Llama3_DiscoLeo_Instruct_8B_v0.1) (This model)
51
+ 4. [Instruction-tuned version of the long-context model](https://huggingface.co/DiscoResearch/Llama3_DiscoLeo_Instruct_8B_32k_v0.1)
52
+ 5. [Experimental `DARE-TIES` Merge with Llama3-Instruct](https://huggingface.co/DiscoResearch/Llama3_DiscoLeo_8B_DARE_Experimental)
53
+ 6. [Collection of Quantized versions](https://huggingface.co/collections/DiscoResearch/discoleo-8b-quants-6651bcf8f72c9a37ce485d42)
54
+
55
+ ## How to use:
56
+ Here's how to use the model with transformers:
57
+ ```python
58
+ from transformers import AutoModelForCausalLM, AutoTokenizer
59
+
60
+ model = AutoModelForCausalLM.from_pretrained(
61
+ "DiscoResearch/Llama3_DiscoLeo_Instruct_8B_v0.1",
62
+ torch_dtype="auto",
63
+ device_map="auto"
64
+ )
65
+ tokenizer = AutoTokenizer.from_pretrained("DiscoResearch/Llama3_DiscoLeo_Instruct_8B_v0.1")
66
+
67
+ prompt = "Schreibe ein Essay über die Bedeutung der Energiewende für Deutschlands Wirtschaft"
68
+ messages = [
69
+ {"role": "system", "content": "Du bist ein hilfreicher Assistent."},
70
+ {"role": "user", "content": prompt}
71
+ ]
72
+ text = tokenizer.apply_chat_template(
73
+ messages,
74
+ tokenize=False,
75
+ add_generation_prompt=True
76
+ )
77
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
78
+
79
+ generated_ids = model.generate(
80
+ model_inputs.input_ids,
81
+ max_new_tokens=512
82
+ )
83
+ generated_ids = [
84
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
85
+ ]
86
+
87
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
88
+ ```
89
+
90
+ ## Acknowledgements
91
+
92
+ The model was trained and evaluated by [Björn Plüster](https://huggingface.co/bjoernp) ([DiscoResearch](https://huggingface.co/DiscoResearch), [ellamind](https://ellamind.com)) with data preparation and project supervision by [Manuel Brack](http://manuel-brack.eu) ([DFKI](https://www.dfki.de/web/), [TU-Darmstadt](https://www.tu-darmstadt.de/)). Instruction tuning was done with the DiscoLM German dataset created by [Jan-Philipp Harries](https://huggingface.co/jphme) and [Daniel Auras](https://huggingface.co/rasdani) ([DiscoResearch](https://huggingface.co/DiscoResearch), [ellamind](https://ellamind.com)). We extend our gratitude to [LAION](https://laion.ai/) and friends, especially [Christoph Schuhmann](https://entwickler.de/experten/christoph-schuhmann) and [Jenia Jitsev](https://huggingface.co/JJitsev), for initiating this collaboration.
93
+
94
+ The model training was supported by a compute grant at the [42 supercomputer](https://hessian.ai/) which is a central component in the development of [hessian AI](https://hessian.ai/), the [AI Innovation Lab](https://hessian.ai/infrastructure/ai-innovationlab/) (funded by the [Hessian Ministry of Higher Education, Research and the Art (HMWK)](https://wissenschaft.hessen.de) & the [Hessian Ministry of the Interior, for Security and Homeland Security (HMinD)](https://innen.hessen.de)) and the [AI Service Centers](https://hessian.ai/infrastructure/ai-service-centre/) (funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html)).
95
+ The curation of the training data is partially funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html)
96
+ through the project [OpenGPT-X](https://opengpt-x.de/en/) (project no. 68GX21007D).