mjbommar commited on
Commit
8f29ba8
1 Parent(s): 40aa84a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -3
README.md CHANGED
@@ -1,3 +1,117 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - linux
8
+ - gpl2
9
+ - mit
10
+ ---
11
+
12
+
13
+ # LaaM - Linux as a Model
14
+ What happens when we train a simple transformer model to memorize the GPL2 source of the Linux kernel?
15
+
16
+ ## Source
17
+ [https://github.com/mjbommar/laam](https://github.com/mjbommar/laam)
18
+
19
+ ## Motivation
20
+ Simply put, the OSI is making a grave mistake by ignoring the most important transitive dependency in AI - the training data.
21
+
22
+ As of the latest version of [The Open Source AI Definition (draft v. 0.0.8)](https://opensource.org/deepdive/drafts/the-open-source-ai-definition-draft-v-0-0-8),
23
+ the OSI has decided that the legal status of training data is irrelevant to their subsequent "approval" of models as "open."
24
+
25
+ The argument in favor of this omission is that such a requirement would be inconvenient and legally ambiguous
26
+ in some jurisdictions.
27
+
28
+ This would be like Creative Commons encouraging the authors of textual or audiovisual works to ignore
29
+ the terms of copyleft licenses.
30
+
31
+ **Simply put, organizations like the OSI must take a clear, common sense stance: "AI" models like text or multimodal LLMs
32
+ cannot be considered "open" if they are trained on "stolen" or "closed source" data.**
33
+
34
+ ## Details
35
+ To demonstrate how ridiculous the OSI's position is, I have trained simple transformer models to memorize the
36
+ source code of Linux version 1.0, which is licensed under the GPL2.
37
+
38
+ This model is documented and trained in perfect compliance with the OSI's draft guidance on Data Information, Code,
39
+ and Model sections. All source code is available in the GitHub repository, all dependencies are open source,
40
+ all input training data is directly described by the source code, and all model weights are available on
41
+ Hugging Face.
42
+
43
+ ## Example Model - 5M parameter Llama2 architecture
44
+ For example, this 5M parameter model can be trained on practically any device in a minutes to hours. The model trivially
45
+ emits copies of Linux 1.0 source code. For example, using the HuggingFace hub copy at `mjbommar/linux-as-a-model-5M`:
46
+
47
+ ```python
48
+ >>> from transformers import pipeline
49
+ >>> p = pipeline('text-generation', 'mjbommar/linux-as-a-model-5M')
50
+ >>> print(p('', max_new_tokens=256, do_sample=True, temperature=0.2)[0]['generated_text'])
51
+ linux/drivers/net/3c503.c /* 3c503.c: A shared-memory NS8390 ethernet driver for linux. */
52
+ /*
53
+ Written 1992,1993 by Donald Becker.
54
+
55
+ Copyright 1993 United States Government as represented by the
56
+ Director, National Security Agency. This software may be used and
57
+ distributed according to the terms of the GNU Public License,
58
+ incorporated herein by reference.
59
+
60
+ This driver should work with the 3c503 and 3c503/16. It should be used
61
+ in shared memory mode for best performance, although it may also work
62
+ in programmed-I/O mode.
63
+
64
+ The Author may be reached as becker@super.org or
65
+ C/O Supercomputing Research Ctr., 17100 Science Dr., Bowie MD 20715
66
+ */
67
+
68
+ ```
69
+
70
+ ## License
71
+ For the sake of demonstration, I have licensed the model source **and weights** under the MIT terms,
72
+ and the OSI should support this model as completely open and compliant with their draft guidance.
73
+
74
+
75
+ ## Train your own model
76
+ ```
77
+ # ensure poetry available
78
+ # curl -sSL https://install.python-poetry.org | python3 -
79
+
80
+ # setup poetry environment
81
+ $ poetry install --no-root
82
+
83
+ # optionally install flash-attn
84
+ # poetry run pip install wheel
85
+ # MAX_JOBS=4 poetry run pip install flash-attn --no-build-isolation
86
+
87
+ # train a tokenizer with fixed vocab size on linux version 1.0
88
+ $ PYTHONPATH=. poetry run python3 -m laam.commands.train_tokenizer \
89
+ --version v1.0/1.0 \
90
+ --vocab-size 32768
91
+
92
+ # train a 5M parameter model on it
93
+
94
+ # stage 1: large batch size, 1e-3 learning rate to safely converge near solution
95
+ $ PYTHONPATH=. poetry run accelerate launch \
96
+ laam/commands/train_llama.py \
97
+ --version v1.0/1.0 \
98
+ --precision bfloat16 \
99
+ --hidden_size 64 \
100
+ --intermediate_size 256 \
101
+ --num_hidden_layers 8 \
102
+ --num_attention_heads 32 \
103
+ --max_position_embeddings 512 \
104
+ --learning_rate 0.001 \
105
+ --batch_size 64 \
106
+ --epochs 100
107
+
108
+ # stage 2: single sample batches with 1e-4 learning rate to memorize
109
+ $ PYTHONPATH=. poetry run accelerate launch \
110
+ laam/commands/train_llama.py \
111
+ --version v1.0/1.0 \
112
+ --precision bfloat16 \
113
+ --reload \
114
+ --learning_rate 0.0001 \
115
+ --batch_size 1 \
116
+ --epochs 100
117
+ ```