File size: 4,444 Bytes
8f29ba8
 
 
 
 
 
 
 
 
a06d7f6
 
d8dda28
 
8f29ba8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: mit
language:
- en
pipeline_tag: text-generation
tags:
- linux
- gpl2
- mit
widget:
 - text: "<|start_file|>"
 - temperature: 0.3
 - do_sample: True
---


# LaaM - Linux as a Model
What happens when we train a simple transformer model to memorize the GPL2 source of the Linux kernel?

## Source
[https://github.com/mjbommar/laam](https://github.com/mjbommar/laam)
 
## Motivation
Simply put, the OSI is making a grave mistake by ignoring the most important transitive dependency in AI - the training data.

As of the latest version of [The Open Source AI Definition (draft v. 0.0.8)](https://opensource.org/deepdive/drafts/the-open-source-ai-definition-draft-v-0-0-8), 
the OSI has decided that the legal status of training data is irrelevant to their subsequent "approval" of models as "open."

The argument in favor of this omission is that such a requirement would be inconvenient and legally ambiguous
in some jurisdictions. 

This would be like Creative Commons encouraging the authors of textual or audiovisual works to ignore
the terms of copyleft licenses.

**Simply put, organizations like the OSI must take a clear, common sense stance: "AI" models like text or multimodal LLMs
cannot be considered "open" if they are trained on "stolen" or "closed source" data.**    

## Details
To demonstrate how ridiculous the OSI's position is, I have trained simple transformer models to memorize the
source code of Linux version 1.0, which is licensed under the GPL2.

This model is documented and trained in perfect compliance with the OSI's draft guidance on Data Information, Code, 
and Model sections.  All source code is available in the GitHub repository, all dependencies are open source,
all input training data is directly described by the source code, and all model weights are available on 
Hugging Face.

## Example Model - 5M parameter Llama2 architecture
For example, this 5M parameter model can be trained on practically any device in a minutes to hours.  The model trivially 
emits copies of Linux 1.0 source code.  For example, using the HuggingFace hub copy at `mjbommar/linux-as-a-model-5M`:

```python
>>> from transformers import pipeline
>>> p = pipeline('text-generation', 'mjbommar/linux-as-a-model-5M')
>>> print(p('', max_new_tokens=256, do_sample=True, temperature=0.2)[0]['generated_text'])
 linux/drivers/net/3c503.c /* 3c503.c: A shared-memory NS8390 ethernet driver for linux. */
/*
 Written 1992,1993 by Donald Becker.

 Copyright 1993 United States Government as represented by the
 Director, National Security Agency. This software may be used and
 distributed according to the terms of the GNU Public License,
 incorporated herein by reference.

 This driver should work with the 3c503 and 3c503/16. It should be used
 in shared memory mode for best performance, although it may also work
 in programmed-I/O mode.

 The Author may be reached as becker@super.org or
 C/O Supercomputing Research Ctr., 17100 Science Dr., Bowie MD 20715
*/

```

## License
For the sake of demonstration, I have licensed the model source **and weights** under the MIT terms,
and the OSI should support this model as completely open and compliant with their draft guidance.


## Train your own model
```
# ensure poetry available
# curl -sSL https://install.python-poetry.org | python3 -

# setup poetry environment
$ poetry install --no-root

# optionally install flash-attn
# poetry run pip install wheel
# MAX_JOBS=4 poetry run pip install flash-attn --no-build-isolation

# train a tokenizer with fixed vocab size on linux version 1.0
$ PYTHONPATH=. poetry run python3 -m laam.commands.train_tokenizer \
    --version v1.0/1.0 \
    --vocab-size 32768

# train a 5M parameter model on it

# stage 1: large batch size, 1e-3 learning rate to safely converge near solution 
$ PYTHONPATH=. poetry run accelerate launch \
    laam/commands/train_llama.py \
    --version v1.0/1.0 \
    --precision bfloat16 \
    --hidden_size 64 \
    --intermediate_size 256 \
    --num_hidden_layers 8 \
    --num_attention_heads 32 \
    --max_position_embeddings 512 \
    --learning_rate 0.001 \
    --batch_size 64 \
    --epochs 100
    
# stage 2: single sample batches with 1e-4 learning rate to memorize
$ PYTHONPATH=. poetry run accelerate launch \
    laam/commands/train_llama.py \
    --version v1.0/1.0 \
    --precision bfloat16 \
    --reload \
    --learning_rate 0.0001 \
    --batch_size 1 \
    --epochs 100
```