File size: 9,544 Bytes
009c407
841d7fc
49c1f08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfca8a4
dc9d777
 
6aa3e92
1115381
 
8c58028
dc9d777
 
8c58028
 
 
 
 
 
cfca8a4
 
ecc127c
 
 
 
009c407
ecc127c
 
 
 
 
 
17f6afe
 
 
ecc127c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17f6afe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ecc127c
012bfcc
dcb0894
 
ecc127c
dcb0894
17f6afe
841d7fc
012bfcc
 
 
 
 
841d7fc
012bfcc
009c407
2925a1d
 
 
 
 
 
 
 
 
 
 
 
012bfcc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382662a
2c6b73d
 
382662a
 
fdf3f16
2c6b73d
fdf3f16
9aef011
 
264634b
 
3e14433
264634b
16c9195
37f71ff
f570496
d3ad40f
f570496
327e651
f3c0668
c2c1511
 
 
 
 
627c408
 
97d94c6
327e651
3e14433
 
a86f107
 
9d4c050
cedbbde
 
 
 
1f1f9b0
 
 
 
 
 
9d4c050
 
fdf3f16
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
# PySR.jl

**Symbolic regression built on Julia, and interfaced by Python.
Uses regularized evolution and simulated annealing.**

Backstory: we used the original
[eureqa](https://www.creativemachineslab.com/eureqa.html)
in our [paper](https://arxiv.org/abs/2006.11287) to
convert a graph neural network into
an analytic equation describing dark matter overdensity. However,
eureqa is GUI-only, doesn't allow for user-defined
operators, has no distributed capabilities,
and has become proprietary. Thus, the goal
of this package is to have an open-source symbolic regression tool
as efficient as eureqa, while also exposing a configurable
python interface.

The algorithms here implement regularized evolution, as in
[AutoML-Zero](https://arxiv.org/abs/2003.03384),
but with additional algorithmic changes such as simulated
annealing, and classical optimization of constants.


## Installation

Install Julia - see instructions for [mac](https://julialang.org/downloads/platform/#macos) and [linux](https://julialang.org/downloads/platform/#linux_and_freebsd). Then, at the command line,
install the `Optim` and `SpecialFunctions` packages via:
`julia -e 'import Pkg; Pkg.add("Optim"); Pkg.add("SpecialFunctions")'`.

For python, you need to have Python 3, numpy, and pandas installed.

You can install this package from PyPI with:

```
pip install pysr
```

## Running:

### Quickstart

```python
import numpy as np
from pysr import pysr

# Dataset
X = 2*np.random.randn(100, 5)
y = 2*np.cos(X[:, 3]) + X[:, 0]**2 - 2

# Learn equations
equations = pysr(X, y, niterations=5,
            binary_operators=["plus", "mult"],
            unary_operators=["cos", "exp", "sin"])

...

print(equations)
```

which gives:

```
   Complexity       MSE                                                Equation
0           5  1.947431                          plus(-1.7420927, mult(x0, x0))
1           8  0.486858           plus(-1.8710494, plus(cos(x3), mult(x0, x0)))
2          11  0.000000  plus(plus(mult(x0, x0), cos(x3)), plus(-2.0, cos(x3)))
```

### Operators

All Base julia operators that take 1 or 2 float32 as input,
and output a float32 as output, are available. A selection
of these and other valid operators are stated below. You can also
define your own in `operators.jl`, and pass the function
name as a string.

**Binary**

`plus`, `mult`, `pow`, `div`, `greater`, `mod`, `beta`, `logical_or`,
`logical_and`

**Unary**

`neg`,
`exp`,
`abs`,
`logm` (=log(abs(x) + 1e-8)),
`logm10` (=log10(abs(x) + 1e-8)),
`logm2` (=log2(abs(x) + 1e-8)),
`log1p`,
`sin`,
`cos`,
`tan`,
`sinh`,
`cosh`,
`tanh`,
`asin`,
`acos`,
`atan`,
`asinh`,
`acosh`,
`atanh`,
`erf`,
`erfc`,
`gamma`,
`relu`,
`round`,
`floor`,
`ceil`,
`round`.

### Full API

What follows is the API reference for running the numpy interface.
You likely don't need to tune the hyperparameters yourself,
but if you would like, you can use `hyperopt.py` as an example.
However, you should adjust `threads`, `niterations`,
`binary_operators`, `unary_operators`, and `maxsize`
to your requirements.

The program will output a pandas DataFrame containing the equations,
mean square error, and complexity. It will also dump to a csv
at the end of every iteration,
which is `hall_of_fame.csv` by default. It also prints the
equations to stdout.

```python
pysr(X=None, y=None, threads=4, niterations=20,
   ncyclesperiteration=int(default_ncyclesperiteration),
   binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"],
   alpha=default_alpha, annealing=True, fractionReplaced=default_fractionReplaced,
   fractionReplacedHof=default_fractionReplacedHof, npop=int(default_npop),
   parsimony=default_parsimony, migration=True, hofMigration=True
   shouldOptimizeConstants=True, topn=int(default_topn),
   weightAddNode=default_weightAddNode, weightDeleteNode=default_weightDeleteNode,
   weightDoNothing=default_weightDoNothing,
   weightMutateConstant=default_weightMutateConstant,
   weightMutateOperator=default_weightMutateOperator,
   weightRandomize=default_weightRandomize, weightSimplify=default_weightSimplify,
   timeout=None, equation_file='hall_of_fame.csv', test='simple1', maxsize=20)
```

Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.

**Arguments**:

- `X`: np.ndarray, 2D array. Rows are examples, columns are features.
- `y`: np.ndarray, 1D array. Rows are examples.
- `threads`: int, Number of threads (=number of populations running).
You can have more threads than cores - it actually makes it more
efficient.
- `niterations`: int, Number of iterations of the algorithm to run. The best
equations are printed, and migrate between populations, at the
end of each.
- `ncyclesperiteration`: int, Number of total mutations to run, per 10
samples of the population, per iteration.
- `binary_operators`: list, List of strings giving the binary operators
in Julia's Base, or in `operator.jl`.
- `unary_operators`: list, Same but for operators taking a single `Float32`.
- `alpha`: float, Initial temperature.
- `annealing`: bool, Whether to use annealing. You should (and it is default).
- `fractionReplaced`: float, How much of population to replace with migrating
equations from other populations.
- `fractionReplacedHof`: float, How much of population to replace with migrating
equations from hall of fame.
- `npop`: int, Number of individuals in each population
- `parsimony`: float, Multiplicative factor for how much to punish complexity.
- `migration`: bool, Whether to migrate.
- `hofMigration`: bool, Whether to have the hall of fame migrate.
- `shouldOptimizeConstants`: bool, Whether to numerically optimize
constants (Nelder-Mead/Newton) at the end of each iteration.
- `topn`: int, How many top individuals migrate from each population.
- `weightAddNode`: float, Relative likelihood for mutation to add a node
- `weightDeleteNode`: float, Relative likelihood for mutation to delete a node
- `weightDoNothing`: float, Relative likelihood for mutation to leave the individual
- `weightMutateConstant`: float, Relative likelihood for mutation to change
the constant slightly in a random direction.
- `weightMutateOperator`: float, Relative likelihood for mutation to swap
an operator.
- `weightRandomize`: float, Relative likelihood for mutation to completely
delete and then randomly generate the equation
- `weightSimplify`: float, Relative likelihood for mutation to simplify
constant parts by evaluation
- `timeout`: float, Time in seconds to timeout search
- `equation_file`: str, Where to save the files (.csv separated by |)
- `test`: str, What test to run, if X,y not passed.
- `maxsize`: int, Max size of an equation.

**Returns**:

pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
(as strings).




# TODO

- [ ] Why don't the constants continually change? It should optimize them every time the equation appears.
- [ ] Add ability to save and state from python
- [ ] Add several common unary and binary operators; list these.
- [ ] Calculate feature importances of future mutations, by looking at correlation between residual of model, and the features.
    - Store feature importances of future, and periodically update it.
- [ ] Implement more parts of the original Eureqa algorithms: https://www.creativemachineslab.com/eureqa.html
- [ ] Sympy printing
- [ ] Consider adding mutation for constant<->variable
- [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
- [ ] Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations
- [ ] Performance:
    - [ ] Use an enum for functions instead of storing them?
    - Current most expensive operations:
        - [ ] Calculating the loss function - there is duplicate calculations happening.
        - [x] Declaration of the weights array every iteration
- [x] Make scaling of changes to constant a hyperparameter
- [x] Make deletion op join deleted subtree to parent
- [x] Update hall of fame every iteration?
    - Seems to overfit early if we do this.
- [x] Consider adding mutation to pass an operator in through a new binary operator (e.g., exp(x3)->plus(exp(x3), ...))
    - (Added full insertion operator
- [x] Add a node at the top of a tree
- [x] Insert a node at the top of a subtree
- [x] Record very best individual in each population, and return at end.
- [x] Write our own tree copy operation; deepcopy() is the slowest operation by far.
- [x] Hyperparameter tune
- [x] Create a benchmark for accuracy
- [x] Add interface for either defining an operation to learn, or loading in arbitrary dataset.
    - Could just write out the dataset in julia, or load it.
- [x] Create a Python interface
- [x] Explicit constant optimization on hall-of-fame
    - Create method to find and return all constants, from left to right
    - Create method to find and set all constants, in same order
    - Pull up some optimization algorithm and add it. Keep the package small!
- [x] Create a benchmark for speed
- [x] Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes?
- [x] Record hall of fame
- [x] Optionally (with hyperparameter) migrate the hall of fame, rather than current bests
- [x] Test performance of reduced precision integers
    - No effect
- [x] Create struct to pass through all hyperparameters, instead of treating as constants
    - Make sure doesn't affect performance
- [x] Rename package to avoid trademark issues
    - PySR?
- [x] Put on PyPI