Spaces:
Running
Running
File size: 12,356 Bytes
009c407 841d7fc 4d9bd75 49c1f08 e987d40 49c1f08 e987d40 49c1f08 e987d40 49c1f08 cfca8a4 cb14d73 dc9d777 81463ee 1115381 8c58028 dc9d777 8c58028 cb14d73 ecc127c 009c407 ecc127c 17f6afe ecc127c f6dcb74 439159c f6dcb74 439159c f6dcb74 439159c cb14d73 17f6afe 3090581 17f6afe cb14d73 ecc127c 012bfcc dcb0894 ecc127c dcb0894 17f6afe 841d7fc 012bfcc 841d7fc 012bfcc c28a133 012bfcc 34fadcf 012bfcc c28a133 012bfcc 34fadcf 012bfcc 34fadcf 012bfcc 382662a 8dd0aec 00136f0 18afca5 3090581 e635a4f f8d4b22 9aef011 264634b f8d4b22 264634b f8d4b22 3e14433 264634b 16c9195 e0d94cd f8d4b22 37f71ff f570496 d3ad40f f570496 327e651 00136f0 c28a133 7022bb1 e0d94cd f3c0668 c2c1511 627c408 97d94c6 327e651 3e14433 a86f107 9d4c050 cedbbde 1f1f9b0 9d4c050 fdf3f16 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
# PySR.jl
[![DOI](https://zenodo.org/badge/295391759.svg)](https://zenodo.org/badge/latestdoi/295391759)
**Symbolic regression built on Julia, and interfaced by Python.
Uses regularized evolution, simulated annealing, and gradient-free optimization.**
Symbolic regression is a very interpretable machine learning algorithm
for low-dimensional problems: these tools search equation space
to find algebraic relations that approximate a dataset.
One can also
extend these approaches to higher-dimensional
spaces by using a neural network as proxy, as explained in
https://arxiv.org/abs/2006.11287, where we apply
it to N-body problems. Here, one essentially uses
symbolic regression to convert a neural net
to an analytic equation. Thus, these tools simultaneously present
an explicit and powerful way to interpret deep models.
*Backstory:*
Previously, we have used
[eureqa](https://www.creativemachineslab.com/eureqa.html),
which is a very efficient and user-friendly tool. However,
eureqa is GUI-only, doesn't allow for user-defined
operators, has no distributed capabilities,
and has become proprietary (and recently been merged into an online
service). Thus, the goal
of this package is to have an open-source symbolic regression tool
as efficient as eureqa, while also exposing a configurable
python interface.
# Installation
Install Julia - see [downloads](https://julialang.org/downloads/), and
then instructions for [mac](https://julialang.org/downloads/platform/#macos)
and [linux](https://julialang.org/downloads/platform/#linux_and_freebsd).
Then, at the command line,
install the `Optim` and `SpecialFunctions` packages via:
`julia -e 'import Pkg; Pkg.add("Optim"); Pkg.add("SpecialFunctions")'`.
For python, you need to have Python 3, numpy, and pandas installed.
You can install this package from PyPI with:
```
pip install pysr
```
# Quickstart
```python
import numpy as np
from pysr import pysr
# Dataset
X = 2*np.random.randn(100, 5)
y = 2*np.cos(X[:, 3]) + X[:, 0]**2 - 2
# Learn equations
equations = pysr(X, y, niterations=5,
binary_operators=["plus", "mult"],
unary_operators=["cos", "exp", "sin"])
...
print(equations)
```
which gives:
```
Complexity MSE Equation
0 5 1.947431 plus(-1.7420927, mult(x0, x0))
1 8 0.486858 plus(-1.8710494, plus(cos(x3), mult(x0, x0)))
2 11 0.000000 plus(plus(mult(x0, x0), cos(x3)), plus(-2.0, cos(x3)))
```
### Custom operators
One can define custom operators in Julia by passing a string:
```python
equations = pysr.pysr(X, y, niterations=100,
binary_operators=["mult", "plus", "special(x, y) = x^2 + y"],
unary_operators=["cos"])
```
Now, the symbolic regression code can search using this `special` function
that squares its left argument and adds it to its right. Make sure
all passed functions are valid Julia code, and take one (unary)
or two (binary) float32 scalars as input, and output a float32. Operators
are automatically vectorized.
One can also edit `operators.jl`. See below for more options.
### Weighted data
Here, we assign weights to each row of data
using inverse uncertainty squared. We also use 10 threads
instead of the usual 4, which creates more populations
(one population per thread).
```python
sigma = ...
weights = 1/sigma**2
equations = pysr.pysr(X, y, weights=weights, threads=10)
```
# Operators
All Base julia operators that take 1 or 2 float32 as input,
and output a float32 as output, are available. A selection
of these and other valid operators are stated below. You can also
define your own in `operators.jl`, and pass the function
name as a string.
**Binary**
`plus`, `mult`, `pow`, `div`, `greater`, `mod`, `beta`, `logical_or`,
`logical_and`
**Unary**
`neg`,
`exp`,
`abs`,
`logm` (=log(abs(x) + 1e-8)),
`logm10` (=log10(abs(x) + 1e-8)),
`logm2` (=log2(abs(x) + 1e-8)),
`log1p`,
`sin`,
`cos`,
`tan`,
`sinh`,
`cosh`,
`tanh`,
`asin`,
`acos`,
`atan`,
`asinh`,
`acosh`,
`atanh`,
`erf`,
`erfc`,
`gamma`,
`relu`,
`round`,
`floor`,
`ceil`,
`round`,
`sign`.
# Full API
What follows is the API reference for running the numpy interface.
You likely don't need to tune the hyperparameters yourself,
but if you would like, you can use `hyperopt.py` as an example.
However, you should adjust `threads`, `niterations`,
`binary_operators`, `unary_operators`, and `maxsize`
to your requirements.
The program will output a pandas DataFrame containing the equations,
mean square error, and complexity. It will also dump to a csv
at the end of every iteration,
which is `hall_of_fame.csv` by default. It also prints the
equations to stdout.
```python
pysr(X=None, y=None, weights=None, threads=4, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20)
```
Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.
Note: most default parameters have been tuned over several example
equations, but you should adjust `threads`, `niterations`,
`binary_operators`, `unary_operators` to your requirements.
**Arguments**:
- `X`: np.ndarray, 2D array. Rows are examples, columns are features.
- `y`: np.ndarray, 1D array. Rows are examples.
- `weights`: np.ndarray, 1D array. Same shape as `y`. Optional weighted sum (e.g., 1/error^2).
- `threads`: int, Number of threads (=number of populations running).
You can have more threads than cores - it actually makes it more
efficient.
- `niterations`: int, Number of iterations of the algorithm to run. The best
equations are printed, and migrate between populations, at the
end of each.
- `ncyclesperiteration`: int, Number of total mutations to run, per 10
samples of the population, per iteration.
- `binary_operators`: list, List of strings giving the binary operators
in Julia's Base, or in `operator.jl`.
- `unary_operators`: list, Same but for operators taking a single `Float32`.
- `alpha`: float, Initial temperature.
- `annealing`: bool, Whether to use annealing. You should (and it is default).
- `fractionReplaced`: float, How much of population to replace with migrating
equations from other populations.
- `fractionReplacedHof`: float, How much of population to replace with migrating
equations from hall of fame.
- `npop`: int, Number of individuals in each population
- `parsimony`: float, Multiplicative factor for how much to punish complexity.
- `migration`: bool, Whether to migrate.
- `hofMigration`: bool, Whether to have the hall of fame migrate.
- `shouldOptimizeConstants`: bool, Whether to numerically optimize
constants (Nelder-Mead/Newton) at the end of each iteration.
- `topn`: int, How many top individuals migrate from each population.
- `nrestarts`: int, Number of times to restart the constant optimizer
- `perturbationFactor`: float, Constants are perturbed by a max
factor of (perturbationFactor\*T + 1). Either multiplied by this
or divided by this.
- `weightAddNode`: float, Relative likelihood for mutation to add a node
- `weightInsertNode`: float, Relative likelihood for mutation to insert a node
- `weightDeleteNode`: float, Relative likelihood for mutation to delete a node
- `weightDoNothing`: float, Relative likelihood for mutation to leave the individual
- `weightMutateConstant`: float, Relative likelihood for mutation to change
the constant slightly in a random direction.
- `weightMutateOperator`: float, Relative likelihood for mutation to swap
an operator.
- `weightRandomize`: float, Relative likelihood for mutation to completely
delete and then randomly generate the equation
- `weightSimplify`: float, Relative likelihood for mutation to simplify
constant parts by evaluation
- `timeout`: float, Time in seconds to timeout search
- `equation_file`: str, Where to save the files (.csv separated by |)
- `test`: str, What test to run, if X,y not passed.
- `maxsize`: int, Max size of an equation.
**Returns**:
pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
(as strings).
# TODO
- [ ] Async threading, and have a server of equations. So that threads aren't waiting for others to finish.
- This is a huge bottleneck right now.
- [ ] Use @fastmath
- [ ] Refresh screen rather than dumping to stdout?
- [ ] Test suite
- [ ] Add ability to save state from python
- [ ] Add true multi-node processing, with MPI, or just file sharing. Multiple populations per core.
- [ ] Calculate feature importances based on features we've already seen, then weight those features up in all random generations.
- [ ] Calculate feature importances of future mutations, by looking at correlation between residual of model, and the features.
- Store feature importances of future, and periodically update it.
- [ ] Implement more parts of the original Eureqa algorithms: https://www.creativemachineslab.com/eureqa.html
- [ ] Experiment with freezing parts of model; then we only append/delete at end of tree.
- [ ] Sympy printing
- [ ] Sympy evaluation
- [ ] Consider adding mutation for constant<->variable
- [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
- [ ] Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations
- [ ] Add GPU capability?
- Not sure if possible, as binary trees are the real bottleneck.
- [ ] Performance:
- [ ] Use an enum for functions instead of storing them?
- Current most expensive operations:
- [ ] Calculating the loss function - there is duplicate calculations happening.
- [x] Declaration of the weights array every iteration
- [x] Print out speed of equation evaluation over time. Measure time it takes per cycle
- [x] Add ability to pass an operator as an anonymous function string. E.g., `binary_operators=["g(x, y) = x+y"]`.
- [x] Add error bar capability (thanks Johannes Buchner for suggestion)
- [x] Why don't the constants continually change? It should optimize them every time the equation appears.
- Restart the optimizer to help with this.
- [x] Add several common unary and binary operators; list these.
- [x] Try other initial conditions for optimizer
- [x] Make scaling of changes to constant a hyperparameter
- [x] Make deletion op join deleted subtree to parent
- [x] Update hall of fame every iteration?
- Seems to overfit early if we do this.
- [x] Consider adding mutation to pass an operator in through a new binary operator (e.g., exp(x3)->plus(exp(x3), ...))
- (Added full insertion operator
- [x] Add a node at the top of a tree
- [x] Insert a node at the top of a subtree
- [x] Record very best individual in each population, and return at end.
- [x] Write our own tree copy operation; deepcopy() is the slowest operation by far.
- [x] Hyperparameter tune
- [x] Create a benchmark for accuracy
- [x] Add interface for either defining an operation to learn, or loading in arbitrary dataset.
- Could just write out the dataset in julia, or load it.
- [x] Create a Python interface
- [x] Explicit constant optimization on hall-of-fame
- Create method to find and return all constants, from left to right
- Create method to find and set all constants, in same order
- Pull up some optimization algorithm and add it. Keep the package small!
- [x] Create a benchmark for speed
- [x] Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes?
- [x] Record hall of fame
- [x] Optionally (with hyperparameter) migrate the hall of fame, rather than current bests
- [x] Test performance of reduced precision integers
- No effect
- [x] Create struct to pass through all hyperparameters, instead of treating as constants
- Make sure doesn't affect performance
- [x] Rename package to avoid trademark issues
- PySR?
- [x] Put on PyPI
|