MilesCranmer commited on
Commit
c7922af
2 Parent(s): 170419e 4820cf9

Merge branch 'master' of github.com:MilesCranmer/Eureqa.jl

Browse files
.readthedocs-custom-steps.yml ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ steps:
2
+ - pydoc-markdown --build --site-dir $SITE_DIR
.readthedocs-requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ pydoc-markdown
2
+ readthedocs-custom-steps
.readthedocs.yml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ version: 2
2
+ mkdocs: {} # tell readthedocs to use mkdocs
3
+ python:
4
+ version: 3.7
5
+ install:
6
+ - method: pip
7
+ path: .
8
+ - requirements: .readthedocs-requirements.txt
README.md CHANGED
@@ -36,6 +36,7 @@ python interface.
36
 
37
 
38
  # Installation
 
39
 
40
  Install Julia - see [downloads](https://julialang.org/downloads/), and
41
  then instructions for [mac](https://julialang.org/downloads/platform/#macos)
@@ -103,8 +104,10 @@ equations = pysr.pysr(X, y, niterations=100,
103
  Now, the symbolic regression code can search using this `special` function
104
  that squares its left argument and adds it to its right. Make sure
105
  all passed functions are valid Julia code, and take one (unary)
106
- or two (binary) float32 scalars as input, and output a float32. Operators
107
- are automatically vectorized.
 
 
108
 
109
  We also define `extra_sympy_mappings`,
110
  so that the SymPy code can understand the output equation from Julia,
@@ -193,17 +196,24 @@ which is `hall_of_fame.csv` by default. It also prints the
193
  equations to stdout.
194
 
195
  ```python
196
- pysr(X=None, y=None, weights=None, procs=4, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20)
197
  ```
198
 
199
  Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.
 
 
 
200
 
201
  **Arguments**:
202
 
203
- - `X`: np.ndarray, 2D array. Rows are examples, columns are features.
 
 
204
  - `y`: np.ndarray, 1D array. Rows are examples.
205
- - `weights`: np.ndarray, 1D array. Same shape as `y`. Optional weighted sum (e.g., 1/error^2).
206
- - `procs`: int, Number of processes running (=number of populations running).
 
 
207
  - `niterations`: int, Number of iterations of the algorithm to run. The best
208
  equations are printed, and migrate between populations, at the
209
  end of each.
@@ -245,6 +255,18 @@ constant parts by evaluation
245
  - `equation_file`: str, Where to save the files (.csv separated by |)
246
  - `test`: str, What test to run, if X,y not passed.
247
  - `maxsize`: int, Max size of an equation.
 
 
 
 
 
 
 
 
 
 
 
 
248
 
249
  **Returns**:
250
 
@@ -311,6 +333,7 @@ pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
311
  ## Feature ideas
312
 
313
  - [ ] Cross-validation
 
314
  - [ ] Sympy printing
315
  - [ ] Better cleanup of zombie processes after <ctl-c>
316
  - [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
@@ -319,13 +342,14 @@ pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
319
  - [ ] Refresh screen rather than dumping to stdout?
320
  - [ ] Add ability to save state from python
321
  - [ ] Additional degree operators?
322
- - [ ] Multi targets (vector ops)
323
  - [ ] Tree crossover? I.e., can take as input a part of the same equation, so long as it is the same level or below?
324
  - [ ] Consider printing output sorted by score, not by complexity.
325
  - [ ] Dump scores alongside MSE to .csv (and return with Pandas).
326
  - [ ] Create flexible way of providing "simplification recipes." I.e., plus(plus(T, C), C) => plus(T, +(C, C)). The user could pass these.
327
  - [ ] Consider allowing multi-threading turned off, for faster testing (cache issue on travis). Or could simply fix the caching issue there.
328
  - [ ] Consider returning only the equation of interest; rather than all equations.
 
329
 
330
  ## Algorithmic performance ideas:
331
 
 
36
 
37
 
38
  # Installation
39
+ PySR uses both Julia and Python, so you need to have both installed.
40
 
41
  Install Julia - see [downloads](https://julialang.org/downloads/), and
42
  then instructions for [mac](https://julialang.org/downloads/platform/#macos)
 
104
  Now, the symbolic regression code can search using this `special` function
105
  that squares its left argument and adds it to its right. Make sure
106
  all passed functions are valid Julia code, and take one (unary)
107
+ or two (binary) float32 scalars as input, and output a float32. This means if you
108
+ write any real constants in your operator, like `2.5`, you have to write them
109
+ instead as `2.5f0`, which defines it as `Float32`.
110
+ Operators are automatically vectorized.
111
 
112
  We also define `extra_sympy_mappings`,
113
  so that the SymPy code can understand the output equation from Julia,
 
196
  equations to stdout.
197
 
198
  ```python
199
+ pysr(X=None, y=None, weights=None, procs=4, populations=None, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, extra_sympy_mappings={}, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20, fast_cycle=False, maxdepth=None, variable_names=[], select_k_features=None, threads=None, julia_optimization=3)
200
  ```
201
 
202
  Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.
203
+ Note: most default parameters have been tuned over several example
204
+ equations, but you should adjust `threads`, `niterations`,
205
+ `binary_operators`, `unary_operators` to your requirements.
206
 
207
  **Arguments**:
208
 
209
+ - `X`: np.ndarray or pandas.DataFrame, 2D array. Rows are examples,
210
+ columns are features. If pandas DataFrame, the columns are used
211
+ for variable names (so make sure they don't contain spaces).
212
  - `y`: np.ndarray, 1D array. Rows are examples.
213
+ - `weights`: np.ndarray, 1D array. Each row is how to weight the
214
+ mean-square-error loss on weights.
215
+ - `procs`: int, Number of processes (=number of populations running).
216
+ - `populations`: int, Number of populations running; by default=procs.
217
  - `niterations`: int, Number of iterations of the algorithm to run. The best
218
  equations are printed, and migrate between populations, at the
219
  end of each.
 
255
  - `equation_file`: str, Where to save the files (.csv separated by |)
256
  - `test`: str, What test to run, if X,y not passed.
257
  - `maxsize`: int, Max size of an equation.
258
+ - `maxdepth`: int, Max depth of an equation. You can use both maxsize and maxdepth.
259
+ maxdepth is by default set to = maxsize, which means that it is redundant.
260
+ - `fast_cycle`: bool, (experimental) - batch over population subsamples. This
261
+ is a slightly different algorithm than regularized evolution, but does cycles
262
+ 15% faster. May be algorithmically less efficient.
263
+ - `variable_names`: list, a list of names for the variables, other
264
+ than "x0", "x1", etc.
265
+ - `select_k_features`: (None, int), whether to run feature selection in
266
+ Python using random forests, before passing to the symbolic regression
267
+ code. None means no feature selection; an int means select that many
268
+ features.
269
+ - `julia_optimization`: int, Optimization level (0, 1, 2, 3)
270
 
271
  **Returns**:
272
 
 
333
  ## Feature ideas
334
 
335
  - [ ] Cross-validation
336
+ - [ ] read the docs page
337
  - [ ] Sympy printing
338
  - [ ] Better cleanup of zombie processes after <ctl-c>
339
  - [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
 
342
  - [ ] Refresh screen rather than dumping to stdout?
343
  - [ ] Add ability to save state from python
344
  - [ ] Additional degree operators?
345
+ - [ ] Multi targets (vector ops). Idea 1: Node struct contains argument for which registers it is applied to. Then, can work with multiple components simultaneously. Though this may be tricky to get right. Idea 2: each op is defined by input/output space. Some operators are flexible, and the spaces should be adjusted automatically. Otherwise, only consider ops that make a tree possible. But will need additional ops here to get it to work. Idea 3: define each equation in 2 parts: one part that is shared between all outputs, and one that is different between all outputs. Maybe this could be an array of nodes corresponding to each output. And those nodes would define their functions.
346
  - [ ] Tree crossover? I.e., can take as input a part of the same equation, so long as it is the same level or below?
347
  - [ ] Consider printing output sorted by score, not by complexity.
348
  - [ ] Dump scores alongside MSE to .csv (and return with Pandas).
349
  - [ ] Create flexible way of providing "simplification recipes." I.e., plus(plus(T, C), C) => plus(T, +(C, C)). The user could pass these.
350
  - [ ] Consider allowing multi-threading turned off, for faster testing (cache issue on travis). Or could simply fix the caching issue there.
351
  - [ ] Consider returning only the equation of interest; rather than all equations.
352
+ - [ ] Enable derivative operators. These would differentiate their right argument wrt their left argument, some input variable.
353
 
354
  ## Algorithmic performance ideas:
355
 
julia/sr.jl CHANGED
@@ -102,21 +102,6 @@ function copyNode(tree::Node)::Node
102
  end
103
  end
104
 
105
- # Evaluate a symbolic equation:
106
- function evalTree(tree::Node, x::Array{Float32, 1}=Float32[])::Float32
107
- if tree.degree == 0
108
- if tree.constant
109
- return tree.val
110
- else
111
- return x[tree.val]
112
- end
113
- elseif tree.degree == 1
114
- return unaops[tree.op](evalTree(tree.l, x))
115
- else
116
- return binops[tree.op](evalTree(tree.l, x), evalTree(tree.r, x))
117
- end
118
- end
119
-
120
  # Count the operators, constants, variables in an equation
121
  function countNodes(tree::Node)::Integer
122
  if tree.degree == 0
@@ -405,35 +390,6 @@ function appendRandomOp(tree::Node)::Node
405
  return tree
406
  end
407
 
408
- # Add random node to the top of a tree
409
- function popRandomOp(tree::Node)::Node
410
- node = tree
411
- choice = rand()
412
- makeNewBinOp = choice < nbin/nops
413
- left = tree
414
-
415
- if makeNewBinOp
416
- right = randomConstantNode()
417
- newnode = Node(
418
- rand(1:length(binops)),
419
- left,
420
- right
421
- )
422
- else
423
- newnode = Node(
424
- rand(1:length(unaops)),
425
- left
426
- )
427
- end
428
- node.l = newnode.l
429
- node.r = newnode.r
430
- node.op = newnode.op
431
- node.degree = newnode.degree
432
- node.val = newnode.val
433
- node.constant = newnode.constant
434
- return node
435
- end
436
-
437
  # Insert random node
438
  function insertRandomOp(tree::Node)::Node
439
  node = randomNode(tree)
@@ -897,7 +853,7 @@ function testConfiguration()
897
  test_output = unaop.(left)
898
  end
899
  end
900
- catch
901
  @printf("\n\nYour configuration is invalid - one of your operators is not well-defined over the real line.\n\n\n")
902
  throw(error)
903
  end
 
102
  end
103
  end
104
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  # Count the operators, constants, variables in an equation
106
  function countNodes(tree::Node)::Integer
107
  if tree.degree == 0
 
390
  return tree
391
  end
392
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
393
  # Insert random node
394
  function insertRandomOp(tree::Node)::Node
395
  node = randomNode(tree)
 
853
  test_output = unaop.(left)
854
  end
855
  end
856
+ catch error
857
  @printf("\n\nYour configuration is invalid - one of your operators is not well-defined over the real line.\n\n\n")
858
  throw(error)
859
  end
pydoc-markdown.yml ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ loaders:
2
+ - type: python
3
+ processors:
4
+ - type: filter
5
+ - type: smart
6
+ - type: crossref
7
+ renderer:
8
+ type: mkdocs
9
+ pages:
10
+ - title: Home
11
+ name: index
12
+ source: README.md
13
+ - title: API Documentation
14
+ contents:
15
+ - '*'
16
+ mkdocs_config:
17
+ mkdocs_config:
18
+ site_name: PySR
19
+ theme: readthedocs
pysr/sr.py CHANGED
@@ -78,6 +78,7 @@ def pysr(X=None, y=None, weights=None,
78
  variable_names=[],
79
  batching=False,
80
  batchSize=50,
 
81
  threads=None, #deprecated
82
  julia_optimization=3,
83
  ):
@@ -86,7 +87,9 @@ def pysr(X=None, y=None, weights=None,
86
  equations, but you should adjust `threads`, `niterations`,
87
  `binary_operators`, `unary_operators` to your requirements.
88
 
89
- :param X: np.ndarray, 2D array. Rows are examples, columns are features.
 
 
90
  :param y: np.ndarray, 1D array. Rows are examples.
91
  :param weights: np.ndarray, 1D array. Each row is how to weight the
92
  mean-square-error loss on weights.
@@ -144,6 +147,10 @@ def pysr(X=None, y=None, weights=None,
144
  during evolution. Still uses full dataset for comparing against
145
  hall of fame.
146
  :param batchSize: int, the amount of data to use if doing batching.
 
 
 
 
147
  :param julia_optimization: int, Optimization level (0, 1, 2, 3)
148
  :returns: pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
149
  (as strings).
@@ -154,6 +161,12 @@ def pysr(X=None, y=None, weights=None,
154
  if maxdepth is None:
155
  maxdepth = maxsize
156
 
 
 
 
 
 
 
157
  # Check for potential errors before they happen
158
  assert len(unary_operators) + len(binary_operators) > 0
159
  assert len(X.shape) == 2
@@ -162,9 +175,17 @@ def pysr(X=None, y=None, weights=None,
162
  if weights is not None:
163
  assert len(weights.shape) == 1
164
  assert X.shape[0] == weights.shape[0]
165
- if len(variable_names) != 0:
166
  assert len(variable_names) == X.shape[1]
167
 
 
 
 
 
 
 
 
 
168
  if populations is None:
169
  populations = procs
170
 
@@ -235,7 +256,7 @@ const annealing = {"true" if annealing else "false"}
235
  const weighted = {"true" if weights is not None else "false"}
236
  const batching = {"true" if batching else "false"}
237
  const batchSize = {min([batchSize, len(X)]) if batching else len(X):d}
238
- const useVarMap = {"false" if len(variable_names) == 0 else "true"}
239
  const mutationWeights = [
240
  {weightMutateConstant:f},
241
  {weightMutateOperator:f},
@@ -262,7 +283,7 @@ const y = convert(Array{Float32, 1}, """f"{y_str})"
262
  def_datasets += """
263
  const weights = convert(Array{Float32, 1}, """f"{weight_str})"
264
 
265
- if len(variable_names) != 0:
266
  def_hyperparams += f"""
267
  const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
268
 
@@ -301,7 +322,7 @@ const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
301
  lastComplexity = 0
302
  sympy_format = []
303
  lambda_format = []
304
- if len(variable_names) != 0:
305
  sympy_symbols = [sympy.Symbol(variable_names[i]) for i in range(X.shape[1])]
306
  else:
307
  sympy_symbols = [sympy.Symbol('x%d'%i) for i in range(X.shape[1])]
@@ -328,3 +349,18 @@ const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
328
  return output[['Complexity', 'MSE', 'score', 'Equation', 'sympy_format', 'lambda_format']]
329
 
330
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  variable_names=[],
79
  batching=False,
80
  batchSize=50,
81
+ select_k_features=None,
82
  threads=None, #deprecated
83
  julia_optimization=3,
84
  ):
 
87
  equations, but you should adjust `threads`, `niterations`,
88
  `binary_operators`, `unary_operators` to your requirements.
89
 
90
+ :param X: np.ndarray or pandas.DataFrame, 2D array. Rows are examples,
91
+ columns are features. If pandas DataFrame, the columns are used
92
+ for variable names (so make sure they don't contain spaces).
93
  :param y: np.ndarray, 1D array. Rows are examples.
94
  :param weights: np.ndarray, 1D array. Each row is how to weight the
95
  mean-square-error loss on weights.
 
147
  during evolution. Still uses full dataset for comparing against
148
  hall of fame.
149
  :param batchSize: int, the amount of data to use if doing batching.
150
+ :param select_k_features: (None, int), whether to run feature selection in
151
+ Python using random forests, before passing to the symbolic regression
152
+ code. None means no feature selection; an int means select that many
153
+ features.
154
  :param julia_optimization: int, Optimization level (0, 1, 2, 3)
155
  :returns: pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
156
  (as strings).
 
161
  if maxdepth is None:
162
  maxdepth = maxsize
163
 
164
+ if isinstance(X, pd.DataFrame):
165
+ variable_names = list(X.columns)
166
+ X = np.array(X)
167
+
168
+ use_custom_variable_names = (len(variable_names) != 0)
169
+
170
  # Check for potential errors before they happen
171
  assert len(unary_operators) + len(binary_operators) > 0
172
  assert len(X.shape) == 2
 
175
  if weights is not None:
176
  assert len(weights.shape) == 1
177
  assert X.shape[0] == weights.shape[0]
178
+ if use_custom_variable_names:
179
  assert len(variable_names) == X.shape[1]
180
 
181
+ if select_k_features is not None:
182
+ selection = run_feature_selection(X, y, select_k_features)
183
+ print(f"Using features {selection}")
184
+ X = X[:, selection]
185
+
186
+ if use_custom_variable_names:
187
+ variable_names = variable_names[selection]
188
+
189
  if populations is None:
190
  populations = procs
191
 
 
256
  const weighted = {"true" if weights is not None else "false"}
257
  const batching = {"true" if batching else "false"}
258
  const batchSize = {min([batchSize, len(X)]) if batching else len(X):d}
259
+ const useVarMap = {"true" if use_custom_variable_names else "false"}
260
  const mutationWeights = [
261
  {weightMutateConstant:f},
262
  {weightMutateOperator:f},
 
283
  def_datasets += """
284
  const weights = convert(Array{Float32, 1}, """f"{weight_str})"
285
 
286
+ if use_custom_variable_names:
287
  def_hyperparams += f"""
288
  const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
289
 
 
322
  lastComplexity = 0
323
  sympy_format = []
324
  lambda_format = []
325
+ if use_custom_variable_names:
326
  sympy_symbols = [sympy.Symbol(variable_names[i]) for i in range(X.shape[1])]
327
  else:
328
  sympy_symbols = [sympy.Symbol('x%d'%i) for i in range(X.shape[1])]
 
349
  return output[['Complexity', 'MSE', 'score', 'Equation', 'sympy_format', 'lambda_format']]
350
 
351
 
352
+ def run_feature_selection(X, y, select_k_features):
353
+ """Use a gradient boosting tree regressor as a proxy for finding
354
+ the k most important features in X, returning indices for those
355
+ features as output."""
356
+
357
+ from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
358
+ from sklearn.feature_selection import SelectFromModel, SelectKBest
359
+
360
+ clf = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='ls') #RandomForestRegressor()
361
+ clf.fit(X, y)
362
+ selector = SelectFromModel(clf, threshold=-np.inf,
363
+ max_features=select_k_features, prefit=True)
364
+ return selector.get_support(indices=True)
365
+
366
+