Spaces:

MilesCranmer
/

PySR

Running

App Files Files Community

MilesCranmer commited on Oct 10, 2020

Commit

c7922af

2 Parent(s): 170419e 4820cf9

Merge branch 'master' of github.com:MilesCranmer/Eureqa.jl

Browse files

Files changed (7) hide show

.readthedocs-custom-steps.yml +2 -0
.readthedocs-requirements.txt +2 -0
.readthedocs.yml +8 -0
README.md +31 -7
julia/sr.jl +1 -45
pydoc-markdown.yml +19 -0
pysr/sr.py +41 -5

.readthedocs-custom-steps.yml ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ steps:
2	+ - pydoc-markdown --build --site-dir $SITE_DIR

.readthedocs-requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ pydoc-markdown
2	+ readthedocs-custom-steps

.readthedocs.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+version: 2
+mkdocs: {}  # tell readthedocs to use mkdocs
+python:
+  version: 3.7
+  install:
+  - method: pip
+    path: .
+  - requirements: .readthedocs-requirements.txt

README.md CHANGED Viewed

@@ -36,6 +36,7 @@ python interface.
 # Installation
 Install Julia - see [downloads](https://julialang.org/downloads/), and
 then instructions for [mac](https://julialang.org/downloads/platform/#macos)
@@ -103,8 +104,10 @@ equations = pysr.pysr(X, y, niterations=100,
 Now, the symbolic regression code can search using this `special` function
 that squares its left argument and adds it to its right. Make sure
 all passed functions are valid Julia code, and take one (unary)
-or two (binary) float32 scalars as input, and output a float32. Operators
-are automatically vectorized.
 We also define `extra_sympy_mappings`,
 so that the SymPy code can understand the output equation from Julia,
@@ -193,17 +196,24 @@ which is `hall_of_fame.csv` by default. It also prints the
 equations to stdout.
 ```python
-pysr(X=None, y=None, weights=None, procs=4, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20)
 ```
 Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.
 **Arguments**:
-- `X`: np.ndarray, 2D array. Rows are examples, columns are features.
 - `y`: np.ndarray, 1D array. Rows are examples.
-- `weights`: np.ndarray, 1D array. Same shape as `y`. Optional weighted sum (e.g., 1/error^2).
-- `procs`: int, Number of processes running (=number of populations running).
 - `niterations`: int, Number of iterations of the algorithm to run. The best
 equations are printed, and migrate between populations, at the
 end of each.
@@ -245,6 +255,18 @@ constant parts by evaluation
 - `equation_file`: str, Where to save the files (.csv separated by |)
 - `test`: str, What test to run, if X,y not passed.
 - `maxsize`: int, Max size of an equation.
 **Returns**:
@@ -311,6 +333,7 @@ pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
 ## Feature ideas
 - [ ] Cross-validation
 - [ ] Sympy printing
 - [ ] Better cleanup of zombie processes after <ctl-c>
 - [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
@@ -319,13 +342,14 @@ pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
 - [ ] Refresh screen rather than dumping to stdout?
 - [ ] Add ability to save state from python
 - [ ] Additional degree operators?
-- [ ] Multi targets (vector ops)
 - [ ] Tree crossover? I.e., can take as input a part of the same equation, so long as it is the same level or below?
 - [ ] Consider printing output sorted by score, not by complexity.
 - [ ] Dump scores alongside MSE to .csv (and return with Pandas).
 - [ ] Create flexible way of providing "simplification recipes." I.e., plus(plus(T, C), C) => plus(T, +(C, C)). The user could pass these.
 - [ ] Consider allowing multi-threading turned off, for faster testing (cache issue on travis). Or could simply fix the caching issue there.
 - [ ] Consider returning only the equation of interest; rather than all equations.
 ## Algorithmic performance ideas:

 # Installation
+PySR uses both Julia and Python, so you need to have both installed.
 Install Julia - see [downloads](https://julialang.org/downloads/), and
 then instructions for [mac](https://julialang.org/downloads/platform/#macos)
 Now, the symbolic regression code can search using this `special` function
 that squares its left argument and adds it to its right. Make sure
 all passed functions are valid Julia code, and take one (unary)
+or two (binary) float32 scalars as input, and output a float32. This means if you
+write any real constants in your operator, like `2.5`, you have to write them
+instead as `2.5f0`, which defines it as `Float32`.
+Operators are automatically vectorized.
 We also define `extra_sympy_mappings`,
 so that the SymPy code can understand the output equation from Julia,
 equations to stdout.
 ```python
+pysr(X=None, y=None, weights=None, procs=4, populations=None, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, extra_sympy_mappings={}, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20, fast_cycle=False, maxdepth=None, variable_names=[], select_k_features=None, threads=None, julia_optimization=3)
 ```
 Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.
+Note: most default parameters have been tuned over several example
+equations, but you should adjust `threads`, `niterations`,
+`binary_operators`, `unary_operators` to your requirements.
 **Arguments**:
+- `X`: np.ndarray or pandas.DataFrame, 2D array. Rows are examples,
+columns are features. If pandas DataFrame, the columns are used
+for variable names (so make sure they don't contain spaces).
 - `y`: np.ndarray, 1D array. Rows are examples.
+- `weights`: np.ndarray, 1D array. Each row is how to weight the
+mean-square-error loss on weights.
+- `procs`: int, Number of processes (=number of populations running).
+- `populations`: int, Number of populations running; by default=procs.
 - `niterations`: int, Number of iterations of the algorithm to run. The best
 equations are printed, and migrate between populations, at the
 end of each.
 - `equation_file`: str, Where to save the files (.csv separated by |)
 - `test`: str, What test to run, if X,y not passed.
 - `maxsize`: int, Max size of an equation.
+- `maxdepth`: int, Max depth of an equation. You can use both maxsize and maxdepth.
+maxdepth is by default set to = maxsize, which means that it is redundant.
+- `fast_cycle`: bool, (experimental) - batch over population subsamples. This
+is a slightly different algorithm than regularized evolution, but does cycles
+15% faster. May be algorithmically less efficient.
+- `variable_names`: list, a list of names for the variables, other
+than "x0", "x1", etc.
+- `select_k_features`: (None, int), whether to run feature selection in
+Python using random forests, before passing to the symbolic regression
+code. None means no feature selection; an int means select that many
+features.
+- `julia_optimization`: int, Optimization level (0, 1, 2, 3)
 **Returns**:
 ## Feature ideas
 - [ ] Cross-validation
+- [ ] read the docs page
 - [ ] Sympy printing
 - [ ] Better cleanup of zombie processes after <ctl-c>
 - [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
 - [ ] Refresh screen rather than dumping to stdout?
 - [ ] Add ability to save state from python
 - [ ] Additional degree operators?
+- [ ] Multi targets (vector ops). Idea 1: Node struct contains argument for which registers it is applied to. Then, can work with multiple components simultaneously. Though this may be tricky to get right. Idea 2: each op is defined by input/output space. Some operators are flexible, and the spaces should be adjusted automatically. Otherwise, only consider ops that make a tree possible. But will need additional ops here to get it to work. Idea 3: define each equation in 2 parts: one part that is shared between all outputs, and one that is different between all outputs. Maybe this could be an array of nodes corresponding to each output. And those nodes would define their functions.
 - [ ] Tree crossover? I.e., can take as input a part of the same equation, so long as it is the same level or below?
 - [ ] Consider printing output sorted by score, not by complexity.
 - [ ] Dump scores alongside MSE to .csv (and return with Pandas).
 - [ ] Create flexible way of providing "simplification recipes." I.e., plus(plus(T, C), C) => plus(T, +(C, C)). The user could pass these.
 - [ ] Consider allowing multi-threading turned off, for faster testing (cache issue on travis). Or could simply fix the caching issue there.
 - [ ] Consider returning only the equation of interest; rather than all equations.
+- [ ] Enable derivative operators. These would differentiate their right argument wrt their left argument, some input variable.
 ## Algorithmic performance ideas:

julia/sr.jl CHANGED Viewed

@@ -102,21 +102,6 @@ function copyNode(tree::Node)::Node
    end
 end
-# Evaluate a symbolic equation:
-function evalTree(tree::Node, x::Array{Float32, 1}=Float32[])::Float32
-    if tree.degree == 0
-        if tree.constant
-            return tree.val
-        else
-            return x[tree.val]
-        end
-    elseif tree.degree == 1
-        return unaops[tree.op](evalTree(tree.l, x))
-    else
-        return binops[tree.op](evalTree(tree.l, x), evalTree(tree.r, x))
-    end
-end
 # Count the operators, constants, variables in an equation
 function countNodes(tree::Node)::Integer
     if tree.degree == 0
@@ -405,35 +390,6 @@ function appendRandomOp(tree::Node)::Node
     return tree
 end
-# Add random node to the top of a tree
-function popRandomOp(tree::Node)::Node
-    node = tree
-    choice = rand()
-    makeNewBinOp = choice < nbin/nops
-    left = tree
-    if makeNewBinOp
-        right = randomConstantNode()
-        newnode = Node(
-            rand(1:length(binops)),
-            left,
-            right
-        )
-    else
-        newnode = Node(
-            rand(1:length(unaops)),
-            left
-        )
-    end
-    node.l = newnode.l
-    node.r = newnode.r
-    node.op = newnode.op
-    node.degree = newnode.degree
-    node.val = newnode.val
-    node.constant = newnode.constant
-    return node
-end
 # Insert random node
 function insertRandomOp(tree::Node)::Node
     node = randomNode(tree)
@@ -897,7 +853,7 @@ function testConfiguration()
                 test_output = unaop.(left)
             end
         end
-    catch
         @printf("\n\nYour configuration is invalid - one of your operators is not well-defined over the real line.\n\n\n")
         throw(error)
     end

    end
 end
 # Count the operators, constants, variables in an equation
 function countNodes(tree::Node)::Integer
     if tree.degree == 0
     return tree
 end
 # Insert random node
 function insertRandomOp(tree::Node)::Node
     node = randomNode(tree)
                 test_output = unaop.(left)
             end
         end
+    catch error
         @printf("\n\nYour configuration is invalid - one of your operators is not well-defined over the real line.\n\n\n")
         throw(error)
     end

pydoc-markdown.yml ADDED Viewed

	@@ -0,0 +1,19 @@

+loaders:
+  - type: python
+processors:
+  - type: filter
+  - type: smart
+  - type: crossref
+renderer:
+  type: mkdocs
+  pages:
+    - title: Home
+      name: index
+      source: README.md
+    - title: API Documentation
+      contents:
+        - '*'
+  mkdocs_config:
+    mkdocs_config:
+      site_name: PySR
+      theme: readthedocs

pysr/sr.py CHANGED Viewed

@@ -78,6 +78,7 @@ def pysr(X=None, y=None, weights=None,
             variable_names=[],
             batching=False,
             batchSize=50,
             threads=None, #deprecated
             julia_optimization=3,
         ):
@@ -86,7 +87,9 @@ def pysr(X=None, y=None, weights=None,
     equations, but you should adjust `threads`, `niterations`,
     `binary_operators`, `unary_operators` to your requirements.
-    :param X: np.ndarray, 2D array. Rows are examples, columns are features.
     :param y: np.ndarray, 1D array. Rows are examples.
     :param weights: np.ndarray, 1D array. Each row is how to weight the
         mean-square-error loss on weights.
@@ -144,6 +147,10 @@ def pysr(X=None, y=None, weights=None,
         during evolution. Still uses full dataset for comparing against
         hall of fame.
     :param batchSize: int, the amount of data to use if doing batching.
     :param julia_optimization: int, Optimization level (0, 1, 2, 3)
     :returns: pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
         (as strings).
@@ -154,6 +161,12 @@ def pysr(X=None, y=None, weights=None,
     if maxdepth is None:
         maxdepth = maxsize
     # Check for potential errors before they happen
     assert len(unary_operators) + len(binary_operators) > 0
     assert len(X.shape) == 2
@@ -162,9 +175,17 @@ def pysr(X=None, y=None, weights=None,
     if weights is not None:
         assert len(weights.shape) == 1
         assert X.shape[0] == weights.shape[0]
-    if len(variable_names) != 0:
         assert len(variable_names) == X.shape[1]
     if populations is None:
         populations = procs
@@ -235,7 +256,7 @@ const annealing = {"true" if annealing else "false"}
 const weighted = {"true" if weights is not None else "false"}
 const batching = {"true" if batching else "false"}
 const batchSize = {min([batchSize, len(X)]) if batching else len(X):d}
-const useVarMap = {"false" if len(variable_names) == 0 else "true"}
 const mutationWeights = [
     {weightMutateConstant:f},
     {weightMutateOperator:f},
@@ -262,7 +283,7 @@ const y = convert(Array{Float32, 1}, """f"{y_str})"
         def_datasets += """
 const weights = convert(Array{Float32, 1}, """f"{weight_str})"
-    if len(variable_names) != 0:
         def_hyperparams += f"""
 const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
@@ -301,7 +322,7 @@ const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
     lastComplexity = 0
     sympy_format = []
     lambda_format = []
-    if len(variable_names) != 0:
         sympy_symbols = [sympy.Symbol(variable_names[i]) for i in range(X.shape[1])]
     else:
         sympy_symbols = [sympy.Symbol('x%d'%i) for i in range(X.shape[1])]
@@ -328,3 +349,18 @@ const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
     return output[['Complexity', 'MSE', 'score', 'Equation', 'sympy_format', 'lambda_format']]

             variable_names=[],
             batching=False,
             batchSize=50,
+            select_k_features=None,
             threads=None, #deprecated
             julia_optimization=3,
         ):
     equations, but you should adjust `threads`, `niterations`,
     `binary_operators`, `unary_operators` to your requirements.
+    :param X: np.ndarray or pandas.DataFrame, 2D array. Rows are examples,
+        columns are features. If pandas DataFrame, the columns are used
+        for variable names (so make sure they don't contain spaces).
     :param y: np.ndarray, 1D array. Rows are examples.
     :param weights: np.ndarray, 1D array. Each row is how to weight the
         mean-square-error loss on weights.
         during evolution. Still uses full dataset for comparing against
         hall of fame.
     :param batchSize: int, the amount of data to use if doing batching.
+    :param select_k_features: (None, int), whether to run feature selection in
+        Python using random forests, before passing to the symbolic regression
+        code. None means no feature selection; an int means select that many
+        features.
     :param julia_optimization: int, Optimization level (0, 1, 2, 3)
     :returns: pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
         (as strings).
     if maxdepth is None:
         maxdepth = maxsize
+    if isinstance(X, pd.DataFrame):
+        variable_names = list(X.columns)
+        X = np.array(X)
+    use_custom_variable_names = (len(variable_names) != 0)
     # Check for potential errors before they happen
     assert len(unary_operators) + len(binary_operators) > 0
     assert len(X.shape) == 2
     if weights is not None:
         assert len(weights.shape) == 1
         assert X.shape[0] == weights.shape[0]
+    if use_custom_variable_names:
         assert len(variable_names) == X.shape[1]
+    if select_k_features is not None:
+        selection = run_feature_selection(X, y, select_k_features)
+        print(f"Using features {selection}")
+        X = X[:, selection]
+        if use_custom_variable_names:
+            variable_names = variable_names[selection]
     if populations is None:
         populations = procs
 const weighted = {"true" if weights is not None else "false"}
 const batching = {"true" if batching else "false"}
 const batchSize = {min([batchSize, len(X)]) if batching else len(X):d}
+const useVarMap = {"true" if use_custom_variable_names else "false"}
 const mutationWeights = [
     {weightMutateConstant:f},
     {weightMutateOperator:f},
         def_datasets += """
 const weights = convert(Array{Float32, 1}, """f"{weight_str})"
+    if use_custom_variable_names:
         def_hyperparams += f"""
 const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
     lastComplexity = 0
     sympy_format = []
     lambda_format = []
+    if use_custom_variable_names:
         sympy_symbols = [sympy.Symbol(variable_names[i]) for i in range(X.shape[1])]
     else:
         sympy_symbols = [sympy.Symbol('x%d'%i) for i in range(X.shape[1])]
     return output[['Complexity', 'MSE', 'score', 'Equation', 'sympy_format', 'lambda_format']]
+def run_feature_selection(X, y, select_k_features):
+    """Use a gradient boosting tree regressor as a proxy for finding
+        the k most important features in X, returning indices for those
+        features as output."""
+    from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
+    from sklearn.feature_selection import SelectFromModel, SelectKBest
+    clf = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='ls') #RandomForestRegressor()
+    clf.fit(X, y)
+    selector = SelectFromModel(clf, threshold=-np.inf,
+            max_features=select_k_features, prefit=True)
+    return selector.get_support(indices=True)