Spaces:
Running
Running
MilesCranmer
commited on
Merge branch 'master' of github.com:MilesCranmer/Eureqa.jl
Browse files- .readthedocs-custom-steps.yml +2 -0
- .readthedocs-requirements.txt +2 -0
- .readthedocs.yml +8 -0
- README.md +31 -7
- julia/sr.jl +1 -45
- pydoc-markdown.yml +19 -0
- pysr/sr.py +41 -5
.readthedocs-custom-steps.yml
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
steps:
|
2 |
+
- pydoc-markdown --build --site-dir $SITE_DIR
|
.readthedocs-requirements.txt
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
pydoc-markdown
|
2 |
+
readthedocs-custom-steps
|
.readthedocs.yml
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
version: 2
|
2 |
+
mkdocs: {} # tell readthedocs to use mkdocs
|
3 |
+
python:
|
4 |
+
version: 3.7
|
5 |
+
install:
|
6 |
+
- method: pip
|
7 |
+
path: .
|
8 |
+
- requirements: .readthedocs-requirements.txt
|
README.md
CHANGED
@@ -36,6 +36,7 @@ python interface.
|
|
36 |
|
37 |
|
38 |
# Installation
|
|
|
39 |
|
40 |
Install Julia - see [downloads](https://julialang.org/downloads/), and
|
41 |
then instructions for [mac](https://julialang.org/downloads/platform/#macos)
|
@@ -103,8 +104,10 @@ equations = pysr.pysr(X, y, niterations=100,
|
|
103 |
Now, the symbolic regression code can search using this `special` function
|
104 |
that squares its left argument and adds it to its right. Make sure
|
105 |
all passed functions are valid Julia code, and take one (unary)
|
106 |
-
or two (binary) float32 scalars as input, and output a float32.
|
107 |
-
|
|
|
|
|
108 |
|
109 |
We also define `extra_sympy_mappings`,
|
110 |
so that the SymPy code can understand the output equation from Julia,
|
@@ -193,17 +196,24 @@ which is `hall_of_fame.csv` by default. It also prints the
|
|
193 |
equations to stdout.
|
194 |
|
195 |
```python
|
196 |
-
pysr(X=None, y=None, weights=None, procs=4, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20)
|
197 |
```
|
198 |
|
199 |
Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.
|
|
|
|
|
|
|
200 |
|
201 |
**Arguments**:
|
202 |
|
203 |
-
- `X`: np.ndarray, 2D array. Rows are examples,
|
|
|
|
|
204 |
- `y`: np.ndarray, 1D array. Rows are examples.
|
205 |
-
- `weights`: np.ndarray, 1D array.
|
206 |
-
-
|
|
|
|
|
207 |
- `niterations`: int, Number of iterations of the algorithm to run. The best
|
208 |
equations are printed, and migrate between populations, at the
|
209 |
end of each.
|
@@ -245,6 +255,18 @@ constant parts by evaluation
|
|
245 |
- `equation_file`: str, Where to save the files (.csv separated by |)
|
246 |
- `test`: str, What test to run, if X,y not passed.
|
247 |
- `maxsize`: int, Max size of an equation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
248 |
|
249 |
**Returns**:
|
250 |
|
@@ -311,6 +333,7 @@ pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
|
|
311 |
## Feature ideas
|
312 |
|
313 |
- [ ] Cross-validation
|
|
|
314 |
- [ ] Sympy printing
|
315 |
- [ ] Better cleanup of zombie processes after <ctl-c>
|
316 |
- [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
|
@@ -319,13 +342,14 @@ pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
|
|
319 |
- [ ] Refresh screen rather than dumping to stdout?
|
320 |
- [ ] Add ability to save state from python
|
321 |
- [ ] Additional degree operators?
|
322 |
-
- [ ] Multi targets (vector ops)
|
323 |
- [ ] Tree crossover? I.e., can take as input a part of the same equation, so long as it is the same level or below?
|
324 |
- [ ] Consider printing output sorted by score, not by complexity.
|
325 |
- [ ] Dump scores alongside MSE to .csv (and return with Pandas).
|
326 |
- [ ] Create flexible way of providing "simplification recipes." I.e., plus(plus(T, C), C) => plus(T, +(C, C)). The user could pass these.
|
327 |
- [ ] Consider allowing multi-threading turned off, for faster testing (cache issue on travis). Or could simply fix the caching issue there.
|
328 |
- [ ] Consider returning only the equation of interest; rather than all equations.
|
|
|
329 |
|
330 |
## Algorithmic performance ideas:
|
331 |
|
|
|
36 |
|
37 |
|
38 |
# Installation
|
39 |
+
PySR uses both Julia and Python, so you need to have both installed.
|
40 |
|
41 |
Install Julia - see [downloads](https://julialang.org/downloads/), and
|
42 |
then instructions for [mac](https://julialang.org/downloads/platform/#macos)
|
|
|
104 |
Now, the symbolic regression code can search using this `special` function
|
105 |
that squares its left argument and adds it to its right. Make sure
|
106 |
all passed functions are valid Julia code, and take one (unary)
|
107 |
+
or two (binary) float32 scalars as input, and output a float32. This means if you
|
108 |
+
write any real constants in your operator, like `2.5`, you have to write them
|
109 |
+
instead as `2.5f0`, which defines it as `Float32`.
|
110 |
+
Operators are automatically vectorized.
|
111 |
|
112 |
We also define `extra_sympy_mappings`,
|
113 |
so that the SymPy code can understand the output equation from Julia,
|
|
|
196 |
equations to stdout.
|
197 |
|
198 |
```python
|
199 |
+
pysr(X=None, y=None, weights=None, procs=4, populations=None, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, extra_sympy_mappings={}, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20, fast_cycle=False, maxdepth=None, variable_names=[], select_k_features=None, threads=None, julia_optimization=3)
|
200 |
```
|
201 |
|
202 |
Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.
|
203 |
+
Note: most default parameters have been tuned over several example
|
204 |
+
equations, but you should adjust `threads`, `niterations`,
|
205 |
+
`binary_operators`, `unary_operators` to your requirements.
|
206 |
|
207 |
**Arguments**:
|
208 |
|
209 |
+
- `X`: np.ndarray or pandas.DataFrame, 2D array. Rows are examples,
|
210 |
+
columns are features. If pandas DataFrame, the columns are used
|
211 |
+
for variable names (so make sure they don't contain spaces).
|
212 |
- `y`: np.ndarray, 1D array. Rows are examples.
|
213 |
+
- `weights`: np.ndarray, 1D array. Each row is how to weight the
|
214 |
+
mean-square-error loss on weights.
|
215 |
+
- `procs`: int, Number of processes (=number of populations running).
|
216 |
+
- `populations`: int, Number of populations running; by default=procs.
|
217 |
- `niterations`: int, Number of iterations of the algorithm to run. The best
|
218 |
equations are printed, and migrate between populations, at the
|
219 |
end of each.
|
|
|
255 |
- `equation_file`: str, Where to save the files (.csv separated by |)
|
256 |
- `test`: str, What test to run, if X,y not passed.
|
257 |
- `maxsize`: int, Max size of an equation.
|
258 |
+
- `maxdepth`: int, Max depth of an equation. You can use both maxsize and maxdepth.
|
259 |
+
maxdepth is by default set to = maxsize, which means that it is redundant.
|
260 |
+
- `fast_cycle`: bool, (experimental) - batch over population subsamples. This
|
261 |
+
is a slightly different algorithm than regularized evolution, but does cycles
|
262 |
+
15% faster. May be algorithmically less efficient.
|
263 |
+
- `variable_names`: list, a list of names for the variables, other
|
264 |
+
than "x0", "x1", etc.
|
265 |
+
- `select_k_features`: (None, int), whether to run feature selection in
|
266 |
+
Python using random forests, before passing to the symbolic regression
|
267 |
+
code. None means no feature selection; an int means select that many
|
268 |
+
features.
|
269 |
+
- `julia_optimization`: int, Optimization level (0, 1, 2, 3)
|
270 |
|
271 |
**Returns**:
|
272 |
|
|
|
333 |
## Feature ideas
|
334 |
|
335 |
- [ ] Cross-validation
|
336 |
+
- [ ] read the docs page
|
337 |
- [ ] Sympy printing
|
338 |
- [ ] Better cleanup of zombie processes after <ctl-c>
|
339 |
- [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
|
|
|
342 |
- [ ] Refresh screen rather than dumping to stdout?
|
343 |
- [ ] Add ability to save state from python
|
344 |
- [ ] Additional degree operators?
|
345 |
+
- [ ] Multi targets (vector ops). Idea 1: Node struct contains argument for which registers it is applied to. Then, can work with multiple components simultaneously. Though this may be tricky to get right. Idea 2: each op is defined by input/output space. Some operators are flexible, and the spaces should be adjusted automatically. Otherwise, only consider ops that make a tree possible. But will need additional ops here to get it to work. Idea 3: define each equation in 2 parts: one part that is shared between all outputs, and one that is different between all outputs. Maybe this could be an array of nodes corresponding to each output. And those nodes would define their functions.
|
346 |
- [ ] Tree crossover? I.e., can take as input a part of the same equation, so long as it is the same level or below?
|
347 |
- [ ] Consider printing output sorted by score, not by complexity.
|
348 |
- [ ] Dump scores alongside MSE to .csv (and return with Pandas).
|
349 |
- [ ] Create flexible way of providing "simplification recipes." I.e., plus(plus(T, C), C) => plus(T, +(C, C)). The user could pass these.
|
350 |
- [ ] Consider allowing multi-threading turned off, for faster testing (cache issue on travis). Or could simply fix the caching issue there.
|
351 |
- [ ] Consider returning only the equation of interest; rather than all equations.
|
352 |
+
- [ ] Enable derivative operators. These would differentiate their right argument wrt their left argument, some input variable.
|
353 |
|
354 |
## Algorithmic performance ideas:
|
355 |
|
julia/sr.jl
CHANGED
@@ -102,21 +102,6 @@ function copyNode(tree::Node)::Node
|
|
102 |
end
|
103 |
end
|
104 |
|
105 |
-
# Evaluate a symbolic equation:
|
106 |
-
function evalTree(tree::Node, x::Array{Float32, 1}=Float32[])::Float32
|
107 |
-
if tree.degree == 0
|
108 |
-
if tree.constant
|
109 |
-
return tree.val
|
110 |
-
else
|
111 |
-
return x[tree.val]
|
112 |
-
end
|
113 |
-
elseif tree.degree == 1
|
114 |
-
return unaops[tree.op](evalTree(tree.l, x))
|
115 |
-
else
|
116 |
-
return binops[tree.op](evalTree(tree.l, x), evalTree(tree.r, x))
|
117 |
-
end
|
118 |
-
end
|
119 |
-
|
120 |
# Count the operators, constants, variables in an equation
|
121 |
function countNodes(tree::Node)::Integer
|
122 |
if tree.degree == 0
|
@@ -405,35 +390,6 @@ function appendRandomOp(tree::Node)::Node
|
|
405 |
return tree
|
406 |
end
|
407 |
|
408 |
-
# Add random node to the top of a tree
|
409 |
-
function popRandomOp(tree::Node)::Node
|
410 |
-
node = tree
|
411 |
-
choice = rand()
|
412 |
-
makeNewBinOp = choice < nbin/nops
|
413 |
-
left = tree
|
414 |
-
|
415 |
-
if makeNewBinOp
|
416 |
-
right = randomConstantNode()
|
417 |
-
newnode = Node(
|
418 |
-
rand(1:length(binops)),
|
419 |
-
left,
|
420 |
-
right
|
421 |
-
)
|
422 |
-
else
|
423 |
-
newnode = Node(
|
424 |
-
rand(1:length(unaops)),
|
425 |
-
left
|
426 |
-
)
|
427 |
-
end
|
428 |
-
node.l = newnode.l
|
429 |
-
node.r = newnode.r
|
430 |
-
node.op = newnode.op
|
431 |
-
node.degree = newnode.degree
|
432 |
-
node.val = newnode.val
|
433 |
-
node.constant = newnode.constant
|
434 |
-
return node
|
435 |
-
end
|
436 |
-
|
437 |
# Insert random node
|
438 |
function insertRandomOp(tree::Node)::Node
|
439 |
node = randomNode(tree)
|
@@ -897,7 +853,7 @@ function testConfiguration()
|
|
897 |
test_output = unaop.(left)
|
898 |
end
|
899 |
end
|
900 |
-
catch
|
901 |
@printf("\n\nYour configuration is invalid - one of your operators is not well-defined over the real line.\n\n\n")
|
902 |
throw(error)
|
903 |
end
|
|
|
102 |
end
|
103 |
end
|
104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
105 |
# Count the operators, constants, variables in an equation
|
106 |
function countNodes(tree::Node)::Integer
|
107 |
if tree.degree == 0
|
|
|
390 |
return tree
|
391 |
end
|
392 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
393 |
# Insert random node
|
394 |
function insertRandomOp(tree::Node)::Node
|
395 |
node = randomNode(tree)
|
|
|
853 |
test_output = unaop.(left)
|
854 |
end
|
855 |
end
|
856 |
+
catch error
|
857 |
@printf("\n\nYour configuration is invalid - one of your operators is not well-defined over the real line.\n\n\n")
|
858 |
throw(error)
|
859 |
end
|
pydoc-markdown.yml
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
loaders:
|
2 |
+
- type: python
|
3 |
+
processors:
|
4 |
+
- type: filter
|
5 |
+
- type: smart
|
6 |
+
- type: crossref
|
7 |
+
renderer:
|
8 |
+
type: mkdocs
|
9 |
+
pages:
|
10 |
+
- title: Home
|
11 |
+
name: index
|
12 |
+
source: README.md
|
13 |
+
- title: API Documentation
|
14 |
+
contents:
|
15 |
+
- '*'
|
16 |
+
mkdocs_config:
|
17 |
+
mkdocs_config:
|
18 |
+
site_name: PySR
|
19 |
+
theme: readthedocs
|
pysr/sr.py
CHANGED
@@ -78,6 +78,7 @@ def pysr(X=None, y=None, weights=None,
|
|
78 |
variable_names=[],
|
79 |
batching=False,
|
80 |
batchSize=50,
|
|
|
81 |
threads=None, #deprecated
|
82 |
julia_optimization=3,
|
83 |
):
|
@@ -86,7 +87,9 @@ def pysr(X=None, y=None, weights=None,
|
|
86 |
equations, but you should adjust `threads`, `niterations`,
|
87 |
`binary_operators`, `unary_operators` to your requirements.
|
88 |
|
89 |
-
:param X: np.ndarray, 2D array. Rows are examples,
|
|
|
|
|
90 |
:param y: np.ndarray, 1D array. Rows are examples.
|
91 |
:param weights: np.ndarray, 1D array. Each row is how to weight the
|
92 |
mean-square-error loss on weights.
|
@@ -144,6 +147,10 @@ def pysr(X=None, y=None, weights=None,
|
|
144 |
during evolution. Still uses full dataset for comparing against
|
145 |
hall of fame.
|
146 |
:param batchSize: int, the amount of data to use if doing batching.
|
|
|
|
|
|
|
|
|
147 |
:param julia_optimization: int, Optimization level (0, 1, 2, 3)
|
148 |
:returns: pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
|
149 |
(as strings).
|
@@ -154,6 +161,12 @@ def pysr(X=None, y=None, weights=None,
|
|
154 |
if maxdepth is None:
|
155 |
maxdepth = maxsize
|
156 |
|
|
|
|
|
|
|
|
|
|
|
|
|
157 |
# Check for potential errors before they happen
|
158 |
assert len(unary_operators) + len(binary_operators) > 0
|
159 |
assert len(X.shape) == 2
|
@@ -162,9 +175,17 @@ def pysr(X=None, y=None, weights=None,
|
|
162 |
if weights is not None:
|
163 |
assert len(weights.shape) == 1
|
164 |
assert X.shape[0] == weights.shape[0]
|
165 |
-
if
|
166 |
assert len(variable_names) == X.shape[1]
|
167 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
168 |
if populations is None:
|
169 |
populations = procs
|
170 |
|
@@ -235,7 +256,7 @@ const annealing = {"true" if annealing else "false"}
|
|
235 |
const weighted = {"true" if weights is not None else "false"}
|
236 |
const batching = {"true" if batching else "false"}
|
237 |
const batchSize = {min([batchSize, len(X)]) if batching else len(X):d}
|
238 |
-
const useVarMap = {"
|
239 |
const mutationWeights = [
|
240 |
{weightMutateConstant:f},
|
241 |
{weightMutateOperator:f},
|
@@ -262,7 +283,7 @@ const y = convert(Array{Float32, 1}, """f"{y_str})"
|
|
262 |
def_datasets += """
|
263 |
const weights = convert(Array{Float32, 1}, """f"{weight_str})"
|
264 |
|
265 |
-
if
|
266 |
def_hyperparams += f"""
|
267 |
const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
|
268 |
|
@@ -301,7 +322,7 @@ const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
|
|
301 |
lastComplexity = 0
|
302 |
sympy_format = []
|
303 |
lambda_format = []
|
304 |
-
if
|
305 |
sympy_symbols = [sympy.Symbol(variable_names[i]) for i in range(X.shape[1])]
|
306 |
else:
|
307 |
sympy_symbols = [sympy.Symbol('x%d'%i) for i in range(X.shape[1])]
|
@@ -328,3 +349,18 @@ const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
|
|
328 |
return output[['Complexity', 'MSE', 'score', 'Equation', 'sympy_format', 'lambda_format']]
|
329 |
|
330 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
variable_names=[],
|
79 |
batching=False,
|
80 |
batchSize=50,
|
81 |
+
select_k_features=None,
|
82 |
threads=None, #deprecated
|
83 |
julia_optimization=3,
|
84 |
):
|
|
|
87 |
equations, but you should adjust `threads`, `niterations`,
|
88 |
`binary_operators`, `unary_operators` to your requirements.
|
89 |
|
90 |
+
:param X: np.ndarray or pandas.DataFrame, 2D array. Rows are examples,
|
91 |
+
columns are features. If pandas DataFrame, the columns are used
|
92 |
+
for variable names (so make sure they don't contain spaces).
|
93 |
:param y: np.ndarray, 1D array. Rows are examples.
|
94 |
:param weights: np.ndarray, 1D array. Each row is how to weight the
|
95 |
mean-square-error loss on weights.
|
|
|
147 |
during evolution. Still uses full dataset for comparing against
|
148 |
hall of fame.
|
149 |
:param batchSize: int, the amount of data to use if doing batching.
|
150 |
+
:param select_k_features: (None, int), whether to run feature selection in
|
151 |
+
Python using random forests, before passing to the symbolic regression
|
152 |
+
code. None means no feature selection; an int means select that many
|
153 |
+
features.
|
154 |
:param julia_optimization: int, Optimization level (0, 1, 2, 3)
|
155 |
:returns: pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
|
156 |
(as strings).
|
|
|
161 |
if maxdepth is None:
|
162 |
maxdepth = maxsize
|
163 |
|
164 |
+
if isinstance(X, pd.DataFrame):
|
165 |
+
variable_names = list(X.columns)
|
166 |
+
X = np.array(X)
|
167 |
+
|
168 |
+
use_custom_variable_names = (len(variable_names) != 0)
|
169 |
+
|
170 |
# Check for potential errors before they happen
|
171 |
assert len(unary_operators) + len(binary_operators) > 0
|
172 |
assert len(X.shape) == 2
|
|
|
175 |
if weights is not None:
|
176 |
assert len(weights.shape) == 1
|
177 |
assert X.shape[0] == weights.shape[0]
|
178 |
+
if use_custom_variable_names:
|
179 |
assert len(variable_names) == X.shape[1]
|
180 |
|
181 |
+
if select_k_features is not None:
|
182 |
+
selection = run_feature_selection(X, y, select_k_features)
|
183 |
+
print(f"Using features {selection}")
|
184 |
+
X = X[:, selection]
|
185 |
+
|
186 |
+
if use_custom_variable_names:
|
187 |
+
variable_names = variable_names[selection]
|
188 |
+
|
189 |
if populations is None:
|
190 |
populations = procs
|
191 |
|
|
|
256 |
const weighted = {"true" if weights is not None else "false"}
|
257 |
const batching = {"true" if batching else "false"}
|
258 |
const batchSize = {min([batchSize, len(X)]) if batching else len(X):d}
|
259 |
+
const useVarMap = {"true" if use_custom_variable_names else "false"}
|
260 |
const mutationWeights = [
|
261 |
{weightMutateConstant:f},
|
262 |
{weightMutateOperator:f},
|
|
|
283 |
def_datasets += """
|
284 |
const weights = convert(Array{Float32, 1}, """f"{weight_str})"
|
285 |
|
286 |
+
if use_custom_variable_names:
|
287 |
def_hyperparams += f"""
|
288 |
const varMap = {'["' + '", "'.join(variable_names) + '"]'}"""
|
289 |
|
|
|
322 |
lastComplexity = 0
|
323 |
sympy_format = []
|
324 |
lambda_format = []
|
325 |
+
if use_custom_variable_names:
|
326 |
sympy_symbols = [sympy.Symbol(variable_names[i]) for i in range(X.shape[1])]
|
327 |
else:
|
328 |
sympy_symbols = [sympy.Symbol('x%d'%i) for i in range(X.shape[1])]
|
|
|
349 |
return output[['Complexity', 'MSE', 'score', 'Equation', 'sympy_format', 'lambda_format']]
|
350 |
|
351 |
|
352 |
+
def run_feature_selection(X, y, select_k_features):
|
353 |
+
"""Use a gradient boosting tree regressor as a proxy for finding
|
354 |
+
the k most important features in X, returning indices for those
|
355 |
+
features as output."""
|
356 |
+
|
357 |
+
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
|
358 |
+
from sklearn.feature_selection import SelectFromModel, SelectKBest
|
359 |
+
|
360 |
+
clf = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='ls') #RandomForestRegressor()
|
361 |
+
clf.fit(X, y)
|
362 |
+
selector = SelectFromModel(clf, threshold=-np.inf,
|
363 |
+
max_features=select_k_features, prefit=True)
|
364 |
+
return selector.get_support(indices=True)
|
365 |
+
|
366 |
+
|