MilesCranmer commited on
Commit
135e2ff
2 Parent(s): 5abd46e 2fa6a85

Merge pull request #276 from MilesCranmer/custom-objectives

Browse files
Files changed (4) hide show
  1. docs/examples.md +118 -1
  2. docs/param_groupings.yml +1 -0
  3. pysr/sr.py +32 -4
  4. pysr/test/test.py +17 -4
docs/examples.md CHANGED
@@ -318,7 +318,124 @@ model.predict(X, -1)
318
 
319
  to make predictions with the most accurate expression.
320
 
321
- ## 9. Additional features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
322
 
323
  For the many other features available in PySR, please
324
  read the [Options section](options.md).
 
318
 
319
  to make predictions with the most accurate expression.
320
 
321
+ ## 9. Custom objectives
322
+
323
+ You can also pass a custom objectives as a snippet of Julia code,
324
+ which might include symbolic manipulations or custom functional forms.
325
+ These do not even need to be differentiable! First, let's look at the
326
+ default objective used (a simplified version, without weights
327
+ and with mean square error), so that you can see how to write your own:
328
+
329
+ ```julia
330
+ function default_objective(tree, dataset::Dataset{T,L}, options)::L where {T,L}
331
+ (prediction, completion) = eval_tree_array(tree, dataset.X, options)
332
+ if !completion
333
+ return L(Inf)
334
+ end
335
+
336
+ diffs = prediction .- dataset.y
337
+
338
+ return sum(diffs .^ 2) / length(diffs)
339
+ end
340
+ ```
341
+
342
+ Here, the `where {T,L}` syntax defines the function for arbitrary types `T` and `L`.
343
+ If you have `precision=32` (default) and pass in regular floating point data,
344
+ then both `T` and `L` will be equal to `Float32`. If you pass in complex data,
345
+ then `T` will be `ComplexF32` and `L` will be `Float32` (since we need to return
346
+ a real number from the loss function). But, you don't need to worry about this, just
347
+ make sure to return a scalar number of type `L`.
348
+
349
+ The `tree` argument is the current expression being evaluated. You can read
350
+ about the `tree` fields [here](https://astroautomata.com/SymbolicRegression.jl/stable/types/).
351
+
352
+ For example, let's fix a symbolic form of an expression,
353
+ as a rational function. i.e., $P(X)/Q(X)$ for polynomials $P$ and $Q$.
354
+
355
+ ```python
356
+ objective = """
357
+ function my_custom_objective(tree, dataset::Dataset{T,L}, options) where {T,L}
358
+ # Require root node to be binary, so we can split it,
359
+ # otherwise return a large loss:
360
+ tree.degree != 2 && return L(Inf)
361
+
362
+ P = tree.l
363
+ Q = tree.r
364
+
365
+ # Evaluate numerator:
366
+ P_prediction, flag = eval_tree_array(P, dataset.X, options)
367
+ !flag && return L(Inf)
368
+
369
+ # Evaluate denominator:
370
+ Q_prediction, flag = eval_tree_array(Q, dataset.X, options)
371
+ !flag && return L(Inf)
372
+
373
+ # Impose functional form:
374
+ prediction = P_prediction ./ Q_prediction
375
+
376
+ diffs = prediction .- dataset.y
377
+
378
+ return sum(diffs .^ 2) / length(diffs)
379
+ end
380
+ """
381
+
382
+ model = PySRRegressor(
383
+ niterations=100,
384
+ binary_operators=["*", "+", "-"],
385
+ full_objective=objective,
386
+ )
387
+ ```
388
+
389
+ > **Warning**: When using a custom objective like this that performs symbolic
390
+ > manipulations, many functionalities of PySR will not work, such as `.sympy()`,
391
+ > `.predict()`, etc. This is because the SymPy parsing does not know about
392
+ > how you are manipulating the expression, so you will need to do this yourself.
393
+
394
+ Note how we did not pass `/` as a binary operator; it will just be implicit
395
+ in the functional form.
396
+
397
+ Let's generate an equation of the form $\frac{x_0^2 x_1 - 2}{x_2^2 + 1}$:
398
+
399
+ ```python
400
+ X = np.random.randn(1000, 3)
401
+ y = (X[:, 0]**2 * X[:, 1] - 2) / (X[:, 2]**2 + 1)
402
+ ```
403
+
404
+ Finally, let's fit:
405
+
406
+ ```python
407
+ model.fit(X, y)
408
+ ```
409
+
410
+ > Note that the printed equation is not the same as the evaluated equation,
411
+ > because the printing functionality does not know about the functional form.
412
+
413
+ We can get the string format with:
414
+
415
+ ```python
416
+ model.get_best().equation
417
+ ```
418
+
419
+ (or, you could use `model.equations_.iloc[-1].equation`)
420
+
421
+ For me, this equation was:
422
+
423
+ ```text
424
+ (((2.3554819 + -0.3554746) - (x1 * (x0 * x0))) - (-1.0000019 - (x2 * x2)))
425
+ ```
426
+
427
+ looking at the bracket structure of the equation, we can see that the outermost
428
+ bracket is split at the `-` operator (note that we ignore the root operator in
429
+ the evaluation, as we simply evaluated each argument and divided the result) into
430
+ `((2.3554819 + -0.3554746) - (x1 * (x0 * x0)))` and
431
+ `(-1.0000019 - (x2 * x2))`, meaning that our discovered equation is
432
+ equal to:
433
+ $\frac{x_0^2 x_1 - 2.0000073}{x_2^2 - 1.0000019}$, which
434
+ is nearly the same as the true equation!
435
+
436
+
437
+
438
+ ## 10. Additional features
439
 
440
  For the many other features available in PySR, please
441
  read the [Options section](options.md).
docs/param_groupings.yml CHANGED
@@ -11,6 +11,7 @@
11
  - ncyclesperiteration
12
  - The Objective:
13
  - loss
 
14
  - model_selection
15
  - Working with Complexities:
16
  - parsimony
 
11
  - ncyclesperiteration
12
  - The Objective:
13
  - loss
14
+ - full_objective
15
  - model_selection
16
  - Working with Complexities:
17
  - parsimony
pysr/sr.py CHANGED
@@ -320,9 +320,9 @@ class PySRRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):
320
  argument is constrained.
321
  Default is `None`.
322
  loss : str
323
- String of Julia code specifying the loss function. Can either
324
- be a loss from LossFunctions.jl, or your own loss written as a
325
- function. Examples of custom written losses include:
326
  `myloss(x, y) = abs(x-y)` for non-weighted, or
327
  `myloss(x, y, w) = w*abs(x-y)` for weighted.
328
  The included losses include:
@@ -335,6 +335,26 @@ class PySRRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):
335
  `ModifiedHuberLoss()`, `L2MarginLoss()`, `ExpLoss()`,
336
  `SigmoidLoss()`, `DWDMarginLoss(q)`.
337
  Default is `"L2DistLoss()"`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
338
  complexity_of_operators : dict[str, float]
339
  If you would like to use a complexity other than 1 for an
340
  operator, specify the complexity here. For example,
@@ -681,7 +701,8 @@ class PySRRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):
681
  timeout_in_seconds=None,
682
  constraints=None,
683
  nested_constraints=None,
684
- loss="L2DistLoss()",
 
685
  complexity_of_operators=None,
686
  complexity_of_constants=1,
687
  complexity_of_variables=1,
@@ -770,6 +791,7 @@ class PySRRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):
770
  self.early_stop_condition = early_stop_condition
771
  # - Loss parameters
772
  self.loss = loss
 
773
  self.complexity_of_operators = complexity_of_operators
774
  self.complexity_of_constants = complexity_of_constants
775
  self.complexity_of_variables = complexity_of_variables
@@ -1224,6 +1246,9 @@ class PySRRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):
1224
  "to True and `procs` to 0 will result in non-deterministic searches. "
1225
  )
1226
 
 
 
 
1227
  # NotImplementedError - Values that could be supported at a later time
1228
  if self.optimizer_algorithm not in VALID_OPTIMIZER_ALGORITHMS:
1229
  raise NotImplementedError(
@@ -1553,6 +1578,8 @@ class PySRRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):
1553
  complexity_of_operators = Main.eval(complexity_of_operators_str)
1554
 
1555
  custom_loss = Main.eval(self.loss)
 
 
1556
  early_stop_condition = Main.eval(
1557
  str(self.early_stop_condition) if self.early_stop_condition else None
1558
  )
@@ -1581,6 +1608,7 @@ class PySRRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):
1581
  complexity_of_variables=self.complexity_of_variables,
1582
  nested_constraints=nested_constraints,
1583
  elementwise_loss=custom_loss,
 
1584
  maxsize=int(self.maxsize),
1585
  output_file=_escape_filename(self.equation_file_),
1586
  npopulations=int(self.populations),
 
320
  argument is constrained.
321
  Default is `None`.
322
  loss : str
323
+ String of Julia code specifying an elementwise loss function.
324
+ Can either be a loss from LossFunctions.jl, or your own loss
325
+ written as a function. Examples of custom written losses include:
326
  `myloss(x, y) = abs(x-y)` for non-weighted, or
327
  `myloss(x, y, w) = w*abs(x-y)` for weighted.
328
  The included losses include:
 
335
  `ModifiedHuberLoss()`, `L2MarginLoss()`, `ExpLoss()`,
336
  `SigmoidLoss()`, `DWDMarginLoss(q)`.
337
  Default is `"L2DistLoss()"`.
338
+ full_objective : str
339
+ Alternatively, you can specify the full objective function as
340
+ a snippet of Julia code, including any sort of custom evaluation
341
+ (including symbolic manipulations beforehand), and any sort
342
+ of loss function or regularizations. The default `full_objective`
343
+ used in SymbolicRegression.jl is roughly equal to:
344
+ ```julia
345
+ function eval_loss(tree, dataset::Dataset{T}, options)::T where T
346
+ prediction, flag = eval_tree_array(tree, dataset.X, options)
347
+ if !flag
348
+ return T(Inf)
349
+ end
350
+ sum((prediction .- dataset.y) .^ 2) / dataset.n
351
+ end
352
+ ```
353
+ where the example elementwise loss is mean-squared error.
354
+ You may pass a function with the same arguments as this (note
355
+ that the name of the function doesn't matter). Here,
356
+ both `prediction` and `dataset.y` are 1D arrays of length `dataset.n`.
357
+ Default is `None`.
358
  complexity_of_operators : dict[str, float]
359
  If you would like to use a complexity other than 1 for an
360
  operator, specify the complexity here. For example,
 
701
  timeout_in_seconds=None,
702
  constraints=None,
703
  nested_constraints=None,
704
+ loss=None,
705
+ full_objective=None,
706
  complexity_of_operators=None,
707
  complexity_of_constants=1,
708
  complexity_of_variables=1,
 
791
  self.early_stop_condition = early_stop_condition
792
  # - Loss parameters
793
  self.loss = loss
794
+ self.full_objective = full_objective
795
  self.complexity_of_operators = complexity_of_operators
796
  self.complexity_of_constants = complexity_of_constants
797
  self.complexity_of_variables = complexity_of_variables
 
1246
  "to True and `procs` to 0 will result in non-deterministic searches. "
1247
  )
1248
 
1249
+ if self.loss is not None and self.full_objective is not None:
1250
+ raise ValueError("You cannot set both `loss` and `full_objective`.")
1251
+
1252
  # NotImplementedError - Values that could be supported at a later time
1253
  if self.optimizer_algorithm not in VALID_OPTIMIZER_ALGORITHMS:
1254
  raise NotImplementedError(
 
1578
  complexity_of_operators = Main.eval(complexity_of_operators_str)
1579
 
1580
  custom_loss = Main.eval(self.loss)
1581
+ custom_full_objective = Main.eval(self.full_objective)
1582
+
1583
  early_stop_condition = Main.eval(
1584
  str(self.early_stop_condition) if self.early_stop_condition else None
1585
  )
 
1608
  complexity_of_variables=self.complexity_of_variables,
1609
  nested_constraints=nested_constraints,
1610
  elementwise_loss=custom_loss,
1611
+ loss_function=custom_full_objective,
1612
  maxsize=int(self.maxsize),
1613
  output_file=_escape_filename(self.equation_file_),
1614
  npopulations=int(self.populations),
pysr/test/test.py CHANGED
@@ -72,8 +72,10 @@ class TestPipeline(unittest.TestCase):
72
  print(model.equations_)
73
  self.assertLessEqual(model.get_best()["loss"], 1e-4)
74
 
75
- def test_multiprocessing_turbo(self):
 
76
  y = self.X[:, 0]
 
77
  model = PySRRegressor(
78
  **self.default_test_kwargs,
79
  # Turbo needs to work with unsafe operators:
@@ -81,17 +83,28 @@ class TestPipeline(unittest.TestCase):
81
  procs=2,
82
  multithreading=False,
83
  turbo=True,
84
- early_stop_condition="stop_if(loss, complexity) = loss < 1e-4 && complexity == 1",
 
 
 
 
 
 
 
 
85
  )
86
  model.fit(self.X, y)
87
  print(model.equations_)
88
- self.assertLessEqual(model.equations_.iloc[-1]["loss"], 1e-4)
 
 
89
 
90
- def test_high_precision_search(self):
91
  y = 1.23456789 * self.X[:, 0]
92
  model = PySRRegressor(
93
  **self.default_test_kwargs,
94
  early_stop_condition="stop_if(loss, complexity) = loss < 1e-4 && complexity == 3",
 
95
  precision=64,
96
  parsimony=0.01,
97
  warm_start=True,
 
72
  print(model.equations_)
73
  self.assertLessEqual(model.get_best()["loss"], 1e-4)
74
 
75
+ def test_multiprocessing_turbo_custom_objective(self):
76
+ rstate = np.random.RandomState(0)
77
  y = self.X[:, 0]
78
+ y += rstate.randn(*y.shape) * 1e-4
79
  model = PySRRegressor(
80
  **self.default_test_kwargs,
81
  # Turbo needs to work with unsafe operators:
 
83
  procs=2,
84
  multithreading=False,
85
  turbo=True,
86
+ early_stop_condition="stop_if(loss, complexity) = loss < 1e-10 && complexity == 1",
87
+ full_objective="""
88
+ function my_objective(tree::Node{T}, dataset::Dataset{T}, options::Options) where T
89
+ prediction, flag = eval_tree_array(tree, dataset.X, options)
90
+ !flag && return T(Inf)
91
+ abs3(x) = abs(x) ^ 3
92
+ return sum(abs3, prediction .- dataset.y) / length(prediction)
93
+ end
94
+ """,
95
  )
96
  model.fit(self.X, y)
97
  print(model.equations_)
98
+ best_loss = model.equations_.iloc[-1]["loss"]
99
+ self.assertLessEqual(best_loss, 1e-10)
100
+ self.assertGreaterEqual(best_loss, 0.0)
101
 
102
+ def test_high_precision_search_custom_loss(self):
103
  y = 1.23456789 * self.X[:, 0]
104
  model = PySRRegressor(
105
  **self.default_test_kwargs,
106
  early_stop_condition="stop_if(loss, complexity) = loss < 1e-4 && complexity == 3",
107
+ loss="my_loss(prediction, target) = (prediction - target)^2",
108
  precision=64,
109
  parsimony=0.01,
110
  warm_start=True,