# Finding the ground state of spin Hamiltonians with reinforcement learning

Kyle Mills

kyle.mills@1qbit.com 1QB Information Technologies (1QBit), Vancouver, British Columbia, Canada

University of Ontario Institute of Technology, Oshawa, Ontario, Canada

Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada

Pooya Ronagh

pooya.ronagh@1qbit.com 1QB Information Technologies (1QBit), Vancouver, British Columbia, Canada

Institute for Quantum Computing (IQC), Waterloo, Ontario, Canada

Department of Physics and Astronomy, University of Waterloo, Ontario, Canada

Isaac Tamblyn

isaac.tamblyn@nrc.ca National Research Council Canada, Ottawa, Ontario, Canada

University of Ottawa, Ottawa, Ontario, Canada

Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada

November 5, 2021

###### Abstract

Reinforcement learning (RL) has become a proven method for optimizing a procedure for which success has been defined, but the specific actions needed to achieve it have not. Using a method we call "Controlled Online Optimization Learning" (COOL), we apply the so-called "black box" method of RL to simulated annealing (SA), demonstrating that an RL agent based on proximal policy optimization can, through experience alone, arrive at a temperature schedule that surpasses the performance of standard heuristic temperature schedules for two classes of Hamiltonians. When the system is initialized at a cool temperature, the RL agent learns to heat the system to "melt" it, and then slowly cool it in an effort to anneal to the ground state; if the system is initialized at a high temperature, the algorithm immediately cools the system. We investigate the performance of our RL-driven SA agent in generalizing to all Hamiltonians of a specific class; when trained on random Hamiltonians of nearest-neighbour spin glasses, the RL agent is able to control the SA process for other Hamiltonians, reaching the ground state with a higher probability than a simple linear annealing schedule. Furthermore, the scaling performance (with respect to system size) of the RL approach is far more favourable, achieving a performance improvement of almost two orders of magnitude on \(L=14^{2}\) systems. We demonstrate the robustness of the RL approach when the system operates in a "destructive observation" mode, an allusion to a quantum system where measurements destroy the state of the system. The success of the RL agent could have far-reaching impact, from classical optimization, to quantum annealing, to the simulation of physical systems.

## I Introduction

In metallurgy and materials science, the process of annealing is used to equilibrate the positions of atoms to obtain perfect low-energy crystals. Heat provides the energy necessary to break atomic bonds, and high-stress interfaces are eliminated by the migration of defects. By slowly cooling the metal to room temperature, the metal atoms become energetically locked in a lattice structure more favourable than the original structure. Metallurgists can tune the temperature schedule to arrive at final products that have desired characteristics, such as ductility and hardness. Annealing is a biased stochastic search for the ground state.

An analogous _in silico_ technique, simulated annealing (SA) [1], can be used to find the ground state of spin-glass models, an NP-hard problem [2]. A spin glass is a graphical model consisting of binary spins \(\sigma_{i}\). The connections between spins are defined by the coupling constants \(J_{ij}\), and a linear term with coefficients \(h_{i}\) can apply a bias to individual spins. The Hamiltonian

\[\mathcal{H}=-\sum_{i<j}J_{ij}\sigma_{i}\sigma_{j}-\sum_{i}h_{i}\sigma_{i}, \quad\sigma_{i}=\pm 1\]

defines the energy of the microstates [3]. The choices of the quadratic coupling coefficients \(J_{ij}\) and the linear bias coefficients \(h_{i}\) effect the interesting dynamics of the model: \(J_{ij}\) can be randomly distributed according to a Gaussian distribution [3], encompass all \(i,j\) combinations for a fully connected Hamiltonian, or be limited to short-range (e.g., nearest-neighbour, \(\langle i,j\rangle\)) interactions, to name a few. For example, when the positive, unit-magnitude coupling is limited to nearest-neighbour pairs, the ubiquitous ferromagnetic Ising model [4] is recovered. Examples of the Hamiltonians we investigate in this work are presented in Figure 1 and discussed in further detail in Section IV.

Finding the ground state of (i.e., "solving") such systems is interesting from the perspective of thermodynamics, as one can observe phenomena such as phase transitions [5; 6], but also practically useful as discrete optimization problems can be mapped to spin-glass models (e.g., the travelling salesperson problem or the knapsack problem) [7]. The Metropolis-Hastings algorithm [8; 9] can be used to simulate the spin glass at arbitrary temperature, \(T\); thus, it is used ubiquitously for SA. By beginning the simulation at a high temperature, one can slowly cool the system over time, providing sufficient thermal energy to escape local minima, and arrive at the ground state "solution" to the problem. The challenge is to find a temperature schedule that minimizes computational effort while still arriving at a satisfactory solution; if the temperature is reduced too rapidly, the system will become trapped in a local minimum, and reducing the temperature too slowly results in an unnecessary computational expense. Kirkpatrick et al. [1; 10] proposed starting at a temperature that results in an 80% acceptance ratio (i.e., 80% of Metropolis spin flips are accepted) and reducing the temperature geometrically. They also recommended monitoring the objective function and reducing the cooling rate if the objective value (e.g., the energy) drops too quickly. More-sophisticated adaptive temperature schedules have been investigated [11]; however, simple linear and reciprocal temperature schedules are commonly used in practice [12; 13]. We will refer to SA using a linear schedule as "classic SA" throughout this work. Nevertheless, in his 1987 paper, Bounds [14] said that "choosing an annealing schedule for practical purposes is still something of a black art".

When framed in the advent of quantum computation and quantum control, establishing robust and dynamic scheduling of control parameters becomes even more relevant. For example, the same optimization problems that can be cast as classical spin glasses are also amenable to quantum annealing [15; 16; 17; 18; 19], exploiting, in lieu of thermal fluctuations, the phenomenon of quantum tunnelling [20; 21; 22] to escape local minima. Quantum annealing (QA) was proposed by Finnila et al. [23] and Kadowaki and Nishimori [24], and, in recent years, physical realizations of devices capable of performing QA (quantum annealers), have been developed [25; 26; 27; 28], and are being rapidly commercialized. As these technologies progress and become more commercially viable, practical applications [19; 29] will continue to be identified and resource scarcity will spur the already extant discussion of the efficient use of annealing hardware [30; 31].

Nonetheless, there are still instances where the classical (SA) outperforms the quantum (QA) [32], and improving the former should not be undervalued. _In silico_ and hardware annealing solutions such as Fujitsu's FPGA-based Digital Annealer [33], NTT's laser-pumped coherent Ising machine (CIM) [34; 35; 36], and the quantum circuit model algorithm known as QAOA [37; 38] all demand the scheduling of control parameters, whether it is the temperature in the case of the Digital Annealer, or the power of the laser pump in the case of CIM. Heuristic methods based on trial-and-error experiments are commonly used to schedule these control parameters, and an automatic approach could expedite development, and improve the stability of such techniques.

In this work, we demonstrate the use of a reinforcement learning (RL) method to learn the "black art" of SA temperature scheduling, and show that an RL agent is able to learn dynamic control parameter schedules for various problem Hamiltonians. The schedules that the RL agent produces are dynamic and reactive, adjusting to the current observations of the system to reach the ground state quickly and consistently without _a priori_ knowledge of a given Hamiltonian. We believe that RL will be important for quantum information processing, especially for hardware- and software-based control.

Figure 1: Two classes of Hamiltonian problems are depicted. (a) The weak-strong clusters (WSC) model comprises two bipartite clusters. The left cluster is biased upward (\(h_{i}>0\)); the right cluster is biased downward (\(h_{i}<0\)). All couplings are equal and of unit magnitude. The two clusters are coupled via the eight central nodes. This model exhibits a deep local minimum very close in energy to the model’s global minimum. When initialized in the local minimum, the RL agent is able to learn schemes to escape the local minimum and arrive at the global minimum, without any explicit knowledge of the Hamiltonian. (b) Here we present an example spin-glass model. The nodes are coupled to nearest neighbours with random Gaussian-distributed coupling coefficients. The nodes are unbiased (\(h_{i}=0\)), and the couplings are changed at each instantiation of the model. The RL algorithm is able to learn a dynamic temperature schedule by observing the system throughout the annealing process, without explicit knowledge of the form of the Hamiltonian, and the learned policy can be applied to all instances of randomly generated couplings. We demonstrate this on variably sized spin glasses and investigate the scaling with respect to a classic linear SA schedule. In (c), we show snapshots of a sample progression of a configuration undergoing SA under the ferromagnetic Ising model Hamiltonian and a constant cooling schedule. The terminal state, all spins-up, is the ground state; this anneal would be considered successful.

## II The environment and architecture

### Reinforcement learning

Reinforcement learning is a branch of dynamic programming whereby an agent, residing in state \(s_{t}\) at time \(t\), learns to take an action \(a_{t}\) that maximizes a cumulative reward signal \(R\) by dynamically interacting with an environment [39]. Through the training process, the agent arrives at a policy \(\pi\) that depends on some observation (or "state") of the system, \(s\). In recent years, neural networks have taken over as the _de facto_ function approximator for the policy. Deep reinforcement learning has seen unprecedented success, achieving superhuman performance in a variety of video games [40, 41, 42, 43], board games [44, 45, 46], and other puzzles [47, 48]. While many reinforcement learning algorithms exist, we have chosen to use proximal policy optimization (PPO) [49], implemented within Stable Baselines [50] for its competitive performance on problems with continuous action spaces.

### The environment

We developed an OpenAI gym [51] environment which serves as the interface to the "game" of simulated annealing. Let us now define some terminology and parameters important to SA. For a given Hamiltonian, defining the interactions of \(L\) spins, we create \(N_{\text{reps}}\) randomly initialized replicas (unless otherwise specified). The initial spins of each replica are drawn from a Bernoulli distribution with probability of a spin-up being randomly drawn from a uniform distribution. These independent replicas are annealed in parallel. The replicas follow an identical temperature schedule with their uncoupled nature providing a mechanism for statistics of the system to be represented through an ensemble of measurements. In the context of the Metropolis-Hastings framework, we define one "sweep" to be \(L\) proposed random spin flips (per replica), and one "step" to be \(N_{\text{sweeps}}\). After every step, the environment returns an observation of the current state \(s_{t}\) of the system, an \(N_{\text{reps}}\times L\) array consisting of the binary spin values present. This observation can be used to make an informed decision of the action \(a_{t}\) that should be taken. The action, a single scalar value, corresponds to the total inverse temperature change \(\Delta\beta\) (where \(\beta=1/T\)) that should be carried out over the subsequent step. The choice of action is provided to the environment, and the process repeats until \(N_{\text{steps}}\) steps have been taken, comprising one full anneal, or "episode" in the language of RL. If the chosen action would result in the temperature becoming negative, no change is made to the temperature and the system continues to evolve under the previous temperature.

In our investigations, we choose \(N_{\text{steps}}=40\) and \(N_{\text{sweeps}}=100\) resulting, in 4000 sweeps per episode. These values define the maximum size of system we can compare to classic SA. This number of sweeps is sufficient for a linear schedule to attain measurable success on all but the largest system size we investigate.

### Observations

For the classical version of the problem, an observation consists of the explicit spins of an ensemble of replicas. In the case of an unknown Hamiltonian, the ensemble measurement is important as the instantaneous state of a single replica does not provide sufficient information about the current temperature of the system. Provid

Figure 2: A neural network is used to learn the control parameters for several SA experiments. By observing a lattice of spins, the neural network can learn to control the temperature of the system in a dynamic fashion, annealing the system to the ground state. The spins at time \(t\) form the state \(s_{t}\) fed into the network. Two concurrent convolutional layers extract features from the state. These features are combined with a dense layer and fed into a recurrent module (an LSTM module) capable of capturing temporal characteristics. The LSTM module output is reduced to two parameters used to form the policy distribution \(\pi_{\theta}(a_{t}\mid s_{t})\) as well as to approximate the value function \(V(s_{t})\) used for the generalized advantage estimate.

ing the agent with multiple replicas allows it to compute statistics and have the possibility of inferring the temperature. For example, if there is considerable variation among replicas, then the system is likely hot, whereas if most replicas look the same, the system is probably cool.

When discussing a quantum system, where the spins represent qubits, direct mid-anneal measurement of the system is not possible as measurement causes a collapse of the wavefunction. To address this, we discuss experiments conducted in a "destructive observation" environment, where measurement of the spins is treated as a "one-time" opportunity for inclusion in RL training data. The subsequent observation is then based on a different set of replicas that have evolved through the same schedule, but from different initializations.

When running the classic SA baselines, to keep comparison fair, each episode consists of \(N_{\text{reps}}\) replicas as in the RL case. If even one replica reaches the ground state, the episode is considered a success.

### Reinforcement learning algorithm

Through the framework of reinforcement learning, we wish to produce a policy function \(\pi_{\theta}(a_{t}\mid s_{t})\) that takes the observed binary spin state \(s_{t}\in\{-1,1\}^{N_{\text{rep}}\times L}\) and produces an action \(a_{t}\) corresponding to the optimal change in the inverse temperature.

We now briefly introduce PPO [49]. First we define our policy \(\pi_{\theta}(a_{t}\mid s_{t})\) as the likelihood that the agent will take action \(a_{t}\) while in state \(s_{t}\); through training, the desire is that the best choice of action will become the most probable. To choose an action, this distribution can be sampled. We will use a neural network, parameterized by weights \(\theta\) to represent the policy by assuming that \(\pi_{\theta}(a_{t}\mid s_{t})\) is a normal distribution and interpreting the output nodes of the neural network as the mean, \(\mu\), and variance, \(\sigma^{2}\).

We define a function \(Q_{\pi_{\theta}}(s_{t},a_{t})\) as the expected future discounted reward if the agent takes action \(a_{t}\) at time \(t\) and then follows policy \(\pi_{\theta}\) for the remainder of the episode. We additionally define a value function \(V_{\pi_{\theta}}(s_{t})\) as the expected future discounted reward starting from state \(s_{t}\) and following the current policy \(\pi_{\theta}\) until the end of the episode. We introduce the concept of _advantage_, \(\hat{A}_{t}(s_{t},a_{t})\), as the difference between these two quantities. \(Q_{\pi_{\theta}}\) and \(V_{\pi_{\theta}}\) are not known and must be approximated. We assume the features necessary to represent \(\pi\) are generally similar to the features necessary to estimate the value function, and thus we can use the same neural network to predict the value function by merely having it output a third quantity.

\(\hat{A_{t}}\) is effectively an estimate of how much better the agent did in choosing action \(a_{t}\), compared to what was expected. We construct the typical policy gradient cost function by coupling the advantage of a state-action pair with the probability of the action being taken,

\[L^{\text{PG}}(\theta)=\hat{\mathbb{E}}_{t}\left[\log\pi_{\theta}(a_{t}\mid s_ {t})\hat{A}_{t}\right],\]

which we want to maximize by modifying the weights \(\theta\) through the training process. It is, however, more efficient to maximize the improvement ratio \(r_{t}\) of the current policy over a policy from a previous iteration \(\pi_{\theta_{\text{old}}}\)[52; 53]:

\[L^{\text{TRPO}}(\theta)=\hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\theta_{\text{old}}}(a_{t}\mid s_{t})}\hat{A}_{t}\right] \equiv\hat{\mathbb{E}}_{t}\left[r_{t}(\theta)\hat{A}_{t}\right].\]

Note, however, that maximizing this quantity can be trivially achieved by making the new policy drastically different from the old policy, which is not the desired behaviour. The PPO algorithm [49] deals with this by clipping the improvement and taking the minimum

\[L^{\text{CLIP}}(\theta)=\hat{\mathbb{E}}_{t}\left[\min(r_{t}(\theta)\hat{A}_{ t},\text{clip}(r_{t}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{t})\right].\]

To train the value function estimator, a squared error is used, that is,

\[L^{\text{VF}}(\theta)=\hat{\mathbb{E}}_{t}[(V_{\pi_{\theta}}(s_{t})-V_{t}^{ \text{targ}})^{2}],\]

and to encourage exploration, an entropic regularization functional \(S\) is used. This all amounts to a three-term cost function

\[L^{\text{PPO}}(\theta)=\hat{\mathbb{E}}_{t}\left[L^{\text{CLIP}}(\theta)-c_{1 }L^{\text{VF}}(\theta)+c_{2}S[\pi_{\theta}](s_{t})\right],\]

where \(c_{1}\) and \(c_{2}\) are hyperparameters.

### Policy network architecture

The neural network is composed of two parts: a convolutional feature extractor, and a recurrent network to capture the temporal characteristics of the problem. The feature extractor comprises two parallel two-dimensional convolutional layers. The first convolutional layer has \(N_{k_{r}}\) kernels of size \(1\times L\), and aggregates along the replicas dimension, enabling the collection of spin-wise statistics across the replicas. The second convolutional layer has \(N_{k_{z}}\) kernels of size \(N_{\text{reps}}\times 1\) and slides along the spin dimension, enabling the aggregation of replica-wise statistics across the spins. The outputs of these layers are flattened, concatenated, and fed into a dense layer of size \(N_{d}\) hidden nodes. This operates as a latent space encoding for input to a recurrent neural network (a long short-term memory, or LSTM, module [54]), used to capture the sequential nature of our application. The latent output of the LSTM module is of size \(N_{L}\). For simplicity, we set \(N_{k_{r}}=N_{k_{s}}=N_{d}=N_{L}=64\). All activation functions are hyperbolic tangent (tanh) activations. Since \(a_{t}\) can assume a continuum of real values, this task is referred to as having a continuous action space, and thus standard practice is for the network to output two values corresponding to the first and second moments of a normal distribution, which can be sampled to produce predictions.

### Reward

At the core of RL is the concept of reward engineering, that is, developing a reward scheme to inject a notion of success into the system. As we only care about reaching the ground state by the end of a given episode, we use a sparse reward scheme, with a reward of zero for every time step before the terminal step, and a reward equal to the negative of the minimum energy as the reward for the terminal step, that is,

\[R_{t}=\begin{cases}0,&t<N_{\text{steps}}\\ -\min_{k}\mathcal{H}(\phi_{k}(s_{t})),&t=N_{\text{steps}}\end{cases}, \tag{1}\]

where \(k\in[1,N_{\text{reps}}]\), and

\[\phi_{k}(s_{t})\in\{-1,1\}^{1\times L}\]

is an indexing function that returns the binary spin values for the \(k\)-th replica of state \(s_{t}\). This reward function is agnostic to system size; as the system size increases, the correlation time will also increase, and additional sweeps may be required between actions, but the reward function remains applicable. Furthermore, with this reward scheme, we encourage the agent to arrive at the lowest possible energy by the time the episode terminates, without regard to what it does in the interim. In searching for the ground state, the end justifies the means.

### Hyperparameters

When optimizing the neural network, we use a PPO discount factor of \(\gamma=0.99\), eight episodes between weight updates, a value function coefficient of \(c_{1}=0.5\), an entropy coefficient of \(c_{2}=0.001\), a clip range of \(\epsilon=0.05\), a learning rate of \(\alpha=1\times 10^{-6}\), and a single minibatch per update. Each agent is trained over the course of \(25,000\) episodes (anneals), with \(N_{\text{steps}}=40\) steps per episode, and with \(N_{\text{sweeps}}=100\) sweeps separating each observation. We used \(N_{\text{reps}}=64\) replicas for each observation.

## III Evaluation

Whereas the RL policy can be made deterministic, meaning a given state always produces the same action, the underlying Metropolis algorithm is stochastic; thus, we must statistically define the metric for success. We borrow this evaluation scheme from Aramon et al. [55]. Each RL episode will either result in "success" or "failure". Let us define the "time to solution" as

\[T_{s}=\tau n_{99}\,, \tag{2}\]

that is, the number of episodes that must be run to be \(99\%\) sure the ground state has been observed at least one time (\(n_{99}\)), multiplied by the time \(\tau\) taken for one episode. As \(\tau\) depends specifically on the hardware used, and the efficiency of software implementations, we will focus on \(n_{99}\) alone as the metric we desire to minimize.

Let us also define \(X_{i}\) as the binary outcome of the \(i\)-th episode, with \(X_{i}=1\) (\(0\)) if at least one (none) of the \(N_{\text{reps}}\) replicas are observed to be in the ground state at episode termination. The quantity \(Y\equiv\sum_{i=1}^{n}X_{i}\) is the number of successful episodes after a total of \(n\) episodes, and \(p\equiv P(X_{i}=1)\) denotes the probability that an anneal \(i\) will be successful. Thus the probability of exactly \(k\) out of \(n\) episodes succeeding is given by the probability mass function of the binomial distribution

\[P(Y=k\mid n,p)=\begin{pmatrix}n\\ k\end{pmatrix}p^{k}(1-p)^{n-k}. \tag{3}\]

To compute the time to solution, our quantity of interest is the number of episodes \(n_{99}\) where \(P=0.99\), that is,

\[P(Y\geq 1\mid n_{99},p)=0.99.\]

From this and (3), it can be shown that

\[n_{99}=\frac{\log{(1-0.99)}}{\log{(1-p)}}.\]

In the work of Aramon et al. [55], \(p\) is estimated using Bayesian inference due to their large system sizes sometimes resulting in zero successes, precluding the direct calculation of \(p\). In our case, to evaluate a policy, we perform \(100\) runs for each of \(100\) instances and compute \(p\) directly from the ratio of successful to total episodes, that is, \(p=\bar{X}\).

## IV Hamiltonians

We present an analysis of two classes of Hamiltonians. The first, which we call the weak-strong clusters model (WSC; see Figure 1a), is an \(L=16\) bipartite graph with two fully connected clusters, inspired by the "Chimera" structure used in D-Wave Systems' quantum annealing hardware [56]. In our case, one cluster is negatively biased with \(h_{i}=-0.44\) and the other positively biased with \(h_{i}=1.0\). All couplings are ferromagnetic and have unit magnitude. This results in an energy landscape with a deep local minimum where both clusters are aligned to their respective biases, but a slightly lower global minimum when the two clusters are aligned together, sacrificing the benefit of bias-alignment for the satisfaction of the intercluster couplings. For all WSC runs, the spins of the lattice are initialized in the local minimum.

The second class of Hamiltonians are nearest-neighbour square spin glasses (SG; see Figure 1b). Couplings are periodic (i.e., the model is defined on a torus), and drawn from a normal distribution with standard deviation \(1.0\). All biases are zero. Hamiltonian instances are generated as needed during training. To evaluate our method and compare against classic SA, we must have a testing set of instances for which we know the true ground state. For each lattice size investigated (\(\sqrt{L}=[4,6,8,10,12,14,16]\)) we generate \(N_{\text{test}}=100\) unique instances and obtain the true ground state energy for each instance using the branch-and-cut method [57] through the Spin Glass Server [58].

## V Results

### Weak-strong clusters model

We demonstrate the use of RL on the WSC model shown in Figure 1a. RL is able to learn a simple temperature schedule that anneals the WSC model to the ground state in 100% of episodes, regardless of the temperature in which the system is initialized. In Figure 3b, we compare the RL policy to classic SA schedules with several constant cooling rates.

When the system is initially hot (small \(\beta\)), both RL and classic SA are capable of reaching the ground state with 100% success as there exists sufficient thermal energy to escape the local minimum. In Figure 3c, we plot an example schedule. The RL policy (red) increases the temperature slightly at first, but then begins to cool the system almost immediately. An abrupt decrease in the Metropolis acceptance rate is observed (Figure 3e). The blue dashed line in Figure 3c represents the average schedule of the RL policy over 100 independent anneals. The standard deviation is shaded. It is apparent that the schedule is quite consistent between runs at a given starting temperature, with some slight variation in the rate of cooling.

When the system is initially cold (large \(\beta\)), there exists insufficient thermal energy to overcome the energy barrier between the local and global minima, and SA fails with a constant cooling rate. The RL policy, however, is able to identify, through observation, that the temperature is too low and can rapidly decrease \(\beta\) initially, heating the system to provide sufficient thermal energy to avoid the local minimum. In Figure 3f, we see an increase in the Metropolis acceptance ratio, followed by a decrease, qualitatively consistent with the human-devised heuristic schedules that have been traditionally suggested [1, 10, 11]. In Figure 3d, we plot an example schedule. Initially, the RL algorithm increases the temperature to provide thermal energy to escape the minimum, then begins the process of cooling. Similar to Figure 3c, the broadness of the variance of the policies is greatest in the cooling phase, with some instances being cooled more rapidly than others. The RL agent does not have access to the current temperature directly, and bases its policy solely on the spins. The orthogonal unit-width convolutions provide a mechanism for statistics over spins and replicas, and the LSTM module provides a mechanism to capture the time-dependent dynamics of the system.

### Spin-glass model

We now investigate the performance of the RL algorithm in learning a general policy for an entire class of Hamiltonians, investigating whether the RL algorithm can learn to generalize its learning to accommodate a theoretically infinite set of Hamiltonians of a specific class. Furthermore, we investigate how RL performs with various lattice sizes, and compare the trained RL model to a linear (with respect to \(\beta\)) classic SA schedule such as the ones used [12, 13]. The results of this investigation are shown in Figure 4.

In all cases, the RL schedule obtains a better (lower) \(n_{99}\) value, meaning far fewer episodes are required for us to be confident that the ground state has been observed. Furthermore, the \(n_{99}\) value exhibits much better scaling with respect to the system size (i.e., the number of optimization variables). In Figure 4e-k, we plot some of the schedules that the RL algorithm produces. In many cases, we see initial heating, followed by cooling, although in the case of the larger models (Figure 4i-k) we see much more complex, but still successful, behaviour. In all cases, the variance of the policies with respect to time (shown as the shaded regions in Figure 4e-k), indicate the agent is using information from the provided state to make decisions, and not just basing its decisions on the elapsed time using the internal state of the LSTM module. If schedules were based purely on some internal representation of time, there would be no variance between episodes.

### Comparing easy and difficult instances

The learned strategy of the RL agent is relatively simple in concept: increase the temperature to a sufficiently high value and then use the remaining time to cool the system as seen in the average policies in Figure 4e-k. In this section we demonstrate the degree to which the performance improvement can be attributed to the ability of the RL agent to base its decisions upon the various dynamics in the system.

We divide the instances in the \(10\times 10\) test set into two subsets, which we label "easy" and "difficult", based on the success of the classic SA baseline. This results in 14 difficult instances in which classic SA succeeds in only 3% of anneals, and 86 easy instances in which classic SA succeeds in more than 3% of anneals.

Figure 4: An RL policy learns to anneal spin-glass models. An example (\(L=4^{2}\)) lattice is shown in (a). In (b) and (c), we plot the acceptance ratios over time for three episodes for each of the \(L=8^{2}\) and \(L=16^{2}\) lattices. In (d), we compare the scaling of the RL policy with respect to system size and compare it to classic SA. We plot the \(n_{99}\) value (the number of anneals required to be 99% certain of observing the ground state; in the case of 100% success, \(n_{99}\) is undefined and plotted as zero) as a function of system size for both the RL and the best linear simulated annealing schedule we observed. The 95% confidence interval is shown as a shaded region. For all system sizes investigated, the learned RL policy is able to reach the ground state in significantly fewer runs. Additionally, we plot the destructive observation results, which also outperform the linear schedules. We note that the destructive observation requires far more Monte Carlo steps per episode to simulate the destructive measurements; this plot should not be interpreted as a comparison of run time with regard to the destructive observation result. In (e) through (k), we plot an example inverse temperature schedule as a solid line, as well as the average inverse temperature schedule (for all testing episodes) as a dashed line, with the shaded region denoting the standard deviation. In this work, we use \(N_{\text{steps}}=40\) episode steps.

Figure 3: An RL policy learns to anneal the WSC model (shown in (a)). (b) We plot the performance of various classic SA schedules, cooling linearly from \(\beta_{i}\) to \(\beta_{f}\), as well as the performance of the RL policy for a variety of starting temperatures. When the initial inverse temperature is sufficiently small, both the RL and classic SA algorithms achieve 100% success (i.e., every episode reaches the ground state). When the system is initialized with a large \(\beta_{i}\), there is insufficient thermal energy for classic SA to overcome the energy barrier and reach the ground state, and consequently a very low success probability. A single RL policy achieves almost perfect success across all initial temperatures. In (c) and (d), we plot the RL inverse temperature schedule in red for episodes initialized with respective low and high inverse temperatures. In blue, we show the average RL policy for the specific starting temperature. The RL algorithm can identify a cold initialization from observation, and increases the temperature before then decreasing it (as shown in (d)). In (e) and (f), we plot the Metropolis acceptance ratio for two episodes, initialized at two extreme temperatures (e) low \(\beta_{i}\), and (f) high \(\beta_{i}\). In this work, we use \(N_{\text{steps}}=40\) episode steps.

We compare three temperature scheduling methods on 100 episodes of each instance in both of these subsets: i) classic (linear) SA; ii) the RL agent; and iii) an RL agent (not yet discussed) that does not include a recurrent LSTM module. As shown in Figure 5a, linearly scheduled classic SA solves the easy instances in 19% of anneals, whereas the reinforcement learning agent manages to solve the same instances with a 53% success probability. With the difficult instances, the difference is more extreme; classic SA manages only 1% success, whereas RL performs substantially better with 29% success.

A variant of the agent without an LSTM module performs more poorly, but still better than classic SA. This agent is simply provided with a floating point representation of the episode step concatenated to the state vector, but without a recurrent network, it has no mechanism to capture the time dependence (history) of the problem. It therefore can only use the current observation in making decisions, and evidently does so more poorly than the agent with access to an LSTM module. For our formulation of the environment, an LSTM module is theoretically important to achieve a well-defined Markov decision process.

In Figure 5b, we plot the average action taken, and in Figure 5c, we plot the average inverse temperature of the system at each step in the test episodes driven by the RL agent, averaged over the easy and difficult instances separately. There is no notable difference in the average schedules of the two subsets. This fact, combined with the considerable magnitude of the standard deviation (plotted as a filled region for difficult instances and vertical bars for easy ones) suggests that the RL agent is adaptive to the specific instantiation of each Hamiltonian. Some of these dynamics can be seen in the successful schedules randomly selected for plotting in Figure 5f.

We then take the average schedules plotted in Figure 5b-c and use them as if they were RL-designed general heuristic schedules, removing the necessity to conduct observations during the training procedure. Both the difficult and easy average schedules perform very

Figure 5: We separate the \(10\times 10\) spin glass instances in the test set into two subsets (easy and difficult), depending on the success of classic SA in finding their ground states. In (a) we plot the performance of three different temperature scheduling approaches on these subsets. RL exhibits superior performance over classic SA in both subsets; however, it demonstrates dramatic superiority in the case of the difficult instances. RL without an LSTM module still performs better than classic SA; it can still dynamically modify the schedule and is not constrained to a constant temperature change at each step, so is more akin to a traditional heuristic temperature scheduling approach. In (b) and (c) we plot the average RL actions and schedule, respectively, for both the difficult and easy instance subsets. The standard deviation of the policies are plotted with error bars (easy instances) and shaded regions (difficult instances). The average difficult policy is very similar to the average easy policy, both having a large standard deviation, suggesting a high degree of specificity of the policy to the given episode. We can see this in (f), where we plot several successful schedules; each schedule is quite different from the others, but each results in a successful episode. In (d), we show the performance when we apply the average actions presented in (b) as a static policy. The average policies perform even more poorly than classic SA. This is further evidence that the RL agent’s ability to observe the system is crucial to its high performance. One might object to the method used to split the instances into the difficult and easy subsets; we have explicitly chosen to split the subsets at a boundary that makes classic SA perform poorly on the difficult instances. Therefore in (e), we consider a difficult (easy) instance as one that the RL agent performs poorly (well) on, and the story remains unchanged.

poorly on both the difficult and easy subsets, succeeding in less than 10% of episodes. This is strong evidence of the specificity of the RL agent's actions to the particular dynamics of each episode and refutes the hypothesis that a single, average policy, even if trained by RL, is a good case for generic instances.

We repeat the previous analysis with subsets based on the performance of the RL agent, arriving at identical conclusions (Figure 5e).

### Destructive observation

A key element of the nature of quantum systems is the collapse of the wavefunction when a measurement of the quantum state is made. When dealing with quantum systems, one must make control decisions based on quantum states that have evolved through an identical policy but have never before been measured. We model this restriction on quantum measurements by allowing any replica observed in the anneal to be consumed as training data for the RL algorithm only once. We simulate this behaviour by keeping track of the policy decisions (the changes in inverse temperature) in an action buffer as we play through each episode. When a set of \(N_{\text{reps}}\) replicas are measured, they are consumed and the system is reset to a new set of initial conditions, as if it was a new episode. The actions held in the buffer are replayed on the new replicas.

In this situation, the agent cannot base its decision on any replica-specific temporal correlations between given measurements; this should not be a problem early in each episode, as the correlation time scale of a hot system is very short, and the system, even under nondestructive observation, would have evolved sufficiently, in the time window between steps, to be uncorrelated. However, as the system cools, the correlation time scale increases exponentially, and destructive observation prevents the agent from relying on temporal correlations of any given replica.

We evaluate an agent trained in this "quantum-inspired" way and plot its performance alongside the nondestructive (i.e., classical) case in Figure 4d. In the case of destructive observation, the agent performs marginally less well than the nondestructive case, but still performs better than SA in most cases. As it is a more complicated task to make observations when the system is temporally uncorrelated, it is understandable that the performance would be inferior to the nondestructive case. Nonetheless, RL is capable of outperforming SA in both the destructive and nondestructive cases.

The relative performance in terms of computational demand between destructive observation and SA alludes to an important future direction in the field of RL, especially when applied to physical systems where observation is destructive, costly, and altogether difficult. With destructive observations, \(N_{\text{steps}}\) systems must be initialized and then evolved together under the same policy. Each copy is consumed one by one, as observations are required for decision making, thus incurring an unavoidable \(N_{\text{steps}}^{2}/2\) penalty in the destructive case. In this sense, it is difficult to consider RL to be superior; prescheduled SA simply does not require observation. However, if the choice to observe were to be incorporated into the action set of the RL algorithm, the agent would choose when observation would be necessary.

For example, in the systems presented in this work, the correlation time of the observations is initially small; the temperatures are high, and frequent observations are required to guide the system through phase space. As the system cools, however, the correlation time grows exponentially, and the observations become much more similar to each previous observation; in this case, it would be beneficial to forgo some expensive observations, as the system would not be evolving substantially. With such a scheme, RL stands a better chance at achieving greater performance.

### Policy analysis

To glean some understanding into what the RL agent is learning, we train an additional model on a well-understood Hamiltonian, the ferromagnetic Ising model of size \(16\times 16\). In this case, the temperatures are initialized randomly (as in the WSC model). This model is the extreme case of a spin glass, with all \(J_{ij}=1\). In Figure 6a, we display the density of states \(g(M,E)\) of the Ising model, plotted in phase space, with axes of magnetization per spin (\(M/L\)) and energy per spin (\(E/L\)). The density of states is greatest in the high-entropy \(M=E=0\) region, and lowest in the low-entropy "corners". We show the spin configurations at the three corners ("checkerboard", spin-up, and spin-down) for clarity. The density of states is obtained numerically using Wang-Landau sampling [59]. Magnetization and energy combinations outside of the dashed "triangle" are impossible.

In Figure 6b, we plot a histogram of the average value function \(V(s_{t})\) on the phase plane, as well as three trajectories. Note that since each observation \(s_{t}\) is composed of \(N_{\text{reps}}\) replicas, we count each observation as \(N_{\text{reps}}\) separate points on the phase plot when computing the histogram, each with an identical contribution of \(V(s_{t})\) to the average. As expected, the learned value function trends higher toward the two global energy minima. The lowest values are present in the initialization region (the high-energy band along the top). We expand two regions of interest in Figure 6c-d. In Figure 6d, we can see that the global minimum is assigned the highest value; this is justifiable in that if the agent reaches this point, it is likely to remain here and reap a high reward so long as the agent keeps the temperature low for the remainder of the episode.

In Figure 6c, we identify four noteworthy energy-magnetization combinations, using asterisks. These four energy-magnetization combinations have identical energies, with increasing magnetization, and correspond to banded spin structures of decreasing width (four example spin configurations are shown). The agent learns to assign a higher value to the higher-magnetization structures, even though the energy, which is the true measure of "success", is identical. This is because the higher-magnetization bands are closer to the right-most global minimum in action space, that is, the agent can traverse from the small-band configuration to the ground state in fewer spin flips than if traversing from the wide-band configurations.

In Figure 6e, we plot a histogram of the average action taken at each point in phase space. The upper high-energy band exhibits more randomness in the actions chosen, as this is the region in which the system lands upon initialization. When initialized, the temperature is at a randomly drawn value, and sometimes the agent must first heat the system to escape a local minimum before then cooling, and thus the first action is, on average, of very low magnitude. As the agent progresses toward the minimum, the agent becomes more aggressive in cooling the system, thereby thermally trapping itself in lower energy states.

### Scaling and time to solution

Figure 4d indicates that both nondestructive and destructive RL perform substantially better not only in absolute terms, but also in terms of scaling. It is important to note that we have specifically chosen a neural network architecture (convolutional) that scales linearly with system size, and have trained each model for the same number of episodes, each consisting of the same number of sweeps. The computation time for each sweep scales linearly with the system size, and thus the training time of our RL models scales linearly with system size. Using RL does indeed impose an additional inference cost, as the observation must be processed by the neural network; on the \(L=10^{2}\) system, inference takes one-third the amount of time as does each episode step. However, this cost has not been optimized, and could significantly be lowered through optimization of the neural network inference or even by offloading the policy network onto specialized hardware designed for inference.

## VI Conclusion

In this work, we show that reinforcement learning is a viable method for learning dynamic control schemes for the task of simulated annealing (SA). We show that, on a simple spin model, a reinforcement learning (RL) agent is capable of devising a temperature control scheme that can consistently escape a local minimum, and then anneal to the ground state. It arrives at a policy that generalizes to a range of initialization temperatures; in all cases, it learns to cool the system. However, if the initial temperature is too low, the RL agent learns to first increase the temperature to provide sufficient thermal energy to escape the local minimum. It achieves this without being provided explicit knowledge of the temperature.

We then demonstrate that the RL agent is capable of learning a policy that can generalize to an entire class of Hamiltonians, and that the problem need not be re

Figure 6: We train an agent on a special case of the spin-glass Hamiltonians: the \(16\times 16\) ferromagnetic Ising model where all couplings \(J_{ij}=1\). (a) We plot the density of states \(\log(g(M,E))\) for the \(16\times 16\) Ising model in the phase space of energy and magnetization, sampled numerically using the Wang–Landau algorithm [59], and indicate four of the novel high- and low-energy spin configurations on a grid. (b) For the trained model, we plot the average of the learned value function \(V(s_{t})\) for each possible energy–magnetization pair. Additionally, we plot the trajectories of the first replica for three episodes of annealing to demonstrate the path through phase space the algorithm learns to take. In (c) and (d), we enlarge two high-value regions of interest. In (e), we plot the average action taken at each point in phase space, as well as the same two trajectories plotted in (b).

stricted to a single set of couplings. By training multiple RL agents on increasing numbers of variables (increasing lattice sizes), we investigate the scaling of the RL algorithm and find that it outperforms a classic SA schedule both in absolute terms and in terms of its scaling.

Our technique is not limited to the system sizes we present in this work; larger system sizes are also within its reach. At some point, as the size of the system increases, correlation times in the underlying Metropolis-Hastings simulation become larger than the intervals between observations, and the number of sweeps must be increased. Additionally, we have specifically chosen a neural network architecture that scales linearly with system size (convolutional neural networks) as opposed to traditional multi-layer perceptron networks that scale exponentially. In fact, the entire procedure scales at most polynomially with system size.

We analyze the value function that the agent learns and see that it attributes an intuitive representation of value to specific regions of phase space.

We discuss the nature of RL in the physical sciences, specifically in situations where observing systems is destructive ("destructive observation") or costly (e.g., performing quantum computations where observations collapse the wavefunction, or conducting chemical analysis techniques that destroy a sample material). We demonstrate that our implementation of RL is capable of performing well in a destructive observation situation, albeit inefficiently. We propose that the future of physical RL (i.e., RL in the physical sciences) will be one of "controlled observation", where the algorithm can choose when an observation is necessary, minimizing the inherent costs incurred when observations are expensive, slow, or difficult.

## VII Correspondence

Requests for materials can be made to any of the authors. The code and data is freely available at the source provided below.

## VIII Acknowledgements

I. T. acknowledges support from NSERC. K. M. acknowledges support from Mitacs. The authors would like to thank Bruce Kravenhoff for valuable discussions in the early stages of the project, and would like to thank Marko Bucyk for reviewing and editing the manuscript.

## IX Statement of contributions

All authors contributed to the ideation and design of the research. K. M. developed and ran the computational experiments, and wrote the initial draft of the the manuscript. P. R. and I. T. jointly supervised this work and revised the manuscript.

## X Data availability

The test data sets necessary to reproduce these findings are available at [https://doi.org/10.5281/zenodo.3897413](https://doi.org/10.5281/zenodo.3897413).

## XI Code availability

The code necessary to reproduce these findings is available at [https://doi.org/10.5281/zenodo.3897413](https://doi.org/10.5281/zenodo.3897413).

## References

* Kirkpatrick _et al._ [1983]Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. Optimization by Simulated Annealing. _Science_**220**, 671-680 (1983). URL [http://www.sciencemag.org/cgi/doi/10.1126/science.220.4598.671](http://www.sciencemag.org/cgi/doi/10.1126/science.220.4598.671).
* Barahona [1982]Barahona, F. On the computational complexity of Ising spin glass models. _Journal of Physics A: Mathematical and General_**15**, 3241-3253 (1982). URL [http://stacks.iop.org/0305-4470/15/i=10/a=028?key=crossref.ife57df46674a7c759374b69321415b44](http://stacks.iop.org/0305-4470/15/i=10/a=028?key=crossref.ife57df46674a7c759374b69321415b44).
* Sherrington & Kirkpatrick [1975]Sherrington, D. & Kirkpatrick, S. Solvable Model of a Spin-Glass. _Physical Review Letters_**35**, 1792-1796 (1975). URL [https://link.aps.org/doi/10.1103/PhysRevLett.35.1792](https://link.aps.org/doi/10.1103/PhysRevLett.35.1792).
* Ising [1925]Ising, E. Beitrag zur Theorie des Ferromagnetismus. _Zeitschrift fur Physik_**31**, 253-258 (1925).
* Onsager [1944]Onsager, L. Crystal statistics. I. A two-dimensional model with an order-disorder transition. _Physical Review_**65**, 117-149 (1944).
* Ferdinand & Fisher [1969]Ferdinand, A. E. & Fisher, M. E. Bounded and Inhomogeneous Ising Models. I. Specific-Heat Anomaly of a Finite Lattice. _Physical Review_**185**, 832-846 (1969). URL [https://link.aps.org/doi/10.1103/PhysRev.185.832](https://link.aps.org/doi/10.1103/PhysRev.185.832).
* Lucas [2014]Lucas, A. Ising formulations of many NP problems. _Frontiers in Physics_**2**, 1-14 (2014).
* Hastings _et al._ [1970]Hastings, B. Y. W. K. Monte Carlo sampling methods using Markov chains and their applications 97-109 (1970).
* Metropolis _et al._ [1953]Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. Equation of State Calculations by Fast Computing Machines. _The Journal of Chemical Physics_**21**, 1087-1092 (1953). URL [http://aip.scitation.org/doi/10.1063/1.1699114](http://aip.scitation.org/doi/10.1063/1.1699114).
* Kirkpatrick [1984]Kirkpatrick, S. Optimization by simulated annealing: Quantitative studies. _Journal of Statistical Physics_**34**, 975-986 (1984). URL [http://link.springer.com/10.1007/BF01009452](http://link.springer.com/10.1007/BF01009452).

* (11) van Laarhoven, P. J. M. & Aarts, E. H. L. _Simulated Annealing: Theory and Applications_ (Springer Netherlands, Dordrecht, 1987). URL [http://link.springer.com/10.1007/978-94-015-7744-1](http://link.springer.com/10.1007/978-94-015-7744-1).
* (12) Stander, J. & Silverman, B. W. Temperature schedules for simulated annealing. _Statistics and Computing_**4**, 21-32 (1994).
* (13) Heim, B., Ronnow, T. F., Isakov, S. V. & Troyer, M. Quantum versus classical annealing of ising spin glasses. _Science_**348**, 215-217 (2015). URL [https://science.sciencemag.org/content/348/6231/215](https://science.sciencemag.org/content/348/6231/215).
* (14) Bounds, D. G. New optimization methods from physics and biology. _Nature_**329**, 215-219 (1987).
* (15) Farhi, E. _et al._ A Quantum Adiabatic Evolution Algorithm Applied to Random Instances of an NP-Complete Problem. _Science_**292**, 472-475 (2001). eprint 0104129.
* Atomic, Molecular, and Optical Physics_**86**, 1-8 (2012). eprint 1207.1712.
* (17) Boixo, S. _et al._ Evidence for quantum annealing with more than one hundred qubits. _Nature Physics_**10**, 218-224 (2014). eprint 1304.4595.
* (18) Bian, Z. _et al._ Discrete optimization using quantum annealing on sparse Ising models. _Frontiers in Physics_**2**, 1-10 (2014).
* (19) Venturelli, D., Marchand, D. J. J. & Rojo, G. Quantum Annealing Implementation of Job-Shop Scheduling 1-15 (2015). URL [http://arxiv.org/abs/1506.08479.1506.08479](http://arxiv.org/abs/1506.08479.1506.08479).
* (20) Ray, P., Chakrabarti, B. K. & Chakrabarti, A. Sherrington-Kirkpatrick model in a transverse field: Absence of replica symmetry breaking due to quantum fluctuations. _Physical Review B_**39**, 11828-11832 (1989).
* Condensed Matter and Materials Physics_**66**, 1-8 (2002).
* (22) Santoro, G. E. Theory of Quantum Annealing of an Ising Spin Glass. _Science_**295**, 2427-2430 (2002). URL [http://www.sciencemag.org/cgi/doi/10.1126/science.1068774.0205280](http://www.sciencemag.org/cgi/doi/10.1126/science.1068774.0205280).
* (23) Finnila, A., Gomez, M., Sebenik, C., Stenson, C. & Doll, J. Quantum annealing: A new method for minimizing multidimensional functions. _Chemical Physics Letters_**219**, 343-348 (1994). URL [https://linkinghub.elsevier.com/retrieve/pii/0009261494001170](https://linkinghub.elsevier.com/retrieve/pii/0009261494001170).
* Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics_**58**, 5355-5363 (1998). eprint 9804280.
* Condensed Matter and Materials Physics_**81**, 1-19 (2010). eprint 0909.4321.
* Condensed Matter and Materials Physics_**82**, 1-15 (2010). eprint 1004.1628.
* (27) Johnson, M. W. _et al._ Quantum annealing with manufactured spins. _Nature_**473**, 194-198 (2011).
* (28) McGeoch, C. C. & Wang, C. Experimental evaluation of an adiabiatic quantum system for combinatorial optimization. _Proceedings of the ACM International Conference on Computing Frontiers, CF 2013_ (2013).
* (29) Ikeda, K., Nakamura, Y. & Humble, T. S. Application of Quantum Annealing to Nurse Scheduling Problem. _Scientific Reports_**9**, 1-10 (2019). eprint 1904.12139.
* (30) Dickson, N. G. _et al._ Thermally assisted quantum annealing of a 16-qubit problem. _Nature Communications_**4**, 1-6 (2013).
* (31) Okada, S., Ohzeki, M. & Tanaka, K. The efficient quantum and simulated annealing of Potts models using a half-hot constraint (2019). URL [http://arxiv.org/abs/1904.01522](http://arxiv.org/abs/1904.01522). eprint 1904.01522.
* Statistical, Nonlinear, and Soft Matter Physics_**71**, 1-10 (2005). eprint 0502468.
* (33) Tsukamoto, S., Takatsu, M., Matsubara, S. & Tamura, H. An accelerator architecture for combinatorial optimization problems. _Fujitsu Scientific and Technical Journal_**53**, 8-13 (2017).
* (34) Inagaki, T. _et al._ A coherent Ising machine for 2000-node optimization problems. _Science_**354**, 603-606 (2016).
* (35) Leleu, T., Yamamoto, Y., McMahon, P. L. & Aihara, K. Destabilization of local minima in analog spin systems by correction of amplitude heterogeneity. _Phys. Rev. Lett._**122**, 040607 (2019). URL [https://link.aps.org/doi/10.1103/PhysRevLett.122.040607](https://link.aps.org/doi/10.1103/PhysRevLett.122.040607).
* (36) Tiunov, E. S., Ulanov, A. E. & Lvovsky, A. I. Annealing by simulating the coherent ising machine. _Opt. Express_**27**, 10288-10295 (2019). URL [http://www.opticsexpress.org/abstract.cfm?URI=oe-27-7-10288](http://www.opticsexpress.org/abstract.cfm?URI=oe-27-7-10288).
* (37) Farhi, E., Goldstone, J. & Gutmann, S. A Quantum Approximate Optimization Algorithm 1-16 (2014). URL [http://arxiv.org/abs/1411.4028](http://arxiv.org/abs/1411.4028). eprint 1411.4028.
* (38) Farhi, E., Goldstone, J., Gutmann, S. & Zhou, L. The Quantum Approximate Optimization Algorithm and the Sherrington-Kirkpatrick Model at Infinite Size 1-31 (2019). URL [http://arxiv.org/abs/1910.08187](http://arxiv.org/abs/1910.08187).
* (39) Sutton, R. & Barto, A. _Reinforcement Learning: An Introduction_ (The MIT Press, 2018), second edn. URL [http://incompleteideas.net/book/the-book-2nd.htmlhttps://www.cambridge.org/core/product/identifier/S0263574799271172/type/journal](http://incompleteideas.net/book/the-book-2nd.htmlhttps://www.cambridge.org/core/product/identifier/S0263574799271172/type/journal){_}article[http://ieeexplore.ieee.org/document/712192/](http://ieeexplore.ieee.org/document/712192/).
* (40) Berner, C. _et al._ Data 2 with Large Scale Deep Reinforcement Learning (2019).
* (41) Zhang, Z. _et al._ Hierarchical Reinforcement Learning for Multi-agent MOBA Game (2019). URL [http://arxiv.org/abs/1901.08004](http://arxiv.org/abs/1901.08004). eprint 1901.08004.
* (42) Vinyals, O. _et al._ Grandmaster level in StarCraft II using multi-agent reinforcement learning. _Nature_ (2019). URL [http://www.nature.com/articles/s41586-019-1724-z](http://www.nature.com/articles/s41586-019-1724-z).
* (43) Mnih, V. _et al._ Playing Atari with Deep Reinforcement Learning. _arXiv preprint_ 1-9 (2013). URL [http://arxiv.org/abs/1312.5602](http://arxiv.org/abs/1312.5602). eprint 1312.5602.
* (44) Silver, D. _et al._ Mastering the game of Go with deep neural networks and tree search. _Nature_**529**, 484-489 (2016). URL [http://dx.doi.org/10.1038/nature16961http://www.nature.com/doifinder/10.1038/nature16961](http://dx.doi.org/10.1038/nature16961http://www.nature.com/doifinder/10.1038/nature16961)* Silver _et al._ [2017] Silver, D. _et al._ Mastering the game of Go without human knowledge. _Nature_**550**, 354-359 (2017). URL [http://www.nature.com/doifinder/10.1038/nature24270](http://www.nature.com/doifinder/10.1038/nature24270).
* Silver _et al._ [2018] Silver, D. _et al._ A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. _Science_**362**, 1140-1144 (2018).
* Agostinelli _et al._ [2019] Agostinelli, F., McAleer, S., Shmakov, A. & Baldi, P. Solving the Rubik's cube with deep reinforcement learning and search. _Nature Machine Intelligence_**1**, 356-363 (2019). URL [http://www.nature.com/articles/s42256-019-0070-z](http://www.nature.com/articles/s42256-019-0070-z).
* OpenAI _et al._ [2019] OpenAI _et al._ Solving Rubik's Cube with a Robot Hand 1-51 (2019). URL [http://arxiv.org/abs/1910.07113](http://arxiv.org/abs/1910.07113).
* Schulman _et al._ [2017] Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal Policy Optimization Algorithms 1-12 (2017). URL [http://arxiv.org/abs/1707.06347](http://arxiv.org/abs/1707.06347).
* Hill _et al._ [2018] Hill, A. _et al._ Stable baselines. [https://github.com/hill-a/stable-baselines](https://github.com/hill-a/stable-baselines) (2018).
* Brockman _et al._ [2016] Brockman, G. _et al._ OpenAI Gym 1-4 (2016). URL [http://arxiv.org/abs/1606.01540](http://arxiv.org/abs/1606.01540). 1606.01540.
* Schulman _et al._ [2015] Schulman, J., Levine, S., Abbeel, P., Jordan, M. & Moritz, P. Trust region policy optimization. In _International conference on machine learning_, 1889-1897 (2015).
* Kakade & Langford [2002] Kakade, S. & Langford, J. Approximately optimal approximate reinforcement learning. In _ICML_, vol. 2, 267-274 (2002).
* Hochreiter & Schmidhuber [1997] Hochreiter, S. & Schmidhuber, J. Long Short-Term Memory. _Neural Computation_**9**, 1735-1780 (1997). URL [http://www.mitpressjournals.org/doi/10.1162/neco.1997.9.8.1735](http://www.mitpressjournals.org/doi/10.1162/neco.1997.9.8.1735).
* Aramon _et al._ [2019] Aramon, M. _et al._ Physics-inspired optimization for quadratic unconstrained problems using a digital Annealer. _Frontiers in Physics_**7** (2019). 1806.08815.
* Bunyk _et al._ [2014] Bunyk, P. I. _et al._ Architectural Considerations in the Design of a Superconducting Quantum Annealing Processor. _IEEE Transactions on Applied Superconductivity_**24**, 1-10 (2014). 1401.5504.
* Liers _et al._ [2005] Liers, F., Junger, M., Reinelt, G. & Rinaldi, G. Computing Exact Ground States of Hard Ising Spin Glass Problems by Branch-and-Cut. _New Optimization Algorithms in Physics_ 47-69 (2005).
* Junger [2014] Junger, M. Spin Glass Server ([https://informatik.uni-koeln.de/spinglass/](https://informatik.uni-koeln.de/spinglass/)). URL [https://informatik.uni-koeln.de/spinglass/](https://informatik.uni-koeln.de/spinglass/).
* Wang & Landau [2001] Wang, F. & Landau, D. P. Efficient, multiple-range random walk algorithm to calculate the density of states. _Physical Review Letters_**86**, 2050-2053 (2001). 0011174.