# Finding the ground state of spin Hamiltonians with reinforcement learning Kyle Mills kyle.mills@1qbit.com 1QB Information Technologies (1QBit), Vancouver, British Columbia, Canada University of Ontario Institute of Technology, Oshawa, Ontario, Canada Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada Pooya Ronagh pooya.ronagh@1qbit.com 1QB Information Technologies (1QBit), Vancouver, British Columbia, Canada Institute for Quantum Computing (IQC), Waterloo, Ontario, Canada Department of Physics and Astronomy, University of Waterloo, Ontario, Canada Isaac Tamblyn isaac.tamblyn@nrc.ca National Research Council Canada, Ottawa, Ontario, Canada University of Ottawa, Ottawa, Ontario, Canada Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada November 5, 2021 ###### Abstract Reinforcement learning (RL) has become a proven method for optimizing a procedure for which success has been defined, but the specific actions needed to achieve it have not. Using a method we call "Controlled Online Optimization Learning" (COOL), we apply the so-called "black box" method of RL to simulated annealing (SA), demonstrating that an RL agent based on proximal policy optimization can, through experience alone, arrive at a temperature schedule that surpasses the performance of standard heuristic temperature schedules for two classes of Hamiltonians. When the system is initialized at a cool temperature, the RL agent learns to heat the system to "melt" it, and then slowly cool it in an effort to anneal to the ground state; if the system is initialized at a high temperature, the algorithm immediately cools the system. We investigate the performance of our RL-driven SA agent in generalizing to all Hamiltonians of a specific class; when trained on random Hamiltonians of nearest-neighbour spin glasses, the RL agent is able to control the SA process for other Hamiltonians, reaching the ground state with a higher probability than a simple linear annealing schedule. Furthermore, the scaling performance (with respect to system size) of the RL approach is far more favourable, achieving a performance improvement of almost two orders of magnitude on \(L=14^{2}\) systems. We demonstrate the robustness of the RL approach when the system operates in a "destructive observation" mode, an allusion to a quantum system where measurements destroy the state of the system. The success of the RL agent could have far-reaching impact, from classical optimization, to quantum annealing, to the simulation of physical systems. ## I Introduction In metallurgy and materials science, the process of annealing is used to equilibrate the positions of atoms to obtain perfect low-energy crystals. Heat provides the energy necessary to break atomic bonds, and high-stress interfaces are eliminated by the migration of defects. By slowly cooling the metal to room temperature, the metal atoms become energetically locked in a lattice structure more favourable than the original structure. Metallurgists can tune the temperature schedule to arrive at final products that have desired characteristics, such as ductility and hardness. Annealing is a biased stochastic search for the ground state. An analogous _in silico_ technique, simulated annealing (SA) [1], can be used to find the ground state of spin-glass models, an NP-hard problem [2]. A spin glass is a graphical model consisting of binary spins \(\sigma_{i}\). The connections between spins are defined by the coupling constants \(J_{ij}\), and a linear term with coefficients \(h_{i}\) can apply a bias to individual spins. The Hamiltonian \[\mathcal{H}=-\sum_{i0\)); the right cluster is biased downward (\(h_{i}<0\)). All couplings are equal and of unit magnitude. The two clusters are coupled via the eight central nodes. This model exhibits a deep local minimum very close in energy to the model’s global minimum. When initialized in the local minimum, the RL agent is able to learn schemes to escape the local minimum and arrive at the global minimum, without any explicit knowledge of the Hamiltonian. (b) Here we present an example spin-glass model. The nodes are coupled to nearest neighbours with random Gaussian-distributed coupling coefficients. The nodes are unbiased (\(h_{i}=0\)), and the couplings are changed at each instantiation of the model. The RL algorithm is able to learn a dynamic temperature schedule by observing the system throughout the annealing process, without explicit knowledge of the form of the Hamiltonian, and the learned policy can be applied to all instances of randomly generated couplings. We demonstrate this on variably sized spin glasses and investigate the scaling with respect to a classic linear SA schedule. In (c), we show snapshots of a sample progression of a configuration undergoing SA under the ferromagnetic Ising model Hamiltonian and a constant cooling schedule. The terminal state, all spins-up, is the ground state; this anneal would be considered successful. ## II The environment and architecture ### Reinforcement learning Reinforcement learning is a branch of dynamic programming whereby an agent, residing in state \(s_{t}\) at time \(t\), learns to take an action \(a_{t}\) that maximizes a cumulative reward signal \(R\) by dynamically interacting with an environment [39]. Through the training process, the agent arrives at a policy \(\pi\) that depends on some observation (or "state") of the system, \(s\). In recent years, neural networks have taken over as the _de facto_ function approximator for the policy. Deep reinforcement learning has seen unprecedented success, achieving superhuman performance in a variety of video games [40, 41, 42, 43], board games [44, 45, 46], and other puzzles [47, 48]. While many reinforcement learning algorithms exist, we have chosen to use proximal policy optimization (PPO) [49], implemented within Stable Baselines [50] for its competitive performance on problems with continuous action spaces. ### The environment We developed an OpenAI gym [51] environment which serves as the interface to the "game" of simulated annealing. Let us now define some terminology and parameters important to SA. For a given Hamiltonian, defining the interactions of \(L\) spins, we create \(N_{\text{reps}}\) randomly initialized replicas (unless otherwise specified). The initial spins of each replica are drawn from a Bernoulli distribution with probability of a spin-up being randomly drawn from a uniform distribution. These independent replicas are annealed in parallel. The replicas follow an identical temperature schedule with their uncoupled nature providing a mechanism for statistics of the system to be represented through an ensemble of measurements. In the context of the Metropolis-Hastings framework, we define one "sweep" to be \(L\) proposed random spin flips (per replica), and one "step" to be \(N_{\text{sweeps}}\). After every step, the environment returns an observation of the current state \(s_{t}\) of the system, an \(N_{\text{reps}}\times L\) array consisting of the binary spin values present. This observation can be used to make an informed decision of the action \(a_{t}\) that should be taken. The action, a single scalar value, corresponds to the total inverse temperature change \(\Delta\beta\) (where \(\beta=1/T\)) that should be carried out over the subsequent step. The choice of action is provided to the environment, and the process repeats until \(N_{\text{steps}}\) steps have been taken, comprising one full anneal, or "episode" in the language of RL. If the chosen action would result in the temperature becoming negative, no change is made to the temperature and the system continues to evolve under the previous temperature. In our investigations, we choose \(N_{\text{steps}}=40\) and \(N_{\text{sweeps}}=100\) resulting, in 4000 sweeps per episode. These values define the maximum size of system we can compare to classic SA. This number of sweeps is sufficient for a linear schedule to attain measurable success on all but the largest system size we investigate. ### Observations For the classical version of the problem, an observation consists of the explicit spins of an ensemble of replicas. In the case of an unknown Hamiltonian, the ensemble measurement is important as the instantaneous state of a single replica does not provide sufficient information about the current temperature of the system. Provid Figure 2: A neural network is used to learn the control parameters for several SA experiments. By observing a lattice of spins, the neural network can learn to control the temperature of the system in a dynamic fashion, annealing the system to the ground state. The spins at time \(t\) form the state \(s_{t}\) fed into the network. Two concurrent convolutional layers extract features from the state. These features are combined with a dense layer and fed into a recurrent module (an LSTM module) capable of capturing temporal characteristics. The LSTM module output is reduced to two parameters used to form the policy distribution \(\pi_{\theta}(a_{t}\mid s_{t})\) as well as to approximate the value function \(V(s_{t})\) used for the generalized advantage estimate. ing the agent with multiple replicas allows it to compute statistics and have the possibility of inferring the temperature. For example, if there is considerable variation among replicas, then the system is likely hot, whereas if most replicas look the same, the system is probably cool. When discussing a quantum system, where the spins represent qubits, direct mid-anneal measurement of the system is not possible as measurement causes a collapse of the wavefunction. To address this, we discuss experiments conducted in a "destructive observation" environment, where measurement of the spins is treated as a "one-time" opportunity for inclusion in RL training data. The subsequent observation is then based on a different set of replicas that have evolved through the same schedule, but from different initializations. When running the classic SA baselines, to keep comparison fair, each episode consists of \(N_{\text{reps}}\) replicas as in the RL case. If even one replica reaches the ground state, the episode is considered a success. ### Reinforcement learning algorithm Through the framework of reinforcement learning, we wish to produce a policy function \(\pi_{\theta}(a_{t}\mid s_{t})\) that takes the observed binary spin state \(s_{t}\in\{-1,1\}^{N_{\text{rep}}\times L}\) and produces an action \(a_{t}\) corresponding to the optimal change in the inverse temperature. We now briefly introduce PPO [49]. First we define our policy \(\pi_{\theta}(a_{t}\mid s_{t})\) as the likelihood that the agent will take action \(a_{t}\) while in state \(s_{t}\); through training, the desire is that the best choice of action will become the most probable. To choose an action, this distribution can be sampled. We will use a neural network, parameterized by weights \(\theta\) to represent the policy by assuming that \(\pi_{\theta}(a_{t}\mid s_{t})\) is a normal distribution and interpreting the output nodes of the neural network as the mean, \(\mu\), and variance, \(\sigma^{2}\). We define a function \(Q_{\pi_{\theta}}(s_{t},a_{t})\) as the expected future discounted reward if the agent takes action \(a_{t}\) at time \(t\) and then follows policy \(\pi_{\theta}\) for the remainder of the episode. We additionally define a value function \(V_{\pi_{\theta}}(s_{t})\) as the expected future discounted reward starting from state \(s_{t}\) and following the current policy \(\pi_{\theta}\) until the end of the episode. We introduce the concept of _advantage_, \(\hat{A}_{t}(s_{t},a_{t})\), as the difference between these two quantities. \(Q_{\pi_{\theta}}\) and \(V_{\pi_{\theta}}\) are not known and must be approximated. We assume the features necessary to represent \(\pi\) are generally similar to the features necessary to estimate the value function, and thus we can use the same neural network to predict the value function by merely having it output a third quantity. \(\hat{A_{t}}\) is effectively an estimate of how much better the agent did in choosing action \(a_{t}\), compared to what was expected. We construct the typical policy gradient cost function by coupling the advantage of a state-action pair with the probability of the action being taken, \[L^{\text{PG}}(\theta)=\hat{\mathbb{E}}_{t}\left[\log\pi_{\theta}(a_{t}\mid s_ {t})\hat{A}_{t}\right],\] which we want to maximize by modifying the weights \(\theta\) through the training process. It is, however, more efficient to maximize the improvement ratio \(r_{t}\) of the current policy over a policy from a previous iteration \(\pi_{\theta_{\text{old}}}\)[52; 53]: \[L^{\text{TRPO}}(\theta)=\hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\theta_{\text{old}}}(a_{t}\mid s_{t})}\hat{A}_{t}\right] \equiv\hat{\mathbb{E}}_{t}\left[r_{t}(\theta)\hat{A}_{t}\right].\] Note, however, that maximizing this quantity can be trivially achieved by making the new policy drastically different from the old policy, which is not the desired behaviour. The PPO algorithm [49] deals with this by clipping the improvement and taking the minimum \[L^{\text{CLIP}}(\theta)=\hat{\mathbb{E}}_{t}\left[\min(r_{t}(\theta)\hat{A}_{ t},\text{clip}(r_{t}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{t})\right].\] To train the value function estimator, a squared error is used, that is, \[L^{\text{VF}}(\theta)=\hat{\mathbb{E}}_{t}[(V_{\pi_{\theta}}(s_{t})-V_{t}^{ \text{targ}})^{2}],\] and to encourage exploration, an entropic regularization functional \(S\) is used. This all amounts to a three-term cost function \[L^{\text{PPO}}(\theta)=\hat{\mathbb{E}}_{t}\left[L^{\text{CLIP}}(\theta)-c_{1 }L^{\text{VF}}(\theta)+c_{2}S[\pi_{\theta}](s_{t})\right],\] where \(c_{1}\) and \(c_{2}\) are hyperparameters. ### Policy network architecture The neural network is composed of two parts: a convolutional feature extractor, and a recurrent network to capture the temporal characteristics of the problem. The feature extractor comprises two parallel two-dimensional convolutional layers. The first convolutional layer has \(N_{k_{r}}\) kernels of size \(1\times L\), and aggregates along the replicas dimension, enabling the collection of spin-wise statistics across the replicas. The second convolutional layer has \(N_{k_{z}}\) kernels of size \(N_{\text{reps}}\times 1\) and slides along the spin dimension, enabling the aggregation of replica-wise statistics across the spins. The outputs of these layers are flattened, concatenated, and fed into a dense layer of size \(N_{d}\) hidden nodes. This operates as a latent space encoding for input to a recurrent neural network (a long short-term memory, or LSTM, module [54]), used to capture the sequential nature of our application. The latent output of the LSTM module is of size \(N_{L}\). For simplicity, we set \(N_{k_{r}}=N_{k_{s}}=N_{d}=N_{L}=64\). All activation functions are hyperbolic tangent (tanh) activations. Since \(a_{t}\) can assume a continuum of real values, this task is referred to as having a continuous action space, and thus standard practice is for the network to output two values corresponding to the first and second moments of a normal distribution, which can be sampled to produce predictions. ### Reward At the core of RL is the concept of reward engineering, that is, developing a reward scheme to inject a notion of success into the system. As we only care about reaching the ground state by the end of a given episode, we use a sparse reward scheme, with a reward of zero for every time step before the terminal step, and a reward equal to the negative of the minimum energy as the reward for the terminal step, that is, \[R_{t}=\begin{cases}0,&t