[MISSING_PAGE_FAIL:1] Finding the ground state of (i.e., "solving") such systems is interesting from the perspective of thermodynamics, as one can observe phenomena such as phase transitions [5; 6], but also practically useful as discrete optimization problems can be mapped to spin-glass models (e.g., the travelling salesperson problem or the knapsack problem) [7]. The Metropolis-Hastings algorithm [8; 9] can be used to simulate the spin glass at arbitrary temperature; thus, it is used ubiquitously for SA. By beginning the simulation at a high temperature, one can slowly cool the system over time, providing sufficient thermal energy to escape local minima, and arrive at the ground state "solution" to the problem. The challenge is to find a temperature schedule that minimizes computational effort while still arriving at a satisfactory solution; if the temperature is reduced too rapidly, the system will become trapped in a local minimum, and reducing the temperature too slowly results in an unnecessary computational expense. Kirkpatrick et al. [1; 10] proposed starting at a temperature that results in an 80% acceptance ratio (i.e., 80% of Metropolis spin flips are accepted) and reducing the temperature geometrically. They also recommended monitoring the objective function and reducing the cooling rate if the objective value (e.g., the energy) drops too quickly. More-sophisticated adaptive temperature schedules have been investigated [11]. Nevertheless, in his 1987 paper, Bounds [12] said that "choosing an annealing schedule for practical purposes is still something of a black art". When framed in the advent of quantum computation and quantum control, establishing robust and dynamic scheduling of control parameters becomes even more relevant. For example, the same optimization problems that can be cast as classical spin glasses are also amenable to quantum annealing [13; 14; 15; 16; 17], exploiting, in lieu of thermal fluctuations, the phenomenon of quantum tunnelling [18; 19; 20] to escape local minima. Quantum annealing (QA) was proposed by Finnila et al. [21] and Kadowaki and Nishimori [22], and, in recent years, physical realizations of devices capable of performing QA (quantum annealers), have been developed [23; 24; 25; 26], and are being rapidly commercialized. As these technologies progress and become more commercially viable, practical applications [17; 27] will continue to be identified and resource scarcity will spur the already extant discussion of the efficient use of annealing hardware [28; 29]. Nonetheless, there are still instances where the classical (SA) outperforms the quantum (QA) [30], and improving the former should not be undervalued. _In silico_ and hardware annealing solutions such as Fujitsu's FPGA-based Digital Annealer [31], NTT's laser-pumped coherent Ising machine (CIM) [32], and the quantum circuit model algorithm known as QAOA [33; 34] all demand the scheduling of control parameters, whether it is the temperature in the case of the Digital Annealer, or the power of the laser pump in the case of CIM. Heuristic methods based on trial-and-error experiments are commonly used to schedule these control parameters, and an automatic approach could expedite development, and improve the stability of such techniques. In this work, we demonstrate the use of a reinforcement learning (RL) method to learn the "black art" of classic SA temperature scheduling, and show that an RL agent is able to learn dynamic control parameter schedules for various problem Hamiltonians. The schedules that the RL agent produces are dynamic and reactive, adjusting to the current observations of the system to reach the ground state quickly and consistently without _a priori_ knowledge of a given Hamiltonian. Our technique, aside from being directly useful for _in silico_ simulation, is an important milestone for future work in quantum information processing, including for hardware- and software-based control problems. Figure 1: Two classes of Hamiltonian problems are depicted. (a) The weak-strong clusters (WSC) model comprises two bipartite clusters. The left cluster is biased upward; the right cluster is biased downward. All couplings are equal and of unit magnitude. The two clusters are coupled via the eight central nodes. This model exhibits a deep local minimum very close in energy to the model’s global minimum. When initialized in the local minimum, the RL agent is able to learn schemes to escape the local minimum and arrive at the global minimum, without any explicit knowledge of the Hamiltonian. (b) Here we present an example spin-glass model. The nodes are coupled to nearest neighbours with random Gaussian-distributed coupling coefficients. The nodes are unbiased, and the couplings are changed at each instantiation of the model. The RL algorithm is able to learn a dynamic temperature schedule by observing the system throughout the annealing process, without explicit knowledge of the form of the Hamiltonian, and the learned policy can be applied to all instances of randomly generated couplings. We demonstrate this on variably sized spin glasses and investigate the scaling with respect to a classic linear SA schedule. In (c), we show snapshots of a sample progression of a configuration undergoing SA under the ferromagnetic Ising model Hamiltonian and a constant cooling schedule. The terminal state, all spins-up, is the ground state; this anneal would be considered successful. ## II Reinforcement Learning Reinforcement learning is a branch of dynamic programming whereby an agent, residing in state \(s_{t}\) at time \(t\), learns to take an action \(a_{t}\) that maximizes a cumulative reward signal \(R\) by dynamically interacting with an environment [35]. Through the training process, the agent arrives at a policy \(\pi\) that depends on some observation (or "state") of the system, \(s\). In recent years, neural networks have taken over as the _de facto_ function approximator for the policy. Deep reinforcement learning has seen unprecedented success, achieving superhuman performance in a variety of video games [36; 37; 38; 39], board games [40; 41; 42], and other puzzles [43; 44]. While many reinforcement learning algorithms exist, we have chosen to use proximal policy optimization (PPO) [45], implemented within Stable Baselines [46] for its competitive performance on problems with continuous action spaces. ## III The Environment We developed an OpenAI gym [47] environment which serves as the interface to the "game" of simulated annealing. Let us now define some terminology and parameters important to simulated annealing. For a given Hamiltonian, defining the interactions of \(L\) spins, we create \(N_{\text{reps}}\) randomly initialized replicas (unless otherwise specified). The initial spins of each replica are drawn from a Bernoulli distribution with probability of a spin-up being randomly drawn from a uniform distribution. These independent replicas are annealed in parallel. The replicas follow an identical temperature schedule with their uncoupled nature providing a mechanism for statistics of the system to be represented through an ensemble of measurements. In the context of the Metropolis-Hastings framework, we define one "sweep" to be \(L\) random spin flips (per replica), and one "step" to be \(N_{\text{sweeps}}\). After every step, the environment returns an observation of the current state \(s_{t}\) of the system, an \(N_{\text{reps}}\times L\) array consisting of the binary spin values present. This observation can be used to make an informed decision of the action \(a_{t}\) that should be taken. The action, a single scalar value, corresponds to the total inverse temperature change \(\Delta\beta\) that should be carried out over the subsequent step. The choice of action is provided to the environment, and the process repeats until \(N_{\text{steps}}\) steps have been taken, comprising one full anneal, or "episode" in the language of RL. If the chosen action would result in the temperature becoming negative, no change is made to the temperature and the system continues to evolve under the previous temperature. ### Observations For the classical version of the problem, an observation consists of the explicit spins of an ensemble of replicas. In the case of an unknown Hamiltonian, the ensemble measurement is important as the instantaneous state of a single replica does not provide sufficient information about the current temperature of the system. Providing the agent with multiple replicas allows it to compute statistics and have the possibility of inferring the temperature. For example, if there is considerable variation among replicas, then the system is likely hot, whereas if most replicas look the same, the system is probably cool. When discussing a quantum system, where the spins represent qubits, direct mid-anneal measurement of the system is not possible as measurement causes a collapse of the wavefunction. To address this, we discuss experi Figure 2: A neural network is used to learn the control parameters for several SA experiments. By observing a lattice of spins, the neural network can learn to control the temperature of the system in a dynamic fashion, annealing the system to the ground state. The spins at time \(t\) form the state \(s_{t}\) fed into the network. Two concurrent convolutional layers extract features from the state. These features are combined with a dense layer and fed into a recurrent module (an LSTM module) capable of capturing temporal characteristics. The LSTM module output is reduced to two parameters used to form the policy distribution \(\pi_{\theta}(a_{t}\mid s_{t})\) as well as to approximate the value function \(V(s_{t})\) used for the generalized advantage estimate. ments conducted in a "destructive observation" environment, where measurement of the spins is treated as a "one-time" opportunity for inclusion in RL training data. The subsequent observation is then based on a different set of replicas that have evolved through the same schedule, but from different initializations. ## IV Reinforcement learning architecture Through the framework of reinforcement learning, we wish to produce a policy function \(\pi_{\theta}(a_{t}\mid s_{t})\) that takes the observed binary spin state \(s_{t}\in\{-1,1\}^{N_{\text{rep}}\times L}\) and produces an action \(a_{t}\) corresponding to the optimal change in the inverse temperature. Here \(\pi\) is a distribution represented by a neural network and the subscript \(\theta\) denotes parameterization by learnable weights \(\theta\). We define the function \[\phi_{k}(s_{t})\in\{-1,1\}^{1\times L}\] as an indexing function that returns the binary spin values for the \(k\)-th rep of state \(s_{t}\). The neural network is composed of two parts: a convolutional feature extractor, and a recurrent network to capture the temporal characteristics of the problem. The feature extractor comprises two parallel two-dimensional convolutional layers. The first convolutional layer has \(N_{k_{r}}\) kernels of size \(1\times L\), and aggregates along the replicas dimension, enabling the collection of spin-wise statistics across the replicas. The second convolutional layer has \(N_{k_{s}}\) kernels of size \(N_{\text{reps}}\times 1\) and slides along the spin dimension, enabling the aggregation of replica-wise statistics across the spins. The outputs of these layers are flattened, concatenated, and fed into a dense layer of size \(N_{d}\) hidden nodes. This operates as a latent space encoding for input to a recurrent neural network (a long short-term memory, or LSTM, module [48]), used to capture the sequential nature of our application. The latent output of the LSTM module is of size \(N_{L}\). For simplicity, we set \(N_{k_{r}}=N_{k_{s}}=N_{d}=N_{L}=64\). All activation functions are hyperbolic tangent (tanh) activations. Since \(a_{t}\) can assume a continuum of real values, this task is referred to as having a continuous action space, and thus standard practice is for the network to output two values corresponding to the first and second moments of a normal distribution. During training, when exploration is desired, an entropic regularization in the PPO cost function can be used to encourage a high variance (i.e., encouraging \(\sigma^{2}\) to remain sizable). Additionally, PPO requires an estimate of the generalized advantage function [49], the difference between the reward received by taking action \(a_{t}\) while in state \(s_{t}\), and the expected value of the cumulative future reward prior to an action being taken. The latter, termed the "value function", or \(V(s_{t})\), cannot possibly be computed because we know only the reward from the action that was chosen, and nothing about the actions that were not chosen, but we can estimate it using a third output from our neural network. Thus, as seen in Figure 2, the neural network in this work takes the state \(s_{t}\) and maps it to three scalar quantities, \(\mu\), \(\sigma^{2}\), and \(V(s_{t})\), defining the two moments of a normal distribution and an estimate of the value function, respectively. At the core of RL is the concept of reward engineering, that is, developing a reward scheme to inject a notion of success into the system. As we only care about reaching the ground state by the end of a given episode, we use a sparse reward scheme, with a reward of zero for every time step before the terminal step, and a reward equal to the negative of the minimum energy as the reward for the terminal step, that is, \[R_{t}=\begin{cases}0,&t