In a paper published in the journal Machine Learning Science and Technology, researchers investigated the trainability of parameterized quantum circuit (PQC)-based policies in reinforcement learning (RL), addressing challenges like barren plateaus (BPs) and gradient explosions.
Their findings showed that trainability depended on basis-state partitioning and action mapping. For a polynomial number of actions, a trainable window was possible with a polynomial number of measurements using a contiguous-like partitioning. These results were validated in a multi-armed bandit environment.
Background
Past work on VQAs has shown their potential in RL, though they face significant trainability challenges, particularly due to BPs and gradient explosion issues. Researchers found that the trainability of PQC-based policies is heavily influenced by the locality of observables and the size of the action space, with polynomial action spaces allowing for limited trainability. Empirical studies confirmed these findings, showing that while some policies can learn effectively within polynomial action spaces, they fail as the number of actions increases beyond this threshold.
Quantum Optimization
Quantum policy gradient algorithms optimize a parameterized policy to maximize performance without relying on a value function, typically using reward increment = nonnegative factor × offset reinforcement × characteristic eligibility (REINFORCE) with a baseline to reduce variance in gradient estimation. This work considers PQC-generated policies, focusing on two born policy variants—contiguous-like and parity-like—and introduces the concept of locality in measurement.
The parameter-shift rule is used for gradient estimation, leveraging quantum hardware to compute partial derivatives efficiently. The gradient ascent process updates the policy parameters iteratively, with the REINFORCE algorithm outlined to integrate these quantum policies in RL.
Trainability Challenges in QRL
This section explores the trainability challenges of quantum policy gradient algorithms, specifically focusing on contiguous and parity-like PQC-based Born policies. The analysis begins with examining product states to understand the variance behavior of the log policy gradient, emphasizing how the number of qubits and actions affects the globality of observables.
The study then extends to entangled states, demonstrating that while entangled states may avoid BPs, they often suffer from exploding gradients due to exponentially diminishing probabilities as the number of qubits increases. The variance of the log policy gradient is shown to scale with the number of qubits and actions, posing significant challenges for trainability, especially in cases with large action spaces.
The discussion includes a detailed analysis of the variance's dependence on the number of actions, highlighting the different trainability profiles of Contiguous-like and Parity-like Born policies. While Parity-like policies are prone to BPs and exploding gradients as the number of actions increases, contiguous-like policies demonstrate a window of trainability with smaller action spaces due to their reliance on local measurements.
The study also considers the Fisher information matrix (FIM) spectrum crucial for characterizing BPs in quantum RL (QRL). The FIM's spectrum can indicate BPs, but even in scenarios where BPs are avoided, the need for an exponentially large number of measurements and the risk of exploding gradients still pose substantial trainability issues, especially as the number of actions grows.
Trainability Evaluation
This section evaluates trainability issues in quantum policy gradients through two key experimental scenarios. First, a simplified two-design ansatz is used to investigate how policy types and action ranges influence the variance of the log likelihood's partial derivatives. It includes examining the FIM spectrum for different policies and highlighting variance trends as functions of the number of actions and qubits. The second scenario involves a multi-armed bandit environment to assess the practical performance of born policies in RL contexts, determining the effectiveness of quantum policies in distinguishing optimal actions through sampling.
For the simplified two-design task, contiguous and parity-like Born policies are analyzed. Although contiguous-like policies exhibit increasing variance with the number of actions due to diminishing probabilities, they show a polynomial decay in variance with qubit count, particularly under polynomial probability clipping. In contrast, parity-like policies suffer from exponentially vanishing variance with increasing qubits, which can lead to gradient explosions and trainability issues as the number of actions grows.
Due to their local measurement nature, contiguous-like policies generally better select the optimal arm than parity-like policies. However, as the number of arms increases, both policies need help to learn effectively, highlighting the limitations of polynomial measurements in complex settings. The variance of the log policy gradient remains low but inadequate for effective learning, illustrating the broader challenge of scaling quantum policy gradients to more complicated scenarios.
Conclusion
To sum up, the research highlighted key trainability challenges for PQC-based policies in policy-based RL, focusing on contiguous-like and Parity-like born policies. The number of qubits and action-space size influenced issues such as standard BPs with diminishing gradients and gradient explosions.
The study found that contiguous-like policies could be trainable within polynomial bounds, while Parity-like policies faced inherent limitations. Classical post-processing and reward sparsity also impacted trainability, and further exploration was needed to address softmax policies and balance trainability with classical simulation efficiency.