In 2014, a significant breakthrough occurred in the context of applying deep neural networks to reinforcement learning (RL). A London-based startup, DeepMind, astounded the machine-learning community with a pioneering achievement. They unveiled a deep neural network capable of mastering Atari games with a level of skill that surpassed human capabilities. This neural network is referred to as a deep Q-network (DQN).
What made DQN truly remarkable was its ability to learn and excel at 49 distinct Atari games, all of which possessed varying rules, objectives, and gameplay structures. This was achieved without altering the network's architecture in any way. To accomplish this remarkable feat, DeepMind integrated numerous traditional concepts from RL, all while introducing a range of innovative techniques that played a pivotal role in the resounding success of DQN.
What is RL?
RL is fundamentally an interactive learning process involving an agent, an environment, and a reward signal. The agents' actions within the environment are guided by a policy aimed at maximizing the rewards received. In contrast to supervised and unsupervised learning, reinforcement learning lacks predefined data or labels; instead, it relies on feedback in the form of rewards from the environment. This approach appeals to the artificial intelligence (AI) community as it mirrors human learning, emphasizing interaction with the environment for developing intelligent agents. Applications of RL span various domains, from self-driving cars to stock market trading strategies, reflecting its significance in cutting-edge technologies.
However, from a deep learning perspective, RL presents distinct challenges. Unlike most deep learning applications that rely on ample hand-labeled training data, RL algorithms must learn from sparse, noisy, and delayed scalar reward signals. The temporal gap between actions and rewards, which can span thousands of timesteps, presents a formidable hurdle. Additionally, deep learning algorithms typically assume independent data samples, while RL frequently deals with sequences of highly correlated states. Moreover, in RL, the data distribution evolves as the algorithm learns new behaviors, which poses challenges for deep learning methods based on fixed underlying distributions.
Deep Q-Networks
In RL, Q-learning falls within the category known as value learning. The central concept in Q-learning is the Q-function, which represents the quality of a specific state-action pair. It calculates the maximum discounted future return when action is performed in each state. This Q-value encapsulates the expected long-term rewards, assuming every subsequent action is taken optimally to maximize future rewards.
Naturally, a question arises regarding how Q-values are acquired. Determining the quality of an action is challenging, even for humans, as it necessitates knowledge of future actions. The expected future returns are contingent on the long-term strategy, creating a conundrum. To value a state-action pair, one must possess precise values for both the state and action. This predicament gives rise to a fundamental challenge in Q-learning.
To address this challenge, Q-values are defined as a function of future Q-values, a relation known as the Bellman equation. This equation states that the maximum future reward for taking an action is the sum of the current reward, the maximum future reward of the next action, and future Q-values. With this connection, an update rule emerges, allowing for the propagation of correct Q-values from the future to the past, a process known as value iteration.
Initially, Q-values are often inaccurate, but with iterations, each Q-value can be updated using the correct value from the future. Value iteration guarantees convergence towards the ultimate optimal Q-value. However, a challenge arises with the vast size of the Q-table, especially in complex environments. Value iteration necessitates a complete traversal of state-action pairs and becomes infeasible. In response, an alternative approach is introduced: approximating the Q-function. This approach alleviates the need to experience every state-action pair and instead learns a function that approximates the Q-function, which can generalize beyond its own experiences.
The idea of approximating the Q-function led to DeepMind's DQN, which employs a deep neural network to estimate Q-values from state images, offering a robust and scalable Q-learning solution. Training DQN involves minimizing the difference between the approximated Q-values and the future expected rewards, as expressed through the Bellman equation. This objective is achieved through stochastic gradient descent.
However, issues related to learning stability emerge due to the high correlation between Q-value updates. To address these stability issues, two engineering solutions are employed: the target Q-network and experience replay. The target Q-network reduces co-dependence by introducing a second network that lags in parameter updates. Experience replay breaks up the correlation of data by randomly sampling from the agent's past experiences, ensuring more representative batch gradients.
Ultimately, while Q-learning is a value-learning algorithm, it indirectly informs policy creation. By constructing an optimal policy based on the Q-function, agents can make informed decisions, relying on the action with the maximum Q-value in each state to navigate environments effectively. DQN also tackles the Markov assumption by considering state history, using multiple past game frames as the current state to account for time-dependent information. This engineering solution enables DQN to handle scenarios where the Markov assumption falls short, such as in games like Pong.
Agent-environment interaction, or RL, is a strategy an agent uses to maximize cumulative rewards through interaction with an environment. The environment provides feedback to the agent through incentives or punishments when it acts. This feedback informs the agent's adjustments, leading to enhanced performance over time.
Moving beyond DQN
In 2013, DQN made significant strides in solving Atari tasks, but it came with notable limitations. Its drawbacks included protracted training times, suboptimal performance on certain games, and the need for retraining for each new game. Subsequently, the research in deep reinforcement learning over the past few years has been predominantly focused on mitigating these shortcomings.
Deep Recurrent Q-Networks: DQN addressed the Markov assumption issue by stacking four consecutive frames as separate channels. This ad-hoc approach, however, imposed constraints on the model's generality. To manage arbitrary sequences of related data, deep recurrent Q-networks (DRQNs) came into play. DRQN incorporates a recurrent layer to transfer state knowledge across time steps, enabling the model to determine the informativeness of frames and even retain long-term information. Extensions like Deep Attention Recurrent Q-Network (DAQRN) further enhanced DRQN's capabilities by incorporating neural attention mechanisms. DRQN's performance excels in first-person shooter (FPS) games and those with extended time dependencies, such as Seaquest.
Double DQN: The overestimation of Q-values by DQN stems from two sources. Firstly, it employs the maximum discounted return instead of the discounted return expectation in the Bellman equation. Secondly, it utilizes its own estimated Q-values at two-time points for time difference calculations, exacerbating the overestimation issue.
Numerous algorithms have been presented to tackle this issue. Notably, the Double DQN stands out as a representative solution. In addition, other approaches such as prioritized experience replay (PER), multistep bootstrapping target learning, normalized advantage function (NAF), noisy DQN, and categorical DQN have contributed to DQN's performance improvement.
Asynchronous Advantage Actor-Critic (A3C): It offers a novel approach to deep reinforcement learning. A3C's asynchronous nature allows parallelization across multiple threads, significantly expediting training while enhancing batch experience diversity. A3C combines actor-critic methods, bridging value, and policy learning. It introduces the advantage function, which measures the quality difference between predicted and actual actions. This approach revolutionized deep reinforcement learning benchmarks, enabling agents to master games like Atari Breakout in less than 12 hours, a remarkable improvement over DQN's prolonged training periods.
Unsupervised Reinforcement and Auxiliary Learning (UNREAL): UNREAL builds upon A3C's foundation and tackles the challenge of reward sparsity. UNREAL strives to extract meaningful information from the environment without relying solely on rewards. It introduces unsupervised auxiliary tasks to enrich its learning objectives. By integrating these unsupervised components, UNREAL accelerates learning, demonstrating the importance of building strong world representations and leveraging unsupervised learning to address low-resource reinforcement learning scenarios.
References
Nithin Buduma, Nikhil Buduma, and Joe Papa. (2022). Fundamentals of Deep Learning, Second Edition, O’Reilly Media, Inc.
Ning Wang, Zhe Li, Xiaolong Liang, Yueqi Hou, and Aiwu Yang. (2023). A Review of Deep Reinforcement Learning Methods and Military Application Research, Mathematical Problems in Engineering, Vol. 7678382. DOI: https://doi.org/10.1155/2023/7678382
Huang, Y. (2020). Deep Q-Networks. In: Dong, H., Ding, Z., Zhang, S. (eds) Deep Reinforcement Learning. Springer, Singapore. DOI: https://doi.org/10.1007/978-981-15-4095-0_4
Mnih, V., et al. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. DOI: https://doi.org/10.48550/arXiv.1312.5602