Q-learning, which was initially an incremental algorithm to estimate an optimal decision strategy in an infinite-horizon decision problem, currently refers to a general class of widely used reinforcement learning (RL) methods. This article deliberates on single-agent and multi-agent Q-learning algorithms and the major Q-learning applications.
Q-learning Basics
In a sequential decision setting, one approach to identify an optimal decision strategy is to estimate the optimal Q-function, which determines the expected cumulative utility of every available decision. Q-learning leverages an off-policy control that decouples the deferral policy from the learning policy and updates the action selection using the e-greed policy and the Bellman optimal equations. Q-learning has become the base of multiple RL algorithms as it possesses simple Q-functions.
However, the reward storage issue has been the biggest problem in the early Q-learning algorithms. The available storage space becomes insufficient with the increasing number of actions, preventing the problem's solution. Thus, achieving effective learning is difficult for complex learning problems that have large state-action environments.
Additionally, the state storage space becomes bigger for multi-agent environments than that of single-agent environments. This storage accounts for a substantial part of the computer's memory, thus, the computer fails to provide the correct answer. To tackle this challenge in diverse environments, many Q-learning algorithms have been created.
For instance, deep Q-learning developed by Google in 2016 features among the most popular algorithms specifically for single-agent environments. In modular Q-learning, which is a multi-agent Q-learning algorithm, a single learning problem is split into multiple parts and a Q-learning algorithm is applied to every part.
Agents share the reward values in the ant Q-learning method, similar to a colony of ants that discard the lower reward values and solve problems using higher values. Thus, the reward values of the action are efficiently obtained in a multi-agent environment through this approach. Nash Q-learning, a modification of the basic Q-learning algorithm, is also effective for multi-agent environments.
Single-agent Q-learning Algorithms
Deep Q-learning: Deep Q-learning merges the convolution neural networks (CNN) with the basic Q-learning. An approximation function is employed by the Q-learning using a CNN when expressing the value function for each state becomes difficult. Deep Q-learning combines two approaches, including the target Q technique and experience replay, apart from the value approximation using a CNN.
The value approximation using a neural network (NN) remains highly unstable, which is stabilized using the experience replay. In this experience replay approach, all rewards, actions, and states are impacted by previous states, implying correlations between actions, rewards, and states.
The approximation function fails to learn stably due to these correlations. However, the experience replay eliminates correlations by storing the experience in a buffer and extracting the learning data randomly. Additionally, the target Q technique prepares the Q network and the target network separately. Using the target network, the technique obtains the target value and prompts the Q network to learn based on the target value, which decreases correlations.
Hierarchical Q-learning: This technique solves problems caused by the growing state-action space of Q-learning. It improves the basic Q-learning through the addition of hierarchical processing to the existing Q-learning system. Hierarchical Q-learning leverages the concept of abstract actions, decomposing the action of the agent into a lower level and a higher level.
For instance, the movement to the target point is confined to the higher level and the movements of right, left, down, and up constitute the lower level when the actions that the agent can select are right, left, down, and up and the goal is to reach the target point.
This division of the agent’s action in a hierarchical manner is known as the abstract action. Conventional Q-learning algorithms face different problems when solving complex environments. However, the hierarchical Q-learning algorithm effectively solves complex problems using abstract action, accelerating the processing time for complex problems.
Double Q-learning: Q-learning typically struggles in a stochastic environment. This problem was solved by developing the double Q-learning algorithm. Q-learning remains biased in a stochastic environment due to the overestimation of the action value of the agent.
Conventional Q-learning does not search for any new optimal value after a specific duration and only selects the highest value repeatedly among the existing values. The double Q-learning addresses this issue of Q-learning by dividing the Q-learning valuation function that determines the action required to prevent the value deviation in the Q-learning algorithm.
Double Q-learning possesses two Q-functions and every Q-function is updated using the value of another Q function. Although the two Q-functions are crucial to learning from a separate set of experiences, both value functions are utilized to select the action.
Double Q-learning is also combined with deep Q-learning to develop double deep Q-learning. Double deep Q-learning has enhanced the deep Q-learning performance by avoiding optimistic predictions and divergences of Q-values expressing future values.
Multi-agent Q-learning Algorithms
Modular Q-learning: Modular Q-learning has been introduced to address the inefficiency of basic Q-learning in multi-agent systems. Modular Q-learning effectively solves Q-learning’s large state-space problem by decomposing a large problem into smaller problems and applying Q-learning to every sub-problem.
Every learning module provides Q-values for the current state’s actions in the action selection stage of the agent. A mediator module selects the most suitable action to be implemented by executing the action of the learning module. Consequently, a reward value is obtained from the environment and stored in every module and the Q-function value is updated afresh.
However, modular Q-learning utilizes engineer-assigned fixed modules and also employs a simple fixed method/the greatest mass (GM) approach to combine these fixed modules. Thus, the key disadvantage of modular Q-learning is the difficulty of learning efficiently in a rapidly changing environment.
Ant Q-learning: Ant Q-learning combines Q-learning and an ant system (AS), which is an algorithmic representation of ants deciding on their paths back to their nest after finding food.
Ant Q-learning primarily extends the existing ASs as learning in this method is performed using a set of cooperating agents, unlike the usual Q-learning method, where the cooperating agents exchange the AQ-values with each other. The goal of Ant-Q learning is to learn an AQ value that can realize a stochastically superior target value.
Ant Q learning learns using multiple agents and makes it possible to find the reward value effectively for a specific action in a multi-agent environment as agents in ant Q-learning efficiently cooperate with each other. The results of ant Q-learning can become stuck in a local minimum as the agents only select the shortest path, which is a major disadvantage.
Q-learning Applications
Due to the effectiveness of the Q-learning algorithm, it is applied to different domains, like image recognition, control theory, operation research, robotics, game theory, network process, and industrial processes.
Computer Networking: Wireless sensor networks must monitor rapidly evolving dynamic behaviors caused by system designers/external factors. Q-learning improves performance by eliminating the need for system redesign and enhancing adaptability to changing situations.
Network control based on the Q-learning algorithm has inspired several practical solutions that prolong the network life and maximize resource utilization. Recently, an antisystem has been used in pre-networking for object tracking and monitoring in various environments.
Game Theory: Google Deep Mind had a significant role in game theory. Deep Mind primarily applied deep Q-learning to assist the arcane games in identifying the optimal moves by themselves, which enabled these games to outperform humans.
Learning in real-time is possible in an online multiplayer game using the data measured along a player’s trajectory, optimizing the game’s performance and developing a human-agent feedback loop. Moreover, a Q-learning algorithm maximizes the total system profit in stochastic cooperative game theory. In recent years, Q-learning has also been used in mobile application games.
Robotics: Q-learning provides toolkits and frameworks for designing difficult and sophisticated engineering behavior in robotics. By leveraging Q-learning, autonomous robots have realized significant growth in behavioral technology with minimal human interference.
In conclusion, Q-learning excels in diverse environments, but overcoming limitations will truly unlock its full power.
References and Further Reading
Clifton, J., Laber, E. (2020). Q-learning: Theory and applications. Annual Review of Statistics and Its Application, 7, 279-301. https://doi.org/10.1146/annurev-statistics-031219-041220
Jang, B., Kim, M., Harerimana, G., Kim, J. W. (2019). Q-learning algorithms: A comprehensive classification and applications. IEEE Access, 7, 133653-133667. https://doi.org/10.1109/ACCESS.2019.2941229
Hasselt, H. (2010). Double Q-learning. Advances in neural information processing systems, 23. https://proceedings.neurips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html