In a paper recently submitted to the ArXiv* server, the authors discussed how successor representation (SR) and its generalizations have facilitated advances in artificial agents that transfer and learn behaviors from experience by reviewing some artificial intelligence (AI) applications.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
The SR is suitable for resolving sets of tasks with similar transition structures and varied reward structures. It is a cumulant of discounted occupancy. Cumulants are primarily quantities of interest summed across time. Specifically, the SR, which is a predictive representation, is a compilation of transition matrices/a predictive model. It discards individual transitions' information and replaces them with their cumulants.
Additionally, SR eliminates the need to iterate/simulate roll-outs over dynamic programming updates, unlike model-based algorithms, as it already compiles transition information into a convenient form. State values are recomputed considering the inner product between the immediate reward vector and the SR. Thus, the efficiency of SR-based value computation is similar to model-free algorithms. Moreover, the SR quickly adapts to specific types of environmental changes like model-based algorithms.
Specifically, local changes to the reward structure of an environment induce local changes within the reward function, which instantly propagate to the value estimates when these changes are combined with the SR. Thus, the flexibility of SR-based value computation is similar to model-based algorithms for reward structure changes. However, transition structure changes require more significant non-local changes to the SR owing to the non-availability of the detailed transition structure in its internal model.
Successor models and successor features
The successor model (SM) is a generalization of the SR that defines an explicit model over trajectories that are temporally abstract. Temporal abstraction refers to a conditional distribution over future states in a certain time horizon. The SM utilizes a k-step conditional distribution as its cumulant and describes a valid probability distribution as the model integrates into one. Thus, the SM can be estimated using density estimation techniques like contrastive learning, variational inference, and generative adversarial learning.
Additionally, SM is also useful for model-based control and policy evaluation. Successor features (SF) is a feature-based generalization of the SR. The SR cannot be learned or computed easily when the state space is unknown or large as it necessitates maintaining occupancy expectations over each state. In such settings, occupancy measures over cumulants, which are features shared across states, can be maintained instead of maintaining occupancy measures for states. This generalization of SR is defined as SFs.
AI applications
Learning to act and explore in the environment before exposure to reward is a strategy that allows the agent to explore the environment without exposure to punishment or reward for a certain period. The agent strives to learn a policy that can transfer to an unknown task.
For instance, this strategy was leveraged to develop agents that can explore Atari games for 250 million time-steps without any reward and then have 100,000 time-steps for earning reward. Superhuman performance was achieved across most Atari games by leveraging this strategy despite no exposure to any reward for most of the experience.
This algorithm was further improved by incorporating an intrinsic reward function that explores the surprising parts/parts inducing high entropy of the state space in the memory of the agent's experience. Thus, the approach significantly enhanced the sample efficiency for several Atari games.
Count-based exploration with a bonus of 1/√N(s) is a method that provides optimal exploration in tabular settings, where N(s) is the number of times a state (s) has been visited. Exploration in sparse-reward Atari games like Montezuma's Revenge was improved using this exploration bonus.
Learning about non-task goals is the strategy where an agent learning a task can also want to do well on another task simultaneously. This strategy can be effective for hindsight experience replay, where an agent can re-label failed experiences at reaching specific task goals to successful experiences after reaching some non-task goal, which is useful for tasks with sparse rewards.
Recently, an asymptotically unbiased importance sampling algorithm was developed that removes bias leveraging SFs while estimating value functions using hindsight experience replay, which enabled learning for both real-world robotic manipulation and simulation tasks in environments with large action and state spaces.
SFs enable transfer to tasks that are linear combinations of known tasks, which is one of the key benefits of leveraging SFs. For instance, an agent learned to move in a set of directions, such as right, left, down, and up, in continuous control settings can move in novel angles, such as up-right and down-left, instantly as required to complete a task.
Overall, this paper reviewed how SR and its variants help AI agents learn and transfer behaviors. SR captures discounted future states across similar tasks, enabling efficient decision-making and adaptation to reward changes. Examples include exploring Atari games without initial reward and transferring learned movement directions to novel angles.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Carvalho, W., Tomov, M. S., De Cothi, W., Barry, C., Gershman, S. J. (2024). Predictive representations: Building blocks of intelligence. ArXiv. https://arxiv.org/abs/2402.06590