In an article recently posted to the Meta Research website, researchers introduced a new method called Hallucinating Datasets with Evolution Strategies (HaDES) for dataset distillation in reinforcement learning (RL). Their technique aims to compress large datasets into a few synthetic examples for training new RL models.
Background
Dataset distillation creates a small set of data points to replace a large real dataset for tasks like image classification, graph learning, or recommender systems. It is useful for architecture search, interpretability, continual learning, and privacy. Most existing methods assume access to a pre-existing target dataset. This is not the case in RL, where the agent must collect data through exploration and interaction with the environment.
RL involves learning a policy that maximizes the expected discounted sum of rewards in a Markov decision process (MDP). A policy is a function that maps states to actions and can be parameterized by a neural network. RL algorithms typically use gradient-based methods or evolutionary strategies (ES) to optimize the policy parameters. Both approaches can be computationally expensive and require a large amount of data.
About the Research
In this paper, the authors introduced behavior distillation, which aims to find and condense the information needed to train an expert policy into a synthetic dataset of state-action pairs. Unlike dataset distillation, behavior distillation addresses both the exploration problem (finding high-reward paths) and the problem of representation learning (learning to show a policy that generates those paths).
To tackle behavior distillation, the study proposed HaDES, a meta-evolutionary outer loop-based method. It optimizes the dataset using ES to update the outer loop and supervised learning ("behavior cloning") in the inner loop of the current dataset. The fitness function measures the policy's performance after the supervised learning step.
The researchers also modified HaDES to use a different parameter setup and training method for neuroevolution by avoiding the resampling of the initial weights in the inner loop policy. This new setup independently scales the number of factors in the evolved policy, which reduces memory usage and leads to competitive performance across various environments compared to traditional ES.
Research Findings
The authors evaluated HaDES on eight continuous control environments from the Brax suite and four discrete environments from the MinAtar suite. They compared HaDES with a fixed policy initialization (HaDES-F) and HaDES with randomized policy initialization (HaDES-R) to direct neuroevolution through ES and PPO, a state-of-the-art RL algorithm. HaDES-F achieved the highest return across the board, while HaDES-R also matched or beat the baseline in most environments. They also showed that HaDES-F discovered a glitch in the collision physics of the Humanoid-Standup environment, propelling itself into the air to achieve extremely high returns.
The generalization properties of synthetic datasets were tested to train new policies with different architectures and hyperparameters. The outcomes showed that the HaDES-R dataset was more robust to changes in both policy architecture and training parameters than the HaDES-F dataset, which incorporates a stronger inductive bias.
Applications
The researchers demonstrated the applicability of synthetic datasets to downstream tasks by using them to train multi-task agents without additional environment interaction. They merged datasets for two different environments by zero-padding the states and actions and used them to train agents with behavior cloning. They showed that the multi-task agents achieved roughly 50% of the single-task performance in some environments and saw no loss in performance in others.
The synthetic datasets evolved by HaDES can accelerate future research on RL foundation models, as they reduce the computational cost and enable experimentation with architectures and multi-task representation learning. Furthermore, these datasets can provide human-interpretable task insights by visualizing the state-action pairs.
Conclusion
In summary, the presented behavior distillation approach and HaDES technique proved effective for dataset compression for RL. HaDES produced effective synthetic datasets for continuous and complex control environments that can be generalized to the different policy architectures and hyperparameters. These datasets can be used to train multi-task agents in a zero-shot fashion.
The researchers kept their code and synthetic datasets open-source to facilitate future research. Moving forward, they identified some limitations and suggested scaling to pixel-based environments, evolving the inner loop parameters along with the dataset, and regularizing datasets to further promote interpretability.