In a recent submission to the arXiv* server, researchers introduced an innovative approach known as Quality Diversity through Human Feedback (QDHF). This approach leverages human feedback to derive diversity metrics, thus expanding the potential applications of quality diversity (QD) algorithms.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Foundation models, including large language models and text-to-image systems, have enabled various applications, empowering individuals to nurture creativity and address challenges. Reinforcement learning from human feedback (RLHF) streamlines the usage of these models, enhancing their competence by aligning them with human instructions and preferences. However, RLHF often focuses on optimizing reward models based on average human preferences. In contrast, the current study introduces the concept of learning diversity metrics to guide the optimization process within QD algorithms.
QD methods, such as novelty search and QD, aim to discover a range of diverse or top-performing solutions. Recent research has expanded on QD by improving diversity maintenance, the search process, and optimization mechanisms. Nonetheless, these methods often rely on manually defined diversity metrics, which can be challenging for complex real-world tasks.
Quality Diversity with Human Feedback
QD algorithms excel at exploring diverse, high-quality solutions within a solution space. Recent research has explored automatic diversity discovery through unsupervised dimension reduction methods, but these metrics may not be consistent with optimization.
Researchers, inspired by work in RLHF, introduced a novel paradigm known as Quality Diversity through Human Feedback (QDHF). In QDHF, diversity metrics are obtained from human feedback on solution similarity. This method works well in complex and abstract domains where defining numeric diversity measurements is difficult. It is also more adaptable than manually designing diversity metrics.
Characterization of Diversity Using Latent Projection: Recent research has demonstrated the use of unsupervised dimensionality reduction techniques to learn robot behavioral descriptors from raw sensory data. Within this framework, diversity characterization is viewed as a more generalized process. Given descriptive data containing diverse information, a latent projection transforms it into a meaningful semantic latent space. Initially, a feature extractor function is employed.
Following this, a dimensionality reduction function is used to project the feature vector into a compact latent representation. The latent space contains axes representing diversity metrics, with their magnitudes and directions capturing nuanced characteristics of the data. Linear projection is employed for dimensionality reduction, with parameters learned using a contrastive learning process.
Using this method, the diverse latent space is brought into line with human concepts of similarity and difference. A triplet loss mechanism is employed to optimize the spatial relations of latent embeddings based on human input, minimizing the distance between similar embeddings while maximizing the distance between dissimilar ones. Human judgment is gathered using the Two Alternative Forced Choice (2AFC) approach, which assesses the similarity of input triplets while accommodating human, heuristic, and AI-based feedback.
QDHF: Researchers propose an implementation of QDHF using contrastive learning and latent space projection, augmented with human judgments. With latent space acting as the measuring space and each dimension having a corresponding diversity metric, diversity metrics in QDHF are produced from human feedback on solution similarities. Two training strategies are devised, QDHF-online and QDHF-offline, for scenarios with or without prior human judgment data. QDHF-offline assumes the availability of human judgment data and trains the latent projection before running the QD algorithm. In contrast, QDHF-online adopts an active learning strategy, fine-tuning the latent projection iteratively during the QD process by gathering human judgment data on triplets of solutions. The frequency of fine-tuning decreases as learned metrics become more robust. The latent projection is updated at defined intervals, with each update utilizing a portion of the human feedback budget.
Experiments and results
Researchers performed experiments across three benchmark tasks: the robotic arm, the kheperax, and latent space illumination (LSI). The robotic arm task aims to find inverse kinematics solutions for a planar arm with revolute joints, minimizing joint angle variance by tracking endpoint positions in 2D. In the Kheperax task, the aim is to discover policy controllers in a neural network for a Khepera-like robot navigating a maze using limited-range lasers and contact sensors. The LSI task explores the latent space in a generative model.
In tasks involving the robotic arm and Kheperax, a predefined ground truth diversity metric is used to simulate human feedback. This metric is based on the position of the arm or robot in a 2D space. The evaluation measures include the QD score and coverage. Since there is no ground truth diversity metric available for the LSI task, human feedback on image similarity is collected. In this case, the effectiveness of QDHF is demonstrated qualitatively. For tasks with ground truth diversity metrics, QDHF performs significantly well, particularly QDHF-online in the robotic arm task. In the LSI task without ground truth diversity metrics, QDHF is shown to generate more diverse images compared to random sampling.
Sample efficiency and alignment between learned and ground truth diversity metrics are assessed. The alignment between learned diversity metrics and ground truth metrics is evaluated, demonstrating that QDHF can effectively align its learned diversity space with ground truth diversity, especially concerning the scales on each axis.
Conclusion
In summary, researchers introduced QDHF, which utilizes human feedback to enhance diversity in QD algorithms. Empirical results demonstrate the superiority of QDHF in automatic diversity discovery, comparing favorably to QD with human-designed metrics. In a latent space illumination task, QDHF significantly improves image diversity.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Ding, L., Zhang, J., Clune, J., Spector, L., and Lehman, J. (2023). Quality Diversity through Human Feedback, arXiv. DOI: https://doi.org/10.48550/arXiv.2310.12103, https://arxiv.org/abs/2310.12103
Article Revisions
- Oct 26 2023 - Title Change - "Enhancing Quality Diversity through Human Feedback" to "Quality Diversity through Human Feedback (QDHF)"