In a recent submission to the arXiv server*, researchers pre-trained a world model using offline data from a real robot and subsequently fine-tuned it with online data acquired through model-based planning.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Reinforcement learning (RL) has the potential to impart autonomous capabilities to physical robots, allowing them to interact with their environment and learn complex tasks guided by reward-based feedback. Yet, RL is notorious for its data inefficiency, demanding a substantial number of online interactions to acquire skills due to the limited availability of supervision. This poses a significant challenge when attempting to train real robots.
Traditional approaches resort to custom simulators or human teleoperation for behavior learning. However, these solutions are constrained by cost and engineering complexities, introducing issues such as the simulation-to-realistic gap and an inability to surpass human performance. Recently, offline RL has emerged as a framework to train RL policies using pre-existing interaction datasets, eliminating the need for online data collection. While this approach alleviates data inefficiency, it introduces challenges related to extrapolation errors, potentially leading to overly cautious policies.
The current study aims to combine the strengths of both approaches. It addresses the challenge of pretraining an RL policy using existing interaction data and subsequently fine-tuning it with a limited amount of data acquired through online interaction.
Temporal Difference- Model Predictive Control (TD-MPC)
In RL, the objective is to acquire a visuo-motor control policy via interaction within the established RL framework applicable to infinite-horizon Partially Observable Markov Decision Processes (POMDPs). This entails the creation of a policy that offers a conditional probability distribution concerning actions relative to a given state, all to maximize the anticipated return. The practical execution relies on a model-based RL (MBRL) algorithm, which dissects the policy into several trainable components collectively known as the world model.
TD-MPC, a particular MBRL algorithm, extends the use of MPC with a jointly learned world model and terminal value function through TD learning. Two notable characteristics of TD-MPC make it relevant to this context: it employs planning, enabling action selection regularization during test time, and it is comparatively lightweight among MBRL algorithms, facilitating real-time operation. The architecture comprises five key components: a representation, a latent dynamics model, and three prediction heads for reward, terminal value function, and latent policy guidance. These components are collectively learned in TD-MPC.
In its original formulation, TD-MPC operates as an online off-policy RL algorithm, maintaining a replay buffer of interactions and optimizing its components jointly. During inference, it uses a sampling-based planner (MPPI) to plan actions and maximize the expected return. A behavioral prior is introduced by generating a fraction of action sequences from the learned policy.
Offline-to-online fine-tuning of world models
The framework proposed for offline-to-online finetuning of world models addresses extrapolation errors through novel test-time regularization during planning. The methodology is divided into two phases: an offline phase that uses pre-existing offline data to pre-train a world model, and an online phase that uses a restricted quantity of online interaction data to fine-tune the model. While TD-MPC is the backbone of the world model and planner, the approach is versatile and applicable to various MBRL algorithms that employ planning.
Initially, the source of model extrapolation errors in offline RL is discussed. These errors stem from discrepancies in state-action distribution between training and evaluation datasets, encompassing value overestimation and other challenges unique to MBRL algorithms. Value overestimation is mitigated by applying TD backups exclusively to in-sample actions, and state-conditional value estimation is introduced to eliminate out-of-sample actions in TD targets.
Additionally, an uncertainty estimation-based test-time behavior regularization technique is proposed to address the issue of planning with unseen state-action pairs, thus mitigating extrapolation errors. This technique balances estimated returns with (epistemic) model uncertainty during planning, enhancing the robustness of world models. The regularization relies on a small ensemble of value functions to estimate uncertainty and penalize actions associated with high uncertainty, enabling a balance between exploration and exploitation.
To expedite information propagation during finetuning, two separate replay buffers are maintained for offline and online data, optimizing the objective on data sampled equally from both sources, with an emphasis on oversampling online interaction data early in finetuning for improved model performance. Balanced sampling practices are incorporated to enhance the fine-tuning process.
Evaluation and analysis
The method's evaluation encompasses a wide array of continuous control tasks from the datasets for deep data-driven RL (D4RL) and xArm task suites and simulation tasks, as well as visuo-motor control tasks on a real xArm robot. The results show that the proposed method consistently outperforms on various tasks, including comparison to state-of-the-art methods for both offline and online RL across a range of tasks and conducting offline-to-online fine-tuning experiments on unseen tasks.
A series of ablation studies are conducted to understand the contributions of individual components to the method's success. The findings highlight the effectiveness of key components, including balanced sampling, ensemble value functions, uncertainty regularization, and learning with in-sample actions and expectile regression.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Feng, Y., Hansen, N., Xiong, Z., Rajagopalan, C., and Wang, X. (2023). Finetuning Offline World Models in the Real World. arXiv, DOI: https://doi.org/10.48550/arXiv.2310.16029, https://arxiv.org/abs/2310.16029