In a paper published in the journal Sensors, researchers addressed robust safety in autonomous highway driving using reinforcement learning (RL) by introducing the replay buffer constrained policy optimization (RECPO) method to ensure policies-maintained safety while maximizing rewards.
They transformed the problem into a constrained Markov decision process (CMDP) and optimized highway driving policies. Deployed in-car autonomous racing learning activity (CARLA) simulation, RECPO outperformed traditional methods, achieving zero collisions and enhancing decision-making stability.
Background
Past research in autonomous driving explored rule-based and optimization-driven algorithms using control Lyapunov functions (CLF), control barrier functions (CBF), and finite state machines (FSM) to enhance safety and traffic flow. Recent advancements in machine learning (ML), including deep RL and model predictive control (MPC), aimed to improve decision-making efficiency but faced challenges in dynamic environments. Safe RL frameworks like CMDP and CPO integrated constraints to optimize policies, addressing safety and adaptability concerns in autonomous systems.
Safe RL Framework Overview
In transforming highway autonomous driving into a safe RL framework like CMDP, key components are defined: state space (S), action space (A), reward function (R), and cost function (C). CMDP extends the standard MDP by introducing constraints on strategies, ensuring safe decision-making under uncertainty. The observation space (S') includes EV data (S'EV), surrounding vehicle data (S'SV), and environmental information (SEN). The A encompasses longitudinal and lateral control behaviors, while the C evaluates risks like collisions and illegal maneuvers, guiding the agent toward safe driving practices.
The goal is to optimize a policy to maximize the expected cumulative reward while adhering to safety constraints defined by the cost function. This approach integrates real-time observations and safety considerations, which are crucial for autonomous vehicles navigating complex highway environments.
Advanced Autonomous Driving Optimization
The RECPO algorithm is applied in autonomous highway driving, integrating advanced techniques to optimize policy updates while ensuring adherence to safety constraints. The algorithm operates by continually sampling and storing trajectories in a replay buffer, utilizing importance weights to efficiently reuse historical data for faster learning and mitigating issues like catastrophic forgetting.
Policy updates in RECPO are guided by safety considerations using a trust region approach. It ensures that policy iterations maximize expected rewards without compromising safety. The algorithm evaluates policy updates based on safety metrics and applies different optimization methods depending on the safety level of the current policy. Additionally, RECPO updates the parameters of value networks using gradient-based methods to enhance prediction accuracy, supporting effective decision-making by the policy network in complex driving environments.
Experimental Evaluation Summary
This study evaluated the effectiveness, robustness, and safety of the RECPO algorithm using the Carla simulation platform in a constructed highway driving scenario. Comparative experiments included advanced deep deterministic policy gradient (DDPG)-based strategies, intelligent driver model (IDM) + minimizing overall braking induced by lane changes (MOBIL), and the traditional CPO algorithm. The simulations utilized a custom-built 10-km long, three-lane highway scenario, ensuring realistic driving conditions with controlled variables like speed variations and lane changes.
During the experimental training phase, RECPO, CPO, and DDPG algorithms were trained and compared based on performance metrics such as rewards, safety, and efficiency. RECPO showed a significant advantage in reward convergence, reaching high values early in training compared to CPO. Both algorithms demonstrated effective safety improvements, with RECPO achieving faster convergence rates and superior performance metrics in cost reduction and success rates compared to CPO and DDPG. The importance of a sampling-based experience replay pool in RECPO facilitated quicker learning and adaptation to the driving environment, underscoring its efficacy in autonomous decision-making.
After deployment and performance testing, RECPO and CPO outperformed IDM + MOBIL in terms of driving efficiency, passenger comfort, and safety. RECPO maintained stable acceleration and low jerk during high-speed maneuvers, enhancing passenger comfort compared to IDM + MOBIL. Safety evaluations revealed RECPO's robust performance in maintaining safe distances and achieving a 100% success rate, contrasting sharply with DDPG's poor performance and high collision rates. RECPO demonstrated superior adaptability and faster convergence rates, highlighting its potential for enhancing autonomous driving systems in complex scenarios. This version condenses the key findings and conclusions while maintaining the essential details of the experimental setup and results.
Conclusion
To sum up, this paper proposed a decision-making framework for highway autonomous driving that prioritized safety and ensured robust performance. The framework utilized a CPO method to construct an RL policy optimizing rewards while adhering to safety constraints. Importance sampling techniques were introduced to facilitate data collection and storage in a Replay buffer, preventing catastrophic forgetting and enhancing policy optimization. The RECPO method optimized autonomous driving strategies in a CMDP framework, showing enhanced convergence speed, safety, and decision stability in CARLA simulations. Future work targets complex scenarios and sensor uncertainties with advanced neural networks and hierarchical decision frameworks.
Journal reference:
- Zhao, R., et al. (2024). Towards Robust Decision-Making for Autonomous Highway Driving Based on Safe Reinforcement Learning. Sensors, 24:13, 4140. DOI:10.3390/s24134140, https://www.mdpi.com/1424-8220/24/13/4140