Weight vector selection methods by hypervolume maximization in the Pareto front for single policy multi-objective reinforcement learning
- Authors
- Lee, Seonjae; Lee, Myoung Hoon; Moon, Jun
- Issue Date
- Jan-2026
- Publisher
- Elsevier
- Keywords
- Multi-objective reinforcement learning; Hypervolume maximization; Pareto optimality; Weight vector selection
- Citation
- Expert Systems with Applications, v.296, pp 1 - 19
- Pages
- 19
- Indexed
- SCIE
SCOPUS
- Journal Title
- Expert Systems with Applications
- Volume
- 296
- Start Page
- 1
- End Page
- 19
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/210100
- DOI
- 10.1016/j.eswa.2025.129070
- ISSN
- 0957-4174
1873-6793
- Abstract
- Effectively solving Multi-Objective Reinforcement Learning (MORL) problems is crucial in real-world applications, such as robotics and autonomous systems, where multiple conflicting objectives must be optimized. Single-Policy MORL (SPMORL), which relies on a single policy network, struggles to learn a diverse set of weight vectors that represent the importance of each objective. In contrast, Multi-Policy MORL (MPMORL) trains multiple policy networks to handle different weight vectors. However, this approach is computationally expensive and can be inefficient, as the networks may not share experiences during training, leading to redundant learning and slower convergence. In this paper, we propose a Weight Vector Selection (WVS) algorithm for SPMORL, enhancing its ability to explore the weight vector space efficiently using the polar coordinate system, the Jacobian matrix, and the ellipsoid function. The key idea of WVS is to provide a weight vector selection criterion that enables SPMORL to approximate the Pareto front more effectively while maintaining computational efficiency. We introduce three novel WVS algorithms: WVS-Polar, WVS-Jacob, and WVS-Ellip. Specifically, WVS-Polar employs the polar coordinate system to estimate the Pareto front, WVS-Jacob utilizes the Jacobian matrix and the derivative of the Pareto front, and WVS-Ellip determines the center of an ellipsoid based on nearby Pareto-optimal points. Integrated with SPMORL, these WVS algorithms iteratively perform two learning stages. First, the agent selects a target weight vector using its respective approximation method. Then, the agent optimizes the selected weight vector, collects the corresponding Pareto-optimal point, and updates the Pareto front. This approach enables SPMORL to efficiently learn multiple weight vectors while maintaining the benefits of using a single policy network. To evaluate the effectiveness of WVS-SPMORL, we conduct extensive experiments in both simulated and real-world environments. In OpenAI-MuJoCo simulations, WVS-SPMORL outperforms baseline SPMORL algorithms in Pareto front approximation. In real-world robot arm tasks, we integrate an offline Reinforcement Learning (RL) mechanism, demonstrating that WVS-SPMORL successfully learns and manages the entire Pareto front. As shown in the attached experimental video, our approach allows the agent to continuously improve its performance in both online RL simulations and offline RL experiments. These results confirm that WVS-SPMORL significantly enhances learning efficiency and performance in MORL.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 전기공학전공 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.