Journal of Systems Engineering and Electronics ›› 2022, Vol. 33 ›› Issue (2): 381-392.doi: 10.23919/JSEE.2022.000040
• SYSTEMS ENGINEERING • Previous Articles Next Articles
Xiangyang LIN*(), Qinghua XING(), Fuxian LIU()
Received:
2020-12-17
Accepted:
2022-03-21
Online:
2022-05-06
Published:
2022-05-06
Contact:
Xiangyang LIN
E-mail:95014052@qq.com;qh_xing@126.com;liuxqh@126.com
About author:
Supported by:
Xiangyang LIN, Qinghua XING, Fuxian LIU. Choice of discount rate in reinforcement learning with long-delay rewards[J]. Journal of Systems Engineering and Electronics, 2022, 33(2): 381-392.
Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks
Table 1
Optimal paths of agent for different combinations of ${{\boldsymbol{R}}_{{{\bf{max}}}}}$ and $ {\boldsymbol{\gamma}} $ "
γ | Rmax | |||
20 | 5e | 8 | 6 | |
0.0 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.1 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.2 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.3 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.4 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.5 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.6 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.7 | {0,0,0,0,0} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.8 | {0,0,0,0,0} | {0,0,0,0,0} | {1,1,1,1,1} | {1,1,1,1,1} |
0.9 | {0,0,0,0,0} | {0,0,0,0,0} | {1,1,1,1,1} | {1,1,1,1,1} |
1.0 | {0,0,0,0,0} | {0,0,0,0,0} | {1,1,1,1,1} | {1,1,1,1,1} |
Table 2
Q-Table under the condition of ${{\boldsymbol{R}}_{{{\bf{max}}}}}{ = {\boldsymbol{20}}}$ and ${\boldsymbol{\gamma }}{ = {\boldsymbol{0.6}}}$ "
State | Action | |
0 | 1 | |
[0,0] | ?1 | 0 |
[0,1] | ?1 | 0 |
[0,2] | ?1 | 0 |
[0,3] | ?1 | 0 |
[0,4] | ?1 | 0 |
[1,0] | ?0.74249 | 0 |
[1,1] | ?1 | 0 |
[1,2] | ?1 | 0 |
[1,3] | ?1 | 0 |
[2,0] | 1.108516 | 0 |
[2,1] | ?0.99927 | 0 |
[2,2] | ?0.99997 | 0 |
[3,0] | 5.794161 | 0 |
[3,1] | ?0.89989 | 0 |
[4,0] | 16.19077 | 0 |
Table 3
Optimal path for different combinations of ${{\boldsymbol{R}}_{{{\bf{max}}}}}$ and $ {\boldsymbol{\gamma}} $ in supplementary experiments "
γ | Rmax | |||
20 | 5e | 8 | 6 | |
0.0 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.1 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.2 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.3 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.4 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.5 | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.6 | {0,0,0,0,0} | {1,1,1,1,1} | {1,1,1,1,1} | {1,1,1,1,1} |
0.7 | {0,0,0,0,0} | {0,0,0,0,0} | {1,1,1,1,1} | {1,1,1,1,1} |
0.8 | {0,0,0,0,0} | {0,0,0,0,0} | {0,0,0,0,0} | {1,1,1,1,1} |
0.9 | {0,0,0,0,0} | {0,0,0,0,0} | {0,0,0,0,0} | {0,0,0,0,0} |
1.0 | {0,0,0,0,0} | {0,0,0,0,0} | {0,0,0,0,0} | {0,0,0,0,0} |
Table 4
Q-Table under the condition of ${{\boldsymbol{R}}_{{{\bf{max}}}}}{{{\boldsymbol{ = 20}}}}$ and ${\boldsymbol{ \gamma}} {\boldsymbol{ = 0}}{\boldsymbol{.6}} $ in supplementary experiments "
State | Action | |
0 | 1 | |
[0,0] | 0.416 | 0 |
[0,1] | ?1 | 0 |
[0,2] | ?1 | 0 |
[0,3] | ?1 | 0 |
[0,4] | ?1 | 0 |
[1,0] | 2.36 | 0 |
[1,1] | ?1 | 0 |
[1,2] | ?1 | 0 |
[1,3] | ?1 | 0 |
[2,0] | 5.6 | 0 |
[2,1] | ?1 | 0 |
[2,2] | ?1 | 0 |
[3,0] | 11 | 0 |
[3,1] | ?1 | 0 |
[4,0] | 20 | 0 |
1 | BELLMAN R A problem in the sequential design of experiments. The Indian Journal of Statistics, 1956, 16 (34): 221- 229. |
2 |
SILVER D, HUANG A, MADDISON C J, et al Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529 (7587): 484- 489.
doi: 10.1038/nature16961 |
3 |
LIN C J, JHANG J Y, LEE C L, et al Using a reinforcement Q-learning-based deep neural network for playing video games. Electronics, 2019, 8 (10): 1128.
doi: 10.3390/electronics8101128 |
4 | TAMASSIA M, ZAMBETTA F, RAFFE W L, et al Learning options from demonstrations: a pac-man case study. IEEE Trans. on Computational Intelligence and AI in Games, 2018, 10 (1): 91- 96. |
5 | WYDMUCH M, KEMPKA M, JASKOWSKI W. ViZDoom competitions: playing doom from pixels. IEEE Trans. on Computational Intelligence and AI in Games, 2019, 11(3): 248–259. |
6 |
JADERBERG M, CZARNECKI W M, DUNNING I, et al Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 2019, 364 (6443): 859- 865.
doi: 10.1126/science.aau6249 |
7 |
LIANG L, CHEN Y C, LIAO L C, et al A novel impedance control method of rubber unstacking robot dealing with unpredictable and time-variable adhesion force. Robotics and Computer-Integrated Manufacturing, 2021, 67, 102038.
doi: 10.1016/j.rcim.2020.102038 |
8 |
GAO J L, YE W J, GUO J, et al Deep reinforcement learning for indoor mobile robot path planning. Sensors, 2020, 20 (19): 5493.
doi: 10.3390/s20195493 |
9 |
XIE J Y, PENG X D, WANG H J, et al UAV autonomous tracking and landing based on deep reinforcement learning strategy. Sensors, 2020, 20 (19): 5630.
doi: 10.3390/s20195630 |
10 | XU X, ZUO L, LI X, et al A reinforcement learning approach to autonomous decision making of intelligent vehicles on highways. IEEE Trans. on Systems, Man and Cybernetics Systems, 2018, 50 (10): 3884- 3897. |
11 |
HE Y, YU F R, ZHAO N, et al Software-defined networks with mobile edge computing and caching for smart cities: a big data deep reinforcement learning approach. IEEE Communications Magazine, 2017, 55 (12): 31- 37.
doi: 10.1109/MCOM.2017.1700246 |
12 | BRANDI S, PISCITELLI M S, MARTELLACCI M, et al Deep reinforcement learning to optimise indoor temperature control and heating energy consumption in buildings. Energy and Buildings, 2020, 224 (1): 110225. |
13 | KHAN A, LAPKIN A, Searching for optimal process routes: a reinforcement learning approach. Computers and Chemical Engineering, 2020, 141(4): 107027. |
14 |
MA R, VANSTRUM E B, LEE R, et al Machine learning in the optimization of robotics in the operative field. Current Opinion in Urology, 2020, 30 (6): 808- 816.
doi: 10.1097/MOU.0000000000000816 |
15 | PARK H, SIM M K, CHOI D G. An intelligent financial portfolio trading strategy using deep Q-learning. Expert Systems with Applications, 2020, 158(15): 113573 |
16 | HU Y, YAO Y, LEE W S A reinforcement learning approach for optimizing multiple traveling salesman problems over graphs. Knowledge-Based Systems, 2020, 204 (27): 106244. |
17 |
AINSLIE G W Impulse control in pigeons. Journal of the Experimental Analysis of Behavior, 1974, 21 (3): 485- 489.
doi: 10.1901/jeab.1974.21-485 |
18 |
TAKAHASHI T Loss of self-control in intertemporal choice may be attributable to logarithmic time-perception. Medical Hypotheses, 2005, 65 (4): 691- 693.
doi: 10.1016/j.mehy.2005.04.040 |
19 | NAKAHARA H, KAVERI S. Internal-time temporal difference model for neural value-based decision making. Neural Computation, 2010, 22(12): 3062–3106. |
20 |
JARMOLOWICZ D P, HUDNALL J L, HALE L, et al Delay discounting as impaired valuation: delayed rewards in an animal obesity model. Journal of the Experimental Analysis of Behavior, 2017, 108 (2): 171- 183.
doi: 10.1002/jeab.275 |
21 | FOSCUE E P, WOOD K N, SCHRAMM-SAPYTA N L. Characterization of a semi-rapid method for assessing delay discounting in rodents. Pharmacology Biochemistry and Behavior, 2012, 101(2): 187–192 |
22 | PAPALE A E, STOTT J J, POWELL N J, et al Interactions between deliberation and delay-discounting in rats. Cognitive, Affective, & Behavioral Neuroscience, 2012, 12 (3): 513- 526. |
23 | YAMAGUCHI Y, SAKAI Y, Reinforcement learning for discounted values often loses the goal in the application to animal learning. Neural Networks, 2012, 35(1): 88–91 |
24 | KNOX W B, STONE P. Framing reinforcement learning from human reward: reward positivity, temporal discounting, episodicity, and performance. Artificial Intelligence, 2015, 225(1): 24–50 |
25 | WANG J P, WANG G, MAO X B, et al Motion control method of two-link manipulator based on deep reinforcement learning. Journal of Computer Applications, 2021, 41 (6): 1799- 1804. |
26 | WEI H B, HE S C Multi-objective optimal control strategy for plug-in diesel electric hybrid vehicles based on deep reinforcement learning. Journal of Chongqing Jiaotong University (Natural Science), 2021, 40 (1): 44- 52. |
27 | LI C, HUANG Y Y, ZHANG Y L, et al Multi-agent decision-making method based on Actor-Critic framework and its application in wargame. Systems Engineering and Electronics, 2020, 43 (3): 755- 762. |
28 | ZHANG Q H, AO B Q, ZHANG Q X Reinforcement learning guidance law of Q-learning. Journal of Systems Engineering and Electronics, 2019, 42 (2): 414- 419. |
29 | SUTTON R S, BARTO A G. Reinforcement learning: an introduction. 2nd ed. Cadge: MIT Press, 2018. |
[1] | Xiaofeng LI, Lu DONG, Changyin SUN. Hybrid Q-learning for data-based optimal control of non-linear switching system [J]. Journal of Systems Engineering and Electronics, 2022, 33(5): 1186-1194. |
[2] | Wanping SONG, Zengqiang CHEN, Mingwei SUN, Qinglin SUN. Reinforcement learning based parameter optimization of active disturbance rejection control for autonomous underwater vehicle [J]. Journal of Systems Engineering and Electronics, 2022, 33(1): 170-179. |
[3] | Xin ZENG, Yanwei ZHU, Leping YANG, Chengming ZHANG. A guidance method for coplanar orbital interception based on reinforcement learning [J]. Journal of Systems Engineering and Electronics, 2021, 32(4): 927-938. |
[4] | Ye MA, Tianqing CHANG, Wenhui FAN. A single-task and multi-decision evolutionary game model based on multi-agent reinforcement learning [J]. Journal of Systems Engineering and Electronics, 2021, 32(3): 642-657. |
[5] | Chen Chen, Jie Chen, and Bin Xin. Hybrid optimization of dynamic deployment for networked fire control system [J]. Journal of Systems Engineering and Electronics, 2013, 24(6): 954-961. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||