Journal of Systems Engineering and Electronics ›› 2023, Vol. 34 ›› Issue (1): 9-18.doi: 10.23919/JSEE.2023.000035

• REMOTE SENSING • Previous Articles     Next Articles

VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning

Tingting WEI(), Weilin YUAN(), Junren LUO(), Wanpeng ZHANG(), Lina LU()   

  1. 1 College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China
  • Received:2022-08-30 Accepted:2023-01-13 Online:2023-02-18 Published:2023-03-03
  • Contact: Wanpeng ZHANG E-mail:weitingting20@nudt.edu.cn;yuanweilin12@nudt.edu.cn;luojunren17@nudt.edu.cn;wpzhang@nudt.edu.cn;lulina16@nudt.edu.cn
  • About author:
    WEI Tingting was born in 1997. She received her M.S. degree from the College of Intelligence Science and Technology, National University of Defense Technology. She is pursuing her Ph.D. degree in National University of Defense Technology. Her research interests are pattern recognition and knowledge inference. E-mail: weitingting20@nudt.edu.cn

    YUAN Weilin was born in 1994. He received his M.S. degree in control science and engineering from National University of Defence Technology, where he is pursing his Ph.D. degree. His research interests include cognitive decision-making and intelligent gaming, reinforcement learning, and multi-agent system. E-mail: yuanweilin12@nudt.edu.cn

    LUO Junren was born in 1989. He received his M.S. degree in control science and engineering from National University of Defense Technology where he is pursuing his Ph.D. degree. His research interests include multi-agent learning and game confrontation. E-mail: luojunren17@nudt.edu.cn

    ZHANG Wanpeng was born in 1981. He received his M.S. and Ph.D. degrees in control science and engineering from National University of Defense Technology (NUDT). He is a professor at NUDT . His research interests include mission planning and intelligent control. E-mail: wpzhang@nudt.edu.cn

    LU Lina was born in 1984. She received her Ph.D. degree in control science and engineering from National University of Defense Technology (NUDT). She is a lecturer at NUDT. Her main research interests include machine learning, and multi-agent cooperation and confrontation. E-mail: lulina16@nudt.edu.cn
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61702528;61806212)

Abstract:

In the field of satellite imagery, remote sensing image captioning (RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote (DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention (VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a cross-modal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-to-end Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.

Key words: remote sensing image captioning (RSIC), vision-language representation, remote sensing image caption dataset, attention mechanism