双LSTM驱动的高分遥感影像地物目标空间关系语义描述

陈杰; 戴欣宜; 周兴; 孙庚; 邓敏

doi:10.11834/jrs.20210340

前沿进展 | 浏览量 : 0 下载量: 216 CSCD: 1 更多指标

PDF
导出
分享
收藏
专辑

双LSTM驱动的高分遥感影像地物目标空间关系语义描述
Semantic understanding of geo-objects’ relationship in high resolution remote sensing image driven by dual LSTM
2021年25卷第5期页码：1085-1094
纸质出版日期： 2021-05-07 ，
DOI： 10.11834/jrs.20210340

扫描看全文

陈杰，戴欣宜，周兴，孙庚，邓敏.2021.双LSTM驱动的高分遥感影像地物目标空间关系语义描述.遥感学报，25（5）： 1085-1094

Chen J，Dai X Y，Zhou X，Sun G and Deng M. 2021. Semantic understanding of geo-objects’ relationship in high resolution remote sensing image driven by dual LSTM. National Remote Sensing Bulletin， 25（5）：1085-1094
陈杰，戴欣宜，周兴，孙庚，邓敏.2021.双LSTM驱动的高分遥感影像地物目标空间关系语义描述.遥感学报，25（5）： 1085-1094 DOI： 10.11834/jrs.20210340.

Chen J，Dai X Y，Zhou X，Sun G and Deng M. 2021. Semantic understanding of geo-objects’ relationship in high resolution remote sensing image driven by dual LSTM. National Remote Sensing Bulletin， 25（5）：1085-1094 DOI： 10.11834/jrs.20210340.

摘要

高分辨率遥感影像中的地物目标具有清晰的类别属性与空间关系语义。在人工智能技术支撑下，用计算机自动认知其空间关系具备了可行性。目前，遥感影像场景的语义理解主要依托图像描述任务（image caption），基于影像的全局特征生成描述语句。但是，这种粗粒度特征容易导致地物目标的类别属性在描述语句生成过程中被错误预测。事实上，以地物目标作为空间关系语义理解的基本单元，更符合人们认知地理空间的习惯。为得到更准确的描述语句，本文构建了基于地物目标的遥感影像语义理解数据集，并提出双LSTM驱动的地物目标空间关系语义理解方法。该方法用目标检测模型识别影像中的显著目标，将这些目标特征输入到语言模型，以缓解描述语句中类别被错误预测的问题。进而，为利用遥感影像场景信息，将影像全局特征与目标区域特征进行融合，并用双LSTM预测目标的注意力分布，提高描述语句生成质量。对比实验结果表明，该方法能生成更准确的图像描述。

Abstract

Geo-objects in High-resolution Remote Sensing Images (HRSIs) have clear category attributes and rich semantic information. With the support of artificial intelligence technology

the spatial relationship can be automatically recognized by a computer. At present

the semantic understanding of HRSIs mainly relies on an image caption model to generate sentences based on the global features. However

coarse-grained features can easily cause the category attribute of the object to be mispredicted during the sentence generation process. In fact

taking the geo-object as the basic unit of semantic understanding is more in line with people’s habit of cognizing geographic space. To obtain more accurate sentences

this study constructs an Object-based Geo-spatial Relation Image Understanding Dataset (OGRIUD) and proposes a dual LSTM-driven semantic understanding method.

The proposed dataset is based on the object

and the sentence description includes the category and location information of the ground object

which make up the deficiency of the target category and the location information in the semantic understanding of the current remote sensing field. The proposed method uses the object detection model to identify salient objects in the image and uses the object features as input in the language model to alleviate the problem of incorrectly predicted categories in the description. Furthermore

to use HRSI scene information

we fuse the global and regional features and use dual LSTM to predict the attention distribution of each geo-object.

We compare the global feature-based approach with the object feature based approach proposed in this paper. Quantitative analysis results show that the proposed method exhibits increased exact matching accuracy

from 53.5% of the original to 62.33%. The visual analysis results show that the proposed method

and the generated spatial relation description statements are also more abundant.

This method enables the language model to focus on objects with actual semantics

and the matching degree between the generated description statement and the remote sensing image content is also improved. The correspondence between the visual object and description improves the interpretability of remote sensing image understanding.

关键词

高分辨率遥感影像地物目标空间关系语义理解图像描述

Keywords

high resolution remote sensing imageground objectsspatial relationshipssemantic understandingimage caption

references

Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S and Zhang L. 2018. Bottom-up and top-down attention for image captioning and visual question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 6077-6086 [DOI: 10.1109/CVPR.2018.00636http://dx.doi.org/10.1109/CVPR.2018.00636]

Chen J, Han Y R, Wan L, Zhou X and Deng M. 2019. Geospatial relation captioning for high-spatial-resolution images by using an attention-based neural network. International Journal of Remote Sensing, 40(16): 6482-6498 [DOI: 10.1080/01431161.2019.1594439http://dx.doi.org/10.1080/01431161.2019.1594439]

Cui W, Zhang D Y, He X, Yao M, Wang Z W, Hao Y J, Li J, Wu W J, Cui W Q and Huang J J. 2019. Multi-scale remote sensing semantic analysis based on a global perspective. ISPRS International Journal of Geo-Information, 8(9): 417 [DOI: 10.3390/ijgi8090417http://dx.doi.org/10.3390/ijgi8090417]

Karpathy A and Li F F. 2015. Deep visual-semantic alignments for generating image descriptions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE: 3128-3137 [DOI: 10.1109/CVPR.2015.7298932http://dx.doi.org/10.1109/CVPR.2015.7298932]

Lin C Y. 2004. Looking for a few good metrics: ROUGE and its evaluation//Proceedings of the 4th NTCIR Workshop. Tokyo: National Institute of Informatics: 1-8

Lu X Q, Wang B Q, Zheng X T and Li X L. 2018. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4): 2183-2195 [DOI: 10.1109/tgrs.2017.2776321http://dx.doi.org/10.1109/tgrs.2017.2776321]

Papineni K, Roukos S, Ward T and Zhu W J. 2002. Bleu: a method for automatic evaluation of machine translation//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: Association for Computational Linguistics: 311-318 [DOI: 10.3115/1073083.1073135http://dx.doi.org/10.3115/1073083.1073135]

Qu B, Li X L, Tao D C and Lu X Q. 2016. Deep semantic understanding of high resolution remote sensing image//2016 International Conference on Computer, Information and Telecommunication Systems. Kunming: IEEE: 1-5 [DOI: 10.1109/CITS.2016.7546397http://dx.doi.org/10.1109/CITS.2016.7546397]

Shi Z W and Zou Z X. 2017. Can a machine generate humanlike language descriptions for a remote sensing image?. IEEE Transactions on Geoscience and Remote Sensing, 55(6): 3623-3634 [DOI: 10.1109/TGRS.2017.2677464http://dx.doi.org/10.1109/TGRS.2017.2677464]

Vedantam R, Zitnick C L and Parikh D. 2015. CIDEr: consensus-based image description evaluation//Proceddings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE: 4566-4575 [DOI: 10.1109/cvpr.2015.7299087http://dx.doi.org/10.1109/cvpr.2015.7299087]

Xu K, Ba J L, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R S and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille: JMLR.org: 2048-2057

Yang Y and Newsam S. 2010. Bag-of-visual-words and spatial extensions for land-use classification//Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems. San Jose: Association for Computing Machinery: 270-279 [DOI: 10.1145/1869790.1869829http://dx.doi.org/10.1145/1869790.1869829]

You Q Z, Jin H L, Wang Z W, Fang C and Luo J B. 2016. Image captioning with semantic attention//Proceddings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 4651-4659 [DOI: 10.1109/cvpr.2016.503http://dx.doi.org/10.1109/cvpr.2016.503]

Yuan Z H, Li X L and Wang Q. 2020. Exploring multi-level attention and semantic relationship for remote sensing image captioning. IEEE Access, 8: 2608-2620 [DOI: 10.1109/ACCESS.2019.2962195http://dx.doi.org/10.1109/ACCESS.2019.2962195]

Zhang F, Du B and Zhang L P. 2015. Saliency-guided unsupervised feature learning for scene classification. IEEE Transactions on Geoscience and Remote Sensing, 53(4): 2175-2184 [DOI: 10.1109/TGRS.2014.2357078http://dx.doi.org/10.1109/TGRS.2014.2357078]

Zhang Z Y, Diao W H, Zhang W K, Yan M L, Gao X and Sun X. 2019a. LAM: remote sensing image captioning with label-attention mechanism. Remote Sensing, 11(20): 2349 [DOI: 10.3390/rs11202349http://dx.doi.org/10.3390/rs11202349]

Zhang Z Y, Zhang W K, Diao W H, Yan M L, Gao X and Sun X. 2019b. VAA: visual aligning attention model for remote sensing image captioning. IEEE Access, 7: 137355-137364 [DOI: 10.1109/ACCESS.2019.2942154http://dx.doi.org/10.1109/ACCESS.2019.2942154]

文章被引用时，请邮件提醒。

提交

结合数据融合与特征选择的遥感影像尺度多样目标检测

EllipticNet：基于椭圆方程的遥感有向目标检测

面向梯田遥感识别的JAM-R-CNN深度网络模型

融合像素—多尺度区域特征的高分辨率遥感影像分类算法