基于表示学习的跨模态检索方法研究进展Progress of Cross-modal Retrieval Methods Based on Representation Learning
杜锦丰;王海荣;梁焕;王栋;
摘要(Abstract):
多模态数据的急剧增长带来了跨模态检索的应用需求,促进了对跨模态检索方法的研究。本文追溯该领域最新进展,跟踪并深入研究国内外基于表示学习的跨模态检索方法,对跨模态检索问题进行定义并梳理该领域常用技术方法、主流模型、常用数据集、评价方法和面临的主要挑战。主要从统计相关分析、图正则化和度量学习3方面介绍基于表示学习跨模态检索方法,并分析其优缺点。为了分析上述方法的优劣性,实验分别在4个数据集上复现14种方法进行对比评价。实验结果表明:基于统计相关分析方法训练效率较高且易于实施;基于图正则化方法通过挖掘模态内和模态间的相似性,实现语义关联;基于度量学习方法是在公共子空间中尽可能保留数据语义相似/不相似的信息。本文介绍基于表示学习的跨模态检索方法的研究现状,为跨模态检索方法研究提供参考。
关键词(KeyWords): 多模态数据;跨模态检索;统计相关分析;图正则化;度量学习
基金项目(Foundation): 宁夏自然科学基金(2020AAC03218);; 宁夏省级培育项目(PY1906);; 宁夏人才项目(KJT2019002)
作者(Authors): 杜锦丰;王海荣;梁焕;王栋;
DOI: 10.16088/j.issn.1001-6600.2021071302
参考文献(References):
- [1] KAUR P,PANNU H S,MALHI A K.Comparative analysis on cross-modal information retrieval:a review[J].Computer Science Review,2021,39:100336.
- [2] 王振,孙福振,张龙波,等.强序列关系保持二值编码[J].计算机应用研究,2020,37(12):3803-3806,3810.DOI:10.19734/j.issn.1001-3695.2019.07.0263.
- [3] 陈宁,段友祥,孙歧峰.跨模态检索研究文献综述[J].计算机科学与探索,2021,15(8):1390-1404.
- [4] ZENG D H,YU Y,OYAMA K.Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA[C]// Proceedings of 2018 IEEE International Symposium on Multimedia(ISM).Piscataway:IEEE,2018:143-150.DOI:10.1109/ISM.2018.00-21.
- [5] YAO T,MEI T,NGO C W.Learning query and image similarities with ranking canonical correlation analysis[C] // Proceedings of the IEEE International Conference on Computer Vision.Piscataway:IEEE,2015:28-36.
- [6] 相子喜,吕学强,张凯.基于有向图模型的多模态新闻图像检索研究[J].科学技术与工程,2016,16(3):78-84,99.
- [7] CHENG Q R,GU X D.Bridging multimedia heterogeneity gap via graph representation learning for cross-modal retrieval[J].Neural Networks,2021,134:143-162.DOI:10.1016/j.neunet.2020.11.011.
- [8] WANG L,ZHU L,DONG X,et al.Joint feature selection and graph regularization for modality-dependent cross-modal retrieval[J].Journal of Visual Communication and Image Representation,2018,54:213-222.DOI:10.1016/j.jvcir.2018.05.006.
- [9] WU Y L,WANG S H,SONG G L,et al.Online asymmetric metric learning with multi-layer similarity aggregation for cross-modal retrieval[J].IEEE Transactions on Image Processing,2019,28(9):4299-4312.
- [10] FENG F X,WANG X J,LI R F.Cross-modal retrieval with correspondence autoencoder[C]// Proceedings of the 22nd ACM international conference on Multimedia.New York:Association for Computing Machinery,2014:7-16.DOI:10.1145/2647868.2654902.
- [11] JIAN Y W,XIAO J,CAO Y,et al.Deep pairwise ranking with multi-label information for cross-modal retrieval[C]// Proceedings of 2019 IEEE International Conference on Multimedia and Expo(ICME).Piscataway:IEEE,2019:1810-1815.DOI:10.1109/ICME.2019.00311.
- [12] WANG Y F,WU F,SONG J,et al.Multi-modal mutual topic reinforce modeling for cross-media retrieval[C]// Proceedings of the 22nd ACM International Conference on Multimedia.New York:Association for Computing Machinery,2014:307-316.DOI:10.1145/2647868.2654901.
- [13] ZHENG Y.Methodologies for cross-domain data fusion:an overview[J].IEEE Transactions on Big Data,2015,1(1):16-34.DOI:10.1109/TBDATA.2015.2465959.
- [14] 李超越.基于特征融合的跨模态检索方法研究与应用[D].北京:北京化工大学,2020.DOI:10.26939/d.cnki.gbhgu.2020.000809.
- [15] 路凯峰,杨溢龙,李智.一种基于BERT和DPCNN的Web服务分类方法[J].广西师范大学学报(自然科学版),2021,39(6):87-98.DOI:10.16088/j.issn.1001-6600.2020111402.
- [16] WANG L M,GUO S,HUANG W L,et al.Places205-VGGNet models for scene recognition[EB/OL].(2015-08-07)[2021-07-13].https://arciv.org/abs/1508.01667.
- [17] HOTELLING H.Relations between two sets of variates[M]// KOTZ S,JOHNSON N L.Breakthroughs in Statistics.New York:Springer,1992:162-190.
- [18] RANJAN V,RASIWASIA N,JAWAHAR C V.Multi-label cross-modal retrieval[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision.Piscataway:IEEE,2015:4094-4102.DOI:10.1109/ICCV.2015.466.
- [19] AKAHO S.A kernel method for canonical correlation analysis[EB/OL].(2006-09-13)[2021-07-13].https:∥arxiv.org/abs/cs/0609071v1.
- [20] YAN F,MIKOLAJCZYK K.Deep correlation for matching images and text[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2015:3441-3450.DOI:10.1109/CVPR.2015.7298966.
- [21] ZENG D H,YU Y,OYAMA K.Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval[J].ACM Transactions on Multimedia Computing,Communications,and Applications(TOMM),2020,16(3):1-23.DOI:10.1145/3387164.
- [22] QI Y D,ZHANG H X.Joint graph regularization in a homogeneous subspace for cross-media retrieval[J].Journal of Advanced Computational Intelligence and Intelligent Informatics,2019,23(5):939-946.DOI:10.20965/jaciii.2019.p0939.
- [23] WANG G H,JI H,KONG D X,et al.Modality-dependent cross-modal retrieval based on graph regularization[J].Mobile Information Systems,2020,2020:4164692.DOI:10.1155/2020/4164692.
- [24] XU G W,LI X M,ZHANG Z J.Semantic consistency cross-modal retrieval with semi-supervised graph regularization[J].IEEE Access,2020,8:14278-14288.DOI:10.1109/ACCESS.2020.2966220.
- [25] YAN J H,ZHANG H X,SUN J D,et al.Joint graph regularization based modality-dependent cross-media retrieval[J].Multimedia Tools and Applications,2018,77(3):3009-3027.DOI:10.1007/s11042-017-4918-0.
- [26] WEI J W,XU X,YANG Y,et al.Universal weighting metric learning for cross-modal matching[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2020:13002-13011.DOI:10.1109/CVPR4 2600.2020.01302.
- [27] WU W,XU J,LI H.Learning similarity function between objects in heterogeneous spaces:MSR-TR-2010-86[R].Beijing:Microsoft Research Asia,2010.
- [28] 徐信芯,姜鑫,张辉,等.基于多层联合降噪的信号处理方法[J].科学技术与工程,2021,21(29):12566-12573.
- [29] REN L,LI K,WANG L Q,et al.Beyond the deep metric learning:enhance the cross-modal matching with adversarial discriminative domain regularization[C]// 2020 25th International Conference on Pattern Recognition(ICPR).Piscataway:IEEE,2021:10165-10172.DOI:10.1109/ICPR48806.2021.9412297.
- [30] XU X,HE L,LU H M,et al.Deep adversarial metric learning for cross-modal retrieval[J].World Wide Web,2019,22(2):657-672.DOI:10.1007/s11280-018-0541-x.
- [31] CHUA T S,TANG J H,HONG R C,et al.NUS-WIDE:a real-world web image database from National University of Singapore[C]// Proceedings of the ACM International Conference on Image and Video Retrieval.New York:Association for Computing Machinery,2009:1-9.DOI:10.1145/1646396.1646452.
- [32] RASIWASIA N,COSTA PEREIRA J,COVIELLO E,et al.A new approach to cross-modal multimedia retrieval[C]// Proceedings of the 18th ACM International Conference on Multimedia.New York:Association for Computing Machinery,2010:251-260.DOI:10.1145/1873951.1873987.
- [33] FARHADI A,HEJRATI M,SADEGHI M A,et al.Every picture tells a story:generating sentences from images[C]// Proceedings of European Conference on Computer Vision.Berlin:Springer,2010:15-29.DOI:10.1007/ 978-3-642-15561-1_2.
- [34] PENG Y X,ZHAI X H,ZHAO Y Z,et al.Semi-supervised cross-media feature learning with unified patch graph regularization[J].IEEE Transactions on Circuits and Systems for Video Technology,2016,26(3):583-596.DOI:10.1109/TCSVT.2015.2400779.
- [35] YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:new similarity metrics for semantic inference over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78.DOI:10.116 2/tacl_a_00166.
- [36] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:lessons learned from the 2015 MSCOCO image captioning challenge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(4):652-663.DOI:10.1109/TPAMI.2016.2587640.
- [37] 李志欣,凌锋,张灿龙,等.融合两级相似度的跨媒体图像文本检索[J].电子学报,2021,49(2):268-274.
- [38] 刘颖,郭莹莹,房杰,等.深度学习跨模态图文检索研究综述[J].计算机科学与探索,2022,16(3):489-511.
- [39] KARPATHY A,LI F F.Deep visual-semantic alignments for genrating image descriptions[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2015:3128-3137.DOI:10.1109/TPAMI.2016.2598339.
- [40] HUANG Y,WU Q,SONG C F,et al.Learning semantic concepts and order for image and sentence matching[C]// Proceedings of the 2018 IEEE CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2018:6163-6171.DOI:10.1109/CVPR.2018.00645.
- [41] LIU Y,GUO Y M,BAKKERr E M,et al.Learning a recurrent residual fusion network for multimodal matching[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision.Piscataway:IEEE,2017:4127-4136.DOI:10.1109/ICCV.2017.442.
- [42] LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]// Proceedings of the European Conference on Computer Vision(ECCV).Cham:Springer,2018:212-228.
- [43] FAGHRI F,FLEET D J,KIROS J R,et al.VSE++:improving visual-semantic embeddings with hard negatives[EB/OL].(2018-07-29)[2021-07-13].https://arxiv.org/abs/1707.05612.
- [44] ZHENG Z D,ZHENG L,GARRETT M,et al.Dual-path convolutional image-text embeddings with instance loss[J].ACM Transactions on Multimedia Computing,Communications,and Applications(TOMM),2020,16(2):1-23.DOI:10.1145/3383184.
- [45] QI J W,PENG Y X,YUAN Y X.Cross-media multi-level alignment with relation attention network[EB/OL].(2018-04-25)[2021-07-13].https://arxiv.org/abs/1804.09539v1.
- [46] MA L,JIANG W H,JIE Z Q,et al.Bidirectional image-sentence retrieval by local and global deep matching[J].Neurocomputing,2019,345:36-44.DOI:10.1016/j.neucom.2018.11.089.
- [47] MITHUN N C,PANDA R,PAPALEXAKIS E E,et al.Webly supervised joint embedding for cross-modal image-text retrieval[C]// Proceedings of the 26th ACM international conference on Multimedia.New York:Association for Computing Machinery,2018:1856-1864.DOI:10.1145/3240508.3240712.
- [48] 谢金峰,王羽,葛唯益,等.基于多语义相似性的关系检测方法[J].西北工业大学学报,2021,39(6):1387-1394.
- [49] 史占堂,马玉鹏,赵凡,等.基于CNN-Head Transformer编码器的中文实体识别[J/OL].计算机工程[2021-12-20].https://doi.org/10.19678/j.issn.1000-3428.0062525.
- [50] ZHAI X H,PENG Y X,XIAO J G.Learning cross-media joint representation with sparse and semisupervised regularization[J].IEEE Transactions on Circuits and Systems for Video Technology,2014,24(6):965-978.
- [51] PENG Y X,QI J W.Quintuple-media joint correlation learning with deep compression and regularization[J].IEEE Transactions on Circuits and Systems for Video Technology,2020,30(8):2709-2722.
- [52] WEI Y C,ZHAO Y,LU C Y,et al.Cross-modal retrieval with CNN visual features:a new baseline[J].IEEE Transactions on Cybernetics,2017,47(2):449-460.
- [53] AUER S,KOVTUN V,PRINZ M,et al.Towards a knowledge graph for science[C]// Proceedings of the 8th International Conference on Web Intelligence,Mining and Semantics.New York:Association for Computing Machinery,2018:1-6.DOI:10.1145/3227609.3227689.
- [54] WANG Z C,LV Q S,LAN X H,et al.Cross-lingual knowledge graph alignment via graph convolutional networks[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:Association for Computational Linguistics,2018:349-357.
- [55] LIU Z H,XIONG C Y,SUN M S,et al.Entity-duet neural ranking:understanding the role of knowledge graph semantics in neural information retrieval[EB/OL].(2018-06-03)[2021-07-28].https://arxiv.org/abs/1805.07591.