Multimedia analysis of robustly optimized multimodal transformer based on vision and language co-learning
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Yoon, JunHo | - |
dc.contributor.author | Choi, GyuHo | - |
dc.contributor.author | Choi, Chang | - |
dc.date.accessioned | 2023-09-21T02:40:35Z | - |
dc.date.available | 2023-09-21T02:40:35Z | - |
dc.date.created | 2023-09-21 | - |
dc.date.issued | 2023-12 | - |
dc.identifier.issn | 1566-2535 | - |
dc.identifier.uri | https://scholarworks.bwise.kr/gachon/handle/2020.sw.gachon/89120 | - |
dc.description.abstract | Recently, research on multimodal learning using all modality information has been conducted to detect disinformation on multimedia. Existing multimodal learning methods include score-level fusion approaches combining different models, and feature-level fusion methods combining embedding vectors to integrate data of different dimensions. Because a late-level fusion method is combined after the modalities are individually operated, there is a limit in that the recognition performance of a unimodal determines the performance. In addition, a fusion method has constraints in that the data among the modalities must be matched. In this study, we propose a classification system using a RoBERTa-based multimodal fusion transformer (RoBERTaMFT) that applies a co-learning method to solve the limitations of the recognition performance of multimodal learning as well as the data imbalance among the modalities. RoBERTaMFT consists of image feature extraction, co learning using the reconstruction of image features with text embedding, and a late-level fusion step applied to the final classification. As experiment results using the CrisisMMD dataset indicate, RoBERTaMFT achieved an accuracy 21.2% and an f1-score 0.414 higher than those of unimodal learning, and an accuracy 11.7% and an f1-score 0.268 higher than those of existing multimodal learning. | - |
dc.language | 영어 | - |
dc.language.iso | en | - |
dc.publisher | ELSEVIER | - |
dc.relation.isPartOf | INFORMATION FUSION | - |
dc.title | Multimedia analysis of robustly optimized multimodal transformer based on vision and language co-learning | - |
dc.type | Article | - |
dc.type.rims | ART | - |
dc.description.journalClass | 1 | - |
dc.identifier.wosid | 001055953800001 | - |
dc.identifier.doi | 10.1016/j.inffus.2023.101922 | - |
dc.identifier.bibliographicCitation | INFORMATION FUSION, v.100 | - |
dc.description.isOpenAccess | N | - |
dc.identifier.scopusid | 2-s2.0-85165528887 | - |
dc.citation.title | INFORMATION FUSION | - |
dc.citation.volume | 100 | - |
dc.contributor.affiliatedAuthor | Yoon, JunHo | - |
dc.contributor.affiliatedAuthor | Choi, Chang | - |
dc.type.docType | Article | - |
dc.subject.keywordAuthor | Multi-modal | - |
dc.subject.keywordAuthor | Multimedia | - |
dc.subject.keywordAuthor | Natural disasters | - |
dc.subject.keywordAuthor | Classification | - |
dc.subject.keywordPlus | FEATURE FUSION | - |
dc.relation.journalResearchArea | Computer Science | - |
dc.relation.journalWebOfScienceCategory | Computer Science, Artificial Intelligence | - |
dc.relation.journalWebOfScienceCategory | Computer Science, Theory & Methods | - |
dc.description.journalRegisteredClass | scie | - |
dc.description.journalRegisteredClass | scopus | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
1342, Seongnam-daero, Sujeong-gu, Seongnam-si, Gyeonggi-do, Republic of Korea(13120)031-750-5114
COPYRIGHT 2020 Gachon University All Rights Reserved.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.