Detailed Information

Cited 0 time in webofscience Cited 1 time in scopus
Metadata Downloads

Multimedia analysis of robustly optimized multimodal transformer based on vision and language co-learning

Full metadata record
DC Field Value Language
dc.contributor.authorYoon, JunHo-
dc.contributor.authorChoi, GyuHo-
dc.contributor.authorChoi, Chang-
dc.date.accessioned2023-09-21T02:40:35Z-
dc.date.available2023-09-21T02:40:35Z-
dc.date.created2023-09-21-
dc.date.issued2023-12-
dc.identifier.issn1566-2535-
dc.identifier.urihttps://scholarworks.bwise.kr/gachon/handle/2020.sw.gachon/89120-
dc.description.abstractRecently, research on multimodal learning using all modality information has been conducted to detect disinformation on multimedia. Existing multimodal learning methods include score-level fusion approaches combining different models, and feature-level fusion methods combining embedding vectors to integrate data of different dimensions. Because a late-level fusion method is combined after the modalities are individually operated, there is a limit in that the recognition performance of a unimodal determines the performance. In addition, a fusion method has constraints in that the data among the modalities must be matched. In this study, we propose a classification system using a RoBERTa-based multimodal fusion transformer (RoBERTaMFT) that applies a co-learning method to solve the limitations of the recognition performance of multimodal learning as well as the data imbalance among the modalities. RoBERTaMFT consists of image feature extraction, co learning using the reconstruction of image features with text embedding, and a late-level fusion step applied to the final classification. As experiment results using the CrisisMMD dataset indicate, RoBERTaMFT achieved an accuracy 21.2% and an f1-score 0.414 higher than those of unimodal learning, and an accuracy 11.7% and an f1-score 0.268 higher than those of existing multimodal learning.-
dc.language영어-
dc.language.isoen-
dc.publisherELSEVIER-
dc.relation.isPartOfINFORMATION FUSION-
dc.titleMultimedia analysis of robustly optimized multimodal transformer based on vision and language co-learning-
dc.typeArticle-
dc.type.rimsART-
dc.description.journalClass1-
dc.identifier.wosid001055953800001-
dc.identifier.doi10.1016/j.inffus.2023.101922-
dc.identifier.bibliographicCitationINFORMATION FUSION, v.100-
dc.description.isOpenAccessN-
dc.identifier.scopusid2-s2.0-85165528887-
dc.citation.titleINFORMATION FUSION-
dc.citation.volume100-
dc.contributor.affiliatedAuthorYoon, JunHo-
dc.contributor.affiliatedAuthorChoi, Chang-
dc.type.docTypeArticle-
dc.subject.keywordAuthorMulti-modal-
dc.subject.keywordAuthorMultimedia-
dc.subject.keywordAuthorNatural disasters-
dc.subject.keywordAuthorClassification-
dc.subject.keywordPlusFEATURE FUSION-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalWebOfScienceCategoryComputer Science, Artificial Intelligence-
dc.relation.journalWebOfScienceCategoryComputer Science, Theory & Methods-
dc.description.journalRegisteredClassscie-
dc.description.journalRegisteredClassscopus-
Files in This Item
There are no files associated with this item.
Appears in
Collections
IT융합대학 > 컴퓨터공학과 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Choi, Chang photo

Choi, Chang
College of IT Convergence (컴퓨터공학부(컴퓨터공학전공))
Read more

Altmetrics

Total Views & Downloads

BROWSE