Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Integration of Global and Local Representations for Fine-Grained Cross-Modal Alignment

Full metadata record
DC Field Value Language
dc.contributor.authorJin, Seungwan-
dc.contributor.authorChoi, Hoyoung-
dc.contributor.authorNoh, Taehyung-
dc.contributor.authorHan, Kyungsik-
dc.date.accessioned2024-12-04T05:00:14Z-
dc.date.available2024-12-04T05:00:14Z-
dc.date.issued2024-11-
dc.identifier.issn0302-9743-
dc.identifier.issn1611-3349-
dc.identifier.urihttps://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/199806-
dc.description.abstractFashion is one of the representative domains of fine-grained Vision-Language Pre-training (VLP) involving a large number of images and text. Previous fashion VLP research has proposed various pre-training tasks to account for fine-grained details in multimodal fusion. However, fashion VLP research has not yet addressed the need to focus on (1) uni-modal embeddings that reflect fine-grained features and (2) hard negative samples to improve the performance of fine-grained V+L retrieval tasks. In this paper, we propose Fashion-FINE (Fashion VLP with Fine-grained Cross-modal Alignment using the INtegrated representations of global and local patch Embeddings), which consists of three key modules. First, a modality-agnostic adapter (MAA) learns uni-modal integrated representations and reflects fine-grained details contained in local patches. Second, hard negative mining with focal loss (HNM-F) performs cross-modal alignment using the integrated representations, focusing on hard negatives to boost the learning of fine-grained cross-modal alignment. Third, comprehensive cross-modal alignment (C-CmA) extracts low- and high-level fashion information from the text and learns the semantic alignment to encourage disentangled embedding of the integrated image representations. Fashion-FINE achieved state-of-the-art performance on two representative public benchmarks (i.e., FashionGen and FashionIQ) in three representative V+L retrieval tasks, demonstrating its effectiveness in learning fine-grained features.-
dc.format.extent18-
dc.language영어-
dc.language.isoENG-
dc.publisherSpringer Verlag-
dc.titleIntegration of Global and Local Representations for Fine-Grained Cross-Modal Alignment-
dc.typeArticle-
dc.publisher.location미국-
dc.identifier.doi10.1007/978-3-031-73010-8_4-
dc.identifier.scopusid2-s2.0-85210322367-
dc.identifier.wosid001416938600004-
dc.identifier.bibliographicCitationLecture Notes in Computer Science, v.15141, pp 53 - 70-
dc.citation.titleLecture Notes in Computer Science-
dc.citation.volume15141-
dc.citation.startPage53-
dc.citation.endPage70-
dc.type.docTypeProceedings Paper-
dc.description.isOpenAccessN-
dc.description.journalRegisteredClassscopus-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalWebOfScienceCategoryComputer Science, Artificial Intelligence-
dc.relation.journalWebOfScienceCategoryComputer Science, Interdisciplinary Applications-
dc.relation.journalWebOfScienceCategoryComputer Science, Theory & Methods-
dc.subject.keywordPlusBenchmarking-
dc.subject.keywordPlusEmbeddings-
dc.subject.keywordPlusImage coding-
dc.subject.keywordPlusImage representation-
dc.subject.keywordPlusVisual languages-
dc.subject.keywordAuthorFashion-
dc.subject.keywordAuthorFine-grained Representation Learning-
dc.subject.keywordAuthorVision-Language Pre-training-
Files in This Item
There are no files associated with this item.
Appears in
Collections
서울 공과대학 > ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Han, Kyungsik photo

Han, Kyungsik
COLLEGE OF ENGINEERING (DEPARTMENT OF INTELLIGENCE COMPUTING)
Read more

Altmetrics

Total Views & Downloads

BROWSE