Integration of Global and Local Representations for Fine-Grained Cross-Modal Alignment

Jin, Seungwan; Choi, Hoyoung; Noh, Taehyung; Han, Kyungsik

doi:10.1007/978-3-031-73010-8_4

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Integration of Global and Local Representations for Fine-Grained Cross-Modal Alignment

Authors: Jin, Seungwan; Choi, Hoyoung; Noh, Taehyung; Han, Kyungsik

Issue Date: Nov-2024

Publisher: Springer Verlag

Keywords: Fashion; Fine-grained Representation Learning; Vision-Language Pre-training

Citation: Lecture Notes in Computer Science, v.15141, pp 53 - 70

Pages: 18

Indexed: SCOPUS

Journal Title: Lecture Notes in Computer Science

Volume: 15141

Start Page: 53

End Page: 70

URI: https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/199806

DOI: 10.1007/978-3-031-73010-8_4

ISSN: 0302-9743
1611-3349

Abstract: Fashion is one of the representative domains of fine-grained Vision-Language Pre-training (VLP) involving a large number of images and text. Previous fashion VLP research has proposed various pre-training tasks to account for fine-grained details in multimodal fusion. However, fashion VLP research has not yet addressed the need to focus on (1) uni-modal embeddings that reflect fine-grained features and (2) hard negative samples to improve the performance of fine-grained V+L retrieval tasks. In this paper, we propose Fashion-FINE (Fashion VLP with Fine-grained Cross-modal Alignment using the INtegrated representations of global and local patch Embeddings), which consists of three key modules. First, a modality-agnostic adapter (MAA) learns uni-modal integrated representations and reflects fine-grained details contained in local patches. Second, hard negative mining with focal loss (HNM-F) performs cross-modal alignment using the integrated representations, focusing on hard negatives to boost the learning of fine-grained cross-modal alignment. Third, comprehensive cross-modal alignment (C-CmA) extracts low- and high-level fashion information from the text and learns the semantic alignment to encourage disentangled embedding of the integrated image representations. Fashion-FINE achieved state-of-the-art performance on two representative public benchmarks (i.e., FashionGen and FashionIQ) in three representative V+L retrieval tasks, demonstrating its effectiveness in learning fine-grained features.

Files in This Item: There are no files associated with this item.

Appears in Collections: 서울 공과대학 > ETC > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Han, Kyungsik photo

Han, Kyungsik: COLLEGE OF ENGINEERING (DEPARTMENT OF INTELLIGENCE COMPUTING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE