Integration of Global and Local Representations for Fine-Grained Cross-Modal Alignment
- Authors
- Jin, Seungwan; Choi, Hoyoung; Noh, Taehyung; Han, Kyungsik
- Issue Date
- Nov-2024
- Publisher
- Springer Verlag
- Keywords
- Fashion; Fine-grained Representation Learning; Vision-Language Pre-training
- Citation
- Lecture Notes in Computer Science, v.15141, pp 53 - 70
- Pages
- 18
- Indexed
- SCOPUS
- Journal Title
- Lecture Notes in Computer Science
- Volume
- 15141
- Start Page
- 53
- End Page
- 70
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/199806
- DOI
- 10.1007/978-3-031-73010-8_4
- ISSN
- 0302-9743
1611-3349
- Abstract
- Fashion is one of the representative domains of fine-grained Vision-Language Pre-training (VLP) involving a large number of images and text. Previous fashion VLP research has proposed various pre-training tasks to account for fine-grained details in multimodal fusion. However, fashion VLP research has not yet addressed the need to focus on (1) uni-modal embeddings that reflect fine-grained features and (2) hard negative samples to improve the performance of fine-grained V+L retrieval tasks. In this paper, we propose Fashion-FINE (Fashion VLP with Fine-grained Cross-modal Alignment using the INtegrated representations of global and local patch Embeddings), which consists of three key modules. First, a modality-agnostic adapter (MAA) learns uni-modal integrated representations and reflects fine-grained details contained in local patches. Second, hard negative mining with focal loss (HNM-F) performs cross-modal alignment using the integrated representations, focusing on hard negatives to boost the learning of fine-grained cross-modal alignment. Third, comprehensive cross-modal alignment (C-CmA) extracts low- and high-level fashion information from the text and learns the semantic alignment to encourage disentangled embedding of the integrated image representations. Fashion-FINE achieved state-of-the-art performance on two representative public benchmarks (i.e., FashionGen and FashionIQ) in three representative V+L retrieval tasks, demonstrating its effectiveness in learning fine-grained features.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - 서울 공과대학 > ETC > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.