Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Full metadata record
DC Field Value Language
dc.contributor.authorOh, Youngtaek-
dc.contributor.authorCho, Jae Won-
dc.contributor.authorKim, Dong-Jin-
dc.contributor.authorKweon, In So-
dc.contributor.authorKim, Junmo-
dc.date.accessioned2025-03-10T07:00:11Z-
dc.date.available2025-03-10T07:00:11Z-
dc.date.issued2024-11-
dc.identifier.urihttps://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206714-
dc.description.abstractIn this paper, we propose a new method to enhance compositional understanding in pretrained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks.Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts.This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations.To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization.These innovations provide fine-grained negative supervision while preserving the model's representational integrity.Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.Code is available at: https://github.com/ytaek-oh/fsc-clip.-
dc.format.extent17-
dc.language영어-
dc.language.isoENG-
dc.publisherAssociation for Computational Linguistics (ACL)-
dc.titlePreserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality-
dc.typeArticle-
dc.identifier.doi10.48550/arXiv.2410.05210-
dc.identifier.scopusid2-s2.0-85217819447-
dc.identifier.bibliographicCitationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp 19060 - 19076-
dc.citation.titleEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference-
dc.citation.startPage19060-
dc.citation.endPage19076-
dc.type.docTypeConference paper-
dc.description.isOpenAccessN-
dc.description.journalRegisteredClassscopus-
dc.subject.keywordPlusBenchmarking-
dc.subject.keywordPlusModal analysis-
dc.subject.keywordPlusVisual languages-
dc.identifier.urlhttps://arxiv.org/abs/2410.05210-
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Dong Jin photo

Kim, Dong Jin
COLLEGE OF ENGINEERING (DEPARTMENT OF INTELLIGENCE COMPUTING)
Read more

Altmetrics

Total Views & Downloads

BROWSE