Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Oh, Youngtaek; Cho, Jae Won; Kim, Dong-Jin; Kweon, In So; Kim, Junmo

doi:10.48550/arXiv.2410.05210

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Full metadata record

DC Field	Value	Language
dc.contributor.author	Oh, Youngtaek	-
dc.contributor.author	Cho, Jae Won	-
dc.contributor.author	Kim, Dong-Jin	-
dc.contributor.author	Kweon, In So	-
dc.contributor.author	Kim, Junmo	-
dc.date.accessioned	2025-03-10T07:00:11Z	-
dc.date.available	2025-03-10T07:00:11Z	-
dc.date.issued	2024-11	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206714	-
dc.description.abstract	In this paper, we propose a new method to enhance compositional understanding in pretrained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks.Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts.This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations.To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization.These innovations provide fine-grained negative supervision while preserving the model's representational integrity.Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.Code is available at: https://github.com/ytaek-oh/fsc-clip.	-
dc.format.extent	17	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	Association for Computational Linguistics (ACL)	-
dc.title	Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality	-
dc.type	Article	-
dc.identifier.doi	10.48550/arXiv.2410.05210	-
dc.identifier.scopusid	2-s2.0-85217819447	-
dc.identifier.bibliographicCitation	EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp 19060 - 19076	-
dc.citation.title	EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference	-
dc.citation.startPage	19060	-
dc.citation.endPage	19076	-
dc.type.docType	Conference paper	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scopus	-
dc.subject.keywordPlus	Benchmarking	-
dc.subject.keywordPlus	Modal analysis	-
dc.subject.keywordPlus	Visual languages	-
dc.identifier.url	https://arxiv.org/abs/2410.05210	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > ETC > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Kim, Dong Jin photo

Kim, Dong Jin: COLLEGE OF ENGINEERING (DEPARTMENT OF INTELLIGENCE COMPUTING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE