GMAIL: Generative Modality Alignment for generated Image Learning

Sukmin Yun; Shentong Mo

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

GMAIL: Generative Modality Alignment for generated Image Learning

Full metadata record

DC Field	Value	Language
dc.contributor.author	Sukmin Yun	-
dc.contributor.author	Shentong Mo	-
dc.date.accessioned	2025-05-07T08:30:36Z	-
dc.date.available	2025-05-07T08:30:36Z	-
dc.date.issued	2025-05	-
dc.identifier.uri	https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/125207	-
dc.description.abstract	Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined \textit{GMAIL}, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.	-
dc.format.extent	22	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	Proceedings of Machine Learning Research	-
dc.title	GMAIL: Generative Modality Alignment for generated Image Learning	-
dc.type	Article	-
dc.identifier.bibliographicCitation	International Conference on Machine Learning, pp 1 - 22	-
dc.citation.title	International Conference on Machine Learning	-
dc.citation.startPage	1	-
dc.citation.endPage	22	-
dc.type.docType	Proceeding	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	foreign	-
dc.subject.keywordAuthor	diffusion models	-
dc.subject.keywordAuthor	generated visual learning	-
dc.subject.keywordAuthor	vision-language models	-
dc.identifier.url	https://openreview.net/forum?id=u6xeKVHS6K	-

Files in This Item: Go to Link

Appears in Collections: COLLEGE OF COMPUTING > DEPARTMENT OF ARTIFICIAL INTELLIGENCE > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Yun, Sukmin photo

Yun, Sukmin: ERICA 소프트웨어융합대학 (DEPARTMENT OF ARTIFICIAL INTELLIGENCE)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

55 Hanyangdeahak-ro, Sangnok-gu, Ansan, Gyeonggi-do, 15588, Korea+82-31-400-4269 sweetbrain@hanyang.ac.kr

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE