CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation

Oh, Hyunwoo; Cha, Seung-ju; Lee, Kwanyoung; Kim, Si-woo; Kim, Dongjin

doi:10.1145/3746027.3755130

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation

Full metadata record

DC Field	Value	Language
dc.contributor.author	Oh, Hyunwoo	-
dc.contributor.author	Cha, Seung-ju	-
dc.contributor.author	Lee, Kwanyoung	-
dc.contributor.author	Kim, Si-woo	-
dc.contributor.author	Kim, Dongjin	-
dc.date.accessioned	2025-12-19T01:00:21Z	-
dc.date.available	2025-12-19T01:00:21Z	-
dc.date.issued	2025-10	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209918	-
dc.description.abstract	We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining ) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector ). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple audio classification datasets demonstrate that CatchPhrase improves audio-to-image alignment and consistently enhances generation quality by mitigating semantic misalignment.	-
dc.format.extent	10	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	Association for Computing Machinery, Inc	-
dc.title	CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation	-
dc.type	Article	-
dc.identifier.doi	10.1145/3746027.3755130	-
dc.identifier.scopusid	2-s2.0-105024078899	-
dc.identifier.bibliographicCitation	MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025, pp 9773 - 9782	-
dc.citation.title	MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025	-
dc.citation.startPage	9773	-
dc.citation.endPage	9782	-
dc.type.docType	Conference paper	-
dc.description.isOpenAccess	Y	-
dc.description.journalRegisteredClass	scopus	-
dc.subject.keywordPlus	Alignment	-
dc.subject.keywordPlus	Classification (of information)	-
dc.subject.keywordPlus	Image coding	-
dc.subject.keywordPlus	Signal encoding	-
dc.subject.keywordAuthor	audio to image generation	-
dc.subject.keywordAuthor	diffusion model	-
dc.subject.keywordAuthor	language-guided generation	-
dc.subject.keywordAuthor	multi-modal representation	-
dc.identifier.url	https://dl.acm.org/doi/10.1145/3746027.3755130	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > ETC > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Kim, Dong Jin photo

Kim, Dong Jin: COLLEGE OF ENGINEERING (DEPARTMENT OF INTELLIGENCE COMPUTING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE