CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation

Oh, Hyunwoo; Cha, Seung-ju; Lee, Kwanyoung; Kim, Si-woo; Kim, Dongjin

doi:10.1145/3746027.3755130

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generationopen access

Authors: Oh, Hyunwoo; Cha, Seung-ju; Lee, Kwanyoung; Kim, Si-woo; Kim, Dongjin

Issue Date: Oct-2025

Publisher: Association for Computing Machinery, Inc

Keywords: audio to image generation; diffusion model; language-guided generation; multi-modal representation

Citation: MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025, pp 9773 - 9782

Pages: 10

Indexed: SCOPUS

Journal Title: MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

Start Page: 9773

End Page: 9782

URI: https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209918

DOI: 10.1145/3746027.3755130

Abstract: We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining ) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector ). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple audio classification datasets demonstrate that CatchPhrase improves audio-to-image alignment and consistently enhances generation quality by mitigating semantic misalignment.

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > ETC > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Kim, Dong Jin photo

Kim, Dong Jin: COLLEGE OF ENGINEERING (DEPARTMENT OF INTELLIGENCE COMPUTING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE