CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generationopen access
- Authors
- Oh, Hyunwoo; Cha, Seung-ju; Lee, Kwanyoung; Kim, Si-woo; Kim, Dongjin
- Issue Date
- Oct-2025
- Publisher
- Association for Computing Machinery, Inc
- Keywords
- audio to image generation; diffusion model; language-guided generation; multi-modal representation
- Citation
- MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025, pp 9773 - 9782
- Pages
- 10
- Indexed
- SCOPUS
- Journal Title
- MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
- Start Page
- 9773
- End Page
- 9782
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209918
- DOI
- 10.1145/3746027.3755130
- Abstract
- We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining ) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector ). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple audio classification datasets demonstrate that CatchPhrase improves audio-to-image alignment and consistently enhances generation quality by mitigating semantic misalignment.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > ETC > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.