Cited 0 time in
CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Oh, Hyunwoo | - |
| dc.contributor.author | Cha, Seung-ju | - |
| dc.contributor.author | Lee, Kwanyoung | - |
| dc.contributor.author | Kim, Si-woo | - |
| dc.contributor.author | Kim, Dongjin | - |
| dc.date.accessioned | 2025-12-19T01:00:21Z | - |
| dc.date.available | 2025-12-19T01:00:21Z | - |
| dc.date.issued | 2025-10 | - |
| dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209918 | - |
| dc.description.abstract | We propose CatchPhrase, a novel audio-to-image generation framework designed to mitigate semantic misalignment between audio inputs and generated images. While recent advances in multi-modal encoders have enabled progress in cross-modal generation, ambiguity stemming from homographs and auditory illusions continues to hinder accurate alignment. To address this issue, CatchPhrase generates enriched cross-modal semantic prompts (EXPrompt Mining ) from weak class labels by leveraging large language models (LLMs) and audio captioning models (ACMs). To address both class-level and instance-level misalignment, we apply multi-modal filtering and retrieval to select the most semantically aligned prompt for each audio sample (EXPrompt Selector ). A lightweight mapping network is then trained to adapt pre-trained text-to-image generation models to audio input. Extensive experiments on multiple audio classification datasets demonstrate that CatchPhrase improves audio-to-image alignment and consistently enhances generation quality by mitigating semantic misalignment. | - |
| dc.format.extent | 10 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.publisher | Association for Computing Machinery, Inc | - |
| dc.title | CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1145/3746027.3755130 | - |
| dc.identifier.scopusid | 2-s2.0-105024078899 | - |
| dc.identifier.bibliographicCitation | MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025, pp 9773 - 9782 | - |
| dc.citation.title | MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025 | - |
| dc.citation.startPage | 9773 | - |
| dc.citation.endPage | 9782 | - |
| dc.type.docType | Conference paper | - |
| dc.description.isOpenAccess | Y | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.subject.keywordPlus | Alignment | - |
| dc.subject.keywordPlus | Classification (of information) | - |
| dc.subject.keywordPlus | Image coding | - |
| dc.subject.keywordPlus | Signal encoding | - |
| dc.subject.keywordAuthor | audio to image generation | - |
| dc.subject.keywordAuthor | diffusion model | - |
| dc.subject.keywordAuthor | language-guided generation | - |
| dc.subject.keywordAuthor | multi-modal representation | - |
| dc.identifier.url | https://dl.acm.org/doi/10.1145/3746027.3755130 | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366
COPYRIGHT © 2024 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
