Cited 0 time in
Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Kim, Jaewon | - |
| dc.contributor.author | Choi, Won-Gook | - |
| dc.contributor.author | Ahn, Seyun | - |
| dc.contributor.author | Chang, Joon-Hyuk | - |
| dc.date.accessioned | 2025-02-12T07:00:42Z | - |
| dc.date.available | 2025-02-12T07:00:42Z | - |
| dc.date.issued | 2024-09 | - |
| dc.identifier.issn | 1990-9772 | - |
| dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206467 | - |
| dc.description.abstract | Recent advancements in text-to-audio (TTA) models have demonstrated their ability to generate sound that aligns with user intentions. Despite this advancement, a notable limitation arises from the models' inability to effectively synthesize audio from visual-domain texts. In this study, we address this challenge by utilizing a novel dataset that pairs visual and acoustic-domain texts, derived using ChatGPT-3.5, and encoding switch through a domain discriminator. This approach ensures not only computational efficiency but also enhances the model's generalization, adaptability, and flexibility. It addresses concerns that training exclusively with visual texts might compromise audio generation quality from audio texts. This study presents a novel methodology for enhancing text-to-audio synthesis, demonstrating significant improvements in audio output fidelity from visual-text inputs. | - |
| dc.format.extent | 5 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.title | Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.21437/Interspeech.2024-1451 | - |
| dc.identifier.scopusid | 2-s2.0-85214833933 | - |
| dc.identifier.wosid | 001331850103084 | - |
| dc.identifier.bibliographicCitation | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 3305 - 3309 | - |
| dc.citation.title | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | - |
| dc.citation.startPage | 3305 | - |
| dc.citation.endPage | 3309 | - |
| dc.type.docType | Proceedings Paper | - |
| dc.description.isOpenAccess | N | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.relation.journalResearchArea | Computer Science | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Artificial Intelligence | - |
| dc.subject.keywordPlus | Audio signal processing | - |
| dc.subject.keywordPlus | Embeddings | - |
| dc.subject.keywordPlus | Signal encoding | - |
| dc.subject.keywordAuthor | audio generation | - |
| dc.subject.keywordAuthor | multi-modal | - |
| dc.subject.keywordAuthor | text embedding | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366
COPYRIGHT © 2024 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
