Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator

Kim, Jaewon; Choi, Won-Gook; Ahn, Seyun; Chang, Joon-Hyuk

doi:10.21437/Interspeech.2024-1451

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kim, Jaewon	-
dc.contributor.author	Choi, Won-Gook	-
dc.contributor.author	Ahn, Seyun	-
dc.contributor.author	Chang, Joon-Hyuk	-
dc.date.accessioned	2025-02-12T07:00:42Z	-
dc.date.available	2025-02-12T07:00:42Z	-
dc.date.issued	2024-09	-
dc.identifier.issn	1990-9772	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/206467	-
dc.description.abstract	Recent advancements in text-to-audio (TTA) models have demonstrated their ability to generate sound that aligns with user intentions. Despite this advancement, a notable limitation arises from the models' inability to effectively synthesize audio from visual-domain texts. In this study, we address this challenge by utilizing a novel dataset that pairs visual and acoustic-domain texts, derived using ChatGPT-3.5, and encoding switch through a domain discriminator. This approach ensures not only computational efficiency but also enhances the model's generalization, adaptability, and flexibility. It addresses concerns that training exclusively with visual texts might compromise audio generation quality from audio texts. This study presents a novel methodology for enhancing text-to-audio synthesis, demonstrating significant improvements in audio output fidelity from visual-text inputs.	-
dc.format.extent	5	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.title	Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator	-
dc.type	Article	-
dc.identifier.doi	10.21437/Interspeech.2024-1451	-
dc.identifier.scopusid	2-s2.0-85214833933	-
dc.identifier.wosid	001331850103084	-
dc.identifier.bibliographicCitation	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 3305 - 3309	-
dc.citation.title	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH	-
dc.citation.startPage	3305	-
dc.citation.endPage	3309	-
dc.type.docType	Proceedings Paper	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalWebOfScienceCategory	Computer Science, Artificial Intelligence	-
dc.subject.keywordPlus	Audio signal processing	-
dc.subject.keywordPlus	Embeddings	-
dc.subject.keywordPlus	Signal encoding	-
dc.subject.keywordAuthor	audio generation	-
dc.subject.keywordAuthor	multi-modal	-
dc.subject.keywordAuthor	text embedding	-

Files in This Item: There are no files associated with this item.

Appears in Collections: 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Chang, Joon-Hyuk photo

Chang, Joon-Hyuk: COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE