Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Temp4Cap: Temporally-aligned Automated Audio Captioning

Full metadata record
DC Field Value Language
dc.contributor.authorChoi, Ho-Young-
dc.contributor.authorCho, Jae-Heung-
dc.contributor.authorByun, Pil Moo-
dc.contributor.authorChoi, Won-Gook-
dc.contributor.authorChang, Joon-Hyuk-
dc.date.accessioned2025-11-20T01:00:18Z-
dc.date.available2025-11-20T01:00:18Z-
dc.date.issued2025-08-
dc.identifier.issn2958-1796-
dc.identifier.urihttps://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209222-
dc.description.abstractAutomated audio captioning (AAC) is a crucial task in machine perception within the audio domain. AAC struggles to interpret and incorporate temporal relationships of sound events in captions. However, existing studies often fail to capture the temporal relationship, leading to incorrect captions. Some recent studies leverage sound event detection models to extract temporal relationships but remain limited by their dependence on independent pre-trained models. In this study, we propose Temp4Cap, a novel AAC framework that directly trains temporal alignment via contrastive learning, using the “temporal caption” generated by a large language model. To capture temporal relationships, we apply a temporal negative sampling strategy, which includes event- and order-level shuffle and random substitution when generating negative samples during contrastive learning. Experimental results on Clotho and AudioCaps show that Temp4Cap significantly improves both captioning and temporal metrics.-
dc.format.extent5-
dc.language영어-
dc.language.isoENG-
dc.publisherInternational Speech Communication Association-
dc.titleTemp4Cap: Temporally-aligned Automated Audio Captioning-
dc.typeArticle-
dc.identifier.doi10.21437/Interspeech.2025-808-
dc.identifier.scopusid2-s2.0-105020091593-
dc.identifier.bibliographicCitationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 3135 - 3139-
dc.citation.titleProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH-
dc.citation.startPage3135-
dc.citation.endPage3139-
dc.type.docTypeConference paper-
dc.description.isOpenAccessN-
dc.description.journalRegisteredClassscopus-
dc.subject.keywordPlusArtificial intelligence-
dc.subject.keywordPlusAudio acoustics-
dc.subject.keywordPlusSpeech communication-
dc.subject.keywordAuthorautomated audio captioning-
dc.subject.keywordAuthorcontrastive learning-
dc.subject.keywordAuthorlarge language model-
dc.subject.keywordAuthorTemporal alignment-
dc.subject.keywordAuthortemporal negative sampling-
dc.identifier.urlhttps://www.isca-archive.org/interspeech_2025/choi25_interspeech.html-
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Chang, Joon-Hyuk photo

Chang, Joon-Hyuk
COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)
Read more

Altmetrics

Total Views & Downloads

BROWSE