Temp4Cap: Temporally-aligned Automated Audio Captioning

Choi, Ho-Young; Cho, Jae-Heung; Byun, Pil Moo; Choi, Won-Gook; Chang, Joon-Hyuk

doi:10.21437/Interspeech.2025-808

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Temp4Cap: Temporally-aligned Automated Audio Captioning

Full metadata record

DC Field	Value	Language
dc.contributor.author	Choi, Ho-Young	-
dc.contributor.author	Cho, Jae-Heung	-
dc.contributor.author	Byun, Pil Moo	-
dc.contributor.author	Choi, Won-Gook	-
dc.contributor.author	Chang, Joon-Hyuk	-
dc.date.accessioned	2025-11-20T01:00:18Z	-
dc.date.available	2025-11-20T01:00:18Z	-
dc.date.issued	2025-08	-
dc.identifier.issn	2958-1796	-
dc.identifier.uri	https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209222	-
dc.description.abstract	Automated audio captioning (AAC) is a crucial task in machine perception within the audio domain. AAC struggles to interpret and incorporate temporal relationships of sound events in captions. However, existing studies often fail to capture the temporal relationship, leading to incorrect captions. Some recent studies leverage sound event detection models to extract temporal relationships but remain limited by their dependence on independent pre-trained models. In this study, we propose Temp4Cap, a novel AAC framework that directly trains temporal alignment via contrastive learning, using the “temporal caption” generated by a large language model. To capture temporal relationships, we apply a temporal negative sampling strategy, which includes event- and order-level shuffle and random substitution when generating negative samples during contrastive learning. Experimental results on Clotho and AudioCaps show that Temp4Cap significantly improves both captioning and temporal metrics.	-
dc.format.extent	5	-
dc.language	영어	-
dc.language.iso	ENG	-
dc.publisher	International Speech Communication Association	-
dc.title	Temp4Cap: Temporally-aligned Automated Audio Captioning	-
dc.type	Article	-
dc.identifier.doi	10.21437/Interspeech.2025-808	-
dc.identifier.scopusid	2-s2.0-105020091593	-
dc.identifier.bibliographicCitation	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 3135 - 3139	-
dc.citation.title	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH	-
dc.citation.startPage	3135	-
dc.citation.endPage	3139	-
dc.type.docType	Conference paper	-
dc.description.isOpenAccess	N	-
dc.description.journalRegisteredClass	scopus	-
dc.subject.keywordPlus	Artificial intelligence	-
dc.subject.keywordPlus	Audio acoustics	-
dc.subject.keywordPlus	Speech communication	-
dc.subject.keywordAuthor	automated audio captioning	-
dc.subject.keywordAuthor	contrastive learning	-
dc.subject.keywordAuthor	large language model	-
dc.subject.keywordAuthor	Temporal alignment	-
dc.subject.keywordAuthor	temporal negative sampling	-
dc.identifier.url	https://www.isca-archive.org/interspeech_2025/choi25_interspeech.html	-

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Chang, Joon-Hyuk photo

Chang, Joon-Hyuk: COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE