Cited 0 time in
Temp4Cap: Temporally-aligned Automated Audio Captioning
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Choi, Ho-Young | - |
| dc.contributor.author | Cho, Jae-Heung | - |
| dc.contributor.author | Byun, Pil Moo | - |
| dc.contributor.author | Choi, Won-Gook | - |
| dc.contributor.author | Chang, Joon-Hyuk | - |
| dc.date.accessioned | 2025-11-20T01:00:18Z | - |
| dc.date.available | 2025-11-20T01:00:18Z | - |
| dc.date.issued | 2025-08 | - |
| dc.identifier.issn | 2958-1796 | - |
| dc.identifier.uri | https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209222 | - |
| dc.description.abstract | Automated audio captioning (AAC) is a crucial task in machine perception within the audio domain. AAC struggles to interpret and incorporate temporal relationships of sound events in captions. However, existing studies often fail to capture the temporal relationship, leading to incorrect captions. Some recent studies leverage sound event detection models to extract temporal relationships but remain limited by their dependence on independent pre-trained models. In this study, we propose Temp4Cap, a novel AAC framework that directly trains temporal alignment via contrastive learning, using the “temporal caption” generated by a large language model. To capture temporal relationships, we apply a temporal negative sampling strategy, which includes event- and order-level shuffle and random substitution when generating negative samples during contrastive learning. Experimental results on Clotho and AudioCaps show that Temp4Cap significantly improves both captioning and temporal metrics. | - |
| dc.format.extent | 5 | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.publisher | International Speech Communication Association | - |
| dc.title | Temp4Cap: Temporally-aligned Automated Audio Captioning | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.21437/Interspeech.2025-808 | - |
| dc.identifier.scopusid | 2-s2.0-105020091593 | - |
| dc.identifier.bibliographicCitation | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 3135 - 3139 | - |
| dc.citation.title | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | - |
| dc.citation.startPage | 3135 | - |
| dc.citation.endPage | 3139 | - |
| dc.type.docType | Conference paper | - |
| dc.description.isOpenAccess | N | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.subject.keywordPlus | Artificial intelligence | - |
| dc.subject.keywordPlus | Audio acoustics | - |
| dc.subject.keywordPlus | Speech communication | - |
| dc.subject.keywordAuthor | automated audio captioning | - |
| dc.subject.keywordAuthor | contrastive learning | - |
| dc.subject.keywordAuthor | large language model | - |
| dc.subject.keywordAuthor | Temporal alignment | - |
| dc.subject.keywordAuthor | temporal negative sampling | - |
| dc.identifier.url | https://www.isca-archive.org/interspeech_2025/choi25_interspeech.html | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366
COPYRIGHT © 2024 HANYANG UNIVERSITY.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
