Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Enhanced synthesis of passively heard speech from electrocorticography signals using image-to-image spectrogram translationopen access

Authors
Lee, HongsangHwang, JihunKim, KyungjunLee, GyuwonChung, Chun-keeIm, Chang-hwan
Issue Date
Mar-2026
Publisher
ELSEVIER
Keywords
Speech synthesis; Brain-computer interface (BCI); Electrocorticography (ECoG); Deep learning; Image-to-image translation
Citation
MACHINE LEARNING WITH APPLICATIONS, v.23, pp 1 - 12
Pages
12
Indexed
SCOPUS
ESCI
Journal Title
MACHINE LEARNING WITH APPLICATIONS
Volume
23
Start Page
1
End Page
12
URI
https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/210824
DOI
10.1016/j.mlwa.2025.100805
ISSN
2666-8270
2666-8270
Abstract
Speech synthesis from neural signals offers a promising avenue for restoring communication in individuals with speech impairments. Recent deep learning advances have improved decoding of neural activity into intelligible speech, yet further enhancement is required to improve the quality of synthesized speech. Here, we investigate whether an image-to-image translation approach can further refine Mel spectrograms synthesized from electrocorticography (ECoG) signals recorded while participants passively listened to spoken sentences. ECoG data were collected from volunteers performing an auditory speech perception task. A three-layer bidirectional long short-term memory (Bi-LSTM) network was first trained to predict Mel-spectrogram features from neural signals. Comparison with the Conformer model indicated that Bi-LSTM was more effective as the initial synthesis model under our limited data conditions. To further enhance the quality of the Bi-LSTM-synthesized Mel spectrograms, we applied Pix2pixHD, a high-resolution conditional GAN, as a post-processing module. The impact of Pix2pixHD was evaluated using Log-Spectral Distance (LSD), Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), and Short-Time Objective Intelligibility (STOI) comparing outputs against the original ground truth. Furthermore, subjective listening tests (2AFC similarity judgment) were conducted to assess perceptual improvements. Across objective metrics, Pix2pixHD post-processing yielded consistent improvements in spectral fidelity, waveform similarity, and estimated intelligibility (lower LSD, higher SI-SDR and STOI), and subjective tests confirmed significantly enhanced perceived similarity to the original speech. These gains were supported by non-parametric significance testing (Wilcoxon signed-rank test, p < 0.005). The results indicate that high-resolution image-to-image translation is an effective vehicle to refine neural signal-based speech synthesis, complementing sequence models and improving the overall perceived quality of the synthesized speech.
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Im, Chang Hwan photo

Im, Chang Hwan
COLLEGE OF ENGINEERING (서울 바이오메디컬공학전공)
Read more

Altmetrics

Total Views & Downloads

BROWSE