Enhanced synthesis of passively heard speech from electrocorticography signals using image-to-image spectrogram translation

Lee, Hongsang; Hwang, Jihun; Kim, Kyungjun; Lee, Gyuwon; Chung, Chun-kee; Im, Chang-hwan

doi:10.1016/j.mlwa.2025.100805

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Enhanced synthesis of passively heard speech from electrocorticography signals using image-to-image spectrogram translationopen access

Authors: Lee, Hongsang; Hwang, Jihun; Kim, Kyungjun; Lee, Gyuwon; Chung, Chun-kee; Im, Chang-hwan

Issue Date: Mar-2026

Publisher: ELSEVIER

Keywords: Speech synthesis; Brain-computer interface (BCI); Electrocorticography (ECoG); Deep learning; Image-to-image translation

Citation: MACHINE LEARNING WITH APPLICATIONS, v.23, pp 1 - 12

Pages: 12

Indexed: SCOPUS
ESCI

Journal Title: MACHINE LEARNING WITH APPLICATIONS

Volume: 23

Start Page: 1

End Page: 12

URI: https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/210824

DOI: 10.1016/j.mlwa.2025.100805

ISSN: 2666-8270
2666-8270

Abstract: Speech synthesis from neural signals offers a promising avenue for restoring communication in individuals with speech impairments. Recent deep learning advances have improved decoding of neural activity into intelligible speech, yet further enhancement is required to improve the quality of synthesized speech. Here, we investigate whether an image-to-image translation approach can further refine Mel spectrograms synthesized from electrocorticography (ECoG) signals recorded while participants passively listened to spoken sentences. ECoG data were collected from volunteers performing an auditory speech perception task. A three-layer bidirectional long short-term memory (Bi-LSTM) network was first trained to predict Mel-spectrogram features from neural signals. Comparison with the Conformer model indicated that Bi-LSTM was more effective as the initial synthesis model under our limited data conditions. To further enhance the quality of the Bi-LSTM-synthesized Mel spectrograms, we applied Pix2pixHD, a high-resolution conditional GAN, as a post-processing module. The impact of Pix2pixHD was evaluated using Log-Spectral Distance (LSD), Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), and Short-Time Objective Intelligibility (STOI) comparing outputs against the original ground truth. Furthermore, subjective listening tests (2AFC similarity judgment) were conducted to assess perceptual improvements. Across objective metrics, Pix2pixHD post-processing yielded consistent improvements in spectral fidelity, waveform similarity, and estimated intelligibility (lower LSD, higher SI-SDR and STOI), and subjective tests confirmed significantly enhanced perceived similarity to the original speech. These gains were supported by non-parametric significance testing (Wilcoxon signed-rank test, p < 0.005). The results indicate that high-resolution image-to-image translation is an effective vehicle to refine neural signal-based speech synthesis, complementing sequence models and improving the overall perceived quality of the synthesized speech.

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > ETC > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Im, Chang Hwan photo

Im, Chang Hwan: COLLEGE OF ENGINEERING (서울 바이오메디컬공학전공)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE