Length-Normalized Representation Learning for Speech Signals
Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

ABSTRACT In this study, we proposed a length-normalized representation learning method for speech and text to address the inherent problem of sequence-to-sequence models when the input and output sequences exhibit different lengths. To this end, the representations were constrained to a xed-length shape by including length normalization and de-normalization processes in the pre- and post-network architecture of the transformer-based self-supervised learning framework. Consequently, this enabled the direct modelling of the relationships between sequences with different length without attention or recurrent network between representation domains. This method not only achieved the aforementioned regularized length effect but also achieved a data augmentation effect that effectively handled differently time-scaled input features. The performance of the proposed length-normalized representations on downstream tasks for speaker and phoneme recognition was investigated to verify the effectiveness of this method over conventional representation methods. In addition, to demonstrate the applicability of the proposed representation method to sequence-to-sequence modeling, a unied speech recognition and text-to-speech (TTS) system was developed. The unied system achieved a high accuracy on a frame-wise phoneme prediction and exhibited a promising potential for the generation of high-quality synthesized speech signals on the TTS.

키워드

Self-supervised learningrepresentation learningspeech and text analysis
제목
Length-Normalized Representation Learning for Speech Signals
저자
KYUNGGUEN BYUNSEYUN UMHong-Goo Kang
DOI
10.1109/ACCESS.2022.3181298
발행일
2022-06
저널명
IEEE Access
10
페이지
60,362 ~ 60,372