Comparative Study on Multi-voice Singing Synthesize Systems

Resna S; Rajeev Rajan

doi:10.5875/ausmt.v13i1.2417

Comparative Study on Multi-voice Singing Synthesize Systems

Resna S.¹ and Rajeev Rajan^2*

¹MultiMedia and Communication Vertical, Tata Elxsi, Technopark, Thiruvananthapuram, India
²College of Engineering, Trivandrum APJ Abdul Kalam Technological University, Trivandrum, India

Download Citation: |
Download PDF

ABSTRACT

In this paper, two multi-voice singing synthesis frameworks are compared One proposed model consists of two blocks, namely, text-to-speech (TTS) converter and speech-to-singing (STS) converter. Synthesized speech is generated from lyrics for a target speaker's voice by TTS converter in the front-end. Later, a sung version is synthesized as per the given target-melody using encoder-decoder model in the STS module. We have compared our model with an existing multi-voice singing synthesis model, based on generative adversarial network (GAN) with phoneme synchronization information. The proposed system is systematically evaluated using subjective and objective tests. Three performance metrics, namely the mean opinion score (MOS), log spectral distance (LSD) have been analyzed as part of the study. Our study shows that the proposed model generates singing voices that adapts well to the target melody but the phonetic intelligibility is poor when compared to the baseline system.

Keywords: multi-voice, encoder-decoder, generative adversarial network, song adaptation

Share this article with your colleagues

REFERENCES

[1] C. Gupta, H. Li and M. Goto, "Deep Learning Approaches in Topics of Singing Information Processing," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2422-2451, 2022. https://doi.org/10.1109/TASLP.2022.3190732
[2] Hiromasa Fujihara, Masataka Goto, J.O., Okuno, H.G.: Lyric- synchronizer: Automatic synchronization system between musical audio signals and lyrics. IEEE Journal of Selected Topics in Signal Processing 5, 1252--1261 (2011). https://doi.org/10.1109/JSTSP.2011.2159577
[3] J. Kim, H. Choi, J. Park, S. Kim, J. Kim, and M. Hahn., “Korean singing voice synthesis system based on an LSTM recurrent neural network,” in Proc. of Interspeech, pp. 1551–1555, 2018. https://doi.org/10.21437/Interspeech.2018-1575
[4] M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on deep neural networks.” in Proc. of Interspeech., pp. 2478–2482, 2016. https://doi.org/10.21437/Interspeech.2016-1027
[5] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda., “Singing voice synthesis based on generative adversarial networks.” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6955–6959, 2019. https://doi.org/10.1109/ICASSP.2019.8683154
[6] Lee, J., Choi, H.S., Jeon, C.B., Koo, J., Lee, K.: “Adversarial trained end-to-end Korean singing voice synthesis system”. arXiv preprint arXiv:1908.01919 (2019). https://doi.org/10.21437/Interspeech.2019-1722
[7] J. Li, H. Yang, W. Zhang, and L. Cai, “A lyrics to singing voice synthesis system with variable timbre,” Communications in Computer and Information Science., pp. 186– 193, 2011. https://doi.org/10.1007/978-3-642-23220-6_23
[8] M. Freixes, F. Alıas, and J. C. Carrie, “A unit se- ´ lection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept,” EURASIP Journal on
Audio, Speech, and Music Processing, vol. 2019, pp. 1–14, 2019. https://doi.org/10.1186/s13636-019-0163-y
[9] Y. Ren, X. Tan, T. Qin, J. Luan, Z. Zhao, and T.-Y. Liu, “Deepsinger: Singing voice synthesis with data mined from the web,” in Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, pp. 1979–1989, 2020. https://doi.org/10.1145/3394486.3403249
[10] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y. Wu, “Transfer learning from speaker verification to
multispeaker text-to-speech synthesis,” Advances in Neural Information Processing Systems 31 (2018), 4485-4495.
[11] J. Parekh, P. Rao, and Y. H. Yang, “Speech-to-singing conversion in an encoder-decoder framework,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 261–265, 2020. https://doi.org/10.1109/ICASSP40776.2020.9054473
[12] P. Chandna, M. Blaauw, J. Bonada, and E. Gomez, ´ “WGANSing: A multi-voice singing voice synthesizer based on the wasserstein-GAN,” in Proc. of 27th European Signal Processing Conference, pp. 1–5, 2019. https://doi.org/10.23919/EUSIPCO.2019.8903099
[13] Resna, S., Rajan, R. “Multi-Voice Singing Synthesis from Lyrics”. Circuits Syst Signal Process 42, 307–321 2023. https://doi.org/10.1007/s00034-022-02122-3
[14] L. Su, “Vocal melody extraction using patch-based CNN,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 371–375, 2018. https://doi.org/10.1109/ICASSP.2018.8462420
[15] B. McFee, C. Raffel, D. Liang, D. Ellis, M. Mcvicar, E. Battenberg, and O. Nieto, “Librosa: Audio and music signal analysis in python,” in Proc. of 14th Python in Science Conference, pp. 18–24, 01 2015. https://doi.org/10.25080/Majora-7b98e3ed-003
[16] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: N. Navab, J. Hornegger, W.M. Wells, A.F. Frangi (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241. Springer International Publishing, 2015. https://doi.org/10.1007/978-3-319-
24574-4_28
[17] D. Griffin and Jae Lim, "Signal estimation from modified short-time Fourier transform," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236-243, April 1984. https://doi.org/10.1109/TASSP.1984.1164317
[18] Radford, A., Metz, L., Chintala, S.: “Unsupervised representation learning with deep convolutional generative adversarial networks.”, in Proc. of 4th International Conference on Learning Representations, ICLR, 2016.
[19] van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. In: Arxiv (2016).
[20] Blaauw, Merlijn, and Jordi Bonada. "A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs" Applied Sciences 7, no. 12: 1313, 2017. https://doi.org/10.3390/app7121313
[21] Morise, M., Yokomori, F., Ozawa, K.: World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems E99.D, 1877–1884 (2016). https://doi.org/10.1587/transinf.2015EDP7457
[22] A. Pandey and D. Wang, "On Cross-Corpus Generalization of Deep Learning Based Speech Enhancement," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2489-2499, 2020. https://doi.org/10.1109/TASLP.2020.3016487
[23] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech Language, vol. 60, pp. 1010–27, 2020. https://doi.org/10.1016/j.csl.2019.101027
[24] Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang, “The nus sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” in Proc. of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–9, 2013. https://doi.org/10.1109/APSIPA.2013.6694316

ARTICLE INFORMATION

Accepted: 2023-03-01
Available Online: 2023-03-01

Cite this article:

Resna. S. and Rajeev. R. (2023) Comparative Study on Multi-voice Singing Synthesize Systems. Int. j. autom. smart technol. https://doi.org/10.5875/ausmt.v13i1.2417

Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.