International Journal of

Automation and Smart Technology

Resna S.1 and Rajeev Rajan2*


1MultiMedia and Communication Vertical, Tata Elxsi, Technopark, Thiruvananthapuram, India
2College of Engineering, Trivandrum APJ Abdul Kalam Technological University, Trivandrum, India

Download Citation: |
Download PDF


ABSTRACT


In this paper, two multi-voice singing synthesis frameworks are compared One proposed model consists of two blocks, namely, text-to-speech (TTS) converter and speech-to-singing (STS) converter. Synthesized speech is generated from lyrics for a target speaker's voice by TTS converter in the front-end. Later, a sung version is synthesized as per the given target-melody using encoder-decoder model in the STS module. We have compared our model with an existing multi-voice singing synthesis model, based on generative adversarial network (GAN) with phoneme synchronization information. The proposed system is systematically evaluated using subjective and objective tests. Three performance metrics, namely the mean opinion score (MOS), log spectral distance (LSD) have been analyzed as part of the study. Our study shows that the proposed model generates singing voices that adapts well to the target melody but the phonetic intelligibility is poor when compared to the baseline system.


Keywords: multi-voice, encoder-decoder, generative adversarial network, song adaptation


Share this article with your colleagues

 


REFERENCES


  1. [1] C. Gupta, H. Li and M. Goto, "Deep Learning Approaches in Topics of Singing Information Processing," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2422-2451, 2022. https://doi.org/10.1109/TASLP.2022.3190732

  2. [2] Hiromasa Fujihara, Masataka Goto, J.O., Okuno, H.G.: Lyric- synchronizer: Automatic synchronization system between musical audio signals and lyrics. IEEE Journal of Selected Topics in Signal Processing 5, 1252--1261 (2011). https://doi.org/10.1109/JSTSP.2011.2159577

  3. [3] J. Kim, H. Choi, J. Park, S. Kim, J. Kim, and M. Hahn., “Korean singing voice synthesis system based on an LSTM recurrent neural network,” in Proc. of Interspeech, pp. 1551–1555, 2018. https://doi.org/10.21437/Interspeech.2018-1575

  4. [4] M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on deep neural networks.” in Proc. of Interspeech., pp. 2478–2482, 2016. https://doi.org/10.21437/Interspeech.2016-1027

  5. [5] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda., “Singing voice synthesis based on generative adversarial networks.” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6955–6959, 2019. https://doi.org/10.1109/ICASSP.2019.8683154

  6. [6] Lee, J., Choi, H.S., Jeon, C.B., Koo, J., Lee, K.: “Adversarial trained end-to-end Korean singing voice synthesis system”. arXiv preprint arXiv:1908.01919 (2019). https://doi.org/10.21437/Interspeech.2019-1722

  7. [7] J. Li, H. Yang, W. Zhang, and L. Cai, “A lyrics to singing voice synthesis system with variable timbre,” Communications in Computer and Information Science., pp. 186– 193, 2011. https://doi.org/10.1007/978-3-642-23220-6_23

  8. [8] M. Freixes, F. Alıas, and J. C. Carrie, “A unit se- ´ lection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept,” EURASIP Journal on
    Audio, Speech, and Music Processing, vol. 2019, pp. 1–14, 2019. https://doi.org/10.1186/s13636-019-0163-y

  9. [9] Y. Ren, X. Tan, T. Qin, J. Luan, Z. Zhao, and T.-Y. Liu, “Deepsinger: Singing voice synthesis with data mined from the web,” in Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, pp. 1979–1989, 2020. https://doi.org/10.1145/3394486.3403249

  10. [10] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y. Wu, “Transfer learning from speaker verification to
    multispeaker text-to-speech synthesis,” Advances in Neural Information Processing Systems 31 (2018), 4485-4495.

  11. [11] J. Parekh, P. Rao, and Y. H. Yang, “Speech-to-singing conversion in an encoder-decoder framework,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 261–265, 2020. https://doi.org/10.1109/ICASSP40776.2020.9054473

  12. [12] P. Chandna, M. Blaauw, J. Bonada, and E. Gomez, ´ “WGANSing: A multi-voice singing voice synthesizer based on the wasserstein-GAN,” in Proc. of 27th European Signal Processing Conference, pp. 1–5, 2019. https://doi.org/10.23919/EUSIPCO.2019.8903099

  13. [13] Resna, S., Rajan, R. “Multi-Voice Singing Synthesis from Lyrics”. Circuits Syst Signal Process 42, 307–321 2023. https://doi.org/10.1007/s00034-022-02122-3

  14. [14] L. Su, “Vocal melody extraction using patch-based CNN,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 371–375, 2018. https://doi.org/10.1109/ICASSP.2018.8462420

  15. [15] B. McFee, C. Raffel, D. Liang, D. Ellis, M. Mcvicar, E. Battenberg, and O. Nieto, “Librosa: Audio and music signal analysis in python,” in Proc. of 14th Python in Science Conference, pp. 18–24, 01 2015. https://doi.org/10.25080/Majora-7b98e3ed-003

  16. [16] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: N. Navab, J. Hornegger, W.M. Wells, A.F. Frangi (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241. Springer International Publishing, 2015. https://doi.org/10.1007/978-3-319-
    24574-4_28

  17. [17] D. Griffin and Jae Lim, "Signal estimation from modified short-time Fourier transform," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236-243, April 1984. https://doi.org/10.1109/TASSP.1984.1164317

  18. [18] Radford, A., Metz, L., Chintala, S.: “Unsupervised representation learning with deep convolutional generative adversarial networks.”, in Proc. of 4th International Conference on Learning Representations, ICLR, 2016.

  19. [19] van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. In: Arxiv (2016).

  20. [20] Blaauw, Merlijn, and Jordi Bonada. "A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs" Applied Sciences 7, no. 12: 1313, 2017. https://doi.org/10.3390/app7121313

  21. [21] Morise, M., Yokomori, F., Ozawa, K.: World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems E99.D, 1877–1884 (2016). https://doi.org/10.1587/transinf.2015EDP7457

  22. [22] A. Pandey and D. Wang, "On Cross-Corpus Generalization of Deep Learning Based Speech Enhancement," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2489-2499, 2020. https://doi.org/10.1109/TASLP.2020.3016487

  23. [23] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech Language, vol. 60, pp. 1010–27, 2020. https://doi.org/10.1016/j.csl.2019.101027

  24. [24] Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang, “The nus sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” in Proc. of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–9, 2013. https://doi.org/10.1109/APSIPA.2013.6694316


ARTICLE INFORMATION




Accepted: 2023-03-01
Available Online: 2023-03-01


Cite this article:

Resna. S. and Rajeev. R. (2023) Comparative Study on Multi-voice Singing Synthesize Systems. Int. j. autom. smart technol. https://doi.org/10.5875/ausmt.v13i1.2417

  Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.