Audio style conversion using deep learning

Aakash Ezhilan; R. Dheekksha; S. Shridevi

doi:10.6703/IJASE.202109_18(5).004

Audio style conversion using deep learning

Aakash Ezhilan, R. Dheekksha, S. Shridevi^*

Centre for Advanced Data Science, Vellore Institute of Technology, Chennai, India

Download Citation: |
Download PDF

ABSTRACT

Style transfer is one of the most popular uses of neural networks. It has been thoroughly researched, such as extracting the style from famous paintings and applying it to other images thus creating synthetic paintings. Generative adversarial networks (GANs) are used to achieve this. This paper explores the many ways in which the same results can be achieved with audio related tasks, for which a plethora of new applications can be found. Analysis of different techniques used to transfer styles of audios, specifically changing the gender of the audio is implemented. The Crowd sourced high-quality UK and Ireland English Dialect speech data set was used. In this paper, the input is the male or female wave form and the opposite gender’s waveform is synthesized by the network, with the content spoken remaining the same. Different architectures are explored, from naive techniques and directly training audio waveforms against convolution neural networks (CNN) to using extensive algorithms researched for image style conversion and generation of spectrograms (using GANs) to be trained on CNNs. This research has a broader scope when used in converting music from one genre to another, identification of synthetic voices, curating voices for AIs based on preference etc.

Keywords: Style transfer, Audio analysis, Neural networks, Dialect transfer.

Share this article with your colleagues

REFERENCES

Chen, J., Yang, G., Zhao, H., Ramasamy, M. 2020. Audio style transfer using shallow convolutional networks and random filters, Multimedia Tools and Applications. Doi :10.1007/s11042-020-08798-6.
Demirsahin, I., Kjartansson, O., Gutkin, A., Rivera, C. 2020. Open-source Multi-speaker corpora of the English accents in the British Isles, Vol. Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 6532–6541. URL https://www.aclweb.org/ anthology/2020.lrec1.804
Deshpande, M.S., Chadha, V.S., Lin, V. 2019. Audio style transfer for accents. URL https://shuby.de/files/11-785_project.pdf
Grinstein, E., Duong, N.Q.K., Ozerov, A. 2018. P. Perez, Audio style transfer, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). doi:10.1109/icassp.2018.8461711.
Hayashi, T., Tamamori, A., Kobayashi, K., Takeda, K., Toda, T. 2017. An investigation of multi-speaker training for wavenet vocoder, in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 712–718. http://dx.doi.org/10.1109/ICASSP.2018.8461711
Hsu, C.-C., Hwang, H.-T., Wu, Y.-C., Tsao, Y., Wang, H.M. 2017. Voice conversion from unaligned corpora using variational auto encoding Wasserstein generative adversarial networks, arXiv:1704.00849.
Huang, C.-y., Lin, Y.Y., Lee, H.-y., Lee, L.-s. 2020. Defending your voice: Adversarial attack on voice conversion, ArXiv, vol. abs/2005.08781.
Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N. 2018. StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks, arXiv: 1806.02169 [cs.SD].
Lorenzo-Trueba, J., Yamagishi, J., Toda, T., Saito, D., Villavicencio, F., Kinnunen, T., Ling, Z. 2018. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods, preprint arXiv:1804.04262.
Miyoshi, H., Saito, Y., Takamichi, S., Saruwatari, H. 2017. Voice conversion using sequence-to-sequence learning of context posterior probabilities, preprint arXiv: 1704.02360.
Pasini, M. 2019. MelGAN-VC: Voice conversion and audio style transfer on arbitrarily long samples using Spectrograms. https://arxiv.org/abs/1910.03713
Sisman, B., Yamagishi, J., King, S., Li, H. 2020. An overview of voice conversion and its challenges: from statistical modeling to deep learning. arXiv preprint, arXiv:2008.03648v2
Tamamori, A., Hayashi, T., Kobayashi, K., Takeda, K., Toda, T. 2017. Speaker-dependent wave net vocoder., in Proc. Interspeech, 1118–1122.
Verma, P, Smith, J.O. 2018. Neural style Transfer for audio spectograms, CoRR abs/1801.01589. URL http://arxiv.org/abs/1801.01589
Wester, M., Wu, Z., Yamagishi, J. 2016. Analysis of the voice conversion challenge 2016 evaluation results., in Proc. Interspeech, 1637–1641.
Wu, C.-W., Liu, J.-Y., Yang, Y.-H., Jang, J.-S.R. 2018. Singing style transfer using Cycle-consistent boundary equilibrium generative adversarial network. arXiv:1807.02254

ARTICLE INFORMATION

Received: 2021-01-29

Accepted: 2021-04-07
Available Online: 2021-09-01

Cite this article:

Ezhilan, A., Dheekksha, R., Shridevi, S. 2021. Audio style conversion using deep learning, International Journal of Applied Science and Engineering. 18, 2021034. https://doi.org/10.6703/IJASE.202109_18(5).004

Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.

Audio style conversion using deep learning

ABSTRACT

REFERENCES

ARTICLE INFORMATION

Other people also read ...

Monitoring soil resilience via the dynamic changes of selected physicochemical properties of soil in a tropical rehabilitated forest

Efficacy of real-time audio biofeedback on physiological strains for simulated tasks with medium and heavy loads

An alternative framework for implementing generator coherency prediction and islanding detection scheme considering critical contingency in an interconnected power grid

Usability evaluation for driving simulation with the mechanical and joystick manual controllers

Formulation, characterization, and optimization of aripiprazole-loaded lyotropic liquid crystalline nanoparticle for sustained release and better encapsulation efficiency against psychosis disorder

Influence of palm oil mills effluent (POME) sludge vermicomposting on soil physicochemical properties and Zea mays growth performances