I used recurrent neural networks (RNNs) on Text-to-Speech (TTS) systems to convert text into synthetic speech.
The model learning patterns on the text and generating speech that resembles that of a real speaker. Then used generative adversarial networks (GANs) on the dataset mp3 feeded at voice cloning, where a generator produces synthetic voices that a discriminator attempts to distinguish from real ones. The test was implanted on variational autoencoders (VAEs) encode and decode voices, allowing the model to learn patterns and generate synthetic voices similar to the originals. Now need more focus at Text-to-Mel (T2M) models convert text into mel spectrograms, which are then processed by vocoders to generate synthetic voices. These specifically trained vocoders can produce voices that resemble the real thing. In addition, there are models capable of generating synthetic voices in real time from text or voice samples, using deep learning techniques to achieve high precision and naturalness.
RNN's/GAN's/VAEs/T2M<.
AWS
Pytorch, Tensorflow, Keras.
Process
RHLF.