AI is Changing the Face and Voice of Customer Service as We Know It

Guest post by Sergey Enin.
There is a number of subjective questions on quora that challenge me. For example, “Who has the most beautiful female voice ever recorded?“. Personally, I love Billie Holiday’s timbre.
Thanks to deep learning, in a little while I would probably hear the voice similar to Billie Holiday’s (or someone else I admire) when calling to my bank, ordering take-out, and contacting customer service representatives. Google’s DeepMind team have recently released a paper, which encourages to seriously consider such possibility. The paper states that DeepMind has created a neural network called WaveNet that is a generative model for raw audio:
- it can generate raw speech signals with a subjective naturalness that has never been reported in the field of text-to-speech (TTS), as assessed by human raters;
- a single model can be used to generate different voices, conditioned on a speaker identity;
- the same architecture shows strong results when tested on a small speech recognition dataset, and seems promising when used to generate other audio modalities such as music;
For those of you who prefer to go straight to the examples – there is a separate post on DeepMind blog available.
How Does It Work from a Technical Perspective?
The paper derives from the earliest work and continues investigating neural autoregressive generative models that model complex distributions such as text, images and eventually raw sound, which is challenging since the sound is a way more complex structure entity. The joint probability of a waveform X = {X1,…,Xt} is factorized as a product of conditional probabilities of each audio sample
Each audio sample Xt is therefore conditioned on the samples at all previous timesteps.
From the neural networks architecture point of view, the net is represented by a stack of causal convolutional layers:
via Storage
During the training period, the network is being fed by real recordings of the human speaking. After the training, the neural network produces realistic-sounding audio fragments sample by sample. Of course, the network should be fed with the text that should be transformed to speaking as well.
What Does It Mean?
It is a big deal for modern voice communication.
Most likely, the technology will make speaker identification way more tricky. Speaker verification systems will require some other methods rather than only voice biometric ones.
On the other hand, there is good news for customer service. It is a great opportunity to make the communication with customer service representatives a more pleasant experience for the customers. The voice of a favorite client’s singer asking: “Would you like to try our new product as well?”. The sweetest dream of any sales/customer care department, isn’t it?
About InData Labs
InData Labs is a data science company, our core services include data AI consulting, big data analytics services, and artificial intelligence development. Our services allow companies to innovate, experiment with new tools, explore new ways of leveraging data, and continuously optimize existing big data solutions.
Have a project in mind? We’ll make it happen!