Adobe Max 2016 had an unexpected surprise, Adobe VoCo, think of it as the photoshop of spoken audio.
Many in the vfx community have speculated on what we were seeing but the truth is much more interesting than you might have expected.
At the Adobe Max conference video clip below if you do not watch it to the end it is not immediately obvious how significant this technology is. Being able to edit a recorded audio clip via text is cool but not revolutionary… being able to make an audio clip of someone saying a sentence they never even remotely said is completely game changing. Remarkably, we all thought that Adobe had added deep learning AI technology to the company’s audio team. But fxguide spoke to Zeyu Jin directly and while he and the team at Adobe thought the future lies in the deep neural networks he explained that “the VoCo you saw at the Max conference is not based on deep learning”.
Near the end of this presentation Zeyu Jin reveals in a brief off script chat that Adobe uses 20 mins of training data. From this 20 minute sample, the system can not just edit existing speech but create whole new sentences, with the correct weighting and cadence needed to be believable. It was assumed by many including here at fxguide, that this was 20 minutes of deep learning training data, but that is not how Adobe manages to drive VoCo, – which in one sense makes it even more remarkable.
Project VoCo allows you to edit and create speech in text, opening the door to vastly more realistic conversational assistive agents, better video game characters and whole new levels of interactive computer communication. Already Google and Apple’s embodied agents such as SIRI produce a passable synthetic conversation dialogue, but if Adobe can commercialise this tech preview to a full product, the sky is the limit.
As VoCo allows you to change words in a voiceover simply by typing new words it was a huge hit of the Adobe MAX 2016 Sneak Peeks. While this technology are not yet part of Creative Cloud, many such technology preview “sneaks’ from previous years have later been incorporated into Adobe products.
Which begs the question how are they doing this, if they are not using deep learning?
“It is not related to decision trees or traditional machine learning but mostly to mathematical optimization and phonetic analysis” Jin explained.
A couple of years ago Adobe showed a text tool that would allow you to search a clip by text, initially many assumed that this was more than it was, in reality the software took your written script and matched it to the audio for searching. At the time it was both exciting to see Adobe doing this work and a tad disappointing they were not showing auto-transcribing software as early reports had indicated. Jump to 2016 and Adobe is demoing that it can do to audio what it has done to editing still images. But this new VoCo technology is not related at all to the earlier Adobe text search-edit demo, it is completely brand new tech. ” It is not built on any existing Adobe technology. It is originated from Princeton University where I am doing my Ph.D.” he explains. “Unfortunately, it also does not imply improvement on auto transcription. We are actually relying on existing transcription approach to perform phoneme segmentation”.
“The core of this method is voice conversion. Turning one voice into another”. The system uses micro clips but with no pitch correction, “in fact, pitch correction is what makes other approaches inferior to our approach” he comments.
Most state-of-the art voice conversion methods re-synthesize voice from spectral representations and this introduces muffled artefacts. Jin’s PhD research uses a system he calls CUTE: A Concatenative Method For Voice Conversion Using Exemplar-based Unit Selection. Or in loose terms: it stitches together pieces of the target voice using examples. It just does it very cleverly. It optimizes for three goals: matching the string you want, using long consecutive segments, and finally smooth transitions between those segments.
CUTE stands for:
- Concatenative synthesis
- Unit selection
- Triphone pre-selection
- Exemplar-based features
To have sufficient ‘units’ of audio or sound to flexibility use, the approach defines these ‘units’ to the frame level. To make sure there is a smooth transition, it computes features using the 20 minutes of example audio and concatenates the spectral representations over multiple consecutive frames.
To obtain the phoneme segmentation, the system first translate the transcripts into phoneme sequences and then apply forced alignment to align phonemes to the target speaker’s voice.
The original PhD University approach defined two types of examples to use, a target exemplar and concatenation exemplar, to allow controlling the patterns of rhythm and sound (or the patterns of stress and intonation in the spoken words) with source examples and enforce concatenation smoothness in the target samples. Using phoneme information to pre-select the phonemes, we ensure the longest possible phonetically correct segments to be used in concatenation synthesis.
Experiments demonstrate that our CUTE method has better quality than previous voice conversion methods and high individuality comparable to real samples.
You can see Jin’s original paper here.
“The approach used in VoCo has the same basic idea but far more sophisticated” he explains.
Adobe’s team can still be said to be using AI, just not a deep learning approach. Carlos Perez of IntuitionMachine.com just published an excellent piece pointing out that Deep Learning (DL) is a subset of Artificial Intelligence (AI) and DL is different to Machine Learning (ML).Perez company is a specialist in Deep Learning for a variety of real world situations using big data. He is not a media specialist, nor is he involved with Adobe VoCo but his explanation of AI is very relevant.
Artificial Intelligence has been around for a long time, and it is an umbrella term for everything from “Good Old Fashion AI (GOFAI), – all the way to connectionist architectures like Deep Learning” explained Perez. “The distinction between AI, ML and DL are very clear to practitioners in these fields”. ML is a sub-set of AI that covers anything that has to do with learning algorithms driven by training data. There are a lot of techniques that have been developed over the years in this area such as Linear Regression, K-means, Decision Trees, Random Forest, PCA, SVM and finally Artificial Neural Networks (ANN). “Artificial Neural Networks is where the field of Deep Learning had its genesis from” he explains.
As Jin’s approach is not actually using decision trees etc but is a using example data it falls on the edge of what some might call ML, but it can be labeled AI regardless.
Interestingly, Neural Networks were invented in the early 60’s, but the computers could not do much computation, and so the normally one layered Neural nets did not deliver on the promise of amazing results. Today multiple layers and use with far more data, and far quicker computers including new dedicated AI cards from companies like Nvidia mean incredible results. AI GPU cards along with Moore’s law, and much richer data sets has lead to Neural Nets or Deep Learning getting a lot of press. We have covered how it is used in everything from Fluid sims to Face solving (with the great work that Cubic Motion has done in face expression FACS solving :see our EPIC story). But as Perez points out “the conclusion that DL is just a better algorithm than SVM or Decision Trees is akin to focusing only on the trees and not seeing the forest”.
DL is behind the advances in Google Translate and the incredible power of the new image recognition advances of the last few years. In the world of computer vision and object classification, DL has been a firestorm that was swept through the research community defeating all other approaches in accuracy. What is incredible about Google Translate is what is actually happening inside the computer, in the middle of any translation. In the old days one would translate word for word between say English and French. The problem is that language is a very funny thing and it doesn’t translate word for word.
For example, Mark Forsyth in The Elements of Eloquence: How to Turn the Perfect English Phrase (a brilliant book with literally 100s of such examples) he points out that adjectives in English absolutely have to be said in the right order and that order we all know but almost none of us could articulate what that sequencing actual is. That order is
In other words, “if you mess with that word order in the slightest you’ll sound like a maniac”, he warns. “It’s an odd thing that every English speaker uses that list, but almost none of us could write it out. For example as size has to come before colour, – a green great dragon can’t exist.” It is worth trying this yourself, describe the car you drive out loud but changing the order of any of these adjectives and you will be surprised at how wrong it sounds. This is just one of the odd aspects of English, many other languages let you know what the subject (noun) is first, and then list its attributes. Clearly, word for word translation would fail to adjust for such an order in different languages. In this particular case, the logic is a hard rule so one could hard code around the sentence, but for many similar language problems it is this aspect of ‘something just not sounding right’, or not expressed the way people ‘usually’ say it that defeats an old school translation device, but DL tends to solve based on ‘experience’ or training data.
What Google Translate does is not translate word for word, but it works out what the sentence means in Maths – in its own internal logic. This is logic that the program worked out itself after being given millions and millions of examples of the same text in multiple languages. Only after working with the training data, is it able to work out the sentiment of a sentence. It stores this intent of the English original in its own internal ‘maths form’ before translating what that sentiment would be in say French. The killer point here is that no one wrote an algorithm in the old school sense of deliberate programming (ie. Old Fashion AI (GOFAI)) – rather they created a neural network that would try a lot of things – knowing the correct outcome – and see what combination of internal structures it could build that would most likely give the right outcome. It learns how to solve the problem. It is not programmed to directly solve a problem.
Chris Manning an expert in Natural Language Processing or (NLP) writes about the Deep Learning Tsunami:
“Deep Learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences. However, some pundits are predicting that the final damage will be even worse”.
The Adobe VoCo is clearly part of a broad wave of language applications that include SIRI and other chat-bots from companies like Google and Amazon. All these conversational agents fall into one of two categories according to a piece written in April by Denny Britz called Deep learning for chatbots.
Retrieval-based models use a library of predefined responses and some kind of rule to pick an appropriate response based on the input and context. “These systems don’t generate any new text, they just pick a response from a fixed set” he wrote.
Generative models which are much harder, don’t rely on pre-defined responses. “They generate new responses from scratch. Generative models are typically based on Machine Translation techniques, but instead of translating from one language to another, we “translate” from an input to an output (response)” he continues.
Adobe’s VoCo is actually different, and in a sense it sits in between theses two ‘chatbot’ models. It requires no input ‘interpretation’ as the new dialogue is typed into the machine by the editor, but it is creating something new that was not just pre-existing in the library.
This is why many of us were quick to jump on the Adobe announcement and why the final Adobe productization of VoCo may end up still being a DL solution. Given how impressive this ‘non DL’ VoCo demo was at the conference – and how much additional power DL solutions can often provide, the vector Adobe has set seems certain to be achieved within a few short years.
Jin finished our discussion by saying “but we all envision the future for natural voice synthesis lies in the deep neural networks and we had some positive results.”
By the way, on a personal note, we predict that DL and AI will do this for every aspect of what we cover here at fxguide from character animation to green screen compositing, from jpeg artifact removal to assisting in editorial choices. AI will not replace artists, but artists who decide to not use it, could be replaced by artists that do.