Microsoft Brings Video Voice Recognition For Everyone

Beyond "how to wreck a nice beach"

satyanadellamicrophone

File photo of Microsoft CEO Satya Nadella.

Credit: Microsoft

Azure Media Services is something Apple might want to consider for streaming its next keynote, rather than rolling its own system on Amazon Web Services and Akamai. It's what big-name broadcasters used to stream the 2014 Winter Olympics and the 2014 World Cup, it's what powers the Blinkbox streaming video service, and if you watched the Xbox One announcement you've already used it, so it's certainly proved its reliability.

Now it's a public preview anyone can use to stream content -- with or without digital rights management (DRM), on just about any device, through Flash, Silverlight or HTML5, with support for creating your own app for Windows, Windows Phone, iOS, Android and Xbox. If you have company training videos, or shareholder meetings you want to share, Azure Media Services gives your business a cloud service to do that.

If you just want somewhere to keep video, services from YouTube to Vimeo let you do that (although with far less control than Azure). But what’s really interesting is the Azure Media Indexer service, which has just moved from preview to General Availability. This is a sophisticated voice recognition system for indexing audio and video, so someone can search for keywords, phrases, or clips; generate closed captions automatically; and even get full transcripts from your media.

How to wreck a nice beach

With the new system, when you search for a keyword, you're not just getting a video that has the word in the title, or in a tag someone has put on by hand; you can jump right to the second of the video that has someone saying the word you're looking for -- and you can see a snippet of the automatic transcript to make sure it's what you're looking for. You can try that out with this Microsoft Video Web Search, which has about ten thousand hours of video clips from MSNBC you can search.

That's a demo put together by the MSR team who have been working on MAVIS (the Microsoft Audio Video Indexing Service that powers the Indexer) for the last seven years. Compare that to Siri or Cortana, which get better as they learn your voice; MAVIS doesn’t have to learn about each person speaking and it can handle multiple speakers in the same conversation, even if they have different accents. And unlike specialist voice recognition systems for doctors and lawyers, which do extremely well at recognizing words as long as they're about those particular topics, MAVIS can handle almost any conversation.

If you use OneNote, you've had audio search since the 2007 Windows version (also built by the team behind MAVIS), but that just looks for phonemes (the sounds that make up individual words) in the recording. Look for "how to recognize speech" and you could easily get a match to "how to wreck a nice beach," because the sounds are very similar.

MAVIS uses vocabulary search, where it knows about the actual words and the context they're used in, although it works using senones, even smaller chunks of speech than phonemes. It stores multiple possible recognitions, complete with the probability that each recognition is the right one (did you say "speech" or "beach" or even "peach"?). And if it comes across words it doesn't know, it looks them up on Bing.

That all makes it easier to tell the difference between someone talking about history and saying 'the Crimean war' and someone talking about politics and mentioning 'crime in a war', explained Behrooz Chitsaz, the Microsoft Research director who talked about MAVIS at the MIX conference back in 2010.

MAVIS is also behind the Department of Energy's ScienceCinema search and it's been searching video for the British Library, NASA, and state archives in both Georgia and Washington. Wondering why a particular bill was passed? How about jumping straight to the exchange in the debate that tipped the balance…

Those systems were built using the MAVIS APIs for SharePoint and SQL Server, which let you build up a multimedia archive on your intranet and use MAVIS to index it.

MAVIS always used Azure to run its speech recognition; that's how Green Button (which Microsoft bought recently) was about to build a cloud video indexing service on it. A couple of years ago MAVIS switched to using the same type of deep learning neural networks that are behind the live speech translation Microsoft plans to launch for Skype later this year. Deep learning is being used at Google and Facebook and it's the hot new technique in machine learning, so it's interesting that Microsoft was the first company to put deep learning-driven voice recognition in a product back in 2012, even if it was one you needed to be a research lab or state to afford.

The combination of the massive data centers you need to run a cloud service and the economics of SaaS subscriptions means that kind of high-powered tool is becoming far more widely available. With the Azure Media Indexer, you can skip building your own media archive completely and put it in the cloud instead.

When did I say that?

The voice recognition in the Indexer is not only useful for videos; it would also work just as well for searching your voice mail or recordings of meetings and conference calls. You could use it to jump to the point in a match where someone scores a goal, or to the bit in the meeting where someone says something useful, or to check your voice mail for the call from the auto repair shop that you're waiting for without having to listen to all your other messages first.

Avanade's Paul Veitch suggests that the first customers beyond broadcasters will be banks, especially traders -- and regulators.

"Lots of banks are interested not only in storing data in the cloud but in how you recall it. You could say 'tell me when I was talking to this customer about the price of gold' and it will know where that part of the conversation was. Now we can analyze that data and make it searchable. The Financial Conduct Authority are quite interested in that for compliance; are the Chinese walls inside the bank working? And internal compliance departments are interested too; they're looking at data mining audio calls and conversations."

He suggests it will be even more useful it you connect it to other data sources and machine learning systems. "There are already automated trading systems that monitor Twitter," he points out. "Now you could do monitoring inside the bank for sentiment too."

Mining voice recordings is the kind of thing that would fit in perfectly with Delve, the social network for documents that's just launching in Office 365. Delve looks for documents your colleagues are working on, and willing to share publicly, that are relevant to what you're working on or the meeting you're about to go to, and shows them to you.

That would be extremely useful if it included links to the recording of a Lync meeting where the customer you're going to meet tomorrow is phoning up to make a complaint, or right to the minute in your online training video where the presenter covers how to fix the problem you're writing an email about. If you can get the right two minutes of it, a three hour video becomes much more useful.

That's the kind of thing Satya Nadella means when he talks about "productivity [including] group collaboration and business processes" or about "digital work and life experiences" that include "intelligent and social work experiences". The Indexer is aimed at broadcasters and content companies today, because they already know what they will use it for (but services like YouTube and Twitch are turning almost everyone into broadcasters now). Now that voice search is available on Azure as a service, we can see what else you can use it for.

This story, "Microsoft Brings Video Voice Recognition For Everyone" was originally published by CITEworld.

To comment on this article and other CIO content, visit us on Facebook, LinkedIn or Twitter.
Download the CIO Nov/Dec 2016 Digital Magazine
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.