Amazon’s Echo speaker knows that you only have to say “Alexa” to wake it up. But technically it can’t tell the difference between two people murmuring in the corner of the room and the sound of radio static. That would take a deeper dive into the building blocks of sound itself.
But one small startup in Cambridge, UK has spent ten years building up an entirely new language of sound which, for the first time, will allow machines to recognize the sound of human speech.
Wait, you’re thinking. Smart gadgets already recognize speech just fine, don’t they? In fact they only recognize words, hence the growing subset in artificial intelligence known as natural language processing.
Distinguishing different types of sounds may seem easy enough for our ears, but that’s thanks to thousands of years of evolution; doing the same for machines is much harder.
Audio Analytic, which has raised $5.5 million in venture financing and sells its library of “sound profiles” to device makers like Cisco, Intel and others, said Wednesday that it was making its new “human speech” sound profile available for clients.
That means that a speaker like Amazon’s Echo or Apple’s HomePod, could eventually recognize the sound of people having a conversation, and know that it’s not appropriate to interrupt.
Audio Analytic is tightlipped about most of its clients. When asked if Amazon was a customer, CEO Chris Mitchell refused to comment. If any of his existing customers purchase the new sound profile, though, they should be able to deploy it over-the-air to their devices. Amazon also has a research and development lab with engineers working on Alexa in Cambridge.
Alexa’s ability to recognize speech has pros and cons, especially on the security front. Hackers can now, for instance, spoof an Alexa skill by getting the service to recognize a slightly-incorrect voice command, according to new research shared with Forbes this week.
But device makers are keen to make their gadgets smarter by teaching them how to hear. Till now, Audio Analytic has derived most of its revenue from licensing fees. It sells “libraries” of certain sounds that make it possible for machines to recognize, say, a dog’s bark about as well as a human would.
To give a sense of how long it can take to sequence these kinds of sounds, Audio Analytic has licensed a grand total of seven of them in the decade it has been operating. They include a barking dog, breaking glass, a crying baby and a smoke or CO alarm. The startup aims to license 50 profiles by 2021.
Right now most smart-home systems are only able to alert users on their smartphones that they’ve detected a noise, but that could be anything from the cat knocking over a book to a moth fluttering in front of the device’s microphone.
To build just one of those profiles, Mitchell’s staff have broken thousands of panes of glass in a dedicated sound lab, and the process has turned into something of a rite of passage for new recruits.
“They’re invited to suit up with protective gear,” says Mitchell. “The weapon of choice is a sledge hammer, but we have tried a bunch of implements including emergency escape hammers.”
The startup, which has a workforce of 45, has also built a proprietary database of 1 million unique “sound events” which it calls Alexandria. Mitchell describes this as separate to what the company is doing commercially and, “what we choose to do with our spare time.”
The company is ahead of the likes of Amazon and Google in the field of “machine listening,” he claims, an academic field that sits alongside speech recognition and natural language processing. “It’s a new discipline.” The startup’s main contribution is a technology it calls ideophones.
“If you look at the speech world, most of speech science is structured around understanding the order in which we’re going to say words,” he says. “We control how sound comes out of our mouths. How we do that in speech is called phonemes. What Audio Analytic has done is discerned, then written software at the AI level to model ideophones. These are the fundamental building blocks that make up sounds.”
Such a sound library is, perhaps not surprisingly, about 50% bigger than the kind used to process words alone, Mitchell adds.
As it happens, humans could do with a better understanding of sounds too.
“Hollywood has convinced the world that things sound a certain way,” says Mitchell. The sound profiles used for car alarms and fist-fight punches, usually sound tamer in movies than they do in real life.
The same goes for breaking glass. “We have to smash quite a lot of windows,” he adds. “It’s absolutely terrifying.”