Hey Alexa, what’s next? Breaking through voice technology’s ceiling

The latest announcement from Amazon that they’d be decreasing workers and price range for the Alexa division has deemed the voice assistant as “a colossal failure.” In its wake, there was dialogue that voice as an trade is stagnating (and even worse, on the decline). 

I’ve to say, I disagree. 

Whereas it’s true that that voice has hit its use-case ceiling, that doesn’t equal stagnation. It merely signifies that the present state of the know-how has just a few limitations which might be vital to know if we would like it to evolve.

Merely put, immediately’s applied sciences don’t carry out in a method that meets the human normal. To take action requires three capabilities:

Superior pure language understanding (NLU): There are many good firms on the market which have conquered this facet. The know-how capabilities are such that they will choose up on what you’re saying and know the same old methods folks may point out what they need. For instance, should you say, “I’d like a hamburger with onions,” it is aware of that you really want the onions on the hamburger, not in a separate bag. Voice metadata extraction: Voice know-how wants to have the ability to choose up whether or not a speaker is glad or pissed off, how far they’re from the mic and their identities and accounts. It wants to acknowledge voice sufficient in order that it is aware of if you or any individual else is speaking. Overcome crosstalk and untethered noise: The power to know within the presence of cross-talk even when different individuals are speaking and when there are noises (site visitors, music, babble) not independently accessible to noise cancellation algorithms.

There are firms that obtain the primary two. These options are usually constructed to work in sound environments that assume there’s a single speaker with background noise principally canceled. Nevertheless, in a typical public setting with a number of sources of noise, that may be a questionable assumption.

Attaining the “holy grail” of voice know-how

It is very important additionally take a second and clarify what I imply by noise that may and may’t be canceled. Noise to which you’ve unbiased entry (tethered noise) will be canceled. For instance, automobiles outfitted with voice management have unbiased digital entry (through a streaming service) to the content material being performed on automobile audio system.

This entry ensures that the acoustic model of that content material as captured on the microphones will be canceled utilizing well-established algorithms. Nevertheless, the system doesn’t have unbiased digital entry to content material spoken by automobile passengers. That is what I name untethered noise, and it might’t be canceled. 

That is why the third functionality — overcoming crosstalk and untethered noise — is the ceiling for present voice know-how. Attaining this in tandem with the opposite two is the important thing to breaking by way of the ceiling.

Every by itself offers you vital capabilities, however all three collectively — the holy grail of voice know-how — offer you performance. 

Speak of the city

With Alexa set to lose $10 billion this yr, it’s pure that it’s going to change into a check case for what went incorrect. Take into consideration how folks usually interact with their voice assistant:

“What time is it?”

“Set a timer for…”

“Remind me to…”

“Name mother—no CALL MOM.” 

“Calling Ron.”

Voice assistants don’t meaningfully interact with you or present a lot help that you just couldn’t accomplish in a couple of minutes. They prevent a while, certain, however they don’t accomplish significant, and even barely sophisticated duties. 

Alexa was definitely a trailblazing pioneer basically voice help, however it had limitations when it got here to specialised, futuristic industrial deployments. In these conditions, it’s crucial for voice assistants or interfaces to have use-case specialised capabilities similar to voice metadata extraction, human-like interplay with the consumer and cross-talk resistance in public locations.

As Mark Pesce writes, “[Voice assistants] had been by no means designed to serve consumer wants. The customers of voice assistants aren’t its clients — they’re the product.”

There are a selection of industries that may be remodeled by high-quality interactions pushed by voice. Take the restaurant and hospitality industries. We need customized experiences.

Sure, I do need to add fries to my order. 

Sure, I do need a late check-in, thanks for reminding me that my flight will get in late on that day. 

Nationwide fast-food chains like Mcdonald’s and Taco Bell are investing in conversational AI to streamline and personalize their drive-through ordering programs. 

After getting voice know-how that meets the human normal, it might go into industrial and enterprise settings the place voice know-how is not only a luxurious, however really creates larger efficiencies and gives significant worth. 

Play it by ear

To allow clever management by voice in these eventualities, nonetheless, know-how wants to beat untethered noise and the challenges introduced by cross-talk. 

It not solely wants to listen to the voice of curiosity however have the power to extract metadata in voice, similar to sure biomarkers. If we are able to extract metadata, we are able to additionally begin to open up voice know-how’s capacity to know emotion, intent and temper.

Voice metadata will even permit for personalization. The kiosk will acknowledge who you might be, pull up your rewards account and ask whether or not you need to put the cost in your card. 

Should you’re interacting with a restaurant kiosk to order meals through voice, there’ll probably be one other kiosk close by with different folks speaking and ordering. It mustn’t solely acknowledge your voice as totally different, however it additionally wants to tell apart your voice from theirs and never confuse your orders. 

That is what it means for voice know-how to carry out to the extent of the human normal. 

Hear me out

How can we be sure that voice breaks by way of this present ceiling? 

I’d argue that it’s not a query of technological capabilities. Now we have the capabilities. Corporations have developed unimaginable NLU. Should you can field collectively the three most vital capabilities for voice know-how to satisfy the human normal, you’re 90% of the way in which there.

The ultimate mile of voice know-how calls for just a few issues.

First, we have to demand that voice know-how is examined in the true world. Too usually, it’s examined in laboratory settings or with simulated noise. Whenever you’re “within the wild,” you’re coping with dynamic sound environments the place totally different voices and sounds interrupt. 

Voice know-how that isn’t real-world examined will at all times fail when it’s deployed in the true world. Moreover, there must be standardized benchmarks that voice know-how has to satisfy. 

Second, voice know-how must be deployed in particular environments the place it might actually be pushed to its limits and clear up crucial issues and create efficiencies. This may result in wider adoption of voice applied sciences throughout the board. 

We’re very practically there. Alexa is on no account the sign that voice know-how is on the decline. In actual fact, it was precisely what the trade wanted to gentle a brand new path ahead and totally understand all that voice know-how has to supply.

Hamid Nawab, Ph.D. is cofounder and chief scientist at Yobe.


Welcome to the VentureBeat group!

DataDecisionMakers is the place consultants, together with the technical folks doing information work, can share data-related insights and innovation.

If you wish to examine cutting-edge concepts and up-to-date info, greatest practices, and the way forward for information and information tech, be part of us at DataDecisionMakers.

You may even take into account contributing an article of your individual!

Learn Extra From DataDecisionMakers