Appszoom for Developers


Posted by on 03/01/2017

We’ve all done it.

“Siri. When is the world going to end?”

“Cortina. Do you find me attractive?”

“Alexis. What do you really think about Siri?”

“OK Google. How do you sleep at night?”

Ever since Apple’s voice assistant Siri was first introduced on the iPhone 4S in 2011, thousands of people have killed time by probing it for the funniest answers. I distinctly remember one Christmas afternoon huddled round an iPad with my brother and sister yelling things like: “Siri. I’m naked!” and “Siri. Make me a sandwich!” before we would all crack up laughing at the answers.

But that was 20 minutes of entertainment. One Christmas afternoon. Our attention quickly shifted to more important topics (are we really going to watch Love Actually for the 9th year in a row?) and Siri was left alone with her thoughts. A gimmick to be used for serious search engine requests by small core of loyal users.

The issue for tech companies, then, was essentially a PR image problem. How to transform speech recognition from the butt of the joke to the top of user priorities? Indeed, should we even invest in it?


Apple’s Siri: great assistant, terrible jokes

How much has speech recognition improved?

Speech recognition relies on data (as journalist Alex Hern puts it: data is the new coal). In order to iron out flaws in speech recognition, the technology needs to gobble up as many accents, dialects, speeds, intonations, and vocabulary choices as it possibly can. In other words, the software is in constant training.

Software developers have spent years ‘training’ speech recognition software to recognize the individual phonemes that make up a word a person says. By linking phonemes in a mathematical chain, the software can understand words and then go one step further by predicting the phoneme that’s most likely to come next. It takes information known to the system to work out the information it doesn’t know. This essentially allows the software to understand what someone is saying through context and probability, so even if someone stutters, mumbles, or smushes two words together, it can understand the question from the words it has understood and their relationship with other words in that sentence.

The race to perfect the technology is on. Last year Microsoft claimed to have achieved a 6.3 word error rate, and just last month Google said that they have cut word error rates by more than 30% since 2012. That’s not to say, however, that things never go wrong. Amazon’s Alexa, for example, hit the news in Texas when it ordered a $170 dollhouse for a girl that was talking to it. To top it off, when the story was read out on the news via living room televisions, the Amazon Echos in those rooms proceeded to order even more dollhouses, bringing joy to children across the US and misery to parents and their credit cards.


Amazon’s Echo Dot: The parent that never says no

Love the sound of your own voice

While we can be fairly confident that the technology is no longer the issue, the question remains: are people talking? Well yes, they are. There are tons of stats to back it up: Google voice search queries in 2016 were up 35x over 2008, and Cortana has 133 million monthly users. Amazon sold an impressive 4.4 million Echo units in its first full year of sales. Predictions for the future vary in number but not in trend: smart speakers in homes will be a common sight, and the number of web searches carried out without a screen will rise rapidly.

It’s still not very common to see people giving their phones’ voice assistant a command in public. There’s an embarrassment around using it in front of strangers, not only because we generally avoid people who talk out loud to themselves on the bus, but also because our web searches are private places. It’s a space in which we feel comfortable asking things we wouldn’t dare asking even our most trusted friends. They’re the things that occur to us when we’re swimming around the depths of our minds. “Am I a toxic person?” “Does a vasectomy hurt?” “Am I a hipster?”


OK Google: “Am I a hipster?”

It’s not surprising then, that use of the technology has increased rapidly in the spaces we can really ‘be ourselves’: the home, the private office, the car. The spaces in which we can practice our twerking and sing full-volume to Beyoncé are the same spaces in which we can have chat to Alexa or Siri.  The home is now such a lucrative area that Amazon and Google are in the midst of an all-out device battle (Echo versus Google Home). Amazon’s Alexa can now call “shotgun” whenever you get in your car.

Even though we’re still a bit too self-conscious about using speech recognition in public, the advantages in the home are clear. Your hands are free to pick up the mess the kids left strewn on the floor. You don’t have to unlock a device or navigate through any menus. It’s fast. It may even be quicker to ask Alexa what the weather’s like than to walk over to the window and stick your head out.

Jump on the developer bandwagon

The potential for developers to piggyback speech recognition software and take advantage of the rising trend is already there. Android apps can use intents to link to Google Voice Search (hundreds of apps on the Android market already do so). Amazon have tools for developers to teach Alexa new ‘skills’ that can link to apps or run programs in the cloud. As the technology becomes more open, so too will the possibilities for third-party developers. Here are some potential areas for creative minds to ponder:

The kitchen

You’re chopping up raw chicken. You put down the knife, quickly wash one hand and flick to the correct page of your recipe book. Your recipe book now has a mixture of water and chicken juice in it. The sauce is sticking to the bottom of the pot and you didn’t realize because you were distracted. Speech recognition to draw from a database of recipes that prompt you line by line, on demand, is a great solution for lovers of cooking and baking. Specialized apps can take advantage.


Speech recognition: keep sticky hands away from books


The idea of ‘second-screen’ apps that compliment a live televised event or on-demand series with a social network conversation is still popular. How can speech recognition play into this? Perhaps it could help us to get back to the days of focussing on the first screen, capturing every little detail, while still enjoying the thrill of a shared experience with others in the background. Imagine shouting out your reactions as they come to you, which are then processed into the social media stream.


Such apps already exist, but the opportunity to combine work with the necessary commute may be of particular interest to workaholics who want to make the most of their time. This could be achieved via an inbox manager that helps you delete / forward / flag / reply to emails as you’re driving, or even something that integrates more closely with an IT system used at work such as Dropbox.

The challenge for traditional app developers is marrying the screen with a burgeoning sector that is essentially moving away from the screen, and making sure that whatever is developed for speech recognition isn’t just a novelty that causes a few laughs on Christmas day then uninstalled moments later. Speech recognition just got serious. Perhaps it’s time to start listening.

Where does the future lie for apps and voice recognition? Leave your comment below!

Steve Howe is a freelance writer, translator, and teacher living in Barcelona.

Read more about him here.

Topics: App promotion