Voice Control Will Disrupt the Living Room

By Patrick Moorhead - December 20, 2011
In what seems to be a routine in high-tech journalism and social media now is to speculate on what Apple will do next. The latest and greatest rumor is that Apple will develop an HDTV set. I wrote back in September that Apple should build a TV given the lousy experience and Apple’s ability to fix big user challenges. What hasn’t been talked about a lot is why voice command and control makes so much sense in home electronics and why it will dominate the living room. Its all about the content. History of U.S. TV Content For many growing up in the U.S., there were 4-5 stations on TV; ABC, NBC, CBS, PBS and an independent UHF channel. If you ever wanted to know what was on, you just looked into the daily newspaper that was dropped off every morning on the front porch. Then around the early 80’s cable started rolling out and TV moved to around 10-20 channels and included ESPN, MTV CNN, and HBO. The next step was an explosion in channels brought by analog cable, digital cable and satellite. My satellite company, Time Warner, offers 512 different channels. Add that to the unlimited of over the top “channels” or titles available on Netflix, Boxee, and you can easily see the challenge. The Consumer Problem With an unlimited amount of things to watch, record, and interact with, finding what you want to watch becomes a huge issue. Paper guides are worthless and integrated TV guides from the cable or satellite boxes are slow and cumbersome. Given the flat and long tail characteristic of choices, multi-variate and unstructured “search” is the answer to find the right content. That is, directories aren’t the answer. The question then becomes, what’s the best way to search. The Right Kind of Search If search is the answer, what kind of search? The answer lies in how people would want to find something. Consumers have many ways they look for things. Some like to do surgical searching where they have exacts. They ask for “The Matrix Revolutions.” Others have a concept or idea of what they are looking for but not exactly; “find the car movie with Will Ferrell and John Reilly” and back comes a few movies like Step Brothers and Talladega Nights. Others may search by an unlimited amount of “mental genres”, or those which are created by the user. They may ask for “all Emmy Award winning movies between 2005 and 2010”. You get the point; the consumer is best served with answers to natural language search and then the call to action is to get that person to the content immediately. Natural Language Voice Search and Control The answer to the content search challenge is natural language voice search and control. That’s a mouthful, but basically, tell the TV what you want to watch and it guides you there from thousands of entry points. Two popular implementations exist today for voice search. There are others, like Dragon Naturally Speaking, but those are niche commercial plays. Microsoft Kinect Microsoft has done more more to enhance the living room than any other company including Apple, Roku, Boxee and Sony. Microsoft is a leader in IPTV and the innovation leader in entertainment game consoles. With Kinect, a user can use Bing to search and find content. It works well in specific circumstances and at certain points in the experience, but it needs a lot of improvement. Bing needs to find content anywhere in the menu structure, not just at the top level. It also needs to improve upon its ability to work well in a living room full of viewers. Its beam-forming is awesome but needs to get better to the point that it serves as a virtual remote. Finally, it needs to support natural language search and the ability to narrow down the choices. I have full confidence that they will add these features, but a big question is the hardware. The hardware is seven years old. Software gymnastics and offloading some processing to the Kinect module has been brilliant, but at some point, hardware runs out of gas. Apple Siri While certainly not the first to bring voice command and dictation to phones, Apple was the first to bring natural language to the phone. The problem with the current Siri is that its not connected to an entertainment database, its logic isn’t there to narrow down choices, and it isn’t connected to a TV so that once you find what you are looking for you can immediately switch the TV. As I wrote in September (before Apple 4s and Siri), Apple “could master controlling the TV’s content via voice primarily.” If Apple were to build a TV, they could hypothetically leverage iPhones, iPads, iPods to improve the voice results. While Kinect has a full microphone array and operates best at 6-8 feet, an iPhone microphone could be 6 inches away and would certainly help with the “who owns the remote” problem and with voice recognition. Even better would be if multiple iOS devices could leverage each others sensors. That would be powerful. While I am skeptical in driving voice control and cognition from the cloud, Apple, if they built a TV, could do more local processing and increase the speed of results. Anyone who has ever used Siri extensively knows what I am talking about here. The first few times Siri for TV fails to bring back results or says “system unavailable”, it gets shelved and never gets used again by many in the household. Part of the the entertainment database needs to be local until the cloud can be 99% accurate. What about Sony, Samsung, LG, and Toshiba? I believe that all major CE manufacturers are working on advanced HCI techniques to control CE devices with voice and air gestures. The big question is, do they have the IP and time to “perfect” the interface before Apple and Microsoft dominate the space? There are two parts to natural language control, the “what did they say”, and the “what did they mean”. Apple licences the first part from Nuance but the back end is Siri. Competitors could license the Nuiance front end, but would need to buy or build the “what did they mean” part. Now that HDTV sales are slowing down, it is even harder to differentiate between HDTVs. Consumers haven’t been willing to spend more for 3D but have been willing to spend more for LED and Smart TV. Once every HDTV is LED, 3D and “smart”, the key differentiator could become voice and air gestures. If Sony, Samsung, LG and Toshiba, aren’t prepared, their world could change dramatically and Microsoft and Apple could have the edge.
Patrick Moorhead
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.