At Intel’s recent IDF (Intel Developer Forum), Intel formally announced their Perceptual Computing SDK 2013 Beta. This SDK (software development kit) enables developers to build applications that enable users to interact with computing devices via multi-modal interfaces, combining voice and machine vision with keyboard, mouse, and direct touch. While having the potential to fundamentally change the way users interact with their phones, tablets and PCs, it is also the key to Intel’s future as it soaks up high degrees of computing resources, very important to Intel’s future.
Today, users interact with their computing devices like phones, tablets and PCs in a myriad of ways through direct touch, keyboard, mouse and trackpad. As we have seen with early machine vision and speech investments made byMicrosoft via Kinect and Apple via Siri there is a lot more than can be done to improve the user experience. Both speech recognition and machine vision are in extensive use by the military and received significant boosts in the last decade with government funding since 9/11, but isn’t a part of the mainstream everyday use.
Machine vision was only recently popularized via Microsoft’s Kinect. It uses two cameras, one color camera and an IR blaster to sense, at the “human torch” level what the player is doing. It cannot accurately detect specific fingers and joints. There is processing done at the camera and on the XBOX, but the user must stand in a certain place in the room and games are limited to non-complex titles that take minimal compute resources. Some PC makers and even the Google’s Nexus 7 offers facial passwords, but these are slow and easy to spoof with a picture, video, or mask. This is just unacceptable for most computing environments other than a TV. Would you want someone to have access to your bank accounts on your phone?
What needs to be done to make the interface more natural? First, it takes an incredible amount of local compute performance at a very low power levels to be able to enable a natural user interface. Let’s take machine vision for a secure user login as an example. The best way to do this is to have two high resolution cameras spread apart mapping a three dimensional view of the face. Think of this as a 3D games reversed. Instead of displaying the pixels and textures in a game, 3D machine vision is pulling in those polygons and textures into the compute device. The challenge here is that this takes an incredible amount of processing performance and a lot of electrical power not only inside the compute engine but also to power the high res, stereo cameras. Then that 3D “map” needs to be pattern matched against a local database, which takes even more compute performance and power. This is step is called “object recognition” where the device needs to tell what it is looking at. I don’t want to debate here what kind of processing this requires, more serial processing which favors the CPU or more parallel vector processing which favors the GPU, but let’s agree it takes a lot of compute performance for now.
While this secure, facial login is just one example, there are many other that fit this natural user interface potential:
- Presenter at a business meeting uses gestures to move around the slides without the need of a “clicker”. They just wave their hands.
- A cook with flour on their hands waves their hand from side to side to turn the page of a recipe book.
- A designer uses their own hands, arms and torso to fit a pair of shoulder pads designed on their computer.
- By intonations in your voice, your home “computer” knows to limit distractions because you are annoyed. Soft music and low lighting awaits you at home.
- By the panic in your voice, your car’s computer knows you are in trouble and asks you if you want to call 911.
- Your home computer senses someone is using your computer that it doesn’t recognizes and texts you with a photo of that person.
- At a nursing home, the tenants computer recognizes that they haven’t got out of bed during the day and pages or texts a nurse or family member.
- Dictation reaches near 100% accuracy with the combination of speech to text and lip reading.
- Your TV recognizes and alerts you that there are five people in your living room after you told your kids they could only have two guests over the house.
- Replacing the physical mouse or trackpad with a “Hand-mouse” where the hand can rest anywhere on a flat surface and can be tapped and swiped like a physical device. A camera is mapping your hand, joints, and finger tips in real time.
- Meeting transcription where everything is recorded in the meeting and transcripted and separated, person by person. Action items and “parking garages” are automatically “sensed”, too with a running tally in a pane of the computer.
The examples are truly endless…. and by these very personal examples, require privacy control, which Intel has added into their SDK via “Privacy Notification.” This can be as simple as an indicator showing when you are being recorded by mic or camera.
None of this voice and machine vision means that direct touch on touchpads and displays, keyboard and mouse will go away any time soon. It won’t. We will move to a “multi-modal” interface where the device will choose based on context and user history which is the best mode of control. This is where Intel’s “Usage Mode Coordination” comes into play, to best choose the mode of interaction. Additionally, two different methods could be in play at once which needs coordination. Lip reading could be combined with speech to text to radically improve speech interaction.