About ten years ago, I wrote an editorial suggesting our PCs could, and should, look at and listen to us. And when they did they would be able to see if we were happy, angry, sad, impatient, or bored. They would know when we were dictating a memo, story, or email, and when we were giving them commands to open a file or get a program. I even went so far as to suggest eye-tracking as a replacement for the mouse. I never suggested hand waving as an alternative input mechanism because of my experience many years ago with a light-pen. For those of you too young to know what that is, the early CAD and computer drawing systems used a pen-like device that was pressed against the display’s screen and would allow the drawing of lines. Inside the light-pen was a photo sensor and the computer would (very quickly and usually unperceptively) scan the screen looking for it. Although you could draw very accurately with that technique, your arm got tried pretty quickly. Try it—hold your arm out toward your display and see how long you can. So for that reason, moving your hand off the desktop (from the mouse, touch pad, trackball or joystick), and suspending it in air does not seem very “natural” to me. But, you can wink at your screen, and look at various parts of it.
Obviously as we get surplus amounts of MIPS, various non-physical contact user interfaces will be tried. And like most things one size or technique won’t fit all (just as some people use a track ball and others prefer the nimble in the keyboard.)
But the computer will look at and listen to us. My computer listens to me almost every day. In fact this editorial was “written” using speech recognition software (Dragon Naturally Speaking v.10) as were several of the other stories in this issue. You can’t tell by reading it, because any mistakes that the software might have made have been corrected as is the case, whether you’re typing it or saying, there’s always an editor.
Although we have the computing power to do the image processing necessary to allow a PC’s camera to watch us, the resolution of most cameras in use today is just VGA. HD cameras are dropping in price and readily available and will be needed if we are to get truly useful and reliable facial and gestural recognition.
Imagine now that we do have such capability, and that we also have adequate bandwidth for video conferencing. Think about the possibility of having your computer’s image processing software evaluate the stress levels in the person you’re communicating with via videoconferencing and reporting to you if your computer thinks your correspondent is lying? Such communications could revive person-to-person communications and lead us back from relying so much on e-mail and text messaging. It certainly would be a useful capability when negotiating a contract with someone. It would also be helpful for people doing online dating. Imagine the savings you would realize by not having to drive to a coffee shop or bar to meet someone for the first time.
Of course, the gaming aspects are well understood, and in fact it was the Wii that ignited the imagination of so many with regard to the potential of gestural and visual vacations with a computer. Human interface studies will be needed to determine the best movements to use to communicate with, and control, the computer. You can envision a common set of operations similar to what we have developed over the years for the operation of an automobile. However, there’s also the possibility of making your computer tamperproof by having a unique set of gestural communications that would prevent anyone else from operating your computer if they didn’t know those moves.
The downside of speech recognition is the noise level colleagues have to tolerate while you are speaking to your computer. It’s not dissimilar from hearing one side of the conversation when someone is using a telephone or mobile phone. The difference is the continuity of speech is more coherent and therefore more distracting.
These will be evolutionary changes, not revolutionary, as PC manufacturers slowly adopt and incorporate high-resolution cameras in the displays. Eventually it will evolve to the point where there are two cameras in the display and will be able to use stereo vision. Two cameras also give you better distance capability so that the user’s position can better track and be used for determining the mood and intonation of communications. Single lens 3D (depth) sensors like Canesta promise a great potential for a low-cost solution. We expect other 3D sensors to appear in the market very soon.
We’re experimenting here at JPR with ambient sensing sound systems that recognize limited instructions such as “wake-up computer,” “lights out,” “start coffee pot.” The bits and pieces to set up these types of systems already exist from a hardware point of view, and mostly exist from a software aspect. They lack universal standards, and so therefore each system, today, is custom. Therefore, I encourage you to build such systems, experiment with them and learn from them. Share your knowledge and help us move forward in more natural communications with our omnipresent companion.