by Joanne Austin

Joanne Austin is the editor of Alive and Well magazine. She uses an Apple IIe and Apple Writer for her writing at home.

When the time comes for you to communicate with your computer by merely talking to it, you'll have to either learn its language or teach it yours. If you think learning a foreign language is difficult, pity the poor computer!
    In the first place, while the variations in different words can be shown on scientists' detailed charts of frequency and amplitude, the words by themselves say nothing unless they have a syntactical order. Second, while human speech often rambles on, sometimes running words together and coming up with slang like "gonna" for "going to," sound alone is not enough to let the computer know where one word ends and another begins. Third, while we have an uncanny ability to understand what's said to us even in the midst of a rock concert, the computer is easily confused by noise.
    What scientists have done is to digitize human speech, converting it into a number code so the computer is more likely to understand what's being said. Every word that must be recognized by the machine has to be broken down into its numbers, i.e., the numerical values for the frequencies and amplitudes of each sound in the word.
Recognition of a word by the computer requires five steps: 1) input of the word through a microphone; 2) analysis by the computer of the frequencies of each sound in that word; 3) isolation by the computer of that word's identifying features; 4) conversion of these features to a digital format through an algorithm, or mathematical equation; 5) recognition by the computer of that word, against a similar digital word, or template, stored in its memory.

Voice Training
    To understand the spoken word, a computer must be trained. Most voice recognition systems currently in use are speaker-dependent. The computer is taught what words it is supposed to know, and the system learns to recognize the way a certain speaker or group of speakers will say those words. Speaker-dependent machines can understand a fairly large vocabulary-up to about a thousand words.
    Ideally, we would like the computer to recognize a word anytime and by anyone who addresses the machine, i.e., to be speaker-independent. Such systems exist, but they possess a limited vocabulary. They recognize only digits 0 through 9 and simple control words such as "yes," "no," "go," "stop" and "erase." For the computer to understand many different frequencies, operators input hundreds of different voices. With that many templates in memory, how any random person pronounces a common word probably corresponds to one of the templates.
    Training the computer can be tedious. Most systems cannot discern separate words if they're spoken together in a continuous sentence, which is the way we usually talk. Users must enter each word in an isolated, discrete manner, with definite pauses, to allow the computer to figure out the beginning and end of the word. To compensate for slight changes in pronunciation or frequency, each user must enter the same word several times-usually three to five, but sometimes as many as nine.
    But perseverance in training reaps rewards. You only have to train the computer once. Additional commands or word changes amend previous instructions. Probability programs allow the computer to make decisions regarding what you said. Researchers have assigned probabilities to all the different ways of garbling a group of words and programmed the computer to choose what was most likely spoken.
    For industrial and commercial users, voice recognition, or voice data entry (VDE), has many applications in research and manufacturing situations: product testing, inspection, inventory control, laboratory work, machine programming and quality control. Voice data entry is also effective if the operator needs mobility, is in a crowded control room, is competing for use of a single keyboard among several operators or is working from a remote location.
    If a business determines that voice data entry would benefit productivity, the next step is designing vocabulary requirements that suit the tasks at hand. Computers recognize short, multisyllable words and those with strong consonant sounds better than they do single-syllable, vowel-sound words. The vocabulary should be easily memorizable by the operator.
    Syntactic control, a recent development that significantly reduces substitution errors, limits the total vocabulary active at any one time. This enables the computer to make its template selection from only part of the total word list. Such advances make speaker-dependent systems capable of 99 percent accuracy.

For Your Ears Only?
Though not yet perfect, voice recognition systems are destined to play a major role in society. At the super-secret National Security Agency at Fort George Meade in Maryland, sensitive computers regularly monitor the overseas telephone calls of corporations and individuals. NSA's system seeks target words, such as "high technology" and "Russia," in order to prevent the Soviet Union from stealing computer or defense secrets.
    Other users of voice data entry include handicapped people who communicate with the system by voice alone through a telecommunications link-up. And speaker verification by voice recognition allows industrial security personnel to prohibit unauthorized entry of buildings or computer systems.
    Ideally, all communication with personal computers will someday be by voice alone. This means that keyboards will be obsolete. Computerized typewriters will process letters in the time it takes us to dictate-and anybody, not just a preentered set of speaker-users, will be able to use that typewriter. Domestically, both IBM and Xerox are working on a "talkwriter" that would recognize vocabularies of between five thousand and ten thousand words, respectively, spoken by any user.
    The Japanese, of course, are not far behind. NEC Corporation has developed a speaker-dependent device that recognizes connected speech up to four seconds long pulled from a 150-word English vocabulary. But the capabilities of "talkwriters" developed for Japanese speakers surpass those of English-language machines. Japanese uses only 120 syllables, while English uses about 10,000. And Japanese is spoken more regularly, with more discernible breaks between words making training easier.
    The cost of voice recognition systems affects their deployment. Small systems can cost as little as $500; very complicated setups, as high as $65,000. For the personal computer, compatible hardware peripherals are selling for anywhere from $1,000 to $3,000.

Talk to Me
Apple, Commodore, IBM, Atari, Coleco and Tandy/ Radio Shack do not manufacture their own voice recognition systems, though some have speech synthesizers. So far nothing has equaled Texas Instruments' Speech Command System. For use with TI's Professional Computer, the Speech Command System is an internal voice recognition/telephone management board, not a peripheral. The system accepts up to 950 vocal commands, each the equivalent of up to 40 keystrokes. So much stored information requires a 10megabyte hard disk rather than a floppy, so the Professional Computer has hard-disk capability built in. Two floppies will work, but will allow only fifty vocal commands.
    The TI Speech Command System is speaker-dependent. Users enter each command up to nine times to accommodate voice fluctuations. But unlike some other systems, this one has what TI calls a "transparent keyboard." Once the voice commands are entered, the operator programs the computer to convert the spoken words to keystrokes. The voice command "indent," for example, may correspond to a complicated series of keystrokes like INDENT PARAGRAPH 5 SPACES comprising up to 40 strokes.
    With a modem, the Speech Command System practically becomes a personal secretary. The system can make calls, using your own voice through a speaker, and record the answers in the system. Others can call you, get your voice on the phone and leave a message. With a Dow Jones Natural Link, you can merely ask what a stock is doing and the computer will tell you. A tickler calendar allows you to communicate with people or companies on a preset schedule. Since the system uses your voice, it sounds natural and not robot-like.
    The entire setup consists of two piggyback circuit boards for telephone and speech, a headset with microphone, a separate speaker, telephone cable, installation and diagnostics guide, a diagnostics diskette, software diskette and user's manual. Total price: about $2,700.
    Optimists predict a $2 billion voice recognition industry by 1990. If they're right, the day may come when computers will handle all our business among themselves with just a few words from us.

About Speach Synthesis

Return to Table of Contents | Previous Article | Next Article