Talk is Getting Cheaper

John Anderson

Giving your computer the power of speech is no mere frill or gimmick. The potential of such capability, for the handicapped as well as microcomputer users at large, is dramatic.

For as long as microcomputers have been around, the cost of such potential has remained a prohibitive factor. But that is changing fast.

Following is a look at three speech synthesis packages for the Atari computer. These packages represent the range of possible configurations: the first is an independently powered piece of hardware, which can hook up to any microcomputer using a serial or parallel port; the second consists of an Atari specific external module, driven by software; the third works entirely in software, using the synthesizer chip already in the Atari.

The Echo GP

I have had an opportunity to experiment with the Echo Speech Synthesizer, from the Street Electronics Corporation, for quite a while now. It is a sophisticated unit, while at the same time fun to use.

It is based on the Texas Instruments TMS 5200 speech processor chip. This is in contrast with its nearest competitor, the Votrax Type 'n Talk, which uses the Votrax chip.

The unit makes use of its own 6502 microprocessor, and interfaces as if it were a printer. It is available in RS-232 serial or Centronics parallel versions. This means that the 850 interface is needed to drive the Echo from an Atari computer. We received the serial version, and controlled it through the 850 using Atari Basic.

Upon power-up, the Echo unit responds with the phrase "Echo ready," to let you know all is well. One of the first points the user will notice is that the Echo is capable of intoning a sentence. Rather than speaking in monotone, the pitch of the voice is dynamic. This makes for a more intelligible and less grating speech quality.

You can use the internal speaker of the unit or route the sound to an external speaker. I found it convenient (as did those around me) to use an earphone when involved in speech editing sessions.

Textalker

Textalker is the ROM based program Echo uses to convert English into speech. Echo can translate English text into phonemes directly, with an impressively low error rate. It can be disorienting, but even when Echo mispronounces a word or syllable, the listener can usually make sense of the sentence from its context.

    Do Re Mi Fa So La Ti Do

Octave 1 - 12 15 18 20 23 26 29 31

Octave 2 - 31 34 37 39 44 48 51 53

Octave 3 - 53 56 58 61 63      

Figure 1. A rough pitch table to give the synthesizer a singing voice. Flats and sharps can also be supported, but 1 have not taken the time to locate them.


10 REM ECHO SINGS ITS HEART	OUT
20 REM ASSUMES SERIAL PORT IS OPEN AND CONFIGURED
30 DIM I$(100)
40 READ I$
50 IF I$="STOP" THEN STOP
60 PRINT #1,I$
90 GOTO 40
1000 DATA >12F
1010 DATA SOME
1020 DATA >31F
1030 DATA WHERE
1040 DATA >29F
1050 DATA OAV
1060 DATA >23F
1070 DATA ER
1080 DATA >26F
1090 DATA THE
1100 DATA >29F
1110 DATA RAIN
1120 DATA >31F
1130 DATA BOW
1140 DATA >12F
1150 DATA SKIES
1160 DATA >26F
1170 DATA ARE
1180 DATA >23F
1190 DATA BLUE
1200 DATA STOP

Figure 2. With a singing synthesizer your micro won't be in
Kansas anymore. The character ">" is what control-e looks
like on the screen.

This is not to say that Echo has the diction of Henry Higgins. In fact, it takes a bit of time to become accustomed to the unique "accent" of the unit. As is the case with some foreign speakers, accustomed listeners will typically understand words that first-time listeners will miss. Echo has trouble with the "g" sound in words like "go," and "l" sounds give it problems as well.

In this respect, the monotone of the Type 'n Talk wins out. (A thorough review of the Votrax unit appears in the September 1981 issue of Creative Computing.) Though it also has its share of vocal peculiarities, it does, on the whole, enunciate more clearly than the Echo. And yet, for extended periods, I would much rather listen to the Echo. The monotone of the Votrax unit gets me down after a while--too "computerish." It was an unfortunate design decision. The Votrax chip itself, as we shall soon see, does allow for software pitch control which results in much more natural sounding speech.

The features of Echo are accessed through control characters. For instance, pressing CONTROL-E will enable the Textalker command set. Following this character with a number from 1 to 63 will determine pitch, which can be toggled from f (for flat, meaning unintoned), to p (for pitched, meaning intoned). In what I think is a first for microcomputers, I found that the Echo could be programmed to "sing" through careful use of these commands. In fact, the unit provides for about three octaves. Not a bad range! A pitch table and sample program appear below. (error in book!) be controlled by text punctuation. A comma will create a pause, a period will cause a drop in pitch at the end of a sentence, and a question mark will result in a rise in pitch.

Textalker can also be commanded to pronounce each punctuation mark it encounters. Similarly, the user may choose to have all upper case letters pronounced as letters (use this mode to get IBM to sound right), or to have all words spelled out letter by letter.

The rate of speech may also be compressed resulting in twice the text in the same amount of time. Remarkably, this function sometimes increases rather than decreases the intelligibility of certain sentences.

According to the documentation, the Textalker component of the Echo Speech Synthesizer "contains close to 400 rules which allow it to correctly pronounce over 96% of the thousand most commonly used words in English."

I was pleasantly surprised at how well Echo did with unaltered text. Having worked with phonemically-based synthesizers in college, I realized this was quite a feat. Of course there are some words Echo has trouble with. Fortunately, an appendix, which outlines the kinds of fixes to apply to these words, is provided. They are as simple as the addition of a space, such as "cre ate" for the word "create," or the spelling of the word "question" as "kwestchun."

Phoneme Generator

In addition to the Textalker module, speech can be programmed at the phonemic level, using the Speakeasy Phoneme Generator, also resident in firmware. This mode is selectable by the character CONTROL-V, and provides for much more detailed control. Stress, pause, pitch, volume, and rate controls can be embedded directly into the text strings.

This approach requires the use of a phoneme code, detailed in the documentation. It bears little resemblance to any phonetic alphabet I have come into contact with, but the 48 sounds it provides are more than enough to do the job.

Unfortunately, the effort it takes to achieve satisfactory results using this approach is somewhat unreasonable, especially in contrast to the serviceable job Textalker does. However budding linguists should take note. The phonemic approach offers great experimentation potential. I did manage to get the Echo speaking a little German.

The Echo Speech Synthesizer lists for $300, which is admittedly a bit stiff. Still, it is comparable to the price of the Type 'n Talk. And if you want your micro to sing Thomas Dolby tunes, the Echo is the only choice.

Hooking Up

In March of this year Creative ran a review of the Echo Speech Synthesizer board for the Apple II, At that time, Textalker and Speakeasy were in the development stage. The Speech Synthesizer offers much greater flexibility and power, as well as the capability for connection to any personal computer.

Male DB-9 -
(to serial -
port #1)
 
Male DB-25
(to Echo GP)
 
 
Pin 1
Pin 2
Pin 3
Pin 4
Pin 5
Pin 6
Pin 7
Pin 8
Pin 9
No Connection
No Connection
Connects to pin 3
Connects to pin 2
Connects to pin 7
Connects to pin 20
Connects to pin 5
Connects to pin 4
No Connection

Figure 3. Wiring a cable for connection to the Atari 850.

However this does not automatically imply easy connection. Even with our experienced people here at the magazine, it took us a while to make the Echo conversant with the Atari.

The documentation that arrived with our Echo was preliminary. All the information we needed was there; I do hope that the final documentation will be an improvement, though.

The real fault lies with the 850 interface module documentation: it provides beginners with quite a run for their money. Here is a way to succeed.

The first thing to do is wire an interface cable, by connecting a DB-9 male to DB-25 male connector. The pinouts given in Figure 3 work with serial port number 1 on the 850.

Next you need to configure the Echo and port number one so that communication may be established. I used a data transfer rate of 1200 baud. This entails setting the DIP switches on the bottom of the Echo so that positions 1 and 2 are on, while position 3 remains off. Position 4 also remains in the off position to enable "handshaking," as we say in the trade.

The serial port is configured through software. Figure 4 shows an example of this configuration, as well as a short program allowing for straightforward experimentation with the unit.

Make sure the 850 device handler is booted whenever using the serial port. This occurs as an autorun.sys file on the Atari DOS disk. Make sure it is resident on any program disk for use with the unit. Power up the 850, then boot a disk with the handler file. You will then be set to go.

For more information concerning the Echo, contact Street Electronics, 1140 Mark Ave., Carpinteria, CA 93013.

10 OPEN #1,12,0,"R:1"
20 XIO 36,#1,10,6,"R1:"
30 DIM I$(100)
40 I$=">15P HI THERE. THIS IS ECHO G P.,, READY WHEN YOU ARE.,, OVER."
60 PRINT #1,I$
70 INPUT I$
80 PRINT #1,I$
90 GOTO 70

Figure 4. It is this simple to configure serial port number one and input text for synthesis. Again the ">" character signifies control-e. Don't forget to boot the device handler prior to running the program.

The Alien Group Voice Box

The Echo has everything it needs to effect speech synthesis onboard. Like a printer, it awaits a stream of characters; it would just as soon pronounce text files from bulletin board services, Compuserve, or the Source. The Atari, thus, is free to do whatever processing you have in mind, while the Echo works independently.

This is a fine capability, but also an added expense. The Voice Box from Alien Group takes some of the internal, ROM based capabilities of the Echo, and efficiently uses Atari RAM for their storage. The Voice Box uses a Votrax SC-01 chip, and connects directly to the Atari input/output jacks. It will necessarily be the final connection in the I/O daisy chain, as it offers no jack of its own.

The external module is no bigger than a transistor radio, and draws power directly from the Atari. It lists for $170, including driver software, which is available in cassette or disk versions.

The Voice Box is manipulated from Atari Basic, and does not offer an RS-232 handler program. Using patches from Basic, however, it can be controlled from a machine language program.

Your machine must have at least 16K to run the Voice Box. If you have 32K or more, you can run two additional programs included with the package: the Random Sentence Generator and the Talking Face. More about these later.

When the driver program is run, the box responds with the phrase "Please teach me to speak," or if a dictionary is loaded, the words "Yes, Mahster," to let you know everything is working.

While calling on its own phonetic input code, as does the Echo, the system also uses a unique approach to convert character strings into speech sounds. English text and phonetic code may be freely intermixed, rather than requiring separate modes, as is without exception the case with every other text-to-speech system I have seen.

Dictionaries

The key to working with the Voice Box is the creation of your own dictionaries.

These are the "word equations" specified to translate words into phonemes. For example, by typing "spek=speak," you will ensure that each time the word "speak is encountered, it will be pronounced correctly. Dictionaries are saved and re-called, as independent files, to cassette or disk. In addition to those you create, three pre-written dictionaries are supplied with the driver software.

Dictionaries eat up computer memory quite quickly--each word equation takes up ten bytes. In order to store phonemes more efficiently, word fragments can be stored. You can define fragments to be recognized only at the beginning or the end of a word, or at every occurrence.

Because dictionary size is limited, the dictionary approach itself is necessarily limited. Even with 48K, no dictionary is going to produce impressively accurate text-to-speech capability, in this respect, the Echo has a much more sophisticated algorithm. This is the main trade-off between the two systems.

In fact, if you have more than 32K, you must change the dimensions of a string statement in the Voice Box driver program in order to store larger dictionaries. The documentation clearly states how to do this.

Other Features

Similar to the Type 'n Talk, the Voice Box sports a potentiometer knob on the front of the case, that can be used to vary the speed and pitch of the speech. The Voice Box unit allows for pitch control through software, too. Control is restricted to four registers, utilizing the slash and the backslash characters to move between them. This negates the musical capabilities of the unit, but is a step ahead of the monotone of the Type 'n Talk.

Because so much of the Voice Box is RAM resident, you must decide how much of the memory of the Atari to allot to dictionary space, in addition to your own Basic programs, and the Voice Box driver. The disk version includes a pared-down driver program for incorporation into other programs. The documentation also gives hints for memory conservation.

In the 32K version, several other features appear. The first is the Random Sentence Generator. The Voice Box will compose random but grammatically correct sentences from its stored word lists. These can be modified with word lists of your own creation. I obtained some rather strange results in my attempts at this. While many were semantically bizarre, I must, admit the sentences were grammatically unassailable. Be prepared for a few shocks when you try this.

There is also a mode called The Talking Face. This displays an animated face, with impressive lip synch simulated as words are articulated by the Box. I am sure this feature would be a big hit with the kids.

The documentation accompanying the system is a bit uneven in places, but manages to cover all the features of the Voice Box in a scant nine pages. The phoneme list is quite complete. The documentation also goes as far as to suggest to assembly language programmers a means of updating data to the box while running machine language animation routines.

While the Voice Box is not really in the same league as the Echo, it offers many of the same features for much less money. For more information contact the Alien Group, 27 West 23rd St., New York, NY 10010.

The Software Automatic Mouth

In the September 1982 issue of Creative, I mentioned that the Atari was capable of speech synthesis using only its internal hardware. The game Tumblebugs taught the Atari its first words: "We gotcha!" This came as a happy revelation to many.

Well with Software Automatic Mouth, SAM for short, Mark Barton has brought this possibility to fruition. He has created a disk-based, unlimited speech synthesis program, requiring no external hardware. And the speech quality of SAM competes favorably with the best systems available for microcomputers.

SAM uses the Atari sound chip, Pokey, to generate speech. Even with my unbridled faith in the capabilities of the Atari, I was quite surprised at how well it does the job. Pokey is at least as intelligible as its two competitors, the TI and Votrax chips.

SAM is the only package around that dares to include lengthy prepared speech demonstration programs to show off its articulative powers. My colleagues agreed that no break-in period was necessary in order to understand SAM.

The documentation supplied is equally impressive. It not only makes operation of the program very simple, but provides background information concerning linguistics and speech synthesis. It helps to make the program into an excellent tutorial on the subject.

I did encounter one snag, if only in my eagerness to get rolling with the package. You must copy all the Basic programs from the master disk to a new diskette. The autoboot assembly language program that constitutes SAM runs from the master, but support programs must be loaded from the new disk. The reason is that the support programs require a mem.sav file. The write-protected master disk will, of course, return an error if a mem.sav attempts to write to it. The documentation clearly states that you must use an un-write-protected new disk with a mem.sav file on it. In my excitement to get going, I did not heed these instructions, and ended up wasting some time.

Figure 5.
Talk_Is_Cheaper1.jpg

Figure 6.

0 GRAPHICS 0
10 REM ---DEMO-
20 DIM SAM$(255):SAM=3192
25 X=0
30 SETCOLOR 2,0,0:SETCOLOR 1,0,0:SETCOLOR 4,0,0:SETCOLOR 3,0,0
40 SPEED=8208:PITCH=8209
45 X=X+5:IF X>45 THEN X=0
50 POKE SPEED,X:POICE PITCH,100
60 SAM$="ULEHKTRAA4NIXK /HULUW4SIXNEY5SHUNS."
70 A=USR(SAM)
80 GOTO 45

Support programs included with the package are: Reciter, which is an English text-to-speech translation program; Sayit, the short Basic program which makes experimentation simple; Demo and speeches, two files that impressively demonstrate the powers of SAM; and Guessnum, a spoken version of a number-guessing game.

An RS-232 handler program is also provided, allowing SAM to act as Echo does to read telecommunications text.

It is extremely simple to work with SAM from Basic. All that is needed is to define SAM$ as it appears in Basic, and then invoke either SAM or Reciter through a USR call. You can also effect machine language patches from Basic.

Speech Quality

The really remarkable thing about SAM is its (his?) intonation--SAM can be extremely expressive. Control of stress placement is easy. The phonetic code is a bit strange, but very nicely laid out in the documentation (see Figure 5). A reference card is also provided.

Similar to the Echo, punctuation is "understood." A hyphen is read as a short pause, and is handy for delineating clause boundaries. A comma inserts a pause equivalent to two hyphens. A question mark also inserts a pause, as well as making the pitch rise at the end of a sentence. Likewise a period makes the pitch fall.

SAM is capable of speaking only 2.5 seconds without a break. If a string exceeds that length, a short break will automatically be inserted. If you don't like the placement of automatic breaks, you can stipulate their positions with hyphens. The breaks are so short as to be hardly noticeable, and cause few problems.

SAM can be controlled more creatively and flexibly than Echo or Voice Box. The pitch and speed of SAM speech can be altered through with POKE statements. I got some wild results playing with these. A sample program, Figure 6, shows how speed effects can be achieved.

The timbre of speech can be varied to make SAM sound quite human--or like a droid from Star Wars.

An 18-page English-to-phonetic code dictionary appears in the documentation to help in speech programming. In addition, SAM flags phoneme input errors. When a bad phoneme occurs in the immediate execution mode, an error is flagged in the same way as syntax errors in Basic. By PEEKing decimal address 8211, you can trace these problems when they occur in the deferred mode.

At the incredible price of $60, there must be a catch, right? Well there is, sort of. Because SAM uses the Atari to do all its work, DMA is shut down during articulation. This means the screen goes blank during speech--no animation, no text, nothing. The documentation tells you how to reenable DMA during speech, but warns that this distorts SAM's speech rather badly. However, this blanking takes place only during articulation. As soon as a string is finished, DMA returns and all is normal.

I cannot overstate how impressed I am with the Software Automatic Mouth. It is a remarkable feat of software savvy, and probably one of the best buys available for the Atari computer. Its higher-priced competitors have their advantages, but would do well to strive for the same strong documentation this package has. If you wish to give your Atari the power of speech, have a disk drive, and are on a limited budget, look at this program. For more information, contact Don't Ask Software, 2265 Westwood Blvd. Suite B-150, Los Angeles, CA 90064.



SAM Speaks Apple II

The Apple II has no special advantage over the Atari when it comes to speech synthesis. The Echo, Votrax, and many other voice systems work equally well for both computers.

The history of software-only synthesizers for the Apple dates back to 1979 when Softape published a program called Apple Talker. That program has been discontinued, but Muse publishes The Voice, an inexpensive program that serves the same purpose. Sirius Software, the renowned game publisher, produces Audex, a general purpose audio program that can be used to approximate speech. For the most part, these programs deliver results that are interesting, but only sporadically intelligible.

Hardware voice products for the Apple also abound. Voice input can be recognized by peripherals from Scott Instruments, among others. Mountain Computer carries a remarkable input-output device that turns an Apple into a digital audio recorder.

At $130 for the Apple version, SAM is the first product to combine unlimited vocabulary, impeccable intelligibility, and reasonable price.

The SAM package includes a little bit of hardware and a little bit of software. The hardware is a board containing a digital-to-analog converter, a tiny amplifier, and an even tinier volume control. The software includes all the programs described in the main part of this article.

SAM sends output to an 8-ohm speaker. You can use the speaker inside your Apple or, for better results, attach a slightly larger one. Installing SAM is no more complicated than hooking an Apple to a TV set.

SAM uses the simplest possible interface to a sound system. In exchange for the simplicity of the hardware, the developers had to write large and complex programs. The program that produces speech based on phonetic codes occupies 9K of RAM. Another program that translates English text into phonetic codes requires an additional 6K. These programs live in an area usually reserved for Applesoft string variables. The English translator also overlaps the memory associated with the second graphics image (Hi-Res page 2) of the Apple.

Because of these requirements, SAM can not cooperate with most other programs. You can not add speech capability to your word processor or terminal program, for example. Pascal, Logo, Graforth, and most other languages can not use SAM.--MC



John Anderson is the associate editor of Creative Computing magazine.

Table of Contents
Previous Section: String Arrays
Next Section: Axlon RAMDisk