Speech processor - Patent 0051462

(19)

(11)

EP 0 051 462 A2

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	12.05.1982 Bulletin 1982/19

(21)	Application number: 81305149.7

(22)	Date of filing: 29.10.1981

(51)	International Patent Classification (IPC)³: G10L 1/08

(84)	Designated Contracting States:
	DE FR GB

(30)

Priority:

03.11.1980 US 203042

(71)	Applicant: GENERAL INSTRUMENT CORPORATION
	Hicksvillle New York 11802 (US)

(72)	Inventor:
	McLaughlin, Philip T. Farmingdale New York 11735 (US)

(74)	Representative: Warren, Francis Charles et al
	Baron & Warren 18 South End Kensington London W8 5BU London W8 5BU (GB)

(56)

References cited: :

(54)	Speech processor

(57) A system and apparatus for generating human speech or other complex sounds from a digital command signal and using a hybrid synthesis technique comprises a digital. fixed repertoire speech processor (10) which includes a first circuit (12) for modeling the behaviour of the human vocal tract and a second circuit (11) for controlling the behaviour of the modeling circuit (12).
The processor may advantageously be implemented on a single VLSI integrated circuit chip.

Description

[0001] Broadly speaking, this invention relates to the synthesis of human speech. More particularly in a preferred embodiment, this invention relates to methods and apparatus for synthesizing human speech using a hybrid synthesis technique.

[0002] In recent years considerable interest has been expressed in the synthesis of human speech, and other sounds. Heretofore, this has required a large computer and expensive speech processors connected to, and driven by, the computer.

[0003] More recently, advances in large scale integration have permitted the speech synthesis circuitry to be reduced in scale so that it can be accommodated on a few VLSI integrated circuit chips, or in the case of the instant invention, on a single VLSI chip.

[0004] As is well known, there are four basic techniques for synthesizing human speech. These are (1) phoneme synthesis, (2) formant synthesis, (3) linear predictive coding, (LPC), and (4) wave-form digitization with compression (WD).

[0005] Three of the goals of any speech synthesis are (1) understandability, (2) quality, and (3) price or cost which may be defined as the bit requirement for each second of speech produced.

[0006] The four basic techniques for synthesizing human speech aan be compared with the three goals. In phoneme synthesis the output voice is clearly understandable. The quality of the voice is robotic. There is one voice and it is not identifiable as other than that of an artificial source. The bit requirement is fairly low with 120 bits for each second of speech. In formant synthesis the understandability is good. The quality is better than in phoneme synthesis and it is capable of producing voices which are distinguishable between male and female. The bit requirement is 400-800 bits per second. In linear predictive coding the understandability is the same as in formant. The quality can be very high and an individual person's voice can be easily recognized but this requires more bits for more quality. Typically between 1,200 and 3,000 bits are required for each second of speech. Wave-form digitization with compression makes a very broad range of all three goals. The understandability can be very good to very poor. The quality also extends over the same broad range. This reflects the cost or the number of bits required which varies from approximately 1,000 to 5,000 bits for each second of speech, the best quality and understandability being with more bits.

[0007] The present invention provides an inexpensive and very flexible speech synthesizer and is capable of providing high understandability, a range of qualities from acceptable to highest quality, and a flexible bit rate which is adjustable in the chip from 500 bits to 3,000 bits per each second of speech.

[0008] In one embodiment the entire synthesizer can be constructed on a single chip. Prior circuits required multiple chips. This has an important result in cost, as the single chip synthesizer significantly reduces the cost over multiple chip synthesizers.

[0009] An advantage of the present invention is that is uses formant synthesis for voiced (vowel) sounds and _LPC for unvoiced sound. Formant and LPC coding can be used in the same word. It is thus a reduction in the size of the memory and bits needed to produce the same sound.

[0010] A further advantage of the present invention is that more memories can be dedicated to a particular sound or group of sounds, thereby permitting the ability to increase the quality or to adjust the quality of the sound of the synthesizer.

[0011] The present invention is a technique for synthesizing human speech that is relatively inexpensive, overcomes the deficiencies of the prior art, and is suitable for fabrication on a single VSLI chip.

SUMMARY OF THE INVENTION

[0012] As a solution to the above and other problems, the instant invention comprises a digital, fixed repertoire, processor for generating human speech in response to a sequence of n-bit, digital, command words input thereto. The processor comprises a means for electronically modelling the behavior of the human vocal tract and means, connected to the source of the incoming digital command words, for controlling the operation of the modelling means thereby to control the speech generated by an analog signal generating means associated with the vocal tract modelling means.

[0013] The invention and its mode of operation will now be described in detail, with particular reference to the appended drawings, in which:

BRIEF DESCRIPTION OF THE DRAWINGS

[0014]

Fig. 1 is a block schematic diagram of an illustrative speech processor according to the invention;

Fig. 2 is a block schematic diagram of an illustrative digital filter stage for use in the speech processor shown in Fig. 1; and

Fig. 3 is a block schematic diagram illustrating how a plurality of filter stages according to Fig. 2 may be connected in tandem.

DETAILED DESCRIPTION OF THE INVENTION

[0015] The speech processor disclosed and claimed herein is intended for product applications where the generation of synthetic speech or complex sounds is required. In a preferred embodiment, the speech processor is realized as an N-channel, metal-gate LSI device. One skilled in the art will realize, however, that other implementations are possible.

[0016] The speech processor according to the invention is a fixed repertoir speech and sound synthesizer, which, in the preferred embodiment, is capable of reproducing up to 256 discrete sound sequences. Each sequence may be called by loading the 8-bit address of the sequence into a command register in the speech processor. The sound sequence data is stored in a mask-programmable read-only memory (ROM), which arrangement enables the user to readily specify the speech or sound pattern desired. By use of suitable interfaces additional ROMs may be added and that would essentially provide an unlimited number of words.

[0017] As will be explained, the internal organization of the processor enables a large quantity of speech or sound to be specified in 16K bits of read-only memory. In addition, the flexible architecture of an on-board controller associated with the processor allows the user to partition the available storage space into as many sequences as desired, up to a maximum of 256 sequences.

[0018] From a functional standpoint, and referring to Fig. 1, processor 10 can be divided into two major sections, a controller 11, and a vocal tract model (VTM) 12. VTM 12 is a parametric sound and voice synthesizer which produces complex waveforms under the control of 15 slowly time varying parameters. As will also be explained, the controller 11 executes internal instructions, which are stored in ROM, and modifies the appropriate parameters of VTM 12 to create the desired sound sequence.

[0019] In the preferred embodiment, the interface between controller 11 and VTM 12 is accomplished through a plurality of parameter registers and related timing signals.

[0020] Referring again to Fig. 1, we shall now present an overview of the processor's architecture. Let us first consider the VTM source timing. Briefly, the VTM operates under control of 15 parameter registers, the size and function of which are listed in Table A. The duration and pitch of the sounds produced by the processors are controlled by the R and P registers, respectively. The contents of the P register, in particular, specifies the number of sample periods in each pitch period.

[0021] Expressed in terms of the pitch (Fo) and the sampling frequency (Fs):

[0022] The pitch source injects unit impulses, spaced P sample periods apart. The contents of the R register (repeat count) represents the number of pitch cycles which will be executed before a register update occurs.

[0023] When P is assigned a zero value, the pitch source is replaced by a zero-mean, pseudo-random noise source. This mode of operation is referred to as the unvoiced mode. In this mode, the processor requests a register update after 64 x R samples.. The amplitude of the source is controlled by the A register. It is coded as 5 bits of mantissa and 3 bits of exponent (i.e. binary shift).

[0024] Let us now consider controller 11 in detail. As shown, controller 11 is a sequential processor which fetches instructions and data from an internal 16K ROM 16, and which is capable of altering the contents of the 15 parameter registers 15 controlling the processor's vocal tract model. The controller has 16 executable instructions, and supports one level of subroutine nesting. In the present embodiment, the instruction set is specifically designed to allow selective updates of the parameter registers to be performed. In addition, the instructions designated JMP and JSR allow chaining of segments, and sharing of code sequences to eliminate redundancy.

[0025] The processors instruction set comprises two groups of instructions, i.e. register modification instructions, and branch control instructions.

[0026] We will now discuss how the processor is interfaced with external devices. First, let us consider program entry control. In speech processor 10, the individual sound sequences stored in internal ROM 16 are accessed by use of an 8-bit data bus 17 and input port 18. The significance of this input byte is a function of the state of an SE (Strobe Enable) conductor 19. When the SE line 19 is high and input port 18 has been loaded from the external system via bus 17, the contents of the 8-bit input port is loaded into an 8-bit input buffer 21. This allows any one of 256 entry points to be specified. The entry points are spaced at 2, 8, 4 or byte increments (i.e. for 8 byte : 0, 8, 16, 24 ... 2040) throughout ROM 16. This addressing scheme is most useful in applications where the processor is used as a peripheral for an external microprocessor, or or where more than 8 sequences are programmed into the ROM.

[0027] In applications where 8 or fewer sequences are required and it is desirable to select these sequences without handshaking, the processor can be placed into a mode where each speech or sound sequence can be initiated by pulling down a single conductor, for example, by grounding the SE conductor 19. When the processor is operated in this way, no handshaking is required to select the desired sequence.

[0028] Let us now discuss handshake control. As previously mentioned, input to the processor is accomplished via the 8-bit bus 17, 2 handshake lines 22 and 23, and the input mode select conductor 19. As mentioned in the previous section, the handshake lines are not necessary for some applications. When SE (Strobe Enable) conductor 19 is kept high, the handshake lines 22 and 23 (LRQ and DLD) are used to coordinate the data input. LRQ (Load Request) is on output line 22 is low whenever the Input Buffer Flag is set. When LRQ is high, the input port is loaded by placing the 8 bits of data on input buffer 17 and pulsing the DLD (Data Load) input 23. The rising edge of DLD will cause LRQ to go low, where it will remain until the internal Input Buffer Flag is reset by an RET instruction.

[0029] The output of VTM section 12 drives an internal 7-bit pulse-width-modulation, digital-to-analog converter 26. The design of DAC 26 is such that all noise components are at or above 10 KHZ. The output is low pass filtered to 5KHZ, and amplified, both of these functions being performed externally.

[0030] In the preferred, LSl embodiment, the processor has two power supply leads and a common ground. One supply lead powers the interface logic and provides standby current to the controller and parameter registers.

[0031] The other lead powers VTM 12, controller 11 and internal ROM 16. When standby lead 27 is high (indicating that the processor is inactive, the second power lead can be powered down, externally, to conserve power. This will provide a standby current which is a fraction of the normal operating current.

[0032] When the processor is loaded with an entry byte, the standby lead 27 is brought low, signalling to the external circuitry to power-up the second power lead. The processor will delay execution of the selected sequence to allow the power supply to settle. This is done by an RC time circuit external to the chip but driven by the chip. If it is not desired to implement the standby mode of operation, the power leads can be tied to a common supply.

[0033] The processor requires one 3.12MHZ clock, which is generated by an onboard oscillator 31 with external crystal control. The crystal 32 is connected to oscillator 31 external of the processor.

[0034] Let us now consider operation of the processor and the filter structure of the VTM in particular. The processor according to the invention models speech (and other sounds) using a series of six 2nd order resonators, excited by either a pseudo-random noise source, or a periodic impulse source.

[0035] VTM 12 is implemented using totally digital techniques. This approach allows one 2nd order section to serve as six sections through the use of multiplexing and information line pipelining. The section that is implemented is the 2nd order infinite impulse response (_IIR) digital filter shown in Figure 2. This filter comprises a pair of adder stages 41 and 42 and three multipliers 43, 44 and 46. The filter stage has the transfer function:-

[0036] Therefore, it can be shown that the poles of the transfer function occur at:

and when,

and,

the poles will be placed in a complex pair, forming a resonator with the band-width given by:

where Fs is the sampling frequency in HZ. and the center frequency (Fk) given by:

[0037] As can be seen from equations 6 and 7 above, the modification of the B coefficient changes both the frequency and the bandwidth of the resonator. The modification of the F coefficient, however, changes only the center frequency and has no effect on the corresponding bandwidth.

[0038] Since speech signals, and in particular vowel sounds, convey information through the shifting of resonant peaks in the spectrum, it is desirable to be able to change center frequencies of the 2nd order sta ges independently of their respective bandwidth setting.s. In addition, it is important that the parameters of the individual stages, corresponding to particular resonances, can be modified independently. The use of cascade 2nd order stages supports these features, giving the filter configuration according to the invention a distinct advantage over other filter sections currently in use for speech synthesis, such as the Lattice section, and the direct form implementation. The instruction set of the processors controller section is designed to exploit the ability of the VTM parameters to be updated, selectively, to achieve a greater packing density in ROM 16. In addition, this permits the user to trade-off between the quality and quantity of the speech samples possible within the ROM space available.

[0039] If it is desired to place resonances at a frequency of zero, these real axis poles can be accommodated directly. Each 2nd order stage may be used to place two real axis poles of variable bandwidth. If X_l is the real axis location of the first pole, and X₂ the second:

and:

with the bandwidths of each given by:

where F_t and Bt represent the coefficients in Figure 2.

[0040] Coefficient updates to the filter occur at the beginning of a pitch period. This timing results in the smallest possible disturbance to the output at update. The information line precision is maintained at 16 bits throughout the VIM filter. The multiply-by-2 unit 44 shown in Fig. 2 is advantageously implemented as a 1-bit binary shift circuit 51 following the F_t multiplier. The shift operation is performed separately from the multiplication to scale F_t to the same range of values as B_t. The coefficients F_t and B_t are quantized, nonlinearly, to minimize coefficient sensitivity.

[0041] The two coefficients are processed by the same non-linear transformation hardware in the range.

where C may be either F_t or B_t.

[0042] The non-linear transformation T(X) is implemented with a table-look-up ROM 53.

[0043] The input coefficients of each stage, denoted as F and B, are expressed in sign magnitude form and used to generate the multiplier coefficients as follows:

[0044] _Fig. 3 represents vocal tract model 12 and depicts six cascaded filter stages 61-66. For unvoiced sounds (U), the input to the filter comes from a pseudo-random noise source 68 while for voiced sounds a scaled, periodic impulse source 67 is used.

[0045] We will now consider the programmability of the processor including the register modification instructions.

[0046] The purpose of the processor register modification instructions, of course, is to update the VTM parameters. The R and P registers determine how many sample periods of a particular sound are output by the VTM, before control of the parameter registers is returned to the controller. The controller waits until the completion of the last of R pitch periods, or 64 X R samples in unvoiced mode, before executing the next register modification instructions.

[0047] Each of the 13 register modification instructions, with the exception of the RCU instruction, comprise a 4-bit op code followed by 4 bits of data which are loaded into the lower 4 bits of register R. The instruction RCU is a 1-byte instruction which loads the upper 2 bits of the Repeat register, i.e. Register R.

[0048] The RCU instructions passes control to the next instruction following execution. The RCU instruction will not cause an immediate transfer of control to the VTM.

[0049]

[0050] In the event that more than one register modification instruction is required to update the parameter registers, but it is not desired to perform a full update (FRL), two or more instructions may be chained together. This feature is also useful for initializing the registers before a JMP, JSR, or return instruction, without causing the VTM to initiate a sound sequence.

[0051] If an instruction is to be chained with one or more other instructions, the lower 4 bits of the instruction byte are set to zero. The last instruction in a series of chained instruction bytes is the only one with non- zero lower 4 bits. The lower 4 bits of the last instruction byte in a chained series is loaded into the lower 4 bits of the R register. When chaining instructions, the data bytes appear after the last instruction. The order of the data bytes is given by Table C, below.

[0052] For example, it is desired to chain an FFU instruction with an FS1 instruction, the following registers would be updated:

FFU modifies F4, F5, F6

FS1 modifies Bl, Fl

[0053] Referring to Table C, it will be seen that the proper sequence for the above data is:

Bl, Fl, F4, F5, F6

[0054] therefore, referring to the processor Instruction Table B, the sequence would be

where N₃N₂N₁N₀ represents the lower 4 bits stored in register R, and [ ] represents the binary value to be loaded into the indicated register.

[0055] The processor's Branch Control instructions differ from the register modification instructions in that they do not modify any of the VTM parameter registers. The sole purpose of the branch control instructions is to determine the location in internal ROM 16 from which the next instruction will be fetched. The JSR (Jump to Subroutine) instruction stores the present address (i.e. the contents of the program counter register 54) in the return buffer (RB) register 56. The program counter 54 is loaded with the 11-bit address specified by the last 3 bits of the instruction byte and the following data byte. In addition, an internal return flag is set to indicate that the RB register 56 has been loaded. The controller then fetches and executes the instruction located at PC + 1 in the ROM 16. Only one level of subroutine is allowed.

[0056] The Jump instruction JMP loads the program counter 54 with the 11-bit address specified by the lower 3 bits of the instruction byte, and the following data byte. Neither the return flag nor the return buffer are modified. Upon completion of execution, the next instruction is fetched from location PC + 1 in ROM 16.

[0057] The RET (Return from Subroutine) instruction is an instruction whose function depends on the state of the return flag. When the return flag is set, indicating that a sub-routine is being executed, execution of an RET instruction will cause the contents of the RB register 56 to be moved into the Program Counter 54. The return flag is reset, and the controller fetches the instruction located at PC + 1 and continues execution trom that location. Again, only one level of subroutine is allowed. When an RET instruction is encountered and the return flag is not set, the status of the input buffer flag (_IBF) is checked. If the I_BF flag is set, indicating that the starting address of the next sound sequence has been loaded into the speech processor, the contents of input buffer register 21 (8 bits) is loaded into program counter 54, left justified. If input buffer flag is not set, the controller will disable any further output from the VTM and wait for the input buffer flag to become set. The standby conductor 27 will go high and remain high until the input buffer flag is set. The standby conductor was previously discussed in the section on standby operation. When the input buffer flag is set, execution continues as described above.

[0058] The previously discussed preferred embodiment, i.e. the N-channel, metal-gate LSI device, has the following characteristics:

Maximum Ratings

[0059]

Typical Operating Conditions

[0060]

Clock

D.C. CHARACTERISTICS

[0061]

CURRENT:

[0062]

[0063] To expedite testing of the speech processor, the contents of the 2K x 8 ROM 16 can be read-out in serial format. In addition, the processor can be placed in a test mode where instructions and data are input from the 8-bit input port 18 in place of the ROM 16. The information line, the data path in the VTM, is output on the serial output (SER).

[0064] Since digital-to-analog converter 26 is a PWM design, the DAC output can be tested as an ordinary digital output. No special level detection is required.

[0065] The test program for a processor according to the invention advantageously comprises two sections, a fixed part which tests the functionality and para- metrics of the processor, excluding ROM 16, and a ROM test section which is unique to the pattern being tested.

[0066] The speech processor according to the invention has a totally digital architecture which is designed to operate in the TTL voltage range. The drive requirements, operating voltages, speed requirements, are all compatible with implementation in standard N-Channel Metal Gate LS1 technology.

[0067] One skilled in the art can make various changes and substitutions without departing from the spirit and scope of the invention.

Claims

1. A digital, fixed-repertoire, processor for generating human speech and other complex sounds in response to a sequence of n-bit digital command words input thereto from an external source, characterised in that it comprises:

(a) means (12) for electronically modeling the behaviour .of the human vocal tract and other complex sound sources, said modeling means including:

(1) an m-stage, second-order, infinite impulse response, digital filter (41,42,43,44,46);

(2) means for exciting the input of said m-stage digital filter; and

(3) means (26), connected to the output of said digital filter, for generating an analog signal, corresponding to said human speech and other complex sound sources; and

(b) means (11) connected to the source of said incoming digital command words, for controlling the operation of said modeling means thereby to continue the speech generated by said analog signal generating means, said controlling means including;

(4) a plurality of n-bit, recirculating registers (15), connected to said modeling means (12), temporarily storing the parameters which define the sound segment phoneme'currently being generated by said modeling means;

(5) a first, digital, read-only memory (16) for storing the parameters which define all possible sounds and sound segments in said repertoire, and

(6) means (19), responsive to said digital command word, for addressing a particular one of the parameters stored in said read-only memory (16).

2. A processor according to claim 1, characterised in that said modeling means further includes:

a second read-only memory (53) connected to said plurality of n-bit recirculating registers (15) and to said m-stage digital filter, for storing parameters to control the response characteristic of said digital filter

3. A processor according to claim 1, characterised in that said m-stage digital filter comprises:

a second-order, infinite impulse response, digital filter section; and

means for multiplexing the digital information applied to said filter section, thereby to cause said section to appear functionally as m cascaded filter sections.

4. A processor according to claim 3, charactersied in that said filer section includes:

means for altering the center frequency of the filter bandpass independently of the bandwidth.

5. A processor according to claim 3, characterised in that said filter section comprises:

first and second tandem-connected adders (41,42), each having first and second inputs and-an output;

first, second and third multipliers (43,44,46), each having first and second inputs and an output; and first and second tandem-connected impedance networks, the input to said filter section being connected to the first input of said first adder (41), the output of filter section being connected to the output of said second adder (42), a first response-controlling parameter signal being connected to the first input of said first multiplier and a second response-controlling parameter signal being connected to a first input of said third multiplier.

6. A processor according to claim 5, characterised in that said second multiplier (44) is a multiply-by-two multiplier and is interposed between the output of said third multiplier (46) and the second input of said second adder (42).

7. A processor according to claim 5, characterised in that said first read-only memory (16) stores parameters defining the amplitude, pitch-period and the number of pitch periods to be requested for said modeling means for each phoneme.

8. A processor according to claim 7, characterised in that said first read-only memory (16) further stores parameters defining the first and second response-controlling parameters for each of the m-stages in said digital filter.

9. A processor according to claim 1, characterised in that said controller (11) further includes:

means for decoding operation codes stored in said first read only memory; and means responsive to said decoded operation codes, for altering the parameters stored in said plurality of n-bit recirculating registers (15).

10. A processor according to claim 1, characterised in that said analog signal generating means comprises:

an (n-1) bit pulse width modulated digital-to-analog connector (26), the output of said filter being connected to an external low-pass filter and an audio amplifier.

11. A processor according to claim 1, characterised in that it further comprises:

a source of pseudo-random noise (68);

a source of scaled periodic impulse signals (67); and means, responsive to the instantaneous command word for selectively connecting the input of said digital filter to the output of either said noise source or said impulse source.

12. A system for generating human speech or other complex sounds from a digital command signal characterised in that voice phonem signals are generated by a formant synthesis circuit, and unvoiced phonem signals are generated by a linear predictive coding circuit: control means responsive to said digital command signal select the phonems signal from said circuits as ordered by said command signal and combine the generated voiced and unvoiced phonems to form words of speech or the other complex sounds, and in that said entire system is on a single semiconductor integrated circuit chip.

Drawing