Speech synthesis device - Patent 0205298

(19)

(11)

EP 0 205 298 A1

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	17.12.1986 Bulletin 1986/51

(21)	Application number: 86304183.6

(22)	Date of filing: 02.06.1986

(51)	International Patent Classification (IPC)⁴: G10L 9/18

(84)	Designated Contracting States:
	DE FR GB

(30)

Priority:

05.06.1985 JP 121704/85

(71)	Applicant: KABUSHIKI KAISHA TOSHIBA
	Kawasaki-shi, Kanagawa-ken 210 (JP)

(72)	Inventor:
	Takamori, Kazuo c/o Patent Division Toshiba Corp. Minato-ku Tokyo (JP)

(74)	Representative: Freed, Arthur Woolf et al
	MARKS & CLERK, 57-60 Lincoln's Inn Fields London WC2A 3LS London WC2A 3LS (GB)

(56)

References cited: :

(54)	Speech synthesis device

(57) The present invention relates to a speech synthesis device. The speech synthesis device according to the present invention is provided with a data memory (10) in which are stored frame data having amplitude data, frequency data and PARCOR coefficients generated by the PARCOR speech synthesis method using selectively at least two different frame lengths, and a variable frame bit indicating the frame length selected for the generation of the frame data; a speech synthesis circuit (15) using frame data read out from the data memory (10) and carrying out PARCOR speech synthesis processing; a discrimination means (14b) for discriminating the content of the variable frame bit within the data read out from the data memory; and a control means (14a) for controlling the frame length to be selected by the speech synthesis circuit (15) when carrying out speech synthesis, according to the discrimination result of the discrimination means (14b).

Description

[0001] This invention relates to a PARCOR type speech synthesis device in which analysis data produced by speech analysis by the PARCOR method is stored in a memory device, and thereafter speech synthesis processing is carried out by reading out this analysis data from the memory device.

[0002] Conventionally in PARCOR type speed synthesis device the original speech wave to be synthesized is separated into speech waveforms with 10 milliseconds or 20 milliseconds as a frame, and for each frame speech analysis is carried out, the amplitude data, frequency data and K parameter data which make up the PARCOR coefficients are generated and stored in a memory device as frame data; then for speech synthesis the above data values are read out from the memory device, and speech synthesis processing is carried out with the same frame length as was used in the analysis stage.

[0003] A large problem, however, with speech synthesis devices using various methods is that of reducing the data rate (bit rate) without losing the quality of the synthesized speech. With speech synthesis devices using the PARCOR method too various approaches to this problem have been tried, and of these the one generally adopted is the use of a 20 millisecond frame length. If the frame length is set to 20 milliseconds then the data quantity is reduced to a half compared with the case where the frame length is 10 milliseconds. When, however, the frame length is set to 20 milliseconds, consonant and plosive sounds and the like in the original speech cannot be extracted in the analysis, and therefore a defect of the synthesized speech is that sounds such as consonants and plosives cannot be realized.

[0004] On the other hand sounds such as consonants and plosives can be extracted with a 10 millisecond frame length, but in this case, as described above, the data volume is increased, and there is the defect that the data compression is lost.

[0005] This invention is made in view of the above described state of affairs, and has as its object the provision of a speech synthesis device whereby sounds such as consonants and plosives which can only be realized with a short frame length can be synthesized, and in which a substantial amount of data compression is achieved.

[0006] In order to achieve the above objective, in this invention sounds such as consonants and plosives included in the original speech which can only be realized with a 10 millisecond frame length are subjected to analysis with a frame length of 10 milliseconds, whereas normal sounds are subjected to analysis with a frame length of 20 milliseconds. Then the frame data generated by the analysis- is appended for each frame a variable frame bit which indicates the frame length used for analysis, and this is stored in a memory device; in the speech generation circuit the speech synthesis is carried out using a frame length determined in accordance with the variable frame bit.

[0007] In this way sounds such as consonants and plosives which.cannot be synthesized using the conventional frame length of 20 milliseconds can be synthesized using the 10 millisecond frame length. Furthermore, the proportion of sounds such as consonants and plosives in the generated speech is low, and in general the same quality of speech synthesis can be achieved using a 20 millisecond frame length, so that a substantial data compression can be carried out.

[0008]

Fig. 1 is a block diagram of one embodiment of the present invention;

Fig. 2 is a portion of a block diagram of the avobe embodiment;

Fig. 3 is a waveform diagram illustrating the above embodiment;

Fig. 4 shows an example of the memory state of the data memory in the circuit of the above embodiment; and

Fig. 5 is a waveform diagram illustrating the above embodiment.

[0009] An embodiment of the present invention is now described with reference to the drawings.

[0010] Fig. 1 is a block diagram showing the structure of a speech synthesis device of the present invention. In Fig. 1, numeral 10 is a data memory, in which is stored the frame data which is the analysis data for each frame generated by the PARCOR speech analysis method and the variable frame bit (VFB) corresponding to the frame length used in the analysis for each frame. This data memory 10 has an address specified by the output of an address counter 11, and previously stored data, that is a plurality of bits, is read out in parallel from the data area specified by this address counter. The data read out from this data memory 10 is applied to a parallel to serial conversion circuit 12. This parallel to serial conversion circuit 12 converts the data read out from the data memory 10 in parallel to serial data and outputs it; in response to a control signal Al output from a control circuit described below the next frame data is output after a certain time interval. This serial data is applied to a serial to parallel conversion circuit 13. This serial to parallel conversion circuit 13 stores the serial data output by parallel to serial conversion circuit 12 and outputs the stored data in parallel at a fixed timing. The parallel data output from this serial to parallel conversion circuit 13 is applied to a control circuit 14 and a PARCOR speech synthesis circuit 15.

[0011] PARCOR speech synthesis circuit 15 is provided with an input data temporary memory circuit 16 which stores the parallel data output from serial to parallel conversion circuit 13, and PARCOR speech synthesis circuit 15 uses this- data stored in memory circuit 16, and selecting either of at least two different frame lengths carries out sequentially PARCOR speech synthesis processing.

[0012] Control circuit 14 comprises a timing generating circuit 14a and a discriminating circuit 14b and has a data read out function to output an increment signal to increment address counter 11, a variable frame bit (VFB) discrimination function to discriminate the content of the variable frame bit (VFB) applied through serial to parallel conversion circuit 13, an output control function to output the control signal to control the interval of outputting the next frame data from parallel to serial conversion circuit 12, and a frame length selection control function to control the selection operation for the frame length during speech synthesis in PARCOR speech synthesis circuit 15 according to the timing control in input data temporary memory circuit 16 within PARCOR speech synthesis circuit 15 and the discrimination result of the datas discrimination function.

[0013] Fig. 2 is a portion of a block diagram showing discriminating circuit 14b in detail. Discriminating circuit 14b is equipped with a VFB discriminating circuit 20 and the other discriminating circuit 22. VFB discriminating circuit 20 is made by a latch circuit. After the reset, VFB data is input to a terminal D and is latched in the latch circuit of discriminating circuit 20. And a VFB data is output to timing generating circuit 14a according to latch clock. The VFB data stored in VFB discriminating circuit 20 is held until a control signal is output from timing generating circuit 14a. The discrimination of a unvoiced sounds, voiced sounds and so on is discriminated in the other discriminaing circuit 22.

[0014] Fig. 3 is a waveform diagram of the sound "pa" including a plosive, and Fig. 4 illustrates an example of the way in which data is stored when the result of analysis according to the PARCOR speech analysis method of a sound having a waveform as shown in Fig. 3 is stored in data memory 10. The original speech in Fig. 3 is divided into a number of frames ti (i = 1, 2, ...). When analyzing according to the PARCOR speech analysis method for each frame a frame length of 10 milliseconds or 20 milliseconds is used selectively. In the case of a sound such as "pa" containing a plosive sound having a rapid onset, so that there is a speech segment which cannot be synthesized with a frame length of 20 milliseconds, analysis is carried out with the first frames t1 and t2 having a frame length of 10 milliseconds selected, and frames t3 and thereafter having a frame length of 20 milliseconds selected.

[0015] Analysis is thus carried -out for each frame, and the generated frame data is stored in data memory 10. In addition to the frame data generated by analysis, a variable frame bit (VFB) indicating whether the frame length used in the analysis is 10 milliseconds or 20 milliseconds is stored together in data memory 10.

[0016] In Fig. 4 VFB is the variable frame bit (VFB) indicating whether the frame length used in the analysis is 10 milliseconds or 20 milliseconds, and in the first frames t1 and t2 in which the 10 millisecond frame length is selected the value thereof is "I", while in frames t3 and thereafter in which the 20 millisecond frame length is selected the value thereof is "0". The frame data comprises amplitude data (AMP data), frequency data (PITCH data) and a plurality of K parameter data values which are the PARCOR coefficients; the quantity of AMP data is 1 bits, the quantity of PITCH data is m bits, and the K parameter data values are each of n bits, so that the plurality of K parameter data values indicated by Kl to Kj give a total K parameter data amount of n x j bits. (For example, 1 + m + n x j is 48 bits).

[0017] If the word length of data memory 10 is 8 bits (1 byte), for the analysis data of frame tl the first byte will hold the variable frame bit (VFB), the AMP data and a part of the PITCH data, being a total of 8 bits, the second byte will hold the remainder of the PITCH data and a part of the K parameter data being a total of 8 bits, and the third byte will hold the remainder of the K parameter data.

[0018] Next the operation will be described. First, control circuit 14 applies an increment signal AO to address counter 11. Thereby, in data memory 10 an address indication corresponding to the sound to be produced is made, and data is read out from the first 8 bit memory area, and supplied to parallel to serial conversion circuit 12. Next when a control signal A1 is supplied by timing generating circuit ₁4a within control circuit 14 to parallel to serial conversion circuit 12, based on the supplied timing of this control signal A1 parallel to serial conversion circuit 12 outputs the 8 bits of data to serial to parallel conversion circuit 13. When the variable frame bit (VFB) is input through serial to parallel conversion circuit 13 discrimination circuit 14b within control circuit 14 determines the value of this bit. If the result of the determination is that VFB = 1, the output timing of the control signal is controlled by signal Al so that the next frame data will be output from parallel to serial conversion circuit 12 10 milliseconds later, and a signal A2 to select 10 milliseconds as the frame length to be used in speech synthesis by PARCOR speech synthesis circuit 15 and also a signal A3 to control the output timing of input data temporary memory circuit 16 are output. On the other hand, if the result of the determination in control circuit 14 is that VFB = 0, the output timing of the control signal Al is controlled so that the next frame data will be output from parallel to serial conversion circuit 12 20 milliseconds later, and a signal A2 to select 20 milliseconds as the frame length to be used in speech synthesis by PARCOR speech synthesis circuit 15 is output.

[0019] PARCOR speech synthesis circuit 15 carries out speech synthesis for the next 10 millisecond or 20 millisecond time interval using the selected 10 millisecond or 20 millisecond frame length.

[0020] Thus in this circuit embodiment for sounds included in the original speech such as consonants or plosives which can only be realized with a frame length of 10 milliseconds, analysis is carried out with a frame length of 10 milliseconds while for normal sounds analysis is carried out with a frame length of 20 milliseconds, and a variable frame bit (VFB) indicating the frame length used in analysis for each frame is appended to the generated frame data and stored in data memory 10, and during speech synthesis, according to this variable frame bit (VFB) the frame length to be used during speech synthesis in PARCOR speech synthesis circuit 15 is determined, whereby sounds such as consonants and plosives which can only be realized with a short frame length can be synthesized. Furthermore, since the proportion of sounds such as consonants and plosives in the generated speech is low, in general the same quality of speech synthesis can be achieved using a 20 millisecond frame length, so that a substantial data compression can be carried out.

[0021] In Fig. 5 waveform a is a synthesized speech waveform of the original speech shown in Fig. 3, analyzed entirely with a frame length of 20 milliseconds and then synthesized; waveform b is a synthesized speech waveform similarly entirely analyzed with a frame length of 10 milliseconds and then synthesized; and waveform c is a sythesized speech waveform of the original speech shown in Fig. 3, analyzed with a variable frame length by the circuit of the above embodiment and then synthesized. It will be seen that waveform c has characteristics which cannot be realized in waveform a, and post-onset characteristics of the sound in which no difference is seen between a and b are also rendered satisfactorily.

[0022] According to the invention as described above, a speech synthesis device can be provided in which sounds such as consonants and plosives which can only be realized with a short frame length can be synthesized, while a substantial data compression ratio can be achieved.

Claims

1. A speech synthesis device comprising a data memory (10) in which are stored frame data having amplitude data, frequency data and PARCOR coefficients generated by the PARCOR speech synthesis method and a speech synthesis circuit (15) using frame data read out from said data memory (10) and carrying out PARCOR speech synthesis processing; characterized in that said frame data is subject to speech analysis using selectively at least two different frame lengths, in that according to the different frame lengths a variable frame bit indicating the frame length selected for the generation of the frame data is stored in said data memory (10), and in being provided with:

a discrimination means (14b) for discriminating the content of the variable frame bit within the data read out from said data memory (10); and

a control means (14a) for controlling the frame length to be selected by said speech synthesis circuit when carrying out speech synthesis, according to the discrimination result of said discrimination means.

2. The speech synthesis device according to claim 1 wherein said frame data read out from said data memory (10) is input to said speech synthesis circuit (15) by a control signal (Al) according to said amplitude data.

3. The Speech synthesis device according to claim 1 wherein said frame data read out from said data memory (10) is converted from parallel data to serial data, then converted again to parallel data, and then input to said speech synthesis circuit (15).

4. The speech synthesis device according to claim 3 wherein said parallel data to be input to said speech synthesis circuit (15) is input to an input data temporary memory circuit (16) provided in said speech synthesis circuit (15), and is input to said speech synthesis circuit (15) by a control signal (A3) according to said variable frame bit.

5. The speech synthesis device according to claim 1 wherein the reading out of data from said data memory (10) is controlled by a control signal (AO) according to said amplitude data.

Drawing

Search report