[0001] This invention relates to a PARCOR type speech synthesis device in which analysis
data produced by speech analysis by the PARCOR method is stored in a memory device,
and thereafter speech synthesis processing is carried out by reading out this analysis
data from the memory device.
[0002] Conventionally in PARCOR type speed synthesis device the original speech wave to
be synthesized is separated into speech waveforms with 10 milliseconds or 20 milliseconds
as a frame, and for each frame speech analysis is carried out, the amplitude data,
frequency data and K parameter data which make up the PARCOR coefficients are generated
and stored in a memory device as frame data; then for speech synthesis the above data
values are read out from the memory device, and speech synthesis processing is carried
out with the same frame length as was used in the analysis stage.
[0003] A large problem, however, with speech synthesis devices using various methods is
that of reducing the data rate (bit rate) without losing the quality of the synthesized
speech. With speech synthesis devices using the PARCOR method too various approaches
to this problem have been tried, and of these the one generally adopted is the use
of a 20 millisecond frame length. If the frame length is set to 20 milliseconds then
the data quantity is reduced to a half compared with the case where the frame length
is 10 milliseconds. When, however, the frame length is set to 20 milliseconds, consonant
and plosive sounds and the like in the original speech cannot be extracted in the
analysis, and therefore a defect of the synthesized speech is that sounds such as
consonants and plosives cannot be realized.
[0004] On the other hand sounds such as consonants and plosives can be extracted with a
10 millisecond frame length, but in this case, as described above, the data volume
is increased, and there is the defect that the data compression is lost.
[0005] This invention is made in view of the above described state of affairs, and has as
its object the provision of a speech synthesis device whereby sounds such as consonants
and plosives which can only be realized with a short frame length can be synthesized,
and in which a substantial amount of data compression is achieved.
[0006] In order to achieve the above objective, in this invention sounds such as consonants
and plosives included in the original speech which can only be realized with a 10
millisecond frame length are subjected to analysis with a frame length of 10 milliseconds,
whereas normal sounds are subjected to analysis with a frame length of 20 milliseconds.
Then the frame data generated by the analysis- is appended for each frame a variable
frame bit which indicates the frame length used for analysis, and this is stored in
a memory device; in the speech generation circuit the speech synthesis is carried
out using a frame length determined in accordance with the variable frame bit.
[0007] In this way sounds such as consonants and plosives which.cannot be synthesized using
the conventional frame length of 20 milliseconds can be synthesized using the 10 millisecond
frame length. Furthermore, the proportion of sounds such as consonants and plosives
in the generated speech is low, and in general the same quality of speech synthesis
can be achieved using a 20 millisecond frame length, so that a substantial data compression
can be carried out.
[0008]
Fig. 1 is a block diagram of one embodiment of the present invention;
Fig. 2 is a portion of a block diagram of the avobe embodiment;
Fig. 3 is a waveform diagram illustrating the above embodiment;
Fig. 4 shows an example of the memory state of the data memory in the circuit of the
above embodiment; and
Fig. 5 is a waveform diagram illustrating the above embodiment.
[0009] An embodiment of the present invention is now described with reference to the drawings.
[0010] Fig. 1 is a block diagram showing the structure of a speech synthesis device of the
present invention. In Fig. 1, numeral 10 is a data memory, in which is stored the
frame data which is the analysis data for each frame generated by the PARCOR speech
analysis method and the variable frame bit (VFB) corresponding to the frame length
used in the analysis for each frame. This data memory 10 has an address specified
by the output of an address counter 11, and previously stored data, that is a plurality
of bits, is read out in parallel from the data area specified by this address counter.
The data read out from this data memory 10 is applied to a parallel to serial conversion
circuit 12. This parallel to serial conversion circuit 12 converts the data read out
from the data memory 10 in parallel to serial data and outputs it; in response to
a control signal Al output from a control circuit described below the next frame data
is output after a certain time interval. This serial data is applied to a serial to
parallel conversion circuit 13. This serial to parallel conversion circuit 13 stores
the serial data output by parallel to serial conversion circuit 12 and outputs the
stored data in parallel at a fixed timing. The parallel data output from this serial
to parallel conversion circuit 13 is applied to a control circuit 14 and a PARCOR
speech synthesis circuit 15.
[0011] PARCOR speech synthesis circuit 15 is provided with an input data temporary memory
circuit 16 which stores the parallel data output from serial to parallel conversion
circuit 13, and PARCOR speech synthesis circuit 15 uses this- data stored in memory
circuit 16, and selecting either of at least two different frame lengths carries out
sequentially PARCOR speech synthesis processing.
[0012] Control circuit 14 comprises a timing generating circuit 14a and a discriminating
circuit 14b and has a data read out function to output an increment signal to increment
address counter 11, a variable frame bit (VFB) discrimination function to discriminate
the content of the variable frame bit (VFB) applied through serial to parallel conversion
circuit 13, an output control function to output the control signal to control the
interval of outputting the next frame data from parallel to serial conversion circuit
12, and a frame length selection control function to control the selection operation
for the frame length during speech synthesis in PARCOR speech synthesis circuit 15
according to the timing control in input data temporary memory circuit 16 within PARCOR
speech synthesis circuit 15 and the discrimination result of the datas discrimination
function.
[0013] Fig. 2 is a portion of a block diagram showing discriminating circuit 14b in detail.
Discriminating circuit 14b is equipped with a VFB discriminating circuit 20 and the
other discriminating circuit 22. VFB discriminating circuit 20 is made by a latch
circuit. After the reset, VFB data is input to a terminal D and is latched in the
latch circuit of discriminating circuit 20. And a VFB data is output to timing generating
circuit 14a according to latch clock. The VFB data stored in VFB discriminating circuit
20 is held until a control signal is output from timing generating circuit 14a. The
discrimination of a unvoiced sounds, voiced sounds and so on is discriminated in the
other discriminaing circuit 22.
[0014] Fig. 3 is a waveform diagram of the sound "pa" including a plosive, and Fig. 4 illustrates
an example of the way in which data is stored when the result of analysis according
to the PARCOR speech analysis method of a sound having a waveform as shown in Fig.
3 is stored in data memory 10. The original speech in Fig. 3 is divided into a number
of frames ti (i = 1, 2, ...). When analyzing according to the PARCOR speech analysis
method for each frame a frame length of 10 milliseconds or 20 milliseconds is used
selectively. In the case of a sound such as "pa" containing a plosive sound having
a rapid onset, so that there is a speech segment which cannot be synthesized with
a frame length of 20 milliseconds, analysis is carried out with the first frames t1
and t2 having a frame length of 10 milliseconds selected, and frames t3 and thereafter
having a frame length of 20 milliseconds selected.
[0015] Analysis is thus carried -out for each frame, and the generated frame data is stored
in data memory 10. In addition to the frame data generated by analysis, a variable
frame bit (VFB) indicating whether the frame length used in the analysis is 10 milliseconds
or 20 milliseconds is stored together in data memory 10.
[0016] In Fig. 4 VFB is the variable frame bit (VFB) indicating whether the frame length
used in the analysis is 10 milliseconds or 20 milliseconds, and in the first frames
t1 and t2 in which the 10 millisecond frame length is selected the value thereof is
"I", while in frames t3 and thereafter in which the 20 millisecond frame length is
selected the value thereof is "0". The frame data comprises amplitude data (AMP data),
frequency data (PITCH data) and a plurality of K parameter data values which are the
PARCOR coefficients; the quantity of AMP data is 1 bits, the quantity of PITCH data
is m bits, and the K parameter data values are each of n bits, so that the plurality
of K parameter data values indicated by Kl to Kj give a total K parameter data amount
of n x j bits. (For example, 1 + m + n x j is 48 bits).
[0017] If the word length of data memory 10 is 8 bits (1 byte), for the analysis data of
frame tl the first byte will hold the variable frame bit (VFB), the AMP data and a
part of the PITCH data, being a total of 8 bits, the second byte will hold the remainder
of the PITCH data and a part of the K parameter data being a total of 8 bits, and
the third byte will hold the remainder of the K parameter data.
[0018] Next the operation will be described. First, control circuit 14 applies an increment
signal AO to address counter 11. Thereby, in data memory 10 an address indication
corresponding to the sound to be produced is made, and data is read out from the first
8 bit memory area, and supplied to parallel to serial conversion circuit 12. Next
when a control signal A1 is supplied by timing generating circuit
14a within control circuit 14 to parallel to serial conversion circuit 12, based on
the supplied timing of this control signal A1 parallel to serial conversion circuit
12 outputs the 8 bits of data to serial to parallel conversion circuit 13. When the
variable frame bit (VFB) is input through serial to parallel conversion circuit 13
discrimination circuit 14b within control circuit 14 determines the value of this
bit. If the result of the determination is that VFB = 1, the output timing of the
control signal is controlled by signal Al so that the next frame data will be output
from parallel to serial conversion circuit 12 10 milliseconds later, and a signal
A2 to select 10 milliseconds as the frame length to be used in speech synthesis by
PARCOR speech synthesis circuit 15 and also a signal A3 to control the output timing
of input data temporary memory circuit 16 are output. On the other hand, if the result
of the determination in control circuit 14 is that VFB = 0, the output timing of the
control signal Al is controlled so that the next frame data will be output from parallel
to serial conversion circuit 12 20 milliseconds later, and a signal A2 to select 20
milliseconds as the frame length to be used in speech synthesis by PARCOR speech synthesis
circuit 15 is output.
[0019] PARCOR speech synthesis circuit 15 carries out speech synthesis for the next 10 millisecond
or 20 millisecond time interval using the selected 10 millisecond or 20 millisecond
frame length.
[0020] Thus in this circuit embodiment for sounds included in the original speech such as
consonants or plosives which can only be realized with a frame length of 10 milliseconds,
analysis is carried out with a frame length of 10 milliseconds while for normal sounds
analysis is carried out with a frame length of 20 milliseconds, and a variable frame
bit (VFB) indicating the frame length used in analysis for each frame is appended
to the generated frame data and stored in data memory 10, and during speech synthesis,
according to this variable frame bit (VFB) the frame length to be used during speech
synthesis in PARCOR speech synthesis circuit 15 is determined, whereby sounds such
as consonants and plosives which can only be realized with a short frame length can
be synthesized. Furthermore, since the proportion of sounds such as consonants and
plosives in the generated speech is low, in general the same quality of speech synthesis
can be achieved using a 20 millisecond frame length, so that a substantial data compression
can be carried out.
[0021] In Fig. 5 waveform a is a synthesized speech waveform of the original speech shown
in Fig. 3, analyzed entirely with a frame length of 20 milliseconds and then synthesized;
waveform b is a synthesized speech waveform similarly entirely analyzed with a frame
length of 10 milliseconds and then synthesized; and waveform c is a sythesized speech
waveform of the original speech shown in Fig. 3, analyzed with a variable frame length
by the circuit of the above embodiment and then synthesized. It will be seen that
waveform c has characteristics which cannot be realized in waveform a, and post-onset
characteristics of the sound in which no difference is seen between a and b are also
rendered satisfactorily.
[0022] According to the invention as described above, a speech synthesis device can be provided
in which sounds such as consonants and plosives which can only be realized with a
short frame length can be synthesized, while a substantial data compression ratio
can be achieved.
1. A speech synthesis device comprising a data memory (10) in which are stored frame
data having amplitude data, frequency data and PARCOR coefficients generated by the
PARCOR speech synthesis method and a speech synthesis circuit (15) using frame data
read out from said data memory (10) and carrying out PARCOR speech synthesis processing;
characterized in that said frame data is subject to speech analysis using selectively
at least two different frame lengths, in that according to the different frame lengths
a variable frame bit indicating the frame length selected for the generation of the
frame data is stored in said data memory (10), and in being provided with:
a discrimination means (14b) for discriminating the content of the variable frame
bit within the data read out from said data memory (10); and
a control means (14a) for controlling the frame length to be selected by said speech
synthesis circuit when carrying out speech synthesis, according to the discrimination
result of said discrimination means.
2. The speech synthesis device according to claim 1 wherein said frame data read out
from said data memory (10) is input to said speech synthesis circuit (15) by a control
signal (Al) according to said amplitude data.
3. The Speech synthesis device according to claim 1 wherein said frame data read out
from said data memory (10) is converted from parallel data to serial data, then converted
again to parallel data, and then input to said speech synthesis circuit (15).
4. The speech synthesis device according to claim 3 wherein said parallel data to
be input to said speech synthesis circuit (15) is input to an input data temporary
memory circuit (16) provided in said speech synthesis circuit (15), and is input to
said speech synthesis circuit (15) by a control signal (A3) according to said variable
frame bit.
5. The speech synthesis device according to claim 1 wherein the reading out of data
from said data memory (10) is controlled by a control signal (AO) according to said
amplitude data.