[0001] The present invention relates to a silence/non-silence discrimination apparatus
adaptable for an ATM (Asynchronous Transfer Mode) communication system in which only
a non-silent part of a speech signal is divided into cells before transmitted, a recorder
for recording only the non-silent part of the speech signal, and a circuit for extracting
a recognition frame as a basic technique of speech recognition.
[0002] In the apparatus for processing only the non-silent part of the speech signal, if
the silence/non-silence discrimination is not exacted, a transmitted speech is interrupted
or an error rate of speech recognition increases. In the ATM communication system,
it is impossible to effectively use the communication line. For this reason, an accurate
silence/non-silence discrimination has been required. To cope with this, there is
a proposal as disclosed in Unexamined Publication Japanese Patent Application No.
60-200300. This proposal is a silence/non-silence discrimination apparatus which
can detect a non-silent part of the speech signal having a low level, such as a word
head consonant, with a lessened failure of its discrimination even when a signal level
varies due to change of ambient conditions and an ambient noise level is large.
[0003] Fig. 1 shows a block diagram of the silence/non-silence discrimination apparatus
as disclosed in the above-identified Japanese Application. A speech signal input through
a microphone, for example, is supplied to an energy extraction circuit 5 and a spectrum
extraction circuit 6. The energy extraction circuit 5 includes a smoothing circuit
and extracts a power (logarithmic power) as a characteristic parameter of the speech
signal every frame period of a predetermined time duration, which is an unit of time
for silence/non-silence discrimination. The spectrum extraction circuit 6 includes
three types of band-pass filters of low frequencies (250 to 600 Hz), medium frequencies
(600 to 1500 Hz), and high frequencies (1500 to 4000 Hz), and three smoothing circuits
respectively coupled with the output terminals of those filters. The circuit 6 also
extracts a power (logarithmic power) for each frequency band as a characteristic parameter
of the speech signal every frame period. The energy extraction circuit 5 and the spectrum
extraction circuit 6 form a characteristic parameter extraction circuit 13.
[0004] The output signals of the energy extraction circuit 5 and the spectrum extraction
circuit 6 are supplied to a multiplexer 7. The multiplexer supplies the signal power
from the energy extraction circuit 5 and the frequency band powers from the spectrum
extraction circuit 6 to a silence/non-silence discriminator 8 in a time division manner.
The discriminator 8 discriminates each frame of the speech signal as being silent
or non-silent. Incidentally, the non-silent frame includes a voiced speech frame
and a non-voiced speech frame. The discriminator 8 is connected with a threshold value
memory 9 and a standard pattern memory 10. The memory 9 stores two threshold values
E1 and E2 that are used for determining if the frame is silent or non-silent on the
basis of the power. The memory 10 stores a coefficient of a linear discrimination
function, which is used for determining if the frame to be detected is a silent frame
or a non-voiced speech frame, a coefficient of a linear discrimination function, which
is used for determining if the frame is a silent frame or a voiced speech frame, a
standard pattern for determining if the frame is a silent frame or a non-voiced speech
frame, and a standard pattern for determining if the frame is a silent frame or a
voiced speech frame. These threshold values, the coefficients, and the standard patterns
are previously obtained by utilizing the statistical feature of a speech signal containing
silent frame, voiced speech frame, non-voiced speech frame, that is generated under
the condition for using the silence/non-silence discrimination apparatus, and stored
in the memories.
[0005] The discriminator 8 produces a signal denoting the determination result, and supplies
it to a detector 11 for detecting the candidate frames of the starts and the ends
of the non-silent part of the speech on the basis of the determination result for
each frame. The result of the detection is supplied to a non-silent detector 12 where
the start and end of the non-silent part are finally determined.
[0006] An operation of the above mentioned prior silence/non-silence discrimination apparatus
will be described. A speech signal of each frame is transformed into a power LPW by
the energy extraction circuit 5. The same is also transformed into a power LPi (i
is a parameter indicative of a frequency band and is any of i to 3) of each frequency
band by the spectrum extraction circuit 6. The discriminator 8 determines if the frame
is silent or non-silent, by using those four parameters LPW and LPi, the threshold
values E1 and E2 and the coefficients of the linear discrimination functions stored
in the memories 9 and 10.
[0007] For the determination, the two threshold values E1 and E2, and the power LPW are
first compared in the following way:
If LPW > E1, it is discriminated that the frame is non-silent,
if LPW < E2, it is discriminated that the frame is silent, and
if E2 ≦ LPW ≦ E1, it is discriminated that the property of the frame is "indefinite".
[0008] When the discrimination is "indefinite", another determination is made by using a
following discrimination function value FX.

where Ai is the coefficient of the linear discrimination function and LPi is the
standard pattern both stored in the memory 10.
[0009] The function value FX is negative for the silent frame, and is positive for the non-silent
frame including the voiced speech frame and the non-voiced speech frame. Calculation
is made of the FX when the coefficient Ai and the standard pattern LPi are those
for determining if the frame is the silent frame or the non-voiced speech frame,
and calculation is made of the FX when the coefficient Ai and the standard pattern
LPi are those for determining if the frame is the silent frame or the voiced speech
frame. When either of the calculated FXs is positive, it is determined that the frame
is the non-silent frame. In other cases, it is determined that the frame is the silent
frame.
[0010] The prior apparatus makes the silence/non-silence determination on the basis of a
difference between a spectral shape extracted by the spectrum extraction circuit
6 and each of standard spectral shapes of silent frame, non-voiced speech frame, voiced
speech frame, for every frame. Therefore, the apparatus may reliably discriminate
the property of the speeches of small energy level, such as silent consonants and
non-silent consonants. The powers of three frequency bands, low frequencies (250 to
600 Hz), medium frequencies (600 to 1500 Hz), and high frequencies (1500 to 4000 Hz),
are used as the characteristic parameters of spectral shapes. However, the selection
of the characteristic parameters has no theoretical basis and the number of the characteristic
parameters is relatively small. These lead to incorrect silence/non-silence discrimination,
failure of detecting the non-silent signal, and increase of noise.
[0011] Consider a case where a spectrum of a non-voiced speech and that of noise are shaped
as indicated by a solid line and a broken line in Fig. 2. In this case, when Ai (i
= 1 to 3) = 1, the function value FX has the same value for both the non-voiced speech
and the noise, although the both spectral shapes are greatly different from each other.
This results in incorrect silence/non-silence discrimination. The incorrect discrimination
is due to the small number of characteristic parameters defining the spectral shape
and the improper selection of the parameters. Further, since the parameter selection
is lack of the theoretical basis, the selection must be made in a trial-and-error
manner. Therefore, much time and labor are consumed during selection of the parameters,
but the results are frequently incorrect. If the number of the characteristic parameters
is increased, a frequency of the erroneous discrimination-determination is reduced,
but the work to calculate the discrimination functions value FX as given by the equation
(1) is increased. The above mentioned Japanese Application describes that the discrimination
function may be replaced by the Mahalanobis's distance. However, if the Mahalanobis's
distance is used, the calculation work is further increased.
[0012] Accordingly, an object of the present invention is to provide a silence/non-silence
discrimination apparatus which can accurately discriminate a silent part and a non-silent
part of a speech signal, with a simple construction.
[0013] To achieve the above object, there is provided a silence/non-silence discrimination
apparatus comprising means for obtaining a plurality of characteristic parameters
from a speech signal; means for projecting the plurality of characteristic parameters
onto a priority component vector space of the characteristic parameters of a given
type of speech signal, the dimension of the priority component vector space being
smaller than the number of the plurality of characteristic parameters; and means for
discriminating whether the speech signal is silent or non-silent based on the position
of the projected point of the plurality of characteristic parameters.
[0014] According to the present invention, a priority component analysis is applied to the
characteristic parameters. Therefore, the number of characteristic parameters can
be reduced while minimizing the loss of the amounts of information that are possessed
by the original characteristic parameters. The silence/non-silence discrimination
apparatus according to the present invention has an excellent discrimination accuracy,
and has a simple construction. With the priority component analysis, the statistical
nature of the characteristic parameters is reflected on the silence/non-silence discrimination.
This eliminates the try-and-error process that is essential to the prior apparatus
to obtain optimum parameters.
[0015] Additional objects and advantages of the invention will be set forth in the description
which follows, and in part will be obvious from the description, or may be learned
by practice of the invention. The objects and advantages of the invention may be realized
and obtained by means of the instrumentalities and combinations particularly pointed
out in the appended claims.
[0016] This invention can be more fully understood from the following detailed description
when taken in conjunction with the accompanying drawings, in which:
[0017] The accompanying drawings, which are incorporated in and constitute a part of the
specification, illustrate presently preferred embodiments of the invention and, together
with the general description given above and the detailed description of the preferred
embodiments given below, serve to explain the principles of the invention.
Fig. 1 is a block diagram showing a prior silence/non-silence discrimination apparatus;
Fig. 2 is a graph showing spectral shapes for explaining the operation of the apparatus
of Fig. 1;
Fig. 3 is a block diagram for explaining a scheme of a silence/non-silence discrimination
according to the present invention;
Fig. 4 is a block diagram showing a speech cell generation apparatus which includes
a first embodiment of a silence/non-silence discrimination apparatus according to
the present invention;
Fig. 5 is a block diagram showing the first embodiment of the silence/non-silence
discrimination apparatus according to the present invention;
Fig. 6 is a flowchart for explaining a sequence of procedural steps to obtain data
to be stored in a non-silent priority component vector memory used in the first embodiment;
Fig. 7 is a graph showing a non-silent region in the non-silent priority component
vector space, which provides a reference for the silence/non-silence discrimination
in the first embodiment;
Fig. 8 is a flowchart for explaining an operation of the silence/non-silence discrimination
of the first embodiment;
Fig. 9 is a block diagram showing a second embodiment of a silence/non-silence discrimination
apparatus according to the present invention;
Fig. 10 is a flowchart for explaining an operation of the silence/non-silence discrimination
of the second embodiment;
Fig. 11 is a block diagram showing a third embodiment of a silence/non-silence discrimination
apparatus according to the present invention;
Fig. 12 is a block diagram showing a characteristic parameter projection circuit in
the third embodiment;
Fig. 13 is a block diagram showing a detection circuit in the third embodiment;
Fig. 14 is a flowchart for explaining an operation of the silence/non-silence discrimination
of the third embodiment;
Fig. 15 is a block diagram showing a fourth embodiment of a silence/non-silence discrimination
apparatus according to the present invention;
Fig. 16 is a flowchart for explaining an operation of the silence/non-silence discrimination
of the fourth embodiment;
Fig. 17 is a block diagram showing a fifth embodiment of a silence/non-silence discrimination
apparatus according to the present invention;
Fig. 18 is a block diagram showing a first example of an FIR (Finite Impulse Response)
used in the fifth embodiment;
Fig. 19 is a block diagram showing a second example of the FIR filter in the fifth
embodiment;
Fig. 20 is a block diagram showing a matching circuit in the fifth embodiment;
Fig. 21 is a flowchart showing a sequence of procedural steps for obtaining a reference
pattern to be stored in a reference pattern memory of the matching circuit in the
fifth embodiment;
Fig. 22 is a flowchart for explaining an operation of the silence/non-silence discrimination
of the fifth embodiment;
Fig. 23 is a block diagram showing a sixth embodiment of a silence/non-silence discrimination
apparatus according to the present invention;
Fig. 24 is a block diagram showing a matching circuit in the sixth embodiment; and
Fig. 25 is a flowchart for explaining an operation of the silence/non-silence discrimination
of the sixth embodiment.
[0018] Preferred embodiments of a silence/non-silence discrimination apparatus according
to the present invention will be described with reference to the accompanying drawings.
A scheme of a silence/non-silence discrimination according to the present invention
is illustrated in Fig. 3.
[0019] Firstly, the characteristic parameters of a speech signal are previously obtained
by a known method. The characteristic parameters, that are expressed by using the
spectrum in the prior art, may be expressed by using the LPC (linear Predictive Coding)
cepstrum, signal power, the number of zero-crossings, linear predictive coefficient,
auto-correlation function, DFT (Digital Fourier Transformation) coefficient, and any
of their combinations. In the present invention, selection of the number of and the
kind of characteristic parameters is not required, and therefore it is preferable
to obtain the largest possible number of and many kinds of characteristic parameters.
[0020] Secondly, the number of characteristic parameters is reduced to such an extent that
the reduction does not adversely affect the accuracy of the silence/non-silence discrimination.
To effect this, the characteristic parameters are transformed into another type of
parameters. Then, the number of the transformed parameters is reduced. The transformation
is made such that when the transformed parameters, after the number of the parameters
is reduced, are inversely transformed into the original characteristic parameters,
an error between the inversely transformed characteristic parameters and the original
characteristic parameters is minimized.
[0021] The scheme of a reduction of the parameters will be described in more detail with
reference to Fig. 3. L number of original characteristic parameters are expressed
by xi (i = 1 to L). A vector represented by the characteristic parameters xi as elements
is expressed as X. A transformation employed is an orthogonal transformation. A transformation
matrix is expressed as "A". The transformed characteristic parameters are expressed
by yi (i = 1 to L). A vector represented by the characteristic parameters yi as elements
is expressed as Y. Y indicates a vector formed of such characteristic parameters that
of the transformed characteristic parameters yi, N number of parameters yj (j = 1
to N) are left, and the remaining (L - N) number of characteristic parameters (where
N < L) are set to 0. An error vector "e", which is caused by reducing the number of
characteristic parameters, is a difference between the vector X of the original characteristic
parameters and a vector A⁻¹

that results from the inverse transformation of the transformed characteristic parameter
vector

with the reduced number of parameters, and is mathematically expressed as follows.
e = X - A⁻¹

= A⁻¹(Y -

) (2)
[0022] The error due to the parameter reduction can be minimized by using such a transformation
as to minimize the square mean value of the above error σr² = E [e
te] (where t is a transposition of a matrix and E [ ] indicates an expected value).
[0023] The transformation to minimize the square mean value of the difference expressed
by the equation (2) is known as a KL transformation, or a transformation whose transformation
matrix is a matrix "A" having an eigen vector of an auto-correlation matrix of the
parameters xi as a row vector. The eigen vector is equivalent to a priority component
vector resulting from the priority component analysis of the parameters xi. In a descending
order of the eigen values, the eigen vectors are made to correspond respectively to
a first priority component vector, a second priority component vector, and so on.
[0024] Assuming that M number of characteristic parameter vector is expressed as Xi (i =
1 to M), the transformation to minimize the square mean value of the difference expressed
by the equation (2), for each vector Xi, is defined by an eigen vector that is obtained
by the priority component analysis of an auto-correlation matrix of the characteristic
parameter vector as given by the following equation.

where xi1, xi2, ... xiL are elements of the characteristic parameter vector Xi.
[0025] As seen from the equation (3), the auto-correlation matrix R is an auto-correlation
matrix Ri of each characteristic parameter vector Xi which is given by the following
equation (4) and is averaged in an L² dimension, that is, a center-of-gravity.

[0026] If a single auto-correlation matrix R represents auto-correlation matrices Ri (i
= 1, 2, ... M) of the M number of characteristic parameter vectors Xi, then the auto-correlation
matrix R is a matrix to minimize the square mean error E of the matrices Ri and R,
that is give by the following equation (5).

[0027] The reason for this follows. The above relation is partially differentiated with
respect to R(k, l), and let the result of the differentiation be 0. Then, the following
equations (6) and (7) are obtained.

where R(k, l) and Ri(k, l) are the (k, l) elements of the auto-correlation matrices
R and Ri.
[0028] By KL transforming the characteristic parameter vector Xi by using such a transformation
matrix R, the priority component analysis is realized.
[0029] An operation to KL transform L the number of characteristic parameters xi (i = 1
to L), and then to reduce the number of the transformed characteristic parameters
is equivalent to project the characteristic parameter vector X onto an N-dimension
priority component vector space with the coordinate axes being the first to N-th
priority component vectors. With this projection, the number of the original characteristic
parameters may be reduced while minimizing the error due to the number of parameter
reduction.
[0030] Fig. 4 shows the block diagram of the speech cell generation device which is used
in the ATM communication system and incorporates the silence/non-silence discrimination
apparatus based on the above mentioned silence/non-silence discrimination system.
A speech signal is supplied to a sound encoder 41 and a noise encoder 42, where it
is encoded. The coding rates of the encoders 41 and 42 are different from each other,
and the coding rate or the bit rate of the encoder 41 is higher than that of the encoder
42. An ADPCM (Adaptive Differential Pulse Code Modulation) coding system is used for
the coding system of the encoders 41 and 42. One of the output signals of the encoders
41 and 42 is supplied through a selector 45 to a cell generation circuit 46. When
receiving the coded speech signal, the circuit 46 generates corresponding cells. In
response to a signal representative of the discrimination result derived from a silence/non-silence
discrimination device 43, the selector 45 selects the output of the sound encoder
41 when the non-silent signal is detected and the output of the noise encoder 42 when
the silent signal is detected. The noise is encoded and transmitted in order to impart
a natural feeling to the transmitted speech. The transmission of noise little degrades
the efficiency of the line usage, because the bit rate thereof is low. Therefore,
when the silence/non-silence discrimination device 43 detects a silent part of the
speech signal, the selector 45 connects to the noise encoder 42.
[0031] The noise (silent frame) is transmitted to the receiver only in the initial stage,
e.g., when connection to the communication line starts, and subsequently the transmission
of the noise is stopped, whereas the transmitted noise is repetitively reproduced
in the receiver. When the transmitter detects a change in the noise, the noise is
transmitted again. Furthermore, only the sound signal is transmitted while the noise
is not transmitted. Accordingly, in this case, unnaturalness involved in the transmitted
sound must be tolerated. Also in this case, if necessary, white noise may be inserted
in the receiving side.
[0032] Fig. 5 is a block diagram showing the first embodiment of the silence/non-silence
discrimination apparatus according to the present invention. In this embodiment, the
LPC cepstrum is used for the characteristic parameter of the speech signal. Therefore,
an LPC cepstrum calculator 51 is coupled with an input terminal for the speech signal.
The calculator 51 calculates the LPC cepstrums c
i (i = 1, 2, ... L) of the speech signal for each frame of a fixed time. The number
L of the parameters indicates an order of analysis, e.g., is set to 16. In this invention,
the number of the parameters is reduced after the priority component analysis. Accordingly,
the number L may be more than 16. For the cepstrum calculation, reference is made
to Alan V. Oppenhiem and Ronald W. Shafer, "Digital Signal Processing" (Prentice Hall
Inc., NJ, 1975).
[0033] It is assumed that a vector formed by the cepstrums c₁, c₂, ... c
L is C. The vector C is input to an inner product calculator 53. The inner product
calculator 53, together with a non-silent priority component vector memory 52, forms
a characteristic parameter projection circuit 54. The memory 52 stores first to third
priority component vectors V1, V2, and V3 that are obtained by applying a priority
component analysis to the LPC cepstrums of the non-silent parts of the speech signal
collected under the condition for using the silent/non-silent discrimination apparatus.
Here, the element of the priority component vector Vi (i = 1 to 3) is denoted as vij
(j = 1, 2, ... L).
[0034] A sequence of procedural steps to obtain the non-silent priority component vectors
to be stored in the non-silent priority component vector memory 52 is shown in the
form of a flowchart in Fig. 6. In step #1, learning speech data are collected under
the condition for using the discrimination apparatus. In step #2, only the non-silent
data are extracted from all of the collected speech data. In step #3, the LPC cepstrums
of the non-silent data are calculated. In step #4, the priority component analysis
is applied to the LPC cepstrums. More exactly, an auto-correlation matrix of the LPC
cepstrum vector is calculated. In step #5, the eigen values and the eigen vectors
of the matrix are calculated. In step #6, the eigen vectors corresponding to the eigen
values in the descending order from the largest absolute value of the eigen values
to the smallest absolute value are set to first, second, ... N-th (here, N = 3) priority
component vectors. As a result, the non-silent priority component vectors V1, V2,
and V3 are obtained.
[0035] Returning to Fig. 5, the inner product calculator 53 calculates the inner product
of the cepstrum vector C and the priority component vector Vi to obtain a projected
point Q of the LPC cepstrum spectrum vector C (= c₁, c₂, ... c
L)
t in a three dimensional vector space formed of the first to third priority component
vectors V1, V2, and V3, as given in the following way.

where qi is a component of the projected point Q in the Vi direction.
[0036] The output signal of the inner product calculator 53 is supplied to a silence/non-silence
discriminator 56. A non-silent region parameter memory 55 storing parameters defining
a non-silent region in the non-silent priority component vector space is also connected
to the discriminator 56. Assuming that the non-silent region takes the form of a rectangular
parallelepiped as shown in Fig. 7, the region defining parameters are V
1L, V
1H, V
2L, V
2H, V
3L, and V
3H defining the upper and lower limits along the direction of the coordinate axes. Those
parameters are previously obtained by statistically processing the LPC cepstrums
of the non-silent parts and the silent parts (including noises) of the speech signal
that are collected under the condition for using the silence/non-silence discrimination
apparatus. The discriminator 56 determines if the frame is silent or non-silent based
on whether the projected point is within the non-silent region of the rectangular
parallelepiped of Fig. 7. Only when V
1L ≦ q₁ ≦ V
1H, V
2L ≦ q₂ ≦ V
2H, and V
3L ≦ q₃ ≦ V
3H, the discriminator 56 determines that the frame is non-silent. In other cases, it
determines that the frame is silent. A sequence of the procedural steps to determine
if the frame is silent or non-silent by the discriminator 56 is shown in Fig. 8.
[0037] In the above description, the determination of the silent or non-silent frame depends
on whether or not the projected point is within the non-silent region in the non-silent
priority component vector space. Alternatively, it may be done by using a distance
between the center-of-gravity of the non-silent region and the projected point Q.
In this case, the center-of-gravity G of the non-silent region is set at coordinates
(g₁, g₂, g₃). A distance D as given below is compared with a predetermined threshold
value Th. If D ≦ Th, it is determined the frame is non-silent, and if D > Th, it is
determined that the frame is silent.

where Ai is a weighting coefficient.
[0038] According to the first embodiment of the silence/non-silence discrimination apparatus,
the L number of characteristic parameters are projected onto the vector space defined
by the non-silent priority component vectors. The determination as to whether the
frame is silent or non-silent depends on whether or not the projected point is within
the non-silent region. This brings about the following advantages. The number of the
characteristic parameters used for the actual determination is reduced. The calculation
work is reduced, accordingly. The circuit for the determination has a simple construction.
Since the parameters are projected onto the priority component vector space, the reduction
of the number of the parameters little damages the accuracy of the silence/non-silence
discrimination. Since the discrimination depends on the region, even when the non-silent
region and the silent region occupy special regions in the priority component vector
space, the high accurate silence/non-silence discrimination is ensured. In the silence/non-silence
discrimination based on the distance, the distance definition determines a shape of
the non-silent region. For example, if in the equation (9), Ai = 1 (i = 1 to 3), the
region satisfying D ≦ Th is inside a sphere. Thus, in the distance-dependent discrimination,
it is impossible to flexibly select a shape of the non-silent region. On the other
hand, the discrimination dependent on the region allows the non-silent region to
take any shape.
[0039] The first embodiment is not limited to the above description. It is possible to project
the characteristic parameters onto the silent priority component vectors, not the
non-silent priority component vectors. It is possible to discriminate whether or not
the projected point is within the silent region instead of whether or not the projected
point within the non-silent region. The LPC cepstrums as the characteristic parameters
may be replaced with any of the spectrum, signal power, the number of zero-crossings,
linear predictive coefficient, auto-correlation function, and DFT coefficient, and
any of their combinations which are used in the prior art. The specific figures, such
as the number of the characteristic parameters and the dimension of the priority component
spectrum space, may be appropriately selected.
[0040] Fig. 9 is a block diagram showing a second embodiment of a silence/non-silence discrimination
apparatus according to the present invention. An LPC cepstrum calculator 62 is connected
to the input terminal, and calculates the LPC cepstrums c
i (i = 1, 2, L) of an input speech signal for each frame, as in the first embodiment.
The cepstrums are supplied to inner product calculators 63 and 64. The inner product
calculators 63 and 64 are respectively coupled with a non-silent priority component
vector memory 65 and a silent priority component vector memory 66. The calculators
63 and 64, and the memories 65 and 66 form a characteristic parameter projection
circuit 67. Thus, in the second embodiment, the characteristic parameters (cepstrums)
are projected onto both the non-silent priority component vector space and the silent
priority component vector space. The memory 65 stores first to third priority component
vectors that are obtained by applying a priority component analysis to the LPC cepstrums
of the non-silent parts of the speech signal collected under the condition for using
the discrimination apparatus, as in the first embodiment. The memory 66 stores first
to third priority component vectors that are obtained by applying a priority component
analysis to the LPC cepstrums of the silent parts of the speech signal collected under
the condition for using the discrimination apparatus. The inner product calculators
63 and 64 each obtain a projected point Q of the LPC cepstrum vector C in a three
dimensional vector space formed by the first to third priority component vectors V1,
V2, and V3, like the inner product calculator 53 in the first embodiment.
[0041] The output signals of the inner product calculators 63 and 64 are supplied to a non-silence
detector 68 and a silence detector 69, respectively. The detectors 68 and 69 are respectively
coupled with a non-silent region parameter memory 70 and a silent region parameter
memory 71. The memory 70 stores the parameters defining a non-silent region in the
non-silent priority component vector space. The memory 71 stores the parameters defining
a silent region in the silent priority component vector space. The parameters define
the upper and lower limits along the coordinate axes, as in the first embodiment.
Also as in the first embodiment, the non-silent detector 68 compares the coordinates
of the projected point Q with those parameters, and produces a detection signal of
"1" level when the projected point exists within the non-silent region. The silence
detector 69 compares the the coordinates of the projected point Q with those parameters,
and produces a detection signal of "1" level when the projected point exists within
the silent region. The output signals of the detectors 68 and 69 are supplied to a
silence/non-silence discriminator 72. The discriminator 72 finally determines that
the speech frame is silent or non-silent in the following way.
[0042] The frame is determined as being:
(1) Non-silent when the output signal of the non-silence detector 68 is "1" level
and the output signal of the silence detector 69 is "0" level;
(2) Silent when the output signal of the non-silence detector 68 is "0" level and
the output signal of the silence detector 69 is "1" level;
(3) Non-silent when the output signal of the non-silence detector 68 is "1" level
and the output signal of the silence detector 69 is "1" level; and
(4) Non-silent when the output signal of the non-silence detector 68 is "0" level
and the output signal of the silence detector 69 is "0" level.
[0043] Thus, only when both the detectors 68 and 69 determine that the frame is silent,
the discriminator 72 determines that the frame is silent. In other cases, the discriminator
72 determines that the frame is non-silent. A flow of the determination is shown
in Fig. 10.
[0044] As described above, according to the second embodiment, the silence/non-silence determination
is based on the positions of the projected points in both the non-silent vector space
and the silent vector space. As a result, even if the input speech signal has an LPC
cepstrum pattern different from that of the non-silent parts of the speech signal
previously collected for obtaining the non-silent priority component vectors, and
it is detected by the non-silent detector 68 that the frame is silence, the discriminator
72 can finally determine that the frame is non-silent, so long as the silence detector
69 detects the silence. Therefore, the silence/non-silence discrimination apparatus
according to the second embodiment can prevent it from failing to detect the non-silent
components.
[0045] It is evident that the second embodiment may also be modified, like the first embodiment.
That is, it is possible to detects the silence/non-silence using the silent region
in the non-silent priority component vector space as the reference region or using
the non-silent region in the silent priority component vector space as the reference
region.
[0046] In the first and the second embodiments as mentioned above, the silence/non-silence
discrimination is done depending on whether or not the projected point of the cepstrum
vector onto the non-silence/silence priority component vector space is within the
non-silent/silent region. Since non-silent speech have many categories, there are
some non-silent speeches that are not discriminated by the first and the second embodiments.
For example, the non-silent characteristic parameters of a vowel are different from
those of a consonant. The same thing is true between male voice and female voice.
Further, the characteristic parameters differ even in the same consonant, based on
the phoneme. Therefore, to improve the silence/non-silence discrimination accuracy,
the priority component vectors are obtained for a plurality of categories, silence
or non/silence is determined for each category, and the final determination is made
on the basis of the result of the silence/non-silence determination for all the categories.
[0047] It is clear that the larger the number of the categories is, the higher the discrimination
accuracy is. Use of a large number of categories as required in the speech recognition
technique would increase the size of the discrimination apparatus. Therefore, it is
necessary to limit the number of categories to a proper one. How to classify the
parameters into categories and to limit the number of the categories will be described.
[0048] The characteristic parameters of the non-silent part of the speech signal is classified
into a predetermined number of categories, and an auto-correlation matrix representative
of each category is obtained. More exactly, so-called LBG algorithm is used.
[0049] A number of characteristic parameter vectors Xi (i = 1 to M) of the non-silent parts
of the speech signal that are collected under the condition for using the discrimination
apparatus are obtained. The auto-correlation matrix Ri of the characteristic parameter
vector Xi is calculated by using the equation (4). By applying the LBG algorithm to
a training vector whose element is the row vector of the matrix Ri, a predetermined
number of representative vectors Aj (j = 1 to N) and a partition P(Aj) are obtained.
The characteristic parameter vector Xi, which is used for obtaining the auto-correlation
matrix Ri belonging to the partition P(Aj), is regarded as a member of the j-th category.
Further, the matrix Rj formed of elements of the representative vectors Aj is regarded
as an auto-correlation matrix representative of the j-th category. The LBG algorithm
is discussed in detail by Y. Linde, A. Buzo, and R. M. Gray, "An Algorithm for Vector
Quantitizer Design", IEEE Troms. COM-28, No. 1(January 1980), pp. 84 - 95.
[0050] Description that follows is how to classify the characteristic parameters into a
plurality of categories and how to obtain priority component vectors for each category
and a reference region for the silence/non-silence discrimination in the priority
component vector space.
[0051] Firstly, a plurality of LPC cepstrum vectors Ci (i = 1, 2, ... M) previously collected
under the condition for using the discrimination apparatus, are obtained. Then, an
auto-correlation matrix of the LPC cepstrum vectors Ci is calculated by using the
following equation. Let a P² dimension vector whose elements are the row vectors of
the matrix Ri be a training vector Ti.

Ti = (ci₁², ci₁ci₂, ... ci₁ci
p, ci₂ci₁, ci₂², ... ci₂ci
p, ci
pci₁, ci₁ci₂, ... ci
p²)
t (11)
where ci₁, ci₂, ... ci
p are elements of the LPC cepstrum vector Ci. The training vector Ti is obtained by
obtaining N number of representative vectors Yj (j = 1, 2, ... N) and a partition
P(Aj) by using an LBG algorithm in the following manner.
Step 1: Initial setting
[0052] The values of the following items are initially set:
[0053] The number N of the representative vectors, threshold value ε of a distortion (square
mean value of an error between the representative vector and each vector), initial
value Ao of the representative vector, and an initial value of the training vector
Ti (i = 1, 2, ... M). Let m be equal to 0, m = 0, and D1 is set with a large value.
Step 2: Calculation of Minimum Mean Distortion
[0054] Such a partition P(Am) = {Si}, i = 1, 2, ... N so as to provide a minimum mean distortion
in a set Am = (Yj: j = 1, 2, ... N) of given representative vectors is obtained by
using the training vector Ti. For all of the training vectors Ti belonging to the
divided region Si, d(Ti, Yi) < d(Ti, Yj) (where j = 1, 2, ... N) is set up. Here,
d(Ti, Yi) is a distortion provided between Ti and Yi, and it is defined as a following
square error.

where Ti(R) and yi(R) are elements of the vectors Ti and Yi.
[0055] The minimum means distortion due to the partition P(Am) is calculated in the following
way.

Step 3: Convergence Check
[0056] If (D
m-1 - D
m)/D
m < ε, the processing is stopped, and the Am is used as a set of the representative
vectors.
Step 4: Repeat
[0057] Let the representative vector set A
m+1 resulting from the present partition be Am, and let m be m+1. Then, the procedure
returns to step 2.
[0058] In this instance, N = 10, ε = 0.01, and M = 10000.
[0059] Through the procedural processing as mentioned above, ten partitions Si (i = 1 to
10) are obtained and treated as ten categories. An LPC cepstrum vector Cj forming
a training vector Tj belonging to the partition Si constitutes a member of the i-th
category. Further, a following matrix Ri resulting from the rearrangement of the elements
of the representative vector Yi serves as a representative auto-correlation matrix
of the i-th category.

[0060] In this manner, the auto-correlation matrices of the characteristic parameter vectors
are classified into a predetermined number of categories and the representative auto-correlation
matrix of each category is obtained. Since the LBG algorithm is used, a square average
value of an error, caused when the auto-correlation matrices Ri (i = 1, 2, ... M)
is represented or approximated by representative auto-correlation matrices Rj (j =
1, 2, ... N) (where N < M), is minimized.
[0061] Then, the priority component vector of each category is obtained by applying the
priority component analysis to the representative auto-correlation matrix Rj of each
category. For each category, discrimination regions to determine whether or not characteristic
parameters belong to the category is set up in the priority component vector space
by projecting the parameter vector belonging to the category onto the priority component
vector space formed of the priority component vectors for the category. The silence/non-silence
discrimination is performed in a manner that the characteristic parameter vector
obtained for each category is projected onto the priority component vector space of
each category, and the projected point is compared with the predetermined discrimination
region in the vector space.
[0062] A block diagram of the third embodiment as mentioned above is shown in Fig. 11.
As in the first and second embodiments, the input terminal is coupled to an LPC cepstrum
calculator 82, which calculates the LPC cepstrums ci (i = 1, 2, ... L) as characteristic
parameters for each frame. The calculated cepstrums are supplied to characteristic
parameter projection circuits 84a to 84j, which are respectively provided for categories
#1 to #10. In this embodiment, the number of categories is 10. Each circuit 84 makes
a priority component analysis for each category. An example of the characteristic
parameter projection circuit 84a is shown in Fig. 12. The projection circuit 84a is
formed of a vector memory 92 for storing priority component vectors for the category
#1 and an inner product calculator 94 for calculating the inner product of the priority
component vectors of the category #1 stored in the memory 92 and the characteristic
parameter vector. Thus, the projection circuits 84a to 84j respectively project the
characteristic parameter vectors onto the priority component vector spaces for the
categories #1 to #10, thereby to obtain projected points.
[0063] The output signals from the projection circuits 84a to 84j are respectively supplied
to detection circuits 86a to 86j. Each of the detection circuit 86a to 86j determines
whether the frame is non-silent or silent on the basis of the coordinates of the projected
point. An example of the detection circuit 86a is shown in Fig. 13. As shown, the
detection circuit 86a is formed of a region parameter memory 102 for storing the parameters
defining a non-silent region of a category #1 in a priority component vector space,
and a non-silence detector 104. If the priority component vector space is a three
dimensional space, the non-silent region takes the form of a rectangular parallelepiped
as shown in Fig. 7. The parameters define the upper and the lower limits along the
respective coordinate axes. The non-silence detector 104 produces a detection signal
of "1" level when the projected point is within the non-silent region, and produces
a detection signal of "0" level in other cases. The output signals from the detection
circuits 86a to 86j are supplied to a silence/non-silence discriminator 88. When at
least one of the detection circuits 86a to 86j produces a detection signal of "1"
level, the discriminator 88 decides that the frame is non-silent. A sequence of procedural
steps to make the decision is shown in Fig. 14.
[0064] The priority component vector of each category is obtained by applying the priority
component analysis to the matrix Ri. The parameters defining the region of each category
in a priority component vector space are previously obtained in a manner that for
each category, an LPC cepstrum vector belonging to that category is projected onto
the priority component vector space of the category.
[0065] As seen from the foregoing description, the third embodiment makes a silence/non-silence
discrimination in a manner that the characteristic parameters are projected onto
the priority component vector space of each category and the final discrimination
is made based on the judgments for all the categories. Therefore, the third embodiment
has an advantages of an improved accuracy of the silence/non-silence discrimination,
in addition to the advantages of the first and the second embodiments. Further, in
the third embodiment, the LBG algorithm is used for classifying the characteristic
parameter vectors into categories and for obtaining the priority component vectors
for each category. Accordingly, the auto-correlation matrix of the M number of characteristic
parameters are classified into a smaller number of categories, leading to the improvement
of the silence/non-silence discrimination accuracy. Incidentally, the modifications
of the first and the second embodiments are allowed also in the third embodiment.
[0066] In the above-mentioned embodiments, when the projected point of the characteristic
parameters are outside the decision reference region, the inversion of a decision
made when the projected point is inside the space is instantly made. In such a case,
if a decision based on another method is applied again, the decision accuracy will
be further improved. This approach is realized by a silence/non-silence discrimination
apparatus shown in Fig. 15, which is a fourth embodiment of the present invention.
As in the previous embodiments, a speech signal is input to an LPC cepstrum calculator
122. The calculator 122 calculates the LPC cepstrums ci (i = 1, 2, ..., L) for each
frame. The cepstrums are supplied to an inner product calculator 124, which is coupled
with a non-silent priority component vector memory 126. The inner product calculator
124 and the non-silent priority component vector memory 126 form a characteristic
parameter projection circuit 128. In this embodiment, the characteristic parameters
are projected onto a non-silent priority component vector space. The inner product
calculator 124 calculates the coordinates of a projected point Q. The non-silent priority
component vector memory 126 stores the vectors of first to third priority components
that result from a priority component analysis of the LPC cepstrums of a non-silent
part of the speech signal as previously collected under the condition for using the
discrimination apparatus.
[0067] The output signal of the inner product calculator 124 is supplied to a detection
circuit 130. The detection circuit 130 is coupled with a non-silent region parameter
memory 132 and a silent region parameter memory 134. The memory 132 stores parameters
defining a non-silent region in the non-silent priority component vector space. The
memory 134 stores parameters defining a silent region in the non-silent priority component
vector space. When the projected point Q is inside the non-silent region, the detection
circuit 130 decides that the frame is non-silent. When the projected point Q is inside
the silent region, the detection circuit 130 decides that the frame is silent. When
the projected point Q is outside both the silent region and the non-silent region,
the detection circuit 130 decides that the property of the frame is indefinite.
[0068] The output signal of the detection circuit 130 is supplied to a silence/non-silence
discriminator 136. The output signal of the discriminator 136 is output as a final
decision result and is stored in a discrimination result memory 140. The memory 140
stores the discrimination results for at least three past frames. The data derived
from the memory 140 is supplied to a conditional probability table 138. The table
138 stores a probability of silence or non-silence as predicted on the basis of the
discrimination results for the three past frames, viz., a conditional probability.
Assuming that the discrimination result for the present frame is D
i, and the discrimination results for the three past frames are D
i-1, D
i-2, and D
i-3, the conditional probability P is given as follows.

where Di is 1 when the i-th frame is non-silent, and is 0 when it is silent. P(D
i, D
i-1, D
i-2, D
i-3) and P(D
i-1, D
i-2, D
i-3) are previously obtained by probability calculations on the basis of the four consecutive
frames and the three consecutive frames. Foe example, P(0, 0, 0) represents the probability
in which the past three frames are silent. Those frames are formed in a manner that
each frame of the speech signal collected under the condition for using the discrimination
apparatus is labeled with silence and non-silence while observing its waveform and
spectrum.
[0069] A conditional probability that the present frame will be non-silent and a conditional
probability that the present frame will be silent are input to the silence/non-silence
discriminator 136. When the decision result from the detection circuit 130 denoting
that the property of the frame is indefinite, the discriminator 136 compares both
the conditional probabilities, and employs the higher of the two. The procedural flow
till the final decision is reached is shown in Fig. 16.
[0070] According to the fourth embodiment, when the decision using the decision region in
the priority component vector space fails to provide a positive decision result,
another decision is made on the basis of the conditional probability which is led
from the learning data, although each previous embodiment unconditionally provides
the negative decision result. That is, the fourth embodiment employs two steps of
decisions for the silence/non-silence discrimination, thereby improving the discrimination
accuracy. Use of the conditional probability implies use of the knowledge on the speech
signal that a transition of the property of the frames, non-silence → silence → non-silence
→ silence, is a rare case. Therefore, a probability of occurrence of mistaken decisions
on the small power consonants of voiced speech or non-voiced speech is reduced. Further,
omission of the beginnings and the ends of words and addition of noise infrequently
occur. As a matter of course, the modifications of the previous embodiments are allowed
also in the fourth embodiment.
[0071] While in the previous embodiments, the discrimination of silence/non-silence is
based on the position of the projected point of the characteristic parameter, it may
be done on the basis of a time variation of the projected point. This approach further
improves the discrimination accuracy, and is realized as shown in Fig. 17, which shows
a fifth embodiment of a silence/non-silence discrimination apparatus according to
the present invention. As in the previous embodiments, an LPC cepstrum calculator
152 is connected to the input terminal. The calculator 152 calculates the LPC cepstrums
ci (i = 1, 2, ... L) for each frame. The cepstrums are supplied to an inner product
calculator 154, which is coupled with a non-silent priority component vector memory
156. The inner product calculator 154 and the non-silent priority component vector
memory 156 form a characteristic parameter projection circuit 158. The non-silent
priority component vector memory 156 stores the vectors of first to third priority
components that result from a priority component analysis of the LPC cepstrums of
the non-silent part of the speech signal as previously collected under the condition
for using the discrimination apparatus. The inner product calculator 154 calculates
the coordinates of a projected point Q in a three-dimensional space whose axes respectively
consist of first to third priority component vectors of the non-silent cepstrums.
[0072] The output signal of the inner product calculator 154 is supplied to a finite impulse
response (FIR) digital filter 160. Examples of the filter 160 are shown in Figs.
18 and 19. The FIR filter of order of p is shown in Fig. 18 and the FIR filter of
order of two is shown in Fig. 19. The filter 160 is for obtaining a change vector
ΔQ(n) = (Δq₁(n), Δq₂(n), Δq₃(n)) by filtering a projecting point vector Q(n) in a
priority vector space in the n-th frame. In the filter of Fig. 18, Δq
i(n) is given as follows.

where a
j (j = 1 to p) is a filter coefficient, p is an order of the filter, and σ is a coefficient
to normalize a variance of the filter outputs, and is expressed as a standard deviation
as follows.

[0073] In the filter of Fig. 19, Δq
i(n) is given as follows.
Δq
i(n) =

(x(n) + b₁x(n-1) + b₂x(n-2)) (18)
x(n) = q
i(n) + a₁x(n-1) + a₂x(n-2) (19)
[0074] The transfer functions H1(z) and H2(z) of the filters of Figs. 18 and 19 are expressed
in the following way.

[0075] The filter coefficients a
j and b
j are previously selected so that the transfer functions H1(z) and H2(z) specify the
high-pass filtering. If required, those may be varied in accordance with a signal
power.
[0076] The output signal of the filter 160 is supplied to matching circuits 162a to 162j.
Each matching circuit 162 calculates a similarity expressed in terms of the Euclidean
distance, and mathematically expressed by the following equation. The details of the
matching circuit 162a is typically illustrated in Fig. 20.

where R
(m) = (r₁
(m), r₂
(m), r₃
(m)) indicates the m-th reference pattern, and is stored in a table 182 in the m-th matching
table 182. It is evident that another known similarity may be used.
[0077] The reference pattern is previously obtained in accordance with a sequence of steps
shown in Fig. 21. In step #41, the speech data in a part of a speech which is generated
under the condition for using the discrimination apparatus and is considered as being
non-silent is collected as learning data. In step #42, an LPC cepstrum of the learning
data is obtained every frame. In step #43, the cepstrum is projected onto a non-silent
priority component vector space. In step #44, a plurality of change vectors denoting
the change of the projected point with respect to time are extracted. In step #45,
the center-of-gravity of those change vectors is calculated, and is used as a reference
pattern. The reason why a number of matching circuits 162 are used is that similarities
between the reference patterns and the change vector of the projected point are calculated,
thereby to improve the discrimination accuracy.
[0078] The output signals of the matching circuits 162a to 162j are input to a silence/non-silence
discriminator 164. The discriminator 164 compares a preset threshold value with a
similarity which is the smallest of ten input similarities. When the minimum similarity
is above the threshold value, the discriminator 164 decides that the frame is non-silent.
In other cases, the same decides that the frame is silent. A sequence of the procedural
steps to make the decisions is illustrated in Fig. 22.
[0079] As described above, according to the fifth embodiment, the silence/non-silence determination
is made on the basis of the time variation of the projected points. This eliminates
the mistaken decision due to the background noise. The position of the projected point
of the characteristic parameters is possibly moved due to the noise. However, the
change of the projected point with respect to time is relative, not absolute, and
hence is insensible to noise. The modifications of the previous embodiments are allowed
also in the fifth embodiment.
[0080] Fig. 23 shows a block diagram of a silence/non-silence discriminator according to
a sixth embodiment of the present invention. An output signal of an LPC cepstrum calculator
202 is supplied to a characteristic parameter projection circuit 208 formed of an
inner product calculator 204 and a non-silence priority component vector memory 206.
The non-silent priority component vector memory 206 stores the vectors of first to
third priority components that result from a priority component analysis of the LPC
cepstrums of the non-silent parts of the speech signal as previously collected under
the condition for using the discrimination apparatus. The inner product calculator
204 calculates the coordinates of a projected point Q in a three dimensional space
whose axes are constituted by first to third priority component vectors of the non-silent
LPC cepstrums.
[0081] The output signal of the inner product calculator 204 is supplied to an FIR filter
210 and a plurality of detection circuits 212a to 212j. Each detection circuit 212
decides whether the projected point is within or outside the non-silent region for
each category. If it is within the region, the detection circuit 212 produces a detection
signal of "1" level. In other cases, it produces a detection signal of "0" level.
The details of the detection circuit 212a is typically shown in Fig. 24. As shown,
the circuit 212a is formed of a non-silent region parameter memory 224 for each category,
and a non-silence detector 226. The output signals of the detection circuits 212a
to 212j are supplied to a temporary detector 214. When any of the output signals of
the detection circuits 212a to 212j has "1" level, the temporary detector 214 temporarily
decides that the frame is non-silent. In other cases, it temporarily decides that
the frame is silent.
[0082] The output signal of the temporarily detector 214 is supplied to an inequality detector
216 which in turn decides if the decision result on the previous frame is equal to
the decision result on the present frame. When both the decisions are unequal, the
detector 216 produces a detection signal FF of "1" level. When equal, it produces
a detection signal FF of "0" level.
[0083] The output signal of the FIR filter 210 is supplied to a change detector 218. The
detector 218 calculates a change quantity Δ given by the following equation by using
a change vector ΔQ = (Δ1q₁, Δq₂, Δq₃) of the projected point Q for each frame, that
is derived from the filter 216. The detector 218 outputs a detection signal CF of
"1" level when the change quantity Δ is larger than a predetermined threshold value
and a detection signal CF of "0" level when the change quantity Δ is not larger than
the threshold value.
Δ = W1Δq₁² + W2Δq₂² + W3Δq₃² (23)
where W1, W2, and W3 are linear weighting coefficients and for which eigen values
Wi of an auto-correlation matrix of characteristic parameters in the non-silent part
are used.
[0084] The output signal of the temporary detector 214, the output signal FF of the inequality
detector 216, and the output signal CF of the change detector 218 are supplied to
a silence/non-silence discriminator 220. The discriminator 220 first compares the
output signal FF of the detector 216 with the output signal CF of the detector 218.
When both the output signals are equal, the output signal of the temporary detector
214 is output as a final decision result. When those are unequal, the decision is
made in the following way.
[0085] If FF = "1" and CF = "0", the temporary decision result of the previous frame is
output as the final decision result and the temporary decision result of the present
frame is altered to make it equal to that of the previous frame.
[0086] If FF = "0" and CF = "1", the temporary decision result of the present frame is inverted,
and the inverted one is output as the final decision result. A flow of above decision
procedure is shown in Fig. 25.
[0087] As seen from the foregoing description, in the sixth embodiment, the position of
the projected point of the characteristic parameters in the priority component vector
space as well as the change of the projected point are used for the silence/non-silence
discrimination. The result is decreased occurrence of mistaken decisions due to noise
and improved decision accuracy. The modifications of the previous embodiments are
allowed also in the sixth embodiment.
[0088] As described above, in the silence/non-silence discrimination apparatus according
to the present invention, a number of characteristic parameters of the speech signal
are calculated. The parameters are projected onto a given priority component vector
space whose dimension number is smaller than the number of the calculated characteristic
parameters. Discrimination is made as to if the speech signal, more exactly the frame
of it, is silent or non-silent, on the basis of the position of the projected point.
Therefore, the number of characteristic parameters used for the discrimination may
be reduced by utilizing the statistical nature of the parameters, while keeping the
satisfactory accuracy of the discrimination. Further, there is no need for optimizing
the number of the parameters and their categorization.
[0089] Additional advantages and modifications will readily occur to those skilled in the
art. Therefore, the invention in its broader aspects is not limited to the specific
details, representative apparatuses, and illustrated embodiments shown and described.
Accordingly, departures may be made from such details without departing from the
sprit or scope of the general inventive concept as defined by the appended claims
and their equivalents.
1. A silence/non-silence discrimination apparatus comprising:
means for obtaining characteristic parameter from an input speech, characterized by
further comprising:
means (54, 67, 84, 128, 158, 208) for projecting said characteristic parameters onto
a vector space formed of first to i-th ("i" being a positive integer smaller than
the number of said characteristic parameters) priority component vectors of the characteristic
parameters of a given type of speech, to obtain a projected point; and
means (56, 72, 88, 136, 164, 220)for discriminating whether the input speech is silent
or non-silent based on the position of said projected point.
2. The apparatus according to claim 1, characterized in that
said projecting means comprises means (53) for projecting said characteristic parameters
onto the vector space formed of the first to i-th priority component vectors of the
characteristic parameters of a non-silent speech, and
said discriminating means comprises means (56) for discriminating that the input speech
signal is silent or not depending on whether or not said projected point is within
a non-silent region in said vector space.
3. The apparatus according to claim 2, characterized in that
said projecting means comprises means (53) for calculating inner products of the
first to i-th priority component vectors of said non-silent speech and a vector formed
of said characteristic parameters, and
said discriminating means comprises means (56) for comparing the inner product with
the upper and the lower limits of said non-silent region.
4. The apparatus according to claim 1,
characterized in that
said projecting means comprises first projection means (63) for projecting said characteristic
parameters onto the vector space formed of the first to i-th priority component vectors
of the characteristic parameters of a non-silent speech to obtain a first projected
point, and second projection means (64) for projecting said characteristic parameters
onto the vector space formed of first to j-th ("j" being a positive integer smaller
than the number of said characteristic parameters obtained) priority component vectors
of the characteristic parameters of a silent speech to obtain a second projected
point, and
said discriminating means comprises first detection means (68) for detecting whether
or not said first projected point is within a predetermined non-silent region in
the vector space of the non-silent speech, second detection means (69) for detecting
whether or not said second projected point is within a predetermined silent region
in the vector space of the silent speech, and third detection means (72), when said
first detection means detects said first projected point is within the non-silent
region, for detecting that said input speech is non-silent, and when said first detection
means detects said first projected point is not within the non-silent region and said
second detection means detects said second projected point is within the silent region,
for detecting that said input speech is silent, and when said first detection means
detects said first projected point is not within the non-silent region and said second
detection means detects said second projected point is not within the silent region,
for detecting that said input speech is non-silent.
5. The apparatus according to claim 4, characterized in that
said first projection means comprises means (63) for calculating first inner products
of said first to i-th priority component vectors of said non-silent speech and a vector
formed of said characteristic parameters,
said second projection means comprises means (64) for calculating second inner products
of said first to j-th priority component vectors of said silent speech and a vector
formed of said characteristic parameters,
said first detection means comprises means (68) for comparing the first inner product
with the upper and the lower limits of the non-silent region, and
said second detection means comprises means (69) for comparing the second inner product
with the upper and the lower limits of the silent region.
6. The apparatus according to claim 1, characterized in that
said projecting means comprises means (84a to 84j) for projecting said characteristic
parameters onto the vector space formed of the first to i-th priority component vectors
of the non-silent speech for every categories, and
said discriminating means comprises means (86a to 86j) for detecting whether or not
the projected point is within a predetermined non-silent region in said vector space
for each category, and means (88), when at least one projected point is within the
non-silent region, for detecting that said input speech is non-silent, and when no
projected point is within the non-silent region, for detecting that said input speech
is silent.
7. The apparatus according to claim 6, characterized in that
said projecting means comprises means (94) for calculating inner products of the
first to i-th priority component vectors of the non-silent speeches for every categories
and a vector formed of said characteristic parameters, and
said discriminating means comprises means (104) for comparing the inner products for
every categories with the upper and the lower limits of the non-silent region.
8. The apparatus according to claim 1, characterized in that
said projecting means comprises means (128) for projecting said characteristic-parameters
onto the vector space formed of the first to i-th priority component vector of the
characteristic parameters of a nonsilent speech, and
said discriminating means comprises first detection means (130) for detecting whether
or not the projected point is within a predetermined non-silent region, second detection
means (130) for detecting whether or not the projected point is within a predetermined
silent region, and third detection means (136), when said first and second detection
means detect that the projected point is within the non-silent region and within the
silent region, for detecting that said input speech is non-silent and silent, respectively,
when said first and second detection means detect that the projected point is not
within the non-silent region and within the silent region, respectively, for calculating
a first conditional probability that the input speech is non-silent and a second
conditional probability that the input speech is silent on the basis of the past discrimination
result, and discriminating whether the input speech is silent or non-silent based
on one of the first and second conditional probabilities which is larger than the
other.
9. The apparatus according to claim 8, characterized in that
said projecting means comprises means (124) for calculating inner products of the
first to i-th priority component vectors of said non-silent speech and a vector formed
of said characteristic parameters,
said first detection means comprises means (130, 132) for comparing the inner product
with the upper and the lower limits of said non-silent region, and
said second detection means comprises means (130, 134) for comparing the inner product
with the upper and the lower limits of said silent region.
10. The apparatus according to claim 1, characterized in that
said discriminating means comprises means (160) for detecting a change of said projected
point with respect to time, and means (162, 164) for detecting whether the input speech
is silent or non-silent on the basis of the change of the projected point.
11. The apparatus according to claim 10, characterized in that
said detecting means comprises a high-pass filter (160) coupled for reception with
a signal representative of the position of the projected point.
12. The apparatus according to claim 10, characterized in that
said projecting means comprises means (158) for projecting said characteristic parameters
onto a vector space formed of first to i-th priority component vectors of a non-silent
speech, and
said detecting means (162) comprises means (180) for storing the center-of-gravity
of vectors representing the change of the projected point of the characteristic
parameters of various types of speeches as reference patterns, means (182) for calculating
similarities of said vectors representing the changes of the projected point with
each of said reference patterns, and means (164) for detecting if the input speech
is silent or non-silent by comparing a minimum value of the said similarities with
a given threshold value.
13. The apparatus according to claim 10, characterized in that
said discriminating means comprises means (218) for detecting a change of said projected
point with respect to time, and means (220) for detecting whether the input speech
is silent or non-silent based on the position of the projected point and the change
of said projected point with respect to time.
14. The apparatus according to claim 13, characterized in that
said projecting means comprises means (208) for projecting said characteristic parameters
onto a vector space formed of the first to i-th priority component vectors of a non-silent
speech, and
said discriminating means comprises first detection means (212) for detecting whether
or not the projected point is within a predetermined non-silent region, second detection
means (218) for detecting whether or not the detection result of said first detection
means is equal to the previous detection result, third detection means (216) for
detecting whether or not a quantity of said change of the projected point is above
a predetermined value, and fourth detection means (220), when the detection result
of said first detection means is unequal to the previous detection result and a change
quantity of the projected point is greater than the predetermined value and when
the detection result of said first detection means is equal to the previous detection
result and a change quantity of said projected point is less than the predetermined
value, for making a detection on the basis of the detection result of said first
detection means, when the detection result of said first detection means is equal
to the previous detection result and a change quantity of said projected point is
greater than the predetermined value, for making a detection on the basis of said
previous detection result of said first detection means and for replacing the present
detection result with said previous result, and when the detection result of said
first detection means is unequal to the previous detection result and a change quantity
of said projected point is less than the predetermined value, for making a detection
on the basis of the inverse of said previous detection result of said first detection
means.
15. A silence/non-silence discrimination method characterized by comprising the steps
of:
making a priority component analysis of a non-silent speech and/or silent speech,
to obtain a predetermined number of priority component vectors of the non-silent
speech and/or silent speech;
obtaining region parameters defining a non-silent region and/or a silent region in
a vector space formed of said predetermined number of priority component vectors of
the non-silent speech and/or silent speech;
obtaining characteristic parameters, larger than said priority component vectors in
number, from an input speech;
projecting a vector formed of said characteristic parameters onto said vector space;
and
detecting that said speech is silent or non-silent depending on whether the projected
point of the vector formed of said characteristic parameters is within or outside
said non-silent region and/or said silent region.
16. A silence/non-silence discrimination method characterized by comprising the steps
of:
making a priority component analysis of a non-silent speech and/or silent speech,
to obtain a predetermined number of priority component vectors of the non-silent
speech and/or the silent speech;
obtaining region parameters defining a non-silent region and/or a silent region in
a vector space formed of said priority component vectors;
obtaining characteristic parameters, larger than said priority component vectors in
number, from an input speech;
projecting a vector formed of said characteristic parameters onto said vector space
formed of said priority component vectors;
obtaining a change of said projected point with respect to time;
obtaining similarities between said change of said projected point and changes of
the projected points of the characteristic parameters of various non-silent speeches
and/or silent speeches with respect to time; and
detecting that said iput speech is silent or non-silent on the basis of the similalities obtained.
17. A speech cell generation apparatus characterized by comprising:
first encoding means for encoding an speech;
second encoding means for encoding the speech at a lower coding rate or bit rate than
that of said first encoding means;
means for detecting whether or not said speech is silent or non-silent depending on
the position of a projected point of characteristic parameters of said speech signal
onto a vector space formed of priority component vectors of a given type of speech;
and
means for converting the output signal of said first encoding means into cells when
the speech is detected as being non-silent, and converting the output signal of said
second encoding means into cells when the speech is detected as being silent.