[0001] This invention relates to a method of, and a speech detector for, detecting the presence
of speech signals in a sampled voice channel signal.
[0002] Speech detectors are used in a variety of speech transmission systems in which speech
transmission paths are established in response to the detection of speech activity
on a voice channel. One such system is a digital speech interpolation (DSI) transmission
system, such as the system described and claimed in E.P.A. 47588 corresponding to
Canadian Patent Application No. 359,965 filed September 9, 1980, entitled "Mitigation
of Noise Signal Contrast in a Digital Speech Interpolation Transmission System", which
conveniently embodies the speech detector of this invention.
[0003] A speech detector should ideally be highly sensitive to the presence of speech signals
while at the same time remaining insensitive to non-speech signals such as noise.
A difficulty arises in distinguishing, quickly and accurately, between speech signals,
particularly at low levels, and noise. In a DSI transmission system, for example,
the speech detector should be able to detect speech signals at low levels in order
to avoid excessive clipping of speech signals at the start of speech utterances, but
at the same time should not respond to noise alone, even at relatively. high levels,
because this would undesirably increase the activity of the DSI transmission.
[0004] Various forms of speech detector have been devised in order to distinguish more reliably
between speech signals and noise. For example, Fariello U.S. Patent No. 3,878,337
issued April 15, 1975 discloses an arrangement in which a predetermined sequence of
the sign of successive samples of a voice channel signal is detected to provide an
indication of speech. LaMarche et al U.S. Patent No. 4,028,496 issued June 7, 1977,
discloses an arrangement in which the detection sensitivity and noise rejection are
improved by accumulating weighted differences between signal samples and their short-term
running average. Furthermore, Vagliani et al U.S. Patent 4,057,690 issued November
8, 1977 discloses an arrangement in which segments of the envelope of a voice channel
signal are compared with one another over different time domains in order to distinguish
between speech signals and noise. However, these arrangements do not fully satisfy
the requirements, of a speech detector in a DSI transmission system, of distinguishing
between low levels of speech and noise and avoiding clipping of the speech signals
at the start of speech utterances, and accordingly a need still exists for an improved
speech detection arrangement which satisfies these requirements.
[0005] Accordingly, an object of this invention is to provide an improved method of, and
speech detector for, detecting the presence of speech signals in a sampled voice channel
signal.
[0006] According to this invention there is provided a method of detecting the presence
of speech signals in a sampled voice channel signal, comprising producing a first
signal state (M=1) whenever the magnitude (T) of a signal sample exceeds a first threshold
level (TF), characterized by the steps of:-
comparing the magnitude (T) of each sample with that (TP) of the preceding sample;
whenever the magnitude (T) of a sample is not greater than that (TP) of the preceding
sample, setting a second threshold (TL) to a level which is greater than and is dependent
upon the magnitude (T) of the current sample;
whenever the magnitude (T) of a sample is greater than that (TP) of the preceding
sample, producing a second signal state (K=1) if the magnitude (T) of the current
sample exceeds the second threshold level (TL); and
in response to each of the first and the second signal states (M=1, K=1), producing
a signal, representing the presence of speech, at least for the current sample.
[0007] Thus in accordance with this invention the speech detection is effected in two separate
parts, associated with the production of the first and second signals respectively.
The first threshold is set to be above anticipated noise levels, so that the first
signal state is produced only at relatively high levels of speech signals, which high
levels exceed the first threshold level and accordingly can not be noise. The second
threshold level is adaptively adjusted to be a little above the level of noise on
the relevant channel. When the sample signal magnitude rises above this second threshold
level, the second signal state is produced immediately. If, as at the start of a speech
utterance, the signal magnitude continues to increase in successive samples, the second
signal state continues to be produced for these samples. If on the other hand, the
signal magnitude falls again the second signal state is no longer produced and the
second threshold level is adaptively adjusted.
[0008] Thus this arrangement provides a rapid detection of speech signals at low levels
at the start of speech utterances.
[0009] In order that the signal, representing the presence of speech is not terminated during
short pauses in speech such as occur between syllables, so-called hangover periods
are desirably provided to maintain the signal representing presence of speech for
a number of samples following the last sample which causes the signal to be produced.
To this end, the method preferably includes the steps of:- in response to the first
signal state, producing a fourth signal state for a first predetermined number of
consecutive samples commencing with the current sample; and in response to the second
signal state, producing a fifth signal state for a second number of consecutive samples
commencing with the current sample; wherein the signal representing presence of speech
is produced in the presence of either the fourth signal state or the fifth signal
state.
[0010] The second number of consecutive samples is desirably varied in dependence upon the
reliability with which the second signal state is produced for each sample, in order
that a speech indication signal is not produced for a long hangover period in response
to a spurious noise signal which has resulted in the production of the second signal
state. Accordingly, the method preferably also includes the step of determining said
second number in dependence upon previous sample magnitudes, said second number being
increased by a predetermined amount, up to a maximum number, for each sample in respect
of which the second signal state is produced, and being decreased by a predetermined
amount at least for each sample whose magnitude is not greater than the magnitude
of the preceding sample.
[0011] Thus the hangover period which is associated with the production of the second signal
state is gradually increased, up to a maximum period, as the reliability of speech
signal detection increases due to successive increases in the signal level in successive
samples. The hangover period associated with the production of the first signal state
need not be variable because this first signal state is only produced for relatively
high signal levels for which the reliability of the speech signal indication is very
high.
[0012] Due to fluctuating signal levels, it can occur that successive signal samples of
a magnitude below the first threshold level initially rise at the start of a speech
utterance, then fall slightly so that the second threshold level is set to a higher
value and the second signal state is not produced, and then rise again to a value
which is above the previous values for which the second signal state was produced
but which is below the new, higher, second threshold level. It is desirable that the
second signal state also be produced in these circumstances. Accordingly, the method
preferably further includes the steps of:- whenever the magnitude of a sample exceeds
that of the preceding sample, and in respect of the preceding sample the fifth signal
state was produced but the second signal state was not produced, producing the second
signal state for the current sample if its magnitude does not exceed the second threshold
level but exceeds a third threshold level; and setting the third threshold level equal
to the magnitude of the preceding sample whenever the second signal state was produced
for the preceding sample and the magnitude of the current sample is not greater than
the magnitude of the preceding sample.
[0013] In order to reduce the influence of spurious noise signals and d.c. offsets on the
speech detector, preferably each signal sample is constituted by an average of a plurality
of individual samples of the voice channel signal, the method of doing this comprising
the step of producing each signal sample by removing d.c. offsets from and averaging
a plurality of individual samples of the voice channel signal. The averaging is particularly
easy to achieve in a DSI transmission system of the type described in our co-pending
Patent Application No. EP-A-47588, already referred to, in which updating of the speech
decision for each channel takes place only once every superframe, each superframe
comprising a plurality of frames each including a sample of each voice signal channel.
[0014] It will be appreciated that the steps of the method of this invention can be carried
out by individual components such as comparators, stores, and gates, or by one or
more programmed read-only memories.
[0015] Accordingly, the invention also extends to a speech detector comprising one or more
read-only memories programmed and arranged to carry out the method recited above.
[0016] Furthermore, the invention extends to a speech detector for detecting the presence
of speech signals in a sample voice channel signal, comprising means for producing
a first signal state (M=1) whenever the magnitude (T) of a signal sample exceeds a
first threshold level (TF), characterized in that the speech detector comprises:-
means for generating a second threshold (TL);
means for delaying each sample until the next sample arrives;
means for comparing the magnitude (T) of each sample with that (TP) of the preceding
sample delayed by said delaying means;
means responsive to said comparing means determining that the magnitude (T) of a sample
is not greater than that (TP) of the preceding sample, for setting the second threshold
(TL) in response to this determination to a level which is greater than and is dependent
upon the magnitude (T) of the current sample;
means responsive to said comparing means determining that the magnitude (T) of a sample
is greater than that (TP) of the preceding sample, for producing in response to this
determination a second signal state (K=1) if the magnitude (T) of the current sample
exceeds the second threshold level (TL); and
means responsive to each of the first and second signal states (M=1, K=1) for producing
a signal, representing the presence of speech, at least for the current sample.
[0017] The present invention still further provides a method of detecting the presence of
speech in a sampled voice channel signal, characterised by the steps of:- setting
a threshold (TL), to a level which is greater than and is dependent upon the magnitude
(T) of the current sample, whenever the magnitude (T) of the current sample is not
greater than that (TP) of the preceding sample; and providing an indication of the
presence of speech whenever the magnitude (T) of the current sample is greater than
that (TP) of the preceding sample and exceeds said threshold level (TL).
[0018] The invention will be further understood from the following description of a preferred
embodiment thereof with reference to the accompanying drawings, in which:-
Figure 1 illustrates in the form of a block diagram a speech detector for use in a
DSI transmission system;
Figure 2 shows a flow chart in explanation of the operation of the speech detector;
Figure 3 is a signal level diagram illustrating the operation of the speech detector;
and
Figure 4 illustrates an offset remover and averaging circuit for supplying offset-removed
and averaged signal samples to the speech detector.
[0019] The speech detector described below with reference to Figures 1 to 3 is intended
for use in a DSI transmission system of the type described in our co-pending Patent
Application No. EP-A-47588 already referred to, in which once in each superframe a
speech decision is updated for each of a plurality of voice signal channels in respect
of each of which there is an individual sample contained in each of a plurality of
frames forming the superframe. In the present case, it is assumed that in each superframe
there are 27 frames each comprising 48 voice channel signal samples each of 8 bits.
[0020] Referring to Figure 1, which shows the speech detector in the form of a block diagram,
it will be seen that the speech detector includes two independent parts, which are
referred to herein as the level detector 601 and the slope detector 602, whose outputs
are combined in an OR gate 603 to produce for each channel a speech decision which
is stored in a 48-channel decision store 604, to the output of which a speech decision
output line 110 is connected. Each of the detectors 601 and 602 is supplied with a
7-bit average T, produced by the circuit described below with reference to Figure
4, on lines 115, and is enabled in the fourteenth frame of each superframe to up-date
the speech decision for each channel. In its preferred form, each of the detectors
601 and 602 comprises a read-only memory. The speech detector is required to be able
to detect speech signals at low levels in order to avoid excessive clipping of speech
signals at the start of speech utterances, but at the same time is required not to
respond to relatively high levels of noise alone because this would undesirably increase
the activity of the DSI transmission. In order to comply with these requirements,
the speech detector is designed to exploit differences in the characteristics of noise
and speech signals, namely that (a) speech signals usually have a higher level than
noise, and (b) whereas noise is continuous, speech signals occur in bursts with the
signal level progressively increasing at the start of each burst. It is to this end
that the speech detector comprises the two detectors 601 and 602.
[0021] Each of the detectors 601 and 602 classifies each channel as being in one of three
states, namely speech, hangover, and silence. For ease of reference, in Figures 2
and 3 these states are denoted by the value of an index, M for the level detector
and K for the slope detector, each index having the value 0 for silence, 1 for speech,
and 2 for hangover. Thus M=1 indicates that the level detector declares that the particular
channel is carrying speech.
[0022] The hangover state is a temporary state which a channel is deemed to be in immediately
following the speech state, and is provided to avoid speech clipping after intersyllabic
pauses in speech. In each detector, a channel which previously was declared as being
in the speech state, but in respect of which speech is no longer detected, is deemed
to be in the hangover state and an initial hangover count is set. If speech is still
not detected in successive superframes, then this hangover count is decremented until
it reaches zero, when the channel is declared silent. The initial hangover count is
fixed in the level detector but is variable in the slope detector, as is further explained
below.
[0023] Referring again to Figure 1, the level detector 601 consists of three parts, namely
a comparator 605, a hangover and control unit 606, and a decision store 607. In frame
14 in each superframe, for each channel, the comparator 605 compares the average T
with a fixed threshold TF which is above the highest possible noise level. The result
of this comparison is supplied to the unit 606. The unit 606 determines the state
of the channel in dependence upon this comparison and the channel's previous state
as stored in the store 607, and stores the current state of the channel, and any hangover
count which is applicable, in the store 607. The unit 606 supplies a logic 1 on the
output line 608 if the channel is determined as being in either the speech or the
hangover state.
[0024] The slope detector 602 consists of a delay unit 609, comparators 610, a hangover,
control, and threshold generator unit 611, and a decision and threshold store 612.
The delay unit 609 provides a delay of 1 superframe for the average T to provide a
previous average TP via lines 613 to the comparators 610. In frame 14 in each superframe,
for each channel, the comparators 610 compare the current average T with the previous
average TP, a threshold TL, and a threshold TH and supply the comparison results to
the unit 611. The thresholds TL and TH are variable thresholds which are stored for
each individual channel in the store 612. The unit 611 determines the state of the
channel in dependence upon the comparison results and the channel's previous state
as stored in the store 612, generates new thresholds TL and TH if necessary, and stores
the current state of the channel, together with any new hangover count and thresholds
TL and TH, in the store 612. The unit 611 supplies a logic 1 on the output line 614
if the channel is determined as being in either the speech or the hangover state.
[0025] Thus it will be seen that the speech decision on the line 110 is present for each
channel, i.e. the channel is deemed to be carrying speech, unless both the level detector
and the slope detector declare the channel to be silent, i.e. both M=0 and K=0.
[0026] The operation of the speech detector will be further understood from the following
description with reference to Figures 2 and 3. In Figures 2 B, D, and G are integers,
H is the hangover count in the level detector, HM is a maximum value of H, C is the
hangover count in the slope detector, CM is a maximum value of C, and the other symbols
have the meanings already described. For the illustration in Figure 3 it has been
assumed that B=1, D=5, G=4, and CM=HM=31. Each of Figures 2 and 3 relates to only
one of the 48 channels, all the channels being treated in the same manner. Figure
3 illustrates the average T for the channel as a line 801 on which each point represents
the value of T in one superframe, and also illustrates the resultant values of M,
H, TL, TH, K, and C. It is initially assumed that M=K=C=0. Successive points on the
line 801 are identified by references 802 through 834.
[0027] Considering firstly the operation of the level detector, for each of the points 802
through 821 T$TF (interrogation 701 in Figure 2) and the previously stored value of
M is zero (interrogation 702 in Figure 2) so that in Figure 2 the branch 703 is reached
and M remains zero (silence). For each of the points 822 through 827 T
>TF, so that, regardless of the previously stored value of M, M is set to 1 (speech)
in block 704 in Figure 2. For point 828 the result of the interrogation 701 is negative,
so that the value of M is interrogated at block 702 in Figure 2. The previously stored
value of M is 1, so that block 705 in Figure 2 is reached, M being set to 2 (hangover)
and H being set to HM=31. For each of points 829 through 834 the result of the interrogation
701 is negative and the previously stored value of M, interrogated in block 702, is
2 so that in Figure 2 the value of H is interrogated at block 706. For these points
H40, so that H is decremented each time at block 707 in Figure 2 and M is unchanged.
Unless T again exceeds TF, this decrementing continues in successive superframes until
H=0, when interrogation 706 has a positive result so that block 708 is reached in
which M is set to zero (silence).
[0028] Considering now the operation of the slope detector, after reading the value T in
each superframe (block 709 in Figure 2), this value is compared with the previous
value TP (interrogation 710 in Figure 2). If T
>TP, as at points 803, 805 and 808 in Figure 3, then an interrogation is made as to
whether K=1 (speech) in block 711 of Figure 2. For each of the points 803, 805, and
808 the previous value of K is zero, so that the result of this interrogation is negative.
In a subsequent interrogation 712 T is compared with the threshold TL, and for each
of the points 803, 805, and 808 T>TL so that a subsequent interrogation in block 713
is effected as to whether K=0. For each of these points the result of this interrogation
is positive, so that in a block 714 the previous value of C is increased by G=4, K
remaining unchanged.
[0029] For each of the points 804, 806, and 807 the result of the interrogation 710 is negative,
so that in a block 715 the threshold TL is set to BT+D, i.e. T+5 in Figure 3. The
previous value of K is then interrogated in a block 716, and because in the case of
each of these points the previous value of K is zero, C is set to zero in a block
717 and K remains unchanged. Thus for all of the points 803 to 808 K=0 (silence).
It can be seen that the threshold TL is adaptively adjusted during this period, so
that this threshold is generally a little above the level of noise present on the
particular channel.
[0030] For the point 809 the interrogation 710 has a positive result, the subsequent interrogation
711 has a negative result, and the resultant interrogation 712 has a positive result
because now T>TL, so that K is set to 1 (speech) in block 718 in Figure 2. For each
of the points 810 through 813 the interrogation 710 and the resultant interrogation
711 both have positive results. Thus for each of the points 809 through 813 C is increased
by G=4 in a block 719; this gradual increasing of C, and hence the hangover period
which will subsequently occur, reflects the increasing reliability of the speech decision
reached initially at the point 809. C is in each case compared with CM=31 in an interrogation
720; for each of these points the result of this interrogation is negative so that
no further action is taken.
[0031] For the point 814 T<TP, so that the threshold TL is again reset in block 715. In
this case the previous value of K interrogated in block 716 is 1, so that in a block
721 the threshold TH is set to the previous average value TP and K is set to 2 (hangover).
Subsequently in a block 722 C is decreased by 1 to 23. For the point 815 T>TP, K≠1,
T;t> TL, and KtO, so that an interrogation T
>TH? (block 723 in Figure 2) is reached whose result is positive. Accordingly, K is
set to 1 in block 718 and C is increased in block 719. This recognizes the point 815
as comprising speech; this recognition is based on the fact that previously the lower-level
point 813 was identified as comprising speech, so that the relatively higher- level
point 815 is also assumed to comprise speech.
[0032] The point 816 results in a hangover decision (K=2) in the same manner as for the
point 814, the thresholds TL and TH being reset and C being decreased by 1 to 26.
For the point 817 T

TP so that the threshold TL is reset, and the interrogation 716 is reached and reveals
that K=2, so that in an interrogation 724 C is assessed and, since it is not zero,
is decreased by one in the block 722.
[0033] For the point 818 T>TP, K4=1, TITL, KtO, and Tj> TH, so that C is interrogated in
a block 725 and, not being zero, is decreased by 1 in a block 726, K remaining unchanged.
The point 819 and the points 820 through 825 result in the same circumstances as the
points 809 and 810 through 813 respectively, except that for each of the points 820
through 825 increasing C in block 719 results in the interrogation C>CM? in block
720 having a positive result, so that for each of these points C is set to CM=31 in
a block 727. At the point 826 both of the thresholds TL and TH are reset in the same
manner as at the points 814 and 816, and a hangover decision (K=2) is reached so that
C is reduced by one. In the same manner as for the point 817, at each of the points
827 through 834 the threshold TL is reset and C is reduced by 1. Unless the line 801
again crosses the threshold TL or TH, this reduction of C continues in successive
superframes until C=0, when one of the interrogations 724 and 725 has a positive result
so that in one of blocks 728 and 729 respectively K is set to zero (silence).
[0034] It can be seen, therefore, that the level detector 601 provides a reliable detection
of the presence of speech each time that the average T exceeds the fixed threshold
TF, and that after each such detection the speech decision on the line 110 is maintained
for a fixed hangover period of 32 superframes, to maintain the decision during intersyllabic
pauses in speech. On the other hand, the slope detector 602 provides a less reliable
but much earlier detection of the start of speech bursts, as at the point 809, to
produce the speech decision on the line 110 as quickly as possible and hence to avoid
excessive clipping of speech signals at the start of speech bursts. As this detection
is less reliable, the hangover period of the slope detector is not immediately set
to the maximum as in the level detector; but instead is increased only gradually to
avoid excessively increasing the activity of the DSI transmission. For example, the
average T at the point 809 could alternatively be due to noise transients instead
of the start of speech, in which case the line 801 would not rise after this point.
In this case although the slope detector would reach the incorrect decision K=1 (speech)
for the point 809, this decision would be maintained only for the short hangover period
of 8 superframes so that the DSI transmission activity would be only slightly increased.
In any event, as described below, the value T is itself an average taken over the
duration of one superframe, and the threshold TL is adaptively adjusted to be above
the average noise level of the channel, so that the slope detector is relatively insensitive
to noise transients.
[0035] Figure 4 illustrates in the form of a block diagram a d.c. offset remover and averaging
circuit which serves to produce a 7-bit offset removed average T for each channel
on the lines 115, from 8-bit individual signal samples of the channels supplied thereto
on lines 102. The offset remover consists of an 8-bit subtractor 401, a 16-bit up/down
counter 402, and a 48-channel by 16-bit store 403. The averaging circuit consists
of a 12-bit adder 404, a 48- channel by 12-bit store 405, a buffer 406 having a clear
input CL, and a 48-channel by 7- bit store 407 having a write-enable input WE. Each
of the stores is addressed in turn for each channel via an address bus which is not
shown.
[0036] The offset remover serves to produce on lines 409 for each channel a 7-bit magnitude
signal from which long-term d.c. offsets have been removed, and to this end the offset
remover in operation reaches an equilibrium state in which for each channel a 16-bit
offset value of the channel is stored in the store 403. In each frame, for each channel,
the stored offset value of the channel is loaded from the store 403 into the counter
402 and is available at the counter output. The 8 most significant bits of the offset
value are applied via lines 410 to the subtractor 401, which subtracts the offset
value bits from the current sample of the channel to produce the 7-bit magnitude signal
on the lines 409 and a sign bit on a further output line 411. This line 411 is connected
to an up/down counting control input U/D of the counter 402 and causes the count of
the counter to be increased or decreased by 1 depending on the polarity of the sign
bit on the line 411. The counter 402 thus produces a new, modified, 16-bit offset
value for the channel at its output, and this new value is written into the store
403 in place of the previous offset value for the channel. This sequence is repeated
for subsequent channels in each frame.
[0037] In the long term, the equilibrium state reached is such that for each channel the
numbers of positive and negative sign bits produced on the line 411 are equal. Although
the stored offset value of each channel varies, only the 8 most significant bits of
this are subtracted from the channel information, and in fact 256 sign bits of one
polarity are required in order to change the subtracted offset value bits by one step.
[0038] The averaging circuit serves to produce, for each channel, the 7-bit average T on
the lines 115. In fact, in order to simplify implementation of the circuit the average
T on the lines 115 is actually a fraction of 27/32 of the actual average of the signals
on the lines 409. For each channel, this average T is updated in the thirteenth frame
of each superframe by signal applied via a line 414 to the input CL of the buffer
406 and the input WE of the store 407, to write a new average T into the store 407
and to clear the buffer 406.
[0039] For each channel in each frame of the superframe, the output of the adder 404 is
stored in the store 405. The adder output is equal to the sum of the 7-bit magnitude
signal of the particular channel, present on the lines 409, and a 12-bit cumulative
sum for the particular channel present on lines 412. The cumulative sum for the channel
is the previously stored sum for the channel which was stored in the store 405, which
is clocked through the buffer 406 in each frame except the thirteenth frame of each
superframe when, as described above, the buffer 406 is cleared to reduce the cumulative
sum to zero.
[0040] In the thirteenth frame of each superframe, therefore, for each channel the 12-bit
cumulative sum produced at the output of the store 405 is equal to the sum of the
offset-removed magnitude signals for that channel during the preceding 27 frames.
Only the 7 most significant bits of this sum are written into the store 407 to achieve
a division of the sum by a factor of 32; hence the average T is 27/32 of the actual
average. This minor difference does not adversely affect the operation of the speech
detector.
[0041] Whilst a particular offset remover and averaging circuit has been described above,
the speech detector of the invention can obviously be used in conjunction with other
forms of such circuit or without any preceding offset remover and averaging circuit.
Similarly, the speech detector can be used in other applications than that described,
and can be provided in respect of any number of voice channel signals.
1. A method of detecting the presence of speech signals in a sampled voice channel
signal, comprising producing a first signal state (M=1) whenever the magnitude (T)
of a signal sample exceeds a first threshold level (TF), characterized by the steps
of:-
comparing the magnitude (T) of each sample with that (TP) of the preceding sample;
whenever the magnitude (T) of a sample is not greater than that (TP) of the preceding
sample, setting a second threshold (TL) to a level which is greater than and is dependent
upon the magnitude (T) of the current sample;
whenever the magnitude (T) of a sample is greater than that (TP) of the preceding
sample, producing a second signal state (K=1) if the magnitude (T) of the current
sample exceeds the second threshold level (TL); and
in response to each of the first and the second signal states (M=1, K=1), producing
a signal, representing the presence of speech, at least for the current sample.
2. A method as claimed in claim 1, comprising the additional steps of:-
whenever the magnitude (T) of a sample does not exceed the first threshold level (TF)
and the first signal state (M=1) was produced for the preceding sample, producing
a third signal state (M=2) for a first predetermined number (H) of consecutive samples
the magnitude of which does not exceed the first threshold level commencing with the
current sample;
whenever the magnitude (T) of a sample is not greater than that (TP) of the preceding
sample and the second signal state (K=1) was produced for said preceding sample, producing
a fourth signal state (K=2) for a second number (C) of consecutive samples commencing
with the current sample; and
producing a signal representing the presence of speech also in response to each of
the third and fourth signal states (M=2, K=2).
3. A method as claimed in claim 2, comprising the further step of determining said
second number (C) in dependence upon previous sample magnitudes, said second number
(C) being increased by a predetermined amount, up to a maximum number (C=32), for
each sample for which the second signal state (K=1) is produced, and being decreased,
down to a minimum number (C=0), for each other sample whose magnitude (T) is not greater
than the magnitude (TP) of the preceding sample.
4. A method as claimed in claim 2 or 3, comprising the further steps of:-
whenever the magnitude (T) of a sample exceeds that (TP) of the preceding sample,
and for said preceding sample the fourth signal state (K=2) was produced but the second
signal state (K=1 ) was not produced, producing the second signal state (K=1) for
the current sample if its magnitude (T) exceeds a third threshold level (TH) but is
below said second threshold level (TL); and
setting the third threshold level (TH) equal to the magnitude (TP) of the preceding
sample whenever the second signal state (K=1) was produced for said preceding sample
and the magnitude (T) of the current sample is not greater than the magnitude (TP)
of said preceding sample.
5. A method as claimed in any of claims 1 to 4 in which each time that the second
threshold level (TL) is set, it is set to be greater than the magnitude (T) of the
current sample by a predetermined amount.
6. A method as claimed in any of claims 1 to 5 in which each signal sample is constituted
by an average of a plurality of individual samples of the voice channel signal, the
method further comprising the step of producing each signal sample by removing d.c.
offsets from and averaging a plurality of individual samples of the voice channel
signal.
7. A speech detector comprising one or more read-only memories programmed and arranged
to carry out the method of any of claims 1 to 6.
8. A speech detector for detecting the presence of speech signals in a sampled voice
channel signal, comprising means (605) for producing a first signal state (M== 1)
whenever the magnitude (T) of a signal sample exceeds a first threshold level (TF),
characterized in that the speech detector comprises:-
means (611) for generating a second threshold (TL);
means (609) for delaying each sample until the next sample arrives;
means (610) for comparing the magnitude (T) of each sample with that (TP) of the preceding
sample delayed by said delaying means (609);
means (611) responsive to said comparing means (610) determining that the magnitude
(T) of a sample is not greater than that (TP) of the preceding sample, for setting
the second threshold (TL) in response to this determination to a level which is greater
than and is dependent upon the magnitude (T) of the current sample;
means (611) responsive to said comparing means (610) determining that the magnitude
(T) of a sample is greater than that (TP) of the preceding sample, for producing in
response to this determination a second signal state (K= 1) if the magnitude (T) of
the current sample exceeds the second threshold level (TL); and
means (603) responsive to each of the first and second signal states (M=1, K=1) for
producing a signal, representing the presence of speech, at least for the current
sample.
9. A speech detector as claimed in claim 8 characterized by means (401 to 406) for
producing each signal sample by removing d.c. offsets from and averaging a plurality
of individual samples of the voice channel signal.
10. A method of detecting the presence of speech in a sampled voice channel signal,
characterized by the steps of:-
setting a threshold (TL), to a level which is greater than and is dependent upon the
magnitude (T) of the current sample, whenever the magnitude (T) of the current sample
is not greater than that (TP) of the preceding sample; and
providing an indication of the presence of speech whenever the magnitude (T) of the
current sample is greater than that (TP) of the preceding sample and exceeds said
threshold level (TL).
11. A method as claimed in claim 10, characterized by maintaining said indication
for a number (C) of samples following each sample whose magnitude (T) is greater than
that (TP) of the preceding sample.
12. A method as claimed in claim 11 characterized by determining the number (C) of
samples for which said indication is maintained in dependence upon previous sample
magnitudes, said number (C) being increased, up to a maximum number (C=32), for each
sample whose magnitude (T) is greater than that (TP) of the preceding sample and being
decreased, down to a minimum number (C=0), for each sample whose magnitude (T) is
not greater than that (TP) of the preceding sample.
13. A method as claimed in claim 10, 11, or 12 characterized by providing an indication
of the presence of speech for each sample whose magnitude (T) exceeds a fixed threshold
level (TF).
1. Verfahren zum Anzeigen von Sprachsignalen in einem abgetasteten Sprachkanalsignal
mit Erzeugung eines ersten Signalzustandes (M=1), jedesmal, wenn die Größe (T) des
Signalabtastwertes einen ersten Schwellwertpegel (TF) überschreitet, gekennzeichnet
durch folgende Schritte:
Vergleichen der Größe (T) jedes Abtastwertes mit der (TP) des vorhergehenden Abtastwertes;
jedesmal, wenn die Größe (T) eines Abtastwertes nicht größer als die (TP) des vorhergehenden
Abtastwertes ist, Setzen einer zweiten Schwelle (TL) auf einen Pegel, der größer als
die Größe (T) des augenblicklichen Abtastwertes ist und von dieser abhängt;
jedesmal dann, wenn die Größe (T) eines Abtastwertes größer als die (TP) des vorhergehenden
Abtastwertes ist, Erzeugen eines zweiten Signalzustandes (K=1), falls die Größe (T)
des augenblicklichen Abtastwertes den zweiten Schwellwertpegel (TL) überschreitet;
und
in Abhängigkeit von jedem der ersten und zweiten Signalzustände (M=1, K=1), Erzeugen
eines die Anwesenheit von Sprache, mindestens bei dem augenblicklichen Abtastwert,
repräsentierenden Signales.
2. Verfahren nach Anspruch 1, mit den zusätzlichen Schritten:
jedesmal dann, wenn die Größe (T) eines Abtastwertes nicht den ersten Schwellwertpegel
(TF) überschreitet und der erste Signalzustand (M=1) bei dem vorhergehenden Abtastwert
erzeugt wurde, Erzeugen eines dritten Signalzustandes (M=2) bei einer ersten vorbestimmten
Zahl (H) von aufeinanderfolgenden Abtastwerten, deren Größe den ersten Schwellwertpegel
nicht überschreitet, beginnend mit dem augenblicklichen Abtastwert;
jedesmal dann, wenn die Größe (T) eines Abtastwertes nicht größer als die (TP) des
vorhergehenden Abtastwertes ist und der zweite Signalzustand (K=1) bei dem vorhergehenden
Abtastwert erzeugt wurde, Erzeugen eines vierten Signalzustands (K=2) bei einer zweiten
Zahl (C) aufeinanderfolgender Abtastwerte, beginnend mit dem augenblicklichen Abtastwert;
und
Erzeugen eines die Anwesenheit von Sprache repräsentierenden Signales auch in Abhängigkeit
von jedem der dritten und vierten Signalzustände (M=2, K=2).
3. Verfahren nach Anspruch 2, mit dem weiteren Schritt der Bestimmung der zweiten
Zahl (C) in Abhängigkeit von vorhergehenden Abtastwert-Größen, wobei die zweite Zahl
(C) um einen vorbestimmten Betrag bis zu einer Maximalzahl (C=32) erhöht wird bei
jedem Abtastwert, bei dem der zweite Signalzustand (K=1) erzeugt, und erniedrigt bis
zu einer Minimalzahl (C=0) wird bei jedem anderen Abtastwert, dessen Größe (T) nicht
größer als die Größe (TP) des vorhergehenden Abtastwertes ist.
4. Verfahren nach Anspruch 2 oder 3, mit den weiteren Schritten:
jedesmal dann, wenn die Größe (T) eines Abtastwertes die (TP) des vorhergehenden Abtastwertes
überschreitet, und wenn für den vorhergehenden Abtastwert der vierte Signalzustand
(K=2) erzeugt, jedoch der zweite Signalzustand (K=1) nicht erzeugt wurde, Erzeugen
des zweiten Signalzustandes (K=1) für den augenblicklichen Abtastwert, falls seine
Größe (T) einen dritten Schwellwertpegel (TH) überschreitet, jedoch unter dem zweiten
Schwellwertpegel (TL) liegt; und
Einstellen des dritten Schwellwertpegels (TH) gleich der Größe (TP) des vorhergehenden
Abtastwertes, jedesmal, wenn der zweite Signalzustand (K=1) für den vorhergehenden
Abtastwert erzeugt wurde und die Größe (T) des augenblicklichen Abtastwertes nicht
größer als die Größe (TP) des vorhergehenden Abtastwertes ist.
5. Verfahren nach einem der Ansprüche 1 bis 4, bei dem jedesmal, wenn der zweite Schwellwertpegel
(TL) gestellt wird, er um eine vorbestimmtes Ausmaß größer als die Größe (T) des augenblicklichen
Abtastwertes gestellt wird.
6. Verfahren nach einem der Ansprüche 1 bis 5, bei dem jeder Signal-Abtastwert durch
einen Durchschnitt aus einer Vielzahl von Einzelabtastwerten des Sprachkanalsignals
gebildet wird, wobei des Verfahren ferner den Schritt enthält des Erzeugens jedes
Signalabtastwertes durch Entfernen von Gleichspannungs-Ablagen von und Mitteln einer
Vielzahl von Einzelabtastwerten des Sprachkanalsignals.
7. Sprachdetektor mit einem oder mehreren Festwertspeichern, die zur Ausführung des
Verfahrens nach einem der Ansprüche 1 bis 6 programmiert und angeordnet sind.
8. Sprachdetektor für die Anzeige von Sprachsignalen in einem abgetasteten Sprachkanalsignal,
mit Mitteln (605) zur Erzeugung eines ersten Signalzustandes (M=1) jedesmal dann,
wenn die Größe (T) eines Signalabtastwertes einen ersten Schwellwertpegel (TF) überschreitet,
dadurch gekennzeichnet, daß der Sprachdetektor enthält:
Mittel (611) zum Erzeugen eines zweiten Schwellwertes (TL);
Mittel (609) zum Aufhalten jedes Abtastwertes bis zur Ankunft des nächsten Abtastwertes;
Mittel (610) zum Vergleichen der Größe (T) jedes Abtastwertes mit der (TP) des vorhergehenden,
durch die Aufhaltemittel (609) aufgehaltenen Abtastwertes;
Mittel (611) in Abhängigkeit von den Vergleichsmitteln (610), zur Bestimmung, daß
die Größe (T) eines Abtastwertes nicht größer als die (TP) des vorhergehenden Abtastwertes
ist, um den zweiten Schwellwert (TL) in Abhängigkeit von dieser Bestimmung auf einen
Pegel zu stellen, der größer ist als die Größe (T) des augenblicklichen Abtastwertes
und davon abhängig ist;
Mittel (611) in Abhängigkeit von den Vergleichsmitteln (610) zur Bestimmung, daß die
Größe (T) eines Abtastwertes größer als die (TP) des vorhergehenden Abtastwertes ist
zur Erzeugung eines zweiten Signalzustandes (K=1 ) in Abhängigkeit von dieser Bestimmung,
falls die Größe (T) des augenblicklichen Abtastwertes den zweiten Schwellwertpegel
(TL) überschreitet; und
Mittel (603) in Abhängigkeit von jedem ersten und zweiten Signalzustand (M=1, K=1)
zur Erzeugung eines die Anwesenheit von Sprache repräsentierenden Signals mindestens
für den augenblicklichen Abtastwert.
9. Sprachdetektor nach Anspruch 8, gekennzeichnet durch Mittel (401 bis 406) zur Erzeugung
jedes Signalabtastwertes durch Entfernen von Gleichspannungs-Ablagen von und Mitteln
einer Vielzahl von einzelnen Abtastwerten des Sprachkanalsignals.
10. Verfahren für die Anzeige von Sprachsignalen in einem abgetasteten Sprachkanalsignal,
gekennzeichnet durch folgende Schritte:
Stellen eines Schwellwertes (TL) auf einen Pegel, der größer als die Größe (T) des
augenblicklichen Abtastwertes ist und von ihr abhängt, jedesmal, wenn die Größe (T)
des augenblicklichen Abtastwertes nicht größer als die (TP) des vorhergehenden Abtastwertes
ist; und
Schaffen einer Anzeige der Anwesenheit von Sprache jedesmal, wenn die Größe (T) des
gegenwärtigen Abtastwertes größer als die (TP) des vorhergehenden Abtastwertes ist
und den Schwellwertpegel (TL) überschreitet.
11. Verfahren nach Anspruch 10, gekennzeichnet durch Aufrechterhalten der Anzeige
für eine Zahl (C) von Abtastwerten folgend jedem Abtastwert, dessen Größe (T) größe
als die (TP) des vorhergehenden Abtastwertes ist.
12. Verfahren nach Anspruch 12, gekennzeichnet durch Bestimmen der Zahl (C) von Abtastwerten,
für die die Anzeige in Abhängigkeit von vorherigen Abtastwertgrößen aufrechterhalten
wird, wobei die Zahl (C) bis zu einer Maximalzahl (C=32) bei jedem Abtastwert erhöht
wird, dessen Größe (T) größer als die (TP) des vorhergehenden Abtastwertes ist, und
bis zu einer Minimalzahl (C=0) erniedrigt wird bei jedem Abtastwert, dessen Größe
(T) nicht größer als die (TP) des vorhergehenden Abtastwertes ist.
13. Verfahren nach Anspruch 10, 11 oder 12, gekennzeichnet durch Schaffen einer Anzeige
der Anwesenheit von Sprache bei jedem Abtastwert, dessen Größe (T) einen festen Schwellwertpegel
(TF) überschreitet.
1. Procédé de détection de la présence de signaux de parole dans un signal de canal
vocal échantillonné, comprenant la production d'un premier état de signal (M=1) chaque
fois que l'amplitude (T) d'un échantillon de signal dépasse un premier niveau de seuil
(TF), caractérisé par les étapes de:
- comparaison de l'amplitude (T) de chaque échantillon avec celle (TP) de l'échantillon
précédent;
- chaque fois que l'amplitude (T) d'un échantillon n'est pas supérieure à celle (TP)
de l'échantillon précédent, établissement d'un second seuil (TL) à un niveau qui est
supérieur à et dépend de l'amplitude (T) de l'échantillon courant;
- chaque fois que l'amplitude (T) d'un échantillon est supérieure à celle (TP) de
l'échantillon précédent production d'un second état de signal (K=1) si l'amplitude
(T) de l'échantillon courant dépasse le second niveau de seuil (TL); et
- en réponse à chacun des premier et second états de signaux (M=1, K=1), production
d'un signal, représentant la présence de la parole, au moins pendant l'échantillon
courant.
2. Procédé selon la revendication 1, comprenant les étapes supplémentaires de:
- chaque fois que l'amplitude (T) d'un échantillon ne dépasse pas le premier niveau
de seuil (TF) et que le premier état de signal (M=1) a été produit pour l'échantillon
précédent, production d'un troisième état de signal (M=2) pour un premier nombre prédéterminé
(H) d'échantillons consécutifs, dont l'amplitude ne dépasse pas le premier niveau
de seuil commençant avec l'échantillon courant;
- chaque fois que l'amplitude (T) d'un échantillon n'est pas supérieure à celle (TP)
de l'échantillon précédent et que le second état de signal (K=1) a été produit pour
l'échantillon précédent, production d'un quatrième état de signal (K=2) pour un second
nombre (C) d'échantillons consécutifs commençant avec l'échantillon courant; et
- production d'un signal représentant la présence de la parole également en réponse
à chacun des troisième et quatrième états de signal (M=2, K=2).
3. Procédé selon la revendication 2, comprenant l'étape supplémentaire de détermination
du second nombre (C) en fonction des amplitudes d'échantillons précédentes, ce second
nombre (C) étant augmenté d'une quantité prédéterminée, allant jusqu'à un nombre maximum
(C=32), pour chaque échantillon, pour lequel le second état de signal (K=1) est produit,
et étant diminué, jusqu'à un nombre minimum (C=0), pour chaque autre échantillon dont
l'amplitude (T) n'est pas supérieure à l'amplitude (TP) de l'échantillon précédent.
4. Procédé selon la revendication 2 ou 3, comprenant les autres étapes de:
- chaque fois que l'amplitude (T) d'un échantillon dépasse celle (TP) de l'échantillon
précédent, et que pour l'échantillon précédent, le quatrième état de signal (K=2)
a été produit mais le second état de signal (K=1) n'a pas été produit, production
du second état de signal (K=1) pour l'échantillon courant si son amplitude (T) dépasse
un troisième niveau de seuil (TH) mais est inférieure au second niveau de seuil (TL);
et
- réglage du troisième niveau de seuil (TH) à une valeur égale à l'amplitude (TP)
de l'échantillon précédent chaque fois que le second état de signal (K=1) a été produit
pour l'échantillon précédent et que l'amplitude (T) de l'échantillon courant n'est
pas supérieure à l'amplitude (TP) de l'échantillon précédent.
5. Procédé selon l'une quelconque des revendications 1 à 4, dans lequel chaque fois
que le second niveau de seuil (TL) est établi, il l'est de manière à être supérieur
à l'amplitude (T) de l'échantillon courant suivant une quantité prédéterminée.
6. Procédé selon l'une quelconque des revendications 1 à 5, dans lequel chaque échantillon
de signal est constitué d'une moyenne d'une pluralité d'échantillons individuels du
signal de canal vocal, le procédé comportant d'autre part l'étape de production de
chaque échantillon de signal par enlèvement des décalages en courant continu à partir
d'une pluralité d'échantillons individuels du signal de canal vocal et moyenne de
cette pluralité.
7. Détecteur de parole comprenant une ou plusieurs mémoires mortes programmées et
agencées de manière à exécuter le procédé de l'une quelconque des revendications 1
à 6.
8. Détecteur de parole pour la détection de la présence de signaux de parole dans
un signal de canal vocal échantillonné, comprenant un moyen (605) pour produire un
premier état de signal (M=1) chaque fois que l'amplitude (T) d'un échantillon de signal
dépasse un premier niveau de seuil (TF), caractérisé en ce que le détecteur de parole
comprend:
- un moyen (611) pour produire un second seul (TL);
- un moyen (609) pour retarder chaque échantillon jusqu'à l'arrivée de l'échantillon
suivant;
- un moyen (610) pour comparer l'amplitude (T) de chaque échantillon avec celle (TP).de
l'échantillon précédent retardé par le moyen de retard (609);
- un moyen (611) répondant au moyen de comparaison (610) déterminant que l'amplitude
(T) d'un échantillon n'est pas supérieure à celle (TP) de l'échantillon précédent,
pour régler le second seul (TL) en réponse à cette détermination à un niveau qui est
supérieur à et dépend de l'amplitude (T) de l'échantillon courant;
- un moyen (611) répondant au moyen de comparaison (610) déterminant que l'amplitude
(T) d'un échantillon est supérieure à celle (TP) de l'échantillon précédent, pour
produire en réponse à cette détermination un second état de signal (K=1) si l'amplitude
(T) de l'échantillon courant dépasse le second niveau de seuil (TL); et
- un moyen (603) répondant à chacun des premier et second états de signal (M=1, K=1
) pour produire un signal, représentant la présence de la parole, au moins pour l'échantillon
courant.
9. Détecteur de parole selon la revendication 8, caractérisé par des moyens (401 à
406) pour produire chaque échantillon de signal par enlèvement des décalages en courant
continu d'une pluralité d'échantillons individuels du signal de canal vocal et moyenne
de cette pluralité.
10. Procédé de détection de la présence de la parole dans un signal de canal vocal
échantillonné, caractérisé par les étapes de:
- établissement d'un seuil (TL), à un niveau qui est supérieur à et dépend de l'amplitude
(T) de l'échantillon courant, chaque fois que l'amplitude (T) de l'échantillon courant
n'est pas supérieure à celle (TP) de l'échantillon précédent; et
- fourniture d'une indication de la présence de la parole chaque fois que l'amplitude
(T) de l'échantillon courant est supérieure à celle (TP) de l'échantillon précédent,
et dépasse le niveau de seuil (TL).
11. Procédé selon la revendication 10, caractérisé par la maintien de cette indication
pour un certain nombre (C) d'échantillons suivant chaque échantillon dont l'amplitude
(T) est supérieure à celle (TP) de l'échantillon précédent.
12. Procédé selon la revendication 11, caractérisé par la détermination du nombre
(C) d'échantillons pour lesquels cette indication est maintenue en dépendance des
amplitudes d'échantillons précédents, ce nombre (C) étant augmenté, jusqu'à un nombre
maximum (C=32), pour chaque échantillon de l'amplitude (T) est supérieure à celle
(TP) de l'échantillon précédent et étant diminuée, jusqu'à un nombre minimum (C=0),
pour chaque échantillon dont l'amplitude (T) n'est pas supérieure à celle (TP) de
l'échantillon précédent.
13. Procédé selon la revendication 10, 11 ou 12, caractérisé par la fourniture d'une
indication de la présence de la parole pour chaque échantillon dont l'amplitude (T)
dépasse un niveau de seuil fixe (TF).