Voice activity detector - Patent 0548054

(19)

(11)

EP 0 548 054 A2

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	23.06.1993 Bulletin 1993/25

(21)	Application number: 93200015.1

(22)	Date of filing: 10.03.1989

(51)	International Patent Classification (IPC)⁵: G10L 3/00

(84)	Designated Contracting States:
	AT BE CH DE ES FR GB GR IT LI LU NL SE

(30)

Priority:

11.03.1988 GB 8805795
06.06.1988 GB 8813346
24.08.1988 GB 8820105

(62)	Application number of the earlier application in accordance with Art. 76 EPC:
	89302422.4 / 0335521

(71)	Applicant: BRITISH TELECOMMUNICATIONS public limited company
	London EC1A 7AJ (GB)

(72)	Inventors:
	Freeman, Daniel Kenneth Ipswich, Suffolk IP4 2HT (GB) Boyd, Ivan Ipswich, Suffolk IP9 2XE (GB)

(74)	Representative: Lloyd, Barry George William et al
	BT Group Legal Services, Intellectual Property Department, 120 Holborn London EC1N 2TE London EC1N 2TE (GB)

(56)

References cited: :


	Remarks:
	This application was filed on 06 - 01 - 1993 as a divisional application to the application mentioned under INID code 60.

(54)	Voice activity detector

(57) A first detector 3 to 6 operates by forming a measure of the spectral similarity between an input signal and a stored noise signal in a buffer 15, the measure being compared (7) with a threshold value.
The buffer 15 is updated from the input only during periods when the input is indicated by an auxiliary detector 20 to be free of speech; the auxiliary detector operates by measuring the spectral similarity of the input signal and a delayed version of it (buffer 24).

Description

[0001] A voice activity detector is a device which is supplied with a signal with the object of detecting periods of speech, or periods containing only noise. Although the present invention is not limited thereto, one application of particular interest for such detectors is in mobile radio telephone systems where the knowledge as to the presence or otherwise of speech can be used exploited by a speech coder to improve the efficient utilisation of radio spectrum, and where also the noise level (from a vehicle-mounted unit) is likely to be high.

[0002] The essence of voice activity detection is to locate a measure which differs appreciably between speech and non-speech periods. In apparatus which includes a speech coder, a number of parameters are readily available from one or other stage of the coder, and it is therefore desirable to economise on processing needed by utilising some such parameter. In many environments, the main noise sources occur in known defined areas of the frequency spectrum. For example, in a moving car much of the noise (e.g. engine noise) is concentrated in the low frequency regions of the spectrum. Where such knowledge of the spectral position of noise is available, it is desirable to base the decision as to whether speech is present or absent upon measurements taken from that portion of the spectrum which contains relatively little noise. It would, of course, be possible in practice to pre-filter the signal before analysing to detect speech activity, but where the voice activity detector follows the output of a speech coder, prefiltering would distort the voice signal to be coded.

[0003] In US4358738, a voice activity detector is disclosed which compares the input signal with predetermined noise characteristics, by filtering the input signal through a pair of manually balanced bandpass filters (employing analogue components) to form two frequency dependent energy segments. This method is of limited usefulness for many reasons; firstly, such a crude arrangement ignores the fact that many types of noise could have an energy balance between the two bands similar to a speech signal, secondly, balancing the filters is laborious and requires a manual detection of noise periods for balancing, and thirdly, such a device is unable to adjust to changing noise or spectral changes in the environment (or communications channels).

[0004] In IEEE transactions on acoustics, speech and signal processing, vol ASSP-25, No. 4, August 1977, page 338-343, Rabiner et al "Application of an LPC distance measure to the voiced unvoiced silence detection problem", there is disclosed a classifier for discriminating between silence, unvoiced speech, and voiced speech which has been transmitted over a telephone line. The method comprises initially using manually classified "silenced", "voiced", and "unvoiced" frames of speech signals to drive reference patterns, and then comparing the input signal to each of these using a comparison measure and selecting the reference pattern to which the input signal is closest. This method shares some of the disadvantages of US4358738, in that it requires extensive manual intervention in selecting "silence" frames from training data and forming therefrom the reference pattern, and that since the reference pattern is fixed changes in the environment result in wrong identifications. These problems are greatly exacerbated in high level noise environments (such as a moving vehicle) compared to the low level noise environment (silence over a telephone line) described by Rabiner.

[0005] European patent application published as 0127718A and US patent 4672669 describe a voice activity detection apparatus in which a first test is made on signal amplitude and a second test is based on analysis of changes in the short-term signal spectrum. Specifically, the spectral analysis is performed by comparing the autocorrelation of the signal with that of an earlier portion of the signal deemed to be speech-free.

[0006] According to one aspect of the present invention there is provided a voice activity detection apparatus comprising:

(i) a first voice activity detector which operates by forming a measure of the spectral similarity between an input signal and a stored portion of input signal deemed to be speech free to produce an output signal indicating the presence or absence of speech in the input signal;

(ii) a store for containing the stored portion of signal; and

(iii) an auxiliary voice activity detector; characterised in that the auxiliary voice activity detector alone controls the updating of the store, the auxiliary voice activity detector operating by forming a measure of the spectral similarity between the current signal and an earlier portion of signal.

[0007] In another aspect, the invention provides a voice activity detection apparatus comprising:

(i) means for receiving an input signal;

(ii) a store for storing a noise representing signal;

(iii) means for periodically forming from the input signal and the stored noise representing signal a measure of the spectral similarity between a portion of the input signal and the said estimated noise signal component;

(iv) means for comparing the measure with a threshold value to produce an output indicating the presence or absence of speech;

(v) an auxiliary voice activity detector; and

(vi) store updating means for updating the store from the input signal;

characterised in that the auxiliary voice activity detector is operable in dependence on a measure of spectral similarity between the input signal and a preceding portion of the input signal to produce a control signal indicating the presence or absence of speech and that the store updating means is operable to update the store from the input signal only when said control signal indicates that speech is absent.

[0008] Other aspects of the present invention are as defined in the claims.

[0009] Some embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 is a block diagram of a first embodiment of the invention;

Figure 2 shows a second embodiment of the invention;

Figure 3 shows a third, preferred embodiment of the invention.

[0010] The general principle underlying a first Voice Activity Detector according to a first embodiment of the invention is as follows.

[0011] A frame of n signal samples (s₀, s₁, s₂' s₃, s₄ ... s_n-1 ) will, when passed through a notational fourth order finite impulse response (FIR) digital filter of impulse response (1, h₀, h₁, h₂, h₃), result in a filtered signal (ignoring samples from previous frames)

The zero order autocorrelation coefficient is the sum of each term squared, which may be normalised i.e. divided by the total number of terms (for constant frame lengths it is easier to omit the division); that of the filtered signal is thus

and this is therefore a measure of the power of the notional filtered signal s' - in other words, of that part of the signal s which falls within the passband of the notional filter.
Expanding, neglecting the first 4 terms,

So R'₀ can be obtained from a combination of the autocorrelation coefficients R_i, weighted by the bracketed constants which determine the frequency band to which the value of R'₀ is responsive. In fact, the bracketed terms are the autocorrelation coefficients of the impulse response of the notional filter, so that the expression above may be simplified to

where N is the filter order and H_i are the (un-normalised) autocorrelation coefficients of the impulse response of the filter.

[0012] In other words, the effect on the signal autocorrelation coefficients of filtering a signal may be simulated by producing a weighted sum of the autocorrelation coefficients of the (unfiltered) signal, using the impulse response that the required filter would have had.

[0013] Thus, a relatively simple algorithm, involving a small number of multiplication operations, may simulate the effect of a digital filter requiring typically a hundred times this number of multiplication operations.

[0014] This filtering operation may alternatively be viewed as a form of spectrum comparison, with the signal spectrum being matched against a reference spectrum (the inverse of the response of the notional filter). Since the notional filter in this application is selected so as to approximate the inverse of the noise spectrum, this operation may be viewed as a spectral comparison between speech and noise spectra, and the zeroth autocorrelation coefficient thus generated (i.e. the energy of the inverse filtered signal) as a measure of dissimilarity between the spectra. The Itakura-Saito distortion measure is used in LPC to assess the match between the predictor filter and the input spectrum, and in one form is expressed as

where A₀ etc are the autocorrelation coefficients of the LPC parameter set. It will be seen that this is closely similar to the relationship derived above, and when it is remembered that the LPC coefficients are the taps of an FIR filter having the inverse spectral response of the input signal so that the LPC coefficient set is the impulse response of the inverse LPC filter, it will be apparent that the Itakura-Saito Distortion Measure is in fact merely a form of equation 1, wherein the filter response H is the inverse of the spectral shape of an all-pole model of the input signal.

[0015] In fact, it is also possible to transpose the spectra, using the LPC coefficients of the test spectrum and the autocorrelation coefficients of the reference spectrum, to obtain a different measure of spectral similarity.

[0016] The I-S Distortion measure is further discussed in "Speech Coding based upon Vector Quantisation" by A Buzo, A H Gray, R M Gray and J D Markel, IEEE Trans on ASSP, Vol ASSP-28, No. 5, October 1980.

[0017] Since the frames of signal have only a finite length, and a number of terms (N, where N is the filter order) are neglected, the above result is an approximation only; it gives, however, a surprisingly good indicator of the presence or absence of speech and thus may be used as a measure M in speech detection. In an environment where the noise spectrum is well known and stationary, it is quite possible to simply employ fixed h₀, h₁ etc coefficients to model the inverse noise filter.

[0018] However, apparatus which can adapt to different noise environments is much more widely useful.

[0019] Referring to Figure 1, in a first embodiment, a signal from a microphone (not shown) is received at an input 1 and converted to digital samples s at a suitable sampling rate by an analogue to digital converter 2. An LPC analysis unit 3 (in a known type of LPC coder) then derives, for successive frames of n (e.g. 160) samples, a set of N (e.g. 8 or 12) LPC filter coefficients L which are transmitted to represent the input speech. The speech signal s also enters a correlator unit 4 (normally part of the LPC coder 3 since the autocorrelation vector R_i of the speech is also usually produced as a step in the LPC analysis although it will be appreciated that a separate correlator could be provided). The correlator 4 produces the autocorrelation vector R_i, including the zero order correlation coefficient R₀ and at least 2 further autocorrelation coefficients R₁, R₂, R₃. These are then supplied to a multiplier unit 5.

[0020] A second input 11 is connected to a second microphone located distant from the speaker so as to receive only background noise. The input from this microphone is converted to a digital input sample train by AD convertor 12 and LPC analysed by a second LPC analyser 13. The "noise" LPC coefficients produced from analyser 13 are passed to correlator unit 14, and the autocorrelation vector thus produced is multiplied term by term with the autocorrelation coefficients R_i of the input signal from the speech microphone in multiplier 5 and the weighted coefficients thus produced are combined in adder 6 according to Equation 1, so as to apply a filter having the inverse shape of the noise spectrum from the noise-only microphone (which in practice is the same as the shape of the noise spectrum in the signal-plus-noise microphone) and thus filter out most of the noise. The resulting measure M is thresholded by thresholder 7 to produce a logic output 8 indicating the presence or absence of speech; if M is high, speech is deemed to be present.

[0021] This embodiment does, however, require two microphones and two LPC analysers, which adds to the expense and complexity of the equipment necessary.

[0022] Alternatively, another embodiment use a corresponding measure formed using the autocorrelations from the noise microphone 11 and the LPC coefficients from the main microphone 1, so that an extra autocorrelator rather than an LPC analyser is necessary.

[0023] These embodiments are therefore able to operate within different environments having noise at different frequencies, or within a changing noise spectrum in a given environment.

[0024] Referring to Figure 2, in the preferred embodiment of the invention, there is provided a buffer 15 which stores a set of LPC coefficients (or the autocorrelation vector of the set) derived from the microphone input 1 in a period identified as being a "non speech" (i.e. noise only) period. These coefficients are then used to derive a measure using equation 1, which also of course corresponds to the Itakura-Saito Distortion Measure, except that a single stored frame of LPC coefficients corresponding to an approximation of the inverse nose spectrum is used, rather than the present frame of LPC coefficients.

[0025] The LPC coefficient vector L_i output by analyser 3 is also routed to a correlator 14, which produces the autocorrelation vector of the LPC coefficient vector. The buffer memory 15 is controlled by the speech/non-speech output of thresholder 7, in such a way that during "speech" frames the buffer retains the "noise" autocorrelation coefficients, but during "noise" frames a new set of LPC coefficients may be used to update the buffer, for example by a multiple switch 16, via which outputs of the correlator 14, carrying each autocorrelation coefficient, are connected to the buffer 15. It will be appreciated that correlator 14 could be positioned after buffer 15. Further, the speech/no-speech decision for coefficient update need not be from output 8, but could be (and preferably is) otherwise derived.

[0026] Since frequent periods without speech occur, the LPC coefficients stored in the buffer are updated from time to time, so that the apparatus is thus capable of tracking changes in the noise spectrum. It will be appreciated that such updating of the buffer may be necessary only occasionally, or may occur only once at the start of operation of the detector, if (as is often the case) the noise spectrum is relatively stationary over time, but in a mobile radio environment frequent updating is preferred.

[0027] In a modification of this embodiment, the system initially employs equation 1 with coefficient terms corresponding to a simple fixed high pass filter, and then subsequently starts to adapt by switching over to using "noise period" LPC coefficients. If, for some reason, speech detection fails, the system may return to using the simple high pass filter.

[0028] It is possible to normalise the above measure by dividing through by R₀, so that the expression to be thresholded has the form

This measure is independent of the total signal energy in a frame and is thus compensated for gross signal level changes, but gives rather less marked contrast between "noise" and "speech" levels and is hence preferably not employed in high-noise environments.

[0029] Instead of employing LPC analysis to derive the inverse filter coefficients of the noise signal (from either the noise microphone or noise only periods, as in the various embodiments described above), it is possible to model the inverse noise spectrum using an adaptive filter of known type; as the noise spectrum changes only slowly (as discussed below) a relatively slow coefficient adaption rate common for such filters is acceptable. In one embodiment, which corresponds to Figure 1, LPC analysis unit 13 is simply replaced by an adaptive filter (for example a transversal FIR or lattice filter), connected so as to whiten the noise input by modelling the inverse filter, and its coefficients are supplied as before to autocorrelator 14.

[0030] In a second embodiment, corresponding to that of Figure 2, LPC analysis means 3 is replaced by such an adaptive filter, and buffer means 15 is omitted, but switch 16 operates to prevent the adaptive filter from adapting its coefficients during speech periods.

[0031] A second Voice Activity Detector for use with another embodiment of the invention will now be described.

[0032] From the foregoing, it will be apparent that the LPC coefficient vector is simply the impulse response of an FIR filter which has a response approximating the inverse spectral shape of the input signal. When the Itakura-Saito Distortion Measure between adjacent frames is formed, this is in fact equal to the power of the signal, as filtered by the LPC filter of the previous frame. So if spectra of adjacent frames differ little, a correspondingly small amount of the spectral power of a frame will escape filtering and the measure will be low. Correspondingly, a large interframe spectral difference produces a high Itakura-Saito Distortion Measure, so that the measure reflects the spectral similarity of adjacent frames. In a speech coder, it is desirable to minimise the data rate, so frame length is made as long as possible; in other words, if the frame length is long enough, then a speech signal should show a significant spectral change from frame to frame (if it does not, the coding is redundant). Noise, on the other hand, has a slowly varying spectral shape from frame to frame, and so in a period where speech is absent from the signal then the Itakura-Saito Distortion Measure will correspondingly be low - since applying the inverse LPC filter from the previous frame "filters out" most of the noise power.

[0033] Typically, the Itakura-Saito Distortion Measure between adjacent frames of a noisy signal containing intermittent speech is higher during periods of speech than periods of noise; the degree of variation (as illustrated by the standard deviation) is also higher, and less intermittently variable.

[0034] It is noted that the standard deviation of the standard deviation of M is also a reliable measure; the effect of taking each standard deviation is essentially to smooth the measure.

[0035] In this second form of Voice Activity Detector, the measured parameter used to decide whether speech is present is preferably the standard deviation of the Itakura-Saito Distortion Measure, but other measures of variance and other spectral distortion measures (based for example on FFT analysis) could be employed.

[0036] It is found advantageous to employ an adaptive threshold in voice activity detection. Such thresholds must not be adjusted during speech periods of the speech signal will be thresholded out. It is accordingly necessary to control the threshold adapter using a speech/non-speech control signal, and it is preferable that this control signal should be independent of the output of the thresholder adapter.

[0037] The threshold T is adaptively adjusted so as to keep the threshold level just above the level of the measure M when noise only is present. Since the measure will in general vary randomly when noise is present, the threshold is varied by determining an average level over a number of blocks, and setting the threshold at a level proportional to the average. In a noisy environment this is not usually sufficient, however, and so an assessment of the degree of variation of the parameter over several blocks is also taken into account.

[0038] The threshold value T is therefore preferably calculated according to

where M' is the average value of the measure over a number of consecutive frames, d is the standard deviation of the measure over those frames, and K is a constant (which may typically be 2).

[0039] In practice, it is preferred not to resume adaptation immediately after speech is indicated to be absent, but to wait to ensure the fall is stable (to avoid rapid repeated switching between the adapting and non-adapting states).

[0040] Referring to Figure 3, in a preferred embodiment of the invention incorporating the above aspects, an input 1 receives a signal which is sampled and digitised by analogue to digital converter (ADC) 2, and supplied to the input of an inverse filter analyser 3, which in practice is part of a speech coder with which the voice activity is to work, and which generates coefficients L_i (typically 8) of a filter corresponding to the inverse of the input signal spectrum. The digitised signal is also supplied to an autocorrelator 4, (which is part of analyser 3) which generates the autocorrelation vector R_i of the input signal (or at least as many low order terms as there are LPC coefficients). Operation of these parts of the apparatus is as described in Figures 1 and 2. Preferably, the autocorrelation coefficients R_i are then averaged over several successive speech frames (typically 5-20 ms long) to improve their reliability. This may be achieved by storing each set of autocorrelations coefficients output by autocorrelator 4 in a buffer 4a, and employing an averager 4b to produce a weighted sum of the current autocorrelation coefficients R_i and those from previous frames stored in and supplied from buffer 4a. The averaged autocorrelation coefficients Ra_i thus derived are supplied to weighting and adding means 5,6 which receives also the autocorrelation vector A_i of stored noise-period inverse filter coefficients L_i from an autocorrelator 14 via buffer 15, and forms from Ra_i and A_i a measure M preferably defined as:

This measure is then thresholded by thresholder 7 against a threshold level, and the logical result provides an indication of the presence or absence of speech at output 8.

[0041] In order that the inverse filter coefficients L_i correspond to a fair estimate of the inverse of the noise spectrum, it is desirable to update these coefficients during periods of noise (and, of course, not to update during periods of speech). It is, however, preferable that the speech/non speech decision on which the updating is based does not depend upon the result of the updating, or else a single wrongly identified frame of signal may result in the voice activity detector subsequently going "out of lock" and wrongly identifying following frames. Preferably, therefore, there is provided a control signal generating circuit 20, effectively a separate voice activity detector, which forms an independent control signal indicating the presence or absence of speech to control inverse filter analyser 3 (or buffer 8) so that the inverse filter autocorrelation coefficients A_i used to form the measure M are only updated during "noise" periods. The control signal generator circuit 20 includes LPC analyser 21 (which again may be part of a speech coder and, specifically, may be performed by analyser 3), which produces a set of LPC coefficients M_i corresponding to the input signal and an autocorrelator 21a (which may be performed by autocorrelator 3a) which derives the autocorrelation coefficients B_i of M_i. If analyser 21 is performed by analyser 3, then M_i=L_i and B_i=A_i. These autocorrelation coefficients are then supplied to weighting and adding means 22,23 (equivalent to 5, 6) which receive also the autocorrelation vector R_i of the input signal from the autocorrelator 4. A measure of the spectral similarity between the input speech frame and the preceding speech frame is thus calculated; this may be the Itakura-Saito distortion measure between R_i of the present frame and B_i of the preceding frame, as disclosed above, or it may instead be derived by calculating the Itakura-Saito distortion measure for R_i and B_i of the present frame, and subtracting (in subtractor 25) the corresponding measure for the previous frame stored in buffer 24, to generate a spectral difference signal (in either case, the measure is preferably energy-normalised by dividing by R₀). The buffer 24 is then, of course, updated. This spectral difference signal, when thresholded by a thresholder 26 is, as discussed above, an indicator of the presence or absence of speech. We have found, however, that although this measure is excellent for distinguishing noise from unvoiced speech (a task which prior art systems are generally incapable of) it is in general rather less able to distinguish noise from voiced speech. Accordingly, there is preferably further provided within circuit 20 a voiced speech detection circuit comprising a pitch analyser 27 ( which is practice may operate as part of a speech coder, and in particular may measure the long term predictor lag value produced in a multipulse LPC coder). The pitch analyser 27 produces a logic signal which is "true" when voiced speech is detected, and this signal, together with the thresholded measure derived from thresholder 26 (which will generally be "true" when unvoiced speech is present) are supplied to the inputs of a NOR gate 28 to generate a signal which is "false" when speech is present and "true" when noise is present. This signal is supplied to buffer 8 (or to inverse filter analyser 3) so that inverse filter coefficients L_i are only updated during noise periods.

[0042] Threshold adapter 29 is also connected to receive the non-speech signal control output of control signal generator circuit 20. The output of the threshold adapter 29 is supplied to thresholder 7. The threshold adapter operates to increment or decrement the threshold in steps which are a proportion of the instant threshold value, until the threshold approximates the noise power level (which may conveniently be derived from, for example, weighting and adding circuits 22, 23). When the input signal is very low, it may be desirable that the threshold is automatically set to a fixed, low, level since at the low signal levels the effect of signal quantisation produced by ADC 2 can produce unreliable results.

[0043] There may be further provided "hangover" generating means 30, which operates to measure the duration of indications of speech after thresholder 7 and, when the presence of speech has been indicated for a period in excess of a predetermined time constant, the output is held high for a short "hangover" period. In this way, clipping of the middle of low-level speech bursts is avoided, and appropriate selection of the time constant prevents triggering of the hangover generator 30 by short spikes of noise which are falsely indicated as speech.

[0044] It will of course be appreciated that all the above functions may be executed by a single suitably programmed digital processing means such as a Digital Signal Processing (DSP) chip, as part of an LPC codec thus implemented (this is the preferred implementation), or as a suitably programmed microcomputer or microcontroller chip with an associated memory device.

[0045] Conveniently, as described above, the voice detection apparatus may be implemented as part of an LPC codec. Alternatively, where autocorrelation coefficients of the signal or relates measures (partial correlation, or "parcor", coefficients) are transmitted to a distant station the voice detection may take place distantly from the codec.

Claims

1. A voice activity detection apparatus comprising:

(i) a first voice activity detector (3-6,14) which operates by forming a measure of the spectral similarity between an input signal and a stored portion of input signal deemed to be speech free to produce an output signal indicating the presence or absence of speech in the input signal;

(ii) a store (15) for containing the stored portion of signal; and

(iii) an auxiliary voice activity detector (20); characterised in that the auxiliary voice activity detector (20) alone controls the updating of the store (15), the auxiliary voice activity detector (20) operating by forming a measure of the spectral similarity between the current signal and an earlier portion of signal.

2. Voice activity detection apparatus comprising:

(i) means (1) for receiving an input signal;

(ii) a store (15) for storing a noise representing signal;

(iii) means (3-6,14) for periodically forming from the input signal and the stored noise representing signal a measure of the spectral similarity between a portion of the input signal and the said estimated noise signal component;

(iv) means (7) for comparing the measure with a threshold value to produce an output indicating the presence or absence of speech;

(v) an auxiliary voice activity detector (20); and

(vi) store updating means for updating the store from the input signal;

3. Apparatus according to Claim 2, further comprising means for adjusting the said threshold value during periods when speech is indicated by said control signal to be absent.

4. Apparatus according to Claim 2 or 3, in which said auxiliary voice activity detector further comprises voiced speech detection means (27) comprising pitch analysis means for generating a signal indicative of the presence of voiced speech, upon which the control signal produced by the auxiliary voice activity (20) detector also depends.

Drawing