(19)
(11) EP 0 167 364 A1

(12) EUROPEAN PATENT APPLICATION

(43) Date of publication:
08.01.1986 Bulletin 1986/02

(21) Application number: 85304627.4

(22) Date of filing: 28.06.1985
(51) International Patent Classification (IPC)4G10L 3/00
(84) Designated Contracting States:
DE FR

(30) Priority: 06.07.1984 US 628583

(71) Applicant: AT&T Corp.
New York, NY 10013-2412 (US)

(72) Inventors:
  • DonVito, Marc Bernard
    Salem New Hampshire 03079 (US)
  • Schoenherr, Brian William
    Medford Massachusetts 02155 (US)

(74) Representative: Watts, Christopher Malcolm Kelway, Dr. 
Lucent Technologies (UK) Ltd, 5 Mornington Road
Woodford Green Essex, IG8 0TU
Woodford Green Essex, IG8 0TU (GB)


(56) References cited: : 
   
       


    (54) Speech-silence detection with subband coding


    (57) Speech detection is accomplished in conjunction with two-band subband encoding. A detection statistic T (iTo), used to estimate the short-term speech energy, is developed from energy estimates made in each subband. A speech presence energy threshold λON a speech silence energy threshold λOFF and λOFF are computed which adapt to the long-term speech level. The detection statistic is compared to the thresholds to make a decision concerning the presence or absence of speech.
    Also disclosed are considerations for extrapolating the detection to result in an arrangement with more than two subbands.




    Description

    Technical Field



    [0001] The invention relates to signal processing generally, and more particularly to means for detecting intervals of silence in encoded speech.

    Background of the Invention



    [0002] Normal human speech includes intervals of silence which will be referred to herein as "speech silence." When the speech is transmitted electronically, such as in a communications network, the speech-silence occupies a significant portion of the total transmission time. This leads to inefficient use of the communications network, since the only information which is transmitted during the course of the entire speech-silence interval, no matter how long, is the existence of the interval and its duration.

    [0003] Efforts have been made to improve the efficiency of transmission by inserting other information, such as data, in the silence intervals on a time assignment basis. Such an approach is presently used for transatlantic cable and satellite communications which are known as TASI (time assignment and speech interpolation) systems. A system of this type is described, for instance in U.S. Pat. 4,100,377.

    [0004] Speech silence may be detected even in voice signals which have already been digitally encoded into a pulse code modulated (PCM) format. This is described, for example, in U.S. Pats. 3,909,532 and 4,449,190.

    [0005] Where both encoded speech and data signals share a carrier on a time assignment basis, there is a need for a high degree of accuracy in the determination of speech-silence intervals in order to permit the maximum use of the interval without degradation of the reconstructed speech. Of primary interest in this regard, therefore, are speech-silence boundaries. These are a transition either from voice to silence or from silence to voice. Accordingly, there is a need for speech-silence boundary detection with improved accuracy.

    Summary of the Invention



    [0006] In accordance with the novel method and apparatus of the present invention, speech-silence boundaries are detected in the digitally encoded data of at least two subbands of the speech signal. Energy estimates are made for each of the frequency subbands for generating a detection statistic to estimate short-term speech energy. A threshold which is adapted to the long-term speech level is computed. This threshold is compared to the detection statistic to make a decision as to the presence of a silence interval. The resulting detection has significantly improved accuracy over detection using only one frequency band.

    Brief Description of the Drawing



    [0007] 

    FIG. 1 is a functional block circuit diagram of a two-band subband encoder with speech detection in accordance with one example of the present invention.

    FIG. 2 is a functional flow diagram showing in more detail a speech statistic computation subunit of the apparatus of FIG. 1.

    FIG. 3 is a functional flow diagram showing in more detail a threshold computation subunit of the apparatus of FIG. 1.

    FIG. 4 is a functional flow diagram showing in more detail a speech determination subunit of the apparatus of FIG. 1.


    Detailed Description



    [0008] The two-band subband encoder 10 with speech detection shown in FIG. 1 includes a lower frequency subband, or low band encoding circuit 12 made up of a low pass quadrature mirror filter 14, a by-two decimator 16, and an ADPCM (adaptive digital pulse code modulation) encoder 18. In parallel with the low band circuit 12 is a higher frequency subband, or high band encoding circuit 20 made up of a high pass quadrature mirror filter 22, a by-two decimator 24, and an ADPCM encoder 26. Both of the encoding circuits 12, 20 operate with a sampling rate of 12 kHz (kilohertz) and receive the same 5.5 kHz analog speech input signal. They send their outputs to a multiplexer 28 for transmission. The details of subband encoding circuits such as the circuits 12, 20 and the multiplexer 28 are known to those in the art and are described, for example, in the U.S. Pat. 4,048,443 in "Sub-band Coding," by R. E. Crochiere in the Bell System Technical Journal, vol. 60, No. 7, Part 2, pp. 1633-1653, Sept. 1981, and in "Digital Voice Storage In a Microprocessor," by J. L. Flanagan, J. D. Johnston, and J. W. Upton, IEEE Transactions On Communications, Feb. 1982, vol. COM 30, no.2, pp.336-345.

    [0009] A speech detector 30, which includes a speech threshold computing subunit 32, a speech statistic computing subunit 34, and a determining subunit 36 is adapted to provide an output to the multiplexer 28 which will result in the insertion of a speech presence indicator, or speech flag, in the transmitted output. The input to the speech threshold computing subunit 32 is the step size information from the low band encoder 12. The input to the speech statistic computing subunit 34 is the sample step size information from both the low band encoder 12 and the high band encoder 20. Both the threshold subunit 32 and the statistic subunit 34 give their output to the speech determining subunit 36.

    [0010] The statistic computing subunit 34 is shown in greater detail in FIG. 2. Speech detection is accomplished by deriving information from the encoders 12, 20 and using it to determine whether speech is present or absent. Each of the encoders 12, 20 in the course of its normal encoding function makes a separate determination of the quantizer step size, based on the signal amplitude in its respective subband. For computational efficiency, the log of the step size is determined and used as a pointer to a step-size table. The log step-size parameters are used as estimates of the speech in each band at a given time.

    [0011] Referring now to FIG. 2, the speech sampling period is represented by τ0. The log of the step size in the low band is represented by dL (iτ0), while the log of the step size in the high band is represented by dH (iτ0) at time t=i τ0. Let T(iTO) be the speech detection statistic used to determine the speech level. Let σL and σH be fixed weights associated with dL (iTO) and dH (iτ0), and let βDS be a fixed weight such that 0<βDS<1. Then a detection statistic T(iτ0) can be computed as follows:

    The detection statistic T(iτ0 ) is smoothed to become a low-pass filtered sum of speech information taken from each subband. The weight βDS is chosen to give T(iτ0) a specific time constant which controls the necessary smoothing of the information. A time constant of 16 milliseconds has been found to be suitable. The constants σL and σH determine the relative weight given to each subband. It has been found to be particularly advantageous to set σH at a value of about 1.5 to 2 times the value of σL. This accentuates discrimination in the high subband, which contains more information for the detection of fricatives and other consonants. The values of these constants for a particular application may be readily determined by means of laboratory tests by one skilled in the art.

    [0012] FIG. 3 shows the method of computing a speech presence energy threshold λON and a speech silence energy threshold λOFF. This method is very similar to that used in ADPCM speech detection, using the log step size dL (iτ0) from the lower subband only. M(iτ0) is the maximum of the values σMdL(iτ0); σM is a constant weight. Therefore, when σMdL (iτ0) increases, M(iτ0) increases when σMdL (iτ0) decreases, M(iτ0) decreases only very slowly according to the leak factor BM. M(iTo) is restrained from decreasing to less than its lower limit (MO), so M(iτ0) measures the maximum speech energy in the lower subband.

    [0013] The variable d'L can be defined to be


    the bias of 32 is used to insure that d'L and M are always positive. The value of M at time iτ0 is



    [0014] The thresholds are fixed distances below M, so, the threshold λON, used to determine when speech changes from OFF to ON, is computed as follows:


    the threshold λOFF' used to determine when speech changes from ON to OFF, is


    the values of CON and COFF are constants, with COFF > CON.

    [0015] FIG. 4 shows how the comparison is done. The speech samples are divided into blocks of some convenient length. (In this case 24 samples per block are used.) Once per block, a decision is made concerning whether speech is ON or OFF. If, in the previous block, speech was on, then the ON threshold is used; if speech was off, the OFF threshold is used. The switch in FIG. 4 chooses the correct threshold, which is then compared to the detection statistic. The speech flag is set ON or OFF depending on whether the detection statistic is above or below the threshold. Let TDS be the time interval associated with one block. (In this case, τDS = 24τ0.) Let S denote the speech state with two possible values:



    [0016] The speech state S (iTDS) at time t=i TDS depends on the previous speech state S[(i-1) TDS] as follows: when


    when





    [0017] The system 10 can be effectively implemented by a person of ordinary skill in the art of subband encoding by appropriately adapting two or more digital signal processor microcomputers. Such microcomputers are presently in use and may include a memory unit, an arithmetic unit, a control unit, an input-output unit, and a machine language storage unit in a single VLSI circuit. Their function may alternately be provided by a combination of a number of different VLSI circuits interconnected. One such microcomputer which is suitable for implementing the system 10 is a DSP (Digital Signal Processor) manufactured by AT&T Technologies, Inc., a corporation of New York, U.S.A. and described, for example, in the above-mentioned Bell System Technical Journal volume.

    [0018] In one example of a system implemented with two DSP's, one DSP is used for the encoding and transmission of speech, while the other DSP is used for the reception and decoding of speech. External logic is used to interface the PCM (pulse code modulation) bit streams of each DSP to both analog-to-digital and digital-to-analog converters for speech input and output. The DSP microcomputers also perform speech-silence detection on the speech signal, so that the silence intervals can be used to transmit user- supplied data.

    [0019] The DSP microcomputers determine the speech state every two milliseconds. The transmitting DSP provides the speech-state status for external circuitry and generates a 112-bit frame for transmission. The frame consists of a 3- bit framing pattern, a 1-bit speech flag, and 24 samples of subband encoded speech. This speech is sampled at a 12 Khz rate and encoded with 5-bit accuracy in the low band and 4- bit accuracy in the high band. When the DSP indicates the speech flag is on, external line interface circuitry will send the DSP-generated frame intact. When the speech flag is off, the 24 samples of speech is replaced by 108 bits of user supplied data. After construction, the frame is sent over a 56 Kbps (kilobits per second) digital channel to another terminal for decoding.

    [0020] In the receiver, a simple framing algorithm is implemented with a combination of DSP firmware and external line interface circuitry. The framing algorithm searches the incoming 56 Kbps signal to find the orientation of the 3-bit framing pattern. After the receiving DSP synchronizes itself with the framing pattern, it reads the speech state flag. If the speech state flag is present, the DSP begins decoding the incoming speech signal for listening, but if the flag is absent, the DSP signals external circuitry to remove the data and send it to a user interface. This pattern is repeated every two milliseconds, as long as a valid framing pattern is detected.

    [0021] The equations above describe the general concepts involved in determining the quantities needed by the speech detector. Due to finite bit length and timing considerations in the DSP, some of these equations are preferably slightly modified. For example, the system 10 is based on a 24-sample frame, so every 24 samples a decision is made as to whether speech is present. The speech detection statistic is computed in this framework by the DSP as follows:



    [0022] So T (iτ DS ) is updated each sample period by adding σLd' L + αηd' to it, and it is leaked once per block of 24 samples. The value of the maximum level M must also be computed slightly differently to obtain accurate results with the DSP. Let τMAX be the time interval between two successive points at which M is leaked. Experimentally, it was found that τMAX= 8 seconds works well. The equation for M that may be implemented in the



    [0023] The thresholds only need to be computed once per 24 samples, so that they can be used to detect the presence or absence of speech.



    [0024] The speech state is determined in the same way as described in Section 11.2 by equations (6-8).

    [0025] This invention is not limited to two-band subband coding. The detection statistic T(iTO) and maximum level M(iτ0) can include information from a larger number of subbands, using equations similar to equations (1) - (11) above. Silence detection with five- band subband coding is an example of this. Let dj(iTO) for j = 1,..., 5 be the log step size values for each of the five bands, let σj, j = 1,...,5 be fixed weights, and let βDS be a leak factor slightly less than 1. In analogy with equation (1), a general equation describing the speech detection statistic is


    Letting u, j=1,...,5 be fixed weights, and BM a fixed leak factor slightly less than 1, the general equation for the maximum level is


    Some of the weighting factors σj or µ could be zero. As in equations (9)-(11), equations (12)-(13) can be slightly altered to conform to a specific hardware implementation, such as an implementation using a DSP microprocessor. It is also necessary to choose specific values of the parameters in equations (12)-(13). For the computation of the detection statistic, σ1 = σ2, and σ3 = σ4 = 2a1, giving a greater weight to the higher frequency bands; band 5 is not used, so σ5 = 0. For the computation of a maximum level, µ1= µ2, and µ3 = µ4 = µ5 = 0. The maximum level depends on the energy in the low-frequency bands, giving a smooth long-term average.

    [0026] In theory, the equations (12) and (13) can be extended to any number of bands. However, as the number of bands increases, the time delay associated with computing the detection statistic and maximum level also increases. Therefore there is a practical limit to the number of bands that can be used in this system.


    Claims

    1. Signal encoding apparatus
    CHARACTERIZED BY

    means for encoding a plurality of frequency subband portions of a signal, including means for generating voltage step size values for signal samples of each subband;

    means for computing speech statistic values based on the voltage step size values for the one frequency subband and the voltage step size values for another of the frequency subbands; and

    means for comparing speech presence energy threshold values and speech silence energy threshold values to the speech statistic values to selectively generate speech presence output signals.


     
    2. The apparatus defined in claim 1 wherein said speech statistic value computing means is
    CHARACTERIZED BY

    means for multiplying the step size values of each subband by a corresponding speech detection coefficient to generate respective speech detection value products;

    means for summing the speech detection value products to generate speech detection value sums, and

    means for smoothing the speech detection value sum.


     
    3. The apparatus defined in claim 2
    CHARACTERIZED IN THAT
    said smoothing means comprises means for summing each speech detection value sum with a delay value to generate a speech detection statistic output value, the delay value being the product of a detection constant and a previous detection statistic output value.
     
    4. The apparatus defined in claim 3
    CHARACTERIZED BY
    means for computing speech energy threshold values and speech silence threshold values based on the voltage step size values for one of the subbands.
     
    5. The apparatus defined in claim 4 wherein said speech statistic value computing means is
    CHARACTERIZED BY
    means for generating a speech presence threshold value and a speech silence value from a maximum energy level value, the maximum energy level value being generated by choosing the maximum of first and second energy levels, the first energy level being the product of a step size value of the low frequency subband and the second energy level being the larger of the previous sample maximum energy level value multiplied by a coefficient and a lower limit.
     
    6. The apparatus defined in claim 5
    CHARACTERIZED BY

    switch means which connect either the speech threshold value or the speech silence value from the generating means to a one input of a comparator in response to a control signal, the other input of the comparator being connected to receive the speech detection statistic, and

    feedback means including a one-sample delay means connected between the output of said comparator and said switch for generating the control signals.


     
    7. A method of detecting the presence of speech content in a signal,
    CHARACTERIZED BY

    computing a short term speech statistic from the step size value information of at least two of the subbands, and

    comparing the speech statistic to a long term speech energy threshold to selectively generate a speech presence indication signal.


     
    8. The method defined in claim 7 further
    CHARACTERIZED BY
    computing a long term speech energy threshold from the step size information of at least one of the subbands.
     
    9. The method defined in claim 8
    CHARACTERIZED BY
    giving greater weight to the step size values for a higher frequency subband than to those of a lower frequency subband when computing the short term speech statistic.
     




    Drawing










    Search report