|
(11) | EP 1 521 242 A1 |
(12) | EUROPEAN PATENT APPLICATION |
|
|
|
|
||||||||||||||||
(54) | Speech coding method applying noise reduction by modifying the codebook gain |
(57) The invention refers to a method for encoding an acoustic signal (y(n)) containing
a speech component and a noise component by using an analysis through synthesis method,
wherein for encoding the acoustic signal a synthesised signal is compared with the
acoustic signal for a time interval, said synthesised signal being described by using
a fixed codebook and an associated fixed gain, comprising the steps:
|
Field of the Invention
a) a spectral analysis/synthesis system (typically a short-term windowed FFT (Fast Fourier Transform),
b) IFFT (Inverse Fast Fourier Transform), a noise estimation procedure, and c) a spectral gain computation according to a suppression rule, which is used for suppressing the noise.
a) decoding
b) FFT analysis
c) Speech enhancement, e.g. noise reduction
d) Inverse FFT analysis
e) encoding
Fig. 1: Scheme of a noise reduction in the frequency domain
Fig. 2: shows schematically the function of the AMR encoder;
Fig. 3: shows schematically the function of the AMR decoder;
Fig. 4: Scheme of a noise reduction method in the parameter domain
1. Function of a encoder (Fig.2)
1. The speech signal is processed block-wise and thus partitioned into frames and sub-frames. Each frame is 20 ms long (160 samples at 8 kHz sampling frequency) and is divided into 4 sub-frames of equal length.
2. LPC analysis of a Hamming-windowed frame.
3. Because of stability reasons, the LPC filter coefficients are transformed to Line Spectrum Frequencies (LSF). Afterwards these coefficients are quantized in order to save bit rate. This step and the previous are done once per frame (except in 12.2 kbit/s mode; the LPC coefficients are calculated and quantised twice per frame) whereas the steps 4 - 9 are performed on sub-frame basis.
4. The sub-frames are filtered by a LPC filter with retransformed and quantised LSF coefficients. Additionally the filter is modified to improve the subjective listening quality.
5. As the encoding is processed block by block, the decaying part of the filter, which is longer than the block length, has to be considered by processing the next sub-frame. In order to speed up the minimization of the residual power described in the following, the zero impulse response of the synthesis filter excited by previous sub-frames is subtracted.
6. The power of the LPC filtered error signal e(n) depends on four variables: the excitation of the adaptive codebook, the excitation of the fixed codebook and the respective gain factors ga and gf. In order to find the global minimum of the power of the residual signal and as no closed solution of this problem exists, all possible combinations of these four parameters have to be tested experimentally. As the minimization is hence too complex, the problem is divided into subproblems. This results in a suboptimal solution, of course. First the adaptive codebook is searched to get the optimal lag L and gain factor ga,L. Afterwards the optimal excitation scaled with the optimal gain factor is synthesis-filtered and subtracted from the target signal. This adaptive codebook search accords to a LTP filtering.
7. In a second step of the minimization problem the fixed codebook is searched. The search is equivalent to the previous adaptive codebook search. I.e. it is looked for the codebook vector that minimizes the error criteria. Afterwards the optimal fixed gain is determined. The resulting coding parameters are the index of the fixed codebook vector J and the optimal gain factor gf,J.
8. The scaling factors of the codebooks are quantized jointly (except in 12.2 kbit/s mode - both gains are quantized scalar), resulting in a quantization index, which is also transmitted to the decoder.
9. Completing the processing of the sub-frame, the optimal excitation signal is computed and saved in the adaptive codebook. The synthesis filter states are also saved so that this decaying part can be subtracted in the next sub-frame.
2. Function of a decoder (Fig. 3)
3. Embodiment of a noise reduction rule
a) a noisy signal y(t) is coded through a CELP (Code excited linear prediction) codec, e.g. the AMR codec.
b) There the signal is described or coded through the so named 'parameters', i.e. the fixed code book entry, the fixed code book gain, the adaptive code book entry, the adaptive codebook gain, the LPC coefficients etc..
c) With a special processing the fixed code book gain gy(m) of the signal is extracted from these parameters.
d) A noise reduction is applied to gy(m).
d1) Accordingly an estimation ĝn(m) of the noise fixed gain is needed.
d2)Furthermore a reduction rule is required to be applied on the noisy fixed gain gy(m). An 'unnoisy' or clean fixed gain, i.e an estimation of the speech gain ĝs(m), is thus obtained.
e) The coded signal is then recomputed by interchanging the noisy fixed gain gy(m) by the estimation of the speech gain ĝs(m) and letting the other codec parameters unchanged. The resulting set of codec parameters are assumed to code a clean signal.
f) Optionally a postfilter is applied in order to control the noise reduction rule and to avoid artefacts stemming from the reduction rule.
g) If the encoded, in the above described way modified signal is decoded, a clean signal in the time domain is achieved.
a)Gain subtraction
One possibility to achieve parameters representing the fixed signal is gain subtraction.
Assuming that the fixed codebook gain gy(m) from the noisy speech is the sum of the clean fixed codebook gain from the clean
speech and the noisy fixed codebook gain from the noise, the fixed codebook gain is
modified accordingly:
where m denotes a time interval, e.g. a frame or a subframe, ĝn(m) the estimate of the noise component and ĝs(m) the estimate of the clean codebook gain. It will be described in the next section
in reference with a different embodiment, how the estimate of fixed gain ĝn(m) of the noise component can be calculated.
b)Minimisation
Alternatively to the gain subtraction the fixed gain can be modified by using a modification
factor γc(m), which can be derived from an estimated signal to noise ratio SN̂R.
As the estimated noisy fixed codebook gain ĝn(m) can be larger than the decoded fixed codebook gain gy(m), the estimated clean speech fixed codebook gain ĝs(m) is bound by a positive threshold gsmin for a least square error minimization.
Let gy(m),ĝs(m),ĝn(m) be respectively the noisy speech fixed codebook gain, the estimated clean speech
fixed codebook gain and the estimated noise only fixed codebook gain. The fixed codebook
gain is modified according to the following equation:
γc(m) based on the MMSE criterion (transposition of the MMSE criterion in the frequency
domain as in [3] to the codec parameter domain) is computed using:
where SN̂R is the estimate of the signal to noise ratio
The computation of the SN̂R(m) based on a "a priori" SNR estimation, which is recursively computed, i.e. it depends
on the previous time interval, e.g. subframe. The following estimate has been proved
useful in practice:
with the exponential weighting factor
Values used in advantageous embodiments are δ1 = 2,δ2 = 0.75. β denotes a weighting factor for taking account of the past of the speech
signal, e.g. as in the formula above by considering the quantities in the previous
subframe. The greater β, the more the past is emphasised. Useful values vor β have
been found especially between β∈[0.7,0.8]
c)estimation of ĝn
For determining the estimate of SN̂R according to the formula above, furthermore the knowledge of the estimate ĝn of the noise fixed gain is required
The estimation of ĝn is based on the principle of minimum statistics, wherein the short-term minimum of
the estimation of the noisy signal power P(m) is searched:
with
where σ̂
is the estimate of the noise power found by using the minimal value the noisy signal
power P on a window of length D. As a noisy speech signal contains speech and noise, the minimum value, which is
present over a certain time period, and which occurs e.g. in speech pauses represents
the noise power σ̂
.
αmax is a constant, e.g. an advantageous embodiment uses αmax = 0.96.
To reduce the number of required comparisons for estimating the noise power σ̂
, that window of length D is divided in U sub-windows of length V. The minimum value in the window of length D is the minimum of the set of minimums
on each subwindow. A buffer, Min_I of U elements contains the set of minimums from
the last U sub-windows. It is renewed each time that V values of P are computed. The
oldest element of the buffer is deleted and replaced by the minimum of the last V
values of P. The minimum on the window of length D, σ̂
for each sub-frame m is the minimum between the minimum of the buffer and the last
value of P computed. σ̂
can be increased by a gain parameter omin to compensate the bias of the estimation. A bias might be due to a continued overestimating
of the noise, e.g. if a continually present murmuring is considered as noise only.
The value of ĝn is finally given by:
d)postfiltering to .control the noise reduction
The noise reduction, as it has been described above, may cause some artefacts during
the voice activity periods, e.g. that the speech signal is attenuated due to an overestimation
of the noise component
To counter this effect, a postfiltering is performed. In case the noise reduction
is not so significant, e.g. in cases where the SNR is high and thus γc is close to 1, γc is forced to be 1. The energy Eu of the subframe n, which contains e.g. in a CELP codec 40 speech samples, can be
described as:
, wherein n is the summation index, gp is the adaptive code book gain, v(n) is the adaptive codebook excitation and c(n) is the fixed codebook excitation. After the noise reduction, the excitation energy
is:
The final value of γc(m) is given by:
Typically ThdB can bechosen equal to 1.
a) Extracting an estimated fixed gain (ĝn) of the noise component from the acoustic signal (y(n)) for the time interval;
b) Representing the noise component for the time interval by its fixed gain (ĝn);
c) Estimating a signal to noise ratio of the speech component to the noise component for the time interval based on the signal to noise ratio of an earlier time interval and a ratio of the acoustic signal (y(n)) to the noise component in the time interval;
d) Determining a modification factor (γc) based on the estimate of the signal to noise ratio;
e) Deriving an estimate of a fixed gain (ĝs(m)) of the speech component by modifying said fixed gain of the speech component with a modification factor (γc).
f) Comparing the energy of the signal in a time interval with its estimate based on the estimated fixed gain (ĝs(m)) of the speech component,
g) Modifying the modification factor in dependence on the result of said comparison