Field of the Invention
[0001] The invention refers to a speech coding method applying noise reduction
[0002] For over forty years noise reduction methods have been developed in speech processing.
Most of the methods are performed in the frequency domain. They commonly comprise
three major components:
a) a spectral analysis/synthesis system (typically a short-term windowed FFT (Fast
Fourier Transform),
b) IFFT (Inverse Fast Fourier Transform), a noise estimation procedure, and c) a spectral
gain computation according to a suppression rule, which is used for suppressing the
noise.
[0003] The suppression rule modifies only the spectral amplitude, not the phase. It has
been shown, that there is no need to modify the phase in speech enhancement processing.
Nevertheless, this approximation is only valid for a Signal to Noise Ratio (SNR) greater
than 6dB. However, this condition is supposed to be satisfied in the majority of the
noise reduction algorithms.
[0004] Methods for spectral weighting noise reduction are often based on the following hypothesis:
- The noise is additive (i.e. $y(t)=s(t)+n(t)$), uncorrelated with the speech signal
and locally stationary. s and y represent the clean and the noisy speech signal respectively.
- There are silence periods in the speech signal.
- The human auditory system is not sensible to the received speech phase.
[0005] A scheme of a treatment of a speech signal with noise reduction is depicted in Fig.
1. The speech component s(p), where p denotes a time interval is superimposed with
a noise component n(p). This results in the total signal y(p). The total signal y(p)
undergoes a FFT. The result are Fourier components Y(p, f
k), where f
k denotes a quantized frequency. Now the noise reduction NR is applied, thus producing
modified Fouriercomponents S(p,
Ŝ(p,f
k). This leads after an IFFT to a clean speech signal estimate
ŝ(p).
[0006] A problem of any spectral weighting noise reduction method is its computational complexity,
e.g. if the following steps have to be performed successively:
a) decoding
b) FFT analysis
c) Speech enhancement, e.g. noise reduction
d) Inverse FFT analysis
e) encoding
[0007] Thereby the above list is typical for classical noise reduction occurring in a communications
network
[0008] Based on the foregoing description it is an object of the invention to provide a
possibility of a noise reduction method in speech processing systems that can be implemented
with a low computational effort.
[0009] This object is solved by the subject matter disclosed in the independent claims.
Advantageous embodiments of the present invention will be presented in the dependent
claims.
[0010] In a method for transmitting speech data said speech data are encoded by using an
analysis through synthesis method. For the analysis through synthesis a synthesised
signal is produced for approximating the original signal. The production of the synthesised
signal is performed by using at least a fixed codebook with a respective fixed gain
and optionally an adaptive codebook and a adaptive gain. The entries of the codebook
and the gain are chosen such, that the synthesised signal resembles the original signal.
[0011] Parameters describing these quantities will be transmitted from a sender to a receiver,
e.g. from a near-end speaker to a far-end speaker or vice versa.
[0012] The invention is based on the idea of modifying the fixed gain determined for the
signal containing a noise component and a speech component. Objective of this modification
is to obtain a useful estimate of the fixed gain of the speech component or clean
signal.
[0013] The modification is done by subtraction of an estimate of the fixed gain of the noise
component. The fixed gain of the noise component may be derived from an analysis of
the power of the signal in a predetermined time window.
[0014] One advantage of this procedure is its low computational complexity, particularly
if the speech enhancement through noise reduction is done independently from an encoding
/ decoding unit, e.g. in a certain position within a network, where according to a
noise reduction method in the time domain all the steps of decoding, FFT, speech enhancement
, IFFT and encoding would have to be performed one after the other. This is not necessary
for a noise reduction method according based on modification of parameters
[0015] Another advantage is that by using the parameters for any modification, a repeated
encoding and decoding process, the so called "tandeming" can be avoided, because the
modification takes place in the parameter itself. Any tandeming decreases the speech
quality. Furthermore the delay due to the additional encoding/decoding, which is e.g.
in GSM typically 5 ms can be avoided.
[0016] Thus the parameters, which are actually transmitted do not need to be transformed
in a signal for applying the speech reduction. The procedure is furthermore also applicable
within a communications network.
[0017] An encoding apparatus set up for performing the above described encoding method includes
at least a processing unit. The encoding apparatus may be part of a communications
device, e.g. a cellular phone or it may be also situated in a communication network
or a component thereof.
[0018] In the following the invention will be described by means of preferred embodiments
with reference to the accompanying drawings in which:
- Fig. 1:
- Scheme of a noise reduction in the frequency domain
- Fig. 2:
- shows schematically the function of the AMR encoder;
- Fig. 3:
- shows schematically the function of the AMR decoder;
- Fig. 4:
- Scheme of a noise reduction method in the parameter domain
1. Function of a encoder (Fig.2)
[0019] First the function of a speech codec is described by an special implementation of
an CELP based codec, the AMR (Adaptive Multirate Codec) codec. The codec consists
of a multi-rate, that is, the AMR codec can switch between the following bit rates:
12.2, 10.2, 7.95, 7.40, 6.70, 5.90, 5.15 and 4.75 kbit/s, speech codec, a source-controlled
rate scheme including a Voice Activity Detection (VAD), a comfort noise generation
system and an error concealment mechanism to compensate the effects of transmission
errors.
[0020] Fig. 2 shows the scheme of the AMR encoder. It uses a LTP (long term prediction)
filter. It is transformed to an equivalent structure called adaptive codebook. This
codebook saves former LPC filtered excitation signals. Instead of subtracting a long-term
prediction as the LTP filter does, an adaptive codebook search is done to get an excitation
vector from further LPC filtered speech samples. The amplitude of this excitation
is adjusted by a gain factor g
a.
[0021] The encoding of the speech is described now with reference to the numbers given in
Fig. 2
1. The speech signal is processed block-wise and thus partitioned into frames and
sub-frames. Each frame is 20 ms long (160 samples at 8 kHz sampling frequency) and
is divided into 4 sub-frames of equal length.
2. LPC analysis of a Hamming-windowed frame.
3. Because of stability reasons, the LPC filter coefficients are transformed to Line
Spectrum Frequencies (LSF). Afterwards these coefficients are quantized in order to
save bit rate. This step and the previous are done once per frame (except in 12.2
kbit/s mode; the LPC coefficients are calculated and quantised twice per frame) whereas
the steps 4 - 9 are performed on sub-frame basis.
4. The sub-frames are filtered by a LPC filter with re-transformed and quantised LSF
coefficients. Additionally the filter is modified to improve the subjective listening
quality.
5. As the encoding is processed block by block, the decaying part of the filter, which
is longer than the block length, has to be considered by processing the next sub-frame.
In order to speed up the minimization of the residual power described in the following,
the zero impulse response of the synthesis filter excited by previous sub-frames is
subtracted.
6. The power of the LPC filtered error signal e(n) depends on four variables: the
excitation of the adaptive codebook, the excitation of the fixed codebook and the
respective gain factors ga and gf. In order to find the global minimum of the power of the residual signal and as no
closed solution of this problem exists, all possible combinations of these four parameters
have to be tested experimentally. As the minimization is hence too complex, the problem
is divided into subproblems. This results in a suboptimal solution, of course. First
the adaptive codebook is searched to get the optimal lag L and gain factor ga,L. Afterwards the optimal excitation scaled with the optimal gain factor is synthesis-filtered
and subtracted from the target signal. This adaptive codebook search accords to a
LTP filtering.
7. In a second step of the minimization problem the fixed codebook is searched. The
search is equivalent to the previous adaptive codebook search. I.e. it is looked for
the codebook vector that minimizes the error criteria. Afterwards the optimal fixed
gain is determined. The resulting coding parameters are the index of the fixed codebook
vector J and the optimal gain factor gf,J.
8. The scaling factors of the codebooks are quantized jointly (except in 12.2 kbit/s
mode - both gains are quantized scalar), resulting in a quantization index, which
is also transmitted to the decoder.
9. Completing the processing of the sub-frame, the optimal excitation signal is computed
and saved in the adaptive codebook. The synthesis filter states are also saved so
that this decaying part can be subtracted in the next sub-frame.
2. Function of a decoder (Fig. 3)
[0022] Now the decoder is described in reference with Fig. 3. As shown in the previous section,
the encoder transforms the speech signal to parameters which describe the speech.
We will refer to these parameters, namely the LSF (or LPC) coefficients, the lag of
the adaptive codebook, the index of the fixed codebook and the codebook gains, as
"speech coding parameters". The domain will be called "(speech) codec parameter domain"
and the signals of this domain are subscripted with frame index $k$.
[0023] Fig. 3 shows the signal flow of the decoder. The decoder receives the speech coding
parameters and computes the excitation signal of the synthesis filter. This excitation
signal is the sum of the excitations of the fixed and adaptive codebook scaled with
their respective gain factors.
[0024] After the synthesis-filtering is performed, the speech signal is post-processed.
3. Embodiment of a noise reduction rule
[0025] Now an embodiment is described, where the fixed codebook gain of a CELP codec through
a certain noise reduction rule is modified such, that the processed fixed codebook
is assumed to be noise free. Therfor the following steps are performed, which lead
to a less noisy signal after processing.
a) a noisy signal y(t) is coded through a CELP (Code excited linear prediction) codec,
e.g. the AMR codec.
b) There the signal is described or coded through the so named 'parameters', i.e.
the fixed code book entry, the fixed code book gain, the adaptive code book entry,
the adaptive codebook gain, the LPC coefficients etc..
c) With a special processing the fixed code book gain gy(m) of the signal is extracted from these parameters.
d) A noise reduction is applied to gy(m).
d1) Accordingly an estimation ĝn(m) of the noise fixed gain is needed.
d2) Furthermore a reduction rule is required to be applied on the noisy fixed gain
gy(m). An 'unnoisy' or clean fixed gain, i.e an estimation of the speech gain ĝs(m), is thus obtained.
e) The coded signal is then recomputed by interchanging the noisy fixed gain gy(m) by the estimation of the speech gain ĝs(m) and letting the other codec parameters unchanged. The resulting set of codec parameters
are assumed to code a clean signal.
f) Optionally a postfilter is applied in order to control the noise reduction rule
and to avoid artefacts stemming from the reduction rule.
g) If the encoded, in the above described way modified signal is decoded, a clean
signal in the time domain is achieved.
[0026] This procedure is depicted schematically in Fig. 4. In an encoder a (total) signal
containing clean speech or a speech component and a noise component is encoded. By
the encoding process a fixed gain g
y(m) of the total signal is calculated. This fixed gain g
y(m) of the total signal is subject to a gain modification which bases on a noise gain
estimation. By the noise gain estimation an estimate of the fixed gain
ĝn(
m) is determined, which is used for the gain modification. The result of the gain modification
is an estimate of the fixed gain
ĝs(
m) of the clean speech or the speech component. This parameter is transmitted from
a sender to a receiver. At the receiver side it is decoded. This procedure, especially
the noise reduction rule, will now be described in detail:
a) Gain subtraction
[0027] One possibility to achieve parameters representing the fixed signal is gain subtraction.
Assuming that the fixed codebook gain g
y(m) from the noisy speech is the sum of the clean fixed codebook gain from the clean
speech and the noisy fixed codebook gain from the noise, the fixed codebook gain is
modified accordingly:

where m denotes a time interval, e.g. a frame or a subframe,
ĝn(
m) the estimate of the noise component and
ĝs(
m) the estimate of the clean codebook gain. It will be described in the next section
in reference with a different embodiment, how the estimate of fixed gain
ĝn(
m) of the noise component can be calculated.
b) estimation of ĝn
[0028] For the above described subtraction the knowledge of the estimate
ĝn of the noise fixed gain is required. The estimation of
ĝn is based on the principle of minimum statistics, wherein the short-term minimum of
the estimation of the noisy signal power P(m) is searched:

with

where σ̂

is the estimate of the noise power found by using the minimal value the noisy signal
power
P on a window of length
D. As a noisy speech signal contains speech and noise, the minimum value, which is
present over a certain time period, and which occurs e.g. in speech pauses represents
the noise power σ̂

.
α
max is a constant, e.g. an advantageous embodiment uses α
max =0.96.
[0029] To reduce the number of required comparisons for estimating the noise power σ̂

, that window of length D is divided in
U sub-windows of length
V. The minimum value in the window of length D is the minimum of the set of minimums
on each subwindow. A buffer, Min_I of U elements contains the set of minimums from
the last U sub-windows. It is renewed each time that V values of P are computed. The
oldest element of the buffer is deleted and replaced by the minimum of the last V
values of P. The minimum on the window of length D, σ̂

for each sub-frame m is the minimum between the minimum of the buffer and the last
value of P computed. σ̂

can be increased by a gain parameter
omin to compensate the bias of the estimation. A bias might be due to a continued overestimating
of the noise, e.g. if a continually present murmuring is considered as noise only.
[0030] The value of
ĝn is finally given by:

d) postfiltering to control the noise reduction
[0031] The noise reduction, as it has been described above, may cause some artefacts during
the voice activity periods, e.g. that the speech signal is attenuated due to an overestimation
of the noise component
1. Method for encoding an acoustic signal (y(n)) containing a speech component and a
noise component by using an analysis through synthesis method, wherein for encoding
the acoustic signal a synthesised signal is compared with the acoustic signal for
a time interval, said synthesised signal being described by using a fixed codebook
and an associated fixed gain, comprising the steps:
a) Determining a fixed gain (gy(m)) of the acoustic signal (y(n)) for the time interval;
b) Extracting an estimated fixed gain (ĝn) of the noise component from the acoustic signal (y(n)) for the time interval;
c) Deriving an estimate of the fixed gain (ĝs(m)) of the speech component by subtracting said fixed gain of the noise component from
the fixed gain (gy(m)) of the acoustic signal for the time interval.
2. Method according to claim 1, wherein the synthesised signal is further described by
an adaptive codebook and an associated adaptive gain.
3. Method according to any of the previous claims, wherein the extracting in step b)
is done by searching the minimum value of a power of the signal (y(n)) in a time interval,
a part of a time interval or a set of time intervals.
4. Method according to any of the previous claims, wherein said time intervalls are frames
or subframes.
5. Noise reducing apparatus with a processing unit set up for performing a method according
to any of the claims 1 to 4.
6. Communications device, in particular a mobile phone with a noise reducing apparatus
according to claim 5.
7. Communications network with a noise reducing apparatus according to claim 5.