Introduction
[0001] Acoustic echo control is a necessary component for a full-duplex hands-free telecommunication
system to eliminate undesired echo signals that result from acoustic coupling between
a loudspeaker and a microphone. If an
acoustic echo canceller (AEC) is used, usually also an
acoustic echo suppressor (AES) is used (in series after the AEC) to remove residual echoes which occur due
to the constantly changing echo paths or when sudden echo path changes occur. Also,
often a noise suppressor (NS) is applied for removing stationary noise from the microphone
signal.
[0002] Recently, systems have been proposed which do not employ an AEC, but do all echo
removal using an AES [1, 2]. However, these systems have still high complexity [1],
do not efficiently address NS within their core, and are not [2] or only inefficiently
[1] applicable for stereo and multi-channel echo control.
[0003] We are proposing schemes for AES and NS which have low complexity, high robustness,
and are scalable for stereo and multi-channel audio without a need for introducing
loudspeaker signal distortion. Conventional echo control algorithms for stereo or
multi-channel audio introduce this distortion since otherwise the echo path impulse
response often can not be estimated (the so-called "non-uniqueness" problem [3]).
Notation and Variables
[0004]
- n
- time-domain time index
- i
- STFT spectrum frequency index
- k
- STFT spectrum time index
- x(n)
- loudspeaker signal
- xl(n)
- loudspeaker signal with channel index l
- L
- Number of loudspeaker signals
- ω(n)
- noise signal component in microphone signal
- y(n)
- microphone signal
- ym(n)
- microphone signal with channel index m
- M
- number of microphone signals
- e(n)
- microphone signal after echo/noise control
- em(n)
- microphone signal after echo/noise control with channel index m
- W
- STFT window size (samples)
- H
- STFT window hop size (samples)
- X(i,k)
- STFT of x(n) (defined similarly for other signals)
- | X(i,k) |
- Magniture spectrum of x(n)
- Ŷ(i,k)
- Estimation of Y(i,k)
- G(i,k)
- Real-valued gain filter for echo/noise removal
- Gv(i,k)
- Real-valued gain filter values for coloration correction
- G1(i,k,G2(i,k)
- Real-valued gain filter values for coloration correction
- d
- delay estimate of acoustic echo path (relative to spectrum time index)
Background
[0005] In the following, commonly used techniques for AEC, AES, and NS are described.
Acoustic Echo Canceler (AEC)
[0006] Traditionally, echo cancellation is accomplished by adaptively identifying the echo
path impulse response and subtracting an estimate of the echo signal from the microphone
signal. The far-end talker signal
x(n) (loudspeaker signal) goes through the echo path, whose impulse response is modeled
as an FIR filter, and adds to the microphone signal
y(n) together with the near-end talker signal
v(n) and the ambient noise
w(n) :

where

[0007] J is the length of the echo path impulse response, and
T denotes the transpose of a vector or a matrix. To cancel the echo in the microphone
signal, an echo estimate
ŷ(
n) is needed, which is generated by passing the far-end talker signal through an FIR
filter

of length
K (generally less than
J),

[0008] The FIR filter coefficients are estimated adaptively in time. Subtracting
ŷ(
n) from the microphone signal
y(n) yields the error signal

[0009] The
mean square error (MSE) can be expressed as

where E{·} denotes mathematical expectation. The objective of the AEC is to estimate
an
ĥ that minimizes E {
e2(n)}.
[0010] There is a vast literature addressing how to search for the optimum
ĥ using adaptive techniques. Commonly used algorithms include normalized least-mean-square
(NLMS), recursive least-squares (RLS), proportionate NLMS (PNLMS), affine projection
algorithm (APA), etc.
[0011] During doubletalk (when the talker at the near-end is talking, i.e.
v(n) ≠ 0, usually the adaptive filter coefficients are "frozen" to prevent that the near-end
signal
v(n) has a negative effect on the adaptive filter in terms of estimating the acoustic
echo path. For this purpose, a doubletalk detector is used.
[0012] Another solution was described to suppress the echo in an audio signal in the document
US 2004/0057574. This is achieved by computing the spectral envelopes of the loudspeaker and microphone
signals and determining the spectral envelope of the echo signal using adaptive filters.
This approach attempts to continuously estimate the time varying spectral envelope
of the echo signal. The problem with this technique is that the adaptive filters predicting
the spectral envelopes need to re-converge every time when the properties of the loudspeaker
signal change. That is, because the echo signal spectral envelope does not only depend
on the loudspeaker signal spectral envelope but also on the loudspeaker signal itself.
This signal dependence of the adaptive filters results in the problem that often the
echo signal spectral envelope is not estimated precisely enough and the echo is not
sufficiently removed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] The invention will be better understood thanks to the attached Figures in which:
- Figure 1 shows a processing carried out in handsfree speech communication clients.
- Figure 2 shows a schematic diagram of an adaptive acoustic echo canceller.
- Figure 3 shows a block diagram of a noise suppression algorithm using noise estimation
analysis module.
- Figure 4 shows a block diagram of a hybrid echo and noise suppression algorithm using
acoustic echo and noise estimation modules.
- Figure 5 shows a block diagram of a hybrid echo/noise suppression algorithm.
- Figure 6 shows the structure of a typical room impulse response (acoustic echo path).
- Figure 7 shows a block diagram of the proposed acoustic echo suppression algorithm.
- Figure 8 shows a block diagram of the proposed acoustic echo in which the time relation
between the loudspeaker and microphone signal transform windows are illustrated.
- Figure 9 shows a block diagram of the proposed acoustic echo suppression algorithm
using two loudspeaker spectra.
- Figure 10 shows a method for smoothing the gain filter for decreasing the potential
for artifacts.
- Figure 11 shows a block diagram in which the given STFT spectra, magnitude spectra
with lower frequency resolution are computed by combining the STFT magnitude spectra.
- Figure 12 shows an interpolation from the combined magnitude gain filter to obtain
the gain filter.
- Figure 13 shows a block diagram of the proposed technique for multi-channel AES by
means of combining multiple loudspeaker and microphone signal spectra to combined
spectra..
- Figure 14 shows a block diagram of the proposed technique for multi-channel NS by
means of combining multiple microphone signal spectra to combined spectra.
- Figure 15 shows a block diagram of the proposed technique for multi-channel AES.
DETAILED DESCRIPTION OF THE INVENTION
[0014] Figure 1 illustrates a handsfree speech communication client system with an AEC,
AES, and NS. The microphone signal is first processed in the Acoustic Echo Canceller
(AEC), then passed to the Acoustic Echo Suppressor (AES) and finally transmitted to
a Noise Suppressor (NS). It is to be noted that both AEC and AES need information
from the loudspeaker signal.
[0015] Figure 2 shows a general view of an adaptive echo canceller. The echo estimate
ĥ estimated from the loudspeaker signal
x(n) is subtracted from the microphone signal
y(n) to produce echo free microphone signal
e(n).
Noise suppressor (NS)
[0016] Figure 3 illustrates a noise suppressor (NS) based on spectral modification. Note
that bold lines in the figures denote a plurality of spectral coefficients or subbands.
This is a block diagram of a noise suppression algorithm by modifying the spectral
magnitude, where STFT, NE, GFC, SM, and ISTFT stand for short-time Fourier transform
(STFT), noise estimation (NE), gain filter computation (GFC), spectral modification
(SM), and inverse short-time Fourier transform (ISTFT), respectively. Bold lines represent
a plurality of spectral coefficients or subbands. First, the incoming microphone signal,
y(n) is partitioned into successive frames. The frame length is typically selected between
10 and 40 ms. A window function (e.g. Hann window) is applied to the signal for a
better estimation. Then, the short-time Fourier spectrum,
Y(i,k), is obtained by applying a
discrete short-time Fourier transform (STFT) to the windowed frame signal. Next, the magnitude (or power) spectrum of the
noise contained in the microphone signal, |
W(i,k)| is estimated. There are several commonly used methods to estimate the noise magnitude
spectrum. One possibility is to detect time periods of silence, i.e. when no speech
signal is active. In these periods one then knows that all the power has been caused
by the noise, yielding a noise magnitude (or power) spectrum estimate, |
Ŵ(i,k)|, of the noise.
[0017] In general, the noise suppressed spectrum
E(i,k) can be recovered through

where
G(i,k) is the so-called (real-valued)
gain filter. The gain filter is computed as a function of the estimated noise magnitude spectrum
|
Ŵ(i,k)|, e.g.

where α is an exponent, and β is a parameter introduced to control the amount of noise
to be suppressed in case it is under (or over) estimated. Finally, the noise-suppressed
output signal
e(n) is constructed, given
E(i,k), using the overlap-add technique with inverse STFT.
[0018] The described technique is often referred to as a spectral modification technique
(or sometimes parametric Wiener filtering technique, or parametric spectral subtraction).
It has been widely adopted for the purpose of additive noise suppression and speech
enhancement.
Acoustic echo suppressor (AES)
[0019] An acoustic echo suppressor (AES) is illustrated in Figure 4. This is a block diagram
of an echo suppression algorithm by modifying the spectral magnitude, where STFT,
AEE, , SM, and ISTFT stand for short-time Fourier transform, acoustic echo estimation,
gain filter computation, spectral modification, and inverse short-time Fourier transform,
respectively. Bold lines represent a plurality of spectral coefficients or subbands.
The AES is very similar to the NS, except that not a noise spectrum is estimated and
removed, but an acoustic echo spectrum is estimated and removed.
[0020] The short-time Fourier spectra of the loudspeaker and microphone signal are obtained
by applying STFT to the windowed corresponding time-domain signals. Next, the echo
components are estimated by modeling the echo path and filtering the loudspeaker signal
with the estimated echo path. The gain filter is then computed based on the estimated
spectral magnitude spectra (or power spectra) of both the estimated echo signal |
Ŷ(
i,
k)| and the microphone signal |
Y(
i,
k)|. (Note that as opposed to carrying out the echo spectrum estimation directly in
the frequency domain, one could also apply time-domain processing similar to AEC and
then apply STFT for obtaining short-term spectra,
Ŷ(
i,
k), of the echo signal). Similar formulas as for NS can be used for computing the output
signal, i.e. (6) with a gain filter

[0021] Finally, the echo-suppressed output signal
e(n) is constructed, given
E(i,k), using the overlap-add technique with inverse STFT. Similar as an AEC, the AES uses
a doubletalk detector which prevents that the acoustic echo path estimate diverges
during doubletalk.
[0022] The weakness of the AES as described is that the estimation of the echo (magnitude)
spectrum is highly computationally complex. Furthermore, the echo (magnitude) spectrum
is rather sensitive to echo path changes, often causing the problem of "residual echoes".
Note that this problem is less pronounced in the AES than AEC since the phase of the
echo estimate is not used. However, echo path changes influencing the magnitude spectrum
result in residual echoes also in the AES.
[0023] In the following, different aspects of the invention are described. This includes
hybrid AES/NS processing, improved AES processing, multi-channel AES and NS processing,
and how all these parts are combined to a powerful scalable AES and NS system. This
is achieved through the method as claimed in the claim 1.
Hybrid echo and noise suppression
[0024] One aspect of the invention addresses efficient combination of AES and NS. Figure
5 illustrates a straight forward and efficient way of combining an AES with a NS.
This figure illustrates a block diagram of a hybrid echo/noise suppression algorithm
by modifying the spectral magnitude, where STFT, AEE, NE, GFC, SM, and ISTFT stand
for short-time Fourier transform, acoustic echo estimation, noise estimation. gain
filter computation, spectral modification, and inverse short-time Fourier transform,
respectively. Bold lines represent a plurality of spectral coefficients or subbands.
The gain filter is computed as a function of the estimated echo magnitude spectrum
|
Ŷ(
i,
k)| and estimated noise magnitude spectrum |
Ŵ(i,k)|, e.g.

where the constants β
1 and β
2 can be used for controlling strength of the echo and noise suppression.
Echo suppressor estimating delay and coloration correction values
[0025] Another aspect of the invention addresses the weaknesses of the conventional AES
in obtaining an estimate of the echo (magnitude) spectra. Not the acoustic echo path
is estimated, but merely a global delay parameter and a filter characterizing the
coloration effect of (the early part of) the acoustic echo path are estimated. This
representation (delay and coloration correction values) is largely insensitive to
acoustic echo path changes and is thus more robust than conventional methods which
estimate the acoustic echo path. Additionally, the computational complexity is much
lower as will be explained.
[0026] In audio processing, with coloration it is usually meant that some frequency ranges
are attenuated while other frequencies are not attenuated or amplified. This is called
"coloration" because such audio signals are perceived as being "colored". For echo
suppression, it is important to know which frequencies are attenuated, not modified,
or amplified by the echo path. Given this information and delay information the echo
signal can be suppressed. A room impulse response (the acoustic echo path) usually
features the direct sound (sound that travels directly from the loudspeaker to the
microphone), followed by a few early reflections, followed by a tail with late reflections
with high density. Figure 6 illustrates the structure of a typical room impulse response
(acoustic echo path). The direct sound is followed by the early reflections and the
late reflections. The direct sound and the early reflections have a coloration effect
on the audio signal. The densely spaced late reflections do not or hardly color the
signal. Thus, for obtaining the information for an effective echo suppression gain
filter it is enough to only consider the direct sound and early reflections.
[0027] The problem of the prior art solution is circumvented by not attempting to directly
estimate the time varying spectral envelope of the echo signal. Rather, properties
of the acoustic echo path, which are expected to only vary slowly in time, are estimated.
An overall delay between the loudspeaker and microphone signal is estimated. Furthermore,
the coloration effect of the (early) acoustic echo path impulse response is estimated
in the form of "coloration correction values". The coloration correction values are
applied to the delayed loudspeaker signal to compute a coloration corrected loudspeaker
signal. A gain filter is computed by using information of the coloration corrected
loudspeaker signal and microphone signal. This time varying gain filter is applied
to the microphone signal to remove the echo.
[0028] A specific implementation of the proposed scheme is illustrated in Figure 7. The
Delay and CE blocks estimate a delay and coloration correction values modeling the
effect of the acoustic echo path on the loudspeaker signal. Given this information,
the gain filter is computed such that it removes the undesired echo. Bold lines represent
a plurality of spectral coefficients or subbands. The STFT spectra of the loudspeaker
and microphone signal are computed. Then the delay d and coloration correction
Gv(i,k) (real-valued gain filter values) mimicking the effect of the early echo path is computed.
As an approximate echo magnitude spectrum estimate the estimated delay and coloration
correction values are applied to the loudspeaker signal spectra,

[0029] Note that this is not a precise echo spectrum or magnitude spectrum estimate. But
it contains the information necessary for applying echo suppression. The gain filter,
G(i,k), is computed by using (8) or (9) with (10). Then, the echo suppressed output signal
is computed by applying the gain filter (6) and applying the inverse STFT with overlap
add. Similar to an AEC or conventional AES, a doubletalk detector is used, i.e. the
delay and coloration correction values are not updated during periods of doubletalk.
[0030] Note that as opposed to a STFT, any other filterbank or time-frequency transform
may be used, which transforms a signal into a plurality of subbands or spectral coefficients.
[0031] The described processing is illustrated with more detail in Figure 8. This is a block
diagram of the proposed acoustic echo suppression algorithm. The Delay and CE blocks
estimate a delay and coloration correction values modeling the effect of the acoustic
echo path on the loudspeaker signal. Given this information, the gain filter is computed
such that it removes the undesired echo. Bold lines represent a plurality of spectral
coefficients or subbands. The loudspeaker signal,
x(n), and microphone signal,
y(n), is shown. Example positions of the STFT windows are indicated. These windows are
dH = (H=STFT windows window hop size) time domain samples apart (d is the delay estimate).
Applying the STFT to these windowed signal portions yields the spectra
X(i,k-
d) and
Y(i,k). The coloration correction estimation (CE in figure) yields the coloration correction
values
Gv(i,k), mimicking the coloration effect of the acoustic echo path. The gain filter,
G(i,k), which is applied to the microphone signal spectra,
Y(i,k), is computed as a function of
X(i,k-d), Gv(i,k), and
Y(i,k) e.g. by (8) or (9) with (10). Finally, the gain filter is applied to the microphone
spectra (SM in the figure) before the inverse STFT is applied with overlap add.
[0032] The delay d is either set to a fixed constant value or estimated adaptively in time
(if the delay of the direct sound from loudspeaker to microphone is not known, or
if the time-relationship of the loudspeaker signal and microphone signal is not known).
The coloration correction estimation values,
Gv(i,k), are estimated adaptively in time. An example on how to estimate delay and coloration
correction values is given next.
[0033] Adaptive estimation of the delay d and coloration correction: In the following, a
possibility is described how to adaptively in time estimate the delay and coloration
estimation to be applied to the loudspeaker spectrum (10). For each spectral coefficient
or subband, a normalized cross-correlation function is computed. The normalized cross-correlation
function is

where
b is the time lag index. Since the delay between
x(n) and
y(n) possibly varies in time, a short-time estimate of (11) is computed by

where

and ε ∈ [0,1] determines the time-constant in seconds of the exponentially decaying
estimation window

where
fs denotes the STFT spectrum sampling frequency.
[0034] Given the normalized cross-correlation functions (12) the delay for each spectral
coefficient or subband is

[0035] Note that in practice it is often enough to use a single delay value applied to the
full signal. We use a single delay, computed by averaging
d(i) over frequency index
i. The coloration correction values are for example computed as

[0036] More generally speaking,
Gv(i,k) is computed such that the delay and coloration corrected loudspeaker magnitude spectrum
(10) approximates the microphone signal magnitude spectrum in a least mean square
error sense (during time periods when no doubletalk it present, i.e. during doubletalk
(13) are not updated).
[0037] Improving echo suppression performance: Since the late reflections are ignored, the
echo (10) will be often underestimated. This can be compensated by the gain filter
computation, i.e. by setting the β parameter in (8) or the β
1 parameter in (9) to a value larger than one.
[0038] For further making the echo suppression effective, we use time smoothing of the gain
filter, such that fully suppressed signal parts will have the effect that the successive
few milliseconds of signal will also be suppressed. This is achieved by smoothing
the gain filter over time in the logarithmic domain, i.e.

and the effectively applied gain filter is

[0039] The parameter ε determines the time-constant in seconds of the smoothing,

where
fs denotes the STFT spectrum sampling frequency. We use T
smooth = 10 milliseconds.
[0040] Alternative implementation: Since usually for reasons of computational complexity
and practicality the STFT is used with significant downsampling (e.g. we are using
a window hop size equal to half the window size), the time resolution in the spectral
domain is too low for applying (10) effectively (e.g. the true delay may lie between
two STFT spectra in time). To circumvent this problem, the ideal scenario (without
this time resolution limitation) is approximated by

where
d is a delay (with as low granularity as the spectrum time index k), δ is a constant
influencing how the magnitude spectra are added (we use δ =2), and
G1(i,k) and
G2(i,k) are real-valued coloration correction gain filters. We found that using two coloration
correction gain filters is enough, however, there may be scenarios where it would
be desired to use more than two coloration correction gain filters (e.g. when the
STFT window size is so small that a single gain filter can not capture most of the
coloration effect of the acoustic echo path).
[0041] Figure 9 illustrates the alternative implementation (20) in detail. This is a block
diagram of the proposed acoustic echo suppression algorithm. The Delay and CE blocks
estimate the delay and coloration correction values mimicking the effect of the acoustic
echo path on the loudspeaker signals. Given this information, the gain filter is computed
such that it removes the undesired echo. Bold lines represent a plurality of spectral
coefficients or subbands. The loudspeaker signal,
x(n), and microphone signal,
y(n), is shown. Example positions of the STFT windows are indicated. These windows are
dH(
H=STFT window hop size) time domain samples apart. Applying the STFT to these windowed
signal portions yields
X(i,k-d) and
Y(i,k). Additionally, a further by one delayed loudspeaker spectrum,
X(i,k-d-1) is used (possibly more successively delayed spectra may be used). The coloration
correction estimation (CE in figure) yields the real-valued gain filters,
G1(i,k) and
G2(i,k), mimicking the coloration effect of the acoustic echo path on the loudspeaker signal.
The gain filter,
G(i,k), which is applied to the microphone signal spectra for echo removal,
Y(i,k), is computed as a function of
X(i,k-d), X(i,k-d-1), G1(i,k), G2(i,k), and
Y(i,k) , by e.g. (8) or (9) with (20). Finally, the gain filter is applied to the microphone
spectra (6) (SM in the figure) before the inverse STFT is applied with overlap add.
[0042] Adaptive estimation of the delay d and coloration correction for the alternative
implementation: Given the normalized cross-correlation functions (12) the delays with
the largest cross-correlation are computed

[0043] The delay is then computed as
d(i) = min{
d1(i),
d2(i)}. The coloration correction values are computed such that the delayed and coloration
corrected loudspeaker magnitude spectrum (20) approximates the microphone signal magnitude
spectrum in a least mean square error sense, e.g. by

where
a11,
a12, and
a22 are computed according to (13) and

[0044] Here we have assumed δ =1 in (20). Similar equations could be written for different
values for δ.
Alternative estimation of the color correction factors G1(i,k) and G2(i,k):
[0045] The gain filters,
G1(i,k) and
G2(i,k) can be estimated using adaptive filters. For this purpose, (20) is re-written as

[0046] The mean square estimation error measured that is minimized is

[0047] Adaptive filter algorithms are used for each
i (
i = STFT spectral coefficient index) to adapt
G1(i,k) and
G2(i,k) such that the mean square error (25) is minimized. Suitable algorithms include normalized
least-mean-square (NLMS), recursive least-squares (RLS), proportionate NLMS (PNLMS),
affine projection algorithm (APA), etc.
[0048] Improving the perceptual quality: When applying the previously described techniques
for computing the gain filter for removing the echo,
G(i,k) the near-end speech signal is likely to be distorted during double talk. It is a
well known fact, that spectral suppression causes artifacts. Some of these artifacts
are sometimes called "musical tone" artifacts.
[0049] To reduce the artifacts occurring from applying the gain filter,
G(i,k), to the microphone signal spectra,
Y(i,k), the gain filter is smoothed as a function of frequency. Motivated by the fact that
the auditory system integrates signal over frequency, i.e. evaluating loudness of
a signal with only limited frequency resolution, the gain filter is smoothed. The
frequency resolution of the auditory system is lower at low frequencies than at high
frequencies. Thus, more smoothing can be applied at high frequencies. An example for
this smoothing is illustrated in Figure 10, where the dashed line is the non-smoothed
gain filter and the solid line is the smoothed gain filter. We implemented the smoothing
by applying a lowpass filter along frequency (
i) to
G(i,k).
[0050] Reducing computational complexity : Considering that the gain filter is smoothed,
the coloration correction values,
Gv(i,k) or
G1(i,k) and
G2(i,k), can directly be computed with lower frequency resolution (as opposed to carrying
out the estimation for each spectrum index
i). This is achieved by using magnitude (or power) spectra with lower frequency resolution
than the STFT. Figure 11 illustrates how magnitude spectra are computed with lower
frequency resolution. First the time-domain signal,
x(n), is converted to the short-time spectral domain,
X(i,k), by applying windowing and the STFT. The spectral coefficients are then grouped and
magnitude combiners (MC in the figure) are used for computing a combined magnitude
given the spectral coefficients of a group,

where
i01 ≤
i ≤
i02 are the STFT spectral coefficient indices belonging to the group with index o and
κ determines how the spectra are combined. We are using κ = 2, i.e. we are adding
the power values of the spectral coefficients.
[0051] The described processing is applied to
X(i,k) and
Y(i,k) and the echo estimations, coloration correction values, and/or noise estimations
are carried out using the reduced number of magnitude spectral coefficients X(o,k)
and (26). Also the gain filter is computed using the reduced number of coefficients.
Before applying the gain filter to the microphone signal spectra for echo/noise removal,
the fewer gain filter coefficients are expanded such that for each STFT coefficient
a gain filter coefficient is available. This expansion is illustrated in Figure 12.
Given are the combined magnitude gain filter estimates (squares in the figure),
G̃(
o,
k), and by applying interpolation between these values (we use linear interpolation)
the gain filter,
G(i,k) (dashed line in the figure) is obtained.
[0052] For a sampling frequency of 16 kHz our system uses an STFT window size W = 256 samples
and a window hop size
H = 128 samples. This results in
l = 129 STFT spectral coefficients (ignoring the symmetric negative frequencies).
[0053] By using (26), we compute
Ĩ = 18 combined magnitudes. Thus, the computational complexity is significantly reduced.
Multi-channel acoustic echo suppressor
[0054] Another aspect of the invention enables application of mono AES principles to the
case when multiple loudspeakers and microphones are used for multi-channel tele-conferencing.
As opposed to conventional AEC or AES, there is no need for distorting the loudspeaker
signals. Since not the acoustic echo paths are explicitly estimated, there is also
no non-uniqueness problem and thus no need for distorting the loudspeaker signals.
[0055] We observed that the coloration correction estimation of echo paths from two loudspeakers
in a desktop stereo system (e.g. personal computer with two loudspeakers) are quite
similar for the left and right loudspeakers (if there is reasonable symmetry, i.e.
if the microphone is not placed much closer to one loudspeaker than the other). We
did this experiment with an omnidirectional microphone. If directional microphones
are used, the coloration effect is still similar, but the overall gain depends on
the direction at which the loudspeaker is located. Often stereo microphones are designed
such that the sum of left and right has an omnidirectional property (i.e. the gain
the of left and right sum does not depend on direction). This omnidirectional property
of the sum signal, motivated us to combine the left and right microphone signals to
one and treat it the same as a single microphone signal for gain filter,
G(i,k), computation. Spinning this thought further, we also tried to combine the loudspeaker
signals to a single signal for gain filter computation. In the end, we had a system
effectively using mono gain filter computation applicable for multi-channel AES. In
the following, we are describing this processing in detail.
[0056] Figure 13 shows how the proposed scheme for mono AES is extended for multi-channel
acoustic echo suppression. A loudspeaker signal combiner (LSC) and microphone signal
combiner (MSC) generate a single composite loudspeaker and microphone signal magnitude
spectrum, respectively. A mono acoustic echo suppression (AES) algorithm is operated
using the composite signal spectra. The so-obtained gain filter
G(i,k) is applied individually to each microphone signal channel. Bold lines represent a
plurality of spectral coefficients or subbands. Note that the AEE block in the figure
corresponds a method for estimating an echo signal spectrum, possibly by applying
a delay and coloration correction to the loudspeaker signal, i.e. (10) or (20), or
possibly another method could be used. A loudspeaker signal combiner (LSC) combines
the loudspeaker signal spectra and generates a "combined" magnitude spectrum |
X(i,k)|. The loudspeaker signals are combined as

where θ controls the combination process and
gx1 are weighting factors for each signal. We use θ = 2 and
gxl = 1.
[0057] Similarly, a microphone signal combiner (MSC) combines the microphone signal spectra,

where λ controls the combination process and
gym are weighting factors. We use λ =2 and
gym = 1. An alternative method for combining the microphone signals is

[0058] The weighting factors,
gxl and
gym, are not chosen all the same if the system is not close to symmetric, i.e. if one
loudspeaker is much closer to microphones than other loudspeakers.
[0059] Given the combined magnitude spectra, |
X(i,k)| and |
Y(i,k)|, the gain filter,
G(i,k) is computed similarly as in the mono AES case, as illustrated in Figure 13. That
is, the echo magnitude spectrum |
Ŷ(
i,
k)| is estimated and the gain filter
G(i,k) is computed, e.g. by (8) or (9). Spectral modification is then applied to each of
the microphone signals 1 ≤
m ≤ M individually, using the same gain filter
G(i,k),

[0060] The echo suppressed output signals
em(n) are obtained by applying the inverse STFT with overlap add to
Em(i,k).
Multi-channel noise suppressor
[0061] Another aspect of the invention is that similar processing as described for multi-channel
AES is also applied to NS, for doing multi-channel NS. Mono NS gain filter computation
applied for multi-channel NS is illustrated in Figure 14. A microphone signal combiner
(MSC) is used to generate a single composite microphone signal magnitude spectrum
and generates a "combined" magnitude spectrum |
Y(i,k)|. A mono noise suppression (NS) algorithm is operated using the composite spectra
to compute the gain filter
G(i,k), which is applied individually to each microphone signal channel. Bold lines represent
a plurality of spectral coefficients or subbands. The microphone signals are combined
as

where λ controls the combination process and
gym are weighting factors. We use λ =2 and
gym = 1. An alternative method for combining the microphone signals is

[0062] Given the combined magnitude spectra, |
Y(i,k)|, the gain filter,
G(i,k), is computed similarly as in the mono NS case, as illustrated in Figure 14. That
is, the noise magnitude spectrum
Ŵ(
i,
k) is estimated and the gain filter
G(i,k) is computed, e.g. by (7). Spectral modification is then applied to each of the microphone
signals 1 ≤
m ≤
M individually, using the same gain filter
G(i,k),

[0063] The noise suppressed output signals
em(n) are obtained by applying the inverse STFT with overlap add to
Em(i,k).
Putting it all together
[0064] All the previously described aspects of the invention are combined to a scalable
system for AES and NS, illustrated in Figure 15 showing a block diagram of the proposed
technique for multi-channel AES. A loudspeaker signal combiner (LSC) and microphone
signal combiner (MSC) generate a single composite loudspeaker and microphone signal
magnitude spectrum, respectively. An hybrid echo/noise suppression algorithm is operated
using the composite signal spectra. The so-obtained gain filter
G(i,k) is applied individually to each microphone signal channel. Bold lines represent a
plurality of spectral coefficients or subbands. The gain filter, applied to all microphone
signals, is computed (9) as a function of the combined microphone signal |
Y(i,k)|, the echo estimate |
Ŷ(
i,
k)|, e.g. (10) or (20), and noise estimate |
Ŵ(
i,
k)|. For low computational complexity, the previously described method for reducing
the number of spectral magnitudes is used.
References
[0065]
[1] C. Avendano, "Acoustinc echo suppression in the STFT domain" in Proc. IEEE Workshop
on Appl. of Sig. Proc. to Audio and Acoustic, Oct. 2001
[2] C. Faller and J. Chen, "Suppressing acoustic echo in a sampled auditory envelope space"
in Proc. IEEE Workshop on Appl. of Sig. Proc. to Audio and Acoustic, Aug, 2003
[3] M.M. Sondhi, D.R. Morgan, and J.L Hall, "Stereophonic acoustic echo cancellation -
an overview of the fondamental problem" IEEE Signal Processing Lett., vol. 2, pp.
148-151, Aug. 1995