FIELD OF THE INVENTION
[0001] The invention relates to an apparatus and a method for generating audio output signals,
and in particular, but not exclusively, to generating decorrelated audio signals for
e.g., speech signals.
BACKGROUND OF THE INVENTION
[0002] Capturing audio, and in particularly speech, has become increasingly important in
the last decades. For example, capturing speech or other audio has become increasingly
important for a variety of applications including telecommunication, teleconferencing,
gaming, audio user interfaces, etc. However, a problem in many scenarios and applications
is that the desired audio source is typically not the only audio source in the environment.
Rather, in typical audio environments there are many other audio/noise sources which
are being captured by the microphone. Audio processing is often desired to improve
the capture of audio, and in particular to post-process the captured audio time interval
improve the resulting audio signals. For example, speech enhancement, beamforming,
noise attenuation and other functions are often used.
[0003] In many embodiments, audio may be represented by a plurality of different audio signals
that reflect the same audio scene or environment. In particular, in many practical
applications, audio is captured by a plurality of microphones at different positions.
For example, a linear array of a plurality of microphones is often used to capture
audio in an environment, such as in a room. The use of multiple microphones allows
spatial information of the audio to be captured. Many different applications may exploit
such spatial information allowing improved and/or new services.
[0004] However, an issue that in practice may tend to reduce the performance and capture
is that different audio sources tend to be captured (differently) in the different
microphones depending on the specific properties and position of the audio source
relative to the microphones. This may typically make it difficult and resource demanding
to process the audio signals and in particular to determine which and how different
audio sources contribute to the captured audio for the different microphones. It is
often challenging to process multiple captured audio signals that are correlated,
and typically with an unknown correlation.
[0005] One approach that may seek to address such issues is to try to separate audio sources
by applying beamforming to form beams directed towards the direction of arrival of
audio from specific audio sources. However, although this may provide advantageous
performance in many scenarios, it is not optimal in all cases. For example, it may
not provide optimal source separation in some cases or indeed in some applications
such a spatial beamforming does not provide audio properties that are ideal for further
processing to achieve a given effect.
[0006] It has been proposed that in some cases where the relationship between audio sources
and microphones are known (e.g. by the acoustic impulse response between the audio
source and the capture position being known) then this information can be used to
generate decorrelated signals from the microphone signal. The known acoustic impulse
responses from the different audio sources to the different microphone signals can
be used to convert the microphone signals into signals where the audio from individual
audio sources are more concentrated in different signals. Such signals may in some
cases provided an improved audio or speech capture or processing.
[0007] However, whereas such an operation may be desirable in many situations and scenarios,
there are also substantial issues, challenges, and disadvantages. Indeed, in most
practical applications the acoustic relationship and specifically the acoustics transfer
functions/impulse responses between audio sources and microphones are simply not known.
Even small differences between the actual and used values can result in significant
degradation in the generated signals and the subsequent processing.
[0008] Hence, an improved approach would be advantageous, and in particular an approach
allowing reduced complexity, increased flexibility, facilitated implementation, reduced
cost, improved audio capture, improved spatial perception/differentiation of audio
sources, improved audio source separation, improved audio/speech application support,
improved spatial decorrelation of audio signals, reduced dependency on known or static
acoustic properties, improved flexibility and customization to different audio environments
and scenarios, and/or improved performance would be advantageous.
SUMMARY OF THE INVENTION
[0009] Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one
or more of the above mentioned disadvantages singly or in any combination.
[0010] According to an aspect of the invention there is provided an audio apparatus for
generating a set of output audio signals, the audio apparatus comprising: a receiver
arranged to receive a set of input audio signals; a segmenter arranged to segment
the set of input audio signals into time segments; an output signal generator arranged
to generate the set of output audio signals, each output audio signal of the set of
output audio signals being linked with an input audio signal of the set of input audio
signals, the output signal generator being arranged to for each time segment perform
the steps of: generating a frequency bin representation of the set of input audio
signals, each frequency bin of the frequency bin representation of the set of input
audio signals comprising a frequency bin value for each of the input audio signals
of the set of input audio signals; generating a frequency bin representation of the
set of output audio signals, each frequency bin of the frequency bin representation
of a set of output audio signals comprising a frequency bin value for each of the
set of output audio signals, the frequency bin value for a given output audio signal
of the set of output audio signals for a given frequency bin being generated as a
weighted combination of frequency bin values of the set of input audio signals for
the given frequency bin; generating a time domain representation for each output audio
signal from the frequency bin representation of the set of output audio signals; an
adapter arranged to update weights of the weighted combination; wherein the adapter
is arranged to update a first weight for a contribution to a first frequency bin value
of a first frequency bin for a first output audio signal linked with a first input
audio signal from a second frequency bin value of the first frequency bin for a second
input audio signal linked to a second output audio signal in response to a correlation
measure between a first previous frequency bin value of the first output audio signal
for the first frequency bin and a second previous frequency bin value of the second
output audio signal for the first frequency bin.
[0011] The approach may provide improved operation and/or performance in many embodiments.
It may provide an advantageous generation of output audio signals with typically increased
decorrelation in comparison to the input signals. The approach may provide an efficient
adaptation of the operation resulting in improved decorrelation in many embodiments.
The adaptation may typically be implemented with low complexity and/or resource usage.
The approach may specifically apply a local adaptation of individual weights yet achieve
an efficient global adaptation.
[0012] The generation of the set of output signals may be adapted to provide increased decorrelation
relative to the input signals which in many embodiments and for many applications
may provide improved audio processing. For example, in many situations speech enhancement,
noise suppression, or beamforming may be improved if such audio processing is performed
on audio signals that are less correlated.
[0013] The first and second output audio signals may typically be different output audio
signals.
[0014] In accordance with an optional feature of the invention, the adapter is arranged
to update the first weight in response to a product of a first value and a second
value, the first value being one of the first previous frequency bin value and the
second previous frequency bin value and the second value being a complex conjugate
of the other of the first previous frequency bin value and the second previous frequency
bin value.
[0015] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0016] In accordance with an optional feature of the invention, the adapter is arranged
to update a second weight being for a contribution to the first frequency bin value
from a third frequency bin value being a frequency bin value of the first frequency
bin for the first input audio signal in response to a magnitude of the first previous
frequency bin value.
[0017] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios. It may in particular provide an improved adaptation
of the generated output signals. In many embodiments, the updating of the weight reflecting
the contribution to an output signal from the linked input signal may be dependent
on the signal magnitude/amplitude of that linked input signal. For example, the updating
may seek to compensate the weight for the level of the input signal to generate a
normalized output signal.
[0018] The approach may allow a normalization/ signal compensation/level compensation to
provide e.g., a desired output level.
[0019] In accordance with an optional feature of the invention, the adapter is arranged
to set a second weight being for a contribution to the first frequency bin value from
a third frequency bin value being a frequency bin value of the first frequency bin
for the first input audio signal to a predetermined value.
[0020] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios. It may in many embodiments provide improved adaptation
while ensuring convergence of the adaptation towards a non-zero signal level. It may
interwork very efficiently with the adaptation of weights that are not for linked
signal pairs.
[0021] In many embodiments, the adapter may be arranged to keep the weight constant and
with no adaptation or updating of the weight.
[0022] In accordance with an optional feature of the invention, the adapter is arranged
to constrain a weight for a contribution to the first frequency bin value from a third
frequency bin value being a frequency bin value of the first frequency bin for the
first input audio signal to be a real value.
[0023] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0024] The weights between linked input/output signals may advantageously be determined
as/ constrained to be a real valued weight. This may lead to improved performance
and adaptation ensuring convergence on a non-zero level solution.
[0025] In accordance with an optional feature of the invention, the adapter is arranged
to set a third weight being a weight for a contribution to a fourth frequency bin
value of the first frequency bin for the second output audio signal from the first
input audio signal to be a complex conjugate of the first weight.
[0026] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0027] The two weights for two pairs of input/output signals may be complex conjugates of
each other in many embodiments.
[0028] In accordance with an optional feature of the invention, weights of the weighted
combination for other input audio signals than the first input audio signal are complex
valued weights.
[0029] This may provide improved performance and/or operation in many embodiments. The use
of complex values for weights for non-linked input signals provide an improved frequency
domain operation.
[0030] In accordance with an optional feature of the invention, the adapter is arranged
to determine output bin values for the given frequency bin ω from:

where
y(ω) is a vector comprising the frequency bin values for the output audio signals for
the given frequency bin ω;
x(ω) is a vector comprising the frequency bin values for the input audio signals for
the given frequency bin ω; and
W(ω) is a matrix having rows comprising weights of a weighted combination for the output
audio signals.
[0031] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0032] The matrix
W(ω) may advantageously be Hermitian. In many embodiments, the diagonal of the matrix
W(ω) may be constrained to be real values, may be set to a predetermined value(s),
and/or may not be updated/adapted but may be maintained as a fixed value. The weights/coefficients
outside the diagonal may generally be complex values.
[0033] In accordance with an optional feature of the invention, the adapter is arranged
to adapt weights
wij of the matrix
W(ω) according to:

where i is a row index of the matrix
W(ω) , j is a column index of the matrix
W(ω) , k is a time segment index, ω represents the frequency bin, and
η(
k, ω ) is a scaling parameter for adapting an adaptation speed.
[0034] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0035] In accordance with an optional feature of the invention, the adapter is arranged
to compensate the correlation value for a signal level of the first frequency bin.
[0036] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios. It may allow a compensation of the update rate for
signal variations.
[0037] In accordance with an optional feature of the invention, the adapter is arranged
to initialize the weights for the weighted combination to comprise at least one zero
value weight and one non-zero value weight.
[0038] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios. It may allow a more efficient and/or quicker adaptation
and convergence towards advantageous decorrelation. In many embodiments, the matrix
W(ω) may be initialized with zero values for weights or coefficients for nonlinked
signals and fixed non-zero real values for linked signals. Typically, the weights
may be set to e.g., 1 for weights on the diagonal and all other weights may initially
be set to zero.
[0039] In accordance with an optional feature of the invention, the weighted combination
comprises applying a time domain windowing to a frequency representation of weights
formed by weights for the first input audio signal and the second input audio signal
for different frequency bins.
[0040] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0041] Applying the time domain windowing to the frequency representation of weights may
comprise: converting the frequency representation of weights to a time domain representation
of weights; applying a window to the time domain representation to generate a modified
time domain representation; and converting the modified time domain representation
to the frequency domain.
[0042] According to an aspect of the invention, there is provided a method of generating
a set of output audio signals: receiving a set of input audio signals; segmenting
the set of input audio signals into time segments; generating the set of output audio
signals, each output audio signal of the set of output audio signals being linked
with one input audio signal of the set of input audio signals, wherein generating
the set of output audio signals comprises for each time segment performing the steps
of: generating a frequency bin representation of the set of input audio signals, each
frequency bin of the frequency bin representation of the set of input audio signals
comprising a frequency bin value for each of the input audio signals of the set of
input audio signals; generating a frequency bin representation of the set of output
audio signals, each frequency bin of the frequency bin representation of a set of
output audio signals comprising a frequency bin value for each of the output audio
signals, the frequency bin value for a given output audio signal of the set of output
audio signals for a given frequency bin being generated as a weighted combination
of frequency bin values of the set of input audio signals for the given frequency
bin; generating a time domain representation for each output audio signal from the
frequency bin representation of the set of output audio signals; and the method further
comprises updating weights of the weighted combination including updating a first
weight for a contribution to a first frequency bin value of a first frequency bin
for a first output audio signal linked with a first input audio signal from a second
frequency bin value of the first frequency bin for a second input audio signal linked
to a second output audio signal in response to a correlation measure between a first
previous frequency bin value of the first output audio signal for the first frequency
bin and a second previous frequency bin value of the second output audio signal for
the first frequency bin.
[0043] These and other aspects, features and advantages of the invention will be apparent
from and elucidated with reference to the embodiment(s) described hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0044] Embodiments of the invention will be described, by way of example only, with reference
to the drawings, in which
FIG. 1 illustrates an example of elements of an apparatus generating a set of output
signals from a set of input signals in accordance with some embodiments of the invention;
FIG. 2 illustrates a flowchart of an example of a method of generating a set of output
signals from a set of input signals in accordance with some embodiments of the invention;
FIG. 3 illustrates an example of elements of an apparatus generating a set of output
signals from a set of input signals in accordance with some embodiments of the invention;
FIG. 4 illustrates an example of a signal flow for an apparatus generating a set of
output signals from a set of input signals in accordance with some embodiments of
the invention;
FIG. 5 illustrates an example of a signal flow for an apparatus generating a set of
output signals from a set of input signals in accordance with some embodiments of
the invention;
FIG. 6 illustrates an example of a signal flow for an apparatus generating a set of
output signals from a set of input signals in accordance with some embodiments of
the invention;
FIG. 7 illustrates an example of a signal flow for an apparatus generating a set of
output signals from a set of input signals in accordance with some embodiments of
the invention; and
FIG. 8 illustrates some elements of a possible arrangement of a processor for implementing
elements of an audio apparatus in accordance with some embodiments of the invention.
DETAILED DESCRIPTION OF SOME EMBODIMENTS OF THE INVENTION
[0045] The following description focuses on embodiments of the invention applicable to audio
capturing such as e.g., speech capturing for a teleconferencing apparatus. However,
it will be appreciated that the approach is applicable to many other audio signals,
audio processing systems, and scenarios for capturing and/or processing audio.
[0046] FIG. 1 illustrates an example of an audio apparatus in accordance with some embodiments
of the invention.
[0047] The audio apparatus comprises a receiver 101 which is arranged to receive a set of
input audio signals. The input audio signals may be received from different sources,
including internal or external sources. In the following an embodiment will be described
where the receiver 101 is coupled to a plurality of microphones, such as a linear
array of microphones, providing a set of input audio signals in the form of microphone
signals.
[0048] The input audio signals are fed to a segmenter 103 which is arranged to segment the
set of input audio signals into time segments. In many embodiments, the segmentation
may typically be a fixed segmentation into time segments of a fixed and equal duration
such as e.g. a division into time segments/intervals with a fixed duration of between
10-20msecs. In some embodiments, the segmentation may be adaptive for the segments
to have a varying duration. For example, the input audio signals may have a varying
sample rate and the segments may be determined to comprise a fixed number of samples.
[0049] The segmenter 103 is coupled to an audio signal generator 105 which is arranged to
generate a set of output audio signals from the input audio signals. In many embodiments,
the audio signal generator 105 is arranged to generate the same number of output signals
as there are input signals, but it is possible in some embodiments for a different
number of output audio signals to be generated.
[0050] The segmentation may typically be into segments with a given fixed number of time
domain samples of the input signals. For example, in many embodiments, the segmenter
103 may be arranged to divide the input signals into consecutive segments of e.g.,
256 or 512 samples.
[0051] The audio signal generator 105 seeks to generate the output signals to correspond
to the input signals but with an increased decorrelation of the signals. The output
audio signals are generated to have an increased spatial decorrelation with the cross
correlation between audio signals being lower for the output audio signals than for
the input audio signals. Specifically, the output signals may be generated to have
the same combined energy/power as the combined energy/power of the input signals (or
have a given scaling of this) but with an increased decorrelation (decreased correlation)
between the signals. Thus, the output signals may specifically be generated to have
decreased coherence (correlation normalized by energy/power) than the input signals.
[0052] The output audio signals may be generated to include all the audio/signal components
of the input signals in the output audio signals but with a re-distribution to different
signals to achieve increased decorrelation/decreased coherence.
[0053] Each output signal may be created by subtracting/removing correlated source audio
components in the input signals from the scaled linked input signal leading to output
signals that contain all sources, but show a reduction in mutual correlation.
[0054] Spatial correlation between the input signals may mean that a common (but unknown)
signal component can be identified that differs per input signal in amplitude and/or
phase. Reducing the correlation at the output may be realized by adding amplitude
and/or phase modified input signals from the scaled linked input such that the common
input signal component is reduced.
[0055] The audio signal generator 105 is arranged to perform an adaptive spatial decorrelation
where the processing is arranged to dynamically adapt the processing and the spatial
decorrelation to reflect properties of the captured audio and specifically to e.g.
adapt to the acoustic environment and the relationships between the audio sources
and microphone signals. Accordingly, the audio apparatus further comprises an adapter
107 which is arranged to adapt the spatial filtering.
[0056] The audio signal generator 105 may typically be arranged to apply a frequency domain
based cross-signal spatial filtering to the input signals to generate the audio signals
with the coefficients of the filtering being adapted based on signal properties, and
specifically based on signal properties of the output signals. In particular, for
each segment, a set of input signal samples may be converted into a set of output
signal samples based on a spatial filtering, and a set of update values for the spatial
filtering may be determined by the adapter 107. The spatial filtering is then updated
for the next segment based on the update values.
[0057] FIG. 2 illustrates a method that may be performed by the audio signal generator 300.
[0058] In step 201, the audio signal generator 105 is arranged to generate a frequency bin
representation of the set of input audio signals. The audio signal generator 105 is
arranged to perform a frequency domain processing of a frequency domain representation
of the input audio signals. The signal representation and processing are based on
frequency bins and thus the signals are represented by values of frequency bins and
these values are processed to generate frequency bin values of the output signals.
In many embodiments, the frequency bins have the same size, and thus cover frequency
intervals of the same size. However, in other embodiments, frequency bins may have
different bandwidths, and for example a perceptually weighted bin frequency interval
may be used.
[0059] In some embodiments, the input audio signals may already be provided in a frequency
representation and no further processing or operation is required. In some such cases,
however, a rearrangement into suitable segment representations may be desired, including
e.g. using interpolation between frequency values to align the frequency representation
to the time segments.
[0060] In other embodiments, a filter bank, such as a Quadrature Mirror Filter, QMF, may
be applied to the time domain input signals to generate the frequency bin representation.
However, in many embodiments, a Discrete Fourier Transform (DFT) and specifically
a Fast Fourier Transform, (FFT) may be applied to generate the frequency representation.
[0061] Step 201 is followed by step 203 in which the audio signal generator 105 generates
a frequency bin representation of the set of output signals. The processing of the
input signals is typically performed in the frequency domain and for each frequency
bin, an output frequency bin value is generated from one or more input frequency bin
values of one or more input signals as will be described in more detail in the following.
The output signals are generated to (typically/on average) reduce the correlation
between signals relative to the correlation of the input signals.
[0062] The audio signal generator 105 is arranged to filter the input audio signals. The
filtering is a spatial filtering in that for a given output signal, the output value
is determined from a plurality of, and typically all of the input audio signals (for
the same time/segment and for the same frequency bin). The spatial filtering is specifically
performed on a frequency bin basis such that a frequency bin value for a given frequency
bin of an output signal is generated from the frequency bin values of the input signals
for that frequency bin. The filtering/weighted combination is across the signals rather
than being a typical time/frequency filtering.
[0063] Specifically, the frequency bin value for a given frequency bin is determined as
the weighted combination of the frequency bin values of the input signals for that
frequency bin. The combination may specifically be a summation and the frequency bin
value may be determined as a weighted summation of the frequency bin values of the
input signals for that frequency bin. The determination of a bin value for a given
frequency bin may be determined as the vector multiplication of a vector of the weights/coefficients
of the weighted summation and a vector comprising the bin values of the input signals:

where y is the output bin value, w
1- w
3 are the weights of the weighted combination, and x
1- x
3 are the input signal bin values.
[0064] Representing the output bin values for a given frequency bin ω as a vector
y(ω), the determination of the output signals may be determined as:

where the matrix
W(ω) represents the weights/coefficients of the weighted summation for the different
output signals and
x(ω) is a vector comprising the input signal values.
[0065] For example, for an example with only three input signals and output signals, the
output bin values for the frequency bin ω may be given by

where y
n represents the output bin values, w
ij represents the weights of the weighted combinations, and x
m represents the input signal bin values.
[0066] Step 203 is followed by step 205 in which a time domain representation of each output
signal is generated from the frequency bin representation of the output signals as
generated in step 203.
[0067] Typically, the reverse operation of the conversion from the time domain to the frequency
domain is applied, e.g. an IFFT or iQMF operation is performed. The resulting output
is accordingly a set of time domain output signals that represents the audio of the
input audio signals but with increased decorrelation between the signals.
[0068] Step 205 is followed by step 207 wherein the adapter 107 is arranged to determine
update values for the weights of the weighted combination(s). Thus, specifically,
in step 207, update values may be determined for the matrix
W(ω).
[0069] The method may then return to step 201 via step 209 to process the next segment.
Step 209 may initiate the processing of the next step including updating the weights
of the weighted combination based on the update values.
[0070] The adapter 107 is arranged to apply an adaptation approach that determines update
values which may allow the output signals to represent the audio of the input signals
but with the output signals typically being more decorrelated than the input signals.
[0071] The adapter 107 is arranged to use a specific approach of adapting the weights based
on the generated output signals. The operation is based on each output audio signal
being linked with one input audio signal. The exact linking between output signals
and input signals is not essential and many different (including in principle random)
linkings/pairings of each output signal to an input signal may be used. However, the
processing is different for a weight that reflects a contribution from an input signal
that is linked to/paired with the output signal for the weight than the processing
for a weight that reflects a contribution from an input signal that is not linked
to/paired with the output signal for the weight. For example, in some embodiments,
the weights for linked signals (i.e. for input signals that are linked to the output
signal generated by the weighted combination that includes the weight) may be set
to a fixed value and not updated, and/or the weights for linked signals may be restricted
to be real valued weights whereas other weights are generally complex values.
[0072] The adapter 107 uses an adaptation/update approach where an update value is determined
for a given weight that represents the contribution to a bin value for a given output
signal from a given non-linked input signal based on a correlation measure between
the output bin value for the given output signal and the output bin value for the
output signal that is linked with the given (non-linked) input signal. The update
value may then be applied to modify the given weight in the subsequent segment, or
the update value for weight in a given segment is determined in response to two output
bin values of a (and typically the immediate) prior segment, where the two output
values represent respectively the input signal and the output signal to which the
weight relate.
[0073] The described approach is typically applied to a plurality, and typically all, of
the weights used in determining the output bin values based on a non-linked input
signal. For weights relating to the input signal that is linked to the output signal
for the weight, other considerations may be used, such as e.g. setting the weight
to a fixed value as will be described in more detail later.
[0074] Specifically, the update value may be determined in dependence on a product of the
output bin value for the weight and the complex conjugate of the output bin value
linked to the input signal for the weight, or equivalently in dependence on a product
of the complex conjugate of the output bin value for the weight and the output bin
value linked to the input signal for the weight.
[0075] As a specific example, an update value for segment k+1 for frequency bin ω may be
determined in dependence on the correlation measure given by:

or equivalently by:

where
yi(k, ω) is the output bin value for the output signal i being determined based on the
weight; and
yj(k, ω) is the output bin value for the output signal j that is linked to the input signal
from which the contribution is determined (i.e. the input signal bin value that is
multiplied by the weight to determine a contribution to the output bin value for signal
i).
[0076] The measure of

(or the conjugate value) indicates the correlation of the time domain signal in the
given segment. In the specific example, this value may then be used to update and
adapt the weight w
i,j(k+1,ω).
[0077] As previously mentioned, the audio signal generator 105 may be arranged to determine
the output bin values for the output signals for the given frequency bin ω from:

where
y(ω) is a vector comprising the frequency bin values for the output signals for the
given frequency bin ω;
x(ω) is a vector comprising the frequency bin values for the input audio signals for
the given frequency bin ω; and
W(ω) is a matrix having rows comprising weights of a weighted combination for the output
audio signals.
[0078] In the example, adapter 107 may specifically be arranged to adapt at least some of
the weights
wij of the matrix
W(ω) according to:

where i is a row index of the matrix
W(ω) , j is a column index of the matrix
W(ω) , k is a time segment index, ω represents the frequency bin, and
η(k, ω ) is a scaling parameter for adapting an adaptation speed. Typically, the adapter
107 may be arranged to adapt all the weights that are not relating the input signal
with the linked output signal (i.e. "cross-signal" weights).
[0079] In some embodiments, the adapter 107 may be arranged to adapt the update rate/speed
of the adaptation of the weights. For example, in some embodiments, the adapter may
be arranged to compensate the correlation measure for a given weight in dependence
on the signal level of the output bin value to which a contribution is determined
by the weight.
[0080] As a specific example, the compensation value

may be compensated by the signal level of the output bin value, |
yi(
k, ω)|. The compensation may for example be included to normalize the update step values
to be less dependent on the signal level of the generated decorrelated signals.
[0081] In many embodiments, such a compensation or normalization may specifically be performed
on a frequency bin basis, i.e. the compensation may be different in different frequency
bins. This may in many scenarios improve the operation and may typically result in
an improved adaptation of the weights generating the decorrelated signals.
[0082] The compensation may for example be built into the scaling parameter
η(
k, ω ) of the previous update equation. Thus, in many embodiments, the adapter 107 may
be arranged to adapt/change the scaling parameter
η(
k, ω ) differently in different frequency bins.
[0083] In many embodiments, the arrangement of the input signal vector
x(ω) the output signal vector γ(ω) is such that linked signals are at the same position
in the respective vectors, i.e. specifically
y1 is linked with
x1, y2 with
x2, y3 is with
x3, etc. In this case, the weights for the linked signals are on the diagonal of the
weight matrix W(
k, ω). The diagonal values may in many embodiments be set to a fixed real value, such
as e.g. specifically be set to a constant value of 1.
[0084] In many embodiments, the weights/spatial filters/weighted combinations may be such
that the weights for a contribution to a first output signal from a first input signal
(not linked to the first output signal) is a complex conjugate of the contribution
to a second output signal being linked with the first input signal from a second input
signal being linked with the first input signal. Thus, the two weights for two pairs
of linked input/output signals are complex conjugates.
[0085] In the example of the weights for linked input and output signals being arranged
on the diagonal of the weight matrix
W(ω), this results in a Hermitian matrix. Indeed, in many embodiments, the weight matrix
W(ω) is a Hermitian matrix. Specifically, the coefficients/weights of the weight matrix
W(ω) may meet the criterion:

[0086] As previously mentioned, the weights for the contributions to an output signal bin
value from the linked input signal (corresponding to the values of the diagonal of
the weight matrix
W(ω) in the specific example) are treated differently than the weights for non-linked
input signals. The weights for linked input signals will in the following for brevity
also be referred to as linked weights and the weights for non-linked input signals
will in the following for brevity also be referred to as non-linked weights, and thus
in the specific example the weight matrix W(ω) will be a Hermitian matrix comprising
the linked weights on the diagonal and the non-linked weights outside the diagonal.
[0087] In many approaches, the adaptation of the non-linked weights is such that it seeks
to reduce the correlation measures. Specifically, each update value may be determined
to reduce the correlation measure. Overall, the adaptation will accordingly seek to
reduce the cross-correlations between the output signals. However, the linked weights
are determined differently to ensure that the output signals maintain a suitable audio
energy/power/level. Indeed, if the linked weights where instead adapted to seek to
reduce the autocorrelation of the output signal for the weight, there is a high risk
that the adaptation would converge on a solution where all weights, and thus output
signals, are essentially zero (as indeed this would result in the lowest correlations).
Further, the audio apparatus is arranged to seek to generate signals with less cross-correlation
but do not seek to reduce autocorrelation.
[0088] Thus, in many embodiments, the linked weights may be set to ensure that the output
signals are generated to have a desirable (combined) energy/power/level.
[0089] In some cases, the adapter 107 may be arranged to adapt the linked weights, and in
other cases the adapter may be arranged to not adapt the linked weights.
[0090] For example, in some embodiments, the linked weights may simply be set to a fixed
constant value that is not adapted. For example, in many embodiments, the linked weights
may be set to a constant scalar value, such as specifically to the value 1 (i.e. a
unitary gain being applied for a linked input signal). For example, the weights on
the diagonal of the weight matrix
W(ω) may be set to 1.
[0091] Thus, in many embodiments, the weight for a contribution to a given output signal
frequency bin value from a linked input signal frequency bin value may be set to a
predetermined value. This value may in many embodiments be maintained constant without
any adaptation.
[0092] Such an approach has been found to provide very efficient performance and may result
in an overall adaptation that has been found to provide output signals to be generated
that provide a highly accurate representation of the original audio of the input signals
but in a set of output signals that have increased decorrelation.
[0093] In some embodiments, the linked weights may also be adapted but will be adapted differently
than the non-linked weights. In particular, in many embodiments, the linked weights
may be adapted based on the output signals.
[0094] Specifically, in many embodiments, a linked weight for a first input signal and linked
output signal may be adapted based on the generated output bin value of the linked
audio signal, and specifically based on the magnitude of the output bin value.
[0095] Such an approach may for example allow a normalization and/or setting of the desired
energy level for the signals.
[0096] In many embodiments, the linked weights are constrained to be real valued weights
whereas the non-linked weights are generally complex values. In particular, in many
embodiments, the weight matrix
W(ω) may be a Hermitian matrix with real values on the diagonal and with complex values
outside of the diagonal.
[0097] Such an approach may provide a particular advantageous operation and adaptation in
many scenarios and embodiments. It has been found to provide a highly efficient spatial
decorrelation while maintaining a relatively low complexity and computational resource.
[0098] The adaptation may gradually adapt the weights to increase the decorrelation between
signals. In many embodiments, the adaptation may be arranged to converge towards a
suitable weight matrix
W(ω) regardless of the initial values and indeed in some cases the adaptation may be
initialized by random values for the weights.
[0099] However, in many embodiments, the adaptation may be started with advantageous initial
values that may e.g. result in faster adaptation and/or result in the adaptation being
more likely to converge towards more optimal weights for decorrelating signals.
[0100] In particular, in many embodiments, the weight matrix
W(ω) may be arranged with a number of weights being zero but at least some weights
being non-zero. In many embodiments, the number of weights being substantially zero
may be no less than 2,3,5, or 10 times the number of weights that are set to a non-zero
value. This has been found to tend to provide improved adaptation in many scenarios.
[0101] In particular, in many embodiments, the adapter 107 may be arranged to initialize
the weights with linked weights being set to a non-zero value, such as typically a
predetermined non-zero real value, whereas non-linked weights are set to substantially
zero. Thus, in the above example where linked signals are arranged at the same positions
in the vectors, this will result in an initial weight matrix
W(ω) having non-zero values on the diagonal and (substantially) zero values outside
of the diagonal.
[0102] Such initializations may provide particularly advantageous performance in many embodiments
and scenarios. It may reflect that due to the audio signals typically representing
audio at different positions, there is a tendency for the input signals to be somewhat
decorrelated. Accordingly, a starting point that assumes the input signals are fully
correlated is often advantageous and will lead to a faster and often improved adaptation.
[0103] It will be appreciated that the weights, and specifically the non-linked weights,
may not necessarily be exactly zero but may in some embodiments be set to low values
close to zero. However, the initial non-zero values may be at least 5,10,20 or 100
times higher than the initially substantially zero values.
[0104] The described approach may provide a highly efficient adaptive spatial decorrelator
that may generate output signals that represent the same audio as the input signals
but with increased decorrelation. The approach has in practice been found to provide
a highly efficient adaptation in a wide variety of scenarios and in many different
acoustic environments and for many different audio sources. For example, it has been
found to provide a highly efficient decorrelation of speaker signals in environments
with multiple speakers.
[0105] The adaptation approach is furthermore computationally efficient and in particular
allows localized and individual adaptation of individual weights based only on two
signals (and specifically on only two frequency bin values) closely related to the
weight, yet the process results in an efficient and often substantially optimized
global optimization of the spatial filtering, and specifically the weight matrix
W(ω) . The local adaptation has been found to lead to a highly advantageous global
adaptation in many embodiments.
[0106] A particular advantage of the approach is that it may be used to decorrelate convolutive
mixtures and is not limited to only decorrelate instantaneous mixtures. For a convolutive
mixture, the full impulse response determines how the signals from the different audio
sources combine at the microphones (i.e. the delay/timing characteristics are significant)
whereas for instantaneous mixes a scalar representation is sufficient to determine
how the audio sources combine at the microphones (i.e. the delay/timing characteristics
are not significant). By transforming a convolutive mixture into the frequency domain,
the mixture can be considered a complex-valued instantaneous mixture per frequency
bin.
[0107] In the following a more detailed and mathematical description of an example of a
spatial decorrelator following some or all of the above described principles will
be provided. In the example, linked input signals and output signals are denoted by
the same index and position in the input and output vectors
x(ω),
y(ω) and a de-mixing/weight matrix
W(ω) is determined for which the linked weights are on the diagonal.
[0108] The exemplary adaptive spatial decorrelator (SDC) works in the frequency domain,
and a robust simple learning rule is applied for each single frequency bin to achieve
decorrelation for that frequency bin. Yet, the individual and local adaptation results
in an overall adaptation of the spatial decorrelation.
[0109] A solution in the frequency domain has the advantage that convolutions and correlations
in time domain are replaced by element-wise multiplications. Significantly, a frequency
domain approach as described may also allow convolutive mixtures to be considered
instantaneous mixtures (per frequency bin) in the frequency domain.
[0110] If we consider input signals being received from a microphone array with
Nmics microphones, a single noise point interferer
s(ω), transfer functions
hi(ω) and microphone signals
xi(ω), with i the microphone number, we can write:

[0111] For a pair of microphones
p and q we can now write for the cross power spectrum (the frequency domain equivalent
of the cross-correlation):

with (.)
∗ the conjugate operator.
[0112] Let s(k) be white again, with for all ω:

[0113] We can rewrite
Cpq (ω) as:

[0114] More generally, for a convolutive mixture with
nmics sources we get, using matrix notation with

where
H(ω) is the so-called
nmics x nsources mixing matrix.
[0115] Now we can determine an
nmics x nmics covariance matrix
Rxx(ω) defined as:

with H conjugate transpose.
[0116] The microphone signals
xi(ω) are uncorrelated, if and only if the off-diagonal elements of
Rxx(ω) are all zero. When correlated, with non-zero off-diagonal elements, we need a
transformation that transforms
x(ω) such, that the output signals are decorrelated. This can be done with an
nmics x nmics de-mixing matrix
W(ω) such that the output

has a covariance matrix
Ryy(ω)defined as:

with all off-diagonal elements equal to zero.
We can rewrite
Ryy(ω) as:

with
Λ(ω) a diagonal matrix with elements λ
1(ω)..λ
nmics(ω) that are real. Often
Λ(ω) is chosen as
Λ(ω) =
I, the identity matrix. While this is a good choice, when you have multiple input choices
si(ω), this is not the best choice when you have a single, or a dominant input noise
source.
[0117] From the above, we have for the (ij)'th element of
Rxx(ω) for a single source s(ω):

[0118] We introduce

, with
R̃xx(ω) the coherence matrix, a normalized covariance matrix. The elements of
R̃xx(ω) only depend on the transfer functions
hi,j(ω), not on the signal characteristics of the source
s(ω). This means that with
W(ω)
R̃xx(ω)
WH(ω) =
I, also
W(ω) will only depend on the transfer functions
hi,j(ω)
. This is advantageous, since the changes in the acoustic paths are normally much slower,
when compared to the changes in the spectrum of the noise source (assuming non-stationarity).
Since we will not use the coherence matrix directly, we will get the same effect if
we choose

, since:

[0119] Later-on, we will see that the update rule is simplified when we choose
Λ(ω) =

, with α(ω) a scalar independent of

.
[0120] The decorrelation/weight matrix
W may be symmetric positive definite. Symmetric means
wij =
wji and
if Rxx is positive definite (all eigenvalues should be positive, which is normally the case,
certainly when there is some uncorrelated (sensor)noise present), then, with
W symmetric,
W will also be positive definite (Suppose the real positive eigenvalues are λ
1 ... . . λ
nmics, then the eigenvalues of
W are

).
[0121] An adaptation/learning approach/ rule may be based on the equation:

or in scalar form:

with k the time index,
η(
k) the step size or learning rate, and δ
ij the Kronecker symbol. This learning rule may be called a local learning rule, since
for calculating
wij(
k + 1) we only need the signals
yi(k) and
yj(
k)
. Note that when at initialization
W(0) =
I, W(
k) will be symmetric, since
y(
k)
yT (k) will be symmetric.
[0122] For a time domain based system addressing
instantaneous mixtures with real signal values similar approaches have been indicated in A. Cichocki
and S. Amari, "Adaptive Blind Signal and Image Processing", Chichester (West ussex,
England): John Wiley & Sons, Ltd, 2002.
[0123] For the learning rule in the frequency domain, complex numbers should preferably
be used instead of real values. To apply the local learning rule the matrix
W(ω) is preferably Hermitian, with

.
[0124] Since
Rxx(ω) is also Hermitian and positive definite, with positive real eigenvalues,
W(ω) will also be positive definite, and the learning rule for complex numbers can
be applied:

or in scalar form:

Note than when
W(0, ω)) =
I, W(
k + 1, ω) remains Hermitian, since
y(
k, ω)
yH(
k, ω) is Hermitian.
[0125] Here k is the frame index. Block/segment processing can be used where the time data
is divided into blocks or frames of e.g. 256 points before each block or frame is
being transformed to the frequency domain.
[0126] To understand some further aspects of this solution, we will take as an example a
two microphone solution as illustrated in FIG. 3. We will start with a point noise
source
s(ω), with transfer functions
h1(ω) and
h2(ω), a 2x2 matrix W and outputs y
1(ω) and y
2(ω).
[0127] We initialize
W(0, ω) =
I, and for the moment do not update
w11(ω) and
w22(ω) Further, we assume

, for all
k and ω. We frequently use
w21(
k, ω) =

.
[0128] If we look at the expected gradient

we have:

For

we have two solutions:

and

[0129] The amplitudes of the two solutions are the inverse of each other, so we will always
have a solution with amplitude smaller or equal to one.
[0130] It can be shown that with the adaptation approach above, we find the solution with
the smallest amplitude, and that this solution is stable, except for |
h1(ω)| = |h
2(ω)|, when in (Eq. 2) both terms become zero. The same is true (with |
h1(ω)| < |h
2(ω)|) when we choose

The righthand side of (Eq. 2) then can be written as:

and will become zero if

This solution is typically not stable.
[0131] Stability issues may in some cases arise in case of a single point source the data
correlation matrix
R̃xx is not positive definite, but semi-positive definite instead, with eigen values that
are zero. Then it may not be appropriate to use the simple local learning rule, which
requires
W(ω) to be positive definite. In the case with one point source,
R̃xx only contains one non-zero eigen value that belongs to the point source. If we have
additional noise sources (both correlated and uncorrelated) then
R̃xx becomes positive definite and the simple learning rule can be applied.
[0132] In acoustic applications there will typically always be additional noise sources.
Firstly, we have microphone sensor noise in each microphone. These noises are uncorrelated
and have approximately equal variances. Secondly, we have remaining signal energy
due to the point noise source. To understand this, we revert to time domain where
we described the microphone signal as

and for the example with 2 microphones above we can write for
y1(
k):

where
w12(
m) is an FIR filter of length Nfir. After convergence we can write:

and rewrite
y1(
k) as:

where we split
x1(
k) into two parts: a part that can be dealt with by the decorrelator and a tail or
diffuse part, that cannot be dealt with by the decorrelator (because of the finite
FIR length).
[0133] All microphone signals i will have a tail or diffuse part, generated in the same
way as the one for
x1(
k), with now
hi(
n) instead of
h1(
n)
.
[0134] If we look at the tail powers, then we can write for a white noise point source s:

[0135] The expected value

, where the expectation is now over all possible source-microphone combinations in
a room with a certain reverberation time
T60 is given by:

with K a constant and
Fs the sampling frequency.
[0136] The variance for a certain realization (impulse response between a certain source
- microphone position) will be high, but due to the summation the variance of

will be much lower. This means that as a first approximation we can assume that the
contributions have equal powers.
[0137] When we have in addition to the point source uncorrelated (sensor) noise at the inputs,
the off-diagonal filter in
W(ω) (in our example
w12(ω) and
w21(ω) cause correlation of the noise at the outputs. As a result, the solution will
be different from the solution without uncorrelated noise at the inputs. It can be
shown that the optimal solution, when applying the learning rule of (Eq. 1) with
i ≠
j is given by:

where β is real and 0 < β ≤ 1.
[0138] β = 0, corresponds to the solution where there is no uncorrelated noise at the inputs,
β = 1 correponds to the solution where there is only uncorrelated noise at the inputs.
This solution is stable, also for the case with |h
1(ω)| = |h
2(ω)|, The reason is that with the added uncorrelated noise
R̃xx will be positive definite and as a result, also
W will be positive definite, as is required to apply our learning rule.
[0139] Now we can also change
w12(ω) and
w21(ω) without having stability problems.
[0140] It can be shown that when uncorrelated noise signals at the inputs have equal variances,
the outputs will also have equal variances. We know that sensor noise and tail contributions
can be considered as having, as a first approximation, equal variances. This means
that choosing
w11(
k, ω) = w
22(k, ω) = 1 is a valid option. It means that the update rule can be simplified to:

with
i ≠ j.
[0141] In use cases where the described decorrelator is combined with e.g. an adaptive beamformer
this has proven to be very robust. The zeroing of the off-diagonal components in
Ryy(ω) is often much more important than having equal elements in
Λ(ω). Note that when the tail contributions, that scale linearly with the source strength
s(ω) are dominant over the sensor noise, the solution is only determined by the acoustic
impulse responses and does not depend on the signal characteristics of the source.
Since the variations in the acoustic paths are much slower when compared with the
variations in the source spectrum or source amplitudes, it means that the SDC can
converge to a stable solution.
[0142] For the use case with a single point source and the tails dominating the sensor noise
and we want to equalize the elements in
Λ(ω) such that we can write

, with α(ω) not depending on
s(ω), we can follow the following procedure:
First the average power is determined, weighted by the squared inverse of w
ii:

with
Pyi(ω) the power of output
i calculated as

(ω) Next an update of
wii is calculated:

β is a small smoothing parameter, such that the updates of the off-diagonal weights
w
ij with
i ≠
j can easily track the changes in w
ii.
[0143] In many embodiments, instead of choosing a pre-defined diagonal matrix
Λ(ω), the decorrelator may be initialized with the diagonal elements of
W(ω) equal to one and then the audio apparatus may proceed to not adapt these weights.
In some embodiments, we may want to write
Λ(ω) =

, with α(ω) independent from
s(ω) to equalize the outputs. The diagonal elements of
W(ω) will not generally be equal to one then.
[0144] The adaptive SDC is most efficiently implemented in the frequency domain with block
processing. As a specific example, at 16 a kHz sampling frequency we may use a block/segment
size of 256 points. This means that also the decorrelation will be over 256 points.
[0145] When filtering in frequency domain using FFT's resulting in discrete sampled frequencies,
we have circular convolutions and circular correlations instead of linear ones.
[0146] Another aspect to consider is that the applied filters are, by definition, a-causal
since

, which corresponds to a reversal in the time domain.
[0147] Regarding the circular versus linear convolution and correlations, techniques may
be used for implementing linear convolutions and correlations in the frequency domain,
using overlap-add or overlap-save techniques. An overlap- save technique for applying
an FIR filter with N=B coefficients may include the following processing:
- 1) The input signal x(n) is partitioned in blocks/segments of B samples and concatenated
with the previous block of B samples.
- 2) The resulting block of 2B samples is transformed to the frequency domain.
- 3) The impulse response of the FIR filter is extended with B zeroes, and is transformed
by an FFT of size 2B to the frequency domain.
- 4) The outputs of the two FFTs are bin-wise multiplied, after which a 2B IFFT is applied.
- 5) It can be shown that the first B samples contain circular convolution artifacts,
whereas the second B samples contain the result of only a linear convolution.
- 6) The second B samples are used as the output block.
[0148] The signal flow for such an example may be as illustrated in FIG. 4. The approach
may provide a linear convolution using an overlap-save technique.
[0149] We use here an overlap of 50% but other overlaps may of course be used. Suppose we
have a filter length N = 384 points. Then it is also possible to use a block size
B=128 points and an (I)FFT of size N+B = 512 points. At the input a frame of B samples
must be concatenated with 3 previous blocks (overlap 75%), giving a block of 512 points,
the weighting/filter vector is extended with 128 zeroes and the last 128 samples of
the 512 points output is selected as the output.
[0150] A linear correlation may be implemented similarly to the linear convolution as illustrated
in FIG. 5 for an overlap of 50%.
[0151] Now suppose we have an impulse response that is a-causal and has non-zero values
for |n| < N/2; In a circular representation this corresponds to a fundamental interval
with
h̃(
n) =
h(
n) and
h̃(
N -
n) =
h(-
n)
[0152] Filtering with an a-causal filter in the frequency domain can then be performed as
follows (with B=N):
- 1) The input signal x(n) is partitioned in blocks/ samples of B samples and concatenated
with the previous block of B samples.
- 2) The resulting block of 2B samples is transformed to the frequency domain.
- 3) Create a vector w with w(n) = h(n) for 0 ≤ n < N/2, w(n) = 0 for N/2 ≤ n ≤ 3N/2 and w(2N - n) = h(-n) for 0 < n < N/2 and transform it to the frequency domain with a 2N point FFT.
- 4) The outputs of the two FFTs are bin-wise multiplied, after which a M=2N IFFT is
applied.
- 5) It can be shown that now the first B/2 samples contain circular convolution artifacts,
as well as the last B/2 samples in the block. The B points samples in the middle contain
the result of only a linear convolution.
- 6) The B points in the middle are used as the output block.
[0153] A corresponding signal flow is shown in FIG. 6.
[0154] Using such principles an adaptive spatial decorrelator can be developed with linear
filtering for the outputs and circular convolutions and correlations for updating
the coefficients.
[0155] We can define an
nmics x nmics matrix
W, where each entry
wij consists of a (complex) frequency domain vector of M/2+1 points (we assume real input
signals such that an input block of M samples gives M/2+1 unique frequency points).
An element of this vector is given by:
wij(ω)
. W(ω) denotes, as before, an
nmics x nmics matrix for a certain frequency (bin) ω (the weight matrix
W(ω)).
[0156] The main diagonal of W may consist of elements w
ii(ω) = 1 for all i and ω.
[0158] The update formula is applied for each ω:

where k denotes the frame index
[0159] The steps 1-4 may be repeated for each frame/segment. An example of the operation
of Steps 1-3 (the filtering/decorrelation steps) is illustrated in FIG. 7
[0160] In the specific approach, the filtering/weighted combination may thus include applying
a time domain windowing to the frequency representation provided by the weights of
a specific position in the weight matrix
W(ω) for different frequency bins. Thus, for a given input signal and output signal,
the weights for the different frequency bins form a frequency representation. The
audio signal generator 105 may be arranged to generate modified weights that are used
in the weighted combination/filtering by applying a time domain window.
[0161] The audio signal generator 105 may specifically take the frequency representation
for a given weight (position, for a given input and output signal pair) and convert
that to the time domain (e.g. using an FFT). The audio signal generator 105 may then
apply a window to the time domain representation to generate a modified time domain
representation. In many embodiments, the window may be such that a center portion
of the segment of the time domain representation is set to (substantially) zero. The
audio signal generator 105 may then convert the modified time domain representation
to the frequency domain to generate modified weights. The filtering of the input signals
to generate the output signals, i.e. the weighted combination, is based on the modified
weights. However, the weight matrix
W(ω) that is adapted and stored for future segments is unchanged. Thus, in some embodiments,
the weight matrix
W(ω) that is adapted is for the specific filtering modified based on the time domain
windowing. Thus, for the filtering, the weight matrix
W can be modified to
W̃.
[0162] With respect to the learning parameter
η(
k, ω ), the adapter 107 may apply normalization per frequency bin ω, such that the convergence
speed does not depend on the (frequency) amplitudes of the inputs. For each frequency
ω the audio signal generator 105 can determine the summed power of the Nmics output
signals y
i(ω) and smooth it with a first order recursion with a time constant of 50 msec. If
we call this power
Py(k
, ω) the audio signal generator 105 can determine the update constant η(k, ω) =
α/
Py (k, ω) with α typically smaller than 0.1.
[0163] Since

a considerable memory and complexity reduction in the update part can be obtained
by storing and calculating only the values
wij(ω) for j < i.
[0164] Apart from the considerable complexity reduction the implementation with, in the
update part, circular convolutions and correlations instead of linear ones is found
to be more robust, for the cases where the input is semi-positive definite (mostly
in simulations).
In the following a detailed exemplary analysis is provided of the scenario of FIG.
3 with two microphone inputs and using the update rule in Equation 1.
[0165] We initialize
W(0, ω) =
I, and for the moment do not update
w11(ω) and
w22(ω) Further, we assume

, for all
k and ω. We frequently use
w21(
k, ω) =

.
[0166] In case of a single point noise source, without additional noises we found that for

we have two solutions:

and

[0167] The amplitudes of the two solutions are the inverse of each other, so we will always
have a solution with amplitude smaller or equal to one.
[0168] From now on we assume |
h2(ω)| > |
h1(ω)|, without loss of generality, since otherwise we will start with
w21(k, ω) instead of
w12(
k, ω). The case |
h2(ω)| = |
h1(ω)| will be discussed later.
[0169] If we look at the first iteration with
w1,2(0, ω) = 0 we have:

[0170] For the expected weight update we have:

[0171] The update will be in the direction of the solution -
h1(ω)/
h2(ω), that is it has the same phase and with β = η|
h2(ω)|
2 a small positive number we can write for
y1(1, ω):

and for
y2(1, ω):

[0172] Compared with
y1(0, ω) and
y2(0, ω), we only see that the amplitude of
h1(1, ω) and
h2(1, ω) changes when compared with
h1(ω) and
h2(ω). Note that the amplitude
h1(1, ω) decreases faster than the amplitude of
h2(1, ω).
[0173] In next iterations of the algorithm the amplitudes of
h1(
k, ω) and
h2(
k, ω) decrease further until for certain k, the expected gradient will be zero and have:

This solution is also a stable solution. Suppose we perturbate the solution with
dw (complex) and have (omitting the k index):

and

[0174] Then we have for
y1(ω) and
y2(ω):

and

[0175] The expected value for the gradient is (neglecting higher order term (dw)
2)

[0176] In the update we have

[0177] The update is in the opposite direction of dw for all dw (
η(|h
2(ω)|
2 - |h
1(ω)|
2) is positive), and thus the solution is stable.
[0178] Looking at the powers of
y1(ω) and
y2(ω) after convergence we have:

and

[0179] Depending on the ratio between |
h1(ω)| and |
h2(ω)| the power of
y2(ω) also decreases, but much less than the power of
y1(ω).
[0180] This is contradictory to what we want to achieve, namely that the output powers are
equal. In this scenario this is only possible if also the output power of
y2(ω) would be zero. This could be done by setting
w22(ω) =
h1(ω)/
h2(ω). This change does not influence the update of
w12(ω), but once
w12(ω) has found its optimal value, also the expected power of
y2(ω) will be zero. The solution is not stable however, since if we perturbate the solution
as before with

we now get

and

[0181] For the expected value of the gradient:

[0182] Now we cannot neglect the higher order term (
dw)
2 and have for the weight:

[0183] It cannot be guaranteed that

. For example with dw = -α
h1(ω)/
h2(ω), where α is a small positive number

.
[0184] The solution with two expected powers equal to zero is not stable. The same is true
for the case |
h2(ω)| = |
h1(ω)| (and
w22(ω) = 1) again. Also then the expected values of the power of outputs after convergence
will be zero, after perturbation the higher order term (dw)
2 cannot be neglected and we have:

which will not be stable for
dw < 0.
[0185] Such stability issues may be due to that for a single point source the data correlation
matrix
R̃xx is not positive definite, but semi-positive definite instead, with eigen values that
are zero. Then we are not allowed to use the simple local learning rule, which requires
W(ω) to be positive definite. In the case with one point source
R̃xx only contains one non-zero eigen value that belongs to the point source. If we have
additional noise sources (both correlated and uncorrelated) then
R̃xx becomes positive definite and the simple learning rule can be applied.
[0186] Assume that besides the single point noise source
s(ω) we have additionally uncorrelated noise
s1(ω) in the first microphone and
s2(ω) in the second microphone with equal variances, i.e.

. This is typical for sensor noise.
[0187] If we look at the noise contributions of
s1(ω) and
s2(ω) in the outputs of
y1(ω) and
y2(ω) then we
have:

[0188] Because

, the powers of
y1(ω) and
y2(ω) will be equal and for the expected gradient (only due to the noise contributions)
we get:

[0189] In case the point noise source
s(ω) would not be present we would have for the update:

[0190] Thus |w
12(k + 1, ω)| and |
w21(k + 1, ω)| decrease towards zero, what is desired for uncorrelated noise.
[0191] When the point noise source
s(ω) is present the optimal solution changes, since the gradient of the point noise
must compensate the gradient that is due to the uncorrelated noise in the inputs.
[0192] Since the gradient due to the noise is in the direction of
w12(k, ω) the optimal solution will be in the direction of -
h1(ω)/
h2(ω), the optimal solution without noise (we still assume |
h2(ω)| > |
h1(ω)|)
[0193] Suppose the deviation of the optimal solution without noise is β
h1(ω)/
h2(ω). The gradient for the noise then is:

[0195] For 0 < β ≤ 1,

is positive and increases monotonically with increasing β, whereas

is also positive, but decreases monotonically and is zero for β = 1. This means there
is a solution where the expected gradient is zero, and the solution is given by

with 0 < β ≤ 1.
[0196] This solution is also stable. Suppose we perturbate the optimal solution with dw
(complex values)

[0197] We then get for
y1(ω):

and for
y2(ω):

[0198] For the gradient we only must consider the extra components due to
dw and can neglect the higher order term (
dw)
2 
[0199] With 0 < β ≤ 1the righthand side is always positive, also for |
h1(ω)| = |
h2(ω)|, such that the update

is always opposite to dw, which guarantees stability.
[0200] The audio apparatus(s) may specifically be implemented in one or more suitably programmed
processors. The different functional blocks may be implemented in separate processors
and/or may, e.g., be implemented in the same processor. An example of a suitable processor
is provided in the following.
[0201] FIG. 8 is a block diagram illustrating an example processor 800 according to embodiments
of the disclosure. Processor 800 may be used to implement one or more processors implementing
an apparatus as previously described or elements thereof. Processor 800 may be any
suitable processor type including, but not limited to, a microprocessor, a microcontroller,
a Digital Signal Processor (DSP), a Field ProGrammable Array (FPGA) where the FPGA
has been programmed to form a processor, a Graphical Processing Unit (GPU), an Application
Specific Integrated Circuit (ASIC) where the ASIC has been designed to form a processor,
or a combination thereof.
[0202] The processor 800 may include one or more cores 802. The core 802 may include one
or more Arithmetic Logic Units (ALU) 804. In some embodiments, the core 802 may include
a Floating Point Logic Unit (FPLU) 806 and/or a Digital Signal Processing Unit (DSPU)
808 in addition to or instead of the ALU 804.
[0203] The processor 800 may include one or more registers 812 communicatively coupled to
the core 802. The registers 812 may be implemented using dedicated logic gate circuits
(e.g., flip-flops) and/or any memory technology. In some embodiments the registers
812 may be implemented using static memory. The register may provide data, instructions
and addresses to the core 802.
[0204] In some embodiments, processor 800 may include one or more levels of cache memory
810 communicatively coupled to the core 802. The cache memory 810 may provide computer-readable
instructions to the core 802 for execution. The cache memory 810 may provide data
for processing by the core 802. In some embodiments, the computer-readable instructions
may have been provided to the cache memory 810 by a local memory, for example, local
memory attached to the external bus 816. The cache memory 810 may be implemented with
any suitable cache memory type, for example, Metal-Oxide Semiconductor (MOS) memory
such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), and/or
any other suitable memory technology.
[0205] The processor 800 may include a controller 814, which may control input to the processor
800 from other processors and/or components included in a system and/or outputs from
the processor 800 to other processors and/or components included in the system. Controller
814 may control the data paths in the ALU 804, FPLU 806 and/or DSPU 808. Controller
814 may be implemented as one or more state machines, data paths and/or dedicated
control logic. The gates of controller 814 may be implemented as standalone gates,
FPGA, ASIC or any other suitable technology.
[0206] The registers 812 and the cache 810 may communicate with controller 814 and core
802 via internal connections 820A, 820B, 820C and 820D. Internal connections may be
implemented as a bus, multiplexer, crossbar switch, and/or any other suitable connection
technology.
[0207] Inputs and outputs for the processor 800 may be provided via a bus 816, which may
include one or more conductive lines. The bus 816 may be communicatively coupled to
one or more components of processor 800, for example the controller 814, cache 810,
and/or register 812. The bus 816 may be coupled to one or more components of the system.
[0208] The bus 816 may be coupled to one or more external memories. The external memories
may include Read Only Memory (ROM) 832. ROM 832 may be a masked ROM, Electronically
Programmable Read Only Memory (EPROM) or any other suitable technology. The external
memory may include Random Access Memory (RAM) 833. RAM 833 may be a static RAM, battery
backed up static RAM, Dynamic RAM (DRAM) or any other suitable technology. The external
memory may include Electrically Erasable Programmable Read Only Memory (EEPROM) 835.
The external memory may include Flash memory 834. The External memory may include
a magnetic storage device such as disc 836. In some embodiments, the external memories
may be included in a system.
[0209] It will be appreciated that the approach may further advantageously in many embodiments
further include an adaptive canceller which is arranged to cancel a signal component
of the beamformed audio output signal which is correlated with the at least one noise
reference signal. For example, similarly to the example of FIG. 1, an adaptive filter
may have the noise reference signal as an input and with the output being subtracted
from the beamformed audio output signal. The adaptive filter may, e.g. be arranged
to minimize the level of the resulting signal during time intervals where no speech
is present.
[0210] It will be appreciated that the above description for clarity has described embodiments
of the invention with reference to different functional circuits, units and processors.
However, it will be apparent that any suitable distribution of functionality between
different functional circuits, units or processors may be used without detracting
from the invention. For example, functionality illustrated to be performed by separate
processors or controllers may be performed by the same processor or controllers. Hence,
references to specific functional units or circuits are only to be seen as references
to suitable means for providing the described functionality rather than indicative
of a strict logical or physical structure or organization.
[0211] The invention can be implemented in any suitable form including hardware, software,
firmware or any combination of these. The invention may optionally be implemented
at least partly as computer software running on one or more data processors and/or
digital signal processors. The elements and components of an embodiment of the invention
may be physically, functionally and logically implemented in any suitable way. Indeed,
the functionality may be implemented in a single unit, in a plurality of units or
as part of other functional units. As such, the invention may be implemented in a
single unit or may be physically and functionally distributed between different units,
circuits and processors.
[0212] Although the present invention has been described in connection with some embodiments,
it is not intended to be limited to the specific form set forth herein. Rather, the
scope of the present invention is limited only by the accompanying claims. Additionally,
although a feature may appear to be described in connection with particular embodiments,
one skilled in the art would recognize that various features of the described embodiments
may be combined in accordance with the invention. In the claims, the term comprising
does not exclude the presence of other elements or steps.
[0213] According to an aspect of the invention, the method of the claims and/or description
is a method excluding a method for performing mental acts as such.
[0214] Furthermore, although individually listed, a plurality of means, elements, circuits
or method steps may be implemented by, e.g., a single circuit, unit or processor.
Additionally, although individual features may be included in different claims, these
may possibly be advantageously combined, and the inclusion in different claims does
not imply that a combination of features is not feasible and/or advantageous. Also,
the inclusion of a feature in one category of claims does not imply a limitation to
this category but rather indicates that the feature is equally applicable to other
claim categories as appropriate. Furthermore, the order of features in the claims
do not imply any specific order in which the features must be worked and in particular
the order of individual steps in a method claim does not imply that the steps must
be performed in this order. Rather, the steps may be performed in any suitable order.
In addition, singular references do not exclude a plurality. Thus, references to "a",
"an", "first", "second" etc. do not preclude a plurality. Reference signs in the claims
are provided merely as a clarifying example shall not be construed as limiting the
scope of the claims in any way.