FIELD OF THE INVENTION
[0001] The invention relates to an apparatus and a method for generating an audio output
signals, and in particular, but not exclusively, to extract audio from a wanted audio
source, such as a desired speaker.
BACKGROUND OF THE INVENTION
[0002] Capturing audio, and in particularly speech, has become increasingly important in
the last decades. For example, capturing speech or other audio has become increasingly
important for a variety of applications including telecommunication, teleconferencing,
gaming, audio user interfaces, etc. However, a problem in many scenarios and applications
is that the desired audio source is typically not the only audio source in the environment.
Rather, in typical audio environments there are many other audio/noise sources which
are being captured by the microphone. Audio processing is often used to improve the
capture of audio, and in particular to post-process the captured audio time interval
improves the resulting audio signals.
[0003] In many embodiments, audio may be represented by a plurality of different audio signals
that reflect the same audio scene or environment. In particular, in many practical
applications, audio is captured by a plurality of microphones at different positions.
For example, a linear array of a plurality of microphones is often used to capture
audio in an environment, such as in a room. The use of multiple microphones allows
spatial information of the audio to be captured. Many different applications may exploit
such spatial information allowing improved and/or new services.
[0004] One frequently used approach is to try to separate audio sources by applying beamforming
to form beams directed towards the direction of arrival of audio from specific audio
sources. However, although this may provide advantageous performance in many scenarios,
it is not optimal in all cases. For example, it may not provide optimal source separation
in some cases, and indeed in some applications such a spatial beamforming may not
provide audio properties that are ideal for further processing to achieve a given
effect.
[0005] Thus, whereas spatial audio source separation, and specifically such separation based
on audio beamforming, is highly advantageous in many scenarios and applications, there
is a desire to improve the performance and operation of such approaches. However,
there is typically also a desire for low complexity and/or resource usage (e.g. computational
resources) and often these preferences conflict with each other.
[0006] Hence, an improved approach would be advantageous, and in particular an approach
allowing reduced complexity, increased flexibility, facilitated implementation, reduced
cost, improved audio capture, improved spatial perception/differentiation of audio
sources, improved audio source separation, improved audio/speech application support,
reduced dependency on known or static acoustic properties, improved flexibility and
customization to different audio environments and scenarios, improved audio beamforming,
an improved trade-off between performance and complexity/ resource usage, and/or improved
performance would be advantageous.
SUMMARY OF THE INVENTION
[0007] Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one
or more of the above mentioned disadvantages singly or in any combination.
[0008] According to an aspect of the invention there is provided an audio apparatus comprising:
a receiver arranged to receive a first set of audio signals, the first set of audio
signals comprising audio signals capturing audio of a scene from different positions;
an audio beamformer comprising: a first set of filters arranged to filter the first
set of audio signals, and a combiner arranged to combine outputs of the first set
of filters to generate a beamform output audio signal; a feedback circuit comprising
a second set of filters arranged to generate a second set of audio signals from a
filtering of the beamform output audio signal, each filter of the second set of filters
having a frequency response being a complex conjugate of a filter of the first set
of filters; a beamform adapter arranged to adapt the first set of filters and the
second set of filters in response to a comparison of the first set of audio signals
and the second set of audio signals; an adaptive coefficient processor arranged to
determine decorrelation coefficients for a set of spatial decorrelation filters generating
decorrelated output signals from the first set of audio signals, the adaptive coefficient
processor being arranged to adapt the decorrelation coefficients in response to an
update value determined from the first set of audio signals; a first set of spatial
filters arranged to apply a first spatial filtering to at least one of the first set
of audio signals and the second set of audio signals, the first set of spatial filters
having coefficients determined from the decorrelation coefficients.
[0009] The approach may provide improved operation and/or performance in many embodiments.
It may in particular allow improved beamforming to focus on a specific audio source
in the scene. It may allow improved extraction/separation of audio from a specific
source in the presence of other audio and noise sources in the scene.
[0010] The approach may allow the operation of the audio apparatus to effectively adapt
to the current conditions, and in particular the acoustic and spatial properties of
the audio sources and the scene. It may provide reduced sensitivity to noise and unwanted
audio sources in the scene and captured audio.
[0011] The approach may allow efficient operation and low complexity in many embodiments.
The different adaptations may synergistically interwork to provide improved separation
of a desired audio source from other captured audio of the scene. Further, the use
of multiple adaptations may provide an improved operation while allowing lower complexity
adaptation algorithms and criteria to be used.
[0012] Each filter of the first set of filters is linked with a filter of the second set
of filters with this filter having a frequency response being a complex conjugate
of the frequency response of the filter of the first set of filters.
[0013] The first and second set of filters may have an equal number of linked paired filters
having complex conjugate frequency responses. For each audio signal in the first set
of audio signals there may be one filter of the first set of filters, one linked/paired
filter in the second set of filters (with a complex conjugate frequency response),
and one audio signal in the second set of audio signals. The comparison may be a comparison
between linked/paired audio signals of the first set of audio signals and the second
set of audio signals. The adaptation of a given filter of the first set of filters
and a given linked filter of the second set of filters may be in response/dependence
on (possibly only) a comparison of a signal of the first set of audio signals filtered
by the given filter of the first set of filters and a signal of the second set of
audio signals generated by filtering of the beamform audio signal by the given linked
filter of the second set of filters. The adaptation may be such that the difference
is reduced.
[0014] Each of the second set of audio signals may be an estimate of the contribution to
the linked audio signal of the first set of audio signals from the audio captured
by the beamform output audio signal, and thus typically may be an estimate of the
contribution from a wanted/desired source.
[0015] The set of spatial decorrelation filters may include one spatial filter for each
signals of the first set of signals. Each filter of the set of spatial decorrelation
filters may generate a filtered/modified version of one audio signal of the first
set of audio signals. The set of spatial decorrelation filters may together generate
a modified first set of audio signals with a higher degree of decorrelation. The spatial
decorrelation filters may perform filtering over the first set of audio signals. The
output of a decorrelation filter (the corresponding modified audio signal of the first
set of audio signals) may be dependent on a plurality of the (unmodified) first set
of audio signals. The spatial decorrelation filters may specifically be frequency
domain filters and the decorrelation coefficients may be frequency domain coefficients.
For a given spatial decorrelation filter, the output value for a given frequency bin
at a given time may be a weighted combination of a plurality of values of the (unmodified)
first set of audio signals for the given frequency bin at the given time. The decorrelated
output signals may have a reduced normalized cross channel signal correlation with
respect to the first set of input signals to the set of spatial decorrelation filters.
[0016] The set of spatial filters may include one spatial filter for each signal of the
first set of signals/ second set of signals. Each filter of the set of spatial filters
may generate a filtered/modified version of one audio signal of the first or second
set of audio signals. The set of spatial filters may together generate a modified
first or second set of audio signals. The spatial filters may perform filtering over
the first set of audio signals. The output of a spatial filter (the corresponding
modified audio signal of the first or second set of audio signals) may be dependent
on a plurality of the (unmodified) first or second set of audio signals. The spatial
filters may specifically be frequency domain filters and the coefficients may be frequency
domain coefficients. For a given spatial filter, the output value for a given frequency
bin at a given time may be a weighted combination of a plurality of values of the
(unmodified) first or second set of audio signals for the given frequency bin at the
given time.
[0017] In accordance with an optional feature of the invention, the audio apparatus further
comprises an audio detector arranged to determine a set of active time intervals during
which an audio source is active and a set of inactive time intervals during which
the audio source is not active; and wherein at least one of the adaption of the first
set of filters and the second set of filters and the adaptation the decorrelation
coefficients is different for the set of active time intervals and the set of inactive
time intervals.
[0018] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation which typically may lead to improved extraction
of a desired audio source.
[0019] The active time intervals may be desired/wanted/speech audio source active time intervals
and the inactive time intervals may be desired/wanted/speech audio source inactive
time intervals.
[0020] In accordance with an optional feature of the invention, the beamform adapter is
arranged to adapt the first set of filters and the second set of filters with a higher
rate of adaptation during the set of active time intervals than during the set of
inactive time intervals.
[0021] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation which typically may lead to improved extraction
of a desired audio source.
[0022] In accordance with an optional feature of the invention, the beamform adapter is
arranged to adapt the first set of filters and the second set of filters during only
one set of time intervals of the set of active time intervals and the set of inactive
time intervals.
[0023] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation which typically may lead to improved extraction
of a desired audio source.
[0024] In some embodiments, the beamform adapter is arranged to adapt the first set of filters
and the second set of filters during the set of active time intervals but not during
the set of inactive time intervals.
[0025] In accordance with an optional feature of the invention, the adaptive coefficient
processor is arranged to adapt the decorrelation coefficients with a higher rate of
adaptation during the set of inactive time intervals than during the set of active
time intervals.
[0026] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation which typically may lead to improved extraction
of a desired audio source.
[0027] In accordance with an optional feature of the invention, the adaptive coefficient
processor is arranged to adapt the decorrelation coefficients during only one set
of time intervals of the set of inactive time intervals and the set of active time
intervals.
[0028] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation which typically may lead to improved extraction
of a desired audio source.
[0029] In some embodiments, the adaptive coefficient processor is arranged to adapt the
decorrelation coefficients during the set of inactive time intervals but not during
the set of active time intervals.
[0030] In accordance with an optional feature of the invention, the first set of spatial
filters is arranged to filter the first set of audio signals.
[0031] This may provide improved performance and/or operation in many embodiments. In many
embodiments and scenarios, this may provide particularly attractive performance and/or
implementation.
[0032] In accordance with an optional feature of the invention, the first set of spatial
filters is arranged to have coefficients set to the decorrelation coefficients determined
for the set of spatial decorrelation filters.
[0033] This may provide improved performance and/or operation in many embodiments. In many
embodiments and scenarios, this may provide particularly attractive performance and/or
implementation.
[0034] In accordance with an optional feature of the invention, the first set of filters
for filtering is arranged to filter the first set of audio signals after filtering
by the first set of spatial filters and the beamform adapter is arranged to perform
the comparison using the first set of audio signals before filtering by the first
set of spatial filters.
[0035] This may provide improved performance and/or operation in many embodiments. In many
embodiments and scenarios, this may provide particularly attractive performance and/or
implementation.
[0036] In accordance with an optional feature of the invention, the first set of
spatial filters is arranged to have coefficients matching coefficients of a spatial filter being
a cascade of two of the set of decorrelation filters.
[0037] This may provide improved performance and/or operation in many embodiments. In many
embodiments and scenarios, this may provide particularly attractive performance and/or
implementation.
[0038] In accordance with an optional feature of the invention, the first set of spatial
filters is arranged to filter the second set of audio signals.
[0039] This may provide improved performance and/or operation in many embodiments. In many
embodiments and scenarios, this may provide particularly attractive performance and/or
implementation.
[0040] In accordance with an optional feature of the invention, the first set of spatial
filters is arranged to have coefficients determined in response to a set of inverse
spatial decorrelation filters, the set of inverse spatial decorrelation filters being
inverse filters of the set of spatial decorrelation filters.
[0041] This may provide improved performance and/or operation in many embodiments. In many
embodiments and scenarios, this may provide particularly attractive performance and/or
implementation.
[0042] In accordance with an optional feature of the invention, the first set of spatial
filters is arranged to have coefficients matching coefficients of a spatial filter
being a cascade of two sets of spatial inverse filters each of which comprises inverse
filters of the set of spatial decorrelation filters.
[0043] This may provide improved performance and/or operation in many embodiments. In many
embodiments and scenarios, this may provide particularly attractive performance and/or
implementation.
[0044] In accordance with an optional feature of the invention, adaptive coefficient processor
is arranged to determine the set of spatial decorrelation filters to generate a set
of output audio signals, each output audio signal of the set of the set of output
signals being linked with one input audio signal of the first set of audio signals,
by performing the steps of: segmenting the first set of audio signals into time segments,
and for at least some time segments performing the steps of: generating a frequency
bin representation of the first set of audio signals, each frequency bin of the frequency
bin representation of the first set of audio signals comprising a frequency bin value
for each of the audio signals of the first set of audio signals; generating a frequency
bin representation of a set of output signals, each frequency bin of the frequency
bin representation of a set of output signals comprising a frequency bin value for
each of the output signals, the frequency bin value for a given output signal of the
set of output signals for a given frequency bin being generated as a weighted combination
of frequency bin values of the first set of audio signals for the given frequency
bin, the weighted combination having the decorrelation coefficients as weights; updating
a first weight for a contribution to a first frequency bin value of a first frequency
bin for a first output signal linked with a first input audio signal from a second
frequency bin value of the first frequency bin for a second input audio signal linked
to a second output signal in response to a correlation measure between a first previous
frequency bin value of the first output signal for the first frequency bin and a second
previous frequency bin value of the second output signal for the first frequency bin.
[0045] This may provide improved performance and/or operation in many embodiments. In many
embodiments and scenarios, this may provide particularly attractive performance and/or
implementation.
[0046] This may provide an advantageous generation of output audio signals with typically
increased decorrelation in comparison to the input signals. The approach may provide
an efficient adaptation of the operation resulting in improved decorrelation in many
embodiments. The adaptation may typically be implemented with low complexity and/or
resource usage. The approach may specifically apply a local adaptation of individual
weights yet achieve an efficient global adaptation.
[0047] The generation of the set of output signals may be adapted to provide increased decorrelation
relative to the input signals which in many embodiments and for many applications
may provide improved audio processing and in particular beamforming.
[0048] The first and second output audio signals may typically be different output audio
signals.
[0049] In accordance with an optional feature of the invention, the adaptive coefficient
processor is arranged to update the first weight in response to a product of a first
value and a second value, the first value being one of the first previous frequency
bin value and the second previous frequency bin value and the second value being a
complex conjugate of the other of the first previous frequency bin value and the second
previous frequency bin value.
[0050] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0051] In some embodiments, the audio apparatus may be arranged to update a second weight
being for a contribution to the first frequency bin value from a third frequency bin
value being a frequency bin value of the first frequency bin for the first input audio
signal in response to a magnitude of the first previous frequency bin value.
[0052] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios. It may in particular provide an improved adaptation
of the generated output signals. In many embodiments, the updating of the weight reflecting
the contribution to an output signal from the linked input signal may be dependent
on the signal magnitude/amplitude of that linked input signal. For example, the updating
may seek to compensate the weight for the level of the input signal to generate a
normalized output signal.
[0053] The approach may allow a normalization/ signal compensation/level compensation to
provide e.g., a desired output level.
[0054] In some embodiments, the audio apparatus may be arranged to set a weight for a contribution
to the first frequency bin value from a third frequency bin value being a frequency
bin value of the first frequency bin for the first input audio signal to a predetermined
value.
[0055] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios. It may in many embodiments provide improved adaptation
while ensuring convergence of the adaptation towards a non-zero signal level. It may
interwork very efficiently with the adaptation of weights that are not for linked
signal pairs.
[0056] In many embodiments, the adapter may be arranged to keep the weight constant and
with no adaptation or updating of the weight.
[0057] In some embodiments, the audio apparatus may be arranged to constrain a weight for
a contribution to the first frequency bin value from a third frequency bin value being
a frequency bin value of the first frequency bin for the first input audio signal
to be a real value.
[0058] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0059] The weights between linked input/output signals may advantageously be determined
as/ constrained to be a real valued weight. This may lead to improved performance
and adaptation ensuring convergence on a non-zero level solution.
[0060] In some embodiments, the audio apparatus may be arranged to set a second weight being
a weight for a contribution to a fourth frequency bin value of the first frequency
bin for the second output audio signal from the first input audio signal to be a complex
conjugate of the first weight.
[0061] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0062] The two weights for two pairs of input/output signals may be complex conjugates of
each other in many embodiments.
[0063] In some embodiments, weights of the weighted combination for other input audio signals
than the first input audio signal are complex valued weights.
[0064] This may provide improved performance and/or operation in many embodiments. The use
of complex values for weights for non-linked input signals provide an improved frequency
domain operation.
[0065] In some embodiments, the audio apparatus may be arranged to determine output bin
values for the given frequency bin ω from:

where
y(ω) is a vector comprising the frequency bin values for the output audio signals for
the given frequency bin ω;
x(ω) is a vector comprising the frequency bin values for the input audio signals for
the given frequency bin ω; and
W(ω) is a matrix having rows comprising weights of a weighted combination for the output
audio signals.
[0066] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0067] The matrix
W(ω) may advantageously be Hermitian. In many embodiments, the diagonal of the matrix
W(ω) may be constrained to be real values, may be set to a predetermined value(s),
and/or may not be updated/adapted but may be maintained as a fixed value. The weights/coefficients
outside the diagonal may generally be complex values.
[0068] In some embodiments, the audio apparatus may be arranged to adapt weights w
ij of the matrix
W(ω) according to:

where i is a row index of the matrix
W(ω), j is a column index of the matrix
W(ω), k is a time segment index, ω represents the frequency bin, and η(k,ω ) is a scaling
parameter for adapting an adaptation speed.
[0069] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0070] In some embodiments, the audio apparatus may be arranged to compensate the correlation
value for a signal level of the first frequency bin.
[0071] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios. It may allow a compensation of the update rate for
signal variations.
[0072] In some embodiments, the audio apparatus may be arranged to initialize the weights
for the weighted combination to comprise at least one zero value weight and one non-zero
value weight.
[0073] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios. It may allow a more efficient and/or quicker adaptation
and convergence towards advantageous decorrelation. In many embodiments, the matrix
W(ω) may be initialized with zero values for weights or coefficients for nonlinked
signals and fixed non-zero real values for linked signals. Typically, the weights
may be set to e.g., 1 for weights on the diagonal and all other weights may initially
be set to zero.
[0074] In some embodiments, the weighted combination comprises applying a time domain windowing
to a frequency representation of weights formed by weights for the first input audio
signal and the second input audio signal for different frequency bins.
[0075] This may provide improved performance and/or operation in many embodiments. It may
typically provide improved adaptation leading to increased decorrelation of the output
audio signals in many scenarios.
[0076] Applying the time domain windowing to the frequency representation of weights may
comprise: converting the frequency representation of weights to a time domain representation
of weights; applying a window to the time domain representation to generate a modified
time domain representation; and converting the modified time domain representation
to the frequency domain.
[0077] According to an aspect of the invention, there is provided a method of generating
an output audio signal, the method comprising: receiving a first set of audio signals,
the first set of audio signals comprising audio signals capturing audio of a scene
from different positions; an audio beamformer generating an output signal by: a first
set of filters filtering the first set of audio signals, and a combiner combining
outputs of the first set of filters to generate an output audio signal; a second set
of filters generating a second set of audio signals from a filtering of the output
audio signal, each filter of the second set of filters having a frequency response
being a complex conjugate of a filter of the first set of filters; adapting the first
set of filters and the second set of filters in response to a comparison of the first
set of audio signals and the second set of audio signals; determining decorrelation
coefficients for a set of spatial decorrelation filters generating decorrelated output
signals from the first set of audio signals including adapting the decorrelation coefficients
in response to an update value determined from the first set of audio signals; a first
set of spatial filters applying a first spatial filtering to at least one of the first
set of audio signals and the second set of audio signals, the first set of spatial
filters having coefficients determined from the decorrelation coefficients.
[0078] These and other aspects, features and advantages of the invention will be apparent
from and elucidated with reference to the embodiment(s) described hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0079] Embodiments of the invention will be described, by way of example only, with reference
to the drawings, in which
FIG. 1 illustrates an example of a beamformer;
FIG. 2 illustrates an example of an audio apparatus in accordance with some embodiments
of the invention;
FIG. 3 illustrates an example of an audio apparatus in accordance with some embodiments
of the invention;
FIG. 4 illustrates an example of an audio apparatus in accordance with some embodiments
of the invention;
FIG. 5 illustrates an example of an audio apparatus in accordance with some embodiments
of the invention; and
FIG. 6 illustrates some elements of a possible arrangement of a processor for implementing
elements of an audio apparatus in accordance with some embodiments of the invention.
DETAILED DESCRIPTION OF SOME EMBODIMENTS OF THE INVENTION
[0080] The following description focuses on embodiments of the invention applicable to audio
capturing such as e.g., speech capturing for a teleconferencing apparatus. However,
it will be appreciated that the approach is applicable to many other audio signals,
audio processing systems, and scenarios for capturing and/or processing audio.
[0081] FIG. 1 illustrates an example of an adaptive audio beamform arrangement. For clarity,
the examples of the description is focused on an example where the audio apparatus
is arranged to process two audio input signals, but it will be appreciated that the
example readily extends to more input signals being processed.
[0082] The arrangement is arranged to generate a first set of input signals which comprises
audio signals that capture audio of a scene from different positions. The first set
of audio signals may specifically be audio signals generated from a set of microphone
signals from a set of microphones being at different positions in an audio environment,
such as a microphone array, and in many cases specifically a linear set of microphones.
[0083] The arrangement includes a beamformer 101 which is arranged to receive a set of a
plurality of audio signals from which a single signal is generated. The beamformer
101 seeks to combine the input signals such that the contributions from a given audio
source are combined constructively. The beamformer 101 specifically combines the signals
to perform a beamforming for the set of spatial audio signals capturing audio in an
environment, such as for a set of signals from a microphone array. The beamformer
101 is arranged to generate a beamform output audio signal from the first set of audio
signals.
[0084] The beamformer 101 specifically comprises a first set of filters

which filters the first set of audio signals. Each filter is arranged to filter one
of the audio signals. Further, the filters are adaptive filters that are dynamically
adapted to achieve an adaptive beamforming. The first set of filters will henceforth
also be referred as beamformer filters.
[0085] The audio arrangement further comprises a feedback circuit 103 which comprises a
second set of filters which generate a second set of audio signals from a filtering
of the beamform output audio signal. The second set of filters will also be referred
to as feedback filters. The input signals to the beamformer may also be referred to
as beamformer input signals and the signals generated by the feedback circuit may
also be referred to as feedback signals.
[0086] The first and second set of filters are specifically matched/linked such that for
each filter of the first set (i.e. for each beamformer filter) there is a filter of
the second set (i.e. a feedback filter) which has a frequency response that is the
complex conjugate of the corresponding beamformer filter. Thus, the filters of the
beamformer 101 and the feedback circuit 103 are adapted together such that the filter
coefficients provide a complex conjugate frequency response (equivalent/corresponding
to time reversed filter impulse responses).
[0087] Thus, from the beamform output audio signal, the feedback circuit 103 generates a
set of feedback signals with each feedback signal being generated from the beamform
output audio signal by a time reversed/complex conjugate filtering by matching filters.
[0088] The audio arrangement further comprises a beamform adapter 105 arranged to adapt
the first set of filters and the second set of filters, i.e. the beamformer filters
and the feedback filters. The beamform adapter 105 is arranged to perform the adaptation
based on a comparison of the first and the second set of audio signals, and specifically
based on the comparison of the beamformer signals and the feedback signals. Specifically,
for each matching/linked beamformer signal and feedback signal (signals for which
the beamformer filter and the feedback filter have complex conjugate frequency responses),
a difference measure may be determined, and the filters of the matching filters may
be adapted to reduce/minimize the difference measure.
[0089] The arrangement of the beamformer 101, feedback circuit 103, and beamform adapter
105 may interact to provide a highly efficient adaptive beamforming which for a set
of spatial audio signals may adapt the filters to form a beam towards an audio source
with the beamform output audio signal providing the audio captured in the formed beam.
The adaption of the filters to minimize the difference measures allows the beamformer
arrangement to detect and track a given audio source. Specifically, in the ideal case,
each of the beamformer filters concentrate all of the energy from a given audio source
that is captured in one of the beamformer signals into a single signal value. In the
ideal case, this is achieved by the beamform filter having an impulse response that
is the time inverse of the acoustic impulse response from the audio sources to the
microphone capturing the signal. In the ideal case, this is achieved for all the signals/filters
resulting in all the captured energy from the specific audio source being combined
into a single value by a summation of the outputs of the beamformer filters thereby
generating a sample value for the beamform output audio signal that maximizes the
energy captured from the audio source.
[0090] Further, as each of the feedback filters are the complex conjugate frequency response
of the corresponding beamformer filter, it is in the ideal case the same as the acoustic
impulse response from the audio source to the microphone (signal). Accordingly, by
filtering the beamform output audio signal by a feedback filter, a feedback signal
is generated which reflects the signal that would be captured at the microphone from
the audio source as represented by the beamform output audio signal. In the ideal
case with only one audio source present and the filters being ideally adapted, the
difference between the generated feedback signal and the captured microphone signal
is zero. In the presence of other audio sources and/or noise, there may be a difference
which is reflected in the difference measure. However, for uncorrelated noise/audio,
the difference, and thus typically the difference measure, is typically zero on average
and accordingly will tend to not prevent efficient adaptation. Thus, by adapting the
filters for a given beamformer signal to minimize the difference measure to the corresponding
feedback signal, an adaptation towards the optimum beamforming filters can typically
be achieved.
[0091] Such a beamforming approach is described in
US 7 146 012. In the following a more detailed description and analysis of the arrangement will
be provided. This description will be based on the example of FIG. 1 where the adaptive
beamformer arrangement consists of a filter part where the microphone signals are
filtered by two filters

and

and the outputs are summed to obtain the output signal
z(ω). In the update part the output signal
z(ω) is fed to two adaptive filters
f1(ω) and
f2(ω) that have the respective microphone signals as a reference.
[0092] With this configuration we have an adaptive beamformer that maximizes the output
power under the constraint: |
f1(ω)|
2 + |
f2(ω)|
2 = 1. The constraint is implied by the configuration, specifically because the conjugates
of the filters
f1(ω) and
f2(ω) in the update part are copied to the filter part. The adaptive filters in the
update part are "normal" unconstrained adaptive filters.
[0093] That the constraint is implied by the configuration can be seen if we look at the
optimal solution, that is when both residual signals are zero. Suppose we have a speech
signal
s(ω) and transfer functions
h1(ω) and
h2(ω) from the speech source to the microphones. The microphone signals are then given
by:
u1(ω) =
h1(ω)
s(ω) and
u2(ω) =
h2(ω)
s(ω).
[0094] The output signal
z(ω) then is:

[0095] In the update part we get after convergence two equations, where we omit ω for convenience:

[0096] From (1) and (2) follows (3):

[0098] Similarly we get:

fulfilling the constraint |
f1|
2+ |
f2|
2 = 1.
[0099] The solution for
f1 is given by:

,where
a1 is an allpass, i.e. |
a1| =1, phase undetermined.
[0100] Similarly for
f2:

[0101] Since (3) still must hold,
a1 =
a2 =
a, the so-called common all-pass term.
[0102] More general and using matrix notation, with

where f,;, (ω) is the beamformer filter corresponding to the
mth microphone and

with
hm(ω) the transfer function from the source
s(
ω) to the
mth microphone, we get for the output
z(
ω):

[0103] In the update part, after convergence we have the equality:

which leads to (since
z(
ω) and
s(
ω) are scalars):

where
a(
ω) is a complex scaler.
[0104] Substituting (4) into (5), using (6) gives :

with a(ω) the common allpass term.
[0105] Substituting (7) into (6) finally gives the solution after convergence (8):

where (.)
H denotes complex conjugate transpose.
[0106] For the constraint we then get:

[0107] To understand that the solution given by Eq. 8 is also the optimal solution that
maximizes the output power, we look at the expected value of the update after convergence,
which consists of a correlation between the residual signal (
x1(
ω) and
x2(
ω) in FIG. 1) and the conjugate of the input signal of the adaptive filters:

with

where
um(ω) is the microphone signal of the
mth microphone.
[0108] This is equivalent to:

[0109] Using
z(ω) =
uT(ω)
f* (ω) and thus
z∗(ω) =
uHf(ω) we get:

with
Ruu(ω) the input covariance matrix. From this we learn that
f(ω) and

are an eigen vector and eigen value of the input covariance matrix respectively.
It can be shown that a stable solution is obtained when

is the largest eigen value. From matrix theory, we know that for a Hermitian matrix
the maximum value of the Rayleigh coefficient (omitting ω):

is obtained for the eigen vector that belongs to the largest eigen value. This proofs
that the beamformer maximizes it output under the constraint
fH f = 1
[0110] When the microphone signals only contain speech with
u(ω) =
s(ω)
h(ω), then maximization of
fH(ω)
Ruu(ω)
f(ω) corresponds to the maximization of the speech in the output. If we consider the
case with
u(ω) =
s(ω)
h(ω) +
v(ω),
where we assume that
v(ω) is uncorrelated noise with equal variance in all microphones i.e.

[0111] Ruu(ω) can be written as:

[0112] Maximization of
fH(ω)
Ruu(ω)
f(ω) then corresponds to maximization of (using
fH(ω)
f(ω) = 1):

which means that maximization corresponds to maximization of the speech and maximizing
the Signal-to-Noise ratio.
[0113] Now assume that
v(ω) consists of correlated noise such that:

with
Rnn(ω) being the covariance matrix of the noise with off-diagonal elements that are non-zero.
[0114] Maximization of
fH(ω)
Ruu(ω)
f(ω) then leads to the maximization of:

[0115] This means that maximization of
fH(ω)
Ruu(ω)
f(ω) does not necessarily lead to a better Signal-to-Noise ratio since the choice of
f not only determines the amount of speech in the output, but also the amount of noise
in the output.
[0116] In the following an approach will be described that may provide improved performance
in many scenarios. In the approach, a beamforming approach such as that described
with reference to FIG. 1 is further enhanced by introduction of an adaptive spatial
filtering of the signals. The spatial filtering is performed across signals and is
based on decorrelation coefficients/weights that are determined based on the input
audio signals to adapt to the current audio properties. As will be described in more
detail in the following, the spatial filtering may be placed at different places in
the loop structure of the beamforming including as an inverse spatial decorrelation
filtering in the feedback loop or as a spatial decorrelation filtering directly on
the input signals being used by the beamforming. It has been found that the approach
may provide substantially improved operation, such as e.g. substantially improved
source separation, in many practical scenarios. In relation to the above analysis,
the approach may in particular achieve that the noise contribution
fH(ω)
Rnn(ω)
f(ω) may be made independent of the choice of
f.
[0117] FIG. 2 illustrates an audio apparatus that may provide improved performance. In the
example, the audio apparatus comprises the beamforming structure as described with
reference to FIG. 1. However, the approach has further been enhanced by this structure
operating on signals that have been spatially decorrelated by an adaptive spatial
decorrelator.
[0118] The audio apparatus of FIG. 2 comprises a receiver 201 which receives a first set
of audio signals and in the specific examples receives a set of microphone signals
from a set of microphones capturing the audio scene from different positions. The
microphones may for example be arranged in a linear array and relatively close to
each other. For example, the maximum distance between capture points for the audio
signals may in many embodiments not exceed 1 meter, 50 cm, 25 cm, or even in some
cases 10 cm.
[0119] The input audio signals may be received from different sources, including internal
or external sources. In the following an embodiment will be described where the receiver
201 is coupled to a plurality of microphones, such as a linear array of microphones,
providing a set of input audio signals in the form of microphone signals.
[0120] However, rather than the directly performing the operation of FIG. 1 on the microphone
signals, the audio apparatus of FIG. 2 is arranged to apply an adaptive spatial decorrelation
operation to the audio signals before the beamforming and adaptation operation. The
spatial decorrelation is an adaptive decorrelation which is adapted based on the input
audio signals so that the decorrelation may be continuously adapted to provide an
increased decorrelation.
[0121] Specifically, the audio apparatus of FIG. 2 comprises spatial decorrelator 205 in
the form of a set of spatial (decorrelation) filters that apply a spatial filtering
to the first set of audio signals. Thus, following the filtering by the spatial filters,
the first set of audio signals are modified to have increased decorrelation (for at
least one audio source) than before the spatial decorrelation filtering.
[0122] The spatial filters are based on coefficients that are adapted by an adaptive coefficient
processor 207 which dynamically adapts and updates the filter coefficients used by
the set of filters of the decorrelator 205. The spatial filters are accordingly set
to have coefficients that are determined by the coefficient adapter 207.
[0123] The beamforming, decorrelation, and adaptation may typically be performed in the
frequency domain. The receiver 201 may comprise a segmenter which is arranged to segment
the set of input audio signals into time segments. In many embodiments, the segmentation
may typically be a fixed segmentation into time segments of a fixed and equal duration
such as e.g. a division into time segments/intervals with a fixed duration of between
10-20msecs. In some embodiments, the segmentation may be adaptive for the segments
to have a varying duration. For example, the input audio signals may have a varying
sample rate and the segments may be determined to comprise a fixed number of samples.
[0124] The segmentation may typically be into segments with a given fixed number of time
domain samples of the input signals. For example, in many embodiments, the segmenter
203 may be arranged to divide the input signals into consecutive segments of e.g.,
256 or 512 samples.
[0125] The receiver 201 may be arranged to generate a frequency bin representation of the
input audio signals and the first set of input signals further processed are typically
represented in the frequency domain by a frequency bin representation. The audio apparatus
may be arranged to perform frequency domain processing of the frequency domain representation
of the input audio signals. The signal representation and processing are based on
frequency bins and thus the signals are represented by values of frequency bins and
these values are processed to generate frequency bin values of the output signals.
In many embodiments, the frequency bins have the same size, and thus cover frequency
intervals of the same size. However, in other embodiments, frequency bins may have
different bandwidths, and for example a perceptually weighted bin frequency interval
may be used.
[0126] In some embodiments, the input audio signals may already be provided in a frequency
representation and no further processing or operation is required. In some such cases,
however, a rearrangement into suitable segment representations may be desired, including
e.g. using interpolation between frequency values to align the frequency representation
to the time segments.
[0127] In other embodiments, a filter bank, such as a Quadrature Mirror Filter, QMF, may
be applied to the time domain input signals to generate the frequency bin representation.
However, in many embodiments, a Discrete Fourier Transform (DFT) and specifically
a Fast Fourier Transform, (FFT) may be applied to generate the frequency representation.
[0128] In the audio apparatus of FIG. 2, the spatial filters 205 specifically process the
audio signals in the frequency domain. In the following description, the first set
of audio signals may also be referred to as input audio signals (to the spatial filters)
prior to the filtering and the resulting signals may also be referred as output audio
signals (from the spatial filters).
[0129] For each frequency bin, an output frequency bin value is generated from one or more
input frequency bin values of one or more input signals as will be described in more
detail in the following. The output signals are generated to (typically/on average)
reduce the correlation between signals relative to the correlation of the input signals,
at least for one audio source which for example may be a dominant audio source.
[0130] The spatial filter set is arranged to filter the input audio signals. The filtering
is a spatial filtering in that for a given output signal, the output value is determined
from a plurality of, and typically all of the input audio signals (for the same time/segment
and for the same frequency bin). The spatial filtering is specifically performed on
a frequency bin basis such that a frequency bin value for a given frequency bin of
an output signal is generated from the frequency bin values of the input signals for
that frequency bin. The filtering/weighted combination is across the signals rather
than being a typical time/frequency filtering.
[0131] Specifically, the frequency bin value for a given frequency bin is determined as
the weighted combination of the frequency bin values of the input signals for that
frequency bin. The combination may specifically be a summation and the frequency bin
value may be determined as a weighted summation of the frequency bin values of the
input signals for that frequency bin. The determination of a bin value for a given
frequency bin may be determined as the vector multiplication of a vector of the weights/coefficients
of the weighted summation and a vector comprising the bin values of the input signals:

where y is the output bin value, w
1- w
3 are the weights of the weighted combination, and x
1- x
3 are the input signal bin values.
[0132] Representing the output bin values for a given frequency bin ω as a vector
y(ω), the determination of the output signals may be determined as:

where the matrix
W(ω) represents the weights/coefficients of the weighted summation for the different
output signals and
x(ω) is a vector comprising the input signal values.
[0133] For example, for an example with only three input signals and output signals, the
output bin values for the frequency bin ω may be given by

where y
n represents the output bin values, w
ij represents the weights of the weighted combinations, and x
m represents the input signal bin values.
[0134] The adaptive coefficient processor 207 may seek to adapt the spatial filters to be
spatial decorrelation filters that seek to generate the output signals to correspond
to the input signals but with an increased decorrelation of the signals. The output
audio signals are generated to have an increased spatial decorrelation with the cross
correlation between audio signals being lower for the output audio signals than for
the input audio signals. Specifically, the output signals may be generated to have
the same combined energy/power as the combined energy/power of the input signals (or
have a given scaling of this) but with an increased decorrelation (decreased correlation)
between the signals. The output audio signals may be generated to include all the
audio/signal components of the input signals in the output audio signals but with
a re-distribution to different signals to achieve increased decorrelation.
[0135] The decorrelation filters may specifically be arranged to generate output signals
that have a lower coherence/ normalized decorrelation than the input signals. The
output signals of the decorrelation filters may thus have lower coherence normalized
decorrelation than the coherence in the input signals to the decorrelation filters.
[0136] The adapter 207 is arranged to determine update values for the weights of the weighted
combination(s) forming the set of decorrelation filters. Thus, specifically, update
values may be determined for the matrix
W(ω). The adapter 207 may then update the weights of the weighted combination based
on the update values.
[0137] The adapter 207 is arranged to apply an adaptation approach that determines update
values which may allow the output signals of the set of decorrelation filters 205
to represent the audio of the input signals to the set of decorrelation filters 205
but with the output signals typically being more decorrelated than the input signals.
[0138] The adapter 207 is arranged to use a specific approach of adapting the weights based
on the generated output signals. The operation is based on each output audio signal
being linked with one input audio signal. The exact linking between output signals
and input signals is not essential and many different (including in principle random)
linkings/pairings of each output signal to an input signal may be used. However, the
processing is different for a weight that reflects a contribution from an input signal
that is linked to/paired with the output signal for the weight than the processing
for a weight that reflects a contribution from an input signal that is not linked
to/paired with the output signal for the weight. For example, in some embodiments,
the weights for linked signals (i.e. for input signals that are linked to the output
signal generated by the weighted combination that includes the weight) may be set
to a fixed value and not updated, and/or the weights for linked signals may be restricted
to be real valued weights whereas other weights are generally complex values.
[0139] The adapter 207 uses an adaptation/update approach where an update value is determined
for a given weight that represents the contribution to a bin value for a given output
signal from a given non-linked input signal based on a correlation measure between
the output bin value for the given output signal and the output bin value for the
output signal that is linked with the given (non-linked) input signal. The update
value may then be applied to modify the given weight in the subsequent segment, or
the update value for weight in a given segment is determined in response to two output
bin values of a (and typically the immediate) prior segment, where the two output
values represent respectively the input signal and the output signal to which the
weight relate.
[0140] The described approach is typically applied to a plurality, and typically all, of
the weights used in determining the output bin values based on a non-linked input
signal. For weights relating to the input signal that is linked to the output signal
for the weight, other considerations may be used, such as e.g. setting the weight
to a fixed value as will be described in more detail later.
[0141] Specifically, the update value may be determined in dependence on a product of the
output bin value for the weight and the complex conjugate of the output bin value
linked to the input signal for the weight, or equivalently in dependence on a product
of the complex conjugate of the output bin value for the weight and the output bin
value linked to the input signal for the weight.
[0142] As a specific example, an update value for segment k+1 for frequency bin ω may be
determined in dependence on the correlation measure given by:

or equivalently by:

where
yi(
k, ω) is the output bin value for the output signal i being determined based on the
weight; and
yj(
k, ω) is the output bin value for the output signal j that is linked to the input signal
from which the contribution is determined (i.e. the input signal bin value that is
multiplied by the weight to determine a contribution to the output bin value for signal
i).
[0143] The measure of

(or the conjugate value) indicates the correlation of the time domain signal in the
given segment. In the specific example, this value may then be used to update and
adapt the weight w
i,j(k+1,ω).
[0144] As previously mentioned, the set of decorrelation filters may be arranged to determine
the output bin values for the output signals for the given frequency bin ω from:

where
y(ω) is a vector comprising the frequency bin values for the output signals for the
given frequency bin ω;
x(ω) is a vector comprising the frequency bin values for the input audio signals for
the given frequency bin ω; and
W(ω) is a matrix having rows comprising weights of a weighted combination for the output
audio signals.
[0145] In the example, adapter 207 may specifically be arranged to adapt at least some of
the weights
wij of the matrix
W(ω) according to:

where i is a row index of the matrix
W(ω), j is a column index of the matrix
W(ω), k is a time segment index, ω represents the frequency bin, and
η(
k, ω ) is a scaling parameter for adapting an adaptation speed. Typically, the adapter
207 may be arranged to adapt all the weights that are not relating the input signal
with the linked output signal (i.e. "cross-signal" weights).
[0146] In some embodiments, the adapter 207 may be arranged to adapt the update rate/speed
of the adaptation of the weights. For example, in some embodiments, the adapter may
be arranged to compensate the correlation measure for a given weight in dependence
on the signal level of the output bin value to which a contribution is determined
by the weight.
[0147] As a specific example, the compensation value

may be compensated by the signal level of the output bin value, |
yi(
k, ω)|. The compensation may for example be included to normalize the update step values
to be less dependent on the signal level of the generated decorrelated signals.
[0148] In many embodiments, such a compensation or normalization may specifically be performed
on a frequency bin basis, i.e. the compensation may be different in different frequency
bins. This may in many scenarios improve the operation and may typically result in
an improved adaptation of the weights generating the decorrelated signals.
[0149] The compensation may for example be built into the scaling parameter
η(
k,ω ) of the previous update equation. Thus, in many embodiments, the adapter 207 may
be arranged to adapt/change the scaling parameter
η(
k,ω ) differently in different frequency bins.
[0150] In many embodiments, the arrangement of the input signal vector
x(ω) the output signal vector
y(ω) is such that linked signals are at the same position in the respective vectors,
i.e. specifically
y1 is linked with
x1, y2 with
x2, y3 is with
x3, etc. In this case, the weights for the linked signals are on the diagonal of the
weight matrix
W(
k, ω). The diagonal values may in many embodiments be set to a fixed real value, such
as e.g. specifically be set to a constant value of 1.
[0151] In many embodiments, the weights/spatial filters/weighted combinations may be such
that the weights for a contribution to a first output signal from a first input signal
(not linked to the first output signal) is a complex conjugate of the contribution
to a second output signal being linked with the first input signal from a second input
signal being linked with the first input signal. Thus, the two weights for two pairs
of linked input/output signals are complex conjugates.
[0152] In the example of the weights for linked input and output signals being arranged
on the diagonal of the weight matrix
W(ω), this results in a Hermitian matrix. Indeed, in many embodiments, the weight matrix
W(ω) is a Hermitian matrix. Specifically, the coefficients/weights of the weight matrix
W(ω) may meet the criterion:

[0153] As previously mentioned, the weights for the contributions to an output signal bin
value from the linked input signal (corresponding to the values of the diagonal of
the weight matrix
W(ω) in the specific example) are treated differently than the weights for non-linked
input signals. The weights for linked input signals will in the following for brevity
also be referred to as linked weights and the weights for non-linked input signals
will in the following for brevity also be referred to as non-linked weights, and thus
in the specific example the weight matrix
W(ω) will be a Hermitian matrix comprising the linked weights on the diagonal and the
non-linked weights outside the diagonal.
[0154] In many approaches, the adaptation of the non-linked weights is such that it seeks
to reduce the correlation measures. Specifically, each update value may be determined
to reduce the correlation measure. Overall, the adaptation will accordingly seek to
reduce the cross-correlations between the output signals. However, the linked weights
are determined differently to ensure that the output signals maintain a suitable audio
energy/power/level. Indeed, if the linked weights where instead adapted to seek to
reduce the autocorrelation of the output signal for the weight, there is a high risk
that the adaptation would converge on a solution where all weights, and thus output
signals, are essentially zero (as indeed this would result in the lowest correlations).
Further, the audio apparatus is arranged to seek to generate signals with less cross-correlation
but do not seek to reduce autocorrelation.
[0155] Thus, in many embodiments, the linked weights may be set to ensure that the output
signals are generated to have a desirable (combined) energy/power/level.
[0156] In some cases, the adapter 207 may be arranged to adapt the linked weights, and in
other cases the adapter may be arranged to not adapt the linked weights.
[0157] For example, in some embodiments, the linked weights may simply be set to a fixed
constant value that is not adapted. For example, in many embodiments, the linked weights
may be set to a constant scalar value, such as specifically to the value 1 (i.e. a
unitary gain being applied for a linked input signal). For example, the weights on
the diagonal of the weight matrix
W(ω) may be set to 1.
[0158] Thus, in many embodiments, the weight for a contribution to a given output signal
frequency bin value from a linked input signal frequency bin value may be set to a
predetermined value. This value may in many embodiments be maintained constant without
any adaptation.
[0159] Such an approach has been found to provide very efficient performance and may result
in an overall adaptation that has been found to provide output signals to be generated
that provide a highly accurate representation of the original audio of the input signals
but in a set of output signals that have increased decorrelation.
[0160] In some embodiments, the linked weights may also be adapted but will be adapted differently
than the non-linked weights. In particular, in many embodiments, the linked weights
may be adapted based on the output signals.
[0161] Specifically, in many embodiments, a linked weight for a first input signal and linked
output signal may be adapted based on the generated output bin value of the linked
audio signal, and specifically based on the magnitude of the output bin value.
[0162] Such an approach may for example allow a normalization and/or setting of the desired
energy level for the signals.
[0163] In many embodiments, the linked weights are constrained to be real valued weights
whereas the non-linked weights are generally complex values. In particular, in many
embodiments, the weight matrix
W(ω) may be a Hermitian matrix with real values on the diagonal and with complex values
outside of the diagonal.
[0164] Such an approach may provide a particular advantageous operation and adaptation in
many scenarios and embodiments. It has been found to provide a highly efficient spatial
decorrelation while maintaining a relatively low complexity and computational resource.
[0165] The adaptation may gradually adapt the weights to increase the decorrelation between
signals. In many embodiments, the adaptation may be arranged to converge towards a
suitable weight matrix
W(ω) regardless of the initial values and indeed in some cases the adaptation may be
initialized by random values for the weights.
[0166] However, in many embodiments, the adaptation may be started with advantageous initial
values that may e.g. result in faster adaptation and/or result in the adaptation being
more likely to converge towards more optimal weights for decorrelating signals.
[0167] In particular, in many embodiments, the weight matrix
W(ω) may be arranged with a number of weights being zero but at least some weights
being non-zero. In many embodiments, the number of weights being substantially zero
may be no less than 2,3,5, or 10 times the number of weights that are set to a non-zero
value. This has been found to tend to provide improved adaptation in many scenarios.
[0168] In particular, in many embodiments, the adapter 207 may be arranged to initialize
the weights with linked weights being set to a non-zero value, such as typically a
predetermined non-zero real value, whereas non-linked weights are set to substantially
zero. Thus, in the above example where linked signals are arranged at the same positions
in the vectors, this will result in an initial weight matrix
W(ω) having non-zero values on the diagonal and (substantially) zero values outside
of the diagonal.
[0169] Such initializations may provide particularly advantageous performance in many embodiments
and scenarios. It may reflect that due to the audio signals typically representing
audio at different positions, there is a tendency for the input signals to be somewhat
decorrelated. Accordingly, a starting point that assumes the input signals are fully
correlated is often advantageous and will lead to a faster and often improved adaptation.
[0170] It will be appreciated that the weights, and specifically the non-linked weights,
may not necessarily be exactly zero but may in some embodiments be set to low values
close to zero. However, the initial non-zero values may be at least 5,10,20 or 100
times higher than the initially substantially zero values.
[0171] The described approach may provide a highly efficient adaptive spatial decorrelator
that may generate output signals that represent the same audio as the input signals
but with increased decorrelation. The approach has in practice been found to provide
a highly efficient adaptation in a wide variety of scenarios and in many different
acoustic environments and for many different audio sources. For example, it has been
found to provide a highly efficient decorrelation of speaker signals in environments
with multiple speakers.
[0172] The adaptation approach is furthermore computationally efficient and in particular
allows localized and individual adaptation of individual weights based only on two
signals (and specifically on only two frequency bin values) closely related to the
weight, yet the process results in an efficient and often substantially optimized
global optimization of the spatial filtering, and specifically the weight matrix
W(ω). The local adaptation has been found to lead to a highly advantageous global adaptation
in many embodiments.
[0173] A particular advantage of the approach is that it may be used to decorrelate convolutive
mixtures and is not limited to only decorrelate instantaneous mixtures. For a convolutive
mixture, the full impulse response determines how the signals from the different audio
sources combine at the microphones (i.e. the delay/timing characteristics are significant)
whereas for instantaneous mixes a scalar representation is sufficient to determine
how the audio sources combine at the microphones (i.e. the delay/timing characteristics
are not significant). By transforming a convolutive mixture into the frequency domain,
the mixture can be considered a complex-valued instantaneous mixture per frequency
bin.
[0174] The adaptive coefficient processor 207 thus determines the coefficients for the set
of spatial decorrelation filters of the decorrelator 205 such that these modify the
first set of audio signals to represent the same audio but with an increased decorrelation
for at least one of the audio sources. Such a decorrelation of the audio signals to
which the previously described beamforming approach is then applied may substantially
improve the overall performance in many scenarios. Indeed, it may in many scenarios
result in improved separation and selection of a specific audio source, such as a
specific speaker. Thus, counterintuitively, the decorrelation of signals may provide
improved beamforming performance despite the beamforming inherently being based on
exploiting and adapting to correlation between audio signals from different positions
to extract/separate an audio source by spatially forming a beam towards the desired
source. Indeed, the decorrelation inherently breaks the link between the audio signals
and the specific location in the audio scene that is typically exploited by a beamforming
operation. However, the Inventors have realized that despite this, the decorrelation
may provide highly advantageous effects and improved performance in many scenarios.
For example, in the presence of a strong noise source, the approach may facilitate
and/or improve the extraction/isolation of a specific desired audio source such as
specifically a speaker.
[0175] The specific adaptation described above may provide a highly advantageous approach
in many embodiments. It may typically provide a low complexity yet highly accurate
adaptation to result in spatial decorrelation filters that generate highly decorrelated
signals. In particular, it may allow local adaptation of individual weights/filter
coefficients to result in a highly efficient global decorrelation of the first set
of audio signals.
[0176] However, it will be appreciated that in other embodiments, other approaches for adapting
the spatial filters/decorrelation filters may be used. For example, in some embodiments,
the adaptive coefficient processor 207 may comprise a neural network which based on
the input samples is arranged to generate update values for modifying the filter coefficients.
For example, for each segment, the frequency bin values for all the audio signals
may be fed to a trained neural network which as an output generates an update value
for each weight. Each weight may then be updated by this update value. The trained
network may for example have been trained by training data that includes many different
frequency bin values and associated update values that have manually been determined
to modify weights towards increased decorrelation.
[0177] As another example, the adaptive coefficient processor 207 may comprise a location
processor which, based on visual cues related to (changing) positions, is arranged
to generate update values for modifying the filter coefficients. As another example,
the adapter 207 may determine the cross-correlation matrix of the input signals and
compute its eigenvalue decomposition. The eigenvectors and eigenvalues can be used
to construct a decorrelation matrix.
[0178] In the example of FIG. 2, the decorrelation and spatial decorrelation filters are
directly applied to the first set of audio signals before these are fed to both the
beamformer 101 and the beamform adapter 105. However, other approaches of adapting
the operation based on applying a spatial/cross signal filtering using coefficients
that are derived from the determined decorrelation coefficients may be used.
[0179] Specifically, FIG. 3 illustrates an example where the audio apparatus is arranged
to perform a spatial filtering of the first set of audio signals before these are
filtered by the first set of filters, i.e. the beamforming is based on the first set
of audio signals after these have been filtered by a set of spatial filters 301. However,
in the example, the beamform adapter 105 receives the first set of audio signals before
any filtering by the section set of filters 301, i.e. the second set of filters are
applied only to the beamformer path and not to the adaptation path.
[0180] The coefficients for this set of spatial filters 301 are determined from the decorrelation
coefficients that have been determined by the coefficient adapter 207. Indeed, in
some embodiments, the approach described with reference to the configuration of FIG.
2 may be applied directly and the resulting second set of filters may be applied (only)
to the signals of the beamform path.
[0181] However, in many embodiments, the set of spatial filters 301 in the configuration
example of FIG. 3 will be different from the set of spatial filters 201 in the configuration
example of FIG. 2. In particular, in many embodiments, the set of spatial filters
will be modified to correspond to a cascade of two sets of decorrelation filters as
determined by the adaptive coefficient processor 207. Thus, filter coefficients of
the set of spatial filters 301 have coefficients matching coefficients of a spatial
filter that is a cascade of two of the set of decorrelation filters. This can typically
be considered equivalent to a double/repeated filtering of the first set of audio
signals by the set of decorrelation filters determined by the adaptive coefficient
processor 207.
[0182] In particular, the adaptive coefficient processor 207 may determine the weights/coefficients
of a weight matrix
W(ω) for a decorrelation filter that can decorrelate the first set of audio signal
in accordance with:

as previously described.
[0183] The set of spatial filters 301 may in this case be generated to correspond to a cascaded
application of two such filters, i.e. it may be arranged to correspond to:

[0184] Thus, in the example, the adaptive coefficient processor 207 may proceed to perform
the adaptation as described for the example of FIG. 2 to determine suitable coefficients
for a set of spatial decorrelation filters that would decorrelate the first set of
audio signals. In some cases, such filters may then be applied by the set of spatial
filters 301 in FIG. 3. However, in many embodiments, the determined decorrelation
filters are not used directly but rather coefficients for the set of spatial filters
301 are determined from the determined coefficients for the decorrelation filters.
In such cases, the adaptive coefficient processor 207 may as part of the determination/adaptation
of coefficients implement a decorrelation filter and apply this to the first set of
signals (e.g. in order to determine update values from

. However, in the specific example, such filtered signals may only be used for the
adaptation process and may not be used further in the beamforming/ processing. Instead,
the set of spatial filters 301 that are applied may be generated from the decorrelation
coefficients/weight matrix
W(ω), specifically as
W2(ω).
[0185] The approach may provide improved overall performance including an improved beamforming
experience. Indeed, it can be shown that for the example where the set of spatial
filters 301 of FIG. 3 are set to coefficients corresponding to
W2 then this may result in the same performance, results, and output signal as for the
example of FIG. 2 It can be shown that these approaches may result in the same optimum
solutions.
[0186] Another possible example of applying a spatial filtering by a set of spatial filters
determined from the decorrelation coefficients determined by the adaptive coefficient
processor 207 is illustrated in FIG. 4. In this example, the first set of audio signals
is not filtered by the set of spatial filters but rather the second set of audio signals
generated by the feedback circuit 103 are filtered by the set of spatial filters 401.
Thus, in this example, the feedback signals rather than the beamform signals are filtered
by the set of spatial filters.
[0187] Further, the set of spatial filters may be determined as a set of a set of inverse
spatial decorrelation filters where each of the inverse spatial de correlation filters
may include an inverse filtering of a determined decorrelation filter.
[0188] For example, if the spatial decorrelation filter is represented by the weight matrix
W(ω), an inverse spatial decorrelation filter may be determined by the weight matrix
W-1(ω). Thus, in many embodiments, the set of spatial filters applied to the second set
of audio signals may include a filtering corresponding to
W-1(ω) where
W(ω) is the spatial decorrelation determined by the adaptive coefficient processor
207.
[0189] In many embodiments, the set of spatial filters applied to the second set of audio
signals may be determined to have filter coefficients that correspond to coefficients
of a spatial filter which is a cascade of two sets of spatial inverse filters each
of which is inverse filters of the spatial decorrelation filters determined by the
adaptive coefficient processor 207.
[0190] In particular, the adaptive coefficient processor 207 may thus determine the weights/coefficients
of a weight matrix
W(ω) for a decorrelation filter that can decorrelate the first set of audio signal
in accordance with:

[0191] The set of spatial filters 301 may in this case be generated to correspond to a cascaded
application of two such filters, i.e. it may be arranged to correspond to:

where
s(ω) represents the second set of audio signals generated by the feedback circuit 103.
[0192] Thus, in the example, the adaptive coefficient processor 207 may proceed to perform
the adaptation as described for the example of FIG. 2 to determine suitable coefficients
for a spatial decorrelation filter that would decorrelate the first set of audio signals.
The determined decorrelation filter coefficients are not used directly but rather
coefficients for the set of spatial filters 401 is determined from the determined
coefficients. In such cases, the adaptive coefficient processor 207 may as part of
the determination/adaptation of coefficients implement a decorrelation filter and
apply this to the first set of signals (e.g. in order to determine update values from

. However, in the specific example, such filtered signals may only be used for the
adaptation process and may not be used further in the beamforming/ processing.
[0193] The approach may provide improved overall performance including an improved beamforming
experience. Indeed, it can be shown that for the example in which the set of spatial
filters 401 of FIG. 4 are set to coefficients corresponding to
W-2 then this may result in the same performance and operation as for the example of
FIG. 2, and specifically that these may result in the same optimum solutions.
[0194] The audio apparatus implements a highly adaptive approach that may be particularly
suitable for adapting to extract specific audio sources, such as a desired speaker.
The approach may be particularly advantageous when e.g. a strong noise source is present
in the audio environment being captured. The adaptive audio apparatus comprises different
adaptations of respectively the set of spatial filters (via the adaptation of the
decorrelation filters/decorrelation coefficients) and the adaptation of the beamformer
filters. The adaptive coefficient processor 207 and beamform adapter 105 synergistically
interwork to provide a highly advantageous and high performance adaptation to the
current audio properties of the scene.
[0195] In many embodiments, the audio apparatus may be arranged to adapt the different adaptations
at different times. Thus, the adaptive coefficient processor 207 and the beamform
adapter 105 may be controlled to be adapted at different time periods, and in many
embodiments these may be nonoverlapping. Thus, at a given time, either the adaptive
coefficient processor 207 or the beamform adapter 105 may be adapting but not both
(although for some periods neither may be adapted).
[0196] For example, as illustrated in FIG. 5, the audio apparatus of FIG. 2 (or correspondingly
of FIG. 3 or 4) may include an audio detector 501 which is arranged to detect when
an audio source is active and when it is not. The audio detector 501 may specifically
be arranged to divide the time into (at least) a set of active time intervals during
which an audio source is active and a set of inactive time intervals during which
the audio source is not active. The audio source may specifically be a desired audio
source, such as specifically a desired speaker.
[0197] In some embodiments, a low complexity audio detection may be used to determine whether
a source is active or not. For example, the audio apparatus may be used to extract
the dominant audio source in the environment, such as for example the strongest speaker.
For example, typically for teleconferencing applications, only one person will be
speaking at a time and the audio apparatus may be used to extract the speech of the
current speaker from background or ambient noise.
[0198] The audio detector 501 may for example in such applications simply detect the audio
source activity based on a level/ energy being above a given threshold. If the signal
level is above the given threshold (which may e.g. be a dynamic threshold that is
set based on a longer time averaged signal level), the audio detector 501 may consider
the audio source/speaker to be active and otherwise it may consider the audio source/speaker
to not be active. Thus, the audio detector 501 may divide the time into active time
intervals when the captured audio is above a threshold (due to the audio source is
active) and non-active time intervals when the signal level is below the threshold
(due to the audio source not being active).
[0199] In many embodiments, more complex detections may be used and indeed techniques may
be used that separate between different audio sources. For example, in many embodiments
the detection may include a consideration of whether the audio signal has desired/expected
properties matching the specific audio source. For example, the audio detector 501
may differentiate between speech and other types of audio based on evaluating whether
the captured audio has properties matching speech.
[0200] It will be appreciated that many different techniques and algorithms are known for
speech/voice activity detection, and more generally for detecting that an audio source
is active, and that any suitable approach may be used without detracting from the
invention.
[0201] In some embodiments, a trained artificial neural network may be used, and it has
in practice been found that effective speech/voice activity detection (and/or noise
detection) can be performed. The artificial neural network may be trained with speech
and all types of non-speech and may for each frame provide an indication whether it
is noise or speech.
[0202] In many embodiments, the detection may be relatively fast and the audio detector
501 may be arranged to designate relatively short time intervals as active or non-active
time intervals. For example, in some embodiments, the audio detector 501 may be arranged
to detect silent pauses/intervals during normal speech and designate such intervals
as non-active time intervals. For example, time intervals less than e.g. 5 msec, 10
msec, or 20 msec may in some embodiments be identified/designated as active or non-active
time intervals.
[0203] In many embodiments, the audio detector 501 may directly detect activity for an audio
source based on the received microphone signals/ the first set of audio signals/ the
beamformer signals. For example, the total captured audio energy may be determined
and compared to a threshold. However, in other embodiments other signals may be considered,
such as e.g. the second set of audio signals. In many embodiments, the audio detector
501 may base the activity detection on the generated beamform output audio signal.
For example, level detection or speech detection may be applied directly to the beamform
output audio signal. This may in many embodiments provide improved performance as
the output signal may specifically be generated to focus on the desired signal, such
as the desired speaker.
[0204] The audio detector 501 may be arranged to control the adaptation based on the audio
source activity detection and specifically the adaption of the decorrelation coefficients
and/or of the beamform filter coefficients may be different in active time intervals
than in inactive time intervals.
[0205] Indeed, in some embodiments, the adaptation of the decorrelation coefficients may
not be performed during the active time intervals but only during the inactive time
intervals. The audio detector 501 may specifically seek to detect whether a desired
or wanted audio source/speaker is active. It may further control the adaptive coefficient
processor 207 to adapt the decorrelation coefficients only when this audio source
is not active. Thus, the adaptive coefficient processor 207 may be arranged to adapt
to provide decorrelation filters that seek to decorrelate the unwanted audio. Thus,
rather than update the decorrelation filters to be a decorrelation of the entire captured
audio, it will be aimed at adapting to reduce decorrelation of the undesired audio.
Such an approach may provide a very efficient and improved operation and performance
in many situations. The decorrelation approach for the non-desired audio may allow
improved beamforming with improved rejection of the undesired audio by the beamforming
while allowing the beamforming to provide efficient and high performance extraction
of the wanted signal.
[0206] In some embodiments, the adaptive coefficient processor 207 may be arranged to adapt
the decorrelation coefficients in both the active time intervals and the inactive
time intervals but with the adaption rate being higher during the inactive time intervals
than during the active time intervals. Thus, rather than only adapting the decorrelation
coefficients during the inactive time interval, there may also be some adaptation
during the active time intervals but with this adaptation being slower, e.g., by a
factor of no less than 2,5,10 or 100. This may be advantageous in some scenarios,
such as for example where the active time intervals are much longer than the inactive
time intervals. In some embodiments, the update rate during the active time intervals
and/or the inactive time intervals may be dynamically adapted depending on e.g., properties
of the time intervals or the input signals. For example, the adaptive coefficient
processor 207 may switch between only updating during the inactive time intervals
to also updating (at a lower rate) during the active time intervals if the duration
of the active time intervals exceed a given duration.
[0207] In many embodiments, the adaptation of the beamforming filters may not be performed
during the inactive time intervals but only during the active time intervals. The
audio detector 501 may specifically seek to detect whether a desired or wanted audio
source/speaker is active and control the beamform adapter 105 to adapt the beamform
filters only when this audio source is active. Thus, the beamform adapter 105 is arranged
to specifically adapt the beamforming to specifically (virtually) form a beam towards
the desired audio source.
[0208] In some embodiments, the beamform adapter 105 may be arranged to adapt the beamformer
coefficients in both the active time intervals and the inactive time intervals but
with the adaption rate being higher during the active time intervals than during the
inactive time intervals. Thus, rather than only adapting the beamform filters during
the active time interval, there may also be some adaptation during the inactive time
intervals but with this adaptation being slower, e.g. by a factor of no less than
2,5,10 or 100. This may be advantageous in some scenarios, such as for example where
the audio detector 501 is set to provide a detection with a very low risk of false
detection of the active audio source such that the risk of not detecting the audio
source being active is relatively high. In such cases, it may be desirable to still
adapt during the inactive time interval but with an update rate that is (typically)
much lower. Thus, a low update rate may be used during time intervals when the wanted
source may or may not be active and with a high update rate being used during time
intervals when the wanted source is almost certainly present.
[0209] In many embodiments the audio apparatus may be arranged to adapt the decorrelation
coefficients (only) during the inactive time intervals and to adapt the beamforming
filters (only) during the active time intervals. This may provide highly efficient
performance and in particularly may provide substantially improved extraction/separation
of a desired audio source/speaker in the presence of noise/ undesired audio, such
as specifically a dominant and correlated noise source.
[0210] In many practical applications, such as many speech capture and processing applications,
the adaptation control of both the beamforming and decorrelation is important for
the effective operation of the audio apparatus. It may be desired for the beamformer
to adapt on the speech, which by nature is not always active, and the decorrelation
on the noise only part. In use cases where the noise is continuously available, the
beamformer can only adapt when there is also noise present. In case the noise is uncorrelated,
or the noise is decorrelated by the decorrelation, the noise does not influence the
adaptation if the noise is equivariant. If the noise is not equivariant, the beamformer
will diverge during noise only, but the beamformer can often quickly (almost unnoticeably)
adjust to the desired speaker if near-end becomes active. Depending on the application,
either no further adaptation control is used, such that the beamformer can quickly
find new sources, or a voice-activity detector is used, which can vary from a simple
energy-based detector until more sophisticated detectors that also use speech characteristics
based on pitch for example.
[0211] The detector for the noise is typically made more conservative, such that it is not
active when speech is also present, and it starts to decorrelate the speech. Depending
on the noise type and level of the noise, energy-based detectors can be used that
distinguish (stationary) noise from speech, or more sophisticated ones, e.g. neural-net
classifiers that are trained to distinguish noise from speech.
[0212] In the above, the description focused on a scenario where the active time intervals
are wanted/speech audio source active time intervals and the inactive time intervals
are unwanted source/noise/interference active time intervals. In this case, the adaptation
rate for the filters is higher during the active time intervals than during the inactive
time intervals, and specifically the adaptation rate for the filters may be performed
only during the active time intervals and not during during the inactive time intervals.
Equivalently, the adaptation rate for the decorrelation coefficients is higher during
the inactive time intervals than during the active time intervals, and specifically
the adaptation rate for the filters may be performed only during the inactive time
intervals and not during the active time intervals.
[0213] Such time intervals may for example be determined directly by the use of a speech
detector which detects and determines the speech active time intervals, which may
be considered as the active time intervals. The remaining time intervals, i.e. the
non- speech active time intervals may be considered as the inactive time intervals.
[0214] In some embodiments, a detection of undesired audio properties may be used, such
as specifically a detection of the activity of an interferer or noise source. For
example, detection of activity of an undesired audio source may be used, e.g., a music
detector, silence detector, specific noises detector. In such cases, the time intervals
detected may be considered as the inactive time intervals. The remaining time intervals
may be considered as the active time intervals.
[0215] It will be appreciated that if instead, active time intervals were considered to
correspond to time intervals in which an undesired audio source is active and inactive
time intervals were considered to correspond to time intervals in which the undesired
audio source is not active, then the previously described adaptation would be reversed
(adaptation rate of filters would be higher during inactive time intervals, adaptation
rate of decorrelation coefficients would be higher during active time intervals, and
specifically in many embodiments adaptation of filters would only be in inactive time
intervals and adaptation of decorrelation coefficients would only be in active time
intervals).
[0216] In the following, it will be demonstrated that the specific configurations of FIGs.
2, 3, and 4 can provide the same solutions. For brevity, these configurations and
solutions will be referred to as A, B and C respectively. For the different solutions,
we will derive the optimum filter coefficients and the corresponding output.
[0217] With configuration A (FIG. 2), the set of spatial filters is placed directly after
the microphones, completely outside the beamformer. This has the advantage that nothing
has to be changed to the beamformer algorithm: it will still have the constraint
fH(ω)
f(ω) = 1 only the inputs differ. A suitable decorrelator may be as previously described.
This decorrelator transforms the noise covariance matrix
Rnn(ω) into:

where
Λ(ω) =
diag(
λ1(ω) .....
λNmics (ω)) is a diagonal matrix and
W(ω) the
nmics × nmics decorrelation matrix of the SDC. Instead of maximizing
fH(ω)
Ruu(ω)
f(ω) the beamformer now maximizes
fH(ω)
W(ω)
Ruu(ω)
WH(ω)
f(ω) which can be written as

[0218] If
Λ(ω) is a scaled version of the identity matrix and can be written as β(ω)
I,
fH(ω)
Λ(ω)
f(ω) transforms to β(ω) and the noise contribution is independent of the choice of
f(ω)and maximization of
fH(ω)
W(ω)
Ruu(ω)
WH(ω)
f(ω) leads to maximization of the Signal-to-Noise ratio in the output.
[0219] The choice of β(ω) in the decorrelator does not influence the SNR but influences
the level of the speech in the output. Suppose we choose β(ω) = 1 and the noise level
at the input increases by a factor of 2, then all the coefficients in
W(ω) will be scaled by a factor of 0.5, so also decreasing the level of the speech
at the output of the SDC. The SDC keeps updating depending on the level of the input
(per frequency bin). As a result also the beamformer keeps updating. It can be shown
that
Λ(ω) =

, i.e. linearly related to the power of the noise source,
W(ω) only depends on the acoustic transfer functions between the noise source and the
microphones. Since the transfer functions generally vary much more slowly when compared
to the noise characteristics, this is also advantageous for the beamformer, since
it sees a "stable" path. A solution is specifically where
diag(
W(ω)) =
I.
Λ(ω) will have elements
λi(ω) that are linearly related to the input. A disadvantage is that not all elements
of
Λ(ω) are equal and as a result, the noise in the output depends on
f(ω). Although this is not desirable, in practice it was found that it is typically
more important that
W(ω)
Rnn(ω)
WH(ω) does not contain off-diagonal elements then that all the elements on the diagonal
are equal. The combined solution with a beamformer preceded by a decorrelator with
diag(
W(ω)) =
I has proven to be very effective.
[0220] The optimal coefficients for the beamformer can be found, using Equation (8) with
instead of
h(ω) using the output of the spatial decorrelator:
W(ω)
h(ω).
[0221] Using a decorrelator with
W(ω) Hermitian and thus
WH(ω) =
W(ω) we get for the optimal solution:

[0223] Substituting (12) and (11) into (10) gives for the output:

[0224] Note that with equal variance uncorrelated noise at the inputs, using the decorrelator
with
diag(
W(ω))
= I and thus
W2(ω) =
W(ω) =
I, we get the same solution as without decorrelator with only the beamformer.
[0225] For configuration B (as illustrated in FIG. 3), it is firstly noted that the decorrelation
can transform the noise covariance matrix
Rnn(ω) into:

[0226] With
W(ω) and
Rnn(ω) positive definite, their inverse exist and we can write:

[0227] If
Λ(ω) is a scaled version of the identity matrix and can be written as

then we get for the inverse of
Rnn(ω):

[0228] Using the decorrelator with
WH(ω) =
W(ω) we can finally write:
W2(ω) can be seen as the inverse of a normalized covariance matrix. Note that in the
adaptive coefficient processor 207,
W(ω) is calculated, whereas in the filtering part the squared matrix
W2(ω) is used.
[0229] We will now proof that this combination maximizes the speech-to-noise ratio in the
output. First we will derive the filter coefficients after convergence like we did
before for the beamformer.
[0230] At the output of the filtering part we have:

[0231] In the update part we have:

[0232] Since z(ω) and s(ω) are scalars we can write:

[0234] Substitution of
fopt(ω) into
fH(ω)
W2(ω)
f(ω) gives:

so the constraint is met.
[0235] Next, we look at the expected value of the update after convergence:

[0236] With

where we use
W(ω) =
WH(ω) and (
W2(ω))
H =
WH(ω)
WH(ω) =
W(ω)
W(ω) =
W2(ω). We then get for the update equation (Eq. 15):

with

and

.
[0237] What Equation 16 shows is that after convergence
f(ω) is an eigen vector and

the corresponding eigen value of the matrix
Ruu(ω)
W2(ω). As before it can be shown that the solution is only stable if

is the largest eigen value.
[0238] Multiplying left and right-hand side with
fH(ω)
W2(ω)we get:

[0239] Note that the left-hand side contains the output power of the beamformer.
[0240] Now consider:
Ruu(ω) =
Rss(ω) +
Rnn(ω),
where the adaptive decorrelator has adapted on the noise source. We then have for
the output power

:

[0241] The noise contribution is independent from the choice of
f(
ω) and thus maximizing the output of the FSB corresponds to maximizing the speech-to-noise
ratio in the output, not necessarily the speech output itself.
[0242] The optimal solutions after convergence for solutions A and B are related by:

[0243] This means that the outputs for both solutions will be equal.
[0244] For solution C of FIG. 4, a matrix block may be placed after the filters in the feedback
circuit 103. Here the constraint equals
fH(ω)
Γ(ω)
f(ω) =1, where
Γ(ω) is the coherence matrix, which is equal to a normalized covariance matrix. We
want to use the decorrelator again and we use

[0245] Thus
W-2(ω) equals the normalized covariance matrix and we can use
W-2(ω) instead of
Γ(ω) and the constraint becomes
fH(ω)
W-2(ω)
f(ω) = 1.
[0246] Like before we find the optimal solution for the filter coefficients
f(ω) by considering the signal flow in the filtering part and update part.
[0247] In the filtering part we have:

and in the update part:

where we use that
z(ω) and
s(ω) are complex scalers.
[0248] Substitution of (19) into (18), using (17) gives:

[0249] Using (Eq. 19) we get:

with a(ω) allpass.
[0250] Substitution of
fopt(ω) into
fH(ω)
W-2(ω)
f(ω) gives:

meeting the constraint.
[0251] Finally we look at the expected value of the update after convergence

[0252] With
z∗(ω) =
uHf(ω):

with

and

.
[0253] Multiplying both sides with
W2(ω) gives:

[0254] After convergence
f(ω) is an eigen vector and

the corresponding eigen value of the matrix
W2(ω)
Ruu(ω). As before, it can be shown that the solution is only stable if

is the largest eigen value.
[0255] Multiplying Eq. 20 left and right with
fH(ω) gives:

[0256] Note that the left-hand side contains the output power of the FSB.
[0257] Now consider:
Ruu(ω) =
Rss(ω) +
Rnn(ω), where the adaptive decorrelator has adapted on the noise source. We then have
for the output power

:

[0258] Maximizing the output of the FSB corresponds to maximizing the speech-to-noise ratio
in the output, not necessarily the speech output itself.
[0259] The optimal solutions after convergence for solutions C and B are related by
fC(ω) =
W2(ω)
fB(ω).
[0260] This means that the outputs for both solutions will be equal.
[0262] As can be seen, by selecting the coefficients appropriately, the exact same (optimal)
beamform output audio signal can be generated.
[0263] However, as can be seen, the coefficients that are selected for the set of spatial
filters to achieve this output may be different.
[0264] The choice of the approach to use will depend on the preferences and requirements
of the individual embodiment.
[0265] An advantage of solution A is that the interaction between the decorrelator and beamformer
is reduced in the sense that the synergistic effect can be achieved with reduced modifications
of the beamformer, feedback circuit, and beamform adapter.
[0266] An advantage of solution B is that if you want to calculate the Direction of Arrival
(DOA) of an audio source using the beam former coefficients, then this is easier to
do for this configuration.
[0267] For example, an approach for calculating a DOA is described in
US 6774934 for a "normal" beamformer with a speech source with possibly uncorrelated equivariant
noise added. For this use case the optimal coefficients are given by:

[0268] Part of the procedure is that the cross-power spectrum is calculated for at least
one pair of filter coefficients, e.g.
f1(ω) and
f2(ω). This gives:

where we use
a(ω)
a∗(ω) = 1, since a is an allpass.
[0269] Important is the phase difference, not the amplitude. This means that for solution
B we can directly use the beamformer coefficients, but for methods A and C we must
pre-multiply the coefficients with
W-1(ω) and
W-2(ω) respectively.
[0270] The audio apparatus(s) may specifically be implemented in one or more suitably programmed
processors. The different functional blocks may be implemented in separate processors
and/or may, e.g., be implemented in the same processor. An example of a suitable processor
is provided in the following.
[0271] FIG. 6 is a block diagram illustrating an example processor 600 according to embodiments
of the disclosure. Processor 600 may be used to implement one or more processors implementing
an apparatus as previously described or elements thereof (including in particular
one more artificial neural network). Processor 600 may be any suitable processor type
including, but not limited to, a microprocessor, a microcontroller, a Digital Signal
Processor (DSP), a Field ProGrammable Array (FPGA) where the FPGA has been programmed
to form a processor, a Graphical Processing Unit (GPU), an Application Specific Integrated
Circuit (ASIC) where the ASIC has been designed to form a processor, or a combination
thereof.
[0272] The processor 600 may include one or more cores 602. The core 602 may include one
or more Arithmetic Logic Units (ALU) 604. In some embodiments, the core 602 may include
a Floating Point Logic Unit (FPLU) 606 and/or a Digital Signal Processing Unit (DSPU)
608 in addition to or instead of the ALU 604.
[0273] The processor 600 may include one or more registers 612 communicatively coupled to
the core 602. The registers 612 may be implemented using dedicated logic gate circuits
(e.g., flip-flops) and/or any memory technology. In some embodiments the registers
612 may be implemented using static memory. The register may provide data, instructions
and addresses to the core 602.
[0274] In some embodiments, processor 600 may include one or more levels of cache memory
610 communicatively coupled to the core 602. The cache memory 610 may provide computer-readable
instructions to the core 602 for execution. The cache memory 610 may provide data
for processing by the core 602. In some embodiments, the computer-readable instructions
may have been provided to the cache memory 610 by a local memory, for example, local
memory attached to the external bus 616. The cache memory 610 may be implemented with
any suitable cache memory type, for example, Metal-Oxide Semiconductor (MOS) memory
such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), and/or
any other suitable memory technology.
[0275] The processor 600 may include a controller 614, which may control input to the processor
600 from other processors and/or components included in a system and/or outputs from
the processor 600 to other processors and/or components included in the system. Controller
614 may control the data paths in the ALU 604, FPLU 606 and/or DSPU 608. Controller
614 may be implemented as one or more state machines, data paths and/or dedicated
control logic. The gates of controller 614 may be implemented as standalone gates,
FPGA, ASIC or any other suitable technology.
[0276] The registers 612 and the cache 610 may communicate with controller 614 and core
602 via internal connections 620A, 620B, 620C and 620D. Internal connections may be
implemented as a bus, multiplexer, crossbar switch, and/or any other suitable connection
technology.
[0277] Inputs and outputs for the processor 600 may be provided via a bus 616, which may
include one or more conductive lines. The bus 616 may be communicatively coupled to
one or more components of processor 600, for example the controller 614, cache 610,
and/or register 612. The bus 616 may be coupled to one or more components of the system.
[0278] The bus 616 may be coupled to one or more external memories. The external memories
may include Read Only Memory (ROM) 632. ROM 632 may be a masked ROM, Electronically
Programmable Read Only Memory (EPROM) or any other suitable technology. The external
memory may include Random Access Memory (RAM) 633. RAM 633 may be a static RAM, battery
backed up static RAM, Dynamic RAM (DRAM) or any other suitable technology. The external
memory may include Electrically Erasable Programmable Read Only Memory (EEPROM) 635.
The external memory may include Flash memory 634. The External memory may include
a magnetic storage device such as disc 636. In some embodiments, the external memories
may be included in a system.
[0279] It will be appreciated that the above description for clarity has described embodiments
of the invention with reference to different functional circuits, units and processors.
However, it will be apparent that any suitable distribution of functionality between
different functional circuits, units or processors may be used without detracting
from the invention. For example, functionality illustrated to be performed by separate
processors or controllers may be performed by the same processor or controllers. Hence,
references to specific functional units or circuits are only to be seen as references
to suitable means for providing the described functionality rather than indicative
of a strict logical or physical structure or organization.
[0280] The invention can be implemented in any suitable form including hardware, software,
firmware or any combination of these. The invention may optionally be implemented
at least partly as computer software running on one or more data processors and/or
digital signal processors. The elements and components of an embodiment of the invention
may be physically, functionally and logically implemented in any suitable way. Indeed,
the functionality may be implemented in a single unit, in a plurality of units or
as part of other functional units. As such, the invention may be implemented in a
single unit or may be physically and functionally distributed between different units,
circuits and processors.
[0281] Although the present invention has been described in connection with some embodiments,
it is not intended to be limited to the specific form set forth herein. Rather, the
scope of the present invention is limited only by the accompanying claims. Additionally,
although a feature may appear to be described in connection with particular embodiments,
one skilled in the art would recognize that various features of the described embodiments
may be combined in accordance with the invention. In the claims, the term comprising
does not exclude the presence of other elements or steps.
[0282] Furthermore, although individually listed, a plurality of means, elements, circuits
or method steps may be implemented by, e.g., a single circuit, unit or processor.
Additionally, although individual features may be included in different claims, these
may possibly be advantageously combined, and the inclusion in different claims does
not imply that a combination of features is not feasible and/or advantageous. Also,
the inclusion of a feature in one category of claims does not imply a limitation to
this category but rather indicates that the feature is equally applicable to other
claim categories as appropriate. Furthermore, the order of features in the claims
do not imply any specific order in which the features must be worked and in particular
the order of individual steps in a method claim does not imply that the steps must
be performed in this order. Rather, the steps may be performed in any suitable order.
In addition, singular references do not exclude a plurality. Thus, references to "a",
"an", "first", "second" etc. do not preclude a plurality. Reference signs in the claims
are provided merely as a clarifying example shall not be construed as limiting the
scope of the claims in any way.