[0001] The present application is related to stereo processing or, generally, multi-channel
processing, where a multi-channel signal has two channels such as a left channel and
a right channel in the case of a stereo signal or more than two channels, such as
three, four, five or any other number of channels.
[0002] Stereo speech and particularly conversational stereo speech has received much less
scientific attention than storage and broadcasting of stereophonic music. Indeed in
speech communications monophonic transmission is still nowadays mostly used. However
with the increase of network bandwidth and capacity, it is envisioned that communications
based on stereophonic technologies will become more popular and bring a better listening
experience.
[0003] Efficient coding of stereophonic audio material has been for a long time studied
in perceptual audio coding of music for efficient storage or broadcasting. At high
bitrates, where waveform preserving is crucial, sum-difference stereo, known as mid/side
(M/S) stereo, has been employed for a long time. For low bit-rates, intensity stereo
and more recently parametric stereo coding has been introduced. The latest technique
was adopted in different standards as HeAACv2 and Mpeg USAC. It generates a down-mix
of the two-channel signal and associates compact spatial side information.
[0004] Joint stereo coding are usually built over a high frequency resolution, i.e. low
time resolution, time-frequency transformation of the signal and is then not compatible
to low delay and time domain processing performed in most speech coders. Moreover
the engendered bit-rate is usually high.
[0005] On the other hand, parametric stereo employs an extra filter-bank positioned in the
front-end of the encoder as pre-processor and in the back-end of the decoder as post-processor.
Therefore, parametric stereo can be used with conventional speech coders like ACELP
as it is done in MPEG USAC. Moreover, the parametrization of the auditory scene can
be achieved with minimum amount of side information, which is suitable for low bit-rates.
However, parametric stereo is as for example in MPEG USAC not specifically designed
for low delay and does not deliver consistent quality for different conversational
scenarios. In conventional parametric representation of the spatial scene, the width
of the stereo image is artificially reproduced by a decorrelator applied on the two
synthesized channels and controlled by Inter-channel Coherence (ICs) parameters computed
and transmitted by the encoder. For most stereo speech, this way of widening the stereo
image is not appropriate for recreating the natural ambience of speech which is a
pretty direct sound since it is produced by a single source located at a specific
position in the space (with sometimes some reverberation from the room). By contrast,
music instruments have much more natural width than speech, which can be better imitated
by decorrelating the channels.
[0006] Problems also occur when speech is recorded with non-coincident microphones, like
in A-B configuration when microphones are distant from each other or for binaural
recording or rendering. Those scenarios can be envisioned for capturing speech in
teleconferences or for creating a virtually auditory scene with distant speakers in
the multipoint control unit (MCU). The time of arrival of the signal is then different
from one channel to the other unlike recordings done on coincident microphones like
X-Y (intensity recording) or M-S (Mid-Side recording). The computation of the coherence
of such non time-aligned two channels can then be wrongly estimated which makes the
artificial ambience synthesis fail.
[0008] Document
WO 2006/089570 A1 discloses a near-transparent or transparent multi-channel encoder/decoder scheme.
A multi-channel encoder/decoder scheme additionally generates a waveform-type residual
signal. This residual signal is transmitted together with one or more multi-channel
parameters to a decoder. In contrast to a purely parametric multi-channel decoder,
the enhanced decoder generates a multi-channel output signal having an improved output
quality because of the additional residual signal. On the encoder-side, a left channel
and a right channel are both filtered by an analysis filterbank. Then, for each subband
signal, an alignment value and a gain value are calculated for a subband. Such an
alignment is then performed before further processing. On the decoder-side, a de-alignment
and a gain processing is performed and the corresponding signals are then synthesized
by a synthesis filterbank in order to generate a decoded left signal and a decoded
right signal.
[0009] In such stereo processing applications, the calculation of an inter-channel or inter
channel time difference between a first channel signal and a second channel signal
is useful in order to typically perform a broadband time alignment procedure. However,
other applications do exist for the usage of an inter-channel time difference between
a first channel and a second channel, where these applications are in storage or transmission
of parametric data, stereo/multi-channel processing comprising a time alignment of
two channels, a time difference of arrival estimation for a determination of a speaker
position in a room, beamforming spatial filtering, foreground/background decomposition
or the location of a sound source by, for example, acoustic triangulation in order
to only name a few.
[0010] For all such applications, an efficient, accurate and robust determination of an
inter-channel time difference between a first and a second channel signal is necessary.
[0011] There do already exist such determinations known under the term "GCC-PHAT" or, stated
differently, generalized cross-correlation phase transform. Typically, a cross-correlation
spectrum is calculated between the two channel signals and, then, a weighting function
is applied to the cross-correlation spectrum for obtaining a so-called generalized
cross-correlation spectrum before performing an inverse spectral transform such as
an inverse DFT to the generalized cross-correlation spectrum in order to find a time-domain
representation. This time-domain representation represents values for certain time
lags and the highest peak of the time-domain representation then typically corresponds
to the time delay or time difference, i.e., the inter-channel time delay of difference
between the two channel signals.
[0012] An exemplary approach for determining an inter-channel time delay from a weighted
cross-correlation function is disclosed in
US 2012/0016632 A1.
[0013] However, it has been shown that, particularly in signals that are different from,
for example, clean speech without any reverberation or background noise, the robustness
of this general technique is not optimum.
[0014] It is, therefore, an object of the present invention to provide an improved concept
for estimating an inter-channel time difference between two channel signals.
[0015] This object is achieved by an apparatus for estimating an inter-channel time difference
in accordance with claim 1, or a method for estimating an inter-channel time difference
in accordance with claim 14 or a computer program in accordance with claim 15.
[0016] The present invention is based on the finding that a weighting a smoothed or non-smoothed
cross-correlation spectrum to obtain a weighted cross correlation spectrum is to be
done using a first weighting procedure or using a second weighting procedure depending
on a signal characteristic estimated by a signal analyzer, wherein the first weighting
procedure is different from the second weighting procedures.
[0017] In a further embodiment, smoothing of the cross-correlation spectrum over time that
is controlled by a spectral characteristic of the spectrum of the first channel signal
or the second channel signal significantly improves the robustness and accuracy of
the inter-channel time difference determination.
[0018] In preferred embodiments, a tonality/noisiness characteristic of the spectrum is
determined, and in case of tone-like signal, a smoothing is stconger while, in case
of a noisiness signal, a smoothing is made less strong.
[0019] Preferably, a spectral flatness measure is used and, in case of tone-like signals,
the spectral flatness measure will be low and the smoothing will become stronger,
and in case of noise-like signals, the spectral flatness measure will be high such
as about 1 or close to 1 and the smoothing will be weak.
[0020] Thus, in accordance with the present invention, an apparatus for estimating an inter-channel
time difference between a first channel signal and a second channel signal comprises
a calculator for calculating a cross-correlation spectrum for a time block for the
first channel signal in the time block and the second channel signal in the time block.
The apparatus further comprises a spectral characteristic estimator for estimating
a characteristic of a spectrum of the first channel signal and the second channel
signal for the time block and, additionally, a smoothing filter for smoothing the
cross-correlation spectrum over time using the spectral characteristic to obtain a
smoothed cross-correlation spectrum. Then, the smoothed cross-correlation spectrum
is further processed by a processor in order to obtain the inter-channel time difference
parameter.
[0021] For preferred embodiments related to the further processing of the smoothed cross-correlation
spectrum, an adaptive thresholding operation is performed, in which the time-domain
representation of the smoothed generalized cross-correlation spectrum is analyzed
in order to determine a variable threshold, that depends on the time-domain representation
and a peak of the time-domain representation is compared to the variable threshold,
wherein an inter-channel time difference is determined as a time lag associated with
a peak being in a predetermined relation to the threshold such as being greater than
the threshold.
[0022] In one embodiment, the variable threshold is determined as a value being equal to
an integer multiple of a value among the largest, for example ten percents of the
values of the time domain representation or, alternatively, in a further embodiment
for the variable determination, the variable threshold is calculated by a multiplication
of the variable threshold and the value, where the value depends on a signal-to-noise
ratio characteristic of the first and the second channel signals, where the value
becomes higher for a higher signal-to-noise ratio and becomes lower for a lower signal-to-noise
ratio.
[0023] As stated before, the inter-channel time difference calculation can be used in many
different applications such as the storage or transmission of parametric data, a stereo/multi-channel
processing/encoding, a time alignment of two channels, a time difference of arrival
estimation for the determination of a speaker position in a room with two microphones
and a known microphone setup, for the purpose of beamforming, spatial filtering, foreground/background
decomposition or a location determination of a sound source, for example by acoustic
triangulation based on time differences of two or three signals.
[0024] In the following, however, a preferred implementation and usage of the inter-channel
time difference calculation is described for the purpose of broadband time alignment
of two stereo signals in a process of encoding a multi-channel signal having the at
least two channels.
[0025] An apparatus for encoding a multi-channel signal having at least two channels comprises
a parameter determiner to determine a broadband alignment parameter on the one hand
and a plurality of narrowband alignment parameters on the other hand. These parameters
are used by a signal aligner for aligning the at least two channels using these parameters
to obtain aligned channels. Then, a signal processor calculates a mid-signal and a
side signal using the aligned channels and the mid-signal and the side signal are
subsequently encoded and forwarded into an encoded output signal that additionally
has, as parametric side information, the broadband alignment parameter and the plurality
of narrowband alignment parameters.
[0026] On the decoder-side, a signal decoder decodes the encoded mid-signal and the encoded
side signal to obtain decoded mid and side signals. These signals are then processed
by a signal processor for calculating a decoded first channel and a decoded second
channel. These decoded channels are then de-aligned using the information on the broadband
alignment parameter and the information on the plurality of narrowband parameters
included in an encoded multi-channel signal to obtain the decoded multi-channel signal.
[0027] In a specific implementation, the broadband alignment parameter is an inter-channel
time difference parameter and the plurality of narrowband alignment parameters are
inter channel phase differences.
[0028] The present invention is based on the finding that specifically for speech signals
where there is more than one speaker, but also for other audio signals where there
are several audio sources, the different places of the audio sources that both map
into two channels of the multi-channel signal can be accounted for using a broadband
alignment parameter such as an inter-channel time difference parameter that is applied
to the whole spectrum of either one or both channels. In addition to this broadband
alignment parameter, it has been found that several narrowband alignment parameters
that differ from subband to subband additionally result in a better alignment of the
signal in both channels.
[0029] Thus, a broadband alignment corresponding to the same time delay in each subband
together with a phase alignment corresponding to different phase rotations for different
subbands results in an optimum alignment of both channels before these two channels
are then converted into a mid/side representation which is then further encoded. Due
to the fact that an optimum alignment has been obtained, the energy in the mid-signal
is as high as possible on the one hand and the energy in the side signal is as small
as possible on the other hand so that an optimum coding result with a lowest possible
bitrate or a highest possible audio quality for a certain bitrate can be obtained.
[0030] Specifically for conversional speech material, it appears that there are typically
speakers being active at two different places. Additionally, the situation is such
that, normally, only one speaker is speaking from the first place and then the second
speaker is speaking from the second place or location. The influence of the different
locations on the two channels such as a first or left channel and a second or right
channel is reflected by different time of arrivals and, therefore, a certain time
delay between both channels due to the different locations, and this time delay is
changing from time to time. Generally, this influence is reflected in the two channel
signals as a broadband de-alignment that can be addressed by the broadband alignment
parameter.
[0031] On the other hand, other effects, particularly coming from reverberation or further
noise sources can be accounted for by individual phase alignment parameters for individual
bands that are superposed on the broadband different arrival times or broadband de-alignment
of both channels.
[0032] In view of that, the usage of both, a broadband alignment parameter and a plurality
of narrowband alignment parameters on top of the broadband alignment parameter result
in an optimum channel alignment on the encoder-side for obtaining a good and very
compact mid/side representation while, on the other hand, a corresponding de-alignment
subsequent to a decoding on the decoder side results in a good audio quality for a
certain bitrate or in a small bitrate for a certain required audio quality.
[0033] An advantage of the present invention is that it provides a new stereo coding scheme
much more suitable for a conversion of stereo speech than the existing stereo coding
schemes. In accordance with the invention, parametric stereo technologies and joint
stereo coding technologies are combined particularly by exploiting the inter-channel
time difference occurring in channels of a multi-channel signal specifically in the
case of speech sources but also in the case of other audio sources.
[0034] Several embodiments provide useful advantages as discussed later on.
[0035] The new method is a hybrid approach mixing elements from a conventional M/S stereo
and parametric stereo. In a conventional M/S, the channels are passively downmixed
to generate a Mid and a Side signal. The process can be further extended by rotating
the channel using a Karhunen-Loeve transform (KLT), also known as Principal Component
Analysis (PCA) before summing and differentiating the channels. The Mid signal is
coded in a primary code coding while the Side is conveyed to a secondary coder. Evolved
M/S stereo can further use prediction of the Side signal by the Mid Channel coded
in the present or the previous frame. The main goal of rotation and prediction is
to maximize the energy of the Mid signal while minimizing the energy of the Side.
M/S stereo is waveform preserving and is in this aspect very robust to any stereo
scenarios, but can be very expensive in terms of bit consumption.
[0036] For highest efficiency at low bit-rates, parametric stereo computes and codes parameters,
like Inter-channel Level differences (ILDs), Inter-channel Phase differences (IPDs),
Inter-channel Time differences (ITDs) and Inter-channel Coherence (ICs). They compactly
represent the stereo image and are cues of the auditory scene (source localization,
panning, width of the stereo...). The aim is then to parametrize the stereo scene
and to code only a downmix signal which can be at the decoder and with the help of
the transmitted stereo cues be once again spatialized.
[0037] Our approach mixed the two concepts. First, stereo cues ITD and IPD are computed
and applied on the two channels. The goal is to represent the time difference in broadband
and the phase in different frequency bands. The two channels are then aligned in time
and phase and M/S coding is then performed. ITD and IPD were found to be useful for
modeling stereo speech and are a good replacement of KLT based rotation in M/S. Unlike
a pure parametric coding, the ambience is not more modeled by the ICs but directly
by the Side signal which is coded and/or predicted. It was found that this approach
is more robust especially when handling speech signals.
[0038] The computation and processing of ITDs is a crucial part of the invention. ITDs were
already exploited in the prior art Binaural Cue Coding (BCC), but in a way that it
was inefficient once ITDs change over time. For avoiding this shortcoming, specific
windowing was designed for smoothing the transitions between two different ITDs and
being able to seamlessly switch from one speaker to another positioned at different
places.
[0039] Further embodiments are related to the procedure that, on the encoder-side, the parameter
determination for determining the plurality of narrowband alignment parameters is
performed using channels that have already been aligned with the earlier determined
broadband alignment parameter.
[0040] Correspondingly, the narrowband de-alignment on the decoder-side is performed before
the broadband de-alignment is performed using the typically single broadband alignment
parameter.
[0041] In further embodiments, it is preferred that, either on the encoder-side but even
more importantly on the decoder-side, some kind of windowing and overlap-add operation
or any kind of crossfading from one block to the next one is performed subsequent
to all alignments and, specifically, subsequent to a time-alignment using the broadband
alignment parameter. This avoids any audible artifacts such as clicks when the time
or broadband alignment parameter changes from block to block.
[0042] In other embodiments, different spectral resolutions are applied. Particularly, the
channel signals are subjected to a time-spectral conversion having a high frequency
resolution such as a DFT spectrum while the parameters such as the narrowband alignment
parameters are determined for parameter bands having a lower spectral resolution.
Typically, a parameter band has more than one spectral line than the signal spectrum
and typically has a set of spectral lines from the DFT spectrum. Furthermore, the
parameter bands increase from low frequencies to high frequencies in order to account
for psychoacoustic issues.
[0043] Further embodiments relate to an additional usage of a level parameter such as an
inter-level difference or other procedures for processing the side signal such as
stereo filling parameters, etc. The encoded side signal can represented by the actual
side signal itself, or by a prediction residual signal being performed using the mid
signal of the current frame or any other frame, or by a side signal or a side prediction
residual signal in only a subset of bands and prediction parameters only for the remaining
bands, or even by prediction parameters for all bands without any high frequency resolution
side signal information. Hence, in the last alternative above, the encoded side signal
is only represented by a prediction parameter for each parameter band or only a subset
of parameter bands so that for the remaining parameter bands there does not exist
any information on the original side signal.
[0044] Furthermore, it is preferred to have the plurality of narrowband alignment parameters
not for all parameter bands reflecting the whole bandwidth of the broadband signal
but only for a set of lower bands such as the lower 50 percents of the parameter bands.
On the other hand, stereo filling parameters are not used for the couple of lower
bands, since, for these bands, the side signal itself or a prediction residual signal
is transmitted in order to make sure that, at least for the lower bands, a waveform-correct
representation is available. On the other hand, the side signal is not transmitted
in a waveform-exact representation for the higher bands in order to further decrease
the bitrate, but the side signal is typically represented by stereo filling parameters.
[0045] Furthermore, it is preferred to perform the entire parameter analysis and alignment
within one and the same frequency domain based on the same DFT spectrum. To this end,
it is furthermore preferred to use the generalized cross correlation with phase transform
(GCC-PHAT) technology for the purpose of inter-channel time difference determination.
In a preferred embodiment of this procedure, a smoothing of a correlation spectrum
based on an information on a spectral shape, the information preferably being a spectral
flatness measure is performed in such a way that a smoothing will be weak in the case
of noise-like signals and a smoothing will become stronger in the case of tone-like
signals.
[0046] Furthermore, it is preferred to perform a special phase rotation, where the channel
amplitudes are accounted for. Particularly, the phase rotation is distributed between
the two channels for the purpose of alignment on the encoder-side and, of course,
for the purpose of de-alignment on the decoder-side where a channel having a higher
amplitude is considered as a leading channel and will be less affected by the phase
rotation, i.e., will be less rotated than a channel with a lower amplitude.
[0047] Furthermore, the sum-difference calculation is performed using an energy scaling
with a scaling factor that is derived from energies of both channels and is, additionally,
bounded to a certain range in order to make sure that the mid/side calculation is
not affecting the energy too much. On the other hand, however, it is to be noted that,
for the purpose of the present invention, this kind of energy conservation is not
as critical as in prior art procedures, since time and phase were aligned beforehand.
Therefore, the energy fluctuations due to the calculation of a mid-signal and a side
signal from left and right (on the encoder side) or due to the calculation of a left
and a right signal from mid and side (on the decoder-side) are not as significant
as in the prior art.
[0048] Subsequently, preferred embodiments of the present invention are discussed with respect
to the accompanying drawings in which:
- Fig. 1
- is a block diagram of a preferred implementation of an apparatus for encoding a multi-channel
signal;
- Fig. 2
- is a preferred embodiment of an apparatus for decoding an encoded multi-channel signal;
- Fig. 3
- is an illustration of different frequency resolutions and other frequency-related
aspects for certain embodiments;
- Fig. 4a
- illustrates a flowchart of procedures performed in the apparatus for encoding for
the purpose of aligning the channels;
- Fig. 4b
- illustrates an embodiment of procedures performed in the frequency domain;
- Fig. 4c
- illustrates an embodiment of procedures performed in the apparatus for encoding using
an analysis window with zero padding portions and overlap ranges;
- Fig. 4d
- illustrates a flowchart for further procedures performed within the apparatus for
encoding;
- Fig. 4e
- illustrates a flowchart for showing an implementation of an inter-channel time difference
estimation;
- Fig. 5
- illustrates a flowchart illustrating a further embodiment of procedures performed
in the apparatus for encoding;
- Fig. 6a
- illustrates a block chart of an embodiment of an encoder;
- Fig. 6b
- illustrates a flowchart of a corresponding embodiment of a decoder;
- Fig. 7
- illustrates a preferred window scenario with low-overlapping sine windows with zero
padding for a stereo time-frequency analysis and synthesis;
- Fig. 8
- illustrates a table showing the bit consumption of different parameter values;
- Fig. 9a
- illustrates procedures performed by an apparatus for decoding an encoded multi-channel
signal in a preferred embodiment;
- Fig. 9b
- illustrates an implementation of the apparatus for decoding an encoded multi-channel
signal;
- Fig. 9c
- illustrates a procedure performed in the context of a broadband de-alignment in the
context of the decoding of an encoded multi-channel signal;
- Fig. 10a
- illustrates an embodiment of an apparatus for estimating an inter-channel time difference;
- Fig. 10b
- illustrates a schematic representation of a signal further processing where the inter-channel
time difference is applied;
- Fig. 10c
- illustrates a schematic representation of the signal analyzer implemented as a noise
estimator in an embodiment and the weighter in accordance with embodiments of the
invention;
- Fig. 10d
- illustrates a schematic representation of the weighter in accordance with embodiments
of the invention;
- Fig. 10e
- illustrates a schematic representation of the processor in accordance with embodiments
of the invention;
- Fig. 10f
- illustrates a schematic representation of the noise estimator in accordance with embodiments
of the invention;
- Fig. 11a
- illustrates procedures performed by the processor of Fig. 10a;
- Fig. 11b
- illustrates further procedures performed by the processor in Fig. 10a;
- Fig. 11c
- illustrates a further implementation of the calculation of a variable threshold and
the usage of the variable threshold in the analysis of the time-domain representation;
- Fig. 11d
- illustrates a first embodiment for the determination of the variable threshold;
- Fig. 11e
- illustrates a further implementation of the determination of the threshold;
- Fig. 11f
- illustrates a schematic representation of the processor in accordance with embodiments
of the invention;
- Fig. 12
- illustrates a time-domain representation for a smoothed cross-correlation spectrum
for a clean speech signal;
- Fig. 13
- illustrates a time-domain representation of a smoothed cross-correlation spectrum
for a speech signal having noise and ambiance.
[0049] Fig. 10a illustrates an embodiment of an apparatus for estimating, an inter-channel
time difference between a first channel signal such as a left channel and a second
channel signal such as a right channel. These channels are input into a time-spectral
converter 150 that is additionally illustrated, with respect to Fig. 4e as item 451.
[0050] Furthermore, the time-domain representations of the left and the right channel signals
are input into a calculator 1020 for calculating a cross-correlation spectrum for
a time block from the first channel signal in the time block and the second channel
signal in the time block. Furthermore, the apparatus comprises a spectral characteristic
estimator 1010 for estimating a characteristic of a spectrum of the first channel
signal or the second channel signal for the time block. The apparatus further comprises
a smoothing filter 1030 for smoothing the cross-correlation spectrum over time using
the spectral characteristic to obtain a smoothed cross-correlation spectrum. The apparatus
further comprises a processor 1040 for processing the smoothed correlation spectrum
to obtain the inter-channel time difference.
[0051] Alternatively, in another embodiment, the element 1030 is not present, and, therefore,
the element 1010 is not necessary as well, as indicated by the broken line 1035. The
apparatus further comprises a signal analyzer 1037 calculating a signal characteristic
estimate such as a noise estimate 1038. This estimate is forwarded to a weighter 1036
configured for performing different weighting operations depending on the signal characteristic
estimate. The signal characteristic estimate is preferably also used to control the
processor 1040 for example when the processor 1040 performs the peak picking operation.
Fig. 10c further illustrates the signal analyzer 1037 and the controllable weighter
1036.
[0052] Particularly, an apparatus in accordance with embodiments of the present invention
is directed to the estimation of an inter-channel time difference between a first
channel signal and a second channel signal. This device comprises the signal analyzer
1037 of Fig. 10a, a cross-correlation spectrum calculator 1020 of Fig. 10a, a weighter
1036 for weighting a smoothed or a non-smoothed cross-correlation spectrum of Fig.
10a and a subsequently connected processor 1040 for processing the weighted cross-correlation
spectrum.
[0053] The elements time-spectrum converter 150, spectral characteristic estimator 1010,
smoothing filter 1030 are not necessary for a basic implementation of the preset invention,
but are preferred for preferred embodiments of the present invention. The signal analyzer
1037 is configured for estimating a signal characteristic such as a noise level 1038
of the first channel signal or the second channel signal or both signals or a signal
derived from the first channel signal or the second channel signal. Thus, a signal
characteristic or signal characteristic estimate such as a noise estimate to be used
later on by the weighter 1036 and, preferably, also used by the processor 1040 can
be derived only from the left or first channel signal, only from the second or right
channel signal, or can be derived from both signals. The derivation of the signal
characteristic from both signals could, for example, be a derivation of an individual
signal characteristic of the first channel signal, an additional individual signal
characteristic from the second or right channel signal and, then, the final signal
characteristic 1038 would be, for example, an average or a weighted average between
both channels. Here, for example the weighting can be done in accordance with the
amplitude so that different amplitudes in, for example, frames of the channels result
in different influences of the corresponding individual noise estimate into the final
noise level 1038. Furthermore, the signal derived from the first channel signal and
the second channel signal could be, for example, a combination signal obtained by
adding the left or first channel signal and the second or right channel signal together
to obtain a combined signal and, then, the signal characteristic 1038 is calculated
from the combined signal.
[0054] In a preferred embodiment, the signal analyzer 1036 is implemented as a noise estimator
or analyzer. However, other ways of signal analysis can be performed as well such
as a tonality analysis, a voice activity detection, a transient analysis, a stereo
analysis, a speech/music analysis, interfering-talker analysis, a background music
analysis, a clean speech analysis or any other signal analysis in order to determine,
whether a signal has a first characteristic or a second characteristic so that the
matching weighting procedure is selected.
[0055] The combination can be a combination with equal weighting factors, i.e., a combination
of the left channel without any weighting and the right channel without any weighting
which would, correspond to weighting factors of 1.0 or, alternatively, different weighting
factors can be applied. Furthermore, the signal derived from the first channel or
the signal derived from the second channel can be obtained by performing a high-pass
filtering or low-pass filtering or can be derived by performing a processing using
an amplitude compression or an amplitude inverse compression function. An amplitude
compression function would be a log function or a function with a power value being
smaller than 1. An inverse compression function would be an exponential function or
a power function with an exponent being greater than 1. Thus, depending on certain
implementations, different processing operations can be applied to different left
and right channel signals and both channels can be combined or not. In the preferred
embodiment, the left and the right channels are added together preferably even without
any specific weighting and the signal characteristic estimate is then calculated from
the result of the combination calculation.
[0056] The calculator 1020 for calculating a cross-correlation spectrum for a time block
from the first channel signal in the time block and the second channel signal in the
time block can be implemented in several ways. One way is that a cross-correlation
is calculated from the time domain signals in the time domain frames and the result
is then converted from the time domain into the spectral domain. Another implementation
is that, for example, by using a DFT or any other time-to-spectral conversion, subsequent
frames of the first channel signal and subsequent frames of the second channel signal
are converted into a spectral representation where the subsequent frames can be overlapping
or non-overlapping. Thus, for each time block of the first channel signal, a spectral
representation is obtained and, correspondingly, for each time block of the second
channel signal, a spectral representation is obtained. The cross-correlation calculation
is performed by multiplying a spectral value of a certain frequency bin k and a certain
time block or time sample index s by the conjugate complex value of the spectral value
with the same index k and the same index s from the spectral representation of the
same time block of the second channel. Other cross-correlation calculation procedures
different from the above described can be used as well in order to calculate the cross-correlation
spectrum for a time block.
[0057] The weighter 1036 is configured for weighting the cross-correlation spectrum obtained
by the calculator. In an implementation, the cross-correlation spectrum is a non-smoothed
cross-correlation spectrum, but in other embodiments, the cross-correlation spectrum
is smoothed where this smoothing is a smoothing with respect to time. Thus, for the
purpose of calculating the smoothed cross-correlation spectrum, the cross-correlation
spectrum of the last block can be used together with a (raw) cross-correlation spectrum
of the current block and, depending on the implementation, a smoothing control information
can be used as is, for example, provided by the spectral characteristic estimator
1010 of Fig. 10a. However, the smoothing can also be performed using a predetermined,
i.e., constant or time-invariant smoothing setting. In accordance with embodiments
of the invention, the weighted cross-correlation spectrum is calculated using a first
weighting procedure 1036a or using a second weighted procedure 1036b which are, for
example, illustrated in Fig. 10d. Particularly, the selection, whether the weighted
cross-correlation spectrum is derived using the first or the second procedure is done
depending on the signal characteristic estimated by the signal analyzer 1037. Thus,
in accordance with the present invention, a weighting with a first weighting characteristic
is used for a certain signal characteristic of the first channel or the second channel
or the combined signal while a second weighting procedure is applied depending on
another signal characteristic as determined by the signal analyzer 1037. The result
of the weighter 1036 is a weighted and smoothed or non-smoothed cross-correlation
spectrum that is then further processed by the processor 1040 to obtain the inter-channel
time difference between the first channel signal and the second channel signal.
[0058] Fig. 10d illustrates an implementation of the signal analyzer as a noise estimator
and the weighter in connection with the processor 1040 in accordance with an embodiment
of the invention. Particularly, the noise estimator 1037 comprises a noise estimate
calculator 1037a and a noise estimate classifier 1037b. The noise estimate classifier
1037b outputs a control signal 1050 corresponding to the noise estimate output 1038
generated by block 1037 in Fig. 10a. This control signal can be applied to a first
switch 1036c or a second switch 1036d. In this implementation, processing kernels
1036a implementing the first weighting procedure and another calculating kernel for
implementing the second weighting procedure 1036b is provided. Depending on the implementation,
only switch 1036c is provided and, depending on the control signal 1050, only the
weighting procedure as determined by the switch 1036c is selected, i.e., the cross-correlation
spectrum as determined by the calculator 1020 is input into the switch 1036c and depending
on the switch setting, forwarding to either the kernel 1036a or the kernel 1036b.
In another implementation, switch 1036c is not there by the cross-correlation spectrum
as determined by block 1020 is fed into both processing kernels 1036a and 1036b and,
depending on the control of the output switch 1036d, either the output of block 1036a
or the output of block 1036b is selected and forwarded to the processor 1040. Thus,
depending on the implementation, only a single weighted cross-correlation spectrum
is calculated where the selection of which one is calculated is done by the control
signal 1050 and the input switch. Alternatively, both weighted cross-correlation spectra
are calculated and only the cross-correlation spectrum that is selected by the output
switch 1036d is forwarded to the processor 1040. Furthermore, only a single processing
kernel can be there without any input/output switches and depending on the control
signal, the correct weighting procedure is set for the corresponding time block. Thus,
for each time block, a noise estimate or control signal 1050 can be calculated and,
for each time block, the weighting can be switched from one weighting procedure to
the other weighting procedure. In this context, it is to be noted that there can also
be implemented three or more different weighting procedures depending on three or
more different noise estimates as the case may be. Thus, the present invention not
only incurs the selection between two different weighting procedures, but also includes
the selection between three or more weighting procedures depending on a control signal
derived from the noise characteristic of the first and the second channel signals.
[0059] In a preferred implementation, the first weighting procedure comprises a weighting
so that an amplitude is normalized and a phase is maintained and the second weighting
procedure comprises a weighting factor derived from the smoothed or the non-smoothed
cross-correlation spectrum using a power operation having a power being lower than
1 or greater than 0. Furthermore, the first weighting procedure can be most identical
to the second weighting procedure except that the second weighting procedure uses
a power between 0 and 1, i.e., a power being greater than 0 and smaller than 1, while
the first weighting procedure does not apply any power or, stated in other words,
applies a power of 1. Thus, the normalization performed by the second weighting procedure
is compressed, i.e., that normalization factor applied by the first weighing procedure
has some value and the normalization factor applied via the second weighting procedure
to the same spectral cross-correlation value has a smaller magnitude. This applies
for higher spectral values of the cross-correlation spectrum. However, for small values
of the cross-correlation spectrum, the normalization value for the second weighting
procedure is greater than the normalization value for the first weighting procedure
with respect to the same spectral value of the cross-correlation spectrum. This is
due to the fact that a power operation with a power lower than 1 such as a square
root operation having a power of 1/2 increases small values but lowers high values.
Thus, additional weighting factor calculations for the second weighting procedure
can also comprise any compression function such as a log function. In a preferred
embodiment, the first weighting procedure operates based on the weighting applied
for the phase transform (PHAT), and the second weighting procedure operates based
on the calculations applied for modified cross-power spectrum phase procedure (MCSP).
[0060] Furthermore, the second weighting procedure is preferably implemented to comprise
a normalization so that an output range of the second normalization procedure is in
a range, in which an output range of the first normalization procedure is positioned,
or so that the output range of the second normalization procedure is the same as an
output range of the first normalization procedure. This can, for example, be implemented
by calculating the absolute values of all spectral values of the MCSP-weighted cross-correlation
spectrum, by adding together all magnitudes of one spectral representation corresponding
to one time block and by then dividing the result by the number of spectral values
in a time block.
[0061] Generally, the processor 1040 of Fig. 10a is configured to perform some processing
steps with respect to the weighted cross-correlation spectrum where, particularly,
a certain peak picking operation is performed in order to finally obtain the inter-channel
time difference. Preferably, this peak picking operation takes place in the time domain,
i.e., the weighted and smoothed or non-smoothed cross-correlation spectrum is converted
from the spectral representation in a time domain representation and, then, this time
domain representation is analyzed and, particularly, a peak or several peaks are picked
based on a threshold. Depending on the setting of the noise estimate, either a first
peak picking operation or a second peak picking operation is performed, where, preferably,
both peak picking operations are different from each other with respect to the threshold
used by the peak picking operation.
[0062] Fig. 10e illustrates a situation being similar, with respect to input switch 1040
and output switch 1043, to the procedure in Fig. 10d. In an implementation illustrated
in Fig. 10e, both peak picking operations can be applied and the result of the "correct"
peak picking operation can be selected by the output switch 1043. Alternatively, the
input switch is there and depending on the control signal 1050, only the correct peak
picking procedure is selected, i.e., either 1041 or 1042. Thus, in an implementation,
there will not be both switches, but in an implementation there will be either the
input switch 1040 or the output switch 1043 in analogy of what has been derived before
with respect to Fig. 10d. In an additional implementation, there only exists a single
processing kernel applying the peak picking operation with a variable threshold and
the control signal 1050 is used in order to set the correct threshold within the single
processing kernel. In a preferred embodiment, the threshold setting is performed in
such a way that the second threshold is higher than the first threshold, where the
second threshold, therefore, is used when the second weighting procedure in block
1036b has been applied, and where the first threshold is used, when the first weighting
procedure in block 1036a has been applied. Thus, when a high level of background noise
is detected, then the second weighting procedure with a power between 0 and 1 or a
log operation, i.e., a compression procedure is applied and, then, the threshold for
the peak picking should be lower compared to a peak picking threshold to be used when
a low level of background noise is detected, i.e., when the first weighting procedure
is applied that performs a normalization with a normalization factor that does not
rely on a compression function such as a log function or a power function with a power
smaller than 1.
[0063] Subsequently, a preferred implementation of the signal analyzer as the noise estimator
1037 is illustrated in Fig. 10f. Basically, the noise estimator 1037 consists of a
noise estimate calculator 1037a and a noise estimate classifier 1037b as illustrated
in Fig. 10d and also indicated in Fig. 10f. The noise estimate calculator 1037a comprises
a background noise estimator 1060 and the subsequently connected (time) smoother 1061
which can, for example, be implemented as an IIR filter.
[0064] The input into the noise estimate calculator 1037a or, particularly, the background
noise estimator 1060 is a frame of the left or first channel signal, a frame of the
second or right channel signal or a signal derived from such channel signal or a combined
signal obtained by adding, for example, a time domain representation of the first
channel signal and a time domain representation of the second channel signal in the
same time block.
[0065] With respect to the noise estimate classifier 1037b, the input signal is delivered
to a signal activity detector 1070 controlling a selector 1071. Based on the result
of the signal activity detector 1070, the selector 1071 selects the active frames
only. Furthermore, a signal level calculator 1072 is connected subsequent to the selector
1071. The calculated signal level is then forwarded to a (time) smoother 1073 which
is, for example, implemented as an IIR filter. Then, in blocks 1074, a signal-to-noise
ratio calculation takes place and the result is compared, within a comparator 1075
to a preferably predetermined threshold which is, for example, between 45 dB and 25
dB and preferably is even in a range between 30 and 40 dB and more preferably, is
at 35 dB.
[0066] The output of the comparator 1075 is the detection result indicating either a high
noise level or a low noise level or indicating that a threshold setting in a certain
way is to be performed by a single weighting procedure processor or, when there are
two weighing procedure processors as illustrated in Fig. 10d, then the decision result
from the comparator 1075, i.e., signal 1050 controls either the input switch 1036c
or the output switch 1036d in order to forward the correctly weighted cross-correlation
spectrum to the processor 1040.
[0067] The detection result 1050 is preferably calculated for each time block or frame.
Thus, when, for example, for a certain frame, the signal activity detector 1070 indicates
that this is a non-active frame, then neither a signal level calculation nor a time
smoothing is performed for this frame, since the selector 1071 only selects an active
frame. Thus, for an inactive frame an SNR ratio calculation is not performed in an
embodiment and, therefore, in this embodiment, for this inactive frame, a detection
result is not provided at all. Thus, in an implementation, the same weighting procedure
as has been determined before with respect to the last active frame is used or, alternatively,
for an inactive frame, either the first weighting procedure or the second weighting
procedure or even a third weighting procedure is applied as fallback solution. Alternatively,
the SNR ratio calculator 1074 can be implemented to use, for an inactive frame, the
time-smoothed signal level of the last or most recently occurring active frame. Thus,
detection result can either be obtained even for inactive frames or, for inactive
frames, a certain (fallback) weighting procedure is used or, for inactive frames,
the same weighting procedure as has been determined for the last active frame preceding
the inactive frame is continued to be used as the case may be.
[0068] In a previous patent application [1], an Inter-channel Time Difference (ITD) estimator
was introduced. This estimator is based on the Generalized Cross-Correlation with
PHAse Transform (GCC-PHAT), a technique widely used in the TDOA literature (initial
paper is [2], another good reference is [3]). The time difference between the two
channels is found by peak-picking the output of the GCC. Better robustness can be
obtained either by using a large analysis window length or by smoothing the cross-correlation
spectrum over time. The main contribution of [1] was to make this smoothing adaptive
with a smoothing factor dependent on a spectral flatness measure.
[0070] The GGC-PHAT performs very well in low noise, reverberative environments (see for
example [3]). However, when the level of the background noise is high or at presence
of other signal components (such as music, transients, complex stereo scenes, frames
classified as inactive, interfering talkers), the GCC-PHAT performance drops significantly.
The GCC output is then noisy and does not contain one single strong peak. Consequently,
a peak-picking often fails to find the correct ITD. This is because the Phase Transform
treats all frequencies equally, regardless of the signal-to-noise ratio. The GCC is
then polluted by the phase of the bins whose signal-to-noise ratio is low.
[0071] To avoid this problem, many other GCC weightings were proposed in the literature.
One of them was found to be very effective on our problematic test signals. It was
first proposed in [4] and was called at that time "modified cross-power spectrum phase"
(MCSP). Its good performance in high noise environments was later confirmed in several
other papers (see e.g. [5]). The weighting (Step 4. of the prior art) is modified
as follows:

with p a parameter between 0 and 1.
p = 0 corresponds to the case of the normal cross-correlation and
ρ = 1 corresponds to the case of the GCC-PHAT. A value below but close to 1 is usually
used, which allows to modify the GCC-PHAT by putting more emphasis to the bins with
high correlation, those usually corresponding to the signal while the bins with low
correlation corresponding to the noise. More precisely, we have found that a value
ρ = 0.8 gave the best performance (it was 0.75 in [4] and 0.78 in [5]).
[0072] Unfortunately, this new weighting performs better than GCC-PHAT only when a high-level
of background noise is present. Alternative scenarios where the new weighting possibly
performs better than GCC-PHAT are inactive frames (i.e. voice activity detection detects
inactive, which could indicate a low speech level), presence of transients, complex
stereo scenarios, music, interfering talkers, presence of background music, speech
which is not clean, In clean environments, like speech with no or only a low-level
of background noise or music or other signal components which deviate from clean speech,
GCC-PHAT still performs better. In order to always achieve the best results, it became
then necessary to switch between the two approaches depending on the signal content.
[0073] To detect the presence of high level of background noise in the signal, a noise estimator
together with a signal activity detector (SAD) are used. The level of the signal
lS can be estimated on the frames where the SAD detects a signal, while the level of
the noise
lN is estimated by the noise estimator. The presence of high level of background noise
is then simply detected by comparing the signal-to-noise-ratio snr =
lS -
lN (in dB) to a threshold, e.g. if
snn < 35 then high noise level is detected.
[0074] Once it is known whether the signal contains a high level of background noise or
not, a decision is made to select either the PHAT weighting or the MCSP weighting
for computing the GCC (Step 4. in the prior art). The peak-picking (Step 6. in the
prior art) can also be modified depending on whether there is high background noise
level detected, for exampling by lowering the threshold.
[0075] Subsequently, a preferred embodiment is described in a step by step manner.
0. High background noise level detection:
- a. a noise estimator (for example from [6]) is used to estimate the level of background
noise lN. An IIR smoothing filter is used to smooth the noise level over time.
- b. a signal activity detector (for example from [6]) is used to classify a frame as
active or inactive. The active frames are then used to compute the signal level ls, simply by computing the signal energy and smoothing it over time using a IIR smoothing
filter.
- c. If the signal-to-noise-ratio snr = ls - lN (in dB) is below a threshold (e.g. 35 dB), then high background noise level is detected.
1. Discrete Fourier Transform: same as in any prior art
2. Cross-correlation spectrum: same as in any prior art
3. Smoothing: same as in any prior art or as described herein based on the spectral
characteristic
4. Weighting:
If low level of background noise is detected, then the same weighting as in the prior
art is used (GCC-PHAT).
If high level of background noise is detected, then the MCSP weighting is used

with 0 < ρ < 1 (e.g. p = 0.8). In order to keep the GCC-MCSP output in the same range as the
GCC-PHAT output, an additional normalization step is performed

5. Inverse Transform: same as in any prior art
6. Peak-picking: the peak-picking can be adapted in case high level of background noise is detected
and the MCSP weighting is used. Particularly, it has been found that a lower threshold
is beneficial.
[0076] Furthermore, Fig. 10a illustrates an implementation that is different from the implementation
of Fig. 10c. In the weighter 1036 of Fig. 10c, the weighter performs either the first
or the second weighting procedure. However, in the weighter 1036 as illustrated in
Fig. 10a, the weighter only performs the second weighting procedure with respect to
the notation in Fig. 10d or 10c. This implementation is useful, when a smoothing filter
as illustrated in block 1030 is used that already performs the first weighing procedure
subsequent to the smoothing or together with the smoothing in e.g. a single mathematical
or hardware operation. Thus, in case of performing the first weighting procedure which
is the normalization operation without any compression in the smoothing filter, then
both, the smoothing filter 1030 on the one hand and the actual weighter 1036 on the
hand correspond to the actual weighter for weighting the smoothed or non-smoothed
or non-smoothed cross-correlation spectrum. Thus, in the implementation of Fig. 10a,
the noise estimate 1038 is only provided to a separate weighter 1036 and the selection
between either the output of the smoothing filter 1030 which is already weighted in
accordance with the weighting procedure and the selection between the output of the
actual weighter 136 in Fig. 10a is done by a certain processor setting 1040 that automatically
uses the output from the smoothing filter 1030, when the weighter 1036 does not provide
any output signal but automatically prioritizes the output of the weighter 1036 over
the output of the smoothing filter 1030, when the weighter 1036 provides and output.
Then, the noise estimate 1038 or, as discussed in other figures, the control signal
1050 is then used for either activating or deactivating the weighter 1036. Thus, the
actual weighter for weighting the smoothed or non-smoothed cross-correlation spectrum
using a first order weighting procedure can be implemented in many different ways
such as in the specific activation/deactivation mode in Fig. 10a or the two-kernel
mode in Fig. 10d with an input or an output switch or in accordance with a single
weighting procedure kernel that, depending on the control signal selects one or the
other weighing procedure or adapts a general weighting processor to perform the first
or the second weighting procedure.
[0077] Subsequently, a preferred embodiment, where a smoothing is performed before weighting
is described. In this context, the functionalities of the spectral characteristic
estimator are also reflected by Fig. 4e, items 453, 454 in a preferred embodiment.
[0078] Furthermore, the functionalities of the cross-correlation spectrum calculator 1020
are also reflected by item 452 in Fig. 4e described later on in a preferred embodiment.
[0079] Correspondingly, the functionalities of the smoothing filter 1030 are also reflected
by item 453 in the context of Fig. 4e to be described later on. Additionally, the
functionalities of the processor 1040 are also described in the context of Fig. 4e
in a preferred embodiment as items 456 to 459.
[0080] Preferred embodiments of the processor 1040 are also described in Fig. 10c
[0081] Preferably, the spectral characteristic estimation calculates a noisiness or a tonality
of the spectrum where a preferred implementation is the calculation of a spectral
flatness measure being close to 0 in the case of tonal or non-noisy signals and being
close to 1 in the case of noisy or noise-like signals.
[0082] Particularly, the smoothing filter is then configured to apply a stronger smoothing
with a first smoothing degree over time in case of a first less noisy characteristic
or a first more tonal characteristic, or to apply a weaker smoothing with a second
smoothing degree over time in case of a second more noisy or second less tonal characteristic.
[0083] Particularly, the first smoothing is greater than the second smoothing degree, where
the first noisy characteristic is less noisy than the second noisy characteristic
or the first tonal characteristic is more tonal than the second tonal characteristic.
The preferred implementation is the spectral flatness measure.
[0084] Furthermore, as illustrated in Fig. 11a, the processor is preferably implemented
to normalize the smoothed cross-correlation spectrum as illustrated at 456 in Fig.
4e and 11a before performing the calculation of the time-domain representation in
step 1031 corresponding to steps 457 and 458 in the embodiment of Fig. 4e. However,
as also outlined in Fig. 11a, the processor can also operate without the normalization
in step 456 in Fig. 4e. Then, the processor is configured to analyze the time-domain
representation as illustrated in block 1032 of Fig. 11a in order to find the inter-channel
time difference. This analysis can be performed in any known way and will already
result in an improved robustness, since the analysis is performed based on the cross-correlation
spectrum being smoothed in accordance with the spectral characteristic.
[0085] As illustrated in Fig. 11b, a preferred implementation of the time-domain analysis
1032 is a low-pass filtering of the time-domain representation as illustrated at 458
in Fig. 11b corresponding to item 458 of Fig. 4e and a subsequent further processing
1033 using a peak searching/peak picking operation within the low-pass filtered time-domain
representation.
[0086] As illustrated in Fig. 11c, the preferred implementation of the peak picking or peak
searching operation is to perform this operation using a variable threshold. Particularly,
the processor is configured to perform the peak searching/peak picking operation within
the time-domain representation derived from the smoothed cross-correlation spectrum
by determining 1034 a variable threshold from the time-domain representation and by
comparing a peak or several peaks of the time-domain representation (obtained with
or without spectral normalization) to the variable threshold, wherein the inter-channel
time difference is determined as a time lag associated with a peak being in a predetermined
relation to the threshold such as being greater than the variable threshold.
[0087] As illustrated in Fig. 11d, one preferred embodiment illustrated in the pseudo code
related to Fig. 4e-b described later on consists in the sorting 1034a of values in
accordance with their magnitude. Then, as illustrated in item 1034b in Fig. 11d, the
highest for example 10 or 5 % of the values are determined.
[0088] Then, as illustrated in step 1034c, a number such as the number 3 is multiplied to
the lowest value of the highest 10 or 5 % in order to obtain the variable threshold.
[0089] As stated, preferably, the highest 10 or 5 % are determined, but it can also be useful
to determine the lowest number of the highest 50 % of the values and to use a higher
multiplication number such as 10. Naturally, even a smaller amount such as the highest
3 % of the values are determined and the lowest value among these highest 3 % of the
values is then multiplied by a number which is, for example, equal to 2.5 or 2, i.e.,
lower than 3. Thus, different combinations of numbers and percentages can be used
in the embodiment illustrated in Fig. 11d. Apart from the percentages, the numbers
can also vary, and numbers greater than 1.5 are preferred.
[0090] In a further embodiment illustrated in Fig. 11e, the time-domain representation is
divided into subblocks as illustrated by block 1101, and these subblocks are indicated
in Fig. 13 at 1300. Here, about 16 subblocks are used for the valid range so that
each subblock has a time lag span of 20. However, the number of subblocks can be greater
than this value or lower and preferably greater than 3 and lower than 50.
[0091] In step 1102 of Fig. 11e, the peak in each subblock is determined, and in step 1103,
the average peak in all the subblocks is determined. Then, in step 1104, a multiplication
value a is determined that depends on a signal-to-noise ratio on the one hand and,
in a further embodiment, depends on the difference between the threshold and the maximum
peak as indicated to the left of block 1104. Depending on these input values, one
of preferably three different multiplication values is determined where the multiplication
value can be equal to a
low, a
high and a
lowest.
[0092] Then, in step 1105, the multiplication value a determined in block 1104 is multiplied
by the average threshold in order to obtain the variable threshold that is then used
in the comparison operation in block 1106. For the comparison operation, once again
the time-domain representation input into block 1101 can be used or the already determined,peaks
in each subblock as outlined in block 1102 can be used.
[0093] Subsequently, further embodiments regarding the evaluation and detection of a peak
within the time-domain cross-correlation function is outlined.
[0094] The evaluation and detection of a peak within the time-domain cross correlation function
resulting from the generalized cross-correlation (GCC-PHAT) method in order to estimate
the Inter-channel Time Difference (ITD) is not always straightforward due to different
input scenarios. Clean speech input can result to a low deviation cross-correlation
function with a strong peak, while speech in a noisy reverberant environment can produce
a vector with high deviation and peaks with lower but still outstanding magnitude
indicating the existence of ITD. A peak detection algorithm that is adaptive and flexible
to accommodate different input scenarios is described.
[0095] Due to delay constraints, the overall system can handle channel time alignment up
to a certain limit, namely ITD_MAX. The proposed algorithm is designed to detect whether
a valid ITD exists in the following cases:
- Valid ITD due to outstanding peak. An outstanding peak within the [-ITD_MAX, ITD_MAX] bounds of the cross-correlation
function is present.
- No correlation. When there is no correlation between the two channels, there is no outstanding peak.
A threshold should be defined, above which the peak is strong enough to be considered
as a valid ITD value. Otherwise, no ITD handling should be signaled, meaning ITD is
set to zero and no time alignment is performed.
- Out of bounds ITD. Strong peaks of the cross-correlation function outside the region [-ITD_MAX, ITD_MAX]
should be evaluated in order to determine whether ITDs that lie outside the handling
capacity of the system exist. In this case no ITD handling should be signaled and
thus no time alignment is performed.
[0096] To determine whether the magnitude of a peak is high enough to be considered as a
time difference value, a suitable threshold needs to be defined. For different input
scenarios, the cross-correlation function output varies depending on different parameters,
e.g. the environment (noise, reverberation etc.), the microphone setup (AB, M/S, etc.).
Therefore, to adaptively define the threshold is essential.
[0097] In the proposed algorithm, the threshold is defined by first calculating the mean
of a rough computation of the envelope of the magnitude of the cross-correlation function
within the [-ITD_MAX, ITD_MAX] region (Fig. 13), the average is then weighted accordingly
depending on the SNR estimation.
[0098] The step-by-step description of the algorithm is described below.
[0099] The output of the inverse DFT of the GCC-PHAT, which represents the time-domain cross-correlation,
is rearranged from negative to positive time lags (Fig. 12).
[0100] The cross-correlation vector is divided in three main areas: the area of interest
namely [-ITD_MAX, ITD_MAX] and the area outside the ITD_MAX bounds, namely time lags
smaller than -ITD_MAX (max_low) and higher than ITD_MAX (max_high). The maximum peaks
of the "out of bound" areas are detected and saved to be compared to the maximum peak
detected in the area of interest.
[0101] In order to determine whether a valid ITD is present, the sub-vector area [-ITD_MAX,
ITD_MAX] of the cross-correlation function is considered. The sub-vector is divided
into N sub-blocks (Fig. 13).
[0102] For each sub-block the maximum peak magnitude peak_sub and the equivalent time lag
position index_sub is found and saved.
[0103] The maximum of the local maxima peak_max is determined and will be compared to the
threshold to determine the existence of a valid ITD value.
[0104] The maximum value peak_max is compared to max_low and max_high. If peak_max is lower
than either of the two then no itd handling is signaled and no time alignment is performed.
Because of the ITD handling limit of the system, the magnitudes of the out of bound
peaks do not need to be evaluated.
[0105] The mean of the magnitudes of the peaks is calculated:

[0106] The threshold
thres is then computed by weighting
peakmean with an SNR depended weighting factor
aw: 
[0107] In cases where
SNR <<
SNRthreshold and |
thres -
peak_max |< ε, the peak magnitude is also compared to a slightly more relaxed threshold (a
w =
alowest), in order to avoid rejecting an outstanding peak with high neighboring peaks. The
weighting factors could be for example
ahiah = 3, a
low = 2.5 and a
lowest = 2, while the SNR
threshold could be for example 20dB and the bound ε = 0.05.
[0108] Preferred ranges are 2.5 to 5 for a
high; 1.5 to 4 for a
low; 1.0 to 3 for a
lowest; 10 to 30 dB for SNR
threshold; and 0.01 to 0.5 for ε, where a
high is greater than a
low that is greater than a
lowest.
[0109] If
peak_max > thres the equivalent time lag is returned as the estimated ITD, elsewise no itd handling
is signaled (ITD=0). Further embodiments are described later on with respect to Fig.
4e.
[0110] Fig. 11f illustrates the preferred implementation of determining a valid ITD (inter-channel
time difference) output.
[0111] Subblocks of the time domain representation of the weighted and smoothed or non-smoothed
cross-correlation spectrum are input into a determination step within the processor
1040. This determination step 1120 determines a valid range and an invalid range within
a time-domain representation derived from the weighted and smoothed or non-smoothed
cross-correlation spectrum. In step 1121, a maximum peak is determined within the
invalid range, and in step 1122, a maximum peak is determined within the valid range.
Particularly, at least one maximum peak is determined within the invalid range and
at least one maximum peak is determined within the valid range. In block 1123, the
maximum peaks of the valid range and the invalid range are compared. In case the valid
peak, i.e., the maximum peak in the valid range is greater than the "invalid peak",
the maximum peak in the invalid range, then an ITD determination 1124 is actually
performed and a valid ITD output is provided. When, however, it is detected that an
"invalid peak" is greater than the "valid peak" or that the invalid peak has the same
size as the valid peak, then a valid output is not provided and, preferably, an error
message or any comparable action is performed in order to bring this to the processor's
attention.
[0112] Subsequently, a preferred implementation of the present invention within block 1050
of Fig. 10b for the purpose of a signal further processor is discussed with respect
to Figs. 1 to 9e, i.e., in the context of a stereo/multi-channel processing/encoding
and time alignment of two channels.
[0113] However, as stated and as illustrated in Fig. 10b, many other fields exist, where
a signal further processing using the determined inter-channel time difference can
be performed as well.
[0114] Fig. 1 illustrates an apparatus for encoding a multi-channel signal having at least
two channels. The multi-channel signal 10 is input into a parameter determiner 100
on the one hand and a signal aligner 200 on the other hand. The parameter determiner
100 determines, on the one hand, a broadband alignment parameter and, on the other
hand, a plurality of narrowband alignment parameters from the multi-channel signal.
These parameters are output via a parameter line 12. Furthermore, these parameters
are also output via a further parameter line 14 to an output interface 500 as illustrated.
On the parameter line 14, additional parameters such as the level parameters are forwarded
from the parameter determiner 100 to the output interface 500. The signal aligner
200 is configured for aligning the at least two channels of the multi-channel signal
10 using the broadband alignment parameter and the plurality of narrowband alignment
parameters received via parameter line 10 to obtain aligned channels 20 at the output
of the signal aligner 200. These aligned channels 20 are forwarded to a signal processor
300 which is configured for calculating a mid-signal 31 and a side signal 32 from
the aligned channels received via line 20. The apparatus for encoding further comprises
a signal encoder 400 for encoding the mid-signal from line 31 and the side signal
from line 32 to obtain an encoded mid-signal on line 41 and an encoded side signal
on line 42. Both these signals are forwarded to the output interface 500 for generating
an encoded multi-channel signal at output line 50. The encoded signal at output line
50 comprises the encoded mid-signal from line 41, the encoded side signal from line
42, the narrowband alignment parameters and the broadband alignment parameters from
line 14 and, optionally, a level parameter from line 14 and, additionally optionally,
a stereo filling parameter generated by the signal encoder 400 and forwarded to the
output interface 500 via parameter line 43.
[0115] Preferably, the signal aligner is configured to align the channels from the multi-channel
signal using the broadband alignment parameter, before the parameter determiner 100
actually calculates the narrowband parameters. Therefore, in this embodiment, the
signal aligner 200 sends the broadband aligned channels back to the parameter determiner
100 via a connection line 15. Then, the parameter determiner 100 determines the plurality
of narrowband alignment parameters from an already with respect to the broadband characteristic
aligned multi-channel signal. In other embodiments, however, the parameters are determined
without this specific sequence of procedures.
[0116] Fig. 4a illustrates a preferred implementation, where the specific sequence of steps
that incurs connection line 15 is performed. In the step 16, the broadband alignment
parameter is determined using the two channels and the broadband alignment parameter
such as an inter-channel time difference or ITD parameter is obtained. Then, in step
21, the two channels are aligned by the signal aligner 200 of Fig. 1 using the broadband
alignment parameter. Then, in step 17, the narrowband parameters are determined using
the aligned channels within the parameter determiner 100 to determine a plurality
of narrowband alignment parameters such as a plurality of inter-channel phase difference
parameters for different bands of the multi-channel signal. Then, in step 22, the
spectral values in each parameter band are aligned using the corresponding narrowband
alignment parameter for this specific band. When this procedure in step 22 is performed
for each band, for which a narrowband alignment parameter is available, then aligned
first and second or left/right channels are available for further signal processing
by the signal processor 300 of Fig. 1.
[0117] Fig. 4b illustrates a further implementation of the multi-channel encoder of Fig.
1 where several procedures are performed in the frequency domain.
[0118] Specifically, the multi-channel encoder further comprises a time-spectrum converter
150 for converting a time domain multi-channel signal into a spectral representation
of the at least two channels within the frequency domain.
[0119] Furthermore, as illustrated at 152, the parameter determiner, the signal aligner
and the signal processor illustrated at 100, 200 and 300 in Fig. 1 all operate in
the frequency domain.
[0120] Furthermore, the multi-channel encoder and, specifically, the signal processor further
comprises a spectrum-time converter 154 for generating a time domain representation
of the mid-signal at least.
[0121] Preferably, the spectrum time converter additionally converts a spectral representation
of the side signal also determined by the procedures represented by block 152 into
a time domain representation, and the signal encoder 400 of Fig. 1 is then configured
to further encode the mid-signal and/or the side signal as time domain signals depending
on the specific implementation of the signal encoder 400 of Fig. 1.
[0122] Preferably, the time-spectrum converter 150 of Fig. 4b is configured to implement
steps 155, 156 and 157 of Fig. 4c. Specifically, step 155 comprises providing an analysis
window with at least one zero padding portion at one end thereof and, specifically,
a zero padding portion at the initial window portion and a zero padding portion at
the terminating window portion as illustrated, for example, in Fig. 7 later on. Furthermore,
the analysis window additionally has overlap ranges or overlap portions at a first
half of the window and at a second half of the window and, additionally, preferably
a middle part being a non-overlap range as the case may be.
[0123] In step 156, each channel is windowed using the analysis window with overlap ranges.
Specifically, each channel is windowed using the analysis window in such a way that
a first block of the channel is obtained. Subsequently, a second block of the same
channel is obtained that has a certain overlap range with the first block and so on,
such that subsequent to, for example, five windowing operations, five blocks of windowed
samples of each channel are available that are then individually transformed into
a spectral representation as illustrated at 157 in Fig. 4c. The same procedure is
performed for the other channel as well so that, at the end of step 157, a sequence
of blocks of spectral values and, specifically, complex spectral values such as DFT
spectral values or complex subband samples is available.
[0124] In step 158, which is performed by the parameter determiner 100 of Fig. 1, a broadband
alignment parameter is determined and in step 159, which is performed by the signal
alignment 200 of Fig. 1, a circular shift is performed using the broadband alignment
parameter. In step 160, again performed by the parameter determiner 100 of Fig. 1,
narrowband alignment parameters are determined for individual bands/subbands and in
step 161, aligned spectral values are rotated for each band using corresponding narrowband
alignment parameters determined for the specific bands.
[0125] Fig. 4d illustrates further procedures performed by the signal processor 300. Specifically,
the signal processor 300 is configured to calculate a mid-signal and a side signal
as illustrated at step 301. In step 302, some kind of further processing of the side
signal can be performed and then, in step 303, each block of the mid-signal and the
side signal is transformed back into the time domain and, in step 304, a synthesis
window is applied to each block obtained by step 303 and, in step 305, an overlap
add operation for the mid-signal on the one hand and an overlap add operation for
the side signal on the other hand is performed to finally obtain the time domain mid/side
signals.
[0126] Specifically, the operations of the steps 304 and 305 result in a kind of cross fading
from one block of the mid-signal or the side signal in the next block of the mid signal
and the side signal is performed so that, even when any parameter changes occur such
as the inter-channel time difference parameter or the inter-channel phase difference
parameter occur, this will nevertheless be not audible in the time domain mid/side
signals obtained by step 305 in Fig. 4d.
[0127] The new low-delay stereo coding is a joint Mid/Side (M/S) stereo coding exploiting
some spatial cues, where the Mid-channel is coded by a primary mono core coder, and
the Side-channel is coded in a secondary core coder. The encoder and decoder principles
are depicted in Figs. 6a, 6b.
[0128] The stereo processing is performed mainly in Frequency Domain (FD). Optionally some
stereo processing can be performed in Time Domain (TD) before the frequency analysis.
It is the case for the ITD computation, which can be computed and applied before the
frequency analysis for aligning the channels in time before pursuing the stereo analysis
and processing. Alternatively, ITD processing can be done directly in frequency domain.
Since usual speech coders like ACELP do not contain any internal time-frequency decomposition,
the stereo coding adds an extra complex modulated filter-bank by means of an analysis
and synthesis filterbank before the core encoder and another stage of analysis-synthesis
filter-bank after the core decoder. In the preferred embodiment, an oversampled DFT
with a low overlapping region is employed. However, in other embodiments, any complex
valued time-frequency decomposition with similar temporal resolution can be used.
[0129] The stereo processing consists of computing the spatial cues: inter-channel Time
Difference (ITD), the inter-channel Phase Differences (IPDs) and inter-channel Level
Differences (ILDs). ITD and IPDs are used on the input stereo signal for aligning
the two channels L and R in time and in phase. ITD is computed in broadband or in
time domain while IPDs and ILDs are computed for each or a part of the parameter bands,
corresponding to a non-uniform decomposition of the frequency space. Once the two
channels are aligned a joint M/S stereo is applied, where the Side signal is then
further predicted from the Mid signal. The prediction gain is derived from the ILDs.
[0130] The Mid signal is further coded by a primary core coder. In the preferred embodiment,
the primary core coder is the 3GPP EVS standard, or a coding derived from it which
can switch between a speech coding mode, ACELP, and a music mode based on a MDCT transformation.
Preferably, ACELP and the MDCT-based coder are supported by a Time Domain BandWidth
Extension (TD-BWE) and or Intelligent Gap Filling (IGF) modules respectively.
[0131] The Side signal is first predicted by the Mid channel using prediction gains derived
from ILDs. The residual can be further predicted by a delayed version of the Mid signal
or directly coded by a secondary core coder, performed in the preferred embodiment
in MDCT domain. The stereo processing at encoder can be summarized by Fig. 5 as will
be explained later on.
[0132] Fig. 2 illustrates a block diagram of an embodiment of an apparatus for decoding
an encoded multi-channel signal received at input line 50.
[0133] In particular, the signal is received by an input interface 600. Connected to the
input interface 600 are a signal decoder 700, and a signal de-aligner 900. Furthermore,
a signal processor 800 is connected to a signal decoder 700 on the one hand and is
connected to the signal de-aligner on the other hand.
[0134] In particular, the encoded multi-channel signal comprises an encoded mid-signal,
an encoded side signal, information on the broadband alignment parameter and information
on the plurality of narrowband parameters. Thus, the encoded multi-channel signal
on line 50 can be exactly the same signal as output by the output interface of 500
of Fig. 1.
[0135] However, importantly, it is to be noted here that, in contrast to what is illustrated
in Fig. 1, the broadband alignment parameter and the plurality of narrowband alignment
parameters included in the encoded signal in a certain form can be exactly the alignment
parameters as used by the signal aligner 200 in Fig. 1 but can, alternatively, also
be the inverse values thereof, i.e., parameters that can be used by exactly the same
operations performed by the signal aligner 200 but with inverse values so that the
de-alignment is obtained.
[0136] Thus, the information on the alignment parameters can be the alignment parameters
as used by the signal aligner 200 in Fig. 1 or can be inverse values, i.e., actual
"de-alignment parameters". Additionally, these parameters will typically be quantized
in a certain form as will be discussed later on with respect to Fig. 8.
[0137] The input interface 600 of Fig. 2 separates the information on the broadband alignment
parameter and the plurality of narrowband alignment parameters from the encoded mid/side
signals and forwards this information via parameter line 610 to the signal de-aligner
900. On the other hand, the encoded mid-signal is forwarded to the signal decoder
700 via line 601 and the encoded side signal is forwarded to the signal decoder 700
via signal line 602.
[0138] The signal decoder is configured for decoding the encoded mid-signal and for decoding
the encoded side signal to obtain a decoded mid-signal on line 701 and a decoded side
signal on line 702. These signals are used by the signal processor 800 for calculating
a decoded first channel signal or decoded left signal and for calculating a decoded
second channel or a decoded right channel signal from the decoded mid signal and the
decoded side signal, and the decoded first channel and the decoded second channel
are output on lines 801, 802, respectively. The signal de-aligner 900 is configured
for de-aligning the decoded first channel on line 801 and the decoded right channel
802 using the information on the broadband alignment parameter and additionally using
the information on the plurality of narrowband alignment parameters to obtain a decoded
multi-channel signal, i.e., a decoded signal having at least two decoded and de-aligned
channels on lines 901 and 902.
[0139] Fig. 9a illustrates a preferred sequence of steps performed by the signal de-aligner
900 from Fig. 2. Specifically, step 910 receives aligned left and right channels as
available on lines 801, 802 from Fig. 2. In step 910, the signal de-aligner 900 de-aligns
individual subbands using the information on the narrowband alignment parameters in
order to obtain phase-de-aligned decoded first and second or left and right channels
at 911a and 911b. In step 912, the channels are de-aligned using the broadband alignment
parameter so that, at 913a and 913b, phase and time-de-aligned channels are obtained.
[0140] In step 914, any further processing is performed that comprises using a windowing
or any overlap-add operation or, generally, any cross-fade operation in order to obtain,
at 915a or 915b, an artifact-reduced or artifact-free decoded signal, i.e., to decoded
channels that do not have any artifacts although there have been, typically, time-varying
de-alignment parameters for the broadband on the one hand and for the plurality of
narrowbands on the other hand.
[0141] Fig. 9b illustrates a preferred implementation of the multi-channel decoder illustrated
in Fig. 2.
[0142] In particular, the signal processor 800 from Fig. 2 comprises a time-spectrum converter
810.
[0143] The signal processor furthermore comprises a mid/side to left/right converter 820
in order to calculate from a mid-signal M and a side signal S a left signal L and
a right signal R.
[0144] However, importantly, in order to calculate L and R by the mid/side-left/right conversion
in block 820, the side signal S is not necessarily to be used. Instead, as discussed
later on, the left/right signals are initially calculated only using a gain parameter
derived from an inter-channel level difference parameter ILD. Generally, the prediction
gain can also be considered to be a form of an ILD. The gain can be derived from ILD
but can also be directly computed. It is preferred to not compute ILD anymore, but
to compute the prediction gain directly and to transmit and use the prediction gain
in the decoder rather than the ILD parameter.
[0145] Therefore, in this implementation, the side signal S is only used in the channel
updater 830 that operates in order to provide a better left/right signal using the
transmitted side signal S as illustrated by bypass line 821.
[0146] Therefore, the converter 820 operates using a level parameter obtained via a level
parameter input 822 and without actually using the side signal S but the channel updater
830 then operates using the side 821 and, depending on the specific implementation,
using a stereo filling parameter received via line 831. The signal aligner 900 then
comprises a phased-de-aligner and energy scaler 910. The energy scaling is controlled
by a scaling factor derived by a scaling factor calculator 940. The scaling factor
calculator 940 is fed by the output of the channel updater 830. Based on the narrowband
alignment parameters received via input 911, the phase de-alignment is performed and,
in block 920, based on the broadband alignment parameter received via line 921, the
time-de-alignment is performed. Finally, a spectrum-time conversion 930 is performed
in order to finally obtain the decoded signal.
[0147] Fig. 9c illustrates a further sequence of steps typically performed within blocks
920 and 930 of Fig. 9b in a preferred embodiment.
[0148] Specifically, the narrowband de-aligned channels are input into the broadband de-alignment
functionality corresponding to block 920 of Fig. 9b. A DFT or any other transform
is performed in block 931. Subsequent to the actual calculation of the time domain
samples, an optional synthesis windowing using a synthesis window is performed. The
synthesis window is preferably exactly the same as the analysis window or is derived
from the analysis window, for example interpolation or decimation but depends in a
certain way from the analysis window. This dependence preferably is such that multiplication
factors defined by two overlapping windows add up to one for each point in the overlap
range. Thus, subsequent to the synthesis window in block 932, an overlap operation
and a subsequent add operation is performed. Alternatively, instead of synthesis windowing
and overlap/add operation, any cross fade between subsequent blocks for each channel
is performed in order to obtain, as already discussed in the context of Fig. 9a, an
artifact reduced decoded signal.
[0149] When Fig. 6b is considered, it becomes clear that the actual decoding operations
for the mid-signal, i.e., the "EVS decoder" on the one hand and, for the side signal,
the inverse vector quantization VQ
-1 and the inverse MDCT operation (IMDCT) correspond to the signal decoder 700 of Fig.
2.
[0150] Furthermore, the DFT operations in blocks 810 correspond to element 810 in Fig. 9b
and functionalities of the inverse stereo processing and the inverse time shift correspond
to blocks 800, 900 of Fig. 2 and the inverse DFT operations 930 in Fig. 6b correspond
to the corresponding operation in block 930 in Fig. 9b.
[0151] Subsequently, Fig. 3 is discussed in more detail. In particular, Fig. 3 illustrates
a DFT spectrum having individual spectral lines Preferably, the DFT spectrum or any
other spectrum illustrated in Fig. 3 is a complex spectrum and each line is a complex
spectral line having magnitude and phase or having a real part and an imaginary part.
[0152] Additionally, the spectrum is also divided into different parameter bands. Each parameter
band has at least one and preferably more than one spectral lines. Additionally, the
parameter bands increase from lower to higher frequencies. Typically, the broadband
alignment parameter is a single broadband alignment parameter for the whole spectrum,
i.e., for a spectrum comprising all the bands 1 to 6 in the exemplary embodiment in
Fig. 3.
[0153] Furthermore, the plurality of narrowband alignment parameters are provided so that
there is a single alignment parameter for each parameter band. This means that the
alignment parameter for a band always applies to all the spectral values within the
corresponding band.
[0154] Furthermore, in addition to the narrowband alignment parameters, level parameters
are also provided for each parameter band.
[0155] In contrast to the level parameters that are provided for each and every parameter
band from band 1 to band 6, it is preferred to provide the plurality of narrowband
alignment parameters only for a limited number of lower bands such as bands 1, 2,
3 and 4.
[0156] Additionally, stereo filling parameters are provided for a certain number of bands
excluding the lower bands such as, in the exemplary embodiment, for bands 4, 5 and
6, while there are side signal spectral values for the lower parameter bands 1, 2
and 3 and, consequently, no stereo filling parameters exist for these lower bands
where wave form matching is obtained using either the side signal itself or a prediction
residual signal representing the side signal.
[0157] As already stated, there exist more spectral lines in higher bands such as, in the
embodiment in Fig. 3, seven spectral lines in parameter band 6 versus only three spectral
lines in parameter band 2. Naturally, however, the number of parameter bands, the
number of spectral lines and the number of spectral lines within a parameter band
and also the different limits for certain parameters will be different.
[0158] Nevertheless, Fig. 8 illustrates a distribution of the parameters and the number
of bands for which parameters are provided in a certain embodiment where there are,
in contrast to Fig. 3, actually 12 bands.
[0159] As illustrated, the level parameter ILD is provided for each of 12 bands and is quantized
to a quantization accuracy represented by five bits per band.
[0160] Furthermore, the narrowband alignment parameters IPD are only provided for the lower
bands up to a border frequency of 2.5 kHz. Additionally, the inter-channel time difference
or broadband alignment parameter is only provided as a single parameter for the whole
spectrum but with a very high quantization accuracy represented by eight bits for
the whole band.
[0161] Furthermore, quite roughly quantized stereo filling parameters are provided represented
by three bits per band and not for the lower bands below 1 kHz since, for the lower
bands, actually encoded side signal or side signal residual spectral values are included.
[0162] Subsequently, a preferred processing on the encoder side is summarized with respect
to Fig. 5. In a first step, a DFT analysis of the left and the right channel is performed.
This procedure corresponds to steps 155 to 157 of Fig. 4c. In step 158, the broadband
alignment parameter is calculated and, particularly, the preferred broadband alignment
parameter inter-channel time difference (ITD). As illustrated in 170, a time shift
of L and R in the frequency domain is performed. Alternatively, this time shift can
also be performed in the time domain. An inverse DFT is then performed, the time shift
is performed in the time domain and an additional forward DFT is performed in order
to once again have spectral representations subsequent to the alignment using the
broadband alignment parameter.
[0163] ILD parameters, i.e., level parameters and phase parameters (IPD parameters), are
calculated for each parameter band on the shifted L and R representations as illustrated
at step 171. This step corresponds to step 160 of Fig. 4c, for example. Time shifted
L and R representations are rotated as a function of the inter-channel phase difference
parameters as illustrated in step 161 of Fig. 4c or Fig. 5. Subsequently, the mid
and side signals are computed as illustrated in step 301 and, preferably, additionally
with an energy conversation operation as discussed later on. In a subsequent step
174, a prediction of S with M as a function of ILD and optionally with a past M signal,
i.e., a mid-signal of an earlier frame is performed. Subsequently, inverse DFT of
the mid-signal and the side signal is performed that corresponds to steps 303, 304,
305 of Fig. 4d in the preferred embodiment.
[0164] In the final step 175, the time domain mid-signal m and, optionally, the residual
signal are coded as illustrated in step 175. This procedure corresponds to what is
performed by the signal encoder 400 in Fig. 1.
[0165] At the decoder in the inverse stereo processing, the
Side signal is generated in the DFT domain and is first predicted from the
Mid signal as:

where g is a gain computed for each parameter band and.is function of the transmitted
Inter-channel Level Difference (ILDs).
[0166] The residual of the prediction
Side -
g •
Mid can be then refined in two different ways:
- By a secondary coding of the residual signal:

where gcodis a global gain transmitted for the whole spectrum
- By a residual prediction, known as stereo filling, predicting the residual side spectrum
with the previous decoded Mid signal spectrum from the previous DFT frame:

where gpred is a predictive gain transmitted per parameter band.
[0167] The two types of coding refinement can be mixed within the same DFT spectrum. In
the preferred embodiment, the residual coding is applied on the lower parameter bands,
while residual prediction is applied on the remaining bands. The residual coding is
in the preferred embodiment as depict in Fig.1 performs in MDCT domain after synthesizing
the residual Side signal in Time Domain and transforming it by a MDCT. Unlike DFT,
MDCT is critical sampled and is more suitable for audio coding. The MDCT coefficients
are directly vector quantized by a Lattice Vector Quantization but can be alternatively
coded by a Scalar Quantizer followed by an entropy coder. Alternatively, the residual
side signal can be also coded in Time Domain by a speech coding technique or directly
in DFT domain.
1. Time-Frequency Analysis: DFT
[0168] It is important that the extra time-frequency decomposition from the stereo processing
done by DFTs allows a good auditory scene analysis while not increasing significantly
the overall delay of the coding system. By default, a time resolution of 10 ms (twice
the 20 ms framing of the core coder) is used. The analysis and synthesis windows are
the same and are symmetric. The window is represented at 16 kHz of sampling rate in
Fig. 7. It can be observed that the overlapping region is limited for reducing the
engendered delay and that zero padding is also added to counter balance the circular
shift when applying ITD in frequency domain as it will be explained hereafter.
2. Stereo parameters
[0169] Stereo parameters can be transmitted at maximum at the time resolution of the stereo
DFT. At minimum it can be reduced to the framing resolution of the core coder, i.e.
20ms. By default, when no transients is detected, parameters are computed every 20ms
over 2 DFT windows. The parameter bands constitute a non-uniform and non-overlapping
decomposition of the spectrum following roughly 2 times or 4 times the Equivalent
Rectangular Bandwidths (ERB). By default, a 4 times ERB scale is used for a total
of 12 bands for a frequency bandwidth of 16kHz (32kbps sampling-rate, Super Wideband
stereo). Fig. 8 summarized an example of configuration, for which the stereo side
information is transmitted with about 5 kbps.
3. Computation of ITD and channel time alignment
[0170] The ITD are computed by estimating the Time Delay of Arrival (
TDOA) using the Generalized Cross Correlation with Phase Transform (
GCC-PHAT):

where L and R are the frequency spectra of the of the left and right channels respectively.
The frequency analysis can be performed independently of the DFT used for the subsequent
stereo processing or can be shared. The pseudo-code for computing the ITD is the following:
L =fft(window(l));
R =fft(window(r));
tmp = L . * conj( R );
sfm_L = prod(abs(L). ^(1/length(L)))/(mean(abs(L))+eps);
sfm_R = prod(abs(R). ^(1/length(R)))l(mean(abs(R))+eps);
sfm = max(sfm_L,sfm_R);
h.cross_corr_smooth = (1-sfm)*h.cross_corr_smooth+sfm*tmp;
tmp = h.cross_corr_smooth ./ abs( h.cross_corr_smooth+eps);
tmp = ifft( tmp );
tmp = tmp([length(tmp)/2+1:length(tmp) 1:length(tmp)/2+1]);
tmp_sort = sort( abs(tmp) );
thresh = 3 * tmp_sort( round(0.95*length(tmp_sort)));
xcorr_time=abs(tmp(- (h.stereo_itd_q_max - (lengfh(tmp)-1)/2 - 1 ):- (h.ste-
reo_itd_q_min - (length(tmp)-1)/2 - 1 )));
%smooth output for better detection
xcorr_time=[xcorr_time 0];
xcorr_time2=filter([0.25 0.5 0.25],1,xcorr_ time);
[m,i] = max(xcorr_time2(2:end));
if m > thresh
itd = h.stereo_itd_q_max - i + 1;
else
itd = 0;
end
[0171] Fig. 4e illustrates a flow chart for implementing the earlier illustrated pseudo
code in order to obtain a robust and efficient calculation of an inter-channel time
difference as an example for the broadband alignment parameter.
[0172] In block 451, a DFT analysis of the time domain signals for a first channel (I) and
a second channel (r) is performed. This DFT analysis will typically be the same DFT
analysis as has been discussed in the context of steps 155 to 157 in Fig. 5 or Fig.
4c, for example.
[0173] A cross-correlation is then performed for each frequency bin as illustrated in block
452.
[0174] Thus, a cross-correlation spectrum is obtained for the whole spectral range of the
left and the right channels.
[0175] In step 453, a spectral flatness measure is then calculated from the magnitude spectra
of L and R and, in step 454, the larger spectral flatness measure is selected. However,
the selection in step 454 does not necessarily have to be the selection of the larger
one but this determination of a single SFM from both channels can also be the selection
and calculation of only the left channel or only the right channel or can be the calculation
of weighted average of both SFM values.
[0176] In step 455, the cross-correlation spectrum is then smoothed over time depending
on the spectral flatness measure.
[0177] Preferably, the spectral flatness measure is calculated by dividing the geometric
mean of the magnitude spectrum by the arithmetic mean of the magnitude spectrum. Thus,
the values for SFM are bounded between zero and one.
[0178] In step 456, the smoothed cross-correlation spectrum is then normalized by its magnitude
and in step 457 an inverse DFT of the normalized and smoothed cross-correlation spectrum
is calculated. In step 458, a certain time domain filter is preferably performed but
this time domain filtering can also be left aside depending on the implementation
but is preferred as will be outlined later on.
[0179] In step 459, an ITD estimation is performed by peak-picking of the filter generalized
cross-correlation function and by performing a certain thresholding operation.
[0180] If no peak above the threshold is obtained, then ITD is set to zero and no time alignment
is performed for this corresponding block.
[0181] The ITD computation can also be summarized as follows. The cross-correlation is computed
in frequency domain before being smoothed depending of the Spectral Flatness Measurement.
SFM is bounded between 0 and 1. In case of noise-like signals, the SFM will be high
(i.e. around 1) and the smoothing will be weak. In case of tone-like signal, SFM will
be low and the smoothing will become stronger. The smoothed cross-correlation is then
normalized by its amplitude before being transformed back to time domain. The normalization
corresponds to the Phase -transform of the cross-correlation, and is known to show
better performance than the normal cross-correlation in low noise and relatively high
reverberation environments. The so-obtained time domain function is first filtered
for achieving a more robust peak peaking. The index corresponding to the maximum amplitude
corresponds to an estimate of the time difference between the Left and Right Channel
(ITD). If the amplitude of the maximum is lower than a given threshold, then the estimated
of ITD is not considered as reliable and is set to zero.
[0182] If the time alignment is applied in Time Domain, the ITD is computed in a separate
DFT analysis. The shift is done as follows:

[0183] It requires an extra delay at encoder, which is equal at maximum to the maximum absolute
ITD which can be handled. The variation of ITD over time is smoothed by the analysis
windowing of DFT.
[0184] Alternatively the time alignment can be performed in frequency domain. In this case,
the ITD computation and the circular shift are in the same DFT domain, domain shared
with this other stereo processing. The circular shift is given by:

[0185] Zero padding of the DFT windows is needed for simulating a time shift with a circular
shift. The size of the zero padding corresponds to the maximum absolute ITD which
can be handled. In the preferred embodiment, the zero padding is split uniformly on
the both sides of the analysis windows, by adding 3.125ms of zeros on both ends. The
maximum absolute possible ITD is then 6.25ms. In A-B microphones setup, it corresponds
for the worst case to a maximum distance of about 2.15 meters between the two microphones.
The variation in ITD over time is smoothed by synthesis windowing and overlap-add
of the DFT.
[0186] It is important that the time shift is followed by a windowing of the shifted signal.
It is a main distinction with the prior art Binaural Cue Coding (BCC), where the time
shift is applied on a windowed signal but is not windowed further at the synthesis
stage. As a consequence, any change in ITD over time produces an artificial transient/click
in the decoded signal.
4. Computation of IPDs and channel rotation
[0187] The IPDs are computed after time aligning the two channels and this for each parameter
band or at least up to a given
ipd_max_band, dependent of the stereo configuration.

[0188] IPDs is then applied to the two channels for aligning their phases:

[0189] Where β = atan2(sin(IPD
i[b]), cos(IPD
i[b]) + c), c =
10ILDi[b]/20 and b is the parameter band index to which belongs the frequency index
k. The parameter
β is responsible of distributing the amount of phase rotation between the two channels
while making their phase aligned.
β is dependent of IPD but also the relative amplitude level of the channels, ILD. If
a channel has higher amplitude, it will be considered as leading channel and will
be less affected by the phase rotation than the channel with lower amplitude.
5. Sum-difference and side signal coding
[0190] The sum difference transformation is performed on the time and phase aligned spectra
of the two channels in a way that the energy is conserved in the Mid signal.

where

is bounded between 1/1.2 and 1.2, i.e. -1.58 and +1.58 dB. The limitation avoids
artefact when adjusting the energy of M and S. It is worth noting that this energy
conservation is less important when time and phase were beforehand aligned. Alternatively
the bounds can be increased or decreased.
[0191] The side signal S is further predicted with M:

where

, where c =
10ILDi[b]/20. Alternatively the optimal prediction gain g can be found by minimizing the Mean Square
Error (MSE) of the residual and ILDs deduced by the previous equation.
[0192] The residual signal
S'(
ƒ) can be modeled by two means: either by predicting it with the delayed spectrum of
M or by coding it directly in the MDCT domain in the MDCT domain.
6. Stereo decoding
[0193] The Mid signal X and Side signal S are first converted to the left and right channels
L and R as follows:

where the gain g per parameter band is derived from the ILD parameter:

[0194] For parameter bands below cod_max_band, the two channels are updated with the decoded
Side signal:

[0195] For higherparameter bands, the side signal is predicted and the channels updated
as:

[0196] Finally, the channels are multiplied by a complex value aiming to restore the original
energy and the inter-channel phase of the stereo signal:

where

where a is defined and bounded as defined previously, and where β = atan2(sin(IPD
i[b]), cos(IPD
i[b]) + c), and where atan2(x,y) is the four-quadrant inverse tangent of x over y.
[0197] Finally, the channels are time shifted either in time or in frequency domain depending
of the transmitted ITDs. The time domain channels are synthesized by inverse DFTs
and overlapadding.
[0198] Specific features of the invention relate to the combination of spatial cues and
sum-difference joint stereo coding. Specifically, the spatial cues IDT and IPD are
computed and applied on the stereo channels (left and right). Furthermore, sum-difference
(M/S signals) are calculated and preferably a prediction is applied of S with M.
[0199] On the decoder-side, the broadband and narrowband spatial cues are combined together
with sum-different joint stereo coding. In particular, the side signal is predicted
with the mid-signal using at least one spatial cue such as ILD and an inverse sum-difference
is calculated for getting the left and right channels and, additionally, the broadband
and the narrowband spatial cues are applied on the left and right channels.
[0200] Preferably, the encoder has a window and overlap-add with respect to the time aligned
channels after processing using the ITD. Furthermore, the decoder additionally has
a windowing and overlap-add operation of the shifted or de-aligned versions of the
channels after applying the inter-channel time difference.
[0201] The computation of the inter-channel time difference with the GCC-Phat method is
a specifically robust method.
[0202] The new procedure is advantageous prior art since is achieves bit-rate coding of
stereo audio or multi-channel audio at low delay. It is specifically designed for
being robust to different natures of input signals and different setups of the multichannel
or stereo recording. In particular, the present invention provides a good quality
for low bit rate stereo speech coding.
[0203] The preferred procedures find use in the distribution of broadcasting of all types
of stereo or multichannel audio content such as speech and music alike with constant
perceptual quality at a given low bit rate. Such application areas are a digital radio,
internet streaming or audio communication applications.
[0204] Subsequently, further embodiments/examples are summarized. The reference numbers
in brackets are not considered to be limiting the scope of the embodiments/examples
in any sense.
- 1. Apparatus for estimating an inter-channel time difference between a first channel
signal and a second channel signal, comprising:
a signal analyzer (1037) for estimating a signal characteristic (1038) of the first
channel signal or the second channel signal or both signals or a signal derived from
the first channel signal or the second channel signal;
a calculator (1020) for calculating a cross-correlation spectrum for a time block
from the first channel signal in the time block and the second channel signal in the
time block;
a weighter (1036) for weighting a smoothed or non-smoothed cross-correlation spectrum
to obtain a weighted cross correlation spectrum using a first weighting procedure
(1036a) or using a second weighting procedure (1036b) depending on a signal characteristic
estimated by the signal analyzer (1037), wherein the first weighting procedure is
different from the second weighting procedure; and
a processor (1040) for processing the weighted cross-correlation spectrum to obtain
the inter-channel time difference.
- 2. Apparatus of example 1, wherein the signal analyzer (1037) is configured as a noise
estimator (1037) for estimating a noise level (1038) of the first channel signal or
the second , channel signal or both signals or a signal derived from the first channel
signal or the second channel signal, and wherein a first signal characteristic is
a first noise level and a second signal characteristic is a second noise level, or
wherein the signal analyzer (1037) is configured to perform a speech/music analysis,
an interfering-talker analysis, a background music analysis, a clean speech analysis
or any other signal analysis in order to determine, whether a signal has a first characteristic
or a second characteristic.
- 3. Apparatus of example 1, wherein the first weighting procedure (1036a) is selected
for a first signal characteristic and the second weighting procedure (1036b) is selected
for a second signal characteristic, and wherein the first signal characteristic is
different from the second signal characteristic.
- 4. Apparatus of one of the preceding examples, wherein the first weighting procedure
(1036a) comprises a weighting so that an amplitude is normalized and a phase is maintained,
or wherein the second weighting procedure (1036b) comprises a weighting factor derived
from the smoothed or non-smoothed cross-correlation spectrum using a power operation
having a power being lower than 1 or greater than 0 or a log function.
- 5. Apparatus of one of the preceding examples, wherein the second weighting procedure
(1036b) comprises a weighting so that an amplitude is normalized and a phase is maintained
and additionally comprises a weighting factor derived from the smoothed or non-smoothed
cross-correlation spectrum using a power operation having a power being lower than
1 or greater than 0 or between 0.79 and 0.82.
- 6. Apparatus of one of the preceding examples, wherein the first weighting procedure
(1036a) operates in accordance with the following equation:

pr wherein the second weighting procedure (1036b) operates in accordance with the
following equation:

wherein CPHAT(k,s) is a weighted smoothed or non-smoothed cross correlation spectrum value for a frequency
index k and a time index s obtained by applying the first weighting procedure,
wherein C̃MCSP(k,s) is a weighted smoothed or non-smoothed cross correlation spectrum value for a frequency
index k and a time index s obtained by applying the second weighting procedure,
wherein C(k,s) is a smoothed or non-smoothed cross correlation spectrum value for a frequency index
k and a time index s, and
wherein p is a power value being different from 1.
- 7. Apparatus of one of the preceding examples, wherein the second weighting procedure
(1036b) comprises a normalization so that an output range of the second normalization
procedure is in a range in which an output range of the first normalization procedure
is positioned, or so that the output range of the second normalization procedure is
the same as an output range of the first normalization procedure.
- 8. Apparatus of one of the preceding examples, wherein the second weighting procedure
(1036b) comprises a normalization based on the following equation:

- 9. Apparatus of one of the preceding examples, wherein the processor (1040) is configured
to perform a first peak picking operation (1041) or a second peak picking operation
(1042) depending on whether the first weighting procedure (1036a) or the second weighting
procedure (1036b) has been used, wherein the first peak picking operation is different
from the second peak picking operation.
- 10. Apparatus of one of the preceding examples, wherein the second peak picking operation
(1042) is used when the second weighting procedure is used, and wherein the second
peak picking operation (1042) is configured to apply a second threshold being lower
than a first threshold used by the first peak picking operation (1041).
- 11. Apparatus of one of examples 2 to 10,
wherein the noise estimator (1037) is configured to estimate (1060) a level of a background
noise or is configured to smooth (1061) an estimated noise level over time or is configured
to use an IIR smoothing filter.
- 12. Apparatus of one of the examples 2 to 11,
wherein the noise estimator (1037) further comprises a signal activity detector (1070)
for classifying the time block as active or inactive, wherein the noise estimator
(1037) is configured to compute (1072) a signal level using one or more active time
blocks, or wherein the noise estimator (1037) is configured to signal (1050) a high
background noise level, when a signal to noise ratio is below a threshold, the threshold
being in a range between 45 to 25 dB.
- 13. Apparatus of one of examples 1 to 12, further comprising:
a spectral characteristic estimator (1010) for estimating a characteristic of a spectrum
of the first channel signal or the second channel signal for the time block;
a smoothing filter (1030) for smoothing the cross-correlation spectrum over time using
the spectral characteristic to obtain a smoothed cross-correlation spectrum, and wherein
the weighter (1036) is configured for weighting the smoothed cross-correlation spectrum.
- 14. Apparatus of one of example 1 to 13,
wherein the processor (1040) is configured to normalize (1036a) the smoothed cross-correlation
spectrum using a magnitude of the smoothed cross-correlation spectrum.
- 15. Apparatus of one examples 1 to 14,
wherein the processor (1040) is configured
to calculate (1031) a time-domain representation of the smoothed cross-correlation
spectrum or a normalized smoothed cross-correlation spectrum; and
to analyze (1032) the time-domain representation to determine the inter-channel time
difference.
- 16. Apparatus of one of the preceding examples,
wherein the processor (1040) is configured to low-pass filter (458) the time-domain
representation and to further process (1033) a result of the low-pass filtering.
- 17. Apparatus of one of the preceding examples,
wherein the processor (1040) is configured to perform the inter-channel time difference
determination by performing a peak searching or peak picking operation (1041, 1042)
within a time-domain representation determined from the smoothed cross-correlation
spectrum.
- 18. Apparatus of one of the examples 13 to 17,
wherein the spectral characteristic estimator (1010) is configured to determine, as
the spectral characteristic, a noisiness or a tonality of the spectrum; and
wherein the smoothing filter (1030) is configured to apply a stronger smoothing over
time with a first smoothing degree in case of a first less noisy characteristic or
a first more tonal characteristic, or to apply a weaker smoothing over time with a
second smoothing degree in case of a second more noisy characteristic or a second
less tonal characteristic,
wherein the first smoothing degree is greater than the second smoothing degree, and
wherein the first noisy characteristic is less noisy than the second noisy characteristic,
or the first tonal characteristic is more tonal than the second tonal characteristic.
- 19. Apparatus of one of the examples 13 to 18,
wherein the spectral characteristics estimator (1010) is configured to calculate,
as the characteristic, a first spectral flatness measure of a spectrum of the first
channel signal and a second spectral flatness measure of a second spectrum of the
second channel signal, and to determine the characteristic of the spectrum from the
first and the second spectral flatness measure by selecting a maximum value, by determining
a weighted average or an unweighted average between the spectral flatness measures,
or by selecting a minimum value.
- 20. Apparatus of one of the examples 13 to 19,
wherein the smoothing filter (1030) is configured to calculate a smoothed cross-correlation
spectrum value for a frequency by a weighted combination of the cross-correlation
spectrum value for the frequency from the time block and a cross-correlation spectral
value for the frequency from at least one past time block, wherein weighting factors
for the weighted combination are determined by the characteristic of the spectrum.
- 21. Apparatus of one of the preceding examples,
wherein the processor (1040) is configured to determine (1120) a valid range and an
invalid range within a time-domain representation derived from the weighted smoothed
or non-smoothed cross-correlation spectrum,
wherein at least one maximum peak within the invalid range is detected (1121) and
compared (1123) to a maximum peak within the valid range, wherein the inter-channel
time difference is only determined (1124), when the maximum peak within the valid
range is greater than at least one maximum peak within the invalid range.
- 22. Apparatus of one of the preceding examples,
wherein the processor (1040) is configured
to perform (1102) a peak search operation within a time-domain representation derived
from the smoothed cross-correlation spectrum,
to determine (1105) a variable of a fixed threshold from the time-domain representation;
and to compare (1106, 1035) a peak to the variable threshold, wherein the inter-channel
time difference is determined as a time lag associated with a peak being in a predetermined
relation to the variable threshold.
- 23. Apparatus of example 22,
wherein the processor (1040) is configured to determine (1105) the variable threshold
as a value being equal to an integer multiple of a value among a largest portion such
as 10 % of values of the time-domain representation.
- 24. Apparatus of one of examples 1 to 23,
wherein the processor (1040) is configured to determine (1102) a maximum peak amplitude
in each subblock of a plurality of subblocks of a time-domain representation derived
from the smoothed cross-correlation spectrum,
wherein the processor (1040) is configured to calculate (1105, 1034) a variable threshold
based on a mean peak magnitude derived (1103) from the maximum peak magnitudes of
the plurality of subblocks, and
wherein the processor (1140) is configured to determine the inter-channel time difference
as a time lag value corresponding to a maximum peak of the plurality of subblocks
being greater than the variable threshold.
- 25. Apparatus of example 24,
wherein the processor (1140) is configured to calculate (1105) the variable threshold
by a multiplication of the mean threshold determined as an average peak among the
peaks in the subblocks and a value,
wherein the value is determined by an SNR (signal to noise ratio) characteristic of
the first and the second channel signal, wherein a first value is associated with
a first SNR value and a second value is associated with a second SNR value, wherein
the first value is greater than the second value, and wherein the first SNR value
is greater than the second SNR value.
- 26. Apparatus of example 25,
wherein the processor is configured to use (1104) a third value (alowest) being lower than the second value (alow) in case of a third SNR value being lower than the second SNR value and when a difference
between the threshold and a maximum peak is lower than a predetermined value.
- 27. Apparatus of one of the examples 2 to 26, wherein the noise estimator (1037) comprises
a background noise estimator (1060) and a time-smother (1061) for providing a background
noise estimate, or
wherein the noise estimator (1037) comprise a signal activity detector (1070), a frame
selector (1071) for selecting an active frame only under control of the signal activity
detector (1070), and a signal level calculator (1072) for calculating a signal level
in the active frame, and a time smoother (1073) for smoothing a result of the signal
level calculator (1072) over time to provide a signal level estimate, or
wherein the noise estimator (1037) is configured to calculate (1074) a signal to noise
ratio from a smoothed or non-smoothed signal level and a smoothed or non-smoothed
background noise level for a frame, and a comparator (1075) for comparing the signal
to noise ratio value to a threshold for the frame in order to provide the noise level
(1038, 1050) for the frame.
- 28. Apparatus of one of the preceding examples, wherein the apparatus is configured
for performing a storage or a transmission of the estimated inter-channel time difference,
or
for performing a stereo or multi-channel processing or encoding of the first and the
second channel signals using the estimated inter-channel time difference, or
for performing a time alignment of the two channel signals using the inter-channel
time difference, or
for performing a time difference of arrival estimation using the estimated inter-channel
time difference, or
for performing a time difference of arrival estimation using the inter-channel time
difference for the determination of a speaker position in a room with two microphones
and a known microphone setup, or
for performing a beamforming using the estimated inter-channel time difference, or
for performing a spatial filtering using the estimated inter-channel time difference,
or
for performing a foreground or background decomposition using the estimated inter-channel
time difference, or
for performing a location operation of a sound source using the estimated inter-channel
time difference, or
for performing a location of a sound source using the estimated inter-channel time
difference by performing an acoustic triangulation based on time differences between
the first channel signal and the second channel signal or the first channel signal,
the second channel signal and at least one additional signal.
- 29. Method of estimating an inter-channel time difference between a first channel
signal and a second channel signal, the method comprising:
estimating a signal characteristic of the first channel signal or the second channel
signal or both signals or a signal derived from the first channel signal or the second
channel signal;
calculating a cross-correlation spectrum for a time block from the first channel signal
in the time block and the second channel signal in the time block;
weighting a smoothed or non-smoothed cross-correlation spectrum to obtain a weighted
cross correlation spectrum using a first weighting procedure or using a second weighting
procedure depending on a signal characteristic estimated, wherein the first weighting
procedure is different from the second weighting procedure; and
processing the weighted cross-correlation spectrum to obtain the inter-channel time
difference.
- 30. Method of example 29, further comprising:
estimating a characteristic of a spectrum of the first channel signal or the second
channel signal for the time block;
smoothing the cross-correlation spectrum over time using the spectral characteristic
to obtain a smoothed cross-correlation spectrum, and wherein the weighting weights
the smoothed cross-correlation spectrum.
- 31. Computer program for performing, when running on a computer or a processor, the
method of example 29 or 30.
[0205] While this invention has been described in terms of several embodiments, there are
alterations, permutations, and equivalents which fall within the scope of this invention.
It should also be noted that there are many alternative ways of implementing the methods
and compositions of the present invention. It is therefore intended that the following
appended claims be interpreted as including all such alterations, permutations and
equivalents as fall within the true spirit and scope of the present invention.
[0206] Although some aspects have been described in the context of an apparatus, it is clear
that these aspects also represent a description of the corresponding method, where
a block or device corresponds to a method step or a feature of a method step. Analogously,
aspects described in the context of a method step also represent a description of
a corresponding block or item or feature of a corresponding apparatus. Some or all
of the method steps may be executed by (or using) a hardware apparatus, like for example,
a microprocessor, a programmable computer or an electronic circuit. In some embodiments,
some one or more of the most important method steps may be executed by such an apparatus.
[0207] Depending on certain implementation requirements, embodiments of the invention can
be implemented in hardware or in software. The implementation can be performed using
a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM,
a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control
signals stored thereon, which cooperate (or are capable of cooperating) with a programmable
computer system such that the respective method is performed. Therefore, the digital
storage medium may be computer readable.
[0208] Some embodiments according to the invention comprise a data carrier having electronically
readable control signals, which are capable of cooperating with a programmable computer
system, such that one of the methods described herein is performed.
[0209] Generally, embodiments of the present invention can be implemented as a computer
program product with a program code, the program code being operative for performing
one of the methods when the computer program product runs on a computer. The program
code may for example be stored on a machine readable carrier.
[0210] Other embodiments comprise the computer program for performing one of the methods
described herein, stored on a machine readable carrier.
[0211] In other words, an embodiment of the inventive method is, therefore, a computer program
having a program code for performing one of the methods described herein, when the
computer program runs on a computer.
[0212] A further embodiment of the inventive methods is, therefore, a data carrier (or a
digital storage medium, or a computer-readable medium) comprising, recorded thereon,
the computer program for performing one of the methods described herein. The data
carrier, the digital storage medium or the recorded medium are typically tangible
and/or non-transitionary.
[0213] A further embodiment of the inventive method is, therefore, a data stream or a sequence
of signals representing the computer program for performing one of the methods described
herein. The data stream or the sequence of signals may for example be configured to
be transferred via a data communication connection, for example via the Internet.
[0214] A further embodiment comprises a processing means, for example a computer, or a programmable
logic device, configured to or adapted to perform one of the methods described herein.
[0215] A further embodiment comprises a computer having installed thereon the computer program
for performing one of the methods described herein.
[0216] A further embodiment according to the invention comprises an apparatus, or a system
configured to transfer (for example, electronically or optically) a computer program
for performing one of the methods described herein to a receiver. The receiver may,
for example, be a computer, a mobile device, a memory device or the like. The apparatus
or system may, for example, comprise a file server for transferring the computer program
to the receiver.
[0217] In some embodiments, a programmable logic device (for example a field programmable
gate array) may be used to perform some or all of the functionalities of the methods
described herein. In some embodiments, a field programmable gate array may cooperate
with a microprocessor in order to perform one of the methods described herein. Generally,
the methods are preferably performed by any hardware apparatus.
[0218] The apparatus described herein may be implemented using a hardware apparatus, or
using a computer, or using a combination of a hardware apparatus and a computer.
[0219] The methods described herein may be performed using a hardware apparatus, or using
a computer, or using a combination of a hardware apparatus and a computer.
[0220] The above described embodiments are merely illustrative for the principles of the
present invention. It is understood that modifications and variations of the arrangements
and the details described herein will be apparent to others skilled in the art. It
is the intent, therefore, to be limited only by the scope of the impending patent
claims and not by the specific details presented by way of description and explanation
of the embodiments herein.
References
[0221]
- [1] Patent application. "Apparatus and Method for Estimating an Inter-Channel Time
Difference." International Application Number PCT/EP2017/051214.
- [2] Knapp, Charles, and Glifford Carter. "The generalized correlation method for estimation
of time delay." IEEE Transactions on Acoustics, Speech, and Signal Processing 24.4
(1976): 320-327.
- [3] Zhang, Cha, Dinei Florêncio, and Zhengyou Zhang. "Why does PHAT work well in low-noise,
reverberative environments?" Acoustics, Speech and Signal Processing, 2008. ICASSP
2008. IEEE International Conference on. IEEE, 2008.
- [4] Rabinkin, Daniel V., et al. "DSP implementation of source location using microphone
arrays." Advanced signal processing algorithms, architectures, and implementations
VI. Vol. 2846. International Society for Optics and Photonics, 1996.
- [5] Shen, Miao, and Hong Liu. "A modified cross power-spectrum phase method based on microphone
array for acoustic source localization." Systems, Man and Cybernetics, 2009. SMC 2009.
IEEE International Conference on. IEEE, 2009.
- [6] 3GPP TS 26.445; Codec for Enhanced Voice Services (EVS); Detailed algorithmic description.