TECHNICAL FIELD
[0001] The application relates to methods and apparatuses for controlling a concealment
method for a lost audio frame of a received audio signal.
BACKGROUND
[0002] Conventional audio communication systems transmit speech and audio signals in frames,
meaning that the sending side first arranges the signal in short segments or frames
of e.g. 20-40 ms which subsequently are encoded and transmitted as a logical unit
in e.g. a transmission packet. The receiver decodes each of these units and reconstructs
the corresponding signal frames, which in turn are finally output as continuous sequence
of reconstructed signal samples. Prior to encoding there is usually an analog to digital
(A/D) conversion step that converts the analog speech or audio signal from a microphone
into a sequence of audio samples. Conversely, at the receiving end, there is typically
a final D/A conversion step that converts the sequence of reconstructed digital signal
samples into a time continuous analog signal for loudspeaker playback.
[0003] However, such transmission system for speech and audio signals may suffer from transmission
errors, which could lead to a situation in which one or several of the transmitted
frames are not available at the receiver for reconstruction. In that case, the decoder
has to generate a substitution signal for each of the erased, i.e. unavailable frames.
This is done in the so-called frame loss or error concealment unit of the receiver-side
signal decoder. The purpose of the frame loss concealment is to make the frame loss
as inaudible as possible and hence to mitigate the impact of the frame loss on the
reconstructed signal quality as much as possible.
[0004] Conventional frame loss concealment methods may depend on the structure or architecture
of the codec, e.g. by applying a form of repetition of previously received codec parameters.
Such parameter repetition techniques are clearly dependent on the specific parameters
of the used codec and hence not easily applicable for other codecs with a different
structure. Current frame loss concealment methods may e.g. apply the concept of freezing
and extrapolating parameters of a previously received frame in order to generate a
substitution frame for the lost frame.
[0005] These state of the art frame loss concealment methods incorporate some burst loss
handling schemes. In general, after a number of frame losses in a row the synthesized
signal is attenuated until it is completely muted after long bursts of errors. In
addition the coding parameters that are essentially repeated and extrapolated are
modified such that the attenuation is accomplished and that spectral peaks are flattened
out.
[0006] Current state-of-the-art frame loss concealment techniques typically apply the concept
of freezing and extrapolating parameters of a previously received frame in order to
generate a substitution frame for the lost frame. Many parametric speech codecs such
as linear predictive codecs like AMR or AMR-WB typically freeze the earlier received
parameters or use some extrapolation thereof and use the decoder with them. In essence,
the principle is to have a given model for coding/decoding and to apply the same model
with frozen or extrapolated parameters. The frame loss concealment techniques of the
AMR and AMR-WB can be regarded as representative. They are specified in detail in
the corresponding standards specifications.
[0007] Many codecs out of the class of audio codecs apply for coding frequency domain techniques.
This means that after some frequency domain transform a coding model is applied on
spectral parameters. The decoder reconstructs the signal spectrum from the received
parameters and finally transforms the spectrum back to a time signal. Typically, the
time signal is reconstructed frame by frame. Such frames are combined by overlap-add
techniques to the final reconstructed signal. Even in that case of audio codecs, state-of-the-art
error concealment typically applies the same or at least a similar decoding model
for lost frames. The frequency domain parameters from a previously received frame
are frozen or suitably extrapolated and then used in the frequency-to-time domain
conversion. Examples for such techniques are provided with the 3GPP audio codecs according
to 3GPP standards.
[0008] US 2004/122680 describes a system for frame error concealment which teaches to adjust the magnitude
of the substitution frame according to the number of consecutive lost frames.
[0009] EP 1 722 359 A1 describes a concealment method including transient detection.
SUMMARY
[0010] Current state-of-the-art solutions for frame loss concealment typically suffer from
quality impairments. The main problem is that the parameter freezing and extrapolation
technique and re-application of the same decoder model even for lost frames does not
always guarantee a smooth and faithful signal evolution from the previously decoded
signal frames to the lost frame. This leads typically to audible signal discontinuities
with corresponding quality impact.
[0011] New schemes for frame loss concealment for speech and audio transmission systems
are described. The new schemes improve the quality in case of frame loss over the
quality achievable with prior-art frame loss concealment techniques.
[0012] The objective of the present embodiments is to control a frame loss concealment scheme
that preferably is of the type of the related new methods described such that the
best possible sound quality of the reconstructed signal is achieved. The embodiments
aim at optimizing this reconstruction quality both with respect to the properties
of the signal and of the temporal distribution of the frame losses. Particularly problematic
for the frame loss concealment to provide good quality are cases when the audio signal
has strongly varying properties such as energy onsets or offsets or if it is spectrally
very fluctuating. In that case the described concealment methods may repeat the onset,
offset or spectral fluctuation leading to large deviations from the original signal
and corresponding quality loss.
[0013] Another problematic case is if bursts of frame losses occur in a row. Conceptually,
the scheme for frame loss concealment according to the methods described can cope
with such cases, though it turns out that annoying tonal artifacts may still occur.
It is another objective of the present embodiments to mitigate such artifacts to the
highest possible degree.
[0014] According to a first aspect, a frame loss concealment method according to claim 1
is disclosed.
[0015] According to a second aspect, an apparatus for creating a substitution frame for
a lost audio frame according to claim 9 is disclosed.
[0016] According to a third aspect, a computer program is defined for concealing a lost
audio frame, and the computer program comprises instructions which when run by a processor
causes the processor to conceal a lost audio frame, in agreement with the first aspect
described above.
[0017] According to a fourth aspect, a computer program product comprises a computer readable
medium storing a computer program according to the above-described third aspect.
[0018] An advantage with an embodiment addresses the control of adaptations frame loss concealment
methods allowing mitigating the audible impact of frame loss in the transmission of
coded speech and audio signals even further over the quality achieved with only the
described concealment methods. The general benefit of the embodiments is to provide
a smooth and faithful evolution of the reconstructed signal even for lost frames.
The audible impact of frame losses is greatly reduced in comparison to using state-of-the-art
techniques.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] All following occurrences of the word "embodiment(s)", if referring to feature combinations
different from those defined by the independent claims, refer to examples which were
originally filed but which do not represent embodiments of the presently claimed invention;
these examples are still shown for illustrative purposes only.
[0020] For a more complete understanding of example embodiments of the present invention,
reference is now made to the following description taken in connection with the accompanying
drawings in which:
Figure 1 shows a rectangular window function.
Figure 2 shows a combination of the Hamming window with the rectangular window.
Figure 3 shows an example of a magnitude spectrum of a window function.
Figure 4 illustrates a line spectrum of an exemplary sinusoidal signal with the frequency
fk.
Figure 5 shows a spectrum of a windowed sinusoidal signal with the frequency fk.
Figure 6 illustrates bars corresponding to the magnitude of grid points of a DFT,
based on an analysis frame.
Figure 7 illustrates a parabola fitting through DFT grid points P1, P2 and P3.
Figure 8 illustrates a fitting of a main lobe of a window spectrum.
Figure 9 illustrates a fitting of main lobe approximation function P through DFT grid
points P1 and P2.
Figure 10 is a flow chart illustrating an example method according to embodiments
of the invention for controlling a concealment method for a lost audio frame of a
received audio signal.
Figure 11 is a flow chart illustrating another example method according to embodiments
of the invention for controlling a concealment method for a lost audio frame of a
received audio signal.
Figure 12 illustrates another example embodiment of the invention.
Figure 13 shows an example of an apparatus according to an embodiment of the invention.
Figure 14 shows another example of an apparatus according to an embodiment of the
invention.
Figure 15 shows another example of an apparatus according to an embodiment of the
invention.
DETAILED DESCRIPTION
[0021] The new controlling scheme for the new frame loss concealment techniques described
involve the following steps as shown in Figure 10. It should be noted that the method
can be implemented in a controller in a decoder.
- 1. Detect conditions in the properties of the previously received and reconstructed
audio signal or in the statistical properties of the observed frame losses for which
the substitution of a lost frame according to the described methods provides relatively
reduced quality, 101.
- 2. In case such a condition is detected in step 1, modify the element of the methods
according to which the substitution frame spectrum is calculated by Z(m) = Y(m) · ejθk by selectively adjusting the phases or the spectrum magnitudes, 102.
Sinusoidal analysis
[0022] A first step of the frame loss concealment technique to which the new controlling
technique may be applied involves a sinusoidal analysis of a part of the previously
received signal. The purpose of this sinusoidal analysis is to find the frequencies
of the main sinusoids of that signal, and the underlying assumption is that the signal
is composed of a limited number of individual sinusoids, i.e. that it is a multi-sine
signal of the following type:

[0023] In this equation K is the number of sinusoids that the signal is assumed to consist
of. For each of the sinusoids with index
k = 1...
K,
ak is the amplitude,
fk is the frequency, and
ϕk is the phase. The sampling frequency is denominated by
fs and the time index of the time discrete signal samples
s(
n) by
n.
[0024] It is of main importance to find as exact frequencies of the sinusoids as possible.
While an ideal sinusoidal signal would have a line spectrum with line frequencies
fk, finding their true values would in principle require infinite measurement time.
Hence, it is in practice difficult to find these frequencies since they can only be
estimated based on a short measurement period, which corresponds to the signal segment
used for the sinusoidal analysis described herein; this signal segment is hereinafter
referred to as an analysis frame. Another difficulty is that the signal may in practice
be time-variant, meaning that the parameters of the above equation vary over time.
Hence, on the one hand it is desirable to use a long analysis frame making the measurement
more accurate; on the other hand a short measurement period would be needed in order
to better cope with possible signal variations. A good trade-off is to use an analysis
frame length in the order of e.g. 20-40 ms.
[0025] A preferred possibility for identifying the frequencies of the sinusoids
fk is to make a frequency domain analysis of the analysis frame. To this end the analysis
frame is transformed into the frequency domain, e.g. by means of DFT or DCT or similar
frequency domain transforms. In case a DFT of the analysis frame is used, the spectrum
is given by:

[0026] In this equation w(n) denotes the window function with which the analysis frame of
length
L is extracted and weighted. Typical window functions are e.g. rectangular windows
that are equal to 1 for n ∈ [0...
L-1] and otherwise 0 as shown in Figure 1. It is assumed here that the time indexes
of the previously received audio signal are set such that the analysis frame is referenced
by the time indexes
n=0...
L-1. Other window functions that may be more suitable for spectral analysis are, e.g.,
Hamming window, Hanning window, Kaiser window or Blackman window. A window function
that is found to be particular useful is a combination of the Hamming window with
the rectangular window. This window has a rising edge shape like the left half of
a Hamming window of length
L1 and a falling edge shape like the right half of a Hamming window of length
L1 and between the rising and falling edges the window is equal to 1 for the length
of
L-
L1, as shown in Figure 2.
[0027] The peaks of the magnitude spectrum of the windowed analysis frame |
X(
m)| constitute an approximation of the required sinusoidal frequencies
fk. The accuracy of this approximation is however limited by the frequency spacing of
the DFT. With the DFT with block length L the accuracy is limited to

[0028] Experiments show that this level of accuracy may be too low in the scope of the methods
described herein. Improved accuracy can be obtained based on the results of the following
consideration:
The spectrum of the windowed analysis frame is given by the convolution of the spectrum
of the window function with the line spectrum of the sinusoidal model signal
S(Ω), subsequently sampled at the grid points of the DFT:

[0029] By using the spectrum expression of the sinusoidal model signal, this can be written
as

[0030] Hence, the sampled spectrum is given by

with
m=0...L-1.
[0031] Based on this consideration it is assumed that the observed peaks in the magnitude
spectrum of the analysis frame stem from a windowed sinusoidal signal with
K sinusoids where the true sinusoid frequencies are found in the vicinity of the peaks.
[0032] Let
mk be the DFT index (grid point) of the observed
kth peak, then the corresponding frequency is

which can be regarded an approximation of the true sinusoidal frequency
fk. The true sinusoid frequency
fk can be assumed to lie within the interval

[0033] For clarity it is noted that the convolution of the spectrum of the window function
with the spectrum of the line spectrum of the sinusoidal model signal can be understood
as a superposition of frequency-shifted versions of the window function spectrum,
whereby the shift frequencies are the frequencies of the sinusoids. This superposition
is then sampled at the DFT grid points. These steps are illustrated by the following
figures. Figure 3 displays an example of the magnitude spectrum of a window function.
Figure 4 shows the magnitude spectrum (line spectrum) of an example sinusoidal signal
with a single sinusoid of frequency. Figure 5 shows the magnitude spectrum of the
windowed sinusoidal signal that replicates and superposes the frequency-shifted window
spectra at the frequencies of the sinusoid. The bars in Figure 6 correspond to the
magnitude of the grid points of the DFT of the windowed sinusoid that are obtained
by calculating the DFT of the analysis frame. It should be noted that all spectra
are periodic with the normalized frequency parameter Ω where Ω = 2
π that corresponds to the sampling frequency
fs.
[0034] The previous discussion and the illustration of figure 6 suggest that a better approximation
of the true sinusoidal frequencies can only be found through increasing the resolution
of the search over the frequency resolution of the used frequency domain transform.
[0035] One preferred way to find better approximations of the frequencies
fk of the sinusoids is to apply parabolic interpolation. One such approach is to fit
parabolas through the grid points of the DFT magnitude spectrum that surround the
peaks and to calculate the respective frequencies belonging to the parabola maxima.
A suitable choice for the order of the parabolas is 2. In detail the following procedure
can be applied:
- 1. Identify the peaks of the DFT of the windowed analysis frame. The peak search will
deliver the number of peaks K and the corresponding DFT indexes of the peaks. The peak search can typically be
made on the DFT magnitude spectrum or the logarithmic DFT magnitude spectrum.
- 2. For each peak k (with k = 1...K) with corresponding DFT index mk fit a parabola through the three points {P1; P2; P3} = {(mk - 1, log(|X(mk - 1)|); (mk, log(|X(mk)|); (mk +1, log(|X(mk + 1)|)}. This results in parabola coefficients bk(0), bk(1), bk(2) of the parabola defined by

This parabola fitting is illustrated in Figure 7.
- 3. For each of the K parabolas calculate the interpolated frequency index

corresponding to the value of q for which the parabola has its maximum. Use

as approximation for the sinusoid frequency fk
[0036] The described approach provides good results but may have some limitations since
the parabolas do not approximate the shape of the main lobe of the magnitude spectrum
|
W(Ω)| of the window function. An alternative scheme doing this is an enhanced frequency
estimation using a main lobe approximation, described as follows. The main idea of
this alternative is to fit a function
P(q), which approximates the main lobe of

through the grid points of the DFT magnitude spectrum that surround the peaks and
to calculate the respective frequencies belonging to the function maxima. The function
P(q) could be identical to the frequency-shifted magnitude spectrum

of the window function. For numerical simplicity it should however rather for instance
be a polynomial which allows for straightforward calculation of the function maximum.
The following detailed procedure can be applied:
- 1. Identify the peaks of the DFT of the windowed analysis frame. The peak search will
deliver the number of peaks K and the corresponding DFT indexes of the peaks. The peak search can typically be
made on the DFT magnitude spectrum or the logarithmic DFT magnitude spectrum.
- 2. Derive the function P(q) that approximates the magnitude spectrum

of the window function or of the logarithmic magnitude spectrum

for a given interval (q1,q2). The choice of the approximation function approximating the window spectrum main
lobe is illustrated by Figure 8.
- 3. For each peak k (with k = 1...K) with corresponding DFT index mk fit the frequency-shifted function P(q-q̂k) through the two DFT grid points that surround the expected true peak of the continuous
spectrum of the windowed sinusoidal signal. Hence, if |X(mk - 1)| is larger than |X(mk + 1)| fit P(q - q̂k) through the points {P1; P2} = {(mk-1, log(|X(mk-1)|); (mk, log(|X(mk)|)} and otherwise through the points {P1; P2} = {(mk, log(|X(mk)|); (mk+1, log(|X(mk+1)|)}. P(q) can for simplicity be chosen to be a polynomial either of order 2 or 4. This renders
the approximation in step 2 a simple linear regression calculation and the calculation
of q̂k straightforward. The interval (q1,q2) can be chosen to be fixed and identical for all peaks, e.g. (q1,q2) = (-1,1), or adaptive.
In the adaptive approach the interval can be chosen such that the function P(q-q̂k) fits the main lobe of the window function spectrum in the range of the relevant
DFT grid points {P1; P2}. The fitting process is visualized in Figure 9.
- 4. For each of the K frequency shift parameters q̂k for which the continuous spectrum of the windowed sinusoidal signal is expected to
have its peak calculate

as approximation for the sinusoid frequency fk.
[0037] There are many cases where the transmitted signal is harmonic meaning that the signal
consists of sine waves which frequencies are integer multiples of some fundamental
frequency
f0. This is the case when the signal is very periodic like for instance for voiced speech
or the sustained tones of some musical instrument. This means that the frequencies
of the sinusoidal model of the embodiments are not independent but rather have a harmonic
relationship and stem from the same fundamental frequency. Taking this harmonic property
into account can consequently improve the analysis of the sinusoidal component frequencies
substantially.
[0039] In case such a peak with corresponding estimated sinusoidal frequency
fk is present, supersede
fk by

[0040] For the two-step procedure given above there is also the possibility to make the
check whether the signal is harmonic and the derivation of the fundamental frequency
implicitly and possibly in an iterative fashion without necessarily using indicators
from some separate method. An example for such a technique is given as follows:
For each
f0,p out of a set of candidate values {
f0,1 ...
f0,P} apply the procedure step 2, though without superseding

but with counting how many DFT peaks are present within the vicinity around the harmonic
frequencies, i.e. the integer multiples of
f0,p. Identify the fundamental frequency
f0,pmax for which the largest number of peaks at or around the harmonic frequencies is obtained.
If this largest number of peaks exceeds a given threshold, then the signal is assumed
to be harmonic. In that case
f0,pmax can be assumed to be the fundamental frequency with which step 2 is then executed
leading to enhanced sinusoidal frequencies

A more preferable alternative is however first to optimize the fundamental frequency
f0 based on the peak frequencies

that have been found to coincide with harmonic frequencies. Assume a set of
M harmonics, i.e. integer multiples {
n1 ...
nM} of some fundamental frequency that have been found to coincide with some set of
M spectral peaks at frequencies
m = 1...
M, then the underlying (optimized) fundamental frequency
f0,opt can be calculated to minimize the error between the harmonic frequencies and the
spectral peak frequencies. If the error to be minimized is the mean square error

then the optimal fundamental frequency is calculated as

[0041] The initial set of candidate values {
f0,1 ...
f0,P} can be obtained from the frequencies of the DFT peaks or the estimated sinusoidal
frequencies

[0042] A further possibility to improve the accuracy of the estimated sinusoidal frequencies

is to consider their temporal evolution. To that end, the estimates of the sinusoidal
frequencies from a multiple of analysis frames can be combined for instance by means
of averaging or prediction. Prior to averaging or prediction a peak tracking can be
applied that connects the estimated spectral peaks to the respective same underlying
sinusoids.
Applying the sinusoidal model
[0043] The application of a sinusoidal model in order to perform a frame loss concealment
operation described herein may be described as follows.
[0044] It is assumed that a given segment of the coded signal cannot be reconstructed by
the decoder since the corresponding encoded information is not available. It is further
assumed that a part of the signal prior to this segment is available. Let
y(n) with
n = 0...
N-1 be the unavailable segment for which a substitution frame
z(
n) has to be generated and
y(
n) with
n<0 be the available previously decoded signal. Then, in a first step a prototype frame
of the available signal of length
L and start index
n-1 is extracted with a window function
w(n) and transformed into frequency domain, e.g. by means of DFT:

[0045] The window function can be one of the window functions described above in the sinusoidal
analysis. Preferably, in order to save numerical complexity, the frequency domain
transformed frame should be identical with the one used during sinusoidal analysis.
[0046] In a next step the sinusoidal model assumption is applied. According to that the
DFT of the prototype frame can be written as follows:

[0047] The next step is to realize that the spectrum of the used window function has only
a significant contribution in a frequency range close to zero. As illustrated in Figure
3 the magnitude spectrum of the window function is large for frequencies close to
zero and small otherwise (within the normalized frequency range from -π to π, corresponding
to half the sampling frequency). Hence, as an approximation it is assumed that the
window spectrum
W(
m) is non-zero only for an interval
M = [
-mmin,
mmax], with
mmin and
mmax being small positive numbers. In particular, an approximation of the window function
spectrum is used such that for each
k the contributions of the shifted window spectra in the above expression are strictly
non-overlapping. Hence in the above equation for each frequency index there is always
only at maximum the contribution from one summand, i.e. from one shifted window spectrum.
This means that the expression above reduces to the following approximate expression:

for non-negative
m ∈
Mk and for each
k.
[0048] Herein,
Mk denotes the integer interval

where
mmin,k and
mmax,k fulfill the above explained constraint such that the intervals are not overlapping.
A suitable choice for
mmin,k and
mmax,k is to set them to a small integer value δ, e.g. δ = 3. If however the DFT indices
related to two neighboring sinusoidal frequencies
fk and
fk+1 are less than 2δ, then δ is set to floor

such that it is ensured that the intervals are not overlapping. The function floor
(·) is the closest integer to the function argument that is smaller or equal to it.
[0049] The next step according to the embodiment is to apply the sinusoidal model according
to the above expression and to evolve its
K sinusoids in time. The assumption that the time indices of the erased segment compared
to the time indices of the prototype frame differs by
n-1 samples means that the phases of the sinusoids advance by

[0050] Hence, the DFT spectrum of the evolved sinusoidal model is given by:

[0051] Applying again the approximation according to which the shifted window function spectra
do no overlap gives:

for non-negative
m ∈
Mk and for each
k.
[0052] Comparing the DFT of the prototype frame
Y-1(
m) with the DFT of evolved sinusoidal model
Y0(
m) by using the approximation, it is found that the magnitude spectrum remains unchanged
while the phase is shifted by

for each
m ∈
Mk. Hence, the frequency spectrum coefficients of the prototype frame in the vicinity
of each sinusoid are shifted proportional to the sinusoidal frequency
fk and the time difference between the lost audio frame and the prototype frame n
-1.
[0053] Hence, according to the embodiment the substitution frame can be calculated by the
following expression:
z(
n) =
IDTF{Z(
m)
} with
Z(
m) =
Y(
m) · e
jθk for non-negative
m ∈
Mk and for each
k.
[0054] A specific embodiment addresses phase randomization for DFT indices not belonging
to any interval
Mk. As described above, the intervals
Mk, k = 1...
K have to be set such that they are strictly non-overlapping which is done using some
parameter δ which controls the size of the intervals. It may happen that δ is small
in relation to the frequency distance of two neighboring sinusoids. Hence, in that
case it happens that there is a gap between two intervals. Consequently, for the corresponding
DFT indices m no phase shift according to the above expression
Z(m) =
Y(m) .
ejθk is defined. A suitable choice according to this embodiment is to randomize the phase
for these indices, yielding
Z(
m) =
Y(
m) ·
ej2πrand(·), where the function rand(·) returns some random number.
[0055] It has been found beneficial for the quality of the reconstructed signals to optimize
the size of the intervals
Mk. In particular, the intervals should be larger if the signal is very tonal, i.e. when
it has clear and distinct spectral peaks. This is the case for instance when the signal
is harmonic with a clear periodicity. In other cases where the signal has less pronounced
spectral structure with broader spectral maxima, it has been found that using small
intervals leads to better quality. This finding leads to a further improvement according
to which the interval size is adapted according to the properties of the signal. One
realization is to use a tonality or a periodicity detector. If this detector identifies
the signal as tonal, the δ-parameter controlling the interval size is set to a relatively
large value. Otherwise, the δ-parameter is set to relatively smaller values.
[0056] Based on the above, the audio frame loss concealment methods involve the following
steps:
- 1. Analyzing a segment of the available, previously synthesized signal to obtain the
constituent sinusoidal frequencies fk of a sinusoidal model, optionally using an enhanced frequency estimation.
- 2. Extracting a prototype frame y-1 from the available previously synthesized signal and calculate the DFT of that frame.
- 3. Calculating the phase shift θk for each sinusoid k in response to the sinusoidal frequency fk and the time advance n-1 between the prototype frame and the substitution frame. Optionally in this step the
size of the interval M may have been adapted in response to the tonality of the audio signal.
- 4. For each sinusoid k advancing the phase of the prototype frame DFT with θk selectively for the DFT indices related to a vicinity around the sinusoid frequency
fk.
- 5. Calculating the inverse DFT of the spectrum obtained in step 4.
Signal and frame loss property analysis and detection
[0057] The methods described above are based on the assumption that the properties of the
audio signal do not change significantly during the short time duration from the previously
received and reconstructed signal frame and a lost frame. In that case it is a very
good choice to retain the magnitude spectrum of the previously reconstructed frame
and to evolve the phases of the sinusoidal main components detected in the previously
reconstructed signal. There are however cases where this assumption is wrong which
are for instance transients with sudden energy changes or sudden spectral changes.
[0058] A first embodiment of a transient detector according to the invention can consequently
be based on energy variations within the previously reconstructed signal. This method,
illustrated in Figure 11, calculates the energy in a left part and a right part of
some analysis frame 113. The analysis frame may be identical to the frame used for
sinusoidal analysis described above. A part (either left or right) of the analysis
frame may be the first or respectively the last half of the analysis frame or e.g.
the first or respectively the last quarter of the analysis frame,
110. The respective energy calculation is done by summing the squares of the samples in
these partial frames:

[0059] Herein
y(n) denotes the analysis frame,
nleft and
nright denote the respective start indices of the partial frames that are both of size
Npart.
[0060] Now the left and right partial frame energies are used for the detection of a signal
discontinuity. This is done by calculating the ratio

[0061] A discontinuity with sudden energy decrease (offset) can be detected if the ratio
Rl/r exceeds some threshold (e.g. 10),
115. Similarly a discontinuity with sudden energy
increase (onset) can be detected if the ratio
Rl/r is below some other threshold (e.g. 0.1),
117.
[0062] In the context of the above described concealment methods it has been found that
the above defined energy ratio may in many cases be a too insensitive indicator. In
particular in real signals and especially music there are cases where a tone at some
frequency suddenly emerges while some other tone at some other frequency suddenly
stops. Analyzing such a signal frame with the above-defined energy ratio would in
any case lead to a wrong detection result for at least one of the tones since this
indicator is insensitive to different frequencies.
[0063] A solution to this problem is described in the following embodiment. The transient
detection is now done in the time frequency plane. The analysis frame is again partitioned
into a left and a right partial frame,
110. Though now, these two partial frames are (after suitable windowing with e.g. a Hamming
window,
111) transformed into the frequency domain, e.g. by means of a
Npart -point DFT,
112. 
and

with
m = 0 ...
Npart - 1.
[0064] Now the transient detection can be done frequency selectively for each DFT bin with
index
m. Using the powers of the left and right partial frame magnitude spectra, for each
DFT index
m a respective energy ratio can be calculated
113 as

[0065] Experiments show that frequency selective transient detection with DFT bin resolution
is relatively imprecise due to statistical fluctuations (estimation errors). It was
found that the quality of the operation is rather enhanced when making the frequency
selective transient detection on the basis of frequency bands. Let
lk = [
mk-1 + 1, ...,
mk] specify the
kth interval,
k = 1...
K, covering the DFT bins from
mk-1 + 1 to
mk, then these intervals define
K frequency bands. The frequency group selective transient detection can now be based
on the band-wise ratio between the respective band energies of the left and right
partial frames:

[0066] It is to be noted that the interval
Ik = [
mk-1 + 1, ...,
mk] corresponds to the frequency band

where
fs denotes the audio sampling frequency.
[0067] The lowest lower frequency band boundary mo can be set to 0 but may also be set to
a DFT index corresponding to a larger frequency in order to mitigate estimation errors
that grow with lower frequencies. The highest upper frequency band boundary
mK can be set to

but is preferably chosen to correspond to some lower frequency in which a transient
still has a significant audible effect.
[0068] A suitable choice for these frequency band sizes or widths is either to make them
equal size with e.g. a width of several 100 Hz. Another preferred way is to make the
frequency band widths following the size of the human auditory critical bands, i.e.
to relate them to the frequency resolution of the auditory system. This means approximately
to make the frequency band widths equal for frequencies up to 1kHz and to increase
them exponentially above 1 kHz. Exponential increase means for instance to double
the frequency bandwidth when incrementing the band index
k.
[0069] As described in the first embodiment of the transient detector that was based on
an energy ratio of two partial frames, any of the ratios related to band energies
or DFT bin energies of two partial frames are compared to certain thresholds. A respective
upper threshold for (frequency selective) offset detection
115 and a respective lower threshold for (frequency selective) onset detection
117 is used.
[0070] A further audio signal dependent indicator that is suitable for an adaptation of
the frame loss concealment method can be based on the codec parameters transmitted
to the decoder. For instance, the codec may be a multi-mode codec like ITU-T G.718.
Such codec may use particular codec modes for different signal types and a change
of the codec mode in a frame shortly before the frame loss may be regarded as an indicator
for a transient.
[0071] Another useful indicator for adaptation of the frame loss concealment is a codec
parameter related to a voicing property and the transmitted signal. Voicing relates
to highly periodic speech that is generated by a periodic glottal excitation of the
human vocal tract.
[0072] A further preferred indicator is whether the signal content is estimated to be music
or speech. Such an indicator can be obtained from a signal classifier that may typically
be part of the codec. In case the codec performs such a classification and makes a
corresponding classification decision available as a coding parameter to the decoder,
this parameter is preferably used as signal content indicator to be used for adapting
the frame loss concealment method.
[0073] Another indicator that is preferably used for adaptation of the frame loss concealment
methods is the burstiness of the frame losses. Burstiness of frame losses means that
there occur several frame losses in a row, making it hard for the frame loss concealment
method to use valid recently decoded signal portions for its operation. A state-of-the-art
indicator is the number
nburst of observed frame losses in a row. This counter is incremented with one upon each
frame loss and reset to zero upon the reception of a valid frame. This indicator is
also used in the context of the present example embodiments of the invention.
Adaptation of the frame loss concealment method
[0074] In case the steps carried out above indicate a condition suggesting an adaptation
of the frame loss concealment operation the calculation of the spectrum of the substitution
frame is modified.
[0075] While the original calculation of the substitution frame spectrum is done according
to the expression
Z(
m) =
Y(
m) ·
ejθk, now an adaptation is introduced modifying both magnitude and phase. The magnitude
is modified by means of scaling with two factors
α(
m) and
β(
m) and the phase is modified with an additive phase component
ϑ(
m). This leads to the following modified calculation of the substitution frame:

[0076] It is to be noted that the original (non-adapted) frame-loss concealment methods
is used if
α(
m) = 1,
β(
m) = 1, and
ϑ(
m) = 0. These respective values are hence the default.
[0077] The general objective with introducing magnitude adaptations is to avoid audible
artifacts of the frame loss concealment method. Such artifacts may be musical or tonal
sounds or strange sounds arising from repetitions of transient sounds. Such artifacts
would in turn lead to quality degradations, which avoidance is the objective of the
described adaptations. A suitable way to such adaptations is to modify the magnitude
spectrum of the substitution frame to a suitable degree.
[0078] Figure 12 illustrates an embodiment of concealment method modification. Magnitude
adaptation,
123, is preferably done if the burst loss counter
nburst exceeds some threshold
thrburst, e.g.
thrburst = 3,
121. In that case a value smaller than 1 is used for the attenuation factor, e.g.
α(
m) = 0.1.
[0079] It has however been found that it is beneficial to perform the attenuation with gradually
increasing degree. One preferred embodiment which accomplishes this is to define a
logarithmic parameter specifying a logarithmic increase in attenuation per frame,
att_per_frame. Then, in case the burst counter exceeds the threshold the gradually increasing attenuation
factor is calculated by

[0080] Here the constant c is mere a scaling constant allowing to specify the parameter
att_per_frame for instance in decibels (dB).
[0081] An additional preferred adaptation is done in response to the indicator whether the
signal is estimated to be music or speech. For music content in comparison with speech
content it is preferable to increase the threshold
thrburst and to decrease the attenuation per frame. This is equivalent with performing the
adaptation of the frame loss concealment method with a lower degree. The background
of this kind of adaptation is that music is generally less sensitive to longer loss
bursts than speech. Hence, the original, i.e. the unmodified frame loss concealment
method is still preferable for this case, at least for a larger number of frame losses
in a row.
[0082] A further adaptation of the concealment method with regards to the magnitude attenuation
factor is preferably done in case a transient has been detected based on that the
indicator
Rl/r, band(
k) or alternatively
Rl/r (
m) or
Rl/r have passed a threshold,
122. In that case a suitable adaptation action,
125, is to modify the second magnitude attenuation factor
β(
m) such that the total attenuation is controlled by the product of the two factors
α(
m) ·
β(
m).
[0083] β(
m) is set in response to an indicated transient. In case an offset is detected the
factor
β(
m) is preferably be chosen to reflect the energy decrease of the offset. A suitable
choice is to set
β(
m) to the detected gain change:

for
m ∈
Ik,
k = 1...
K.
[0084] In case an onset is detected it is rather found advantageous to limit the energy
increase in the substitution frame. In that case the factor can be set to some fixed
value of e.g. 1, meaning that there is no attenuation but not any amplification either.
[0085] In the above it is to be noted that the magnitude attenuation factor is preferably
applied frequency selectively, i.e. with individually calculated factors for each
frequency band. In case the band approach is not used, the corresponding magnitude
attenuation factors can still be obtained in an analogue way.
β(
m) can then be set individually for each DFT bin in case frequency selective transient
detection is used on DFT bin level. Or, in case no frequency selective transient indication
is used at all
β(
m) can be globally identical for all
m.
[0086] A further preferred adaptation of the magnitude attenuation factor is done in conjunction
with a modification of the phase by means of the additional phase component
ϑ(
m)
127. In case for a given
m such a phase modification is used, the attenuation factor
β(
m) is reduced even further. Preferably, even the degree of phase modification is taken
into account. If the phase modification is only moderate,
β(
m) is only scaled down slightly, while if the phase modification is strong,
β(
m) is scaled down to a larger degree.
[0087] The general objective with introducing phase adaptations is to avoid too strong tonality
or signal periodicity in the generated substitution frames, which in turn would lead
to quality degradations. A suitable way to such adaptations is to randomize or dither
the phase to a suitable degree.
[0088] Such phase dithering is accomplished if the additional phase component
ϑ(
m) is set to a random value scaled with some control factor:
ϑ(
m) =
a(
m) · rand(·).
[0089] The random value obtained by the function rand(·) is for instance generated by some
pseudo-random number generator. It is here assumed that it provides a random number
within the interval [0, 2
π].
[0090] The scaling factor
a(
m) in the above equation control the degree by which the original phase
θk is dithered. The following embodiments address the phase adaptation by means of controlling
this scaling factor. The control of the scaling factor is done in an analogue way
as the control of the magnitude modification factors described above.
[0091] According to a first embodiment scaling factor
a(
m) is adapted in response to the burst loss counter. If the burst loss counter
nburst exceeds some threshold
thrburst, e.g.
thrburst = 3, a value larger than 0 is used, e.g.
a(
m) = 0.2.
[0092] It has however been found that it is beneficial to perform the dithering with gradually
increasing degree. One preferred embodiment which accomplishes this is to define a
parameter specifying an increase in dithering per frame,
dith_
increase_per_frame. Then in case the burst counter exceeds the threshold the gradually increasing dithering
control factor is calculated by

[0093] It is to be noted in the above formula that
a(
m) has to be limited to a maximum value of 1 for which full phase dithering is achieved.
[0094] It is to be noted that the burst loss threshold value
thrburst used for initiating phase dithering may be the same threshold as the one used for
magnitude attenuation. However, better quality can be obtained by setting these thresholds
to individually optimal values, which generally means that these thresholds may be
different.
[0095] An additional preferred adaptation is done in response to the indicator whether the
signal is estimated to be music or speech. For music content in comparison with speech
content it is preferable to increase the threshold
thrburst meaning that phase dithering for music as compared to speech is done only in case
of more lost frames in a row. This is equivalent with performing the adaptation of
the frame loss concealment method for music with a lower degree. The background of
this kind of adaptation is that music is generally less sensitive to longer loss bursts
than speech. Hence, the original, i.e. unmodified frame loss concealment method is
still preferable for this case, at least for a larger number of frame losses in a
row.
[0096] A further preferred embodiment is to adapt the phase dithering in response to a detected
transient. In that case a stronger degree of phase dithering can be used for the DFT
bins
m for which a transient is indicated either for that bin, the DFT bins of the corresponding
frequency band or of the whole frame.
[0097] Part of the schemes described address optimization of the frame loss concealment
method for harmonic signals and particularly for voiced speech.
[0098] In case the methods using an enhanced frequency estimation as described above are
not realized another adaptation possibility for the frame loss concealment method
optimizing the quality for voiced speech signals is to switch to some other frame
loss concealment method that specifically is designed and optimized for speech rather
than for general audio signals containing music and speech. In that case, the indicator
that the signal comprises a voiced speech signal is used to select another speech-optimized
frame loss concealment scheme rather than the schemes described above.
[0099] The embodiments apply to a controller in a decoder, as illustrated in Figure 13.
Figure 13 is a schematic block diagram of a decoder according to the embodiments.
The decoder
130 comprises an input unit
132 configured to receive an encoded audio signal. The figure illustrates the frame loss
concealment by a logical frame loss concealment-unit
134, which indicates that the decoder is configured to implement a concealment of a lost
audio frame, according to the above-described embodiments. Further the decoder comprises
a controller
136 for implementing the embodiments described above. The controller
136 is configured to detect conditions in the properties of the previously received and
reconstructed audio signal or in the statistical properties of the observed frame
losses for which the substitution of a lost frame according to the described methods
provides relatively reduced quality. In case such a condition is detected, the controller
136 is configured to modify the element of the concealment methods according to which
the substitution frame spectrum is calculated by
Z(
m) =
Y(
m) ·
ejθk by selectively adjusting the phases or the spectrum magnitudes. The detection can
be performed by a detector unit
146 and modifying can be performed by a modifier unit
148 as illustrated in Figure 14.
[0100] The decoder with its including units could be implemented in hardware. There are
numerous variants of circuitry elements that can be used and combined to achieve the
functions of the units of the decoder. Such variants are encompassed by the embodiments.
Particular examples of hardware implementation of the decoder is implementation in
digital signal processor (DSP) hardware and integrated circuit technology, including
both general-purpose electronic circuitry and application-specific circuitry.
[0101] The decoder
150 described herein could alternatively be implemented e.g. as illustrated in Figure
15, i.e. by one or more of a processor
154 and adequate software
155 with suitable storage or memory
156 therefore, in order to reconstruct the audio signal, which includes performing audio
frame loss concealment according to the embodiments described herein, as shown in
Figure 13. The incoming encoded audio signal is received by an input (IN)
152, to which the processor
154 and the memory
156 are connected. The decoded and reconstructed audio signal obtained from the software
is outputted from the output (OUT)
158.
[0102] The technology described above may be used e.g. in a receiver, which can be used
in a mobile device (e.g. mobile phone, laptop) or a stationary device, such as a personal
computer.
[0103] It is to be understood that the choice of interacting units or modules, as well as
the naming of the units are only for exemplary purpose, and may be configured in a
plurality of alternative ways in order to be able to execute the disclosed process
actions.
[0104] It should also be noted that the units or modules described in this disclosure are
to be regarded as logical entities and not with necessity as separate physical entities.
It will be appreciated that the scope of the technology disclosed herein fully encompasses
other embodiments which may become obvious to those skilled in the art, and that the
scope of this disclosure is accordingly not to be limited.
[0105] Reference to an element in the singular is not intended to mean "one and only one"
unless explicitly so stated, but rather "one or more."
[0106] Moreover, it is not necessary for a device or method to address each and every problem
sought to be solved by the technology disclosed herein, for it to be encompassed hereby.
[0107] In the preceding description, for purposes of explanation and not limitation, specific
details are set forth such as particular architectures, interfaces, techniques, etc.
in order to provide a thorough understanding of the disclosed technology. However,
it will be apparent to those skilled in the art that the disclosed technology may
be practiced in other embodiments and/or combinations of embodiments that depart from
these specific details. That is, those skilled in the art will be able to devise various
arrangements which, although not explicitly described or shown herein, embody the
principles of the disclosed technology. In some instances, detailed descriptions of
well-known devices, circuits, and methods are omitted so as not to obscure the description
of the disclosed technology with unnecessary detail. All statements herein reciting
principles, aspects, and embodiments of the disclosed technology, as well as specific
examples thereof, are intended to encompass both structural and functional equivalents
thereof. Additionally, it is intended that such equivalents include both currently
known equivalents as well as equivalents developed in the future, e.g. any elements
developed that perform the same function, regardless of structure.
[0108] Thus, for example, it will be appreciated by those skilled in the art that the figures
herein can represent conceptual views of illustrative circuitry or other functional
units embodying the principles of the technology, and/or various processes which may
be substantially represented in computer readable medium and executed by a computer
or processor, even though such computer or processor may not be explicitly shown in
the figures.
[0109] The functions of the various elements including functional blocks may be provided
through the use of hardware such as circuit hardware and/or hardware capable of executing
software in the form of coded instructions stored on computer readable medium. Thus,
such functions and illustrated functional blocks are to be understood as being either
hardware-implemented and/or computer-implemented, and thus machine-implemented.
[0110] The embodiments described above are to be understood as a few illustrative examples
of the present invention. It will be understood by those skilled in the art that various
modifications, combinations and changes may be made to the embodiments without departing
from the scope of the present invention. In particular, different part solutions in
the different embodiments can be combined in other configurations, where technically
possible.