[0001] The present invention relates to speech coders employing analysis-by-synthesis techniques,
and more particularly to a coder for low-bit-rate applications, preferably at the
lowest limits of the range of rates for which the above-mentioned coders can be used
with good performance, e.g. rates within the 4 - 8 kbit/s range.
[0002] An example of this type of applications is represented by speech coders to be used
for the so-called half-rate channel of the European mobile radio system.
[0003] In coders using analysis-by-synthesis techniques, for each block of speech signal
samples to be coded, the excitation signal for the synthesis filter simulating the
speech production apparatus is chosen within a set of excitation signals so as to
minimize a perceptually meaningful measure of distortion. This is commonly obtained
through the comparison of the synthesized samples and of the corresponding samples
of the original signal and the simultaneous weighting, in a suitable filter, with
a function that takes into account how human perception evaluates the resulting distortion.
[0004] In its most general form, the synthesis filter includes a cascade of two elements
that impose short-term and long-term spectral features, respectively, on the excitation
signal. The former ones are linked to the correlation among subsequent samples, which
generates a non-flat spectral envelope, and the latter ones are linked to the correlation
between adjacent pitch periods, on which the fine signal spectral structure depends.
With such a scheme, the coded signal includes information relating to excitation and
to short-term synthesis parameters (short-term linear prediction coefficients or other
quantities related to them) and long-term ones (long-term delay and linear prediction
coefficients).
[0005] The insertion of long-term features into the coded signal greatly enhances natural
sounding of the signal, especially if the delay is updated at each subframe during
the analysis-by-synthesis cycle; however,the related information would require most
of the bits available for coding. Especially in case of low-bit-rate applications,
it is therefore particularly interesting to search for solutions that enable a reduction
of the amount of information to be transmitted to the decoder, while preserving signal
quality.
[0006] In the paper "Generalized analysis-by-synthesis coding and its application to pitch
prediction" presented by W.B. Kleijn, R.P. Ramachandran and P. Kroon at the ICASSP
92 Conference, San Francisco (California, USA), March 23-26 1992, paper I-337, it
is suggested for this purpose to carry out a long-term analysis delay interpolation,
the delay being updated at each frame. A direct interpolation, without adequate arrangements,
would provide delay values that are not the optimal values and would provoke time
misalignments among long-term spectral features in the original signal and in the
synthesized signal, that generate a significant distortion.
[0007] To avoid these inconveniences, the paper suggests to modify the original signal so
that long-term predictor parameters become known functions of time and allow a direct
interpolation without degrading performance. The suggested modifications consist of
limited time oscillations and small amplitude scalings of the original signal. Time
oscillations can be carried out in discrete manner. The need for inserting these time
oscillations, and therefore for setting an optimal amount thereof, obviously increases
the coder complexity.
[0008] To solve this problem, according to the present invention, therefore, a coding system
is provided in which, before long-term analysis, discrete time shifts are introduced
on the residual signals and in which the search for optimal excitation signal and
optimal shift is carried out so as to reduce complexity of computations.
The invention characteristics are disclosed in the appended claims.
[0009] A preferred embodiment of the invention will now be described, with reference to
the enclosed drawings, in which:
- Fig. 1 is a block diagram of the coder;
- Fig. 2 is a functional diagram of some blocks of the coder;
- Fig. 3 is a block diagram of the decoder.
[0010] Before describing in detail the coder/decoder structure, the principles on which
it is based will be summarized. The coder receives samples x(n) of the speech signal
to be coded, grouped into blocks (commonly called 'frames') including a fixed number
Lf of contiguous samples. Every frame of Lf samples is then divided into subframes
of Ls contiguous samples. The coder must determine a set of parameters to be transmitted
to the decoder so that the decoder is able to synthesize a signal that approximates
the original signal. To achieve this, an analysis-by-synthesis procedure is used,
through which the coder analyzes the effects of the possible values of each parameter
and chooses the value that enables obtaining the best approximation of the original
signal. For this purpose, the coder will contain a replica of the decoder to produce,
for each of said values, the corresponding output signal. To generate these output
signals, both long-term and short-term correlations of the speech signal are exploited,
imposed on an excitation signal through respective synthesis filters. At each frame,
the coder carries out a linear prediction analysis (short-term or LPC analysis) and
computes the short-term residual signal, that is used to compute parameters (delay
and coefficient) of the long-term synthesis filter. (The coefficient is unique in
the preferred embodiment, since a first-order filter is used). To improve the resolution
of long-term-correlation information, both the delay and the coefficient are interpolated
when the delays of the current frame and the previous frame are close in value. To
reduce the effects of time mismatches between the original signal and the reconstructed
one, at each subframe small time shifts can be introduced in the original speech signal:
the shift amount is determined through an exhaustive search in a range of possible
values so as to minimize the energy of the error (difference between original signal
and reconstructed signal). After having determined the optimal shift, the search for
the optimal excitation signal is carried out.
[0011] In the following, to make the description clearer, the possible excitation signals
will be considered as words chosen in a certain codebook, that is, reference is made
to a type of coder known as CELP (Codebook Excited Linear Prediction), even if, as
it will be seen, every word is made up of an extremely small number of pulses (preferably
1 or 2) with deterministically predefined amplitudes and positions, and the codebook
is not stored.
[0012] The coded signal will include information related to short-term and long-term synthesis
filter parameters and to the optimal excitation, transmitted as usual in the form
of suitably coded indexes.
[0013] In the decoder, starting from these indexes, an excitation signal corresponding to
the one used by the coder will be retrieved and filtered in the chain of a long-term
synthesis filter and a short-term synthesis filter to provide a reconstructed signal
that can be still subjected to a further filtering (post-filtering), based for example
on short-term synthesis parameters, to improve the subjective signal quality. The
reconstructed signal is then converted again into analogue form and supplied to utilization
devices.
[0014] By way of example, in the following description reference will be made to frames
with length Lf = 160 samples (that, with a 8-kHz sampling frequency, correspond to
a speech signal segment whose length T = 20 ms), divided into 8 subframes whose length
Ls = 20 samples. For reasons related to the introduction of time shifts, it is necessary
to have available, in addition to the Lf samples of a frame, a group of H+K samples
of the following frame (e.g. H = 24, K =8).
[0015] With reference to Fig. 1, the input signal samples x(n) present on a connection 1
are temporarily stored in a buffer MT arranged to store

samples, and every T ms a block of Lf samples will be written and read. Samples
read in MT are supplied to a high-pass filter FPA whose task is removing d.c. drifts
and low-frequency noise, and the filtered signal x
f(n) is supplied to short-term analysis circuits STA and to a linear prediction filter
LPC.
[0016] Circuits STA are to determine, for each frame, a set of P linear prediction coefficients
a
i (e.g. 10), to convert these coefficients into a group of parameters in the frequency
domain, commonly known as LSP (Line Spectrum Pairs) and to carry out a quantization,
for example a scalar one, of the differences between adjacent parameters. Indexes
j(f), that are part of the coded signal, are transmitted to the decoder through a
connection 2a after binary coding in circuits that are not shown. Conversion into
line spectrum pairs is desirable since, as well known, spectrum lines have properties
of quantization, interpolation and check of synthesis filter stability that are better
than those of the coefficients. Before computing line spectrum pairs, in the block
STA a smoothing of spectrum information related to formants is also carried out to
match it to the quantization circuit resolution. This is accomplished by multiplying
computed coefficients a
i by a respective factor g₁
i, whose value is typically less than 1 but quite near 1. This operation allows reducing
the risk, in case of particularly narrow formants, of reproducing after quantization
formants that are equally narrow, but shifted with respect to the original ones, and
therefore reduces a possible cause for the degradation of coded signal quality.
[0017] The circuit STA computes coefficients a
i according to the classical autocorrelation method, as described in "Digital Signal
Processing of Speech Signals" by L.R. Rabiner and R.W. Schafer (Prentice - Hall Ed.,
Englewood Cliffs, N.J., USA, 1978), p. 401. For the computation, STA operates on a
set of Lf+P input samples (in particular, the samples that occupy the last Lf+P positions
in MT), obtained through a trapezoidal window that weights with a maximum weight (particularly
1) all samples except for the first and the last P ones, for which the weights have
been determined with a simple linear interpolation operation between minimum and maximum
weight: in this way, smoothing, that is required by the autocorrelation method to
provide good results, is limited to the overlapping area between contiguous windows.
The forward positioning of the window also takes into account the fact that, when
coding the initial subframes of a frame (e.g. the first 3), in place of linear prediction
coefficients computed for the frame itself, coefficients are used which are obtained
by the conversion of line spectrum pair values determined through interpolation between
values related to the previous frame and values related to the current frame. This
ensures a gradual transition between current frame parameters and previous frame parameters.
As concerns the window, as explained, it encompasses or spans over a current frame
and the subsequent frame in the meaning that it comprises samples of both frames without,
however, having to comprise two full frames.
[0018] The transformation of linear prediction coefficients into line spectrum pairs is
carried out, for example, in the way described by P.Kabal and R.P. Ramachandran in
the article "The computation of line spectral frequencies using Chebyshev polynomials",
IEEE Transactions on Acoustic, Speech and Signal Processing, December 1986.
[0019] The operations of STA are typical of any linear prediction coder, and therefore a
more detailed description is not necessary.
[0020] The indexes j(φ) are also supplied to a linear prediction coefficient reconstructing
circuit STR1 that supplies filter LPC, short-term synthesis filters STS1, STS' and
spectral weighting filters SW, SW' with quantized values of the coefficients, obtained
by applying inverse procedures with respect to the ones used to transform the coefficients
into line spectrum pairs. STR1 also computes interpolated values to be used in the
first three subframes. To simplify, in the following, the quantized values are also
designated a
i.
[0021] The filter LPC receives the filtered speech signal samples x
f(n) and filters them according to the conventional function

generating the short-term prediction residual r
s(n), that is supplied both to a low-pass filter FPB, that produces a filtered residual
signal r
f(n), and to time shift circuits TS, that produce a modified residual signal r
m(n). Low-pass filtering facilitates, as well known, operations of a following long-term
analysis circuits LTA.
[0022] The circuits LTA must determine, at each frame, and supply afollowing long-term synthesis
filter LTS1 with the delay d (pitch period) with which a sample of an excitation signal
is used to generate a reconstructed signal and the gain or coefficient b with which
said sample is weighted.
[0023] The block LTA computes the delay d by maximizing the autocorrelation function

where k can vary between a minimum value and a maximum value allowed for the delay
d (e.g., 20 and 120), and x is a preset number, whose purpose is causing the length
of the window taken into account for the calculation to enable obtaining a satisfactory
value for d. Considering that the window must include the most recent samples, as
already said, its length is a compromise between two opposed needs: the greater the
length, the most accurate the evaluation; on the other hand, the shorter the window,
the more its center is next to the end of the frame to be coded (Lf samples) and therefore
it allows obtaining a current value next to that end, what is required for interpolation.
For example, x can be K. In the preferred embodiment, the delay is never less than
the length of a subframe, and this simplifies considerably subsequent operations.
The value computed with (1) can also be subjected to corrections, that will be examined
afterwards, aimed at guaranteeing a shape as much as possible smooth for d and compensating
for synchronism losses due to the time shift.
[0024] The value of coefficient b is determined so as to minimize the energy of error signal
r
l(n), given by the equation
For the value d of the delay to be used for the current frame, b is given by the equation

, where E(r
f) indicates the energy

A minimum and a maximum, 0 and 1 respectively, are also set for the value of b.
Values that are less than 0 are excluded because they would correspond to a signal
overturning, that would also compel to transmit a sign bit, while values that are
greater than 1 make the filter unstable, as well known. The value of b computed using
(2) can also be subjected to corrections aimed at guaranteeing the best quality of
the coded signal. Furthermore, in certain frames, instead of the values d and b computed
with (1) and (2), it is possible to use values obtained by linear interpolation between
values computed for the previous frame and values computed for the current frame.
[0025] Together with the computation of d and b, the prediction gain G is also computed:
this is a quantity representing the ratio between the energies of input and output
signals from the long-term predictor and gives a measure of long-term prediction efficiency.
Gain G is
defined by the expression

where
1 - bR[r
f(d)]/E'(r
f)

Gain G allows establishing whether the speech segment being coded is voiced, that
is indicated by values of G and b that are both greater than respective thresholds
G
thr, b
thr. In case of a voiced sound, LTA generates a flag V that is used to decide to carry
out the interpolation and to introduce the time shift.
[0026] A first correction for delay d is based on the search for the local maximum of function
(1) also in a given neighborhood (e.g., ñ 15%) of the value obtained at the previous
frame: if this local maximum is different from the main maximum by an amount that
is less than a certain limit, that new value is used that provides a more smooth outline
that can be therefore interpolated. This secondary search is carried out only if the
signal in the previous frame was strongly voiced and had been subjected to interpolation.
Moreover, the correction, if any, is carried out before computing b and G, so as to
use the already corrected value of d for these computations.
[0027] A second correction is linked to the presence of the time shift mechanism, that inserts
a variable delay whose effects can be compared to those of a non-synchronous operation
of the coder. To try to recover synchronous features, the value of d computed by LTA
and possibly corrected as said before is changed by adding thereto a corrective term
d' linked to the amount of the shift itself and given by the expression
where is the shift accumulated up to that frame expressed as number of samples of
the residual signal upsampled by a factor G, while d and Lf have the meaning said
before. Upsampling will be discussed in greater detail with reference to circuits
TS. It means that the samples obtained by sampling the original speech signal at a
first sampling rate are in turn submitted to a sampling at a higher rate. Thus, if
samples obtained by an 8 kHz sampling are themselves sampled at 64 kHz, each 8 kHz
sample will originate eight samples at 64 kHz. The correction can be carried out if
interpolation is required in the current frame and if the speech segment is not voiced.
The first condition is necessary since, if the interpolation is absent, no shift is
carried out; moreover, the signal must not be voiced because in this situation an
even minimal modification of d with respect to the exact value can usually be perceived.
Before adding the corrective term to d, its absolute value is limited to a maximum
value ¦d'¦
max, for example 1. Furthermore, the correction is carried out only if it does not modify
the decision about interpolation (that will be described afterwards) and does not
take the value of d outside the provided range of values.
[0028] As regards b, a first correction consists of clipping b to a first upper limit b₁,
since, if b is too high, an excessive energy increase would occur, which gives rise
to noises. Limit b₁ is linked to the ratio between energies in a pitch period of the
current frame and of the previous one and it is given by the expression
where E''(r
f) denotes the quantity

that indeed is the energy in a pitch period d, and indexes 0,-1 denote current and
previous frames, respectively. The correction is carried out if the energy in the
previous frame exceeds a certain threshold.
[0029] A further limitation for b is carried out in case of low values of G (less than G
thr), that show speech segments with low periodicity, while b is relatively high (greater
than a second limit b₂): in this case, the value b₂ is employed, since employing the
actual value could produce artifacts in the coded signal.
[0030] As regards interpolation, this is carried out if the relative variation of d between
two consecutive frames does not exceed, as absolute value, a predetermined amount
(e.g., 15%) and if the values of b in these frames are both positive. The actual computation
of the values of d and b to be used in case of interpolation is carried out in the
long-term synthesis filter LTS1, to which LTA sends a flag F when the above mentioned
conditions are verified. The same flag is also supplied to an error energy minimizing
circuit EM determining the optimal time shift and excitation. Information about interpolation
is also required by the synthesis filter in the decoder; however, it is not necessary
to transmit it, since it can be immediately recreated in that filter, by the comparison
between the values of d and b related to two frames, exactly like in the coder.
[0031] The values of d and b determined at each frame are converted as usual into the respective
indexes j(d), j(b), that are the information related to long-term analysis to be inserted
into the coded signal, and that are transmitted to the decoder, after suitable coding,
through connections 2b, 2c. Index j(b) is determined through a quantization operation,
during which, in addition to limiting the maximum value to 1, values of b that are
less than half of the first quantized value are forced to 0. No quantization of d
is however necessary, since d is already a discrete quantity: it is however preferable
to transmit d under the form of an index for sake of uniformity with the other information.
The conversion of the values of d into indexes practically consists of their shift,
such as to make the possible range of values begin from 1 instead of from a value
d
min. In the described example (101 values of d and j(d)), 7 bits will be necessary to
code index j(d), and these bits will also allow coding of values of j(d) outside the
provided range. One of these further values (e.g., value 127) is used to show forcing
of b to 0 and it is supplied to the decoder in place of index j(d) corresponding to
the actual value of d, since, if b = 0, the long-term synthesis filter does not provide
contributions to the reconstructed signal and delay information is useless. In addition
to information about forcing of b to 0, however, index j(b) corresponding to the minimum
value of b is transmitted.
[0032] To simplify, circuits generating indexes j(b), j(d) are included into block LTA.
[0033] It must be noted that the correction of d to take into account possible shifts is
carried out after the corrections of b, since only depending on the corrected values
of b, circuit LTA can take decisions related to the sound nature and the need to carry
out interpolation and therefore shift.
[0034] The operations performed by LTA are described in detail in the appendix, that includes
program listing in C language. Given the listing, a technician has no problem in designing
devices that perform the described functions.
[0035] Indexes j(d), j(b) are reconverted into quantized or reconstructed values of the
respective parameters by reconstructing circuits LTR1, composed of simple read-only
memories addressed by the indexes. During this reconstruction, LTR1 provides the actual
values of d, b if j(d) shows a value allowed for the delay (that is, if j(d) is in
the range 1 to 101). If j(d) shows any one of the values outside the allowed range
(therefore its value is from 102 to 127), LTR1 provides value 0 for b and value d
min for d. The fact that, when reconstructing the parameters, all indexes j(d) not corresponding
to a value allowed for the delay, and not only the one really used for this purpose,
are interpreted as indication of forcing of b to 0, allows reconstructing the value
b=0 even in case of possible errors on the least significant bits of that index. Anyway,
if by chance the reconstruction of b=0 should fail, circuits LTR1 generate the minimum
value of b since they have at their disposal the corresponding index j(b). To simplify,
in the following, reconstructed (or quantized) values will also be shown by b, d.
[0036] The long-term synthesis filter LTS1 generates a reconstructed short-term residual
signal s
s(n), by filtering according to the conventional function

an excitation signal s₁(n). This one is composed of a shape information (innovation),
represented by one of the words s(n) of an innovation codebook IC1, by a positive
or null amplitude parameter g (innovation gain), chosen in a codebook of innovation
gains IG1, and by a sign information, represented by a parameter σ (innovation sign)
whose value is ±1. Signal s₁(n) is therefore given by

and is obtained through a multiplier M1. To simplify, we suppose that also parameter
σ is read in codebook IG1. Even if, to facilitate understanding, codebooks IC1, IG1
are represented as circuit blocks (that could suggest the idea of memories that contain
them), as said above, the particular structure of innovation codebook makes their
storage superfluous. The structure of innovation and gain codebooks will be examined
later.
[0037] In order to obtain a sample of the reconstructed residual s
s(n), LTS1 must weight with the factor b the sample related to instant n-d. In case
no interpolation has to be performed, operation of LTS1 is quite conventional. In
case of interpolation, the values of d and b are computed sample by sample according
to the equations
with n = 0.....Lf-1,

and

. Symbols d0, b0 show the values related to the current frame, d(-1), b(-1) those
related to the previous frame. The interpolation is therefore a linear one and extends
over a whole frame. The values of d(n) and b(n) then vary sample by sample. As regards
d(n), it will generally not be an integer number: this means that the value of signal
s
s(n) at the continuous time instant n-d(n) does not coincide with that of an actually
available sample and must be evaluated: according to the invention, evaluation is
performed through a second order polynomial interpolation (that is through a parabola)
centered about the discrete time instant that is nearest to n-d(n); the value thus
evaluated is then multiplied by the interpolated value b(n).
[0038] The interpolation procedure adopted has an extremely lower computation complexity
than more sophisticated interpolation methods based on signal filtering. However,
its effect is essentially a low-pass one, that is useful for the good operation of
the coder since it avoids that the reconstructed signal has too marked periodicity
properties.
[0039] The reconstructed short-term residual s
s(n) is supplied to the short-term synthesis filter STS1, whose transfer function is
1/1-A(z). This filter generates the reconstructed speech signal y(n) that is supplied
to the spectral weighting filter SW whose transfer function is, as usual,

, where A
w(z) is the
function

with

, where γ is an experimentally determined corrective factor that determines band
widening around formants. The reconstructed and weighted signal y
w(n) is subtracted in an adder SM from the modified reconstructed and weighted signal
x
w(n) obtained by filtering the output signal from TS in the cascade of two filters
STS', SW', respectively identical to STS1 and SW. At output of SM, a weighted error
signal e(n) is obtained, that is supplied to the error energy minimizing circuit EM
that performs all necessary operations to determine optimal shift and excitation.
[0040] Purpose of circuits TS is aligning in time the signal to be coded with the replica
that long-term synthesis filter is able to produce, and in particular avoiding shifts
among pitch peaks in the signal predicted by LTS1 and in the original one. For this
purpose, TS at each subframe makes the time window of Ls samples, that locates the
subframe itself, shift by a certain amount Dh. The shift to be applied is determined
by unit EM with a fast search procedure within a range of values defined by a maximum
allowable shift. Shift is applied on the residual signal and not on the original one
because the resulting distortion is smoothed by the following filtering in STS', SW'
and therefore is substantially imperceptible. The shift applied in a subframe is algebraically
added to the one accumulated up to that time, providing a global shift ĥ, in order
to avoid too sudden variations. Global shift also cannot exceed a certain maximum
value (H samples of the original signal). The reason why H samples of the following
frame have also been loaded in MT is therefore evident. Purpose of the shift variation
limitation is avoiding excessive distortions; the limitation related to global shift
instead is determined by the delay that has to be tolerated in coding procedures and
therefore by the availability of future samples. Time shift has a resolution that
is less than one sampling period of the original signal, and therefore it is necessary
to carry out an upsampling of the residual signal.
[0041] Taking into account all this, circuit TS will include an upsampling circuit US (in
practice an interpolating filter), that supplies at its output the upsampled residual
r̂
s(n̂), and a shifting element SH that receives from EM information about shift entity
ĥ and generates the modified upsampled residual r̂
m(n̂). In the example, upsampling ratio Γ is 8, and therefore the upsampled signal
has a frequency of 64 kHz: this upsampling ratio provides a suitable resolution for
all desired purposes. Moreover, for the correct operation of the interpolating filter,
it is necessary to always have available a certain number of samples following the
interested ones: this is the reason why the further K samples of the following frame
are also loaded in MT.
[0042] It is not necessary to materially carry out the downsamping to obtain a modified
residual signal with a 8-kHz sampling frequency, since this operation can be implicitly
carried out, when necessary, by simply reading a sample of r̂
m (n̂) every Γ, with an suitable phase. Downsampling is the inverse operation to upsampling,
recovering the samples at lower rate.
[0043] Element SH will practically be a memory that loads, at each subframe, the ΓLs samples
of the upsampled residual plus a certain number of following and previous samples
linked to the maximum allowed shift in a frame (in practice, a number of samples equal
to twice the maximum shift, as will be explained in the description of optimal shift
search); SH is addressed for reading by the error energy minimizing unit EM, in such
a way as to supply the following circuits with Ls samples adequately shifted with
respect to the incoming subframe.
[0044] Turning back to the innovation codebook, this includes a certain number of words,
each having Ls samples, of which only a very limited number is different from 0. This
choice derives from the fact that, being the codebook quite limited, it would be an
illusion to think to find inside it words with a lot of pulses (that is non-null samples)
in which all pulses are actually suitable, and further enables reducing the amount
of computations necessary when searching for the optimal excitation. In the preferred
embodiment of the present invention, the codebook is composed of two parts. The first
one includes Ls words having a single non-null sample, with amplitude equal to 1 and
positive sign, and Ls-1 null samples. The non-null sample occupies a different position
in all words, that therefore can be obtained one from the other by simply shifting
the non-null sample by one position. For this first part of the codebook, signal s(n)
can be represented as
where δ is the well known unitary function and n, n₁ can have values between 0 and
Ls-1.
[0045] The second part includes words with two samples whose amplitude is 1, and Ls-2 null
samples. These words are generated starting from a limited number of key-words (in
particular 3) with the method described in European Patent Application EP-A-0396121
in the name of CSELT. In the example taken into account, the three key-words have
all the first pulse in position 0 and the second pulse in a respective key position
n₂(1), n₂(2), n₂(3), and the other words are obtained making the pulse pair shift
towards a word end till the second pulse reaches such end or the first pulse reaches
the respective key position. Key positions are chosen in order to give origin to Ni2
(in particular 21) possible positions of the pulse pair; for each one of these positions,
there are two words that are different one from the other by the second pulse sign,
as described in said European Application, that take to Ls+2Ni2 (62 in the example)
the total number of words in the innovation codebook. For this second part of the
codebook, an innovation word is represented by the equation
with n = 0...Ls-1, n₁ = 0...Ls-1-n₂(p), n₂ = n₂(p)...Ls-1, p = 1...Nip, where n₂(p)
shows the generic key position and Nip is the number of key positions used (3 in the
example).
[0046] The innovation codebook structure, with few non-null samples and words obtained by
shifting samples by one position starting from a limited number of keys, is a simple
deterministic structure that enables a fast search procedure of the optimal excitation
that requires neither codebook storage nor the effective filtering of the candidate
excitation signal.
[0047] During the search for optimal innovation, the test with words of the first part of
the codebook must be carried out only if long-term analysis has indicated a voiced
sound or, on the contrary, when strong energy concentrations are noted in short signal
sections. These strong concentrations can in fact signal the onset of a voiced section,
that cannot still be classified as such, since classification is based on long-term
analysis and in the previous signal sections there were no useful features to indicate
such onset. Under these conditions, therefore, filter LTS1 would indeed not be able
to supply a correct predicted signal. Now, it is mandatory, for a good coded signal
quality, that pitch pulses be correctly reproduced, and therefore use of single-pulse
words proves itself useful to indeed compensate for an inadequate operation (in voiced
sections) or for an impossible correct operation (in onsets) of long-term synthesis
filter. Single-pulse words, instead, must not be used to reproduce unvoiced sounds
that are not onsets, where their use is counterproductive, even in case it is actually
one of them to provide minimum error signal energy, since the subjective effect is
usually worse.
[0048] The manner in which strong energy concentrations in short times are detected will
be described afterwards.
[0049] Words in the codebook are identified by a respective index j(s); the index related
to the optimal word, adequately coded, is transmitted to the decoder through a connection
2d. Since in the described example the codebook includes 62 words, to which as many
indexes j(s) correspond, without having to modify the number of bits coding j(s),
two further values of j(s) are available that do not correspond to any word in the
codebook. These are used to represent a null innovation gain, as will be said afterwards;
similarly to what has been done for long-term prediction delay and coefficient, when
generating the indexes, only one of the two values of j(s) not corresponding to an
innovation word will be used to indicate g = 0 and, when decoding, g will be set to
0 in correspondence with both values of j(s).
[0050] As regards gain g, this is quantized using a codebook built so as to allow saving
coding bits with respect to what would actually be necessary to represent all possible
values provided in the codebook. Information about gain, for each subframe, is represented
in the form of two indexes j(gmax), j(gnor), the first one of which is linked to the
maximum value of g in the frame, and the second one to the difference between such
maximum value and the actual value, and by sign σ. This information is transmitted
to the decoder through a connection 2e.
[0051] The codebook includes a number Nig of possible absolute values of g that can be represented
as

where Nim and Nin are two different powers of 2. For example, we can have Nim =
2⁴ and Nin = 2², or Nim = 2⁴ and Nin = 2³. At each subframe, the optimal value of
g determined with the error minimizing procedure that will be described afterwards
is quantized, generating a respective index j(g) that is not transmitted but is reconstructed
in the decoder. At the end of the frame, value j(gmax) related to the maximum frame
gain is identified and is transmitted as such if it is not less than Nin; otherwise,
index j(gmax) is forced to value Nin. In this way, j(gmax) can only assume Nim values
and therefore the number of coding bits is limited. Once having identified j(gmax),
index j(gnor) is computed for every subframe with the equation

; j(gnor) can have values in the range between 0 and Nim+Nin-2. The actual value
of index j(gnor) is transmitted only if it is not greater than Nin-1; otherwise, gain
is deemed 0 (that is, innovation is silenced for subframes where gain is very small
with respect to the maximum one) and index j(s) of the innovation word is forced to
one of the values that do not correspond to any codebook word to show transmission
of a word with null gain. In this way, a reduced differential dynamics is used and
the bits that should have been used to represent gain on the whole dynamics, are saved,
at the expense of a slight performance loss due to possible innovation silencing.
To minimize the effect of channel errors on innovation index j(s), in case of silencing
the value Nin-1 for index j(gnor) is anyway transmitted.
[0052] The gain codebook can be a logarithmic codebook, so that the ratio between two consecutive
values is a constant. The ratio must take into account several requirements:
- values in dB must be as near as possible to allow a quantization as accurate as possible;
- global dynamics between minimum gain g(1) and maximum one g(Nim+Nin-1) must be adequately
extended to cover the different types of sound and a reasonable set of different voice
levels;
- differential dynamics between g(x-Nim+1) and g(x) must be adequately extended to make
the probability of silencing reasonably low.
[0053] For example, with the above values of Nim, Nin, the value of the ratio between two
consecutive gain levels can range from 3 to 6 dB.
[0054] The fast search procedure for optimal shift and excitation will now be described,
referring also to the operative diagram in Fig. 2, that correspond to the set of blocks
M1, LTS, STS, STS', SM, SW, SW' of Fig. 1. In Fig. 2, the same symbols as in Fig.
1 are used, with the exception of blocks STW1, STW2 that represent the filter resulting
from the series of filters STS1, SW and respectively STS', SW', that is a filter with
transfer function

. In this Figure, each of the filters has been divided into an element with null
input (LTSa, STW1a, STW2a) that provides contribution of initial conditions (that
is of filtering memories for previous subframes), and into an element (STW1b, STW2b)
that is reset at each subframe (filtering with null initial conditions), as indicated
by signal R supplied by a time base, not shown. Filtering with null initial conditions
of excitation is only the short-term filtering, since it has been supposed that delay
d is not less than a subframe.
[0055] The optimal shift determination is composed of three steps:
- evaluation of the need to perform a shift;
- determination of an suitable range of shift values;
- search for the optimal shift in the range.
[0056] In the first step, it is checked if three conditions are satisfied:
- the subframe is not silence, which is shown by the fact that the energy of rs(n) is greater than a given thres hold;
the signal is voiced or has been subjected to interpolation, which is shown by
flags F, V coming from LTA;
- a peak of rs(n) actually occurs in the subframe, which is shown by the fact that the average power
of rs(n) in the subframe (that is the energy divided by the number Ls of samples) is greater
than or equal to the energy in a period of length d that ends with the last sample
of the subframe itself.
[0057] The reason for the first condition is obvious. As regards the second and the third
one, shift must be performed only if there is a pitch peak in the subframe. This occurs
first of all in voiced sections; the fact that an interpolation occurred, that is,
that the values of parameters obtained in two subsequent frames are very near, suggest
a certain periodicity in the signal segment that must be coded, and therefore enabling
the shift also in this case can be useful to further reduce risks of misalignment
between the reconstruced signal and the original signal.
[0058] Computation of energy and powers can be carried out indifferently on the upsampled
signal or on the original one. During these computations, the maximum absolute value
of r̂
s in the current subframe and its position are also obtained: they will be used in
determining the shift. To determine the position of the maximum, it is mandatory to
operate on the upsampled signal to get maximum resolution.
[0059] The second step determines the lower and upper extremes ĥ
min, ĥ
max of a range that extends around shift value ĥ accumulated so far in the frame. Values
ĥ
max, ĥ
min are initially fixed so that differences ĥ
max - ĥ and ĥ - ĥ
min have a prearranged value Γ · Δh, for example 20 samples of the upsampled signal r̂
s. There exists therefore a maximum number of possible values (41 in the example) among
which the optimal shift can be searched for. The actual extreme values ĥ
min, ĥ
max could be not symmetrical with respect to value h (that is, the range can be limited
on one or both sides of the accumulated value h), since it is necessary to avoid shifting
the subframe too much, both in the past, with possible duplication of a maximum of
r̂
s previously taken into account, and in the future with consequent loss of a maximum.
This check is made possible by storing the maximum of r̂
s in the subframe. However, unless range limiting has not been bilateral, the search
for the optimal shift is carried out trying to keep constant the range width, by taking
into account also some values beyond the extreme that is not subjected to limitation.
In any case, the shift to be carried out must not make value H exceeded.
[0060] The optimal shift value within the test range is the one minimizing energy of an
error signal e₁(n) represented by the difference between reconstructed and weighted
modified signal x
w(n) (Fig. 1) and contribution y
w1(n) of excitation filtering memories, and is obtained with a fast search procedure
that allows reducing the amount of necessary computations.
[0061] For this fast search, it must be taken into account on one hand that output signal
x
w(n) from STW' can be expressed as

(where n ranges from 0 to Ls-1), and on the other hand that the same signal is the
sum of output x
w1 of STW2a and output x
w2 of STW2b. Summation in (7) represents signal x
w1,that can be computed, once and for all, like the corresponding contribution y
w1 of chain LTSa, STW1a, and therefore an error

can also be computed once and for all, that appears at the output of an adder SMa.
Error e₁ can then be written

, where x
w2 depends on
s and therefore on the shift. It is then necessary to determine x
w2 for all values of the shift, to compute for each one the respective energy of e₁,
and to store value of that provides minimum energy and corresponding signal x
w(n).
[0062] The procedure to determine x
w2 adopted according to the invention takes into account that, for a given shift value,
signal x
w2 is given by

The upper limit of the summation is the minimum between n and P, since when filtering
with null initial conditions, samples with n-k < 0, that is, samples of the previous
subframe, must not be taken into account. Values of x
w2 are actually computed according to (8) for a first group of Γ possible shifts that
range from h
max to ĥ
max-Γ+1; obviously, the tests will be stopped if by chance h
min is reached before having examined all Γ shifts. For the other values of shift, from
ĥ
max -Γ to ĥ
min, instead of being computed with (8), x
w2 is computed according to the equation
[0063] In (9), Q(n) shows the truncated pulse response (since it is computed only for Ls
values of n) of filter STW, with Q(0) = 1.
[0064] It can be immediately noted that, taking into account that Q is determined once and
for all, beside a certain value, (9) requires much fewer computations than (8).
[0065] It must further be stated that Γ values of x
w2 must actually be computed according to (8) and (9), that is one for each of the Γ
upsampled signal samples corresponding to a 8-kHz sampling period.
[0066] Once having minimized the energy of e₁(n) and having found the optimal shift, minimization
of the energy of e(n) is started to find the optimal excitation. Unit EM directly
computes an expression of the energy to be minimized that is function of the position
of the pulses in the innovation word, and for this purpose the pulse response Q is
employed, computed during search for the optimal shift. Computation of the pulse response
is made convenient with respect to filtering execution by the fact that every word
includes two non-null samples at most. Moreover, taking into account the more general
case of the words with 2 pulses, the global pulse response is the sum of two responses
spaced by a distance equal to the key; responses for all other words linked to a key
are then obtained simply by a translation by one sample at a time. To simplify, in
the following mathematical expressions, the variability range of the summation index
for summations extended to all samples in a subframe has not been indicated.
[0067] Error e(n), for a generic excitation word, is given by

, where u(n) is the output signal from STW1b. Energy of e(n) is given by
that can be written as

. Taking into account that the first and the last summations represent energies of
signals e₁, u, and the second one represents mutual correlation R(e₁u)(k) between
them, evaluated for k=0 and in the following simply called R(e₁u), we have
[0068] Minimizing E(n) is the same as maximizing the difference of energies
For each word of the examined codebook, the maximum of (12) is obtained for a value

, as immediately appears by computing the derivative with respect to g₁ and making
it equal to 0, to which a value
corresponds.
[0069] The particular structure of the innovation codebook allows to directly obtain E(u)
and R(e₁u), that depend on the position of the pulse or pulses in the word, by exploiting
the pulse response of filter STW1, that is equal to the one of filter STW2, previously
determined.
[0070] In fact
or, more simply,
where Eq is energy of the adequately truncated signal Q (that is, computed for a number
of samples determined by the position of n₁, n₂). Moreover, R(e₁u) can be written
where
n=0
It is clear that for single-pulse words, relations (14) and (15) are simply reduced
to

and

.
[0071] The operations performed at each subframe by EM to determine the optimal excitation
can be considered as divided into three steps.
a) Before examining the effect of each innovation word, as soon as values ai are available, EM computes and stores the possible values of the three addends in
(14). Computation will be carried out only for the first 4 subframes, since, as already
said, in the following subframes filter coefficients ai do not change. Terms Eq can be computed with a simple iterative procedure, according
to the equation

with n =1...Ls-1 and

.
Moreover, since the codebook includes only Ni2 possible pairs of values n1,n2, computation of ρ is carried out only for these pairs, according to the expressions


where n₂(p) has the already cited meaning, n = 1...Ls-1- n₂(p) and k = Ni2...1 is
the generic pair of values n₁, n₂.
b) As soon as the optimal value of e₁ is available, always before the search procedure,
EM computes and stores values R(e₁q).
c) After these operations, EM computes values of E(u),R(e₁u) word by word, determining
value g₀ and the related ΔE, and storing the word index and the related value of g
that originated the energy minimum.
[0072] As said above, if the sound is not voiced, the tests with words of the first part
of the codebook are carried out only if strong energy concentrations in short times
are noted,that can show the onset of a voiced signal section. For this purpose, within
the subframe, energy of a certain group of samples of the modified residual is computed
(e.g. 5 samples), starting from the beginning of the subframe and shifting, by one
sample at a time, the window selecting the group till the whole subframe has been
scanned, and storing which group shows maximum energy. Furthermore, the average power
(that is the energy divided by the number of samples) in the window where the maximum
occurred and the average power in the subframe are also computed. Tests with single-pulse
words will be enabled if subframe energy and the ratio between the average powers
in the window and in the subframe are greater than suitable thresholds. Moreover,
if the optimal innovation is composed of a single-pulse word, the absolute value of
gain g is limited to a maximum value ¦g¦
max = ¦r
s¦
max, where is a parameter approximately equal to 1 and ¦r
s¦
max is the residual maximum computed during operations to determine
min,
max. Purpose of this limitation is also to prevent insertion into the signal of a pulse
with too high energy with respect to the maximum residual amplitude in the same subframe.
[0073] At the end of each subframe, initial conditions in filters LTSa, STW1a, STW2a will
have to be updated. To update LTSa, that is s
s(n), it will be necessary to add a pulse or a pair of pulses (corresponding to the
optimal innovation word) to s
s1(n). To update y
w(n), it will be necessary to add to y
w1(n) one or two pulse responses (corresponding to signal u(n)) adequately shifted and
multiplied by gain g in order to supply the value of y
w2 corresponding to the optimal excitation. The pulse response will also be exploited
to update STW2a. Furthermore, since filters STW have order P, only the last P samples
of such responses (from Ls to Ls-P) are of interest.
The operations of EM are also included in the appendix.
[0074] The decoder structure will now be described, referring to the diagram in Fig. 3,
where blocks corresponding to the ones already described with reference to Fig. 1
are shown by the same reference symbols, followed by digit 2. The various reconstructed
signals are also shown with the same reference symbols used for the original signals
in the coder.
[0075] The decoder receives from the coder, through connections 2a-2e, indexes j(j), j(φ),
j(b), j(s), j(gmax), j(gnor) and sign σ for the innovation gain. At each subframe,
index j(s) selects an innovation word s(n) in codebook IC2 or indicates a subframe
that does not provide innovation contributions (g=0). If a word has been selected,
it is multiplied in M2 by gain g whose absolute value is selected in the codebook
IG2 by an index

and whose sign is σ, thereby providing the reconstructed excitation signal (or fixed
codebook contribution) s₁(n).
[0076] This signal is filtered in the long-term synthesis filter LTS2 to provide the reconstructed
short-term residual s
s(n). In order to operate exactly like its replica LTS1 in the coder, filter LTS2 must
receive from reconstruction circuit LTR2 parameters d, b and flag F indicating the
possible need to carry out interpolation of d and b. Therefore, LTR2 will include
a read-only memory with two tables addressed by indexes j(d), j(b), like LTR1 (Fig.
1), in addition to a circuit suitable to store values of d, b related to two consecutive
frames and to carry out the comparisons, described in connection with the coder, necessary
to determine if interpolation of d, b is necessary. Signal s
s(n) outgoing from LTS2 is filtered in the short-term synthesis filter STS2 using coefficients
a
i generated in coefficient reconstructing circuit STR2 starting from indexes j(φ).
In STS2, too, for the first subframes of each frame, interpolated coefficients will
be used. The reconstructed speech signal y(n) is still subjected to a further filtering
in an adaptive filter PF that uses coefficients obtained from linear prediction coefficients
a
i and that inserts into the reconstructed speech signal a distortion that improves
the perceptual effect. At the output of PF, there is a filtered reconstructed signal
y
p(n). Employ of filters like PF when coding a speech signal is well known to the technicians
and does not require further explanations.
[0077] It will be noted that the decoder does not take into account the possible shift carried
out into the coder: in fact, purpose of the shift is just causing the synthesized
signal to be a replica as good as possible of the original signal, and therefore the
decoder only requires information related to excitation and filters.
[0078] It is clear that what has been described is provided only by way of non-limiting
example and that variations and modifications are possible without departing from
the scope of the present invention. Thus, for example, even if reference has been
made, about innovation, to sample whose amplitude was 1, it is also possible to use
samples whose amplitudes are chosen in a finite set of values (e.g., √1, ± √2, ± 1/√2):
obviously, in this case the coded signal will also include information about the relative
amplitude of innovation samples. Generalizing equations (14), (15) to the case of
pulses whose amplitude is not unitary is immediate. The choice of sample amplitudes
in a finite set of values is not limiting, because anyway relative amplitudes of the
samples themselves are quantized.
1. Method of coding/decoding speech signals, including, in a coding step, the operations
of:
- sampling the original speech signals at a first sampling rate and dividing the resulting
sequence of samples [x(n)] into a plurality of blocks of subsequent samples, each
block comprising a first predetermined number Ls of samples or an integer multiple
of said first number;
- performing a short-term analysis of the original speech signal to determine a group
of linear prediction coefficients (ai) to be used for a linear prediction filtering, a short-term synthesis filtering and
a spectral weighting filtering, generating a representation of said coefficients in
the frequency domain, and inserting into the coded signal information [j(φ)] related
to the value of said representation, said information being valid for a period equal
to the duration of a block or of a group of consecutive blocks of samples;
- obtaining, through said linear prediction filtering, a short-term residual signal
[rs(n)] for said block or group of blocks of samples;
- subjecting said residual signal [rs(n)] to a long-term analysis, to determine long-term analysis parameters comprising
a long-term synthesis filtering delay d and coefficient b, and inserting into the
coded signal information [j(d), j(b)] related to the values of said parameters, said
information being valid for a time equal to the duration of a block or a group of
consecutive blocks of samples;
- reproducing every block of speech signal samples to be coded with a reconstructed
and weighted speech signal [yw(n)], obtained by subjecting to long-term synthesis filtering, short-term synthesis
filtering and spectral weighting filtering an excitation signal chosen within a set
of excitation signals, each comprising an amplitude contribution (excitation gain)
and a shape contribution (innovation), the latter being composed of a limited number
of pulses, much less than said first number of samples, with predefined positions
and amplitudes belonging to a respective finite set;
- subjecting a set of samples of said residual signal [rs(n)] to a time shift by discrete steps, each set of residual signal samples having
a number of samples equal to the number of samples in a block of speech signal samples
to be coded, to align in time the residual signal with a reconstructed residual signal
[ss(n)] obtained as result of the short-term synthesis filtering of an excitation signal,
the shift generating a modified residual signal [r̂m(n̂)] that is subjected to a long-term synthesis filtering and to a spectral weighting
filtering, identical to those carried out for the excitation signals, to generate
a reconstructed and weighted modified speech signal [xw(n)];
- determining an optimal excitation signal for each block of samples, by minimizing
the energy of a weighted error signal [e(n)] represented by the difference between
the reconstructed and weighted modified signal [xw(n)] and the reconstructed and weighted signal [yw(n)], and inserting into the coded signal information [j(s), j(gmax), j(gnor), σ] that identifies the optimal excitation signal; characterized in that:
- the innovation pulses are the only non-null samples of words composed of said first
number Ls of samples,
- the innovation words for a first subset of excitation signals include a pair of
pulses, a limited group of words of the first set being key-words in which the two
pulses are placed in predetermined key positions and the other words in the subset
being obtained from each of the key-words by each time simultaneously shifting the
pulses by one position towards a word end, till one of the pulses reaches said end
or the key position of the other pulse in the starting word, the shifting direction
being the same for all words; and
- the innovation words for a second subset of excitation signals include only one
pulse whose position is different for each signal;
and in that for said determination of the optimal excitation signal the energy
of said weighted error signal is directly computed, by exploiting a pulse response
Q(n) of filters that carry out synthesis and spectral weighting filterings of the
excitation signal, with the following operations:
- determining said pulse response Q(n) and the energy Eq thereof for each of the possible
pulse positions in the excitation signals;
- determining a first partial error signal [e₁(n)], represented by the difference
between the reconstructed and weighted signal [xw(n)] and a contribution [yw1(n)] of the excitation signal filtering memory, and the energy of the same error signal;
- determining a first correlation R(e₁q) between said first partial error signal [e₁(n)]
and the pulse response Q(n) for each of the pulses of an excitation signal;
- determining for each excitation signal, starting from said pulse response, a signal
[u(n)] representative of a contribution of the filtering with null initial conditions
of the excitation signal;
- determining the energy E(u) of said signal [u(n)] representative of the contribution
of a filtering with null initial conditions of the excitation signal, and determine
a second correlation R(e₁u) between said signal [u(n)] representative of the contribution
of the filtering with null initial conditions of the excitation signal and the first
partial error signal [e₁(n)];
- determining, for each excitation signal, an optimal value of the amplitude contribution
as ratio between said second correlation and the energy of the signal resulting from
filtering at null initial conditions;
- computing, as function of said second correlation R(e₁u), of said energy Eu of the
signal representative of the contribution of the filtering with null initial conditions
of excitation and of said energy E(e₁) of the first partial error signal, the value
of error signal energy for each excitation signal.
2. Method according to claim 1, characterized in that said pulses have unitary amplitude.
3. Method according to claim 1 or 2, wherein the sequence of speech signal samples is
divided into frames that are composed by a plurality of consecutive subframes each
corresponding to one of said blocks and include a second predetermined number Lf of
samples, and wherein said short-term analysis is carried out for each frame, characterized
in that for said short-term analysis in a frame a sample window is analysed, whose
length is Lf+P (P = number of linear prediction coefficients in each group), that
encompasses a current frame and the subsequent frame and also includes a predefined
number H+K of samples of said subsequent frame, said window being a trapezoidal window
that weights all samples with maximum weight, apart from the first and the last P
samples, for which the weighting factors are determined through linear interpolation
between a minimum weight and the maximum weight.
4. Method according to claim 3, characterized in that for the initial subframes of each
frame, the linear prediction coefficients ai are coefficients obtained as result of an interpolation between the values provided
by short-term analysis for the current frame and those provided for the previous frame,
the interpolation being carried out by operating on said representation.
5. Method according to any one of the previous claims, wherein the linear prediction
residual is subjected to low-pass filtering before long-term analysis, thereby providing
a filtered residual signal [rf(n)].
6. Method according to any of claims 1 to 5, wherein the sequence of speech signal samples
is divided into frames that are composed of a plurality of consecutive subframes each
corresponding to one of said blocks and include a second predetermined number Lf of
samples, and wherein said long-term analysis is carried out for each frame, characterized
in that to determine said long-term analysis parameters, a sample window of the filtered
residual signal [rf(n)] is analysed, that encompasses a current frame and the subsequent frame and also
includes a predefined number H+K of samples of said subsequent frame.
7. Method according to claim 6, characterized in that said long-term analysis further
includes the operation of determining, for each frame, a long-term prediction gain
G, representative of the ratio between the energies of filtered residual signal at
the input of and at the output from means that carry out said analysis, the gain being
also determined at each frame.
8. Method according to claim 6 or 7, characterized in that said long-term analysis further
includes the operations of:
- classifying a speech signal segment corresponding to a frame as voiced or unvoiced,
depending on the value of said long-term analysis coefficient b and on prediction
gain G, and generating a first flag (V) in case the segment is classified as voiced;
- comparing values of long-term analysis delay d and coefficient b related to a current
frame with those related to the previous frame and generating, when delay variation
is less than a predefined amount and coefficient values in both frames are positive,
a second flag (F) that enables interpolation between delay and coefficient values
computed for said previous frame and those computed for the current frame.
9. Method according to any of the claims from 6 to 8, wherein long-term analysis delay
d is determined as maximum of the autocorrelation function of the filtered residual
within the window used for the analysis itself, characterized in that, before determining
long-term analysis coefficient b and prediction gain G for the current frame, the
local maximum of said autocorrelation function is determined even in a neighborhood
of the maximum of the same function in the previous frame, if said first and second
flags had been generated in said previous frame, and said local maximum is used as
delay for current frame if it is different by an amount that is less than a predefined
value from the maximum in the window related to current frame.
10. Method according to any of the claims from 6 to 9, characterized in that the value
of long-term analysis coefficient b is clipped to a first maximum value b₁, linked
to the ratio between energy of the filtered residual signal in the current frame and
in the previous frame in an interval whose length is equal to the long-term analysis
delay.
11. Method according to any of the claims from 6 to 10, characterized in that the value
of long-term analysis coefficient b is clipped to a second maximum value b₂, if it
exceeds such value while the prediction gain G is less than a gain threshold Gthr.
12. Method according to claim 8 or any of claims 9 to 11, if referred to claim 8, characterized
in that said interpolation of long-term analysis delay d and coefficient b is a linear
interpolation extended over a whole frame and, in case of a non-integer interpolated
delay value, the value of a corresponding sample of the reconstructed residual signal
ss(n) is evaluated with a second-order polynomial interpolation centered around the
integer delay value that is nearest to said interpolated value.
13. Method according to any of the claims from 6 to 12, wherein information related to
long-term analysis coefficient b inserted in the coded signal are indexes representative
of quantized coefficient values, and information related to long-term analysis delay
d allows representing also delay values that are outside an interval of allowed delays,
characterized in that coefficient values that are less than a predefined fraction
of a minimum quantized value are forced to 0 and, in case of forcing to 0, delay information
representative of a value that is outside said interval of allowed delays and the
index representative of said minimum quantized value, are inserted in the coded signal.
14. Method according to any of claims 1 to 13, characterized in that, to determine the
optimal excitation, excitation signals of said second subset are used if said first
flag (V) has been generated or, if said flag has not been generated, if analysis of
the energy distribution in the, modified residual signal shows an energy concentration
in short times, that indicates the onset of a voiced sound.
15. Method according to claim 14, characterized in that, to determine the optimal excitation,
the excitation signals of the two subsets are normalized with different normalization
factors, linked to the number of pulses present in respective subset signals.
16. Method according to claim 14 or 15, characterized in that, if said first flag (V)
has been generated, the amplitude contribution for excitation signals of said second
subset is limited in such a way as not to exceed a threshold that is proportional
to the absolute value of the residual signal.
17. Method according to any of claims 14 to 16, characterized in that said analysis of
the energy distribution of the modified residual signal is carried out at each subframe
and includes the operations of:
- dividing the subframe into a plurality of partially overlapping windows, a first
and a last window coinciding with a respective initial or final part of the subframe,
the windows following the first one being each shifted by one sample with respect
to the previous window;
- determining the energy and the power of the modified residual signal in the whole
subframe and the energy in each one of said windows;
- determining the power for the window whose energy is maximum and determining the
ratio between the power in said window and the power in the subframe; and
- comparing said maximum energy and said power ratio with respective thresholds, said
energy concentration being recognized if said maximum energy and said ratio are not
less than respective thresholds.
18. Method according to any of the claims from 6 to 17, characterized in that, if only
the second flag (F) has been generated, long-term analysis delay d is varied by an
amount that is proportional to entity of the shift accumulated up to the previous
frame, the absolute value of the variation being limited to a predefined maximum.
19. Method according to claim 18, characterized in that said delay variation is disabled
if it causes the decision about interpolation to be altered and the delay to go out
of a predetermined interval of values.
20. Method according to any of the claims from 6 to 19, characterized in that the residual
signal is subjected to said time shift in a subframe if at least one of said first
and second flags has been generated and if an analysis of the modified residual signal
energy in the subframe shows that the corresponding speech signal segment is not silence
and includes a pitch peak, the shift related to a subframe being accumulated with
that of the previous subframes of the same frame, so that the total shift in a frame
remains less than a maximum shift.
21. Method according to claim 20, characterized in that said analysis of the modified
residual signal energy includes the operations of:
- comparing the energy itself with an energy threshold, which, when reached, shows
that the corresponding speech signal segment is not silence;
- determining the modified residual signal power in the subframe and in an interval
whose length is equal to the long-term analysis delay, and the ratio between such
powers; and
- comparing such ratio with a power threshold, which, when exceeded, shows the presence
of a pitch peak in the subframe.
22. Method according to claim 20 or 21, characterized in that the shift for a subframe
is determined, before determining an optimal excitation signal, within an interval
that extends around the shift accumulated in previous subframes of the same frame,
and it is the value that minimizes energy of said first partial error signal [e₁(n)].
23. Method according to claim 20, characterized in that to determine the shift, an upsampling
of the residual signal is carried out, at a second rate that is a multiple of the
first rate, the shift in a subframe being equal to one or more samples of the upsampled
residual signal.
24. Method according to claim 22 or 23, characterized in that said first partial error
signal is computed as sum between a signal [xw2(n)] representative of the modified residual signal filtered with null initial conditions
and a second partial error signal [e₀(n)], which is the difference between the memory
contribution [xw1(n)] of the modified residual signal filtering and the memory contribution [yw1(n)] of the excitation filtering, the signal [xw2(n)] representative of the modified residual filtered with null initial conditions
related to a sample in a subframe being obtained by carrying out the actual filtering
of the modified residual signal for shift values between the upper end of the interval
and an intermediate value between the two extreme values, while for each of the remaining
shifts in the interval it is iteratively obtained from the value related to the previous
sample and from said pulse response.
25. Method according to claim 24, characterized in that the determination of said interval
of shift values is carried out through the following operations:
- fixing for the interval ends two symmetrical values with respect to the accumulated
value;
- determining the residual signal peak position in the upsampled residual signal and
comparing it with the peak position in the previous subframe;
- limiting the interval extension on one or both sides of the accumulated value to
avoid an excessive shift of the subframe into the past and/or into the future, with
consequent duplication or loss of residual signal peaks.
26. Method according to claim 25, characterized in that, in case of interval limitation
on one side only of the accumulated value, the search for the shift is carried out
also taking into account a certain number of values beyond the interval end not interested
by the limitation, such that the global number of tested values is equal to the number
of values included between said symmetrical values.
27. Method according to any of the claims from 1 to 26, including a decoding step where,
starting from the information [j(φ), j(d), j(b), j(s), j(gnor), j(gmax), σ] about
the linear prediction coefficient representation, the long-term analysis parameters
and the excitation signal, said representation is reconstructed, reconstructed linear
prediction coefficients are obtained therefrom, the long-term analysis parameters
are reconstructed, an excitation signal is chosen in a set of excitation signals corresponding
to the one used in the coding step, and said signal is subjected to a short-term and
a long-term synthesis filtering, identical to the ones carried out in the coding step,
by using reconstructed linear prediction coefficients ai and long-term analysis delay d and coefficient b, to generate a reconstructed block
of speech signal samples [y(n)] for each excitation signal [s(n)], characterized in
that every block of the reconstructed speech signal [y(n)], during the initial part
of a validity period of linear prediction coefficients, is generated by carrying out
the short-term synthesis filtering with reconstructed linear prediction coefficients
ai obtained as result of an interpolation between reconstructed values related to an
immediately previous validity period and reconstructed values related to the current
period, and in that the values of long term analysis delay d and coefficient b, related
to two consecutive validity periods, are compared and, if the delay variation is less
than a predefined amount and the coefficient is positive in both periods, a flag corresponding
to that second flag is generated, to enable carrying out, during long-term synthesis
filtering, an interpolation between the long-term analysis parameter values related
to said two validity periods.
28. Apparatus for coding/decoding speech signals using analysis-by-synthesis techniques,
including a coder composed of:
- means (MT) for sampling at a first rate a speech signal and to divide the sample
sequence into blocks comprising a first number of samples;
- short-term analysis means (STA, STR1) for computing a group of linear prediction
coefficients ai for one or more blocks of samples, for transforming said coefficients into a representation
thereof in the frequency domain, for obtaining from said representation indexes j(φ)
identifying the coefficients themselves, to be inserted into the coded signal, and
for reconstructing the coefficients starting from said indexes, every group of linear
prediction coefficients being valid for a period of time equal to the duration of
one or more blocks of samples;
- a linear prediction filter (LPC) that receives blocks of signal samples from the
sampling means (MT) and linear prediction coefficients ai from the short-term analysis means (STA, STR1) and generates a short-term prediction
residual signal rs(n);
- long-term analysis means (LTA, LTR1) for obtaining, from said residual signal, parameters
for a long-term synthesis filtering, which parameters comprise a delay (d) and a coefficient
(b), and for transforming said parameters into indexes [j(b), j(d)] to be inserted
into the coded signal, the long-term analysis parameters being valid for a period
of time equal to the duration of one or more blocks of samples;
- a first filtering system (LTS1, STS1, SW) that: includes the series of a long-term
synthesis filter (LTS1), that receives from the long-term analysis means (LTA, LTR1)
said parameters, and of a short-term synthesis filter (STS1) and a spectral weighting
filter (SW), that receive from said short-term analysis means (STA, STR1) said linear
prediction coefficients ai receives signals belonging to a set of excitation signals each including a shape
contribution composed of a number of pulses, of predefined amplitudes and positions,
said pulse number being much less than said first number; and generates a reconstructed
signal yw(n) for each one of the excitation signals;
- means (TS) for time shifting, by discrete steps, a set of samples yw(n) of said residual signal to align it in time with a reconstructed residual signal
ss(n) generated by the long-term synthesis filter (LTS1) of said first filtering system,
the set of samples of residual signal having a number of samples equal to said first
number of samples, every shift step being chosen within an interval of allowed values;
- a second filtering system (STS', SW'), that includes the series of a short-term
synthesis filter and a spectral weighting filter identical to those (STS1, SW) of
the first filtering system, is supplied with a modified residual signal generated
by the time shift means for each of the values of said interval, and generates a reconstructed
and weighted modified residual signal, said first and second filtering systems (LTS1,
STS1, SW, STS', SW') separately determining a contribution representative of the memory
of previous filtering and a contribution representative of a filtering with null initial
conditions;
- means (SM, EM) for generating a weighted error signal [e(n)] by comparing signals
generated by the first and the second filtering systems, for identifying an optimal
excitation signal and an optimal shift, by minimizing the energy of said weighted
error signal, and for inserting in the coded signal information that identifies the
optimal excitation signal;
and further comprising, at the decoding side:
- means (LTR2, STR2) for reconstructing the linear prediction coefficients and long-term
analysis parameters starting from said indexes;
- a third filtering system (LTS2, STS2), including the series of a long-term synthesis
filter,and a short-term synthesis filter, identical to those (LTS1, STS1) of the first
filtering system, for filtering an excitation signal selected, through information
related to optimal excitation, in a set corresponding to the set used on the coding
side and to generate a block of reconstructed speech signal samples,
characterized in that:
- the innovation pulses are the only non-null samples of words composed of said first
number Ls of samples,
- the innovation words for a first subset of excitation signals include a pair of
pulses, a limited group of words of the first set being key-words in which the two
pulses are placed in predetermined key positions and the other words in the subset
being obtained from each of the key-words by each time simultaneously shifting the
pulses by one position towards a word end, till one of the pulses reaches said end
or the key position of the other pulse in the starting word, the shifting direction
being the same for all words; and
- the innovation words for a second subset of excitation signals include only one
pulse whose position is different for each signal;
and in that, in said error signal generating means (SM, EM), the means to minimize
error energy are composed of a processing unit arranged to:
- determine said pulse response [Q(n)] and an energy (Eq) thereof for each one of
the possible pulse positions in excitation signals;
- determine a first partial error signal [e₁(n)], represented by the difference between
the reconstructed and weighted modified signal [xw(n)] and a contribution [yw1(n)] of the excitation signal filtering memory, and an energy of the error signal
itself;
- determine a first correlation [R(e₁q)] between said first partial error signal [e₁(n)]
and the pulse response for each of the pulses of an excitation signal;
- determine, for each excitation signal, starting from said pulse responses, a signal
[u(n)] representative of a contribution of the filtering with null initial conditions
of the excitation signal;
- determine the energy [E(u)] of said signal [u(n)] representative of the contribution
of a filtering with null initial conditions of the excitation signal and a second
correlation R(e₁u) between said signal [u(n)] representative of the contribution of
the filtering with null initial conditions of the excitation signal and the first
partial error signal [e₁(u)];
- determine, for each excitation signal, an optimal value of the amplitude contribution
as ratio between said second correlation and the energy of the signal resulting from
filtering with null initial conditions;
- compute, as function of said second correlation R(e₁u), of said energy (Eu) of the
signal representative of the contribution of the filtering with null initial conditions
of the excitation and of said energy [E(e₁)] of the first partial error signal, the
error signal energy value for each excitation signal.
29. Apparatus according to claim 28, characterized in that a low-pass filter (FPB) is
provided between said linear prediction filter (LPC) and said long-term analysis means
(LTA, LTR1).
30. Apparatus according to claim 28 or 29, characterized in that the short-term analysis
means (STA, STR1) in the coder and the means (STR2) for reconstructing linear prediction
coefficients in the decoder include means for carrying out, on said representation
in the frequency domain, a linear interpolation between values related to two consecutive
validity periods and supply the short-term synthesis filters (STS1, STS', STS2) of
said filtering systems with the interpolated values in an initial part of a validity
period of a set of coefficients.
31. Apparatus according to any one of claims from 28 to 30, characterized in that the
long-term analysis means (LTA, LTR1) in the coder and the means (LTR2) for reconstructing
the long-term analysis parameters in the decoder include comparing means for comparing
parameters related to two consecutive validity periods and generating a flag (F) to
enable carrying out an interpolation between the parameters when they satisfy predetermined
conditions, and the long-term synthesis filters (LTS1, LTS2) of the first and second
filtering systems are associated to means that, when said flag is present, carry out
a second-order polynomial interpolation of said parameters, extended to a whole validity
period thereof, and supply the respective long-term synthesis filter (LTS1, LTS2)
with the interpolated parameters.
32. Apparatus according to any one of claims from 28 to 31, characterized in that the
time shift means (TS) include a circuit (US) for upsampling the residual signal, and
storing means (SH) for storing, for each block of samples to be coded, a first group
of upsampled residual signal samples corresponding to said first number Ls of samples,
and two further groups of upsampled residual signal samples, respectively preceding
and following said first group and including a number of samples linked to the maximum
allowed shift, and for supplying the second filtering system (STS', STW'), upon command
by the energy minimizing means (EM), with a fourth group of upsampled residual signal
samples, including as many samples as those of the first group and shifted with respect
to the first group by said optimal shift.