Speech coder employing analysis-by-synthesis techniques with a pulse excitation

(19)

(11)

EP 0 619 574 A1

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	12.10.1994 Bulletin 1994/41

(21)	Application number: 94105438.9

(22)	Date of filing: 07.04.1994

(51)	International Patent Classification (IPC)⁵: G10L 9/14, G10L 9/00, G10L 9/18

(84)	Designated Contracting States:
	AT BE CH DE ES FR GB GR IT LI NL SE

(30)

Priority:

09.04.1993 IT TO930244

(71)	Applicants:
	SIP SOCIETA ITALIANA PER l'ESERCIZIO DELLE TELECOMUNICAZIONI P.A. I-10122 Torino (IT) AT&T Corp. New York, NY 10013-2412 (US)

(72)	Inventors:
	Cellario, Luca Torino (IT) Sereno, Danielle Torino (IT) Kleijn, Willem Bastiaan New Jersey 07920 (US) Kroon, Peter New Jersey 08812 (US)

(74)	Representative: Riederer Freiherr von Paar zu Schönau, Anton
	Lederer, Keller & Riederer, Postfach 26 64 84010 Landshut 84010 Landshut (DE)

(56)

References cited: :

(54)	Speech coder employing analysis-by-synthesis techniques with a pulse excitation

(57) In an analysis-by-synthesis coder, the original speech signal undergoes small time shifts to match in time the signal to be coded with the replica produced by the long term synthesis filter. The shift is determined at each subframe by an exhaustive search within a range of possible values so as to minimize the error signal energy. Once the optimal shift has been determined, the optimal excitation is searched for. The excitation is chosen in a codebook containing words with very few pulses arranged in a deterministic structure, which words are all obtained from a limited number of key words. The deterministic codebook structure allows a fast search for the optimal excitation, without need of storing the codebook and actually performing the synthesis filterings of the candidate excitations.

Description

[0001] The present invention relates to speech coders employing analysis-by-synthesis techniques, and more particularly to a coder for low-bit-rate applications, preferably at the lowest limits of the range of rates for which the above-mentioned coders can be used with good performance, e.g. rates within the 4 - 8 kbit/s range.

[0002] An example of this type of applications is represented by speech coders to be used for the so-called half-rate channel of the European mobile radio system.

[0003] In coders using analysis-by-synthesis techniques, for each block of speech signal samples to be coded, the excitation signal for the synthesis filter simulating the speech production apparatus is chosen within a set of excitation signals so as to minimize a perceptually meaningful measure of distortion. This is commonly obtained through the comparison of the synthesized samples and of the corresponding samples of the original signal and the simultaneous weighting, in a suitable filter, with a function that takes into account how human perception evaluates the resulting distortion.

[0004] In its most general form, the synthesis filter includes a cascade of two elements that impose short-term and long-term spectral features, respectively, on the excitation signal. The former ones are linked to the correlation among subsequent samples, which generates a non-flat spectral envelope, and the latter ones are linked to the correlation between adjacent pitch periods, on which the fine signal spectral structure depends. With such a scheme, the coded signal includes information relating to excitation and to short-term synthesis parameters (short-term linear prediction coefficients or other quantities related to them) and long-term ones (long-term delay and linear prediction coefficients).

[0005] The insertion of long-term features into the coded signal greatly enhances natural sounding of the signal, especially if the delay is updated at each subframe during the analysis-by-synthesis cycle; however,the related information would require most of the bits available for coding. Especially in case of low-bit-rate applications, it is therefore particularly interesting to search for solutions that enable a reduction of the amount of information to be transmitted to the decoder, while preserving signal quality.

[0006] In the paper "Generalized analysis-by-synthesis coding and its application to pitch prediction" presented by W.B. Kleijn, R.P. Ramachandran and P. Kroon at the ICASSP 92 Conference, San Francisco (California, USA), March 23-26 1992, paper I-337, it is suggested for this purpose to carry out a long-term analysis delay interpolation, the delay being updated at each frame. A direct interpolation, without adequate arrangements, would provide delay values that are not the optimal values and would provoke time misalignments among long-term spectral features in the original signal and in the synthesized signal, that generate a significant distortion.

[0007] To avoid these inconveniences, the paper suggests to modify the original signal so that long-term predictor parameters become known functions of time and allow a direct interpolation without degrading performance. The suggested modifications consist of limited time oscillations and small amplitude scalings of the original signal. Time oscillations can be carried out in discrete manner. The need for inserting these time oscillations, and therefore for setting an optimal amount thereof, obviously increases the coder complexity.

[0008] To solve this problem, according to the present invention, therefore, a coding system is provided in which, before long-term analysis, discrete time shifts are introduced on the residual signals and in which the search for optimal excitation signal and optimal shift is carried out so as to reduce complexity of computations.
The invention characteristics are disclosed in the appended claims.

[0009] A preferred embodiment of the invention will now be described, with reference to the enclosed drawings, in which:

Fig. 1 is a block diagram of the coder;
Fig. 2 is a functional diagram of some blocks of the coder;
Fig. 3 is a block diagram of the decoder.

[0010] Before describing in detail the coder/decoder structure, the principles on which it is based will be summarized. The coder receives samples x(n) of the speech signal to be coded, grouped into blocks (commonly called 'frames') including a fixed number Lf of contiguous samples. Every frame of Lf samples is then divided into subframes of Ls contiguous samples. The coder must determine a set of parameters to be transmitted to the decoder so that the decoder is able to synthesize a signal that approximates the original signal. To achieve this, an analysis-by-synthesis procedure is used, through which the coder analyzes the effects of the possible values of each parameter and chooses the value that enables obtaining the best approximation of the original signal. For this purpose, the coder will contain a replica of the decoder to produce, for each of said values, the corresponding output signal. To generate these output signals, both long-term and short-term correlations of the speech signal are exploited, imposed on an excitation signal through respective synthesis filters. At each frame, the coder carries out a linear prediction analysis (short-term or LPC analysis) and computes the short-term residual signal, that is used to compute parameters (delay and coefficient) of the long-term synthesis filter. (The coefficient is unique in the preferred embodiment, since a first-order filter is used). To improve the resolution of long-term-correlation information, both the delay and the coefficient are interpolated when the delays of the current frame and the previous frame are close in value. To reduce the effects of time mismatches between the original signal and the reconstructed one, at each subframe small time shifts can be introduced in the original speech signal: the shift amount is determined through an exhaustive search in a range of possible values so as to minimize the energy of the error (difference between original signal and reconstructed signal). After having determined the optimal shift, the search for the optimal excitation signal is carried out.

[0011] In the following, to make the description clearer, the possible excitation signals will be considered as words chosen in a certain codebook, that is, reference is made to a type of coder known as CELP (Codebook Excited Linear Prediction), even if, as it will be seen, every word is made up of an extremely small number of pulses (preferably 1 or 2) with deterministically predefined amplitudes and positions, and the codebook is not stored.

[0012] The coded signal will include information related to short-term and long-term synthesis filter parameters and to the optimal excitation, transmitted as usual in the form of suitably coded indexes.

[0013] In the decoder, starting from these indexes, an excitation signal corresponding to the one used by the coder will be retrieved and filtered in the chain of a long-term synthesis filter and a short-term synthesis filter to provide a reconstructed signal that can be still subjected to a further filtering (post-filtering), based for example on short-term synthesis parameters, to improve the subjective signal quality. The reconstructed signal is then converted again into analogue form and supplied to utilization devices.

[0014] By way of example, in the following description reference will be made to frames with length Lf = 160 samples (that, with a 8-kHz sampling frequency, correspond to a speech signal segment whose length T = 20 ms), divided into 8 subframes whose length Ls = 20 samples. For reasons related to the introduction of time shifts, it is necessary to have available, in addition to the Lf samples of a frame, a group of H+K samples of the following frame (e.g. H = 24, K =8).

[0015] With reference to Fig. 1, the input signal samples x(n) present on a connection 1 are temporarily stored in a buffer MT arranged to store

samples, and every T ms a block of Lf samples will be written and read. Samples read in MT are supplied to a high-pass filter FPA whose task is removing d.c. drifts and low-frequency noise, and the filtered signal x_f(n) is supplied to short-term analysis circuits STA and to a linear prediction filter LPC.

[0016] Circuits STA are to determine, for each frame, a set of P linear prediction coefficients a_i (e.g. 10), to convert these coefficients into a group of parameters in the frequency domain, commonly known as LSP (Line Spectrum Pairs) and to carry out a quantization, for example a scalar one, of the differences between adjacent parameters. Indexes j(f), that are part of the coded signal, are transmitted to the decoder through a connection 2a after binary coding in circuits that are not shown. Conversion into line spectrum pairs is desirable since, as well known, spectrum lines have properties of quantization, interpolation and check of synthesis filter stability that are better than those of the coefficients. Before computing line spectrum pairs, in the block STA a smoothing of spectrum information related to formants is also carried out to match it to the quantization circuit resolution. This is accomplished by multiplying computed coefficients a_i by a respective factor g₁ⁱ, whose value is typically less than 1 but quite near 1. This operation allows reducing the risk, in case of particularly narrow formants, of reproducing after quantization formants that are equally narrow, but shifted with respect to the original ones, and therefore reduces a possible cause for the degradation of coded signal quality.

[0017] The circuit STA computes coefficients a_i according to the classical autocorrelation method, as described in "Digital Signal Processing of Speech Signals" by L.R. Rabiner and R.W. Schafer (Prentice - Hall Ed., Englewood Cliffs, N.J., USA, 1978), p. 401. For the computation, STA operates on a set of Lf+P input samples (in particular, the samples that occupy the last Lf+P positions in MT), obtained through a trapezoidal window that weights with a maximum weight (particularly 1) all samples except for the first and the last P ones, for which the weights have been determined with a simple linear interpolation operation between minimum and maximum weight: in this way, smoothing, that is required by the autocorrelation method to provide good results, is limited to the overlapping area between contiguous windows. The forward positioning of the window also takes into account the fact that, when coding the initial subframes of a frame (e.g. the first 3), in place of linear prediction coefficients computed for the frame itself, coefficients are used which are obtained by the conversion of line spectrum pair values determined through interpolation between values related to the previous frame and values related to the current frame. This ensures a gradual transition between current frame parameters and previous frame parameters. As concerns the window, as explained, it encompasses or spans over a current frame and the subsequent frame in the meaning that it comprises samples of both frames without, however, having to comprise two full frames.

[0018] The transformation of linear prediction coefficients into line spectrum pairs is carried out, for example, in the way described by P.Kabal and R.P. Ramachandran in the article "The computation of line spectral frequencies using Chebyshev polynomials", IEEE Transactions on Acoustic, Speech and Signal Processing, December 1986.

[0019] The operations of STA are typical of any linear prediction coder, and therefore a more detailed description is not necessary.

[0020] The indexes j(φ) are also supplied to a linear prediction coefficient reconstructing circuit STR1 that supplies filter LPC, short-term synthesis filters STS1, STS' and spectral weighting filters SW, SW' with quantized values of the coefficients, obtained by applying inverse procedures with respect to the ones used to transform the coefficients into line spectrum pairs. STR1 also computes interpolated values to be used in the first three subframes. To simplify, in the following, the quantized values are also designated a_i.

[0021] The filter LPC receives the filtered speech signal samples x_f(n) and filters them according to the conventional function

generating the short-term prediction residual r_s(n), that is supplied both to a low-pass filter FPB, that produces a filtered residual signal r_f(n), and to time shift circuits TS, that produce a modified residual signal r_m(n). Low-pass filtering facilitates, as well known, operations of a following long-term analysis circuits LTA.

[0022] The circuits LTA must determine, at each frame, and supply afollowing long-term synthesis filter LTS1 with the delay d (pitch period) with which a sample of an excitation signal is used to generate a reconstructed signal and the gain or coefficient b with which said sample is weighted.

[0023] The block LTA computes the delay d by maximizing the autocorrelation function

where k can vary between a minimum value and a maximum value allowed for the delay d (e.g., 20 and 120), and x is a preset number, whose purpose is causing the length of the window taken into account for the calculation to enable obtaining a satisfactory value for d. Considering that the window must include the most recent samples, as already said, its length is a compromise between two opposed needs: the greater the length, the most accurate the evaluation; on the other hand, the shorter the window, the more its center is next to the end of the frame to be coded (Lf samples) and therefore it allows obtaining a current value next to that end, what is required for interpolation. For example, x can be K. In the preferred embodiment, the delay is never less than the length of a subframe, and this simplifies considerably subsequent operations. The value computed with (1) can also be subjected to corrections, that will be examined afterwards, aimed at guaranteeing a shape as much as possible smooth for d and compensating for synchronism losses due to the time shift.

[0024] The value of coefficient b is determined so as to minimize the energy of error signal r_l(n), given by the equation

For the value d of the delay to be used for the current frame, b is given by the equation

, where E(r_f) indicates the energy

A minimum and a maximum, 0 and 1 respectively, are also set for the value of b. Values that are less than 0 are excluded because they would correspond to a signal overturning, that would also compel to transmit a sign bit, while values that are greater than 1 make the filter unstable, as well known. The value of b computed using (2) can also be subjected to corrections aimed at guaranteeing the best quality of the coded signal. Furthermore, in certain frames, instead of the values d and b computed with (1) and (2), it is possible to use values obtained by linear interpolation between values computed for the previous frame and values computed for the current frame.

[0025] Together with the computation of d and b, the prediction gain G is also computed: this is a quantity representing the ratio between the energies of input and output signals from the long-term predictor and gives a measure of long-term prediction efficiency. Gain G is
defined by the expression

where
1 - bR[r_f(d)]/E'(r_f)

Gain G allows establishing whether the speech segment being coded is voiced, that is indicated by values of G and b that are both greater than respective thresholds G_thr, b_thr. In case of a voiced sound, LTA generates a flag V that is used to decide to carry out the interpolation and to introduce the time shift.

[0026] A first correction for delay d is based on the search for the local maximum of function (1) also in a given neighborhood (e.g., ñ 15%) of the value obtained at the previous frame: if this local maximum is different from the main maximum by an amount that is less than a certain limit, that new value is used that provides a more smooth outline that can be therefore interpolated. This secondary search is carried out only if the signal in the previous frame was strongly voiced and had been subjected to interpolation. Moreover, the correction, if any, is carried out before computing b and G, so as to use the already corrected value of d for these computations.

[0027] A second correction is linked to the presence of the time shift mechanism, that inserts a variable delay whose effects can be compared to those of a non-synchronous operation of the coder. To try to recover synchronous features, the value of d computed by LTA and possibly corrected as said before is changed by adding thereto a corrective term d' linked to the amount of the shift itself and given by the expression

where is the shift accumulated up to that frame expressed as number of samples of the residual signal upsampled by a factor G, while d and Lf have the meaning said before. Upsampling will be discussed in greater detail with reference to circuits TS. It means that the samples obtained by sampling the original speech signal at a first sampling rate are in turn submitted to a sampling at a higher rate. Thus, if samples obtained by an 8 kHz sampling are themselves sampled at 64 kHz, each 8 kHz sample will originate eight samples at 64 kHz. The correction can be carried out if interpolation is required in the current frame and if the speech segment is not voiced. The first condition is necessary since, if the interpolation is absent, no shift is carried out; moreover, the signal must not be voiced because in this situation an even minimal modification of d with respect to the exact value can usually be perceived. Before adding the corrective term to d, its absolute value is limited to a maximum value ¦d'¦_max, for example 1. Furthermore, the correction is carried out only if it does not modify the decision about interpolation (that will be described afterwards) and does not take the value of d outside the provided range of values.

[0028] As regards b, a first correction consists of clipping b to a first upper limit b₁, since, if b is too high, an excessive energy increase would occur, which gives rise to noises. Limit b₁ is linked to the ratio between energies in a pitch period of the current frame and of the previous one and it is given by the expression

where E''(r_f) denotes the quantity

that indeed is the energy in a pitch period d, and indexes 0,-1 denote current and previous frames, respectively. The correction is carried out if the energy in the previous frame exceeds a certain threshold.

[0029] A further limitation for b is carried out in case of low values of G (less than G_thr), that show speech segments with low periodicity, while b is relatively high (greater than a second limit b₂): in this case, the value b₂ is employed, since employing the actual value could produce artifacts in the coded signal.

[0030] As regards interpolation, this is carried out if the relative variation of d between two consecutive frames does not exceed, as absolute value, a predetermined amount (e.g., 15%) and if the values of b in these frames are both positive. The actual computation of the values of d and b to be used in case of interpolation is carried out in the long-term synthesis filter LTS1, to which LTA sends a flag F when the above mentioned conditions are verified. The same flag is also supplied to an error energy minimizing circuit EM determining the optimal time shift and excitation. Information about interpolation is also required by the synthesis filter in the decoder; however, it is not necessary to transmit it, since it can be immediately recreated in that filter, by the comparison between the values of d and b related to two frames, exactly like in the coder.

[0031] The values of d and b determined at each frame are converted as usual into the respective indexes j(d), j(b), that are the information related to long-term analysis to be inserted into the coded signal, and that are transmitted to the decoder, after suitable coding, through connections 2b, 2c. Index j(b) is determined through a quantization operation, during which, in addition to limiting the maximum value to 1, values of b that are less than half of the first quantized value are forced to 0. No quantization of d is however necessary, since d is already a discrete quantity: it is however preferable to transmit d under the form of an index for sake of uniformity with the other information. The conversion of the values of d into indexes practically consists of their shift, such as to make the possible range of values begin from 1 instead of from a value d_min. In the described example (101 values of d and j(d)), 7 bits will be necessary to code index j(d), and these bits will also allow coding of values of j(d) outside the provided range. One of these further values (e.g., value 127) is used to show forcing of b to 0 and it is supplied to the decoder in place of index j(d) corresponding to the actual value of d, since, if b = 0, the long-term synthesis filter does not provide contributions to the reconstructed signal and delay information is useless. In addition to information about forcing of b to 0, however, index j(b) corresponding to the minimum value of b is transmitted.

[0032] To simplify, circuits generating indexes j(b), j(d) are included into block LTA.

[0033] It must be noted that the correction of d to take into account possible shifts is carried out after the corrections of b, since only depending on the corrected values of b, circuit LTA can take decisions related to the sound nature and the need to carry out interpolation and therefore shift.

[0034] The operations performed by LTA are described in detail in the appendix, that includes program listing in C language. Given the listing, a technician has no problem in designing devices that perform the described functions.

[0035] Indexes j(d), j(b) are reconverted into quantized or reconstructed values of the respective parameters by reconstructing circuits LTR1, composed of simple read-only memories addressed by the indexes. During this reconstruction, LTR1 provides the actual values of d, b if j(d) shows a value allowed for the delay (that is, if j(d) is in the range 1 to 101). If j(d) shows any one of the values outside the allowed range (therefore its value is from 102 to 127), LTR1 provides value 0 for b and value d_min for d. The fact that, when reconstructing the parameters, all indexes j(d) not corresponding to a value allowed for the delay, and not only the one really used for this purpose, are interpreted as indication of forcing of b to 0, allows reconstructing the value b=0 even in case of possible errors on the least significant bits of that index. Anyway, if by chance the reconstruction of b=0 should fail, circuits LTR1 generate the minimum value of b since they have at their disposal the corresponding index j(b). To simplify, in the following, reconstructed (or quantized) values will also be shown by b, d.

[0036] The long-term synthesis filter LTS1 generates a reconstructed short-term residual signal s_s(n), by filtering according to the conventional function

an excitation signal s₁(n). This one is composed of a shape information (innovation), represented by one of the words s(n) of an innovation codebook IC1, by a positive or null amplitude parameter g (innovation gain), chosen in a codebook of innovation gains IG1, and by a sign information, represented by a parameter σ (innovation sign) whose value is ±1. Signal s₁(n) is therefore given by

and is obtained through a multiplier M1. To simplify, we suppose that also parameter σ is read in codebook IG1. Even if, to facilitate understanding, codebooks IC1, IG1 are represented as circuit blocks (that could suggest the idea of memories that contain them), as said above, the particular structure of innovation codebook makes their storage superfluous. The structure of innovation and gain codebooks will be examined later.

[0037] In order to obtain a sample of the reconstructed residual s_s(n), LTS1 must weight with the factor b the sample related to instant n-d. In case no interpolation has to be performed, operation of LTS1 is quite conventional. In case of interpolation, the values of d and b are computed sample by sample according to the equations

with n = 0.....Lf-1,

and

. Symbols d0, b0 show the values related to the current frame, d(-1), b(-1) those related to the previous frame. The interpolation is therefore a linear one and extends over a whole frame. The values of d(n) and b(n) then vary sample by sample. As regards d(n), it will generally not be an integer number: this means that the value of signal s_s(n) at the continuous time instant n-d(n) does not coincide with that of an actually available sample and must be evaluated: according to the invention, evaluation is performed through a second order polynomial interpolation (that is through a parabola) centered about the discrete time instant that is nearest to n-d(n); the value thus evaluated is then multiplied by the interpolated value b(n).

[0038] The interpolation procedure adopted has an extremely lower computation complexity than more sophisticated interpolation methods based on signal filtering. However, its effect is essentially a low-pass one, that is useful for the good operation of the coder since it avoids that the reconstructed signal has too marked periodicity properties.

[0039] The reconstructed short-term residual s_s(n) is supplied to the short-term synthesis filter STS1, whose transfer function is 1/1-A(z). This filter generates the reconstructed speech signal y(n) that is supplied to the spectral weighting filter SW whose transfer function is, as usual,

, where A_w(z) is the
function

with

, where γ is an experimentally determined corrective factor that determines band widening around formants. The reconstructed and weighted signal y_w(n) is subtracted in an adder SM from the modified reconstructed and weighted signal x_w(n) obtained by filtering the output signal from TS in the cascade of two filters STS', SW', respectively identical to STS1 and SW. At output of SM, a weighted error signal e(n) is obtained, that is supplied to the error energy minimizing circuit EM that performs all necessary operations to determine optimal shift and excitation.

[0040] Purpose of circuits TS is aligning in time the signal to be coded with the replica that long-term synthesis filter is able to produce, and in particular avoiding shifts among pitch peaks in the signal predicted by LTS1 and in the original one. For this purpose, TS at each subframe makes the time window of Ls samples, that locates the subframe itself, shift by a certain amount Dh. The shift to be applied is determined by unit EM with a fast search procedure within a range of values defined by a maximum allowable shift. Shift is applied on the residual signal and not on the original one because the resulting distortion is smoothed by the following filtering in STS', SW' and therefore is substantially imperceptible. The shift applied in a subframe is algebraically added to the one accumulated up to that time, providing a global shift ĥ, in order to avoid too sudden variations. Global shift also cannot exceed a certain maximum value (H samples of the original signal). The reason why H samples of the following frame have also been loaded in MT is therefore evident. Purpose of the shift variation limitation is avoiding excessive distortions; the limitation related to global shift instead is determined by the delay that has to be tolerated in coding procedures and therefore by the availability of future samples. Time shift has a resolution that is less than one sampling period of the original signal, and therefore it is necessary to carry out an upsampling of the residual signal.

[0041] Taking into account all this, circuit TS will include an upsampling circuit US (in practice an interpolating filter), that supplies at its output the upsampled residual r̂_s(n̂), and a shifting element SH that receives from EM information about shift entity ĥ and generates the modified upsampled residual r̂_m(n̂). In the example, upsampling ratio Γ is 8, and therefore the upsampled signal has a frequency of 64 kHz: this upsampling ratio provides a suitable resolution for all desired purposes. Moreover, for the correct operation of the interpolating filter, it is necessary to always have available a certain number of samples following the interested ones: this is the reason why the further K samples of the following frame are also loaded in MT.

[0042] It is not necessary to materially carry out the downsamping to obtain a modified residual signal with a 8-kHz sampling frequency, since this operation can be implicitly carried out, when necessary, by simply reading a sample of r̂_m (n̂) every Γ, with an suitable phase. Downsampling is the inverse operation to upsampling, recovering the samples at lower rate.

[0043] Element SH will practically be a memory that loads, at each subframe, the ΓLs samples of the upsampled residual plus a certain number of following and previous samples linked to the maximum allowed shift in a frame (in practice, a number of samples equal to twice the maximum shift, as will be explained in the description of optimal shift search); SH is addressed for reading by the error energy minimizing unit EM, in such a way as to supply the following circuits with Ls samples adequately shifted with respect to the incoming subframe.

[0044] Turning back to the innovation codebook, this includes a certain number of words, each having Ls samples, of which only a very limited number is different from 0. This choice derives from the fact that, being the codebook quite limited, it would be an illusion to think to find inside it words with a lot of pulses (that is non-null samples) in which all pulses are actually suitable, and further enables reducing the amount of computations necessary when searching for the optimal excitation. In the preferred embodiment of the present invention, the codebook is composed of two parts. The first one includes Ls words having a single non-null sample, with amplitude equal to 1 and positive sign, and Ls-1 null samples. The non-null sample occupies a different position in all words, that therefore can be obtained one from the other by simply shifting the non-null sample by one position. For this first part of the codebook, signal s(n) can be represented as

where δ is the well known unitary function and n, n₁ can have values between 0 and Ls-1.

[0045] The second part includes words with two samples whose amplitude is 1, and Ls-2 null samples. These words are generated starting from a limited number of key-words (in particular 3) with the method described in European Patent Application EP-A-0396121 in the name of CSELT. In the example taken into account, the three key-words have all the first pulse in position 0 and the second pulse in a respective key position n₂(1), n₂(2), n₂(3), and the other words are obtained making the pulse pair shift towards a word end till the second pulse reaches such end or the first pulse reaches the respective key position. Key positions are chosen in order to give origin to Ni2 (in particular 21) possible positions of the pulse pair; for each one of these positions, there are two words that are different one from the other by the second pulse sign, as described in said European Application, that take to Ls+2Ni2 (62 in the example) the total number of words in the innovation codebook. For this second part of the codebook, an innovation word is represented by the equation

with n = 0...Ls-1, n₁ = 0...Ls-1-n₂(p), n₂ = n₂(p)...Ls-1, p = 1...Nip, where n₂(p) shows the generic key position and Nip is the number of key positions used (3 in the example).

[0046] The innovation codebook structure, with few non-null samples and words obtained by shifting samples by one position starting from a limited number of keys, is a simple deterministic structure that enables a fast search procedure of the optimal excitation that requires neither codebook storage nor the effective filtering of the candidate excitation signal.

[0047] During the search for optimal innovation, the test with words of the first part of the codebook must be carried out only if long-term analysis has indicated a voiced sound or, on the contrary, when strong energy concentrations are noted in short signal sections. These strong concentrations can in fact signal the onset of a voiced section, that cannot still be classified as such, since classification is based on long-term analysis and in the previous signal sections there were no useful features to indicate such onset. Under these conditions, therefore, filter LTS1 would indeed not be able to supply a correct predicted signal. Now, it is mandatory, for a good coded signal quality, that pitch pulses be correctly reproduced, and therefore use of single-pulse words proves itself useful to indeed compensate for an inadequate operation (in voiced sections) or for an impossible correct operation (in onsets) of long-term synthesis filter. Single-pulse words, instead, must not be used to reproduce unvoiced sounds that are not onsets, where their use is counterproductive, even in case it is actually one of them to provide minimum error signal energy, since the subjective effect is usually worse.

[0048] The manner in which strong energy concentrations in short times are detected will be described afterwards.

[0049] Words in the codebook are identified by a respective index j(s); the index related to the optimal word, adequately coded, is transmitted to the decoder through a connection 2d. Since in the described example the codebook includes 62 words, to which as many indexes j(s) correspond, without having to modify the number of bits coding j(s), two further values of j(s) are available that do not correspond to any word in the codebook. These are used to represent a null innovation gain, as will be said afterwards; similarly to what has been done for long-term prediction delay and coefficient, when generating the indexes, only one of the two values of j(s) not corresponding to an innovation word will be used to indicate g = 0 and, when decoding, g will be set to 0 in correspondence with both values of j(s).

[0050] As regards gain g, this is quantized using a codebook built so as to allow saving coding bits with respect to what would actually be necessary to represent all possible values provided in the codebook. Information about gain, for each subframe, is represented in the form of two indexes j(gmax), j(gnor), the first one of which is linked to the maximum value of g in the frame, and the second one to the difference between such maximum value and the actual value, and by sign σ. This information is transmitted to the decoder through a connection 2e.

[0051] The codebook includes a number Nig of possible absolute values of g that can be represented as

where Nim and Nin are two different powers of 2. For example, we can have Nim = 2⁴ and Nin = 2², or Nim = 2⁴ and Nin = 2³. At each subframe, the optimal value of g determined with the error minimizing procedure that will be described afterwards is quantized, generating a respective index j(g) that is not transmitted but is reconstructed in the decoder. At the end of the frame, value j(gmax) related to the maximum frame gain is identified and is transmitted as such if it is not less than Nin; otherwise, index j(gmax) is forced to value Nin. In this way, j(gmax) can only assume Nim values and therefore the number of coding bits is limited. Once having identified j(gmax), index j(gnor) is computed for every subframe with the equation

; j(gnor) can have values in the range between 0 and Nim+Nin-2. The actual value of index j(gnor) is transmitted only if it is not greater than Nin-1; otherwise, gain is deemed 0 (that is, innovation is silenced for subframes where gain is very small with respect to the maximum one) and index j(s) of the innovation word is forced to one of the values that do not correspond to any codebook word to show transmission of a word with null gain. In this way, a reduced differential dynamics is used and the bits that should have been used to represent gain on the whole dynamics, are saved, at the expense of a slight performance loss due to possible innovation silencing. To minimize the effect of channel errors on innovation index j(s), in case of silencing the value Nin-1 for index j(gnor) is anyway transmitted.

[0052] The gain codebook can be a logarithmic codebook, so that the ratio between two consecutive values is a constant. The ratio must take into account several requirements:

values in dB must be as near as possible to allow a quantization as accurate as possible;
global dynamics between minimum gain g(1) and maximum one g(Nim+Nin-1) must be adequately extended to cover the different types of sound and a reasonable set of different voice levels;
differential dynamics between g(x-Nim+1) and g(x) must be adequately extended to make the probability of silencing reasonably low.

[0053] For example, with the above values of Nim, Nin, the value of the ratio between two consecutive gain levels can range from 3 to 6 dB.

[0054] The fast search procedure for optimal shift and excitation will now be described, referring also to the operative diagram in Fig. 2, that correspond to the set of blocks M1, LTS, STS, STS', SM, SW, SW' of Fig. 1. In Fig. 2, the same symbols as in Fig. 1 are used, with the exception of blocks STW1, STW2 that represent the filter resulting from the series of filters STS1, SW and respectively STS', SW', that is a filter with transfer function

. In this Figure, each of the filters has been divided into an element with null input (LTSa, STW1a, STW2a) that provides contribution of initial conditions (that is of filtering memories for previous subframes), and into an element (STW1b, STW2b) that is reset at each subframe (filtering with null initial conditions), as indicated by signal R supplied by a time base, not shown. Filtering with null initial conditions of excitation is only the short-term filtering, since it has been supposed that delay d is not less than a subframe.

[0055] The optimal shift determination is composed of three steps:

evaluation of the need to perform a shift;
determination of an suitable range of shift values;
search for the optimal shift in the range.

[0056] In the first step, it is checked if three conditions are satisfied:

the subframe is not silence, which is shown by the fact that the energy of r_s(n) is greater than a given thres hold;
the signal is voiced or has been subjected to interpolation, which is shown by flags F, V coming from LTA;
a peak of r_s(n) actually occurs in the subframe, which is shown by the fact that the average power of r_s(n) in the subframe (that is the energy divided by the number Ls of samples) is greater than or equal to the energy in a period of length d that ends with the last sample of the subframe itself.

[0057] The reason for the first condition is obvious. As regards the second and the third one, shift must be performed only if there is a pitch peak in the subframe. This occurs first of all in voiced sections; the fact that an interpolation occurred, that is, that the values of parameters obtained in two subsequent frames are very near, suggest a certain periodicity in the signal segment that must be coded, and therefore enabling the shift also in this case can be useful to further reduce risks of misalignment between the reconstruced signal and the original signal.

[0058] Computation of energy and powers can be carried out indifferently on the upsampled signal or on the original one. During these computations, the maximum absolute value of r̂_s in the current subframe and its position are also obtained: they will be used in determining the shift. To determine the position of the maximum, it is mandatory to operate on the upsampled signal to get maximum resolution.

[0059] The second step determines the lower and upper extremes ĥ_min, ĥ_max of a range that extends around shift value ĥ accumulated so far in the frame. Values ĥ_max, ĥ_min are initially fixed so that differences ĥ_max - ĥ and ĥ - ĥ_min have a prearranged value Γ · Δh, for example 20 samples of the upsampled signal r̂_s. There exists therefore a maximum number of possible values (41 in the example) among which the optimal shift can be searched for. The actual extreme values ĥ_min, ĥ_max could be not symmetrical with respect to value h (that is, the range can be limited on one or both sides of the accumulated value h), since it is necessary to avoid shifting the subframe too much, both in the past, with possible duplication of a maximum of r̂_s previously taken into account, and in the future with consequent loss of a maximum. This check is made possible by storing the maximum of r̂_s in the subframe. However, unless range limiting has not been bilateral, the search for the optimal shift is carried out trying to keep constant the range width, by taking into account also some values beyond the extreme that is not subjected to limitation. In any case, the shift to be carried out must not make value H exceeded.

[0060] The optimal shift value within the test range is the one minimizing energy of an error signal e₁(n) represented by the difference between reconstructed and weighted modified signal x_w(n) (Fig. 1) and contribution y_w1(n) of excitation filtering memories, and is obtained with a fast search procedure that allows reducing the amount of necessary computations.

[0061] For this fast search, it must be taken into account on one hand that output signal x_w(n) from STW' can be expressed as

(where n ranges from 0 to Ls-1), and on the other hand that the same signal is the sum of output x_w1 of STW2a and output x_w2 of STW2b. Summation in (7) represents signal x_w1,that can be computed, once and for all, like the corresponding contribution y_w1 of chain LTSa, STW1a, and therefore an error

can also be computed once and for all, that appears at the output of an adder SMa. Error e₁ can then be written

, where x_w2 depends on _s and therefore on the shift. It is then necessary to determine x_w2 for all values of the shift, to compute for each one the respective energy of e₁, and to store value of that provides minimum energy and corresponding signal x_w(n).

[0062] The procedure to determine x_w2 adopted according to the invention takes into account that, for a given shift value, signal x_w2 is given by

The upper limit of the summation is the minimum between n and P, since when filtering with null initial conditions, samples with n-k < 0, that is, samples of the previous subframe, must not be taken into account. Values of x_w2 are actually computed according to (8) for a first group of Γ possible shifts that range from h_max to ĥ_max-Γ+1; obviously, the tests will be stopped if by chance h_min is reached before having examined all Γ shifts. For the other values of shift, from ĥ_max -Γ to ĥ_min, instead of being computed with (8), x_w2 is computed according to the equation

[0063] In (9), Q(n) shows the truncated pulse response (since it is computed only for Ls values of n) of filter STW, with Q(0) = 1.

[0064] It can be immediately noted that, taking into account that Q is determined once and for all, beside a certain value, (9) requires much fewer computations than (8).

[0065] It must further be stated that Γ values of x_w2 must actually be computed according to (8) and (9), that is one for each of the Γ upsampled signal samples corresponding to a 8-kHz sampling period.

[0066] Once having minimized the energy of e₁(n) and having found the optimal shift, minimization of the energy of e(n) is started to find the optimal excitation. Unit EM directly computes an expression of the energy to be minimized that is function of the position of the pulses in the innovation word, and for this purpose the pulse response Q is employed, computed during search for the optimal shift. Computation of the pulse response is made convenient with respect to filtering execution by the fact that every word includes two non-null samples at most. Moreover, taking into account the more general case of the words with 2 pulses, the global pulse response is the sum of two responses spaced by a distance equal to the key; responses for all other words linked to a key are then obtained simply by a translation by one sample at a time. To simplify, in the following mathematical expressions, the variability range of the summation index for summations extended to all samples in a subframe has not been indicated.

[0067] Error e(n), for a generic excitation word, is given by

, where u(n) is the output signal from STW1b. Energy of e(n) is given by

that can be written as

. Taking into account that the first and the last summations represent energies of signals e₁, u, and the second one represents mutual correlation R(e₁u)(k) between them, evaluated for k=0 and in the following simply called R(e₁u), we have

[0068] Minimizing E(n) is the same as maximizing the difference of energies

For each word of the examined codebook, the maximum of (12) is obtained for a value

, as immediately appears by computing the derivative with respect to g₁ and making it equal to 0, to which a value

corresponds.

[0069] The particular structure of the innovation codebook allows to directly obtain E(u) and R(e₁u), that depend on the position of the pulse or pulses in the word, by exploiting the pulse response of filter STW1, that is equal to the one of filter STW2, previously determined.

[0070] In fact

or, more simply,

where Eq is energy of the adequately truncated signal Q (that is, computed for a number of samples determined by the position of n₁, n₂). Moreover, R(e₁u) can be written

where

n=0
It is clear that for single-pulse words, relations (14) and (15) are simply reduced to

and

[0071] The operations performed at each subframe by EM to determine the optimal excitation can be considered as divided into three steps.

a) Before examining the effect of each innovation word, as soon as values a_i are available, EM computes and stores the possible values of the three addends in (14). Computation will be carried out only for the first 4 subframes, since, as already said, in the following subframes filter coefficients a_i do not change. Terms Eq can be computed with a simple iterative procedure, according to the equation

with n =1...Ls-1 and

.
Moreover, since the codebook includes only Ni2 possible pairs of values _n1,_n2, computation of ρ is carried out only for these pairs, according to the expressions

where n₂(p) has the already cited meaning, n = 1...Ls-1- n₂(p) and k = Ni2...1 is the generic pair of values n₁, n₂.

b) As soon as the optimal value of e₁ is available, always before the search procedure, EM computes and stores values R(e₁q).

c) After these operations, EM computes values of E(u),R(e₁u) word by word, determining value g₀ and the related ΔE, and storing the word index and the related value of g that originated the energy minimum.

[0072] As said above, if the sound is not voiced, the tests with words of the first part of the codebook are carried out only if strong energy concentrations in short times are noted,that can show the onset of a voiced signal section. For this purpose, within the subframe, energy of a certain group of samples of the modified residual is computed (e.g. 5 samples), starting from the beginning of the subframe and shifting, by one sample at a time, the window selecting the group till the whole subframe has been scanned, and storing which group shows maximum energy. Furthermore, the average power (that is the energy divided by the number of samples) in the window where the maximum occurred and the average power in the subframe are also computed. Tests with single-pulse words will be enabled if subframe energy and the ratio between the average powers in the window and in the subframe are greater than suitable thresholds. Moreover, if the optimal innovation is composed of a single-pulse word, the absolute value of gain g is limited to a maximum value ¦g¦_max = ¦r_s¦_max, where is a parameter approximately equal to 1 and ¦r_s¦_max is the residual maximum computed during operations to determine _min, _max. Purpose of this limitation is also to prevent insertion into the signal of a pulse with too high energy with respect to the maximum residual amplitude in the same subframe.

[0073] At the end of each subframe, initial conditions in filters LTSa, STW1a, STW2a will have to be updated. To update LTSa, that is s_s(n), it will be necessary to add a pulse or a pair of pulses (corresponding to the optimal innovation word) to s_s1(n). To update y_w(n), it will be necessary to add to y_w1(n) one or two pulse responses (corresponding to signal u(n)) adequately shifted and multiplied by gain g in order to supply the value of y_w2 corresponding to the optimal excitation. The pulse response will also be exploited to update STW2a. Furthermore, since filters STW have order P, only the last P samples of such responses (from Ls to Ls-P) are of interest.
The operations of EM are also included in the appendix.

[0074] The decoder structure will now be described, referring to the diagram in Fig. 3, where blocks corresponding to the ones already described with reference to Fig. 1 are shown by the same reference symbols, followed by digit 2. The various reconstructed signals are also shown with the same reference symbols used for the original signals in the coder.

[0075] The decoder receives from the coder, through connections 2a-2e, indexes j(j), j(φ), j(b), j(s), j(gmax), j(gnor) and sign σ for the innovation gain. At each subframe, index j(s) selects an innovation word s(n) in codebook IC2 or indicates a subframe that does not provide innovation contributions (g=0). If a word has been selected, it is multiplied in M2 by gain g whose absolute value is selected in the codebook IG2 by an index

and whose sign is σ, thereby providing the reconstructed excitation signal (or fixed codebook contribution) s₁(n).

[0076] This signal is filtered in the long-term synthesis filter LTS2 to provide the reconstructed short-term residual s_s(n). In order to operate exactly like its replica LTS1 in the coder, filter LTS2 must receive from reconstruction circuit LTR2 parameters d, b and flag F indicating the possible need to carry out interpolation of d and b. Therefore, LTR2 will include a read-only memory with two tables addressed by indexes j(d), j(b), like LTR1 (Fig. 1), in addition to a circuit suitable to store values of d, b related to two consecutive frames and to carry out the comparisons, described in connection with the coder, necessary to determine if interpolation of d, b is necessary. Signal s_s(n) outgoing from LTS2 is filtered in the short-term synthesis filter STS2 using coefficients a_i generated in coefficient reconstructing circuit STR2 starting from indexes j(φ). In STS2, too, for the first subframes of each frame, interpolated coefficients will be used. The reconstructed speech signal y(n) is still subjected to a further filtering in an adaptive filter PF that uses coefficients obtained from linear prediction coefficients a_i and that inserts into the reconstructed speech signal a distortion that improves the perceptual effect. At the output of PF, there is a filtered reconstructed signal y_p(n). Employ of filters like PF when coding a speech signal is well known to the technicians and does not require further explanations.

[0077] It will be noted that the decoder does not take into account the possible shift carried out into the coder: in fact, purpose of the shift is just causing the synthesized signal to be a replica as good as possible of the original signal, and therefore the decoder only requires information related to excitation and filters.

[0078] It is clear that what has been described is provided only by way of non-limiting example and that variations and modifications are possible without departing from the scope of the present invention. Thus, for example, even if reference has been made, about innovation, to sample whose amplitude was 1, it is also possible to use samples whose amplitudes are chosen in a finite set of values (e.g., √1, ± √2, ± 1/√2): obviously, in this case the coded signal will also include information about the relative amplitude of innovation samples. Generalizing equations (14), (15) to the case of pulses whose amplitude is not unitary is immediate. The choice of sample amplitudes in a finite set of values is not limiting, because anyway relative amplitudes of the samples themselves are quantized.

[0079] To simplify the drawings, no timing signals for the various blocks have been shown; on the other hand, the timing sequence of operations clearly results from the description.

Claims

1. Method of coding/decoding speech signals, including, in a coding step, the operations of:

- sampling the original speech signals at a first sampling rate and dividing the resulting sequence of samples [x(n)] into a plurality of blocks of subsequent samples, each block comprising a first predetermined number Ls of samples or an integer multiple of said first number;

- performing a short-term analysis of the original speech signal to determine a group of linear prediction coefficients (a_i) to be used for a linear prediction filtering, a short-term synthesis filtering and a spectral weighting filtering, generating a representation of said coefficients in the frequency domain, and inserting into the coded signal information [j(φ)] related to the value of said representation, said information being valid for a period equal to the duration of a block or of a group of consecutive blocks of samples;

- obtaining, through said linear prediction filtering, a short-term residual signal [r_s(n)] for said block or group of blocks of samples;

- subjecting said residual signal [r_s(n)] to a long-term analysis, to determine long-term analysis parameters comprising a long-term synthesis filtering delay d and coefficient b, and inserting into the coded signal information [j(d), j(b)] related to the values of said parameters, said information being valid for a time equal to the duration of a block or a group of consecutive blocks of samples;

- reproducing every block of speech signal samples to be coded with a reconstructed and weighted speech signal [y_w(n)], obtained by subjecting to long-term synthesis filtering, short-term synthesis filtering and spectral weighting filtering an excitation signal chosen within a set of excitation signals, each comprising an amplitude contribution (excitation gain) and a shape contribution (innovation), the latter being composed of a limited number of pulses, much less than said first number of samples, with predefined positions and amplitudes belonging to a respective finite set;

- subjecting a set of samples of said residual signal [r_s(n)] to a time shift by discrete steps, each set of residual signal samples having a number of samples equal to the number of samples in a block of speech signal samples to be coded, to align in time the residual signal with a reconstructed residual signal [s_s(n)] obtained as result of the short-term synthesis filtering of an excitation signal, the shift generating a modified residual signal [r̂_m(n̂)] that is subjected to a long-term synthesis filtering and to a spectral weighting filtering, identical to those carried out for the excitation signals, to generate a reconstructed and weighted modified speech signal [x_w(n)];

- determining an optimal excitation signal for each block of samples, by minimizing the energy of a weighted error signal [e(n)] represented by the difference between the reconstructed and weighted modified signal [x_w(n)] and the reconstructed and weighted signal [y_w(n)], and inserting into the coded signal information [j(s), j(g_max), j(g_nor), σ] that identifies the optimal excitation signal; characterized in that:

- the innovation pulses are the only non-null samples of words composed of said first number Ls of samples,

- the innovation words for a first subset of excitation signals include a pair of pulses, a limited group of words of the first set being key-words in which the two pulses are placed in predetermined key positions and the other words in the subset being obtained from each of the key-words by each time simultaneously shifting the pulses by one position towards a word end, till one of the pulses reaches said end or the key position of the other pulse in the starting word, the shifting direction being the same for all words; and

- the innovation words for a second subset of excitation signals include only one pulse whose position is different for each signal;
and in that for said determination of the optimal excitation signal the energy of said weighted error signal is directly computed, by exploiting a pulse response Q(n) of filters that carry out synthesis and spectral weighting filterings of the excitation signal, with the following operations:

- determining said pulse response Q(n) and the energy Eq thereof for each of the possible pulse positions in the excitation signals;

- determining a first partial error signal [e₁(n)], represented by the difference between the reconstructed and weighted signal [x_w(n)] and a contribution [y_w1(n)] of the excitation signal filtering memory, and the energy of the same error signal;

- determining a first correlation R(e₁q) between said first partial error signal [e₁(n)] and the pulse response Q(n) for each of the pulses of an excitation signal;

- determining for each excitation signal, starting from said pulse response, a signal [u(n)] representative of a contribution of the filtering with null initial conditions of the excitation signal;

- determining the energy E(u) of said signal [u(n)] representative of the contribution of a filtering with null initial conditions of the excitation signal, and determine a second correlation R(e₁u) between said signal [u(n)] representative of the contribution of the filtering with null initial conditions of the excitation signal and the first partial error signal [e₁(n)];

- determining, for each excitation signal, an optimal value of the amplitude contribution as ratio between said second correlation and the energy of the signal resulting from filtering at null initial conditions;

- computing, as function of said second correlation R(e₁u), of said energy Eu of the signal representative of the contribution of the filtering with null initial conditions of excitation and of said energy E(e₁) of the first partial error signal, the value of error signal energy for each excitation signal.

2. Method according to claim 1, characterized in that said pulses have unitary amplitude.

3. Method according to claim 1 or 2, wherein the sequence of speech signal samples is divided into frames that are composed by a plurality of consecutive subframes each corresponding to one of said blocks and include a second predetermined number Lf of samples, and wherein said short-term analysis is carried out for each frame, characterized in that for said short-term analysis in a frame a sample window is analysed, whose length is Lf+P (P = number of linear prediction coefficients in each group), that encompasses a current frame and the subsequent frame and also includes a predefined number H+K of samples of said subsequent frame, said window being a trapezoidal window that weights all samples with maximum weight, apart from the first and the last P samples, for which the weighting factors are determined through linear interpolation between a minimum weight and the maximum weight.

4. Method according to claim 3, characterized in that for the initial subframes of each frame, the linear prediction coefficients a_i are coefficients obtained as result of an interpolation between the values provided by short-term analysis for the current frame and those provided for the previous frame, the interpolation being carried out by operating on said representation.

5. Method according to any one of the previous claims, wherein the linear prediction residual is subjected to low-pass filtering before long-term analysis, thereby providing a filtered residual signal [r_f(n)].

6. Method according to any of claims 1 to 5, wherein the sequence of speech signal samples is divided into frames that are composed of a plurality of consecutive subframes each corresponding to one of said blocks and include a second predetermined number Lf of samples, and wherein said long-term analysis is carried out for each frame, characterized in that to determine said long-term analysis parameters, a sample window of the filtered residual signal [r_f(n)] is analysed, that encompasses a current frame and the subsequent frame and also includes a predefined number H+K of samples of said subsequent frame.

7. Method according to claim 6, characterized in that said long-term analysis further includes the operation of determining, for each frame, a long-term prediction gain G, representative of the ratio between the energies of filtered residual signal at the input of and at the output from means that carry out said analysis, the gain being also determined at each frame.

8. Method according to claim 6 or 7, characterized in that said long-term analysis further includes the operations of:

- classifying a speech signal segment corresponding to a frame as voiced or unvoiced, depending on the value of said long-term analysis coefficient b and on prediction gain G, and generating a first flag (V) in case the segment is classified as voiced;

- comparing values of long-term analysis delay d and coefficient b related to a current frame with those related to the previous frame and generating, when delay variation is less than a predefined amount and coefficient values in both frames are positive, a second flag (F) that enables interpolation between delay and coefficient values computed for said previous frame and those computed for the current frame.

9. Method according to any of the claims from 6 to 8, wherein long-term analysis delay d is determined as maximum of the autocorrelation function of the filtered residual within the window used for the analysis itself, characterized in that, before determining long-term analysis coefficient b and prediction gain G for the current frame, the local maximum of said autocorrelation function is determined even in a neighborhood of the maximum of the same function in the previous frame, if said first and second flags had been generated in said previous frame, and said local maximum is used as delay for current frame if it is different by an amount that is less than a predefined value from the maximum in the window related to current frame.

10. Method according to any of the claims from 6 to 9, characterized in that the value of long-term analysis coefficient b is clipped to a first maximum value b₁, linked to the ratio between energy of the filtered residual signal in the current frame and in the previous frame in an interval whose length is equal to the long-term analysis delay.

11. Method according to any of the claims from 6 to 10, characterized in that the value of long-term analysis coefficient b is clipped to a second maximum value b₂, if it exceeds such value while the prediction gain G is less than a gain threshold G_thr.

12. Method according to claim 8 or any of claims 9 to 11, if referred to claim 8, characterized in that said interpolation of long-term analysis delay d and coefficient b is a linear interpolation extended over a whole frame and, in case of a non-integer interpolated delay value, the value of a corresponding sample of the reconstructed residual signal s_s(n) is evaluated with a second-order polynomial interpolation centered around the integer delay value that is nearest to said interpolated value.

13. Method according to any of the claims from 6 to 12, wherein information related to long-term analysis coefficient b inserted in the coded signal are indexes representative of quantized coefficient values, and information related to long-term analysis delay d allows representing also delay values that are outside an interval of allowed delays, characterized in that coefficient values that are less than a predefined fraction of a minimum quantized value are forced to 0 and, in case of forcing to 0, delay information representative of a value that is outside said interval of allowed delays and the index representative of said minimum quantized value, are inserted in the coded signal.

14. Method according to any of claims 1 to 13, characterized in that, to determine the optimal excitation, excitation signals of said second subset are used if said first flag (V) has been generated or, if said flag has not been generated, if analysis of the energy distribution in the, modified residual signal shows an energy concentration in short times, that indicates the onset of a voiced sound.

15. Method according to claim 14, characterized in that, to determine the optimal excitation, the excitation signals of the two subsets are normalized with different normalization factors, linked to the number of pulses present in respective subset signals.

16. Method according to claim 14 or 15, characterized in that, if said first flag (V) has been generated, the amplitude contribution for excitation signals of said second subset is limited in such a way as not to exceed a threshold that is proportional to the absolute value of the residual signal.

17. Method according to any of claims 14 to 16, characterized in that said analysis of the energy distribution of the modified residual signal is carried out at each subframe and includes the operations of:

- dividing the subframe into a plurality of partially overlapping windows, a first and a last window coinciding with a respective initial or final part of the subframe, the windows following the first one being each shifted by one sample with respect to the previous window;

- determining the energy and the power of the modified residual signal in the whole subframe and the energy in each one of said windows;

- determining the power for the window whose energy is maximum and determining the ratio between the power in said window and the power in the subframe; and

- comparing said maximum energy and said power ratio with respective thresholds, said energy concentration being recognized if said maximum energy and said ratio are not less than respective thresholds.

18. Method according to any of the claims from 6 to 17, characterized in that, if only the second flag (F) has been generated, long-term analysis delay d is varied by an amount that is proportional to entity of the shift accumulated up to the previous frame, the absolute value of the variation being limited to a predefined maximum.

19. Method according to claim 18, characterized in that said delay variation is disabled if it causes the decision about interpolation to be altered and the delay to go out of a predetermined interval of values.

20. Method according to any of the claims from 6 to 19, characterized in that the residual signal is subjected to said time shift in a subframe if at least one of said first and second flags has been generated and if an analysis of the modified residual signal energy in the subframe shows that the corresponding speech signal segment is not silence and includes a pitch peak, the shift related to a subframe being accumulated with that of the previous subframes of the same frame, so that the total shift in a frame remains less than a maximum shift.

21. Method according to claim 20, characterized in that said analysis of the modified residual signal energy includes the operations of:

- comparing the energy itself with an energy threshold, which, when reached, shows that the corresponding speech signal segment is not silence;

- determining the modified residual signal power in the subframe and in an interval whose length is equal to the long-term analysis delay, and the ratio between such powers; and

- comparing such ratio with a power threshold, which, when exceeded, shows the presence of a pitch peak in the subframe.

22. Method according to claim 20 or 21, characterized in that the shift for a subframe is determined, before determining an optimal excitation signal, within an interval that extends around the shift accumulated in previous subframes of the same frame, and it is the value that minimizes energy of said first partial error signal [e₁(n)].

23. Method according to claim 20, characterized in that to determine the shift, an upsampling of the residual signal is carried out, at a second rate that is a multiple of the first rate, the shift in a subframe being equal to one or more samples of the upsampled residual signal.

24. Method according to claim 22 or 23, characterized in that said first partial error signal is computed as sum between a signal [x_w2(n)] representative of the modified residual signal filtered with null initial conditions and a second partial error signal [e₀(n)], which is the difference between the memory contribution [x_w1(n)] of the modified residual signal filtering and the memory contribution [y_w1(n)] of the excitation filtering, the signal [x_w2(n)] representative of the modified residual filtered with null initial conditions related to a sample in a subframe being obtained by carrying out the actual filtering of the modified residual signal for shift values between the upper end of the interval and an intermediate value between the two extreme values, while for each of the remaining shifts in the interval it is iteratively obtained from the value related to the previous sample and from said pulse response.

25. Method according to claim 24, characterized in that the determination of said interval of shift values is carried out through the following operations:

- fixing for the interval ends two symmetrical values with respect to the accumulated value;

- determining the residual signal peak position in the upsampled residual signal and comparing it with the peak position in the previous subframe;

- limiting the interval extension on one or both sides of the accumulated value to avoid an excessive shift of the subframe into the past and/or into the future, with consequent duplication or loss of residual signal peaks.

26. Method according to claim 25, characterized in that, in case of interval limitation on one side only of the accumulated value, the search for the shift is carried out also taking into account a certain number of values beyond the interval end not interested by the limitation, such that the global number of tested values is equal to the number of values included between said symmetrical values.

27. Method according to any of the claims from 1 to 26, including a decoding step where, starting from the information [j(φ), j(d), j(b), j(s), j(gnor), j(gmax), σ] about the linear prediction coefficient representation, the long-term analysis parameters and the excitation signal, said representation is reconstructed, reconstructed linear prediction coefficients are obtained therefrom, the long-term analysis parameters are reconstructed, an excitation signal is chosen in a set of excitation signals corresponding to the one used in the coding step, and said signal is subjected to a short-term and a long-term synthesis filtering, identical to the ones carried out in the coding step, by using reconstructed linear prediction coefficients a_i and long-term analysis delay d and coefficient b, to generate a reconstructed block of speech signal samples [y(n)] for each excitation signal [s(n)], characterized in that every block of the reconstructed speech signal [y(n)], during the initial part of a validity period of linear prediction coefficients, is generated by carrying out the short-term synthesis filtering with reconstructed linear prediction coefficients a_i obtained as result of an interpolation between reconstructed values related to an immediately previous validity period and reconstructed values related to the current period, and in that the values of long term analysis delay d and coefficient b, related to two consecutive validity periods, are compared and, if the delay variation is less than a predefined amount and the coefficient is positive in both periods, a flag corresponding to that second flag is generated, to enable carrying out, during long-term synthesis filtering, an interpolation between the long-term analysis parameter values related to said two validity periods.

28. Apparatus for coding/decoding speech signals using analysis-by-synthesis techniques, including a coder composed of:

- means (MT) for sampling at a first rate a speech signal and to divide the sample sequence into blocks comprising a first number of samples;

- short-term analysis means (STA, STR1) for computing a group of linear prediction coefficients a_i for one or more blocks of samples, for transforming said coefficients into a representation thereof in the frequency domain, for obtaining from said representation indexes j(φ) identifying the coefficients themselves, to be inserted into the coded signal, and for reconstructing the coefficients starting from said indexes, every group of linear prediction coefficients being valid for a period of time equal to the duration of one or more blocks of samples;

- a linear prediction filter (LPC) that receives blocks of signal samples from the sampling means (MT) and linear prediction coefficients a_i from the short-term analysis means (STA, STR1) and generates a short-term prediction residual signal r_s(n);

- long-term analysis means (LTA, LTR1) for obtaining, from said residual signal, parameters for a long-term synthesis filtering, which parameters comprise a delay (d) and a coefficient (b), and for transforming said parameters into indexes [j(b), j(d)] to be inserted into the coded signal, the long-term analysis parameters being valid for a period of time equal to the duration of one or more blocks of samples;

- a first filtering system (LTS1, STS1, SW) that: includes the series of a long-term synthesis filter (LTS1), that receives from the long-term analysis means (LTA, LTR1) said parameters, and of a short-term synthesis filter (STS1) and a spectral weighting filter (SW), that receive from said short-term analysis means (STA, STR1) said linear prediction coefficients a_i receives signals belonging to a set of excitation signals each including a shape contribution composed of a number of pulses, of predefined amplitudes and positions, said pulse number being much less than said first number; and generates a reconstructed signal y_w(n) for each one of the excitation signals;

- means (TS) for time shifting, by discrete steps, a set of samples y_w(n) of said residual signal to align it in time with a reconstructed residual signal s_s(n) generated by the long-term synthesis filter (LTS1) of said first filtering system, the set of samples of residual signal having a number of samples equal to said first number of samples, every shift step being chosen within an interval of allowed values;

- a second filtering system (STS', SW'), that includes the series of a short-term synthesis filter and a spectral weighting filter identical to those (STS1, SW) of the first filtering system, is supplied with a modified residual signal generated by the time shift means for each of the values of said interval, and generates a reconstructed and weighted modified residual signal, said first and second filtering systems (LTS1, STS1, SW, STS', SW') separately determining a contribution representative of the memory of previous filtering and a contribution representative of a filtering with null initial conditions;

- means (SM, EM) for generating a weighted error signal [e(n)] by comparing signals generated by the first and the second filtering systems, for identifying an optimal excitation signal and an optimal shift, by minimizing the energy of said weighted error signal, and for inserting in the coded signal information that identifies the optimal excitation signal;
and further comprising, at the decoding side:

- means (LTR2, STR2) for reconstructing the linear prediction coefficients and long-term analysis parameters starting from said indexes;

- a third filtering system (LTS2, STS2), including the series of a long-term synthesis filter,and a short-term synthesis filter, identical to those (LTS1, STS1) of the first filtering system, for filtering an excitation signal selected, through information related to optimal excitation, in a set corresponding to the set used on the coding side and to generate a block of reconstructed speech signal samples,
characterized in that:

- the innovation pulses are the only non-null samples of words composed of said first number Ls of samples,

- the innovation words for a second subset of excitation signals include only one pulse whose position is different for each signal;
and in that, in said error signal generating means (SM, EM), the means to minimize error energy are composed of a processing unit arranged to:

- determine said pulse response [Q(n)] and an energy (Eq) thereof for each one of the possible pulse positions in excitation signals;

- determine a first partial error signal [e₁(n)], represented by the difference between the reconstructed and weighted modified signal [x_w(n)] and a contribution [y_w1(n)] of the excitation signal filtering memory, and an energy of the error signal itself;

- determine a first correlation [R(e₁q)] between said first partial error signal [e₁(n)] and the pulse response for each of the pulses of an excitation signal;

- determine, for each excitation signal, starting from said pulse responses, a signal [u(n)] representative of a contribution of the filtering with null initial conditions of the excitation signal;

- determine the energy [E(u)] of said signal [u(n)] representative of the contribution of a filtering with null initial conditions of the excitation signal and a second correlation R(e₁u) between said signal [u(n)] representative of the contribution of the filtering with null initial conditions of the excitation signal and the first partial error signal [e₁(u)];

- determine, for each excitation signal, an optimal value of the amplitude contribution as ratio between said second correlation and the energy of the signal resulting from filtering with null initial conditions;

- compute, as function of said second correlation R(e₁u), of said energy (Eu) of the signal representative of the contribution of the filtering with null initial conditions of the excitation and of said energy [E(e₁)] of the first partial error signal, the error signal energy value for each excitation signal.

29. Apparatus according to claim 28, characterized in that a low-pass filter (FPB) is provided between said linear prediction filter (LPC) and said long-term analysis means (LTA, LTR1).

30. Apparatus according to claim 28 or 29, characterized in that the short-term analysis means (STA, STR1) in the coder and the means (STR2) for reconstructing linear prediction coefficients in the decoder include means for carrying out, on said representation in the frequency domain, a linear interpolation between values related to two consecutive validity periods and supply the short-term synthesis filters (STS1, STS', STS2) of said filtering systems with the interpolated values in an initial part of a validity period of a set of coefficients.

31. Apparatus according to any one of claims from 28 to 30, characterized in that the long-term analysis means (LTA, LTR1) in the coder and the means (LTR2) for reconstructing the long-term analysis parameters in the decoder include comparing means for comparing parameters related to two consecutive validity periods and generating a flag (F) to enable carrying out an interpolation between the parameters when they satisfy predetermined conditions, and the long-term synthesis filters (LTS1, LTS2) of the first and second filtering systems are associated to means that, when said flag is present, carry out a second-order polynomial interpolation of said parameters, extended to a whole validity period thereof, and supply the respective long-term synthesis filter (LTS1, LTS2) with the interpolated parameters.

32. Apparatus according to any one of claims from 28 to 31, characterized in that the time shift means (TS) include a circuit (US) for upsampling the residual signal, and storing means (SH) for storing, for each block of samples to be coded, a first group of upsampled residual signal samples corresponding to said first number Ls of samples, and two further groups of upsampled residual signal samples, respectively preceding and following said first group and including a number of samples linked to the maximum allowed shift, and for supplying the second filtering system (STS', STW'), upon command by the energy minimizing means (EM), with a fourth group of upsampled residual signal samples, including as many samples as those of the first group and shifted with respect to the first group by said optimal shift.

Drawing

Search report