Field of the Invention
[0001] The present invention relates generally to speech coding systems and more specifically
to a reduction of bandwidth requirements in analysis-by-synthesis speech coding systems.
Background of the Invention
[0002] Speech coding systems function to provide codeword representations of speech signals
for communication over a channel or network to one or more system receivers. Each
system receiver reconstructs speech signals from received codewords. The amount of
codeword information communicated by a system in a given time period defines system
bandwidth and affects the quality of speech reproduced by system receivers.
[0003] Designers of speech coding systems often seek to provide high quality speech reproduction
capability using as little bandwidth as possible. However, requirements for high quality
speech and low bandwidth may conflict and therefore present engineering trade-offs
in a design process. This notwithstanding, speech coding techniques have been developed
which provide acceptable speech quality at reduced channel bandwidths. Among these
are
analysis-by-synthesis speech coding techniques.
[0004] With analysis-by-synthesis speech coding techniques, speech signals are coded through
a waveform matching procedure. A candidate speech signal is synthesized from one or
more parameters for comparison to an original speech signal to be encoded. By varying
parameters, different synthesized candidate speech signals may be determined. The
parameters of the closest matching candidate speech signal may then be used to represent
the original speech signal.
[0005] Many analysis-by-synthesis coders,
e.g., most code-excited linear prediction (CELP) coders, employ a
long-term predictor (LTP) to model long-term correlations in speech signals. (The term "speech signals"
means actual speech or any of the excitation signals present in analysis-by-synthesis
coders.) As a general matter, such correlations allow a past speech signal to serve
as an approximation of a current speech signal. LTPs work to compare several past
speech signals (which have already been coded) to a current (original) speech signal.
By such comparisons, the LTP determines which past signal most closely matches the
original signal. A past speech signal is identifiable by a
delay which indicates how far in the past (from current time) the signal is found. A coder
employing an LTP subtracts a scaled version of the closest matching past speech signal
(
i.e., the best approximation) from the current speech signal to yield a signal (sometimes
referred to as a
residual or
excitation) with reduced long-term correlation. This signal is then coded, typically with a
fixed stochastic codebook (FSCB). The FSCB index and LTP delay, among other things,
are transmitted to a CELP decoder which can recover an estimate of the original speech
from these parameters.
[0006] By modeling long-term correlations of speech, the quality of reconstructed speech
at a decoder may be enhanced. This enhancement, however, is not achieved without a
significant increase in bandwidth. For example, in order to model long-term correlations
in speech, conventional CELP coders may transmit 8-bit delay information every 5 or
7.5 ms (referred to as a
subframe). Such time-varying delay parameters require,
e.g., between one and two additional kilobits (kb) per second of bandwidth. Because variations
in LTP delay may not be predictable over time (
i.e., a sequence of LTP delay values may be stochastic in nature), it may prove difficult
to reduce the additional bandwidth requirement through the coding of delay parameters.
[0007] One approach to reducing the extra bandwidth requirements of analysis-by-synthesis
coders employing an LTP might be to transmit LTP delay values less often and determine
intermediate LTP delay values by interpolation. However, interpolation may lead to
suboptimal delay values being used by the LTP in individual subframes of the speech
signal. For example, if the delay is suboptimal, then the LTP will map past speech
signals into the present in a suboptimal fashion. As a result, any remaining excitation
signal will be larger than it might otherwise be. The FSCB must then work to undo
the effects of this suboptimal time-shift rather than perform its normal function
of refining waveform shape. Without such refinement, significant audible distortion
may result.
[0008] Y. Shiraki and M. Honda, in IEEE Transactions on Acoustics, Speech and Signal Processing,
vol
36, no [9], pages 1437-1444, September 1988, disclose a variable-length segment vocoder
based on joint segmentation and quantization for very low bit rate speech coding.
The coding problem is formulated to search for the segment boundaries and the sequence
of code segments so as to minimise the spectral distortion measure for a given interval.
This technique uses a dynamic programming procedure. An iterative algorithm for designing
the variable-length segment quantizer is also disclosed. This iterative algorithm
consists of updating the segment boundaries and updating the fixed-length code segments
for the training sequence.
Summary of the Invention
[0009] The present invention is defined by the independent claims.
[0010] The present invention provides a method and apparatus for reducing bandwidth requirements
in analysis-by-synthesis speech coding systems. The present invention provides multiple
trial original signals based upon an actual original signal to be encoded. These trial original signals
are constrained to be audibly similar to the actual original signal and are used in
place of or supplement the use of the actual original in coding. The original signal,
and hence the trial original signals, may take the form of actual speech signals or
any of the excitation signals present in analysis-by-synthesis coders. The present
invention affords generalized analysis-by-synthesis coding by allowing for the variation
of
original speech signals to reduce coding error and bit rate. The invention is applicable to,
among other things, networks for communicating speech information, such as, for example,
cellular and conventional telephone networks.
[0011] In an illustrative embodiment of the present invention, trial original signals are
used in a coding and synthesis process to yield reconstructed original signals. Error
signals are formed between the trial original signals and the reconstructed signals.
The trial original signal which is determined to yield the minimum error is used as
the basis for coding and communication to a receiver. By reducing error in this fashion,
a coding process may be modified such that required system bandwidth may be reduced.
[0012] In a further illustrative embodiment of the present invention for a CELP coder, one
or more trial original signals are provided by application of a codebook of time-warps
to the actual original signal. In an LTP procedure of the CELP coder, trial original
signals are compared with a candidate past speech signal provided by an adaptive codebook.
The trial original signal which most closely compares to the candidate is identified.
As part of the LTP process, the candidate is subtracted from the identified trial
original signal to form a residual. The residual is then coded by application of a
fixed stochastic codebook. As a result of using multiple trial original signals in
the LTP procedure, the illustrative embodiment of the present invention provides improved
mapping of past signals to the present and, as a result, reduced residual error. This
reduced residual error affords less frequent transmission of LTP delay information
and allows for delay interpolation with little or no degradation in the quality of
reconstructed speech.
[0013] Another illustrative embodiment of the present invention provides multiple trial
original signals through a
time-shift technique.
Brief Description of the Drawings
[0014] Figure 1 presents an illustrative embodiment of the present invention.
[0015] Figure 2 presents a conventional CELP coder.
[0016] Figure 3 presents an illustrative embodiment of the present invention.
[0017] Figure 4 presents an illustrative time-warp function for the embodiment presented
in Figure 3.
[0018] Figure 5 presents an illustrative embodiment of the present invention concerning
time-shifting.
[0019] Figure 6 presents an illustrative time-shifting function for the embodiment presented
in Figure 5.
Detailed Description
Introduction
[0020] Figure 1 presents an illustrative embodiment of the present invention. An original
speech signal to be encoded, S(i), is provided to a trial original signal generator
10. The trial original signal generator 10 produces a trial original signal

(i) which is audibly similar to the original signal S(i). Trial original signal

(i) is provided to a speech coder/synthesizer 15 which (i) determines a coded representation
for

(i) and (ii) further produces a reconstructed speech signal, Ŝ(i), based upon the
coded representation of

(i). A difference or
error signal, E(i), is formed between trial original speech signal

(i) and Ŝ(i) by subtraction circuit 17. Signal E(i) is fed back to the trial original
signal generator 10 which selects another trial original signal in an attempt to reduce
the magnitude of the error signal, E(i). The embodiment thereby functions to determine,
within certain constraints, which trial original signal,

(i), yields a minimum error, E
min(i). Once

(i) is determined, parameters used by the coder/synthesizer 15 to synthesize the corresponding
Ŝ(i) may serve as the coded representation of

(i) and hence, S(i).
[0021] The present invention provides generalization for conventional analysis-by-synthesis
coding by recognizing that the original signals may be varied to reduce error in the
coding process. As such, the coder/synthesizer 15 may be any conventional analysis-by-synthesizer
coder, such as conventional CELP.
Conventional CELP
[0022] A conventional analysis-by-synthesis CELP coder is presented in Figure 2. A sampled
speech signal, s(i), (where i is the sample index) is provided to a short-term linear
prediction filter (STP) 20 of order N, optimized for a current segment of speech.
Signal x(i) is an excitation obtained after filtering with the STP:

where parameters a
n are provided by linear prediction analyzer 10. Since N is usually about 10 samples
(for an 8 kHz sampling rate), the excitation signal x(i) retains the long-term periodicity
of the original signal, s(i). An LTP 30 is provided to remove this redundancy.
[0023] Values for x(i) are usually determined on a blockwise basis. Each block is referred
to as a
subframe. The linear prediction coefficients, a
n, are determined by the analyzer 10 on a
frame-by-
frame basis, with a
frame having a fixed duration which is generally an integral multiple of subframe durations,
and usually 20-30 ms in length. Subframe values for a
n are usually determined through interpolation.
[0024] The LTP determines a gain λ(i) and a delay d(i) for use as follows:

where the x̂(i-d(i)) are samples of a speech signal synthesized (or reconstructed)
in earlier subframes. Thus, the LTP 30 provides the quantity λ(i) x̂(i-d(i)). Signal
r(i) is the excitation signal remaining after λ(i) x̂(i-d(i)) is subtracted from x(i).
Signal r(i) is then coded with a FSCB 40. The FSCB 40 yields an index indicating the
codebook vector and an associated scaling factor, µ(i). Together these quantities
provide a scaled excitation which most closely matches r(i).
[0025] Data representative of each subframe of speech, namely, LTP parameters λ(i) and d(i),
and the FSCB index, are collected for the integer number of subframes equalling a
frame (typically 2, 4 or 6). Together with the coefficients a
n, this frame of data is communicated to a CELP decoder where it is used in the reconstruction
of speech.
[0026] A CELP decoder performs the reverse of the coding process discussed above. The FSCB
index is received by a FSCB of the receiver (sometimes referred to as a synthesizer)
and the associated vector e(i) (an excitation signal) is retrieved from the codebook.
Excitation e(i) is used to excite an inverse LTP process (wherein long-term correlations
are provided) to yield a quantized equivalent of x(i), x̂(i). A reconstructed speech
signal, y(i), is obtained by filtering x̂(i) with an inverse STP process (wherein
short-term correlations are provided).
[0027] In general, the reconstructed excitation x̂(i) can be interpreted as the sum of scaled
contributions from the adaptive and fixed codebooks. To select the vectors from these
codebooks, a perceptually relevant error criterion may be used. This can be done by
taking advantage of the spectral masking existing in the human auditory system. Thus,
instead of using the difference between the original and reconstructed speech signals,
this error criterion considers the difference of perceptually weighted signals.
[0028] The perceptual weighting of signals deemphasizes the formants present in speech.
In this example, the formants are described by an all-pole filter in which spectral
deemphasis can be obtained by moving the poles inward. This is equivalent to replacing
the filter with predictor coefficients a
1 , a
2 , ··· , a
N, by a filter with coefficients γa
1, γ
2a
2, ··· , γ
Na
N, where γ is a perceptual weighting factor (usually set to a value around 0.8).
[0029] The sampled error signal in the perceptually weighted domain, g(i), is:

The error criterion of analysis-by-synthesis coders is formulated on a subframe-by-subframe
basis. For a subframe length of L samples, a commonly used criterion is:

where î is the first sample of the subframe. Note that this criterion weighs the
excitation samples unevenly over the subframe; the sample x̂(î+L-1) affects only g(î+L-1),
while x̂(î) affects all samples of g(i) in the present subframe.
[0030] The criterion of equation (4) includes the effects of differences in x(i) and x̂(i)
prior to î,
i.e., prior to the beginning of the present subframe. It is convenient to define an excitation
in the present subframe to represent this zero-input response of the weighted synthesis
filter.

where z(i) is the zero-input response of the perceptually-weighted synthesis filter
when excited with x(i)-x̂(i).
[0031] In the time-domain, the spectral deemphasis by the factor γ results in a quicker
attenuation of the impulse response of the all-pole filter. In practice, for a sampling
rate of 8 kHz, and γ = 0. 8, the impulse response never has a significant part of
its energy beyond 20 samples.
[0032] Because of its fast decay, the impulse response of the all-pole filter 1/(1 - γa
1z
-1 ··· - γ
Na
Nz
-N) can be approximated by a finite-impulse-response filter. Let h
0, h
1, ···, h
R-1 denote the impulse response of the latter filter. This allows vector notation for
the error criterion operating on the perceptually-weighted speech. Because the coders
operate on a subframe-by-subframe basis, it is convenient to define vectors with the
length of the subframe in samples, L. For example, for the excitation signal:

[0033] Further, the spectral-weighting matrix
H is defined as:
H has dimensions (L+ R-1)×L. Thus, the vector
Hx̂(i) approximates the entire response of the IIR filter 1/(1 - γa
1z
-1 ··· - γ
Na
Nz
-N) to the vector
x̂(i). With these definitions an appropriate perceptually-weighted criterion is:

With the current definition of
H the error criterion of equation (8) is of the autocorrelation type (note that
HTH is Toeplitz). If the matrix
H is truncated to be square L×L, equation (8) approximates equation (4), which is the
more common covariance criterion, as used in the original CELP.
An Illustrative Embodiment for CELP Coding
[0034] Figure 3 presents an illustrative embodiment of the present invention as it may be
applied to CELP coding. A sampled speech signal, s(i), is presented for coding. Signal
s(i) is provided to a linear predictive analyzer 100 which produces linear predictive
coefficients, a
n. Signal s(i) is also provided to an STP 120, which operates according to a process
described by Eq. (1), and to a delay estimator 140.
[0035] Delay estimator 140 operates to search the recent past history of s(i) (
e.g., between 20 and 160 samples in the past) to determine a set of consecutive past
samples (of length equal to a subframe) which most closely matches the current subframe
of speech, s(i), to be coded. Delay estimator 140 may make its determination through
a correlation procedure of the current subframe with the contiguous set of past sample
s(i) values in the interval i-160 ≤ i ≤ i-20. An illustrative correlation technique
is that used by conventional
open-loop LTPs of CELP coders. (The term
open-loop refers to an LTP delay estimation process using
original rather
than reconstructed past speech signals. A delay estimation process which uses
reconstructed speech signals is referred to as
closed-loop.) The delay estimator 140 determines a delay estimate by the above described procedure
once per frame. Delay estimator 140 computes delay values M for each subframe by interpolation
of delay values determined at frame boundaries.
[0036] Adaptive codebook 150 maintains an integer number (typically 128 or 256) of vectors
of reconstructed past speech signal information. Each such vector, x̂(i), is L samples
in length (the length of a subframe) and partially overlaps neighbor codebook vectors,
such that consecutive vectors are distinct by one sample. As shown in Figure 3, each
vector is formed of the sum of
past adaptive codebook 150 and fixed codebook 180 contributions to the basic waveform
matching procedure of the CELP coder. The delay estimate, M, is used as an index to
stored adaptive codebook vectors.
[0037] Responsive to receiving M, adaptive codebook 150 provides a vector, x̂(i-M), comprising
L samples beginning M+L samples in the past and ending M samples in the past. This
vector of past speech information serves as an LTP estimate of the
present speech information to be coded.
[0038] As described above, the LTP process functions to identify a past speech signal which
best matches a present speech signal so as to reduce the long term correlation in
coded speech. In the illustrative embodiment of Figure 3, multiple trial original
speech signals are provided for the LTP process. Such multiple trial original signals
are provided by time-warp function 130.
[0039] Time-warp function 130, presented in Figure 4, provides a codebook 133 of time-warps
(TWCB) for application to original speech to produce multiple trial original signals.
In principle, the codebook 133 of time-warp function 130 may include any time-warp,
ζ(t)
Δ 
(where τ is a warped time-scale), which does not change the perceptual quality of
the original signal:

where t
j and τ
j denote the start of the current subframe j in the original and warped domains.
[0040] To help insure stability of the warping process, it is preferred that major pitch
pulses fall near the right hand boundary of the subframes. This can be done by defining
sub-frame boundaries to fall just to the right of such pulses using known techniques.
Assuming that the pitch pulses of the speech signal to be coded are at the boundary
points, it is preferred that warping functions satisfy:

If the pitch pulses are somewhat before the subframe boundaries, ζ(t) should maintain
its end value in this neighborhood of the subframe boundary. If equation (10) is not
satisfied, oscillating warps may be obtained. The following family of time-warping
functions may be used to provide a codebook of time-warps:

where A, B, C, σ
B, and σ
C are constants. The warping function converges towards A with increasing t. At t
j the value of the warping function is just A+B. The value of C can be used to satisfy
equation (10) exactly. A codebook of continuous time-warps can be generated by 1)
choosing a value for A, (typically between 0.95 and 1.05), 2) choosing values for
σ
B and σ
C (typically on the order of 2.5 ms), 3) use B to satisfy the boundary condition at
t
j (where ζ (t
j)=A+B), and 4) choose C to satisfy the boundary condition of equation (10). Note that
no information concerning the warping codebook is transmitted; its size is limited
only by the computational requirements.
[0041] Referring to Figure 4, original speech signal x(i) is received by the time-warping
process 130 and stored in memory 131. Original speech signal x(i) is made available
to the warping process 132 as needed. Warping process receives a warping function
ζ(t) from a time-warp codebook 133 and applies the function to the original signal
according to equation (9). A time-warped original speech signal,

(i), referred to as a trial original, is supplied to process 134 which determines
a squared difference or error quantity, ε', as follows:

Equation (12) is similar to equation (8) except that, unlike equation (8), equation
(12) has been normalized thus making a least squares error process sensitive to differences
of shape only.
[0042] The error quantity ε' is provided to an error evaluator 135 which functions to determine
the minimum error quantity, ε'
min, from among all values of ε' presented to it (there will be a value ε' for each time
warp in the TWCB) and store the value of

(i) associated with ε'
min, namely

(i).
[0043] Once

(i) is determined, the scale factor λ(i) is determined by process 136 as follows:

This scale factor is multiplied by x̂(i-M) and provided as output.
[0044] Referring again to Figure 3, x̂
min(i) and adaptive codebook estimate λ(i)x̂(i-M) are supplied to circuit 160 which subtracts
estimate λ(i)x̂(i-M) from warped original
min(i). The result is excitation signal r(i) which is supplied to a fixed stochastic
codebook search process 170.
[0045] Codebook search process 170 operates conventionally to determine which of the fixed
stochastic codebook vectors, z(i), scaled by a factor, µ(i), most closely matches
r(i) in a least squares, perceptually weighted sense. The chosen scaled fixed codebook
vector, µ(i)z
min(i), is added to the scaled adaptive codebook vector, λ(i)x̂(i-M), to yield the best
estimate of a current reconstructed speech signal, x̂(i). This best estimate, x̂(i),
is stored in the adaptive codebook 150.
[0046] As is the case with conventional speech coders, LTP delay and scale factor values,
λ and M, a FSCB index, and linear prediction coefficents, a
n, are supplied to a decoder across a channel for reconstruction by a conventional
CELP receiver. However, because of the reduced error (in the coding process) afforded
by operation of the illustrative embodiment of the present invention, it is possible
to transmit LTP delay information, M, once per frame, rather than once per subframe.
Subframe values for M may be provided at the receiver by interpolating the delay values
in a fashion identical to that done by delay estimator 140 of the transmitter.
[0047] By transmitting LTP delay information M every frame rather than every subframe, the
bandwidth requirements associated with delay may be significantly reduced.
An LTP with a Continuous Delay Contour
[0048] For a conventional LTP, delay is constant within each subframe, changing discontinuously
at subframe boundaries. This discontinuous behavior is referred to as a
stepped delay contour. With stepped delay contours, the discontinuous changes in delay from
subframe to subframe correspond to discontinuities in the LTP mapping of past excitation
into the present. These discontinuities are modified by interpolation, and they may
prevent the construction of a signal with a smoothly evolving pitch-cycle waveform.
Because interpolation of delay values is called for in the illustrative embodiments
discussed above, it may prove advantageous to provide an LTP with a continuous delay
contour more naturally facilitating interpolation. Since this reformulated LTP provides
a delay contour with no discontinuities, it is referred to as a
continuous delay contour LTP.
[0049] The process by which delay values of a continuous delay contour are provided to an
adaptive codebook supplants that described above for delay estimator 140. To provide
a continuous delay contour for the LTP, the best of a set of possible contours over
the current subframe is selected. Each contour starts at the end value of the delay
contour of the previous subframe, d(t
j). In the present illustrative embodiment, each of the delay contours of the set are
chosen to be linear within a subframe. Thus, for current subframe j of N samples (spaced
at the sampling interval T), which ranges over t
j < t ≤ t
j+1, the instantaneous delay d(t) is of the form:

where α is a constant. For a given d(t), the mapping of a past speech signal (unscaled
by an LTP gain) into the present by an LTP is:

Equation (15) is evaluated for the samples t
j, t
j+T, ···, t
j+(N-1)T. For non-integer delay values, the signal value x̂(t-d(t)) must be obtained
with interpolation. For the determination of the optimal piecewise-linear delay contour,
we have a set of Q trial slopes α
1 , α
2, ··· , α
Q, for each of which the sequence u(t
j), u(t
j+T), ···, u(t
j+(N-1)T) is evaluated. The best quantized value of ḋ(t
j) can then be found using equation (8). That is, equation (8) may be used to provide
a perceptually weighted, least squares error estimate between x̂(t) and x̂(t-d(t)).
Referring to Figure 3 as it might be adapted for the present embodiment, the value
of d(t
j) is passed from delay estimator 140 to adaptive codebook 150 in lieu of M.
[0050] When using an LTP with a continuous delay contour to obtain a time-scaled version
of the past signal, it is preferred that the slope of the delay contour be less than
unity: d(t) < 1. If this proposition is violated, local time-reversal of the mapped
waveform may occur. Also, a continuous delay contour cannot accurately describe pitch
doubling. To model pitch doubling, the delay contour must be discontinuous. Consider
again the delay contour of equation (14). Because each pitch period is usually dominated
by one major center of energy (the pitch pulse), it is preferred the delay contour
be provided with one degree of freedom per pitch cycle. Thus, the illustrative continuous
delay-contour LTP provides subframes with an adaptive length of approximately one
pitch cycle. This adaptive length is used to provide for subframe boundaries being
placed just past the pitch pulses. By so doing, an oscillatory delay contour can be
avoided. Since the LTP parameters are transmitted at fixed time intervals, the subframe
size does not affect the bit rate. In this illustrative embodiment, known methods
for locating the pitch pulses, and thus delay frame boundaries, are applicable. These
methods may be applied as part of the adaptive codebook process 150.
An Illustrative Embodiment for CELP Coding Involving Time-Shifting
[0051] In addition to the time-warping embodiments discussed above, a time-shifting embodiment
of the present invention may be employed. Illustratively, a time-shifting embodiment
may take the form of that presented in Figure 5, which is similar to that of Figure
3 with the time-warp function 130 replaced with a time-shift function 200.
[0052] Like the time-warp function 130, the time-shift function 200 provides multiple trial
original signals which are constrained to be audibly similar to the original signal
to be coded. Like the time-warp function 130, the time-shift function 200 seeks to
determine which of the trial original signals generated is closest in form to an identified
past speech signal. However, unlike the time-warp function 130, the time-shift function
200 operates by sliding a subframe of the original speech signal, preferably the excitation
signal x(i), in time by an amount θ, θ
min≤θ≤θ
max, to determine a position of the original signal which yields minimum error when compared
with a past speech signal (typically, |θ
min|=|θ
max|=2.5 samples, achieved with up-sampling). The shifting of the original speech signal
by an amount θ to the right (
i.e., later in time) is accomplished by repeating the last section of length θ of the
previous subframe thereby padding the left edge of the original speech subframe. The
shifting of the original speech signal by an amount θ to the left is accomplished
by simply removing (
i.e., omitting) a length of the original signal equal to θ from the left edge of the
subframe. As with time-warping, minimum error is generally associated with time-matching
the major pitch pulses in a subframe as between two signals.
[0053] Note that the subframe size need not be a function of the pitch-period. It is preferred,
however, that the subframe size be always less than a pitch period. Then the location
of each pitch pulse can be determined independently. A subframe size of 2.5 ms can
be used. Since the LTP parameters are transmitted at fixed time intervals, the subframe
size does not affect the bit rate. To prevent subframes from falling between pitch
pulses, the change in shift must be properly restricted (of the order of .25 ms for
a 2.5 ms subframe). Alternatively, the delay can be kept constant for subframes where
the energy is much lower than that of surrounding subframes.
[0054] An illustrative time-shift function 200 is presented in Figure 6. The function 200
is similar to the time-warp function 130 discussed above with a pad/omit process 232
in place of warping process 132 and associated codebook 133.
[0055] The shifting procedure performed by function 200 is:

where t
j denotes the start of current frame j in the original signal. A closed-loop fitting
procedure searches for the value of θ
min ≤ θ ≤ θ
max, which minimizes an error criterion similar to equation (12):

This procedure is carried out by process 234 (which determines ε' according to equation
(17)) and error evaluator 135 (which determines ε'
min).
[0056] The optimal value of θ for the subframe j is that θ associated with ε'
min and is denoted as θ
j. For a subframe length L
subframe, the start of subframe j+1 in the original speech is now determined by:

while for the reconstructed signal the time τ
j+1 simply is:

[0057] As is the case with the illustrative embodiments discussed above, this embodiment
of the present invention provides scaling and delay information, linear prediction
coefficients, and fixed stochastic codebook indices to a conventional CELP receiver.
Again, because of reduced coding error provided by the present invention, delay information
may be transmitted every frame, rather than every subframe. The receiver may interpolate
delay information to determine delay values for individual subframes as done by delay
estimator 140 of the transmitter.
[0058] Interpolation with a stepped-delay contour may proceed as follows. Let t
A and t
B denote the beginning and end of the present interpolation interval, for the original
signal. Further, we denote with the index j
A the first LTP subframe of the present interpolation interval, and j
B the first LTP subframe of the next interpolation interval. First, an open-loop estimate
of the delay at the end of the present interpolation interval, d
B, is obtained by, for example, a cross-correlation process between past and present
speech signals. (In fact the value used for t
B for this purpose must be an estimate, since the final value results after conclusion
of the interpolation.) Let the delay at the end of the previous interpolation interval
be denoted as d
A. Then the delay of subframe j can simply be set to be:

The unscaled contribution of the LTP to the excitation is then given by:

where τ
j is the beginning of the subframe j, for the reconstructed signal.
Delay Pitch Doubling and Halving
[0059] Analysis-by-synthesis coders often suffer from delay doubling or halving due to the
similarity of successive pitch-cycles. Such doubling or halving of delay is difficult
to prevent in many practical applications. However, regarding the present invention,
delay doubling or halving can be accommodated as follows. As a first step, the open-loop
delay estimate for the endpoint in the present interpolation interval is compared
with the last delay in the previous interpolation interval. When ever it is close
to a multiple or submultiple of the previous interpolation interval endpoint, then
delay multiplication or division is considered to have occurred. What follows is a
discussion of how to address delay doubling and delay halving; other multiples may
be addressed similarly.
[0060] Regarding delay doubling, let an open-loop estimate of the end value delay be denoted
as d
2 (τ
B), where the subscript
2 indicates that the delay corresponds to two pitch cycles. Let d
1(τ
A) represent a delay corresponding to one pitch cycle. In general, the doubled delay
and the standard delay are related by:

Equation (22) describes two sequential mappings by an LTP. A simple multiplication
of the delay by two does not result in a correct mapping when the pitch period is
not constant.
[0061] Now consider the case where d
1(τ) is linear within the present interpolation interval:

Then combination of equations (22) and (23) gives:

Equation (24) shows that, within a restricted range, d
2(τ) is linear. However, in general, D
2(τ) is not linear in the range where τ
A < τ < τ
A+d
1(τ). The following procedure can be used for delay doubling. At the outset d
1(τ
A) and d
2(τ
B) are known. By using τ=τ
B in equation (24), β can be obtained:

Then both d
1(τ) and d
2(τ) are known within the interpolation interval. The standard delay, d
1(τ) satisfies equation (23) within the entire interpolation interval. For d
2(τ), note that equation (22) is valid over the entire interpolation interval, while
equation (24) is valid over only a restricted part.
[0062] The actual LTP excitation contribution for the interpolation interval is now obtained
by a smooth transition from the standard to the double delay:

where ψ(τ) is a smooth function increasing from 0 to 1 over the indicated interpolation
interval, which delineates the present interpolation interval. This procedure assumes
that the interpolation interval is sufficiently larger than the double delay.
[0063] For delay halving, the same procedure is used in the opposite direction. Assume the
boundary conditions d
2 (τ
A) and d
1(τ
B). To be able to use equation (22) for τ
A < τ ≤ τ
B, d
1(τ
A) must be defined in the range τ
A-d
1(τ
A) < τ ≤ τ
A. A proper definition will maintain good speech quality. Since the double delay will
be linear in the previous interpolation interval, we can use equation (24) to obtain
a reasonable definition of d
1(τ) in this range. For a linear delay contour, d
2(τ) satisfies:

where the ' indicates that the values refer to the previous interpolation interval
(note that τ'
B=τ
A), and where η' is a constant. Comparing this with equation (24), d
1(τ) in the last part of the previous interpolation interval is:

Equation (28) provides also a boundary value for the present interpolation interval,
d
1(τ
A). From this value and d
1(τ
B), the value of β for equation (23) can be computed. Again, equation (22) can be used
to compute d
2(τ) in the present interpolation interval. The transition from d
2(τ) to d
1(τ) is again performed by using equation 22, but now ψ(τ) decreases from 1 to 0 in
the interpolation interval.
1. A method for coding an original speech signal, the method comprising the steps of:
generating a plurality of trial original signals based on the original signal, that
are constrained to be audibly similar to the original signal and include at least
one modified original signal,
for each of said trial original signals, carrying out the following steps a), b) and
c):
a) coding the trial original signal to produce one or more parameters representative
thereof;
b) synthesizing an estimate of the trial original signal from one or more of the parameters;
and
c) determining an error between the trial original signal and the synthesized estimate
of the trial original signal; and
selecting as a coded representation of the original signal said one or more parameters
of a trial original signal having associated therewith a determined error which satisfies
an error criterion.
2. The method of claim 1 wherein the step of generating a plurality of trial original
signals comprises the step of applying one or more time-warps to the original signal.
3. The method of claim 1 wherein the step of generating a plurality of trial original
signals comprises the step of performing one or more time-shifts of the original signal.
4. The method of claim 1 wherein the step of coding a trial original signal comprises
performing analysis-by-synthesis coding.
5. The method of claim 4 wherein the step of performing analysis-by-synthesis coding
comprises performing code-excited linear prediction coding.
6. The method of claim 1 wherein the step of determining an error comprises determining
a sum of squares of samples of a difference between a filtered trial original signal
and a filtered synthesized estimate thereof.
7. The method of claim 1 wherein the step of determining an error comprises determining
a sum of squares of samples of a difference between a perceptually weighted trial
original signal and a perceptually weighted synthesized estimate thereof.
8. The method of claim 6 or claim 7 wherein the error evaluation process comprises determining
the minimum sum of squares of samples from among a plurality of sums of squares of
samples.
9. The method of claim 1 wherein the step of selecting a coded representation of the
original signal comprises determining the trial original signal having the smallest
determined error associated therewith.
10. An apparatus for coding an original speech signal, the apparatus comprising:
means (132, 133; 232) for generating a plurality of trial original signals based on
the original signal, that are constrained to be audibly similar to the original signal
and include at least one modified original signal;
means coupled to the means for generating, for coding a trial original signal to produce
one or more parameters representative thereof;
means (170, 180, 150), coupled to the means for coding, for synthesizing an estimate
of the trial original signal from one or more of the parameters;
means (134, 234), coupled to the means for coding and the means for generating, for
determining an error between the trial original signal and the synthesized estimate
of the trial original signal; and
means (135) for selecting as coded representation of the original signal said one
or more parameters of a trial original signal having associated therewith a determined
error which satisfies an error criterion.
11. The apparatus of claim 10 wherein the means for generating a plurality of trial original
signals comprises means (132) for applying one or more time-warps to the original
signal.
12. The apparatus of claim 10 wherein the means for generating a plurality of trial original
signals comprises a codebook (133) of time warps.
13. The apparatus of claim 10 wherein the means for generating a plurality of trial original
signals comprises means (232) for performing one or more time-shifts of the original
signal.
14. The apparatus of claim 10 wherein the means for coding a trial original signal comprises
means for performing analysis-by-synthesis coding.
15. The apparatus of claim 14 wherein the means for performing analysis-by-synthesis coding
comprises a code-excited linear prediction coder.
16. The apparatus of claim 10 wherein the means for synthesizing an estimate of a trial
original signal comprises a fixed stochastic codebook (180).
17. The apparatus of claim 16 wherein the means for synthesizing an estimate of a trial
original further comprises an adaptive codebook (150).
18. The apparatus of claim 10 wherein the means for determining an error comprises means
for determining a sum of squares of samples of a difference between the trial original
signal and the synthesized estimate thereof.
19. The apparatus of claim 18 wherein the error evaluation process comprises determining
the minimum sum of squares of samples from among a plurality of sums of squares of
samples.
20. The method of claim 18 wherein the difference between the original signal and the
synthesized estimate thereof is perceptually weighted.
21. The apparatus of claim 10 wherein the means for selecting a coded representation of
the original signal comprises means for determining the trial original signal having
the smallest determined error associated therewith.
22. A network for communicating an original speech signal, the network comprising:
a communication channel;
a transmitter, coupled to the communication channel, for transmitting a coded representation
of the original signal, the transmitter comprising an apparatus as set out in claim
10; and
a receiver, coupled to the communication channel, for decoding the coded representations
of the original signal received from the transmitters.
1. Verfahren zur Codierung eines original-Sprachsignals, mit den folgenden Schritten:
Erzeugen einer Mehrzahl von auf dem Originalsignal basierenden Versuchs-Originalsignalen,
denen die Beschränkung auferlegt wird, daß sie dem Originalsignal hörbar ähnlich sind,
und die mindestens ein modifiziertes Originalsignal enthalten;
Ausführen der folgenden Schritte a), b) und c) für jedes der besagten Versuchs-Originalsignale:
a) Codieren des Versuchs-Originalsignals, um einen oder mehrere Parameter zu erzeugen,
die dieses darstellen;
b) synthetisches Erzeugen einer Abschätzung des Versuchs-Originalsignals aus einem
oder mehreren der Parameter; und
c) Bestimmen eines Fehlers zwischen dem Versuchs-Originalsignal und der synthetisch
erzeugten Abschätzung des Versuchs-Originalsignals; und
Auswählen des besagten einen bzw. der besagten mehreren Parameter eines Versuchs-Originalsignals,
dem ein bestimmter Fehler zugeordnet ist, der ein Fehlerkriterium erfüllt, als eine
codierte Darstellung des Originalsignals.
2. Verfahren nach Anspruch 1, wobei der Schritt des Erzeugens einer Mehrzahl von Versuchs-Originalsignalen
den Schritt des Vornehmens einer oder mehrerer Zeitverzerrungen an dem Originalsignal
umfaßt.
3. Verfahren nach Anspruch 1, wobei der Schritt des Erzeugens einer Mehrzahl von Versuchs-Originalsignalen
den Schritt des Durchführens einer oder mehrerer Zeitverschiebungen des Originalsignals
umfaßt.
4. Verfahren nach Anspruch 1, wobei der Schritt des Codierens eines Versuchs-Originalsignals
das Durchführen einer Analyse-durch-Synthese-Codierung umfaßt.
5. Verfahren nach Anspruch 4, wobei der Schritt des Durchführens einer Analyse-durch-Synthese-Codierung
das Durchführen einer codeerregten linearen Prädiktionscodierung umfaßt.
6. Verfahren nach Anspruch 1, wobei der Schritt des Bestimmens eines Fehlers das Bestimmen
einer Summe von Quadraten von Abtastwerten einer Differenz zwischen einem gefilterten
Versuchs-Originalsignal und einer gefilterten synthetisch erzeugten Abschätzung dieses
Signals umfaßt.
7. Verfahren nach Anspruch 1, wobei der Schritt des Bestimmens eines Fehlers das Bestimmen
einer Summe von Quadraten von Abtastwerten einer Differenz zwischen einem wahrnehmungsbezogen
gewichteten Versuchs-Originalsignal und einer wahrnehmungsbezogen gewichteten synthetisch
erzeugten Abschätzung dieses Signals umfaßt.
8. Verfahren nach Anspruch 6 oder 7, wobei der Vorgang der Fehlerauswertung das Bestimmen
der minimalen Summe von Quadraten von Abtastwerten aus einer Mehrzahl von Summen von
Quadraten von Abtastwerten umfaßt.
9. Verfahren nach Anspruch 1, wobei der Schritt des Auswählens einer codierten Darstellung
des Originalsignals das Bestimmen des Versuchs-Originalsignals umfaßt, dem der kleinste
bestimmte Fehler zugeordnet ist.
10. Vorrichtung zur Codierung eines Original-Sprachsignals, die folgendes umfaßt:
Mittel (132, 133; 232) zum Erzeugen einer Mehrzahl von auf dem Originalsignal basierenden
Versuchs-Originalsignalen, denen die Beschränkung auferlegt wird, daß sie dem Originalsignal
hörbar ähnlich sind, und die mindestens ein modifiziertes Originalsignal enthalten;
an die Erzeugungsmittel angekoppelte Mittel zum Codieren eines Versuchs-Originalsignals,
um einen oder mehrere Parameter zu erzeugen, die dieses darstellen;
an die Codierungsmittel angekoppelte Mittel (170, 180, 150) zum synthetischen Erzeugen
einer Abschätzung des Versuchs-Originalsignals aus einem oder mehreren der Parameter;
an die Codierungsmittel und an die Erzeugungsmittel angekoppelte Mittel (134, 234)
zum Bestimmen eines Fehlers zwischen dem Versuchs-Originalsignal und der synthetisch
erzeugten Abschätzung des Versuchs-Originalsignals; und
Mittel (135) zum Auswählen des besagten einen bzw. der besagten mehreren Parameter
eines Versuchs-Originalsignals, dem ein bestimmter Fehler zugeordnet ist, der ein
Fehlerkriterium erfüllt, als eine codierte Darstellung des Originalsignals.
11. Vorrichtung nach Anspruch 10, wobei die Mittel zum Erzeugen einer Mehrzahl von Versuchs-Originalsignalen
Mittel (132) zum Vornehmen einer oder mehrerer Zeitverzerrungen an dem Originalsignal
umfassen.
12. Vorrichtung nach Anspruch 10, wobei die Mittel zum Erzeugen einer Mehrzahl von Versuchs-Originalsignalen
ein Codebuch (133) von Zeitverzerrungen umfassen.
13. Vorrichtung nach Anspruch 10, wobei die Mittel zum Erzeugen einer Mehrzahl von Versuchs-Originalsignalen
Mittel (232) zur Durchführung einer oder mehrerer Zeitverschiebungen des Originalsignals
umfassen.
14. Vorrichtung nach Anspruch 10, wobei die Mittel zum Codieren eines Versuchs-Originalsignals
Mittel zur Durchführung einer Analyse-durch-Synthese-Codierung umfassen.
15. Vorrichtung nach Anspruch 14, wobei die Mittel zur Durchführung einer Analyse-durch-Synthese-Codierung
einen codeerregten linearen Prädiktionscodierer umfassen.
16. Vorrichtung nach Anspruch 10, wobei die Mittel zum synthetischen Erzeugen einer Abschätzung
des Versuchs-Originalsignals ein festes stochastisches Codebuch (180) umfassen.
17. Vorrichtung nach Anspruch 16, wobei die Mittel zum synthetischen Erzeugen einer Abschätzung
eines Versuchs-Originalsignals weiterhin ein adaptives Codebuch (150) umfassen.
18. Vorrichtung nach Anspruch 10, wobei die Mittel zum Bestimmen eines Fehlers Mittel
zum Bestimmen einer Summe von Quadraten von Abtastwerten einer Differenz zwischen
dem Versuchs-Originalsignal und dessen synthetisch erzeugter Abschätzung umfassen.
19. Vorrichtung nach Anspruch 18, wobei der Vorgang der Fehlerauswertung das Bestimmen
der minimalen Summe von Quadraten von Abtastwerten aus einer Mehrzahl von Summen von
Quadraten von Abtastwerten umfaßt.
20. Vorrichtung nach Anspruch 18, wobei die Differenz zwischen dem Originalsignal und
dessen synthetisch erzeugter Abschätzung wahrnehmungsbezogen gewichtet ist.
21. Vorrichtung nach Anspruch 10, wobei die Mittel zum Auswählen einer codierten Darstellung
des Originalsignals Mittel zum Bestimmen des Versuchs-Originalsignals umfassen, dem
der kleinste bestimmte Fehler zugeordnet ist.
22. Netz zur Übermittlung eines Original-Sprachsignals mit folgendem:
einem Kommunikationskanal;
einem an den Kommunikationskanal angekoppelten Sender zum Senden einer codierten Darstellung
des Originalsignals, wobei der Sender eine Vorrichtung nach Anspruch 10 umfaßt; und
einem an den Kommunikationskanal angekoppelten Empfänger zum Decodieren der codierten
Darstellung des aus dem Sender empfangenen Originalsignals.
1. Méthode de codage d'un signal de parole initial, la méthode comprenant les étapes
de:
génération d'une pluralité de signaux initiaux de test basés sur le signal initial;
pour chacun desdits signaux initiaux de test, exécution des étapes suivantes a), b)
et c):
a) codage du signal initial de test pour produire un ou plusieurs paramètres représentatifs
de celui-ci;
b) synthèse d'une estimation du signal initial de test à partir d'un ou plusieurs
des paramètres; et
c) détermination d'une erreur entre le signal initial de test et l'estimation synthétisée
du signal initial de test; et
sélection comme représentation codée du signal initial desdits un ou plusieurs paramètres
d'un signal initial de test auquel est associée une erreur déterminée qui satisfait
à un critère d'erreur.
2. Méthode selon la revendication 1, dans laquelle l'étape de génération d'une pluralité
de signaux initiaux de test comprend l'étape d'application d'une ou plusieurs déviations
temporelles au signal initial.
3. Méthode selon la revendication 1, dans laquelle l'étape de génération d'une pluralité
de signaux initiaux de test comprend l'étape de réalisation d'un ou plusieurs décalages
temporels du signal initial.
4. Méthode selon la revendication 1, dans laquelle l'étape de codage d'un signal initial
de test comprend la réalisation d'un codage d'analyse par synthèse.
5. Méthode selon la revendication 4, dans laquelle l'étape de réalisation du codage d'analyse
par synthèse comprend la réalisation d'un codage de prédiction linéaire à code excité.
6. Méthode selon la revendication 1, dans laquelle l'étape de détermination d'une erreur
comprend la détermination d'une somme de carrés d'échantillons d'une différence entre
un signal initial de test filtré et une estimation synthétisée filtrée de celui-ci.
7. Méthode selon la revendication 1, dans laquelle l'étape de détermination d'une erreur
comprend la détermination d'une somme de carrés d'échantillons d'une différence entre
un signal initial de test perceptivement pondéré et une estimation synthétisée perceptivement
pondérée de celui-ci.
8. Méthode selon la revendication 6 ou la revendication 7, dans laquelle le processus
d'évaluation d'erreur comprend la détermination de la somme minimum de carrés d'échantillons
parmi une pluralité de sommes de carrés d'échantillons.
9. Méthode selon la revendication 1, dans laquelle l'étape de sélection d'une représentation
codée du signal initial comprend la détermination du signal initial de test auquel
est associée la plus petite erreur déterminée.
10. Appareil de codage d'un signal de parole initial, l'appareil comprenant:
un moyen (132, 133; 232) pour générer une pluralité de signaux initiaux de test basés
sur le signal initial, qui sont limités pour être audiblement semblables au signal
initial et comportent au moins un signal initial modifié;
un moyen, couplé au moyen de génération, pour coder un signal initial de test en vue
de produire un ou plusieurs paramètres représentatifs de celui-ci;
un moyen (170, 180, 150), couplé au moyen de codage, pour synthétiser une estimation
du signal initial de test à partir d'un ou plusieurs des paramètres;
un moyen (134, 234), couplé au moyen de codage et au moyen de génération, pour déterminer
une erreur entre le signal initial de test et l'estimation synthétisée du signal initial
de test; et
un moyen (135) pour sélectionner comme représentation codée du signal initial lesdits
un ou plusieurs paramètres d'un signal initial de test auquel est associée une erreur
déterminée qui satisfait à un critère d'erreur.
11. Appareil selon la revendication 10, dans lequel le moyen de génération d'une pluralité
de signaux initiaux de test comprend un moyen (132) pour appliquer une ou plusieurs
déviations temporelles au signal initial.
12. Appareil selon la revendication 10, dans lequel le moyen de génération d'une pluralité
de signaux initiaux de test comprend un dictionnaire de codes (133) de déviations
temporelles.
13. Appareil selon la revendication 10, dans lequel le moyen de génération d'une pluralité
de signaux initiaux de test comprend un moyen (232) pour réaliser un ou plusieurs
décalages temporels du signal initial.
14. Appareil selon la revendication 10, dans lequel le moyen de codage d'un signal initial
de test comprend un moyen pour réaliser un codage d'analyse par synthèse.
15. Appareil selon la revendication 14, dans lequel le moyen de réalisation du codage
d'analyse par synthèse comprend un codeur de prédiction linéaire à code excité.
16. Appareil selon la revendication 10, dans lequel le moyen de synthèse d'une estimation
d'un signal initial de test comprend un dictionnaire de codes stochastique fixe (180).
17. Appareil selon la revendication 16, dans lequel le moyen de synthèse d'une estimation
d'un signal initial de test comprend en outre un dictionnaire de codes adaptatif (150).
18. Appareil selon la revendication 10, dans lequel le moyen de détermination d'une erreur
comprend un moyen pour déterminer une somme de carrés d'échantillons d'une différence
entre le signal initial de test et l'estimation synthétisée de celui-ci.
19. Appareil selon la revendication 18, dans lequel le processus d'évaluation d'erreur
comprend la détermination de la somme minimum de carrés d'échantillons parmi une pluralité
de sommes de carrés d'échantillons.
20. Appareil selon la revendication 18, dans lequel la différence entre le signal initial
et l'estimation synthétisée de celui-ci est perceptivement pondérée.
21. Appareil selon la revendication 10, dans lequel le moyen de sélection d'une représentation
codée du signal initial comprend un moyen pour déterminer le signal initial de test
auquel est associée la plus petite erreur déterminée.
22. Réseau de communication d'un signal de parole initial, le réseau comprenant:
une voie de communication;
un émetteur, couplé à la voie de communication, pour transmettre une représentation
codée du signal initial, l'émetteur comprenant un appareil tel que présenté à la revendication
10; et
un récepteur, couplé à la voie de communication, pour décoder la représentation codée
du signal initial reçu de l'émetteur.