FIELD OF THE INVENTION
[0001] This invention relates to audio coding systems and methods and in particular, but
not exclusively, to such systems and methods for coding audio signals at low bit rates.
BACKGROUND OF THE INVENTION
[0002] In a wide range of applications it is desirable to provide a facility for the efficient
storage of audio signals at a low bit rate so that they do not occupy large amounts
of memory, for example in computers, portable dictation equipment, personal computer
appliances, etc. Equally, where an audio signal is to be transmitted, for example
to allow video conferencing, audio streaming, or telephone communication
via the Internet, etc., a low bit rate is highly desirable. In both cases, however, high
intelligibility and quality are important and this invention is concerned with a solution
to the problem of providing coding at very low bit rates whilst preserving a high
level of intelligibility and quality, and also of providing a coding system which
operates well at low bit rates with both speech and music.
[0003] In order to achieve a very low bit rate with speech signals it is generally recognised
that a parametric coder or "vocoder" should be used rather than a waveform coder.
A vocoder encodes only parameters of the waveform, and not the waveform itself, and
produces a signal that sounds like speech but with a potentially very different waveform.
[0004] A typical example is the LPC 10 vocoder (Federal Standard 1015) as described in T.E.
Tremaine "The Government Standard Linear Predictive Coding Algorithm: LPC10; Speech
Technology, pp 40-49, 1982) superseded by a similar algorithm LPC10e. LPC10 and other
vocoders have historically operated in the telephony bandwidth (0-4kHz) as this bandwidth
is thought to contain all the information necessary to make speech intelligible. However
we have found that the quality and intelligibility of speech coded at bit rates as
low as 2.4Kbit/s in this way is not adequate for many current commercial applications.
[0005] The problem is that to improve the quality, more parameters are needed in the speech
model, but encoding these extra parameters means fewer bits are available for the
existing parameters. Various enhancements to the LPC10e model have been proposed for
example in A.V. McCree and T.P. Barnwell III "A Mixed Excitation LPC Vocoder Model
for Low Bit Rate Speech Coding"; IEEE-Trans Speech and Audio Processing Vol.3 No.4
July 1995, but even with all these the quality is barely adequate.
[0006] In an attempt to further enhance the model we looked at encoding a wider bandwidth
(0-8kHz). This has never been considered for vocoders because the extra bits needed
to encode the upper band would appear to vastly outweigh any benefit in encoding it.
Wideband encoding is normally only considered for good quality coders, where it is
used to add greater naturalness to the speech rather than to increase intelligibility,
and requires a lot of extra bits.
[0007] One common way of implementing a wideband system is to split the signal into lower
and upper sub-bands, to allow the upper sub-band to be encoded with fewer bits. The
two bands are decoded separately and then added together as described in the ITU Standard
G722 (X. Maitre,"7kHz audio coding within 64 kbit/s", IEEE Journal on Selected Areas
in Comm., vol.6, No.2, pp283-298, Feb 1988). Applying this approach to a vocoder suggested
that the upper band should be analysed with a lower order LPC than the lower band
(we found second order adequate). We found it needed a separate energy value, but
no pitch and voicing decision, as the ones from the lower band can be used. Unfortunately
the recombination of the two synthesized bands produced artifacts which we deduced
were caused by phase mismatch between the two bands. We overcame this problem in the
decoder by combining the LPC and energy parameters of each band to produce a single,
high-order wideband filter, and driving this with a wideband excitation signal.
[0008] Surprisingly, the intelligibility of the wideband LPC vocoder for clean speech was
significantly higher compared to the telephone bandwidth version at the same bit rate,
producing a DRT score (as described in W.D. Voiers, 'Diagnostic evaluation of speech
intelligibility', in Speech Intelligibility and Speaker Recognition (M.E. Hawley,
cd.) pp. 374-387, Dowden, Hutchinson & Ross, Inc., 1977) of 86.8 as opposed to 84.4
for the narrowband coder.
[0009] However, for speech with even a small amount of background noise, the synthesised
signal sounded buzzy and contained artifacts in the upper band. Our analysis showed
that this was because the encoded upper band energy was being boosted by the background
noise, which during the synthesis of voiced speech boosted the upper-band harmonics,
creating a buzzy effect.
[0010] On further detailed investigation we found that the increase in intelligibility was
mainly a result of better encoding of the unvoiced fricatives and plosives, not the
voiced sections. This led us to a different approach in the decoding of the upper
band, where we synthesized only noise, restricting the harmonics of the voiced speech
to the lower band only. This removed the buzz, but could instead add hiss if the encoded
upper band energy was high, because of upper band harmonics in the input signal. This
could be overcome by using the voicing decision, but we found the most reliable way
was to divide the upper band input signal into noise and harmonic (periodic) components,
and encode only the energy of the noise component.
[0011] This approach has two unexpected benefits, which greatly enhance the power of the
technique. Firstly, as the upper band contains only noise there are no longer problems
matching the phase of the upper and lower bands, which means that they can be synthesized
completely separately even for a vocoder. In fact the coder for the lower band can
be totally separate, and even be an off-the-shelf component. Secondly, the upper band
encoding is no longer speech specific, as any signal can be broken down into noise
and harmonic components, and can benefit from reproduction of the noise component
where otherwise that frequency band would not be reproduced at all. This is particularly
true for rock music, which has a strong percussive element to it.
[0012] The system is a fundamentally different approach to other wideband extension techniques,
which are based on waveform encoding as in McElroy et al: Wideband Speech Coding in
7.2KB/s ICASSP 93 pp 11-620 - II-623. The problem of waveform encoding is that it
either requires a large number of bits as in G722 (Supra), or else poorly reproduces
the upper band signal (McElroy et al), adding a lot of quantisation noise to the harmonic
components.
[0013] In this specification, the term "vocoder" is used broadly to define a speech coder
which codes selected model parameters and in which there is no explicit coding of
the residual waveform, and the term includes coders such as multi-band excitation
coders (MBE) in which the coding is done by splitting the speech spectrum into a number
of bands and extracting a basic set of parameters for each band.
[0014] The term vocoder analysis is used to describe a process which determines vocoder
coefficients including at least LPC coefficients and an energy value. In addition,
for a lower sub-band the vocoder coefficients may also include a voicing decision
and for voiced speech a pitch value.
SUMMARY OF THE INVENTION
[0015] According to one aspect of this invention there is provided an audio coding system
for encoding and decoding an audio signal, said system including an encoder and a
decoder, said encoder comprising:-
filter means for decomposing said audio signal into an upper and a lower sub-band
signal;
lower sub-band coding means for encoding said lower sub-band signal;
upper sub-band coding means for parametric encoding at least the non-periodic component
of said upper sub-band signal according to a source-filter model;
said decoder means comprising means for decoding said encoded lower sub-band signal
and said encoded upper sub-band signal, and for reconstructing therefrom an audio
output signal,
wherein said decoder means comprises filter means, and excitation means for generating
an excitation signal for being passed by said filter means to produce a synthesised
upper sub-band signal, said excitation means in use generating an excitation signal
which includes a substantial component of synthesised noise in a frequency band corresponding
to the upper sub-band of said audio signal, and said synthesised upper sub-band signal
and the decoded lower sub-band signal are recombined in use to form the audio output
signal.
[0016] Although the decoder means may comprise a single decoding means covering both the
upper and lower sub-bands of the encoder, it is preferred for the decoder means to
comprise lower sub-band decoding means and upper sub-band decoding means, for receiving
and decoding the encoded lower and upper sub-band signals respectively.
[0017] In a particular preferred embodiment, said upper frequency band of said excitation
signal substantially wholly comprises a synthesised noise signal, although in other
embodiments the excitation signal may comprise a mixture of a synthesised noise component
and a further component corresponding to one or more harmonics of said lower sub-band
audio signal.
[0018] Conveniently, the upper sub-band coding means comprises means for analysing and encoding
said upper sub-band signal to obtain an upper sub-band energy or gain value and one
or more upper sub-band spectral parameters. The one or more upper sub-band spectral
parameters preferably comprise second order LPC coefficients.
[0019] Preferably, said encoder means includes means for measuring the noise energy in said
upper sub-band thereby to deduce said upper sub-band energy or gain value. Alternatively,
said encoder means may include means for measuring the whole energy in said upper
sub-band signal thereby to deduce said upper sub-band energy or gain value.
[0020] To save unnecessary usage of the bit rate, the system preferably includes means for
monitoring said energy in said upper sub-band signal and for comparing this with a
threshold derived from at least one of the upper and lower sub-band energies, and
for causing said upper sub-band encoding means to provide a minimum code output if
said monitored energy is below said threshold.
[0021] In arrangements intended primarily for speech coding, said lower sub-band coding
means may comprise a speech coder, including means for providing a voicing decision.
In these cases, said decoder means may include means responsive to the energy in said
upper band encoded signal and said voicing decision to adjust the noise energy in
said excitation signal dependent on whether the audio signal is voiced or unvoiced.
[0022] Where the system is intended primarily for music, said lower sub-band coding means
may comprise any of a number of suitable waveform coders, for example an MPEG audio
coder.
[0023] The division between the upper and lower sub-bands may be selected according to the
particular requirements, thus it may be about 2.75kHz, about 4kHz, about 5.5kHz, etc.
[0024] Said upper sub-band coding means preferably encodes said noise component with a very
low bit rate of less than 800 bps and preferably of about 300 bps.
[0025] Where the upper sub-band is analysed to obtain an energy gain value and one or more
spectral parameters, said upper sub-band signal is preferably analysed with relatively
long frame periods to determine said spectral parameters and with relatively short
frame periods to determine said energy or gain value.
[0026] In another aspect this invention provides an audio coding method for encoding and
decoding an audio signal, which method comprises:
decomposing said audio signal into an upper and a lower sub-band signal;
encoding said lower sub-band signal;
parametric encoding at least the non-periodic component of said upper sub-band signal
according to a source-filter model, and
decoding said encoded lower sub-band signal and said encoded upper sub-band signal
to reconstruct an audio output signal;
wherein said decoding step includes providing an excitation signal which includes
a substantial component of synthesised noise in an upper frequency band corresponding
to the upper sub-band of said audio signal, passing said excitation signal through
a filter means to produce a synthesised upper sub-band signal, and recombining said
synthesised upper sub-band signal and the decoded lower sub-band signal to form the
audio output signal.
[0027] In another aspect, the invention provides a system and associated method for very
low bit rate coding in which the input signal is split into sub-bands, respective
vocoder coefficients obtained and then together recombined to and LPC filter.
[0028] Accordingly in this aspect, the invention provides a coder system for encoding and
decoding a speech signal, said system comprising encoder means and decoder means,
said encoder means including:-
filter means for decomposing said speech signal into lower and upper sub-bands together
defining a bandwidth of at least 5.5 kHz;
lower sub-band vocoder analysis means for performing a high order vocoder analysis
on said lower sub-band to obtain vocoder coefficients representative of said lower
sub-band;
upper sub-band vocoder analysis means for performing a low order vocoder analysis
on said upper sub-band to obtain vocoder coefficients representative of said upper
sub-band;
coding means for coding vocoder parameters including said lower and upper sub-band
coefficients to provide a compressed signal for storage and/or transmission, and
said decoder means including:-
decoding means for decoding said compressed signal to obtain a set of vocoder parameters
combining said lower and upper sub-band vocoder coefficients;
synthesising means for constructing an LPC filter from the set of vocoder parameters
and re-synthesising said speech signal from said filter and from an excitation signal.
[0029] Preferably said lower sub-band analysis means applies tenth order LPC analysis and
said upper sub-band analysis means applies second order LPC analysis.
[0030] The invention also extends to audio encoders and audio decoders for use with the
above systems, and to corresponding methods.
BRIEF DESCRIPTION OF THE DRAWINGS
[0031] The invention may be performed in various ways, and, by way of example only, two
embodiments and various modifications thereof will now be described in detail, reference
being made to the accompanying drawings, in which:-
- Figure 1
- is a block diagram of an encoder of a first embodiment of a wideband codec in accordance
with this invention;
- Figure 2
- is a block diagram of a decoder of the first embodiment of a wideband codec in accordance
with this invention;
- Figure 3
- are spectra showing the result of the encoding-decoding process implemented in the
first embodiment;
- Figure 4
- is a spectrogram of a male speaker;
- Figure 5
- is a block diagram of the speech model assumed by a typical vocoder;
- Figure 6
- is a block diagram of an encoder of a second embodiment of a codec in accordance with
this invention;
- Figure 7
- shows two sub-band short-time spectra for an unvoiced speech frame sampled at 16 kHz;
- Figure 8
- shows two sub-band LPC spectra for the unvoiced speech frame of Figure 7;
- Figure 9
- shows the combined LPC spectrum for the unvoiced speech frame of Figures 7 and 8;
- Figure 10
- is a block diagram of a decoder of the second embodiment of a codec in accordance
with this invention;
- Figure 11
- is a block diagram of an LPC parameter coding scheme used in the second embodiment
of this invention, and
- Figure 12
- shows a preferred weighting scheme for the LSP predictor employed in the second embodiment
of this invention.
[0032] In this description we describe two different embodiments of the invention, both
of which utilise sub-band coding. In the first embodiment, a coding scheme is implemented
in which only the noise component of the upper band is encoded and resynthesized in
the decoder.
[0033] The second embodiment employs an LPC vocoder scheme for both the lower and upper
sub-bands to obtain parameters which are combined to produce a combined set of LPC
parameters for controlling an all pole filter.
[0034] By way of introduction to the first embodiment, current audio and speech coders,
if given an input signal with an extended bandwidth, simply bandlimit the input signal
before coding. The technology described here allows the extended bandwidth to be encoded
at a bit rate insignificant compared to the main coder. It does not attempt to fully
reproduce the upper sub-band, but still provides an encoding that considerably enhances
the quality (and intelligibility for speech) of the main bandlimited signal.
[0035] The upper band is modelled in the usual way as an all-pole filter driven by an excitation
signal. Only one or two parameters are needed to describe the spectrum. The excitation
signal is considered to be a combination of white noise and periodic components, the
latter possibly having very complex relationships to one another (true for most music).
In the most general form of the codec described below, the periodic components are
effectively discarded. All that is transmitted is the estimated energy of the noise
component and the spectral parameters; at the decoder, white noise alone is used to
drive the all-pole filter.
[0036] The key and original concept is that the encoding of the upper band is completely
parametric - no attempt is made to encode the excitation signal itself. The only parameters
encoded are the spectral parameters and an energy parameter.
[0037] This aspect of the invention may be implemented either as a new form of coder or
as a wideband extension to an existing coder. Such an existing coder may be supplied
by a third party, or perhaps is already available on the same system (eg ACM codecs
in Windows95/NT). In this sense it acts as a parasite to that codec, using it to do
the encoding of the main signal, but producing a better quality signal than the narrowband
codec can by itself. An important characteristic of using only white noise to synthesize
the upper band is that it is trivial to add together the two bands - they only have
to be aligned to within a few milliseconds, and there are no phase continuity issues
to solve. Indeed, we have produced numerous demonstrations using different codecs
and had no difficulty aligning the signals.
[0038] The invention may be used in two ways. One is to improve the quality of an existing
narrowband (4kHz) coder by extending the input bandwidth, with a very small increase
in bit rate. The other is to produce a lower bit rate coder by operating the lower
band coder on a smaller input bandwidth (typically 2.75kHz), and then extending it
to make up for the lost bandwidth (typically to 5.5kHz).
[0039] Figures 1 and 2 illustrate an encoder 10 and decoder 12 respectively for a first
embodiment of the codec. Referring initially to Figure 1, the input audio signal passes
to a low-pass filter 14 where it is low pass filtered to form a lower sub-band signal
and decimated, and also to a high-pass filter 16 where it is high pass filtered to
form an upper sub-band signal and decimated.
[0040] The filters need to have both a sharp cutoff and good stop-band attenuation. To achieve
this, either 73 tap FIR filters or 8th order elliptic filters are used, depending
on which can run faster on the processor used. The stopband attenuation should be
at least 40dB and preferably 60dB, and the pass band ripple small - 0.2dB at most.
The 3dB point for the filters should be the target split point (4kHz typically).
[0041] The lower sub-band signal is supplied to a narrowband encoder 18. The narrowband
encoder may be a vocoder or a waveband encoder. The upper sub-band signal is supplied
to an upper sub-band analyser 20 which analyses the spectrum of the upper sub-band
to determine parametric coefficients and its noise component, as to be described below.
[0042] The spectral parameters and the log of the noise energy value are quantised, subtracted
from their previous values (i.e. differentially encoded) and supplied to a Rice coder
22 for coding and then combined with the coded output from the narrowband encoder
18.
[0043] In the decoder 12, the spectral parameters are obtained from the coded data and applied
to a spectral shape filter 23. The filter 23 is excited by a synthetic white noise
signal to produce a synthesized non-harmonic upper sub-band signal whose gain is adjusted
in accordance with the noise energy value at 24. The synthesised signal then passes
to a processor 26 which interpolates the signal and reflects it to the upper sub-band.
The encoded data representing the lower sub-band signal passes to a narrowband decoder
30 which decodes the lower sub-band signal which is interpolated at 32 and then recombined
at 34 to form the synthesized output signal.
[0044] In the above embodiment, Rice coding is only appropriate if the storage/transmission
mechanism can support variable bit-rate coding, or tolerate a large enough latency
to allow the data to be blocked into fixed-sized packets. Otherwise a conventional
quantisation scheme can be used without affecting the bit rate too much.
[0045] The result of the whole encoding-decoding process is illustrated in the spectra in
Figure 3, where the upper one is a frame containing both noise and strong harmonic
components from
Nakita by Elton John, and the lower one is the same frame with the 4-8kHz region encoded
using the wideband extension described above.
[0046] Referring now in more detail to the spectral and noise component analysis of the
upper sub-band, the spectral analysis derives two LPC coefficients using the standard
autocorrelation method, which is guaranteed to produce a stable filter. For quantisation,
the LPC coefficients are converted into reflection coefficients and quantised with
nine levels each. These LPC coefficients are then used to inverse filter the waveform
to produce a whitened signal for the noise component analysis.
[0047] The noise component analysis can be done in a number of ways. For instance the upper
sub-band may be full-wave rectified, smoothed and analysed for periodicity as described
in McCree et al. However, the measurement is more easily made by direct measurement
in the frequency domain. Accordingly, in the present embodiment a 256-point FFT is
performed on the whitened upper sub-band signal. The noise component energy is taken
to be the median of the FFT bin energies. This parameter has the important property
that if the signal is completely noise, the expected value of the median is just the
energy of the signal. But if the signal has periodic components, then so long as the
average spacing is greater than twice the frequency resolution of the FFT, the median
will fall between the peaks in the spectrum. But if the spacing is very tight, the
ear will notice little difference if white noise is used instead.
[0048] For speech (and some audio signals), it is necessary to perform the noise energy
calculation over a shorter interval than the LPC analysis. This is because of the
sharp attack on plosives, and because unvoiced spectra do not move very quickly. In
this case, the ratio of the median to the energy of the FFT, i.e. the fractional noise
component, is measured. This is then used to scale all the measured energy values
for that analysis period.
[0049] The noise/periodic distinction is an imperfect one, and the noise component analysis
itself is imperfect. To allow for this, the upper sub-band analyser 20 may scale the
energy in the upper band by a fixed factor of about 50%. Comparing the original signal
with the decoded extended signal sounds as if the treble control is turned down somewhat.
But the difference is negligible compared to the complete removal of the treble in
the unextended decoded signal.
[0050] The noise component is not usually worth reproducing when it is small compared to
the harmonic energy in the upper band, or very small compared to the energy in the
lower band. In the first case it is in any case hard to measure the noise component
accurately because of the signal leakage between FFT bins. To some degree this is
also true in the second case because of the finite attenuation in the stopband of
the low-band filter. So in a modification of this embodiment the upper sub-band analyser
20 may compare the measured upper sub-band noise energy against a threshold derived
from at least one of the upper and lower sub-band energies and, if it is below the
threshold, the noise floor energy value is transmitted instead. The noise floor energy
is an estimate of the background noise level in the upper band and would normally
be set equal to the lowest upper band energy measured since the start of the output
signal.
[0051] Turning now to the performance of this embodiment, Figure 4, is a spectrogram of
a male speaker. The vertical axis, frequency, stretches to 8000Hz, twice the range
of standard telephony coders (4kHz). The darkness of the plot indicates signal strength
at that frequency. The horizontal axis is time.
[0052] It will be seen that above 4kHz the signal is mostly noise from fricatives or plosives,
or not there at all. In this case the wideband extension produces an almost perfect
reproduction of the upper band.
[0053] For some female and children's voices, the frequency at which the voiced speech has
lost most of its energy is higher than 4kHz. Ideally in this case, the band split
should be done a little higher (5.5kHz would be a good choice). But even if this is
not done, the quality is still better than an unextended codec during unvoiced speech,
and for voiced speech it is exactly the same. Also the gain in intelligibility comes
through good reproduction of the fricatives and plosives, not through better reproduction
of the vowels, so the split point affects only the quality, not the intelligibility.
[0054] For reproduction of music, the effectiveness of the wideband extension depends somewhat
on the kind of music. For rock/pop where the most noticeable upper band components
are from the percussion, or from the "softness" of the voice (particularly for females),
the noise-only synthesis works very well, even enhancing the sound in places. Other
music has only harmonic components in the upper band - piano for instance. In this
case nothing is reproduced in the upper band. However, subjectively the lack of higher
frequencies seems less important for sounds where there are a lot of lower frequency
harmonics.
[0055] Referring now to the second embodiment of the codec which will be described with
reference to Figures 5 to 12 this embodiment is based on the same principles as the
well-known LPC10 vocoder (as described in T. E. Tremain "The Government Standard Linear
Predictive Coding Algorithm: LPC10"; Speech Technology, pp 40-49, 1982), and the speech
model assumed by the LPC10 vocoder is shown in Figure 5. The vocal tract, which is
modeled as an all-pole filter 110, is driven by a periodic excitation signal 112 for
voiced speech and random white noise 114 for unvoiced speech.
[0056] The vocoder consists of two parts, the encoder 116 and the decoder 118. The encoder
116, shown in Figure 6, splits the input speech into frames equally spaced in time.
Each frame is then split into bands corresponding to the 0-4 kHz and 4-8 kHz regions
of the spectrum. This is achieved in a computationally efficient manner using 8th-order
elliptic filters. High-pass and low-pass filters 120 and 122 respectively are applied
and the resulting signals decimated to form the two sub-bands. The upper sub-band
contains a mirrored form of the 4-8 kHz spectrum. Ten Linear Prediction Coding (LPC)
coefficients are computed at 124 from the lower sub-band, and two LPC coefficients
are computed at 126 from the high-band, as well as a gain value for each band. Figures
7 and 8 show the two sub-band short-term spectra and the two sub-band LPC spectra
respectively for a typical unvoiced signal at a sample rate of 16 kHz and Figure 9
shows the combined LPC spectrum. A voicing decision 128 and pitch value 130 for voiced
frames are also computed from the lower sub-band. (The voicing decision can optionally
use upper sub-band information as well). The ten low-band LPC parameters are transformed
to Line Spectral Pairs (LSPs) at 132, and then all the parameters are coded using
a predictive quantiser 134 to give the low-bit-rate data stream.
[0057] The decoder 118 shown in Figure 10 decodes the parameters at 136 and, during voiced
speech, interpolates between parameters of adjacent frames at the start of each pitch
period. The ten lower sub-band LSPs are then converted to LPC coefficients at 138
before combining them at 140 with the two upper sub-band coefficients to produce a
set of eighteen LPC coefficients. This is done using an Autocorrelation Domain Combination
technique or a Power Spectral Domain Combination technique to be described below.
The LPC parameters control an all-pole filter 142, which is excited with either white
noise or an impulse-like waveform periodic at the pitch period from an excitation
signal generator 144 to emulate the model shown in Figure 5. Details of the voiced
excitation signal are given below.
[0058] The particular implementation of the second embodiment of the vocoder will now be
described. For a more detailed discussion of various aspects, attention is directed
to L. Rabiner and R.W. Schafer, 'Digital Processing of Speech Signals', Prentice Hall,
1978.
LPC Analysis
[0059] A standard autocorrelation method is used to derive the LPC coefficients and gain
for both the lower and upper sub-bands. This is a simple approach which is guaranteed
to give a stable all-pole filter; however, it has a tendency to over-estimate formant
bandwidths. This problem is overcome in the decoder by adaptive formant enhancement
as described in A.V. McCree and T.P. Barnwell III, 'A mixed excitation lpc vocoder
model for low bit rate speech encoding', IEEE Trans. Speech and Audio Processing,
vol.3, pp.242-250, July 1995, which enhances the spectrum around the formants by filtering
the excitation sequence with a bandwidth-expanded version of the LPC synthesis (alI-pole)
filter. To reduce the resulting spectral tilt, a weaker all-zero filter is also applied.
The overall filter has a transfer function H(
z)=
A(
z/0.5)/
A(
z/0.8), where A(
z) is the transfer function of the all-pole filter.
Resynthesis LPC Model
[0060] To avoid potential problems due to discontinuity between the power spectra of the
two sub-band LPC models, and also due to the discontinuity of the phase response,
a single high-order resynthesis LPC model is generated from the sub-band models. From
this model, for which an order of 18 was found to be suitable, speech can be synthesised
as in a standard LPC vocoder. Two approaches are described here, the second being
the computationally simpler method.
[0061] In the following, subscripts
L and
H will be used to denote features of hypothesised low-pass filtered versions of the
wide band signal respectively, (assuming filters having cut-offs at 4 kHz, with unity
response inside the pass band and zero outside), and subscripts
l and
h used to denote features of the lower and upper sub-band signals respectively.
Power Spectral Domain Combination
[0062] The power spectral densities of filtered wide-band signals
PL(
ω) and
PH(
ω), may be calculated as:

and

where
al(
n),
an(
n) and
gl, gh are the LPC parameters and gain respectively from a frame of speech and
pl, ph, are the LPC model orders. The term π-ω/2 occurs because the upper sub-band spectrum
is mirrored.
[0063] The power spectral density of the wide-band signal,
PW(
ω), is given by

[0064] The autocorrelation of the wide-band signal is given by the inverse discrete-time
Fourier transform of
PW(ω), and from this the (18th order) LPC model corresponding to a frame of the wide-band
signal can be calculated. For a practical implementation, the inverse transform is
performed using an inverse discrete Fourier transform (DFT). However this leads to
the problem that a large number of spectral values are needed (typically 512) to give
adequate frequency resolution, resulting in excessive computational requirements.
Autocorrelation Domain Combination
[0065] For this approach, instead of calculating the power spectral densities of low-pass
and high-pass versions of the wide-band signal, the autocorrelations,
rL(τ) and
rH(τ), are generated. The low-pass filtered wide-band signal is equivalent to the lower
sub-band up-sampled by a factor of 2. In the time-domain this up-sampling consists
of inserting alternate zeros (interpolating), followed by a low-pass filtering. Therefore
in the autocorrelation domain, up-sampling involves interpolation followed by filtering
by the autocorrelation of the low-pass filter impulse response.
[0066] The autocorrelations of the two sub-band signals can be efficiently calculated from
the sub-band LPC models (see for example
R.A. Roberts and C.T. Mullis, 'Digital Signal Processing', chapter 11, p. 527, Addison-Wesley,
1987). If
r1(
m) denotes the autocorrelation of the lower sub-band, then the interpolated autocorrelation,
r'1(
m) is given by:

The autocorrelation of the low-pass filtered signal
rL(
m), is:

where h(m) is the low-pass filter impulse response. The autocorrelation of the high-pass
filtered signal
rH(
m), is found similarly, except that a high-pass filter is applied.
[0067] The autocorrelation of the wide-band signal
rW(
m), can be expressed:

and hence the wide-band LPC model calculated. Figure 5 shows the resulting LPC spectrum
for the frame of unvoiced speech considered above.
[0068] Compared with combination in the power spectral domain, this approach has the advantage
of being computationally simpler. FIR filters of order 30 were found to be sufficient
to perform the upsampling. In this case, the poor frequency resolution implied by
the lower order filters is adequate because this simply results in spectral leakage
at the crossover between the two sub-bands. The approaches both result in speech perceptually
very similar to that obtained by using an high-order analysis model on the wide-band
speech.
[0069] From the plots for a frame of unvoiced speech shown in Figures 7, 8, and 9, the effect
of including the upper-band spectral information is particularly evident here, as
most of the signal energy is contained within this region of the spectrum.
Pitch/Voicing Analysis
[0070] Pitch is determined using a standard pitch tracker. For each frame determined to
be voiced, a pitch function, which is expected to have a minimum at the pitch period,
is calculated over a range of time intervals. Three different functions have been
implemented, based on autocorrelation, the Averaged Magnitude Difference Function
(AMDF) and the negative Cepstrum. They all perform well; the most computationally
efficient function to use depends on the architecture of the coder's processor. Over
each sequence of one or more voiced frames, the minima of the pitch function are selected
as the pitch candidates. The sequence of pitch candidates which minimizes a cost function
is selected as the estimated pitch contour. The cost function is the weighted sum
of the pitch function and changes in pitch along the path. The best path may be found
in a computationally efficient manner using dynamic programming.
[0071] The purpose of the voicing classifier is to determine whether each frame of speech
has been generated as the result of an impulse-excited or noise-excited model. There
is a wide range of methods which can be used to make a voicing decision. The method
adopted in this embodiment uses a linear discriminant function applied to; the low-band
energy, the first autocorrelation coefficient of the low (and optionally high) band
and the cost value from the pitch analysis. For the voicing decision to work well
in high levels of background noise, a noise tracker (as described for example in
A. Varga and K. Ponting, 'Control Experiments on Noise Compensation in Hidden Markov
Model based Continuous Word Recognition', pp.167-170, Eurospeech 89) can be used to calculate the probability of noise, which is then included in the
linear discriminant function.
Parameter Encoding
Voicing Decision
[0072] The voicing decision is simply encoded at one bit per frame. It is possible to reduce
this by taking into account the correlation between successive voicing decisions,
but the reduction in bit rate is small.
Pitch
[0073] For unvoiced frames, no pitch information is coded. For voiced frames, the pitch
is first transformed to the log domain and scaled by a constant (e.g. 20) to give
a perceptually-acceptable resolution. The difference between transformed pitch at
the current and previous voiced frames is rounded to the nearest integer and then
encoded.
Gains
[0074] The method of coding the log pitch is also applied to the log gain, appropriate scaling
factors being 1 and 0.7 for the low and high band respectively.
LPC Coefficients
[0075] The LPC coefficients generate the majority of the encoded data. The LPC coefficients
are first converted to a representation which can withstand quantisation, i.e. one
with guaranteed stability and low distortion of the underlying formant frequencies
and bandwidths. The upper sub-band LPC coefficients are coded as reflection coefficients,
and the lower sub-band LPC coefficients are converted to Line Spectral Pairs (LSPs)
as described in
F. Itakura, 'Line spectrum representation of linear predictor coefficients of speech
signals', J. Acoust. Soc. Ameri., vol.57, S35(A), 1975. The upper sub-band coefficients are coded in exactly the same way as the log pitch
and log gain, i.e. encoding the difference between consecutive values, an appropriate
scaling factor being 5.0. The coding of the low-band coefficients is described below.
Rice Coding
[0076] In this particular embodiment, parameters are quantised with a fixed step size and
then encoded using lossless coding. The method of coding is a Rice code (as described
in
R. F. Rice & J.R. Plaunt, 'Adaptive variable-length coding for efficient compression
of spacecraft television data', IEEE Transactions on Communication Technology, vol.19,
no.6, pp.889-897, 1971), which assumes a Laplacian density of the differences. This code assigns a number
of bits which increases with the magnitude of the difference. This method is suitable
for applications which do not require a fixed number of bits to be generated per frame,
but a fixed bit-rate scheme similar to the LPC10e scheme could be used.
Voiced Excitation
[0077] The voiced excitation is a mixed excitation signal consisting of noise and periodic
components added together. The periodic component is the impulse response of a pulse
dispersion filter (as described in McCree et al) passed through a periodic weighting
filter. The noise component is random noise passed through a noise weighting filter.
[0078] The periodic weighting filter is a 20th order Finite Impulse Response (FIR) filter,
designed with breakpoints (in kHz) and amplitudes:
b.p. |
0 |
0.4 |
0.6 |
1.3 |
2.3 |
3.4 |
4.0 |
8.0 |
amp |
1 |
1.0 |
0.975 |
0.93 |
0.8 |
0.6 |
0.5 |
0.5 |
[0079] The noise weighting filter is a 20th order FIR filter with the opposite response,
so that together they produce a uniform response over the whole frequency band.
LPC Parameter Encoding
[0080] In this embodiment prediction is used for the encoding of the Line Spectral pair
Frequencies (LSFs) and the prediction may be adaptive. Although vector quantisation
could be used, scalar encoding has been used to save both computation and storage.
Figure 11 shows the overall coding scheme. In the LPC parameter encoder 146 the input
l
i(
t) is applied to an adder 148 together with the negative of an estimate l̂
i(
t) from the predictor 150 to provide a prediction error which is quantised by a quantiser
152. The quantised prediction error is Rice encoded at 154 to provide an output, and
is also supplied to an adder 156 together with the output from the predictor 150 to
provide the input to the predictor 150.
[0081] In the LPC parameter decoder 158, the error signal is Rice decoded at 160 and supplied
to an adder 162 together with the output from a predictor 164. The sum from the adder
162, corresponding to an estimate of the current LSF component, is output and also
supplied to the input of the predictor 164.
LSF Prediction
[0082] The prediction stage estimates the current LSF component from data currently available
to the decoder. The variance of the prediction error is expected to be lower than
that of the original values, and hence it should be possible to encode this at a lower
bit rate for a given average error.
[0083] Let the LSF element
i at time t be denoted
li(
t) and the LSF element recovered by the decoder denoted
li(
t). If the LSFs are encoded sequentially in time and in order of increasing index within
a given time frame, then to predict
li(
t), the following values are available:

and

Therefore a general linear LSF predictor can be written

where
aij(τ) is the weighting associated with the prediction of
l̂i(
t) from
j(
t-τ).
[0084] In general only a small set of values of
aij(τ) should be used, as a high-order predictor is computationally less efficient both
to apply and to estimate. Experiments were performed on unquantized LSF vectors (i.e.
predicting from
lj(τ) rather than
j(τ), to examine the performance of various predictor configurations, the results of
which are:
Table 1
Sys |
MAC |
Elements |
Err/dB |
A |
0 |
- |
-23.47 |
B |
1 |
aii(1) |
-26.17 |
C |
2 |
aii(1), aii-1(0) |
-27.31 |
D |
3 |
aii(1), aii-1(0), aii-1(1) |
-27.74 |
E |
2 |
aii(1), aii(2) |
-26.23 |
F |
19 |
aij(1)|1 ≤ j ≤ 10, |
-27.97 |
|
|
aij(0)|1 ≤ j ≤ i - 1 |
|
System D (shown in Figure 12) was selected as giving the best compromise between
efficiency and error.
[0085] A scheme was implemented where the predictor was adaptively modified. The adaptive
update is performed according to:

where ρ determines the rate of adaption (a value of ρ=0.005 was found suitable, giving
a time constant of 4.5 seconds). The terms C
xx and C
xy are initialised from training data as

and

Here
yi is a value to be predicted (
li(
t)) and
xi is a vector of predictor inputs (containing 1,
li(
t-1) etc.). The updates defined in Equation (8) are applied after each frame, and periodically
new Minimum Mean-Squared Error (MMSE) predictor coefficients,
p, are calculated by solving
Cxxp=
Cxy.
[0086] The adaptive predictor is only needed if there are large differences between training
and operating conditions caused for example by speaker variations, channel differences
or background noise.
Quantisation and Coding
[0087] Given a predictor output
l̂i(
t), the prediction error is calculated as
ei(
t)=
li(
t)-
l̂i(
t). This is uniformly quantised by scaling to give an error
i(
t) which is then losslessly encoded in the same way as all the other parameters. A
suitable scaling factor is 160.0. Coarser quantisation can be used for frames classified
as unvoiced.
Results
[0088] Diagnostic Rhyme Tests (DRTs) (as described in
W.D. Voiers, 'Diagnostic evaluation of speech intelligibility', in Speech Intelligibility
and Speaker Recognition (M.E. Hawley, cd.) pp. 374-387, Dowden, Hutchinson & Ross,
Inc., 1977) were performed to compare the intelligibility of a wide-band LPC vocoder using the
autocorrelation domain combination method with that of a 4800 bps CELP coder (Federal
Standard 1016) (operating on narrow-band speech). For the LPC vocoder, the level of
quantisation and frame period were set to give an average bit rate of approximately
2400 bps. From the results shown in Table 2, it can be seen that the DRT score for
the wideband LPC vocoder exceeds that for the CELP coder.
Table 2
Coder |
DRT Score |
CELP |
83.8 |
Wideband LPC |
86.8 |
[0089] This second embodiment described above incorporates two recent enhancements to LPC
vocoders, namely a pulse dispersion filter and adaptive spectral enhancement.
1. An audio coding system for encoding and decoding an audio signal, said system including
an encoder and a decoder, said encoder comprising:
filter means for decomposing said audio signal into an upper and a lower sub-band
signal;
lower sub-band coding means for encoding said lower sub-band signal;
upper sub-band coding means for parametric encoding at least the non-periodic component
of said upper sub-band signal according to a source-filter model;
said decoder means comprising means for decoding said encoded lower sub-band signal
and said encoded upper sub-band signal, and for reconstructing therefrom an audio
output signal,
wherein said decoder means comprises filter means and excitation means for generating
an excitation signal for being passed by said filter means to produce a synthesised
upper sub-band signal, said excitation means in use generating an excitation signal
which includes a substantial component of synthesised noise in an upper frequency
band corresponding to the upper sub-band of said audio signal, and said synthesised
upper sub-band signal and the decoded lower sub-band signal are recombined in use
to form the audio output signal.
2. An audio coding system according to Claim 1, wherein said decoder means comprises
lower sub-band decoding means and upper sub-band decoding means, for receiving and
decoding the encoded lower and upper sub-band signals respectively.
3. An audio coding system according to Claim 1 or 2, wherein said upper frequency band-of
said excitation signal substantially wholly comprises a synthesised noise signal.
4. An audio coding system according to Claim 1 or 2, wherein said excitation signal comprises
a mixture of a synthesised noise component and a further component corresponding to
one or more harmonics of said lower sub-band audio signal.
5. An audio coding system according to any of the preceding Claims, wherein said upper
sub-band coding means comprises means for analysing and encoding said upper sub-band
signal to obtain an upper sub-band energy or gain value and one or more upper sub-band
spectral parameters.
6. An audio coding system according to Claim 5, wherein said one or more upper sub-band
spectral parameters comprise second order LPC coefficients.
7. An audio coding system according to Claim 5 or 6, wherein said encoder means includes
means for measuring the energy in said upper sub-band thereby to deduce said upper
sub-band energy or gain value.
8. An audio coding system according to Claim 5 or 6, wherein said encoder means includes
means for measuring the energy of a noise component in said upper band signal thereby
to deduce said upper sub-band energy or gain value.
9. An audio coding system according to Claim 7 or Claim 8, including means for monitoring
said energy in said upper sub-band signal, comparing this with a threshold derived
from at least one of said upper and lower sub-band energies, and for causing said
upper sub-band encoding means to provide a minimum code output if said monitored energy
is below said threshold.
10. An audio coding system according to any of the preceding Claims, wherein said lower
sub-band coding means comprises a speech coder, and includes means for providing a
voicing decision.
11. An audio coding system according to Claim 10, wherein said decoder means includes
means responsive to the energy in said upper band encoded signal and said voicing
decision to adjust the noise energy in said excitation signal dependent on whether
the audio signal is voiced or unvoiced.
12. An audio coding system according to any of Claims 1 to 9, wherein said lower sub-band
coding means comprises an MPEG audio coder.
13. An audio coding system according to any of the preceding Claims, wherein said upper
sub-band contains frequencies above 2.75kHz and said lower sub-band contains frequencies
below 2.75kHz.
14. An audio coding system according to any of Claims 1 to 12, wherein said upper sub-band
contains frequencies above 4kHz, and said lower sub-band contains frequencies below
4kHz.
15. An audio encoder according to any of Claims 1 to 12, wherein said upper sub-band contains
frequencies above 5.5kHz and said lower sub-band contains frequencies below 5.5kHz.
16. An audio encoder according to any of the preceding Claims, wherein said upper sub-band
coding means encodes said noise component with a bit rate of less than 800 bps and
preferably of about 300 bps.
17. An audio coding system according to Claim 5 or any Claim dependent thereon, wherein
said upper sub-band signal is analysed with long frame periods to determine said spectral
parameters and with short frame periods to determine said energy or gain value.
18. An audio coding method for encoding and decoding an audio signal, which method comprises:
decomposing said audio signal into an upper and a lower sub-band signal;
encoding said lower sub-band signal ;
parametric encoding at least the non-periodic component of said upper sub-band signal
according to a source-filter model, and
decoding said encoded lower sub-band signal and said encoded upper sub-band signal
to reconstruct an audio output signal;
wherein said decoding step includes providing an excitation signal which includes
a substantial component of synthesised noise in an upper frequency band corresponding
to the upper sub-band of said audio signal, passing said excitation signal through
a filter means to produce a synthesised upper sub-band signal, and recombining said
synthesised upper sub-band signal and the decoded lower sub-band signal to form the
audio output signal.
19. An audio encoder for encoding an audio signal, said encoder comprising:
means for decomposing said audio signal into an upper and a lower sub-band signal;
lower sub-band coding means for encoding said lower sub-band signal, and
upper sub-band coding means for parametric encoding at least a noise component of
said upper sub-band signal according to a source-filter model.
20. A method of encoding an audio signal which comprises decomposing said audio signal
into an upper and a lower sub-band signal, encoding said lower sub-band signal and
parametric encoding at least a noise component of said upper sub-band signal according
to a source-filter model.
21. An audio decoder adapted for decoding an audio signal encoded in accordance with the
method of Claim 20, said decoder comprising filter means and excitation means for
generating an excitation signal for being passed by said filter means to produce a
synthesised audio signal, said excitation means in use generating an excitation signal
which includes a substantial component of synthesised noise in an upper frequency
band corresponding to the upper sub-bands of said audio signal.
22. A method of decoding an audio signal encoded in accordance with the method of Claim
20, which comprises providing an excitation signal which includes a substantial component
of synthesised noise in an upper frequency bandwidth corresponding to the upper sub-band
of the input audio signal, and passing said excitation signal through a filter means
to produce a synthesised audio signal.
23. A coder system for encoding and decoding a speech signal, said system comprising encoder
means and decoder means, said encoder means including:-
filter means for decomposing said speech signal into lower and upper sub-bands together
defining a bandwidth of at least 5.5 kHz;
lower sub-band vocoder analysis means for performing a high order vocoder analysis
on said lower sub-band to obtain vocoder coefficients including LPC coefficients representative
of said lower sub-band;
upper sub-band vocoder analysis means for performing a low order vocoder analysis
on said upper sub-band to obtain vocoder coefficients including LPC coefficients representative
of said upper sub-band;
coding means for coding vocoder parameters including said lower and upper sub-band
coefficients to provide an encoded signal for storage and/or transmission, and
said decoder means including:-
decoding means for decoding said encoded signal to obtain a set of vocoder parameters
combining said lower and upper sub-band vocoder coefficients;
synthesising means for constructing an LPC filter from the set of vocoder parameters
and for synthesising said speech signal from said filter and from an excitation signal.
24. A voice coder system according to Claim 23, wherein said lower sub-band vocoder analysis
means and said upper sub-band vocoder analysis means are LPC vocoder analysis means.
25. A voice coder system according to Claim 24, wherein said lower sub-band LPC analysis
means performs a tenth order or higher analysis.
26. A voice coder system according to Claim 24 or Claim 25, wherein said high band LPC
analysis means performs a second order analysis.
27. A voice coder system according to any of Claims 23 to 26, wherein said synthesising
means includes means for re-synthesising said lower sub-band and said upper sub-band
and for combining said re-synthesised lower and higher sub-bands.
28. A voice coder system according to Claim 27, wherein said synthesising means includes
means for determining the power spectral densities of the lower sub-band and the upper
sub-band respectively, and means for combining said power spectral densities to obtain
a high order LPC model.
29. A voice coder system according to Claim 28, wherein said means for combining includes
means for determining the autocorrelations of said combined power spectral densities.
30. A voice coder system according to Claim 29, wherein said means for combining includes
means for determining the autocorrelations of the power spectral density functions
of said lower and upper sub-bands respectively, and then combining said autocorrelations.
31. A voice encoder apparatus for encoding a speech signal, said encoder apparatus including:-
filter means for decomposing said speech signal into lower and upper sub-bands;
low band vocoder analysis means for performing a high order vocoder analysis on said
lower sub-band signal to obtain vocoder coefficients representative of said lower
sub-band;
upper band vocoder analysis means for performing a low order vocoder analysis on said
upper sub-band signal to obtain vocoder coefficients representative of said upper
sub-band, and
coding means for coding said low and high sub-band vocoder coefficients to provide
an encoded signal for storage and/or transmission.
32. A voice decoder apparatus adapted for synthesising a speech signal coded by an encoder
in accordance with Claim 31, and said coded speech signal comprising parameters including
LPC coefficients for a lower sub-band and an upper sub-band, said decoder apparatus
including:
decoding means for decoding said encoded signal to obtain a set of LPC parameters
combining said lower and upper sub-band LPC coefficients, and
synthesising means for constructing an LPC filter from the set of LPC parameters for
said upper and said lower sub-bands and for synthesising said speech signal from said
filter and from an excitation signal.
1. Ein Audiocodierungssystem zum Codieren und Decodieren eines Audiosignals, wobei das
System einen Codierer und einen Decodierer umfaßt, wobei der Codierer folgende Merkmale
aufweist:
eine Filtereinrichtung zum Zerlegen des Audiosignals in ein oberes und ein unteres
Teilbandsignal;
eine Codierungseinrichtung für das untere Teilband zum Codieren des unteren Teilbandsignals;
eine Codierungseinrichtung für das obere Teilband zum vollständigen parametrischen
Codieren von zumindest der nichtperiodischen Komponente des oberen Teilbandsignals
gemäß einem Quellfiltermodell;
wobei die Decodereinrichtung eine Einrichtung zum Decodieren des codierten unteren
Teilbandsignals und des codierten oberen Teilbandsignals und zum Rekonstruieren eines
Audioausgangssignals aus denselben aufweist,
wobei die Decodereinrichtung eine Filtereinrichtung und eine Erregungseinrichtung
zum Erzeugen eines Erregungssignals aufweist, um durch die Filtereinrichtung durchgeleitet
zu werden, um ein synthetisiertes nichtharmonisches oberes Teilbandsignal zu erzeugen,
wobei die Erregungseinrichtung ein Erregungssignal erzeugt, das eine wesentliche Komponente
von synthetisiertem Rauschen in einem oberen Frequenzband umfaßt, das dem oberen Teilband
des Audiosignals entspricht, wobei das synthetisierte obere Teilbandsignal und das
decodierte untere Teilbandsignal rekombiniert werden, um das Audioausgangssignal zu
bilden.
2. Ein Audiocodierungssystem gemäß Anspruch 1, bei dem die Decodereinrichtung eine Decodiereinrichtung
für das untere Teilband und eine Decodiereinrichtung für das obere Teilband aufweist,
zum Empfangen und Decodieren des codierten oberen bzw. unteren Teilbandsignals.
3. Ein Audiocodierungssystem gemäß Anspruch 1 oder 2, bei dem das obere Frequenzband
des Erregungssignals ein synthetisiertes Rauschsignal im wesentlichen vollständig
aufweist.
4. Ein Audiocodierungssystem gemäß Anspruch 1 oder 2, bei dem das Erregungssignal eine
Mischung aus einer synthetisierten Rauschkomponente und einer weiteren Komponente
aufweist, die einer oder mehreren Harmonischen des unteren Teilbandaudiosignals entspricht.
5. Ein Audiocodierungssystem gemäß einem der vorangehenden Ansprüche, bei dem die obere
Teilbandcodierungseinrichtung eine Einrichtung zum Analysieren und Codieren des oberen
Teilbandsignals aufweist, um eine untere Teilbandenergie oder einen Verstärkungswert
und einen oder mehrere Oberes-Teilband-Spektralparameter zu erhalten.
6. Ein Audiocodierungssystem gemäß Anspruch 5, bei dem der eine oder die mehreren Oberes-Teilband-Spektralparameter
LPC-Koeffizienten zweiter Ordnung aufweisen.
7. Ein Audiocodierungssystem gemäß Anspruch 5 oder 6, bei dem die Codierungseinrichtung
eine Einrichtung zum Messen der Energie in dem oberen Teilband umfaßt, um dadurch
die obere Teilbandenergie oder den Verstärkungswert herzuleiten.
8. Ein Audiocodierungssystem gemäß Anspruch 5 oder 6, bei dem die Codierungseinrichtung
eine Einrichtung zum Messen der Energie einer Rauschkomponente in dem oberen Bandsignal
umfaßt, um dadurch die obere Teilbandenergie oder den Verstärkungswert herzuleiten.
9. Ein Audiocodierungssystem gemäß Anspruch 7 oder Anspruch 8, das eine Einrichtung zum
Überwachen der Energie in dem oberen Teilbandsignal, das Vergleichen derselben mit
einer Schwelle, die aus zumindest entweder der oberen oder der unteren Teilbandenergie
hergeleitet wird, und zum Verursachen, daß die obere Teilbandcodierungseinrichtung
eine Minimalcodeausgabe liefert, ob die überwachte Energie unter der Schwelle liegt.
10. Ein Audiocodierungssystem gemäß einem der vorangehenden Ansprüche, bei dem die Codierungseinrichtung
für das untere Teilband einen Sprachcodierer aufweist und eine Einrichtung zum Bereitstellen
einer Stimmentscheidung umfaßt.
11. Ein Audiocodierungssystem gemäß Anspruch 10, bei dem die Decodereinrichtung eine Einrichtung
umfaßt, die auf die Energie in dem oberen bandcodierten Signal und die Stimmentscheidung
anspricht, um die Rauschenergie in dem Erregungssignal abhängig davon anzupassen,
ob das Audiosignal stimmhaft oder stimmlos ist.
12. Ein Audiocodierungssystem gemäß einem der Ansprüche 1 bis 9, bei dem die Codierungseinrichtung
für das untere Teilband einen MPEG-Audiocodierer aufweist.
13. Ein Audiocodierungssystem gemäß einem der vorangehenden Ansprüche, bei dem das obere
Teilband Frequenzen über 2,75 kHz und das untere Teilband Frequenzen unter 2,75 kHz
enthält.
14. Ein Audiocodierungssystem gemäß der Ansprüche 1 bis 12, bei dem das obere Teilband
Frequenzen über 4 kHz aufweist und das untere Teilband Frequenzen unter 4 kHz enthält.
15. Ein Audiocodierungssystem gemäß der Ansprüche 1 bis 12, bei dem das obere Teilband
Frequenzen über 5,5 kHz aufweist und das untere Teilband Frequenzen unter 5,5 kHz
enthält.
16. Ein Audiocodierer gemäß einem der vorangehenden Ansprüche, bei dem die Codierungseinrichtung
für das obere Teilband die Rauschkomponente mit einer Bitrate von weniger als 800
bps und vorzugsweise ungefähr 300 bps codiert.
17. Ein Audiocodierungssystem gemäß Anspruch 5 oder einem davon abhängigen Anspruch, wobei
das obere Teilbandsignal mit langen Rahmenperioden analysiert wird, um die Spektralparameter
zu bestimmen, und mit kurzen Rahmenperioden, um den Energie- oder Verstärkungs-Wert
zu bestimmen.
18. Ein Audiocodierungsverfahren zum Codieren und Decodieren eines Audiosignals, wobei
das Verfahren folgende Schritte aufweist:
Zerlegen eines Audiosignals in ein oberes und ein unteres Teilbandsignal;
Codieren des unteren Teilbandsignals;
vollständiges parametrisches Codieren von zumindest der nichtperiodischen Komponente
des oberen Teilbandsignals gemäß einem Quellfiltermodell; und
Decodieren des codierten unteren Teilbandsignals und des codierten oberen Teilbandsignals,
um ein Audioausgangssignal zu rekonstruieren;
wobei der Decodierungsschritt das Bereitstellen eines Erregungssignals, das eine
wesentliche Komponente von synthetisiertem Rauschen in einem oberen Frequenzband umfaßt,
das dem oberen Frequenzband des Audiosignals entspricht, und ein Durchleiten des Erregungssignals
durch eine Filtereinrichtung umfaßt, um ein synthetisiertes, nichtharmonisches oberes
Teilbandsignal zu erzeugen, wobei das synthetisierte obere Teilbandsignal und das
decodierte untere Teilbandsignal rekombiniert werden, um das Audioausgangssignal zu
bilden.
19. Ein Audiocodierer zum Codieren eines Audiosignals, wobei der Codierer folgende Merkmale
aufweist:
eine Einrichtung zum Zerlegen des Audiosignals in ein oberes und ein unteres Teilbandsignal;
eine untere Teilbandcodierungseinrichtung zum Codieren des unteren Teilbandsignals;
und
eine obere Teilbandcodierungseinrichtung zum vollständigen parametrischen Codieren
von zumindest einer Rauschkomponente des oberen Teilbandsignals gemäß einem Quellfiltermodell.
20. Ein Verfahren zum Codieren eines Audiosignals, das das Aufteilen des Audiosignals
in ein oberes und ein unteres Teilbandsignal, das Codieren des unteren Teilbandsignals
und das vollständige parametrische Codieren von zumindest einer Rauschkomponente des
oberen Teilbandsignals gemäß einem Quellfiltermodell aufweist.
21. Ein Audiodecodierer, der zum Decodieren eines Audiosignals angepaßt ist, das gemäß
dem Verfahren von Anspruch 20 codiert ist, wobei der Decodierer eine Filtereinrichtung
und eine Erregungseinrichtung zum Erzeugen eines Erregungssignals aufweist, das durch
die Filtereinrichtung durchgeleitet werden soll, um ein synthetisiertes Audiosignal
zu erzeugen, wobei die Erregungseinrichtung ein Erregungssignal erzeugt, das eine
wesentliche Komponente von synthetisiertem Rauschen in einem oberen Frequenzband umfaßt,
das den oberen Teilbändern des Audiosignals entspricht.
22. Ein Verfahren zum Decodieren eines Audiosignals, das gemäß dem Verfahren von Anspruch
20 codiert ist, das das Bereitstellen eines Erregungssignals aufweist, das eine wesentliche
Komponente von synthetisiertem Rauschen in einer oberen Frequenzbandbreite umfaßt,
die dem oberen Teilband des Eingangsaudiosignals entspricht, und das Durchleiten des
Erregungssignals durch eine Filtereinrichtung, um ein synthetisiertes Audiosignal
zu erzeugen.
23. Ein Codierungssystem zum Codieren und Decodieren eines Sprachsignals, wobei das System
eine Codierereinrichtung und eine Decodierereinrichtung aufweist, wobei die Codierereinrichtung
folgende Merkmale aufweist:
eine Filtereinrichtung zum Aufteilen des Sprachsignals in ein oberes und ein unteres
Teilband, die zusammen eine Bandbreite von zumindest 5,5 kHz definieren;
eine Vocoderanalyseeinrichtung für das untere Teilband zum Durchführen einer Vocoderanalyse
hoher Ordnung an dem unteren Teilband, um Vocoderkoeffizienten zu erhalten, die LPC-Koeffizienten
umfassen, die das untere Teilband darstellen;
eine Vocoderanalyseeinrichtung für das obere Teilband, zum Durchführen einer Vocoderanalyse
niedriger Ordnung an dem oberen Teilband, um Vocoderkoeffizienten zu erhalten, die
LPC-Koeffizienten umfassen, die das obere Teilband darstellen;
eine Codierungseinrichtung zum Codieren von Vocoderparametern, die die unteren und
die oberen Teilbandkoeffizienten umfassen, um ein codiertes Signal für eine Speicherung
und/oder Übertragung zu liefern, und wobei die Decodereinrichtung folgende Merkmale
umfaßt:
eine Decodiereinrichtung zum Decodieren des codierten Signals, um einen Satz von Vocoderparametern
zu erhalten, die die unteren und die oberen Teilbandvocoderkoeffizienten kombinieren;
eine Synthetisierungseinrichtung zum Erzeugen eines LPC-Filters aus dem Satz von Vocoderparametern
und zum Synthetisieren des Sprachsignals aus dem Filter und aus einem Erregungssignal.
24. Ein Stimmcodierersystem gemäß Anspruch 23, bei dem die Vocoderanalyseeinrichtung für
das untere Teilband und die Vocoderanalyseeinrichtung für das obere Teilband LPC-Vocoderanalyseeinrichtungen
sind.
25. Ein Stimmcodierersystem gemäß Anspruch 24, bei dem die LPC-Analyseeinrichtung des
unteren Teilbands eine Analyse zehnter Ordnung oder höher durchführt.
26. Ein Stimmcodierersystem gemäß Anspruch 24 oder Anspruch 25, bei dem die LPC-Analyseeinrichtung
des hohen Bandes eine Analyse zweiter Ordnung durchführt.
27. Ein Stimmcodierersystem gemäß einem der Ansprüche 23 bis 26, bei dem die Synthetisierungseinrichtung
eine Einrichtung zum Resynthetisieren des unteren Teilbandes und des oberen Teilbandes
und zum Kombinieren des resynthetisierten unteren und oberen Teilbandes umfaßt.
28. Ein Stimmcodierersystem gemäß Anspruch 27, bei dem die Synthetisierungseinrichtung
eine Einrichtung zum Bestimmen der Leistungsspektraldichten des unteren Teilbandes
bzw. des oberen Teilbandes und eine Einrichtung zum Kombinieren der Leistungsspektraldichten
umfaßt, um ein LPC-Modell hoher Ordnung zu erhalten.
29. Ein Stimmcodierersystem gemäß Anspruch 28, bei dem die Einrichtung zum Kombinieren
eine Einrichtung zum Bestimmen der Autokorrelationen der kombinierten Leistungsspektraldichten
umfaßt.
30. Ein Stimmcodierersystem gemäß Anspruch 29, bei dem die Einrichtung zum Kombinieren
eine Einrichtung zum Bestimmen der Autokorrelationen der Leistungsspektraldichtefunktionen
der unteren bzw. oberen Teilbänder und dann das Kombinieren der Autokorrelationen
umfaßt.
31. Eine Stimmcodierervorrichtung zum Codieren eines Stimmsignals, wobei die Codierervorrichtung
folgende Merkmale umfaßt:
eine Filtereinrichtung zum Zerlegen des Sprachsignals in ein unteres und ein oberes
Teilband;
eine Niedrigband-Vocoderanalyseeinrichtung zum Durchführen einer Vocoderanalyse hoher
Ordnung an dem unteren Teilbandsignal, um Vocoderkoeffizienten zu erhalten, die das
untere Teilband darstellen;
eine Vocoderanalyseeinrichtung des oberen Bandes zum Durchführen einer Vocoderanalyse
niedriger Ordnung an dem oberen Teilbandsignal, um Vocoderkoeffizienten zu erhalten,
die das obere Teilband darstellen; und
eine Codierungseinrichtung zum Codieren der niedrigen und hohen Teilbandvocoderkoeffizienten,
um ein codiertes Signal für eine Speicherung und/oder Übertragung zu liefern.
32. Ein Stimmdecodervorrichtung, die zum Synthetisieren eines Sprachsignals angepaßt ist,
das durch einen Codierer gemäß Anspruch 31 codiert ist, und wobei das codierte Sprachsignal
Parameter aufweist, die LPC-Koeffizienten für ein unteres Teilband und ein oberes
Teilband umfassen, wobei die Decodervorrichtung folgende Merkmale umfaßt:
eine Decodiereinrichtung zum Decodieren des codierten Signals, um einen Satz von LPC-Parametern
zu erhalten, die die unteren und oberen Teilband-LPC-Koeffizienten kombinieren; und
eine Synthetisierungseinrichtung zum Erzeugen eines LPC-Filters aus dem Satz von LPC-Parametern
für das obere und das untere Teilband, und zum Synthetisieren des Sprachsignals aus
dem Filter und aus einem Erregungssignal.
1. Système de codage audio pour coder et décoder un signal audio, ledit système comportant
un codeur et un décodeur, ledit codeur comprenant :
un moyen de filtrage pour décomposer ledit signal audio en un signal de sous-bande
supérieure et inférieure ;
un moyen de codage de sous-bande inférieure pour coder ledit signal de sous-bande
inférieure ;
un moyen de codage de sons-bande supérieure pour coder de façon paramétrique au moins
la composante non périodique dudit signal de sous-bande supérieure conformément à
un modèle de filtre source ;
ledit moyen décodeur comprenant un moyen pour décoder ledit signal de sous-bande inférieure
codé et ledit signal de sous-bande supérieure codé et pour reconstruire à partir de
là un signal de sortie audio,
dans lequel ledit moyen décodeur comprend un moyen de filtrage et un moyen d'excitation
pour produire un signal d'excitation destiné à être transmis par ledit moyen de filtrage
pour produire un signal synthétisé de sous-bande supérieure, ledit moyen d'excitation,
pendant l'utilisation, produisant un signal d'excitation qui comporte une composante
substantielle de bruit synthétisée dans une bande de fréquences supérieures correspondant
à la sous-bande supérieure dudit signal audio, et ledit signal de sous-bande supérieure
synthétisé et le signal de sous-bande inférieur décodé sont recombinés, pendant l'utilisation,
pour former le signal de sortie audio.
2. Système de codage audio selon la revendication 1, dans lequel Ledit moyen décodeur
comprend un moyen décodeur de sous-bande inférieure et un moyen décodeur de sous-bande
supérieure, pour recevoir et décoder respectivement les signaux de sous-bandes inférieure
et supérieure codés.
3. Système de codage audio selon la revendication 1 ou 2, dans lequel ladite bande de
fréquences sapérieure dudit signal d'excitation comprend sensiblement entièrement
un signal de bruit synthétisé.
4. Système de codage audio selon la revendication 1 ou 2, dans lequel ledit signal d'excitation
comprend un mélange d'une composante de bruit synthétisée et d'une autre composante
correspondant à une ou plusieurs harmoniques dudit signal audio de sous-bande inférieure.
5. Système de codage audio selon l'une quelconque des revendications précédentes, dans
lequel ledit moyen de codage de sous-bande supérieure comprend un moyen pour analyser
et coder ledit signal de sous-bande supprieure pour obtenir une valeur d'énergie ou
de gain de sous-bandent supérieure et un ou plusieurs paramètres spectraux de sous-bande
supérieure.
6. Système de codage audio selon la revendication 5, dans lequel lesdits un ou plusieurs
paramètres spectraux de sous-bande supérieure comprennent des coefficients LPC du
second ordre.
7. Système de codage audio selon la revendication 5 ou 6, dans lequel ledit moyen codeur
comporte un moyen pour mesurer l'énergie dans ladite sous-bande supérieure, de manière
à déduire ladite valeur d'énergie ou de gain de sous-bande supérieure.
8. Système de codage audio selon la revendication 5 ou 6, dans lequel ledit moyen codeur
comporte un moyen pour mesurer l'énergie d'une composante de bruit dans ledit signal
de bande supérieure de manière à déduire ladite valeur d'énergie ou de gain de sous-bande
supérieure.
9. Système de codage audio selon la revendication 7 ou la revendication 8, comportant
un moyen pour surveiller ladite énergie dans ledit signal de sous-bande supérieure,
comparer celle-ci à un seuil déterminé à partir d'au moins l'une desdites énergies
de sous-bande supérieure et inférieure, et pour faire fournir par ledit moyen de codage
de sous-bande supérieure une sortie de code minimale si ladite énergie surveillée
est au-dessous dudit seuil.
10. Système de codage audio selon l'une quelconque des revendications précédentes, dans
lequel ledit moyen de codage de sous-bande inférieure comprend un codeur de parole,
et comporte un moyen pour fournir une décision de voisage.
11. Système de codage audio selon la revendication 10, dans lequel ledit moyen décodeur
comporte un moyen réagissant à l'énergie dans ledit signal codé de bande supérieure
et à ladite décision de voisage pour régler l'énergie du bruit dans ledit signal d'excitation
en fonction du fait que le signale audio est voisé ou non voisé.
12. Système de codage audio selon l'une quelconque des revendications 1 à 9, dans lequel
ledit moyen de codage de sous-bande inférieure comprend un codeur audio MPEG.
13. Système de codage audio selon l'une quelconque des revendications précédentes, dans
lequel ladite sous-bande supérieure contient des fréquences supérieures à 2,75 kHz
et ladite sous-bande intérieure contient des fréquences inférieures à 2,75 kHz.
14. Système de codage audio selon l'une quelconque des revendications 1 à 12, dans lequel
ladite sous-bande supérieure contient des fréquences supérieures à 4 kHz et ladite
sous-bande inférieure contient des fréquences inférieures à 4 kHz.
15. Codeur audio selon l'une quelconque des revendications 1 à 12, dans lequel ladite
sous-bande supérieure contient des fréquences supérieures à 5,5 kHz et ladite sous-bande
inférieure contient des fréquences inférieures à 5,5 kHz.
16. Codeur audio selon l'une quelconque des revendications précédentes, dans lequel ledit
moyen de codage de sous-bande supérieure code ladite composante de bruit avec un débit
binaire inférieur à 800 bps et de préférence, d'environ 300 bps.
17. Système de codage audio selon la revendication 5 ou l'une quelconque des revendications
qui en dépendent, dans lequel ledit signal de sous-bande supérieure est analysé avec
des périodes de trame longues pour déterminer lesdits paramètres spectraux et avec
des périodes de trame courtes pour déterminer ladite valeur d'énergie ou de gain.
18. Procédé de codage audio pour coder et décoder un signal audio, procédé comprenant
:
la décomposition dudit signal audio en un signal de sous-bande supérieure et inférieure
;
le codage dudit signal de sous-bande inférieure ;
le codage paramétrique d'au moins la composante non périodique dudit signal de sous-bande
supérieure conformément à un modèle de filtre source, et
le décodage dudit signal de sous-bande inférieure codé et dudit signal de sous-bande
supérieure codé pour reconstruire un signal de sortie audio ;
dans lequel ladite étape de décodage comporte la fourniture d'un signal d'excitation
qui comporte une composante substantielle de bruit synthétisée dans une bande de fréquences
supérieures correspondant à la sous-bande supérieure dudit signal audio, la transmission
dudit signal d'excitation à travers un moyen de filtrage pour produire un signal de
sous-bande supérieure synthétisé, et la recombinaison dudit signal de sous-bande supérieure
synthétisé et du signal de sous-bande inférieure décodé pour former le signal de sortie
audio.
19. Codeur audio pour coder un signal audio, ledit codeur comprenant :
un moyen pour décomposer ledit signal audio en un signal de sous-bande supérieure
et inférieure ;
un moyen de codage de sous-bande inférieure pour coder ledit signal de sous-bande
inférieure, et
un moyen de codage de sous-bande supérieure pour coder de façon paramétrique au moins
une composante de bruit dudit signal de sous-bande supérieure, conformément à un modèle
de filtre source.
20. Procédé de codage d'un signal audio comprenant la décomposition dudit signal audio
en un signal de sous-bande supérieure et inférieure, le codage dudit signal de sous-bande
inférieure et le codage paramétrique d'au moins une composante de bruit dudit signal
de sous-bande supérieure, conformément à un modèle de filtre source.
21. Décodeur audio adapté pour décoder un signal audic codé selon le procédé de la revendication
20, ledit décodeur comprenant un moyen de filtrage et un moyen d'excitation pour produire
un signal d'excitation destiné à être transmis par ledit moyen de filtrage pour produire
un signal audio synthétisé, ledit moyen d'excitation, pendant l'utilisation, produisant
un signal d'excitation qui comporte une composante substantielle de bruit synthétisée
dans une bande de fréquences supérieures correspondant aux sous-bandes supérieures
dudit signal audio.
22. Procédé de décodage d'un signal audio codé selon le procédé de la revendication 20,
comprenant la fourniture d'un signal d'excitation qui comporte une composante substantielle
de bruit synthétisée dans une largeur de bande de fréquences supérieures correspondant
à la sous-bande supérieure du signal audio d'entrée et la transmission dudit signal
d'excitation à travers un moyen de filtrage pour produire un signal audio synthétisé.
23. Système codeur pour coder et décoder un signal de parole, ledit système comprenant
un moyen codeur et un moyen décodeur, ledit moyen codeur comportant :
un moyen de filtrage pour décomposer ledit signal de parole en sous-bande inférieure
et supérieure, définissant ensemble une largeur de bande d'au moins 5, 5 kHz ;
un moyen d'analyse de vocodeur de sous-bande inférieure pour effectuer une analyse
de vocodeur d'ordre supérieur sur ladite sous-bande inférieure pour obtenir des coefficients
de vocodeur comportant des coefficients LPC représentatifs de ladite sous-bande inférieure
;
un moyen d'analyse de vocodeur de sous-bande supérieure pour effectuer une analyse
de vocodeur d'ordre inférieur sur ladite sous-bande supérieure pour obtenir des coefficients
de vocodeur comportant des coefficients LPC représentatifs de ladite sous-bande supérieure
;
un moyen de codage pour coder les paramètres de vocodeur comportant lesdits coefficients
de sous-bande inférieure et supérieure pour fournir un signal codé pour mémorisation
et/ou transmission, et
ledit moyen décodeur comportant
un moyen décodeur pour décoder ledit signal codé pour obtenir un ensemble de paramètres
de vocodeur combinant lesdits coefficients de vocodeur de sous-bande inférieure et
supérieure ;
un moyen synthétiseur pour construire un filtre LPC à partir de l'ensemble de paramètres
de vocodeur et pour synthétiser ledit signal de parole à partir dudit filtre et à
partir d'un signal d'excitation.
24. Système codeur de voix selon la revendication 23, dans lequel ledit moyen d'analyse
de vocodeur de sous-bande inférieure et ledit moyen d'analyse de vocodeur de sous-bande
supérieure sont des moyens d'analyse de vocodeur LPC.
25. Système codeur de voix selon la revendication 24, dans lequel ledit moyen d'analyse
LPC de sous-bande inférieure effectue une analyse du dixième ordre ou supérieur.
26. Système de codeur de voix selon la revendication 24 ou la revendication 25, dans lequel
ledit moyen d'analyse LPC dc bande supérieure effectue une analyse du second ordre.
27. Système codeur de voix selon l'une quelconque des revendications 23 à 26, dans lequel
ledit moyen synthétiseur comporte un moyen pour resynthétiser ladite sous-bande inférieure
et ladite sous-bande supérieure et pour combiner les dites sous-bandes inférieure
et supérieure resynthétisées.
28. Système codeur de voix selon la revendication 27, dans lequel ledit moyen synthétiseur
comporte un moyen pour déterminer respectivement les densités spectrales de puissance
de la sous-bande inférieure et de la sous-bande supérieure et un moyen pour combiner
lesdites densités spectrales de puissance pour obtenir un modèle LPC d'ordre supérieur.
29. Système codeur de voix selon la revendication 28, dans lequel ledit moyen de combinaison
comporte un moyen pour déterminer les autocorrélations desdites densités spectrales
de puissance combinées.
30. Système codeur de voix selon la revendication 29, dans lequel ledit moyen de combinaison
comporte un moyen pour déterminer respectivement les autocorrélations des fonctions
de densité spectrale de puissance desdites sous-bardes inférieure et supérieure, puis
pour combiner les dites autocorrélations.
31. Dispositifs codeur de voix pour coder un signal de parole, ledit dispositif codeur
comportant :
un moyen de filtrage pour décomposer ledit signal de parole en sous-bandes inférieure
et supérieure ;
un moyen d'analyse de vocodeur de bande inférieure pour effectuer une analyse de vocodeur
d'ordre supérieur sur ledit signal de sous-bande inférieure pour obtenir des coefficients
de vocodeur représentatifs de ladite sous-bande inférieure ;
un moyen d'analyse de vocodeur de bande supérieure pour effectuer une analyse de vocodeur
d'ordre inférieur sur ledit signal de sous-bande supérieure pour obtenir des coefficients
de vocodeur représentatifs de ladite sous-bande supérieure, et
un moyen de codage pour coder lesdits coefficients de vocodeur de sous-bande inférieure
et supérieure pour fournir un signal codé pour mémorisation et/ou transmission.
32. Dispositif décodeur de voix adapté pour synthétiser un signal codé par un codeur selon
la revendication 31 et ledit signal de parole codé comprenant des paramètres comportant
des coefficients LPC pour une sous-bande inférieure et une sous-bande supérieure,
ledit dispositif décodeur comportant :
un moyen décodeur pour décoder ledit signal codé pour obtenir un ensemble de paramètres
LPC combinant lesdits coefficients LPC de sous-bande inférieure et supérieure, et
un moyen synthétiseur pour construire un filtre LPC à partir de l'ensemble de paramètres
LPC pour lesdites sous-bandes inférieure et supérieure et pour synthétiser ledit signal
de parole à partir dudit filtre et à partir d'un signal d'excitation.