|
(11) | EP 1 158 495 A2 |
(12) | EUROPEAN PATENT APPLICATION |
|
|
|
|
|||||||||||||||||||||||
(54) | Wideband speech coding system and method |
(57) A speech encoder/decoder for wideband speech with a partitioning of wideband into
lowband and highband, convenient coding of the lowband, and LP excited by noise plus
some periodicity for the highband. The embedded lowband may be extracted for a lower
bit rate decoder. Additionally, the use of a single quantizer for both lowband and
highband parts of a wideband codec is disclosed. |
BACKGROUND OF THE INVENTION
SUMMARY OF THE INVENTION
BRIEF DESCRIPTION OF THE DRAWINGS
Figures 1a-1c show first preferred embodiments.
Figures 2a-2b illustrate frequency domain frames.
Figures 3a-3b show filtering.
Figures 4a-4b are block diagrams of G.729 encoder and decoder.
Figure 5 shows spectrum reversal.
Figures 6-7 are the high portion of a lowband for a voiced frame and the envelope.
Figures 8-9 are block diagrams of systems.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
1. Overview
2. First preferred embodiment systems
3. Coder details
(1) Sample an input wideband speech signal (which is bandlimited to 8 kHz) at 16 kHz to obtain a sequence of wideband samples, wb(n). Partition the digital stream into 160-sample (10 ms) frames.
(2) Lowpass filter wb(n) with a passband of 0-4 kHz to yield lowband signal lb(n) and (later) also highpass filter wb(n) with a passband of 4-8 kHz to yield highband signal hb(n); this is just half-band filtering. Because both lb(n) and hb(n) have bandwidths of 4kHz, the sampling rate of 16 kHz of both lb(n) and hb(n) can be decimated by a factor of 2 to a sampling rate of 8 kHz without loss of information. Thus let lbd(m) denote the baseband (0-4 kHz) version of lb(n) after decimation of the sampling rate by a factor of 2, and similarly let hbdr(m) denote the baseband (0-4 kHz) version of hb(n) after decimation of the sampling rate by a factor of 2. Figures 3a-3b illustrate the formation of lbd(m) and hbdr(m) in the frequency domain for a voiced frame, respectively; note that π on the frequency scale corresponds to one-half the sampling rate. The decimation by 2 creates spectrally reversed images, and the baseband hbdr(m) is reversed compared to hb(n). Of course, lbd(m) corresponds to the traditional 8 kHz sampling of speech for digitizing voiceband (0.3-3.4 kHz) analog telephone signals.
(3) Encode lbd(m) with a narrowband coder, for example the ITU standard 11.8 kb/s G.729 Annex E coder which provides very high speech quality as well as relatively good performance for music signals. This coder may use 80-sample (10 ms at a sampling rate of 8 kHz) frames which correspond to 160-sample (10 ms at a sampling rate of 16 kHz) frames of wb(n). This coder uses linear prediction (LP) coding with both forward and backward modes and encodes a forward mode frame with 18 bits for codebook quantized LP coefficients, 14 bits for codebook quantized gain (7 bits in each of two subframes), 70 bits for codebook quantized differential delayed excitation (35 bits in each subframe), and 16 bits for codebook quantized pitch delay and mode indication to total 118 bits for a 10 ms frame. A backward mode frame is similar except the 18 LP coefficient bits are instead used to increase the excitation codebook bits to 88.
(4) Using lbd(m), prepare a pitch-modulation waveform similar to that which will be used by the highband decoder as follows. First, apply a 2.8-3.8 kHz bandpass filter to the baseband signal lbd(m) to yield its high portion, lbdh(m). Then take the absolute value, |lbdh(m)|; a signal similar to this will be used by the decoder as a multiplier of a white-noise signal to be the excitation for the highband. Decoder step (5) in the following section provides more details.
(5) If not previously performed in step (2), highpass filter wb(n) with a passband of 4-8 kHz to yield highband signal hb(n), and then decimate the sampling rate by 2 to yield hbdr(m). This highband processing may follow the lowband processing (foregoing steps (2)-(4)) in order to reduce memory requirements of a digital signal processing system.
(6) Apply LP analysis to hbdr(m) and determine (highband) LP coefficients aHB(j) for an order M = 10 filter plus estimate the energy of the residual rHB(m). The energy of rHB will scale the pitch-modulated white noise excitation of the filter for synthesis.
(7) Reverse the signs of alternate highband LP coefficients: this is equivalent to
reversing the spectrum of hbdr(m) to hbd(m) and thereby relocating the higher energy
portion of voiced frames into the lower frequencies as illustrated in Figure 5. Energy
in the lower frequencies permits effective use of the same LP codebook quantization
used by the narrowband coder for lbd(m). In particular, voiced frames have a lowpass
characteristic and codebook quantization efficiency for LSFs relies on such characteristic:
G.729 uses split vector quantization of LSFs with more bits for the lower coefficients.
Thus determine LSFs from the (reversed) LP coefficients ±aHB(j), and quantize with the quantization method of the narrowband coder for lbd(m)
in step (4).
Alternatively, first reverse the spectrum of hbdr(m) to yield hbd(m) by modulating
with a 4 kHz square wave, and then perform the LP analysis and LSF quantization. Either
approach yields the same results.
(8) The excitation for the highband synthesis will be scaled noise modulated (multiplied) by an estimate of |lbdh(m)| where the scaling is set to have the excitation energy equal to the energy of the highband residual rHB(m). Thus normalize the residual energy level by dividing the energy of the highband residual by the energy of |lbdh(m)| which was determined in step (4). Lastly, quantize this normalized energy of the highband residual in place of the (non-normalized) energy of the highband residual which would be used for excitation when the pitch-modulation is omitted. That is, the use of pitch modulation for the highband excitation requires no increase in coding bits because the decoder derives the pitch modulation from the decoded lowband signal, and the energy of the highband residual takes the same number of coding bits whether or not normalization has been applied.
(9) Combine the output bits of the baseband lbd(m) coding of step (4) and the output bits of hbd(m) coding of steps (7-8) into a single bitstream.
4. Decoder details
(1) Extract the lowband code bits from the bitstream and decode (using the G.729 decoder) to synthesize lowband speech lbd' (m), an estimate of lbd(m).
(2) Bandpass filter (2.8-3.8 kHz band) lbd'(m) to yield lbdh'; (m) and compute the absolute value |lbdh'(m)| as in the encoding.
(3) Extract the highband code bits, decode the quantized highband LP coefficients (derived from hbd(m)) and the quantized normalized excitation energy level (scale factor). Frequency reverse the LP coefficients (alternate sign reversals) to have the filter coefficients for an estimate of hbdr(m).
(4) Generate white noise and scale by the scale factor. The scale factor may be interpolated (using the adjacent frame's scale factor) every 20-sample subframe to yield a smoother scale factor.
(5) Modulate (multiply) the scaled white noise from (4) by waveform |lbdh'(m)| from
(2) to form the highband excitation. Figure 6 illustrates an exemplary lbdh'(m) for
a voiced frame. In the case of unvoiced speech, the periodicity would generally be
missing and lbdh' (m) would be more uniform and not significantly modulate the white-noise
excitation.
The periodicity of lbdh' (m) roughly reflects the vestigial periodicity apparent in
the highband portion of Figure 2a and missing in Figure 2b. This pitch modulation
will compensate for a perceived noisiness of speech synthesized from a pure noise
excitation for hbd(m) in strongly-voiced frames. The estimate uses the periodicity
in the 2.8-3.8 kHz band of lbd' (m) because strongly-voiced frames with some periodicity
in the highband tend to have periodicity in the upper frequencies of the lowband.
(6) Synthesize highband signal hbdr'(m) by using the frequency-reversed highband LP coefficients from (3) together with the modulated scaled noise from (5) as the excitation. The LP coefficients may be interpolated every 20 samples in the LSP domain to reduce switching artifacts.
(7) Upsample (interpolation by 2) synthesized (decoded) lowband signal lbd'(m) to a 16 kHz sampling rate, and lowpass filter (0-4 kHz band) to form lb'(n). Note that interpolation by 2 forms a spectrally reversed image of lbd'(m) in the 4-8 kHz band, and the lowpass filtering removes this image.
(8) Upsample (interpolation by 2) synthesized (decoded) highband signal hbdr'(m) to a 16 kHz sampling rate, and highpass filter (4-8 kHz band) to form hb'(n) which reverses the spectrum back to the original. The highpass filter removes the 0-4 kHz image.
(9) Add the two upsampled signals to form the synthesized (decoded) wideband speech signal: wb'(n) = lb'(n) + hb'(n).
5. System preferred embodiments
6. Second preferred embodiments
7. Modifications
(a) partitioning a frame of digital speech into a lowband and a highband;
(b) encoding said lowband;
(c) encoding said highband using a linear prediction excitation from noise modulated by a portion of said lowband; and
(d) combining said encoded lowband and said encoded highband to form an encoded wideband speech.
(a) decoding a first portion of an input signal as a lowband speech signal;
(b) decoding a second portion of an input signal as a noise-modulated excitation of a linear prediction encoding wherein said noise modulated excitation is noise modulated by a portion of the results of said decoding as a lowband speech signal of preceding step (a); and
(c) combining the results of foregoing steps (a) and (b) to form a decoded wideband speech signal.
(a) a lowband filter and a highband filter for digital speech;
(b) a first encoder with input from said lowband filter;
(c) a second encoder with input from said highband filter and said lowband filter, said second encoder using an excitation from noise modulated by a portion of output from said lowband filter; and
(d) a combiner for the outputs of said first encoder and said second encoder to output encoded wideband speech.
(a) a first speech decoder with an input for encoded narrowband speech;
(b) a second speech decoder with an input for encoded highband speech and an input for the output of said first speech decoder, said second speech decoder using excitation of noise modulated by a portion of the output of said first speech decoder; and
(c) a combiner for the outputs of said first and second speech decoders to output decoded wideband speech.
(a) partitioning a frame of digital speech into a lowband and a highband;
(b) decimating the sampling rate of both said lowband and said highband;
(c) encoding said decimated lowband from step (b) including a first method of quantization;
(d) reversing the spectrum of a baseband image of said decimated highband from step (b); and
(e) encoding the results of step (d) including said first method of quantization.
(a) decoding a first portion of an input signal as a lowband speech signal including using a first codebook;
(b) decoding a second portion of an input signal as a highband speech signal including using said first codebook; and
(c) combining the results of foregoing steps (a) and (b) to form a decoded wideband speech signal.
(a) a lowband filter and a highband filter for digital speech;
(b) a first encoder with input from said lowband filter; said first encoder using a first quantizer;
(c) a second encoder with input from said highband filter, said second encoder using said first quantizer; and
(d) a combiner for said first encoder and said second encoder to output encoded wideband speech.
(a) a first speech decoder with an input for encoded narrowband speech and an LP codebook;
(b) a second speech decoder with an input for encoded highband speech, said second decoder using said LP codebook.