Multimode speech encoder with gain smoothing

(19)

(11)

EP 1 100 076 A2

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	16.05.2001 Bulletin 2001/20

(21)	Application number: 00124232.0

(22)	Date of filing: 09.11.2000

(51)	International Patent Classification (IPC)⁷: G10L 19/14

(84)	Designated Contracting States:
	AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR
	Designated Extension States:
	AL LT LV MK RO SI

(30)

Priority:

10.11.1999 JP 31953499

(71)	Applicant: NEC CORPORATION
	Tokyo (JP)

(72)	Inventor:
	Ozawa, Kazunori Minato-ku, Tokyo (JP)

(74)	Representative: VOSSIUS & PARTNER
	Siebertstrasse 4 81675 München 81675 München (DE)

(54)	Multimode speech encoder with gain smoothing

(57) A mode discriminating circuit (300) extracts a feature quantity from a voice signal and makes mode discrimination. When the output of the circuit (300) represents a predetermined mode, a smoothing circuit (450) executes time-wise smoothing of at least either one of excitation gain, adaptive codebook gain, spectral parameter and excitation signal level. A multiplexer (400) locally reproduces a synthesized signal by using a signal obtained through the smoothing, and feeds out a combination of the outputs of spectral parameter calculating circuit, mode discriminating circuit, adaptive codebook circuit, excitation quantizing circuit and gain quantizing circuit. Thus it makes possible to provide a voice coding and a voice decoding apparatus, which permit good coded sound quality to be obtained for voice with background noise superimposed thereon even at a low bit rate.

Description

[0001] The present invention relates to a voice coding and a voice decoding apparatus and method thereof for satisfactorily coding background noise signal superimposed on a voice signal even at a low bit rate.

[0002] As system for highly efficiently coding voice signal, CELP (Code-Excited Least Predictive Coding) is well known in the art, as described in, for instance, M. Schroeder at B. Atal, "Code-excited liner prediction: High quality speech at very low bit rates", Proc. ICASSP., pp. 937-940, 1985 (Literature 1) and Kleijn et al, "Improved speech quality and efficient vector quantization in SELP", Proc. ICASSP, pp. 155-158, 1988 (Literature 2).

[0003] In such prior art system, on the transmitting side a spectral parameter representing a spectral characteristic of voice signal is extracted from the voice signal for each frame (of 20 msec., for instance) by executing linear prediction (LPC) analysis. Also, the frame is divided into sub-frames (of 5 msec., for instance), and pitch prediction of voice signal in each sub-frame is executed by using an adaptive codebook. Specifically, the pitch prediction is executed by extracting parameters in the adaptive codebook (i.e., delay parameter corresponding to the pitch cycle and gain parameter) for each sub-frame on the basis of past excitation signal. An excitation signal is obtained as a result of the pitch prediction, and it is quantized by selecting an optimum excitation codevector from an excitation codebook (or vector quantization codebook), which is constituted by noise signals of predetermined kinds, and calculating an optimum gain. The selection of the excitation codevector is executed such as to minimize the error power level between a signal synthesized from a selected noise signal and the residue signal. A multiplexer combines an index representing the kind of the selected codevector with the gain, the spectral parameter and the adaptive codebook parameters, and transmits the resultant signal. The receiving side is not described.

[0004] In the prior art method as described above, reducing the bit rate of coding to be below, for instance, 8 kb/sec., results in deterioration of background noise signal particularly in the case where the voice signal and background noise signal are superimposed on each other. The deterioration of the background noise signal leads to deterioration of the overall sound quality. This deterioration is particularly pronounced in the case of using voice coding in portable telephone or the like. In the methods disclosed in Literatures 1 and 2, bit rate reduction results in excitation codebook bit number reduction to deteriorate the accuracy of waveform reconstitution. The deterioration is not so pronounced in the case of high waveform correlation signal such as voice signal, but it is pronounced in the case of lower correlation signal such as background noise signal.

[0005] In a prior art method described in C. Laflamme, "16 kbps wideband speech coding technique based on algebraic CLEP", Proc. ICASSP, pp. 13-16, 1991 (Literature 3), excitation signal is expressed in the form of pulse combination. Therefore, although high model matching property and satisfactory sound quality are obtainable when voice is dealt with, in the case of dealing with low bit rate the sound quality of background noise part of coded voice is extremely deteriorated due to insufficient number of pulses employed.

[0006] The reason why this is so is as follows. In the vowel time section of voice, pulses are concentrated in the neighborhood of a pitch pulse as pitch start point. It is thus possible to obtain efficient voice expression with a small number of pulses. In the case of background noise or like random signal, however, it is necessary to provide pulses randomly. Therefore, it is difficult with a small number of pulses to obtain satisfactory expression of the background noise. The reduction of the number of pulses and bit rate suddenly deteriorates the sound quality with respect to the background noise.

[0007] An object of the present invention, therefore, is to solve the above problems and provide a voice coding and a voice decoding apparatus, which is less subject to sound quality deterioration with respect to background noise with relatively less computational effort even in the lower bit rate case.

[0008] According to an aspect of the present invention, there is provided a voice coding apparatus including a spectral parameter calculating part for obtaining a spectral parameter for each predetermined frame of an input voice signal and quantizing the obtained spectral parameter, an adaptive codebook part for dividing the frame into a plurality of sub-frames, obtaining a delay and a gain from a past quantized excitation signal for each of the sub-frames by using an adaptive codebook and obtaining a residue by predicting the voice signal, an excitation quantizing part for quantizing the excitation signal of the voice signal by using the spectral parameter, and a gain quantizing part for quantizing the gain of the adaptive codebook and the gain of the excitation signal, comprising: a mode discriminating part for extracting a predetermined feature quantity from the voice signal and judging the pertinent mode to be either one of the plurality of predetermined modes on the basis of the extracted feature quantity; a smoothig part for executing time-wise smoothing of at least either one of the gain of the excitation signal, the gain of the adaptive codebook, the spectral parameter and the level of the excitation signal; and a multiplexer part for locally reproducing synthesized signal by using the smoothed signal and feeding out a combination of the outputs of the spectral parameter calculating, mode discriminating, adaptive codebook, excitation quantizing and gain quantizing parts.

[0009] The mode discriminating part executes mode discriminating for each frame. The feature quantity is pitch prediction gain. The mode discriminating part averages the pitch prediction gains each obtained for each sub-frame over the full frame and classifying a plurality of predetermined modes by comparing a plurality of predetermined threshold values with the average value. The plurality of predetermined modes substantially correspond to a silence, a transient, a weak voice and a strong voice time section, respectively.

[0010] According to another aspect of the present invention, there is provided a voice decoding apparatus including a multiplexer part for separating spectral parameter, pitch, gain and excitation signal as voice data from a voice signal, an excitation signal restoring part for restoring an excitation signal from the separated pitch, excitation signal and gain, a synthesizing filter part for synthesizing a voice signal on the basis of the restored excitation signal and the spectral parameter, and a post-filter part for post-filtering the synthesized voice signal by using the spectral parameter, comprising: an inverse filter part for estimating an excitation signal through an inverse post-filtering and inverse synthesis filtering on the basis of the output signal of the post-filter part and the spectral parameter, and a smoothing part for executing clockwise filtering of at least either one of the level of the estimated excitation signal, the gain and the spectral parameter, the smoothed signal or signals being fed to the synthesis filter part, the synthesized signal output thereof being fed to the post-filter part to synthesize a voice signal.

[0011] According to other aspect of the present invention, there is provided a voice decoding apparatus including a multiplexer part for separating a mode discrimination data, spectral parameter, pitch, gain and excitation signal on the basis of a feature quantity of a voice signal to be decoded, an excitation signal restoring part for restoring an excitation signal from the separated pitch, excitation signal and gain, a synthesis filter part for synthesizing the voice signal by using the restored excitation signal and the spectral parameter, and a post-filter part for post-filtering the synthesized voice signal by using the spectral parameter, comprising: an inverse filter part for estimating the voice signal on the basis of the output signal of the post-filter part and the spectral parameter through an inverse post-filtering and inverse synthesis filtering, a smoothing part for executing time-wise smoothing of at least either one of the level of the estimated excitation signal, the gain and the spectral parameter, the smoothed signal being fed to the synthesis filter part, the synthesis signal output thereof being fed to the post-filter part.

[0012] The mode discriminating part executes mode discriminating for each frame. The feature quantity is the pitch prediction gain. The mode discrimination is executed by averaging the pitch prediction gains each obtained for each sub-frame over the full frame and comparing the average value thus obtained with a plurality of predetermined threshold values. The plurality of predetermined modes substantially correspond to a silence, a transient, a weak voice and a strong voice time section, respectively.

[0013] According to still other aspect of the present invention, there is provided a voice decoding apparatus for locally reproducing a synthesized voice signal on the basis of a signal obtained through time-wise smoothing of at least either one of spectral parameter of the voice signal, gain of an adaptive codebook, gain of an excitation codebook and RMS of an excitation signal.

[0014] According to further aspect of the present invention, there is provided a voice decoding apparatus for obtaining a residue signal from a signal obtained after post-filtering through an inverse post-synthesis filtering process, executing a voice signal synthesizing process afresh on the basis of a signal obtained through time-wise smoothing of at least one of RMS of residue signal, spectral parameter of received signal, gain of adaptive codebook and gain of excitation codebook and executing a post-filtering process afresh, thereby feeding out a final synthesized signal.

[0015] According to still further aspect of the present invention, there is provided a voice decoding apparatus for obtaining a residue signal from a signal obtained after post-filtering through an inverse post-synthesis filtering process, and in a mode determined on the basis of a feature quantity of a voice signal to be decoded or in the case of presence of the feature quantity in a predetermined range, executing a voice signal synthesizing process afresh on the basis of a signal obtained through time-wise smoothing of at least either one of RMS of the residue signal, spectral parameter of a received signal, gain of an adaptive codebook and gain of an excitation codebook, and executing a post-filtering process afresh, thereby feeding out a final synthesized signal.

[0016] According to other aspect of the present invention, there is provided a voice coding method including a step for obtaining a spectral parameter for each predetermined frame of an input voice signal and quantizing the obtained spectral parameter, a step for dividing the frame into a plurality of sub-frames, obtaining a delay and a gain from a past quantized excitation signal for each of the sub-frames by using an adaptive codebook and obtaining a residue by predicting the voice signal, a step for quantizing the excitation signal of the voice signal by using the spectral parameter, and a step for quantizing the gain of the adaptive codebook and the gain of the excitation signal, further comprising steps of: extracting a predetermined feature quantity from the voice signal and judging the pertinent mode to be either one of the plurality of predetermined modes on the basis of the extracted feature quantity; executing time-wise smoothing of at least either one of the gain of the excitation signal, the gain of the adaptive codebook, the spectral parameter and the level of the excitation signal; and locally reproducing synthesized signal by using the smoothed signal and feeding out a combination of the outputs of the spectral parameter data, mode discriminating data, adaptive codebook data, excitation quantizing data and gain quantizing data.

[0017] According to still other aspect of the present invention, there is provided a voice decoding method including a step for separating spectral parameter, pitch, gain and excitation signal as voice data from a voice signal, a step for restoring an excitation signal from the separated pitch, excitation signal and gain, a step for synthesizing a voice signal on the basis of the restored excitation signal and the spectral parameter, and a step for post-filtering the synthesized voice signal by using the spectral parameter, further comprising steps of: estimating an excitation signal through an inverse post-filtering and inverse synthesis filtering on the basis of the post-filtered signal and the spectral parameter; and executing clockwise filtering of at least either one of the level of the estimated excitation signal, the gain and the spectral parameter, the smoothed signal or signals being fed to the synthesis filtering, the synthesized signal output thereof being fed to the post-filtering to synthesize a voice signal.

[0018] According to further aspect of the present invention, there is provided a voice decoding method including a step for separating a mode discrimination data, spectral parameter, pitch, gain and excitation signal on the basis of a feature quantity of a voice signal to be decoded, a step for restoring an excitation signal from the separated pitch, excitation signal and gain, a step for synthesizing the voice signal by using the restored excitation signal and the spectral parameter, and a step for post-filtering the synthesized voice signal by using the spectral parameter, comprising steps of: estimating the voice signal on the basis of the post-filtered signal and the spectral parameter through an inverse post-filtering and inverse synthesis filtering; and executing time-wise smoothing of at least either one of the level of the estimated excitation signal, the gain and the spectral parameter; the smoothed signal being fed to the synthesis filtering, the synthesis signal output thereof being fed to the post-filtering.

[0019] According to still further aspect of the present invention, there is provided a voice decoding method for locally reproducing a synthesized voice signal on the basis of a signal obtained through time-wise smoothing of at least either one of spectral parameter of the voice signal, gain of an adaptive codebook, gain of an excitation codebook and RMS of an excitation signal.

[0020] According to still other aspect of the present invention, there is provided a voice decoding method for obtaining a residue signal from a signal obtained after post-filtering through an inverse post-synthesis filtering process, executing a voice signal synthesizing process afresh on the basis of a signal obtained through time-wise smoothing of at least one of RMS of residue signal, spectral parameter of received signal, gain of adaptive codebook and gain of excitation codebook and executing a post-filtering process afresh, thereby feeding out a final synthesized signal.

[0021] According to other aspect of the present invention, there is provided a voice decoding method for obtaining a residue signal from a signal obtained after post-filtering through an inverse post-synthesis filtering process, and in a mode determined on the basis of a feature quantity of a voice signal to be decoded or in the case of presence of the feature quantity in a predetermined range, executing a voice signal synthesizing process afresh on the basis of a signal obtained through time-wise smoothing of at least either one of RMS of the residue signal, spectral parameter of a received signal, gain of an adaptive codebook and gain of an excitation codebook, and executing a post-filtering process afresh, thereby feeding out a final synthesized signal.

[0022] Other objects and features will be clarified from the following description with reference to attached drawings.

Fig. 1 is a block diagram showing a first embodiment of the voice coding apparatus according to the present invention;

Fig. 2 is a block diagram showing a second embodiment of the voice coding apparatus according to the present invention; and

Fig. 3 is a block diagram showing a third embodiment of the voice coding apparatus according to the present invention.

[0023] Preferred embodiments of the present invention will now be described with reference to the drawings.

[0024] Fig. 1 is a block diagram showing a first embodiment of the voice coding apparatus according to the present invention. Referring to Fig. 1, a frame circuit 110 divides a voice signal inputted from an input terminal 100 into frames (of 20 msec., for instance). A sub-frame divider circuit 120 divides each voice signal frame into sub-frames (of 5 msec. for instance) shorter than the frame.

[0025] A spectral parameter calculating circuit 200 cuts out the voice from at least one voice signal sub-frame as window (of 24 msec., for instance) longer than the sub-frame length, and calculates a spectral parameter in predetermined degree (for instance P = 10). The spectral parameter may be calculated by using well-known LPC analysis and, Brug analysis, etc. In this description, the use of the Brug analysis is assumed. The Brug analysis is detailed in Nakamizo, "Signal analysis and system identification", Corona Co., Ltd., pp. 82-87, 1988 (Literature 4), and is not described here. The circuit 200 converts linear prediction coefficients α_i (i being 1 to 10 ), calculated by the Brug analysis, to LSP parameter suited for quantization and interpolation. As for the conversion of the linear prediction coefficients to the LSP parameter, reference may be had to Sugamura et al, "Voice data compression in linear spectrum pair (LSP) voice analysis and synthesis system", Trans. ECIC Japan, J64-A, pp. 599-606, 1981 (Literature 5). For example, the circuit 200 converts linear prediction coefficients obtained in the 2-nd and 4-th sub-frames by the Brug method to LSP parameter data, obtains LSP parameter data in the 1-st and 3-rd sub-frames by interpolation, inversely converts the 1-st and 3-rd sub-frame LSP parameter data to restore linear prediction coefficients, and thus feeds out the 1-st to 4-th sub-frame linear prediction coefficients α _il (i being 1 to 10, 1 being 1 to 5) to an acoustic weighting circuit 210. The circuit 200 further feeds out the 4-th sub-frame LSP parameter data to a spectral parameter quantizing circuit 210.

[0026] A spectral parameter quantizing circuit 210 efficiently quantizes LSP parameter in predetermined sub-frame, and feeds out quantized LSP value for minimizing distortion given as

where LSP(i), QLSP(i)j and W(i) are the i-th LSP before the quantization, the j-th result obtained after the quantization and the weighting coefficient. An LSP codebook 211 is referred by the spectral parameter quantizing circuit 210.

[0027] In the following, it is assumed that vector quantization is used and that the 4-th sub-frame LSP parameter is quantized. The LSP parameter may be vector quantized by a well-known method. Specific examples of the method are described in Japanese Patent Laid-Open No. 4-171500 (Japanese Patent Application No. 2-297600) (Literature 6), Japanese Patent Laid-Open No. 4-363000 (Japanese patent Application No. 3-261925) (Literature 7), Japanese Patent Laid-Open No. 5-6199 (Japanese Patent Application 3-155049) (Literature 8) and T. Nomura et al, "LSP Coding Using VQ-SVQ With Interpolation in 4,075 kbps M-LCELP Speech Coder", Proc. Mobile Multimedia Communication, pp. B. 2.5, 1993 (Literature 9), and are not described here.

[0028] The spectral parameter quantizing circuit 210 restores the 1-st to 4-th LSP parameters from the quantized LSP parameter data obtained in the 4-th sub-frame. Specifically, the circuit 210 restores the 1-st to 3-rd sub-frame LSP parameters by executing linear interpolation from the 4-th sub-frame quantized LSP parameter data in the prevailing frame and immediately preceding frames. The 1-st to 4-th sub-frame LSP parameters can be restored by linear interpolation after selecting one kind of codevector corresponding to minimum error power level between the LSP parameter data before and after the quantization. For further performance improvement, it is possible to arrange such as to select a plurality of candidates of the codevector corresponding to the minimum error power level, evaluate cumulative distortion about each candidate and select a set of a candidate minimizing cumulative distortion and the interpolated LSP parameter data. For further detail, reference may be had to, for instance, Japanese Patent Laid-Open No. 6-222797 (Japanese Patent Application No. 5-8737) (Literature 10).

[0029] The spectral parameter quantizing circuit 210 converts the thus restored 1-st to 3-rd sub-frame LSP and quantized 4-th sub-frame LSP parameter data to the linear prediction coefficients α_il (i being 1 to 10, 1 being 1 to 5) for each sub-frame, and feeds out the coefficient data thus obtained to an impulse response calculating circuit 310. The circuit 210 further feeds out an index representing the codevector of the quantized 4-th sub-frame LSP parameter to a multiplexer 400.

[0030] An acoustic weighting circuit 230, receiving the linear prediction coefficients α_il (i being 1 to 10, 1 being 1 to 5) in each sub-frame, executes acoustic weighting of the sub-frame voice signal in the manner as described in Literature 1 noted above, and feeds out an acoustically weighted signal.

[0031] A response signal calculating circuit 240 receives the linear prediction coefficients α_il in each sub-frame from the spectral parameter calculating circuit 200, and the interpolated, quantized and restored linear prediction coefficients α_il from the spectral parameter quantizing circuit 210, then calculates a response signal with input signal made to be zero d(n) = 0 for one sub-frame by using preserved filter memory values, and feeds out the calculated response signal to a subtracter 235. The response signal x_z(n) is given as

when n-i ≤0

where N is the sub-frame length, γ is a weight coefficient for controlling the extent of the acoustic weighting and the same as the value given by Equation (7) given below, and s_w(n) and p(n) are the output signal of a weighting signal calculating circuit to be described later and the output signal as denomenator of the right side first term filter in Equation (7) given below, respectively.

[0032] The subtracter 235 subtracts the response signal from the acoustically weighted signal for one sub-frame as shown by the equation shown below, and feeds out x'_w(n) to an adaptive codebook circuit 300.

[0033] The impulse response calculating circuit 310 calculates a predetermined number of impulse responses h_w(n) of acoustic weighting filter, in which z transform is expressed by the following equation, and feeds out the calculated data to the adaptive codebook circuit 470 and an excitation quantizing circuit 350.

[0034] A mode discriminating circuit 300 executes mode discrimination for each frame by extracting a feature quantity from the frame circuit output signal. As feature quantity the pitch prediction gain may be used. Specifically, in this case the circuit 800 averages the pitch prediction gains obtained in the individual sub-frames over the full frame, and executes classification into a plurality of predetermined modes by comparing the average value with a plurality of predetermined threshold values. Here, it is assumed that four different modes are provided. Specifically, it is assumed that modes 0 to 3 are set substantially for a silence, a transient, a weak voice and a strong voice time sections, respectively. The circuit 800 feeds out mode discrimination data thus obtained to the excitation quantizing circuit 350, a gain quantizing circuit 365 and the multiplexer 400.

[0035] The adaptive codebook circuit 470 receives the past excitation signal v(n) from the gain quantizing circuit 370, output signal x'_w(n) from the subtracter 235 and acoustically weighted impulse response hw(n) from the impulse response calculating circuit 310, then obtains a delay T corresponding to the pitch such as to minimize the distortion given by the following equation, and feeds out an index representing the delay to the multiplexer 400.

where

[0036] In Equation (8), symbol * represents convolution.

[0037] The adaptive codebook circuit 500 then obtains gain β given as

[0038] For improving the accuracy of delay extraction for women's and children's voices, the delay may be obtained as decimal sample value instead of integer sample value. As for a specific method, reference may be had to, for instance, P. Kroon et al, "Pitch predictors with high temporal resolution", Proc. ICASSP, pp. 661-664, 1990 (Literature 11).

[0039] The adaptive codebook circuit 470 further executes pitch prediction as in the following Equation (10), and feeds out the prediction residue signal e_w(n) to the excitation quantizing circuit 355.

[0040] The excitation quantizing circuit 355 receives mode discrimination data, and switches the excitation signal quantizing methods on the basis of the discriminated mode.

[0041] It is assumed that M pulses are provided in the modes 1 to 3. It is also assumed that in the modes 1 to 3 an amplitude or a polarity codebook of B bits is provided for collective pulse amplitude quantization of M pulses. The following description assumes the case of using a polarity codebook. The polarity codebook is stored in an excitation codebook 351.

[0042] In the voice section, the excitation quantizing circuit 355 reads out individual polarity codevectors stored in the excitation codebook 351, allots a pulse position to each read-out codevector, and selects a plurality of sets of codevector and pulse position, which minimize the following equation (11).

where h_w(n) is the acoustically weighted impulse response. The Equation (11) may be minimized by selecting a set of polarity codevector g_ik and pulse position m_i, which maximizes the following equation (12).

[0043] Alternatively, it is possible to select a set, which maximizes the following Equation (13). In this case, the computational effort necessary for the numerator calculation is reduced.

[0044] For computational effort reduction, the positions which can be allotted for the individual pulses in the modes 1 to 3, can be restricted as shown in Literature 3. As an example, assuming N = 40 and M = 5, the positions which can be allotted by the individual pulses are as shown in Table 1 below.

Table 1

PULSE No.	POSITION
1^ST PULSE	0,	5,	10,	15,	20,	25,	30,	35
2^ND PULSE	1,	6,	11,	16,	21,	26,	31,	36
3^RD PULSE	2,	7,	12,	17,	22,	27,	32,	37
4^TH PULSE	3,	8,	13,	18,	23,	28,	33,	38
5^TH PULSE	4,	9,	14,	19,	24,	29,	34,	39

[0045] After the end of polarity codevector retrieval, the excitation quantizing circuit 355 feeds out the selected plurality of sets of polarity codevector and position to the gain quantizing circuit 370. In a predetermined mode (i.e., the mode 0 in this case), a plurality of extents of shifting the pulse positions of all the pulses are predetermined by determining the pulse positions at a predetermined interval as shown in Table 2. In the following example, for different extents of shift (i.e., shifts 0 to 3) are used, to which the positions are shifted by one sample after another. In this case, the shift extents are transmitted by quantizing them in two bits.

Table 2

SHIFT AMOUNT	POSITION
0	0,	4,	8,	12,	16,	20,	24,	28,	32,	36
1	1,	5,	9,	13,	17,	21,	25,	29,	33,	37
2	2,	6,	10,	14,	18,	22,	26,	30,	34,	38
3	3,	7,	11,	15,	19,	23,	27,	31,	35,	39

[0046] The polarities corresponding to the individual shift extents and pulse positions shown in Table 2 are preliminarily obtained from the above Equation (14).

[0047] The pulse positions shown in Table 2 and corresponding polarities are fed out for each shift extent to the gain quantizing circuit 365.

[0048] The gain quantizing circuit 370 receives the mode discrimination data from the mode discriminating circuit 300. In the modes 1 to 3, the circuit 370 receives a plurality of sets of polarity codevector and pulse position, and in the mode 0 it receives the set of pulse position and corresponding polarity for each shift extent.

[0049] The gain quantizing circuit 370 reads out gain codevector from the gain codebook 380. In the modes 1 to 3, the circuit 370 executes gain codevector retrieval for the plurality of selected sets of polarity codevector and pulse position such as to minimize the following Equation (15), and selects one set of gain and plurality codevectors, which minimizes distortion.

[0050] In the above example, both the excitation gains represented by the gain and pulse of the adaptive codebook are simultaneously vector quantized. The gain quantizing circuit 370 feeds the index representing selected polarity codevector, the code representing pulse position and the index representing gain codevector to the multiplexer 400.

[0051] When the discrimination data is in the mode 0, the gain quantizing circuit 370 receives a plurality of shift extents and polarity corresponding to each pulse position in each shift extent case, executes gain codevector retrieval, and selects one set of gain codevector and shift extent, which minimizes the following Equation (16).

where β, k and G' k is the k-th codevector in a two-dimensional gain codebook stored in the gain codebook 380, δ (j) is the j-th shift extent, and g' k is the selected gain codevector. The circuit 370 feeds out index representing the selected gain codevector and code representing the shift extent to the multiplexer 400.

[0052] In the modes 1 to 3, it is possible as well to preliminarily learn and store a codebook for amplitude quantization of a plurality of pulses by using voice signal. As for codebook learning method, reference may be had to Linde et al, "An algorithm for vector quantization design", IEEE Trans. Commun., pp. 84-95, January, 1980 (Literature 12).

[0053] A smoothing circuit 450 receives the mode data and, when the received mode data is in a predetermined mode (for instance the mode 0), executes time-wise smoothing of at least either one of gain of excitation signal in gain codevector, gain of adaptive codebook, RMS of excitation signal and spectral parameter.

[0054] The gain of excitation signal is smoothed in a manner as given by the following equation.

where m is the sub-frame number.

[0055] The gain of adaptive codebook is smoothed in a manner as given by the following equation.

[0056] The RMS of excitation signal is smoothed in a manner as given by the following equation.

[0057] The spectral parameter is smoothed in a manner as given by the following equation.

[0058] A weighting signal calculating circuit 360 receives the mode discrimination data and the smoothed signal output of the smoothing circuit and, in the cases of the modes 12 to 3, obtains drive excitation signal v(n) as in the above Equation (21).

[0059] The weighting signal calculating circuit 360 feeds out v(n) to the adaptive codebook circuit 470.

[0060] In the case of the mode 0, the weighting signal calculating circuit 360 obtains drive excitation signal v(n) in the manner as given by Equation (22).

[0061] The weighting signal calculating circuit 360 feeds out v(n) to the adaptive codebook circuit 470.

[0062] The weighting signal calculating circuit 360 calculates response signal x_w(n) for each sub-frame by using the output parameters of the spectral parameter calculating circuit 200, the spectral parameter quantizing circuit 210 and the smoothing circuit 450. In the modes 1 to 3, the circuit 360 calculates x_w(n) as given by Equation (23), and feeds out the calculated x_w(n) to the response signal calculating circuit 240.

[0063] In the mode 0, the weighted signal calculating circuit 500 receives smoothed LSP parameter obtained in the smoothing circuit 450, and converts this parameter to smoothed linear prediction coefficient. The circuit 360 then calculates response signal x_w(n) as given by Equation (24), and feeds out the response signal x_w(n) to the response signal calculating circuit 240.

[0064] A second embodiment of the present invention will now be described with reference to drawings.

[0065] A demultiplexer 500 separates, from the received signal, index representing gain codevector, index representing delay of adaptive codebook, data of voice signal, index of excitation codevector and index of spectral parameter, and feeds out individual separated parameters.

[0066] A gain decoding circuit 510 receives index of gain codevector and mode discrimination data, and reads out and feeds out the gain codevector from a gain codebook 380 on the basis of the received index.

[0067] An adaptive codebook circuit 520 receives mode discrimination data and delay of adaptive codebook, generates adaptive codevector, multiplies gain of adaptive codebook by gain codevector, and feeds out the resultant product.

[0068] When the mode discrimination data is in the modes 1 to 3, an excitation restoring circuit 540 generates an excitation signal on the basis of the polarity codevector, pulse position data and gain codevector read out from excitation codebook 351, and feeds out the generated excitation signal to an adder 550.

[0069] The adder 550 generates the drive excitation signal v(n) by using the outputs of the adaptive codebook circuit 520 and the excitation signal decoding circuit 540, feeds out the generated v(n) to a synthesizing filter circuit 560.

[0070] A spectral parameter decoding circuit 570 decodes the spectral parameter, executes conversion thereof to linear prediction coefficient, and feeds out the coefficient data thus obtained to a synthesizing filter circuit 560.

[0071] The synthesizing filter circuit 560 receives the drive excitation signal v(n) and linear prediction coefficient, and calculates reproduced signal s(n).

[0072] A post-filtering circuit 600 executes post-filtering for masking quantized noise with respect to the reproduced signal s(n), and feeds out post-filtered output signal Sp(n). The post-filter has a transfer characteristic given by Equation (25).

[0073] An inverse post/synthesizing filter circuit 610 constitutes inverse-filter of post and synthesizing filters, and calculates a residue signal e(n). The inverse filter has a transfer characteristic given by Equation (26).

[0074] A smoothing circuit 620 executes time-wise smoothing of at least either one of gain of excitation signal in gain codevector, gain of adaptive codebook, RMS of residue signal and spectral parameter. The gain of excitation signal, the gain of adaptive codebook and the spectral parameter are smoothed in manners as given by the above Equations (17), (18) and (20), respectively. RMSe(m) is the RMS of the m-th sub-frame residue signal.

[0075] The smoothing circuit 620 restores the drive excitation signal by using the smoothed parameter or parameters. The instant case concerns the restoration of drive voice surface signal by smoothing the RMS of residue signal as given by the following Equation (28).

[0076] The synthesizing filter 560 receives drive excitation signal e(n) obtained by using the smoothed parameter or parameters, and calculates reproduced signal s(n). As an alternative, it is possible to use smoothed linear prediction coefficient.

[0077] The post filter 600 receives the pertinent reproduced signal, executes post-filtering thereof to obtain final reproduced signal sp(n), and feeds out this signal.

[0078] Fig. 3 is a block diagram showing a third embodiment. In Fig. 3, parts like those in Fig. 2 are designated by like reference numerals, and are no longer described.

[0079] Referring to Fig. 3, an inverse post/synthesizing filter circuit 630 and a smoothing circuit 640 receive discrimination data from a demultiplexer 500 and, when the discrimination data is in a predetermined mode (for instance mode 0), executes their operations. These operations are the same as those of the inverse post/synthesizing filter circuit 610 and the smoothing circuit 620 in Fig. 2, and then no longer described.

[0080] As has been described in the foregoing, in the voice coding apparatus according to the present invention a synthesized signal is locally reproduced by using the data obtained by time-wise smoothing of at least either one of spectral parameter, gain of adaptive codebook, gain of excitation codebook and RMS of excitation signal. Thus, even with voice with background noise superimposed thereon, it is possible to suppress local time-wise parameter variations in the background noise part even at a low bit, thus providing coded voice less subject to sound quality deterioration.

[0081] Also, in the voice decoding apparatus according to the present invention, used on decoding side, a residue signal is obtained from a signal obtained after post-filtering in an inverse post-synthesis filtering process, a voice signal synthesizing process is executed afresh on the basis of a signal obtained as a result of time-wise smoothing of at least either one of RMS residue signal, spectral parameter of received signal, gain of adaptive codebook, and gain of excitation codebook, and a post-filtering process is executed afresh, thereby feeding out a final synthesized signal. Processes thus may be added as perfect post-processes to the prior art decoding apparatus without any change or modification thereof. It is thus possible to suppress local time-wise parameter variations in the background noise part and provide synthesized voice less subject to sound quality deterioration.

[0082] Furthermore, in the voice decoding apparatus according to the present invention, a parameter smoothing process is executed in a predetermined mode or in the case of presence of feature quantity in a predetermined area. It is thus possible to execute process only in a particular time section (for instance a silence time section). Thus, even in the case of coding voice with background noise superimposed thereon at a low bit, the background noise part can be satisfactorily coded without adversely affecting the voice time section.

[0083] Changes in construction will occur to those skilled in the art and various apparently different modifications and embodiments may be made without departing from the scope of the present invention. The matter set forth in the foregoing description and accompanying drawings is offered by way of illustration only. It is therefore intended that the foregoing description be regarded as illustrative rather than limiting.

Claims

1. A voice coding apparatus including a spectral parameter calculating part for obtaining a spectral parameter for each predetermined frame of an input voice signal and quantizing the obtained spectral parameter, an adaptive codebook part for dividing the frame into a plurality of sub-frames, obtaining a delay and a gain from a past quantized excitation signal for each of the sub-frames by using an adaptive codebook and obtaining a residue by predicting the voice signal, an excitation quantizing part for quantizing the excitation signal of the voice signal by using the spectral parameter, and a gain quantizing part for quantizing the gain of the adaptive codebook and the gain of the excitation signal, comprising:

a mode discriminating part for extracting a predetermined feature quantity from the voice signal and judging the pertinent mode to be either one of the plurality of predetermined modes on the basis of the extracted feature quantity;

a smoothig part for executing time-wise smoothing of at least either one of the gain of the excitation signal, the gain of the adaptive codebook, the spectral parameter and the level of the excitation signal; and

a multiplexer part for locally reproducing synthesized signal by using the smoothed signal and feeding out a combination of the outputs of the spectral parameter calculating, mode discriminating, adaptive codebook, excitation quantizing and gain quantizing parts.

2. The voice coding apparatus according to claim 1, wherein the mode discriminating part averages the pitch prediction gains each obtained for each sub-frame over the full frame and classifying a plurality of predetermined modes by comparing a plurality of predetermined threshold values with the average value.

3. A voice decoding apparatus including a multiplexer part for separating spectral parameter, pitch, gain and excitation signal as voice data from a voice signal, an excitation signal restoring part for restoring an excitation signal from the separated pitch, excitation signal and gain, a synthesizing filter part for synthesizing a voice signal on the basis of the restored excitation signal and the spectral parameter, and a post-filter part for post-filtering the synthesized voice signal by using the spectral parameter, comprising:
an inverse filter part for estimating an excitation signal through an inverse post-filtering and inverse synthesis filtering on the basis of the output signal of the post-filter part and the spectral parameter, and a smoothing part for executing clockwise filtering of at least either one of the level of the estimated excitation signal, the gain and the spectral parameter, the smoothed signal or signals being fed to the synthesis filter part, the synthesized signal output thereof being fed to the post-filter part to synthesize a voice signal.

4. A voice decoding apparatus including a multiplexer part for separating a mode discrimination data, spectral parameter, pitch, gain and excitation signal on the basis of a feature quantity of a voice signal to be decoded, an excitation signal restoring part for restoring an excitation signal from the separated pitch, excitation signal and gain, a synthesis filter part for synthesizing the voice signal by using the restored excitation signal and the spectral parameter, and a post-filter part for post-filtering the synthesized voice signal by using the spectral parameter, comprising:

an inverse filter part for estimating the voice signal on the basis of the output signal of the post-filter part and the spectral parameter through an inverse post-filtering and inverse synthesis filtering; and

a smoothing part for executing time-wise smoothing of at least either one of the level of the estimated excitation signal, the gain and the spectral parameter, the smoothed signal being fed to the synthesis filter part, the synthesis signal output thereof being fed to the post-filter part.

5. The apparatus according to claim 3 or 4,
wherein the mode discrimination is executed by averaging the pitch prediction gains each obtained for each sub-frame over the full frame and comparing the average value thus obtained with a plurality of predetermined threshold values.

6. The apparatus according to claim 1, 2, 3, 4 or 5,
wherein the mode discriminating part executes mode discriminating for each frame.

7. The apparatus according to any one of claims 1 to 6,
wherein the feature quantity is the pitch prediction gain.

8. The apparatus according to any one of claims 1 to 7, wherein the plurality of predetermined modes substantially correspond to a silence, a transient, a weak voice and a strong voice time section, respectively.

9. A voice decoding apparatus for locally reproducing a synthesized voice signal on the basis of a signal obtained through time-wise smoothing of at least either one of spectral parameter of the voice signal, gain of an adaptive codebook, gain of an excitation codebook and RMS of an excitation signal.

10. A voice decoding apparatus for obtaining a residue signal from a signal obtained after post-filtering through an inverse post-synthesis filtering process, executing a voice signal synthesizing process afresh on the basis of a signal obtained through time-wise smoothing of at least one of RMS of residue signal, spectral parameter of received signal, gain of adaptive codebook and gain of excitation codebook and executing a post-filtering process afresh, thereby feeding out a final synthesized signal.

11. A voice decoding apparatus for obtaining a residue signal from a signal obtained after post-filtering through an inverse post-synthesis filtering process, and in a mode determined on the basis of a feature quantity of a voice signal to be decoded or in the case of presence of the feature quantity in a predetermined range, executing a voice signal synthesizing process afresh on the basis of a signal obtained through time-wise smoothing of at least either one of RMS of the residue signal, spectral parameter of a received signal, gain of an adaptive codebook and gain of an excitation codebook, and executing a post-filtering process afresh, thereby feeding out a final synthesized signal.

12. A voice coding method including a step for obtaining a spectral parameter for each predetermined frame of an input voice signal and quantizing the obtained spectral parameter, a step for dividing the frame into a plurality of sub-frames, obtaining a delay and a gain from a past quantized excitation signal for each of the sub-frames by using an adaptive codebook and obtaining a residue by predicting the voice signal, a step for quantizing the excitation signal of the voice signal by using the spectral parameter, and a step for quantizing the gain of the adaptive codebook and the gain of the excitation signal, further comprising steps of:

extracting a predetermined feature quantity from the voice signal and judging the pertinent mode to be either one of the plurality of predetermined modes on the basis of the extracted feature quantity;

executing time-wise smoothing of at least either one of the gain of the excitation signal, the gain of the adaptive codebook, the spectral parameter and the level of the excitation signal; and

locally reproducing synthesized signal by using the smoothed signal and feeding out a combination of the outputs of the spectral parameter data, mode discriminating data, adaptive codebook data, excitation quantizing data and gain quantizing data.

13. A voice decoding method including a step for separating spectral parameter, pitch, gain and excitation signal as voice data from a voice signal, a step for restoring an excitation signal from the separated pitch, excitation signal and gain, a step for synthesizing a voice signal on the basis of the restored excitation signal and the spectral parameter, and a step for post-filtering the synthesized voice signal by using the spectral parameter, further comprising steps of:

estimating an excitation signal through an inverse post-filtering and inverse synthesis filtering on the basis of the post-filtered signal and the spectral parameter; and

executing clockwise filtering of at least either one of the level of the estimated excitation signal, the gain and the spectral parameter, the smoothed signal or signals being fed to the synthesis filtering, the synthesized signal output thereof being fed to the post-filtering to synthesize a voice signal.

14. A voice decoding method including a step for separating a mode discrimination data, spectral parameter, pitch, gain and excitation signal on the basis of a feature quantity of a voice signal to be decoded, a step for restoring an excitation signal from the separated pitch, excitation signal and gain, a step for synthesizing the voice signal by using the restored excitation signal and the spectral parameter, and a step for post-filtering the synthesized voice signal by using the spectral parameter, comprising steps of:

estimating the voice signal on the basis of the post-filtered signal and the spectral parameter through an inverse post-filtering and inverse synthesis filtering; and

executing time-wise smoothing of at least either one of the level of the estimated excitation signal, the gain and the spectral parameter;

the smoothed signal being fed to the synthesis filtering, the synthesis signal output thereof being fed to the post-filtering.

15. A voice decoding method for locally reproducing a synthesized voice signal on the basis of a signal obtained through time-wise smoothing of at least either one of spectral parameter of the voice signal, gain of an adaptive codebook, gain of an excitation codebook and RMS of an excitation signal.

16. A voice decoding method for obtaining a residue signal from a signal obtained after post-filtering through an inverse post-synthesis filtering process, executing a voice signal synthesizing process afresh on the basis of a signal obtained through time-wise smoothing of at least one of RMS of residue signal, spectral parameter of received signal, gain of adaptive codebook and gain of excitation codebook and executing a post-filtering process afresh, thereby feeding out a final synthesized signal.

17. A voice decoding method for obtaining a residue signal from a signal obtained after post-filtering through an inverse post-synthesis filtering process, and in a mode determined on the basis of a feature quantity of a voice signal to be decoded or in the case of presence of the feature quantity in a predetermined range, executing a voice signal synthesizing process afresh on the basis of a signal obtained through time-wise smoothing of at least either one of RMS of the residue signal, spectral parameter of a received signal, gain of an adaptive codebook and gain of an excitation codebook, and executing a post-filtering process afresh, thereby feeding out a final synthesized signal.

Drawing