(19)
(11)EP 0 578 436 A1

(12)EUROPEAN PATENT APPLICATION

(43)Date of publication:
12.01.1994 Bulletin 1994/02

(21)Application number: 93305133.6

(22)Date of filing:  30.06.1993
(51)International Patent Classification (IPC)5G10L 9/14, G10L 9/18, G10L 9/00
(84)Designated Contracting States:
DE ES FR GB IT

(30)Priority: 10.07.1992 US 911850

(71)Applicant: AT&T Corp.
New York, NY 10013-2412 (US)

(72)Inventors:
  • Kleijn, Willem Bastiaan
    Basking Ridge, New Jersey 07920 (US)
  • Kroon, Peter
    Green Brook, New Jersey 08812 (US)

(74)Representative: Watts, Christopher Malcolm Kelway, Dr. et al
AT&T (UK) Ltd. 5, Mornington Road
Woodford Green Essex, IG8 0TU
Woodford Green Essex, IG8 0TU (GB)


(56)References cited: : 
  
      


    (54)Selective application of speech coding techniques


    (57) A speech coding method and apparatus which selectively applies speech coding techniques to time segments of speech information signals, such as, e.g., pitch cycle waveforms is disclosed. A speech information signal comprising N signal segments is coded with a first speech coder to provide a first coded representation for each of the N signal segments. A second speech information signal reflecting speech information not coded by the first coder is determined for each of one or more of the N signal segments. In addition to coding the N first speech information signal segments with the first speech coder, M of the second speech information signals are coded with a second speech coder, where 1 ≦ MN - 1. The selective coding of M of the second speech information signals is done responsive a coding criterion. By selective use of the second speech coder, the number of bits needed to represent speech information may be reduced, or alternatively, better performance may be obtained without an increase in bit rate. The first and second speech coders may be any of those known in the art.




    Description

    Field of the Invention



    [0001] The present invention relates generally to speech communication systems and more specifically to coding techniques for speech compression.

    Background of the Invention



    [0002] Efficient communication of speech information often involves the coding of speech signals for transmission over a channel or network ("channel"). Speech coding systems include coding processes which convert speech signals into codewords for transmission over the channel and decoding processes which reconstruct speech from received code words. These coding and decoding processes provide data compression and expansion useful for communication of speech signals over channels of limited bandwidth.

    [0003] In analysis-by-synthesis speech coding systems, such as code-excited linear predictive (CELP) speech coding known in the art, a speech signal for coding is first divided into contiguous time segments of fixed duration referred to as subframes. Each subframe is typically 2.5 to 7.5 milliseconds (ms) in duration. Most of the speech information of each subframe is coded as a set of parameters characterizing the speech signal within the subframe. Several contiguous coded subframes (usually 4 or 6) are collected together in groups referred to as frames. These frames of coded speech are communicated via a channel to a receiver. The receiver may, e.g., synthesize audible speech from the received frame information.

    [0004] A goal of most speech coding systems is to provide faithful reproduction of original speech sounds such as, e.g., voiced speech, produced when the vocal cords are tensed and vibrating quasi-periodically. In the time domain, a voiced speech signal usually appears as a succession of similar but slowly evolving waveforms referred to as pitch-cycles. A pitch-cycle waveform is generally characterized by a major transient surrounded by a succession of lower amplitude vibrations. A single one of these pitch-cycle waveforms has a duration referred to as a pitch-period.

    [0005] Because of the nature of the voiced speech signal pitch-cycle, speech coding systems which operate on a subframe basis aim to accurately represent widely disparate signal features within a subframe. How these speech signal features are treated by a speech coding system significantly affects system performance.

    Summary of the Invention



    [0006] The present invention provides a speech coding method and apparatus which selectively applies speech coding techniques to time segments of speech information signals, such as, e.g., pitch-cycle waveforms. A speech information signal comprising N signal segments is coded with a first speech coder to provide a first coded representation for each of the N signal segments. A second speech information signal reflecting speech information not coded by the first coder is determined for each of one or more of the N signal segments. In addition to coding the N first speech information signal segments with the first speech coder, M of the second speech information signals are coded with a second speech coder, where 1 ≦ MN -1. The selective coding of M of the second speech information signals is done responsive a coding criterion. By selective use of the second speech coder, the number of bits needed to represent speech information may be reduced, or alternatively, better performance may be obtained without an increase in bit rate. The first and second speech coders may be any of those known in the art.

    [0007] Illustrative embodiments of the present invention provide improved CELP speech coding systems. Such improved CELP systems are adapted to provide for subframes of 2.5 ms in duration. These subframes serve as the segments referenced above. Given their short duration, many subframes of a speech information signal will not contain a major signal transient The illustrative embodiments provide coding for all subframes with the first speech coder. For those subframes without a major transient, such coding may be all that is required to satisfy an applicable coding criterion, such as a threshold signal energy. For those segments which include a major transient, additional coding may be employed to meet the applicable criterion. In this way, speech information signal coding is tailored on a subframe basis to meet coding requirements as needed.

    [0008] In a first illustrative embodiment of the present invention, the selection of second speech information signals for coding with a second speech coder is based upon the coding criterion. In a second illustrative embodiment, the coding of second speech information signals involves coding several trial combinations of second speech information signals and selecting one of the combinations based on a coding criterion.

    Brief Description of the Drawings



    [0009] Figure 1 presents a first illustrative embodiment of the present invention.

    [0010] Figure 2 presents three contiguous frames of a speech information signal x(i).

    [0011] Figure 3 presents an illustrative bit format for one frame of coded speech information.

    [0012] Figure 4 presents an illustrative embodiment of a receiver for use with the illustrative embodiment of Figure 1.

    [0013] Figure 5 presents a second illustrative embodiment of the present invention.

    [0014] Figure 6 presents a speech coding subsystem, comprising adaptive and fixed codebooks, for use with the illustrative embodiment of Figure 5.

    Detailed Description


    A. Introduction to the Illustrative Embodiments



    [0015] For clarity of explanation, the illustrative embodiments of the present invention are presented as comprising, among other things, individual functional blocks. The functions these blocks represent may be provided through the use of either shared or dedicated hardware, including, but not limited to, hardware capable of executing software. Illustrative embodiments may comprise digital signal processor (DSP) hardware, such as the AT&T DSP16 or DSP32C, and software performing the operations discussed below. Very large scale integration (VLSI) hardware embodiments of the present invention, as well as hybrid DSP/VLSI embodiments, may also be provided.

    [0016] The illustrative embodiments of the present invention provide an improvement to conventional CELP speech coding. Because the embodiments are directed to an improvement of CELP, those aspects of the embodiments ordinarily found in conventional CELP will not be discussed in great detail. For a discussion of conventional CELP and related topics, see commonly assigned U.S. Patent Application Serial No. 07/782,686, which is hereby incorporated by reference as if set forth fully herein. In light of this incorporated disclosure and the discussion to follow, it will be apparent to those of ordinary skill in-the art that the present invention is applicable to various other speech coding systems, not merely analysis-by-synthesis coding systems generally, or CELP coders specifically.

    [0017] The illustrative embodiments of the present invention concern selective application of two speech coders. The first speech coder comprises a long term predictor (LTP) (either alone or in combination with a linear predictive filter (LPF)). The second comprises a fixed stochastic codebook (FSCB) and search mechanism. As in conventional CELP, the embodiments code subframes of a speech information signal. These subframes are packaged together in conventional fashion as a frame of coded speech information and communicated to a receiver. Each frame is 20 ms in duration and comprises eight 2.5 ms subframes of speech information.

    [0018] The present invention provides coding for voiced speech signals. Coding for other types of speech signals, e.g., silence and unvoiced speech, may be provided by conventional coding techniques known in the art. Switching between such coding techniques and embodiments of the present invention may also be accomplished by conventional techniques known in the art. See, e.g., commonly assigned United States Patent No. 5,007,093, which is hereby incorporated by reference as if fully set forth herein. For the sake of the clarity of explanation of the present invention, these well understood techniques will not be presented further.

    [0019] Communication channels for use with embodiments of the present invention may comprise, e.g., a telecommunications network, such as a telephone network or radio link, or a storage medium, such as a semiconductor memory, magnetic disk or tape memory, or CD-ROM (combinations of a network and a storage medium may also be provided). Within the context of the present invention, a receiver is any device which receives coded speech signals over the communications channel. So, e.g., a receiver may comprise a CD-ROM reader, a dish or tape drive, a cellular or conventional telephone, a radio receiver, etc. Thus, the communication of signals via the channel may comprise, e.g., signal transmission over a network or link, signal storage in a storage medium, or both.

    B. A First Illustrative Embodiment



    [0020] A first illustrative embodiment of the present invention is presented in Figure 1. As shown in the Figure, a sampled speech information signal, s(i), (where i is the sample index) is provided to a linear predictive filter 20 and a linear predictive analyzer 10. Signal s(i) may be provided, e.g., by conventional analog-to-digital conversion of an analog speech signal. Linear predictive analyzer (LPA) 10 computes linear prediction coefficients in the conventional fashion well known in the art based on the signal s(i). The coefficients are determined and quantized by LPA 10 to be valid at frame boundaries, as in conventional CELP. Coefficient values, ar, valid at the center of subframes within the boundaries are determined by conventional interpolation of quantized frame boundary coefficient data by LPA 10. The coefficients, ar, valid at subframe centers are output to buffer 27 and LPF 20. Coefficients valid at frame boundaries,

    , are additionally output to channel interface 55. Values of ar valid at the center of subframes are used by LPF 20 and, via buffer 27, LTP 30 and FSCB search 40, in the conventional manner.

    [0021] Signal x(i) -- the first speech information signal of the illustrative embodiment -- is formed in the conventional manner by LPF 20 based on coefficients provided by LPA 10. Two subframes of signal x(i) are provided by LPF 20, one subframe (i.e., 20 samples) at a time, by the filtering of successive samples of LPF 20 input signal s(i) as follows:


    where linear prediction coefficients ar are valid at the center of the subframe in question. Since R is usually about 10 samples (for an 8 kHz sampling rate), the signal x(i) retains the long-term periodicity of the original signal, s(i). LTP 30, discussed below, is provided to remove this redundancy.

    [0022] Subframes of signal x(i) are output from LPF 20 and are provided to subframe analyzer 25 and buffer 29. Analyzer 25 and buffer 29 each store pairs of subframes of the information signal x(i) provided by LPF 20. In accordance with the present invention, subframe analyzer 25 determines, for each pair of subframes it has stored, which subframe should be coded with use of the first coder only (i.e., the LTP 30), and which should be coded with use of both the first and second coders (i.e., the LTP 30 and the FSCB system 40, 45). This determination is based on the speech information signal energy of each subframe of the pair. The subframe which exhibits the greater signal energy is chosen by analyzer 25 for coding with use of both the first and second speech coders. The other subframe -- the one with less signal energy -- is coded with use of the first speech coder, but not the second. Subframe energy is determined by analyzer 25 in conventional fashion:


    where L is the number of samples in a subframe (e.g., L = 20 samples).

    [0023] Subframe energy is determined by analyzer 25 for each subframe of a subframe pair prior to coding either of the two subframes. Once the determination of subframe energy has been made, the subframes of the pair in question may be coded in turn. Copies of these subframes are stored in buffer 29, as discussed above, for the purpose of coding by the embodiment. Linear prediction coefficients from analyzer 10 needed for coding these buffered subframes are stored in buffer 27.

    [0024] Buffers 27, 29 do not add coding delay to the system. This is because ordinary linear prediction analyzers and filters, e.g., LPA 10 and LPF 20, must themselves collect and store speech information signal values in order to determine linear prediction coefficients and filtered speech information. In one conventional form of linear prediction analysis, the LPA 10 stores one-half frame of speech information signal samples on each side of a frame boundary at which linear prediction coefficients are to be computed. Therefore, prior to determining linear prediction coefficients valid at the center of the first subframe of a given frame, the conventional LPA 10 introduces a delay of one and one-half frames. Since samples (e.g., whole subframes) of speech information signals must be stored for the computation of these linear prediction coefficients, the storage of subframes in buffer 27 may be implemented as a block transfer of information which can occur without sample delay. Thus, no delay need be introduced by virtue buffer 27, 29 storage.

    [0025] Analyzer 25 controls the coding of the pair subframes stored in buffer 29 by the generation of an enable signal, ε, which it provides to the coders. Once ε is appropriately asserted, the subframes of a buffered subframe pair are coded, one at a time, by application of the first coder -- the LTP 30.

    [0026] The LTP 30 of the illustrative embodiment comprises a conventional CELP adaptive codebook and search mechanism which determines a gain λ(i) and a delay d(i) (although indexed by i, values for d(i) and λ(i) are constant for all samples within a subframe). LTP 30 will be enabled to operate when ε takes on a value other than 00 (see discussion of ε below). Computed values for delay and gain for each coded subframe are provided by LTP 30 to channel interface 55 as shown in Figure 1. A subframe of a residual speech information signal, r(i), -- the second speech information signal of the embodiment -- is determined as follows:


    where the (i - d(i)) are samples of a speech information signal synthesized (or reconstructed) in earlier subframes. To facilitate implementation of (3), LTP 30 provides the quantity λ(i)x̂(i -d(i)) to subtraction circuit 35. Signal r(i) is the speech information signal remaining after λ(i) x̂(i - d(i)) is subtracted from x(i) by circuit 35; r(i) reflects speech information not coded by the first speech coder. Signal r(i) may then coded with a FSCB mechanism 40 under the control of subframe analyzer 25 by enable signal, ε.

    [0027] The enable signal, ε, is provided by analyzer 25 to the fixed stochastic codebook (FSCB) search mechanism 40 to control application of the FSCB to the subframe of a pair of subframes determined to contain the greater energy. The enable signal, ε, may be implemented with two bits. So, e.g., when the bits forming ε are 01, the FSCB system 40, 45 codes the first (or earlier) subframe of a subframe pair. When the bits forming ε are 10, the FSCB system 40, 45 codes the second subframe of the pair (ε equalling 00 indicates a wait or idle state for both coders commensurate with speech information signal buffering).

    [0028] When the enable signal is asserted (as either a 01 or 10), the FSCB search mechanism 40 operates to determine a vector from the FSCB 45 and a scaling factor, µ(i), which in combination most closely match the signal r(i) associated with the subframe to be coded. The FSCB 45 and search mechanism 40 are conventional in the art except for the control provided by the analyzer 25. FSCB mechanism 40 provides as output to channel interface 55 an index indicating the determined FSCB vector, IFC, and an associated scaling factor, µ(i). When the enable signal from analyzer 25 is not asserted (i.e., ε is 00), the FSCB mechanism 40 sits idle.

    [0029] Analyzer 25 also provides to channel interface 55 a single bit for each pair of subframes processed by the embodiment of Figure 1. This bit, referred to as the subframe selection bit, ξ, reflects the asserted value of ε supplied to FSCB 40. When ε is set to 01, the subframe selection bit ξ is set to 0. When ε is set to 10, ξ is set to 1. Channel interface 55 requires a subframe selection bit ξ for each pair of coded subframes to provide an indication of which subframe has been coded with both coders and which has not.

    [0030] Once coding of the two subframes of a subframe pair is complete, coding is halted until analyzer 25 has determined how to code the next successive pair of subframes. Analyzer 25 halts coding by providing ε equal to 00. First and second coders operate responsive to the asserted ε signal and then check ε when done. If ε equals 00, they halt; otherwise they proceed to code the next pair of subframes as described above.

    [0031] Figure 2 is provided to facilitate an understanding of how the analyzer 25 and the buffers 27 and 29 operate over time with the other components of the illustrative embodiment of Figure 1. Figure 2 presents contiguous frames of the speech information signal x(i). These frames are provided to analyzer 25 for energy determinations (actual sample values for signal x(i) are not shown for the sake of clarity). As shown in the Figure, each of the frames, F - 1, F, and F + 1, comprises eight subframes, labeled a through h. Since each frame comprises 160 samples (or 20 ms of speech information at 8kHz sampling rate), each of the labeled subframes comprises 20 samples (or 2.5 ms of speech information). Consecutive pairs of subframes within each frame are numbered 1 through 4.

    [0032] Assume that a signal, s(i), has been provided to LPA 10 and LPF 20 of Figure 1 as is conventional in CELP coders. As a result, LPA 10 has determined LP coefficients valid at the frame boundaries between frames F - 1 and F, (i.e.,

    ), and F and F + 1 (i.e.,

    ). These coefficients are used in a conventional interpolation process by LPA 10 to provide subframe coefficients as discussed above. These subframe coefficients are used by LPF 20 in conventional fashion to filter subframes of signal s(i).

    [0033] At the outset, two subframes of signal s(i) are filtered by LPF 20 to yield the first pair subframes of signal x(i) in frame F: subframes a and b (i.e., frame F, pair 1). Analyzer 25 and buffer 29 receive and store subframes a and b of frame F. The enable signal bits provided by analyzer 25 are set to 00, reflecting an idle state of the coding system. Analyzer 25 determines which of subframes a and b contains the greater amount of energy as discussed above. Responsive to this determination, analyzer 25 controls the coding of subframes a and b by the first and second coders. As part of this control process, analyzer 25 provides an enable signal, ε, indicating which of the two subframes is to be coded with both coders.

    [0034] Once the enable signal is provided, coding occurs as described above. Analyzer 25 can then reset enable signal to 00. Analyzer 25 and buffer 29 proceed to store the next contiguous pair of subframes -- frame F, subframe pair 2, comprising subframes c and d. Control of the coding of subframes c and d responsive to this determination is thereafter effected by analyzer 25.

    [0035] The determination of subframe energy and control of coders is repeated for each consecutive pair of subframes in the speech information signal. So, for example, after coding subframes c and d, the embodiment of Figure 1 proceeds to code subframes e and f (i.e., pair 3), and subframes g and h (i.e., pair 4) of frame F. As a result of coding only one subframe of each consecutive subframe pair with the second coder, the second coder has been used to code only 4 of the 8 subframes in frame F. At this point, LPA 10 computes additional frame boundary linear prediction coefficients (e.g., coefficients valid at the right boundary of frame F + 1,

    ) and the whole process repeats itself, from one frame to the next, for as long as there are signal subframes to code.

    [0036] Over the course of coding eight subframes of a frame of speech, information representative of each coded speech subframe is collected by channel interface 55 for transmission to a receiver over a channel 56. The receiver uses this information in the reconstruction of speech. This information comprises LTP parameters λ(i) and d(i), the FSCB index, IFC, and scaling factor, µ(i) (for the appropriate higher energy subframes), and the linear prediction coefficients ar, valid at the later of the two frame boundaries associated with the coded frame, e.g.,a

    . This information further comprises a set of subframe selection bits, ξ, identifying which subframe in each successive pair of coded subframes has been coded with use of both coders. Channel interface 55 buffers all information it receives during the coding of a frame and maps (or assembles) the buffered information into a format suitable for communication over channel 56.

    [0037] Figure 3 presents an illustrative format of a frame of coded speech information as assembled by interface 55. This format comprises 158 bits which are partitioned among various quantities needed by a receiver to reconstruct a frame of speech. These quantities include LTP 30 information (i.e., delay and gain) for all eight subframes of the frame, and FSCB system 40, 45 information (i.e., codebook index and gain) for four of the eight subframes.

    [0038] As shown in the Figure, linear prediction coefficients, ar, 1 ≦ r ≦ 10, are represented by a field of 30 bits. These 30 bits are used to represent the coefficients in the conventional fashion well known in the art.

    [0039] Also represented is LTP delay and gain information for each of the eight subframes of a coded frame. Each subframe's LTP delay, d(i), is represented by a 7 bit field. Each subframe's LTP gain, λ(i), is represented by a 4 bit field. Therefore, a total of 88 bits (i.e., 8 subframes x (7 bits + 4 bits)) are used to represent coded speech information provided by the first coder -- the LTP 30.

    [0040] As an alternative to coding each delay of a frame with 7 bits, either the fourth or the fifth subframe delay of may be coded with 7 bits and the other seven subframe delays may be coded differentially, using 2 bits per subframe differential delay value. This practice saves a total of 35 bits, reducing the number of bits required to code a frame from 158 to 123.

    [0041] As a further alternative to coding multiple delay values (whether differential or otherwise) for each frame, the present invention may be combined with the generalized analysis-by-synthesis techniques disclosed in U.S. Patent Application Serial No. 07/782,686 and incorporated by reference above. By virtue of combining the present invention with the techniques of the referenced application, delay information need be sent only once for each coded frame. Thus, e.g., only seven bits need be used to represent delay for the entire frame. This provides a savings of an additional 14 bits. To combine the techniques of the referenced application with those of the present invention, the embodiments presented in Figures 3 and 5 of the referenced application may each be modified to buffer signal x(i) and parameters M and an while subframe analysis is performed in accordance with the first illustrative embodiment of the present invention. Alternatively, embodiments presented in Figures 3 and 5 may each be used as coding subsystems in accordance with the second illustrative embodiment of the present invention (see below).

    [0042] Figure 3 further shows a 4 bit subframe selection field which contains a subframe selection bit, ξ, for each of four contiguous pairs of subframes coded. Each of these four bits represents one of the four subframe pairs. As stated above, a zero-valued selection bit indicates the first (i.e., the earlier) of two subframes of a subframe pair has been coded with use of both coders, while a one-valued selection bit indicates the second (i.e., the later) of two such subframes has been so coded.

    [0043] After the four bits designated for subframe selection, the channel format includes a field for the representation of FSCB system 40, 45 information. The bits of this field are divided among the four subframes identified by the subframe selection bit field. For each such identified subframe, a FSCB index, IFC (6 bits), and a FSCB scaling factor, µ(i) (3 bits), are communicated. Thus, the field comprises 36 bits (4 subframes x (3 bits + 6 bits)).

    [0044] A frame of coded speech information in the format described above is communicated over communication channel 56 to a receiver. The receiver reconstructs or synthesizes a frame of speech information from the coded frame. An illustrative embodiment of a receiver for synthesizing speech information according to the present invention is presented in Figure 4.

    [0045] As a general matter, the receiver of Figure 4 performs the inverse of the coding process discussed above. Successive frames of coded speech information transmitted by channel interface 55 are received by receiver channel interface 58. Interface 58 unpacks the bits of a received coded frame format and provides appropriate information and signals to other elements of the receiver.

    [0046] Assume that a frame of coded speech information has been received by channel interface 58 and that this frame represents frame F presented in Figure 2. Responsive to receipt of this coded frame, channel interface extracts linear prediction coefficients,

    , from the received frame. Recall that these coefficients are valid at the latest frame boundary (that is, the frame boundary which lies at the end of frame F). These coefficients are used, together with the set of previously received and stored linear prediction coefficients valid at previous frame boundary (the frame boundary which lies at the end of frame F - 1,

    ), to provide a set of coefficients valid at the center of each subframe of speech within frame F. These sets of coefficients are provided with conventional linear prediction coefficient interpolation well known in the art. Naturally, the set of linear prediction coefficients received by interface 58,

    , will be buffered for use in a subsequent interpolation process. This subsequent interpolation process will be performed in response to the receipt on the next frame of coded speech information, frame F + 1. The process of buffering and interpolation is repeated for each frame of coded speech received by interface 58.

    [0047] After interpolating linear prediction coefficients, the receiver proceeds to synthesize the subframes of coded speech. Interface 58 extracts from the received frame the subframe selection bit ξ associated with the first pair of coded subframes, a and b, of frame F. The interface 58 examines ξ to determine whether the synthesis of the first subframe of speech information (i.e., subframe a of frame F) requires application of the FSCB 70. If so, interface 58 provides a logically true subframe selection control signal, γ, to switches 60 and 80 of the receiver. Signal γ asserted as true causes the switches 60, 80 to be in a closed state effectively coupling the FSCB 70 into the synthesis process for subframe a. If no application of FSCB 70 is required for subframe a, interface 58 provides a logically false γ to switches 60 and 80, causing switches 60 and 80 to open, effectively decoupling the FSCB 70 from the synthesis process.

    [0048] After determining the appropriate subframe selection control signal γ, interface 58 may extract and output to switch 60 the fixed codebook index, IFC, associated with the subframe of the first subframe pair which has been coded with use of the FSCB system 40, 45. Also, interface 58 may extract and provide to multiplier circuit 75 the FSCB gain, µ(i), for that subframe.

    [0049] Assuming that subframe a is the subframe of the first pair coded with both coders, signal γ will be true and switches 60 and 80 will be closed. Index, IFC, and gain, µ(i), provided will be used by FSCB 70 and multiplier 80, respectively, to provide a synthesized excitation signal, e(i), in the conventional fashion. This excitation signal, e(i), is the contribution of the FSCB system 70, 75 to a synthesized speech information signal for subframe a. The excitation signal e(i) is provided to summing circuit 100 for addition to the adaptive codebook contribution to the synthesized speech information signal for that subframe.

    [0050] This adaptive codebook contribution is provided based on the extracted adaptive codebook delay and gain information, d(i) and λ(i), respectively, associated with subframe a of coded speech. The adaptive codebook contribution is determined in the conventional fashion, with the delay, d(i), serving to identify a previously synthesized frame of speech information, and the gain λ(i) acting as a multiplicative factor.

    [0051] Synthesis of speech for subframe a is completed by an inverse LPF 110 based on linear prediction coefficients provided by interface 58. These coefficients are valid at the center of subframe a.

    [0052] Since subframe a of the first pair of subframes was coded with use of both coders, it follows that subframe b was coded without the FSCB system 40, 45. Therefore, to proceed with the synthesis of speech for subframe b, interface 58 must apply a logically false subframe selection control signal γ to switches 60 and 80. By doing this, interface 58 causes FSCB system 70, 75 to play no part in the synthesis of speech for this subframe. Speech associated with subframe b is therefore synthesized with use of the adaptive codebook 90 and gain multiplication circuit 95, along with the inverse LPF 110. As a result of switch 80 being open, excitation signal e(i) is zero valued.

    [0053] Consecutive pairs of coded subframes of speech are handled in the same manner as subframes a and b. Of course, other subframe pairs may have been coded differently (that is, with the first of the two subframes coded without the FSCB system 40, 45). In such a circumstance, the procedures discussed above for subframes a and b would be reversed.

    C. A Second Illustrative Embodiment



    [0054] A second illustrative embodiment of the present invention is presented in Figure 5. Like the first embodiment described above, this embodiment may employ the channel format presented in Figure 3 and may communicate with the receiver presented in Figure 4. Unlike the first embodiment, however, this embodiment does not decide prior to the coding process which subframe of a subframe pair will be coded with use of one coder and which will be coded with use of both coders. Rather, for a given pair of subframes, this illustrative embodiment provides coded alternatives: (i) a first alternative where the first subframe of a pair is coded with both coders, but the second is coded without the second coder; and (ii) a second alternative where the first subframe is coded without the second coder, and the second subframe is coded with both coders. The second embodiment then chooses the alternative which results in lower coding error. The parameters (i.e., the coded representation) of the chosen alternative are then provided to a channel interface for communication to a receiver.

    [0055] As shown in Figure 5, a linear predictive filter 20 and a linear predictive analyzer 10 receive a sampled speech information signal, s(i). Analyzer 10 and filter 20 are the same devices described above with reference to the first illustrative embodiment. As with the first embodiment, LPA 10 computes linear prediction coefficients,

    , valid at frame boundaries, based on signal s(i). Values for ar valid at the center of subframes within the boundaries are determined by conventional interpolation of frame boundary coefficients by LPA 10. The coefficients, ar, valid at subframe centers are output to LPF 20, LPF⁻¹s 120 (LPF⁻¹s 120 will be discussed below in connection with the choice of coded alternatives), LTP 30, and FSCB search 40. Coefficients,

    , valid at frame boundaries are additionally output to selector 130. Subframes of speech information signal x(i) are formed in the conventional manner by LPF 20, as described above for the first illustrative embodiment.

    [0056] Like the first embodiment, the second embodiment operates on pairs of subframes. In this case, each pair of subframes of x(i) is provided by LPF 20, in parallel, to two coding subsystems 115, 116.

    [0057] Each coding subsystem 115, 116 operates to code the subframes of a subframe pair in a similar manner. As shown in Figure 6, the subsystems 115, 116 comprise the same coders (an adaptive codebook LTP 30, 32 and a FSCB system 40,45). The difference between these subsystems 115, 116 concerns the way their the coders are applied to the subframes of a given subframe pair. Subsystem 115 codes the first subframe of a subframe pair with use of both coders, and the second subframe without the second coder; subsystem 116 codes the first subframe of the same pair without the second coder, and the second subframe with both coders. Control of subframe coding by the second coder for subsystems 115, 116 is effected by FSCB control 37, 38, respectively, which sets ε such that the appropriate subframe within a pair is always coded for the subsystem 115, 116.

    [0058] Thus, subsystems 115, 116 provide alternative coded representations of a given subframe pair from which one must be chosen. These alternative representations are provided by coding subsystems 115, 116 to selector 130 as LTP delay and gain information, d(i) and λ(i), respectively; and FSCB system index and gain information, IFC and µ(i), respectively. The choice between two coded representations of a subframe pair is based on the amount of coding error introduced by each representation. The amount of coding error introduced by each representation is evaluated by selector 130, in combination with LPF⁻¹s 120 and subtraction circuits 125.

    [0059] Referring again to Figure 5, each coding subsystem 115, 116 provides an estimated speech information signal, (i), which is equal to the speech information signal which would be synthesized by a receiver if it were to receive that subsystem's coded representation of the original speech information signal x(i). The estimated speech information signal x̂(i) from each subsystem 115, 116 may therefore be compared to original speech information signal x(i) to determine a measure of error introduced by the coded representation.

    [0060] A measure of coding error is provided by forming a difference, δ, between a perceptually weighted original speech information signal, x(i), and a perceptually weighted estimated speech information signal x̂(i) from each coding subsystem, for a pair of subframes. Perceptual weighting is provided by LPF⁻¹ s 120 which operate according to the following expression:


    where linear prediction coefficients ar are valid at the center of the subframe in question, R is the number of coefficients, and γ is a perceptual weighting factor (illustratively set to 0.8). Difference signals, δ(i), are formed by subtraction circuits 125 and represent coding error over a pair of subframes.

    [0061] The difference signals, δ(i), are provided to selector 130 for comparison. The selector squares these difference signals, δ(i)², to determine error signal energy. These error signal energies are compared to determine which is smaller. The coding subsystem responsible for introducing the smaller error, as represented by the smaller error signal energy, δ(i)², is the one chosen to provide the coded representation of the pair of subframes.

    [0062] As discussed above, both coding subsystems 115, 116 provide their coded representations of a subframe pair to selector 130. Once selector 130 has determined which subsystem 115, 116 will introduce the smaller error by its coded representation, it provides that representation to a channel interface 55 Channel interface 55 is the same as that discussed above with reference to the first illustrative embodiment. Interface 55 packs bits in a format for transmission to a receiver in the fashion discussed above with reference to Figure 3.

    [0063] In addition to the coded representation of a subframe pair, selector 130 provides linear prediction coefficients

    and a subframe select bit, ξ, to the interface 55. The linear prediction coefficients

    are the same as those discussed above with reference to the first embodiment. They are valid at the end of the frame containing the coded subframe pair in question. The subframe select bit, ξ, is defined as discussed above with reference to the first illustrative embodiment. Values for the bit are determined based on the particular coding subsystem 115, 116 chosen by selector 130. When coder 115 has been chosen to provide the coded representation for the pair of subframes (i.e., when the first subframe of a pair has been coded with both coders of subsystem 115), ξ is set equal to 0. When coder 116 has been chosen to provide the coded representation of the pair of subframes (i.e., when the second subframe of a pair has been coded with both coders of subsystem 116), ξ, is set equal to 1.

    [0064] After choosing a coded representation for a pair of subframes of the speech information signal, x(i), and prior to the coding of the next pair of subframes in a frame of speech information, selector 130 updates the contents of certain memories of the embodiment. It does this by providing an update signal, ν, to the adaptive codebooks 32, LTPs 30, and FSCB searches 40 of subsystems 115, 116. Signal ν is also provided to those LPF⁻¹ 120 which provide perceptual weighting to the estimated speech information signals, (i), output by the subsystems 115, 116. The update signal, ν, causes the contents of the adaptive codebook 32, m₁, associated with the subsystem which provided the chosen representation to overwrite the contents of the adaptive codebook 32 of the other subsystem 116, 115. Furthermore, it causes the signal memories of the LTP 30, FSCB search 40, and LPF⁻¹ 120 (m, m, m, respectively) which are associated with the chosen representation to overwrite the signal memories of the other LTP 30, FSCB search 40 and LPF⁻¹ 120 (linear filters operate by summing weighted past values of either or both input and output signals; it is the memory holding these past values -- the signal memory -- which is overwritten by this process; conventional LTP 30 and FSCB search 40 of subsystems 115, 116 also contain inverse LPF filters which are used to assess codebook vector errors (see U.S. Patent Application Serial No. 07/782,686, incorporated by reference above)). Illustratively, ν takes on the same values as subframe selection signal, ξ. As such, responsive to receiving 1), the memories of the system have the information needed (m₁, m₂, m₃, m₄) to effect the correct memory update. After completion of this update process, the coding of the next pair of subframes in a frame of a speech information signal may occur.

    [0065] The teachings of the present invention may be applied to still further illustrative embodiments. For example, an embodiment may be provided which comprises a first and a second speech coder and which codes a speech information signal segment using either or both of the speech coders. If these are N signal segments for coding by this embodiment, then the first coder is applied in the coding of L such segments, and the second coder is applied in the coding of M such segments, where L +MN + 1. In this embodiment, each of the N segments is coded with use of at least one of the two coders.


    Claims

    1. A method of coding a first speech information signal, the first speech information signal comprising a plurality of N signal segments, the method comprising the steps of: coding the N signal segments with a first speech coder to provide a first coded representation for each of the N signal segments; for each of one or more of the N signal segments, forming a second speech information signal reflecting speech information not coded by the first speech coder, and coding M second speech information signals with a second speech coder, where 1MN - 1, responsive to a coding criterion.
     
    2. The method of claim 1 wherein the second speech information signal comprises a residual speech information signal reflecting a difference between a signal segment and its quantized representation provided by the first speech coder.
     
    3. The method of claim 1 wherein the step of coding M second speech information signals comprises the step of selecting one or more of the M second speech information signals for additional coding responsive to the coding criterion.
     
    4. The method of claim 3 wherein the step of selecting one or more of the M second speech information signals comprises the step of evaluating a characterizing parameter for each of the N signal segments of the first speech information signal.
     
    5. The method of claim 4 wherein a second speech information signal is selected for additional coding responsive to a comparison of a characterizing parameter of its corresponding signal segment with the coding criterion.
     
    6. The method of claim 5 wherein the characterizing parameter comprises signal energy.
     
    7. The method of claim 1 further comprising the step of forming a synthesized speech information signal for each signal segment for use by the first speech coder in coding subsequent signal segments.
     
    8. The method of claim 1 wherein the step of coding N signal segments with a first speech coder comprises: generating a plurality of modified signal segments based on a signal segment to be coded; coding a modified signal segment to produce a coded representation thereof; synthesizing an estimate of the modified signal segment based on its coded representation; determining an error between the signal segment to be coded and the synthesized estimate of the modified signal segment; and selecting as the first coded representation of the signal segment to be coded a coded representation of the modified signal segment based on an error evaluation process.
     
    9. The method of claim 1 wherein a set of signal segments is coded a plurality of times with use of the first and second speech coders to form a plurality of modified coded representations of the set, and wherein a modified coded representation is selected as a coded representation of the set responsive to the coding criterion.
     
    10. A method of coding segments of a speech information signal, the method comprising the steps of: forming a plurality of trial coded representations of a set of signal segments, each trial coded representation formed by coding each signal segment of the set with use of a first speech coder; and coding fewer than all the signal segments of the set with use of a second speech coder; and selecting as a coded representation of the set of signal segments a trial coded representation based on a coding criterion.
     
    11. The method of claim 10 wherein the step of selecting comprises the step of determining a characterizing parameter for a trial coded representation.
     
    12. The method of claim 11 wherein the step of selecting further comprises the step of comparing the characterizing parameters of the trial coded representations, and selecting a trial coded representation based on the coding criterion.
     
    13. The method of claim 10 wherein the step of coding each signal segment with use of the first speech coder comprises the steps of: generating a plurality of modified signal segments based on a signal segment to be coded; coding a modified signal segment to produce a coded representation thereof; synthesizing an estimate of the modified signal segment based on its coded representation; determining an error between the signal segment to be coded and the synthesized estimate of the modified signal segment; and selecting as a coded representation of the signal segment to be coded a coded representation of the modified signal segment having an associated error based on an error evaluation process.
     
    14. An apparatus for coding a first speech information signal, the first speech information signal comprising a plurality of N signal segments, the apparatus comprising: a first speech coder for coding the N signal segments to provide a first coded representation for each of the N signal segments; means for forming a second speech information signal for each of one or more of the N signal segments, the second speech information signal reflecting speech information not coded by the first speech coder; and a second speech coder for coding M second speech information signals, where 1 ≦ MN - 1, responsive to a coding criterion.
     
    15. The apparatus of claim 14 wherein the second speech information signal comprises a residual speech information signal reflecting a difference between a signal segment and its quantized representation provided by the first speech coder.
     
    16. The apparatus of claim 14 further comprising an analyzer for selecting one or more of the M second speech information signals for additional coding responsive to the coding criterion.
     
    17. The apparatus of claim 14 wherein the first speech information signal is provided by a linear prediction filter.
     
    18. The apparatus of claim 14 wherein the first speech coder comprises an adaptive codebook vector quantizer.
     
    19. The apparatus of claim 18 wherein the first speech coder further comprises a linear prediction filter.
     
    20. The apparatus of claim 14 wherein the second speech coder comprises a fixed codebook.
     
    21. An apparatus for coding segments of a speech information signal, the apparatus comprising: means for forming a plurality of trial coded representations of a set of signal segments, said means for forming comprising: a first speech coder for use in coding each signal segment of the set; and a second speech coder for use in coding fewer than all the signal segments of the set; and means for selecting as a coded representation of the set of signal segments a trial coded representation based on a coding criterion.
     
    22. A method of coding a speech information signal with use of at least two speech coders, the speech information signal comprising a plurality of N signal segments, the method comprising the steps of: coding L of the N signal segments with use of a first speech coder; and coding M of the N signal segments with use of a second speech coder;
    such that L +MN + 1 and each of the N signal segments is coded with at least one of the two speech coders.
     




    Drawing













    Search report