Speech synthesis apparatus

(19)

(11)

EP 0 139 419 A1

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	02.05.1985 Bulletin 1985/18

(21)	Application number: 84305918.9

(22)	Date of filing: 30.08.1984

(51)	International Patent Classification (IPC)⁴: G10L 5/04

(84)	Designated Contracting States:
	DE FR GB

(30)

Priority:

31.08.1983 JP 157719/83

(71)	Applicant: KABUSHIKI KAISHA TOSHIBA
	Kawasaki-shi, Kanagawa-ken 210 (JP)

(72)	Inventors:
	Watanabe, Sadakazu Kawasaki-shi Kanagawa-ken (JP) Nomura, Norimasa Yokohama-shi Kanagawa-ken (JP)

(74)	Representative: Kirk, Geoffrey Thomas et al
	BATCHELLOR, KIRK & CO. 2 Pear Tree Court Farringdon Road London EC1R 0DS London EC1R 0DS (GB)

(56)

References cited: :

(54)	Speech synthesis apparatus

(57) The invention provides a speech synethesising apparatus in which a character string translator 1 converts each word in an encoded input character string into a phoneme string corresponding to the characters in the word using a word memory 2. Next, a variable phoneme detector 3 detects those phonemes in the string the values of whose prosodic parameters may be modified due to the existence of an influencing phoneme at another location in the phoneme string, by comparing each phoneme with phonemes stored in variable phoneme memory 4. If a variable phoneme is detected in the phoneme string, a search is made by an influencing phoneme detector 5 for influencing phonemes at the location in the phoneme string indicated by the variable phoneme detector. The variable phoneme memory stored, along with each variable phoneme, the predetermined relative location at which an influencing phoneme may be found. If an influencing phoneme is detected at the appropriate location relative to the variable phoneme the influencing phoneme detector will output data representative of a modification in the value of a selected parameter (duration, pitch or power) of the variable phoneme. The phoneme string is then delivered to a parameter value determining unit 7, where standard and modified data are combined. Finally, the phoneme string, parameter values, and modification data are supplied a parametric synthesizer 7, which assembles them into synthetic speech.

Description

BACKGROUND OF THE INVENTION

[0001] One of the known methods' for transforming character strings into synthetic speech is "synthesis by rule." In this method, a character string is first transformed into a sequence of phonemes. Next, the prosodic parameters (the duration, pitch and power) of each phoneme are determined, and speech segments having those parameters are selected from a library of the spectral envelopes of such speech segments. Finally, the phoneme sequence and the parameters are provided to a well-known speech synthesizer which assembles the segments into synthetic speech, adjusting the parameters and connecting them into more-or-less natural speech. Although it is possible to synthesize speech using standard values of each parameter for each phoneme, it is also possible to vary the duration of those phonemes which consist of a consonant-vowel (CV) combination. When one of these variable phonemes is encountered in a phoneme string, its duration may be modified (changed from the standard value) by considering the influence of the phoneme immediately before or after the variable phoneme in the phoneme string. However, even when the duration of CV phonemes is modified in accordance with the influence of influencing phonemes immediately adjacent the variable phoneme, the synthetic speech produced is rether unnatural and unclear.

SUMMARY OF THE INVENTION

[0002] One object of this invention is to produce synthetic speech of such clarity and high quality that it is very nearly natural speech.

[0003] Another object of the invention is to produce synthetic speech in which the prosodic parameters of variable phonemes are modified as a function of the existence of influencing phonemes at locations in the phoneme string other than the two locations immediately adjacent the variable phoneme (as a funtlon of a non-influencing phoneme in a non-adjacent mora).

[0004] The invention is based on the careful observation of natural speech by the present inventors. They found, for example, that the duration of a variable phoneme consisting of a double consonant is influenced more by the kind of phoneme which exists in the phoneme string in a location one phoneme removed from the double consonant than by the kind of phoneme which exists immediately adjacent the double consonant in the phoneme string. They observed that, in Japanese speech, a double consonant lasts several milliseconds longer when a "prolonged sound" or a syllabic nasal "N" is at a location in the phoneme string two morae subsequent to the double consonant.

[0005] The invention comprises five main elements. A character string translator accepts encoded character strings from a device, such as a keyboard, which is capable of inputting character strings electronically. Using a word memory, the character string translator converts each word in the input character string into a phoneme string corresponding to the characters in the word. Next, a variable phoneme detector detects those phonemes in the string the values of whose prosodic parameters may be modified due to the existence of an influencing phoneme at a location in the phoneme string which is at least one mora removed from the location of the variable phoneme. If variable phonemes are detected in the phoneme string, a search is made by an influencing phoneme detector for influencing phonemes at the location in the phoneme string indicated by the variable phoneme detector. The variable phoneme detector is associated with a variable phoneme memory which stores, along with each variable phoneme, the predetermined location at which an influencing phoneme will influence the value of each parameter of the variable phoneme. If an influencing phoneme is detected at the appropriate location in relation to the variable phoneme the influencing phoneme detector will output data representative of a modification in the value of a selected parameter (duration, pitch or power) of the variable phoneme. The phoneme string is then delivered to a parameter value determining unit, which stores standard values of the parameters for all phonemes. Standard values may be modified in response to the modification data supplied by the influencing phoneme detector. Finally, the phoneme string, parameter values, and modification data are supplied to a well-known parametric synthesizer, which assembles them into synthetic speech. Of course, since this invention is an electronic apparatus, it does not perform operations on "characters" or "phonemes" but rather on electrical codes representing characters and phonemes. This fact will be silently recognized throughout the specification.

BRIEF DESCRIPTION OF THE DRAWINGS

[0006]

. Figure 1 is a block diagram of one embodiment of the improved speech synthesis apparatus.

Figure 2 is a schematic diagram showing in more detail the character string translator of Figure 1.

Figure 3 is a schematic diagram showing in more detail the variable phoneme detector of Figure 1.

Figure 4 is a schematic diagram showing in more detail the influencing phoneme detector of Figure .1.

Figure 5 is a schematic diagram showing in more detail the parameter value determining unit of Figure 1.

Figure 6 is a block diagram of another embodiment of the improved speech synthesis apparatus.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0007] Fig. 1 is a block diagram of the improved speech synthesis apparatus. An encoded character string is input from a device (not shown) such as a keyboard having character kcys, a memory device which stores character strings (sech as used in a word processor), or a communication device receiving character strings through communication lines.

[0008] Character string translator 1 translates the character codes making up the character string into phoneme codes representing a phoneme string, using word memory 2. The phoneme string is supplied to variable phoneme detector 3, which detects variable phonemes. In a Japanese phoneme string, these variable phonemes include the double consonant (hereinafter indicated by "Q"), syllabic nasal "N", and prolonged sound (hereinafter indicated by "L").

[0009] Such variable phonemes are stored in variable phoneme memory 4 which is used by detector 3 in the detection of variable phonemes. As is described later, if a variable phoneme is detected, it is associated with attribution data from which parameter value modifications may be determined. The output of detector 3 includes information regarding the predetermined location at which an influencing phoneme must be found in order to influence the value of a parameter (the duration, in the preferred embodiment) of the detected variable phoneme.

[0010] Influencing phoneme detector 5 determines whether there exists an influencing phoneme, such as a double consonant, syllabic nasal "N" or prolonged sound at the location Indicated by detector 3.

[0011] Parameter value determining unit 7 stores the normal values of parameters such as the duration of the phoneme, its pitch or its power. If both the variable phoneme and a corresponding influencing phoneme are detected, unit 7 outputs a modified value of the given parameter to parametric synthesizer 8, which may be any well-known synthesizer, for example formant-type, Parcore-type, and Cepstrum-type synthesizer.

[0012] Fig. 2 shows in more detail 'the character string translator 1 of Fig. 1. A character string (which normally constitutes one word) enters input register 101 under control of input control circuit 102. Read control circuit 103 supplies an initial address to word memory 2, and a word is read out into register 104. Word memory 2 stores a plurality of words (in segment 104A) together with the corresponding phoneme strings into which the words are translated (in segment 104B). The word from segment 104A is supplied to comparator 100 and compared with the content of input register 101. If the input character string In input register 101 is not the same as the character string from segment 104A, comparator 105 produces a signal on line 106. Responding to the signal on line 106, read control circuit 103 increments its internal counter (not shown) and provides the incremented address to word memory 2, so that the next word in memory 2 (together with the corresponding phoneme string) is read out Into register 104. These operations are repeated until comparator 105 detects identity between the input character string and the character string retrieved from memory 2,

[0013] When identity is detected between the input character string in input register 101 and the character string in segment 104A of register 104, the comparator produces a signal on line 107, causing the phoneme string in segment 104B to be written into phoneme string memory 108. At the same time, input control circuit 102 causes the next character string to enter input register 101, and read control circuit 103 supplies the initial address to word memory 2 in preparation for the next translation operation.

[0014] Fig. 3 shows in more detail the variable phoneme detector 3 of Fig. 1. Input control circuit 301 accesses phoneme string memory 108 and successively transfers phonemes into buffer 302. Most Japanese speech consists of four morae, as follows (a "mora" is a basic unit of time in speech; it may contain one or more phonemes):

(a) a consonant-vowel combination ("CV")

(b) an independent vowel, or a prolonged sound (symbolized as "L")

(d) a syllabic nasal "N"

[0015] Read control circuit 303 supplies an initial address to variable phoneme memory 4, which stores variable phonemes in association with the predetermined location in the phoneme string at which an Influencing phoneme will influence the value of the duration of each variable phoneme. Variable phoneme data from memory 4 is written into register 304, which consists of a phoneme segment 304A and a relative location segment 304B. The relative location data in relative location segment 304B consists of at least one location in the phoneme string, relative to the variable phoneme, where an influencing phoneme may be located to influence the duration of the variable phoneme. For example, if the variable phoneme is a double consonant, "Q", its relative location data may be "+2", which means that the duration of the double consonant is influenced by an influencing phoneme located two morae later In the phoneme string. Comparator 305 compares the phonemes in phoneme buffer 302 and the phoneme segment 304A of register 304. If they are not identical, comparator 305 produces a signal on Iine 306. Responding to the signal on line 306, read control circuit 303 increments its internal counter (not shown) and provides the incremented address to variable phoneme memory 4, so that the next variable phoneme in memory 4 is read out into register 304. These operations are repeated until comparator 305 detects identity between the phoneme in buffer 302 and the phoneme retrieved from memory 4 as stored in segment 304A, or until all variable phonemes in memory 4 have been retrieved. When identity Is detected between the phoneme In buffer 302 and the phoneme In segment 304A, comparator 305 produces a signal on line 307, causing the phoneme from buffer 302 and the relative location data in segment 304B (through selector 308) to be written into memory 309. The selector 308 normally selects the relative location data from segment 304B; however, if identity is never detected by comparator 305, read control circuit 303 outputs an END signal on line 310, causing selector 308 to select "0" as the relative looation data, indicating that the phonemes are not variable phonemes.

[0016] Fig. 4 shows in more detail the Influencing phoneme detector 5 of Fig. 1. Read control circuit 501 controls counter 502, whose content indicates an address memory 309. The present address in counter 502 is supplied to memory 309 through adder 503 which normally adds "O" to the present address. The data read out from memory 309 are provided to register 505 through selector 504 which normally selects register 506. The data consist of phonemes (in segment 505A) and relative locationa (in segment 505B). The relative location information in segment 505B is supplied to zero detector which produces a signal on line 507 when the relative location information in segmenet 505B is "O" or on line 508 when the relative location informetion is not "O". In the latter case, the signal on the line 508 is provided to read control circuit 501 to prevent the incrementing of counter 502, to the selector 504 to select register 506, and to read control circuit 510 to supply an Initial address to parameter value variation memory 6, That is, when the relative location information is not "O", the relative location is added to the present address in counter 502 by adder 503, and the sum is supplied as an address to memory 309 so that the corresponding data is read from memory 309 and stored in register 506 via selector 504.

[0017] Parameter value variation memory 6 is accessed by read control circuit 510 and supplies to register 512 data consisting of a phoneme, in segment 512A (an influencing phoneme) and, in segment 512B, attribution data from which the change in duration of the variable phoneme may be determined. Comparator 513 compares the phoneme in segment 512A with that in segment 506A, producing a signal on line 514 if it detects identity and on line 515 otherwise. Responding to the signal on line 514, the phoneme'in segment 505A and the attribution data In segment 512B are written into a checked data memory 520. On the o her hand, the signal on line 515 is provided to read control 510 to increment an inner counter (not shown) and output the next address to memory 511. The comparator compares the phoneme in register 50f with successively retrieved phonemes in register 512. If none of the phonemes in memory 511 is the same as with the phoneme in register 506, read control circuit 510 outputs an END signal to selector 516. Selector 516 usually selects the attribution data segment of memory 512 as input data but instead selects "O" upon receipt of an END signal. Therefore, the phoneme In segment 505A is written into checked data memory 520 with attribution data equal to "O". When zero detector 506 detects a "0", it outputs a signal on line 507. In this case, the phoneme in segment 505A is not a variable phoneme, so it is written into the checked data memory 520 with attribution data equal to "O".

[0018] Fig. 5 shows in more detail the parameter value determining unit 7 of Fig. 1. Data consisting of phonemes in combination with attribution data are successively read from checked data memory 520 into register 701. Parameter value memory 702 stores standard values of the parameters for every phoneme. (It is also possible, instead of storing all the values, to use a parameter calculator according to a phoneme code.) A phoneme in segment 701A of register 701 is supplied to parameter value memory 702 as an address; memory 702 then outputs the corresponding parameter value to adder 703. Modifying data memory 704 stores parameter value modifications and outputs them when addressed by attribution data from segment 701B. The standard parameter values from memory 702 and the modification data from memory 704 are added by adder 703 and the sums supplied along with the phoneme string to parametric synthesizer 8 of Fig. 1 for assembly into synthetic speech.

[0019] A pitch modifying circuit 710 and a power modifying circuit 720 may be provided for modifying pitch or power data in a similar manner.

[0020] Table 1 shows how the duration of a phoneme is modified. In example A, when the character string "kessen" (a Japanese word meaning "decisive battle") is entered, character string translator 1 translates it into the phoneme string KE/Q/SE/N, where "/" indicates divisions between morae. Variable phoneme detector 3 detects a variable phoneme "Q" in the second location; consequently, influencing phoneme detector 5 should search for an influencing phoneme. In this example, there is an Influencing phoneme "N" in the fourth location in the phoneme string, so parameter value determining unit 7 adds a modifying value ta to the standard double consonant duration tm, which means that the duration of phoneme "Q" is given by t_{Q =} t_m ⁺ t_a.

[0021] If character string "kesseki" (meaning "absent") is entered (example B), the translated phoneme string is KE/Q/SE/KI. Variable phoneme detector 3 again detects a variable phoneme "Q" in the second location, but influencing phoneme detector 5 does not detect an influencing phoneme. Therefore, parameter value determining unit 7 determines the duration of the phoneme "Q" as t_Q = t_m, the standard double consonant duration.

[0022] In example C, "Ikkan" (meaning "consistently") is translated into the phoneme string I/Q/KA/N including variable phoneme "N" in the fourth location. The duration t_Q of the second phoneme is therefore given by t_{Q =} t_m⁺ t_a. The character string of example D, "Ikkatsu" (meaning "together"), is translated into the phoneme string I/Q/KA/TSU, which Includes variable phoneme "Q" in the second location but no influencing phoneme, so that the duration of the second phoneme is t_Q = t_m.

[0023] The standard duration t_m of a double consonant in Japanese is 170 ms, and additional duration t_a is, for example, 50 ms. According to this embodiment, clear and naturalized speech can be synthesized by considering the influence of a non-adjacent phoneme on the duration of a variable phoneme.

[0024] Fig. 6 is a block diagram of another embodiment of this invention. Independent vowel detector 11, neighboring vowel detector 12, and prolonged sound transforming unit 13 are insertea between character string translator 1 and variable phoneme detector 3 of Fig. 1. The remaining elements of Fig. 6 are the same as in Fig. 1, so their descriptions are omitted here.

[0025] Most Japanese words consist of four morae; and of these, words having a syllabic nasal "N" in the fourth location appear most frequently. The present inventors have observed that the duration of the syllabic nasal "N" in the fourth location is related to the existence of a syllable nasal "N" or prolonged sound (L) in the second location.

[0026] Independent vowel detector 11, including a comparator (not shown), detects whether the phoneme string includes independent vowels O, U, or I. If the phoneme string includes such vowels, neighboring vowel detector 12, also Including a comparator, detects the identity of the phoneme immediately preceding the detected vowel. Then prolonged sound transforming unit 13, Including a code converter, transforms the detected independent vowel into the prolonged sound L if and only if the combination of the detected independent vowel and the immediately preceding phoneme fall into one of the following cate- goriesi

(1) The detected independent vowel is either U or O and the preceding phoneme is either U or O.

(2) The detected independent vowel is I and the preceding phoneme is either I or E.

Table 2 shows examples of these cases.

[0027]

[0028] The reason for transformation of vowels into prolonged sounds is that the above independent vowels become prolonged sounds of the Immediately preceding vowels in natural Japanese speech. Moreover, it is helpful, in determining the parameter values of the syllabic nasal or prolonged sound, to consider the Influence of a preceding phoneme in the phoneme string.

[0029] Table 3 shows examples of the determination of phoneme duration. In example H, the character string "dangan" (meaning "bullet") includes the syllabic nasal sound "N" in the fourth location. The duration of this sound Is influenced by the syllabic nasal sound N in the second location, so the duration of the former is the same as the latter (t₄ = t₂). These examples indicate that the duration of the syllabic nasal "N" or the prolonged sound L is the same as the duration of the second preceding mora if the phoneme is either at that location the syllabic nasal "N" or the prolonged sound (L).

[0030] Although illustrative embodiments of the present invention have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments and that various changes and modifications may be effected therein by one skilled in the art without departing from the scope of the present invention. Specifically, while the preferred embodiment has been illustrative as synthesizing Japanese speech, it is equally applicable to speech of any other language.

Claims

1. An apparatus for creating synthetic speech from a character string, comprising: translating means for translating the character string into a phoneme string, parameter value determining means for storing a value of a selected parameter of each phoneme in the phoneme string, and parametric synthesizer means for assembling the phoneme string into synthetic speech using the stored values, characterised by

variable phoneme detecting means coupled to said translating means for detecting, in the phoneme string, a variable phoneme, the value of the selected parameter of which is a function of the identity of an influencing phoneme in the phoneme string; and

influencing phoneme detecting means coupled to said variable phoneme detecting means for determining the identity of the influencing phoneme, said parameter value determining means being responsive to the result of the determination for modifying the stored value of the selected parameter.

2. An apparatus as claimed in claim 1 wherein the selected parameter is duration.

3. An apparatus as claimed in claim 1 or 2 wherein the predetermined location is at least one mora removed from the variable phoneme in the phoneme string.

4. An apparatus as claimed in claim 1, 2 or 3 wherein the translating means has an associated memory for storing a phoneme string corresponding to each of a plurality of words, the translating means having comparing means for outputting the corresponding phoneme string when a stored word matches a portion of the character string.

5. An apparatus as claimed in one of claims 1 to 4 further comprising memory means associated with said variable phoneme detecting means for storing, in association with a variable phoneme, the relative location at which an influencing phoneme will influence the value of the parameter of the stored variable phoneme, the variable phoneme detecting means including comparing means for outputting the associated predetermined location when a stored variable phoneme is detected in the phoneme string.

6. An apparatus as claimed in any preceding claim further comprising memory means associated with the influencing phoneme detecting means for storing, in association with the identity of an influencing phoneme, data representative of a modification in the stored value of the selected parameter, the influencing phoneme detecting means including comparing means for outputting the associated data when the stored influencing phoneme is in the predetermined location.

7. An apparatus as claimed in any of claims 1 to 6 wherein the variable phoneme is a syllabic nasal "N".

8. An apparatus as claimed in any of claims 1 to 6 wherein the variable phoneme is a double consonant.

9. An apparatus for creating synthetic speech from a string of character codes, comprising: code translating means for translating the string of character codes into a string of phoneme codes, and speech synthesizer means for assembling the string of phoneme codes into synthetic speech, characterised by

detecting means for detecting, in said string of phoneme codes, a pair of phoneme codes in non-adjacent morae, and

modifying means coupled to said detecting means for modifying a selected parameter of a phoneme represented by one of said phoneme codes in non-adjacent morae.

10. A method of creating synthetic speech for a character string, comprising the steps of:

(a) translating the character string into a phoneme string;

(b) searching in the phoneme string for any variable phoneme which has a prosodic parameter which is alterable by the presence in the phoneme string of an influencing phoneme in non-adjacent morae;

(d) determining the influence on any variable phoneme of any influencing phoneme located, and

(e) assembling into synthetic speech the phoneme string with individual phonemes as modified by any determined influences.

Drawing

Search report