(19)
(11)EP 2 645 364 B1

(12)EUROPEAN PATENT SPECIFICATION

(45)Mention of the grant of the patent:
08.05.2019 Bulletin 2019/19

(21)Application number: 12162032.2

(22)Date of filing:  29.03.2012
(51)Int. Cl.: 
G10L 15/22  (2006.01)
G10L 13/08  (2013.01)
G10L 15/18  (2013.01)
G10L 25/48  (2013.01)

(54)

Spoken dialog system using prominence

Sprachdialogsystem mit Anwendung der Prominenz

Système de dialogue vocal utilisant la proéminence


(84)Designated Contracting States:
AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

(43)Date of publication of application:
02.10.2013 Bulletin 2013/40

(73)Proprietor: Honda Research Institute Europe GmbH
63073 Offenbach/Main (DE)

(72)Inventor:
  • Heckmann, Martin
    63073 Offenbach (DE)

(74)Representative: Rupp, Christian 
Mitscherlich PartmbB Patent- und Rechtsanwälte Sonnenstraße 33
80331 München
80331 München (DE)


(56)References cited: : 
JP-A- 2001 236 091
US-A1- 2003 216 912
  
  • Lars Schillingmann ET AL: "Acoustic Packaging and the Learning of Words", IEEE ICDL-EPIROB 2011 (1st Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics), 27 August 2011 (2011-08-27), XP55030018, Retrieved from the Internet: URL:http://aiweb.techfak.uni-bielefeld.de/ files/ICDL-2011-AP-Poster.pdf [retrieved on 2012-06-14]
  • Gina-Anne Levow: "Characterizing and Recognizing Spoken Corrections in Human-Computer Dialogue", Proceeding COLING '98: Proceedings of the 17th international conference on Computational linguistics, 1 January 1998 (1998-01-01), pages 736-742, XP55030034, Retrieved from the Internet: URL:http://delivery.acm.org/10.1145/990000 /980969/p736-levow.pdf?ip=145.64.134.241&a cc=OPEN&CFID=112130174&CFTOKEN=49708997&__ acm__=1339753363_5f224eaa7972a7fa4dd176031 78f746e [retrieved on 2012-06-15]
  • MARC SWERTS1 ET AL: "CORRECTIONS IN SPOKEN DIALOGUE SYSTEMS", 20001016, 16 October 2000 (2000-10-16), XP007010410,
  
Note: Within nine months from the publication of the mention of the grant of the European patent, any person may give notice to the European Patent Office of opposition to the European patent granted. Notice of opposition shall be filed in a written reasoned statement. It shall not be deemed to have been filed until the opposition fee has been paid. (Art. 99(1) European Patent Convention).


Description


[0001] The invention relates to the domain of speech based human-machine interaction. More precisely it relates to improving a spoken dialog system by incorporating prosodic information contained in the speech signal.

[0002] According to Wikipedia a spoken dialog system is a dialog system delivered through voice. Such a system, according to the invention comprises at least:
  • a speech recognizer, functionally connected to a microphone, and
  • a text-to-speech module, functionally connected to a loudspeaker.


[0003] A spoken dialog system enables the communication between a human and a machine based on speech. Its main components are commonly at least one of a speech recognizer, a text-to-speech (TTS) system, a response generator, a dialog manager, a knowledge base and a natural language understanding module. Hereby a speech recognizer transforms acoustical signals to hypotheses of uttered words or word units, the natural language understanding module combines these hypotheses to utterance hypotheses using a knowledge base where e.g. the grammar of the language is represented and extracts information from the utterance currently demanded by the dialog manager (e.g. the destination in a speech based navigation system). Overall the dialog manager controls the process and if information is still missing (e.g. the street number) triggers the response generator to create a response which asks for this information. The textual response is then transformed to a speech signal via the text-to-speech system (for more details see e.g. http://en.wikipedia.org/wiki/Spoken_dialog_system).

[0004] Human speech is not only the words spoken but also how they are spoken. This is reflected by prosody, i.e. rhythm, speed, stress, structure and/or intonation of speech, each of which can be used alone or in combination as prosodic cues. Also other features of an utterance can serve as prosodic cues.

[0005] Such prosodic cues play a very important role in human-human communication, e.g. they structure utterance in phrases (elements of a clause), emphasize novel information in an utterance, and differentiate between questions or statements.

[0006] Different approaches to extract prosodic cues have been proposed. Nevertheless prosodic information is rarely used in spoken dialog systems.

[0007] "Utterance" in spoken language analysis typically relates to a unit of speech. It is generally but not always bounded by silence. In the context of a spoken dialog system it is mainly used to refer to what was said by a speaker, e.g. when in a dialog it is the speaker's turn. In a dialog or communication, a turn is a time during which a single participant speaks, within a typical, orderly arrangement in which participants speak with minimal overlap and gap between them.

[0008] In many situations current speech interfaces perform not as desired or expected by a user. They frequently misunderstand words, especially when background noise is present, or when the speaking style differs from the one they expect. In particular they are insensitive to the prosody of speech, the rhythm, speed, stress, structure and/or intonation of speech.

[0009] Speech interfaces are already an important component for current cars but also in other areas such as, e.g., mobile telecommunication device control, and their importance will even increase in the future.

[0010] However, current speech interfaces are perceived as little intuitive and error prone. One reason for this is that such systems do not analyze speech as humans do. Especially, they are "blind" to prosodic cues. The integration of prosodic cues endows the systems with capabilities which allow them to better understand the user's goals. In particular, the integration of prosodic cues renders the systems more intuitive and hence more robust.

[0011] Stressing the words which are the most relevant is a very natural way of talking. By considering this fact, the human-machine dialog is much more natural and hence easier for the human.

[0012] As detailed below, this can be especially beneficial in situations where clarifications are necessary from the human. When clarifications occur or are necessary, current systems usually are unaware of which part of the utterance was misunderstood. In the subsequent interpretation of the correction by the human they hence decode the utterance in the same way as they decoded the original utterance. Humans however tend to emphasize the misunderstood term in the correction. By endowing the system with capabilities to extract this emphasis, i.e. the prominence of a syllable, word or group of words, the system will obtain additional information and the dialog between human and machine will improve.

[0013] There is a multitude of approaches to extract prosodic cues in the context of speech processing (cf. documents [1-3]). When used to improve automatic speech understanding this information was extracted from speech corpora, i.e. databases of speech audio files and text transcriptions, and used to improve an automatic transcription of these corpora (cf. documents [4-9]). A very recent system uses prosody to improve recognition accuracies in broadcast news recognition by scoring the words differently depending on their pitch accent [10]. The prosodic cues of the analyzed broadcast news in this case are solely determined based on the fundamental frequency.

[0014] One prominent case of prosody use is the Verbmobil project (1993-2000) [11]. The target of the project was to allow people speaking different languages to communicate verbally with each other by help of a computer. Hence recognition of an utterance in a source language was performed, the recognized utterance was translated into the target language, which was then re-synthesized and output.

[0015] Prosodic cues were used to disambiguate different sentence meanings based on word prominence information, i.e. the acoustic manifestation of emphasis put on a word during articulating it, and to guide the parsing of the sentence by using information on the prosodic phrasing, the segmentation of an utterance in different sub-phrases based on prosodic cues. The cues deployed were based on fundamental frequency, intensity, and duration.

[0016] Other studies showed that the visual channel also conveys prosodic information, especially in the mouth region, the eyebrows and the head movements (cf. documents [12-16]). Few studies exist, where visual prosodic information is automatically extracted using markers in a speaker's face (cf. documents [17-18]).

[0017] Document US 7,996,214 describes a system and method of exploiting prosodic features for dialog act tagging. The dialog act (question, hesitation, ...) is based on prosodic features.

[0018] Document US 2006/0122834 A1 describes an emotion detection device and a method for use in distributed systems. The emotional state of the user is inferred.

[0019] In US 7,778,819 a method and apparatus is described for predicting word prominence in speech synthesis. Prominence is estimated from text and used it in speech synthesis.

[0020] A method for analyzing speech in a spoken dialog system, in which all features of the preamble of claim 1 are disclosed, is described in US 2003/216912 A1.

[0021] JP 2001 236091 A discloses a dialog system, in which correction phrases are detected by analysing prosodic cues.

[0022] It is an object of the present invention to provide a method for analyzing speech and a spoken dialog system, with which the determination of misunderstood terms can be improved.

[0023] This object is achieved by a method and a system according to the independent claims. Advantageous embodiments are defined in the dependent claims.

[0024] Further aspects of the invention are now described with reference to the drawings.

[0025] Fig. 1 shows an overview of an exemplary embodiment of the invention.

[0026] In one aspect, the invention presents a method for operating (by analysing speech) a spoken dialog system according to claim 1.

[0027] Such a spoken dialog system, according to the invention, comprises at least:
  • a speech recognizer, functionally connected to a microphone, and
  • a text-to-speech module, functionally connected to a loudspeaker.


[0028] The utterance may be a correction of the previous utterance, and wherein the utterance is a word or sentence.

[0029] The prominence is determined by evaluating the acoustic signal and a video signal visually capturing the user. Different levels of prominence may be ranked. The prominence indicates a degree of importance of parts of an utterance, e.g. the emphasis a speaker sets on parts of the utterance.

[0030] The marker feature is detected when at least parts of the previous utterance are repeated.

[0031] The part(s) to be replaced in the previous utterance can be used to improve a recognition accuracy by extracting at least one part with a pre-determined prominence from the utterance, e.g. the correction, extracting the part(s) to be replaced in the previous utterance, and comparing at least one recognition hypotheses for the extracted parts and inferring from this comparison a new recognition hypothesis for the part to be replaced of the previous utterance.

[0032] The marker feature can be determined by the prominence of the first part of the utterance either by itself or in combination with a lexical analysis of the utterance.

[0033] The utterance is analyzed in form of a speech/acoustical and a video signal. The prosodic cues may be extracted from a combination of the speech/acoustical signal and the video signal, e.g. representing a recording of a user's upper body, preferably including the head and face.

[0034] The movements of the user's head, facial muscles, mouth and/or eyebrows can be used to determine the prosodic cues.

[0035] A compressive transformation may be applied to the mouth region. A tilt of the head and consequently the mouth region can be corrected prior to applying the transformation, in particular a Discrete Cosine Transformation (DCT).

[0036] Reliability for each information channel may be calculated over which the prosodic cues are obtained. An assignment of importance, i.e. a prominence, on the different parts of the utterance can be obtained by adaptively combining the different information channels considering previously calculated reliabilities. The reliability of the video channel may be calculated based on the illumination conditions.

[0037] In a further aspect, the invention presents a spoken dialog system according to claim 12.

[0038] The system composes a video accepting means for accepting a visual signal for capturing e.g. a video signal, e.g. a video camera.

[0039] Further aspects of the invention are now described with reference to the drawings.
Fig. 1
shows an overview of an exemplary embodiment of the invention.
Fig. 2
shows an exemplary system layout of the invention.
Fig. 3
shows a flow chart of one possible embodiment of the invention.
Fig. 4
shows a block diagram of an extraction and integration of prosodic features of one possible embodiment of the invention.


[0040] Fig. 1 shows an overview of an exemplary setup of a spoken dialog system 30 according to the invention. In Fig. 1 a user 10 from which an utterance is accepted is shown. This utterance can be accepted by a means 20 for accepting acoustical signals, e.g. a microphone, and, optionally, a means 25 for accepting visual signals, e.g. a camera producing video signals. The spoken dialog system includes a processing engine 40 to process the signals accepted by means 20, 25. In particular, the processing engine 40 provides at least one of a speech recognizer, a text-to-speech (TTS) system, a response generator, a dialog manager, a knowledge base and a natural language understanding module, a lexical analyzer module or a combination thereof. While the processing engine 40 is shown as a single block in Fig. 1, it has to be understood that all elements of the processing engine 40 may be realized as separate modules.

[0041] Further, the spoken dialog system 30 comprises or is functionally connected to a processing means 50 and storage means 60. The processing engine employs the processing means 50 and uses the storage means 60 during processing. The spoken dialog system 30 may also comprise an interface 70 for communication to the user 10 or other systems, e.g. a navigation system, a control unit or an assistance system. These systems might also be realized as software applications and hence the interface 70 might be a hardware interface or a software interface.

[0042] Typically, the input signals or input patterns to the spoken dialog system 30 are accepted from a sensor, which is then processed by hardware units and software components. An output signal or output pattern is obtained, which may serve as input to other systems for further processing, e.g. for visualization purposes, for navigation, vehicle or robot control or the control of a (mobile) telecommunication device or appliances. The input signal may be supplied by one or more sensors, e.g. for visual or acoustic sensing, but also by the software or hardware interface. The output signal/pattern may be transferred to another processing unit or actor, which may be used to influence the actions or behavior of a robot, vehicle or mobile telecommunication device.

[0043] Computations and transformations required by the spoken dialog system 30 may be performed by a processing means 50 such as one or more processors (CPUs), signal processing units or other calculation, processing or computational hardware and/or software, which might also be adapted for parallel processing.

[0044] Processing and computations may be performed on standard off-the-shelf (OTS) hardware or specially designed hardware components. A CPU of a processor may perform the calculations and may include a main memory (RAM, ROM), a control unit, and/or an arithmetic logic unit (ALU). It may also address a specialized graphic processor, which may provide dedicated memory and processing capabilities for handling the computations needed.

[0045] Also the storage means 60 is used for storing information and/or data obtained, needed for processing and results. The storage means 60 also allows storing or memorizing inputs to the spoken dialog system 30 and knowledge, such as e.g. speech recognition methods, recognition data, recognition hypotheses, etc. deducted therefrom to influence processing of future inputs.

[0046] The storage means 60 may be provided by devices such as a hard disk (SSD, HDD, Flash memory), RAM and/or ROM, which may be supplemented by other (portable) storage media such as floppy disks, CD-ROMs, Tapes, USB drives, Smartcards, Pen drives etc. Hence, a program encoding a method according to the invention as well as data acquired, processed, learned or needed in/for the application of the inventive system and/or method may be stored in a respective storage medium.

[0047] In particular, the method described by the invention may be provided as a software program product on a (e.g., portable) physical storage medium which may be used to transfer the program product to a processing system or a computing device in order to instruct the system or device to perform a method according to this invention. Furthermore, the method may be directly implemented on a computing device or may be provided in combination with the computing device.

[0048] One aspect of the invention is to extract information on the importance of different parts in an utterance by a speaker and to use it in a spoken dialog system, an example of which is shown in Fig. 2. Levels of importance are manifested in the acoustical signal representing the utterance by different levels of emphasis set on a corresponding part of the utterance. In linguistics the relative emphasis parts of an utterance, e.g. a syllable or a word, are given relative to others is called stress or prominence. In this sense these emphasized parts stand out, and are therefore prominent. Prominence is, for example, used to indicate the most informative part of an utterance.

[0049] Based on features extracted from the acquired acoustical signals representing the utterance it is possible to assign to each part of the utterance different degrees of prominence. The different degrees of prominence can then be mapped to degrees of importance for the segment in the utterance as intended by the speaker. A spoken dialog system can then use the information of the degree of importance to improve the dialog with the user as a segment of the utterance with a high degree of importance might be indicative for an information which is currently missing to fulfill the task, e.g. in the navigation scenario the street address when city and street are already known, or indicate that a misunderstanding has happened between the user and the system. The following description of a possible embodiment will further exemplify this.

[0050] One possible embodiment of the current invention is to improve a correction dialog as shown in Fig. 3. With current spoken dialog systems, and speech recognition systems in particular, it is quite common that the system misunderstands the user. In some cases the spoken dialog system is able to detect such recognition errors automatically. However, in most cases this happens unnoticed by the system.

[0051] When humans communicate with each other they commonly use negative signals as "No", "No, I meant ...", "No, I said ..." or the like to indicate that there was a misunderstanding. Take the following hypothetical communication as an example:

Human A: I want to drive to Maisach

Human B: To which part of Munich do you want to drive?

Human A: NO, I want to drive to MAISACH!



[0052] As illustrated, the communication partners will, subsequent to a misunderstanding, fully or partially repeat what they said previously. In this repetition they tend to make the previously misunderstood term the most prominent one as it is currently the most important one for them (in the example above prominence is indicated by bold font weight).

[0053] The human listener will then be able to first infer from the negative signals that there was a misunderstanding and then detect - based on its prominence - the assumingly misunderstood term. In this scenario prosodic cues are not only very important to detect the misunderstood term but also the negative signal. These negative signals are commonly also uttered with a high level of prominence.

[0054] As current spoken dialog systems are not able to interpret prosodic cues they have severe difficulties in a situation where a correction is given from a user.

[0055] The current invention presents a method to endow a spoken dialog system with capabilities to infer the prominence of different parts of an utterance and use them in the dialog management. When transferring the previous example of a communication between humans to a human machine dialog the system proposed in the invention will be able to detect that there was a misunderstanding in the previous dialog act based on its recognition of the negative signal "No" combined with its high prominence.
Human: I want to drive to Maisach.
Machine: To which part of Munich do you want to drive?
Human: NO, I want to drive to MAISACH!
Machine: Sorry. To which street in MAISACH do you want to drive?


[0056] After recognizing the marker feature of the negative signal "No", it will search for an additional term with very high prominence ("MAISACH") and infer that this term was misunderstood in the previous utterance. It can then signal to the user that it understood that there was a mistake in the previous dialog act and move on in the dialog with the now corrected term. Here prominence can also be used in the feedback to the user to further highlight that the system has identified the mistake. Such a strategy is also used by humans. Overall such a system will feature a more efficient, i.e. quicker dialog, with less turns and also a more natural and intuitive dialog.

[0057] It has to be understood that the marker feature can also be another feature, especially in another language. It might also be a certain intonation, stress, structure, ... of an utterance. In particular the marker feature can be a very high degree of prominence of a word. In cases where a signal as "No" is missing the misunderstood term will be made by the user even more prominent (e.g. "I want to drive to MAISACH!") and can hence be used by itself to indicate the misunderstanding.

[0058] As well, the detection that there is a misunderstanding in general and which part of the utterance was misunderstood does not require a correct recognition of the relevant parts of the utterance. This information can be inferred from the prominence of the respective parts, e.g. a very prominent segment at the beginning of an utterance is a good indication for a correction utterance.

[0059] Current spoken dialog systems will not be able to detect that there was a misunderstanding and as a consequence will not easily be able to use the context information from the previous dialog act.

[0060] However, in such a correction dialog the system has usually access to two variants of the same word (e.g. "Maisach", the one which was uttered in the original dialog act and misunderstood and the one from the correction). This can be used to obtain a better recognition after the word was uttered the second time.

[0061] For finding both instances of this word it is not necessary to recognize the word. The invention instead uses the prominence of the word in the correction utterance to detect it in the utterance previously misunderstood/misinterpreted. Once one instance is found it is now possible to use rather simple pattern matching techniques to find the same word again in the previous dialog act. This is possible as in this case the same speaker in an at least very similar environment has uttered the word.

[0062] Once this is achieved the most likely and several less likely recognition hypotheses for both of these instances of the word can be calculated. When comparing these hypotheses it is possible to gain new information and improve the overall recognition accuracy. In both cases the order of hypotheses with recognition scores in brackets was e.g. "Munich"(0.9), "Maisach"(0.8), "Mainz"(0.5). Then just selecting the second most likely would be a good strategy as "Munich" was already identified as being wrong. In a case where they differ e.g. original: "Munich"(0.9), "Mainz"(0.8), "Maisach"(0.7), correction: "Munich"(0.9), "Maisach"(0.7) "Mainz"(0.5) then selecting "Maisach" would be a good strategy because it obtained the combined highest score.

[0063] A prerequisite for correctly interpreting the correction is the capacity of the system to store at least one, in one embodiment several, previous dialog acts. This can either be in the format as they were recorded or in an abstracted feature representation.

[0064] To obtain a measure for the importance of a part of an utterance, i.e. its prominence, different measures from the acoustic signal have been proposed:
  • Spectral intensity: The energy in certain frequency bands relative to that of others correlates well with prominence
  • Duration: The lengthening of a syllable is characteristic for prominence
  • Pitch patterns: Certain pitch patterns are indicative for prominence


[0065] For some of these features the fundamental frequency, whose percept is called pitch, has to be extracted and its shape has to be classified [19, 20].

[0066] Spectral intensity represents a reliable and computational not very costly way to extract prominence information from the speech signal [21, 22].

[0067] In particular in the current invention not only the acoustical signal is used but also the movements of the speaker's head and face (compare Fig. 4). It has previously been shown that movements of the eyebrows, the head and the mouth region convey important information on the prominence of parts of an utterance [12-14]. Generally, prosodic cues can be derived from facial muscles or other facial or body features, e.g. the movement of arms or hands, can be used. Even gestures can be used, e.g. to detect a marker feature, such as a specific, e.g. negative, body posture or a specific movement or movement pattern.

[0068] Methods for the extraction of such features are also available [17, 18, 23]. Very powerful ways to extract visual features from the face are transformation based approaches as e.g. the Discrete Cosine Transformation (DCT) [24]. In this case also a head rotation estimation and subsequent correction thereof is beneficial. This can e.g. be obtained by detecting the user's eyes or by calculating the symmetry axis of the mouth region. Alternative transformations suited to this task are e.g. Principal Component Transformation (PCA) and Fourier Transformation.

[0069] The combination of acoustic and visual information in the determination of prominence is not known from prior art and manifests a part of this invention. In particular we propose to determine reliability measures for the acoustical and visual channel and perform the integration of the different channels depending on their individual reliability. Different reliability measures can be used, e.g. based on probabilistic models of the values of the cues and a comparison of the current value with the model.

[0070] Hence, the invention uses prosodic speech features in a human machine dialog to infer the importance of different parts of an utterance and to use this information to make the dialog more intuitive and more robust.

[0071] The dialog is more intuitive because users can speak more natural, i.e. using prosody, and the system also uses prosodic cues to give feedback to the user. The higher robustness is thereby achieved by reducing the recognition errors.

[0072] One scenario is a situation where a user utters a phrase, the system misunderstands the phrase, the user repeats the utterance for the system, thereby changing the prosody of the utterance which makes the misunderstood part more prominent in the utterance, a strategy commonly used when humans speak to each other, the system determines the most prominent part of the correction utterance, hence the previously misunderstood part, the system extracts the corresponding segment from the original utterance, e.g. by pattern matching with the prominent part in the correction, the system uses this information to determine a better hypothesis on what was really said by the user.

[0073] An important aspect of the invention is the combination of acoustic and visual features for the determination of the importance, i.e. prominence, of the different parts of the utterance. Hereby the combination of the different information streams is adapted to the reliability of the different streams.

[0074] In particular, the integration of prosodic cues renders the systems more intuitive and hence more robust. Stressing the words which are the most relevant is a very natural way of talking. By considering this the man-machine dialog is much more natural and hence easier for the human. As detailed, this can be especially beneficial in situations where clarifications are necessary from the human. When such clarifications are happening current systems usually are unaware that the last utterance was a clarification and which part of the utterance was misunderstood. In the subsequent interpretation of the correction they hence decode the utterance in the same way as they decoded the original utterance. Humans however tend to indicate such a correction utterance prosodically and emphasize the misunderstood term in the correction. By detecting the correction and subsequent extracting this emphasis, i.e. the prominence, the system will obtain additional information and the dialog between human and machine will improve.

Summary:



[0075] The invention presents a System which determines the different degrees of importance a speaker sets on parts of an utterance from a speech signal based on prosodic speech features and uses this information to improve a man machine dialog by integrating it into a spoken dialog system.

[0076] The importance of a part of an utterance is determined by its prominence. The information on the prominence of a part of the utterance is used in a situation where the system misunderstood the user in a previous dialog act, the system detects that such a misunderstanding has happened from the current dialog act in which the user repeated at least part of the previous utterance, the system detects the misunderstood part of the utterance, the system uses this information to improve the recognition of this part of the utterance, and where the system uses the acquired information in the dialog management.

[0077] The information on the misunderstood part is used to improve the recognition accuracy by extracting the emphasized part from the repetition of the utterance, i.e. the correction, extracting the misunderstood part in the original utterance e.g. by pattern matching with the part extracted from the correction, and comparing the N highest recognition hypotheses of the segments extracted in a. or b. and inferring from this comparison a new recognition hypothesis for the part of the utterance in question.

[0078] The detection of a misunderstanding in the previous dialog act uses either a very high prominence of a part of the last user response or the prominence of the first part of the last user response. In particular the latter can also be supported with a lexical analysis of the utterance.

[0079] The system after having detected a previous misunderstanding uses prosody to highlight the previously misunderstood and now corrected part of the utterance.

Nomenclature


Spoken dialog system



[0080] A Spoken dialog system is a dialog system delivered through voice. It commonly has at least one of the following components, or a combination thereof:
  • Speech recognizer
  • Text-to-speech system
  • Response generator
  • Dialog manager
  • Knowledge base
  • Natural language understanding module

Prosody



[0081] The rhythm, stress, and intonation of speech

Prominence



[0082] The relative emphasis that may be given to certain syllables in a word, or to certain words in a phrase or sentence

Stress



[0083] See prominence

References



[0084] 
  1. [1] Wang, D. & Narayanan, S., An acoustic measure for word prominence in spontaneous speech, Audio, Speech, and Language Processing, IEEE Transactions on, IEEE, 2007, 15, 690-701
  2. [2] Sridhar, R.; Bangalore, S. & Narayanan, S., Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework, Audio, Speech, and Language Processing, IEEE Transactions on, IEEE, 2008, 16, 797-811
  3. [3] Jeon, J. & Liu, Y., Syllable-level prominence detection with acoustic evidence, INTERSPEECH, 2010
  4. [4] Wang, M. & Hirschberg, J., Automatic classification of intonational phrase boundaries, Computer Speech & Language, Elsevier, 1992, 6, 175-196
  5. [5] Shriberg, E.; Stolcke, A.; Jurafsky, D.; Coccaro, N.; Meteer, M.; Bates, R.; Taylor, P.; Ries, K.; Martin, R. & Van Ess-Dykema, C., Can prosody aid the automatic classification of dialog acts in conversational speech?, Language and speech, SAGE Publications, 1998, 41, 443
  6. [6] Shriberg, E.; Stolcke, A.; Hakkani-Tür, D. & Tür, G., Prosody-based automatic segmentation of speech into sentences and topics, Speech communication, Elsevier, 2000, 32, 127-154
  7. [7] Ang, J.; Liu, Y. & Shriberg, E., Automatic dialog act segmentation and classification in multiparty meetings, Proc. ICASSP, 2005, 1, 1061-1064
  8. [8] Liu, Y.; Shriberg, E.; Stolcke, A.; Hillard, D.; Ostendorf, M. & Harper, M., Enriching speech recognition with automatic detection of sentence boundaries and disfluencies, Audio, Speech, and Language Processing, IEEE Transactions on, IEEE, 2006, 14, 1526-1540
  9. [9] Rangarajan Sridhar, V.; Bangalore, S. & Narayanan, S., Combining lexical, syntactic and prosodic cues for improved online dialog act tagging, Computer Speech & Language, Elsevier, 2009, 23, 407-422
  10. [10] Jeon, J.; Wang, W. & Liu, Y., N-best rescoring based on pitch-accent patterns, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, 2011, 732-741
  11. [11] Nöth, E.; Batliner, A.; Kießling, A.; Kompe, R. & Niemann, H., Verbmobil: The use of prosody in the linguistic components of a speech understanding system, IEEE Trans. Speech and Audio Proc., IEEE, 2000, 8, 519-532
  12. [12] Graf, H.; Cosatto, E.; Strom, V. & Huang, F., Visual prosody: Facial movements accompanying speech, Int. Conf. on Automatic Face and Gesture Recognition, 2002, 396-401
  13. [13] Munhall, K.; Jones, J.; Callan, D.; Kuratate, T. & Vatikiotis-Bateson, E., Visual prosody and speech intelligibility, Psychological Science, SAGE Publications, 2004, 15, 133
  14. [14] Beskow, J.; Granström, B. & House, D., Visual correlates to prominence in several expressive modes, Proc. of Interspeech, 2006, 1272-1275
  15. [15] Krahmer, E. & Swerts, M., Audiovisual prosody-introduction to the special issue, Language and speech, 2009, 52, 129-133
  16. [16] Prieto, P.; Pugliesi, C.; Borràs-Comes, J.; Arroyo, E. & Blat, J., Crossmodal Prosodic and Gestural Contribution to the Perception of Contrastive Focus, Proc. INTERSPEECH, 2011
  17. [17] Dohen, M.; Loevenbruck, H.; Harold, H. et al. Visual correlates of prosodic contrastive focus in French: Description and inter-speaker variability, Proc. Speech Prosody, 2006
  18. [18] Cvejic, E.; Kim, J.; Davis, C. & Gibert, G. Prosody for the Eyes: Quantifying Visual Prosody Using Guided Principal Component Analysis, Proc. INTERSPEECH, 2010
  19. [19] Heckmann, M.; Joublin, F. & Goerick, C. Combining Rate and Place Information for Robust Pitch Extraction Proc. INTERSPEECH, 2007, 2765-2768
  20. [20] Heckmann, M. & Nakadai, K. Robust intonation pattern classification in human robot interaction, Proc. INTERSPEECH, ISCA, 2011
  21. [21] Tamburini, F. & Wagner, P. On automatic prominence detection for German, Proc. of INTERSPEECH, ISCA, 2007
  22. [22] Schillingmann, L.; Wagner, P.; Munier, C.; Wrede, B. & Rohlfing, K., Using Prominence Detection to Generate Acoustic Feedback in Tutoring Scenarios INTERSPEECH, ISCA, 2011
  23. [23] Christian Lang, Sven Wachsmuth, M. H. H. W. Facial Communicative Signals - Valence Recognition in Task-Oriented Human-Robot-Interaction, Journal of Social Robotics, accepted for publication
  24. [24] Heckmann, M.; Kroschel, K.; Savariaux, C. & Berthommier, F. DCT-based video features for audio-visual speech recognition, Seventh International Conference on Spoken Language Processing (ICSLP), 2002



Claims

1. A method for analyzing speech in a spoken dialog system (30), comprising the steps of:

accepting an utterance of a user by at least one means (20) for accepting acoustical signals, in particular a microphone, and

analyzing the utterance and obtaining prosodic cues from the utterance using at least one processing engine (40), wherein

the utterance is evaluated based on the prosodic cues to determine a prominence of parts of the utterance, wherein

the utterance is analyzed to determine whether the utterance is a correction utterance correcting the previous utterance by detecting either at least one marker feature, e.g. a negative statement, and/or a segment with a high prominence, indicative of the utterance containing at least one replacement word to replace at least one word in a previous utterance, and

if it is determined that the utterance is a correction utterance,

- the at least one replacement word is detected based on the prominence of parts in the utterance,

- pattern matching is performed to match the at least one replacement word to the at least one word to be replaced in the previous utterance, and

- the previous utterance is re-evaluated with the replacement word(s) instead of the word to be replaced,

characterized in that
movements of the user determined from a video signal visually capturing the user are used to further determine the prosodic cues.
 
2. The method of claim 1, wherein
the utterance is a word or sentence.
 
3. The method of claim 1 or 2, wherein
the prominence is determined by evaluating the acoustic signal and/or different levels of prominence are ranked.
 
4. The method of any one of the preceding claims, wherein
the prominence indicates a degree of importance of parts of an utterance, e.g. the emphasis a speaker sets on parts of the utterance.
 
5. The method of any one of the preceding claims, wherein
the marker feature is detected when at least parts of the previous utterance are repeated.
 
6. The method of any one of the preceding claims, wherein
the part(s) to be replaced in the previous utterance is/are used to improve a recognition accuracy by extracting at least one part with a pre-determined prominence from the utterance, e.g. the correction, extracting the part(s) to be replaced in the previous utterance, and comparing at least one recognition hypotheses for the extracted parts and inferring from this comparison a new recognition hypothesis for the part to be replaced of the previous utterance.
 
7. The method of any one of the preceding claims, wherein
the marker feature is determined by the prominence of the first part of the utterance either by itself or in combination with a lexical analysis of the utterance and/or is determined by a body posture or gesture of the user.
 
8. The method of any one of the preceding claims, wherein
the prosodic cues are extracted from the acoustical signal and the video signal representing a recording of a user's upper body, preferably including the head and face, or a combination of both.
 
9. The method of any one of the preceding claims, wherein
movements of the user's head, arm, hand, facial muscles, mouth and/or eyebrows are used to determine the prosodic cues.
 
10. The method of any one of the preceding claims, wherein
a compressive transformation is applied to the mouth region, and wherein a tilt of the head and consequently the mouth region is corrected prior to applying the transformation, in particular a Discrete Cosine Transformation (DCT).
 
11. The method of any one of the preceding claims, wherein
a reliability for each information of the acoustical signal and the video signal is calculated over which the prosodic cues are obtained, wherein an assignment of importance, i.e. prominence, on the different parts of the utterance is obtained by adaptively combining the different information channels considering previously calculated reliabilities, and wherein the reliability of the video signal is calculated based on the illumination conditions.
 
12. A spoken dialog system (30), comprising

at least one means (20) for accepting acoustical signals, in particular a microphone, for accepting an utterance, and

at least one processing engine (40) for analyzing the utterance and to obtain prosodic cues from the utterance, wherein the processing engine (40) is adapted to evaluate the utterance based on the prosodic cues to determine a prominence of parts of the utterance, wherein

the processing engine (40) is adapted to analyze the utterance to determine whether the utterance is a correction utterance correcting the previous utterance by detecting either at least one marker feature, e.g. a negative statement, and/or a segment with a high prominence, indicative of the utterance containing at least a replacement word to replace at least a word in a previous utterance, and

if the processing engine (40) determines that the utterance is a correction utterance,

- the processing engine (40) is adapted to detect the at least one replacement word based on the prominence of parts in the utterance,

- the processing engine (40) is adapted to perform pattern matching to match the at least one replacement word to the at least one word to be replaced in the previous utterance, and

- the processing engine (40) is adapted to re-evaluate the previous utterance with the replacement word(s) instead of the word to be replaced,

characterized by
means (25) for accepting a video signal visually capturing the user, wherein
the processing engine (40) is adapted to further determine the prosodic cues based on movements of the user determined from the video signal.
 


Ansprüche

1. Verfahren zum Analysieren von Sprache in einem gesprochenen Dialogsystem (30), umfassend die Schritte:

Akzeptieren einer Äußerung eines Benutzers durch mindestens ein Mittel (20) zum Akzeptieren von akustischen Signalen, insbesondere eines Mikrofons, und

Analysieren der Äußerung und Erhalten von prosodischen Hinweisen aus der Äußerung unter Verwendung mindestens einer Verarbeitungsanwendung (40), wobei die Äußerung basierend auf den prosodischen Hinweisen bewertet wird, um eine Hervorhebung von Teilen der Äußerung zu bestimmen,

wobei

die Äußerung analysiert wird, um zu bestimmen, ob es sich bei der Äußerung um eine Korrekturäußerung handelt, die die vorherige Äußerung korrigiert, indem entweder mindestens ein Markierungsmerkmal, z.B. eine negative Anweisung, und/oder ein Segment mit großer Hervorhebung erfasst wird, das auf die Äußerung hinweist, die mindestens ein Ersatzwort enthält, um mindestens ein Wort in einer früheren Äußerung zu ersetzen, und wenn bestimmt wird, dass die Äußerung eine Korrekturäußerung ist,

- das mindestens eine Ersatzwort basierend auf der Hervorhebung von Teilen in der Äußerung erkannt wird,

- ein Musterabgleich durchgeführt wird, um das mindestens eine Ersatzwort an das mindestens eine Wort anzupassen, das in der vorherigen Äußerung ersetzt werden soll, und

- die vorherige Äußerung mit dem/den Ersatzwort(en) anstelle des zu ersetzenden Wortes neu bewertetwird,

dadurch gekennzeichnet, dass

Bewegungen des Benutzers, die aus einem Videosignal bestimmt werden, das den Benutzer visuell erfasst, zum Bestimmen der prosodischen Hinweise verwendet werden.


 
2. Verfahren nach Anspruch 1, wobei
die Äußerung ein Wort oder ein Satz.ist
 
3. Verfahren nach Anspruch 1 oder 2, wobei
die Hervorhebung durch Auswertung des akustischen Signals bestimmt wird und/oder verschiedene Hervorhebungsgrade bewertet werden .
 
4. Verfahren nach einem der vorhergehenden Ansprüche, wobei
die Hervorhebung einen Grad der Wichtigkeit von Teilen einer Äußerung anzeigt, z.B. die Betonung, die ein Sprecher auf Teile der Äußerung legt.
 
5. Verfahren nach einem der vorhergehenden Ansprüche, wobei
die Markerfunktion erkannt wird, wenn mindestens Teile der vorherigen Äußerung wiederholt werden.
 
6. Verfahren nach einem der vorhergehenden Ansprüche, wobei
das/die in der vorherigen Äußerung zu ersetzende(n) Teil(e) verwendet wird(n), um eine Erkennungsgenauigkeit zu verbessern, indem mindestens ein Teil mit einer vorbestimmten Hervorhebung aus der Äußerung extrahiert wird/sind, z.B. die Korrektur, der/die in der vorherigen Äußerung zu ersetzende(n) Teil(e) extrahiert wird/werden, und mindestens eine Erkennungshypothese für die extrahierten Teile verglichen wird und aus diesem Vergleich eine neue Erkennungshypothese für den zu ersetzenden Teil der vorherigen Äußerung abgeleitet wird.
 
7. Verfahren nach einem der vorhergehenden Ansprüche, wobei
die Markierungsmerkmal durch die Hervorhebung des ersten Teils der Äußerung entweder selbst oder in Kombination mit einer lexikalischen Analyse der Äußerung bestimmt wird und/oder durch eine Körperhaltung oder Geste des Benutzers bestimmt wird.
 
8. Verfahren nach einem der vorhergehenden Ansprüche, wobei
die prosodischen Hinweise aus dem akustischen Signal und dem Videosignal extrahiert werden, das eine Aufzeichnung des Oberkörpers eines Benutzers, vorzugsweise einschließlich Kopf und Gesicht, oder eine Kombination aus beidem darstellt.
 
9. Verfahren nach einem der vorhergehenden Ansprüche, wobei
Bewegungen von Kopf, Arm, Hand, Gesichtsmuskeln, Mund und/oder Augenbrauen des Benutzers zur Bestimmung der prosodischen Hinweise verwendet werden.
 
10. Verfahren nach einem der vorhergehenden Ansprüche, wobei
eine komprimierende Transformation auf den Mundbereich angewendet wird, und wobei eine Neigung des Kopfes und damit des Mundbereichs vor dem Anwenden der Transformation, insbesondere einer Diskreten Cosinustransformation (DCT), korrigiert wird.
 
11. Verfahren nach einem der vorhergehenden Ansprüche, wobei
eine Zuverlässigkeit für jede Information des akustischen Signals und das Videosignal berechnet wird, über die die prosodischen Hinweise erhalten werden, wobei eine Zuordnung von Wichtigkeit, d.h. Hervorhebung, auf die verschiedenen Teile der Äußerung durch adaptives Kombinieren der verschiedenen Informationskanäle unter Berücksichtigung zuvor berechneter Zuverlässigkeit erhalten wird, und wobei die Zuverlässigkeit des Videosignals basierend auf den Beleuchtungsbedingungen berechnet wird.
 
12. Ein gesprochenes Dialogsystem (30), umfassend

mindestens ein Mittel (20) zum Empfangen von akustischen Signalen, insbesondere eines Mikrofons, zum Empfangen einer Äußerung, und

mindestens eine Verarbeitungsanwendung (40) zum Analysieren der Äußerung und zum Erhalten prosodischer Hinweise aus der Äußerung, wobei die Verarbeitungsanwendung (40) angepasst ist, um die Äußerung basierend auf den prosodischen Hinweisen zu bewerten, um eine Hervorhebung von Teilen der Äußerung zu bestimmen,

wobei
die Verarbeitungsanwendung (40) angepasst ist, um die Äußerung zu analysieren, um zu bestimmen, ob die Äußerung eine Korrekturäußerung ist, die die vorherige Äußerung korrigiert, indem sie entweder mindestens ein Markierungsmerkmal, z.B. eine negative Aussage, und/oder ein Segment mit hoher Prominenz erfasst, das die Äußerung anzeigt, die mindestens ein Ersatzwort enthält, um mindestens ein Wort in einer früheren Äußerung zu ersetzen, und
wenn die Verarbeitungsanwendung (40) bestimmt, dass die Äußerung eine Korrekturäusserung ist,
die Verarbeitungsanwendung (40) angepasst ist, um die prosodischen Hinweise basierend auf Bewegungen des Nutzers, die aus dem Videosignal ermittelt wurden, zu ermitteln.
 


Revendications

1. Procédé d'analyse de parole dans un système de dialogue parlé (30), comprenant les étapes suivantes :

accepter l'énoncé d'un utilisateur par au moins un moyen (20) destiné à accepter des signaux acoustiques, en particulier un microphone, et

analyser l'énoncé et obtenir des indices prosodiques à partir de l'énoncé en utilisant au moins un moteur de traitement (40), dans lequel

l'énoncé est évalué sur la base des indices prosodiques afin de déterminer une prédominance de parties de l'énoncé,

dans lequel l'énoncé est analysé afin de déterminer si l'énoncé est un énoncé de correction qui corrige l'énoncé précédent en détectant au moins une caractéristique de marqueur, comme un énoncé négatif, et/ou un segment à forte prédominance, qui indique que l'énoncé contient au moins un mot de remplacement destiné à remplacer au moins un mot dans un énoncé précédent, et

s'il est déterminé que l'énoncé est un énoncé de correction,

- l'au moins un mot de remplacement est détecté sur la base de la prédominance de parties de l'énoncé,

- une comparaison de structure est effectuée afin de comparer l'au moins un mot de remplacement avec l'au moins un mot à remplacer dans l'énoncé précédent, et

- l'énoncé précédent est réévalué avec le(s) mot(s) de remplacement à la place du mot à remplacer,

caractérisé en ce que

les mouvements de l'utilisateur déterminés à partir d'un signal vidéo qui capture visuellement l'utilisateur sont utilisés pour déterminer les indices prosodiques.


 
2. Procédé selon la revendication 1, dans lequel
l'énoncé est un mot ou une phrase.
 
3. Procédé selon la revendication 1 ou 2, dans lequel
la prédominance est déterminée en évaluant le signal acoustique et/ou les différents niveaux de prédominance sont classés.
 
4. Procédé selon l'une quelconque des revendications précédentes, dans lequel
la prédominance indique un degré d'importance des parties d'un énoncé, comme l'emphase qu'une personne place sur des parties de l'énoncé.
 
5. Procédé selon l'une quelconque des revendications précédentes, dans lequel
la caractéristique de marqueur est détectée lorsqu'au moins des parties de l'énoncé précédent sont répétées.
 
6. Procédé selon l'une quelconque des revendications précédentes, dans lequel
la/les partie(s) à remplacer dans l'énoncé précédent est/sont utilisée(s) pour améliorer une précision de reconnaissance en extrayant au moins une partie qui présente une prédominance prédéterminée de l'énoncé, comme la correction, en extrayant la/les partie(s) à remplacer dans l'énoncé précédent, et en comparant au moins une hypothèse de reconnaissance pour les parties extraites et en déduisant, à partir de cette comparaison, une nouvelle hypothèse de reconnaissance pour la partie à remplacer de l'énoncé précédent.
 
7. Procédé selon l'une quelconque des revendications précédentes, dans lequel
la caractéristique de marqueur est déterminée par la prédominance de la première partie de l'énoncé en elle-même ou en combinaison avec une analyse lexicale de l'énoncé, et/ou est déterminée par une posture corporelle ou un geste de l'utilisateur.
 
8. Procédé selon l'une quelconque des revendications précédentes, dans lequel
les indices prosodiques sont extraites du signal acoustique et du signal vidéo qui représentent un enregistrement de la partie supérieure du corps de l'utilisateur, de préférence la tête et le visage compris, ou une combinaison des deux.
 
9. Procédé selon l'une quelconque des revendications précédentes, dans lequel
les mouvements de la tête, du bras, de la main, des muscles faciaux, de la bouche et/ou des sourcils de l'utilisateur sont utilisés pour déterminer les indices prosodiques.
 
10. Procédé selon l'une quelconque des revendications précédentes, dans lequel
une transformation compressive est appliquée à la zone de la bouche, et dans lequel une inclinaison de la tête et, par conséquent, de la zone de la bouche est corrigée avant d'appliquer la transformation, en particulier une transformation en cosinus discrète (DCT).
 
11. Procédé selon l'une quelconque des revendications précédentes, dans lequel
une fiabilité pour chaque information du signal acoustique et du signal vidéo est calculée, sur laquelle les indices prosodiques sont obtenues, dans lequel une affectation d'importance, c'est-à-dire d'une prédominance, sur les différentes parties de l'énoncé est obtenue en combinant de manière adaptive les différents canaux d'informations en tenant compte des fiabilités calculées précédemment, et dans lequel la fiabilité du signal vidéo est calculée sur la base des conditions d'éclairage.
 
12. Système de dialogue parlé (30), comprenant
au moins un moyen (20) destiné à accepter des signaux acoustiques, en particulier un microphone, destiné à accepter un énoncé, et
au moins un moteur de traitement (40) destiné à analyser l'énoncé et à obtenir des indices prosodiques à partir de l'énoncé, dans lequel le moteur de traitement (40) est adapté pour évaluer l'énoncé sur la base des indices prosodiques afin de déterminer une prédominance de parties de l'énoncé,
dans lequel
le moteur de traitement (40) est adapté pour analyser l'énoncé afin de déterminer si l'énoncé est un énoncé de correction qui corrige l'énoncé précédent en détectant au moins une caractéristique de marqueur, comme un énoncé négatif, et/ou un segment à forte prédominance, qui indique que l'énoncé contient au moins un mot de remplacement destiné à remplacer au moins un mot dans un énoncé précédent, et
si le moteur de traitement (40) détermine que l'énoncé est un énoncé de correction,

- le moteur de traitement (40) est adapté pour détecter le au moins un mot de remplacement sur la base de la prédominance de parties de l'énoncé,

- le moteur de traitement (40) est adapté pour effectuer une comparaison de structure destinée à comparer le au moins un mot de remplacement avec le au moins un mot à remplacer dans l'énoncé précédent, et

- le moteur de traitement (40) est adapté pour réévaluer l'énoncé précédent avec le(s) mot(s) de remplacement à la place du mot à remplacer,

caractérisé par
un moyen (25) destiné à accepter un signal vidéo qui capture visuellement l'utilisateur, dans lequel
le moteur de traitement (40) est adapté pour déterminer en outre les indices prosodiques sur la base de mouvements de l'utilisateur déterminés à partir du signal vidéo.
 




Drawing









REFERENCES CITED IN THE DESCRIPTION



This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Patent documents cited in the description




Non-patent literature cited in the description