(19)
(11)EP 3 573 059 B1

(12)EUROPEAN PATENT SPECIFICATION

(45)Mention of the grant of the patent:
31.03.2021 Bulletin 2021/13

(21)Application number: 19175883.8

(22)Date of filing:  22.05.2019
(51)Int. Cl.: 
G10L 21/003  (2013.01)
G10L 13/00  (2006.01)

(54)

DIALOGUE ENHANCEMENT BASED ON SYNTHESIZED SPEECH

DIALOGVERBESSERUNG AUF BASIS VON SYNTHETISIERTER SPRACHE

AMÉLIORATION DE DIALOGUE BASÉE SUR LA PAROLE SYNTHÉTISÉE


(84)Designated Contracting States:
AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

(30)Priority: 25.05.2018 US 201862676368 P
25.05.2018 EP 18174310

(43)Date of publication of application:
27.11.2019 Bulletin 2019/48

(73)Proprietor: Dolby Laboratories Licensing Corporation
San Francisco, CA 94103 (US)

(72)Inventors:
  • PORT, Timothy Alan
    McMahons Point, NSW 2060 (AU)
  • NG, Winston Chi Wai
    McMahons Point, NSW 2060 (AU)
  • GERRARD, Mark William
    McMahons Point, NSW 2060 (AU)

(74)Representative: Dolby International AB Patent Group Europe 
Apollo Building, 3E Herikerbergweg 1-35
1101 CN Amsterdam Zuidoost
1101 CN Amsterdam Zuidoost (NL)


(56)References cited: : 
US-A1- 2016 125 893
  
  • LUC LE MAGOAROU ET AL: "Text-informed audio source separation using nonnegative matrix partial co-factorization", 2013 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 22 September 2013 (2013-09-22), pages 1-6, XP055122931, DOI: 10.1109/MLSP.2013.6661995 ISBN: 978-1-47-991180-6
  • DANIEL DZIBELA ET AL: "Hidden-Markov-Model Based Speech Enhancement", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 4 July 2017 (2017-07-04), XP080774326,
  • HANSEN J H L ET AL: "Text-directed speech enhancement employing phone class parsing and feature map constrained vector quantization", SPEECH COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 21, no. 3, April 1997 (1997-04), pages 169-189, XP004059541, ISSN: 0167-6393, DOI: 10.1016/S0167-6393(97)00003-4
  
Note: Within nine months from the publication of the mention of the grant of the European patent, any person may give notice to the European Patent Office of opposition to the European patent granted. Notice of opposition shall be filed in a written reasoned statement. It shall not be deemed to have been filed until the opposition fee has been paid. (Art. 99(1) European Patent Convention).


Description

CROSS-REFERENCE TO RELATED APPLICATIONS



[0001] This application claims priority of the following priority applications: US provisional application 62/676,368 (reference: D17097USP1), filed 25 May 2018 and EP application 18174310.5 (reference: D17097EP), filed 25 May 2018.

FIELD OF THE INVENTION



[0002] The present invention generally relates to dialogue enhancement in audio signals.

BACKGROUND OF THE INVENTION



[0003] Dialogue enhancement is an important signal processing feature for the hearing impaired, and applied in e.g. hearing aids, television sets, etc. Traditionally, it has been done by applying a fixed frequency response curve that emphasizes (amplifies) all content in the frequency range where dialogue is typically present. This type of "single ended" dialogue enhancement may be improved by some type of adaptive approach based on detection and analysis of the audio signal. In a simple case, the application of the fixed frequency response curve can be made conditional on specific criteria (sometimes referred to as "gated" dialogue enhancement). In more complicated implementations, also the frequency response curve is adaptive, and based on the input audio signal. However, gated dialog enhancers are difficult to implement in that they typically require a classifier or speech activity detector. Methods based upon time frequency analysis are difficult to design and are prone to misdetection of speech.

[0004] Another approach for dialogue enhancement is based on metadata included in the audio stream, i.e. information from the encoder sider specifying the dialogue content, thereby facilitating enhancement. The metadata can include "flags" indicting when to activate dialogue enhancement, and also an indication of frequency content thereby allowing adjustment of the frequency response curve. In other examples, the metadata can be parameters allowing a parametric reconstruction of the dialogue content, which dialogue content may then be amplified as desired. This approach, to include dialogue metadata in the audio stream, generally has high performance. However, it is restricted to dual ended systems, i.e. where the audio stream is preprocessed on the transmitter side, e.g. in an encoder.

[0005] US 2016/125893 (THOMSON LICENSING) discloses separation of speech and background from an audio mixture by using a speech example, generated from a source associated with a speech component in the audio mixture, to guide the separation process.

[0006] There is a need for even further improvement of dialogue enhancement technology.

GENERAL DISCLOSURE OF THE INVENTION



[0007] It is a general objective of the present invention to provide improved performance of dialogue enhancement, in particular single-ended dialogue enhancement in the absence of explicit metadata.

[0008] The invention is defined by the independent claims. Embodiments are given by the dependent claims.

[0009] According to a first aspect of the present invention, this and other objectives are achieved by a method for dialogue enhancement of an audio signal, comprising receiving an audio stream including said audio signal and a text content associated with dialogue occurring in the audio signal, generating parameterized synthesized speech from said text content, and applying dialogue enhancement to the audio signal based on the parameterized synthesized speech.

[0010] According to a second aspect, this and other aspects are achieved by a system for dialogue enhancement of an audio signal, based on a text content associated with dialogue occurring in the audio signal, the system comprising a speech synthesizer for generating a parameterized synthesized speech from the text content, and a dialogue enhancement module for applying dialogue enhancement to the audio signal based on the parameterized synthesized speech.

[0011] The invention is based on the notion that text captions, subtitles, or other forms of text content included in an audio stream, and being related to dialogue occurring in the audio signal, can be used to significantly improve dialogue enhancement on the playback side. More specifically, the text may be used to generate parameterized synthesized speech, which may be used to enhance (amplify) dialogue content.

[0012] The invention may be advantageous in a single ended system (e.g. broadcast or downloaded media) such as in a TV or set-top-box. In a single ended system, the audio stream is typically not specifically preprocessed for dialogue enhancement, and the invention may significantly improve dialogue enhancement on the receiver side.

[0013] As indicated above, the invention is particularly useful in single-ended dialogue enhancement, i.e. where the transmitted audio stream has not been preprocessed to facilitate dialogue enhancement. However, the invention may also be advantageous in a dual-ended system, in which case the step of generating parameterized synthesized speech can be performed on the sender side. For example, the invention could be used to extract a dialogue component from an existing audio mix, for situations when the dialogue stream is transmitted as an independent buffer. Or, the invention could contribute to computation of dialogue coefficients in applications where dialogue is represented with coefficient weights (metadata) transmitted to the receiver (decoder) side.

[0014] In order to align the frequency content of the synthesized speech with the frequency content of the audio signal, it may be advantageous to compare the parameterized synthesized speech with the audio signal to provide an error signal, and to apply feedback control of the parameterized synthesized speech based on the error signal.

[0015] There are several ways of using the synthesized speech in the dialogue enhancement.

[0016] In one embodiment, the dialogue enhancement includes application of a fixed frequency response curve, and the application of the fixed frequency response curve is conditional on the parameterized synthesized speech. With this approach, the frequency response curve is only applied when it can be established that the audio signal includes dialogue. As a consequence, the quality of the dialogue enhancement is improved.

[0017] In another embodiment, the synthesized speech is used as a reference for an adaptive system (for example a minimum mean squared error (MMSE) tracking) to extract an estimate of the dialogue from the original audio signal. Dialogue enhancement is then performed by amplifying the extracted dialogue and mixing it back into the (time aligned) original audio signal. This corresponds in principle to the dialogue enhancement performed using parameterized dialogue encoded in the audio stream, but made possible without metadata.

[0018] In yet another embodiment, time/frequency gains are applied to the audio signal based on the parameterized synthesized speech. The gains will vary with the content of the speech across time and frequency. This corresponds in principle to an application of an adaptive frequency response curve.

[0019] In some embodiments, the text content includes annotations identifying a specific speaker, and the generation of synthesized speech may then be aligned with a model of the identified speaker.

[0020] The text content may further include abbreviations of words present in the dialogue occurring in the audio signal, in which case the method may further include extending the abbreviations into full words which are likely to correspond to the words present in the dialogue.

[0021] A further aspect of the present invention related to a computer program product comprising computer program code portions which, when executed on a computer processor, enable the computer processor to perform the method of the first aspect of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS



[0022] The present invention will be described in more detail with reference to the appended drawings, showing currently preferred embodiments of the invention.

Figure 1 shows a block diagram of a dialogue enhancement system according to a first embodiment of the invention.

Figure 2 shows a block diagram of a dialogue enhancement system according to a second embodiment of the invention based on dialogue extraction and gain.

Figure 3 shows a block diagram of a dialogue enhancement system according to a third embodiment of the invention based on time/frequency enhancement.

Figure 4 shows an embodiment of the invention using annotations.

Figure 5 is a flow chart of dialogue enhancement according to an embodiment of the invention.


DETAILED DESCRIPTION OF CURRENTLY PREFERRED EMBODIMENTS



[0023] Systems and methods disclosed in the following may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks referred to as "stages" in the below description does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation. Certain components or all components may be implemented as software executed by a digital signal processor or microprocessor, or be implemented as hardware or as an application-specific integrated circuit. Such software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Further, it is well known to the skilled person that communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

[0024] Figure 1 shows a first example of a dialogue enhancement system 10 using text captions 3 included in an audio stream 1 for dialogue enhancement of an audio signal 2. The audio signal can be described as a dialogue component s, mixed with a noise or background component n. The purpose of the dialogue enhancement system 10 is to increase the s/n-ratio.

[0025] The system is connected to receive an audio stream including the audio signal 2 and the text content 3. If the dialogue enhancement system 10 receives the audio signal 2 and text content 3 as a combined audio stream 1, the system may include a decoder 11 for separating the audio signal 2 from the text 3. Alternatively, the system receives the text 3 separately from the audio signal 2.

[0026] The system further includes a speech synthesizer 12, for generating a parameterized synthesized speech . The synthesizer may be a parametric vocoder or a machine learning algorithm based upon a corpus of training data. Machine learning algorithms may have an advantage with respect to taking a specific speaker into consideration.

[0027] In some embodiments, the synthesizer 12 may have a feedback loop 13 from the audio signal 2 to a summation point 14 forming an error signal e. The error signal e is fed to synthesizer 12, thereby ensuring that the parameterized synthesized speech is an estimate of the time and frequency characteristics of the dialogue component s of the audio signal 2.

[0028] The parameterized synthesized speech is fed to a decision logic 15, configured to output a logic signal indicating if dialogue enhancement is to be activated. For example, the logic signal can be set to ON when an energy measure of the synthesized speech exceeds a pre-set threshold. The decision logic may also compare the synchronized speech with the audio signal in order to determine a speech similarity score, and set the logic signal to ON only when the score exceeds a pre-set threshold. Especially in the absence of feedback in the synthesizer, such a similarity score can be used to even better synchronize the logic signal with the audio signal, and thus even further improve the timing of the dialogue enhancement.

[0029] The system further comprises a dialogue enhancement module 16, which is connected to receive the logic signal from the decision logic 15, and to activate dialogue enhancement conditionally to this signal. The dialogue enhancement module is here further configured to apply a pre-set frequency response curve amplification of the audio signal.

[0030] Figure 2 shows another embodiment of a dialogue enhancement system 20 according to the invention. In the embodiment in figure 2, signals 1-3 and blocks 11-14 are identical to those in figure 1, and will not be further described.

[0031] In figure 2, the parameterized synthesized speech is fed to a dialogue extraction filter 17, which is configured to extract dialogue content from the audio signal by comparing the audio signal with the parameterized synthesized speech . The result of the comparison is an estimation s' of the dialogue component s of the audio signal which may be used for dialogue enhancement.

[0032] The comparison may be based on a minimum mean square error (MMSE) approach, where the coefficients of the filter 17 are selected to minimize the error.

[0033] Words or even phonemes of the synthesized dialogue can be compared individually to a smaller window of the audio signal, for example in the frequency domain.

[0034] Finally, the system includes a dialogue enhancement module 16, which is configured to apply a gain to the extracted dialogue s and mixes this into the audio signal. The result is a dialogue enhanced signal αs+n, where α>1.

[0035] Figure 3 shows another embodiment of a dialogue enhancement system 30 according to the invention. In the embodiment in figure 2, signals 1-3 and blocks 11-14 are identical to those in figure 1 and 2, and will not be further described.

[0036] In the system 30 in figure 3, the feedback loop 13 is required and serves to minimize the error e between the dialogue to be enhanced in the audio signal and the parameterized synthesized speech generated by the synthesizer 12. The feedback loop 13 thus ensures that the parameterized synthesized dialogue is an estimate of the time and frequency characteristics of the dialogue component s in the audio signal 2.

[0037] In some embodiments, the feedback loop 13 will allow the synthesizer to iterate over parameters that adjust the synthesized speech . The feedback may adjust features such as (but not limited to): the cadence, pitch, time alignment, amplitude of the synthesized speech in relation to the dialogue in the audio signal.

[0038] In the system in figure 3, the parameterized dialogue is fed directly into a dialog enhancement module 19, to control the application of time/frequency gains on the audio signal. By applying varying time/frequency gains to the audio signal which match the dialogue content in the audio signal, the speech-to-noise (s/n) ratio is amplified, and the output is a dialogue enhanced signal αs+n, where α>1. The result is an adaptive dialogue enhancement.

[0039] Figure 4 shows a further example of a dialogue synthesizer 12', configured to apply a personalized speech model 21a, 21b to increase the accuracy of the synthesized speech . The synthesizer is further adapted to extract annotations within the text content 3', which annotations indicate a specific speaker. The synthesizer 32 then uses such annotations to select the correct speech model 21a, 21b.

[0040] For example, when receiving the following annotation + text:

Fred: Hello Mary. What are you planning to have for lunch today?

a first speech model 21a, associated with the speaker Fred, will be applied.



[0041] Further, when receiving the following reply:

Mary: I am planning on having a tuna salad sandwich

A second speech model 21b, associated with the speaker Mary, will be applied.



[0042] If there is no pre-stored speech model for a specific annotation, a default model may be applied.

[0043] With reference to figure 5, a method according to an embodiment of the invention includes in step S1 receiving an audio signal 2 which includes a dialogue content s and noise/background n and receiving text content 3 associated with the dialogue content.

[0044] In step S2, the speech synthesizer 12 provides a parameterized synthesized dialogue corresponding to the text 3, and optionally applies a feedback control based on the audio signal to ensure that the frequency content of the parameterized synthesized dialogue matches that of the audio signal.

[0045] In step S3, the parameterized synthesized dialogue is used to control dialogue enhancement.

[0046] In a system according to the embodiment in figure 1, the speech synthesis in step S2 is used only to make a qualified assessment of when there is dialogue present in the audio signal, and in that case activate a (static) dialogue enhancement.

[0047] In a system according to the embodiment in figure 2, the speech synthesis in step S2 is used to extract an estimated dialogue from the audio signal by comparison to the parameterized synthesized dialogue in the dialogue extraction filter 17, and then, in the dialogue enhancement module 18, applying a gain to this estimated dialogue and mixing it with the original audio signal.

[0048] Finally, in a system according to figure 3, the parameterized synthesized dialogue is used directly by a dialogue enhancement module 19 to apply adaptive time/frequency gains to the audio signal.

[0049] The person skilled in the art realizes that the present invention by no means is limited to the preferred embodiments described above. On the contrary, many modifications and variations are possible within the scope of the appended claims. In particular, there are other ways to use parameterized synchronized speech based on text captions to improve dialogue enhancement of audio associated with this text.

[0050] Further, a dialogue enhancement system according to the invention could be configured to detect abbreviations in the text content, and be configured to extend such abbreviations into full words which are likely to correspond to the words present in the dialogue.

[0051] Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art.


Claims

1. A method for dialogue enhancement of an audio signal (2), comprising:

receiving (step S1) said audio signal (2) and a text content (3) associated with dialogue occurring in the audio signal,

generating (step S2) parameterized synthesized speech () from said text content, and

applying (step S3) dialogue enhancement to said audio signal based on said parameterized synthesized speech (),

wherein the text content includes annotations identifying a specific speaker, and wherein generation of the synthesized speech is aligned with a model of the identified speaker.


 
2. The method according to claim 1, further comprising:

comparing the parameterized synthesized speech with the audio signal to provide an error signal, and

applying feedback control of the parameterized synthesized speech based on the error signal, in order to align the frequency content of the synthesized speech with the frequency content of the audio signal.


 
3. The method according to claim 1 or 2, wherein the step of applying dialogue enhancement is conditional on a comparison between the audio signal and the parameterized synthesized speech ().
 
4. The method according to claim 3, wherein the applying dialogue enhancement includes application of a fixed frequency response curve.
 
5. The method according to one of claims 1 - 3, further comprising:
applying a time/frequency gain to the audio signal based on the parameterized synthesized speech.
 
6. The method according to one of claims 1 - 3, further comprising:

applying a dialogue extraction filter to the audio signal to obtain an estimated dialogue, wherein said dialogue extraction filter is determined by comparing the extracted dialogue component with said parameterized synthesized speech and minimizing an error,

applying a gain to the estimated dialogue to obtain an amplified dialogue component, and

mixing the amplified dialogue component with the audio signal.


 
7. The method according to claim 6, wherein the error is a minimum means square error (MMSE).
 
8. The method according to any one of the preceding claims, wherein said text content includes abbreviations of words present in the dialogue occurring in the audio signal, the method further including:
extending the abbreviations into full words which are likely to correspond to the words present in the dialogue.
 
9. The method according to any one of the preceding claims, wherein the step of generating parameterized synthesized speech is performed on a sender side of a dual-ended system.
 
10. The method according to claim 9, further comprising extracting a dialogue component from an existing audio mix, and including said dialogue component in a transmitted audio bit stream.
 
11. The method according to claim 9, further comprising computing dialogue coefficients representing dialogue, and including said dialogue coefficients in a transmitted audio bit stream.
 
12. A system for dialogue enhancement of an audio signal (2), based on a text content (3) associated with dialogue occurring in the audio signal, the system comprising:

a speech synthesizer (12, 22) for generating a parameterized synthesized speech () from said text content, and

a dialogue enhancement module (16, 26) for applying dialogue enhancement to said audio signal based on said parameterized synthesized speech (),

wherein the text content includes annotations identifying a specific speaker, and wherein generation of the synthesized speech by the speech synthesizer is aligned with a model of the identified speaker.


 
13. The system according to claim 12, further comprising:

a feedback loop (13, 23) for feedback of the parameterized synthesized speech, and

a summation point (14, 24) for comparing the parameterized synthesized speech with the audio signal to provide an error signal,

wherein the synthesizer is configured to apply feedback control of the parameterized synthesized speech based on the error signal, in order to align the frequency content of the synthesized speech with the frequency content of the audio signal.


 
14. The system according to any one of claim 12-13, implemented in a single ended receiver.
 
15. A computer program product comprising computer program code portions which, when executed on a computer processor, enable the computer processor to perform the steps of the method according to one of claims 1 - 11.
 


Ansprüche

1. Verfahren zur Dialogverbesserung eines Audiosignals (2), umfassend:

Empfangen (Schritt S1) des Audiosignals (2) und eines Textinhalt (3), der dem im Audiosignal stattfindenden Dialog zugeordnet ist,

Erzeugen (Schritt S2) von parametrisierter synthetisierter Sprache (ŝ) aus dem Textinhalt, und

Anwenden (Schritt S3) von Dialogverbesserung auf das Audiosignal auf der Grundlage der parametrisierten synthetisierten Sprache (ŝ),

wobei der Textinhalt Anmerkungen einschließt, die einen konkreten Sprecher identifizieren, und wobei das Erzeugen der synthetisierten Sprache an einem Modell des identifizierten Sprechers ausgerichtet ist.


 
2. Verfahren nach Anspruch 1, weiter umfassend:

Vergleichen der parametrisierten synthetisierten Sprache mit dem Audiosignal, um ein Fehlersignal bereitzustellen, und

Anwenden von Rückmeldungssteuerung der parametrisierten synthetisierten Sprache auf der Grundlage des Fehlersignals, um den Frequenzinhalt der synthetisierten Sprache an dem Frequenzinhalt des Audiosignals auszurichten.


 
3. Verfahren nach Anspruch 1 oder 2, wobei der Schritt des Anwendens von Dialogverbesserung bedingt ist an einen Vergleich zwischen dem Audiosignal und der parametrisierten synthetisierten Sprache (ŝ).
 
4. Verfahren nach Anspruch 3, wobei das Anwenden von Dialogverbesserung Anwenden einer Reaktionskurve mit fixer Frequenz einschließt.
 
5. Verfahren nach einem der Ansprüche 1 - 3, weiter umfassend:
Anwenden einer Zeit-/Frequenzverstärkung auf das Audiosignal auf der Grundlage der parametrisierten synthetisierten Sprache.
 
6. Verfahren nach einem der Ansprüche 1 - 3, weiter umfassend:

Anwenden eines Dialogextraktionsfilters auf das Audiosignal, um einen geschätzten Dialog zu erhalten, wobei der Dialogextraktionsfilter bestimmt ist durch Vergleichen der extrahierten Dialogkomponente mit der parametrisierten synthetisierten Sprache und Minimieren eines Fehlers,

Anwenden einer Verstärkung auf den geschätzten Dialog, um eine verstärkte Dialogkomponente zu erhalten, und

Mischen der verstärkten Dialogkomponente mit dem Audiosignal.


 
7. Verfahren nach Anspruch 6, wobei der Fehler ein Mindestmittelquadratfehler (MMSE) ist.
 
8. Verfahren nach einem der vorstehenden Ansprüche, wobei der Textinhalt Abkürzungen von Wörtern einschließt, die in dem Dialog vorhanden sind, der im Audiosignal stattfindet, wobei das Verfahren weiter einschließt:
Erweitern der Abkürzungen in komplette Wörter, von denen es wahrscheinlich ist, dass sie den im Dialog vorhandenen Wörtern entsprechen.
 
9. Verfahren nach einem der vorstehenden Ansprüche, wobei der Schritt des Erzeugens von parametrisierter synthetisierter Sprache an einer Senderseite eines Systems mit zwei Enden ausgeführt wird.
 
10. Verfahren nach Anspruch 9, weiter umfassend Extrahieren einer Dialogkomponente aus einem bestehenden Audiomix, und Einschließen der Dialogkomponente in einen übertragenen Audiobitstream.
 
11. Verfahren nach Anspruch 9, weiter umfassend Berechnen von Dialogkoeffizienten, und Einschließen der Dialogkoeffizienten in einen übertragenen Audiobitstream.
 
12. System zur Dialogverbesserung eines Audiosignals (2), auf der Grundlage eines Textinhalts (3), der dem im Audiosignal stattfindenden Dialog zugeordnet ist, wobei das System umfasst:

einen Sprachsynthesizer (12, 22) zum Erzeugen einer parametrisierten synthetisierten Sprache (ŝ) aus dem Textinhalt, und

ein Dialogverbesserungsmodul (16, 26) zum Anwenden von Dialogverbesserung auf das Audiosignal auf der Grundlage der parametrisierten synthetisierten Sprache (ŝ),

wobei der Textinhalt Anmerkungen einschließt, die einen konkreten Sprecher identifizieren, und wobei das Erzeugen der synthetisierten Sprache durch den Sprachsynthesizer an einem Modell des identifizierten Sprechers ausgerichtet ist.


 
13. System nach Anspruch 12, weiter umfassend:

eine Rückmeldungsschleife (13, 23) zum Rückmelden der parametrisierten synthetisierten Sprache, und

einen Summationspunkt (14, 24) zum Vergleichen der parametrisierten synthetisierten Sprache mit dem Audiosignal, um ein Fehlersignal bereitzustellen,

wobei der Synthesizer ausgelegt ist, um Rückmeldungssteuerung auf die parametrisierte synthetisierte Sprache auf der Grundlage des Fehlersignals anzuwenden, um den Frequenzinhalt der synthetisierten Sprache an dem Frequenzinhalt des Audiosignals auszurichten.


 
14. System nach einem der Ansprüche 12 - 13, implementiert in einem Empfänger mit einem Ende.
 
15. Computerprogrammprodukt, umfassend Computerprogrammcodeabschnitte, die, wenn sie auf einem Computerprozessor ausgeführt werden, es dem Computerprozessor ermöglichen die Schritte des Verfahrens nach einem der Ansprüche 1 - 11 auszuführen.
 


Revendications

1. Procédé pour une amélioration de dialogue d'un signal audio (2), comprenant :

une réception (étape S1) dudit signal audio (2) et d'un contenu textuel (3) associé à un dialogue se trouvant dans le signal audio,

une génération (étape S2) d'une voix synthétisée paramétrée () à partir dudit contenu textuel, et

une application (étape S3) d'une amélioration de dialogue sur ledit signal audio sur la base de ladite voix synthétisée paramétrée (),

dans lequel le contenu textuel inclut des annotations identifiant un locuteur spécifique, et dans lequel la génération de la voix synthétisée est alignée sur un modèle du locuteur identifié.


 
2. Procédé selon la revendication 1, comprenant en outre :

une comparaison de la voix synthétisée paramétrée avec le signal audio pour fournir un signal d'erreur, et

une application d'une commande par rétroaction de la voix synthétisée paramétrée sur la base du signal d'erreur, afin d'aligner le contenu fréquentiel de la voix synthétisée sur le contenu fréquentiel du signal audio.


 
3. Procédé selon la revendication 1 ou 2, dans lequel l'étape d'application d'une amélioration de dialogue dépend d'une comparaison entre le signal audio et la voix synthétisée paramétrée ().
 
4. Procédé selon la revendication 3, dans lequel l'application de l'amélioration de dialogue inclut une application d'une courbe de réponse en fréquence fixe.
 
5. Procédé selon l'une des revendications 1-3, comprenant en outre :
une application d'un gain temps/fréquence sur le signal audio sur la base de la voix synthétisée paramétrée.
 
6. Procédé selon l'une des revendications 1-3, comprenant en outre :

une application d'un filtre d'extraction de dialogue sur le signal audio pour obtenir un dialogue estimé, dans lequel ledit filtre d'extraction de dialogue est déterminé en comparant la composante de dialogue extraite avec ladite voix synthétisée paramétrée et en réduisant au minimum une erreur,

une application d'un gain sur le dialogue estimé pour obtenir une composante de dialogue amplifiée, et

un mélange de la composante de dialogue amplifiée au signal audio.


 
7. Procédé selon la revendication 6, dans lequel l'erreur est une erreur quadratique moyenne minimum (MMSE).
 
8. Procédé selon l'une quelconque des revendications précédentes, dans lequel ledit contenu textuel inclut des abréviations de mots présents dans le dialogue se trouvant dans le signal audio, le procédé incluant en outre :
une extension des abréviations en mots complets qui sont susceptibles de correspondre aux mots présents dans le dialogue.
 
9. Procédé selon l'une quelconque des revendications précédentes, dans lequel l'étape de génération de la voix synthétisée paramétrée est mise en œuvre sur un côté expéditeur d'un système à deux extrémités.
 
10. Procédé selon la revendication 9, comprenant en outre une extraction d'une composante de dialogue d'un mélange audio existant, et une inclusion de ladite composante de dialogue dans un train binaire audio transmis.
 
11. Procédé selon la revendication 9, comprenant en outre un calcul de coefficients de dialogue représentant un dialogue, et une inclusion desdits coefficients de dialogue dans un train binaire audio transmis.
 
12. Système pour une amélioration de dialogue d'un signal audio (2), sur la base d'un contenu textuel (3) associé à un dialogue se trouvant dans le signal audio, le système comprenant :

un synthétiseur vocal (12, 22) permettant de générer une voix synthétisée paramétrée () à partir dudit contenu textuel, et

un module d'amélioration de dialogue (16, 26) permettant d'appliquer une amélioration de dialogue sur ledit signal audio sur la base de ladite voix synthétisée paramétrée (),

dans lequel le contenu textuel inclut des annotations identifiant un locuteur spécifique, et dans lequel la génération de la voix synthétisée par le synthétiseur vocal est alignée sur un modèle du locuteur identifié.


 
13. Système selon la revendication 12, comprenant en outre :

une boucle de rétroaction (13, 23) pour une rétroaction de la voix synthétisée paramétrée, et

un point de sommation (14, 24) permettant de comparer la voix synthétisée paramétrée avec le signal audio pour fournir un signal d'erreur,

dans lequel le synthétiseur est configuré pour appliquer une commande par rétroaction de la voix synthétisée paramétrée sur la base du signal d'erreur, afin d'aligner le contenu fréquentiel de la voix synthétisée sur le contenu fréquentiel du signal audio.


 
14. Système selon l'une quelconque des revendications 12-13, mis en œuvre dans un récepteur à une seule extrémité.
 
15. Produit de programme d'ordinateur comprenant des parties de code de programme d'ordinateur qui, lorsqu'elles sont exécutées sur un processeur d'ordinateur, permettent au processeur d'ordinateur de mettre en œuvre les étapes du procédé selon l'une des revendications 1-11.
 




Drawing















REFERENCES CITED IN THE DESCRIPTION



This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Patent documents cited in the description