FIELD
[0001] The present technology relates to a method and system for text-to-speech synthesis.
In particular, methods and systems for outputting synthetic speech having one or more
selected speech attribute are provided.
BACKGROUND
[0002] In text-to-speech (TTS) systems, a portion of text (or a text file) is converted
into audio speech (or an audio speech file). Such systems are used in a wide variety
of applications such as electronic games, e-book readers, e-mail readers, satellite
navigation, automated telephone systems, and automated warning systems. For example,
some instant messaging (IM) systems use TTS synthesis to convert text chat to speech.
This can be very useful for people who have difficulty reading, people who are driving,
or people who simply do not want to take their eyes off whatever they are doing to
change focus to the IM window.
[0003] A problem with TTS synthesis is that the synthesized speech can lose attributes such
as emotions, vocal expressiveness, and the speaker's identity. Often all synthesized
voices will sound the same. There is a continuing need to make systems sound more
like a natural human voice.
[0004] U.S. Patent No. 8,135,591 issued on March 13, 2012 describes a method and system for training a text-to-speech synthesis system for
use in speech synthesis. The method includes generating a speech database of audio
files comprising domain-specific voices having various prosodies, and training a text-to-speech
synthesis system using the speech database by selecting audio segments having a prosody
based on at least one dialog state. The system includes a processor, a speech database
of audio files, and modules for implementing the method.
[0005] U.S. Patent Application Publication No. 2013/0262119 published on October 3, 2013 teaches a text-to-speech method configured to output speech having a selected speaker
voice and a selected speaker attribute. The method includes inputting text; dividing
the inputted text into a sequence of acoustic units; selecting a speaker for the inputted
text; selecting a speaker attribute for the inputted text; converting the sequence
of acoustic units to a sequence of speech vectors using an acoustic model; and outputting
the sequence of speech vectors as audio with the selected speaker voice and the selected
speaker attribute. The acoustic model includes a first set of parameters relating
to speaker voice and a second set of parameters relating to speaker attributes, which
parameters do not overlap. Selecting a speaker voice includes selecting parameters
from the first set of parameters and selecting the speaker attribute includes selecting
the parameters from the second set of parameters. The acoustic model is trained using
a cluster adaptive training method (CAT) where the speakers and speaker attributes
are accommodated by applying weights to model parameters which have been arranged
into clusters, a decision tree being constructed for each cluster. Embodiments where
the acoustic model is a Hidden Markov Model (HMM) are described.
[0006] U.S. Patent No. 8,886,537 issued on November 11, 2014 describes a method and system for text-to-speech synthesis with personalized voice.
The method includes receiving an incidental audio input of speech in the form of an
audio communication from an input speaker and generating a voice dataset for the input
speaker. A text input is received at the same device as the audio input and the text
is synthesized from the text input to synthesized speech using a voice dataset to
personalize the synthesized speech to sound like the input speaker. In addition, the
method includes analyzing the text for expression and adding the expression to the
synthesized speech. The audio communication may be part of a video communication and
the audio input may have an associated visual input of an image of the input speaker.
The synthesis from text may include providing a synthesized image personalized to
look like the image of the input speaker with expressions added from the visual input.
SUMMARY
[0007] It is thus an object of the present technology to ameliorate at least some of the
inconveniences present in the prior art.
[0008] In one aspect, implementations of the present technology provide a method for text-to-speech
synthesis (TTS) configured to output a synthetic speech having a selected speech attribute.
The method is executable at a computing device. The method first comprises the following
steps for training an acoustic space model: a) receiving a training text data and
a respective training acoustic data, the respective training acoustic data being a
spoken representation of the training text data, the respective training acoustic
data being associated with one or more defined speech attribute; b) extracting one
or more of phonetic and linguistic features of the training text data; c) extracting
vocoder features of the respective training acoustic data, and correlating the vocoder
features with the phonetic and linguistic features of the training text data and with
the one or more defined speech attribute, thereby generating a set of training data
of speech attributes; and d) using a deep neural network (dnn) to determine interdependency
factors between the speech attributes in the training data. The dnn generates a single,
continuous acoustic space model based on the interdependency factors, the acoustic
space model thereby taking into account a plurality of interdependent speech attributes
and allowing for modelling of a continuous spectrum of the interdependent speech attributes.
The method further comprises the following steps for TTS using the acoustic space
model: e) receiving a text; f) receiving a selection of a speech attribute, the speech
attribute having a selected attribute weight; g) converting the text into synthetic
speech using the acoustic space model, the synthetic speech having the selected speech
attribute; and h) outputting the synthetic speech as audio having the selected speech
attribute.
[0009] In other words, in some embodiments, the acoustic space model obtained by method
of steps a) to d), allows the conversion of text into an audio output of synthetic
speech having a selected speech attribute with a selected attribute weight. In some
embodiments, the selected speech attribute may comprise a combination of different
categories of speech attribute, for example "gender", "emotion" and "language" such
as "happy female in the English language". In some embodiments, the selected speech
attribute may comprise different combinations within a speech attribute category each
with its own weight, for example "female in English, 50% excited, 50% happy). Advantageously,
in some embodiments, an audio output of synthetic speech can be obtained having a
particular selected speech attribute and selected attribute weight that is a combination
which the dnn was not necessarily trained on.
[0010] In some embodiments, extracting one or more of phonetic and linguistic features of
the training text data comprises extracting features relating to any of the sounds,
structure and meaning of the training text data, such as by parsing. In some embodiments,
extracting one or more of phonetic and linguistic features of the training text data
comprises dividing the training text data into phones. In some embodiments, extracting
vocoder features of the respective training acoustic data comprises separating the
respective acoustic data into characteristic elements. The characteristic elements
can include one or more of a frequency component, an amplitude, a component, a waveform,
and the like. In some embodiments, extracting vocoder features of the respective training
acoustic data comprises dimensionality reduction of the waveform of the respective
training acoustic data.
[0011] One or more speech attribute may be defined during the training steps. Similarly,
one or more speech attribute may be selected during the conversion/speech synthesis
steps. Non-limiting examples of speech attributes include emotions, genders, language,
intonations, accents, speaking styles, dynamics, and speaker identities. In some embodiments,
in the training phase, the one or more defined speech attribute comprises at least
an emotion, a gender and a language. In some embodiments, in the training phase, the
one or more defined speech attribute comprises at least one of an emotion, a gender
and a language. In some embodiments, the training data includes training text data
and respective training acoustic data in different languages. In some embodiments,
the selected speech attribute includes a language thereby allowing the synthetic speech
to be provided in a specified language. In some embodiments, two or more speech attributes
are defined or selected. Each selected speech attribute has a respective selected
attribute weight. The selected speech attribute can be, for example, a combination
of different emotions each with its respective weight e.g. 20% sad, 80% angry. In
embodiments where two or more speech attributes are selected, the outputted synthetic
speech has each of the two or more selected speech attributes.
[0012] In some embodiments, the method further comprises the steps of: receiving a second
text; receiving a second selected speech attribute, the second selected speech attribute
having a second selected attribute weight; converting the second text into a second
synthetic speech using the acoustic space model, the second synthetic speech having
the second selected speech attribute; and outputting the second synthetic speech as
audio having the second selected speech attribute.
[0013] In another aspect, implementations of the present technology provide a server. The
server comprises an information storage medium; a processor operationally connected
to the information storage medium, the processor configured to store objects on the
information storage medium. The processor is further configured to a) receive a training
text data and a respective training acoustic data, the respective training acoustic
data being a spoken representation of the training text data, the respective training
acoustic data being associated with one or more defined speech attribute; b) extract
one or more of phonetic and linguistic features of the training text data; c) extract
vocoder features of the respective training acoustic data, and correlate the vocoder
features with the phonetic and linguistic features of the training text data and with
the one or more defined speech attribute, thereby generating a set of training data
of speech attributes; and d) use a deep neural network (dnn) to determine interdependency
factors between the speech attributes in the training data, the dnn generating a single,
continuous acoustic space model based on the interdependency factors, the acoustic
space model thereby taking into account a plurality of interdependent speech attributes
and allowing for modelling of a continuous spectrum of the interdependent speech attributes.
[0014] The processor is further configured to: e) receive a text; f) receive a selection
of a speech attribute, the speech attribute having a selected attribute weight; g)
convert the text into synthetic speech using the acoustic space model, the synthetic
speech having the selected speech attribute; and h) output the synthetic speech as
audio having the selected speech attribute.
[0015] In the context of the present specification, unless specifically provided otherwise,
a "server" is a computer program that is running on appropriate hardware and is capable
of receiving requests (e.g., from client devices) over a network, and carrying out
those requests, or causing those requests to be carried out. The hardware may be one
physical computer or one physical computer system, but neither is required to be the
case with respect to the present technology. In the present context, the use of the
expression a "server" is not intended to mean that every task (e.g., received instructions
or requests) or any particular task will have been received, carried out, or caused
to be carried out, by the same server (i.e., the same software and/or hardware); it
is intended to mean that any number of software elements or hardware devices may be
involved in receiving/sending, carrying out or causing to be carried out any task
or request, or the consequences of any task or request; and all of this software and
hardware may be one server or multiple servers, both of which are included within
the expression "at least one server".
[0016] In the context of the present specification, unless specifically provided otherwise,
a "client device" is an electronic device associated with a user and includes any
computer hardware that is capable of running software appropriate to the relevant
task at hand. Thus, some (non-limiting) examples of client devices include personal
computers (desktops, laptops, netbooks, etc.), smartphones, and tablets, as well as
network equipment such as routers, switches, and gateways. It should be noted that
a computing device acting as a client device in the present context is not precluded
from acting as a server to other client devices. The use of the expression "a client
device" does not preclude multiple client devices being used in receiving/sending,
carrying out or causing to be carried out any task or request, or the consequences
of any task or request, or steps of any method described herein.
[0017] In the context of the present specification, unless specifically provided otherwise,
a "computing device" is any electronic device capable of running software appropriate
to the relevant task at hand. A computing device may be a server, a client device,
etc.
[0018] In the context of the present specification, unless specifically provided otherwise,
a "database" is any structured collection of data, irrespective of its particular
structure, the database management software, or the computer hardware on which the
data is stored, implemented or otherwise rendered available for use. A database may
reside on the same hardware as the process that stores or makes use of the information
stored in the database or it may reside on separate hardware, such as a dedicated
server or plurality of servers.
[0019] In the context of the present specification, unless specifically provided otherwise,
the expression "information" includes information of any nature or kind whatsoever,
comprising information capable of being stored in a database. Thus information includes,
but is not limited to audiovisual works (photos, movies, sound records, presentations
etc.), data (map data, location data, numerical data, etc.), text (opinions, comments,
questions, messages, etc.), documents, spreadsheets, etc.
[0020] In the context of the present specification, unless specifically provided otherwise,
the expression "component" is meant to include software (appropriate to a particular
hardware context) that is both necessary and sufficient to achieve the specific function(s)
being referenced.
[0021] In the context of the present specification, unless specifically provided otherwise,
the expression "information storage medium" is intended to include media of any nature
and kind whatsoever, including RAM, ROM, disks (CD-ROMs, DVDs, floppy disks, hard
drivers, etc.), USB keys, solid state-drives, tape drives, etc.
[0022] In the context of the present specification, unless specifically provided otherwise,
the expression "vocoder" is meant to refer to an audio processor that analyzes speech
input by determining the characteristic elements (such as frequency components, noise
components, etc.) of an audio signal. In some cases, a vocoder can be used to synthesize
a new audio output based on an existing audio sample by adding the characteristic
elements to the existing audio sample. In other words, a vocoder can use the frequency
spectrum of one audio sample to modulate the same in another audio sample. "Vocoder
features" refer to the characteristic elements of an audio sample determined by a
vocoder, e.g., the characteristics of the waveform of an audio sample such as frequency,
etc.
[0023] In the context of the present specification, unless specifically provided otherwise,
the expression "text" is meant to refer to a human-readable sequence of characters
and the words they form. A text can generally be encoded into computer-readable formats
such as ASCII. A text is generally distinguished from non-character encoded data,
such as graphic images in the form of bitmaps and program code. A text may have many
different forms, for example it may be a written or printed work such as a book or
a document, an email message, a text message (e.g., sent using an instant messaging
system), etc.
[0024] In the context of the present specification, unless specifically provided otherwise,
the expression "acoustic" is meant to refer to sound energy in the form of waves having
a frequency, the frequency generally being in the human hearing range. "Audio" refers
to sound within the acoustic range available to humans. "Speech" and "synthetic speech"
are generally used herein to refer to audio or acoustic, e.g., spoken, representations
of text. Acoustic and audio data may have many different forms, for example they may
be a recording, a song, etc. Acoustic and audio data may be stored in a file, such
as an MP3 file, which file may be compressed for storage or for faster transmission.
[0025] In the context of the present specification, unless specifically provided otherwise,
the expression "speech attribute" is meant to refer to a voice characteristic such
as emotion, speaking style, accent, language, identity of speaker, intonation, dynamic,
or speaker trait (gender, age, etc.). For example, a speech attribute may be angry,
sad, happy, neutral emotion, nervous, commanding, male, female, old, young, gravelly,
smooth, rushed, fast, loud, soft, a particular regional or foreign accent, in a particular
language, and the like. Many speech attributes are possible. Further, a speech attribute
may be variable over a continuous range, for example intermediate between "sad" and
"happy" or "sad" and "angry".
[0026] In the context of the present specification, unless specifically provided otherwise,
the expression "deep neural network" is meant to refer to a system of programs and
data structures designed to approximate the operation of the human brain. Deep neural
networks generally comprise a series of algorithms that can identify underlying relationships
and connections in a set of data using a process that mimics the way the human brain
operates. The organization and weights of the connections in the set of data generally
determine the output. A deep neural network is thus generally exposed to all input
data or parameters at once, in their entirety, and is therefore capable of modeling
their interdependencies. In contrast to machine learning algorithms that use decision
trees and are therefore constrained by their limitations, deep neural networks are
unconstrained and therefore suited for modelling interdependencies.
[0027] In the context of the present specification, unless specifically provided otherwise,
the words "first", "second", "third", etc. have been used as adjectives only for the
purpose of allowing for distinction between the nouns that they modify from one another,
and not for the purpose of describing any particular relationship between those nouns.
Thus, for example, it should be understood that, the use of the terms "first server"
and "third server" is not intended to imply any particular order, type, chronology,
hierarchy or ranking (for example) of/between the server, nor is their use (by itself)
intended imply that any "second server" must necessarily exist in any given situation.
Further, as is discussed herein in other contexts, reference to a "first" element
and a "second" element does not preclude the two elements from being the same actual
real-world element. Thus, for example, in some instances, a "first" server and a "second"
server may be the same software and/or hardware, in other cases they may be different
software and/or hardware.
[0028] Implementations of the present technology each have at least one of the above-mentioned
object and/or aspects, but do not necessarily have all of them. It should be understood
that some aspects of the present technology that have resulted from attempting to
attain the above-mentioned object may not satisfy this object and/or may satisfy other
objects not specifically recited herein.
[0029] Additional and/or alternative features, aspects and advantages of implementations
of the present technology will become apparent from the following description, the
accompanying drawings and the appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0030] For a better understanding of the present technology, as well as other aspects and
further features thereof, reference is made to the following description which is
to be used in conjunction with the accompanying drawings, where:
Figure 1 is a schematic diagram of a system implemented in accordance with a non-limiting
embodiment of the present technology.
Figure 2 depicts a block-diagram of a method executable within the system of Figure
1 and implemented in accordance with non-limiting embodiments of the present technology.
Figure 3 depicts a schematic diagram of training an acoustic space model from source
text and acoustic data in accordance with non-limiting embodiments of the present
technology.
Figure 4 depicts a schematic diagram of text-to-speech synthesis in accordance with
non-limiting embodiments of the present technology.
DETAILED DESCRIPTION
[0031] Referring to Figure 1, there is shown a diagram of a system 100, the system 100 being
suitable for implementing non-limiting embodiments of the present technology. It is
to be expressly understood that the system 100 is depicted as merely as an illustrative
implementation of the present technology. Thus, the description thereof that follows
is intended to be only a description of illustrative examples of the present technology.
This description is not intended to define the scope or set forth the bounds of the
present technology. In some cases, what are believed to be helpful examples of modifications
to the system 100 may also be set forth below. This is done merely as an aid to understanding,
and, again, not to define the scope or set forth the bounds of the present technology.
These modifications are not an exhaustive list, and, as a person skilled in the art
would understand, other modifications are likely possible. Further, where this has
not been done (i.e., where no examples of modifications have been set forth), it should
not be interpreted that no modifications are possible and/or that what is described
is the sole manner of implementing that element of the present technology. As a person
skilled in the art would understand, this is likely not the case. In addition it is
to be understood that the system 100 may provide in certain instances simple implementations
of the present technology, and that where such is the case they have been presented
in this manner as an aid to understanding. As persons skilled in the art would understand,
various implementations of the present technology may be of a greater complexity.
[0032] System 100 includes a server 102. The server 102 may be implemented as a conventional
computer server. In an example of an embodiment of the present technology, the server
102 may be implemented as a Dell™ PowerEdge™ Server running the Microsoft™ Windows
Server™ operating system. Needless to say, the server 102 may be implemented in any
other suitable hardware and/or software and/or firmware or a combination thereof.
In the depicted non-limiting embodiment of the present technology, the server 102
is a single server. In alternative non-limiting embodiments of the present technology,
the functionality of the server 102 may be distributed and may be implemented via
multiple servers.
[0033] In some implementations of the present technology, the server 102 can be under control
and/or management of a provider of an application using text-to-speech (TTS) synthesis,
e.g., an electronic game, an e-book reader, an e-mail reader, a satellite navigation
system, an automated telephone system, an automated warning system, an instant messaging
system, and the like. In alternative implementations the server 102 can access an
application using TTS synthesis provided by a third-party provider. In yet other implementations,
the server 102 can be under control and/or management of, or can access, a provider
of TTS services and other services incorporating TTS.
[0034] The server 102 includes an information storage medium 104 that may be used by the
server 102. Generally, the information storage medium 104 may be implemented as a
medium of any nature and kind whatsoever, including RAM, ROM, disks (CD-ROMs, DVDs,
floppy disks, hard drivers, etc.), USB keys, solid state-drives, tape drives, etc.
and also the combinations thereof.
[0035] The implementations of the server 102 are well known in the art. So, suffice it to
state, that the server 102 comprises
inter alia a network communication interface 109 (such as a modem, a network card and the like)
for two-way communication over a communication network 110; and a processor 108 coupled
to the network communication interface 109 and the information storage medium 104,
the processor 108 being configured to execute various routines, including those described
herein below. To that end the processor 108 may have access to computer readable instructions
stored on the information storage medium 104, which instructions, when executed, cause
the processor 108 to execute the various routines described herein.
[0036] In some non-limiting embodiments of the present technology, the communication network
110 can be implemented as the Internet. In other embodiments of the present technology,
the communication network 110 can be implemented differently, such as any wide-area
communication network, local-area communication network, a private communication network
and so on.
[0037] The information storage medium 104 is configured to store data, including computer-readable
instructions and other data, including text data, audio data, acoustic data, and the
like. In some implementations of the present technology, the information storage medium
104 can store at least part of the data in a database 106. In other implementations
of the present technology, the information storage medium 104 can store at least part
of the data in any collections of data other than databases.
[0038] The information storage medium 104 can store computer-readable instructions that
manage updates, population and modification of the database 106 and/or other collections
of data. More specifically, computer-readable instructions stored on the information
storage medium 104 allow the server 102 to receive (e.g., to update) information in
respect of text and audio samples via the communication network 110 and to store information
in respect of the text and audio samples, including the information in respect of
their phonetic features, linguistic features, vocoder features, speech attributes,
etc., in the database 106 and/or in other collections of data.
[0039] Data stored on the information storage medium 104 (and more particularly, at least
in part, in some implementations, in the database 106) can comprise
inter alia text and audio samples of any kind. Non-limiting examples of text and/or audio samples
include books, articles, journals, emails, text messages, written reports, voice recordings,
speeches, video games, graphics, spoken text, songs, videos, and audiovisual works.
[0040] Computer-readable instructions, stored on the information storage medium 104, when
executed, can cause the processor 108 to receive instruction to output a synthetic
speech 440 having a selected speech attribute 420. The instruction to output the synthetic
speech 440 having the selected speech attribute 420 can be instructions of a user
121 received by the server 102 from a client device 112, which client device 112 will
be described in more detail below. The instruction to output the synthetic speech
440 having the selected speech attribute 420 can be instructions of the client device
112 received by the server 102 from client device 112. For example, responsive to
user 121 requesting to have text messages read aloud by the client device 112, the
client device 112 can send to the server 102 a corresponding request to output incoming
text messages as synthetic speech 440 having the selected speech attribute 420, to
be provided to the user 121 via the output module 118 and the audio output 140 of
the client device 112.
[0041] Computer-readable instructions, stored on the information storage medium 104, when
executed, can further cause the processor 108 to convert a text into synthetic speech
440 using an acoustic space model 340, the synthetic speech 440 having a selected
speech attribute 420. Broadly speaking, this conversion process can be broken into
two portions: a training process in which the acoustic space model 340 is generated
(generally depicted in Figure 3), and an "in-use" process in which the acoustic space
model 340 is used to convert a received text 410 into synthetic speech 440 having
selected speech attributes 420 (generally depicted in Figure 4). We will discuss each
portion in turn.
[0042] In the training process, computer-readable instructions, stored on the information
storage medium 104, when executed, can cause the processor 108 to receive a training
text data 312 and a respective training acoustic data 322. The form of the training
text data 312 is not particularly limited and may be, for example, part of a written
or printed text 410 of any type, e.g., a book, an article, an e-mail, a text 410 message,
and the like. In some embodiments, the training text data 312 is received via text
input 130 and input module 113. In alternative embodiments, the training text data
312 is received via a second input module (not depicted) in the server (102). The
training text data 312 may be received from an e-mail client, an e-book reader, a
messaging system, a web browser, or within another application containing text content.
Alternatively, the training text data 312 may be received from the operating system
of the computing device (e.g., the server 102, or the client device 112). The form
of the training acoustic data 322 is also not particularly limited and may be, for
example, a recording of a person reading aloud the training text data 312, a recorded
speech, a play, a song, a video, and the like.
[0043] The training acoustic data 322 is a spoken (e.g., audio) representation of the training
text data 312, and is associated with one or more defined speech attribute, the one
or more defined speech attribute describing characteristics of the training acoustic
data 322. The one or more defined speech attribute is not particularly limited and
may correspond, for example, to an emotion (angry, happy, sad, etc.), the gender of
the speaker, an accent, an intonation, a dynamic (loud, soft, etc.), a speaker identity,
etc. Training acoustic data 322 may be received as any type of audio sample, for example
a recording, a MP3 file, and the like. In some embodiments, the training acoustic
data 322 is received via an audio input (not depicted) and input module 113. In alternative
embodiments, the training acoustic data 322 is received via a second input module
(not depicted) in the server (102). The training acoustic data 322 may be received
from an application containing audio content. Alternatively, the training acoustic
data 322 may be received from the operating system of the computing device (e.g.,
the server 102, or the client device 112).
[0044] Training text data 312 and training acoustic data 322 can originate from multiple
sources. For example, training text and/or acoustic data could be retrieved from email
messages, downloaded from a remote server, and the like. In some non-limiting implementations,
training text and/or acoustic data is stored in the information storage medium 104,
e.g., in database 106. In alternative non-limiting implementations, training text
and/or acoustic data is received (e.g., uploaded) by the server 102 from the client
device 112 via the communication network 110. In yet another non-limiting implementation,
training text and/or acoustic data is retrieved (e.g., downloaded) from an external
resource (not depicted) via the communication network 110. In yet another embodiment,
training text data 312 is inputted by the user 121 via text input 130 and input module
113. Similarly, training acoustic data 322 may be inputted by the user 121 via an
audio input (not depicted) connected to input module 113.
[0045] In this implementation of the present technology, the server 102 acquires the training
text and/or acoustic data from an external resource (not depicted), which can be,
for example, a provider of such data. It should be expressly understood that the source
of the training text and/or acoustic data can be any suitable source, for example,
any device that optically scans text images and converts them to a digital image,
any device that records audio samples, and the like.
[0046] One or more (items of) training text data 312may be received. In some non-limiting
implementations, two or more (items of) training text data 312 are received. In some
non-limiting implementations, two or more respective (items of) training acoustic
data 322 may be received for each (item of) training text data 312 received, each
(item of) training acoustic data 322 being associated with one or more defined speech
attribute. In such implementations, each training acoustic data 322 may have distinct
defined speech attributes. For example, a first training acoustic data 322 being a
spoken representation of a first training text data 312 may have the defined speech
attributes "male" and "angry" (i.e., a recording of the first training text data 312
read out-loud by an angry man), whereas a second training acoustic data 322, the second
training acoustic data 322 also being a spoken representation of the first training
text data 312, may have the defined speech attributes "female", "happy", and "young"
(i.e., a recording of the first training text data 312 read out-loud by a young girl
who is feeling very happy). The number and type of speech attributes is defined independently
for each training acoustic data 322.
[0047] Computer-readable instructions, stored on the information storage medium 104, when
executed, can further cause the processor 108 to extract one or more of phonetic and
linguistic features of the training text data 312. For example, in some embodiments
the processor 108 can be caused to divide the training text data 312 into phones,
a phone being a minimal segment of a speech sound in a language (such as a vowel or
a consonant). As will be understood by persons skilled in the art, many phonetic and/or
linguistic features may be extracted, and there are many methods known for doing so;
neither the phonetic and/or linguistic features extracted nor the method for doing
so is meant to be particularly limited.
[0048] Computer-readable instructions, stored on the information storage medium 104, when
executed, can further cause the processor 108 to extract vocoder features of the respective
training acoustic data 322 and correlate the vocoder features with the one or more
phonetic and linguistic feature of the training text data and with the one or more
defined speech attribute. A set of training data of speech attributes is thereby generated.
In some non-limiting implementations, extracting vocoder features of the training
acoustic data comprises dimensionality reduction of the waveform of the respective
training acoustic data. As will be understood by persons skilled in the art, extraction
of vocoder features may be done using many different methods, and the method used
is not meant to be particularly limited.
[0049] Computer-readable instructions, stored on the information storage medium 104, when
executed, can further cause the processor 108 to use a deep neural network (dnn) to
determine interdependency factors between the speech attributes in the training data.
The dnn (as described further below), generates a single, continuous acoustic space
model that takes into account a plurality of interdependent speech attributes and
allows for modelling of a continuous spectrum of interdependent speech attributes.
Implementation of the dnn is not particularly limited. Many such machine learning
algorithms are known. In some non-limiting implementations, the acoustic space model,
once generated, is stored in the information storage medium 104, e.g., in database
106, for future use in the "in-use" portion of the TTS process.
[0050] The training portion of the TTS process is thus complete, the acoustic space model
having been generated. We will now describe the system for the "in-use" portion of
the TTS process in which the acoustic space model is used to convert a received text
into synthetic speech having selected speech attributes. The training portion can
be performed "off-line", whereas the "in-use portion of the TTS process may be performed
"on-line".
[0051] Computer-readable instructions, stored on the information storage medium 104, when
executed, can further cause the processor 108 to receive a text 410. As for the training
text data 312, the form and origin of the text 410 is not particularly limited. The
text 410 may be part of a written text of any type, e.g., a book, an article, an e-mail,
a text message, and the like. In some non-limiting implementations, the text 410 is
received via text input 130 and input module 113 of the client device 112. The text
410 may be received from an e-mail client, an e-book reader, a messaging system, a
web browser, or within another application containing text content. Alternatively,
the text 410 may be input by the user 121 via text input 130. In alternative non-limiting
implementations, the text 410 is received from the operating system of the computing
device (e.g., the server 102, or the client device 112).
[0052] Computer-readable instructions, stored on the information storage medium 104, when
executed, can further cause the processor 108 to receive a selection of a speech attribute
420, the speech attribute 420 having a selected attribute weight. One or more speech
attribute 420 may be received, each having one or more selected attribute weight.
The selected attribute weight defines the weight of the speech attribute 420 desired
in the synthetic speech to be outputted. In other words, the synthetic speech will
have a weighted sum of speech attributes 420. Further, a speech attribute 420 may
be variable over a continuous range, for example intermediate between "sad" and "happy"
or "sad" and "angry".
[0053] In some non-limiting implementations, the selected speech attribute 420 is received
via the input module 113 of the client device 112. In some non-limiting implementations,
the selected speech attribute 420 is received with the text 410. In alternative embodiments,
the text 410 and the selected speech attribute 420 are received separately (e.g.,
at different times, or from different applications, or from different users, or in
different files, etc.), via the input module 113. In further non-limiting implementations,
the selected speech attribute 420 is received via a second input module (not depicted)
in the server 102.
[0054] It should be expressly understood that the selected speech attribute 420 is not particularly
limited and may correspond, for example, to an emotion (angry, happy, sad, etc.),
the gender of the speaker, a language, an accent, an intonation, a dynamic, a speaker
identity, a speaking style, etc, or any combination thereof.
[0055] Computer-readable instructions, stored on the information storage medium 104, when
executed, can further cause the processor 108 to convert the text 410 into synthetic
speech 440 using the acoustic space model 340 generated during the training process.
In other words, the text 410 and the selected one or more speech attributes 420 are
inputted into the acoustic space model 340, which outputs the synthetic speech having
the selected speech attribute (as described further below). It should be understood
that any desired speech attributes can be selected and used to included in the outputted
synthetic speech.
[0056] Computer-readable instructions, stored on the information storage medium 104, when
executed, can further cause the processor 108 to send to the client device 112 an
instruction to output the synthetic speech as audio having the selected speech attribute
420, e.g., via the output module 118 and audio output 140 of the client device 112.
The instruction can be sent via communication network 110. In some non-limiting implementations,
the processor 108 can send instruction to output the synthetic speech as audio using
a second output module (not depicted) in the server 102, e.g., connected to the network
communication interface 109 and the processor 108. In some non-limiting implementations,
instruction to output the synthetic speech via output module 118 and audio output
140 of the client device 112 is sent to client device 112 via the second output module
(not depicted) in the server 102.
[0057] Computer-readable instructions, stored on the information storage medium 104, when
executed, can further cause the processor 108 to repeat the "in-use" process in which
the acoustic space model 340 is used to convert a received text 410 into synthetic
speech having selected speech attributes 420 repeatedly, until all received texts
410 have been outputted as synthetic speech having the selected speech attributes
420. The number of texts 410 that can be received and outputted as synthetic speech
using the acoustic space model 340 is not particularly limited.
[0058] The system 100 further comprises a client device 112. The client device 112 is typically
associated with a user 121. It should be noted that the fact that the client device
112 is associated with the user 121 does not need to suggest or imply any mode of
operation - such as a need to log in, a need to be registered or the like.
[0059] The implementation of the client device 112 is not particularly limited, but as an
example, the client device 112 may be implemented as a personal computer (desktops,
laptops, netbooks, etc.) or as a wireless communication device (a smartphone, a tablet
and the like).
[0060] The client device 112 comprises an input module 113. How the input module 113 is
implemented is not particularly limited and may depend on how the client device 112
is implemented. The input module 113 may include any mechanism for providing user
input to the processor 116 of the client device 112. The input module 113 is connected
to a text input 130. The text input 130 receives text. The text input 130 is not particularly
limited and may depend on how the client device 112 is implemented. For example, the
text input 130 can be a keyboard, and/or a mouse, and so on. Alternatively, the text
input 130 can be a means for receiving text data from an external storage medium or
a network. The text input 130 is not limited to any specific input methodology or
device. For example, it could be arranged by a virtual button on a touch-screen display
or a physical button on the cover of the electronic device, for instance. Other implementations
are possible.
[0061] Merely as an example and not as a limitation, in those embodiments of the present
technology where the client device 112 is implemented as a wireless communication
device (such as a smartphone), text input 130 can be implemented as an optical interference
based user input device. The text input 130 of one example is a finger/object movement
sensing device on which a user performs a gesture and/or presses with a finger. The
text input 130 can identify/track the gesture and/or determines a location of a user's
finger on the client device 112. In the instances where the text input 130 is executed
as the optical interference based user input device, such as a touch screen or multi-touch
display, the input module 113 can further execute functions of the output module 118,
particularly in embodiments where the output module 118 is implemented as a display
screen.
[0062] The input module 113 is also connected to an audio input (not depicted) for inputting
acoustic data. The audio input is not particularly limited and may depend on how the
client device 112 is implemented. For example, the audio input can be a microphone,
a recording device, an audio receiver, and the like. Alternatively, the audio can
be a means for receiving acoustic data from an external storage medium or a network
such as a cassette tape, a compact disc, a radio, a digital audio source, an MP3 file,
etc. The audio input is not limited to any specific input methodology or device.
[0063] The input module 113 is communicatively coupled to a processor 116 and transmits
input signals based on various forms of user input for processing and analysis by
processor 116. In embodiments where the input module 113 also operates as the output
module 118, being implemented for example as a display screen, the input module 113
can also transmit output signal.
[0064] The client device 112 further comprises a computer usable information storage medium
(also referred to as a local memory 114). Local memory 114 can comprise any type of
media, including but not limited to RAM, ROM, disks (CD-ROMs, DVDs, floppy disks,
hard drivers, etc.), USB keys, solid state-drives, tape drives, etc. Generally speaking,
the purpose of the local memory 114 is to store computer readable instructions as
well as any other data.
[0065] The client device 112 further comprises the output module 118. In some embodiments,
the output module 118 can be implemented as a display screen. A display screen may
be, for example, a liquid crystal display (LCD), a light emitting diode (LED), an
interferometric modulator display (IMOD), or any other suitable display technology.
A display screen is generally configured to display a graphical user interface (GUI)
that provides an easy to use visual interface between the user 121 of the client device
112 and the operating system or application(s) running on the client device 112. Generally,
a GUI presents programs, files and operational options with graphical images. Output
module 118 is also generally configured to display other information like user data
and web resources on a display screen. When output module 118 is implemented as a
display screen, it can also be implemented as a touch based device such as a touch
screen. A touch screen is a display that detects the presence and location of user
touch inputs. A display screen can be a dual touch or multi-touch display that can
identify the presence, location and movement of touch inputs. In the instances where
the output module 118 is implemented as a touch-based device such as a touch screen,
or a multi-touch display, the display screen can execute functions of the input module
113.
[0066] The output module 118 further comprises an audio output device such as a sound card
or an external adaptor for processing audio data and a device for connecting to an
audio output 140, the output module 118 being connected to the audio output 140. The
audio output 140 may be, for example, a direct audio output such as a speaker, headphones,
HDMI audio, or a digital output, such as an audio data file which may be sent to a
storage medium, networked, etc. The audio output is not limited to any specific output
methodology or device and may depend on how the client device 112 is implemented.
[0067] The output module 118 is communicatively coupled to the processor 116 and receives
signals from the processor 116. In instances where the output module 118 is implemented
as a touch-based display screen device such as a touch screen, or a multi-touch display,
the output module 118 can also transmit input signals based on various forms of user
input for processing and analysis by processor 116.
[0068] The client device 112 further comprises the above mentioned processor 116. The processor
116 is configured to perform various operations in accordance with a machine-readable
program code. The processor 116 is operatively coupled to the input module 113, to
the local memory 114, and to the output module 118. The processor 116 is configured
to have access to computer readable instructions which instructions, when executed,
cause the processor 116 to execute various routines.
[0069] As non-limiting examples, the processor 116 described herein can have access to computer
readable instructions, which instructions, when executed, can cause the processor
116 to: output a synthetic speech as audio via the output module 118; receive from
a user 121 of the client device 112 via the input module 113 a selection of text and
selected speech attribute(s); send, by the client device 112 to a server 102 via a
communication network 110, the user-inputted data; and receive, by the client device
112 from the server 102 a synthetic speech for outputting via the output module 118
and audio output 140 of the client device 112.
[0070] The local memory 114 is configured to store data, including computer-readable instructions
and other data, including text and acoustic data. In some implementations of the present
technology, the local memory 114 can store at least part of the data in a database
(not depicted). In other implementations of the present technology, the local memory
114 can store at least part of the data in any collections of data (not depicted)
other than databases.
[0071] Data stored on the local memory 114 (and more particularly, at least in part, in
some implementations, in the database) can comprise text and acoustic data of any
kind.
[0072] The local memory 114 can store computer-readable instructions that control updates,
population and modification of the database (not depicted) and/or other collections
of data (not depicted). More specifically, computer-readable instructions stored on
the local memory 114 allow the client device 112 to receive (e.g., to update) information
in respect of text and acoustic data and synthetic speech, via the communication network
110, to store information in respect of the text and acoustic data and synthetic speech,
including the information in respect of their phonetic and linguistic features, vocoder
features, and speech attributes in the database, and/or in other collections of data.
[0073] Computer-readable instructions, stored on the local memory 114, when executed, can
cause the processor 116 to receive instruction to perform TTS. The instruction to
perform TTS can be received following instructions of a user 121 received by the client
device 112 via the input module 113. For example, responsive to user 121 requesting
to have text messages read out-loud, the client device 112 can send to the server
102 a corresponding request to perform TTS.
[0074] In some implementations of the present technology, instruction to perform TTS can
be executed on the server 102, so that the client device 112 transmits the instructions
to the server 102. Further, computer-readable instructions, stored on the local memory
114, when executed, can cause the processor 116 to receive, from the server 102, as
a result of processing by the server 102, an instruction to output a synthetic speech
via audio output 140. The instruction to output the synthetic speech as audio via
audio output 140 can be received from the server 102 via communication network 110.
In some implementations, the instruction to output the synthetic speech as audio via
audio output 140 of the client device 112 may comprise an instruction to read incoming
text messages out-loud. Many other implementations are possible and these are not
meant to be particularly limited.
[0075] In alternative implementations of the present technology, an instruction to perform
TTS can be executed locally, on the client device 112, without contacting the server
102.
[0076] More particularly, computer-readable instructions, stored on the local memory 114,
when executed, can cause the processor 116 to receive a text, receive one or more
selected speech attributes, etc. In some implementations, the instruction to perform
TTS can be instructions of a user 121 entered using the input module 113. For example,
responsive to user 121 requesting to read text messages out-loud, the client device
112 can receive instruction to perform TTS.
[0077] Computer-readable instructions, stored on the local memory 114, when executed, can
further cause the processor 116 to execute other steps in the TTS method, as described
herein; these steps are not described again here to avoid unnecessary repetition.
[0078] It is noted that the client device 112 is coupled to the communication network 110
via a communication link 124. In some non-limiting embodiments of the present technology,
the communication network 110 can be implemented as the Internet. In other embodiments
of the present technology, the communication network 110 can be implemented differently,
such as any wide-area communications network, local-area communications network, a
private communications network and the like. The client device 112 can establish connections,
through the communication network 110, with other devices, such as servers. More particularly,
the client device 112 can establish connections and interact with the server 102.
[0079] How the communication link 124 is implemented is not particularly limited and will
depend on how the client device 112 is implemented. Merely as an example and not as
a limitation, in those embodiments of the present technology where the client device
112 is implemented as a wireless communication device (such as a smartphone), the
communication link 124 can be implemented as a wireless communication link (such as
but not limited to, a 3G communications network link, a 4G communications network
link, a Wireless Fidelity, or WiFi® for short, Bluetooth® and the like). In those
examples, where the client device 112 is implemented as a notebook computer, the communication
link 124 can be either wireless
[0080] (such as the Wireless Fidelity, or WiFi® for short, Bluetooth® or the like) or wired
(such as an Ethernet based connection).
[0081] It should be expressly understood that implementations for the client device 112,
the communication link 124 and the communication network 110 are provided for illustration
purposes only. As such, those skilled in the art will easily appreciate other specific
implementation details for the client device 112, the communication link 124 and the
communication network 110. As such, by no means are examples provided herein above
meant to limit the scope of the present technology.
[0082] Figure 2 illustrates a computer-implemented method 200 for text-to-speech (TTS) synthesis,
the method executable on a computing device (which can be either the client device
112 or the server 102) of the system 100 of Figure 1.
[0083] The method 200 begins with steps 202-208 for training an acoustic space model which
is used for TTS in accordance with embodiments of the technology. For ease of understanding,
these steps are described with reference to Figure 3, which depicts a schematic diagram
300 of training an acoustic space model 340 from source text 312 and acoustic data
322 in accordance with non-limiting embodiments of the present technology.
Step 202 - receiving a training text data and a respective training acoustic data,
the respective training acoustic data being a spoken representation of the training
text data, the respective training acoustic data being associated with one or more
defined speech attribute
[0084] The method 200 starts at step 202, where a computing device, being in this implementation
of the present technology the server 102, receives instruction for TTS, specifically
to output a synthetic speech having a selected speech attribute.
[0085] It should be expressly understood that, although the method 200 is described here
with reference to an embodiment where the computing device is a server 102, this description
is presented by way of example only, and the method 200 can be implemented
mutatis mutandis in other embodiments, such as those where the computing device is a client device
112.
[0086] In step 202, training text data 312 is received. The form of the training text data
312 is not particularly limited. It may be part of a written text of any type, e.g.,
a book, an article, an e-mail, a text message, and the like. The training text data
312 is received via text input 130 and input module 113. It may be received from an
e-mail client, an e-book reader, a messaging system, a web browser, or within another
application containing text content. Alternatively, the training text data 312 may
be received from the operating system of the computing device (e.g., the server 102,
or the client device 112).
[0087] Training acoustic data 322 is also received. The training acoustic data 322 is a
spoken representation of the training text data 312 and is not particularly limited.
It may be a recording of a person reading aloud the training text 312, a speech, a
play, a song, a video, and the like.
[0088] The training acoustic data 322 is associated with one or more defined speech attribute
326. The defined speech attribute 326 is not particularly limited and may correspond,
for example, to an emotion (angry, happy, sad, etc.), the gender of the speaker, a
language, an accent, an intonation, a dynamic, a speaker identity, etc. For each training
acoustic data 322 received, the one or more speech attribute 326 is defined, to allow
an association, such as a correlation, between vocoder features 324 of the acoustic
data 322 and speech attributes 326 during training of the acoustic space model 340
(defined further below).
[0089] The form of the training acoustic data 322 is not particularly limited. It may be
part of an audio sample of any type, e.g., a recording, a speech, a video, and the
like. The training acoustic data 322 is received via an audio input (not depicted)
and input module 113. It may be received from an application containing audio content.
Alternatively, the training acoustic data 322 may be received from the operating system
of the computing device (e.g., the server 102, or the client device 112).
[0090] Training text data 312 and training acoustic data 322 can originate from multiple
sources. For example, text and/or acoustic data 312, 322 could be retrieved from email
messages, downloaded from a remote server, and the like. In some non-limiting implementations,
text and/or acoustic data 312, 322 is stored in the information storage medium 104,
e.g., in database 106. In alternative non-limiting implementations, text and/or acoustic
data 312, 322 is received (e.g., uploaded) by the server 102 from the client device
112 via the communication network 110. In yet another non-limiting implementation,
text and/or acoustic data 312, 322 is retrieved (e.g., downloaded) from an external
resource (not depicted) via the communication network 110.
[0091] In this implementation of the present technology, the server 102 acquires the text
and/or acoustic data 312, 322 from an external resource (not depicted), which can
be, for example, a provider of such data. In other implementations of the present
technology, the source of the text and/or acoustic data 312, 322 can be any suitable
source, for example, any device that optically scans text images and converts them
to a digital image, any device that records audio samples, and the like.
[0092] Then, the method 200 proceeds to step 204.
Step 204 - extracting one or more of phonetic and linguistic features of the training
text data
[0093] Next, at step 204, the server 102 executes a step of extracting one or more of phonetic
and linguistic features 314 of the training text data 312. This step is shown schematically
in the first box 310 in Figure 3. Phonetic and/or linguistic features 314 are also
shown schematically in Figure 3. Many such features and ways of extracting such features
are known, and this step is not meant to be particularly limited. For example, in
the non-limiting embodiment shown in Figure 3, the training text data 312 is divided
into phones, a phone being a minimal segment of a speech sound in a language. Phones
are generally either vowels or consonants or small groupings thereof. In some embodiments,
the training text data 312 may be divided into phonemes, a phoneme being a minimal
segment of speech that cannot be replaced by another without changing meaning, i.e.,
an individual speech unit for a particular language. As will be understood by persons
skilled in the art, extraction of phonetic and/or linguistic features 314 may be done
using any known method or algorithm. The method to be used and the phonetic and/or
linguistic features 314 to be determined may be selected using a number of different
criteria, such as the source of the text data 312, etc.
[0094] Then, the method 200 proceeds to step 206.
Step 206 - extracting vocoder features of the respective training acoustic data, and
correlating the vocoder features with the phonetic and linguistic features of the
training text data and with the one or more defined speech attribute, thereby generating
a set of training data of speech attributes
[0095] Next, at step 206, the server 102 executes a step of extracting vocoder features
324 of the training acoustic data 322. This step is shown schematically in the second
box 320 in Figure 3. Vocoder features 324 are also shown schematically in Figure 3,
as are defined speech attributes 326. Many such features and ways of extracting such
features are known, and this step is not meant to be particularly limited. For example,
in the non-limiting embodiment shown in Figure 3, the training acoustic data 322 is
divided into vocoder features 324. In some embodiments, extracting vocoder features
324 of the training acoustic data 322 comprises dimensionality reduction of the waveform
of the respective training acoustic data. As will be understood by persons skilled
in the art, extraction of vocoder features 324 may be done using any known method
or algorithm. The method to be used may be selected using a number of different criteria,
such as the source of the acoustic data 322, etc.
[0096] Next, the vocoder features 324 are correlated with the phonetic and/or linguistic
features 314 of the training text data 312 determined in step 204 and with the one
or more defined speech attribute 326 associated with the training acoustic data 322,
and received in step 202. The phonetic and/or linguistic features 314, the vocoder
features 324, and the one or more defined speech attribute 326 and the association
or correlations therebetween form a set of training data (not depicted).
[0097] Then, the method 200 proceeds to step 208.
Step 208 - using a deep neural network to determine interdependency factors between
the speech attributes in the training data, the deep neural network generating a single,
continuous acoustic space model based on the interdependency factors, the acoustic
space model thereby taking into account a plurality of interdependent speech attributes
and allowing for modelling of a continuous spectrum of the interdependent speech attributes
[0098] In step 208, the server 102 uses a deep neural network (dnn) 330 to determine interdependency
factors between the speech attributes 326 in the training data. The dnn 330 is a machine
learning algorithm in which input nodes receive input and output nodes provide output,
a plurality of hidden layers of nodes between the input nodes and the output nodes
serving to execute a machine-learning algorithm. In contrast to a decision-tree based
algorithm, the dnn 330 takes all of the training data into account simultaneously
and finds interconnections and interdependencies between the training data, allowing
for continuous, unified modelling of the training data. Many such dnns are known and
the method of implementation of the dnn 330 is not meant to be particularly limited.
[0099] In the non-limiting embodiment shown in Figure 3, the input into the dnn 330 is the
training data (not depicted), and the output from the dnn 330 is the acoustic space
model 340. The dnn 330 thus generates a single, continuous acoustic space model 340
based on the interdependency factors between the speech attributes 326, the acoustic
space model 340 thereby taking into account a plurality of interdependent speech attributes
and allowing for modelling of a continuous spectrum of the interdependent speech attributes.
The acoustic space model 340 can now be used in the remaining steps 210-216 of the
method 200.
[0100] The method 200 now continues with steps 210-216 in which text-to-speech synthesis
is performed, using the acoustic space model 340 generated in step 208. For ease of
understanding, these steps are described with reference to Figure 4, which depicts
a schematic diagram 400 of text-to-speech synthesis (TTS) in accordance with non-limiting
embodiments of the present technology.
Step 210 - receiving a text
[0101] In step 210, a text 410 is received. As for the training text data 312, the form
of the text 410 is not particularly limited. It may be part of a written text of any
type, e.g., a book, an article, an e-mail, a text message, and the like. The text
410 is received via text input 130 and input module 113. It may be received from an
e-mail client, an e-book reader, a messaging system, a web browser, or within another
application containing text content. Alternatively, the text 410 may be received from
the operating system of the computing device (e.g., the server 102, or the client
device 112).
[0102] The method 200 now continues with step 212.
Step 212 - receiving a selection of a speech attribute, the speech attribute having
a selected attribute weight
[0103] In step 212, a selection of a speech attribute 420 is received. One or more speech
attribute 420 may be selected and received. Speech attribute 420 is not particularly
limited and may correspond, for example, to an emotion (angry, happy, sad, etc.),
the gender of the speaker, a language, an accent, an intonation, a dynamic, a speaker
identity, a speaking style, etc. For each training acoustic data 322 received, the
one or more speech attribute 326 is defined, to allow an association or a correlation
between vocoder features 324 of the acoustic data 322 and speech attributes 326 during
training of the acoustic space model 340 (defined further below).
[0104] Each speech attribute 326 has a selected attribute weight (not depicted). The selected
attribute weight defines the weight of the speech attribute desired in the synthetic
speech 440. The weight is applied for each speech attribute 326, the outputted synthetic
speech 440 having a weighted sum of speech attributes. It will be understood that,
in the non-limiting embodiment where only one speech attribute 420 is selected, the
selected attribute weight for the single speech attribute 420 is necessarily 1 (or
100%). In alternative embodiments, where two or more selected speech attributes 420
are received, each selected speech attribute 420 having a selected attribute weight,
the outputted synthetic speech 440 will have a weighted sum of the two or more selected
speech attributes 420.
[0105] The selection of the speech attribute 420 is received via the input module 113. In
some non-limiting embodiments, it may be received with the text 410 via the text input
130. In alternative embodiments, the text 410 and the speech attribute 420 are received
separately (e.g., at different times, or from different applications, or from different
users, or in different files, etc.), via the input module 113.
Step 214 - converting the text into synthetic speech using the acoustic space model,
the synthetic speech having the selected speech attribute
[0106] In step 214, the text 410 and the one or more speech attribute 420 are inputted into
the acoustic space model 340. The acoustic space model 340 converts the text into
synthetic speech 440. The synthetic speech 440 has perceivable characteristics 430.
The perceivable characteristics 430 correspond to vocoder or audio features of the
synthetic speech 440 that are perceived as corresponding to the selected speech attribute(s)
420. For example, where the speech attribute "angry" has been selected, the synthetic
speech 440 has a waveform whose frequency characteristics (in this example, the frequency
characteristics being the perceivable characteristics 430) produce sound that is perceived
as "angry", the synthetic speech 440 therefore having the selected speech attribute
"angry".
Step 216 - outputting the synthetic speech as audio having the selected speech attribute
[0107] The method 200 ends with step 216, in which the synthetic speech 440 is outputted
as audio having the selected speech attribute(s) 420. As described above with reference
to step 214, the synthetic speech 440 produced by the acoustic space model 340 has
perceivable characteristics 430, the perceivable characteristics 430 producing sound
having the selected speech attribute(s) 420.
[0108] In some implementations, where the computing device is a server 102 (as in the implementation
depicted here), the method 200 may further comprise a step (not depicted) of sending,
to client device 112, an instruction to output the synthetic speech 440 via output
module 118 and audio output 140 of the client device 112. In some implementations,
the instruction to output the synthetic speech 440 via the audio output 140 of the
client device 112 comprises an instruction to read a text message received on the
client device 112 out loud to the user 121, so that the user 121 is not required to
look at the client device 112 in order to receive the text message. For example, the
instruction to output the synthetic speech 440 on client device 112 may be part of
an instruction to read a text message. In such a case, the text 410 received in step
210 may also be part of an instruction to convert incoming text messages to audio.
Many alternative implementations are possible. For example, the instruction to output
the synthetic speech 440 on client device 112 may be part of an instruction to read
an e-book out loud, read an email message out loud, read back to the user 121 a text
that the user has entered, to verify the accuracy of the text, and so on.
[0109] In some implementations, where the computing device is a server 102 (as in the implementation
depicted here), the method 200 may further comprise a step (not depicted) of outputting
the synthetic speech 440 via a second output module (not depicted). The second output
module (not depicted) may, for example, be part of the server 102, e.g. connected
to the network communication interface 109 and the processor 108. In some embodiments,
instruction to output the synthetic speech 440 via output module 118 and audio output
140 of the client device 112 is sent to client device 112 via the second output module
(not depicted) in the server 102.
[0110] In alternative implementations, where the computing device is a client device 112,
the method 200 may further comprise a step of outputting the synthetic speech 440
via output module 118 and audio output 140 of the client device 112. In some implementations,
the instruction to output the synthetic speech 440 via the audio output 140 of the
client device 112 comprises an instruction to read a text message received on the
client device 112 out loud to the user 121, so that the user 121 is not required to
look at the client device 112 in order to receive the text message. For example, the
instruction to output the synthetic speech 440 on client device 112 may be part of
an instruction to read a text message. In such a case, the text 410 received in step
210 may also be part of an instruction to convert incoming text messages to audio.
Many alternative implementations are possible. For example, the instruction to output
the synthetic speech 440 on client device 112 may be part of an instruction to read
an e-book out loud, read an email message out loud, read back to the user 121 a text
that the user has entered, to verify the accuracy of the text, and so on.
[0111] In some implementations, the method 200 ends after step 216. For example, if the
received text 410 has been outputted as synthetic speech 440, then the method 200
ends after step 216. In alternative implementations, steps 210 to 216 may be repeated.
For example, a second text (not depicted) may be received, along with a second selection
of one or more speech attribute (not depicted). In this case, the second text is converted
into a second synthetic speech (not depicted) using the acoustic space model 340,
the second synthetic speech having the second selected one or more speech attribute,
and the second synthetic speech is outputted as audio having the second selected one
or more speech attribute. Steps 210 to 216 may be repeated until all desired texts
have been converted to synthetic speech having the selected one ore more speech attribute.
In such implementations the method is therefore recursive, repeatedly converting texts
into synthetic speech and outputting the synthetic speech as audio until every desired
text has been converted and outputted.
[0112] Some of the above steps and signal sending-receiving are well known in the art and,
as such, have been omitted in certain portions of this description for the sake of
simplicity. The signals can be sent/received using optical means (such as a fibre-optic
connection), electronic means (such as using wired or wireless connection), and mechanical
means (such as pressure-based, temperature based or any other suitable physical parameter
based means).
[0113] Some technical effects of non-limiting embodiments of the present technology may
include provision of a fast, efficient, versatile, and/or affordable method for text-to-speech
synthesis. In some embodiments, the present technology allows provision of TTS with
a programmatically selected voice. For example, in some embodiments, synthetic speech
can be outputted having any combination of selected speech attributes. In such embodiments,
the present technology can thus be flexible and versatile, allowing a programmatically
selected voice to be outputted. In some embodiments, the combination of speech attributes
selected is independent of the speech attributes in the training acoustic data. For
example, suppose a first training acoustic data having the speech attributes "angry
male" and a second training acoustic data having the speech attributes "young female,
happy" are received during training of the acoustic space model; nevertheless, the
speech attributes "angry" and "female" can be selected, and synthetic speech having
the attributes "angry female" can be outputted. Further, arbitrary weights for each
speech attribute can be selected, depending on the voice characteristics desired in
the synthetic speech. In some embodiments, therefore, a synthetic speech can be outputted,
even if no respective training acoustic data with the selected attributes was received
during training. Further, the text converted to synthetic speech need not correspond
to the training text data, and a text can be converted to synthetic speech even though
no respective acoustic data for that text was received during the training process.
At least some of these technical effects are achieved through building an acoustic
model that is based on interdependencies of the attributes of the acoustic data. In
some embodiments, the present technology may provide synthetic speech that sounds
like a natural human voice, having the selected speech attributes. Instead of a 'post-processing'
approach seen in some of the prior art where an 'average' voice is synthesized then
adapted according to desired criteria, embodiments of the present method and system
provide a one-step generation of the desired synthetic speech. Furthermore, embodiments
of the present method and system can provide unique synthetic speech based on unique
and different combinations, along a continuous spectrum, of many different speech
attributes including language, emotion, and accent.
[0114] It should be expressly understood that not all technical effects mentioned herein
need to be enjoyed in each and every embodiment of the present technology. For example,
embodiments of the present technology may be implemented without the user enjoying
some of these technical effects, while other embodiments may be implemented with the
user enjoying other technical effects or none at all.
[0115] Modifications and improvements to the above-described implementations of the present
technology may become apparent to those skilled in the art. The foregoing description
is intended to be exemplary rather than limiting. The scope of the present technology
is therefore intended to be limited solely by the scope of the appended claims.