BACKGROUND TO THE INVENTION
[0001] The invention relates to a method for modelling speech synthesis, based on a human
vocal chord model that comprises two dynamically interlinked masses, which have positions
determining glottis model parameters. The physical modelling of human speech synthesis
is used to construct devices for producing synthesized human speech, such devices
being industrially applicable on a wide scale, for example in human-machine dialog
systems and announcement systems. Human vocal chords are the active elements for generating
voiced sounds such as vowels. Due to the interaction between the air flowing and the
human vocal chords, or folds, the instantaneous air flow through them is modulated
periodically. The opening between the vocal chords is called the glottis, and the
pulsating air flow is called the glottal pulse. Hereinafter, the terms
vocal chords and
glottis are generally used as equivalents. The shape of the glottal pulse is changed on passing
the vocal tract and the lips. The resulting variations in atmospheric pressure at
the lips are perceived as speech. From a signal-processing point of view, the vocal
chords can be seen as a signal source, and the combination of vocal tract and lips
as a filter. Source-filter models are often used in speech synthesizers.
[0002] A method according to the preamble has been described in a publication by K. Ishizaka
& J.L. Flanagan,
Synthesis of Voiced Sounds From a Two-Mass Modal of the Vocal Cords, Bell Syst. Tech. J. 51 (1972) pp. 1233-1267. In the publication by T. Koizumi et
al., J. Acoust. Soc. Am., Vol. 82 (1987), pages 1179-1192, further two-mass models
of the vocal cords are disclosed for modelling speech synthesis. Present models that
describe the interaction between vocal chords and air flow have proven to be computationally
costly, which has precluded their broad use in speech synthesizers.
SUMMARY TO THE INVENTION
[0003] Accordingly, amongst other things it is an object of the present invention to provide
a method for modelling speech synthesis hardware, that is open to straightforward
calculating through inter alia using various submodels as building blocks, each of
such building blocks by itself being of a relatively elementary nature.
[0004] Now, according to one of its aspects, the invention is characterized in that said
human vocal chord model comprises, in addition to a first sub-model describing said
dynamically interlinked masses acted upon by external forces, a second dynamic partial
model describing a contour of the glottis in terms of its geometry parameters and
including its dynamics, in dependence of position and velocity parameters produced
by said first sub-model, and outputting a first part of said external forces in addition
to transmitting said geometry parameters, and also a third hydrodynamic sub-model,
describing hydrodynamics of an air flow through the vocal chords in dependence of
said transmitted geometry parameters, and being acted upon by air pressure parameters,
for outputting a second part of said external forces and an actual volume velocity.
[0005] The invention also relates to a speech synthesizer device modelled according to the
method of to the invention. Various further advantageous aspects of the invention
are recited in dependent Claims.
BRIEF DESCRIPTION OF THE DRAWING
[0006] These and other aspects and advantages of the invention will be described in detail
hereinafter with reference to the disclosure of preferred embodiments, and in particular
with reference to the appended Figures that show:
Figure 1 an exemplary speech signal for reference;
Figure 2 the glottal pulse for the same signal;
Figure 3 an elementary set-up of the two-mass model;
Figure 4 the model of Figure 3, built up from mechanical, geometrical, impact, and
fluid dynamics sub-models;
Figure 5 a state diagram description;
Figure 6 a complete speech synthesizer system.
Figures 7A,7B,7C give 17 equations used in the description and derivation of various
quantitative entities for the modelling.
DETAILED DISCLOSURE OF PREFERRED EMBODIMENTS
[0007] For reference, Figure 1 gives an exemplary speech signal as a function of time, in
this case representing a vowel 'a'. The vertical axis is in arbitrary units of intensity.
For the same signal, Figure 2 gives the amount of air flow between the vocal chords
or glottal pulse; clearly, both signals have the same basic frequency or pitch.
[0008] Figure 3 gives an elementary set-up of a two-mass model used in the modelling. For
approximating, the glottis is considered to be a two-dimensional obstruction with
a plane of symmetry at y=0, and a uniform cross-section in a direction perpendicular
to the plane of the Figure. In consequence, for the calculations only the upper half
of the Figure needs to be considered. During generation of the speech, air is generally
flowing from left to right in the Figure. The Figure shows masses
m1, m2, damping constants
r1, r2, elastic coupling constants
k1, k2, and the elastic coupling constant between the two masses
k12. The positions of the masses are
y1(t), and
y2(t), respectively, whereas the x-positions may be left out of consideration. The hydrodynamic
forces on the masses
f1(t) and
f2(t), are functions of the geometry of the vocal chords, and of the air pressure
pT(t) in front of and
pV(t) behind the vocal chords, respectively. The geometry of the contour, indicated as
a drawn line, can be given as a function
yc(x,y1(t),y2(t)). The solution of the dynamic equations will be a self-oscillating system that modulates
the velocity
uG(t) of the air mass flowing through the obstruction. The volume velocity
uG(t) itself also depends on the geometry of the vocal chords and on the air pressures
in front of and behind the obstruction. The equations of motions of the system are
given by
EQ1 in the Fig. 7,
f1,0 and
f2,0 being associated with the rest positions of the two masses. The two equations may
be solved by numerical methods for finding the air flow. In the above, the relationships
between the geometry and the hydrodynamic forces are non-linear, which renders the
solving of
EQ1, and the modelling based thereon an intricate and time-consuming task. The present
invention decomposes the overall model into sub-models, and in a particularly preferred
embodiment into four such sub-models.
[0009] Figure 4 gives the model of Figure 1, broken down into mechanical-, geometrical-,
impact-, and fluid dynamics sub-models, these models being furthermore connected by
interrelations that are shown as arrows. The four sub-models are as follows:
a. The mechanical sub-model describes the behaviour of the mass-elastic constant system
under arbitrary forces on the masses. This system is linear and can be realized with
elementary time-discrete signal processing operators. The inputs are the forces f1, f2; the outputs are the positions y and velocities (dy/dt) of the two masses in the y-direction.
b. The geometry sub-model describes the geometry of the vocal chords. Instead of by
the formal function yc(x,y1(t),y2(t)), the geometry is expressed in a set of parameters that are determined by the positions
of the masses, such as through parameters that define finite straight lines. Also
these relations can be expressed in a straightforward manner. The inputs to the geometry
sub-model are the outputs of the mechanic sub-model. The outputs are the positions
and velocities of the two masses in the y-direction, combined with the descriptions
of the instantaneous geometry, as dependent on the model equations used.
c. The impact sub-model describes the influence of impacts between the two vocal chord
halves on the geometry, and calculates the ensuing forces on the two masses. The inputs
to the sub-model are the outputs from the geometry sub-model. The outputs of the sub-model
are in the first place the reactive forces f1,C, f2,C on the glottis itself, that are fed back into the mechanical sub-model, and in the
second place, the corrected instantaneous geometry parameters.
d. The hydrodynamic sub-model describes the hydrodynamic forces on the two masses
as a consequence of the geometry and of the air pressures ...pT[n], pV[n] in front of and behind the vocal chords, respectively, and also calculates the volume
velocity uG[n] of the air through the glottis. The inputs are the instantaneous geometry parameters
outputted by the impact sub-model, and furthermore the instantaneous upstream and
downstream pressures. The outputs are further forces operating on the glottis itself,
and also, the instantaneous mass flow.
[0010] In Figure 4, all pressures and volume velocities through the glottis have been indicated
as time discrete quantities. The geometry sub-model, the impact sub-model, and the
hydrodynamic sub-model give instantaneous relations between the inputs thereto and
the outputs therefrom.
DESCRIPTION OF THE MATHEMATICS
[0011] In the mechanical sub-model, the equations of state are given by
EQ2 of the Fig. 7, wherein
x(t) is a state vector with as entries the positions and velocities of the two masses
and
u(t) the input vector with components
f1(t) and
f2(t). The output vector is given in similar way as the input vector. The state vector x(t)
itself is given by
EQ3 as having four components. The input vector is given by
EQ4 as having two components only. The output vector is given by
EQ5 as having four components. The four matrices
A,B,C,D from
EQ2 are given by
EQ6-9, respectively. Herein, matrix
C is the unitary matrix, and matrix
D consists exclusively of zeroes. In a more complex model, a non-zero version of this
matrix may be used.
[0012] Now, the system has two input quantities, which leads to the definition of two impulse
responses. The first one,
h1(t) is the response to a unitary delta change in
f1(t), the second one,
h2(t) is the response to a unitary delta change in
f2(t), given by
EQ10,11, respectively. A delta change has a pulse shape of finite area and infinitely small
width. The matrix
A can in a conventional manner be diagonalized as expressed by
EQ12, wherein
λ1-λ4 are the respective Eigenvalues of
A. Consequently, the two impulse responses
h1(t), h2(t) can be expressed as shown by
EQ13,14 respectively.
[0013] Now, the object is to generate a time-discrete system with two inputs and four outputs,
with impulse responses
g1[n],
g2[n], such that
gi[n]=hi(nT),
i=1,2, and wherein T is the sample period. The state description of such a time-discrete
system is given by
EQ16. Herein, u is the input given in
EQ11, F=BxT is given by
EQ7, G=C is given by
EQ8, and
H=0. The requirement of
EQ15 in that the two impulse responses should be equal is fulfilled if
EQ17 is met, which also yield
E. Here, indeed holds that
D=0.
[0014] Figure 5 gives a state diagram description realized as a set of interconnected blocks,
of which three form a linear chain, combined with one feedback block. The contents
of the various blocks
40, 42, 44 have been indicated by the associated capital letters from the above equations, in
particular the set of equations
EQ16. The block
T implements a unitary delay. The realization of the state diagram of Figure 5 in hardware
has been shown in Figure 4 already.
[0015] Technical applications of the modelling can be found in speech synthesis for traffic-,
flight-, or telephone information systems. In this respect, Figure 6 gives a complete
speech synthesizer system, with blocks
30, 32, 34, 36. The input quantity
u[n] at block
36 is the simulated volume speed from the lungs. Block
36 itself is a tube-model of the human trachea as governed by its cross-sectional parameters
inputted at left (38). The quantities
uV+[n], uV-[n], uT+[n], uT-[n] represent the forward and backward volume velocities (+,-), respectively, at the
upstream and downstream side
(T,V), respectively, of the glottis represented by block
32. Upstream is the trachea model, downstream the vocal tract model; both of these are
represented by means of a tube model, that produces the forward and reflected airstreams.
The volume velocities are equivalent to the pressures
pV[n],
pT[n], and the glottal volume velocity
uG[n], respectively.
[0016] The relationship between the total glottal volume velocity
UG and
pT(k) and
pV(k) and
x(k-1) are known. The latter is the 'state vector' of the glottis as determined by the positions
and velocities of the two masses. In a formula, this is:
This depends on the model actually used and will therefore not be specified in detail.
Furthermore, the following holds:
[0017] Herein, ρ is the volume density of the air,
c is the sound speed,
AT is the area of the trachea at the glottis, and
AV is the area of the vocal tract at the glottis. Equations (1) and (3) indicate that
the air stream through the glottis is equal to the air stream through vocal tract
and trachea. Equations (2) and (4) indicate the relation between pressure and volume
velocity. The time indices have been omitted for clarity. The quantities
uT+ and
uV- are known. Therefore, the above equations allow to calculate the other air speeds
uV+ and
uV+ in front of and behind the glottis, respectively. In turn, these speeds allow to
calculate the control of the vocal tract and the eventual speech signal. In particular,
the set consists of four non-linear equations with four unknowns. The solution may
ensue by substituting equations (2) and (4) into (1) and (3), respectively. The expression
for
uG is relatively straightforward, and quite often a quadratic function.
[0018] In the arrangement of Figure 6, block
32 represents the hydrodynamic sub-model according to the invention as governed by the
glottal parameters inputted at
40. Block
32 according to the invention interacts with block
34, that contains the mechanical-, geometry-, and impact sub-models that have been considered
earlier. Block
30, like block 36, represents a tube model of the human vocal tract as governed by its
cross-sectional parameters inputted at left (
42). The ultimate output quantity is the air pressure
p[n] at the lips.
1. A method for modelling speech synthesis, based on a human vocal chord model that comprises
two interlinked masses, which have positions determining glottis model parameters,
characterized in that said human vocal chord model comprises, in addition to a
first sub-modelling step (20) describing said dynamically interlinked masses acted
upon by external forces, a second dynamic partial modelling step (22,24) describing
a contour of the glottis in terms of its geometry parameters and including its dynamics,
in dependence of position and velocity parameters produced by said first sub-modelling
step, and outputting a first part of said external forces in addition to transmitting
said geometry parameters, and also a third hydrodynamic sub-modelling step (26), describing
hydrodynamics of an air flow through the vocal chords in dependence of said transmitted
geometry parameters, and being acted upon by air pressure parameters, for outputting
a second part of said external forces and an actual volume velocity.
2. A method as claimed in Claim 1, wherein said second dynamic partial modelling step
is decomposed into an instantaneous geometry sub-modelling step (22) describing an
external geometry of said glottis, in dependence of behaviour of said first sub-modelling
step, and an impact sub-modelling step (24) fed by said external geometry describing
a colliding behaviour of said vocal chords, for outputting the first part of said
external forces and for transmitting said geometry parameters.
3. An apparatus for modelling speech synthesis, having modelling means for a human vocal
chord model that comprises two interlinked masses, which have positions determining
glottis model parameters,
characterized in that said modelling means comprise, in addition to first sub-modelling
means describing said dynamically interlinked masses acted upon by external forces,
second partial modelling means for a dynamic partial model describing a contour of
the glottis in terms of its geometry parameters and including its dynamics, in dependence
of position and velocity parameters received from said first sub-modelling means,
and having output means for outputting a first part of said external forces in addition
to transmitting said geometry parameters, and also third hydrodynamic sub-modelling
means for describing hydrodynamics of an air flow through the vocal chords in dependence
of said transmitted parameters, and being acted upon by air pressure parameters, and
having further output means for outputting a second part of said external forces and
an actual volume velocity.
4. An apparatus as claimed in Claim 3, wherein said second dynamic partial modelling
means are decomposed into instantaneous geometry sub-modelling means describing an
external geometry of said glottis, in dependence of behaviour of said first sub-model,
and impact sub-modelling means fed by said external geometry describing a colliding
behaviour of the vocal chords, for outputting the first part of said external forces
and for transmitting said geometry parameters.
5. An apparatus as claimed in Claims 3 or 4, comprising an excitatory signal source feeding
a trachea tube model, wherein said vocal chord model has its hydrodynamic sub-model
arranged downstream of the trachea tube-model, and wherein a vocal tract tube-model
is fed by said hydrodynamic sub-model for outputting a lip pressure signal.
1. Verfahren zum Modellieren von Sprachsynthese, das auf einem menschlichen Stimmbandmodell
beruht, das zwei miteinander verbundene Massen umfasst, die Positionen haben, die
Glottismodellparameter bestimmen,
dadurch gekennzeichnet, dass das genannte menschliche Stimmbandmodell zusätzlich zu einem ersten Submodellierungsschritt
(20), der die genannten dynamisch miteinander verbundenen Massen, auf die äußere Kräfte
wirken, beschreibt, einen zweiten dynamischen Teilmodellierungsschritt (22, 24) umfasst,
der eine Kontur der Stimmritze anhand ihrer Geometrieparameter beschreibt und ihre
Dynamik enthält, in Abhängigkeit von Positions- und Geschwindigkeitsparametern, die
von dem genannten ersten Submodellierungsschritt erzeugt werden, und einen ersten
Teil der genannten äußeren Kräfte zusätzlich zum Übertragen der genannten Geometrieparameter
ausgibt, und auch einen dritten hydrodynamischen Submodellierungsschritt (26), der
die Hydrodynamik eines Luftstroms durch die Stimmbänder in Abhängigkeit von den genannten
übertragenen Geometrieparametern beschreibt, und wobei darauf Luftdruckparameter wirken,
zum Ausgeben eines zweiten Teils der genannten äußeren Kräfte und einer tatsächlichen
Volumengeschwindigkeit.
2. Verfahren nach Anspruch 1, wobei der genannte zweite dynamische Teilmodellierungsschritt
in einen momentanen Geometrie-Submodellierungsschritt (22), der eine äußere Geometrie
der genannten Stimmritze beschreibt, in Abhängigkeit vom Verhalten des genannten ersten
Submodellierungsschrittes, und einen durch die genannte äußere Geometrie versorgten
Stoß-Submodellierungsschritt (24), der ein kollidierendes Verhalten der genannten
Stimmbänder beschreibt, zerlegt wird, zum Ausgeben des ersten Teils der genannten
äußeren Kräfte und zum Übertragen der genannten Geometrieparameter.
3. Gerät zur Modellierung von Sprachsynthese, mit Modellierungsmitteln für ein menschliches
Stimmbandmodell, das zwei miteinander verbundenen Massen umfasst, die Positionen haben,
die Glottismodellparameter bestimmen,
dadurch gekennzeichnet, dass die genannten Modellierungsmittel zusätzlich zu ersten Submodellierungsmittel,
die die genannten dynamisch miteinander verbundenen Massen, auf die äußere Kräfte
wirken, beschreiben, zweite Teilmodellierungsmittel für ein dynamisches Teilmodell
umfassen, das eine Kontur der Stimmritze anhand ihrer Geometrieparameter beschreibt
und ihre Dynamik enthält, in Abhängigkeit von Positions- und Geschwindigkeitsparametern,
die von den genannten ersten Submodellierungsmitteln erhalten werden, und mit Ausgabemitteln
zum Ausgeben eines ersten Teils der genannten äußeren Kräfte zusätzlich zum Übertragen
der genannten Geometrieparameter, und auch dritte hydrodynamische Submodellierungsmittel
zum Beschreiben der Hydrodynamik eines Luftstroms durch die Stimmbänder in Abhängigkeit
von den genannten übertragenen Parametern, und wobei darauf Luftdruckparameter wirken,
und mit weiteren Ausgabemitteln zum Ausgeben eines zweiten Teils der genannten äußeren
Kräfte und einer tatsächlichen Volumengeschwindigkeit.
4. Gerät nach Anspruch 3, wobei die genannten zweiten dynamischen Teilmodellierungsmittel
in momentane Geometrie-Submodellierungsmittel, die in Abhängigkeit vom Verhalten des
genannten ersten Submodells eine äußere Geometrie der genannten Stimmritze beschreiben,
und durch die genannte äußere Geometrie versorgte Stoß-Submodellierungsmittel, die
ein kollidierendes Verhalten der genannten Stimmbänder beschreiben, zerlegt werden,
um den ersten Teil der genannten äußeren Kräfte auszugeben und die genannten Geometrieparameter
zu übertragen.
5. Gerät nach den Ansprüchen 3 oder 4, mit einer anregenden Signalquelle, die ein Luftröhrenmodell
versorgt, wobei das hydrodynamische Submodell des genannten Stimmbandmodells abwärts
vom Luftröhrenmodell angeordnet ist, und wobei von dem genannten hydrodynamischen
Submodell ein Vokaltraktröhrenmodell versorgt wird, um ein Lippendrucksignal auszugeben.
1. Procédé de modélisation de la synthèse de la parole, basé sur un modèle de cordes
vocales humaines qui comprend deux masses reliées entre elles, qui ont des positions
déterminant des paramètres de modélisation de la glotte,
caractérisé en ce que ledit modèle de cordes vocales humaines comprend, en plus d'une
première étape de sous-modélisation (20) décrivant lesdites masses reliées entre elles
de manière dynamique sur lesquelles agissent des forces externes, une deuxième étape
de modélisation dynamique partielle (22, 24) décrivant un contour de la glotte en
termes de ses paramètres géométriques, et incluant sa dynamique, en fonction de paramètres
de position et de vitesse produits par ladite première étape de sous-modélisation,
et délivrant une première partie desdites forces externes en plus de la transmission
desdits paramètres géométriques, et également une troisième étape de sous-modélisation
hydrodynamique (26) décrivant l'hydrodynamique d'une circulation d'air à travers les
cordes vocales en fonction desdits paramètres de géométrie transmis, et soumis à des
paramètres de pression d'air pour délivrer une seconde partie desdites forces externes
et une vitesse volumique réelle.
2. Procédé selon la revendication 1, dans lequel ladite seconde étape de modélisation
partielle dynamique est décomposée en une étape de sous-modélisation de géométrie
instantanée (22) décrivant une géométrie externe de ladite glotte, en fonction du
comportement de ladite première étape de sous-modélisation, et une étape de sous-modélisation
de choc (24) délivrée par ladite géométrie externe décrivant un comportement de collision
desdites cordes vocales, pour délivrer la première partie desdites forces externes
et transmettre lesdits paramètres de géométrie.
3. Appareil de modélisation de la synthèse de la parole, ayant des moyens de modélisation
pour un modèle de cordes vocales humaines qui comprend deux masses reliées entre elles,
qui ont des positions déterminant des paramètres de modélisation de la glotte,
caractérisé en ce que lesdits moyens de modélisation comprennent, en plus dudit
premier moyen de sous-modélisation décrivant lesdites masses reliées entre elles de
manière dynamique sur lesquelles sont appliquées des forces externes, un deuxième
moyen de modélisation externe pour un modèle partiel dynamique décrivant un contour
de la glotte en termes de paramètres géométriques et incluant sa dynamique, en fonction
de paramètres de position et de vitesse reçus dudit premier moyen de sous-modélisation,
et ayant un moyen de sortie pour délivrer une première partie desdites forces externes
en plus de la transmission desdits paramètres de géométrie, et également un troisième
moyen de sous-modélisation hydrodynamique pour décrire l'hydrodynamique d'une circulation
d'air à travers les cordes vocales en fonction desdits paramètres transmis, et sur
lesquels agissent des paramètres de pression d'air, et ayant un autre moyen de sortie
pour délivrer une seconde partie desdites forces externes et une vitesse volumique
réelle.
4. Appareil selon la revendication 3, dans lequel lesdits seconds moyens de modélisation
partiels dynamiques sont décomposés en moyens de sous-modélisation de géométrie instantané
décrivant une géométrie externe de ladite glotte, en fonction du comportement dudit
premier sous-modèle, et un moyen de sous-modélisation de choc délivré par ladite géométrie
externe décrivant un comportement de collision des cordes vocales, pour délivrer la
première partie desdites forces externes et pour transmettre lesdits paramètres géométriques.
5. Appareil selon la revendication 3 ou 4, comprenant une source de signaux d'excitation
alimentant un modèle tubulaire de trachée, dans lequel ledit modèle de cordes vocales
a son sous-modèle d'hydrodynamique monté en aval du modèle tubulaire de trachée, et
dans lequel un modèle tubulaire de l'appareil vocal est délivré par ledit sous-modèle
d'hydrodynamique pour délivrer un signal de pression au niveau des lèvres.