Method and system for synthetic modeling of a sound signal

(19)

(11)

EP 3 012 832 A1

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	27.04.2016 Bulletin 2016/17

(21)	Application number: 14189713.2

(22)	Date of filing: 21.10.2014

(51)

International Patent Classification (IPC):

G10H 5/00^(2006.01)

(84)	Designated Contracting States:
	AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
	Designated Extension States:
	BA ME

(71)	Applicant: Universität Potsdam
	14469 Potsdam (DE)

(72)	Inventors:
	Abel, Markus 14469 Potsdam (DE) Bergner, Andre 12681 Berlin (DE) Stefanakis, Nikolaos 10589 Berlin (DE) Ahnert, Karsten 14471 Potsdam (DE)

(74)	Representative: Bittner, Thomas L.
	Boehmert & Boehmert Anwaltspartnerschaft mbB Patentanwälte Rechtsanwälte Pettenkoferstrasse 20-22 80336 München 80336 München (DE)

(54)	Method and system for synthetic modeling of a sound signal

(57) The invention refers to a method for synthetic modeling of a sound signal, wherein the method is performed by a processor of a computing device and comprises the following steps: providing an input sound sample as a target signal, analyzing the input sound sample to determine a model of a differential equation, wherein the differential equation is represented by at least one function, wherein the at least one function depends on at least one variable and at least one parameter, determining the at least one function, the at least one parameter and an initial condition of the differential equation so that the differential equation provides a dynamic model of the target signal, and determining the quality of the dynamic model using an optimality condition. Further, the invention refers to a system for synthetic modeling of a sound, comprising a processor and a memory and a computer program product.

Description

[0001] The invention relates to a method and a system for synthetic modeling of a sound signal.

Background

[0002] The ability to reconstruct dynamical information from time-series has been the subject of research in many different scientific fields. It has been applied successfully to solve problems related to prediction, qualitative description of dynamics and signal classification. In music, dynamic modeling has been proposed as the means for modeling and synthesizing natural sounds as well as a tool for creating sound visualizations. Although the ability of such models to reproduce stationary signals or signals with slowly varying dynamics was demonstrated, their use for modeling sounds, e.g. non-stationary sounds such as drums, has not been yet addressed.

Summary

[0003] It is an object to provide improved techniques for modeling a sound signal.

[0004] This object is achieved by the method according to claim 1, the system of claim 13 and the computer program product according to claim 14. Further embodiments are subject matter of dependent claims.

[0005] In one aspect, method for synthetic modeling of a sound signal is provided. The method is performed by a processor of a computing device and comprises the following steps: providing an input sound sample as a target signal, analyzing the input sound sample to determine a model of a differential equation, wherein the differential equation is represented by at least one function, wherein the at least one function depends on at least one variable and at least one parameter, determining the at least one function, the at least one parameter and an initial condition of the differential equation so that the differential equation provides a dynamic model of the target signal, and determining the quality of the dynamic model using an optimality condition.

[0006] In another aspect, a system for synthetic modeling of a sound is disclosed. The system comprises a processor and a memory and is configured to: provide an input sound sample as a target signal, analyze the input sound sample to determine a model of a differential equation, wherein the differential equation is represented by at least one function, wherein the at least one function depends on at least one variable and at least one parameter, determine the at least one function, the at least one parameter and an initial condition of the differential equation so that the differential equation provides a dynamic model of the target signal, and determine the quality of the dynamic model using an optimality condition.

[0007] In a further aspect, a computer program product is provided, which, when executed by a processor, executes the method for modeling the sound signal. The computer program product can be stored on a non-transitory storage medium.

[0008] The differential equation can be represented by one or more functions. The function(s) can also be called basis function(s) and can be provided from a pool of functions. Each function can depend on one or more variables and / or one or more parameters.

[0009] The optimality condition may be determined by a least square method using Newton's method, for example.

[0010] The input sound signal can be any sound signal, for example a drum sound, a horn sound (e.g. from a brass instrument), a clarinet sound and a cymbal sound.

[0011] In one embodiment, the model of the differential equation can be a stochastic model and the at least one function can be a non-linear function. This allows improved modeling of non-linear sound signals, wherein the frequency of the signal depends on the amplitude. Hereby, a damping of the sound can be modeled efficiently. Using non-linear functions is appealing to the nature of speech as well as to that of many physical instruments which exhibit nonlinear behavior and therefore, their resynthesis has not been solved to a satisfactory manner by linear methods. The use of a nonlinear model in conjunction to percussive instruments would appear justified, since it is generally accepted that there are several nonlinear mechanisms involved in the physics of membranophones and cymbals. These include the dependence of frequency to amplitude, irregularities in the decay profile of distinct modes of vibration and nonlinear coupling between different modes.

[0012] The at least one parameter of the at least one function can be changed dynamically according to a predefined function or to a user input.

[0013] The method may further comprise decomposing the input sound sample into several components.

[0014] In a further embodiment, the method may further comprise determining a number of dimensions of the model of the differential equation.

[0015] The method may further comprise providing at least one time-delayed version of the input sound sample, and analyzing the at least one time-delayed version of the input sound sample to determine the model of the differential equation.

[0016] In another embodiment, the method may comprise determining a time derivative of the input sound sample, and analyzing the time derivative of the input sound sample to determine a reconstructed differential equation or the model of the differential equation, wherein, if applicable, the quality of the model of the differential equation is determined by an optimality condition. The time derivative of the input sound sample may be a first order or a higher order (second order, third order, ...) time derivative.

[0017] The method can comprise a step of transforming a real valued component of the input sound sample into a complex value.

[0018] The at least one parameter of the at least one function can be a constant.

[0019] The at least one function can be a time-dependent function.

[0020] An envelope of the input sound signal can be modeled by a dynamic equation, wherein the dynamic equation comprises at least one further differential equation. By modeling the envelope, parts of the sound signal varying fast in time can be captured.

[0021] The method may also comprise a step of outputting a sound signal which is generated by evolving the model of the differential equation in time. For example, the differential equation may be integrated forward in time.

[0022] The disclosure refers to the usage of a computing device. The computing device may comprise one or more processors configured to execute instructions. Further, the computing device may comprise a memory in form of volatile memory (e.g. RAM - random access memory) and / or non-volatile memory (e.g. a magnetic hard disk, a flash memory). The device may further comprise means for connecting and / or communicating with other (computing) devices, for example by a wired connection (e.g. LAN - local area network, Firewire (IEEE 1394) and / or USB - universal serial bus) or by a wireless connection (e.g. WLAN - wireless local area network, Bluetooth and / or WiMAX - Worldwide Interoperability for Microwave Access). The computing device may comprise a device for registering user input, for example a keyboard, a mouse and / or a touch pad. The device may comprise a display device or may be connected to a display device. The display device may be a touch-sensitive display device (e.g. a touch screen).

[0023] The features described in context of the method also apply to the system and vice versa.

Description of embodiments

[0024] Reference is made to figures of a drawing:

Fig. 1: shows spectra magnitude in (a) and envelope in (b) of the most dominant modal component of a 10x9' Sonor tom for three realizations of different impact level: high (dashed line), medium (solid line), and small (dotted line). The first 500 samples of the realization with the lowest impact level are shown in (c) and it's corresponding phase-space trajectory in (d). One clearly recognizes that the orbit intersects itself in (d).
Fig. 2: shows a synthesized signal in time domain in (a) and in the frequency domain in (b). The amplitude characteristics coincide exactly with that of the predictable system shown with the red line in (a).
Fig. 3: shows a further synthesized signal in time domain in (a) and in the frequency domain in (b). The amplitude characteristics coincide exactly with that of the predictable system shown with the red line in (a).
Fig. 4: shows another synthesized signal in time domain in (a) and in the frequency domain in (b). The amplitude characteristics do no longer coincide with that of the predictable system.
Fig. 5: shows the effect of amplitude modulation produced by a non-linearity of the form 800(y+y*)y. Synthesized signal in time domain in (a) and in the frequency domain in (b).
Fig. 6: shows the Effect of frequency modulation produced by a non-linearity of the form 800(y-y*)y. Synthesized signal in time domain in (a) and in the frequency domain in (b).
Fig. 7: shows a Graphical User Interface of the mutli-channel ODE synthesizer.
Fig. 8: shows a work-flow of processes.

1. General information

[0025] Reconstruction of system dynamics from a collection of observed time series was first studied in the context of chaotic systems. The so called Embedding Theorem states that even when there is only a single measured quantity from a dynamical system, it is possible to reconstruct a state space that is equivalent to the original (but unknown) state space composed of all the dynamical variables. In many cases, a higher-dimensional space, than that provided by the observed quantities, would be required in order to obtain such a mapping. The two most common approaches for this to be accomplished is the use of time delays and the use of sequential derivatives. Based on the second approach, a technique called differential embedding is used in the present disclosure.

[0026] Differential embedding uses the time derivatives of the observed quantity as the natural sets of independent coordinates. Letting y₁(t) denote the scalar observable at time t, the reconstructed system then takes the so called standard form

where f(y₁,y₂,···, y_M) is called the standard function. All the parameters in the standard function f(·) have to be determined so that the above model of ordinary differential equations (ODE) demonstrates a behavior qualitatively similar to the original dynamical system. To solve this problem, a model for f(·) has to be chosen, then involved parameters can be evaluated by using L2 approximations, i.e., least-squares methods. Assuming that the standard function f(·) has been properly defined, then, given an appropriate initial condition, forward integration of the standard system can be used for re-synthesis of the original sound.

[0027] In general, drum sounds do not exhibit harmonic structure and this rises the number of dimensions M which are required for the proper modeling of the dynamics. In order to handle this problem, it would make sense not to work with the recorded signal directly, but to decompose the signal into a collection of simpler components and to consider a single dynamical system for each component. In that sense, simple filtering can be used in order to isolate parts of the signal spectrum that corresponds to distinct modes. Alternatively, decompositions in terms of wavelets or other basis functions can be used in order to decompose the signal.

2. Non-linearities in mode vibration

[0028] An important first step in the attempt to synthesize high quality naturally sounding percussive sound is to demonstrate ability in capturing and synthesizing the characteristics of distinct modes of vibration. One may then add just a few modes that correspond to the most prominent modes of the impacted structure in order to create a more complete impression of the acoustic timbre. However, even the modeling of single modes poses challenges that are not at all trivial. In a large degree, these challenges are related to nonlinear phenomena. If a string or a membrane is excited with a very large displacement, the assumption of a constant average value of tension does not hold any more; the time-averaged tension exhibits a decline over time before converging to its nominal value. As the fundamental frequency is proportional to tension, a decline of the fundamental frequency over time is also observed. This so called "tension modulation" effect is common not only in membranes but also in string instruments. In dynamical systems, this dependence of frequency on amplitude is well-known in the theory of nonlinear oscillators.

[0029] Resynthesis of the decay profiles of normal modes is an additional significant challenge. Linear theory dictates that the energy of vibration decays exponentially, meaning that the measured sound should lose energy at a constant dB/sec rate. In analysis of musical instruments however, one often observes non-constant decay profiles. In piano notes for example, a double decay profile is reported while beat phenomena which are caused by the coupling between the strings were shown. In general, irregular decay profiles may be explained by nonlinearities characterizing the oscillation, by coupling between adjacent modes or it can be simply the result of the room reverberation being superimposed to the direct sound from the recorded instrument. Regardless of what is the source of these "irregularities", the ability of a synthesis model to capture and later reproduce these effects is an important requirement in the attempt to improve the faithfulness of reproduction.

[0030] We illustrate an example here from real recordings made with a 10x9' Sonor tom. We have used three different recordings of the same instrument stroked at successively increasing impact levels, where we have used low-pass filtering with a cut-off frequency of 220 Hz in order to isolate the first normal mode. Fig. 1(a) and (b) illustrates the spectral magnitude and the envelope decay for this first mode for each one of the three realizations. A first thing to observe here is the effect of tension modulation which becomes evident by the broadening of the spectral peak in the recording corresponding highest impact level in sub figure (a). For the same recording, a non-constant decaying behavior becomes apparent at the initial part of the signal in Fig. 1(b). The mode seems to decay at a constant rate only after some significant amount of time has passed, which is probably a consequence of the average membrane tension relaxing to its nominal value. On the other hand, the mode appears to decay fairly constantly for the two realizations corresponding to the two lowest impact strengths. Additional differences within these three recordings concern the attack time (symbolized by tp in Fig. 1(c)). In particular, we observe a slight decrease of the time of maximum amplitude as we go from the lowest impact level to the highest one. These observations dictate that the same modal component is not only quantitatively but also qualitatively different within realizations corresponding to different impact levels.

[0031] There are therefore two important requirements that a synthesis model should fulfill: first, the model used for synthesis should account for the various types of nonlinearities that characterize the mode vibration, second, it should be able to reproduce the qualitative diversity that characterizes the physical system at different impact levels. A straightforward question then arises; is it possible to construct and use a single model which will be valid for the entire dynamic range of consideration, or should one use a different model for different dynamic levels? While the variance of the impact level is probably the most important expressive attribute in drum performance, we believe that this question is far more than trivial. In what follows these challenges are addressed from the perspective of a simple nonlinear dynamical model.

[0032] When building a model, a choice of great practical significance is the embedding dimension M of the reconstructed state space which has somehow to be consistent with the inherent dimension of the original physical system. Indeed, the classical approach in dynamic modeling would be to set the value of M according to the properties of the target signals. However, great values of M may complicate the approach; the number of parameters that are involved in the definition of the dynamic model rise significantly, and the physical meaning of each parameter becomes less and less clear. From the point of view of a synthesis model, this could prevent completely the establishment of a meaningful mapping between parameter values and perceptually meaningful characteristics of the synthesized sounds. In view of these constraints and having in mind that we intend to use dynamic modeling for synthesis purposes, we feel the need of reversing the classical question regarding the embedding dimension in the form "what part of the signal can we successfully model given a particular value of M?"

[0033] In general, drum sounds do not exhibit harmonic structure and this raises the embedding dimension which is required for the proper modeling of the dynamics. In order to handle this problem, it would make sense not to work with the recorded signal directly, but to decompose the signal into a collection of simpler components and to consider a single dynamic system for each component. As it will be shown in the following, a first order complex ODE is an excellent dynamic system for capturing and resynthesizing the characteristics of distinct modal components. The latter may be isolated from the recorded signal by simple low-pass or band-pass filtering.

[0034] A low dimensional dynamical model can be used as the means for synthesizing decaying oscillations. The basic form of the dynamical model is a non-autonomous first order complex ODE. Here, we use the same fundamental form but from an analysis-synthesis perspective, meaning that the model parameters have to be defined in accordance to real recordings of percussive instruments. The basic model formalism reads

where f(y,t) is in general a nonlinear function of the complex state variable y and of time t, and y₀ E C is the initial condition. The function is defined explicitly below, together with the reconstruction method which shows a way to systematically build it from data. Based on this model, a complex signal can be generated by integrating the dynamical system forward in time as

and the real or imaginary part of the signal may be used for playback.

[0035] Observe that as the state variable y is complex, it can be decomposed in terms of an envelope r = |y| and a phase angle φ as y = rexpjφ. The simplest example one may figure out for the above model is the well-known harmonic, decaying oscillator

where σ + jω and y₀ are complex model parameters and y is the single (in this case) model basis term. It is trivial to directly establish a mapping between the model parameters and the characteristics of an exponentially decaying complex sinusoid; with σ we may control the decay rate of the oscillation, with ω = 2πf we may control the frequency while with y₀ we may adjust the phase and the amplitude of the trajectory at t = 0. In this particular example the ODE may be solved analytically in order to derive a closed form of the synthesis result as ŷ(t) = y₀e^σte²^jπft. Based on this basic model, nontrivial dynamics may be introduced by additional time-depended and nonlinear basis terms, as it will be shown in the sections that follow.

A. Envelope-depended basis functions

[0036] As additional functions that may allow us to capture the intrinsic nonlinear behavior of percussive instruments we propose terms of the form |y|ⁿy, with n a positive integer, where

is the envelope of the signal ('*' denotes complex conjugation). This classical nonlinearity provides a very natural way of relating the nonlinear behavior to the amplitude of the oscillation.

[0037] Let's consider an example of a single term of order n added to the basic model of Eq. 2.3 with complex weight b = b_R + jb_I as

[0038] By expansion of the equations with respect to b one finds the corrections to the frequency, caused by the nonlinear term. For n = 2, and real variable and coefficients, this is a classical example. An easier way to understand the influence of the nonlinear term is the transformation of the system above into terms of amplitude r and phase φ as

[0039] It is now easy to observe that different values of b_I are expected to influence only the frequency of the oscillation, while the value of b_R will influence both the envelope and the frequency, as the frequency φ is coupled to the envelope r. Observe that for b_I > 0 the instantaneous frequency will vary in proportion to the envelope of the signal, which is exactly what is required for tension-modulation. On the other hand the value of b_R is an additional important sound-morphing parameter as it may be exploited for reproducing a more complex decay profile in the envelope of the signal.

B. Time-depended basis functions

[0040] Similar to many musical sounds, drum modes do not decay monotonically; they undergo an attack region where the amplitude builds-up, reaching at a maximum level before it starts to decay. The exact characteristics of the attack region are expected to be depended on the type of the excitator that is used (e.q. mallet or hand) and the correct synthesis of the attack part of the signal is known to be very crucial for achieving a naturally sounding result.

[0041] From the perspective of dynamical reconstruction, the existence of an attack region in the signal has a direct effect on the embedding dimension which is required for "unfolding" the attractor. See for example the phase-space plot of the tom sound in Fig. 1(d). One may observe that as the envelope is not a one-to-one function of time any more, the orbit is doomed to intersect itself. This means that we would need an embedding dimension higher than two in order to capture the system dynamics. Rather than increasing the dimension of our model, it is shown in what follows that this problem may be tackled by inserting time-dependence in the ODE.

[0042] There is probably a large set of time-depended functions that may serve for the intended goal. We prefer to use a simple time-depended basis function in the form y/(t + ε), where ε is a positive smoothing constant. The choice for this function is motivated by the so called gammatone function, as it provides with a very convenient way for controlling the attack time of the oscillation. We will simply illustrate the applicability of the introduced time depended term with a complex weight a = a_R + ja_I by looking into the simple example

[0043] Again, by decoupling the previous equation in terms of amplitude r and phase Φ, we may rewrite the previous system as

[0044] By focusing on the envelope, it can be easily understood why such a term may introduce the desired amplitude build-up. Assuming that σ < 0, it is easy to observe that while

the envelope of y(t) has a positive slope. At

the slope is zero and for t larger than this value, the envelope of y(t) will decay. On the other hand, the imaginary part of a will introduce a kind of time-varying behavior to the frequency, leaving again the evolution of the envelope unaffected. Due to the way that time t is incorporated in the function, the presented terms are expected to influence the trajectory of the generated signal only at the beginning of the oscillation. Their influence for t >> t_p will be negligible in comparison to the other terms of the ODE.

C. Proposed Model

[0045] Based on what has been discussed so far, we propose the following dynamical model for synthesis of drum modes

where K = N + M and y₀ is the initial condition. In this formulation, we let the basis function corresponding to b = 0 carry the fundamental frequency and constant decay rate of the dynamical model as b₀ = σ₀ + jω₀. The presented model consists therefore of the parameter set Θ = {c, e, y₀} where c = [c₁, ..., c_k]^T is the vector with the ODE coefficients and e = [ε₁, ... , ε_M]^T is the vector with the smoothing constants.

[0046] The parameters in Θ must be optimized so that the generated trajectory is quantitatively similar to an observation of the recorded modal component. Observe that the presented model is a linear function of all the parameters in Θ expect from the smoothing constants ε₁, ..., ε_M. In order to simplify things a bit, we admit in the following the use of fixed values for the smoothing constants in e. This way the parameter set is reduced and also the problem is brought into a form which is appropriate for applying linear reconstruction.

D. Mapping of the impact level onto the synthesis parameters

[0047] Variance of the impact level is a very important expressive factor in drum performance and therefore any synthesis method intended for the generation of percussive sounds should somehow incorporate this parameter in the synthesis process. Considering this requirement, we now redefine the modelling problem in the form

where p ∈ R₊ is a scalar that represents the impact level or the intensity with which the instrument is excited. Ideally, this scalar should be a free parameter for the user to control the characteristics of the generated sound. The above requirement would have been fulfilled if we could find a continuous mapping between the impact level p and the model parameters in the form Θ(p) = {c(p), e, y₀(p)} with c(p) being a continuous vector function of p and y₀(p) a continuous complex function of p. This represents a non-trivial challenge, considering that, during the reconstruction phase, the dynamics of the modelled mode are available only at a finite number of J arbitrary training impact levels p₁, p₂, ...,p_J. We will show in the sections that follow that there is a simple way for morphing between the dynamics of different impact levels in order to derive continuous mapping for a range p_min ≤ p ≤ p_max, where p_min and p_max are the minimum and maximum training impact levels respectively that characterize the observation signals.

3. Model optimization and parametrization

A. Embedding

[0048] An embedding, in particular a differential embedding, is a quite abstract construct in order to describe a map between two spaces, in our case, differential manifolds. The manifold is the object under consideration, here; it corresponds to the phase space occupied by the trajectory of our dynamical system - the percussive instrument. So, we are searching a model in a certain underlying space, given the data, i.e. the sound signal recorded from an instrument. This model lives on a differentiable manifold which in turn might be embedded in a function space

(the space of vector of functions, which are once differentiable, in correspondence to Eq. 2.1).

[0049] Now, theory ensures that such an embedding exists. In the following, we describe how we apply and improve the method in practice for drum sounds. We present an embedding technique which assumes the collection of J different recordings of a percussive instrument acquired at different impact levels, spanning a wide dynamic range. Such measurements are provided by recordings of the acoustic signal which we assume that are consistent with respect to factors such as playing tactic, type of excitator or location of impact but they vary with respect to the impact level. The recordings are available to us in digitized form represented by s_j = [s_j(t₀),..., s_j(t_i),..., s_j(t_I-1)]^T, where

are the discrete time locations, F_s = 1/T is the sampling rate and j = 1, 2, ... J is the index of the recorded sound sample.

[0050] With respect to our proposed dynamic model, three main preparation steps are considered before differential embedding can be performed. First of all, the real-valued discrete signal must be converted into a complex-valued signal. This so called analytic representation of the discrete signal can be derived in the frequency domain, by setting the negative frequency indexes to zero and then performing the inverse Fourier transform. In the second step, the complex-valued signal must be filtered in order to isolate the modal components of interest. For this purpose, low-pass or band-pass filtering may be applied as an attempt for separating the modal component of interest from the remaining part of the signal. Third, the filtered signal must be cropped in time so that t₀ = 0 corresponds to the beginning of the signal. An additional crucial step before reconstruction is the calculation of the derivative of the input signal. The choice of the method for the calculation of the derivative is of no trivial concern, especially when the signal is contaminated by noise, in which case more sophisticated techniques are required. At least for the needs of our work, the signals to be differentiated are expected to be relatively smooth because all high frequency content is filtered out. Therefore, simple numerical methods of differentiation are expected to work as well. A last step considers the association of an impact level parameter with each observation. An obvious choice here is to employ a measure relative to the energy of the signal, i.e p_j = ∥s_j∥2, where ∥·∥₂ represents the L2 norm. Observe that the impact level parameter is associated to the prefiltered signal and is therefore common for all the modal components extracted from a particular instance.

[0051] For what follows, let y_j = [y_j(t₀), ... , y_j(t_I-1)] and ẏ_j = [ẏ_j(t₀), ..., ẏ_j(t_I-1)] represent the discrete observation signal and its derivative, defined for different versions j = 1, 2, ... J, that have become available to us through the three steps discussed above. In a similar way it makes sense to define the basis function vector h_k,j = [h_k,j(t₀),..., h_k,j (t_I-1)] for k = 1, 2, ..., K where dependence on y has been omitted for convenience.

[0052] One can employ nonparametric methods to find the general function in Eq. 2.1, or try to parametrize it by polynomials or other basis function, which should be chosen in a way that corresponds best to the physics of the system. In the latter case, we can cook down the problem to a conventional optimization problem: find the optimal coefficients c₁, ... c_K satisfying the relationship

where H_j = [h₁_,j,...,h_K,j] and c = [c₁,..., c_k]^T. In this form, the problem has been reduced to a linear optimization problem. We note that the nonlinear transformations h_k of the variables can distort the distribution of the variables considerably. Accepting this fact, the ODE coefficients can be found by minimizing the cost function

[0053] In the following we will consider a small collection of basis functions and therefore matrix H_j is overdetermined and

is expected to be positive definite. One then may use the least squares error solution

for defining the optimal complex ODE coefficients c^o. An alternative approach is to use constrained optimization, by restricting the value domain of some of the coefficients with the scope to ensure stability of the reconstructed dynamical system or in order to enforce the reconstructed model to preserve some desired properties. Although possibly restricting the available degrees of freedom in the model, a reasonable constraint for ensuring stability is to restrict the real part of b_n to non-positive values as b_n,R ≤ 0, ∀n.

[0054] Optimization must be performed for all available observations. This way, we may derive J different set of ODE coefficients

corresponding to the J available training impact levels p₁, ..., p_J. The problem then becomes to define the model parameters that correspond to an arbitrary value of an impact level p which is between two successive training impact levels p_j and p_j-1 with p_j < p_j-1. Here, we use linear interpolation in order to specify the model parameters from

and

In particular, the mapping may be expressed as

where a is the interpolation parameter. Obviously, higher orders of interpolation may be used. We will refer to this modeling approach as the Varying Coefficient Model (VCM).

4. Synthesis of multiple modes

[0055] In this section, an example for applying the method is provided.

[0056] The steps explained so far consider a single modal component of the target signal to be synthesized. While in many cases the fundamental mode contains a very large portion of the energy of the physical instrument, it might be necessary to synthesize additional modal components.

[0057] The extension from one to multiple modal components is straightforward if one independent ODE is considered for each modal component of the target sound. One can simply use a different band-pass filter in order to isolate the mode of interest and then apply the previous discussed methods separately to each mode. A spectrally richer sound would then be achieved by adding up all the synthesized modes. However, many difficulties arise as one is going to higher modes. First of all, the modal density increases and different modes might appear to be very close to one another in the spectrum. This would make it almost impossible to separate them by filtering. When two or more modes are present in the same band-pass region, the inherent dimension of the sound at the output of the filter increases. This dictates using ODEs of higher dimension than two. As previously discussed, this might lead to problems regarding the stability of the derived dynamic system and it might lead to ODEs with terms much more difficult to interpret in terms of physical meaning.

[0058] One approach for overcoming this problem without increasing the dimension of the dynamic model is to use multiple first-order complex ODEs which are coupled to one another. Considering N different spectral regions extracted from the same measurement, each spectral region roughly being responsible for a single mode, one can define a system of N coupled first-order complex ODEs as

with y_n = y_n(t), the n-th state-space coordinate, is associated to the output of the n-th filter and

is its derivative. Observe that dependency to

is allowed for the n-th ODE as long as m < n. This type of coupling allows the order of the filter which is used for each frequency region to be significantly reduced.

5. A powerful synthesis platform based on a system of ODEs

[0059] The equivalence between the classical harmonic oscillator which is used in sound synthesis and a first-order complex ODE has been shown in other parts of the disclosure. It can be demonstrated by considering the following simple non-autonomous dynamical system

[0060] Here ε is a positive constant required in order to prevent the singularity at t = 0, σ₀ < 0 and f₀ > 0 represent the user-defined constant decay and constant frequency. And b = b_R + jb_I with b_R ≥ 0 is responsible for the control of the attack time. The value of α_R and the initial condition y(0) are parameters depended on the value of the user-defined attack time and velocity tp and r_p respectively as b_R = -σ₀(t_p + c) and

Here,

is calculated analytically by solving the ODE as

Given this initial condition and value of α_R, the synthesized signal will reach to an amplitude peak at time instant tp and the value of the envelope at tp will be equal to r_p. Figure 2 illustrates the synthesis result for a central frequency of 150 Hz, a decay rate equal to -15, a velocity value equal to 1 and attack time equal to 0.025 sec. The above dynamical system is thus absolutely predictable and sets the basis for adding further functionalities in the ODE as the means for enriching the sonic attributes of the output signal

(y(t)).

[0061] The linear system shown in Eq. (5.1) demonstrates very poor sonic capabilities since the only interesting attributes that can be altered are the frequency, the decay and the attack time. More complex behavior can be achieved by adding non-linear terms in the right hand side.

[0062] A very interesting type of non-linearity can be created by using the envelope of the signal, |y|^my, where m is a positive integer. In the general case, more than one orders of m can be considered, but the analysis is restricted to the case of a single term of order m and coefficient c which can be freely defined by the user. The ODE module can be written as

[0063] It should be noted that c is in general a complex coefficient which can be decomposed into real and imaginary part as c = c_R + jc_I. It is expected that the real and imaginary part of this coefficient will be responsible for completely different behavior in the above ODE module. This can be better understood by decoupling the amplitude and frequency of the complex variable as y = re^jΘ. Noting that r = |y|, Eq. (5.2) can now be written as

[0064] It can be seen that c_I influences the frequency of the system but it has absolutely no effect on the amplitude. A positive c_I would make the frequency of the oscillator vary in proportion to the amplitude, an effect related to tensor modulation. For synthesis purposes however, there is no reason why a negative value shouldn't be used. Figure 3 illustrates the synthesis result for a value of c_I equal to 155, when all other synthesis parameters are exactly like in the previous figure 2. It can be clearly seen that the evolution of the amplitude agrees 100% with that of the predictable system.

[0065] Steering the analysis on the real part of c, it can be expected that a non-zero real part will affect both the amplitude and the frequency of the oscillator. Some interesting sonic possibilities arise under the condition c_R < 0 since in the opposite case, stability of the dynamical system cannot be guaranteed. In order to allow a simple design and to maintain stability, only negative values of c_R are considered in what follows. This intrinsic non-linearity will add a type of non-linear decay rate into the system (the magnitude of the decay rate varies in proportion to the amplitude). While this effect offers some interesting sound design capabilities, it is expected that it can strongly perturb the oscillator from the predicted attack time and maximum amplitude tp and r_p. In fact, a non-zero value of c_R requires the recalculation of the depended parameters b_R and

so that the target attack-time and maximum amplitude will hold. In Fig. 4 the synthesis result for a value of c = -50 + 155i is presented, when the other synthesis parameters are the same like in the previous figures. It can be seen that the amplitude characteristics do no longer coincide with that of the predictable system. In fact, it can be observed that the attack time and the maximum amplitude decrease as the magnitude of c_R increases.

[0066] Two other types of interesting non-linearities are of the form (y + y*)y and (y - y*)y, where superscript (*) denotes complex conjugation. This provides further enrichment of the right hand side of the oscillator as

[0067] In general d and e can be complex coefficients but here it is assumed that both are real coefficients. This ensures that terms y + y* and y - y* will be real and imaginary numbers respectively, something that is convenient in order to achieve separate influence in amplitude and frequency

[0068] It can be seen that the two non-linearities are transformed into terms of the form dcos(θ)r and esin(θ) in the expression of the amplitude and the frequency respectively. In other words, they enforce a kind of amplitude and frequency modulation into the dynamical system, with a modulator frequency which varies dynamically according to θ̇(t). This is an extremely useful mechanism for synthesis purposes as it can lead to a straightforward increase of the bandwidth, or it can be used to create traditional synthesis effects such as tremolo and vibrato.

[0069] The influence of each one of these two terms is illustrated separately. In figure 5, the "amplitude-modulation" effect is only considered in an oscillator with decay, central-frequency and attack time parameters set equal to the previous cases but without amplitude depended non-linearity. The modulation effect can be seen to produce peaks in the spectrum which are harmonically related to the central-frequency of 150 Hz. In similarity to the previous amplitude depended non-linearity, it can be observed that the amplitude deviates from that of the predictable system, causing the amplitude level to exceed the dynamic range of [-1; 1]. On the other hand, it can be seen in Fig. 6 that the influence of (y - y*)y is restricted to the frequency, leaving the amplitude variation unaffected. An increment in the bandwidth is observed, but the resulting sound is not characterized by harmonic content as before. The general impression when hearing the sound is that of a frequency increasing with time.

[0070] The underlying mechanisms presented here reveal a significant potential in enlarging the bandwidth of the synthesized sound in connection to external input or to coupling with other ODEs. For example, classical deterministic oscillators such as sines, square-waveforms or saw-tooth waveforms with user defined frequency characteristics can be directly inserted into the ODEs in order to modulate its frequency or amplitude. Furthermore, the output signal from a second ODE can be used in order to modulate the amplitude or the frequency of the first ODE. In analogy to classical Frequency Modulation synthesis (FM synthesis), one can then start talking about a modulating ODE and a carrier ODE. The concept can be of course generalized to a case of N ODEs where the user is free to determine parameters involving their linear or non-linear coupling characteristics. This idea is further illustrated in the following.

[0071] While the classical mathematical framework of coupled ODEs is a suitable platform for modeling real instruments, in what follows we consider it from a purely synthesis perspective. We propose a functional representation which allows a direct interpretation of the ODE coefficients to physically or functionally meaningful control parameters which the user may freely vary within certain limits.

[0072] Our synthesis platform is formulated in terms of the general framework of N first-order complex ODEs

with state-space coordinates y₁,y₂,...,y_N ∈

As before, the overdots here denote differentiation with respect to time t, ẏ = dy/dt. Given f_n(·) and an initial condition y_n(0) for all n = 1, 2,...,N, the above dynamical system can be numerically integrated returning N different complex outputs y₁(t), y₂(t),..., y_N(t). A sonic output s[k] = s(kT) can then be acquired at a sampling rate F_s = 1/T as a mixture of all oscillator outputs at each time index k as

where y_n[k] = y_n(kT), µ_n is the user-defined gain of the n-th oscillator and

denotes the real part of a complex number. At first glance, the presented approach reveals a connection to additive synthesis; each equation in the system of ODEs may be associated to a different mode of vibration and a synthesis result can be obtained as a superposition of modes. On the other hand, in the context of non-linear dynamics, the same system is expected to be capable of producing quasi-periodic and even chaotic behavior. It will be shown what follows that the addition of simple non-linear terms may provide an efficient way for enriching the bandwidth of the sonic output and that non-linear coupling of ODEs is the cause of frequency modulation (FM) and amplitude modulation (AM).

[0073] By using the vector notation y = [y₁, y₂, ..., y_N]^T and y = [ẏ₁,ẏ₂, ..., ẏ_N]^T the dynamic system above can be written in more compact form as

[0074] Before expanding F(·) into a more analytic form we first define the class of Nx1 vectors

which are required for the mechanisms of pitchbend and non-linear decay. Here, the power m₁, m₂, ... , m_N of each term can be varied explicitly by the user. An additional vector is required for handling amplitude control in the form

where

will be used for penalizing the energy of the oscillation subject to the user-defined threshold a_n.

[0075] Based on the previous vector notations, the standard function F(y,t) can now be written as

[0076] Here (∘) denotes the Hadamard product. All bold capital letters are N×N coefficient matrices and m, l are N×1 coefficient vectors. These entities carry the synthesis control parameters.

[0077] The sound morphing potential of each term in Eq. (5.13) can already be guessed. In the general case, the coefficient matrices can be fully populated complex matrices. Observe that diagonal terms are responsible for processes that are intrinsic to each oscillator while non-diagonal terms in the matrices define the amount of coupling between ODEs. In the actual synthesis interface however, not all forms of coupling are used and synthesis parameters are restricted in most cases to the real domain. The parameter space is described in more detail in the next section together with a corresponding GUI.

[0078] The role of each term in the right hand side of the dynamical system is described below:

A is in general a complex NxN matrix. The diagonal terms of the matrix A_ii = σ_i + j2πf_i carry the decay rate and central frequency of each ODE while values inside non-diagonal terms define the additive coupling between ODEs.
B is a complex NxN diagonal matrix. The real part in each diagonal term is automatically calculated as a function of the user-defined attack time. This way, a different attack time can be used for each ODE.
C_R and C_I are both real NxN diagonal matrices. The values inside the diagonal of C_R are parameters associated to the non-linear decay effect while those of C_I are associated to tension modulation. To be noticed that the integer vectors m and n are also free user-defined parameters.
D is in general a fully populated real matrix of size NxN associated to the amplitude-modulation functionality. A non-zero value of D_ij indicates the degree with which the amplitude of the i-th ODE is modulated by the output of the j-th ODE.
E is a fully populated NxN real matrix which is associated to the frequency-modulation functionality. A non-zero value of E_ij indicates the degree with which the frequency of the i-th ODE is modulated by the output of the j-th ODE.

6. Graphical User Interface

[0079] A simple graphical user interface depicted in Figure 7 serves as prototype GUI for giving to the user access to all the control parameters of interest. The user is allowed to vary continuously the synthesis parameters from a minimum to a maximum value with the use of sliders. An exception is the synthesis parameters m_n and l_n which are associated to the rate of pitchbend and non-linear decay in Eq. (5.10). A pop-up window to set integer values is designed for that purpose.

[0080] The GUI is divided in two panels, the "Oscillator" panel and the "Coupling" panel. All sliders in the "Oscillator" panel refer to processes that are intrinsic to each oscillator. Namely, "attack time", "decay rate", "frequency", "pitch bend", "nonlinear decay" and "amplitude control" apply changes to the diagonal of matrices B, A, C_R, C_I and P respectively while the "intensity" slider is used for controlling the value of r_p,n. All the coefficient matrices are assumed to be confined in the real domain except from the diagonal of matrix A which obeys A_nn = σ_n + j2πf_n according to the user defined decay rate σ_n and frequency value f_n. The "amplitude threshold" and the "oscillator volume" sliders are associated to the values of a_n and µ_n in Eqs. (5.12) and (5.8), respectively. Finally, a set of radio buttons is used for selection of the oscillator index n. In the depicted example, the number of ODEs is limited to 4. A check box ("activate") controls whether a particular oscillator is active or not in the synthesis process.

[0081] On the right side, the "Coupling" panel includes the sliders that are necessary for controlling frequency modulation ("FM"). Amplitude modulation ("AM") and additive coupling ("AC"), therefore affecting the elements of matrices E, D and A respectively which are again confined between a minimum and maximum value in the real domain. An additional set of radio buttons is used for selecting the index of the "interfering oscillator. For example, if n is the index of the selected radio button in the "Oscillator" panel and j the corresponding index in the "Coupling" panel, the changes applied to the sliders will be reflected to the n-th row and j-th column of A, D and E. Observe that when n = j, changes applied to the "AC" slider are equivalent to changes in the decay rate of the n-th oscillator. A solution for this problem is either to deactivate the particular slider when n = j or to couple it with the "decay rate" slider. Observe that in the context of this synthesis approach there is no strict distinction between "modulating" and "carrier" oscillators. Any oscillator can be a "carrier" oscillator and a "modulating" oscillator at the same time. It might be of interest to use an ODE only as the means for modulating a different ODE. In that case, the output of the "modulating" ODE can be ignored by setting its gain to zero.

[0082] Some things are worth to be mentioned for the particular GUI example. It must have already been observed that while the number of synthesis parameters rises quadratically with the number of oscillators N, the actual number of graphical objects which is required is independent of that value. Also, in the current form, we have excluded coupling effects that are associated to matrices B, C_R, C_I and P, i.e. these matrices have by default zero values outside the main diagonal. Of potential interest here would be to include pitch bend and non-linear coupling in the oscillator panel which implies using fully populated matrices C_I and C_R respectively. In the case of non-linear decay, for example, the envelope of one oscillator would be allowed to affect the temporal characteristics of another oscillator.

7. Work-flow

[0083] The work-flow of the data collection approach is shown in Fig. 8(a). Any conventional digital recording equipment can be used in order to capture the natural sound of the percussive instrument. If done with a computer, a microphone will be required, and a sound card equipped with a microphone pre-amplifier. Alternatively, stand-alone devices can also be used. During the recording process, the musician is requested to perform several hits of varying impact level. This is advantageous in order to create a generic model which will output realistic synthesis results for a wide dynamic range.

[0084] Apart from recordings, already existing sample banks of percussive instrument can be exploited. There are numerous drum samplers in the market which are based on recordings of a variety of real percussive instruments. Based on these samplers, the necessary data can be exported by using any conventional Digital Audio Workstation (DAW) such as Cubase, Logic, GarageBand and others.

[0085] Some pre-processing of the recorded or exported audio file is also required. For this purpose, an automatic onset detection algorithm is built which can be used in order to crop a large audio file, containing multiple realizations of the physical instrument, into multiple smaller audio files, each one containing a single realization of the physical instrument. After these files have been created, they are named and saved in the hard disk for further processing.

[0086] The work-flow for the reconstruction process is shown in Fig. 8(b). The first step in the reconstruction process is to import the necessary audio samples from the instrument's database. In the second step, various temporal and frequency characteristics of the samples are plotted. Based on these plots, decisions such as the duration of the input signal, the number and the band-pass limits of the different frequency zones, the number and the type of the basis functions and many other parameters are taken.

[0087] The imported samples need to be processed according to the reconstruction parameters defined by the user, providing so the input data that will be used for the reconstruction process. In the context of a generic reconstruction approach, more than one audio samples of the same instrument might be required for the analysis, each time that a model is built. Special attention must be paid in that case so that all samples have the same duration and the same time onset.

[0088] Before the reconstruction process can begin, the type of reconstruction must be specified. This can be simple regression (least squares error minimization), constraint optimization or sparse reconstruction. More details about these three different approaches are shown further down. The work-flow of the synthesis process is shown in Fig. 8(c). First, the model corresponding to a particular instrument is loaded. Before the synthesis process can begin, various parameters can be specified by the user. These concern the duration of the synthesized signal, the sampling rate and the impact level. Additional parameters which can be taken into account concern the decay rate and the pitch of the synthesized sound.

[0089] As long as the synthesis parameters are defined, the ODEs are integrated forward in time and the synthesis result is plotted and heard.

8. Appendix

[0090] Three different techniques for estimating the coefficients of an ODE are presented in the following. Other method can be used as well.

a) Regression: (Least squares) Error Minimization

[0091] The technique is illustrated for the general problem of N coupled ODEs, previously shown in Eq. 4.1. With respect to this problem, N different functions f_n(·) are defined and it is assumed that each function can be decomposed into M_n basis terms g_n(·). One considers all N time series y_n(t) (n = 1,2,...,N) measured at I time points t_i (i = 1,2,...,I). It is of course assumed that the derivative of y_n can be calculated at each time point. Then the process for the n-th ODE becomes minimization of the quantity

[0092] A set of linear equations arises by the condition

and the solution to these equations gives the optimal coefficients c_nm. While this is shown for the n-th ODE only, the process is identical for all N ODEs.

[0093] Neglecting dependence on the ODE index n, the problem is expressed in matrix vector form. The goal is to approximate the I×1 target observation vector ẏ as a linear combination of M basis vectors g_m, m = 1,2,...,M. Given the choice of basis vectors the goal is to find the values of the coefficients c_m, m = 1,2,...,M such that

or in matrix form

where c = [c₁c₂ ... c_m]^T] and W = [g₁g₂ ... g_M]. The number of time points I are considered to be greater than the number of basis vectors M and therefore, matrix W is overdetermined. The optimal coefficient vector c_o is derived through least squares error minimization as

and the reconstruction error can be defined as

[0094] The least squares error criterion

is used in order to measure the fit between the reconstructed signal and the target measurement. The square of the energy of the reconstruction error is normalized with respect to the energy of the target measurement; a value of E_LS equal to 0 therefore denotes a perfect reconstruction result.

b) Constrained optimization

[0095] There are certain occasions where one would desire to minimize the reconstruction error subject to linear constraints. This task can be mathematically formulated as

where matrix A and vector B can be defined so as to express any linear inequalities. Constrained optimization can be used in order to restrict the sign of some coefficients in regions where the ODE remains stable. Additionally, one might use it in order to control the rate of decay or the fundamental frequency of the ODE, especially in low order ODEs where the coefficients of the basis functions have a well understood physical meaning. Apart from the use of inequalities, one can also fix the value of the ODE coefficients to a desired value. This is a more general form of Eq. (8.6) which can be formulated as

[0096] Constrained optimization can be performed by using the built-in Matlab routine lsqlin.

c) Optimization with a maximum allowed number of terms

[0097] There are occasions where one would like to approximate the target measurement ẏ by using only a limited number of the M available candidate basis vectors g_m. This type of optimization is related to the so called sparse reconstruction techniques which have become extremely popular in the signal processing community. In general, sparse reconstruction techniques can be mathematically formulated as

where ∥c∥_o measures the number of non-zero elements in vector c. While the applicability of this optimization strategy in signal compression and dimension reduction is apparent, it serves by helping to define ODEs with a small only number of terms, and therefore, with reduced computational complexity.

[0098] Among the many different algorithms that exist for sparse reconstruction, two popular greedy techniques are presented: Matching Pursuit (MP) and Orthogonal Matching Pursuit (OMP).

[0099] The MP algorithm (shown in Table 3) is an iterative technique based on three fundamental steps; an initialization step, a selection step and a residual update step. The algorithm requires that all basis vectors g_m are normalized to have equal energy. The algorithm begins by looking among all candidate basis vectors in order to find the one which has the highest correlation with the target measurement. A coefficient is returned by projecting the measurement onto the selected basis vector, and the contribution of the given vector to the measurement is removed, returning a residual. The same process is then repeated, but the initial measurement is now replaced by the residual calculated at the previous iteration. The process is repeated until a maximum number of iterations K_max has been reached.

[0100] OMP, shown in Table 4, has exactly the same initialization and selection step as MP, but the coefficients corresponding to the selected vectors are updated simultaneously by projecting the initial target measurement onto the space spanned by the entire set of selected vectors. The number of columns of the matrix with the collection of selected vectors D_Γi increases by one at each iteration. The value of the coefficients in OMP is thus allowed to vary at each iteration, whereas in MP, once a coefficient value has been defined, it is not allowed to change any more. OMP has in general better convergence properties than MP, but it includes a matrix inversion step which makes it slower than MP.

Table 3

Matching Pursuit (MP)
Initialization
r⁰ = ẏ, Γ⁰ = Ø
for i = 1 : K_max
Selection step

Γⁱ=Γ^i-1∪kⁱ
Coefficient update

Residual update

end

Table 4

Orthogonal Matching Pursuit (OMP)
Initialization
r⁰ = ẏ, Γ⁰ = Ø
for i = 1 : K_max
Selection step

Γⁱ=Γ^i-1∪kⁱ
Coefficient update

Residual update

end

[0101] The features disclosed in the specification, in the claims and in the figures can be relevant for implementing embodiments in any possible combination with each other.

Claims

1. A method for synthetic modeling of a sound signal, wherein the method is performed by a processor of a computing device and comprises the following steps:

- providing an input sound sample as a target signal,

- analyzing the input sound sample to determine a model of a differential equation, wherein the differential equation is represented by at least one function, wherein the at least one function depends on at least one variable and at least one parameter,

- determining the at least one function, the at least one parameter and an initial condition of the differential equation so that the differential equation provides a dynamic model of the target signal, and

- determining the quality of the dynamic model using an optimality condition.

2. The method of claim 1, wherein the model of the differential equation is a stochastic model and the at least one function is a non-linear function.

3. The method of claim 1 or 2, wherein the at least one parameter of the at least one function is changed dynamically according to a predefined function or to a user input.

4. The method of one of the preceding claims, further comprising decomposing the input sound sample into several components.

5. The method of one of the preceding claims, further comprising determining a number of dimensions of the model of the differential equation.

6. The method of one of the preceding claims, further comprising:

- providing at least one time-delayed version of the input sound sample, and

- analyzing the at least one time-delayed version of the input sound sample to determine the model of the differential equation.

7. The method of one of the preceding claims, further comprising:

- determining a time derivative of the input sound sample, and

- analyzing the time derivative of the input sound sample to determine a reconstructed differential equation or the model of the differential equation, wherein, if applicable, the quality of the model of the differential equation is determined by an optimality condition.

8. The method of one of the preceding claims, further comprising transforming a real valued component of the input sound sample into a complex value.

9. The method of one of the claims 1, 2 and 4 to 8, wherein the at least one parameter of the at least one function is a constant.

10. The method of one of the preceding claims, wherein the at least one function is a time-dependent function.

11. The method of one of the claims 1 to 9, wherein an envelope of the input sound signal is modeled by a dynamic equation, wherein the dynamic equation comprises at least one further differential equation.

12. The method of one of the preceding claims, further comprising outputting a sound signal which is generated by evolving the model of the differential equation in time.

13. System for synthetic modeling of a sound, comprising a processor and a memory and being configured to:

- provide an input sound sample as a target signal,

- analyze the input sound sample to determine a model of a differential equation, wherein the differential equation is represented by at least one function, wherein the at least one function depends on at least one variable and at least one parameter,

- determine the at least one function, the at least one parameter and an initial condition of the differential equation so that the differential equation provides a dynamic model of the target signal, and

- determine the quality of the dynamic model using an optimality condition.

14. Computer program product which, when executed by a processor, executes the method according to one of the claims 1 to 12.

Drawing

Search report

Search report