Technical Field
[0001] The present invention generally relates to speech synthesis technology.
Background of the invention
[0002] In most telecommunication systems, speech signals are not transmitted with their
full analog bandwidth in order to use the available transmission channel more efficiently.
For telephone communication, the signal bandwidth is typically limited to less than
4 kHz, even though there are signal components up to 8 kHz and higher in the original
speech signal. This band limitation has little or no effect on the intelligibility
of voiced sounds, but fricatives such as /s/, /sh/, /ch/, /z/ or /f/ may be lost completely.
In Fig. 1 a spectrogram of the utterance "Sauerkraut is served" is depicted. The phoneme
/s/ is spoken at approximately 0, 0.8, and 1.4 seconds. At these times almost 100%
of the signal energy is above 4 kHz. With a bandwidth limited to less than 4 kHz the
speech intelligibility may still be sufficient because the listener is often able
to predict missing phonemes from syntax and the context of what is being said. However,
errors arise if such a prediction is not possible, e.g., because names or unknown
words of a foreign language are transmitted. Furthermore, the phoneme /s/ is important
in the English language to indicate the plural and possessive pronouns.
[0003] Various spectral compression schemes exist that aim to present signals from one frequency
range in another more useful range of the frequency spectrum. The side-effects of
the schemes vary according to the limitations of the candidate identification techniques,
artifacts resulting from the signal processing, as well as high computational complexity
and delay in the signal analysis stage.
[0004] P. Patrick, R. Steele, and C. Xydeas, Frequency compression of 7.6 kHz speech into
3.3 kHz bandwidth, 31 (5):692-701, May 1983 describes a system for retaining information under frequency compression. This system
consists of frequency mapping at the transmitter and demapping at the receiver side.
According to a frequency compression factor c, every cth sample is retained in the
compressed magnitude spectrum, so that it occupies only the frequency range that is
available for transmission. At the receiver side, the received frequency components
are spaced out to their correct locations and the magnitude of the missing components
is found by linear interpolation. The phase is chosen randomly.
[0005] The compression factor c can also be set frequency dependently to give higher or
lower compression in certain regions. Three different mapping laws, which are tailored
for certain phonemes, and one that simply corresponds to a band limitation, are used
in the system. For switching between these laws, the signal is first classified as
voiced or unvoiced speech based on its autocorrelation. Mapping is only applied for
unvoiced speech. Then the mapping and demapping procedure is applied according to
the same laws.
[0006] It is important to note that this system needs to transmit side information about
the mapping laws that have been used to ensure correct demapping. Additionally, the
receiver must be able to handle this side information and to perform the demapping.
[0008] When the spectral centroid is greater, portions of the spectrum above 4 kHz are added
into the 3 - 4 kHz bands according to the following rules:
fc = 2 kHz: No spectral translation is performed.
fc > 3 kHz: The 4 - 5 kHz band is added into the 3 - 4 kHz band.
fc > 4 kHz: The 5 - 6 kHz band is also added into the 3 - 4 kHz band.
fc > 5 kHz: The 6 - 7 kHz band is also added into the 3 - 4 kHz band.
[0009] Uncontrolled artifacts are resulting from this signal processing. There is no general
improvement of the speech intelligibility.
[0010] Another application area of frequency mapping methods is hearing aids. People with
hearing impairment usually suffer from a bad perception of high frequency sounds.
The traditional approach is to strongly amplify these critical frequency regions.
However, for some people, hearing sensitivity is so poor at high frequencies that
sufficient gain for achieving audibility cannot be provided.
[0012] Some further approaches are:
- to shift the complete spectrum to lower frequencies if high frequency sounds are detected
(H. J. McDermott, V. P. Dorkos, M. R. Dean, and T. Y. Ching. Improvements in speech
perception with use of the avr transonic frequency-transposing hearing aid. J Speech
Hear Lang Res, 42(6):1323-1335, 1999). Problems with this scheme are the reliable detection of high frequency signals
under noisy conditions and artifacts that occur during the on/off switching.
- to detect a range of high energy and to shift that region downwards (Kuk, Korhonen, Peeters, Jessen, and Andersen. Linear frequency transposition: Extending
the audibility of high frequency information. The Hearing Review, 2006). Shifting means here, to overlap the identified frequency interval with a lower
band and to add the two spectra. This can lead to blurring of vowel sounds.
- to compress frequencies above a threshold frequency (described above). The compression
is always switched on, regardless of the current speech sound.
[0013] US 5 771 299 aims to compress or expand the spectral envelope of an input signal. The goal is
to move signal information from a region of the spectrum that is outside of the audible
range of hearing aid users into another range that is still audible for the user.
The system uses an LPC analysis filter and transmits the filter coefficients to the
synthesis filter directly. Then, with the help of all-pass filters non-integer delays
are introduced in the analysis and/or the synthesis filters. Delays larger than 1
compress the spectral envelope while delays smaller than 1 expand the spectral envelope.
[0014] The main problem with this system is that the voice specific characteristics such
as formant positions are not preserved, since they are also compressed/expanded as
part of the signal processing. This has the effect of enhancing audibility for the
hearing impaired, but not of preserving the audio quality of the original input speech
signal. Further disadvantages are delays of the output signal.
[0015] US 2009 0074197 discloses a similar transposition but with user-dependent information for spatial
hearing. The goal is to perform a frequency transposition that moves frequency regions
of the input signal to user specific ranges that are measured using built-in mechanisms
of the hearing aid.
[0016] A method according to
EP 1 333 700 A2 applies a perception-based nonlinear transposition function to the input spectrum.
The same transposition function is applied over the whole signal so that artifacts
resulting from switching between non-transposition and transposition processing are
avoided. However, the use of a single function over the entire speech signal ignores
time-varying phoneme-specific characteristics in the signal. Also, transforming the
input signal to and then from the perception based scale to apply the transposition
function reduces spectral resolution. As a result, characteristics of parts of the
spectrum, which would otherwise only be subjected to a linear section of the transposition
function, are not accurately preserved.
[0017] US 2006 0226016 aims to correct the phase of a transpositioned spectrum, wherein the transposition
system is always active. Such processing of voiced segments has a negative effect
on the phase and harmonic characteristics of the signal.
[0018] According to
US 2009 0226016 the input signal is subjected to a high pass (or band pass) filter, after which the
spectral envelope of the high pass signal is estimated using an all-pole model. A
warping function is then applied to the all-pole model, translating the poles to lower
frequencies using both non-linear and linear warping factors. An excitation signal
(the prediction error signal) is then applied to the newly transposed all-pole model
and shaped according to this transposed spectral envelope to create a synthesized
transposed signal. The synthesized signal with an optional amplification and the original
low-pass signal are then summed to create a signal containing compressed and transposed
segments of the original spectrum. The LPC analysis and voice synthesis step is costly
and the quality of such signals usually sounds quite artificial without proper postprocessing.
This postprocessing is also expensive and this method is therefore not suitable for
a system focusing on improved audio quality compared to systems for the hearing-impaired.
[0019] US 6 577 739 describes a proportional compression factor to the frequency spectrum generated by
an FFT of the input signal. The factors are set to be between 0.5 and 0.99, and are
applied using different methods. Using linear interpolation and assuming a compression
factor of 0.5, the contribution of two input FFT bins are calculated and applied to
one output IFFT pin. Another method would use an IFFT length double that of the input
FFT length. By additionally placing the contributions from the input FFT bins in a
higher or lower position in the output IFFT vector, a frequency transposition can
be applied as well. This method again places a compressed frequency range of an input
signal in another range in the output signal, which is still available to the communication
channel or the end user of a hearing aid device. This system is however quite limited
in scope by requiring FFT processing of the input and output signals. The additional
time domain trimming of signals required by using FFT and IFFT vectors of different
lengths also presents problems for time-domain synchronization and variability of
the compression factors is not easily implemented.
[0020] CA 2 569 221 describes an attempt to retain information in frequency ranges outside of a band
pass threshold. The input signal is converted to the frequency domain via the FFT
algorithm or a polyphase filterbank. A compression function is then applied. Additionally,
an amplification is applied in accordance to the energy level of the uncompressed
portion of the input signal. The system can also comprise sending the compressed signal
to an automatic speech recognition module. This method can have negative compression
effects by giving portions of the input spectrum either too much or too little emphasis.
[0021] WO 2008/0123886 first defines a pass band for the output signal, and then defines a threshold, generally
lower than the pass band, above which the frequency compression is applied. In the
frequency region above the threshold, the highest frequency still of interest is identified
and the appropriate compression is then applied. After the compression, the new peak
power is normalized in a manner proportional to the compression, or is simply halved
by -3 dB. Additionally, this method provides for the ability to expand a received
signal, compressed or otherwise and synthetically reconstruct high frequency portions
of the signal that may or may not have been present in the original transmitted signal.
This can have a negative effect on the audio quality of the signal. This method still
neglects the time-varying speaker dependent characteristics of speech sounds and applies
the same compression function to each frame.
[0022] WO 20051015952 discloses a method of enhancing sound for a hearing-impaired listener which comprises
manipulating the frequency of high frequency components of the sound in a high frequency
band.
Summary of the Invention
[0023] In view of the foregoing, the need exists for improved solutions that preserve and
enhance voice specific characteristics. The object of the present invention is to
improve at least one out of controllability, precision, signal quality, processing
load, and computational complexity.
[0024] The inventive method for adaptive spectral transformation for acoustic speech signals
comprises the steps of
receiving at least one spectral input representation corresponding to at least one
window of a time domain input signal of acoustic speech,
selecting of the spectral input representations at least one selected spectral representation
to be transformed,
assigning the at least one selected spectral representation to one of a set of cluster
centres, wherein
the cluster centres are defined on the bases of spectral representations of windowed
acoustic speech segments of a speech corpus by a clustering algorithm,
spectral class representations are assigned to the cluster centres and are elements
of a code book and
the code book links to each spectral class representation at least one spectral transformation
which enhances the corresponding spectral class representation,
transforming each selected spectral representation to a spectral output representation,
wherein the applied transformation corresponds to the at least one spectral transformation
linked to the cluster centre which is assigned to the respective selected spectral
representation, and
providing the spectral output representations to synthesize an acoustic speech signal.
[0025] Selected spectral representations are classified by assigning them to spectral class
representation. The spectral transformation applied to a selected spectral representation
is adapted to this selected spectral representation because the applied spectral transformation
enhances the assigned spectral class representation. The adaptations to enhance the
spectral class representations are made in a setup procedure and include heuristic
steps in order to find transformations which enhance the spectral class representations.
The adapted advantageous transformations for each cluster centre respectively for
each class of the code book ensure enhancement of the spectral representation of all
the spectral representations assigned to this class of the code book. The controllability,
precision and signal quality are enhanced while the processing load and computational
complexity are reduced.
[0026] Best results can easily be found by the use of an appropriate code book, respectively
by selecting an appropriate speech corpus and an appropriate clustering algorithm.
Clustering of spectral representation with sufficient energy in the 4 to 8 kHz band
allows specific enhancement of fricatives such as /s/, /sh/, /ch/, /z/ or /f/. The
step of finding transformations which enhance fricatives of different classes allows
linking of very specific transformations to the classes of the code book.
[0027] An English speech corpus can for example be taken from the TIMIT acoustic-phonetic
continuous speech data base. This speech data base has been designed by the Massachusetts
Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc.
(TI) for acoustic-phonetic studies. It contains utterances of 630 male and female
speakers in American English, divided into eight dialect regions. From each person,
ten phonetically rich sentences have been recorded and tagged with phonetic information.
The training data can be reduced to 70 speakers from all dialect regions and pauses
can been removed with the help of phoneme tags. The reduced training data has a duration
of approximately 23 minutes. For validation purposes, a second data set can been extracted
that consists for example of 10 minutes speech data from 30 speakers. The sampling
frequency is preferably fs = 16 kHz.
[0028] There are two different kinds of preferred transformations, frequency compression
and formant boosting. Frequency compression is compressing the bandwidth for example
from a bandwidth of 0 to 8 kHz to an bandwidth of 0 to 4 kHz preferably with a compression
only at the upper or lower end of the bandwidth, optionally linear at least in the
middle frequency range, corresponding to no compression when the slope is equal to
1. Formant boosting takes place using formant-dependent functions to increase the
contrast between the formants and the non-formant frequencies in the frequency spectrum.
The formant boosting gain function linked to each spectral class representation of
the code book is amplifying at least one preferably two or three of the formants at
low frequencies.
[0029] Assigning the at least one selected spectral representation to one of a set of cluster
centres includes
calculating distance measures between the selected spectral representation and all
the spectral class representations of the code book, and
assigning the at least one selected spectral representation to the cluster centre
with the shortest distance measures between the selected spectral representation and
the spectral class representations of the cluster centre.
[0030] Calculating distance measures becomes very simple by first calculating feature vectors
to the spectral representations. The distance measures are just distances between
the feature vectors. The feature vectors are calculated from spectral envelope representations
by a filtering transformation, preferably with a mel-filterbank, wherein the mel-filterbank
optionally uses overlapping triangular windows with widths variable with frequency.
With such feature vectors it can be sufficient to use a code book including at least
eight (for example for fricative enhancement), preferably thirty-two, optionally 128
cluster centres. With higher class numbers more different enhancement problems can
be solved each in a different way by the linked at least one spectral transformations.
[0031] The spectral class representations for the cluster centres are preferably averaged
spectral representations averaged over spectral representations of corresponding cluster
elements. A reduction to spectral class representations of special interest can be
made by applying the clustering algorithm on the bases of a preselected sub-corpus
of the speech corpus. The sub-corpus can for example be reduced to spectral representations
of the speech corpus which have a spectral centroid lying above a given threshold
frequency, preferably of 3 kHz.
[0032] Transformations are only needed for spectral representation which can be improved.
A selection can be made by calculating the spectral centroid of each spectral input
representation and selecting spectral input representations which have a spectral
centroid lying above a threshold, frequency preferably above 3 kHz. Such a selection
fits to a code book base on a sub-corpus of the speech corpus with the same threshold
frequency. A selection can also be made by detecting at least one of speech activity
and background noise level. Spectral input representations with speech to be transformed
will be selected with appropriate selection criteria.
[0033] The method for adaptive spectral transformation for acoustic speech signals can be
implemented in different fields. In some applications the acoustic speech signal is
windowed and of each window a spectral representation is deduced. There are also applications
where the spectral representations of windows of an acoustic speech signal are provided
by a system. Therefore the method of this invention starts with the step of receiving
spectral input representations, which can be provided by the same system or by another
system. At least some of the received spectral input representations are selected
and enhanced by adapted transformations and the enhanced spectral representations
are provided in the form of at least one spectral output representation. The combination
of untransformed and transformed spectral representations allow synthesizing an enhanced
acoustic speech signal.
[0034] The invention can be implemented in a computer program comprising program code means
for performing all the steps of the disclosed methods when said program is run on
a computer.
[0035] The described solutions preserve and enhance voice specific characteristics such
as formants and their respective positions. This has the effect of enhancing audibility,
while preserving the audio quality of the original input speech signal. The current
invention also avoids unnecessary delays of the output signal. Various compression
functions are applied to the speech signal, which takes into account the time-varying
phoneme-specific characteristics in the signal and allows for spectral sharpening
through adaptive formant boosting. Also, unnecessary transformations of the input
signal, which could reduce spectral resolution are avoided. The effect of phase distortion
on the harmonic section of the input signal can be avoided in the current invention
by limiting the compression processing to only unvoiced segments of speech.
[0036] A focus on efficient algorithms and output signal synthesis allows the current invention
to also avoid costly postprocessing to achieve high quality audio output. Additionally,
transforming the input signal from the time domain to the frequency domain is not
bound to the FFT (Fast Fourier Transform) algorithm. Compression functions are intended
to be designed such that they are able to be efficiently stored in memory and do not
inadvertently give portions of the input spectrum undesired emphasis.
Brief description of the figures
[0037]
- Fig. 1
- spectrogram of the utterance "Sauerkraut is served once a week". The plots below show
the percentage of signal energy above 4 kHz and the amplitude of the signal
- Fig. 2
- example of the LBG-algorithm for clusters of two dimensional feature vectors with
(c) κ = K = 8 classes, x- and y-axis correspond to the value ranges of the first and the second
features
- Fig. 3
- results of clustering with K = 8 in classes after pre-classification with a threshold
of fT = 3kHz. The line with more details shows the average spectrum in dB, the line with
less details shows a filtered spectral representation of the code book entry and the
vertical line the spectral centroid fc
- Fig. 4
- a block diagram of the adaptive transformation method
- Fig. 5 and 6,
- compression functions
- Fig. 7
- a formant boosting gain function and spectral representations (cluster mean and filtered
cluster mean), where the line with more details shows the average spectrum in dB,
the line with less details shows a filtered spectral representation of the code book
entry
- Fig. 8
- a block diagram of the adaptive transformation method using feature vectors
- Fig. 9
- examples for processing of phonemes: (a) spectral compression with compression characteristic
below, (b) formant boosting with gain g(Ωµ) between + -10dB. The processed signals are band limited to 4 kHz (shorter curve)
- Fig. 10
- example of the sauerkraut utterance processed with the 128 class scheme (input signal
above, processed signal below)
- Fig. 11
- time series auf a noise reduction filter and its derivative
- Fig. 12
- block diagram of the onset sharpening using an algorithm according to

of the recursive Wiener filter
Detailed description of the invention
[0038] Fig. 1 shows for the acoustic speech "Sauerkraut is served once a week" the energy
distribution in time and frequency, a time series of the energy above 4 kHz and a
time series of the amplitude. The fricatives "s", "ce" and "k" have quite some energy
in the frequency range from 4 to 8 kHz.
[0039] A time-domain input signal, as for example "Sauerkraut is served once a week", is
windowed before being transferred to the frequency domain via a fast Fourier transform
(FFT) algorithm, discrete Fourier transform (DFT), discrete cosine transform (DCT),
polyphase filterbank, wavelet transform, or some other time to frequency domain transformation.
In the frequency domain the windowed time-domain signal has the form of spectral representations.
By filtering for example with a mel-filterbank the spectral representations can be
reduced to feature vectors.
[0040] Fig. 2 shows an example of clusters of two dimensional feature vectors. With the
LBG-algorithm for (c) κ =
K = 8 classes eight cluster centers + are found. The procedure is similar for feature
vectors with higher dimensionality. The spectral representations or the feature vectors
of the cluster centres are defined on the bases of spectral representations of windowed
acoustic speech segments of a speech corpus by the clustering algorithm. A code book
includes the spectral class representations for the cluster centres, which are averages
over the elements of the cluster.
[0041] Fig. 3 shows results of clustering with K = 8 in classes after pre-classification
with a threshold of
fT = 3kHz in (b). The line with more details shows the average spectrum in dB, the line
with less details shows a filtered spectral representation of the code book entry
and the vertical line the spectral centroid
fc . The filtered spectral representations of the code book entries are the spectral
class representations for the cluster centres, which are averages over the elements
of the cluster.
[0042] The spectral representations or feature vectors of windowed frames of an input signal
are subjected to a classification technique from the field of pattern recognition,
such as the minimum mean squared error estimator. The input frames are classified
by finding the class corresponding to the smallest value of a cost function, d(
v,ck) in Equation 1, where v
d is the D-dimensional set of features of the current frame and C
dk is the set of features of a codebook entry k.

[0043] In Fig. 4, this setup is illustrated using K=128 as a possible number of entries
for such a codebook. The codebook can be trained using feature vectors from a training
set and any number of vector quantization algorithms, such as k-means [MacQueen, 1967]
or the Linde-Buzo-Gray (LBG) algorithm[Linde, Buzo, & Gray, 1980]. Linear discriminant
analysis (LDA), principal component analysis (PCA), support vector machines (SVM),
and other class enhancing algorithms can also be applied to decrease the intraclass
variances and/or increase the interclass distances, which improves separability.
[0044] The feature vectors for training and testing are in this case a set of perception
based features in the mel scale, although the feature vectors must not be limited
to these specific features. The features can also be designed to emphasize signal
characteristics in or near the region of interest in the frequency spectrum. The key
concept here is that the feature vector has a reduced dimensionality with respect
to the input signal frequency spectrum in order to reduce computational costs.
[0045] A spectral compression function and/or a formant boosting function are chosen from
the relevant codebook entry.
[0046] To avoid possible switching artifacts, a smoothing procedure can be applied to the
chosen compression function in the frequency domain but in the time direction. Using
for example, an IIR-smoothing function, the time-variance of the applied compression
functions can be reduced.
[0047] The spectral compression functions are functions- in the preferred case continuous
and nonlinear - that apply compression rates in the frequency domain. Both the compression
functions and the frequency spectrum can be either linearly or nonlinearly scaled,
reflecting the occasional advantages of processing in more perception based scales
like the logarithmic scale.
[0048] The operation of spectral compression maps a frequency interval of the input signal
into a smaller frequency interval of the output. When working on a discrete representation
of the spectrum, this is equivalent to mapping a set of frequency pins from the input
spectrum X(e
jΩµ) into one frequency pin of the output spectrum Y (e
jΩµ). Mathematically, an operation that reduces the input frequency bandwidth by approximately
0.5 can be expressed by

[0049] The upper and lower boundaries of the interval that is projected onto output frequency
pin
v are represented by the variables µ
u(v) and µ
l(v), respectively. These boundaries are actually functions of
v, allowing a variable amount of compression for different frequencies. Compression
with this equation derives the magnitude of Y (e
jΩv) as the mean value of the magnitude of X(e
jΩµ) for µ = µ
µ(v), ... , µ
l(v) while retaining the phase of input component v for output component
v.
[0050] Fig. 5 illustrates an example for the relationship between input frequency
fin and output frequency
fout. The curved line shows the compression characteristic, the horizontal and vertical
lines indicate the frequency interval that is mapped. The frequency in Hz can be converted
to the pin-index with the following relationship

[0051] Equation 2 also performs energy normalization with respect to the amount of frequency
compression. Energy normalization can take many forms, however, and could also be
applied in a fashion based on the momentary broadband or frequency localized SNR.
Another option is to maintain the form of the precompression estimated spectral envelope
and apply a gain or dampening factor to correct the energy level to correspond to
the slope of the envelope in the region of interest.
[0052] The equation also preserves the phase of the original uncompressed signal, but other
phase corrections and adjustments are also possible, such as using a random phase.
[0053] The compression functions can be continuous in nature and have various compression
rates over frequency. Preferably, the compression function is linear in lower frequencies,
corresponding to no compression when the slope is equal to 1. In the higher frequencies
of interest, the compression rate increases gradually, with the rate and degree of
increase dependent upon the feature-based classification.
[0054] In Fig. 6 various examples of continuous nonlinear compression functions are shown.
Those displayed with solid lines are linear with a slope of, or near, 1 in the lower
frequency range, hence performing little to no compression. With increasing input
frequency, the compression rate increases differently for each of the curves. The
dashed curves perform little to no compression in the middle frequency range, rather
than the low frequencies. These curves compress frequencies in the lower and higher
frequency extremes, and can be better understood as performing a transposition of
the middle frequencies down into a lower region with almost no compression.
[0055] The appropriate compression function for each class of the code book is to be defined
either manually, e.g., using subjective listening criteria, or automatically, e.g.,
applying unsupervised learning methods to find a preferred mapping to an output characteristic.
[0056] Different sets of class function assignments are generally possible, since the ideal
output for improved audio quality does not always coincide with that providing improved
speech recognition rates.
[0057] Speech can be made more accentuated, if formants are amplified. Especially in situations
with background noise, we can expect a better localized SNR in these frequency regions,
so the broadband SNR can be improved by applying a frequency dependent gain factor
to the input spectrum. Ideally, the amplification should only be applied during speech
activity. Otherwise, background noise will be amplified during pauses.
[0058] The current invention can receive information about speech activity from an external
module, such as a noise detection and cancelation module.
[0059] The formant-boosting functions are also functions - (non)continuous, (non)linear,
and on a (non)linear scale - designed such that a variable gain factor is applied
to frequency ranges around formants in the spectrum. Fig. 7 shows a formant boosting
gain function and spectral representations (cluster mean and filtered cluster mean),
where the line with more details shows the average spectrum in dB, the line with less
details shows a filtered spectral representation of the code book entry. A curve can
be used to determine the percentage of the gain factor that is applied to the formant
frequency range. Alternatively individual gain and dampening curves can be stored
in a codebook and applied to those frequency ranges identified as formants in voiced
speech.
[0060] For some spaces between formants, the gain factor just described can be transformed
into a dampening factor using another set of curves. The boosting curves can be designed
to have positive values for amplifying the formant frequencies and negative values
for the dampening curves to reduce the magnitude of the valleys between formants.
[0061] When compression is applied an additional curve can be used in the frequency range
of the compression results. This amplifies or dampens the spectrum in the region of
the frequency compression and so can be used to amplify or dampen the effects of the
compression.
[0062] A decision can be made for each codebook entry of the spectral compression, formant
boosting signal processing about the order of the steps. In some instances it can
be necessary to first apply spectral compression and then the boosting function. At
other times it should be possible to first boost (or dampen) regions of the spectrum
and then to apply the spectral compression. This could be important when the compression
is designed such that one of the formants is located in the compressed frequency range.
The prior boosting of the formant would serve to retain the formant shape even after
compression.
[0063] An overall block diagram of one possible incarnation of the system is seen in Fig.
8. Windows of a signal in the time domain are transformed to spectral representations
by an analysis filter-bank. By extracting feature vectors a classification in relation
to elements of a code book a signal processing is realized with the transformations
linked to the code book elements. The transformed spectral representations are transformed
to windows of a signal in the time domain by a synthesis filter-bank.
[0064] Fig. 9 shows examples for processing of phonemes:
- (a) with a spectral compression according to the shown compression characteristic
,
- (b) with formant boosting according to the shown gain g(Ωµ) between + -10dB. The processed signals are band limited to 4 kHz.
[0065] Fig. 10 shows an example of the sauerkraut utterance processed with the 128 class
scheme. The energy of the fricatives "s", "ce" and "k" is transformed below 4 kHz.
[0066] The system is also capable of receiving information from other signal processing
modules, such as silence/unvoiced/voiced decisions from the noise estimation and reduction
module. Furthermore the module should be capable of sending information to other modules,
especially an ASR module, which can use the enhanced signal to improve recognition
rates. The ASR module could be retrained with the compressed output data. However
the compressed signal alone achieves a change in recognition rates that can be useful.
[0067] A focus on efficient algorithms and output signal synthesis allows the current invention
to also avoid costly postprocessing to achieve high quality audio output. Additionally,
transforming the input signal from the time domain to the frequency domain is not
bound to the FFT algorithm. Compression functions in the current invention are intended
to be designed such that they are able to be efficiently stored in memory and do not
inadvertently give portions of the input spectrum undesired emphasis.
Onset Sharpening
[0068] Another invention is disclosed, which is new and inventive independent of the independent
claims. This further invention is related to onset sharpening, which can be used independently
but is of course advantageously combinable with the previously described speech enhancement
methods.
[0069] The onset sharpening method performs an onset sharpening, i.e., to introduce attenuation
immediately before and/or amplification after speech onsets in order to make them
more accentuated. This is improving speech quality and intelligibility, especially
for speech signals corrupted with background noise that are to be enhanced with noise
reduction methods. Noise reduction filters and their derivatives (Fig. 11) tend to
react too slow and therefore do remove desired signal components during speech onsets.
[0070] Before any onset-dependent signal enhancement can be performed, the speech onsets
need to be found first. This is done based on a recursive Wiener noise reduction filter.
There are no real-time constraints, so all of the following steps can be applied to
the entire signal x(t) giving the necessary data for the next step.
[0071] An analysis filter bank is needed to transform the input signal x(t) into the frequency
domain. The result is a function
X(
ejΩµ, l) with µ= 0, ... , N
DFT=2 and I = 0, ... , M-1, where M is the number of signal frames. This could also be
interpreted as a (N
DFT=2+1) x M matrix which is constituting a spectrogram.
[0072] Based on the spectral signal representation of the previous step, the attenuation
factors of the recursive Wiener filter characteristic can be computed. The following,
non-frequency dependent, parameters have been used as filter parameters:
- Maximum attenuation Gmin(Ωµ, I) = Gmin = 0.25 (corresponding to 20 log10(Gmin) = -20 dB)
- Overestimation factor β̃ (Ωµ,I) = 5
[0073] The result G
rec(Ωµ, l) can again be seen as a (N
DFT=2+1) x M matrix containing the attenuation factor for each sub-band for all time
instances.
[0074] In order to smooth the filter coefficients in a perceptually meaningful manner over
time, a mel-filterbank of 32 bands is applied to G
rec(Ω
µ, I), resulting in the 32 x M matrix

(m, I).
[0075] The actual onset detection is performed within each mel-band of

(m, I). Here, the moments when

(m, I) changes its value from G
min to 1 (or close to 1) are of interest, i.e., the points when the filter opens. These
time instances can be found by taking the numerical derivative

and comparing the resulting value with a threshold

[0076] The mel-band m for time instance I is labeled as a speech onset. The derivative of
the noise reduction coefficient lies in the range of d

and positive values indicate times when the filter is opening. A threshold = 0.2
has proved to give good detection results for various speech signals and SNRs. Because
it could happen that the derivative is greater than for several consecutive frames,
also a sliding time window of 100ms duration is applied. Within this time window,
only one detection is allowed. Furthermore, if an onset has been detected in a certain
mel-band, the neighboring mel-band towards lower frequencies of the same frame I is
also marked as a speech onset. The clean speech is mixed with background noise recorded
in a car driving at a speed of 160 km/h to form an SNR of 1 dB during speech activity.
Of course, the detection can be made more sensitive by taking a lower threshold, e.g.
y= 0.1.
[0077] Based on the detections made with the method of the previous section, the onset sharpening
can be performed. It consists of placing attenuation immediately before a detected
speech onset and boosting the signal for a short time interval afterwards. The shape
of the attenuation/amplification is determined by a prototype function that has been
chosen to be

with the attenuation e = 0.5 and σ = 3. The term e/ σ is used to normalize f (x)
to a maximum value of 1. Out of this prototype function, the onset sharpening gain
function g
08(I) can be sampled. When deriving onset sharpening gain function, three parameters
can be set:
- 1. The width of the (negative) attenuation and the (positive) boosting part, defined
by the variable τos in ms.
- 2. The offset τoffs in ms that defines at which time instance the gain function is placed. For τoffs = 0 ms, the zero crossing is exactly at the detection point, for negative offset
values the zero crossing will be earlier. This is desirable because then the noise
reduction filter is forced to open earlier.
- 3. A gain parameter αos that controls the amount of attenuation and boosting. It is multiplied with the onset
sharpening gain function gos(l) which is interpreted as dB values.
[0078] Several other prototype functions could be devised for onset sharpening, e.g., a
sinusoid. The prototype function has been chosen with the stated parameters because
it decays smoothly towards the ends and offers a steep slope around the zero crossing.
This prototype function is now placed at each detected speech onset time instance
in the corresponding mel-band, giving the onset sharpening matrix for met-bands

(
m,
l), which then is expanded into the onset sharpening matrix for the subbands g
os(m,l). The parameters that have been used for the onset sharpening gain functions
are τ
os = 75 ms, τ
offs = -10ms and α
os = 3, of course other parameters are possible as well. Using the notation of the mel-filterbank
matrix A the expansion from melbands to subbands can be expressed as

[0079] The weighting of the filters contained in A that gives triangles of a broader bandwidth
a lower amplitude is removed by the normalization containing the maximum operation.
[0080] Fig. 12 discloses an onset sharpening algorithm with the following recursive Wiener
noise reduction filter being modified by the onset sharpening gain function. The simplest
way is by multiplication

where

is the onset sharpening function in linear values. Since the noise reduction filter
is applied multiplicative to the input spectrum, this filter modification could also
be interpreted as a multiplication of the signal spectrum
X(
ejΩµ, l) with the onset sharpening gain before (or after) applying the noise reduction filter.
[0081] In a second modification, the onset sharpening gain is built into the noise reduction
characteristic to modify the spectral floor and the maximum gain of the filter (which
is set to 1 in the recursive Wiener filter):

[0082] For this description, the gain function has to be separated into the part responsible
for the attenuation before speech onsets

and the part for amplification after speech onsets

[0083] A third possibility is to modify also the overestimation factor:

Inspection of noise reduction filters from the first and second modification shows
that there are only small differences between the characteristics. The parameter settings
for τ
os, τ
offs. α
os and the choice of the gain prototype function are of much greater influence. Therefore,
the simpler modification

will be used for the evaluation.
[0084] For the evaluation of the speech onset enhancement method, the recursive Wiener filter
with

has been compared to a standard recursive Wiener filter
Grec(Ωµ,l). This has mainly been done on the basis of a logarithmic spectral distance (LSD)
measure. Comparison of the noise reduction filter coefficients for several characteristics
gives a qualitative impression about the opening/closing properties of a filter. Finally,
listening tests give a useful criterion that help to judge intelligibility and the
amount of artifacts such as musical tones.
[0085] The idea in using an LSD measure is to create a signal x(t) = s(t)+ b(t) corrupted
with background noise, where the speech component s(t) and the noise b(t) are known.
A noise reduction filter is computed for the disturbed signal x(t) and the two signal
components are passed through this filter separately. Then, the distortion measures
LSD
speech and LSD
noise can be calculated between the original and the filtered signal. Ideally, the speech
component is passed through the filter unchanged, leading to a distance close to zero,
whereas large distortions can be expected for the noise component. Based on these
two measures, it is possible to judge on the noise suppression ability and on how
much speech components are affected. The LSD between two time varying spectra S(
ejΩµ,
l) and
Ŝ(
ejΩµ,
l) is defined as

[0086] The two variables

give lower bounds for the values that enter the measure. These components are selected
by the binary mask

[0087] The normalization factor
Kl counts the number of components that are used for the distance measure in each time
frame. In order to avoid a division by zero, it is defined as

[0088] The variable D gives the number of signal frames that are used for the calculation
of the LSD, i.e., the number of signal frames with
Kl, > 0.
[0089] For evaluating the performance of the modified recursive Wiener filter

a set of 616 filters has been computed with all possible combinations of the parameters
τ
os ∈ [50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100] ms
τ
offs ∈ [0, -5, -10, -15, -20, -25, -30, -35] ms
α
os ∈ [0, 2, 4, 6, 8, 10, 12] .
[0090] The signal that has been used is the "Sauerkraut is serve once a week" utterance
used throughout this text, mixed with background noise from a car driving at a constant
speed of 160km/h. The SNR d ring speech activity is adjusted to 1 dB. Then, only the
speech and only the noise components have been processed with these filters and the
distances

and

have been calculated. As reference for the comparison, a recursive Wiener filter
has been designed and the measures

and

have been evaluated.
[0091] As mentioned earlier, a good filter is characterized by a small LSD for speech and
a large value for noise. Obviously, these two requirements are difficult to meet at
the same time and an increase in one category usually falls together with an increase
in the other one. For better comparison of the two filter types, the

and

could be calculated They are defined such, that a positive value means that the onset
sharpening approach gives better results in the LSD sense. This is apparently not
always the case and again it can be seen that an improvement in noise suppression
leads to worse results for speech and vice versa. However, there are some combinations
where a positive value can be achieved in both disciplines, e.g., for the parameter
combination
τ
os = 100ms
τ
offs= -30 ms.
[0092] The gain factor α
os basically only scales the distance measure. For the special case of T
os = 0, the modified filter reduces to the standard recursive Wiener filter and thus
gives the same LSD.
[0093] Comparing the noise reduction filter coefficients for a Wiener filter, a recursive
Wiener filter and a recursive Wiener filter modified by onset detection for several
subbands and different filter parameters gives the following result. The same signal
as for the LSD measures has been used for the design of these filters.
[0094] The parameters that have been used are

[0095] But of course other parameters are possible, too.
[0096] The Wiener filter opens more often, which potentially results in musical tones. It
has also been seen that the recursive Wiener filter opens a bit later, which can be
corrected by the onset sharpening modification.
[0097] First of all it should be noticed that the offset τ
offs = -30ms is fairly large compared to the duration of τ
os = 100 ms. This causes that the modified filter coefficient near a speech onset increases,
then decreases for a few frames and finally grows again with the opening of the recursive
Wiener filter. Apparently, this is no optimal behavior even though this parameter
combination gave the best results in the LSD. This gives rise to the assumption, that
a different prototype function should be used. Good candidates would be more flat
around their maximum, avoiding the "crash down" near 0.18 and 0.5 seconds. At any
rate, a filter that is opening earlier seems to give benefits in terms of the LSD
measure.
[0098] However, for larger gains α
os, the noise floor was decreased on the expense of increasing filtering artifacts.
[0099] Fig. 11 shows a procedure where the attenuation factor and its derivative together
with the threshold are shown for mel-band 25 (corresponding to frequencies between
3.6 and 4.4 kHz). Because it could happen that the derivative is greater than γ for
several consecutive frames, also a sliding time window of 100ms duration is applied.
Within this time window, only one detection is allowed. Furthermore, if an onset has
been detected in a certain mel-band, the neighboring mel-band towards lower frequencies
of the same frame I is also marked as a speech onset.
[0100] The signal that has been used is the utterance "Sauerkraut is served once a week"
from the TIMIT database that has also been used before. The clean speech is mixed
with background noise recorded in a car driving at a speed of 160 km/h to form an
SNR of 1 dB during speech activity.
1. A method for adaptive spectral transformation for acoustic speech signals comprising
the steps of
receiving at least one spectral input representation corresponding to at least one
window of a time domain input signal of acoustic speech,
selecting of the spectral input representations at least one selected spectral representation
to be transformed
assigning the at least one selected spectral representation to one of a set of fluster
centres, wherein
the cluster centres are defined on the bases of spectral representations of windowed
acoustic speech segments of a speech corpus by a clustering algorithm,
spectral class representations are assigned to the cluster centres and are elements
of a code book and
the code book links to each spectral class representation at least one spectral transformation
which enhances the corresponding spectral class representation,
transforming each selected spectral representation to a spectral output representation,
wherein the applied transformation corresponds to the at least one spectral transformation
linked to the cluster centre which is assigned to the respective selected spectral
representation, and
providing the spectral output representations to synthesize an acoustic speech signal.
2. Method as claimed in claim 1, wherein assigning the at least one selected spectral
representation to one of a set of cluster centres includes
calculating distance measures between the selected spectral representation and all
the spectral class representations of the code book, and
assigning the at least one selected spectral representation to the cluster centre
with the shortest distance measures between the selected spectral representation and
the spectral class representations of the cluster centre
3. Method as claimed in claim 2, wherein calculating distance measures includes calculating
feature vectors for the spectral representations and the distances measures are distances
between the feature vectors.
4. Method as claimed in claim 3, wherein the feature vectors are calculated from the
spectral representations by a filtering transformation, preferably with a mel-filterbank,
wherein the mel-filterbank optionally uses overlapping triangular windows with widths
variable with frequency.
5. Method as claimed in one of claims 1 to 4, wherein the code book includes at least
eight, preferably thirty-two, optionally 128 cluster centres.
6. Method as claimed in one of claims 1 to 5, wherein the spectral class representations
for the cluster centres are averaged spectral representations averaged over spectral
representations of corresponding cluster elements for respective classes.
7. Method as claimed in one of claims 1 to 6, wherein the definition of cluster centres
by the clustering algorithm is made on the bases of a preselected sub-corpus of the
speech corpus.
8. Method as claimed in claim 7, wherein preselecting a sub-corpus includes the reduction
to spectral representations of the speech corpus which have a spectral centroid lying
above a threshold frequency, preferably above 3 kHz.
9. Method as claimed in one of claims 1 to 8, wherein one of the at least one spectral
transformation linked to each spectral class representation of the code book is a
spectral compression transformation mapping a frequency interval of the selected spectral
representation to a smaller frequency interval of the spectral output representation.
10. Method as claimed in claim 9, wherein the spectral compression transformation linked
to each spectral class representation of the code book is compressing a bandwidth
of 0 to 8 kHz to an bandwidth of 0 to 4 kHz preferably with a compression only at
the upper or lower end of the bandwidth, optionally linear at least in the middle
frequency range, corresponding to no compression when the slope is equal to 1.
11. Method as claimed in one of claims 1 to 7, wherein one of the at least one spectral
transformation linked to each spectral class representation of the code book is a
formant boosting gain function.
12. Method as claimed in claim 11, wherein the formant boosting gain function linked to
each spectral class representation of the code book is amplifying at least one preferably
two or three of the formants at low frequencies.
13. Method as claimed in claim 8, wherein selecting at least one selected spectral representation
to be transformed includes calculating the spectral centroid of each spectral input
representation and selecting spectral input representations which have a spectral
centroid lying above a threshold, frequency preferably above 3 kHz.
14. Method as claimed in one of claims 1 to 13, wherein selecting at least one selected
spectral representation to be transformed includes detecting at least one of speech
activity and background noise level and selecting spectral input representations with
speech to be transformed.
15. A computer program comprising program code means adapted to perform all the steps
of any one of the claims 1 to 14 when said program is run on a computer.
1. Verfahren zur adaptiven spektralen Transformation für akustische Sprachsignale, mit
folgenden Verfahrensschritten:
Erhalt wenigstens einer spektralen Eingangs-Repräsentation, welche wenigstens einem
Fenster eines Zeitbereichs-Eingangssignals für akustische Sprache entspricht,
Auswahl mindestens einer zu transformierenden spektralen Eingangs-Repräsentation aus
den spektralen Eingangs-Repräsentationen,
Zuordnen der ausgewählten spektralen Repräsentationen zu einem Cluster-Zentrum eines
Satzes von Cluster-Zentren, wobei
die Cluster-Zentren durch einen Clustering-Algorithmus auf der Basis der spektralen
Repräsentationen von Zeitfenstern zugeordneten Segmenten eines Sprach-Corpus definiert
sind,
den Cluster-Zentren spektrale Klassen-Repräsentationen zugeordnet werden, die Elemente
eines Code-Buches sind und
das Code-Buch jeder spektralen Klassen-Repräsentation wenigstens eine spektrale Transformation
zuweist, welche die entsprechende spektrale Klassen-Repräsentation verbessert,
Transformieren jeder ausgewählten spektralen Repräsentation in eine spektrale Ausgangs-Repräsentation,
wobei die angewendete Transformation der wenigstens einen spektralen Transformation
entspricht, welche dem Cluster-Zentrum zugewiesen ist, welches der jeweiligen ausgewählten
spektralen Repräsentation zurgeordnet ist, und
Bereitstellen der spektralen Ausgangs-Repräsentationen, um ein akustisches Sprach-signal
zu synthetisieren.
2. Verfahren nach Anspruch 1, bei dem das Zuordnen der ausgewählten spektralen Repräsentationen
zu einem Cluster-Zentrum eines Satzes von Cluster-Zentren folgendes umfasst.
Berechnen von Distanzmassen zwischen der ausgewählten spektralen Repräsentation und
allen spektralen Klassen-Repräsentationen des Code-Buches, und
Zuordnen der wenigstens einen ausgewählten spektralen Repräsentation zum Cluster-Zentrum
mit den kürzesten Distanzmassen zwischen der ausgewählten spektralen Repräsentation
und den spektralen Klassen-Repräsentationen des Cluster-Zentrums
3. Verfahren nach Anspruch 2, bei dem das Berechnen der Distanzmasse folgendes umfasst:
Berechnen der Merkmalsvektoren für die spektralen Repräsentationen, und die Distanzmasse
sind Distanzen zwischen den Merkmalsvektoren.
4. Verfahren nach Anspruch 3, bei dem die Merkmalsvektoren durch eine Filterungs-Transformation
aus den spektralen Repräsentationen errechnet werden, vorzugweise mit einer Mel-Filterbank,
wobei die Mel-Filterbank gegebenenfalls einander überlappende, dreieckige Fenster
mit mit der Frequenz variablen Breiten benutzt.
5. Verfahren nach einem der Ansprüche 1 bis 4, bei dem das Code-Buch zumindest acht,
vorzugsweise zwei und dreissig, gegebenenfalls 128 Cluster-Zentren aufweist.
6. Verfahren nach einem der Ansprüche 1 bis 5, bei dem die spektralen Klassen-Repräsentationen
für die Cluster-Zentren gemittelte spektrale Repräsentationen sind, welche über spektrale
Repräsentationen entsprechender Cluster-Elemente für die jeweiligen Klassen gemittelte
worden sind.
7. Verfahren nach einem der Ansprüche 1 bis 6, bei dem die Definition von Klassen-Zentren
durch den Clustering-Algorithmus auf der Basis eines vorgewählten Sub-Corpus des Sprach-Corpus
vorgenommen wird.
8. Verfahren nach Anspruch 7, bei dem die Vorauswahl eines Sub-Corpus die Reduktion auf
spektrale Repräsentationen des Sprach-Corpus umfasst, die einen spektralen Mittelpunkt
haben, der oberhalb einer Schwellwertfrequenz liegt, vorzugsweise oberhalb von 3 kHz,
9. Verfahren nach einem der Ansprüche 1 bis 8, bei dem eine der wenigstens einen jeder
spektralen Klassen-Repräsentation des Code-Buches zugewiesenen spektralen Transformation
eine spektrale Kompressionstransformation ist, welche ein Frequenzintervall der ausgewählten
spektralen Repräsentation auf ein kleineres Frequenzintervall der spektralen Ausgangs-Repräsentation
abbildet.
10. Verfahren nach Anspruch 9, bei dem die jeder spektralen Klassen-Repräsentation des
Code-Buches zugewiesenen spektrale Kompressionstransformation eine Bandbreite von
0 bis 8 kHz auf eine Bandbreite von 0 bis 4 kHz komprimiert, vorzugsweise mit einer
Kompression nur am oberen oder unteren Ende der Bandbreite, gegebenenfalls mindestens
im mittleren Frequenzbereich linear entsprechend keiner Kompression, wenn die Steilheit
gleich 1 ist.
11. Verfahren nach einem der Ansprüche 1 bis 7, bei dem eine der wenigstens einen jeder
spektralen Klassen-Repräsentation des Code-Buches zugewiesenen spektralen Transformation
eine Formanten Vorstärkungsfunktion ist.
12. Verfahren nach Anspruch 11, bei dem die jeder spektralen Klassen-Repräsentation des
Code-Buches zugewiesene Formanten Verstärkungsfunktion mindestens einen, vorzugsweise
zwei oder drei, der Formanten bei niedrigen Frequenzen verstärkt.
13. Verfahren nach Anspruch 8, bei dem die Auswahl zumindest einer zu transformierenden
ausgewählten spektralen Repräsentation das Berechnen des spektralen Mittelpunktes
jeder spektralen Eingangs-Repräsentation umfasst, sowie die Auswahl derjenigen spektralen
Eingangs-Repräsentationen, welche einen spektralen Mittelpunkt haben, der oberhalb
eines Schwellwertes der Frequenz liegt, vorzugsweise oberhalb 3 kHz.
14. Verfahren nach einem der Ansprüche 1 bis 13, bei dem die Auswahl zumindest einer zu
transformierenden ausgewählten spektralen Repräsentation das Ermitteln wenigstens
der Sprachaktivität oder des Hintergrund-Geräuschniveaus umfasst sowie die Auswahl
der spektralen Eingangsrepräsentationen mit zu transformierender Sprache.
15. Computer-Programm mit Programmcodier-Mitteln, die zur Durchführung aller Schritte
eines der Ansprüche 1 bis 14 geeignet sind, wenn das Programm auf einem Computer ablauft.
1. Procédé pour la transformation spectrale adaptive pour des signaux vocaux acoustiques
comprenant les étapes suivantes :
réception d'au moins une représentation spectrale d'entrée, qui correspond à au moins
une fenêtre d'un signal d'entrée de plage de temporisation pour une langage acoustique,
sélection d'au moins une représentation spectrale d'entrée à transformer parmi les
représentations spectrales d'entrée,
attribution des représentations spectrales sélectionnées à un centre de cluster d'un
jeu de centres de cluster, dans laquelle
les centres de cluster sont définis par un algorithme de clustering sur la base des
segments de représentations spectrales d'intervalles de temps d'un corpus linguistique,
que des représentations spectrales de classe sont associées aux centres de cluster
et sont des éléments d'un livre de codes, et
que le livre de codes assigne au moins une transformation spectrale à chaque représentation
spectrale de classe, qui améliore la représentation spectrales de classe correspondante,
transformation de chaque représentation spectrale sélectionnée en une représentation
spectrale de sortie, dans laquelle la transformation appliquée correspond à l'au moins
une transformation spectrale assignés au centre de cluster, qui est attribué à la
représentation spectrale respective sélectionnée, et
mettre en place des représentations spectrales de sortie pour synthétiser un signal
vocal acoustique.
2. Procédé selon la revendication 1, dans lequel l'attribution des représentations spectrales
sélectionnées à un centre de cluster d'un jeu de centres de cluster comprend le suivait:
calculer des mesures de distances entre représentation spectrale sélectionnée et toutes
les représentations spectrales de classe du livre de codes, et
attribuer l'au moins une représentation spectrale sélectionnée au centre de cluster
ayant les mesures de distance plus courtes entre la représentation spectrale sélectionnée
et les représentations spectrales de classe du centre de cluster.
3. Procédé selon la revendication 2, dans lequel le calcul des mesures de distance comprend
le suivant :
calculer les vecteurs de caractéristiques pour les représentations spectrales, et
les mesures de distance sont des distances entre les vecteurs de caractéristiques.
4. Procédé selon la revendication 3, dans lequel les vecteurs de caractéristiques sont
calculés à parti des représentations spectrales par une transformation à filtrage,
de préférence par un Mel banc de filtres, dans lequel le Mel banc de filtres utilise,
le cas échéant, des fenêtres triangulaires chevauchants l'une par rapport à l'autre
ayant des largeurs variables avec la fréquence.
5. Procédé selon une quelconque des revendications 1 à 4, dans lequel le livre de codes
comprend au moins huit, de préférence trente deux, le cas échéant 128 centres de cluster.
6. Procédé selon une quelconque des revendications 1 à 5, dans lequel les représentations
spectrales de classe pour les centres de cluster sont des représentations spectrales
moyennées, qui ont été moyennées pour les classes respectives via des représentation
spectrales des éléments de cluster correspondants.
7. Procédé selon une quelconque des revendications 1 à 6, dans lequel la définition des
centres de classe est effectuée par l'algorithme de clustering à la base d'un sous-corpus
du corpus linguistique.
8. Procédé selon la revendication 7, dans lequel la présélection d'un sous-corpus comprend
la réduction aux représentations spectrales du corpus linguistique, qui ont un centre
spectral, qui est au-dessus d'une fréquence d'un nombre, de préférence au-dessus de
3 kHz.
9. Procédé selon une quelconque des revendications 1 à 8, dans lequel une transformation
spectrale attribuée au moins une de chaque représentation de classe du livre de codes
est une transformation spectrale de compression, qui imagine un intervalle de fréquence
de la représentation spectrale sélectionnée à un intervalle de fréquence plus petit
de la représentation spectrale de sortie.
10. Procédé selon la revendication 9, dans lequel chaque transformation spectrale de compression
attribuée à chaque représentation spectrale de classe du livre de codes d'une largeur
de bande de 0 à 8 kHz est comprimée à une largeur de bande de 0 à 4 kHz, de préférence
avec une compression seulement au bout supérieur ou inférieur de la largeur de bande,
le cas échéant avec pas de compression linéairement correspondante au moins dans la
largeur médiane de bande de fréquence, si la pente est égale à 1.
11. Procédé selon une quelconque des revendications 1 à 7, dans lequel une des transformations
spectrales attribuée à au moins une de chaque représentation de classe du livre de
codes est une fonction d'amplification de formants.
12. Procédé selon la revendication 11, dans lequel la fonction d'amplification de formants
attribuée à chaque représentation spectrale de classe du livre de codes amplifie au
moins un formant, de préférence deux ou trois, avec des fréquences basses.
13. Procédé selon la revendication 8, dans lequel la sélection d'au moins une représentation
spectrale sélectionnée comprend le calcul du centre spectrale de chaque représentation
spectrale d'entrée, ainsi que la sélection de ceux représentations spectrales d'entrée,
qui ont un centre spectral, qui est au-dessus d'une fréquence d'un nombre, de préférence
au-dessus de 3 kHz.
14. Procédé selon une quelconque des revendications 1 à 13, dans lequel la sélection d'au
moins une représentation spectrale sélectionnée à transformer comprend déterminer
au moins l'activité linguistique ou le niveau de bruit de fond ainsi que la sélection
des représentations spectrales d'entrée comprenant de langage à transformer.
15. Programme d'ordinateur comprenant des moyens de programme adaptés pour réaliser toutes
les étapes d'une des revendications 1 à 14, quand le programme tourne à un ordinateur.