[0001] The present invention generally relates to the processing of voice signals in order
to enhance characteristics which are useful for a further technical use of the processed
voice signal. The invention particularly relates to the enhancement and extraction
of formants from audio signals such as e.g. speech signals.
[0002] The proposed processing is useful e.g. for hearing aids, automatic speech recognition
and the training of artificial speech synthesis with the extracted formants.
[0003] "Formants" are the distinguishing or meaningful frequency components of human speech.
According to one definition (see e.g. http://en.wikipedia.org/wiki/Formants also for
more details and citations) a formant is a peak in an acoustic frequency spectrum
which results from the resonant frequencies of any acoustical system (acoustical tube).
It is most commonly invoked in phonetics or acoustics involving the resonant frequencies
of vocal tracts.
[0004] The detection of formants is useful e.g. in the framework of speech recognition systems
and speech synthesizing systems. Today's speech recognition systems work very good
in well controlled, low-noise environments but show severe performance degradation
when the distances between the speaker and the microphone varies or noise is present.
The formant frequencies, i. e. the resonance frequencies of the vocal tract, are one
of the cues for speech recognition.
[0005] Vowels are mostly recognized based on the formant frequencies and their transitions
and also for consonants they play a very important role.
[0006] Known speech recognition systems follow a purely probabilistic approach and use the
formant frequencies and transitions only implicitly. The features they use approximate
the positions of the formants but no explicit formant extraction or tracking is performed.
[0007] A different applicability of the formant transitions is their use for speech synthesis.
Currently synthesis systems based on the concatenation of prerecorded blocks (diphone
concatenation) perform significantly better than those using directly formants or
vocal tract filter shapes. But this is rather due to the difficulty in finding the
right parameterization of these models and not an intrinsic problem to the sound generation.
For example driving such a formant based synthesis system with parameters extracted
from measurements on humans produces naturally sounding speech. Formant extraction
algorithms can be used to perform this determination of the articulation parameters
from large corpuses of speech and a learning algorithm can be developed which determines
their correct setting during the speech synthesis process.
[0008] Due to their key role a vast variety of algorithms to extract formants have been
published. In most known approaches the formant frequencies, hence the poles of the
vocal tract, are modeled directly, e. g. via Linear Predictive Coding or via an AM-FM
modulation model. Another approach is the evaluation of the phase information to decide
if a spectral peak is a formant.
[0009] It is also known to use bandpass filters and first order LPC analysis to extract
the formants. The bandpass center frequencies are adapted based on the found location
in the previous time step. Additionally a voiced/unvoiced decision is incorporated
in the formant extraction.
Object of the present invention
[0010] It is the object of the present invention to propose an improved approach to enhance
formants of audio signals, preferably speech signals.
[0011] This object is achieved by means of the features of the independent claims. The dependent
claims develop further the central idea of the present invention.
[0012] A first aspect of the invention relates to a method for enhancing the formants of
an audio signal, the method comprising the following steps:
- a.) applying a frequency conversion on an audio signal,
- b.) enhancing the formant tracks via filtering in the spectral domain.
[0013] The size of the filters used in step b.) can be adapted, in a configuration step,
depending on center frequencies of the frequency conversion step.
[0014] The size of the filters used in step b.) can be adapted corresponding to the spectral
resolution of the frequency conversion step.
[0015] The size of the filters used in step b.) can be adapted corresponding to expected
formants which e.g. occur typically in speech signals.
[0016] Before step b.) the fundamental frequency of the audio signal can be estimated and
then essentially eliminated.
[0017] Before step b.) the spectral distribution of the excitation of the acoustic tube
can be estimated and an amplification of the spectrogram with the inverse of this
distribution can be performed.
[0018] After step a.) the envelope of the signal can be determined e.g. via rectification
and low-pass filtering.
[0019] A Gammatone filter bank can be used for the frequency conversion step.
[0020] A reconstructive filtering can be applied on the result of step b.).
[0021] The reconstructive filtering can use filters adapted to the expected formants of
the supplied audio signal, and the reconstruction is done by adding the impulse response
of the used filters weighted with the response when filtering with said filter.
[0022] Pair Gabor filters can be used for the reconstructive filtering.
[0023] The width of the reconstructing filters is adapted corresponding to the spectral
resolution of the frequency conversion step or the mean bandwidths of preset formants
expected to be present in the supplied audio signal.
[0024] The enhanced formants can then be extracted from the signal for further use.
[0025] The method of enhancing the formants can be used e.g. for speech enhancement.
[0026] The method can be used together with a tracking algorithm in order to carry out a
speech recognition on the supplied audio signal.
[0027] The method can be used to train artificial speech synthesis systems with the extracted
formants.
[0028] The invention also relates to a computer program product, implementing such a method.
[0029] The invention further relates to a hearing aid comprising a computing unit designed
to implement such a method.
[0030] Further features, objects and advantages of the invention will now be explained with
reference to the figures of the enclosed drawings.
Figure 1: Spectrogram of a speech signal after application of a Gammatone filter bank
and envelope calculation. The sentence is a German male speaker saying "Ich hätte
gerne eine Zugverbindung für morgen." from the Kiel Corpus of Spontaneous Speech
Figure 2: Spectrogram after elimination of the fundamental frequency for the same
sentence as in Fig. 1.
Figure 3: Spectrogram after pre-emphasis for the same sentence as in Fig. 1.
Figure 4: Spectrogram after filtering with a Mexican Hat whose width is adapted to
the center frequency for the same sentence as in Fig. 1 .
Figure 5: Spectrogram from Fig. 4 after normalization at each sample.
Figure 6: Enhanced spectrogram of a synthesized speech signal ("Five women played
basketball"). The true formant tracks of the first four formants are depicted by black
and yellow dashed lines.
Figure 7: Enhanced spectrogram of a synthesized speech signal when babble noise was
added at 20 dB. The true formant tracks of the first four formants are depicted by
dashed lines.
Figure 8: Enhanced spectrogram of a synthesized speech signal when babble noise was
added at 10 dB. The true formant tracks of the first four formants are depicted by
dashed lines.
Figure 9: Enhanced spectrogram of a synthesized speech signal when babble noise was
added at 0 dB. The true formant tracks of the first four formants are depicted by
dashed lines.
Figure 10: Schematic flow chart of the method.
Figure 11: Schematic flow chart of a reconstructive filtering
[0031] The invention proposes a method and a system (see figure 10) which enhances the formants
in the spectrogram and allows a subsequent extraction of the enhanced formants.
Frequency conversion:
[0032] The invention proposes to apply e.g. a Gammatone filter bank on a supplied audio
signal representation to obtain a spectro-temporal representation of the signals.
In any case the audio signal is converted in the frequency domain.
[0033] The first stage in the system as shown in figure 10 is the application of a Gammatone
filter bank on the signal. The filter bank has e.g. 128 channels ranging from e.g.
80Hz to 5 kHz. From this signal the envelope is calculated via rectification and low-pass
filtering. In Fig. 1 the results of this processing can be seen.
Estimation and subsequent Elimination of the fundamental frequency:
[0034] In order to reduce the impact of the fundamental frequency on the position of the
formants, especially the first formant, the fundamental frequency of voiced signal
parts can be estimated and subsequently eliminated from the spectrogram.
[0035] In the excitation signal of the vocal tract the energy of the fundamental frequency
is normally much higher than that of the harmonics. As a consequence of this unbalanced
excitation of the first formant, with high energy at the fundamental frequency and
significantly lower energy at the adjacent harmonics, it is difficult to extract its
correct location. For this reason the invention proposes to eliminate the fundamental
frequency of voiced signal parts from the spectrogram.
[0036] E.g. an algorithm based on a histogram of zero crossing distances can be used to
estimate the fundamental frequency. In principle any pitch estimation algorithm can
be used for the estimation of the fundamental frequency.
[0037] For the elimination of the fundamental frequency the filter channels in the neighborhood
of the found fundamental frequency are set to the noise floor. In order to recreate
smooth transitions after the elimination of the fundamental frequency and to reduce
the computational load a smoothing in the time domain and an optional sub-sampling
is performed.
[0038] The results of this processing can be seen in Fig. 2.
Filtering in the spectro-temporal domain in order to enhance the formants
[0039] In a next step the high frequencies are emphasized.
[0040] A filtering along the channel axis is performed. During the filtering the size of
the filtering kernel is changed position-dependent, i. e. with wide kernels at low
frequencies and narrow kernels at high frequencies. This takes into account the logarithmic
arrangement of the center frequencies in the Gammatone filter bank.
[0041] The energy of the glottal excitation signal shows a general decay with frequency.
Therefore low formants being excited by low harmonics have much more energy than high
formants. In a similar way the noise like excitation, which has mostly energy in the
high frequencies, has a much lower overall energy than the harmonic excitation. As
a consequence in a speech signal the energy in the low frequencies is much higher
than in the high frequencies. To overcome this problem we perform a pre-emphasis of
the spectrogram. This pre-emphasis raises the energy of the high frequencies (compare
Fig. 3).
[0042] A known way is to use a high-pass filter but as the audio signal is already represented
in the spectro-temporal domain the invention proposes to weight the energy of the
filter channels with an exponentially decreasing weight from the high to the low frequencies.
Subsequently a smoothing along the frequency axis is carried out. Via this smoothing
the energy of the single harmonics is spread and peaks at the formant location form.
[0043] When using a filter bank with a logarithmic arrangement of center frequencies, as
in the case of the Gammatone filter bank, the size of the smoothing kernel has to
be set depending on the center frequencies. It has to be wide at low frequencies where
the filter bandwidths and hence the increment of the center frequencies is low in
order to cover the necessary frequency range.
[0044] In contrast it is made small at high frequencies where the filter bandwidths are
large. As smoothing kernel it is possible to use a Gaussian kernel, but we achieved
better results with a Mexican Hat (Difference of Gaussian). The Mexican Hat operator
enhances line like structures and suppresses regions in between these line like structures.
[0045] Figure 4 shows the results of this operation. The resulting spectrogram contains
negative values due to the application of the Mexican Hat. Depending on the further
processing it can be beneficial to set these negative values to zero, but in our case
they have been kept as they permit a better enhancement of the formant tracks.
[0046] As can be seen the formant structure is now clearly visible as dark ridges in the
spectrogram. Finally, a normalization of the values to the maximum at each sample
is performed (compare Fig. 5). By doing so the formants are also visible in signal
parts where the energy is relatively low.
[0048] Results for the clean signal can be seen in Fig. 6. The correct formant tracks for
the first four formants are given by the dashed line. As can be seen from the plot
our algorithm represents the formants quite accurately.
[0049] In Fig. 7 the result of the same signal with babble noise added at an SNR of 20 dB
is shown. The location of the peaks is hardly affected by the additional noise.
[0050] In Fig. 8 we further increased the noise level to 10 dB. As a consequence the ridges
in the enhanced spectrogram show more discontinuities but their location is still
correct.
[0051] Finally we added babble noise at 0 dB in Fig. 9. The discontinuities increase further
with the decreasing SNR but the location of the ridges does not change significantly
even for such low SNR values.
Reconstruction step
[0052] In order to further enhance the formant frequencies an optional reconstruction step
can be performed on the result of the filtering in the spectral domain. Figure 11
shows a block diagram of the reconstructive filtering.
[0053] For doing so the result of the previous enhancement (smoothing along the frequency
axis) is filtered with a set of n parallel filters whose impulse responses are adapted
to the expected structure. This can for example be a set of n even Gabor filters with
different orientations and frequencies.
[0054] Gabor filters are known to the skilled person and can be defined as linear filters
whose impulse response is defined by a harmonic function multiplied by a Gaussian
function.
[0055] For the reconstruction the impulse responses (receptive fields) of these filters
are then respectively weighted with the corresponding response when applying the filter
to the data, i.e. the result of the preceding filtering step. Therefore during the
reconstruction the filter does not only generate one single point at the center of
the filter but a structure corresponding to the whole area of the impulse response.
[0056] Finally, all these responses are added up to form the resulting spectral representation
being the result of the reconstruction. As a consequence the result will show structures
in accordance with the impulse responses of the filters (e. g. lines when even Gabor
filters are used). This is among others due to the fact that the set of Gabor filters
used is not complete and hence is not able to reconstruct the original data perfectly
but only a subset with properties defined by the subset used (line structures in our
case). The width of these reconstruction filters can also be adapted in accordance
with the spectral resolution or the expected formant bandwidth.
Applications of the invention
[0057] The invention can be applied to the enhancement of speech signals, especially for
the hearing impaired as it is known that enhancing the formants increases intelligibility
for them.
[0058] Combined with a tracking algorithm it is possible to use it for speech recognition
or the learning of parameters for formant based speech synthesis.
1. A method for enhancing the formants of an audio signal, the method comprising the
following steps:
a.) applying a frequency conversion on an audio signal,
b.) enhancing the formant tracks via filtering in the spectral domain.
2. The method according to claim 1,
wherein step b.) is carried out using a smoothening with a defined smoothening kernel.
3. The method according to claim 1 or 2,
wherein the size of filters used in step b.) is adapted depending on center frequencies
of the frequency conversion step.
4. The method according to claim 3,
wherein the size of filters used in step b.) is adapted corresponding to the spectral
resolution of the frequency conversion step.
5. The method according to claim 4,
wherein the size of filters used in step b.) is adapted corresponding to preset expected
formants.
6. The method according to any of the preceding claims,
wherein before step b.) the fundamental frequency of the audio signal is estimated
and then eliminated.
7. The method according to any of the preceding claims,
wherein before step b.) the spectral distribution of the excitation of the acoustic
tube is estimated and an amplification of the spectrogram with the inverse of this
distribution is performed.
8. The method according to any of the preceding claims,
wherein after step a.) the envelope of the signal is determined e.g. via rectification
and low-pass filtering.
9. The method according to any of the preceding claims,
wherein a Gammatone filter bank is used for the frequency conversion step a.).
10. The method according to any of the preceding claims,
wherein a reconstructive filtering is applied on the result of step b.).
11. The method according to claim 10,
wherein the reconstructive filtering uses filters adapted to expected formants of
the supplied audio signal, and
the reconstruction is done by adding the impulse response of the used filters weighted
with the response when filtering with said filter.
12. The method according to claim 11,
where pair Gabor filters are used for the reconstructive filtering.
13. The method according to claim 11 or 12,
wherein the width of the reconstructing filters is adapted corresponding to the spectral
resolution of the frequency conversion step or the mean bandwidths of preset formants
expected to be present in the supplied audio signal.
14. The method according to any of the preceding claims, comprising the further step of
extracting the enhanced formants.
15. Using a method according to any of the preceding claims for speech enhancement -
16. Using of a method according to any of claims 1 to 14 together with a tracking algorithm
in order to carry out an automatic speech recognition on the supplied audio signal.
17. Use of a method according to any of claims 1 to 14 to train artificial speech synthesis
systems with the extracted formants.
18. A computer program product, implementing a method according to any of the claims 1
to 14 when run on a computing device.
19. A hearing aid comprising a computing unit designed to implement a method according
to any of claims 1 to 14.