[0001] The present invention relates to an apparatus and a method for automatic classification
of audio signals comprising the combination of features of independent claims 1 and
12, respectively.
[0002] There is a growing amount of both audio and video data / signals available on the
Internet and in a variety of storage media e.g. CDs or digital video discs. Furthermore,
said audio and video data is provided by a huge number of telestations as an analogue
or digital audio or video signal.
[0003] Currently, there is a desire for the possibility to search for segments of interest
/ important events (e.g. certain topics, persons, events or plots etc.) in said audio
and/or video signal.
[0004] In this regard, only self-contained activities (events) having a certain minimum
importance (important events) are accounted for.
[0005] Said self-contained activities / important events might be the different notices
mentioned in a newsmagazine or different pieces of music reproduced in a radio show,
for example. If the programme is a certain football match, for example, said self-contained
activities / important events might be kick-off, penalty kick, throw-in etc..
[0006] In the following, said self-contained activities (events) that are included in a
certain programme and meet a minimum importance are called "important events" or "contents".
[0007] The traditional audio / video tape recorder sample playback mode for browsing and
skimming an analogue audio / video signal is cumbersome and inflexible. The reason
for this problem is that the signal is treated as a linear block of samples. No searching
functionality (except fast forward and fast reverse) is provided.
[0008] To address this problem some modem audio / video tape recorder offer the possibility
to set indexes either manually or automatically each time a recording operation is
started to allow automatic recognition of certain sequences of video signals. It is
a disadvantage with said indexes that the indexes are not adapted to individually
identify a certain sequence of audio / video signals.
[0009] On the other hand, digital audio / video discs contain digital data (digitised audio
/ video signals), wherein tracks or chapters are added to the digital data during
the production of the digital disc. Said tracks / chapters normally allow identification
of separate portions of data / the story line, only. Especially, said chapters do
not allow identification of certain important events / contents (self-contained activities
/ events having a certain minimum importance) contained in the data. Furthermore,
said tracks / chapters are not neutral since they are provided by the manufacturer
of the digital disc.
[0010] An obvious solution for the problem of handling large amounts of audio / video signals
would be to manually segment the signals of each programme into segments according
to its important events and to provide detailed information with respect to the signal
included in said segments.
[0011] Due to the immense amount of e.g. sequences comprised in the available audio / video
signals, manual segmentation is extremely time-consuming and thus expensive. Therefore,
this approach is not practicable to process a huge amount of audio / video signals.
[0012] To solve the above problem approaches for automatic segmentation of audio / video
signals with respect to important events / contents comprised in the signals have
been recently proposed.
[0013] Possible application areas for such an automatic segmentation of audio / video signals
are digital libraries or the Internet, for example.
[0014] The known approaches for the segmentation process comprise fragmenting, automatic
classification and automatic segmentation of the raw signals.
[0015] "Fragmenting" is performed to partition the raw signals into fragments of a suitable
length for further processing. The fragments comprise a suitable amount of signals,
each. Thus, the accuracy of the following classification and segmentation process
is depending on the length of said fragments.
[0016] "Classification" stands for a raw discrimination of the signals comprised in the
fragments with respect to the origin of the signals (e.g. speech, music, noise, silence
and gender of speaker). Classification usually is performed by signal analysis techniques
based on audio class classifying models. Thus, classification results in a sequence
of fragments, which are discriminated with respect to the origin of the signals comprised
in the fragments.
[0017] "Segmentation" stands for segmenting the raw signal into individual sequences of
cohesive fragments wherein each sequence contains a content (self-contained activity
of a minimum importance) included in the signals of said sequence. Segmentation can
be performed based on content classifying rules.
[0018] Each content comprises all the fragments, which belong to the respective self-contained
activity comprised in the raw signal (e.g. a goal, a penalty kick of a football match
or different news during a news magazine or different pieces of music of a music sampler).
[0019] A segmentation apparatus 40 for automatic segmentation of audio signals according
to the prior art is shown in Fig. 5.
[0020] The effect of said segmentation apparatus 40 on an audio signal 50 is shown in Fig.
6.
[0021] The segmentation apparatus 40 comprises audio signal input means 42 for supplying
a raw audio signal 50 via an audio signal entry port 41.
[0022] In the present example, said raw audio signal 50 is part of a video signal stored
in a suitable video format in a hard disc 48.
[0023] Alternatively, said raw audio signal might be a real time signal (e.g. an audio signal
of a conventional television channel), for example.
[0024] The audio signals 50 supplied by the audio signal input means 42 are transmitted
to audio signal fragmenting means 43. The audio signal fragmenting means 43 partitions
the audio signals 50 (and the respective video signals) into audio fragments 51 (and
corresponding video fragments) of a predetermined length.
[0025] The audio fragments 51 generated by the audio signal fragmenting means 43 are further
transmitted to classifying means 44.
[0026] The classifying means 44 discriminates the audio clips 51 into predetermined audio
classes 52 based on predetermined audio class classifying models by analysing acoustic
characteristics of the audio signal 50 comprised in the audio fragments 51, whereby
each audio class identifies a kind of audio signals included in the respective audio
fragment.
[0027] Each of the audio class classifying models allocates a combination of certain acoustic
characteristics of an audio signal to a certain kind of audio signal.
[0028] Here, the acoustic characteristics for the audio class classifying model identifying
the kind of audio signals "silence" are "low energy level" and "low zero cross rate"
of the audio signal comprised in the respective audio clip, for example.
[0029] In the present example an audio class and a corresponding audio class classifying
model for each "silence" (class 1), "speech" (class 2), "cheering/clapping" (class
3) and "music" (class 4) are provided.
[0030] Said audio class classifying models are stored in the classifying means 44.
[0031] The audio clips 52 discriminated into audio classes by the classifying means 44 are
supplied to segmenting means 45.
[0032] A plurality of predetermined content classifying rules is stored in the segmenting
means 45. Each content classifying rule allocates a certain sequence of audio classes
of consecutive audio clips to a certain content / important event.
[0033] In the present example a content classifying rule for each a "free kick" (content
1), a "goal" (content 2), a "foul" (content 3) and "end of game" (content 4) are provided.
[0034] It is evident that the contents comprised in the audio signals are composed of a
sequence of consecutive audio fragments, each. This is visualised by the segmented
signal 53 of Fig. 6.
[0035] Since each audio fragment can be discriminated into an audio class each content /
important event comprised in the audio signals is composed of a sequence of corresponding
audio classes of consecutive audio fragments, too.
[0036] Therefore, by comparing a certain sequence of audio classes of consecutive audio
fragments that belongs to the audio signals with the sequences of audio classes of
consecutive audio fragments that belong to the content classifying rules the segmenting
means 45 detects a rule that meets the respective sequence of audio classes.
[0037] In consequence, the content allocated to said rule is allocated to the respective
sequence of consecutive audio fragments that belongs to the audio signals.
[0038] Thus, based on said content classifying rules the segmenting means 45 segments the
classified audio signals provided by the discrimination means 44 into a sequence of
contents 53 (self-contained activities).
[0039] In the present example, an output file generation means 46 is used to generate a
video output file containing the audio signals 50, the corresponding video signals
and information regarding the corresponding sequence of contents 53.
[0040] Said output file is stored via a signal output port 47 into a hard disc 48.
[0041] By using a video playback apparatus 49 the video output files stored in the hard
disc 48 can be played back.
[0042] In the present example, the video playback apparatus 49 is a digital video recorder
which is further capable to extract or select individual contents comprised in the
video output file based on the information regarding the sequence of contents 53 comprised
in the video output file.
[0043] Thus, segmentation of audio signals with respect to its contents / important events
is performed by the segmentation apparatus 40 shown in Fig. 5.
[0044] A stochastic signal model frequently used with classification of audio signals /
data is the HIDDEN MARKOV MODEL, which is explained in detail in the essay "A Tutorial
on Hidden Markov Models and Selected Applications in Speech Recognition" by Lawrence
R. RABINER published in the Proceedings of the IEEE, Vol. 77, No.2, February 1989.
[0045] Different approaches for audio-classification-segmentation with respect to speech,
music, silence and gender are disclosed in the paper "Speech/Music/Silence and Gender
Detection Algorithm" of Hadi HARB, Liming CHEN and Jean-Yves AULOGE published by the
Lab. ICTT Dept. Mathematiques - Informatiques, ECOLE CENTRALE DE LYON. 36, avenue
Guy de Collongue B.P. 163, 69131 ECULLY Cedex, France.
[0046] In general, the above paper is directed to discrimination of an audio channel into
speech/music/silence/and noise that helps improving scene segmentation. Four approaches
for audio class discrimination are proposed: A "model-based approach" where models
for each audio class are created, the models being based on low level features of
the audio data such as cepstrum and MFCC. A "metric-based segmentation approach" uses
distances between neighbouring windows for segmentation. A "rule-based approach" comprises
creation of individual rules for each class wherein the rules are based on high and
low level features. Finally, a "decoder-based approach" uses the hidden Makrov model
of a speech recognition system wherein the hidden Makrov model is trained to give
the class of an audio signal.
[0047] Furthermore, this paper describes in detail speech, music and silence properties
to allow generation of rules describing each class according to the "rule based approach"
as well as gender detection to detect the gender of a speech signal.
[0048] Zhu LIU and Yao WANG of the Polytechnic University Brooklyn, USA together with Tsuhan
CHEN of the Carnegie Mellon University, Pittsburg, USA disclose "Audio Feature Extraction
and Analysis for Scene Segmentation and Classification". This paper describes the
use of associated audio information for video scene analysis of video data to discriminate
five types of TV programs, namely commercials, basketball games, football games, news
report and weather forecast.
[0049] According to this paper the audio data is divided into a plurality of clips, each
clip comprising a plurality of frames.
[0050] A set of low level audio features comprising analysis of volume contour, pitch contour
and frequency domain features as bandwidth are proposed for classification of the
audio data contained in each clip.
[0051] Using a clustering analysis, the linear separability of different classes is examined
to separate the video sequence into the above five types of TV programs.
[0052] Three layers of audio understanding are discriminated in this paper: In a "low-level
acoustic characteristics layer" low level generic features such as loudness, pitch
period and bandwidth of an audio signal are analysed. In an "intermediate-level acoustic
signature layer" the object that produces a particular sound is determined by comparing
the respective acoustic signal with signatures stored in a database. In a "high level
semantic-model" some
α priori known semantic rules about the structure of audio in different scene types (e.g.
only speech in news reports and weather forecasts; speech together with noisy background
in commercials) are used.
[0053] To segment the audio data, sequences of audio classes of consecutive audio clips
are used. Thus, depending on the sequence of audio classes of consecutive audio clips
(e.g. speech-silence-cheering/clapping-music) a suitable number of consecutive audio
clips (e.g. 4) is allocated to a segment comprising one important event (e.g. "goal").
[0054] To further enhance accuracy of the above-described method, it is proposed to combine
the analysis of the audio data of video data with an analysis of the visual information
comprised in the video data (e.g. the respective colour patterns and shape of imaged
objects).
[0055] The patent US 6,185,527 discloses a system and method for indexing an audio stream
for subsequent information retrieval and for skimming, gisting and summarising the
audio stream. The system and method includes use of special audio prefiltering such
that only relevant speech segments that are generated by a speech recognition engine
are indexed. Specific indexing features are disclosed that improve the precision and
recall of an information retrieval system used after indexing for word spotting. The
described method includes rendering the audio stream into intervals, with each interval
including one or more segments. For each segment of an interval it is determined whether
the segment exhibits one or more predetermined audio features such as a particular
range of zero crossing rates, a particular range of energy, and a particular range
of spectral energy concentration. The audio features are heuristically determined
to represent respective audio events, including silence, music, speech, and speech
on music. Also, it is determined whether a group of intervals matches a heuristically
predefined meta pattern such as continuous uninterrupted speech, concluding ideas,
hesitations and emphasis in speech, and so on, and the audio stream is then indexed
based on the interval classification and meta pattern matching, with only relevant
features being indexed to improve subsequent precision of information retrieval. Also,
alternatives for longer terms generated by the speech recognition engine are indexed
along with respective weights, to improve subsequent recall.
[0056] Thus, it is inter alia proposed to automatically provide a summary of an audio stream
or to gain an understanding of the gist of an audio stream.
[0057] Don KIMBER and Lynn WILCOX describe algorithms, which generate indices from automatic
acoustic segmentation, in the essay "Acoustic Segmentation for Audio Browsers". These
algorithms use hidden Markov models to segment audio into segments corresponding to
different speakers or acoustic classes. Types of proposed acoustic classes include
speech, silence, laughter, non-speech sounds and garbage, wherein garbage is defined
as non-speech sound not explicitly modelled by the other class models.
[0058] An implementation of the known methods is proposed by George TZANETAKIS and Perry
COOK in the essay "MARSYAS: A framework for audio analysis" wherein a client-server
architecture is used.
[0059] A summary and definition of acoustic characteristics frequently used to discriminate
audio signals into audio classes is given in the paper "Content Analysis for Audio
Classification and Segmentation" of Lie LU and Hong-Jiang ZHANG which was published
in the IEEE transactions on speech and audio processing, vol. 10, No. 7 of October
2002. The definitions given in this paper apply to the definitions of the acoustic
characteristics used in the present patent application.
[0060] Further approaches regarding classification and segmentation of audio signals are
described in the essays "ROBUST HMM-BASED SPEECH/MUSIC SEGMENATIONG" of Iitendra AJMERA,
Iain A. McCOWAN and Hervé BOURLARD, Dalle Molle Institute for Perceptual Artificial
Intelligence, P.O. Box 592, CH-1920 martigny, Switzerland and "A Robust Audio Classification
and Segmentation Method" of Lie LU, Hao JIANG and HongJiang ZHANG, Microsoft research,
China.
[0061] It is a disadvantage with the above described classification apparatus and methods
that music containing singing voice frequently is misclassified as speech, particularly
when the singing voice is the dominant signal component.
[0062] Music is a very general term that covers a huge variety of audio signals such as
different instrumental sounds, singing voice with instrumental sound and also pure
singing voice although in real life application, pure singing voice is not common.
A robust speech/music classification should be able to distinguish speech from music
regardless of the type of music, i.e. pure instrumental sound, singing voice etc..
[0063] However, as singing voice is more similar to speech rather than to music in general,
state-of-the-art speech/music classification system usually fail to classify an audio
signal containing music correctly when there is only a singing voice, or a dominant
singing voice in the signal.
[0064] Although it is possible to add singing voice in the training material to train the
audio class classifying model for the audio class "music", due to the presence of
many other music signals, such as instrumental music, orchestra, pop etc. such an
approach usually does not improve the classification performance. This is a crucial
drawback of the prior art since music comprising singing voice is among the most common
music available in real-live applications.
[0065] It is the object of the present invention to overcome the above-cited disadvantage
and to provide an apparatus and a method for automatic classification of audio signals
that provides an enhanced accuracy when classifying an audio signal comprising a singing
voice.
[0066] The above object is solved by an apparatus for automatic classification of audio
signals comprising the combination of features of independent claim 1.
[0067] Furthermore, the above object is solved by a method for automatic classification
of audio signals comprising the combination of features of independent claim 12.
[0068] Further developments are set forth in the respective dependent claims.
[0069] According to a preferred embodiment of the present invention an apparatus for automatic
classification of audio signals comprises signal input means for supplying audio signals,
audio signal fragmenting means for partitioning audio signals supplied by the signal
input means into audio fragments of a predetermined length, feature extracting means
for analysing acoustic characteristics of the audio signals comprised in the audio
fragments and classifying means for discriminating the audio fragments provided by
the audio signal fragmenting means into a predetermined audio class based on predetermined
audio class classifying models by using acoustic characteristics of the audio signals
comprised in the audio fragments, wherein a predetermined audio class classifying
model is provided for each audio class and each audio class represents a respective
kind of audio signals comprised in the corresponding audio fragment, wherein an individual
predetermined audio class classifying model is provided for at least each audio class
"speech", "music" and "singing voice".
[0070] Since an individual predetermined audio class classifying model is provided for at
least each audio class "speech", "music" and "singing voice", an audio class classifying
model specialised in singing voice included in the raw audio signal is provided. Thus,
a singing voice can be identified in a raw audio signal with high accuracy.
[0071] Advantageously, the inventive apparatus for automatic classification of audio signals
further comprises a classifier database comprising the predetermined audio class classifying
models, wherein the classifying means discriminates the audio fragments provided by
the audio signal fragmenting means into predetermined audio classes based on the audio
class classifying models stored in the classifier database.
[0072] By the provision of a classifier database comprising audio class classifying models,
audio class classifying models that are specialised (trained) for a certain kind of
audio signal might be used. The usage of specialised audio class classifying models
significantly enhances accuracy of the classification of the audio signals.
[0073] Favourably, the classifying means further allocates audio fragments discriminated
into the audio class "singing voice" to the audio class "music".
[0074] Thus, once an audio fragment has been discriminated into the audio class "singing
voice" said fragment additionally or finally is allocated into the audio class "music".
[0075] Alternatively, as a "singing voice" comprised in an audio signal is very similar
to the audio signal "speech", the accuracy of the inventive apparatus is significantly
enhanced by further discriminating audio fragments allocated to the audio class "speech"
into the audio classes "speech" and "singing voice".
[0076] Furthermore, it is profitable that the acoustic characteristics analysed in the audio
signals comprised in the audio fragments by the feature extracting means include volume
standard deviation and/or volume dynamic range and/or high zero crossing rate ratio
and/or low short-term energy ratio and/or spectral flux and/or zero crossing rate
and/or energy/loudness and/or sub-band energy rate and/or mel-cepstral frequency components
and/or frequency centroid and/or bandwidth and/or line spectrum frequencies and/or
roll-off.
[0077] It is preferred that the audio class classifying models are provided as hidden Markov
models and/or Neuronal Networks and/or Gaussian Mixture Models and/or decision trees.
[0078] Advantageously, the audio class model for the audio class "singing voice" is trained
by a training audio signal comprising pure singing voice, only.
[0079] Thus, a suitable audio class model for the audio class "singing voice" can be achieved
in a very easy and reliable way.
[0080] According to a preferred embodiment, the apparatus for automatic classification of
audio signals further comprises segmentation means for segmenting classified audio
signals into individual audio windows consisting of sequences of cohesive audio fragments
based on predetermined content classifying rules by analysing a sequence of audio
classes of cohesive audio fragments provided by the classifying means, wherein each
sequence of cohesive audio fragments segmented by the segmentation means corresponds
to an individual content included in the audio signal.
[0081] According to this embodiment, it is further profitable if the segmentation means
allocates a predefined number of audio fragments to an audio window, determines the
number of audio fragments of each audio class comprised in the audio window and allocates
the majority audio class to the respective audio window.
[0082] Thus, the allocation of audio classes in the audio window is used to segment the
audio signal. Complicated content classifying rules can be avoided.
[0083] It is beneficial if each audio fragment generated by the audio signal fragmenting
means corresponds to a frame consisting of a predefined number N of signal samples.
[0084] Furthermore, it is preferred that the inventive apparatus for automatic classification
of audio signals further comprises signal output means for generating an output file,
wherein the output file contains the raw audio signal supplied to the signal input
means and an information signal comprising information regarding to the audio classes
and / or the audio windows and / or contents included in the raw signal.
[0085] Provision of such an information signal allows a distinct identification of the audio
classes and audio windows extracted from the raw audio signals. Search engines and
signal playback means can handle such an output file with ease. Therefore, a research
for an audio window of a certain content comprised in the output file can be performed
with ease.
[0086] Furthermore, the above object is solved by a method for automatic classification
of audio signals comprising the following steps:
- partitioning audio signals into audio fragments of a predetermined length;
- analysing acoustic characteristics of the audio signals comprised in the audio fragments;
and
- discriminating the audio fragments into a predetermined audio class based on predetermined
audio class classifying models by using acoustic characteristics of the audio signals
comprised in the audio fragments, wherein a predetermined audio class classifying
model is provided for each audio class and each audio class represents a respective
kind of audio signals comprised in the corresponding audio fragment;
wherein the step of discriminating the audio fragments into a predetermined audio
class is performed by using an individual predetermined audio class classifying model
for at least each audio class "speech", "music" and "singing voice".
[0087] Preferably, the method further comprises the step of providing a classifier database
comprising the predetermined audio class classifying models, wherein the step of discriminating
the audio fragments into a predetermined audio class is performed by using the audio
class classifying models stored in the classifier database.
[0088] Favourably, the method further comprises the step of allocating the audio fragments
discriminated into the audio class "singing voice" to the audio class "music".
[0089] Alternatively, it is beneficial if the method further comprises the step of discriminating
the audio fragments allocated to the audio class "speech" into the audio classes "speech"
and "singing voice".
[0090] Moreover, it is preferred that the step of analysing acoustic characteristics in
the audio signals comprised in the audio fragments includes analysis of volume standard
deviation and/or volume dynamic range and/or high zero crossing rate ratio and/or
low short-term energy ratio and/or spectral flux and/or zero crossing rate and/or
energy/loudness and/or sub-band energy rate and/or mel-cepstral frequency components
and/or frequency centroid and/or bandwidth and/or line spectrum frequencies and/or
roll-off.
[0091] Favourably, the audio class classifying models are provided as hidden Markov models
and/or Neuronal Networks and/or Gaussian Mixture Models and/or decision trees.
[0092] Moreover, it is beneficial if the method further comprises the step of training the
audio class model for the audio class "singing voice" by a training audio signal comprising
pure singing voice, only.
[0093] According to a preferred embodiment of the present invention the method further comprises
the steps of analysing a sequence of audio classes of cohesive audio fragments and
segmenting classified audio signals into individual audio windows consisting of sequences
of cohesive audio fragments based on predetermined content classifying rules by using
the analyses of said sequence of audio classes of cohesive audio fragments, wherein
each sequence of cohesive audio fragments corresponds to an individual content included
in the audio signal.
[0094] It is further preferred that the method further comprises the steps of allocating
a predefined number of audio fragments to an audio window, determining the number
of audio fragments of each audio class comprised in the audio window and allocating
the majority audio class to the respective audio window.
[0095] Furthermore, the above object is solved by a software product comprising a series
of state elements that are adapted to be processed by a data processing means of a
terminal such, that a method according to one of the claims 12 to 20 may be executed
thereon.
[0096] In the following detailed description, the present invention is explained by reference
to the accompanying drawings, in which like reference characters refer to like parts
throughout the views, wherein:
- Fig. 1A
- shows a block diagram of an apparatus for automatic classification of audio signals
according to a preferred embodiment of the present invention;
- Fig. 1B
- schematically shows the effect the inventive apparatus for automatic classification
of audio signals has on audio signals;
- Fig. 2
- shows a flow diagram of a preferred embodiment of the inventive method for automatic
classification of audio signals;
- Fig. 3
- shows a flow diagram of an alternative embodiment of the inventive method for automatic
classification of audio signals;
- Fig. 4
- shows a flow diagram of a further embodiment of the inventive method for automatic
classification of audio signals;
- Fig. 5
- shows a block diagram of a segmentation apparatus according to the prior art; and
- Fig. 6
- schematically shows the effect the segmentation apparatus according to the prior art
has on audio signals.
[0097] Fig. 1A shows a block diagram of an apparatus for automatic classification of audio
signals according to the one preferred embodiment of the present invention.
[0098] Fig. 1B schematically shows the effect the inventive apparatus for automatic classification
of audio signals has on audio signals.
[0099] In the present embodiment, a raw audio signal 2 is supplied via an input port 12
to signal input means 3 of the inventive apparatus 1 for automatic classification
of audio signals.
[0100] In the present example, the raw audio signal 2 provided to the signal input means
3 is a digital video data file which is stored on a suitable recording medium 48 (e.g.
a hard disc or a digital video disc).
[0101] The digital video data file is composed of at least an audio signal and a picture
signal and an information signal.
[0102] Alternatively, the raw signals 2 provided to the signal input means 3 might be real
time video signals of a conventional television channel or audio signals of a radio
broadcasting station.
[0103] According to this preferred embodiment, the inventive apparatus 1 for automatic classification
of audio signals is included into a digital video recorder, which is not shown in
the figures.
[0104] Alternatively, the apparatus for automatic classification of audio signals might
be included in a different digital audio / video apparatus, such as a personal computer
or workstation or might even be provided as a separate equipment (e.g. a top set box).
[0105] The signal input means 3 converts the raw signals 2 into a suitable format.
[0106] Audio signals comprised in the raw signal 2 provided to signal input means 3 via
the input port 12 are read out by the signal input means 3 and transmitted to audio
signal fragmenting means 4.
[0107] The audio signal fragmenting means 4 partitions said audio signals 2 into audio fragments
41, 42, 43, ..., 4N of a predetermined length.
[0108] Said audio fragments 41, 42, 43, ..., 4N preferably are the smallest unit of audio
signal analysis, respectively.
[0109] In the present embodiment, one audio fragment comprises one frame of audio (video)
signals and is about 10 milliseconds in length.
[0110] It is obvious for a skilled person that the audio fragments alternatively might comprise
more than one frame of audio (video) signals.
[0111] Alternatively, one frame might comprise more or less than 10 milliseconds of audio
signals (preferably between 4 and 20 milliseconds of audio signals, e.g. 6, 8, 12
or 14 milliseconds of audio signals).
[0112] According to an alternative embodiment more than one frame is comprised in an audio
fragment. In this case it is evident for a man skilled in the art that the audio signals
comprised in each audio fragment might be further divided into a plurality of frames
of e.g. 512 samples. In this case it is profitable if consecutive frames are shifted
by 180 samples with respect to the respective antecedent frame. This subdivision allows
a precise and easy processing of the audio signals comprised in each audio fragment.
[0113] It is important to emphasise that the audio signal fragmenting means 4 do not necessarily
subdivide the audio signals 2 into audio fragments 41, 42, 43, ..., 4N in a literal
sense. In the present embodiment, the audio signal fragmenting means 4 defines fragments
of audio signals comprising a suitable amount of audio signals within the audio signals,
only.
[0114] In the present example, the audio signal fragmenting means 4 generates a meta data
file defining audio fragments 41, 42, 43, ..., 4N in the audio signal 2 while the
audio signal itself remains unamended.
[0115] The audio fragments defined by the audio signal fragmenting means 4 are transmitted
to feature extracting means 5.
[0116] The feature extracting means 5 analyses acoustic characteristics of audio signals
comprised in the audio fragments 41, 42, 43,..., 4N.
[0117] In the present embodiment, volume standard deviation and volume dynamic range and
high zero crossing rate ratio and low short-term energy ratio and spectral flux and
zero crossing rate and energy/loudness and sub-band energy rate and mel-cepstral frequency
components and frequency centroid and bandwidth and line spectrum frequencies and
roll-off of the signals comprised in the audio fragments 41, 42, 43, ..., 4N are analysed
by the feature extracting means 5.
[0118] It is obvious for a skilled person that it might be sufficient to analyse a subset
of the above acoustic characteristics.
[0119] The acoustic characteristics of audio signals comprised in the audio fragments 41,
42, 43, ..., 4N are output to classifying means 6 by the feature extracting means
5.
[0120] The classifying means 6 automatically discriminates the audio fragments 41, 42, 43
provided by the audio signal fragmenting means 4 into a predetermined audio class
61, 62, 63 by using the acoustic characteristics of the audio signals comprised in
the audio fragments 41, 42, 43 analysed by the feature extracting means 5. Each audio
class 61, 62, 63 represents a respective kind of audio signals comprised in the corresponding
audio fragment 41, 42, 43.
[0121] Discrimination is performed by the classifying means 6 based on predetermined audio
class classifying models 71, 72, 73 which are stored in a classifier database 7.
[0122] A predetermined audio class classifying model 71, 72, 73 is provided in the classifier
database 7 for each audio class 61, 62, 63. According to the present invention, an
individual predetermined audio class classifying model 71, 72, 73 is provided for
at least each audio class 61, 62, 63 "speech", "music" and "singing voice". The audio
class 63 "singing voice" alternatively might be referred to as "a capella music".
[0123] Since an individual predetermined audio class classifying model 71, 72, 73 is provided
for at least each audio class 61, 62, 63 "speech", "music" and "singing voice", an
audio class classifying model 73 specialised in singing voice included in the raw
audio signal is provided. Thus, a singing voice can be identified in a raw audio signal
2 with high accuracy.
[0124] Furthermore, by the provision of a classifier database 7 comprising audio class classifying
models 71, 72, 73, audio class classifying models 71, 72, 73 that are specialised
(trained) for a certain kind of audio signal 2 might be used. The usage of specialised
audio class classifying models 71, 72, 73 significantly enhances accuracy of the classification
of the audio signals 2.
[0125] In the present example, the predetermined audio class classifying models 71, 72,
73 are stored in the classifier database 7 as Gaussian Mixture Models (GMM).
[0126] Alternatively, the audio class classifying models might even be provided e.g. as
Neuronal Networks and/or hidden Markov models and/or decision trees.
[0127] To achieve a suitable audio class model for the audio class "singing voice" in a
very easy and reliable way, the audio class model 73 for the audio class 63 "singing
voice" is trained by a training audio signal comprising pure singing voice, only.
[0128] In this respect, training is performed by analysing a plurality of raw signals consisting
of "singing voice", only, and varying parameters of the audio class model 73 for the
audio class "singing voice" till a satisfying accuracy for a correct identification
of "singing voice" in the raw signal is achieved by the audio class model 73.
[0129] By using self-learning models the variation of the parameters can be automated.
[0130] The training signal might be provided by a large database (not shown in the Figures).
[0131] In case the audio class models are Gaussian Mixture Models (GMM) a linear combination
of the Gaussian probability density functions (pdf) is used to model the pdf's of
the signal belonging to a given audio class. Component Gaussians can have full or
diagonal covariance matrices. GMM parameters such as individual Gaussians and their
weight factors are tuned to suit the training signals. A GMM can approximate well
any continuous pdfs. The dimensions of the component Gaussians depend on the parametrization
of the underlying acoustic signal. It can be, for example, set of Mel Frequency Cepstral
Coefficients (MFCC) and their derivatives computed over signal frames and windows
which are sequences of frames, as well as other spectral and time characteristics
such as spectral centroid, spectral flux, zero crossing rate etc..
[0132] In the present embodiment, said classifier database 7 is a convention hard disc.
Alternatively, e.g. an EEPROM or a FLASH-memory might be used.
[0133] It is obvious for a skilled person that the discrimination of the audio fragments
41, 42, 43 is not necessarily performed in a literal sense, but might be performed
e.g. by automatically generating a meta file (information signal) dedicated to the
(raw) audio signal 2, the meta file comprising e.g. pointers to identify the audio
fragments 41, 42, 43 and the corresponding audio classes 61, 62, 63 in the audio signal
2.
[0134] In the present embodiment, said pointers contained in the meta file identify both
the location and the audio class 61, 62, 63 of the fragments 41, 42, 43 comprised
in the audio signals 2.
[0135] Favourably, the classifying means 6 allocates audio fragments 41, 42, 43 discriminated
into the audio class 63 "singing voice" to the audio class 62 "music".
[0136] Thus, once an audio fragment 41, 42, 43 has been discriminated into the audio class
63 "singing voice" it is additionally allocated / re-categorised into the audio class
62 "music" as it is shown in Fig.2.
[0137] Alternatively, as it is shown in Fig. 3, the audio fragments allocated to the audio
class 61 "speech" are further discriminated into the audio classes 61 "speech" and
63 "singing voice" to increase the accuracy of the inventive apparatus 1. The reasons
is that a "singing voice" comprised in an audio signal 2 is very similar to the audio
signal "speech". Thus, a signal that seems to contain "speech" has to be further examined
to detect whether the pretended "speech" is real "speech" or indeed a "singing voice".
[0138] After discrimination into audio classes 61, 62, 63 by the classifying means 6, the
classified audio fragments 9 are transmitted to a segmentation means 8.
[0139] Said segmentation means 8 segments the classified audio signals 9 into individual
audio windows 81, 82, 83 consisting of sequences of cohesive audio fragments 41, 42,
43 based on predetermined content classifying rules by analysing a sequence of audio
classes 61, 62, 63 of cohesive audio fragments 41, 42, 43 provided by the classifying
means 6. Each sequence of cohesive audio fragments 41, 42, 43 segmented by the segmentation
means 8 corresponds to an individual content included in the audio signal 2.
[0140] Contents are self-contained activities comprised in the audio signals of a certain
programme that meet a certain minimum importance.
[0141] The length of time of the contents comprised in the audio signals of a programme
usually differs. Thus, each content comprises a certain individual number of cohesive
audio fragments 41, 42, 43.
[0142] If the programme is news, for example, the contents are the different notices mentioned
in the news. If the programme is football, for example, said contents are kick-off,
penalty kick, throw-in, goal, etc. If the programme is a music sampler, said contents
are the individual pieces of music, for example.
[0143] As said before, the contents comprised in the audio signal are composed of a sequence
of consecutive audio fragments 41, 42, 43, each. Since each audio fragment 41, 42,
43 is discriminated into an audio class 61, 62, 63 each content is composed of a sequence
of corresponding audio classes 61, 62, 63 of consecutive the audio fragments 41, 42,
43, too.
[0144] Therefore, by comparing the sequences of audio classes 61, 62, 63 of consecutive
audio fragments 41, 42, 43 which belong to the contents of the respective audio signal
with the sequences of audio classes 61, 62, 63 of consecutive audio fragments 41,
42, 43 which belong to the content classifying rules it is possible to find content
classifying rules which are adapted to identify the respective content.
[0145] The function of the content classifying rules will become more apparent by the following
example:
[0146] The sequence of audio classes of cohesive audio fragments 41, 42, 43 for the content
classifying rule identifying the content "goal" might be "speech" 61, "singing voice"
63 and "music" 62.
[0147] Thus, in case the sequence of audio classes of cohesive audio fragments 41, 42, 43
"speech" 61, "singing voice" 63 and "music" 62 is to be segmented by the segmentation
means 8, the content 1 "goal" will be allocated to said sequence of audio fragments
41, 42, 43 (window 81 of Fig. 1B).
[0148] According to an alternative embodiment, the segmentation means 8 allocates a predefined
number of audio fragments 41, 42, 43, ..., 4N to an audio window 81, 82, 83, ...,
8N determines the number of audio fragments 41, 42, 43, ..., 4N of each audio class
61, 62, 63 comprised in the audio window and allocates the majority audio class to
the respective audio window.
[0149] Thus, the allocation of audio classes 61, 62, 63 in the audio window 81, 82, 83 is
used to segment the audio signal. In consequence, complicated content classifying
rule can be avoided.
[0150] This will become more apparent by the following example:
[0151] In case an audio window 8N comprises a sequence of audio fragments 4N-3, 4N-2, 4N-1,
4N of the audio classes 62, 63, 61, 61 the audio class 61 automatically will be determined
as being the majority audio class by the segmentation means 8. Thus, the audio class
61 automatically will be allocated by the segmentation means 8 to the respective audio
window 8N as content k (see Fig. 1B).
[0152] In case no majority audio class can be identified due to a tie-situation, additional
rules may be provided to rank preferred audio classes.
[0153] Furthermore, the inventive apparatus 1 for automatic classification of audio signals
2 comprises signal output means 10.
[0154] Said signal output means 10 automatically generates an output file 11 containing
the raw signal 2 supplied by the signal input means 3 and an information signal (meta
file) comprising information regarding to the fragments, audio classes, windows and
contents included in the raw signal 2.
[0155] Search engines and signal playback means can handle a correspondingly processed signal
11 with ease. Therefore, a research for the audio classes and contents comprised in
the output file 11 is facilitated.
[0156] The output file 11 is output by the signal output means 10 via an output port 13.
[0157] The signal output via said output port 13 might be stored into a suitable recording
medium 48 which might be a conventional hard disc or optical disc, for example.
[0158] In the following, preferred methods for automatic classification of audio signals
are explained in detail by reference to Figs. 2, 3 and 4:
[0159] In a first step S1 (not shown in the Figures) audio signals 2 automatically are partitioned
into audio fragments 41, 42, 43 of a predetermined length. In the present preferred
embodiment, the length of the fragments 41, 42, 43 is one frame, each.
[0160] In the following step S2 acoustic characteristics of the audio signals comprised
in the audio fragments 41, 42, 43 are analysed.
[0161] Said acoustic characteristics include volume standard deviation and/or volume dynamic
range and/or high zero crossing rate ratio and/or low short-term energy ratio and/or
spectral flux and/or zero crossing rate and/or energy/loudness and/or sub-band energy
rate and/or mel-cepstral frequency components and/or frequency centroid and/or bandwidth
and/or line spectrum frequencies and/or roll-off.
[0162] A classifier database 7 containing predetermined audio class classifying models 71,
72, 73 is provided in the following method step S4. A predetermined audio class classifying
model 71, 72, 73 is provided for each audio class 61, 62, 63 and each audio class
61, 62, 63 represents a respective kind of audio signals comprised in the corresponding
audio fragment 41, 42, 43 of the raw audio signal 2.
[0163] According to the present invention, an individual predetermined audio class classifying
model 71, 72, 73 for at least each audio class 61, 62, 63 "speech", "music" and "singing
voice" is provided.
[0164] In the present embodiment, the audio class classifying models 71, 72, 73 are provided
as hidden Markov models. Alternatively Neuronal Networks or Gaussian Mixture Models
or decision trees might be used.
[0165] To achieve a suitable audio class model 73 for the audio class 63 "singing voice",
in a preparatory step S7 the audio class model 73 for the audio class 63 "singing
voice" has been trained by a training audio signal consisting of pure singing voice,
only.
[0166] By using acoustic characteristics of the audio signals comprised in the audio fragments
41, 42, 43 analysed in the preceding method step S2 the audio fragments 41, 42, 43
are discriminated into predetermined audio classes 61, 62, 63 based on predetermined
audio class classifying models 71, 72, 73 stored in the classifier database 7 that
has been provided in method step S4.
[0167] In case an audio fragment 41, 42, 43 is discriminated in step S3 into the audio class
63 "singing voice", said audio fragment 41, 42, 43 is further allocated to the audio
class 62 "music" in the following step S5 as it is shown in Fig. 2.
[0168] Alternatively, in case an audio fragment 41, 42, 43 is discriminated in step S3 into
the audio class 61 "speech", said audio fragment 41, 42, 43 is further discriminated
into the audio classes 61 "speech" and 63 "singing voice" in the following step S6
as it is shown in Fig. 3.
[0169] A sequence of audio classes 61, 62, 63 of cohesive audio fragments 41, 42, 43 is
analysed in the following method step S8.
[0170] In the following step S9, classified audio signals 9 provided by method steps S3,
S5 and S6 are segmented into individual audio windows 81, 82, 83 consisting of sequences
of cohesive audio fragments 41, 42, 43 based on predetermined content classifying
rules. Said segmentation is performed by using the sequence of audio classes 61, 62,
63 of cohesive audio fragments 41, 42, 43 analysed in method step S8. Each sequence
of cohesive audio fragments 41, 42, 43 corresponds to an individual content included
in the audio signal.
[0171] According to a preferred embodiment said segmentation might be performed as follows:
[0172] In a first sub-step S10a predefined number of audio fragments 41, 42, 43 is allocated
to an audio window 81, 82, 83.
[0173] In a second sub-step S11 the number of audio fragments 41, 42, 43 of each audio class
61, 62, 63 comprised in the audio window 81, 82, 83 is determined.
[0174] In a third sub-step S12 the majority audio class 61, 62, 63 is allocated to the respective
audio window 81, 82, 83.
[0175] Thus, a segmented audio signal is provided (see Fig. 4).
[0176] According to a further alternative embodiment of the inventive method (not shown
in the figures), a Viterbi algorithm is used for step 9 of segmenting classified audio
signals.
[0177] In this case, the steps S3, S5 and S6 of discriminating audio fragments and the steps
S8 and S9 of analysing a sequence of audio classes and segmenting classified audio
signals are combined into a joint optimisation of the best state sequence of audio
fragments that explains the observation with the highest possible likelihood, given
the model.
[0178] To enhance clarity of Fig. 1, supplementary means as power supply, buffer memories
etc. are not shown.
[0179] In the one embodiment of the present invention shown in Fig. 1, separate microprocessors
are used for the signal input means 3, the audio signal fragmenting means 4, the feature
extracting means 5, the classifying means 6, the segmentation means 8 and the signal
output means 10.
[0180] Alternatively, one single microcomputer might be used to incorporate the signal input
means, the audio signal fragmenting means, the feature extracting means, the classifying
means, the segmentation means and the signal output means.
[0181] Further alternatively, the signal input means and the signal output means might be
incorporated in one common microcomputer and the audio signal fragmenting means, the
feature extracting means, the classifying means and the segmentation means might be
incorporated in another common microcomputer.
[0182] Preferably, the inventive apparatus for automatic classification of audio signals
might be integrated into a digital audio / video recorder or top set box or realised
by use of a conventional personal computer or workstation.
[0183] According to a further embodiment of the present invention (which is not shown in
the figures), the above object is solved by a software product comprising a series
of state elements that are adapted to be processed by a data processing means of a
terminal such, that a method according to one of the claims 12 to 20 may be executed
thereon.
[0184] Said terminal might be a personal computer or video recording/reproducing apparatus,
for example.
[0185] In summary, the inventive apparatus and method for automatic classification of audio
signals uses an easy and reliable way for classification of audio signals comprising
a singing voice.
[0186] Since an individual predetermined audio class classifying model is provided for at
least each audio class "speech", "music" and "singing voice", an audio class classifying
model specialised in singing voice included in the raw audio signal is provided. Thus,
singing voice can be identified in a raw audio signal with high accuracy.
1. Apparatus (1) for automatic classification of audio signals comprising:
- signal input means (3) for supplying audio signals (2);
- audio signal fragmenting means (4) for partitioning audio signals (2) supplied by
the signal input means (3) into audio fragments (41, 42, 43) of a predetermined length;
- feature extracting means (5) for analysing acoustic characteristics of the audio
signals comprised in the audio fragments (41, 42, 43); and
- classifying means (6) for discriminating the audio fragments (41, 42, 43) provided
by the audio signal fragmenting means (4) into a predetermined audio class (61, 62,
63) based on predetermined audio class classifying models (71, 72, 73) by using acoustic
characteristics of the audio signals comprised in the audio fragments (41, 42, 43),
wherein a predetermined audio class classifying model (71, 72, 73) is provided for
each audio class (61, 62, 63) and each audio class (61, 62, 63) represents a respective
kind of audio signals comprised in the corresponding audio fragment (41, 42, 43);
characterised in that
an individual predetermined audio class classifying model (71, 72, 73) is provided
for at least each audio class (61, 62, 63) "speech", "music" and "singing voice".
2. Apparatus for automatic classification of audio signals according to claim 1,
characterised in that
the apparatus (1) for automatic classification of audio signals further comprises:
- a classifier database (7) comprising the predetermined audio class classifying models
(71, 72, 73);
wherein the classifying means (6) discriminates the audio fragments (41, 42, 43)
provided by the audio signal fragmenting means (4) into predetermined audio classes
(61, 62, 63) based on the audio class classifying models (71, 72, 73) stored in the
classifier database (7).
3. Apparatus for automatic classification of audio signals according to claim 1 or 2,
characterised in that
the classifying means (6) further allocates audio fragments (41, 42, 43) discriminated
into the audio class (63) "singing voice" to the audio class (62) "music".
4. Apparatus for automatic classification of audio signals according to claim 1 or 2,
characterised in that
the classifying means (6) further discriminates audio fragments (41, 42, 43) allocated
to the audio class (61) "speech" into the audio classes (61, 63) "speech" and "singing
voice".
5. Apparatus for automatic classification of audio signals according to one of the preceding
claims,
characterised in that
the acoustic characteristics analysed in the audio signals comprised in the audio
fragments (41, 42, 43) by the feature extracting means (5) include volume standard
deviation and/or volume dynamic range and/or high zero crossing rate ratio and/or
low short-term energy ratio and/or spectral flux and/or zero crossing rate and/or
energy/loudness and/or sub-band energy rate and/or mel-cepstral frequency components
and/or frequency centroid and/or bandwidth and/or line spectrum frequencies and/or
roll-off.
6. Apparatus for automatic classification of audio signals according to one of the preceding
claims,
characterised in that
the audio class classifying models (71, 72, 73) are provided as hidden Markov models
and/or Neuronal Networks and/or Gaussian Mixture Models and/or decision trees.
7. Apparatus for automatic classification of audio signals according to one of the preceding
claims,
characterised in that
the audio class model (73) for the audio class (63) "singing voice" is trained by
a training audio signal comprising pure singing voice, only.
8. Apparatus for automatic classification of audio signals according to one of the preceding
claims,
characterised in that the apparatus (1) for automatic classification of audio signals further comprises:
- segmentation means (8) for segmenting classified audio signals (9) into individual
audio windows (81, 82, 83) consisting of sequences of cohesive audio fragments (41,
42, 43) based on predetermined content classifying rules by analysing a sequence of
audio classes (61, 62, 63) of cohesive audio fragments (41, 42, 43) provided by the
classifying means (6), wherein each sequence of cohesive audio fragments (41, 42,
43) segmented by the segmentation means (8) corresponds to an individual content included
in the audio signal.
9. Apparatus for automatic classification of audio signals according to claim 8,
characterised in that the segmentation means (8)
- allocates a predefined number of audio fragments (41, 42, 43) to an audio window
(81, 82, 83),
- determines the number of audio fragments (41, 42, 43) of each audio class (61, 62,
63) comprised in the audio window (81, 82, 83) and
- allocates the majority audio class (61, 62, 63) to the respective audio window (81,
82, 83).
10. Apparatus for automatic classification of audio signals according to one of the preceding
claims,
characterised in that
each audio fragment (41, 42, 43) generated by the audio signal fragmenting means (4)
corresponds to a frame consisting of a predefined number N of signal samples.
11. Apparatus for automatic classification of audio signals according to one of the preceding
claims,
characterised in that the apparatus (1) for automatic classification of audio signals further comprises:
- signal output means (10) for generating an output file (11);
wherein the output file contains the raw audio signal (2) supplied to the signal
input means (3) and an information signal comprising information regarding to the
audio classes (61, 62, 63) and / or the audio windows (81, 82, 83) and / or contents
included in the raw signal (2).
12. Method for automatic classification of audio signals comprising the following steps:
- (S1) partitioning audio signals (2) into audio fragments (41, 42, 43) of a predetermined
length;
- (S2) analysing acoustic characteristics of the audio signals comprised in the audio
fragments (41, 42, 43); and
- (S3) discriminating the audio fragments (41, 42, 43) into a predetermined audio
class (61, 62, 63) based on predetermined audio class classifying models (71, 72,
73) by using acoustic characteristics of the audio signals comprised in the audio
fragments (41, 42, 43), wherein a predetermined audio class classifying model (71,
72, 73) is provided for each audio class (61, 62, 63) and each audio class (61, 62,
63) represents a respective kind of audio signals comprised in the corresponding audio
fragment (41, 42, 43);
characterised in that
the step (S3) of discriminating the audio fragments (41, 42, 43) into a predetermined
audio class (61, 62, 63) is performed by using an individual predetermined audio class
classifying model (71, 72, 73) for at least each audio class (61, 62, 63) "speech",
"music" and "singing voice".
13. Method for automatic classification of audio signals according to claim 12,
characterised in that the method further comprises the step of
- (S4) providing a classifier database (7) comprising the predetermined audio class
classifying models (71, 72, 73);
wherein the step (S3) of discriminating the audio fragments (71, 72, 73) into a predetermined
audio class (61, 62, 63) is performed by using the audio class classifying models
(71, 72, 73) stored in the classifier database (7).
14. Method for automatic classification of audio signals according to claim 12 or 13,
characterised in that the method further comprises the step of
- (S5) allocating the audio fragments (41, 42, 43) discriminated into the audio class
(63) "singing voice" to the audio class (62) "music".
15. Method for automatic classification of audio signals according to one of the claims
12 to 13,
characterised in that the method further comprises the step of
- (S6) discriminating the audio fragments (41, 42, 43) allocated to the audio class
(61) "speech" into the audio classes (61, 63) "speech" and "singing voice".
16. Method for automatic classification of audio signals according to one of the claims
12 to 15,
characterised in that
the step (S2) of analysing acoustic characteristics in the audio signals comprised
in the audio fragments (41, 42, 43) includes analysis of volume standard deviation
and/or volume dynamic range and/or high zero crossing rate ratio and/or low short-term
energy ratio and/or spectral flux and/or zero crossing rate and/or energy/loudness
and/or sub-band energy rate and/or mel-cepstral frequency components and/or frequency
centroid and/or bandwidth and/or line spectrum frequencies and/or roll-off.
17. Method for automatic classification of audio signals according to one of the claims
12 to 16,
characterised in that
the audio class classifying models (71, 72, 73) are provided as hidden Markov models
and/or Neuronal Networks and/or Gaussian Mixture Models and/or decision trees.
18. Method for automatic classification of audio signals according to one of the claims
12 to 17,
characterised in that the method further comprises the step of
- (S7) training the audio class model (73) for the audio class (63) "singing voice"
by a training audio signal comprising pure singing voice, only.
19. Method for automatic classification of audio signals according to one of the claims
12 to 18,
characterised in that the method further comprises the steps of
- (S8) analysing a sequence of audio classes (61, 62, 63) of cohesive audio fragments
(41, 42, 43); and
- (S9) segmenting classified audio signals (9) into individual audio windows (81,
82, 83) consisting of sequences of cohesive audio fragments (41, 42, 43) based on
predetermined content classifying rules by using the analyses of said sequence of
audio classes (61, 62, 63) of cohesive audio fragments (41, 42, 43), wherein each
sequence of cohesive audio fragments (41, 42, 43) corresponds to an individual content
included in the audio signal.
20. Method for automatic classification of audio signals according to claim 19,
characterised in that the method further comprises the steps of
- (S10) allocating a predefined number of audio fragments (41, 42, 43) to an audio
window (81, 82, 83);
- (S11) determining the number of audio fragments (41, 42, 43) of each audio class
(61, 62, 63) comprised in the audio window (81, 82, 83); and
- (S12) allocating the majority audio class (61, 62, 63) to the respective audio window
(81, 82, 83).
21. Software product comprising a series of state elements that are adapted to be processed
by a data processing means of a terminal such, that a method according to one of the
claims 12 to 20 may be executed thereon.