(19) |
|
|
(11) |
EP 1 667 106 B1 |
(12) |
EUROPEAN PATENT SPECIFICATION |
(45) |
Mention of the grant of the patent: |
|
25.11.2009 Bulletin 2009/48 |
(22) |
Date of filing: 06.12.2004 |
|
(51) |
International Patent Classification (IPC):
|
|
(54) |
Method for generating an audio signature
Verfahren zur Erstellung einer Audiosignatur
Procédé de génération d'une signature audio
|
(84) |
Designated Contracting States: |
|
DE FR GB |
(43) |
Date of publication of application: |
|
07.06.2006 Bulletin 2006/23 |
(73) |
Proprietor: Sony Deutschland GmbH |
|
10785 Berlin (DE) |
|
(72) |
Inventor: |
|
- Kemp, Thomas
Stuttgart Technology Center
70327 Stuttgart (DE)
|
(74) |
Representative: Müller - Hoffmann & Partner |
|
Patentanwälte
Innere Wiener Strasse 17 81667 München 81667 München (DE) |
(56) |
References cited: :
|
|
|
|
- RIBBROCK A ET AL: "A full-text retrieval approach to content-based audio identification"
PROCEEDINGS OF THE 2003 IEEE RADAR CONFERENCE. HUNTSVILLE, AL, MAY 5 - 8, 2003, IEEE
RADAR CONFERENCE, NEW YORK, NY : IEEE, US, 9 December 2002 (2002-12-09), pages 194-197,
XP010642545 ISBN: 0-7803-7920-9
- CANO P ET AL: "A review of algorithms for audio fingerprinting" PROCEEDINGS OF THE
2003 IEEE RADAR CONFERENCE. HUNTSVILLE, AL, MAY 5 - 8, 2003, IEEE RADAR CONFERENCE,
NEW YORK, NY : IEEE, US, 9 December 2002 (2002-12-09), pages 169-173, XP010642539
ISBN: 0-7803-7920-9
- KURTH F ED - INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS: "A ranking technique
for fast audio identification" PROCEEDINGS OF THE 2003 IEEE RADAR CONFERENCE. HUNTSVILLE,
AL, MAY 5 - 8, 2003, IEEE RADAR CONFERENCE, NEW YORK, NY : IEEE, US, 9 December 2002
(2002-12-09), pages 186-189, XP010642543 ISBN: 0-7803-7920-9
|
|
|
|
Note: Within nine months from the publication of the mention of the grant of the European
patent, any person may give notice to the European Patent Office of opposition to
the European patent
granted. Notice of opposition shall be filed in a written reasoned statement. It shall
not be deemed to
have been filed until the opposition fee has been paid. (Art. 99(1) European Patent
Convention).
|
[0001] The present invention relates to a method for analyzing audio data.
[0002] The provision of characterizing data which may identify or characterize audio data
for instance samples of speech or music, become more and more important as it is necessary
for instance in customer devices to have means for robustly and reliably identify
such audio data, for instance within a data base comprising a large plurality of music
samples or the like. However, known identification schemes are connected with a high
burden of computational work in order to clearly and reliably identify audible samples
by providing characteristic properties of such audible samples.
[0004] It is an aspect of the present invention to provide a method for analyzing audio
data which is capable of reliably and robustly characterize and identify audio data
at a comparably reduced computational burden.
[0005] The object underlying the present invention is achieved by a method for analyzing
audio data with the features of independent claim 1. Preferred embodiments of the
inventive method for analyzing audio data are within the scope of the dependent subclaims.
The object underlying the present invention is also achieved by a method for classifying
audio data, by an apparatus for analyzing audio data, by a computer program product,
as well as by a computer readable storage means according to claims 22, 27, 28, and
29 respectively.
[0006] It is therefore a key idea of the present invention to provide a method for analyzing
audio data in which the process of generating a signature for said audio data is based
on a feature set which contains a plurality of time domain related features and to
base the process of analyzing on said plurality of time domain related features and
further to built up said signature for said audio data in order to at least one of
contain, describe, and represent a plurality of temporal positions or ranges of a
plurality of characteristics of said time domain related features within said audio
data. Therefore, said signature can be derived by avoiding Fourier transform processes
or other transforming algorithms having a high computational burden. Therefore, the
inventive method is fast, easy to design, and at the same time reliable and simple
to apply.
[0007] It is of particular advantage if audio data of a finite length are used, in particular
with a determined beginning and a determined ending.
[0008] According to a preferred embodiment of the inventive method for analyzing audio data
additionally or alternatively absolute temporal positions of characteristics of time
domain related features are involved in said step (C) of generating said signature
for said audio data.
[0009] According to a further preferred embodiment of the inventive method for analyzing
audio data additionally or alternatively relative temporal positions of characteristics
of time domain related features are involved in said step (C) of generating said signature
for said audio data.
[0010] It is of further advantage if additionally or alternatively in accordance to another
embodiment of the inventive method for analyzing audio data characteristics of the
same time domain related feature are involved in said step (C) of generating said
signature for said audio data.
[0011] It may additionally or alternatively in accordance to another embodiment of the inventive
method for analyzing audio data be preferred that characteristics of time domain related
features of the same type are involved in said step (C) of generating said signature
for said audio data.
[0012] In accordance to another embodiment of the inventive method for analyzing audio data
characteristics of time domain related features of the different types may additionally
or alternatively be involved in said step (C) of generating said signature for said
audio data.
[0013] According to a still further preferred embodiment of the inventive method for analyzing
audio data additionally or alternatively at least one of the strength, the time course
of the strength and local extremer of the strength of time domain related features
are used as characteristics.
[0014] Additionally or alternatively, time domain related features of the same type may
be combined.
[0015] Time domain related features of different types may also be combined in addition
or as an alternative.
[0016] For instance, an energy contour of the audio data is used as a time domain related
feature.
[0017] Additionally or alternatively, a zero crossing rate of the audio data is used as
a time domain related features.
[0018] Analogue-to-digital converted data are used as said audio data.
[0019] Down-sampled data may be used as said audio data in addition or as an alternative,
in particular at a rate of 22050 Hz.
[0020] Additionally or alternatively, stereo-to-mono converted data are used as said audio
data.
[0021] Preferably, data having a frame structure are used as said audio data.
[0022] In accordance to another embodiment of the inventive method for analyzing audio data
said frame structure may be formed by a sequence of consecutive time frames, each
time frame having a given and in particular fixed frame size and each time frame being
separated from consecutive time frames by a given and in particular fixed frame shift.
[0023] Said frame shift may be chosen to be less than the frame size.
[0024] For instance, the frame size may be about 16 ms and the frame shift may be about
10 ms.
[0025] The temporal positions may be given as indices with respect to said time frames of
the frame structure underlying the audio data.
[0026] In accordance to a still further embodiment the inventive method for analyzing audio
data comprises a step (A) of providing said audio data that comprises at least one
of receiving, reproducing, and generating said audio data.
[0027] According to the invention, the temporal positions in said signature are given in
the order of the magnitude or strength of the respective features.
[0028] According to a further aspect of the present invention, a method for classifying
audio data is provided which comprises a step (S1) of providing audio data as input
data, a step (S2) of generating a signature for said audio data, a step (S3) of providing
a comparison signature, a step (S4) of comparing said signature for said audio data
with said comparison signature and thereby generating comparison data, and a step
(S5) of providing as classification result said comparison data as output data, wherein
said signature for said audio data and in particular said comparison signature are
obtained according to the inventive method for analyzing audio data.
[0029] According to a preferred embodiment of the inventive method for classifying audio
data said step S3 of providing a comparison signature may comprise a step of providing
additional audio data as additional input data and a step of generating an additional
signature for said additional audio data. Said additional signature for said additional
audio data may then be used as said comparison signature.
[0030] Additionally or alternatively, said additional signature for said additional audio
data and thereby said comparison signature may be obtained according to the inventive
method for analyzing audio data.
[0031] Further additionally or alternatively, at least two samples of audio data may be
compared with respect to each other - one of said samples being assigned to said derived
signature and the other one of said samples being assigned to said additional signature/comparison
signature - in particular by comparing said signature and said additional signature/said
comparison signature with respect to coinciding features and/or with respect to differing
features.
[0032] Further additionally or alternatively, said at least two samples of audio data to
be compared with respect to each other may be compared with respect to each other
based of time domain related features first in a comparing pre-process and then based
on additional features, e. g. based on features more complicated to calculate and/or
based on frequency domain related features, in a more detailed comparing process.
[0033] In the sense of the present invention in all cases of all these methods and the like
music, speech, noise and any combination thereof may be used as audio data or as pre-forms
thereof.
[0034] According to a further aspect of the present invention, an apparatus for analyzing
audio data is provided which is adapted and which comprises means for carrying out
a method for analyzing audio data according to the present invention and the steps
thereof.
[0035] According to a further aspect of the present invention a computer program product
is provided comprising computer program means which is adapted to realize a method
for analyzing audio data according to the present invention and the steps thereof,
when it is executed on a computer or a digital signal processing means.
[0036] Additionally a computer readable storage medium is provided which comprises a computer
program product according to the present invention.
[0037] These and further aspects of the present invention will be further discussed in the
following:
[0038] The present invention in particular relates to a unique identifier for audio data
and in more particular to a unique identifier for music.
[0039] With today's large capacity mp3 players and the corresponding flourishing music data
exchange, there is a substantial problem with the identification of music. If two
devices wish to communicate with each other about e. g. user preferences, or if a
server based application should e. g. recommend music based on user play list histories,
or if two databases should be merged, there is the need of a unique identifier for
music. Such an identifier must be unique enough so that one piece of music is never
confused with a different one, but must be robust enough so that different versions
of the same music (different cut, amplification, compression) result in the same identifier.
We propose a signal processing based tag which can be computed with very low effort
on the music signal directly which satisfies the above constraints.
[0040] It is proposed to e. g. extract a variety of time domain features or time domain
related features from a piece of music and to use the time stamps, i.e. where the
time domain feature has occurred) as a signature describing this file. It is in particular
proposed not to use the absolute location or temporal location or position but the
relative distance of the time domain features as a part of the signatures. Furthermore,
it is proposed to combine time domain features of several types, for example local
maxima of the energy contour, local minima of the energy contour, maxima and minima
of the zero crossing rate, into one signature, where the time differences between
features within one type and also between types are all used to identify a given piece
of music.
[0041] In order to speed up the searching process, we propose to use the song length as
a restriction in the search. Only songs that have a similar length to the incoming
song are considered for the comparison, which speeds up the search by a factor of
100 or more.
[0042] All the analysis may be carried out on frames. A frame is a time slice of the original
music piece with a length of e. g. about 16 ms. Frames are spaced 10 ms apart from
each other.
[0043] Specifically, it is proposed to use the following time domain features on suitably
down sampled and stereo-to-mono converted audio files. A typical parameter for this
is a sampling rate of about 22050 Hz.
[0044] The following properties may be used a features:
- EPEAK
- is the difference of the mean value of the short-time energy as computed over a long
time period (+- 50 frames) and the mean value of the short time energy as computed
over a short time period (+- 2 frames).
- EVALLEY
- is EPEAK multiplied by (-1).
- ZCRPEAK
- is the difference of the mean value of the frame zero crossing rage as computed over
a long time period (+- 50 frames) and the mean value of the frame zero crossing rate
computed over a short time period (+-2 frames).
- ZCRVALLEY
- is ZCRPEAK multiplied by (-1).
[0045] The values of the above-mentioned time domain features (there are as many per feature
as there are frames in the audio file) are sorted, and the highest N of them are retained
(and sorted by value) to form the signature. N is dependent on the type of the feature.
Typical choices could be N(EPEAK) = 20, N(EVALLEY) = 15, N(ZCRPEAK) = 10, N(ZCRVALLEY)
= 10.
[0046] The signature may be formed by the frame indices corresponding to the times or temporal
locations of the peaks of the respective time domain features. The signature may therefore
be formed as a n-tuple of respective index numbers or values.
[0047] When two signatures are compared to check whether they correspond to the same music,
since they can be time shifted versions of each other, all possible differences between
peaks in each of the time domain features have to be taken into account as potential
shifts. Shifts exceeding a predefined threshold can be discarded, which greatly speeds
up the comparison.
[0048] After all possible shifts have been extracted, the peaks of all individual time domain
features are compared with each other. Identical time instances are counted as a hit.
The strength of a hit is modified according to the relative position in the respective
list, e. g. if the strongest peak in file 1 matches a strong peak in file 2, this
will give a higher evidence of identity than if two weak peaks are matching. Peaks
in different features are not compared to each other. As a result of this step, for
each of the time domain features and every potential shift, a hit strength is obtained.
[0049] The hits from different features are then linearly combined with combination weights
Wi, where i denotes the time domain feature used, to yield final similarity scores,
conditioned on tentative shift. If any of the final similarity scores exceeds a predefined
threshold, it is concluded that the two files are identical and are shifted with the
corresponding shift relative to each other.
[0050] Since all proposed features are computed purely in the time domain, the entire processing
is extremely fast and can be run on low powered hardware platforms.
[0051] With the proposed invention we can uniquely identify a piece of music in a music
collection, independent from existing metadata like ID3 tags or the like. Wrong ID3
tags can even be corrected which otherwise renders impossible further processing.
By computing the tag directly on the signal, there is complete independence of metadata,
which is the main advantage of the proposed technology.
[0052] The proposed signature is fast to compute and particularly small. Also, it is symmetric,
i. e. the effort to compare two identifiers or signatures is small on the server side
and comparable with the effort to compute an identifier or signature.
[0053] The invention will now be explained based on preferred embodiments thereof and by
taking reference to the accompanying and schematical figures.
- Fig. 1
- is a schematical flow chart describing a preferred embodiment of the inventive method
for analyzing audio data.
- Fig. 2A, 2B
- are schematical graphical representations which elucidate the process for obtaining
the structure for a signature for audio data which can be obtained according to preferred
embodiments of the inventive method for analyzing audio data.
- Fig. 3
- is a schematical flow chart which elucidates a preferred embodiment of the inventive
method for classifying audio data.
[0054] In the following functional and structural similar or equivalent element structures
will be denoted with the same reference symbols. Not in each case of their occurrence
a detailed description will be repeated.
[0055] Fig. 1 is a schematical flow chart for elucidating an embodiment of the inventive
method for analyzing audio data AD.
[0056] After a start-up procedure a sample of audio data AD is received as an input I. In
the following step B said audio data AD are analyzed with respect to a feature set
FS. Said feature set FS contains or is built up by a plurality of time domain features
TDFk. Said feature set FS may be externally or internally be provided. In the following
step C a signature SAD for said audio data AD with respect to the given feature set
FS is generated.
[0057] In a final step D said signatures SAD for said audio data AD is provided and output
as output data O.
[0058] Figs. 2A and 2B schematically demonstrate different possibilities for obtaining a
signature SAD for a given audible data sample AD.
[0059] The graph of Fig. 2A shows a distinct time domain feature TDFk and its strength or
amplitude as a function of time t. Therefore in the graphical representation of Fig.
1 the abscissa is assigned to the time t and the ordinate is assigned to the amplitude
or strength of the time domain feature TDFk.
[0060] The graphical representation of Fig. 2A also contains an indication of the respective
frames, into which the time course is sub-divided. Each of the frames has a given
number which is defined by the consecutive enumeration of the sequence of time frames.
Each frame has a frame size FrSi and directly neighboured frames are shifted by a
frame shift FrSh. Both parameters are defined as shown in the graphical representation
of Fig. 2A.
[0061] As can be seen from the graphical representation of Fig. 2A there exist five local
maxima with respect to the amplitude of the time domain feature TDFk. Each of said
local maxima is localized within a given frame. In the order of their temporal occurrence
the frames which contain the local extrema or local maxima are frames 3, 5, 8, 16,
and 23.
[0062] Below the graphical representation examples for distinct signature SAD for the given
audio data sample AD are demonstrated. In the first example, the signature SAD is
just the 5-tupel given by the frame indices or frame numbers, i.e. SAD: = <3,5,8,16,23>.
Therefore, this first example for a signature SAD uses the maxima for the time domain
feature TDFk and shows the time ordered sequence of the respective frame indices starting
at the very beginning of the audio data sample AD.
[0063] Sometimes, it is more convenient to compare not the time ordering but an ordering
with respect to the magnitudes of the local maxima, i.e. by using an ordering for
the frame indices which corresponds to the order of decreasing amplitudes for the
maxima, i.e. the index for the most prominent maximum is mentioned first, followed
by the frame index containing the next less prominent maximum and so on. Therefore,
according to this particular embodiment respective signature SAD has the form SAD:
= <5,3,16,8,23>.
[0064] It is also possible to use a strict relative time ordering, i.e. without taking into
account the strength of the local maxima the enumeration and therefore the ordering
of the maxima is started with the occurrence of the first maximum. Followed by the
difference of index values between the first and the second following maximum, which
is followed by the difference between the index of the first and the second maximum
and so on. In this case the derived signature SAD has the form SAD: = <2,3,8,7>.
[0065] Finally, it is also possible to use a relative time relationship together with taking
into account an ordering with respect to the strength of the local maxima of the time
domain feature TDFk. That means, that the ordering starts relatively to the temporal
position of the most prominent local maximum, which is in the case of Fig. 1 the maximum
of frame 5. Then the index difference to the next less prominent maximum situated
in frame 3 is given, i.e. the value -2 is achieved. Then the difference in index values
with respect to the next less prominent maximum situated in frame 16 is calculated,
which yields the value 13. Accordingly, it follows that the next values for the signature
are -8 and 15. SAD has the form: SAD := <-2, 13, -8, 15>.
[0066] The results for the structure for SAD. in the described cases are summarized in Fig.
2B.
[0067] Fig. 3 is a schematical flow chart for an embodiment for a method for classifying
audio data AD which can also be referred to as a method for comparing audio data or
audio samples.
[0068] After a start-up or set-up procedure in a first step S1 an audio data sample AD to
be classified is received as an input I'.
[0069] In a following step S2 said audio data sample AD is analyzed with respect to a received
feature set FS in accordance to the inventive method for analyzing audio data AD by
taking into account time domain features TDFk in a step S2a. In a further sub-step
S2b the respective signature SAD for the received audio data sample AD is generated
as a function of said audio data as well as a function of said feature set, i.e. SAD:
= SAD(AD,FS).
[0070] In the following step a comparison signature SAD is obtained either as a fixed value
or from a database or by generating said comparison signature from a further audio
data sample AD' in the same manner as has been done with the received audio data sample
AD.
[0071] In a further step S4 the signature SAD for the received audio data sample AD is compared
with respect to the comparison signature CS. Thereby comparison data CD are generated
which are representative and descriptive for the common features or differences between
the signature SAD for the received audio data sample and the comparison signature
CS.
[0072] Finally in a step S5 the comparison data CD as a function of said signature SAD for
the received audio data sample AD and the comparison signature CS is provided as an
output O'.
Reference Symbols
[0073]
- AD
- audio data, audio data sample
- Ingredients AD'
- audio data, audio data sample, additional audio data
- CD
- comparison data
- CS
- comparison signature
- Fj
- Feature; j = 1, ..., n
- Fr
- time frame within audio data
- FrSh
- frame shift
- FrSi
- frame size
- FS
- feature set
- I
- input, input data
- I'
- input, input data
- O
- output, output data
- O'
- output, output data
- SAD
- signature for received audio data AD
- SAD'
- additional signature for received additional audio data AD'
- TDFk
- time domain (related) feature; k = 1, ..., m
1. Method for analyzing audio data (AD), comprising:
a step (B) of analyzing said audio data (AD), thereby determining a plurality of time
domain features (TDFk),
a step (C) of generating a signature (SAD) for said audio data (AD), said signature
being descriptive, representative or characteristic for said audio data (AD), and
a step (D) of providing as an analysis result, said signature (SAD) for said audio
data (AD) as output data (O),
wherein said signature (SAD) for said audio data (AD) contains, describes, or represents
a plurality of temporal positions of said plurality of time domain features (TDFk)
within said audio data (AD),
characterized in that temporal positions in said signature (SAD) are given in the order of the magnitude
or strength of respective features.
2. Method according to claim 1, wherein audio data (AD) of a finite length are used,
in particular with a determined beginning and a determined ending.
3. Method according to any one of the preceding claims, wherein absolute temporal positions
of characteristics of time domain features (TDFk) are involved in said step (C) of
generating said signature (SAD) for said audio data (AD).
4. Method according to any one of the preceding claims, wherein relative temporal positions
of time domain features (TDFk) are involved in said step (C) of generating said signature
(SAD) for said audio data (AD).
5. Method according to any one of the preceding claims, wherein characteristics of the
same time domain feature (TDFk) are involved in said step (C) of generating said signature
(SAD) for said audio data (AD).
6. Method according to any one of the preceding claims, wherein characteristics of time
domain features (TDFk) of the same type are involved in said step (C) of generating
said signature (SAD) for said audio data (AD).
7. Method according to any one of the preceding claims, wherein characteristics of time
domain features (TDFk) of different types are involved in said step (C) of generating
said signature (SAD) for said audio data (AD).
8. Method according to any one of the preceding claims, wherein at least one of the strength,
the time course of the strength and local extremer of the strength of time domain
features (TDFk) are used as characteristics.
9. Method according to any one of the preceding claims, wherein time domain features
(TDFk) of the same type are combined.
10. Method according to any one of the preceding claims, wherein time domain features
(TDFk) of different types are combined.
11. Method according to any one of the preceding claims, wherein an energy contour of
the audio data (AD) is used as a time domain feature (TDFk).
12. Method according to any one of the preceding claims, wherein a zero crossing rate
of the audio data (AD) is used as a time domain features (TDFk).
13. Method according to any one of the preceding claims, wherein analogue-to-digital converted
data are used as said audio data (AD).
14. Method according to any one of the preceding claims, wherein down-sampled data are
used as said audio data (AD), in particular at a rate of 22050 Hz.
15. Method according to any one of the preceding claims, wherein stereo-to-mono converted
data are used as said audio data (AD).
16. Method according to any one of the preceding claims, wherein data having a frame structure
are used as said audio data (AD).
17. Method according to claim 16, wherein said frame structure is formed by a sequence
of consecutive time frames (Fr), each time frame (Fr) having a given and in particular
fixed frame size (FrSi) and each time frame (Fr) being separated from consecutive
time frames (Fr) by a given and in particular fixed frame shift (FrSh).
18. Method according to claim 17, wherein the frame shift (FrSh) is chosen to be less
than the frame size (FrSi).
19. Method according to any one of the preceding claims 17 or 18, wherein the frame size
(FrSi) is about 16 ms and the frame shift (FrSh) is about 10 ms.
20. Method according to any one of the preceding claims 16 to 19, wherein temporal positions
are given as indices with respect to said time frames (Fr) of the frame structure
underlying the audio data (AD).
21. Method according to any one of the preceding claims, comprising a step (A) of providing
said audio data (AD) that comprises at least one of receiving, reproducing, and generating
said audio data (AD).
22. Method for classifying audio data, comprising:
- a step (S1) of providing audio data (AD) as input data (I').
- a step (S2) of generating a signature (SAD) for said audio data (AD),
- a step (S3) of providing a comparison signature (CS),
- a step (S4) of comparing said signature (SAD) for said audio data (AD) with said
comparison signature (CS) and thereby generating comparison data (CD), and
- a step (S5) of providing as classification result said comparison data (CD) as output
data (O'),
- wherein said signature (SAD) for said audio data (AD) and said comparison signature
(CS) are obtained according to a method for analyzing audio data according to any
one of the claims 1 to 21.
23. Method according to claim 22,
wherein said step (S3) of providing a comparison signature (CS) comprises:
- a step of providing additional audio data (AD') as additional input data and
- a step of generating an additional signature for said additional audio data (AD'),
and
wherein said additional signature for said additional audio data (AD') is used as
said comparison signature (CS).
24. Method according to claim 23,
wherein said additional signature for said additional audio data (AD') and thereby
said comparison signature (CS) are obtained according to a method for analyzing audio
data according to any one of the claims 1 to 21.
25. Method according to any one of the preceding claims 22 to 24,
wherein at least two samples of audio data (AD, AD') are compared with respect to
each other - one (AD) of said samples (AD, AD') being assigned to said derived signature
(SAD) and the other one (AD') of said samples (AD') being assigned to said additional
signature or said comparison signature (CS) - in particular by comparing said signature
(SAD) and said additional signature or said comparison signature (CS) with respect
to coinciding features and/or with respect to differing features.
26. Method according to claim 25,
wherein said at least two samples of audio data (AD, AD') to be compared with respect
to each other are compared with respect to each other based of time domain features
(TDFk) first in a comparing pre-process and then based on additional features, e.
g. based on features more complicated to calculate and/or based on frequency domain
related features, in a more detailed comparing process.
27. Apparatus for analyzing audio data,
which is adapted and which comprises means for carrying out a method for analyzing
audio data according to any one of the claims 1 to 21 and the steps thereof and/or
the method for classifying audio data according to any one of the claims 22 to 26
and the steps thereof.
28. Computer program product,
comprising computer program means which is adapted to realize a method for analyzing
audio data according to any one of the claims 1 to 21 and the steps thereof and/or
the method for classifying audio data according to any one of the claims 22 to 26
and the steps thereof, when it is executed on a computer or a digital signal processing
means.
29. Computer readable storage medium,
comprising a computer program product according to claim 28.
1. Verfahren zum Analysieren von Audiodaten (AD), mit:
einem Schritt (B) des Analysierens der Audiodaten (AD), dadurch Bestimmen einer Mehrzahl
von Zeitbereichsmerkmalen (TDFk),
einem Schritt (C) des Erzeugens einer Signatur (SAD) für die Audiodaten (AD), wobei
die Signatur beschreibend, repräsentativ oder charakteristisch ist für die Audiodaten
(AD), und
einem Schritt (D) des Bereitstellens der Signatur (SAD) für die Audiodaten (AD) als
ein Analyseergebnis als Ausgabedaten (O),
wobei die Signatur (SAD) für die Audiodaten (AD) eine Mehrzahl von zeitlichen Positionen
der Mehrzahl von Zeitbereichsmerkmalen (TDFk) innerhalb der Audiodaten (AD) enthält,
beschreibt oder repräsentiert,
dadurch gekennzeichnet,
dass die zeitlichen Positionen in der Signatur (SAD) in der Reihenfolge der Größe oder
Stärke jeweiliger Merkmale gegeben sind.
2. Verfahren nach Anspruch 1,
wobei Audiodaten (AD) einer endlichen Länge verwendet werden, insbesondere mit einem
bestimmten Anfang und einem bestimmten Ende.
3. Verfahren nach einem der vorangehenden Ansprüche,
wobei absolute zeitliche Positionen von Charakteristika von Zeitbereichsmerkmalen
(TDFk) im Schritt (C) des Erzeugens der Signatur (SAD) für die Audiodaten (AD) verwendet
werden.
4. Verfahren nach einem der vorangehenden Ansprüche,
wobei relative zeitliche Positionen von Zeitbereichsmerkmalen (TDFk) im Schritt (C)
des Erzeugens der Signatur (SAD) für die Audiodaten (AD) verwendet werden.
5. Verfahren nach einem der vorangehenden Ansprüche,
wobei Charakteristika desselben Zeitbereichsmerkmals (TDFk) im Schritt (C) des Erzeugens
der Signatur (SAD) für die Audiodaten (AD) verwendet werden.
6. Verfahren nach einem der vorangehenden Ansprüche,
wobei Charakteristika von Zeitbereichsmerkmalen (TDFk) desselben Typs im Schritt (C)
des Erzeugens der Signatur (SAD) für die Audiodaten (AD) verwendet werden.
7. Verfahren nach einem der vorangehenden Ansprüche,
wobei Charakteristika von Zeitbereichsmerkmalen (TDFk) unterschiedlicher Typen im
Schritt (C) des Erzeugens der Signatur (SAD) für die Audiodaten (AD) verwendet werden.
8. Verfahren nach einem der vorangehenden Ansprüche,
wobei mindestens eines aus der Stärke, aus dem zeitlichen Verlauf der Stärke und aus
lokalen Extrema der Stärke von Zeitbereichsmerkmalen (TDFk) als Charakteristika verwendet
werden.
9. Verfahren nach einem der vorangehenden Ansprüche,
wobei Zeitbereichsmerkmale (TDFk) desselben Typs miteinander kombiniert werden.
10. Verfahren nach einem der vorangehenden Ansprüche,
wobei Zeitbereichsmerkmale (TDFk) unterschiedlicher Typen miteinander kombiniert werden.
11. Verfahren nach einem der vorangehenden Ansprüche,
wobei eine Energiekontur der Audiodaten (AD) als ein Zeitbereichsmerkmal (TDFk) verwendet
wird.
12. Verfahren nach einem der vorangehenden Ansprüche,
wobei eine Nulldurchgangsrate der Audiodaten (AD) als ein Zeitbereichsmerkmal (TDFk)
verwendet wird.
13. Verfahren nach einem der vorangehenden Ansprüche,
wobei analog-digitalgewandelte Daten als die Audiodaten (AD) verwendet werden.
14. Verfahren nach einem der vorangehenden Ansprüche,
wobei dezimierte Daten als Audiodaten (AD) verwendet werden, insbesondere mit einer
Rate von 22050 Hz.
15. Verfahren nach einem der vorangehenden Ansprüche,
wobei stereo-monogewandelte Daten als die Audiodaten (AD) verwendet werden.
16. Verfahren nach einem der vorangehenden Ansprüche,
wobei Daten mit einer Framestruktur als die Audiodaten (AD) verwendet werden.
17. Verfahren nach Anspruch 16,
wobei die Framestruktur gebildet wird von einer Abfolge aufeinanderfolgender Zeitframes
(Fr), wobei jedes Zeitframe (Fr) eine gegebene und insbesondere feste Framegröße (FrSi)
aufweist und wobei jedes Zeitframe (Fr) von aufeinanderfolgenden Zeitframes (Fr) durch
eine gegebene und insbesondere feste Frameverschiebung (FrSh) getrennt ist.
18. Verfahren nach Anspruch 17,
wobei die Frameverschiebung (FrSh) so gewählt wird, dass sie kleiner ist als die Framegröße
(FrSi).
19. Verfahren nach einem der vorangehenden Ansprüche 17 oder 18,
wobei die Framegröße (FrSi) etwa 16 ms und die Frameverschiebung (FrSh) etwa 10 ms
betragen.
20. Verfahren nach einem der vorangehenden Ansprüche 16 bis 19,
wobei zeitliche Positionen gegeben sind als Indices in Bezug auf die Zeitframes (Fr)
der den Audiodaten (AD) zugrunde liegenden Framestruktur.
21. Verfahren nach einem der vorangehenden Ansprüche,
mit einem Schritt (A) des Bereitstellens der Audiodaten (AD), welcher mindestens eines
des Empfangens, des Wiedergebens und des Erzeugens der Audiodaten (AD) aufweist.
22. Verfahren zum Klassifizieren von Audiodaten, mit:
- einem Schritt (S1) des Bereitstellens von Audiodaten (AD) als Eingabedaten (I'),
- einem Schritt (S2) des Erzeugens einer Signatur (SAD) für die Audiodaten (AD),
- einem Schritt (S3) des Bereitstellens einer Vergleichssignatur (CS),
- einem Schritt (S4) des Vergleichens der Signatur (SAD) für die Audiodaten (AD) mit
der Vergleichssignatur (CS) und dadurch Erzeugen von Vergleichsdaten (CD) und
- einem Schritt (S5) des Bereitstellens als Klassifizierungsergebnis der Vergleichsdaten
(CD) als Ausgabedaten (O'),
- wobei die Signatur (SAD) für die Audiodaten (AD) und die Vergleichssignatur (CS)
gemäß einem Verfahren zum Analysieren von Audiodaten nach einem der Ansprüche 1 bis
21 erhalten werden.
23. Verfahren nach Anspruch 22,
wobei der Schritt (S3) des Bereitstellens der Vergleichssignatur (CS) aufweist:
- einen Schritt des Bereitstellens zusätzlicher Audiodaten (AD') als zusätzliche Eingabedaten
und
- einen Schritt des Erzeugens einer zusätzlichen Signatur für die zusätzlichen Audiodaten
(AD'), und
wobei die zusätzliche Signatur für die zusätzlichen Audiodaten (AD') als Vergleichssignatur
(CS) verwendet wird.
24. Verfahren nach Anspruch 23,
wobei die zusätzliche Signatur für die zusätzlichen Audiodaten (AD') und dadurch die
Vergleichssignatur (CS) gemäß einem Verfahren zum Analysieren von Audiodaten nach
einem der Ansprüche 1 bis 21 erhalten werden.
25. Verfahren nach einem der vorangehenden Ansprüche 22 bis 24,
wobei mindestens zwei Samples von Audiodaten (AD, AD') miteinander verglichen werden
- wobei eines (AD) der Samples (AD, AD') der abgeleiteten Signatur (SAD) und das andere
(AD') der Samples (AD') der zusätzlichen Signatur oder Vergleichssignatur (CS) zugeordnet
werden - insbesondere durch Vergleichen der Signatur (SAD) und der zusätzlichen Signatur
oder Vergleichssignatur (CS) in Bezug auf koinzidierende Merkmale und/oder in Bezug
auf unterschiedliche Merkmale.
26. Verfahren nach Anspruch 25,
wobei die mindestens zwei Samples von Audiodaten (AD, AD'), die miteinander zu vergleichen
sind, miteinander zunächst auf der Grundlage von Zeitbereichsmerkmalen (TDFk) in einem
Vergleichsvorverarbeitungsvorgang und dann auf der Grundlage zusätzlicher Merkmale,
zum Beispiel auf der Grundlage von Merkmalen, die komplizierter zu berechnen sind,
und/oder auf der Grundlage von frequenzbereichsbezogenen Merkmalen, in einem detaillierteren
Vergleichsvorgang verglichen werden.
27. Vorrichtung zum Analysieren von Audiodaten,
welche ausgebildet ist und Einrichtungen aufweist, ein Verfahren zum Analysieren von
Audiodaten nach einem der Ansprüche 1 bis 21 und dessen Schritte und/oder das Verfahren
zum Klassifizieren von Audiodaten nach einem der Ansprüche 22 bis 26 und dessen Schritte
auszuführen.
28. Computerprogrammerzeugnis,
mit einer Computerprogrammeinrichtung, welche ausgebildet ist, ein Verfahren zum Analysieren
von Audiodaten nach einem der Ansprüche 1 bis 21 und dessen Schritte und/oder das
Verfahren zum Klassifizieren von Audiodaten nach einem der Ansprüche 22 bis 26 und
dessen Schritte auszuführen, wenn es auf einem Computer oder einer digitalen Signalverarbeitungseinrichtung
ausgeführt wird.
29. Computerlesbares Speichermedium,
mit einem Computerprogrammerzeugnis nach Anspruch 28.
1. Procédé d'analyse de données audio (AD), comprenant :
une étape (B) d'analyse desdites données audio (AD), pour ainsi déterminer une pluralité
de particularités dans le domaine temporel (TDFk),
une étape (C) de génération d'une signature (SAD) desdites données audio (AD), ladite
signature étant descriptive, représentative ou caractéristique desdites données audio
(AD), et
une étape (D) de production de ladite signature (SAD) desdites données audio (AD),
sous la forme de données de sortie (O), en tant que résultat d'analyse,
ladite signature (SAD) desdites données audio (AD) contenant, décrivant ou représentant
une pluralité de positions temporelles de ladite pluralité de particularités dans
le domaine temporel (TDFk) à l'intérieur desdites données audio (AD),
caractérisé en ce que les positions temporelles dans ladite signature (SAD) sont données dans l'ordre d'importance
ou d'intensité des particularités respectives.
2. Procédé selon la revendication 1, dans lequel les données utilisées sont des données
audio (AD) d'une longueur finie, avec en particulier un début déterminé et une fin
déterminée.
3. Procédé selon l'une quelconque des revendications précédentes, dans lequel les positions
temporelles absolues de caractéristiques des particularités dans le domaine temporel
(TDFk) sont mises en jeu dans ladite étape (C) de génération de ladite signature (SAD)
desdites données audio (AD).
4. Procédé selon l'une quelconque des revendications précédentes, dans lequel les positions
temporelles relatives des particularités dans le domaine temporel (TDFk) sont mises
en jeu dans ladite étape (C) de génération de ladite signature (SAD) desdites données
audio (AD).
5. Procédé selon l'une quelconque des revendications précédentes, dans lequel des caractéristiques
de la même particularité dans le domaine temporel (TDFk) sont mises en jeu dans ladite
étape (C) de génération de ladite signature (SAD) desdites données audio (AD).
6. Procédé selon l'une quelconque des revendications précédentes, dans lequel des caractéristiques
de particularités dans le domaine temporel (TDFk) du même type sont mises en jeu dans
ladite étape (C) de génération de ladite signature (SAD) desdites données audio (AD).
7. Procédé selon l'une quelconque des revendications précédentes, dans lequel des caractéristiques
de particularités dans le domaine temporel (TDFk) de types différents sont mises en
jeu dans ladite étape (C) de génération de ladite signature (SAD) desdites données
audio (AD).
8. Procédé selon l'une quelconque des revendications précédentes, dans lequel au moins
l'intensité et/ou l'évolution dans le temps de l'intensité et/ou des extrêmes locaux
de l'intensité des particularités dans le domaine temporel (TDFk) sont utilisés en
tant que caractéristiques.
9. Procédé selon l'une quelconque des revendications précédentes, dans lequel des particularités
dans le domaine temporel (TDFk) du même type sont combinées.
10. Procédé selon l'une quelconque des revendications précédentes, dans lequel des particularités
dans le domaine temporel (TDFk) de types différents sont combinées.
11. Procédé selon l'une quelconque des revendications précédentes, dans lequel un contour
d'énergie des données audio (AD) est utilisé en tant que particularité dans le domaine
temporel (TDFk).
12. Procédé selon l'une quelconque des revendications précédentes, dans lequel une fréquence
de passages par zéro des données audio (AD) est utilisée en tant que particularité
dans le domaine temporel (TDFk).
13. Procédé selon l'une quelconque des revendications précédentes, dans lequel lesdites
données audio (AD) utilisées sont des données converties d'analogique en numérique.
14. Procédé selon l'une quelconque des revendications précédentes, dans lequel lesdites
données audio (AD) utilisées sont des données sous-échantillonnées, en particulier
à une fréquence de 22050 Hz.
15. Procédé selon l'une quelconque des revendications précédentes, dans lequel lesdites
données audio (AD) utilisées sont des données converties de stéréo en mono.
16. Procédé selon l'une quelconque des revendications précédentes, dans lequel lesdites
données audio (AD) utilisées sont des données ayant une structure de trame.
17. Procédé selon la revendication 16, dans lequel ladite structure de trame est formée
par une séquence de trames temporelles consécutives (Fr), chaque trame temporelle
(Fr) ayant une taille de trame donnée et fixée de manière particulière (FrSi) et chaque
trame temporelle (Fr) étant séparée des trames temporelles (Fr) consécutives par un
décalage de trame donné et fixé de manière particulière (FrSh).
18. Procédé selon la revendication 17, dans lequel le décalage temporel (FrSh) est choisi
de façon à être inférieur à la taille de trame (FrSi).
19. Procédé selon l'une quelconque des revendications précédentes 17 ou 18, dans lequel
la taille de trame (FrSi) est d'environ 16 ms et le décalage de trame (FrSh) est d'environ
10 ms.
20. Procédé selon l'une quelconque des revendications précédentes 16 à 19, dans lequel
des positions temporelles sont données sous la forme d'index par rapport auxdites
trames temporelles (Fr) de la structure de trame sous-jacente aux données audio (AD).
21. Procédé selon l'une quelconque des revendications précédentes, comprenant une étape
(A) de production desdites données audio (AD) qui comprend au moins la réception et/ou
la reproduction et/ou la génération desdites données audio (AD).
22. Procédé de classification de données audio, comprenant :
- une étape (S1) de production de données audio (AD) sous la forme de données d'entrée
(I'),
- une étape (S2) de génération d'une signature (SAD) desdites données audio (AD),
- une étape (S3) de production d'une signature de comparaison (CS),
- une étape (S4) de comparaison de ladite signature (SAD) desdites données audio (AD)
à ladite signature de comparaison (CS) et ainsi de génération de données de comparaison
(CD), et
- une étape (S5) de production desdites données de comparaison (CD), sous la forme
de données de sortie (O'), en tant que résultat de classification,
- ladite signature (SAD) desdites données audio (AD) et ladite signature de comparaison
(CS) étant obtenues suivant un procédé d'analyse de données audio selon l'une quelconque
des revendications 1 à 21.
23. Procédé selon la revendication 22,
dans lequel ladite étape (S3) de production d'une signature de comparaison (CS) comprend
:
- une étape de production de données audio additionnelles (AD') sous la forme de données
d'entrée supplémentaires, et
- une étape de génération d'une signature additionnelle desdites données audio additionnelles
(AD'), et
dans lequel ladite signature additionnelle desdites données audio additionnelles (AD')
est utilisée en tant que signature de comparaison (CS).
24. Procédé selon la revendication 23,
dans lequel ladite signature additionnelle desdites données audio additionnelles (AD')
et ainsi ladite signature de comparaison (CS) sont obtenues suivant un procédé d'analyse
de données audio selon l'une quelconque des revendications 1 à 21.
25. Procédé selon l'une quelconque des revendications précédentes 22 à 24,
dans lequel au moins deux échantillons de données audio (AD, AD') sont comparés l'un
à l'autre - l'un (AD) desdits échantillons (AD, AD') étant attribué à ladite signature
dérivée (SAD) et l'autre (AD') desdits échantillons (AD') étant attribué à ladite
signature additionnelle ou à ladite signature de comparaison (CS) - en particulier
par comparaison de ladite signature (SAD) à ladite signature additionnelle ou ladite
signature de comparaison (CS) vis-à-vis des particularités qui coïncident et/ou vis-à-vis
des particularités qui diffèrent.
26. Procédé selon la revendication 25,
dans lequel lesdits au moins deux échantillons de données audio (AD, AD') à comparer
l'un à l'autre sont comparés l'un à l'autre sur la base de particularités dans le
domaine temporel (TDFk) tout d'abord, dans une opération de comparaison préalable,
puis sur la base de particularités additionnelles, par exemple sur la base de particularités
plus compliqués à calculer et/ou sur la base de particularités associées au domaine
temporel, dans une opération de comparaison plus détaillée.
27. Appareil d'analyse de données audio,
qui est adapté et qui comprend des moyens pour l'exécution d'un procédé d'analyse
de données audio selon l'une quelconque des revendications 1 à 21 et des étapes de
celui-ci et/ou du procédé de classification de données audio selon l'une quelconque
des revendications 22 à 26 et des étapes de celui-ci.
28. Produit de programme informatique,
comprenant des moyens formant programme informatique qui sont adaptés pour réaliser
un procédé d'analyse de données audio selon l'une quelconque des revendications 1
à 21 et les étapes de celui-ci et/ou le procédé de classification de données audio
selon l'une quelconque des revendications 22 à 26 et les étapes de celui-ci, lorsqu'il
est exécuté sur un ordinateur ou des moyens de traitement de signaux numériques.
29. Support de stockage lisible par un ordinateur,
comprenant un produit de programme informatique selon la revendication 28.
REFERENCES CITED IN THE DESCRIPTION
This list of references cited by the applicant is for the reader's convenience only.
It does not form part of the European patent document. Even though great care has
been taken in compiling the references, errors or omissions cannot be excluded and
the EPO disclaims all liability in this regard.
Non-patent literature cited in the description
- A full-text retrieval approach to content-based audio identificationA. RibbrockFrank KurthProceedings of the 2003 IEEE Radar Conference20021209 [0003]