TECHNICAL FIELD
[0001] The present disclosure relates generally to audio recognition and in particular to
determining if sound is natural or artificial.
BACKGROUND
[0002] This section is intended to introduce the reader to various aspects of art, which
may be related to various aspects of the present disclosure that are described and/or
claimed below. This discussion is believed to be helpful in providing the reader with
background information to facilitate a better understanding of the various aspects
of the present disclosure. Accordingly, it should be understood that these statements
are to be read in this light, and not as admissions of prior art.
[0003] Audio (acoustic, sound) recognition is particularly suitable for monitoring people
activity as it is relatively non-intrusive, does not require other detectors than
microphones and is relatively accurate.
[0004] Figure 1 illustrates a generic conventional audio classification pipeline 100 that
comprises an audio sensor 110 capturing a raw audio signal, a preprocessing module
120 that prepares the captured audio for a features extraction module 130 that outputs
extracted features to a classifier module 140 that uses entries in an audio database
150 to label audio that is then output.
[0005] The labelled audio can then be used to, for example, determine the activities of
persons (and even pets) in the location where the audio was captured. Knowledge of
the activities can be used in situations like e-health, care of children or the elderly,
and home security. In addition, parents could use the knowledge to determine what
their children do when they are alone at home: for instance, after school, are they
doing their homework or watching television?
[0006] In some of these cases, it can be important to distinguish between natural sound
- for example talking or singing persons, or a barking dog - and artificial sound,
i.e. sound that is rendered by a rendering device such as a radio, a television or
a hi-fi system. It will be appreciated that persons talking on the television could
be mistaken for real persons discussing. So far, this issue appears to have no suitable
conventional solution.
[0007] It will be appreciated that there is a desire for a solution that addresses this
problem. The present principles provide such a solution.
SUMMARY OF DISCLOSURE
[0008] In a first aspect, the present principles are directed to a method for determining
if sound is artificial. At a device, a hardware input interface obtains a signal corresponding
to sound in an environment and at least one hardware processor calculates from the
signal at least one of a descriptor related to loudness and a descriptor related to
silence, and determines that the sound is artificial in case a variance of the descriptor
related to loudness is below a first threshold value or in case the descriptor related
to silence is below a second threshold value.
[0009] Various embodiments of the first aspect include:
- That the descriptor related to silence is a ratio of windows of the signal that are
silent to windows of the signal that are non-silent. Adjacent windows can be overlapping.
A window can be deemed as silent in case its Root Mean Square (RMS) power is below
a third threshold.
- That the descriptor related to loudness is a standard deviation for power of the signal.
[0010] In a second aspect, the present principles are directed to a device for determining
if sound is artificial, comprising a hardware input interface configured to obtain
a signal corresponding to sound in an environment, and at least one hardware processor
configured to calculate from the signal at least one of a descriptor related to loudness
and a descriptor related to silence, and determine that the sound is artificial in
case a variance of the descriptor related to loudness is below a first threshold value
or in case the descriptor related to silence is below a second threshold value.
[0011] Various embodiments of the second aspect include:
- That the descriptor related to silence is a ratio of windows of the signal that are
silent to windows of the signal that are non-silent. Adjacent windows can be overlapping.
A window can be deemed as silent in case its Root Mean Square (RMS) power is below
a third threshold.
- That the descriptor related to loudness is a standard deviation for power of the signal.
- That the input interface is configured to capture the sound. The input interface can
comprise a microphone.
- That the device further comprises an output interface for outputting information about
whether the sound is natural or artificial.
[0012] In a third aspect, the present principles are directed to a computer program comprising
program code instructions executable by a processor for implementing the method according
to the first aspect.
[0013] In a fourth aspect, the present principles are directed to a computer program product
which is stored on a non-transitory computer readable medium and comprises program
code instructions executable by a processor for implementing the method according
to the first aspect.
BRIEF DESCRIPTION OF DRAWINGS
[0014] Preferred features of the present principles will now be described, by way of non-limiting
example, with reference to the accompanying drawings, in which:
Figure 1 illustrates a generic conventional audio classification pipeline;
Figure 2 illustrates conventional downward compression with a hard knee;
Figure 3 illustrates a signal without dynamic range compression and the same signal
with dynamic range compression;
Figure 4 illustrates a device for audio distinction according to the present principles;
and
Figure 5 illustrates a flowchart for a method of audio distinction according to the
present principles.
DESCRIPTION OF EMBODIMENTS
[0015] One way of monitoring a person in order to, for instance, anticipate problems, is
to verify if the habits of the person are followed. To do this, it can be useful to
classify ambient sound in the person's location as:
- no sound, i.e., silence.
- natural ambient sound, such as for example physical people speaking, cooking, dog
barking.
- artificial ambient sound, such as sound coming from a radio, a television, or a hi-fi
system. In this context, "artificial" means that the sound was processed for broadcast
or recording and subsequent rendering.
[0016] To detect artificial ambient sound, the present principles rely on the fact that
most artificial audio sources use dynamic range compression to enhance the sound and
to make it more present. It is for example possible to enhance the sound to avoid
a clipping effect, amplifier chain saturation or better to fit into Frequency Modulation
standard that has limited frequency spectrum range.
[0017] Dynamic range compression, which is a very common technique in the broadcast chain
and in media content workflows, amplifies parts of the audio signal with low amplitude
(upward compression), reduces the loud parts of the sound (downward compression),
or both. On the other hand, natural sounds tend to be characterized by a wider dynamic
range, which typically means that more low power sounds tend to be present in a natural
audio signal than in a dynamic range compressed audio signal. Hence, detecting such
dynamic differences within the sound can help differentiating artificial and natural
sound.
Dynamic Range Compression
[0018] Dynamic Range Compression (DRC) for audio will now be described in further detail.
As already mentioned, DRC can amplify low sounds, attenuate high sounds, or both.
[0019] Figure 2 illustrates conventional downward compression with a hard knee. A compression
function curve 210 has a first part 212 that is neutral - i.e., an input level transformed
by this part results in an equal output level. The curve further has a second part
214 that meets the first part 212 at a hard knee 216. The second part 214 performs
downward compression, which means that an input level L
l is transformed into a lower output level Lo. Figure 2 also shows a threshold 220
that lies between the first part 212 and the second part 214. In this example, the
threshold 220 coincides with the hard knee 216, but it will be appreciated that in
case a soft knee is used, this will extend around the threshold 220 and comprise part
of the first part 212 and the second part 214 as well.
[0020] It will also be understood that for upward compression, the first part of the curve
would be flatter so that an input level results in a higher output level (except perhaps
at the hard knee). It will further be understood that the function can allow both
downward and upward compression, in which case the first part and the second part
can have identical slopes or different slopes.
[0021] DRC can for example be used:
- In public spaces to make music sound louder without having to increase the peak amplitude.
- In music production for a better mix between vocals and instruments.
- In voice processing to avoid sibilance.
- In broadcasting to fit a broadcast signal with narrow range, as will be explained
in more detail.
- In marketing to increase the impact of commercials.
- To protect circuitry in devices with amplifiers, and also to avoid clipping or saturation
effects.
- In hearing aids and headphones to make certain sounds more audible while others are
attenuated.
[0022] Figure 3 illustrates a signal 310 without dynamic range compression and the same
signal 320 with dynamic range compression. As can be seen, the loud parts have been
attenuated (downward compression) and the low parts have been amplified (upward compression).
[0023] In the case of FM (Frequency Modulation) radio broadcasting, the characteristic of
the frequency modulation limits the frequency spectrum range, which in turn limits
the acoustic dynamic range. If the frequency spectrum range is not respected, this
will result in spectrum overlaps and audio distortion. Simply reducing the amplitude
of the signal fed to the modulator so that it never clips the signal requires an important
reduction of the input signal, which results in a reduction in the signal-to-noise
(SNR) ratio. A lower SNR ratio in turn means that a listener will hear more transmission
noise, especially during the more quiet part of the transmission.
[0024] The effect on FM also applies to digital radio that includes an ADC (Analog to Digital
Converter) in front of the modulator and for which dynamic range is limited.
[0025] DRC can also preserve the audio amplification chain and as well as any speakers from
saturation when they are not dimensioned to render the natural dynamic range.
[0026] In addition, compressing broadcast radio FM or broadcast TV enables the high-power
amplifier transmitter required to broadcast the signal over the air to transmit using
a more constant output power. Doing so can increase the lifetime of the amplifier.
Indeed, the standardization community tries to find the best compromise between audio
quality for the end user and economy when it comes to the broadcasting infrastructure.
[0027] Further, Automatic Gain Control (AGC) is useful for microphone capture when a speaker
talks over a low background sound that should be shared with the audience. AGC aims
to provide a control level output signal regardless of the input signal. In other
words, weak input signals are amplified and loud input signals are attenuated. The
outcome is a less dynamic sound that is suitable for network broadcasting.
[0028] It will be understood that an audio signal that is broadcast or streamed over the
air, a cellular network or a broadband network typically has a compressed acoustic
dynamic. Hence, a music or voice audio signal listened through a speaker has different
dynamical properties than audio produced by natural sources like human voices, animal
sounds and (non-amplified) instruments.
[0029] As an effect of DRC is an amplification of sounds below a first amplitude threshold
and an attenuation of sounds above a second amplitude threshold (possibly the same
as the first threshold), it can be seen that sounds with DRC have a smaller amplitude
variance than natural sounds. In addition, most broadcast sources - television, radio,
music - tend to avoid silence. Therefore, the proportion of silence will be low for
artificial sound sources.
[0030] Hence to distinguish artificial ambient sound from natural ambient sound, a device
can analyse captured ambient sound to determine at least one of:
- if the variance of the amplitude is above (natural ambient sound) or below (artificial
ambient sound) a variance threshold value, and
- if the level of silence is above (natural ambient sound) or below (artificial ambient
sound) a silence threshold value
[0031] One way of calculating the amplitude variance is as follows, but it will be appreciated
that other ways exist. First, the captured sound is divided into a number of sections
(or windows); the windows can be distinct, but are generally overlapping with a subsequent
window starting at the middle of the window just before. Each window has an index,
that we note k for instance. The windows have a same size, noted w. The captured sound,
i.e. the part for which it should be determined if it is natural or artificial, is
thus divided in a set of K possibly overlapping windows.
[0032] The Root Mean Square (RMS) power of the sound for the window k is defined as:

where the s
i are the w contiguous samples of the sound in the window k.
[0033] The size
w may take the value of 1024, but other values such as 2048 have also been contemplated.
[0034] The output for the different windows defines a series of instantaneous power values
Pk for the
K windows of the captured sound signal.
[0035] The mean power can then be calculated as

and the standard deviation as

[0036] Since amplitude and power are inextricably linked, the standard deviation for the
power is also an indirect measure of the standard deviation for the amplitude.
[0037] It is preferred to obtain a normalised measure by dividing the standard deviation
by the mean power:

[0038] This coefficient of variation of the power is a first descriptor, related to loudness,
used to distinguish natural sounds from artificial sounds.
[0039] To calculate the level of silence, first the windows whose RMS power is below a given
threshold τ are marked as 'silent'.
[0040] Then in an optional step, consecutive windows marked 'silent' are grouped in 'silent'
groups, and consecutive windows marked 'non-silent' are grouped in 'non-silent' groups.
The signal is therefore seen a series of interleaved 'silent' and 'non-silent' groups.
To clean the signal of anomalous or outlying events, groups of 'non-silent' windows
smaller than a certain size (such as a few windows, e.g. three) are marked 'silent'.
[0041] Finally, the second descriptor, related to silence, for distinguishing sound is the
proportion of 'silent' windows over the number of windows K of the signal subject
to examination.

where K is the number of windows considered.
[0042] In a variation, the detection of the silent windows may occur before the calculation
of the descriptor CV(P) explained above, and this descriptor may be computed only
on the windows which are marked as 'non silent'.
[0043] The two descriptors described above are expected to have a high value for the first
(high variation of the power) and the second (large number of silent windows) in case
of a natural sound, and the opposite for an artificial sound (power constantly high,
nearly no silent window). This will be used in a classification system as exposed
hereafter.
[0044] To classify the sound, different possibilities exist. A first possibility is to take
the first and second descriptors as input to a supervised classifier that is trained
to separate the natural sound from the artificial sound. The supervised classifier
may for instance be based on a decision tree, using two thresholds corresponding to
the two descriptors.
[0045] A second possibility is to use a set of conditions such as:
IF descriptor1 > thresholdl THEN sound is artificial
ELSE IF descriptor2 > threshold2 THEN sound is natural
ELSE IF descriptor1 > threshold3 AND descriptor2 < threshold 4 THEN sound is artificial
ELSE IF descriptor1 < threshold5 and descriptor2 > threshold6 THEN sound is natural
where descriptor1 is the descriptor related to loudness and descriptor2 is the descriptor
related to silence, and the various thresholds are thresholds in the model used for
the determination.
[0046] Naturally, there are many ways of expressing the conditions, using different thresholds
that in addition may depend on many things such as locality and equipment.
[0047] Figure 4 illustrates a device for audio distinction 400 according to the present
principles. The device 400 comprises at least one hardware processing unit ("processor")
410 configured to execute instructions of a first software program and to process
audio for distinction, as described herein. The device 400 further comprises at least
one memory 420 (for example ROM, RAM and Flash or a combination thereof) configured
to store the software program and data required to distinguish sound. The device 400
can also comprise at least one user communications interface
("User I/
O") 430 for interfacing with a user.
[0048] The device 400 further comprises an input interface 440 and an output interface 450.
The input interface 440 is configured to obtain audio for distinguishing; the input
interface 440 can be adapted to capture audio, for example a microphone, but it can
also be an interface adapted to receive captured audio. The output interface 450 is
configured to output information about distinguished audio - is it natural or artificial
sound - for example for presentation on a screen or by transfer to a further device.
[0049] Non-transitory, computer-readable storage medium 460 includes a computer program
with instructions that, when executed by the processor 410 performs the methods described
herein.
[0050] The processor 410 can also be configured to use the distinction to determine user
activity as described in the background part of the description.
[0051] The device 400 is preferably implemented as a single device such as a gateway, but
its functionality can also be distributed over a plurality of devices.
[0052] In some cases, the processor 410 may have access to other data and use this data
to determine that the sound has been incorrectly classified, for example in case the
sound was classified as natural and the data originates from another device and indicates
that artificial sound is indeed rendered in the environment where the processor 410
is located. If this occurs regularly, it could mean that the classification model
used by the processor 410 is not accurate enough. In this case, the device 400 can
send anonymized descriptors that caused false incorrect classification to a server,
so that the global model can be adapted to these descriptors (i.e. recomputed with
those new inputs). The global model can then be distributed to the individual devices.
In such an implementation, a stream processing big-data infrastructure such as Storm
or Spark is particularly relevant.
[0053] Figure 5 illustrates a flowchart for a method of audio distinction according to the
present principles. In step S510 the device 400 obtains captured sound, either by
capturing it itself or receiving captured sound from another device. In step S520,
the processor 410 calculates power standard deviation, i.e., the first descriptor,
(as a measure of amplitude standard deviation) as already explained. In step S530,
the processor 410 calculates the silence level, i.e., the second descriptor, as already
described. Finally, in step S540, the processor uses the first and second descriptors
to determine if the captured sound is natural or artificial, as already described.
[0054] The processor 410 can then for example output information on whether the sound is
natural or artificial through the output interface 450 or use this information internally
as input to other functions.
[0055] It will be appreciated that the present principles can provide a solution for audio
recognition that can enable:
- Respect of users' privacy since the sound can be distinguished in a device located
in the users' location rather than being sent to a device "in the cloud".
- A small footprint on the distinguishing device since it is sufficient to retain the
model, some variables and the present sound windows.
[0056] It should be understood that the elements shown in the figures may be implemented
in various forms of hardware, software or combinations thereof. Preferably, these
elements are implemented in a combination of hardware and software on one or more
appropriately programmed general-purpose devices, which may include a processor, memory
and input/output interfaces. Herein, the phrase "coupled" is defined to mean directly
connected to or indirectly connected with through one or more intermediate components.
Such intermediate components may include both hardware and software based components.
[0057] The present description illustrates the principles of the present disclosure. It
will thus be appreciated that those skilled in the art will be able to devise various
arrangements that, although not explicitly described or shown herein, embody the principles
of the disclosure and are included within its scope.
[0058] All examples and conditional language recited herein are intended for educational
purposes to aid the reader in understanding the principles of the disclosure and the
concepts contributed by the inventor to furthering the art, and are to be construed
as being without limitation to such specifically recited examples and conditions.
[0059] Moreover, all statements herein reciting principles, aspects, and embodiments of
the disclosure, as well as specific examples thereof, are intended to encompass both
structural and functional equivalents thereof. Additionally, it is intended that such
equivalents include both currently known equivalents as well as equivalents developed
in the future, i.e., any elements developed that perform the same function, regardless
of structure.
[0060] Thus, for example, it will be appreciated by those skilled in the art that the block
diagrams presented herein represent conceptual views of illustrative circuitry embodying
the principles of the disclosure. Similarly, it will be appreciated that any flow
charts, flow diagrams, state transition diagrams, pseudocode, and the like represent
various processes which may be substantially represented in computer readable media
and so executed by a computer or processor, whether or not such computer or processor
is explicitly shown.
[0061] The functions of the various elements shown in the figures may be provided through
the use of dedicated hardware as well as hardware capable of executing software in
association with appropriate software. When provided by a processor, the functions
may be provided by a single dedicated processor, by a single shared processor, or
by a plurality of individual processors, some of which may be shared. Moreover, explicit
use of the term "processor" or "controller" should not be construed to refer exclusively
to hardware capable of executing software, and may implicitly include, without limitation,
digital signal processor (DSP) hardware, read only memory (ROM) for storing software,
random access memory (RAM), and non-volatile storage.
[0062] Other hardware, conventional and/or custom, may also be included. Similarly, any
switches shown in the figures are conceptual only. Their function may be carried out
through the operation of program logic, through dedicated logic, through the interaction
of program control and dedicated logic, or even manually, the particular technique
being selectable by the implementer as more specifically understood from the context.
[0063] In the claims hereof, any element expressed as a means for performing a specified
function is intended to encompass any way of performing that function including, for
example, a) a combination of circuit elements that performs that function or b) software
in any form, including, therefore, firmware, microcode or the like, combined with
appropriate circuitry for executing that software to perform the function. The disclosure
as defined by such claims resides in the fact that the functionalities provided by
the various recited means are combined and brought together in the manner which the
claims call for. It is thus regarded that any means that can provide those functionalities
are equivalent to those shown herein.
1. A method for determining if sound is artificial, the method comprising at a device
(400):
obtaining (S510), by a hardware input interface (440) a signal corresponding to sound
in an environment;
calculating (S520, S530), by at least one hardware processor (410) from the signal
at least one of a descriptor related to loudness and a descriptor related to silence;
and
determining (S540), by the at least one hardware processor (410), that the sound is
artificial in case a variance of the descriptor related to loudness is below a first
threshold value or in case the descriptor related to silence is below a second threshold
value.
2. The method of claim 1, wherein the descriptor related to silence is a ratio of windows
of the signal that are silent to windows of the signal that are non-silent.
3. The method of claim 2, wherein the adjacent windows are overlapping.
4. The method of claim 2 or 3, wherein a window is silent in case its Root Mean Square
(RMS) power is below a third threshold.
5. The method of any one of claims 1 to 4, wherein the descriptor related to loudness
is a standard deviation for power of the signal.
6. A device (400) for determining if sound is natural, comprising:
a hardware input interface (440) configured to obtain a signal corresponding to sound
in an environment; and
at least one hardware processor (410) configured to:
calculate from the signal at least one of a descriptor related to loudness and a descriptor
related to silence; and
determine that the sound is artificial in case a variance of the descriptor related
to loudness is below a first threshold value or in case the descriptor related to
silence is below a second threshold value.
7. The device of claim 6, wherein the descriptor related to silence is a ratio of windows
of the signal that are silent to windows of the signal that are non-silent.
8. The device of claim 7, wherein the adjacent windows are overlapping.
9. The device of claim 7 or 8, wherein a window is silent in case its Root Mean Square
(RMS) power is below a third threshold.
10. The device of any one of claims 6 to 9, wherein the descriptor related to loudness
is a standard deviation for power of the signal.
11. The device of any one of claims 6 to 10, wherein the input interface (440) is configured
to capture the sound.
12. The device of claim 11, wherein the input interface (440) comprises a microphone.
13. The device of any one of claims 6 to 12, further comprising an output interface (450)
for outputting information about whether the sound is natural or artificial.
14. A computer program comprising instructions that, when executed cause at least one
hardware processor (410) to perform the method of any one of claims 1-5.
15. A non-transitory, computer-readable storage medium (460) including that, when executed,
cause at least one hardware processor (410) to perform the method of any one of claims
1-5.