Technical Field
[0001] Various example embodiments relate generally to methods and apparatus for defending
adversarial attacks.
Background
[0002] Many processing systems operate based on sensor information. The use of machine learning
systems in these processing systems is increasing, particularly in sensitive applications
including user identification and verification, online shopping and interaction with
private data. One example is the use of machine learning systems to process an input
signal, such as a sampled audio signal, to provide user identification and verification.
A machine learning system, typically a trained neural network, processes an input
signal and produces a classification outcome, for instance a user identification.
[0003] It is now recognized that many machine learning systems are vulnerable to adversarial
attacks. An adversarial attack is a perturbation added to an original signal which
can cause a machine learning system into produce an incorrect classification outcome.
Often the perturbation is of a small amplitude compared to the original signal. In
the case of an audio original signal, the addition of the perturbation to the audio
original signal is inaudible to a user.
Summary
[0004] In one embodiment, an apparatus comprises means configured to perform removing a
portion from a first signal. The portion is selected at random from a plurality of
portions comprising the first signal. The means are further configured to perform
replacing the portion removed from the first signal with a replacement portion.
[0005] In one embodiment, a method comprises removing a portion from a first signal, the
portion selected at random from a plurality of portions comprising the first signal.
The method further comprises replacing the portion removed from the first signal with
a replacement portion.
[0006] In one embodiment, a computer readable storage medium stores instructions which,
when executed by a computer, cause the computer to perform a method. The method comprises
removing a portion from a first signal, the portion selected at random from a plurality
of portions comprising the first signal. The method further comprises replacing the
portion removed from the first signal with a replacement portion.
Brief Description of the Drawings
[0007] Example embodiments will now be described with reference to the accompanying drawings,
in which:
FIG. 1 illustrates an example environment in which some embodiments described herein
may be employed;
FIG. 2 shows a method according to some embodiments described herein;
FIG. 3 illustrates an example misalignment of two signals S1' and S2';
FIG. 4 shows an apparatus according to some embodiments described herein; and
FIG. 5 depicts a high-level block diagram of an apparatus 400 suitable for use in
performing functions described herein according to some embodiments.
Detailed Description
[0008] Example embodiments will now be described, including methods and apparatus that remove
a portion from a first signal. The portion is selected at random from a plurality
of portions comprising the first signal. The methods and apparatus also replace the
portion removed from the first signal with a replacement portion, which may mitigate
the vulnerability of machine learning systems to adversarial attacks.
[0009] Functional blocks denoted as "means configured to perform ..." (a certain function)
shall be understood as functional blocks comprising circuitry that is adapted for
performing or configured to perform a certain function. A means being configured to
perform a certain function does, hence, not imply that such means necessarily is performing
said function (at a given time instant). Moreover, any entity described herein as
"means", may correspond to or be implemented as "one or more modules", "one or more
devices", "one or more units", etc. When provided by a processor, the functions may
be provided by a single dedicated processor, by a single shared processor, or by a
plurality of individual processors, some of which may be shared. Moreover, explicit
use of the term "processor" or "controller" should not be construed to refer exclusively
to hardware capable of executing software, and may implicitly include, without limitation,
digital signal processor (DSP) hardware, network processor, application specific integrated
circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing
software, random access memory (RAM), and non-volatile storage. Other hardware, conventional
or custom, may also be included. Their function may be carried out through the operation
of program logic, through dedicated logic, through the interaction of program control
and dedicated logic, or even manually, the particular technique being selectable by
the implementer as more specifically understood from the context.
[0010] FIG. 1 illustrates an example environment in which embodiments may be employed. FIG.
1 shows a user 10, a first device 20 and a second device 30. Each of the devices 20,
30 may have at least one microphone (not shown) which records sound incident thereon.
For instance, if the user 10 speaks, the sound of the user 10 speaking, U, will arrive
at the or each microphone in each device 20, 30 and may be recorded. These recordings
may be processed by a machine learning system, for example to categorise the identity
of the user 10. In some cases, the machine learning system may be provided on one
or more of the devices 20, 30, while in other case, the machine learning system may
be in a system remote from the devices 20, 30 and with which the devices 20, 30 may
communicate.
[0011] Also shown in FIG. 1 is a perturbation source 40 which emits a perturbation signal
P, for instance as a perturbation sound. The perturbation signal P will also be incident
on the microphones of the devices 20, 30, and so will also be recorded. Thus, what
the machine learning system processes is a combination of the user speaking U and
the perturbation sound signal P. This leads to the possibility of an adversarial attack
involving a perturbation sound signal P that is recorded together with user speech
U to cause the machine learning system to produce an incorrect classification.
[0012] It will be appreciated by the skilled person that the embodiments are not restricted
to two devices, and in some embodiments only one of the devices 20, 30 may be present,
while in other embodiments more than two devices may be present.
[0013] Referring to FIG. 2, there is shown a plurality of signals S1, S2, ... Sn, referred
to collectively as signals S. In some embodiments each of the signals S may be a recording
of the same audio sound, for instance by using a plurality of microphones, one for
each signal S, which may be present in one or more of the devices 20, 30. Each signal
S may be a digitized audio signal comprising a plurality of samples.
[0014] Some embodiments relate to a method 100 as shown in FIG. 2. The method 100 comprises,
at 102, removing a portion from each signal S. Each signal S comprises a plurality
of portions, wherein the portion removed from each signal S is selected at random
from the plurality of portions. In some embodiments, the selection of the portion
to be removed may be performed independently and at random for each of the signals
S, with the result that with high likelihood different portions may be removed from
each of the signals S. In some embodiments, random sub-sampling may be used to remove
the portion from each of the signals S. As will be apparent to a skilled person, in
some embodiments more than one portion may be removed from each signal S. Further,
in some embodiments a different number of portions may be removed from each of the
signals S1...Sn. In some embodiments, each portion removed from each signal S comprises
at least one sample.
[0015] The method 100 further comprises, at 104, replacing each portion removed from each
signal S with a replacement portion. In some embodiments, data imputation techniques
may be used to create each replacement portion from the signal with the portion removed
therefrom. In some embodiments, generative up-sampling may be used to create each
replacement portion from the corresponding signal with the portion removed therefrom.
In other embodiments, other generative models may be used which characterise explicitly
(such as fully visible belief networks or variational auto-encoders) or implicitly
(such as generative stochastic networks or generative adversarial networks) the probabilistic
distribution of a signal with the portion removed therefrom to create the replacement
portion for that signal.
[0016] Randomly removing a portion from each signal S and replacing each portion with a
replacement portion may reduce susceptibility to adversarial attacks, for instance
from the perturbation source 40. An attacker does not control which portions are removed
from each signal and replaced. Since the portions are removed at random, an attacker
cannot tailor the perturbation signal P to account for the removed portions due to
the random selection of the portions. In principle, an attacker could try to make
the perturbation signal P as sparse as possible in the time domain as a means to minimise
the chances that non-zero components of the perturbation signal P are present in the
removed portion(s). However, this would in effect decrease the support of the perturbation
signal P and would necessarily result in an increase in the amplitude of the non-zero
components of perturbation signal P. The increased amplitude of the non-zero components
of perturbation signal P may be noticeable by the user, thereby neutralising the effectiveness
of the adversarial attack.
[0017] In FIG. 2, each signal S1, S2, ..., Sn with a replacement portion is denoted as signal
S1', S2', ..., Sn', respectively, and denoted collectively as signals S' hereafter.
[0018] In some embodiments, the method 100 further comprises, at 106, aligning the signals
S'. In practical systems, each of the signals S may have been sampled from a different
microphone. Thus there may be variations in the start time when each of the signals
S was first sampled, variations in the schedulers of the processors used to sample
the microphones, and differences between the processor loads, each of which might
lead to misalignment of the signals S and thus the signals S'. Such misalignment might
deteriorate the quality of a combined signal generated from the signals S', and therefore
might decrease the overall performance achievable by a machine learning system that
processed such a combined signal.
[0019] FIG. 3 illustrates an example misalignment of two signals S1' and S2'. The signal
S1' comprises a sequence of samples 200 which includes a replacement portion comprising
replacement sample 210. The signal S2' comprises a sequence of samples 220 which includes
a replacement portion comprising replacement sample 230.
[0020] In the illustrated example shown in FIG. 3, sampling of the signal S2' started after
sampling of the signal S1'. Further, in the illustrated example shown in FIG. 3, there
is some misalignment of the samples 200, 220 forming the signals S1' and S2', respectively.
[0021] Returning now to FIG. 2, at 106 the signals S' are aligned in the time domain. In
some embodiments, techniques such as Profile Hidden Markov Models or Continuous Profile
Generative Models may be used to align the signals S'. Other suitable techniques known
to the skilled person may be used in other embodiments. In some embodiments, subsections
of the signals S' are aligned - that is, the signals S' are piece-wise aligned where
each piece comprises one or more samples in a sequence of samples.
[0022] In some embodiments, the method 100 further comprises, at 108, combining the signals
S' into a combined signal Sout. In embodiments where the signals S' are aligned, the
aligned signals S' are combined. In one embodiment, the combined signal Sout may be
an average of the signals S', however in other embodiments alternative transformations
may be used to combine the signals S'.
[0023] Referring again to the illustrated example in FIG. 3, an example combined signal
Sout is shown. The combined signal Sout comprises a sequence of samples 240. The samples
240 are formed by combining the signals S1 and S2. Dashed lines 'A' in FIG. 3 are
used to denote the samples 200, 220 in the signals S1 and S2 which have been aligned
and combined to form samples 240 in the combined signal Sout.
[0024] Combining the signals S' into the combined signal Sout may further reduce susceptibility
to adversarial attacks. At any given moment in time, the perturbation signal P might
be present in only some of the signals S' given random removal of some portions in
each signal, ambient noise and hardware limitations on most microphones. In addition,
due to distances between the physical locations of the microphones used to capture
the signals S, there may be time offsets between the location of the perturbation
signal P relative to the user signal U. Combining the signals S' into the combined
signal Sout can be considered as a form of smoothing/interpolation across the different
signals S', such that the perturbation signal P being present in only some of the
signals S' results in the perturbation signal P having a reduced amplitude in the
combined signal Sout, which may reduce the effectiveness of the perturbation signal
P.
[0025] Referring now to FIG. 2, in some embodiments, the method 100 further comprises, at
110, sending the combined signal Sout to a machine learning system. As will be described
in more detail below, in some embodiments the method 100 may be implemented within
a machine learning system as a pre-processing method, in which case 110 may be considered
as sending the combined signal Sout from a pre-processor to the machine learning system.
[0026] In some embodiments, there may be only a first signal S1, in which case aligning
the signals S' at 106 and combining the signals S' at 108 may be omitted.
[0027] FIG. 4 depicts a high-level block diagram of an apparatus 300 suitable for use in
performing functions described herein according to some embodiments. The apparatus
will be described with reference to signals S1...Sn (not shown) as described above
in relation to FIG. 2, and again collectively referred to as signals S. The apparatus
300 may comprise at least one microphone 310, each of which may record one of the
signals S1...Sn. In some embodiments at least one of the signals S1... Sn may be received
by the apparatus 300 from another device.
[0028] The apparatus 300 comprises means 320 configured to perform removing a portion from
each of the signals S. The portion removed from each signal S is selected at random
from a plurality of portions comprising that signal. In some embodiments, the selection
of the portion to be removed may be performed independently and at random for each
of the signals S, with the result that with high likelihood different portions may
be removed from each of the signals S. In some embodiments, means 320 may comprise
a random sub-sampler used to remove the portion from each of the signals S. As will
be apparent to a skilled person, in some embodiments means 320 may be configured to
remove more than one portion from each signal S. Further, in some embodiments means
320 may be configured to remove a different number of portions from each of the signals
S1...Sn. In some embodiments, each portion removed from each signal S comprises at
least one sample.
[0029] The apparatus 300 further comprises means 330 configured to perform replacing each
portion removed from each signal S with a replacement portion. In some embodiments,
means 330 may be configured to use data imputation techniques to create each replacement
portion from the signal with the portion removed therefrom. In some embodiments, means
330 comprises a generative up-sampler that creates each replacement portion from the
corresponding signal with the portion removed therefrom. In other embodiments, means
330 comprises another generative system, such as a fully visible belief network, a
variational auto-encoder, a generative stochastic network and/or a generative adversarial
network. In like manner to the description above in relation to FIG. 2, each signal
S1, S2, ..., Sn with a replacement portion is denoted as signal S1', S2', ..., Sn'
(not shown), respectively, and denoted collectively as signals S' hereafter.
[0030] In some embodiments, the apparatus 300 further comprises means 340 configured to
perform aligning the signals S'. In some embodiments, means 340 is configured to perform
aligning the signals S' in the time domain using Profile Hidden Markov Models or Continuous
Profile Generative Models. Other suitable techniques known to the skilled person may
be used in other embodiments. In some embodiments, means 340 are configured to perform
aligning subsections of the signals S', such that the signals S' are piece-wise aligned
where each piece comprises one or more samples in a sequence of samples.
[0031] In some embodiments, the apparatus 300 further comprises means 350 configured to
perform combining the signals S' into a combined signal Sout (not shown). In embodiments
where the signals S' are aligned, the aligned signals S' are combined. In one embodiment,
means 350 is configured to perform combining the signals S' by averaging the signals
S', however in other embodiments means 350 may be configured to use alternative transformations
to combine the signals S'.
[0032] In some embodiments, the apparatus 300 further comprises means 360 configured to
perform sending the combined signal Sout to a machine learning system. In some embodiments
the means 360 is configured to send the combined signal Sout to a machine learning
system remote from the apparatus 300, for instance over a communications network (not
shown). In other embodiments, the apparatus 300 may include the machine learning system
and the means 320, 330, 340, 350 and 360 may be implemented as a pre-processor within
the machine learning system. Such an arrangement may neutralise the effectiveness
of an adversarial attack implemented using malicious software in a device 20, 30,
for instance malicious software that adds a perturbation signal P to a signal recorded
by a microphone in the device 20, 30.
[0033] In some embodiments, the apparatus 300 may comprise a single microphone 310 that
is configured to record a first signal S1. In such embodiments where there is only
the first signal S1, the means 340 and 350 may be omitted.
[0034] Embodiments described herein may allow applications based on audio interfaces with
wearable and mobile devices 30, 40 to be done more safely in the presence of adversarial
attacks. For example the devices 30, 40 may be any combination of laptop computer,
tablet, phone or watch with one microphone each. A user using such devices to access
sensitive information will typically be required to perform a two-factor authentication
process. Such authentication is currently performed without the ease of an audio interface
given the present level of insecurity with audio authentication. Rather than having
the current two-factor authentication approach where for example when using one device
to login a confirmation request is sent to a second device, in embodiments described
herein instead the user would simply speak and microphones on devices 30, 40 are activated,
record the user speaking. Thus the user's voice may be used to provide authentication
safely as the processing of the recorded signals as described in embodiments herein
may enable user verification by a machine learning system to be performed safely against
adversarial attacks.
[0035] Embodiments described herein may also be employed in earbuds or headsets with a microphone,
and may provide additional security, for instance when used together with a personal
assistant software to place orders or interact with various types of sensitive information.
[0036] Referring now to FIG. 5, in some embodiments, the means 320, 330, 340, 350 and 360
comprises at least one processor 410 (e.g., a central processing unit (CPU) and/or
other suitable processor(s)) and at least one memory 420 (e.g., random access memory
(RAM), read only memory (ROM), or the like). The means 320, 330, 340, 350 and 360
further comprises computer program code 430 and various input/output devices 440 (e.g.,
a user input device (such as a keyboard, a keypad, a mouse, a microphone, a touch-sensitive
display or the like), a user output device 450 (such as a display, a speaker, or the
like), and storage devices (e.g., a tape drive, a floppy drive, a hard disk drive,
a compact disk drive, non-volatile memory or the like)). The computer program code
430 can be loaded into the memory 420 and executed by the processor 410 to implement
functions as discussed herein and, thus, computer program code 430 (including associated
data structures) can be stored on a computer readable storage medium, e.g., RAM memory,
magnetic or optical drive or diskette, or the like.
[0037] It will be appreciated that the functions depicted and described herein may be implemented
in software (e.g., via implementation of software on one or more processors, for executing
on a general purpose computer (e.g., via execution by one or more processors) so as
to implement a special purpose computer, or the like) and/or may be implemented in
hardware (e.g., using a general purpose computer, one or more application specific
integrated circuits (ASIC), and/or any other hardware equivalents).
[0038] A further embodiment is a computer program product comprising a computer readable
storage medium having computer readable program code embodied therein, the computer
readable program code being configured to implement one of the above methods when
being loaded on a computer, a processor, or a programmable hardware component. In
some embodiments, the computer readable storage medium is non-transitory.
[0039] A person of skill in the art would readily recognize that steps of various above-described
methods can be performed by programmed computers. Herein, some embodiments are also
intended to cover program storage devices, e.g., digital data storage media, which
are machine or computer readable and encode machine-executable or computer-executable
programs of instructions where said instructions perform some or all of the steps
of methods described herein. The program storage devices may be, e.g., digital memories,
magnetic storage media such as magnetic disks and magnetic tapes, hard drives, or
optically readable digital data storage media. The embodiments are also intended to
cover computers programmed to perform said steps of methods described herein or (field)
programmable logic arrays ((F)PLAs) or (field) programmable gate arrays ((F)PGAs),
programmed to perform said steps of the above-described methods.
[0040] It should be appreciated by those skilled in the art that any block diagrams herein
represent conceptual views of illustrative circuitry embodying the principles of the
invention. Similarly, it will be appreciated that any flow charts, flow diagrams,
state transition diagrams, pseudo code, and the like represent various processes which
may be substantially represented in computer readable medium and so executed by a
computer or processor, whether or not such computer or processor is explicitly shown.
[0041] While aspects of the present disclosure have been particularly shown and described
with reference to the embodiments above, it will be understood by those skilled in
the art that various additional embodiments may be contemplated by the modification
of the disclosed machines, systems and methods without departing from the scope of
what is disclosed. Such embodiments should be understood to fall within the scope
of the present disclosure as determined based upon the claims and any equivalents
thereof.
1. An apparatus comprising means configured to perform:
removing a portion from a first signal, the portion selected at random from a plurality
of portions comprising the first signal; and
replacing the portion removed from the first signal with a replacement portion.
2. The apparatus of claim 1, wherein the means are further configured to perform:
generating the replacement portion from the first signal with the portion removed.
3. The apparatus of claim 1 or 2 further comprising:
a microphone configured to produce the first signal.
4. The apparatus of claim 1, wherein the means are further configured to perform:
removing a portion from each of a plurality of signals;
replacing the portion removed from each signal with a corresponding replacement portion;
and
combining the plurality of signals having replacement portions to obtain a combined
signal.
5. The apparatus of claim 4, wherein each signal comprises a plurality of portions, and
the means are further configured to perform:
selecting each portion removed from each signal independently and at random from the
plurality of portions.
6. The apparatus of claim 4 or 5, wherein the means are further configured to perform:
generating each replacement portion from the corresponding signal having the portion
removed.
7. The apparatus of claims 4 to 6, wherein the means are further configured to perform:
aligning the signals prior to combining the signals.
8. The apparatus of claims 4 to 7, wherein the means are further configured to perform:
removing a plurality of portions from each of the plurality of signals.
9. The apparatus of any of claims 4 to 8, wherein the means are further configured to
perform:
sending the combined signal to a machine learning system as an input thereto.
10. The apparatus of any of claims 4 to 9, further comprising:
a plurality of microphones configured to produce the plurality of signals.
11. The apparatus of any preceding claim, wherein the means comprises:
at least one processor; and
at least one memory including computer program code, the at least one memory and computer
program code configured to, with the at least one processor, cause the performance
of the apparatus.
12. A method comprising:
removing a portion from a first signal, the portion selected at random from a plurality
of portions comprising the first signal; and
replacing the portion removed from the first signal with a replacement portion.
13. The method of claim 12, further comprising:
removing a portion from each of a plurality of signals, each portion selected independently
and at random from a plurality of portions comprising each signal;
replacing the portion removed from each signal with a corresponding replacement portion;
aligning the plurality of signals; and
combining the plurality of signals having replacement portions.
14. The method of claim 12 or 13, further comprising:
removing a plurality of portions from each of the plurality of signals.
15. A computer readable storage medium storing instructions which, when executed by a
computer, cause the computer to perform a method, the method comprising:
removing a portion from a first signal, the portion selected at random from a plurality
of portions comprising the first signal; and
replacing the portion removed from the first signal with a replacement portion.