FIELD
[0001] The presently disclosed subject matter relates to a training method for training
a generator neural network, a method to generate further training data for a machine
learnable model, a method to train a machine learnable model, a training system for
training a generator neural network, a generator system for a generator neural network,
and an autonomous apparatus and a computer readable medium.
BACKGROUND
[0002] Machine learnable models find a wide application in many fields of technology. For
example, in parts production a machine learnable model may classify a produced part
as fault from a sensor reading of the part, e.g., an image taken with an image sensor.
Automated quality control has the potential to greatly reduce the percentage of faulty
parts produced, e.g., components and the like. An image sensor coupled to a machine
learnable model can eliminate almost all parts with visible defects.
[0003] For example, in autonomous driving a machine learnable model may classify an object
in the environment of the autonomous vehicle as another car, cyclist, pedestrian and
so on, e.g., from a sensor reading of the environment. The sensor reading may be obtained
with sensors such as an image sensor, LIDAR and so on. A controller of the vehicle
will use the classification in making driving decisions. For example, the car may
need to reduce speed if a pedestrian appears to be about to cross the road, while
there is no need to adjust the driving for a sign waiting at the side of the road-unless
the sign is classified as a traffic sign, in which case such a need may arise, and
so on.
[0004] To train or test a machine learnable model, e.g., comprising a neural network, training
data is needed. Such training data may be obtained by taking sensor measurements in
an environment that is similar to the one expected to be encountered in practice.
However, obtaining the right kind or the right amount of training data is sometimes
hard. For example, there may be too little training data for the complexity of a particular
machine learnable model, while obtaining additional training data is costly or even
impossible. Another problem is that getting enough training data of the right kind
is difficult. For example, in the case of an autonomous vehicle, such as a car, if
it is currently summer, then obtaining additional training data in a winter landscape
will not be possible until it is winter. Another problem is that dangerous situations,
e.g., crashes and near crashes, occur only seldom and are hard to enact artificially.
Between 2009 and 2015, a fleet of autonomous cars traveled 1.3 million miles and was
involved in 11 crashes. (see the article
'How Many Miles of Driving Would It Take to Demonstrate Autonomous Vehicle Reliability?',
by Nidhi Kalra, and Susan M. Paddock). Although any crash is one too many, for testing and training purposes, the sensor
data of 11 crashes is not much.
[0005] Additional training data may be generated using a generator neural network. In particular,
generative adversarial networks (GANs) are a powerful tool for data synthesis, e.g.,
for generating natural looking images, as well as for learning feature representations.
A generator neural network may be configured to generate synthesized sensor data,
that looks like measured sensor data. For example, a generator neural network may
be configured to generate synthesized sensor data from scratch, e.g., taking a noise
vector as input and producing the synthesized sensor data as output. For example,
a generator neural network may be configured to transform existing measured sensor
data from one domain to another, e.g., from summer to winter, or from LIDA sensor
data to image sensor data, etc. A generator neural network may be configured to take
an additional input, e.g., a class label, indicating to the generator neural network
the type of synthesized sensor data that is desired, e.g., the time of year, the type
of car in the synthesized sensor data, and so on. Another application is to transfer
measured sensor data from one modality to another, e.g., from radar to image data
or vice versa. This may make complementary data from different origins interchangeable.
[0006] A preferred way to train a generator neural network is together with a discriminator
neural network, e.g., in a so-called generative adversarial network (GAN). A GAN comprises
at least two neural networks: a generator, and a discriminator. The two networks are
trained jointly through adversarial training, e.g., the two networks compete in a
game, where the discriminator is optimized to distinguish original images from the
training set from images generated by the generator network. Conversely, the generator
is trained to make its generated images less and less distinguishable from those contained
in the training dataset. In their standard form, GANs generate independent samples
from their model of the training data distribution, and need no label information
for training or generation. For example, see the article
'Generative Adversarial Nets' by Ian J. Goodfellow, et al., included herein by reference. There may be multiple generator networks in
a GAN.
[0007] An important subclass of GANs are conditional GANs (cGANs), which receive one or
more additional inputs to the discriminator network and/or the generator network,
and can thus generate new data conditioned on user-specified information. The basic
use case is to provide a class label, and then generate images of only that particular
class. Recently cGANs have been successfully used to generate photo-realistic synthetic
images of different classes; see, e.g., the article
'Large scale GAN training for high fidelity natural image synthesis', by Andrew Brock, et al. cGANs have also been successfully employed for image-to-image and label-to-image
translation tasks, aiming to generate realistic looking images while conditioning
on dense information such as images or label maps and making use of some prior information
about the scene; see, e.g., the article
'Semantic Image Synthesis with Spatially-Adaptive Normalization', by Taesung Park, et al.
SUMMARY
[0008] It would be advantageous to have an improved training method for training a generator
neural network configured to generate synthesized sensor data, and such methods are
defined in the claims.
[0009] For example, the method may comprise training the generator neural network together
with a discriminator neural network. The training may comprise generating synthesized
sensor data using the generator neural network. For example, the discriminator network
may be optimized to distinguish between measured sensor data and the synthesized sensor
data, while the generator network may be optimized to generate synthesized sensor
data which is indistinguishable from measured sensor data by the discriminator network.
[0010] Interestingly, the discriminator network may be configured to receive discriminator
input data which may comprise synthesized sensor data and/or measured sensor data,
and to produce as output localized distinguishing information. For example, the discriminator
input data may be either synthesized sensor data or measured sensor data, but the
discriminator input data may also be a combination of the two. The localized distinguishing
information may indicate for a plurality of sub-sets of the discriminator input data
if said sub-set corresponds to measured sensor data or to synthesized sensor data.
[0011] In a conventional GAN framework, the discriminator may be configured to output one
global decision about the discriminator input data, e.g., if it is measured or synthesized
sensor data, e.g., if it belongs to the real or fake class. The inventors found that
this global feedback information may be misleading to the generator: often the synthetic
sample looks partially real, however, if the discriminator classifies the whole sample
as fake, the generator would get a noisy signal that all parts of the sample are fake.
This may significantly slow down the training of the generator and may even lead to
a suboptimal solution during training.
[0012] In an embodiment, the discriminator is configured to provide localized distinguishing
information as output. The localized distinguishing information provides local feedback
to the generator neural network; for example, a particular sub-set of the discriminator
input looks real, e.g., appears to be drawn from the same distribution as the measured
sensor data, while another sub-set of the discriminator input look fake, e.g., does
not appear to be drawn from the same distribution as the measured sensor data.
[0013] Thus the generator network training uses a training signal which comprises more information,
and thus helps training. On the other hand, a discriminator can find that part of
its input does not look real, even if the overall impression is that of measured sensor
data. Thus, the generator task of fooling the discriminator becomes more challenging
which improves the quality of generated samples.
[0014] The measured sensor data may be obtained from one or more sensors, some of which
may be spatial sensors, e.g., image sensors, LIDAR and so. For example, the measured
sensor data may comprise a measured image obtained from an image sensor; and the synthesized
sensor data may comprise a synthesized image. Typically, the synthesized sensor data
and the measured sensor data have the same resolution. For example, the sensor data
may include sensor signals from sensors such as, e.g., video, radar, LiDAR, ultrasonic,
motion, imaging camera, and so on. A sensor data may comprise sensor signals from
multiple sensor modalities, e.g., image and radar sensor signals. The measured sensor
data may comprise audio data.
[0015] The measured and synthesized sensor data, may comprise a plurality of values indicating
a plurality of sensor values. For example, pixels in image like data, or samples,
in audio like data, and so on. The plurality of sub-sets may correspond to the plurality
of pixels. For example, the localized distinguishing information may indicate for
the plurality of values if said value corresponds to measured sensor data or synthesized
sensor data. For example, in case of an image, the localized distinguishing information
may comprise per-pixel information indicating whether the pixel appears to be measured
sensor data or not. The localized distinguishing information may have a lower resolution
than the discriminator input though. For example, the localized distinguishing information
may indicate for every
n pixels, e.g., every 4 or 16 pixels etc. For example, the localized distinguishing
information may correspond to a checkerboard pattern on an image. In addition to the
localized distinguishing information, a global decision may also be output.
[0016] For example, the discriminator network may be optimized for the localized distinguishing
information correctly indicating for the plurality of sub-sets of the discriminator
input data if said sub-set corresponds to measured sensor data or synthesized sensor
data. The localized distinguishing information may, e.g., comprise a plurality of
values corresponding to the plurality of sub-sets. For example, a value of 0 may indicate
synthesized sensor data while a value of 1 may indicate measured sensor data.
[0017] For example, the discriminator network may sometimes be provided with a measured
sensor data, in which case all localized distinguishing information for all of the
plurality of sub-sets should indicate that the part is measured sensor data. For example,
the discriminator network may sometimes be provided with a synthesized sensor data,
in which case all localized distinguishing information for all of the plurality of
sub-sets should indicate that the part is synthesized sensor data.
[0018] Training for the discriminator network to provide strong localized distinguishing
information may be improved by also training the discriminator network on discriminator
inputs that are composed of part measured sensor data and part synthesized sensor
data. For example, one may obtain composed sensor data by obtaining part of the composed
sensor data from a measured sensor data and obtaining part, e.g., the remaining part,
from synthesized sensor data. One may combine multiple measured sensor data and/or
synthesized sensor data. The output of the discriminator network when applied on the
composed sensor data should correspond with the composing; for example, the localized
distinguishing information should indicate which part, e.g., pixels or samples, of
the discriminator input was obtained from measured sensor data and which part from
synthesized sensor data. The composed sensor data may be generated randomly.
[0019] In an embodiment, a part taken from measured or synthesized data is a connected and/or
convex part, e.g., a rectangle. For example, measured or synthesized data may be combined
with a so-called CutMix operation.
[0020] Training the discriminator on composed sensor data causes a consistency regularization,
encouraging the encoder-decoder discriminator to focus more on semantic and structural
changes between real and fake images and to attend less to domain-preserving perturbations.
Moreover, it also helps to improve the localization ability of the decoder. This improves
the discriminator training, further enhancing the quality of generated samples.
[0021] In case composed sensor data is used, it is beneficial to output global distinguishing
information as well, e.g., indicating the proportion of the discriminator input data
that corresponds to measured sensor data. The global distinguishing information may
be trained from the correct proportion in the composed image.
[0022] Training the discriminator and/or generator network may use conventional training
techniques such a backpropagation, e.g., using techniques used in GAN training, e.g.,
using a system such as ADAM. The GAN may be a CycleGAN, but this is not necessary.
[0023] In an embodiment, the generator network may be configured for a domain translation
task. For example, the generator network may be configured to receive measured sensor
data from a first domain as input and wherein the generator network is trained to
generate synthesized sensor data from a second domain. This can be used for many purposes.
For example,
- the first and second domain correspond to a different time of day and/or of the year,
and/or
- the first and second domain indicate a type of environment of a vehicle, and/or
- the first and second domain indicate a type of sensor data, and/or
- the first and second domain indicate an occlusion or desocclusion.
[0024] For example, to test a machine learnable model on hard to obtain test data, e.g.,
sensor data corresponding with dangerous situations, e.g., crashes and near crashes,
the generator may be applied to an example of the test data, and transfer it to a
different domain. For example, types of cars may be changes, time of day or time of
year may be changed, etc. Thus, measured sensor data obtained during a near collision,
say around noon in spring, may be converted to synthesized sensor data corresponding
to an evening in fall, yet still show a near collision. Using the synthesized sensor
data the machine learnable model may be tested for a wider range of near-collisions,
thus improving the safety of the autonomous apparatus in which the machine learnable
model is used.
[0025] In an embodiment, the training set comprises ground-truth class-labels for the measured
sensor data. A class-label may be provided as an additional input to the discriminator
network. For example, the discriminator network may be a conditional network receiving
the class label as input. Typically, a class label is also provided as an additional
input to the generator network, e.g., to indicate to the generator network to generate
synthesized sensor data according to the class label. The latter is not necessary
though, for example, a conditional discriminator network may be combined with multiple
unconditional generator networks.
[0026] The class label indicates a class of the discriminator input data. The discriminator
neural network may be optimized to distinguish if the discriminator input data corresponds
to the class. For example, the discriminator network may be trained to distinguish
between measured sensor data with the correct class label on the one hand and synthesized
sensors data or measured sensor data with an incorrect class label on the other hand.
This may also be indicated in the localized distinguishing information, e.g., per-pixel.
[0027] The domain translation and data synthesis tasks with the encoder-decoder discriminator
in principle can be performed between any sensor signals. The proposed framework can
be used for data augmentation as well as domain transfer tasks. The generated samples
can be then used for training any data-driven method.
[0028] A class label may also be used for a generator network configured for a domain translation
task. For example, in an embodiment, a class-label may indicate a transformation goal
to the generator network. There may be a plurality of transformation goals, e.g.,
corresponding to a plurality of domains. The training data may be labeled with a domain
of the plurality of domains. The generator network may be configured to transform
measured sensor data to a domain according to the transformation goal. The discriminator
network may be configured to determine if the input sensor data satisfies the domain
according to the transformation goal.
[0029] In an embodiment, a transformation goal may comprise a time difference, the training
data being labeled with a timestamp. The generator network may be configured to transform
measured sensor data from a first timestamp to a second timestamp according to the
time difference. The discriminator network may be configured to receive as input a
first sensor data, a second sensor data and a time difference and to determine if
the first sensor data and the second sensor data satisfy the time difference. Any
one of the first and second sensor data may be synthesized data in which case, the
discriminator network may be trained to reject the images.
[0030] An interesting application of sensor data translation is occlusion and desocclusion.
The class label, e.g., a transformation goal may indicate an object which is to be
occluded, e.g., moved behind another object, or to be desoccluded, e.g., moved in
front of another object. For example, a pedestrian may be moved in front of behind
a tree; a cyclist in front or behind a car, and so. The discriminator network may
be trained to verify if the object is indeed occluded or desoccluded. The class label
in this case may be a map indicating the object to be occluded/ desoccluded. In an
embodiment, generator network and discriminator network receive data indicating an
object in the sensor data, and indication if the object is to be occluded or desoccluded.
[0031] In an embodiment, the generator network and/or the discriminator network comprise
one or more neurons receiving at least part of the sensor data and optionally at least
part of the class label, e.g., transformation goal. For example, the generator network
and/or the discriminator network may be arranged to receive multiple channels as input,
at least one of the channels encoding the sensor data and/or noise data; optionally
at least one of the channels may encode for a class label or transformation goal.
For example, the generator network and/or the discriminator network may comprise multiple
layers.
[0032] The discriminator neural network may comprise an encoder network followed by a decoder
network. The encoder network may be configured to receive as input the discriminator
input data, and the decoder network is configured to receive as input the encoder
network output and to produce as output the localized distinguishing information.
Between the encoder network and decoder network there may be a bottleneck. The bottleneck
may foster correct encoding of the encoding network. For example, the encoder network
may be configured to produce the global distinguishing information as output. Training
for the global distinguishing information thus causes the encoder network to improve
the correct learning of encoding of the discriminator network input. For example,
the encoder network may be configured to down-sample the encoder input to arrive at
the encoding, e.g., the global distinguishing information.
[0033] The decoder network may receive as input the output of the encoder network. The output
of the encoder network may be the global distinguishing information. The decoder network
may be configured to produce localized distinguishing information which indicates
which parts of the discriminator input where real and which were synthesized. For
example, the decoder network may be configured to up-sample the decoder input, which
may comprise the encoder output.
[0034] There may be multiple skip-connections from layers in the encoder network to layers
in the discriminator network. For example, a skip-connection may provide information
that allows the global distinguishing information to be up-scaled to localized distinguishing
information.
[0035] In an embodiment, the discriminator network is a U-net. A U-net is conventionally
used for image segmentation; for example, the segment organs in a medical image. By
comprising a U-net in the discriminator in the field of data generation the U-net
may be trained to indicate which parts of the U-net is measured and which is synthesized
sensor data. Interestingly, a conventional U-net may be adapted so that the encoder
part provides a global output. This provides a training signal which may be used for
training of the encoder part.
[0036] The method of training a generator network may be used in a method to generate further
training data for a machine learnable model. For example, the machine learnable model
may be a classifier. For example, the machine learnable model may be configured to
receive measured sensor data as input and to generate a classification of the measured
sensor data as output. For example, the measured sensor data may be an image taken
with an image sensor. For example, the image may be of a produced part and the classification
may be if the part is defective. For example, the measured sensor data may be an image
taken in an environment of an autonomous apparatus and the classification may indicate
if there is a dangerous situation. The machine learnable model may also be a neural
network but this is not necessary. The machine learnable model may use other techniques,
e.g., SVM, random forests, and so on. The further training data may be used for training,
but may also or instead be used for testing.
[0037] For example, the method may comprise obtaining an initial training set for the machine
learnable model, the initial training set comprising measured sensor data obtained
from a sensor, and training a generator network from the initial training set using
an embodiment of the training method. The trained generator network may be applied
to generate the further training data. The further training data may then be used
for training and/or testing the machine learnable model at least on the further training
data.
[0038] A further aspect concerns a training system for training a generator neural network
configured to generate synthesized sensor data. A further aspect concerns a generator
system for a generator neural network arranged to generate synthesized sensor data.
A further aspect concerns an autonomous apparatus, e.g., an autonomous vehicle. For
example, the autonomous apparatus may be a computer-controlled machine, such as a
robot, a vehicle, a domestic appliance, a power tool, a manufacturing machine.
[0039] Embodiments of the methods and/or systems may be performed on one or more electronic
devices. For example, the electronic device, may be a computer.
[0040] An embodiment of the methods may be implemented on a computer as a computer implemented
method, or in dedicated hardware, or in a combination of both. Executable code for
an embodiment of the method may be stored on a computer program product. Examples
of computer program products include memory devices, optical storage devices, integrated
circuits, servers, online software, etc. Preferably, the computer program product
comprises non-transitory program code stored on a computer readable medium for performing
an embodiment of the method when said program product is executed on a computer.
[0041] In an embodiment, the computer program comprises computer program code adapted to
perform all or part of the steps of an embodiment of the method when the computer
program is run on a computer. Preferably, the computer program is embodied on a computer
readable medium.
[0042] Another aspect of the presently disclosed subject matter is a method of making the
computer program available for downloading.
BRIEF DESCRIPTION OF THE DRAWINGS
[0043] Further details, aspects, and embodiments will be described, by way of example only,
with reference to the drawings. Elements in the figures are illustrated for simplicity
and clarity and have not necessarily been drawn to scale. In the figures, elements
which correspond to elements already described may have the same reference numerals.
In the drawings,
Figure 1a schematically shows an example of an embodiment of a generator neural network
and of a discriminator neural network,
Figure 1b schematically shows an example of an embodiment of a generator neural network,
Figure 1c schematically shows an example of an embodiment of a discriminator neural
network
Figure 2a schematically shows an example of an embodiment of a training system,
Figure 2b schematically shows an example of an embodiment of a generator system,
Figure 2c schematically shows an example of an embodiment of a training system,
Figure 3 schematically shows an example of an embodiment of a training method,
Figure 4 schematically shows an example of an embodiment of a training system,
Figure 5 schematically shows examples of an embodiment of data in an embodiment of
a training method,
Figure 6 schematically shows examples of an embodiment of data in an embodiment of
a training method,
Figure 7a schematically shows a computer readable medium having a writable part comprising
a computer program according to an embodiment,
Figure 7b schematically shows a representation of a processor system according to
an embodiment,
Figure 8 schematically shows an example of an embodiment of a training system.
List of Reference Numerals in figures 1a-2c, 4-8:
[0044] The following list of references and abbreviations is provided for facilitating the
interpretation of the drawings and shall not be construed as limiting the claims.
- 100
- a Generative Adversarial Network (GAN)
- 110
- a generator neural network
- 120
- an encoder part
- 130
- a processing part
- 140
- a decoder part
- 141
- synthesized sensor data
- 151
- generator neural network input
- 152
- a class-label
- 160
- a discriminator neural network
- 161
- a discriminator neural network input
- 162
- a class label
- 163
- localized distinguishing information
- 164
- global distinguishing information
- 172
- an encoder part
- 174
- a decoder part
- 175
- skip connections
- 200
- a training system
- 210
- an optimizer
- 220
- a generator unit
- 230
- a discriminator unit
- 240
- a training set storage
- 250
- a generator system
- 252
- an input unit
- 254
- an output unit
- 260
- a training system
- 263
- a processor system,
- 264
- a memory
- 265
- a communication interface
- 401
- measured sensor data
- 402
- synthesized sensor data
- 501
- a measured sensor data
- 502
- a synthesized sensor data
- 511, 512
- a mask
- 521, 522
- a target global distinguishing data
- 531, 532
- a composed sensor data
- 541, 542
- a localized distinguishing data
- 551, 552
- a global distinguishing data
- 800
- an environment
- 810
- a car
- 810'
- an autonomous car
- 820
- a sensor system
- 822
- a controller
- 812
- a pedestrian
- 830
- a first training database
- 832
- a second training database
- 840
- a training system
- 842
- a generator system
- 850
- a machine learning system
- 852
- a classifier
- 1000
- a computer readable medium
- 1010
- a writable part
- 1020
- a computer program
- 1110
- integrated circuit(s)
- 1120
- a processing unit
- 1122
- a memory
- 1124
- a dedicated integrated circuit
- 1126
- a communication element
- 1130
- an interconnect
- 1140
- a processor system
DETAILED DESCRIPTION OF EMBODIMENTS
[0045] While the presently disclosed subject matter is susceptible of embodiment in many
different forms, there are shown in the drawings and will herein be described in detail
one or more specific embodiments, with the understanding that the present disclosure
is to be considered as exemplary of the principles of the presently disclosed subject
matter and not intended to limit it to the specific embodiments shown and described.
[0046] In the following, for the sake of understanding, elements of embodiments are described
in operation. However, it will be apparent that the respective elements are arranged
to perform the functions being described as performed by them.
[0047] Further, the subject matter that is presently disclosed is not limited to the embodiments
only, but also includes every other combination of features described herein or recited
in mutually different dependent claims.
[0048] Figure 1a schematically shows an example of an embodiment of a generator neural network 110
and of a discriminator neural network 160. Generator neural network 110 and discriminator
neural network 160 are trained together as a GAN 100.
[0049] Generator neural network 110 is configured to receive a generator neural network
input 151 and to produce synthesized sensor data 141. The generator neural network
input 151 may comprise a random element, e.g., noise, e.g., a noise vector, which
may be used for generation of new synthesized sensor data.
[0050] Generator neural network 110 may be configured to receive an additional input: class
label 152, but this is optional. Class label 152 indicates a desired property of synthesized
sensor data 141, e.g., a desired domain. There may be more than one class label. Input
152 is optional.
[0051] Generator neural network input 151 may comprise measured sensor data; generator neural
network 110 may be configured for a translation task, e.g., domain translation. For
example, generator neural network input 151 may be configured to generate synthesized
sensor data like generator neural network input 151 but in a different domain. For
a domain translation task some kind of cycle regularization may be used during training.
Cycle regularization may be obtained by using multiple generator networks, or by configuring
the generator neural network as a conditional neural network, e.g., wherein a class
label indicates the desired domain transfer. A CycleGan is not necessary though.
[0052] Generator network 110 may be used to generate synthesized sensor data 141. The generator
network may be optimized so that generated synthesized sensor data 141 is indistinguishable
from measured sensor data by discriminator network 160. For example, that as far as
discriminator network 160 can distinguish synthesized sensor data 141 appears as if
it was drawn from measured sensor data.
[0053] Discriminator neural network 160 is optimized to distinguish between measured sensor
data and synthesized sensor data. Discriminator neural network 160 may be configured
to receive a discriminator neural network input 161. For example, discriminator neural
network input 161 may comprise measured sensor data, e.g., sensor data obtained from
a sensor. In this case, discriminator neural network 160 may distinguish the discriminator
neural network input 161 as measured sensor data. For example, discriminator neural
network input 161 may comprise synthesized sensor data, e.g., synthesized sensor data
141. In this case, discriminator neural network 160 may distinguish the discriminator
neural network input 161 as synthesized.
[0054] For example, discriminator neural network input 161 may comprise data which is a
composite of real data, e.g., measured sensor data, and fake data, e.g., synthesized
sensor data. In this case, discriminator neural network 160 may distinguish the discriminator
neural network input 161 as measured sensor data in part and synthesized sensor data
in part.
[0055] Discriminator network 160 is configured to produce as output localized distinguishing
information 163. The localized distinguishing information indicates for a plurality
of sub-sets of the discriminator input data if said sub-set corresponds to measured
sensor data or to synthesized sensor data. For example, the discriminator input may
be partitioned into multiple sub-sets, and localized distinguishing information 163
may indicate per sub-set whether it is measured sensor data or synthesized sensor
data. For example, the discriminator input may be partitioned into its individual
pixels or samples; the localized distinguishing information can thus indicate on a
per-pixel or per-sample basis which appear to be measure sensor data, and which do
not.
[0056] Optionally, discriminator network 160 may be configured to receive an additional
input: a class label 162. In case a class label 162 is used, the discriminator network
may additionally verify if the discriminator input 161 is according to the class label.
For example, discriminator network 160 may only output an is-real output in case the
discriminator input 161 is both measured sensor data and according to the class label.
This may also be indicated on a per-subset basis.
[0057] Optionally, discriminator network 160 may be configured to produce a global distinguishing
information 164. For example, the global distinguishing information 164 may indicate
what amount, e.g., the size of the part of discriminator input 161 that appears measured
sensor data, and the size of the size of the part that appears to be synthesized sensor
data. The sizes may be relative. It was found that global distinguishing information
164 may be used as a useful additional training signal. Nevertheless, global distinguishing
information 164 is optional as a similar effect may be obtained by only training from
the localized distinguishing information.
[0058] Figure 1b schematically shows an example of an embodiment of a generator neural network. In
this embodiment, the generator neural network receives measured sensor data as part
of the input 151; this is not necessary, e.g., noise may be used instead or in addition.
[0059] The generator network of figure 1b comprises three parts: an encoder part 120, a
processing part 130 and a decoder part 140.
[0060] Encoder part 120 is configured to receive the input sensor data 151. Encoder part
120 may be configured with a so-called bottleneck at its output. Processor part 130
receives the output of the encoder part 120, decoder part 140 may receive the output
of the processing part. The optional class-label 151, may comprise a transformation
goal which is to be applied to one or more parts of the network. As shown in figure
1, the class-label 152 is provided as input to the encoder part and as an input to
the processing part. Although not shown in figure 1b, it was found to be particularly
advantageous to supply the transformation goal as an input to the decoder part 140
as well.
[0061] In an embodiment, the class-label could be an input to the decoder part 140. In an
embodiment, the class-label could be an input to the decoder part 140 and to the encoder
part 130.
[0062] In an embodiment, encoder part 120 comprises multiple convolution layers, processor
part 130 comprises multiple residual layers and the decoder part comprises multiple
convolution layers. Various known types of layers may be added. For example, in an
embodiment, encoder part 120 comprises 5 convolution layers, processor part 130 comprises
4 residual layers and the decoder part comprises 5 convolution layers. The network
may be larger or smaller as desired, or may even be much larger.
[0063] Figure 1c schematically shows an example of an embodiment of a discriminator neural network.
The discriminator network of figure 1c comprises an encoder part 172 and a decoder
part 174.
[0064] Encoder part 172 is configured to receive the discriminator input 161, and optionally
the class-label 162 (if any). Encoder part 172 may be configured with a so-called
bottleneck at its output. Encoder part 172 may be configured to produce as output
a global distinguishing information 164. For example, encoder part 172 may be a conventional
discriminator network. Global distinguishing information 164 may be provided as an
output of the discriminator network. Since during training it is known how much of
discriminator input 161 is measured and how much is synthesized, a ground truth value
is available, and thus global distinguishing information 164 may be used as an additional
training signal, e.g., to train the encoder part 172.
[0065] Decoder part 174 may receive the output of the encoder part 172. Optionally, the
class-label 162 may also be received by the decoder part 174. In an embodiment, encoder
part 172 and decoder part 174 may comprise multiple convolution layers, etc. Decoder
part 174 is configured to produce as output the localized distinguishing information
163. The localized distinguishing information 163 provides more detailed information
on which parts of the discriminator input 161 looks real and which did not. Layers
of the decoder network may receive input of corresponding layers in the encoder input,
e.g., so-called skip connections 175. Thus a layer in the decoder has access to information
in the encoder at a comparable resolution, e.g., the same resolution.
[0066] Figure 2a schematically shows an example of an embodiment of a training system 200. Training
system 200 is configured for training a generator neural network arranged to transform
measured sensor data into synthesized sensor data. For example, system 200 may comprise
a generator unit 220 configured for applying the generator neural network, and a discriminator
unit 230 configured for applying a discriminator neural network. For example, generator
unit 220 and/or discriminator unit 230 may comprise storage for storing parameters
of the respective neural networks. For example, generator unit 220 and/or discriminator
unit 230 may be configured to receive network inputs, apply the inputs and the parameters
according to the neural network type and to provide the network result on an output.
[0067] System 200 comprises an optimizer 210. Optimizer 210 is configured to train the generator
network together with the discriminator neural network. The generator network is optimized
to generated synthesized sensor data, and the discriminator network is optimized to
distinguish between measured sensor data and synthesized sensor data. In order to
train the two neural networks, optimizer 210 has access to a training set, e.g., as
stored in a training set storage 240. The training set comprises measured sensor data.
Sensor data may be image data, e.g., images, but may comprise instead or in addition
a wide variety of data, e.g., radar data, ultrasonic sensor data, etc. In an embodiment,
sensor data may be obtained from a sensor configured to produce two-dimensional data
characterizing an environment of the sensor. The sensor may be employed in a machine.
In an embodiment, at least part or all of the sensor measurements have domain information
and/or sensor time information indicating the domain in which the condition, e.g.,
the environment or environment type, and/or the time when the sensor data was obtained.
[0068] A sensor data may be a multiple of conjoint sensor data, possibly of different sensor
modalities. For example, in the example of autonomous vehicles one sensor data item
may comprise, one or more of image, radar, and other sensor data, typically concurrent
data recorded from multiple sensors. For example, system 200 may comprise a communication
interface for accessing the training set. Sensor data may be measured, e.g., as received
from a sensor, e.g., real, or true; or sensor data may be generated, e.g., as generated
by a generator unit, e.g., fake.
[0069] Once the generator network is sufficiently trained, e.g., after convergence or after
exhausting the training data, or after a preset number of training iterations, the
generator network may be used in an application, typically without the corresponding
discriminator network. For example,
figure 2b schematically shows an example of an embodiment of a generator system 250. Generator
system 250 is configured to apply a generator neural network, such as the generator
neural network trained by system 200, e.g., the generator neural network of generator
unit 220. Generator system 250 is thus arranged to generate synthesized sensor data.
System 250 may comprise an input unit 252. Input unit 252 may be configured for receiving
as input measured sensor data, e.g., in case of a domain transferring task. Input
unit 252 may be configured for receiving a noise component, e.g., in case of a generating
task. Input unit 252 may be configured for both noise and sensor data as well. Input
unit 252 might also be used to receive sensor data that was not measured but generated.
After generating the synthesized sensor data, the generated output sensor data may
be put on output 254, e.g., transmitted. For example, system 250 may comprise a communication
interface for receiving and/or transmitting the sensor data.
[0070] System 250 comprises a generator system 220 configured to apply the trained generator
network to the received input measured sensor data. Typically, system 250 is configured
to perform further tasks. For example, system 250 may be configured to augment further
training data for a further neural network, e.g., for a classifier. System 250 and
system 200 may be the same system, or they may not be. Systems 200 and/or 250 may
be a single device or may comprise multiple devices.
[0071] Systems 200 and/or 250 may communicate with each other or with external storage or
input devices or output devices over a computer network. The computer network may
be an internet, an intranet, a LAN, a WLAN, etc. The computer network may be the Internet.
The systems comprise a connection interface which is arranged to communicate within
the system or outside of the system as needed. For example, the connection interface
may comprise a connector, e.g., a wired connector, e.g., an Ethernet connector, an
optical connector, etc., or a wireless connector, e.g., an antenna, e.g., a Wi-Fi,
4G or 5G antenna, etc.
[0072] The execution of system 200 and 250 is implemented in a processor system, e.g., one
or more processor circuits, examples of which are shown herein. Figures 1a, 1b, 1c,
2a and 2b show functional units that may be functional units of the processor system.
For example, figures 1a-2b may be used as a blueprint of a possible functional organization
of the processor system. The processor circuit(s) are not shown separate from the
units in these figures For example, the functional units shown in figures 1a-2b may
be wholly or partially implemented in computer instructions that are stored at system
200 and 250, e.g., in an electronic memory of system 200 and 250, and are executable
by a microprocessor of system 200 and 250. In hybrid embodiments, functional units
are implemented partially in hardware, e.g., as coprocessors, e.g., neural network
coprocessors, and partially in software stored and executed on system 200 and 250.
Parameters of the network and/or training data may be stored locally, e.g., at system
200 and 250, or may be stored in cloud storage.
[0073] Figure 2c schematically shows an example of an embodiment of a training system 260. Training
system 260 may comprise a processor system 263, a memory 264, and a communication
interface 265. For example, the execution of system 200 may be implemented in a processor
system, e.g., one or more processor circuits, e.g., microprocessors, examples of which
are shown herein. Parameters of the networks and/or training data may be stored locally
at system 260 or may be stored in cloud storage.
[0074] Figure 3 schematically shows an example of an embodiment of a training method 300. Method
300 is a training method for training a generator neural network configured to generate
synthesized sensor data. Method 300 comprises
- accessing (310) a training set of measured sensor data obtained from a sensor,
- training (320) the generator neural network together with a discriminator neural network,
the training comprising
- generating (330) synthesized sensor data using the generator neural network,
- optimizing (340) the discriminator network to distinguish between measured sensor
data and synthesized sensor data,
- optimizing (350) the generator network to generate synthesized sensor data which is
indistinguishable from measured sensor data by the discriminator network, wherein
- the discriminator network is configured to receive (360) discriminator input data
comprising synthesized sensor data and/or measured sensor data, and to produce (370)
as output localized distinguishing information, the localized distinguishing information
indicating for a plurality of sub-sets of the discriminator input data if said sub-set
corresponds to measured sensor data or to synthesized sensor data.
[0075] In the various embodiments of system 100, 200 and 250, one or more communication
interfaces may be selected from various alternatives. For example, the interface may
be a network interface to a local or wide area network, e.g., the Internet, a storage
interface to an internal or external data storage, a keyboard, an application interface
(API), etc.
[0076] The systems 100, 200 and 250 may have a user interface, which may include well-known
elements such as one or more buttons, a keyboard, display, touch screen, etc. The
user interface may be arranged for accommodating user interaction for configuring
the systems, training the networks on a training set, or applying the system to new
sensor data, etc.
[0077] Storage may be implemented as an electronic memory, say a flash memory, or magnetic
memory, say hard disk or the like. Storage may comprise multiple discrete memories
together making up the storage, e.g., storage 264, 240, etc. Storage may comprise
a temporary memory, say a RAM. The storage may be cloud storage.
[0078] Systems 100, 200 or 250 may be implemented in a single device. Typically, the systems
100, 200 and 250 each comprise a microprocessor which executes appropriate software
stored at the system; for example, that software may have been downloaded and/or stored
in a corresponding memory, e.g., a volatile memory such as RAM or a non-volatile memory
such as Flash. Alternatively, the systems may, in whole or in part, be implemented
in programmable logic, e.g., as field-programmable gate array (FPGA). The systems
may be implemented, in whole or in part, as a so-called application-specific integrated
circuit (ASIC), e.g., an integrated circuit (IC) customized for their particular use.
For example, the circuits may be implemented in CMOS, e.g., using a hardware description
language such as Verilog, VHDL, etc. In particular, systems 100, 200 and 250 may comprise
circuits for the evaluation of neural networks.
[0079] A processor circuit may be implemented in a distributed fashion, e.g., as multiple
sub-processor circuits. A storage may be distributed over multiple distributed sub-storages.
Part or all of the memory may be an electronic memory, magnetic memory, etc. For example,
the storage may have volatile and a non-volatile part. Part of the storage may be
read-only.
[0080] Below several further optional refinements, details, and embodiments are illustrated.
Below it is assumed that the measured and synthesized sensor data comprise an image.
This is not necessary however, instead of an image sensor another type of sensor may
have been used to obtain the measured sensor data; such different sensor data may
also be used in addition to an image sensor.
[0081] GAN 100 and training system 200 may use a conventional training technique in part,
except that an additional training signal is available by comparing the localized
distinguishing information with the corresponding sub-sets in the discriminator input
161.
[0082] For example, in a conventional GAN training system that does not use localized distinguishing
information one could train two networks: a generator
G and a discriminator D, by minimizing the following competing objectives, e.g., in
an alternating manner:

[0083] G aims to map a latent variable
z ∼
p(
z) sampled from a prior distribution to a realistic-looking image, e.g., like measured
sensor data, while
D aims to distinguish between a real
x and generated
G(
z) images, e.g., between measured sensor data and synthesized sensor data. Ordinarily,
G and
D may be modeled as a decoder and an encoder convolutional network, respectively. For
example, the discriminator may output a value between 1 and 0 to indicate how real
the input seems.
[0084] In an embodiment, the discriminator D may be implemented as a combination of an encoder
and decoder, e.g., as a so-called U-Net. For example, the encoder part could be implemented
as a known discriminator, e.g., by reusing building blocks of the known discriminator
classification networks as an encoder part. The decoder part could be built by re-using
building blocks from known generator networks.
[0085] In other words, in an embodiment, the discriminator comprises a downsampling network
but also an upsampling network. An advantage of using a decoder part or an upsampling
part, is that the discriminator network produces an output signal of a higher resolution,
than what the output would be if only a downsampling or encoder network were used.
In particular, the discriminator can give localized feedback on the quality of the
synthesized sensor data, even on a per-pixel level. The resolution of the localized
distinguishing information may be the same as that of the synthesized sensor data,
but may be lower instead.
[0086] The downsampling network and the upsampling networks may be connected via a bottleneck,
as well as skip-connections that copy and concatenate feature maps from the encoder
and the decoder modules. We will refer to this discriminator as
DU. While a conventional discriminator
D(
x) classifies a discriminator input image
x into being real or fake, an embodiment of discriminator
DU(
x) may additionally perform this classification on a
per-pixel basis, segmenting image
x into real and fake regions; in addition the discriminator may still give a global
image classification of
x, e.g., from the encoder.
Figure 4 schematically shows an example of an embodiment of a training system. Shown in figure
4 is a discriminator which comprises an encoding or downsampling part 172, which may
produce global distinguishing information 164, and an upsampling part 174 which takes
the global distinguishing information 164 and upsamples it to local distinguishing
information 163. Global distinguishing information 164 may serve as a bottle-neck,
which may be used to train the encoder.
[0087] Also shown in figure 4 is a generator network 110, which receives a generator input
151, which may comprise a random component, e.g., a noise vector, and/or sensor data,
typically measured sensor data. Shown in figure 4 is that discriminator input 172
either receives measured sensor data 401, in this case an image of a person, or synthesized
sensor data 402, e.g., an output of generator network 110. As is shown below, the
training signal may be improved by combining sensor data 401 and 402. Decoder part
174 receives as input the global distinguishing information, which is to be upscaled,
but also skip connections 175, which provide information from layers in the encoder
part.
[0088] A skip connection 175 preferably connects layers of the same or similar resolution.
For example, layer
i of encoder 172 may be connected to a layer
n - f(
i), wherein
f is an increasing function and
n denotes the number of layers of decoder 174. If the number of layers is the same,
one may use the identity for
f. Thus one or more layers of decoder 174 may receive as input the output of the previous
layer of decoder 174 but also the output or input of a corresponding layer in encoder
172.
[0089] The discriminator learns both global and local differences between real and fake
images, something which is helped by using both global and local outputs. Hereafter,
we refer to the encoder module of the discriminator as
DencU and to the decoder module as
DdecU. A discriminator loss for use in training may be computed by taking the decisions
from both
DencU and
DdecU, e.g.:

[0090] The loss for the encoder
LDencU may be computed from the scalar output of

For example, one may use:

[0091] The loss for the decoder
LDencU may be computed as the mean decision over all sub-set, e.g., pixels:

[0092] Here,

and

refer to the discriminator decision at pixel (
i,j). In an embodiment, per-pixel outputs of
DdecU are derived based on global information from high-level features, enabled through
the process of upsampling from the bottleneck, as well as more local information from
low-level features, mediated by the skip connections from the intermediate layers
of the encoder network. Note that an encoder/decoder architecture for the discriminator
is not strictly necessary. Instead, say a monolithic discriminator architecture comprising
multiple convolutional layers may be used. An advantage of encoder/decoder architecture
is that the global distinguishing information can be used as an additional training
signal.
[0093] As the generator objective one may use:

[0094] This loss function encourages the generator to focus on both global structures and
local details while synthesizing images in order to fool the more powerful discriminator
DU. Loss functions such as the ones suggested herein may be used in an otherwise conventional
training system, e.g., ADAM, e.g., using GAN-style backpropagation. For example, in
an embodiment, the discriminator trains (1) for one or more iterations, followed by
the generator training (2) for one or more iterations. Phases (1) and (2) may be repeated
until convergence, or until a set number of iterations, etc. For example, the generator
may be repeatedly applied to obtain synthesized sensor data, and the discriminator
may be repeatedly applied to measured sensor data or synthesized sensor data, or a
composite. During discriminator training the loss of the discriminator may be optimized,
e.g., reduced. During generator training the loss of the generator may be optimized,
e.g., reduced.
[0095] To further improve training a consistency regularization may be introduced to better
train the discriminator. This leads to a higher quality feedback signal from the discriminator
which in turn leads to a better generator neural network. The localized distinguishing
information, e.g., per-pixel decisions, of a well-trained
DU discriminator should be equivariant under any class-domain-altering transformations
of images. However, this property is not guaranteed by the structure of the neural
networks.
[0096] To obtain such equivariancy, the discriminator may be regularized to focus more on
semantic and structural changes between real and fake data and to pay less attention
to arbitrary class-domain-preserving perturbations. Therefore, we propose the consistency
regularization of the discriminator, explicitly encouraging the decoder module
DdecU to output equivariant predictions for composed sensor data. For example, one way
to compose sensor data, e.g., images, is to cut and paste patches from images of different
classes, e.g., real and fake. For example, the transformation may be the so-called
CutMix transformation of real and fake samples. An advantage of CutMix in the context
of GAN training is that it does not alter the real and fake image patches used for
mixing, preserving their original class domain, and provides a large variety of possible
outputs. We visualize the CutMix augmentation strategy and the
DU predictions in Figure 5.
[0097] Figure 5 schematically shows examples of an embodiment of data in an embodiment of a training
method. Figure 5 shows a schematic visualization of the CutMix regularization and
the predictions of an embodiment of the discriminator, in this case a U-net, on the
CutMix images.
[0098] The first row of figure 5 shows a real image 501, e.g., measured sensor data, and
a fake image 502, e.g., synthesized sensor data. The second row shows binary masks
M, which may be used for the CutMix operation. In this row of figure 5 a white color
is used for real, and a black color for fake. Mask 511 will cause mostly fake image
in the transformed image, while mask 512 will cause mostly real image. The corresponding
target global distinguishing scores c are shown below. A target 521 of 0.28 indicating
that most of the mask 511 is black; and a target 522 of 0.68 indicating that most
of mask 512 is white. The discriminator is trained for the global distinguishing information
matches the target scores c.
[0099] The fourth row schematically shows composed sensor data, in this case CutMix images
from real and fake samples. For example, image 531 may have been obtained by taking
the part from fake image 502 indicated by mask 511 and taking the part from real image
501 indicated by mask 511. For example, image 532 may have been obtained by taking
the part from fake image 502 indicated by mask 512 and taking the part from real image
502 indicated by mask 512. For example, if the images are images taken from the environment
of a vehicle, e.g., street scenes, then the composed images 531 and 532 may partly
show a realistic street scene and partly a less-realistic looking generated street
scene. The same holds for products, e.g., goods, that may or may not have production
defects. A composed sensor data may show the product but part of it corresponds to
measured sensor data, e.g., an image of the product as it rolls of a production line,
with the other part synthesized.
[0100] The fifth row shows corresponding real/fake segmentation maps of the discriminator
DU with its predicted global distinguishing scores below it. Note that global distinguishing
score 551 is close to target 521 and global distinguishing score 552 is close to target
522.
[0101] The localized discriminator outputs 541 and 542 use a darker greyscale to indicate
likely synthesized sensor data and a lighter grey scale for likely measured sensor
data. In this case the colors in 541 and 542 are schematically indicated as uniform,
but in actual data, e.g., as shown in figure 6, the data will typically not be uniform,
but a mixture of darker and lighter greys indicating regions in which the discriminator
is more or less certain that the part is synthesized sensor data.
[0102] For example, in an embodiment a new training sample is composed
x∼ for the discriminator
DU by mixing measured sensor data
x and synthesized sensor data
G(
z) ∈ R
W×H×C with the mask M:

where M ∈ {0,1}W×H is the binary mask indicating if the pixel (i,j) comes from the
real (Mi,j = 1) or fake (Mi,j = 0) image, 1 is a binary mask filled with ones, and
⊙ is an element-wise multiplication. The class label c ∈ (0,1) for the new synthetic
image x
∼ is assigned proportionally to the number of pixels coming from the real image, e.g.,

Note that for the synthetic sample x
∼, c and M are the ground truth for the encoder and decoder modules of the discriminator
D
U, respectively. Here the CutMix operator is applied to purely generated synthesized
sensor data, but the same applies to synthesized data which is obtained by domain-transferring
measured sensor data.
[0103] Given a CutMix operation, e.g., such as the one above, one can train the discriminator
to provide consistent per-pixel predictions, e.g.,

by introducing the consistency regularization loss term in the discriminator objective:

where ∥ · ∥ denotes a norm, such as the
L2 norm. This consistency loss may then be taken between the per-pixel output of

on the CutMix image and the CutMix between outputs of the

on real and fake images, penalizing the discriminator for inconsistent predictions.
[0104] The loss term in Eq. 7 can be included, e.g., added, to the discriminator objective
in Eq. 2, possibly with a weighting hyper-parameter
λ:

[0105] Hyper-parameter
λ may e.g., equal 1. The generator objective L
G may remain unchanged, see Eq. 5. In an embodiment, with a U-Net GAN a non-saturating
GAN objective formulation may be used.
[0106] The introduced consistency regularization as well as the U-Net architecture of the
discriminator can be combined with any other adversarial losses of the generator and
discriminator.
[0107] Figure 6 schematically shows examples of an embodiment of data in an embodiment of a training
method. From left to right shows images obtained from later stages of the training.
The top row shows synthesized sensor data generated by the generator network. The
bottom row shows the corresponding localized distinguishing information. Training
data comprised a set of images of people. The generator network is optimized to generate
images that appear to be drawn from the same distribution. In the top row, one can
see that images appear more realistic as the training progresses. In the bottom row
the discriminator output is shown, it can be seen that as the training progresses
the discriminator output becomes increasingly lighter grey indicating that the discriminator
considers of the image to be likely measured sensor data. For example, in the first
two images the top left of the head appears unrealistic which is reflected in a dark
patch in the top left of the localized distinguishing information.
[0108] The example images were obtained from an embodiment in which a U-Net type discriminator
network was used. The synthetic image samples are obtained from a fixed noise vector
at different training iterations. Brighter colors correspond to the discriminator
confidence of pixel being real (and darker of being fake). Note that the U-Net discriminator
provides very detailed and spatially coherent response to the generator, enabling
it to further improve the image quality, e.g., the unnaturally large man's forehead
is recognized as fake by the discriminator and is corrected by the generator throughout
the training.
[0109] Embodiments may be used in GAN models for data synthesis and data augmentation. Its
use is particularly advantages when collecting additional data is expensive or legally
not possible. In the context of autonomous driving this includes extreme situations,
like dangerously maneuvering cars or near-hit situations involving pedestrians.
[0110] For example, the methods, e.g., training methods, may be computer implemented methods.
For example, accessing training data, and/or receiving input data may be done using
a communication interface, e.g., an electronic interface, a network interface, a memory
interface, etc. For example, storing or retrieving parameters may be done from an
electronic storage, e.g., a memory, a hard drive, etc., e.g., parameters of the networks.
For example, applying a neural network to data of the training data, and/or adjusting
the stored parameters to train the network may be done using an electronic computing
device, e.g., a computer.
[0111] The neural networks, either during training and/or during applying may have multiple
layers, which may include, e.g., convolutional layers and the like. For example, the
neural network may have at least 2, 5, 10, 15, 20 or 40 hidden layers, or more, etc.
The number of neurons in the neural network may, e.g., be at least 10, 100, 1000,
10000, 100000, 1000000, or more, etc.
[0112] Many different ways of executing the method are possible, as will be apparent to
a person skilled in the art. For example, the order of the steps can be performed
in the shown order, but the order of the steps can be varied or some steps may be
executed in parallel. Moreover, in between steps other method steps may be inserted.
The inserted steps may represent refinements of the method such as described herein,
or may be unrelated to the method. For example, some steps may be executed, at least
partially, in parallel. Moreover, a given step may not have finished completely before
a next step is started.
[0113] Embodiments of the method may be executed using software, which comprises instructions
for causing a processor system to perform the method, e.g., method 300. Software may
only include those steps taken by a particular sub-entity of the system. The software
may be stored in a suitable storage medium, such as a hard disk, a floppy, a memory,
an optical disc, etc. The software may be sent as a signal along a wire, or wireless,
or using a data network, e.g., the Internet. The software may be made available for
download and/or for remote usage on a server. Embodiments of the method may be executed
using a bitstream arranged to configure programmable logic, e.g., a field-programmable
gate array (FPGA), to perform the method.
[0115] In an embodiment, the BigGAN generator and discriminator architectures are adopted
for the 256 × 256 and 128 × 128 resolution with a channel multiplier
ch =64. The original BigGAN discriminator downsamples the input image to a feature map
of dimensions 16
ch × 4 × 4, on which global sum pooling is applied to derive a 16
ch dimensional feature vector that is classified into real or fake. In an embodiment,
the BigGAN discriminator is modified by copying the generator architecture and appending
it to the 4 × 4 output of the discriminator. In this embodiment, the features are
successively upsampled via ResNet blocks until the original image resolution (
H ×
W) is reached. Furthermore, the input to every decoder ResNet block is concatenated
with the output features of the encoder blocks that share the same intermediate resolution.
In this way, high-level and low-level information are integrated on the way to the
output feature map. In this embodiment, the decoder architecture is almost identical
to the generator, with the exception of changing the number of channels of the final
output from 3 to
ch, appending a final block of 1 × 1 convolutions to produce the 1 ×
H ×
W output map; no class-conditional BatchNorm is used in the decoder. Class information
is provided to
DU with projection to the
ch-dimensional channel features of the U-Net encoder and decoder output. In contrast
to BigGAN, it was found beneficial not to use a hierarchical latent space, but to
directly feed the same input vector z to BatchNorm at every layer in the generator.
Furthermore, it was also found beneficial to remove the self-attention layer in both
encoder and decoder; experiments showed that they did not contribute to the performance
yet lead to memory overhead.
[0116] Experiments were also performed on an unconditional embodiment. Note that the original
BigGAN is a class-conditional model. For the unconditional model, the class-conditional
BatchNorm is replaced with selfmodulation, wherein the BatchNorm parameters are conditioned
only on the latent vector
z, and do not use the class projection of in the discriminator. In this embodiment,
these modifications provide a two-headed discriminator. While each decoder head is
already sufficient to train the network, we find it beneficial to compute the GAN
loss at both heads with equal weight. The hinge loss may be kept. Models that also
employ consistency regularization in the decoder output space benefit from using non-saturating
loss.
[0117] During the training, for each iteration, a mini-batch of CutMix images (
x∼
; c; M) is created with probability
rmix. This probability is increased linearly from 0 to 0.5 between the first
n epochs in order to give the generator time to learn how to synthesize more real looking
samples and not to give the discriminator too much power from the start. CutMix images
are created from the existing real and fake images in the mini-batch using binary
masks M. For sampling M, we use the original CutMix implementation: first sampling
the combination ratio c between the real and generated images from the uniform distribution
(0, 1) and then uniformly sample the bounding box coordinates for the cropping regions
of
x and
G(
z) to preserve the
c ratio. Binary masks M also denote the target for the decoder
DdecU, and for the encoder
DencU we use soft targets
c - the fraction of 1s in M. We set weighing parameter
λ to 1. Note that the consistency regularization does not impose much overhead during
training. Extra computational cost comes only from feeding additional CutMix images
through the discriminator while updating its parameters
[0118] The original training parameters of BigGAN may be adopted. In particular, one may
use a uniformly distributed noise vector
z in [-1, 1]
140 as input to the generator, and the Adam optimizer with learning rates of 1
e-4 and 5e-4 for
G and
DU. It was found beneficial in experiments to operate with considerably smaller mini-batch
sizes than BigGAN, e.g., batch sizes between 20 and 50.
[0119] Figure 8 schematically shows an example of an embodiment of a training system 840.
[0120] Figure 8 shows an autonomous apparatus, in this case an autonomous car 810', situated
in an environment 800, e.g., a traffic situation. In environment 800 there may be
various objects, both static and dynamic, that affect how the apparatus 810 may be
controlled. A similar apparatus, in this case car 810, may be used to obtain measured
sensor data. For example, shown in figure 8 is a pedestrian 812 crossing the environment
behind car 810. Apparatus 810 may be autonomous but does not need to be. Apparatus
810 and 810' may be the same except for an update in controller 822.
[0121] Car 810 may comprise a sensor system 820, e.g., comprising one or more image sensors,
radar, lidar and so on, to sense the environment of the car, e.g., environment 800.
For example, sensor system 820 may be configured to produce measured sensor data comprising
information on environment 800. Car 810 may comprise one or more actuators to move
the car through environment 800, e.g., wheels and motor.
[0122] Sensor data obtained from sensor system 820 may be stored in a first training database
830. A training system 840, e.g., configured for an embodiment of a training method
for training a generator neural network may be configured to train a generator to
generate synthesized sensor data which appears to be drawn from first training database
830. Training system may be configured to obtain an initial training set form first
database 830 and train a generator network from the initial training set. For example,
a training system 840 may produce a generator network for use in a generator system
842. Generator system 842 may be used to generate additional sensor data, e.g., synthesized
sensor data. The synthesized sensor data may be stored in a second training database
832. The second training database 832 may also comprise the original measured sensor
data, e.g., taken from first database 830.
[0123] The synthesized training data may be generated with or without the use of class-labels.
For example, the measured training data in first database 830 may be labeled, e.g.,
by apparatus 810, by sensor 820, by a further device (not shown), or by a human. The
class labels may be used to generate synthesized sensor data of a particular kind,
e.g., with a nearby pedestrian. An unconditional generator neural network may be configured
to receive as input a measured sensor data or a noise vector, or both. Also a conditional
generator neural network may be configured to receive as input a measured sensor data
or a noise vector, or both. Both types may be trained for pure generation or for domain
transfer or for a combination, e.g., generation in the context of a measured sensor
data.
[0124] A machine learning system 850 may be configured to train a machine learnable model
on the training data in second database 832. For example, the machine learnable model
may be a classifier. The machine learnable model may comprise a neural network, but
this is not necessary; For example, it may comprise an SVM, random forests, and so
on. Machine learning system 850 may be configured with a learning algorithms consistent
with the type of machine learnable model, e.g., SVM training or random forest training.
Machine learning system 850 may use the synthesized sensor data for training, for
testing, or for both. Machine learning system 850 produces a trained a classifier
852. For example, classifier 852 may be configured to classify an object in the environment
of the apparatus from the measured sensor data.
[0125] The classifier 852 may be included in a controller for an autonomous apparatus 810',
e.g., like car 810. For example, a controller 822 may comprise classifier 852. Controller
822 may be configured to generate a control signal to control the autonomous apparatus
810'. Controller 822 may be configured to generate a control signal at least from
the object classified by the classifier. For example, if classifier 852 classifies
that an environment 800 comprises a pedestrian like 812, then it is not safe to revert
the car. The control signal may be configured to control the actuators, e.g., turning
and steering of the wheels and/or motor.
[0126] It will be appreciated that the presently disclosed subject matter also extends to
computer programs, particularly computer programs on or in a carrier, adapted for
putting the presently disclosed subject matter into practice. The program may be in
the form of source code, object code, a code intermediate source, and object code
such as partially compiled form, or in any other form suitable for use in the implementation
of an embodiment of the method. An embodiment relating to a computer program product
comprises computer executable instructions corresponding to each of the processing
steps of at least one of the methods set forth. These instructions may be subdivided
into subroutines and/or be stored in one or more files that may be linked statically
or dynamically. Another embodiment relating to a computer program product comprises
computer executable instructions corresponding to each of the devices, units and/or
parts of at least one of the systems and/or products set forth.
[0127] Figure 7a shows a computer readable medium 1000 having a writable part 1010 comprising
a computer program 1020, the computer program 1020 comprising instructions for causing
a processor system to perform a training method according to an embodiment. The computer
program 1020 may be embodied on the computer readable medium 1000 as physical marks
or by magnetization of the computer readable medium 1000. However, any other suitable
embodiment is conceivable as well. Furthermore, it will be appreciated that, although
the computer readable medium 1000 is shown here as an optical disc, the computer readable
medium 1000 may be any suitable computer readable medium, such as a hard disk, solid
state memory, flash memory, etc., and may be non-recordable or recordable. The computer
program 1020 comprises instructions for causing a processor system to perform said
training method.
[0128] Figure 7b shows in a schematic representation of a processor system 1140 according to an embodiment
of a training system, or generator system. The processor system comprises one or more
integrated circuits 1110. The architecture of the one or more integrated circuits
1110 is schematically shown in Figure 7b. Circuit 1110 comprises a processing unit
1120, e.g., a CPU, for running computer program components to execute a method according
to an embodiment and/or implement its modules or units. Circuit 1110 comprises a memory
1122 for storing programming code, data, etc. Part of memory 1122 may be read-only.
Circuit 1110 may comprise a communication element 1126, e.g., an antenna, connectors
or both, and the like. Circuit 1110 may comprise a dedicated integrated circuit 1124
for performing part or all of the processing defined in the method. Processor 1120,
memory 1122, dedicated IC 1124 and communication element 1126 may be connected to
each other via an interconnect 1130, say a bus. The processor system 1110 may be arranged
for contact and/or contact-less communication, using an antenna and/or connectors,
respectively.
[0129] For example, in an embodiment, processor system 1140, e.g., a training device may
comprise a processor circuit and a memory circuit, the processor being arranged to
execute software stored in the memory circuit. For example, the processor circuit
may be an Intel Core i7 processor, ARM Cortex-R8, etc. In an embodiment, the processor
circuit may be ARM Cortex M0. The memory circuit may be an ROM circuit, or a non-volatile
memory, e.g., a flash memory. The memory circuit may be a volatile memory, e.g., an
SRAM memory. In the latter case, the device may comprise a non-volatile software interface,
e.g., a hard drive, a network interface, etc., arranged for providing the software.
[0130] While device 1140 is shown as including one of each described component, the various
components may be duplicated in various embodiments. For example, the processor 1120
may include multiple microprocessors that are configured to independently execute
the methods described herein or are configured to perform steps or subroutines of
the methods described herein such that the multiple processors cooperate to achieve
the functionality described herein. Further, where the device 1140 is implemented
in a cloud computing system, the various hardware components may belong to separate
physical systems. For example, the processor 1120 may include a first processor in
a first server and a second processor in a second server.
[0131] It should be noted that the above-mentioned embodiments illustrate rather than limit
the presently disclosed subject matter, and that those skilled in the art will be
able to design many alternative embodiments.
[0132] In the claims, any reference signs placed between parentheses shall not be construed
as limiting the claim. Use of the verb 'comprise' and its conjugations does not exclude
the presence of elements or steps other than those stated in a claim. The article
'a' or 'an' preceding an element does not exclude the presence of a plurality of such
elements. Expressions such as "at least one of" when preceding a list of elements
represent a selection of all or of any subset of elements from the list. For example,
the expression, "at least one of A, B, and C" should be understood as including only
A, only B, only C, both A and B, both A and C, both B and C, or all of A, B, and C.
The presently disclosed subject matter may be implemented by hardware comprising several
distinct elements, and by a suitably programmed computer. In the device claim enumerating
several parts, several of these parts may be embodied by one and the same item of
hardware. The mere fact that certain measures are recited in mutually different dependent
claims does not indicate that a combination of these measures cannot be used to advantage.
[0133] In the claims references in parentheses refer to reference signs in drawings of exemplifying
embodiments or to formulas of embodiments, thus increasing the intelligibility of
the claim. These references shall not be construed as limiting the claim.