METHOD AND APPARATUS FOR MARKING OBJECT OUTLINE IN TARGET IMAGE, AND STORAGE MEDIUM AND ELECTRONIC APPARATUS

(19)

(11)

EP 4 174 769 A1

(12)	EUROPEAN PATENT APPLICATION
	published in accordance with Art. 153(4) EPC

(43)	Date of publication:
	03.05.2023 Bulletin 2023/18

(21)	Application number: 21829598.8

(22)	Date of filing: 24.05.2021

(51)

International Patent Classification (IPC):

G06T 7/13^(2017.01)
G06K 9/62^(2022.01)

G06T 7/11^(2017.01)
G06N 3/04^(2023.01)

(52)	Cooperative Patent Classification (CPC):
	G06T 7/13; G06N 3/04; G06T 7/11; G06F 18/00

(86)	International application number:
	PCT/CN2021/095562

(87)	International publication number:
	WO 2021/258955 (30.12.2021 Gazette 2021/52)

(84)	Designated Contracting States:
	AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
	Designated Extension States:
	BA ME
	Designated Validation States:
	KH MA MD TN

(30)

Priority:

24.06.2020 CN 202010591353

(71)	Applicant: Sanechips Technology Co., Ltd.
	Shenzhen, Guangdong 518055 (CN)

(72)	Inventors:
	XIANG, Yan Shenzhen, Guangdong 518057 (CN) ZHANG, Xiao Shenzhen, Guangdong 518057 (CN) XU, Ke Shenzhen, Guangdong 518057 (CN) ZHU, Fang Shenzhen, Guangdong 518057 (CN)

(74)	Representative: Swindell & Pearson Limited
	48 Friar Gate Derby DE1 1GY Derby DE1 1GY (GB)

(54)	METHOD AND APPARATUS FOR MARKING OBJECT OUTLINE IN TARGET IMAGE, AND STORAGE MEDIUM AND ELECTRONIC APPARATUS

(57) A method and apparatus for marking an object outline in a target image, and a storage medium and an electronic apparatus. The method comprises: acquiring a target image feature of a target image, the target image comprising a target object, and the target object being of a target type (S302); inputting the target image feature into a target generator (S304); and acquiring a target mask of the target image that is generated by the target generator, the target mask being used for marking the outline of the target object (S306).

Description

CROSS-REFERENCE TO RELATED APPLICATION

[0001] The present application is filed on the basis of the Chinese Patent Application 202010591353.9 filed on June 24, 2020, and claims priority of the Chinese Patent Application, the entire contents of which are incorporated herein by reference.

TECHNICAL FIELD

[0002] Embodiments of the present disclosure relate to the field of computers, and in particular to a method and apparatus for labeling an object contour in a target image, a computer-readable storage medium and an electronic device.

BACKGROUND

[0003] In the existing technology, in some cases it is usually necessary to label and segment the target object in an image. For example, the contour of a character in an image needs to be labeled. However, in some cases, the contour of the character can be labeled by weakly supervised learning methods.

[0004] However, existing weakly supervised learning methods usually adopt image-level classification tags. If the above methods are adopted, the accuracy of labeling and segmenting objects by the trained model is low.

[0005] In other words, in some cases in the existing technology, in the process of segmenting and predicting the target object in the image to determine the contour of the target object by weakly supervised learning, a problem of low accuracy of contour determination is encountered.

SUMMARY

[0006] Embodiments of the present disclosure provide a method and apparatus for labeling an object contour in a target image, a storage medium and an electronic device in order to solve at least to a certain extent the problem of low accuracy in determining a contour of a target object by weakly supervised learning in some cases in the existing technology.

[0007] According to an embodiment of the present disclosure, a method for labeling an object contour in a target image is provided. The method may include acquiring a target image feature of a target image, where the target image includes a target object, and the target object is of a target type. The method may further include inputting the target image feature into a target generator, where the target generator is a generator in a generative adversarial network trained by utilizing a sample image, the generative adversarial network includes the target generator and a discriminator, the target generator is configured to generate a first mask of the sample image upon acquiring a first image feature of the sample image, and the discriminator is configured to, upon receiving a sample image obtained after erasing pixels corresponding to the first mask, identify the type of a sample object in the sample image obtained after erasing the pixels, the type of the sample object is used for training parameters in the target generator. The method may further include acquiring a target mask of the target image generated by the target generator, where the target mask being used for labeling a contour of the target object.

[0008] According to another embodiment of the present disclosure, an apparatus for labeling an object contour in a target image is provided. The apparatus may include a first acquisition unit, a first input unit and a second acquisition unit. The first acquisition unit is configured to acquire a target image feature of a target image, where the target image includes a target object, the target object is of a target type. The first input unit is configured to input the target image feature into a target generator. The target generator is a generator in a generative adversarial network trained by using a sample image. The generative adversarial network includes the target generator and a discriminator. The target generator is configured to generate a first mask of the sample image upon acquiring a first image feature of the sample image. The discriminator is configured to, upon receiving a sample image obtained after erasing pixels corresponding to the first mask, identify the type of a sample object in the sample image obtained after erasing the pixels. The type of the sample object is used for training parameters in the target generator. The second acquisition unit is configured to acquire a target mask of the target image generated by the target generator. The target mask is used for labeling a contour of the target object.

[0009] According to yet another embodiment of the present disclosure, further provided is a computer-readable storage medium having stored thereon computer programs which, when executed by a processor, cause the processor to carry out any one of the above methods.

[0010] According to yet another embodiment of the present disclosure, an electronic apparatus is further provided. The electronic apparatus may include a memory and a processor, where the memory stores computer programs which, when executed by a processor, cause the processor to carry out any one of the above methods.

BRIEF DESCRIPTION OF DRAWINGS

[0011]

FIG. 1 is a schematic diagram of an application scenario of a method for labeling an object contour in a target image according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of another application scenario of the method for labeling an object contour in a target image according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of the method for labeling an object contour in a target image according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a model of the method for labeling an object contour in a target image according to an embodiment of the present disclosure;

FIG. 5 is a flowchart of another method for labeling an object contour in a target image according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a model combination of a method for labeling an object contour in a target image according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of an apparatus for labeling an object contour in a target image according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of another apparatus for labeling an object contour in a target image according to an embodiment of the present disclosure; and

FIG. 9 is another electronic device for labeling an object contour in a target image according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

[0012] The embodiments of the present disclosure will be described in detail below by embodiments with reference to the accompanying drawings.

[0013] It is to be noted that the terms "first" and "second" in the description and claims of the present disclosure and the accompanying drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or precedence.

[0014] The method embodiments provided in the embodiments of the present disclosure may be executed in mobile terminals, computer terminals or similar arithmetic apparatuses. By taking execution of the method in a mobile terminal as an example, FIG. 1 is a hardware structure diagram of a mobile terminal for a method for labeling an object contour in a target image according to an embodiment of the present disclosure. As shown in FIG. 1, the mobile terminal may include one or more (only one is shown in FIG. 1) processors 102 (the processor 102 may include, but not limited to, a microprocessor unit (MCU), a programmable logic device (FPGA) or other processing apparatuses) and a memory 104 configured to store data. The mobile terminal may further include a transmission device 106 for communication functions and an input/output device 108. It should be understood by a person having ordinary skills in the art that the structure shown in FIG. 1 is merely illustrative and not intended to limit the structure of the mobile terminal. For example, the mobile terminal may further include more or less components than those shown in FIG. 1, or have a configuration different from that shown in FIG. 1.

[0015] The memory 104 may be configured to store computer programs, for example, software programs and modules of applications, such as computer programs corresponding to the method for labeling an object contour in a target image in the embodiments of the present disclosure. The processor 102 executes various functional applications and data processing, i.e., implementing the method described above, by running the computer programs stored in the memory 104. The memory 104 may include high-speed random-access memories, or may include non-volatile memories, such as one or more magnetic storage devices, flash memories or other non-volatile solid-state memories. In some instances, the memory 104 may further include memories remotely arranged relative to the processor 102. These remote memories may be connected to the mobile terminal via a network. Examples of the network include, but not limited to, Internet, Intranet, local area networks, mobile communication networks and combinations thereof.

[0016] The transmission device 106 is configured to receive or transmit data via a network. Instances of the network may include wireless networks provided by a communication provider of the mobile terminal. In one instance, the transmission device 106 includes a Network Interface Controller (NIC), which can be connected to other network devices through a base station to communicate with the Internet. In one instance, the transmission device 106 may be a Radio Frequency (RF) module which is configured to communicate with the Internet in a wireless manner.

[0017] The embodiments of the present disclosure can be run on the network architecture shown in FIG. 2. As shown in FIG. 2, the network architecture includes: a terminal 202, a network 204 and a server 206. Data interaction can be performed between the terminal 202 and the server 20 through the network 204.

[0018] The embodiment provides a method for labeling an object contour in a target image that is executed in the mobile terminal or the network architecture. FIG. 3 is a flowchart of the method for labeling an object contour in a target image according to the embodiment of the present disclosure. As shown in FIG. 3, the flow may include steps S302 to S306.

[0019] At S302, a target image feature of a target image is acquired, where the target image includes a target object, and the target object is of a target type.

[0020] At S304, the target image feature is input into a target generator, where the target generator is a generator in a generative adversarial network trained by utilizing a sample image. The generative adversarial network includes the target generator and a discriminator. The target generator is configured to generate a first mask of the sample image upon acquiring a first image feature of the sample image. The discriminator is configured to, upon receiving a sample image obtained after erasing pixels corresponding to the first mask, identify the type of a sample object in the sample image obtained after erasing the pixels. The type of the sample object is used for training parameters in the target generator.

[0021] At S306, a target mask of the target image generated by the target generator is acquired, where the target mask is used for labeling a contour of the target object.

[0022] By the above steps, since the target generator is used to generate the first mask of the sample image and erase the pixels corresponding to the first mask in the process of training the target generator, the image can be identified as a whole in the process of training the discriminator, thereby facilitating the target generator to generate a more accurate mask, improving the accuracy of the target generator and improving the accuracy of labeling the contour of the target object in the target image by the target generator. Therefore, the problem of low identification efficiency of the object contour can be solved, and the effect of improving the identification efficiency of the object contour can be achieved.

[0023] The execution body of the above steps may be, but not limited to, a base station, a terminal, a server, and the like.

[0024] The objective of the target generator is to generate a better segmentation mask of an input image, so that the discriminator cannot determine the type of the target object in the image obtained after erasing the mask. The objective of the discriminator is to identify the target type of the target object in the image as completely as possible. In other words, for an image that contains an object, the target generator is to generate a good enough first mask and erase the pixels in the image corresponding to the first mask, so that the discriminator cannot determine the type of the object in the image. The discriminator is to determine the type of the object in the image by utilizing the content of the object that is not erased in the image.

[0025] The embodiments of the present disclosure may be applied to, but not limited to, the process of identifying a contour of an object in an image. For example, for an input image that contains a target object labeled with a target type, in the embodiments of the present disclosure, a target mask can be generated by the trained generator by inputting the image into the generator. The contour of the target object in the image is labeled in the target mask, so that the image is semantically segmented.

[0026] For example, for an image that contains a cat and is labeled with a cat tag, after the image is input into the target generator, a target mask is generated by the target generator, and the contour of the cat is labeled in the target mask.

[0027] Prior to inputting the target image feature into the target generator, the method further includes: acquiring the sample image; acquiring a first image feature of the sample image; acquiring a first image feature of the sample image; inputting the first image feature into the generative adversarial network, and generating the first mask of the sample image by the target generator; erasing pixels corresponding to the image mask to obtain a first image; inputting the first image and the sample image into the discriminator to train the discriminator; and, inputting the first image into the target generator to train the target generator.

[0028] In other words, in the embodiments of the present disclosure, the target generator and the discriminator are pre-trained networks. The sample image is used during pre-training. The sample image includes a first object labeled with a type. After the sample image is input into the target generator, the target generator will generate a mask of the sample image, and a target position in the image is labeled in the mask. Then, the pixels at the target position are erased to obtain a first image, and the first image and the sample image are input into the discriminator to train the discriminator. After the discriminator is trained, the discriminator can output the type of the first object in the first image. A target generator with a better output mask can be obtained by training the target generator using the type and the first image.

[0029] Alternatively, in the embodiments of the present disclosure, it is also possible that, a plurality of sample images are acquired, and a part of the plurality of sample images are input into the target generator to generate first masks by the target generator. Then, pixels corresponding to the first masks of this part of sample images are erased; and, this part of sample images obtained after erasing the pixels and the remaining sampling samples that are not input into the target generator are input into the discriminator to train the discriminator.

[0030] The inputting the first image and the sample image into the discriminator to train the discriminator includes: calculating a first loss of the discriminator after inputting the first image and the sample image into the discriminator; and, adjusting parameters in the discriminator by utilizing the first loss. In an embodiment, the parameters in the discriminator can be adjusted in a case where the first loss is greater than a first threshold. The first loss of the discriminator after the parameters are adjusted is less than or equal to the first threshold.

[0031] During this process, the first loss of the discriminator needs to be calculated. The larger the first loss is, the worse the convergence effect of the model is. Therefore, it is necessary to adjust the values of parameters in the model. The values of the parameters are continuously adjusted and the loss is continuously calculated until the first loss is less than or equal to the first threshold, indicating that the parameters of the model are appropriate and the model is enough convergent.

[0032] The inputting the first image into the target generator to train the target generator includes: acquiring a first type of a first object in the first image output after inputting the first image into the discriminator; calculating a second loss of the target generator under the first type; and, adjusting parameters in the target generator by utilizing the second loss. In an embodiment, the parameters in the target generator can be adjusted in a case where the second loss is greater than a second threshold. The second loss of the target generator after the parameters are adjusted is less than or equal to the second threshold.

[0033] The loss also needs to be calculated in the process of training the target generator. After the discriminator outputs the type of the first image, the second loss of the target generator under this type is calculated. The larger the second loss is, the worse the convergence effect of the model is. Therefore, it is necessary to adjust the values of parameters in the model. The values of the parameters are continuously and the loss is continuously calculated until the second loss is less than or equal to the second threshold, indicating that the parameters of the model are appropriate and the model is enough convergent.

[0034] The acquiring a target image feature of a target image includes: acquiring the target image; inputting the target image into a target model, the target model being a model obtained after deleting a fully-connected layer of a pre-trained first model; and, acquiring the target image feature of the target image output by the target model.

[0035] That is, the target image feature of the target image is obtained through a target model. The target model is a pre-trained model. By adopting the target model, after the target image is input, the target model can obtain the target image feature of the target image.

[0036] Prior to inputting the target image into the target model, the method further includes: acquiring the sample image; training a second model by utilizing the sample image to obtain the trained first model; and, deleting a fully-connected layer of the first model to obtain the target model.

[0037] That is, in the embodiments of the present disclosure, the target model is a model obtained by training the second model using the sample image to obtain the first model and then deleting the fully-connected layer of the first model.

[0038] The method for labeling an object contour in a target image will be described below by way of an example.

[0039] The ideal of the embodiments of the present disclosure is as follows. When a neural network is trained for image classification, the trained network often does not concern all the feature at the position of the whole target. As a result, the pixels of the features concerned by the network in the input image are deleted by erasing during the training process. In order to better identify the target in the image, the neural network has to concern the features of other parts at the position of the target in the image. By continuously iterating in the above way, the neural network will finally concern all the features at the position of the whole target. The position distribution is consistent with the distribution of the semantic segmentation mask of the object, so that the target segmentation mask in the image is finally obtained by a classification tag. The above idea is achieved in an adversarial training manner in the embodiments of the present disclosure. The target generator is configured to generate the first mask of the target in the sample image, and the discriminator is configured to determine the type of the sample picture obtained after the pixels corresponding to the first mask are erased. During the training process, the target generator will confront the discriminator to generate a better first mask to decrease the number of pixels of the target in the image and weaken the discriminator's perception to the target in the image. The discriminator will gradually concern all the feature at the position of the target in the image in order to better identify the object in the image. Finally, after the Nash equilibrium is reached, the target generator generates a good enough mask, so that the discriminator cannot determine the type of the image obtained after the mask is erased. The Nash equilibrium is a combination of best strategies, meaning that the parameters in both the target generator and the discriminator are the best. Specifically, it means that the target generator generates a first mask that just blocks the object in the image. After the pixels corresponding to the first mask are erased, the object in the image is just erased. Since the object has been erased, the discriminator cannot determine the type of the object. However, if the first mask does not completely block the object, the discriminator can determine the type of the object by identifying a part of the object that is not blocked.

[0040] As shown in FIG. 4, the network structure in the embodiments of the present disclosure can be mainly divided into three parts, i.e., a pre-trained feature extraction network, a semantic segmentation generation network (target generator) and a discrimination network (discriminator).

[0041] The pre-trained feature extraction network may adopt a conventional image classification network (a second model) (e.g., Inception, ResNet, and the like, but not limited thereto), and is pre-trained on a data set. After the network is trained to convergence (after the first model is obtained), the last fully-connected layer of the network is deleted to obtain the target model. A convolution feature output by the target model serves as the input of the target generator. During the training of the second model, the convolution in the pre-trained feature extraction network will be replaced with dilated convolutions with different dilation parameters. Since the dilated convolution has a larger respective field than the general convolution, the second network model can have a more comprehensive perception of the target in the image, so that the perception range of the trained target network more tends to be closer to the semantic segmentation mask of the target, facilitating the subsequent semantic segmentation generation network to converge faster, and ensuring the stability of adversarial training.

[0042] The semantic segmentation generation network, also known as a segmentation prediction generation network (target generator), uses the convolution feature of the pre-trained feature extraction network as an input. The network gradually increases the width and height of a feature map by a deconvolution layer until the size is consistent with the size of the image input into the pre-trained network. Finally, semantic segmentation prediction is performed on the target in the image.

[0043] The discrimination network also adopts a conventional image classification network (e.g., Inception, ResNet, and the like, but not limited thereto) to determine the target in the image (including background type). The input of the discrimination network mainly includes two parts (as shown in FIG. 4): the image obtained after erasing the pixels corresponding to the prediction mask, and a real image B. Similarly, in order to improve the identification accuracy of the discrimination network, the discrimination network also adopts dilated convolutions with different dilation coefficients to better perceive the target in the image more comprehensively.

[0044] The training process is mainly divided into two steps. Firstly, the feature extraction network is pre-trained on a data set. After the network is trained to convergence, the fully-connected layer of the network is deleted, and a last convolution activation feature of the network is used as the input of the segmentation prediction network. Then, adversarial training is performed on the segmentation prediction generation network and the discrimination network. As shown in FIG. 4, after an image A passes through the feature extraction network and the segmentation prediction generation network, a mask M of a target in the image A is predicted, and pixels corresponding to the mask M in the image A are erased to obtain an image A'. Firstly, the discriminator is trained using A' and the real image B by minimizing a classification loss; and then, the target generator is trained using A' by minimizing a non-current classification loss. The convergence is finally realized by repetitive iteration.

[0045] As shown in FIG. 5, at S502, the sample image needs to be acquired before training. During the sample collection process, in order to achieve the better segmentation effect in training, it is necessary to collect as many sample images as possible in the application scenario. The image may also be acquired in the following ways: utilizing images in various public indoor scenarios containing characters; acquiring pictures in an actual application scenario; purchasing from a third-party data company; generating by an image generation algorithm (e.g. GAN); acquiring by a network crawler based on an academic purpose; or the like.

[0046] At S504, after the data is acquired, the data needs to be cleaned and calibrated. In order to better train the network, it is necessary to verify and check the acquired data to ensure the completeness, uniformity, correctness or the like of samples. The completeness means that the data set should contain all possible scenarios among application scenarios to ensure the generalization ability of the trained model. For example, the motion blur in the sample image caused by too fast motion of the target should also be contained in the data set. The uniformity means that different types of samples in the data set should be consistent in quantity as far as possible and should not differ greatly. The correctness means that data labeling should have a clear labeling standard to avoid confusion of labeling.

[0047] After the sample data is acquired, S506 may be executed to train the model. Firstly, the second model is trained to obtain a first model so as to obtain a target model.

[0048] Classification training is performed on the pre-trained feature extraction network (the second model) on the data set. During training, the loss is calculated by a cross entropy through the following calculation formula:

where z is a non-SoftMax output predicted by the network, and c is the type of the label.

[0049] After the model is trained to convergence, the first model is obtained, and the fully-connected layer for final classification in the first model is deleted to obtain a target model. The output of the target model is used as the input of the subsequent segmentation prediction network. If the amount of data is insufficient, the input data can be enhanced to improve the performance of the network.

[0050] Next, the target generator and the discriminator need to be trained. After the pre-trained feature extraction network is trained, adversarial training is performed on the segmentation prediction generation network and the discrimination network. At this time, the pre-trained feature extraction network and the segmentation prediction network are trained as a whole. However, compared with the segmentation prediction network, the pre-trained feature extraction network will have a smaller learning rate.

(1) For each data set batch, the discriminator is firstly trained, and real data and the image obtained after erasing pixels corresponding to the mask are input into the discriminator. After the data and the image pass through a softmax function, the loss is calculated by a cross entropy.
(2) The target generator is then trained. The image obtained after erasing pixels predicted by the mask is input into the discriminator and then passes through a softmin function. For the type of the image obtained after erasing pixels predicted by the mask output by the discriminator, the loss is calculated by a cross entropy.
(3) The steps (1) and (2) are repeated until the model converges.

[0051] The above process completes the training of the model.

[0052] Subsequently, at S508, the deployment and verification of the model may be performed. After training, if it is necessary to deploy the network, as shown in FIG. 6, the feature extraction network (target network) and the segmentation prediction generation network (target generator) can be combined to obtain a complete segmentation prediction network. The corresponding result of semantic segmentation prediction can be obtained only by inputting the original image data. The network can be applied to most semantic segmentation application scenarios. For example, for an image that contains an object, the result of semantic segmentation prediction may include the contour of the object in the image, so that the object in the image can be labeled.

[0053] In order to verify the actual effect of the model, the mask output by the network will be compared with the manually labeled actual mask. The prediction quality of the mask can be evaluated by Mean Interaction of Union (MIoU), which is defined as follows:

where N+1 denotes the number of classes (including null class); N is an integer; p_ij denotes pixels in the image that are actually of class i but predicted as class j; p_ii denotes pixels that are actually of class i and predicted as class i; and i and j are integers.

[0054] In the method provided in the embodiment of the present disclosure, the generative adversarial network is trained by erasing pixels corresponding to the mask. Compared with supervised semantic segmentation training methods, in the embodiment of the present disclosure, only classification tags are used, so that the workload for neural network labeling is greatly reduced, and the labor cost is reduced. Meanwhile, compared with other semi-supervised semantic segmentation methods based on the class activation map, in the embodiment of the present disclosure, excessive artificial prior experience is not required, and no additional parameters need to be added, so that in a case where the amount of training data is the same, the trained network has higher robustness and better network universality, and the result of identification of the contour of the object in the image is more accurate.

[0055] From the foregoing description of the implementations, it should be clearly understood by those having ordinary skills in the art that the method according to the above embodiments may be implemented by software and necessary general-purpose hardware platforms. Of course, the method may also be implemented by hardware. However, in many cases, the former is preferred. Based on this understanding, the technical schemes of the present disclosure may be essentially embodied in form of software products, or some of the technical schemes that contribute to the prior art may be embodied in form of software products. The computer software products are stored in a storage medium (e.g., ROM/RAM, magnetic disks or optional disks), and include a number of instructions which, when executed by a terminal device (which may be a mobile terminal, a computer, a server, a network device, and the like), cause the terminal device to carry out the method according to the embodiments of the present disclosure.

[0056] In the embodiment, an apparatus for labeling an object contour in a target image is further provided. The apparatus is configured to implement the above embodiments and some implementations, and the description that has been given will not be repeated here. As used hereinafter, the term "module" may be a combination of software and/or hardware that implements predetermined functions. Although the apparatus described in the following embodiments is preferably implemented by software, it is possible and contemplated to implement the apparatus by hardware or a combination of software and hardware.

[0057] FIG. 7 is a block diagram of an apparatus for labeling an object contour in a target image according to an embodiment of the present disclosure. As shown in FIG. 7, the apparatus may include a first acquisition unit 702, a first input unit 704 and a second acquisition unit 706.

[0058] The first acquisition unit 702 is configured to acquire a target image feature of a target image. The target image includes a target object, the target object being of a target type.

[0059] The first input unit 704 is configured to input the target image feature into a target generator. The target generator is a generator in a generative adversarial network trained by utilizing a sample image. The generative adversarial network includes the target generator and a discriminator. The target generator is configured to generate a first mask of the sample image upon acquiring a first image feature of the sample image. The discriminator is configured to, upon receiving a sample image obtained after pixels corresponding to the first mask are erased, identify the type of a sample object in the sample image obtained after the pixels are erased. The type is used for training parameters in the target generator.

[0060] The second acquisition unit 706 is configured to acquire a target mask of the target image generated by the target generator. The target mask is used for labeling a contour of the target obj ect.

[0061] FIG. 8 is a block diagram of an apparatus for labeling an object contour in a target image according to an embodiment of the present disclosure. As shown in FIG. 7, in addition to all modules shown in FIG. 7, the apparatus may further include a third acquisition unit 802, a fourth acquisition unit 804, a second input unit 806, an erasing unit 808, a third input unit 810 and a fourth input unit 812.

[0062] The third acquisition unit 802 is configured to acquire the sample image before the target image feature is input into the target generator.

[0063] The fourth acquisition unit 804 is configured to acquire a first image feature of the sample image.

[0064] The second input unit 806 is configured to input the first image feature into the generative adversarial network and generate the first mask of the sample image by the target generator.

[0065] The erasing unit 808 is configured to erase pixels corresponding to the first mask to obtain a first image.

[0066] The third input unit 810 is configured to input the first image and the sample image into the discriminator to train the discriminator.

[0067] The fourth input unit 812 is configured to input the first image into the target generator to train the target generator.

[0068] The third input unit includes a first calculation module and a first adjustment module. The first calculation module is configured to calculate a first loss of the discriminator after the first image and the sample image are input into the discriminator. The first adjustment module is configured to adjust parameters in the discriminator in a case where the first loss is greater than a first threshold. The first loss of the discriminator after the parameters are adjusted is less than or equal to the first threshold.

[0069] The fourth input unit includes a first acquisition module, a second calculation module and a second adjustment module. The first acquisition module is configured to acquire a first type of a first object in the first image output after the first image is input into the discriminator. The second calculation module is configured to calculate a second loss of the target generator under the first type. The second adjustment module is configured to adjustment parameters in the target generator in a case where the second loss is greater than a second threshold, where the second loss of the target generator after the parameters are adjusted is less than or equal to the second threshold.

[0070] The fourth acquisition unit includes a second acquisition module, an input module and a third acquisition module. The second acquisition module is configured to acquire the target image. The input module is configured to input the target image into the target model, the target model being a model obtained after deleting a fully-connected layer of a pre-trained first model. The third acquisition module is configured to acquire the target image feature of the target image output by the target model.

[0071] The fourth acquisition unit further includes a fourth acquisition module, a training module and a deletion module. The fourth acquisition module is configured to acquire the sample image before the target image feature is input into the target model. The training module is configured to train a second model by utilizing the sample image to obtain the trained first model. The deletion module is configured to delete the fully-connected layer of the first model to obtain the target model.

[0072] It is to be noted that the above modules may be implemented by software or hardware. In the latter case, the modules may be implemented in the following way, but not limited to: the modules are located in a same processor; or, the modules are located in different processors in any combination.

[0073] According to an embodiment of the present disclosure, further provided is a computer-readable storage medium having computer programs stored thereon which, when executed by a processor, cause the processor to carry out any one of the methods.

[0074] In an exemplary embodiment, the computer-readable storage medium may include, but not limited to: U disks, read-only memories (ROMs), random access memories (RAMs), mobile hard disks, magnetic disks, optical disks, or various mediums that can store computer programs.

[0075] With reference to FIG. 9, according to an embodiment of the present disclosure, further provided is an electronic device. The electronic device may include a memory 902 and a processor 901. The memory 902 stores computer programs which, when executed by the processor 901, cause the processor to carry out any one of the methods.

[0076] In an exemplary embodiment, the electronic device may further include a transmission device and an input/output device. The transmission device is connected to the processor 901, and the input/output device is connected to the processor 901.

[0077] In accordance with the present disclosure, since the target generator is used to generate the first mask of the sample image and erase the pixels corresponding to the first mask in the process of training the target generator, the image can be identified as a whole in the process of training the discriminator, thereby facilitating the target generator to generate a more accurate mask, improving the accuracy of the target generator and improving the accuracy of labeling the contour of the target object in the target image by the target generator. Therefore, the problem of low identification efficiency of the object contour can be solved, and the effect of improving the identification efficiency of the object contour can be achieved.

[0078] The specific examples in the embodiment may refer to the examples described in the above embodiments and exemplary implementations, and will not be repeated in the embodiment.

[0079] Apparently, it should be understood by those having ordinary skills in the art that, the modules or steps in the present disclosure may be implemented by a general computing device, and may be integrated in a single computing device or distributed on a network consisting of a plurality of computing devices. The modules or steps may be implemented by program codes that may be executed by a computing device, so that they may be stored in a storage device and executed by the computing device. In addition, in some cases, the shown or described steps may be executed in an order different from the order described herein. Or, the modules or steps are manufactured into integrated circuit modules, or some of the modules or steps are manufactured into single integrated circuit modules. Therefore, the present disclosure is not limited to any particular combination of hardware and software.

[0080] The foregoing description merely shows some embodiments of the present disclosure and is not intended to limit the present disclosure. Various alterations and variations may be made to the present disclosure by those having ordinary skills in the art. Any modifications, equivalent replacements and improvements made without departing from the principle of the present disclosure shall fall into the protection scope of the present disclosure.

Claims

1. A method for labeling an object contour in a target image, comprising:

acquiring a target image feature of a target image, wherein the target image comprises a target object of a target type;

inputting the target image feature into a target generator, wherein the target generator is a generator in a generative adversarial network trained by utilizing a sample image, the generative adversarial network comprises the target generator and a discriminator, the target generator is configured to generate a first mask of the sample image upon acquiring a first image feature of the sample image, and the discriminator is configured to, upon receiving a sample image obtained after pixels corresponding to the first mask are erased, identify the type of a sample object in the sample image obtained after the pixels are erased, the type of the sample object is used for training parameters in the target generator; and

acquiring a target mask of the target image generated by the target generator, wherein the target mask is used for labeling a contour of the target object.

2. The method of claim 1, prior to the inputting the target image feature into the target generator, further comprising:

acquiring a first image feature of the sample image;

inputting the first image feature into the target generator to generate the first mask of the sample image;

erasing pixels corresponding to the first mask to obtain a first image;

inputting the first image and the sample image into the discriminator to train the discriminator; and

inputting the first image into the target generator to train the target generator.

3. The method of claim 2, wherein the inputting the first image and the sample image into the discriminator to train the discriminator comprises:

calculating a first loss of the discriminator after the first image and the sample image are input into the discriminator; and

adjusting parameters in the discriminator by utilizing the first loss.

4. The method of claim 2, wherein the inputting the first image into the target generator to train the target generator comprises:

acquiring a first type of a first object in the first image output after the first image is input into the discriminator;

calculating a second loss of the target generator under the first type; and

adjusting parameters in the target generator by utilizing the second loss.

5. The method of claim 2, wherein the acquiring a target image feature of a target image comprises:

acquiring the target image;

inputting the target image into a target model, the target model being a model obtained after deleting a fully-connected layer of a pre-trained first model; and

acquiring the target image feature of the target image output by the target model.

6. The method of claim 5, prior to the inputting the target image feature into the target generator, further comprising:

acquiring the sample image;

training a second model by utilizing the sample image to obtain the trained first model, the second model being the first model before training; and

deleting a fully-connected layer of the first model to obtain the target model.

7. The method of claim 5 or 6, wherein convolution layers of the discriminator and the first model comprise dilated convolutions with different dilation coefficients.

8. An apparatus for labeling an object contour in a target image, comprising:

a first acquisition unit, configured to acquire a target image feature of a target image, wherein the target image comprises a target object of a target type;

a first input unit, configured to input the target image feature into a target generator, wherein the target generator is a generator in a generative adversarial network trained by utilizing a sample image, the generative adversarial network comprises the target generator and a discriminator, the target generator is configured to generate a first mask of the sample image upon acquiring a first image feature of the sample image, and the discriminator being configured to, upon receiving a sample image obtained after pixels corresponding to the first mask are erased, identify the type of a sample object in the sample image obtained after the pixels are erased, the type of the sample object is used for training parameters in the target generator; and

a second acquisition unit, configured to acquire a target mask of the target image generated by the target generator, wherein the target mask is used for labeling a contour of the target object.

9. A computer-readable storage medium having computer programs stored thereon which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 7.

10. An electronic device, comprising a memory and a processor, the memory storing computer programs which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 7.

Drawing

Search report

Cited references

REFERENCES CITED IN THE DESCRIPTION

This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Patent documents cited in the description

CN202010591353 [0001]