RELATED APPLICATION
FIELD OF THE TECHNOLOGY
[0002] The present disclosure relates to the field of artificial intelligence technologies,
and in particular, to a method for detecting a keypoint of a to-be-detected object,
a training method, an apparatus, an electronic device, a computer-readable storage
medium, and a computer program product.
BACKGROUND OF THE DISCLOSURE
[0003] In the related art, keypoint detection of a three-dimensional face character is generally
divided into two general types. The first general type is a method based on traditional
geometric analysis, and the second general type is a method based on deep learning.
For the first type of method, a keypoint positioning method based on the geometric
analysis relies on manually set rules, and is difficult to be applied to head models
of different forms. Therefore, robustness of the method is poor. However, for the
second type of method, basically, a three-dimensional head model is first rendered
into two-dimensional images, and then a two-dimensional convolutional neural network
is used to extract a feature, to detect a corresponding keypoint. As a result, three-dimensional
geometric information is inevitably lost. Based on this, accuracy of the keypoint
detection of the three-dimensional face character in the related art is low.
SUMMARY
[0004] The embodiments of the present disclosure provide a method for detecting a keypoint
of a to-be-detected object, a method for training a three-dimensional network model,
an apparatus, an electronic device, a computer-readable storage medium, and a computer
program product, to improve accuracy of performing keypoint detection through a three-dimensional
network model.
[0005] Technical solutions of the embodiments of the present disclosure are implemented
as follows.
[0006] An embodiment of the present disclosure provides a method for detecting a keypoint
of a to-be-detected object, including:
obtaining a three-dimensional mesh representing the to-be-detected object, and determining
vertices of the three-dimensional mesh and a connection relationship between the vertices;
performing feature extraction on the vertices of the three-dimensional mesh, to obtain
a vertex feature of the three-dimensional mesh;
performing global feature extraction on the to-be-detected object based on the vertex
feature, to obtain a global feature of the to-be-detected object, and performing local
feature extraction on the to-be-detected object based on the vertex feature and the
connection relationship between the vertices, to obtain a local feature of the to-be-detected
object; and
performing feature splicing based on the vertex feature, the global feature, and the
local feature, and performing detection on a keypoint of the to-be-detected object,
to obtain a position of the keypoint of the to-be-detected object on the to-be-detected
object.
[0007] An embodiment of the present disclosure provides an apparatus for detecting a keypoint
of a to-be-detected object, including:
an obtaining module, configured to obtain a three-dimensional mesh representing the
to-be-detected object, and determine vertices of the three-dimensional mesh and a
connection relationship between the vertices;
a first feature extraction module, configured to perform feature extraction on the
vertices of the three-dimensional mesh, to obtain a vertex feature of the three-dimensional
mesh;
a second feature extraction module, configured to perform global feature extraction
on the to-be-detected object based on the vertex feature, to obtain a global feature
of the to-be-detected object, and perform local feature extraction on the to-be-detected
object based on the vertex feature and the connection relationship between the vertices,
to obtain a local feature of the to-be-detected object; and
an output module, configured to perform detection on a keypoint of the to-be-detected
object based on the vertex feature, the global feature, and the local feature, to
obtain a position of the keypoint of the to-be-detected object on the to-be-detected
object.
[0008] An embodiment of the present disclosure provides a method for training a three-dimensional
network model. The three-dimensional network model includes at least a first feature
extraction layer, a second feature extraction layer, a third feature extraction layer,
and an output layer. The method includes:
acquiring an object training sample carrying a label, where the label indicates a
real position of a keypoint of the object training sample;
obtaining a training three-dimensional mesh configured for representing the object
training sample, and determining vertices of the training three-dimensional mesh and
a connection relationship between the vertices;
performing feature extraction on the vertices of the object training sample via the
first feature extraction layer, to obtain a vertex feature of the training three-dimensional
mesh;
performing global feature extraction on the object training sample based on the vertex
feature of the training three-dimensional mesh via the second feature extraction layer,
to obtain a global feature of the object training sample, and performing local feature
extraction on the object training sample based on the vertices of the training three-dimensional
mesh and the connection relationship between the vertices via the third feature extraction
layer, to obtain a local feature of the object training sample;
performing detection on the keypoint of the object training sample via the output
layer based on the vertex feature of the training three-dimensional mesh, the global
feature of the object training sample, and the local feature of the object training
sample, to obtain a position of the keypoint of the object training sample on the
object training sample; and
acquiring a difference between the position of the keypoint of the object training
sample and the label, and training the three-dimensional network model based on the
difference, to obtain a target three-dimensional network model, the target three-dimensional
network model being configured for detecting a keypoint of a to-be-detected object,
to obtain a position of the keypoint of the to-be-detected object on the to-be-detected
object.
[0009] An embodiment of the present disclosure provides an apparatus for training a three-dimensional
network model. The three-dimensional network model includes at least a first feature
extraction layer, a second feature extraction layer, a third feature extraction layer,
and an output layer. The apparatus includes:
an acquiring module, configured to acquire an object training sample carrying a label,
where the label indicates a real position of a keypoint of the object training sample;
an obtaining module, configured to obtain a training three-dimensional mesh configured
for representing the object training sample, and determine vertices of the training
three-dimensional mesh and a connection relationship between the vertices;
a first feature extraction module, configured to perform feature extraction on the
vertices of the object training sample via the first feature extraction layer, to
obtain a vertex feature of the training three-dimensional mesh;
a second feature extraction module, configured to perform global feature extraction
on the object training sample based on the vertex feature of the training three-dimensional
mesh via the second feature extraction layer, to obtain a global feature of the object
training sample, and perform local feature extraction on the object training sample
based on the vertices of the training three-dimensional mesh and the connection relationship
between the vertices via the third feature extraction layer, to obtain a local feature
of the object training sample;
an output module, configured to perform detection on the keypoint of the object training
sample via the output layer based on the vertex feature of the training three-dimensional
mesh, the global feature of the object training sample, and the local feature of the
object training sample, to obtain a position of the keypoint of the object training
sample on the object training sample; and
an update module, configured to acquire a difference between the position of the keypoint
of the object training sample and the label, and train the three-dimensional network
model based on the difference, to obtain a target three-dimensional network model,
the target three-dimensional network model being configured for detecting a keypoint
of a to-be-detected object, to obtain a position of the keypoint of the to-be-detected
object on the to-be-detected object.
[0010] An embodiment of the present disclosure provides an electronic device, including:
a memory, configured to store executable instructions; and
a processor, configured to implement, when executing the computer-executable instructions
stored in the memory, the method for detecting a keypoint of a to-be-detected object
according to the embodiments of the present disclosure.
[0011] An embodiment of the present disclosure provides an electronic device, including:
a memory, configured to store executable instructions; and
a processor, configured to implement, when executing the computer-executable instructions
stored in the memory, the method for training a three-dimensional network model according
to the embodiments of the present disclosure.
[0012] An embodiment of the present disclosure provides a computer-readable storage medium,
having computer-executable instructions stored therein. The computer-executable instructions,
when executed by a processor, cause the processor to perform the method for detecting
a keypoint of a to-be-detected object according to the embodiments of the present
disclosure.
[0013] An embodiment of the present disclosure provides a computer-readable storage medium,
having computer-executable instructions stored therein. The computer-executable instructions,
when executed by a processor, cause the processor to perform the method for training
a three-dimensional network model according to the embodiments of the present disclosure.
[0014] An embodiment of the present disclosure provides a computer program product. The
computer program product includes a computer program or computer-executable instructions.
The computer program or the computer-executable instructions are stored in a computer-readable
storage medium. A processor of an electronic device reads the computer program or
the computer-executable instructions from the computer-readable storage medium, and
the processor executes the computer program or the computer-executable instructions,
so that the electronic device performs the method for detecting a keypoint of a to-be-detected
object according to the embodiments of the present disclosure.
[0015] An embodiment of the present disclosure provides a computer program product. The
computer program product includes a computer program or computer-executable instructions.
The computer program or the computer-executable instructions are stored in a computer-readable
storage medium. A processor of an electronic device reads the computer program or
the computer-executable instructions from the computer-readable storage medium, and
the processor executes the computer program or the computer-executable instructions,
so that the electronic device performs the method for training a three-dimensional
network model according to the embodiments of the present disclosure.
[0016] The embodiments of the present disclosure have the following beneficial effects:
[0017] A three-dimensional mesh corresponding to a to-be-detected object is obtained, a
global feature and a local feature of the to-be-detected object are separately extracted
through construction of a dual-path feature extraction layer based on a vertex feature
and a connection relationship between vertices obtained by using the three-dimensional
mesh, and then a position of a keypoint on the to-be-detected object is obtained based
on the vertex feature obtained by using the three-dimensional mesh and the global
feature and the local feature obtained through extraction.
In this way, richer feature information of the to-be-detected object is extracted via
a plurality of feature extraction layers, and then detection is performed on the keypoint
of the to-be-detected object based on the rich feature information, so that accuracy
of three-dimensional keypoint detection is significantly improved.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018]
FIG. 1 is a schematic architectural diagram of a keypoint detection system 100 according
to an embodiment of the present disclosure.
FIG. 2 is a schematic structural diagram of an electronic device according to an embodiment
of the present disclosure.
FIG. 3 is a schematic flowchart of a method for detecting a keypoint of a to-be-detected
object according to an embodiment of the present disclosure.
FIG. 4 is a schematic diagram of a three-dimensional mesh of a human head according
to an embodiment of the present disclosure.
FIG. 5 is a schematic flowchart of determining a local feature of each vertex according
to an embodiment of the present disclosure.
FIG. 6 is a schematic diagram of determining a correlation degree between a reference
vertex and another vertex by using an attention mechanism according to an embodiment
of the present disclosure.
FIG. 7 is a schematic diagram of positions of keypoints on a to-be-detected object
according to an embodiment of the present disclosure.
FIG. 8 is a schematic structural diagram of a three-dimensional network model according
to an embodiment of the present disclosure.
FIG. 9 is a schematic structural diagram of a third feature extraction layer according
to an embodiment of the present disclosure.
FIG. 10 is a schematic structural diagram of a three-dimensional network model according
to an embodiment of the present disclosure.
FIG. 11 is a schematic flowchart of a training process of a three-dimensional network
model according to an embodiment of the present disclosure.
FIG. 12 is a schematic diagram of patch simplification of a three-dimensional mesh
according to an embodiment of the present disclosure.
FIG. 13 is a schematic diagram of patch densification of a three-dimensional mesh
according to an embodiment of the present disclosure.
FIG. 14 is a schematic flowchart of a method for detecting a keypoint of a to-be-detected
object according to an embodiment of the present disclosure.
FIG. 15 is a schematic structural diagram of a graph convolutional neural network
according to an embodiment of the present disclosure.
FIG. 16 is a comparison diagram of a geodesic distance and a Euclidean distance according
to an embodiment of the present disclosure.
DESCRIPTION OF EMBODIMENTS
[0019] To make objectives, technical solutions, and advantages of the present disclosure
clearer, the following further describes the embodiments of the present disclosure
in detail with reference to the accompanying drawings. The described embodiments are
not to be considered as a limitation to the embodiments of the present disclosure.
All other embodiments obtained by a person of ordinary skill in the art without creative
efforts shall fall within the protection scope of the present disclosure.
[0020] In the following descriptions, the term "some embodiments" describes subsets of all
possible embodiments, but "some embodiments" may be the same subset or different subsets
of all possible embodiments, and can be combined with each other without conflict.
[0021] In the following descriptions, the terms "first", "second", and "third" are merely
for distinguishing between similar objects rather than representing a specific order
of the objects. A specific order or a sequence of "first", "second", and "third" is
interchangeable in proper circumstances, so that the embodiments of the present disclosure
described herein can be implemented in an order other than that is illustrated or
described herein.
[0022] Unless otherwise defined, meanings of all technical and scientific terms used in
this specification are the same as those usually understood by a person skilled in
the art to which the present disclosure belongs. Terms used in this specification
are merely intended to describe the objectives of the embodiments of the present disclosure,
but are not intended to limit the present disclosure.
[0023] Before the embodiments of the present disclosure are further described in detail,
nouns and terms in the embodiments of the present disclosure are described, and the
nouns and the terms in the embodiments of the present disclosure are applicable to
the following explanations.
- (1) Three-dimensional mesh: A three-dimensional mesh is a manifold surface having
a topology structure, for example, a spherical surface divided into a combination
of a plurality of vertices and a plurality of sides. In the present disclosure, the
three-dimensional mesh may be a three-dimensional face mesh. Herein, the three-dimensional
mesh is a graph structure. The term "mesh" herein can be understood as "mesh model".
That is, the "mesh" in the present disclosure can be replaced by "mesh model".
- (2) Client: A client is a program that corresponds to a server and that provides a
local service for a user. Except for some applications that can only be run locally,
the client is generally installed on an ordinary client machine, and needs to be run
in cooperation with the server. In other words, a corresponding server and service
program need to exist in a network to provide a corresponding service. Therefore,
a specific communication connection needs to be established between the client and
the server, to ensure normal running of the application.
- (3) Three-dimensional face keypoint detection: Three-dimensional face keypoint detection
means detecting three-dimensional coordinates of a series of face keypoints with preset
semantics given any three-dimensional face mesh model. Quantities of vertices and
patches of the three-dimensional face model are not limited. The keypoint with the
preset semantic refers to position information of a canthus, a corner of the mouth,
a nose tip, a face contour, and the like. The semantic of the keypoint and a quantity
of keypoints are determined by a specific task.
- (4) Graph neural network (GNN): A GNN is a type of artificial neural network, and
is configured for processing data that may be represented as a graph. In comparison
with a conventional two-dimensional convolutional neural network acting on a two-dimensional
image, the graph neural network expands an acting object into graph data that can
be represented in a three-dimensional mesh morphology. A key design element of the
graph neural network is to use message pairs for transferring, so that a graph node
is iteratively updated by exchanging information with a neighbor of the graph node.
- (5) Loss: A loss is configured for measuring a difference between an actual result
and a target result of a model, to perform model training and optimization.
- (6) Three-dimensional heatmap regression: Three-dimensional heatmap regression means
that: A graph neural network uses a heatmap as an output layer, forms a regression
loss in combination with a standard heatmap, the neural network is trained through
forward propagation and gradient backhaul, so that an output of the neural network
is fitted with a label, and keypoint coordinates are finally calculated based on the
heatmap.
- (7) Three-dimensional (3D) scanner: A 3D scanner is a scientific instrument configured
to detect and analyze a shape (geometric structure) and appearance data (characteristics
such as a color and a surface albedo) of an object or an environment in the real world.
Collected data is usually configured for performing three-dimensional reconstruction
calculation, to create digital models of actual objects in the virtual world. These
models have a wide range of applications, such as industrial design, defect detection,
reverse engineering, robot guidance, landscape measurement, medical information, biological
information, and criminal identification.
- (8) Multi-layer perceptron (MLP): A multi-layer perceptron is an artificial neural
network with a forward structure for mapping a group of input vectors to a group of
output vectors. The MLP may be considered as a directed graph, and is formed by a
plurality of node layers. Each layer is fully connected to a next layer. Each node
other than an input node is a neuron (or referred to as a processing unit) with a
non-linear activation function.
- (9) Convolutional neural network (CNN): A convolutional neural network is a feedforward
neural network, generally formed by one or more convolutional layers (network layers
that use convolutional mathematical operation) and a fully connected layer at an end.
A neuron inside the network may respond to some regions of an input image, and generally
have excellent performance in the field of visual image processing.
- (10) Machine learning (ML): Machine learning is a multi-field interdiscipline, and
relates to a plurality of disciplines such as the probability theory, statistics,
the approximation theory, convex analysis, and the algorithm complexity theory. The
machine learning specializes in studying how a computer simulates or implements a
human learning behavior to acquire new knowledge or skills, and reorganize an existing
knowledge structure, to keep improving its performance. The machine learning is the
core of artificial intelligence, is a basic way to make the computer intelligent,
and is applied to various fields of the artificial intelligence. The machine learning
and deep learning generally include technologies such as an artificial neural network,
a belief network, reinforcement learning, transfer learning, inductive learning, and
learning from demonstrations.
- (11) Point cloud data: Point cloud data is a set of massive points of a surface feature
of a target, and is generally obtained through laser measurement or photogrammetry.
Point cloud data obtained through laser measurement includes three-dimensional coordinates
and laser reflection intensity. Such point cloud data is usually used to determine
a state of an object based on an echo characteristic and reflection intensity. Point
cloud data obtained through photogrammetry usually includes three-dimensional coordinates
and color information.
- (12) Graph attention network (GAT): A GAT is a new neural network architecture based
on graph structural data.
[0024] As the artificial intelligence technology is researched and advanced, researches
and applications of the artificial intelligence technology are carried out in a plurality
of fields, such as common smart homes, smart wearing devices, virtual assistants,
smart speakers, smart marketing, unmanned driving, autonomous driving, drones, robots,
smart medicine, and smart customer service. It is believed that as the technology
develops, the artificial intelligence technology is to be applied in more fields and
play an increasingly important role.
[0025] The solutions provided in the embodiments of the present disclosure relate to a technology
such as a three-dimensional network model of artificial intelligence, and may also
be applied to fields such as a cloud technology and Internet of vehicles. Details
are specifically described in the following embodiments.
[0026] Referring to FIG. 1, FIG. 1 is a schematic architectural diagram of a keypoint detection
system 100 according to an embodiment of the present disclosure. To implement an application
scenario of keypoint detection (for example, the application scenario of the keypoint
detection may be: When the keypoint detection is performed on a face, three-dimensional
scanning is first performed on the face by using a three-dimensional scanner, and
then a position of a keypoint on the face based on three-dimensional scan data is
detected), a terminal (for example, a terminal 400 is shown) is connected to a server
200 via a network 300. The network 300 may be a wide area network, a local area network,
or a combination thereof. The terminal 400 is configured for a user to perform display
on a display interface (for example, a display interface 401-1 is shown) by using
a client 401. The terminal 400 and the server 200 are connected to each other via
the wired or wireless network.
[0027] The terminal 400 is configured to acquire three-dimensional scan data corresponding
to a to-be-detected object and send the three-dimensional scan data to the server
200.
[0028] The server 200 is configured to: receive the three-dimensional scan data; obtain,
based on the three-dimensional scan data, a three-dimensional mesh configured for
representing the to-be-detected object, and determine vertices of the three-dimensional
mesh and a connection relationship between the vertices; perform feature extraction
on the vertices of the three-dimensional mesh, to obtain a vertex feature of the three-dimensional
mesh; perform global feature extraction on the to-be-detected object based on the
vertex feature, to obtain a global feature of the to-be-detected object, and perform
local feature extraction on the to-be-detected object based on the vertex feature
and the connection relationship between the vertices, to obtain a local feature of
the to-be-detected object; performing detection on a keypoint of the to-be-detected
object based on the vertex feature, the global feature, and the local feature, to
obtain a position of the keypoint of the to-be-detected object on the to-be-detected
object; and send the position of the keypoint on the to-be-detected object to the
terminal 400.
[0029] The terminal 400 is further configured to display, based on the display interface,
the position of the keypoint on the to-be-detected object.
[0030] In some embodiments, the server 200 may be an independent physical server, may be
a server cluster formed by a plurality of physical servers or a distributed system,
or may be a cloud server that provides basic cloud computing services such as a cloud
service, a cloud database, cloud computing, a cloud function, cloud storage, a network
service, cloud communication, a middleware service, a domain name service, a security
service, a content delivery network (CDN), and a big data and artificial intelligence
platform. The terminal 400 may be a smartphone, a tablet computer, a notebook computer,
a desktop computer, a set-top box, a smart voice interaction device, a smart home
appliance, an in-vehicle terminal, an aircraft, a mobile device (for example, a mobile
phone, a portable music player, a personal digital assistant, a dedicated message
device, a portable game device, a smart speaker, and a smartwatch), or the like, but
is not limited thereto. The terminal device and the server may be directly or indirectly
connected in a wired or wireless communication manner. This is not limited in the
embodiments of the present disclosure.
[0031] Referring to FIG. 2, FIG. 2 is a schematic structural diagram of an electronic device
according to an embodiment of the present disclosure. During actual application, the
electronic device may be the server 200 or the terminal 400 shown in FIG. 1. Referring
to FIG. 2, the electronic device shown in FIG. 2 includes: at least one processor
410, a memory 450, at least one network interface 420, and a user interface 430. All
components in the terminal 400 are coupled together by using a bus system 440. The
bus system 440 is configured to implement connection and communication between the
components. In addition to a data bus, the bus system 440 further includes a power
bus, a control bus, and a state signal bus. However, for ease of clear description,
all types of buses are marked as the bus system 440 in FIG. 2.
[0032] The processor 410 may be an integrated circuit chip having a signal processing capability,
for example, a general processor, a digital signal processor (DSP), another programmable
logic device, a discrete gate or a transistor logic device, a discrete hardware component,
or the like. The general processor may be a microprocessor or any regular processor
or the like.
[0033] The user interface 430 includes one or more output apparatuses 431 that enable media
content to be presented, and the output apparatuses 431 include one or more speakers
and/or one or more visual displays. The user interface 430 further includes one or
more input apparatuses 432, and the input apparatuses 432 include a user interface
component that helps user input, for example, a keyboard, a mouse, a microphone, a
touch display, a camera, or another input button or control.
[0034] The memory 450 may be a removable memory, a non-removable memory, or a combination
thereof. Exemplary hardware devices include a solid-state memory, a hard disk drive,
an optical disk drive, and the like. In some embodiments, the memory 450 includes
one or more storage devices physically located away from the processor 410.
[0035] The memory 450 includes a volatile memory or a non-volatile memory, or may include
both a volatile memory and a non-volatile memory. The non-volatile memory may be a
read-only memory (ROM), and the volatile memory may be a random access memory (RAM).
The memory 450 described in this embodiment of the present disclosure aims to include
any suitable type of memory.
[0036] In some embodiments, the memory 450 can store data to support various operations.
Examples of the data include a program, a module, a data structure, or a subset or
a superset thereof, which are described below by way of example.
[0037] An operating system 451 includes system programs configured to process various basic
system services and execute a hardware-related task, for example, a frame layer, a
core library layer, and a drive layer, configured to implement various basic services
and process the hardware-based task.
[0038] A network communication module 452 is configured to reach another electronic device
through the one or more (wired or wireless) network interfaces 420. Exemplary network
interfaces 420 include: Bluetooth, wireless fidelity (Wi-Fi), a universal serial bus
(USB), and the like.
[0039] A presentation module 453 is configured to enable, through the one or more output
apparatuses 431 (for example, a display screen and a speaker) associated with the
user interface 430, information to be presented (for example, configured to operate
a peripheral device and a user interface displaying content and information).
[0040] An input processing module 454 is configured to detect user input or interaction
from the one or more input apparatuses 432 and translate the detected input or interaction.
[0041] In some embodiments, the apparatus provided in the embodiments of the present disclosure
may be implemented in a software manner. FIG. 2 shows a keypoint detection apparatus
455 stored in the memory 450. The keypoint detection apparatus 455 may be software
in a form of a program, a plug-in, or the like, including the following software modules:
an obtaining module 4551, a first feature extraction module 4552, a second feature
extraction module 4553, and an output module 4554. These modules are logical, and
therefore may be arbitrarily combined or further divided based on an implemented function.
Functions of the modules are described below.
[0042] In some other embodiments, the apparatus provided in the embodiments of the present
disclosure may be implemented in a hardware manner. In an example, the keypoint detection
apparatus provided in the embodiments of the present disclosure may be a processor
in a form of a hardware decoding processor, and the processor is programed to perform
a keypoint detection method provided in the embodiments of the present disclosure.
For example, the processor in the form of the hardware decoding processor may be implemented
by using one or more application-specific integrated circuits (ASIC), a DSP, a programmable
logic device (PLD), a complex programmable logic device (CPLD), a field programmable
gate array (FPGA), or another electronic element.
[0043] In some embodiments, the terminal or the server may implement the keypoint detection
method provided in the embodiments of the present disclosure by running a computer
program. For example, the computer program may be a native program or a software module
in the operating system; may be a native application (APP), namely, a program that
needs to be installed in the operating system to run, such as an instant messaging
APP or a web browser APP; may be an applet, namely, a program that only needs to be
downloaded into a browser environment to run; or may be an applet that can be embedded
into any APP. In summary, the foregoing computer program may be any form of an application,
a module, or a plug-in.
[0044] Based on the foregoing descriptions of the keypoint detection system and the electronic
device provided in the embodiments of the present disclosure, the keypoint detection
method provided in the embodiments of the present disclosure is described below. During
actual implementation, the keypoint detection method provided in the embodiments of
the present disclosure may be implemented by a terminal or a server separately, or
by the terminal and the server together. An example in which the server 200 in FIG.
1 independently performs the keypoint detection method provided in the embodiments
of the present disclosure is used for description. Referring to FIG. 3, FIG. 3 is
a schematic flowchart of a keypoint detection method according to an embodiment of
the present disclosure. Operations shown are described below with reference to FIG.
3.
[0045] At 101: A server obtains a three-dimensional mesh configured for representing a to-be-detected
object, and determines vertices of the three-dimensional mesh and a connection relationship
between the vertices.
[0046] During actual implementation, obtaining a three-dimensional mesh configured for representing
a to-be-detected object may be directly receiving a three-dimensional mesh of a to-be-detected
object sent by another device, or may be implemented by using point cloud data (namely,
three-dimensional scan data) corresponding to the to-be-detected object. Herein, the
point cloud data indicates a set of massive points of a surface feature of the to-be-detected
object, and may generally be obtained through laser measurement or photogrammetry.
Specifically, the point cloud data corresponding to the to-be-detected object is first
acquired, and then the three-dimensional mesh configured for representing the to-be-detected
object is obtained based on the point cloud data, in other words, the three-dimensional
mesh corresponding to the to-be-detected object is constructed. Herein, there are
a plurality of manners of acquiring the point cloud data corresponding to the to-be-detected
object. The point cloud data may be prestored locally in a terminal, may be acquired
from the outside world (such as the Internet), or may be collected in real time, for
example, collected in real time by using a three-dimensional scanning apparatus such
as a three-dimensional scanner.
[0047] In some embodiments, when the point cloud data is collected in real time by using
the three-dimensional scanning apparatus such as the three-dimensional scanner, a
process of constructing the three-dimensional mesh corresponding to the to-be-detected
object specifically includes: scanning the to-be-detected object by using the three-dimensional
scanning apparatus, to obtain point cloud data of a geometric surface of the to-be-detected
object; and constructing the three-dimensional mesh corresponding to the to-be-detected
object based on the point cloud data. For example, referring to FIG. 4, FIG. 4 is
a schematic diagram of a three-dimensional mesh of a human head according to an embodiment
of the present disclosure. Based on FIG. 4, when the to-be-detected object is a face,
the three-dimensional scanner performs three-dimensional scanning on the human head,
to obtain point cloud data corresponding to the head; and the three-dimensional mesh
corresponding to the head is constructed based on the point cloud data.
[0048] A process of constructing the three-dimensional mesh corresponding to the to-be-detected
object based on the point cloud data may be as follows: First, the point cloud data
is preprocessed to obtain target point cloud data. The preprocessing includes operations
such as filtering, denoising, and point cloud registration. Herein, the filtering
may remove noise points, the denoising may further reduce noise and invalid points,
and the point cloud registration may align the point cloud data into a same coordinate
system. Then, mesh reconstruction is performed on the target point cloud data, to
obtain the three-dimensional mesh. The mesh reconstruction is a process of transforming
discrete target point cloud data into a three-dimensional mesh. Commonly used mesh
reconstruction algorithms include a mesh-based method, a voxel-based method, an implicit
function-based method, and the like. Herein, the mesh-based method is to transform
the target point cloud data into a triangular mesh, the voxel-based method is to transform
the target point cloud data into a voxel mesh, and the implicit function-based method
is to use a data function to represent the three-dimensional mesh.
[0049] In the embodiments of the present disclosure, data related to real-time scanning
or the like is included. When the embodiments of the present disclosure are applied
to a specific product or technology, user permission or consent needs to be obtained,
and collection, use, and processing of relevant data need to comply with relevant
laws, regulations, and standards of relevant countries and regions.
[0050] In a three-dimensional mesh model, vertices are the most fundamental units (basic
points) that define the geometric shape and structure of the three-dimensional mesh
model. These vertices are connected by edges to constitute the topological structure
of the three-dimensional mesh model. The vertices are connected to form polygons (such
as triangles or quadrilaterals), and these polygons are combined to form the surface
of the three-dimensional mesh model. The three-dimensional mesh model involved in
the present disclosure can be understood as a Graph structure. A Graph structure is
a data structure G(V, E) composed of a set of vertices and the connections between
them, where V represents the set of nodes (vertices) in the Graph, and E represents
the set of edges (connections between nodes/vertices) in the Graph. Therefore, after
obtaining the three-dimensional mesh model (i.e., Graph structure), it is easy to
determine the vertices and the connections between them based on the G (V, E) corresponding
to the three-dimensional mesh model. During actual implementation, the connection
relationship between the vertices of the three-dimensional mesh may be an inter-vertex
connection relationship matrix, which indicates whether there is an association between
the vertices. A size of the matrix is N*N, and a value of the matrix is 0 or 1. N
herein is a quantity of vertices. When a vertex i is connected to a vertex j, a connection
relationship A
ij between the two vertices is 1, or otherwise, is 0.
[0051] For example, on a face, there is a connection relationship between vertices of a
three-dimensional mesh configured for indicating a position of an eye, and there is
no connection relationship between a vertex of the three-dimensional mesh configured
for indicating the position of the eye and a vertex of a three-dimensional mesh for
indicating a position of a chin.
[0052] At 102: Perform feature extraction on the vertices of the three-dimensional mesh,
to obtain a vertex feature of the three-dimensional mesh.
[0053] During actual implementation, the feature extraction is performed on the vertices
of the three-dimensional mesh, to obtain the vertex feature of the three-dimensional
mesh. The vertex feature includes positions of the corresponding vertices and information
about corresponding positions indicated by the corresponding vertices on a face. For
example, the vertex feature herein may be N*(6+X), where N represents a quantity of
vertices corresponding to the three-dimensional mesh; 6 represents dimensions occupied
by vertex coordinates and a normal vector, to be specific, six direction dimensions
corresponding to three coordinate dimensions of vertex coordinates (x, y, z); and
X includes other characteristics of the vertices of the three-dimensional mesh, to
be specific, the information about the corresponding positions indicated by the corresponding
vertices on the face, such as a curvature and texture information. These other characteristics
may be adjusted based on different data and tasks. In this way, when the present disclosure
is applied to a model, in a training phase of the model, these other characteristics
are added to improve learning efficiency of the model.
[0054] At 103: Perform global feature extraction on the to-be-detected object based on the
vertex feature, to obtain a global feature of the to-be-detected object, and perform
local feature extraction on the to-be-detected object based on the vertex feature
and the connection relationship between the vertices, to obtain a local feature of
the to-be-detected object.
[0055] After determining the connection relationship between the vertices of the three-dimensional
mesh and the vertex feature of the three-dimensional mesh, the global feature extraction
and the local feature extraction are separately performed on the to-be-detected object,
to obtain the global feature and the local feature of the to-be-detected object.
[0056] In some embodiments, a process of performing global feature extraction on the to-be-detected
object based on the vertex feature to obtain the global feature of the to-be-detected
object may be: first performing feature extraction on the to-be-detected object based
on the vertex feature; performing max pooling processing on an extracted feature,
to obtain a max pooling feature, so that all vertices share the max pooling feature;
and using the max pooling feature as the global feature of the to-be-detected object.
[0057] In some embodiments, a process of performing local feature extraction on the to-be-detected
object based on the vertex feature and the connection relationship between the vertices
to obtain the local feature of the to-be-detected object may be: determining a local
feature of each vertex based on the vertex feature and the connection relationship
between the vertices; and determining the local feature of the to-be-detected object
based on the local feature of each vertex.
[0058] Herein, the global feature indicates overall features of the to-be-detected object,
such as a color feature, a texture feature, and a shape feature of the to-be-detected
object, and the local feature indicates detailed features of the to-be-detected object,
in other words, features extracted from a local region of the to-be-detected object,
such as features extracted from an edge, a corner, a point, a line, a curve, and a
region of a special attribute of the to-be-detected object. For example, when the
to-be-detected object is the face, the global feature may be a size, a shape, a position,
and the like of facial features on the face, and the local feature may be distribution
of facial muscles, a shape change of the facial features, and the like under different
expressions. Herein, the global feature is a low-layer visual feature at a pixel level.
Therefore, the global feature has characteristics such as good invariance, simple
calculation, and intuitive representation, but is not applicable to cases of object
aliasing and obstruction. The local image feature has characteristics of being rich
in number in an image and a small inter-feature correlation degree. In the cases of
object aliasing and obstruction, disappearance of some features does not affect detection
and matching of other features. In this way, the global feature extraction and the
local feature extraction are performed on the to-be-detected object, to acquire richer
and more accurate features of the to-be-detected object, thereby improving accuracy
of a keypoint detection result.
[0059] Next, a process of determining the local feature of each vertex based on the vertex
feature and the connection relationship between the vertices, and a process of determining
the local feature of the to-be-detected object based on the local feature of each
vertex are described separately.
[0060] For the process of determining the local feature of each vertex based on the vertex
feature and the connection relationship between the vertices, refer to FIG. 5 herein.
FIG. 5 is a schematic flowchart of determining the local feature of each vertex according
to an embodiment of the present disclosure. Based on FIG. 5, the process of determining
the local feature of each vertex based on the vertex feature and the connection relationship
between the vertices is implemented through operation 1031 to operation 1033. With
reference to FIG. 5, the following processing is performed for each vertex.
[0061] At 1031: Determine the vertex as a reference vertex, and determine a vertex feature
of the reference vertex and a vertex feature of another vertex based on a vertex feature
of each vertex in a three-dimensional mesh, the another vertex being any vertex other
than the reference vertex.
[0062] For example, the quantity of vertices in the three-dimensional mesh is N, a feature
of each vertex is h, and a dimension is F. That is,

[0063] A vertex i is used as a reference node, and h
i is a vector with a size of F, that is, a feature of the reference node i. A vertex
j is another vertex, and h
j is a vector with a size of F, that is, a feature of the another node j. There is
an edge connection relationship between the vertex i and the vertex j.
[0064] At 1032: Determine a correlation value between the reference vertex and the another
vertex based on the vertex feature of the reference vertex, the vertex feature of
the another vertex, and the connection relationship between the vertices, the correlation
value indicates a correlation degree between the reference vertex and the another
vertex.
[0065] In some embodiments, a process of determining the correlation value between the reference
vertex and the another vertex based on the vertex feature of the reference vertex,
the vertex feature of the another vertex, and the connection relationship between
the vertices may be: determining the correlation degree between the reference vertex
and the another vertex by using an attention mechanism based on the vertex feature
of the reference vertex, the vertex feature of the another vertex, and the connection
relationship between the vertices. The correlation degree is an indicator for measuring
correlation strength between the reference vertex and the another vertex, and a magnitude
of the correlation degree may be calculated by using the following formula:

[0066] W is a weight matrix with a size of F×F, and h
i is a vertex feature of a reference vertex i, h
j is a vertex feature of another vertex j, attention indicates processing by using
an attention mechanism, and
eij indicates a correlation degree between the reference vertex and the another vertex.
[0067] In some other embodiments, a process of determining the correlation value between the
reference vertex and the another vertex based on the vertex feature of the reference
vertex, the vertex feature of the another vertex, and the connection relationship
between the vertices may be: determining, based on the connection relationship between
the vertices, the reference vertex and another vertex that are connected to each other;
performing similarity matching on the reference vertex and the corresponding another
vertex based on the vertex feature of the reference vertex and a vertex feature of
the another vertex that is connected to the reference vertex, to obtain a similarity
between the reference vertex and the corresponding another vertex (a corresponding
similarity is obtained for each of the another vertex); and determining the similarity
as the correlation degree between the reference vertex and the corresponding another
vertex.
[0068] Then, normalization processing is performed on the correlation degree, to obtain
the correlation value between the reference vertex and the another vertex. That is,

[0069] Soft max
j indicates that normalization processing is used,
αij indicates a correlation value between nodes i and j, exp indicates an exponential
function with a natural constant e as the base, N
i indicates a domain formed by all other nodes that have a connection relationship
with the reference node i, and q represents any vertex in the domain.
[0070] For example, referring to FIG. 6, FIG. 6 is a schematic diagram of determining the
correlation degree between the reference vertex and the another vertex by using the
attention mechanism according to an embodiment of the present disclosure. Based on
FIG. 6,
αij indicated by 601 indicates the correlation value between nodes i and j,
Whi in a dashed box 602 indicates the corresponding vertex feature of the reference vertex
i,
Whj in a dashed box 603 indicates the vertex feature corresponding to the another vertex
j, and a is a weight vector. After determining of the correlation degree between the
reference vertex and the another vertex based on
Whi and
Whj, Soft max
j processing, namely, the normalization processing, is performed on the correlation
degree, to obtain the correlation value between the reference vertex and the another
vertex.
[0071] Herein, a process of determining the correlation degree between the reference vertex
and the another vertex by using the attention mechanism herein may specifically be:
splicing the features
Whi and
Whj of the vertices i and j, calculating an inner product based on a feature obtained
through splicing and a weight vector a with a dimension of 2F, and obtaining the correlation
value between the reference vertex and the another vertex through an activation function.
That is,

[0072] N
i indicates a domain formed by all other nodes that have a connection relationship
with a reference node i, q represents any vertex in the domain,
Whi∥
Whj indicates a spliced feature obtained by splicing features
Whi and
Whj of vertices i and j, exp indicates an exponential function with a natural constant
e as the base, LeakyReLU is a non-linear activation function, and a is a weight vector
with a size of 2F.
[0073] For a method for determining the correlation degree, the correlation degree between
the reference vertex and the another vertex may alternatively be directly calculated
based on the vertex feature of the reference vertex, the vertex feature of the another
vertex, and the connection relationship between the vertices. There are a plurality
of methods for calculating the correlation degree, such as a Pearson correlation coefficient
and a Spearman's rank correlation coefficient.
[0074] At 1033: Determine a local feature of the reference vertex based on the correlation
value and the vertex feature of the another vertex.
[0075] During actual implementation, after the correlation value is obtained, when a quantity
of other vertices is one, a process of determining the local feature of the reference
vertex based on the correlation value and the vertex feature of the another vertex
may be: performing multiplication on the correlation value and the vertex feature
of the another vertex, to obtain a multiplication result; and determining the local
feature of the reference vertex based on the multiplication result. That is,

[0076] σ is an activation function,
αij is a correlation value between a reference vertex i and another vertex j,
Whj indicates a vertex feature corresponding to the another vertex j, and
hi' is a local feature of the reference vertex.
[0077] When a quantity of other vertices is more than one, a process of determining the
local feature of the reference vertex based on the correlation value and the vertex
feature of the another vertex may be: performing, for each of the other vertices,
multiplication on the correlation value and a vertex feature of the another corresponding
vertex, to obtain a multiplication result of the another vertex; performing cumulative
summation on multiplication results of the other vertices, to obtain a summation result;
and determining the local feature of the reference vertex based on the summation result.
That is,

[0078] σ is an activation function,
αij is a correlation value between a reference vertex i and another vertex j,
Whj indicates a vertex feature corresponding to the another vertex j, and N
i indicates a domain formed by all other nodes that have a connection relationship
with the reference node i.
[0079] A process of determining the local feature of the to-be-detected object based on
the local feature of each vertex specifically includes: performing feature fusion
on the local feature of each vertex based on the local feature of each vertex, to
obtain a fused feature; and using the fused feature as the local feature of the to-be-detected
object.
[0080] At 104: Perform detection on a keypoint of the to-be-detected object based on the
vertex feature, the global feature, and the local feature, to obtain a position of
the keypoint of the to-be-detected object on the to-be-detected object.
[0081] In some embodiments, a process of performing detection on the keypoint of the to-be-detected
object based on the vertex feature, the global feature, and the local feature to obtain
the position of the keypoint of the to-be-detected object on the to-be-detected object
may be: performing feature splicing on the vertex feature, the global feature, and
the local feature, to obtain a spliced feature of the to-be-detected object; and performing
detection on the keypoint of the to-be-detected object based on the spliced feature,
to obtain the position of the keypoint of the to-be-detected object on the to-be-detected
object. In this way, the spliced feature includes feature information of the vertex
feature, the global feature, and the local feature of the to-be-detected object, and
detection is performed on the keypoint of the to-be-detected object based on the spliced
feature. Therefore, with reference to the feature information of the vertex feature,
the global feature, and the local feature, in other words, through richer feature
information, the keypoint of the to-be-detected object is detected, thereby improving
accuracy of a keypoint detection result.
[0082] In a method based on three-dimensional coordinate regression in the related art,
a point around the keypoint may be similar to the keypoint, and therefore, it is difficult
to accurately define the keypoint through a pixel position. A three-dimensional heatmap
in the present disclosure is a statistical chart that displays a plurality of pieces
of data by coloring a color block, in other words, displays each piece of data according
to a specified color mapping rule. For example, a large value is represented by a
dark color, and a small value is represented by a light color; or a large value is
represented by a warm tone, and a small value is represented by a cold tone. In this
way, the three-dimensional heatmap is outputted, and a probability that the keypoint
belongs to each vertex is displayed, so that local accuracy of a detection result
can be better ensured.
[0083] For example, referring to FIG. 7, FIG. 7 is a schematic diagram of positions of keypoints
on the to-be-detected object according to an embodiment of the present disclosure.
Based on FIG. 7, black points in FIG. 7 are the keypoints. When the to-be-detected
object is the face, the positions of the keypoints shown in FIG. 7 may be positions
of facial features of the face. Black points in a dashed box 701 are keypoints indicating
a position of a frontal head in the face, black points in dashed boxes 702 and 703
are keypoints indicating positions of eyes in the face, black points indicated by
704 and 705 are keypoints indicating positions of ears in the face, black points in
a dashed box 706 are keypoints indicating a position of a nose in the face, black
points in a dashed box 707 are keypoints indicating a position of a mouth in the face,
black points indicated by 708 and 709 are keypoints indicating positions of cheeks
in the face, and black points in the dashed box 710 are keypoints indicating a position
of a chin in the face. Herein, detection is performed on the positions of the facial
features of the to-be-detected object via a output layer based on the vertex feature,
the global feature, and the local feature, to obtain a probability of a keypoint being
at each vertex in the three-dimensional mesh, namely, a probability that each vertex
in the three-dimensional mesh is a keypoint corresponding to a position of each facial
feature; a three-dimensional heatmap corresponding to the three-dimensional mesh based
on each probability is generated; the position of the keypoint of the to-be-detected
object on the to-be-detected object is determined based on the three-dimensional heatmap.
To be specific, for the keypoint corresponding to the position of each facial feature,
a vertex with a maximum probability is selected from a plurality of probabilities
and determined as the corresponding keypoint, to determine the position of the facial
feature based on the obtained keypoint.
[0084] In some embodiments, the keypoint detection method herein may be further applied
to a three-dimensional network model. The three-dimensional network model includes
at least a first feature extraction layer, a second feature extraction layer, a third
feature extraction layer, and an output layer. Referring to FIG. 8, FIG. 8 is a schematic
structural diagram of the three-dimensional network model according to an embodiment
of the present disclosure. Based on FIG. 8, a process of performing feature extraction
on the vertices of the three-dimensional mesh to obtain a vertex feature of the three-dimensional
mesh may be: performing feature extraction on the vertices of the three-dimensional
mesh via the first feature extraction layer, to obtain the vertex feature of the three-dimensional
mesh. A process of performing global feature extraction on the to-be-detected object
based on the vertex feature to obtain a global feature of the to-be-detected object,
and performing local feature extraction on the to-be-detected object based on the
vertex feature and the connection relationship between the vertices to obtain a local
feature of the to-be-detected object may be: performing global feature extraction
on the to-be-detected object based on the vertex feature via the second feature extraction
layer, to obtain the global feature of the to-be-detected object; and performing local
feature extraction on the to-be-detected object based on the vertex feature and the
connection relationship between the vertices via the third feature extraction layer,
to obtain the local feature of the to-be-detected object. A process of performing
detection on the keypoint of the to-be-detected object based on the vertex feature,
the global feature, and the local feature, to obtain the position of the keypoint
of the to-be-detected object on the to-be-detected object may be: detecting the keypoint
of the to-be-detected object via the output layer based on the vertex feature, the
global feature, and the local feature, to obtain the position of the keypoint of the
to-be-detected object on the to-be-detected object.
[0085] In this way, the position of the keypoint on the to-be-detected object is detected
through the three-dimensional network model, so that accuracy of the detected position
is improved.
[0086] In some embodiments, the third feature extraction layer herein may include at least
two third feature extraction sublayers and a feature splicing sublayer. For example,
referring to FIG. 9, FIG. 9 is a schematic structural diagram of the third feature
extraction layer according to an embodiment of the present disclosure. Based on FIG.
9, a process of determining the local feature of each vertex based on the vertex feature
and the connection relationship between vertices via the third feature extraction
layer may be: performing the following processing for each of the vertices via each
of the third feature extraction sublayers: determining the vertex as the reference
vertex, and determining the vertex feature of the reference vertex and the vertex
feature of the another vertex based on the vertex feature of each vertex in the three-dimensional
mesh; determining the correlation value between the reference vertex and the another
vertex based on the vertex feature of the reference vertex, the vertex feature of
the another vertex, and the connection relationship between the vertices; determining
a local subfeature of the reference vertex based on the correlation value and the
vertex feature of the another vertex; and splicing the local subfeature obtained through
each third feature extraction sublayer via the feature splicing sublayer, to obtain
the local feature of the reference vertex. That is,

[0087] k is a quantity of layers of a third feature extraction sublayer, N
i indicates a domain formed by all other nodes that have a connection relationship
with a reference node i,
σ is an activation function,
αij is a correlation value between the reference vertex i and another vertex j,
Whj indicates a vertex feature corresponding to the another vertex j, and concat indicates
that splicing processing is used.
[0088] Herein, a process of determining the correlation value between the reference vertex
and the another vertex based on the vertex feature of the reference vertex, the vertex
feature of the another vertex, and the connection relationship between the vertices
is the same as the foregoing process. In addition, a process of determining the local
subfeature of the reference vertex based on the correlation value and the vertex feature
of the another vertex is the same as the foregoing process of determining the local
feature of the reference vertex based on the correlation value and the vertex feature
of the another vertex. Details are not described herein again.
[0089] In some embodiments, the three-dimensional network model further includes a first
feature splicing layer, a second feature splicing layer, and a fourth feature extraction
layer. For example, referring to FIG. 10, FIG. 10 is a schematic structural diagram
of the three-dimensional network model according to an embodiment of the present disclosure.
Based on FIG. 10, a process of performing detection on the keypoint of the to-be-detected
object via the output layer based on the vertex feature, the global feature, and the
local feature to obtain the position of the keypoint of the to-be-detected object
may be: performing feature splicing on the vertex feature, the global feature, and
the local feature via the first feature splicing layer, to obtain the spliced feature
of the to-be-detected object; performing local feature extraction on the to-be-detected
object based on the spliced feature via the fourth feature extraction layer, to obtain
a target local feature of the to-be-detected object; performing feature splicing on
the spliced feature, the global feature, and the target local feature via the second
feature splicing layer, to obtain a target spliced feature of the to-be-detected object;
and performing detection on the keypoint of the to-be-detected object based on the
target spliced feature via the output layer, to obtain the position of the keypoint
of the to-be-detected object on the to-be-detected object.
[0090] The three-dimensional network model may further include a fifth feature extraction
layer and a third feature splicing layer. Therefore, local feature extraction is performed
on the to-be-detected object based on the target spliced feature via the fifth feature
extraction layer, to obtain a second target local feature; then feature splicing is
performed on the target spliced feature, the second target local feature, and the
global feature via the third feature splicing layer, to obtain a second target spliced
feature; and finally detection is performed on the keypoint of the to-be-detected
object based on the second target spliced feature via the output layer, to obtain
the position of the keypoint of the to-be-detected object on the to-be-detected object.
Herein, for a process of determining the local feature and the corresponding spliced
feature of the to-be-detected object in the three-dimensional network model, quantities
of feature extraction layers and feature splicing layers in the three-dimensional
network model may be more than one, and a process of obtaining the final spliced feature
via the plurality of feature extraction layers and feature splicing layers is as described
in the foregoing. Details are not described in this embodiment of the present disclosure.
[0091] Layer structures of the fourth feature extraction layer, the fifth feature extraction
layer, and the third feature extraction layer are the same, and processes of processing
the features are the same. Layer structures of the second feature splicing layer,
the third feature splicing layer, and the first feature splicing layer are the same,
and processes of processing the features are also the same. Further feature processing
is performed on the spliced feature via the fourth feature extraction layer, to obtain
the more accurate target local feature, and the feature splicing is performed on the
spliced feature, the global feature, and the obtained target local feature via the
second feature splicing layer, to perform detection on the keypoint of the to-be-detected
object based on the target spliced feature obtained by feature splicing. Correspondingly,
further feature processing is performed on the target spliced feature via the fifth
feature extraction layer, to obtain the more accurate second target local feature,
and the feature splicing is performed on the target spliced feature, the global feature,
and the obtained second target local feature via the third feature splicing layer,
to perform detection on the keypoint of the to-be-detected object based on the second
target spliced feature obtained by feature splicing.
[0092] In this way, feature extraction layers with a same structure and feature splicing
layers with a same structure are disposed, and a process of performing local feature
extracting and corresponding feature splicing on a to-be-detected object is repeated
for a plurality of times, so that accuracy of an extracted feature is improved, thereby
improving accuracy of a keypoint detection result.
[0093] In some embodiments, before detection is performed on the keypoint of the to-be-detected
object based on the three-dimensional network model, the three-dimensional network
model further needs to be trained, so that the keypoint of the to-be-detected object
is detected based on a trained three-dimensional network model. Specifically, referring
to FIG. 11, FIG. 11 is a schematic flowchart of a training process of a three-dimensional
network model according to an embodiment of the present disclosure. Based on FIG.
11, the training process of the three-dimensional network model may be implemented
through the following operations.
[0094] At 201: A server acquires an object training sample carrying a label, the label indicating
a real position of a keypoint of the object training sample.
[0095] At 202: Obtain a training three-dimensional mesh configured for representing the
object training sample, and determine vertices of the training three-dimensional mesh
and a connection relationship between the vertices.
[0096] After the training three-dimensional mesh configured for representing the object
training sample is obtained, data enhancement may be further performed on the training
three-dimensional mesh, so that the three-dimensional network model is trained through
an enhanced training three-dimensional mesh. Specifically, a data enhancement method
for the training three-dimensional mesh is divided into patch simplification and patch
densification.
[0097] In some embodiments, when the patch simplification is performed on the training three-dimensional
mesh, an edge optimization manner may be used. To be specific, the smallest edge between
the vertices is found each time, and corresponding two vertices are merged into one
vertex. Specifically, an edge between any two vertices is acquired, and edges are
compared, to select the smallest edge from the edges as a target edge based on a comparison
result; and then two vertices corresponding to the target edge are acquired, and the
two vertices are merged into one vertex, to obtain the enhanced training three-dimensional
mesh. For example, referring to FIG. 12, FIG. 12 is a schematic diagram of patch simplification
of a three-dimensional mesh according to an embodiment of the present disclosure.
Based on FIG. 12, there are ten vertices v
1 to v
10, and based on the ten vertices, ten sides v
1v
2, v
1v
3, v
1v
4, v
1v
10, v
1v
9, v
1v
2, v
1v
8, v
5v
2, v
7v
2, and v
6v
2 are formed. An edge between v
1 and v
2 is the smallest edge, and then the two vertices are merged into one vertex v, to
obtain an enhanced training three-dimensional mesh.
[0098] In some other embodiments, when the patch densification is performed on the training
three-dimensional mesh, barycentric coordinates are preferentially calculated for
a patch with a large area, and then the original triangular patch is divided into
three parts based on the barycentric coordinates. Specifically, at least one patch
is acquired, and comparison is performed on the patch. Based on a comparison result,
a patch with the largest area is selected from the plurality of patches as a target
patch. A center of gravity of the target patch and three vertices corresponding to
the target patch are determined, and then the original triangular patch is divided
into three parts based on barycentric coordinates and the three vertices. For example,
referring to FIG. 13, FIG. 13 is a schematic diagram of patch densification of a three-dimensional
mesh according to an embodiment of the present disclosure. Based on FIG. 13, there
are nine vertices A to I, and based on the nine vertices, eight triangular patches
are formed, to be specific, a patch between the vertices A, B, and C, a patch between
the vertices A, B, and I, a patch between the vertices H, B, and I, a patch between
the vertices H, B, and G, a patch between the vertices F, B, and G, a patch between
the vertices F, B, and E, a patch between the vertices D, B, and E, and a patch between
the vertices D, B, and C. Herein, the patch between the vertices A, B, and C is a
target patch with a largest area. A center of gravity P of the target patch and the
corresponding vertices A, B, and C are determined, and then the original target patch
is divided into three parts based on P, A, B, and C, to obtain an enhanced training
three-dimensional mesh.
[0099] Herein, a target quantity of vertices may be preset, to end a data enhancement process
of the training three-dimensional mesh. Specifically, in the data enhancement process
of the training three-dimensional mesh, a quantity of vertices of the enhanced training
three-dimensional mesh is acquired, the quantity of vertices is compared with the
preset target quantity of vertices, and the data enhancement of the training three-dimensional
mesh is ended based on a comparison result. Herein, when the patch simplification
is performed on the training three-dimensional mesh, when the comparison result represents
that the quantity of vertices is less than the target quantity of vertices, the data
enhancement of the training three-dimensional mesh is ended. When the patch densification
is performed on the training three-dimensional mesh, when the comparison result represents
that the quantity of vertices is greater than the target quantity of vertices, the
data enhancement of the training three-dimensional mesh is ended.
[0100] At 203: Perform feature extraction on the vertices of the object training sample
via a first feature extraction layer, to obtain a vertex feature of the training three-dimensional
mesh.
[0101] At 204: Perform global feature extraction on the object training sample based on
the vertex feature of the training three-dimensional mesh via a second feature extraction
layer, to obtain a global feature of the object training sample, and perform local
feature extraction on the object training sample based on the vertices of the training
three-dimensional mesh and the connection relationship between the vertices via a
third feature extraction layer, to obtain a local feature of the object training sample.
[0102] At 205: Perform detection on the keypoint of the object training sample via an output
layer based on the vertex feature of the training three-dimensional mesh, the global
feature of the object training sample, and the local feature of the object training
sample, to obtain a position of the keypoint of the object training sample on the
object training sample.
[0103] During actual implementation, the three-dimensional network model further includes
a first feature splicing layer. Therefore, a process of performing detection on the
keypoint of the object training sample via the output layer based on the vertex feature
of the training three-dimensional mesh, the global feature of the object training
sample, and the local feature of the object training sample, to obtain the position
of the keypoint of the object training sample on the object training sample may be:
performing feature splicing on the vertex feature of the training three-dimensional
mesh, the global feature of the object training sample, and the local feature of the
object training sample via the first feature splicing layer, to obtain a spliced feature
of the object training sample; and performing detection on the keypoint of the object
training sample based on the spliced feature of the object training sample via the
output layer, to obtain the position of the keypoint of the object training sample
on the object training sample.
[0104] At 206: Acquire a difference between the position of the keypoint of the object training
sample and the label, and train the three-dimensional network model based on the difference,
to obtain a target three-dimensional network model, the target three-dimensional network
model being configured for performing keypoint detection on a to-be-detected object,
to obtain a position of a keypoint of the to-be-detected object on the to-be-detected
object.
[0105] The following continues to describe the keypoint detection method provided in the
embodiments of the present disclosure. Referring to FIG. 14, FIG. 14 is a schematic
flowchart of a keypoint detection method according to an embodiment of the present
disclosure. Based on FIG. 14, the keypoint detection method provided in the embodiments
of the present disclosure is cooperatively implemented by a client and a server.
[0106] At 301: The client acquires, in response to an uploading operation of an object training
sample carrying a label, the object training sample carrying the label.
[0107] During actual implementation, the client may be a keypoint detection client disposed
in a terminal. A user triggers, based on a human-computer interaction interface of
the client, an uploading function in the human-computer interaction interface, to
enable the client to present an object selection interface on the human-computer interaction
interface. The user locally uploads, based on the object selection interface, the
object training sample carrying the label from the terminal, so that the client obtains
the uploaded object training sample.
[0108] In some embodiments, an object training sample may alternatively be captured by a
camera communicatively connected to a terminal. After capturing the object training
sample, the camera annotates a label on the object training sample, and then transmits
an object training sample carrying the label to the terminal, and the object training
sample carrying the label is automatically uploaded to the client by the terminal.
[0109] At 302: The client sends the object training sample to the server.
[0110] At 303: The server inputs the received object training sample to a three-dimensional
network model.
[0111] At 304: Perform detection on a keypoint of the object training sample based on the
three-dimensional network model, to obtain a position of the keypoint of the object
training sample.
[0112] At 305: Acquire a difference between the position of the keypoint of the object training
sample and the label, and train the three-dimensional network model based on the difference.
[0113] During actual implementation, the server completes training of the three-dimensional
network model by iterating the foregoing training process until a loss function converges.
[0114] At 306: The server generates a prompt message indicating that the training of the
three-dimensional network model is completed.
[0115] At 307: Send the prompt message to the client.
[0116] At 308: The client acquires point cloud data corresponding to a to-be-detected object
in response to an uploading operation of the point cloud data corresponding to the
to-be-detected object.
[0117] During actual implementation, the point cloud data corresponding to the to-be-detected
object may be prestored locally in the terminal, may be acquired from the outside
world (such as the Internet), or may be collected in real time, for example, collected
in real time by using a three-dimensional scanning apparatus such as a three-dimensional
scanner.
[0118] At 309: The client sends the point cloud data corresponding to the to-be-detected
object to the server in response to a keypoint detection instruction for the to-be-detected
object.
[0119] During actual implementation, the keypoint detection instruction for the to-be-detected
object may be automatically generated by the client under a specific trigger condition,
where for example, the keypoint detection instruction for the to-be-detected object
is automatically generated after the client acquires the point cloud data corresponding
to the to-be-detected object; may be sent to the client by another device communicatively
connected to the terminal; or may be generated after the user triggers a corresponding
determining function item based on the human-computer interaction interface of the
client.
[0120] At 310: The server inputs the received point cloud data corresponding to the to-be-detected
object to the three-dimensional network model, to enable the three-dimensional network
model to perform keypoint detection on the to-be-detected object, so as to obtain
a three-dimensional heatmap indicating a position of a keypoint of the to-be-detected
object on the to-be-detected object.
[0121] At 311: Send the three-dimensional heatmap configured for indicating the position
of the keypoint of the to-be-detected object on the to-be-detected object to the client.
[0122] At 312: The client displays the three-dimensional heatmap configured for indicating
the position of the keypoint of the to-be-detected object on the to-be-detected object.
[0123] During actual implementation, the client may display the three-dimensional heatmap
in the human-computer interaction interface of the client, may store the three-dimensional
heatmap locally in the terminal, may send the three-dimensional heatmap to the another
device communicatively connected to the terminal, or the like.
[0124] Through application of the foregoing embodiments of the present disclosure, a three-dimensional
mesh corresponding to a to-be-detected object is obtained, a global feature and a
local feature of the to-be-detected object are separately extracted through construction
of a dual-path feature extraction layer based on a vertex feature and a connection
relationship between vertices obtained by using the three-dimensional mesh, and then
a position of a keypoint on the to-be-detected object is obtained based on the vertex
feature obtained by using the three-dimensional mesh and the global feature and the
local feature obtained through extraction. In this way, richer feature information
of the to-be-detected object is extracted via a plurality of feature extraction layers,
and then detection is performed on the keypoint of the to-be-detected object based
on the rich feature information, so that accuracy of three-dimensional keypoint detection
is significantly improved.
[0125] An exemplary application of the embodiments of the present disclosure in an actual
application scenario is described below.
[0126] It is found that keypoint detection of a three-dimensional face character is generally
divided into two general types. The first general type is a method based on conventional
geometric analysis. Generally, a semantic keypoint of a three-dimensional head model
is directly positioned through sharp edge detection, curvature calculation, dihedral
angle calculation, normal vector calculation, and some specific geometric rules. For
example, it may be assumed that a maximum vertex in a z-direction in a three-dimensional
coordinate system is a keypoint at a nose tip. Sharp edge detection is performed below
the nose tip, and approximate regions of keypoints at the left and right corners of
a mouth may be roughly positioned with reference to a symmetry relationship. The second
general type is a method based on deep learning. In this general type of method, basically,
a three-dimensional head model is first rendered into a two-dimensional image, and
then a two-dimensional convolutional neural network is used to extract a feature,
to detect a corresponding keypoint. This type of method may be further divided into
different combination methods based on whether to perform multi-view detection and
whether to directly regress to a three-dimensional keypoint. For example, a common
combination method is to render only a front view of the three-dimensional head model
and record a rendering projection relationship, then detect two-dimensional keypoint
coordinates on the two-dimensional front view, and finally perform backward projection,
based on the known projection relationship, into a three-dimensional space, to obtain
final three-dimensional keypoint coordinates. Another combination method is to render
a plurality of views (for example, a front view and a side view), and respectively
input rendered views into different branches of a neural network model, so that the
neural network model directly regresses to three-dimensional keypoint coordinates
with reference to features of the two views.
[0127] However, for the foregoing first type of method, a conventional keypoint positioning
method based on the geometric analysis relies on manually set rules. For example,
during sharp edge detection, a threshold needs to be specified. This is an empirical
value, and is difficult to be applied to head models of different forms. Therefore,
robustness of the method is poor. For the foregoing second type of method, the method
based on the two-dimensional convolutional neural network has been successful in a
conventional two-dimensional image keypoint detection task. However, there are a plurality
of restrictions and disadvantages in directly applying the two-dimensional convolutional
neural network to three-dimensional keypoint detection. Specifically, first, a quantity
of three-dimensional face models that can be acquired is exceedingly less than that
of face images. In other words, a data set is lacking, and therefore it is difficult
to make the neural network efficient. Second, in a manner of rendering the three-dimensional
face head model into the two-dimensional image, three-dimensional geometric information
is inevitably lost. For example, for the front view, information about back of a head
is inevitably lacking. If it is necessary to perform detection on a keypoint of the
back of the head, when the information is lost, the detection cannot be performed.
Third, if a multi-view manner is used to avoid a problem of information loss as much
as possible, features are extracted through a multi-branch network, and finally the
neural network performs fusion and regression to three-dimensional coordinates. In
this way, the neural network needs to learn of intrinsic connections between different
views, and there may be a problem of convergence difficulty, thereby increasing training
difficulty.
[0128] Based on this, the embodiments of the present disclosure provide a keypoint detection
method, an apparatus, an electronic device, a computer-readable storage medium, and
a computer program product, to effectively resolve a plurality of disadvantages of
the foregoing technical methods. Specifically, first, a three-dimensional face model
dataset is enhanced through patch simplification and patch densification, so that
a problem of lack of the three-dimensional head model dataset is resolved, and supervised
deep learning is provided with a guarantee of training data. Second, based on a graph
neural network structure, a convolutional neural module is directly applied to a three-dimensional
space. This avoids a problem of naturally losing three-dimensional geometric information
according to the method of performing detection in a two-dimensional space of rendered
views, and also resolves a problem that intrinsic connections of different views are
difficult to learn of. Finally, a two-dimensional heatmap in a traditional sense is
expanded into a three-dimensional heatmap. In comparison with a manner of directly
regressing to three-dimensional coordinates, the three-dimensional heatmap can better
ensure local accuracy of a detection result.
[0129] Next, the technical solutions of the present disclosure are described from a product
side. Herein, the present disclosure provides a three-dimensional face keypoint detection
method that is based on a graph neural network structure and a three-dimensional heatmap.
This method may be integrated into a character animation tool set, to complete a transformation
matching process between different head models in cooperation with a non-rigid wrapping
algorithm. A specific product form herein may be a control. In response to a triggering
operation for the control, a keypoint detection request carrying related data of a
to-be-detected three-dimensional head model is sent to a remote server deployed with
the technical solutions of the present disclosure, to acquire a return result. Herein,
a manner in which the remote server is deployed facilitates iterative algorithm optimization,
and there is no need to update local plug-in code, thereby saving local computer resources.
[0130] Next, the technical solutions of the present disclosure are described below on a
technical side.
[0131] First, a graph convolutional neural network structure in the technical solutions
of the present disclosure is described. Specifically, a three-dimensional model (three-dimensional
network model) naturally has a graph structural relationship, and in the relationship,
pixels are not as compactly and regularly arranged as in a two-dimensional image.
Therefore, it is inappropriate to directly use a traditional convolutional neural
network, and a classic graph attention network (GAT) is introduced herein. Herein,
for a GAT basic network included in the graph convolutional neural network structure,
as shown in Formula (1), it is assumed that a graph structure (three-dimensional mesh)
includes N nodes (vertices), where a feature vector (vertex feature) of each node
is h, and a dimension of each node is F. Then, it is assumed that a node j is a neighbor
of a node i (in other words, there is an edge connection relationship between i and
j). In this case, importance (a correlation value) of the node j to the node i may
be calculated by using an attention mechanism, as shown in Formula (2) and Formula
(3). Specifically, a process of calculating the importance of the node j to the node
i by using the attention mechanism may be: performing splicing on features Wh, and
Whj of the nodes i and j; calculating an inner product of the feature obtained through
splicing and a weight vector a with a dimension of 2F, as shown in Formula (4); and
determining, based on the importance of the node j to the node i, a feature vector
(local feature) of the node i, as shown in Formula (6).
[0132] During actual application, K feature vectors (local subfeatures) corresponding to
the node i may alternatively be obtained in a multi-layer GAT splicing manner, in
other words, by using K attention mechanisms, and the K feature vectors are spliced,
to obtain a final feature vector (local feature) corresponding to the node i, as shown
in Formula (7). In this way, based on a GAT feature of relying on only an edge (connection
relationship between vertices) rather than a complete graph structure, flexibility
of a keypoint detection process is improved. In addition, an attention mechanism is
used, so that different weights can be assigned to different neighbor nodes, thereby
improving accuracy of the keypoint detection process.
[0133] Herein, after descriptions of the graph attention network (GAT), referring to FIG.
15, FIG. 15 is a schematic structural diagram of a graph convolutional neural network
according to an embodiment of the present disclosure. Herein, a three-dimensional
head model keypoint automatic detection neural network shown in FIG. 15 is constructed
based on the GAT. Based on FIG. 15, input data, namely, vertex data is N*(6+X) (vertices
of a three-dimensional mesh), N represents a quantity of vertices of a three-dimensional
model (three-dimensional mesh), 6 represents dimensions occupied by vertex coordinates
and a normal vector, and X includes other characteristics of the vertices of the three-dimensional
head model (three-dimensional mesh), including a curvature, texture information, and
the like. Herein, these other characteristics may be adjusted based on different data
and tasks. Generally, richer input characteristics are more beneficial to learning
of the neural network. A
ij is a vertex connection relationship matrix (connection relationship between the vertices),
and has a size of N*N. A value of A
ij is 0 or 1. For example, if two vertices i and j are connected, A
ij is 1, or otherwise, is 0.
[0134] Based on FIG. 15, a multi-layer perceptron (MLP) represents a plurality of fully-connected
perceptual layers. The vertex data (vertices of the three-dimensional mesh) first
passes through an MLP module with a hidden layer dimension of [128, 64], to obtain
a preliminary hidden layer feature X
1 (vertex feature). Then the vertex data is divided into two paths (global feature
extraction and local feature extraction). One path continues to pass through an MLP
module ([512, 1024]), max pooling is performed on an output feature X
2, to acquire global feature information X
3, and then the global feature information is shared by all N vertices, to determine
a global feature N×X
3. The other path passes through three groups of GAT modules. Each GAT module includes
eight layers of attention base networks (heads). Herein, output layers of the three
groups of GAT modules are spliced together to determine a local feature. Finally,
the two paths of features are spliced and inputted into a final MLP module ([1024,
512, K]), to obtain final three-dimensional heatmap data of N*K (K is a quantity of
keypoints), and the data is visualized on the three-dimensional head model, to obtain
N three-dimensional heatmaps.
[0135] Because characteristics of the GAT module and the MLP module, a fixed quantity N
of vertices is not needed in a same network structure. This enables three-dimensional
face models with different quantities of vertices to be used as input of the neural
network model in both a training stage and an actual use stage, thereby improving
applicability of the present disclosure.
[0136] Second, the three-dimensional heatmap in the technical solutions of the present disclosure
is described. Because a heatmap of a three-dimensional mesh no longer has a structure
with compact coordinates of a two-dimensional image, in comparison with a two-dimensional
heatmap using a Euclidean distance, the three-dimensional heatmap uses a geodesic
distance herein. In this way, on a three-dimensional mesh level, in comparison with
a Euclidean distance between two points, verifying a shortest path on a mesh graph
structure based on a geodesic distance between two points can better indicates a characteristic
of a three-dimensional surface. For example, referring to FIG. 16, FIG. 16 is a comparison
diagram of a geodesic distance and a Euclidean distance according to an embodiment
of the present disclosure. Based on FIG. 16, a straight line between two vertices
as indicated by 1602 is the Euclidean distance, and a curve as indicated by 1601 is
the corresponding geodesic distance.
[0137] When a graph neural network is trained and put into use, the three-dimensional heatmap
outputted by the neural network needs to be further transformed into final three-dimensional
keypoint coordinates. Herein, a manner of transforming the conventional two-dimensional
heatmap into two-dimensional coordinates includes the following operations: Coordinates
of a vertex at which a maximum probability is located are first acquired (referred
to as an argmax method). Then softmax probability expectations of a plurality of vertex
coordinates are weighted (that is, a soft-argmax method), to obtain final three-dimensional
keypoint coordinates. For the present disclosure, considering that a plurality of
three-dimensional coordinates are weighted according to the soft-argmax method, a
result does not necessarily fall on a three-dimensional mesh plane. Therefore, the
argmax method is directly used herein, in other words, coordinates of a vertex at
which a maximum probability is located are acquired, to determine final three-dimensional
keypoint coordinates.
[0138] Finally, the data enhancement method in the present disclosure is described.
[0139] Different from a two-dimensional face image and two-dimensional keypoint data, three-dimensional
face mesh data is excessively difficult to acquire in a large number. Lack of data
is a major problem that plagues supervised neural network learning. A graph neural
network can learn sufficient detection capabilities from the dataset only if a dataset
is large enough and can cover different face forms. However, a three-dimensional face
keypoint dataset is difficult to acquire, a cause of which is shown in the following
several aspects. Specifically, first, the three-dimensional face mesh data is produced
by the art staff, and a production process is cumbersome. However, generation of a
two-dimensional image only needs clicking a camera shutter once. Therefore, in datasets
disclosed in either the Internet or academia, two-dimensional face images are rich,
and corresponding three-dimensional face data is lacking. Second, for a keypoint detection
task, manual annotation needs to be performed on a keypoint in advance (or annotation
is implemented through both initial automatic detection by an algorithm and the little
number of subsequent manual correction). Two-dimensional keypoint annotation work
has been previously performed by the large number of people, and an annotation tool
is not complex. Essentially, only a specific pixel in an image needs to be annotated
in the work. However, keypoint annotation of a three-dimensional mesh is more difficult.
For example, it is difficult for an annotator to confirm a face contour. Therefore,
when the three-dimensional data is lacking, it is not possible to develop a corresponding
annotation tool based on an existing three-dimensional face head model, to manually
annotate a three-dimensional keypoint. Based on this, in the technical solutions of
the present disclosure, data enhancement is performed on existing three-dimensional
face model data based on patch simplification and patch densification, thereby providing
normalized and reasonable training data for the graph neural network.
[0140] Herein, the data enhancement method is divided into the patch simplification and
the patch densification. For the patch simplification, an edge optimization manner
may be used, to be specific, the smallest edge between nodes is found each time, and
two nodes corresponding to the smallest edge are merged into one vertex, as shown
in FIG. 12. For the patch densification, barycentric coordinates are preferentially
calculated for a patch with a large area, and then the original triangular patch is
divided into three parts based on the barycentric coordinates, as shown in FIG. 13.
Herein, termination of operations of both the patch densification and the patch simplification
can be controlled by using a final quantity of target vertices.
[0141] In this way, in the present disclosure, a specific keypoint of a three-dimensional
game head model is automatically detected, so that an accurate and reliable keypoint
basis can be provided for subsequent registration work of the three-dimensional head
model. In comparison with a conventional manner of manual annotation and then performing
head model registration, according to the present disclosure, excessive manual participation
can be avoided, so that keypoint dependent work such as registration of the three-dimensional
head model can be automatically completed. This reduces human resources of the art
staff, thereby speeding up an entire production process related to animation of a
model character.
[0142] Further, in the present disclosure, based on supervised deep learning of a graph
neural network, a position of the three-dimensional keypoint can be accurately predicted,
which is robust. In addition, a forward calculation speed of a deep learning model
is extremely fast. An algorithm requires only one second on the whole to complete
automatic annotation, while in contrast, a manual manner usually takes several minutes.
Therefore, the present disclosure has great practical value in terms of efficiency.
In addition, a quantity of inputted vertices of a three-dimensional face model is
not limited in the present disclosure. After supervised learning training is performed,
the generated deep learning model can be widely applied to tasks of automatic detection
of keypoints of three-dimensional head models with different vertex densification
degrees, and has strong applicability.
[0143] Through application of the foregoing embodiments of the present disclosure, a three-dimensional
mesh corresponding to a to-be-detected object is obtained, a global feature and a
local feature of the to-be-detected object are separately extracted through construction
of a dual-path feature extraction layer based on a vertex feature and a connection
relationship between vertices obtained by using the three-dimensional mesh, and then
a position of a keypoint on the to-be-detected object is obtained based on the vertex
feature obtained by using the three-dimensional mesh and the global feature and the
local feature obtained through extraction. In this way, richer feature information
of the to-be-detected object is extracted via a plurality of feature extraction layers,
and then detection is performed on the keypoint of the to-be-detected object based
on the rich feature information, so that accuracy of three-dimensional keypoint detection
is significantly improved.
[0144] The following continues to describe an exemplary structure in which an implementation
of the keypoint detection apparatus 455 provided in the embodiments of the present
disclosure is a software module. In some embodiments, as shown in FIG. 2, the software
module in the keypoint detection apparatus 455 stored in the memory 450 may include:
an obtaining module 4551, configured to obtain a three-dimensional mesh configured
for representing a to-be-detected object, and determine vertices of the three-dimensional
mesh and a connection relationship between the vertices;
a first feature extraction module 4552, configured to perform feature extraction on
the vertices of the three-dimensional mesh, to obtain a vertex feature of the three-dimensional
mesh; and
a second feature extraction module 4553, configured to perform global feature extraction
on the to-be-detected object based on the vertex feature, to obtain a global feature
of the to-be-detected object, and perform local feature extraction on the to-be-detected
object based on the vertex feature and the connection relationship between the vertices,
to obtain a local feature of the to-be-detected object; and
an output module 4554, configured to perform detection on a keypoint of the to-be-detected
object based on the vertex feature, the global feature, and the local feature, to
obtain a position of the keypoint of the to-be-detected object on the to-be-detected
object.
[0145] In some embodiments, the obtaining module 4551 is further configured to scan the
to-be-detected object by using a three-dimensional scanning apparatus, to obtain point
cloud data of a geometric surface of the to-be-detected object; and construct the
three-dimensional mesh corresponding to the to-be-detected object based on the point
cloud data.
[0146] In some embodiments, the second feature extraction module 4553 is further configured
to determine a local feature of each of the vertices based on the vertex feature and
the connection relationship between the vertices; and determine the local feature
of the to-be-detected object based on the local feature of each of the vertices.
[0147] In some embodiments, the second feature extraction module 4553 is further configured
to perform the following processing for each of the vertices: determining the vertex
as a reference vertex, and determining a vertex feature of the reference vertex and
a vertex feature of another vertex based on a vertex feature of each vertex in the
three-dimensional mesh; the another vertex being any vertex other than the reference
vertex; determining a correlation value between the reference vertex and the another
vertex based on the vertex feature of the reference vertex, the vertex feature of
the another vertex, and the connection relationship between the vertices, the correlation
value indicates a magnitude of a correlation degree between the reference vertex and
the another vertex; and determining a local feature of the reference vertex based
on the correlation value and the vertex feature of the another vertex.
[0148] In some embodiments, the second feature extraction module 4553 is further configured
to determine the correlation degree between the reference vertex and the another vertex
by using an attention mechanism based on the vertex feature of the reference vertex,
the vertex feature of the another vertex, and the connection relationship between
the vertices; and perform normalization processing on the correlation degree, to obtain
the correlation value between the reference vertex and the another vertex.
[0149] In some embodiments, when a quantity of other vertices is one, the second feature extraction
module 4553 is further configured to perform multiplication on the correlation value
and the vertex feature of the another vertex, to obtain a multiplication result; and
determine the local feature of the reference vertex based on the multiplication result.
[0150] In some embodiments, when a quantity of other vertices is more than one, the second
feature extraction module 4553 is further configured to perform, for each of the other
vertices, multiplication on the correlation value and a vertex feature of the another
corresponding vertex, to obtain a multiplication result of the another vertex; perform
cumulative summation on multiplication results of the other vertices, to obtain a
summation result; and determine the local feature of the reference vertex based on
the summation result.
[0151] In some embodiments, the second feature extraction module 4553 is further configured
to perform feature fusion on the local feature of each of the vertices based on the
local feature of each of the vertices, to obtain a fused feature; and use the fused
feature as the local feature of the to-be-detected object.
[0152] In some embodiments, the output module 4554 is further configured to perform feature
splicing on the vertex feature, the global feature, and the local feature, to obtain
a spliced feature of the to-be-detected object; and perform detection on the keypoint
of the to-be-detected object based on the spliced feature, to obtain the position
of the keypoint of the to-be-detected object on the to-be-detected object.
[0153] In some embodiments, the output module 4554 is further configured to perform detection
on the keypoint of the to-be-detected object based on the vertex feature, the global
feature, and the local feature, to obtain a probability of the keypoint being at each
of the vertices in the three-dimensional mesh; generate a three-dimensional heatmap
corresponding to the three-dimensional mesh based on the probability; and determine
the position of the keypoint of the to-be-detected object on the to-be-detected object
based on the three-dimensional heatmap.
[0154] In some embodiments, the apparatus is used in a three-dimensional network model.
The three-dimensional network model includes at least a first feature extraction layer,
a second feature extraction layer, a third feature extraction layer, and an output
layer. The first feature extraction module 4552 is further configured to perform feature
extraction on the vertices of the three-dimensional mesh via the first feature extraction
layer, to obtain the vertex feature of the three-dimensional mesh. The second feature
extraction module 4553 is further configured to perform global feature extraction
on the to-be-detected object based on the vertex feature via the second feature extraction
layer, to obtain the global feature of the to-be-detected object, and perform local
feature extraction on the to-be-detected object based on the vertex feature and the
connection relationship between the vertices via the third feature extraction layer,
to obtain the local feature of the to-be-detected object. The output module 4554 is
further configured to perform detection on the keypoint of the to-be-detected object
via the output layer based on the vertex feature, the global feature, and the local
feature, to obtain the position of the keypoint of the to-be-detected object on the
to-be-detected object.
[0155] In some embodiments, the three-dimensional network model further includes a first
feature splicing layer, a second feature splicing layer, and a fourth feature extraction
layer. The output module 4554 is further configured to perform feature splicing on
the vertex feature, the global feature, and the local feature via the first feature
splicing layer, to obtain the spliced feature of the to-be-detected object; perform
local feature extraction on the to-be-detected object based on the spliced feature
via the fourth feature extraction layer, to obtain a target local feature of the to-be-detected
object; perform feature splicing on the spliced feature, the global feature, and the
target local feature via the second feature splicing layer, to obtain a target spliced
feature of the to-be-detected object; and perform detection on the keypoint of the
to-be-detected object based on the target spliced feature via the output layer, to
obtain the position of the keypoint of the to-be-detected object on the to-be-detected
object.
[0156] The following continues to describe an exemplary structure in which an implementation
of an apparatus for training a three-dimensional network model provided in the embodiments
of the present disclosure is a software module. The three-dimensional network model
includes at least a first feature extraction layer, a second feature extraction layer,
a third feature extraction layer, and an output layer. The training apparatus includes
an acquiring module, an obtaining module, a first feature extraction module, a second
feature extraction module, an output module, and an update module.
[0157] The acquiring module is configured to acquire an object training sample carrying
a label, where the label indicates a real position of a keypoint of the object training
sample.
[0158] The obtaining module is configured to obtain a training three-dimensional mesh representing
the object training sample, and determine vertices of the training three-dimensional
mesh and a connection relationship between the vertices.
[0159] The first feature extraction module is configured to perform feature extraction on
the vertices of the object training sample via the first feature extraction layer,
to obtain a vertex feature of the training three-dimensional mesh.
[0160] The second feature extraction module is configured to perform global feature extraction
on the object training sample based on the vertex feature of the training three-dimensional
mesh via the second feature extraction layer, to obtain a global feature of the object
training sample, and perform local feature extraction on the object training sample
based on the vertices of the training three-dimensional mesh and the connection relationship
between the vertices via the third feature extraction layer, to obtain a local feature
of the object training sample.
[0161] The output module is configured to perform detection on the keypoint of the object
training sample via the output layer based on the vertex feature of the training three-dimensional
mesh, the global feature of the object training sample, and the local feature of the
object training sample, to obtain a position of the keypoint of the object training
sample on the object training sample.
[0162] The update module is configured to acquire a difference between the position of the
keypoint of the object training sample and the label, and train the three-dimensional
network model based on the difference, to obtain a target three-dimensional network
model, the target three-dimensional network model being configured for performing
keypoint detection on a to-be-detected object, to obtain a position of a keypoint
of the to-be-detected object on the to-be-detected object.
[0163] An embodiment of the present disclosure further provides an electronic device, including:
a memory, configured to store computer-executable instructions; and
a processor, configured to implement, when executing the computer-executable instructions
stored in the memory, the keypoint detection method or the method for training a three-dimensional
network model in the embodiments of the present disclosure, for example, the keypoint
detection method shown in FIG. 3, or the method for training a three-dimensional network
model shown in FIG. 11.
[0164] An embodiment of the present disclosure provides a computer program product or a
computer program. The computer program product or the computer program includes computer-executable
instructions. The computer-executable instructions are stored in a computer-readable
storage medium. A processor of an electronic device reads the computer-executable
instructions from the computer-readable storage medium, and the processor executes
the computer-executable instructions, to cause the electronic device to perform the
keypoint detection method or the method for training a three-dimensional network model
in the embodiments of the present disclosure, for example, the keypoint detection
method shown in FIG. 3 or the method for training a three-dimensional network model
shown in FIG. 11.
[0165] An embodiment of the present disclosure provides a computer-readable storage medium,
having computer-executable instructions stored therein. When the computer-executable
instructions are executed by a processor, the processor performs the keypoint detection
method or the method for training a three-dimensional network model provided in the
embodiments of the present disclosure, for example, the keypoint detection method
shown in FIG. 3 or the method for training a three-dimensional network model shown
in FIG. 11.
[0166] In some embodiments, the computer-readable storage medium may be a memory such as
an FRAM, a ROM, a PROM, an EPROM, an EEPROM, a flash memory, a magnetic surface memory,
a compact disc, or a CD-ROM, or may be various devices including one or any combination
of the foregoing memories.
[0167] In some embodiments, the computer-executable instruction may be written in any form
of programming language (including a compiled or interpreted language, or a declarative
or procedural language) in the form of a program, software, a software module, a script,
or code, and may be deployed in any form, including being deployed as an independent
program or being deployed as a module, a component, a subroutine, or another unit
suitable for use in a computing environment.
[0168] In an example, the computer-executable instruction may but do not necessarily correspond
to a file in a file system, may be stored in a part of a file for storing another
program or other data, for example, stored in one or more scripts in a hypertext markup
language (HTML) document, in a single file specifically configured for a discussed
program, or in a plurality of collaborative files (for example, files storing one
or more modules, a subprogram, or a code part).
[0169] In an example, the executable instruction may be deployed to be executed on one electronic
device, executed on a plurality of electronic devices located at one position, or
executed on a plurality of electronic devices distributed at a plurality of positions
and interconnected through a communication network.
[0170] In conclusion, the embodiments of the present disclosure have the following beneficial
effects:
- (1) Richer feature information of a to-be-detected object is extracted via a plurality
of feature extraction layers, and then detection is performed on a keypoint of the
to-be-detected object based on the rich feature information, so that accuracy of three-dimensional
keypoint detection is significantly improved.
- (2) A GAT feature of relying on only an edge rather than a complete graph structure
is used, so that flexibility of a keypoint detection process is improved. In addition,
an attention mechanism is used, so that different weights can be assigned to different
neighbor nodes, thereby improving accuracy of the keypoint detection process.
- (3) A specific keypoint of a three-dimensional game head model is automatically detected,
so that an accurate and reliable keypoint basis can be provided for subsequent registration
work of the three-dimensional head model. In comparison with a conventional manner
of manual annotation and then performing head model registration, according to the
present disclosure, excessive manual participation can be avoided, so that keypoint
dependent work such as registration of the three-dimensional head model can be automatically
completed. This reduces human resources of the art staff, thereby speeding up an entire
production process related to animation of a model character.
- (4) Based on supervised deep learning of a graph neural network, a position of the
three-dimensional keypoint can be accurately predicted, which is robust. In addition,
a forward calculation speed of a deep learning model is extremely fast. An algorithm
requires only one second on the whole to complete automatic annotation, while in contrast,
a manual manner usually takes several minutes. Therefore, the present disclosure has
great practical value in terms of efficiency. In addition, a quantity of inputted
vertices of a three-dimensional face model is not limited in the present disclosure.
After supervised learning training is performed, the generated deep learning model
can be widely applied to tasks of automatic detection of keypoints of three-dimensional
head models with different vertex densification degrees, and has strong applicability.
[0171] The foregoing descriptions are merely the embodiments of the present disclosure,
and are not intended to limit the protection scope of the present disclosure. Any
modification, equivalent replacement, or improvement made without departing from the
spirit and principle of the present disclosure shall fall within the protection scope
of the present disclosure.
1. A method for detecting a keypoint of a to-be-detected object, performed by an electronic
device, the method comprising:
obtaining a three-dimensional mesh representing a to-be-detected object, and determining
vertices of the three-dimensional mesh and a connection relationship between the vertices;
performing feature extraction on the vertices of the three-dimensional mesh, to obtain
a vertex feature of the three-dimensional mesh;
performing global feature extraction on the to-be-detected object using the vertex
feature, to obtain a global feature of the to-be-detected object;
performing local feature extraction on the to-be-detected object using the vertex
feature and the connection relationship between the vertices, to obtain a local feature
of the to-be-detected object; and
performing detection on a keypoint of the to-be-detected object using the vertex feature,
the global feature, and the local feature, to obtain a position of the keypoint of
the to-be-detected object on the to-be-detected object.
2. The method of claim 1, wherein obtaining the three-dimensional mesh configured for
representing the to-be-detected object comprises:
scanning the to-be-detected object by using a three-dimensional scanning apparatus,
to obtain point cloud data of a geometric surface of the to-be-detected object; and
constructing the three-dimensional mesh corresponding to the to-be-detected object
using the point cloud data.
3. The method of claim 1 or 2, wherein performing the local feature extraction on the
to-be-detected object using the vertex feature and the connection relationship between
the vertices, to obtain the local feature of the to-be-detected object comprises:
determining a respective local feature of each of the vertices using the vertex feature
and the connection relationship between the vertices; and
determining the local feature of the to-be-detected object using the respective local
feature of each of the vertices.
4. The method of claim 3, wherein determining the respective local feature of each of
the vertices using the vertex feature and the connection relationship between the
vertices comprises:
performing the following processing for each of the vertices:
determining the vertex as a reference vertex, and determining a vertex feature of
the reference vertex and a vertex feature of another vertex using a respective vertex
feature of each vertex in the three-dimensional mesh, wherein the another vertex is
any vertex other than the reference vertex;
determining a correlation value between the reference vertex and the another vertex
using the vertex feature of the reference vertex, the vertex feature of the another
vertex, and the connection relationship between the vertices, wherein the correlation
value indicates a magnitude of a correlation degree between the reference vertex and
the another vertex; and
determining a local feature of the reference vertex using the correlation value and
the vertex feature of the another vertex.
5. The method of claim 4, wherein determining the correlation value between the reference
vertex and the another vertex using the vertex feature of the reference vertex, the
vertex feature of the another vertex, and the connection relationship between the
vertices comprises:
determining the correlation degree between the reference vertex and the another vertex
by using an attention mechanism based on the vertex feature of the reference vertex,
the vertex feature of the another vertex, and the connection relationship between
the vertices; and
performing normalization processing on the correlation degree, to obtain the correlation
value between the reference vertex and the another vertex.
6. The method of claim 4, wherein when a quantity of other vertices is one, determining
the local feature of the reference vertex using the correlation value and the vertex
feature of the another vertex comprises:
performing multiplication on the correlation value and the vertex feature of the another
vertex, to obtain a multiplication result; and
determining the local feature of the reference vertex using the multiplication result.
7. The method of claim 4, wherein when a quantity of other vertices is more than one,
determining the local feature of the reference vertex using the correlation value
and the vertex feature of the another vertex comprises:
for each of the other vertices, performing multiplication on the correlation value
and a vertex feature of the vertex, to obtain a multiplication result of the vertex;
performing cumulative summation on multiplication results of the other vertices, to
obtain a summation result; and
determining the local feature of the reference vertex using the summation result.
8. The method of claim 3, wherein determining the local feature of the to-be-detected
object using the respective local feature of each of the vertices comprises:
performing feature fusion on the local features of the vertices using the respective
local feature of each of the vertices, to obtain a fused feature; and
determining the fused feature as the local feature of the to-be-detected object.
9. The method of any one of claims 1 to 8, wherein performing detection on the keypoint
of the to-be-detected object using the vertex feature, the global feature, and the
local feature, to obtain the position of the keypoint of the to-be-detected object
on the to-be-detected object comprises:
performing feature splicing on the vertex feature, the global feature, and the local
feature, to obtain a spliced feature of the to-be-detected object; and
performing detection on the keypoint of the to-be-detected object using the spliced
feature, to obtain the position of the keypoint of the to-be-detected object on the
to-be-detected object.
10. The method of any one of claims 1 to 9, wherein performing detection on the keypoint
of the to-be-detected object using the vertex feature, the global feature, and the
local feature, to obtain the position of the keypoint of the to-be-detected object
on the to-be-detected object comprises:
performing detection on the keypoint of the to-be-detected object using the vertex
feature, the global feature, and the local feature, to obtain a probability of the
keypoint being at each of the vertices in the three-dimensional mesh;
generating a three-dimensional heatmap corresponding to the three-dimensional mesh
using the probability; and
determining the position of the keypoint of the to-be-detected object on the to-be-detected
object using the three-dimensional heatmap.
11. The method of any one of claims 1 to 10, wherein the method is applied to a three-dimensional
network model, the three-dimensional network model comprises at least a first feature
extraction layer, a second feature extraction layer, a third feature extraction layer,
and an output layer, and
wherein performing feature extraction on the vertices of the three-dimensional mesh,
to obtain the vertex feature of the three-dimensional mesh comprises:
performing feature extraction on the vertices of the three-dimensional mesh via the
first feature extraction layer, to obtain the vertex feature of the three-dimensional
mesh;
wherein performing global feature extraction on the to-be-detected object using the
vertex feature, to obtain the global feature of the to-be-detected object, and performing
local feature extraction on the to-be-detected object using the vertex feature and
the connection relationship between the vertices, to obtain the local feature of the
to-be-detected object comprises:
performing global feature extraction on the to-be-detected object using the vertex
feature via the second feature extraction layer, to obtain the global feature of the
to-be-detected object, and performing local feature extraction on the to-be-detected
object using the vertex feature and the connection relationship between the vertices
via the third feature extraction layer, to obtain the local feature of the to-be-detected
object; and
wherein performing detection on the keypoint of the to-be-detected object using the
vertex feature, the global feature, and the local feature, to obtain the position
of the keypoint of the to-be-detected object on the to-be-detected object comprises:
performing detection on the keypoint of the to-be-detected object via the output layer
using the vertex feature, the global feature, and the local feature, to obtain the
position of the keypoint of the to-be-detected object on the to-be-detected object.
12. The method of claim
11, wherein the three-dimensional network model further comprises a first feature splicing
layer, a second feature splicing layer, and a fourth feature extraction layer, and
wherein performing detection on the keypoint of the to-be-detected object via the
output layer using the vertex feature, the global feature, and the local feature,
to obtain the position of the keypoint of the to-be-detected object on the to-be-detected
object comprises:
performing feature splicing on the vertex feature, the global feature, and the local
feature via the first feature splicing layer, to obtain the spliced feature of the
to-be-detected object;
performing local feature extraction on the to-be-detected object using the spliced
feature via the fourth feature extraction layer, to obtain a target local feature
of the to-be-detected object;
performing feature splicing on the spliced feature, the global feature, and the target
local feature via the second feature splicing layer, to obtain a target spliced feature
of the to-be-detected object; and
performing detection on the keypoint of the to-be-detected object using the target
spliced feature via the output layer, to obtain the position of the keypoint of the
to-be-detected object on the to-be-detected object.
13. A method for training a three-dimensional network model, performed by an electronic
device, the three-dimensional network model comprising at least a first feature extraction
layer, a second feature extraction layer, a third feature extraction layer, and an
output layer, and the method comprising:
acquiring an object training sample carrying a label, wherein the label indicates
a real position of a keypoint of the object training sample;
obtaining a training three-dimensional mesh representing the object training sample,
and determining vertices of the training three-dimensional mesh and a connection relationship
between the vertices;
performing feature extraction on the vertices of the object training sample via the
first feature extraction layer, to obtain a vertex feature of the training three-dimensional
mesh;
performing global feature extraction on the object training sample using the vertex
feature of the training three-dimensional mesh via the second feature extraction layer,
to obtain a global feature of the object training sample, and performing local feature
extraction on the object training sample using the vertices of the training three-dimensional
mesh and the connection relationship between the vertices via the third feature extraction
layer, to obtain a local feature of the object training sample;
performing detection on the keypoint of the object training sample via the output
layer using the vertex feature of the training three-dimensional mesh, the global
feature of the object training sample, and the local feature of the object training
sample, to obtain a position of the keypoint of the object training sample on the
object training sample; and
acquiring a difference between the position of the keypoint of the object training
sample and the label, and training the three-dimensional network model using the difference,
to obtain a target three-dimensional network model, wherein the target three-dimensional
network model is configured for detecting a keypoint of a to-be-detected object, to
obtain a position of the keypoint of the to-be-detected object on the to-be-detected
object.
14. An apparatus for detecting a keypoint of a to-be-detected object, comprising:
an obtaining module, configured to: obtain a three-dimensional mesh representing a
to-be-detected object, and determine vertices of the three-dimensional mesh and a
connection relationship between the vertices;
a first feature extraction module, configured to perform feature extraction on the
vertices of the three-dimensional mesh, to obtain a vertex feature of the three-dimensional
mesh;
a second feature extraction module, configured to: perform global feature extraction
on the to-be-detected object using the vertex feature, to obtain a global feature
of the to-be-detected object, and perform local feature extraction on the to-be-detected
object using the vertex feature and the connection relationship between the vertices,
to obtain a local feature of the to-be-detected object; and
an output module, configured to perform detection on a keypoint of the to-be-detected
object using the vertex feature, the global feature, and the local feature, to obtain
a position of the keypoint of the to-be-detected object on the to-be-detected object.
15. An apparatus for training a three-dimensional network model, the three-dimensional
network model comprising at least a first feature extraction layer, a second feature
extraction layer, a third feature extraction layer, and an output layer, and the apparatus
comprising:
an acquiring module, configured to acquire an object training sample carrying a label,
wherein the label indicates a real position of a keypoint of the object training sample;
an obtaining module, configured to: obtain a training three-dimensional mesh representing
the object training sample, and determine vertices of the training three-dimensional
mesh and a connection relationship between the vertices;
a first feature extraction module, configured to perform feature extraction on the
vertices of the object training sample via the first feature extraction layer, to
obtain a vertex feature of the training three-dimensional mesh;
a second feature extraction module, configured to: perform global feature extraction
on the object training sample using the vertex feature of the training three-dimensional
mesh via the second feature extraction layer, to obtain a global feature of the object
training sample, and perform local feature extraction on the object training sample
using the vertices of the training three-dimensional mesh and the connection relationship
between the vertices via the third feature extraction layer, to obtain a local feature
of the object training sample;
an output module, configured to perform detection on the keypoint of the object training
sample via the output layer using the vertex feature of the training three-dimensional
mesh, the global feature of the object training sample, and the local feature of the
object training sample, to obtain a position of the keypoint of the object training
sample on the object training sample; and
an update module, configured to: acquire a difference between the position of the
keypoint of the object training sample and the label, and train the three-dimensional
network model using the difference, to obtain a target three-dimensional network model,
wherein the target three-dimensional network model is configured for detecting a keypoint
of a to-be-detected object, to obtain a position of the keypoint of the to-be-detected
object on the to-be-detected object.
16. An electronic device, comprising:
a memory, configured to store computer-executable instructions; and
a processor, configured to perform, when executing the computer-executable instructions
stored in the memory, the method for detecting a keypoint of a to-be-detected object
of any one of claims 1 to 12 or the method for training a three-dimensional network
model of claim 13.
17. A computer-readable storage medium having stored therein computer-executable instructions
that, when executed by a processor, cause the processor to perform the method for
detecting a keypoint of a to-be-detected object of any one of claims 1 to 12 or the
method for training a three-dimensional network model of claim 13.
18. A computer program product, comprising a computer program or computer-executable instructions,
the computer program or the computer-executable instructions implementing, when executed
by a processor, the method for detecting a keypoint of a to-be-detected object of
any one of claims 1 to 12 or the method for training a three-dimensional network model
of claim 13.