RELATED APPLICATION
FIELD OF THE TECHNOLOGY
[0002] This application relates to the field of artificial intelligence, and in particular,
to a video super-resolution technology.
BACKGROUND OF THE DISCLOSURE
[0003] With the development of image processing technologies, an image super-resolution
processing technology for reconstructing an observed low-resolution image into a corresponding
high-resolution image to improve the resolution of the original image has emerged.
[0004] In the related art, a generative network in a generative adversarial network is used
to generate high-resolution images with richer details. However, because the generative
adversarial network generates details randomly, same objects in adjacent frames are
not completely aligned after super-resolution, or details added to an (i+1)
th image frame cannot be aligned with details added to an i
th image frame. This results in visual perception of video discontinuity. In other words,
an issue related to stability of time series continuity is prone to occur, for example,
an inter-frame jump.
SUMMARY
[0005] This application provides a video super-resolution processing method and apparatus,
a device, and a medium, so as to resolve an inter-frame jump problem during video
super-resolution by introducing a time series feature of different image frames. The
technical solutions are as follows:
[0006] According to one aspect of this application, a video super-resolution processing
method is provided, the method being performed by a computer device, and the method
including:
obtaining an (i+1)th image frame from a video, and obtaining an image feature of an ith image frame in the video and a long time series feature before the ith image frame, the image feature of the ith image frame and the long time series feature before the ith image frame being cached during super-resolution processing of the ith image frame and an (i-1)th image frame, respectively;
performing super-resolution prediction on the image feature of the ith image frame, the long time series feature before the ith image frame, and the (i+1)th image frame by using a generative network, to obtain a super-resolution image of
the (i+1)th image frame, an image feature of the (i+1)th image frame, and a long time series feature before the (i+2)th image frame; and
caching the image feature of the (i+1)th image frame and the long time series feature before the (i+2)th image frame;
i being a positive integer greater than 2.
[0007] According to another aspect of this application, a video super-resolution processing
apparatus is provided, the apparatus including:
an obtaining module, configured to: obtain an (i+1)th image frame from a video, and obtain an image feature of an ith image frame in the video and a long time series feature before the ith image frame, the image feature of the ith image frame and the long time series feature before the ith image frame being cached by a cache module during super-resolution processing of
the ith image frame and an (i-1)th image frame, respectively;
a generative network module, configured to perform super-resolution prediction on
the image feature of the ith image frame, the long time series feature before the ith image frame, and the (i+1)th image frame, to obtain a super-resolution image of the (i+1)th image frame, an image feature of the (i+1)th image frame, and a long time series feature before the (i+2)th image frame; and
the cache module, configured to cache the image feature of the (i+1)th image frame and the long time series feature before the (i+2)th image frame;
i being a positive integer greater than 2.
[0008] According to another aspect of this application, a computer device is provided, the
computer device including a processor and a memory, the memory storing at least one
instruction, at least one program, a code set, or an instruction set, and the at least
one instruction, the at least one program, the code set, or the instruction set being
loaded and executed by the processor to implement the video super-resolution processing
method according to the foregoing aspect.
[0009] According to another aspect of this application, a computer-readable storage medium
is provided, the computer-readable storage medium storing at least one instruction,
at least one program, a code set, or an instruction set, and the at least one instruction,
the at least one program, the code set, or the instruction set being loaded and executed
by a processor to implement the video super-resolution processing method according
to the foregoing aspect.
[0010] According to another aspect of this application, a computer program product is provided,
the computer program product including at least one program, the at least one program
being stored in a computer-readable storage medium, a processor of a computer device
reading the at least one program from the computer-readable storage medium, and the
processor executing the at least one program to cause the computer device to perform
the video super-resolution processing method according to the foregoing aspect.
[0011] The technical solutions provided in this application have at least the following
beneficial effects:
After super-resolution processing is performed on the i
th image frame, if i is a positive integer greater than 2, the image feature of the
i
th image frame and the long time series feature before the i
th image frame can be cached. The image feature of the i
th image frame can indicate information in a previous image frame, and the long time
series feature before the i
th image frame can indicate time series information between several previous image frames.
Therefore, when super-resolution processing is performed on the (i+1)
th image frame in the video, to ensure time series stability between adjacent frames,
the image feature of the i
th image frame and the long time series feature before the i
th image frame can be obtained. Thus, super-resolution prediction is performed on the
image feature of the i
th image frame, the long time series feature before the i
th image frame, and the (i+1)
th image frame by using the generative network. With reference to the image feature
of the current image frame, the image feature of the previous image frame, and the
long time series feature before the previous image frame, the super-resolution image
of the (i+1)
th image frame, the image feature of the (i+1)
th image frame, and the long time series feature before the (i+2)
th image frame are obtained. After that, the image feature of the (i+1)
th image frame and the long time series feature before the (i+2)
th image frame continue to be cached, so as to be used for super-resolution processing
of a subsequent frame. Through caching of the long time series feature of a previous
frame, reference is made to the time series feature of the previous image frame when
super-resolution is performed on the (i+1)
th image frame. In other words, when super-resolution processing is performed to obtain
details of the (i+1)
th image frame, reference is made to details of the previous image frames, so as to
align details added to the (i+1)
th image frame with details added to the i
th image frame. In this way, time series stability between the adjacent frames can be
ensured, so that no inter-frame jump occurs.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] To describe technical solutions of embodiments of this application more clearly,
the following briefly describes accompanying drawings required for describing the
embodiments. Apparently, the accompanying drawings in the following description show
merely some embodiments of this application, and a person of ordinary skill in the
art may still derive other drawings from these accompanying drawings without creative
efforts.
FIG. 1 is a diagram of a structure of a video super-resolution processing method according
to an exemplary embodiment of this application.
FIG. 2 is a flowchart of a video super-resolution processing method according to an
exemplary embodiment of this application.
FIG. 3 is a flowchart of a video super-resolution processing method according to an
exemplary embodiment of this application.
FIG. 4 is a flowchart of a feature extraction network according to an exemplary embodiment
of this application.
FIG. 5 is a flowchart of a feature fusion network according to an exemplary embodiment
of this application.
FIG. 6 is a flowchart of a feature fusion network according to an exemplary embodiment
of this application.
FIG. 7 is a flowchart of a video super-resolution processing method according to an
exemplary embodiment of this application.
FIG. 8 is a diagram of a structure of training in a video super-resolution processing
method according to an exemplary embodiment of this application.
FIG. 9 is a flowchart of training in a video super-resolution processing method according
to an exemplary embodiment of this application.
FIG. 10 is a flowchart of training in a video super-resolution processing method according
to an exemplary embodiment of this application.
FIG. 11 is a diagram of a structure of training in a video super-resolution processing
method according to an exemplary embodiment of this application.
FIG. 12 is a block diagram of a video super-resolution processing apparatus according
to an exemplary embodiment of this application.
FIG. 13 is a block diagram of a video super-resolution processing apparatus according
to an exemplary embodiment of this application.
FIG. 14 is a block diagram of a computer device according to an exemplary embodiment
of this application.
DESCRIPTION OF EMBODIMENTS
[0013] To make objectives, technical solutions, and advantages of this application clearer,
the following further describes implementations of this application in detail with
reference to accompanying drawings.
[0014] First, terms used in embodiments of this application are briefly described.
[0015] Neural network: an algorithmic mathematical model that imitates behavioral characteristics
of animal neural networks and performs distributed parallel information processing.
This type of network depends on complexity of a system, and adjusts connection relationships
between a lot of internal nodes, so as to achieve an objective of processing information.
[0016] Generative adversarial network (GAN): includes a generative network and a discriminative
network. The generative network is configured to generate a super-resolution image,
and the discriminative network is configured to determine whether an image conforms
to distribution of a real image.
[0017] Image features: features for describing corresponding properties of an image or an
image area, including a color feature, a texture feature, a shape feature, and a spatial
relationship feature. Image features are usually extracted by a feature extraction
layer in a neural network. Image features may be represented by using vectors.
[0018] Time series: alternatively termed as temporal sequence, which is a sequence of data
points arranged in chronological order. Usually, a time interval of a time series
is a constant value, for example, 1 second or 1 minute. The data point in an image
time series may be understood as 1 frame.
[0019] Resolution: may be subdivided into a display resolution, an image resolution, a print
resolution, a scan resolution, and the like. A resolution determines fineness of image
details. Usually, a higher resolution of an image indicates a larger quantity of pixels
included and a clearer image.
[0020] In the related art, resolutions of displays gradually increase, but a 4K or 8K resolution
cannot be reached when many videos are shot. A super-resolution technology can resolve
the problem by increasing video resolutions to adapt to corresponding displays. At
present, a super-resolution technology based on a deep neural network has a good effect
in image processing. However, details and textures in a result of a neural network
based on a pixel loss function are smooth, and a visual effect is poor. Compared with
the neural network based on a pixel loss function, a generative adversarial network
can generate a high-resolution image with more details and richer textures from each
image frame. However, because a generative adversarial network generates details randomly,
applying a super-resolution technology based on a generative adversarial network to
videos may result in a severe issue related to stability of time series continuity,
for example, an inter-frame jump. An inter-frame jump means that the same object in
adjacent frames is not completely aligned after super-resolution, or added details
and textures are not aligned, appear in spurts, and drift leftward or rightward, leading
to visual discontinuity and dissonance.
[0021] Aiming at the foregoing problem, this application provides a video super-resolution
processing method. The following describes the method.
[0022] FIG. 1 is a diagram of a structure of a video super-resolution processing method
according to an exemplary embodiment of this application.
[0023] An original video 110 is inputted into a generative adversarial network 100, and
a super-resolution video 120 corresponding to the original video 110 is outputted.
The original video 110 has a first resolution, and the super-resolution video 120
has a second resolution. The first resolution is lower than the second resolution.
[0024] The original video 110 includes several image frames. To perform super-resolution
processing on the original video 110 is to perform super-resolution processing on
the several image frames in the original video 110, to obtain super-resolution images
with a super resolution and then obtain the super-resolution video 120 that includes
the super-resolution images. As shown in FIG. 1, the image frames in the original
video 110 are inputted into the generative adversarial network 100 according to a
time series of the video, to obtain a super-resolution image of the first image frame,
a super-resolution image of the second image frame, ..., and a super-resolution image
of an i
th image frame. A next to-be-inputted image frame in the original video 110 is an (i+1)
th image frame.
[0025] In some embodiments, the generative adversarial network 100 includes a generative
network 101. The generative network 101 further includes a feature extraction network
1011, a feature fusion network 1012, and an upsampling network 1013. A low-resolution
image (the (i+1)
th image frame) in the original video 110 is inputted into the generative network 101,
and an image features of the (i+1)
th image frame is obtained by using the feature extraction network 1011. The image feature
of the (i+1)
th image frame, and image feature of the i
th image frame and a long time series feature before the i
th image frame that are cached in a cache 102 are inputted into the feature fusion network
1012, to obtain a long time series feature before the (i+2)
th image frame. The image feature of the (i+1)
th image frame and the long time series feature before the (i+2)
th image frame are inputted into the upsampling network 1013 for prediction, and a super-resolution
image of the (i+1)
th image frame is obtained.
[0026] The image feature of the i
th image frame and the long time series feature before the i
th image frame are cached in the cache 102 when the generative network 101 processes
the i
th image frame in the original video 110. When the generative network 101 processes
the (i+1)
th image frame in the original video 110, the cache 102 provides the image feature of
the i
th image frame and the long time series feature before the i
th image frame to the feature fusion network 1012, and in addition, continues to cache
the image feature of the (i+1)
th image frame and the long time series feature before the (i+2)
th image frame. This implements cyclic use of image features and long time series features.
[0027] FIG. 2 is a flowchart of a video super-resolution processing method according to
an exemplary embodiment of this application. The method is performed by a computer
device, and includes the following operations:
Operation 220: Obtain an (i+1)
th image frame from a video, and obtain an image feature of an i
th image frame in the video and a long time series feature before the i
th image frame, the image feature of the i
th image frame and the long time series feature before the i
th image frame being cached during super-resolution processing of the i
th image frame.
[0028] The (i+1)
th image frame is obtained from the original video that needs to be processed. The (i+1)
th image frame is an image frame that is in the original video and on which super-resolution
processing needs to be performed currently, and is a low-resolution image frame. i
is a positive integer greater than 2. In other words, a long time series feature may
be generated, starting from the third image frame. Thus, corresponding image features
and long time series features are cached. Starting from the fourth image frame, super-resolution
processing may be performed by using the method provided in this embodiment.
[0029] When super-resolution operations are performed on the video, image frames in the
video are sequentially processed according to a time series (or a temporal sequence)
of the video. For example, the first image frame is processed first, then the second
image frame is processed, ..., and the i
th image frame is processed. Usually, in a video super-resolution operation method,
an i
th image frame is directly inputted into a generative network, to obtain a super-resolution
image of the i
th image frame. With this method, when super-resolution prediction is performed on an
image frame, information in the image frame is prone to loss. Therefore, image features
and long time series features are introduced in this embodiment of this application.
[0030] The long time series feature includes image features of a plurality of image frames
that are accumulated in a long time series according to an order in which super-resolution
operations are performed on image frames in a video. The long time series feature
include information in several previous image frames in a super-resolution operation
process. The long time series is a time series with a time length greater than a threshold.
[0031] When super-resolution operations are performed on the first image frame and the second
image frame in the original video, a long time series feature in this case may be
considered as a null value or a preset value. This is not discussed in this embodiment.
[0032] In this embodiment, starting from the third image frame in the original video, the
long time series feature before the image frame is generated when super-resolution
processing is performed. Therefore, during super-resolution processing of the i
th image frame, the image feature of the i
th image frame in the video and the long time series feature before the i
th image frame may be cached. The i
th image frame is an image frame that is in the video and on which super-resolution
processing has been performed.
[0033] In some embodiments, the long time series feature before the i
th image frame may be cumulative features of all images from the first image frame to
an (i-1)
th image frame in the original video according to a playing time series (or a playing
order) of the original video.
[0034] In some embodiments, the long time series feature before the i
th image frame may be cumulative features of last several image frames (for example,
a predetermined number of image frames) before the i
th image frame, for example, may be cumulative features of last three image frames before
the i
th image frame, including an image feature of an (i-3)
th image frame, an image feature of an (i-2)
th image frame, and an image feature of the (i-1)
th image frame; or may be cumulative features of last five image frames before the i
th image frame, including an image feature of an (i-5)
th image frame, an image feature of an (i-4)
th image frame, an image feature of an (i-3)
th image frame, an image feature of an (i-2)
th image frame, and an image feature of the (i-1)
th image frame.
[0035] In some embodiments, the long time series feature before the i
th image frame may be determined based on a caching capability of a cache. If the caching
capability of the cache is larger, more information about the long time series feature
before the i
th image frame is retained. If the caching capability of the cache is smaller, less
information about the long time series feature before the i
th image frame is retained.
[0036] Operation 240: Perform super-resolution prediction on the image feature of the i
th image frame, the long time series feature before the i
th image frame, and the (i+1)
th image frame by using a generative network, and output a super-resolution image of
the (i+1)
th image frame, an image feature of the (i+1)
th image frame, and a long time series feature before the (i+2)
th image frame.
[0037] In some embodiments, the image feature of the i
th image frame and the long time series feature before the i
th image frame that are cached, and the (i+1)
th image frame that is obtained from the video and on which super-resolution processing
needs to be performed currently are inputted into the generative network for super-resolution
prediction. The super-resolution image of the (i+1)
th image frame, the image feature of the (i+1)
th image frame, and the long time series feature before the (i+2)
th image frame are outputted by the generative network.
[0038] Operation 260: Cache the image feature of the (i+1)
th image frame and the long time series feature before the (i+2)
th image frame.
[0039] After the image feature of the (i+1)
th image frame and the long time series feature before the (i+2)
th image frame are obtained, the image feature of the (i+1)
th image frame and the long time series feature before the (i+2)
th image frame may be cached, so as to be used for super-resolution processing of a
next image frame.
[0040] In conclusion, according to the method provided in this embodiment, after super-resolution
processing is performed on the i
th image frame, if i is a positive integer greater than 2, the image feature of the
i
th image frame and the long time series feature before the i
th image frame can be cached. The image feature of the i
th image frame can indicate information in a previous image frame, and the long time
series feature before the i
th image frame can indicate time series information between several previous image frames.
Therefore, when super-resolution processing is performed on the (i+1)
th image frame in the video, to ensure time series stability between adjacent frames,
the image feature of the i
th image frame and the long time series feature before the i
th image frame can be obtained. Thus, super-resolution prediction is performed on the
image feature of the i
th image frame, the long time series feature before the i
th image frame, and the (i+1)
th image frame by using the generative network. With reference to the image feature
of the current image frame, the image feature of the previous image frame, and the
long time series feature before the previous image frame, the super-resolution image
of the (i+1)
th image frame, the image feature of the (i+1)
th image frame, and the long time series feature before the (i+2)
th image frame are obtained. After that, the image feature of the (i+1)
th image frame and the long time series feature before the (i+2)
th image frame are cached, so as to be used for super-resolution processing of a subsequent
frame. Through caching of long time series feature of previous frames, reference is
made to the time series feature of the previous image frames when super-resolution
is performed on the (i+1)
th image frame. In other words, when super-resolution processing is performed to obtain
details of the (i+1)
th image frame, reference is made to details of the previous image frames, so as to
align details added to the (i+1)
th image frame with details added to the i
th image frame. In this way, time series stability between the adjacent frames can be
ensured, so that no inter-frame jump occurs.
[0041] The generative network is configured to generate a super-resolution image of an image
frame. As shown in FIG. 1, the generative network 101 includes the feature extraction
network 1011, the feature fusion network 1012, and the upsampling network 1013. The
(i+1)
th image frame in the original video 110 is inputted into the generative network 101;
the image feature of the (i+1)
th image frame are obtained by using the feature extraction network 1011 in the generative
network 101; the image feature of the i
th image frame, the image feature of the (i+1)
th image frame, and the long time series feature before the i
th image frame are fused into the long time series feature before the (i+2)
th image frame by using the feature fusion network 1012 in the generative network; and
the super-resolution image of the (i+1)
th image frame is outputted through prediction performed by the upsampling network 1013
in the generative network 101.
[0042] In some embodiments, operation 240 further includes the following sub-operations:
Operation 241: Perform feature extraction on the (i+1)
th image frame by using the feature extraction network, to obtain the image feature
of the (i+1)
th image frame.
[0043] In some embodiments, the feature extraction network is configured to output the image
feature of the (i+1)
th image frame based on an input of the (i+1)
th image frame. By using the feature extraction network, an image corresponding to a
low-resolution image frame in the original video is mapped to an eigenspace, and an
image feature of the low-resolution image frame is extracted.
[0044] In a possible implementation, a convolutional network may be used as the feature
extraction network, and a plurality of convolution kernels of different sizes are
used to process an image.
[0045] In some embodiments, the convolutional network includes a first convolution kernel,
a second convolution kernel, a third convolution kernel, a fourth convolution kernel,
and a fifth convolution kernel. The first convolution kernel, the second convolution
kernel, and the fourth convolution kernel are 3*3 convolution kernels, and the third
convolution kernel and the fifth convolution kernel are 1*1 convolution kernels. An
original image is inputted into the first convolution kernel, the second convolution
kernel, and the third convolution kernel. An output end of the first convolution kernel
is connected to an input end of the fourth convolution kernel, and an output end of
the second convolution kernel, an output end of the third convolution kernel, and
an output end of the fourth convolution kernel are connected to an input end of the
fifth convolution kernel. The fifth convolution kernel outputs image feature of the
original image.
[0046] The (i+1)
th image frame is inputted into the convolutional network, and a first convolution result
is obtained after convolution by the first convolution kernel.
[0047] A second convolution result is obtained after convolution by the second convolution
kernel.
[0048] A third convolution result is obtained after convolution by the third convolution
kernel.
[0049] The first convolution result is inputted into the fourth convolution kernel, and
a fourth convolution result is obtained after convolution by the fourth convolution
kernel.
[0050] The second convolution result, the third convolution result, and the fourth convolution
result are inputted into the fifth convolution kernel, and the image feature of the
(i+1)
th image frame are obtained after convolution by the fifth convolution kernel.
[0051] For example, as shown in FIG. 4, the (i+1)
th image frame is inputted into the feature extraction network 1011, and the image feature
of the (i+1)
th image frame can be outputted correspondingly. The (i+1)
th image frame is inputted into the feature extraction network 1011. After the (i+1)
th image frame moves through convolution kernels of different sizes, the image feature
of the (i+1)
th image frame are outputted. A convolutional network 10121 is used as the feature extraction
network 1011. The (i+1)
th image frame is inputted into the feature extraction network 1011. After convolution
by a first convolution kernel (3*3), a second convolution kernel (3*3), and a third
convolution kernel (1*1) that are parallel, an output of the first convolution kernel
(3*3) moves through a concatenated fourth convolution kernel (3*3), and an output
of the second convolution kernel (3*3), an output of the third convolution kernel
(1*1), and an output of the fourth convolution kernel (3*3) are inputted together
into a fifth convolution kernel (1*1). The image feature of the (i+1)
th image frame is outputted.
[0052] In conclusion, according to the method provided in this embodiment, feature extraction
is performed on an image by using the feature extraction network, to obtain an image
feature of the corresponding image. The convolutional network is used and the plurality
of convolution kernels of different sizes are used, to fully extract the image feature
and retain more information in the image, so that a super-resolution prediction result
is more accurate.
[0053] Operation 242: Fuse the image feature of the i
th image frame, the long time series feature before the i
th image frame, and the image feature of the (i+1)
th image frame by using the feature fusion network, to obtain the long time series feature
before the (i+2)
th image frame.
[0054] In some embodiments, the feature fusion network is configured to perform feature
fusion on the image feature of the i
th image frame, the long time series feature before the i
th image frame, and the image feature of the (i+1)
th image frame, to output the long time series feature before the (i+2)
th image frame.
[0055] The feature fusion network is mainly configured to: align and extract features, concatenate
features that need to be fused, cross-mix, by means of channel shuffle, at least two
sets of features that need to be fused, and compress and extract cross-mixed features.
[0056] For example, as shown in FIG. 5, channel shuffle is performed on the image features
of the (i+1)
th image frame and the i
th image frame to obtain mixed feature, and the mixed feature is compressed and extracted,
for example, by means of a deformable convolutional network, to obtain a fused time
series feature.
[0057] According to the method provided in this embodiment, image features are further fused
by using the feature fusion network, to obtain the long time series feature. This
increases a quantity of time series information during video image processing, so
that information in previous image frames in a video time series can be fully utilized
when a subsequent image is processed.
[0058] In some embodiments, the feature fusion network is a multi-phase fusion network,
and includes at least a first feature fusion layer and a second feature fusion layer.
The first feature fusion layer is configured to output the fused time series feature
based on an input of the image feature of the (i+1)
th image frame and the image feature of the i
th image frame. The second feature fusion layer is configured to output the long time
series feature before the (i+2)
th image frame based on an input of the fused time series feature obtained in the first
phase and the long time series feature before the i
th image frame.
[0059] For example, as shown in FIG. 6, feature fusion is performed on the image feature
of the i
th image frame, the long time series feature before the i
th image frame, and the image feature of the (i+1)
th image frame by using the feature fusion network 1012. The image feature of the image
frame on which super-resolution processing needs to be performed currently and the
image feature of the previous image frame (the image feature of the (i+1)
th image frame and the image feature of the i
th image frame) are inputted into a first feature fusion layer 10121. Then, the long
time series feature before the previous image frame (the long time series feature
before the i
th image frame) and an output of the first feature fusion layer 10121 are inputted together
into a second feature fusion layer 10122, to output the long time series feature before
the (i+2)
th image frame.
[0060] According to the method provided in this embodiment, image features are further fused
by using two phases of fusion layers. This can further prevent introduction of an
artifact feature while improving an effect of feature fusion.
[0061] Operation 243: Perform prediction on the image feature of the (i+1)
th image frame and the long time series feature before the (i+2)
th image frame by using the upsampling network, to obtain the super-resolution image
of the (i+1)
th image frame.
[0062] In some embodiments, the upsampling network is configured to output the super-resolution
image of the (i+1)
th image frame based on an input of the image feature of the (i+1)
th image frame and the long time series feature before the (i+2)
th image frame.
[0063] Because super-resolution processing of the first image frame and super-resolution
processing of the second image frame are slightly different from the foregoing operations,
this application provides the following processing operations:
Operation 310: Obtain the first image frame from the video.
[0064] The first image frame is obtained from the original video. It is herein noted that
the terms "first image frame", "second image frame" and "third image frame" are used
to indicate a specific sequence of the image frames referred to in this embodiment;
the first image frame is an image frame at the very start of the video, the second
image frame immediately follows the first image frame, and the third image frame immediately
follows the second image frame in this embodiment.
[0065] Operation 320: Perform super-resolution prediction on the first image frame by using the generative
network, to obtain a super-resolution image of the first image frame and an image
feature of the first image frame.
[0066] In some embodiments, only the first image frame is inputted into the generative network.
The image feature of the first image frame is obtained by using the feature extraction
network in the generative network. The image feature of the first image frame are
inputted into the upsampling network, to obtain the super-resolution image of the
first image frame. The image feature of the first image frame is cached in the cache
for super-resolution prediction of a next image frame.
[0067] In some embodiments, long time series features of image frames in the original video
are preset. The first image frame is inputted into the generative network. The image
feature of the first image frame is obtained by using the feature extraction network
in the generative network. The preset long time series feature and the image feature
of the first image frame are inputted into the feature fusion network, to obtain a
fused time series feature. The image feature of the first image frame and the fused
time series feature are inputted into the upsampling network, to obtain the super-resolution
image of the first image frame. The image feature of the first image frame and the
fused time series feature are cached in the cache for super-resolution prediction
of a next image frame.
[0068] Operation 330: Obtain the second image frame from the video.
[0069] The second image frame is obtained from the original video.
[0070] Operation 340: Perform super-resolution prediction on the image feature of the first image frame
and the second image frame by using the generative network, to obtain a super-resolution
image of the second image frame and an image feature of the second image frame.
[0071] In some embodiments, the second image frame is inputted into the generative network.
The image feature of the second image frame is obtained by using the feature extraction
network in the generative network. The image feature of the first image frame that
is cached in the cache and the image feature of the second image frame are inputted
into the feature fusion network, to obtain a fused feature of the first image frame
and the second image frame. The image feature of the second image frame and the fused
feature are inputted into the upsampling network, to obtain the super-resolution image
of the second image frame.
[0072] In some embodiments, the long time series features of the image frames in the original
video are preset. The second image frame is inputted into the generative network.
The image feature of the second image frame is obtained by using the feature extraction
network in the generative network. The image feature of the second image frame, and
the image feature of the first image frame and the fused time series feature that
are cached in the cache are inputted into the feature fusion network, to obtain a
long time series feature before the second image frame. The image feature of the second
image frame and the long time series feature before the second image frame are inputted
into the upsampling network, to obtain the super-resolution image of the second image
frame.
[0073] Operation 350: Obtain the third image frame from the video.
[0074] The third image frame is obtained from the original video.
[0075] Operation 360: Perform super-resolution prediction on the image feature of the second image frame
and the third image frame by using the generative network, to obtain a super-resolution
image of the third image frame, an image feature of the third image frame, and a long
time series feature before a fourth image frame.
[0076] In some embodiments, the third image frame is inputted into the generative network.
The image feature of the third image frame is obtained by using the feature extraction
network in the generative network. The image feature of the second image frame that
is cached in the cache and the image feature of the third image frame are inputted
into the feature fusion network, to obtain a fused feature of the second image frame
and the third image frame. The image feature of the third image frame and the fused
feature are inputted into the upsampling network, to obtain the super-resolution image
of the third image frame.
[0077] In some embodiments, the long time series features of the image frames in the original
video are preset. The third image frame is inputted into the generative network. The
image feature of the third image frame is obtained by using the feature extraction
network in the generative network. The image feature of the third image frame, and
the image feature of the second image frame and the fused time series feature cached
in the cache are inputted into the feature fusion network, to obtain a long time series
feature before the fourth image frame. The image feature of the third image frame
and the long time series feature before the fourth image frame are inputted into the
upsampling network, to obtain the super-resolution image of the third image frame.
[0078] Operation 370: Cache the image feature of the third image frame and the long time series feature
before the fourth image frame.
[0079] Starting from the third image frame in the original video, real long time series
features may be generated. Therefore, the image feature of the third image frame and
the long time series feature before the fourth image frame may be cached, so as to
be used for super-resolution processing of a next image frame. After that, the foregoing
super-resolution prediction operation on the (i+1)
th image frame may be performed.
[0080] A generative adversarial network usually includes a generative network and a discriminative
network. The generative network is configured to generate a super-resolution image,
and the discriminative network is configured to determine whether an image conforms
to distribution of a real image. The discriminative network needs to be trained while
the generative network is trained. The following embodiments mainly describe a method
for training the generative network.
[0081] FIG. 8 is a diagram of a structure of training in a video super-resolution method
according to an exemplary embodiment of this application.
[0082] In some embodiments, the training in the video super-resolution processing method
is mainly training specific to a generative network 101. Whether a super-resolution
image generated by the generative network 101 is accurate further needs to be determined
by a discriminative network 103 to obtain a discrimination result. Therefore, the
training in the video super-resolution processing method further requires training
of the discriminative network.
[0083] In some embodiments, a training process mainly includes: inputting a sample image
into the generative network 101, to obtain a super-resolution image of the sample
image through super-resolution prediction performed by the generative network 101;
inputting the sample image and the super-resolution image of the sample image into
the discriminative network 103 for discrimination, and outputting a discrimination
result; and training the generative network and the discriminative network alternately
based on the discrimination result and a loss function. The loss function includes
at least one of an inter-frame stability loss function, an adversarial loss function,
a perceptual loss function, and a pixel loss function.
[0084] FIG. 9 is a flowchart of training in a video super-resolution processing method according
to an exemplary embodiment of this application. This method includes the following
operations:
Operation 410: Cache an i
th sample image frame and an (i+1)
th sample image frame from a sample video.
[0085] In some embodiments, a training set is given for training a generative network, and
the training set includes a sample video.
[0086] The i
th sample image frame and the (i+1)
th sample image frame are cached from the sample video. The (i+1)
th sample image frame is a current image, and the i
th sample image frame is a historical image in the sample video.
[0087] Operation 420: Predict a super-resolution image of the i
th sample image frame and a super-resolution image of the (i+1)
th sample image frame by using the generative network.
[0088] In some embodiments, the generative network is configured to perform super-resolution
prediction on an image frame. An image frame on which super-resolution prediction
needs to be performed is inputted to obtain a super-resolution image of the corresponding
image frame. For example, the i
th sample image frame and the (i+1)
th sample image frame are inputted, and the super-resolution image of the i
th sample image frame and the super-resolution image of the (i+1)
th sample image frame are outputted through super-resolution prediction by the generative
network.
[0089] Operation 430: Discriminate between the super-resolution image of the (i+1)
th sample image frame and the (i+1)
th sample image frame by using a discriminative network, to obtain a discrimination
result.
[0090] In some embodiments, the generative network is configured to predict a super-resolution
result of an image. The super-resolution result requires discrimination by the discriminative
network. For example, the super-resolution image of the (i+1)
th sample image frame generated by the generative network and the (i+1)
th sample image frame are inputted into the discriminative network, and the discriminative
network determines whether the super-resolution image is a real super-resolution image
of the (i+1)
th sample image frame or the super-resolution image of the (i+1)
th sample image frame generated by the generative network. If the discrimination result
outputted by a discriminator is true or 1, the super-resolution image of the (i+1)
th sample image frame generated by the generative network conforms to distribution of
the real super-resolution image of the (i+1)
th sample image frame. If the discrimination result outputted by a discriminator is
false or 0, the super-resolution image of the (i+1)
th sample image frame generated by the generative network does not conform to distribution
of the real super-resolution image of the (i+1)
th sample image frame.
[0091] Operation 440: Calculate an error loss between the sample image and the super-resolution image of
the sample image based on the discrimination result and a loss function.
[0092] In the related art, part of a reason for an occurrence of an inter-frame jump may
be that loss functions used in current super-resolution network training processes
are mostly single-frame loss functions, which constrain a super-resolution result
of each frame, and there is a lack of a constraint on stability between adjacent frames.
This results in inconsistent results between the adjacent frames in output results,
a noticeable jump, and poor stability. Based on this, in a possible implementation,
in addition to several loss functions commonly used in a generative adversarial network,
for example, an adversarial loss function, this application further provides an inter-frame
stability loss function, which is configured for constraining a change between adjacent
image frames.
[0093] An inter-frame stability loss is a parameter that constrains stability of changes
between adjacent image frames in a video. The inter-frame stability loss mainly compares
a change between a super-resolution result of a current sample image frame and a super-resolution
result of a previous sample image frame, and a change between the two corresponding
sample images; and constrains the changes to be as close as possible or within a specific
threshold.
[0094] The corresponding error loss is calculated based on at least one loss function and
the result of discrimination that is between the sample image and the super-resolution
image of the sample image and that is outputted by the discriminative network.
[0095] Operation 450: Train the generative network and the discriminative network alternately based on
the error loss.
[0096] The error loss calculated based on the loss function is fed back to the generative
network and the discriminative network, to train the generative network and the discriminative
network alternately.
[0097] In some embodiments, training the generative network and the discriminative network
alternately includes fixing parameters of the generative network and training the
discriminative network; or fixing parameters of the discriminative network and training
the generative network; or training the generative network and the discriminative
network simultaneously.
[0098] To sum up, according to the method provided in this embodiment, the i
th sample image frame and the (i+1)
th sample image frame in the sample video are cached, the super-resolution images are
predicted by using the generative network, discrimination is performed on the super-resolution
result by using the discriminative network, and in addition, the generative network
and the discriminative network are trained alternately with reference to the loss
function. Using a pixel-level loss function for constraining adjacent frames can stably
improve time series continuity of adjacent frames, so as to obtain a generative network
with more accurate generation results.
[0099] In some embodiments, after operation 430, training specific to the generative network
and training specific to the discriminative network may be included.
Training specific to the generative network:
[0100] As shown in FIG. 10, operation 440 further includes the following sub-operations:
Operation 441: Calculate an inter-frame stability loss between a first change and a second change
by using the inter-frame stability loss function.
[0101] Based on the i
th sample image frame and the (i+1)
th sample image frame, the generative network performs super-resolution prediction to
obtain the super-resolution image of the i
th sample image frame and the super-resolution image of the (i+1)
th sample image frame. In this case, the first change is a change between the i
th sample image frame and the (i+1)
th sample image frame, and the second change is a change between the super-resolution
image of the i
th sample image frame and the super-resolution image of the (i+1)
th sample image frame.
[0102] In this embodiment of this application, to resolve an inter-frame jump problem, a
loss function used for the generative network during the training of the generative
network may be the inter-frame stability loss function. Therefore, based on the inter-frame
stability loss function, the inter-frame stability loss between the first change and
the second change is calculated, that is, a stability loss of a change between adjacent
frames of sample images and a stability loss of a change between super-resolution
images of the adjacent frames of sample images are calculated.
[0103] In conclusion, according to the method provided in this embodiment, the inter-frame
stability loss is used to constrain the generative network, and using the loss function
for constraining adjacent frames can stably improve time series continuity of adjacent
frames.
[0104] In some embodiments, optical flows are usually used for measuring changes between
adjacent image frames.
[0105] In some embodiments, a first optical flow between the i
th sample image frame and the (i+1)
th sample image frame is calculated by using an optical flow network, a second optical
flow between the super-resolution image of the i
th sample image frame and the super-resolution image of the (i+1)
th sample image frame is calculated by using the optical flow network, and the inter-frame
stability loss is calculated based on the first optical flow and the second optical
flow.
[0106] In some embodiments, a mean square error loss, also referred to as an L2 norm loss,
is used for calculating an average value of squared differences between actual values
and predicted values. For example,

, where i represents an i
th frame, N represents a maximum value of i, F( ) represents an optical flow, g
i represents an i
th sample image frame, g
i+1 represents an (i+1)
th sample image frame, F(g
i+1, g
i) represents a first optical flow between the i
th sample image frame and the (i+1)
th sample image frame, GT
i represents a super-resolution image of the i
th sample image frame, GT
i+1 represents a super-resolution image of the (i+1)
th sample image frame, and F(GT
i+1, GT
i) represents a second optical flow between the super-resolution image of the i
th sample image frame and the super-resolution image of the (i+1)
th sample image frame.
[0107] In some embodiments, a pre-trained optical flow network is used for calculating optical
flows, and the optical flow network is not optimized in a training process. The inter-frame
stability loss is calculated based on a mean square deviation between the first optical
flow and the second optical flow.
[0108] In some embodiments, the inter-frame stability loss is used for constraining the
changes between the adjacent image frames. Training the generative network based on
the inter-frame stability loss is mainly training the generative network based on
the changes between the adjacent image frames. The inter-frame stability loss is calculated
based on a difference between the first optical flow and the second optical flow.
The inter-frame stability loss is fed back to the generative network, to train the
generative network.
[0109] For example, as shown in FIG. 11, a discriminative network 103 performs result discrimination
based on a sample image and a super-resolution image of the sample image, and outputs
a true (1) result or a false (0) result. The i
th sample image frame and the (i+1)
th sample image frame are inputted into an optical flow network 104, to obtain the first
optical flow. The super-resolution image of the i
th sample image frame and the super-resolution image of the (i+1)
th sample image frame are inputted into the optical flow network 104, to obtain the
second optical flow. The first optical flow and the second optical flow are substituted
into the inter-frame stability loss function, to calculate the inter-frame stability
loss.
[0110] Operation 442: Calculate a first error loss of the discrimination result based on the discrimination
result and an adversarial loss function.
[0111] The first error loss is an adversarial loss. The adversarial loss is a parameter
configured for adjusting an output result of the generative network and an output
result of the discriminative network, to make the output results tend to be consistent.
Based on the adversarial loss function, the discriminative network may be trained
in a process of training the generative network. Thus, the discriminative network
can determine a difference between a generation result of the generative network and
a real super-resolution image, and feed the difference back to the generative network,
so as to cyclically train the generative network and the discriminative network.
[0112] In some embodiments, the generative network is trained by using the adversarial loss
function. For example, D
adv_G = E
x∼p(x)[log (1- D(I
g)], where D
adv_G represents a first error loss, E
* represents an expected value of the function, x is a low-resolution image, p(x) is
distribution of the low-resolution image, D is a discriminative network, and I
g is a super-resolution result inputted into the discriminative network, the super-resolution
result being generated by a generative network.
[0113] The first error loss between a super-resolution result predicted by the generative
network and a real sample image is calculated by using the adversarial loss function,
and the generative network is trained based on the discrimination result of the discriminative
network, so that the super-resolution result predicted by the generative network is
determined as the true (1) result by the discriminative network.
[0114] According to the method provided in this embodiment, optimization training of the
generative network is further implemented by using the adversarial loss function,
so that the generative network generates more real super-resolution results.
[0115] Operation 443: Calculate a second error loss between a feature of the (i+1)
th sample image frame and a feature of the super-resolution image of the (i+1)
th sample image frame by using a perceptual loss function.
[0116] The perceptual loss function constrains a super-resolution result and a sample result
in terms of eigenspace. A sample image and a corresponding super-resolution image
simultaneously move through a pre-trained convolutional neural network, such as a
visual geometry group (VGG) network, and corresponding features are generated, respectively.
A distance between the feature of the sample image and the feature of the corresponding
super-resolution image is constrained.
[0117] In some embodiments, the second error loss between the feature of the (i+1)
th sample image frame and the feature of the super-resolution image of the (i+1)
th sample image frame is calculated by using the perceptual loss function. The generative
network is trained based on the second error loss between the feature of the (i+1)
th sample image frame and the feature of the super-resolution image of the (i+1)
th sample image frame.
[0118] In some embodiments, a mean square error loss function is used as the perceptual
loss function to calculate an average value of squared differences between image features
of real sample images and predicted super-resolution images, to train the generative
network. For example,

, where
Dperc represents a second error loss, i represents an i
th frame, N represents a maximum value of i, g
i+1 is a super-resolution image of an (i+1)
th sample image frame, GT
i+1 is the (i+1)
th sample image frame, VGG(g
i+1) is an image feature of the super-resolution image of the (i+1)
th sample image frame, and VGG(GT
i+1) is an image feature of the (i+1)
th sample image frame.
[0119] According to the method provided in this embodiment, a sample image and a super-resolution
image are further constrained in terms of eigenspace by using the perceptual loss
function. In this way, a quantity of reference information of an image is larger,
and an effect of training is better.
[0120] Operation 444: Calculate a third error loss between the super-resolution image of the (i+1)
th sample image frame and the (i+1)
th sample image frame by using a pixel loss function.
[0121] The third error loss is a pixel loss. The pixel loss is a parameter configured for
supervising that a super-resolution image predicted by the generative network does
not deviate from an original low-resolution image. A large difference between a super-resolution
result and a sample image is prevented based on a pixel loss function.
[0122] In some embodiments, an error loss between the super-resolution image of the (i+1)
th sample image frame and the (i+1)
th sample image frame is calculated by using the pixel loss function. The generative
network is trained based on the error loss between the super-resolution image of the
(i+1)
th sample image frame and the (i+1)
th sample image frame.
[0123] In some embodiments, a mean square error loss function is used as the pixel loss
function to calculate an average value of squared differences between the real sample
images and the predicted super-resolution images of the sample images, to train the
generative network. For example,

, where i represents an i
th frame, N represents a maximum value of i, g
i+1 is a super-resolution image of an (i+1)
th sample image frame, and GT
i+1 is the (i+1)
th sample image frame.
[0124] According to the method provided in this embodiment, a difference between a sample
image and a super-resolution image is constrained within a specific range by using
the pixel loss function, so that the training process is stabler.
[0125] Operation 450 further includes the following sub-operations:
Operation 451: Train the generative network.
[0126] The generative network is trained based on the at least one of the inter-frame stability
loss function, the adversarial loss function, the perceptual loss function, and the
pixel loss function, so that a super-resolution image that is of a sample image and
that is predicted by the generative network is close to the sample image.
[0127] In some embodiments, only the inter-frame stability loss function may be used to
train the generative network, and other loss functions are not used.
[0128] In some embodiments, only the adversarial loss function may be used to train the
generative network, and other loss functions are not used.
[0129] In some embodiments, only the perceptual loss function may be used to train the generative
network, and other loss functions are not used.
[0130] In some embodiments, only the pixel loss function may be used to train the generative
network, and other loss functions are not used.
[0131] In some embodiments, the inter-frame stability loss function and the adversarial
loss function may be used to train the generative network, and other loss functions
are not used.
[0132] In some embodiments, the inter-frame stability loss function and the perceptual loss
function may be used to train the generative network, and other loss functions are
not used.
[0133] In some embodiments, the inter-frame stability loss function and the pixel loss function
may be used to train the generative network, and other loss functions are not used.
[0134] In some embodiments, the inter-frame stability loss function, the adversarial loss
function, and the perceptual loss function may be used to train the generative network,
and other loss functions are not used.
[0135] In some embodiments, the inter-frame stability loss function, the adversarial loss
function, and the pixel loss function may be used to train the generative network,
and other loss functions are not used.
[0136] In some embodiments, the inter-frame stability loss function, the perceptual loss
function, and the pixel loss function may be used to train the generative network,
and other loss functions are not used.
[0137] In some embodiments, the inter-frame stability loss function, the adversarial loss
function, the perceptual loss function, and the pixel loss function may all be used
to train the generative network.
Training specific to the discriminative network:
[0138] Operation 442: Calculate a first error loss of the discrimination result based on the discrimination
result and an adversarial loss function.
[0139] The discrimination result is obtained by using the discriminative network to discriminate
between the super-resolution image of the (i+1)
th sample image frame and the (i+1)
th sample image frame. An adversarial loss is a parameter configured for adjusting an
output result of the generative network and an output result of the discriminative
network, to make the output results tend to be consistent. Based on the adversarial
loss function, the discriminative network may be trained in a process of training
the generative network. Thus, the discriminative network can determine a difference
between a generation result of the generative network and a real super-resolution
image, and feed the difference back to the generative network, so as to cyclically
train the generative network and the discriminative network.
[0140] In some embodiments, the discriminative network is trained by using the adversarial
loss function. For example, D
adv_D =- E
x∼p(x)[log (1- D(I
g)] + E
xr∼p(xr)[log D(xr)], where x is a low-resolution image, p(x) is distribution of the low-resolution
image, xr is a super-resolution image, p(xr) is distribution of the super-resolution
image, D is a discriminative network, and I
g is a super-resolution result inputted into the discriminative network, the super-resolution
result being generated by a generative network.
[0141] Operation 450 further includes the following sub-operations:
Operation 452: Train the discriminative network.
[0142] The discriminative network is trained based on the adversarial loss function, so
that the discriminative network can determine a difference between a super-resolution
result predicted by the generative network and an accurate super-resolution result,
so as to train the discriminator for accuracy of a discriminant result.
[0143] This application provides the video super-resolution processing method that ensures
stability of feature time series based on a constraint on adjacent frames. The generative
adversarial network is used as an entire network to generate a super-resolution image
with rich details and textures. The generative network uses a cyclic structure to
add information transfer between adjacent frames in a video, and uses inter-frame
information to improve a super-resolution effect and stability. In terms of loss functions,
a generative adversarial loss is used to increase textures and details, and in addition,
the image-level pixel loss function and the perceptual loss function are used to ensure
that a super-resolution result of a single frame does not significantly deviate from
an original image. Moreover, this application further provides an inter-frame time
series loss function to perform inter-frame constraint on adjacent super-resolution
results. This improves stability between video frames and reduces jumps, ensuring
visual perception of video continuity and consistency.
[0144] This application mainly provides two innovation points:
- 1. A feature circulation network is used as the generative network, and features instead
of an image are used as an input for a time series. Advantages thereof are as follows:
[0145] Efficiency is higher. Features may be directly used as an input for a next frame
without a need for mapping to the eigenspace through a network. This can reduce a
time cost. A longer time series indicates a greater time advantage over an image input.
[0146] More information is retained in features as compared with an image. In addition,
a feature of a previous frame and a long time series feature are introduced. Through
an increase in a quantity of time series information, problems caused by an image
input, such as an information loss and an information error, are alleviated. Examples
of an information error include bending, an artifact, and the like.
[0147] 2. The constraint on adjacent frames is added to the training process. Advantages
thereof are as follows:
Using the pixel-level loss function for constraining adjacent frames can stably improve
time series continuity of adjacent frames.
[0148] Only the constraint on adjacent frames needs to be added to video forward sequence
training, so that stability of inter-frame super-resolution results can be constrained
and texture jumps in an output result can be reduced. Reverse sequence training is
not required.
[0149] The technical process of this application mainly includes two parts: the generative
network and the training of the generative network. The generative network is a network
that generates a super-resolution result, and is also used during actual application.
The training of the generative network includes the discriminative network, the VGG
network, the optical flow network, and the loss functions. A main role is to supervise
the super-resolution result in the training process, so as to constrain the generative
network.
[0150] An input to the generative network includes a current image frame, a feature of a
previous frame, and a long time series feature before the previous frame. After the
input moves through a structure of the generative network, a feature of the current
frame, a long time series feature before the current frame, and a super-resolution
image of the current frame are outputted. Both the feature of the current frame and
the long time series feature before the current frame are used as an input to the
generative network for a subsequent image. The super-resolution image enters a training
phase, and is also outputted directly during application. The inputted image first
moves through the feature extraction network, and the image is mapped to the eigenspace;
then, the extracted feature together with the feature of the previous frame and the
long time series feature before the previous frame are inputted into the feature fusion
network for fusion; and after the fused feature is processed through the upsampling
network, the super-resolution result is obtained. The extracted features and the fused
features in the generation phase are respectively used in super-resolution of a next
frame and a next frame of the next frame as inputs, to provide supplementary information.
[0151] The function of the feature extraction network is to map the inputted image to the
eigenspace. An input is the image, and an output is the extracted feature. The feature
is inputted into the super-resolution process of the next frame as an input. The feature
extraction network used in this application is a convolutional network. To fully extract
image features, convolutions with different receptive fields are used to process an
input.
[0152] In this application, the multi-phase feature fusion network is used as the feature
fusion network to fuse the feature of the current frame, the feature of the previous
frame, and the long time series feature before the previous frame. First, the feature
of the current frame and the feature of the previous frame are fused, and then fused
feature of the two frames are fused with the long time series feature before the previous
frame. A deformable convolution is mainly used to align and extract features. Two
sets of features are concatenated, then the features are cross-mixed through a channel
shuffle operation, and then mixed features are compressed and extracted by means of
the deformable convolution.
[0153] FIG. 12 is a block diagram of a video super-resolution processing apparatus according
to an exemplary embodiment of this application. The apparatus includes:
an obtaining module 1220, configured to: obtain an (i+1)th image frame from a video, and obtain an image feature of an ith image frame in the video and a long time series feature before the ith image frame, the image feature of the ith image frame and the long time series feature before the ith image frame being cached by a cache module 1210 during super-resolution processing
of the ith image frame and an (i-1)th image frame, respectively;
a generative network module 1230, configured to perform super-resolution prediction
on the image feature of the ith image frame, the long time series feature before the ith image frame, and the (i+1)th image frame, to obtain a super-resolution image of the (i+1)th image frame, an image feature of the (i+1)th image frame, and a long time series feature before the (i+2)th image frame; and
the cache module, configured to cache the image feature of the (i+1)th image frame and the long time series feature before the (i+2)th image frame;
i being a positive integer greater than 2.
[0154] In a possible implementation, as shown in FIG. 13, the generative network module
1230 includes a feature extraction network module 1231, a feature fusion network module
1232, and an upsampling network module 1233.
[0155] The feature extraction network module 1231 is configured to perform feature extraction
on the (i+1)
th image frame, to obtain the image feature of the (i+1)
th image frame.
[0156] The feature fusion network module 1232 is configured to fuse the image feature of
the i
th image frame, the long time series feature before the i
th image frame, and the image feature of the (i+1)
th image frame, to obtain the long time series feature before the (i+2)
th image frame.
[0157] The upsampling network module 1233 is configured to perform prediction on the image
feature of the (i+1)
th image frame and the long time series feature before the (i+2)
th image frame, to obtain the super-resolution image of the (i+1)
th image frame.
[0158] In a possible implementation, the feature fusion network module 1232 includes a first
feature fusion sub-module 12321 and a second feature fusion sub-module 12322.
[0159] The first feature fusion sub-module 12321 is configured to fuse the image feature
of the (i+1)
th image frame and the image feature of the i
th image frame, to obtain a fused time series feature.
[0160] The second feature fusion sub-module 12322 is configured to fuse the fused time series
feature and the long time series feature before the i
th image frame, to obtain the long time series feature before the (i+2)
th image frame.
[0161] In a possible implementation, the obtaining module 1220 is further configured to
obtain the first image frame from the video.
[0162] The generative network module 1230 is further configured to perform super-resolution
prediction on the first image frame, to obtain a super-resolution image of the first
image frame and an image feature of the first image frame.
[0163] The obtaining module 1220 is further configured to obtain the second image frame
from the video.
[0164] The generative network module 1230 is further configured to perform super-resolution
prediction on the image feature of the first image frame and the second image frame,
to obtain a super-resolution image of the second image frame and an image feature
of the second image frame.
[0165] The obtaining module 1220 is further configured to obtain the third image frame from
the video.
[0166] The generative network module 1230 is further configured to perform super-resolution
prediction on the image feature of the second image frame and the third image frame,
to obtain a super-resolution image of the third image frame, an image feature of the
third image frame, and a long time series feature before a fourth image frame.
[0167] The cache module is configured to cache the image feature of the third image frame
and the long time series feature before the fourth image frame.
[0168] In a possible implementation, the generative network module 1230 is obtained through
training in the following mode:
[0169] The caching module 1210 is further configured to cache an i
th sample image frame and an (i+1)
th sample image frame from a sample video.
[0170] The generative network module 1230 is configured to predict a super-resolution image
of the i
th sample image frame and a super-resolution image of the (i+1)
th sample image frame.
[0171] A calculation module 1250 is configured to calculate an inter-frame stability loss
between a first change and a second change by using an inter-frame stability loss
function. The first change is a change between the i
th sample image frame and the (i+1)
th sample image frame, and the second change is a change between the super-resolution
image of the i
th sample image frame and the super-resolution image of the (i+1)
th sample image frame. The inter-frame stability loss is configured for constraining
stability between adjacent image frames.
[0172] A training module 1260 is configured to train the generative network module based
on the inter-frame stability loss.
[0173] In a possible implementation, the calculation module 1250 is further configured to
calculate a first optical flow of the first change by using an optical flow network
module.
[0174] The calculation module 1250 is further configured to calculate a second optical flow
of the second change by using the optical flow network module.
[0175] The calculation module 1250 is further configured to substitute the first optical
flow and the second optical flow into the inter-frame stability loss function, to
calculate the inter-frame stability loss.
[0176] In a possible implementation, a discriminative network module 1240 is configured
to discriminate between the super-resolution image of the (i+1)
th sample image frame and the (i+1)
th sample image frame, to obtain a discrimination result.
[0177] The calculation module 1250 is further configured to calculate a first error loss
of the discrimination result based on the discrimination result and an adversarial
loss function.
[0178] The training module 1260 is further configured to train the generative network and
the discriminative network alternately based on the first error loss.
[0179] The adversarial loss function is configured for constraining consistency between
a super-resolution result of the (i+1)
th sample image frame and the discrimination result.
[0180] In a possible implementation, the calculation module 1250 is further configured to
calculate a second error loss between a feature of the (i+1)
th sample image frame and a feature of the super-resolution image of the (i+1)
th sample image frame by using a perceptual loss function.
[0181] The training module 1260 is further configured to train the generative network based
on the second error loss.
[0182] The perceptual loss function is configured for constraining consistency between the
(i+1)
th sample image frame and the super-resolution image of the (i+1)
th sample image frame in terms of eigenspace.
[0183] In a possible implementation, the calculation module 1250 is further configured to
calculate a third error loss between the super-resolution image of the (i+1)
th sample image frame and the (i+1)
th sample image frame by using a pixel loss function.
[0184] The training module 1260 is further configured to train the generative network based
on the third error loss.
[0185] The pixel loss function is configured for constraining consistency between the super-resolution
image of the (i+1)
th sample image frame and the (i+1)
th sample image frame in terms of image content.
[0186] FIG. 14 is a diagram of a structure of a computer device according to an exemplary
embodiment of this application. For example, a computer device 1400 includes a central
processing unit (CPU) 1401, a system memory 1404 including a random access memory
(RAM) 1402 and a read-only memory (ROM) 1403, and a system bus 1405 connecting the
system memory 1404 and the central processing unit 1401. The computer device 1400
further includes a basic input/output (I/O) system 1406 assisting in transmitting
information between components in the computer, and a mass storage device 1407 configured
to store an operating system 1413, a client 1414, and another program module 1415.
[0187] The basic I/O system 1406 includes a display 1408 configured to display information,
and an input device 1409 configured for inputting information by a user, such as a
mouse or a keyboard. The display 1408 and the input device 1409 are both connected
to the central processing unit 1401 by using an input/output controller 1410 connected
to the system bus 1405. The basic I/O system 1406 may further include the input/output
controller 1410 configured to receive and process inputs from a plurality of other
devices such as a keyboard, a mouse, and an electronic stylus. Likewise, the input/output
controller 1410 further provides an output to a display screen, a printer, or another
type of output device.
[0188] The mass storage device 1407 is connected to the central processing unit 1401 by
using a mass storage controller (not shown) connected to the system bus 1405. The
mass storage device 1407 and a computer-readable medium associated with the mass storage
device 1407 provide non-volatile storage for the computer device 1400. To be specific,
the mass storage device 1407 may include a computer-readable medium (not shown) such
as a hard disk or a compact disc read-only memory (CD-ROM) drive.
[0189] The computer-readable medium may include a computer storage medium and a communication
medium. The computer storage medium includes a volatile or non-volatile, or removable
or non-removable medium that is implemented by using any method or technology and
that is configured for storing information such as a computer-readable instruction,
a data structure, a program module, or other data. The computer storage medium includes
a RAM, a ROM, an erasable programmable read-only memory (EPROM), an electrically erasable
programmable read-only memory (EEPROM), a flash memory or another solid-state memory
technology, a CD-ROM, a digital versatile disc (DVD) or another optical memory, a
tape cartridge, a magnetic tape, a magnetic disk memory, or another magnetic storage
device. Certainly, a person skilled in the art may learn that the computer storage
medium is not limited to the foregoing several types. The system memory 1404 and the
mass storage device 1407 may be collectively referred to as a memory.
[0190] According to the embodiments of this application, the computer device 1400 may further
be connected, through a network such as the Internet, to a remote computer on the
network and run. To be specific, the computer device 1400 may be connected to a network
1412 by using a network interface unit 1411 connected to the system bus 1405, or may
be connected to another type of network or a remote computer system (not shown) by
using a network interface unit 1411.
[0191] An exemplary embodiment of this application further provides a computer-readable
storage medium. The computer-readable storage medium stores at least one program.
The at least one program is loaded and executed by a processor to implement the video
super-resolution processing method according to the foregoing method embodiments.
[0192] An exemplary embodiment of this application further provides a computer program product.
The computer program product includes at least one program, and the at least one program
is stored in a readable storage medium. A processor of a computer device reads the
at least one program from the readable storage medium, and the processor executes
the at least one program to cause the computer device to perform the video super-resolution
processing method according to the foregoing method embodiments.
[0193] "A plurality of" mentioned in the specification means two or more. After considering
the specification and practicing the present disclosure, a person skilled in the art
may easily conceive of other implementations of this application. This application
is intended to cover any variations, uses, or adaptive changes of this application.
These variations, uses, or adaptive changes follow the general principles of this
application and include common general knowledge or common technical means in the
art, which are not disclosed in this application. The specification and embodiments
are considered as merely exemplary, and the actual scope and spirit of this application
are pointed out in the following claims.
[0194] A person of ordinary skill in the art may understand that all or some of the operations
of the foregoing embodiments may be implemented by hardware, or may be implemented
by a program instructing relevant hardware. The program may be stored in a computer-readable
storage medium. The storage medium may be a read-only memory, a magnetic disk, an
optical disc, or the like.
[0195] The foregoing descriptions are merely optional embodiments of this application, but
are not intended to limit this application. Any modification, equivalent replacement,
improvement, or the like made within the spirit and principle of this application
shall fall within the protection scope of this application.