TECHNICAL FIELD
[0001] Embodiments presented herein relate to a method, a video network node, a computer
program, and a computer program product for determining a time offset for a video
segment of a video stream using metadata.
BACKGROUND
[0002] Communications systems, for example implementing functionality of a content delivery
network (CDN), can be used to serve content, such as video streams, to end-users with
high availability and high performance. In some scenarios, additional content, such
as advertisements, are inserted at one or more places in the video stream before it
is delivered to the end-users.
[0003] In general terms, advertisement insertion concerns the insertion of new advertisement
segments into video streams, and advertisement replacement concerns the replacement
of existing advertisement segments in video streams with new advertisement segments.
Advertisement segments are commonly grouped together into consecutive sequences of
advertisements, each such sequence being denoted an "advertisement break". A television
(TV) program may have a pre-roll advertisement break (comprising a sequence of advertisements
before program start), any number of mid-roll advertisement breaks (each comprising
a sequence of advertisements in the middle of the program), and a post-roll advertisement
break (a sequence of advertisements after the end of the program). Pay TV operators
usually sell advertisement slots for a certain time window. Two examples are called
C
3 andC
7. For a C
3 time window, for example, advertisements slots are sold for 3 days, and between the
time the TV program was aired and until 3 days afterwards the advertisements must
not be replaced. However, after the time period of 3 days, advertisement slots sold
under the C
3 contract may be replaced with new advertisements.
[0004] The act of inserting advertisements at the beginning and/or end of advertisement
breaks, and/or replacing existing advertisements with new advertisements require accuracy
in identifying the first and last frame of the advertisement break. Without this accuracy,
advertisement insertion and advertisement replacement may result in a disruptive,
choppy, or jagged appearance of the video stream to the viewer. To get a smooth advertisement
insertion and advertisement replacement, the exact boundaries of the advertisement
break within the video stream needs to be known.
[0005] TV operators have metadata regarding which advertisements where inserted to the video
stream, at what start and end times each advertisement is found in the stream, and
what is the duration of each advertisement. Such metadata can be stored in log files.
[0006] One mechanism for advertisement insertion and advertisement replacement could thus
be to use the metadata as is, which describes approximately the start and end times
of ad-breaks. However, it could be that the metadata of the log file is not well synchronized
with the video stream, thus resulting in new advertisements being inserted in the
middle of an existing advertisement, or replacing parts of TV programs and a prefix
or a suffix of an existing advertisements with new advertisements instead of accurately
replacing existing advertisements within an advertisement break with new advertisements.
[0007] Although advertisements have been mentioned as an example where a video segment (as
defined by a single advertisement or an entire advertisement break) is to be replaced
or removed from a video stream, there are also other examples where a video segment
is to be replaced or removed from a video stream.
[0008] In view of the above, there is thus a need for an improved handling of video segments
in a video stream.
[0009] US 2014/196085 discloses methods and systems to insert advertisements and/or other supplemental
or replacement content into a stream of video content. In some example embodiments,
the methods and systems receive a request to replace a portion of video content currently
playing at a client device with supplemental video content, such as an advertisement.
In response to the request, the methods and systems determine one or more fingerprints
of the video content plating at the client device, identify one or more frames of
the video content at which to insert the supplemental video content based on the one
or more fingerprints, and insert the supplemental video content at the identified
one or moPre frames of the video content.
[0010] WO 2014/178872 discloses a method and system for manipulating a manifest. A server receives a request
for a manifest corresponding to a session identifier. The server retrieves from a
session server a session manifest based on the session identifier. The server adjusts
a session offset based on a difference in a session length represented by the session
manifest from a session length represented by a previous session manifest corresponding
to the session. When the session manifest comprises an address of an ad break, the
server identifies in a cache at least one advertisement to be inserted into the session
and replaces at least one address corresponding to at least one segment of the at
least one advertisement in the session manifest based on the difference. The server
transmits the session manifest to the smart appliance.
SUMMARY
[0011] An object of embodiments herein is to provide mechanisms for accurately identifying
a video segment in a video stream.
[0012] According to a first aspect there is presented a method for determining a time offset
for a video segment of a video stream using metadata. The metadata comprises time
information of at least one of a start time and an end time of the video segment.
The method is performed by a video network node. The method comprises extracting a
first video part and a second video part from the video stream. Each of the first
video part and the second video part comprises a common video segment. The method
comprises identifying a sequence of video frames in the first video part that represents
the common video segment, wherein identifying the sequence of video frames comprises:
identifying, in the first video part, a first sequence of video frames that is similar
to a second sequence of video frames in the second video part, and wherein the first
sequence of video frames has a time duration equal to the time duration of the video
segment and determining that the first sequence of video frames is similar to the
second sequence of video frames in the second video part using an image similarity
measure between video frames in the first video part and video frames in the second
video part. The method comprises determining the time offset based on a time difference
between an end-point frame of the identified sequence of video frames and the time
information in the metadata.
[0013] According to a second aspect there is a video network node for determining a time
offset for a video segment of a video stream using metadata. The metadata comprises
time information of at least one of a start time and an end time of the video segment.
The video network node comprises processing circuitry and a storage medium. The storage
medium stores instructions that, when executed by the processing circuitry, cause
the video network node to perform operations, or steps. The operations, or steps,
cause the video network node to extract a first video part and a second video part
from the video stream. Each of the first video part and the second video part comprises
a common video segment. The operations, or steps, cause the video network node to
identify a sequence of video frames in the first video part that represents the common
video segment wherein identifying the sequence of video frames comprises: identifying,
in the first video part, a first sequence of video frames that is similar to a second
sequence of video frames in the second video part, and wherein the first sequence
of video frames has a time duration equal to the time duration of the video segment
and determining that the first sequence of video frames is similar to the second sequence
of video frames in the second video part using an image similarity measure between
video frames in the first video part and video frames in the second video part.The
operations, or steps, cause the video network node to determine the time offset based
on a time difference between an end-point frame of the identified sequence of video
frames and the time information in the metadata.
[0014] According to a third aspect there is presented a computer program for determining
a time offset for a video segment of a video stream using metadata, the computer program
comprising computer program code which, when run on a video network node, causes the
video network node to perform operations, or steps. The operations, or steps, cause
the video network node to extract a first video part and a second video part from
the video stream. Each of the first video part and the second video part comprises
a common video segment. The operations, or steps, cause the video network node to
identify a sequence of video frames in the first video part that represents the common
video segment wherein identifying the sequence of video frames comprises: identifying,
in the first video part, a first sequence of video frames that is similar to a second
sequence of video frames in the second video part , and wherein the first sequence
of video frames has a time duration equal to the time duration of the video segment
and determining that the first sequence of video frames is similar to the second sequence
of video frames in the second video part using an image similarity measure between
video frames in the first video part and video frames in the second video part. The
operations, or steps, cause the video network node to determine the time offset based
on a time difference between an end-point frame of the identified sequence of video
frames and the time information in the metadata.
[0015] Advantageously this method, this video network node, this computer program and this
computer program product enable accurate identification of the video segment in the
video stream. In turn, this enables efficient handling of video segments in the video
stream.
[0016] Advantageously this method, this video network node, this computer program and this
computer program product provide an accurate identification of the first and last
frames of the video segment.
[0017] Advantageously this method, this video network node, this computer program and this
computer program product need a comparatively small search window to accurately find
the first and last frames of the video segment.
[0018] Advantageously this method, this video network node, this computer program and this
computer program product enable, with the use of the metadata, to identify the video
segment even when the content of the video segment appears for the first time in the
video stream.
[0019] Advantageously this method, this video network node, this computer program and this
computer program product enable accurate determination of the time offset in scenarios
where the time offset is caused by transcoding, re-encoding, or other processing operations
occurring before the video stream is played out at a client node.
[0020] Advantageously this method, this video network node, this computer program and this
computer program product enable efficient separation of the video segment from the
video stream such that the video segment can be replaced or removed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] The inventive concept is now described, by way of example, with reference to the
accompanying drawings, in which:
Fig. 1 is a schematic diagram illustrating a communications system according to embodiments;
Fig. 2 schematically illustrates video streams according to an embodiment;
Figs. 3, 4, 9, 10, and 11 are flowcharts of methods according to embodiments;
Fig. 5 is a schematic illustration of similarity hashing according to an embodiment;
Fig. 6 is a schematic illustration of a similarity matrix according to an embodiment;
Fig. 7 is a schematic illustration of metadata according to an embodiment;
Fig. 8 is a schematic diagram illustrating part of the communications system of Fig.
1;
Fig. 12 is a schematic diagram showing functional units of a video network node according
to an embodiment;
Fig. 13 is a schematic diagram showing functional modules of a video network node
according to an embodiment; and
Fig. 14 shows one example of a computer program product comprising computer readable
storage medium according to an embodiment.
DETAILED DESCRIPTION
[0022] The inventive concept will now be described more fully hereinafter with reference
to the accompanying drawings, in which certain embodiments of the inventive concept
are shown. This inventive concept may, however, be embodied in many different forms
and should not be construed as limited to the embodiments set forth herein; rather,
these embodiments are provided by way of example so that this disclosure will be thorough
and complete, and will fully convey the scope of the inventive concept to those skilled
in the art. Like numbers refer to like elements throughout the description. Any step
or feature illustrated by dashed lines should be regarded as optional.
[0023] Fig. 1 is a schematic diagram illustrating a communications system 100 where embodiments
presented herein can be applied. The communications system 100 could implement the
functionality of a content delivery network and comprises a video streamer node 110,
a video network node 1200, a manipulator node 130 (optional), a client node 140, and
content databases 150, 160 acting as video servers streaming and serving Uniform Resource
Locators (URLs) of the video segments to the client node 140. The video streamer node
110 issues metadata 120 (for example provided in a log file) which specifies advertisement
breaks within a video stream. The metadata 120 could describe a unique identity for
every advertisement and the approximated start time and end times of each advertisement.
In this respect the start time and end times of each advertisement as given by the
metadata 120 could differ from only by a single video frame (corresponding to a duration
in time of a fraction of a second) from the true start time and end times of each
advertisement, to several video frames (corresponding to a duration in time of more
than a second) from the true start time and end times of each advertisement. Further,
the start time and end times of each advertisement could be indicated by the insertion
of cue-tones in the video stream, which indicate the exact position of the ad-breaks.
However, not all video streams have cue-tones inserted.
[0024] The client node 140 is configured to request a manifest 170 from the video network
node 1200 upon playout of the video stream. In response to the request the video network
node 1200 returns a manipulated manifest 170 which contains segments of the video
stream from the original Content Origin database 150.
[0025] The video network node 1200 is configured to remove segments of old advertisements,
and to insert segments of new advertisements with pointers, such as URLs, pointing
to the Alternative Content Origin database 160 (instead of to the original advertisement
segments in the Content Origin database 150). The decisions of where the advertisements
are, that is, the decisions of which video segments to remove and where to insert
the video segments of the new advertisements, are made based on the information supplied
to the video network node 1200 by the metadata 120. For example, the metadata may
be supplied by the operator in terms of starting times of the original advertisements
when the video streamer node 110 inserts the advertisements into the video stream
for the first time.
[0026] The video network node 1200 is configured, for example, to replace old advertisements
within a recording of the video stream with new advertisements. The video network
node 1200 relies on accurate metadata 120 describing where the existing advertisements
are found. However, as mentioned above, the metadata may not be accurate and hence
the video network node 1200 may not be able to correctly replace the old advertisements
with the new advertisements.
[0027] The embodiments disclosed herein therefore relate to mechanisms for determining a
time offset for a video segment of a video stream using metadata 120. The time offset
results from the start time and end times of each advertisement as given by the metadata
120 not being accurate. In order to obtain such mechanisms there is provided a video
network node 1200, a method performed by the video network node 1200, a computer program
product comprising code, for example in the form of a computer program, that when
run on a video network node 1200, causes the video network node 1200 to perform the
method.
[0028] Figs. 3 and 4 are flowcharts illustrating embodiments of methods for determining
a time offset for a video segment 230' of a video stream 200 using metadata 120. The
methods are performed by the video network node 1200. The methods are advantageously
provided as computer programs 1420.
[0029] Reference is now made to Fig. 3 illustrating a method for determining a time offset
for a video segment 230' of a video stream 200 using metadata 120 as performed by
the video network node 1200 according to an embodiment. Parallel reference is made
to Fig. 2
[0030] The video network node 1200 obtains as input metadata 120 and an approximate start
and/or end time of a video segment 230'. Fig. 2 at (a) and (b) schematically illustrates
a video stream 200. Fig. 2 at (a) shows that metadata 120 points out a starting point
of video segment 230'. That is, the metadata 120 comprises time information of at
least one of a start time and an end time of the video segment 230'. Start times and
end times given by the metadata 120 are only approximate, and the video network node
1200 is therefore configured to find this inaccuracy. Fig. 2 at (b) illustrates the
true location of the video segment 230. This location differs by a time offset to
from the approximate location of the video segment 230' as given by the metadata 120
in Fig. 2 at (a). The video network node 1200 is configured to download parts of the
video stream 200 in order to find the exact start time and/or end time of the video
segment 230' using the downloaded parts together with the metadata 120. Particularly,
the video network node 1200 is configured to perform step S102:
S102: The video network node 1200 extracts a first video part 210 and a second video
part 220 from the video stream 200, each of which comprising a common video segment
230, 240. That is, the first video part 210 and the second video part 220 are extracted
such that they both comprise a common video segment 230, 240 representing content
occurring in both the first video part 210 and the second video part 220.
[0031] In the illustrative example of Fig. 2 the first video part 210 has a duration t
3 and the second video part 220 has a duration t
5, and the common video segment 230, 240 has a duration t2 in the first video part
210 and a duration t
4 in the second video part 220. Further, the common video segment 230, 240 starts a
time offset Δt=t1 from the start of the first video part 210.
[0032] S106: The video network node 1200 identifies a sequence of video frames in the first
video part 210 that represents the common video segment 230, 240. That is, the identified
sequence of video frames occurs somewhere in the first video part 210 and is thus
a sub-part of the first video part 210.
[0033] S108: The video network node 1200 determines the time offset to based on a time difference
between an end-point frame of the identified sequence of video frames and the time
information in the metadata.
[0034] Here, the end-point frame could be either the first frame of the identified sequence
of video frames or the last frame of the identified sequence of video frames. That
is, in an embodiment the end-point frame of the sequence of video frames is a first
occurring frame of the sequence of video frames, and the end-point frame constitutes
the beginning of the video segment. In an alternative embodiment the end-point frame
of the sequence of video frames is a last occurring frame of the sequence of video
frames, and the end-point frame constitutes the ending of the video segment.
[0035] The common video segment 230, 240 could be identical to the video segment 230'. Hence,
in such embodiments the first video part 210 and the second video part 220 both comprise
the content of the video segment (i.e., the content of the video segment 230' is identical
to the content of the video segments 230 and 240). The end-point frame of the identified
sequence is thus identical to an end-point frame of the video segment 230'. This is
the case in the illustrative example of Fig. 2.
[0036] However, it could be that neither the first video part 210 nor the second video part
220 comprises the video segment 230'. In such scenarios it can be assumed that there
is a known time difference between the sequence of video frames in the first video
part 210 and the video segment 230' such that the video network node 1200 can identify
an end-point frame of the video segment 230' by adding (or subtracting) this know
time difference to/from the end-point frame of the identified sequence in order to
determine the time offset to.
[0037] Embodiments relating to further details of determining the time offset to for the
video segment 230' of the video stream 200 using the metadata 120 as performed by
the video network node 1200 will now be disclosed.
[0038] Reference is now made to Fig. 4 illustrating methods for determining the time offset
to for the video segment 230' of the video stream 200 using the metadata 120 as performed
by the video network node 1200 according to further embodiments. It is assumed that
steps S102, S106, S108 are performed as described above with reference to Fig. 3 and
a thus repeated description thereof is therefore omitted.
[0039] There may be different ways to extract the first video part 210 and the second video
part 220 from the video stream 200. As disclosed above, the first video part 210 and
the second video part 220 are extracted such that they both comprise a common video
segment 230, 240. Further, according to the metadata 120 the approximate start time
and stop time of the video segment 230' is known. Hence, in scenarios where the common
video segment 230, 240 is identical to the video segment 230' first video part 210
and the second video part 220 could be selected to at least comprise content corresponding
to the video segment 230'. The first video part 210 and the second video part 220
could thus be extracted by downloading the video stream 200 from approximate start
time - Δt and until approximate end time + Δt. The value of Δt is taken to be large
enough to contain the maximum approximation error of the metadata. In view of the
above, the value of Δt could correspond to a single video frame (corresponding to
a duration in time of a fraction of a second) up to several video frames (corresponding
to a duration in time of more than a second).
[0040] There may be different ways to perform the identifying in step S106. Embodiments
relating thereto will now be described in turn.
[0041] As disclosed above, the metadata 120 comprises time information of at least one of
a start time and an end time of the video segment 230'. According to the claimed invention,
the metadata 120 comprises information of a time duration of the video segment 230'.
The sequence of video frames could then in above step S106 be identified such that
it has a time duration equal to the time duration of the video segment.
[0042] The sequence of video frames could in step S106 be identified using a similarity
measure. Particularly, according to an embodiment the video network node 1200 is configured
to perform step Sio6a as part of step S106 in order to identify the sequence of video
frames:
Sio6a: The video network node 1200 identifies, in the first video part 210, a first
sequence of video frames that is similar to a second sequence of video frames in the
second video part 220. A condition for this first sequence of video frames is that
it has a time duration equal to the time duration of the video segment (as given by
the metadata 120).
[0043] As disclosed above, the common video segment 230, 240 could be identical to the video
segment 230'. Hence, since the common video segment 230, 240 is part of the first
video part 210 the first sequence of video frames as identified in step Sio6a could
be identical to the video segment 230'.
[0044] However, as also disclosed above, it could be that neither the first video part 210
nor the second video part 220 comprise the video segment 230'. In such scenarios the
first sequence of video frames as identified in step Sio6a could be adjacent the video
segment 230' or even further separated from the video segment 230', again assuming
that there is a known time difference between the sequence of video frames in the
first video part 210 and the video segment 230'.
[0045] There could be different ways to identify first sequence of video frames in step
Sio6a. According to the claimed invention an image similarity measure is determined
for all combinations (or a subset thereof) of video frames between the first video
part 210 and the second video part 220. Hence, according to an embodiment the video
network node 1200 is configured to perform step Sio6a as part of step S106 in order
to identify the sequence of video frames:
Sio6b: The video network node 1200 determines that the first sequence of video frames
(as identified in step sio6a) in the first video part 210 is similar to the second
sequence of video frames in the second video part 220 using an image similarity measure
between video frames in the first video part 210 and video frames in the second video
part 220.
[0046] There could be different examples of image similarity measures that could be applied
in the determination in step Sio6b. Either the image similarity measure is determined
using the video frames of the first video part 210 and the second video part 220 as
is, or the image similarity measure is determined using processed video frames of
the first video part 210 and the second video part 220. One way to process the video
frames is to subject the video frames to similarity hashing. According to an embodiment
the image similarity measure is thus determined using similarity hashes of video frames
in the first video part 210 and similarity hashes of video frames in the second video
part 220. There are different ways to determine the similarity hashes (that is, to
perform similarity hashing on the video frames). One type of similarity hashing is
perceptual hashing, in which perceptually similar images obtain similar hash values
with small distance between them. In general terms, perceptual hashing is the use
of an algorithm that produces a snippet, or fingerprint, of various forms of multimedia.
Perceptual hash functions are analogous if features are similar, whereas cryptographic
hashing relies on the avalanche effect of a small change in input value creating a
drastic change in output value. Further aspects of the similarity hashing will be
described below with reference to Fig. 5.
[0047] Fig. 5 is a schematic illustration of similarity hashing according to an embodiment.
Input as defined by the first video part 210 and the second video part 220 are decoded
by a decoder 510 (possibly using down-sampling as in step S104 to reduce the frame
rate) to produce respective sequences of frames 520a, 520b (denoted Frames1 and Frames2
in Fig. 5). The video frames 520a, 520b are then subjected to similarity hashing 530,
producing respective image hashes 540a, 540b (denoted Hashes1 and Hashes2 in Fig.
5). Each frame is thus represented by its own image hash.
[0048] Every image hash of a frame of the first video part 210 could be compared with every
image hash of a frame of the second video part 220. Alternatively, only a selected
subset of the image hashes of the first video part 210 are compared to the same selected
subset of image hashes of the second video part 220. The higher the similarity measure,
the more similar two frames are. Denote by S(i,j) the image similarity score between
the i:th frame of the first video part 210 and the j:th frame of the second video
part 220. S(i,j) is determined by comparing the image hash of frame i with the image
hash of frame j using an appropriate distance measure (e.g. dot-product).
[0049] Fig. 6 is a schematic illustration of a similarity matrix 600 according to an embodiment.
Fig. 6 shows the similarity matrix 600 which holds at position (i,j) the similarity
score S(i,j). In the illustrative example of Fig. 6, darker entries in the similarity
matrix 600 represent higher similarity score and lighter entries in the similarity
matrix 600 represent lower similarity score. The maximum entry per row in the similarity
matrix 600 can be stored in a first vector 630a for the first video part 210 and the
maximum entry per column in the similarity matrix 600 can be stored in a second vector
630b for the second video part 220. The similarity matrix 600 can be interpreted as
a heat-map. A search can be made for the diagonal 610 in the similarity matrix 600
with the maximum similarity score. The position of this diagonal 610 yields the time
value of step Sio6c (by multiplying the number frames skipped from the main diagonal
of the similarity matrix 600 in order to reach the diagonal 610 with the frame rate
of the first video part 210). Further aspects of searching for the diagonal 610 in
the similarity matrix 600 will be disclosed below with reference to Fig. 11.
[0050] The image similarity measure is maximized when the first sequence of video frames
and the second sequence of video frames match each other. Hence, according to an embodiment
the video network node 1200 is configured to perform step Sio6c as part of step S106:
Sio6c: The video network node 1200 determines, in relation to a first occurring frame
of the first video part 210, a time value that maximizes the image similarity measure.
The time offset to is then determined based on the time value.
[0051] If the common video segment 230, 240 is identical to the video segment 230', then
the time offset to is identical to the time value determined in step Sio6c. Otherwise,
the known time difference between the sequence of video frames in the first video
part 210 and the video segment 230' needs to be added to the time value determined
in step Sio6c to yield the time offset to.
[0052] The image similarity measure could in step Sio6b be determined to comprise a sequence
of image similarity values. It could be that the sequence of image similarity values
comprises isolated high image similarity values. Such isolated high image similarity
values could be removed from the image similarity measure when determining the time
value in step Sio6c. That is, isolated high values 620 in the similarity matrix 600
could be removed before searching for the diagonal 610 in order to reduce the possibility
of false positives. Thus, elements representing isolated high image similarity values
could be removed from the matrix when determining the time value. This enables isolated
high image similarity values to be removed from the image similarity measure.
[0053] The similarity matrix 600 does not necessarily need to be a square matrix; it will
be a rectangular (non-square) matrix in case the first video part 210 and the second
video part 220 do not result in the same number of image hashes (for example by the
first video part 210 and the second video part 220 not containing the same number
of frames).
[0054] In order to reduce the execution time of at least above steps S106 and S108 the first
video part 210 and/or the second video part 220 could be down-sampled before steps
S106 and S108 are performed. Hence, according to an embodiment, the video network
node 1200 is configured to perform step S104 before steps S106 and S108:
S104: The video network node 1200 down-samples at least one of the first video part
210 and the second video part 220 before identifying the sequence of video frames
in step S106.
[0055] Down-sampling generally refers to reducing the frame rate of the first video part
210 and/or the second video part 220, such as using only every k:th frame, where
k>1 is an integer, or any other subset of frames. However, this does not exclude that,
additionally or alternatively, the resolution of the individual frames could be reduced.
[0056] An approximation of the time offset to could then be found using the thus down-sampled
at least one of the first video part 210 and the second video part 220. Hence, steps
S104, S106, and S108 could be iteratively performed at least two times. That is, step
S106 of identifying the sequence of video frames could be repeated for a new first
video part and a new second video part. The new first video part and the new second
video part are determined based on the sequence of video frames identified using the
down-sampled at least one of the first video part and the second video part. For example,
the new first video part and the new second video part could selected based on the
time value determined in step sio6c that maximizes the image similarity measure. That
is, a first approximation of the time offset to could be found using a down-sampled
first video part 210 and a down-sampled second video part 220 in an initial search
window, and a second, refined, approximation of the time offset to could be found
using a down-sampled first video part 210 and a down-sampled second video part 220
in a refined search window, where the refined search window is selected based on the
time value determined in step s106c that maximizes the image similarity measure in
the initial search window.
[0057] There could be different actions for the video network node 1200 to perform upon
having determined the time offset to in step S108.
[0058] According to some aspects the video network node 1200 removes at least part of the
video segment 230', for example to replace it with a new video segment. Hence, according
to an embodiment, the video network node 1200 is configured to perform step S110a:
S110a: The video network node 1200 removes at least part of the video segment 230'
from the video stream 200 using the end-point frame of the identified sequence of
video frames as reference.
[0059] It could be that the video network node 1200 removes the entire video segment 230',
or even that the video network node 1200 removes more than just the video segment
230', such as the video segment 230' and an adjacent video segment or the video segment
230' and another video segment separated from the video segment 230' by a known time
difference. This could be in a case where the video segment 230' is a first video
segment of a composite video segment, and, for example, where the first video part
210 comprises the composite video segment. The video network node 1200 could, for
example, be configured to analyze the manifest 170 for the video stream 200 that the
client node 140 requests, and to remove only the video segment corresponding to an
advertisement break, thus allowing the replacement of the one or more of the advertisements
of the advertisement break with a video segment corresponding to one or more new advertisements
in a precise, frame-accurate manner, even when the metadata 120 is inaccurate.
[0060] According to some aspects the video network node 1200 does not perform any manipulation
of the video stream 200, such as removal or replacement of the video segment 230',
but instead informs the manifest manipulator node 130 of the determined time offset
to (for the manifest manipulator 130 to perform such manipulation). Hence, according
to an embodiment, the video network node 1200 is configured to perform step S110b:
S110b: The video network node 1200 provides information of the time offset to to a
manifest manipulator node 130.
[0061] Further aspects of determining the time offset to for the video segment 230' of the
video stream 200 using the metadata 120 as performed by the video network node 1200
and applicable to any of the above embodiments will now be described.
[0062] Fig. 7 gives an illustrative example of metadata 120. In the illustrative example
metadata in Fig. 7 there are 3 advertisement breaks, denoted Ad-break1, Ad-break2,
and Ad-break3. Ad-break1 starts with advertisement Ad-3801 and ends with advertisement
Ad-3807; ad-break2 starts with advertisement Ad-3805 and ends with advertisement Ad-3811;
ad-break3 starts with advertisement Ad-3809 and ends with advertisement ad-3810. As
can be seen in Fig. 7, Ad-break2 comprises a segment denoted Ad-3805 that occurs also
in Ad-break1. Ad-3805 in Ad-break2 is adjacent Ad-3808 which does not occur in Ad-break1.
By using embodiments disclosed herein pairs of ad-breaks could be found such that
the first advertisement of the first ad-break appears somewhere within the second
ad-break. For example, Ad-3805 is the first advertisement in ad-break2 and it appears
somewhere within ad-break1 (as its fifth advertisement), so the pair (ad-break2, ad-break1)
has this property that the first advertisement of the first ad-break in the pair appears
somewhere within the second ad-break of the pair. Also, Ad-3809 appears as the first
advertisement in ad-break3 and somewhere within ad-break2 (it is the third advertisement
in ad-break2) so (ad-break3, ad-break2) is also a pair of advertisement breaks which
has this property that the first advertisement of the first ad-break in the pair appears
somewhere within the second ad-break of the pair. Hence, the exact start time and/or
end time for Ad-3805 in Ad-break2 (or Ad-breaki) could be found using embodiments
disclosed herein, and similar for Ad-3810. This is illustrated in Fig. 8.
[0063] Fig. 8 is a schematic diagram illustrating a part 100' of the communications system
in Fig. 1. Fig. 8 schematically illustrates a video network node 1200 taking as input
the metadata 120 (only part of the metadata of Fig.7 is shown) and the video stream
200 from the video streamer node 110 as input and produces as output to a database
810 an accurate start time and end time of Ad-break2. In the database data representing
the identifier of Ad-break2, the start time of Ad-break2 as given by the metadata,
the end time of Ad-break2 as given by the metadata, the determined accurate start
time of Ad-break2 as determined by the video network node 1200, and the determined
accurate end time of Ad-break2 as determined by the video network node 1200 is stored.
[0064] Fig. 9 is a flowchart of a particular embodiment for determining the time offset
to for the video segment 230' of the video stream 200 using the metadata 120 as performed
by the video network node 1200 based on at least some of the above disclosed embodiments.
[0065] S201: The video network node 1200 receives a request from a client node 140 to playout
the video stream 200 starting at time t.
[0066] S202: The video network node 1200 checks if the time t is close to an advertisement
break. If no, step S203 is entered, and if yes, step S204 is entered.
[0067] S203: The video network node 1200 enables playout of the requested video stream 200
starting at time t at the client node 140.
[0068] S204: The video network node 1200 checks if t is already stored in a database of
fixed times (Already-Fixed-Times-DB). If no, step S205 is entered, and if yes, step
S207 is entered.
[0069] S205: The video network node 1200 determines an initial start time t' from the time
t and Δt (see above for a definition of Δt).
[0070] S206: The video network node 1200 determines the exact start and end time of the
advertisement break. The variable t' is fixed to represent the exact start time of
the advertisement break and stored in Already-Fixed-Times-DB together with t.
[0071] S207: The video network node 1200 retrieves the exact start time t' from the Already-Fixed-Times-DB
using t.
[0072] S208: The video network node 1200 enables playout of the requested video stream 200
from time t to time t' at the client node 140.
[0073] S209: The video network node 1200 replaces the original advertisement with a new
advertisement to be played out at the client node 140 starting at time t'.
[0074] Fig. 10 is a flowchart of a particular embodiment for determining the time difference
based on at least some of the above disclosed embodiments.
[0075] S301: The video network node 1200 extracts a first video part (denoted video1) and
a second video part (denoted video2) from the video stream, each of which comprising
a common video segment 230, 240.
[0076] S302: The video network node 1200 checks if the first video part is shorter than
the second video part. If yes, step S303 is entered, and else step S304 is entered.
[0077] S303: The video network node 1200 replaces the annotation of the first video part
and the second video part with each other such that the first video part is longer
than the second video part.
[0078] S304: The video network node 1200 identifies the first seconds, Y_Preff, of the first
video part and denotes this part of the first video part as Prefix1.
[0079] S305: The video network node 1200 searches for Prefix1 in the second video part using
an image similarity measure, e.g., as described with reference to Fig. 6.
[0080] S306: The video network node 1200 checks if a matching part in the second video part
is found. If yes, step S307 is entered, and if no, step S308 is entered.
[0081] S307: The video network node 1200 outputs the time value that maximizes the image
similarity measure in step S305.
[0082] S308: The video network node 1200 identifies the last seconds, Y_Suff, of the first
video part and denotes this part of the first video part as Suffix1.
[0083] S309: The video network node 1200 searches for Suffix1 in the second video part using
an image similarity measure, e.g., as described with reference to Fig. 6.
[0084] S310: The video network node 1200 outputs the time value that maximizes the image
similarity measure in step S309.
[0085] Fig. 11 is a flowchart of an embodiment for searching for the diagonal 610 in the
similarity matrix 600.
[0086] Let X represent the expected number of frames of the video segment 230'. Further,
assume that the video segment 230' has a time duration d as given by the metadata
120. Further, let r represent the frame rate. That is, the first video part 210 and
the second video part 220 are sampled to have a frame rate r. Then X = r · d. The
video segment 230' is expected to represent a common video segment 230, 240 with a
length of X frames in both the first video part 210 and the second video part 220.
[0087] S401: The video network node 1200 searches the first vector 630 for the next sequence
consecutive entries of (approximately) length X of high similarities (i.e., a sequence
of length X whose total similarity score is above a threshold).
[0088] S402: The video network node 1200 searches for a diagonal 610 starting at the row
indicated by the first entry in the sequence found in step S401.
[0089] S403: The video network node 1200 checks if a diagonal 610 is found. If no, step
S404 is entered, and if yes, step S405 is entered.
[0090] S404: The video network node 1200 determines that the video segment 230' was not
found, and hence that no advertisement break was found. Step S401 is entered once
again.
[0091] S405: The video network node 1200 determines that the video segment 230' was found,
and hence that an advertisement break was found.
[0092] S406: The video network node 1200 outputs the start and stop times of the video segment
230'.
[0093] Although some of the examples presented herein relate to advertisements have been
mentioned as an example where a video segment (as defined by a single advertisement
or an entire advertisement break) is to be replaced or removed from a video stream,
the herein disclosed embodiments are not limited to handling of advertisements; rather
the herein disclosed embodiments are applicable to any examples where a particular
video segment is to be accurately identified in a video stream.
[0094] Fig. 12 schematically illustrates, in terms of a number of functional units, the
components of a video network node 1200 according to an embodiment. Processing circuitry
1210 is provided using any combination of one or more of a suitable central processing
unit (CPU), multiprocessor, microcontroller, digital signal processor (DSP), etc.,
capable of executing software instructions stored in a computer program product 1410
(as in Fig. 14), e.g. in the form of a storage medium 1230. The processing circuitry
1210 may further be provided as at least one application specific integrated circuit
(ASIC), or field programmable gate array (FPGA).
[0095] Particularly, the processing circuitry 1210 is configured to cause the video network
node 1200 to perform a set of operations, or steps, S102-S110b, S201-S209, S301-S310,
S401-S406, as disclosed above. For example, the storage medium 1230 may store the
set of operations, and the processing circuitry 1210 may be configured to retrieve
the set of operations from the storage medium 1230 to cause the video network node
1200 to perform the set of operations. The set of operations may be provided as a
set of executable instructions.
[0096] Thus the processing circuitry 1210 is thereby arranged to execute methods as disclosed
herein. The storage medium 1230 may also comprise persistent storage, which, for example,
can be any single one or combination of magnetic memory, optical memory, solid state
memory or even remotely mounted memory. The video network node 1200 may further comprise
a communications interface 1220 at least configured for communications with other
entities and devices. As such the communications interface 1220 may comprise one or
more transmitters and receivers, comprising analogue and digital components. The processing
circuitry 1210 controls the general operation of the video network node 1200 e.g.
by sending data and control signals to the communications interface 1220 and the storage
medium 1230, by receiving data and reports from the communications interface 1220,
and by retrieving data and instructions from the storage medium 1230. Other components,
as well as the related functionality, of the video network node 1200 are omitted in
order not to obscure the concepts presented herein.
[0097] Fig. 13 schematically illustrates, in terms of a number of functional modules, the
components of a video network node 1200 according to an embodiment. The video network
node 1200 of Fig. 13 comprises a number of functional modules; an extract module 1210a
configured to perform step S102, an identify module 1210c configured to perform step
S106, and a determine module 1210i configured to perform step S108. The video network
node 1200 of Fig. 13 may further comprise a number of optional functional modules,
such as any of a down-sample module 1210b configured to perform step S104, an identify
module 1210d configured to perform step Sio6a, a determine module 1210e configured
to perform step Sio6b, a determine module 1210f configured to perform step Sio6c,
a remove module 1210h configured to perform step S110a, and a provide module 1210i
configured to perform step S110b. In general terms, each functional module 1210a-1210i
may in one embodiment be implemented only in hardware and in another embodiment with
the help of software, i.e., the latter embodiment having computer program instructions
stored on the storage medium 1230 which when run on the processing circuitry 1210
makes the video network node 1200 perform the corresponding steps mentioned above
in conjunction with Fig 13. It should also be mentioned that even though the modules
correspond to parts of a computer program, they do not need to be separate modules
therein, but the way in which they are implemented in software is dependent on the
programming language used. Preferably, one or more or all functional modules 1210
a-1210 i may be implemented by the processing circuitry 1210 , possibly in cooperation
with the communications interface 1220 and/or the storage medium 1230. The processing
circuitry 1210 may thus be configured to from the storage medium 1230 fetch instructions
as provided by a functional module 1210 a-1210 i and to execute these instructions,
thereby performing any steps as disclosed herein.
[0098] The video network node 1200 may be provided as a standalone device or as a part of
at least one further device. For example, the video network node 1200 may be provided
in the manifest manipulator node 130. Alternatively, functionality of the video network
node 1200 may be distributed between at least two devices, or nodes. These at least
two nodes, or devices, may either be part of the same network part or may be spread
between at least two such network parts.
[0099] Thus, a first portion of the instructions performed by the video network node 1200
may be executed in a first device, and a second portion of the of the instructions
performed by the video network node 1200 may be executed in a second device; the herein
disclosed embodiments are not limited to any particular number of devices on which
the instructions performed by the video network node 1200 may be executed. Hence,
the methods according to the herein disclosed embodiments are suitable to be performed
by a video network node 1200 residing in a cloud computational environment. Therefore,
although a single processing circuitry 1210 is illustrated in Fig. 12 the processing
circuitry 1210 may be distributed among a plurality of devices, or nodes. The same
applies to the functional modules 1210 a-1210 i of Fig. 13 and the computer program
1420 of Fig. 14 (see below).
[0100] Fig. 14 shows one example of a computer program product 1410 comprising computer
readable storage medium 1430. On this computer readable storage medium 1430, a computer
program 1420 can be stored, which computer program 1420 can cause the processing circuitry
1210 and thereto operatively coupled entities and devices, such as the communications
interface 1220 and the storage medium 1230, to execute methods according to embodiments
described herein. The computer program 1420 and/or computer program product 1410 may
thus provide means for performing any steps as herein disclosed.
[0101] In the example of Fig. 14, the computer program product 1410 is illustrated as an
optical disc, such as a CD (compact disc) or a DVD (digital versatile disc) or a Blu-Ray
disc. The computer program product 1410 could also be embodied as a memory, such as
a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only
memory (EPROM), or an electrically erasable programmable read-only memory (EEPROM)
and more particularly as a non-volatile storage medium of a device in an external
memory such as a USB (Universal Serial Bus) memory or a Flash memory, such as a compact
Flash memory. Thus, while the computer program 1420 is here schematically shown as
a track on the depicted optical disk, the computer program 1420 can be stored in any
way which is suitable for the computer program product 1410.
[0102] The inventive concept has mainly been described above with reference to a few embodiments.
However, as is readily appreciated by a person skilled in the art, other embodiments
than the ones disclosed above are equally possible within the scope of the inventive
concept, as defined by the appended patent claims.
1. A method for determining a time offset (to) for a video segment (230') of a video
stream (200) using metadata (120), the metadata (120) comprising time information
of at least one of a start time and an end time of the video segment (230'), the method
being performed by a video network node (1200), the method being
characterized in comprising:
extracting (S102) a first video part (210) and a second video part (220) from the
video stream (200), each of which comprising a common video segment (230, 240);
identifying (S106a) a sequence of video frames in the first video part (210) that
represents the common video segment (230, 240), wherein identifying the sequence of
video frames comprises:
identifying (S106a), in the first video part, a first sequence of video frames that
is similar to a second sequence of video frames in the second video part (220), and
wherein the first sequence of video frames has a time duration equal to the time duration
of the video segment (230') and
determining (S106b) that the first sequence of video frames is similar to the second
sequence of video frames in the second video part using an image similarity measure
between video frames in the first video part and video frames in the second video
part; and
determining (S108) the time offset (to) based on a time difference between an end-point
frame of the identified sequence of video frames and the time information in the metadata.
2. The method according to claim 1, wherein the metadata comprises information of time
duration of the video segment, and wherein the sequence of video frames is identified
such that it has a time duration equal to the time duration of the video segment.
3. The method according to claims 1 or 2, wherein the first sequence of video frames
is identical to the video segment (230'), or wherein the first sequence of video frames
is adjacent the video segment (230').
4. The method according to claim 1, wherein the image similarity measure is determined
using similarity hashes of video frames in the first video part and similarity hashes
of video frames in the second video part.
5. The method according to claim 1, further comprising:
determining (S106c), in relation to a first occurring frame of the first video part,
a time value that maximizes the image similarity measure, and wherein the time offset
(to) is determined based on the time value.
6. The method according to claim 5, wherein the image similarity measure comprises a
sequence of image similarity values, and wherein isolated high image similarity values
are removed from the image similarity measure when determining the time value.
7. The method according to any of the preceding claims, further comprising:
down-sampling (S104) at least one of the first video part and the second video part
before said identifying the sequence of video frames.
8. The method according to claim 7, wherein the step of identifying the sequence of video
frames is repeated for a new first video part and a new second video part, wherein
the new first video part and the new second video part are determined based on the
sequence of video frames identified using the down-sampled at least one of the first
video part and the second video part.
9. The method according to any of the preceding claims, wherein the end-point frame of
the sequence of video frames is a first occurring frame of the sequence of video frames,
and wherein the end-point frame constitutes beginning of the video segment, or wherein
the end-point frame of the sequence of video frames is a last occurring frame of the
sequence of video frames, and wherein the end-point frame constitutes ending of the
video segment.
10. The method according to any of the preceding claims, further comprising:
removing (S110a) at least part of the video segment (230') from the video stream (200)
using the end-point frame of the identified sequence of video frames as reference.
11. The method according to any of claims 1 to 10, further comprising:
providing (S110b) information of the time offset (to) to a manifest manipulator node
(130).
12. The method according to any of the preceding claims, wherein the video segment is
a first video segment of a composite video segment, and wherein the first video part
comprises the composite video segment.
13. A video network node (1200) for determining a time offset (to) for a video segment
(230') of a video stream (200) using metadata (120), the metadata (120) comprising
time information of at least one of a start time and an end time of the video segment
(230'), the video network node (1200) comprising processing circuitry (1210), the
processing circuitry being
characterized in being configured to cause the video network node (1200) to:
extract a first video part (210) and a second video part (220) from the video stream
(200), each of which comprising a common video segment (230, 240);
identify a sequence of video frames in the first video part (210) that represents
the common video segment (230, 240), wherein identifying the sequence of video frames
comprises:
identifying, in the first video part (210), a first sequence of video frames that
is similar to a second sequence of video frames in the second video part (220), and
wherein the first sequence of video frames has a time duration equal to the time duration
of the video segment; and
determining that the first sequence of video frames is similar to the second sequence
of video frames in the second video part using an image similarity measure between
video frames in the first video part and video frames in the second video part; and
determine the time offset (to) based on a time difference between an end-point frame
of the identified sequence of video frames and the time information in the metadata.
14. A computer program (1420) for determining a time offset (to) for a video segment (230')
of a video stream (200) using metadata (120), the metadata (120) comprising time information
of at least one of a start time and an end time of the video segment (230'), the computer
program comprising computer code which, when run on processing circuitry (1210) of
a video network node (1200), causes the video network node (1200) to:
extract (S102) a first video part (210) and a second video part (220) from the video
stream (200), each of which comprising a common video segment (230, 240);
identify a sequence of video frames in the first video part (210) that represents
the common video segment (230, 240), wherein identifying the sequence of video frames
comprises:
identifying, in the first video part (210), a first sequence of video frames that
is similar to a second sequence of video frames in the second video part (220), and
wherein the first sequence of video frames has a time duration equal to the time duration
of the video segment; and
determining that the first sequence of video frames is similar to the second sequence
of video frames in the second video part using an image similarity measure between
video frames in the first video part and video frames in the second video part; and
determine (S108) the time offset (to) based on a time difference between an end-point
frame of the identified sequence of video frames and the time information in the metadata.
1. Verfahren zum Bestimmen eines Zeitversatzes (to) für ein Videosegment (230') eines
Video-Streams (200) unter Verwendung von Metadaten (120), wobei die Metadaten (120)
eine Zeitinformation von mindestens einer von einer Startzeit und einer Endzeit des
Videosegments (230') umfassen, wobei das Verfahren von einem Videonetzknoten (1200)
durchgeführt wird, wobei das Verfahren
dadurch gekennzeichnet ist, dass es Folgendes umfasst:
Extrahieren (S102) eines ersten Videoteils (210) und eines zweiten Videoteils (220)
aus dem Video-Stream (200), von denen jeder ein gemeinsames Videosegment (230, 240)
umfasst;
Identifizieren (S106a) einer Sequenz von Videoframes in dem ersten Videoteil (210),
die das gemeinsame Videosegment (230, 240) darstellt, wobei das Identifizieren der
Sequenz von Videoframes Folgendes umfasst:
Identifizieren (S106a), in dem ersten Videoteil, einer ersten Sequenz von Videoframes,
die einer zweiten Sequenz von Videoframes in dem zweiten Videoteil (220) ähnlich ist,
und wobei die erste Sequenz von Videoframes eine Zeitdauer hat, die gleich der Zeitdauer
des Videosegments (230') ist, und
Bestimmen (S106b), dass die erste Sequenz von Videoframes der zweiten Sequenz von
Videoframes in dem zweiten Videoteil ähnlich ist, und zwar unter Verwendung eines
Bildähnlichkeitsmaßes zwischen Videoframes in dem ersten Videoteil und Videoframes
in dem zweiten Videoteil; und
Bestimmen (S108) des Zeitversatzes (to) beruhend auf einer Zeitdifferenz zwischen
einem Endpunktframe der identifizierten Sequenz von Videoframes und der Zeitinformation
in den Metadaten.
2. Verfahren nach Anspruch 1, wobei die Metadaten eine Information über die Zeitdauer
des Videosegments umfassen und wobei die Sequenz von Videoframes so identifiziert
wird, dass sie eine Zeitdauer aufweist, die gleich der Zeitdauer des Videosegments
ist.
3. Verfahren nach Anspruch 1 oder 2, wobei die erste Sequenz von Videoframes identisch
mit dem Videosegment (230') ist oder wobei die erste Sequenz von Videoframes an das
Videosegment (230') angrenzt.
4. Verfahren nach Anspruch 1, wobei das Bildähnlichkeitsmaß unter Verwendung von Ähnlichkeits-Hashes
von Videoframes in dem ersten Videoteil und Ähnlichkeits-Hashes von Videoframes in
dem zweiten Videoteil bestimmt wird.
5. Verfahren nach Anspruch 1, ferner umfassend:
Bestimmen (S106c), in Bezug auf einen ersten auftretenden Frame des ersten Videoteils,
eines Zeitwerts, der das Bildähnlichkeitsmaß maximiert, und wobei der Zeitversatz
(to) beruhend auf dem Zeitwert bestimmt wird.
6. Verfahren nach Anspruch 5, wobei das Bildähnlichkeitsmaß eine Sequenz von Bildähnlichkeitswerten
umfasst und wobei isolierte hohe Bildähnlichkeitswerte bei der Bestimmung des Zeitwerts
aus dem Bildähnlichkeitsmaß entfernt werden.
7. Verfahren nach einem der vorstehenden Ansprüche, ferner umfassend:
Abwärtsabtasten (S104) von mindestens einem des ersten Videoteils und des zweiten
Videoteils vor dem Identifizieren der Sequenz von Videoframes.
8. Verfahren nach Anspruch 7, wobei der Schritt des Identifizierens der Sequenz von Videoframes
für einen neuen ersten Videoteil und einen neuen zweiten Videoteil wiederholt wird,
wobei der neue erste Videoteil und der neue zweite Videoteil beruhend auf der Sequenz
von Videoframes bestimmt werden, die unter Verwendung des abwärts abgetasteten mindestens
einen des ersten Videoteils und des zweiten Videoteils identifiziert wird.
9. Verfahren nach einem der vorstehenden Ansprüche, wobei der Endpunktframe der Sequenz
von Videoframes ein erster auftretender Frame der Sequenz von Videoframes ist und
wobei der Endpunktframe den Beginn des Videosegments darstellt oder wobei der Endpunktframe
der Sequenz von Videoframes ein letzter auftretender Frame der Sequenz von Videoframes
ist und wobei der Endpunktframe das Ende des Videosegments darstellt.
10. Verfahren nach einem der vorstehenden Ansprüche, ferner umfassend:
Entfernen (S110a) mindestens eines Teils des Videosegments (230') aus dem Video-Stream
(200) unter Verwendung des Endpunktframes der identifizierten Sequenz von Videoframes
als Referenz.
11. Verfahren nach einem der Ansprüche 1 bis 10, ferner umfassend:
Bereitstellen (S110b) einer Information über den Zeitversatz (to) an einen manifesten
Manipulatorknoten (130).
12. Verfahren nach einem der vorstehenden Ansprüche, wobei das Videosegment ein erstes
Videosegment eines zusammengesetzten Videosegments ist und wobei der erste Videoteil
das zusammengesetzte Videosegment umfasst.
13. Videonetzknoten (1200) zum Bestimmen eines Zeitversatzes (to) für ein Videosegment
(230') eines Video-Streams (200) unter Verwendung von Metadaten (120), wobei die Metadaten
(120) eine Zeitinformation von mindestens einer von einer Startzeit und einer Endzeit
des Videosegments (230') umfassen, wobei der Videonetzknoten (1200) eine Verarbeitungsschaltung
(1210) umfasst, wobei die Verarbeitungsschaltung
dadurch gekennzeichnet ist, dass sie konfiguriert ist, um den Videonetzknoten (1200) zu Folgendem zu veranlassen:
Extrahieren eines ersten Videoteils (210) und eines zweiten Videoteils (220) aus dem
Video-Stream (200), von denen jeder ein gemeinsames Videosegment (230, 240) umfasst;
Identifizieren einer Sequenz von Videoframes in dem ersten Videoteil (210), die das
gemeinsame Videosegment (230, 240) darstellt, wobei das Identifizieren der Sequenz
von Videoframes Folgendes umfasst:
Identifizieren, in dem ersten Videoteil (210), einer ersten Sequenz von Videoframes,
die einer zweiten Sequenz von Videoframes in dem zweiten Videoteil (220) ähnlich ist,
und wobei die erste Sequenz von Videoframes eine Zeitdauer aufweist, die gleich der
Zeitdauer des Videosegments ist; und
Bestimmen, dass die erste Sequenz von Videoframes der zweiten Sequenz von Videoframes
in dem zweiten Videoteil ähnlich ist, und zwar unter Verwendung eines Bildähnlichkeitsmaßes
zwischen Videoframes in dem ersten Videoteil und Videoframes in dem zweiten Videoteil;
und
Bestimmen des Zeitversatzes (to) beruhend auf einer Zeitdifferenz zwischen einem Endpunktframe
der identifizierten Sequenz von Videoframes und der Zeitinformation in den Metadaten.
14. Computerprogramm (1420) zum Bestimmen eines Zeitversatzes (to) für ein Videosegment
(230') eines Video-Streams (200) unter Verwendung von Metadaten (120), wobei die Metadaten
(120) eine Zeitinformation von mindestens einer von einer Startzeit und einer Endzeit
des Videosegments (230') umfassen, wobei das Computerprogramm einen Computercode umfasst,
der, wenn er auf der Verarbeitungsschaltung (1210) eines Videonetzknotens (1200) ausgeführt
wird, den Videonetzknoten (1200) zu Folgendem veranlasst:
Extrahieren (S102) eines ersten Videoteils (210) und eines zweiten Videoteils (220)
aus dem Video-Stream (200), von denen jeder ein gemeinsames Videosegment (230, 240)
umfasst;
Identifizieren einer Sequenz von Videoframes in dem ersten Videoteil (210), die das
gemeinsame Videosegment (230, 240) darstellt, wobei das Identifizieren der Sequenz
von Videoframes Folgendes umfasst:
Identifizieren, in dem ersten Videoteil (210), einer ersten Sequenz von Videoframes,
die einer zweiten Sequenz von Videoframes in dem zweiten Videoteil (220) ähnlich ist,
und wobei die erste Sequenz von Videoframes eine Zeitdauer aufweist, die gleich der
Zeitdauer des Videosegments ist; und
Bestimmen, dass die erste Sequenz von Videoframes der zweiten Sequenz von Videoframes
in dem zweiten Videoteil ähnlich ist, und zwar unter Verwendung eines Bildähnlichkeitsmaßes
zwischen Videoframes in dem ersten Videoteil und Videoframes in dem zweiten Videoteil;
und
Bestimmen (S108) des Zeitversatzes (to) beruhend auf einer Zeitdifferenz zwischen
einem Endpunktframe der identifizierten Sequenz von Videoframes und der Zeitinformation
in den Metadaten.
1. Procédé pour déterminer un décalage temporel (to) pour un segment vidéo (230') d'un
flux vidéo (200) en utilisant des métadonnées (120), les métadonnées (120) comprenant
des informations temporelles d'au moins l'un d'un temps de début et d'un temps de
fin du segment vidéo (230'), le procédé étant exécuté par un nœud de réseau vidéo
(1200), le procédé étant
caractérisé en ce qu'il comprend :
l'extraction (S102) d'une première partie vidéo (210) et d'une seconde partie vidéo
(220) à partir du flux vidéo (200), dont chacune comprend un segment vidéo commun
(230, 240) ;
l'identification (S106a) d'une séquence de trames vidéo dans la première partie vidéo
(210) qui représente le segment vidéo commun (230, 240), dans lequel l'identification
de la séquence de trames vidéo comprend :
l'identification (S106a), dans la première partie vidéo, d'une première séquence de
trames vidéo qui est similaire à une seconde séquence de trames vidéo dans la seconde
partie vidéo (220), et dans lequel la première séquence de trames vidéo a une durée
temporelle égale à la durée temporelle du segment vidéo (230') et
la détermination (S106b) que la première séquence de trames vidéo est similaire à
la seconde séquence de trames vidéo dans la seconde partie vidéo en utilisant une
mesure de similitude d'image entre les trames vidéo dans la première partie vidéo
et les trames vidéo dans la seconde partie vidéo ; et
la détermination (S108) du décalage temporel (to) sur la base d'une différence de
temps entre une trame de point final de la séquence identifiée de trames vidéo et
les informations temporelles dans les métadonnées.
2. Procédé selon la revendication 1, dans lequel les métadonnées comprennent des informations
de durée temporelle du segment vidéo, et dans lequel la séquence de trames vidéo est
identifiée de telle sorte qu'elle a une durée temporelle égale à la durée temporelle
du segment vidéo.
3. Procédé selon la revendication 1 ou 2, dans lequel la première séquence de trames
vidéo est identique au segment vidéo (230'), ou dans lequel la première séquence de
trames vidéo est adjacente au segment vidéo (230').
4. Procédé selon la revendication 1, dans lequel la mesure de similitude d'image est
déterminée en utilisant des hachages de similitude de trames vidéo dans la première
partie vidéo et des hachages de similitude de trames vidéo dans la seconde partie
vidéo.
5. Procédé selon la revendication 1, comprenant en outre :
la détermination (S106c), par rapport à une première trame apparaissant de la première
partie vidéo, d'une valeur temporelle qui maximise la mesure de similitude d'image,
et dans lequel le décalage temporel (to) est déterminé sur la base de la valeur temporelle.
6. Procédé selon la revendication 5, dans lequel la mesure de similitude d'image comprend
une séquence de valeurs de similitude d'image, et dans lequel des valeurs de similitude
d'image élevée isolées sont supprimées de la mesure de similitude d'image lors de
la détermination de la valeur temporelle.
7. Procédé selon l'une quelconque des revendications précédentes, comprenant en outre
:
un sous-échantillonnage (S104) d'au moins l'une de la première partie vidéo et de
la seconde partie vidéo avant ladite identification de la séquence de trames vidéo.
8. Procédé selon la revendication 7, dans lequel l'étape d'identification de la séquence
de trames vidéo est répétée pour une nouvelle première partie vidéo et une nouvelle
seconde partie vidéo, dans lequel la nouvelle première partie vidéo et la nouvelle
seconde partie vidéo sont déterminées sur la base de la séquence de trames vidéo identifiée
en utilisant au moins l'une de la première partie vidéo et de la seconde partie vidéo
sous-échantillonnée.
9. Procédé selon l'une quelconque des revendications précédentes, dans lequel la trame
de point final de la séquence de trames vidéo est une première trame survenant de
la séquence de trames vidéo, et dans lequel la trame de point final constitue le début
du segment vidéo, ou dans lequel la trame de point final de la séquence de trames
vidéo est une dernière trame survenant de la séquence de trames vidéo, et dans lequel
la trame de point final constitue la fin du segment vidéo.
10. Procédé selon l'une quelconque des revendications précédentes, comprenant en outre
:
la suppression (S110a) d'au moins une partie du segment vidéo (230') du flux vidéo
(200) en utilisant la trame de point final de la séquence de trames vidéo identifiée
comme référence.
11. Procédé selon l'une quelconque des revendications 1 à 10, comprenant en outre :
la fourniture (S110b) d'informations du décalage temporel (to) à un nœud manipulateur
de manifeste (130).
12. Procédé selon l'une quelconque des revendications précédentes, dans lequel le segment
vidéo est un premier segment vidéo d'un segment vidéo composite, et dans lequel la
première partie vidéo comprend le segment vidéo composite.
13. Nœud de réseau vidéo (1200) pour déterminer un décalage temporel (to) pour un segment
vidéo (230') d'un flux vidéo (200) en utilisant des métadonnées (120), les métadonnées
(120) comprenant des informations temporelles d'au moins l'un d'un temps de début
et d'un temps de fin du segment vidéo (230'), le nœud de réseau vidéo (1200) comprenant
un circuit de traitement (1210), le circuit de traitement étant
caractérisé en ce qu'il est configuré pour amener le nœud de réseau vidéo (1200) à :
extraire une première partie vidéo (210) et une seconde partie vidéo (220) à partir
du flux vidéo (200), chacune comprenant un segment vidéo commun (230, 240) ;
identifier une séquence de trames vidéo dans la première partie vidéo (210) qui représente
le segment vidéo commun (230, 240), dans lequel l'identification de la séquence de
trames vidéo comprend :
l'identification, dans la première partie vidéo (210), d'une première séquence de
trames vidéo qui est similaire à une seconde séquence de trames vidéo dans la seconde
partie vidéo (220), et dans lequel la première séquence de trames vidéo a une durée
temporelle égale à la durée temporelle du segment vidéo ; et
la détermination que la première séquence de trames vidéo est similaire à la seconde
séquence de trames vidéo dans la seconde partie vidéo en utilisant une mesure de similitude
d'image entre les trames vidéo dans la première partie vidéo et les trames vidéo dans
la seconde partie vidéo ; et
déterminer le décalage temporel (to) sur la base d'une différence de temps entre une
trame de point final de la séquence identifiée de trames vidéo et les informations
temporelles dans les métadonnées.
14. Programme informatique (1420) pour déterminer un décalage temporel (to) pour un segment
vidéo (230') d'un flux vidéo (200) en utilisant des métadonnées (120), les métadonnées
(120) comprenant des informations temporelles d'au moins l'un d'un temps de début
et d'un temps de fin du segment vidéo (230'), le programme informatique comprenant
un code informatique qui, lorsqu'il est exécuté sur des circuits de traitement (1210)
d'un nœud de réseau vidéo (1200), amène le nœud de réseau vidéo (1200) à :
extraire (S102) une première partie vidéo (210) et une seconde partie vidéo (220)
à partir du flux vidéo (200), chacune comprenant un segment vidéo commun (230, 240)
;
identifier une séquence de trames vidéo dans la première partie vidéo (210) qui représente
le segment vidéo commun (230, 240), dans lequel l'identification de la séquence de
trames vidéo comprend :
l'identification, dans la première partie vidéo (210), d'une première séquence de
trames vidéo qui est similaire à une seconde séquence de trames vidéo dans la seconde
partie vidéo (220), et dans lequel la première séquence de trames vidéo a une durée
temporelle égale à la durée temporelle du segment vidéo ; et
la détermination que la première séquence de trames vidéo est similaire à la seconde
séquence de trames vidéo dans la seconde partie vidéo en utilisant une mesure de similitude
d'image entre les trames vidéo dans la première partie vidéo et les trames vidéo dans
la seconde partie vidéo ; et
déterminer (S108) le décalage temporel (to) sur la base d'une différence de temps
entre une trame de point final de la séquence identifiée de trames vidéo et les informations
temporelles dans les métadonnées.