Background
[0001] This relates generally to scalable video codecs.
[0002] Scalable video codecs enable different picture quality levels to be delivered to
different customers, depending on what type of service they prefer. Lower quality
video services may be less expensive than higher quality video services.
[0003] In a scalable video coder, a lower bit depth may be called a baseline layer and a
higher bit depth may be called an enhancement layer. The greater the bit depth, the
better the quality of the video.
[0004] In a scalable video codec, an encoder and decoder may be provided as one unit. In
some cases, only an encoder may be provided and, in other cases, only a decoder may
be provided. The scalable video coder enables the system to operate with at least
the baseline layer. Thus, in low cost systems, only the baseline layer may be utilized
and, in higher cost, more advanced systems, the enhancement layer may be utilized.
[0005] It is advantageous to derive the enhancement layer from the baseline layer. To this
end, inverse tone mapping may be utilized to increase the bit depth of the baseline
layer to the bit depth of the enhancement layer. In some cases, for example, the baseline
layer may be 8 bits per pixel and the enhancement may be 10, 12, or higher bits per
pixel.
Brief Description of the Drawings
[0006]
Figure 1 is a schematic depiction of an encoder and decoder system in accordance with
one embodiment of the present invention;
Figure 2 is a depiction of an encoder and decoder system in accordance with another
embodiment of the present invention; and
Figure 3 is a system depiction for still another embodiment of the present invention.
Detailed Description
[0007] Referring to Figure 1, a scalable video codec includes an encoder 10 that communicates
over a video transmission or a video storage 14 with a decoder 12. Figure 1 shows
an encoder from one codec with a decoder from another codec.
[0008] As an example, a network computer may communicate over the network with another computer.
Each computer may have a codec which includes both an encoder and a decoder so that
information may be encoded at one node, transmitted over the network to the other
node, which then decodes the encoded information.
[0009] The codec shown in Figure 1 is a scalable video codec (SVC). This means that it is
capable of encoding and/or decoding information with different bit depths. Video sources
16 and 26 may be connected to the encoder 10. The video source 16 may use N-bit video
data, while the video source 26 may provide M-bit video data, where the bit depth
M is greater than the bit depth N. In other embodiments, more than two sources with
more than two bit depths may be provided.
[0010] In each case, the information from a video source is provided to an encoder. In the
case of the video source 16, of lower bit depth, the information is provided to a
baseline encoder 18. In the case of the video source 26, of higher bit depth, an enhancement
layer encoder 28 is utilized.
[0011] However, baseline decoded information at B from the baseline encoder 18 is inverse
tone mapped to increase its bit depth to M-bits for use in enhancement layer encoding.
Thus, the decoded N-bit video is provided, in one embodiment, to an inverse tone mapping
unit 20. The inverse tone mapping 20 increases the bit depth and produces an M-bit
output to the enhancement layer encoder 28. The decoded stream B is also presented
for tone mapping derivation 24. The tone mapping derivation 24 also receives information
from the M-bit video source 26. The output of the tone mapping derivation 24 is used
for inverse tone mapping 20.
[0012] At the same time, the encoded output at A from the encoder 18 is output to the video
transmission or storage 14.
[0013] As a result of the use of the decoded stream B for tone mapping derivator 24, the
coding residual in the enhancement layer encoder 28 may be reduced, improving coding
efficiency, in some cases, because of a better prediction in the encoder 28.
[0014] The encoder of Figure 1 may be consistent with the H.264 (advanced video codec (AVC)
and MPEG-4 Part 10), compression standard, for example. The H.264 standard has been
prepared by the Joint Video Team (JVT), which includes ITU-T SG16 Q.6, also known
as VCEG (Video Coding Expert Group), and of the ISO-IEC JTC1/SC29/WG11 (2003), known
as MPEG (Motion Picture Expert Group). H.264 is designed for applications in the area
of digital TV broadcast, direct broadcast satellite video, digital subscriber line
video, interactive storage media, multimedia messaging, digital terrestrial TV broadcast,
and remote video surveillance, to mention a few examples.
[0015] While one embodiment may be consistent with H.264 video coding, the present invention
is not so limited. Instead, embodiments may be used in a variety of video compression
systems including MPEG-2 (ISO/IEC 13818-1 (2000) MPEG-2 available from International
Organization for Standardization, Geneva, Switzerland) and VC1 (SMPTE 421M (2006)
available from SMPTE White Plains, NY 10601).
[0016] The encoder provides information over the video transmission or video storage 14
for use by a decoder. The information that may be provided may include the baseline
(BL) layer video stream, the inverse tone mapping (ITM) information, the filter taps
from the adaptive filtering 24, and the enhancement layer (EL) video stream. Some
of this information may be included in a packet header. For example, the inverse tone
mapping (ITM) information and the filter tap information may be provided in an appropriate
header in packetized data transmission.
[0017] Upon receipt of the appropriate information in the decoder 12, the baseline decoder
30 decodes the information for N-bit video display by the display 32. However, if,
instead, enhancement layer equipment is provided, a higher bit depth display 40 may
be provided. (Generally, two displays would not be included). The baseline decoder
output, which is N-bits, is converted to M-bit video using inverse tone mapping unit
34, which is also fed ITM information about the inverse tone mapping that was done
in the encoder 10.
[0018] The video decoder is self-deriving since information available to the decoder is
used to encode. This same information can be accessed by the decoder to decode the
encoded information without seeking that information from the encoder.
[0019] In general, any type of tone mapping may be utilized to increase the bit density
of the baseline layer video including inverse block-based scaling and inverse piecewise
linear mapping.
[0020] The tone mapping derivation 24 in Figure 1 finds the relationship among higher/lower
bit depth video. Usually, the mapping relationship is derived through the statistical
feature from the original high bit depth video and the original lower bit depth video
at the encoder side.
[0021] A look-up table (LUT) is built using a pixel x of the lower bit depth N input and
the co-located pixel y of the higher bit depth M input. By "co-located" it is intended
to refer to a pixel in the same locations in the two pictures from the sources 16
and 26.
[0022] For every pixel x
i in the lower bit depth input and the co-located pixel y
i in higher bit depth input, let

then, the j
th entry of LUT[j]=sum
j/num
j
[0023] If (num
j==0), then LUT[j] is the weighted average of LUT[j
-] and LUT[j
+] where j
- and j
+, if available, are the closest non-zero neighbors to the j
th entry.
[0024] Instead of using the input pixel of the lower bit depth source 16, the decoded output
pixel from the base layer encoder 18 is used with the input of the higher bit depth
source 26 to work out the mapping LUT. The pixel z is the decoded lower bit depth
N output and the co-located input pixel y is the higher bit depth M input. For every
pixel z
i in the lower bit depth decoded output and the co-located pixel y
i in the higher bit depth input, let

then, the j
th entry of LUT[j]=sum
j/num
j
[0025] If (num
j==0), then LUT[j] is the weighted average of LUT[j
-] and LUT[j
+] where j
- and j
+, if available, are the closest non-zero neighbors to the j
th entry.
[0026] In Figure 2, a content adaptive technique using content analysis and filtering 42,
derives the tone mapping LUT. The pixel z is the decoded lower bit depth N output
and the co-located input pixel y is from the higher bit depth M. If there are no edge
pixels in the neighborhood surrounding to the target pixel z, then the target pixel
z may be replaced with the filtered pixel f to derive the tone mapping LUT.
[0027] For every pixel x
i in the lower bit depth decoded output and the co-located pixel y
i in higher bit depth input, if there is no edge pixel in the neighborhood of x
i:

[0028] The j
th entry of LUT[j]=sum
j/num
j
[0029] If (num
j==0), then LUT[j] is the weighted average of LUT[j
-] and LUT[j
+] where j
- and j
+, if available, are the closest non-zero neighbors to the j
th entry.
[0030] The Sobel edge operator is used for the content analysis and filtering 42 in one
embodiment. Given the target pixel z:

[0031] The edge metric (EM) for the target pixel z is formulated as the convolution of the
weighting in the equation below with its 3x3 neighborhood, NH9(z), as:
EM(z) = |NH9(z)*E_h| + |NH9(z)*E_v|+ |NH9(z)*E_P45| + |NH9(z)*E_N45|
[0032] The use of two directions, E_v, and E_h may be sufficient for many applications.
The detection at 45 degrees further improves the edge detection, but with more computational
complexity.
[0033] Other content analysis techniques may be used for edge detection such as the Canny
algorithm and the derivative-based algorithm.
[0034] In Figure 2, the target pixel is filtered with the filter support coming from the
neighborhood pixels. A linear filter or an average filter may be used with the edge
detector in some embodiments.
[0035] The definition of neighborhood may be naturally aligned with the definition of the
block size specified in the popular video coding standard such as SVC and H.264. The
block size can be 4x4, 8x4, 4x8, and 8x8, as examples. With this alignment, the tone
mapping derivation 24 is content adaptive and block-based. While a 3x3 neighborhood
may be used, other neighborhood pixels may also be used.
[0036] The tone mapping table may be derived based on the luma and chroma channels respectively.
The luma LUT may be used for the mapping of luma pixels and the chroma LUT may be
used for mapping of chroma pixels. In some scenarios, the one chroma table is shared
by both Cb and Cr channels, or two individual tables for Cb and Cr, respectively,
may be used.
[0037] In some embodiments, the tone mapping relationship is used to predict the pixels
of the higher bit depth by the use of the decoded pixels of lower bit depth and the
co-located input pixels of higher bit depth. Through the use of the decoded pixels
of the lower bit depth, instead of the input pixels of the lower bit depth, the coding
residual is reduced and better coding efficiency is achieved in some embodiments.
[0038] Content adaptive techniques utilize the neighboring pixels to produce the filtered
pixel as a substitute for the non-filtered decoded pixel to derive the tone mapping
relationship. With the use of the neighborhood analysis, pixels on the other edge
are eliminated in order to produce smoother pixels and to better predict the higher
bit depth pixels in some embodiments. Thus, more efficient coding efficiency is accomplished
in some cases. Because of the resulting self-derivation, at the video decoder side,
no extra overhead is required to transmit from the video encoder side to video decoder
side in some embodiments.
[0039] Referring to Figure 3, the encoders and decoders depicted in Figures 1 and 2 may,
in one embodiment, be part of a graphics processor 112. In some embodiments, the encoders
and decoders shown in Figures 1 and 2 may be implemented in hardware and, in other
embodiments, they may be implemented in software or firmware. In the case of a software
implementation, the pertinent code may be stored in any suitable semiconductor, magnetic
or optical memory, including the main memory 132. Thus, in one embodiment, source
code 139 may be stored in a machine readable medium, such as main memory 132, for
execution by a processor, such as the processor 100 or the graphics processor 112.
[0040] A computer system 130 may include a hard drive 134 and a removable medium 136, coupled
by a bus 104 to a chipset core logic 110. The core logic may couple to the graphics
processor 112 (via bus 105) and the main processor 100 in one embodiment. The graphics
processor 112 may also be coupled by a bus 106 to a frame buffer 114. The frame buffer
114 may be coupled by a bus 107 to a display screen 118, in turn coupled to conventional
components by a bus 108, such as a keyboard or mouse 120.
[0041] The blocks indicated in Figures 1 and 2 may constitute hardware or software components.
In the case of software components, the figures may indicate a sequence of instructions
that may be stored in a computer readable medium such as a semiconductor integrated
circuit memory, an optical storage device, or a magnetic storage device. In such case,
the instructions are executable by a computer or processor-based system that retrieves
the instructions from the storage and executes them. In some cases, the instructions
may be firmware, which may be stored in an appropriate storage medium. One result
of the execution of such instructions is the improvement of quality of pictures that
are ultimately displayed on a display screen.
[0042] References throughout this specification to "one embodiment" or "an embodiment" mean
that a particular feature, structure, or characteristic described in connection with
the embodiment is included in at least one implementation encompassed within the present
invention. Thus, appearances of the phrase "one embodiment" or "in an embodiment"
are not necessarily referring to the same embodiment. Furthermore, the particular
features, structures, or characteristics may be instituted in other suitable forms
other than the particular embodiment illustrated and all such forms may be encompassed
within the claims of the present application.
[0043] While the present invention has been described with respect to a limited number of
embodiments, those skilled in the art will appreciate numerous modifications and variations
therefrom. It is intended that the appended claims cover all such modifications and
variations as fall within the true spirit and scope of this present invention.
1. A method comprising:
using decoded, lower bit depth video for inverse tone mapping for higher bit depth
encoding; and
using an analysis of neighboring pixels for inverse tone mapping.
2. The method of claim 1 including increasing the bit depth of encoded baseline layer
video information.
3. The method of claim 2 including providing said increased bit depth video information
to an enhancement layer encoder.
4. The method of claim 1 including using self-deriving decoding.
5. The method of claim 1 including using a decoded, lower bit depth video for tone mapping
derivation.
6. The method of claim 1 including using a decoded output of a baseline layer encoder
for inverse tone mapping.
7. The method of claim 1 including using co-located pixels in lower and higher bit depth
video for tone mapping derivation.
8. The method of claim 1 including using filtered pixels in lower bit depth video for
tone mapping derivation.
9. The method of claim 1 including filtering the decoded, lower bit depth video before
inverse tone mapping.
11. The method of claim 1 including developing a tone mapping look-up table with said
neighboring pixels and co-located pixels in said lower and higher bit depth video.
12. An apparatus comprising:
a lower bit depth encoder having encoded and decoded video outputs;
a device to increase the bit depth of encoded video information using video from said
decoded video output; and
an inverse tone mapping unit to analyse neighboring pixels.
13. The apparatus of claim 12 wherein said device includes an inverse tone mapping unit.
14. The apparatus of claim 12 wherein said apparatus is an encoder.
15. The apparatus of claim 12 wherein said apparatus includes a decoder.
16. The apparatus of claim 15 wherein said decoder is self-deriving.
17. The apparatus of claim 12 including a baseline encoder coupled to said device.
18. The apparatus of claim 13 including a filter to filter said decoded video output.
19. The apparatus of claim 18 including an enhancement layer encoder coupled to said
filter.
20. The apparatus of claim 19 including an inverse tone mapping unit and a tone mapper
derivation, wherein an output of said filter is coupled to the inverse tone mapping
unit and the tone mapper derivation.
21. A video encoder comprising the apparatus of any of claims 12 to 20.
22. A video decoder comprising the apparatus of any of claims 12 to 20.
23. A video encoder adapted to carry out the method of any of claims 1 to 11.