0% found this document useful (0 votes)
65 views6 pages

Video Coding With Semantic Image

1. The document discusses a new approach for video coding called semantic coding. 2. The semantic coding approach uses texture analysis and synthesis to improve coding efficiency. Texture is categorized as subjectively relevant or irrelevant. 3. Relevant texture is coded normally, while irrelevant texture is approximated at the decoder using side information from the encoder. This allows for bitrate savings of up to 33.3% compared to standard video codecs.

Uploaded by

Ajnesh Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views6 pages

Video Coding With Semantic Image

1. The document discusses a new approach for video coding called semantic coding. 2. The semantic coding approach uses texture analysis and synthesis to improve coding efficiency. Texture is categorized as subjectively relevant or irrelevant. 3. Relevant texture is coded normally, while irrelevant texture is approximated at the decoder using side information from the encoder. This allows for bitrate savings of up to 33.3% compared to standard video codecs.

Uploaded by

Ajnesh Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Video Coding with Semantic Image

Analysis and Synthesis


 
Goal: Improvement of coding efficiency in video coding through texture analysis at the encoder-side and
texture synthesis at the decoder-side and under integration of semantic hints.   
 
A new content-based approach for improved H.264/AVC video coding is presented. The framework is
generic because based on a closed-loop texture analysis by synthesis algorithm that can automatically
identify and recover from video quality impairments through artifact detectors and appropriate
countermeasures respectively. The algorithm is flexible, for it can in principle be integrated into any
standards-compliant video codec. The fundamental assumption of our approach is that many video
scenes can be classified into subjectively relevant and irrelevant textures. The texture categorization is
thereby done by a texture analyzer (encoder side), while the corresponding texture synthesizer performs
the replacement of the subjectively irrelevant textures (decoder side), given the side information
generated by the texture analyzer. When implementing the proposed approach into an H.264/AVC codec,
bit rate savings of up to 33.3% compared to an H.264/AVC video codec without our approach are
achieved.
Structure of the Semantic Coding Approach  
In this work, we have developed the closed-loop analysis-synthesis algorithm depicted in Fig. 1. The
incoming video sequence is divided into overlapping groups of pictures (GoP). The first GoP consists of
the first I picture of the sequence and the last picture of the GoP is the first P picture. Between this I an P
picture are B pictures. For example, when 3 B pictures are used, the first GoP has the structure IBBBP1
in temporal order. The second GoP consists of the last picture (the P1 picture) of the first GoP and the
next P picture. In our example, the second GoP has the structure P1BBBP2. I and P pictures are key
pictures and coded using MSE distortion and an H.264/AVC encoder. B pictures (between the key
pictures) are candidates for a possible partial texture synthesis and are also otherwise coded using MSE
distortion and H.265/AVC.  
 
Each GoP is analyzed by the texture analyzer (TA) and synthesized by the texture synthesizer (TS), given
the (quantized) side information generated by the TA. The synthesized GoP is then submitted to the video
quality assessment unit (VQA) for detection of possible spatial or temporal impairments in the
reconstructed video

Fig. 1   Principle of the closed-loop analysis-synthesis video coding approach  


In the subsequent iterations, the degrees of freedom of the system are explored by a state machine (SM)
in the quest of even better side information. Once all relevant system states have been visited for the
given input GoP, a rate-distortion decision is made and the optimized side information is transmitted to
the decoder. Detail-irrelevant textures for which no rate-distortion gains can be achieved are coded by the
reference codec, which acts as fallback coding solution. Furthermore, the GoP structure used in our
framework prevents infinite error propagation, as the key pictures are coded based on MSE.    
Bit-Rate Savings with the Semantic Coding Approach
We have integrated our approach into an H.264/AVC codec. The test sequences “Concrete”, “City”,
“Preakness”, and “Coastguard” are used to demonstrate that an approximate representation of some rigid
and non-rigid textures can be done without subjectively noticeable loss of quality.

Fig. 2:   Bitrate savings w.r.t. quantization accuracy


 

The following set-up was used for the H.264/AVC codec. Three B pictures, one reference picture for each
P picture, CABAC (entropy coding method), rate distortion optimization, 30 Hz progressive video at CIF
resolution. The quantization parameter QP was set to 16, 20, 24, 28 and 32. Fig. 2 depicts the bit rate
savings obtained for each of the test sequences. Here we have assumed and verified through visual
inspection that the mse coded and synthesized textures cannot be distinguished. It can be seen that the
highest savings are measured for the highest quantization accuracy considered. The most substantial bit
rate savings (33.3%) are measured for the “City” sequence. The bit rate savings decrease with the
quantization accuracy due to the fact that the volume of the side information remains constant over the
different QP settings. All results are derived from decoding bit-streams and the encoder is run
automatically for each sequence.

Video data contains spatial and temporal redundancy. Similarities can thus be encoded by merely
registering differences within a frame (spatial), and/or between frames (temporal). Spatial
encoding is performed by taking advantage of the fact that the human eye is unable to distinguish
small differences in color as easily as it can perceive changes in brightness, so that very similar
areas of color can be "averaged out" in a similar way to jpeg images (JPEG image compression
FAQ, part 1/2). With temporal compression only the changes from one frame to the next are
encoded as often a large number of the pixels will be the same on a series of frames.

Lossless compression
Some forms of data compression are lossless. This means that when the data is decompressed, the result is a
bit-for-bit perfect match with the original. While lossless compression of video is possible, it is rarely used, as
lossy compression results in far higher compression ratios at an acceptable level of quality.

Intraframe versus interframe compression


One of the most powerful techniques for compressing video is interframe compression. Interframe
compression uses one or more earlier or later frames in a sequence to compress the current frame, while
intraframe compression uses only the current frame, which is effectively image compression.
The most commonly used method works by comparing each frame in the video with the previous one. If the
frame contains areas where nothing has moved, the system simply issues a short command that copies that part
of the previous frame, bit-for-bit, into the next one. If sections of the frame move in a simple manner, the
compressor emits a (slightly longer) command that tells the decompresser to shift, rotate, lighten, or darken the
copy — a longer command, but still much shorter than intraframe compression. Interframe compression works
well for programs that will simply be played back by the viewer, but can cause problems if the video sequence
needs to be edited.
Since interframe compression copies data from one frame to another, if the original frame is simply cut out (or
lost in transmission), the following frames cannot be reconstructed properly. Some video formats, such as DV,
compress each frame independently using intraframe compression. Making 'cuts' in intraframe-compressed
video is almost as easy as editing uncompressed video — one finds the beginning and ending of each frame,
and simply copies bit-for-bit each frame that one wants to keep, and discards the frames one doesn't want.
Another difference between intraframe and interframe compression is that with intraframe systems, each frame
uses a similar amount of data. In most interframe systems, certain frames (such as " I frames" in MPEG-2)
aren't allowed to copy data from other frames, and so require much more data than other frames nearby.
It is possible to build a computer-based video editor that spots problems caused when I frames are edited out
while other frames need them. This has allowed newer formats like HDV to be used for editing. However, this
process demands a lot more computing power than editing intraframe compressed video with the same picture
quality

Video is basically a three-dimensional array of color pixels. Two dimensions serve as spatial
(horizontal and vertical) directions of the moving pictures, and one dimension represents the time
domain. A data frame is a set of all pixels that correspond to a single time moment. Basically, a
frame is the same as a still picture.
Video data contains spatial and temporal redundancy. Similarities can thus be encoded by merely
registering differences within a frame (spatial), and/or between frames (temporal). Spatial
encoding is performed by taking advantage of the fact that the human eye is unable to distinguish
small differences in color as easily as it can perceive changes in brightness, so that very similar
areas of color can be "averaged out" in a similar way to jpeg images (JPEG image compression
FAQ, part 1/2). With temporal compression only the changes from one frame to the next are
encoded as often a large number of the pixels will be the same on a series of frames

































You might also like