EE 5359 MULTIMEDIA PROCESSING
SPRING 2012
Interim report
Real time H.264 decoder implementation in robot control
Under guidance of
DR K R RAO
DEPARTMENT OF ELECTRICAL ENGINEERING
UNIVERSITY OF TEXAS AT ARLINGTON
SPRING 2012
Presented by
Saurabh Ghorpade
1000718850
Acronyms:
A.R. Drone: Augmented reality quadrotor drone.
API: Application programming interface.
CAVLC: Context adaptive variable length coding
DHCP: Dynamic host configuration protocol
DOF: Degrees of freedom
EOS: End of sequence
GCC: GNU compiler collection
GOB: Group of blocks
GOBSC: Group of block start code
GOBQuant: Group of block quantizer
IP: Internet protocol
JPEG: Joint photographic experts group
LiPo: Lithium polymer.
MBC: Coded macroblock bit
MBDES: Macroblock description code
MEMS: Microelectromechanical systems
OpenCV: Open source computer vision
PFORMAT: Picture format
PFRAME: Picture frame
POSIX : Portable operating system interface
PQUANT: Picture quantizer
PRESOLUTION: Picture resolution
PSC: Picture start code
PTYPE: Picture type
RLC: Run length coding
SDL: Simple direct media layer
UDP: User datagram protocol
UVLC: Universal variable length coding
WiFi: Wireless Fidelity.
Proposal:
This project presents a study on H.264 decoder [1,6,7] and the algorithms for evaluating the inverse transforms and packet analysis suitable for high speed real time implementation in C/C++ for robot control [4]. Along with this, feature recognition would be incorporated in order to fine tune the robot to follow objects with particular shape.
Overview:
Owing to the speed and timing requirements, C is becoming more and more popular for the real time environments. The application is a flying robot (A.R.Drone [4]) following a colored object. The complete source was initially implemented in Java and its response was sluggish [2] even though it worked fine in terms of basic functionality. The reason behind this is that lot of classes need to be loaded at startup, unnecessary array validations, excessive use of heap (for objects: virtual table + sync primitives) which is slow, built in garbage collection makes application non deterministic (consumes memory as well as time), larger memory causes swapping which decreases speed. There are no pointers in Java which facilitate faster memory access. Compiler optimizations, macros cannot be implemented. The main reason is Java program first converts into byte code and then Java virtual machine executes it followed by operating system making system calls to access wireless interfaces (WiFi- 802.11 [8]). So it goes through a minimum of two layers whereas in case of C, direct interaction with hardware can be done.
Environment setup:
The robot communicates with the laptop using a user datagram protocol (UDP) socket. The robot has its own router which assigns an IP (Internet Protocol) address to laptop using dynamic host configuration protocol ( DHCP) once connected. So, two threads need to be written, corresponding to two ports (one for navigation/control of robot and other for video stream analysis). The port numbers and IP address are given in the specification sheet [3] of the robot. Tools/ softwares required : Eclipse Integrated Development Environment, visual studio 2010, GCC [9] (GNU compiler collection) . Operating system: Linux and windows.
H.264 Decoder:
Once connection to robot is established, then comes the major part of the project. H.264 decoder.
But before proceeding with the decoder implementation, the encoder implementation is studied to facilitate correct decoding strategy.
An image is split in groups of blocks (GOB), which correspond to 16-lines-height parts of the
image, split as shown in fig1.
Fig1. Image (left) decomposed into GOBs [3] (right)
Each GOB is split into macroblocks (fig2), which represent a 16x16 image.
Fig2. Structure of each GOB [3]
Each macroblock contains information of a 16x16 image, in YCbCr [5] format, type 4:2:0 (fig4)
Fig3.Image decomposed into YCbCr [5]
The 16x16 image is finally stored in the memory as 6 blocks of 8x8 pixels:
• 4 Y blocks (y0, y1, y2, y3) to form the 16x16 pixels Y image of the luma component
(corresponding to a greyscale version of the original 16x16 RGB image).
• 2 blocks of down-sampled chroma components (computed from the original 16x16 RGB
image):
Cb: blue-difference component (8x8 values)
Cr: red-difference component (8x8 values)
Fig4.Macroblock. [5]
· Layer of blocks
As shown in fig 4 a macroblock corresponds to four luminance ("luma") blocks and two
color ("chroma") blocks.
Step 1: Each 8x8 block of the current macroblock is transformed by DCT [19].
Step 2: Each element of the transformed 8x8 block is quantized.
Step 3: The block 8x8 is then zigzag reordered.
Step 4: The block 8x8 is then encoded using UVLC (universal variable length coding). The complete process is represented in fig6.
Fig6. Modified JPEG [3]
This is followed by specific block entropy-coding.
7.2.3 P264 codec overview [3]
P264 [3] is a simplified H.264 baseline profile [15]. It uses I and P frame types but not the B frame type as shown in the fig 7 which is similar to H.264. Its motion compensation precision is 1 pixel as opposed to H.264’s motion compensation precision (1/4 pixel). The entropy coding used in P264 is RLE coupled with UVLC, where as in H.264 it is CAVLC.
Fig 7. P264 Vs H.264 [3]
I Frame
An I frame is a complete frame. No reference to any previous frame is needed to decode it. Like
H.264, P264 makes a spatial prediction for each macroblock based on the neighboring pixels.
Several modes are available for I macroblock:
• intra 16x16 - a prediction is made over the 16x16 macroblock.
• intra 4x4 - the macroblock is divided into 16 4x4 blocks. Each 4x4 block has its own intra
prediction
Once the intra prediction is done, it is subtracted from the current macroblock. The residual
data is then processed with classical steps : transform, quantization, entropy coding.
Luma intra 16x16 prediction
4 modes are available (fig8):
• VERTICAL - extends the 16 upper neighbor pixels over the all 16x16 macroblock
• HORIZONTAL - extends the 16 left neighbor pixels over the all 16x16 macroblock
• DC - fills the 16x16 block with the mean of the 16 upper and the 16 left neighbor pixels
• PLANE - makes interpolation of the 16 upper and the 16 left neighbor pixels
Fig 8. Luma intra 16x16 prediction modes [3]
Luma intra 4x4 prediction
9 modes are available (fig9):
• VERTICAL_4x4_MODE
• HORIZONTAL_4x4_MODE
• DC_4x4_MODE
• DIAGONAL_DL_4x4_MODE
• DIAGONAL_DR_4x4_MODE
• VERTICAL_RIGHT_4x4_MODE
• HORIZONTAL_DOWN_4x4_MODE
• VERTICAL_LEFT_4x4_MODE
• HORIZONTAL_UP_4x4_MODE
Fig 9. Luma intra 4x4 prediction modes [3]
Chroma 8x8 prediction
For chroma prediction, 4 modes are available :
• DC
• HORIZONTAL
• VERTICAL
• PLANE
Those modes are equivalent to luma intra 16x16 except for DC.
P Frame
While I frame performs a spatial prediction, P frames make predictions based on the previous
encoded frames. For each macroblock, a reference is found in the previous frame by looking around the current position. The motion vector is the distance between the reference in the previous picture and the current macroblock to be encoded. The best reference is subtracted from the current macroblock to form the residual data. The motion vector (fig 10) is transmitted in the data stream so that decoder can rebuild the frame.
Fig 10.Motion vector estimation [3]
The motion vector has a one pixel precision for luma component and half pixel precision for
chroma component due to chroma subsampling. Therefore chroma needs to be interpolated
to access sub pixels P264 does not allow macroblock fragmentation for motion estimation. Only one motion vector is computed for the entire 16x16 macroblock. The reference frame is always the previous
encoded/decoded frame.
Residual data
Once intra/inter prediction is done, it is subtracted from the current macroblock. The residual
data is then processed with the next scheme :
step 1: Split the residual macroblock into 16 4x4 luma blocks and 4 4x4 chroma blocks for each
chroma component
step 2: Apply pseudo dct 4x4 on each 4x4 block
step 3: Quantize all 4x4 blocks
step 4: If current macroblock is encoded using a luma 16x16 prediction, collect all DC coefficients
of each 4x4 luma block and apply the Hadamard transformation [12]
step 5: For each chroma component collect the 4 chroma DC values and perform a 2x2
Hadamard transform.
step 6: Zigzag all AC blocks
step 7: Entropy encoding – Mix of run length coding [13] and Huffman coding [14].
In intra 4x4 coding, for each 4x4 block, the intra prediction is determined first then the
residual 4x4 block is processed from step 1 to step 3. Then the 4x4 block is reconstructed in
order to have the correct neighboring pixels for the next 4x4 block intra prediction.
The order for luma (Y) and chroma (C) 4x4 block encoding is shown in fig 11.
Fig 11. Order of luma and chroma encoding [3]
The encoder process is summarized in the block diagram (fig 12.a). This is performed by the hardware.
In the decoder entropy decoder entropy decoding, MV is not shown. MV is needed in the motion compensation block.
Fig 12. The block diagram of H.264 algorithm [15]
(a) Encoder
(b) Decoder
Transport layer
This section describes how the final data stream is generated.
For each picture, data corresponding to an image header followed by groups of data blocks and an
ending code (EOS, end of sequence).
The composition of each block-layer is shown in fig 13.
Fig 13. Block composition [3]
Picture start code (PSC) (22 bits)
UVLC start with a PSC (Picture start code) is 22 bits long:
0000 0000 0000 0000 1 00000
P264 PSC is:
0000 0000 0000 0001 0 00000
A PSC is always byte aligned.
Picture format (PFORMAT) (2 bits)
The second information is the picture format which can be one of the following: Common intermediate format (CIF) or Video graphics array (VGA)
• 00: forbidden
• 01: CIF
• 10: VGA
Picture resolution (PRESOLUTION) (3 bits)
Picture resolution which is used in combination with the picture format (3 bits)
• 000: forbidden
• 001: for CIF it means sub-QCIF (Quarter CIF)
• 010: for CIF it means QCIF
• 011: for CIF it means CIF
• 100: for CIF it means 4-CIF
• 101: for CIF it means 16-CIF
Picture type (PTYPE) (3 bits)
Picture type:
• 000: INTRA picture
• 001: INTER picture
Picture quantizer (PQUANT) (5/6 bits)
UVLC codec: The PQUANT code is a 5-bits-long word. The reference of quantizer for the picture
that ranges from 1 to 30.
P264 codec: The PQUANT code is a 6-bits-long word and ranges from 0 to 63;
Picture frame (PFRAME) (32 bits)
The frame number (32 bits).
Group of block start code (GOBSC) (22 bits)
Each GOB starts with a GOBSC (Group of block start code) which is 22 bits long:
uvlc codec :
0000 0000 0000 0000 1xxx xx
p264 codec :
0000 0000 0000 0001 0xxx xx
A GOBSC is always a byte aligned. The least significant bytes represent the blockline number.
Group of block quantizer (GOBQUANT) (5/6 bits)
Equivalent to PQUANT for the current GOB.
UVLC macroblocks Layer
Data for each macroblock corresponding to header of macroblock followed by data of macroblock.
MBC - Coded macroblock bit:
• Bit 0 : ‘1’ means there is a macroblock / ‘0’ means macroblock is all zero.
• If MBC is 0, the following fields are omitted.
MBDES - macroblock description code:
• Bit 0 : ‘1’ means there is non dc coefficients for block y0.
• Bit 1 : ‘1’ means there is non dc coefficients for block y1.
• Bit 2 : ‘1’ means there is non dc coefficients for block y2.
• Bit 3 : ‘1’ means there is non dc coefficients for block y3.
• Bit 4 : ‘1’ means there is non dc coefficients for block Cb.
• Bit 5 : ‘1’ means there is non dc coefficients for block Cr.
• Bit 6 : ‘1’ means there is a differential quantization (MBDIFF) value following this code.
Not implemented, always 0
• Bit 7 : Always ‘1’ to avoid a zero byte.
MBDIFF – differential quantization: Not implemented
P264 macroblock Layer
There are 3 types of macroblocks in the transport layer :
• I frame with intra 16x16 prediction for the current macroblock
• I frame with intra 4x4 prediction for the current macrobock
• P frame
Macroblock intra 16x16:
INTRA LUMA TYPE – fragmentation used for intra prediction:
Bit 0 : ’0’ means intra 4x4, ’1’ means intra 16x16. Thus INTRA LUMA TYPE is set to 1 for an
intra 16x16 macroblock.
INTRA CHROMA TYPE – intra mode for chroma component:
One of the four available intra chroma predictions coded over 2 (bits).
INTRA LUMA 16x16 MODE – 16x16 intra mode for luma component:
One of the four available intra chroma predictions coded over 2 (bits).
Y0 – Y15 – luma 4x4 blocks:
Each block (16 elements) is encoded using the method described in section Block Entropy coding
process.
CHROMA DATA – U and V blocks:
This segment is common to all types of macroblock.
Macroblock intra 4x4:
INTRA LUMA TYPE – fragmentation used for intra prediction:
Bit 0 : ’0’ means intra 4x4, ’1’ means intra 16x16. Thus INTRA LUMA TYPE is set to 0 for an intra
4x4 macroblock
INTRA CHROMA TYPE – intra mode for chroma component:
One of the four available intra chroma prediction coded over 2 (bits).
INTRA LUMA 4x4 MODE - list of 16 intra 4x4 prediction:
Each intra 4x4 is one of the nine available intra 4x4 luma predictions (horizontal, vertical, vertical
up)
Each element of the list is coded using a prediction based on the neighboring predictions. If
the prediction is correct, the element is coded using only 1 bit. If the prediction is false 4 bits
are used.
DC Y – list of 16 DC value:
DC Y is a list of 16 elements which gather DC values from the 16 4x4 blocks of the current
macroblock. This list is written in the data stream using the block-encoding method.
AC Yx – block of AC coeff:
Each AC block (15 elements) is encoded with the block-encoding method.
CHROMA DATA – U and V blocks:
This segment is common to all type of macroblocks.
Inter macroblock :
PARTITION LIST – list of mb subdivisions for motion estimation: