EE 5359 MULTIMEDIA PROCESSING

SPRING 2012

Interim report

Real time H.264 decoder implementation in robot control

Under guidance of

DR K R RAO
DEPARTMENT OF ELECTRICAL ENGINEERING
UNIVERSITY OF TEXAS AT ARLINGTON

SPRING 2012

Presented by

Saurabh Ghorpade

1000718850

Acronyms:

A.R. Drone: Augmented reality quadrotor drone.

API: Application programming interface.

CAVLC: Context adaptive variable length coding

DHCP: Dynamic host configuration protocol

DOF: Degrees of freedom

EOS: End of sequence

GCC: GNU compiler collection

GOB: Group of blocks

GOBSC: Group of block start code

GOBQuant: Group of block quantizer

IP: Internet protocol

JPEG: Joint photographic experts group

LiPo: Lithium polymer.

MBC: Coded macroblock bit

MBDES: Macroblock description code

MEMS: Microelectromechanical systems

OpenCV: Open source computer vision

PFORMAT: Picture format

PFRAME: Picture frame

POSIX : Portable operating system interface

PQUANT: Picture quantizer

PRESOLUTION: Picture resolution

PSC: Picture start code

PTYPE: Picture type

RLC: Run length coding

SDL: Simple direct media layer

UDP: User datagram protocol

UVLC: Universal variable length coding

WiFi: Wireless Fidelity.

Proposal:

This project presents a study on H.264 decoder [1,6,7] and the algorithms for evaluating the inverse transforms and packet analysis suitable for high speed real time implementation in C/C++ for robot control [4]. Along with this, feature recognition would be incorporated in order to fine tune the robot to follow objects with particular shape.

Overview:

Owing to the speed and timing requirements, C is becoming more and more popular for the real time environments. The application is a flying robot (A.R.Drone [4]) following a colored object. The complete source was initially implemented in Java and its response was sluggish [2] even though it worked fine in terms of basic functionality. The reason behind this is that lot of classes need to be loaded at startup, unnecessary array validations, excessive use of heap (for objects: virtual table + sync primitives) which is slow, built in garbage collection makes application non deterministic (consumes memory as well as time), larger memory causes swapping which decreases speed. There are no pointers in Java which facilitate faster memory access. Compiler optimizations, macros cannot be implemented. The main reason is Java program first converts into byte code and then Java virtual machine executes it followed by operating system making system calls to access wireless interfaces (WiFi- 802.11 [8]). So it goes through a minimum of two layers whereas in case of C, direct interaction with hardware can be done.

Environment setup:

The robot communicates with the laptop using a user datagram protocol (UDP) socket. The robot has its own router which assigns an IP (Internet Protocol) address to laptop using dynamic host configuration protocol ( DHCP) once connected. So, two threads need to be written, corresponding to two ports (one for navigation/control of robot and other for video stream analysis). The port numbers and IP address are given in the specification sheet [3] of the robot. Tools/ softwares required : Eclipse Integrated Development Environment, visual studio 2010, GCC [9] (GNU compiler collection) . Operating system: Linux and windows.

H.264 Decoder:

Once connection to robot is established, then comes the major part of the project. H.264 decoder.

But before proceeding with the decoder implementation, the encoder implementation is studied to facilitate correct decoding strategy.

An image is split in groups of blocks (GOB), which correspond to 16-lines-height parts of the

image, split as shown in fig1.

Fig1. Image (left) decomposed into GOBs [3] (right)

Each GOB is split into macroblocks (fig2), which represent a 16x16 image.

Fig2. Structure of each GOB [3]

Each macroblock contains information of a 16x16 image, in YCbCr [5] format, type 4:2:0 (fig4)

Fig3.Image decomposed into YCbCr [5]

The 16x16 image is finally stored in the memory as 6 blocks of 8x8 pixels:

• 4 Y blocks (y0, y1, y2, y3) to form the 16x16 pixels Y image of the luma component

(corresponding to a greyscale version of the original 16x16 RGB image).

• 2 blocks of down-sampled chroma components (computed from the original 16x16 RGB

image):

Cb: blue-difference component (8x8 values)

Cr: red-difference component (8x8 values)

Fig4.Macroblock. [5]

·  Layer of blocks

As shown in fig 4 a macroblock corresponds to four luminance ("luma") blocks and two

color ("chroma") blocks.

Step 1: Each 8x8 block of the current macroblock is transformed by DCT [19].

Step 2: Each element of the transformed 8x8 block is quantized.

Step 3: The block 8x8 is then zigzag reordered.

Step 4: The block 8x8 is then encoded using UVLC (universal variable length coding). The complete process is represented in fig6.

Fig6. Modified JPEG [3]

This is followed by specific block entropy-coding.

7.2.3 P264 codec overview [3]

P264 [3] is a simplified H.264 baseline profile [15]. It uses I and P frame types but not the B frame type as shown in the fig 7 which is similar to H.264. Its motion compensation precision is 1 pixel as opposed to H.264’s motion compensation precision (1/4 pixel). The entropy coding used in P264 is RLE coupled with UVLC, where as in H.264 it is CAVLC.

Fig 7. P264 Vs H.264 [3]

I Frame

An I frame is a complete frame. No reference to any previous frame is needed to decode it. Like

H.264, P264 makes a spatial prediction for each macroblock based on the neighboring pixels.

Several modes are available for I macroblock:

• intra 16x16 - a prediction is made over the 16x16 macroblock.

• intra 4x4 - the macroblock is divided into 16 4x4 blocks. Each 4x4 block has its own intra

prediction

Once the intra prediction is done, it is subtracted from the current macroblock. The residual

data is then processed with classical steps : transform, quantization, entropy coding.

Luma intra 16x16 prediction

4 modes are available (fig8):

• VERTICAL - extends the 16 upper neighbor pixels over the all 16x16 macroblock

• HORIZONTAL - extends the 16 left neighbor pixels over the all 16x16 macroblock

• DC - fills the 16x16 block with the mean of the 16 upper and the 16 left neighbor pixels

• PLANE - makes interpolation of the 16 upper and the 16 left neighbor pixels

Fig 8. Luma intra 16x16 prediction modes [3]

Luma intra 4x4 prediction

9 modes are available (fig9):

• VERTICAL_4x4_MODE

• HORIZONTAL_4x4_MODE

• DC_4x4_MODE

• DIAGONAL_DL_4x4_MODE

• DIAGONAL_DR_4x4_MODE

• VERTICAL_RIGHT_4x4_MODE

• HORIZONTAL_DOWN_4x4_MODE

• VERTICAL_LEFT_4x4_MODE

• HORIZONTAL_UP_4x4_MODE

Fig 9. Luma intra 4x4 prediction modes [3]

Chroma 8x8 prediction

For chroma prediction, 4 modes are available :

• DC

• HORIZONTAL

• VERTICAL

• PLANE

Those modes are equivalent to luma intra 16x16 except for DC.

P Frame

While I frame performs a spatial prediction, P frames make predictions based on the previous

encoded frames. For each macroblock, a reference is found in the previous frame by looking around the current position. The motion vector is the distance between the reference in the previous picture and the current macroblock to be encoded. The best reference is subtracted from the current macroblock to form the residual data. The motion vector (fig 10) is transmitted in the data stream so that decoder can rebuild the frame.

Fig 10.Motion vector estimation [3]

The motion vector has a one pixel precision for luma component and half pixel precision for

chroma component due to chroma subsampling. Therefore chroma needs to be interpolated

to access sub pixels P264 does not allow macroblock fragmentation for motion estimation. Only one motion vector is computed for the entire 16x16 macroblock. The reference frame is always the previous

encoded/decoded frame.

Residual data

Once intra/inter prediction is done, it is subtracted from the current macroblock. The residual

data is then processed with the next scheme :

step 1: Split the residual macroblock into 16 4x4 luma blocks and 4 4x4 chroma blocks for each

chroma component

step 2: Apply pseudo dct 4x4 on each 4x4 block

step 3: Quantize all 4x4 blocks

step 4: If current macroblock is encoded using a luma 16x16 prediction, collect all DC coefficients

of each 4x4 luma block and apply the Hadamard transformation [12]

step 5: For each chroma component collect the 4 chroma DC values and perform a 2x2

Hadamard transform.

step 6: Zigzag all AC blocks

step 7: Entropy encoding – Mix of run length coding [13] and Huffman coding [14].

In intra 4x4 coding, for each 4x4 block, the intra prediction is determined first then the

residual 4x4 block is processed from step 1 to step 3. Then the 4x4 block is reconstructed in

order to have the correct neighboring pixels for the next 4x4 block intra prediction.

The order for luma (Y) and chroma (C) 4x4 block encoding is shown in fig 11.

Fig 11. Order of luma and chroma encoding [3]

The encoder process is summarized in the block diagram (fig 12.a). This is performed by the hardware.

In the decoder entropy decoder entropy decoding, MV is not shown. MV is needed in the motion compensation block.

Fig 12. The block diagram of H.264 algorithm [15]

(a)  Encoder

(b)  Decoder

Transport layer

This section describes how the final data stream is generated.

For each picture, data corresponding to an image header followed by groups of data blocks and an

ending code (EOS, end of sequence).

The composition of each block-layer is shown in fig 13.

Fig 13. Block composition [3]

Picture start code (PSC) (22 bits)

UVLC start with a PSC (Picture start code) is 22 bits long:

0000 0000 0000 0000 1 00000

P264 PSC is:

0000 0000 0000 0001 0 00000

A PSC is always byte aligned.

Picture format (PFORMAT) (2 bits)

The second information is the picture format which can be one of the following: Common intermediate format (CIF) or Video graphics array (VGA)

• 00: forbidden

• 01: CIF

• 10: VGA

Picture resolution (PRESOLUTION) (3 bits)

Picture resolution which is used in combination with the picture format (3 bits)

• 000: forbidden

• 001: for CIF it means sub-QCIF (Quarter CIF)

• 010: for CIF it means QCIF

• 011: for CIF it means CIF

• 100: for CIF it means 4-CIF

• 101: for CIF it means 16-CIF

Picture type (PTYPE) (3 bits)

Picture type:

• 000: INTRA picture

• 001: INTER picture

Picture quantizer (PQUANT) (5/6 bits)

UVLC codec: The PQUANT code is a 5-bits-long word. The reference of quantizer for the picture

that ranges from 1 to 30.

P264 codec: The PQUANT code is a 6-bits-long word and ranges from 0 to 63;

Picture frame (PFRAME) (32 bits)

The frame number (32 bits).

Group of block start code (GOBSC) (22 bits)

Each GOB starts with a GOBSC (Group of block start code) which is 22 bits long:

uvlc codec :

0000 0000 0000 0000 1xxx xx

p264 codec :

0000 0000 0000 0001 0xxx xx

A GOBSC is always a byte aligned. The least significant bytes represent the blockline number.

Group of block quantizer (GOBQUANT) (5/6 bits)

Equivalent to PQUANT for the current GOB.

UVLC macroblocks Layer

Data for each macroblock corresponding to header of macroblock followed by data of macroblock.

MBC - Coded macroblock bit:

• Bit 0 : ‘1’ means there is a macroblock / ‘0’ means macroblock is all zero.

• If MBC is 0, the following fields are omitted.

MBDES - macroblock description code:

• Bit 0 : ‘1’ means there is non dc coefficients for block y0.

• Bit 1 : ‘1’ means there is non dc coefficients for block y1.

• Bit 2 : ‘1’ means there is non dc coefficients for block y2.

• Bit 3 : ‘1’ means there is non dc coefficients for block y3.

• Bit 4 : ‘1’ means there is non dc coefficients for block Cb.

• Bit 5 : ‘1’ means there is non dc coefficients for block Cr.

• Bit 6 : ‘1’ means there is a differential quantization (MBDIFF) value following this code.

Not implemented, always 0

• Bit 7 : Always ‘1’ to avoid a zero byte.

MBDIFF – differential quantization: Not implemented

P264 macroblock Layer

There are 3 types of macroblocks in the transport layer :

• I frame with intra 16x16 prediction for the current macroblock

• I frame with intra 4x4 prediction for the current macrobock

• P frame

Macroblock intra 16x16:

INTRA LUMA TYPE – fragmentation used for intra prediction:

Bit 0 : ’0’ means intra 4x4, ’1’ means intra 16x16. Thus INTRA LUMA TYPE is set to 1 for an

intra 16x16 macroblock.

INTRA CHROMA TYPE – intra mode for chroma component:

One of the four available intra chroma predictions coded over 2 (bits).

INTRA LUMA 16x16 MODE – 16x16 intra mode for luma component:

One of the four available intra chroma predictions coded over 2 (bits).

Y0 – Y15 – luma 4x4 blocks:

Each block (16 elements) is encoded using the method described in section Block Entropy coding

process.

CHROMA DATA – U and V blocks:

This segment is common to all types of macroblock.

Macroblock intra 4x4:

INTRA LUMA TYPE – fragmentation used for intra prediction:

Bit 0 : ’0’ means intra 4x4, ’1’ means intra 16x16. Thus INTRA LUMA TYPE is set to 0 for an intra

4x4 macroblock

INTRA CHROMA TYPE – intra mode for chroma component:

One of the four available intra chroma prediction coded over 2 (bits).

INTRA LUMA 4x4 MODE - list of 16 intra 4x4 prediction:

Each intra 4x4 is one of the nine available intra 4x4 luma predictions (horizontal, vertical, vertical

up)

Each element of the list is coded using a prediction based on the neighboring predictions. If

the prediction is correct, the element is coded using only 1 bit. If the prediction is false 4 bits

are used.

DC Y – list of 16 DC value:

DC Y is a list of 16 elements which gather DC values from the 16 4x4 blocks of the current

macroblock. This list is written in the data stream using the block-encoding method.

AC Yx – block of AC coeff:

Each AC block (15 elements) is encoded with the block-encoding method.

CHROMA DATA – U and V blocks:

This segment is common to all type of macroblocks.

Inter macroblock :

PARTITION LIST – list of mb subdivisions for motion estimation: