INTERNATIONAL ORGANISATION FOR STANDARDISATION

ORGANISATION INTERNATIONALE DE NORMALISATION

ISO/IEC JTC1/SC29/WG11

CODING OF MOVING PICTURES AND AUDIO

ISO/IEC JTC1/SC29/WG11

MPEG2003/m10158

Oct. 2003, Brisbane, AU

Title: An Overlapped Block Motion Estimation for MC-EZBC

Authors: Yongjun Wu, Robert A. Cohen, and John W. Woods

Source: Rensselaer Polytechnic Institute

Status: Contribution

Summary

This input document summarizes a new contribution to scalable interframe video coding. We have extended overlapped block motion compensation (OBMC) to bidirectional hierarchical variable size block matching (HVSBM) in the MC-EZBC coder [1]. We present experimental results for Test 1b of the recent Call for Evidence [2]. Significant reduction in block artifacts is noted along with objective improvement in PSNR.

1. Background

This section describes MC-EZBC video codec for Test 1b. At the encoder, we use a fully scalable version of MC-EZBC to generate the bitstream and decode this bitstream at the desired bit rate and temporal resolution. Here, we use only Y-based motion estimation and do not employ spatial transition filtering (SLTF/SHTF) [6].

1.1 Bit rates

The MC-EZBC bit stream is fully embedded, hence we were able to maintain the bit rates at virtually the exact values. The spatial and temporal scalability is available by multiples of 2.

1.2 Random access points

A GOP size of 16 frames translates to 0.533 seconds at 30 fps. In a sequence of 300 frames, we processed 288 frames (18 GOPs of size 16) for the experiments.

1.3 Encoding and decoding delay

Encoding and decoding delay are both equal to 1 GOP, which in our case is 16 frames, except for Mobile, where we used a GOP size of 32.

1.4 Bi-directional HVSBM

Motion estimation is first done through hierarchical variable-size block matching, with block sizes varying between 64´64 and 4´4. We estimate the motion vectors for the blocks in the current frame by finding a match in the reference frame. For efficient motion estimation, we perform this search using a 3 level spatial pyramid. For CIF sequences (352´288) we use search range of ±4 at the lowest spatial resolution and use a refinement range of ±1 at its higher spatial resolutions. Thus at the first stage of temporal analysis the effective search range is ±22. As we move down the temporal pyramid, we double the search range.

The motion estimation is bi-directional, using one previous reference frame and one future reference frame. Thus, the number of reference frames is two. For a given block, if a good match is not found in the future frame, we search the past frame, and the better of the two matches is chosen as the reference block. When this bi-directional match results in too many unconnected pixels, we decide to discontinue further temporal analysis for that frame pair, and based on a threshold value. We refer to this as adaptive GOP size.

Coding of one GOP requires that all frames of the GOP be in the memory. The amount of memory (in bytes) needed for storage of YUV 4:2:0 data is

1.5 ´ |GOP| ´ sizeof(float) ´ frame_width ´ frame_height,

where |GOP| is the number of frames in a GOP, and sizeof(float) is the number of bytes in a floating-point number, in our case 4. Also, the motion vectors are stored in quad-tree format and their storage space is determined by the density of the motion field.

After we form the full motion vector quad-tree with bi-directional HVSBM, we prune it back in the sense of rate-distortion optimization using for CIF sequences.

1.5 Layered structure

The motion field we use is nonscalable. So for a given spatiotemporal resolution we need all the motion vectors up to that temporal level and the remaining bits are then assigned to the spatiotemporal subbands. Of course, a scalable motion field would alleviate this problem at low rates. An extension to scalable motion vector coding should be feasible.

2. OBMC after HVSBM

We apply overlapped block motion compensation (OBMC) to the variable size blocks after pruning. In our OBMC framework, we view any data received by the decoder prior to the decoding of frameas a source of information about the true motion field. For simplicity, we limit the valid information for each block to be the two horizontal and two vertical direct neighbors [3], and we assume stationarity of the image and block motion field.

2.1 Probability Weighting Windows

According to the above assumption, the probability weighting window used is symmetric left-to-right and top-to-bottom. Here we use modified a 2-D bilinear window, whose performance is only a little worse than iterated optimal window design (OWD) [4]. Each block size has its corresponding weighting window. The following are the probability weighting windows for 4x4 blocks:

, ,

.

(1) shows probability coefficients for 4x4 block itself, (2) is for probability coefficients from its right neighbor, and (3) is for probability coefficients from its bottom neighbor.

2.2 Shrinking scheme for variable block sizes

The set of modified 2-D bilinear windows only gives out the probability weighting coefficients between blocks with the same size. We adopt a shrinking scheme among blocks with different sizes to reduce smoothing at a motion discontinuity. Specifically, there are 3 cases for the neighbor of a given block:

(a)  Neighbor block size is the same as that of current block. We simply use the probability coefficients between blocks with the same size.

(b)  Neighbor block size is larger than that of current block. We deal it in the same way as (a), by using the same window size as the current block.

(c)  Neighbor block size is smaller than that of current block. Then the small neighbor has a smaller effective area. We set the probability coefficients for the motion vector from this neighbor to be the same as if the current block had the same block size. Beyond the affected area, self weighting coefficients are increased to compensate.

Since a large block probably has different a motion vector from its small neighbor, this scheme can reduce smoothing at a motion discontinuity.

2.3 OBMC for all blocks

After pruning in bi-directional HVSBM, there are 3 kinds of blocks in a high temporal frame: DEFAULT (connected) block with motion vector between current frame and next frame, and with prediction and update step in lifting implementation; INTRABLOCK with the same kind of motion vector as DEFAULT block, but with prediction step only; REVERSE block with motion vector between current frame and previous frame, and with compensation step only. OBMC is done for all blocks no matter what kind of motion vector this block and its neighbors have. Then each pixel can have up to three motion vectors, and its reference frame may be the next frame, the previous frame, or a combination of the two. Its compensated value comes from a weighted average of the predicted sub-pixels in its reference frame(s).

In comparison to earlier transition filtering schemes SLTF/SHTF [6], our OBMC applies to all the blocks and not just to those with different classifications. So it is not meant as a transition between different classes of blocks. Also, there is no iterative procedure in the former method to optimize over the use of multiple block references. No additional overhead must be transmitted to realize OBMC since all blocks are treated.

2.4 Iterative procedure in OBMC

Standard block motion compensation allows each vector of motion field to be estimated independently of all the other vectors. However, OBMC creates an interdependence between motion vectors that makes such a decoupled estimation suboptimal. Moreover, OBMC specifies a non-causal neighborhood. There does not exist a block scanning order such for every block, all its neighbor blocks are scanned before it. So we use an iterative estimation search procedure for optimized motion estimation [4], which ensures that MAD converges to a local minimum. For each pixel in some block , a residual error image is computed as,

This represents the motion compensation error that results when vector , namely the motion vector from its own block, is omitted and all the neighbors’ motion vectors are fixed. Then we further optimize according to,

Here and are the probability coefficients from weighting windows.

There are two parameters left for design: sweep number and adjustment distance each time . According to our experiment, MAD of high temporal frame converges to its local minimum very quickly as shown in Fig.1. So we choose sweep number, and adjustment distance, corresponding to each time. From the experiments below, we now find about 0.1 dB Mean PSNR (defined in Section 4) gain due to the iteration procedure.

3. Lifting implementation for OBMC

We do OBMC in a lifting implementation, namely with the prediction and update steps as normal. The specific equations for OBMC in lifting implementation are as follows,

,

.

OBMC regards the motion vector field as random. That means pixel in frame B has motion vector with probability from its corresponding probability weighting window as stated above, and is compensated by the weighted average of the predicted sub-pixels. In these equations is the nearest

Fig.1 Convergence behavior in iterative procedure. We plot MAD of high temporal frame 5 at temporal level 1 in Football sequence, CIF format, is the number of iterations, is the adjustment distance in each iteration, unit pixel.

integer to . Although the form of the low temporal frame seems the same as that without OBMC, actually OBMC affects both high temporal frame and low temporal frame. From our experiments, we find the following two points

1.  Low temporal frame without OBMC will have a few block artifacts, which can result from the block boundaries in the high temporal frame. However, the low temporal frame with OBMC will not have any block artifacts, due to the smoothing effect of OBMC in the high temporal frame.

2.  Compared with an original frame, both kinds of low temporal frames with and without OBMC are a little blurred. The blur effect in a low temporal frame without OBMC comes from the inaccuracy of motion vectors in those areas. But the blur effect in a low temporal frames with OBMC comes from a combination of the smoothing effect of OBMC and the inaccuracy of motion vectors. The blur effects are similar in the two kinds of low temporal frames.

As a result we visually prefer OBMC’s low temporal frame and find it more suitable for further stages of MCTF. (Please refer to Figs. 2 and 3 at end of this document.)

4. Experimental results

This section contains the results of our experiments with the MC-EZBC coder with and without OBMC for Test 1b. Note: in the experiment, motion estimation is done on Y component only instead of YUV simultaneously. All the results are reported in terms of the Y, U, V-component PSNR and the Mean PSNR, defined as

Mean PSNR = (4´PSNR_Y + PSNR_U + PSNR_V)/6.

4.1 Test 1b results

In test 1b there are 3 CIF sequences Canoa, Mobile Calendar and Football. With a GOP size of 16, we process 208 and 256 frames of Canoa and Football, respectively. For Mobile Calendar we process 288 frames with GOP of 32. The PSNR results are listed in Table 1. References for the lower frame rate videos are the unquantized outputs of MCTF. There are two references though: low temporal frames with OBMC and low temporal frames without OBMC. So at lower than full frame rate, we have two entries in Table 1. Case (1) uses reference with OBMC. Case (2) uses reference without OBMC.

Sequence / Rate / Coder / Y(dB) / U(dB) / V(dB) / Mean(dB)
Canoa
30 Hz
/
2048
/ WITHOUT OBMC / 32.60 / 38.50 / 38.19 / 34.52
WITH OBMC / 33.59 / 39.80 / 39.64 / 35.63
1024
/ WITHOUT OBMC / 28.94 / 35.79 / 34.76 / 31.05
WITH OBMC / 29.79 / 37.30 / 36.78 / 32.21
Canoa
15 Hz (1)
/
512
/ WITHOUT OBMC / 28.47 / 34.95 / 33.70 / 30.42
WITH OBMC / 29.12 / 36.09 / 35.13 / 31.28
256
/ WITHOUT OBMC / 25.37 / 32.54 / 30.44 / 27.41
WITH OBMC / 25.64 / 33.33 / 31.70 / 27.93
Canoa
15 Hz (2)
/
512
/ WITHOUT OBMC / 28.75 / 35.11 / 33.81 / 30.65
WITH OBMC / 29.10 / 36.10 / 35.11 / 31.27
256
/ WITHOUT OBMC / 25.54 / 32.65 / 30.52 / 27.56
WITH OBMC / 25.75 / 33.39 / 31.73 / 28.08

Football

30 Hz
/
2048
/ WITHOUT OBMC / 36.47 / 39.90 / 41.50 / 37.88
WITH OBMC / 37.45 / 41.20 / 42.78 / 38.96
1024
/ WITHOUT OBMC / 32.28 / 36.27 / 38.55 / 33.99
WITH OBMC / 33.19 / 37.66 / 39.90 / 35.05

Football

15 Hz (1)

/ 512 / WITHOUT OBMC / 31.37 / 34.91 / 37.47 / 32.98
WITH OBMC / 32.20 / 35.91 / 38.34 / 33.84

256

/ WITHOUT OBMC / 28.02 / 31.35 / 34.94 / 29.73
WITH OBMC / 28.37 / 32.05 / 35.60 / 30.19

Football

15 Hz (2)

/ 512 / WITHOUT OBMC / 31.76 / 35.05 / 37.59 / 33.28
WITH OBMC / 32.07 / 35.88 / 38.34 / 33.75

256

/ WITHOUT OBMC / 28.26 / 31.43 / 35.03 / 29.92
WITH OBMC / 28.49 / 32.08 / 35.63 / 30.28

Mobile

Calendar

30 Hz

/ 2048 / WITHOUT OBMC / 36.41 / 42.14 / 41.64 / 38.24
WITH OBMC / 36.82 / 42.56 / 42.00 / 38.64

1024

/ WITHOUT OBMC / 33.08 / 39.33 / 38.42 / 35.01
WITH OBMC / 33.45 / 40.07 / 38.90 / 35.46

Mobile

Calendar

15 Hz (1)

/ 512 / WITHOUT OBMC / 31.43 / 36.76 / 35.39 / 32.98
WITH OBMC / 32.04 / 36.71 / 35.88 / 33.46

256

/ WITHOUT OBMC / 27.00 / 31.31 / 30.40 / 28.28
WITH OBMC / 27.22 / 31.31 / 30.59 / 28.48

Mobile

Calendar

15 Hz (2)

/ 512 / WITHOUT OBMC / 31.74 / 36.83 / 35.48 / 33.21
WITH OBMC / 31.89 / 36.67 / 35.83 / 33.34

256

/ WITHOUT OBMC / 27.14 / 31.33 / 30.40 / 28.38
WITH OBMC / 27.22 / 31.31 / 30.59 / 28.48

Table 1: Test 1b results

We see an improvement in mean PSNR in Table 1 averaging about 1.1 dB for Canoa and Football at high bit rates, dropping significantly at lower bit rates. Improvement for Mobile Calendar though is only 0.4 dB even at high bit rates. These PSNR values without OBMC are lower than reported in our response to the CfE [5], which used a color or YUV motion estimation. Here the HVSBM is only done on the Y component. The case 1 reference (with OBMC) uniformly gives the higher PSNR improvement. Since this is the better looking reference, then maybe these are the figures to use going forward. But, case 2 PSNR figures are consistent with some earlier non-OBMC results.