ECE 734 Final Project Report

Acceleration of motion estimation by pattern matchingedge detection algorithm using PLX subword parallel ISA

ECE 734 Project Final Report

Submitted by Sanghamitra Roy and Dongkeun Oh

Table of contents

1Introduction and motivation------3

2Overview of edge detection algorithm------4

3Structure of Canny’s algorithm and its hardware

implementation------7

4PLX subword parallel architecture: Overview------11

5Algorithm for hardware implementation------11

6Initial approach for hardware implementation------12

7Memory efficient hardware implementation ------15

8Experimental results ------17

9Conclusion and future work ------19

10References------201

11Appendix------221

1. Introduction and motivation

With rapid increase in As the amount of multimedia information overin the internet increases, there has been a remarkable riseincrease in the demand of the video-driven applications such as teleconference, videophone, and image-based multimedia services. Thus, the amount of video information to be transmitted in the network has and the number of requested information increaseds,although the transmission rate in the network has does not increasedaccordinglyat the same rate. Hence, The need of low bit-rate video coding techniques have become necessary to ease these bottlenecks.appears so as to overcome above limitations.

The approach of very low bit-rate video coding algorithms can be divided into two categories. The first category consists ofis block-based algorithms such as H.261, H.263, MPEG-1, and MPEG-2. These algorithms areIt is easy to implement and maintain a relatively good image quality at low bit rates. However, at very low bit rates, less than 28.8kbps, blocking and mosquito artifacts become visible and the reconstructed image quality becomeswill be degraded. This , which iis the reason whythat this strategy is not employed in MPEG-4.

The second category is object or segmentation based coding. Many techniques for object based coding at very low bit rates have are already been proposed. Object based coding Mostly, it achieves high compression rate by subdividing an image into a number of arbitrarily shaped objects and the background, and by performing motion estimation of an objects. The greatest most advantage of this method is the ability to perform lies on that it is possible to get the accurate motion estimation of the moving objects and utilize the available bit rate efficiently, by focusing on the moving objects. Therefore, the quality of the images produced by this method varies dramatically depending on the quality of object segmentation.

Thise object oriented approach supports highthe quality of resolution for each individual object. The accurate motion representation of the object is the key to success of good motion compensation for coding purposes as well asbut also for image format conversion. However, most of the object-based coding approaches are computationally should suffer from expensive. computational complexity of the segmentation and motion estimation techniques.

Object segmentation and recognition is also the a primary step of computer vision. Object recognition is used iIn many areas such as traffic monitoring and robot vision, the use of object recognition process becomes prevalent. While a single image provides a snapshot of a scene, different frames of a video taken over time represent the dynamics in the scene, making it possible to capture the motion in the sequence. The recognition process of a moving object is processed in real time, which requires high performance image processors. In order that robot vision processes many frames of the image per second, the high performance the image processor is recommended.

Edge detection or object segmentation is the crucial part offor object recognition. Edge features, which are recognized as an important aspect of human visual perception, are commonly used in shape analysis. Decomposition of images into two regions of low-frequency blocks and blocks containing visually important features such as edges or lines requires analysis the criteria about the constraints of visual continuity of the image to be proposed.

The objective of this project is to improve the computational power of an image processor by accelerating the edge detection algorithm. develop hardware implementation of edge detection algorithm to build the high performance image processor. We propose to enhance the performance of the edge detection Especially we are going to accelerate the computation of the algorithm using sub-word parallelism, and implement this algorithm this process will be implemented using PLX subword parallel ISA.

2. Overview of Edge Detection algorithm

An edge in an image corresponds to a discontinuity in the intensity surface of the underlying scenes – a jump in intensity from one pixel to the next. Edge detecting significantly reduces the amount of data and filters out useless information, while preserving the important structural properties in an image. There are many ways to perform edge detection. However, the majority of different methods may be grouped into two categories, gradient and Laplacian. The gradient method detects the edges by looking for the maximum and minimum in the first derivative of the image. The Laplacian method searches for zero crossings in the second derivative of the image to find edges. An edge has the one-dimensional shape of a ramp. and cCalculating the derivative of the image can highlight its location. Suppose we have the following signal, with an edge shown by the jump in intensity below in figure 1:

Figure 1: Intensity profile of pixels in 1D line

If we take the gradient of this signal (which, in one dimension, is just the first derivative with respect to t) we get the following as shown in figure 2:

Figure 2: 1st Derivative of pixel intensity

Clearly, Tthe derivative shows a maximum located at the center of the edge in the original signal. This method of locating an edge is characteristic of the “gradient filter” family of edge detection filters and includes the Sobel method. A pixel location is declared an edge location if the value of the gradient exceeds some threshold. As mentioned before, pixels in edgess will have higher pixel intensity values than those surrounding it. So once a threshold is set, weyou can compare the gradient value to the threshold value and detect an edge whenever the threshold is exceeded. Furthermore, when the first derivative is at a maximum, the second derivative is zero. As a result, another alternative to finding the location of an edge is to locate the zeros in the second derivative. This method is known as the Laplacian and the second derivative of the signal is shown below in figure 3:

Figure 3: 2nd Derivative of pixel intensity

Based on this one-dimensional analysis, the theory can be carried over to two-dimensions as long as there is an accurate approximation to calculate the derivative of a two-dimensional image.

The gradient of an image f(x,y ) at location (x,y) is defined as the vector

(1)

It is well known from vector analysis that the gradient vector points in the direction of maximum rate of change of f at coordinates (x,y).

The magnitude of this vector

(2)

is an important quantity in the edge detection process which to providesthe maximum rate of increase of f(x,y) per unit distance in the direction of of. The direction of the gradient vector

(3)

represents the direction angle of the vector at (x,y).

Computation of the gradient of an image is based on obtaining the partial derivative

and , at every pixel location. Let us consider the 3*3 area shown in Figure 4.

z1 / z2 / z3
z4 / z5 / z6
z7 / z8 / z9

Figure 4: 3*3 region of an image

Figure 5:Mmasks based on (a) Prewitt (b) Sobel operator

At a pixel zZ5 of an 3*3 image, the first order derivative by Prewitt operator is given by

(4)

The magnitude using equation (2) is computationally expensive and thus we use an approximate equations using absolute values.

(5)

3. Structure of Canny’s algorithm and its hardware implementation

Structure of canny’s algorithm

We selected canny’s algorithm for the implementation of the edge detection process because it is considered as a kind of “standard method” ofin edge detection process. The base program source file is found at [8? ] and we modifyied this program to produce sample data, binary file for PLX simulation and simulate simulate the whole edge detection process. Some sample outputs of edge detection using our C code are shown in figure 7.

The program consists of 5 main function modules. Their function names are gaussian_smoothe, derivative_x_y, magnitude_x_y, non_max_supp, and apply_hysteresis.

In the first stage, itperforms linear filtering with a Gaussian kernel tosmooth the noise in the image. Here, the pixel color data is converted into grayscale value. In the second stage, computation of the edge strength anddirection for each pixel in the smoothed image is performed.This is doneby differentiating the image in two orthogonal directionsand computing the partial derivatives. The third stage calculates the gradient magnitude as the root sum ofsquares of the derivatives. The gradient direction is computedusing the arctangent of the ratio of the derivatives. We use the sum of absolute value of the two derivatives to approximate the magnitude.

Figure 6: Derivative masks for Canny’s algorithm

Figure 7:. Original images and edge detected images using our C code

Figure 8: Block diagram of canny’s algorithm program using AQtime profiler

In third stages, it computes the magnitude of the gradient. This is originally the square root of the sum of the squared derivative values but we used the sum of absolute value of two derivative values. In the fourth stage, candidate edge pixels are identified as the pixels that survivea thinning process called non-maximal suppression. Inthis process, the edge strength of each candidate edge pixelis set to zero if its edge strength is not larger than the edgestrength of the two adjacent pixels in the gradient direction.Thresholding is then done on the thinned edge magnitudeimage using hysteresis. In the fifth stages of hysteresis, two edge strengththresholds are used. All candidate edge pixels below thelower threshold are labeled as non-edges and all pixelsabove the lower threshold that can be connected to any pixelabove the higher threshold through a chain of edge pixels arelabeled as edge pixels.

Hardware implementation and Project purpose

The acceleration of this algorithm can be done in various ways. Upon inspection of the various routines in the C code implementation, we find that the First, we decided to perform acceleration of running this algorithm using subword parallelism in PLX ISA and found that derivative_x_y function module is a good candidate for being converted into PLX ISA using subword parallelism. Before proceeding to the hardware implementation of the derivative part of (x,y) pixels, we briefly review the overall running time of the whole structureedge detection algorithm.

Figure 9: Snapshot of the profile of Canny’s program produced by AQtime

Figure 10: Running time weight of each program module in Canny’s Edge detection program

The fFigure 107 shows the runtime ning profile of canny’s program by the AQtime profiler. It shows the percentage weight of running time for the whole body of the algorithm. From the figures97 and figure 810, it can be is foundobserved that the gGaussian smoothing stage occupiesya half of the execution running time of the program. This It is because it uses a lot of multiplication and division processoperations for each the whole pixel data. Therefore, the acceleration of this gaussian smoothing may be done by using high performance multiplication and division modules. PLX ISA does not directly simply support division process and so its hardware implementation goes beyond the scope of this project. The second stage of Canny’s algorithm, i.e., the computation part of the derivative, can be the candidate of hardware acceleration using subword parallelism because its loop structure is symmetric and it can be easily parallellized into PLX subword ISA. Also this derivative calculation step is common to all edge detection algorithms. So speeding up this step can benefit other edge detection schemes too. All the other three function modules can be accelerated by using the same approach. Thus, we will concentrate on the hardware implementation of this part and we will leave implementation for of the other parts for future work.

4. PLX subword parallel architecture: Overview

We give a very brief overview of the PLX architecture in this section. We avoid going into too much detail as this is already a part of our course. PLX is a small subword parallel instruction set architecture developed at PrincetonUniversity. It uses SIMD type of instructions for parallel operation and faster performance. PLX supports 1,2,4 or 8 byte sub-words. It has 32 general purpose registers which may be 32, 64 or 128 bit wide. This wordsize scalability allows design tradeoffs between performance and cost. PLX also supports predicated execution. Reading or writing data from the memory requires using aligned memory address (4/8/16 byte boundaries).

5. Algorithmfor hardware implementation

We picked the partial derivative estimation algorithm for hardware implementation and optimization using the PLX instruction set architecture. This algorithm calculates the partial x-derivative and partial y-derivative for all pixel values of the image. The algorithm contains basic addition and subtraction operations using the derivative mask of figure 6. The C code snippet for this algorithm is shown below.

for(r=0; r < rows; r++)
{
pos = r * cols;
del_x[pos] = s[pos + 1] – s[pos];
pos++
for(c = 1; c < (cols – 1); c++, pos++) {
del_x[pos] =
s[pos + 1] – s[pos – 1];
}
del_x[pos] = s[pos] – s[pos – 1];
} / for(c=0; ccols; c++)
{
pos = c;
del_y[pos] = s[pos + cols] – s[pos];
pos += cols;
for(r = 1; r < (rows – 1); r++,
pos+= cols) {
del_y[pos] =
s[pos + cols] – s[pos – cols];
}
del_y[pos] = s[pos] – s[pos – cols];
}

As it can be seen, the algorithm is mostly symmetric, except the first and last calculations of every row, and column. We have the following objectives for implementing this algorithm using PLX subword parallel architecture:

i) Explore parallelism in the algorithm by loop unrolling

ii) Minimize memory accesses to reduce execution time bottle neck

iii) Design memory access and data fetching to maximize cache hit

iv) Optimize the PLX code for faster performance

v) Verify accuracy of the PLX implementation with the C code

6. Initial approach for hardware implementation

Our algorithms have been implemented for 100*X100 pixel images having a total of 10000 pixels. At first we implemented the above algorithm by unfolding the symmetric inner loops as shown below:

for(r=0; r < 100; r++)
{
pos = r * 100;
del_x[pos] = s[pos + 1] – s[pos];
pos++
for(c = 1; c < 25; c++, pos+= 4) {
del_x[pos] =
s[pos + 1] – s[pos – 1];
del_x[pos+1] =
s[pos + 2] – s[pos];
del_x[pos+2] =
s[pos + 3] – s[pos + 1];
del_x[pos+ 3] =
s[pos + 4] – s[pos +2];
}
……
del_x[pos] = s[pos] – s[pos – 1];
} / for(c=0; c100; c++)
{
pos = c;
del_y[pos] = s[pos + 100] – s[pos];
pos += 100;
for(r = 1; r25; r++,
pos+= 400) {
del_y[pos] =
s[pos + 100] – s[pos – 100];
del_y[pos + 100] =
s[pos + 200] – s[pos];
del_y[pos + 200] =
s[pos + 300] – s[pos +100];
del_y[pos + 300] =
s[pos + 400] – s[pos +200];
}
……
del_y[pos] = s[pos] – s[pos – 100];
}

The data is stored as 2 bytes for each pixel, in a sequential row-wise order in the memory. The arrangement of the data in memory is shown in figure 118.So, pixels 0-99 are stored sequentially as 2 bytes each, followed by pixels 100-199 and so on. As observed from the code, the first and last derivatives of each row and column are calculated using different masks than the intermediate derivatives. So, in this initial implementation we use subword parallel operations for intermediate pixels {1,2,3,4}, {5,6,7,8} …. for the rows and {100,200,300,400}, {101, 201, 301, 401} …. for the columns. Thus, if we want to load pixels 1,2,3,4 in a register as shown in figure 101, we need two loads from the memory as these pixels are not aligned with the 810 byte address boundaries. Also for loading the column-wise data {100, 200, 300, 400} in a single register, we face a problem as the data is arranged in row-wise order in the memory.

Figure 11:Mmemory mapping for initial algorithm

The initial implementation has helped us to learn the basics of coding an algorithm in PLX. We also verify our results with the C code by interfacing our PLX algorithm with the C code. We calculate the speedup of our PLX implementation with respect to the Intel x86 architecture. The following are the major issues we face during our initial implementation.

i)Interfacing data with C code: we use short integer representation of pixels in C, each of which requires 2 bytes. We use fread/fwrite to read/write binary data from the C code to the PLX code.

ii)Loops are implemented using predicated jump instruction in PLX

iii)Load alignment problem: In PLX, the data needs to be loaded from a multiple of 4 or 8 byte address to avoid trap. As we have seen our data is slightly misaligned with this requirement.

The initial implementation gives a 2X speedup in terms of the number of machine cycles, with respect to the C code. But given the potential of subword parallelism, there is scope for a lot of improvement. Specially, wWe can design better memory management schemes for our algorithm to improve the efficiency of our algorithm and cache performance. In the next section we describe our final hardware implementation optimized for better memory management and faster performance.

7. Memory efficient hardware implementation

In our new implementation, we perform both the x and y-derivative calculation in the same loop. Thus we merge the two nested loops and then perform loop unfolding. Also we perform 16*2 = 32 derivative calculations in a single iterationloop. The unfolded algorithm code is shown below. Note that we have used vector representation to simplify our code. Thus x[a: b: c] means {x[a], x[a + b], x[a + 2b], ….., x[c]}. And the notation x[a: d] means {x[a], x[a + 1], x[a + 2], ….., x[d]}.

C code optimized for subword parallel hardware implementation

for(r=0; r < 25; r++)
{
pos = r * 100*4;
for(c = 0; c < 25; c++, pos+= 4) {
if(c == 0) {
del_x[pos:100:pos + 300) =
s[pos + 1:100: pos + 301] - s[pos: 100: pos + 300];
} else {
del_x[pos:100:pos + 300) =
s[pos + 1:100: pos + 301] - s[pos - 1: 100: pos + 299];
}
del_x[pos + 1: 100: pos + 301] =
s[pos + 2: 100: pos+302] – s[pos: 100: pos + 300];
del_x[pos + 2: 100: pos + 302] =
s[pos + 3: 100: pos + 303] – s[pos + 1: 100: pos + 301];
del_x[pos + 3: 100: pos + 303] =
s[pos + 4: 100: pos + 304] – s[pos + 2: 100: pos + 302];
if(r == 0) {
del_y[pos:pos + 3) =
s[pos + 100:pos + 103] - s[pos: pos + 3];
} else {
del_y[pos:pos + 3) =
s[pos + 100:pos + 103] - s[pos - 100: pos -97];
}
del_y[pos + 100: pos + 103] =
s[pos + 200: pos+203] – s[pos: pos + 3];
del_y[pos + 200: pos + 203] =
s[pos + 300: pos + 303] – s[pos + 100: pos + 103];
del_y[pos + 300: pos + 303] =
s[pos + 400: pos + 403] – s[pos + 200: pos + 203];
}
}

Using the above code we can perform a better memory management for our algorithm. Thus, as shown in figure 121, we can load 4x4 blocks of pixel data fromthe memoryin four registers. We can calculate the y-derivative for those 16 pixels using subword parallel operations. Then we perform matrix transpose operation in PLX to get the data in column-wise order. So now, we can perform the x-derivative calculation for all the 16 pixels. As the boundary pixels have different derivative mask, we parallelize them along with the intermediate pixels, by using predicated execution. For details of this implementation, please refer to our optimized PLX code in the Appendix. Since the memory load is always from 8 byte aligned addresses, we can minimize memory access time and have a better cache performance. For some boundary data needed between two iterations, we perform local data communication, instead of reloading the data from the memory.