Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060

Peng Zhang and Xun Zhang

Department of Electrical and Computer Engineering

University of Wisconsin - Madison

,

1

Abstract: In this project, we implemented 2D-DWT algorithm, which is used as a very important algorithm in JPEG2000, in TI TMS320C6. The next step is to optimize our implantation by unfolding and retiming. Lastly, we compared the result with existing implementations.

1.Introduction

To address higher performance of still image encoding, a new image compression standard is currently being developed, the JPEG2000. Itis not only intended to provide rate-distortion andsubjective image quality performance superior toexisting standards, but also to provide features andfunctionalities that current standards can either notaddress efficiently or in many cases cannot address atall.

DWT is the key algorithm using in JPEG2000. The standard works on image tiles. The term ‘tiling’ refers to the partition of the original (source) image into rectangular non-overlapping blocks (tiles), which are compressed independently, as though they were entirely distinct images. The tile components are decomposed into different decomposition levels using DWT.

TI TMS320C6 series DSP is popular. Implementation of DWT on this platform is a good way to get familiar with the DSP and DWT algorithm, as well as the way to apply algorithm transformation techniques. By doing this project, we will get to know how to map the signal processing algorithm onto a processor, which is a goal of this course.

The remainder of this paper is organized in thefollowing fashion: Section 2 contains the description of Jpeg 2000 and DWT algorithm.Section 3is about the implementation on TMS320 C62x platform. Section 4 talks about the optimization of the assembly code with loop pipelining.Brief conclusions are presented in Section 5.

2.Jpeg2000 and DWT Algorithms

JPEG 2000 is intended to be a new still image coding standard for different types of still images (bi-level, gray-level, color, multi-component, hyper-component) with different characteristics (natural, scientific, medical, remote sensing, text, rendered graphics, etc.) allowing different imaging models (client/server, real-time transmission, image library archival, limited buffer and bandwidth resources, etc.) preferably within a unified and integrated system. This coding system is intended to provide low bit-rate operations with rate-distortion and subjective image quality performance superior to existing standards, without sacrificing performance at other points in the rate-distortion spectrum. In addition, this system could include many modern features such as error resilience, up-to-lossless embedded bit-stream, region of interest coding, progressive coding by resolution and by accuracy, etc.

The jpeg2000 coding mode is 2D-Discrete Wavelet Transform, including non-reversible and reversible transforms

The coder is essentially a bit-plane coder, using the same Layered Zero Coding (LZC) techniques which have been employed in a number of embedded Wavelet coders and were originally proposed by Taubman and Zakhor. In fact, many of the ideas presented in the VM4.0, including the use of separate code blocks and post-compression rate-distortion optimization are taken directly from that work and Dr. Taubman's own doctoral dissertation (UC Berkeley, 1994). The key additions are:

  • The use of fractional bit-planes, in which the quantization symbols for any given quantization layer (or bit-plane) are coded in a succession of separate passes, rather than just one pass
  • A simple embedded quad-tree algorithm is used to identify whether or not each of a collection of "sub-blocks" contains any non-zero (significant) samples at each quantization layer, so that the encoding and decoding algorithms need only visit those samples which lie within sub-blocks which are known to have significant samples.

Discrete Wavelet Transform

Wavelet-based analysis of signals is an interesting, and relatively recent, new tool. Similar to Fourier series analysis, where sinusoids are chosen as the basis function, wavelet analysis is also based on a decomposition of a signal using an orthonormal (typically, although not necessarily) family of basis functions. Unlike a sine wave, a wavelet has its energy concentrated in time. Sinusoids are useful in analyzing periodic and time-invariant phenomena, while wavelets are well suited for the analysis of transient, time-varying signals.

A wavelet expansion is similar in form to the well-known Fourier series expansion, but is defined by a two-parameter family of functions

Wherej and k are integers and the functions are the wavelet expansion functions. As indicated earlier, they usually form an orthogonal basis. The two-parameter expansion coefficients are called the discrete wavelet transform (DWT) coefficients of and Equation (8.6.1) is known as the synthesis formula (i.e., inverse transformation). The coefficients are given by

The wavelet basis functions are a two-parameter family of functions that are related to a function called the generating or mother wavelet by

Wherek is the translation and j is the dilation or compression parameter. Therefore, wavelet basis functions are obtained from a single wavelet by translation and scaling. There is, however, no single and universal mother wavelet function. The mother wavelet must simply satisfy a small set of conditions and is typically selected based on the signal processing problem domain. Almost all useful wavelet systems satisfy the multi-resolution condition. This means that given an approximation of a signal using translations of a mother wavelet up to some chosen scale, we can achieve a better approximation by using expansion signals with half the width and half as wide translation steps. This is conceptually similar to improving frequency resolution by doubling the number of harmonics (i.e., halving the fundamental harmonic) in a Fourier series expansion. It is important to note that the wavelet functions never actually enter into the calculation of the discrete wavelet transform.

Image transforms

The 1D discrete wavelet transform is calculated using Mallat's algorithm. The transform coefficients, 's and 's at different scales, are calculated using the following convolution-like expressions:

Wherej denotes the resolution and k is the index for the samples. The operation defined in Equation (8.6.4) is a linear digital filtering operation using filters h and g, followed by down-sampling. The co-efficients 's and 's are known respectively as the level j scaling and wavelet coefficients. The top-level coefficients represent the original signal. For a signal of length N, where N is a power of two, we get . In such a case, the iteration may be repeated J times with the last stage being of length one, one scaling coefficient, and one wavelet coefficient. Note that the iteration is performed over the scaling coefficients only. Here is an illustration for.

Filters h and g are FIR quadrature-mirror filters known as the scaling and wavelet filters, respectively. The scaling filter is a lowpass filter, while the wavelet filter is highpass. For an even length scaling filter, the two are related by the following formula:

,

1D-DWT

The 1D series form of the inverse discrete wavelet transform (IDWT) is given by the following formula:

Wherej denotes the scale and k is the vector index. The operation defined in Equation (8.6.7) is an up-sampling operation followed by linear digital filtering. At scale j, the scaling and wavelet coefficients are combined to form the level scaling coefficients. The top-level sequence represents the original signal. For a signal of length N, where N is a power of two, we get . In such a case, the iteration may be repeated J times, beginning with two one-sample long sequences and ending with a sequence N samples long.

The basis sequences of the discrete wavelet transform are given by the rows of the transformation matrix. Here we show the sequences of length .

2D-DWT

Similarly, the 2D discrete wavelet transform is computed using 2D filtering and down-sampling operations. At each scale, the coefficients are given by the following formulas:




where the filters , , , and are formed by vector outer products of the quadrature mirror filters h and g.

, ,

,.

The 2D series formulation of the inverse discrete wavelet transform is given by

where the filters , , , and are formed by vector outer products of the quadrature mirror filters h and g (see Equation (8.6.9)).

3.TMS320 C6 implementation

TMS320C6000™ Platformoffers a broad portfolio of the industry's fastest DSPs running at clock speeds up to 1 GHz. In this project, TMS320C62x™ fixed-point generation is chosen to implement the DWT algorithm.

Implementation in C

Forward and Inverse 5-3 and 9-7 2D-DWT were implemented. Source file will be attached at the end of the report.

In order to focus on the TMS320C62x™ implementation and optimization, only forward 5-3 2-D DWT was chosen to be developed on TMS320C62x™.

Forward 5-3 wavelet transform in 2-D includes several functions:

Forward lazy transformation

// Forward lazy transform.
void dwt_deinterleave (int *a, int n, int x)

{
int dn, sn, i;
int *b;
dn=n/2;
sn=(n+1)/2;
b=(int*)malloc(n*sizeof(int));
for (i=0; i<sn; i++)
b[i]=a[2*i*x];
for (i=0; i<dn; i++)
b[sn+i]=a[(2*i+1)*x];
for (i=0; i<n; i++)
a[i*x]=b[i];
free(b);
}

Forward 5-3 wavelet transform in 1-D

// Forward 5-3 wavelet transform in 1-D.
void dwt_encode_1(int *a, int n, int x)
{
int dn, sn, i;
dn=n/2;
sn=1+(n-1)/2;
for (i=0; i<dn; i++)
D(i)-=(S_(i)+S_(i+1))>1;
for (i=0; i<sn; i++)
S(i)+=(D_(i-1)+D_(i)+2)>2;
dwt_deinterleave(a, n, x);
}

Forward 5-3 wavelet transform in 2-D

// Forward 5-3 wavelet transform in 2-D.

/*
* Apply a reversible inverse DWT transform to a component of an image
* a: samples of the component
* w: width of the component
* h: height of the component
* l: number of decomposition levels in the DWT
*/
void dwt_encode(int* a, int w, int h, int l)
{
int i, j, rw, rh;
for (i=0; i<l; i++) {
rw=int_ceildivpow2(w, i);
rh=int_ceildivpow2(h, i);
for (j=0; j<rw; j++)
dwt_encode_1(a+j, rh, w);
for (j=0; j<rh; j++)
dwt_encode_1(a+j*w, rw, 1);
}
}

Implementation in Assembly with TMS320 C6000™ Compiler

TI's C6000 has Compile Tools, which are co-developed with the DSP architecture and declared to offer the embedded software developer best-in-class performance, with industry-leading global view analysis and architecture-specific optimizations that include interactive profiling, tuning and feedback.

One of the major areas of focus for the C6000 compiler has been VLIW architecture specific optimizations. Since the release 1.0 of the C6000 tools in February 1997, TI has continued to drive C performance improvements on key DSP MIPS intensive code. Software pipelining, data path partitioning, inner and nested loop optimization, unrolling, predication forms the basis of the C6000 specific enhancements that are included.

In addition to this core set of optimizations, TI offers something unique in the industry called compiler feedback, telling the developer exactly how the compiler did and what potential areas could be improved. This, used in conjunction with the compiler tutorial feedback solutions, can be used to provide the developer valuable insight when tuning C code.

Assembly code was generated with TMS320C6000™ Compiler Tools, using 3rd level optimization [see attachment 2].

4.Assembly code optimization

TMS320C6000™ Compiler has done the optimization on speed. However, potential enhancement could still be achieved.

For an example, there is a “for” loop in function dwt_deinterleave:

for (i=0; i<sn; i++)

b[i]=a[2*i*x];

C Compiler translates this loop into the following assembly code:

  • ;** ------*
  • L1:
  • [ B0] B .S1 L1 ; |32|
  • || LDW .D1T1 *A0,A5 ; |32|
  • ADD .S1 A6,A0,A0 ; |32|
  • [ B0] SUB .D2 B0,1,B0 ; |32|
  • NOP 2
  • STW .D1T1 A5,*A3++ ; |32|
  • ;** ------*

0 / … / 6i+1 / 6i+2 / 6i+3 / 6i+4 / 6i+5 / 6i+6
… / (load a[2ix]) / sn-1
&a[0] / … / &a[2(i+1)x]
… / a[i]
&b[0] / … / &b[i+1]
… / b[i]=a[i]

Table 1 shows the execution of this loop. Before clock cycle 6*i+1, address of a[2*i*x] has been stored in register A0. In clock cycle 6*i+1, loading a[2*i*x] to register A5 is dispatched. Loop control variable stored in B0 is also checked to decide whether the loop kernel will be executed again after 6 clock cycles or not. In clock cycle 6*i+2, address of a[2*(i+1)*x] is computed and stored in register A0 for the following loop. In clock cycle 6*i+3, Loop control variable is decreased and stored in B0. After 2 clock cycles without any operations,the value of a[2*i*x] has been loaded to A5, and the store of a[2*(i+1)*x] to b[i] whose address is the value of register A3 is dispatched, the value of A3 also incremented to point to the memory of

b[i+1].

Table 1. loop kernel execution of speed optimized assembly code

In this code, it takes 6 clock cycles to finish each loop. Assume sn = n+1, it takes6*(n+1) clock cycles to finish the whole “for” loop. Nevertheless, more improvement of speed could be realized.

Basically the idea is to pipeline 6 loops together while sn > 6.

Take the same “for” loop for the example, the pipeline optimized assembly code is shown as following:

  • ;** ------*
  • ; PIPELINED LOOP PRE-PROCESS
  • L2:
  • ADD .S1 A3,A4,A4
  • || SUB .D2 B0,7,B0
  • || LDW .D1T1 *A4,A6
  • L3:
  • MV .S2X A0,B4
  • || [ B0] B .S1 L4
  • || ADD .L1 A3,A4,A0
  • || [ B0] SUB .D2 B0,1,B0
  • || LDW .D1T1 *A4,A0
  • ADD .L1 A3,A0,A0
  • || [ B0] SUB .D2 B0,1,B0
  • || LDW .D1T1 *A0,A0
  • || [ B0] B .S1 L4
  • [ B0] B .S1 L4
  • || ADD .L1 A3,A0,A0
  • || [ B0] SUB .D2 B0,1,B0
  • || LDW .D1T1 *A0,A0
  • ADD .L1 A3,A0,A0
  • || [ B0] SUB .D2 B0,1,B0
  • || LDW .D1T1 *A0,A0
  • || [ B0] B .S1 L4
  • MV .S2X A6,B5
  • || [ B0] B .S1 L4
  • || ADD .L1 A3,A0,A4
  • || [ B0] SUB .D2 B0,1,B0
  • || LDW .D1T1 *A0,A0
  • ;** ------*
  • ; PIPELINED LOOPKERNEL
  • L4:
  • STW .D2T2 B5,*B4++
  • || MV .S2X A0,B5
  • || [ B0] B .S1 L4
  • || ADD .L1 A3,A4,A4
  • || [ B0] SUB .L2 B0,1,B0
  • || LDW .D1T1 *A4,A0
  • ;** ------*
  • ; PIPELINED LOOP PAST-PROCESS
  • L5:
  • MV .S2X A0,B5
  • || STW .D2T2 B5,*B4++
  • MV .S2X A0,B5
  • || STW .D2T2 B5,*B4++
  • MV .S2X A0,B5
  • || STW .D2T2 B5,*B4++
  • MVC .S2 B6,CSR
  • || MV .L2X A0,B5
  • || STW .D2T2 B5,*B4++
  • MV .S2X A0,B5
  • || STW .D2T2 B5,*B4++
  • STW .D2T2 B5,*B4++
  • ;** ------*

Table 2 shows the execution of the optimized loop, with assumption sn = n+1.

0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / … / n-5
sn-7 / sn-8 / sn-9 / sn-10 / sn-11 / sn-12 / sn-13 / … / 0
&a[0] / &a[2x] / &a[4x] / &a[6x] / &a[8x] / &a[10x] / &a[12x] / &a[14x] / …
a[0] / a[2x] / …
&b[0] / &b[1] / …
b[0]=a[0] / …
… / n+1 / n+2 / n+3 / n+4 / n+5 / n+6 / n+7

… / &a[2(n+1)x]
… / a[2(n-5)x] / a[2(n-4)x] / a[2(n-3)x] / a[2(n-2)x] / a[2(n-1)x] / a[2nx]
… / &b[n-5] / &b[n-4] / &b[n-3] / &b[n-2] / &b[n-1] / &b[n]
… / b[n-6]=
a[2(n-6)x] / b[n-5]=
a[2(n-5)x] / b[n-4]=a[2(n-4)x] / b[n-3]=a[2(n-3)x] / b[n-2]=a[2(n-2)x] / b[n-1]=a[2(n-1)x] / b[n]=a[2nx]

Table 2. loop kernel execution of pipeline optimized assembly code

Clock cycle 1 to 6 is the pre-process of the pipeline. 6 loading instructions are dispatched and 6 pipeline kernel executions (L4) are scheduled with branch instruction.

From clock cycle 7, pipeline loop kernel begins to execute. In clock cycle i, store value of register B5 (a[2*(i-7)*x]) to b[i-7] is dispatched. In the same clock cycle, (a[2*(i-6)*x]) is moved to B5 for the following store operation, loading a[2*(i-1)*x] for cycle i+6 is dispatched, address of a[2*i*x] and b[i-6] is calculated. At the same time, Loop control variable stored in B0 is also checked to decide whether the loop kernel will be executed again after 6 clock cycles or not.

In this way, the throughput of the pipeline could achieve 1 / cycle.

Clock cycle n+2 to n+7 is the pre-process of the pipeline. 6 store instructions are dispatched, and in clock cycle n+9, the loop will be finished.

With pipelining the loop, it takes n+9 clock cycles to finish the whole “for” loop while assuming sn = n+1.

Pipeline is used to optimize all the loops during which no functions are called. The optimized code is attached.

5.Conclusion

In this project, 2-D DWT algorithm is implemented on TMS320C6000™ platform. C6000 Compile Tools are used to generate the assembly code, which has been tested, and pipeline of 6 iterations is used to achieve a higher speed performance. In this way, the loop which takes 6*(n+1) clock cycles to finish could be completed in n+9 cycles.

More methods and techniques could be applied for the optimization, which calls for the following study.

References

[1] M. Rabbani, R. Joshi, An overview of the JPEG2000 still imagecompression standard, Signal Processing: Image Communication 17 (2002) 3–48

[2] M. Vetterli, J. Kovacevic, Wavelet and Subband Coding, Prentice-Hall, Englewood Cliffs, NJ, 1995.

[3] J.D. Villasenor, B. Belzer, J. Liao, Wavelet filter evaluation for image compression, IEEE Trans. Image Process. 4 (8) (August 1995) 1053–1060.

[4]Writing Interruptible Looped Code for theTMS320C6x, TI

[5] DSP TMS320C62x/C67x Programmer’s Guide, TI

Attachments

  1. 2-D DWT implementation in C
  2. fix.h:

#ifndef __FIX_H
#define __FIX_H
int fix_mul(int a, int b);
#endif

  • fix.c:

#include "fix.h"
#ifdef WIN32
#define int64 __int64
#else
#define int64 long long
#endif
// Multiply two fixed-precision rational numbers.
int fix_mul(int a, int b) {
return (int)((int64)a*(int64)b>13);
}

  • int.h:

#ifndef __INT_H
#define __INT_H
int int_min(int a, int b);
int int_max(int a, int b);
int int_clamp(int a, int min, int max);
int int_abs(int a);
int int_ceildiv(int a, int b);
int int_ceildivpow2(int a, int b);
int int_floordivpow2(int a, int b);
int int_floorlog2(int a);
#endif

  • int.c:

#include "int.h"

// Get the minimum of two integers.
int int_min(int a, int b) {
return a<b?a:b;
}
// Get the maximum of two integers.
int int_max(int a, int b) {
return a>b?a:b;
}
// Clamp an integer inside an interval.
int int_clamp(int a, int min, int max) {
if (a<min) return min;
if (a>max) return max;
return a;
}
//Get absolute value of integer.
int int_abs(int a) {
return a<0?-a:a;
}
// Divide an integer and round upwards.
int int_ceildiv(int a, int b) {
return (a+b-1)/b;
}
// Divide an integer by a power of 2 and round upwards.
int int_ceildivpow2(int a, int b) {
return (a+(1<b)-1)>b;
}
// Divide an integer by a power of 2 and round downwards.
int int_floordivpow2(int a, int b) {
return a>b;
}
// Get logarithm of an integer and round downwards.
int int_floorlog2(int a) {
int l;
for (l=0; a>1; l++) {
a>=1;
}
return l;
}

  • dwt.h:

#ifndef __DWT_H
#define __DWT_H
/*
* Apply a reversible DWT transform to a component of an image
* a: samples of the component
* w: width of the component
* h: height of the component
* l: number of decomposition levels in the DWT
*/
void dwt_encode(int* a, int w, int h, int l);
/*
* Apply a reversible inverse DWT transform to a component of an image
* a: samples of the component
* w: width of the component
* h: height of the component
* l: number of decomposition levels in the DWT
*/
void dwt_decode(int* a, int w, int h, int l);
/*
* Get the gain of a subband for the reversible DWT
* orient: number that identifies the subband (0->LL, 1->HL, 2->LH, 3->HH)
*/
int dwt_getgain(int orient);
/*
* Get the norm of a wavelet function of a subband at a specified level for the reversible DWT
* level: level of the wavelet function
* orient: band of the wavelet function
*/
double dwt_getnorm(int level, int orient);
/*
* Apply an irreversible DWT transform to a component of an image
* a: samples of the component
* w: width of the component
* h: height of the component
* l: number of decomposition levels in the DWT
*/
void dwt_encode_real(int* a, int w, int h, int l);
/*
* Apply an irreversible inverse DWT transform to a component of an image
* a: samples of the component
* w: width of the component
* h: height of the component
* l: number of decomposition levels in the DWT
*/
void dwt_decode_real(int* a, int w, int h, int l);
/*
* Get the gain of a subband for the irreversible DWT
* orient: number that identifies the subband (0->LL, 1->HL, 2->LH, 3->HH)
*/
int dwt_getgain_real(int orient);
/*
* Get the norm of a wavelet function of a subband at a specified level for the irreversible DWT
* level: level of the wavelet function
* orient: band of the wavelet function
*/
double dwt_getnorm_real(int level, int orient);
#endif

  • dwt.c:

#include "dwt.h"
#include "int.h"
#include "fix.h"
#include <stdlib.h>
#define S(i) a[x*(i)*2]
#define D(i) a[x*(1+(i)*2)]
#define S_(i) ((i)<0?S(0):((i)>=sn?S(sn-1):S(i)))
#define D_(i) ((i)<0?D(0):((i)>=dn?D(dn-1):D(i)))
// This table contains the norms of the 5-3 wavelets for different bands.
double dwt_norms[4][10]={
{1.000, 1.500, 2.750, 5.375, 10.68, 21.34, 42.67, 85.33, 170.7, 341.3},
{1.038, 1.592, 2.919, 5.703, 11.33, 22.64, 45.25, 90.48, 180.9},
{1.038, 1.592, 2.919, 5.703, 11.33, 22.64, 45.25, 90.48, 180.9},
{.7186, .9218, 1.586, 3.043, 6.019, 12.01, 24.00, 47.97, 95.93}
};
// This table contains the norms of the 9-7 wavelets for different bands.
double dwt_norms_real[4][10]={
{1.000, 1.965, 4.177, 8.403, 16.90, 33.84, 67.69, 135.3, 270.6, 540.9},
{2.022, 3.989, 8.355, 17.04, 34.27, 68.63, 137.3, 274.6, 549.0},
{2.022, 3.989, 8.355, 17.04, 34.27, 68.63, 137.3, 274.6, 549.0},
{2.080, 3.865, 8.307, 17.18, 34.71, 69.59, 139.3, 278.6, 557.2}
};
// Forward lazy transform.
void dwt_deinterleave(int *a, int n, int x) {
int dn, sn, i;
int *b;
dn=n/2;
sn=(n+1)/2;
b=(int*)malloc(n*sizeof(int));
for (i=0; i<sn; i++)
b[i]=a[2*i*x];
for (i=0; i<dn; i++)
b[sn+i]=a[(2*i+1)*x];
for (i=0; i<n; i++)
a[i*x]=b[i];
free(b);
}
// Inverse lazy transform.
void dwt_interleave(int *a, int n, int x) {
int dn, sn, i;
int *b;
dn=n/2;
sn=(n+1)/2;
b=(int*)malloc(n*sizeof(int));
for (i=0; i<sn; i++)
b[2*i]=a[i*x];
for (i=0; i<dn; i++)
b[2*i+1]=a[(sn+i)*x];
for (i=0; i<n; i++)
a[i*x]=b[i];
free(b);
}
// Forward 5-3 wavelet tranform in 1-D.
void dwt_encode_1(int *a, int n, int x)
{
int dn, sn, i;
dn=n/2;
sn=1+(n-1)/2;
for (i=0; i<dn; i++)
D(i)-=(S_(i)+S_(i+1))>1;
for (i=0; i<sn; i++)
S(i)+=(D_(i-1)+D_(i)+2)>2;
dwt_deinterleave(a, n, x);
}
// Inverse 5-3 wavelet tranform in 1-D.
void dwt_decode_1(int *a, int n, int x)
{
int dn, sn, i;
dn=n/2;
sn=1+(n-1)/2;
dwt_interleave(a, n, x);
for (i=0; i<sn; i++)
S(i)-=(D_(i-1)+D_(i)+2)>2;
for (i=0; i<dn; i++)
D(i)+=(S_(i)+S_(i+1))>1;
}
// Forward 5-3 wavelet tranform in 2-D.
void dwt_encode(int* a, int w, int h, int l)
{
int i, j, rw, rh;
for (i=0; i<l; i++) {
rw=int_ceildivpow2(w, i);
rh=int_ceildivpow2(h, i);
for (j=0; j<rw; j++)
dwt_encode_1(a+j, rh, w);
for (j=0; j<rh; j++)
dwt_encode_1(a+j*w, rw, 1);
}
}
// Inverse 5-3 wavelet tranform in 2-D.
void dwt_decode(int* a, int w, int h, int l)
{
int i, j, rw, rh;
for (i=l-1; i>=0; i--) {
rw=int_ceildivpow2(w, i);
rh=int_ceildivpow2(h, i);
for (j=0; j<rh; j++)
dwt_decode_1(a+j*w, rw, 1);
for (j=0; j<rw; j++)
dwt_decode_1(a+j, rh, w);
}
}
// Get gain of 5-3 wavelet transform.
int dwt_getgain(int orient) {
if (orient==0) return 0;
if (orient==1 || orient==2) return 1;
return 2;
}
// Get norm of 5-3 wavelet.
double dwt_getnorm(int level, int orient) {
return dwt_norms[orient][level];
}
// Forward 9-7 wavelet transform in 1-D.
void dwt_encode_1_real(int *a, int n, int x)
{
int dn, sn, i;
dn=n/2;
sn=1+(n-1)/2;
for (i=0; i<dn; i++)
D(i)-=fix_mul(S_(i)+S_(i+1), 12993);
for (i=0; i<sn; i++)
S(i)-=fix_mul(D_(i-1)+D_(i), 434);
for (i=0; i<dn; i++)
D(i)+=fix_mul(S_(i)+S_(i+1), 7233);
for (i=0; i<sn; i++)
S(i)+=fix_mul(D_(i-1)+D_(i), 3633);
for (i=0; i<dn; i++)
D(i)=fix_mul(D(i), 5038);
for (i=0; i<sn; i++)
S(i)=fix_mul(S(i), 6660);
dwt_deinterleave(a, n, x);
}
// Inverse 9-7 wavelet transform in 1-D.
void dwt_decode_1_real(int *a, int n, int x)
{
int dn, sn, i;
dn=n/2;
sn=1+(n-1)/2;
dwt_interleave(a, n, x);
for (i=0; i<sn; i++)
S(i)=fix_mul(S(i), 10076);
for (i=0; i<dn; i++)
D(i)=fix_mul(D(i), 13320);
for (i=0; i<sn; i++)
S(i)-=fix_mul(D_(i-1)+D_(i), 3633);
for (i=0; i<dn; i++)
D(i)-=fix_mul(S_(i)+S_(i+1), 7233);
for (i=0; i<sn; i++)
S(i)+=fix_mul(D_(i-1)+D_(i), 434);
for (i=0; i<dn; i++)
D(i)+=fix_mul(S_(i)+S_(i+1), 12993);
}
// Forward 9-7 wavelet transform in 2-D.
void dwt_encode_real(int* a, int w, int h, int l)
{
int i, j, rw, rh;
for (i=0; i<l; i++) {
rw=int_ceildivpow2(w, i);
rh=int_ceildivpow2(h, i);
for (j=0; j<rw; j++)
dwt_encode_1_real(a+j, rh, w);
for (j=0; j<rh; j++)
dwt_encode_1_real(a+j*w, rw, 1);
}
}
// Inverse 9-7 wavelet transform in 2-D.
void dwt_decode_real(int* a, int w, int h, int l)
{
int i, j, rw, rh;
for (i=l-1; i>=0; i--) {
rw=int_ceildivpow2(w, i);
rh=int_ceildivpow2(h, i);
for (j=0; j<rh; j++)
dwt_decode_1_real(a+j*w, rw, 1);
for (j=0; j<rw; j++)
dwt_decode_1_real(a+j, rh, w);
}
}
// Get gain of 9-7 wavelet transform.
int dwt_getgain_real(int orient) {
return 0;
}
// Get norm of 9-7 wavelet.
double dwt_getnorm_real(int level, int orient) {
return dwt_norms_real[orient][level];
}