Technical University Hamburg-Harburg

Technische Informatik (TI 3)

Prof. Dr. Siegfried M. Rump

Applying Concepts of Software Reuse to the Implementation of Data Warehouse ETL Systems

October, 2001

Jiayang Zhou

1

Content

STATEMENT......

ACKNOWLEDGEMENTS......

1.Introduction......

1.1.Description of the work......

1.2.Scenarios......

1.3.The structure of this work......

2.Fundamental of software reuse......

2.1.What is software reuse?......

2.2.Why is software reuse important?......

2.3.Economics of software reuse......

2.4.Where does software reuse pay off?......

2.5.Upon what concept is software reuse based?......

2.6.Principles of object-oriented software reuse......

2.6.1.Information hiding......

2.6.2.Modularity......

2.6.3.Adaptability......

2.6.4.Modification......

2.7.State of the art......

3.Data Warehouse Loader: analysis example of software reuse......

3.1.Introduction of data warehouse ETL systems......

3.1.1.Definition of data warehouse ETL systems......

3.1.2.Requirements of data warehouse ETL systems......

3.1.3.Context requirement of data warehouse application......

3.1.4.Other usage of Data Warehouse Loader......

3.2.Architecture of a data warehousing system......

3.2.1.Operational data source......

3.2.2.Data-transfer......

3.2.3.Data warehouse......

3.2.3.1.Subject-orientation......

3.2.3.2.Integration......

3.2.3.3.Time variance......

3.2.3.4.Non-volatility......

3.2.3.5.Difference between data warehouse and operational systems......

3.2.4.Analysis......

3.2.5.Presentation......

3.2.6.Metadata......

3.2.7.Process management......

3.2.8.User administration......

3.3.The role of Data Warehouse Loader......

3.3.1.Functionality of Data Warehouse Loader......

3.3.2.Software reusability consideration of Data Warehouse Loader......

4.The Implementation of Data Warehouse Loader......

4.1.The description of Data Warehouse Loader......

4.2.The architecture of Data Warehouse Loader......

4.3.Loader-engine: the operation mode of Data Warehouse Loader......

4.3.1.Advantages of workflow architecture......

4.3.2.Major task of the workflow of Data Warehouse Loader......

4.3.2.1.Extraction......

4.3.2.2.Transformation......

4.3.2.3.Figuring out difference......

4.3.2.4.Importing......

4.3.2.5.Merging......

4.3.2.6.Retrieving......

4.4.Loader-interface: interface concept of Data Warehouse Loader......

4.4.1.Extraction-interface......

4.4.2.Transformation-interface......

4.4.3.Database-interface......

4.4.4.Record-interface......

4.5.Format of intermediate files......

4.6.Sorting the linked list of record objects......

4.7.Graphic user interface of Data Warehouse Loader......

5.Reuse Analysis of Data Warehouse Loader......

5.1.Reuse development process of Data Warehouse Loader......

5.2.Applying concepts of software reuse......

5.2.1.Code reuse......

5.2.2.Adaptability......

5.2.3.Modularity......

5.2.4.Interface......

5.3.Reuse architecture analysis of Data Warehouse Loader......

6.Summary......

6.1.Summary of this work......

6.2.Lesson learned......

Appendix......

1

STATEMENT

Hereby I do state that the present work has been undertaken by myself and with the unique help, which is referred within this thesis.

Jiayang Zhou

Hamburg, October 2001

1

ACKNOWLEDGEMENTS

I would like to express my thanks for Prof. Dr. Siegfried Mr. Rump, Mr. Stefan Krey, and Mr. Lutz Russek for their essential help and advisory in the whole development of this work. Sometimes their help is not merely academic. Besides, since this work is undertaken in sd&m AG (software design & management), I would like to thank all my colleagues from sd&m AG, Hamburg. Their cooperation and help are appreciated as well.

1

INTRODUCTION

1.Introduction

1.1.Description of the work

Software reuse is a process of implementing or updating software systems using existing software assets. Software assets can be defined as software components, objects, software requirement analysis, design models, domain architecture, database schema, code, documentation, manuals, standards, test scenarios, and plans. Software reuse may occur within a software system, across similar systems, or in widely different systems. This process provides ways to reduce costs, shorten schedules, and produce quality products.

The importance of software reuse lies in its benefits of providing quality and reliable software in a relatively short time. The computer industry has demonstrated that software reuse generates a significant return on investment by reducing cost, time and effort while increasing the quality, productivity, and maintainability of software systems throughout the software life cycle. In a word, software reuse is advantageous because it:

Increases productivity

Enhances quality

Saves cost

Reduces software development schedules

Reduces maintenance

Enhances standardization

Increases portability

Contributes to the evolution of a common component warehouse

Increases performance

Software reuse is now considered an integral principle of the software engineering process. And software reuse can be developed in a manner similar to the development of computer hardware products. [1]

In this work, the fundamental concept of software reuse with an object-oriented approach is examined. It deals with the object-oriented software reuse strategies, the reuse paradigm, and the reuse process. It is obvious that the mere use of a certain programming language does not guarantee software reusability. The language must be accompanied with reuse technology, such as tools and methodologies.

Moreover, the general concept of software reuse is applied to the implementation of data warehouse ETL (Extraction, Transformation and Loading) systems. This data warehouse ETL system, called Data Warehouse Loader, is implemented in Java. Therefore, it is an analysis example of software reusability with object-oriented approach. The standard architecture of data warehouse application and the role of Data Warehouse Loader inside is explained. The detail implementation of Data Warehouse Loader is illustrated, such as its overall architecture, workflow, interface concept and so on. Finally, the reuse analysis is conducted in order to figure out the high reusable feature of this software, which means to illustrate the relationship between the general reuse concept and the real implementation scheme of Data Warehouse Loader. In a word, Data Warehouse Loader is implemented in a manner, where increasing software reusability is especially emphasized. The overall architecture is designed with the favor of applying the software reuse concept. On the other hand, this implementation of Data Warehouse Loader in Java has some drawbacks of performance degradation.

1.2.Scenarios

There have been problems in the software development since its inception. The cost of software development is constantly increasing. Many projects are challenged but not completed. A challenged project is one that is completed with cost overruns and delays in schedule. The percentage of failure is greater than that of successfully completed projects. Please see Figure 1.1. The computer industry has tried to seek an easy way to reduce the costs and shorten the schedules required for software development, while making quality software with fewer errors. [1]

Here shows that considering software reuse should be one of the solutions. Software reuse means that ideas and code are developed once, and then used to solve many software problems in order to enhance productivity, reliability and quality. Reuse applies not only to source-code fragment, but to all the intermediate products generated during software development, including documentation, system specifications, design architecture and so on.

Reusability is a big issue these days. Pretested software should be used so that cost and time can be saved. The development of object-oriented software means modeling a problem as a set of types or classes from which the objects are created. This set is partitioned into a hierarchical categorization that emphasizes reuse by relegating common characteristics and behaviors to the highest possible level. Once this modeling had been done, coding (translation of algorithms to program) is easier because it consists of mere creation of necessary objects from the defined classes and invokes the behavioral operations of object. Reusable software requires planned, analyzed, and structured design that withstands thorough testing for functionality, reliability, and modularity. [1]

Here an object-oriented approach to software development is preferred because it leads to reusable classes. Objects are discrete software components and contain data and procedure together. Systems are partitioned based on objects. Data determine the structure of the software. On the contrary, data-oriented or event-oriented analysis and design deal with operations and data as distinct and loosely coupled. Operations determine the structure of the system. Data are of secondary importance. Therefore, the cost of software development is growing exponentially.

In 1998, sd&m AG (software design & management) completed one project, called START-MDB (Management Database), to build a data warehouse application for START Holding in Frankfurt. One part of this data warehouse application is Data Warehouse Loader, which extracts data from different operational data source, transforms data into required format, and loads data into target data warehouse. This sd&m Data Warehouse Loader is implemented in C, which is suitable in this case, since the operation mode of Data Warehouse Loader is non-object-oriented. Because of choosing ANSI-C as programming language, it is possible to migrate between different system platforms, for example, from Windows NT to Unix. On account of the requirements of data warehouse application, this sd&m Data Warehouse Loader is designed to be reusable from the beginning. That means it can be fit to different data transformation scheme, different operational data source and different target data warehouse. However, it is difficult to realize this reusability with C. The code is hard to read and it needs to be recompiled when a new database is inserted or a new data transformation scheme is required.

Therefore, the task of this work is to implement Data Warehouse Loader in Java. From the programming language point of view, Java has several advantages as following:

  • Java is object-oriented from the ground up, which means it was explicitly designed from the start to be object-oriented. However, C is not an object-oriented language.
  • Java has a facility, called Interface, whose name indicates its primary use: specifying a set of methods that represent a particular class interface, which can be implemented individually in a number of different classes. All of the classes will then share this common interface, and the methods in it can be called polymorphically. While C does not have the interface concept.
  • Java is compiled to a machine-independent low-level code called byte code. This byte code is then interpreted by the Java Virtual Machine running on the particular machine. This gives the Java code platform independence, which means that the same byte code can be run on any of a huge variety of machines with different operating systems. Porting a Java program to another machine does not even require recompilation. The cost is the slowing down of run-time speed-up to a factor of 5.
  • Java Virtual Machine can carry out a number of checks that a program is running properly, for example, array bounds, memory access, viruses in byte code and so on. Accordingly, Java program is of more robustness and security compared with C.
  • When a C program requests memory to use as workspaces, it must keep track of it and return it to the operating system when it ceases to use it. This requires extra programming and extra care. This task of garbage collection is carried out automatically in Java. An object that is no longer used is automatically destroyed and the memory is released.

Additionally, Java is now used worldwide. The management trend of most firms is to have Java programs in their organization. That is also true for the management database system. Therefore, implementing Data Warehouse Loader in Java will make this software more acceptable by the market.

1.3.The structure of this work

Chapter 1 gives the description of the work and far-ranging scenarios, which includes the current situation of software development, the importance of software reuse and the initiate of this project.

Chapter 2 introduces the fundamental concepts of software reuse and some state-of-art software reuse technology, which is important for a better understanding of software architecture design in favor of reusability. Concerned with software reuse, here discusses its definition, importance, economics, basis and so on.

Chapter 3 presents the introduction of Data Warehouse Loader, which is the analysis example of software reuse in this work. Firstly the introduction of ETL systems, so called Data Warehouse Loader, is given. With the explanation of the standard architecture of data warehouse application, where Data Warehouse Loader resides, the role of Data Warehouse Loader is introduced, namely its functionality and its reuse consideration.

Chapter 4 is the detail implementation of Data Warehouse Loader. It illustrates the practical part of this work, including the overall architecture, the workflow and the interface-concept of Data Warehouse Loader. This chapter explains how is the relationship between each class, how each stage of Data Warehouse Loader workflow works, how is the format of intermediate files, how is the sorting inside each linked list of record object, and so on.

Chapter 5 is the link between the theoretical concept of software reuse and the practical implementation of Data Warehouse Loader. In this chapter, the reuse analysis shows how the abstract concept is applied.

Chapter 6 offers the conclusion of the whole project. Some lessons learned during the general software reuse process and some drawbacks of this work are also showed.

Appendix contains the reference books of this work.

1

FUNDAMENTAL OF SOFTWARE REUSE

2.Fundamental of software reuse

2.1.What is software reuse?

Software reuse is defined as the process of implementing or updating software system using existing software assets. Software reuse can occur within a system, across similar systems, or in widely different systems. The term “asset” was selected to express that software can have lasting value. Reusable software assets include more than just codes. Requirements, designs, models, algorithms, tests, documents, and many other products of the software process can be reused. [2]

Software reuse is a concept to acquire high-leverage software, which has the potential to be reused across applications. However, as in many cases, taking a simple idea and making it happen in reality often is not as easy as it sounds. Details have to be worked out before the concept can be made to work in practice.

2.2.Why is software reuse important?

Systematic software reuse revolves around the planned development and exploitation of reusable software assets within or across applications and products lines. Its primary goal is to save your money and/or time. It succeeds when the amount of resources required to deliver an acceptable product are reduced. It tries to take advantage of software that exists or can be purchased off the shelf. It motivates to address the number of management, technical, and people issues that inhibit reuse. When getting down to basics, software reuse is motivated by the desire to get the job done cheaply and quickly.

At this point, it might be a question, why software reuse is especially important. Are there many firms doing it? Do most developers build their software to be reused? Has the underlying technology needed for software reuse been around for years? Are the guidelines for this systematic reuse practice available? Are there examples, which illustrate the successful reuse stories?

Unfortunately, the answers to those questions above have been NO until recently. Most practitioners have not figured out how to do it in a repeatable and systematic manner. The reason is that the technology needed just was not available until recently. The arrival of object-oriented approaches and languages, domain engineering methods, integrated software development environments and new process paradigms make broad-spectrum software reuse possible. Advances in software architecture provide us with the foundation for software reuse, while a consensus on related standards provides us with the building codes.

Figure 2.1 Reuse maturity distributions

For the most part, software reuse tends to be done ad hoc in most firms. As illustrated in Figure 2.1, most of the firms whose software reuse processes have been evaluated using a reuse maturity model [21] are not using the state of the art. Reuse processes are not well defined and practices are not institutionalized in the majority of the firms. This analysis assumes that the processes, which organizations use to manage product lines, architectures, and software reuse, should be part of their business practice framework. Reuse considerations need to be incorporated into each of five levels of process maturity identified by the model: Level 1 (ad hoc), Level 2 (repeatable), Level 3 (defined), Level 4 (managed), Level 5 (optimizing). Please see Table 2.1. [2]

Maturity level / Name / Characteristics
1 / Ad hoc reuse / Reuse occurs ad hoc
Reuse is neither repeatable nor managed
2 / Project-wide reuse / Reuse is a product of a project, not a process
Reuse is repeatable o a project-by-project basis
3 / Organization-wide reuse / Reuse assets are a product of the process
Reuse is part of the way the organization does business
4 / Product-line reuse / Reusable assets are a product of the process
Reuse is viewed as a business into itself
5 / Broad-spectrum reuse / Reuse is an integral part of the corporate culture
Processes are optimized with reuse in mind

Table 2.1: Process Maturity Models

2.3.Economics of software reuse

With the recent push to downsize or outsource, software costs have to be cut down. The majority of improvement strategies being pursued today is either to reduce the inputs needed to finish the job, (such as people, time, equipment, etc.), or to increase the outputs generated per unit of input.

This dual nature of software productivity can be represented notionally using the following equation [2]:

Productivity = Outputs / Inputs used to generate the results

When focusing on the equation’s input side, more workstation, CASE tools, mature processes, and the like can be equipped with software engineers. Using this approach, more output can be obtained from the people using an automation strategy. Just the reverse happens when the output side of the equation is focused on. Instead of concentrating on improving staff efficiency, reusing existing assets is emphasized on to get more output per unit of input. In either case, the strategies employed tend to be complementary. For example, increased automation can lead to increased reuse.

2.4.Where does software reuse pay off?

Industry has realized s significant payoff by instituting systematic software reuse practice. For example, Wayne Lim of Hewlett-Packard reported the following benefits attributable to their software reuse initiative in IEEE software magazine [7]: