A View over Distributed Databases

By

Vinod Bobba

Introduction:

The progress of distributed databases from a decade is moving towards addressing the more complex issues in databases. The next generation of database applications will be larger and more complex. In particular, the amount of data will be greater, and there will be a need to support complex objects, multimedia and rules processing. So there is a need to change the DBM algorithms and the way they integrate data. This trend represents a move toward Distributed database access in a more complex environment. Organizations will need to integrate large amounts of data in a distributed environment. The problems of size, security and consistency are difficult to manage. Distributed database refers to defining a global database as though it were centralized, and then distributing portions of it at a variety of interconnected sites. This report is my paper in which, I tried to specify some of the advantages over centralized but didn’t try to compare the two, as they have their own complex factors. After that I covered some of the core topics with specific emphasis on data distribution, fragmentation and on distributed concurrency control along with those also stressed on the state of distributed database technology and finally the cons & pros of the technology. Distributed Databases are important because they allow the expansion of a centralized database over a network, which exposes many benefits such as performance and availability.

Overview:

One of the driving factors is that, distributed databases reflect organizational structure where the database fragments (discussed further) are located at different places in their corresponding department. This gives more control on data and moreover we are able to access data collectively. Unlike in centralized database where if the server fails the whole system will halt, in distributed database system a fault in one database will not affect other. Performance is the most important aspect that drives towards distributed databases. Data is located near the site of greatest demand, and the database systems themselves are parallelized, allowing load on the databases to be balanced among servers i.e. here each and every node or module acts as a client as well as server and the database systems can be modified, added and removed from the distributed database without affecting other systems.

Point of View:

Talking about the types of distributed databases, if the data is distributed and all servers run the same DBMS software then it is a homogeneous distributed database system. If different sites run different DBMSs and are connected in such a way to enable access to data from multiple sites, we call them heterogeneous distributed database. Actually these ideas are pretty generic and I came across current industry standards define in their own way, like Oracle defines homogeneous system as a network with two or more Oracle Databases that reside on one or more machines and heterogeneous system as a network with atleast one non-oracle database system. So these depend upon how they define but the underlying concepts of distributed databases are well met. Now, the doubt is that how these different DBMSs mask their differences in heterogeneous system. The quick answer I found is well-accepted standards for gateway protocols, where they provide an API that exposes DBMS functionality to external applications [1].

I have already mentioned some of the potential advantages of distributed databases, but to achieve those in my research, I came to know that effective location of fragments is very important to realize all the advantages. Here comes the concept of fragmentation, dealing with breaking a relation into smaller relations and storing fragments at different sites. This concept comes into picture talking about storing data in distributed DBMS, if a relation is at a remote site then accessing that relation incurs message passing cost. To reduce this overhead, a single relation is fragmented across several sites. But this is not sufficient, it should be seen that fragments are stored at sites where they are most frequently accessed. So, now the question is how the fragmentation is done.

Fragmentation is done in two ways, horizontal fragmentation (rows) - each fragment consists of a subset of rows of the original relation. Next it’s the vertical fragmentation (columns), obviously as the name says consists of a subset of columns of the original relation. After fragmenting it’s about allocating fragments. It is explained in one my references that fragment allocation can be nonredundant or redundant, where exactly one copy of each fragment will exist across all the sites or fragments are replicated over multiple sites respectively. However the choice depends upon the suitable procedure to improve performance. As we know how the fragments are partitioned to determine the global database, in any distributed database to work efficiently these fragments have to be distributed in such a way that it minimizes the total cost of transmission (volume of transmission). Since the volume of data transmitted depends upon the query type, it is possible to handle some queries locally and some may need communication with other sites. For example, traffic for SELECT, DELETE operations is less compared to JOIN operation. This gives us an idea that this integrated problem of partitioning and allocation decisions are too complex to be solved to optimality. So all we can do is to select a suitable model or criteria which are beneficial for allocation depending upon the system. The reference that I went through about this, gave a new linear-integer formulation where it focused on the allocation of fragments in such a way that the total cost resulting from transmission is minimized. Their new formulation put forth some important points about fragmentation. They analyzed that the allocation problem is easier to solve if the number of fragments are small even if there are large number of sites. Anyway this aspect is confined to this formulation only[2].

Next aspect that I want to discuss in my paper is about the concurrency control. Before going into the topic let us start with transactions. As we know in distributed database, a transaction is submitted at one site, but it can access data at other sites and activity of a transaction at a given site is called a subtransaction. When transaction is submitted at one site, the transaction manager at that site breaks it up into collection of one or more subtransactions that execute at different sites, submits them to transaction managers at the other sites, and coordinates their activity. Now in distributed databases, concurrency control and recovery requires lot of attention as the data is distributed. We now consider how lock and unlock requests are implemented in a distributed environment.

In this environment, lock management can be distributed in three ways, Centralized, Primary copy, fully distributed. In Centralized, a single site is in charge of handling lock and unlock request for all objects. In primary copy, only one copy of each object is designated as the primary copy and all requests to lock or unlock a copy of this object are handled by the lock manager at the site where the primary copy is stored, regardless of where the copy itself is stored. In fully distributed, is handled by the lock manager at the site where the copy is stored. It is clear the Centralized is vulnerable to single point failure and primary copy deals with communication with sites. These two are well handled in Fully distributed and hence it can be used if reads are more frequent than writes. Deadlock detection is something that is related to concurrency control and like in centralized databases, deadlocks must be detected are resolved. So here each site maintains a local waits-for graph and a cycle in that graph indicated a deadlock. But there are cases where the deadlocks are not detected in those graphs. In such cases a distributed deadlock detection algorithm must be used which share their waits-for graphs and those algorithms must consider a concept called phantom deadlocks. In this case, delay in propagation of local information to other sites may result in deadlock detection algorithm to identify deadlocks that do not really exits [1].

Even though recovery is not which I concentrated in my paper, I would like make a point as that would extend our discussion on concurrency control without ending it abruptly. So, once the deadlocks are detected, they should be properly recovered by aborting transactions that caused deadlocks and Commit protocols (2PC) coordinates activities at different sites involved in the transaction in other words enforce atomicity. To make a point on two phase commit protocol which is the most widely used commit protocol, each transaction has a designated coordinator site. Subtranscations are executed at subordinate sites. The protocol ensures that the changes made by any transaction are recoverable. If the coordinator site crashes then subordinates have to wait for the coordinator site to recover. This can be avoided in other protocols but they have other overheads which are out of our scope [3].

With respect to better performance in distributed databases, we have already discussed about the fragmentation, now other point that I came across is by exploiting the inherent parallelism. In this aspect, interquery and intraquery parallelism are pointed where the later deals with breaking up of a single query into subqueries & executing at different sites while the former deals with executing of multiple queries at same time. But these suites well if all the databases are read only, if not then proper concurrency control and recovery methods or protocols should be considered. One more important point that struck me is that some commercial systems are employing two alternative execution models to improve performance. In first one, database is open for queries i.e. read only and updates are batched during office hours. Once the office hours are off, updates are sequentially executed with time-multiplexing between read activities and update activities. In the second approach, two different copies of databases are maintained where one is for querying and other for updates from application programs. However at regular intervals, updates are copied to query database. Even here proper concurrency control and recovery protocols are required to ensure synchronization. With the second approach we can overcome transaction manipulation overheads [3].

Finally:

In distributed databases, database scaling is easy because it is totally modular compared to centralized as extension only needs to add processing and storage power. But we need to understand that liner increase in processing and storage powers cannot be attained due to distribution overheads but we can considerably increase the performance. Unlike database scaling, network scaling is a problem in this type of environment. The reason is network is unpredictable and it is described that there are no significant network architectures and protocols in distributed databases that can deal with slow WAN. Due to the changes in the technology, distributed databases need to deal with more complex data or other than relational databases like ORDBMs or OODBMs this add more complexity in this distributed environment [3].Thus there is need for advanced methodologies and measures which can deal with wide variety of data, moreover it is necessary to understand the performance models by thoroughly understanding the underlying technologies and algorithms. There is also need for efficient replica control protocols and advanced transaction models to improve processing which the distributed databases have to provide in more heterogeneous environments. Finally, future research should on these lines to improve distributed databases in a better way to sustain the changes in technologies.

References:

Oracle, “29 Distributed Database Concepts”, http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10739/ds_concepts.htm

Syam Menon, “Allocating Fragments in Distributed Databases”, IEEE Transactions on Parallel and Distributed System, vol.16, no.7, July 2005.

http://csdl.computer.org/dl/trans/td/2005/07/l0577.pdf

Tamer, M., Patrick Valduriez, “Distributed Database Systems: Where are we now?”, IBM Almaden Research Center,

http://csdl.computer.org/dl/mags/co/1991/08/r8068.pdf