The Role of Cryptography in the Database as a Service Model
Hemavathy Alaganandam
Contents
The Role of Cryptography in the Database as a Service Model
Contents
1.INTRODUCTION
2.DAS Scenario
3.Data Privacy 1st challenge
3.1Software level encryption
3.2Hardware level encryption
3.3Encryption Penalty
4.Data Privacy 2nd Challenge
4.1Relation Encryption and Storage Model
4.2Mapping Conditions (MapCond)
4.3Implementing Relational Operators over Encrypted Relations
4.4Problems with the Strategy
5.CONCLUSIONS
6.REFERENCES
1.INTRODUCTION
"Database as a Service" model provides users powerto create, store, modify, and retrieve data from anywhere inthe world, as long as they have access to the Internet. Itintroduces several challenges, an important issue being dataprivacy. There are two main privacy issues. First, the owner ofthe data needs to be assured that the data stored on theservice-provider site is protected against data thefts fromoutsiders. Second, data needs to be protected even fromthe service providers, if the providers themselves cannot betrusted.
In this paper, I focus on the research made towards the first and second challenge.I specifically focused on techniques to execute SQL queriesover encrypted data. The strategy in the papers I read was to process as much ofthe query as possible at the service providers' site, withouthaving to decrypt the data. Decryption and the remainderof the query processing are performed at the client site. The basic idea was similar in most papers with each paper trying to overcome the drawback in the other solutions.
The rest of the paper is organized as follows. Section 2presents the Database as a Service Scenario.Section 3 & Section 4 discusses the data privacy challenges and solutions.Iconclude the paper in Section 5.
2.DAS Scenario
The DAS scenario involves mainly four entities (see Figure1):
- Data owner: an organization that produces data to be made available for controlled external release;
- User: human entity that presents requests (queries) to the system;
- Client : front-end that transforms the user queries into queries on the encrypted data stored on the server;
- Server: an organization that receives the encrypted data from a data owner and makes them available fordistribution to clients.
Clients and data owners are assumed to trust the server to faithfully maintain outsourced data. Specifically, the server is relied upon for the availability of outsourced databases. However, the server is assumed not to be trusted with the confidentiality of the actual database content. The server should be prevented from making unauthorized access to the data stored in the database. To this purpose, the data owner encrypts her data and gives the encrypted database to the server. The end users, instead, are trusted to access the database, according to the data owner’s policy.
Figure 1: The service-provider architecture.
3.Data Privacy 1st challenge
If database as aservice is to be successful, and customer data is to resideon the site of the database service provider, then the serviceprovider needs to find a way to preserve the privacy of theuser data. There needs to be security measure in place sothat even if the data is stolen, the thief cannot make sense ofit.Encryption is the perfect technique to solve this problem.There are two dimensions to encryption support indatabases. One is the granularity of data to be encryptedor decrypted. The field, the row and the page, typically4KB, are the alternatives. The field may appear to be thebest choice, because it would minimize the number of bytesencrypted. However, practical methodsof embedding encryption within relational databasesentail a significant start up cost for an encryption operation.Row or the page level encryption amortizes this cost overlarger data. The second dimension is software versus hardwarelevel implementation of encryption algorithms.
3.1Software level encryption
First of all symmetric ciphers do much better than asymmetric. However for example if we use Blowfish which is a 64-bit block cipher, which means that data is encrypted and decrypted
in 64-bit chunks. This has implication on short data. Even 8-bit data, when encrypted by the algorithmwill result in 64 bits.
In the paper [4] Blowfish implementation was registered into thedatabase as a user defined function (UDF). Once it was registered, it could be usedto encrypt the data in one or more fields - whenever datawas inserted into the chosen fields, the values are encryptedbefore being stored. On read access, the stored data is decryptedbefore being operated upon.For example, if we were to encrypt the columndiscount of a table called lineitem using the user defined function called ”encrypt”, and decrypt it by the userdefined function ”decrypt” one would use the followingSQL command to insert data into the table lineitem:
insert into lineitem (discount)
values (encrypt(10,key))
The statement to select the encrypted field is given next:
select decrypt(discount,key)
from lineitem
where custid = 300
In this approach the creator of the encrypted data suppliesthe key, and the database provides the encryption function.Only those users who are given the key can decrypt thedata using the decryption algorithm. Since the key is ownedby the creator, and not stored at the site of the database serviceprovider, unauthorized person who may get hold ofdisk files can not get hold of the key. In fact, even employeesof the database service provider do not have accessto the encryption key. The full security provided by the encryptionalgorithm is inherited by the data in the database.
3.2Hardware level encryption
Specialized encryption hardware, the IBM S/390 CryptographicCoprocessor, is available under IBM OS/390 environmentwith Integrated Cryptographic Service Facility(ICSF) libraries. IBM DB2f or OS/390 provides a facilitycalled ”editproc” (or edit routine), which can be associated
with a database table. An edit routine is invoked for a wholerowof the database table, whenever the row is accessed bythe DBMS.An encryption/decryption edit routine can be registered forthe tables. When a read/write request arrives for a rowin one of these tables, the edit routine invokes encryption/decryption algorithm, which is implemented in hardware,for whole row. In[4], they used the algorithm optionfor encryption hardware.
3.3Encryption Penalty
The response time for a query on encrypteddata will increase due to boththe cost of decryption as well as routine and/or hardwareinvocations in DB2. This increase is referred to as the encryptionpenalty. The software field level encryption was found to be particularlyCPU intensive.As the number of rows increases,query execution time grows very sharply in software levelencryption. On the other hand, hardwarelevel encryption showed almost perfectly linear increase.
4.Data Privacy 2nd Challenge
The second challenge is that of "total" data privacy, whichis more complex since it includes protection from the databaseprovider. The requirement is that encrypted data may notbe decrypted at the provider site. A straightforward approach is to transmit the requisite encrypted tables from theserver (at the provider site) to the client, decrypt the tables,and execute the query at the client. But this approach mitigates almost every advantage of the service-provider model,
since now primary data processing has to occur on clientmachines. For this reason the encrypted database is augmentedwith additional information (index) allowscertain amount of query processing to occur at the serverwithout jeopardizing data privacy. The client also maintains metadata for translating user queries to the appropriate representation on the
server, and performs post-processing on server query results.Based on the auxiliary information stored,in [3] Hacigumus et al develop techniques to split an original query over unencrypted relationsinto (1) a corresponding query over encrypted relations torun on the server, and (2) a client query for post-processingresults of the Server query.
4.1Relation Encryption and Storage Model
For each relation R(Ai, A 2 , . . . , An), an encrypted relation:Rs (etuple, A1s, A2s, …, Ans)
is stored on the server. The attribute etuplestores an encrypted string that corresponds to a tuple in relation R. Each attribute Ais corresponds to the index for the attribute Ai that will be usedfor query processing at the server. For example, consider arelation emp below that stores information about employees.
eid / Ename / Salary / Addr / Did23 / Tom / 70K / Maple / 40
860 / Mary / 60K / Main / 80
320 / John / 50K / River / 50
875 / Jerry / 55K / Hopewell / 110
The emp table is mapped to a corresponding table at theserver:
emp s ( etuple, eid s, ename s, salary s, addr s, did s)
It is only necessary to create an index for attributes involvein search and join predicates. In the above example, if it isknown that there would be no query that involves attributeaddr in either a selection or a join, then the index on thisattribute need not be created.
Partition Functions
The domain of values (Di)of attribute R.Aiare mapped into partitions { p1 , . . . ,pi}, such that
(1)these partitions taken together cover the whole domain; and
(2) any two partitions do not overlap.
As an example, consider the attribute eid of the emp tableabove. Suppose the values of domain of this attribute lie inthe range [0, 1000]. Assume that the whole range is dividedinto 5 partitions:
partition(emp.eid) ={[0, 200], (200,400], (400,600], (600,800], (800, 1000]}
Different attributes may be partitioned using different partition functions. It should be clear that the partition of attribute Ai corresponds to a splitting of its domain into aset of buckets. Any histogram-construction technique, suchas MaxDiff, equi-width, or equi-depth, could be used
to create partitioning of attributes.
Identification Functions
The identification function assigns an identifier identR.Ai (pj) to each partitionpj of attribute Ai.
For instance, as shown below, an identifier is assigned to each range of emp id’s
2 / 7 / 5 / 1 / 4[0,200] / (200,400] / (400,600] / (600,800] / (800, 1000]
Partition and identification functionsof emp ID
The ident function value should be unique, so a collision free hash function is a good choice. For example in the case where a partition corresponds to a numeric range, the hash function may use the start and / or end values of a range.
Mapping Functions
The mapping function MapR.Aitakes care of mapping a value v inthe domain of attribute Ai to the identifier of the partitionto which v belongs: MapR.Ai(V) = identR.Ai (pj), where pj
is the partition that contains v.In the example above, the following table shows somevalues of the mapping function for attribute emp.eid.
Forinstance, Mapemp.eld(23) = 2, Mapemp.eid(860) = 4
There are two types of mapping functions:
1. Order preserving: A mapping function MapR.Aiiscalled order preservingif for any two values vi and vjin the domain of Ai, if vi < vj, then MapR.Ai(Vi) <MapR.Ai (Vj).
2. Random: A mapping function is called random if it isnot order preserving.
A random mapping function provides superior privacycompared to its corresponding order preserving mapping. The choice, whether a mapping functionis order preserving or not, affects query translation. Query translation issimplified using an order-preserving mapping function.
Storing Encrypted Data
For each tuplet = ( a1 , a2 , . . . ,ai) in R, the relation R s stores a tuple:
(encrypt( {al, a2, . . . , an}), MapR.A1(a1),MapR.A2 ( a 2 ) , . . . , MapR.Ai (ai))
where encrypt is the function used to encrypt a tuple ofthe relation. For instance, the following is the encryptedrelation emp s stored on the server:
Etuple / eids / Enames / Salarys / addrs / Dids1100110011110010… / 2 / 19 / 81 / 18 / 2
1000000000011101… / 4 / 31 / 59 / 41 / 4
1111101000010001… / 7 / 7 / 7 / 22 / 2
1010101010111110… / 4 / 71 / 49 / 22 / 4
Corresponding Employee table in the server
The first column etuple contains the string correspondingto the encrypted tuples in emp. For instance, the first tuple is encrypted to "1100110011110010..." that is equal to
encrypt(23, Tom, 7OK, Maple, 40). Any block cipher technique such as AES,RSA , Blowfish , DES etc., can be used toencrypt the tuples.The second column corresponds to the index on the employee ids. For example, value for attribute eid in the firsttuple is 23, and its corresponding partition is [0, 200]. Sincethis partition is identified to 2, we store the value "2" as
the identifier of the eid for this tuple.
In general, the notation "E" ("Encrypt") mapsa relation R to its encrypted representation. That is, givenrelation R( A1, A2, . . . , A,~), relation E( R) is RS (etuple, A1 s,A2 s , . . . , Ans). In the above example, E(emp) is the tableemp s .
Decryption Functions
Given the operator E that maps a relation to its encrypted representation, the inverse operator D maps the encrypted representation to its corresponding unencrypted representation. That is, D(Rs) = R. In the example above, D(emps) = emp. That is, D (temp s) will decrypt all of the encrypted columnsin temp s and drop the auxiliary columns corresponding tothe indices.
4.2Mapping Conditions (MapCond)
For each relation, the server side stores the encrypted tuples, along with the attribute indices determined by theirmapping functions. Meanwhile, the client stores the metadata about the specific indices, such as the information aboutthe partitioning of attributes, the mapping functions, etc.The client utilizes this information to translate a given queryQ to its server-side representation Qs, which is then executed by the server.
Let us consider different query conditions :
Condition Attribute op Values (op can be =,<,>,<=,>=)
1) Attribute = Value: Here the mapcond function would just map the value to a partition identifier.
eg. MapCond(eid = 860) = eid s = 4since eid = 860 is mapped to 4 by the mapping function ofthis attribute.
2)Attribute < or >Value: Depending uponwhether or not the mapping function MapAi of the attributeis order-preserving or random, different translations are possible
• Order preserving: In this case, the translation is straight forward:
Mapcond(Ai < v) => Ai s = MapAi(V)
• Random: Mapcond(Ai > v) = Ai s in Map>Ai(v).
Mapcond(eid < 280) => eid s in {2, 7}since all employee ids less than 280 have two partitions
[0,200] and (200,400], whose identifiers are {2, 7}.
Condition Attribute op Attribute (op can be =,<,>,<=,>=)
1) Attribute1 = Attribute2: Such a condition might arisein a join or selection. We consider all possible pairs of partitions ofAi and Aj that overlap.
Partions / Ident(empdid) / Partition / Ident(mgrdid)[0,100] / 2 / [0,200] / 9
(100,200] / 4 / (200,400] / 8
(200,300] / 3
(300,400] / 1
For instance, the table above shows the partition and identification functions of two attributes emp.did and mgr.did.Then condition emp.did = mgr.did is translated to the following condition C1:
C1 : (emp s.did s = 2 AND mgr s.did s = 9)
V (emp s.did s = 4 AND mgr s.did s = 9)
V (empS.did s = 3 AND mgrS.did S = 8)
V (empS.did s = 1 AND mgrS.did s = 8).
2) Attribute1 < Attribute2 can be dealt in the same way
4.3Implementing Relational Operators over Encrypted Relations
This section describes how individual relational operators (such as selections, joins and groupingoperators) can be implemented in the proposed database architecture. I‘ve shown a few examples below. The strategy is to partition the computation ofthe operators across the client and the server. Specifically,we will attempt to compute a superset of answers generated by the operator using the attribute indices stored atthe server. These answers will then be filtered at the clientafter decryption to generate the true results. Work done at the client is tried to minimize as much aspossible.
The Selection Operator: Consider a selection operation O'c(R) on a relation R, where C is a condition specified on one or more of the attributes A1, A2,.. •, An of R. Astraightforward implementation of such an operator in ourenvironment is to transmit the relation R s from the serverto the client. Then the client decrypts the result using theD operator, and implements the selection. This strategy,however, pushes the entire work of implementing the selection to the client. In addition, the entire encrypted relationneeds to be transmitted from the server to the client. Analternative mechanism is to partially compute the selectionoperator at the server using the indices associated with theattributes in C, and push the results to the client. The clientdecrypts the results and filters out tuples that do not satisfyC.
Eg. Selection(eid<395 AND did=140 (emp)) is translated to
Selection on client side[Decryption(Selection on server side based on Mapcond (empS))], where
thecondition on the server is:Mapcond(C) = (eid s in[2, 7] ANDdid s =4)
The query is first executed at the server based on the corresponding Mapcond. The results are decrypted at the client and then the selection is once again done to eliminate the spurious rows.
The Join Operatorwould be similar as above. The join will again be done on set of rows returned by the server which eliminates the false rows.
The Grouping and Aggregation Operator: The basic idea is this : the grouping is done at the server side (ofcourse it will have rows that shouldn’t belong in there). The server does not perform anyaggregation since it does not have anyvalues for those attributes. The resultsarereturned to the client, which performs the grouping operation again. This operation can be implemented very efficiently,since every tuple belonging to a single group say x will bein a single x s, group computed by the server. As a result,the client only needs to consider tuples in a single x s, groupwhen computing the groups corresponding to x. Of course,the aggregation functions specified will be computedat the client, since their computation requires that tuples befirst decrypted.
Sorting Operation : It is similar to the grouping operation. The amount of work done at the client in post-processing depends upon whether or not theattributes listed in the sort have order-preserving mappings. If theattributes have order-preserving mappings, then the resultsreturned by the server are presorted upto within a partition. Thus, sorting the results is a simple local operationover a single partition. Alternatively, even if the mappingis not order preserving, it is useful to compute grouping at theserver to reduce the amount of client work. Since the tupleshave been grouped by the server, sorting can be implementedefficiently using a merge-sort algorithm.
Thus Given a query Q a strategy can be developed to split the computation of Q across the serverand the client. The server will use the implementation ofthe relational operators discussed in the previous section tocompute as much of the query as possible, relegating the remainder of the computation to the client. The objective is tocome up with the "best" query plan for Q that minimizes theexecution cost.
4.4Problems with the Strategy
I will briefly go through in this section some of the problems with the Hacigumus et al approach and list a few solutions proposed in other papers.
1) It assumes that the client has complete access to the query result. However, this assumption does not fit real world applications, where different users may have different access privileges. In [1] Damiani et all investigated a solution for implementing through cryptography a selective access policy where they introduced a method to exploit a tree hierarchy for key management.