Unit 5
Unit 5
Syllabus
Distributed Databases: Architecture, Data Storage, Transaction
Processing, Query processing and optimization - NOSQL
Databases: Introduction - CAP Theorem - Document Based
systems - Key value Stores - Column Based Systems - Graph
Databases. Database Security: Security issues - Access control
based on privileges - Role Based access control - SQL Injection -
Statistical Database security - Flow control - Encryption and
Public Key infrastructures - Challenges.
Distributed Databases:
Definition of distributed databases:
A distributed database system consists of loosely coupled
sites (computer) that share no physical components and
each site is associated a database system.
The software that maintains and manages the working of
distributed databases is called distributed database
management system.
The database system that runs on each site is independent
of each other. Refer Fig. 5.1.1.
Unit 5 Advanced Topics
The transactions can access data at one or more sites.
• In this system, all the sites are aware of the other sites present
in the system and they all cooperate in processing user's request.
• Each site present in the system, surrenders part of its autonomy
in terms of right to change schemas or software.
• The homogeneous database system appears as a single system
to the user.
(2) Heterogeneous databases
• The heterogeneous databases are kind of database systems in
which different sites have different schema or software. Refer Fig.
5.1.3.
Unit 5 Advanced Topics
• The participating sites are not aware of other sites present in the
system.
• These sites provide limited facilities for cooperation in
transaction processing.
Architecture
• Following is an architecture of distributed databases. In this
architecture the local database is maintained by each site.
• Each site is interconnected by communication network.
Data Storage
There are two approaches of storing relation r in distributed
database -
Data Replication
• Concept: Data replication means storing a copy or replica of a
relation fragments in two or more sites.
• There are two methods of data replication replication. (1) Full
replication (2) Partial replication
• Full replication: In this approach the entire relation is stored
at all the sites. In this approach full redundant databases are
those in which every site contains a copy of entire database.
• Partial replication: In this approach only some fragments of
relation are replicated on the sites.
Advantages:
(1) Availability: Data replication facilitates increased availability
of data.
Unit 5 Advanced Topics
(2) Parallelism: Queries can be processed by several sites in
parallel.
(3) Faster accessing: The relation r is locally available at each
site, hence data accessing becomes faster.
Disadvantages:poi
(1) Increased cost of update: The major disadvantage of data
replication is increased betcost of updated. That means each
replica of relation r must be updated from all the sites if user
makes a request for some updates in relation.
(2) Increased complexity in concurrency control: It becomes
complex to implement the concurrency control mechanism for all
the sites containing replica.
Data Fragmentation
• Concept: Data fragmentation is a division of relation r into
fragments r1,r2, r3,...,rn which contain sufficient information to
reconstruct relation r.
• There are two approaches of data fragmentation - (1) Horizontal
fragmentation and (2) Vertical fragmentation.
• Horizontal fragmentation: In this approach, each tuple of r is
assigned to one or more fragments. If relation R is fragmented in
r1 and r2 fragments, then to bring these fragments back to R we
must use union operation. That means R=r1ur2
• Vertical fragmentation: In this approach, the relation r is
fragmented based on one or more columns. If relation R is
fragmented into r1 and r2 fragments using vertical fragmentation
then to bring these fragments back to original relation R we must
use join operation. That means R= r1 r2
• For example - Consider following relation r
Student(RollNo, Marks, City)
The values in this schema are inserted as
Unit 5 Advanced Topics
Horizontal Fragmentation 1:
SELECT * FROM Student WHERE Marks >50 AND City='Pune'
We will get
Horizontal Fragmentation 2:
SELECT * FROM Student WHERE Marks >50 AND
City="Mumbai'
We will get
Vertical Fragmentation 1 :
SELECT * FROM RollNo
Vertical Fragmentation 2:
SELECT * FROM city
Transaction Processing
Basic Concepts
Unit 5 Advanced Topics
In distributed system transaction initiated at one site can access
or update data at other sites. Let us discuss various basic
concepts used during transaction processing in distributed
systems -
• Local and global transactions :
Local transaction Ti is said to be local if it is initiated at site Si and
can access or update data at site Si only.
Global transaction Ti initiated by site Si is said to be global if it can
access or update data at site Si, Sj,Sk and so on.
• Coordinating and participating sites:
The site at which the transaction is initiated is called coordinating
site. The participating sites are those sites at which the sub-
transactions are executing. For example - If site S1 initiates the
transaction T1 then it is called coordinating site. Now assume that
transaction T1 (initiated at S1) can access site S2 and S3. Then
sites S2 and S3 are called participating sites.
To access the data on site S2, the transaction T1 needs another
transaction T12 on site S2 similarly to access the data on site S3,
the transaction T2 needs some transaction say T13 on site S3.
Then transactions T12 and T13 are called sub-transactions. The
above described scenario can be represented by following Fig.
5.1.6.
• Transaction manager :
Unit 5 Advanced Topics
The transaction manager manages the execution of those
transactions (or subtransactions) that access data stored in a
local site.
(1) To maintain the log for recovery purpose.
(2) Participating in coordinating the concurrent execution of the
transactions executing balls at that site.
• Transaction coordinator:
The transaction coordinator coordinates the execution of the
various transactions (both local and global) initiated at that site.
The tasks of Transaction coordinator are -
(1) Starting the execution of transactions that originate at the site.
(2) Distributing subtransactions at appropriate sites for execution
Let TC denotes the transaction coordinator and TM denotes the
transaction manager, then the system architecture can be
represented as,
Failure Modes
There are four types of failure modes,
1. Failure of site
2. Loss of messages
Unit 5 Advanced Topics
3. Failure of communication link
4. Network partition
The most common type of failure in distributed system is loss or
corruption of messages. The system uses Transmission Control
Protocol(TCP) to handle such error. This is a standard connection
oriented protocol in which message is transmitted from one end to
another using wired connection.
• If two nodes are not directly connected, messages from one to
another must be routed through sequence of communication
links. If the communication link fails, the messages are rerouted
by alternative links.
• A system is partitioned if it has been split into two subsystems.
This is called partitions. Lack of connection between the
subsystems also cause failure in distributed system.
Commit Protocols
Two Phase Commit Protocol
• The atomicity is an important property of any transaction
processing. What is this atomicity property? This property means
either the transaction will execute completely or it won't execute
at all.
• The commit protocol ensures the atomicity across the sites in
following ways -
i) A transaction which executes at multiple sites must either be
committed at all the sites, or aborted at all the sites.
ii) Not acceptable to have a transaction committed at one site and
aborted at another.
• There are two types of important sites involving in this protocol -
• One Coordinating site
• One or more participating sites.
Two phase commit protocol
Unit 5 Advanced Topics
This protocol works in two phases - i) Voting phase and ii)
Decision phase.
Phase 1: Obtaining decision or voting phase
Step 1: Coordinator site Ci asks all participants to prepare to
commit transaction Ti.
• Ci adds the records <prepareT> to the log and writes the log
to stable storage.
• It then sends prepare T messages to all participating sites at
which T will get executed.
Failure of site
There are various cases at which failure may occur,
(1) Failure of participating sites
• If any of the participating sites gets failed then when
participating site Si recovers, it examines the log entry made by it
to take the decision about executing transaction.
• If the log contains <commit T> record: participating site
executes redo (T)
• If the log contains <abort T> record: participating site executes
undo (T)
• If the log contains <ready T> record: participating site must
consult Coordinating site to take decision about execution of
transaction T.
• If T committed, redo (T)
• If T aborted, undo (T)
• If the log of participating site contains no record then that means
Si gets failed before responding to Prepare T message from
coordinating site. In this case it must abort T
(2) Failure of coordinator
• If coordinator fails while the commit protocol for T is executing
then participating sites must take decision about execution of
transaction T:
Unit 5 Advanced Topics
i) If an active participating site contains a <commit T> record in its
log, then T site must be committed.
ii) If an active participating site contains an <abort T> record in its
log, then T must be aborted.
iii) If some active participating site does not contain a <ready T>
record in its log, then the failed coordinator Ci cannot have
decided to commit T. Can therefore abort T.
iv) If none of the above cases holds, then all participating active
sites must have a <ready T> record in their logs, but no additional
control records (such as <abort T> of <commit T>). In this case
active sites must wait for coordinator site Ci to recover, to find
decision.
Two phase locking protocol has blocking problem.
What is blocking problem?
It is a stage at which active participating sites may have to wait for
failed coordinator site to recover.
The solution to this problem is to use three phase locking
protocol.
Three Phase Commit Protocol
• The three phase locking is an extension of two phase locking
protocol in which eliminates the blocking problem.
• Various assumptions that are made for three phase commit
protocol are -
• No network partitioning.
• At any point at least one site must be up.
• At the most k sites (participating as well as coordinating) can
fail.
• Phase 1: This phase is similar to phase 1 of two phase protocol.
That means Coordinator site Ci asks all participants to prepare to
commit transaction Ti. The coordinator then makes the decision
Unit 5 Advanced Topics
about commit or abort based on the response from all the
participating sites.
• Phase 2: In phase 2 coordinator makes a decision as in 2
Phase Commit which is called the pre-commit decision <Pre-
commit, T>, and records it in multiple (at least K) participating
sites.
• Phase 3: In phase 3, coordinator sends commit/abort message
to all participating Brits sites.
• Under three phase protocol, the knowledge of pre-commit
decision can be used to commit despite coordinator site failure.
That means if the coordinating site in case gets failed then one of
the participating site becomes the coordinating site and der
consults other participating sites to know the Pre-commit
message which they possess. Thus using this pre-commit t
message the decision about commit/abort is taken by this new
coordinating site.
• This protocol avoids blocking problem as long as less than k
sites fail.
Advantage of three phase commit protocol
(1) It avoid blocking problem.
Disadvantage of three phase commit protocol
(1) The overhead is increased.
Query Processing in DBMS
Query Optimization
Introduction
NoSQL stands for not only SQL.
It is nontabular database system that store data
differently than relational tables. There are various
types of NoSQL databases such as document, key-
value, wide column and graph.
Using NoSQL we can maintain flexible schemas and
these schemas can be scaled easily with large
amount of data
Need
The NoSQL database technology is usually adopted for following
reasons -ut
The NoSQL databases are often used for handling big data
as a part of fundamental architecture.
The NoSQL databases are used for storing and modelling
structured, semi- structured and unstructured data.
For the efficient execution of database with high availability,
NoSQL is used.
The NoSQL database is non-relational, so it scales out
better than relational databases and these can be designed
with web applications.
For easy scalability, the NoSQL is used.
Features
The NoSQL does not follow any relational model.
It is either schema free or have relaxed schema. That means
it does not require specific definition of schema.
Multiple NoSQL databases can be executed in distributed
fashion.
Unit 5 Advanced Topics
It can process both unstructured and semi-structured data.
The NoSQL have higher scalability.
It is cost effective.
It supports the data in the form of key-value pair, wide
columns and graphs.
CAP Theorem:
Introduction:
Table of content:
Consistency
Availability
Partition tolerance
Consistency:
In CAP Theorem , Consistency means that all nodes in a
distributed database system should have the same data at any
given point . This ensures that all clients accessing the system will
see the same data , regardless of the node they are connected
to . When a client performs a read operation ,they should recieve
the most recent write value.
Availability:
In the CAP Theorem , Availability means : A distributed database
system should always be able to respond to clients requests , so
when a client requests data ,it should recieve a response ,
regardless of any state of the individual node i.e system must be
operational at all times .
Partition Tolerance:
Sometimes due to some disruption in the network , nodes get
seperated into groups such that they cannot communicate with
each other . This is called network partition , In the CAP Theorem ,
partition tolerance means: A distributed database system should
continue to work even if there is a network partition .
For example:
Key-Value Store
• Key-value pair is the simplest type of NoSQL database.
• It is designed in such a way to handle lots of data and heavy
load.
• In the key-value storage the key is unique and the value can be
JSON, string or binary objects.
• For example -
{Customer:
|
{"id":1,"name":"Ankita"},
{"id":2,"name":"Kavita"}
|
}
Here id, name are the keys and 1,2, "Ankita", "Prajkta" are the
values corresponding to those keys.
Key-value stores help the developer to store schema-less data.
They work best for Shopping cart contents.
The DynamoDB, Riak, Redis are some famous examples of key-
value store.
Unit 5 Advanced Topics
Document Based Systems
• The document store make use of key-value pair to store and
retrieve data.
• The document is stored in the form of XML and JSON.
• The document stores appear the most natural among NoSQL
database types.
• It is most commonly used due to flexibility and ability to query on
any field.
• For example -
{
"id": 101,
"Name": "AAA",
"City" : "Pune"
}
MongoDB and CouchDB are two popular document oriented
NoSQL database.
Graph Databases
The graph database is typically used in the applications where the
relationships among the data elements is an important aspect.
The connections between elements are called links or
relationships. In a graph database, connections are first-class
elements of the database, stored directly. In relational databases,
links are implied, using data to express the relationships.
The graph database has two components -
1) Node: The entities itself. For example - People, student.
2) Edge: The relationships among the entities.
For example -
Threats to Database
Threats to database will result in loss or degradation of data.
There are three kinds of loss that occur due to threats to database
(1) Loss of Integrity:
Unit 5 Advanced Topics
• Database integrity means information must be protected from
improper modification.
• Modification to database can be performed by inserting, deleting
or modifying the data.
• Integrity is lost if unauthorized changes are made to data
intentionally or accidently.
• If data integrity is not corrected and work is continued then it
results in inaccuracy, fraud, or erroneous decision.
(2) Loss of Availability:
• Database availability means making the database objects
available to authorized
users.
(3) Loss of Confidentiality:
• Confidentiality means protection of data from unauthorized
disclosure of information.
• The loss of confidentiality results in loss of public confidence, or
embarrassment or some legal action against organization.
Control Measures
• There are four major control measures used to provide security
on data in database.
1. Access control
2. Interface control
3. Flow control
4. Data encryption
• Access Control: The most common security problem is
unauthorized access to of computer system. Generally this
access is for obtaining the information or to make malicious
changes in the database. The security mechanism of a DBMS
must include provisions for restricting access to the database
system as a whole. This function, called access control.
Unit 5 Advanced Topics
• Inference Control: This method is used to provide the security
to statistical database security problems. Statistical databases are
used to provide statistical information based on some criteria.
These databases may contain information about particular age
group, income-level, education criteria and so on. Access to some
sensitive information must be avoided while using the statistical
databases. The corresponding measure that prevents the user
from completing any inference channel.
• Flow Control: It is a kind of control measure which prevents
information from flowing in such a way that it reaches
unauthorized users. Channels that are pathways for information to
flow implicitly in ways that violate the security policy of an
organization are called covert channels.
• Data Encryption: The data encryption is a control measure
used to secure the sensitive data. In this technique, the data is
encoded using some coding algorithm. An unauthorized user who
accesses encoded data will have difficulty deciphering it, but
authorized users are given decoding or decrypting algorithms (or
keys) to decipher the data.
uname-request.POST['username']
passwd=request.POST['password']
query="SELECT id FROM users WHERE username='"+ uname
+"' ANDpassword='"+ passwd +"'
database.execute(query)
• Here the two input fields - One for user name and another for
password is vulnerable to SQL injection.
• The attacker can attack using these fields and alter the SQL
query to get the access to the database.
• They could use a trick on password field. They could add
OR 1 = 1
Statement to the password field.
• As a result the query would becomes (assuming username as
'user1' and password='password')
Unit 5 Advanced Topics
• SELECT id FROM users WHERE username='user1' AND
password='password' OR 1 = 1
• Because of OR 1 = 1 statement, the WHERE clause returns the
first id from the users table no matter what the username and
password are. That means even-if we enter any wrong username
or password still the query will get executed because of OR 1 = 1
part which comes out to be true.
• The first id is returned by the above query for users table and we
know that the first id is normally administrator. In this way, the
attacker not only bypasses authentication but also gains
administrator privileges.
How to prevent SQL injection?
• The only way to prevent SQL injection is to validate every input
field.
• Another method is to make use of parameterized query. This
parameterized query is called prepared statement. By this ways,
application code never use the input directly.
• The Web Application Firewalls (WAF) are also used to filter out
the SQL.
Statistical Database Security
• Statistical databases contain statistical data about various
populations.
• A population is a set of tuples of a table that satisfy some
selection criteria.
• The statistical database may contain the confidential data about
individuals. For example - The database system about
government agencies is a statistical database. The statistical
database security helps in protecting the sensitive information
present in the statistical database from user's access.
• Users are permitted to use some portion of statistical database.
He/she can not have access over the complete database system.
Unit 5 Advanced Topics
For instance - In an employees database, any user is not
permitted to access the information about employee's salary.
• Statistical aggregate functions such as COUNT, MAX,MIN,
AVERAGE and STANDARD DEVIATION are used in the queries
which are called as statistical queries.
• The possibility of accessing individual information from statistical
queries is reduced by following ways -
• No statistical query is permitted whenever the number of tuples
in the population specified by the selection condition falls below
some threshold.
• Prohibit sequence of queries that refer repeatedly to same
population of tuples.
• Partition the database. That means records are stored in
groups of some minimum size. The query can refer to any
complete group but never to subsets of records within a group.
Flow Control
• Flow control is a mechanism that regulates the flow of
information among accessible objects.
• A flow between two objects obj1 and obj2 occurs when program
reads values from obj1 and writes values to the object obj2.
• The flow control checks that the information contained in one
object should not get transferred to the less protected object.
• The flow policy specifies the channels along which the
information is allowed to move.
• The simple flow policy specifies two classes of information -
Confidential(C) and non confidential(N). According to flow policy
only the information flow from confidential to non confidential
class is not allowed.
Unit 5 Advanced Topics
Convert Channel
• A covert channel is a type of attack that creates a capability to
transfer information objects between processes that are not
supposed to be allowed to communicate.
• This convert channel violates the security or the policy.
• The convert channel allows information to pass from higher
classification level to lower classification level through improper
means.
• The security experts believe that one way to avoid convert
channels is for as programmers to not gain the access to
sensitive data.
Encryption and Public Key Infrastructures
Cryptology is a technique of encoding and decoding messages,
so that they cannot be understood by anybody except the sender
and the intended recipient.
There are various encoding and decoding schemes which are
called as encryption schemes. The sender and recipient of the
message decide on an encoding and decoding scheme and use it
for communication.
The process of encoding messages is known as encryption. The
sender sends the original text. The original text called plaintext,
The encrypted form of plaintext it is called as ciphertext. This
encrypted text travel through the network. When it reaches at the
receiving computer, the recipient understands the meaning and
decodes the message to extract the correct meaning out of it.
This process is called as decryption.
Unit 5 Advanced Topics
The sender applies the encryption algorithm and recipient applies
the decryption algorithm. Both the sender and the receiver must
agree on this algorithm for any meaningful communication. The
algorithm basically takes one text as input and produces another
as the output. Therefore, the algorithm contains the intelligence
for transforming message.
For example: If we want to send some message through an e-
mail and we wish that nobody except the friend should be able to
understand it. Then the message can be encoded using some
intelligence. For example if the alphabets A to Z are encoded as
follows-
That means last three letters are placed in reverse order and then
first three letters are in straight manner. Continuing this logic the
A to Z letters are encoded. Now if I write the message
"SEND SOME MONEY"
it will be
QBSA QRTB TRSBN
This coded message is called cipher text.
There are variety of coding methods that can be used.
Types of Cryptography
There are two types encryption schemes based in key used for
encryption and decryption.
1. Symmetric key encryption: It is also known as secret key
encryption. In this method, only one key is used. The same key is
shared by sender and receiver for encryption and decryption of
messages. Hence both parties must agree upon the key before
any transmission begins and nobody else should know about it. At
the sender's end, the key is used to change the original message
into an encoded form. At the receiver's end using the same key
Unit 5 Advanced Topics
the encoded message is decrypted and original message is
obtained. Data Encryption Standard (DES) uses this approach.
The problem with this approach is that of key agreement and
distribution.
2. Asymmetric key encryption: It is also known as public key
encryption. In this method, different keys are used. One key is
used for encryption and other key must be used for decryption.
No other key can decrypt the message-not even the original key
used for encryption.
One of the two keys is known as public key and the other is the
private key. Suppose there are two users X and Y. The
• X wants to send a message to Y. Then X will convey its public
key to Y but the private key of X will be known to X only.
• Y should know the private key of Y and X should know the Y's
public key.
When X and Y wants to communicate:m
1. If X wants to send a message to Y, then first of all X encrypts
the message using Y's public key. For that purpose it is
necessary that X knows the Y's public key.
2. X then sends this encrypted to Y.
3. Now using Y's private key, Y decrypts X's message. Note that
only Y knows his private key. It is not possible for Y to decrypt the
message using X's public key.
4. When Y wants to send a message to X then using X's public
key Y will encrypt the message and will send the encrypted
message to X. On the other hand, X will use its own private key to
decrypt this message. Here again Y will not know the private key
of X.
Digital Signature
A digital signature is a mathematical scheme for demonstrating
the authenticity of a digital message or document. If the recipient
Unit 5 Advanced Topics
gets a message with digital signature then he believes that the
message was created by a known sender.
Digital signatures are commonly used for software distribution,
financial transactions, and in other cases where it is important to
detect forgery or tampering.
When X and Y wants to communicate with each other
1. X encrypts the original plaintext message into ciphertext by
using Y's public key.
2. Then X executes an algorithm on the original plaintext to
calculate a Message Digest, also known as hash. This algorithm
takes the original plaintext in the binary format, apply the hashing
algorithm. As an output a small string of binary digits gets created.
This hashing algorithm is public and anyone can use it. The most
popular message digest algorithms are MD5 and SHA-1. X
encrypts the message digest. For this, it uses its own private key.
3. X now combines the ciphertext and its digital signature (i.e
encrypted message digest) and it is sent over the network to Y.
4. Y receives the ciphertext and X's digital signature. Y has to
decrypt both of these. Y first decrypts ciphertext back to plaintext.
For this, it uses its own private key. Thus, Y gets the message
itself in a secure manner.
5. Now to ensure that the message has come from the intended
sender Y takes X's digital signature and decrypts it. This gives Y
the message digest as was generated by X. The X had encrypted
the message digest to form a digital signature using its own
private key. Therefore, Y uses X's public key for decrypting the
digital signature.
6. Hash algorithm to generate the message digest is public.
Therefore, Y can also use it.
7. Now there are two message digests one created by X and
other by Y. The Y now Anon simply compares the two message
Unit 5 Advanced Topics
digests. If the two match, Y can be sure that the message came
indeed from X and not from someone else.
Thus with digital signature confidentiality, authenticity as well as
message integrity is assured.
The other important feature supported by digital signature is non-
repudiation. That is, a sender cannot refuse having sent a
message. Since the digital signature requires the private key of
the sender, once a message is digitally signed, it can be legally
proven that the sender had indeed sent the message.
Challenges
Following are the challenges faced by the database security
system -
(1) Data Quality
• The database community need the solution to assess the quality
of data. The quality of data can be assessed by a simple
mechanism such as quality stamps that are posted on web sites:
• The database community may need more effective technique of
integrity semantic verification for accessing the quality of data.
• Application level recovery techniques are also used to repair
incorrect data.
(2) Intellectual Property Rights
• Everywhere there is increasing use of internet and intranet. Due
to which there are chances of making un-authorized duplication
and distribution of the contents. Hence digital watermarking
technique is used to protect the contents from unauthorized
access or ownership.
• However, research is needed to develop the techniques for
preventing intellectual property right violation.
(3) Database Survivability
• It is desired that the database systems must continue to work
even after information warfare attacks.
Unit 5 Advanced Topics
• The goal of information warfare attacker is to damage the
organization's operation.
• Following are the corrective actions for handling this situation -
• Confinement : Take immediate action to eliminate attacker's
access to the system. Isolate the affected components to avoid
further spread.
• Damage Assessment: Determine the extent of problem.
• Reconfiguration:Re-configuration allows the system to be in
operation in degraded mode while recovery is going on.
• Repair:Recover the corrupted or lost data by repairing or
reinstalling the system.
• Fault treatment: Identify the weakness exploited in the attack
and take steps to prevent a recurrence.