0% found this document useful (0 votes)
19 views46 pages

Unit 5

Unit 5 covers advanced topics in distributed databases, including architecture, data storage, transaction processing, and security issues. It discusses the advantages and disadvantages of distributed databases, types of databases, and transaction management protocols such as two-phase and three-phase commit protocols. The document also explains query processing steps, including parsing, translation, optimization, and evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views46 pages

Unit 5

Unit 5 covers advanced topics in distributed databases, including architecture, data storage, transaction processing, and security issues. It discusses the advantages and disadvantages of distributed databases, types of databases, and transaction management protocols such as two-phase and three-phase commit protocols. The document also explains query processing steps, including parsing, translation, optimization, and evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Unit 5 Advanced Topics

Syllabus
Distributed Databases: Architecture, Data Storage, Transaction
Processing, Query processing and optimization - NOSQL
Databases: Introduction - CAP Theorem - Document Based
systems - Key value Stores - Column Based Systems - Graph
Databases. Database Security: Security issues - Access control
based on privileges - Role Based access control - SQL Injection -
Statistical Database security - Flow control - Encryption and
Public Key infrastructures - Challenges.

Distributed Databases:
Definition of distributed databases:
 A distributed database system consists of loosely coupled
sites (computer) that share no physical components and
each site is associated a database system.
 The software that maintains and manages the working of
distributed databases is called distributed database
management system.
 The database system that runs on each site is independent
of each other. Refer Fig. 5.1.1.
Unit 5 Advanced Topics
The transactions can access data at one or more sites.

Advantages of distributed database system


 There is fast data processing as several sites participate in
request processing. 2.
 Reliability and availability of this system is high.
 It possess reduced operating cost.
 It is easier to expand the system by adding more sites.
 It has improved sharing ability and local autonomy.

Disadvantages of distributed database system


 The system becomes complex to manage and control.
 The security issues must be carefully managed.
 The system require deadlock handling during the transaction
processing otherwise the entire system may be in
inconsistent state.
 There is need of some standardization for processing of
distributed database system.
Unit 5 Advanced Topics

Difference between distributed DBMS and centralized DBMS

Uses of distributed system:


 Often distributed databases are used by organizations that
have numerous offices in different geographical locations.
Typically an individual branch is interacting primarily with the
data that pertain to its own operations, with a much less
frequent need for general company data. In such a situation,
distributed systems are useful.
 Using distributed system, one can give permissions to single
sections of the overall database, for better internal and
external protection.
 If we need to add a new location to a business, it is simple to
create an additional node within the database, making
distribution highly scalable.

Types of Distributed Databases


There are two types of distributed databases -
(1) Homogeneous databases
Unit 5 Advanced Topics
• The homogeneous databases are kind of database systems in
which all sites have identical software running on them. Refer Fig.
5.1.2.

• In this system, all the sites are aware of the other sites present
in the system and they all cooperate in processing user's request.
• Each site present in the system, surrenders part of its autonomy
in terms of right to change schemas or software.
• The homogeneous database system appears as a single system
to the user.
(2) Heterogeneous databases
• The heterogeneous databases are kind of database systems in
which different sites have different schema or software. Refer Fig.
5.1.3.
Unit 5 Advanced Topics

• The participating sites are not aware of other sites present in the
system.
• These sites provide limited facilities for cooperation in
transaction processing.

Architecture
• Following is an architecture of distributed databases. In this
architecture the local database is maintained by each site.
• Each site is interconnected by communication network.

When user makes a request for particular data at site Si then it is


first searched at the local database. If the data is not present in
Unit 5 Advanced Topics
the local database then the request for that data is passed to all
the other sites via communication network. Each site then
searches for that data at its local database. When data is found at
particular site say Sj then it is transmitted to site Si via
communication network.

Data Storage
There are two approaches of storing relation r in distributed
database -

(1) Replication: System maintains multiple copies of data, stored


in different sites, for grind faster retrieval and fault tolerance.
(2) Fragmentation: Relation is partitioned into several fragments
stored in distinct sites.

Data Replication
• Concept: Data replication means storing a copy or replica of a
relation fragments in two or more sites.
• There are two methods of data replication replication. (1) Full
replication (2) Partial replication
• Full replication: In this approach the entire relation is stored
at all the sites. In this approach full redundant databases are
those in which every site contains a copy of entire database.
• Partial replication: In this approach only some fragments of
relation are replicated on the sites.
Advantages:
(1) Availability: Data replication facilitates increased availability
of data.
Unit 5 Advanced Topics
(2) Parallelism: Queries can be processed by several sites in
parallel.
(3) Faster accessing: The relation r is locally available at each
site, hence data accessing becomes faster.
Disadvantages:poi
(1) Increased cost of update: The major disadvantage of data
replication is increased betcost of updated. That means each
replica of relation r must be updated from all the sites if user
makes a request for some updates in relation.
(2) Increased complexity in concurrency control: It becomes
complex to implement the concurrency control mechanism for all
the sites containing replica.

Data Fragmentation
• Concept: Data fragmentation is a division of relation r into
fragments r1,r2, r3,...,rn which contain sufficient information to
reconstruct relation r.
• There are two approaches of data fragmentation - (1) Horizontal
fragmentation and (2) Vertical fragmentation.
• Horizontal fragmentation: In this approach, each tuple of r is
assigned to one or more fragments. If relation R is fragmented in
r1 and r2 fragments, then to bring these fragments back to R we
must use union operation. That means R=r1ur2
• Vertical fragmentation: In this approach, the relation r is
fragmented based on one or more columns. If relation R is
fragmented into r1 and r2 fragments using vertical fragmentation
then to bring these fragments back to original relation R we must
use join operation. That means R= r1 r2
• For example - Consider following relation r
Student(RollNo, Marks, City)
The values in this schema are inserted as
Unit 5 Advanced Topics

Horizontal Fragmentation 1:
SELECT * FROM Student WHERE Marks >50 AND City='Pune'
We will get

Horizontal Fragmentation 2:
SELECT * FROM Student WHERE Marks >50 AND
City="Mumbai'
We will get

Vertical Fragmentation 1 :
SELECT * FROM RollNo

Vertical Fragmentation 2:
SELECT * FROM city

Transaction Processing
Basic Concepts
Unit 5 Advanced Topics
In distributed system transaction initiated at one site can access
or update data at other sites. Let us discuss various basic
concepts used during transaction processing in distributed
systems -
• Local and global transactions :
Local transaction Ti is said to be local if it is initiated at site Si and
can access or update data at site Si only.
Global transaction Ti initiated by site Si is said to be global if it can
access or update data at site Si, Sj,Sk and so on.
• Coordinating and participating sites:
The site at which the transaction is initiated is called coordinating
site. The participating sites are those sites at which the sub-
transactions are executing. For example - If site S1 initiates the
transaction T1 then it is called coordinating site. Now assume that
transaction T1 (initiated at S1) can access site S2 and S3. Then
sites S2 and S3 are called participating sites.
To access the data on site S2, the transaction T1 needs another
transaction T12 on site S2 similarly to access the data on site S3,
the transaction T2 needs some transaction say T13 on site S3.
Then transactions T12 and T13 are called sub-transactions. The
above described scenario can be represented by following Fig.
5.1.6.

• Transaction manager :
Unit 5 Advanced Topics
The transaction manager manages the execution of those
transactions (or subtransactions) that access data stored in a
local site.
(1) To maintain the log for recovery purpose.
(2) Participating in coordinating the concurrent execution of the
transactions executing balls at that site.
• Transaction coordinator:
The transaction coordinator coordinates the execution of the
various transactions (both local and global) initiated at that site.
The tasks of Transaction coordinator are -
(1) Starting the execution of transactions that originate at the site.
(2) Distributing subtransactions at appropriate sites for execution
Let TC denotes the transaction coordinator and TM denotes the
transaction manager, then the system architecture can be
represented as,

Failure Modes
There are four types of failure modes,
1. Failure of site
2. Loss of messages
Unit 5 Advanced Topics
3. Failure of communication link
4. Network partition
The most common type of failure in distributed system is loss or
corruption of messages. The system uses Transmission Control
Protocol(TCP) to handle such error. This is a standard connection
oriented protocol in which message is transmitted from one end to
another using wired connection.
• If two nodes are not directly connected, messages from one to
another must be routed through sequence of communication
links. If the communication link fails, the messages are rerouted
by alternative links.
• A system is partitioned if it has been split into two subsystems.
This is called partitions. Lack of connection between the
subsystems also cause failure in distributed system.

Commit Protocols
Two Phase Commit Protocol
• The atomicity is an important property of any transaction
processing. What is this atomicity property? This property means
either the transaction will execute completely or it won't execute
at all.
• The commit protocol ensures the atomicity across the sites in
following ways -
i) A transaction which executes at multiple sites must either be
committed at all the sites, or aborted at all the sites.
ii) Not acceptable to have a transaction committed at one site and
aborted at another.
• There are two types of important sites involving in this protocol -
• One Coordinating site
• One or more participating sites.
Two phase commit protocol
Unit 5 Advanced Topics
This protocol works in two phases - i) Voting phase and ii)
Decision phase.
Phase 1: Obtaining decision or voting phase
Step 1: Coordinator site Ci asks all participants to prepare to
commit transaction Ti.
• Ci adds the records <prepareT> to the log and writes the log
to stable storage.
• It then sends prepare T messages to all participating sites at
which T will get executed.

Step 2: Upon receiving message, transaction manager at


participating site determines if it can commit the transaction
• If not, add a record <no T> to the log and send abort T
message to coordinating site Ci.
Unit 5 Advanced Topics
• If the transaction can be committed, then :
• Add the record <ready T> to the log
• Force all records for T to stable storage
• Send ready T message to Ci.

Phase 2: Recoding decision phase


• T can be committed of Ci received a ready T message from all
the participating sites otherwise T must be aborted.o
• Coordinator adds a decision record, <commit T> or <abort T>, to
the log and forces record onto stable storage. Once the record
stable storage it is irrevocable (even if failures occur)

• Coordinator sends a message to each participant informing it of


the decision
(commit or abort)
Unit 5 Advanced Topics
• Participants take appropriate action locally.

Failure of site
There are various cases at which failure may occur,
(1) Failure of participating sites
• If any of the participating sites gets failed then when
participating site Si recovers, it examines the log entry made by it
to take the decision about executing transaction.
• If the log contains <commit T> record: participating site
executes redo (T)
• If the log contains <abort T> record: participating site executes
undo (T)
• If the log contains <ready T> record: participating site must
consult Coordinating site to take decision about execution of
transaction T.
• If T committed, redo (T)
• If T aborted, undo (T)
• If the log of participating site contains no record then that means
Si gets failed before responding to Prepare T message from
coordinating site. In this case it must abort T
(2) Failure of coordinator
• If coordinator fails while the commit protocol for T is executing
then participating sites must take decision about execution of
transaction T:
Unit 5 Advanced Topics
i) If an active participating site contains a <commit T> record in its
log, then T site must be committed.
ii) If an active participating site contains an <abort T> record in its
log, then T must be aborted.
iii) If some active participating site does not contain a <ready T>
record in its log, then the failed coordinator Ci cannot have
decided to commit T. Can therefore abort T.
iv) If none of the above cases holds, then all participating active
sites must have a <ready T> record in their logs, but no additional
control records (such as <abort T> of <commit T>). In this case
active sites must wait for coordinator site Ci to recover, to find
decision.
Two phase locking protocol has blocking problem.
What is blocking problem?
It is a stage at which active participating sites may have to wait for
failed coordinator site to recover.
The solution to this problem is to use three phase locking
protocol.
Three Phase Commit Protocol
• The three phase locking is an extension of two phase locking
protocol in which eliminates the blocking problem.
• Various assumptions that are made for three phase commit
protocol are -
• No network partitioning.
• At any point at least one site must be up.
• At the most k sites (participating as well as coordinating) can
fail.
• Phase 1: This phase is similar to phase 1 of two phase protocol.
That means Coordinator site Ci asks all participants to prepare to
commit transaction Ti. The coordinator then makes the decision
Unit 5 Advanced Topics
about commit or abort based on the response from all the
participating sites.
• Phase 2: In phase 2 coordinator makes a decision as in 2
Phase Commit which is called the pre-commit decision <Pre-
commit, T>, and records it in multiple (at least K) participating
sites.
• Phase 3: In phase 3, coordinator sends commit/abort message
to all participating Brits sites.
• Under three phase protocol, the knowledge of pre-commit
decision can be used to commit despite coordinator site failure.
That means if the coordinating site in case gets failed then one of
the participating site becomes the coordinating site and der
consults other participating sites to know the Pre-commit
message which they possess. Thus using this pre-commit t
message the decision about commit/abort is taken by this new
coordinating site.
• This protocol avoids blocking problem as long as less than k
sites fail.
Advantage of three phase commit protocol
(1) It avoid blocking problem.
Disadvantage of three phase commit protocol
(1) The overhead is increased.
Query Processing in DBMS

Query Processing is the activity performed in extracting data from


the database. In query processing, it takes various steps for
fetching the data from the database. The steps involved are:

1. Parsing and translation


2. Optimization
3. Evaluation

The query processing works in the following way:


Unit 5 Advanced Topics
Parsing and Translation

As query processing includes certain activities for data retrieval.


Initially, the given user queries get translated in high-level
database languages such as SQL. It gets translated into
expressions that can be further used at the physical level of the
file system. After this, the actual evaluation of the queries and a
variety of query -optimizing transformations and takes place. Thus
before processing a query, a computer system needs to translate
the query into a human-readable and understandable language.
Consequently, SQL or Structured Query Language is the best
suitable choice for humans. But, it is not perfectly suitable for the
internal representation of the query to the system. Relational
algebra is well suited for the internal representation of a query.
The translation process in query processing is similar to the
parser of a query. When a user executes any query, for
generating the internal form of the query, the parser in the system
checks the syntax of the query, verifies the name of the relation in
the database, the tuple, and finally the required attribute value.
The parser creates a tree of the query, known as 'parse-tree.'
Further, translate it into the form of relational algebra. With this, it
evenly replaces all the use of the views when used in the query.
Thus, we can understand the working of a query processing in the
below-described diagram:
Unit 5 Advanced Topics

Suppose a user executes a query. As we have learned that there


are various methods of extracting the data from the database. In
SQL, a user wants to fetch the records of the employees whose
salary is greater than or equal to 10000. For doing this, the
following query is undertaken:

select emp_name from Employee where salary>10000;

Thus, to make the system understand the user query, it needs to


be translated in the form of relational algebra. We can bring this
query in the relational algebra form as:
σsalary>10000 (πsalary (Employee))
o πsalary (σsalary>10000 (Employee))

After translating the given query, we can execute each relational


algebra operation by using different algorithms. So, in this way, a
query processing begins its working.
Unit 5 Advanced Topics
Evaluation

For this, with addition to the relational algebra translation, it is


required to annotate the translated relational algebra expression
with the instructions used for specifying and evaluating each
operation. Thus, after translating the user query, the system
executes a query evaluation plan.
Query Evaluation Plan
o In order to fully evaluate a query, the system needs to
construct a query evaluation plan.
o The annotations in the evaluation plan may refer to the
algorithms to be used for the particular index or the specific
operations.
o Such relational algebra with annotations is referred to
as Evaluation Primitives. The evaluation primitives carry
the instructions needed for the evaluation of the operation.
o Thus, a query evaluation plan defines a sequence of
primitive operations used for evaluating a query. The query
evaluation plan is also referred to as the query execution
plan.
o A query execution engine is responsible for generating the
output of the given query. It takes the query execution plan,
executes it, and finally makes the output for the user query.

Query Optimization

The process of selecting an efficient execution plan for


processing a query is known as query optimization.

Following query parsing which is a process by which this


decision-making is done for a given query, calculating how many
different ways there are in which the query can run, then the
parsed query is delivered to the query optimizer, which generates
various execution plans to analyze the parsed query and select
Unit 5 Advanced Topics
the plan with the lowest estimated cost. The catalog manager
assists the optimizer in selecting the optimum plan to perform the
query by generating the cost of each plan.
Purpose of the Query Optimizer in DBMS

The optimizer tries to come up with the best execution plan


possible for a SQL statement.

Among all the candidate plans reviewed, the optimizer chooses


the plan with the lowest cost. The optimizer computes costs
based on available facts. The cost computation takes into account
query execution factors such as I/O, CPU, and communication for
a certain query in a given context.

Sr. No Class Name Role


01 10 Shreya CR
02 10 Ritik

For example, there is a query that requests information about


students who are in leadership roles, such as being a class
representative. If the optimizer statistics show that 50% of
students are in positions of leadership, the optimizer may decide
that a full table search is the most efficient. However, if data show
that just a small number of students are in positions of leadership,
reading an index followed by table access by row id may be more
efficient than a full table scan.

Because the database has so many internal statistics and tools at


its disposal, the optimizer is frequently in a better position than the
user to decide the best way to execute a statement. As a result,
the optimizer is used by all SQL statements.
Unit 5 Advanced Topics
NOSQL Databases

Introduction
 NoSQL stands for not only SQL.
 It is nontabular database system that store data
differently than relational tables. There are various
types of NoSQL databases such as document, key-
value, wide column and graph.
 Using NoSQL we can maintain flexible schemas and
these schemas can be scaled easily with large
amount of data

Need
The NoSQL database technology is usually adopted for following
reasons -ut
 The NoSQL databases are often used for handling big data
as a part of fundamental architecture.
 The NoSQL databases are used for storing and modelling
structured, semi- structured and unstructured data.
 For the efficient execution of database with high availability,
NoSQL is used.
 The NoSQL database is non-relational, so it scales out
better than relational databases and these can be designed
with web applications.
 For easy scalability, the NoSQL is used.

Features
 The NoSQL does not follow any relational model.
 It is either schema free or have relaxed schema. That means
it does not require specific definition of schema.
 Multiple NoSQL databases can be executed in distributed
fashion.
Unit 5 Advanced Topics
 It can process both unstructured and semi-structured data.
 The NoSQL have higher scalability.
 It is cost effective.
 It supports the data in the form of key-value pair, wide
columns and graphs.

Comparison between RDBMS and NoSQL

CAP Theorem:

Introduction:

What is CAP THEOREM?


Unit 5 Advanced Topics
Almost a couple of decades ago ,Eric brewer introduced the idea
that there exist a tradeoff among consistency , availability and
partition tolerance. This tradeoff is Commonly known as CAP
Theorem and has been extensively discussed ever since .The
CAP Theorem states that disturbuted database can have at most
two of these properties Consistency, Availability and Partition
tolerance .One can make trade-offs between the three parameters
based on the system requirements.

Table of content:
 Consistency

 Availability

 Partition tolerance

 Importance of CAP Theorem

 CAP Theorem Database Architecture

Consistency:
In CAP Theorem , Consistency means that all nodes in a
distributed database system should have the same data at any
given point . This ensures that all clients accessing the system will
see the same data , regardless of the node they are connected
to . When a client performs a read operation ,they should recieve
the most recent write value.

To maintain consistency, the node where the write operation is


performed has the responsibility of instantly replicating the data to
all other nodes. So , Consistency is crucial for preserving the
Unit 5 Advanced Topics
integrity and accuracy of the data in a distributed database
system .

Availability:
In the CAP Theorem , Availability means : A distributed database
system should always be able to respond to clients requests , so
when a client requests data ,it should recieve a response ,
regardless of any state of the individual node i.e system must be
operational at all times .

Ensuring availability is important for maintaining functionality and


usability . To achieve availability, a distributed system must have
redundant nodes or mechanism in place to ensure it can continue
to operate even if some nodes fall .

Partition Tolerance:
Sometimes due to some disruption in the network , nodes get
seperated into groups such that they cannot communicate with
each other . This is called network partition , In the CAP Theorem ,
partition tolerance means: A distributed database system should
continue to work even if there is a network partition .

This is one essential requirements because it allows the system to


handle network partition without compromising availability or
consistency. To achieve this we can use approaches like
replication , redundancy, consensus Algorithm ,load balancing.
Unit 5 Advanced Topics
Importance of CAP Theorem:
In a distributed database system , network failures can disrupt the
systems ability to access and update data . Therefore partition
tolerance is a must have requirement .But achieving partition
tolerance often requires making tradeoffs with consistency and
availability. Why ?

Maintaining consistency can come at a cost of increased latency


and reduced availability as the system may need to wait for all
nodes to agree on the current state of the data before responding
to clients requests . Meanwhile maintaining availability can lead to
data inconsistency or conflicts , as nodes may return different or
stale version of the data .

CAP Theorem Database Architecture:


In practice, different distributed database systems make different
trade-offs based on their specific use cases and requirements.

For example:

Relational Databases (e.g., MySQL) typically prioritize consistency


and availability (CA) because they are often used in situations
where strong consistency is crucial.

NoSQL Databases (e.g., Cassandra) often prioritize availability


and partition tolerance (AP) because they are designed to handle
large-scale distributed data with a tolerance for eventual
consistency.
Unit 5 Advanced Topics
NewSQL Databases (e.g., Google Spanner) aim for a balance
between CA and CP by using advanced techniques like
synchronized clocks to provide strong consistency while
maintaining high availability and partition tolerance.

CA: The system prioritizes consistency and availability but doesn't


handle network partitions well. In the case of a network partition, it
might choose to become unavailable to maintain consistency.

CP: The system prioritizes consistency and partition tolerance but


sacrifices availability. In the face of a network partition, it will
maintain consistency by refusing some requests.

AP: The system prioritizes availability and partition tolerance but


sacrifices strong consistency. It continues to operate and respond
to requests even if it means returning stale data or data that is not
consistent across all nodes.
Unit 5 Advanced Topics
Types of NoSQL Database
There are four types of NoSQL databases and those are -
1. Key-value store
2. Document store
3. Graph based
4. Wide column store
Let us discuss them in detail.

Key-Value Store
• Key-value pair is the simplest type of NoSQL database.
• It is designed in such a way to handle lots of data and heavy
load.
• In the key-value storage the key is unique and the value can be
JSON, string or binary objects.
• For example -
{Customer:
|
{"id":1,"name":"Ankita"},
{"id":2,"name":"Kavita"}
|
}
Here id, name are the keys and 1,2, "Ankita", "Prajkta" are the
values corresponding to those keys.
Key-value stores help the developer to store schema-less data.
They work best for Shopping cart contents.
The DynamoDB, Riak, Redis are some famous examples of key-
value store.
Unit 5 Advanced Topics
Document Based Systems
• The document store make use of key-value pair to store and
retrieve data.
• The document is stored in the form of XML and JSON.
• The document stores appear the most natural among NoSQL
database types.
• It is most commonly used due to flexibility and ability to query on
any field.
• For example -
{
"id": 101,
"Name": "AAA",
"City" : "Pune"
}
MongoDB and CouchDB are two popular document oriented
NoSQL database.

Column Based Systems


• The column store model is similar to traditional relational
database. In this model, the columns are created for each row
rather than having predefined by the table structure.
• In this model number of columns are not fixed for each record.
• Columns databases can quickly aggregate the value of a given
column.
• For example -
Unit 5 Advanced Topics

The column store databases are widely used to manage data


warehouses, business intelligence, HBase, Cassandra are
examples of column based databases.

Graph Databases
The graph database is typically used in the applications where the
relationships among the data elements is an important aspect.
The connections between elements are called links or
relationships. In a graph database, connections are first-class
elements of the database, stored directly. In relational databases,
links are implied, using data to express the relationships.
The graph database has two components -
1) Node: The entities itself. For example - People, student.
2) Edge: The relationships among the entities.
For example -

Graph base database is mostly used for social networks, logistics,


spatial data. The graph databases are - Neo4J, Infinite
Graph,OrientDB.
Unit 5 Advanced Topics
Database Security
o Definition: Database security is a technique that deals
with protection of data against unauthorized access and
protection.
o Database security is an important aspect for any
database management system as it deals with
sensitivity of data and information of enterprise.
o Database security allows or disallows users. from
performing actions on the database objects.
Security Issues
Types of Security
Database security addresses following issues -
(1) Legal Issues: There are many legal or ethical issues with
respect to right to access information. For example - If some
sensitive information is present in the database, then it must not
be accessed by unauthorized person.
(2) Policy Issues: There are some government or organizational
policies that tells us what kind of information should be made
available to access publicly.
(3) System Issues: Under this issue, it is decided whether
security function should be handled at hardware level or at
operating system level or at database level.
(4) Data and User Level Issues: In many organizations, multiple
security levels are identified to categorize data and users based
on these classifications. The security policy of organization must
understand these levels for permitting access to different levels of
users.

Threats to Database
Threats to database will result in loss or degradation of data.
There are three kinds of loss that occur due to threats to database
(1) Loss of Integrity:
Unit 5 Advanced Topics
• Database integrity means information must be protected from
improper modification.
• Modification to database can be performed by inserting, deleting
or modifying the data.
• Integrity is lost if unauthorized changes are made to data
intentionally or accidently.
• If data integrity is not corrected and work is continued then it
results in inaccuracy, fraud, or erroneous decision.
(2) Loss of Availability:
• Database availability means making the database objects
available to authorized
users.
(3) Loss of Confidentiality:
• Confidentiality means protection of data from unauthorized
disclosure of information.
• The loss of confidentiality results in loss of public confidence, or
embarrassment or some legal action against organization.

Control Measures
• There are four major control measures used to provide security
on data in database.
1. Access control
2. Interface control
3. Flow control
4. Data encryption
• Access Control: The most common security problem is
unauthorized access to of computer system. Generally this
access is for obtaining the information or to make malicious
changes in the database. The security mechanism of a DBMS
must include provisions for restricting access to the database
system as a whole. This function, called access control.
Unit 5 Advanced Topics
• Inference Control: This method is used to provide the security
to statistical database security problems. Statistical databases are
used to provide statistical information based on some criteria.
These databases may contain information about particular age
group, income-level, education criteria and so on. Access to some
sensitive information must be avoided while using the statistical
databases. The corresponding measure that prevents the user
from completing any inference channel.
• Flow Control: It is a kind of control measure which prevents
information from flowing in such a way that it reaches
unauthorized users. Channels that are pathways for information to
flow implicitly in ways that violate the security policy of an
organization are called covert channels.
• Data Encryption: The data encryption is a control measure
used to secure the sensitive data. In this technique, the data is
encoded using some coding algorithm. An unauthorized user who
accesses encoded data will have difficulty deciphering it, but
authorized users are given decoding or decrypting algorithms (or
keys) to decipher the data.

Database Security and DBA


• DBA stands for Database Administrator, who is the central
authority for managing the database system.
• DBA is responsible for granting privileges to users who want to
use the database system.
• The DBA has a DBA account in the DBMS which is sometimes
called as system or superuser account. It provides powerful
capabilities that are not made available for regular database
accounts and users.
• DBA makes use of some special commands that perform
following type of actions –
1. Account Creation: This command helps in creating a new
account and password for a single user or for group of users.
Unit 5 Advanced Topics
2. Privilege Granting: This command allows the DBA to grant
privileges to certain accounts.
3. Privilege Revocation: This command allows the DBA to
cancel the privileges to certain accounts.
4. Security Level Assignment: This action assigns user account
to the appropriate security clearance level.
Access Control Based on Privileges or Discretionary Access
Control
Discretionary Access Control (DAC) is a access control
mechanism based on privileges.
Types of discretionary privileges: The DBMS must provide
selective access to each relation in the database on specific
accounts. This selective access is known as privileges. There are
two levels for assigning privileges for using Database systems
and these are -
o The account level: At this level, the DBA specifies the
particular privileges that each account holds
independently of the relations in the database.
o Relation(or table) level: At this level, the DBA can
control the privilege to access each individual relation
or view in the database.
o For granting the privileges, the access control
mechanism follows an authorizationof (a model for
discretionary privileges known as the access matrix
model.
o The access matrix is a table with rows and columns. It
defines the access permissions.
o The rows of a matrix M represent subjects (users,
accounts, programs)
o The columns represent objects (relations, records,
columns, views, operations).
Unit 5 Advanced Topics
o Each position M(i, j) in the matrix represents the types
of privileges (read, write, esdstab update) that subject i
holds on object j.
o For example -

• Discretionary Access Control allows each user or subject to


control access to their own data.
• In DAC, owner of resource restricts access to the resources
based on the identity of users.
• DAC is typically the default access control mechanism for most
desktop operating doy systems.
• Each resource object on DAC based system has Account
Control List (ACL) associated with it.
Unit 5 Advanced Topics
• An ACL contains a list of users and groups to which the user
has permitted access together with the level of access for each
user or group.
• For example - The ACL is an object centered description of
access rights as follows-
test1.doc: {Prajka: read}
test2.exe: {Ankita: execute}, {Prajkta: execute}
test3.com: (Ankita: execute, read}, {Prajkta: execute, read, write}
• Object access is determined during Access Control List (ACL)
authorization and based on user identification and/or group
membership.
• Under DAC a user can only set access permissions for
resources which they already own.
• Similarly a hypothetical user A cannot change the access control
for a file that is owned by user B. User A can, however, set
access permissions on a file that he/she Jono owns.
• User may transfer object ownership to another user(s).
• User may determine the access type of other users.
• The DAC is easy to implement access control model.
Advantages:
(1) It is flexible.
(2) It has simple and efficient access right management.
(3) It is scalable. That means we can add more users without any
complexity.
Disadvantages:
(1) It increases the risk that data will be made accessible to users
that should not necessarily be given access.
(2) There is no control over information flow as one user can
transfer ownership to another user.
Unit 5 Advanced Topics
Role Based Access Control
• It is based on the concept that privileges and other permissions
are associated with A organizational roles, rather than individual
users. Individual users are then assigned to appropriate roles.
• For example, an accountant in a company will be assigned to
the Accountant role, gaining access to all the resources permitted
for all accountants on the system. Similarly, a software engineer
might be assigned to the Developer role.
• In an RBAC system, the roles are centrally managed by the
administrator. The administrators determine what roles exist
within their companies and then map these roles to job functions
and tasks.
• Roles can effectively be implemented using security groups. The
security groups are created representing each role. Then
permissions and rights are assigned to these groups. Next, simply
add the appropriate users to the appropriate security groups,
depending on their roles or job functions.
• A user can have more than one role. And more than one user
can have the same role.
• Role hierarchies can be used to match natural relations between
roles. For example - A lecturer can create a role student and give
it a privilege "read course material".
• Role Based Access Control (RBAC), also known as non
discretionary access control.
• RBAC security strategy is widely used by most organizations for
deployment of commercial and off-the-shelf products.
Advantages:
(1) The security is more easily maintained by limiting unnecessary
access to sensitive information based on each user's established
role within the organization.
Unit 5 Advanced Topics
(2) All the roles can be aligned with the organizational structure of
the business and users can do their jobs more efficiently and
autonomously.
Disadvantages:
(1) It is necessary to understand each user's functionality in depth
so that roles can be properly assigned.
(2) If roles are not assigned properly then inappropriate access
right creates security severe problems for database system.
SQL Injection
 SQL injection is a type of code injection technique that might
destroy the databases. In this technique the malicious code
in SQL statement is placed via web page input. These
statements control a database server behind a web
application.
 Attackers can use SQL injection vulnerabilities to bypass
application security measures. They can go around
authentication and authorization of a web page or web
application and retrieve the content of the entire SQL
database. They can also use SQL injection to add, modify
and delete records in the database.
 An SQL injection vulnerability may affect any website or web
application that uses an SQL database such as MySQL,
Oracle, SQL Server or others.
How SQL Injection Works?
• To make an SQL injection attack, an attacker must first find
vulnerable user inputs ad to within the web page or web
application. A web page or web application that has an ses SQL
injection vulnerability uses such user input directly in an SQL
query. The attacker can create input content. Such content is
often called a malicious payload and is the key part of the attack.
After the attacker sends this content, malicious SQL commands
are executed in the database.
Unit 5 Advanced Topics
• SQL is a query language that was designed to manage data
stored in relational Sup databases. You can use it to access,
modify and delete data. Many web applications and websites
store all the data in SQL databases. In some cases, you can also
use SQL commands to run operating system commands.
Therefore, a successful SQL Injection attack can have very
serious consequences.
Example of SQL Injection
• Following is an example of SQL injection vulnerability works
around a simple rid web application having two input fields - One
for user name and another for password.
• This example has a table named users with the columns
username and password

uname-request.POST['username']
passwd=request.POST['password']
query="SELECT id FROM users WHERE username='"+ uname
+"' ANDpassword='"+ passwd +"'
database.execute(query)

• Here the two input fields - One for user name and another for
password is vulnerable to SQL injection.
• The attacker can attack using these fields and alter the SQL
query to get the access to the database.
• They could use a trick on password field. They could add
OR 1 = 1
Statement to the password field.
• As a result the query would becomes (assuming username as
'user1' and password='password')
Unit 5 Advanced Topics
• SELECT id FROM users WHERE username='user1' AND
password='password' OR 1 = 1
• Because of OR 1 = 1 statement, the WHERE clause returns the
first id from the users table no matter what the username and
password are. That means even-if we enter any wrong username
or password still the query will get executed because of OR 1 = 1
part which comes out to be true.
• The first id is returned by the above query for users table and we
know that the first id is normally administrator. In this way, the
attacker not only bypasses authentication but also gains
administrator privileges.
How to prevent SQL injection?
• The only way to prevent SQL injection is to validate every input
field.
• Another method is to make use of parameterized query. This
parameterized query is called prepared statement. By this ways,
application code never use the input directly.
• The Web Application Firewalls (WAF) are also used to filter out
the SQL.
Statistical Database Security
• Statistical databases contain statistical data about various
populations.
• A population is a set of tuples of a table that satisfy some
selection criteria.
• The statistical database may contain the confidential data about
individuals. For example - The database system about
government agencies is a statistical database. The statistical
database security helps in protecting the sensitive information
present in the statistical database from user's access.
• Users are permitted to use some portion of statistical database.
He/she can not have access over the complete database system.
Unit 5 Advanced Topics
For instance - In an employees database, any user is not
permitted to access the information about employee's salary.
• Statistical aggregate functions such as COUNT, MAX,MIN,
AVERAGE and STANDARD DEVIATION are used in the queries
which are called as statistical queries.
• The possibility of accessing individual information from statistical
queries is reduced by following ways -
• No statistical query is permitted whenever the number of tuples
in the population specified by the selection condition falls below
some threshold.
• Prohibit sequence of queries that refer repeatedly to same
population of tuples.
• Partition the database. That means records are stored in
groups of some minimum size. The query can refer to any
complete group but never to subsets of records within a group.
Flow Control
• Flow control is a mechanism that regulates the flow of
information among accessible objects.
• A flow between two objects obj1 and obj2 occurs when program
reads values from obj1 and writes values to the object obj2.
• The flow control checks that the information contained in one
object should not get transferred to the less protected object.
• The flow policy specifies the channels along which the
information is allowed to move.
• The simple flow policy specifies two classes of information -
Confidential(C) and non confidential(N). According to flow policy
only the information flow from confidential to non confidential
class is not allowed.
Unit 5 Advanced Topics
Convert Channel
• A covert channel is a type of attack that creates a capability to
transfer information objects between processes that are not
supposed to be allowed to communicate.
• This convert channel violates the security or the policy.
• The convert channel allows information to pass from higher
classification level to lower classification level through improper
means.
• The security experts believe that one way to avoid convert
channels is for as programmers to not gain the access to
sensitive data.
Encryption and Public Key Infrastructures
Cryptology is a technique of encoding and decoding messages,
so that they cannot be understood by anybody except the sender
and the intended recipient.
There are various encoding and decoding schemes which are
called as encryption schemes. The sender and recipient of the
message decide on an encoding and decoding scheme and use it
for communication.
The process of encoding messages is known as encryption. The
sender sends the original text. The original text called plaintext,
The encrypted form of plaintext it is called as ciphertext. This
encrypted text travel through the network. When it reaches at the
receiving computer, the recipient understands the meaning and
decodes the message to extract the correct meaning out of it.
This process is called as decryption.
Unit 5 Advanced Topics
The sender applies the encryption algorithm and recipient applies
the decryption algorithm. Both the sender and the receiver must
agree on this algorithm for any meaningful communication. The
algorithm basically takes one text as input and produces another
as the output. Therefore, the algorithm contains the intelligence
for transforming message.
For example: If we want to send some message through an e-
mail and we wish that nobody except the friend should be able to
understand it. Then the message can be encoded using some
intelligence. For example if the alphabets A to Z are encoded as
follows-

That means last three letters are placed in reverse order and then
first three letters are in straight manner. Continuing this logic the
A to Z letters are encoded. Now if I write the message
"SEND SOME MONEY"
it will be
QBSA QRTB TRSBN
This coded message is called cipher text.
There are variety of coding methods that can be used.

Types of Cryptography
There are two types encryption schemes based in key used for
encryption and decryption.
1. Symmetric key encryption: It is also known as secret key
encryption. In this method, only one key is used. The same key is
shared by sender and receiver for encryption and decryption of
messages. Hence both parties must agree upon the key before
any transmission begins and nobody else should know about it. At
the sender's end, the key is used to change the original message
into an encoded form. At the receiver's end using the same key
Unit 5 Advanced Topics
the encoded message is decrypted and original message is
obtained. Data Encryption Standard (DES) uses this approach.
The problem with this approach is that of key agreement and
distribution.
2. Asymmetric key encryption: It is also known as public key
encryption. In this method, different keys are used. One key is
used for encryption and other key must be used for decryption.
No other key can decrypt the message-not even the original key
used for encryption.
One of the two keys is known as public key and the other is the
private key. Suppose there are two users X and Y. The
• X wants to send a message to Y. Then X will convey its public
key to Y but the private key of X will be known to X only.
• Y should know the private key of Y and X should know the Y's
public key.
When X and Y wants to communicate:m
1. If X wants to send a message to Y, then first of all X encrypts
the message using Y's public key. For that purpose it is
necessary that X knows the Y's public key.
2. X then sends this encrypted to Y.
3. Now using Y's private key, Y decrypts X's message. Note that
only Y knows his private key. It is not possible for Y to decrypt the
message using X's public key.
4. When Y wants to send a message to X then using X's public
key Y will encrypt the message and will send the encrypted
message to X. On the other hand, X will use its own private key to
decrypt this message. Here again Y will not know the private key
of X.

Digital Signature
A digital signature is a mathematical scheme for demonstrating
the authenticity of a digital message or document. If the recipient
Unit 5 Advanced Topics
gets a message with digital signature then he believes that the
message was created by a known sender.
Digital signatures are commonly used for software distribution,
financial transactions, and in other cases where it is important to
detect forgery or tampering.
When X and Y wants to communicate with each other
1. X encrypts the original plaintext message into ciphertext by
using Y's public key.
2. Then X executes an algorithm on the original plaintext to
calculate a Message Digest, also known as hash. This algorithm
takes the original plaintext in the binary format, apply the hashing
algorithm. As an output a small string of binary digits gets created.
This hashing algorithm is public and anyone can use it. The most
popular message digest algorithms are MD5 and SHA-1. X
encrypts the message digest. For this, it uses its own private key.
3. X now combines the ciphertext and its digital signature (i.e
encrypted message digest) and it is sent over the network to Y.
4. Y receives the ciphertext and X's digital signature. Y has to
decrypt both of these. Y first decrypts ciphertext back to plaintext.
For this, it uses its own private key. Thus, Y gets the message
itself in a secure manner.
5. Now to ensure that the message has come from the intended
sender Y takes X's digital signature and decrypts it. This gives Y
the message digest as was generated by X. The X had encrypted
the message digest to form a digital signature using its own
private key. Therefore, Y uses X's public key for decrypting the
digital signature.
6. Hash algorithm to generate the message digest is public.
Therefore, Y can also use it.
7. Now there are two message digests one created by X and
other by Y. The Y now Anon simply compares the two message
Unit 5 Advanced Topics
digests. If the two match, Y can be sure that the message came
indeed from X and not from someone else.
Thus with digital signature confidentiality, authenticity as well as
message integrity is assured.
The other important feature supported by digital signature is non-
repudiation. That is, a sender cannot refuse having sent a
message. Since the digital signature requires the private key of
the sender, once a message is digitally signed, it can be legally
proven that the sender had indeed sent the message.
Challenges
Following are the challenges faced by the database security
system -
(1) Data Quality
• The database community need the solution to assess the quality
of data. The quality of data can be assessed by a simple
mechanism such as quality stamps that are posted on web sites:
• The database community may need more effective technique of
integrity semantic verification for accessing the quality of data.
• Application level recovery techniques are also used to repair
incorrect data.
(2) Intellectual Property Rights
• Everywhere there is increasing use of internet and intranet. Due
to which there are chances of making un-authorized duplication
and distribution of the contents. Hence digital watermarking
technique is used to protect the contents from unauthorized
access or ownership.
• However, research is needed to develop the techniques for
preventing intellectual property right violation.
(3) Database Survivability
• It is desired that the database systems must continue to work
even after information warfare attacks.
Unit 5 Advanced Topics
• The goal of information warfare attacker is to damage the
organization's operation.
• Following are the corrective actions for handling this situation -
• Confinement : Take immediate action to eliminate attacker's
access to the system. Isolate the affected components to avoid
further spread.
• Damage Assessment: Determine the extent of problem.
• Reconfiguration:Re-configuration allows the system to be in
operation in degraded mode while recovery is going on.
• Repair:Recover the corrupted or lost data by repairing or
reinstalling the system.
• Fault treatment: Identify the weakness exploited in the attack
and take steps to prevent a recurrence.

You might also like