Advanced Database Systems Module
Advanced Database Systems Module
University of Gondar
Editor:
January 2023
Gondar, ETHIOPIA
Table of Contents
Chapter One: Transaction Management and Concurrency Control ................................................... 3
1.1 Transaction ............................................................................................................................... 3
1.2 Transaction Support ................................................................................................................. 4
1.3 Properties of Transaction ......................................................................................................... 7
1.4 Concurrency Control ................................................................................................................ 7
1.4.1 Lost update Problem ......................................................................................................... 7
1.4.2 Uncommitted dependency problem .................................................................................. 8
1.4.4 Inconsistent Analysis Problem .......................................................................................... 9
1.5 Concept of Serializability......................................................................................................... 9
1.6 Recoverability ........................................................................................................................ 12
1.7 Concurrency Control Techniques .......................................................................................... 12
1.7.1 Locking ........................................................................................................................... 13
1.7.2 Timestamping.................................................................................................................. 20
1.8 Database Recovery ............................................................................................................ 24
1.9 Transactions and Recovery ............................................................................................... 24
1.10 Recovery Facilities ........................................................................................................ 25
1.11 Recovery Techniques .................................................................................................... 27
1.11.1 Deferred Update ............................................................................................................ 27
1.11.2 Immediate Update ......................................................................................................... 27
1.11.3 Shadow Paging .............................................................................................................. 28
Chapter Two :Query Processing and Optimization ......................................................................... 28
2.1 Introduction ................................................................................................................................ 28
2.2 Query Processing ....................................................................................................................... 29
2.3 Query Optimization.................................................................................................................... 29
2.4 Phases of Query Processing ................................................................................................... 31
2.5 Query Decomposition ............................................................................................................ 33
2.6 Transformation Rules for RA Operations .............................................................................. 37
2.7 Database Statistics.................................................................................................................. 43
2.8 Pipelining ............................................................................................................................... 43
Chapter Three Database Integrity, Security and Recovery .............................................................. 44
Transaction processing systems are systems with large databases and hundreds of concurrent
users executing database transactions. Examples of such systems include airline reservations,
banking, credit card processing, online retail purchasing, stock markets, supermarket checkouts,
and many other applications. These systems require high availability and fast response time for
hundreds of concurrent users. We define the concept of a transaction, which is used to represent a
logical unit of database processing that must be completed in its entirety to ensure correctness. A
transaction is typically implemented by a computer program, which includes database commands
such as retrievals, insertions, deletions, and updates.
The access mode can be specified as READ ONLY or READ WRITE. The default is READ
WRITE, unless the isolation level of READ UNCOMMITTED is specified (see below), in which
case READ ONLY is assumed. A mode of READ WRITE allows select, update, insert, delete, and
create commands to be executed. A mode of READ ONLY, as the name implies, is simply for data
retrieval.
The diagnostic area size option, DIAGNOSTIC SIZE n, specifies an integer value n, which
indicates the number of conditions that can be held simultaneously in the diagnostic area. These
conditions supply feedback information (errors or exceptions) to the user or program on the n most
recently executed SQL statement.
The isolation level option is specified using the statement ISOLATION LEVEL <isolation>,
where the value for <isolation> can be READ UNCOMMITTED, READ COMMITTED,
REPEATABLE READ, or SERIALIZABLE. The default isolation level is SERIALIZABLE,
although some systems use READ COMMITTED as their default. The use of the term
SERIALIZABLE here is based on not allowing violations that cause dirty read, unrepeatable read,
and phantoms, and it is thus not identical to the way serializability. If a transaction executes at a
lower isolation level than SERIALIZABLE, then one or more of the following three violations
may occur:
1. Dirty read. A transaction T1 may read the update of a transaction T2, which has not
yet committed. If T2 fails and is aborted, then T1 would have read a value that does not
exist and is incorrect.
2. Nonrepeatable read. A transaction T1 may read a given value from a table. If another
transaction T2 later updates that value and T1 reads that value again, T1 will see a
different value.
3. Phantoms. A transaction T1 may read a set of rows from a table, perhaps based on
some condition specified in the SQL WHERE-clause. Now suppose that a transaction
T2 inserts a new row r that also satisfies the WHERE-clause condition used in T1, into
the table used by T1. The record r is called a phantom record because it was not there
when
T1 starts but is there when
T1 ends. T1 may or may not see the phantom, a row that previously did not exist. If the
equivalent serial order is T1 followed by T2, then the record r should not be seen; but if it
is T2 followed by T1, then the phantom record should be in the result given to T1. If the
system cannot ensure the correct behavior, then it does not deal with the phantom record
problem.
Type of Violation
Table 1.1 summarizes the possible violations for the different isolation levels. An entry of Yes
indicates that a violation is possible and an entry of No indicates that it is not possible. READ
UNCOMMITTED is the most forgiving, and SERIALIZABLE is the most restrictive in that it
avoids all three of the problems mentioned above.
The above transaction consists of first inserting a new row in the EMPLOYEE table and then
updating the salary of all employees who work in department 2. If an error occurs on any of
the SQL statements, the entire transaction is rolled back. This implies that any updated salary
(by this transaction) would be restored to its previous value and that the newly inserted row
would be removed. As we have seen, SQL provides a number of transaction-oriented features.
The DBA or database programmers can take advantage of these options to try improving
transaction performance by relaxing serializability if that is acceptable for their applications.
Snapshot isolation will ensure that the phantom record problem does not occur, since the
database transaction, or in some cases the database statement, will only see the records that
were committed in the database at the time the transaction starts. Any insertions, deletions, or
updates that occur after the transaction starts will not be seen by the transaction.
Loss of T2‘s update avoided by preventing T1 from reading balx until after update.
T4 updates balx to £200 but it aborts, so balx should be back at original value of £100.
T3 has read new value of balx (£200) and uses value as basis of £10 reduction, giving a new
balance of £190, instead of £90.
Problem avoided by preventing T3 from reading balx until after T4 commits or aborts.
Meantime, T5 has transferred £10 from balx to balz, so T6 now has wrong result (£10 too high).
Problem avoided by preventing T6 from reading balx and balz until after T5 completed updates.
Serial schedule: Schedule where operations of each transaction are executed consecutively
without any interleaved operations from other transactions.
No guarantee that results of all serial executions of a given set of transactions will be
identical.
Nonserial schedule: Schedule where operations from set of concurrent transactions are
interleaved. Objective of serializability is to find nonserial schedules that allow transactions
to execute concurrently without interfering with one another.
In other words, want to find nonserial schedules that are equivalent to some serial schedule.
Such a schedule is called serializable.
(c) If one transaction writes a data item and another reads or writes same data item, order
of execution is important.
Conflict Schedules. Two schedules are said to be conflict equivalent if the relative
order of any two conflicting operations is the same in both schedules.
10
Serializability premises
Conflict serializable schedule orders any conflicting operations in same way as some serial
execution.
Under constrained write rule (transaction updates data item based on its old value, which is
first read),
T9 is transferring £100 from one account with balance balx to another account with balance
baly.
11
1.6 Recoverability
Serializability identifies schedules that maintain database consistency, assuming no transaction
fails.
Could also examine recoverability of transactions within schedule.
If transaction fails, atomicity requires effects of transaction to be undone.
Durability states that once transaction commits, its changes cannot be undone (without running
another, compensating, transaction).
Recoverable schedule
A schedule where, for each pair of transactions Ti and Tj, if Tj reads a data item previously written
by Ti, then the commit operation of Ti precedes the commit operation of Tj.
12
Both are conservative(pessimistic) approaches: delay transactions in case they conflict with other
transactions.
Optimistic methods assume conflict is rare and only check for conflicts at commit time.
1.7.1 Locking
Transaction uses locks to deny access to other transactions and so prevent incorrect updates.
Generally, a transaction must claim a shared (read) or exclusive (write) lock on a data item before
read or write.
Lock prevents another transaction from modifying item or even reading it, in the case of a write
lock.
If transaction has shared lock on item, can read but not update item.
If transaction has exclusive lock on item, can both read and update item.
Reads cannot conflict, so more than one transaction can hold shared locks simultaneously on
same item.
Some systems allow transaction to upgrade read lock to an exclusive lock, or downgrade
exclusive lock to a shared lock.
For two transactions above, a valid schedule using these rules is:
S = {write_lock(T9, balx), read(T9, balx), write(T9, balx), unlock(T9, balx), write_lock(T10, balx),
read(T10, balx), write(T10, balx), unlock(T10, balx), write_lock(T10, baly), read(T10, baly), write(T10,
baly), unlock(T10, baly), commit(T10), write_lock(T9, baly), read(T9, baly), write(T9, baly), unlock(T9,
baly), commit(T9) }
13
If at start, balx = 100, baly = 400, result should be: balx = 220, baly = 330, if T9 executes before
T10, or
balx = 210, baly = 340, if T10 executes before T9.
Problem is that transactions release locks too soon, resulting in loss of total isolation and
atomicity.
To guarantee serializability, need an additional protocol concerning the positioning of lock and
unlock operations in every transaction.
Transaction follows 2PL protocol if all locking operations precede first unlock operation in the
transaction.
14
15
Cascading Rollback
16
However, problems can occur with interpretation of when locks can be released.
T14 aborts.
Since T15 is dependent on T14, T15 must also be rolled back. Since T16 is dependent on T15, it too
must be rolled back. This is called cascading rollback.
To prevent this with 2PL, leave release of all locks until end of transaction.
17
Deadlock
An impasse that may result when two (or more) transactions are each waiting for locks held by the
other to be released.
Only one way to break deadlock: abort one or more of the transactions.
Timeouts.
Deadlock prevention.
Timeouts
Transaction that requests lock will only wait for a system-defined period of time.
If lock has not been granted within this period, lock request times out.
18
In this case, DBMS assumes transaction may be deadlocked, even though it may not be, and it
aborts and automatically restarts the transaction.
Deadlock Prevention
DBMS looks ahead to see if transaction would cause deadlock and never allows deadlock to
occur.
Wait-Die - only an older transaction can wait for younger one, otherwise transaction is aborted
(dies) and restarted with same timestamp.(Why same timestamp?)
Wound-Wait - only a younger transaction can wait for an older one. If older transaction requests
lock held by younger one, younger one is aborted (wounded).
―Wound‖: The younger transaction rolls back (if it cannot finish in small interval of time) and
gives lock to the older one
19
Several issues:
1.7.2 Timestamping
Transactions ordered globally so that older transactions, transactions with smaller timestamps,
get priority in the event of conflict.
No locks so no deadlock.
Timestamp: A unique identifier created by DBMS that indicates relative starting time of a
transaction. Can be generated by using system clock at time transaction started, or by incrementing
a logical counter every time a new transaction starts.
Timestamping – procedure
20
Read/write proceeds only if last update on that data item was carried out by an older transaction.
Timestamping - Read(x)
Check the last write on the Data Item ts(T) < WTS(x)
Otherwise the Read continues and the RTS(X) will be set to the max of ( RTS(x) and ts(T))
Timestamping - Write(x)
Hence error to update now and restarted with new time stamp
Write can safely be ignored - ignore obsolete write rule. (Don‘t need to restart the transaction)
21
Optimistic Techniques
Based on assumption that conflict is rare and more efficient to let transactions proceed without
delays to ensure serializability.
Three phases:
• Read
• Validation
• Write
Optimistic Techniques - Read Phase
22
Transaction reads values from database and stores them in local variables. Updates are applied
to a local copy of the data.
For update transaction, checks transaction leaves database in a consistent state, with
serializability maintained.
A file.
A record.
Tradeoff:
23
Hierarchy of Granularity
Stable storage represents information that has been replicated in several nonvolatile storage
media with independent failure modes.
Types of Failures
Sabotage.
24
If transaction had not committed at failure time, recovery manager has to undo (rollback)
any effects of that transaction for atomicity.
Partial undo - only one transaction has to be undone. Global undo - all transactions have
to be undone.
Example
DBMS starts at time t0, but fails at time tf. Assume data for transactions T2 and T3 have been
written to secondary storage.
T1 and T6 have to be undone. In absence of any other information, recovery manager has to
redo T2, T3, T4, and T5.
25
Transaction records.
Checkpoint records.
Transaction identifier.
Type of log record, (transaction start, insert, update, delete, abort, commit).
Identifier of data item affected by database action (insert, delete, and update operations).
Checkpointing
Checkpoint: Point of synchronization between database and log file. All buffers are forcewritten
to secondary storage.
When failure occurs, redo all transactions that committed since the checkpoint and undo all
transactions active at time of crash.
In previous example, with checkpoint at time tc, changes made by T2 and T3 have been written to
secondary storage.
Thus:
26
Immediate Update
Shadow Paging
• If transaction fails before commit, it will not have modified database and so no undoing of
changes required.
• May be necessary to redo updates of committed transactions as their effect may not have
reached database.
27
2.1 Introduction
In network and hierarchical DBMSs, low-level procedural query language is generally embedded
in high-level programming language.
28
With declarative languages such as SQL, user specifies what data is required rather than how it
is to be retrieved.
Aims of QP:
transform query written in high-level language (e.g. SQL), into correct and efficient
execution strategy expressed in low-level language (implementing RA);
As there are many equivalent transformations of same high-level query, aim of QO is to choose
one that minimizes resource usage.
29
SELECT *
( city='London' (Branch))
Assume:
Analysis 1:
30
i. read each tuple from the two relations n+m reads ii.
create a table of the Cartesian product nXm writes
iii. test each tuple of step 2 nXm read
Total No. of Disk access: 2(nXm) +n+m
Analysis 2:
i. read each tuple from the two relations n+m reads ii.
create a table of the Join n writes
iii. test each tuple of step 2 n read
Total No. of Disk access: 3(n) +m
Analysis 3:
i. read each tuple from the two relations n+m reads ii.
create a table of the Join n writes
iii. test each tuple of step 2 n read
Cartesian product and join operations much more expensive than selection, and third option
significantly reduces size of relations being joined together.
31
Advantages of dynamic query optimization arise from fact that information is up to date.
Disadvantages are that performance of query is affected, time may limit finding optimum
strategy.
Advantages of static query optimization are removal of runtime overhead, and more time to find
optimum strategy.
32
Disadvantages arise from fact that chosen execution strategy may no longer be optimal
when query is run.
analysis,
normalization,
semantic analysis,
simplification,
query restructuring.
Analysis
Analysis – Example
SELECT staff_no
FROM Staff
33
Comparison ‗>10‘ is incompatible with type position, which is variable character string.
Finally, query transformed into some internal representation more suitable for processing.
Example - R.A.T
Normalization
34
Semantic Analysis
Algorithms to determine correctness exist only for queries that do not contain disjunction
and negation.
Create node for each relation and node for result. Create edges between two nodes that
represent a join, and edges between nodes that represent projection.
35
Relation connection graph not fully connected, so query is not correctly formulated.
Normalized attribute connection graph has cycle between nodes c.maxRent and 0 with
negative valuation sum, so query is contradictory.
36
Simplification
transforms query to semantically equivalent but more easily and efficiently computed
form.
Typically, access restrictions, view definitions, and integrity constraints are considered.
Assuming user has appropriate access privileges, first apply well-known idempotency
rules of boolean algebra.
p q r(R) = p( q( r(R)))
Commutativity of Selection.
p( q(R)) = q( p(R))
For example:
37
R p S=S p R
RXS=SXR
If selection predicate involves only attributes of one of join relations, Selection and Join (or
Cartesian product) operations commute:
If selection predicate is conjunctive predicate having form (p q), where p only involves
attributes of R, and q only attributes of S, Selection and Theta join operations commute
as:
38
For example:
position='Manager'
(Staff) city='London'Staff.branchNo=Branch.branchNo Branch) =
39
R S=S R
R- S≠S- R
Commutativity of Selection and set operations (Union, Intersection, and Set difference).
(R S) T=S (R T)
40
(R S) T=S (R T)
R S) T=R (S T)
(R X S) X T = R X (S X T)
If join condition q involves attributes only from S and T, then Theta join is associative:
R p S) q r T=R p r (S q T)
For example:
For prospective renters of flats, find properties that match requirements and owned by
CO93.
p.ownerNo = ‗CO93‘;
individual selects
Use associativity of binary operations to rearrange leaf nodes so leaf nodes with most
restrictive Selection operations executed first.
Combine Cartesian product with subsequent selection whose predicate represents join
condition into a Join operation.
Use formulae that estimate costs for a number of options, and select one with lowest cost.
Consider only cost of disk access, which is usually dominant cost in QP.
Many estimates are based on cardinality of the relation, so need to be able to estimate this.
42
If statistics updated every time tuple is changed, this would impact performance.
DBMS could update statistics on a periodic basis, for example nightly, or whenever the
system is idle.
nBlocks(R) = [nTuples(R)/bFactor(R)]
minA(R),maxA(R)
2.8 Pipelining
Materialization - output of one operation is stored in temporary relation for processing by
next. (heuristic approach)
Could also pipeline results of one operation to another without creating temporary relation.
43
Pipelining can save on cost of creating temporary relations and reading results back in
again.
44
Part or all of the corporate data may have strategic importance and therefore needs to be
kept secure and confidential.
Security considerations do not only apply to the data held in a database. Breaches of
security may affect other parts of the system, which may in turn affect the database.
Any situation or event, whether intentional or unintentional, that will adversely affect a
system and consequently an organization.
Physical Level: concerned with securing the site containing the computer system should
be physically secured. The backup systems should also be physically protected from access
except for authorized users.
Human Level: concerned with authorization of database users for access to the content at
different levels and privileges.
Operating System: concerned with the weakness and strength of the operating system
security on data files. Weakness may serve as a means of unauthorized access to the
database. This also includes protection of data in primary and secondary memory from
unauthorized access.
Database System: concerned with data access limit enforced by the database system.
Access limit like password, isolated transaction and etc.
45
Application Level: Different Application Software need to have their own Security
mechanism. –eg Authentication
Authorization
Access controls
Views
Integrity
Encryption
RAID technology
Authorization
The granting of a right or privilege, which enables a subject to legitimately have access to
a system or a system‘s object.
Authorization is a mechanism that determines whether a user is, who he or she claims to
be.
The granting of a right or privilege that enables a subject to have legitimate access to a
system or a system‘s object
Authorization controls can be built into the software, and govern not only what system or
object a specified user can access, but also what the user may do with it
46
Read Authorization: the user with this privilege is allowed only to read the content of the
data object.
Insert Authorization: the user with this privilege is allowed only to insert new records or
items to the data object.
Update Authorization: users with this privilege are allowed to modify content of
attributes but are not authorized to delete the records.
Delete Authorization: users with this privilege are only allowed to delete a record and not
anything else.
Index Authorization: deals with permission to create as well as delete an index table for
relation.
Account Creation: involves creating different accounts for different users as well as user
groups.
47
Privilege Grant: involves giving different levels of privileges for different users and user
groups.
Access control
A privilege allows a user to create or access (that is read, write, or modify) some database
object (such as a relation, view, and index) or to run certain DBMS utilities.
Privileges are granted to users to accomplish the tasks required for their jobs.
Almost all RDBMSs provide security at different levels and formats of data. This includes:
View Level: permission to data included in the view and not in the named relations
Hybrid (Relation/View): the case where only part of a single relation is made available
to users through View.
Any database access request will have the following three major components
48
Requested Object: on which resource or data of the database is the operation sought to
be applied?
Requesting User: who is the user requesting the operation on the specified object?
The database should be able to check for all the three components before processing any
request. The checking is performed by the security subsystem of the DBMS.
SQL standard supports DAC through the GRANT and REVOKE commands.
The GRANT command gives privileges to users, and the REVOKE command takes away
privileges.
DAC while effective has certain weaknesses. In particular an unauthorized user can trick
an authorized user into disclosing sensitive data.
Each database object is assigned a security class and each user is assigned a clearance for
a security class, and rules are imposed on reading and writing of database objects by users.
MAC determines whether a user can read or write an object based on rules that involve the
security level of the object and the clearance of the user. These rules ensure that sensitive
data can never be ‗passed on‘ to another user without the necessary clearance.
View
Is the dynamic result of one or more relational operations operating on the base relations
to produce another relation.
49
A view is a virtual relation that does not actually exist in the database, but is produced
upon request by a particular user, at the time of request.
The view mechanism provides a powerful and flexible security mechanism by hiding parts
of the database from certain users
Using a view is more restrictive than simply having certain privileges granted to a user on
the base relation(s)
Backup
Process of periodically taking a copy of the database and log file (and possibly programs)
to offline storage media.
Journaling
Process of keeping and maintaining a log file (or journal) of all changes made to database
to enable effective recovery in event of failure.
Restoring Database is done by restoring the database to the latest Back then applying the
Log file
Integrity
Prevents data from becoming invalid, and hence giving misleading or incorrect results.
Different types of constraints
Primary key
Default
Foreign key
Unique key
Check
Encryption
50
The encoding of the data by a special algorithm that renders the data unreadable by any
program without the decryption key.
Encryption can be used for both storing securely in a shared/Multiuser environment or for
transmitting data securely
Encryption key
Encryption algorithm
Decryption key
Decryption algorithm
Types of cryptosystem
Symmetric encryption – uses the same key for both encryption and decryption and relies
on safe communication lines for exchanging the key.
Generally, symmetric algorithms are much faster to execute on a computer than those that
are asymmetric. In the contrary, asymmetric algorithms are more secure than symmetric
algorithms
Hardware that the DBMS is running on must be fault-tolerant, meaning that the DBMS
should continue to operate even if one of the hardware components fails.
Suggests having redundant components that can be seamlessly integrated into the working
system whenever there is one or more component failures.
51
The main hardware components that should be fault-tolerant include disk drives, disk
controllers, CPU, power supplies, and cooling fans.
Disk drives are the most vulnerable components with the shortest times between failure of
any of the hardware components.
Performance is increased through data striping: the data is segmented into equal-size
partitions (the striping unit), which are transparently distributed across multiple disks.
Reliability is improved through storing redundant information across the disks using a
parity scheme or an error-correcting scheme.
Such kind of databases should have special security mechanisms so that confidential
information about people will not be disclosed for many users.
Only queries with statistical aggregate functions like Average, Sum, Min, Max, Standard
Deviation, Mid, Count, etc should be executed.
Not to let the user make inference on the retrieved data, one can also implement constraint
on the minimum number of records or tuples in the resulting relation by setting a threshold.
52
Performance is increased through data striping: the data is segmented into equal-size
partitions (the striping unit), which are transparently distributed across multiple disks.
Reliability is improved through storing redundant information across the disks using a
parity scheme or an error-correcting scheme.
sender can be sure receiver is genuine (non-fabrication); sender cannot deny he or she
sent it (non-repudiation).
Some Measures include:
Proxy servers
Firewalls
Message digest algorithms and digital signatures
Digital certificates
Kerberos
Secure sockets layer (SSL) and Secure HTTP (S-HTTP)
Secure Electronic Transactions (SET) and Secure Transaction Technology (STT)
53
• But it is not always the case that organizational data reside in one site.
• This demand databases at different sites to be integrated and synchronized with all the
facilities of database approach.
54
• Even though integration of data implies centralized storage and control, in distributed
database systems the intention is different. Data is stored in different database systems in a
decentralized manner but act as if they are centralized through development of computer
networks.
• A distributed database system consists of loosely coupled sites that share no physical
component and database systems that run on each site are independent of each other.
• Concepts in DDBMS
• Replication: System maintains multiple copies of data, stored in different sites, for faster
retrieval and fault tolerance.
• Data transparency: Degree to which system user may remain unaware of the details of how
and where the data items are stored in a distributed system
55
• Local Transaction: transactions that access data only in that single site
– User at one site may be able access data that is available at another site.
– Each site can retain some degree of control over local data
– If one site fails the rest can continue operation as long as transaction does not
demand data from the failed system and the data is not replicated in other sites
– If a query involves data from several sites, it may be possible to split the query into
sub-queries that can be executed at several sites which is parallel processing
Disadvantages of DDBMS
– Communication problems
56
– Sites may not be aware of each other and may provide only limited facilities for
cooperation in transaction processing
Another reason for the creation of object-oriented databases is the vast increase in the use of object-
oriented programming languages for developing software applications. Databases are fundamental
components in many software systems, and traditional databases are sometimes difficult to use
with software applications that are developed in an object-oriented programming language such as
C++ or Java. Object databases are designed so they can be directly—or seamlessly—integrated
with software that is developed using object-oriented programming languages.
An object typically has two components: state (value) and behavior (operations). It can have a
complex data structure as well as specific operations defined by the programmer. Objects in an
57
OOPL exist only during program execution; therefore, they are called transient objects. An OO
database can extend the existence of objects so that they are stored permanently in a database, and
hence the objects become persistent objects that exist beyond program termination and can be
retrieved later and shared by other programs. In other words, OO databases store persistent objects
permanently in secondary storage, and allow the sharing of these objects among multiple programs
and applications. This requires the incorporation of other well-known features of database
management systems, such as indexing mechanisms to efficiently locate the objects, concurrency
control to allow object
Abstraction:
We now discuss four abstraction concepts that are used in semantic data models, such as the EER
model, as well as in KR schemes:
2) identification,
58
2 identification
Identification is the abstraction process whereby classes and objects are made uniquely identifiable
by means of some identifier. For example, a class name uniquely identifies a whole class within a
schema. An additional mechanism is necessary for telling distinct object instances apart by means
of object identifiers. Moreover, it is necessary to identify multiple manifestations in the database
of the same real-world object. For example, we may have a tuple <‗Matthew
Clarke‘, ‗610618‘, ‗376-9821‘> in a PERSON relation and another tuple <‗301-54-0836‘, ‗CS‘,
3.8> in a STUDENT relation that happen to represent the same real-world entity. There is no way
to identify the fact that these two database objects (tuples) represent the same real-world entity
unless we make a provision at design time for appropriate cross-referencing to supply this
identification. Hence, identification is needed at two levels:
■ To distinguish among database objects and classes
■ To identify database objects and to relate them to their real-world counterparts
Specialization is the process of classifying a class of objects into more specialized subclasses.
Generalization is the inverse process of generalizing several classes into a higher-level abstract
class that includes the objects in all these classes. Specialization is conceptual refinement,
whereas generalization is conceptual synthesis. Subclasses are used in the EER model to
59
represent specialization and generalization. We call the relationship between a subclass and its
superclass an IS-A-SUBCLASS-OF relationship, or simply an IS-A relationship.
Aggregation is an abstraction concept for building composite objects from their component
objects. There are three cases where this concept can be related to the EER model. The first case
is the situation in which we aggregate attribute values of an object to form the whole object. The
second case is when we represent an aggregation relationship as an ordinary relationship. The
third case, which the EER model does not provide for explicitly, involves the possibility of
combining objects that are related by a particular relationship instance into a higher-level
aggregate object. This is sometimes useful when the higher-level aggregate object is itself to be
related to another object.
We call the relationship between the primitive objects and their aggregate object IS-A-PARTOF;
the inverse is called IS-A-COMPONENT-OF.UML provides for all three types of aggregation.
– Integrated centralized, consolidated database that integrates data derived from the
entire organization.
60
• Consolidates data from multiple and diverse sources with diverse formats.
• Time variant: In contrast to the operational data that focus on current transactions, the
warehouse data represent the flow of data through time.
– Data warehouse contains data that reflect what happened last week, last month, past
five years, and so on.
• Nonvolatile Once data enter the data warehouse, they are never removed. Because the
data in the warehouse represent the company‘s entire history.
– The data warehouse and operational environments are separated. Data warehouse
receives its data from operational databases.
• Ultimately Information is created from data warehouses. Such Information becomes the
basis for rational decision making.
• The data found in data warehouse is analyzed to discover previously unknown data
characteristics, relationships, dependencies, or trends.
61
Data mining is used for knowledge discovery, the process of searching data for unanticipated new
knowledge.
Traditional databases support online transaction processing (OLTP), which includes
insertions, updates, and deletions, while also supporting information query requirements.
Traditional relational databases are optimized to process queries that may touch a small part of
the database and transactions that deal with insertions or updates of a few tuples per relation to
process. Thus, they cannot be optimized for OLAP, DSS, or data mining. By contrast, data
warehouses are designed precisely to support efficient extraction, processing, and presentation
for analytic and decision-making purposes. In comparison to traditional databases, data
warehouses generally contain very large amounts of data from multiple sources that may include
databases from different data models and sometimes files acquired from independent systems
and platforms.
62
■ Client-server architecture
■ Multiuser support
■ Accessibility
■ Transparency
■ Intuitive data manipulation
■ Consistent reporting performance
■ Flexible reporting
Because they encompass large volumes of data, data warehouses are generally an order of
magnitude (sometimes two orders of magnitude) larger than the source databases. The sheer
volume of data (likely to be in terabytes or even petabytes) is an issue that has been dealt with
through enterprise-wide data warehouses, virtual data warehouses, and data marts:
■ Enterprise-wide data warehouses are huge projects requiring massive investment of time and
resources.
■ Virtual data warehouses provide views of operational databases that are materialized for
efficient access.
■ Data marts generally are targeted to a subset of the organization, such as a department, and are
more tightly focused.
Data mining is typically carried out with some end goals or applications. Broadly speaking, these
goals fall into the following classes: prediction, identification, classification, and optimization. ■
Prediction. Data mining can show how certain attributes within the data will behave in the future.
Examples of predictive data mining include the analysis of buying transactions to predict what
consumers will buy under certain discounts, how much sales volume a store will generate in a
given period, and whether deleting a product line will yield more profits. In such applications,
business logic is used coupled with data mining. In a scientific context, certain seismic wave
patterns may predict an earthquake with high probability.
63
■ Identification. Data patterns can be used to identify the existence of an item, an event, or an
activity. For example, intruders trying to break a system may be identified by the programs
executed, files accessed, and CPU time per session. In biological applications, existence of a gene
may be identified by certain sequences of nucleotide symbols in the DNA sequence. The area
known as authentication is a form of identification. It ascertains whether a user is indeed a specific
user or one from an authorized class, and involves a comparison of parameters or images or signals
against a database.
■ Classification. Data mining can partition the data so that different classes or categories can be
identified based on combinations of parameters. For example, customers in a supermarket can be
categorized into discount seeking shoppers, shoppers in a rush, loyal regular shoppers, shoppers
attached to name brands, and infrequent shoppers. This classification may be used in different
analyses of customer buying transactions as a post mining activity. Sometimes classification based
on common domain knowledge is used as an input to decompose the mining problem and make it
simpler. For instance, health foods, party foods, or school lunch foods are distinct categories in the
supermarket business. It makes sense to analyze relationships within and across categories as
separate problems. Such categorization may be used to encode the data appropriately before
subjecting it to further data mining.
■ Optimization. One eventual goal of data mining may be to optimize the use of limited resources
such as time, space, money, or materials and to maximize output variables such as sales or profits
under a given set of constraints.
As such, this goal of data mining resembles the objective function used in
operations research problems that deals with optimization under
constraints.
64