0% found this document useful (0 votes)
21 views86 pages

Unit V

Uploaded by

aryakadam348
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views86 pages

Unit V

Uploaded by

aryakadam348
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Database Management System

D E PA RTM EN T O F I N F O R M AT I O N T E C H N O L O G Y

Database Management Systems 1


Unit V

Advanced techniques, Databases


and applications

Database Management Systems 2


Syllabus

 Advanced techniques, Databases and applications Database


Architecture (2): Centralized, Client-Server, Parallel, Distributed
and Database Connectivity
 Decision Support Systems(1): Data Warehousing, ,Data Mining
and Knowledge discovery, BI

 Big Data & NoSQL(3): Big Data Analytics: Introduction,


Application, Challenges Hadoop, XML, JSON , Structured vs
Unstructured Databases
 NoSQL(1): CAP Theorem and BASE Properties, NoSQL
Databases.

Database Management Systems 3


Database Architecture

 Following are different database architecture


• Centralized
• Client-Server
• Parallel
• Distributed
• Cloud Databases.

Database Management Systems 4


Centralized Database Architecture
 A centralized database is stored at a
single location such as a mainframe
computer.
 Maintained and modified from that
location only.
 Accessed using an internet
connection such as a LAN or WAN.
 General-purpose computer system:
one to a few CPUs and a number of
device controllers that are connected
through a common bus that provides
access to shared memory.

Fig: 1 Centralized Database Architecture

Database Management Systems 5


Centralized Database Architecture
(Contd.)
 Single-user system (e.g., personal computer or workstation):
desk-top unit, single user, usually has only one CPU and one
or two hard disks; the OS may support only one user.
 Multi-user system: more disks, more memory, multiple
CPUs, and a multi-user OS. Serve a large number of users
who are connected to the system via terminals. Often called
server systems.
 The centralized database is used by organizations such as
colleges, companies, banks etc.

Database Management Systems 6


Advantages
 The data integrity get maintained as the whole database is
stored at a single physical location.
 The data redundancy is minimal as all the data is stored
together and not scattered across different locations.
 The centralized database is much more secure as all data is
avail on single location only.
 Data is easily portable because it is stored at the same place.
 The centralized database is cheaper than other types of
databases as it requires less power and maintenance.

Database Management Systems 7


Disadvantages
 Searching data in database is time consuming as all data is at
centralized location.
 There is a lot of data access traffic for the centralized
database. This may create a bottleneck situation.
 If there are no database recovery measures in place and a
system failure occurs, then all the data in the database will be
destroyed.

Database Management Systems 8


Client-Server Systems
 Client - Server architecture of
database system has two logical
components namely client, and
server.
 Clients are generally personal
computers or workstations
whereas server is large
workstations, mini range computer
system or a mainframe computer
system.
 The applications and tools of
Fig: 2 Client Server Architecture
DBMS run on one or more client
platforms, while the DBMS
software's reside on the server.

Database Management Systems 9


Client-Server Systems (contd…)

 The server computer is called


backend and manages access
structures, query evaluation and
optimization, concurrency control and
recovery.
 The client's computer is called front
end which consists of tools such as
forms, report-writers, and graphical
user interface facilities.
 The interface between the front-end
and the back-end is through SQL or
through an application program
interface.
Fig: 3 Example of Client Server
Architecture  Above concept is depicted in figure:

Database Management Systems 10


Advantages
Better functionality for the cost
Flexibility in locating resources and expanding
facilities
Better user interfaces
Easier maintenance

Database Management Systems 11


Disadvantages
 Programming cost is high in client/server environments,
particularly in initial phases.
 There is a lack of management tools for diagnosis,
performance monitoring and tuning and security control, for
the DBMS, client and operating systems and networking
environments.

Database Management Systems 12


Parallel Database System
 It’s become impossible to manage voluminous data using
centralized system, as simplest of simple query will consume lots
of time.
 So the solution to such system is Parallel Database System.
 In this system database is distributed among multiple processors
possibly to perform queries in parallel.
 Uses share resources for handling massive data just to increase the
performance of the whole system.

Database Management Systems 13


Types of Parallel Database System
 Following figures shows various parallel database systems which
can able to handle data through distribution and parallel query
execution.
• Shared Memory Architecture
• Shared Disk Architecture
• Shared Nothing Architecture
• Hierarchical System Architecture

Database Management Systems 14


Shared Memory Database System
Architecture
 Single memory is shared
among many processors as
show in Figure.
 As shown in the figure, several
processors are connected
through an interconnection
network with Main memory and
disk setup.
 Here interconnection network
is usually a high speed network
(may be Bus, Mesh, or
Hypercube) which makes data
Fig: 4 Shared Memory Database System
sharing (transporting) easy
Architecture among the various components
(Processor, Memory, and Disk).

Database Management Systems 15


Shared Disk Database System
Architecture
 In Shared Disk architecture, single
disk or single disk setup is shared
among all the available processors
and also all the processors have
their own private memories as
shown in Figure.

Fig: 5 Shared Disk Database System


Architecture

Database Management Systems 16


Shared Nothing Database System
Architecture
Every processor has its own memory
and disk setup.
 This setup may be considered as
set of individual computers
connected through high speed
interconnection network using
regular network protocols and
switches for example to share data
between computers.

Fig: 6 Shared Nothing Database System


Architecture

Database Management Systems 17


Hierarchical Database System
Architecture
 It’s is combination of all the above
said Database System architecture.
 So it has all futures of Shared
Memory, Share Disk and Shared
Nothing architecture.
 In figure, we are using shared nothing
architecture at the top level, where
nodes are connected using
interconnection network, and do not
share any kind of memory or disk.
 Some nodes of the system are using
Shared Memory approach with few
processors.
Fig: 7 Hierarchical Database System  And some other nodes are using
Architecture
Shared Disk Architecture

Database Management Systems 18


Advantages
 Hierarchical Database System is more flexible system, as it is
combination of both Shared Disk and Shared Memory architecture.
 Greater performance as uses combination of Shared Disk and
Shared Memory, so data may be access directly from Memory and
not through disk to speed up the system.
 Less costly, as uses shared disk system and cost for disk is used
for other system enhancement.
 System is more persistent than other systems.

Database Management Systems 19


Distributed Database System

 It is collection of multiple, logically interrelated databases


distributed over a computer network.
 Logically it’s looks like:

Database Management Systems 20


Distributed Database System

 Data stored at a number of sites


each site logically consists of a
single processor.
 Processors at different sites are
interconnected by a computer
network no multiprocessors
• parallel database systems
 Distributed database is a database,
not a collection of files data
logically related as exhibited in the
users’ access patterns
• relational data model
Fig: 7 Distributed Database System  D-DBMS is a full-fledged DBMS
• not remote file system, not a TP
system
Database Management Systems 21
Advantages

 Increased Reliability and availability


• Reliability is basically defined as the probability that a system is
running at a certain time whereas Availability is defined as the
probability that the system is continuously available during a time
interval.
 Easier Expansion
• In a distributed environment expansion of the system in terms of
adding more data, increasing database sizes, or adding more data,
increasing database sizes or adding more processor is much easier.
 Improved Performance
• We can achieve inter-query and intra-query parallelism by executing
multiple queries at different sites by breaking up a query into a
number of sub-queries that basically executes in parallel which
basically leads to improvement in performance

Database Management Systems 22


Disadvantages
 The distributed database is quite complex
 This database is more expensive as it is complex and hence,
difficult to maintain.
 As it is distributed system it requires database to be more
secure and each and ever node should be secure as well.
 Data integrity and Data Redundancy will not be maintained.

Database Management Systems 23


Decision Support System
 DSS is a systems that help managers
•to optimize the process of making decision;
•by quickly retrieve efficient information from
multiple blocks of data (raw data, transactions,
…).
 DSS has 3 fundamental Components:
• ETL tools (Extract, Transformed and
Load)
• Which means Extract needed data from
different sources; Transform and adjust
the data; Load the selected data into the
system (database, data warehouse …).
• Storage tool
•Mainly Databases and Data warehouses.
Fig: 8 Components of Decision Support • OLAP ( Online Analytical Processing)
System • Which offers interactive and complex
multidimensional data queries with rapid
execution time.
Database Management Systems 24
Data Warehousing
 A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of
management's decision making process.
 Subject-Oriented: A data warehouse
is used to analyze specific subjects. For
example, “sales” can be a particular
subject.
 Integrated: A data warehouse
integrates data coming from many data
sources. Thus, there will be only one
way to identify a specific data.
 Time-Variant: Historical data is
maintained in a data warehouse, since
the history of data can help decision
making.
Fig: 9 Generic Data Warehouse
Architecture  • Non-volatile: Each data integrated in
the data warehouse, will never be
modified or changed.

Database Management Systems 25


Datamining

 It is process of examining large


amount of data (i.e. Big Data)
collected in systematic way.
 It analyzes the data from many
different dimensions or angles in
order to extract useful
information from it.
 It is also called as an Knowledge
Discovery.
 Following are different phases
of Data Mining Process.
Fig: 10 Data Mining Process

Database Management Systems 26


Phases of Data Mining
 Problem analysis : In this step, we must select and analyze a
complicated problem and whose resolution will provide
competitive advantages to the company.
 Data analysis : This phase is used to evaluate data quality, detect
inadequacies and analyze distributions and combinations.
 Data preparation: Once available data sources are identified, they
need to be cleaned in order to eliminate any noise.
 Modeling: during this step, modeling techniques should be
selected to create one or more models. Then, these model should
be tested for checking their validity.
 Evaluation: the created models should be interpreted and
evaluated to verify if they meet business needs.

Database Management Systems 27


Business Intelligence
 “The processes, technologies and
tools needed to turn data into
information and information into
knowledge and knowledge into plans
that support decision making.
 BI encompasses data warehousing,
business analytics and knowledge
management.”
 A term is used for to improve
business decision making by using
fact based support system with
concepts and methods.

Fig: 11 Business Intelligence Architecture

Database Management Systems 28


Benefits of Business Intelligence
 Improve Management
Processes – planning,
controlling, measuring and/or
changing resulting in increased
revenues and reduced costs
 Improve Operational Processes
–fraud detection, order
processing, purchasing..
resulting in increased revenues
and reduced costs
 Predict the Future

Fig: 12 Benefits of Business Intelligence

Database Management Systems 29


Difference between Two Tier and Three Tier System

Three Tier Architecture


Two Tier Architecture
Software Engineering and Project Management 30
Difference between Two Tier and Three Tier System

Architectur Client -Server Architecture Web -based


e Type application
Working Client will hit request directly to Here in between client
server and client will get and server middle ware
response directly from will be there, if client hits
server,The direct a request it will go to the
communication takes place middle ware and middle
between client and server. There ware will send to server
is no intermediate between and vice versa.
client and server. Because of
tight coupling a 2 tiered
application will run faster.
Layers 2-tier means 3-tier means
1.Design layer/Client Application 1.Design layer
(Client Tier) /presentation
2.Data layer/Database (Data 2.Business layer or Logic
Tier) layer / data access tier
3.Data layer / data tier.

Software Engineering and Project Management 31


Difference between Two Tier and Three Tier System

Architecture Client -Server Web -based application


Type Architecture

Security Less secured as Highly secured as client is not allowed


client can talk to to talk to database directly
database
directly
Scalability Poor Excellent as requests can be load
balanced between servers
Reusability Mostly clients Reusability more with services
are monolothic implementation
and thereby
reusability not
possible

Software Engineering and Project Management 32


Difference between Two Tier and Three Tier System

Architecture Type Client -Server Web -based application


Architecture
Advantages: 1.Easy to maintain and 1.Better Re-usability.
modification is bit 2.Improve Data Integrity.
easy. 3.Improved Security – Client is not
2.Communication is direct access to database.
faster. 4.Forced separation of user
interface logic and business logic.
5.Business logic sits on small
number of centralized machines
(may be just one).
6.Easy to maintain, to manage, to
scale, loosely coupled etc.
Disadvantages: 1.In two tier 1. Increase Complexity/Effort
architecture
application
performance will be
degrade upon
increasing the users.
2.Cost-ineffective.
Software Engineering and Project Management 33
Big Data- Introduction
 Data Sources
• The magnitude of data generated and shared by businesses,
public administrations numerous industrial and not-to-profit
sectors, and scientific research
 Data Formats
• These data include textual content (i.e. structured, semi-
structured as well as unstructured), to multimedia content (e.g.
videos, images, audio) on a multiplicity of platforms (e.g.
machine-to-machine communications, social media sites,
sensors networks, cyber-physical systems, and Internet of
Things [IoT]).

Database Management Systems 34


Big Data- Introduction
 Data Amount
 Every day the world produces around
• 2.5 quintillion bytes of data(i.e.1 exabyte equals 1 quintillion
bytes or 1 exabyte equals 1 billiongigabytes), with 90% of these
data generated in the world being unstructured.
• Gantz and Reinsel (2012)assert that by 2020, over 40
Zettabytes(or 40 trillion gigabytes)

Database Management Systems 35


Big Data Characteristics

VOLUME
Growing quantity of data
e.g. social media, behavioral, T Y
video I E
R
VA

Quickening speed of data


VELOCITY
e.g. smart meters, process
monitoring
Gartner, Feb 2001
Increase in types of data
e.g. app data, unstructured
data
Database Management Systems 36
Volume

Volume
• Petabytes,
Exabytes of
data
• Volumes too
great for typical
DBMS
Volume - Bytes Defined

eBay data warehouse (2010) = 10


PB

eBay will increase this 2.5 times


by 2011

Teradata > 10 PB

Megabyte: 220 bytes or, loosely, one Gigabyte: 230 bytes or, loosely one billion
million bytes bytes
5-38
Velocity

Velocity
• Massive
amount of
streaming
data
Variety

Variety
• Massive sets of
unstructured/se
mi-structured
data from Web
traffic, social
media, sensors,
and so on
Big Data Challenges
 Some of these challenges are a function of the characteristics of
BD,
 Some, by its existing analysis methods and models, and some,
through the limitations of current data processing system
 Decision-making of what data are generated and collected
 Issues of privacy
 And ethical considerations relevant to mining such data
 Infrastructure's high costs

Database Management Systems 41


Big Data Challenges
 The broad challenges of Big Data can be grouped into three main
categories, based on the data life cycle(Big data characteristics):

Data challenges Process challenges and Management challenges

Velocity Data acquisition and warehousing Privacy


:
Variability Data mining and cleansing Security
:
Visualization Data aggregation and integration Data governance
Value Data analysis and modelling Data and information sharing
Data interpretation Cost/operational expenditures
Data ownership

Database Management Systems 42


Big Data Challenges
 Data challenges relate to the characteristics of the data itself (e.g. data
volume, variety, velocity, veracity, volatility, quality, discovery)
 Process challenges are related to series of how techniques: how to
capture data, how to integrate data, how to transform data, how to
select the right model for analysis and how to provide the results.
 Management challenges cover for example privacy, security,
governance and ethical aspects

Database Management Systems 43


Big Data Analytics

Database Management Systems 44


Big Data Analytics
 Descriptive analytics scrutinizes data and information to define the
current state of a business situation in away that developments,
patterns and exceptions become evident, in the form of producing
standard reports, ad hoc reports, and alerts.
 Inquisitive analytics is about probing data to certify/reject business
propositions, for example, analytical drill downs into data, statistical
analysis, factor analysis.
 Predictive analytics is concerned with forecasting and statistical
modelling to determine the future possibilities

Database Management Systems 45


Big Data Analytics
 Prescriptive analytics is the stage where the predictions are used to
prescribe (or recommend) the next set of things to be done. It is about
optimization and randomized testing to assess how businesses
enhance their service levels
 Pre-emptive analytics is about having the capacity to take
precautionary Actions on events that may undesirably influence the
organizational Performance, for example, identifying the possible
perils and Recommending mitigating strategies far ahead in time

Database Management Systems 46


Analytics Models
How can we
make it happen?
Prescriptive
What will
Analytics
happen?
Predictive n
Why did it
Analytics i z atio
VALUE

happen? tim
Op
What Diagnostic i ght
Analytics res
happened? Fo

Descriptive i ght
I ns
Analytics

s i ght
d
ti o
n Hin
a
fo rm
In

DIFFICULTY
47
47
Big Data Applications

Database Management Systems 48


Hadoop

Database Management Systems 49


Hadoop

Database Management Systems 50


Hadoop
 The core of Apache Hadoop consists of a storage part, known as
Hadoop Distributed File System (HDFS), and
a processing part which is a MapReduce programming model.
 Hadoop splits files into large blocks and distributes them across
nodes in a cluster.
 It then transfers packaged code into nodes to process the data in
parallel.

Database Management Systems 51


Hadoop
 Hadoop HDFS : [IBM has alternative file system for Hadoop with
name GPFS] where Hadoop stores data a file system that spans all
the nodes in a Hadoop cluster links together the file systems on
many local nodes to make them into one large file system that
spans all the data nodes of the cluster
 Hadoop MapReduce v1 : an implementation for large-scale data
processing.

Database Management Systems 52


Hadoop
 MapReduce engine consists of :
 JobTracker : receive client applications jobs and send orders to the
TaskTrackes who are nearest to the data as possible.
 TaskTracker: exists on cluster's nodes to receive the orders from
JobTracker
 YARN (it is the newer version of MapReduce):
each cluster has a Resource Manager, and each data node runs a
Node Manager.
 For each job, one slave node will act as the Application Master,
monitoring resources/tasks, etc.

Database Management Systems 53


Hadoop
 Hadoop is good for:
 Processing massive amounts of data through parallelism
 Handling a variety of data (structured, unstructured, semi-
structured)
 Using inexpensive commodity hardware
 Hadoop is not good for:
• Processing transactions (random access)
• When work cannot be parallelized
• Fast access to data
• Processing lots of small files
• Intensive calculations with small amounts of data

Database Management Systems 54


HDFS
 Hadoop Distributed File System (HDFS) principles
 Distributed, scalable, fault tolerant, high throughput
 Data access through MapReduce
 Files split into blocks (aka splits)
 3 replicas for each piece of data by default
 Can create, delete, and copy, but cannot update
 Designed for streaming reads, not random access
 Data locality is an important concept: processing data on or near
the physical storage to decrease transmission of data

Database Management Systems 55


HDFS: architecture

 Master / Slave architecture


 NameNode
• Manages the file system namespace and metadata
• Regulates access to files by clients
 DataNode
• Many DataNodes per cluster
• Manages storage attached to the nodes
• Periodically reports status to NameNode
• Data is stored across multiple nodes
• Nodes and components will fail, so for reliability data is
replicated across multiple nodes

Database Management Systems 56


Types of Data
Data can be broadly classified into four types:
• Structured Data:
 Have a predefined model, which organizes data into a form that is relatively easy to
store, process, retrieve
and manage
 E.g., relational data
• Unstructured Data:
 Opposite of structured data
 E.g., Flat binary files containing text, video or audio
 Note: data is not completely devoid of a structure (e.g., an audio file may still have
an encoding structure and some metadata associated with it)
• Dynamic Data:
 Data that changes relatively frequently
 E.g., office documents and transactional entries in a financial database
• Static Data:
 Opposite of dynamic data, E.g. Medical imaging data from MRI or CT scans
Database Management Systems 57
Why Classifying Data?
 Segmenting data into one of the following 4 quadrants can
help in designing and developing a pertaining storage
solution
Dynamic Static
StructuredUnstructured

Media Production, Media Archive,


eCAD, mCAD, Office Broadcast, Medical
Docs Imaging

Transaction Systems,
BI, Data Warehousing
ERP, CRM

 Relational databases are usually used for structured data


 File systems or NoSQL databases can be used for (static),
unstructured data (more on these later)
Database Management Systems 58
XML
 XML is case sensitive
 All start tags must have end tags
 Elements must be properly nested
 XML declaration is the first statement
 Every document must contain a root element
 Attribute values must have quotation marks
 Certain characters are reserved for parsing

Database Management Systems 59


Building Blocks of XML
 Elements (Tags) are the primary components of XML documents.

<AUTHOR id = 123>
<FNAME> JAMES</FNAME>
<LNAME> RUSSEL</LNAME>
</AUTHOR>
<!- I am comment ->
 Attributes provide additional information about Elements.
Values of the Attributes are set inside the Elements
 Comments stats with <!- and end with ->

Database Management Systems 60


Example of XML
<Book>
<Title>My Life and Times</Title>
<Author>Paul McCartney</Author>
<Date>July, 1998</Date>
<ISBN>94303-12021-43892</ISBN>
<Publisher>McMillin Publishing</Publisher>
</Book>
<Book>
<Title>Illusions The Adventures of a Reluctant Messiah</Title>
<Author>Richard Bach</Author>
<Date>1977</Date>
<ISBN>0-440-34319-4</ISBN>
<Publisher>Dell Publishing Co.</Publisher>
</Book>
<Book>
<Title>The First and Last Freedom</Title>
<Author>J. Krishnamurti</Author>
<Date>1954</Date>
<ISBN>0-06-064831-7</ISBN>
<Publisher>Harper &amp; Row</Publisher>
</Book>

Database Management Systems 61


JSON
 JSON stands for “JavaScript Object Notation”, a simple data
interchange format.
 It began as a notation for the world wide web.
 Since JavaScript exists in most web browsers, and JSON is based
on JavaScript, it’s very easy to support there.
 However, it has proven useful enough and simple enough that it is
now used in many other contexts that don’t involve web surfing.

Database Management Systems 62


JSON data structures

• Object:  String:
{ "key1": "value1", "This is a string"
"key2": "value2" }
• Boolean:
• Array:
true
[ "first", "second",
"third" ] false
• Number: • null:
42 null
3.1415926

Database Management Systems 63


The CAP Theorem
 The limitations of distributed databases can be described in the so
called the CAP theorem
 Consistency: every node always sees the same data at any
given instance (i.e., strict consistency)
 Availability: the system continues to operate, even if
nodes in a cluster crash, or some hardware or software
parts are down due to upgrades
 Partition Tolerance: the system continues to operate in the
presence of network partitions

CAP theorem: any distributed database with shared data, can have at most two
of the three desirable properties, C, A or P

Database Management Systems 64


BASE
 BASE (Basically Available, Soft state, Eventual consistency).
BASE system gives up on consistency of a distributed system.

Database Management Systems 65


The BASE Properties
 The CAP theorem proves that it is impossible to guarantee strict
Consistency and Availability while being able to tolerate network
partitions
 This resulted in databases with relaxed ACID guarantees
 In particular, such databases apply the BASE properties:
 Basically Available: the system guarantees Availability
 Soft-State: the state of the system may change over time
 Eventual Consistency: the system will eventually
become consistent
Eventual Consistency
A database is termed as Eventually Consistent if:
 All replicas will gradually become consistent in the
absence of updates

Webpage-A
Webpage-A Webpage-A

Event: Update Webpage-


Webpage-A A
Webpage-A

Webpage-A
Eventual Consistency:
A Main Challenge
But, what if the client accesses the data from
different replicas?

Webpage-A
Webpage-A Webpage-A

Event: Update Webpage-


A
Webpage-A
Webpage-A

Webpage-A

Protocols like Read Your Own Writes (RYOW) can be applied!


No-SQL
 Stands for No-SQL or Not Only SQL??
 Class of non-relational data storage systems
• E.g. BigTable, Dynamo, PNUTS/Sherpa, ..
 Usually do not require a fixed table schema nor do they use the
concept of joins
 Distributed data storage systems
 All NoSQL offerings relax one or more of the ACID properties
(will talk about the CAP theorem)

Database Management Systems 69


NoSQL Data Storage: Classification

 Uninterrupted key/value or ‘the big hash table’.


• Amazon S3 (Dynamo)
 Flexible schema
• BigTable, Cassandra, HBase (ordered keys, semi-structured
data),
• Sherpa/PNuts (unordered keys, JSON)
• MongoDB (based on JSON)
• CouchDB (name/value in text)

Database Management Systems 70


NoSQL Databases
To this end, a new class of databases emerged, which
mainly follow the BASE properties
 These were dubbed as NoSQL databases
 E.g., Amazon’s Dynamo and Google’s Bigtable

Main characteristics of NoSQL databases include:


 No strict schema requirements
 No strict adherence to ACID properties
 Consistency is traded in favor of Availability
Types of NoSQL Databases
Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar


Stores Databases Stores Databases
Document Stores
 Documents are stored in some standard format or encoding (e.g., XML,
JSON, PDF or Office Documents)
 These are typically referred to as Binary Large Objects
(BLOBs)
 Documents can be indexed
 This allows document stores to outperform traditional file
systems
 E.g., MongoDB and CouchDB (both can be queried using MapReduce)
Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar


Stores Databases Stores Databases
Graph Databases
 Data are represented as vertices and edges
00
Id:1 : knows /03 Id: 2
l 0
Labe : 2001/1 Name: Bob
e
Sinc Age: 22

01 r
Id:1 : knows /03 be 4
el 1 0 m
Lab : 2001/ e
m / 02 /
1
Id: 1 nc e 5 _
Si 10 is 11 rs
Name: Alice Id:1 Id: bel: : 20 be
Age: 18 Lab 3
0
La nce 0 4 Mem
el: M i :1 :
emb S Id abel
ers L

Id:1 Id: 3
La b 0 2 Name: Chess
Sin el: is_ Type: Group
ce: m
2 0 0 e mb e
5/0 r
7/0
 Graph databases are powerful for graph-like queries (e.g., find the
1

shortest path between two elements)


 E.g., Neo4j and VertexDB
Types of NoSQL Databases
Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar


Stores Databases Stores Databases
Key-Value Stores
 Keys are mapped to (possibly) more complex value (e.g., lists)
 Keys can be stored in a hash table and can be distributed easily
 Such stores typically support regular CRUD (create, read, update, and
delete) operations
 That is, no joins and aggregate functions
 E.g., Amazon DynamoDB and Apache Cassandra
Types of NoSQL Databases
Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar


Stores Databases Stores Databases
Columnar Databases
 Columnar databases are a hybrid of RDBMSs and Key-Value stores
 Values are stored in groups of zero or more columns, but in Column-Order
(as opposed to Row-Order)

Record 1 Column A Column A = Group A

Alice 3 25 Bob Alice Bob Carol Alice Bob Carol


4 19 Carol 0 3 4 0 25 3 25 4 19
45 19 45 0 45
Column Family {B, C}
Row-Order Columnar (or Column-Order) Columnar with Locality Groups

 Values are queried by matching keys


E.g., HBase and Vertica
NoSQL Databases
 Key-value stores
• The simplest of the NoSQL databases, key-value stores have each item
in the database stored as an attribute name together with its value. Riak,
Voldemort, and Redis are examples.
 Wide-column stores
• Cassandra and HBase are examples of this type of database where data is
stored together in columns.
 Document databases
• MongoDB is the most well-known document database. These types of
databases store data in documents.
 Graph databases
• Neo4J and HyperGraphDB are popular examples of graph database
which are useful for data about networks.

Database Management Systems 80


NoSQL Products
 CassandraCassandra
 CouchDB
 Hadoop & Hbase
 MongoDB
 StupidDBStupidDB
 Etc.

Database Management Systems 81


Common Advantages of NoSQL
Systems
 Cheap, easy to implement (open source)
 Data are replicated to multiple nodes (therefore identical and fault-
tolerant) and can be partitioned
 When data is written, the latest version is on at least one node and
then replicated to other nodes
 No single point of failure
 Easy to distribute
 Don't require a schema

Database Management Systems 82


What does NoSQL Not Provide?
 Joins
 Group by
 But PNUTS provides interesting materialized view approach to
joins/aggregation.
 ACID transactions
 SQL
 Integration with applications that are based on SQL

Database Management Systems 83


NoSQL
 NoSQL Data storage systems makes sense for applications that
need to deal with very very large semi-structured data
• Log Analysis
• Social Networking Feeds
 Most of us work on organizational databases, which are not that
large and have low update/query rates
• regular relational databases are the correct solution for such
applications

Database Management Systems 84


Some Data and Metadata Standards
Relational XML JSON
Metadata Data Definition Language XML Schema XSD (W3C) , JSON Schema - IETF
(ISO) Namespaces (W3C)

Constraints Integrity Constraints in table Schematron (ISO),


definitions (ISO) RelaxNG (OASIS)
Triggers Relational triggers
Data Exchange SQL standard (ISO) defines an XML is a syntax widely JSON is a serialized
Serializations XML serialization but it is not used XML, Turtle for data format, there are
widely used – There is no exchange
Currently, (W3C) XMLforrepresentations
the role of JSON is mainly
agreed JSON serialization. of JSON too
data exchange between JavaScript
There are RDF serializations. clients and servers

Annotations Not part of the relational Many kinds of annotations


model are defined for XML
schemas and for XML data

Query & CRUD Data Manipulation Language XPath, XQuery (W3C) and JAQL, JSONiq
Languages (DML), SQL, SQL/XML (ISO) SPARQL (W3C) others for
CRUD

Database Management Systems 85


Some Data and Metadata Standards
Relational XML JSON
Query & CRUD Many for various Various, includes some of
APIs programming languages, e.g., the relational APIs and
JDBC, ODBC specific APIs, e.g., XQJ

Collection Table, View, Database (ISO) XML Collection Function


(W3C)
Transformation SQL (Tables to Tables) XSLT (XML to text, includes JavaScript
& Other - XML); XForms SQL
Languages XMLTABLE (XML to
relational)

Database Management Systems 86

You might also like