0% found this document useful (0 votes)
101 views43 pages

Advanced Database Management System Mod14

This chapter discusses Big Data and NoSQL, focusing on the characteristics of Big Data, including volume, velocity, and variety, and how these exceed traditional database capabilities. It also covers the Hadoop framework and its ecosystem, highlighting components like HDFS and MapReduce, as well as various NoSQL database models such as key-value, document, column-oriented, and graph databases. The chapter aims to equip readers with an understanding of modern data management technologies and their applications in business.

Uploaded by

razel gicale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views43 pages

Advanced Database Management System Mod14

This chapter discusses Big Data and NoSQL, focusing on the characteristics of Big Data, including volume, velocity, and variety, and how these exceed traditional database capabilities. It also covers the Hadoop framework and its ecosystem, highlighting components like HDFS and MapReduce, as well as various NoSQL database models such as key-value, document, column-oriented, and graph databases. The chapter aims to equip readers with an understanding of modern data management technologies and their applications in business.

Uploaded by

razel gicale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Database

Systems:
Design,
Implementation,
and
Management,
14e
Module 14: Big Data and
NoSQL

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 1
whole or in part.
Chapter Objectives (1 of 2)

By the end of this chapter, you should be able to:

1. Explain the role of Big Data in modern business

2. Describe the primary characteristics of Big Data and how these go beyond the
traditional “3 Vs”

3. Explain how the core components of the Hadoop framework operate

4. Identify the major components of the Hadoop ecosystem

5. Summarize the four major approaches of the NoSQL data model and how they
differ from the relational model

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 2
whole or in part.
Chapter Objectives (2 of 2)

By the end of this chapter, you should be able to (continued):

6. Describe the characteristics of NewSQL databases

7. Understand how to work with document databases using MongoDB

8. Understand how to work with graph databases using Neo4j

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 3
whole or in part.
Big Data

• Big Data refers to a set of data that displays the characteristics of volume, velocity,
and variety (the 3 Vs) to an extent that makes the data unsuitable for management
by a relational DBMS

• These characteristics can be defined as follows:


− Volume – the quantity of data to be stored
− Velocity – the speed at which data is entering the system
− Variety – the variations in the structure of the data to be stored

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 4
whole or in part.
Volume

• Volume, the quantity of data to be stored, is a key characteristic of Big Data

• Scaling up is keeping the same number of systems but migrating each one to a
larger system

• Scaling out means that when the workload exceeds server capacity, it is spread
out across a number of servers

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 5
whole or in part.
Velocity (1 of 2)

• Velocity refers to the rate at which new data enters the system as well as the rate at
which the data must be processed

• The velocity of processing can be broken down into two categories:


− Stream processing focuses on input processing and requires analysis of data
stream as it enters the system
 Scientists have created algorithms to decide ahead of time which data will
be kept
− Feedback loop processing refers to the analysis of the data to produce
actionable results

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 6
whole or in part.
Velocity (2 of 2)

Figure 14.3 Feedback Loop


Processing

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 7
whole or in part.
Variety

• Variety refers to the vast array of formats and structures in which data may be
captured

• Structured data is data that has been organized to fit a predefined data model

• Unstructured data is data that is not organized to fit into a predefined data model

• Semistructured data combines elements of both – some parts of the data fit a
predefined model while other parts do not

• Relational databases rely on structured data

• One advantage of providing structure is the flexibility of being able to structure the
data in different ways for different applications

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 8
whole or in part.
Other Characteristics

• Variability refers to the changes in the meaning of data based on context

• Sentimental analysis is a method of text analysis that attempts to determine if a


statement conveys a positive, negative, or neutral attitude about a topic

• Veracity refers to the trustworthiness of data

• Value refers to the degree to which the data can be analyzed for meaningful insight

• Visualization is the ability to graphically resent data to make it understandable

• Polyglot persistence is the coexistence of a variety of data storage and


management technologies within an organization’s infrastructure

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 9
whole or in part.
Hadoop

• De facto standard for most Big Data storage and processing

• Hadoop is a Java-based framework for distributing and processing very large data
sets across clusters of computers

• The two most important components include the following:


− Hadoop Distributed File System (HDFS) is a low-level distributed file processing
system that can be used directly for data storage
− MapReduce is a programming model that supports processing large data sets

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 10
whole or in part.
HDFS (1 of 3)

• The Hadoop Distributed File System (HDFS) approach to distributing is based


on the following key assumptions:
− High volume – Hadoop has a default block sizes is 64 MB and can be configured
to even larger values
− Write-once, read-many: this model simplifies concurrency issues and improves
data throughput
− Streaming access: Hadoop is optimized for batch processing of entire files as a
continuous stream of data
− Fault tolerance: Hadoop is designed to replicate data across many different
devices so that when one fails, data is still available from another device

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 11
whole or in part.
HDFS (2 of 3)

• Hadoop uses several types of nodes, which are computers that perform one or more
types of tasks within the system
− Data nodes store the actual file data
− The name node contains file system metadata
− The client node makes requests to the file system as needed to support user
applications

• The data node communicates with the name node and sends block reports and
heartbeats
− A block report is sent every 6 hours and informs the name node which blocks
are on that data node
− A heartbeat is used to let the name node know that the data node is still
available
Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 12
whole or in part.
HDFS (3 of 3)

Figure 14.4 Hadoop Distributed File


System (HDFS)

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 13
whole or in part.
MapReduce

• MapReduce is the computing framework used to process large data sets across
clusters

• A map function takes a collection of data and sorts and filters it into a set of key-
value pairs
− The map function is performed by a program called a mapper

• A reduce function summarizes the results of the map function into a single result
− The reduce function is performed by a program called a reducer

• The implementation of MapReduce complements the HDFS structure


− Job tracker is a central control program used to report on MapReduce processing
jobs
− Task tracker is aandprogram
Coronel, Carlos Morris, Steven,responsible
Database Systems:for running
Design, mapandand
Implementation, reduce
Management, tasks
14 Edition. on a node
© 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 14
whole or in part.
Hadoop Ecosystem (1 of 3)

• Most organizations that use Hadoop also use a set of other related products that
interact and complement each other to produce an entire ecosystem of applications
and tools

• Like any ecosystem, the interconnected pieces are constantly evolving and their
relationships are changing, so it is a rather fluid situation

• MapReduce Simplification Applications


− Hive is a data warehousing system that sits on top of HDFS and supports its own
SQL-like language
− Pig is a tool for compiling a high-level scripting language, named Pig Latin, into
MapReduce jobs for executing in Hadoop

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 15
whole or in part.
Hadoop Ecosystem (2 of 3)

• Data Ingestion Applications


− Flume is a component for ingesting data in Hadoop
− Sqoop is a tool for converting data back and forth between a relational database
and the HDFS

• Direct Query Applications


− Hbase is a column-oriented NoSQL database designed to sit on top of the HDFS
that quickly processes sparse datasets
− Impala was the first SQL on Hadoop application

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 16
whole or in part.
Hadoop Ecosystem (3 of 3)

Figure 14.6 A Sample of the


Hadoop Ecosystem

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 17
whole or in part.
Hadoop Pushback

• Many organizations benefit from having a customized Hadoop ecosystem that is


tailored to their specific needs in a manner that no other solution can duplicate
− However, the learning curve can be steep

• Companies such as IBM and Cloudera offer out-of-the-box Hadoop ecosystems


called data platforms

• The perceived complications of Hadoop have helped to propel interest in alternative


solutions, such as NoSQL databases

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 18
whole or in part.
Knowledge Check Activity 14-1

• What is Big Data? Give a brief definition.

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 19
whole or in part.
Knowledge Check Activity 14-1:
Answer
• What is Big Data? Give a brief definition.

Answer: Big Data is data of such volume, velocity, and/or variety that
it is difficult for traditional relational database technologies to store and
process it.

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 20
whole or in part.
NoSQL

• NoSQL is the name given to a broad array of nonrelational database technologies


that have developed to address Big Data challenges

• The name does not describe what the NoSQL technologies are, but rather what they
are not

• There are hundreds of products that can be considered as being under the broadly
defined term NoSQL
− Most fit into one of four categories: key-value data stores, document databases,
column-oriented databases, and graph databases

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 21
whole or in part.
Key-Value Databases

• Key-value (KV) databases are conceptually the simplest of the NoSQL data
models
− A KV database is a NoSQL database that stores data as a collection of key-value
pairs

• Key-value pairs are typically organized into “bucket”


− A bucket can roughly be thought of as the KV database equivalent of a table
− A bucket is a logical grouping of keys

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 22
whole or in part.
Document Databases (1 of 2)

• Figure 14.7 Key-Value Database Storage

• Figure 14.8 Document Database Tagged Format

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 23
whole or in part.
Document Databases (2 of 2)

• Document databases are conceptually similar to key-value databases


− A document database stores data in key-value pairs in which the value
component is composed of a tag-encoded document

• JSON (JavaScript Object Notation) is a human-readable text format for data


interchange that defines attributes and values in a document

• BSON (Binary JSON) is a computer-readable format for data interchange that


expands the JSON format to include additional data types including binary objects

• A collection, in document databases, is a logical storage unit that contains similar


documents, roughly analogous to a table in a relational database

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 24
whole or in part.
Column-Oriented Databases (1 of 2)

• Column-oriented databases refers to the following two technologies:


− Column-centric storage, which is a storage technique in which data is stored
in blocks which hold data from a single column across many rows
− Row-centric storage, which is a storage technique in which data is stored in
blocks which hold data from all columns of a given set of rows

• A column family database is a NoSQL database that organizes data in key-value


pairs with keys mapped to a set of columns in the value component

• A super column is a groups of columns that are logically related

• In a column family database, a collection of columns or super columns related to a


collection of rows are grouped together to create a column family

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 25
whole or in part.
Column-Oriented Databases (2 of 2)

Figure 14.10 Column Family


Database

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 26
whole or in part.
Graph Databases (1 of 2)

• A graph database is a NoSQL database based on graph theory to store data about
relationship-rich environments

• The primary components of graph databases are nodes, edges, and properties
− The node is a specific instance of something we want to keep data about
− An edge is a relationship between nodes
− Properties are the attributes or characteristics of a node or edge that are of
interest to the users

• A query in a graph database is called a traversal

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 27
whole or in part.
Graph Databases (2 of 2)

Figure 14.11 Graph Database


Representation

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 28
whole or in part.
Knowledge Check Activity 14-2

• What are the four basic categories of NoSQL databases?

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 29
whole or in part.
Knowledge Check Activity 14-2:
Answer
• What are the four basic categories of NoSQL databases?
• Answer: Key-value database, document databases, column family
databases, and graph databases.

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 30
whole or in part.
Aggregate Awareness

• Key-value, document, and column family databases are aggregate aware


− Aggregate aware means that the data is collected or aggregated around a
central topic or entity

• The aggregate aware database models achieve clustering efficiency by making each
piece of data relatively independent

• Graph databases, like relational databases, are aggregate ignorant


− Aggregate ignorant models do not organize the data into collections based on a
central entity

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 31
whole or in part.
NewSQL Databases

• NewSQL is a database model that attempts to provide ACID-compliant transactions


across a highly distributed infrastructure

• Characteristics of NewSQL include the following:


− Have no proven track record
− Have been adopted by relatively few organizations

• NewSQL databases support:


− SQL as the primary interface
− ACID-compliant transactions

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 32
whole or in part.
Working with Document Databases
Using MongoDB (1 of 3)
• MongoDB is a popular document database
− Among the NoSQL databases currently available, MongoDB has been one of the
most successful in penetrating the database market

• MongoDB, comes from the word humongous as its developers intended their new
product to support extremely large data sets

• It is designed for the following:


− High availability
− High scalability
− High performance

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 33
whole or in part.
Working with Document Databases
Using MongoDB (2 of 3)
• Importing Documents in MongoDB
− Refer to the text for an importation example and considerations

• Example of a MongoDB Query Using find()


− Methods are programed functions to manipulate objects
 The find() method retrieves objects from a collection that match the
restrictions provided
− Refer to the text for a query example

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 34
whole or in part.
Working with Document Databases
Using MongoDB (3 of 3)

Figure 14.12 Example of MongoDB


Document Query

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 35
whole or in part.
Working with Graph Databases Using
Neo4j
• Even though Neo4j is not yet as widely adopted as MongoDB, it has been one of the
fastest growing NoSQL databases

• Graph databases still work with concepts similar to entities and relationships
− The focus is on the relationships

• Graph databases are used in environments with complex relationships among


entities
− Graph databases are heavily reliant on interdependence among their data

• Neo4j provides several interface options


− It was originally designed with Java programming in mind and optimized for
interaction through a Java API
Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 36
whole or in part.
Creating Nodes in Neo4j

• Nodes in a graph database correspond to entity instances in a relational database

• In Neo4j, a label is the closest thing to the concept of a table from the relational
model
− A label is a tag that is used to associate a collection of nodes as being of the
same type or belonging to the same group

• Cypher is the interactive, declarative query language in Neo4j

• Nodes and relationships are created using a CREATE command

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 37
whole or in part.
Retrieving Node Data with MATCH and
WHERE
• Refer to the text for examples of the following:
− Retrieving node data with MATCH and WHERE
− Retrieving relationship data with MATCH and WHERE

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 38
whole or in part.
Retrieving Relationship Data with
MATCH and WHERE

Figure 14.13 Neo4j Query Using


MATCH/WHERE/RETURN

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 39
whole or in part.
Knowledge Check Activity 14-3

• Explain what it means for a database to be aggregate aware.

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 40
whole or in part.
Knowledge Check Activity 14-3:
Answer
• Explain what it means for a database to be aggregate aware.
• Answer: Aggregate aware means that the designer of the database
has to be aware of the way the data in the database will be used, and
then design the database around whichever component would be
central to that usage. Instead of decomposing the data structures to
eliminate redundancy, an aggregate aware database is collects, or
aggregates, all of the data around a central component to minimize
the structures required during processing.

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 41
whole or in part.
Summary (1 of 2)

Now that the lesson has ended, you should be able to:

1. Explain the role of Big Data in modern business

2. Describe the primary characteristics of Big Data and how these go beyond the
traditional “3 Vs”

3. Explain how the core components of the Hadoop framework operate

4. Identify the major components of the Hadoop ecosystem

5. Summarize the four major approaches of the NoSQL data model and how they
differ from the relational model

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 42
whole or in part.
Summary (2 of 2)

Now that the lesson has ended, you should be able to (continued):

6. Describe the characteristics of NewSQL databases

7. Understand how to work with document databases using MongoDB

8. Understand how to work with graph databases using Neo4j

Coronel, Carlos and Morris, Steven, Database Systems: Design, Implementation, and Management, 14 Edition. © 2023
Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in 43
whole or in part.

You might also like