•
Computer Science & Engineering
CHANDIGARH UNIVERSITY, MOHALI
BIG Data Analytics
21CSH-471
BY : Urvashi
Assistant Professor (Chandigarh
University)
Contents to be covered in UNIT
2
UNIT-2 Big Data Technologies Contact Hours:15
Chapter-1 Big Data Frameworks: Hadoop, Apache Spark, and their Comparison; NoSQL databases: MongoDB,
Big Data Cassandra, and HBase; Big Data Visualization Tools: Tableau, Power BI, and Zeppelin; Real-Time Big
Frameworks Data Processing: Apache Storm and Flink; Emerging trends in Big Data Technologies.
Overview of SQL vs. NoSQL: Differences and Use Cases; Introduction to Big SQL: Big SQL Features –
Chapter – 2 Scalability, support for structured and unstructured data, Query optimization Techniques in Big
Big SQL and SQL; NoSQL Database Types: Key-Value stores (Redis, DynamoDB), Document stores (MongoDB,
NO SQL CouchDB), Column-family stores (Cassandra, HBase), Graph Databases (Neo4j); Advantages and
Databases limitations of Big SQL and NoSQL.
Chapter – 3 Introduction to IBM Watson: Overview and capabilities of Watson AI, Watson’s role in Big data and
AI in Big Data decision-making; Key Watson Services: Watson Discovery, Watson Studio, and Watson Assistant,
Integration of Watson with Big Data tools; AI and Machine Learning Applications in Big Data:
Natural Language Processing (NLP), Sentiment Analysis and Predictive Analytics.
Course Outcomes
CO1 Understand the Fundamentals of Big Data.
CO2 Master Big Data Architecture and Tools
CO3 Explore the Hadoop Ecosystem and Data Processing Models
CO4 Develop Data Science Skills and Tools
CO5 Implement Real-Time Data Analytics and Visualization
3
HBASE Data Model and Versioning
• Data organization concepts
• Namespaces
• Tables
• Column families
• Column qualifiers
• Columns
• Rows
• Data cells
• Data is self-describing
Hbase Data Model and Versioning
(cont'd.)
• HBase stores multiple versions of data items
• Timestamp associated with each version
• Each row in a table has a unique row key
• Table associated with one or more
column families
• Column qualifiers can be dynamically specified
as new table rows are created and inserted
• Namespace is collection of tables
• Cell holds a basic data item
(a) creating a table:
create ‘EMPLOYEE”, 'Name', *Address', ‘Details’
(b) Inserting some row data In the EMPLOYEE table:
put ‘EMPLOYEE', ‘row1", ‘Name:Fname', ‘John'
put ‘EMPLOYEE', *row1”, *Name:Lname', ‘Smith'
put ’EMPLOYEE’, *row1', *Name:Nickname'. ‘Johnny’
put ‘EMPLOYEE’, ‘row1’, ‘Details:Job’,
‘Engineer’ put ‘EMPLOYEE’, ‘row1',
‘Details:Review’. ‘Good’ put ’EMPLOYEE',
*row2', ‘Name:Fname”, ‘Alicia” put
‘EMPLOYEE’, ‘row2', ‘Name:Lname', ‘Zelaya’
put ‘EMPLOYEE’, ‘row2’, ‘Name:MName“,
‘Jennifer' put ’EMPLOYEE’, ‘row2', ‘Details:Job’,
‘DBA’
put ‘EMPLOYEE’, *row2”, ‘Details:Supervisor’. ‘James
Borg’ put ‘EMPLOYEE’, ‘row3’. ‘Name:Fname', ’James’
put ‘EMPLOYEE', ‘row3”, ‘Name:Minit'. 'E’
put ‘EMPLOYEE', *row3”, ‘Name:Lname", ‘Borg'
put ’EMPLOYEE’, ‘row3’, *Name:Suffix‘. ‘Jr.'
put ‘EMPLOYEE', ‘row3', ‘Details:Job’. 'CEO'
put ‘EMPLOYEE', *row3’, *Details:Salary’, *1,000,000'
Ic) Some Hbase baslc CRUD operatlons:
Creating a table: create <tablename>, <coIumn family>,
<column family>,
Inserting Data: put <tabIename6, <rowid>, <column familys:<column qualifier>,
<vaIue6 Reading Data (all data in a table): scan <tablename>
Retrieve Data (one item): get <tabIename6.crowds
Figure 24.3 Examples in Hbase (a) Creating a table called EMPLOYEE with three
column families: Name, Address, and Details (b) Inserting some in the EMPLOYEE
table; diXerent rows can have different self-describing column qualifiers (Fname,
Lname, Nickname, Mname, Minit, Suffix, ... for column family Name; Job, Review,
Hbase Crud Operations
• Provides only low-level CRUD (create,read,
update, delete) operations
• Application programs implement more
complex operations
• Create
• Creates a new table and specifies one or more
column families associated with the table
• Put
• Inserts new data or new versions of existing data
items
• Get
• Retrieves data associated with a single row
• Scan
• Retrieves all the rows
Hbase Storage and Distributed System
Concepts
• Each Hbase table divided into several regions
• Each region holds a range of the row keys in the
table
• Row keys must be lexicographically ordered
• Each region has several stores
- Column families are assigned to stores
• Regions assigned to region servers for storage
• Master server responsible for
monitoring the region servers
• Hbase uses Apache Zookeeper and HDFS
NOSQL Graph Databases
Neo4j
Graph databases
• Data represented as a graph
• Collection of vertices (nodes) and edges
• Possible to store data associated with
both individual nodes and individual edges
Neo4j
• Open source system
• Uses concepts of nodes and relationships
Neo4j (cont'd.)
• Path
• Traversal of part of the graph
• Typically used as part of a query to specify
a
• pattern
• Schema optional in Neo4j
• Indexing and node identifiers
• Users can create for the collection of
nodes that have a particular label
• One or more properties can be indexed
Copyright6 2016 Ramez Elmasri and Shamkant B. Navathe
Slide 24-
Neo4j (cont'd.)
• Path
• Traversal of part of the graph
• Typically used as part of a query to specify
a
• pattern
• Schema optional in Neo4j
• Indexing and node identifiers
• Users can create for the collection of
nodes that have a particular label
• One or more properties can be indexed
Copyright6 2016 Ramez Elmasri and Shamkant B. Slide 24-
Neo4j (cont'd.)
• Path
• Traversal of part of the graph
• Typically used as part of a query to specify
a
• pattern
• Schema optional in Neo4j
• Indexing and node identifiers
• Users can create for the collection of
nodes that have a particular label
• One or more properties can be indexed
Copyright6 2016 Ramez Elmasri and Shamkant B. Slide 24-
Neo4j (cont'd.)
• Path
• Traversal of part of the graph
• Typically used as part of a query to specify a
• pattern
• Schema optional in Neo4j
• Indexing and node identifiers
• Users can create for the collection of
nodes that have a particular label
• One or more properties can be indexed
The Cypher Query Language of Neo4j:
• Cypher query made up of clauses
• Result from one clause can be the input to
the next clause in the query
The Cypher Query Language of
Neo4j(cont'd.) (d) Examples of simple Cypher
queries:
1. MATCH (d : DEPARTMENT
(Ono: ‘5’)) — I : Locatedln ] —+
(loc)
RETURN d.Dname , Ioc.Lname
2. MATCH (e: EMPLOYEE (Empid:
‘2’)) — ( w: WorksOn ] —+ (p)
RETURN e.Ename , w.Hours,
p.Pname
Figure (cont'd.) Examples in 3. MATCH (e ) - [ w: WorksOn ] —
Neo4j using the Cypher + (p: PROJECT (Pno: 2))
language RETURN p.Pname, e.Ename ,
w.Hours
(d) Examples of Cypher 4. MATCH (e) — [ w: WorksOn 1 —› (p)
queries RETURN e.Ename , w.Hours,
p.Pname ORDER BY e.Ename
WHERE numOfprojs
5. MATCH (e) — [ w: WorksOn2 1 —›
RETURN
(p) e.Ename ,
numOfprojs
RETURN ORDER, w.Hours,
e.Ename BY
numOfprojs
p.Pname
7. MATCH
ORDER(e) BY- [ w: WorksOn ]
—+ (p)
e.Ename UMIT 10
RETURN(e)
6. MATCH e -, w,
[ w:p
ORDER BY
WorksOn ] —+ (p)
e.Ename MMIT 10
WITH e,
8. MATCH (e:
COUNT(p) AS
EMPLOYEE
numOfprojs
Neo4j InteJaces and Distributed System
Characteristics
• Enterprise edition versus communityedition
• Enterprise editionsupports caching,
clustering of data, and locking
• Graph visualization interface
• Subset of nodes and edges in a database
graph can be displayed as a graph
• Used to visualize query results
• Master-slave replication
• Caching
• Logicallogs
Summary
• NOSQL systems focus on storage of “big data”
• General categories
• Document-based
• Key-value stores
• Column-based
• Graph-based
• Some systems use techniques spanning two
or more categories
• Consistency paradigms
Reference Books
TEXT BOOKS
1. Mohammed Guller, Big Data Analytics with Spark, Apress,2015
2. Tom Mitchell, “Machine Learning”, McGraw Hill, 3rdEdition,1997
3. Michael Minelli, Michehe Chambers, “Big Data, Big Analytics:
Emerging Business Intelligence and Analytic Trends for Today’s
Business”, 1stEdition, Ambiga Dhiraj, Wiely CIO Series, 2013.
4. Arvind Sathi, “Big Data Analytics: Disruptive Technologies for
Changing the Game”,1st Edition, IBM Corporation, 2012.
REFERENCE BOOKS
5. Chris Eaton, Dirk deroos et al., “Understanding Big data”, McGraw
Hill, 2012.
6. Vignesh Prajapati, “Big Data Analytics with R and Hadoop”, Packet
Publishing 2013.
7. JyLiebowitz, “Big Data and Business Analytics”, CRC press, 2013.
For more insight
Web sources
1. https://www.alliant.edu/blog/4-top-
online-resources-data-analytics?
utm_source=chatgpt.com
2. https://www.alliant.edu/blog/4-top-
online-resources-data-analytics?
utm_source=chatgpt.com
3. https://www.coursera.org/articles/
big-data-technologies?
utm_source=chatgpt.com
4. https://careerfoundry.com/en/ Big Data Big Big Data and
Analytics Analytics
blog/data-analytics/where-to-find- Wiley
free-datasets/?
utm_source=chatgpt.com
THANK YOU
For queries
Email: [email protected]