0% found this document useful (0 votes)

15 views24 pages

BIG Data Analytics 21CSH-471: Computer Science & Engineering

The document outlines the curriculum for a Big Data Analytics course at Chandigarh University, covering key topics such as Big Data frameworks (Hadoop and Apache Spark), NoSQL databases, and AI applications in Big Data. It details course outcomes, core components of Hadoop, and the features and advantages of both Hadoop and Spark. Additionally, it provides references for further reading and resources related to Big Data technologies.

Uploaded by

Tanu Dahiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views24 pages

BIG Data Analytics 21CSH-471: Computer Science & Engineering

Uploaded by

Tanu Dahiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

•

Computer Science & Engineering

CHANDIGARH UNIVERSITY, MOHALI

BIG Data Analytics

21CSH-471

BY : AJAY PAL SINGH

Assistant Professor (Chandigarh

University)
Contents to be covered in UNIT
2
UNIT-2 Big Data Technologies Contact Hours:15

Chapter-1 Big Data Frameworks: Hadoop, Apache Spark, and their Comparison; NoSQL databases: MongoDB,
Big Data Cassandra, and HBase; Big Data Visualization Tools: Tableau, Power BI, and Zeppelin; Real-Time Big
Frameworks Data Processing: Apache Storm and Flink; Emerging trends in Big Data Technologies.

Overview of SQL vs. NoSQL: Differences and Use Cases; Introduction to Big SQL: Big SQL Features –
Chapter – 2 Scalability, support for structured and unstructured data, Query optimization Techniques in Big SQL;
Big SQL and NoSQL Database Types: Key-Value stores (Redis, DynamoDB), Document stores (MongoDB, CouchDB),
NO SQL Column-family stores (Cassandra, HBase), Graph Databases (Neo4j); Advantages and limitations of Big
Databases SQL and NoSQL.

Chapter – 3 Introduction to IBM Watson: Overview and capabilities of Watson AI, Watson’s role in Big data and
AI in Big Data decision-making; Key Watson Services: Watson Discovery, Watson Studio, and Watson Assistant,
Integration of Watson with Big Data tools; AI and Machine Learning Applications in Big Data: Natural
Language Processing (NLP), Sentiment Analysis and Predictive Analytics.
Course Outcomes
CO1 Understand the Fundamentals of Big Data.

CO2 Master Big Data Architecture and Tools

CO3 Explore the Hadoop Ecosystem and Data Processing Models

CO4 Develop Data Science Skills and Tools

CO5 Implement Real-Time Data Analytics and Visualization

3
Chapter – 1
Big Data Frameworks

Big Data Frameworks: Hadoop, Apache

Spark, and their Comparison

4
Overview of Hadoop
Hadoop is an open-source framework developed by the Apache Software
Foundation. It enables distributed storage and processing of large datasets
across clusters of computers using simple programming models. It is
particularly suited for batch processing.
Core Components of Hadoop
1.HDFS (Hadoop Distributed File System): A scalable and fault-tolerant file
storage system.
2.MapReduce: A programming model for processing large datasets.
3.YARN (Yet Another Resource Negotiator): Manages resources and schedules
tasks.
4.Hadoop Common: Utilities supporting other Hadoop modules.
5
Hadoop Distributed File System (HDFS)
HDFS is designed for high throughput access to data. It splits files into blocks
and distributes them across multiple nodes in the cluster, providing
redundancy and fault tolerance.
MapReduce
MapReduce is the data processing engine of Hadoop. It uses two steps: Map,
which processes input data into key-value pairs, and Reduce, which
aggregates results. This model simplifies large-scale data processing.
YARN
YARN separates resource management from data processing, making Hadoop
more flexible and efficient. It enables multiple data processing engines to run
simultaneously on a single cluster.

6
7
Some common frameworks of Hadoop
• Hive- It uses HiveQL for data structuring and for writing complicated
MapReduce in HDFS.
• Drill- It consists of user-defined functions and is used for data exploration.
• Storm- It allows real-time processing and streaming of data.
• Spark- It contains a Machine Learning Library(MLlib) for providing enhanced
machine learning and is widely used for data processing. It also
supports Java, Python, and Scala.
• Pig- It has Pig Latin, a SQL-Like language and performs data transformation
of unstructured data.
• Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.
8
Advantages of Hadoop
Hadoop offers scalability, cost efficiency, flexibility in handling various
data types, and resilience to hardware failures. It is a fundamental tool
for managing big data.
• Ability to store a large amount of data.
• High flexibility.
• Cost effective.
• High computational power.
• Tasks are independent.
• Linear scaling.

9
Features of Hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Disadvantages:
• Not very effective for small data.
• Hard cluster management.
• Has stability issues.
• Security concerns.
10
• Complexity: Hadoop can be complex to set up and maintain,
especially for organizations without a dedicated team of experts.
• Latency: Hadoop is not well-suited for low-latency workloads and may
not be the best choice for real-time data processing.
• Limited Support for Real-time Processing: Hadoop’s batch-oriented
nature makes it less suited for real-time streaming or interactive data
processing use cases.
• Limited Support for Structured Data: Hadoop is designed to work
with unstructured and semi-structured data, it is not well-suited for
structured data processing
• Data Security: Hadoop does not provide built-in security features
such as data encryption or user authentication, which can make it
difficult to secure sensitive data.
11
• Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming
model is not well-suited for ad-hoc queries.
• Limited Support for Graph and Machine Learning: Hadoop’s core
component HDFS and MapReduce are not well-suited for graph and
machine learning workloads.
• Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
• Data Loss: In the event of a hardware failure, the data stored in a single
node may be lost permanently.
• Data Governance: Data Governance is a critical aspect of data
management, Hadoop does not provide a built-in feature to manage data
lineage, data quality, data cataloging, data lineage, and data audit.

12
Introduction to Apache Spark
Apache Spark is a unified analytics engine designed for large-scale data
processing. It supports batch processing, real-time data streaming,
machine learning, and graph processing, making it highly versatile.
According to Databrick’s definition “Apache Spark is a lightning-fast
unified analytics engine for big data and machine learning. It was
originally developed at UC Berkeley in 2009.”
History of spark :
Spark started in 2009 in UC Berkeley R&D Lab which is known as
AMPLab now. Then in 2010 spark became open source under a BSD
license. After that spark transferred to ASF (Apache Software
Foundation) in June 2013.

13
Key Features of Spark
1.Speed: In-memory processing accelerates data tasks.
2.Ease of Use: APIs in Java, Python, Scala, and R.
3.Versatility: Support for various workloads like streaming and machine
learning.
4.Fault Tolerance: Automatic recovery mechanisms.
5 Spark is also useful to perform graph processing. Neo4j / Apache
Graph was using for graph processing.
6. Spark can process the data in real-time and batch mode.

14
Components of Apache Spark

15
1. Spark Core: All the functionalities being provided by Apache Spark
are built on the highest of the Spark Core. It delivers speed by providing
in-memory computation capability.
• It contains the basic functionality of spark. (Task scheduling, memory
management, fault recovery, interacting with storage systems).
• Home to API that defines RDDs.
2. Spark SQL Structured data: The Spark SQL component is built above
the spark core and used to provide the structured processing on the
data. It provides standard access to a range of data sources.
It is a Spark package for working with structured data.
It Supports many sources of data including hive tablets, parquet, json.
It allows the developers to intermix SQK with programmatic data
manipulation supported by RDDs in python, scala and java.
16
3. Spark Streaming: Spark streaming permits ascendible, high-
throughput, fault-tolerant stream process of live knowledge streams.
Spark can access data from a source like a flume, TCP socket.
The functionality of this module is:
• Enables processing of live streams of data like log files generated by
production web services.
• The API’s defined in this module are quite similar to spark core RDD
API’s.
4. Mllib Machine Learning: MLlib in spark is a scalable Machine
learning library that contains various machine learning algorithms.
5. GraphX graph processing: It is an API for graphs and graph parallel
execution. There is network analytics in which we store the data.
Clustering, classification, traversal, searching, and pathfinding is also
possible in the graph.
17
Uses of Apache Spark: The main applications of the spark framework
are:
• The data generated by systems aren’t consistent enough to mix for
analysis. To fetch consistent information from systems we will use
processes like extract, transform and load and it reduces time and
cost since they are very efficiently implemented in spark.
• It is tough to handle the time generated data like log files. Spark is
capable enough to work well with streams of information and reuse
operations.
• As spark is capable of storing information in memory and might run
continual queries quickly, it makes it straightforward to figure out the
machine learning algorithms that can be used for a particular kind of
data.
18
Hadoop vs. Apache Spark

19
Hadoop and Spark Integration
Hadoop and Spark can be used together. Spark can run on top of HDFS
for storage, combining Hadoop's reliability with Spark’s processing
speed for hybrid solutions.

Use Cases of Hadoop

• Log file analysis
• Data archiving
• Fraud detection in banking
• Large-scale ETL processes

20
Tools Built on Hadoop
1.Hive: SQL-like querying for big data.
2.Pig: High-level scripting language.
3.HBase: Non-relational distributed database.

Ecosystem Around Spark

4.Kafka: Real-time message processing.
5.Delta Lake: Structured storage.
6.Airflow: Workflow orchestration.

21
Reference Books
TEXT BOOKS

1. Mohammed Guller, Big Data Analytics with Spark, Apress,2015

2. Tom Mitchell, “Machine Learning”, McGraw Hill, 3rdEdition,1997
3. Michael Minelli, Michehe Chambers, “Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today’s Business”, 1stEdition, Ambiga Dhiraj, Wiely CIO
Series, 2013.
4. Arvind Sathi, “Big Data Analytics: Disruptive Technologies for Changing the Game”,1st
Edition, IBM Corporation, 2012.

REFERENCE BOOKS
5. Chris Eaton, Dirk deroos et al., “Understanding Big data”, McGraw Hill, 2012.
6. Vignesh Prajapati, “Big Data Analytics with R and Hadoop”, Packet Publishing 2013.
7. JyLiebowitz, “Big Data and Business Analytics”, CRC press, 2013.
For more insight
Web sources 
1. https://www.alliant.edu/blog/4-top-
online-resources-data-analytics?
utm_source=chatgpt.com
2. https://www.alliant.edu/blog/4-top-
online-resources-data-analytics?
utm_source=chatgpt.com
3. https://www.coursera.org/articles/
big-data-technologies?
utm_source=chatgpt.com
4. https://careerfoundry.com/en/ Big Data Big Big Data and
Analytics Analytics
blog/data-analytics/where-to-find- Wiley
free-datasets/?
utm_source=chatgpt.com
THANK YOU

For queries
Email: [email protected]

Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
8 TH
No ratings yet
8 TH
19 pages
Module 2
No ratings yet
Module 2
20 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Apache Spark: Fast Data Processing Engine
No ratings yet
Apache Spark: Fast Data Processing Engine
80 pages
Hadoop vs Spark in Big Data Analytics
No ratings yet
Hadoop vs Spark in Big Data Analytics
8 pages
SPARK
No ratings yet
SPARK
47 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Asit Kumar Das - M5 SPARK
No ratings yet
Asit Kumar Das - M5 SPARK
24 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Shark
No ratings yet
Shark
24 pages
Big Data Technologies Presentation
No ratings yet
Big Data Technologies Presentation
10 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
21 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
BD Notes 5
No ratings yet
BD Notes 5
37 pages
Apache Spark Big Data Framework Overview
No ratings yet
Apache Spark Big Data Framework Overview
58 pages
Overview of Apache Spark Features and Benefits
No ratings yet
Overview of Apache Spark Features and Benefits
16 pages
Sspark
No ratings yet
Sspark
7 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Basics of Big Data
No ratings yet
Basics of Big Data
7 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Sparklyr Online Training Overview
No ratings yet
Sparklyr Online Training Overview
80 pages
T07 Spark
No ratings yet
T07 Spark
23 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Understanding Big Data and Hadoop Basics
No ratings yet
Understanding Big Data and Hadoop Basics
17 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Bba13 Notes BDF Unit 1
No ratings yet
Bba13 Notes BDF Unit 1
3 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
Spark
No ratings yet
Spark
5 pages
Overview of Hadoop and Spark Ecosystem
No ratings yet
Overview of Hadoop and Spark Ecosystem
14 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
BIG DATA Class 1 1741496163
No ratings yet
BIG DATA Class 1 1741496163
108 pages
Big Data Processing Techniques
No ratings yet
Big Data Processing Techniques
21 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Hadoop
No ratings yet
Hadoop
4 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
125 pages
Big Data Complete Notes
100% (3)
Big Data Complete Notes
33 pages
Big Data Assignment Notes
No ratings yet
Big Data Assignment Notes
13 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
No ratings yet
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
8 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Unit V
No ratings yet
Unit V
35 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Bda U4
No ratings yet
Bda U4
49 pages
M5
No ratings yet
M5
18 pages
Blended Learning 1
No ratings yet
Blended Learning 1
5 pages
General Principles of Food Hygiene CXC 1-1969
No ratings yet
General Principles of Food Hygiene CXC 1-1969
35 pages
Form 3 New Note
No ratings yet
Form 3 New Note
10 pages
HDFC Life Insurance: Market Overview 2024
No ratings yet
HDFC Life Insurance: Market Overview 2024
59 pages
"B" Shifting Tool Specifications
No ratings yet
"B" Shifting Tool Specifications
3 pages
Hytrin (Kandungan Sama Dengan Hytroz)
No ratings yet
Hytrin (Kandungan Sama Dengan Hytroz)
7 pages
EDAN M3A Brochure
No ratings yet
EDAN M3A Brochure
2 pages
Pha 2000 - Pha 3000 PDF
No ratings yet
Pha 2000 - Pha 3000 PDF
36 pages
Physics 2nd Year Full Book
No ratings yet
Physics 2nd Year Full Book
3 pages
Full Body Dumbbell Workout
No ratings yet
Full Body Dumbbell Workout
13 pages
ACCOMPLISHMENT REPORT 2024 NWMC School Memorandum Sample
100% (1)
ACCOMPLISHMENT REPORT 2024 NWMC School Memorandum Sample
4 pages
EBF PPT Part 1 Unit 2 Financial Systems NEU 2022
No ratings yet
EBF PPT Part 1 Unit 2 Financial Systems NEU 2022
73 pages
CA Intermediate FM SM Exam
No ratings yet
CA Intermediate FM SM Exam
6 pages
Cloning and Biotechnology Overview
No ratings yet
Cloning and Biotechnology Overview
5 pages
PDS ZIP Password R
No ratings yet
PDS ZIP Password R
2 pages
Economic Recovery Strategies
No ratings yet
Economic Recovery Strategies
9 pages
Antibiotics: Success and Failures
No ratings yet
Antibiotics: Success and Failures
43 pages
PYB 101, Psych 1 - Module II
No ratings yet
PYB 101, Psych 1 - Module II
51 pages
2023 TJPhO v1
No ratings yet
2023 TJPhO v1
11 pages
Lesson 3.2 - Create UT Case With PCL
No ratings yet
Lesson 3.2 - Create UT Case With PCL
9 pages
Order Letter 2
No ratings yet
Order Letter 2
3 pages
Logistics
100% (2)
Logistics
24 pages
Kruger Box Fanccd-lea032.e7.Ed3
No ratings yet
Kruger Box Fanccd-lea032.e7.Ed3
4 pages
English UTS Practice Questions
No ratings yet
English UTS Practice Questions
10 pages
QCSV510 - 2024 Assignment
No ratings yet
QCSV510 - 2024 Assignment
2 pages
S8 - MP - Allocation List (A) - MP-24-Apr-2025
No ratings yet
S8 - MP - Allocation List (A) - MP-24-Apr-2025
3 pages
Brochure
No ratings yet
Brochure
3 pages
Build A Large Space Saving CNC Router For Under 60
No ratings yet
Build A Large Space Saving CNC Router For Under 60
10 pages
AutoIt WebUI Integration Guide
No ratings yet
AutoIt WebUI Integration Guide
20 pages
Food Microbial Ecology Insights
No ratings yet
Food Microbial Ecology Insights
10 pages

BIG Data Analytics 21CSH-471: Computer Science & Engineering

Uploaded by

BIG Data Analytics 21CSH-471: Computer Science & Engineering

Uploaded by

•

Computer Science & Engineering

BIG Data Analytics

BY : AJAY PAL SINGH

Assistant Professor (Chandigarh

CO2 Master Big Data Architecture and Tools

CO3 Explore the Hadoop Ecosystem and Data Processing Models

CO4 Develop Data Science Skills and Tools

CO5 Implement Real-Time Data Analytics and Visualization

Big Data Frameworks: Hadoop, Apache

Use Cases of Hadoop

Ecosystem Around Spark

1. Mohammed Guller, Big Data Analytics with Spark, Apress,2015

You might also like