0% found this document useful (0 votes)
15 views24 pages

BIG Data Analytics 21CSH-471: Computer Science & Engineering

The document outlines the curriculum for a Big Data Analytics course at Chandigarh University, covering key topics such as Big Data frameworks (Hadoop and Apache Spark), NoSQL databases, and AI applications in Big Data. It details course outcomes, core components of Hadoop, and the features and advantages of both Hadoop and Spark. Additionally, it provides references for further reading and resources related to Big Data technologies.

Uploaded by

Tanu Dahiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views24 pages

BIG Data Analytics 21CSH-471: Computer Science & Engineering

The document outlines the curriculum for a Big Data Analytics course at Chandigarh University, covering key topics such as Big Data frameworks (Hadoop and Apache Spark), NoSQL databases, and AI applications in Big Data. It details course outcomes, core components of Hadoop, and the features and advantages of both Hadoop and Spark. Additionally, it provides references for further reading and resources related to Big Data technologies.

Uploaded by

Tanu Dahiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Computer Science & Engineering


CHANDIGARH UNIVERSITY, MOHALI

BIG Data Analytics


21CSH-471

BY : AJAY PAL SINGH

Assistant Professor (Chandigarh


University)
Contents to be covered in UNIT
2
UNIT-2 Big Data Technologies Contact Hours:15

Chapter-1 Big Data Frameworks: Hadoop, Apache Spark, and their Comparison; NoSQL databases: MongoDB,
Big Data Cassandra, and HBase; Big Data Visualization Tools: Tableau, Power BI, and Zeppelin; Real-Time Big
Frameworks Data Processing: Apache Storm and Flink; Emerging trends in Big Data Technologies.

Overview of SQL vs. NoSQL: Differences and Use Cases; Introduction to Big SQL: Big SQL Features –
Chapter – 2 Scalability, support for structured and unstructured data, Query optimization Techniques in Big SQL;
Big SQL and NoSQL Database Types: Key-Value stores (Redis, DynamoDB), Document stores (MongoDB, CouchDB),
NO SQL Column-family stores (Cassandra, HBase), Graph Databases (Neo4j); Advantages and limitations of Big
Databases SQL and NoSQL.

Chapter – 3 Introduction to IBM Watson: Overview and capabilities of Watson AI, Watson’s role in Big data and
AI in Big Data decision-making; Key Watson Services: Watson Discovery, Watson Studio, and Watson Assistant,
Integration of Watson with Big Data tools; AI and Machine Learning Applications in Big Data: Natural
Language Processing (NLP), Sentiment Analysis and Predictive Analytics.
Course Outcomes
CO1 Understand the Fundamentals of Big Data.

CO2 Master Big Data Architecture and Tools

CO3 Explore the Hadoop Ecosystem and Data Processing Models

CO4 Develop Data Science Skills and Tools

CO5 Implement Real-Time Data Analytics and Visualization

3
Chapter – 1
Big Data Frameworks

Big Data Frameworks: Hadoop, Apache


Spark, and their Comparison

4
Overview of Hadoop
Hadoop is an open-source framework developed by the Apache Software
Foundation. It enables distributed storage and processing of large datasets
across clusters of computers using simple programming models. It is
particularly suited for batch processing.
Core Components of Hadoop
1.HDFS (Hadoop Distributed File System): A scalable and fault-tolerant file
storage system.
2.MapReduce: A programming model for processing large datasets.
3.YARN (Yet Another Resource Negotiator): Manages resources and schedules
tasks.
4.Hadoop Common: Utilities supporting other Hadoop modules.
5
Hadoop Distributed File System (HDFS)
HDFS is designed for high throughput access to data. It splits files into blocks
and distributes them across multiple nodes in the cluster, providing
redundancy and fault tolerance.
MapReduce
MapReduce is the data processing engine of Hadoop. It uses two steps: Map,
which processes input data into key-value pairs, and Reduce, which
aggregates results. This model simplifies large-scale data processing.
YARN
YARN separates resource management from data processing, making Hadoop
more flexible and efficient. It enables multiple data processing engines to run
simultaneously on a single cluster.

6
7
Some common frameworks of Hadoop
• Hive- It uses HiveQL for data structuring and for writing complicated
MapReduce in HDFS.
• Drill- It consists of user-defined functions and is used for data exploration.
• Storm- It allows real-time processing and streaming of data.
• Spark- It contains a Machine Learning Library(MLlib) for providing enhanced
machine learning and is widely used for data processing. It also
supports Java, Python, and Scala.
• Pig- It has Pig Latin, a SQL-Like language and performs data transformation
of unstructured data.
• Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.
8
Advantages of Hadoop
Hadoop offers scalability, cost efficiency, flexibility in handling various
data types, and resilience to hardware failures. It is a fundamental tool
for managing big data.
• Ability to store a large amount of data.
• High flexibility.
• Cost effective.
• High computational power.
• Tasks are independent.
• Linear scaling.

9
Features of Hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Disadvantages:
• Not very effective for small data.
• Hard cluster management.
• Has stability issues.
• Security concerns.
10
• Complexity: Hadoop can be complex to set up and maintain,
especially for organizations without a dedicated team of experts.
• Latency: Hadoop is not well-suited for low-latency workloads and may
not be the best choice for real-time data processing.
• Limited Support for Real-time Processing: Hadoop’s batch-oriented
nature makes it less suited for real-time streaming or interactive data
processing use cases.
• Limited Support for Structured Data: Hadoop is designed to work
with unstructured and semi-structured data, it is not well-suited for
structured data processing
• Data Security: Hadoop does not provide built-in security features
such as data encryption or user authentication, which can make it
difficult to secure sensitive data.
11
• Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming
model is not well-suited for ad-hoc queries.
• Limited Support for Graph and Machine Learning: Hadoop’s core
component HDFS and MapReduce are not well-suited for graph and
machine learning workloads.
• Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
• Data Loss: In the event of a hardware failure, the data stored in a single
node may be lost permanently.
• Data Governance: Data Governance is a critical aspect of data
management, Hadoop does not provide a built-in feature to manage data
lineage, data quality, data cataloging, data lineage, and data audit.

12
Introduction to Apache Spark
Apache Spark is a unified analytics engine designed for large-scale data
processing. It supports batch processing, real-time data streaming,
machine learning, and graph processing, making it highly versatile.
According to Databrick’s definition “Apache Spark is a lightning-fast
unified analytics engine for big data and machine learning. It was
originally developed at UC Berkeley in 2009.”
History of spark :
Spark started in 2009 in UC Berkeley R&D Lab which is known as
AMPLab now. Then in 2010 spark became open source under a BSD
license. After that spark transferred to ASF (Apache Software
Foundation) in June 2013.

13
Key Features of Spark
1.Speed: In-memory processing accelerates data tasks.
2.Ease of Use: APIs in Java, Python, Scala, and R.
3.Versatility: Support for various workloads like streaming and machine
learning.
4.Fault Tolerance: Automatic recovery mechanisms.
5 Spark is also useful to perform graph processing. Neo4j / Apache
Graph was using for graph processing.
6. Spark can process the data in real-time and batch mode.

14
Components of Apache Spark

15
1. Spark Core: All the functionalities being provided by Apache Spark
are built on the highest of the Spark Core. It delivers speed by providing
in-memory computation capability.
• It contains the basic functionality of spark. (Task scheduling, memory
management, fault recovery, interacting with storage systems).
• Home to API that defines RDDs.
2. Spark SQL Structured data: The Spark SQL component is built above
the spark core and used to provide the structured processing on the
data. It provides standard access to a range of data sources.
It is a Spark package for working with structured data.
It Supports many sources of data including hive tablets, parquet, json.
It allows the developers to intermix SQK with programmatic data
manipulation supported by RDDs in python, scala and java.
16
3. Spark Streaming: Spark streaming permits ascendible, high-
throughput, fault-tolerant stream process of live knowledge streams.
Spark can access data from a source like a flume, TCP socket.
The functionality of this module is:
• Enables processing of live streams of data like log files generated by
production web services.
• The API’s defined in this module are quite similar to spark core RDD
API’s.
4. Mllib Machine Learning: MLlib in spark is a scalable Machine
learning library that contains various machine learning algorithms.
5. GraphX graph processing: It is an API for graphs and graph parallel
execution. There is network analytics in which we store the data.
Clustering, classification, traversal, searching, and pathfinding is also
possible in the graph.
17
Uses of Apache Spark: The main applications of the spark framework
are:
• The data generated by systems aren’t consistent enough to mix for
analysis. To fetch consistent information from systems we will use
processes like extract, transform and load and it reduces time and
cost since they are very efficiently implemented in spark.
• It is tough to handle the time generated data like log files. Spark is
capable enough to work well with streams of information and reuse
operations.
• As spark is capable of storing information in memory and might run
continual queries quickly, it makes it straightforward to figure out the
machine learning algorithms that can be used for a particular kind of
data.
18
Hadoop vs. Apache Spark

19
Hadoop and Spark Integration
Hadoop and Spark can be used together. Spark can run on top of HDFS
for storage, combining Hadoop's reliability with Spark’s processing
speed for hybrid solutions.

Use Cases of Hadoop


• Log file analysis
• Data archiving
• Fraud detection in banking
• Large-scale ETL processes

20
Tools Built on Hadoop
1.Hive: SQL-like querying for big data.
2.Pig: High-level scripting language.
3.HBase: Non-relational distributed database.

Ecosystem Around Spark


4.Kafka: Real-time message processing.
5.Delta Lake: Structured storage.
6.Airflow: Workflow orchestration.

21
Reference Books
TEXT BOOKS

1. Mohammed Guller, Big Data Analytics with Spark, Apress,2015


2. Tom Mitchell, “Machine Learning”, McGraw Hill, 3rdEdition,1997
3. Michael Minelli, Michehe Chambers, “Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today’s Business”, 1stEdition, Ambiga Dhiraj, Wiely CIO
Series, 2013.
4. Arvind Sathi, “Big Data Analytics: Disruptive Technologies for Changing the Game”,1st
Edition, IBM Corporation, 2012.

REFERENCE BOOKS
5. Chris Eaton, Dirk deroos et al., “Understanding Big data”, McGraw Hill, 2012.
6. Vignesh Prajapati, “Big Data Analytics with R and Hadoop”, Packet Publishing 2013.
7. JyLiebowitz, “Big Data and Business Analytics”, CRC press, 2013.
For more insight
Web sources 
1. https://www.alliant.edu/blog/4-top-
online-resources-data-analytics?
utm_source=chatgpt.com
2. https://www.alliant.edu/blog/4-top-
online-resources-data-analytics?
utm_source=chatgpt.com
3. https://www.coursera.org/articles/
big-data-technologies?
utm_source=chatgpt.com
4. https://careerfoundry.com/en/ Big Data Big Big Data and
Analytics Analytics
blog/data-analytics/where-to-find- Wiley
free-datasets/?
utm_source=chatgpt.com
THANK YOU

For queries
Email: [email protected]

You might also like