Please read this disclaimer before proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
20IT503
Big Data Analytics
Department: IT
Batch/Year: 2020-24/III
Created by: K.SELVI,
Assistant Professor
Date: 30.07.2022
Table of Contents
S NO CONTENTS PAGE NO
1 Table of Contents 5
2 Course Objectives 6
3 Pre Requisites (Course Names with Code) 7
4 Syllabus (With Subject Code, Name, LTPC details) 8
5 Course Outcomes 10
6 CO- PO/PSO Mapping 11
7 Lecture Plan 12
8 Activity Based Learning 13
9 1 INTRODUCTION TO BIG DATA
1.1 Data Science 14
1.2 Fundamentals and Components 16
1.3 Types of Digital Data 18
Classification of Digital Data 18
1.4
1.5 Introduction to Big Data 20
1.6 Characteristics of Data 21
1.7 Evolution of Big Data 24
1.8 Big Data Analytics 29
Classification of Analytics
1.9 32
1.10 Top Challenges Facing Big Data 42
1.11 Importance of Big Data Analytics 45
10 Assignments 46
11 Part A (Questions & Answers) 47
12 Part B Questions 52
13 Supportive Online Certification Courses 53
14 Real Time Applications 54
15 Content Beyond the Syllabus 55
16 Assessment Schedule
17 Prescribed Text Books & Reference Books 57
18 Mini Project Suggestions 58
5
Course Objectives
To Understand the Big Data Platform and its Use cases
To Provide an overview of Apache Hadoop
To Provide HDFS Concepts and Interfacing with HDFS
To Understand Map Reduce Jobs
Pre Requisites
CS8391 – Data Structures
CS8492 – Database Management System
Syllabus
LTPC
20IT503 BIG DATA ANALYTICS
3003
UNIT I INTRODUCTION TO BIG DATA 9
Data Science – Fundamentals and Components –Types of Digital Data –
Classification of Digital Data – Introduction to Big Data – Characteristics of
Data – Evolution of Big Data – Big Data Analytics – Classification of
Analytics – Top Challenges Facing Big Data – Importance of Big Data
Analytics.
UNIT II DESCRIPTIVE ANALYTICS USING STATISTICS 9
Mean, Median and Mode – Standard Deviation and Variance – Probability –
Probability Density Function – Percentiles and Moments – Correlation and
Covariance – Conditional Probability – Bayes’ Theorem – Introduction to
Univariate, Bivariate and Multivariate Analysis – Dimensionality Reduction
using Principal Component Analysis (PCA) and LDA.
UNIT III PREDICTIVE MODELING AND MACHINE LEARNING 9
Linear Regression – Polynomial Regression – Multivariate Regression –
Bias/Variance Trade Off – K Fold Cross Validation – Data Cleaning and
Normalization – Cleaning Web Log Data – Normalizing Numerical Data –
Detecting Outliers – Introduction to Supervised And Unsupervised Learning
– Reinforcement Learning – Dealing with Real World Data – Machine
Learning Algorithms –Clustering.
Syllabus
UNIT IV BIG DATA HADOOP FRAMEWORK 9
Introducing Hadoop –Hadoop Overview – RDBMS versus Hadoop – HDFS
(Hadoop Distributed File System): Components and Block Replication –
Processing Data with Hadoop – Introduction to MapReduce – Features of
MapReduce – Introduction to NoSQL: CAP theorem – MongoDB: RDBMS
Vs. MongoDB – Mongo DB Database Model – Data Types and Sharding –
Introduction to Hive – Hive Architecture – Hive Query Language (HQL).
UNIT V PYTHON AND R PROGRAMMING 9
Python Introduction – Data types - Arithmetic - control flow – Functions -
args - Strings – Lists – Tuples – sets – Dictionaries Case study: Using R,
Python, Hadoop, Spark and Reporting tools to understand and Analyze
the Real world Data sources in the following domain- financial,
Insurance, Healthcare in Iris, UCI datasets.
Course Outcomes
CO# COs K Level
CO1 Identify Big Data and its Business Implications. K3
CO2 List the components of Hadoop and Hadoop Eco- K4
System
CO3 Access and Process Data on Distributed File System K4
CO4 Manage Job Execution in Hadoop Environment K4
CO5 Develop Big Data Solutions using Hadoop Eco System K4
CO-PO/PSO Mapping
PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO PSO
CO #
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
CO1 2 3 3 3 3 1 1 - 1 2 1 1 2 2 2
CO2 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO3 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO4 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO5 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
Lecture Plan
UNIT I INTRODUCTION TO BIG DATA 9
Data Science – Fundamentals and Components –Types of Digital Data – Classification of
Digital Data – Introduction to Big Data – Characteristics of Data – Evolution of Big Data –
Big Data Analytics – Classification of Analytics – Top Challenges Facing Big Data –
Importance of Big Data Analytics.
Session Mode of Reference
Topics to be covered
No. delivery
Data Science – Fundamentals and Components
1 Chalk & 1
board
Chalk & 1
2 Types of Digital Data – Classification of Digital
board
Data
Introduction to Big Data – Characteristics of Data
Chalk & 1
3
board
Evolution of Big Data Chalk &
4 1
board
Chalk & 1
5 Big Data Analytics
board
6 Classification of Analytics
Chalk & 1
board
Top Challenges Facing Big Data
7 Chalk & 1
board
8 Chalk & 1
Importance of Big Data Analytics.
board
CONTENT BEYOND THE SYLLABUS : Use Cases of Hadoop (IBM Watson, LinkedIn,
Yahoo!
NPTEL/OTHER REFERENCES / WEBSITES : -
1. EMC Education Services, Data Science and Big Data Analytics: Discovering,
Analyzing, Visualizing and Presenting Data, Wiley Publishers, 2015. (Chapter
1 and Chapter 10)
.
NUMBER OF PERIODS : Planned: 8 Actual:
DATE OF COMPLETION : Planned: Actual:
REASON FOR DEVIATION (IF ANY) :
CORRECTIVE MEASURES :
Signature of The Faculty Signature Of HoD
ACTIVITY BASED LEARNING
S NO TOPICS
FLASH CARDS (https://quizlet.com/in/582911673/flash-cards-
1
flash-cards/)
11
UNIT I INTRODUCTION TO BIG DATA
1.1 DATA SCIENCE
As the amount of data continues to grow, the need to leverage it
becomes more important. Data science involves using methods to analyse
massive amounts of data and extract the knowledge it contains. Data science
and big data evolved from statistics and traditional data management but are
now considered to be distinct disciplines. Data science is an evolutionary
extension of statistics capable of dealing with the massive amounts of data
produced today. It adds methods from computer science to the repertoire of
statistics.
Data science is the science of analysing raw data using
statistics and machine learning techniques with the purpose of
drawing conclusions about that information.
Data science is the science of extracting knowledge from data. In
other words, it is a science of drawing out hidden patterns amongst data
using statistical and mathematical techniques.
It employs techniques and theories drawn from many fields from
the broad areas of mathematics, statistics, information technology including
machine learning, data engineering, probability models, statistical learning,
pattern recognition and learning, etc.
Data Scientist works on massive datasets for weather predictions,
oil drillings, earthquake prediction, financial frauds, terrorist network and
activities, global economic impacts, sensor logs, social media analytics,
customer churn, collaborative filtering(prediction about interest on users),
regression analysis, etc. Data science is multi-disciplinary.
Data Scientist
Business Acumen(expertise) Skills:
A data scientist should have following ability to play the role of data scientist.
-Understanding of domain
-Business strategy
-Problem solving
-Communication
-Presentation
-Keenness
Technology Expertise:
Following skills required as far as technical expertise is concerned.
-Good database knowledge such as RDBMS.
-Good NoSQL database knowledge such as MongoDB, Cassandra,
HBase, etc.
-Programming languages such as Java. Python, C++, etc.
-Open-source tools such as Hadoop.
-Data warehousing.
-Data mining
-Visualization such as Tableau, Flare, Google visualization APIs, etc.
Mathematics Expertise:
The following are the key skills that a data scientist will have to
have to comprehend data, interpret it and analyze.
-Mathematics, Statistics, Artificial Intelligence (AI),
-Algorithms, Machine learning, Pattern recognition.
-Natural Language Processing.
-To sum it up, the data science process is
-Collecting raw data from multiple different data sources.
-Processing the data.
-Integrating the data and preparing clean datasets.
-Engaging in explorative data analysis using model and
algorithms.
-Preparing presentations using data visualizations.
-Communicating the findings to all stakeholders.
-Making faster and better decisions.
1.2 DATA SCIENCE COMPONENTS:
The main components of Data Science are given below:
1. Statistics:
Statistics is one of the most important components of data science.
Statistics is a way to collect and analyze the numerical data in a large amount
and finding meaningful insights from it.
2. Domain Expertise:
In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills of a particular area.
In data science, there are various areas for which we need domain experts.
3.Data engineering:
Data engineering is a part of data science, which involves
acquiring, storing, retrieving, and transforming the data. Data engineering
also includes metadata (data about data) to the data.
4. Visualization:
Data visualization is meant by representing data in a visual
context so that people can easily understand the significance of data. Data
visualization makes it easy to access the huge amount of data in visuals.
5. Advanced computing:
Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and maintaining the
source code of computer programs.
6. Mathematics:
Mathematics is the critical part of data science. Mathematics
involves the study of quantity, structure, space, and changes. For a data
scientist, knowledge of good mathematics is essential.
7. Machine learning:
Machine learning is backbone of data science. Machine learning
is all about to provide training to a machine so that it can act as a human
brain. In data science, we use various machine learning algorithms to solve
the problems.
1.3 TYPES OF DIGITAL DATA:
As the Internet age continues to grow, we generate an
incomprehensible amount of data every second. So much so that the
number of data floating around the internet is estimated to reach 163
zettabytes by 2025. That’s a lot of tweets, selfies, purchases, emails,
blog posts, and any other piece of digital information that we can think
of.
In data science and big data, there are many different types
of data. Each of them tends to require different tools and techniques.
The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
1.4 CLASSIFICATION OF DIGITAL DATA
Digital data can be classified according to the following types:
• STRUCTURED DATA
• UN STRUCTURED DATA
• SEMI- STRUCTURED DATA
STRUCTUREDDATA:
Structured data has certain predefined organizational properties and
is present in structured or tabular schema, making it easier to analyze and sort.
In addition, thanks to its predefined nature, each field is discrete and can be
accessed separately or jointly along with data from other fields. This makes
structured data extremely valuable, making it possible to collect data from
various locations in the database quickly.
UN STRUCTURED DATA:
Unstructured data entails information with no predefined conceptual
definitions and is not easily interpreted or analyzed by standard databases or
data models. Unstructured data accounts for the majority of big data and
comprises information such as dates, numbers, and facts. Big data examples of
this type include video and audio files, mobile activity, satellite imagery, and
No-SQL databases, to name a few. Photos we upload on Facebook or Instagram
and videos that we watch on YouTube or any other platform contribute to the
growing pile of unstructured data
SEMI-STRUCTUREDDATA:
Semi-structured data is a hybrid of structured and unstructured
data. This means that it inherits a few characteristics of structured data but
nonetheless contains information that fails to have a definite structure and does
not conform with relational databases or formal structures of data models. For
instance, JSON and XML are typical examples of semi-structured data.
1.5 INTRODUCTION TO BIG DATA
Data is created constantly, and at an ever-increasing rate. Mobile
phones, social media, imaging technologies to determine a medical diagnosis—
all these and more create new data, and that must be stored somewhere for
some purpose.
Devices and sensors automatically generate diagnostic information
that needs to be stored and processed in real time. Merely keeping up with this
huge influx of data is difficult, but substantially more challenging is analyzing
vast amounts of it, especially when it does not conform to traditional notions of
data structure, to identify meaningful patterns and extract useful information.
These challenges of the data deluge present the opportunity to
transform business, government, science, and everyday life.
Several industries have led the way in developing their ability to
gather and exploit data:
• Credit card companies monitor every purchase their customers make and
can identify fraudulent purchases with a high degree of accuracy using rules
derived by processing billions of transactions.
• Mobile phone companies analyze subscribers’ calling patterns to determine,
for example, whether a caller’s frequent contacts are on a rival network. If
that rival network is offering an attractive promotion that might cause the
subscriber to defect, the mobile phone company can proactively offer the
subscriber an incentive to remain in her contract.
• For companies such as LinkedIn and Facebook, data itself is their primary
product. The valuations of these companies are heavily derived from the
data they gather and host, which contains more and more intrinsic value as
the data grows.
1.6 CHARACTERISTICS OF DATA
Three attributes stand out as defining Big Data characteristics:
Huge volume of data:
Rather than thousands or millions of rows, Big Data can be billions
of rows and millions of columns.
Complexity of data types and structures:
Big Data reflects the variety of new data sources, formats, and
structures, including digital traces being left on the web and other digital
repositories for subsequent analysis.
Speed of new data creation and growth:
Big Data can describe high velocity data, with rapid data ingestion
and near real time analysis
Big data is high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of information
processing for enhanced insight and decision making. Big data is
fundamentally about applying innovative and cost-effective techniques for
solving existing and future business problems whose resource requirements
(for data management space, computation resources, or immediate, in
memory representation needs) exceed the capabilities of traditional
computing environments.
(i) Volume
The name 'Big Data' itself is related to a size which is enormous. Size
of data plays very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent
upon volume of data. Hence, 'Volume' is one characteristic which needs to be
considered while dealing with 'Big Data'.
(ii) Variety
Variety refers to heterogeneous sources and the nature of data, both
structured and unstructured. During earlier days, spreadsheets and databases
were the only sources of data considered by most of the applications. Now days,
data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc.
is also being considered in the analysis applications. This variety of unstructured
data poses certain issues for storage, mining and analyzing data.
(iii) Velocity
The term 'velocity' refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data. Big Data Velocity deals with the speed at which data flows
in from sources like business processes, application logs, networks and social
media sites, sensors, Mobile devices, etc. The flow of data is massive and
continuous.
(iv)Variability
This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively
Structured data:
Data containing a defined data type, format, and structure (that is,
transaction data, online analytical processing [OLAP] data cubes, traditional
RDBMS, CSV files, and even simple spreadsheets).
Semi-structured data:
Textual data files with a discernible pattern that enables parsing
(such as Extensible Markup Language [XML] data files that are self describing
and defined by an XML schema).
Quasi-structured data:
Textual data with erratic data formats that can be formatted with
effort, tools, and time (for instance, web clickstream data that may contain
inconsistencies in data values and formats).
Unstructured data:
Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
Big data characteristics: Data Structures
Data growth is increasingly Unstructured
1.7 EVOLUTION OF BIG DATA
Where does ‘Big Data’ come from?
The term ‗Big Data‘ has been in use since the early 1990s. Although it
is not exactly known who first used the term, most people credit John R.Mashey
(who at the time worked at Silicon Graphics) for making the term popular.
In its true essence, Big Data is not something that is completely new or
only of the last two decades. Over the course of centuries, people have been trying
to use data analysis and analytics techniques to support their decision-making
process. The ancient Egyptians around 300 BC already tried to capture all existing
‗data‘ in the library of Alexandria. Moreover, the Roman Empire used to carefully
analyze statistics of their military to determine the optimal distribution for their
armies.
However, in the last two decades, the volume and speed with which
data is generated has changed – beyond measures of human comprehension. The
total amount of data in the world was 4.4 zettabytes in 2013. That is set to rise
steeply to 44 zettabytes by 2020. To put that in perspective, 44 zettabytes is
equivalent to 44 trillion gigabytes. Even with the most advanced technologies today,
it is impossible to analyze all this data. The need to process these increasingly
larger (and unstructured) data sets is how traditional data analysis transformed into
‗Big Data‘ in the last decade.
To illustrate this development over time, the evolution of Big Data can
roughly be sub-divided into three main phases. Each phase has its own
characteristics and capabilities. In order to understand the context of Big Data
today, it is important to understand how each phase contributed to the
contemporary meaning of Big Data.
Big Data phase 1.0
Data analysis, data analytics and Big Data originate from the
longstanding domain of database management. It relies heavily on the storage,
extraction, and optimization techniques that are common in data that is stored in
Relational Database Management Systems (RDBMS).
Database management and data warehousing are considered the core
components of Big Data Phase 1. It provides the foundation of modern data
analysis as we know it today, using well-known techniques such as database
queries, online analytical processing and standard reporting tools
Big Data phase 2.0
Since the early 2000s, the Internet and the Web began to offer unique
data collections and data analysis opportunities. With the expansion of web traffic
and online stores, companies such as Yahoo, Amazon and eBay started to analyze
customer behaviour by analysing click-rates, IP-specific location data and search
logs. This opened a whole new world of possibilities.
From a data analysis, data analytics, and Big Data point of view,
HTTP-based web traffic introduced a massive increase in semi-structured and
unstructured data. Besides the standard structured data types, organizations now
needed to find new approaches and storage solutions to deal with these new data
types in order to analyze them effectively. The arrival and growth of social media
data greatly aggravated the need for tools, technologies and analytics techniques
that were able to extract meaningful information out of this unstructured data.
Big Data phase 3.0
Although web-based unstructured content is still the main focus for
many organizations in data analysis, data analytics, and big data, the current
possibilities to retrieve valuable information are emerging out of mobile devices.
Mobile devices not only give the possibility to analyze behavioural data (such as
clicks and search queries), but also give the possibility to store and analyze
location-based data (GPS-data)..
With the advancement of these mobile devices, it is possible to track movement,
analyse physical behaviour and even health-related data (number of steps you
take per day). This data provides a whole new range of opportunities, from
transportation, to city design and health care.
Simultaneously, the rise of sensor-based internet-enabled devices is
increasing the data generation like never before. Famously coined as the
‗Internet of Things‘ (IoT), millions of TVs, thermostats, wearable's and even
refrigerators are now generating zettabytes of data every day. And the race to
extract meaningful and valuable information out of these new data sources has
only just begun.
A summary of the three phases in Big Data is listed in the Table below:
Drivers of Big data:
To better understand the market drivers related to Big Data, it is
helpful to first understand some past history of data stores and the kinds of
repositories and tools to manage these data stores.
In the 1990s the volume of information was often measured in
terabytes. Most organizations analyzed structured data in rows and columns
and used relational databases and data warehouses to manage large stores of
enterprise information. The following decade saw a proliferation of different
kinds of data sources—mainly productivity and publishing tools such as content
management repositories and networked attached storage systems—to
manage this kind of information, and the data began to increase in size and
started to be measured at petabyte scales. In the 2010s, the information that
organizations try to manage has broadened to include many other kinds Of
data.
1.Medical information, such as genomic sequencing and diagnostic imaging.
2.Photos and video footage uploaded to the World Wide Web.
Video surveillance, such as the thousands of video cameras spread across a
city.
3.Mobile devices, which provide geospatial location data of the users, as well
as metadata about text messages, phone calls, and application usage on smart
phones.
4.Smart devices, which provide sensor-based collection of information from
smart electric grids, smart buildings, and many other public and industry
infrastructures .
5.Nontraditional IT devices, including the use of radio-frequency identification
(RFID) readers, GPS navigation systems, and seismic processing.
The Big Data trend is generating an enormous amount of
information from many new sources. This data deluge requires advanced
analytics and new market players to take advantage of these opportunities and
new market dynamics, which will be discussed in the following section
1.8 BIG DATA ANALYTICS
BUSINESS INTELLIGENCE (BI) VERSUS BIG DATA
Following are the differences that one encounters dealing with traditional Bl and
big data.
In traditional BI environment, all the enterprise's data is housed in a
central server whereas in a big data environment data resides in a distributed
file system. The distributed file system scales by scaling in(decrease) or
out(increase) horizontally as compared to typical database server that scales
vertically.
In traditional BI, data is generally analysed in an offline mode
whereas in big data, it is analyzed in both real-time streaming as well as in
offline mode.
Traditional Bl is about structured data and it is here that data is
taken to processing functions (move data to code) whereas big data is about
variety: Structured, semi- structured, and unstructured data and here the
processing functions are taken to the data(move code to data).
STATE OF THE PRACTICE IN ANALYTICS
Current business problems provide many opportunities for
organizations to become more analytical and data driven, as shown in Table.
Business Driver Examples
Optimize business operations Sales, pricing, profitability,
efficiency
Identify business risk Customer churn, fraud, default
Predict new business Upsell, cross-sell, best
opportunities new
customer prospects
Comply with laws or regulatory Anti-Money Laundering, Fair
Requirements Lending,
The first three examples do not represent new problems.
Organizations have been trying to reduce customer churn, increase
sales, and cross-sell customers for many years.
What is new is the opportunity to fuse advanced analytical techniques
with Big Data to produce more impactful analyses for these traditional problems.
The last example portrays emerging regulatory requirements
Many compliance and regulatory laws have been in existence for
decades, but additional requirements are added every year, which represent
additional complexity and data requirements for organizations.
Laws related to anti-money laundering (AML) and fraud prevention
require advanced analytical techniques to comply with and manage properly.
DATA ANALYTICS LIFE CYCLE:
Here is a brief overview of the main phases of the Data Analytics:
Phase 1- Discovery:
In Phase 1, the team learns the business domain, including relevant
history such as whether the organization or business unit has attempted similar
projects in the past from which they can learn. The team assesses the resources
available to support the project in terms of people, technology, time and data.
Important activities in this phase include framing the business problem as an
analytics challenge that can be addressed in subsequent phases and formulating
initial hypotheses (IHs) to test and begin learning the data.
Phase 2- Data preparation:
Phase 2 requires the presence of an analytic sandbox, in which the
team can work with data and perform analytics for the duration of the project.
The team needs to execute extract, load, and transform (ELT) or extract,
transform and load (ETL) to get data into the sandbox. The ELT and ETL are
sometimes abbreviated as ETLT. Data should be transformed in the ETLT process
so the team can work with it and analyze it. In this phase, the team also needs to
familiarize itself with the data thoroughly and take steps to condition the data.
Phase 3-Model planning:
Phase 3 is model planning, where the team determines the methods,
techniques and workflow it intends to follow for the subsequent model building
phase. The team explores the data to learn about the relationships between
variables and subsequently selects key variables and the most suitable models.
Overview of Data Analytical Lifecycle
Phase 4-Model building:
In Phase 4, the team develops data sets for testing, training, and
production purposes. In addition, in this phase the team builds and executes
models based on the work done in the model planning phase. The team also
considers whether its existing tools will suffice for running the models, or if it
will need a more robust environment for executing models and workflows (for
example, fast hardware and parallel processing, if applicable).
Phase 5-Communicate results:
In Phase 5, the team, in collaboration with major stakeholders,
determines if the results of the project are a success or a failure based on the
criteria developed in Phase 1. The team should identify key findings, quantify
the business value, and develop a narrative to summarize and convey findings
to stakeholders.
Phase 6-0perationalize:
In Phase 6, the team delivers final reports, briefings, code
and technical documents. In addition, the team may run a pilot project to
implement the models in a production environment.
1.9 CLASSIFICATION OF ANALYTICS:
1. BI Versus Data Science
2.Current Analytical Architecture (data flow)
3.Drivers of Big Data
4.Emerging Big Data Ecosystem and a New Approach to Analytics
1. comparing BI with Data Science
Data Science (Data Science & Big Data Analytics: Discovering, Analyzing,
Visualizing and Presenting Data)
Predictive Analytics and Data Mining (Data Science)
Typical Techniques and Optimization. predictive modelling,
Data Types forecasting. statisticalanalysis
Structured/unstructured data, many types
ofsources, verylarge datasets
Common Questions What if ... ?
What's the optimal scenario for our
business?
What will happen next? What if these
trendscontinue? Whyis this happening?
BI (Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and
Presenting Data)
Business Intelligence
Typical Techniques and Standard and ad hoc reporting, dashboards,
Data Types alerts, queries,details on demand
Structured data. traditional sources.
manageable datasets
Common Questions What happened last quarter?
How many units sold?
Where is the problem? Hey
in whichsituation?
2.Current Analytical Architecture:
Typical Analytical Architecture
1.For data sources to be loaded into the data warehouse, data needs to
be well understood, structured and normalized with the appropriate data type
definitions.
2.As a result of this level of control on the EDW(enterprise data
warehouse-on server or on cloud), additional local systems may emerge in the form
of departmental warehouses and local data marts that business users create to
accommodate their need for flexible analysis. However, these local systems reside
in isolation, often are not synchronized or integrated with other data stores and
may not be backed up.
3.In the data warehouse, data is read by additional applications across
the enterprise for Bl and reporting purposes.
At the end of this workflow, analysts get data from server. Because users generally
are not allowed to run custom or intensive analytics on production databases,
analysts create data extracts from the EDW to analyze data offline in R or other
local analytical tools to store and process critical data, supporting enterprise
applications and enabling corporate reporting activities.
Although reports and dashboards are still important for organizations,
most traditional data architectures prevent data exploration and more sophisticated
analysis.
3.Drivers of Big Data:
In the 1990s the volume of information was often measured in
terabytes. Most organizations analyzed structured data in rows and columns and
used relational databases and data warehouses to manage large amount of
enterprise information.
Data Evolution and the Rise of Big Data Sources
The following decade (2000) saw different kinds of data sources-
mainly productivity and publishing tools such as content management
repositories and networked attached storage systems-to manage this kind of
information, and the data began to increase in size and started to be measured
at petabyte scales.
In the 2010s, the information that organizations try to manage has
broadened to include many other kinds of data. In this era, everyone and
everything is leaving a digital footprint. These applications, which generate data
volumes that can be measured in exabyte scale, provide opportunities for new
analytics and driving new value for organizations. The data now comes from
multiple sources, like Medical information, Photos and video footage, Video
surveillance, Mobile devices, Smart devices, Nontraditional IT devices etc.
4.Emerging Big Data Ecosystem and a New Approach to Analytics:
Emerging Big Data Ecosystem
As the new ecosystem takes shape, there are four main groups of
players within this interconnected web.
1.Data devices and the "Sensor net” gather data from multiple
locations and continuously generate new data about this data. For each gigabyte
of new data created, an additional petabyte of data is created about that data.
For example, consider someone playing an online video game
through a PC, game console, or smartphone. In this case, the video game
provider captures data about the skill and levels attained by the player.
Intelligent systems monitor and log how and when the user plays the game. As a
consequence, the game provider can fine-tune the difficulty of the game,
suggest other related games that would most likely interest the user, and offer
additional equipment and enhancements for the character based on the user's
age, gender, and interests. This information may get stored locally or uploaded
to the game provider's cloud to analyze the gaming habits and opportunities for
upsell and cross-sell and identify typical profiles of specific kinds of users.
Smartphones provide another rich source of data. In addition to
messaging and basic phone usage, they store and transmit data about Internet
usage, SMS usage, and real- time location. This metadata can be used for analyzing
traffic patterns by scanning the density of smartphones in locations to track the
speed of cars or the relative traffic congestion on busy roads. In this way, GPS
devices in cars can give drivers real-time updates and offer alternative routes to
avoid traffic delays.
Retail shopping loyalty cards record not just the amount an individual
spends, but the locations of stores that person visits, the kinds of products
purchased, the stores where goods are purchased most often,
and the combinations of products purchased together. Collecting this data provides
insights into shopping and travel habits and the likelihood of successful
advertisement targeting for certain types of retail promotions.
2.Data collectors include sample entities that collect data from the
device and users.
Data results from a cable TV provider tracking the shows a person
watches, which TV channels someone will and will not pay for to watch on demand,
and the prices someone is willing to pay for premium TV content.
Retail stores tracking the path a customer takes through their store
while pushing a shopping cart with an RFID chip so they can gauge which products
get the most foot traffic using geospatial data collected from the RFID chips.
3.Data aggregators make sense of the data collected from the various
entities from the "SensorNet" or the "Internet of Things." These organizations
compile data from the devices and usage patterns collected by government
agencies, retail stores and websites. ln turn, they can choose to transform and
package the data as products to sell to list brokers, who may want to generate
marketing lists of people who may be good targets for specific ad campaigns.
4.Data users / buyers: These groups directly benefit from the data collected
and aggregated by others within the data value chain. Retail banks, acting as a
data buyer, may want to know which customers have the highest likelihood to
apply for a second mortgage or a home equity line of credit.
To provide input for this analysis, retail banks may purchase data
from a data aggregator. This kind of data may include demographic information
about people living in specific locations; people who appear to have a specific
level of debt, yet still have solid credit scores (or other characteristics such as
paying bills on time and having savings accounts) that can be used to infer credit
worthiness; and those who are searching the web for information about paying
off debts or doing home remodeling projects. Obtaining data from these various
sources and aggregators will enable a more targeted marketing campaign, which
would have been more challenging before Big Data due to the lack of
information or high-performing technologies.
Using technologies such as Hadoop to perform natural language
processing on unstructured, textual data from social media websites, users can
gauge the reaction to events such as presidential campaigns. People may, for
example, want to determine public sentiments toward a candidate by analyzing
related blogs and online comments. Similarly, data users may want to track and
prepare for natural disasters by identifying which areas a hurricane affects first
and how it moves, based on which geographic areas are tweeting about it or
discussing it via social media.
KEY ROLES FOR THE NEW BIG DATA ECOSYSTEM:
1.Deep Analytical Talent is technically savvy, with strong
analytical skills. Members possess a combination of skills to handle raw,
unstructured data and to apply complex analytical techniques at massive
scales.
This group has advanced training in quantitative disciplines, such
as mathematics, statistics, and machine learning. To do their jobs, members
need access to a robust analytic sandbox or workspace where they can
perform large-scale analytical data experiments.
Examples of current professions fitting into this group include
statisticians, economists,mathematicians, and the new role of the Data
Scientist.
2.Data Savvy Professionals: has less technical depth but has a
basic knowledge of statistics or machine learning and can define key questions
that can be answered using advanced analytics.
These people tend to have a base knowledge of working with data,
or an appreciation for some of the work being performed by data scientists and
others with deep analytical talent.
Examples of data savvy professionals include financial analysts,
market research analysts, life scientists, operations managers, and business
and functional managers.
3.Technology and Data Enablers:
This group represents people providing technical expertise to
support analytical projects, such as provisioning and administrating analytical
sandboxes, and managing large-scale data architectures that enable
widespread analytics within companies and other organizations.
This role requires skills related to computer engineering,
programming, and database administration.
These three groups must work together closely to solve complex Big Data
challenges.
Most organizations are familiar with people in the latter two groups
mentioned, but the first group, Deep Analytical Talent, tends to be the newest
role for most and the least understood.
For simplicity, this discussion focuses on the emerging role of the
Data Scientist. It describes the kinds of activities that role performs and
provides a more detailed view of the skills needed to fulfill that role.
Activities of data scientist:
There are three recurring sets of activities that data scientists
perform:
Reframe business challenges as analytics challenges. Specifically,
this is a skill to diagnose business problems, consider the core of a given
problem, and determine which kinds of analytical methods can be applied to
solve it.
Design, implement, and deploy statistical models and data mining
techniques on Big Data. This set of activities is mainly what people think about
when they consider the role of the Data Scientist: namely, applying complex or
advanced analytical methods to a variety of business problems using data.
Develop insights that lead to actionable recommendations. It is critical
to note that applying advanced methods to data problems does not necessarily
drive new business value. Instead, it is important to learn how to draw insights out
of the data and communicate them effectively.
Profile of a data scientist:
Data scientists are generally thought of as having five main sets of
skills and behavioral characteristics.
Quantitative skill: such as mathematics or statistics
Technical aptitude: namely, software engineering, machine learning, and
programming skills
Skeptical mind-set and critical thinking: It is important that data scientists
can examine their work critically rather than in a one-sided way.
Curious and creative: Data scientists are passionate about data and finding
creative ways to solve problems and portray information.
.
Communicative and collaborative:
Data scientists must be able to understand the business value in a
clear way and collaboratively work with other groups, including project sponsors
and key stakeholders.
Data scientists are generally comfortable using this blend of skills to
acquire, manage, analyze, and visualize data and tell compelling stories about it.
1.10 CHALLENGES FACING BID DATA:
Data volume:
Data today is growing at an exponential rate. This high tide of data
will continue to rise continuously. The key questions are –
“will all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc.
Storage:
Cloud computing is the answer to managing infrastructure for big
data as far as cost-efficiency, elasticity and easy upgrading / downgrading is
concerned. This further complicates the decision to host big data solutions
outside the enterprise.
Data retention:
How long should one retain this data? Some data may require for log-
term decision, but some data may quickly become irrelevant and obsolete.
Skilled professionals:
In order to develop, manage and run those applications that generate
insights, organizations need professionals who possess a high-level proficiency in
data sciences.
Other challenges:
Other challenges of big data are with respect to capture, storage,
search, analysis, transfer and security of big data.
Visualization:
Big data refers to datasets whose size is typically beyond the storage
capacity of traditional database software tools. There is no explicit definition of
how big the data set should be for it to be considered big data. Data
visualization(computer graphics) is becoming popular as a separate discipline.
There are very few data visualization experts.
There are mainly seven challenges of big data: scale, security,
schema, Continuous availability, Consistency, Partition tolerant and data quality.
Scale:
Storage (RDBMS (Relational Database Management System) or NoSQL
(Not only SQL)) is one major concern that needs to be addressed to handle the
need for scaling rapidly and elastically. The need of the hour is a storage that can
best withstand the attack of large volume, velocity and variety of big data. Should
you scale vertically or should you scale horizontally?
Security:
Most of the NoSQL big data platforms have poor security mechanisms
(lack of proper authentication and authorization mechanisms) when it comes to
safeguarding big data. A spot that cannot be ignored given that big data carries
credit card information, personal information and other sensitive data.
schema:
Rigid schemas have no place. We want the technology to be able to fit
our big data and not the other way around. The need of the hour is dynamic
schema. Static (pre-defined schemas) are obsolete.
Continuous availability:
The big question here is how to provide 24/7 support because almost
all RDBMS and NoSQL big data platforms have a certain amount of downtime built
in.
Consistency:
Should one opt for consistency or eventual consistency? Partition
tolerant: How to build partition tolerant systems that can take care of both
hardware and software failures?
Data quality:
How to maintain data quality, data accuracy, completeness, timeliness,
etc.? Do we have appropriate metadata in place?
1.11 IMPORTANCE OF BIG DATA:
Let us study the various approaches to analysis of data and what it
leads to.
Reactive-Business Intelligence:
What does Business Intelligence (BI) help us with? It allows the
businesses to make faster and better decisions by providing the right information
to the right person at the right time in the right format.
It is about analysis of the past or historical data and then displaying
the findings of the analysis or reports in the form of enterprise dashboards,
alerts, notifications, etc. It has support for both pre-specified reports as well as
adhoc querying.
Reactive - Big Data Analytics:
Here the analysis is done on huge datasets but the approach is still
reactive as it is still based on static data.
Proactive - Analytics:
This is to support futuristic decision making by use of data mining
predictive modelling, text mining, and statistical analysis on. This analysis is not
on big data as it still the traditional database management practices on big data
and therefore has severe limitations on the storage capacity and the processing
capability.
Proactive - Big Data Analytics:
This is filtering through terabytes, petabytes, exabytes of
information to filter out the relevant data to analyze. This also includes high
performance analytics to gain rapid insights from big data and the ability to
solve complex problems using more data.
Assignments
Q. Question CO K Level
No. Level
1 i)Identify big data and its business Impacts of Big
Data on business. CO1 K4
ii)Identify Big Data technology and software’s on
real time business
Part-A Questions and Answers
1. What is big data? (CO1, K2)
Big data is high-volume, high-velocity and high-variety information assets
that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.
2. Mention the characteristics of big data. (CO1,K2)
The characteristics of big data are volume, variety, value, velocity and
veracity.
3.Differentiate structured and un-structured data. (CO1,K2)
Structured data is highly-organized and formatted in a way so it's easily
searchable in relational databases. Examples of structured data include
names, dates, addresses, credit card numbers, stock information, geo-
location, and more.
Unstructured data has no pre-defined format or organization, making it much
more difficult to collect, process, and analyze. Examples of unstructured data
include text, video, audio, mobile activity, social media activity, satellite
imagery, surveillance imagery – the list goes on and on.
4. Differentiate between big data and conventional data. (CO1,K2)
Big Data Conventional Data
Huge Data Sets Data Set size in control
Unstructured Data such as text, Normally Structured Data such
video and audio as numbers and categories, but it
can take other forms as well.
Hard to perform queries and Easy to perform queries and
analysis analysis
5. What is semi structured data? Give example. (CO1,K2)
Semi-structured data is the data which does not conforms to a
data model but has some structure. It lacks a fixed or rigid schema. It is the
data that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process,
we can store them in the relational database.
Examples are E-mails, XML and other markup languages, Binary
executables, TCP/IP packets, Zipped files.
6. What is veracity? (CO1,K2)
It refers to the quality of the data that is being analyzed. High
veracity data has many records that are valuable to analyze and that
contribute in a meaningful way to the overall results. Low veracity data, on
the other hand, contains a high percentage of meaningless data.
7. What is Data Science ? ( CO1 , K1 )
As the amount of data continues to grow, the need to leverage it
becomes more important. Data science involves using methods to analyze
massive amounts of data and extract the knowledge it contains.
Data science is an evolutionary extension of statistics capable of
dealing with the massive amounts of data produced today. It adds methods
from computer science to the repertoire of statistics.
8.Mention the benefits of implementing big data technology within
an organization. (CO1,K2)
Increasing Revenues
Lowering Costs
Increasing Productivity
Reducing Risks
9. List the major domains of big data use cases. (CO1,K2)
A scan of the list allows us to group most of those applications into
these categories:
Business intelligence, querying, reporting, searching
Improved performance for common data management operations
Non-database applications
Data mining and analytical applications
10.What is the role of data science in commercial companies? ( CO1 ,
K1 ) Commercial companies in almost every industry use data science
and big data to gain insights into their customers, processes, staff, completion,
and products. Many companies use data science to offer customers a better
user experience, as well as to cross-sell, up-sell, and personalize their
offerings.
Human resource professionals use people analytics and text mining
to screen candidates, monitor the mood of employees, and study informal
networks among co-workers.
11. What are the characteristics of big data applications? (CO1,K2)
The big data approach is mostly suited to addressing or solving
business problems that are subject to one or more of the following criteria:
Data throttling
Computation-restricted throttling
Large data volumes
Significant data variety
Benefits from data parallelization
12. Mention some examples of big data applications. (CO1,K2)
The various examples of big data applications are.
Energy Network Monitoring and optimization
Credit fraud detection
Data profiling
Clustering and customer segmentation
Recommendation Engine
Price Modelling
13. Is cost reduction relevant to big data analytics? Justify.
(CO1,K3)
Big data analytics is considered a huge cost clutter by providing predictive
analysis.
Predictive analysis uses many techniques from data mining, statistics,
modelling, machine learning and artificial intelligence to analyze current
data to make predictions about future. Thus helping companies in
understanding the market situation in the future.
14. List out the key computing resources for big data
frameworks. (CO1,K2)
The four key computing resources for big data frameworks are
Processing capability
Memory
Storage
Network
15.Mention the key composition of the big data ecosystem
stack. (CO1,K2)
The key composition of the big data ecosystem stack are
Apache Hbase, Apache Hive, Apache Pig, Apache Mahout, Apache Oozie,
Apache Zookeeper and Apache Sqoop.
Part-B Questions
Q. Questions CO K Level
No. Level
1 Discuss in detail the evolution of big data. CO1 K2
2 Explain the best practices in big data analytics. CO1 K2
3 Explain the characteristics of big data CO1 K2
applications.
4 i) Explain the type of digital data. CO1 K2
ii) Explain the classification of digital data.
5 Explain the components of Data science. CO1 K2
6 What are the three characteristics of big data? CO1 K2
Explain the differences between Bl and Data
Science.
7 Describe the current analytical architecture CO1 K2
for data scientists.
8 What are the key roles for the New Big Data CO1 K2
Ecosystem?
9 What are key skill sets and CO1 K2
behavioural characteristics of a data
scientist?
10 Write the importance of Big data Analytics. CO1 K2
Supportive Online Certification Courses
Sl. Courses Platform
No.
1 Big Data Computing Swayam
2 Python for Data Science Swayam
Real Time Applications in Day to
Day life and to Industry
Sl. No. Questions
1 Explain the role of Big data Analytics in retail Industry. (co1,K4)
2 Explain the role of Data scientist in analysis of weather data.
(co1,K4)
Content Beyond the Syllabus
USECASES OF HADOOP
1. IBM Watson
In 2011, IBM‘s computer system Watson participated in the U.S. television game show
Jeopardy against two of the best Jeopardy champions in the show‘s history. In the
game, the contestants are provided a clue such as ―He likes his martinis shaken, not
stirred‖ and the correct response, phrased in the form of a question, would be, ―Who
is James Bond?‖
Over the three-day tournament, Watson was able to defeat the two human
contestants. Toeducate Watson, Hadoop was utilized to process various data
sources such as encyclopedias, dictionaries, news wire feeds, literature, and the entire
contents of Wikipedia. For each clue provided during the game,
Watson had to perform the following tasks in less than three
seconds.
Deconstruct the provided clue into words and phrases
Establish the grammatical relationship between the words and the phrases
Create a set of similar terms to use in Watson‘s search for a response
Use Hadoop to coordinate the search for a response across terabytes of data
Determine possible responses and assign their likelihood of being correct
Actuate the buzzer
Provide a syntactically correct response in English
Among other applications, Watson is being used in the medical profession to diagnose
patients and provide treatment recommendations.
2. LinkedIn
LinkedIn is an online professional network of 250 million users in 200 countries as
of early 2014. LinkedIn provides several free and subscription-based services, such as
company information pages, job postings, talent searches, social graphs of one‘s
contacts, personally tailored news feeds, and access to discussion groups, including
a Hadoop users group.
Content Beyond the Syllabus
LinkedIn utilizes Hadoop for the following purposes:
Process daily production database transaction logs
Examine the users‘ activities such as views and clicks
Feed the extracted data back to the production systems
Restructure the data to add to an analytical database
Develop and test analytical models
Yahoo!
As of 2012, Yahoo! has one of the largest publicly announced Hadoop
deployments at 42,000 nodes across several clusters utilizing. Yahoo!‗s Hadoop
applications include the following 350 peta bytes of raw storage.
Search index creation and maintenance
Web page content optimization
Web ad placement optimization
Spam filters
Ad-hoc analysis and analytic model development
Prior to deploying Hadoop, it took 26 days to process three years‘ worth of log
data. With Hadoop, the processing time was reduced to 20 minutes.
Text & Reference Books
Sl. Book Name & Author Book
No.
1 EMC Education Services, "Data Science and Big Data Text Book
Analytics: Discovering, Analyzing, Visualizing and
Presenting Data", Wiley publishers, 2015.
2 Jure Leskovec, Anand Rajaraman and Jeffrey David Text Book
Ullman, "Mining of Massive Datasets", Cambridge
University Press, 2014.
3 An Introduction to Statistical Learning: with Applications Text Book
in R (Springer Texts in Statistics) Hardcover – 2017
4 Dietmar Jannach and Markus Zanker, "Recommender Reference
Systems: An Introduction", Cambridge University Press, Book
2010.
5 Kim H. Pries and Robert Dunnigan, "Big Data Analytics: Reference
A Practical Guide for Managers " CRC Press, 2015. Book
6 Jimmy Lin and Chris Dyer, "Data-Intensive Text Reference
Processing with MapReduce", Synthesis Lectures on Book
Human Language Technologies, Vol. 3, No. 1, Pages 1-
177, Morgan Claypool
publishers, 2010.
Mini Project Suggestions
Sl. Questions Platform
No.
1 Analyze Your Personal Facebook Posting Habits — Are Python
you spending too much time posting on Facebook? The
numbers don’t lie, and you can find them in this
beginner-to-intermediate Python data project.
(co1, K4)
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
relianceon the contentsof this information is strictlyprohibited.