Unit 2 Big Data Analytics
Unit 2 Big Data Analytics
Big Data analytics is a process used to extract meaningful insights, such as hidden patterns, unknown correlations, market
trends, and customer preferences. Big Data analytics provides various advantages—it can be used for better decision
making, preventing fraudulent activities, among other things.
Today, there are millions of data sources that generate data at a very rapid rate. These data sources are present across the
world. Some of the largest sources of data are social media platforms and networks. Let’s use Facebook as an example—it
generates more than 500 terabytes of data every day. This data includes pictures, videos, messages, and more.
Data also exists in different formats, like structured data, semi-structured data, and unstructured data. For example, in a
regular Excel sheet, data is classified as structured data—with a definite format. In contrast, emails fall under semi-
structured, and your pictures and videos fall under unstructured data. All this data combined makes up Big Data.
Using analytics to understand customer behavior in order to optimize the customer experience
Increasing operational efficiency by understanding where bottlenecks are and how to fix them
These are just a few examples — the possibilities are really endless when it comes to Big Data analytics. It all depends on
how you want to use it in order to improve your business.
Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to identify fraudulent activities and
discrepancies. The organization leverages it to narrow down a list of suspects or root causes of problems.
2. Product Development and Innovations
Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines and armed forces across the globe, uses
Big Data analytics to analyze how efficient the engine designs are and if there is any need for improvements.
Use Case: Starbucks uses Big Data analytics to make strategic decisions. For example, the company leverages it to decide
if a particular location would be suitable for a new outlet or not. They will analyze several different factors, such as
population, demographics, accessibility of the location, and more.
Use Case: Delta Air Lines uses Big Data analysis to improve customer experiences. They monitor tweets to find out their
customers’ experience regarding their journeys, delays, and so on. The airline identifies negative tweets and does what’s
necessary to remedy the situation. By publicly addressing these issues and offering solutions, it helps the airline build good
customer relations.
Problem Definition
Project Description
The objective of this project would be to develop a machine learning
model to predict the hourly salary of people using their curriculum
vitae (CV) text as input.
Problem Definition
Problem Definition is probably one of the most complex and
heavily neglected stages in the big data analytics pipeline. In order
to define the problem a data product would solve, experience is
mandatory. Most data scientist aspirants have little or no experience
in this stage.
Supervised classification
Supervised regression
Unsupervised learning
Learning to rank
******************************************************
Data Collection
Data collection improves customer experience and drives better decision-making and overall growth for businesses across
the board. But what is data collection and how do organizations obtain it? This guide will give you a complete overview of
data collection, its methods, tools, tips, challenges, and more.
1. First-Party Data
First-party data is obtained directly from the consumer, through websites, social media platforms, apps, surveys, etc. With
rising concerns about privacy, first-party data has become more relevant than ever. It is highly reliable, accurate, and
valuable as no mediators are involved. Additionally, since companies have exclusive ownership of first-party data, it can be
utilized without restrictions.
First-party data helps you analyze the market and customers’ needs. Additionally, this data also has usage restrictions and
offers a tailored consumer experience. First-party data refers to customer relationship management data, behavioral data,
subscriptions, social media data, customer feedback, consumer purchase data, and survey data.
2. Second-Party Data
This is data collected from a trusted partner. Here, another business collects the data from consumers and then sells or
shares it as part of the partnership. Second-party and first-party data are similar in that they are both collected from reliable
sources. Second-party data is used by companies to develop better insights, build better predictive models, and scale their
businesses.
3. Third-Party Data
Data that are collected from an outside source with no direct relationship between the business and the consumers fall in
this category. This kind of data is often collected from various sources and then aggregated and sold to companies for
marketing purposes like cold calling or mailing lists. Third-party data can help businesses reach a wider audience and
improve their audience targeting. However, there is no guarantee that the data is reliable and collected with adherence to
privacy laws. Thus, caution is critical while dealing with third-party data.
Data collection can help improve services, understand consumer needs, refine business strategies, grow and retain
customers, and even sell the data as second-party data to other businesses at a profit.
Primary data collection is the process of acquiring data directly from the source. This data is highly accurate as it is collected
first-hand. In addition, primary data collection methods can be further categorized as quantitative and qualitative.
1.1: Quantitative methods are based on mathematical calculation and can be used to make reliable analyses and
predictions. In fact, some popular quantitative data collection methods are smoothing techniques, barometric methods, and
time-series analysis.
1.2: Qualitative methods are used when the elements are not quantifiable. This is contextual data that is used to identify
the motivations of customers. Besides, some popular quantitative data collection methods are interviews, the Delphi
technique, focus groups, questionnaires, and surveys.
Secondary data collection is the process of collecting data from various internal and external data sources. In this case, the
data is easily available for use and can be less time-consuming. Moreover, some secondary data sources include customer
relationship management software, sales reports, financial statements, press releases, the internet, business journals, and
executive summaries.
*********************************************************************************************************************************************
Data cleansing
What is data cleansing?
Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing incorrect,
incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying data errors and then
changing, updating or removing data to correct them. Data cleansing improves data quality and helps provide
Data cleansing is a key part of the overall data management process and one of the core components of data
preparation work that readies data sets for use in business intelligence (BI) and data science applications. It's
typically done by data quality analysts and engineers or other data management professionals. But data
scientists, BI analysts and business users may also clean data or take part in the data cleansing process for
Data cleansing, data cleaning and data scrubbing are often used interchangeably. For the most part, they're
considered to be the same thing. In some cases, though, data scrubbing is viewed as an element of data
cleansing that specifically involves removing duplicate, bad, unneeded or old data from data sets.
Data scrubbing also has a different meaning in connection with data storage. In that context, it's an automated
function that checks disk drives and storage systems to make sure the data they contain can be read and to
Business operations and decision-making are increasingly data-driven, as organizations look to use data
analytics to help improve business performance and gain competitive advantages over rivals. As a result, clean
data is a must for BI and data science teams, business executives, marketing managers, sales reps and
operational workers. That's particularly true in retail, financial services and other data-intensive industries, but it
If data isn't properly cleansed, customer records and other business data may not be accurate and analytics
applications may provide faulty information. That can lead to flawed business decisions, misguided strategies,
missed opportunities and operational problems, which ultimately may increase costs and reduce revenue and
profits. IBM estimated that data quality issues cost organizations in the U.S. a total of $3.1 trillion in 2016, a
Data cleansing addresses a range of errors and issues in data sets, including inaccurate, invalid, incompatible
and corrupt data. Some of those problems are caused by human error during the data entry process, while
others result from the use of different data structures, formats and terminology in separate systems throughout
an organization.
The types of issues that are commonly fixed as part of data cleansing projects include the following:
Typos and invalid or missing data. Data cleansing corrects various structural errors in data sets. For
example, that includes misspellings and other typographical errors, wrong numerical entries, syntax errors
and missing values, such as blank or null fields that should contain data.
Inconsistent data. Names, addresses and other attributes are often formatted differently from system to
system. For example, one data set might include a customer's middle initial, while another doesn't. Data
elements such as terms and identifiers may also vary. Data cleansing helps ensure that data is consistent
Duplicate data. Data cleansing identifies duplicate records in data sets and either removes or merges them
through the use of deduplication measures. For example, when data from two systems is
Irrelevant data. Some data -- outliers or out-of-date entries, for example -- may not be relevant to analytics
applications and could skew their results. Data cleansing removes redundant data from data sets, which
streamlines data preparation and reduces the required amount of data processing and storage resources.
The scope of data cleansing work varies depending on the data set and analytics requirements. For example, a
data scientist doing fraud detection analysis on credit card transaction data may want to retain outlier values
because they could be a sign of fraudulent purchases. But the data scrubbing process typically includes the
following actions:
1. Inspection and profiling. First, data is inspected and audited to assess its quality level and identify issues
that need to be fixed. This step usually involves data profiling, which documents relationships between data
elements, checks data quality and gathers statistics on data sets to help find errors, discrepancies and other
problems.
2. Cleaning. This is the heart of the cleansing process, when data errors are corrected and inconsistent,
3. Verification. After the cleaning step is completed, the person or team that did the work should inspect the
data again to verify its cleanliness and make sure it conforms to internal data quality rules and standards.
4. Reporting. The results of the data cleansing work should then be reported to IT and business executives to
highlight data quality trends and progress. The report could include the number of issues found and
The cleansed data can then be moved into the remaining stages of data preparation, starting with data
These metrics can be used to measure data quality levels in connection with data cleansing efforts.
Various data characteristics and attributes are used to measure the cleanliness and overall quality of data sets,
accuracy
completeness
consistency
integrity
timeliness
uniformity
validity
Data management teams create data quality metrics to track those characteristics, as well as things like error
rates and the overall number of errors in data sets. Many also try to calculate the business impact of data
quality problems and the potential business value of fixing them, partly through surveys and interviews with
business executives.
Done well, data cleansing provides the following business and data management benefits:
Improved decision-making. With more accurate data, analytics applications can produce better results.
That enables organizations to make more informed decisions on business strategies and operations, as well
More effective marketing and sales. Customer data is often wrong, inconsistent or out of date. Cleaning
up the data in customer relationship management and sales systems helps improve the effectiveness of
Better operational performance. Clean, high-quality data helps organizations avoid inventory shortages,
delivery snafus and other business problems that can result in higher costs, lower revenues and damaged
Increased use of data. Data has become a key corporate asset, but it can't generate business value if it
isn't used. By making data more trustworthy, data cleansing helps convince business managers and
Reduced data costs. Data cleansing stops data errors and issues from further propagating in systems and
analytics applications. In the long term, that saves time and money, because IT and data management
teams don't have to continue fixing the same errors in data sets.
Data cleansing and other data quality methods are also a key part of data governance programs, which aim to
ensure that the data in enterprise systems is consistent and gets used properly. Clean data is one of the
******************************************************************************************************************************
Summarizing data
Data summarization is the science and art of conveying information more effectivelly and efficiently. Data summarization
is typically numerical, visual or a combination of the two. It is a key skill in data analysis - we use it to provide insights
both to others and to ourselves. Data summarization is also an integral part of exploratory data analysis.
In this chapter we focus on the basic techniques for univariate and bivariate data. Visualization and more advanced data
summarization techniques will be covered in later chapters.
We will be using R and ggplot2 but the contents of this chapter are meant to be tool-agnostic. Readers should use the
programming language and tools that they are most comfortable with. However, do not sacrifice expresiveness or
profesionallism for the sake of convenience - if your current toolbox limits you in any way, learn new tools!
Univariate distributions come from various sources. It might be a theoretical distribution, an empirical distribution of a
data sample, a probabilistic opinion from a person, a posterior distribution of a parameter from a Bayesian model, and
many others. Descriptive statistics apply to all of these cases in the same way, regardless of the source of the
distribution.
Before we proceed with introducing the most commonly used descriptive statistics, we discuss their main purpose. The
main purpose of any sort of data summarization technique is to (a) reduce the time and effort of delivering information to
the reader in a way that (b) we lose as little relevant information as possible. That is, to compress the information.
All summarization methods do (a) but we must be careful to choose an appropriate method so that we also get (b).
Summarizing out relevant information can lead to misleading summaries, as we will illustrate with several examples.
Central tendency
The most common first summary of a distribution is its typical value, also known as the location or central tendency of a
distribution.
Given a sample of data, the estimate of the mean is the easiest to compute (we compute the average), but the median
and mode are more robust to outliers - extreme and possibly unrepresentative values.
In the case of unimodal approximately symmetrical distributions, such as the univariate normal distribution, all these
measures of central tendency will be similar and all will be an excellent summary of location. However, if the distribution
is asymmetrical (skewed), they will differ. In such cases it is our job to determine what information we want to convey and
which summary of central tendency is the most appropriate, if any.
Dispersion
Once location is established, we are typically interested in whether the values of the distribution cluster close to the
location or are spread far from the location.
The most common ways of measuring such dispersion (or spread or scale) of a distribution are:
variance (mean of quadratic distances from mean) or, more commonly, standard deviation (root of variance, so we are
on the same scale as the measurement)
median absolute deviation (median of absolute distances from mean),
quantile-based intervals, in particular the inter-quartile range (IQR) (interval between the 1st and 3rd quartiles, 50% of
the mass/density lies in this interval).
******************************************************************************************************************************
Data Exploration
What is Exploratory Data Analysis ?
In this article, we will discuss exploratory data analysis which is one of the basic and essential steps of
a data science project. A data scientist involves almost 70% of his work in doing the EDA of his
dataset.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to
apprehend their predominant traits, discover patterns, locate outliers, and identify relationships
between variables. EDA is normally carried out as a preliminary step before undertaking extra formal
statistical analyses or modeling.
The Foremost Goals of EDA
1. Data Cleaning: EDA involves examining the information for errors, lacking values, and
inconsistencies. It includes techniques including records imputation, managing missing statistics, and
figuring out and getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to recognize the important tendency,
variability, and distribution of variables. Measures like suggest, median, mode, preferred deviation,
range, and percentiles are usually used.
3. Data Visualization: EDA employs visual techniques to represent the statistics graphically.
Visualizations consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts
assist in identifying styles, trends, and relationships within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to
create new functions or derive meaningful insights. Feature engineering can contain scaling,
normalization, binning, encoding express variables, and creating interplay or derived variables.
5. Correlation and Relationships: EDA allows discover relationships and dependencies between
variables. Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights
into the power and direction of relationships between variables.
6. Data Segmentation: EDA can contain dividing the information into significant segments based
totally on sure standards or traits. This segmentation allows advantage insights into unique subgroups
inside the information and might cause extra focused analysis.
7. Hypothesis Generation: EDA aids in generating hypotheses or studies questions based totally on
the preliminary exploration of the data. It facilitates form the inspiration for in addition evaluation and
model building.
8. Data Quality Assessment: EDA permits for assessing the nice and reliability of the information. It
involves checking for records integrity, consistency, and accuracy to make certain the information is
suitable for analysis.
Types of EDA
Depending on the number of columns we are analyzing we can divide EDA into two types.
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing information
units to uncover styles, pick out relationships, and gain insights. There are various sorts of EDA
strategies that can be hired relying on the nature of the records and the desires of the evaluation.
Here are some not unusual kinds of EDA:
1. Univariate Analysis: This sort of evaluation makes a speciality of analyzing character variables
inside the records set. It involves summarizing and visualizing a unmarried variable at a time to
understand its distribution, relevant tendency, unfold, and different applicable records. Techniques like
histograms, field plots, bar charts, and precis information are generally used in univariate analysis.
0 seconds of 0 secondsVolume 0%
2. Bivariate Analysis: Bivariate evaluation involves exploring the connection between variables. It
enables find associations, correlations, and dependencies between pairs of variables. Scatter plots, line
plots, correlation matrices, and move-tabulation are generally used strategies in bivariate analysis.
3. Multivariate Analysis: Multivariate analysis extends bivariate evaluation to encompass greater
than variables. It ambitions to apprehend the complex interactions and dependencies among more
than one variables in a records set. Techniques inclusive of heatmaps, parallel coordinates, aspect
analysis, and primary component analysis (PCA) are used for multivariate analysis.
4. Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a
temporal component. Time collection evaluation entails inspecting and modeling styles, traits, and
seasonality inside the statistics through the years. Techniques like line plots, autocorrelation analysis,
transferring averages, and ARIMA (AutoRegressive Integrated Moving Average) fashions are generally
utilized in time series analysis.
5. Missing Data Analysis: Missing information is a not unusual issue in datasets, and it may impact
the reliability and validity of the evaluation. Missing statistics analysis includes figuring out missing
values, know-how the patterns of missingness, and using suitable techniques to deal with missing
data. Techniques along with lacking facts styles, imputation strategies, and sensitivity evaluation are
employed in lacking facts evaluation.
6. Outlier Analysis: Outliers are statistics factors that drastically deviate from the general sample of
the facts. Outlier analysis includes identifying and knowledge the presence of outliers, their capability
reasons, and their impact at the analysis. Techniques along with box plots, scatter plots, z-rankings,
and clustering algorithms are used for outlier evaluation.
7. Data Visualization: Data visualization is a critical factor of EDA that entails creating visible
representations of the statistics to facilitate understanding and exploration. Various visualization
techniques, inclusive of bar charts, histograms, scatter plots, line plots, heatmaps, and interactive
dashboards, are used to represent exclusive kinds of statistics.
These are just a few examples of the types of EDA techniques that can be employed at some stage in
information evaluation. The choice of strategies relies upon on the information traits, research
questions, and the insights sought from the analysis.
*************************************************************************************
Data visualization convert large and small data sets into visuals, which is easy to understand and process for
humans.
Data visualization tools provide accessible ways to understand outliers, patterns, and trends in the data.
In the world of Big Data, the data visualization tools and technologies are required to analyze vast amounts
of information.
Data visualizations are common in your everyday life, but they always appear in the form of graphs and
charts. The combination of multiple visualizations and bits of information are still referred to as Infographics.
Data visualizations are used to discover unknown facts and trends. You can see visualizations in the form of
line charts to display change over time. Bar and column charts are useful for observing relationships and
making comparisons. A pie chart is a great way to show parts-of-a-whole. And maps are the best way to
share geographical data visually.
Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel spreadsheet,
which displays the data in more sophisticated ways such as dials and gauges, geographic maps, heat maps,
pie chart, and fever chart.
American statistician and Yale professor Edward Tufte believe useful data visualizations consist of ?
complex ideas communicated with clarity, precision, and efficiency.
To craft an effective data visualization, you need to start with clean data that is well-sourced and complete.
After the data is ready to visualize, you need to pick the right chart.
After you have decided the chart type, you need to design and customize your visualization to your liking.
Simplicity is essential - you don't want to add any elements that distract from the data.
Several decades later, one of the most advanced examples of statistical graphics occurred when Charles
Minard mapped Napoleon's invasion of Russia. The map represents the size of the army and the path of
Napoleon's retreat from Moscow - and that information tied to temperature and time scales for a more in-
depth understanding of the event.
Computers made it possible to process a large amount of data at lightning-fast speeds. Nowadays, data
visualization becomes a fast-evolving blend of art and science that certain to change the corporate
landscape over the next few years.
Importance of Data Visualization
Data visualization is important because of the processing of information in human brains. Using graphs and
charts to visualize a large amount of the complex data sets is more comfortable in comparison to studying
the spreadsheet and reports.
Data visualization is an easy and quick way to convey concepts universally. You can experiment with a
different outline by making a slight adjustment.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT
Data visualization tools have been necessary for democratizing data, analytics, and making data-driven
perception available to workers throughout an organization. They are easy to operate in comparison to
earlier versions of BI software or traditional statistical analysis software. This guide to a rise in lines of
business implementing data visualization tools on their own, without support from IT.
Data visualization allows you to interact with data. Google, Apple, Facebook, and Twitter all ask better a
better question of their data and make a better business decision by using data visualization.
Here are the top 10 data visualization tools that help you to visualize the data:
1. Tableau
Tableau is a data visualization tool. You can create graphs, charts, maps, and many other graphics.
A tableau desktop app is available for visual analytics. If you don't want to install tableau software on your
desktop, then a server solution allows you to visualize your reports online and on mobile.
A cloud-hosted service also is an option for those who want the server solution but don't want to set up
manually. The customers of Tableau include Barclays, Pandora, and Citrix.
2. Infogram
Infogram is also a data visualization tool. It has some simple steps to process that:
1. First, you choose among many templates, personalize them with additional visualizations like maps,
charts, videos, and images.
2. Then you are ready to share your visualization.
3. Infogram supports team accounts for journalists and media publishers, branded designs of
classroom accounts for educational projects, companies, and enterprises.
An infogram is a representation of information in a graphic format designed to make the data easily
understandable in a view. Infogram is used to quickly communicate a message, to simplify the presentation
of large amounts of the dataset, to see data patterns and relationships, and to monitor changes in variables
over time.
Infogram abounds in almost any public environment such as traffic signs, subway maps, tag clouds, musical
scores, and weather charts, among a huge number of possibilities.
3. Chartblocks
Chartblocks is an easy way to use online tool which required no coding and builds visualization from
databases, spreadsheets, and live feeds.
ADVERTISEMENT
Your chart is created under the hood in html5 by using the powerful JavaScript library D3.js. Your
visualizations is responsive and compatible with any screen size and device. Also, you will be able to embed
your charts on any web page, and you can share it on Facebook and Twitter.
4. Datawrapper
Datawrapper is an aimed squarely at publisher and journalist. The Washington Post, VOX, The
Guardian, BuzzFeed, The Wall Street Journal and Twitter adopts it.
Datawrapper is easy visualization tool, and it requires zero codings. You can upload your data and easily
create and publish a map or a chart. The custom layouts to integrate your visualizations perfectly on your
site and access to local area maps are also available.
5. Plotly
Plotly will help you to create a slick and sharp chart in just a few minutes or in a very short time. It also starts
from a simple spreadsheet.
The guys use Plotly at Google and also by the US Air Force, Goji and The New York University.
Plotly is very user-friendly visualization tool which is quickly started within a few minutes. If you are a part of
a team of developers that wants to have a crack, an API is available for JavaScript and Python languages.