0% found this document useful (0 votes)

50 views20 pages

Unit 2 Big Data Analytics

Big Data Analytics is the process of extracting insights from large data sets that traditional tools cannot handle, enabling better decision-making and fraud prevention. It encompasses various data types and collection methods, including first-party, second-party, and third-party data, which are crucial for understanding customer behavior and improving business strategies. Data cleansing is essential for ensuring data quality, involving the correction of errors and inconsistencies to facilitate accurate analysis and informed decision-making.

Uploaded by

John

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views20 pages

Unit 2 Big Data Analytics

Uploaded by

John

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

What is Big Data Analytics?

Big Data analytics is a process used to extract meaningful insights, such as hidden patterns, unknown correlations, market
trends, and customer preferences. Big Data analytics provides various advantages—it can be used for better decision
making, preventing fraudulent activities, among other things.

What is Big Data?

Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using traditional tools.

Today, there are millions of data sources that generate data at a very rapid rate. These data sources are present across the
world. Some of the largest sources of data are social media platforms and networks. Let’s use Facebook as an example—it
generates more than 500 terabytes of data every day. This data includes pictures, videos, messages, and more.

Data also exists in different formats, like structured data, semi-structured data, and unstructured data. For example, in a
regular Excel sheet, data is classified as structured data—with a definite format. In contrast, emails fall under semi-
structured, and your pictures and videos fall under unstructured data. All this data combined makes up Big Data.

Uses and Examples of Big Data Analytics

There are many different ways that Big Data analytics can be used in order to improve businesses and organizations. Here
are some examples:

 Using analytics to understand customer behavior in order to optimize the customer experience

 Predicting future trends in order to make better business decisions

 Improving marketing campaigns by understanding what works and what doesn't

 Increasing operational efficiency by understanding where bottlenecks are and how to fix them

 Detecting fraud and other forms of misuse sooner

These are just a few examples — the possibilities are really endless when it comes to Big Data analytics. It all depends on
how you want to use it in order to improve your business.

Benefits and Advantages of Big Data Analytics

1. Risk Management

Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to identify fraudulent activities and
discrepancies. The organization leverages it to narrow down a list of suspects or root causes of problems.
2. Product Development and Innovations

Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines and armed forces across the globe, uses
Big Data analytics to analyze how efficient the engine designs are and if there is any need for improvements.

3. Quicker and Better Decision Making Within Organizations

Use Case: Starbucks uses Big Data analytics to make strategic decisions. For example, the company leverages it to decide
if a particular location would be suitable for a new outlet or not. They will analyze several different factors, such as
population, demographics, accessibility of the location, and more.

4. Improve Customer Experience

Use Case: Delta Air Lines uses Big Data analysis to improve customer experiences. They monitor tweets to find out their
customers’ experience regarding their journeys, delays, and so on. The airline identifies negative tweets and does what’s
necessary to remedy the situation. By publicly addressing these issues and offering solutions, it helps the airline build good
customer relations.
Problem Definition

Project Description
The objective of this project would be to develop a machine learning
model to predict the hourly salary of people using their curriculum
vitae (CV) text as input.

Using the framework defined above, it is simple to define the

problem. We can define X = {x1, x2, …, xn} as the CV’s of users,
where each feature can be, in the simplest way possible, the
amount of times this word appears. Then the response is real
valued, we are trying to predict the hourly salary of individuals in
dollars.

These two considerations are enough to conclude that the problem

presented can be solved with a supervised regression algorithm.

Problem Definition
Problem Definition is probably one of the most complex and
heavily neglected stages in the big data analytics pipeline. In order
to define the problem a data product would solve, experience is
mandatory. Most data scientist aspirants have little or no experience
in this stage.

Most big data problems can be categorized in the following ways −

 Supervised classification
 Supervised regression
 Unsupervised learning
 Learning to rank

Let us now learn more about these four concepts.

Supervised Classification
Given a matrix of features X = {x1, x2, ..., xn} we develop a model M
to predict different classes defined as y = {c1, c2, ..., cn}. For
example: Given transactional data of customers in an insurance
company, it is possible to develop a model that will predict if a client
would churn or not. The latter is a binary classification problem,
where there are two classes or target variables: churn and not
churn.
Other problems involve predicting more than one class, we could be
interested in doing digit recognition, therefore the response vector
would be defined as: y = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, a-state-of-the-
art model would be convolutional neural network and the matrix of
features would be defined as the pixels of the image.
Supervised Regression

In this case, the problem definition is rather similar to the previous

problem, the response y ∈ ℜ, this means the response is real

example; the difference relies on the response. In a regression

valued. For example, we can develop a model to predict the hourly

salary of individuals given the corpus of their CV.
Unsupervised Learning

Management is often thirsty for new insights. Segmentation models

can provide this insight in order for the marketing department to
develop products for different segments. A good approach for
developing a segmentation model, rather than thinking of
algorithms, is to select features that are relevant to the
segmentation that is desired.

For example, in a telecommunications company, it is interesting to

segment clients by their cellphone usage. This would involve
disregarding features that have nothing to do with the segmentation
objective and including only those that do. In this case, this would
be selecting features as the number of SMS used in a month, the
number of inbound and outbound minutes, etc.
Learning to Rank

This problem can be considered as a regression problem, but it has

particular characteristics and deserves a separate treatment. The
problem involves given a collection of documents we seek to find
the most relevant ordering given a query. In order to develop a
supervised learning algorithm, it is needed to label how relevant an
ordering is, given a query.
It is relevant to note that in order to develop a supervised learning
algorithm, it is needed to label the training data. This means that in
order to train a model that will, for example, recognize digits from
an image, we need to label a significant amount of examples by
hand. There are web services that can speed up this process and
are commonly used for this task such as amazon mechanical turk. It
is proven that learning algorithms improve their performance when
provided with more data, so labeling a decent amount of examples
is practically mandatory in supervised learning.

******************************************************
Data Collection

Data collection improves customer experience and drives better decision-making and overall growth for businesses across
the board. But what is data collection and how do organizations obtain it? This guide will give you a complete overview of
data collection, its methods, tools, tips, challenges, and more.

What Exactly is Data Collection?

Data collection is the process of collecting, measuring, and analyzing data from various sources to gain insights. Data can
be collected through various sources, such as social media monitoring, online tracking, surveys, feedback, etc. In fact, there
are three main categories of data that businesses endeavor to collect.

1. First-Party Data

First-party data is obtained directly from the consumer, through websites, social media platforms, apps, surveys, etc. With
rising concerns about privacy, first-party data has become more relevant than ever. It is highly reliable, accurate, and
valuable as no mediators are involved. Additionally, since companies have exclusive ownership of first-party data, it can be
utilized without restrictions.

First-party data helps you analyze the market and customers’ needs. Additionally, this data also has usage restrictions and
offers a tailored consumer experience. First-party data refers to customer relationship management data, behavioral data,
subscriptions, social media data, customer feedback, consumer purchase data, and survey data.

2. Second-Party Data

This is data collected from a trusted partner. Here, another business collects the data from consumers and then sells or
shares it as part of the partnership. Second-party and first-party data are similar in that they are both collected from reliable
sources. Second-party data is used by companies to develop better insights, build better predictive models, and scale their
businesses.

3. Third-Party Data

Data that are collected from an outside source with no direct relationship between the business and the consumers fall in
this category. This kind of data is often collected from various sources and then aggregated and sold to companies for
marketing purposes like cold calling or mailing lists. Third-party data can help businesses reach a wider audience and
improve their audience targeting. However, there is no guarantee that the data is reliable and collected with adherence to
privacy laws. Thus, caution is critical while dealing with third-party data.

Why is Data Collection Important?

Data is changing the way we conduct business. Data can help organizations optimize the quality of work, draw valuable
insights, predict trends, prevent risks, save time, drive profits, make better decisions, and so on. By collecting data,
organizations have a treasure trove of valuable information at their disposal that can be utilized to thrive in today’s
competitive market.

Data collection can help improve services, understand consumer needs, refine business strategies, grow and retain
customers, and even sell the data as second-party data to other businesses at a profit.

Benefits of Data Collection

The following are some of the ways data collection can be beneficial to organizations.

 Improving precision in targeting customers

 Finding new customers
 Understanding customer behavior
 Increasing customer retention
 Improving decision-making
 Reducing errors
 Enhancing marketing efforts
 Predicting market trends
 Growing business revenue
 Improving business processes

An Overview of Popular Data Collection Methods

Now that we have discussed the question of what is data collection and its importance, let’s take a look at the different
methods in which this is done. Data collection methods can be broadly classified into two types.

1. Primary Data Collection Methods

Primary data collection is the process of acquiring data directly from the source. This data is highly accurate as it is collected
first-hand. In addition, primary data collection methods can be further categorized as quantitative and qualitative.

1.1: Quantitative methods are based on mathematical calculation and can be used to make reliable analyses and
predictions. In fact, some popular quantitative data collection methods are smoothing techniques, barometric methods, and
time-series analysis.

1.2: Qualitative methods are used when the elements are not quantifiable. This is contextual data that is used to identify
the motivations of customers. Besides, some popular quantitative data collection methods are interviews, the Delphi
technique, focus groups, questionnaires, and surveys.

2. Secondary Data Collection Methods

Secondary data collection is the process of collecting data from various internal and external data sources. In this case, the
data is easily available for use and can be less time-consuming. Moreover, some secondary data sources include customer
relationship management software, sales reports, financial statements, press releases, the internet, business journals, and
executive summaries.

*********************************************************************************************************************************************
Data cleansing
What is data cleansing?

Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing incorrect,

incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying data errors and then

changing, updating or removing data to correct them. Data cleansing improves data quality and helps provide

more accurate, consistent and reliable information for decision-making in an organization.

Data cleansing is a key part of the overall data management process and one of the core components of data

preparation work that readies data sets for use in business intelligence (BI) and data science applications. It's

typically done by data quality analysts and engineers or other data management professionals. But data

scientists, BI analysts and business users may also clean data or take part in the data cleansing process for

their own applications.

Data cleansing vs. data cleaning vs. data scrubbing

Data cleansing, data cleaning and data scrubbing are often used interchangeably. For the most part, they're

considered to be the same thing. In some cases, though, data scrubbing is viewed as an element of data

cleansing that specifically involves removing duplicate, bad, unneeded or old data from data sets.

Data scrubbing also has a different meaning in connection with data storage. In that context, it's an automated

function that checks disk drives and storage systems to make sure the data they contain can be read and to

Why is clean data important?

Business operations and decision-making are increasingly data-driven, as organizations look to use data

analytics to help improve business performance and gain competitive advantages over rivals. As a result, clean

data is a must for BI and data science teams, business executives, marketing managers, sales reps and

operational workers. That's particularly true in retail, financial services and other data-intensive industries, but it

applies to organizations across the board, both large and small.

If data isn't properly cleansed, customer records and other business data may not be accurate and analytics

applications may provide faulty information. That can lead to flawed business decisions, misguided strategies,

missed opportunities and operational problems, which ultimately may increase costs and reduce revenue and
profits. IBM estimated that data quality issues cost organizations in the U.S. a total of $3.1 trillion in 2016, a

figure that's still widely cited.

What kind of data errors does data scrubbing fix?

Data cleansing addresses a range of errors and issues in data sets, including inaccurate, invalid, incompatible

and corrupt data. Some of those problems are caused by human error during the data entry process, while

others result from the use of different data structures, formats and terminology in separate systems throughout

an organization.

The types of issues that are commonly fixed as part of data cleansing projects include the following:

 Typos and invalid or missing data. Data cleansing corrects various structural errors in data sets. For

example, that includes misspellings and other typographical errors, wrong numerical entries, syntax errors

and missing values, such as blank or null fields that should contain data.

 Inconsistent data. Names, addresses and other attributes are often formatted differently from system to

system. For example, one data set might include a customer's middle initial, while another doesn't. Data

elements such as terms and identifiers may also vary. Data cleansing helps ensure that data is consistent

so it can be analyzed accurately.

 Duplicate data. Data cleansing identifies duplicate records in data sets and either removes or merges them

through the use of deduplication measures. For example, when data from two systems is

combined, duplicate data entries can be reconciled to create single records.

 Irrelevant data. Some data -- outliers or out-of-date entries, for example -- may not be relevant to analytics

applications and could skew their results. Data cleansing removes redundant data from data sets, which

streamlines data preparation and reduces the required amount of data processing and storage resources.

What are the steps in the data cleansing process?

The scope of data cleansing work varies depending on the data set and analytics requirements. For example, a

data scientist doing fraud detection analysis on credit card transaction data may want to retain outlier values

because they could be a sign of fraudulent purchases. But the data scrubbing process typically includes the

following actions:
1. Inspection and profiling. First, data is inspected and audited to assess its quality level and identify issues

that need to be fixed. This step usually involves data profiling, which documents relationships between data

elements, checks data quality and gathers statistics on data sets to help find errors, discrepancies and other

problems.

2. Cleaning. This is the heart of the cleansing process, when data errors are corrected and inconsistent,

duplicate and redundant data is addressed.

3. Verification. After the cleaning step is completed, the person or team that did the work should inspect the

data again to verify its cleanliness and make sure it conforms to internal data quality rules and standards.

4. Reporting. The results of the data cleansing work should then be reported to IT and business executives to

highlight data quality trends and progress. The report could include the number of issues found and

corrected, plus updated metrics on the data's quality levels.

The cleansed data can then be moved into the remaining stages of data preparation, starting with data

structuring and data transformation, to continue readying it for analytics uses.

These metrics can be used to measure data quality levels in connection with data cleansing efforts.

Characteristics of clean data

Various data characteristics and attributes are used to measure the cleanliness and overall quality of data sets,

including the following:

 accuracy

 completeness

 consistency

 integrity

 timeliness

 uniformity

 validity

Data management teams create data quality metrics to track those characteristics, as well as things like error

rates and the overall number of errors in data sets. Many also try to calculate the business impact of data
quality problems and the potential business value of fixing them, partly through surveys and interviews with

business executives.

The benefits of effective data cleansing

Done well, data cleansing provides the following business and data management benefits:

 Improved decision-making. With more accurate data, analytics applications can produce better results.

That enables organizations to make more informed decisions on business strategies and operations, as well

as things like patient care and government programs.

 More effective marketing and sales. Customer data is often wrong, inconsistent or out of date. Cleaning

up the data in customer relationship management and sales systems helps improve the effectiveness of

marketing campaigns and sales efforts.

 Better operational performance. Clean, high-quality data helps organizations avoid inventory shortages,

delivery snafus and other business problems that can result in higher costs, lower revenues and damaged

relationships with customers.

 Increased use of data. Data has become a key corporate asset, but it can't generate business value if it

isn't used. By making data more trustworthy, data cleansing helps convince business managers and

workers to rely on it as part of their jobs.

 Reduced data costs. Data cleansing stops data errors and issues from further propagating in systems and

analytics applications. In the long term, that saves time and money, because IT and data management

teams don't have to continue fixing the same errors in data sets.

Data cleansing and other data quality methods are also a key part of data governance programs, which aim to

ensure that the data in enterprise systems is consistent and gets used properly. Clean data is one of the

hallmarks of a successful data governance initiative.

******************************************************************************************************************************
Summarizing data
Data summarization is the science and art of conveying information more effectivelly and efficiently. Data summarization
is typically numerical, visual or a combination of the two. It is a key skill in data analysis - we use it to provide insights
both to others and to ourselves. Data summarization is also an integral part of exploratory data analysis.

In this chapter we focus on the basic techniques for univariate and bivariate data. Visualization and more advanced data
summarization techniques will be covered in later chapters.

We will be using R and ggplot2 but the contents of this chapter are meant to be tool-agnostic. Readers should use the
programming language and tools that they are most comfortable with. However, do not sacrifice expresiveness or
profesionallism for the sake of convenience - if your current toolbox limits you in any way, learn new tools!

Descriptive statistics for univariate distributions

We humans are not particularly good at thinking in multiple dimensions, so, in practice, there will be a tendency to look at
individual variables and dimensions. That is, in practice, we will most of the time be summarizing univariate distributions.

Univariate distributions come from various sources. It might be a theoretical distribution, an empirical distribution of a
data sample, a probabilistic opinion from a person, a posterior distribution of a parameter from a Bayesian model, and
many others. Descriptive statistics apply to all of these cases in the same way, regardless of the source of the
distribution.

Before we proceed with introducing the most commonly used descriptive statistics, we discuss their main purpose. The
main purpose of any sort of data summarization technique is to (a) reduce the time and effort of delivering information to
the reader in a way that (b) we lose as little relevant information as possible. That is, to compress the information.

All summarization methods do (a) but we must be careful to choose an appropriate method so that we also get (b).
Summarizing out relevant information can lead to misleading summaries, as we will illustrate with several examples.

Central tendency
The most common first summary of a distribution is its typical value, also known as the location or central tendency of a
distribution.

The most common summaries of the location of a distribution are:

 the mean (the mass centre of the distribution),

 the median or 2nd quartile (the value such that half of the mass is on one and half on the other side),
 the mode (the most probable value or the value with the highest density).

Given a sample of data, the estimate of the mean is the easiest to compute (we compute the average), but the median
and mode are more robust to outliers - extreme and possibly unrepresentative values.

In the case of unimodal approximately symmetrical distributions, such as the univariate normal distribution, all these
measures of central tendency will be similar and all will be an excellent summary of location. However, if the distribution
is asymmetrical (skewed), they will differ. In such cases it is our job to determine what information we want to convey and
which summary of central tendency is the most appropriate, if any.

Dispersion
Once location is established, we are typically interested in whether the values of the distribution cluster close to the
location or are spread far from the location.

The most common ways of measuring such dispersion (or spread or scale) of a distribution are:
 variance (mean of quadratic distances from mean) or, more commonly, standard deviation (root of variance, so we are
on the same scale as the measurement)
 median absolute deviation (median of absolute distances from mean),
 quantile-based intervals, in particular the inter-quartile range (IQR) (interval between the 1st and 3rd quartiles, 50% of
the mass/density lies in this interval).

******************************************************************************************************************************
Data Exploration
What is Exploratory Data Analysis ?
In this article, we will discuss exploratory data analysis which is one of the basic and essential steps of
a data science project. A data scientist involves almost 70% of his work in doing the EDA of his
dataset.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to
apprehend their predominant traits, discover patterns, locate outliers, and identify relationships
between variables. EDA is normally carried out as a preliminary step before undertaking extra formal
statistical analyses or modeling.
The Foremost Goals of EDA
1. Data Cleaning: EDA involves examining the information for errors, lacking values, and
inconsistencies. It includes techniques including records imputation, managing missing statistics, and
figuring out and getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to recognize the important tendency,
variability, and distribution of variables. Measures like suggest, median, mode, preferred deviation,
range, and percentiles are usually used.
3. Data Visualization: EDA employs visual techniques to represent the statistics graphically.
Visualizations consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts
assist in identifying styles, trends, and relationships within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to
create new functions or derive meaningful insights. Feature engineering can contain scaling,
normalization, binning, encoding express variables, and creating interplay or derived variables.
5. Correlation and Relationships: EDA allows discover relationships and dependencies between
variables. Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights
into the power and direction of relationships between variables.
6. Data Segmentation: EDA can contain dividing the information into significant segments based
totally on sure standards or traits. This segmentation allows advantage insights into unique subgroups
inside the information and might cause extra focused analysis.
7. Hypothesis Generation: EDA aids in generating hypotheses or studies questions based totally on
the preliminary exploration of the data. It facilitates form the inspiration for in addition evaluation and
model building.
8. Data Quality Assessment: EDA permits for assessing the nice and reliability of the information. It
involves checking for records integrity, consistency, and accuracy to make certain the information is
suitable for analysis.
Types of EDA
Depending on the number of columns we are analyzing we can divide EDA into two types.
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing information
units to uncover styles, pick out relationships, and gain insights. There are various sorts of EDA
strategies that can be hired relying on the nature of the records and the desires of the evaluation.
Here are some not unusual kinds of EDA:
1. Univariate Analysis: This sort of evaluation makes a speciality of analyzing character variables
inside the records set. It involves summarizing and visualizing a unmarried variable at a time to
understand its distribution, relevant tendency, unfold, and different applicable records. Techniques like
histograms, field plots, bar charts, and precis information are generally used in univariate analysis.
0 seconds of 0 secondsVolume 0%

2. Bivariate Analysis: Bivariate evaluation involves exploring the connection between variables. It
enables find associations, correlations, and dependencies between pairs of variables. Scatter plots, line
plots, correlation matrices, and move-tabulation are generally used strategies in bivariate analysis.
3. Multivariate Analysis: Multivariate analysis extends bivariate evaluation to encompass greater
than variables. It ambitions to apprehend the complex interactions and dependencies among more
than one variables in a records set. Techniques inclusive of heatmaps, parallel coordinates, aspect
analysis, and primary component analysis (PCA) are used for multivariate analysis.
4. Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a
temporal component. Time collection evaluation entails inspecting and modeling styles, traits, and
seasonality inside the statistics through the years. Techniques like line plots, autocorrelation analysis,
transferring averages, and ARIMA (AutoRegressive Integrated Moving Average) fashions are generally
utilized in time series analysis.
5. Missing Data Analysis: Missing information is a not unusual issue in datasets, and it may impact
the reliability and validity of the evaluation. Missing statistics analysis includes figuring out missing
values, know-how the patterns of missingness, and using suitable techniques to deal with missing
data. Techniques along with lacking facts styles, imputation strategies, and sensitivity evaluation are
employed in lacking facts evaluation.
6. Outlier Analysis: Outliers are statistics factors that drastically deviate from the general sample of
the facts. Outlier analysis includes identifying and knowledge the presence of outliers, their capability
reasons, and their impact at the analysis. Techniques along with box plots, scatter plots, z-rankings,
and clustering algorithms are used for outlier evaluation.
7. Data Visualization: Data visualization is a critical factor of EDA that entails creating visible
representations of the statistics to facilitate understanding and exploration. Various visualization
techniques, inclusive of bar charts, histograms, scatter plots, line plots, heatmaps, and interactive
dashboards, are used to represent exclusive kinds of statistics.
These are just a few examples of the types of EDA techniques that can be employed at some stage in
information evaluation. The choice of strategies relies upon on the information traits, research
questions, and the insights sought from the analysis.
*************************************************************************************

What is Data Visualization?

Data visualization is a graphical representation of quantitative information and data by using visual elements
like graphs, charts, and maps.

Data visualization convert large and small data sets into visuals, which is easy to understand and process for
humans.

Data visualization tools provide accessible ways to understand outliers, patterns, and trends in the data.

In the world of Big Data, the data visualization tools and technologies are required to analyze vast amounts
of information.

Data visualizations are common in your everyday life, but they always appear in the form of graphs and
charts. The combination of multiple visualizations and bits of information are still referred to as Infographics.

Data visualizations are used to discover unknown facts and trends. You can see visualizations in the form of
line charts to display change over time. Bar and column charts are useful for observing relationships and
making comparisons. A pie chart is a great way to show parts-of-a-whole. And maps are the best way to
share geographical data visually.

Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel spreadsheet,
which displays the data in more sophisticated ways such as dials and gauges, geographic maps, heat maps,
pie chart, and fever chart.

What makes Data Visualization Effective?

Effective data visualization are created by communication, data science, and design collide. Data
visualizations did right key insights into complicated data sets into meaningful and natural.

American statistician and Yale professor Edward Tufte believe useful data visualizations consist of ?
complex ideas communicated with clarity, precision, and efficiency.
To craft an effective data visualization, you need to start with clean data that is well-sourced and complete.
After the data is ready to visualize, you need to pick the right chart.

After you have decided the chart type, you need to design and customize your visualization to your liking.
Simplicity is essential - you don't want to add any elements that distract from the data.

History of Data Visualization

The concept of using picture was launched in the 17th century to understand the data from the maps and
graphs, and then in the early 1800s, it was reinvented to the pie chart.

Several decades later, one of the most advanced examples of statistical graphics occurred when Charles
Minard mapped Napoleon's invasion of Russia. The map represents the size of the army and the path of
Napoleon's retreat from Moscow - and that information tied to temperature and time scales for a more in-
depth understanding of the event.

Computers made it possible to process a large amount of data at lightning-fast speeds. Nowadays, data
visualization becomes a fast-evolving blend of art and science that certain to change the corporate
landscape over the next few years.
Importance of Data Visualization
Data visualization is important because of the processing of information in human brains. Using graphs and
charts to visualize a large amount of the complex data sets is more comfortable in comparison to studying
the spreadsheet and reports.

Data visualization is an easy and quick way to convey concepts universally. You can experiment with a
different outline by making a slight adjustment.

Data visualization have some more specialties such as:

o Data visualization can identify areas that need improvement or modifications.

o Data visualization can clarify which factor influence customer behavior.
o Data visualization helps you to understand which products to place where.
o Data visualization can predict sales volumes.

Data visualization tools have been necessary for democratizing data, analytics, and making data-driven
perception available to workers throughout an organization. They are easy to operate in comparison to
earlier versions of BI software or traditional statistical analysis software. This guide to a rise in lines of
business implementing data visualization tools on their own, without support from IT.

Why Use Data Visualization?

1. To make easier in understand and remember.

2. To discover unknown facts, outliers, and trends.
3. To visualize relationships and patterns quickly.
4. To ask a better question and make better decisions.
5. To competitive analyze.
6. To improve insights.

Data Visualization Tools

There are tools which help you to visualize all your data in a few minutes. They are already there; only you
need to do is to pick the right data visualization tool as per your requirements.

Data visualization allows you to interact with data. Google, Apple, Facebook, and Twitter all ask better a
better question of their data and make a better business decision by using data visualization.

Here are the top 10 data visualization tools that help you to visualize the data:

1. Tableau

Tableau is a data visualization tool. You can create graphs, charts, maps, and many other graphics.
A tableau desktop app is available for visual analytics. If you don't want to install tableau software on your
desktop, then a server solution allows you to visualize your reports online and on mobile.

A cloud-hosted service also is an option for those who want the server solution but don't want to set up
manually. The customers of Tableau include Barclays, Pandora, and Citrix.

2. Infogram

Infogram is also a data visualization tool. It has some simple steps to process that:

1. First, you choose among many templates, personalize them with additional visualizations like maps,
charts, videos, and images.
2. Then you are ready to share your visualization.
3. Infogram supports team accounts for journalists and media publishers, branded designs of
classroom accounts for educational projects, companies, and enterprises.

An infogram is a representation of information in a graphic format designed to make the data easily
understandable in a view. Infogram is used to quickly communicate a message, to simplify the presentation
of large amounts of the dataset, to see data patterns and relationships, and to monitor changes in variables
over time.

Infogram abounds in almost any public environment such as traffic signs, subway maps, tag clouds, musical
scores, and weather charts, among a huge number of possibilities.

3. Chartblocks

Chartblocks is an easy way to use online tool which required no coding and builds visualization from
databases, spreadsheets, and live feeds.

Your chart is created under the hood in html5 by using the powerful JavaScript library D3.js. Your
visualizations is responsive and compatible with any screen size and device. Also, you will be able to embed
your charts on any web page, and you can share it on Facebook and Twitter.

4. Datawrapper

Datawrapper is an aimed squarely at publisher and journalist. The Washington Post, VOX, The
Guardian, BuzzFeed, The Wall Street Journal and Twitter adopts it.

Datawrapper is easy visualization tool, and it requires zero codings. You can upload your data and easily
create and publish a map or a chart. The custom layouts to integrate your visualizations perfectly on your
site and access to local area maps are also available.
5. Plotly

Plotly will help you to create a slick and sharp chart in just a few minutes or in a very short time. It also starts
from a simple spreadsheet.

The guys use Plotly at Google and also by the US Air Force, Goji and The New York University.

Plotly is very user-friendly visualization tool which is quickly started within a few minutes. If you are a part of
a team of developers that wants to have a crack, an API is available for JavaScript and Python languages.

184.big Data Analytics A Classification of Data Quality Assessment and Improvement Methods
No ratings yet
184.big Data Analytics A Classification of Data Quality Assessment and Improvement Methods
8 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
32 pages
Complete
No ratings yet
Complete
27 pages
2 - Business Problems and Data Science Solutions
No ratings yet
2 - Business Problems and Data Science Solutions
26 pages
Supervised vs Unsupervised Learning
No ratings yet
Supervised vs Unsupervised Learning
11 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Data Science - PPT
No ratings yet
Data Science - PPT
45 pages
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
No ratings yet
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
30 pages
2.0 Machine Learning Introduction
No ratings yet
2.0 Machine Learning Introduction
24 pages
Unit 3
No ratings yet
Unit 3
13 pages
Big Data and Data Analytics Course Overview
No ratings yet
Big Data and Data Analytics Course Overview
46 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Document From Shivam
No ratings yet
Document From Shivam
35 pages
Ds1 - Shahana
No ratings yet
Ds1 - Shahana
36 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Introduction to Data Science & ML
No ratings yet
Introduction to Data Science & ML
41 pages
Big Data Analytics
100% (1)
Big Data Analytics
11 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
23 pages
Chapter 1
No ratings yet
Chapter 1
12 pages
Introduction to Predictive Analytics
No ratings yet
Introduction to Predictive Analytics
30 pages
Advanced Machine Learning Mastering Level Learning With Python
No ratings yet
Advanced Machine Learning Mastering Level Learning With Python
81 pages
Machine Learning
No ratings yet
Machine Learning
49 pages
Chapter 14 Big Data and Data Science - DONE DONE DONE
No ratings yet
Chapter 14 Big Data and Data Science - DONE DONE DONE
28 pages
Machine Learning
No ratings yet
Machine Learning
51 pages
Topic 1
No ratings yet
Topic 1
39 pages
Chapter 4 Data Science and Big DataÂ
No ratings yet
Chapter 4 Data Science and Big DataÂ
23 pages
Predictive Analytics in Marketing
No ratings yet
Predictive Analytics in Marketing
90 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
62 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Business Analytics for Managers
No ratings yet
Business Analytics for Managers
26 pages
Business Analytics Essentials
No ratings yet
Business Analytics Essentials
35 pages
Data Science Overview and Applications
No ratings yet
Data Science Overview and Applications
52 pages
Big Data Analytics
No ratings yet
Big Data Analytics
65 pages
Unit 2 AI & ML
No ratings yet
Unit 2 AI & ML
86 pages
Big Data Lesson 1 Lucrezia Noli
No ratings yet
Big Data Lesson 1 Lucrezia Noli
46 pages
Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Rtmnu AIIIII
No ratings yet
Rtmnu AIIIII
57 pages
BA Full Note 1
No ratings yet
BA Full Note 1
183 pages
Chapter 1
No ratings yet
Chapter 1
49 pages
MODULE 2 - (Ask Questions To Make Data-Driven Decisions)
No ratings yet
MODULE 2 - (Ask Questions To Make Data-Driven Decisions)
23 pages
2 Machine Learning
No ratings yet
2 Machine Learning
69 pages
Pruning Techniques in Data Science
No ratings yet
Pruning Techniques in Data Science
13 pages
UNIT 1 - Introduction (Types of Machine Learning)
100% (1)
UNIT 1 - Introduction (Types of Machine Learning)
21 pages
Ids Sem Ans U-Ii
No ratings yet
Ids Sem Ans U-Ii
10 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
37 pages
RBA
No ratings yet
RBA
5 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
1 DataScience
No ratings yet
1 DataScience
91 pages
Data Science Unit 1 NOTES
No ratings yet
Data Science Unit 1 NOTES
45 pages
Case Study Data Science Business
100% (1)
Case Study Data Science Business
805 pages
Data Analytics Chapter - 1
No ratings yet
Data Analytics Chapter - 1
42 pages
Lec 1: DNSC 6314
No ratings yet
Lec 1: DNSC 6314
47 pages
BP &PM, Linear or Logistic Regressiondf
No ratings yet
BP &PM, Linear or Logistic Regressiondf
63 pages
Big Data Analytics: Opportunities & Challenges
No ratings yet
Big Data Analytics: Opportunities & Challenges
43 pages
Big Data & Analytics for Business
No ratings yet
Big Data & Analytics for Business
25 pages
Data Science and Its Life Cycle
No ratings yet
Data Science and Its Life Cycle
30 pages
Machine Learning and Financial Applications
No ratings yet
Machine Learning and Financial Applications
29 pages
Comprehensive Machine Learning Notes
No ratings yet
Comprehensive Machine Learning Notes
96 pages
Big Data 4 Manuscripts - Data Analytics For Beginners, Deep Learning With Keras, Analyzing Data With Power BI, Convolutional... (Williams, Anthony) (Z-Library)
No ratings yet
Big Data 4 Manuscripts - Data Analytics For Beginners, Deep Learning With Keras, Analyzing Data With Power BI, Convolutional... (Williams, Anthony) (Z-Library)
218 pages
Internal Audit Checklist - AS9100
75% (12)
Internal Audit Checklist - AS9100
13 pages
R D A 6 2 3 1 D A T A S H e e T
No ratings yet
R D A 6 2 3 1 D A T A S H e e T
11 pages
Looking For Real Exam Questions For IT Certification Exams!
No ratings yet
Looking For Real Exam Questions For IT Certification Exams!
14 pages
Industry - AI (BLP Group) - VECV-Bhopal-Fire Hydrant System-Techno-Commercail Proposal-V.1-17102024
No ratings yet
Industry - AI (BLP Group) - VECV-Bhopal-Fire Hydrant System-Techno-Commercail Proposal-V.1-17102024
10 pages
CSE 344 Midterm SQL & DB Queries
No ratings yet
CSE 344 Midterm SQL & DB Queries
8 pages
DS WhitePapers Working With Derived Format Converter
No ratings yet
DS WhitePapers Working With Derived Format Converter
58 pages
Arabic Origins of Cryptology Vol. 1
100% (7)
Arabic Origins of Cryptology Vol. 1
206 pages
EMC® VNX7500™ Hardware Information Guide: Block-Only
No ratings yet
EMC® VNX7500™ Hardware Information Guide: Block-Only
88 pages
Assignment II COAL-Ff PDF
No ratings yet
Assignment II COAL-Ff PDF
4 pages
DG Placement for Power Loss Minimization
No ratings yet
DG Placement for Power Loss Minimization
3 pages
Bank Account Debit Authorization Form
No ratings yet
Bank Account Debit Authorization Form
1 page
ARG Parameters Mar 2019
No ratings yet
ARG Parameters Mar 2019
13 pages
Nevil Pooniwala's Stanford Profile
0% (1)
Nevil Pooniwala's Stanford Profile
2 pages
MEITRACK P99G P99L User Guide
No ratings yet
MEITRACK P99G P99L User Guide
23 pages
Mini Monitor Module Installation Guide: Troubleshooting
No ratings yet
Mini Monitor Module Installation Guide: Troubleshooting
2 pages
Cyber
No ratings yet
Cyber
29 pages
The Effect of Computer Technology On Academic Achievement Author
No ratings yet
The Effect of Computer Technology On Academic Achievement Author
5 pages
Kirka - Io ? Play On CrazyGames 4
No ratings yet
Kirka - Io ? Play On CrazyGames 4
1 page
DSB-SC AM Signal Demodulation Guide
No ratings yet
DSB-SC AM Signal Demodulation Guide
3 pages
How To Start Cracking With OpenBullet
No ratings yet
How To Start Cracking With OpenBullet
2 pages
Cognitive Psychology and Problem Solving
No ratings yet
Cognitive Psychology and Problem Solving
5 pages
Television Production
No ratings yet
Television Production
3 pages
NSX 63 Troubleshooting
No ratings yet
NSX 63 Troubleshooting
238 pages
651a36d211b4ed7aad6955a7 46184252463
No ratings yet
651a36d211b4ed7aad6955a7 46184252463
2 pages
Amazon Manager II Risk Mining Interview Preparation Guide
No ratings yet
Amazon Manager II Risk Mining Interview Preparation Guide
11 pages
VLT AutomationDrive FC 301 302 DG M00190 01
No ratings yet
VLT AutomationDrive FC 301 302 DG M00190 01
264 pages
Accounting Resume Brief
No ratings yet
Accounting Resume Brief
2 pages
ANDROID
No ratings yet
ANDROID
54 pages
Understanding Business Intelligence
No ratings yet
Understanding Business Intelligence
5 pages
Asco LV Ats & PCS
No ratings yet
Asco LV Ats & PCS
59 pages

Unit 2 Big Data Analytics

Uploaded by

Unit 2 Big Data Analytics

Uploaded by

What is Big Data Analytics?

What is Big Data?

Uses and Examples of Big Data Analytics

 Predicting future trends in order to make better business decisions

 Improving marketing campaigns by understanding what works and what doesn't

 Detecting fraud and other forms of misuse sooner

Benefits and Advantages of Big Data Analytics

3. Quicker and Better Decision Making Within Organizations

4. Improve Customer Experience

Using the framework defined above, it is simple to define the

These two considerations are enough to conclude that the problem

Most big data problems can be categorized in the following ways −

Let us now learn more about these four concepts.

In this case, the problem definition is rather similar to the previous

problem, the response y ∈ ℜ, this means the response is real

valued. For example, we can develop a model to predict the hourly

Management is often thirsty for new insights. Segmentation models

For example, in a telecommunications company, it is interesting to

This problem can be considered as a regression problem, but it has

What Exactly is Data Collection?

Why is Data Collection Important?

Benefits of Data Collection

 Improving precision in targeting customers

An Overview of Popular Data Collection Methods

1. Primary Data Collection Methods

2. Secondary Data Collection Methods

more accurate, consistent and reliable information for decision-making in an organization.

their own applications.

Data cleansing vs. data cleaning vs. data scrubbing

Why is clean data important?

applies to organizations across the board, both large and small.

figure that's still widely cited.

What kind of data errors does data scrubbing fix?

so it can be analyzed accurately.

combined, duplicate data entries can be reconciled to create single records.

What are the steps in the data cleansing process?

duplicate and redundant data is addressed.

corrected, plus updated metrics on the data's quality levels.

structuring and data transformation, to continue readying it for analytics uses.

Characteristics of clean data

including the following:

The benefits of effective data cleansing

as things like patient care and government programs.

marketing campaigns and sales efforts.

relationships with customers.

workers to rely on it as part of their jobs.

hallmarks of a successful data governance initiative.

Descriptive statistics for univariate distributions

The most common summaries of the location of a distribution are:

 the mean (the mass centre of the distribution),

What is Data Visualization?

What makes Data Visualization Effective?

History of Data Visualization

Data visualization have some more specialties such as:

o Data visualization can identify areas that need improvement or modifications.

Why Use Data Visualization?

1. To make easier in understand and remember.

Data Visualization Tools

You might also like