Data Analytics: Transforming Raw Data
Data Analytics: Transforming Raw Data
Data Analytics
Introduction to Data Analytics
Table of Contents
Introduction P3
Closing P50
2
Introduction to Data Analytics
Introduction
What is data analytics, and why
is it important?
Data is much more than a collection of numbers.
But if you don’t do anything with it, it’s just
that—a nice collection taking up space on your
hard drive. Instead of filing your data away,
never to be seen again, put it to good use and
analyze your data to help tell the story of your
company’s growth.
3
Introduction to Data Analytics
There are many reasons why you should seek to analyze and understand your data.
For one, having a firm grasp of your company’s metrics can help you optimize your
processes and procedures. With optimal strategies, you’re more likely to see a more
significant return on investments and be more efficient.
A second reason? Businesses that succeed are known for taking risks– but they
don’t usually take a chance without having an idea of the outcome. Data analytics
helps lower the dangers of high-stakes decisions by allowing you to review carefully
calculated outcomes.
The results of data analysis can also help with customer retention. Analyzing your
customers’ trends, patterns, and behaviors can help your team better market to those
individuals. When you understand what your customers want according to the data, you
can provide them with precisely what they need.
So, with data analytics, you can uncover your company’s patterns and trends and then
make better, more accurate assumptions, predictions, and conclusions for your team.
Ready to learn how to analyze your data? Let’s take a look at the fundamentals of
data analytics.
4
Introduction to Data Analytics
It’s essential to understand precisely what data is. Most likely, numbers and figures
come to mind when you hear “data.” But it’s so much more than that. Text snippets,
images, and videos are also classified as data. This is important because it means you
have options regarding the type of data you can collect. For example, you can run a text
or sentiment analysis on social media comments to gain insight into your customers’
thoughts and feelings regarding your product or service. In other words, you can count it
as data if it’s trackable.
Numbers, like dollars earned or lost, are essential to understanding how well your
business is doing. However, metrics like customer satisfaction are also valuable to help
you understand what your business is doing right and how you can improve.
Before we continue with how to implement data analytics in your business, we need to
define some useful terms. Use this section as a reference guide.
• Data storage of Data silos: This refers to the places where your
data is stored. Think hard drives or warehouses.
5
Introduction to Data Analytics
• Data ethics: Did you receive prior consent from participants in your data sets?
Do your data collection methods align with government policies like GDPR
and CPPA? It’s essential you can answer “yes” to both these questions before
continuing with your analysis. Otherwise, you could end up in serious legal trouble.
• Classification: This is done when you set parameters for your data. You likely
see classification happening throughout your workday. If you have spam filters
turned on for your email inbox, this is classification. Spam filters are great at
detecting spam and separating it from essential emails.
• Clustering: This is the process of grouping data sets together. We’ll talk about
this later.
• Bias and variance: Bias is a term that refers to system errors, while variance
refers to errors that randomly occur. To get the best result from your data
analysis, bias and variance are best balanced.
• Overlifting and underfitting: Overlifting occurs when the results of your analysis
match the data sets too closely. While this might not seem bad at first, it is.
It means somehow your training and testing sets combined and created an
error. You’ll need cross-validation to solve this, ensuring your testing data and
training data are separate sets.
6
Introduction to Data Analytics
7
Introduction to Data Analytics
Artificial intelligence, or AI, is more than a trend– a rapidly evolving technology that’s
here to stay. AI is a computer science powered by a machine whose goal is to mimic
human intelligence. This human-like thinking allows the computer to detect patterns,
make predictions, and problem-solve. When introduced to data analytics, AI can help
data analysts quickly and efficiently run analyses on any given dataset.
There are several reasons to use AI in your business practices, particularly for data
analytics. Data analytics uses your business’s data to tell your company’s story. AI,
though, helps tell the complete data-driven story, allowing you and your team to
understand better what happened, why it happened, what’s happening, what’s likely to
happen, and what could happen.
8
Introduction to Data Analytics
One of the reasons a company chooses to implement data analytics into its processes
is to help them make decisions. Without AI, this responsibility lies solely on the analyst
to look at the data, compute the numbers, and present the options. With large datasets,
this can be challenging and time-consuming. There’s always the chance that a formula is
miscalculated or an essential piece of data is missing.
AI helps alleviate those problems. Artificial intelligence can quickly parse immense
volumes of data. This dramatically improves the accuracy and efficiency of data review,
leaving your analysts with more time to review results and consider what the data says.
AI can help with decision-making, too, as it can easily predict outcomes depending on
the analysis and model you choose. Those are just some of the benefits for your analysts.
Let’s not forget about your customers. AI technologies can help learn customer data and
predict products and services your customers will like based on past purchases.
AI can be a fantastic tool for your business operations. However, there are a few
drawbacks. AI and its algorithms are only as good as your datasets, meaning insufficient
data will lead to inaccurate results. You and your team will need to ensure your data is
ready for your applications, which could take significant time. AI is also not great at
detecting bias in a dataset, so you must ensure your data accurately represents
your customers.
9
Introduction to Data Analytics
With the help of AI, data analytics is a smart investment for any company, big or small.
But before you decide to run any kind of analysis, you should consider several important
factors, like which kind of analysis to run on your data. There are various types of data
analysis to help you discover the big picture of your company.
10
Introduction to Data Analytics
You’ll likely need to assign a member of each department as a data manager, too, but
they’ll work to maintain data on a smaller scale. This person can access the necessary
data relevant to their department. They’ll also be able to work closely with the data
management to become their department’s point of contact for anything data-related.
It’s helpful to think of data as a life cycle. The first step of the cycle is data generation.
Data is generated from various sources, and each source may have relevance to your
business operations. There are three main types of data sources: first-party sources,
second-party sources, and third-party sources.
First-party sources are sources of information that your company generates itself. These
are sources of data where the data relates directly to your business operations. Social
media interactions, transactions and receipts, observations, cookies, and customer
survey results are considered first-party sources. Each source relates directly to your
business and how your customers interact with your websites, products, and services.
11
Introduction to Data Analytics
Be sure to keep a watchful eye on the data pipeline, too. If you notice a large number
of insignificant data, it could be that something in the data pipeline is broken, causing
data points to be left out or corrupted before reaching their destination. If something is
broken, you should fix it as soon as possible to ensure the quality of your data.
12
Introduction to Data Analytics
Data quality assurance also includes validation. This means you should
continually be aware of collection methods to ensure they always comply with
data policies and rules. If not, unethical or illegal collection methods can land
your business in hot water with the federal government.
After the data has been assured for quality, preprocessed, and translated,
the next step is to input the data into the data management system. How
you store and organize your data is a key determining factor of what you
can do with it later on. So, if you haven’t already built or implemented a data
management system, pay particular attention to the next section. In the next
section, we’ll cover the different types of data storage and how your storage
methods can determine your analysis methods.
There are two main types of databases we need to discuss. Those are SQL
and NoSQL databases. An SQL database is a structured, relational database
that requires data to be translated into a readable language. This means the
data is stored and organized in a table or connected tables. This database
allows for easy analysis and modeling because the data is likely already
translated to a language an algorithm can read.
SQL databases are popular amongst data scientists because they follow the
ACID criteria well. Each acronym letter describes four criteria components
necessary for data integrity about how data moves throughout the system.
Let’s define ACID before we continue:
13
Introduction to Data Analytics
1) Atomicity: 3) Isolation:
This term describes data transactions. Data scientists can quickly encounter
It means that each transaction of a problems if multiple data transactions
dataset is counted as its transaction. If, coincide. Isolation ensures
for some reason, the transaction fails, it transactions do not interfere with one
is not applied to the data. Instead, the another.
data is reverted to its original state.
2) Consistency: 4) Durability:
This property ensures that transactions Durability refers to the security of
remain consistent across the database. a transaction. In other words, once
It also maintains data integrity and changes are made to a database
ensures the absence of data corruption. during a transaction, those changes
are permanent and stored in the
database.
NoSQL databases are slightly different from SQL databases and have different
purposes.. For some datasets, a structured database is not the best choice of storage
and can’t be immediately organized. That’s where NoSQL databases come into play.
NoSQL databases lack defined boundaries, models, and schemas. This means that
data can be stored in large pools on the framework. These large pools are called data
lakes. If you plan to run predictive analytics using AI technologies, particularly machine
learning or deep learning, data lakes are imperative because they can manage a
continuous data stream.
14
Introduction to Data Analytics
15
Introduction to Data Analytics
The answers to “what happened?”, “Why did it happen?”, “What will happen?” and “What
should we do?” are not all answered by the same type of analysis, though. That’s why
there are various types of data analysis, and it’s essential to choose the correct type of
analysis before you begin answering any of your questions or making predictions.
Let’s pretend you have a complete data set and have no idea where to start with data
analytics. Before you do anything else with your data, like putting it in an algorithm for
forecasting or projections, you need to make sense of it. That’s where exploratory data
analysis comes in. Simply put, exploratory data analysis means looking at the data to
search for patterns and trends.
16
Introduction to Data Analytics
17
Introduction to Data Analytics
Data visualization can help you determine other essential information, too, like the
mean, median, mode, and range of your data set. The mean of a data set refers to the
average of your data. So, to find the mean, you’ll add the data set together and divide
the number by the total number of data points. The value you calculate represents the
average of your data set, or for the example of customers’ ages, the average age of
customers who purchase your products.
You can determine the median of the data set by putting your variables in order from
least to greatest. The median is the value directly in the middle of the data set.
The mode simply refers to the value that is the most common. Continuing with the
customers’ age example, if twelve of twenty customers are 26 years old and the other
eight vary in age, then 26 is the mode because it is the most common value.
Mean, median, and mode all describe the middle of the data set, but in different ways.
The range, however, represents the span between the lowest and highest values. You can
easily find the range of a data set by subtracting the lowest value from the highest.
Understanding the mean, median, mode, and range of your data set is helpful, as it will
provide the basis for what your company can and should expect. However, this is not the
only information you can gather through exploratory data analysis. You’ll want to look
at the outliers or data points that don’t necessarily align with the rest and calculate the
standard deviation to determine how much your data points differ from the average. This
is helpful because it will give you a solid understanding of your client base in the case of
the age example.
Beauty Products Sales Order Analysis
4.95k
3.96k
No. of Orders
2.97k
1.98k
989
Image Source
18
Introduction to Data Analytics
Smarphones
15.19%
Laptops
LCD
7.59%
17.72%
Canada ($750k)
UK ($500k)
Image Source
Looking at and graphing your data set for exploratory data analysis can also help
understand correlation and causation. Studying the correlation and causation of your
data set can help you understand which variables need to be in place for something else
to occur. For example, if you are looking at data related to new customer subscriptions,
you might notice more signups due to a one-day sale.
If exploratory data analysis is completed correctly, you’ll likely have more questions than
answers. Use the results of your exploratory data analysis to help form hypotheses for
further analysis of your data.
19
Introduction to Data Analytics
Descriptive analysis aims to answer the question, “What happened?” This kind of
analysis is helpful because it uncovers patterns and trends hidden in historical data. It’s
important to note that results from this kind of analysis should not be used to predict
future outcomes (predictions and forecasts are made in a different type of analysis that
we’ll cover later). Instead, descriptive analysis is designed to help make sense of past
operations so we can understand current business operation models.
When conducting descriptive analysis, it is necessary to complete all of the same steps
that you would do for exploratory data analysis. (Do you see why exploratory data
analysis is the first step? It’s the foundation of data analytics!) You’ll need to take the
necessary steps to gather your data from internal or external sources and clean it to
ensure it is usable. However, before you visualize your data, take a moment to explore it.
Data exploration is a vital step of descriptive analysis. This is part of the process where
you will plug your data into a spreadsheet (if it’s not already in one), run statistical
equations, or review it for its apparent characteristics, like trends or patterns. It’s helpful
to use tools, like artificial intelligence or built-in software for your spreadsheet, to help
you analyze your data. Then, when you have a solid understanding of the data, you can
visualize it, summarize it, and present your findings to your team. Hopefully, with your
team members’ input, you can begin to interpret what the data is telling you.
20
Introduction to Data Analytics
25
20
Library visits
15
10
5 10 15 20
With the various tools available, descriptive data analysis is a simple process. If you’ve
taken the time to ensure your data is good, your analysis’s results should accurately
describe your business’s past metrics. Descriptive data analysis is helpful, too, because
you can keep track of key performance indicators or KPIs.
The downside to descriptive data analytics is that it does not answer “why?” It just
describes what has happened. However, you can use the results to understand what and
what is not working for your company. Many stakeholders use the results of descriptive
data analytics to help determine what to do with their investments in your company, as
this kind of analysis is known to reveal red or green flags. This could be problematic if you
have slightly less-than-perfect numbers and skittish stakeholders.
Don’t let the stakeholder’s interest dissuade you from conducting descriptive data
analysis– this analysis is necessary for any business. It is helpful to know and understand
metrics like year-on-year growth, sales revenue and income reporting, shipping logistics,
and sales trends. You’ve likely seen descriptive analytics in action, too, in other ways.
Think social media engagement reporting and web traffic analysis. These are all data
points you can gather from descriptive analytics to help you understand what variables
were in place for your current business operations to exist.
If you want to use data analytics to get an idea of potential data projections and
forecasts, you’ll want to implement predictive data analysis. Let’s take a look at it now.
21
Introduction to Data Analytics
Before we get into how to use predictive analytics and its benefits, let’s take a minute to
review machine learning and deep learning. Machine learning is an artificial intelligence
technology that uses algorithms and models to make predictions based on the collected
data. Depending on the type of machine learning you use, you may need to program
certain algorithms for your specific data set. Some machine learning technologies do not
need to be programmed and can run as is.
Deep learning is a machine learning type that processes data similarly to how a human
brain processes information. This type of learning uses neural networks, or connected
neurons that resemble the brain, to recognize complicated patterns and trends that
might have been missed during descriptive data analysis. Deep learning can review text,
pictures, video, or sounds to make predictions and provide valuable insight.
Think of predictive data analysis as a crystal ball. It combines machine learning and
deep learning to analyze patterns and trends in a data set, allowing you and your team
to gain insight into potential outcomes if you change or manipulate variables.
22
Introduction to Data Analytics
Image Source
It’s important to understand that you shouldn’t feed data directly into the algorithm
without cleaning it. Missing information can have a significant outcome on the
accuracy of your predictions. You wouldn’t want to use predictions made on misleading
information as it could negatively impact your business, defeating the whole purpose of
using predictive analytics in the first place.
Data fed into algorithms for predictive analysis must also be separated into two
groups: a testing group and a training group. The testing group should contain as much
information about your data set as possible.
For example, let’s say you own a restaurant and notice a slight uptick in soup sales
Gabby
on cloudy or Gomez,
rainy days. Because this is purely anecdotal, you want to be sure of your
Inbound
findings Success
and decide to use predictive data analytics to estimate future sales. To do this,
Coach,provide
you should HubSpotthe algorithm with the number of bowls of soup sold and the weather
conditions for a set time, including sunny and cloudy or rainy days. Because you already
know how many bowls of soup were sold on cloudy days, you should be able to run the
analysis on the training set and compare results with your true historical data.
23
Introduction to Data Analytics
Once the algorithm is trained on your data, you can feed new data into the algorithm,
like the following week’s weather forecast, and get an idea of how many bowls of soup
you might sell in the next week. This lowers the risk of making extra soup that doesn’t sell
because now you have a fairly accurate prediction of projected sales based on your
past numbers.
Lowering and mitigating risks is just one example of how predictive data analysis can
help your business. Predictive data analysis can do much more than just project potential
sales; it works best with real-time data. Many companies use this kind of analysis for
customer retention. Predictive data analysis can help pinpoint potential churn when
run with real-time data. With the analysis results, you and your team can take the
appropriate measures to stop churn before customers reach that point.
If you are an ecommerce business, predictive analytics can help you recommend
new products or services to your customers. Simply provide the algorithm with your
customers’ past behaviors and purchases. The algorithm will match your clients to
products or services based on what other customers with similar behaviors bought in
the past. It can also help prevent fraud by detecting suspicious user activity in your
operation systems, thus keeping your data secure.
While predictive data analysis is suitable for forecasting, using the results to change your
business operations is always a risk because there is a chance variables could change, or
an unexpected hiccup could occur. Prescriptive data analytics considers all the likelihood
of variables changing and can help make the best recommendations based on your data.
If you’ve ever wanted someone to help you make a business decision, you need to
consider using prescriptive data analysis. Unlike predictive analytics, which answers the
question “what could happen?” prescriptive data analytics helps you understand what
you should do and the outcomes you would face if you followed its recommendation.
Prescriptive data analytics is the most advanced stage of data analytics, helping you
take the guesswork out of your decisions.
Prescriptive data analytics is complex. First, you’ll need to define the question you want
to understand. Then, you’ll need to link the AI-powered algorithm with your data storage
system. This kind of analysis requires continuous historical, real-time, and internal and
external data to give you the most accurate outcomes.
24
Introduction to Data Analytics
25
Introduction to Data Analytics
Regression analysis
Regression analysis is a statistical model that depicts the relationship between two
variables, an independent variable, and a dependent variable. Regression analysis
models are often considered the “go-to method” for data analytics because they explain
the relationship between the dependent and independent variables. Plus, we can use it to
predict future sales based on our historical data.
Regression analysis models give you an idea of what’s happening to reduce the likelihood
of assumption. It’s essential that you choose independent variables that matter;
otherwise, your models will be filled with insignificant data points. So, be as specific as
possible when determining your independent variables.
This looks like a lot of math and if math isn’t your strong suit, don’t worry. That’s why
you’ve hired data analysts and used statistical programs. It’s helpful to understand these
terms, though, as it will give you a better understanding of model predictions. Let’s look at
a real-world example to understand better the equation and how to graph the data.
26
Introduction to Data Analytics
Let’s look at the soup sales again. Except this time, as a business owner, you want to
explore if the time of day impacts the number of soup sales. You love to have a bowl of
soup at lunch, and you want to assume that your customers do, too. However, you paid
attention in stats class and know you should never assume without looking at the
data first.
So, you first collect several weeks’ worth of data and track the number of soup sales
throughout the day. For example, one particular day, you might find that you do not sell
any soup the first hour your restaurant is open. But, in the fifth hour, you sell two bowls
of soup. Track this data across several weeks before running a regression analysis model.
As always, the most accurate data will produce the best results. Bad data will lead to an
inaccurate analysis.
In this example, the number of soup sales is the dependent variable (or Y). It’s called the
dependent variable because it depends on the value of the independent variable (or X),
which is the time of day.
25
20
15
Y
10
0
2 4 6 8 10
-5 X
Image Source
27
Introduction to Data Analytics
Armed with weeks worth of historical data, including soup sales and the time of day, you
should plot those points on a graph. The graph’s x-axis, or the horizontal axis, represents
the number of soup sales, and the y-axis represents the time of day.
Now that you have a visual representation of your sales look to see if there is a linear
pattern in your data. This graph shows a positive relationship between the time of
day and soup sales. Your data analyst or the program you’re using can determine the
regression line. The regression line shows the line of best fit for your data. It’s important
to remember that there may be a small chance of error in the regression line. The error
term mentioned above acts as an extra layer of insurance for estimating sales. The
smaller the error term, the more you can rely on the accuracy of the estimation.
Linear regression models, like the one pictured above, are among the most common
regression analysis models. But if a linear regression model doesn’t seem to represent
your data fully, there are other types of models to run. Those are:
28
Introduction to Data Analytics
There are a few important things we must consider when it comes to regression analysis
models. The first is that correlation does not always mean causation. In the example of
the soup sales, the time of day does not always determine the number of soup sales.
Although the time of day definitely influences when customers are most likely to have
soup, you also have to consider the hunger levels of customers and the type of soup you
are offering that day.
The second thing to remember is that you should be as specific as possible when
choosing your independent variables. Too broad of an independent variable will result
in inconsistent or useless results. The more accurate the data is, the better chance of an
accurate regression analysis.
Cluster analysis
The cluster analysis method, or clustering, involves grouping data points based on
similarities. This means that your data set might not have any target values, but with the
help of algorithms, you can sort your data into groups that make sense.
29
Introduction to Data Analytics
The same concept can be applied to your data sets. However, instead of manually sorting
your data to look for often hidden similarities, following various cluster models and using
the accompanying algorithm is helpful.
There are six different methods of clustering. Let’s take a look at each of them and
their algorithm.
Connectivity-based clustering
Connectivity-based clustering, also known as hierarchical clustering, centers around the
idea that each piece of data is connected to its neighbor based on its relationship, or
proximal distance, to its neighbor. If you use an algorithm to compute a connectivity-
based cluster, your results will be shown in a dendrogram.
10
Height
0
Alabama
Louisiana
Georgia
Tennessee
North Carolina
Mississippi
South Carolina
Texas
Illinois
New York
Florida
Arizona
Michigan
Maryland
New Mexico
Alaska
Colorado
California
Nevada
South Dakota
West Virginia
North Dakota
Vermont
Idaho
Montana
Nebraska
Minnesota
Wisonsin
Maine
Iowa
New Hampshire
Virginia
Wyoming
Arkansas
Kentucky
Delaware
Massachusetts
New Jersey
Connecticut
Rhode Island
Missouri
Oregon
Washington
Oklahoma
Indiana
Kansas
Ohio
Pennsylvania
Hawaii
Utah
Image Source
Looking at the above example, you’ll notice that the overall data is split into several
different groups. Each data point within the group is then divided into another similar
subgroup. The x-axis describes the clusters that do not merge, while the y-axis
represents the distance between each cluster.
The rule of thumb for connectivity-based clustering is that if the data is similar to an
established cluster, it is sorted into that group. If dissimilar, it goes elsewhere or farther
away from the established cluster and can form its cluster if needed.
30
Introduction to Data Analytics
The algorithm you should use for this kind of clustering is called the BIRCH algorithm, or
Balanced Iterative Reducing and Clustering Using Hierarchies. Running this algorithm
is quick and efficient and works best with large data sets. Unlike other algorithms we’ll
discuss later, this algorithm only makes one pass through the data and needs a few set
parameters to run well. Before running the algorithm, define the CF tree and its threshold.
A CF tree consists of each subgroup or leaf cluster, and each leaf cluster can only get
as big as the threshold allows. A new leaf is formed once the threshold reaches the
maximum number of data points.
Centroid Clustering
The Centroid Clustering method is the easiest of all clustering methods, making it the
most commonly used clustering technique. The most difficult part of this clustering
model is choosing the number of clusters, or k, you want your data set divided into and
assigning those clusters a vector value. Vectr value simply refers to a collection of values
within a group. After those parameters are set, your data is sorted into the given set of
clusters based on how closely it matches the vector value.
Ideal Clustering
-2
-4
-4 -2 0 2 4 6 8 10
Image Source
31
Introduction to Data Analytics
This clustering method relies heavily on the K-means clustering algorithm. This algorithm
sorts data into groups according to the predefined k-cluster. Each time the algorithm is
run, the center value, or the centroid, of k may change. The algorithm is run enough times
so that an optimal k is discovered. Optimal k values should be the average of all of the
centroid points.
Density-based Clustering
OUTLIER
DBSCAN
Image Source
32
Introduction to Data Analytics
Distribution-Based Clustering
The distribution-based clustering model takes an entirely different approach to clustering
compared to the previous models we have discussed. This model categorizes data into
groups based on the likelihood of a piece of data belonging to that group. Distribution-
based clustering works if there are predetermined central points. Once those points are
identified, it is put into that cluster if the data looks like it might belong.
100
80
Waiting
60
40
1 2 3 4 5
Duration
Image Source
There are several algorithms that deal with distribution-based clustering, including
K-means clustering and the DBSCAN.
33
Introduction to Data Analytics
Fuzzy Clustering
What do you do if your dataset can be classified into multiple clusters? That’s where
fuzzy clustering comes in. Fuzzy clustering describes a model where data points are
categorized first based on similarities to the central point. On the second test run, data
that has not yet been categorized is grouped based on the probability of belonging.
0
0 2 4 6 8 10
Image Source
The most popular algorithm associated with fuzzy clustering is called the Fuzzy C-Means
algorithm. This algorithm assigns each data point a membership value representing the
standard deviation from the central value of the cluster.
34
Introduction to Data Analytics
Constraint-based Clustering
Algorithms and clustering methods are great for helping your analysts identify hidden
patterns within a data set. However, there are times when your analysts might already
expect how the data should be sorted. In these cases, the constraint-based clustering
model works best. This model allows the analyst to set parameters for the data, including
the number of clusters, the number of allowed data points in the cluster, and its
allowed dimensions.
Cannot link
Must link
Image Source
Several algorithms work for constraint-based clustering. It’s important to note that the
algorithms mentioned in this section are not an exhaustive list of possible algorithms,
and various algorithms work for multiple types of data modeling. If you use any software
to help plot your data, your software will likely suggest programmed algorithms to help
you best sort your data points, making it much more efficient for your analysts.
35
Introduction to Data Analytics
Simply put, a decision tree model consists of one question that points to
multiple options. Normally, the questions have binary answers, like yes or
no. As you move further down the tree, more questions and options may
present themselves. The beauty of a decision tree is you can weigh all of
the outcomes against the risks and rewards.
There are two decision tree types: categorical variable decision trees and
continuous variable decision trees. A categorical variable decision tree
is a simple model. It categorizes data based on the question provided.
Continuous variable decision trees, though, do not always provide a
simple answer. These models are called regression trees because the
outcome depends on previous and sometimes multiple variables.
36
Introduction to Data Analytics
Like a living tree, you can cut out branches of your decision tree based on inaccurate
data (like noise or outliers). Decision trees are easy to understand, but large data sets
can quickly become complicated. Therefore, you must ensure an appropriate sample
size before running a decision tree model on your data.
Decision Tree
End
Decision 1
Node
Chance
Node
End
n1 Decision 2 Node
c isio
De
Decision Node
De
cis
io
n
2
Chance End
Node Decision 3 Node
Image Source
Decision tree models are great for evaluating business outcomes, and you can also
employ them in your systems to help suggest customer recommendations.
37
Introduction to Data Analytics
This analytical method assumes that time is an independent variable and all other
variables, regardless of what they are, depend upon the continuation of time. Large
amounts of data are collected over a series of evenly spaced intervals to get the most
accurate results from this kind of analysis. The massive volume of data helps to ensure
the consistency and reliability of your results, and it also cuts through noise and ensures
any detected trends or patterns are not influenced by outliers.
Time series data analysis is a simplistic idea that can quickly become complicated,
depending on how you want to run your data and the metrics you are looking for in your
dataset. This type of analysis is broken into two important classifications: stock time
series data and flow time series data. Think of stock time series data as a snapshot
of the collected data. The data within that snapshot is measured and assessed for its
patterns and trends. While stock time series data is only one period, flow time series data
refers to a continuous data flow over a predetermined time.
0.8
HadCRU
0.6
-0.2
-0.4
-0.6
-0.8
95% confidence internal shown for Berkeley Earth -1
Temperature anomalies relative to 1981-2010 average
-1.2
1860 1800 1900 1920 1940 1960 1980 2000 2020
Image Source
38
Introduction to Data Analytics
There are also general variations in this kind of analysis, too. For instance, if you want to
look at notable trends, you’ll want to run a functional analysis. If you wanted to see if the
pattern flowed in one direction, you would run a trend analysis. And, if you wanted to see
if the data is consistent on a seasonal basis, you’d run a seasonal variation analysis.
No matter which kind of analysis you choose, a few key indicators concerning possible
patterns are essential to understand. When a trend is revealed, the data will most likely
follow a specific pattern, either in an increasing or decreasing direction. Some patterns
reveal seasonality, which means the pattern is regular and repeats at specified intervals,
like days or weeks. The data could also reveal a cyclic pattern, meaning the fluctuations
in the data do not follow a designated time. And finally, the pattern could be irregular,
completely random, and unpredictable.
There are several benefits to running a time series analysis. Besides revealing patterns
and trends dependent on time, this analysis often provides better data visualization.
Depending on the analysis model you choose, either the Box Jenkins ARIMA Model or the
Box Jenkins Multivariate Model, you can plot a singular variable or multiple variables. Be
careful, though, with the temptation to plot all your known variables at once, as too many
variables can quickly become too complicated to understand, making it difficult to spot
any trend.
With the help of algorithms and machine learning, there are multiple ways to categorize
and explore your data. You can use machine learning for:
39
Introduction to Data Analytics
Time series analysis is useful and can be used in various business functions. Demand
forecasting, financial analysis, resource and inventory management, and risk
management can all be determined with this analysis.
Now that you understand the different types of analysis you can do on your data, let’s
look at some best practices to ensure the best data analytics results.
40
Introduction to Data Analytics
Let’s look at how you can establish and follow best practices for each facet of
data analytics.
41
Introduction to Data Analytics
Data analytics helps you understand the risks and possible solutions to some of your
business questions. However, you and the team must take the time to understand
the decision, the relevant data, the analysis results, and the criteria for making a
decision. Clearly defining the context of a decision allows for thorough analysis and
understanding of the impact before executing new business initiatives.
42
Introduction to Data Analytics
Before running an analysis, take time to brainstorm with your team and make a list of
relevant KPIs. Your key metrics should be SMART indicators. SMART means that the
metric is specific, measurable, achievable, relevant, and time-specific. You should also
consider creating a balance of leading and lagging indicators. Leading indicators will
help predict future performance and lagging indicators will help you understand past
performance.
Taking relevant KPIs into account and tracking them will help you and your team
effectively make more intelligent decisions with your data.
Ultimately, your KPIs help you make decisions for improved performance. Continually
auditing your metrics is a good practice to ensure you have the best data in your hands.
43
Introduction to Data Analytics
It’s a good idea to involve relevant stakeholders, including lawyers and policy analysts, to
help you create your documents. These policies should define what your company plans
to do with the data, who can review it, and outline protections for whistleblowers if it
occurs. You should also consider enacting a policy to help mitigate bias and
ensure fairness.
Again, you’ll want to be transparent with your users' private data. Plus, this adds a layer
of public accountability for compliance.
Ethical AI Practices
Artificial intelligence is a smart technology that can quickly become complicated and
hard to understand. Not understanding the AI technology you choose for your business
operations is a big no-no. If you cannot clearly explain your AI to another person, it can
seem like you are hiding a major part of your business practices. Hiding your business
practices, whether you intend to or not, is unethical and should be avoided.
44
Introduction to Data Analytics
Have you heard the phrase, “Explain it to me like I’m five?” This just means explaining a
topic on a level a five-year-old can understand. Keep this phrase in mind when choosing
an AI technology for your business. Having the ability to clearly explain the technologies
you use leads to transparency and trust.
45
Introduction to Data Analytics
Regulatory compliance extends to all facets of data analytics. Like ensuring your
data collection methods align with government regulations, you also need to confirm
your privacy measures align with government policies. This includes managing the
data lifecycle from start to finish. Ensure your data deletion policies are founded on
government rules to comply with regulatory standards.
46
Introduction to Data Analytics
A chart or a graph can aid your analysts in communicating the data to others,
too. Instead of talking about what they’ve learned by analyzing data, they can
quickly and effectively highlight insights on a graph or a chart to help provide a
visual aid for their audience. Plus, if they’re using this information to present to
new stakeholders, they can develop a diagram or a graph in your business colors
to impress your audience with your company branding.
If you are using data analytics to help plan or set goals, showing business
projections on a graph can be beneficial. A negative or positive trend is easily
recognizable. Depending on the direction, the positive or negative movement
can help your entire team understand what your goals intend to solve.
47
Introduction to Data Analytics
A graph or chart that is overly complicated is not a good visual at all. A complicated
visual often leads to more questions than answers. An excellent visual seeks to answer
just one specific question and features different colors representing various data
segments. It also keeps the data in context, meaning the graphic does not make
invalid claims.
When creating a visual, also keep your audience in mind. How your audience will view the
data will help you understand the best way to present it.
Effective data visualization relies on choosing the right type of visual to represent your
information. There are all kinds of graphs and charts you can use, including, but not
limited to, a line graph, bar graph, pie chart, histogram, or scatter plot. While each one
essentially performs the same function and is a visual representation, not every graph
or chart is an excellent match for your data. For example, it wouldn’t make sense to
represent revenue growth on a scatter plot. Instead, growth is best represented on a line
graph or bar graph.
Because data analysis helps tell your company’s story, you should choose the best visual
to represent your data accurately.
48
Introduction to Data Analytics
If you’re worried about visualizing data by hand, don’t sweat it. You can use numerous
techniques and tools to help make data visualization easier for everyone.
For starters, you will need some analyzed data before attempting any type of
visualization. Once you have your analyzed data, you can enter it into a spreadsheet,
like Google Sheets or Microsoft Excel, and use the built-in functions to create a graph
or chart. If you use Excel, grab a copy of our free Excel Graph Generator Template
to simplify the process. Some software, like your CRM or analytical software, also have
built-in features to graph the data it analyzes quickly and easily.
But you have options if you want to make your graphs by hand. Programs like Canva
are simple to use. Canva’s graphics also feature various customizable graphs and
charts. Not only can you make the visual fit your data, but you can also brand it to
your company’s colors. If you need a heat map, input your data into Hotjar and let the
program do the rest.
For an in-depth look at data visualization, check out the free ebook, An Introduction to
Data Visualization.
49
Introduction to Data Analytics
Closing
At first glance, analyzing your data can
seem daunting. However, with the right data
management system, a responsive algorithm,
and an appropriate method of analysis, you can
dive into the world of data analytics and uncover
patterns and trends within your collected data.
50