Introduction to Data Science
Chapter One
Introduction
Data science encompasses a set of principles, problem definitions, algorithms, and processes for
extracting non-obvious and useful patterns from large data sets. Many of the elements of data
science have been developed in related fields such as machine learning and data mining. In fact, the
terms data science, machine learning, and data mining are often used interchangeably. The
commonality across these disciplines is a focus on improving decision-making through the analysis
of data. However, although data science borrows from these other fields, it is broader in scope.
Machine learning (ML) focuses on the design and evaluation of algorithms for extracting patterns
from data. Data mining generally deals with the analysis of structured data and often implies an
emphasis on commercial applications. Data science takes all of these considerations into account
but also takes up other challenges, such as the capturing, cleaning, and transforming of unstructured
social media and web data; the use of big data technologies to store and process big, unstructured
data sets; and questions related to data ethics and regulation.
Using data science, we can extract different types of patterns. For example, we might want to
extract patterns that help us to identify groups of customers exhibiting similar behavior and tastes.
Skill Tracks of Data Science
Today’s data scientists have vastly different backgrounds, yet each conceptualizes
the elephant based on his/her professional training and application area. And to make matters worse,
most of us are not even fully aware of our conceptualizations, much less the uniqueness of the
experience from which they are derived. Data science has three main skill tracks: engineering,
analysis and modeling /inference.
Fig 1 Skill tracks of data science
There are some representative skills in each track. Different tracks and combinations of tracks will
define different roles in data science. When people talk about all the machine learning and AI
algorithms, they often overlook the critical data engineering part that makes everything possible.
Data engineering is the unseen iceberg under the water surface. Does your company need a data
scientist?
You are not ready for a data scientist if you don’t have a data engineer yet. You need to have the
ability to get data before making sense of it. If you only deal with small datasets with formatted
data, you may be able to get by with plain text files such as CSV (i.e., comma-separated values) or
even spreadsheet. As the data increasing in volume, variety, and velocity, data engineering
becomes a sophisticated discipline in its own right.
Engineering
Data engineering is the foundation that makes everything else possible (figure 1.2). It mainly
involves in building the data pipeline infrastructure. In the (not that) old days, when data was stored
on local servers, computers, or other devices, building the data infrastructure as a massive IT
project. It involved the software, hardware for servers to store the data and the ETL (i.e., extract,
transform, and load) process. With cloud development, the new norm to store and compute data is
on the cloud. Data engineering today, at its core, is software engineering with data flow as the
focus. The fundamental building block for automation is maintaining the data pipeline through
modular, well-commented code and version control.
Figure 2 Engineering track
1) Data Environment
Designing and setting up the entire environment to support data science workflow is the prerequisite
for data science projects. It may include setting up storage in the cloud, Kafka platform, Hadoop
and Spark clusters, etc. Each company has a unique data condition and need. The data environment
will be different depending on the size of the data, update frequency, the complexity of analytics,
compatibility with the back-end infrastructure, and (of course) budget.
2) Data Management
Automated data collection is a common task that includes parsing the logs (depending on the stage
of the company and the type of industry you are in), web scraping, API queries, and interrogating
data streams. Data management includes constructing data schema to support analytics and
modeling needs, and ensuring data is correct, standardized, and documented.
3) Production
If you want to integrate the model or analysis into the production system, you have to automate all
data handling steps. It involves the whole pipeline from data access, reprocessing, modeling to final
deployment. It is necessary to make the system work smoothly with all existing software stacks. So,
it requires monitoring the system through some robust measures, such as rigorous error handling,
fault tolerance, and graceful degradation to make sure the system is running smoothly and users are
happy.
Analysis
Analysis turns raw information into insights in a fast and often exploratory way. In general, an
analyst needs to have decent domain knowledge, do exploratory analysis efficiently, and present
the results using storytelling.
Figure 3 Analysis track
(1) Domain knowledge
Domain knowledge is the understanding of the organization or industry where you apply data
science. You can’t make sense of data without context. Some questions about the context are:
• What are the critical metrics for this kind of business?
• What are the business questions?
• What type of data do they have, and what does the data represent?
• How to translate a business need to a data problem?
• What has been tried before, and with what results?
• What are the accuracy-cost-time trade-offs?
• How can things fail?
• What are other factors not accounted for?
• What are the reasonable assumptions, and what are faulty?
(2) Exploratory Analysis
This type of analysis is about exploration and discovery. Rigorous conclusions are not the primary
driver, which means the goal is to get insights driven by correlation, not causation. The latter one
requires more advanced statistical skills and hence more time and resource expensive. Instead, this
role will help your team look at as much data as possible so that the decision-makers can get a
sense of what’s worth further pursuing. It often involves different ways to slice and aggregate data.
An important thing to note here is that you should be careful not to get a conclusion beyond the
data. You don’t need to write production-level robust codes to perform well in this role.
(3) Storytelling
Storytelling with data is critical to deliver insights and drive better decision making. It is the art of
telling people what the numbers signify. It usually requires data summarization, aggregation, and
visualization. It is crucial to answering the following questions before you begin down the path of
creating a data story.
• Who is your audience?
• What do you want your audience to know or do?
• How can you use data to help make your point?
A business-friendly report or an interactive dashboard is the typical outcome of the analysis.
Modeling/Inference
Modeling/Inference is a process that dives deeper into the data to discover patterns that are not
easily seen. It may be the most misunderstood track. When the general public thinks about data
science, the first thing that comes to mind might be fancy machine learning models. Despite the
over-representation of machine learning in the public’s mind, the truth is that you don’t have to
use machine learning to be a data scientist. Even data scientists who use machine learning in their
work spend less than 20% of their time working on machine learning. They spend most of their
time communicating with different stakeholders and collecting and cleaning data.
This track mainly focuses on three problems:
1) prediction
2) explanation,
3) causal inference.
Prediction focuses on predicting based on what has happened, and understanding each variable’s
role is not a concern. Many blackbox models, such as ensemble methods and deep learning, are
often used to make a prediction. Examples of problems are image recognition, machine translation,
and recommendation. The next level of the ladder, intervention, requires model interpretability.
Questions on this level involve not just seeing but changing. The question pattern is like, “what
happens if I do …?” For example, product managers often need to prioritize a list of features by
user preference. They need to know what happens if we build feature a instead of b. feature b.
Causal inference is on the third level, which is counterfactual. When an experiment is not possible,
and the cost of a wrong decision is too high, you need to use the existing data to answer a
counterfactual question:
Figure 4 Modeling/Inference
If we look at this track through the lens of the technical methods used, there are three types.
(1) Supervised learning
In supervised learning, each sample corresponds to a response measurement. There are two flavors
of supervised learning: regression and classification. In regression, the response is a real number,
such as the total net sales in 2017 for a company or the yield of wheat next year for a state. The goal
for regression is to approximate the response measurement. In classification, the response is a class
label, such as a dichotomous response of yes/no. The response can also have more than two
categories, such as four segments of customers. A supervised learning model is a function that maps
some input variables (X) with corresponding parameters (beta) to a response (y). The modeling
process is to adjust the value of parameters to make the mapping fit the given response. In other
words, it is to minimize the discrepancy between given responses and the model output. When the
response y is a real value number, it is intuitive to define discrepancy as the squared difference
between model output and the response. When y is categorical, there are other ways to measure the
difference, such as the area under the receiver operating characteristic curve (i.e., AUC) or
information gain.
(2) Unsupervised learning
In unsupervised learning, there is no target variable. For a long time, the machine learning
community overlooked unsupervised learning except clustering. Moreover, many researchers
thought that clustering was the only form of unsupervised learning. One reason is that it is hard to
define the goal of unsupervised learning explicitly. Unsupervised learning can be used to do the
following:
• Identify a good internal representation or pattern of the input that is useful for subsequent
supervised or reinforcement learning, such as finding clusters;
• It is a dimension reduction tool that provides compact, low dimensional representations of the
input, such as factor analysis.
• Provide a reduced number of uncorrelated learned features from original variables, such as
principal component regression.
(3) Customized model development In most cases, after a business problem is fully translated into a
data science problem, a data scientist needs to use out of the box algorithms to solve the problem
with the right data. But in some situations, there isn’t enough data to use any machine learning
model, or the question doesn’t fit neatly in the specifications of existing tools, or the model needs to
incorporate some prior domain knowledge. A data scientist may need to develop new models to
accommodate the subtleties of the problem at hand. For example, people may use Bayesian models
to include domain knowledge as the modeling process’s prior distribution.
Here is a list of questions that can help you decide the type of
technique to use:
• Is your data labeled? It is straightforward since supervised learning needs labeled data.
• Do you want to deploy your model at scale? There is a fundamental difference between building
and deploying models. It is like the difference between making bread and making a bread
machine. One is a baker who will mix and bake ingredients according to recipes to make a variety
of bread. One is a machine builder who builds a machine to automate the process and produce
bread at scale. difference between building and deploying models. It is like the difference between
making bread and making a bread machine. One is a baker who will mix and bake ingredients
according to recipes to make a variety of bread. One is a machine builder who builds a machine to
automate the process and produce bread at scale.
• Is your data easy to collect? One of the major sources of cost in deploying machine learning is
collecting, preparing, and cleaning the data. Because model maintenance includes continuously
collecting data to keep the model updated. If the data collection process requires too much human
labor, the maintenance cost can be too high.
• Does your problem have a unique context? If so, you may not be able to find any off-the-shelf
method that can directly apply to your question and need to customize the model.
Tasks Data Science can solve
Data science is not a panacea, and there are problems data science can’t help. It is best to make a
judgment as early in the analytical cycle as possible. First and foremost, we need to tell customers,
clients, and stakeholders honestly and clearly when we think data analytics can’t answer their
question after careful evaluation of the request, data availability, computation resources, and
modeling details. Often, we can tell them what we can do as an alternative. It is essential to
“negotiate” with others what data science can do specifically; simply answering “we cannot do what
you want” will end the collaboration. Listed below are some of the prerequisites that should be
fulfilled before addressing a task using data science:
1. The question needs to be specific enough
Let us look at the two examples below:
• Question 1: How can I increase product sales?
• Question 2: Is the new promotional tool introduced at the beginning of this year boosting the
annual sales of P1197 in Iowa and Wisconsin? (P1197 is a corn seed product) It is easy to see the
difference between the two questions. Question 1 is grammatically correct, but it is not proper for
data analysis to answer. Why? It is too general. What is the response variable here? Product sales?
Which product? Is it annual sales or monthly sales? What are the candidate predictors? We nearly
can’t get any useful information from the questions.
In contrast, question 2 is much more specific. From the analysis point of view, the response variable
is clearly “annual sales of P1197 in Iowa and Wisconsin”. Even if we don’t know all the
predictors, the variable of interest is “the new promotional tool introduced early this year.” We want
to study the impact of the promotion on sales. We can start there and figure out other variables that
need to be included in the model.
As a data scientist, we may start with general questions from customers, clients, or stakeholders and
eventually get to more specific and data science solvable questions with a series of communication,
evaluation, and negotiation. Effective communication and in-depth knowledge about the business
problems are essential to converting a general business question into a solvable analytical problem.
Domain knowledge helps data scientists communicate using the language other people can
understand and obtain the required context.
2. Accurate and relevant data
One cannot make a silk purse out of a sow’s ear. Data scientists relevant and accurate data. The
supply problem mentioned above is a case in point. There was relevant data, but not sound. All the
later analytics based on that data was a building on sand. Of course, data nearly almost have noise,
but it has to be in a certain range. Generally speaking, the accuracy requirement for the independent
variables of interest and response variable is higher than others. For the above question 2, it is
variables related to the “new promotion” and “sales of P1197”.
The data has to be helpful for the question. If we want to predict which product consumers are most
likely to buy in the next three months, we need to have historical purchasing data: the last buying
time, the amount of invoice, coupons, etc. Information about customers’ credit card numbers, ID
numbers, and email addresses will not help much.
Often, the data quality is more important than the quantity
The following are the tasks that can be carried out in data science:
1. Description
2. Comparison
3. Clustering
4. Classification
5. Regression
1. Description
The primary analytic problem is to summarize and explore a data set with descriptive statistics
(mean, standard deviation, and so forth) and visualization methods. It is the most straightforward
problem and yet the most crucial and common one. We will need to describe and explore the dataset
before moving on to a more complex analysis. For problems such as customer segmentation,
after we cluster the sample, the next step is to figure out each class’s profile by comparing the
descriptive statistics of various variables. Questions of this kind are:
• What is the annual income distribution?
• Are there any outliers?
• What are the mean active days of different accounts?
Data description is often used to check data, find the appropriate data preprocessing method, and
demonstrate the model results.
2. Comparison
The first common modeling problem is to compare different groups. Is A better in some way than
B? Or more comparisons: Is there any difference among A, B, and C in a particular aspect? Here are
some examples:
• Are males more inclined to buy our products than females?
• Are there any differences in customer satisfaction in different business districts?
• Do soybean carrying a particular gene have higher oil content?
For those problems, it usually starts with some summary statistics and visualization by groups.
After a preliminary visualization, you can test the differences between the treatment and control
groups statistically. The commonly used statistical tests are chi-square test, t-test, and ANOVA.
There are also methods using Bayesian methods. In the biology industry, such as new drug
development, crop breeding, fixed/random/mixed effect models are standard techniques.
3. Clustering
Clustering is a widespread problem, the task of clustering involves grouping similar items into
clusters, and it can answer questions like:
• How many reasonable customer segments are there based on historical
purchase patterns?
• How are the customer segments different from each other?
Please note that clustering is unsupervised learning; there are no response variables. The most
common clustering algorithms include K-Means and Hierarchical Clustering.
4. Classification
For classification problems, there are one or more label columns to define the ground truth of
classes. We use other features of the training dataset as explanatory variables for model training.
We can use the trained classifier to predict the labels of a new observation. Here are some example
questions:
• Will this customer likely to buy our product?
• Is the borrower going to pay us back?
• Is it spam email or not?
There are hundreds of different classifiers. In practice, we do not need to try all the models but
several models that perform well generally. For example, the random forest algorithm is usually
used as the baseline model to set model performance expectations.
5. Regression
In general, regression deals with a question like “how much is it?” and return a numerical answer. It
is necessary to coerce the model results to be 0 or round it to the nearest integer in some cases. It
Regression is considered the most common problem in the data science world.
• What will be the temperature tomorrow?
• What is the projected net income for the next season?
• How much inventory should we have?
6. Optimization
Optimization is another common type of problems in data science to find an optimal solution by
tuning a few tune-able variables with other non-controllable environmental variables. It is an
expansion of comparison problem and can solve problems such as:
• What is the best route to deliver the packages?
• What is the optimal advertisement strategy to promote a new product?
Data Science Roles
The following factors are used to separate different roles in data science:
• How much business knowledge is required?
• Does it need to deploy code in the production environment?
• How frequently is data updated?
• How much engineering skill is required?
• How much math/stat knowledge is needed?
• Does the role work with structured or unstructured data?
Data infrastructure engineers work at the beginning of the data pipeline. They are software
engineers who work in the production system and usually handle high-frequency data. They are
responsible for bringing data of different forms and formats and ensuring data comes in smoothly
and correctly. They work directly with other engineers (for example, data engineers and backend
engineers). They typically don’t need to know the data’s business context
or how data scientists will use it. For example, integrate the Data infrastructure engineers work at
the beginning of the data pipeline. They are software engineers who work in the production system
and usually handle high-frequency data. They are responsible for bringing data of different forms
and formats and ensuring data comes in smoothly and correctly. They work directly with other
engineers (for example, data engineers and backend engineers). They typically don’t need to know
the data’s business context or how data scientists will use it. For example, integrate the company’s
services with AWS/GCP/Azure services and set up an Apache Kafka environment to stream the
events. People call a storage repository with vast raw data in its native format (XML, JSON, CSV,
Parquet, etc.) a data lake
As the number of data sources multiplies, having data scattered in various formats prevents the
organization from using the data to help with business decisions or building products. That is when
data engineers come to help. Data engineers transform, clean, and organize the data from the data
lake. They commonly design schemas, store data in queryable forms, and build and maintain data
warehouses. People call this cleaner and better-organized database data mart (figure 1.6) which
contains a subset of data for business needs. They use technologies like Hadoop/Spark and SQL.
Since the database is for non-engineers, data engineers need to know more about the business and
how analytical personnel uses the data. Some may have a basic understanding of machine learning
to deploy models developed by data/research scientists.
Business intelligence (BI) engineers and data analysts are close to the business, so they need to
know the business context well. The critical difference is that BI engineers build automated
dashboards, so they are engineers. They are usually experts in SQL and have the engineering skill to
write production-level code to construct the later steam data pipeline and automate their work.
Data analysts are technical but not engineers. They analyze ad hoc data and deliver the results
through presentations. The data is, most of the time, structured. They need to know coding basics
(SQL or R/Python), but they rarely need to write production-level code. This role was mixed with
“data scientist” by many companies but is now much better refined in mature companies.
Research scientists are experts who have a research background.
They do rigorous analysis and make causal inferences by framing experiments and developing
hypotheses, and proving whether they are true or not. They are researchers that can create new
models and publish peer-reviewed papers. Most of the small/mid companies don’t have this role.
Applied scientist is the role that aims to fill the gap between data/research scientists and data
engineers. They have a decent scientific background but are also experts in applying their
knowledge and implementing solutions at scale. They have a different
focus than research scientists. Instead of scientific discovery, they focus on real-life applications.
They usually need to pass a coding bar.
In the past, some data scientist roles encapsulated statistics, machine learning, and algorithmic
knowledge, including taking models from proof of concept to production. However, more recently,
some of these responsibilities are now more common in another role: machine learning engineer.
Often larger companies may distinguish between data scientists and machine learning engineer
roles.
Machine learning engineer roles will deal more with the algorithmic and machine learning side and
strongly emphasize software engineering.
Pillars of Knowledge for a Data Scientist
It is well known there are three pillars of essential knowledge for a successful data scientist.
(1) Analytics knowledge and toolsets
A successful data scientist needs to have a strong technical background in data mining, statistics,
and machine learning. The in-depth understanding of modeling with insight about data enables a
data scientist to convert a business problem to a data science problem. Many chapters of this book
are focusing on analytics knowledge and toolsets.
(2) Domain knowledge and collaboration
A successful data scientist needs in-depth domain knowledge to understand the business problem
well. For any data science project, the data scientist needs to collaborate with other team members.
Communication and leadership skills are critical for data scientists during the entire project cycle,
especially when there is only one scientist in the project. The scientist needs to decide the timeline
and impact with uncertainty.
(3) (Big) data management and (new) IT skills
The last pillar is about computation environment and model implementation in a big data platform.
It used to be the most difficult one for a data scientist with a statistics background (i.e., lack
computer science knowledge or programming skills). The good news is that with the rise of the big
data platform in the cloud, it is easier for a statistician to overcome this barrier. The “Big Data
Cloud Platform” chapter of this book will describe this pillar in detail.
Data Science Project Cycle
A data science project has various stages. Many textbooks and blogs focus on one or two specific
stages, and it is rare to see an end-to-end life cycle of a data science project. To get a good grasp
of the end-to-end process requires years of real-world experience. Seeing a holistic picture of the
whole cycle helps data scientists to better prepare for real-world applications.
Types of Data Science Projects
People often use data science projects to describe any project that uses data to solve a business
problem, including traditional business analytics, data visualization, or machine learning modeling.
Here we limit our discussion of data science projects that involve data and some statistical or
machine learning models and exclude basic analytics or visualization. The business problem itself
gives us the flavor of the project. We can view data as the raw ingredient to start with, and the
machine learning model makes the dish. The types of data used and the final model development
define the different kinds of data science projects.
1. Offline and Online Data
There are offline and online data. Offline data are historical data stored in databases or data
warehouses. With the development of data storage techniques, the cost to store a large amount of
data is low. Offline data are versatile and rich in general (for example, websites may track and keep
each user’s mouse position, click and typing information while the user is visiting the website). The
data is usually stored in a distributed system, and it can be extracted in batch to create features used
in model training. Online data are real-time information that flows to models to make automatic
actions. Real-time data can frequently change (for example, the keywords a customer is searching
for can change at any given time). Capturing and using real-time online data requires the integration
of a machine learning model to the production infrastructure. It used to be a steep learning curve for
data scientists not familiar with computer engineering, but the cloud infrastructure makes it much
more manageable.
2. Offline Training and Offline Application
This type of data science project is for a specific business problem that needs to be solved once or
multiple times. But the dynamic and disruptive nature of this type of business problem requires
substantial work every time. One example of such a project is “whether a brand-new business
workflow is going to improve efficiency.” In this case, we often use internal/external offline data
and business insight to build models. The final results are delivered as a report to answer the
specific business question. It is similar to the traditional business intelligence project but with more
focus on data and models. Sometimes the data size and model complexity are beyond the capacity
of a single computer. Then we need to use distributed storage and computation. Since the model
uses historical data.
3. Offline Training and Online Application
Another type of data science project uses offline data for training and applies the trained model to
real-time online data in the production environment. For example, we can use historical data to train
a personalized advertisement recommendation model that provides a real-time ad recommendation.
The model training uses historical offline data. The trained model then takes customers’ online real-
time data as input features and run the model in realtime to provide an automatic action. The model
training is very similar to the “offline training, offline application” project. But to put the trained
model into production, there are specific requirements.
For example, as features used in the offline training have to be available online in real-time, the
model’s online run-time has to be short enough without impacting user experience.
4. Online Training and Online Application
For some business problems, it is so dynamic that even yesterday’s data is out of date. In this case,
we can use online data to train the model and apply it in real-time. We call this type of data science
project “online training, online application.” This type of data science project requires high
automation and low latency.
Problem Formulation of Project Planning Stage
A data-driven and fact-based planning stage is essential to ensure a successful data science project.
With the recent big data and data science hype, there is a high demand for data science projects to
create business value across different business sectors. Usually, the leaders of an organization are
those who initiate the data science project proposals. This top-down style of data science projects
typically have high visibility with some human and computation resources pre-allocated. However,
it is crucial to understand the business problem first and align the goal across different teams,
including:
(1) the business team, which may include members from the business operation team, business
analytics, insight, and metrics reporting team;
(2) the technology team, which may include members from the database and data warehouse team,
data engineering team, infrastructure team, core machine learning team, and software development
team;
(3) the project, program, and product management team depending on the scope of the data science
project.
To start the conversation, we can ask everyone in the team the following questions :
• What are the pain points in the current business operation?
• What data are available, and how is the quality and quantity of the data?
• What might be the most significant impacts of a data science project?
• Is there any positive or negative impact on other teams?
• What computation resources are available for model training and model execution?
• Can we define key metrics to compare and quantify business value?
• Are there any data security, privacy, and legal concerns?
• What are the desired milestones, checkpoints, and timeline?
• Is the final application online or offline?
• Are the data sources online or offline?
After the planning stage, we should be able to define a set of key metrics related to the project,
identify some offline and online data sources, request needed computation resources, draft a
tentative timeline and milestones, and form a team of data scientist, data engineer, software
developer, project manager and members from the business operation. Data scientists should play a
significant role in these discussions. If data scientists do not lead the project formulation and
planning, the project may not catch the desired timeline and milestones.
Project Modeling Stage
As a data scientist, communicating any newly encountered difficulties or opportunities during the
modeling stage to the entire team is essential to keep the data science project progress.
Data cleaning, data wrangling, and exploratory data analysis are great starting points toward
modeling with the available data source identified at the planning stage. Meanwhile, abstracting the
business problem to be a set of statistical and machine learning problems is an iterative process.
Business problems can rarely be solved by using just one statistical or machine learning model.
Using a sequence of methods to decompose the business problem is one of the critical
responsibilities for a senior data scientist. The process requires iterative rounds of discussions with
the business and data engineering team based on each iteration’s new learnings.
Each iteration includes both data-related and model-related parts.
Data-Related Parts
Data cleaning, data preprocessing, and feature engineering are related procedures that aim to create
usable variables or features for statistical and machine learning models. A critical aspect of data
related procedures is to make sure the data source we are using is a good representation of the
situation where the final trained model will be applied. The same representation is rarely possible,
and it is ok to start with a reasonable approximation. A data scientist must be clear on the
assumptions and communicate the limitations of biased data with the team and quantify its impact
on the application.
Model-Related Part
There are different types of statistical and machine learning models, such as supervised learning,
unsupervised learning, and causal inference. For each type, there are various algorithms, libraries,
or packages readily available. To solve a business problem, we sometimes need to piece together a
few methods at the model exploring and developing stage. This stage also includes model training,
validation, and testing to ensure the model works well in the production environment (i.e., it can be
generalized well and not causing overfitting). The model selection follows Occam’s razor, choosing
the simplest among a set of compatible models. Before we try complicated models, it is good to get
some benchmarks by additional business rules, common-sense decisions, or standard models (such
as random forest for classification and regression problems).
Model Implementation and Post Production Stage
For offline application data science projects, the end product is often a detailed report with model
results and output. However, for online application projects, a trained model is just halfway from
the finish line. The offline data is stored and processed in a different environment from the online
production environment. Building the online data pipeline and implementing machine learning
models in a production environment requires lots of additional work. Even though recent advance in
cloud infrastructure lowers the barrier dramatically, it still takes effort to implement an offline
model in the online production system. Before we promote the model to production, there are two
more steps to go:
1. Shadow mode
2. A/B testing
A shadow mode is like an observation period when the data pipeline and machine learning models
run as fully functional, but we only record the model output without any actions. Some people call
it proof of concept (POC). During the shadow mode, people frequently check the data pipeline and
model and detect bugs such as a timeout, missing features, and version conflict (for example,
Python 2 vs. Python 3), data type mismatch, etc.
During A/B testing, all the incoming observations are randomly separated into two groups: control
and treatment. The control group will skip the machine learning model, while the treatment group is
going through the machine learning model. After that, people monitor a list of pre-defined key
metrics during a specific period to compare the control and treatment groups. The differences in
these metrics determine whether the machine learning model provides business value or not. Real
applications can be complicated. For example, there can be multiple treatment groups, or hundreds,
even thousands of A/B testing running by different teams at any given time in the same production
environment. Once the A/B testing shows that the model provides significant
business value, we can put it into full production.
Project Summary
Data science end-to-end project cycle is a complicated process that requires close collaboration
among many teams. The data scientist, maybe the only scientist in the team, has to lead the planning
discussion and model development based on data available and communicate key assumptions and
uncertainties. A data science project may fail at any stage, and a clear end-to-end cycle view of the
project helps avoid some mistakes.
Common Mistakes in Data Science
Data science projects can go wrong at different stages in many ways. Most textbooks and online
blogs focus on technical mistakes about machine learning models, algorithms, or theories, such as
detecting outliers and overfitting. It is important to avoid these technical mistakes. However, there
are common systematic mistakes across data science projects that are rarely discussed in textbooks.
Listed below are the common mistakes that can occur at different stages of a data science project
1) Project Formulation Stage
The most challenging part of a data science project is problem formulation. Data science project
stems from pain points of the business. The draft version of the project’s goal is relatively vague
without much quantification or is the gut feeling of the leadership team. Often there are multiple
teams involved in the initial project formulation stage, and they have different views. It is easy to
have misalignment across teams, such as resource allocation, milestone deliverable, and timeline.
Data science team members with technical backgrounds sometimes are not even invited to the
initial discussion at the problem formulation stage. It sounds ridiculous, but sadly true that a lot of
resources are spent on solving the wrong problem, the number one systematic common mistake in
data science. Formulating a business problem into the right data science project requires an in-depth
understanding of the business context, data availability and quality, computation infrastructure, and
methodology to leverage the data to quantify business value.
We see people over-promise about business value all the time, another common mistake that will
fail the project at the beginning. With big data and machine learning hype, leaders across many
industries often have unrealistic high expectations on data science. It is especially true during
enterprise transformation when there is a strong push to adopt new technology to get value out of
the data. The unrealistic expectations are based on assumptions that are way off the chart without
checking the data availability, data quality, computation resource, and current best practices in the
field. Even when there is some exploratory analysis by the data science team at the problem
formulation stage, project leaders sometimes ignore their data-driven voice. These two systematic
mistakes undermine the organization’s data science strategy. The higher the expectation, the bigger
the disappointment when the project cannot deliver business value. Data and business context are
essential to formulate the business problem and set reachable business value.
2) Project Planning Stage
Now suppose the data science project is formulated correctly with a reasonable expectation on the
business value. The next step is to plan the project by allocating resources, setting up milestones
and timelines, and defining deliverables. In most cases, project managers coordinate different teams
involved in the project and use agile project management tools similar to those in software
development. Unfortunately, the project management team may not have experience with data
science projects and hence fail to account for the uncertainties at the planning stage. The
fundamental difference between data science projects and other projects leads to another common
mistake: too optimistic about the timeline. For example, data exploratory and data preparation
may take 60% to 80% of the total time for a given data science project, but people often don’t
realize that.
When there are a lot of data already collected across the organization, people assume we have
enough data for everything. It leads to the mistake: too optimistic about data availability
and quality. We need not “big data,” but data that can help us solve the problem. The data available
may be of low quality, and we need to put substantial effort into cleaning the data before we
can use it. There are “unexpected” efforts to bring the right and relevant data for a specific data
science project. To ensure smooth delivery of data science projects, we need to account for the
“unexpected” work at the planning stage.
3) Project Modeling Stage
Finally, we start to look at the data and fit some models. One common mistake at this stage is
unrepresentative data. The model trained using historical data may not generalize to the future.
There is always a problem with biased or unrepresentative data. As a data scientist, we need to use
data that are closer to the situation where the model will apply and quantify the impact of model
output in production. Another mistake at this stage is overfitting and obsession with complicated
models. Now, we can easily get hundreds or even thousands of features, and the machine learning
models are getting more complicated. People can use open source libraries to try all kinds of models
and are sometimes obsessed with complicated models instead of using the simplest among a set of
compatible models with similar results. The data used to build the models is always somewhat
biased or unrepresentative. Simpler models are better to generalize. It has a higher chance of
providing consistent business value once the model passes the test and is finally implemented in the
production environment. The existing data and methods at hand may be sufficient to solve the
business problem. In that case, we can try to collect more data, do feature engineering, or develop
new models. However, if there is a fundamental gap between data and the business problem, the
data scientist must make the tough decision to unplug the project. On the other hand, data science
projects usually have high visibility and may be initiated by senior leadership. Even after the
data science team provided enough evidence that they can’t deliver the expected business value,
people may not want to stop the project, which leads to another common mistake at the modeling
stage: take too long to fail. The earlier we can prevent a failing project, the better because we can
put valuable resources into other promising projects.
Model Implementation and Post production Stage
Now suppose we have found a model that works great for the training and testing data. If it is an
online application, we are halfway. The next is to implement the model, which sounds like
alien work for a data scientist without software engineering experience in the production system.
The data engineering team can help with model production. However, as a data scientist, we need
to know the potential mistakes at this stage. One big mistake is missing shadow mode and A/B
testing and assuming that the model performance at model training/testing stays the same in
the production environment. Unfortunately, the model trained and evaluated using historical data
nearly never performs the same in the production environment. The data used in the offline training
may be significantly different from online data, and the business context may have changed. If
possible, machine learning models in production should always go through shadow mode and A/B
testing to evaluate performance. In the model training stage, people usually focus on model
performance, such as accuracy, without paying too much attention to the model execution time.
When a model runs online in real time, each instance’s total run time (i.e., model latency) should
not impact the customer’s user experience. Nobody wants to wait for even one second to see the
results after click the “search” button. In the production stage, feature availability is crucial to run
a real-time model. Engineering resources are essential for model production. However, in traditional
companies, it is common that a data science project fails to scale in real-time applications
due to lack of computation capacity, engineering resources, or nontech culture and environment.
As the business problem evolves rapidly, the data and model in the production environment need to
change accordingly, or the model’s performance deteriorates over time. The online production
environment is more complicated than model training and testing. For example, when we pull
online features from different resources, some may be missing at a specific time; the model may run
into a time-out zone, and various software can cause the version problem. We need regular
checkups during the entire life of the model cycle from implementation to retirement.
Unfortunately, people often don’t set the monitoring system for data science projects, and it is
another common mistake: missing necessary online checkup
A data science project may fail in different ways. However, the data science project can provide
significant business value if we put data and business context at the center of the project, get
familiar with the data science project cycle and proactively identify and avoid these potential
mistakes.
Here is the summary of the mistakes:
• Solving the wrong problem
• Overpromise on business value
• Too optimistic about the timeline
• Too optimistic about data availability and quality
• Unrepresentative data
Data Science process
Following a structured approach to data science helps you to maximize your chances of success in a
data science project at the lowest cost. It also makes it possible to take up a project as a team, with
each team member focusing on what they do best. Take care, however: this approach may not be
suitable for every type of project or be the only way to do good data science.
1. Setting the research goal
A project starts by understanding the what, the why, and the how of your project. What does the
company expect you to do? And why does management place such a value on your research? Is it
part of a bigger strategic picture or a “lone wolf” project originating from an opportunity someone
detected? Answering these three questions (what, why, how) is the goal of the first phase, so that
everybody knows what to do and can agree on the best course of action.
The outcome should be a clear research goal, a good understanding of the context, well-defined
deliverables, and a plan of action with a timetable. This information is then best placed in a project
charter. The length and formality can, of course, differ between projects and companies. In this
early phase of the project, people skills and business acumen are more important than great
technical prowess, which is why this part will often be guided by more senior personnel.
Spend time understanding the goals and context of your research
An essential outcome is the research goal that states the purpose of your assignment in a clear and
focused manner. Understanding the business goals and context is critical for project success.
Continue asking questions and devising examples until you grasp the exact business expectations,
identify how your project fits in the bigger picture, appreciate how your research is going to change
the business, and understand how they’ll use your results. Nothing is more frustrating than spending
months researching something until you have that one moment of brilliance and solve the
problem, but when you report your findings back to the organization, everyone immediately
realizes that you misunderstood their question. Don’t skim over this phase lightly. Many data
scientists fail here: despite their mathematical wit and scientific brilliance, they never seem to grasp
the business goals and context.
Create a project charter
Clients like to know upfront what they’re paying for, so after you have a good understanding
of the business problem, try to get a formal agreement on the deliverables.
All this information is best collected in a project charter. For any significant project this would be
mandatory.
A project charter requires teamwork, and your input covers at least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
■ A timeline
Your client can use this information to make an estimation of the project costs and the data and
people required for your project to become a success.
2. Retrieving data
The next step in data science is to retrieve the required data. Sometimes you need to go into the
field and design a data collection process yourself, but most of the time you won’t be involved in
this step. Many companies will have already collected and stored the data for you, and what they
don’t have can often be bought from third parties. Don’t be afraid to look outside your organization
for data, because more and more organizations are making even high-quality data freely available
for public and commercial use.
Start with data stored within the company
Your first act should be to assess the relevance and quality of the data that’s readily available within
your company. Most companies have a program for maintaining key data, so much of the cleaning
work may already be done. This data can be stored in official data repositories such as databases,
data marts, data warehouses, and data lakes maintained by a team of IT professionals. The primary
goal of a database is data storage, while a data warehouse is designed for reading and analyzing that
data. A data mart is a subset of the data warehouse and geared toward serving a specific business
unit. While data warehouses and data marts are home to preprocessed data, data lakes contains data
in its natural or raw format. But the possibility exists that your data still
resides in Excel files on the desktop of a domain expert.
Finding data even within your own company can sometimes be a challenge. As companies grow,
their data becomes scattered around many places. Knowledge of the data may be dispersed as
people change positions and leave the company. Documentation and metadata aren’t always the top
priority of a delivery manager, so it’s possible you’ll need to develop some Sherlock Holmes–like
skills to find all the lost bits.
Getting access to data is another difficult task. Organizations understand the value and sensitivity of
data and often have policies in place so everyone has access to what they need and nothing more.
These policies translate into physical and digital barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer data in most countries. This is for good reasons, too;
imagine everybody in a credit card company having access to your spending habits. Getting access
to the data may take time and involve company politics.
Don’t be afraid to shop around
If data isn’t available inside your organization, look outside your organization’s walls. Many
companies specialize in collecting valuable information. For instance, Nielsen and GFK are well
known for this in the retail industry. Other companies provide data so that you, in turn, can enrich
their services and ecosystem. Such is the case with Twitter, LinkedIn, and Facebook.
Although data is considered an asset more valuable than oil by certain companies, more and more
governments and organizations share their data for free with the world. This data can be of excellent
quality; it depends on the institution that creates and manages it.
Do data quality checks now to prevent problems later
Expect to spend a good portion of your project time doing data correction and cleansing, sometimes
up to 80%. The retrieval of data is the first time you’ll inspect the data in the data science process.
Most of the errors you’ll encounter during the datagathering phase are easy to spot, but being too
careless will make you spend many hours solving data issues that could have been prevented during
data import.
You’ll investigate the data during the import, data preparation, and exploratory phases. The
difference is in the goal and the depth of the investigation. During data retrieval, you check to see if
the data is equal to the data in the source document and look to see if you have the right data types.
This shouldn’t take too long; when you have enough evidence that the data is similar to the data you
find in the source document, you stop. With data preparation, you do a more elaborate check. If you
did a good job during the previous phase, the errors you find now are also present in the source
document. The focus is on the content of the variables: you want to get rid of typos and other data
entry errors and bring the data to a common standard among the data sets. For example, you might
correct USQ to USA and United Kingdom to UK. During the exploratory phase your focus shifts to
what you can learn from the data.
3. Data Preparation
The data received from the data retrieval phase is likely to be “a diamond in the rough.” Your task
now is to sanitize and prepare it for use in the modeling and reporting phase. Doing so is
tremendously important because your models will perform better and you’ll lose less time trying to
fix strange output. It can’t be mentioned nearly enough times: garbage in equals garbage out. Your
model needs the data in a specific form.
Cleansing data
Data cleansing is a subprocess of the data science process that focuses on removing errors in your
data so your data becomes a true and consistent representation of the processes it originates from.
By “true and consistent representation” we imply that at least two types of errors exist. The first
type is the interpretation error, such as when you take the value in your data for granted, like saying
that a person’s age is greater than 300 years. The second type of error points to inconsistencies
between data sources or against your company’s standardized values. An example of this class of
errors is putting “Female” in one table and “F” in another when they represent the same thing: that
the person is female. Another example is that you use Pounds in one table and Dollars in another.
Data Entry Errors
Data collection and data entry are error-prone processes. They often require human intervention,
and because humans are only human, they make typos or lose their concentration for a second and
introduce an error into the chain. But data collected by machines or computers isn’t free from errors
either. Errors can arise from human sloppiness, whereas others are due to machine or hardware
failure. Examples of errors originating from machines are transmission errors or bugs in the extract,
transform, and load phase (ETL). For small data sets you can check every value by hand. Detecting
data errors when the variables you study don’t have many classes can be done by tabulating the data
with counts.
Redundant Whitespace
Whitespaces tend to be hard to detect but cause errors like other redundant characters would. Who
hasn’t lost a few days in a project because of a bug that was caused by whitespaces at the end of a
string? You ask the program to join two keys and notice that observations are missing from the
output file. After looking for days through the code, you finally find the bug. Then comes the
hardest part: explaining the delay to the project stakeholders. The cleaning during the ETL phase
wasn’t well executed, and keys in one table contained a whitespace at the end of a string. This
caused a mismatch of keys such as “FR ” – “FR”, dropping the observations that couldn’t be
matched.
Fixing Capital Letter Mismatches
Capital letter mismatches are common. Most programming languages make a distinction between
“Brazil” and “brazil”. In this case you can solve the problem by applying a function that returns
both strings in lowercase, such as .lower() in Python. “Brazil”.lower() == “brazil”.lower() should
result in true.
Outliers, impossible values and sanity checks
An outlier is an observation that seems to be distant from other observations or, more specifically,
one observation that follows a different logic or generative process than the other observations. The
easiest way to find outliers is to use a plot or a table with the minimum and maximum values.
Sanity checks are another valuable type of data check. Here you check the value against physically
or theoretically impossible values such as people taller than 3 meters or someone with an age of 299
years. Sanity checks can be directly expressed with rules:
check = 0 <= age <= 12
Dealing with Missing Values
Missing values aren’t necessarily wrong, but you still need to handle them separately; certain
modeling techniques can’t handle missing values. They might be an indicator that something went
wrong in your data collection or that an error happened in the ETL process.
Deviations from a Codebook
Detecting errors in larger data sets against a code book or against standardized values can be done
with the help of set operations. A code book is a description of your data, a form of metadata. It
contains things such as the number of variables per observation, the number of observations, and
what each encoding within a variable means. (For instance “0” equals “negative”, “5” stands for
“very positive”.
Combining data from different data sources
Your data comes from several different places, and in this substep we focus on integrating these
different sources. Data varies in size, type, and structure, ranging from databases and Excel files to
text documents.
We focus on data in table structures in this chapter for the sake of brevity. It’s easy to fill entire
books on this topic alone, and we choose to focus on the data science process instead of presenting
scenarios for every type of data. But keep in mind that other types of data sources exist, such as
key-value stores, document stores, and so on, which we’ll handle in more appropriate places in the
book.
The Different Ways of Combining Data
You can perform two operations to combine information from different data sets. The first operation
is joining: enriching an observation from one table with information from another table. The second
operation is appending or stacking: adding the observations of one table to those of another table.
When you combine data, you have the option to create a new physical table or a virtual table by
creating a view. The advantage of a view is that it doesn’t consume more disk space. Let’s elaborate
a bit on these methods.
Joining Tables
Joining tables allows you to combine the information of one observation found in one table with the
information that you find in another table. The focus is on enriching a single observation. Let’s say
that the first table contains information about the purchases of a customer and the other table
contains information about the region where your customer lives.
To join tables, you use variables that represent the same object in both tables, such as a date, a
country name, or a Social Security number. These common fields are known as keys. When these
keys also uniquely define the records in the table, they are called primary keys.
Appending Tables
Appending or stacking tables is effectively adding observations from one table to another table.
Figure below shows an example of appending tables. One table contains the observations from the
month January and the second table contains observations from the month February. The result of
appending these tables is a larger one with the observations from January as well as February. The
equivalent operation in set theory would be the union, and this is also the command in SQL, the
common language of relational databases.
Using Views to Simulate Data Joins and Appends
To avoid duplication of data, you virtually combine data with views. In the previous example we
took the monthly data and combined it in a new physical table. The problem is that we duplicated
the data and therefore needed more storage space. In the example we’re working with, that may not
cause problems, but imagine that every table consists of terabytes of data; then it becomes
problematic to duplicate the data.
For this reason, the concept of a view was invented. A view behaves as if you’re working on a table,
but this table is nothing but a virtual layer that combines the tables for you.
Enriching Aggregated Measures
Data enrichment can also be done by adding calculated information to the table, such as the total
number of sales or what percentage of total stock has been sold in a certain region.
4. Transforming Data
Certain models require their data to be in a certain shape. Now that you’ve cleansed and integrated
the data, this is the next task you’ll perform: transforming your data so it takes a suitable form for
data modeling.
Reducing the number of variables
Sometimes you have too many variables and need to reduce the number because they don’t add new
information to the model. Having too many variables in your model makes the model difficult to
handle, and certain techniques don’t perform well when you overload them with too many input
variables.
Turning Variables into Dummies
Variables can be turned into dummy variables. Dummy variables can only take two values: true(1)
or false(0). They’re used to indicate the absence of a categorical effect that may explain the
observation. In this case you’ll make separate columns for the classes stored in one variable and
indicate it with 1 if the class is present and 0 otherwise. An example is turning one column named
Weekdays into the columns Monday through Sunday. You use an indicator to show if the
observation was on a Monday; you put 1 on Monday and 0 elsewhere. Turning variables into
dummies is a technique that’s used in modeling and is popular with, but not exclusive to,
economists.
5. Data exploration
During exploratory data analysis you take a deep dive into the data. Information becomes much
easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain an
understanding of your data and the interactions between variables. This phase is about exploring
data, so keeping your mind open and your eyes peeled is essential during the exploratory data
analysis phase. The goal isn’t to cleanse the data, but it’s common that you’ll still discover
anomalies you missed before, forcing you to take a step back and fix them.
From top to bottom, bar chart, a line plot and a distribution are graphs used in exploratory
analysis
These plots can be combined to provide even more insight.
Another exploratory data analysis technique is brushing and linking. With brushing and linking
you combine and link different graphs and tables (or views) so changes in one graph are
automatically transferred to the other graphs.
6. Data Modeling
With clean data in place and a good understanding of the content, you’re ready to build models with
the goal of making better predictions, classifying objects, or gaining an understanding of the system
that you’re modeling. This phase is much more focused than the exploratory analysis step, because
you know what you’re looking for and what you want the outcome to be. The techniques you’ll use
now are borrowed from the field of machine learning, data mining, and/or statistics. Building a
model is an iterative process. The way you build your model depends on whether you go with
classic statistics or the somewhat more recent machine learning school, and the type of technique
you want to use. Either way, most models consist of the following main steps:
1 Selection of a modeling technique and variables to enter in the model
2 Execution of the model
3 Diagnosis and model comparison
Model and Variable Selection
You’ll need to select the variables you want to include in your model and a modeling technique.
Your findings from the exploratory analysis should already give a fair idea of what variables will
help you construct a good model. Many modeling techniques are available, and choosing the right
model for a problem requires judgment on your part. You’ll need to consider model performance
and whether your project meets all the requirements to use your model, as well as other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left untouched?
■ Does the model need to be easy to explain?
Model Execution
Luckily, most programming languages, such as Python, already have libraries such as StatsModels
or Scikit-learn. These packages use several of the most popular techniques. Coding a model is a
nontrivial task in most cases, so having these libraries available can speed up the process.
Model diagnostics and model comparison
You’ll be building multiple models from which you then choose the best one based on multiple
criteria. Working with a holdout sample helps you pick the best-performing model. A holdout
sample is a part of the data you leave out of the model building so it can be used to evaluate the
model afterward. The principle here is simple: the model should work on unseen data. You use only
a fraction of your data to estimate the model and the other part, the holdout sample, is kept out of
the equation. The model is then unleashed on the unseen data and error measures are calculated to
evaluate it. Multiple error measures are available such as MSE and classification accuracy.
7. Presentation and automation
After you’ve successfully analyzed the data and built a well-performing model, you’re ready to
present your findings to the world. This is an exciting part; all your hours of hard work have paid
off and you can explain what you found to the stakeholders.
Sometimes people get so excited about your work that you’ll need to repeat it over and over again
because they value the predictions of your models or the insights that you produced. For this reason,
you need to automate your models. This doesn’t always mean that you have to redo all of your
analysis all the time. Sometimes it’s sufficient that you implement only the model scoring; other
times you might build an application that automatically updates reports, Excel spreadsheets, or
PowerPoint presentations. The last stage of the data science process is where your soft skills will be
most useful, and yes, they’re extremely important.
Chapter Two
Data
As its name suggests, data science is fundamentally dependent on data. In its most basic form, a
datum or a piece of information is an abstraction of a real-world entity (person, object, or event).
The terms variable, feature, and attribute are often used interchangeably to denote an individual
abstraction. Each entity is typically described by a number of attributes. For example, a book might
have the following attributes: author, title, topic, genre, publisher, price, date published, word count,
number of chapters, number of pages, edition, ISBN, and so on. A data set consists of the data
relating to a collection of entities, with each entity described in terms of a set of attributes. In its
most basic form,1 a data set is organized in an n * m data matrix called the analytics record, where
n is the number of entities (rows) and m is the number of attributes (columns). In data science, the
terms data set and analytics record are often used interchangeably, with the analytics record being a
particular representation of a data set. Table 1 illustrates an analytics record for a data set of classic
books. Each row in the table describes one book. The terms instance, example, entity, object, case,
individual, and record are used in data science literature to refer to a row. So a data set contains a
set of instances, and each instance is described by a set of attributes.
The construction of the analytics record is a prerequisite of doing data science. In fact, the majority
of the time and effort in data science projects is spent on creating, cleaning, and updating the
analytics record. The analytics record is often constructed by merging information from many
different sources: data may have to be extracted from multiple databases, data warehouses, or
computer files in different formats (e.g., spreadsheets or csv files) or scraped from the web or social
media streams. There are many different types of attributes, and for each attribute type different
sorts of analysis are appropriate.
So understanding and recognizing different attribute types is a fundamental skill for a data scientist.
The standard types are numeric, nominal, and ordinal. Numeric attributes describe measurable
quantities that are represented using integer or real values. Numeric attributes can be measured on
either an interval scale or a ratio scale. Interval attributes are measured on a scale with a fixed but
arbitrary interval and arbitrary origin—for example, date and time measurements. It is appropriate
to apply ordering and subtraction operations to interval attributes, but other arithmetic operations
(such as multiplication and division) are not appropriate.
Ratio scales are similar to interval scales, but the scale of measurement possesses a true-zero origin.
A value of zero indicates that none of the quantity is being measured. A consequence of a ratio scale
having a true-zero origin is that we can describe a value on a ratio scale as being a multiple (or
ratio) of another value. Temperature is a useful example for distinguishing between interval and
ratio scales. A temperature measurement on the Celsius or Fahrenheit scale is an interval
measurement because a 0 value on either of these scales does not indicate zero heat.
Nominal (also known as categorical) attributes take values from a finite set. These values are names
(hence “nominal”) for categories, classes, or states of things. Examples of nominal attributes
include marital status (single, married, divorced) and beer type (ale, pale ale, pils, porter, stout,
etc.).
A binary attribute is a special case of a nominal attribute where the set of possible values is
restricted to just two values.
Ordinal attributes are similar to nominal attributes, with the difference that it is possible to apply a
rank order over the categories of ordinal attributes. For example, an attribute describing the
response to a survey question might take values from the domain “strongly dislike, dislike, neutral,
like, and strongly like.” There is a natural ordering over these values from “strongly dislike” to
“strongly like” (or vice versa depending on the convention being used).
Structured and Unstructured data
Structured data are data that can be stored in a table, and every instance in the table has the same
structure (i.e., set of attributes). As an example, consider the demographic data for a population,
where each row in the table describes one person and consists of the same set of demographic
attributes (name, age, date of birth, address, gender, education level, job status, etc.). Structured data
can be easily stored, organized, searched,reordered, and merged with other structured data. It is
relatively easy to apply data science to structured data because, by definition, it is already in a
format that is suitable for integration into an analytics record.
Unstructured data are data where each instance in the data set may have its own internal structure,
and this structure is not necessarily the same in every instance. Unstructured data are much more
common than structured data. For example, collections of human text (emails, tweets, text
messages, posts, novels, etc.) can be considered unstructured data, as can collections of sound,
image, music, video, and multimedia files. The variation in the structure between the different
elements means that it is difficult to analyze unstructured data in its raw form.
Cross Industry Standard Process for Data Mining (CRISP-DM)
Cross Industry Standard Process for Data Mining (CRISPDM). In fact, the CRISP-DM has
regularly been in the number-one spot in various industry surveys for a number of years. The
primary advantage of CRISP-DM, the main reason why it is so widely used, is that it is designed to
be independent of any software, vendor, or data-analysis technique. CRISP-DM was originally
developed by a consortium of organizations consisting of leading data science vendors, end users,
consultancy companies, and researchers. The CRISP-DM life cycle consists of six stages: business
understanding, data understanding, data preparation, modeling, evaluation, and deployment, as
shown in the figure below. Data is at the center of all data science activities, and that is why the
CRISP-DM diagram has data at its center. The arrows between the stages indicate the typical
direction of the process. The process is semistructured, which means that a data scientist doesn’t
always move through these six stages in a linear fashion. Depending on the outcome of a particular
stage, a data scientist may go back to one of the stages.
In the first two stages, business understanding and data understanding, the data scientist is trying to
define the goals of the project by understanding the business needs and the data that the business
has available to it. In the early stages of a project, a data scientist will often iterate between focusing
on the business and exploring what data are available. This iteration typically involves identifying a
business problem and then exploring if the appropriate data are available to develop a data-driven
solution to the problem. If the data are available, the project can proceed; if not, the data scientist
will have to identify an alternative problem to tackle. During this stage of a project, a data scientist
will spend a great deal of time in meetings with colleagues in the business-focused departments
(e.g., sales, marketing, operations) to understand their problems and with the database
administrators to get an understanding of what data are available.
The focus of the data-preparation stage is the creation of a data set that can be used for the data
analysis. In general, creating this data set involves integrating data sources from a number of
databases. When an organization has a data warehouse, this integration can be relatively
straightforward. Once a dataset has been created, the quality of the data needs to be checked and
fixed. Typical data-quality problems include outliers and missing values. Checking the quality of
the data is very important because errors in the data can have a serious effect on the performance of
the data-analysis algorithms. The next stage of CRISP-DM is the modeling stage. This is the stage
where automatic algorithms are used to extract useful patterns from the data and to create models
that encode these patterns. A model is trained on a data set by running an ML algorithm on the data
set so as to identify useful patterns in the data and to return a model that encodes these patterns. In
some cases an ML algorithm works by fitting a template model structure to a data set by setting the
parameters of the template to good values for that data set (e.g., fitting a linear regression or neural
network model to a data set).
In other cases an ML algorithm builds a model in a piecewise fashion (e.g. growing a decision tree
one node at a time beginning at the root node of the tree). In most data science projects it is a model
generated by an ML algorithm that is ultimately the software that is deployed by an organization to
help it solve the problem the data science project is addressing. Each model is trained by a different
type of ML algorithm, and each algorithm looks for different types of patterns in the data. The last
two stages of the CRISP-DM process, evaluation and deployment, are focused on how the models
fit the business and its processes. The tests run during the modeling stage are focused purely on the
accuracy of the models for the data set. The evaluation phase involves assessing the models in the
broader context defined by the business needs. Does a model meet the business objectives of the
process? Is there any business reason why a model is inadequate? At this point in the process, it is
also useful for the data scientist to do a general quality assurance review on the project activities:
Was anything missed? Could anything have been done better? Based on the general assessment of
the models, the main decision made during the evaluation phase is whether any of the models
should be deployed in the business or another iteration of the CRISP-DM process is required to
create adequate models. Assuming the evaluation process approves a model or models, the project
moves into the final stage of the process: deployment.
The deployment phase involves examining how to deploy the selected models into the business
environment. This involves planning how to integrate the models into the organization’s technical
infrastructure and business processes. The best models are the ones that fit smoothly into current
practices The iterative nature of data science projects is perhaps the aspect of these projects that is
most often overlooked in discussions of data science. After a project has developed and deployed a
model, the model should be regularly reviewed to check that it still fits the business’s needs and that
it hasn’t become obsolete.
There are many reasons why a data-driven model can become obsolete: the business’s needs might
have changed; the process the model emulates and provides insight into might have changed (for
example, customer behavior changes, spam email changes, etc.); or the data streams the model uses
might have changed (for example, a sensor that feeds information into a model may have been
updated, and the new version of the sensor provides slightly different readings, causing the model to
be less accurate). The frequency of this review is dependent on how quickly the business ecosystem
and the data that the model uses evolve. Constant monitoring is needed to determine the best time to
go through the process again. For a data science project to succeed, a data scientist needs to have a
clear understanding of the business need that the project is trying to solve. The business
understanding stage of the process is really important. With regard to getting the right data for a
project, a survey of data scientists in 2016 found that 79 percent of their time is spent on data
preparation.
Figure CRISP DM Stages and processes
Chapter Three
Big Data
Big data refers to data that cannot be processed by traditional/conventional methods. The set of
technologies used to do data science varies across organizations. The larger the organization or the
greater the amount of data being processed or both, the greater the complexity of the technology
ecosystem supporting the data science activities. In most cases, this ecosystem contains tools and
components from a number of different software suppliers, processing data in many different
formats. There is a spectrum of approaches from which an organization can select when developing
its own data science ecosystem. At one end of the spectrum, the organization may decide to invest
in a commercial integrated tool set. At the other end, it might build up a bespoke ecosystem by
integrating a set of open-source tools and languages.
In this diagram, the three main areas consist of data sources, where all the data in an organization
are generated; data storage, where the data are stored and processed; and applications, where the
data are shared with consumers of these data.
All organizations have applications that generate and capture data about customers, transactions,
and operational data on everything to do with how the organization operates. Such data sources and
applications include customer management, orders, manufacturing, delivery, invoicing, banking,
finance, customer-relationship management (CRM), call center, enterprise resource planning (ERP)
applications, and so on. These types of applications are commonly referred to as online transaction
processing (OLTP) systems. For many data science projects, the data from these applications will be
used to form the initial input data set for the ML algorithms. Over time, the volume of data captured
by the various applications in the organization grows ever larger and the organization will start to
branch out to capture data that was ignored, wasn’t captured previously.
Figure Big Data Architecture
These newer data are commonly referred to as “big-data sources” because the volume of data that is
captured is significantly higher than the organization’s main operational applications. Some of the
common big-data sources include network traffic, logging data from various applications, sensor
data, weblog data, social media data, website data, and so on. In traditional data sources, the data
are typically stored in a database. However, because the applications associated with many of the
newer big-data sources are not primarily designed to store data long term—for example, with
streaming data—the storage formats and structures for this type of data vary from application to
application.
As the number of data sources increases, so does the challenge of being able to use these data for
analytics and for sharing them across the wider organization. The data-storage layer, shown in the
figure above, is typically used to address the data sharing and data analytics across an organization.
This layer is divided into two parts. The first part covers the typical data-sharing software used by
most organizations. The most popular form of traditional data integration and storage software is a
relational database management system (RDBMS). These traditional systems are often the
backbone of the business intelligence (BI) solutions within an organization. A BI solution is a user-
friendly decision-support system that provides data. aggregating, integration, and reporting as well
as analysis functionality. Depending on the maturity level of a BI architecture, it can consist of
anything from a basic copy of an operational application to an operational data store (ODS) to
massively parallel processing (MPP) BI database solutions and data warehouses.
Data warehousing is best understood as a process of data aggregation and analysis with the goal of
supporting decision making. However, the focus of this process is the creation of a well-designed
and centralized data repository, and the term data warehouse is sometimes used to denote this type
of data repository. In this sense, a data warehouse is a powerful resource for data science. From a
data science perspective, one of the major advantages of having a data warehouse in place is a much
shorter project time.
The key ingredient in any data science process is data, so it is not surprising that in many data
science projects the majority of time and effort goes into finding, aggregating, and cleaning the data
prior to their analysis. If a data warehouse is available in a company, then the effort and time that go
into data preparation on individual data science projects is often significantly reduced.
However, it is possible to do data science without a centralized data repository. Constructing a
centralized repository of data involves more than simply dumping the data from multiple
operational databases into a single database.
Merging data from multiple databases often requires much complex manual work to resolve
inconsistencies between the source databases.
Extraction, transformation, and load (ETL) is the term used to describe the typical processes and
tools used to support the mapping, merging, and movement of data between databases. The typical
operations carried out in a data warehouse are different from the simple operations normally applied
to a standard relational data model database. The term online analytical processing (OLAP) is used
to describe these operations. OLAP operations are generally focused on generating summaries of
historic data and involve aggregating data from multiple sources. The second part of the data-
storage layer deals with managing the data produced by an organization’s big-data sources. In this
architecture, the Hadoop platform is used for the storage and analytics of these big data. Hadoop is
an open-source framework developed by the Apache Software Foundation that is designed for the
processing of big data. It uses distributed storage and processing across clusters of commodity
servers. Applying the MapReduce programming model, it speeds up the processing of queries on
large data sets.
Data analysis is associated with both sections of the data-storage layer in the figure above. Data
analysis can occur on the data in each section of the data layer, and the results from data analysis
can be shared between each section while additional data analysis is being performed. The data
from traditional sources frequently are relatively clean and information dense compared to the data
captured from big-data sources. However, the volume and real-time nature of many big-data sources
means that the effort involved in preparing and analyzing these big-data sources can be repaid in
terms of additional insights not available through the data coming from traditional sources. A
variety of data-analysis techniques developed across a number of different fields of research
(including natural-language processing, computer vision, and ML) can be used to transform
unstructured, low-density, low-value big data into high-density and high-value data. These high-
value data can then be integrated with the other high-value data from traditional sources for further
data analysis.
Modern Databases or Modern Traditional Databases
Modern databases are far more advanced than traditional relational databases. They can store and
query data in variety of different formats. In addition to the traditional relational formats, it is also
possible to define object types, store documents, and store and query JSON objects, spatial data,
and so on. Most modern databases also come with a large number of statistical functions, so that
some have an equivalent number of statistical functions as most statistical applications. Using
the statistical functionality that is available in the databases in an organization may allow data
analytics to be performed in a more efficient and scalable manner using SQL. Furthermore, most
leading database vendors (including Oracle, Microsoft, IBM, and EnterpriseDB) have integrated
many ML algorithms into their databases, and these algorithms can be run using SQL. ML that is
built into the database engine and is accessible using SQL is known as in-database machine
learning. In-database ML can lead to quicker development of models and quicker deployment of
models and results to applications and analytic dashboards.
The advantages of in-database machine learning include:
No Data Movement
Faster performance
High security
Scalability
Real time deployment and environment
Production deployment
Five V’s of Big data
1. Volume
Let’s start with the chief characteristic, especially since “Big Data” was first coined to describe the
enormous amount of information. Thus, the Volume characteristic is the defining criterion for
whether we can consider a dataset can be regarded as Big Data or not.
Volume describes both the size and quantity of the data. However, the definition of Big Data can
change depending on the computing power available on the market at any given time. But
regardless of the type of devices used to collect and process the data, it doesn’t change that Big
Data’s volume is colossal, thanks to the vast number of sources sending the information.
2. Velocity
Velocity describes how rapidly the data is generated and how quickly it moves. This data flow
comes from sources such as mobile phones, social media, networks, servers, etc. Velocity covers the
data's speed, and it also describes how the information continuously flows. For instance, a consumer
with wearable tech that has a sensor connected to a network will keep gathering and sending data to
the source. It’s not a one-shot thing. Now picture millions of devices performing this action
simultaneously and perpetually, and you can see why volume and velocity are the two prominent
characteristics.
Velocity also factors in how quickly the raw Big Data information is turned into something an
organization will benefit from. When talking about the business sector, that translates into getting
actionable information and acting on it before the competition does. For something like the
healthcare industry, it's critical that medical data gathered by patient monitoring be quickly analyzed
for a patient's health.
3. Variety
Variety describes the diversity of the data types and its heterogeneous sources. Big Data information
draws from a vast quantity of sources, and not all of them provide the same level of value or
relevance.
The data, pulled from new sources located in-house and off-site, comes in three different types:
Structured Data: Also known as organized data, information with a defined length and format. An
Excel spreadsheet with customer names, e-mails, and cities is an example of structured data.
Unstructured Data: Unlike structured data, unstructured data covers information that can’t neatly fit
in the rigid, traditional row and column structure found in relational databases. Unstructured data
includes images, texts, and videos, to name a few. For example, if a company received 500,000
jpegs of their customers’ cats, that would qualify as unstructured data.
Semi-structured Data: As the name suggests, semi-structured data is information that features
associated information like metadata, although it doesn't conform to formal data structures. This
category includes e-mails, web pages, and TCP/IP packets.
4. Veracity
Veracity describes the data’s accuracy and quality. Since the data is pulled from diverse sources, the
information can have uncertainties, errors, redundancies, gaps, and inconsistencies. It's bad enough
when an analyst gets one set of data that has accuracy issues; imagine getting tens of thousands of
such datasets, or maybe even millions.
Veracity speaks to the difficulty and messiness of vast amounts of data. Excessive quantities of
flawed data lead to data analysis nightmares. On the other hand, insufficient amounts of Big Data
could result in incomplete information. Astute data analysts will understand that dealing with Big
Data is a balancing act involving all its characteristics.
5. Value
Although this is the last Big Data characteristic, it’s by no means the least important. After all, the
entire reason for wading through oceans of Big Data is to extract value! So unless analysts can take
that glut of data and turn it into an actionable resource that helps a business, it’s useless.
So, value in this context refers to the potential value Big Data can offer and directly relates to what
an organization can do with the processed data. The more insights derived from the Big Data, the
higher its value.
Chapter Three
Machine Learning
ML algorithms and techniques are applied primarily during the modeling stage of CRISP-DM. ML
involves a two-step process. First, an ML algorithm is applied to a data set to identify useful
patterns in the data. These patterns can be represented in a number of different ways. We describe
some popular representations later in this chapter, but they include decision trees, regression
models, and neural networks. These representations of patterns are known as “models,” which is
why this stage of the CRISP-DM life cycle is known at the “modeling stage.” Simply put, ML
algorithms create models from data, and each algorithm is designed to create models using a
particular representation (neural network or decision tree or other).
Second, once a model has been created, it is used for analysis. In some cases, the structure of the
model is what is important. A model structure can reveal what the important attributes are in a
domain. For example, in a medical domain we might apply an ML algorithm to a data set of stroke
patients and use the structure of the model to identify the factors that have a strong association with
stroke. In other cases, a model is used to label or classify new examples. For instance, the primary
purpose of a spam-filter model is to label new emails as either spam or not spam rather than to
reveal the defining attributes of spam email.
Types of Machine Learning
Supervised and Unsupervised Learning
The majority of ML algorithms can be classified as either supervised learning or unsupervised
learning. The goal of supervised learning is to learn a function that maps from the values of the
attributes describing an instance to the value of another attribute, known as the target attribute, of
that instance. For example, when supervised learning is used to train a spam filter, the algorithm
attempts to learn a function that maps from the attributes describing an email to a value (spam/not
spam) for the target attribute; the function the algorithm learns is the spam-filter model returned by
the algorithm. So in this context the pattern that the algorithm is looking for in the data is the
function that maps from the values of the input attributes to the values of the target attribute, and the
model that the algorithm returns is a computer program that implements this function. Supervised
learning works by searching through lots of different functions to find the function that best maps
between the inputs and output. However, for any data set of reasonable complexity there are so
many combinations of inputs and possible mappings to outputs that an algorithm cannot try all
possible functions. As a consequence, each ML algorithm is designed to look at or prefer certain
types of functions during its search. These preferences are known as the algorithm’s learning bias.
The real challenge in using ML is to find the algorithm whose learning bias is the best match for
a particular data set. Generally, this task involves experiments with a number of different algorithms
to find out which one works best on that data set.
Supervised learning is “supervised” because each of the instances in the data set lists both the input
values and the output (target) value for each instance. So the learning algorithm can guide its search
for the best function by checking how each function it tries matches with the data set, and at the
same time the data set acts as a supervisor for the learning process by providing feedback.
Obviously, for supervised learning to take place, each instance in the data set must be labeled with
the value of the target attribute. Often, however, the reason a target attribute is interesting is that it is
not easy to directly measure, and therefore it is not possible to easily create a data set of labeled
instances. In such scenarios, a great deal of time and effort is required to create a data set with the
target values before a model can be trained using supervised learning.
In unsupervised learning, there is no target attribute. As a consequence, unsupervised-learning
algorithms can be used without investing the time and effort in labeling the instances of the data set
with a target attribute. However, not having a target attribute also means that learning becomes
more difficult: instead of the specific problem of searching for a mapping from inputs to output.
Types of supervised learning
When a data set is composed of numeric attributes, then prediction models based on regression are
frequently used. Regression analysis estimates the expected (or average) value of a numeric target
attribute when all the input attributes are fixed. The first step in a regression analysis is to
hypothesize the structure of the relationship between the input attributes and the target. Then a
parameterized mathematical model of the hypothesized relationship is defined. This parameterized
model is called a regression function. You can think of a regression function as a machine that
converts inputs to an output value and of the parameters as the settings that control the behavior of a
machine. A regression function may have multiple parameters, and the focus of regression analysis
is to find the correct settings for these parameters.
When a linear relationship is assumed, the regression analysis is called linear regression. The
simplest application of linear regression is modeling the relationship between two attributes: an
input attribute X and a target attribute Y. In this simple linear-regression problem, the regression
function has the following form:
This regression function is just the equation of a line (often written as y = mx + c) that is familiar to
most people from high school geometry.3 The variables ω0 and ω1 are the parameters of the
regression function. Modifying these parameters changes how the function maps from the input X to
the output Y. The parameter ω0 is the y-intercept (or c in high school geometry) that specifies
where the line crosses the vertical y axis when X is equal to zero. The parameter ω1 defines the
slope of the line.
Neural Networks and Deep Learning
A neural network consists of a set of neurons that are connected together. A neuron takes a set of
numeric values as input and maps them to a single output value. At its core, a neuron is simply a
multi-input linear-regression function. The only significant difference between the two is that in
a neuron the output of the multi-input linear-regression function is passed through another function
that is called the activation function.
These activation functions apply a nonlinear mapping to the output of the multi-input linear-
regression function. Two commonly used activation functions are the logistic function and tanh
function.
One of the most exciting technical developments in the past 10 years has been the emergence of
deep learning. Deep-learning networks are simply neural networks that have multiple8 layers of
hidden units; in other words, they are deep in terms of the number of hidden layers
they have.
Decision Trees
Linear regression and neural networks work best with numeric inputs. If the input attributes in a
data set are primarily nominal or ordinal, however, then other ML algorithms and models, such as
decision trees, may be more appropriate.
A decision tree encodes a set of if then, else rules in a tree structure. Figure 16 illustrates a decision
tree for deciding whether an email is spam or not. Rectangles with rounded corners represent tests
on attributes, and the square nodes indicate decision, or classification, nodes.
The tree above encodes the following rules: if the email is from an unknown sender, then it is spam;
if it isn’t from an unknown sender but contains suspicious words, then it is spam; if it is neither
from an unknown sender nor contains suspicious words, then it is not spam.
In a decision tree, the decision for an instance is made by starting at the top of the tree and
navigating down through the tree by applying a sequence of attribute tests to the instance. Each
node in the tree specifies one attribute to test, and the process descends the tree node by node by
choosing the branch from the current node with the label matching the value of the test attribute of
the instance. The final decision is the label of the terminating (or leaf) node that the instance
descends to.
Each path in a decision tree, from root to leaf, defines a classification rule composed of a sequence
of tests. The goal of a decision-tree-learning algorithm is to find a set of classification rules that
divide the training data set into sets of instances that have the same value for the target attribute.
The idea is that if a classification rule can separate out from a data set a subset of instances that
have the same target value, and if this classification rule is true for a new example (i.e., the example
goes down that path in the tree), then it is likely that the correct prediction for this new example is
the target value shared by all the training instances that fit this rule.