I
Day 2 of 50
Intro to Data Science.
A step-by-step guide to learn data science by Data Science East Africa team.
PREPARED BY DSEA
II
Have you ever wondered how Amazon, eBay suggest items
for you to buy?
How Gmail filters your emails in the spam and non-spam
categories?
How Netflix predicts the shows of your liking?
How do they do it? These are the few questions we ponder from time to time. In reality, doing such
tasks are impossible without the availability of data. Data science is all about using data to solve
problems.
The problem could be decision making such as identifying which email is spam and which is not. Or a
product recommendation such as which movie to watch? Or predicting the outcome such as who will
be the next President of the Kenya? So, the core job of a data scientist is to understand the data,
extract useful information out of it and apply this in solving the problems.
Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to
discover hidden patterns from the raw data.
Who is a data scientist
III
Who is a data scientist
The person who practices the art of Data Science is called a Data Scientist. The term “Data
Scientist” has been coined after considering the fact that a Data Scientist draws a lot of information
from the scientific fields and applications whether it is statistics or mathematics. TH
Josh will defined a data scientist as a person who is better at statistics than any software engineer
and better at software engineering than any statistician.
Who is a data scientist
III
Data scientists are those who crack complex data problems with their strong expertise in certain
scientific disciplines. They work with several elements related to mathematics, statistics, computer
science, etc (though they may not be an expert in all these fields).
They make a lot of use of the latest technologies in finding solutions and reaching conclusions that are
crucial for an organization’s growth and development. Data Scientists present the data in a much more
useful form as compared to the raw data available to them from structured as well as unstructured forms.
TH
IV
Data Science Life Cycle
Step 1: Define Problem Statement
Creating a well-defined problem statement is a first and critical step in
data science. It is a brief description of the problem that you are going to
solve. Remember, all the efforts and work you do after defining the
problem statement is to solve it. The problem statement is shared by
your client. Your client can be your boss, colleague or it can be your
personal project. They would tell you the problems they are facing.
IV Step 1: Define Problem Statement
Some examples are shown below.
I want to predict the loan default for my credit department.
I want to increase the revenues.
I want to recommend the job to my clients.
Most of the times, these initial set of a problem shared with you is vague
and ambiguous. For example, the problem statement: “I want to increase
the revenue”, doesn’t tell you how much to increase the revenue such as
20% or 30%, for which products to increase revenue and what is the time
frame to increase the revenue. You have to make the problem statement
clear, goal-oriented and measurable. This can be achieved by asking the
right set of questions.
IV Step 1: Define Problem Statement
But how can you ask better or right questions to create a well-
defined problem statement?
You should ask open-ended rather than closed-ended questions.
The open-ended questions help to uncover unknown unknowns. The
unknown unknowns are the things which you don’t know you don’t
know.
Step 1: Define Problem Statement
IV
IV
In these slides, we will work on a problem statement “Which club will win the EPL?”
IV Step 2: Data Collection
You need to collect the data which can help to solve the problem.
Data collection is a systematic approach to gather relevant
information from a variety of sources. Depending on the problem
statement, the data collection method is broadly classified into two
categories.
First, when you have some unique problem and no related research is
done on the subject. Then, you need to collect new data. This method
is called as primary data collection. For example, you want
information on the average time that employees spend in a cafeteria
across companies. There is no public data available of these. But you
can collect the data through various methods such as surveys,
interviews of employees and by monitoring the time spent by
employees in cafeteria. This method is time-consuming.
IV Step 2: Data Collection
Another method is to use the data which is readily available or collected by
someone else. These data can be found on the internet, news articles,
government census, magazines and so on. This method is called as
secondary data collection. This method is less time-consuming than the
primary method.
For our problem statement on EPL. You can collect and aggregate the
data from various open-source websites such as Github, Kaggle, and
datahub.
IV
Step 2: Data Collection
IV Step 3: Data Quality Check and Remediation.
One of the most important and often ignored aspects by data
scientists is ensuring the data that is used for analysis and
interpretation is of good quality.
After collecting the data, most people start the analysis on it. Often,
they forgot to do a sanity check on the data. If the data is of bad
quality, it can give misleading information. Simply said: “Garbage in,
garbage Out”.
IV
Step 3: Data Quality Check and Remediation.
IV
Step 3: Data Quality Check and Remediation.
For example, if you start the analysis without ensuring data quality. Then you might get unexpected
results such as the Crystal Palace club will win the next EPL. However, your domain knowledge on EPL
says that the result looks inaccurate as Crystal Palace has never even finished in the top 4 (Just an
example for soccer fans).
IV Step 4: Exploratory Data Analysis
Before you model the steps to arrive at a solution, it’s important to
analyze the data. It is the most exciting step as it helps you to build
familiarity with the data and extract useful insights. If you skip this
step then you might end up generating inaccurate models and
choosing the insignificant variables in your model.
Quoting John Tukey, the creator of Exploratory Data Analysis, it is
important to understand what you CAN DO before you learn to
measure how WELL you seem to have DONE it.
You can use descriptive statistics such as central value measures and
variability measures. Also, visualization methods such as graphs and
plots can be used for analysis.
For example, you can calculate the average goals scored per match.
You can also check if home advantage is a real thing?
IV
Step 4: Exploratory Data Analysis
The bar chart below shows the home advantage of Liverpool. It shows that of all home matches of Liverpool, they
have won 17 and drawn 2. But have never lost any home match.
IV
Step 4: Exploratory Data Analysis
However,for Leicester city there is no home advantage, the bar chart below shows that of all home matches of
Leicester, they have won 8, lost 8 and drawn 3.
IV
Step 4: Exploratory Data Analysis
Data analysis is an iterative process that helps you to get closer to the solution. Every iteration has a cost
associated with it. Therefore, it is advisable that as a data scientist you plan properly so that the number of
iterations is reduced. You can play with data, create your own graphs and learn to derive inferences. All these
analyses which you performed are commonly known as exploratory data analysis (EDA).
IV Step 5:Data Modelling
Modelling means formulating every step and gather the techniques
required to achieve the solution. You need to list down the flow of
the calculations which is nothing but modelling steps to the
solution.
The important factor is how to perform the calculations. There are
various techniques under Statistics and Machine Learning that you
can choose based on the requirement.
IV
Step 5:Data Modelling
For our EPL data, we chose to use statistical techniques to predict the winner in the current season.
IV
Step 5:Data Modelling
Top 6 teams
As per the data of the last three seasons, we have learnt that the winner of EPL is changing but the top 6
clubs remain the same. These clubs are Arsenal, Chelsea, Liverpool, Manchester City, Manchester United
and Tottenham Hotspur. The winner in the next season is very likely to be from these top 6 teams.
Rank Clubs on Player Skills
As we know, in football a player can be an Attacker, Defender, Midfielder or a Goalkeeper. Each of these
positions requires a different set of physical skills. A player is assigned a score based on his physical
skills and the position he is playing in. The scores are given on a scale of 0–100. And each of these skills
is assigned a weight.
IV
Step 5:Data Modelling
IV
Step 5:Data Modelling
Rank Clubs on Past Performance
We rank the performance of the teams based on the number of wins and difference in goals in the last season.
IV
Step 5:Data Modelling
Combine the Ranks to predict the winner
Considering 80% weight to the players’ skills and 20% weight to the past performance of clubs we get the final
ranking of the top 6 clubs. Based on these calculations, Tottenham Hotspur is likely to win this EPL.
IV Step 6. Data Communication.
This is the final step where you present the results from your analysis
to the stakeholders. You explain to them how you came to a specific
conclusion and your critical findings.
Most often you need to present your findings to a non-technical
audience, such as the marketing team or business executives. You
need to communicate the results in a simple to understand manner.
And the stakeholders should be able to chalk out an actionable plan
from it.
IV
Step 6. Data Communication.
Know your audience and speak their language
You should know your audience and speak their language. For example, you are presenting the prediction
of EPL winner to a football fan. They do not understand the statistics you have used but they can relate to
the steps you followed to decide the winner.
Focus on values and outcomes
You should focus on value and outcomes. Your stakeholder would not be interested in how you got the
data but from where you got the data. The usage of credible sources helps to build confidence in your
prediction.
IV
Step 6. Data Communication.
Communicate assumptions and limitations
You should clearly communicate the important assumptions and limitations you have made. For example,
to calculate the overall rating of the team you have assumed 3–4–3–1 formation for all the teams. It’s
important that you communicate these to the stakeholders.
Although the goal is to predict the winner, there might be some other critical findings that are relevant.
For example, the teams with the best attack, midfield, defence and goalkeeping. The best player for each
position in the league based on the players’ skills.
Such a unified view of the data in the form of graphs, charts and numbers is called a Dashboard. you can
use Excel to create the dashboard.
Now, it is your turn. Take a problem and solve it as a data scientist.
V
Happy Learning !
DATA SCIENCE EAST AFRICA TEAM.
Sources and References
Intro to Data Science
Premier League Stats
Wikipedia: Data Science
Do you have
any questions?
ZIMCORE HUBS | NEW HIRE LAUNCHPAD
We're always here for you
datsscienceeastafrica@[Link]
Twitter: DataScience_EA, TechMadi