Mdss Notes Unit I
Mdss Notes Unit I
PART I:DEFINING AI
PART II: BUILDING AND SUSTAINING A TEAM
WHAT IS DATASCIENCE?
Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.
Data science uses the most powerful hardware, programming systems, and most
efficient algorithms to solve the data related problems. It is the future of artificial
intelligence.
Defining AI
AI is a sub-field of computer science and mathematics. AI is a general scientific field
that covers everything related to weak and strong AI.
Strong AI and Weak AI are two fascinating branches of artificial intelligence that capture
our imagination. Strong AI, like a smart character from a sci-fi movie, could think, learn,
and perform tasks just like humans. On the other hand, Weak AI, the kind we encounter
daily, focuses on doing one job well, such as recommending movies, giving us
directions, or even helping us pick the perfect playlist
Some examples of weak AI are as follows:
Cars like Tesla with self-drive technology Voice assistants like Alexa, Google, Siri and
more Google Maps Chatbots like ChatGPT, Brad, [Link] systems like
Amazon, Spotify, Netflix, etc. Spam Filters on Email
Examples of strong AI in sci-fi movies like Wall-E, Big Hero 6, The Terminator, Vision
from Marvel, etc.
Machine learning is a subfield of AI that studies algorithms that can adapt their behavior
based on incoming data without explicit instructions from a programmer.
Deep learning is a subfield of machine learning that studies a specific kind of machine
learning model called deep neural networks.
Data science is a multidisciplinary field that uses a set of tools to extract knowledge
from data and support decision making. Machine learning and deep learning are among
the main tools of data science.
Structured data is hard to gather and maintain, but it is the easiest to analyze. The
reason is
that we often collect it for this exact purpose. Structured data is typically stored inside
computer databases and files. In digital advertising, ad networks apply huge effort to
collect as much data as possible.
let's take a data table with two columns of numbers: x and y. We can say that this data
has two [Link] value in this dataset is displayed on the following plot:
For example, for x = 10,y will be equal to 8, which matches the real data points
depicted as blue dots
If we represent pixels in this image as numbers, we will end up with over twelve million
of them in each photo. In other words, our data has a dimensionality of 12 million.
Many machine learning algorithms suffer from a problem called the curse of
dimensionality.
The following two pictures help to describe the difference between object detection and
instance segmentation:
Thus, the main practical uses of deep learning in computer vision are essentially the
same
tasks at different levels of resolution:
Image classification: Determining the class of an image from a predetermined
set of categories
Object detection: Finding bounding boxes for objects inside an image and
assigning a class probability for each bounding box
Offline model testing encompasses all the model-evaluation processes that are
performed
before the model is deployed.
Before discussing online testing in detail, we must first define model errors and ways to
calculate them.
The difference between the real value and model's approximation makes up the model
error:
For regression problems, we can measure the error in quantities that the model
predicts.
For example, if we predict house prices using a machine learning model and get a
prediction of $300,000 for a house with a real price of $350,000, we can say that the
error is
$350,000 - $300,000 = $50,000.
For classification problems in the simplest setting, we can measure the error as 0 for a
guess, and 1 for a wrong answer. For example, for a cat/dog recognizer, we give an
error of
1 if the model predicts that there is a cat in a dog photo, and 0 if it gives a correct
answer.
Decomposing errors
Suppose that our model makes a prediction and we know the real value. If this
prediction is incorrect, then there is some difference between the prediction and the
true value:
The other part of this error will come from imperfections in our data, and some from
imperfections in our model.
Not all reducible errors are the same. We can decompose reducible errors further. For
example, look at the following diagram:
The red center of each target represents our goal (real value), and the blue shots
represent
the model predictions. In the target, the model's aim is off—all predictions are close
together, but they are far away from the target. This kind of error is called bias. The
simpler
our model is, the more bias it will have. For a simple model, the bias component can
become prevailing:
In the preceding plot, we try to model a complex relationship between variables with a
simple line. This kind of model has a high bias.
All predictions appear to be clustered around the true target, but the spread is too high.
Thus, we have decomposed model error into the three following numbers
For example, we will predict a housing price as a linear function of its size in square
feet:
There is a relationship between them . Predictive models show a property called the bias-
variance tradeoff—the more biased a model, the lower the variance component of the error.
And in reverse, the more variance it has, the lower its bias will be.
Figure: The first simple model has low variance, and the second complex model has high
variance:
Understanding overfitting
The bias-variance trade-off goes hand in hand with a very important problem in
machine learning called overfitting. If your model is too simple, it will cause large
errors. If it is too complex, it will memorize the data too well.
To measure model error we divide the data into training set an d test set, We will use
data in the training set to train our model The test set acts as unseen data
1. Trained a model
2. Measured the error on the test data
3. Changed your model to improve the metrics
4. Repeated steps 1-3 ten times
5. Deployed the model to production
You looked at a score, and changed your model or data processing code several
consecutive times. In fact, you did several learning iterations by hand. By repeatedly
improving the test score, you indirectly disclosed information about the test data to
your model. When the metric values measured on a test set deviate from the metrics
measured on the real data, we say that the test data has leaked into our model. Data
leaks are notoriously hard to detect before they cause damage. To avoid them,
Data scientists use validation sets to tune model parameters and compare different
models before choosing the best [Link], the test data is used only as a final check
that informs you about model quality on unseen data. After measuring the test metric
scores, the only decision left is to make is whether the model will proceed to testing in a
real-world scenario.
In the following screenshot, you can see an example of a train/validation/test split of the
dataset:
Unfortunately, the following two problems persist when we use this approach:
The information about our test set might still leak into our solution after many
iterations. Test-set leakage does not disappear completely when you use the validation
set, it just becomes slower. To overcome this, change your test data from time to time.
Ideally, make a new test set for every model-deployment cycle.
You might overfit your validation data quickly, because of the train-measurechange
feedback cycle for tuning your models.
To prevent overfitting, you can randomly select train and validation sets from your data
for each experiment. Randomly shuffle all available data, then select random train and
validation datasets by splitting the data into three parts according to proportions you
have chosen.
There is no general rule for how much training, validation, and testing data you should
[Link], more training data means a more accurate model, but it means that you will
have less data to assess the model's performance. The typical split for medium-sized
datasets (up to 100,000 data points) is to use 60-80% of the data to train the model and
use the rest for validation.
The situation changes for large datasets.
If you have a dataset with 10,000,000 rows, using 30% for testing would comprise
3,000,000 rows. It is likely that this amount would be overkill. Increasing test and
validation test sizes will yield diminishing returns. For some problems, you will get good
results with 100,000 examples for testing, which would amount for a 1% test size. The
more data you have, the lower the proportion you should use for testing.
If there is too little data. In those situations, taking from 30%-40% data for testing and
validation might severely decrease the model's accuracy. You can apply a technique
called
cross-validation in data-scarce situations. With cross-validation, there's no need to
create a
separate validation or test set. Cross-validation proceeds in the following way:
1. You choose some fixed number of iterations—three, for example.
2. Split the dataset into three parts.
3. For each iteration, cross-validation uses 2/3 of the dataset as a training data and
1/3 as validation data.
4. Train model for each of the three train-validation set pairs.
5. Calculate the metric values using each validation set.
6. Aggregate the metrics into a single number by averaging all metric values.
RMSE penalizes large errors more than MAE. This property comes from the fact that
RMSE uses squared errors, while MAE uses absolute values
Here, ncorrect is the amount of correct predictions, and N is the total number of predictions
Let's assume the average probability of having pneumonia is 0.001%. That is, one
person out of 100,000 has the illness. If you had collected data on 200,000 people, it is
feasible that your dataset would contain only two positive cases. Imagine you have
asked a data scientist to build a machine
learning model that estimates pneumonia probability based on a patient's data. You
have said that you would only accept an accuracy of no less than 99.9%. Suppose that
someone created a dummy algorithm that always outputs zeros.
This model has no real value, but its accuracy on our data will be high as it will make
only two errors:
After looking at this table, we can see that the dummy model won't be helpful to
anyone. It
didn't identify two people with the condition as positive. We call those errors False
Negatives (FN). The model also correctly identified all patients with no pneumonia, or
True Negatives (TN), but it has failed to diagnose ill patients correctly.
This model correctly identified two cases, making two True Positive (TP) predictions.
In the following table, you can see two new metrics for summarizing different kinds of
errors, precision and recall:
For binary classification, precision and recall diminish the amount of metrics we must
work
with to two. This is better, but not ideal. We can sum up everything into a single number
by
using a metric called F1-score.
F1 is 1 for a perfect classifier and 0 for the worst classifier. Because it considers both
precision and recall, it does not suffer from the same problem as accuracy and is a
better
default metric for classification problems
Imbalanced classes
Imbalanced data refers to those types of datasets where the target class has an uneven
distribution of observations, i.e one class label has a very high number of observations
and the other has a very low number of observations
To illustrate this, let's take a trained model and generate predictions for the test
dataset. If
we calculate class assignments by taking lots of different thresholds, and then calculate
the
precision, recall, and F1 score for each of those assignments, we could depict each
precision
and recall value in a single plot:
Online model testing
Even a great offline model testing pipeline won't guarantee that the model will perform
exactly the same in production. There are always risks that can affect your model
performance, such as the following:
The experiment setup for a hypothesis test splits test targets into two groups on
purpose.
We can try to use a single group instead. For instance, we can take one set of
measurements
with the old model. After the first part of the experiment is finished, we can deploy the
new
algorithm and measure its effect. Then, we compare two measurements made one after
another. What could go wrong? In fact, the results we get wouldn't mean anything.
Many
things could have changed in between our measurements, such as the following:
User preferences
General user mood
Popularity of our service
Average user profile
Any other attribute of users or businesses
All these hidden effects could affect our measurements in unpredictable ways, which is
why we need two groups: test and control. We must select these groups in such a way
that
the only difference between them is our hypothesis. It should be present in the test
group
and missing from the control group. To illustrate, in medical trials, control groups are
the
ones who get the placebo. Suppose we want to test the positive effect of a new
painkiller.
Here are some examples of bad test setups:
The easiest way to create groups is random selection. Truly random selection may be
hard
to do in the real world, but is easy if you deal with internet services. There, you may just
randomly decide which version of your algorithm to use for each active user. Be sure to
always design experiment setups with an experienced statistician or data scientist, as
correct tests are notoriously hard to execute, especially in offline settings.
Statistical tests check the validity of a null hypothesis, that is, that the results you got
are by
chance. The opposite result is called an alternative hypothesis. For instance, here is the
hypothesis set for our ad model test:
Null hypothesis: The new model does not affect the ad service revenue.
Alternative hypothesis: The new model affects the ad service revenue.
The amount of data you need to collect for conducting a hypothesis test depends on
several
factors:
Confidence level: The more statistical confidence you need, the more data is required
to support the evidence.
Statistical power: This measures the probability of detecting a significant difference, if
one exists. The more statistical power your test has, the lower the chance of false
negative responses.
Hypothesized difference and population variance: If your data has large variance,
you need to collect more data to detect a significant difference. If the difference
between the two means is smaller than population variance, you would need even more
data.
You can see how different test parameters determine their data hunger in the following
table:
In situations where you can trade off statistical rigor for speed and risk-aversion, there
is an
alternative approach called Multi-Armed Bandits (MABs). To understand how MABs
work, imagine yourself inside a casino with lots of slot machines. You know that some of
those machines yield better returns than others. Your task is to find the best slot
machine with a minimal number of trials. Thus, you try different (multi) arms of slot
machines (bandits) to maximize your reward. You can extend this situation to testing
multiple ad models: for each user, you must find a model that is most likely to increase
your ad revenue.
The most popular MAB algorithm is called an epsilon-greedy bandit. Despite the name,
the
inner workings of the method are simple:
1. Select a small number called epsilon. Suppose we have chosen 0.01.
2. Choose a random number between 0 and 1. This number will determine whether
MAB will explore or exploit a possible set of choices.
3. If the number is lower or equal to epsilon, make a choice at random and record a
reward after making an action tied to your choice. We call this process
exploration – MAB tries different actions at random with a low probability to
find out their mean reward.
4. If your number is greater than epsilon, make the best choice according to the data
you have collected. We call this process exploitation – MAB exploits knowledge
it has collected to execute an action that has the best expected reward. MAB
selects the best action by averaging all recorded rewards for each choice and
selecting a choice with the greatest reward expectation.
Project stakeholders: Represent people who are interested in the project; in other
words, your customers. They generate and prioritize high-level requirements and goals
for the project.
Project users: People who will use the solution you are building. They should be
involved in the requirements-specification process to present a practical view on
the system's usability.
Business analysts: The main business expert of the team. They help to shape
business requirements and help data scientists to understand the details about the
problem domain. They define business requirements in the form of a business
requirements document (BRD), or stories, and may act as a product owner in agile
teams.
System analysts: They define, shape, and maintain software and integration
requirements. They create a software requirements document (SRD). In simple
projects or Proof of Concepts (PoC), this role can be handled by other team members.
Data analysts: Analysis in data science projects often requires building complex
database queries and visualizing data. Data analysts can support other team members
by creating datamarts and interactive dashboards and derive insights from data.
Data scientists: They create models, perform statistical analysis, and handle other
tasks related to data science. For most projects, it will be sufficient to select and apply
existing algorithms. An expert who specializes in applying existing algorithms to solve
practical problems is called a machine or deep learning engineer. However, some
projects may ask for research and the creation of new
state-of-the-art models. For these tasks, a machine or deep learning researcher will be a
better fit. For those readers with computer science backgrounds, we can loosely
describe the difference between a machine learning engineer and research scientist as
the difference between a software engineer and a computer scientist.
Data engineers: They handle all data preparation and data processing. In simple
projects, data scientists with data engineering skills can handle this role.
However, do not underestimate the importance of data engineers in projects with
serious data processing requirements. Big data technology stacks are very complicated
to set up and work with on a large scale, and there is no one better to handle this task
than data engineers.
Data science team manager: They coordinate all tasks for the data team, plan
activities, and control deadlines.
Let's now see the general flow of how roles can work together to build the final solution:
Let's look at the flow and delivery artifacts of each step:
One of the main activities of this department is to detect and prevent credit card fraud.
They do this by using a rule-based, fraud detection system. This system looks over all
credit
card transactions happening in the bank and checks whether any series of transactions
should be considered fraudulent. Each check is hardcoded and predetermined. They
have
heard that ML brings benefits over traditional, rule-based, fraud detection systems. So,
they
have asked Mary to implement a fraud detection model as a plugin for their existing
system. Mary has inquired about the datasets and operators of the current fraud
detection
system, and the department confirmed that they will provide all necessary data from
the
system itself. The only thing they need is a working model, and a simple software
integration. The staff were already familiar with common classification metrics, so they
were advised to use F1-score with k-fold cross-validation.
In this project setup, project stakeholders have already finished the business analysis
stage.
They have an idea and a success criterion. Mary has clean and easy access to the data
source from a single system, and stakeholders can define a task in terms of a
classification
problem. They have also defined a clear way to test the results. The software
integration
requirements are also simple. Thus, the role flow of the project is simplified to just a few
steps, which are all performed by a single data scientist role:
As a result, Mary has laid out the following steps to complete the tasks:
1. Create a machine learning model.
2. Test it.
3. Create a model training pipeline.
4. Create a simple integration API.
5. Document and communicate the results to the customer
The retail company has over 10,000 stores across the country, and ensuring the proper
level
of service quality for all stores is becoming difficult. Each store has a fixed number of
staff
members working full-time. However, each store has a different number of visitors. This
number depends on the geographical location of the store, holidays, and possibly many
more unknown factors. As a result, some stores are overcrowded and some experience
a
low number of visits. Jonathan's idea is to change the staff employment strategy so that
customer satisfaction will rise, and the employee burden will even out.
Instead of hiring a fixed store team, he suggests making it elastic. He wants to create a
special mobile application that will allow stores to adjust their staff list with great speed.
In
this application, a store manager can create a task for a worker. The task can be as
short as
one hour or as long as one year. A pool of workers will see all vacant slots in nearby
stores.
If a worker with the necessary skills sees a task that interests them, they can accept it
and go
to the store. An algorithm will prepare task suggestions for the store manager, who will
issue work items to the pool. This algorithm will use multiple machine learning models
to
recommend tasks. One model would forecast the expected customer demand for each
store.
The second model, a computer vision algorithm, will measure the length of the lines in
the
store. This way, each store will use the right amount of workforce, changing it based on
demand levels. With the new app, store employees will be able to plan vacations and
ask
for a temporary replacement. Calculations show that this model will keep 50,000
workers
with an average load of 40 hours per week for each worker, and a pool of 10,000 part-
time
workers with an average load of 15 hours per week. Management considered this model
to
be more economically viable and agreed to perform a test of the system in one store. If
the
test is successful, they will continue expanding the new policy.
The next thing they ask Jonathan is to come up with an implementation plan. He now
needs to decompose the project into a series of tasks that are necessary to deploy the
system
in one store. Jonathan prepared the following decomposition:
1. Collect and document initial requirements for the mobile app, forecasting, and
computer vision models.
2. Collect and document non-functional requirements and Service Level
Agreements for the system.
3. Decide on necessary hardware resources for development, testing, and
production.
4. Find data sources with training data.
5. Create data adapters for exporting data from source systems.
6. Create a software system architecture. Choose the technology stack.
7. Create development, test, and production environments.
8. Develop a forecasting model (full ML project life cycle).
9. Develop a line-length recognition model (full ML project life cycle).
10. Develop a mobile app.
11. Integrate the mobile app with the models.
12. Deploy the system to the test environment and perform an end-to-end system
test.
13. Deploy the system to the production environment.
14. Start the system test in one store.
Each point in this plan could be broken down further into 10-20 additional tasks. A full
task
decomposition for this project could easily include 200 points, but we will stop at this
level,
as it is sufficient for discussion purposes.
In reality, plans constantly change, so Jonathan has also decided to use a software
development project management framework, such as SCRUM, to manage deadlines,
requirement changes, and stakeholder expectations
Key skills of a data scientist
An ideal data scientist is often described as a mix of the following three things:
Domain expertise: They should know the business domain well. Without it, task
decomposition and prioritization becomes impossible, and the project will inevitably go
astray.
Data science: A good understanding of the basic concepts behind data science and
machine learning is essential. Without it, you will be building a house without knowing
what a house is. It will streamline communication and help to create good task
decompositions and stories.
Software engineering: Knowledge of basic software engineering will ensure that the
manager can keep an eye on crucial aspects of a project, such as software architecture
and technical debt. Good software project managers have development experience.
This experience taught them to write automated tests, refactoring, and building a good
architecture. Unfortunately, many data science
projects suffer from bad software design choices. Taking shortcuts is only good in the
short term; in the long term, they will come back to bite you. Projects tend to scale with
time; as the team increases, a number of integrations grow, and new requirements
arrive. Bad software design paralyzes the project, leaving you with only one option—a
complete rewrite of the system.
Common flaws of technical interviews
Data scientist: To implement prototypes based on the task definitions provided by the
business analyst and systems analyst.
Backend software engineer: To integrate prototypes with external systems and
implement software based on the requirements.
User interface software engineer: To implement interactive UIs and visualizations
for prototype presentation. Next, Robert thought about team size constraints and
created the following positions from the preceding roles: