0% found this document useful (0 votes)
22 views31 pages

Mdss Notes Unit I

The document provides an overview of artificial intelligence (AI) and data science, explaining their definitions, applications, and limitations. It distinguishes between weak and strong AI, discusses machine learning and deep learning as key components of data science, and highlights their roles in various fields such as healthcare and business. Additionally, it covers model testing, error analysis, and the importance of data types in machine learning, emphasizing the challenges of overfitting and the need for proper validation techniques.

Uploaded by

Surya Alakanti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views31 pages

Mdss Notes Unit I

The document provides an overview of artificial intelligence (AI) and data science, explaining their definitions, applications, and limitations. It distinguishes between weak and strong AI, discusses machine learning and deep learning as key components of data science, and highlights their roles in various fields such as healthcare and business. Additionally, it covers model testing, error analysis, and the importance of data types in machine learning, emphasizing the challenges of overfitting and the need for proper validation techniques.

Uploaded by

Surya Alakanti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

UNIT –I

PART I:DEFINING AI
PART II: BUILDING AND SUSTAINING A TEAM
WHAT IS DATASCIENCE?
Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.

Data science uses the most powerful hardware, programming systems, and most
efficient algorithms to solve the data related problems. It is the future of artificial
intelligence.

Defining AI
AI is a sub-field of computer science and mathematics. AI is a general scientific field
that covers everything related to weak and strong AI.

Strong AI and Weak AI are two fascinating branches of artificial intelligence that capture
our imagination. Strong AI, like a smart character from a sci-fi movie, could think, learn,
and perform tasks just like humans. On the other hand, Weak AI, the kind we encounter
daily, focuses on doing one job well, such as recommending movies, giving us
directions, or even helping us pick the perfect playlist
Some examples of weak AI are as follows:
Cars like Tesla with self-drive technology Voice assistants like Alexa, Google, Siri and
more Google Maps Chatbots like ChatGPT, Brad, [Link] systems like
Amazon, Spotify, Netflix, etc. Spam Filters on Email

Examples of strong AI in sci-fi movies like Wall-E, Big Hero 6, The Terminator, Vision
from Marvel, etc.

Machine learning is a subfield of AI that studies algorithms that can adapt their behavior
based on incoming data without explicit instructions from a programmer.

Deep learning is a subfield of machine learning that studies a specific kind of machine
learning model called deep neural networks.

Data science is a multidisciplinary field that uses a set of tools to extract knowledge
from data and support decision making. Machine learning and deep learning are among
the main tools of data science.

The influence of data science


• DS has huge potential and effects our daily life
• Health care – diagnosing and prediction diseases
• Business – finding new strategies, customer preferences, review analysis
• Sadly, disturbing daily lives…
• Example: Social Credit System, experimented by China government
• Potential applications in Healthcare; diagnosing and predicting health
issues
• Businesses applications; new strategies for winning new customers and
personalized services

Limitations Of Data Science
• Complexity
• Data Security
• Team misleading
• does not allow to expertise
• Insufficient understanding of the technical side of data science can lead to
serious problems
Introduction to machine learning
Machine learning is a scientific field that studies algorithms that can learn to
perform tasks without specific instructions, relying on patterns discovered in data
To illustrate this, we'll look at a warehouse security surveillance system. It monitors all
surveillance cameras and identifies employees from the video feed. If the system does
not recognize a person as an employee, it raises an alert. This setup uses two machine
learning models: face detection and face recognition. At first, the face detection model
searches for faces in each frame of the video. Next, the face recognition model
identifies a person as an employee by searching the face database. Each model does
not solve the employee identification task alone. Yet each model provides an insight
that is a part of the decision-making process

Data for machine learning models


We can divide the entire world's data into two categories: structured and unstructured.
Most data is unstructured. Images, audio recordings, documents, books, and articles all
represent unstructured data. Unstructured data is a natural byproduct of our lives
nowadays. Smartphones and social networks facilitate the creation of endless data
streams.
Nowadays, you need little to snap a photo or make a video

Structured data is hard to gather and maintain, but it is the easiest to analyze. The
reason is
that we often collect it for this exact purpose. Structured data is typically stored inside
computer databases and files. In digital advertising, ad networks apply huge effort to
collect as much data as possible.

let's take a data table with two columns of numbers: x and y. We can say that this data
has two [Link] value in this dataset is displayed on the following plot:
For example, for x = 10,y will be equal to 8, which matches the real data points
depicted as blue dots
If we represent pixels in this image as numbers, we will end up with over twelve million
of them in each photo. In other words, our data has a dimensionality of 12 million.

Many machine learning algorithms suffer from a problem called the curse of
dimensionality.

Types of tasks which can solve with ML


• House price estimation is a regression task.
• Predicting user ad clicks is a classification task.
• Predicting HDD utilization in a cloud storage service is a regression task.
• Identifying the risk of credit default is a classification task.
Introduction to deep learning
Deep learning is a branch of machine learning which is based on artificial neural networks. It
is capable of learning complex patterns and relationships within data
Deep learning is fantastic at solving tasks with unstructured datasets.
To illustrate this, let's look at a machine learning competition called ImageNet. It contains over
14 million images, classified into 22,000 distinct categories. To solve ImageNet, an algorithm
should learn to identify an object in the photo. While human performance on this task is around
95% accuracy, the best neural network model surpassed this level in 2015 .
Diving into natural language processing
We write every day, whether it is documents, tweets, electronic messages, books, or
emails.
The list can go on and on. Using algorithms to understand natural language is difficult
because our language is ambiguous, complex, and contains many exceptions and
corner
cases. The first attempts at natural language processing (NLP) were about building
rulebased
systems. Linguists carefully designed hundreds and thousands of rules to perform
seemingly simple tasks, such as part of speech tagging. Deep learning models took the
NLP
world by storm. They can perform a wide range of NLP tasks with much better quality
than
previous generation NLP models. Deep neural networks translate text to another
language
with near-human accuracy. They are also quite accurate at doing part-of-speech
tagging.
Another intersecting NLP problem is text classification. By labeling many texts as
emotionally positive or negative, we can create a sentiment analysis model. As you
already
know, we can train this kind of model using supervised learning. Sentiment analysis
models can give powerful insights when used to measure reactions to news or the
general
mood around Twitter hashtags.
Text classification is also used to solve automated email and document tagging. We can
use
neural networks to process large chunks of emails and to assign appropriate tags to
them.
The pinnacle of practical NLP is the creation of dialog systems, or chatbots. Chatbots
can be
used to automate common scenarios at IT support departments and call centers.
However,
creating a bot that can reliably and consistently solve its task is not an easy task.
Clients
tend to communicate with bots in rather unexpected ways, so you will have a lot of
corner
cases to cover. NLP research is not quite at the point of providing an end-to-end
conversational model that can solve the task.
Exploring computer vision

In 2015, a deep neural network surpassed human performance on ImageNet. Since


then,
many computer vision algorithms have been rendered obsolete. Deep learning allows us
not only to classify images, but also to do object detection and instance segmentation.

The following two pictures help to describe the difference between object detection and
instance segmentation:

Thus, the main practical uses of deep learning in computer vision are essentially the
same
tasks at different levels of resolution:
Image classification: Determining the class of an image from a predetermined
set of categories

Object detection: Finding bounding boxes for objects inside an image and
assigning a class probability for each bounding box

Instance segmentation: Doing pixel-wise segmentation of an image, outlining


every object from a predetermined class list

Computer vision algorithms have found applications in cancer screening, handwriting


recognition, face recognition, robotics, self-driving cars, and many other areas.

Another promising approach for training generative models is called Generative


Adversarial Networks (GANs). You use two models to train GANs: generator and
discriminator. The generator creates images. The discriminator tries to distinguish the real
images from your dataset from the generated images. Over time, the generator learns to
create more realistic images, while the discriminator learns to identify more subtle mistakes
in the image generation processAnother interesting direction in computer vision is generative
models
We can also use GANs to perform conditional image generation. The word conditional
means that we can specify some parameters for the generator. In particular, we can specify
a type of object or texture that is being generated. For example, Nvidia's landscape
generator software can transform a simple color-coded image, where specific colors
represent soil, sky, water, and other objects, to realistic-looking photos

Testing Your Models


Two types of testing models arise and they are:

• Offline model testing


• Online model testing

Offline model testing encompasses all the model-evaluation processes that are
performed
before the model is deployed.
Before discussing online testing in detail, we must first define model errors and ways to
calculate them.
The difference between the real value and model's approximation makes up the model
error:
For regression problems, we can measure the error in quantities that the model
predicts.
For example, if we predict house prices using a machine learning model and get a
prediction of $300,000 for a house with a real price of $350,000, we can say that the
error is
$350,000 - $300,000 = $50,000.

For classification problems in the simplest setting, we can measure the error as 0 for a
guess, and 1 for a wrong answer. For example, for a cat/dog recognizer, we give an
error of
1 if the model predicts that there is a cat in a dog photo, and 0 if it gives a correct
answer.

Decomposing errors
Suppose that our model makes a prediction and we know the real value. If this
prediction is incorrect, then there is some difference between the prediction and the
true value:

The other part of this error will come from imperfections in our data, and some from
imperfections in our model.

Not all reducible errors are the same. We can decompose reducible errors further. For
example, look at the following diagram:

The red center of each target represents our goal (real value), and the blue shots
represent
the model predictions. In the target, the model's aim is off—all predictions are close
together, but they are far away from the target. This kind of error is called bias. The
simpler
our model is, the more bias it will have. For a simple model, the bias component can
become prevailing:

In the preceding plot, we try to model a complex relationship between variables with a
simple line. This kind of model has a high bias.

The second component of the model error is variance:

All predictions appear to be clustered around the true target, but the spread is too high.
Thus, we have decomposed model error into the three following numbers

For example, we will predict a housing price as a linear function of its size in square
feet:

There is a relationship between them . Predictive models show a property called the bias-
variance tradeoff—the more biased a model, the lower the variance component of the error.
And in reverse, the more variance it has, the lower its bias will be.
Figure: The first simple model has low variance, and the second complex model has high
variance:

Understanding overfitting
The bias-variance trade-off goes hand in hand with a very important problem in
machine learning called overfitting. If your model is too simple, it will cause large
errors. If it is too complex, it will memorize the data too well.

To measure model error we divide the data into training set an d test set, We will use
data in the training set to train our model The test set acts as unseen data

1. Trained a model
2. Measured the error on the test data
3. Changed your model to improve the metrics
4. Repeated steps 1-3 ten times
5. Deployed the model to production
You looked at a score, and changed your model or data processing code several
consecutive times. In fact, you did several learning iterations by hand. By repeatedly
improving the test score, you indirectly disclosed information about the test data to
your model. When the metric values measured on a test set deviate from the metrics
measured on the real data, we say that the test data has leaked into our model. Data
leaks are notoriously hard to detect before they cause damage. To avoid them,

Data scientists use validation sets to tune model parameters and compare different
models before choosing the best [Link], the test data is used only as a final check
that informs you about model quality on unseen data. After measuring the test metric
scores, the only decision left is to make is whether the model will proceed to testing in a
real-world scenario.

In the following screenshot, you can see an example of a train/validation/test split of the
dataset:
Unfortunately, the following two problems persist when we use this approach:
The information about our test set might still leak into our solution after many
iterations. Test-set leakage does not disappear completely when you use the validation
set, it just becomes slower. To overcome this, change your test data from time to time.
Ideally, make a new test set for every model-deployment cycle.
You might overfit your validation data quickly, because of the train-measurechange
feedback cycle for tuning your models.
To prevent overfitting, you can randomly select train and validation sets from your data
for each experiment. Randomly shuffle all available data, then select random train and
validation datasets by splitting the data into three parts according to proportions you
have chosen.

There is no general rule for how much training, validation, and testing data you should
[Link], more training data means a more accurate model, but it means that you will
have less data to assess the model's performance. The typical split for medium-sized
datasets (up to 100,000 data points) is to use 60-80% of the data to train the model and
use the rest for validation.
The situation changes for large datasets.

If you have a dataset with 10,000,000 rows, using 30% for testing would comprise
3,000,000 rows. It is likely that this amount would be overkill. Increasing test and
validation test sizes will yield diminishing returns. For some problems, you will get good
results with 100,000 examples for testing, which would amount for a 1% test size. The
more data you have, the lower the proportion you should use for testing.

If there is too little data. In those situations, taking from 30%-40% data for testing and
validation might severely decrease the model's accuracy. You can apply a technique
called
cross-validation in data-scarce situations. With cross-validation, there's no need to
create a
separate validation or test set. Cross-validation proceeds in the following way:
1. You choose some fixed number of iterations—three, for example.
2. Split the dataset into three parts.
3. For each iteration, cross-validation uses 2/3 of the dataset as a training data and
1/3 as validation data.
4. Train model for each of the three train-validation set pairs.
5. Calculate the metric values using each validation set.
6. Aggregate the metrics into a single number by averaging all metric values.

The following screenshot explains cross-validation visually:


Cross-validation has one main drawback: it requires significantly more computational
resources to assess model quality

Using technical metrics


In particular, for regression problems the most common metric is the root mean
square
error, or RMSE

Let's examine the elements of this formula:


• N is the total number of data points.
• predicted - actual measures the error between ground truth and model
prediction.
• The Sigma sign at the start of the formula means sum.
Another popular way to measure regression errors is mean absolute error (MAE):

RMSE penalizes large errors more than MAE. This property comes from the fact that
RMSE uses squared errors, while MAE uses absolute values

For classification problems, the metric-calculation process is more involved. Let's


imagine
that we are building a binary classifier that estimates the probability of a person having
pneumonia. To calculate how accurate the model is, we may just divide the total of
correct
answers by the number of rows in the dataset:

Here, ncorrect is the amount of correct predictions, and N is the total number of predictions
Let's assume the average probability of having pneumonia is 0.001%. That is, one
person out of 100,000 has the illness. If you had collected data on 200,000 people, it is
feasible that your dataset would contain only two positive cases. Imagine you have
asked a data scientist to build a machine
learning model that estimates pneumonia probability based on a patient's data. You
have said that you would only accept an accuracy of no less than 99.9%. Suppose that
someone created a dummy algorithm that always outputs zeros.

This model has no real value, but its accuracy on our data will be high as it will make
only two errors:

Let's look at model predictions in more detail by constructing a confusion table:

After looking at this table, we can see that the dummy model won't be helpful to
anyone. It
didn't identify two people with the condition as positive. We call those errors False
Negatives (FN). The model also correctly identified all patients with no pneumonia, or
True Negatives (TN), but it has failed to diagnose ill patients correctly.

This model correctly identified two cases, making two True Positive (TP) predictions.
In the following table, you can see two new metrics for summarizing different kinds of
errors, precision and recall:

For binary classification, precision and recall diminish the amount of metrics we must
work
with to two. This is better, but not ideal. We can sum up everything into a single number
by
using a metric called F1-score.

You can calculate F1 using the following formula:

F1 is 1 for a perfect classifier and 0 for the worst classifier. Because it considers both
precision and recall, it does not suffer from the same problem as accuracy and is a
better
default metric for classification problems

Imbalanced classes
Imbalanced data refers to those types of datasets where the target class has an uneven
distribution of observations, i.e one class label has a very high number of observations
and the other has a very low number of observations
To illustrate this, let's take a trained model and generate predictions for the test
dataset. If
we calculate class assignments by taking lots of different thresholds, and then calculate
the
precision, recall, and F1 score for each of those assignments, we could depict each
precision
and recall value in a single plot:
Online model testing
Even a great offline model testing pipeline won't guarantee that the model will perform
exactly the same in production. There are always risks that can affect your model
performance, such as the following:

Humans: We can make mistakes and leave bugs in the code.


Data collection: Selection bias and incorrect data-collection procedures maydisrupt
true metric values.
Changes: Real-world data may change and deviate from your training dataset,leading
to unexpected model behavior.

The experiment setup for a hypothesis test splits test targets into two groups on
purpose.
We can try to use a single group instead. For instance, we can take one set of
measurements
with the old model. After the first part of the experiment is finished, we can deploy the
new
algorithm and measure its effect. Then, we compare two measurements made one after
another. What could go wrong? In fact, the results we get wouldn't mean anything.
Many
things could have changed in between our measurements, such as the following:

User preferences
General user mood
Popularity of our service
Average user profile
Any other attribute of users or businesses

All these hidden effects could affect our measurements in unpredictable ways, which is
why we need two groups: test and control. We must select these groups in such a way
that
the only difference between them is our hypothesis. It should be present in the test
group
and missing from the control group. To illustrate, in medical trials, control groups are
the
ones who get the placebo. Suppose we want to test the positive effect of a new
painkiller.
Here are some examples of bad test setups:

The control group consists only of women.


The test and control groups are in different geographical locations.
You use biased interviews to preselect people for an experiment.

The easiest way to create groups is random selection. Truly random selection may be
hard
to do in the real world, but is easy if you deal with internet services. There, you may just
randomly decide which version of your algorithm to use for each active user. Be sure to
always design experiment setups with an experienced statistician or data scientist, as
correct tests are notoriously hard to execute, especially in offline settings.

Statistical tests check the validity of a null hypothesis, that is, that the results you got
are by
chance. The opposite result is called an alternative hypothesis. For instance, here is the
hypothesis set for our ad model test:
Null hypothesis: The new model does not affect the ad service revenue.
Alternative hypothesis: The new model affects the ad service revenue.
The amount of data you need to collect for conducting a hypothesis test depends on
several
factors:
Confidence level: The more statistical confidence you need, the more data is required
to support the evidence.
Statistical power: This measures the probability of detecting a significant difference, if
one exists. The more statistical power your test has, the lower the chance of false
negative responses.
Hypothesized difference and population variance: If your data has large variance,
you need to collect more data to detect a significant difference. If the difference
between the two means is smaller than population variance, you would need even more
data.

You can see how different test parameters determine their data hunger in the following
table:
In situations where you can trade off statistical rigor for speed and risk-aversion, there
is an
alternative approach called Multi-Armed Bandits (MABs). To understand how MABs
work, imagine yourself inside a casino with lots of slot machines. You know that some of
those machines yield better returns than others. Your task is to find the best slot
machine with a minimal number of trials. Thus, you try different (multi) arms of slot
machines (bandits) to maximize your reward. You can extend this situation to testing
multiple ad models: for each user, you must find a model that is most likely to increase
your ad revenue.

The most popular MAB algorithm is called an epsilon-greedy bandit. Despite the name,
the
inner workings of the method are simple:
1. Select a small number called epsilon. Suppose we have chosen 0.01.
2. Choose a random number between 0 and 1. This number will determine whether
MAB will explore or exploit a possible set of choices.
3. If the number is lower or equal to epsilon, make a choice at random and record a
reward after making an action tied to your choice. We call this process
exploration – MAB tries different actions at random with a low probability to
find out their mean reward.
4. If your number is greater than epsilon, make the best choice according to the data
you have collected. We call this process exploitation – MAB exploits knowledge
it has collected to execute an action that has the best expected reward. MAB
selects the best action by averaging all recorded rewards for each choice and
selecting a choice with the greatest reward expectation.

Online data testing


Machine learning models are sensitive to incoming data.
Good models have a certain degree of generalization, but significant changes in data or
underlying processes that generate data can lead the model predictions astray. If online
data significantly diverges from test data, you can't be certain about model
performance
before performing online tests. If the test data differs from the training data, then your
model won't work as expected.
To overcome this, your system needs to monitor all incoming data and check its quality
on
the fly. Here are some typical checks:
• Missing values in mandatory data fields
• Minimum and maximum values
• Acceptable values of categorical data fields
• String data formats (dates, addresses)
• Target variable statistics (distribution checks, averages)
Building and Sustaining a
Team
An average data science team will include a business analyst, a system analyst, a data
scientist, a data engineer, and a data science team manager. More complex projects
may
also benefit from the participation of a software architect and backend/frontend
development teams.

Here are the core responsibilities of each team role:

Project stakeholders: Represent people who are interested in the project; in other
words, your customers. They generate and prioritize high-level requirements and goals
for the project.

Project users: People who will use the solution you are building. They should be
involved in the requirements-specification process to present a practical view on
the system's usability.

Let's look at the core responsibilities in the analysis team:

Business analysts: The main business expert of the team. They help to shape
business requirements and help data scientists to understand the details about the
problem domain. They define business requirements in the form of a business
requirements document (BRD), or stories, and may act as a product owner in agile
teams.

System analysts: They define, shape, and maintain software and integration
requirements. They create a software requirements document (SRD). In simple
projects or Proof of Concepts (PoC), this role can be handled by other team members.

Data analysts: Analysis in data science projects often requires building complex
database queries and visualizing data. Data analysts can support other team members
by creating datamarts and interactive dashboards and derive insights from data.

Let's look at the core responsibilities in the data team:

Data scientists: They create models, perform statistical analysis, and handle other
tasks related to data science. For most projects, it will be sufficient to select and apply
existing algorithms. An expert who specializes in applying existing algorithms to solve
practical problems is called a machine or deep learning engineer. However, some
projects may ask for research and the creation of new
state-of-the-art models. For these tasks, a machine or deep learning researcher will be a
better fit. For those readers with computer science backgrounds, we can loosely
describe the difference between a machine learning engineer and research scientist as
the difference between a software engineer and a computer scientist.

Data engineers: They handle all data preparation and data processing. In simple
projects, data scientists with data engineering skills can handle this role.
However, do not underestimate the importance of data engineers in projects with
serious data processing requirements. Big data technology stacks are very complicated
to set up and work with on a large scale, and there is no one better to handle this task
than data engineers.
Data science team manager: They coordinate all tasks for the data team, plan
activities, and control deadlines.

Let's look at the core responsibilities in the software team:


Software teams should handle all additional requirements for building mobile,web,
and desktop applications. Depending on the ramifications, the software development
can be handled by a single developer, a single team, or even several teams.
In large projects that comprise multiple systems, you may need the help of a software
architect.

Let's now see the general flow of how roles can work together to build the final solution:
Let's look at the flow and delivery artifacts of each step:

1. The business analyst documents business requirements based on querying


project stakeholders and users.
2. The system analyst documents system (technical) requirements based on
business requirements and querying project stakeholders and users.
3. The data analyst supports the team by creating requested datamarts and
dashboards. They can be used by everyone on the team in the development
process, as well as in production. If the data analyst uses a Business Intelligence
tool, they can build dashboards directly for end users.
4. The data scientist (researcher) uses documented requirements and raw data to
build a model training pipeline and document data processing requirements that
should be used to prepare training, validation, and testing datasets.
5. The data engineer builds a production-ready data pipeline based on the
prototype made in Step 3.
6. The data scientist (engineer) uses processed data to build a production-ready
model for training and prediction pipelines and all necessary integrations,
including model APIs.
7. The software team uses the complete model training and prediction pipelines to
build the final solution.

Case study 1 – Applying machine learning to


prevent fraud in banks
Mary is
working as a data scientist in a bank where the fraud analysis department became
interested in machine learning (ML). She is experienced in creating machine learning
models and integrating them into existing systems by building APIs. Mary also has
experience in presenting the results of her work to the customer

One of the main activities of this department is to detect and prevent credit card fraud.
They do this by using a rule-based, fraud detection system. This system looks over all
credit
card transactions happening in the bank and checks whether any series of transactions
should be considered fraudulent. Each check is hardcoded and predetermined. They
have
heard that ML brings benefits over traditional, rule-based, fraud detection systems. So,
they
have asked Mary to implement a fraud detection model as a plugin for their existing
system. Mary has inquired about the datasets and operators of the current fraud
detection
system, and the department confirmed that they will provide all necessary data from
the
system itself. The only thing they need is a working model, and a simple software
integration. The staff were already familiar with common classification metrics, so they
were advised to use F1-score with k-fold cross-validation.

In this project setup, project stakeholders have already finished the business analysis
stage.
They have an idea and a success criterion. Mary has clean and easy access to the data
source from a single system, and stakeholders can define a task in terms of a
classification
problem. They have also defined a clear way to test the results. The software
integration
requirements are also simple. Thus, the role flow of the project is simplified to just a few
steps, which are all performed by a single data scientist role:
As a result, Mary has laid out the following steps to complete the tasks:
1. Create a machine learning model.
2. Test it.
3. Create a model training pipeline.
4. Create a simple integration API.
5. Document and communicate the results to the customer

Case study 2 – Finding a home for machine


learning in a retail company
Jonathan worked in retail for many years, so he knows the business side
pretty well. He has also read some books and attended several data science events, so
he
understands the practical capabilities of data science. With knowledge of both business
and
data science, Jonathan can see how data science can change his environment. After
writing
out a list of ideas, he will evaluate them from the business viewpoint. Projects with the
lowest complexity and highest value will become candidates for implementation.

The retail company has over 10,000 stores across the country, and ensuring the proper
level
of service quality for all stores is becoming difficult. Each store has a fixed number of
staff
members working full-time. However, each store has a different number of visitors. This
number depends on the geographical location of the store, holidays, and possibly many
more unknown factors. As a result, some stores are overcrowded and some experience
a
low number of visits. Jonathan's idea is to change the staff employment strategy so that
customer satisfaction will rise, and the employee burden will even out.
Instead of hiring a fixed store team, he suggests making it elastic. He wants to create a
special mobile application that will allow stores to adjust their staff list with great speed.
In
this application, a store manager can create a task for a worker. The task can be as
short as
one hour or as long as one year. A pool of workers will see all vacant slots in nearby
stores.
If a worker with the necessary skills sees a task that interests them, they can accept it
and go
to the store. An algorithm will prepare task suggestions for the store manager, who will
issue work items to the pool. This algorithm will use multiple machine learning models
to
recommend tasks. One model would forecast the expected customer demand for each
store.
The second model, a computer vision algorithm, will measure the length of the lines in
the
store. This way, each store will use the right amount of workforce, changing it based on
demand levels. With the new app, store employees will be able to plan vacations and
ask
for a temporary replacement. Calculations show that this model will keep 50,000
workers
with an average load of 40 hours per week for each worker, and a pool of 10,000 part-
time
workers with an average load of 15 hours per week. Management considered this model
to
be more economically viable and agreed to perform a test of the system in one store. If
the
test is successful, they will continue expanding the new policy.
The next thing they ask Jonathan is to come up with an implementation plan. He now
needs to decompose the project into a series of tasks that are necessary to deploy the
system
in one store. Jonathan prepared the following decomposition:
1. Collect and document initial requirements for the mobile app, forecasting, and
computer vision models.
2. Collect and document non-functional requirements and Service Level
Agreements for the system.
3. Decide on necessary hardware resources for development, testing, and
production.
4. Find data sources with training data.
5. Create data adapters for exporting data from source systems.
6. Create a software system architecture. Choose the technology stack.
7. Create development, test, and production environments.
8. Develop a forecasting model (full ML project life cycle).
9. Develop a line-length recognition model (full ML project life cycle).
10. Develop a mobile app.
11. Integrate the mobile app with the models.
12. Deploy the system to the test environment and perform an end-to-end system
test.
13. Deploy the system to the production environment.
14. Start the system test in one store.

Each point in this plan could be broken down further into 10-20 additional tasks. A full
task
decomposition for this project could easily include 200 points, but we will stop at this
level,
as it is sufficient for discussion purposes.
In reality, plans constantly change, so Jonathan has also decided to use a software
development project management framework, such as SCRUM, to manage deadlines,
requirement changes, and stakeholder expectations
Key skills of a data scientist
An ideal data scientist is often described as a mix of the following three things:

Domain expertise: This includes knowledge of an environment data scientists


are working in, such as healthcare, retail, insurance, or finance.
Software engineering: Even the most advanced model won't make a difference if
it can only present pure mathematical abstractions. Data scientists need to know
how to shape their ideas into a usable form.
Data science: Data scientists need to be proficient in mathematics, statistics, and
one or more key areas of data science, such as machine learning, deep learning,
or time series analysis.

Key skills of a data engineer


The key areas of knowledge for data engineers are as follows:
Software engineering: Software engineering skills are very important for data
engineers. Data transformation code frequently suffers from bad design choices.
Following the best practices of software design will ensure that all data processing jobs
will be modular, reusable, and easily readable
Big data engineering: This includes distributed data processing frameworks,
data streaming technologies, and various orchestration frameworks. Data
engineers also need to be proficient with the main software architecture patterns
related to data processing.
Database management and data warehousing: Relational, NoSQL, and inmemory
databases.

Key skills of a data science manager


Management: A data science team manager should have a good understandingof the
main software management methodologies, such as SCRUM and [Link] should
also know approaches and specific strategies for managing data science projects.

Domain expertise: They should know the business domain well. Without it, task
decomposition and prioritization becomes impossible, and the project will inevitably go
astray.
Data science: A good understanding of the basic concepts behind data science and
machine learning is essential. Without it, you will be building a house without knowing
what a house is. It will streamline communication and help to create good task
decompositions and stories.

Software engineering: Knowledge of basic software engineering will ensure that the
manager can keep an eye on crucial aspects of a project, such as software architecture
and technical debt. Good software project managers have development experience.
This experience taught them to write automated tests, refactoring, and building a good
architecture. Unfortunately, many data science
projects suffer from bad software design choices. Taking shortcuts is only good in the
short term; in the long term, they will come back to bite you. Projects tend to scale with
time; as the team increases, a number of integrations grow, and new requirements
arrive. Bad software design paralyzes the project, leaving you with only one option—a
complete rewrite of the system.
Common flaws of technical interviews

 Searching for candidates you don't need


The primary focus for a candidate will be on applying different techniques (data
mining/statistical analysis/build prediction systems/recommendation systems) using
large
company datasets. Apply machine learning models and test the effectiveness of
different
actions. The candidate must have strong technical expertise and be able to use a wide
set of
tools for data mining/data analysis methods. The candidate must be able to build and
implement mathematical models, algorithms, and simulations.
 Discovering the purpose of the interview process
 Introducing values and ethics into the interview
 Designing good interviews
 Designing test assignments
 Interviewing for different data science roles

Achieving team Zen


A balanced team solves all incoming tasks efficiently and effortlessly. Each team
member
complements others and helps others to find a solution with their unique capabilities.
However, balanced teams do not just magically come into existence when a skillful
leader
assembles their elite squad. Finding team Zen is not a momentary achievement; you
need
to work to find it. It may seem that some teams are stellar performers while others are
underwhelming. It is crucial to realize that team performance is not a constant state.
The
best teams can become the worst, and vice versa. Each team should work to improve.
Some teams may need less work, while others will be harder to change, but no team is
hopeless.

Leadership and people management


In the Achieving the team Zen section, we concluded that a team leader should not be
at the
core of the team because this situation leads to severe organizational imbalance.
Nonetheless, a team leader should be everywhere and nowhere at the same time. A
team
leader should facilitate team functioning, help every team member take part in the
process,
and mitigate any risks that threaten to disrupt the team functions before they become
real
issues. A good team leader can substitute and provide support for all or most of the
roles in
their team. They should have good expertise regarding the core roles of the team so
that
they can be helpful in as many of the team activities as possible.
Leading by example
The most simple and effective leadership advice was likely given to you in childhood: if
you want to have good relationships with others, take the first step. Want people to
trust
you? Build incrementally by trusting other people. Want your team to be motivated? Be
involved in the workflow and give everyone a helping hand; show them progress.
Management literature calls this principle leadership by example

Using situational leadership


Defining tasks in a clear way
The situational leadership model is useful, but it does not tell you how to describe tasks
so
that your teammates will be in sync with your vision. First, you should be aware that
your
task descriptions should not be the same for every team member. Your descriptions will
vary, depending on competence and motivation levels. At the direction stage, a task
description can resemble a low-level and detailed to-do list, and at the delegation
stage, a
task description may be comprised of two sentences. However, no matter what stage
you
are at, your task description should be clear, time-bounded, and realistic. SMART criteria
helps create such [Link] states that any task should be as follows:

Specific: Be concrete and target a specific area or goal


Measurable: Have some indicator of progress
Assignable: Have an assignee(s) who will do the task
Realistic: State which results can be achieved when given a set of constraints and
available resources
Time-related: State a time constraint or a deadline
Developing empathy
Basic concepts such as leadership by example, situational leadership, and the SMART
criteria will help you structure and measure your work as a team leader. However, there
is
another crucial component, without which your team can fall apart, even with the
perfect
execution of formal leadership functions. This component is empathy. Being empathetic
means to understand your emotions, as well as other people's emotions. Our reactions
are
mostly irrational, and we often confuse our own feelings and emotions. For example, it
is
easy to confuse anger for fear; we can act angrily when in reality we are just afraid.
Understanding your own emotions and learning to recognize others will help you
understand people and see subtleties and motives behind their actions. Empathy helps
us
find the logic behind irrational actions so that we can react properly. It may be easy to
answer with anger in regard to aggressive behavior unless you can understand the
motives
behind those actions. If you can see those motives, acting angrily may seem foolish in
the
end. Empathy is the ultimate conflict-resolution tool; it helps build trust and coherency
in
the team.

Facilitating a growth mindset


The last key component of team building we will cover is a growth mindset. The
constant
growth of your team is necessary if you do not want it to fall apart. Without growth,
there
will be no motivation, no new goals, and no progress. We can distill team growth into
two
components: the global growth of a team as a whole and the local growth of each
individual team member.

Case study—creating a data science


department
A large manufacturing company has decided to open a new data science department.
They
hired Robert as an experienced team leader and asked him to build the new
department.
The first thing Robert did was research the scenarios his department should handle. He
discovered that the understanding of data science was still vague in the company, and
while some managers expected the department to build machine learning models and
seek
new data science use cases in the company, others wanted him to build data
dashboards
and reports. To create a balanced team, Robert first documented two team goals and
confirmed that his views correctly summed up what the company's management
wanted
from the new team:
Data stewardship: Creating data marts, reports, and dashboards from the company's
data warehouses based on incoming requests from management
Data science: Searching for advanced analytics use cases, implementing prototypes,
and defending project ideas
From this information, Robert derived the following team roles:
Data team leader: To supervise two data teams and coordinate projects at a higher
level. The workload in this role is not expected to be high during the first year of the
team's existence.

Data stewardship team:


Team leader/project manager: The goal definition clearly required someone to
manage incoming requests and plan the work.
Systems analyst: Someone to document visualization and reporting requirements that
come from the stakeholders.
Data analyst: To implement data reports and dashboards based on the requirements
defined by the systems analyst.

Data science team:


Team leader/project manager: Should be able to manage R&D processes in a
business environment.
Business analyst: To interview company management and search for potential data
science use cases.
Systems analyst: To search for data and provide requirements for the use cases
provided by the business analyst.

Data scientist: To implement prototypes based on the task definitions provided by the
business analyst and systems analyst.
Backend software engineer: To integrate prototypes with external systems and
implement software based on the requirements.
User interface software engineer: To implement interactive UIs and visualizations
for prototype presentation. Next, Robert thought about team size constraints and
created the following positions from the preceding roles:

Data team leader:


Employees needed: 1. This position will be handled by Robert.

Data stewardship team:


Team leader: Employees needed: 1. Robert had no prior experience leading data
analysis teams, so he decided to hire an experienced manager who will work under his
supervision.
Systems analyst: Employees needed: 1.
Data analyst: Employees needed: 2.

Data science team:


Team leader: Employees needed: 1. During the first year, this position will be handled
by Robert, as the volume of projects won't be so high that this work will interfere with
the data team leader
role. As he will be acting as the best expert in the team, Robert has carefully thought
about how to delegate tasks using situational leadership so that the data scientists on
his team will be motivated
and will have room to grow and take on more complex tasks along the way.
Business analyst: Employees needed: 1.
Systems analyst: Employees needed: 0. Robert has decided that there is no need for
keeping system and business analysis separate since the system analysis could be
performed cross-functionally by the whole team at the R&D stage.
Data scientist and backend software engineer: Employees needed: 2. Robert
decided to combine data scientists and backend software engineers into a single
position since the software
engineering requirements for the prototype projects were easy enough to be handled by
data scientists. He decided to hire two experts so that he could work on several ideas
simultaneously.

You might also like