0% found this document useful (0 votes)
77 views14 pages

Bits Bytes: Data Digest

This document is the July 2021 edition of the Data Digest newsletter from Great Learning. It includes sections on leadership speaking about motivating employees and fostering creativity during challenging times. It also features a success story from an alumnus of Great Learning's Postgraduate Program in Data Science and Business Analytics discussing how the program helped advance her career. Additionally, it provides an overview of the Great Learning journey from a student's perspective, noting how the faculty helped deliver learning effectively despite complex topics. The newsletter highlights various articles and updates related to data science.

Uploaded by

saswat rath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views14 pages

Bits Bytes: Data Digest

This document is the July 2021 edition of the Data Digest newsletter from Great Learning. It includes sections on leadership speaking about motivating employees and fostering creativity during challenging times. It also features a success story from an alumnus of Great Learning's Postgraduate Program in Data Science and Business Analytics discussing how the program helped advance her career. Additionally, it provides an overview of the Great Learning journey from a student's perspective, noting how the faculty helped deliver learning effectively despite complex topics. The newsletter highlights various articles and updates related to data science.

Uploaded by

saswat rath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DATA DIGEST | JULY 2021 EDITION

Bits &

Bytes
What’s Inside:
LEADERSHIP SPEAKS GREAT LEARNING SUCCESS STORY

GREAT LEARNING JOURNEY LET’S LEARN! SPOTLIGHT DISCOVER

WHAT’S NEW? DATA SCIENCE AT WORK CAREER MAZE: DIRECTIONS

SOCIAL MEDIA BULLETIN THAT’S A GOOD QUESTION!


DATA DIGEST | JULY 2021 EDITION

Leadership Speaks How do you keep your employees motivated during


these challenging times?

“One of the biggest challenges during this tough phase


is that we cannot meet the employees in person, which
means there is a huge impact on communication. So we
must ensure we do not keep the employees in the dark,
and should keep the communication going. It is very
important to keep calm, so that people around us and
our team are also calm. It is not correct to react
adversely to any mistakes but to let them know about
their mistakes in a calm manner and find a solution
rather than criticising them.

It is also important to be flexible with our expectations


so they take care of their personal needs. Giving them
MALATHI RAMKUMAR more time away from work is a great motivation and will
General Manager - Customer Success: also instigate loyalty.”
Olympus Digital Campus
How do you foster creativity among the team?

“When we speak of creativity, it means the environment


needs to be free and limitless. We must create a
supportive and free environment for the employees to
be able to open up and think of creative ideas. In
addition to this, the most important factor is time. If we
do not give them adequate time, then it will be difficult
for them to come up with innovative ideas. Another
aspect to this is how we react to failures/poor ideas. If
there is no support for failures, then the fear of this will
make the employees not think out of the box. If we
provide them adequate support and do not react
adversely to a failed idea, they will continue to bring in
more and more creative ideas and solutions.”

What’s your advice to the budding female leaders


within the company?

“Though ‘equality’ is gaining a lot of momentum these


days that helps with the career growth of women, there
is still a long way to go. As a woman, I feel we need to
develop resilience, like having a positive attitude, better
at regulating emotions, etc. This would help us manage
the multiple aspects of life: personal, family and work.
Also, we may tend to think more about taking
risks/decisions, but we should not be hesitant about it.
We have to be confident about ourselves, put in that
effort to assess the situation and go ahead with the
decision making.

Last but not the least, we should keep learning, be


empathetic to our fellow colleagues and maintain
work-life balance.“

02
DATA DIGEST | JULY 2021 EDITION

Great Learning “Hello, I am Sarayu from Bangalore. After having played


Success Story various roles as Technical Lead, Project Manager, Senior
Business Analyst for over 12+ years in the R&D and
service delivery sector, I thought my career needed a
boost, and after a lot of research, I was convinced and
determined to join the Postgraduate Program in Data
Science and Business Analytics (PGP-DSBA), by Great
Learning in collaboration with The University of Texas at
Austin (McCombs School of Business).

The journey was not an easy one and was sometimes


overwhelming, as I had to manage between work, family
and weekend classes. Well, all these challenges were
well-handled with great support and great mentoring
from Great Learning faculty associates and industry
experts.
SARAYU DIXIT
PGP-DSBA Alumnus Great Learning provided me with a comprehensive
learning path to become a data scientist. It has provided
a good roadmap on my data science journey. The
courses are very well planned which include a
foundation for Statistics, mathematical modelling, SQL,
Python and R. All the programs are taught by industry
experts and their learning support is awesome. I have
benefited a lot from the course contents that are
available 24/7. Most of my Data Science and Machine
Learning techniques are a result of this learning. I have
also been able to participate in hackathons and
International conferences organized by Great Learning
that enhanced my portfolio in Data Science and has
helped me build networks and learn from other people.

I have spent time learning new concepts and sharpening


my acquired skills through projects. My capstone project
at Great Learning, was headed by one of the industry
veterans who assisted me from start to finish. The
project gave me confidence in applying various ML
techniques that were trained during the courses. Also,
within 3 months into my project completion, I was
well-trained and equipped that I was given an
opportunity to internally change into a Data Science
related role at my company.

I understand that there is a gender gap in the field of


Information Technology and Data Science ever since its
advent, however, it has expanded significantly over the
years. Community of Data Science with inspiring stories,
interactions with women who are more positioned to the
field of Data Science and data-related business have
gradually been increasing to bring more women into the
field of Data Science. My Data Science journey
continues. I’m still in the process of learning. I keep
pushing myself to learn new things.”

03
DATA DIGEST | JULY 2021 EDITION

Great Learning I admit that the first learning from this course for me
was how much you don’t know!! Thanks to Dr. PKV and
Journey the faculty members who helped deliver this for us.
Despite the expertise, how they made the learning easy
for us speaks volumes of the team of lecturers.

The career counseling session was truly an eye-opener


for me. He very subtly but strongly delivered the
message around companies not going to hire you just
because you finish this program. It’s important to look at
transitioning into roles in your own organization, learn
and contribute before you could explore opportunities
outside, especially when you are not fine to let go of
your experience pack and start as a fresher in this arena,
so true for people like me with work experience.

I was fortunate that I had an opportunity in the function


SOWMIYA DESIKAN that I lead, where we had complex problems to solve
PGP-DSBA Alumnus involving huge volumes of data. A USD900 million
spend, 1 million transactions a month, time-sensitive
“A well-curated course that gave an insight into all deliverable. It’s my organization and manager, I am
the aspects of Data Science.” thankful too as well, where I could literally apply the tip
from the Career Counselor. The same one year of this
With about 20 years of experience with leading a
course pursuit, I also worked on developing data-based
function in the World’s Leading Shipping Company -
Products partnering with Startups, data teams that
MAERSK, handled a gamut of roles around Business
helped apply the learnings.
Excellence, Operations Management, Strategic
Transformation, Digitization, People Leadership, our In parallel, the CAPSTONE project is another imperative
organization was going through a transformation in part of the learning which paved the way for people
Customer Journey and Digital Transformation. Your from different domains to come together, to apply the
years of experience are not important, but how learnings and solve a real-time problem. The diligent
much you are staying relevant to the industry and follow up and coaching from Mr. PV Subramanian, a
development around you is important. great team to work with resulted in immense learning to
carry for a long time.
The bold answer to this pondering was ‘Data’ and
'Technology’ that are here to stay. Then what and I have transitioned to lead the Analytics and Metrics
where to pursue were to be pinned down. A lot of team for an InfoSec function at CitiCorp since Oct 2020
exploration and consulting, which is when as a Sr. Vice President.
PGP-DSBA by Great Learning popped up. What
kept me motivated was my belief around being To those looking out to join these programs, my
relevant and if not now, there is never going to be a suggestion would be:
better time than this, and what happened after was
Please be clear about the WHY and WHAT you would
a deep revelation.
like to do post this program. The program will offer you
A well-curated course that gave an insight into all learning and exposure; however, it’s our clarity of
aspects of Data Science, topping up with functional thought, time commitment, and continued pursuit of
applications as well. We were offered an able team learning that will take us ahead in the journey.”
of professors bringing not just theory but more
experience to the classroom on how to solve
complex problems through data and technology. “The capacity to learn is a gift, the ability to learn is a
skill, and the willingness to learn is a choice”

04
DATA DIGEST | JULY 2021 EDITION

Predictive Analytics allows businesses to decide on the

Let’s Learn! pricing. Here are some key strategies in arriving at the
right Price Intelligence:

Customer KYC - Collect customer data and classify or


segment them into different categories. Develop
premium, economy, and mid-market products with
different pricing. The mechanism propagates customer
satisfaction but also brand affinity through thick and thin
as well as ensures profitable growth.

Value-based pricing - Customer retention and product


value go hand in hand. The value of the business
offerings is the key to retaining customers. Price
Analytics allows to tag the right price to the right value.

Quick wins - Some fast price rectifications to suit the


market scenario, offering’ value, and historical pricing
HIMANSHU MANROA allow businesses to tap some quick wins. For example:
AVP, Research & Analytics Solutions,
Special pricing for pandemic pricing, Festival Bonanza,
Datamatics | Pre-Sales & Marketing
New Year pricing, etc. Such periodic pricings allows
registering high revenues in a short time.
“Move away from Predatory Pricing to Pricing
80:20 rule - The rule is true in most business situations.
Analytics for long-term sustainability”
Look for the 20% most profitable customers who are
“Pricing Analytics is highly useful in not only a responsible for 80% of the revenue. Follow their channel
booming economy but also otherwise. It is the key affinity and invest in those channels. Though these
to ensuring not just growth but profitable growth. channels turn to be expensive they tend to be the
Predatory pricing is out of context for a sustainable preferred channels of those 20% who are the most
business. So is hunch-based pricing. The way we at profitable revenue generators.
Datamatics look at it, Price Intelligence and Price
Price promotions - Price Analytics allows to evaluate
Analytics offer discreet mechanisms to arrive at
customers vis-a-vis the market scenario and determine
insights towards enhancing business value.
the right promotional campaigns at the right time.
Pricing Analytics is not just Advanced Analytics – It
In a nutshell, effective pricing strategies and tactics
is much more - Arriving at a right pricing is a
allow businesses to improve promotion effectiveness
complex task. To achieve the right pricing,
by 25%, deliver a 2% to 7 % improvement in sales &
businesses need to understand different factors for
improve revenue margins by 2% to 5%.
achieving sustainability as well as profitability. At the
same time, it is a no-brainer that attractive pricing
plays a crucial role in generating and improving
To summarize:
revenue. Analytics plays an important role in arriving
at the right pricing. In fact, Predictive Analytics Market conditions are not the same. They continuously
models based on historical data and the resultant change depending on geopolitical scenarios. Price
sales spikes are the foundation of arriving at the Intelligence and Price Analytics powered by Predictive
right pricing. Analytics models allow businesses to dynamically
change pricing to suit market conditions to ensure
sustainability and profitability.”

05
DATA DIGEST | JULY 2021 EDITION

Spotlight

A DECISION SUPPORT TOOL FOR


OPTIMUM LOAN TO ASSET VALUE

This is a project presented by Subramanian The dataset had information on KYC details,
Gopalakrishnan, Apurva Dhingra, Sahil Linjhara, demographics, security assets, past loan records
George Varghese and Ankush Kharbanda, and credit score of each applicant. In the data, it
PGP-DSBA students in the AICTE Sponsored Online was observed that 22% of the applicants had
International Conference on Data Science, Machine defaulted on their first EMI. A machine learning
learning and its applications (ICDML-2020). A model was built to predict the probability that a
follow-up paper was published in the conference new applicant will default on their first EMI. The
journal. higher the Probability of Default (PD) of an
applicant, the higher is the risk of default. Based
Over the last few years, the credit industry in India
upon the probability of default (PD), customers were
has experienced exponential growth and the retail
bucketed into 3 buckets, i.e., Category A, Category
loan book of Financial Institutions (FI) in India is
B and Category C. Category A contains applicants
expected to double to Rs.96 trillion by 2024. To
with low risk and Category C contains applicants
remain competitive in retail lending, one of the major
with high risk. Further, an optimized LTV was defined
challenges faced by FI is to maximize loan amount
for each category which was identified based on the
with minimum processing time while ensuring the
historical data and business requirements in line with
least number of defaults. In this study, our learners
leading industry practices.
aimed to build a Machine Learning model to predict
the probability that a new applicant will default on A combination of PD and Optimized LTV might help
the first EMI and to calculate the optimum Loan to the NBFC to verify the eligibility of the applicant
Asset Value (LTV) for each applicant applying for a with no time and reduce the number of defaults. A
two-wheeler loan. The loan to Asset value (LTV) prototype for an automation tool was built which
ratio is a financial term used by lenders to express recommended the optimum LTV range for a new
the ratio of a loan to the value of an asset purchased. applicant on entering the details. This tool would
enable an applicant to check their applicability in
The study sample belonged to a Non-Banking
terms of LTV which would, in turn, reduce the
Financial Company (NBFC) containing details of
processing cost for an NBFC. This analysis can be
2,33,154 applicants who applied for a two-wheeler
further expanded to other types of loans.
loan.

06
DATA DIGEST | JULY 2021 EDITION

Discover

Normalization is a process that is used to rescale the


real values of a numeric attribute into a range from 0
to 1. Normalization helps organize the data in such a
way that it appears similar across all the areas and
Python Dictionary – Everything you need to know
records. There are various advantages of data
Python is a high-level, general-purpose normalization, such as redundancy reduction,
programming language that is well-interpreted. The complexity reduction, clarity, and acquiring higher
design philosophy of this programming language is quality data.
code readability with the remarkable use of
important indentation. The language of Python LEARN MORE
constructs and its object-oriented approach targets
help the programmers to write logical and clear
code for small as well as large-scale projects. A What is Image Processing?
Python dictionary can be created by assigning a
Image Processing is a way to convert an image to a
sequence of elements in curly {} brackets. It is
digital aspect and perform certain functions on it, in
separated by ‘comma’. In a dictionary, you will find a
order to get an enhanced image or extract other
pair of values, in which one is the Key, and the other
useful information from it. It is a type of signal time
one is the corresponding pair element, which is
when the input is an image, such as a video frame or
called the Key:value. Values that you find in a
image and output can be an image or features
dictionary can be of various data types and even can
associated with that image. Usually, the Image
be replicated, while the keys cannot be duplicated
Processing system includes treating images as two
and must be absolute.
equal symbols while using the set methods used.
Image Processing is a way by which an individual
can enhance the quality of an image or gather
LEARN MORE
alerting insights from an image and feed it to an
algorithm to predict the later things. These libraries
are involved in performing Image Processing in
Python: Scikit-image, OpenCV, Mahotas, SimplelTK,
SciPy, Pillow and Matplotlib.

NumPy Normalization Tutorial


LEARN MORE
NumPy is an in-built Python library that is used for
working with arrays. Now, as we already know that in
Python, one can create an array using lists, then why
do we require NumPy for that? Well, NumPy
provides a faster way to work with the arrays as
compared to the traditional lists.
07
DATA DIGEST | JULY 2021 EDITION

What's New? Tor discovered that a portion of the bugs was


because of C language, since 2016. Arti is trusted to
settle a significant number of Tor's drawn-out
programming issues. Tor currently expects to turn
into a communication seclusion wall for Zcash and
empower all communication tools for Zcash clients
Want to steer your Machine Learning Operations in a safe and unknown condition.
(MLOps) journey with what is trending? Have a
look!
Machine learning in determining the size of basic
Algorithmia is a tool that is a solitary arrangement building block-Cell...
stage for all phases of machine learning tasks and
The basic building block of living organisms is the
the governance circuition. It empowers machine
“cell”. A human body is composed of nearly 15 trillion
learning activity groups to work in synergy on
cells. Studies are going around the world by
complex AI applications in a single focal area. As of
practitioners about the fundamental principles of the
now, in excess of 100,000 specialists and
cell size regulation system with the help of neural
information researchers are utilizing the stage,
networks. Numerous tests in the past have shown
including the staff of the United Nations and
that Machine learning, particularly deep learning
different Fortune 500 organizations.
strategies can semi-automatically extricate complex
Data Version Control is an open-source 'version design under an enormous number of informational
control framework' for projects that are subjected to indexes. These techniques accomplish such things
machine learning. It courses machine learning by eliminating not so important dimensions and
models and informational collections such as data other noise factors.
sets etc. The stage has been worked to make
Institute of Industrial Science of Tokyo has
machine learning models that can be distributed and
developed an algorithm that knits “Machine
can be replicable. DVC is intended to deal with
Learning” techniques to approximate the size of the
enormous documents, informational collections,
cell and its reproducibility. The algorithm is powered
models of machine learning, metrics and code.
by the idea of ‘Artificial Neural Networks’ which
There are many such tools of MLOps that are dismisses making suppositions and makes a
currently used in prominence and you can start calculative statement. This technique is
learning some of them early to have the edge. subsequently reducing the data noise and is helping
researchers understand better the flexibility in
Zcash to put funds In Tor to drain C Language for depiction power for deterministic and noise
Rust. Smart move? dispensation relationships.

Zcash Open Major Grants (ZOMG) has put $670,000


in Tor to foster Arti (a Rust Tor execution), another
coding language. The coding of Tor is in the C
programming language as an independent network
proxy and has now begun giving indications of
getting depleted. C needs significant level highlights
for complex programming undertakings, making it a
sluggish and careful cycle that requires high skills
and perseverance. Rust is an undeniable level
present-day programming language with
imaginative highlights and security properties.

08
DATA DIGEST | JULY 2021 EDITION

Customer Satisfaction: Worked as a


delighter for underwriters who used
to receive complaints from customers
regarding their policies being
cancelled. This ultimately enhanced
the company's reputation.

Customer Retention: Few customers


who were not keen on reissuing the
policy were retained.

Data Science at Work!


Alfiya Shaikh works at a firm as an associate
program manager and her diverse role involves
multiple aspects of operations, research, and project
management.

Here are the ‘Data Science at Work’ Stories for this NPS (Net Promoter Score) is a very important
edition. parameter for any and every product/service in the
market. While checking the responses of the NPS
Customer profiling whose policies are being data, she realized that there were too many
cancelled, bringing down the number of policies subjective questions and responses in the data;
being cancelled. building a word cloud for the same and analyzing
the text was important.
Suchika Aggarwal who is a part of the Auditor’s
Process Team audits small businesses in 3 areas: Self
Voluntary, Virtual Audit (phone/e-mail), & physical There was a dire need to check for the customer
Audit, and prepares documents as per USA sentiment, areas of improvement, which aspects of
government (NCCI) guidelines. the customer journey need more enhancement and,
what were the concerns raised by the customers,
The cost of the physical audit process comprises what were the overall NPS scores and their
80% of the total cost of the whole process. It was graphical representations.
observed that some policies are getting cancelled
due to miscommunication on the company’s behalf
due to vague audit guidelines leading to monetary Techniques used: She used the Text Mining tools
loss, customer dissatisfaction and impact on the such as NLTK, word cloud, and lambda functions to
company’s reputation. Hence, she decided to start get a clear picture of the various
working on the last 12 months’ data for policyholders parameters/features which were collected in the
whose policies got canceled. NPS data survey.

Observations: While performing the analysis, she


removed the stop words, did the stemming of the
Techniques used: Customer profiling was done
data, removed the punctuations, and checked for
using the Clustering method – K means and EDA.
the most frequently occurring words. This helped
her to dig deeper and gave her multiple insights on:
customers’ overall sentiment, concerns they faced
while managing their journey, what factors added
Impact Generated: to their delight element, and what factors
contributed to their negative or passive sentiments.
These observations are under the process of
Cost Saving: Each policy was reissued implementation.
which saved up to $12,000 for the
company.

09
DATA DIGEST | JULY 2021 EDITION

Should I be applying for the job roles that demand


2-3+ years of experience whereas I have only 1 year
of hands-on experience in data science tools and
techniques?

Absolutely. If you are someone who is able to clear


their basic technical round, then you can be
considered further. Some criteria are just to make
sure that you are polished enough in the field. If you
are someone who has practised effectively and is

Career Maze: Directions very well comfortable with the subject matter, that
will be shown during the interview. If you are not
very well versed in the subject matter, that too will
be shown in the interview and the interviewer will
assess you based on your performance. Emphasize
on - “Practice”. If you have a firm hold on the data
science tools and techniques and you are able to
Which category of companies should I target to deliver it when examined, you will surely sail through
grow my career and what should be my approach the interviews. In any case, if the criteria is very
while looking for a career transition? stringent and the necessity is of having 3 years of
Be open for all the companies and domains data science tools and techniques' experience on
wherever there is a data science and data analytics paper, do not get discouraged. There are many other
opportunity. In whichever field you want to foray roles out there in the industry that will suit you and
into, it will require your foundation to be strong, that you shall keep looking for them.
is your technical skills. In terms of interview scenario,
irrespective of the domain or industry, there will be How can I carve my resume using my past work
the 1st line of interview rounds that will involve the experience for a data science job role?
technical assessment. You will be put into
You can highlight some key projects that you have
challenging situations with writing some codes or
worked upon and highlight the achievements and
solving a case study. The assessment is based on
challenges that you have overcome during your
how comfortable you are in terms of the complexity
tenure. Having said that, if you are applying for a
of the subject and your readiness to take up the
data science role, focus on presenting your data
real-time challenges that you may face on a daily
science knowledge predominantly. If you have
basis in the respective job role. This practice is
worked with any tool before, explain how you used it
common for most companies that will offer you a job
to generate productivity in your organisation, like
role in the field of data science. Irrespective of the
optimising a process etc. You can mention the
domain, the approach of your recruiter is more or
successful projects you have worked on in data
less the same.
science. Capstone projects are very attractive for
If you are not comfortable with writing your codes recruiters as they get to see your overall
by yourself and you seek help from the internet understanding of the industry. You may add
every time you solve a problem, this can be a hurdle self-assigned projects that you took up and have
for you in the interviews. To overcome this paralysis, created an impact in your workplace. Mention the
there is only one remedy - “Practice”. Spend a projects where you have done exceptionally well
minimum of 60 minutes regularly practising on some during your learning journey. The key ingredient is to
raw data sets and strengthening your basics. highlight your accomplishments and the diverse
challenges that you have faced and conquered. This
It is important to gather insights about the industry goes on to prove your mettle and how robust you
you are targeting for a transition. You need to know are.
a little about everything - the history, key players,
challenges, processes, new technologies and trends
in the market for the industry. This will power your
conversation with the recruiter in the 2nd line of
interviews that is usually known as the HR round.

10
DATA DIGEST | JULY 2021 EDITION

Social Media Bulletin

This is how our learners feel post completion of the


course...

11
DATA DIGEST | JULY 2021 EDITION

That's a good question!

In this edition, our question is: Which distance measure do you use the most while solving different ML problems
and why?

Arindam Sarkar says - “There are certain ML Algorithms which have the distance measures at its core and is the
foundation of it. Let’s take the examples of KNN (Supervised Learning) or the K-means Clustering (Unsupervised
Learning) – both use Euclidean Distance Measure. The similarity of the features are determined with the help of this
distance measure, be it in KNN for classification problems or forming clusters with the help of the K-means
algorithm. Although Euclidean Distance is by far the most commonly used distance measure across ML Algorithms
but saying that it depends on the algorithm which is being used.

Let us take the example of LDA (Linear Discriminant Analysis) wherein the Euclidean distance fails because of the
following disadvantages:

Ignores the Variance (or Standard Deviation)

Scale-dependent

Ignores Correlation (In the case of multiple predictors, the correlations between them is ignored)

LDA algorithm requires the new axis or axes being created using two criteria and most importantly both are
considered simultaneously, i.e

Maximizing the distance between the means of each class or category

Minimizing the variation within each class or category

The above criteria hence cannot be met using the Euclidean Distance measure, hence Mahalanobis distance is
applicable in this case as it takes into account the distance from the centres of the classes but also the variances and
correlations. To conclude it depends on how the ML Algorithm works to achieve the specific objective (be it
Supervised or Unsupervised Learning problem) for the applicability of the distance measure being used.”

12
DATA DIGEST | JULY 2021 EDITION

Balaji says - “Distance measure is a very important concept in Data Science and comes into play frequently whether
we are talking about Unsupervised Machine Learning or Supervised Machine Learning. Though it is called Distance
measure, this term has more to do with “similarity” or “likeness” when we compare two observations. In most
contexts, where we apply Data Science, the Distance measure concept is applied in this sense. However, if we are
analyzing geographical data, for example with latitude/longitude values of location, then “distance” would actually
mean distance.

So, to compare the likeness or similarity between two observations, we need to have a quantifiable way of doing it.
For example, if we have two individuals – one male, 38 years old, with 15 years of work experience living in Mumbai
versus another who is male, 24 years of age, 3 years of experience living in Delhi. How do we quantify or compute
the similarity of these two persons with a third person and decide which two are more similar? This is where the
Distance Measures come in.

While there are several distance measures like Euclidean, Manhattan, Chebyshev, Cosine, Hamming and Minkowski,
choosing one that is appropriate is not a straightforward task. Some of these measures are more intuitive to
understand like for example Euclidean (length of the straight line segment that connects two points), or Manhattan
(distance between two points if we could move only in right angles), or Hamming Distance (the number of features
that have different values between two observations). Minkowski measure is a mathematically generalized form of
distance measure with a parameter K with K=1 making it equivalent to Manhattan and K=2 making it equivalent to
Euclidean.

I tend to use the Euclidean more often than the other measures as it is intuitive to understand how it is calculated as
well as, more importantly, it works well in most scenarios. However, hamming distance scores over the others if we
have more categorical features (features with a limited number of unique values or text values). Minkowski offers the
advantage of being able to “tune” the value of K by trying out different values and selecting the one that gives the
best outcome. But when it comes to data dealing with geographical information, especially the latitude/longitude
values (which denote the unique coordinates of a location on Earth’s surface), then Haversine distance may be worth
a try.”

Jainesh says - “Choosing the right distance metrics plays an important role in machine learning. A distance measure
is a score that summarizes the relative difference between two objects in a problem or in any business domain.
Knowing when to use which distance measure can help go from a poor classifier to an accurate model. The four
most used distance measures in machine learning are: Euclidean Distance, Hamming Distance, Manhattan Distance,
Minkowski Distance.

The use of distance measure depends on case to case. But the most widely used distance measure is ‘Euclidean
Distance’ in data mining specifically, as it calculates the distance between two real-valued vectors. Below are few
pointers to decide which distance measure to use when:

Euclidean Distance - Euclidean distance can be used with both float and integer data type values. Euclidean
distance is calculated as the square root of the sum of the squared differences between the two vectors. Euclidean
distance is the shortest path between source and destination, hence, making it a classic example of clustering where
closer vicinity of the objects is preferred for categorization. Euclidean distance works great when there is
low-dimensional data and the magnitude of the vectors is important to be measured.

Euclidean distance is not scale in-variant which means that distances computed might be skewed depending on the
units of the features. Typically, it needs normalized data before using this distance measure. As the dimensionality
increases of your data, the less useful Euclidean distance becomes.

13
DATA DIGEST | JULY 2021 EDITION

Hamming Distance - Hamming distance is the number of values that are different between two vectors. It is typically
used to compare two binary strings of equal length. It can also be used for strings to compare how similar they are
to each other by calculating the number of characters that are different from each other. Hamming distance is used
to measure the distance between categorical variables.

Hamming distance is difficult to use when two vectors are not of equal length. Also, it is not advised to use this
distance measure when the magnitude is an important measure.

Manhattan Distance - The Manhattan distance, often called Taxicab distance, calculates the distance between
real-valued vectors. Manhattan distance then refers to the distance between two vectors if they could only move
right angles. There is no diagonal movement involved in calculating the distance. When a dataset has discrete and/or
binary attributes, Manhattan seems to work quite well since it considers the paths that realistically could be taken
within values of those attributes.

it is more likely to give a higher distance value than Euclidean distance since it does not calculate the shortest path
possible. This does not necessarily give issues but is something to consider.

Minkowski Distance - Minkowski distance is a metric used in Normed vector space (n-dimensional real space), which
means that it can be used in a space where distances can be represented as a vector that has a length. This distance
measured can be manipulated with input parameters to closely resemble other.

“An intellectual is someone whose


mind watches itself.”
- Albert Camus

Editorial team

Hemant Verma Jenlyn Jude Miranda Tushar Ghosh Sampada Mall

14

You might also like