Bits Bytes: Data Digest
Bits Bytes: Data Digest
Bits &
Bytes
What’s Inside:
LEADERSHIP SPEAKS GREAT LEARNING SUCCESS STORY
02
DATA DIGEST | JULY 2021 EDITION
03
DATA DIGEST | JULY 2021 EDITION
Great Learning I admit that the first learning from this course for me
was how much you don’t know!! Thanks to Dr. PKV and
Journey the faculty members who helped deliver this for us.
Despite the expertise, how they made the learning easy
for us speaks volumes of the team of lecturers.
04
DATA DIGEST | JULY 2021 EDITION
Let’s Learn! pricing. Here are some key strategies in arriving at the
right Price Intelligence:
05
DATA DIGEST | JULY 2021 EDITION
Spotlight
This is a project presented by Subramanian The dataset had information on KYC details,
Gopalakrishnan, Apurva Dhingra, Sahil Linjhara, demographics, security assets, past loan records
George Varghese and Ankush Kharbanda, and credit score of each applicant. In the data, it
PGP-DSBA students in the AICTE Sponsored Online was observed that 22% of the applicants had
International Conference on Data Science, Machine defaulted on their first EMI. A machine learning
learning and its applications (ICDML-2020). A model was built to predict the probability that a
follow-up paper was published in the conference new applicant will default on their first EMI. The
journal. higher the Probability of Default (PD) of an
applicant, the higher is the risk of default. Based
Over the last few years, the credit industry in India
upon the probability of default (PD), customers were
has experienced exponential growth and the retail
bucketed into 3 buckets, i.e., Category A, Category
loan book of Financial Institutions (FI) in India is
B and Category C. Category A contains applicants
expected to double to Rs.96 trillion by 2024. To
with low risk and Category C contains applicants
remain competitive in retail lending, one of the major
with high risk. Further, an optimized LTV was defined
challenges faced by FI is to maximize loan amount
for each category which was identified based on the
with minimum processing time while ensuring the
historical data and business requirements in line with
least number of defaults. In this study, our learners
leading industry practices.
aimed to build a Machine Learning model to predict
the probability that a new applicant will default on A combination of PD and Optimized LTV might help
the first EMI and to calculate the optimum Loan to the NBFC to verify the eligibility of the applicant
Asset Value (LTV) for each applicant applying for a with no time and reduce the number of defaults. A
two-wheeler loan. The loan to Asset value (LTV) prototype for an automation tool was built which
ratio is a financial term used by lenders to express recommended the optimum LTV range for a new
the ratio of a loan to the value of an asset purchased. applicant on entering the details. This tool would
enable an applicant to check their applicability in
The study sample belonged to a Non-Banking
terms of LTV which would, in turn, reduce the
Financial Company (NBFC) containing details of
processing cost for an NBFC. This analysis can be
2,33,154 applicants who applied for a two-wheeler
further expanded to other types of loans.
loan.
06
DATA DIGEST | JULY 2021 EDITION
Discover
08
DATA DIGEST | JULY 2021 EDITION
Here are the ‘Data Science at Work’ Stories for this NPS (Net Promoter Score) is a very important
edition. parameter for any and every product/service in the
market. While checking the responses of the NPS
Customer profiling whose policies are being data, she realized that there were too many
cancelled, bringing down the number of policies subjective questions and responses in the data;
being cancelled. building a word cloud for the same and analyzing
the text was important.
Suchika Aggarwal who is a part of the Auditor’s
Process Team audits small businesses in 3 areas: Self
Voluntary, Virtual Audit (phone/e-mail), & physical There was a dire need to check for the customer
Audit, and prepares documents as per USA sentiment, areas of improvement, which aspects of
government (NCCI) guidelines. the customer journey need more enhancement and,
what were the concerns raised by the customers,
The cost of the physical audit process comprises what were the overall NPS scores and their
80% of the total cost of the whole process. It was graphical representations.
observed that some policies are getting cancelled
due to miscommunication on the company’s behalf
due to vague audit guidelines leading to monetary Techniques used: She used the Text Mining tools
loss, customer dissatisfaction and impact on the such as NLTK, word cloud, and lambda functions to
company’s reputation. Hence, she decided to start get a clear picture of the various
working on the last 12 months’ data for policyholders parameters/features which were collected in the
whose policies got canceled. NPS data survey.
09
DATA DIGEST | JULY 2021 EDITION
Career Maze: Directions very well comfortable with the subject matter, that
will be shown during the interview. If you are not
very well versed in the subject matter, that too will
be shown in the interview and the interviewer will
assess you based on your performance. Emphasize
on - “Practice”. If you have a firm hold on the data
science tools and techniques and you are able to
Which category of companies should I target to deliver it when examined, you will surely sail through
grow my career and what should be my approach the interviews. In any case, if the criteria is very
while looking for a career transition? stringent and the necessity is of having 3 years of
Be open for all the companies and domains data science tools and techniques' experience on
wherever there is a data science and data analytics paper, do not get discouraged. There are many other
opportunity. In whichever field you want to foray roles out there in the industry that will suit you and
into, it will require your foundation to be strong, that you shall keep looking for them.
is your technical skills. In terms of interview scenario,
irrespective of the domain or industry, there will be How can I carve my resume using my past work
the 1st line of interview rounds that will involve the experience for a data science job role?
technical assessment. You will be put into
You can highlight some key projects that you have
challenging situations with writing some codes or
worked upon and highlight the achievements and
solving a case study. The assessment is based on
challenges that you have overcome during your
how comfortable you are in terms of the complexity
tenure. Having said that, if you are applying for a
of the subject and your readiness to take up the
data science role, focus on presenting your data
real-time challenges that you may face on a daily
science knowledge predominantly. If you have
basis in the respective job role. This practice is
worked with any tool before, explain how you used it
common for most companies that will offer you a job
to generate productivity in your organisation, like
role in the field of data science. Irrespective of the
optimising a process etc. You can mention the
domain, the approach of your recruiter is more or
successful projects you have worked on in data
less the same.
science. Capstone projects are very attractive for
If you are not comfortable with writing your codes recruiters as they get to see your overall
by yourself and you seek help from the internet understanding of the industry. You may add
every time you solve a problem, this can be a hurdle self-assigned projects that you took up and have
for you in the interviews. To overcome this paralysis, created an impact in your workplace. Mention the
there is only one remedy - “Practice”. Spend a projects where you have done exceptionally well
minimum of 60 minutes regularly practising on some during your learning journey. The key ingredient is to
raw data sets and strengthening your basics. highlight your accomplishments and the diverse
challenges that you have faced and conquered. This
It is important to gather insights about the industry goes on to prove your mettle and how robust you
you are targeting for a transition. You need to know are.
a little about everything - the history, key players,
challenges, processes, new technologies and trends
in the market for the industry. This will power your
conversation with the recruiter in the 2nd line of
interviews that is usually known as the HR round.
10
DATA DIGEST | JULY 2021 EDITION
11
DATA DIGEST | JULY 2021 EDITION
In this edition, our question is: Which distance measure do you use the most while solving different ML problems
and why?
Arindam Sarkar says - “There are certain ML Algorithms which have the distance measures at its core and is the
foundation of it. Let’s take the examples of KNN (Supervised Learning) or the K-means Clustering (Unsupervised
Learning) – both use Euclidean Distance Measure. The similarity of the features are determined with the help of this
distance measure, be it in KNN for classification problems or forming clusters with the help of the K-means
algorithm. Although Euclidean Distance is by far the most commonly used distance measure across ML Algorithms
but saying that it depends on the algorithm which is being used.
Let us take the example of LDA (Linear Discriminant Analysis) wherein the Euclidean distance fails because of the
following disadvantages:
Scale-dependent
Ignores Correlation (In the case of multiple predictors, the correlations between them is ignored)
LDA algorithm requires the new axis or axes being created using two criteria and most importantly both are
considered simultaneously, i.e
The above criteria hence cannot be met using the Euclidean Distance measure, hence Mahalanobis distance is
applicable in this case as it takes into account the distance from the centres of the classes but also the variances and
correlations. To conclude it depends on how the ML Algorithm works to achieve the specific objective (be it
Supervised or Unsupervised Learning problem) for the applicability of the distance measure being used.”
12
DATA DIGEST | JULY 2021 EDITION
Balaji says - “Distance measure is a very important concept in Data Science and comes into play frequently whether
we are talking about Unsupervised Machine Learning or Supervised Machine Learning. Though it is called Distance
measure, this term has more to do with “similarity” or “likeness” when we compare two observations. In most
contexts, where we apply Data Science, the Distance measure concept is applied in this sense. However, if we are
analyzing geographical data, for example with latitude/longitude values of location, then “distance” would actually
mean distance.
So, to compare the likeness or similarity between two observations, we need to have a quantifiable way of doing it.
For example, if we have two individuals – one male, 38 years old, with 15 years of work experience living in Mumbai
versus another who is male, 24 years of age, 3 years of experience living in Delhi. How do we quantify or compute
the similarity of these two persons with a third person and decide which two are more similar? This is where the
Distance Measures come in.
While there are several distance measures like Euclidean, Manhattan, Chebyshev, Cosine, Hamming and Minkowski,
choosing one that is appropriate is not a straightforward task. Some of these measures are more intuitive to
understand like for example Euclidean (length of the straight line segment that connects two points), or Manhattan
(distance between two points if we could move only in right angles), or Hamming Distance (the number of features
that have different values between two observations). Minkowski measure is a mathematically generalized form of
distance measure with a parameter K with K=1 making it equivalent to Manhattan and K=2 making it equivalent to
Euclidean.
I tend to use the Euclidean more often than the other measures as it is intuitive to understand how it is calculated as
well as, more importantly, it works well in most scenarios. However, hamming distance scores over the others if we
have more categorical features (features with a limited number of unique values or text values). Minkowski offers the
advantage of being able to “tune” the value of K by trying out different values and selecting the one that gives the
best outcome. But when it comes to data dealing with geographical information, especially the latitude/longitude
values (which denote the unique coordinates of a location on Earth’s surface), then Haversine distance may be worth
a try.”
Jainesh says - “Choosing the right distance metrics plays an important role in machine learning. A distance measure
is a score that summarizes the relative difference between two objects in a problem or in any business domain.
Knowing when to use which distance measure can help go from a poor classifier to an accurate model. The four
most used distance measures in machine learning are: Euclidean Distance, Hamming Distance, Manhattan Distance,
Minkowski Distance.
The use of distance measure depends on case to case. But the most widely used distance measure is ‘Euclidean
Distance’ in data mining specifically, as it calculates the distance between two real-valued vectors. Below are few
pointers to decide which distance measure to use when:
Euclidean Distance - Euclidean distance can be used with both float and integer data type values. Euclidean
distance is calculated as the square root of the sum of the squared differences between the two vectors. Euclidean
distance is the shortest path between source and destination, hence, making it a classic example of clustering where
closer vicinity of the objects is preferred for categorization. Euclidean distance works great when there is
low-dimensional data and the magnitude of the vectors is important to be measured.
Euclidean distance is not scale in-variant which means that distances computed might be skewed depending on the
units of the features. Typically, it needs normalized data before using this distance measure. As the dimensionality
increases of your data, the less useful Euclidean distance becomes.
13
DATA DIGEST | JULY 2021 EDITION
Hamming Distance - Hamming distance is the number of values that are different between two vectors. It is typically
used to compare two binary strings of equal length. It can also be used for strings to compare how similar they are
to each other by calculating the number of characters that are different from each other. Hamming distance is used
to measure the distance between categorical variables.
Hamming distance is difficult to use when two vectors are not of equal length. Also, it is not advised to use this
distance measure when the magnitude is an important measure.
Manhattan Distance - The Manhattan distance, often called Taxicab distance, calculates the distance between
real-valued vectors. Manhattan distance then refers to the distance between two vectors if they could only move
right angles. There is no diagonal movement involved in calculating the distance. When a dataset has discrete and/or
binary attributes, Manhattan seems to work quite well since it considers the paths that realistically could be taken
within values of those attributes.
it is more likely to give a higher distance value than Euclidean distance since it does not calculate the shortest path
possible. This does not necessarily give issues but is something to consider.
Minkowski Distance - Minkowski distance is a metric used in Normed vector space (n-dimensional real space), which
means that it can be used in a space where distances can be represented as a vector that has a length. This distance
measured can be manipulated with input parameters to closely resemble other.
Editorial team
14