Data Science Interview Quesions
Data Science Interview Quesions
Answer :- prop.table() employed on top of table() function i.e., prop.table(table()) is the r function. It
can be also be employed on any variable but it makes sense to employ on a factor variable.
Answer :- lapply returns the ouput as a list whereas sapply returns the ouput as a vector, matrix or
array.
Can we represent the output of a classifer having more than two levels using a confusion
matrix?
Answer :- We cannot use confusion matix when we have more than two levels in the output variable.
Instead, we can use crosstable() function from gmodels package
What is Probability?
Answer :- It is the probability of two events occuring at the same time. Classical example is probability
of an email being spam wih the word lottery in it.Here the events are email being spam and email
having the word lottery
Answer :- mean() function can be used to compute the accuracy. Within paranthesis actual labels have
to compared with predicted labels
Answer :- Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Mathematically it is given as P(A|B) = [P(B|A)P(A)]/P(B) where A & B
are events. P(A|B) called as Posterior Probability, is the probability of event A(response) given that
B(independent) has already occured. P(B|A) is the likelihood of the training data i.e., probability of
event B(indpendent) given that A(response) has already occured. P(A) is the probability of the
response variable and P(B) is the probability of the training data or evidence.
Answer :- The fundamental assumption is that each indepedent variable independently and equally
contributes to the outcome.
What is SVM?
Answer :- Here we plot each data point in n-dimensional space with the value of each dimension being
the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that
differentiate the classes very well
What are the tuning parameters of SVM?
Answer :- Kernel, Regularization, Gamma and Margin are the tuning paramers of SVM
Answer :- Kernel tricks are nothing but the transformations applied on input variables which separate
non-separable data to separable data. There are 9 different kernel tricks. Examples are Linear, RBF,
Polynomial, etc.
Is there a need to convert categorical variables into numeric in SVM? If yes, explain.
Answer :- All the categorical variables have to be converted to numeric by creating dummy variables,
as all the data points have to be plotted on n dimensional space, in addition to this we have tuning
paramteters like Kernel, Regularization, Gamma & Margin which are mathematical computations that
require numeric variables. This is an assumpton of SVM.
Is there a need to convert categorical variables into numeric in SVM? If yes, explain.
Answer :- All the categorical variables have to be converted to numeric by creating dummy variables,
as all the data points have to be plotted on n dimensional space, in addition to this we have tuning
paramteters like Kernel, Regularization, Gamma & Margin which are mathematical computations that
require numeric variables. This is an assumpton of SVM.
Answer :- The value of Regularization parameter tells the training model as to how much it can avoid
misclassifying each training observation.
Answer :- Gamma is the kernel coefficient in the kernel tricks RBF, Polynomial, & Sigmoid. Higher
values of Gamma will make the model more complex and overfits the model.
Answer :- Margin is the separation line to the closest class datapoints. Larger the margin width, better
is the classification done. But before even achieving maximum margin, objective of te algorithm is to
correctly classify datapoints.
Answer :- Decision Tree is a superised machine learning algorithm used for classification and
regression analysis. It is a tree-like structure in which internal node represents test on an attribute,
each branch represents outcome of test and each leafe node represents class label.
Explain different types of nodes in nodes in decision tree and how are they selected.
Answer :- We have Root Node, Internal Node, Leaf Node in a decision tree. Decision Tree starts at the
Root Node, this is the first node of the decision tree. Data set is split based on Root Node, again nodes
are selected to further split the already splitted data. This process of splitting the data goes on till we
get leaf nodes, which are nothing but the classification labels. The process of selecting Root Nodes
and Internal Nodes is done using the statistical measure called as Gain
Answer :- We say a data set is pure or homoegenous if all of it's class labels is the same and impure or
hetergenous if the class labels are different. Entropy or Gini Index or Classification Error can be used
to measure impurity of the data set.
Answer :- The process of removal of sub nodes which contribute less power to the decision tree model
is called as Pruning.
a) 31
b) 40
c) 10
d) 32 answer
Answer :- Pruning reduces the complexity of the model which in turn reduces overfitting problem of
Decision Tree. There are two strategies in Pruning. Propruning - discard unreliable parts from the fully
grown tree, Prepruning - stop growing a branch when the information becomes unreliable.
Postpruning is the preferred one.
Answer :- Gain for any column is calculated by differencing Information Gain of a dataset with respect
to a variable from the Information Gain of the entire dataset i.e., Gain(Age) = Info(D) - Info(D wrt Age)
Answer :- Random Forest is an Ensemble Classifer. As opposed to building a single decision tree,
random forest builds many decision trees and combines the output of all the decision trees to give a
stable output.
Answer :- C50 and tree packages can be used to implement a decision tree algorithm in R.
What is the R package to employ Random Forest in R?
How does Random Forest adds randomness and build a better model?
Answer :- Instead of searching for the most important feature while splitting a node, it searches for
the best feature among a random subset of features. This results in a wide diversity that generally
results in a better model. Additional randomness can be added by using random thresholds for each
feature rather than searching for the best possible thresholds (like a normal decision tree does).
Answer :- Random Forest won't overfit the model, it is unexcelled in reliable accuracy, works very well
on large data sets, can handle thousands of input variables without deletion, outputs significance of
input variables, handles outliers and missing values very well
Answer :- Neural Network is a supervised machine learning algorithm which is inspired by human
nervous system and it replicates the similar to how human brain is trained. It consists of Input Layers,
Hidden Layers, & Output Layers.
Answer :- The main limitation of Random Forest is that a large number of trees can make the algorithm
to slow and ineffective for real-time predictions. In most real-world applications the random forest
algorithm is fast enough, but there can certainly be situations where run-time performance is
important and other approaches would be preferred.
Answer :- Artificial Neural Network, Recurrent Neural Networks, Convolutional Neural Networks,
Boltzmann Machine Networks, Hopfield Networks are examples of the Neural Networks. There are a
few other types as well.
Answer :- Activation function is used to convert a input signal of a node in a A-NN to an output signal.
That output signal now is used as a input in the next layer in the stack.
Answer :- Sigmoid or Logistic, Tanh or Hyperbolic tangent, ReLu or Rectified Linear units are examples
of activation functions in neural network
Answer :- Probability distributions are categorized in two, Discrete and Continuous probability
distributions. In discrete probability distribution underlying random variable is discrete whereas in
conitnusous probability distribution underlying random variable is continuous
Answer :- A discrete random variable is a random variable that has countable values, such as a list of
non-negative integers.
Answer :- Number of students present, number of red marbles in a jar, number of heads when flipping
three coins, students' grade level are few of the examples of discrete random variables
Answer :- A continuous random variable is a random variable with a set of possible values (known as
the range) that is infinite and uncountable.
Answer :- Height of students in class, weight of students in class, time it takes to get to school, distance
traveled between classes are few of the examples of continuous random variables
Answer :- Expected value (EV), also known as mean value, is the expected outcome of a given
experiment, calculated as the weighted average of all possible values of a random variable based on
their probabilities. EV(n) = x1P1 + X2P2+X3P3+...+XnPn
Answer :- Qualitative and Quantitative are the broader classifications in R, however these are further
classified into Nominal, Ordinal, Interval, & Ratio data types.
Answer :- A nominal data type merely is a name or a label. Languages spoken by a person, jersey
numbers of football players are examples of Nominal data type. Whereas, on top of being a name or
a label, Ordinal data type has some natural ordering associated with it. Shirt sizes, Likert scale rating,
Ranks of a competition, Educational background of a person are examples of Ordinal data type
How is Interval data type different from Ratio?
Answer :- Interval scales are numeric scales in which we know not only the order, but also the exact
differences between the values, but the problem with the problem with interval scales is that they
don’t have a “true zero". Temperature and Dates are examples of Interval data type. Whereas Ratio
data type tell us about the order, exact value between units, and they also have an absolute zero.
Heights & Weights of people, length of a object
Answer :- Any statistical method, be it descriptive, predictive or prescriptive can be employed only
based on the data type of the variable. Incorrect identification of data types leads to incorrect
modeling which in turn leads to incorrect solution.
Answer :- Absolute zero means true absence of a value. We do not have any absolute zero in Interval
data type. One such example is 0 Celsius temperature which does not mean that the temperature is
absent.
Answer :- The Factor data objects in R are used to store and process categorical data in R.
Answer :- Factor variable is a variable which can take only limited set of values. In other words, the
levels of a factor variable will be limited.
Answer :- The lazy evaluation of a function means, the argument is evaluated only if it is used inside
the body of the function. If there is no reference to the argument in the body of the function then it
is simply ignored.
Answer :- NA
Answer :- The subset() functions is used to select variables and observations. The sample() function is
used to choose a random sample of size n from a dataset.
Answer :- This is the package which is loaded by default when R environment is set. It provides the
basic functionalities like input/output, arithmetic calculations etc. in the R environment.
Answer :- When two vectors of different length are involved in a operation then the elements of the
shorter vector are reused to complete the operation. This is called element recycling. Example - v1 <-
c(4,1,0,6) and V2 <- c(2,4) then v1*v2 gives (8,4,0,24). The elements 2 and 4 are repeated.
Answer :- In R the data objects can be converted from one form to another. For example we can create
a data frame by merging many lists. This involves a series of R commands to bring the data into the
new format. This is called data reshaping.
Answer :- Binomial, Poisson, Negative Binomial, Geometric, Hypergeometric are the examples of
Discrete Probability Dsitributions
Answer :- Normal, Exponential, t, f, Chi-square, Unifrom, Weibull are few of the examples of
Continuous Probability Distributions
Answer :- Binomial Distribution can be simply thought of as the probability of Success of Failure
outcome in an experiment that is conducted multiple times. Exmaples: Head and Tail outcomes after
tossing a coin, Pass or Fail after appearing for an examination.
Answer :- The normal distribution or a bell curve is a probability function that describes how the values
of a variable are distributed by its mean and standard deviation. Distribution of heights, weights,
salaries of people are examples of Normal distribution
Answer :- Uniform Distribution is the simplest of all the statistical distributions. It is sometimes also
known as a rectangular distribution, is a distribution that has constant probability. This distribution is
defined by two parameters, a and b, a being minimum value and b the maximum value. Examples:
Probability of a flight landing between 25 to 30 minutes when it is anounced that the flight will be
landing in 30 minutes. Continuous Uniform Distribution (resembles rectangle) and Discrete Uniform
Distribution (rectangle in the form of a dots) are the two types of Uniform Distribution. Examples:
What is T Distribution?
Answer :- The T distribution also known as, Student’s T-distribution is a probability distribution that is
used to estimate population parameters when the sample size is small and/or when the population
variance is unknown.
Explain F Distribution?
Answer :- Probability distribution for the statistical measure 'F-statistic' is called as F Distribution. It
will be a skewed dsitribution used for ANOVA testing. Minimum value will be 0 and there is no
standard maximum value. Here F statistic is nothing but the value that you get in the output of ANOVA
or Regression analysis. F test will tell you if a group of variables are staisitically significant.
Answer :- The Weibull distribution is particularly useful in reliability work since it is a general
distribution which, by adjustment of the distribution parameters, can be made to model a wide range
of life distribution characteristics of different classes of engineered items. Weibull distribution is
widely used in assess product reliability, analyze life data, and model failure times i.e, it is widely used
in Reliability and Survival Analysis Based on the Beta parameter, Weibull distriution can take different
distributions. If Beta<1 then Gamma, Beta=1 then Exponential, Beta=2 then Lognormal, Beta=3.5 then
Normal.
When is it appropriate to employ a Bar plot?
Answer :- A Bar plot of a numerical variable will be cluttered and makes it difficult for interpretation,
whereas it makes sense to employ bar plot on categorical variable as we can interpret it in an efficient
way.
Answer :- Variance is calculated to find how the individual data points are away from the mean,
nothing but dispersion in the data. It is calculated as the averge of the square of the difference of
mean from each data point. So from this calculation, we know for a fact that units are getting squared.
There is way to get rid of squared units without having the necessity of standard deviation is by taking
an absolute instead of square in the variance calculation. But the problem with taking absolute is it
will lead to misleading results, for example, two variable X1(4,4,-4,-4) & X2(7,1,-6,-2) you get same
variance as 4 if absolute is used and different variances as 4 & 4.74 when squared is used. For this
reason we resort to squaring the difference of each value from it's mean. At this stage, if we interpret
dispersion of data based on Variance, it shall confusion as the values & units are squared. Hence, we
resort to Standard Deviation.
Why the probability associated with a single value of a continuous random variable
is considered to be zero?
Answer :- A continuous random variable takes an infinite number of possible values. As the number
of values assumed by the random variable is infinite, the probability of observing a single value is zero.
Answer :- Probability Sampling and Non-Probability Sampling are the broader classifications of
Sampling techniques. The difference lies between the above two is whether the sample selection is
based on randomization or not. With randomization, every element gets equal chance to be picked
up and to be part of sample for study.
Answer :- An error occured during the sampling process if referred to as a Sampling Error. It can
include both Systematic Sampling Error or Random Sampling Error. Systematic sampling error is the
fault of the investigation, but random sampling error is not.
Answer :- This Sampling technique uses randomization to make sure that every element of the
population gets an equal chance to be part of the selected sample. It’s alternatively known as random
sampling. Simple Random Sampling, Stratified Sampling, Systematic Sampling, Cluster Sampling, Multi
stage Sampling are the types of Probability Sampling
Answer :- This technique divides the elements of the population into small subgroups (strata) based
on the similarity in such a way that the elements within the group are homogeneous and
heterogeneous among the other subgroups formed. And then the elements are randomly selected
from each of these strata. We need to have prior information about the population to create
subgroups.
Answer :- Our entire population is divided into clusters or sections and then the clusters are randomly
selected. All the elements of the cluster are used for sampling. Clusters are identified using details
such as age, sex, location etc. Single Stage Cluster Sampling or Two Stage Cluster Sampling can be used
to performm Cluster Sampling
Answer :- It is the combination of one or more probability sampling techniques. Population is divided
into multiple clusters and then these clusters are further divided and grouped into various sub groups
(strata) based on similarity. One or more clusters can be randomly selected from each stratum. This
process continues until the cluster can’t be divided anymore. For example country can be divided into
states, cities, urban and rural and all the areas with similar characteristics can be merged together to
form a strata.
Answer :- It does not rely on randomization. This technique is more reliant on the researcher’s
ability to select elements for a sample. Outcome of sampling might be biased and makes difficult for
all the elements of population to be part of the sample equally. This type of sampling is also known as
non-random sampling. Convenience Samping, Purposive Sampling, Quota Sampling,
Referral/Snowball Sampling are the types of Non-Probability Sampling
Answer :- Here the samples are selected based on the availability. This method is used when the
availability of sample is rare and also costly. So based on the convenience samples are selected. For
example: Researchers prefer this during the initial stages of survey research, as it’s quick and easy to
deliver results.
Answer :- This type of sampling depends of some pre-set standard. It selects the representative
sample from the population. Proportion of characteristics/ trait in sample should be same as
population. Elements are selected until exact proportions of certain types of data is obtained or
sufficient data in different categories is collected. For example: If our population has 45% females and
55% males then our sample should reflect the same percentage of males and females.
Answer :- This technique is used in the situations where the population is completely unknown and
rare. Therefore we will take the help from the first element which we select for the population and
ask him to recommend other elements who will fit the description of the sample needed. So this
referral technique goes on, increasing the size of population like a snowball.For example: It’s used in
situations of highly sensitive topics like HIV Aids. Not all the victims will respond to the questions asked
so researchers can contact people they know or volunteers to get in touch with the victims and collect
information
Answer :- When errors are systematic, they bias the sample in one direction. Under these
circumstances, the sample does not truly represent the population of interest. Systematic error occur
s when the sample is not drawn properly. It can also occur if names are dropped from the sample list
because some individuals were difficult to locate or uncooperative. Individuals dropped from the
sample could be different from those retained. Those remaining could quite possibly produce a biased
sample. Political polls often have special problems that make prediction difficult.
Answer :- Random sampling error, as contrasted to systematic sampling error, is often referred to as
chance error. Purely by chance, samples drawn from the same population will rarely provide identical
estimates of the population parameter of interest. These estimates will vary from sample to sample.
For example, if you were to flip 100 unbiased coins, you would not be surprised if you obtained 55
heads on one trial, 49 on another, 52 on a third, and so on.
Answer :- In Statistics, 68–95–99.7 rule is also known as the empirical rule or three sigma rule. For a
Gaussian distribution the the mean (arithmetic average), median (central value), and mode (most
frequent value) coincide. Here, area under the curve between ± 1s (1 sigma) includes 68% of all values
(of the population), while ± 2s (2 sigma) includes 95% and ± 3s (3 sigma) includes 99.7% of all values.
In order to come up with a Linear Regression output a minimum of how many
obervations are required.
Answer :- Output of Linear Regression is in the form of equation of straight line which requires atleast
2 observations.
How can you say that Standard Normal Distribution is better than Normal
Distribution?
Answer :- It is inappropriate to say that Sam with 80 score in English Literature is better than Tom with
60 score in Pyschology, as the variability of scores within the subjects may vary. In order to compare
the scores of two different subjects, we need to standardize the deviations of the subjects and then
compare the results.. This can be done using Z transformation, which gives 0 as mean and 1 as
Standard Deviation for any normally distributed data. Assuming SD=77, Mean=3 for English Literature
and SD=56 , Mean=2 and Pyschology, we get 1,2 as z scores or SD away from Mean for English
Literature and Pyschology. Now you can say that English Literature Tom performed better than Sam.
Answer :- Often referred to as Percentiles, Quantiles are the point(values) in your data below which
certain proportion of data falls. For example Median is also a quantile or 50th percentile below which
50% of the data falls.
How is a Normal QQ plot plotted , what is it's use and why is it called as a Normal
QQ plot?
Answer :- A Q-Q Plot or a Quantile-Quantile Plot is plotted by considering raw values of a variable on
Y axis and it's standardized values on X axis. It is used to assess the distribution of the underlying data,
the distribution could be any of the theoretical distributions like Normal, Exponential, etc. Mostly we
will be interested to find if the distribution of underlying data (variable) follows normal distribution or
not, Q-Q Plot is called as Normal Q-Q Plot
Answer :- Reference Line indicates the normal distribution of the data. If most of the data points in a
Normal Q-Q Plot are falling across the refence line then we say that the distribution of the undelying
data (variable) follows Normal Distribution.
What are the R functions to plot QQ Plot and the reference line in a Q-Q Plot
Answer :- qqnorm() is used to plot a Q-Q Plot whereas qqline() is used to plot the refence line
Answer :- Sample Variance refers to the variation of observations in a single sample whereas Sampling
Variance refers to the variation of a statistical measure (eg., Mean) among multiple samples.
How is Standard Deviation different from Standard Error?
Answer :- Stadard Deviation and Standard Error are both measures of dispersion or spreadness.
Standard Deviation uses Population data and Standard Error uses Sample data. Standard Error tells
you have far a sample statistic (eg., sample mean) deviates from the actual Population mean. This
deviation is referred as the Standard Error. Larger the sample size, less will be the deviation (SE)
between the sample mean and the population mean.
Answer : Central Limit Theorem explains about the distribution of the sample data. The distribution
will be normal, if the population data is normally distributed or if the propulation data is not normal
but the sample size is fairly large.
Answer :- We cannot trust a point esitmate (for example a sample mean) to infer about Population
mean, reason being, if we draw another sample it is more likely that we will get a different sample
mean all together. To overcome this problem, we come up with an Interval associated with some
Confidence. This can be achieved by including Magin of Error with Point Estimate which gives us the
Confidence Interval
Answer is qnorm
Answer is C
Answer :- Yes, we have standard z values for different probability values. For example, 1.64 for 90%,
1.96 for 95%, & 2.58 for 99% probability values
Answer :- We will not have standard t values for different probability values, reason being the
computation of t value includes degrees of freedom, which is dependent on the sample size. Hence
for the same probability with different degrees of freedom we get different t values.
Answer :- Degrees of freedom are the number of independent values that a statistical analysis can
estimate. You can also think of it as the number of values that are free to vary as you estimate
parameters They appear in the output of Hypothesis Tests, Probability Distributions, Regression
Analysis. Degrees of freedom is equal your sample size minus the number of parameters you need to
calculate during an analysis. It is usually a positive whole number.
Answer :- It is the way of testing results of an experiment whether they are valid and meaningful and
have not occured just by chance. If the results have happened just by chance then the experiment
cannot be repeated and is not reusable.
Answer :- Null Hypothesis is nothing but a statement which is usually true. On top of Null Hypothesis
we conduct various Hypothesis Tests to see if Null Hypothesis holds true or not. Null Hypothesis is
denoted by Ho.
Answer :- naive Bayes is so ‘naive’ because it assumes that all of the features in a data set are equally
important and independent. As we know, these assumption are rarely true in real world scenario.
Answer :- Prior probability is the proportion of dependent variable in the data set. It is the closest
guess you can make about a class, without any further information.
Answer : Covariance is a measure to know how to variables change together.Covariances are difficult
to compare. For example: if we calculate the covariances of salary ($) and age (years), we’ll get
different covariances which can’t be compared because of having unequal scales. To combat such
situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective
scale. Correlation is the standardized form of covariance.
Answer :- ANCOVA (analysis of covariance) technique to capture association between continuous and
categorical variables.
Answer :- In classification techniques, instead of predicting the actual classes, a measure called as
LogLoss is used to predict the probabilities for an observation,.
Answer :- Cross Entropy essenitally is similar to log loss function used to measure the probabilities of
an actual label. Generally, Log loss term is used in Binary classifications, whereas Cross Entropy is used
for multiple classification.
Given Decision Tree & Random Forest, which one do you think might create an
overfitting problem and which one solves the overfitting problem
Answer :- Decision Tree has the tendency of overfitting because for the fact, it tries to build as much
accurate model as possible by selecting the root node & the internal nodes based on the measure
Gain. This Decision Tree will behave very well on the training data but might not generalize it's
predictions on the test data. To overcome this problem, we have a reliable ensemble algorithm called
as Random Forest which helps in tackling the overfitting problem by creating creating a lot of decision
trees (built using a fewer input variables) and just not a single one. Finally, the results will be
considered based on majority voting or an average of all the results.
For a coefficient value of -0.65123 for an input variable cost. car what has to be the
interpretation of Log(Carpool/Car) in a multinomial regression?
Answer :- First of all, the sign (+ve,-ve) indicates the impact of the input variable on the output mode.
In this case, if there is a unit increase in the input variable i.e., cost.car, the Log(Carpool/Car) decreases
by 0.65123
For Logiistic Regression, is it a good practice to decide on the goodness of the model
based on just accuracy, or is there anything else we can look at.
Answer :- Output of the Logistic Regression is great, you have multiple measures using which you can
comment about the accuracy and reliability of the model. Like, probabilitites of parameters, Null
Deviance, Residual Deviance, stepAIC (to compare mutliple models), confusion matrix, overall
accuracy, Sensitivity (Recall), Specificity, ROC Curve, Lift Chart are the measures you might want to
look at based on the context of the business objective.
How does Multinomial Regression predicts the probabilities of class labels, given
the fact that you have more than 2 class labels?
Answer :- In a way, Multinomial Regression builds n-1 individual Logistic Regression models, here n is
the number of class labels. Applying exponential on either sides of the the n-1 model outputs and then
solving them gives us the individual probabilities for the n class labels. Once we get the probabilities
we then classify observations as the class labels.
Answer :- SVM is termed as a black box technique, as internally the algorithm applies complex
transformations on the input variables based on the Kernel trick applied. Although, the math of these
tranformations is not hidden but slightly complex. Becasue of this complexity, SVM is known as a black
box technique.
Why are Ensemble techniques more preferred than other classification models?
Answer :- Firstly the ensemble techniques assure about the reliability of the accuracy. This however
can also be achieved for non-ensemble techniques by employing various reliability techniques. One
such popular technique is k-fold cross validation. Secondly, it's the way how intelligently the
classifications are predicted in ensemble techniques.
Answer :- Imputing the missing values, normalization, SVD or PCA or Clustering, similarity measures
can be considered as the pre-processing steps before Recommendation Systems.
What is the need of having Confidence and Lift Ratio, when you have the Support
measure?
Answer :- Support measure helps us in filtering out all the possible combination of rules which are
expnential. Effect of Antecedent or Conseqent being a generalized product cannot be filtered out just
by definig Support. Confidence helps in filtering out Antecedents being generalized products and Lift
Ratio helps in filtering our Consequents being generalized ones.
Answer :- In User Based Collaborative Filtering, users act as rows and items as columns. Here we try
to find the similarity among the users.
Answer : In Item Based Collaborative Filtering, items act as rows and users as columns. Here we try to
find the similarity among the items.
How is Item based collaborative filtering different from User based collaborative
filtering?
Answer :- When compared to Users, count of Items will be more. And in Item based collaborative
filtering, we try to find similarity among the items which inturn makes the process computationally
expensive. In addition to this, in User based collaborative filtering, by trying to find the similarity
among the users we try to connect to the usre's taste. Whereas Item based collaborative filtering is
somewhat similar to Market Basket Analysis where in we generalize the results.
Answer :- It is appropriate to normalize the data when we have the values like ratings(1-5) as opposed
to having values as purchased/not purchased or rated/not rated.
What is the first thing that you need to look at when you are given a dataset?
Answer :- The first check should be made on NA values. Check if there are any NA values present in
the data or not. If present, then impute the NA values rather than deleting the observations having
NAs.
Answer :- Missing data can reduce the power/fit of a model or can lead to a biased model making
incorrect predictions or classifications.
What could be the reasons for the presence of NA values in the data?
Answer :- Data Extraction and Data Collection are considered to be the major reasons for missing
values.
What are the reasons for the NAs while collecting the data?
Answer :- Missing completely at random, Missing at random, Missing that depends on unobserved
predictors, Missing that depends on the missing value itself are the reasons for NAs while collecting
the data.
Answer :- Listwise deletion, Pairwise deletion, Mean/Mode Substitution, Prediction model, KNN
Imputation, Hot Deck Imputation, Maximum Likelihood, Multiple Imputation are the various
imputation techniques
Answer :- For each record, correlation between each combination of variables is computed. If the
correlation is a junk value for two subsequent correlations, then the common value will be dropped
from any computation.
Answer :- We divide dataset into two halves. One with no missing values (train data) and the other
one with the missing values (test data). Variable with missing values is treated as the target variable.
Next, we create a model to predict the target variable based on other attributes of the training data
set.
Answer :- There are two drawbacks for this approach. First is that the model estimated values are
usually more well-behaved than the true values. Second is that, if there is no relationships with
attributes in the data set and the attribute with missing values, then the model will not be precise for
estimating missing values.
Answer :- In this method, the missing values of an attribute are imputed using the given number of
attributes that are most similar to the attribute whose values are missing. The similarity to the
attribute is determined using a discrete function.
Choice of K value is critical. Higher values of K would include attributes which are significantly
different from what we need whereas lower value of K implies missing out of significant
attributes.
What is logistic regression? Or State an example when you have used logistic
regression recently?
Answer :- Logistic Regression often referred as logit model is a technique to predict the binary
outcome from a linear combination of predictor variables. For example, if you want to predict whether
a particular political leader will win the election or not. In this case, the outcome of prediction is binary
i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election
campaigning of a particular candidate, the amount of time spent in campaigning, etc.
Answer :- A subclass of information filtering systems that are meant to predict the preferences or
ratings that a user would give to a product. Recommender systems are widely used in movies, news,
research articles, products, social tags, music, etc.
Answer :- Cleaning data from multiple sources to transform it into a format that data analysts or data
scientists can work with is a cumbersome process because - as the number of data sources increases,
the time take to clean the data increases exponentially due to the number of sources and the volume
of data generated in these sources. It might take up to 80% of the time for just cleaning data making
it a critical part of analysis task.
Answer :- These are descriptive statistical analysis techniques which can be differentiated based on
the number of variables involved at a given point of time. For example, the pie charts of sales based
on territory involve only one variable and can be referred to as univariate analysis.
Answer :- Machine learning is the science of getting computers to act without being explicitly
programmed. Machine learning has given us self-driving cars, practical speech recognition, effective
web search, and a vastly improved understanding of the human genome. It is so widespread that
unknowingly we use it many a times in our daily life.
Answer :- Supervised Machine Learning will be employed for the problem statements where in output
variable (Nominal) of interest can be either classified or predicted.
Examples: KNN, Naive Bayes, SVM, Decision Tree, Random Forest, Neural Network
Answer :- In this category of Machine Learning, there won’t be any output variable to be either
predicted or classified. Instead the algorithm understands the patterns in the data.
Answer :- Classification Models are employed when the observations have to be classified in
categories and not predicted.
Examples being Cancerous and Non-cancerous tumor (2 categories), Bus, Rail, Car, Carpool
(>2 categories)
Answer :-KNN, Naive Bayes, SVM, Decision Tree, Random Forest, Neural Network
Answer :-It is because of bottom up approach, where initially each observation is considered to be a
single cluster and gradually based on the distance measure inidividual clusters will be paired and finally
merged as one
Answer :-When the clusters are as much heterogenous as possible and when the observations within
each cluster are as much homogeenous as possible.
Answer :-None of your data science topics are domain specific. They can be employed in any domain,
provided data is available.
Example of clustering?
Answer :-Using variables like income, education, profession, age, number of children, etc you come
with different clusters and each cluster has people with similar socio-economic criteria
Answer :-It would be better if we employ clustering on normalized data as you will get different results
for with and without normalization
Answer :-Theoretically it will be between - infinity to + inifinity but normally you have values between
-3 to +3
Answer :-summary() command gives the distribution for numerical variables and proportion of
observations for factor variables
Answer :-str() command gives dimensions for your data drame. In addition to this it gives, class of the
dataset & class of every variable
Answer :-Linkage is the criteria based on which distances between two clusters is computed. Single,
Complete, Average are few of the examples for linkages
Single - The distance between two clusters is defined as the shortest distance between two
points in each cluster.
Complete - The distance between two clusters is defined as the longest distance between two
points in each cluster.
Average - the distance between two clusters is defined as the average distance between each
point in one cluster to every point in the other cluster.
Answer :-In Hierarchial Clustering number of clusters will be decided only after looking at the
dendrogram.
Answer :-After computing optimal clusters, aggregate measure like mean has to be computed on all
variables and then resultant values for all the variables have to be interpreted among the clusters