0% found this document useful (0 votes)
40 views10 pages

Statistics Interview Questions

Uploaded by

shanurudra177
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views10 pages

Statistics Interview Questions

Uploaded by

shanurudra177
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Most Important statistics interview questions

1. What is statistics?
Statistics is a branch of mathematics dealing with data collection, analysis, interpretation,
and presentation. It helps in making informed decisions by identifying patterns, trends, and
relationships within data.

2. What are the two main types of statistics?


The two main types of statistics are Descriptive Statistics and Inferential Statistics.

Descriptive Statistics:
• It deals with summarizing and organizing data.
• It includes measures such as mean, median, mode (central tendency) and variance,
standard deviation, range (dispersion).
• Graphical representations like histograms, box plots, and pie charts are also part of
descriptive statistics.

Inferential Statistics:
• It involves making predictions or inferences about a population based on a sample.
• It includes techniques such as hypothesis testing, confidence intervals, and
regression analysis.
• Common inferential tests include t-tests, chi-square tests, and ANOVA.

3. What are the different types of data?


There are two main types of data:
Quantitative (Numerical) Data and Qualitative (Categorical) Data.

Quantitative Data consists of numbers and represents measurable values. It is divided into:
• Discrete Data: Whole, countable numbers (e.g., number of students in a class).
• Continuous Data: Values that can take any range within a given interval (e.g., height,
weight).

Qualitative Data describes characteristics and is divided into:


• Nominal Data: Categories without a specific order (e.g., colors, gender).
• Ordinal Data: Categories with a meaningful order, but the difference between them
is not measurable (e.g., ratings like "good," "better," "best").

4. What is the difference between population and sample?


The population and sample are key concepts in statistics:
Population:
It refers to the entire group of individuals or observations that we are interested in studying.
Example: If we want to study the average height of students in a country, all students in that
country form the population.

Sample:
A sample is a subset of the population, selected for analysis to make inferences about the
whole population.
Example: If we collect height data from 1,000 students across different schools, that is a
sample of the total population.

5. What is the difference between mean, median, and mode?


Mean, median, and mode are measures of central tendency:

Mean is the average of all values. It's useful for normally distributed data but sensitive to
outliers.
Median is the middle value when data is sorted. If the number of values is odd, the middle
one is the median. If even, the median is the average of the two middle numbers.It’s more
robust against outliers.
Mode is the most frequently occurring value in the dataset. It’s useful for categorical data.

6. What is standard deviation?


Standard deviation is a measure of how spread out the data points are from the mean. It
tells us how much the values in a dataset deviate from the average.
• A low standard deviation means the data points are close to the mean, while a high
standard deviation indicates that the data points are more spread out.
• It is widely used in statistics and machine learning to understand data variability and
consistency.

7. What is variance?
Variance measures how much the data points deviate from the mean. It tells us how spread
out the data is.
• A higher variance means more dispersion, while a lower variance means the data is
closer to the mean.
• It’s important in statistics and machine learning for understanding variability in data.

8. Why is standard deviation preferred over variance?


Variance is in squared units, making it hard to interpret.
Standard deviation (σ) is in the same unit as the data, making it more intuitive.

Example: If test scores have a variance of 25, the standard deviation is √25 = 5, which is
easier to understand in the context of score
9. What is skewness?
Skewness measures the asymmetry of a data distribution. It tells us whether the data is
symmetrically distributed or if it has a longer tail on one side.
• Positive Skew (Right-Skewed): The right tail is longer, meaning most values are
concentrated on the left. Example: Income distribution (few people earn very high
salaries).
• Negative Skew (Left-Skewed): The left tail is longer, meaning most values are
concentrated on the right. Example: Exam scores (if most students score high and
few score low).
• Zero Skewness: The distribution is symmetric, like a normal distribution.

Skewness is important in data analysis because it affects statistical methods like mean,
median, and regression models.

10. What is kurtosis?


Kurtosis measures the tailedness of a distribution, indicating whether the data has heavy or
light tails compared to a normal distribution. It helps understand the presence of extreme
values (outliers).
• Leptokurtic (High Kurtosis, >3): Heavy tails with more extreme outliers (e.g., stock
market returns).
• Mesokurtic (Normal Kurtosis, ≈3): Similar to a normal distribution (e.g., IQ scores).
• Platykurtic (Low Kurtosis, <3): Light tails with fewer outliers (e.g., uniform
distribution).
Kurtosis is useful in risk analysis, finance, and statistics to assess the likelihood of extreme
deviations in data.

11. When should we use a t-test instead of a z-test?


• Z-test is used when sample size > 30 and population variance is known.
• T-test is used when sample size < 30 and population variance is unknown.

Example: If we want to test whether the average height of people in a city is 5'7":
• If we have 100 samples, use a Z-test.
• If we have only 15 samples, use a T-test.

12. You have a dataset with extreme outliers. Which measure of central
tendency should you use?
The mean is sensitive to outliers, so I would use the median, which is robust to extreme
values.

13. How would you test if two drug treatments have different effects.
I would use a two-sample T-test to compare the average effect of each drug.

Steps:
Null Hypothesis (H₀): Both drugs have the same effect.
Alternative Hypothesis (H₁): The drugs have different effects.
Choose significance level (α=0.05).
Perform the T-test and check the p-value.
If p < 0.05, we reject H0 and conclude the drugs have different effects

14. How do you determine if a categorical variable is dependent on another


categorical variable?
I would use a Chi-Square Test for Independence.

Example:
H₀: Gender and product preference are independent.
H₁: Gender influences product preference.
Compute the Chi-square statistic and check the p-value.
If p < 0.05, we reject H0 and conclude that gender affects product preference.

15. What is the difference between correlation and causation?


Correlation measures the relationship between two variables, indicating how they move
together. However, it does not imply that one causes the other.
Causation means that a change in one variable directly results in a change in another.

For example:
• Correlation: Ice cream sales and drowning incidents are positively correlated, but
eating ice cream doesn’t cause drowning. The real cause is hot weather.
• Causation: Smoking causes lung cancer, as proven through medical research.

16. What is probability?


Probability is the measure of how likely an event is to occur, ranging from 0 to 1, where:
• 0 means the event is impossible.
• 1 means the event is certain.

17. What are mutually exclusive events?


Events that cannot occur at the same time (e.g., rolling a die and getting both 3 and 5).

18. What are independent events?


Independent events are events where the outcome of one event does not affect the
outcome of the other.
For example:
Tossing a coin twice: Getting heads on the first toss does not impact the second toss.
Rolling two dice: The number on one die does not influence the number on the other.

19. You have three groups of students, each taught with a different method.
How would you compare their exam scores?

Since we are comparing more than two groups, we use ANOVA (Analysis of Variance).

Steps:
H₀: All three teaching methods lead to the same average score.
H₁: At least one method produces a different score.

Compute the F-statistic and p-value.


If p < 0.05, reject H0, meaning at least one teaching method is significantly different.

20. What is Bayes' Theorem?


Bayes' Theorem describes how to update the probability of an event based on new
evidence. It is used in conditional probability and is widely applied in machine learning,
spam filtering, and medical diagnosis.

21. What is the law of large numbers?


The Law of Large Numbers (LLN) states that as the sample size increases, the sample mean
gets closer to the population mean. In other words, the more observations we take, the
more accurate our estimate of the true average becomes.

22. What are the types of probability distributions?


Probability distributions describe how values of a random variable are distributed. They are
mainly classified into discrete and continuous distributions.

1. Discrete Probability Distributions (For countable outcomes)


• Bernoulli Distribution – For a single trial with two outcomes (success/failure).
Example: Coin toss (Heads or Tails).
• Binomial Distribution – For multiple independent Bernoulli trials. Example: Number
of heads in 10 coin tosses.
• Poisson Distribution – Models rare events over time/space. Example: Number of
customer arrivals at a store per hour.
• Geometric Distribution – Counts the number of trials until the first success. Example:
Number of coin tosses before getting heads.
2. Continuous Probability Distributions (For measurable outcomes)
• Normal (Gaussian) Distribution – Bell-shaped curve, widely used in statistics.
Example: Heights of people.
• Exponential Distribution – Models waiting time between events. Example: Time
between bus arrivals.
• Uniform Distribution – All outcomes are equally likely. Example: Rolling a fair die.

23. What is the Binomial Distribution?


The Binomial Distribution is a discrete probability distribution that models the number of
successes in a fixed number of independent trials, where each trial has only two possible
outcomes: success or failure.

24. What is the Poisson Distribution?


The Poisson Distribution is a discrete probability distribution that models the number of
times an event occurs in a fixed interval of time or space, given that the events occur
independently and at a constant average rate.

25. What is the Normal Distribution?


The Normal Distribution, also known as the Gaussian Distribution, is a symmetric, bell-
shaped probability distribution that describes how data values are distributed around the
mean.

26. What is the Central Limit Theorem?


The Central Limit Theorem (CLT) states that, regardless of the original distribution of a
population, the sampling distribution of the sample mean approaches a normal distribution
as the sample size increases, provided the samples are independent and randomly selected.

27. What is hypothesis testing?


Hypothesis testing is a statistical method used to determine whether there is enough
evidence to support a certain claim about a population based on sample data.
It starts with two hypotheses:
• Null Hypothesis (H₀): Assumes no effect or no difference.
• Alternative Hypothesis (H₁): Assumes an effect or a difference exists.

We then collect data, perform a statistical test (like a t-test or chi-square test), and calculate
a p-value. If the p-value is below a chosen significance level (like 0.05), we reject the null
hypothesis, meaning the evidence supports the alternative hypothesis.

For example, if we test whether a new marketing strategy increases sales, hypothesis testing
helps us determine if the observed increase is real or just due to random chance.

28. What is a null hypothesis (H₀) and an alternative hypothesis (H₁)?


• Null Hypothesis (H₀): The null hypothesis is a statement that assumes no effect, no
difference, or no relationship between variables. It represents the default
assumption that there is no significant change or association.
• Alternative Hypothesis (H₁ or Ha): The alternative hypothesis is a statement that
contradicts the null hypothesis. It suggests that there is a significant effect,
difference, or relationship between variables.

29. What is a Type I and Type II error?


In hypothesis testing, a Type I error occurs when we reject a true null hypothesis (H₀),
meaning we detect an effect that doesn’t actually exist. It’s also called a false positive.
A Type II error happens when we fail to reject a false null hypothesis, meaning we miss
detecting a real effect. This is called a false negative.
For example, in medical testing:
A Type I error would be diagnosing a healthy person as sick.
A Type II error would be failing to diagnose a sick person.

30. What is p-value?


A p-value is the probability of obtaining results at least as extreme as the observed data,
assuming that the null hypothesis (H₀) is true. It helps determine the strength of evidence
against H₀ in hypothesis testing.
A low p-value (≤ 0.05) suggests strong evidence against H₀, leading us to reject it.
A high p-value (> 0.05) means weak evidence against H₀, so we fail to reject it.

31. What is confidence interval?


A confidence interval (CI) is a range of values used to estimate an unknown population
parameter, like the mean or proportion, with a certain level of confidence. It provides an
interval where we expect the true value to lie.

32. What is linear regression?


Linear Regression is a statistical method used to model the relationship between a
dependent variable (Y) and one or more independent variables (X) by fitting a straight line to
the data.

33. What is logistic regression?


Logistic Regression is a statistical method used for binary classification problems, where the
dependent variable (Y) has only two possible outcomes, such as yes/no, 0/1, or true/false.

34. What is multicollinearity?


Multicollinearity occurs when independent variables in a regression model are highly
correlated with each other. This makes it difficult to determine the individual effect of each
variable on the dependent variable.

35. What is overfitting?


Overfitting occurs when a machine learning model learns too much from the training data,
capturing noise and random fluctuations instead of the underlying pattern. This leads to
high accuracy on training data but poor performance on new, unseen data.

36. What is underfitting?


Underfitting occurs when a machine learning model is too simple to capture the underlying
pattern in the data. This results in poor performance on both training and test data because
the model fails to learn meaningful relationships.

37. What is sampling?


Sampling is the process of selecting a subset of data from a larger population to analyze and
make inferences about the whole population. It is used when collecting data from the entire
population is impractical or too costly.

38. What is stratified sampling?


Stratified Sampling is a type of probability sampling where the population is divided into
homogeneous groups (strata) based on a specific characteristic (e.g., age, income, education
level), and samples are randomly selected from each group.

39. What is a Z-score?


A Z-score (or standard score) measures how many standard deviations a data point is from
the mean of a dataset. It helps standardize data and compare values from different
distributions.

40. What is an outlier?


An outlier is a data point that is significantly different from the rest of the dataset. It lies far
away from the mean and may indicate variability in the data, errors, or rare events.

41. What is bootstrapping?


Bootstrapping is a resampling technique used to estimate the sampling distribution of a
statistic by repeatedly drawing samples with replacement from the original dataset. It is
commonly used when the sample size is small, and the true population distribution is
unknown.
42. What is heteroscedasticity?
Heteroscedasticity refers to a situation in regression analysis where the variance of errors
(residuals) is not constant across all levels of the independent variable(s). This violates one
of the key assumptions of linear regression, which assumes homoscedasticity (constant
variance).

43. What is homoscedasticity?


Homoscedasticity means that the variance of the residuals (errors) in a regression model is
constant across all levels of an independent variable. It is an assumption in linear regression
that ensures reliable estimations.

44. What is R-squared in regression?


R-squared (R²) is a metric that tells us how well our regression model explains the variability
in the dependent variable. It ranges from 0 to 1, where:
• R² = 1 means the model perfectly explains the variance in the data.
• R² = 0 means the model explains none of the variance.

45. What is the difference between parametric and non-parametric tests?


Parametric tests assume that the data follows a certain distribution (e.g., normal
distribution). Examples: t-test, ANOVA.
Non-parametric tests do not assume any specific distribution. Examples: Mann-Whitney U
test, Kruskal-Wallis test.

46. What is ANOVA?


ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or
more groups to determine if there is a significant difference between them. It helps analyze
whether the variation in data is due to real differences or just random chance.

47. What is the Chi-square test?


The Chi-Square Test is a statistical test used to determine whether there is a significant
association between categorical variables in a dataset. It helps analyze whether the
observed frequencies differ from expected frequencies by chance.

48. What is the difference between F-test and T-test?


The F-test and T-test are both statistical tests, but they are used for different purposes:
• T-test is used to compare the means of two groups and determine if they are
significantly different.
• F-test is used to compare variances between two or more groups and is often used in
ANOVA to check for overall differences among multiple groups.
50. What is a time series?
A time series is a sequence of data points collected or recorded at regular time intervals.
Unlike regular datasets, time series data is ordered chronologically, making time a crucial
factor in analysis. It is used in forecasting and trend analysis. Examples: stock prices, weather
data.

51. What is A/B testing?


Answer: A/B testing is an experimental method where two versions (A and B) of a variable
(e.g., a webpage, advertisement) are tested to determine which one performs better based
on a key metric.

52. What is Principal Component Analysis (PCA)?


Answer: PCA is a dimensionality reduction technique that transforms correlated variables
into a smaller set of uncorrelated variables called principal components, preserving as much
variance as possible.

53. What is regularization in regression?


Regularization is a technique used in regression to prevent overfitting by adding a penalty
term to the loss function. This helps keep the model simpler and improves its ability to
generalize to new data.

54. What is the bias-variance tradeoff?


The bias-variance tradeoff is a fundamental concept in machine learning that describes the
balance between two types of errors:

Bias – The error due to overly simplistic models that fail to capture patterns in the data
(underfitting).
Variance – The error due to overly complex models that fit noise in the training data
(overfitting).
A good model aims to find a balance between these two.

You might also like