Data Science Sample Interview Questions – Practical
1. Python for Data Science
1. Q: What is the purpose of using Pandas in Data Science?
A: Pandas is used for data manipulation and analysis. It provides data structures like
Series and DataFrame to handle structured data efficiently.
2. Q: How do you read a CSV file in Python using Pandas?
A: import pandas as pd; df = pd.read_csv('file.csv')
3. Q: What function would you use to check for null values in a DataFrame?
A: df.isnull().sum()
4. Q: How can you select a subset of columns in a DataFrame?
A: df[['column1', 'column2']]
5. Q: What is the difference between loc and iloc in Pandas?
A: loc is label-based indexing; iloc is integer-based indexing.
6. Q: How do you handle missing data using Pandas?
A: Use df.fillna() to fill or df.dropna() to remove missing data.
7. Q: How do you group data in Pandas?
A: Use df.groupby('column').agg({'value': 'sum'}) for aggregation.
8. Q: What does the apply() function do in Pandas?
A: It applies a function along an axis of the DataFrame or Series.
9. Q: How do you merge two DataFrames?
A: pd.merge(df1, df2, on='key')
10. Q: How can you convert a column’s datatype in Pandas?
A: df['col'] = df['col'].astype(int)
🔹 2. Statistics & Machine Learning Concepts
11. Q: What is the Central Limit Theorem?
A: It states that the sampling distribution of the sample mean approaches a normal
distribution as the sample size becomes large.
12. Q: Define p-value.
A: The p-value indicates the probability of observing the results given that the null
hypothesis is true.
13. Q: What is the difference between Type I and Type II error?
A: Type I is a false positive (rejecting a true null), Type II is a false negative (failing to
reject a false null).
14. Q: What does R-squared represent?
A: It represents the proportion of variance in the dependent variable explained by
the independent variables.
15. Q: What is multicollinearity?
A: It refers to high correlation between independent variables in regression, which
can distort results.
16. Q: How do you check for normal distribution in data?
A: Use histograms, Q-Q plots, or statistical tests like Shapiro-Wilk.
17. Q: What is hypothesis testing?
A: It is a statistical method to test assumptions (hypotheses) about a parameter in a
population using sample data.
18. Q: What is the difference between supervised and unsupervised learning?
A: Supervised uses labeled data (e.g., regression), unsupervised uses unlabeled data
(e.g., clustering).
19. Q: What is overfitting in machine learning?
A: When a model performs well on training data but poorly on unseen data.
20. Q: How can you prevent overfitting?
A: Use techniques like cross-validation, regularization, pruning, and reducing model
complexity.
🔹 3. Data Wrangling & Visualization
21. Q: What is data wrangling?
A: It’s the process of cleaning and transforming raw data into a usable format.
22. Q: Name a few Python libraries for data visualization.
A: Matplotlib, Seaborn, Plotly.
23. Q: How do you plot a histogram using Seaborn?
A: sns.histplot(data=df, x='column')
24. Q: What is the difference between a bar plot and a histogram?
A: Bar plots show categorical data; histograms show frequency of numerical data
bins.
25. Q: How to identify outliers using box plots?
A: Outliers are shown as points outside the whiskers of a box plot.
26. Q: How can you handle categorical variables?
A: Through one-hot encoding or label encoding.
27. Q: How do you deal with duplicates in a dataset?
A: Use df.drop_duplicates().
28. Q: What does the describe() function do in Pandas?
A: Provides summary statistics for numerical columns.
29. Q: How to rename a column in Pandas?
A: df.rename(columns={'old': 'new'})
30. Q: What is feature scaling and why is it important?
A: It brings features to a similar scale to improve model performance; common
methods include MinMax and Standard Scaler.
🔹 4. SQL & Databases
31. Q: What is a primary key?
A: A column that uniquely identifies each row in a table.
32. Q: How do you select all columns from a table?
A: SELECT * FROM table_name;
33. Q: How do you retrieve unique values from a column?
A: SELECT DISTINCT column FROM table;
34. Q: How to filter data using SQL?
A: Use the WHERE clause. Example: SELECT * FROM table WHERE age > 25;
35. Q: What is the difference between INNER JOIN and LEFT JOIN?
A: INNER JOIN returns only matching records; LEFT JOIN returns all records from the
left table and matching from the right.
36. Q: How do you find the average in SQL?
A: SELECT AVG(column) FROM table;
37. Q: How can you sort results in SQL?
A: Use ORDER BY clause.
38. Q: What does GROUP BY do?
A: It groups rows that have the same values in specified columns.
39. Q: How do you limit results in SQL?
A: Use LIMIT n (MySQL/PostgreSQL) or TOP n (SQL Server).
40. Q: How to use subqueries in SQL?
A: By nesting a SELECT statement inside another. Example: SELECT name FROM
employees WHERE id IN (SELECT emp_id FROM sales);
🔹 5. Machine Learning Algorithms
41. Q: What is linear regression?
A: A method to model the relationship between a dependent variable and one or
more independent variables using a straight line.
42. Q: What is logistic regression used for?
A: For binary classification problems.
43. Q: Name three distance-based algorithms.
A: KNN, K-means, Hierarchical clustering.
44. Q: How does KNN work?
A: It classifies data points based on the majority class among their ‘k’ nearest
neighbors.
45. Q: What is the cost function of logistic regression?
A: Log-loss or binary cross-entropy.
46. Q: What is regularization?
A: A technique to penalize large coefficients in regression to avoid overfitting (L1 and
L2).
47. Q: What is a decision tree?
A: A tree-based model that splits data into branches to make decisions.
48. Q: What are Random Forests?
A: An ensemble of decision trees used for classification and regression.
49. Q: What is Gradient Boosting?
A: A method that builds models sequentially to correct the previous model’s errors.
50. Q: What is cross-validation?
A: A method to evaluate model performance by dividing data into training and
validation sets multiple times.
🔹 6. Deep Learning & NLP
51. Q: What is deep learning?
A: A subset of machine learning that uses neural networks with multiple layers to
learn from data.
52. Q: What is a neural network?
A: A model inspired by the human brain, consisting of layers of nodes (neurons) to
learn complex patterns.
53. Q: What is the activation function?
A: A function that introduces non-linearity to the model. Examples: ReLU, Sigmoid,
Tanh.
54. Q: What is the use of an optimizer in neural networks?
A: It updates the weights of the network to minimize loss. Common optimizers: SGD,
Adam.
55. Q: What is backpropagation?
A: A process to update weights in neural networks using gradients calculated from
the loss function.
56. Q: What is a convolutional neural network (CNN) used for?
A: For image processing tasks like classification and object detection.
57. Q: What is a recurrent neural network (RNN) used for?
A: For sequence data like text or time series.
58. Q: What is the vanishing gradient problem?
A: When gradients become very small during training, making it hard for the model
to learn.
59. Q: How can you mitigate vanishing gradients?
A: Use ReLU activation or architectures like LSTM/GRU.
60. Q: What is the difference between LSTM and GRU?
A: Both are RNN variants, but GRUs are simpler and faster while LSTMs capture long-
term dependencies better.
61. Q: What is word embedding?
A: A technique to represent text in vector space, e.g., Word2Vec, GloVe.
62. Q: What does TF-IDF stand for?
A: Term Frequency-Inverse Document Frequency; it measures word importance in
documents.
63. Q: What is tokenization in NLP?
A: The process of splitting text into words or sentences (tokens).
64. Q: What is lemmatization?
A: Reducing words to their base or dictionary form.
65. Q: What is the Bag of Words model?
A: A representation of text that counts word frequency while ignoring grammar and
word order.
66. Q: What is a stop word?
A: Common words (like 'the', 'is') removed during preprocessing to reduce noise.
67. Q: What is named entity recognition (NER)?
A: The task of identifying entities like people, organizations, and locations in text.
68. Q: What is a language model?
A: A model that learns to predict the probability of a sequence of words.
69. Q: How is NLP used in chatbots?
A: To process and understand user input using techniques like intent recognition and
response generation.
70. Q: What is sentiment analysis?
A: A technique to determine the sentiment (positive/negative/neutral) of text data.
🔹 7. Projects & Business Case Studies
71. Q: Why are real-world projects important in data science?
A: They demonstrate the application of concepts to solve practical business
problems.
72. Q: What’s a good approach to solving a data science case study?
A: Understand the problem, explore the data, preprocess, model, and interpret
results.
73. Q: What kind of data is used in a customer churn project?
A: Customer demographics, transaction history, usage behavior, and service details.
74. Q: How would you approach a fraud detection problem?
A: Use supervised models on labeled data or anomaly detection on unlabeled data.
75. Q: In a sales forecasting project, what algorithms might you use?
A: Time series models like ARIMA, Prophet, or LSTM.
76. Q: What is EDA and why is it important?
A: Exploratory Data Analysis; it helps understand data distribution, patterns, and
anomalies.
77. Q: How can you present your data science project?
A: Use dashboards, reports, and presentations with visuals and key metrics.
78. Q: What is A/B testing used for?
A: To compare two versions of a product or model to determine which performs
better.
79. Q: How do you measure model performance for classification tasks?
A: Metrics like accuracy, precision, recall, F1-score, ROC-AUC.
80. Q: What is a confusion matrix?
A: A table that shows true vs predicted classifications to evaluate model
performance.
🔹 8. Tools & Deployment
81. Q: What is Jupyter Notebook used for?
A: An interactive environment for writing and running code, especially in data
science.
82. Q: What is Anaconda?
A: A distribution of Python for scientific computing, including tools like Jupyter and
libraries like Pandas.
83. Q: What is Git used for?
A: Version control – tracking changes in code and collaboration.
84. Q: What is the purpose of Docker in Data Science?
A: To package projects with dependencies into containers for easy deployment.
85. Q: How does Flask help in ML deployment?
A: It’s a micro web framework for building REST APIs to serve machine learning
models.
86. Q: What is Streamlit?
A: A Python library to build interactive web apps for data science projects easily.
87. Q: What is the difference between REST API and a web app?
A: REST APIs allow data communication; web apps provide user interfaces.
88. Q: What is MLOps?
A: A practice to streamline and automate the lifecycle of ML models, including
development, deployment, and monitoring.
89. Q: What are pipelines in ML?
A: A way to automate a sequence of data processing and modeling steps.
90. Q: What is a model registry?
A: A central place to track, version, and manage machine learning models.
🔹 9. Evaluation & Best Practices
91. Q: How do you handle class imbalance?
A: Use techniques like SMOTE, class weighting, or oversampling/undersampling.
92. Q: What is precision and recall?
A: Precision = TP / (TP + FP), Recall = TP / (TP + FN)
93. Q: What is F1-score?
A: Harmonic mean of precision and recall; useful in imbalanced datasets.
94. Q: What is ROC-AUC?
A: Measures the area under the ROC curve, indicating model performance at various
thresholds.
95. Q: What’s the difference between bagging and boosting?
A: Bagging reduces variance (Random Forest); boosting reduces bias (XGBoost).
96. Q: What are hyperparameters?
A: Parameters set before training a model (e.g., learning rate, depth).
97. Q: How do you tune hyperparameters?
A: Using Grid Search or Random Search with cross-validation.
98. Q: What is feature engineering?
A: Creating new input features from existing ones to improve model performance.
99. Q: What is dimensionality reduction?
A: Reducing the number of features using techniques like PCA or t-SNE.
100. Q: What are some best practices in data science projects?
A: Clear problem definition, clean data, reproducible code, proper documentation,
and model monitoring.