0% found this document useful (0 votes)
29 views8 pages

DS - Sample Questions (Practical)

Uploaded by

almallugamer0420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views8 pages

DS - Sample Questions (Practical)

Uploaded by

almallugamer0420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Science Sample Interview Questions – Practical

1. Python for Data Science


1. Q: What is the purpose of using Pandas in Data Science?
A: Pandas is used for data manipulation and analysis. It provides data structures like
Series and DataFrame to handle structured data efficiently.
2. Q: How do you read a CSV file in Python using Pandas?
A: import pandas as pd; df = pd.read_csv('file.csv')
3. Q: What function would you use to check for null values in a DataFrame?
A: df.isnull().sum()
4. Q: How can you select a subset of columns in a DataFrame?
A: df[['column1', 'column2']]
5. Q: What is the difference between loc and iloc in Pandas?
A: loc is label-based indexing; iloc is integer-based indexing.
6. Q: How do you handle missing data using Pandas?
A: Use df.fillna() to fill or df.dropna() to remove missing data.
7. Q: How do you group data in Pandas?
A: Use df.groupby('column').agg({'value': 'sum'}) for aggregation.
8. Q: What does the apply() function do in Pandas?
A: It applies a function along an axis of the DataFrame or Series.
9. Q: How do you merge two DataFrames?
A: pd.merge(df1, df2, on='key')
10. Q: How can you convert a column’s datatype in Pandas?
A: df['col'] = df['col'].astype(int)

🔹 2. Statistics & Machine Learning Concepts


11. Q: What is the Central Limit Theorem?
A: It states that the sampling distribution of the sample mean approaches a normal
distribution as the sample size becomes large.
12. Q: Define p-value.
A: The p-value indicates the probability of observing the results given that the null
hypothesis is true.
13. Q: What is the difference between Type I and Type II error?
A: Type I is a false positive (rejecting a true null), Type II is a false negative (failing to
reject a false null).
14. Q: What does R-squared represent?
A: It represents the proportion of variance in the dependent variable explained by
the independent variables.
15. Q: What is multicollinearity?
A: It refers to high correlation between independent variables in regression, which
can distort results.
16. Q: How do you check for normal distribution in data?
A: Use histograms, Q-Q plots, or statistical tests like Shapiro-Wilk.
17. Q: What is hypothesis testing?
A: It is a statistical method to test assumptions (hypotheses) about a parameter in a
population using sample data.
18. Q: What is the difference between supervised and unsupervised learning?
A: Supervised uses labeled data (e.g., regression), unsupervised uses unlabeled data
(e.g., clustering).
19. Q: What is overfitting in machine learning?
A: When a model performs well on training data but poorly on unseen data.
20. Q: How can you prevent overfitting?
A: Use techniques like cross-validation, regularization, pruning, and reducing model
complexity.

🔹 3. Data Wrangling & Visualization


21. Q: What is data wrangling?
A: It’s the process of cleaning and transforming raw data into a usable format.
22. Q: Name a few Python libraries for data visualization.
A: Matplotlib, Seaborn, Plotly.
23. Q: How do you plot a histogram using Seaborn?
A: sns.histplot(data=df, x='column')
24. Q: What is the difference between a bar plot and a histogram?
A: Bar plots show categorical data; histograms show frequency of numerical data
bins.
25. Q: How to identify outliers using box plots?
A: Outliers are shown as points outside the whiskers of a box plot.
26. Q: How can you handle categorical variables?
A: Through one-hot encoding or label encoding.
27. Q: How do you deal with duplicates in a dataset?
A: Use df.drop_duplicates().
28. Q: What does the describe() function do in Pandas?
A: Provides summary statistics for numerical columns.
29. Q: How to rename a column in Pandas?
A: df.rename(columns={'old': 'new'})
30. Q: What is feature scaling and why is it important?
A: It brings features to a similar scale to improve model performance; common
methods include MinMax and Standard Scaler.

🔹 4. SQL & Databases


31. Q: What is a primary key?
A: A column that uniquely identifies each row in a table.
32. Q: How do you select all columns from a table?
A: SELECT * FROM table_name;
33. Q: How do you retrieve unique values from a column?
A: SELECT DISTINCT column FROM table;
34. Q: How to filter data using SQL?
A: Use the WHERE clause. Example: SELECT * FROM table WHERE age > 25;
35. Q: What is the difference between INNER JOIN and LEFT JOIN?
A: INNER JOIN returns only matching records; LEFT JOIN returns all records from the
left table and matching from the right.
36. Q: How do you find the average in SQL?
A: SELECT AVG(column) FROM table;
37. Q: How can you sort results in SQL?
A: Use ORDER BY clause.
38. Q: What does GROUP BY do?
A: It groups rows that have the same values in specified columns.
39. Q: How do you limit results in SQL?
A: Use LIMIT n (MySQL/PostgreSQL) or TOP n (SQL Server).
40. Q: How to use subqueries in SQL?
A: By nesting a SELECT statement inside another. Example: SELECT name FROM
employees WHERE id IN (SELECT emp_id FROM sales);
🔹 5. Machine Learning Algorithms
41. Q: What is linear regression?
A: A method to model the relationship between a dependent variable and one or
more independent variables using a straight line.
42. Q: What is logistic regression used for?
A: For binary classification problems.
43. Q: Name three distance-based algorithms.
A: KNN, K-means, Hierarchical clustering.
44. Q: How does KNN work?
A: It classifies data points based on the majority class among their ‘k’ nearest
neighbors.
45. Q: What is the cost function of logistic regression?
A: Log-loss or binary cross-entropy.
46. Q: What is regularization?
A: A technique to penalize large coefficients in regression to avoid overfitting (L1 and
L2).
47. Q: What is a decision tree?
A: A tree-based model that splits data into branches to make decisions.
48. Q: What are Random Forests?
A: An ensemble of decision trees used for classification and regression.
49. Q: What is Gradient Boosting?
A: A method that builds models sequentially to correct the previous model’s errors.
50. Q: What is cross-validation?
A: A method to evaluate model performance by dividing data into training and
validation sets multiple times.

🔹 6. Deep Learning & NLP


51. Q: What is deep learning?
A: A subset of machine learning that uses neural networks with multiple layers to
learn from data.
52. Q: What is a neural network?
A: A model inspired by the human brain, consisting of layers of nodes (neurons) to
learn complex patterns.
53. Q: What is the activation function?
A: A function that introduces non-linearity to the model. Examples: ReLU, Sigmoid,
Tanh.
54. Q: What is the use of an optimizer in neural networks?
A: It updates the weights of the network to minimize loss. Common optimizers: SGD,
Adam.
55. Q: What is backpropagation?
A: A process to update weights in neural networks using gradients calculated from
the loss function.
56. Q: What is a convolutional neural network (CNN) used for?
A: For image processing tasks like classification and object detection.
57. Q: What is a recurrent neural network (RNN) used for?
A: For sequence data like text or time series.
58. Q: What is the vanishing gradient problem?
A: When gradients become very small during training, making it hard for the model
to learn.
59. Q: How can you mitigate vanishing gradients?
A: Use ReLU activation or architectures like LSTM/GRU.
60. Q: What is the difference between LSTM and GRU?
A: Both are RNN variants, but GRUs are simpler and faster while LSTMs capture long-
term dependencies better.
61. Q: What is word embedding?
A: A technique to represent text in vector space, e.g., Word2Vec, GloVe.
62. Q: What does TF-IDF stand for?
A: Term Frequency-Inverse Document Frequency; it measures word importance in
documents.
63. Q: What is tokenization in NLP?
A: The process of splitting text into words or sentences (tokens).
64. Q: What is lemmatization?
A: Reducing words to their base or dictionary form.
65. Q: What is the Bag of Words model?
A: A representation of text that counts word frequency while ignoring grammar and
word order.
66. Q: What is a stop word?
A: Common words (like 'the', 'is') removed during preprocessing to reduce noise.
67. Q: What is named entity recognition (NER)?
A: The task of identifying entities like people, organizations, and locations in text.
68. Q: What is a language model?
A: A model that learns to predict the probability of a sequence of words.
69. Q: How is NLP used in chatbots?
A: To process and understand user input using techniques like intent recognition and
response generation.
70. Q: What is sentiment analysis?
A: A technique to determine the sentiment (positive/negative/neutral) of text data.

🔹 7. Projects & Business Case Studies


71. Q: Why are real-world projects important in data science?
A: They demonstrate the application of concepts to solve practical business
problems.
72. Q: What’s a good approach to solving a data science case study?
A: Understand the problem, explore the data, preprocess, model, and interpret
results.
73. Q: What kind of data is used in a customer churn project?
A: Customer demographics, transaction history, usage behavior, and service details.
74. Q: How would you approach a fraud detection problem?
A: Use supervised models on labeled data or anomaly detection on unlabeled data.
75. Q: In a sales forecasting project, what algorithms might you use?
A: Time series models like ARIMA, Prophet, or LSTM.
76. Q: What is EDA and why is it important?
A: Exploratory Data Analysis; it helps understand data distribution, patterns, and
anomalies.
77. Q: How can you present your data science project?
A: Use dashboards, reports, and presentations with visuals and key metrics.
78. Q: What is A/B testing used for?
A: To compare two versions of a product or model to determine which performs
better.
79. Q: How do you measure model performance for classification tasks?
A: Metrics like accuracy, precision, recall, F1-score, ROC-AUC.
80. Q: What is a confusion matrix?
A: A table that shows true vs predicted classifications to evaluate model
performance.

🔹 8. Tools & Deployment


81. Q: What is Jupyter Notebook used for?
A: An interactive environment for writing and running code, especially in data
science.
82. Q: What is Anaconda?
A: A distribution of Python for scientific computing, including tools like Jupyter and
libraries like Pandas.
83. Q: What is Git used for?
A: Version control – tracking changes in code and collaboration.
84. Q: What is the purpose of Docker in Data Science?
A: To package projects with dependencies into containers for easy deployment.
85. Q: How does Flask help in ML deployment?
A: It’s a micro web framework for building REST APIs to serve machine learning
models.
86. Q: What is Streamlit?
A: A Python library to build interactive web apps for data science projects easily.
87. Q: What is the difference between REST API and a web app?
A: REST APIs allow data communication; web apps provide user interfaces.
88. Q: What is MLOps?
A: A practice to streamline and automate the lifecycle of ML models, including
development, deployment, and monitoring.
89. Q: What are pipelines in ML?
A: A way to automate a sequence of data processing and modeling steps.
90. Q: What is a model registry?
A: A central place to track, version, and manage machine learning models.

🔹 9. Evaluation & Best Practices


91. Q: How do you handle class imbalance?
A: Use techniques like SMOTE, class weighting, or oversampling/undersampling.
92. Q: What is precision and recall?
A: Precision = TP / (TP + FP), Recall = TP / (TP + FN)
93. Q: What is F1-score?
A: Harmonic mean of precision and recall; useful in imbalanced datasets.
94. Q: What is ROC-AUC?
A: Measures the area under the ROC curve, indicating model performance at various
thresholds.
95. Q: What’s the difference between bagging and boosting?
A: Bagging reduces variance (Random Forest); boosting reduces bias (XGBoost).
96. Q: What are hyperparameters?
A: Parameters set before training a model (e.g., learning rate, depth).
97. Q: How do you tune hyperparameters?
A: Using Grid Search or Random Search with cross-validation.
98. Q: What is feature engineering?
A: Creating new input features from existing ones to improve model performance.
99. Q: What is dimensionality reduction?
A: Reducing the number of features using techniques like PCA or t-SNE.
100. Q: What are some best practices in data science projects?
A: Clear problem definition, clean data, reproducible code, proper documentation,
and model monitoring.

You might also like