✅ 1) What is Pandas library in Python?
Answer:
Pandas is a powerful, open-source Python library primarily used for data manipulation and analysis.
It provides two main data structures:
Series: 1-dimensional labeled arrays.
DataFrame: 2-dimensional labeled data structure similar to a table (like Excel or SQL table).
It is widely used in data science, machine learning, and data engineering.
✅ 2) List some key features of Pandas.
Answer:
Fast and efficient data manipulation using DataFrames.
Tools for reading and writing data between in-memory data structures and different formats
(CSV, Excel, SQL).
Label-based indexing for rows and columns.
Handling missing data.
Grouping and aggregation.
Time series functionality.
Built-in plotting using Matplotlib.
✅ 3) What is NumPy library in Python?
Answer:
NumPy (Numerical Python) is a library used for numerical computing. It provides support for:
N-dimensional arrays (ndarray)
Mathematical functions (e.g., mean, sum, std)
Linear algebra
Random number generation
It forms the foundation for libraries like Pandas and SciPy.
✅ 4) What is Matplotlib library?
Answer:
Matplotlib is a Python library used for creating static, animated, and interactive visualizations. It is
often used with NumPy and Pandas for plotting data. The most commonly used module in Matplotlib
is pyplot.
Example:
python
CopyEdit
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], [4, 5, 6])
plt.show()
✅ 5) What is the difference between Seaborn and Matplotlib?
Answer:
Feature Matplotlib Seaborn
Purpose Low-level, general plotting High-level interface built on Matplotlib
Syntax More manual styling needed Easier to use with built-in themes
Data Works with arrays Works directly with Pandas DataFrames
Example plt.plot() sns.lineplot()
✅ 6) Are Sklearn and Scikit-learn the same? What is its use in data science?
Answer:
Yes, Sklearn and Scikit-learn are the same. sklearn is the importable module name for Scikit-learn, a
popular library for machine learning.
It provides tools for:
Classification (e.g., Naive Bayes, SVM)
Regression (e.g., Linear Regression)
Clustering (e.g., K-Means)
Model selection and evaluation
✅ 7) What functions come in Pandas and Numpy library?
Answer:
Pandas:
read_csv(), head(), info(), describe()
groupby(), merge(), dropna(), fillna(), value_counts()
NumPy:
array(), arange(), linspace()
mean(), sum(), std(), reshape(), dot()
✅ 8) What is a DataFrame in Python?
Answer:
A DataFrame is a 2D labeled data structure with columns of potentially different types. It's part of
Pandas and resembles an Excel spreadsheet or a SQL table.
Example:
python
CopyEdit
import pandas as pd
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
✅ 9) How to find duplicates in Python? (Python Command)
Answer:
python
CopyEdit
df.duplicated() # Returns True for duplicate rows
df[df.duplicated()] # Filters duplicate rows
df.drop_duplicates() # Removes duplicates
✅ 10) What is the use of describe() command?
Answer:
df.describe() provides statistical summary of numeric (or all, if specified) columns in a DataFrame.
It includes:
Count
Mean
Standard deviation
Min, Max
25%, 50%, and 75% percentiles
Example:
python
CopyEdit
df.describe(include='all')
✅ 11) What is the significance of Confusion Matrix?
Answer:
A confusion matrix is a performance measurement tool for classification models. It shows how many
predictions were:
True Positives (TP): Correctly predicted positive class
True Negatives (TN): Correctly predicted negative class
False Positives (FP): Incorrectly predicted as positive
False Negatives (FN): Incorrectly predicted as negative
It helps in calculating metrics like:
Accuracy
Precision
Recall
F1 Score
✅ 12) What is TP, TN, FP, FN in Confusion Matrix?
Answer:
Term Description
TP (True Positive) Model correctly predicts positive class
TN (True Negative) Model correctly predicts negative class
FP (False Positive) Model incorrectly predicts positive class
FN (False Negative) Model incorrectly predicts negative class
✅ 13) What is Recall?
Answer:
Recall (also called Sensitivity or True Positive Rate) is the ratio of correctly predicted positive
observations to all actual positives.
Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}Recall=TP+FNTP
It answers: Out of all actual positives, how many did we correctly predict?
✅ 14) What is Precision?
Answer:
Precision is the ratio of correctly predicted positive observations to the total predicted positives.
Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}Precision=TP+FPTP
It answers: Out of all predicted positives, how many were actually positive?
✅ 15) What is F1 Score?
Answer:
The F1 Score is the harmonic mean of precision and recall. It balances the two metrics.
F1 Score=2×Precision×RecallPrecision+Recall\text{F1 Score} = 2 \times \frac{\text{Precision} \times
\text{Recall}}{\text{Precision} + \text{Recall}}F1 Score=2×Precision+RecallPrecision×Recall
Useful when you need a balance between Precision and Recall.
✅ 16) What is the need for Data Visualization in Data Science?
Answer:
Data visualization helps in:
Understanding trends, patterns, and outliers
Communicating insights effectively
Making data-driven decisions
Validating assumptions and hypotheses
Tools: Matplotlib, Seaborn, Tableau, PowerBI
✅ 17) What is an Outlier?
Answer:
An outlier is a data point that differs significantly from other observations in a dataset.
They can arise due to:
Measurement errors
Data entry errors
True variability
Outliers can skew statistical results and should be handled carefully (e.g., removed or capped).
✅ 18) When to use Histogram and Pie Chart?
Answer:
Chart Use Case
Histogram To show distribution of a continuous variable (e.g., Age, Salary)
Pie Chart To show proportion/percentage of categories in a dataset (e.g., Gender, City)
✅ 19) What are the challenges in Big Data Visualization?
Answer:
Scalability: Standard tools may not handle billions of rows.
Speed: Rendering large datasets takes time.
Interactivity: Real-time filtering and zooming becomes hard.
Storage: Large visual files consume memory.
Data Cleaning: Big data may have missing, inconsistent entries.
✅ 20) What is Joint Plot and Dist Plot?
Answer:
🔹 jointplot() – from Seaborn:
Combines scatter plot and histograms.
Useful for visualizing the relationship between two variables + distribution.
Example:
python
CopyEdit
sns.jointplot(x='Age', y='Salary', data=df)
🔹 distplot() (deprecated, use displot()):
Plots a histogram + KDE (Kernel Density Estimate).
Shows distribution of a single variable.
Example:
python
CopyEdit
sns.displot(df['Salary'], kde=True)
21) What are the tools used for Data Visualization?
Answer:
Popular tools for data visualization include:
🔹 Python Libraries:
Matplotlib – Basic plots (line, bar, scatter).
Seaborn – Statistical visualizations with better styling.
Plotly – Interactive plots.
Bokeh – Web-based visualizations.
Altair – Declarative charts.
🔹 BI Tools:
Tableau
Power BI
Google Data Studio
✅ 22) What is Data Wrangling?
Answer:
Data Wrangling (also known as Data Munging) is the process of cleaning, transforming, and
organizing raw data into a usable format.
Typical steps include:
Handling missing values
Converting data types
Removing duplicates
Normalizing data
Feature engineering
✅ 23) What is Data Transformation?
Answer:
Data Transformation is the process of converting data from one format or structure into another. It's
often used in:
Normalization/Standardization
Encoding categorical data
Aggregating values
Scaling numerical values
It prepares data for analysis or modeling.
✅ 24) What is the use of StandardScaler function in Python?
Answer:
StandardScaler from sklearn.preprocessing standardizes features by removing the mean and scaling
to unit variance (Z-score normalization).
Z=x−μσZ = \frac{x - \mu}{\sigma}Z=σx−μ
It ensures that all features contribute equally to the model.
Example:
python
CopyEdit
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
✅ 25) What is Hadoop?
Answer:
Hadoop is an open-source framework used for storing and processing large datasets using a
distributed computing model.
Key components:
HDFS (Storage)
MapReduce (Processing)
YARN (Resource management)
Common (Libraries)
It enables parallel processing across multiple computers.
✅ 26) What is HDFS and MapReduce?
Answer:
HDFS (Hadoop Distributed File System): A distributed file system that stores data across
multiple machines. It breaks large files into blocks (default 128MB) and stores them
redundantly.
MapReduce: A programming model for processing large data in parallel. It consists of:
o Map step: Processes and filters data
o Reduce step: Aggregates results
✅ 27) What are the components of the Hadoop Ecosystem?
Answer:
The Hadoop ecosystem includes:
HDFS – Storage layer
MapReduce – Processing layer
YARN – Resource manager
Hive – SQL-like querying
Pig – Data flow scripting
HBase – NoSQL database
Sqoop – Transfers data between Hadoop and RDBMS
Flume – Collects and transports log data
Zookeeper – Coordination service
Oozie – Workflow scheduler
✅ 28) What is Scala?
Answer:
Scala is a general-purpose programming language that combines object-oriented and functional
programming paradigms. It runs on the Java Virtual Machine (JVM) and is used heavily with Apache
Spark.
✅ 29) What are the features of Scala?
Answer:
Statically typed (like Java)
Supports functional and object-oriented programming
Type inference
Concise syntax
Interoperable with Java
Concurrency support (via Akka)
Used in big data frameworks like Spark
✅ 30) How is Scala different from Java?
Feature Scala Java
Programming Style Functional + OOP Only OOP
Code Length Concise Verbose
Type Inference Yes No
Concurrency Actor model (Akka) Threads
Use in Big Data Apache Spark Limited
31) What is Big Data?
Answer:
Big Data refers to extremely large datasets that are too complex or massive for traditional data
processing tools. It includes structured, semi-structured, and unstructured data from various sources
like social media, sensors, logs, and transactions.
✅ 32) What are the characteristics of Big Data? (The 5 V's)
Answer:
1. Volume – Huge amount of data.
2. Velocity – Speed at which data is generated and processed.
3. Variety – Different forms: text, image, video, etc.
4. Veracity – Accuracy and trustworthiness of data.
5. Value – Extracting useful insights from the data.
✅ 33) List phases in Data Science Life Cycle.
Answer:
1. Problem Understanding
2. Data Collection
3. Data Cleaning / Wrangling
4. Exploratory Data Analysis (EDA)
5. Feature Engineering
6. Model Building
7. Model Evaluation
8. Deployment
9. Monitoring and Maintenance
✅ 34) What is Central Tendency?
Answer:
Central Tendency refers to the measure that identifies the center of a dataset. The most common
measures are:
Mean (average)
Median (middle value)
Mode (most frequent value)
✅ 35) What is Dispersion?
Answer:
Dispersion measures how spread out the data is. It helps understand variability. Common measures:
Range
Variance
Standard Deviation
Interquartile Range (IQR)
✅ 36) What are Mean, Median, Mode, Mid-range? Calculate for: 10, 22, 13, 10, 21, 43, 77, 21, 10
Answer:
Data: 10, 22, 13, 10, 21, 43, 77, 21, 10
Mean:
10+22+13+10+21+43+77+21+109=2279≈25.22\frac{10 + 22 + 13 + 10 + 21 + 43 + 77 + 21 + 10}{9} =
\frac{227}{9} \approx 25.22910+22+13+10+21+43+77+21+10=9227≈25.22
Median (sorted: 10, 10, 10, 13, 21, 21, 22, 43, 77):
Middle value = 21
Mode: 10 (appears 3 times)
Mid-Range:
Min+Max2=10+772=43.5\frac{Min + Max}{2} = \frac{10 + 77}{2} = 43.52Min+Max=210+77=43.5
✅ 37) What is Variance?
Answer:
Variance measures the average squared deviation from the mean. It shows how much the data
spreads out.
σ2=1n∑(xi−μ)2\sigma^2 = \frac{1}{n}\sum (x_i - \mu)^2σ2=n1∑(xi−μ)2
✅ 38) What is Standard Deviation?
Answer:
Standard deviation is the square root of variance. It shows how much the values deviate from the
mean.
σ=Variance\sigma = \sqrt{Variance}σ=Variance
If data is tightly clustered, SD is low; if spread out, SD is high.
✅ 39) What is Posterior Probability in Naive Bayes?
Answer:
Posterior Probability is the probability of a class (e.g., spam) given a set of features (e.g., words in an
email).
P(Class∣Data)P(Class|Data)P(Class∣Data)
It is calculated using Bayes’ Theorem:
P(C∣X)=P(X∣C)⋅P(C)P(X)P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}P(C∣X)=P(X)P(X∣C)⋅P(C)
✅ 40) What is Likelihood Probability in Naive Bayes?
Answer:
Likelihood is the probability of the features (data) given a class.
P(Data∣Class)P(Data|Class)P(Data∣Class)
Example: In spam detection, it's the probability that certain words appear given that the email is
spam.
41) What is NLTK?
NLTK (Natural Language Toolkit) is a powerful Python library used for working with human language
data (text). It provides easy-to-use interfaces to:
Over 50 corpora and lexical resources such as WordNet.
Text processing libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning.
Wrappers for industrial-strength NLP libraries.
✅ Key Features:
Written in Python.
Good for educational and research purposes in NLP.
Helps in building Python programs to work with human language data.
42) What is Tokenization in NLP?
Tokenization is the process of splitting text into smaller units called tokens. Tokens can be:
Words
Sentences
Subwords
✅ Types of Tokenization:
Word Tokenization: Splits text into words.
Example: "I love Python" → ["I", "love", "Python"]
Sentence Tokenization: Splits text into sentences.
Example: "I love Python. It is powerful." → ["I love Python.", "It is powerful."]
Why Tokenization?
It’s the first step in NLP to break down raw text for further processing like parsing, tagging, etc.
43) What is Stemming?
Stemming is the process of reducing a word to its root form by chopping off derivational affixes.
✅ Example:
"playing", "played", "plays" → "play"
"running", "runner" → "run"
Note: Stemming is a rule-based process and may not always result in a real word.
Example: "studies" → "studi"
✅ Common Stemmer in NLTK:
python
CopyEdit
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem("playing") # Output: 'play'
44) What is Lemmatization?
Lemmatization is the process of reducing a word to its base or dictionary form, called a lemma.
Unlike stemming, lemmatization returns a valid word and considers the context (POS).
✅ Example:
"running", "ran" → "run"
"better" → "good"
✅ Lemmatization in NLTK:
python
CopyEdit
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("running", pos="v") # Output: 'run'
🔁 Stemming vs Lemmatization:
Stemming is faster, less accurate.
Lemmatization is slower, but more accurate.
45) What is Corpus in NLP?
A Corpus is a large collection of text data used for training and evaluating NLP models.
✅ Types of Corpora:
Annotated corpora (tagged with POS, syntax)
Raw corpora (plain text)
Monolingual or multilingual
✅ Examples:
Brown Corpus
Gutenberg Corpus
WordNet (lexical database)
NLTK Example:
python
CopyEdit
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
gutenberg.fileids() # Lists files in the Gutenberg corpus
46) What is Spark Framework?
Apache Spark is an open-source distributed computing framework used for big data processing. It
supports:
Batch processing
Real-time stream processing
Machine learning
Graph processing
✅ Languages Supported: Scala, Python (PySpark), Java, R
✅ Why Spark?
Processes data faster than Hadoop MapReduce
In-memory computing
Built-in libraries for ML (MLlib), graph (GraphX), SQL (Spark SQL)
Steps to Run Scala Program in Windows Using Spark Framework:
1. Copy Scala File:
o Save your .scala file (e.g., sum.scala) in the Spark folder:
C:\Program Files\Big Data\Spark
2. Open CMD in Spark Folder:
o Open the Spark folder.
o In the address bar, type cmd and press Enter. This opens CMD in that path.
3. Start Spark Shell:
o Type:
bash
CopyEdit
spark-shell
o Press Enter. This starts the interactive Spark Scala shell.
4. Load Scala File in Spark Shell:
o Use the :load command to load your Scala file:
scala
CopyEdit
:load sum.scala
o This will run the code inside sum.scala.
✅ Example:
If sum.scala contains:
scala
CopyEdit
val a = 5
val b = 10
val sum = a + b
println("Sum is: " + sum)
Output will be:
csharp
CopyEdit
Sum is: 15