TYCS SEM-6 DATA SCIENCE UNIT-1 Notes
1. Define Data Science. Discuss in detail the applications of Data Science.
Definition:
Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured and unstructured data. It
combines elements of statistics, computer science, and domain expertise to analyze data and
support decision-making.
Scope of Data Science:
Data Science involves the entire lifecycle of data — from data collection, cleaning, and
processing to analysis, visualization, and prediction.
Comparison with Related Fields:
Business Intelligence (BI): Focuses on descriptive analysis (what happened).
Artificial Intelligence (AI): Focuses on building systems that can mimic human
intelligence.
Machine Learning (ML): A subset of AI that uses algorithms to learn from data.
Data Warehousing (DW): Deals with storing and managing large volumes of data
for analysis.
Data Science integrates all these to provide predictive and prescriptive insights.
Applications of Data Science:
1. Healthcare: Disease prediction, drug discovery, medical image analysis.
2. Finance: Fraud detection, risk analysis, algorithmic trading.
3. Retail and E-commerce: Customer behavior analysis, recommendation systems.
4. Social Media: Sentiment analysis, trend prediction.
5. Transportation: Route optimization, traffic prediction.
6. Manufacturing: Predictive maintenance, quality control.
7. Education: Personalized learning, student performance analysis.
2. Explain Data Warehousing (DW) and Data Mining (DM) in detail.
Data Warehousing (DW):
A Data Warehouse is a centralized repository that stores data collected from multiple
sources in a structured format. It supports querying and analysis for decision-making.
Features of Data Warehousing:
Subject-oriented: Focused on specific areas like sales, finance, etc.
Integrated: Data is collected and cleaned from multiple sources.
Time-variant: Stores historical data for analysis.
Non-volatile: Data is stable; it is only updated periodically.
Benefits:
Improved decision-making.
Easy access to consolidated data.
Enables historical and trend analysis.
Data Mining (DM):
Data Mining is the process of discovering hidden patterns, correlations, and trends within
large datasets using statistical and machine learning techniques.
Common Data Mining Techniques:
Classification: Assigning data into predefined categories.
Clustering: Grouping similar data points.
Association Rule Mining: Discovering relationships between variables (e.g., Market
Basket Analysis).
Regression: Predicting continuous values.
Anomaly Detection: Identifying unusual data points.
Relationship between DW and DM:
Data Warehousing provides the data infrastructure, while Data Mining extracts knowledge
and patterns from that data.
3. Explain in detail different Data Sources.
Data sources are the origins from which data is collected for analysis. Data can be
structured, unstructured, or semi-structured.
Types of Data Sources:
1. Databases:
o Relational Databases (MySQL, PostgreSQL).
o NoSQL Databases (MongoDB, Cassandra).
o Contain structured data stored in tables.
2. Files:
o CSV, Excel, JSON, XML, text files, etc.
o Easy to store and share small to medium datasets.
3. APIs (Application Programming Interfaces):
o Provide real-time access to online data (e.g., weather APIs, stock data APIs).
4. Web Scraping:
o Extracting data from websites using tools like BeautifulSoup, Scrapy.
5. Sensors and IoT Devices:
o Generate continuous data streams (e.g., temperature sensors, fitness trackers).
6. Social Media:
o Data from platforms like Twitter, Instagram, and Facebook used for sentiment
and trend analysis.
4. Describe in detail Data Transformation.
Definition:
Data Transformation is the process of converting data from its original format into a suitable
format for analysis or machine learning.
Common Data Transformation Techniques:
1. Scaling:
o Adjusting the range of numeric data (e.g., 0–1 or -1–1).
o Techniques: Min-Max Scaling, Standardization (Z-score normalization).
1. Normalization:
Bringing all numeric data to a common scale without distorting differences in ranges
— useful in distance-based algorithms like KNN and K-Means.
2. Encoding Categorical Variables:
Converting non-numeric data into numeric form.
o Label Encoding: Assigns an integer to each category.
o One-Hot Encoding: Creates binary columns for each category.
3. Binning (Discretization):
Converting continuous data into categorical bins (e.g., converting income into “Low,”
“Medium,” “High”).
4. Aggregation:
Summarizing data (e.g., average monthly sales from daily sales data).
5. Log Transformation:
Reduces the impact of outliers and makes skewed data more normally distributed.
6. Feature Extraction and Dimensionality Reduction:
Techniques like PCA (Principal Component Analysis) reduce the number of
features while retaining key information.
7. Data Merging and Integration:
Combining datasets from multiple sources for comprehensive analysis.
Importance of Data Transformation:
Ensures data compatibility with ML algorithms.
Improves model accuracy and training efficiency.
Reduces redundancy and bias in data.
Enables effective visualization and interpretation.
Purpose:
Improve data quality.
Prepare data for machine learning algorithms.
Reduce complexity and improve model performance.
5. Discuss about Feature Engineering and Time-Series Data.
Feature Engineering:
Definition:
Feature Engineering is the process of creating new input features or modifying existing ones
from raw data to improve the performance of machine learning models. It helps models
understand data patterns more effectively.
Objectives of Feature Engineering:
Improve model accuracy and performance.
Reduce noise and irrelevant information.
Simplify data representation for better learning.
Common Feature Engineering Techniques:
1. Feature Creation:
o Deriving new features from existing ones (e.g., creating “Total Purchase”
from quantity × price).
2. Feature Selection:
o Choosing only the most relevant variables that affect the target output.
3. Feature Scaling:
o Normalizing or standardizing numerical data for uniformity.
4. Encoding Categorical Variables:
o Converting text data into numeric form (e.g., One-Hot Encoding, Label
Encoding).
5. Binning:
o Grouping continuous values into bins or categories (e.g., age → youth, adult,
senior).
6. Handling Outliers and Missing Values:
o Replacing, removing, or imputing missing/outlier data.
7. Polynomial Features:
o Creating interaction or higher-order terms (e.g., x2,xyx^2, xyx2,xy) for
nonlinear relationships.
Benefits of Feature Engineering:
Enhances model interpretability.
Reduces overfitting.
Helps in better generalization and prediction accuracy.
Time-Series Data:
Definition:
Time-Series Data is a sequence of data points collected or recorded at specific time intervals
(e.g., hourly, daily, monthly).
Examples:
Stock market prices over time.
Daily temperature readings.
Website traffic per day.
Sensor data from IoT devices.
Characteristics of Time-Series Data:
1. Time Dependency: Each data point depends on previous observations.
2. Trend: General increase or decrease in data over time.
3. Seasonality: Regular patterns repeating over a fixed period (e.g., sales rising during
holidays).
4. Noise: Random variation in data not explained by trends or seasonality.
Feature Engineering for Time-Series Data:
Lag Features: Using previous time steps as input variables.
Rolling Statistics: Calculating moving averages or rolling sums.
Time-Based Features: Extracting features like day, month, year, or hour.
Differencing: Removing trends to make data stationary.
Decomposition: Splitting data into trend, seasonal, and residual components.
Applications:
Forecasting (weather, sales, energy consumption).
Anomaly detection in system logs or financial transactions.
Predictive maintenance in manufacturing.
UNIT-2
1. Define Data Wrangling. Discuss Data Wrangling Techniques.
Definition:
Data Wrangling (also known as Data Munging) is the process of cleaning, structuring, and
enriching raw data into a desired format for better decision-making and analysis.
Steps and Techniques in Data Wrangling:
1. Data Cleaning:
o Handling missing values, removing duplicates, fixing data types, correcting
errors.
2. Reshaping and Pivoting:
o Changing the structure of data tables (e.g., converting rows to columns or vice
versa).
3. Merging and Joining:
o Combining multiple datasets using keys or indexes.
4. Aggregating:
o Summarizing data values by applying functions like sum, mean, count.
5. Filtering and Subsetting:
o Selecting specific rows or columns based on conditions.
6. Feature Engineering:
o Creating new features from existing data to improve model performance.
7. Dummification:
o Converting categorical variables into binary (0/1) indicators.
8. Feature Scaling:
o Applying normalization or standardization to ensure equal feature importance.
Tools for Data Wrangling:
Python Libraries: Pandas, NumPy.
R Packages: dplyr, tidyr.
Other Tools: Excel, Power BI, Talend.
1. Explain Data Visualization Techniques in Detail.
Definition:
Data visualization is the graphical representation of data and information. It helps to
understand trends, patterns, and outliers in data using visual elements such as charts, graphs,
and plots.
Purpose of Data Visualization:
Simplifies complex data for better understanding.
Reveals hidden patterns and relationships.
Supports data-driven decision-making.
Helps communicate insights effectively.
Common Data Visualization Techniques:
1. Histogram:
o Displays the frequency distribution of a numeric variable.
o The x-axis represents data intervals (bins) and the y-axis shows frequencies.
o Useful for understanding data distribution and skewness.
Example: Distribution of students’ marks.
2. Scatter Plot:
o Represents the relationship between two numeric variables (X and Y).
o Each point corresponds to one observation.
o Helps detect correlations, trends, and outliers.
Example: Relationship between advertising budget and sales.
3. Box Plot:
o Displays data distribution based on five summary statistics: minimum, first
quartile (Q1), median, third quartile (Q3), and maximum.
o Helps identify outliers and data spread.
4. Bar Chart:
o Used to compare categorical data.
o Bars can be vertical or horizontal.
5. Pie Chart:
o Shows the proportion of categories in a dataset as slices of a circle.
o Useful for visualizing percentage or ratio data.
6. Heatmap:
o Uses colors to represent values in a matrix format.
o Commonly used for correlation analysis between variables.
7. Line Chart:
o Displays data trends over time (time-series data).
o Useful for tracking changes or trends.
Popular Tools and Libraries:
Python: Matplotlib, Seaborn, Plotly
R: ggplot2
BI Tools: Tableau, Power BI
2. Define Descriptive Statistics. Explain Mean, Median, Mode, and
Standard Deviation in Detail.
Definition:
Descriptive Statistics summarizes and describes the main features of a dataset through
numerical and graphical methods. It helps to understand the central tendency, variability, and
overall distribution of data.
Measures of Central Tendency:
1. Mean (Average):
o The sum of all values divided by the number of values.
4.
o A low SD indicates data points are close to the mean, and a high SD means
high variability.
Importance:
Descriptive statistics are essential for data understanding, detecting errors, and providing the
foundation for inferential statistics.
3. Explain in Detail Classification and Regression Analysis.
Introduction to Supervised Learning:
Supervised learning involves training a model on labeled data — where both input and output
are known — to make predictions on new, unseen data.
A. Classification Analysis:
Definition:
Classification is the process of predicting a categorical (discrete) output label based on input
features.
Examples:
Spam vs. Non-Spam Email
Disease Detection (Positive/Negative)
Loan Approval (Yes/No)
Popular Algorithms:
Logistic Regression
Decision Tree
Random Forest
Support Vector Machine (SVM)
K-Nearest Neighbors (K-NN)
Naïve Bayes
Evaluation Metrics: Accuracy, Precision, Recall, F1-score.
B. Regression Analysis:
Definition:
Regression is used to predict a continuous numeric output variable based on input variables.
Types of Regression:
1.
4. Define Bias and Variance. Discuss the Bias-Variance Tradeoff.
Bias:
Bias is the error introduced by simplifying a complex problem.
High Bias: The model is too simple and fails to capture underlying patterns
(underfitting).
Low Bias: The model fits the training data well.
Variance:
Variance is the error introduced due to the model’s sensitivity to small fluctuations in the
training data.
High Variance: Model fits training data too closely (overfitting).
Low Variance: Model performs consistently across datasets.
Bias-Variance Tradeoff:
The goal is to find a balance between bias and variance for optimal model
performance.
Underfitting: High bias, low variance → poor training and testing performance.
Overfitting: Low bias, high variance → excellent training performance but poor
testing accuracy.
Ideal Model:
Has both low bias and low variance — generalizes well to unseen data.
Visualization:
Model Type Bias Variance Performance
Underfitting High Low Poor
Good Fit Low Low Best
Overfitting Low High Poor
5. Discuss Different Techniques for Evaluating Model Performance.
Model Evaluation helps determine how well a trained model performs on new, unseen data.
1. Confusion Matrix:
A 2×2 table that shows predicted vs. actual classifications.
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
6. ROC Curve and AUC:
ROC (Receiver Operating Characteristic) curve shows the tradeoff between true
positive rate and false positive rate.
AUC (Area Under Curve) measures the model’s ability to distinguish between
classes (1 = perfect model).
7. Cross-Validation:
Splits data into multiple folds to train and test models multiple times.
o K-Fold Cross Validation
o Stratified Cross Validation
8. Hyperparameter Tuning:
Optimizing model parameters (e.g., tree depth, learning rate) using Grid Search or Random
Search to improve accuracy.
6. Explain Ensemble Learning with Bagging and Boosting in Detail.
Definition:
Ensemble Learning combines multiple machine learning models (called base learners) to
improve prediction accuracy and robustness.
Goal:
To achieve better generalization by reducing bias, variance, or both.
A. Bagging (Bootstrap Aggregating):
Concept:
Trains multiple models on different random subsets of training data (with
replacement).
Combines their outputs using averaging (for regression) or majority voting (for
classification).
Steps:
1. Create multiple random subsets of the dataset.
2. Train a model on each subset.
3. Combine predictions from all models.
Advantages:
Reduces variance (avoids overfitting).
Increases model stability.
Common Algorithms:
Random Forest (an ensemble of Decision Trees).
B. Boosting:
Concept:
Builds models sequentially.
Each new model corrects errors made by previous models.
Focuses more on misclassified examples.
Steps:
1. Train a weak learner (simple model).
2. Increase weights for misclassified samples.
3. Train the next learner on the updated data.
4. Combine all learners’ predictions.
Advantages:
Reduces bias.
Achieves high accuracy.
Common Algorithms:
AdaBoost (Adaptive Boosting)
Gradient Boosting (GBM)
XGBoost
LightGBM
Comparison:
Aspect Bagging Boosting
Training Parallel Sequential
Goal Reduce Variance Reduce Bias
Example Random Forest AdaBoost, XGBoost
Overfitting Less prone More prone (needs tuning)
UNIT-3
a) Explain Storytelling in Analysis in Detail
Definition:
Data storytelling is the process of communicating data-driven insights using a combination of
visuals, narrative, and context. It transforms complex analytical findings into
understandable and actionable messages for decision-makers.
Key Components of Data Storytelling:
1. Data:
oThe factual foundation — quantitative or qualitative insights derived from
analysis.
o Must be accurate, relevant, and well-prepared.
2. Visuals:
o Graphs, charts, and infographics that present data patterns effectively.
o Help audiences quickly grasp insights.
3. Narrative:
o The story or explanation that connects the visuals and data together.
o It explains why the trends or changes are happening.
Steps in Effective Data Storytelling:
1. Understand the Audience:
o Tailor your story to the knowledge level and interests of your audience.
2. Define the Objective:
o Identify the key insight or message to communicate.
3. Select the Right Visuals:
o Use charts (bar, line, scatter, etc.) that best represent your data.
4. Provide Context:
o Explain background information and what the numbers mean.
5. Highlight Key Insights:
o Emphasize the main findings using color, annotations, or callouts.
6. Build a Narrative Flow:
o Structure the story with a beginning (problem), middle (analysis), and end
(insights/recommendations).
Importance of Data Storytelling:
Simplifies complex analytical results.
Engages both technical and non-technical audiences.
Encourages data-driven decisions.
Makes insights memorable and actionable.
Example:
Instead of just showing a sales decline chart, a story could be:
“Sales dropped by 15% in Q3 due to increased competition in the northern region. However,
a new marketing strategy launched in Q4 led to a 10% recovery.”
b) Discuss: Visualization Tools in Detail
Definition:
Visualization tools are software applications that help users represent data graphically to
identify patterns, trends, and insights quickly.
1. Python Visualization Libraries:
Matplotlib:
o Foundation library for static, interactive, and animated plots in Python.
o Supports bar charts, pie charts, line graphs, and histograms.
o Example: plt.plot(), plt.bar()
Seaborn:
o Built on top of Matplotlib.
o Provides high-level functions for statistical visualizations.
o Example: sns.heatmap(), sns.boxplot()
Plotly:
o Enables interactive, web-based visualizations.
o Used for dashboards and 3D plots.
2. Business Intelligence (BI) Tools:
Tableau:
o Drag-and-drop tool for creating interactive dashboards.
o Supports data blending, mapping, and storytelling features.
o Widely used in corporate analytics.
Power BI:
o Microsoft tool for business reporting and visualization.
o Integrates well with Excel, SQL, and cloud services.
Google Data Studio:
o Free online tool for creating interactive and shareable reports.
3. R Visualization Packages:
ggplot2:
o Implements the “Grammar of Graphics” for flexible data visualization.
o Ideal for statistical and academic visualizations.
Importance of Visualization Tools:
Help understand data quickly.
Enable interactive exploration.
Aid in effective storytelling and reporting.
Facilitate collaboration in business environments.
c) List and Discuss Data Management Activities
Definition:
Data Management refers to the administration, control, and organization of data
throughout its lifecycle — from creation and storage to analysis and archiving.
Major Data Management Activities:
1. Data Collection:
o Gathering raw data from various sources (databases, sensors, social media,
APIs).
2. Data Storage:
o Saving data in structured or unstructured formats (e.g., relational databases,
data lakes).
3. Data Integration:
o Combining data from multiple sources into a unified view.
4. Data Cleaning:
o Removing duplicates, correcting errors, handling missing values.
5. Data Transformation:
o Converting data into suitable formats for analysis (scaling, encoding,
normalization).
6. Data Analysis:
o Applying statistical and machine learning techniques to extract insights.
7. Data Security and Privacy:
o Ensuring protection against unauthorized access or breaches.
8. Data Backup and Recovery:
o Regularly saving copies to prevent loss due to hardware failure or human
error.
9. Data Governance:
oEstablishing policies and rules for consistent and ethical data use.
10. Data Archiving:
o Storing older or less frequently used data for future reference or compliance.
Importance:
Efficient data management ensures accuracy, accessibility, consistency, and security — all
crucial for data-driven organizations.
d) Elaborate the Concept of Data Governance
Definition:
Data Governance refers to the framework of policies, processes, and responsibilities that
ensure data is managed effectively, securely, and ethically across an organization.
Key Objectives:
Maintain data quality, accuracy, and consistency.
Define ownership and accountability.
Ensure compliance with legal and regulatory standards (e.g., GDPR, HIPAA).
Enhance data security and privacy.
Key Components of Data Governance:
1. Data Ownership: Assign responsibility for each data asset.
2. Data Policies: Rules defining data usage, sharing, and storage.
3. Data Standards: Formats and naming conventions for uniformity.
4. Data Quality Management: Continuous monitoring for accuracy and completeness.
5. Compliance and Security: Protect data against misuse or leaks.
6. Metadata Management: Maintain information about data origin, meaning, and
usage.
Benefits:
Improves trust in data-driven decisions.
Minimizes data risks and errors.
Ensures regulatory compliance.
Increases organizational efficiency.
Example:
A bank enforcing strict access control for customer data and maintaining audit trails to ensure
data privacy.
e) Illustrate Extraction, Transformation, and Load (ETL) in Detail
Definition:
ETL is a process in data pipelines used to move data from multiple sources into a central
repository (like a Data Warehouse) after cleaning and transformation.
1. Extraction (E):
Collects data from various sources such as databases, files, web APIs, or sensors.
Can be full extraction (all data) or incremental extraction (only new/changed data).
Example: Extracting sales data from an SQL database.
2. Transformation (T):
Converts raw data into a structured and usable format.
Involves:
o Data Cleaning (handling missing values, removing duplicates)
o Data Integration (combining data from multiple sources)
o Data Formatting (converting date/time formats, encoding)
o Aggregation and summarization
Example: Converting product names to uppercase and summarizing sales by region.
3. Loading (L):
Moves the transformed data into the target system such as a Data Warehouse or
analytical database.
Can be:
o Batch Loading: Data loaded in bulk at scheduled intervals.
o Real-time Loading: Data loaded continuously as it changes.
Tools Used for ETL:
Informatica PowerCenter
Talend
Microsoft SSIS
Apache NiFi
AWS Glue
Importance of ETL:
Ensures consistent, clean, and reliable data.
Enables data-driven reporting and analytics.
Improves data accessibility and performance in Business Intelligence systems
f) Give the Importance of Data Quality
Definition:
Data Quality refers to the degree to which data is accurate, complete, reliable, and
relevant for its intended use.
Key Dimensions of Data Quality:
1. Accuracy: Data must represent reality correctly.
2. Completeness: No missing or incomplete records.
3. Consistency: Uniform data across systems and sources.
4. Timeliness: Data should be updated and available when needed.
5. Validity: Data must conform to defined formats and rules.
6. Uniqueness: Avoid duplication of records.
Importance of High Data Quality:
Improved Decision-Making: Reliable data leads to accurate insights.
Operational Efficiency: Reduces errors and rework.
Regulatory Compliance: Ensures adherence to data privacy and accuracy standards.
Customer Satisfaction: Better data leads to personalized and reliable services.
Cost Savings: Prevents loss caused by incorrect or duplicate data.
Example:
Poor data quality (e.g., wrong customer contact details) can lead to failed communications
and lost sales opportunities.
Data Quality Assurance Techniques:
Data validation checks
Automated cleaning scripts
Regular audits and monitoring
Use of master data management systems (MDM)
Q4. Common Questions
1. Differentiate between Structured and Unstructured Data
Definition:
Data in Data Science can be categorized into structured, unstructured, and semi-
structured forms depending on how it is organized and stored.
Feature Structured Data Unstructured Data
Data organized in fixed fields or columns, Data without a predefined format or
Definition
easily stored in tables. model.
Feature Structured Data Unstructured Data
Format Rows and columns (tabular form). Text, images, audio, video, emails, etc.
Stored in data lakes, NoSQL databases, or
Storage Stored in relational databases (SQL, Oracle).
file systems.
Requires complex processing (e.g., NLP,
Processing Easy to search and analyze using SQL queries.
image processing).
Sales records, employee database, financial Social media posts, emails, images, sensor
Examples
transactions. data.
Tools SQL, Excel, Pandas. Hadoop, Spark, TensorFlow.
Conclusion:
Structured data is quantitative and easy to analyze, while unstructured data is qualitative and
requires advanced analytical methods.
2. Explain Hyperparameter Tuning
Definition:
Hyperparameter Tuning is the process of finding the best combination of model parameters
that control the learning process in a machine learning algorithm.
Difference between Parameters and Hyperparameters:
Parameters: Learned automatically during training (e.g., weights in neural
networks).
Hyperparameters: Set manually before training (e.g., learning rate, tree depth,
number of clusters).
Common Hyperparameters:
For Decision Tree: max_depth, min_samples_split
For Random Forest: n_estimators (number of trees)
For SVM: C, kernel, gamma
For Neural Networks: learning_rate, batch_size, epochs
Techniques for Hyperparameter Tuning:
1. Grid Search:
o Tests all possible combinations of hyperparameters.
o Ensures best result but is computationally expensive.
2. Random Search:
o Randomly selects combinations to test.
o Faster than grid search with nearly similar accuracy.
3. Bayesian Optimization:
Uses probability models to predict which combinations may perform best.
o
4. Automated Hyperparameter Tuning Tools:
o Optuna, Scikit-learn’s GridSearchCV, RandomizedSearchCV, Keras Tuner.
Importance:
Improves model accuracy and generalization.
Prevents underfitting or overfitting.
Optimizes training time and performance.
1. Differentiate between Structured and Unstructured Data
Definition:
Data in Data Science can be categorized into structured, unstructured, and semi-
structured forms depending on how it is organized and stored.
Feature Structured Data Unstructured Data
Data organized in fixed fields or columns, Data without a predefined format or
Definition
easily stored in tables. model.
Format Rows and columns (tabular form). Text, images, audio, video, emails, etc.
Stored in data lakes, NoSQL databases, or
Storage Stored in relational databases (SQL, Oracle).
file systems.
Requires complex processing (e.g., NLP,
Processing Easy to search and analyze using SQL queries.
image processing).
Sales records, employee database, financial Social media posts, emails, images, sensor
Examples
transactions. data.
Tools SQL, Excel, Pandas. Hadoop, Spark, TensorFlow.
Conclusion:
Structured data is quantitative and easy to analyze, while unstructured data is qualitative and
requires advanced analytical methods.
2. Explain Hyperparameter Tuning
Definition:
Hyperparameter Tuning is the process of finding the best combination of model parameters
that control the learning process in a machine learning algorithm.
Difference between Parameters and Hyperparameters:
Parameters: Learned automatically during training (e.g., weights in neural
networks).
Hyperparameters: Set manually before training (e.g., learning rate, tree depth,
number of clusters).
Common Hyperparameters:
For Decision Tree: max_depth, min_samples_split
For Random Forest: n_estimators (number of trees)
For SVM: C, kernel, gamma
For Neural Networks: learning_rate, batch_size, epochs
Techniques for Hyperparameter Tuning:
1. Grid Search:
o Tests all possible combinations of hyperparameters.
o Ensures best result but is computationally expensive.
2. Random Search:
o Randomly selects combinations to test.
o Faster than grid search with nearly similar accuracy.
3. Bayesian Optimization:
o Uses probability models to predict which combinations may perform best.
4. Automated Hyperparameter Tuning Tools:
o Optuna, Scikit-learn’s GridSearchCV, RandomizedSearchCV, Keras Tuner.
Importance:
Improves model accuracy and generalization.
Prevents underfitting or overfitting.
Optimizes training time and performance.
4. Discuss Any Three Libraries of Data Science
Data Science relies heavily on open-source libraries that simplify data handling, analysis, and
modeling.
1. NumPy (Numerical Python):
Used for numerical computation and array manipulation.
Provides tools for linear algebra, Fourier transform, and matrices.
Example:
import numpy as np
a = np.array([1, 2, 3])
print(a.mean())
2. Pandas:
Provides data structures like Series and DataFrame for data manipulation.
Used for reading, cleaning, and transforming structured data.
Example:
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())
3. Matplotlib / Seaborn:
Matplotlib: Used for static, publication-quality visualizations.
Seaborn: Built on Matplotlib for statistical visualization (heatmaps, box plots, etc.).
Example:
import seaborn as sns
sns.boxplot(x="Category", y="Sales", data=df)
Other Popular Libraries:
Scikit-learn: Machine learning algorithms and model evaluation.
TensorFlow / PyTorch: Deep learning and neural networks.
Plotly: Interactive dashboards and plots.
5. Differentiate Between Underfitting and Overfitting
Aspect Underfitting Overfitting
Model is too simple to capture data Model learns noise and patterns from training
Definition
patterns. data too closely.
Training
Low Very High
Accuracy
Testing
Low Low (poor generalization)
Accuracy
Model complexity too low,
Cause Model complexity too high, excessive training.
insufficient training.
Example Linear model for nonlinear data. Deep model trained too long on small dataset.
Add more features, use a complex Use regularization, early stopping, cross-
Solution
model. validation.
Visualization Analogy:
Underfitting: Straight line through curved data.
Overfitting: Model fits every noise point in training data.
Goal:
Achieve the right balance — low bias and low variance — ensuring the model generalizes
well.