- Rohit
DSBDA Unit 3 Imp Points + PYQ’s
1. Driving Data Deluge
The term "data deluge" refers to the massive and rapid increase in the amount of data being generated, collected,
and stored in the digital age. The growth of digital technologies, connected devices, social media, and sensors is
creating a flood of data, overwhelming traditional storage and processing methods.
Examples:
1. Proliferation of Smart Devices (IoT):
Billions of Internet of Things (IoT) devices like smartwatches, smart TVs, smart meters, and industrial sensors are
generating real-time data continuously.
Example: A smart city uses traffic sensors, pollution monitors, and surveillance cameras, all producing continuous
data streams.
2. Social Media and User-Generated Content:
Platforms like Facebook, Instagram, YouTube, and Twitter encourage users to generate and share massive amounts
of data in the form of text, images, audio, and video.
Example: As of 2024, YouTube users upload 500+ hours of video every minute.
3. Cloud Computing and Storage Advancements:
Cheap and scalable storage solutions in the cloud allow organizations to collect and retain huge volumes of raw
data.
Example: Companies like Google and Amazon store customer behavior logs, purchase history, and preferences on a
massive scale.
4. Business and E-commerce Analytics:
Online platforms gather data to perform customer profiling, recommendation systems, fraud detection, etc.
Example: Amazon logs every user click, view, and purchase to enhance its recommendation engine.
5. Scientific and Medical Research:
Advanced scientific experiments, simulations, and medical imaging produce large datasets.
Example: The Large Hadron Collider (LHC) generates petabytes of data during particle collision experiments.
Medical Example: Genomic sequencing creates terabytes of data per person.
6. Multimedia Content Explosion:
High-resolution photos, 4K/8K videos, and virtual reality (VR) content significantly increase storage and
bandwidth demands.
Example: Netflix alone streams petabytes of video content daily across the globe.
2. Difference Between Data Science and Business Intelligence
• Definition:
o Data Science: An interdisciplinary field that uses algorithms, machine learning, and statistics to
extract insights and predictions from structured & unstructured data.
o Business Intelligence (BI): A process of analyzing historical and current business data to support
decision-making using dashboards, reports, and data warehouses.
• Focus:
o Data Science: Predictive and prescriptive analytics — "What will happen?" and "What should be
done?"
o Business Intelligence: Descriptive and diagnostic analytics — "What happened?" and "Why did it
happen?"
• Data Type:
o Data Science: Deals with structured, semi-structured, and unstructured data.
o Business Intelligence: Primarily works with structured data from relational databases.
• Tools & Technologies:
o Data Science: Python, R, TensorFlow, Spark, Hadoop, Jupyter, Scikit-learn.
o Business Intelligence: Power BI, Tableau, Microsoft Excel, SAP BI, QlikView, Oracle BI.
• Approach:
o Data Science: Data-driven and algorithm-centric; uses statistical modeling, data mining, AI, and
ML.
o Business Intelligence: Business-centric; uses reporting, querying, and OLAP (Online Analytical
Processing).
• Goal:
o Data Science: Discover patterns, build predictive models, and enable automation or innovation.
o Business Intelligence: Monitor performance, visualize KPIs, and support strategic decisions.
• Time Orientation:
o Data Science: Future-focused — predicts trends, behaviors, and outcomes.
o Business Intelligence: Past and present-focused — analyzes historical and real-time data.
• Users:
o Data Science: Data Scientists, AI Engineers, Machine Learning Experts.
o Business Intelligence: Business Analysts, Managers, Executives.
• Output Examples:
o Data Science: Predictive models, recommendation systems, classification algorithms.
o Business Intelligence: Dashboards, scorecards, summary reports, visual charts.
• Use Case Example:
o Data Science: Predicting customer churn using ML.
o Business Intelligence: Analyzing quarterly sales by region using bar charts.
(Write this answer in tabular format)
3. What are the Sources of Big Data?
Introduction:
Big Data refers to extremely large, complex, and fast-growing datasets. These originate from diverse sources and
are characterized by volume, variety, velocity, and veracity — the 4 V's of Big Data.
Major Sources of Big Data:
1. Social Media Platforms:
Platforms like Facebook, Twitter, YouTube, and Instagram generate vast data through posts, comments,
likes, photos, and videos.
Example: Twitter generates hundreds of millions of tweets daily.
2. IoT and Sensor Data:
Devices like smartwatches, vehicles, and city sensors continuously produce real-time data.
Example: Smart cities use sensors for traffic, pollution, and energy tracking.
3. Machine and Log Data:
Generated by servers, software, and networks, including system logs, error reports, and user access logs.
Example: Web server logs help analyze website performance and detect threats.
4. Transactional Data:
Collected from banking, retail, and e-commerce systems, including sales, payments, and bookings.
Example: Amazon logs every cart activity and transaction.
5. Mobile Devices and Apps:
Apps generate data about user location, preferences, and media use.
Example: Google Maps collects live location data for traffic analysis.
6. Multimedia Content:
Images, audio, and video from media apps, cameras, calls, and streaming platforms.
Example: Netflix tracks viewing habits for content recommendations.
7. Cloud-based Services:
Cloud systems (SaaS, IaaS, PaaS) collect user activity and resource usage.
Example: Google Drive logs file access and sharing behavior.
8. Web and Clickstream Data:
Data from user interactions on websites — clicks, time spent, and navigation paths.
Example: E-commerce sites analyze the user journey from search to checkout.
9. Public and Government Data:
Data from weather, census, transport, health, and law enforcement agencies.
Example: IMD generates climate data for forecasting and disaster planning.
10. Communication Data:
Generated from emails, chats, VoIP calls, and messaging services.
Example: WhatsApp exchanges billions of messages daily
4. What is the Data Discovery Phase? List various activities involved in identifying potential data resources.
Introduction:
The Data Discovery Phase is the first step in data analysis where analysts identify, explore, and evaluate available
data to solve a business or research problem. It ensures only relevant, high-quality data is used in decision-making.
Definition:
Data Discovery involves locating and understanding data from multiple sources to uncover useful patterns and
insights.
Objectives:
• Find relevant and reliable data
• Understand structure and format
• Assess quality and completeness
• Detect trends and patterns
• Align data with business needs
Key Activities:
1. Data Source Identification:
Locate internal and external data sources like databases, APIs, sensors, websites, and cloud platforms.
2. Data Inventory and Cataloging:
List and describe existing data assets, including metadata like format, owner, and update frequency.
3. Stakeholder Consultation:
Collaborate with experts to understand critical datasets and align with business goals.
4. Data Profiling and Sampling:
Review samples to check types, missing values, inconsistencies, and outliers.
5. Evaluating Data Relevance:
Check if the data fits the problem — in terms of granularity, time range, and context.
6. Data Quality Assessment:
Assess accuracy, completeness, consistency, and timeliness.
7. Accessibility & Security Check:
Ensure legal, ethical, and secure access — review permissions, policies, and governance.
8. Data Relationship Mapping:
Identify links across datasets (e.g., foreign keys or common fields) for future merging.
9. Documentation:
Record findings, quality concerns, and source details for future use.
10. Tool-Based Exploration (Optional):
Use tools like Tableau Prep, Power BI, Alteryx, or Talend for visual exploration and profiling.
5. Data Preparation Phase, Analytics Sandbox, and ETLT Process.
1. Data Preparation Phase:
This phase involves cleaning, organizing, and transforming raw data into a usable format for analysis or machine
learning. It ensures the data is accurate, complete, and ready for modeling or visualization.
2. Steps in Data Preparation:
• Data Collection: Gather data from sources like databases, APIs, files, and sensors.
• Data Integration: Combine data from multiple sources into one view.
• Data Cleaning: Fix missing values, errors, outliers, and duplicates.
• Data Transformation: Convert formats, normalize values, and create new features.
• Data Reduction: Reduce volume via feature selection, sampling, or aggregation.
• Data Validation: Check data quality and consistency.
• Data Formatting: Convert to analysis-ready formats.
• Documentation: Record all steps for reproducibility.
3. Analytics Sandbox:
A secure, isolated environment for analysts to explore and test data without affecting production systems.
• Used for EDA, modeling, and "what-if" analysis
• Supports reproducibility and experimentation
• Protects original data
• Can be local or cloud-based
•
4. ETLT Process (Extract → Transform → Load → Transform):
• Extract (E): Pull raw data from source systems
• Transform (T1): Apply initial formatting before storage
• Load (L): Store data in a central warehouse or lake
• Transform (T2): Apply deeper transformations post-loading using SQL or tools.
6. What is the Model Planning Phase? List the activities carried out and tools used in this phase.
1. Model Planning Phase:
This phase involves deciding how to solve a business problem using analytics. Data scientists choose modeling
techniques, define data inputs, plan evaluation, and prepare for model building.
Objectives:
• Choose suitable algorithms
• Plan feature engineering
• Define evaluation metrics
• Set up modeling workflow
• Prepare training/testing strategies
2. Key Activities:
• Problem Framing: Translate business goals into a modeling task (e.g., classification, clustering).
• Exploratory Data Analysis (EDA): Visualize data, identify patterns, and detect anomalies.
• Feature Planning: Select or create relevant features for modeling.
• Algorithm Selection: Choose ML/statistical methods (e.g., logistic regression, K-means).
• Data Splitting: Decide on train-test split or cross-validation.
• Evaluation Metrics: Select performance measures (e.g., Accuracy, AUC, RMSE).
• Environment Setup: Prepare tools, libraries, and compute resources.
• Documentation: Record all decisions and plans.
3. Common Tools:
• EDA & Visualization: Python (Pandas, Seaborn), R (ggplot2), Tableau, Power BI
• Statistical Analysis: R Studio, Python (StatsModels), SPSS, SAS
• Model Planning Tools: Jupyter Notebook, RapidMiner, KNIME, Weka
• Data Splitting: Scikit-learn (train_test_split), R (caret)
• Documentation: Word, Google Docs, Excel, Notion
• Cloud Sandbox: Google Colab, Azure ML Studio
7. What is the Model Building Phase? List the activities carried out and tools used in this phase.
1. Model Building Phase:
This phase involves training and evaluating machine learning or statistical models using the prepared data and
chosen algorithms.
Objectives:
• Train models on historical/labeled data
• Tune hyperparameters
• Evaluate model performance
• Save models for testing/deployment
2. Key Activities:
• Data Splitting: Split data into train/validation/test sets (e.g., 75-15-15).
• Model Coding: Implement models using Python, R, or other languages.
• Model Training: Train the model to learn from data.
• Model Evaluation: Use metrics like Accuracy, F1 Score, etc.
• Hyperparameter Tuning: Improve performance using Grid Search, Random Search, or automated tuning.
• Model Comparison: Compare different models to choose the best.
• Feature Importance: Analyze key features influencing predictions.
• Model Saving: Export trained models (e.g., with pickle, joblib).
• Documentation: Record configurations, results, and findings.
3. Tools Used:
• Languages: Python, R, Java
• ML Libraries (Python): Scikit-learn, TensorFlow, Keras, PyTorch, XGBoost
• ML Libraries (R): caret, randomForest, e1071, mlr
• Tuning Tools: GridSearchCV, Optuna, Hyperopt, R’s tuneGrid
• Notebooks: Jupyter, Google Colab, RStudio
• Visualization: Seaborn, Matplotlib, ggplot2, SHAP, LIME
• Automation: H2O.ai, TPOT, Google AutoML, Azure AutoML
4. Example Models:
• Classification: Logistic Regression, Decision Tree, SVM, Random Forest
• Regression: Linear Regression, Lasso, XGBoost
• Clustering: K-Means, DBSCAN
• Deep Learning: CNNs, RNNs
8. Explain the 'Communicate Results' and 'Operationalize' phases in the Data Science lifecycle.
1. Communicate Results Phase
Definition:
This phase involves presenting analytical findings to stakeholders in a clear and actionable way, bridging technical
insights with business decisions.
Objectives:
• Make insights understandable and actionable
• Guide decision-making
• Summarize findings visually and narratively
Key Activities:
• Data Visualization: Use charts, dashboards to show trends (e.g., Tableau, Power BI).
• Storytelling: Present the data journey clearly and logically.
• Model Reporting: Explain metrics like accuracy, AUC, RMSE in layman’s terms.
• Risk Analysis: Discuss data limitations, biases, and assumptions.
• Recommendations: Suggest data-driven actions.
• Interactive Dashboards: Share insights dynamically (e.g., Dash, Shiny).
Tools:
• Visualization: Tableau, Power BI, Matplotlib, Seaborn
• Docs/Slides: PowerPoint, Google Slides, Notion
• Reporting: Excel, Dash, Shiny
• Collaboration: Jupyter Notebook, Confluence
Outcome:
• Clear stakeholder understanding
• Informed decision-making
• Prepares for model deployment
2. Operationalize Phase
Definition:
This is where models are deployed into production systems to generate ongoing business value.
Objectives:
• Deploy and integrate the model
• Monitor performance
• Ensure security, scalability, and maintainability
Key Activities:
• Model Deployment: Package model as API or service (e.g., Flask, FastAPI).
• Integration: Connect with business tools (e.g., CRM, dashboards).
• Monitoring: Track accuracy, usage, drift (e.g., MLflow, Prometheus).
• Retraining: Update models periodically with new data.
• Security: Apply privacy, compliance, access controls.
• User Training: Educate stakeholders on using the model effectively.
Tools & Platforms:
• Deployment: Docker, Kubernetes, AWS Lambda, Azure ML
• Monitoring: MLflow, Grafana
• CI/CD: Git, Jenkins
• Cloud: AWS SageMaker, Google AI Platform
Outcome:
• Model delivers value in real-world systems
• Supports automation and decision-making
• Ensures ROI from data science solutions
9. What is ETL in Data Science?
1. Definition:
ETL stands for Extract, Transform, Load — a core data integration process used to collect data from multiple
sources, clean and format it, and load it into a target system like a data warehouse or database for analysis.
2. Components of ETL:
Extract: Data is pulled from various sources such as databases, flat files, APIs, web servers, etc.
Transform: Data is cleaned, standardized, aggregated, or enriched (e.g., handling missing values, converting
formats, deduplication).
Load: The final structured data is inserted into a storage system like a Data Warehouse, Data Lake, or Analytical
Sandbox.
3. Purpose of ETL:
Integrate data from disparate sources.
Ensure data quality and consistency.
Prepare data for analysis, visualization, and machine learning.
4. Tools Used in ETL:
Talend, Apache Nifi, Informatica, SSIS, Pentaho
Python (Pandas, Airflow), SQL Scripts
AWS Glue, Azure Data Factory, Google Dataflow
5. Example:
Extract: Sales data from MySQL, customer data from CSV
Transform: Merge datasets, remove duplicates, convert currencies
Load: Store into Snowflake data warehouse for BI reporting
10. Common Tools for Model Building and Model Selection for Data Analytics.
1. Model Building Tools:
Model building involves creating, training, and testing machine learning or statistical models using data. Common
tools include:
Tool/Library Description
Scikit-learn (Python) A powerful library for traditional ML algorithms (e.g., classification, regression).
TensorFlow / Keras Deep learning frameworks for neural networks and AI models, supporting large-scale
(Python) training.
PyTorch (Python) Another deep learning library, widely used for research and production models.
Gradient boosting framework known for high-performance models, especially for
XGBoost
structured data.
A platform offering automated machine learning (AutoML) and scalable model
H2O.ai
building.
R (Caret, randomForest) R’s packages for model building, particularly for statistical models and decision trees.
2. Model Selection Tools:
Model selection focuses on choosing the best-performing model from a set of candidates based on evaluation
metrics.
Tool/Method Description
Grid Search Exhaustive search to find the best hyperparameters for a given model (Scikit-learn, R).
Random Search Randomized search for hyperparameter tuning, faster than grid search.
Cross-Validation Assesses model performance by splitting data into k-folds and validating across them.
AutoML (H2O.ai, Automatically searches for the best model and hyperparameters using algorithms like
TPOT) TPOT, etc.
Compare different models (e.g., Random Forest vs Logistic Regression) using AUC, F1-
Model Comparison
score, RMSE.
11. Key Roles for a Successful Analytics Project
• Data Scientist:
Develops predictive models, analyzes complex datasets, and extracts insights using machine learning and
statistics.
• Data Engineer:
Builds and maintains infrastructure (ETL pipelines, databases) to process and store data.
• Data Analyst:
Analyzes and visualizes data, generates reports, and presents business insights.
• Business Analyst:
Acts as a bridge between stakeholders and the data team, translating business needs into technical
requirements.
• Project Manager:
Manages project timelines, resources, and team coordination.
• Subject Matter Expert (SME):
Provides domain expertise to ensure the analysis aligns with business needs.
• Data Architect:
Designs the data infrastructure and ensures scalability and accessibility.
• IT Support/Operations:
Maintains the technical infrastructure and deploys models into production.
• Data Governance Specialist:
Ensures data privacy, security, and compliance with regulations.
12. Three Characteristics of Big Data
• Volume:
Refers to the large amount of data generated every day. This includes data from social media, sensors,
transactions, and more, often measured in terabytes or petabytes.
• Variety:
Refers to the different types of data generated, such as structured, unstructured, and semi-structured data
(e.g., text, images, videos, sensor data, logs).
• Velocity:
Refers to the speed at which data is generated, processed, and analyzed. This includes real-time or near-
real-time data streaming and requires fast processing.
13. Descriptive, Diagnostic, Predictive Analytics
• Descriptive Analytics
o Purpose: Explains what has happened in the past.
o Methods: Data aggregation, data mining, and reporting.
o Example: Monthly sales reports showing trends and patterns.
• Diagnostic Analytics
o Purpose: Explains why something happened.
o Methods: Root cause analysis, correlation analysis.
o Example: Analyzing why sales dropped in a specific region (e.g., competitor actions, seasonality).
• Predictive Analytics
o Purpose: Forecasts what is likely to happen in the future.
o Methods: Machine learning, regression analysis, time series forecasting.
o Example: Predicting future sales based on historical trends and seasonality.
14. Different Stakeholders of an Analytics Project and Their Expectations
• Business Executives/Managers
o Expectations: Actionable insights for strategic decision-making, improved business performance,
and ROI.
o Output: Clear, visual reports, dashboards, and recommendations that align with business goals.
• Data Scientists/Analysts
o Expectations: High-quality, cleaned, structured data to build accurate models and derive
meaningful insights.
o Output: Validated models, performance metrics, and detailed analytical reports.
• IT/Data Engineers
o Expectations: Scalable, secure data infrastructure and smooth integration of models into
production systems.
o Output: Well-structured data pipelines, operationalized models, and automated workflows.
• Marketing/Sales Teams
o Expectations: Insights into customer behavior, market trends, and predictive analysis for targeted
campaigns.
o Output: Targeted recommendations, segmentation analysis, and performance forecasts.
• Finance Team
o Expectations: Data-driven models to support financial forecasting, budgeting, and risk
management.
o Output: Predictive models, cost-benefit analysis, and financial risk assessments.
• End Users (e.g., Customers, Operators)
o Expectations: User-friendly interfaces, accurate predictions, and solutions that meet practical
needs.
o Output: Dashboards, real-time insights, and product recommendations.
• Regulatory/Compliance Authorities
o Expectations: Data privacy and compliance with industry regulations (e.g., GDPR).
o Output: Secure data practices, audits, and transparent reporting on model use and data handling.
15. Linear Regression and Difference Between Simple and Multiple Linear Regression
1. Linear Regression
o Definition: Linear Regression is a statistical method to model the relationship between a
dependent variable (Y) and one or more independent variables (X) by fitting a linear equation to
observed data.
o Formula:
Y = β₀ + β₁X + ε
Where:
▪ Y = Dependent variable
▪ X = Independent variable
▪ β₀ = Intercept
▪ β₁ = Coefficient of X
▪ ε = Error term
2. Simple Linear Regression
o Definition: A type of linear regression where the model is based on only one independent variable
to predict the dependent variable.
o Formula:
Y = β₀ + β₁X
o Use Case: Predicting outcomes based on a single factor (e.g., predicting house price based on
square footage).
3. Multiple Linear Regression
o Definition: A type of linear regression where the model uses two or more independent variables to
predict the dependent variable.
o Formula:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ
o Use Case: Predicting outcomes based on multiple factors (e.g., predicting house price based on
square footage, location, and number of rooms).