0% found this document useful (0 votes)

55 views14 pages

Dsbda Ut3

The document discusses key concepts in data science, including the data deluge caused by IoT, social media, and cloud computing, and distinguishes between data science and business intelligence. It outlines the phases of data analysis such as data discovery, preparation, model planning, building, communicating results, and operationalizing models, along with their activities and tools. Additionally, it explains the ETL process and provides insights into common tools used for model building and selection.

Uploaded by

practicalcodes04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views14 pages

Dsbda Ut3

Uploaded by

practicalcodes04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

- Rohit

DSBDA Unit 3 Imp Points + PYQ’s

1. Driving Data Deluge

The term "data deluge" refers to the massive and rapid increase in the amount of data being generated, collected,
and stored in the digital age. The growth of digital technologies, connected devices, social media, and sensors is
creating a flood of data, overwhelming traditional storage and processing methods.

Examples:
1. Proliferation of Smart Devices (IoT):
Billions of Internet of Things (IoT) devices like smartwatches, smart TVs, smart meters, and industrial sensors are
generating real-time data continuously.
Example: A smart city uses traffic sensors, pollution monitors, and surveillance cameras, all producing continuous
data streams.

2. Social Media and User-Generated Content:

Platforms like Facebook, Instagram, YouTube, and Twitter encourage users to generate and share massive amounts
of data in the form of text, images, audio, and video.

Example: As of 2024, YouTube users upload 500+ hours of video every minute.

3. Cloud Computing and Storage Advancements:

Cheap and scalable storage solutions in the cloud allow organizations to collect and retain huge volumes of raw
data.

Example: Companies like Google and Amazon store customer behavior logs, purchase history, and preferences on a
massive scale.

4. Business and E-commerce Analytics:

Online platforms gather data to perform customer profiling, recommendation systems, fraud detection, etc.

Example: Amazon logs every user click, view, and purchase to enhance its recommendation engine.

5. Scientific and Medical Research:

Advanced scientific experiments, simulations, and medical imaging produce large datasets.

Example: The Large Hadron Collider (LHC) generates petabytes of data during particle collision experiments.

Medical Example: Genomic sequencing creates terabytes of data per person.

6. Multimedia Content Explosion:

High-resolution photos, 4K/8K videos, and virtual reality (VR) content significantly increase storage and
bandwidth demands.

Example: Netflix alone streams petabytes of video content daily across the globe.
2. Difference Between Data Science and Business Intelligence

• Definition:
o Data Science: An interdisciplinary field that uses algorithms, machine learning, and statistics to
extract insights and predictions from structured & unstructured data.
o Business Intelligence (BI): A process of analyzing historical and current business data to support
decision-making using dashboards, reports, and data warehouses.
• Focus:
o Data Science: Predictive and prescriptive analytics — "What will happen?" and "What should be
done?"
o Business Intelligence: Descriptive and diagnostic analytics — "What happened?" and "Why did it
happen?"
• Data Type:
o Data Science: Deals with structured, semi-structured, and unstructured data.
o Business Intelligence: Primarily works with structured data from relational databases.
• Tools & Technologies:
o Data Science: Python, R, TensorFlow, Spark, Hadoop, Jupyter, Scikit-learn.
o Business Intelligence: Power BI, Tableau, Microsoft Excel, SAP BI, QlikView, Oracle BI.
• Approach:
o Data Science: Data-driven and algorithm-centric; uses statistical modeling, data mining, AI, and
ML.
o Business Intelligence: Business-centric; uses reporting, querying, and OLAP (Online Analytical
Processing).
• Goal:
o Data Science: Discover patterns, build predictive models, and enable automation or innovation.
o Business Intelligence: Monitor performance, visualize KPIs, and support strategic decisions.
• Time Orientation:
o Data Science: Future-focused — predicts trends, behaviors, and outcomes.
o Business Intelligence: Past and present-focused — analyzes historical and real-time data.
• Users:
o Data Science: Data Scientists, AI Engineers, Machine Learning Experts.
o Business Intelligence: Business Analysts, Managers, Executives.
• Output Examples:
o Data Science: Predictive models, recommendation systems, classification algorithms.
o Business Intelligence: Dashboards, scorecards, summary reports, visual charts.
• Use Case Example:
o Data Science: Predicting customer churn using ML.
o Business Intelligence: Analyzing quarterly sales by region using bar charts.

(Write this answer in tabular format)

3. What are the Sources of Big Data?

Introduction:
Big Data refers to extremely large, complex, and fast-growing datasets. These originate from diverse sources and
are characterized by volume, variety, velocity, and veracity — the 4 V's of Big Data.
Major Sources of Big Data:
1. Social Media Platforms:
Platforms like Facebook, Twitter, YouTube, and Instagram generate vast data through posts, comments,
likes, photos, and videos.
Example: Twitter generates hundreds of millions of tweets daily.
2. IoT and Sensor Data:
Devices like smartwatches, vehicles, and city sensors continuously produce real-time data.
Example: Smart cities use sensors for traffic, pollution, and energy tracking.
3. Machine and Log Data:
Generated by servers, software, and networks, including system logs, error reports, and user access logs.
Example: Web server logs help analyze website performance and detect threats.
4. Transactional Data:
Collected from banking, retail, and e-commerce systems, including sales, payments, and bookings.
Example: Amazon logs every cart activity and transaction.
5. Mobile Devices and Apps:
Apps generate data about user location, preferences, and media use.
Example: Google Maps collects live location data for traffic analysis.
6. Multimedia Content:
Images, audio, and video from media apps, cameras, calls, and streaming platforms.
Example: Netflix tracks viewing habits for content recommendations.
7. Cloud-based Services:
Cloud systems (SaaS, IaaS, PaaS) collect user activity and resource usage.
Example: Google Drive logs file access and sharing behavior.
8. Web and Clickstream Data:
Data from user interactions on websites — clicks, time spent, and navigation paths.
Example: E-commerce sites analyze the user journey from search to checkout.
9. Public and Government Data:
Data from weather, census, transport, health, and law enforcement agencies.
Example: IMD generates climate data for forecasting and disaster planning.
10. Communication Data:
Generated from emails, chats, VoIP calls, and messaging services.
Example: WhatsApp exchanges billions of messages daily
4. What is the Data Discovery Phase? List various activities involved in identifying potential data resources.

Introduction:
The Data Discovery Phase is the first step in data analysis where analysts identify, explore, and evaluate available
data to solve a business or research problem. It ensures only relevant, high-quality data is used in decision-making.
Definition:
Data Discovery involves locating and understanding data from multiple sources to uncover useful patterns and
insights.
Objectives:
• Find relevant and reliable data
• Understand structure and format
• Assess quality and completeness
• Detect trends and patterns
• Align data with business needs

Key Activities:
1. Data Source Identification:
Locate internal and external data sources like databases, APIs, sensors, websites, and cloud platforms.
2. Data Inventory and Cataloging:
List and describe existing data assets, including metadata like format, owner, and update frequency.
3. Stakeholder Consultation:
Collaborate with experts to understand critical datasets and align with business goals.
4. Data Profiling and Sampling:
Review samples to check types, missing values, inconsistencies, and outliers.
5. Evaluating Data Relevance:
Check if the data fits the problem — in terms of granularity, time range, and context.
6. Data Quality Assessment:
Assess accuracy, completeness, consistency, and timeliness.
7. Accessibility & Security Check:
Ensure legal, ethical, and secure access — review permissions, policies, and governance.
8. Data Relationship Mapping:
Identify links across datasets (e.g., foreign keys or common fields) for future merging.
9. Documentation:
Record findings, quality concerns, and source details for future use.
10. Tool-Based Exploration (Optional):
Use tools like Tableau Prep, Power BI, Alteryx, or Talend for visual exploration and profiling.
5. Data Preparation Phase, Analytics Sandbox, and ETLT Process.

1. Data Preparation Phase:

This phase involves cleaning, organizing, and transforming raw data into a usable format for analysis or machine
learning. It ensures the data is accurate, complete, and ready for modeling or visualization.

2. Steps in Data Preparation:

• Data Collection: Gather data from sources like databases, APIs, files, and sensors.
• Data Integration: Combine data from multiple sources into one view.
• Data Cleaning: Fix missing values, errors, outliers, and duplicates.
• Data Transformation: Convert formats, normalize values, and create new features.
• Data Reduction: Reduce volume via feature selection, sampling, or aggregation.
• Data Validation: Check data quality and consistency.
• Data Formatting: Convert to analysis-ready formats.
• Documentation: Record all steps for reproducibility.

3. Analytics Sandbox:
A secure, isolated environment for analysts to explore and test data without affecting production systems.
• Used for EDA, modeling, and "what-if" analysis
• Supports reproducibility and experimentation
• Protects original data
• Can be local or cloud-based
•
4. ETLT Process (Extract → Transform → Load → Transform):
• Extract (E): Pull raw data from source systems
• Transform (T1): Apply initial formatting before storage
• Load (L): Store data in a central warehouse or lake
• Transform (T2): Apply deeper transformations post-loading using SQL or tools.
6. What is the Model Planning Phase? List the activities carried out and tools used in this phase.

1. Model Planning Phase:

This phase involves deciding how to solve a business problem using analytics. Data scientists choose modeling
techniques, define data inputs, plan evaluation, and prepare for model building.
Objectives:
• Choose suitable algorithms
• Plan feature engineering
• Define evaluation metrics
• Set up modeling workflow
• Prepare training/testing strategies

2. Key Activities:
• Problem Framing: Translate business goals into a modeling task (e.g., classification, clustering).
• Exploratory Data Analysis (EDA): Visualize data, identify patterns, and detect anomalies.
• Feature Planning: Select or create relevant features for modeling.
• Algorithm Selection: Choose ML/statistical methods (e.g., logistic regression, K-means).
• Data Splitting: Decide on train-test split or cross-validation.
• Evaluation Metrics: Select performance measures (e.g., Accuracy, AUC, RMSE).
• Environment Setup: Prepare tools, libraries, and compute resources.
• Documentation: Record all decisions and plans.

3. Common Tools:
• EDA & Visualization: Python (Pandas, Seaborn), R (ggplot2), Tableau, Power BI
• Statistical Analysis: R Studio, Python (StatsModels), SPSS, SAS
• Model Planning Tools: Jupyter Notebook, RapidMiner, KNIME, Weka
• Data Splitting: Scikit-learn (train_test_split), R (caret)
• Documentation: Word, Google Docs, Excel, Notion
• Cloud Sandbox: Google Colab, Azure ML Studio
7. What is the Model Building Phase? List the activities carried out and tools used in this phase.

1. Model Building Phase:

This phase involves training and evaluating machine learning or statistical models using the prepared data and
chosen algorithms.

Objectives:
• Train models on historical/labeled data
• Tune hyperparameters
• Evaluate model performance
• Save models for testing/deployment

2. Key Activities:
• Data Splitting: Split data into train/validation/test sets (e.g., 75-15-15).
• Model Coding: Implement models using Python, R, or other languages.
• Model Training: Train the model to learn from data.
• Model Evaluation: Use metrics like Accuracy, F1 Score, etc.
• Hyperparameter Tuning: Improve performance using Grid Search, Random Search, or automated tuning.
• Model Comparison: Compare different models to choose the best.
• Feature Importance: Analyze key features influencing predictions.
• Model Saving: Export trained models (e.g., with pickle, joblib).
• Documentation: Record configurations, results, and findings.

3. Tools Used:
• Languages: Python, R, Java
• ML Libraries (Python): Scikit-learn, TensorFlow, Keras, PyTorch, XGBoost
• ML Libraries (R): caret, randomForest, e1071, mlr
• Tuning Tools: GridSearchCV, Optuna, Hyperopt, R’s tuneGrid
• Notebooks: Jupyter, Google Colab, RStudio
• Visualization: Seaborn, Matplotlib, ggplot2, SHAP, LIME
• Automation: H2O.ai, TPOT, Google AutoML, Azure AutoML

4. Example Models:
• Classification: Logistic Regression, Decision Tree, SVM, Random Forest
• Regression: Linear Regression, Lasso, XGBoost
• Clustering: K-Means, DBSCAN
• Deep Learning: CNNs, RNNs
8. Explain the 'Communicate Results' and 'Operationalize' phases in the Data Science lifecycle.

1. Communicate Results Phase

Definition:
This phase involves presenting analytical findings to stakeholders in a clear and actionable way, bridging technical
insights with business decisions.
Objectives:
• Make insights understandable and actionable
• Guide decision-making
• Summarize findings visually and narratively
Key Activities:
• Data Visualization: Use charts, dashboards to show trends (e.g., Tableau, Power BI).
• Storytelling: Present the data journey clearly and logically.
• Model Reporting: Explain metrics like accuracy, AUC, RMSE in layman’s terms.
• Risk Analysis: Discuss data limitations, biases, and assumptions.
• Recommendations: Suggest data-driven actions.
• Interactive Dashboards: Share insights dynamically (e.g., Dash, Shiny).
Tools:
• Visualization: Tableau, Power BI, Matplotlib, Seaborn
• Docs/Slides: PowerPoint, Google Slides, Notion
• Reporting: Excel, Dash, Shiny
• Collaboration: Jupyter Notebook, Confluence
Outcome:
• Clear stakeholder understanding
• Informed decision-making
• Prepares for model deployment

2. Operationalize Phase
Definition:
This is where models are deployed into production systems to generate ongoing business value.
Objectives:
• Deploy and integrate the model
• Monitor performance
• Ensure security, scalability, and maintainability
Key Activities:
• Model Deployment: Package model as API or service (e.g., Flask, FastAPI).
• Integration: Connect with business tools (e.g., CRM, dashboards).
• Monitoring: Track accuracy, usage, drift (e.g., MLflow, Prometheus).
• Retraining: Update models periodically with new data.
• Security: Apply privacy, compliance, access controls.
• User Training: Educate stakeholders on using the model effectively.
Tools & Platforms:
• Deployment: Docker, Kubernetes, AWS Lambda, Azure ML
• Monitoring: MLflow, Grafana
• CI/CD: Git, Jenkins
• Cloud: AWS SageMaker, Google AI Platform
Outcome:
• Model delivers value in real-world systems
• Supports automation and decision-making
• Ensures ROI from data science solutions
9. What is ETL in Data Science?

1. Definition:
ETL stands for Extract, Transform, Load — a core data integration process used to collect data from multiple
sources, clean and format it, and load it into a target system like a data warehouse or database for analysis.

2. Components of ETL:

Extract: Data is pulled from various sources such as databases, flat files, APIs, web servers, etc.
Transform: Data is cleaned, standardized, aggregated, or enriched (e.g., handling missing values, converting
formats, deduplication).
Load: The final structured data is inserted into a storage system like a Data Warehouse, Data Lake, or Analytical
Sandbox.

3. Purpose of ETL:
Integrate data from disparate sources.
Ensure data quality and consistency.
Prepare data for analysis, visualization, and machine learning.

4. Tools Used in ETL:

Talend, Apache Nifi, Informatica, SSIS, Pentaho
Python (Pandas, Airflow), SQL Scripts
AWS Glue, Azure Data Factory, Google Dataflow

5. Example:
Extract: Sales data from MySQL, customer data from CSV
Transform: Merge datasets, remove duplicates, convert currencies
Load: Store into Snowflake data warehouse for BI reporting
10. Common Tools for Model Building and Model Selection for Data Analytics.

1. Model Building Tools:

Model building involves creating, training, and testing machine learning or statistical models using data. Common
tools include:

Tool/Library Description

Scikit-learn (Python) A powerful library for traditional ML algorithms (e.g., classification, regression).

TensorFlow / Keras Deep learning frameworks for neural networks and AI models, supporting large-scale
(Python) training.

PyTorch (Python) Another deep learning library, widely used for research and production models.

Gradient boosting framework known for high-performance models, especially for

XGBoost
structured data.

A platform offering automated machine learning (AutoML) and scalable model

H2O.ai
building.

R (Caret, randomForest) R’s packages for model building, particularly for statistical models and decision trees.

2. Model Selection Tools:

Model selection focuses on choosing the best-performing model from a set of candidates based on evaluation
metrics.

Tool/Method Description

Grid Search Exhaustive search to find the best hyperparameters for a given model (Scikit-learn, R).

Random Search Randomized search for hyperparameter tuning, faster than grid search.

Cross-Validation Assesses model performance by splitting data into k-folds and validating across them.

AutoML (H2O.ai, Automatically searches for the best model and hyperparameters using algorithms like
TPOT) TPOT, etc.

Compare different models (e.g., Random Forest vs Logistic Regression) using AUC, F1-
Model Comparison
score, RMSE.
11. Key Roles for a Successful Analytics Project

• Data Scientist:
Develops predictive models, analyzes complex datasets, and extracts insights using machine learning and
statistics.

• Data Engineer:
Builds and maintains infrastructure (ETL pipelines, databases) to process and store data.

• Data Analyst:
Analyzes and visualizes data, generates reports, and presents business insights.

• Business Analyst:
Acts as a bridge between stakeholders and the data team, translating business needs into technical
requirements.

• Project Manager:
Manages project timelines, resources, and team coordination.

• Subject Matter Expert (SME):

Provides domain expertise to ensure the analysis aligns with business needs.

• Data Architect:
Designs the data infrastructure and ensures scalability and accessibility.

• IT Support/Operations:
Maintains the technical infrastructure and deploys models into production.

• Data Governance Specialist:

Ensures data privacy, security, and compliance with regulations.
12. Three Characteristics of Big Data

• Volume:
Refers to the large amount of data generated every day. This includes data from social media, sensors,
transactions, and more, often measured in terabytes or petabytes.

• Variety:
Refers to the different types of data generated, such as structured, unstructured, and semi-structured data
(e.g., text, images, videos, sensor data, logs).

• Velocity:
Refers to the speed at which data is generated, processed, and analyzed. This includes real-time or near-
real-time data streaming and requires fast processing.

13. Descriptive, Diagnostic, Predictive Analytics

• Descriptive Analytics

o Purpose: Explains what has happened in the past.

o Methods: Data aggregation, data mining, and reporting.

o Example: Monthly sales reports showing trends and patterns.

• Diagnostic Analytics

o Purpose: Explains why something happened.

o Methods: Root cause analysis, correlation analysis.

o Example: Analyzing why sales dropped in a specific region (e.g., competitor actions, seasonality).

• Predictive Analytics

o Purpose: Forecasts what is likely to happen in the future.

o Methods: Machine learning, regression analysis, time series forecasting.

o Example: Predicting future sales based on historical trends and seasonality.

14. Different Stakeholders of an Analytics Project and Their Expectations

• Business Executives/Managers

o Expectations: Actionable insights for strategic decision-making, improved business performance,

and ROI.

o Output: Clear, visual reports, dashboards, and recommendations that align with business goals.

• Data Scientists/Analysts

o Expectations: High-quality, cleaned, structured data to build accurate models and derive
meaningful insights.

o Output: Validated models, performance metrics, and detailed analytical reports.

• IT/Data Engineers

o Expectations: Scalable, secure data infrastructure and smooth integration of models into
production systems.

o Output: Well-structured data pipelines, operationalized models, and automated workflows.

• Marketing/Sales Teams

o Expectations: Insights into customer behavior, market trends, and predictive analysis for targeted
campaigns.

o Output: Targeted recommendations, segmentation analysis, and performance forecasts.

• Finance Team

o Expectations: Data-driven models to support financial forecasting, budgeting, and risk

management.

o Output: Predictive models, cost-benefit analysis, and financial risk assessments.

• End Users (e.g., Customers, Operators)

o Expectations: User-friendly interfaces, accurate predictions, and solutions that meet practical
needs.

o Output: Dashboards, real-time insights, and product recommendations.

• Regulatory/Compliance Authorities

o Expectations: Data privacy and compliance with industry regulations (e.g., GDPR).

o Output: Secure data practices, audits, and transparent reporting on model use and data handling.
15. Linear Regression and Difference Between Simple and Multiple Linear Regression

1. Linear Regression

o Definition: Linear Regression is a statistical method to model the relationship between a

dependent variable (Y) and one or more independent variables (X) by fitting a linear equation to
observed data.

o Formula:
Y = β₀ + β₁X + ε
Where:

▪ Y = Dependent variable

▪ X = Independent variable

▪ β₀ = Intercept

▪ β₁ = Coefficient of X

▪ ε = Error term

2. Simple Linear Regression

o Definition: A type of linear regression where the model is based on only one independent variable
to predict the dependent variable.

o Formula:
Y = β₀ + β₁X

o Use Case: Predicting outcomes based on a single factor (e.g., predicting house price based on
square footage).

3. Multiple Linear Regression

o Definition: A type of linear regression where the model uses two or more independent variables to
predict the dependent variable.

o Formula:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ

o Use Case: Predicting outcomes based on multiple factors (e.g., predicting house price based on
square footage, location, and number of rooms).

DSBDA Unit 3 Notes
No ratings yet
DSBDA Unit 3 Notes
16 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
Dsbda Solved 3
No ratings yet
Dsbda Solved 3
29 pages
Ds&ba 22
No ratings yet
Ds&ba 22
35 pages
Fda 1
No ratings yet
Fda 1
5 pages
Unit 1
No ratings yet
Unit 1
10 pages
Big Data Analytics Lifecycle Guide
No ratings yet
Big Data Analytics Lifecycle Guide
1 page
Big Data Assignments Answer
No ratings yet
Big Data Assignments Answer
15 pages
Unit 1
No ratings yet
Unit 1
88 pages
Data Science and Big Data Analytics Unit 1 Notes
No ratings yet
Data Science and Big Data Analytics Unit 1 Notes
13 pages
CHAPTER 02: Big Data Analytics
No ratings yet
CHAPTER 02: Big Data Analytics
73 pages
Understanding ETL in Data Science
No ratings yet
Understanding ETL in Data Science
38 pages
Big Data Analytics for Students
No ratings yet
Big Data Analytics for Students
23 pages
21CS71 Imp
No ratings yet
21CS71 Imp
29 pages
Unit 1
No ratings yet
Unit 1
21 pages
Lecture 2
No ratings yet
Lecture 2
50 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
31 pages
Big Data Processing
No ratings yet
Big Data Processing
38 pages
Unit 1 Big Data AKTU
No ratings yet
Unit 1 Big Data AKTU
75 pages
Business Analytics
No ratings yet
Business Analytics
34 pages
U I Q-A
No ratings yet
U I Q-A
7 pages
Very Imp Read Once
No ratings yet
Very Imp Read Once
30 pages
Data Science & Big Data Essentials
No ratings yet
Data Science & Big Data Essentials
46 pages
Dataanalyticsunit 1
No ratings yet
Dataanalyticsunit 1
26 pages
Unit 5 Ds
No ratings yet
Unit 5 Ds
5 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
CHAPTER 02: Big Data Analytics
No ratings yet
CHAPTER 02: Big Data Analytics
62 pages
Big Data: Types, Trends, and Analytics
No ratings yet
Big Data: Types, Trends, and Analytics
74 pages
BDU1
No ratings yet
BDU1
39 pages
Unit 1
No ratings yet
Unit 1
23 pages
Lecture1 Introductiontobigdata 190301171350
No ratings yet
Lecture1 Introductiontobigdata 190301171350
63 pages
UNUT 1 - Introduction and Data Analytics Life Cycle
No ratings yet
UNUT 1 - Introduction and Data Analytics Life Cycle
86 pages
Big Data in Management Unit - I: Session 1-5
No ratings yet
Big Data in Management Unit - I: Session 1-5
25 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
All Answers
No ratings yet
All Answers
55 pages
BDA Module
No ratings yet
BDA Module
6 pages
Fundamentos y Variedad en Big Data
No ratings yet
Fundamentos y Variedad en Big Data
4 pages
Comprehensive Guide To Business Analytics
No ratings yet
Comprehensive Guide To Business Analytics
10 pages
Glossary
No ratings yet
Glossary
50 pages
Big Data Unit-I
No ratings yet
Big Data Unit-I
28 pages
Module 1
No ratings yet
Module 1
21 pages
ET Ext
No ratings yet
ET Ext
217 pages
Big Data Analytics Overview
100% (1)
Big Data Analytics Overview
81 pages
Data Analytics-Unit1 Notes
No ratings yet
Data Analytics-Unit1 Notes
30 pages
Applications
No ratings yet
Applications
3 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Data Analytics Quantum
No ratings yet
Data Analytics Quantum
144 pages
File 1
No ratings yet
File 1
3 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Types of Digital Data & Big Data
No ratings yet
Types of Digital Data & Big Data
136 pages
(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics
No ratings yet
(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics
68 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
52 pages
Basic Concepts in Big Data 1
No ratings yet
Basic Concepts in Big Data 1
43 pages
Unit 1
No ratings yet
Unit 1
36 pages
01 Konsep Big Data
No ratings yet
01 Konsep Big Data
60 pages
2023 Buyers Guide Business Intelligence
No ratings yet
2023 Buyers Guide Business Intelligence
14 pages
UNIT 5 Information Management
No ratings yet
UNIT 5 Information Management
55 pages
Unit 2 DS
No ratings yet
Unit 2 DS
30 pages
IMC Plan for Mulligan's Golf Center
No ratings yet
IMC Plan for Mulligan's Golf Center
32 pages
23MB3326 Digital Marketing Question Bank
No ratings yet
23MB3326 Digital Marketing Question Bank
2 pages
ABM105 - Module6 - S12023-2024EDITED
No ratings yet
ABM105 - Module6 - S12023-2024EDITED
7 pages
Bba Final Year Project Report Summer Training Report On Haldirams
No ratings yet
Bba Final Year Project Report Summer Training Report On Haldirams
87 pages
Businessobjects 6.5: Improved Ease of Use For Ad Hoc Query and Analysis
No ratings yet
Businessobjects 6.5: Improved Ease of Use For Ad Hoc Query and Analysis
6 pages
Oracle BI Application Setup & Configuration V1.4
No ratings yet
Oracle BI Application Setup & Configuration V1.4
5,504 pages
Competitive Intelligence Programs
No ratings yet
Competitive Intelligence Programs
5 pages
Digital Marketing Strategy and Analytics Guide
No ratings yet
Digital Marketing Strategy and Analytics Guide
32 pages
Business Plan For It Services in International Market
No ratings yet
Business Plan For It Services in International Market
26 pages
Business Intelligence & Business Analytics
No ratings yet
Business Intelligence & Business Analytics
8 pages
Importance of Demographic Analysis
No ratings yet
Importance of Demographic Analysis
2 pages
Courses Listed: Training For Applications With Asset Management in SAP ERP
No ratings yet
Courses Listed: Training For Applications With Asset Management in SAP ERP
26 pages
BI Best Practices: Thoroughly Think It Through: Max T. Russell
No ratings yet
BI Best Practices: Thoroughly Think It Through: Max T. Russell
4 pages
Estadística para Ciencia de Datos Con Python
50% (2)
Estadística para Ciencia de Datos Con Python
7 pages
MLM Strategies for Beginners
No ratings yet
MLM Strategies for Beginners
10 pages
ETL Vs ELT White Paper
No ratings yet
ETL Vs ELT White Paper
12 pages
3D0013 Animated 3d Business Model Canvas 16x9
No ratings yet
3D0013 Animated 3d Business Model Canvas 16x9
14 pages
Chapter 2 Data Literacy and DG Concepts
No ratings yet
Chapter 2 Data Literacy and DG Concepts
34 pages
Crafting Effective Value Propositions
No ratings yet
Crafting Effective Value Propositions
26 pages
Business Intelligence
No ratings yet
Business Intelligence
8 pages
Bài 2 - Đo Lư NG Marketing
No ratings yet
Bài 2 - Đo Lư NG Marketing
27 pages
Chapter 2
No ratings yet
Chapter 2
4 pages
Business Intelligence Foundations Exam
No ratings yet
Business Intelligence Foundations Exam
6 pages
Complete Guide To Tableau CRM (CRM Analytics)
No ratings yet
Complete Guide To Tableau CRM (CRM Analytics)
11 pages
BI Testing: Key Aspects and Categories
No ratings yet
BI Testing: Key Aspects and Categories
19 pages
Internship PPT 2025
No ratings yet
Internship PPT 2025
12 pages

Dsbda Ut3

Uploaded by

Dsbda Ut3

Uploaded by

- Rohit

DSBDA Unit 3 Imp Points + PYQ’s

1. Driving Data Deluge

2. Social Media and User-Generated Content:

3. Cloud Computing and Storage Advancements:

4. Business and E-commerce Analytics:

5. Scientific and Medical Research:

Medical Example: Genomic sequencing creates terabytes of data per person.

6. Multimedia Content Explosion:

(Write this answer in tabular format)

1. Data Preparation Phase:

2. Steps in Data Preparation:

1. Model Planning Phase:

1. Model Building Phase:

1. Communicate Results Phase

4. Tools Used in ETL:

1. Model Building Tools:

Gradient boosting framework known for high-performance models, especially for

A platform offering automated machine learning (AutoML) and scalable model

2. Model Selection Tools:

• Subject Matter Expert (SME):

• Data Governance Specialist:

13. Descriptive, Diagnostic, Predictive Analytics

o Purpose: Explains what has happened in the past.

o Methods: Data aggregation, data mining, and reporting.

o Example: Monthly sales reports showing trends and patterns.

o Purpose: Explains why something happened.

o Methods: Root cause analysis, correlation analysis.

o Purpose: Forecasts what is likely to happen in the future.

o Methods: Machine learning, regression analysis, time series forecasting.

o Example: Predicting future sales based on historical trends and seasonality.

o Expectations: Actionable insights for strategic decision-making, improved business performance,

o Output: Validated models, performance metrics, and detailed analytical reports.

o Output: Well-structured data pipelines, operationalized models, and automated workflows.

o Output: Targeted recommendations, segmentation analysis, and performance forecasts.

o Expectations: Data-driven models to support financial forecasting, budgeting, and risk

o Output: Predictive models, cost-benefit analysis, and financial risk assessments.

• End Users (e.g., Customers, Operators)

o Output: Dashboards, real-time insights, and product recommendations.

o Definition: Linear Regression is a statistical method to model the relationship between a

2. Simple Linear Regression

3. Multiple Linear Regression

You might also like