0% found this document useful (0 votes)

26 views11 pages

Unit 2 - Data Science Methodology

The document outlines IBM's 10-stage Data Science Methodology, which provides a structured framework for tackling data science projects, from defining the problem to deploying the model. It emphasizes the iterative nature of the process, the importance of data types and analytics, and the various modeling techniques and evaluation metrics used in data science. Additionally, it illustrates these concepts through a practical example of classifying cuisines based on ingredients.

Uploaded by

chkc4pzjwz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views11 pages

Unit 2 - Data Science Methodology

Uploaded by

chkc4pzjwz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Unit 2: Data Science Methodology:

An Analytic Approach to Capstone Project

In a data science project, we follow a step-by-step methodology (framework) to turn a real-world question
into a working solution. One popular framework is IBM’s 10‑stage Data Science Methodology (by John
Rollins), which guides you through defining the problem, collecting and preparing data, building models, and
deploying results 1 . Each stage answers key questions (for example, “What problem are we solving?” in
Business Understanding or “What data do we need?” in Data Requirements) 2 . The overall process is
iterative – we often loop back to earlier steps to refine our work 3.

Diagram: Example data science process. Raw data is collected and processed into a clean dataset. Exploratory
analysis and modeling produce a data product or visualization, which leads to decisions. Starting with a
business question, data is gathered and processed; we then explore the data, build models, and create
reports or apps. In practice this works in stages such as Business Understanding, Data Collection, Data
Preparation, Modeling, Evaluation, Deployment, and Feedback 2 1 . For example, at each step we
might ask “What are we trying to predict?” (Business Understanding) and “How will we measure
success?” (Evaluation). This keeps the project focused and structured.

• Business Understanding (Stage 1): Define the problem and goals from a real-world perspective.
Involve stakeholders to set objectives and success criteria 4 . For example, a company might ask:
“How can we increase snack sales by 10%?” or, in a student project, “Can we predict the cuisine of a
dish from its ingredients?” 4 .
• Analytic Approach (Stage 2): Choose how to solve the problem with data. Translate the problem
into a data science task (e.g. classification, regression, clustering). For example, “identifying cuisine
from ingredients” is a classification problem (we predict categories like Italian, Chinese, etc.) 5 . If

1
the goal were “predict tomorrow’s sales amount”, that would be a regression problem (predicting a
number) 5 .
• Data Requirements (Stage 3): Decide what data you need. Based on the analytic approach, list the
needed variables (features) and labels. For the cuisine example, you might need a table of recipes
with ingredient lists and cuisine type. Data could be structured (organized tables of numbers and
categories) or unstructured (free-form text, images, etc.) 6 . Structured data fits neatly in rows/
columns (like a spreadsheet of ingredient quantities), whereas unstructured data has no fixed format
(like recipe instructions or photos) 6 .
• Data Collection (Stage 4): Gather the required data. This could mean reading CSV files, querying
databases, scraping web pages, or collecting new measurements. For example, you might download
an online recipe database or survey people for likes/dislikes. Sometimes you can collect more data if
needed – modern tools often allow using very large datasets 7 .
• Data Understanding (Stage 5): Explore and summarize the data you collected. Use simple
descriptive statistics (counts, averages) and charts to see what the data looks like. Check for missing
or strange values, and make sure the data matches the problem. For example, plot how many
recipes belong to each cuisine to see if some cuisines are very rare. This stage helps spot errors or
new needs (e.g. you might realize you need more data for under-represented cuisines) 8 .

• Data Preparation (Stage 6): Clean and transform the data into a form suitable for modeling 9 .
Common tasks include handling missing values, removing duplicates, converting categories to
numbers, and engineering new features. For instance, you might turn each list of ingredients into
binary features (e.g. “tomato_present = yes/no”) or count the number of spicy ingredients. This often
takes the most time in a project 10 . Feature engineering (creating new explanatory variables) and
text processing (like extracting keywords from recipe names) happen here to improve model
performance 11 .

• Modeling (Stage 7): Build data models that address the analytic approach 12 . If doing classification
(like cuisine prediction), you might train a decision tree or k-nearest neighbors model using the
training set of your data. For regression (e.g. predicting price), you might use linear regression.
(Classification vs. regression is a key distinction: classification predicts categories, while regression
predicts numeric values 13 .) In unsupervised tasks (no clear label), you might do clustering
(grouping similar items) or dimensionality reduction. During modeling, you usually split your data
into a training set and a test set

2
– the training set teaches the model, and the test set (kept separate) lets you evaluate how well the
model works on new data. You may also try multiple algorithms or parameters to find the best
performing model 14 .

• Evaluation (Stage 8): Assess how well the model solves the original problem 15 . For classification
models, common metrics include accuracy (percent correctly predicted), precision (how many
predicted positives are true positives), recall (how many actual positives were found), and the
F1‑score (harmonic mean of precision and recall) 16 17 18 . For regression models, common
metrics are Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared
Error (RMSE) (all measuring prediction error) 19 . For example, if our cuisine classifier is correct 80%
of the time, its accuracy is 0.80 16 . Precision and recall give insight in imbalanced cases (e.g. if some
cuisines are rare) 20 17 . We often use Train/Test Split and Cross-Validation to reliably estimate
these metrics (see below). If the model doesn’t meet our goals, we may return to earlier stages to
tweak the data or model and try again 3 .

3
Diagram: Splitting data into training and test sets. To evaluate a model, we split labeled data into a Training
set (used to build/train the model) and a Test set (used to evaluate performance on new data). Usually the
training set is larger. After training the model on the training set, we check its accuracy and other metrics on
the test set to see how well it generalizes.

Diagram: K-Fold Cross-Validation. To get a more reliable evaluation, we can use K‑Fold cross-validation 21 .
Here, the data is divided into K equal “folds”. We train the model K times: each time using K – 1 folds for
training and the remaining fold for testing, rotating which fold is held out. The final score is averaged over
all folds. This prevents “lucky” splits and uses all data for testing at some point 21 .

• Deployment (Stage 9): Put the model into production or use it to make decisions 22 . This could be
as simple as generating a report, or as involved as embedding the model in a website or app. For
example, you could integrate the cuisine classifier into a mobile app: users enter ingredients and the
app shows the predicted cuisine. Deployment often involves engineering work and collaboration

4
with others (e.g. software developers) 23 . We usually start small (pilot tests or limited release) to
ensure everything works.
• Feedback (Stage 10): Gather results from the deployed model and use them to improve it 24 . For
instance, track how often the cuisine predictions were correct in real use. This feedback (user data,
new observations, or stakeholder input) helps refine the model – perhaps by adding more data,
tweaking features, or retraining. The process is cyclical: over time you learn from how the model
performs and go back to earlier steps (like data prep or modeling) to make it better 3 .

As a capstone project progresses, teams should collaborate and discuss each of these stages. Using real
datasets and case studies (like predicting cuisine or sales) helps make the steps concrete. Always keep the
original question in mind, involve domain experts if possible, and document each step. By applying this
methodology with hands-on practice, students can systematically tackle their projects and learn the full
cycle of data science.

Data and Analytics Types

Data comes in different forms and analytics can be different types, each with examples:

• Data Types:
• Structured data has a fixed format (like spreadsheets or databases). For example, a table of customer
ages and incomes or a CSV of recipe ingredients is structured.
• Unstructured data has no fixed schema. Examples include free text (recipe instructions, social media
comments), images (food photos), audio, or video. These need special processing (e.g. text parsing
or image analysis) 6 .

• (There is also semi-structured data, like JSON or XML files, which have some structure but not rigid
tables.)

• Analytics Types:

• Descriptive analytics: What happened? Summarizes past data (e.g. “There were 200 Italian recipes
this month”). It uses charts and tables to describe data.
• Diagnostic analytics: Why did it happen? Looks for causes (e.g. “We see more Italian recipes because
last month’s cooking contest was Italian-themed”). It may involve deeper analysis or data slicing.
• Predictive analytics: What might happen? Uses historical data to forecast future or unknown
outcomes (e.g. “Given the ingredients, predict the cuisine type” or “predict next month’s sales”). This
is where most modeling (classification/regression) comes in 25 .
• Prescriptive analytics: What should we do? Recommends actions to optimize results (e.g. “Based on
the predicted cuisine interest, we should feature more Italian recipes next month”). This often
follows predictive models and may involve optimization or business rules 26 .

Understanding these types helps clarify project goals. For instance, if you only summarize data, that’s
descriptive; if you build a model to predict cuisine from ingredients, that’s predictive (classification) 5 .

5
Modeling Techniques
Modeling turns data into predictions or insights. Some key ideas:

• Supervised vs Unsupervised:
• Supervised learning uses labeled data. If we have recipes labeled by cuisine, we can train a model to
predict the cuisine. This includes classification (predict categories) and regression (predict
numbers). For example, predicting cuisine is classification, while predicting calorie count from
ingredients would be regression 5 13 .

• Unsupervised learning finds patterns without labels. For example, clustering ingredients into groups
of similar flavor.

• Classification vs Regression: In classification tasks (categories), we measure accuracy, precision,

recall, F1, etc. In regression tasks (numbers), we use MAE, MSE, RMSE to measure error. As a rule of
thumb: classification deals with discrete outcomes (like “Italian” or “Mexican”), while regression deals with
continuous values (like “price = $12.50”) 13 .

• Algorithms: Common classification algorithms include decision trees, random forests, logistic
regression, k-nearest neighbors, and neural networks. For regression, common models are linear
regression, random forests, or neural nets. Choice of algorithm depends on data size, type, and
project needs. In a student project, it’s fine to start with a simple model (like decision tree) and then
try more as needed.

• Feature Engineering: Often the most creative part is converting raw data into model inputs. For text
(e.g. recipe names) you might extract keywords. For categorical data (like cuisine names), you use
label encoding. Combining or scaling features can also help model performance 11 .

Train-Test Split and Cross-Validation

When building a model, we must evaluate it fairly. The train-test split is a basic method: we divide our
dataset into a training set and a test set. The model learns only from the training set; then we see how well it
performs on the separate test set

6
. This prevents “cheating” by testing on the same data we trained on.

To reduce randomness in a single split, we often use K-Fold cross-validation 21 . For example, in 5-fold CV
we split data into 5 parts. We train on 4 parts and test on the 1 remaining part, repeating 5 times so each
part is tested once. We then average the results. This method (illustrated below) provides a more stable
estimate of model performance, especially on small datasets 21 .

Diagram: K-Fold Cross-Validation. The data is split into K folds. Each time, one fold (shown in a light box) is
used as the test set (holdout) and the model is trained on the other K–1 folds (colored balls). By rotating
which fold is held out, we test on all data. We then average the model’s accuracy (or other metric) over all K
trials 21 .

In summary, train-test split (shown above) and K-fold CV help ensure our model truly learns patterns, not
just memorizing the data.

7
Model Evaluation Metrics
Depending on the task, we use different metrics to judge models:

• Classification Metrics:
• Accuracy: Fraction of total predictions that are correct 16 . Simple but can be misleading if classes
are imbalanced.
• Precision: Of all cases predicted as “positive” (e.g. predicted cuisine = Italian), what fraction were
truly positive (truly Italian)? 17
• Recall (Sensitivity): Of all actual positive cases, what fraction did we correctly predict? 20

• F1-Score: The harmonic mean of precision and recall 18 . It balances the two (gives a low score if
either precision or recall is low).

• (Other useful tools include a confusion matrix and ROC curves, but the main metrics above often
suffice for a simple project.)

• Regression Metrics:

• Mean Absolute Error (MAE): The average absolute difference between predicted and actual values.
• Mean Squared Error (MSE): The average of squared differences between predictions and actuals
19 .

• Root Mean Squared Error (RMSE): The square root of MSE (gives error in the same units as the
target). Lower values of these errors mean a better model.

Below is a quick summary:

Metric Formula (concept) Use

Accuracy (TP+TN)/(Total predictions) Classification; fraction of correct labels 16

Classification; how many predicted positives are

Precision TP/(TP+FP)
correct 17

Classification; how many actual positives were found

Recall TP/(TP+FN)
20

2·(Precision·Recall)/ Classification; harmonic mean of precision & recall

F1 Score
(Precision+Recall) 18

MAE avg( predicted – actual

MSE avg((predicted – actual)²) 19 Regression; average squared prediction error 19

Regression; square root of MSE (error in original

RMSE sqrt(MSE)
units)

(TP = true positives, FP = false positives, FN = false negatives, TN = true negatives.)

8
These metrics guide us in deciding if a model is good enough. For instance, if our cuisine classifier has 80%
accuracy, precision 0.75, recall 0.85, and F1-score 0.80, we know it’s doing reasonably well, but maybe
missing some cases (precision<recall). If not satisfied, we might try a different model or collect more data.

Example: Cuisine Classification

As a practical example, imagine a capstone project to identify a dish’s cuisine from its ingredients.
Applying the methodology:

• Business Understanding: We want to help a recipe website tag each dish with the correct cuisine
(Italian, Mexican, etc.). The business goal is easier browsing and better recommendations.
• Analytic Approach: This is a classification problem (predict one of several cuisine labels) 5 .
• Data Requirements: We need a dataset of recipes, each with ingredient lists and known cuisine labels.
• Data Collection: We gather recipes from a public API or dataset.
• Data Understanding: We explore the ingredients (e.g. some cuisines share ingredients). We might find
that some ingredients (like basil) are more common in Italian dishes, etc.
• Data Preparation: We convert each recipe’s ingredients into features. For example, create a binary
feature for each common ingredient (“tomato” yes/no). We clean any misspellings and handle
recipes that have missing ingredients.
• Modeling: We split into train/test

. We train a classification model (e.g. decision tree) on the training set to predict cuisine. We might
also try logistic regression or k-NN and compare.
• Evaluation: We test on the held-out data. Suppose the model’s accuracy is 78%, precision (for each
cuisine) is around 0.75–0.80, and recall similar. We compute these metrics to judge quality. If
accuracy is too low, we iterate: perhaps engineer more features (like ingredient combinations) or try
a different algorithm.
• Deployment: Once happy with the model, we could deploy it as a web tool: users input ingredients, it
outputs the predicted cuisine.
• Feedback: We collect user feedback or new recipe data to see how the model performs in the real
world and update the model over time.

9
This case shows how each stage of the methodology is used in a real scenario, with hands-on steps (like
coding a train-test split or plotting ingredient frequencies) and discussion points (like which features seem
most important).

Deployment and Feedback

After evaluation, a model is deployed if it meets the business needs 22 . Deployment could be presenting
results in a report, building a web app, or integrating the model into an existing system. For example, the
cuisine classifier might run on a website’s backend to auto-tag new recipes. Deployment often involves
additional engineering, and it’s tested in a live environment.

Finally, collect feedback and monitor performance 24 . Did the deployed model actually improve outcomes
(e.g., click-through on suggested recipes)? If not, revisit earlier steps: maybe gather more data, retrain the
model, or adjust thresholds. This feedback loop ensures the data science work provides ongoing value and
adapts to new information 3 .

Key Takeaways: A successful data science project follows a clear methodology: define the problem, choose
an analysis approach, gather and prepare data, build and evaluate models, and deploy solutions. Use visual
aids (like flowcharts and tables), simple examples, and collaborative discussion to solidify these concepts.
Remember to split your data for fair testing

, and use metrics (accuracy, precision/recall, MAE/MSE, etc.) to measure success 16 19 . By working
through each stage carefully and iterating based on feedback 3 , students can apply data science methods
effectively in their capstone projects.

10
1 3 4 Foundational Methodology for Data Science
5 7 8 https://tdwi.org/~/media/64511A895D86457E964174EDC5C4C7B1.PDF
9 10 11

12 14 15

22 23 24

2 Master Data Science Methodology: Foundations & Stages - CliffsNotes

https://www.cliffsnotes.com/study-notes/16130953

6 Structured vs. Unstructured Data: What’s the Difference? | IBM

https://www.ibm.com/think/topics/structured-vs-unstructured-data

13 Classification vs Regression in Machine Learning | GeeksforGeeks

https://www.geeksforgeeks.org/ml-classification-vs-regression/

16 17 Classification: Accuracy, recall, precision, and related metrics | Machine Learning | Google
20 for Developers
https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall

18 12 Important Model Evaluation Metrics for Machine Learning (2025)

https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/

19 Mean Squared Error | GeeksforGeeks

https://www.geeksforgeeks.org/mean-squared-error/

21 Cross Validation in Machine Learning | GeeksforGeeks

https://www.geeksforgeeks.org/cross-validation-machine-learning/

25 Descriptive, predictive, diagnostic, and prescriptive analytics explained — a complete

26 marketer’s guide
https://business.adobe.com/blog/basics/descriptive-predictive-prescriptive-analytics-explained

Data Science Process
No ratings yet
Data Science Process
13 pages
Introduction To Data Science Methodology
No ratings yet
Introduction To Data Science Methodology
45 pages
Data Science Methodology
No ratings yet
Data Science Methodology
21 pages
Dsur Ea2352001010391 W3
No ratings yet
Dsur Ea2352001010391 W3
3 pages
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
No ratings yet
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
6 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Daftar Isi Modul Data Science
100% (1)
Daftar Isi Modul Data Science
56 pages
Life Cycle of Data Science - Complete Step-By-step Guide
No ratings yet
Life Cycle of Data Science - Complete Step-By-step Guide
3 pages
Data Science Process Overview
No ratings yet
Data Science Process Overview
4 pages
Data Science
No ratings yet
Data Science
25 pages
Unit 2 - DS - 1st Year
No ratings yet
Unit 2 - DS - 1st Year
7 pages
Architecture of Data Science Projects: Components
No ratings yet
Architecture of Data Science Projects: Components
4 pages
Big Data
No ratings yet
Big Data
4 pages
5 Data Science Project Lifecycle
No ratings yet
5 Data Science Project Lifecycle
33 pages
Team1 - Data Science Methodology
No ratings yet
Team1 - Data Science Methodology
39 pages
Data Science Lifecycle
No ratings yet
Data Science Lifecycle
3 pages
EBook - Data Science 4
No ratings yet
EBook - Data Science 4
14 pages
Unit2data Science Methodology
No ratings yet
Unit2data Science Methodology
6 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
Unit 3 (DS)
No ratings yet
Unit 3 (DS)
32 pages
Data Science Process
No ratings yet
Data Science Process
7 pages
Notes Unit 1
No ratings yet
Notes Unit 1
8 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Data Science Guide for Professionals
No ratings yet
Data Science Guide for Professionals
10 pages
Xii Analytical Approach
No ratings yet
Xii Analytical Approach
3 pages
Week 3
No ratings yet
Week 3
3 pages
IDS - UNIT-2 - Notes Part1 - Introduction To Data Science and Prob Concept
No ratings yet
IDS - UNIT-2 - Notes Part1 - Introduction To Data Science and Prob Concept
66 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
Capstone Project
No ratings yet
Capstone Project
28 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Data Science
No ratings yet
Data Science
5 pages
Liceria Tech
No ratings yet
Liceria Tech
12 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
Data Science Project - An Inductive Learning Approach, Verri
No ratings yet
Data Science Project - An Inductive Learning Approach, Verri
238 pages
Data Science Methodology
No ratings yet
Data Science Methodology
3 pages
Unit-2 - DS Notes
No ratings yet
Unit-2 - DS Notes
22 pages
Unit 2
No ratings yet
Unit 2
21 pages
Data Science Process Overview
No ratings yet
Data Science Process Overview
11 pages
Data Science Project Guide
No ratings yet
Data Science Project Guide
19 pages
FDS Introduction
No ratings yet
FDS Introduction
41 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Lesson2 Notes
No ratings yet
Lesson2 Notes
13 pages
Data Science Methodology
No ratings yet
Data Science Methodology
26 pages
Crisp-Dm
No ratings yet
Crisp-Dm
4 pages
AI Student HandbookXII
No ratings yet
AI Student HandbookXII
48 pages
CH 2
No ratings yet
CH 2
26 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
11 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
W3 - DA Life Cycle
No ratings yet
W3 - DA Life Cycle
49 pages
Data Science Methodology
No ratings yet
Data Science Methodology
4 pages
Importance of Statistics in Data Science
No ratings yet
Importance of Statistics in Data Science
3 pages
Data Wrangling: A Step-by-Step Guide
No ratings yet
Data Wrangling: A Step-by-Step Guide
4 pages
Ds 3
No ratings yet
Ds 3
9 pages
Diabetes Prediction Case Study
No ratings yet
Diabetes Prediction Case Study
7 pages
DS Unit 2
No ratings yet
DS Unit 2
7 pages
Image Denoising & Classification Assignment
No ratings yet
Image Denoising & Classification Assignment
3 pages
Comparing Machine Learning Models For Graduate Admission Predictions 1 PDF
No ratings yet
Comparing Machine Learning Models For Graduate Admission Predictions 1 PDF
10 pages
PPT: Pre-Trained Prompt Tuning For Few-Shot Learning: Yuxian Gu, Xu Han, Zhiyuan Liu, Minlie Huang
No ratings yet
PPT: Pre-Trained Prompt Tuning For Few-Shot Learning: Yuxian Gu, Xu Han, Zhiyuan Liu, Minlie Huang
14 pages
AI Transformations in Supply Chains
No ratings yet
AI Transformations in Supply Chains
19 pages
05 Grid Search - en
No ratings yet
05 Grid Search - en
1 page
DSF Unit 4
No ratings yet
DSF Unit 4
12 pages
Module 5 Aws
No ratings yet
Module 5 Aws
55 pages
Cardiovascular Diseases Prediction Article
No ratings yet
Cardiovascular Diseases Prediction Article
28 pages
Bangla Hand Written Digit Recognition
No ratings yet
Bangla Hand Written Digit Recognition
19 pages
Waveform
No ratings yet
Waveform
54 pages
Development of An Efficient Network Intrusion Detection Model Using Extreme Gradient Boosting XGBoost On The UNSW-NB15 Dataset
No ratings yet
Development of An Efficient Network Intrusion Detection Model Using Extreme Gradient Boosting XGBoost On The UNSW-NB15 Dataset
7 pages
Amar Paper
No ratings yet
Amar Paper
6 pages
DeepSeekMath_V2
No ratings yet
DeepSeekMath_V2
19 pages
Lab 10 - Random Forest Classifier
100% (1)
Lab 10 - Random Forest Classifier
3 pages
Development of Harumanis Mango Insidious Fruit Rot (IFR) Detection by Utilising Vibration-Based Sensors and PCA With Random Forest
No ratings yet
Development of Harumanis Mango Insidious Fruit Rot (IFR) Detection by Utilising Vibration-Based Sensors and PCA With Random Forest
10 pages
Unit I
No ratings yet
Unit I
10 pages
Ling 2024
No ratings yet
Ling 2024
18 pages
6-CSC 405 Sem1 2020-2021 - Intro To Machine Learning
No ratings yet
6-CSC 405 Sem1 2020-2021 - Intro To Machine Learning
39 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
Creating an AI Project Logbook
No ratings yet
Creating an AI Project Logbook
28 pages
Machine Learning Pros and Cons
No ratings yet
Machine Learning Pros and Cons
3 pages
GenAI Data Risks & Management Guide
No ratings yet
GenAI Data Risks & Management Guide
9 pages
Arduino Earthquake Detector Alert System
No ratings yet
Arduino Earthquake Detector Alert System
64 pages
Customer Churn Prediction Using Machine Learning Algorithms
No ratings yet
Customer Churn Prediction Using Machine Learning Algorithms
6 pages
Cricket Players Performance Prediction and Evaluation Using Machine Learning Algorithms
No ratings yet
Cricket Players Performance Prediction and Evaluation Using Machine Learning Algorithms
7 pages
Machine Learning For Real Estate Investment Risk Assessment: Developing Predictive Models For The US Market
No ratings yet
Machine Learning For Real Estate Investment Risk Assessment: Developing Predictive Models For The US Market
19 pages
Deep Learning for Lumpy Skin Disease Detection
No ratings yet
Deep Learning for Lumpy Skin Disease Detection
7 pages
EDA Unit1
No ratings yet
EDA Unit1
53 pages
Asset-V1 MKAU+SEng9032+DEV 01+type@asset+block@ChapOne
No ratings yet
Asset-V1 MKAU+SEng9032+DEV 01+type@asset+block@ChapOne
29 pages
Chatgpt and Other Large Language Models For Cybersecurity of Smart Grid Applications
No ratings yet
Chatgpt and Other Large Language Models For Cybersecurity of Smart Grid Applications
5 pages

Unit 2 - Data Science Methodology

Uploaded by

Unit 2 - Data Science Methodology

Uploaded by

Unit 2: Data Science Methodology:

An Analytic Approach to Capstone Project

Data and Analytics Types

• Classification vs Regression: In classification tasks (categories), we measure accuracy, precision,

Train-Test Split and Cross-Validation

Below is a quick summary:

Metric Formula (concept) Use

Accuracy (TP+TN)/(Total predictions) Classification; fraction of correct labels 16

Classification; how many predicted positives are

Classification; how many actual positives were found

2·(Precision·Recall)/ Classification; harmonic mean of precision & recall

MAE avg( predicted – actual

MSE avg((predicted – actual)²) 19 Regression; average squared prediction error 19

Regression; square root of MSE (error in original

(TP = true positives, FP = false positives, FN = false negatives, TN = true negatives.)

Example: Cuisine Classification

Deployment and Feedback

2 Master Data Science Methodology: Foundations & Stages - CliffsNotes

6 Structured vs. Unstructured Data: What’s the Difference? | IBM

13 Classification vs Regression in Machine Learning | GeeksforGeeks

18 12 Important Model Evaluation Metrics for Machine Learning (2025)

19 Mean Squared Error | GeeksforGeeks

21 Cross Validation in Machine Learning | GeeksforGeeks

25 Descriptive, predictive, diagnostic, and prescriptive analytics explained — a complete

You might also like