0% found this document useful (0 votes)
26 views11 pages

Unit 2 - Data Science Methodology

The document outlines IBM's 10-stage Data Science Methodology, which provides a structured framework for tackling data science projects, from defining the problem to deploying the model. It emphasizes the iterative nature of the process, the importance of data types and analytics, and the various modeling techniques and evaluation metrics used in data science. Additionally, it illustrates these concepts through a practical example of classifying cuisines based on ingredients.

Uploaded by

chkc4pzjwz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

Unit 2 - Data Science Methodology

The document outlines IBM's 10-stage Data Science Methodology, which provides a structured framework for tackling data science projects, from defining the problem to deploying the model. It emphasizes the iterative nature of the process, the importance of data types and analytics, and the various modeling techniques and evaluation metrics used in data science. Additionally, it illustrates these concepts through a practical example of classifying cuisines based on ingredients.

Uploaded by

chkc4pzjwz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Unit 2: Data Science Methodology:

An Analytic Approach to Capstone Project


In a data science project, we follow a step-by-step methodology (framework) to turn a real-world question
into a working solution. One popular framework is IBM’s 10‑stage Data Science Methodology (by John
Rollins), which guides you through defining the problem, collecting and preparing data, building models, and
deploying results 1 . Each stage answers key questions (for example, “What problem are we solving?” in
Business Understanding or “What data do we need?” in Data Requirements) 2 . The overall process is
iterative – we often loop back to earlier steps to refine our work 3.

Diagram: Example data science process. Raw data is collected and processed into a clean dataset. Exploratory
analysis and modeling produce a data product or visualization, which leads to decisions. Starting with a
business question, data is gathered and processed; we then explore the data, build models, and create
reports or apps. In practice this works in stages such as Business Understanding, Data Collection, Data
Preparation, Modeling, Evaluation, Deployment, and Feedback 2 1 . For example, at each step we
might ask “What are we trying to predict?” (Business Understanding) and “How will we measure
success?” (Evaluation). This keeps the project focused and structured.

• Business Understanding (Stage 1): Define the problem and goals from a real-world perspective.
Involve stakeholders to set objectives and success criteria 4 . For example, a company might ask:
“How can we increase snack sales by 10%?” or, in a student project, “Can we predict the cuisine of a
dish from its ingredients?” 4 .
• Analytic Approach (Stage 2): Choose how to solve the problem with data. Translate the problem
into a data science task (e.g. classification, regression, clustering). For example, “identifying cuisine
from ingredients” is a classification problem (we predict categories like Italian, Chinese, etc.) 5 . If

1
the goal were “predict tomorrow’s sales amount”, that would be a regression problem (predicting a
number) 5 .
• Data Requirements (Stage 3): Decide what data you need. Based on the analytic approach, list the
needed variables (features) and labels. For the cuisine example, you might need a table of recipes
with ingredient lists and cuisine type. Data could be structured (organized tables of numbers and
categories) or unstructured (free-form text, images, etc.) 6 . Structured data fits neatly in rows/
columns (like a spreadsheet of ingredient quantities), whereas unstructured data has no fixed format
(like recipe instructions or photos) 6 .
• Data Collection (Stage 4): Gather the required data. This could mean reading CSV files, querying
databases, scraping web pages, or collecting new measurements. For example, you might download
an online recipe database or survey people for likes/dislikes. Sometimes you can collect more data if
needed – modern tools often allow using very large datasets 7 .
• Data Understanding (Stage 5): Explore and summarize the data you collected. Use simple
descriptive statistics (counts, averages) and charts to see what the data looks like. Check for missing
or strange values, and make sure the data matches the problem. For example, plot how many
recipes belong to each cuisine to see if some cuisines are very rare. This stage helps spot errors or
new needs (e.g. you might realize you need more data for under-represented cuisines) 8 .

• Data Preparation (Stage 6): Clean and transform the data into a form suitable for modeling 9 .
Common tasks include handling missing values, removing duplicates, converting categories to
numbers, and engineering new features. For instance, you might turn each list of ingredients into
binary features (e.g. “tomato_present = yes/no”) or count the number of spicy ingredients. This often
takes the most time in a project 10 . Feature engineering (creating new explanatory variables) and
text processing (like extracting keywords from recipe names) happen here to improve model
performance 11 .

• Modeling (Stage 7): Build data models that address the analytic approach 12 . If doing classification
(like cuisine prediction), you might train a decision tree or k-nearest neighbors model using the
training set of your data. For regression (e.g. predicting price), you might use linear regression.
(Classification vs. regression is a key distinction: classification predicts categories, while regression
predicts numeric values 13 .) In unsupervised tasks (no clear label), you might do clustering
(grouping similar items) or dimensionality reduction. During modeling, you usually split your data
into a training set and a test set

2
– the training set teaches the model, and the test set (kept separate) lets you evaluate how well the
model works on new data. You may also try multiple algorithms or parameters to find the best
performing model 14 .

• Evaluation (Stage 8): Assess how well the model solves the original problem 15 . For classification
models, common metrics include accuracy (percent correctly predicted), precision (how many
predicted positives are true positives), recall (how many actual positives were found), and the
F1‑score (harmonic mean of precision and recall) 16 17 18 . For regression models, common
metrics are Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared
Error (RMSE) (all measuring prediction error) 19 . For example, if our cuisine classifier is correct 80%
of the time, its accuracy is 0.80 16 . Precision and recall give insight in imbalanced cases (e.g. if some
cuisines are rare) 20 17 . We often use Train/Test Split and Cross-Validation to reliably estimate
these metrics (see below). If the model doesn’t meet our goals, we may return to earlier stages to
tweak the data or model and try again 3 .

3
Diagram: Splitting data into training and test sets. To evaluate a model, we split labeled data into a Training
set (used to build/train the model) and a Test set (used to evaluate performance on new data). Usually the
training set is larger. After training the model on the training set, we check its accuracy and other metrics on
the test set to see how well it generalizes.

Diagram: K-Fold Cross-Validation. To get a more reliable evaluation, we can use K‑Fold cross-validation 21 .
Here, the data is divided into K equal “folds”. We train the model K times: each time using K – 1 folds for
training and the remaining fold for testing, rotating which fold is held out. The final score is averaged over
all folds. This prevents “lucky” splits and uses all data for testing at some point 21 .

• Deployment (Stage 9): Put the model into production or use it to make decisions 22 . This could be
as simple as generating a report, or as involved as embedding the model in a website or app. For
example, you could integrate the cuisine classifier into a mobile app: users enter ingredients and the
app shows the predicted cuisine. Deployment often involves engineering work and collaboration

4
with others (e.g. software developers) 23 . We usually start small (pilot tests or limited release) to
ensure everything works.
• Feedback (Stage 10): Gather results from the deployed model and use them to improve it 24 . For
instance, track how often the cuisine predictions were correct in real use. This feedback (user data,
new observations, or stakeholder input) helps refine the model – perhaps by adding more data,
tweaking features, or retraining. The process is cyclical: over time you learn from how the model
performs and go back to earlier steps (like data prep or modeling) to make it better 3 .

As a capstone project progresses, teams should collaborate and discuss each of these stages. Using real
datasets and case studies (like predicting cuisine or sales) helps make the steps concrete. Always keep the
original question in mind, involve domain experts if possible, and document each step. By applying this
methodology with hands-on practice, students can systematically tackle their projects and learn the full
cycle of data science.

Data and Analytics Types


Data comes in different forms and analytics can be different types, each with examples:

• Data Types:
• Structured data has a fixed format (like spreadsheets or databases). For example, a table of customer
ages and incomes or a CSV of recipe ingredients is structured.
• Unstructured data has no fixed schema. Examples include free text (recipe instructions, social media
comments), images (food photos), audio, or video. These need special processing (e.g. text parsing
or image analysis) 6 .

• (There is also semi-structured data, like JSON or XML files, which have some structure but not rigid
tables.)

• Analytics Types:

• Descriptive analytics: What happened? Summarizes past data (e.g. “There were 200 Italian recipes
this month”). It uses charts and tables to describe data.
• Diagnostic analytics: Why did it happen? Looks for causes (e.g. “We see more Italian recipes because
last month’s cooking contest was Italian-themed”). It may involve deeper analysis or data slicing.
• Predictive analytics: What might happen? Uses historical data to forecast future or unknown
outcomes (e.g. “Given the ingredients, predict the cuisine type” or “predict next month’s sales”). This
is where most modeling (classification/regression) comes in 25 .
• Prescriptive analytics: What should we do? Recommends actions to optimize results (e.g. “Based on
the predicted cuisine interest, we should feature more Italian recipes next month”). This often
follows predictive models and may involve optimization or business rules 26 .

Understanding these types helps clarify project goals. For instance, if you only summarize data, that’s
descriptive; if you build a model to predict cuisine from ingredients, that’s predictive (classification) 5 .

5
Modeling Techniques
Modeling turns data into predictions or insights. Some key ideas:

• Supervised vs Unsupervised:
• Supervised learning uses labeled data. If we have recipes labeled by cuisine, we can train a model to
predict the cuisine. This includes classification (predict categories) and regression (predict
numbers). For example, predicting cuisine is classification, while predicting calorie count from
ingredients would be regression 5 13 .

• Unsupervised learning finds patterns without labels. For example, clustering ingredients into groups
of similar flavor.

• Classification vs Regression: In classification tasks (categories), we measure accuracy, precision,


recall, F1, etc. In regression tasks (numbers), we use MAE, MSE, RMSE to measure error. As a rule of
thumb: classification deals with discrete outcomes (like “Italian” or “Mexican”), while regression deals with
continuous values (like “price = $12.50”) 13 .

• Algorithms: Common classification algorithms include decision trees, random forests, logistic
regression, k-nearest neighbors, and neural networks. For regression, common models are linear
regression, random forests, or neural nets. Choice of algorithm depends on data size, type, and
project needs. In a student project, it’s fine to start with a simple model (like decision tree) and then
try more as needed.

• Feature Engineering: Often the most creative part is converting raw data into model inputs. For text
(e.g. recipe names) you might extract keywords. For categorical data (like cuisine names), you use
label encoding. Combining or scaling features can also help model performance 11 .

Train-Test Split and Cross-Validation


When building a model, we must evaluate it fairly. The train-test split is a basic method: we divide our
dataset into a training set and a test set. The model learns only from the training set; then we see how well it
performs on the separate test set

6
. This prevents “cheating” by testing on the same data we trained on.

To reduce randomness in a single split, we often use K-Fold cross-validation 21 . For example, in 5-fold CV
we split data into 5 parts. We train on 4 parts and test on the 1 remaining part, repeating 5 times so each
part is tested once. We then average the results. This method (illustrated below) provides a more stable
estimate of model performance, especially on small datasets 21 .

Diagram: K-Fold Cross-Validation. The data is split into K folds. Each time, one fold (shown in a light box) is
used as the test set (holdout) and the model is trained on the other K–1 folds (colored balls). By rotating
which fold is held out, we test on all data. We then average the model’s accuracy (or other metric) over all K
trials 21 .

In summary, train-test split (shown above) and K-fold CV help ensure our model truly learns patterns, not
just memorizing the data.

7
Model Evaluation Metrics
Depending on the task, we use different metrics to judge models:

• Classification Metrics:
• Accuracy: Fraction of total predictions that are correct 16 . Simple but can be misleading if classes
are imbalanced.
• Precision: Of all cases predicted as “positive” (e.g. predicted cuisine = Italian), what fraction were
truly positive (truly Italian)? 17
• Recall (Sensitivity): Of all actual positive cases, what fraction did we correctly predict? 20

• F1-Score: The harmonic mean of precision and recall 18 . It balances the two (gives a low score if
either precision or recall is low).

• (Other useful tools include a confusion matrix and ROC curves, but the main metrics above often
suffice for a simple project.)

• Regression Metrics:

• Mean Absolute Error (MAE): The average absolute difference between predicted and actual values.
• Mean Squared Error (MSE): The average of squared differences between predictions and actuals
19 .

• Root Mean Squared Error (RMSE): The square root of MSE (gives error in the same units as the
target). Lower values of these errors mean a better model.

Below is a quick summary:

Metric Formula (concept) Use

Accuracy (TP+TN)/(Total predictions) Classification; fraction of correct labels 16

Classification; how many predicted positives are


Precision TP/(TP+FP)
correct 17

Classification; how many actual positives were found


Recall TP/(TP+FN)
20

2·(Precision·Recall)/ Classification; harmonic mean of precision & recall


F1 Score
(Precision+Recall) 18

MAE avg( predicted – actual

MSE avg((predicted – actual)²) 19 Regression; average squared prediction error 19

Regression; square root of MSE (error in original


RMSE sqrt(MSE)
units)

(TP = true positives, FP = false positives, FN = false negatives, TN = true negatives.)

8
These metrics guide us in deciding if a model is good enough. For instance, if our cuisine classifier has 80%
accuracy, precision 0.75, recall 0.85, and F1-score 0.80, we know it’s doing reasonably well, but maybe
missing some cases (precision<recall). If not satisfied, we might try a different model or collect more data.

Example: Cuisine Classification


As a practical example, imagine a capstone project to identify a dish’s cuisine from its ingredients.
Applying the methodology:

• Business Understanding: We want to help a recipe website tag each dish with the correct cuisine
(Italian, Mexican, etc.). The business goal is easier browsing and better recommendations.
• Analytic Approach: This is a classification problem (predict one of several cuisine labels) 5 .
• Data Requirements: We need a dataset of recipes, each with ingredient lists and known cuisine labels.
• Data Collection: We gather recipes from a public API or dataset.
• Data Understanding: We explore the ingredients (e.g. some cuisines share ingredients). We might find
that some ingredients (like basil) are more common in Italian dishes, etc.
• Data Preparation: We convert each recipe’s ingredients into features. For example, create a binary
feature for each common ingredient (“tomato” yes/no). We clean any misspellings and handle
recipes that have missing ingredients.
• Modeling: We split into train/test

. We train a classification model (e.g. decision tree) on the training set to predict cuisine. We might
also try logistic regression or k-NN and compare.
• Evaluation: We test on the held-out data. Suppose the model’s accuracy is 78%, precision (for each
cuisine) is around 0.75–0.80, and recall similar. We compute these metrics to judge quality. If
accuracy is too low, we iterate: perhaps engineer more features (like ingredient combinations) or try
a different algorithm.
• Deployment: Once happy with the model, we could deploy it as a web tool: users input ingredients, it
outputs the predicted cuisine.
• Feedback: We collect user feedback or new recipe data to see how the model performs in the real
world and update the model over time.

9
This case shows how each stage of the methodology is used in a real scenario, with hands-on steps (like
coding a train-test split or plotting ingredient frequencies) and discussion points (like which features seem
most important).

Deployment and Feedback


After evaluation, a model is deployed if it meets the business needs 22 . Deployment could be presenting
results in a report, building a web app, or integrating the model into an existing system. For example, the
cuisine classifier might run on a website’s backend to auto-tag new recipes. Deployment often involves
additional engineering, and it’s tested in a live environment.

Finally, collect feedback and monitor performance 24 . Did the deployed model actually improve outcomes
(e.g., click-through on suggested recipes)? If not, revisit earlier steps: maybe gather more data, retrain the
model, or adjust thresholds. This feedback loop ensures the data science work provides ongoing value and
adapts to new information 3 .

Key Takeaways: A successful data science project follows a clear methodology: define the problem, choose
an analysis approach, gather and prepare data, build and evaluate models, and deploy solutions. Use visual
aids (like flowcharts and tables), simple examples, and collaborative discussion to solidify these concepts.
Remember to split your data for fair testing

, and use metrics (accuracy, precision/recall, MAE/MSE, etc.) to measure success 16 19 . By working
through each stage carefully and iterating based on feedback 3 , students can apply data science methods
effectively in their capstone projects.

10
1 3 4 Foundational Methodology for Data Science
5 7 8 https://tdwi.org/~/media/64511A895D86457E964174EDC5C4C7B1.PDF
9 10 11

12 14 15

22 23 24

2 Master Data Science Methodology: Foundations & Stages - CliffsNotes


https://www.cliffsnotes.com/study-notes/16130953

6 Structured vs. Unstructured Data: What’s the Difference? | IBM


https://www.ibm.com/think/topics/structured-vs-unstructured-data

13 Classification vs Regression in Machine Learning | GeeksforGeeks


https://www.geeksforgeeks.org/ml-classification-vs-regression/

16 17 Classification: Accuracy, recall, precision, and related metrics | Machine Learning | Google
20 for Developers
https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall

18 12 Important Model Evaluation Metrics for Machine Learning (2025)


https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/

19 Mean Squared Error | GeeksforGeeks


https://www.geeksforgeeks.org/mean-squared-error/

21 Cross Validation in Machine Learning | GeeksforGeeks


https://www.geeksforgeeks.org/cross-validation-machine-learning/

25 Descriptive, predictive, diagnostic, and prescriptive analytics explained — a complete


26 marketer’s guide
https://business.adobe.com/blog/basics/descriptive-predictive-prescriptive-analytics-explained

11

You might also like