0% found this document useful (0 votes)
17 views24 pages

Data Science and Analytics

Data science is a discipline focused on deriving value from data through theory, processes, and tools, with data analytics being a core practical process. It is built on three pillars: mathematics and statistics, computer science and programming, and domain expertise, and follows a systematic approach similar to the scientific method. Ethical considerations in data science include addressing bias, ensuring privacy, and maintaining transparency, while data analytics encompasses four types: descriptive, diagnostic, predictive, and prescriptive analytics.

Uploaded by

alaanaustinn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views24 pages

Data Science and Analytics

Data science is a discipline focused on deriving value from data through theory, processes, and tools, with data analytics being a core practical process. It is built on three pillars: mathematics and statistics, computer science and programming, and domain expertise, and follows a systematic approach similar to the scientific method. Ethical considerations in data science include addressing bias, ensuring privacy, and maintaining transparency, while data analytics encompasses four types: descriptive, diagnostic, predictive, and prescriptive analytics.

Uploaded by

alaanaustinn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Science and Analytics

Data science is the entire discipline of deriving value from data, encompassing theory,
processes, and tools. Data analytics is the core practical process within data science focused
on examining datasets to draw conclusions.
A Deeper Look at Data Science ᾝ‍ὒData science is more than just analyzing data; it's a field
built on three pillars:
Mathematics & Statistics: This is the theoretical foundation. It provides the methods for
quantifying uncertainty, designing experiments, and building sound models. It's the
knowledge that ensures the conclusions drawn from data are valid and not just random
chance.
Computer Science & Programming: This provides the practical tools to handle data at scale.
This includes skills in programming (like Python or R), database management (SQL), and
understanding algorithms and data structures.
Domain Expertise: This is the crucial context. It's the understanding of the specific field—be it
finance, healthcare, marketing, or manufacturing—that allows a data scientist to ask the right
questions, interpret the results correctly, and understand the real-world implications. An
insight is useless without the context to make it actionable.
The "Science" in Data ScienceThe field is called data science because it follows a systematic,
evidence-based approach similar to the scientific method:

Formulate a Hypothesis: Start with a question or assumption (e.g., "We believe customers
who use our mobile app spend more than those who use the website.").
Design an Experiment: Plan how to test the hypothesis. This could involve an A/B test, where
you show one group of users one version of a product and a second group another version.
Collect & Process Data: Gather the data from the experiment or from existing sources.
Analyze the Results: Use statistical tests to determine if the results are significant.
Draw a Conclusion: Based on the evidence, either accept or reject the hypothesis.
Communicate these findings.
Ethics in Data ScienceA critical aspect of data science is ethics. Data gives power, and with
that comes responsibility. Key ethical considerations include:

Bias: Data can reflect historical biases. If a hiring model is trained on past data where mostly
men were hired, it might learn to discriminate against women. A good data scientist must
identify and mitigate such biases.
Privacy: Handling sensitive personal data is a major responsibility. Data scientists must comply
with regulations like GDPR (in Europe) and ensure data is anonymized and secure.
Transparency: Models, especially those used for important decisions (like loan applications),
should be explainable. People have a right to know how a decision affecting them was
made.
The Four Types of Data Analytics in DetailData analytics is the engine of insight generation.
Let's explore the four types with more depth.
Data Science and Analytics Structure

Data Science
The discipline of deriving value from data

Data Analytics
Core process of examining data for insights

Mathematics & Statistics


Foundation for valid conclusions

Computer Science &


Programming
Tools for handling data at scale

Domain Expertise
Context for actionable insights

1. Descriptive Analytics: "What happened?" (The Foundation)This is the most common


type of analytics, providing a clear summary of historical data. It's the foundation upon
which all other analytics are built.
Goal: To create a clear, concise picture of the past.
Techniques:
Aggregation: Calculating sums, counts, and averages (e.g., total monthly revenue).
Measures of Central Tendency: Finding the mean (average), median (middle value), and
mode (most frequent value) to understand the "typical" data point.
Measures of Dispersion: Calculating the range, standard deviation, and variance to
understand how spread out the data is.
Tools: SQL for querying databases, spreadsheets (Excel, Google Sheets), and Business
Intelligence (BI) platforms like Tableau or Power BI for creating interactive dashboards.
Example: A hospital administrator uses a dashboard to see the average patient wait time for
each day of the last month. The dashboard shows that wait times peak on Mondays. This is a
descriptive insight.
Transforming Data into Descriptive Insights

Raw,
Unprocessed Dispersion Clear
Data Data Central Analysis Descriptive
Disorganized and
Aggregation Tendency Calculate range,
Insights
meaningless Calculate sums, Find mean, standard Concise picture of the
information counts, averages median, and mode deviation, variance past

2. Diagnostic Analytics: "Why did it happen?" (The Investigation)This is the process of


forensic data analysis. It takes descriptive data and digs deeper to find the causes.

Goal: To diagnose the root cause of an observed trend or event.


Techniques:
Drill-Down: The ability to click into a data point on a dashboard to see the underlying, more
detailed data.
Correlation Analysis: Identifying statistical relationships between variables. Crucial point: A
data analyst knows that correlation does not imply causation. Just because two things
happen at the same time doesn't mean one caused the other.
Root Cause Analysis: Methodologies like the "5 Whys" to trace a problem back to its origin.
Example: Seeing that wait times peak on Mondays (the "what"), the hospital analyst drills
down. They find the long waits are in the cardiology department. They run a correlation
analysis and find that the number of on-call cardiologists is lowest on Mondays. They now
have a strong hypothesis for the "why".
How to diagnose the root cause of an observed trend?

Root Cause
Analysis
Correlation
Traces a problem back to Analysis
its origin using
methodologies like the "5 Identifies statistical
Whys". relationships between
variables, cautioning
against assuming
causation.

Drill-Down
Allows detailed
exploration of data
points to uncover
underlying information.

3. Predictive Analytics: "What is likely to happen?" (The Forecast)This stage moves from
looking at the past to forecasting the future. While complex machine learning is a
powerful tool here, many predictive analytics rely on established statistical models.

Goal: To make educated guesses about future outcomes based on historical patterns.
Techniques (including non-ML):
Time-Series Analysis: Models like ARIMA or Moving Averages that use the temporal nature of
data to forecast future points (e.g., stock prices, sales).
Statistical Regression: Using statistical models to predict a value based on the relationship
between variables.
Example: A retail chain uses its past 3 years of sales data in a time-series model to forecast
the demand for winter coats for the upcoming season, helping them manage inventory.
Which technique should
be used for predictive
analytics?

Time-Series Statistical
Analysis Regression
Uses temporal Predicts values
data to forecast based on variable
future points, relationships,
suitable for stock useful for
prices or sales. understanding
dependencies.

4. Prescriptive Analytics: "What should we do?" (The Recommendation)This is the most


advanced and valuable stage. It uses the predictions from the previous stage to
recommend specific actions to achieve a business goal.
Goal: To provide clear, data-driven recommendations on the best course of action.
Techniques:
Optimization: Using algorithms (e.g., linear programming) to find the best possible outcome
given a set of constraints (e.g., finding the most profitable product mix given production
limits).
Simulation: Creating models (e.g., a Monte Carlo simulation) to test the likely outcomes of
different decisions under various conditions.
Example: A shipping company uses a prescriptive model that takes a predictive traffic
forecast as an input. The model then runs an optimization algorithm to recommend the exact
delivery routes for its entire fleet that will minimize total fuel consumption and delivery time.

Optimization
Best for finding the most efficient
solution within constraints.
Which
prescriptive
Simulation
analytics
Ideal for testing outcomes under
technique various conditions.
should be
used?
a. Data Collection

Once you know your objective, you need the raw materials: data. This step involves
gathering all relevant information from various sources.

Data collection methods are the specific techniques used to gather information in a
systematic way to answer a question or test a hypothesis. The choice of method depends on
the research question, required data type, budget, and timeline.

Which data collection method should be used?

Data Type
Timeline
Research The type of data
needed influences Time availability
Question the method. affects method
Budget feasibility.
The nature of the
question guides Financial constraints
the method choice. impact method
selection.

Here are the most common data collection methods in detail.

1. Surveys and Questionnaires Ὅ

This method involves asking a structured set of questions to a group of individuals. Surveys
are extremely versatile and can be deployed through various channels like email, web forms,
or in person.

• Data Type: Primarily quantitative (e.g., ratings on a scale of 1-5, multiple-choice


answers) but can also collect qualitative data through open-ended questions.
• Use Case: A company sending a customer satisfaction survey to gather feedback on a
recent purchase.
• Advantages: Highly scalable, cost-effective, and can reach a large and geographically
diverse audience quickly.
• Disadvantages: Can suffer from low response rates and response bias (where people
don't answer truthfully).

Customer satisfaction survey

Pros Cons

Scalable Low response rates

Cost-effective Response bias

Diverse audience

2. Interviews ὞️

Interviews are a qualitative method involving direct, one-on-one or small-group


conversations with participants. They can be structured (with a fixed set of questions),
semi-structured (with guiding questions but allowing for follow-ups), or unstructured (a
free-flowing conversation).

Structured Interviews
Ensures consistency and comparability
across interviews but may limit
flexibility.
Semi-Structured
Which Interviews
Balances consistency with flexibility,
interview type allowing for follow-up questions.

should be used
for data Unstructured Interviews
collection? Offers maximum flexibility and
spontaneity but may lack consistency.
• Data Type: Almost exclusively qualitative. They provide rich, in-depth insights into
participants' thoughts, feelings, and experiences.

• Use Case: A UX researcher conducting interviews with users to understand their


frustrations with a new software feature.
• Advantages: Provides deep, nuanced understanding and allows for probing follow-up
questions.

• Disadvantages: Time-consuming, not scalable, and the interviewer's presence can


influence participant responses.

3. Observation ὄ️

This method involves systematically watching and recording behaviors or events as they
occur in their natural setting. The researcher can be a participant (joining the group) or a
non-participant (watching from a distance).

• Data Type: Can be qualitative (e.g., detailed notes on interactions) or quantitative


(e.g., counting the number of times a specific action occurs).

• Use Case: A retail consultant observing how shoppers navigate a store to identify
bottlenecks and optimize the layout.
• Advantages: Provides direct insight into actual behavior rather than self-reported
behavior, which can be more accurate.
• Disadvantages: Can be subject to observer bias, is time-intensive, and doesn't explain
the "why" behind behaviors.

Observational data collection

Pros Cons

Direct insight Observer bias

Accurate behavior Time-intensive

Lacks explanation
4. Web Scraping ὗ️

This is an automated technique for extracting large amounts of data from websites. A script
or "bot" is programmed to visit web pages and pull specific information, which is then saved
into a structured format like a spreadsheet.
• Data Type: Secondary data, which can be quantitative or qualitative.
• Use Case: An e-commerce analyst scraping competitor websites daily to track product
prices and stock levels.
• Advantages: Highly efficient for gathering massive public datasets.
• Disadvantages: Can be legally and ethically complex, and scripts can break if the
source website's layout changes.

Web Scraping

Pros Cons

Efficient data Legal and


collection
1 1 ethical issues
Quickly gathers large Raises concerns about
amounts of public compliance with data
data from competitor privacy laws.
websites.

Cost-effective Script fragility


2 2
Reduces expenses by Scripts may fail if
automating data website layouts
collection processes. change frequently.

Data quality
3 concerns
Risk of inaccurate or
outdated data
affecting analysis.

b. Data preprocessing

Data cleaning, also known as data cleansing or data scrubbing, is the process of detecting
and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. It's a
foundational step in data preparation that transforms messy, raw data into a clean, reliable,
and usable format for analysis or machine learning.
Think of it like preparing ingredients before cooking; you must wash vegetables, remove bad
spots, and measure correctly before you can create a good meal. The same principle, often
summarized as "garbage in, garbage out," applies to data—low-quality data will always
produce low-quality results.

Key Tasks in Data Cleaning

Data cleaning isn't a single action but a collection of systematic tasks.

1. Handling Missing Values ᾧ

This is the most common data cleaning task. It addresses records where data is missing (e.g.,
a blank cell in a spreadsheet).

• Why it's a problem: Many algorithms can't handle missing values, and they can skew
statistical calculations like means or sums.
• Common Techniques:
• Deletion: If a row has too many missing values, it might be best to remove the
entire row. Similarly, if a column is mostly empty, it might be dropped. This is a
simple but potentially costly approach as it leads to information loss.
• Imputation: This involves filling in the missing values. The method depends on
the data type:

• For numerical data, you can impute with the column's mean or median.
The median is often preferred as it's less sensitive to outliers.
• For categorical data, you can impute with the column's mode (the most
frequent value).
Addressing Missing Values in Data Analysis

Imputation
Algorithm Limitations
Techniques

Inability to Process Mean/Median


Missing Data Imputation

Skewed Statistical
Mode Imputation
Calculations
Missing Values in
Data Analysis

Column Deletion Categorical Data

Row Deletion Numerical Data

Data Type
Deletion Techniques
Considerations

2. Correcting Inconsistent Data and Typos

This task focuses on standardizing data to ensure consistency across all records.

• Why it's a problem: A single entity can be represented in multiple ways, leading a
computer to treat them as different things.
• Examples:
• Categorical Data: A "Country" column might contain "USA," "U.S.," and "United
States." These should all be standardized to a single format, like "USA."
• Case Sensitivity: "New Delhi" and "new delhi" would be treated as different
categories unless standardized to a consistent case (e.g., title case).
• Formatting: Standardizing date formats (e.g., from "DD-MM-YYYY" to
"YYYY-MM-DD") or phone numbers.
Data Inconsistency:
The Hidden Depths of
Data Errors

Data Representation
Problem

Categorical Data
Variation

Case Sensitivity Issues

Formatting Discrepancies

3. Handling Outliers and Structural Errors

This involves identifying and dealing with data points that are improbable or incorrectly
formatted.

• Outliers: These are data points that fall far outside the normal range of the data (e.g., a
person's age listed as 200). They can heavily distort statistical models. Outliers can be
removed, corrected if they are clear data entry errors, or capped (setting a maximum
value).

• Structural Errors: These are issues with the data's format, like a number column that
contains text (e.g., "N/A" instead of a blank). These must be corrected for the data type
to be consistent.
How to handle outliers and structural errors in data?

Remove Outliers

Eliminates data points that distort


statistical models.

Correct Outliers

Fixes data entry errors to maintain data


integrity.

Cap Outliers

Sets a maximum value to limit their impact.

Correct Structural Errors

Ensures data consistency by fixing format


issues.

4. Removing Duplicates Ὕ️

This task involves identifying and removing identical records from the dataset.

• Why it's a problem: Duplicate entries can lead to inflated counts and give certain data
points more weight than they should have during analysis, biasing the results.

• Example: A customer might appear twice in a sales database due to a system glitch.
Finding and removing one of these entries is essential for accurate customer analysis.
Duplicate data biases
analytical results.

Inflated counts

System glitch

Biased results

c. Data transformation

Data transformation is the process of converting data from one format or structure to
another. In the context of data science and analytics, it involves applying a set of functions or
rules to raw data to make it suitable for analysis and, most importantly, to improve the
performance of machine learning models.
Think of it as preparing ingredients for a specific recipe. You might need to chop vegetables,
grind spices, or convert measurements—all transformations to make the ingredients work for
the final dish. Similarly, data must be transformed to work effectively with a given algorithm.
Data Transformation Process

Raw Data

Format Conversion

Structure Adjustment

Rule Application

Suitability for Analysis

Improved Model Performance

Key Data Transformation Techniques

Here are the most common and important data transformation techniques in detail.

1. Feature Scaling ὒ

Many machine learning algorithms are sensitive to the scale of the data. If one feature has a
very large range (e.g., salary from ₹5,00,000 to ₹50,00,000) and another has a small range
(e.g., years of experience from 1 to 20), the algorithm might incorrectly assume that
salary is more important. Feature scaling puts all features on a similar scale.
• Normalization (Min-Max Scaling): This technique rescales the data to a fixed range,
typically 0 to 1. It's calculated as:
• Xnorm=Xmax−XminX−Xmin
• It's useful but can be sensitive to outliers.
• Standardization (Z-score Normalization): This technique rescales data to have a mean
of 0 and a standard deviation of 1. It's calculated as:
• Xstd=σX−μ
• where μ is the mean and σ is the standard deviation. This is often the preferred
method as it's less affected by outliers.
Normalization
Useful for scaling data to a
Which feature fixed range but sensitive to
outliers.
scaling technique Standardization
should be used?
Preferred method as it's less
affected by outliers and scales
data to a mean of 0 and
standard deviation of 1.

2. Encoding Categorical Variables ὒ

Most machine learning models are based on mathematical equations and can only
understand numbers, not text. Encoding is the process of converting categorical text data
into a numerical format.
• Label Encoding: This technique assigns a unique integer to each category. For
example, in a "City" column, New Delhi might become 0, Mumbai might become 1,
and Bengaluru might become 2.
• Limitation: This can create a false sense of order (e.g., implying that Bengaluru >
Mumbai), which can mislead some models.
• One-Hot Encoding: This technique creates a new binary (0 or 1) column for each
unique category. If a row's city is "New Delhi," the "New Delhi" column will be 1 while
the "Mumbai" and "Bengaluru" columns will be 0.
• Benefit: It avoids the problem of creating a false order and is generally the
preferred method for nominal (unordered) categories.

Choose encoding wisely for model accuracy.

Avoids False
Creates False Order
Order
Multiple
Single Column Columns
Needed Needed

Label Encoding One-Hot Encoding


3. Handling Skewed Data Ὄ

Some algorithms, especially linear models, assume that the numerical features follow a
normal (bell-shaped) distribution. If a feature's distribution is highly skewed (lopsided),
transforming it can improve model performance.
• Log Transformation: This involves taking the logarithm (log(x)) of each value in a
feature. It is very effective for reducing right-skewness (when the tail of the
distribution is on the right).
• Square Root Transformation: Taking the square root (sqrt(x)) is another, milder way to
reduce right-skewness.

Which transformation method is best for reducing


right-skewness in data?

Square Root
Log Transformation
Transformation
Effective for significant Milder skew reduction
skew reduction

4. Binning

Binning (or bucketing) is the process of converting a continuous numerical feature into a
categorical one by grouping values into "bins."
• Use Case: Converting a precise "Age" feature into categorical "Age Groups" like '0-18',
'19-35', '36-60', and '60+'.
• Benefit: It can help reduce the effect of small errors in the data and capture non-linear
patterns. For example, a person's spending habits might be similar within an age group
but change significantly between groups.
Should the "Age" feature be binned
into categorical groups?

No Binning
Maintains precision
Binning but may miss non-
linear patterns.
Reduces errors and
captures non-linear
patterns in data.

d. Data Reduction

Data reduction is the process of reducing the volume of a dataset while trying to preserve as
much of its important information as possible. The goal is to obtain a smaller, more
manageable representation of the original data that can be analyzed more efficiently without
sacrificing significant analytical quality.
Think of it like creating a high-quality summary of a long book. You remove the less critical
details but keep the main plot, characters, and themes, making it much faster to read and
understand.
Data reduction is essential for dealing with "Big Data" because massive datasets are
expensive to store and computationally intensive to analyze.
Data Reduction Process

Identify Key Information

Remove Redundancy

Summarize Data

There are two main strategies for data reduction:

1. Dimensionality Reduction (Reducing Columns) Ὄ

This strategy focuses on reducing the number of random variables or features (columns)
under consideration. Many datasets have redundant or irrelevant features that add
complexity without adding much information.

• Why it's done: To combat the "Curse of Dimensionality," where too many features can
actually make machine learning models perform worse. It also simplifies models and
speeds up training time.

• Key Techniques:
• Principal Component Analysis (PCA): This is the most popular dimensionality
reduction technique. PCA transforms the original set of features into a new,
smaller set of uncorrelated features called principal components. It works by
identifying the directions of maximum variance in the data and projecting the
data onto them. The first principal component captures the most information
(variance), the second captures the next most, and so on. You can then keep
just the first few components that represent the majority of the original data's
variance.

• Feature Selection: Instead of creating new features like PCA, this technique
selects a subset of the most relevant original features. It aims to find the best
combination of features that produce the best model performance.
Strategies for Dimensionality Reduction

Principal
Component
Analysis Improved Model
Performance
Feature
Selection

2. Numerosity Reduction (Reducing Rows) Ὄ

This strategy focuses on reducing the number of data entries or records (rows) in a dataset.
• Why it's done: To create a smaller, representative subset of the data that is faster to
process.

• Key Techniques:
• Sampling: This involves selecting a subset of the data that is representative of
the entire dataset.

• Simple Random Sampling: Every data point has an equal chance of being
selected.
• Stratified Sampling: The population is divided into subgroups (strata)
based on a certain characteristic (e.g., age groups, geographic regions).
A random sample is then taken from each subgroup. This ensures that
minority groups are properly represented in the sample.
• Data Aggregation: This involves combining multiple data points into a single
summary. For example, daily sales data can be aggregated into weekly,
monthly, or quarterly sales figures. This reduces the number of rows
significantly. For instance, 365 daily sales records can be reduced to just 12
monthly records.
Data Reduction Process

Data
Sampling Aggregation
Selecting a Combining data
representative points into
subset of data summaries

e. Feature Engineering

Feature engineering is the creative and often iterative process of using domain knowledge to
select, transform, and create new input variables (features) from raw data to improve the
performance of a machine learning model. It is widely considered one ofthe most impactful
activities in the machine learning workflow.
Think of it like this: A machine learning model is like a student. You can give the student raw,
unorganized textbooks (raw data) and hope they figure everything out, or you can give them
well-structured summary notes with key concepts highlighted (engineered features). The
student with the better notes will almost always learn faster and perform better.
Feature Engineering Cycle

Iterate Process Select Features


Refine features based on Choose relevant variables
evaluation from raw data

Evaluate Model
Transform Features
Performance
Modify selected variables
Assess how well the model for better use
performs

Create New Features


Generate new variables
from existing ones

Why is Feature Engineering So Important? ✨

The features you provide to your model have a huge influence on its performance. Better
features can:

• Improve Model Accuracy: Well-engineered features can expose the underlying


patterns in the data more clearly, making it easier for the model to learn.

• Simplify Models: Good features can capture complex relationships in a simpler form,
allowing you to use less complex and faster models.

• Incorporate Domain Knowledge: It's the primary way to inject human expertise about
the problem domain into the model.
The Power of Feature Engineering

Domain
Knowledge
Domain knowledge
enriches models with
human expertise.

Model Simplicity
Model Accuracy
Good features simplify
Features enhance models, making them
pattern recognition for faster.
better model accuracy.

Key Feature Engineering Techniques

Feature engineering is a broad field, but most techniques fall into a few main categories.

1. Feature Creation

This involves creating entirely new features from one or more existing ones.

• From Dates and Times: A single timestamp column is rarely useful on its own. You can
create many valuable features from it:
• day_of_week, month, year, hour_of_day, week_of_year.

• A binary feature like is_weekend or is_holiday.


• From Combining Features: You can create interaction features that capture
relationships between variables.

• From height and weight, create a Body Mass Index (BMI) feature.
• In real estate, from price and square_feet, create a price_per_square_foot
feature.
• In e-commerce, create a cart_value feature by multiplying item_price by
quantity.
• From Text Data:
• word_count, character_count, or average_word_length in a text field.

2. Feature Transformation ὒ

This involves modifying existing features to make them more suitable for a model.

• Binning (or Discretization): This technique converts a continuous numerical feature


into a categorical one. For example, a precise age feature can be grouped into
age_group categories like 'Child' (0-12), 'Teen' (13-19), 'Adult' (20-59), and 'Senior' (60+).
This can help the model capture non-linear patterns.
• Log Transformation: Applying a logarithmic function to a feature can help handle
highly skewed data, making its distribution more normal, which benefits many linear
models.

3. Advanced Encoding Methods ᾞ

While simple encoding is part of preprocessing, more advanced techniques are a form of
feature engineering.
• Frequency Encoding: Replacing a category with its frequency (the number of times it
appears in the dataset). This can capture the importance of a category.

• Target Encoding (Mean Encoding): This is a powerful technique where you replace a
category with the average value of the target variable for that category. For example,
if you're predicting customer churn, you could replace the category "New Delhi" with
the average churn rate of all customers from New Delhi. This directly encodes
information about the target variable but must be used carefully to avoid data leakage.

4. Handling Missing Data as a Feature

Sometimes, the fact that data is missing is itself a valuable piece of information. Instead of
just imputing a missing value, you can create a new binary feature that indicates its absence.
• Example: In a dataset of loan applicants, some may not provide their annual_income.
You could create a feature called income_is_missing (1 if missing, 0 if present). This
absence might correlate with a higher risk of default.

You might also like