Data Science and Analytics
Data Science and Analytics
Data science is the entire discipline of deriving value from data, encompassing theory,
processes, and tools. Data analytics is the core practical process within data science focused
on examining datasets to draw conclusions.
A Deeper Look at Data Science ᾝὒData science is more than just analyzing data; it's a field
built on three pillars:
Mathematics & Statistics: This is the theoretical foundation. It provides the methods for
quantifying uncertainty, designing experiments, and building sound models. It's the
knowledge that ensures the conclusions drawn from data are valid and not just random
chance.
Computer Science & Programming: This provides the practical tools to handle data at scale.
This includes skills in programming (like Python or R), database management (SQL), and
understanding algorithms and data structures.
Domain Expertise: This is the crucial context. It's the understanding of the specific field—be it
finance, healthcare, marketing, or manufacturing—that allows a data scientist to ask the right
questions, interpret the results correctly, and understand the real-world implications. An
insight is useless without the context to make it actionable.
The "Science" in Data ScienceThe field is called data science because it follows a systematic,
evidence-based approach similar to the scientific method:
Formulate a Hypothesis: Start with a question or assumption (e.g., "We believe customers
who use our mobile app spend more than those who use the website.").
Design an Experiment: Plan how to test the hypothesis. This could involve an A/B test, where
you show one group of users one version of a product and a second group another version.
Collect & Process Data: Gather the data from the experiment or from existing sources.
Analyze the Results: Use statistical tests to determine if the results are significant.
Draw a Conclusion: Based on the evidence, either accept or reject the hypothesis.
Communicate these findings.
Ethics in Data ScienceA critical aspect of data science is ethics. Data gives power, and with
that comes responsibility. Key ethical considerations include:
Bias: Data can reflect historical biases. If a hiring model is trained on past data where mostly
men were hired, it might learn to discriminate against women. A good data scientist must
identify and mitigate such biases.
Privacy: Handling sensitive personal data is a major responsibility. Data scientists must comply
with regulations like GDPR (in Europe) and ensure data is anonymized and secure.
Transparency: Models, especially those used for important decisions (like loan applications),
should be explainable. People have a right to know how a decision affecting them was
made.
The Four Types of Data Analytics in DetailData analytics is the engine of insight generation.
Let's explore the four types with more depth.
Data Science and Analytics Structure
Data Science
The discipline of deriving value from data
Data Analytics
Core process of examining data for insights
Domain Expertise
Context for actionable insights
Raw,
Unprocessed Dispersion Clear
Data Data Central Analysis Descriptive
Disorganized and
Aggregation Tendency Calculate range,
Insights
meaningless Calculate sums, Find mean, standard Concise picture of the
information counts, averages median, and mode deviation, variance past
Root Cause
Analysis
Correlation
Traces a problem back to Analysis
its origin using
methodologies like the "5 Identifies statistical
Whys". relationships between
variables, cautioning
against assuming
causation.
Drill-Down
Allows detailed
exploration of data
points to uncover
underlying information.
3. Predictive Analytics: "What is likely to happen?" (The Forecast)This stage moves from
looking at the past to forecasting the future. While complex machine learning is a
powerful tool here, many predictive analytics rely on established statistical models.
Goal: To make educated guesses about future outcomes based on historical patterns.
Techniques (including non-ML):
Time-Series Analysis: Models like ARIMA or Moving Averages that use the temporal nature of
data to forecast future points (e.g., stock prices, sales).
Statistical Regression: Using statistical models to predict a value based on the relationship
between variables.
Example: A retail chain uses its past 3 years of sales data in a time-series model to forecast
the demand for winter coats for the upcoming season, helping them manage inventory.
Which technique should
be used for predictive
analytics?
Time-Series Statistical
Analysis Regression
Uses temporal Predicts values
data to forecast based on variable
future points, relationships,
suitable for stock useful for
prices or sales. understanding
dependencies.
Optimization
Best for finding the most efficient
solution within constraints.
Which
prescriptive
Simulation
analytics
Ideal for testing outcomes under
technique various conditions.
should be
used?
a. Data Collection
Once you know your objective, you need the raw materials: data. This step involves
gathering all relevant information from various sources.
Data collection methods are the specific techniques used to gather information in a
systematic way to answer a question or test a hypothesis. The choice of method depends on
the research question, required data type, budget, and timeline.
Data Type
Timeline
Research The type of data
needed influences Time availability
Question the method. affects method
Budget feasibility.
The nature of the
question guides Financial constraints
the method choice. impact method
selection.
This method involves asking a structured set of questions to a group of individuals. Surveys
are extremely versatile and can be deployed through various channels like email, web forms,
or in person.
Pros Cons
Diverse audience
2. Interviews ️
Structured Interviews
Ensures consistency and comparability
across interviews but may limit
flexibility.
Semi-Structured
Which Interviews
Balances consistency with flexibility,
interview type allowing for follow-up questions.
should be used
for data Unstructured Interviews
collection? Offers maximum flexibility and
spontaneity but may lack consistency.
• Data Type: Almost exclusively qualitative. They provide rich, in-depth insights into
participants' thoughts, feelings, and experiences.
3. Observation ὄ️
This method involves systematically watching and recording behaviors or events as they
occur in their natural setting. The researcher can be a participant (joining the group) or a
non-participant (watching from a distance).
• Use Case: A retail consultant observing how shoppers navigate a store to identify
bottlenecks and optimize the layout.
• Advantages: Provides direct insight into actual behavior rather than self-reported
behavior, which can be more accurate.
• Disadvantages: Can be subject to observer bias, is time-intensive, and doesn't explain
the "why" behind behaviors.
Pros Cons
Lacks explanation
4. Web Scraping ὗ️
This is an automated technique for extracting large amounts of data from websites. A script
or "bot" is programmed to visit web pages and pull specific information, which is then saved
into a structured format like a spreadsheet.
• Data Type: Secondary data, which can be quantitative or qualitative.
• Use Case: An e-commerce analyst scraping competitor websites daily to track product
prices and stock levels.
• Advantages: Highly efficient for gathering massive public datasets.
• Disadvantages: Can be legally and ethically complex, and scripts can break if the
source website's layout changes.
Web Scraping
Pros Cons
Data quality
3 concerns
Risk of inaccurate or
outdated data
affecting analysis.
b. Data preprocessing
Data cleaning, also known as data cleansing or data scrubbing, is the process of detecting
and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. It's a
foundational step in data preparation that transforms messy, raw data into a clean, reliable,
and usable format for analysis or machine learning.
Think of it like preparing ingredients before cooking; you must wash vegetables, remove bad
spots, and measure correctly before you can create a good meal. The same principle, often
summarized as "garbage in, garbage out," applies to data—low-quality data will always
produce low-quality results.
This is the most common data cleaning task. It addresses records where data is missing (e.g.,
a blank cell in a spreadsheet).
• Why it's a problem: Many algorithms can't handle missing values, and they can skew
statistical calculations like means or sums.
• Common Techniques:
• Deletion: If a row has too many missing values, it might be best to remove the
entire row. Similarly, if a column is mostly empty, it might be dropped. This is a
simple but potentially costly approach as it leads to information loss.
• Imputation: This involves filling in the missing values. The method depends on
the data type:
• For numerical data, you can impute with the column's mean or median.
The median is often preferred as it's less sensitive to outliers.
• For categorical data, you can impute with the column's mode (the most
frequent value).
Addressing Missing Values in Data Analysis
Imputation
Algorithm Limitations
Techniques
Skewed Statistical
Mode Imputation
Calculations
Missing Values in
Data Analysis
Data Type
Deletion Techniques
Considerations
This task focuses on standardizing data to ensure consistency across all records.
• Why it's a problem: A single entity can be represented in multiple ways, leading a
computer to treat them as different things.
• Examples:
• Categorical Data: A "Country" column might contain "USA," "U.S.," and "United
States." These should all be standardized to a single format, like "USA."
• Case Sensitivity: "New Delhi" and "new delhi" would be treated as different
categories unless standardized to a consistent case (e.g., title case).
• Formatting: Standardizing date formats (e.g., from "DD-MM-YYYY" to
"YYYY-MM-DD") or phone numbers.
Data Inconsistency:
The Hidden Depths of
Data Errors
Data Representation
Problem
Categorical Data
Variation
Formatting Discrepancies
This involves identifying and dealing with data points that are improbable or incorrectly
formatted.
• Outliers: These are data points that fall far outside the normal range of the data (e.g., a
person's age listed as 200). They can heavily distort statistical models. Outliers can be
removed, corrected if they are clear data entry errors, or capped (setting a maximum
value).
• Structural Errors: These are issues with the data's format, like a number column that
contains text (e.g., "N/A" instead of a blank). These must be corrected for the data type
to be consistent.
How to handle outliers and structural errors in data?
Remove Outliers
Correct Outliers
Cap Outliers
4. Removing Duplicates Ὕ️
This task involves identifying and removing identical records from the dataset.
• Why it's a problem: Duplicate entries can lead to inflated counts and give certain data
points more weight than they should have during analysis, biasing the results.
• Example: A customer might appear twice in a sales database due to a system glitch.
Finding and removing one of these entries is essential for accurate customer analysis.
Duplicate data biases
analytical results.
Inflated counts
System glitch
Biased results
c. Data transformation
Data transformation is the process of converting data from one format or structure to
another. In the context of data science and analytics, it involves applying a set of functions or
rules to raw data to make it suitable for analysis and, most importantly, to improve the
performance of machine learning models.
Think of it as preparing ingredients for a specific recipe. You might need to chop vegetables,
grind spices, or convert measurements—all transformations to make the ingredients work for
the final dish. Similarly, data must be transformed to work effectively with a given algorithm.
Data Transformation Process
Raw Data
Format Conversion
Structure Adjustment
Rule Application
Here are the most common and important data transformation techniques in detail.
1. Feature Scaling ὒ
Many machine learning algorithms are sensitive to the scale of the data. If one feature has a
very large range (e.g., salary from ₹5,00,000 to ₹50,00,000) and another has a small range
(e.g., years of experience from 1 to 20), the algorithm might incorrectly assume that
salary is more important. Feature scaling puts all features on a similar scale.
• Normalization (Min-Max Scaling): This technique rescales the data to a fixed range,
typically 0 to 1. It's calculated as:
• Xnorm=Xmax−XminX−Xmin
• It's useful but can be sensitive to outliers.
• Standardization (Z-score Normalization): This technique rescales data to have a mean
of 0 and a standard deviation of 1. It's calculated as:
• Xstd=σX−μ
• where μ is the mean and σ is the standard deviation. This is often the preferred
method as it's less affected by outliers.
Normalization
Useful for scaling data to a
Which feature fixed range but sensitive to
outliers.
scaling technique Standardization
should be used?
Preferred method as it's less
affected by outliers and scales
data to a mean of 0 and
standard deviation of 1.
Most machine learning models are based on mathematical equations and can only
understand numbers, not text. Encoding is the process of converting categorical text data
into a numerical format.
• Label Encoding: This technique assigns a unique integer to each category. For
example, in a "City" column, New Delhi might become 0, Mumbai might become 1,
and Bengaluru might become 2.
• Limitation: This can create a false sense of order (e.g., implying that Bengaluru >
Mumbai), which can mislead some models.
• One-Hot Encoding: This technique creates a new binary (0 or 1) column for each
unique category. If a row's city is "New Delhi," the "New Delhi" column will be 1 while
the "Mumbai" and "Bengaluru" columns will be 0.
• Benefit: It avoids the problem of creating a false order and is generally the
preferred method for nominal (unordered) categories.
Avoids False
Creates False Order
Order
Multiple
Single Column Columns
Needed Needed
Some algorithms, especially linear models, assume that the numerical features follow a
normal (bell-shaped) distribution. If a feature's distribution is highly skewed (lopsided),
transforming it can improve model performance.
• Log Transformation: This involves taking the logarithm (log(x)) of each value in a
feature. It is very effective for reducing right-skewness (when the tail of the
distribution is on the right).
• Square Root Transformation: Taking the square root (sqrt(x)) is another, milder way to
reduce right-skewness.
Square Root
Log Transformation
Transformation
Effective for significant Milder skew reduction
skew reduction
4. Binning
Binning (or bucketing) is the process of converting a continuous numerical feature into a
categorical one by grouping values into "bins."
• Use Case: Converting a precise "Age" feature into categorical "Age Groups" like '0-18',
'19-35', '36-60', and '60+'.
• Benefit: It can help reduce the effect of small errors in the data and capture non-linear
patterns. For example, a person's spending habits might be similar within an age group
but change significantly between groups.
Should the "Age" feature be binned
into categorical groups?
No Binning
Maintains precision
Binning but may miss non-
linear patterns.
Reduces errors and
captures non-linear
patterns in data.
d. Data Reduction
Data reduction is the process of reducing the volume of a dataset while trying to preserve as
much of its important information as possible. The goal is to obtain a smaller, more
manageable representation of the original data that can be analyzed more efficiently without
sacrificing significant analytical quality.
Think of it like creating a high-quality summary of a long book. You remove the less critical
details but keep the main plot, characters, and themes, making it much faster to read and
understand.
Data reduction is essential for dealing with "Big Data" because massive datasets are
expensive to store and computationally intensive to analyze.
Data Reduction Process
Remove Redundancy
Summarize Data
This strategy focuses on reducing the number of random variables or features (columns)
under consideration. Many datasets have redundant or irrelevant features that add
complexity without adding much information.
• Why it's done: To combat the "Curse of Dimensionality," where too many features can
actually make machine learning models perform worse. It also simplifies models and
speeds up training time.
• Key Techniques:
• Principal Component Analysis (PCA): This is the most popular dimensionality
reduction technique. PCA transforms the original set of features into a new,
smaller set of uncorrelated features called principal components. It works by
identifying the directions of maximum variance in the data and projecting the
data onto them. The first principal component captures the most information
(variance), the second captures the next most, and so on. You can then keep
just the first few components that represent the majority of the original data's
variance.
• Feature Selection: Instead of creating new features like PCA, this technique
selects a subset of the most relevant original features. It aims to find the best
combination of features that produce the best model performance.
Strategies for Dimensionality Reduction
Principal
Component
Analysis Improved Model
Performance
Feature
Selection
This strategy focuses on reducing the number of data entries or records (rows) in a dataset.
• Why it's done: To create a smaller, representative subset of the data that is faster to
process.
• Key Techniques:
• Sampling: This involves selecting a subset of the data that is representative of
the entire dataset.
• Simple Random Sampling: Every data point has an equal chance of being
selected.
• Stratified Sampling: The population is divided into subgroups (strata)
based on a certain characteristic (e.g., age groups, geographic regions).
A random sample is then taken from each subgroup. This ensures that
minority groups are properly represented in the sample.
• Data Aggregation: This involves combining multiple data points into a single
summary. For example, daily sales data can be aggregated into weekly,
monthly, or quarterly sales figures. This reduces the number of rows
significantly. For instance, 365 daily sales records can be reduced to just 12
monthly records.
Data Reduction Process
Data
Sampling Aggregation
Selecting a Combining data
representative points into
subset of data summaries
e. Feature Engineering
Feature engineering is the creative and often iterative process of using domain knowledge to
select, transform, and create new input variables (features) from raw data to improve the
performance of a machine learning model. It is widely considered one ofthe most impactful
activities in the machine learning workflow.
Think of it like this: A machine learning model is like a student. You can give the student raw,
unorganized textbooks (raw data) and hope they figure everything out, or you can give them
well-structured summary notes with key concepts highlighted (engineered features). The
student with the better notes will almost always learn faster and perform better.
Feature Engineering Cycle
Evaluate Model
Transform Features
Performance
Modify selected variables
Assess how well the model for better use
performs
The features you provide to your model have a huge influence on its performance. Better
features can:
• Simplify Models: Good features can capture complex relationships in a simpler form,
allowing you to use less complex and faster models.
• Incorporate Domain Knowledge: It's the primary way to inject human expertise about
the problem domain into the model.
The Power of Feature Engineering
Domain
Knowledge
Domain knowledge
enriches models with
human expertise.
Model Simplicity
Model Accuracy
Good features simplify
Features enhance models, making them
pattern recognition for faster.
better model accuracy.
Feature engineering is a broad field, but most techniques fall into a few main categories.
1. Feature Creation
This involves creating entirely new features from one or more existing ones.
• From Dates and Times: A single timestamp column is rarely useful on its own. You can
create many valuable features from it:
• day_of_week, month, year, hour_of_day, week_of_year.
• From height and weight, create a Body Mass Index (BMI) feature.
• In real estate, from price and square_feet, create a price_per_square_foot
feature.
• In e-commerce, create a cart_value feature by multiplying item_price by
quantity.
• From Text Data:
• word_count, character_count, or average_word_length in a text field.
2. Feature Transformation ὒ
This involves modifying existing features to make them more suitable for a model.
While simple encoding is part of preprocessing, more advanced techniques are a form of
feature engineering.
• Frequency Encoding: Replacing a category with its frequency (the number of times it
appears in the dataset). This can capture the importance of a category.
• Target Encoding (Mean Encoding): This is a powerful technique where you replace a
category with the average value of the target variable for that category. For example,
if you're predicting customer churn, you could replace the category "New Delhi" with
the average churn rate of all customers from New Delhi. This directly encodes
information about the target variable but must be used carefully to avoid data leakage.
Sometimes, the fact that data is missing is itself a valuable piece of information. Instead of
just imputing a missing value, you can create a new binary feature that indicates its absence.
• Example: In a dataset of loan applicants, some may not provide their annual_income.
You could create a feature called income_is_missing (1 if missing, 0 if present). This
absence might correlate with a higher risk of default.