Data Collection & Preprocessing Techniques
Data Collection & Preprocessing Techniques
UCSC0601
Presented By:
Dr. Abhinandan P. S
Department of CSE
KIT Kolhapur(Autonomous)
Unit 2
Data Collection and Preprocessing
• Data acquisition methods and sources,
• Exploratory data analysis (EDA) techniques,
• Data cleaning techniques:
• handling missing values,
• outliers, and noise,
• Data validation,
• Data transformation,
• Data reduction,
• Normalization Techniques
Data acquisition
• The methods and sources of data acquisition can vary depending on the
nature of the data, the specific requirements of a project, and the industry
involved.
Here are common data acquisition methods and sources:
Surveys and Questionnaires:
Method: Surveys and questionnaires involve asking individuals or organizations a set of
predefined questions to gather information.
Source: Responses can be collected through online surveys, paper forms, phone interviews, or
in-person interviews.
Sensors and Instruments:
Method: Sensors and instruments are used to measure physical or environmental parameters,
such as temperature, pressure, humidity, or GPS coordinates.
Source: Data from sensors can be collected in real-time and are commonly used in scientific
research, manufacturing, environmental monitoring, and healthcare.
Web Scraping:
Method: Web scraping involves extracting data from websites, typically for purposes such as
market research, competitive analysis, or data aggregation.
Source: Websites, online databases, and social media platforms are common sources for web
scraping.
Web scrapping
E-commerce Price Monitoring:
• Retailers and competitors use web scraping to monitor and extract pricing information from e-commerce
websites. This allows them to adjust their own pricing strategies in real-time to stay competitive.
Job Market Analysis:
• Job aggregators and recruitment agencies use web scraping to gather job listings from various websites. This
data can be analyzed to understand job market trends, demand for specific skills, and salary ranges.
• Travel and Flight Prices:
• Travel agencies and price comparison websites use web scraping to extract real-time data on flight prices,
hotel rates, and other travel-related information. This enables them to provide users with up-to-date and
competitive pricing.
Social Media Monitoring:
• Companies and brands use web scraping to monitor social media platforms for mentions, reviews, and
sentiment analysis. This helps them understand public perception and respond to customer feedback in real-
time.
• Stock Market Data:
• Financial analysts and traders use web scraping to extract real-time stock market data, news, and financial
reports from various sources. This information is crucial for making timely investment decisions.
Weather Data Collection:
• Meteorological organizations use web scraping to gather real-time weather data from multiple sources.
This allows them to provide accurate and timely weather forecasts.
Competitor Analysis:
• Businesses use web scraping to track and analyze competitor prices, product offerings, and marketing
strategies. This helps them adjust their own strategies to stay competitive in the market.
News Aggregation:
• News websites and aggregators use web scraping to collect and display news articles from various
sources. This ensures that users have access to a diverse range of news sources in one location.
Healthcare Data Extraction:
• Researchers and healthcare professionals use web scraping to gather information on medical research,
drug prices, and healthcare statistics from different websites. This aids in data-driven decision-making in
the healthcare industry.
Real Estate Market Analysis:
• Real estate professionals use web scraping to collect data on property prices, rental rates, and market
trends. This information is valuable for making informed decisions in the real estate industry.
IoT Devices:
Method: The Internet of Things (IoT) involves the use of interconnected devices that
collect and exchange data. This data is often used for monitoring and automation.
Source: IoT devices, such as smart sensors, wearables, and connected appliances,
generate data that can be collected and analyzed.
Transaction Data:
Method: Transactional data is collected from various business transactions, including
sales, financial transactions, and customer interactions.
Source: Point-of-sale systems, e-commerce platforms, and financial institutions are
common sources of transactional data.
Remote Sensing:
Method: Remote sensing involves collecting data from a distance using satellite,
aircraft, or drone-based sensors.
Source: Remote sensing is used in applications such as environmental monitoring,
agriculture, and geographic information systems (GIS).
Mobile Apps and Devices:
Method: Mobile apps and devices often collect data about user behavior, preferences, and
location.
Source: Mobile phones, tablets, and wearables generate data that can be used for user
analytics, marketing, and personalized services.
Social Media Monitoring:
Method: Social media monitoring involves tracking and analyzing social media platforms to
gather information about trends, sentiment, and user interactions.
Source: Data is acquired from platforms like Twitter, Facebook, Instagram, and LinkedIn.
Internal Business Data:
Method: Companies collect and store data generated from their internal operations,
including sales records, customer interactions, and employee data.
Source: Enterprise resource planning (ERP) systems, customer relationship management
(CRM) systems, and other internal databases are sources of business data.
Publicly Available Datasets:
Method: Publicly available datasets are datasets that are made accessible for research or
analysis purposes.
Source: Government agencies, research institutions, and organizations often release datasets
for public use. Examples include census data, weather data, and social science datasets.
Data Objects and Attribute Types
• Data objects are entities or things for which data is collected.
These can be individuals, events, transactions, or any tangible
or conceptual entity.
• Real-world Example: In a retail business, data objects could
include customers, products, sales transactions, and employees.
Each of these entities represents a distinct data object within the
business context.
Attribute
• Definition: Attribute types refer to the characteristics or properties of data
objects. They describe different aspects or features of the data objects.
• Real-world Example: Using the
• example:
• Categorical Attribute: The "Product Category" attribute for a product, which can be
categories like "Electronics," "Clothing," or "Home Appliances."
• Numerical Attribute: The "Price" attribute for a product, representing a numerical
value such as $49.99.
• Text Attribute: The "Customer Feedback" attribute, containing textual comments or
reviews provided by customers.
• Temporal Attribute: The "Purchase Date" attribute in a sales transaction, indicating
when a particular product was bought.
• Binary Attribute: The "Membership Status" attribute, representing whether a customer
is a member or not (e.g., 0 for non-member, 1 for member).
• Type of an attribute is determined by the set of possible values—nominal,
binary, ordinal, or numeric.
Nominal Attributes: Nominal means “relating to names.”
The values of a nominal attribute are symbols or names of things.
Each value represents some kind of category, code, or state,
Nominal attributes are also referred to as categorical.
Nominal values do not have any meaningful order
Nominal values are also known as enumerations.
Exploratory Data Analysis (EDA)
• Exploratory Data Analysis (EDA) is a critical step in the data analysis
process where the focus is on understanding the main characteristics of
the data.
• EDA involves visualizing and summarizing key features, patterns, and
trends in the dataset to gain insights.
Here are some common techniques used in Exploratory Data Analysis:
Descriptive Statistics:
1. Description: Descriptive statistics provide a summary of the main characteristics of a
dataset, including measures such as mean, median, mode, range, variance, and
standard deviation.
1. Application: Descriptive statistics give a quick overview of the central tendency and
variability of the data.
• Dispersion is an indicator of how far away from the center, we can find
the data values. The most common measures of dispersion
are variance, standard deviation and interquartile range (IQR).
Bivariate Analysis:
1. Description: Bivariate analysis explores relationships between two variables to
understand patterns and correlations.
2. Application: Scatter plots, heatmaps, and correlation matrices are used for bivariate
analysis. These techniques help identify trends and potential associations between
variables.
3. Real-world Example: Using a scatter plot to explore the correlation between
advertising spending and sales revenue in a marketing dataset.
Multivariate Analysis:
1. Description: Multivariate analysis involves the simultaneous analysis of three or more
variables to identify complex patterns.
2. Application: Techniques like 3D plots, parallel coordinates plots, and multidimensional
scaling help visualize relationships in datasets with multiple variables.
3. Real-world Example: Using a 3D plot to explore the relationship between product
price, customer satisfaction, and sales volume in a retail dataset.
Correlation Analysis:
1. Description: Correlation analysis measures the strength and direction of the
linear relationship between two numerical variables.
2. Application: Scatter plots and correlation coefficients (e.g., Pearson
correlation) help identify associations between variables.
3. Real-world Example: Analyzing the correlation between hours of study and
exam scores in an educational dataset.
Outlier Detection:
1.Description: Outlier detection techniques identify data points that
deviate significantly from most of the data.
2.Application: Box plots, scatter plots, and statistical methods (e.g., z-
scores) can be employed to detect and analyze outliers.
3.Real-world Example: Using a box plot or z-score analysis to detect
outliers in a dataset of employee salaries.
• There are several techniques for detecting and dealing with outliers,
including visual inspection, statistical methods (such as Z-score or
Tukey's method), and machine learning algorithms (like isolation forests
or one-class SVM).
Z-score for this given data 5,6,3,4,8
Missing Values Analysis:
1. Description: Missing values analysis involves identifying and handling missing data in the dataset.
2. Application: Visualization techniques, such as missing value matrices or heatmaps, help assess the
extent of missing data and inform imputation strategies.
3. Real-world Example: Creating a heatmap to visualize the presence of missing values in a dataset of
customer feedback, guiding decisions on imputation strategies.
Interactive Visualizations:
1. Description: Interactive visualizations allow users to explore the data dynamically and gain insights
through user interactions.
2. Application: Tools like Plotly, Bokeh, or Tableau enable the creation of interactive dashboards and
visualizations for more flexible exploration.
3. Real-world Example: Developing an interactive dashboard using tools like Tableau to explore and
analyze sales performance, customer demographics, and product trends.
How can we fix these missing values? This is what’s
called imputation in statistics.
• Imputation is the act of replacing missing data with statistical estimates of
the missing values.
• There are quite a few imputation methods. However, when deciding what
method to use, there are some things to keep in mind. We can frame these
concerns by asking 3 questions:
• The most commonly used numbers for this method are -1, 0,99, -
999 (or other combinations of 9s). Deciding on which arbitrary
number to use depends on the range of your data’s distribution. For
example, if your data is between 1–100, it wouldn’t be wise to use 1
or 99 because those values may already exist in your data, and
these placeholder numbers are usually used to flag missing values.
5. Missing Category Imputation
• This method is used for categorical data. It involves labeling all
missing values in a categorical column as ‘missing’.
6. Missing Indicator Imputation
Data Validation
• The integrity of data becomes increasingly more important as more B2B firms
use data-driven techniques to enhance revenue and improve operational
efficiencies. The inability to trust business data gathered from a variety of
sources can sabotage an organization’s efforts to fulfill critical business
objectives.
• Data standards, heterogeneous data systems, a lack of data governance,
manual processes, and so on are all issues they encounter. As a result of this
inability to trust data, data validation is required. Data validation allows
businesses to have more confidence in their data.
What is Data Validation?
• Data Validation is the process of ensuring that source data is accurate and of high
quality before using, importing, or otherwise processing it. Depending on the
destination constraints or objectives, different types of validation can be performed.
Validation is a type of data cleansing.
• When migrating and merging data, it is critical to ensure that data from various
sources and repositories conforms to business rules and does not become
corrupted due to inconsistencies in type or context.
• The goal is to generate data that is consistent, accurate, and complete in order to
avoid data loss and errors during the move.
•
What are the Types of Data Validation?
• Every organization will have its own set of rules for storing and maintaining
data.
• Setting basic data validation rules will assist your company in maintaining
organized standards that will make working with data more efficient.
• Most Data Validation procedures will run one or more of these checks to
ensure that the data is correct before it is stored in the database.
The following are the common Data Validation Types:
1) Data Type Check
• A Data Type check ensures that data entered into a field is of the correct data type.
• A field, for example, may only accept numeric data.
• The system should then reject any data containing other characters, such as letters or special symbols,
and an error message should be displayed.
2) Code Check
• A Code Check ensures that a field is chosen from a valid list of values or that certain formatting rules are
followed.
• For example, it is easier to verify the validity of a postal code by comparing it to a list of valid codes.
• Other items, such as country codes and NAICS industry codes, can be approached in the same way.
3) Range Check
• A Range Check will determine whether the input data falls within a given range. Latitude and longitude
• for example, are frequently used in geographic data. Latitude should be between -90 and 90, and
longitude should be between -180 and 180.
• Any values outside of this range are considered invalid.
4) Format Check
• Many data types have a predefined format. A Format Check will ensure that the data is in the
correct format. Date fields,
• for example, are stored in a fixed format such as “YYYY-MM-DD” or “DD-MM-YYYY.” If the
date is entered in any other format, it will be rejected.
• A National Insurance number looks like this: LL 99 99 99 L, where L can be any letter and 9 can
be any number.
5) Consistency Check
• A Consistency Check is a type of logical check that ensures data is entered in a logically
consistent manner. Checking if the delivery date for a parcel is after the shipping date is one
example.
6) Uniqueness Check
• Some data, such as IDs or e-mail addresses, are inherently unique.
• These fields in a database should most likely have unique entries.
• A Uniqueness Check ensures that an item is not entered into a database more than once.
7) Presence Check
• A Presence Check ensures that all mandatory fields are not left blank.
• If someone tries to leave the field blank, an error message will be displayed, and they will be unable
to proceed to the next step or save any other data that they have entered. A key field, for example,
cannot be left blank in most databases.
8) Length Check
• A Length Check ensures that the appropriate number of characters are entered into the field.
• It verifies that the entered character string is neither too short nor too long. Consider a password
that must be at least 8 characters long. The Length Check ensures that the field is filled with exactly
8 characters.
9) Look Up
• Look Up assists in reducing errors in a field with a limited set of values.
• It consults a table to find acceptable values. The fact that there are only 7 possible days in a week,
for example, ensures that the list of possible values is limited.
Data Transformation
• Data transformation in data analytics refers to the process of converting raw
data into a format that is suitable for analysis and modeling.
• The goal of data transformation is to prepare the data for data analysis so that
it can be used to extract useful insights and knowledge.
Data transformation typically involves several steps, including:
Data cleaning:
Removing or correcting errors, inconsistencies, and missing values in the data.
Data integration:
Combining data from multiple sources, such as databases and spreadsheets, into a single
format.
Data normalization:
Scaling the data to a common range of values, such as between 0 and 1, to facilitate
comparison and analysis.
Data reduction:
Reducing the dimensionality of the data by selecting a subset of relevant features or attributes.
Data discretization:
Converting continuous data into discrete categories or bins.
Data aggregation:
Combining data at different levels of granularity, such as by summing or averaging, to create
new features or attributes.
Advantages of Data Transformation
• Improves Data Quality: Data transformation helps to improve the quality of
data by removing errors, inconsistencies, and missing values.
• Facilitates Data Integration: Data transformation enables the integration of
data from multiple sources, which can improve the accuracy and
completeness of the data.
• Improves Data Analysis: Data transformation helps to prepare the data for
analysis and modeling by normalizing, reducing dimensionality, and
discretizing the data.
• Increases Data Security: Data transformation can be used to mask sensitive
data, or to remove sensitive information from the data, which can help to
increase data security.
Disadvantages of Data Transformation
• Time-consuming:
Data transformation can be a time-consuming process, especially when dealing with large
datasets.
• Complexity:
Data transformation can be a complex process, requiring specialized skills and knowledge
to implement and interpret the results.
• Data Loss:
Data transformation can result in data loss, such as when discretizing continuous data, or
when removing attributes or features from the data.
• Biased transformation:
Data transformation can result in bias, if the data is not properly understood or used.
• High cost:
Data transformation can be an expensive process, requiring significant investments in
hardware, software, and personnel.
Data Reduction
• Data reduction techniques ensure the integrity of data while reducing
the data.
• Data reduction is a process that reduces the volume of original data
and represents it in a much smaller volume.
• Data reduction techniques are used to obtain a reduced representation
of the dataset that is much smaller in volume by maintaining the integrity
of the original data.
• By reducing the data, the efficiency of the data mining process is
improved, which produces the same analytical results.
Techniques of Data Reduction
• 1. Dimensionality Reduction:
• Whenever we encounter weakly important data, we use the attribute
required for our analysis.
• Dimensionality reduction eliminates the attributes from the data set
under consideration, thereby reducing the volume of original data.
• It reduces data size as it eliminates outdated or redundant features.
Here are three methods of dimensionality reduction.
Here are three methods of dimensionality reduction.
1. Wavelet Transform
2. Principal Component Analysis
3. Attribute Subset Selection
2. Numerosity Reduction:
The numerosity reduction reduces the original data volume and
represents it in a much smaller form.
This technique includes two types parametric and non-parametric
numerosity reduction.
3. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way that
consumes less space.
Data compression involves building a compact representation of information by removing
redundancy and representing data in binary form.
Data that can be restored successfully from its compressed form is called Lossless compression.
In contrast, the opposite where it is not possible to restore the original form from the compressed
form is Lossy compression.
5. Discretization Operation
• The data discretization technique is used to divide the attributes of
the continuous nature into data with intervals. We replace many
constant values of the attributes with labels of small intervals. This
means that mining results are shown in a concise and easily
understandable way.
i.Top-down discretization: If you first consider one or a couple of
points (so-called breakpoints or split points) to divide the whole set
of attributes and repeat this method up to the end, then the process
is known as top-down discretization, also known as splitting.
ii.Bottom-up discretization: If you first consider all the constant
values as split-points, some are discarded through a combination of
the neighborhood values in the interval. That process is called
bottom-up discretization.
Benefits of Data Reduction
Scaling to a range is a good choice when both of the following conditions are met:
1. You know the approximate upper and lower bounds on your data with few or no outliers.
2. Your data is approximately uniformly distributed across that range.
Feature Clipping
• If your data set contains extreme outliers, you might try feature clipping,
which caps all feature values above (or below) a certain value to fixed value.
For example, you could clip all temperature values above 40 to be exactly 40.
Log Scaling
• Log scaling computes the log of your values to compress a wide range to a
narrow range.
• Log scaling is helpful when a handful of your values have many points, while
most other values have few points. This data distribution is known as
the power law distribution. Movie ratings are a good example. In the chart
below, most movies have very few ratings (the data in the tail), while a few
have lots of ratings (the data in the head).
Z-score normalization
• In this technique, values are normalized based on mean and
standard deviation of the data.
Advantages:
1.Improved performance of machine learning algorithms: Normalization can help
to improve the performance of machine learning algorithms by scaling the
input features to a common scale. This can help to reduce the impact of
outliers and improve the accuracy of the model.
2.Better handling of outliers: Normalization can help to reduce the impact of
outliers by scaling the data to a common scale, which can make the outliers
less influential.
3.Improved interpretability of results: Normalization can make it easier to
interpret the results of a machine learning model, as the inputs will be on a
common scale.
4.Better generalization: Normalization can help to improve the generalization of
a model, by reducing the impact of outliers and by making the model less
sensitive to the scale of the inputs.
Disadvantages:
1.Loss of information: Normalization can result in a loss of information if the
original scale of the input features is important.
2.Impact on outliers: Normalization can make it harder to detect outliers as
they will be scaled along with the rest of the data.
3.Impact on interpretability: Normalization can make it harder to interpret the
results of a machine learning model, as the inputs will be on a common scale,
which may not align with the original scale of the data.
4.Additional computational costs: Normalization can add additional
computational costs to the data mining process, as it requires additional
processing time to scale the data.