0% found this document useful (0 votes)

16 views64 pages

Data Collection & Preprocessing Techniques

IQR = Q3 - Q1 Where: Q1 is the 1st quartile Q3 is the 3rd quartile The IQR indicates the spread between the middle 50% of data. It is less influenced by outliers than the standard deviation.

Uploaded by

Sanket Kembalkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views64 pages

Data Collection & Preprocessing Techniques

IQR = Q3 - Q1 Where: Q1 is the 1st quartile Q3 is the 3rd quartile The IQR indicates the spread between the middle 50% of data. It is less influenced by outliers than the standard deviation.

Uploaded by

Sanket Kembalkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Data Analytics

UCSC0601

Presented By:
Dr. Abhinandan P. S
Department of CSE
KIT Kolhapur(Autonomous)
Unit 2
Data Collection and Preprocessing
• Data acquisition methods and sources,
• Exploratory data analysis (EDA) techniques,
• Data cleaning techniques:
• handling missing values,
• outliers, and noise,
• Data validation,
• Data transformation,
• Data reduction,
• Normalization Techniques
Data acquisition

• Data acquisition refers to the process of collecting, capturing, and

obtaining data from various sources.

• The methods and sources of data acquisition can vary depending on the
nature of the data, the specific requirements of a project, and the industry
involved.
Here are common data acquisition methods and sources:
Surveys and Questionnaires:
Method: Surveys and questionnaires involve asking individuals or organizations a set of
predefined questions to gather information.
Source: Responses can be collected through online surveys, paper forms, phone interviews, or
in-person interviews.
Sensors and Instruments:
Method: Sensors and instruments are used to measure physical or environmental parameters,
such as temperature, pressure, humidity, or GPS coordinates.
Source: Data from sensors can be collected in real-time and are commonly used in scientific
research, manufacturing, environmental monitoring, and healthcare.
Web Scraping:
Method: Web scraping involves extracting data from websites, typically for purposes such as
market research, competitive analysis, or data aggregation.
Source: Websites, online databases, and social media platforms are common sources for web
scraping.
Web scrapping
E-commerce Price Monitoring:
• Retailers and competitors use web scraping to monitor and extract pricing information from e-commerce
websites. This allows them to adjust their own pricing strategies in real-time to stay competitive.
Job Market Analysis:
• Job aggregators and recruitment agencies use web scraping to gather job listings from various websites. This
data can be analyzed to understand job market trends, demand for specific skills, and salary ranges.
• Travel and Flight Prices:
• Travel agencies and price comparison websites use web scraping to extract real-time data on flight prices,
hotel rates, and other travel-related information. This enables them to provide users with up-to-date and
competitive pricing.
Social Media Monitoring:
• Companies and brands use web scraping to monitor social media platforms for mentions, reviews, and
sentiment analysis. This helps them understand public perception and respond to customer feedback in real-
time.
• Stock Market Data:
• Financial analysts and traders use web scraping to extract real-time stock market data, news, and financial
reports from various sources. This information is crucial for making timely investment decisions.
Weather Data Collection:
• Meteorological organizations use web scraping to gather real-time weather data from multiple sources.
This allows them to provide accurate and timely weather forecasts.
Competitor Analysis:
• Businesses use web scraping to track and analyze competitor prices, product offerings, and marketing
strategies. This helps them adjust their own strategies to stay competitive in the market.
News Aggregation:
• News websites and aggregators use web scraping to collect and display news articles from various
sources. This ensures that users have access to a diverse range of news sources in one location.
Healthcare Data Extraction:
• Researchers and healthcare professionals use web scraping to gather information on medical research,
drug prices, and healthcare statistics from different websites. This aids in data-driven decision-making in
the healthcare industry.
Real Estate Market Analysis:
• Real estate professionals use web scraping to collect data on property prices, rental rates, and market
trends. This information is valuable for making informed decisions in the real estate industry.
IoT Devices:
Method: The Internet of Things (IoT) involves the use of interconnected devices that
collect and exchange data. This data is often used for monitoring and automation.
Source: IoT devices, such as smart sensors, wearables, and connected appliances,
generate data that can be collected and analyzed.
Transaction Data:
Method: Transactional data is collected from various business transactions, including
sales, financial transactions, and customer interactions.
Source: Point-of-sale systems, e-commerce platforms, and financial institutions are
common sources of transactional data.
Remote Sensing:
Method: Remote sensing involves collecting data from a distance using satellite,
aircraft, or drone-based sensors.
Source: Remote sensing is used in applications such as environmental monitoring,
agriculture, and geographic information systems (GIS).
Mobile Apps and Devices:
Method: Mobile apps and devices often collect data about user behavior, preferences, and
location.
Source: Mobile phones, tablets, and wearables generate data that can be used for user
analytics, marketing, and personalized services.
Social Media Monitoring:
Method: Social media monitoring involves tracking and analyzing social media platforms to
gather information about trends, sentiment, and user interactions.
Source: Data is acquired from platforms like Twitter, Facebook, Instagram, and LinkedIn.
Internal Business Data:
Method: Companies collect and store data generated from their internal operations,
including sales records, customer interactions, and employee data.
Source: Enterprise resource planning (ERP) systems, customer relationship management
(CRM) systems, and other internal databases are sources of business data.
Publicly Available Datasets:
Method: Publicly available datasets are datasets that are made accessible for research or
analysis purposes.
Source: Government agencies, research institutions, and organizations often release datasets
for public use. Examples include census data, weather data, and social science datasets.
Data Objects and Attribute Types
• Data objects are entities or things for which data is collected.
These can be individuals, events, transactions, or any tangible
or conceptual entity.
• Real-world Example: In a retail business, data objects could
include customers, products, sales transactions, and employees.
Each of these entities represents a distinct data object within the
business context.
Attribute
• Definition: Attribute types refer to the characteristics or properties of data
objects. They describe different aspects or features of the data objects.
• Real-world Example: Using the
• example:
• Categorical Attribute: The "Product Category" attribute for a product, which can be
categories like "Electronics," "Clothing," or "Home Appliances."
• Numerical Attribute: The "Price" attribute for a product, representing a numerical
value such as $49.99.
• Text Attribute: The "Customer Feedback" attribute, containing textual comments or
reviews provided by customers.
• Temporal Attribute: The "Purchase Date" attribute in a sales transaction, indicating
when a particular product was bought.
• Binary Attribute: The "Membership Status" attribute, representing whether a customer
is a member or not (e.g., 0 for non-member, 1 for member).
• Type of an attribute is determined by the set of possible values—nominal,
binary, ordinal, or numeric.
Nominal Attributes: Nominal means “relating to names.”
 The values of a nominal attribute are symbols or names of things.
 Each value represents some kind of category, code, or state,
 Nominal attributes are also referred to as categorical.
 Nominal values do not have any meaningful order
 Nominal values are also known as enumerations.
Exploratory Data Analysis (EDA)
• Exploratory Data Analysis (EDA) is a critical step in the data analysis
process where the focus is on understanding the main characteristics of
the data.
• EDA involves visualizing and summarizing key features, patterns, and
trends in the dataset to gain insights.
Here are some common techniques used in Exploratory Data Analysis:

Descriptive Statistics:
1. Description: Descriptive statistics provide a summary of the main characteristics of a
dataset, including measures such as mean, median, mode, range, variance, and
standard deviation.

1. Application: Descriptive statistics give a quick overview of the central tendency and
variability of the data.

1. Real-world Example: In a dataset of customer purchase amounts, calculating the mean

and standard deviation provides insights into the average spending and the spread of
data around the mean.
Measures of dispersion or variability

• Dispersion is an indicator of how far away from the center, we can find
the data values. The most common measures of dispersion
are variance, standard deviation and interquartile range (IQR).

• Variance is the standard measure of spread.

• The Standard deviation is the square root of the variance.

The variance and standard deviation are two useful measures of
spread.
Variance
• Variance measures the
dispersion of a set of data
points around their mean
value.
• It is the mean of the
squares of the individual
deviations.
• Variance gives results in
the original units squared.
Standard deviation
IQR (Interquartile range)
• A third measure of spread is the interquartile range (IQR).
• The IQR is calculated using the boundaries of data situated between
the 1st and the 3rd quartiles.
• The interquartile range (IQR) can be calculated as follows:- IQR = Q3
– Q1
• In the same way that the median is more robust than the mean, the
IQR is a more robust measure of spread than variance and standard
deviation and should therefore be preferred for small or asymmetrical
distributions.
• It is a robust measure of spread.
Measures of shape
• There are two statistical measures that can tell us about the shape of
the distribution. These measures are skewness and kurtosis. These
measures can be used to convey information about the shape of the
distribution of the dataset.
Skewness is a measure of a distribution's symmetry or more precisely lack of
symmetry.
• It is used to mean the absence of symmetry from the mean of the dataset.
• It is a characteristic of the deviation from the mean.
• It is used to indicate the shape of the distribution of data.
Negative skewness
• Negative values for skewness indicate negative skewness.
• Hence, in this case we have Mean < Median < Mode
Zero skewness
• Zero skewness means skewness value of zero.
• It means the dataset is symmetrical.
• A data set is symmetrical if it looks the same to the left and right to the center
point.
• A perfectly symmetrical data set will have a skewness of zero.
• So, the normal distribution which is perfectly symmetrical has a skewness of 0.
• So, in this case, we have Mean = Median = Mode
Positive skewness:
• The dataset are skewed or tail to right.
• By skewed right, we mean that the right tail is long relative to the left
tail.
• The data values are concentrated in the right.
• we have Mean > Median > Mode
• Reference range on skewness values
• The rule of thumb for skewness values are:
• If the skewness is between -0.5 and 0.5, the data are fairly
symmetrical.
• If the skewness is between -1 and – 0.5 or between 0.5 and 1, the
data are moderately skewed.
• If the skewness is less than -1 or greater than 1, the data are highly
Kurtosis
• Kurtosis is the degree of peakedness of a distribution.
• Data sets with high kurtosis tend to have a distinct peak near the
mean, decline rather rapidly and have heavy tails.
• Data sets with low kurtosis tend to have a flat top near the mean
rather than a sharp peak.
Univariate Analysis:

1. Description: Univariate analysis examines the distribution and characteristics of

individual variables in the dataset.
2. Application: Histograms, kernel density plots, box plots, and bar charts are common
visualization techniques for univariate analysis.
3. Real-world Example: Creating a histogram to analyze the distribution of employee
salaries in a company dataset.

Bivariate Analysis:
1. Description: Bivariate analysis explores relationships between two variables to
understand patterns and correlations.
2. Application: Scatter plots, heatmaps, and correlation matrices are used for bivariate
analysis. These techniques help identify trends and potential associations between
variables.
3. Real-world Example: Using a scatter plot to explore the correlation between
advertising spending and sales revenue in a marketing dataset.
Multivariate Analysis:
1. Description: Multivariate analysis involves the simultaneous analysis of three or more
variables to identify complex patterns.
2. Application: Techniques like 3D plots, parallel coordinates plots, and multidimensional
scaling help visualize relationships in datasets with multiple variables.
3. Real-world Example: Using a 3D plot to explore the relationship between product
price, customer satisfaction, and sales volume in a retail dataset.

Correlation Analysis:
1. Description: Correlation analysis measures the strength and direction of the
linear relationship between two numerical variables.
2. Application: Scatter plots and correlation coefficients (e.g., Pearson
correlation) help identify associations between variables.
3. Real-world Example: Analyzing the correlation between hours of study and
exam scores in an educational dataset.
Outlier Detection:
1.Description: Outlier detection techniques identify data points that
deviate significantly from most of the data.
2.Application: Box plots, scatter plots, and statistical methods (e.g., z-
scores) can be employed to detect and analyze outliers.
3.Real-world Example: Using a box plot or z-score analysis to detect
outliers in a dataset of employee salaries.
• There are several techniques for detecting and dealing with outliers,
including visual inspection, statistical methods (such as Z-score or
Tukey's method), and machine learning algorithms (like isolation forests
or one-class SVM).
Z-score for this given data 5,6,3,4,8
Missing Values Analysis:
1. Description: Missing values analysis involves identifying and handling missing data in the dataset.
2. Application: Visualization techniques, such as missing value matrices or heatmaps, help assess the
extent of missing data and inform imputation strategies.
3. Real-world Example: Creating a heatmap to visualize the presence of missing values in a dataset of
customer feedback, guiding decisions on imputation strategies.

Interactive Visualizations:
1. Description: Interactive visualizations allow users to explore the data dynamically and gain insights
through user interactions.
2. Application: Tools like Plotly, Bokeh, or Tableau enable the creation of interactive dashboards and
visualizations for more flexible exploration.
3. Real-world Example: Developing an interactive dashboard using tools like Tableau to explore and
analyze sales performance, customer demographics, and product trends.
How can we fix these missing values? This is what’s
called imputation in statistics.
• Imputation is the act of replacing missing data with statistical estimates of
the missing values.

• There are quite a few imputation methods. However, when deciding what
method to use, there are some things to keep in mind. We can frame these
concerns by asking 3 questions:

• What is the type of the variable (data type) of the feature?

• How does the imputation method affect the distribution of the data?
• What is the cause of the missing value?
Missing value data:
1. Complete removal of rows or columns of missing values
Removal of columns:
2. Mean/Median & Mode Imputation

For numerical variables

• Mean as a measure is greatly affected by outliers or if the
distribution of the data or column is not normally-distributed.
Therefore, it’s wise to first check the distribution of the column
before deciding if to use a mean imputation or median
imputation
• Mean imputation works better if the distribution is normally-
distributed or has a Gaussian distribution, while median
imputation is preferable for skewed distribution(be it right or left)
For categorical variables:
Mode imputation means replacing missing values by the mode, or
the most frequent- category value.
3. Systematic Random Sampling
Imputation
4. Arbitrary values imputation

• The most commonly used numbers for this method are -1, 0,99, -
999 (or other combinations of 9s). Deciding on which arbitrary
number to use depends on the range of your data’s distribution. For
example, if your data is between 1–100, it wouldn’t be wise to use 1
or 99 because those values may already exist in your data, and
these placeholder numbers are usually used to flag missing values.
5. Missing Category Imputation
• This method is used for categorical data. It involves labeling all
missing values in a categorical column as ‘missing’.
6. Missing Indicator Imputation
Data Validation
• The integrity of data becomes increasingly more important as more B2B firms
use data-driven techniques to enhance revenue and improve operational
efficiencies. The inability to trust business data gathered from a variety of
sources can sabotage an organization’s efforts to fulfill critical business
objectives.
• Data standards, heterogeneous data systems, a lack of data governance,
manual processes, and so on are all issues they encounter. As a result of this
inability to trust data, data validation is required. Data validation allows
businesses to have more confidence in their data.
What is Data Validation?
• Data Validation is the process of ensuring that source data is accurate and of high
quality before using, importing, or otherwise processing it. Depending on the
destination constraints or objectives, different types of validation can be performed.
Validation is a type of data cleansing.

• When migrating and merging data, it is critical to ensure that data from various
sources and repositories conforms to business rules and does not become
corrupted due to inconsistencies in type or context.

• The goal is to generate data that is consistent, accurate, and complete in order to
avoid data loss and errors during the move.
•
What are the Types of Data Validation?
• Every organization will have its own set of rules for storing and maintaining
data.

• Setting basic data validation rules will assist your company in maintaining
organized standards that will make working with data more efficient.

• Most Data Validation procedures will run one or more of these checks to
ensure that the data is correct before it is stored in the database.
The following are the common Data Validation Types:
1) Data Type Check
• A Data Type check ensures that data entered into a field is of the correct data type.
• A field, for example, may only accept numeric data.
• The system should then reject any data containing other characters, such as letters or special symbols,
and an error message should be displayed.
2) Code Check
• A Code Check ensures that a field is chosen from a valid list of values or that certain formatting rules are
followed.
• For example, it is easier to verify the validity of a postal code by comparing it to a list of valid codes.
• Other items, such as country codes and NAICS industry codes, can be approached in the same way.
3) Range Check
• A Range Check will determine whether the input data falls within a given range. Latitude and longitude
• for example, are frequently used in geographic data. Latitude should be between -90 and 90, and
longitude should be between -180 and 180.
• Any values outside of this range are considered invalid.
4) Format Check
• Many data types have a predefined format. A Format Check will ensure that the data is in the
correct format. Date fields,
• for example, are stored in a fixed format such as “YYYY-MM-DD” or “DD-MM-YYYY.” If the
date is entered in any other format, it will be rejected.
• A National Insurance number looks like this: LL 99 99 99 L, where L can be any letter and 9 can
be any number.
5) Consistency Check
• A Consistency Check is a type of logical check that ensures data is entered in a logically
consistent manner. Checking if the delivery date for a parcel is after the shipping date is one
example.
6) Uniqueness Check
• Some data, such as IDs or e-mail addresses, are inherently unique.
• These fields in a database should most likely have unique entries.
• A Uniqueness Check ensures that an item is not entered into a database more than once.
7) Presence Check
• A Presence Check ensures that all mandatory fields are not left blank.
• If someone tries to leave the field blank, an error message will be displayed, and they will be unable
to proceed to the next step or save any other data that they have entered. A key field, for example,
cannot be left blank in most databases.
8) Length Check
• A Length Check ensures that the appropriate number of characters are entered into the field.
• It verifies that the entered character string is neither too short nor too long. Consider a password
that must be at least 8 characters long. The Length Check ensures that the field is filled with exactly
8 characters.
9) Look Up
• Look Up assists in reducing errors in a field with a limited set of values.
• It consults a table to find acceptable values. The fact that there are only 7 possible days in a week,
for example, ensures that the list of possible values is limited.
Data Transformation
• Data transformation in data analytics refers to the process of converting raw
data into a format that is suitable for analysis and modeling.

• The goal of data transformation is to prepare the data for data analysis so that
it can be used to extract useful insights and knowledge.
Data transformation typically involves several steps, including:
Data cleaning:
Removing or correcting errors, inconsistencies, and missing values in the data.
Data integration:
Combining data from multiple sources, such as databases and spreadsheets, into a single
format.
Data normalization:
Scaling the data to a common range of values, such as between 0 and 1, to facilitate
comparison and analysis.
Data reduction:
Reducing the dimensionality of the data by selecting a subset of relevant features or attributes.
Data discretization:
Converting continuous data into discrete categories or bins.
Data aggregation:
Combining data at different levels of granularity, such as by summing or averaging, to create
new features or attributes.
Advantages of Data Transformation
• Improves Data Quality: Data transformation helps to improve the quality of
data by removing errors, inconsistencies, and missing values.
• Facilitates Data Integration: Data transformation enables the integration of
data from multiple sources, which can improve the accuracy and
completeness of the data.
• Improves Data Analysis: Data transformation helps to prepare the data for
analysis and modeling by normalizing, reducing dimensionality, and
discretizing the data.
• Increases Data Security: Data transformation can be used to mask sensitive
data, or to remove sensitive information from the data, which can help to
increase data security.
Disadvantages of Data Transformation
• Time-consuming:
Data transformation can be a time-consuming process, especially when dealing with large
datasets.
• Complexity:
Data transformation can be a complex process, requiring specialized skills and knowledge
to implement and interpret the results.
• Data Loss:
Data transformation can result in data loss, such as when discretizing continuous data, or
when removing attributes or features from the data.
• Biased transformation:
Data transformation can result in bias, if the data is not properly understood or used.
• High cost:
Data transformation can be an expensive process, requiring significant investments in
hardware, software, and personnel.
Data Reduction
• Data reduction techniques ensure the integrity of data while reducing
the data.
• Data reduction is a process that reduces the volume of original data
and represents it in a much smaller volume.
• Data reduction techniques are used to obtain a reduced representation
of the dataset that is much smaller in volume by maintaining the integrity
of the original data.
• By reducing the data, the efficiency of the data mining process is
improved, which produces the same analytical results.
Techniques of Data Reduction
• 1. Dimensionality Reduction:
• Whenever we encounter weakly important data, we use the attribute
required for our analysis.
• Dimensionality reduction eliminates the attributes from the data set
under consideration, thereby reducing the volume of original data.
• It reduces data size as it eliminates outdated or redundant features.
Here are three methods of dimensionality reduction.
Here are three methods of dimensionality reduction.
1. Wavelet Transform
2. Principal Component Analysis
3. Attribute Subset Selection
2. Numerosity Reduction:
The numerosity reduction reduces the original data volume and
represents it in a much smaller form.
This technique includes two types parametric and non-parametric
numerosity reduction.
3. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.
4. Data Compression
 Data compression employs modification, encoding, or converting the structure of data in a way that
consumes less space.
 Data compression involves building a compact representation of information by removing
redundancy and representing data in binary form.
 Data that can be restored successfully from its compressed form is called Lossless compression.
 In contrast, the opposite where it is not possible to restore the original form from the compressed
form is Lossy compression.
5. Discretization Operation
• The data discretization technique is used to divide the attributes of
the continuous nature into data with intervals. We replace many
constant values of the attributes with labels of small intervals. This
means that mining results are shown in a concise and easily
understandable way.
i.Top-down discretization: If you first consider one or a couple of
points (so-called breakpoints or split points) to divide the whole set
of attributes and repeat this method up to the end, then the process
is known as top-down discretization, also known as splitting.
ii.Bottom-up discretization: If you first consider all the constant
values as split-points, some are discarded through a combination of
the neighborhood values in the interval. That process is called
bottom-up discretization.
Benefits of Data Reduction

• Data reduction can save energy.

• Data reduction can reduce your physical storage costs.
• And data reduction can decrease your data center track.
• Data reduction greatly increases the efficiency of a storage system
and directly impacts your total spending on capacity.
Normalization Techniques
• Data normalization is a technique used in data mining to transform the values
of a dataset into a common scale.
• This is important because many machine learning algorithms are sensitive to
the scale of the input features and can produce better results when the data
is normalized.
There are 4 most commonly used techniques are:
1. Min-Max normalization:
2. Z-score normalization:
3. Clipping
4. Logarithmic transformation:
Min-Max normalization
• This technique scales the values of a feature to a range between 0 and 1.
• This is done by subtracting the minimum value of the feature from each value,
and then dividing by the range of the feature.

Scaling to a range is a good choice when both of the following conditions are met:

1. You know the approximate upper and lower bounds on your data with few or no outliers.
2. Your data is approximately uniformly distributed across that range.
Feature Clipping
• If your data set contains extreme outliers, you might try feature clipping,
which caps all feature values above (or below) a certain value to fixed value.
For example, you could clip all temperature values above 40 to be exactly 40.
Log Scaling
• Log scaling computes the log of your values to compress a wide range to a
narrow range.
• Log scaling is helpful when a handful of your values have many points, while
most other values have few points. This data distribution is known as
the power law distribution. Movie ratings are a good example. In the chart
below, most movies have very few ratings (the data in the tail), while a few
have lots of ratings (the data in the head).
Z-score normalization
• In this technique, values are normalized based on mean and
standard deviation of the data.
Advantages:
1.Improved performance of machine learning algorithms: Normalization can help
to improve the performance of machine learning algorithms by scaling the
input features to a common scale. This can help to reduce the impact of
outliers and improve the accuracy of the model.
2.Better handling of outliers: Normalization can help to reduce the impact of
outliers by scaling the data to a common scale, which can make the outliers
less influential.
3.Improved interpretability of results: Normalization can make it easier to
interpret the results of a machine learning model, as the inputs will be on a
common scale.
4.Better generalization: Normalization can help to improve the generalization of
a model, by reducing the impact of outliers and by making the model less
sensitive to the scale of the inputs.
Disadvantages:
1.Loss of information: Normalization can result in a loss of information if the
original scale of the input features is important.
2.Impact on outliers: Normalization can make it harder to detect outliers as
they will be scaled along with the rest of the data.
3.Impact on interpretability: Normalization can make it harder to interpret the
results of a machine learning model, as the inputs will be on a common scale,
which may not align with the original scale of the data.
4.Additional computational costs: Normalization can add additional
computational costs to the data mining process, as it requires additional
processing time to scale the data.

Data Mining Introduction & Techniques
No ratings yet
Data Mining Introduction & Techniques
9 pages
Data Source
No ratings yet
Data Source
7 pages
Data Accquisition
No ratings yet
Data Accquisition
6 pages
Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
Data Acquisition and EDA Techniques
No ratings yet
Data Acquisition and EDA Techniques
58 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
27 pages
TTDS Lecture 1
No ratings yet
TTDS Lecture 1
22 pages
Updated DM
No ratings yet
Updated DM
72 pages
Unit-2 DS
No ratings yet
Unit-2 DS
10 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Data Mining Unit I Notes
No ratings yet
Data Mining Unit I Notes
29 pages
Build The Models
No ratings yet
Build The Models
7 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
DSDM Notes
No ratings yet
DSDM Notes
114 pages
Mod 5
No ratings yet
Mod 5
36 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
5 pages
Bi - Unit 3
No ratings yet
Bi - Unit 3
18 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Data Curation and Managment Chap1-5 1-5
No ratings yet
Data Curation and Managment Chap1-5 1-5
31 pages
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
No ratings yet
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
10 pages
All About Data Science
No ratings yet
All About Data Science
35 pages
DM Lec01
No ratings yet
DM Lec01
27 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Unit 1
No ratings yet
Unit 1
17 pages
Module 1 - Aug 2024
No ratings yet
Module 1 - Aug 2024
93 pages
Unit 2
No ratings yet
Unit 2
21 pages
DM Module 1
No ratings yet
DM Module 1
13 pages
Data Similarity and Dissimilarity
No ratings yet
Data Similarity and Dissimilarity
73 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Unit 1 - PPT
No ratings yet
Unit 1 - PPT
67 pages
2 Data Science - Managing Data
No ratings yet
2 Data Science - Managing Data
37 pages
DM Unit2 (Part1)
No ratings yet
DM Unit2 (Part1)
19 pages
Ics054 Unit 1
No ratings yet
Ics054 Unit 1
14 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Eti Report
No ratings yet
Eti Report
11 pages
Data Warehousing and Mining Overview
100% (1)
Data Warehousing and Mining Overview
685 pages
Why We Need Data Mining?
No ratings yet
Why We Need Data Mining?
39 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
2.1 Data Analytics
No ratings yet
2.1 Data Analytics
16 pages
Unit-1 PPT Dma
No ratings yet
Unit-1 PPT Dma
83 pages
Data Collection and Storage
No ratings yet
Data Collection and Storage
15 pages
Module 1 Part1
No ratings yet
Module 1 Part1
68 pages
Chapter II Data Collection and Management
No ratings yet
Chapter II Data Collection and Management
19 pages
Module 2 Data Science
No ratings yet
Module 2 Data Science
28 pages
Crisp-Dm: Elgounidi Hajar Safsafi Aya El Malki Ikram Aqaabich Reda
No ratings yet
Crisp-Dm: Elgounidi Hajar Safsafi Aya El Malki Ikram Aqaabich Reda
87 pages
Unit 2
No ratings yet
Unit 2
37 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Importance of Data Mining Explained
No ratings yet
Importance of Data Mining Explained
17 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Data Mining vs. Traditional Analysis
No ratings yet
Data Mining vs. Traditional Analysis
297 pages
Data Science Foundations
No ratings yet
Data Science Foundations
58 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Advanced Data Analytics and Visualization Course Material
No ratings yet
Advanced Data Analytics and Visualization Course Material
45 pages
ITS632 Lecture2 Data
No ratings yet
ITS632 Lecture2 Data
61 pages
SAP Hybris 6.X Installation Guide
No ratings yet
SAP Hybris 6.X Installation Guide
10 pages
Jaguar Land Rover
No ratings yet
Jaguar Land Rover
22 pages
State of Software 2025 Report
No ratings yet
State of Software 2025 Report
73 pages
AWS - IAM Detail - Notes - by Rohit Singh
No ratings yet
AWS - IAM Detail - Notes - by Rohit Singh
7 pages
Full Download Project Management A Managerial Approach 7th Edition Jack R. Meredith PDF
100% (31)
Full Download Project Management A Managerial Approach 7th Edition Jack R. Meredith PDF
84 pages
Manage Engine Solution
100% (1)
Manage Engine Solution
11 pages
10 No-Code Tools: To Build The Project of Your Dreams
100% (1)
10 No-Code Tools: To Build The Project of Your Dreams
13 pages
SIM Sarah+Zeva (2022102063) +
No ratings yet
SIM Sarah+Zeva (2022102063) +
5 pages
Unit 4 - Building and Deploying AI Applications
No ratings yet
Unit 4 - Building and Deploying AI Applications
30 pages
The Total Economic Impact™ of Camunda For Enterprises
No ratings yet
The Total Economic Impact™ of Camunda For Enterprises
10 pages
PMP Chapter 3 - Introduction To PMBOK Guide Knowledge Areas Processes
No ratings yet
PMP Chapter 3 - Introduction To PMBOK Guide Knowledge Areas Processes
7 pages
Computer Basics & Keyboarding Guide
No ratings yet
Computer Basics & Keyboarding Guide
16 pages
Customer Engagement Strategies
No ratings yet
Customer Engagement Strategies
5 pages
Product Owner - Self-Evaluation Checklist: Seldom Often Always
No ratings yet
Product Owner - Self-Evaluation Checklist: Seldom Often Always
1 page
Iaa202 - Lab6 - Ia1803 - Le Tien Long - He171603
No ratings yet
Iaa202 - Lab6 - Ia1803 - Le Tien Long - He171603
7 pages
CA Final Audit MCQs QP With Ans May 2025 Exam Castudynotes Com
No ratings yet
CA Final Audit MCQs QP With Ans May 2025 Exam Castudynotes Com
11 pages
ERP MidTerm Assessment Fall-2020
No ratings yet
ERP MidTerm Assessment Fall-2020
2 pages
IATF16949 Quality Management Guide
No ratings yet
IATF16949 Quality Management Guide
2 pages
Husna Report Copy Final
No ratings yet
Husna Report Copy Final
41 pages
Winshuttle Studio Productbrochure EN
No ratings yet
Winshuttle Studio Productbrochure EN
2 pages
Resume For Internship Architecture
100% (2)
Resume For Internship Architecture
6 pages
ITA Notes Unit 1
No ratings yet
ITA Notes Unit 1
7 pages
Bima Indra Gunawan: Jl. Penganten Ali No.61 Ciracas Jakarta Timur, DKI Jakarta - 13740
No ratings yet
Bima Indra Gunawan: Jl. Penganten Ali No.61 Ciracas Jakarta Timur, DKI Jakarta - 13740
2 pages
Sap Co
No ratings yet
Sap Co
24 pages
DMEE Configuration
100% (1)
DMEE Configuration
27 pages
SAP PP Interview Questions Guide
No ratings yet
SAP PP Interview Questions Guide
4 pages
Commercial Construction Project Manager Resume Example
No ratings yet
Commercial Construction Project Manager Resume Example
1 page
Website Setup Questionnaire for HHI Solutions
No ratings yet
Website Setup Questionnaire for HHI Solutions
9 pages
SYMBOL
100% (1)
SYMBOL
2 pages
Project Coordination Expertise
No ratings yet
Project Coordination Expertise
2 pages

Data Collection & Preprocessing Techniques

Uploaded by

Data Collection & Preprocessing Techniques

Uploaded by

Data Analytics

• Data acquisition refers to the process of collecting, capturing, and

1. Real-world Example: In a dataset of customer purchase amounts, calculating the mean

• Variance is the standard measure of spread.

• The Standard deviation is the square root of the variance.

1. Description: Univariate analysis examines the distribution and characteristics of

• What is the type of the variable (data type) of the feature?

For numerical variables

• Data reduction can save energy.

You might also like