0% found this document useful (0 votes)

7 views37 pages

Lecture 2 Preprocessing

Uploaded by

vgiribabu2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views37 pages

Lecture 2 Preprocessing

Uploaded by

vgiribabu2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

ME5041: Data Science for Mechanical Engineers

Lecture -2 : Introduction to data science (Cont..)

and Data.

Prof. C. Balaji, Prof. Narasimhan Swaminathan

Department of Mechanical Engineering, IIT Madras
[email protected], [email protected]
Analyzing data – Problem 3

Simulations were carried for cooling an electronic chip using a square cross-sectional microchannel heat
sink. A parametric study was conducted by varying inlet velocity (v) and inlet pressure (p). The average
outlet temperature (T) and average heat transfer coefficient (h) were extracted for the given inlet velocity
and inlet pressure and given in Table 1.
(i) Perform a trend analysis of the average outlet temperature against the heat transfer coefficient using the
graph sheet given.
(ii) Using the visual best approach, try to plot the variation of outlet temperature with heat transfer
coefficient.
(iii) Determine the slope and intercept of this model or fit.
(iv) Based on the fit predict the average outlet temperature if the heat transfer coefficient is 3000 Wm-2K-1.
Analyzing data – An example (Cont..)
Table 1
Average outlet
Average heat transfer coefficient
Sl. No temperature, T
at outlet, h (Wm-2K-1)
(K)
1 2049.5 414.8
2 2265.2 395.8
3 2456.6 382.2
4 2628.3 371.9
5 2783.7 364
6 2925.3 357.6
7 3055.1 352.4
8 3174.8 348.1
9 3285.6 344.4
10 3388.5 341.3
11 3484.6 338.6
12 3574.4 336.2
13 3658.7 334.1
Analyzing data – An example (Cont..)

Variation of h (Wm-2K-1) Variation of T (K)

4000 440

420
3500
400
h (Wm-2K-1)

3000 380

T (K)
2500 360

340
2000
320

1500 300
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Observation Observation

Mean Standard deviation Mean Standard deviation

h T
2979.25 515.46 360.11 24.88
Analyzing data – An example (Cont..)

T vs h Average
Sl. Average heat transfer coefficient outlet
440
No at outlet, h (Wm-2K-1) temperature,
420 T (K)

400 1 2049.5 414.8

2 2265.2 395.8
380
3 2456.6 382.2
T (K)

360 4 2628.3 371.9

5 2783.7 364
340
6 2925.3 357.6
320 7 3055.1 352.4
300 8 3174.8 348.1
2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 9 3285.6 344.4
h (Wm-2K-1)
10 3388.5 341.3
11 3484.6 338.6
12 3574.4 336.2
13 3658.7 334.1
Pearson correlation coefficient, r = -0.9804
Analyzing data – An example (Cont..)
Relationship

y = 2E-05x2 - 0.1675x + 669.11

y = -0.0473x + 501.03

Prediction Enter the "h"

Temperature
value Linear Quadratic
3000 359.13 346.61
Tools and other considerations for data science

Tools for data science:

o Starting from simple MS Excel to Matlab, Python, R and SQl.

Other Considerations in data science:

o Ethics, bias, privacy – Very important .
o But right now these are beyond the scope of this short course.
Statutory warning

“If you torture data enough,

it will confess to almost anything.”

- Anonymous
Data

o The raw material from which information is obtained.

o Before doing any data analysis or extract science out of it, we need to
start with the process of gathering and sorting data.

o Even before this, the primordial requirement is the identification of right

kinds of information sources.
Data
Types
Structured data
o The example we worked out a little while ago used structured data.

o Structured data need not be strictly numbers.

o Data with customer ID, Aadhar number or mobile number are also
considered as structured data.

Unstructured data
o These are data without labels.
Data Types
(Cont..)
o Consider a data from a place where Pressure is between 1005 and 1013
mbar and there was a rainfall of 20 mm in the last three hours.

o However, it is not clear if a lower pressure would cause a lower rainfall or

vice versa of if there is a correlation at all between the two.

o Challenges are more in compiling and organizing unstructured data.

o Structured data is equivalent to machine language, making it easier to be

processed by computers.
Data Collection
Open data
o Such data are available from Governments, NGO, academic and scientific
communities and are usually available to all.
o Eg. IMF data, Basic weather data, GFS data, ECMWF data and so on.
o Needless to say apart from being they need to be accessible, reusable,
properly described, complete, timely updates with support are required.
Social media data
o Gold mine to analyze for R&D purposes. Eg. Facebook APIC
(Application programming Interface).
Data Collection (Cont..)
Multi-modal data
o Consider the explosion of the use of internet in an emerging area like
“Internet of Things”.
o These devices include sensors of various kinds, as for example multi-sensor
network for a building management system and the data need not be just
numbers or text. They can be audio/video files, images, music, use of space
and so on.
o Classification, processing and so on are additional tasks involved in
handling multi-model data.
Data storage and presentation:

o Data is stored in various formats.

o Texts, for example are stored as CSV or TSV.

o Other formats are available like XML, Java script Object Notation and so on
(this is really in the domain of CS).

o Barcodes and QRs also one kind of data storage.

Data Pre-processing

o Data cleaning
o Data integration
o Data transformation
o Data reduction
o Data discretization

Let us look at the above very briefly for the sake of completeness.
Data cleaning
o There can be many reasons for having “dirty” data and so intuitively, it is
apparent that there are as many ways to “clean” it.
a. Data managing/manipulating:
o “The dry bulb temperature at 7 UTC in Bangalore on 16th Jan 2025 was
30oC, the RH 46% and pressure 1004 mbar”, This is in text form.

o The table conveys Table: Atmospheric condition in Bangalore on 16th Jan 2025.
exactly the same S. No. Quantity Value Unit
1. DBT 30 oC
information but in
analysis friendly 2. RH 46 %
form. 3. Pressure 1004 mbar
Data cleaning (Cont..)

b. Handling missing data

o Lots of reasons for this to happen?
o How do we deal with it?
• Ignore the record.
• Inference based filling up.
• Using a global constant to fill all the missing values.
Data cleaning (Cont..)

c. Noisy data
o Sometimes issue may be noisy or corrupted data.
o Due to faulty instruments, data entry problems or technology limitations.
o How do we deal with them?
• Rationalize decimal points as in a thermometer reading.
• Identify and remove outliers.
Sounding data

o University of Wyoming
(http://weather.uwyo.edu/upperair/sounding.html)
o Weather balloons record Pressure, Temperature, Dewpoint,
relative humidity, and wind velocity and direction.
Sample data
o Sample data collected at Chennai International Airport, on 27th Jan 2025
Handling missing data

Median =
o Fill the missing data on 27th Jan 2025?
Pressure 01/25/25 01/26/25 01/27/25 01/28/25 01/29/25
1000 24 24.6 25.2 24.2
914 20.6 18.6 19.1 18.4
850 15.8 16.4 15.2 15.4
793 12.2 11.9 11.9 12.1
700 4.4 6.2 5.2 11
600 5.2 5.4 6 3.6
500 -2.1 -2.1 -2.1 -3.1
Handling missing data – Example (Cont..)

How to fill the missing data on 27th Jan 2025?

Handling missing data – Example (Cont..)
Handling missing data – Example (Cont..)

Pressure Predicted
(hPa) 01/25/25 01/26/25 01/27/25 01/28/25 01/29/25 (mean)
1000 24 24.6 25 25.2 24.2 24.5
914 20.6 18.6 19.2 19.1 18.4 19.175
850 15.8 16.4 16.2 15.2 15.4 15.7
793 12.2 11.9 11.6 11.9 12.1 12.025
700 4.4 6.2 4.6 5.2 11 6.7
600 5.2 5.4 5.8 6 3.6 5.05
500 -2.1 -2.1 -3.5 -2.1 -3.1 -2.35
Coefficient of Determination 0.9877
Handling missing data – Example (Cont..)

How to fill the missing data on 27th Jan 2025?

Handling missing data – Example (Cont..)
Pressure
• (hPa) 01/25/25 01/26/25 01/27/25 01/28/25 01/29/25 MEDIAN
1000 24 24.6 24.2 24.2 24.4
914 20.6 18.6 19.1 18.4 18.85
850 15.8 16.4 15.2 15.4 15.6
793 12.2 11.9 11.9 12.1 12
700 4.4 6.2 5.2 11 5.7
600 5.2 5.4 6 3.6 5.3
500 -2.1 -2.1 -2.1 -3.1 -2.1
Handling missing data – Example (Cont..)

How to fill the missing data on 27th Jan 2025?

Handling missing data - Example
Pressure
(hPa) 01/25/25 01/26/25 01/27/25 01/28/25 01/29/25 MEDIAN
1000 24 24.6 25 25.2 24.2 24.4
914 20.6 18.6 19.2 19.1 18.4 18.85
850 15.8 16.4 16.2 15.2 15.4 15.6
793 12.2 11.9 11.6 11.9 12.1 12
700 4.4 6.2 4.6 5.2 11 5.7
600 5.2 5.4 5.8 6 3.6 5.3
500 -2.1 -2.1 -3.5 -2.1 -3.1 -2.1
Coefficient of Determination 0.9922
•
Data Integration

o Combine data from multiple sources into a single cohered storage

place.
o Detect and resolve data value conflicts.
o Eg. Units of quantities of different databases may be different.
o Address issues of redundant/duplicate data.
Data Transformation

o Involves one/more of the following steps

o Smoothing
o Aggregation
o Normalization
o Attribute or Feature conclusion
Handling missing data - Example
o Problem 5: Following is the temperature data collected at seven pressure levels on 25 th, 26th, 28th,
and 29th of Jan 2025, over Chennai International Airport station.
Pressure 01/25/25 01/26/25 01/27/25 01/28/25 01/29/25
1000 24.0784 24.625 24.2763 24.252
914 20.65653 19.25757 19.1353 18.4423
850 15.8457 16.4423 16.2636 15.2535 15.47185
793 12.25553 11.9525 11.6796 12.17253
700 4.4457 6.2552 5.2142 11.2548
600 5.46868 5.88235 6.4542 5.6885
500 -2.155 2.175 -3.5673 -2.157

Perform the following data preprocessing operations:

1. Smoothen the data
2. Identify the outliers
3. Fill the missing data
4. Normalize the data
Data Aggregation
o Data cube aggregation – Say two or three dimensions of the data are
aggregated in a “data cube”.
Say: x,y,z – Length, breadth and height of a cuboid.
Let us say x = 62 cm, y=25 cm , z=33 cm.
6 bit representation; x =62 cm 111110
y= 25 cm 011001
z= 33 cm 100001
Therefore, 111110011001100001
o Actually it is a binary representation of all the three dimensions of a cuboid.
Data Discretization

o Dividing or Binning into manageable path

o For Eg. All the temperature data can be split into cold, moderate or hot.
Likert Scale
A Likert Scale is a psychometric scale commonly involved in
research that employs questionnaires
o This scale is the most widely used approach to scale responses.
o Named after American psychologist Rensis Likert (1903-
1981) Source: Google images – Graduating icon

o The format of a typical five level Likert item, for example,

could be Strongly disagree – say 2/10 Disagree – 4/10
Neutral – 6/10 Agree – 8/10
Strongly agree – 10/10 35
Likert Scale

In order to dive deep into the problem of child mortality, the data can be
looked by a target group, say a group of social scientists/public health
experts/ paediatricians and this group could be asked to rate the progress
we have made continent/country wise in the last decades in reducing the
infant mortality rate, as perceived by stakeholders in society.

36
Thank you!!

UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Data Analytics: Collection & Pre-processing
No ratings yet
Data Analytics: Collection & Pre-processing
16 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
U2 Data Collection, Cleaning & Handling
No ratings yet
U2 Data Collection, Cleaning & Handling
5 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Data Analytics - Module-1.2
No ratings yet
Data Analytics - Module-1.2
55 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Sceince - UNIT - 4
No ratings yet
Data Sceince - UNIT - 4
70 pages
Unit1-Data Science Fundamentals
No ratings yet
Unit1-Data Science Fundamentals
35 pages
Data Science Preprocessing Guide
No ratings yet
Data Science Preprocessing Guide
40 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Week3 - Data Preprocessing, Extraction and Preparation
No ratings yet
Week3 - Data Preprocessing, Extraction and Preparation
34 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
2nd Cleaning
No ratings yet
2nd Cleaning
46 pages
Data Preprocessing and Data Cleaning
No ratings yet
Data Preprocessing and Data Cleaning
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
60 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
3 Processing
No ratings yet
3 Processing
79 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Unit 3
No ratings yet
Unit 3
41 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
60 pages
Data Preprocessing for COVID-19 Data
No ratings yet
Data Preprocessing for COVID-19 Data
8 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
34 pages
Data Sciences Unit-I
No ratings yet
Data Sciences Unit-I
83 pages
Understanding Data Attributes and Preprocessing
No ratings yet
Understanding Data Attributes and Preprocessing
12 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
BDA - Lecture 4
No ratings yet
BDA - Lecture 4
41 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Unit I
No ratings yet
Unit I
57 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
16-Data Preprocessing
No ratings yet
16-Data Preprocessing
27 pages
Powerpoint Biology 2
No ratings yet
Powerpoint Biology 2
30 pages
3 - Lecture-3 (Part-01) Buckling and Stability of Columns
No ratings yet
3 - Lecture-3 (Part-01) Buckling and Stability of Columns
39 pages
Subiecte Bilingv 2014
No ratings yet
Subiecte Bilingv 2014
2 pages
Chris Van Hof FUNdamentals
No ratings yet
Chris Van Hof FUNdamentals
15 pages
Structural Analysis for R.C.C Building Design
No ratings yet
Structural Analysis for R.C.C Building Design
8 pages
Signals & Systems 2018-19
No ratings yet
Signals & Systems 2018-19
3 pages
EXP2
No ratings yet
EXP2
5 pages
Grade 6 - FA1 - AY 2025-26
No ratings yet
Grade 6 - FA1 - AY 2025-26
4 pages
Jun 2012, Paper 1, Foundation PDF
No ratings yet
Jun 2012, Paper 1, Foundation PDF
24 pages
Series 700 Intelligent Conventional Fire Detection Range: Sell Sheet
No ratings yet
Series 700 Intelligent Conventional Fire Detection Range: Sell Sheet
4 pages
Hebrew Suffix / Prefix Cheat Sheet
83% (6)
Hebrew Suffix / Prefix Cheat Sheet
2 pages
10 Electricity
No ratings yet
10 Electricity
7 pages
Swift Compiler Error Debugging
No ratings yet
Swift Compiler Error Debugging
2 pages
Outline Understanding Quran 1
No ratings yet
Outline Understanding Quran 1
5 pages
MSC-IT Part I Regular Sem 1 Nov 2022
No ratings yet
MSC-IT Part I Regular Sem 1 Nov 2022
7 pages
Oral Habits Part 2 Tongue Thrusting
100% (1)
Oral Habits Part 2 Tongue Thrusting
45 pages
BLANKING VAM TOP ® HC 7in. 26lb-ft API Drift 6.151in. 87.5%
No ratings yet
BLANKING VAM TOP ® HC 7in. 26lb-ft API Drift 6.151in. 87.5%
1 page
Ganpati 2021 - Application Form
No ratings yet
Ganpati 2021 - Application Form
2 pages
Basic Maths
No ratings yet
Basic Maths
16 pages
Efficiency of Applying Optimum Physical Loads To Young Basketball Players
No ratings yet
Efficiency of Applying Optimum Physical Loads To Young Basketball Players
50 pages
11111111sensata Switch Catalog
No ratings yet
11111111sensata Switch Catalog
49 pages
KALMAR
No ratings yet
KALMAR
92 pages
Sheet Pile Design
100% (1)
Sheet Pile Design
51 pages
Numerical Integration Using Sparse Grids
No ratings yet
Numerical Integration Using Sparse Grids
26 pages
Flight Manual
100% (7)
Flight Manual
55 pages
Jee Mains 955855 1 1761926499
No ratings yet
Jee Mains 955855 1 1761926499
7 pages
T4 Ascii
No ratings yet
T4 Ascii
20 pages
Measurement Systems - v5
No ratings yet
Measurement Systems - v5
33 pages
Evaluating Mechanical Properties of Metal Materials Made Via Additive Manufacturing Processes
No ratings yet
Evaluating Mechanical Properties of Metal Materials Made Via Additive Manufacturing Processes
6 pages
Rooted Firmware B760H v2016 Flashtool Image
100% (1)
Rooted Firmware B760H v2016 Flashtool Image
3 pages

Lecture 2 Preprocessing

Uploaded by

Lecture 2 Preprocessing

Uploaded by

ME5041: Data Science for Mechanical Engineers

Lecture -2 : Introduction to data science (Cont..)

Prof. C. Balaji, Prof. Narasimhan Swaminathan

Variation of h (Wm-2K-1) Variation of T (K)

Mean Standard deviation Mean Standard deviation

400 1 2049.5 414.8

360 4 2628.3 371.9

y = 2E-05x2 - 0.1675x + 669.11

Prediction Enter the "h"

Tools for data science:

Other Considerations in data science:

“If you torture data enough,

o The raw material from which information is obtained.

o Even before this, the primordial requirement is the identification of right

o Structured data need not be strictly numbers.

o However, it is not clear if a lower pressure would cause a lower rainfall or

o Challenges are more in compiling and organizing unstructured data.

o Structured data is equivalent to machine language, making it easier to be

o Data is stored in various formats.

o Texts, for example are stored as CSV or TSV.

o Barcodes and QRs also one kind of data storage.

b. Handling missing data

How to fill the missing data on 27th Jan 2025?

How to fill the missing data on 27th Jan 2025?

How to fill the missing data on 27th Jan 2025?

o Combine data from multiple sources into a single cohered storage

o Involves one/more of the following steps

Perform the following data preprocessing operations:

o Dividing or Binning into manageable path

o The format of a typical five level Likert item, for example,

You might also like