ME5041: Data Science for Mechanical Engineers
Lecture -2 : Introduction to data science (Cont..)
and Data.
Prof. C. Balaji, Prof. Narasimhan Swaminathan
Department of Mechanical Engineering, IIT Madras
[email protected],
[email protected] Analyzing data – Problem 3
Simulations were carried for cooling an electronic chip using a square cross-sectional microchannel heat
sink. A parametric study was conducted by varying inlet velocity (v) and inlet pressure (p). The average
outlet temperature (T) and average heat transfer coefficient (h) were extracted for the given inlet velocity
and inlet pressure and given in Table 1.
(i) Perform a trend analysis of the average outlet temperature against the heat transfer coefficient using the
graph sheet given.
(ii) Using the visual best approach, try to plot the variation of outlet temperature with heat transfer
coefficient.
(iii) Determine the slope and intercept of this model or fit.
(iv) Based on the fit predict the average outlet temperature if the heat transfer coefficient is 3000 Wm-2K-1.
Analyzing data – An example (Cont..)
Table 1
Average outlet
Average heat transfer coefficient
Sl. No temperature, T
at outlet, h (Wm-2K-1)
(K)
1 2049.5 414.8
2 2265.2 395.8
3 2456.6 382.2
4 2628.3 371.9
5 2783.7 364
6 2925.3 357.6
7 3055.1 352.4
8 3174.8 348.1
9 3285.6 344.4
10 3388.5 341.3
11 3484.6 338.6
12 3574.4 336.2
13 3658.7 334.1
Analyzing data – An example (Cont..)
Variation of h (Wm-2K-1) Variation of T (K)
4000 440
420
3500
400
h (Wm-2K-1)
3000 380
T (K)
2500 360
340
2000
320
1500 300
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Observation Observation
Mean Standard deviation Mean Standard deviation
h T
2979.25 515.46 360.11 24.88
Analyzing data – An example (Cont..)
T vs h Average
Sl. Average heat transfer coefficient outlet
440
No at outlet, h (Wm-2K-1) temperature,
420 T (K)
400 1 2049.5 414.8
2 2265.2 395.8
380
3 2456.6 382.2
T (K)
360 4 2628.3 371.9
5 2783.7 364
340
6 2925.3 357.6
320 7 3055.1 352.4
300 8 3174.8 348.1
2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 9 3285.6 344.4
h (Wm-2K-1)
10 3388.5 341.3
11 3484.6 338.6
12 3574.4 336.2
13 3658.7 334.1
Pearson correlation coefficient, r = -0.9804
Analyzing data – An example (Cont..)
Relationship
y = 2E-05x2 - 0.1675x + 669.11
y = -0.0473x + 501.03
Prediction Enter the "h"
Temperature
value Linear Quadratic
3000 359.13 346.61
Tools and other considerations for data science
Tools for data science:
o Starting from simple MS Excel to Matlab, Python, R and SQl.
Other Considerations in data science:
o Ethics, bias, privacy – Very important .
o But right now these are beyond the scope of this short course.
Statutory warning
“If you torture data enough,
it will confess to almost anything.”
- Anonymous
Data
o The raw material from which information is obtained.
o Before doing any data analysis or extract science out of it, we need to
start with the process of gathering and sorting data.
o Even before this, the primordial requirement is the identification of right
kinds of information sources.
Data
Types
Structured data
o The example we worked out a little while ago used structured data.
o Structured data need not be strictly numbers.
o Data with customer ID, Aadhar number or mobile number are also
considered as structured data.
Unstructured data
o These are data without labels.
Data Types
(Cont..)
o Consider a data from a place where Pressure is between 1005 and 1013
mbar and there was a rainfall of 20 mm in the last three hours.
o However, it is not clear if a lower pressure would cause a lower rainfall or
vice versa of if there is a correlation at all between the two.
o Challenges are more in compiling and organizing unstructured data.
o Structured data is equivalent to machine language, making it easier to be
processed by computers.
Data Collection
Open data
o Such data are available from Governments, NGO, academic and scientific
communities and are usually available to all.
o Eg. IMF data, Basic weather data, GFS data, ECMWF data and so on.
o Needless to say apart from being they need to be accessible, reusable,
properly described, complete, timely updates with support are required.
Social media data
o Gold mine to analyze for R&D purposes. Eg. Facebook APIC
(Application programming Interface).
Data Collection (Cont..)
Multi-modal data
o Consider the explosion of the use of internet in an emerging area like
“Internet of Things”.
o These devices include sensors of various kinds, as for example multi-sensor
network for a building management system and the data need not be just
numbers or text. They can be audio/video files, images, music, use of space
and so on.
o Classification, processing and so on are additional tasks involved in
handling multi-model data.
Data storage and presentation:
o Data is stored in various formats.
o Texts, for example are stored as CSV or TSV.
o Other formats are available like XML, Java script Object Notation and so on
(this is really in the domain of CS).
o Barcodes and QRs also one kind of data storage.
Data Pre-processing
o Data cleaning
o Data integration
o Data transformation
o Data reduction
o Data discretization
Let us look at the above very briefly for the sake of completeness.
Data cleaning
o There can be many reasons for having “dirty” data and so intuitively, it is
apparent that there are as many ways to “clean” it.
a. Data managing/manipulating:
o “The dry bulb temperature at 7 UTC in Bangalore on 16th Jan 2025 was
30oC, the RH 46% and pressure 1004 mbar”, This is in text form.
o The table conveys Table: Atmospheric condition in Bangalore on 16th Jan 2025.
exactly the same S. No. Quantity Value Unit
1. DBT 30 oC
information but in
analysis friendly 2. RH 46 %
form. 3. Pressure 1004 mbar
Data cleaning (Cont..)
b. Handling missing data
o Lots of reasons for this to happen?
o How do we deal with it?
• Ignore the record.
• Inference based filling up.
• Using a global constant to fill all the missing values.
Data cleaning (Cont..)
c. Noisy data
o Sometimes issue may be noisy or corrupted data.
o Due to faulty instruments, data entry problems or technology limitations.
o How do we deal with them?
• Rationalize decimal points as in a thermometer reading.
• Identify and remove outliers.
Sounding data
o University of Wyoming
(http://weather.uwyo.edu/upperair/sounding.html)
o Weather balloons record Pressure, Temperature, Dewpoint,
relative humidity, and wind velocity and direction.
Sample data
o Sample data collected at Chennai International Airport, on 27th Jan 2025
Handling missing data
Median =
o Fill the missing data on 27th Jan 2025?
Pressure 01/25/25 01/26/25 01/27/25 01/28/25 01/29/25
1000 24 24.6 25.2 24.2
914 20.6 18.6 19.1 18.4
850 15.8 16.4 15.2 15.4
793 12.2 11.9 11.9 12.1
700 4.4 6.2 5.2 11
600 5.2 5.4 6 3.6
500 -2.1 -2.1 -2.1 -3.1
Handling missing data – Example (Cont..)
How to fill the missing data on 27th Jan 2025?
Handling missing data – Example (Cont..)
Handling missing data – Example (Cont..)
Pressure Predicted
(hPa) 01/25/25 01/26/25 01/27/25 01/28/25 01/29/25 (mean)
1000 24 24.6 25 25.2 24.2 24.5
914 20.6 18.6 19.2 19.1 18.4 19.175
850 15.8 16.4 16.2 15.2 15.4 15.7
793 12.2 11.9 11.6 11.9 12.1 12.025
700 4.4 6.2 4.6 5.2 11 6.7
600 5.2 5.4 5.8 6 3.6 5.05
500 -2.1 -2.1 -3.5 -2.1 -3.1 -2.35
Coefficient of Determination 0.9877
Handling missing data – Example (Cont..)
How to fill the missing data on 27th Jan 2025?
Handling missing data – Example (Cont..)
Pressure
• (hPa) 01/25/25 01/26/25 01/27/25 01/28/25 01/29/25 MEDIAN
1000 24 24.6 24.2 24.2 24.4
914 20.6 18.6 19.1 18.4 18.85
850 15.8 16.4 15.2 15.4 15.6
793 12.2 11.9 11.9 12.1 12
700 4.4 6.2 5.2 11 5.7
600 5.2 5.4 6 3.6 5.3
500 -2.1 -2.1 -2.1 -3.1 -2.1
Handling missing data – Example (Cont..)
How to fill the missing data on 27th Jan 2025?
Handling missing data - Example
Pressure
(hPa) 01/25/25 01/26/25 01/27/25 01/28/25 01/29/25 MEDIAN
1000 24 24.6 25 25.2 24.2 24.4
914 20.6 18.6 19.2 19.1 18.4 18.85
850 15.8 16.4 16.2 15.2 15.4 15.6
793 12.2 11.9 11.6 11.9 12.1 12
700 4.4 6.2 4.6 5.2 11 5.7
600 5.2 5.4 5.8 6 3.6 5.3
500 -2.1 -2.1 -3.5 -2.1 -3.1 -2.1
Coefficient of Determination 0.9922
•
Data Integration
o Combine data from multiple sources into a single cohered storage
place.
o Detect and resolve data value conflicts.
o Eg. Units of quantities of different databases may be different.
o Address issues of redundant/duplicate data.
Data Transformation
o Involves one/more of the following steps
o Smoothing
o Aggregation
o Normalization
o Attribute or Feature conclusion
Handling missing data - Example
o Problem 5: Following is the temperature data collected at seven pressure levels on 25 th, 26th, 28th,
and 29th of Jan 2025, over Chennai International Airport station.
Pressure 01/25/25 01/26/25 01/27/25 01/28/25 01/29/25
1000 24.0784 24.625 24.2763 24.252
914 20.65653 19.25757 19.1353 18.4423
850 15.8457 16.4423 16.2636 15.2535 15.47185
793 12.25553 11.9525 11.6796 12.17253
700 4.4457 6.2552 5.2142 11.2548
600 5.46868 5.88235 6.4542 5.6885
500 -2.155 2.175 -3.5673 -2.157
Perform the following data preprocessing operations:
1. Smoothen the data
2. Identify the outliers
3. Fill the missing data
4. Normalize the data
Data Aggregation
o Data cube aggregation – Say two or three dimensions of the data are
aggregated in a “data cube”.
Say: x,y,z – Length, breadth and height of a cuboid.
Let us say x = 62 cm, y=25 cm , z=33 cm.
6 bit representation; x =62 cm 111110
y= 25 cm 011001
z= 33 cm 100001
Therefore, 111110011001100001
o Actually it is a binary representation of all the three dimensions of a cuboid.
Data Discretization
o Dividing or Binning into manageable path
o For Eg. All the temperature data can be split into cold, moderate or hot.
Likert Scale
A Likert Scale is a psychometric scale commonly involved in
research that employs questionnaires
o This scale is the most widely used approach to scale responses.
o Named after American psychologist Rensis Likert (1903-
1981) Source: Google images – Graduating icon
o The format of a typical five level Likert item, for example,
could be Strongly disagree – say 2/10 Disagree – 4/10
Neutral – 6/10 Agree – 8/10
Strongly agree – 10/10 35
Likert Scale
In order to dive deep into the problem of child mortality, the data can be
looked by a target group, say a group of social scientists/public health
experts/ paediatricians and this group could be asked to rate the progress
we have made continent/country wise in the last decades in reducing the
infant mortality rate, as perceived by stakeholders in society.
36
Thank you!!