0% found this document useful (0 votes)
17 views157 pages

Clean Data Method

The document outlines methods for cleaning a dataset of company financials using pandas in Python. It details steps to handle missing values (n.a.) by replacing them with NaN, removing negative revenue values, and manually updating specific entries. Additionally, it discusses using a converter function to handle n.a. values when exporting to Excel.

Uploaded by

SistemPoint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views157 pages

Clean Data Method

The document outlines methods for cleaning a dataset of company financials using pandas in Python. It details steps to handle missing values (n.a.) by replacing them with NaN, removing negative revenue values, and manually updating specific entries. Additionally, it discusses using a converter function to handle n.a. values when exporting to Excel.

Uploaded by

SistemPoint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

simulate_data_cleaning_Part1

August 8, 2025

1 Methods to clean data


[1]: import pandas as pd
df_missing_data = pd.read_csv("/home/aditya/github/Deep-learning-prerequisite/
↪pandas/dataset/company_financials.csv") # path of office pc

# df_missing_data = pd.read_csv("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/company_financials.csv") # path of␣

↪macbook

df_missing_data

[1]: tickers eps revenue price people


0 Apple NaN 105.55 113.42 Tim Cook
1 Microsoft 1.59 -42.45 410.99 Satya Nadella
2 Google 3.94 118.78 83.55 Sundar Pichai
3 Amazon NaN 69.97 494.10 Andy Jassy
4 Facebook 1.3 85.42 397.51 Mark Zuckerberg
5 Tesla 1.3 -69.91 139.42 Elon Musk
6 Netflix 6.74 192.61 52.48 Reed Hastings
7 Intel n.a. 99.33 416.96 Pat Gelsinger
8 IBM 0.59 47.11 368.09 n.a.
9 Samsung 3.63 141.13 378.05 Jong-Hee Han
10 Oracle 0.61 38.96 397.07 Safra Catz
11 Cisco 0.6 110.44 83.32 Chuck Robbins
12 HP 2.73 2.02 211.31 Enrique Lores
13 Adobe -3.74 33.59 102.14 Shantanu Narayen
14 Nvidia -3.17 109.84 438.40 Jensen Huang
15 Qualcomm 0.31 136.92 330.48 Cristiano Amon
16 Salesforce -1.04 108.57 198.90 Marc Benioff
17 Spotify 2.94 94.22 78.60 Daniel Ek
18 Uber -0.72 84.94 189.94 Dara Khosrowshahi
19 Lyft -2.24 26.07 196.33 David Risher
20 Snap 6.4 NaN 378.32 Evan Spiegel
21 Dropbox 1.32 76.97 336.90 Drew Houston
22 Zoom 2.2 152.86 449.25 Eric Yuan
23 Twitter -2.27 117.18 262.50 Linda Yaccarino
24 PayPal n.a. 11.85 103.82 Alex Chriss

1
1.0.1 Here in this dataFrame we can see that the data is a bit messy
• In people and eps columns we have a value n.a.
• In revenue column we have 2 values that are in negative ### We need to clean this mess
before we start our analysis
• We need to replace the n.a. with NaN
• We need to replace the -ve values in revenue columns as well because any company’s revenue
cannot be -ve it will either be zero or less than zero.

1.1 Remove n.a. values when creating dataFrame from a csv file
[2]: #path of pc
df_cleaned = pd.read_csv("/home/aditya/github/Deep-learning-prerequisite/pandas/
↪dataset/company_financials.csv",na_values=["not available","n.a."])

# path of macbook
# df_cleaned = pd.read_csv("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/company_financials.

↪csv",na_values=["not available","n.a."])

df_cleaned

[2]: tickers eps revenue price people


0 Apple NaN 105.55 113.42 Tim Cook
1 Microsoft 1.59 -42.45 410.99 Satya Nadella
2 Google 3.94 118.78 83.55 Sundar Pichai
3 Amazon NaN 69.97 494.10 Andy Jassy
4 Facebook 1.30 85.42 397.51 Mark Zuckerberg
5 Tesla 1.30 -69.91 139.42 Elon Musk
6 Netflix 6.74 192.61 52.48 Reed Hastings
7 Intel NaN 99.33 416.96 Pat Gelsinger
8 IBM 0.59 47.11 368.09 NaN
9 Samsung 3.63 141.13 378.05 Jong-Hee Han
10 Oracle 0.61 38.96 397.07 Safra Catz
11 Cisco 0.60 110.44 83.32 Chuck Robbins
12 HP 2.73 2.02 211.31 Enrique Lores
13 Adobe -3.74 33.59 102.14 Shantanu Narayen
14 Nvidia -3.17 109.84 438.40 Jensen Huang
15 Qualcomm 0.31 136.92 330.48 Cristiano Amon
16 Salesforce -1.04 108.57 198.90 Marc Benioff
17 Spotify 2.94 94.22 78.60 Daniel Ek
18 Uber -0.72 84.94 189.94 Dara Khosrowshahi
19 Lyft -2.24 26.07 196.33 David Risher
20 Snap 6.40 NaN 378.32 Evan Spiegel
21 Dropbox 1.32 76.97 336.90 Drew Houston
22 Zoom 2.20 152.86 449.25 Eric Yuan
23 Twitter -2.27 117.18 262.50 Linda Yaccarino
24 PayPal NaN 11.85 103.82 Alex Chriss

As you can see all the n.a. values have been replaced with NaN values

2
1.2 Now lets see how we can remove -ve values from the revenue column
[3]: import numpy as np
df_cleaned.loc[df_cleaned["revenue"]<0, "revenue"] = [Link]
df_cleaned

[3]: tickers eps revenue price people


0 Apple NaN 105.55 113.42 Tim Cook
1 Microsoft 1.59 NaN 410.99 Satya Nadella
2 Google 3.94 118.78 83.55 Sundar Pichai
3 Amazon NaN 69.97 494.10 Andy Jassy
4 Facebook 1.30 85.42 397.51 Mark Zuckerberg
5 Tesla 1.30 NaN 139.42 Elon Musk
6 Netflix 6.74 192.61 52.48 Reed Hastings
7 Intel NaN 99.33 416.96 Pat Gelsinger
8 IBM 0.59 47.11 368.09 NaN
9 Samsung 3.63 141.13 378.05 Jong-Hee Han
10 Oracle 0.61 38.96 397.07 Safra Catz
11 Cisco 0.60 110.44 83.32 Chuck Robbins
12 HP 2.73 2.02 211.31 Enrique Lores
13 Adobe -3.74 33.59 102.14 Shantanu Narayen
14 Nvidia -3.17 109.84 438.40 Jensen Huang
15 Qualcomm 0.31 136.92 330.48 Cristiano Amon
16 Salesforce -1.04 108.57 198.90 Marc Benioff
17 Spotify 2.94 94.22 78.60 Daniel Ek
18 Uber -0.72 84.94 189.94 Dara Khosrowshahi
19 Lyft -2.24 26.07 196.33 David Risher
20 Snap 6.40 NaN 378.32 Evan Spiegel
21 Dropbox 1.32 76.97 336.90 Drew Houston
22 Zoom 2.20 152.86 449.25 Eric Yuan
23 Twitter -2.27 117.18 262.50 Linda Yaccarino
24 PayPal NaN 11.85 103.82 Alex Chriss

1.2.1 Explanation
.loc[]
• It is label-based indexing. It is used to access a group of rows and columns in a DataFrame
by their labels (names) — not by their integer positions.
• syntax : [Link][row_condition, column_label]
– row_condition: A condition or label(s) that tells pandas which rows you want.
– column_label: The column(s) you want from those selected rows.

1.3 How to replace NaN value in a cell in a dataFrame if a value is known


manually
Now in the people column we can see that IBM has a NAN value. I know that the ceo
of IBM is Arvind Krishna and I want to replace the NaN with the actual ceo name in the people
column of the dataFrame

3
[4]: df_cleaned.loc[8,"people"] = "Arvind Krishna"
df_cleaned

[4]: tickers eps revenue price people


0 Apple NaN 105.55 113.42 Tim Cook
1 Microsoft 1.59 NaN 410.99 Satya Nadella
2 Google 3.94 118.78 83.55 Sundar Pichai
3 Amazon NaN 69.97 494.10 Andy Jassy
4 Facebook 1.30 85.42 397.51 Mark Zuckerberg
5 Tesla 1.30 NaN 139.42 Elon Musk
6 Netflix 6.74 192.61 52.48 Reed Hastings
7 Intel NaN 99.33 416.96 Pat Gelsinger
8 IBM 0.59 47.11 368.09 Arvind Krishna
9 Samsung 3.63 141.13 378.05 Jong-Hee Han
10 Oracle 0.61 38.96 397.07 Safra Catz
11 Cisco 0.60 110.44 83.32 Chuck Robbins
12 HP 2.73 2.02 211.31 Enrique Lores
13 Adobe -3.74 33.59 102.14 Shantanu Narayen
14 Nvidia -3.17 109.84 438.40 Jensen Huang
15 Qualcomm 0.31 136.92 330.48 Cristiano Amon
16 Salesforce -1.04 108.57 198.90 Marc Benioff
17 Spotify 2.94 94.22 78.60 Daniel Ek
18 Uber -0.72 84.94 189.94 Dara Khosrowshahi
19 Lyft -2.24 26.07 196.33 David Risher
20 Snap 6.40 NaN 378.32 Evan Spiegel
21 Dropbox 1.32 76.97 336.90 Drew Houston
22 Zoom 2.20 152.86 449.25 Eric Yuan
23 Twitter -2.27 117.18 262.50 Linda Yaccarino
24 PayPal NaN 11.85 103.82 Alex Chriss

1.3.1 Explanation:
• df_cleaned.loc[8,“people”] = “Arvind Krishna” : Here 8 is the row number and “people” is
the column name
• using these two we are locating the cell to be updated and assign a new value into it.

1.4 Handle n.a. values using a Converter function


1.4.1 There is another way to achieve the same result
What if we replace n.a. value in the people column in the IBM row. But this can be dangerous
because what I am about to do will take effect on the entire column.

[5]: # import csv and create a dataFrame


# conver that csv file into xcel sheet
# path of pc
df = pd.read_csv("/home/aditya/github/Deep-learning-prerequisite/pandas/dataset/
↪company_financials.csv")

4
# path of macbook
# df = pd.read_csv("/home/machine_learning/github/Deep-learning-prerequisite/
↪pandas/dataset/company_financials.csv")

df_cleaned.to_excel(excel_writer = "test_converter.xlsx", header = True,␣


↪index=True, index_label = 'Index', startrow=0, startcol=0)

[6]: def people_col_converter(cell):


if cell == "n.a.":
return "Arvind Krishna"
else:
return cell
df_excel = pd.read_excel("test_converter.xlsx",engine="openpyxl",converters = {
"people" : people_col_converter
})
df_excel

[6]: Index tickers eps revenue price people


0 0 Apple NaN 105.55 113.42 Tim Cook
1 1 Microsoft 1.59 NaN 410.99 Satya Nadella
2 2 Google 3.94 118.78 83.55 Sundar Pichai
3 3 Amazon NaN 69.97 494.10 Andy Jassy
4 4 Facebook 1.30 85.42 397.51 Mark Zuckerberg
5 5 Tesla 1.30 NaN 139.42 Elon Musk
6 6 Netflix 6.74 192.61 52.48 Reed Hastings
7 7 Intel NaN 99.33 416.96 Pat Gelsinger
8 8 IBM 0.59 47.11 368.09 Arvind Krishna
9 9 Samsung 3.63 141.13 378.05 Jong-Hee Han
10 10 Oracle 0.61 38.96 397.07 Safra Catz
11 11 Cisco 0.60 110.44 83.32 Chuck Robbins
12 12 HP 2.73 2.02 211.31 Enrique Lores
13 13 Adobe -3.74 33.59 102.14 Shantanu Narayen
14 14 Nvidia -3.17 109.84 438.40 Jensen Huang
15 15 Qualcomm 0.31 136.92 330.48 Cristiano Amon
16 16 Salesforce -1.04 108.57 198.90 Marc Benioff
17 17 Spotify 2.94 94.22 78.60 Daniel Ek
18 18 Uber -0.72 84.94 189.94 Dara Khosrowshahi
19 19 Lyft -2.24 26.07 196.33 David Risher
20 20 Snap 6.40 NaN 378.32 Evan Spiegel
21 21 Dropbox 1.32 76.97 336.90 Drew Houston
22 22 Zoom 2.20 152.86 449.25 Eric Yuan
23 23 Twitter -2.27 117.18 262.50 Linda Yaccarino
24 24 PayPal NaN 11.85 103.82 Alex Chriss

1.4.2 Explanation:
• people_col_converter function will be called for every single cell in the people column.
• This function will called when creating dataFrame from the excel sheet present in the file
system.

5
1.4.3 Why this worked?
• The reason why this worked is because the people column only had one NaN value in the
entire column.

1.4.4 The reason to avoid this method


• If there were multiple NaN values inside the people column then this method would have
replaced all the NaN values with the value "Arvind Krishna" which ofcourse is not desirable
in this case.

2 How to handle missing data in pandas


[7]: # offcie pc path
df_missing_val = pd.read_csv("/home/aditya/github/Deep-learning-prerequisite/
↪pandas/dataset/nyc_weather_data_missing_values.csv")

# macbook path
# df_missing_val = pd.read_csv("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_data_missing_values.

↪csv")

type(df_missing_val["date"][0])
df_missing_val

[7]: date temperature_celsius humidity_percent wind_speed_kmph \


0 2023-01-01 -2.9 87.4 11.9
1 2023-01-02 15.9 56.8 5.4
2 2023-01-03 22.8 48.7 11.3
3 2023-01-04 39.4 56.4 12.2
4 2023-01-05 7.8 82.9 7.1
.. … … … …
95 2023-04-06 33.0 36.3 4.1
96 2023-04-07 -4.4 58.1 8.0
97 2023-04-08 11.8 41.2 13.0
98 2023-04-09 14.7 44.4 NaN
99 2023-04-10 0.3 48.3 7.7

weather_condition
0 Snow
1 Snow
2 NaN
3 Rain
4 Rain
.. …
95 Fog
96 Fog
97 Sunny
98 Sunny

6
99 Rain

[100 rows x 5 columns]

The column date can have date values in form of string so to be safe we are going to
make sure that when we create the dataFrame from a csv file we convert those string
values in the date column into a datetime timestamps.
• First we need to convert all the dates in the date column from type string to type date

[8]: # pc path
df_missing_val = pd.read_csv("/home/aditya/github/Deep-learning-prerequisite/
↪pandas/dataset/nyc_weather_data_missing_values.csv", parse_dates = ["date"])

# macbook path
# df_missing_val = pd.read_csv("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_data_missing_values.

↪csv", parse_dates = ["date"])

type(df_missing_val["date"][0])

[8]: pandas._libs.[Link]

[9]: df_missing_val

[9]: date temperature_celsius humidity_percent wind_speed_kmph \


0 2023-01-01 -2.9 87.4 11.9
1 2023-01-02 15.9 56.8 5.4
2 2023-01-03 22.8 48.7 11.3
3 2023-01-04 39.4 56.4 12.2
4 2023-01-05 7.8 82.9 7.1
.. … … … …
95 2023-04-06 33.0 36.3 4.1
96 2023-04-07 -4.4 58.1 8.0
97 2023-04-08 11.8 41.2 13.0
98 2023-04-09 14.7 44.4 NaN
99 2023-04-10 0.3 48.3 7.7

weather_condition
0 Snow
1 Snow
2 NaN
3 Rain
4 Rain
.. …
95 Fog
96 Fog
97 Sunny
98 Sunny
99 Rain

7
[100 rows x 5 columns]

2.0.1 Set the date column as an index of the dataFrame instead of the default index
[10]: df_missing_val.set_index("date",inplace=True)
df_missing_val

[10]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
… … … …
2023-04-06 33.0 36.3 4.1
2023-04-07 -4.4 58.1 8.0
2023-04-08 11.8 41.2 13.0
2023-04-09 14.7 44.4 NaN
2023-04-10 0.3 48.3 7.7

weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 NaN
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain

[100 rows x 4 columns]

2.1 Find all the rows with the missing values in the dataFrame
[11]: df_missing_val[df_missing_val.isna().any(axis=1)]

[11]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-03 22.8 48.7 11.3
2023-01-06 -1.5 12.9 Rain
2023-01-10 65.2 6.5 Sunny

8
2023-01-15 NaN 74.3 8.9
2023-01-21 18.9 5.5 NaN
2023-01-22 5.3 32.9 NaN
2023-01-31 58.3 Fog NaN
2023-02-02 5.2 64.0 NaN
2023-02-04 -2.6 55.1 1.7
2023-02-06 NaN 64.4 8.7
2023-02-20 19.4 NaN 13.2
2023-02-24 15.4 16.7 11.1
2023-03-04 20.5 70.9 9.8
2023-03-08 15.6 NaN 16.3
2023-03-13 12.7 59.3 NaN
2023-03-15 NaN 46.6 15.6
2023-03-16 27.6 NaN 14.3
2023-03-23 NaN NaN 9.5
2023-03-24 NaN 63.3 5.7
2023-03-26 18.8 43.3 NaN
2023-03-28 35.5 NaN 5.9
2023-03-29 19.9 71.7 11.2
2023-04-09 14.7 44.4 NaN

weather_condition
date
2023-01-03 NaN
2023-01-06 NaN
2023-01-10 NaN
2023-01-15 Snow
2023-01-21 NaN
2023-01-22 Rain
2023-01-31 NaN
2023-02-02 Fog
2023-02-04 NaN
2023-02-06 Sunny
2023-02-20 Rain
2023-02-24 NaN
2023-03-04 NaN
2023-03-08 Rain
2023-03-13 NaN
2023-03-15 Snow
2023-03-16 Cloudy
2023-03-23 Fog
2023-03-24 Cloudy
2023-03-26 Fog
2023-03-28 Sunny
2023-03-29 NaN
2023-04-09 Sunny

9
2.1.1 Explanation:
df_missing_val.isna()
Returns a DataFrame of the same shape where:

True means the value is missing (NaN)

False means it's not missing

.any(axis=1)
This is a reduction operation that checks if any value in each row is True.
So what is axis=1?
axis=1 → operate row-wise

axis=0 → operate column-wise

2.1.2 Topics covered


• fillna to fill missing values using differenbt ways
• interpolate to make a guess on missing values using interpolation
• dropna to drop rows with missing values

2.1.3 Fillna method to replace NaN values with Zeros in the dataFrame with missing
values
[12]: new_df_fill_na = df_missing_val.fillna(0)
new_df_fill_na

[12]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
… … … …
2023-04-06 33.0 36.3 4.1
2023-04-07 -4.4 58.1 8.0
2023-04-08 11.8 41.2 13.0
2023-04-09 14.7 44.4 0
2023-04-10 0.3 48.3 7.7

weather_condition
date
2023-01-01 Snow
2023-01-02 Snow

10
2023-01-03 0
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain

[100 rows x 4 columns]

2.1.4 Find all the rows that have zero in it in the dataFrame
[13]: new_df_fill_na[(new_df_fill_na==0).any(axis=1)]

[13]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-03 22.8 48.7 11.3
2023-01-06 -1.5 12.9 Rain
2023-01-10 65.2 6.5 Sunny
2023-01-15 0.0 74.3 8.9
2023-01-21 18.9 5.5 0
2023-01-22 5.3 32.9 0
2023-01-31 58.3 Fog 0
2023-02-02 5.2 64.0 0
2023-02-04 -2.6 55.1 1.7
2023-02-06 0.0 64.4 8.7
2023-02-20 19.4 0 13.2
2023-02-24 15.4 16.7 11.1
2023-03-04 20.5 70.9 9.8
2023-03-08 15.6 0 16.3
2023-03-13 12.7 59.3 0
2023-03-15 0.0 46.6 15.6
2023-03-16 27.6 0 14.3
2023-03-23 0.0 0 9.5
2023-03-24 0.0 63.3 5.7
2023-03-26 18.8 43.3 0
2023-03-28 35.5 0 5.9
2023-03-29 19.9 71.7 11.2
2023-04-09 14.7 44.4 0

weather_condition
date
2023-01-03 0
2023-01-06 0
2023-01-10 0

11
2023-01-15 Snow
2023-01-21 0
2023-01-22 Rain
2023-01-31 0
2023-02-02 Fog
2023-02-04 0
2023-02-06 Sunny
2023-02-20 Rain
2023-02-24 0
2023-03-04 0
2023-03-08 Rain
2023-03-13 0
2023-03-15 Snow
2023-03-16 Cloudy
2023-03-23 Fog
2023-03-24 Cloudy
2023-03-26 Fog
2023-03-28 Sunny
2023-03-29 0
2023-04-09 Sunny

2.1.5 Explanation :
This method will fill the entire dataFrame with 0 where NaN values were present in the dataFrame

2.1.6 Sometimes we may want to fillna but we don’t want to fill the entire dataFrame
with 0. May be we want to specify different values for different columns

[14]: new_df = df_missing_val.fillna({


"temperature_celsius":0,
"humidity_percent":0,
"wind_speed_kmph":0,
"weather_condition":"no event"
})
new_df

[14]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
… … … …
2023-04-06 33.0 36.3 4.1
2023-04-07 -4.4 58.1 8.0
2023-04-08 11.8 41.2 13.0

12
2023-04-09 14.7 44.4 0
2023-04-10 0.3 48.3 7.7

weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 no event
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain

[100 rows x 4 columns]

Get all the rows where it has data that zero in the dataFrame
[15]: new_df[(new_df==0).any(axis=1)]

[15]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-15 0.0 74.3 8.9
2023-01-21 18.9 5.5 0
2023-01-22 5.3 32.9 0
2023-01-31 58.3 Fog 0
2023-02-02 5.2 64.0 0
2023-02-06 0.0 64.4 8.7
2023-02-20 19.4 0 13.2
2023-03-08 15.6 0 16.3
2023-03-13 12.7 59.3 0
2023-03-15 0.0 46.6 15.6
2023-03-16 27.6 0 14.3
2023-03-23 0.0 0 9.5
2023-03-24 0.0 63.3 5.7
2023-03-26 18.8 43.3 0
2023-03-28 35.5 0 5.9
2023-04-09 14.7 44.4 0

weather_condition
date
2023-01-15 Snow
2023-01-21 no event
2023-01-22 Rain
2023-01-31 no event

13
2023-02-02 Fog
2023-02-06 Sunny
2023-02-20 Rain
2023-03-08 Rain
2023-03-13 no event
2023-03-15 Snow
2023-03-16 Cloudy
2023-03-23 Fog
2023-03-24 Cloudy
2023-03-26 Fog
2023-03-28 Sunny
2023-04-09 Sunny

Get all the rows where it has data “no event”


[16]: new_df[(new_df=="no event").any(axis=1)]

[16]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-03 22.8 48.7 11.3
2023-01-06 -1.5 12.9 Rain
2023-01-10 65.2 6.5 Sunny
2023-01-21 18.9 5.5 0
2023-01-31 58.3 Fog 0
2023-02-04 -2.6 55.1 1.7
2023-02-24 15.4 16.7 11.1
2023-03-04 20.5 70.9 9.8
2023-03-13 12.7 59.3 0
2023-03-29 19.9 71.7 11.2

weather_condition
date
2023-01-03 no event
2023-01-06 no event
2023-01-10 no event
2023-01-21 no event
2023-01-31 no event
2023-02-04 no event
2023-02-24 no event
2023-03-04 no event
2023-03-13 no event
2023-03-29 no event

We should avoid replacing NaN with zeros


• Reason : If we replace all the NaN values with 0 in columns temperature_celsius, humid-
ity_percent and wind_speed_kmph. Then it will cause issues when we calculate mean in
those columns.

14
Another way of getting between estimate is to carry forward the temperature of one
row previous of the row where 0 is present.
[17]: previous_fill_val_df = df_missing_val.fillna(method="ffill")
previous_fill_val_df

/tmp/ipykernel_3487556/[Link]: FutureWarning: [Link] with


'method' is deprecated and will raise in a future version. Use [Link]() or
[Link]() instead.
previous_fill_val_df = df_missing_val.fillna(method="ffill")

[17]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
… … … …
2023-04-06 33.0 36.3 4.1
2023-04-07 -4.4 58.1 8.0
2023-04-08 11.8 41.2 13.0
2023-04-09 14.7 44.4 13.0
2023-04-10 0.3 48.3 7.7

weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 Snow
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain

[100 rows x 4 columns]

2.1.7 Explanation :
/tmp/ipykernel_2951661/[Link]: FutureWarning: [Link] with ‘method’ is
deprecated and will raise in a future version. Use obj.ffill() or [Link]() instead. previ-
ous_fill_val_df = df_missing_val.fillna(method=“ffill”)
Means that using the method=“ffill” or method=“bfill” inside .fillna() is going to be removed in
future versions of pandas.

15
[18]: df_missing_val

[18]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
… … … …
2023-04-06 33.0 36.3 4.1
2023-04-07 -4.4 58.1 8.0
2023-04-08 11.8 41.2 13.0
2023-04-09 14.7 44.4 NaN
2023-04-10 0.3 48.3 7.7

weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 NaN
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain

[100 rows x 4 columns]

[19]: previous_fill_val_df = df_missing_val.ffill()


previous_fill_val_df

[19]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
… … … …
2023-04-06 33.0 36.3 4.1
2023-04-07 -4.4 58.1 8.0
2023-04-08 11.8 41.2 13.0
2023-04-09 14.7 44.4 13.0

16
2023-04-10 0.3 48.3 7.7

weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 Snow
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain

[100 rows x 4 columns]

Check if there exist any NaN values in any of the rows in the dataFrame
[20]: previous_fill_val_df[previous_fill_val_df.isna().any(axis=1)]

[20]: Empty DataFrame


Columns: [temperature_celsius, humidity_percent, wind_speed_kmph,
weather_condition]
Index: []

This confirms that our dataFrame infact has no NaN values in it

Another way of getting bettween estimate is to carry forward the temperature of one
row next of the row where 0 is present.
[21]: df_missing_val

[21]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
… … … …
2023-04-06 33.0 36.3 4.1
2023-04-07 -4.4 58.1 8.0
2023-04-08 11.8 41.2 13.0
2023-04-09 14.7 44.4 NaN
2023-04-10 0.3 48.3 7.7

weather_condition

17
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 NaN
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain

[100 rows x 4 columns]

[22]: next_fill_val_df = df_missing_val.bfill()


next_fill_val_df

[22]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
… … … …
2023-04-06 33.0 36.3 4.1
2023-04-07 -4.4 58.1 8.0
2023-04-08 11.8 41.2 13.0
2023-04-09 14.7 44.4 7.7
2023-04-10 0.3 48.3 7.7

weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 Rain
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain

[100 rows x 4 columns]

18
Lets confirm if there exist any NaN values in the dataFrame
[23]: next_fill_val_df[next_fill_val_df.isna().any(axis=1)]

[23]: Empty DataFrame


Columns: [temperature_celsius, humidity_percent, wind_speed_kmph,
weather_condition]
Index: []

Again this confirms that there exists no NaN values in the dataFrame

How to limit the number of times a value is allowed to be copied in these two methods
• Another way of getting between estimate is to carry forward the temperature of one row
previous of the row where 0 is present.
• Another way of getting between estimate is to carry forward the temperature of one row next
of the row where 0 is present.

[24]: df_missing_data.head(10)

[24]: tickers eps revenue price people


0 Apple NaN 105.55 113.42 Tim Cook
1 Microsoft 1.59 -42.45 410.99 Satya Nadella
2 Google 3.94 118.78 83.55 Sundar Pichai
3 Amazon NaN 69.97 494.10 Andy Jassy
4 Facebook 1.3 85.42 397.51 Mark Zuckerberg
5 Tesla 1.3 -69.91 139.42 Elon Musk
6 Netflix 6.74 192.61 52.48 Reed Hastings
7 Intel n.a. 99.33 416.96 Pat Gelsinger
8 IBM 0.59 47.11 368.09 n.a.
9 Samsung 3.63 141.13 378.05 Jong-Hee Han

Setting the limit to 1


• Here in this case the value will only be allowed to be copied once

[25]: copy_once_df = df_missing_val.ffill(limit=1)


copy_once_df.head(10)

[25]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
2023-01-06 -1.5 12.9 Rain
2023-01-07 20.4 67.0 7.0
2023-01-08 36.8 83.6 5.7

19
2023-01-09 7.5 79.0 9.4
2023-01-10 65.2 6.5 Sunny

weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 Snow
2023-01-04 Rain
2023-01-05 Rain
2023-01-06 Rain
2023-01-07 Fog
2023-01-08 Cloudy
2023-01-09 Sunny
2023-01-10 Sunny

Lets check if this dataFrame has any NaN values in it


[26]: copy_once_df[copy_once_df.isna().any(axis=1)]

[26]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-22 5.3 32.9 NaN
2023-03-24 NaN 63.3 5.7

weather_condition
date
2023-01-22 Rain
2023-03-24 Cloudy

2.2 Clean data using Interpolation


Explanation as to why these methods are a bit restrictive in cases when you are
dealing with the values such as temperature Methods in question - Another way of getting
the estimate between two values is to carry forward the temperature of one row previous of the row
where 0 is present. - Another way of getting estimate between two values is to carry forward the
temperature of one row next of the row where 0 is present. Suppose there on the day 1 we have a
temperature -2.9 degrees then on day 2 the df has a NaN value and on day 3 the df has 14 degree
temperature Case 1 : Another way of getting between estimate is to carry forward the temperature
of one row previous of the row where 0 is present. day 1 –> -2.9 day 2 –> -2.9 day 3 –> 14 Case
2 : Another way of getting between estimate is to carry forward the temperature of one row next
of the row where 0 is present. day 1 –> -2.9 day 2 –> 14 day 3 –> 14
In both the cases the temperature between days 1 to 3 doesn’t rise gradually it raises abruptly
and then the two consecutive days have the same temperature in real life that’s not how it works
the temperature rises normally and gradually. So how can we deal with the NaN values in the
temperature column or in the dataFrame as a whole?

20
2.3 What is interpolation ?
[Link]() is a pandas method used to fill in missing val-
ues NaN by estimating or guessing what the values should be, based on
surrounding data. Here is the link to the docs related to this topic :
[Link]
[Link](method='linear', *, axis=0, limit=None, inplace=False,
limit_direction=None, limit_area=None, **kwargs)

[27]: interpolated_val_df = df_missing_val.interpolate()


interpolated_val_df.head(10)

/tmp/ipykernel_3487556/[Link]: FutureWarning: [Link]


with object dtype is deprecated and will raise in a future version. Call
obj.infer_objects(copy=False) before interpolating instead.
interpolated_val_df = df_missing_val.interpolate()

[27]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
2023-01-06 -1.5 12.9 Rain
2023-01-07 20.4 67.0 7.0
2023-01-08 36.8 83.6 5.7
2023-01-09 7.5 79.0 9.4
2023-01-10 65.2 6.5 Sunny

weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 NaN
2023-01-04 Rain
2023-01-05 Rain
2023-01-06 NaN
2023-01-07 Fog
2023-01-08 Cloudy
2023-01-09 Sunny
2023-01-10 NaN

2.3.1 Explanation:
• DataFrame df_missing_val has columns with object dtype (e.g. strings or mixed types).
• interpolate() is meant for numeric columns, and using it on object-type columns is being
deprecated.

21
[28]: interpolated_val_df = df_missing_val.select_dtypes(include=["number"]).
↪interpolate()

interpolated_val_df.head(10)

[28]: temperature_celsius
date
2023-01-01 -2.9
2023-01-02 15.9
2023-01-03 22.8
2023-01-04 39.4
2023-01-05 7.8
2023-01-06 -1.5
2023-01-07 20.4
2023-01-08 36.8
2023-01-09 7.5
2023-01-10 65.2

Why its giving me only temperature_celsius column as a result ?


• Because only temperature is currently recognized as a numeric column (int or float) in the
dataFrame
• Hence before I interpolate the dataFrame. I need to change these columns humidity_percent,
wind_speed_kmph to numbers type columns so that interpolation can actually work

[29]: df_missing_val.head(10)

[29]: temperature_celsius humidity_percent wind_speed_kmph \


date
2023-01-01 -2.9 87.4 11.9
2023-01-02 15.9 56.8 5.4
2023-01-03 22.8 48.7 11.3
2023-01-04 39.4 56.4 12.2
2023-01-05 7.8 82.9 7.1
2023-01-06 -1.5 12.9 Rain
2023-01-07 20.4 67.0 7.0
2023-01-08 36.8 83.6 5.7
2023-01-09 7.5 79.0 9.4
2023-01-10 65.2 6.5 Sunny

weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 NaN
2023-01-04 Rain
2023-01-05 Rain
2023-01-06 NaN
2023-01-07 Fog

22
2023-01-08 Cloudy
2023-01-09 Sunny
2023-01-10 NaN

[30]: columns_to_convert = ["temperature_celsius", "humidity_percent",␣


↪"wind_speed_kmph"]

df_missing_val[columns_to_convert] = df_missing_val[columns_to_convert].
↪apply(pd.to_numeric, errors="coerce")

interpolated_val_df = df_missing_val.select_dtypes(include=["number"]).
↪interpolate()

interpolated_val_df.head(10)

[30]: temperature_celsius humidity_percent wind_speed_kmph


date
2023-01-01 -2.9 87.4 11.90
2023-01-02 15.9 56.8 5.40
2023-01-03 22.8 48.7 11.30
2023-01-04 39.4 56.4 12.20
2023-01-05 7.8 82.9 7.10
2023-01-06 -1.5 12.9 7.05
2023-01-07 20.4 67.0 7.00
2023-01-08 36.8 83.6 5.70
2023-01-09 7.5 79.0 9.40
2023-01-10 65.2 6.5 9.30

2.3.2 Explanation:
• .apply(pd.to_numeric, errors="coerce")
– apply(pd.to_numeric) applies the pd.to_numeric() function to each column in the list.

2.3.3 Lets take another example of interpolation

[31]: # pc path
df_missing_weather_data = pd.read_excel("/home/aditya/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.

↪xlsx")

# macbook path
# df_missing_weather_data = pd.read_excel("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.

↪xlsx",engine="openpyxl")

df_missing_weather_data

[31]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38.0 23 52 30.03
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05

23
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow

24
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

25
Lets check how many rows has NaN values in them
[32]: df_missing_weather_data[df_missing_weather_data.isna().any(axis=1)]

[32]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38.0 23 52 30.03
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN

26
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

Lets run interpolate function on this dataFrame


[33]: # df before interpolation
df_missing_weather_data

27
[33]: EST Temperature DewPoint Humidity Sea Level PressureIn \
0 1/1/2016 38.0 23 52 30.03
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN

28
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN

29
27 NaN
28 NaN
29 257.0
30 NaN

[34]: inter_polated_df = df_missing_weather_data.interpolate()


inter_polated_df

/tmp/ipykernel_3487556/[Link]: FutureWarning: [Link]


with object dtype is deprecated and will raise in a future version. Call
obj.infer_objects(copy=False) before interpolating instead.
inter_polated_df = df_missing_weather_data.interpolate()

[34]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38.00 23 52 30.03
1 1/2/2016 36.00 18 46 30.02
2 1/3/2016 40.00 21 47 29.86
3 1/4/2016 25.00 9 44 30.05
4 1/5/2016 29.00 -3 41 30.57
5 1/6/2016 33.00 4 35 30.50
6 1/7/2016 39.00 11 33 30.28
7 1/8/2016 39.00 29 64 30.20
8 1/9/2016 44.00 38 77 30.16
9 1/10/2016 50.00 46 71 29.59
10 1/11/2016 33.00 8 37 29.92
11 1/12/2016 35.00 15 53 29.85
12 1/13/2016 26.00 4 42 29.94
13 1/14/2016 34.50 12 47 29.95
14 1/15/2016 43.00 31 62 29.82
15 1/16/2016 47.00 37 70 29.52
16 1/17/2016 36.00 23 66 29.78
17 1/18/2016 25.00 6 53 29.83
18 1/19/2016 22.00 3 42 30.03
19 1/20/2016 32.00 15 49 30.13
20 1/21/2016 31.00 11 45 30.15
21 1/22/2016 26.00 6 41 30.21
22 1/23/2016 30.25 21 78 29.77
23 1/24/2016 34.50 11 53 29.92
24 1/25/2016 38.75 18 54 30.25
25 1/26/2016 43.00 29 56 30.03
26 1/27/2016 41.00 22 45 30.03
27 1/28/2016 37.00 20 51 29.90
28 1/29/2016 36.00 21 50 29.58
29 1/30/2016 34.00 16 46 30.01
30 1/31/2016 46.00 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.000000 0 5 NaN

30
1 10 7.000000 0 3 NaN
2 10 8.000000 0 1 NaN
3 10 9.000000 0 3 NaN
4 10 5.000000 0 0 NaN
5 10 4.000000 0 0 NaN
6 10 2.000000 0 3 NaN
7 10 4.000000 0 8 NaN
8 9 8.000000 T 8 Rain
9 4 7.333333 1.8 7 Rain
10 10 6.666667 0 1 NaN
11 10 6.000000 T 4 NaN
12 10 10.000000 0 0 NaN
13 10 5.000000 T 7 NaN
14 9 5.000000 T 2 NaN
15 8 7.000000 0.24 7 Rain
16 8 6.000000 0.05 6 Fog-Snow
17 9 12.000000 T 2 Snow
18 10 11.000000 0 1 NaN
19 10 6.000000 0 2 NaN
20 10 6.000000 0 1 NaN
21 9 11.000000 0.01 3 Snow
22 1 16.000000 2.31 8 Fog-Snow
23 8 6.000000 T 3 Snow
24 10 3.000000 0 2 NaN
25 10 7.000000 0 2 NaN
26 10 7.000000 T 3 Rain
27 10 5.000000 0 1 NaN
28 10 8.000000 0 4 NaN
29 10 7.000000 0 0 NaN
30 10 5.000000 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 313.0
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 192.5
14 101.0

31
15 340.0
16 345.0
17 293.0
18 297.5
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 280.2
26 274.4
27 268.6
28 262.8
29 257.0
30 257.0

This warning means that there are some colums contain object-type columns like
string, or numbers stored as string /tmp/ipykernel_2443388/[Link]: Future-
Warning: [Link] with object dtype is deprecated and will raise in a future
version. Call obj.infer_objects(copy=False) before interpolating instead. inter_polated_df =
df_missing_weather_data.interpolate()

Convert the numbers stored as strings into numbers


[35]: # convert the columns into numbers newer pandas version
df_missing_weather_data.infer_objects(copy=False)

# in older pandas version copy=False will return an error instead you will have␣
↪to use this

# df_missing_weather_data.infer_objects()

# only select those columns whose data-type is of type number and then␣
↪interpolate it accordingly

interpolated_df = df_missing_weather_data.select_dtypes(include=["number"]).
↪interpolate()

interpolated_df

[35]: Temperature DewPoint Humidity Sea Level PressureIn VisibilityMiles \


0 38.00 23 52 30.03 10
1 36.00 18 46 30.02 10
2 40.00 21 47 29.86 10
3 25.00 9 44 30.05 10
4 29.00 -3 41 30.57 10
5 33.00 4 35 30.50 10
6 39.00 11 33 30.28 10
7 39.00 29 64 30.20 10
8 44.00 38 77 30.16 9

32
9 50.00 46 71 29.59 4
10 33.00 8 37 29.92 10
11 35.00 15 53 29.85 10
12 26.00 4 42 29.94 10
13 34.50 12 47 29.95 10
14 43.00 31 62 29.82 9
15 47.00 37 70 29.52 8
16 36.00 23 66 29.78 8
17 25.00 6 53 29.83 9
18 22.00 3 42 30.03 10
19 32.00 15 49 30.13 10
20 31.00 11 45 30.15 10
21 26.00 6 41 30.21 9
22 30.25 21 78 29.77 1
23 34.50 11 53 29.92 8
24 38.75 18 54 30.25 10
25 43.00 29 56 30.03 10
26 41.00 22 45 30.03 10
27 37.00 20 51 29.90 10
28 36.00 21 50 29.58 10
29 34.00 16 46 30.01 10
30 46.00 28 52 29.90 10

WindSpeedMPH CloudCover WindDirDegrees


0 8.000000 5 281.0
1 7.000000 3 275.0
2 8.000000 1 277.0
3 9.000000 3 345.0
4 5.000000 0 333.0
5 4.000000 0 313.0
6 2.000000 3 293.0
7 4.000000 8 79.0
8 8.000000 8 76.0
9 7.333333 7 109.0
10 6.666667 1 289.0
11 6.000000 4 235.0
12 10.000000 0 284.0
13 5.000000 7 192.5
14 5.000000 2 101.0
15 7.000000 7 340.0
16 6.000000 6 345.0
17 12.000000 2 293.0
18 11.000000 1 297.5
19 6.000000 2 302.0
20 6.000000 1 312.0
21 11.000000 3 34.0
22 16.000000 8 42.0

33
23 6.000000 3 327.0
24 3.000000 2 286.0
25 7.000000 2 280.2
26 7.000000 3 274.4
27 5.000000 1 268.6
28 8.000000 4 262.8
29 7.000000 0 257.0
30 5.000000 0 257.0

Let’s handle dataFrame interpolation using time method The reason I want to use time
method is because our data is dependent on time and time i.e dates. Since we have date column I
wanted to use time method.

[36]: df_missing_weather_data

[36]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38.0 23 52 30.03
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90

34
VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \
0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0

35
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

[37]: time_interpolated_df = df_missing_weather_data.


↪select_dtypes(include=["number"]).interpolate(method="time")

time_interpolated_df

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[37], line 1
----> 1 time_interpolated_df =␣
↪df_missing_weather_data.select_dtypes(include=["number"]).interpolate(method="time")

2 time_interpolated_df

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in [Link](self, method, axis, limit, inplace,␣

↪limit_direction, limit_area, downcast, **kwargs)

8497 else:
8498 index = missing.get_interp_index(method, [Link])
-> 8499 new_data = obj._mgr.interpolate(
8500 method=method,
8501 index=index,
8502 limit=limit,
8503 limit_direction=limit_direction,
8504 limit_area=limit_area,
8505 inplace=inplace,
8506 downcast=downcast,
8507 **kwargs,
8508 )
8510 result = self._constructor_from_mgr(new_data, axes=new_data.axes)

36
8511 if should_transpose:

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, inplace, **kwargs)

290 def interpolate(self, inplace: bool, **kwargs) -> Self:


--> 291 return self.apply_with_block(
292 "interpolate",
293 inplace=inplace,
294 **kwargs,
295 using_cow=using_copy_on_write(),
296 already_warned=_AlreadyWarned(),
297 )

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, f, align_keys,␣

↪**kwargs)

361 applied = [Link](f, **kwargs)


362 else:
--> 363 applied = getattr(b, f)(**kwargs)
364 result_blocks = extend_blocks(applied, result_blocks)
366 out = type(self).from_blocks(result_blocks, [Link])

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, method, index, inplace,␣
↪limit, limit_direction, limit_area, downcast, using_cow, already_warned,␣

↪**kwargs)

1794 copy, refs = self._get_refs_and_copy(using_cow, inplace)


1796 # Dispatch to the EA method.
-> 1797 new_values = self.array_values.interpolate(
1798 method=method,
1799 axis=[Link] - 1,
1800 index=index,
1801 limit=limit,
1802 limit_direction=limit_direction,
1803 limit_area=limit_area,
1804 copy=copy,
1805 **kwargs,
1806 )
1807 data = extract_array(new_values, extract_numpy=True)
1809 if (
1810 not copy
1811 and warn_copy_on_write()
1812 and already_warned is not None
1813 and not already_warned.warned_already
1814 ):

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/numpy_.py:296, in [Link](self, method, axis,␣

↪index, limit, limit_direction, limit_area, copy, **kwargs)

37
293 out_data = self._ndarray.copy()
295 # TODO: assert we have floating dtype?
--> 296 missing.interpolate_2d_inplace(
297 out_data,
298 method=method,
299 axis=axis,
300 index=index,
301 limit=limit,
302 limit_direction=limit_direction,
303 limit_area=limit_area,
304 **kwargs,
305 )
306 if not copy:
307 return self

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace(data, index, axis, method, limit,␣

↪limit_direction, limit_area, fill_value, mask, **kwargs)

371 if method == "time":


372 if not needs_i8_conversion([Link]):
--> 373 raise ValueError(
374 "time-weighted interpolation only works "
375 "on Series or DataFrames with a "
376 "DatetimeIndex"
377 )
378 method = "values"
380 limit_direction = validate_limit_direction(limit_direction)

ValueError: time-weighted interpolation only works on Series or DataFrames with␣


↪a DatetimeIndex

Explanation :
• As you can see here this error occured because our dataFrame doesn’t have date column set
as its index so inorder to solve this error and for us to use time method for interpolation we
will have to set the date column as the index of this dataFrame

Setting the date column as an index of the dataFrame


[ ]: df_missing_weather_data.set_index(df_missing_weather_data["EST"],inplace=True)
df_missing_weather_data

As you can see that now our date column named EST has been set as the index of
this dataFrame now we can move forward with our interpolation operation to remove
NaN from the dataFrame
[ ]: # convert the columns that have numbers in string format into data-types of␣
↪type number

df_missing_weather_data.infer_objects(copy=False)

38
Why are we setting copy = False in .infer_objects(copy=False) here ? The copy pa-
rameter controls whether the method returns a new DataFrame or modifies the original DataFrame
in-place. | copy value | What it does | | ———— | ——————————- | | True | Returns a
new DataFrame (copy) | | False | Modifies the original DataFrame |
You are telling pandas:
“Please try to convert any object columns to a more specific dtype in-place —
do not create a new DataFrame.”

[38]: df_missing_weather_data.select_dtypes(include=["number"]).
↪interpolate(method="time")

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[38], line 1
----> 1␣
↪df_missing_weather_data.select_dtypes(include=["number"]).interpolate(method="time")

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in [Link](self, method, axis, limit, inplace,␣

↪limit_direction, limit_area, downcast, **kwargs)

8497 else:
8498 index = missing.get_interp_index(method, [Link])
-> 8499 new_data = obj._mgr.interpolate(
8500 method=method,
8501 index=index,
8502 limit=limit,
8503 limit_direction=limit_direction,
8504 limit_area=limit_area,
8505 inplace=inplace,
8506 downcast=downcast,
8507 **kwargs,
8508 )
8510 result = self._constructor_from_mgr(new_data, axes=new_data.axes)
8511 if should_transpose:

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, inplace, **kwargs)

290 def interpolate(self, inplace: bool, **kwargs) -> Self:


--> 291 return self.apply_with_block(
292 "interpolate",
293 inplace=inplace,
294 **kwargs,
295 using_cow=using_copy_on_write(),
296 already_warned=_AlreadyWarned(),
297 )

39
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, f, align_keys,␣

↪**kwargs)

361 applied = [Link](f, **kwargs)


362 else:
--> 363 applied = getattr(b, f)(**kwargs)
364 result_blocks = extend_blocks(applied, result_blocks)
366 out = type(self).from_blocks(result_blocks, [Link])

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, method, index, inplace,␣
↪limit, limit_direction, limit_area, downcast, using_cow, already_warned,␣

↪**kwargs)

1794 copy, refs = self._get_refs_and_copy(using_cow, inplace)


1796 # Dispatch to the EA method.
-> 1797 new_values = self.array_values.interpolate(
1798 method=method,
1799 axis=[Link] - 1,
1800 index=index,
1801 limit=limit,
1802 limit_direction=limit_direction,
1803 limit_area=limit_area,
1804 copy=copy,
1805 **kwargs,
1806 )
1807 data = extract_array(new_values, extract_numpy=True)
1809 if (
1810 not copy
1811 and warn_copy_on_write()
1812 and already_warned is not None
1813 and not already_warned.warned_already
1814 ):

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/numpy_.py:296, in [Link](self, method, axis,␣

↪index, limit, limit_direction, limit_area, copy, **kwargs)

293 out_data = self._ndarray.copy()


295 # TODO: assert we have floating dtype?
--> 296 missing.interpolate_2d_inplace(
297 out_data,
298 method=method,
299 axis=axis,
300 index=index,
301 limit=limit,
302 limit_direction=limit_direction,
303 limit_area=limit_area,
304 **kwargs,
305 )
306 if not copy:

40
307 return self

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace(data, index, axis, method, limit,␣

↪limit_direction, limit_area, fill_value, mask, **kwargs)

371 if method == "time":


372 if not needs_i8_conversion([Link]):
--> 373 raise ValueError(
374 "time-weighted interpolation only works "
375 "on Series or DataFrames with a "
376 "DatetimeIndex"
377 )
378 method = "values"
380 limit_direction = validate_limit_direction(limit_direction)

ValueError: time-weighted interpolation only works on Series or DataFrames with␣


↪a DatetimeIndex

Again I am getting this error ? hmmm… Lets check what is the data-type of the EST column
in the dataFrame

[39]: df_missing_weather_data["EST"].dtype

[39]: dtype('O')

Explanation : In pandas:
O stands for "object" type.

This means the column may contain:

strings

mixed types (e.g., strings + numbers)

Python objects (like datetime strings not yet converted)


So, “EST” is currently treated as a general object column — likely string data representing
dates/times.

[40]: df_missing_weather_data["EST"].apply(lambda x: isinstance(x,str)).all()

[40]: True

Since it returned True that means all the values present in the EST column is of
string datatype. This means I need to convert all the values in this column into date type
before assigning it as an index to this dataFrame

41
[41]: df_missing_weather_data["EST"] = pd.
↪to_datetime(df_missing_weather_data["EST"],format="%m%d%Y",errors="coerce")

df_missing_weather_data["EST"].dtype

[41]: dtype('<M8[ns]')

[42]: df_missing_weather_data

[42]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 NaT 38.0 23 52 30.03
1 NaT 36.0 18 46 30.02
2 NaT 40.0 21 47 29.86
3 NaT 25.0 9 44 30.05
4 NaT NaN -3 41 30.57
5 NaT 33.0 4 35 30.50
6 NaT 39.0 11 33 30.28
7 NaT 39.0 29 64 30.20
8 NaT 44.0 38 77 30.16
9 NaT 50.0 46 71 29.59
10 NaT 33.0 8 37 29.92
11 NaT 35.0 15 53 29.85
12 NaT 26.0 4 42 29.94
13 NaT NaN 12 47 29.95
14 NaT 43.0 31 62 29.82
15 NaT 47.0 37 70 29.52
16 NaT 36.0 23 66 29.78
17 NaT 25.0 6 53 29.83
18 NaT 22.0 3 42 30.03
19 NaT 32.0 15 49 30.13
20 NaT 31.0 11 45 30.15
21 NaT 26.0 6 41 30.21
22 NaT NaN 21 78 29.77
23 NaT NaN 11 53 29.92
24 NaT NaN 18 54 30.25
25 NaT 43.0 29 56 30.03
26 NaT 41.0 22 45 30.03
27 NaT 37.0 20 51 29.90
28 NaT 36.0 21 50 29.58
29 NaT 34.0 16 46 30.01
30 NaT 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN

42
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN

43
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

2.3.4 Why did my string type dates got converted to NaT values in the EST column
?
Explanation : - Dates in the column EST is this string format 1/3/2016. - This format corresponds
to this %m/%d/%Y format - But in my code I accidently did this %m%d%Y - Which is why I got NaT
values in my ESt column of my dataFrame

Now I have to fix this


[43]: # Lets create a new dataFrame from the excel sheet again
# pc path
df_missing_weather_data = pd.read_excel("/home/aditya/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.

↪xlsx")

# macbook path
# df_missing_weather_data = pd.read_excel("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.

↪xlsx",engine="openpyxl")

df_missing_weather_data

[43]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38.0 23 52 30.03
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94

44
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain

45
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

Convert the EST column string type dates into date type dates. This time using this
format %m/%d/%Y
[44]: df_missing_weather_data["EST"] = pd.
↪to_datetime(df_missing_weather_data["EST"],format = "%m/%d/%Y",␣

↪errors="coerce")

df_missing_weather_data

[44]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 2016-01-01 38.0 23 52 30.03

46
1 2016-01-02 36.0 18 46 30.02
2 2016-01-03 40.0 21 47 29.86
3 2016-01-04 25.0 9 44 30.05
4 2016-01-05 NaN -3 41 30.57
5 2016-01-06 33.0 4 35 30.50
6 2016-01-07 39.0 11 33 30.28
7 2016-01-08 39.0 29 64 30.20
8 2016-01-09 44.0 38 77 30.16
9 2016-01-10 50.0 46 71 29.59
10 2016-01-11 33.0 8 37 29.92
11 2016-01-12 35.0 15 53 29.85
12 2016-01-13 26.0 4 42 29.94
13 2016-01-14 NaN 12 47 29.95
14 2016-01-15 43.0 31 62 29.82
15 2016-01-16 47.0 37 70 29.52
16 2016-01-17 36.0 23 66 29.78
17 2016-01-18 25.0 6 53 29.83
18 2016-01-19 22.0 3 42 30.03
19 2016-01-20 32.0 15 49 30.13
20 2016-01-21 31.0 11 45 30.15
21 2016-01-22 26.0 6 41 30.21
22 2016-01-23 NaN 21 78 29.77
23 2016-01-24 NaN 11 53 29.92
24 2016-01-25 NaN 18 54 30.25
25 2016-01-26 43.0 29 56 30.03
26 2016-01-27 41.0 22 45 30.03
27 2016-01-28 37.0 20 51 29.90
28 2016-01-29 36.0 21 50 29.58
29 2016-01-30 34.0 16 46 30.01
30 2016-01-31 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN

47
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN

48
29 257.0
30 NaN

[45]: df_missing_weather_data["EST"].dtype

[45]: dtype('<M8[ns]')

Now our EST column values have been successfully converted without any issue

Explanation
• M8 stands for datetime64 (M is for “datetime” in NumPy’s dtype notation).
• < means little-endian byte order
• [ns] means the precision in nanoseconds

Let’s verify if the values in this EST column is of type string or not
[46]: df_missing_weather_data["EST"].apply(lambda x: isinstance(x,str)).all()

[46]: False

Since we are getting the result to be False that means the values stored in EST column
is not of type string

Since earlier I have created a new dataFrame from the excel sheet I don’t have to
reset the index in the dataFrame
[47]: df_missing_weather_data.set_index(df_missing_weather_data["EST"],inplace=True)
df_missing_weather_data

[47]: EST Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 2016-01-01 38.0 23 52 30.03
2016-01-02 2016-01-02 36.0 18 46 30.02
2016-01-03 2016-01-03 40.0 21 47 29.86
2016-01-04 2016-01-04 25.0 9 44 30.05
2016-01-05 2016-01-05 NaN -3 41 30.57
2016-01-06 2016-01-06 33.0 4 35 30.50
2016-01-07 2016-01-07 39.0 11 33 30.28
2016-01-08 2016-01-08 39.0 29 64 30.20
2016-01-09 2016-01-09 44.0 38 77 30.16
2016-01-10 2016-01-10 50.0 46 71 29.59
2016-01-11 2016-01-11 33.0 8 37 29.92
2016-01-12 2016-01-12 35.0 15 53 29.85
2016-01-13 2016-01-13 26.0 4 42 29.94
2016-01-14 2016-01-14 NaN 12 47 29.95
2016-01-15 2016-01-15 43.0 31 62 29.82
2016-01-16 2016-01-16 47.0 37 70 29.52
2016-01-17 2016-01-17 36.0 23 66 29.78

49
2016-01-18 2016-01-18 25.0 6 53 29.83
2016-01-19 2016-01-19 22.0 3 42 30.03
2016-01-20 2016-01-20 32.0 15 49 30.13
2016-01-21 2016-01-21 31.0 11 45 30.15
2016-01-22 2016-01-22 26.0 6 41 30.21
2016-01-23 2016-01-23 NaN 21 78 29.77
2016-01-24 2016-01-24 NaN 11 53 29.92
2016-01-25 2016-01-25 NaN 18 54 30.25
2016-01-26 2016-01-26 43.0 29 56 30.03
2016-01-27 2016-01-27 41.0 22 45 30.03
2016-01-28 2016-01-28 37.0 20 51 29.90
2016-01-29 2016-01-29 36.0 21 50 29.58
2016-01-30 2016-01-30 34.0 16 46 30.01
2016-01-31 2016-01-31 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover \


EST
2016-01-01 10 8.0 0 5
2016-01-02 10 7.0 0 3
2016-01-03 10 8.0 0 1
2016-01-04 10 9.0 0 3
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-08 10 4.0 0 8
2016-01-09 9 8.0 T 8
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0

50
2016-01-31 10 5.0 0 0

Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN

Now lets run interpolation using time method on the dataFrame


[48]: # convert the columns that have numbers in string format into data-types of␣
↪type number

# This will work on newer pandas version


# converted_val_in_df = df_missing_weather_data.infer_objects(copy=False)

# this will work on older pandas version


converted_val_in_df = df_missing_weather_data.infer_objects()

# select only those columns which have the values as type of numbers

51
new_processed_df = converted_val_in_df.select_dtypes(include=["number"]).
↪interpolate(method="time")

new_processed_df

[48]: Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 38.00 23 52 30.03
2016-01-02 36.00 18 46 30.02
2016-01-03 40.00 21 47 29.86
2016-01-04 25.00 9 44 30.05
2016-01-05 29.00 -3 41 30.57
2016-01-06 33.00 4 35 30.50
2016-01-07 39.00 11 33 30.28
2016-01-08 39.00 29 64 30.20
2016-01-09 44.00 38 77 30.16
2016-01-10 50.00 46 71 29.59
2016-01-11 33.00 8 37 29.92
2016-01-12 35.00 15 53 29.85
2016-01-13 26.00 4 42 29.94
2016-01-14 34.50 12 47 29.95
2016-01-15 43.00 31 62 29.82
2016-01-16 47.00 37 70 29.52
2016-01-17 36.00 23 66 29.78
2016-01-18 25.00 6 53 29.83
2016-01-19 22.00 3 42 30.03
2016-01-20 32.00 15 49 30.13
2016-01-21 31.00 11 45 30.15
2016-01-22 26.00 6 41 30.21
2016-01-23 30.25 21 78 29.77
2016-01-24 34.50 11 53 29.92
2016-01-25 38.75 18 54 30.25
2016-01-26 43.00 29 56 30.03
2016-01-27 41.00 22 45 30.03
2016-01-28 37.00 20 51 29.90
2016-01-29 36.00 21 50 29.58
2016-01-30 34.00 16 46 30.01
2016-01-31 46.00 28 52 29.90

VisibilityMiles WindSpeedMPH CloudCover WindDirDegrees


EST
2016-01-01 10 8.000000 5 281.0
2016-01-02 10 7.000000 3 275.0
2016-01-03 10 8.000000 1 277.0
2016-01-04 10 9.000000 3 345.0
2016-01-05 10 5.000000 0 333.0
2016-01-06 10 4.000000 0 313.0
2016-01-07 10 2.000000 3 293.0

52
2016-01-08 10 4.000000 8 79.0
2016-01-09 9 8.000000 8 76.0
2016-01-10 4 7.333333 7 109.0
2016-01-11 10 6.666667 1 289.0
2016-01-12 10 6.000000 4 235.0
2016-01-13 10 10.000000 0 284.0
2016-01-14 10 5.000000 7 192.5
2016-01-15 9 5.000000 2 101.0
2016-01-16 8 7.000000 7 340.0
2016-01-17 8 6.000000 6 345.0
2016-01-18 9 12.000000 2 293.0
2016-01-19 10 11.000000 1 297.5
2016-01-20 10 6.000000 2 302.0
2016-01-21 10 6.000000 1 312.0
2016-01-22 9 11.000000 3 34.0
2016-01-23 1 16.000000 8 42.0
2016-01-24 8 6.000000 3 327.0
2016-01-25 10 3.000000 2 286.0
2016-01-26 10 7.000000 2 280.2
2016-01-27 10 7.000000 3 274.4
2016-01-28 10 5.000000 1 268.6
2016-01-29 10 8.000000 4 262.8
2016-01-30 10 7.000000 0 257.0
2016-01-31 10 5.000000 0 257.0

Checking if there are any NaN values present in the dataFrame


[49]: new_processed_df[new_processed_df.isna().any(axis=1)]

[49]: Empty DataFrame


Columns: [Temperature, DewPoint, Humidity, Sea Level PressureIn,
VisibilityMiles, WindSpeedMPH, CloudCover, WindDirDegrees]
Index: []

Since there is no NaN values present It is safe to say that the data has been significantly
cleaned in this particular dataFrame

2.3.5 method (default = ‘linear’)


Determines the type of interpolation technique used to fill missing values. Common methods and
use cases: | Method | Description | Use Case Example | | ——————————- | ——————
———————————————— | ————————————————————- | | 'linear'
| Fills using linear interpolation (straight line between points). | Time series or numerical data
with a consistent trend. | | 'time' | Interpolates based on index assuming it is datetime type. |
Time-indexed DataFrames (like stock prices). | | 'index' | Same as 'linear' but uses the index
values instead of positions. | When index carries meaningful numeric data. | | 'nearest' | Use
nearest valid data point. | When you want to take closest available value. | | 'zero' | Polynomial
interpolation of order 0 (constant). | When data is constant over segments. | | 'slinear' | Spline

53
interpolation of order 1 (similar to linear). | For smoother interpolation than basic linear. | |
'quadratic' | Spline interpolation of order 2. | Curved data trends (like acceleration data). |
| 'cubic' | Spline interpolation of order 3. | Smooth curves with gentle oscillation (like motion
tracking). | | 'barycentric', 'polynomial' | Fit through points using polynomial interpolation.
| Use cautiously; can cause overfitting. | | 'pchip' | Piecewise Cubic Hermite Interpolating Poly-
nomial. | Monotonic and shape-preserving curves. | | 'akima' | Interpolation using Akima splines.
| Reduces overshoots and artifacts. | | 'spline' | Generic spline method (requires order). | Highly
smooth interpolations. | | 'pad' or 'ffill' | Fill forward using previous valid value. | When
values persist until changed. | | 'backfill' or 'bfill' | Fill backward using next valid value. |
For retrospective value filling. | | 'from_derivatives' | Interpolates from derivatives. | Advanced
usage; rarely needed. | ### slinear : Spline interpolation of order 1 - slinear stands for spline
linear interpolation (a type of spline of degree 1). - It’s a more flexible and numerically stable
version of linear interpolation. - Under the hood it uses SciPy’s interp1d(kind='slinear'),
which interpolate based on piecewise linear splines between known data points. - This method
only works when index is numeric or datetime. - syntax : [Link](method='slinear')

When to use slinear Use 'slinear' when : - You have time series or sequential data with gaps
(example : NaN values). - You want to smooth missing values, but still assume linear transitions
between known data points. - You want more numerical precision or SciPy backend compatibility
compared to simple linear interpolation.

Real-World Example:
• You’re a data analyst working with temperature readings from an IoT weather sensor that
recorded data every hour. Due to network issues, some data points are missing.

How Is It Different from ‘linear’?

Method Description
'linear' Simple straight-line interpolation between two points
'slinear' Piecewise linear interpolation using [Link].interp1d, generally more
precise

In practice, for short gaps, both yield similar results, but ‘slinear’ is better when working within
SciPy-compatible scientific or engineering data pipelines.

What is Piecewise Linear Interpolation?


• A Piecewise Linear Interpolation is a method of constructing new data points within the
range of discrete set of known data points using straight-line segments between each pair of
points.
• When to use :
– Data changes linearly or gradually.
– You want a fast, simple or predictable interpolation.
– Spline or polynomial interpolation might overfit or behave oddly with sharp changes.
• When not to use
– When data is expected to follow a curve pattern.

54
– When smoothness in slope is essential.(example : animation, trajectory path)

[50]: # Simualte hourly temperature readings with gaps


data = {
"temperature" : [30.1,31.5,[Link],[Link],34.2,35.0,[Link],36.1]
}
dataFrame = [Link](data,index = pd.date_range("2025-8-4 00:00",periods =␣
↪8,freq="h"))

dataFrame

[50]: temperature
2025-08-04 [Link] 30.1
2025-08-04 [Link] 31.5
2025-08-04 [Link] NaN
2025-08-04 [Link] NaN
2025-08-04 [Link] 34.2
2025-08-04 [Link] 35.0
2025-08-04 [Link] NaN
2025-08-04 [Link] 36.1

[51]: # performing interpolation using method slinear


dataFrame["slinear_interpolated_temperature"] = dataFrame.
↪interpolate(method="slinear")

dataFrame

[51]: temperature slinear_interpolated_temperature


2025-08-04 [Link] 30.1 30.10
2025-08-04 [Link] 31.5 31.50
2025-08-04 [Link] NaN 32.40
2025-08-04 [Link] NaN 33.30
2025-08-04 [Link] 34.2 34.20
2025-08-04 [Link] 35.0 35.00
2025-08-04 [Link] NaN 35.55
2025-08-04 [Link] 36.1 36.10

[52]: import [Link] as plt


# plotting sliniear
[Link](figsize=(8,4))
# I want to show the index which is of datetime format in the x axis of the␣
↪graph plotted by pyplot I will have to use this [Link]

[Link]([Link],dataFrame["temperature"],color="red",linestyle="-",␣
↪marker="o",label="Original temperature data points with gaps (NaN values)")

[Link](dataFrame.
↪index,dataFrame("slinear_interpolated_temperature"),color="green",linestyle="--",marker="o",

↪temperature data using slinear method")

[Link]("Simualtion of hourly temperature readings with gaps")


[Link]("Time")

55
[Link]("Temperature")
[Link]()
[Link](True)
[Link]()

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[52], line 6
4 # I want to show the index which is of datetime format in the x axis of␣
↪the graph plotted by pyplot I will have to use this [Link]

5 [Link](dataFrame.
↪index,dataFrame["temperature"],color="red",linestyle="-",␣

↪marker="o",label="Original temperature data points with gaps (NaN values)")

----> 6 [Link](dataFrame.
↪index,dataFrame("slinear_interpolated_temperature"),color="green",linestyle="--",marker="o"

↪temperature data using slinear method")

7 [Link]("Simualtion of hourly temperature readings with gaps")


8 [Link]("Time")

TypeError: 'DataFrame' object is not callable

What just happen ? After inspecting the code I found out that I made a silly mistake here in this
line [Link]([Link],dataFrame("slinear_interpolated_temperature"),color="green",linest
temperature data using slinear method") Here I should have written this
dataFrame["slinear_interpolated_temperature"] but instead I wrote this
dataFrame("slinear_interpolated_temperature"). I know I am stupid but hey thanks
to my supidity atleast someone who is reading this notebook might avoid the mistake that I made.

56
Analysis : When I wrote this dataFrame("slinear_interpolated_temperature") I am essen-
tially trying to call a dataFrame like a function which causes python to say : “You are treating a
dataFrame object like a callable function but it’s not a callable!”

Lets fix the code and re-try plotting the interpolated data using slinear method once
again
[53]: # plotting sliniear
[Link](figsize=(8,4))
# I want to show the index which is of datetime format in the x axis of the␣
↪graph plotted by pyplot I will have to use this [Link]

[Link]([Link],dataFrame["temperature"],color="red",linestyle="-",␣
↪marker="o",label="Original temperature data points with gaps (NaN values)")

[Link](dataFrame.
↪index,dataFrame["slinear_interpolated_temperature"],color="green",linestyle="--",marker="o",

↪temperature data using slinear method")

[Link]("Simualtion of hourly temperature readings with gaps")


[Link]("Time")
[Link]("Temperature")
[Link]()
[Link](True)
[Link]()

2.3.6 quadratic : Spline interpolation of order 2


Quadratic spline interpolation fits a second-degree polynomial (parabola) between known data
points, rather than straight line (linear) or multiple connected lines (piecewise linear). This

57
allows it to capture curvature in the data, making it useful for smooth accelerating or decelerating
trends. Instead of drawing a straight lines between points (like in linear), it draws parabolic curves
that gently follow the shape of the data. A quadratic function has the form : y = ax^2 + bx + c

Real world use cases : Example : Car acceleration Imagine you are tracking the distance
traveled by a car every second. The car is accelerating , the distance increases faster and faster. |
Time (s) | Distance (m) | | ——– | ———— | | 0 | 0 | | 1 | 3 | | 2 | NaN | | 3 | 18 |

[54]: df = [Link](
{
"time":[0,1,2,3],
"distance":[0,3,[Link],18]
}
)
df

[54]: time distance


0 0 0.0
1 1 3.0
2 2 NaN
3 3 18.0

[55]: df["distance_quadratic"] = df["distance"].interpolate(method="quadratic")


df

[55]: time distance distance_quadratic


0 0 0.0 0.0
1 1 3.0 3.0
2 2 NaN 9.0
3 3 18.0 18.0

[56]: import [Link] as plt


# plotting quadratic
[Link](figsize = (8,4))
plt.
↪plot(df["time"],df["distance"],color="red",linestyle="-",marker="o",label="Original␣

↪distance with gaps in data points (NaN values)")

plt.
↪plot(df["time"],df["distance_quadratic"],color="green",linestyle="--",marker="o",␣

↪label = "Quadratic interpolated distance with generated data points")

[Link]("Car acceleration over time")


[Link]("Time")
[Link]("Acceleration")
[Link]()
[Link](True)
[Link]()

58
2.3.7 Difference between quadratic and Linear interpolation:
Scenario: Estimating car speed at a given time using recorded data. Let’s say you have a car’s
position (in meters) at different times (in seconds). You want to estimate its position at a time not
in the original dataset.

[57]: from [Link] import interp1d


# Sample time and position data
time = [Link]([0, 1, 2, 3, 4])
position = [Link]([0, 2, 6, 12, 20])

# Create linear and quadratic interpolators


linear_interp = interp1d(time,position,kind = "linear")
quadratic_interp = interp1d(time,position,kind="quadratic")

# Perform interpolation on data points


t_interp = 2.5
pos_linear = linear_interp(t_interp)
pos_quadratic = quadratic_interp(t_interp)

# Print interpolated values


print(f"Linear Interpolation at t=2.5: {pos_linear:.2f} m")
print(f"Quadratic Interpolation at t=2.5: {pos_quadratic:.2f} m")

Linear Interpolation at t=2.5: 9.00 m


Quadratic Interpolation at t=2.5: 8.75 m

59
What is t_interp? In this interpolation example t_interp is the point at which I want to
estimamte a value (example: y-value) based on surrounding known data.

Why 2.5?
• There is known data points t=2 and t=3, and I want to estimate the value halfway between
them (i.e at t = 2.5).
• This is where interpolation comes in I estimate the value at a new point (2.5) using a math-
ematical model based on existing data.

Purpose of t_interp It is used to : - Compare interpolation methods (linear vs quadratic) by


estimating what value they produce at the same point - Visualize and understand how each method
behaves differently at non-observed locations. - It is the x-value where we want to estimate the
y-value

What is [Link] gives you: [Link].make_interp_spline or interp1d allows


you to: - Specify any x-values at which to interpolate (like t_interp = 2.5) - Get smooth curves at
finer resolutions - Use advanced interpolation types like quadratic, cubic spline or B-splines So if
interpolating for: - Plotting a smooth line - Predicting values at specific new points, not just filling
in NaNs - Doing more advanced curve fitting

When to use [Link]() Use this when: - You want to fill missing
values (NaNs) - You are working within a time series or numeric sequence - You don’t need high-
resolution or custom x-points

Task Use Pandas Use Scipy


Fill missing values (NaNs) � �
Interpolate at arbitrary points � �
Plot smooth interpolated curves � �
More interpolation types (splines, etc.) Limited �

[58]: # Create 100 data points between 0 and 4 values for time column
t_vals = [Link](0,4,100)
pos_linear_vals = linear_interp(t_vals)
pos_quadratic_vals = quadratic_interp(t_vals)

df = [Link]({
"time" : t_vals ,"position_linear_interpolated" : pos_linear_vals,␣
↪"position_quadratic_interpolated" : pos_quadratic_vals

})
df

[58]: time position_linear_interpolated position_quadratic_interpolated


0 0.000000 0.000000 0.000000
1 0.040404 0.080808 0.042037
2 0.080808 0.161616 0.087338
3 0.121212 0.242424 0.135904

60
4 0.161616 0.323232 0.187736
.. … … …
95 3.838384 18.707071 18.571574
96 3.878788 19.030303 18.923783
97 3.919192 19.353535 19.279257
98 3.959596 19.676768 19.637996
99 4.000000 20.000000 20.000000

[100 rows x 3 columns]

[59]: # plot the difference between linear and quadratic interpolation using␣
↪matplotlib

[Link](figsize=(10, 6))
[Link](time, position, 'o', label='Original Data')
[Link](t_vals, pos_linear_vals, '--', label='Linear Interpolation')
[Link](t_vals, pos_quadratic_vals, '-.', label='Quadratic Interpolation')
[Link]([t_interp], [pos_linear], color='blue', label='Linear @ 2.5s')
[Link]([t_interp], [pos_quadratic], color='green', label='Quadratic @ 2.
↪5s')

[Link]('Time (s)')
[Link]('Position (m)')
[Link]('Linear vs Quadratic Interpolation')
[Link]()
[Link](True)
[Link]()

61
When to use quadratic:
• When data shows smooth but accelerate or decelerating behavior.
• Ideal for physics, motion tracking , growth models , sensor data with non-linear trends.
• More accurate than linear interpolation for curved data.

When not to use:


• If your data is noisy or fluctuates sharply. quadratic can overfit.
• If your data is truly linear. simpler methods like linear are better and more efficient.

2.3.8 cubic : Spline interpolation of order 3


Cubic interpolation uses third-degree polynomials to connect data points. Each segment between
two data points is modeled using a cubic equation : y = ax^3 + dx^2 + cx + d This method
produces smooth, continuous curves with smooth transitions (first and seconds derivatives are
continuous), making it especially useful for modeling natural or fluid motion. (first and seconds
derivatives are continuous) Explaiantion : - First derivative: Measures the slope or rate of
change. - In motion it represents velocity. - Second derivative: Measures the change in slope. - In
motion: It represents acceleration.

what is the first and second derivative: Cubic interpolation connects data points using a
third degree polynomial form: y = ax^3 + bx^2 + cx + d This creates smooth curves between
points, unlike linear or quadratic interpolation which may result in sharp bends or unnatural
transitions. Derivative in cubic interpolation : - The first derivative of a function give us the
rate of change or slope at any given point. - For cubic polynomial : - y = ax^3 + bx^2 + cx +
d - y’ = 3ax^2 + 2bx + c (This is the first derivative) - What it tells us? - How steep a curve is
at any point. - In real-world terms: velocity, trend direction or growth rate. - Why it matters: -
Continuity of first derivatives ensures that there are no sharp turns or corners. - Smooth transitions
are essential in animation, physics simulations, robotics , financial forecasting and motion paths.
Second Derivative : The second derivative gives us the rate of change of the rate of change
essentially, how the slope is changing. - What it tells us - The curvature of the graph. - In physical
terms: acceleration, force or inflection points. - Why it matters - Continuity of second derivative
ensures no sudden changes in acceleration which is crucial for : - Mechanical systems (example :
robotic arms moving smoothly) - Video games animation (smooth character movement) - Spline
fitting in data modeling (natural-looking data curves)

Why cubic interpolation is better Compared to linear interpolation, cubic interpolation: -


Matches not just position (data points), - But also velocity (first derivative) - And acceleration
(second derivative) at the joins This means both path and motion are smooth

Why derivative continuity matters : Imagine drawing a smooth curve through data points,
like tracking an object’s movement: - Linear interpolation : Connects points with straight lines.
The slope can suddenly jump at each point movement looks jerky. - Cubic interpolation : Uses
cubic polynomials between each pair of points and ensures: - No sharp corners (curve is smooth). -
No sudden changes in velocity (first derivative is continuous). - No sudden changes in acceleration

62
(second derivative is continuous) - That means the motion appears fluid and natural , like the
movement of a hand or a car.

What does that mean?


• linear : Straight lines between points (piecewise linear) (order 1)
• quadratic : parabolic curves (order 2)
• cubic : more flexible curves, capturing subtle changes in trends direction or curvature.

Real world use cases : Let’s say you’re tracking the hand position of a person drawing a circle,
capture every few milliseconds. Because hand movement isn’t jerkey, It’s fluid and curved . A
cubic interpolation gives a more natural fit to the motion. Example data: | Time (s) | X Position
| | ——– | ———- | | 0 | 0 | | 1 | 1 | | 2 | NaN | | 3 | 1 | | 4 | 0 |
A 'linear' or 'quadratic' interpolation will make the curve look angular or not smooth enough.
'cubic' smoothly interpolates the missing position at t=2

[60]: df = [Link]({
"time":[0,1,2,3,4],
"x_pos":[0,1,[Link],1,0]
})
df

[60]: time x_pos


0 0 0.0
1 1 1.0
2 2 NaN
3 3 1.0
4 4 0.0

[61]: df["x_pos"] = df["x_pos"].interpolate(method="cubic")


df

[61]: time x_pos


0 0 0.000000
1 1 1.000000
2 2 1.333333
3 3 1.000000
4 4 0.000000

Cubic interpolation Visual description :


• Produces smooth S-shaped transitions
• Avoids sharp angles
• Allows for inflection points (where the curve bends direction)

When to use 'cubic'


• Data represents smooth, natural motions

63
• You want visually pleasing interpolation (example : animations, simulations)
• You need smooth first and second derivatives (important in physics and motion modeling)
• common domains :
– Computer graphics and animation
– sensor based tracking
– climate and environmental modeling
– audio signal smoothing

When to avoid:
• Not ideal for very noisy or discontinuous data
• Overkill for simple trends (linear or non-linear)
• More computationally expensive than linear/quadratic

2.3.9 Lets look at three more examples where cubic method interpolation can be used
in real life
Example 1 : A robotic arm records its X, Y, Z coordinates every 0.5 seconds, but due
to sensor lag, some positions are missing. Since robotic and human arm movements are
fluid and continuous, cubic interpolation ensures smooth transitions between known
positions.
[62]: import pandas as pd
import numpy as np
import [Link] as plt

data = [10, 12, [Link], 15, [Link], [Link], 21, 23, [Link], 28]
time = list(range(len(data)))

df = [Link]({"time":time,"position":data})
df

[62]: time position


0 0 10.0
1 1 12.0
2 2 NaN
3 3 15.0
4 4 NaN
5 5 NaN
6 6 21.0
7 7 23.0
8 8 NaN
9 9 28.0

[63]: # cubic interpolation


df["interpolated_position"] = df["position"].interpolate(method = "cubic")
df

64
[63]: time position interpolated_position
0 0 10.0 10.000000
1 1 12.0 12.000000
2 2 NaN 13.508547
3 3 15.0 15.000000
4 4 NaN 16.833333
5 5 NaN 18.905983
6 6 21.0 21.000000
7 7 23.0 23.000000
8 8 NaN 25.200855
9 9 28.0 28.000000

[64]: # plot position and interpolated_position


[Link](figsize = (8,4))
[Link](df["time"],df["position"], "ro-", label = "Original position data with␣
↪(NaNs)")

[Link](df["time"],df["interpolated_position"],"g--o",label="Cubic␣
↪interpolated data")

[Link]("Robotic arm position over time")


[Link]("Time")
[Link]("X-Position")
[Link]()
[Link](True)
[Link]()

Why cubic interpolation?

65
• Prevents jerky motion.
• Preserves natural motion curves
• Useful in game development , robotics path planning and animation rigs

Code explaination:
• Breakdown of "ro-":
– r -> red color
– o -> circle markers at data points
– - -> solid line
• So "ro-" plots:
– A red solid line with circle markers at the data points.
• Breakdown of "g--o":
– g -> green color
– -- -> dashed lines
– o -> circle markers
• So, g--o plots:
– A green dashed line with circle markers at the data points.

Other Examples:

Format String Description


'b^--' Blue dashed line with triangle markers
'ks:' Black dotted line with square markers
'm*-' Magenta solid line with star markers

Alternative (Using color=, linestyle=, marker=)


[Link](df["time"], df["position"], color="red", linestyle="-", marker="o")
[Link](df["time"], df["interpolated_position"], color="green", linestyle="--", marker="o"

Example 2: Simulated average monthly temperatures with missing entries


[65]: temperatures = [15.2, 16.8, [Link], 20.1, 24.5, [Link], [Link], 26.2, 23.8, 20.
↪0]

months = list(range(1, 11))


df = [Link]({"month":months,"temperature":temperatures})
df

[65]: month temperature


0 1 15.2
1 2 16.8
2 3 NaN
3 4 20.1
4 5 24.5
5 6 NaN
6 7 NaN
7 8 26.2

66
8 9 23.8
9 10 20.0

[66]: # interpolate data and handle NaN values using cubic method here
df["interpolated_temperature_data_using_cubic_method"] = df["temperature"].
↪interpolate(method="cubic")

df

[66]: month temperature interpolated_temperature_data_using_cubic_method


0 1 15.2 15.200000
1 2 16.8 16.800000
2 3 NaN 17.748067
3 4 20.1 20.100000
4 5 24.5 24.500000
5 6 NaN 27.126607
6 7 NaN 27.469726
7 8 26.2 26.200000
8 9 23.8 23.800000
9 10 20.0 20.000000

[67]: # Now plotting both original temperature data and interpolated data using cubic␣
↪method in matplotlib

[Link](figsize=(12,8))
plt.
↪plot(df["month"],df["temperature"],color="red",linestyle="-",marker="o",label="Original␣

↪temperature with gaps in data points (example NaN values)")

plt.
↪plot(df["month"],df["interpolated_temperature_data_using_cubic_method"],color="green",linest

↪temperature using cubic method")

[Link]("Simulated average monthly temperatures with missing entries")


[Link]("Time in months")
[Link]("Temperature in degree celcius")
[Link]()
[Link]()

67
Let’s see what is the difference between linear and cubic interpolation
[68]: df["linearly_interpolated_temperature"] = df["temperature"].
↪interpolate(method="linear")

df

[68]: month temperature interpolated_temperature_data_using_cubic_method \


0 1 15.2 15.200000
1 2 16.8 16.800000
2 3 NaN 17.748067
3 4 20.1 20.100000
4 5 24.5 24.500000
5 6 NaN 27.126607
6 7 NaN 27.469726
7 8 26.2 26.200000
8 9 23.8 23.800000
9 10 20.0 20.000000

linearly_interpolated_temperature
0 15.200000
1 16.800000
2 18.450000
3 20.100000

68
4 24.500000
5 25.066667
6 25.633333
7 26.200000
8 23.800000
9 20.000000

[69]: df["quadratic_interpolated_temperature"] = df["temperature"].


↪interpolate(method="quadratic")

df

[69]: month temperature interpolated_temperature_data_using_cubic_method \


0 1 15.2 15.200000
1 2 16.8 16.800000
2 3 NaN 17.748067
3 4 20.1 20.100000
4 5 24.5 24.500000
5 6 NaN 27.126607
6 7 NaN 27.469726
7 8 26.2 26.200000
8 9 23.8 23.800000
9 10 20.0 20.000000

linearly_interpolated_temperature quadratic_interpolated_temperature
0 15.200000 15.200000
1 16.800000 16.800000
2 18.450000 17.872503
3 20.100000 20.100000
4 24.500000 24.500000
5 25.066667 27.082618
6 25.633333 27.338344
7 26.200000 26.200000
8 23.800000 23.800000
9 20.000000 20.000000

[70]: # Now plotting the linear and cubic interpolated data along with the original␣
↪data with missing gaps in the graph

[Link](figsize=(12,8))
plt.
↪plot(df["month"],df["temperature"],color="red",linestyle="-",marker="o",label="Original␣

↪temperature with gaps in data points (example NaN values)")

plt.
↪plot(df["month"],df["interpolated_temperature_data_using_cubic_method"],color="green",linest

↪temperature using cubic method")

plt.
↪plot(df["month"],df["linearly_interpolated_temperature"],color="orange",linestyle="-.

↪",marker="o",label="Interpolated temperature using linear method")

69
plt.
↪plot(df["month"],df["quadratic_interpolated_temperature"],color="blue",linestyle=":

↪",marker="o",label="Interpolated temperature using quadratic method")

[Link]("Difference between cubic vs linear vs quadratic interpolated data")


[Link]("Time in months")
[Link]("Temperature in degree celcius")
[Link]()
[Link]()

Example 3 : Simulated Audio Waveform Reconstruction


[71]: amplitude = [0.1, 0.4, [Link], [Link], -0.2, -0.5, [Link], 0.0, 0.3, [Link]]
frames = list(range(10))

df = [Link]({'frame': frames, 'amplitude': amplitude})


df

[71]: frame amplitude


0 0 0.1
1 1 0.4
2 2 NaN
3 3 NaN
4 4 -0.2

70
5 5 -0.5
6 6 NaN
7 7 0.0
8 8 0.3
9 9 NaN

[72]: df["interpolated_amplitude_using_cubic_method"] = df["amplitude"].


↪interpolate(method="cubic")

df

[72]: frame amplitude interpolated_amplitude_using_cubic_method


0 0 0.1 0.100000
1 1 0.4 0.400000
2 2 NaN 0.387216
3 3 NaN 0.155824
4 4 -0.2 -0.200000
5 5 -0.5 -0.500000
6 6 NaN -0.376081
7 7 0.0 0.000000
8 8 0.3 0.300000
9 9 NaN NaN

Since interpolate function requires the atleast two data points for it to fill in the
missing data points between those two points the last values has been left out as NaN.
Hence its better to ignore the last row

So I will be dropping the last row in favour of between results.


[73]: df_dropped_last_row = [Link](9)
df_dropped_last_row

[73]: frame amplitude interpolated_amplitude_using_cubic_method


0 0 0.1 0.100000
1 1 0.4 0.400000
2 2 NaN 0.387216
3 3 NaN 0.155824
4 4 -0.2 -0.200000
5 5 -0.5 -0.500000
6 6 NaN -0.376081
7 7 0.0 0.000000
8 8 0.3 0.300000

[74]: # Now plotting the amplitude along with␣


↪interpolated_amplitude_using_cubic_method using matplotlib

[Link](figsize=(12,8))
[Link](df["frame"], df["amplitude"], color="red", linestyle="-",marker = "o",␣
↪label="Original amplitude with missing data points (example : NaN values)")

71
[Link](df["frame"],df["interpolated_amplitude_using_cubic_method"],␣
↪color="green", linestyle="--",marker="o",label="Interpolated amplitude using␣

↪cubic method")

[Link]("Audio Waveform Reconstruction")


[Link]("Frame")
[Link]("amplitude")
[Link]()
[Link]()

2.3.10 Lets look at three more examples where nearest method interpolation can be
used in real life
The nearest method replaces a missing value with the closest non-null value either before or after
the missing entry (whichever is closer in position).

Real world example : Temperature sensor readings Imagine if I am collecting temperature


data every second from a sensor, but due to noise or disconnection, some values are missing. If
your are only intrested in quick approximations (like in real-time monitoring), you might want to
fill missing data with the nearest valid value.

[75]:

72
temperature = [25.0, 25.1, [Link], [Link], 25.3, 25.4, [Link], 25.6, [Link], 25.
↪9]

seconds = list(range(len(temperature)))

df = [Link]({"second":seconds, "temperature":temperature})
df

[75]: second temperature


0 0 25.0
1 1 25.1
2 2 NaN
3 3 NaN
4 4 25.3
5 5 25.4
6 6 NaN
7 7 25.6
8 8 NaN
9 9 25.9

[76]: df["nearest_interpolated_temperature"] = df["temperature"].


↪interpolate(method="nearest")

df

[76]: second temperature nearest_interpolated_temperature


0 0 25.0 25.0
1 1 25.1 25.1
2 2 NaN 25.1
3 3 NaN 25.3
4 4 25.3 25.3
5 5 25.4 25.4
6 6 NaN 25.4
7 7 25.6 25.6
8 8 NaN 25.6
9 9 25.9 25.9

Now plot the original temperature column and interpolated temperature column using
matplotlib
[77]: [Link](figsize = (8,4))
plt.
↪plot(df["second"],df["temperature"],color="red",linestyle="-",marker="o",label="Original␣

↪temperature with gaps (i,e NaN values)")

plt.
↪plot(df["second"],df["nearest_interpolated_temperature"],color="green",linestyle="--",marker

↪temperature values")

[Link]("Temperature data every second from a sensor")


[Link]("Time in seconds")
[Link]("Temperature in degree celcius")

73
[Link]()
[Link]()
[Link]()

2.3.11 When to Use ‘nearest’

Situation Why it’s Good


Real-time Fast, simple estimation — prioritizing speed over accuracy
dashboards
Categorical/Discrete Like filling in missing class labels, ratings, or modes where continuous
data trends don’t matter
Quick fallback When precision isn’t critical but missing data breaks your system logic
mechanism

2.3.12 When not to use nearest

Situation Why It’s Bad


Smooth/continuous It can create sudden jumps or step artifacts in graphs or models
data
Scientific Inaccurate interpolations can skew results — better to use 'linear' or
computation 'cubic'
Predictive Introducing artificial flat zones can mislead machine learning models
modeling

74
2.3.13 Lets look at three more examples where 'barycentric' and 'polynomial'
method interpolation can be used in real life
Both ‘barycentric’ and ‘polynomial’ interpolation methods are based on polynomial interpolation,
where a single polynomial is fitted through all the given data points.

What is Polynomial Interpolation? Polynomial interpolation fits a single polynomial curve


through all your data points. For n data points, the interpolating polynomial will be of degree
n-1. - polynomial : Uses lagrange polynomial interpolation - barycentric : A numerically stable
variant of Lagrange interpolation (faster and more stable)

Important notes:
• Very flexible — can pass exactly through known points.
• But risky — overfitting , especially when data is noisy or there are many points (Runge’s
pehnomenon)

Real-World Example 1: Calibrating a Scientific Instrument You have a limited number


of calibration points for a scientific instrument (example : converting voltage readings to ph level).
You can resume a perfect mathematical relationship between voltage and ph (no noise), so higher-
order interpolation might be acceptable.

[78]: voltage = [Link]([0.1, 0.2, 0.4, 0.5, 0.7])


pH = [Link]([1.0, 2.0, 3.8, 4.9, 7.2])

df = [Link]({
"voltage":voltage, "pH":pH
})
df

[78]: voltage pH
0 0.1 1.0
1 0.2 2.0
2 0.4 3.8
3 0.5 4.9
4 0.7 7.2

[79]: df = df.set_index("voltage")
df

[79]: pH
voltage
0.1 1.0
0.2 2.0
0.4 3.8
0.5 4.9
0.7 7.2

75
[80]: # Interpolation at missing voltage points
new_index = [Link](0.1, 0.7, 100)
new_index

[80]: array([0.1 , 0.10606061, 0.11212121, 0.11818182, 0.12424242,


0.13030303, 0.13636364, 0.14242424, 0.14848485, 0.15454545,
0.16060606, 0.16666667, 0.17272727, 0.17878788, 0.18484848,
0.19090909, 0.1969697 , 0.2030303 , 0.20909091, 0.21515152,
0.22121212, 0.22727273, 0.23333333, 0.23939394, 0.24545455,
0.25151515, 0.25757576, 0.26363636, 0.26969697, 0.27575758,
0.28181818, 0.28787879, 0.29393939, 0.3 , 0.30606061,
0.31212121, 0.31818182, 0.32424242, 0.33030303, 0.33636364,
0.34242424, 0.34848485, 0.35454545, 0.36060606, 0.36666667,
0.37272727, 0.37878788, 0.38484848, 0.39090909, 0.3969697 ,
0.4030303 , 0.40909091, 0.41515152, 0.42121212, 0.42727273,
0.43333333, 0.43939394, 0.44545455, 0.45151515, 0.45757576,
0.46363636, 0.46969697, 0.47575758, 0.48181818, 0.48787879,
0.49393939, 0.5 , 0.50606061, 0.51212121, 0.51818182,
0.52424242, 0.53030303, 0.53636364, 0.54242424, 0.54848485,
0.55454545, 0.56060606, 0.56666667, 0.57272727, 0.57878788,
0.58484848, 0.59090909, 0.5969697 , 0.6030303 , 0.60909091,
0.61515152, 0.62121212, 0.62727273, 0.63333333, 0.63939394,
0.64545455, 0.65151515, 0.65757576, 0.66363636, 0.66969697,
0.67575758, 0.68181818, 0.68787879, 0.69393939, 0.7 ])

Here we are using [Link] to generate data for our data frame:
• The data being generated here is the voltages data points
• The data is being generated between 0.1 to 0.7
• The number of data points being generated here is 100
• syntax :
– [Link](lower_limit in decimal, upper_limit in decimal, number of data points gen-
erated)

Reasons for doing this : The original data has only 5 points → we want to interpolate pH at
intermediate voltages (like 0.11, 0.13, 0.33, etc.) for a smooth curve.

[81]: # re-indexing the dataFrame with interpolated voltages in new_index variable


interpolated_index = [Link](new_index)
interpolated_index

[81]: pH
voltage
0.100000 1.0
0.106061 NaN
0.112121 NaN
0.118182 NaN
0.124242 NaN

76
… …
0.675758 NaN
0.681818 NaN
0.687879 NaN
0.693939 NaN
0.700000 7.2

[100 rows x 1 columns]

Now that we have gotten the desired dataFrame on which we can perform interpolation
using polynomial interpolation for ph values lets mover forward with that
[82]: interpolated_index["pH"] = interpolated_index["pH"].
↪interpolate(method="polynomial", order=4)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[82], line 1
----> 1 interpolated_index["pH"] =␣
↪interpolated_index["pH"].interpolate(method="polynomial", order=4)

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in [Link](self, method, axis, limit, inplace,␣

↪limit_direction, limit_area, downcast, **kwargs)

8497 else:
8498 index = missing.get_interp_index(method, [Link])
-> 8499 new_data = obj._mgr.interpolate(
8500 method=method,
8501 index=index,
8502 limit=limit,
8503 limit_direction=limit_direction,
8504 limit_area=limit_area,
8505 inplace=inplace,
8506 downcast=downcast,
8507 **kwargs,
8508 )
8510 result = self._constructor_from_mgr(new_data, axes=new_data.axes)
8511 if should_transpose:

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, inplace, **kwargs)

290 def interpolate(self, inplace: bool, **kwargs) -> Self:


--> 291 return self.apply_with_block(
292 "interpolate",
293 inplace=inplace,
294 **kwargs,
295 using_cow=using_copy_on_write(),
296 already_warned=_AlreadyWarned(),

77
297 )

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, f, align_keys,␣

↪**kwargs)

361 applied = [Link](f, **kwargs)


362 else:
--> 363 applied = getattr(b, f)(**kwargs)
364 result_blocks = extend_blocks(applied, result_blocks)
366 out = type(self).from_blocks(result_blocks, [Link])

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, method, index, inplace,␣
↪limit, limit_direction, limit_area, downcast, using_cow, already_warned,␣

↪**kwargs)

1794 copy, refs = self._get_refs_and_copy(using_cow, inplace)


1796 # Dispatch to the EA method.
-> 1797 new_values = self.array_values.interpolate(
1798 method=method,
1799 axis=[Link] - 1,
1800 index=index,
1801 limit=limit,
1802 limit_direction=limit_direction,
1803 limit_area=limit_area,
1804 copy=copy,
1805 **kwargs,
1806 )
1807 data = extract_array(new_values, extract_numpy=True)
1809 if (
1810 not copy
1811 and warn_copy_on_write()
1812 and already_warned is not None
1813 and not already_warned.warned_already
1814 ):

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/numpy_.py:296, in [Link](self, method, axis,␣

↪index, limit, limit_direction, limit_area, copy, **kwargs)

293 out_data = self._ndarray.copy()


295 # TODO: assert we have floating dtype?
--> 296 missing.interpolate_2d_inplace(
297 out_data,
298 method=method,
299 axis=axis,
300 index=index,
301 limit=limit,
302 limit_direction=limit_direction,
303 limit_area=limit_area,
304 **kwargs,

78
305 )
306 if not copy:
307 return self

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace(data, index, axis, method, limit,␣

↪limit_direction, limit_area, fill_value, mask, **kwargs)

391 _interpolate_1d(
392 indices=indices,
393 yvalues=yvalues,
(…)
401 **kwargs,
402 )
404 # error: Argument 1 to "apply_along_axis" has incompatible type
405 # "Callable[[ndarray[Any, Any]], None]"; expected "Callable[…,
406 # Union[_SupportsArray[dtype[<nothing>]], Sequence[_SupportsArray
407 # [dtype[<nothing>]]],␣
↪Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]],

408 # Sequence[Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]]],
409 #␣
↪Sequence[Sequence[Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]]]]]]"

--> 410 np.apply_along_axis(func, axis, data)

File <__array_function__ internals>:180, in apply_along_axis(*args, **kwargs)

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/numpy/lib/
↪shape_base.py:379, in apply_along_axis(func1d, axis, arr, *args, **kwargs)

375 except StopIteration as e:


376 raise ValueError(
377 'Cannot apply_along_axis when any iteration dimensions are 0'
378 ) from None
--> 379 res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
381 # build a buffer for storing evaluations of func1d.
382 # remove the requested axis, and add the new ones on the end.
383 # laid out so that each write is contiguous.
384 # for a tuple index inds, buff[inds] = func1d(inarr_view[inds])
385 buff = zeros(inarr_view.shape[:-1] + [Link], [Link])

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace.<locals>.func(yvalues)

388 def func(yvalues: [Link]) -> None:


389 # process 1-d slices in the axis direction
--> 391 _interpolate_1d(
392 indices=indices,
393 yvalues=yvalues,
394 method=method,
395 limit=limit,
396 limit_direction=limit_direction,

79
397 limit_area=limit_area_validated,
398 fill_value=fill_value,
399 bounds_error=False,
400 mask=mask,
401 **kwargs,
402 )

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in _interpolate_1d(indices, yvalues, method, limit,␣

↪limit_direction, limit_area, fill_value, bounds_error, order, mask, **kwargs)

524 yvalues[invalid] = [Link](


525 indices[invalid], indices[valid][indexer],␣
↪yvalues[valid][indexer]

526 )
527 else:
--> 528 yvalues[invalid] = _interpolate_scipy_wrapper(
529 indices[valid],
530 yvalues[valid],
531 indices[invalid],
532 method=method,
533 fill_value=fill_value,
534 bounds_error=bounds_error,
535 order=order,
536 **kwargs,
537 )
539 if mask is not None:
540 mask[:] = False

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in _interpolate_scipy_wrapper(x, y, new_x, method, fill_value,␣

↪bounds_error, order, **kwargs)

592 else:
593 kind = method
--> 594 terp = interpolate.interp1d(
595 x, y, kind=kind, fill_value=fill_value, bounds_error=bounds_error
596 )
597 new_y = terp(new_x)
598 elif method == "spline":
599 # GH #10633, #24014

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/scipy/interpolate/
↪_interpolate.py:397, in interp1d.__init__(***failed resolving arguments***)

394 yy = np.ones_like(self._y)
395 rewrite_nan = True
--> 397 self._spline = make_interp_spline(xx, yy, k=order,
398 check_finite=False)
399 if rewrite_nan:
400 self._call = self.__class__._call_nan_spline

80
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/scipy/interpolate/
↪_bsplines.py:1596, in make_interp_spline(x, y, k, t, bc_type, axis,␣

↪check_finite)

1593 nt = [Link] - k - 1
1595 if nt - n != nleft + nright:
-> 1596 raise ValueError("The number of derivatives at boundaries does not "
1597 f"match: expected {nt-n}, got {nleft}+{nright}")
1599 # bail out if the `y` array is zero-sized
1600 if [Link] == 0:

ValueError: The number of derivatives at boundaries does not match: expected 2,␣
↪got 0+0

2.3.14 Explaination :
This is due to the fact that polynomial interpolation using interpolate(method="polynomial",
order=4) cannot handle missing values at the beginning or end of the DataFrame (NaNs at the
boundaries), because polynomial interpolation needs a full polynomial to fit smoothly – and it
requires enough known values around the missing value to do so.

2.3.15 Why my method fails :


• [Link](new_index) creates many NaNs at the beginning and end.
• Then, interpolate(method="polynomial", order=4) tries to fit local polynomials for miss-
ing values but can’t handle NaNs at the boundaries but due to lack of neighbours.

Tried to solve by setting limit_direction to “both”


[83]: interpolated_index["plynomial_interpolated_pH"] = interpolated_index.
↪interpolate(method="polynomial",order=4, limit_direction="both")

interpolated_index

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[83], line 1
----> 1 interpolated_index["plynomial_interpolated_pH"] =␣
↪interpolated_index.interpolate(method="polynomial",order=4, limit_direction="both")

2 interpolated_index

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in [Link](self, method, axis, limit, inplace,␣

↪limit_direction, limit_area, downcast, **kwargs)

8497 else:
8498 index = missing.get_interp_index(method, [Link])
-> 8499 new_data = obj._mgr.interpolate(
8500 method=method,
8501 index=index,

81
8502 limit=limit,
8503 limit_direction=limit_direction,
8504 limit_area=limit_area,
8505 inplace=inplace,
8506 downcast=downcast,
8507 **kwargs,
8508 )
8510 result = self._constructor_from_mgr(new_data, axes=new_data.axes)
8511 if should_transpose:

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, inplace, **kwargs)

290 def interpolate(self, inplace: bool, **kwargs) -> Self:


--> 291 return self.apply_with_block(
292 "interpolate",
293 inplace=inplace,
294 **kwargs,
295 using_cow=using_copy_on_write(),
296 already_warned=_AlreadyWarned(),
297 )

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, f, align_keys,␣

↪**kwargs)

361 applied = [Link](f, **kwargs)


362 else:
--> 363 applied = getattr(b, f)(**kwargs)
364 result_blocks = extend_blocks(applied, result_blocks)
366 out = type(self).from_blocks(result_blocks, [Link])

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, method, index, inplace,␣
↪limit, limit_direction, limit_area, downcast, using_cow, already_warned,␣

↪**kwargs)

1794 copy, refs = self._get_refs_and_copy(using_cow, inplace)


1796 # Dispatch to the EA method.
-> 1797 new_values = self.array_values.interpolate(
1798 method=method,
1799 axis=[Link] - 1,
1800 index=index,
1801 limit=limit,
1802 limit_direction=limit_direction,
1803 limit_area=limit_area,
1804 copy=copy,
1805 **kwargs,
1806 )
1807 data = extract_array(new_values, extract_numpy=True)
1809 if (

82
1810 not copy
1811 and warn_copy_on_write()
1812 and already_warned is not None
1813 and not already_warned.warned_already
1814 ):

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/numpy_.py:296, in [Link](self, method, axis,␣

↪index, limit, limit_direction, limit_area, copy, **kwargs)

293 out_data = self._ndarray.copy()


295 # TODO: assert we have floating dtype?
--> 296 missing.interpolate_2d_inplace(
297 out_data,
298 method=method,
299 axis=axis,
300 index=index,
301 limit=limit,
302 limit_direction=limit_direction,
303 limit_area=limit_area,
304 **kwargs,
305 )
306 if not copy:
307 return self

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace(data, index, axis, method, limit,␣

↪limit_direction, limit_area, fill_value, mask, **kwargs)

391 _interpolate_1d(
392 indices=indices,
393 yvalues=yvalues,
(…)
401 **kwargs,
402 )
404 # error: Argument 1 to "apply_along_axis" has incompatible type
405 # "Callable[[ndarray[Any, Any]], None]"; expected "Callable[…,
406 # Union[_SupportsArray[dtype[<nothing>]], Sequence[_SupportsArray
407 # [dtype[<nothing>]]],␣
↪Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]],

408 # Sequence[Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]]],
409 #␣
↪Sequence[Sequence[Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]]]]]]"

--> 410 np.apply_along_axis(func, axis, data)

File <__array_function__ internals>:180, in apply_along_axis(*args, **kwargs)

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/numpy/lib/
↪shape_base.py:379, in apply_along_axis(func1d, axis, arr, *args, **kwargs)

375 except StopIteration as e:

83
376 raise ValueError(
377 'Cannot apply_along_axis when any iteration dimensions are 0'
378 ) from None
--> 379 res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
381 # build a buffer for storing evaluations of func1d.
382 # remove the requested axis, and add the new ones on the end.
383 # laid out so that each write is contiguous.
384 # for a tuple index inds, buff[inds] = func1d(inarr_view[inds])
385 buff = zeros(inarr_view.shape[:-1] + [Link], [Link])

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace.<locals>.func(yvalues)

388 def func(yvalues: [Link]) -> None:


389 # process 1-d slices in the axis direction
--> 391 _interpolate_1d(
392 indices=indices,
393 yvalues=yvalues,
394 method=method,
395 limit=limit,
396 limit_direction=limit_direction,
397 limit_area=limit_area_validated,
398 fill_value=fill_value,
399 bounds_error=False,
400 mask=mask,
401 **kwargs,
402 )

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in _interpolate_1d(indices, yvalues, method, limit,␣

↪limit_direction, limit_area, fill_value, bounds_error, order, mask, **kwargs)

524 yvalues[invalid] = [Link](


525 indices[invalid], indices[valid][indexer],␣
↪yvalues[valid][indexer]

526 )
527 else:
--> 528 yvalues[invalid] = _interpolate_scipy_wrapper(
529 indices[valid],
530 yvalues[valid],
531 indices[invalid],
532 method=method,
533 fill_value=fill_value,
534 bounds_error=bounds_error,
535 order=order,
536 **kwargs,
537 )
539 if mask is not None:
540 mask[:] = False

84
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in _interpolate_scipy_wrapper(x, y, new_x, method, fill_value,␣

↪bounds_error, order, **kwargs)

592 else:
593 kind = method
--> 594 terp = interpolate.interp1d(
595 x, y, kind=kind, fill_value=fill_value, bounds_error=bounds_error
596 )
597 new_y = terp(new_x)
598 elif method == "spline":
599 # GH #10633, #24014

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/scipy/interpolate/
↪_interpolate.py:397, in interp1d.__init__(***failed resolving arguments***)

394 yy = np.ones_like(self._y)
395 rewrite_nan = True
--> 397 self._spline = make_interp_spline(xx, yy, k=order,
398 check_finite=False)
399 if rewrite_nan:
400 self._call = self.__class__._call_nan_spline

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/scipy/interpolate/
↪_bsplines.py:1596, in make_interp_spline(x, y, k, t, bc_type, axis,␣

↪check_finite)

1593 nt = [Link] - k - 1
1595 if nt - n != nleft + nright:
-> 1596 raise ValueError("The number of derivatives at boundaries does not "
1597 f"match: expected {nt-n}, got {nleft}+{nright}")
1599 # bail out if the `y` array is zero-sized
1600 if [Link] == 0:

ValueError: The number of derivatives at boundaries does not match: expected 2,␣
↪got 0+0

Ok so lets try it another way of implementing Polynomial interpolation using


[Link]
[84]: voltage = [Link]([0.1, 0.2, 0.4, 0.5, 0.7])
pH = [Link]([1.0, 2.0, 3.8, 4.9, 7.2])
coeffs = [Link](voltage,pH,deg=4)
coeffs

[84]: array([-58.33333333, 95. , -49.41666667, 19.05 ,


-0.5 ])

coeffs = [Link](voltage,pH,deg=4) Explaining this : This line calculates the coeffi-


cients of a polynomial that best fits the data points (voltage,pH) using polynomial interpolation.

85
What is [Link]()?
• [Link](x,y,deg) fits a polynomial of degree deg to the data (x,y).
• It returns the coefficients of the best-fit polynomial (from highest degree to constant term).

Why use a 4th-degree polynomial for 5 points? When using [Link](voltage, pH,
deg=4), you are asking for a polynomial of degree 4 (i.e, highest term is x^4) that fits your 5 data
points. - Because : - A polynomial of degree n can exactly fit n + 1 data points (If they are unique
and well-behaved). - You have 5 data points, so you can fit a polynomial of degree 4 to pass exactly
through all those points. This is called interpolating polynomial fitting. A 4th-degree polynomial:
y = a4x^4 + a3x^3 + a2x^2 + a1x + a0 As you can see this has 5 coeffitients from a0 to a4. To
solve those 5 unknowns you need 5 equations, which you get from the 5 data points you have. So
this choice guarantees an exact fit — The curve will go exactly through each data point.

Why not a higher-degree polynomial? Using a polynomial of degree 5 or more with 5 points
leads to overfitting, or the system becomes underdetermined (more unknowns than equations).

Generate 100 volatge values between 0.1 to 0.7 using numpy


[85]: interpolated_voltage = [Link](0.1,0.7,100)
interpolated_voltage

[85]: array([0.1 , 0.10606061, 0.11212121, 0.11818182, 0.12424242,


0.13030303, 0.13636364, 0.14242424, 0.14848485, 0.15454545,
0.16060606, 0.16666667, 0.17272727, 0.17878788, 0.18484848,
0.19090909, 0.1969697 , 0.2030303 , 0.20909091, 0.21515152,
0.22121212, 0.22727273, 0.23333333, 0.23939394, 0.24545455,
0.25151515, 0.25757576, 0.26363636, 0.26969697, 0.27575758,
0.28181818, 0.28787879, 0.29393939, 0.3 , 0.30606061,
0.31212121, 0.31818182, 0.32424242, 0.33030303, 0.33636364,
0.34242424, 0.34848485, 0.35454545, 0.36060606, 0.36666667,
0.37272727, 0.37878788, 0.38484848, 0.39090909, 0.3969697 ,
0.4030303 , 0.40909091, 0.41515152, 0.42121212, 0.42727273,
0.43333333, 0.43939394, 0.44545455, 0.45151515, 0.45757576,
0.46363636, 0.46969697, 0.47575758, 0.48181818, 0.48787879,
0.49393939, 0.5 , 0.50606061, 0.51212121, 0.51818182,
0.52424242, 0.53030303, 0.53636364, 0.54242424, 0.54848485,
0.55454545, 0.56060606, 0.56666667, 0.57272727, 0.57878788,
0.58484848, 0.59090909, 0.5969697 , 0.6030303 , 0.60909091,
0.61515152, 0.62121212, 0.62727273, 0.63333333, 0.63939394,
0.64545455, 0.65151515, 0.65757576, 0.66363636, 0.66969697,
0.67575758, 0.68181818, 0.68787879, 0.69393939, 0.7 ])

[86]: interpolated_ph = [Link](coeffs,interpolated_voltage)


interpolated_ph

[86]: array([1. , 1.07053317, 1.13936727, 1.20659518, 1.27230791,


1.33659457, 1.39954238, 1.46123668, 1.52176092, 1.58119664,
1.63962352, 1.69711934, 1.75375999, 1.80961947, 1.86476988,

86
1.91928147, 1.97322256, 2.02665959, 2.07965713, 2.13227784,
2.1845825 , 2.23663001, 2.28847737, 2.34017968, 2.39179018,
2.4433602 , 2.49493918, 2.54657469, 2.59831239, 2.65019606,
2.7022676 , 2.75456701, 2.8071324 , 2.86 , 2.91320414,
2.96677727, 3.02074995, 3.07515085, 3.13000674, 3.18534253,
3.24118121, 3.29754391, 3.35444983, 3.41191633, 3.46995885,
3.52859094, 3.58782429, 3.64766866, 3.70813196, 3.76922018,
3.83093745, 3.89328598, 3.95626611, 4.0198763 , 4.08411311,
4.14897119, 4.21444335, 4.28052046, 4.34719153, 4.41444368,
4.48226214, 4.55063024, 4.61952943, 4.68893928, 4.75883745,
4.82919972, 4.9 , 4.97121028, 5.04280068, 5.11473943,
5.18699287, 5.25952544, 5.33229971, 5.40527634, 5.47841413,
5.55166997, 5.62499885, 5.69835391, 5.77168636, 5.84494555,
5.91807892, 5.99103203, 6.06374857, 6.13617031, 6.20823714,
6.27988708, 6.35105624, 6.42167885, 6.49168724, 6.56101188,
6.62958131, 6.69732222, 6.76415939, 6.83001571, 6.89481219,
6.95846795, 7.02090021, 7.08202432, 7.14175373, 7.2 ])

[87]: df = [Link]({
"interpolated_voltage" : interpolated_voltage, "interpolated_ph":
↪interpolated_ph

})
df

[87]: interpolated_voltage interpolated_ph


0 0.100000 1.000000
1 0.106061 1.070533
2 0.112121 1.139367
3 0.118182 1.206595
4 0.124242 1.272308
.. … …
95 0.675758 6.958468
96 0.681818 7.020900
97 0.687879 7.082024
98 0.693939 7.141754
99 0.700000 7.200000

[100 rows x 2 columns]

[88]: # plot the graph


[Link](voltage, pH, 'o', label='Original Data')
[Link](interpolated_voltage, interpolated_ph, '-', label='4th-degree␣
↪Polynomial Interpolation')

[Link]("Voltage")
[Link]("pH")
[Link]()
[Link](True)

87
[Link]("Polynomial Interpolation (order 4)")
[Link]()

Why assuming relationship between ph and voltage is done here in idealized systems
In electro chemistry relationship between voltage (measured by a ph electrode) and ph is described
by the Nearest equation. For a typical glass ph electrode, this is simplified as –> E = E0 + (0.0591
x ph) or more accurately ph = (E-E0)/0.0591 Where : - E = measured potential (voltage) in volts
- E0 = standard electrode potential (reference) - 0.0591 = voltage change per unit ph at 25 degree
celcius (room temperature) This equation implies a linear and predictable relationship between ph
and voltage under ideal conditions.

When it’s reasonable to assume a perfect relationship


• In calibration or simulation:
– You’re creating synthetic data (example : for a data science or machine learning excer-
cise) and want to model sensor behavior.
• In controlled lab settings:
– Temperature, pressure, ionic strength and interfering substances are controlled making
the Nernst equation closely accurate.
• During sensor desgin/testing:

88
– to benchmark ideal sensor behavior before accounting for real-world drift , hysteresis, or
contamination.

When not to assume a perfect relationship


• In real-world application :
– Sensors degrade or get dirty
– Temperature affects reading (Nernst slope changes).
– Non-ideal interactions or impurities in solution.
• If doing high-accuracy or regulatory work, a real-world calibration curve or correction algo-
rithm is necessary.

Summary:

Method Description When to Use When to Avoid


'polynomial' / Fits a single polynomial Low-noise, few points, High noise, large
'barycentric' to all data points scientific modeling datasets, real-time
'barycentric' A stable version of Use over 'polynomial' Still suffers from same
polynomial interpolation if available core limitations

2.3.16 Some points to remember :


• Use [Link] or [Link] for global interpolation
• Don’t use reindex() then interpolate with method="polynomial" it creates boundary NaNs
• Polynomial interpolation is powerful but unstable for high degrees or extrapolation.
• For safe use : only interpolate within your range.

2.3.17 Lets look at three more examples where ‘pchip’ method interpolation can be
used in real life
PCHIP stands for Piecewise Cubic Hermite Polynomial. Unlike regular cubic spline interpolation
(Which might overshoot or oscillate), pchip is : - Monotonic : Preserves the increasing nature of
data. - Shape-preserving : Avoids sharp turns and unnatural wiggles. It’s not smooth in second
derivatives, but it gives a more realistic curve for non-linear or step-like data.

What does “not smooth in second derivatives” mean? In calculus, the first derivative of
a curve gives the slope (i.e how steep it is) and the second derivative tells you how the slope is
changing i.e the curvature. A curve is smooth in second derivative if : - It has continuous curvature
without any sudden changes in how it bends.

PCHIP (Piecewise Cubic Hermite Interpolating Polynomial):


• Ensures first derivative continuity so the curve doesn’t have sharp turns.
• But the second derivative is not smooth across the whole curve this means the rate of bending
can suddenly change at segment boundaries.
• However , this is intentional it avoids overshooting or introducing weird oscillations (as in
spline or high-degree polynomial interpolation)

89
Why is this important? Because in real world data is not always smooth. If you enforce smooth
curvature (like cubic spline), it may introduce unrealistic wiggles, especially in non-linear or step-
like data (example : sudden jumps or sharp transition). PCHIP trades off a bit of mathamatical
elegance (smooth second derivative) for practical realism.

How unrealistic wiggles may get introduced ? When you enforce smooth curvature, such
as cubic spline interpolation, the algorithm tries to make sure that the curve is as smooth as
possible especially in terms of second derivative (curvature). This smoothness contraints often
causes unrealistic oscillations or “wiggles”, especially when your data has : - Sudden jumps - Sharp
transitions - Flat regions followed by rapid changes

How wiggle occurs (in simple terms): Imagine you are connecting a series of pegs with a
flexible ruler (representing a cubic spline). You are trying to bend it smoothly through all the pegs
but if some pegs are suddenly higher or lower then others the ruler will curve sharply before or
after that point. This overshooting creates wiggles, the curve temporarily dips or peaks between
the actual data points. - Mathematical insight - The function is continuous. - The first and second
derivatives are continuous. - But it doesn’t know your data might be non-smooth in reality, so it
creates an unrealistically smooth curve that doesn’t respect real transitions.

Real world example Suppose you are logging the power consumption (in watts) of a smart
device every minute. The readings could look like this : | Time | Power Usage | | | ——– | ———–
| —————————- | | 10:00 AM | 10 W | | | 10:01 AM | 12 W | | | 10:02 AM | 13 W | | | 10:03
AM | 300 W | ← Device turned ON suddenly | | 10:04 AM | 305 W | | | 10:05 AM | 12 W | ← Device
turned OFF suddenly | If you use a cubic interpolation here to smooth between these points. It
might create curves that dip below or rise above the realistic values before or after the sharp jump
10:03 AM even dipping below 10W or overshooting above 105W which never actually happened.

Lets implement it and see for ourselves what is the difference between PCHIP and
cubic interpolation of data
[89]: data = {
"time" : ['10:00', '10:01', '10:02', '10:03', '10:04', '10:05'],
"power" : [ 10, 12, 15, 300, 302, 305]
}
df = [Link](data)

df

[89]: time power


0 10:00 10
1 10:01 12
2 10:02 15
3 10:03 300
4 10:04 302
5 10:05 305

[90]: # Lets check if the the time column values are in string datatype or not
df["time"].apply(lambda x : isinstance(x,str)).all()

90
[90]: True

Since the result is returning True that means the time column has the values in string
format instead of time
[91]: # converting the data stored in string dtype to time dtype in time column
df["time"] = pd.to_datetime(df["time"],format="%H:%M")
# Now lets check if the dtype has changed from string to datetime in time␣
↪column of the dataFrame

df["time"].apply(lambda x : isinstance(x,str)).all()

[91]: False

[92]: df

[92]: time power


0 1900-01-01 [Link] 10
1 1900-01-01 [Link] 12
2 1900-01-01 [Link] 15
3 1900-01-01 [Link] 300
4 1900-01-01 [Link] 302
5 1900-01-01 [Link] 305

[93]: # Now we have to convert the time into minutes


df["timestamp"] = (df["time"] - df["time"].min()).dt.total_seconds() / 60
df

[93]: time power timestamp


0 1900-01-01 [Link] 10 0.0
1 1900-01-01 [Link] 12 1.0
2 1900-01-01 [Link] 15 2.0
3 1900-01-01 [Link] 300 3.0
4 1900-01-01 [Link] 302 4.0
5 1900-01-01 [Link] 305 5.0

[94]: x = df["timestamp"]
y = df["power"]
# create 200 values between the minimum value and maxmimum value present in the␣
↪timestamp column

x_dense = [Link]([Link](),[Link](),200)
x_dense

[94]: array([0. , 0.02512563, 0.05025126, 0.07537688, 0.10050251,


0.12562814, 0.15075377, 0.1758794 , 0.20100503, 0.22613065,
0.25125628, 0.27638191, 0.30150754, 0.32663317, 0.35175879,
0.37688442, 0.40201005, 0.42713568, 0.45226131, 0.47738693,
0.50251256, 0.52763819, 0.55276382, 0.57788945, 0.60301508,
0.6281407 , 0.65326633, 0.67839196, 0.70351759, 0.72864322,

91
0.75376884, 0.77889447, 0.8040201 , 0.82914573, 0.85427136,
0.87939698, 0.90452261, 0.92964824, 0.95477387, 0.9798995 ,
1.00502513, 1.03015075, 1.05527638, 1.08040201, 1.10552764,
1.13065327, 1.15577889, 1.18090452, 1.20603015, 1.23115578,
1.25628141, 1.28140704, 1.30653266, 1.33165829, 1.35678392,
1.38190955, 1.40703518, 1.4321608 , 1.45728643, 1.48241206,
1.50753769, 1.53266332, 1.55778894, 1.58291457, 1.6080402 ,
1.63316583, 1.65829146, 1.68341709, 1.70854271, 1.73366834,
1.75879397, 1.7839196 , 1.80904523, 1.83417085, 1.85929648,
1.88442211, 1.90954774, 1.93467337, 1.95979899, 1.98492462,
2.01005025, 2.03517588, 2.06030151, 2.08542714, 2.11055276,
2.13567839, 2.16080402, 2.18592965, 2.21105528, 2.2361809 ,
2.26130653, 2.28643216, 2.31155779, 2.33668342, 2.36180905,
2.38693467, 2.4120603 , 2.43718593, 2.46231156, 2.48743719,
2.51256281, 2.53768844, 2.56281407, 2.5879397 , 2.61306533,
2.63819095, 2.66331658, 2.68844221, 2.71356784, 2.73869347,
2.7638191 , 2.78894472, 2.81407035, 2.83919598, 2.86432161,
2.88944724, 2.91457286, 2.93969849, 2.96482412, 2.98994975,
3.01507538, 3.04020101, 3.06532663, 3.09045226, 3.11557789,
3.14070352, 3.16582915, 3.19095477, 3.2160804 , 3.24120603,
3.26633166, 3.29145729, 3.31658291, 3.34170854, 3.36683417,
3.3919598 , 3.41708543, 3.44221106, 3.46733668, 3.49246231,
3.51758794, 3.54271357, 3.5678392 , 3.59296482, 3.61809045,
3.64321608, 3.66834171, 3.69346734, 3.71859296, 3.74371859,
3.76884422, 3.79396985, 3.81909548, 3.84422111, 3.86934673,
3.89447236, 3.91959799, 3.94472362, 3.96984925, 3.99497487,
4.0201005 , 4.04522613, 4.07035176, 4.09547739, 4.12060302,
4.14572864, 4.17085427, 4.1959799 , 4.22110553, 4.24623116,
4.27135678, 4.29648241, 4.32160804, 4.34673367, 4.3718593 ,
4.39698492, 4.42211055, 4.44723618, 4.47236181, 4.49748744,
4.52261307, 4.54773869, 4.57286432, 4.59798995, 4.62311558,
4.64824121, 4.67336683, 4.69849246, 4.72361809, 4.74874372,
4.77386935, 4.79899497, 4.8241206 , 4.84924623, 4.87437186,
4.89949749, 4.92462312, 4.94974874, 4.97487437, 5. ])

[95]: from [Link] import PchipInterpolator, CubicSpline


# create a pchip and cubic interpolator
pchip = PchipInterpolator(x,y)
cubic = CubicSpline(x,y)

[96]: # use pchip and cubic interpolator to get the y (power) values using pchip and␣
↪cubic interpolation method respectively

y_pchip = pchip(x_dense)
y_cubic = cubic(x_dense)

[97]:

92
# plot the graph showing the difference between pchip and cubic interpolation␣
↪method

[Link](figsize=(8,4))
[Link](x,y, marker = "o", label="Original Data", color="black")
[Link](x_dense,y_pchip,linestyle="-",label="PCHIP (Monotonic) interpolated␣
↪data", linewidth=2)

[Link](x_dense,y_cubic,linestyle="--",label="Cubic interpolated␣
↪data",linewidth=2)

[Link]("Difference between PCHIP and cubic interpolation method")


[Link]("Time (minutes)")
[Link]("Power (watts)")
[Link]()
[Link](True)
plt.tight_layout()
[Link]()

As you can see if you focus on the orange dotted line here, this is what it means
to introduce the wiggle when using cubic interpolation method on data points show
sudden jumps or sharp transition

When to Use PCHIP (Real-World Examples)


• Sensor Data in Industrial Systems :
– Suppose a temperature sensor outputs data from a furnace.
– The temperature rises sharply when the heater turns on.
– Cubic spline interpolation might overshoot in between, suggesting the temperature
dipped or spiked — which never happened.
– PCHIP preserves the monotonic (always increasing) nature and avoids creating those
false dips/peaks.

93
• Growth Curves for Children :
– A child’s height is monotonically increasing (it never decreases).
– Using PCHIP, we ensure that interpolated height values never decrease between checkups
— and we don’t introduce odd wiggles in the curve.

When Not to Use PCHIP


• Smooth Physical Systems with Continuous Curvature :
– Example : Modeling the motion of a pendulum.
– Its position and velocity change smoothly second derivative matter here a lot hence it
must be smooth.
– Using PCHIP here could make the curve less physically realistic , as it doesn’t enforce
smooth acceleration or decelration.
• When you need derivative for further computation :
– For example, In computational physics or fluid dynamics, you offen need smooth second
derivative for solving equations.
– PCHIP’s piecewise nature and second derivative discontinuities could lead to numerical
instability.

Summary :

Feature PCHIP
Preserves Shape � Yes
Monotonicity � Yes
Smooth First Derivative � Yes
Smooth Second Derivative � No
Good for Step-like Data � Yes
Prone to Overfitting � No
Risk of Overshoot/Wiggle � Low

2.3.18 Lets look at three more examples where ‘pchip’ method interpolation can be
used in real life
Akima spline interpolation is a type of piecewise cubic interpolation that : - Uses information from
neighboring slopes to prevent overshooting or oscillations (wiggles) - Produces smooth and visually
realistic curves. - Is not globally smooth in second derivatives (i.e It’s not as smooth as cubic
spline), but it handles sharp changes in the data much better.

When to use Akima interpolation Use Akima interpolation when: - Your data has sharp
transition or discontinuities - You want to avoid oscillations that comes from cubic splines interpo-
lation - You want to balance between smoothness and local sharp preservation

When not to use Akima interpolation Avoid Akima interpolation when: - You need very
smooth second derivatives (example: for physical simulations involving acceleration or curvature)
- Your data is very clean and smooth, and a simple cubic spline would suffice. - You are working
with very large datasets where performance is critical. Akima is more computationally intensive
than linear interpolation.

94
3 Real-World Examples of When to Use Akima
1. Stock Price charting (with sudden jumps due to news)
• Problem : You want to interpolate between stock prices, but there’s jump due to an
announcement.
• Why Akima? Avoid overshooting and artifacts around jumps.
2. Senso Data with outliers or sudden changes
• Example water pressure spikes in industrial sensors.
• Why Akima? Avoids wiggles and provides stable interpolation without overfitting to
outliers.
3. Medical Signals (like ECG or Heart Rate)
• ECG data has peaks (QRS complex) you don’t want spline interpolation to create arti-
facts around those.
• Why Akima? It preserves sharp changes and avoids adding “fake” peaks.

2.3.19 Lets see how the Akima interpolation looks like with real life example

[98]: from [Link] import Akima1DInterpolator, CubicSpline,␣


↪PchipInterpolator

# simulate stock price data (in usd)


x = [Link]([9, 10, 11, 12, 13]) # Time in hours (e.g., 9 AM to 1 PM)
y = [Link]([100, 102, 180, 182, 185]) # Sudden jump between 10 and 11 AM

# Generate dense time points for smooth plotting


x_dense = [Link]([Link](),[Link](),300)
print(f"x_dense ===> {x_dense}")

# Create interpolators
akima_interp = Akima1DInterpolator(x,y)
cubic_interp = CubicSpline(x,y)
pchip_interp = PchipInterpolator(x,y)

# compute interpolated values


y_pchip = pchip_interp(x_dense)
y_akima = akima_interp(x_dense)
y_cubic = cubic_interp(x_dense)

# create a dataFrame
df = [Link]({
"time" : x_dense , "stock_prices_interpolated_akima_method" : y_akima ,␣
↪"stock_prices_interpolated_cubic" : y_cubic

, "stock_prices_interpoalted_pchip" : y_pchip
})
df

x_dense ===> [ 9. 9.01337793 9.02675585 9.04013378 9.05351171


9.06688963

95
9.08026756 9.09364548 9.10702341 9.12040134 9.13377926 9.14715719
9.16053512 9.17391304 9.18729097 9.2006689 9.21404682 9.22742475
9.24080268 9.2541806 9.26755853 9.28093645 9.29431438 9.30769231
9.32107023 9.33444816 9.34782609 9.36120401 9.37458194 9.38795987
9.40133779 9.41471572 9.42809365 9.44147157 9.4548495 9.46822742
9.48160535 9.49498328 9.5083612 9.52173913 9.53511706 9.54849498
9.56187291 9.57525084 9.58862876 9.60200669 9.61538462 9.62876254
9.64214047 9.65551839 9.66889632 9.68227425 9.69565217 9.7090301
9.72240803 9.73578595 9.74916388 9.76254181 9.77591973 9.78929766
9.80267559 9.81605351 9.82943144 9.84280936 9.85618729 9.86956522
9.88294314 9.89632107 9.909699 9.92307692 9.93645485 9.94983278
9.9632107 9.97658863 9.98996656 10.00334448 10.01672241 10.03010033
10.04347826 10.05685619 10.07023411 10.08361204 10.09698997 10.11036789
10.12374582 10.13712375 10.15050167 10.1638796 10.17725753 10.19063545
10.20401338 10.2173913 10.23076923 10.24414716 10.25752508 10.27090301
10.28428094 10.29765886 10.31103679 10.32441472 10.33779264 10.35117057
10.36454849 10.37792642 10.39130435 10.40468227 10.4180602 10.43143813
10.44481605 10.45819398 10.47157191 10.48494983 10.49832776 10.51170569
10.52508361 10.53846154 10.55183946 10.56521739 10.57859532 10.59197324
10.60535117 10.6187291 10.63210702 10.64548495 10.65886288 10.6722408
10.68561873 10.69899666 10.71237458 10.72575251 10.73913043 10.75250836
10.76588629 10.77926421 10.79264214 10.80602007 10.81939799 10.83277592
10.84615385 10.85953177 10.8729097 10.88628763 10.89966555 10.91304348
10.9264214 10.93979933 10.95317726 10.96655518 10.97993311 10.99331104
11.00668896 11.02006689 11.03344482 11.04682274 11.06020067 11.0735786
11.08695652 11.10033445 11.11371237 11.1270903 11.14046823 11.15384615
11.16722408 11.18060201 11.19397993 11.20735786 11.22073579 11.23411371
11.24749164 11.26086957 11.27424749 11.28762542 11.30100334 11.31438127
11.3277592 11.34113712 11.35451505 11.36789298 11.3812709 11.39464883
11.40802676 11.42140468 11.43478261 11.44816054 11.46153846 11.47491639
11.48829431 11.50167224 11.51505017 11.52842809 11.54180602 11.55518395
11.56856187 11.5819398 11.59531773 11.60869565 11.62207358 11.63545151
11.64882943 11.66220736 11.67558528 11.68896321 11.70234114 11.71571906
11.72909699 11.74247492 11.75585284 11.76923077 11.7826087 11.79598662
11.80936455 11.82274247 11.8361204 11.84949833 11.86287625 11.87625418
11.88963211 11.90301003 11.91638796 11.92976589 11.94314381 11.95652174
11.96989967 11.98327759 11.99665552 12.01003344 12.02341137 12.0367893
12.05016722 12.06354515 12.07692308 12.090301 12.10367893 12.11705686
12.13043478 12.14381271 12.15719064 12.17056856 12.18394649 12.19732441
12.21070234 12.22408027 12.23745819 12.25083612 12.26421405 12.27759197
12.2909699 12.30434783 12.31772575 12.33110368 12.34448161 12.35785953
12.37123746 12.38461538 12.39799331 12.41137124 12.42474916 12.43812709
12.45150502 12.46488294 12.47826087 12.4916388 12.50501672 12.51839465
12.53177258 12.5451505 12.55852843 12.57190635 12.58528428 12.59866221
12.61204013 12.62541806 12.63879599 12.65217391 12.66555184 12.67892977
12.69230769 12.70568562 12.71906355 12.73244147 12.7458194 12.75919732
12.77257525 12.78595318 12.7993311 12.81270903 12.82608696 12.83946488
12.85284281 12.86622074 12.87959866 12.89297659 12.90635452 12.91973244

96
12.93311037 12.94648829 12.95986622 12.97324415 12.98662207 13. ]

[98]: time stock_prices_interpolated_akima_method \


0 9.000000 100.000000
1 9.013378 99.525195
2 9.026756 99.063993
3 9.040134 98.616391
4 9.053512 98.182392
.. … …
295 12.946488 184.815461
296 12.959866 184.861090
297 12.973244 184.907052
298 12.986622 184.953353
299 13.000000 185.000000

stock_prices_interpolated_cubic stock_prices_interpoalted_pchip
0 100.000000 100.000000
1 98.610726 100.000376
2 97.272002 100.001501
3 95.983327 100.003376
4 94.744200 100.005998
.. … …
295 182.608276 184.813870
296 183.171024 184.860183
297 183.757013 184.906643
298 184.366564 184.953249
299 185.000000 185.000000

[300 rows x 4 columns]

Plotting the graph of interpolated value by using akima interpolation method


[99]: [Link](figsize = (10,6))
[Link](x,y,marker="o",label="Original Stock prices with gaps", markersize=8,␣
↪color="black")

[Link](x_dense,y_akima, label="Akima Interpolation", linewidth=2, color="red")


[Link]("Stock prices chart : Akima vs cubic interpolation")
[Link]("Time (Hour of day)")
[Link]("Stock price (USD)")
[Link]()
[Link](True)
[Link]()

97
Plotting the graph of interpolated value by using cubic interpolation method and
akima interpolation method
[100]: [Link](figsize = (10,6))
[Link](x,y,marker="o",label="Original Stock prices with gaps", markersize=8,␣
↪color="black")

[Link](x_dense,y_akima, label="Akima Interpolation", linewidth=2, color="red")


[Link](x_dense,y_cubic,linestyle="--",color="green",linewidth=2, label="Cubic␣
↪interpolation")

[Link]("Stock prices chart : Akima vs cubic interpolation")


[Link]("Time (Hour of day)")
[Link]("Stock price (USD)")
[Link]()
[Link](True)
[Link]()

98
Plotting the graph of interpolated value by using cubic interpolation method vs akima
interpolation vs PCHIP interpolation method
[101]: [Link](figsize = (10,6))
[Link](x,y,marker="o",label="Original Stock prices with gaps", markersize=8,␣
↪color="black")

[Link](x_dense,y_akima, label="Akima Interpolation", linewidth=2, color="red")


[Link](x_dense,y_cubic,linestyle="--",color="green",linewidth=2, label="Cubic␣
↪interpolation")

[Link](x_dense,y_pchip,linestyle="-.",color="orange",linewidth=2,␣
↪label="PCHIP interpolation")

[Link]("Stock prices chart : Akima vs cubic vs pchip interpolation")


[Link]("Time (Hour of day)")
[Link]("Stock price (USD)")
[Link]()
[Link](True)
[Link]()

99
2.3.20 Explaination of the graph of akima vs cubic vs pchip interpolation method :
Key obervations : | Time (approx) | What’s happening in the graph | | —————
| ——————————————————————————————————————————
——————————————————- | | 9.0 - 10.0 | Cubic (green) dips unrealistically
below the original values — this is an overshoot due to curvature [Link] (red) and
PCHIP (orange) stay more stable and realistic. | | 10.0 - 11.0 | All curves rise sharply due to a
jump in stock price. Cubic again overshoots a bit, while Akima and PCHIP stay closer to true
values. | | 11.0 - 13.0 | Cubic shows wiggles (up and down oscillations) that are not in the actual
data. PCHIP and Akima give more stable curves. |

Why does Akima show a dip vs PCHIP?


1. PCHIP – Preserves monotonicity and shape
• PCHIP stands for Piecewise Cubic Hermite Interpolating Polynomial
• It’s designed to:
– Avoid overshooting.
– Preserve the original tred direction (i.e no new extrem introduced)
– Produced visually pleasing , smooth curves without unnecessary bumps. PCHIP
never introduces new local maxima/minima. This is why it avoids dips or bumps.
2. Akima Smoother but not always monotonic
• Akima interpolation is a spline based method that blends data using slopes calculated
from neighboring points
• It tries to produce very smooth curves, even for erratic data.

100
• It is less sensitive to large changes in adjacent points but may still introduce small
oscillations or dips between points. Akima can introduce dips or local minima/maxima
if the slope changes rapidly or unevenly between adjacent data points.

2.3.21 Lets look at three more examples where from_derivatives method interpola-
tion can be used in real life
The from_derivatives interpolation method constructs an interpolating polynomial based on : -
Function values(y) - Derivative values (dy {first derivative}, ddy {second derivative}) at known data
points. This is known as Hermite Interpolation (as opposed to Langrange or spline interpolation)

How does it work? Mathematically, it uses Hermite polynomials, which enforce that both the
function and its derivatives match at the interpolation nodes. Unlike spline interpolation (which just
fits the function values), Hermite interpolation also ensures smooth changes by matching slopes (and
optionally higher derivatives). It’s most often used via : [Link].from_derivatives(x,
y_ders)

Math Under the Hood

What are hermite polynomials ? Hermite polynomials are a sequence of orthogonal poly-
nomials used in mathematics , physics and numerical analysis, especially in problems involving
Gaussian weight function, quantum mechanics and interpolation (like Piecewise Cubic Hermite
Interpolation Polynomial or PCHIP)

There are two main types:


• Physicist’s Hermite polynomials : 𝐻𝑛 (𝑥)
• Probabilist’s Hermite polynomials : 𝐻𝑒𝑛 (𝑥) They differ by a scaling factor, but both satisfy
recurrence relations and orthogonality.

101
Core Ideas
• Hermite polynomials 𝐻𝑛 (𝑥) are solutions to the Hermite differntial equations:
– 𝑦″ − 2𝑥𝑦′ + 2𝑛𝑦 = 0
2
• They form an orthogonal basis with respect to the weight function 𝑤(𝑥) = 𝑒−𝑥 (for physicist’s
version)
• First few Hermite polynomials:
– 𝐻0 (𝑥) = 1
– 𝐻1 (𝑥) = 2𝑥
– 𝐻2 (𝑥) = 4𝑥2 − 2
– 𝐻3 (𝑥) = 8𝑥3 − 12𝑥
– 𝐻4 (𝑥) = 16𝑥4 − 48𝑥2 + 12

Applications:
1. Quantum mechanics
• Hermite polynomials appear in the solution to the quantum harmonic oscillator.
2. Gaussian Quadrature
2
• They are used to construct Gauss-Hermite quadrature rules for integrals involving 𝑒−𝑥
3. Hermite interpolation
• In numerical analysis, Hermite interpolation uses not only function values but also
derivatives at interpolation points.
• PCHIP and from_derivatives are based on this principal, through not necessarily on
Hermite polynomials themselves
4. Machine Learning and statistics
• Used in expansions of functions in terms of orthogonal polynomials

Recurrence Relation: You can compute them using: 𝐻0 (𝑥) = 1, 𝐻1 (𝑥) = 2𝑥, 𝐻𝑛+1 (𝑥) =
2𝑥𝐻𝑛 (𝑥) − 2𝑛𝐻𝑛−1 (𝑥)

[102]: import numpy as np


import [Link] as plt
from [Link] import hermite
# Generating 500 x values between -3 to 3
x = [Link](-3,3,500)
[Link](figsize=(8,4))
for n in range(5):
Hn = hermite(n)
[Link](x,Hn(x), label=f"H_{n}(x)")

[Link]("Hermite Polynomials (Physicist's)")


[Link]()
[Link](True)
[Link]()

102
Clarifying Confusion: PCHIP vs Hermite Polynomials
• PCHIP uses Hermite-style interpolation, which means it ensures function value + slope(1st
derivative) continuity.
• It does not directly use Hermite polynomials , the name “Hermite” here refers to the inter-
polation style not the specific set of orthogonal polynomials

What are orthogonal polynomials? Orthogonal polynomials are a class of polynomials that
are mutually orthogonal under some inner product on a function space (usually defined over an
interval with a weight function). In simpler terms : - The two polynomials are orthogonal if their
weighted inner product (a kind of dot product) equals zero

Formal definition of orthogonal Let 𝑤(𝑥) be a weight function defined on an interval [a,b].
𝑏
Then a sequence of polynomials 𝑃𝑛 (𝑥) is called orthogonal if: ∫𝑎 𝑃𝑛 (𝑥)𝑃𝑚 (𝑥)𝑤(𝑥)𝑑𝑥 = 0 whenever
𝑛 ≠ 𝑚 This means: - Each polynomial 𝑃𝑛 has degree n. - They’re “independence” of each other
under the integration operation with the weight function 𝑤(𝑥)

Examples of orthogonal polynomial

Polynomial Family Interval Weight Function 𝑤(𝑥) Used In


Legendre [−1, 1] 𝑤(𝑥) = 1 Physics, approximation
1
Chebyshev [−1, 1] 𝑤(𝑥) = √1−𝑥 2
Signal processing
Laguerre [0, ∞) 𝑤(𝑥) = 𝑒−𝑥 Quantum mechanics
2
Hermite (−∞, ∞) 𝑤(𝑥) = 𝑒−𝑥 Probability (Gaussian)

103
Intuition Think of how vectors can be orthogonal in geometry — at 90° to each other. Orthog-
onal polynomials are “perpendicular” under integration. This makes them perfect for: - Function
approximation (like Fourier series, but with polynomials). - Reducing error in numerical methods
(e.g., least squares). - Solving differential equations (like in physics and engineering).

Why Orthogonal Polynomials Matter


• Numerical Stability: They reduce correlation between terms in polynomial approximation.
• Fast Convergence: They’re used in spectral methods where approximations converge quickly.
• Efficient Representation: You can express complex functions with fewer terms (like Chebyshev
in fast transforms).

Resources used : Orthogonal plynomials: [Link]


Hermite plynomials: [Link]

Example 1 : Physics particle trajectory with known velocity at each point Use Use the
BPoly.from_derivatives() class in SciPy: class in SciPy to implement from_derivatives

[103]: import numpy as np


import [Link] as plt
from [Link] import BPoly

x = [0,1,2]
# first derivative gives velocity and second derivative gives accelration
y_ders = [
[0, 10], # at t=0, position=0, velocity=10
[15, 5], # at t=1, position=15, velocity=5
[25, 0] # at t=2, position=25, velocity=0
]

poly = BPoly.from_derivatives(x,y_ders)
x_vals = [Link](0,2,100)
y_vals = poly(x_vals)

# Plot the from_derivative using matplotlib


[Link](figsize=(8,4))
[Link](x_vals,y_vals,color="red", linestyle="-",label="Interpolated Position")
[Link](x,[yd[0] for yd in y_ders], color = "green",linestyle="--",label="Data␣
↪Points")

[Link]("Trajectory interpolation with velocity info")


[Link]("Time")
[Link]("Position")
[Link]()
[Link](True)
[Link]()

104
Example 2 : Thermodynamics –> Temperature vs Time with known Heating Rate
[104]: x = [0,5,10] # time in minutes
y_ders = [
[25, 2], # at 0 min: 25°C, heating at 2°C/min
[35, 1], # at 5 min: 35°C, heating at 1°C/min
[40, 0] # at 10 min: 40°C, stopped heating
]

poly = BPoly.from_derivatives(x,y_ders)
# generate 500 values between 1 to 10
x_vals = [Link](0,10,500)
y_vals = poly(x_vals)

# plot the graph


[Link](figsize=(8,4))
[Link](x_vals,y_vals,color="red", linestyle="-",label="Interpolated Position")
[Link](x,[yd[0] for yd in y_ders], color = "green",linestyle="--",label="Data␣
↪Points")

[Link]("Temperature curve with known heating rates.")


[Link]("Time (min)")
[Link]("Temperature (°C)")
[Link]()
[Link](True)
[Link]()

105
When to use? Use it when: - You know or can estimate derivatives at the data points. - You
know smooth curves with custom control over slope and curvature. - you are simulating or modeling
physical systems (example : motion, thermodynamics).

When not to use


• You only know function values (use linear, pchip, cubic etc.).
• Derivatives are noisy, approximate or unknown.
• You don’t need high continuety or slope control.

2.4 Sometimes there might be situation where I have to drop the rows with
n.a. or NaN values
[105]: # pc path
df_missing_values = pd.read_excel("/home/aditya/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.

↪xlsx")

# macbook path
# df_missing_values = pd.read_excel("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.

↪xlsx", engine = "openpyxl")

df_missing_values

[105]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38.0 23 52 30.03

106
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN

107
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN

108
29 257.0
30 NaN

[106]: df_dropped_NaN = df_missing_values.dropna()


df_dropped_NaN

[106]: EST Temperature DewPoint Humidity Sea Level PressureIn \


8 1/9/2016 44.0 38 77 30.16
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


8 9 8.0 T 8 Rain
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow

WindDirDegrees
8 76.0
15 340.0
16 345.0
17 293.0

As you can see all the rows containing NaN values have been removed

Suppose there is a situation where I have to drop all the rows that have all NaN values.
[107]: df_drop_rows_having_all_NaN = df_missing_values.dropna(how="all")
df_drop_rows_having_all_NaN

[107]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38.0 23 52 30.03
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52

109
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN

110
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

If I want to keep the rows which have atleast one NaN value and drop every other
rows.
[108]: df_keep_the_row_with_one_NaN_val = df_missing_values.dropna(thresh=1)
df_keep_the_row_with_one_NaN_val

[108]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38.0 23 52 30.03
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50

111
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN

112
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

Now If I change the threshold to 10 then

113
[109]: df_keep_the_row_with_one_NaN_val = df_missing_values.dropna(thresh=10)
df_keep_the_row_with_one_NaN_val

[109]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38.0 23 52 30.03
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
26 1/27/2016 41.0 22 45 30.03
29 1/30/2016 34.0 16 46 30.01

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
26 10 7.0 T 3 Rain

114
29 10 7.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
6 293.0
7 79.0
8 76.0
9 109.0
11 235.0
12 284.0
14 101.0
15 340.0
16 345.0
17 293.0
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
26 NaN
29 257.0

Here thresh i.e threshold means the minimum amount of data that a row must have
for it to be allowed to be white listed when .dropna() method is ran in pandas.
• Basically It’s saying that :
– If there are 11 columns in the dataFrame
– If there is a row where out of 11 columns 10 of them have values in it and 1 of them has
a NaN value in it.
– If [Link](thresh=10) where thresh=10 then what it means is that all the rows that
have 1 column with NaN value and 10 columns with data then It will not be dropped by
.dropna() method. It will be allowed to be remained in the dataFrame after processing
– All the other rows that do not fit in this rule defined in thresh will be dropped.

2.5 Insert the missing date in the dataFrame where the date column is set to
be the index of the dataFrame
Refer to this docs for more info : [Link]

Set the EST which is essentially a date type column as an index of the dataFrame.
• Convert the EST column from string to date type column
• Then set the EST column as an index to the dataFrame

Verify if the EST column has the string type values stored in it

115
[110]: df_missing_values["EST"].apply(lambda x: isinstance (x,str)).all()

[110]: True

The result of this expression : df_missing_values["EST"].apply(lambda x: isinstance


(x,str)).all() is returning True that means the values in EST column is in string
format.

Convert the values in the EST column to be from string type to date type
[111]: df_missing_values["EST"] = pd.to_datetime(df_missing_values["EST"],format="%m/
↪%d/%Y",errors="coerce")

df_missing_values

[111]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 2016-01-01 38.0 23 52 30.03
1 2016-01-02 36.0 18 46 30.02
2 2016-01-03 40.0 21 47 29.86
3 2016-01-04 25.0 9 44 30.05
4 2016-01-05 NaN -3 41 30.57
5 2016-01-06 33.0 4 35 30.50
6 2016-01-07 39.0 11 33 30.28
7 2016-01-08 39.0 29 64 30.20
8 2016-01-09 44.0 38 77 30.16
9 2016-01-10 50.0 46 71 29.59
10 2016-01-11 33.0 8 37 29.92
11 2016-01-12 35.0 15 53 29.85
12 2016-01-13 26.0 4 42 29.94
13 2016-01-14 NaN 12 47 29.95
14 2016-01-15 43.0 31 62 29.82
15 2016-01-16 47.0 37 70 29.52
16 2016-01-17 36.0 23 66 29.78
17 2016-01-18 25.0 6 53 29.83
18 2016-01-19 22.0 3 42 30.03
19 2016-01-20 32.0 15 49 30.13
20 2016-01-21 31.0 11 45 30.15
21 2016-01-22 26.0 6 41 30.21
22 2016-01-23 NaN 21 78 29.77
23 2016-01-24 NaN 11 53 29.92
24 2016-01-25 NaN 18 54 30.25
25 2016-01-26 43.0 29 56 30.03
26 2016-01-27 41.0 22 45 30.03
27 2016-01-28 37.0 20 51 29.90
28 2016-01-29 36.0 21 50 29.58
29 2016-01-30 34.0 16 46 30.01
30 2016-01-31 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \

116
0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN

117
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

Let’s check if all the values in the EST column is of type string
[112]: df_missing_values["EST"].apply(lambda x: isinstance(x,str)).all()

[112]: False

[113]: df_missing_values["EST"].dtype

[113]: dtype('<M8[ns]')

The result is False hence the EST column values are not of type string i.e they have
been successfully converted to datetime object in this dataFrame

Now we can safely set the EST column as index of this dataFrame
[114]: df_missing_values.set_index(df_missing_values["EST"],inplace=True)
df_missing_values

[114]: EST Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 2016-01-01 38.0 23 52 30.03
2016-01-02 2016-01-02 36.0 18 46 30.02
2016-01-03 2016-01-03 40.0 21 47 29.86
2016-01-04 2016-01-04 25.0 9 44 30.05
2016-01-05 2016-01-05 NaN -3 41 30.57
2016-01-06 2016-01-06 33.0 4 35 30.50
2016-01-07 2016-01-07 39.0 11 33 30.28
2016-01-08 2016-01-08 39.0 29 64 30.20
2016-01-09 2016-01-09 44.0 38 77 30.16
2016-01-10 2016-01-10 50.0 46 71 29.59
2016-01-11 2016-01-11 33.0 8 37 29.92

118
2016-01-12 2016-01-12 35.0 15 53 29.85
2016-01-13 2016-01-13 26.0 4 42 29.94
2016-01-14 2016-01-14 NaN 12 47 29.95
2016-01-15 2016-01-15 43.0 31 62 29.82
2016-01-16 2016-01-16 47.0 37 70 29.52
2016-01-17 2016-01-17 36.0 23 66 29.78
2016-01-18 2016-01-18 25.0 6 53 29.83
2016-01-19 2016-01-19 22.0 3 42 30.03
2016-01-20 2016-01-20 32.0 15 49 30.13
2016-01-21 2016-01-21 31.0 11 45 30.15
2016-01-22 2016-01-22 26.0 6 41 30.21
2016-01-23 2016-01-23 NaN 21 78 29.77
2016-01-24 2016-01-24 NaN 11 53 29.92
2016-01-25 2016-01-25 NaN 18 54 30.25
2016-01-26 2016-01-26 43.0 29 56 30.03
2016-01-27 2016-01-27 41.0 22 45 30.03
2016-01-28 2016-01-28 37.0 20 51 29.90
2016-01-29 2016-01-29 36.0 21 50 29.58
2016-01-30 2016-01-30 34.0 16 46 30.01
2016-01-31 2016-01-31 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover \


EST
2016-01-01 10 8.0 0 5
2016-01-02 10 7.0 0 3
2016-01-03 10 8.0 0 1
2016-01-04 10 9.0 0 3
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-08 10 4.0 0 8
2016-01-09 9 8.0 T 8
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3

119
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0

Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN

Now that the EST column has been set I don’t need EST column which makes it
redundant
[115]: df_EST_index = df_missing_values.drop("EST",axis=1)
df_EST_index

120
[115]: Temperature DewPoint Humidity Sea Level PressureIn \
EST
2016-01-01 38.0 23 52 30.03
2016-01-02 36.0 18 46 30.02
2016-01-03 40.0 21 47 29.86
2016-01-04 25.0 9 44 30.05
2016-01-05 NaN -3 41 30.57
2016-01-06 33.0 4 35 30.50
2016-01-07 39.0 11 33 30.28
2016-01-08 39.0 29 64 30.20
2016-01-09 44.0 38 77 30.16
2016-01-10 50.0 46 71 29.59
2016-01-11 33.0 8 37 29.92
2016-01-12 35.0 15 53 29.85
2016-01-13 26.0 4 42 29.94
2016-01-14 NaN 12 47 29.95
2016-01-15 43.0 31 62 29.82
2016-01-16 47.0 37 70 29.52
2016-01-17 36.0 23 66 29.78
2016-01-18 25.0 6 53 29.83
2016-01-19 22.0 3 42 30.03
2016-01-20 32.0 15 49 30.13
2016-01-21 31.0 11 45 30.15
2016-01-22 26.0 6 41 30.21
2016-01-23 NaN 21 78 29.77
2016-01-24 NaN 11 53 29.92
2016-01-25 NaN 18 54 30.25
2016-01-26 43.0 29 56 30.03
2016-01-27 41.0 22 45 30.03
2016-01-28 37.0 20 51 29.90
2016-01-29 36.0 21 50 29.58
2016-01-30 34.0 16 46 30.01
2016-01-31 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover \


EST
2016-01-01 10 8.0 0 5
2016-01-02 10 7.0 0 3
2016-01-03 10 8.0 0 1
2016-01-04 10 9.0 0 3
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-08 10 4.0 0 8
2016-01-09 9 8.0 T 8
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1

121
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0

Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0

122
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN

2.6 Adding missing dates in the dataFrame where EST (date type column) is
set as an index in the dataFrame
Since I don’t have any missing date values in the dataFrame I will manually have to
drop the rows from the dataFrame
[116]: df_dropped_dates = df_EST_index.drop(df_EST_index.index[["2016-01-03",␣
↪"2016-01-08" , "2016-01-09"]])
df_dropped_dates

---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[116], line 1
----> 1 df_dropped_dates = df_EST_index.
↪drop(df_EST_index.index[["2016-01-03", "2016-01-08" , "2016-01-09"]])
2 df_dropped_dates

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪indexes/[Link], in Index.__getitem__(self, key)

5407 if len(key) == 0 and len(key) != len(self):


5408 [Link](
5409 "Using a boolean indexer with length 0 on an Index with "
5410 "length greater than 0 is deprecated and will raise in a␣
↪"

(…)
5413 stacklevel=find_stack_level(),
5414 )
-> 5416 result = getitem(key)
5417 # Because we ruled out integer above, we always get an arraylike here
5418 if [Link] > 1:

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/[Link], in DatetimeLikeArrayMixin.__getitem__(self, key)

374 """
375 This getitem defers to the underlying array, which by-definition can
376 only handle list-likes, slices, and integer scalars
377 """
378 # Use cast as we know we will get back a DatetimeLikeArray or DTScalar,
379 # but skip evaluating the Union at runtime for performance
380 # (see [Link]

123
--> 381 result = cast("Union[Self, DTScalarOrNaT]", super().__getitem__(key))
382 if lib.is_scalar(result):
383 return result

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/_mixins.py:292, in NDArrayBackedExtensionArray.__getitem__(self, key)

289 # error: Incompatible types in assignment (expression has type␣


↪"ExtensionArray",

290 # variable has type "Union[int, slice, ndarray]")


291 key = extract_array(key, extract_numpy=True) # type: ignore[assignment]
--> 292 key = check_array_indexer(self, key)
293 result = self._ndarray[key]
294 if lib.is_scalar(result):

File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪indexers/[Link], in check_array_indexer(array, indexer)

547 raise ValueError(


548 "Cannot index with an integer indexer containing NA values"
549 ) from err
550 else:
--> 551 raise IndexError("arrays used as indices must be of integer or␣
↪boolean type")

553 return indexer

IndexError: arrays used as indices must be of integer or boolean type

Explaination
• df_dropped_dates = df_EST_index.drop(df_EST_index.index[["2016-01-03",
"2016-01-08", "2016-01-09"]])
• This causes an error because df_EST_index.index[...] expects integer or boolean index-
ing, not string labels like "2016-01-03".
• df_EST_index.index[...] is trying to index the index using a list of strings, which is not
allowed — it’s not integer-based or boolean-based indexing.
• Instead, I was trying to remove rows by label, not by position. #### solution
• During my search I found out that I can pass those dates strings directly to .drop() without
using df_EST_index.index[...]..
• This works because drop can take labels (as strings) if the index is of datetime64 type —
and Pandas will internally convert your strings to datetimes and match them correctly.

[117]: # This code worked fine on my pc but its not working on my macbook for some␣
↪reason

# df_dropped_dates = df_EST_index.drop(["2016-01-03", "2016-01-08",␣


↪"2016-01-09"])

# df_dropped_dates

# writing this code to make it work on older version of pandas

124
dates_to_drop = pd.to_datetime(["2016-01-03", "2016-01-08", "2016-01-09"])
df_dropped_dates = df_EST_index.drop(dates_to_drop)
df_dropped_dates

[117]: Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 38.0 23 52 30.03
2016-01-02 36.0 18 46 30.02
2016-01-04 25.0 9 44 30.05
2016-01-05 NaN -3 41 30.57
2016-01-06 33.0 4 35 30.50
2016-01-07 39.0 11 33 30.28
2016-01-10 50.0 46 71 29.59
2016-01-11 33.0 8 37 29.92
2016-01-12 35.0 15 53 29.85
2016-01-13 26.0 4 42 29.94
2016-01-14 NaN 12 47 29.95
2016-01-15 43.0 31 62 29.82
2016-01-16 47.0 37 70 29.52
2016-01-17 36.0 23 66 29.78
2016-01-18 25.0 6 53 29.83
2016-01-19 22.0 3 42 30.03
2016-01-20 32.0 15 49 30.13
2016-01-21 31.0 11 45 30.15
2016-01-22 26.0 6 41 30.21
2016-01-23 NaN 21 78 29.77
2016-01-24 NaN 11 53 29.92
2016-01-25 NaN 18 54 30.25
2016-01-26 43.0 29 56 30.03
2016-01-27 41.0 22 45 30.03
2016-01-28 37.0 20 51 29.90
2016-01-29 36.0 21 50 29.58
2016-01-30 34.0 16 46 30.01
2016-01-31 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover \


EST
2016-01-01 10 8.0 0 5
2016-01-02 10 7.0 0 3
2016-01-04 10 9.0 0 3
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0

125
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0

Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN

126
2016-01-30 NaN 257.0
2016-01-31 NaN NaN

Now what I have got the dataFrame that simulates the missing dates that I may get
in the real world I can move forward.

Adding missing dates in the dataFrame where EST (date type column) is set as an
index in the dataFrame
[118]: dt = pd.date_range("2016-01-01", "2016-01-31")
idx = [Link](dt)
df_dropped_dates.reindex(idx)

[118]: Temperature DewPoint Humidity Sea Level PressureIn \


2016-01-01 38.0 23.0 52.0 30.03
2016-01-02 36.0 18.0 46.0 30.02
2016-01-03 NaN NaN NaN NaN
2016-01-04 25.0 9.0 44.0 30.05
2016-01-05 NaN -3.0 41.0 30.57
2016-01-06 33.0 4.0 35.0 30.50
2016-01-07 39.0 11.0 33.0 30.28
2016-01-08 NaN NaN NaN NaN
2016-01-09 NaN NaN NaN NaN
2016-01-10 50.0 46.0 71.0 29.59
2016-01-11 33.0 8.0 37.0 29.92
2016-01-12 35.0 15.0 53.0 29.85
2016-01-13 26.0 4.0 42.0 29.94
2016-01-14 NaN 12.0 47.0 29.95
2016-01-15 43.0 31.0 62.0 29.82
2016-01-16 47.0 37.0 70.0 29.52
2016-01-17 36.0 23.0 66.0 29.78
2016-01-18 25.0 6.0 53.0 29.83
2016-01-19 22.0 3.0 42.0 30.03
2016-01-20 32.0 15.0 49.0 30.13
2016-01-21 31.0 11.0 45.0 30.15
2016-01-22 26.0 6.0 41.0 30.21
2016-01-23 NaN 21.0 78.0 29.77
2016-01-24 NaN 11.0 53.0 29.92
2016-01-25 NaN 18.0 54.0 30.25
2016-01-26 43.0 29.0 56.0 30.03
2016-01-27 41.0 22.0 45.0 30.03
2016-01-28 37.0 20.0 51.0 29.90
2016-01-29 36.0 21.0 50.0 29.58
2016-01-30 34.0 16.0 46.0 30.01
2016-01-31 46.0 28.0 52.0 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover \


2016-01-01 10.0 8.0 0 5.0

127
2016-01-02 10.0 7.0 0 3.0
2016-01-03 NaN NaN NaN NaN
2016-01-04 10.0 9.0 0 3.0
2016-01-05 10.0 5.0 0 0.0
2016-01-06 10.0 4.0 0 0.0
2016-01-07 10.0 2.0 0 3.0
2016-01-08 NaN NaN NaN NaN
2016-01-09 NaN NaN NaN NaN
2016-01-10 4.0 NaN 1.8 7.0
2016-01-11 10.0 NaN 0 1.0
2016-01-12 10.0 6.0 T 4.0
2016-01-13 10.0 10.0 0 0.0
2016-01-14 10.0 5.0 T 7.0
2016-01-15 9.0 5.0 T 2.0
2016-01-16 8.0 7.0 0.24 7.0
2016-01-17 8.0 6.0 0.05 6.0
2016-01-18 9.0 12.0 T 2.0
2016-01-19 10.0 11.0 0 1.0
2016-01-20 10.0 6.0 0 2.0
2016-01-21 10.0 6.0 0 1.0
2016-01-22 9.0 NaN 0.01 3.0
2016-01-23 1.0 16.0 2.31 8.0
2016-01-24 8.0 6.0 T 3.0
2016-01-25 10.0 3.0 0 2.0
2016-01-26 10.0 7.0 0 2.0
2016-01-27 10.0 7.0 T 3.0
2016-01-28 10.0 5.0 0 1.0
2016-01-29 10.0 8.0 0 4.0
2016-01-30 10.0 7.0 0 0.0
2016-01-31 10.0 5.0 0 0.0

Events WindDirDegrees
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN NaN
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN NaN
2016-01-09 NaN NaN
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0

128
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN

As you can see that the missing dates have been added in the dataFrame with all the
values as NaN

Now I will have to use interpolation to fill all the NaN values based on dates using
time method
[119]: # replace NaN with 0
# convert the string type numbers into integer type numbers
# This code works on newer version of pandas
# df_dropped_dates.infer_objects(copy=False)

# This code will work on older version of pandas


df_dropped_dates.infer_objects()

# only select those columns whose data-type is of type number and then␣
↪interpolate using time method

df_interpolate_using_time = df_dropped_dates.select_dtypes(include=["number"]).
↪interpolate(method="time",limit_direction="both")

df_interpolate_using_time

[119]: Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 38.00 23 52 30.03
2016-01-02 36.00 18 46 30.02
2016-01-04 25.00 9 44 30.05
2016-01-05 29.00 -3 41 30.57
2016-01-06 33.00 4 35 30.50
2016-01-07 39.00 11 33 30.28
2016-01-10 50.00 46 71 29.59
2016-01-11 33.00 8 37 29.92
2016-01-12 35.00 15 53 29.85

129
2016-01-13 26.00 4 42 29.94
2016-01-14 34.50 12 47 29.95
2016-01-15 43.00 31 62 29.82
2016-01-16 47.00 37 70 29.52
2016-01-17 36.00 23 66 29.78
2016-01-18 25.00 6 53 29.83
2016-01-19 22.00 3 42 30.03
2016-01-20 32.00 15 49 30.13
2016-01-21 31.00 11 45 30.15
2016-01-22 26.00 6 41 30.21
2016-01-23 30.25 21 78 29.77
2016-01-24 34.50 11 53 29.92
2016-01-25 38.75 18 54 30.25
2016-01-26 43.00 29 56 30.03
2016-01-27 41.00 22 45 30.03
2016-01-28 37.00 20 51 29.90
2016-01-29 36.00 21 50 29.58
2016-01-30 34.00 16 46 30.01
2016-01-31 46.00 28 52 29.90

VisibilityMiles WindSpeedMPH CloudCover WindDirDegrees


EST
2016-01-01 10 8.0 5 281.0
2016-01-02 10 7.0 3 275.0
2016-01-04 10 9.0 3 345.0
2016-01-05 10 5.0 0 333.0
2016-01-06 10 4.0 0 313.0
2016-01-07 10 2.0 3 293.0
2016-01-10 4 4.4 7 109.0
2016-01-11 10 5.2 1 289.0
2016-01-12 10 6.0 4 235.0
2016-01-13 10 10.0 0 284.0
2016-01-14 10 5.0 7 192.5
2016-01-15 9 5.0 2 101.0
2016-01-16 8 7.0 7 340.0
2016-01-17 8 6.0 6 345.0
2016-01-18 9 12.0 2 293.0
2016-01-19 10 11.0 1 297.5
2016-01-20 10 6.0 2 302.0
2016-01-21 10 6.0 1 312.0
2016-01-22 9 11.0 3 34.0
2016-01-23 1 16.0 8 42.0
2016-01-24 8 6.0 3 327.0
2016-01-25 10 3.0 2 286.0
2016-01-26 10 7.0 2 280.2
2016-01-27 10 7.0 3 274.4
2016-01-28 10 5.0 1 268.6

130
2016-01-29 10 8.0 4 262.8
2016-01-30 10 7.0 0 257.0
2016-01-31 10 5.0 0 257.0

2.6.1 What gives why is interpolation dropping the rows which have NaN values in
all of their columns ?
Explanation:
• Pandas’ interpolation methods (like “time”, “linear”, etc.) require at least some non-NaN
values to interpolate across.
• If a row has all columns as NaN, pandas has no basis to infer values — so it skips interpolation
for that row.

2.7 Handle missing data using Replace function


[120]: import numpy as np
import pandas as pd
# pc path
df_missing_data = pd.read_excel("/home/aditya/github/Deep-learning-prerequisite/
↪pandas/dataset/nyc_weather_missing_data_demo.xlsx")

# macbook path
# df_missing_data = pd.read_excel("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.

↪xlsx", engine="openpyxl")

df_missing_data

[120]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38.0 23 52 30.03
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03

131
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees

132
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

Make EST a date column the index of the dataFrame

Lets check if EST is a of string type column


[121]: df_missing_data["EST"].apply(lambda x: isinstance(x,str)).all()

[121]: True

This goes to show that the EST column contains date in string dataType form

Convert the EST column values into a datetime type values


[122]: df_missing_data["EST"] = pd.to_datetime(df_missing_data["EST"],format="%m/%d/
↪%Y",errors="coerce")

df_missing_data

133
[122]: EST Temperature DewPoint Humidity Sea Level PressureIn \
0 2016-01-01 38.0 23 52 30.03
1 2016-01-02 36.0 18 46 30.02
2 2016-01-03 40.0 21 47 29.86
3 2016-01-04 25.0 9 44 30.05
4 2016-01-05 NaN -3 41 30.57
5 2016-01-06 33.0 4 35 30.50
6 2016-01-07 39.0 11 33 30.28
7 2016-01-08 39.0 29 64 30.20
8 2016-01-09 44.0 38 77 30.16
9 2016-01-10 50.0 46 71 29.59
10 2016-01-11 33.0 8 37 29.92
11 2016-01-12 35.0 15 53 29.85
12 2016-01-13 26.0 4 42 29.94
13 2016-01-14 NaN 12 47 29.95
14 2016-01-15 43.0 31 62 29.82
15 2016-01-16 47.0 37 70 29.52
16 2016-01-17 36.0 23 66 29.78
17 2016-01-18 25.0 6 53 29.83
18 2016-01-19 22.0 3 42 30.03
19 2016-01-20 32.0 15 49 30.13
20 2016-01-21 31.0 11 45 30.15
21 2016-01-22 26.0 6 41 30.21
22 2016-01-23 NaN 21 78 29.77
23 2016-01-24 NaN 11 53 29.92
24 2016-01-25 NaN 18 54 30.25
25 2016-01-26 43.0 29 56 30.03
26 2016-01-27 41.0 22 45 30.03
27 2016-01-28 37.0 20 51 29.90
28 2016-01-29 36.0 21 50 29.58
29 2016-01-30 34.0 16 46 30.01
30 2016-01-31 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN

134
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN

135
27 NaN
28 NaN
29 257.0
30 NaN

Set the EST as index in the dataFrame


[123]: df_missing_data.set_index("EST",inplace=True)
df_missing_data

[123]: Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 38.0 23 52 30.03
2016-01-02 36.0 18 46 30.02
2016-01-03 40.0 21 47 29.86
2016-01-04 25.0 9 44 30.05
2016-01-05 NaN -3 41 30.57
2016-01-06 33.0 4 35 30.50
2016-01-07 39.0 11 33 30.28
2016-01-08 39.0 29 64 30.20
2016-01-09 44.0 38 77 30.16
2016-01-10 50.0 46 71 29.59
2016-01-11 33.0 8 37 29.92
2016-01-12 35.0 15 53 29.85
2016-01-13 26.0 4 42 29.94
2016-01-14 NaN 12 47 29.95
2016-01-15 43.0 31 62 29.82
2016-01-16 47.0 37 70 29.52
2016-01-17 36.0 23 66 29.78
2016-01-18 25.0 6 53 29.83
2016-01-19 22.0 3 42 30.03
2016-01-20 32.0 15 49 30.13
2016-01-21 31.0 11 45 30.15
2016-01-22 26.0 6 41 30.21
2016-01-23 NaN 21 78 29.77
2016-01-24 NaN 11 53 29.92
2016-01-25 NaN 18 54 30.25
2016-01-26 43.0 29 56 30.03
2016-01-27 41.0 22 45 30.03
2016-01-28 37.0 20 51 29.90
2016-01-29 36.0 21 50 29.58
2016-01-30 34.0 16 46 30.01
2016-01-31 46.0 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover \


EST
2016-01-01 10 8.0 0 5
2016-01-02 10 7.0 0 3

136
2016-01-03 10 8.0 0 1
2016-01-04 10 9.0 0 3
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-08 10 4.0 0 8
2016-01-09 9 8.0 T 8
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0

Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0

137
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN

2.8 Use replace function to -ve values as NaN in the dataFrame


• This is just to test what replace function does.
• This replace fuction do not make changes in the original dataFrame but It creates a copy of
the original dataFrame with changes in it after the operation is complete.

[124]: replaced_df=df_missing_data.replace(-3,[Link])
replaced_df

[124]: Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 38.0 23.0 52 30.03
2016-01-02 36.0 18.0 46 30.02
2016-01-03 40.0 21.0 47 29.86
2016-01-04 25.0 9.0 44 30.05
2016-01-05 NaN NaN 41 30.57
2016-01-06 33.0 4.0 35 30.50
2016-01-07 39.0 11.0 33 30.28
2016-01-08 39.0 29.0 64 30.20
2016-01-09 44.0 38.0 77 30.16
2016-01-10 50.0 46.0 71 29.59
2016-01-11 33.0 8.0 37 29.92
2016-01-12 35.0 15.0 53 29.85
2016-01-13 26.0 4.0 42 29.94
2016-01-14 NaN 12.0 47 29.95
2016-01-15 43.0 31.0 62 29.82
2016-01-16 47.0 37.0 70 29.52
2016-01-17 36.0 23.0 66 29.78
2016-01-18 25.0 6.0 53 29.83
2016-01-19 22.0 3.0 42 30.03

138
2016-01-20 32.0 15.0 49 30.13
2016-01-21 31.0 11.0 45 30.15
2016-01-22 26.0 6.0 41 30.21
2016-01-23 NaN 21.0 78 29.77
2016-01-24 NaN 11.0 53 29.92
2016-01-25 NaN 18.0 54 30.25
2016-01-26 43.0 29.0 56 30.03
2016-01-27 41.0 22.0 45 30.03
2016-01-28 37.0 20.0 51 29.90
2016-01-29 36.0 21.0 50 29.58
2016-01-30 34.0 16.0 46 30.01
2016-01-31 46.0 28.0 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover \


EST
2016-01-01 10 8.0 0 5
2016-01-02 10 7.0 0 3
2016-01-03 10 8.0 0 1
2016-01-04 10 9.0 0 3
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-08 10 4.0 0 8
2016-01-09 9 8.0 T 8
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0

139
Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN

lets replace NaN values in Events column with 0 using replace function
[125]: replaced_df["Events"] = replaced_df["Events"].replace([Link],0)
replaced_df

[125]: Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 38.0 23.0 52 30.03
2016-01-02 36.0 18.0 46 30.02
2016-01-03 40.0 21.0 47 29.86
2016-01-04 25.0 9.0 44 30.05
2016-01-05 NaN NaN 41 30.57
2016-01-06 33.0 4.0 35 30.50
2016-01-07 39.0 11.0 33 30.28

140
2016-01-08 39.0 29.0 64 30.20
2016-01-09 44.0 38.0 77 30.16
2016-01-10 50.0 46.0 71 29.59
2016-01-11 33.0 8.0 37 29.92
2016-01-12 35.0 15.0 53 29.85
2016-01-13 26.0 4.0 42 29.94
2016-01-14 NaN 12.0 47 29.95
2016-01-15 43.0 31.0 62 29.82
2016-01-16 47.0 37.0 70 29.52
2016-01-17 36.0 23.0 66 29.78
2016-01-18 25.0 6.0 53 29.83
2016-01-19 22.0 3.0 42 30.03
2016-01-20 32.0 15.0 49 30.13
2016-01-21 31.0 11.0 45 30.15
2016-01-22 26.0 6.0 41 30.21
2016-01-23 NaN 21.0 78 29.77
2016-01-24 NaN 11.0 53 29.92
2016-01-25 NaN 18.0 54 30.25
2016-01-26 43.0 29.0 56 30.03
2016-01-27 41.0 22.0 45 30.03
2016-01-28 37.0 20.0 51 29.90
2016-01-29 36.0 21.0 50 29.58
2016-01-30 34.0 16.0 46 30.01
2016-01-31 46.0 28.0 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover \


EST
2016-01-01 10 8.0 0 5
2016-01-02 10 7.0 0 3
2016-01-03 10 8.0 0 1
2016-01-04 10 9.0 0 3
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-08 10 4.0 0 8
2016-01-09 9 8.0 T 8
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2

141
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0

Events WindDirDegrees
EST
2016-01-01 0 281.0
2016-01-02 0 275.0
2016-01-03 0 277.0
2016-01-04 0 345.0
2016-01-05 0 333.0
2016-01-06 0 NaN
2016-01-07 0 293.0
2016-01-08 0 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 0 289.0
2016-01-12 0 235.0
2016-01-13 0 284.0
2016-01-14 0 NaN
2016-01-15 0 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 0 NaN
2016-01-20 0 302.0
2016-01-21 0 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 0 286.0
2016-01-26 0 NaN
2016-01-27 Rain NaN
2016-01-28 0 NaN
2016-01-29 0 NaN
2016-01-30 0 257.0
2016-01-31 0 NaN

How to replace values in multiple columns using replace function

142
• I have chosen WindDirDegrees column for this demo. Since I have NaN values in it I am
going to replace it with negative random number -989 for this demo.

[126]: replaced_df["WindDirDegrees"] = replaced_df["WindDirDegrees"].replace(np.


↪NaN,-989)

replaced_df

[126]: Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 38.0 23.0 52 30.03
2016-01-02 36.0 18.0 46 30.02
2016-01-03 40.0 21.0 47 29.86
2016-01-04 25.0 9.0 44 30.05
2016-01-05 NaN NaN 41 30.57
2016-01-06 33.0 4.0 35 30.50
2016-01-07 39.0 11.0 33 30.28
2016-01-08 39.0 29.0 64 30.20
2016-01-09 44.0 38.0 77 30.16
2016-01-10 50.0 46.0 71 29.59
2016-01-11 33.0 8.0 37 29.92
2016-01-12 35.0 15.0 53 29.85
2016-01-13 26.0 4.0 42 29.94
2016-01-14 NaN 12.0 47 29.95
2016-01-15 43.0 31.0 62 29.82
2016-01-16 47.0 37.0 70 29.52
2016-01-17 36.0 23.0 66 29.78
2016-01-18 25.0 6.0 53 29.83
2016-01-19 22.0 3.0 42 30.03
2016-01-20 32.0 15.0 49 30.13
2016-01-21 31.0 11.0 45 30.15
2016-01-22 26.0 6.0 41 30.21
2016-01-23 NaN 21.0 78 29.77
2016-01-24 NaN 11.0 53 29.92
2016-01-25 NaN 18.0 54 30.25
2016-01-26 43.0 29.0 56 30.03
2016-01-27 41.0 22.0 45 30.03
2016-01-28 37.0 20.0 51 29.90
2016-01-29 36.0 21.0 50 29.58
2016-01-30 34.0 16.0 46 30.01
2016-01-31 46.0 28.0 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover \


EST
2016-01-01 10 8.0 0 5
2016-01-02 10 7.0 0 3
2016-01-03 10 8.0 0 1
2016-01-04 10 9.0 0 3

143
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-08 10 4.0 0 8
2016-01-09 9 8.0 T 8
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0

Events WindDirDegrees
EST
2016-01-01 0 281.0
2016-01-02 0 275.0
2016-01-03 0 277.0
2016-01-04 0 345.0
2016-01-05 0 333.0
2016-01-06 0 -989.0
2016-01-07 0 293.0
2016-01-08 0 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 0 289.0
2016-01-12 0 235.0
2016-01-13 0 284.0
2016-01-14 0 -989.0
2016-01-15 0 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0

144
2016-01-18 Snow 293.0
2016-01-19 0 -989.0
2016-01-20 0 302.0
2016-01-21 0 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 0 286.0
2016-01-26 0 -989.0
2016-01-27 Rain -989.0
2016-01-28 0 -989.0
2016-01-29 0 -989.0
2016-01-30 0 257.0
2016-01-31 0 -989.0

2.8.1 Now I can use .replace function with a dictionary to see if I can replace values
using this function in multiple columns of the dataFrame

[127]: replaced_df = replaced_df.replace({


"Events" : 0,
"WindDirDegrees": -989.0
},[Link])
replaced_df

[127]: Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 38.0 23.0 52 30.03
2016-01-02 36.0 18.0 46 30.02
2016-01-03 40.0 21.0 47 29.86
2016-01-04 25.0 9.0 44 30.05
2016-01-05 NaN NaN 41 30.57
2016-01-06 33.0 4.0 35 30.50
2016-01-07 39.0 11.0 33 30.28
2016-01-08 39.0 29.0 64 30.20
2016-01-09 44.0 38.0 77 30.16
2016-01-10 50.0 46.0 71 29.59
2016-01-11 33.0 8.0 37 29.92
2016-01-12 35.0 15.0 53 29.85
2016-01-13 26.0 4.0 42 29.94
2016-01-14 NaN 12.0 47 29.95
2016-01-15 43.0 31.0 62 29.82
2016-01-16 47.0 37.0 70 29.52
2016-01-17 36.0 23.0 66 29.78
2016-01-18 25.0 6.0 53 29.83
2016-01-19 22.0 3.0 42 30.03
2016-01-20 32.0 15.0 49 30.13
2016-01-21 31.0 11.0 45 30.15

145
2016-01-22 26.0 6.0 41 30.21
2016-01-23 NaN 21.0 78 29.77
2016-01-24 NaN 11.0 53 29.92
2016-01-25 NaN 18.0 54 30.25
2016-01-26 43.0 29.0 56 30.03
2016-01-27 41.0 22.0 45 30.03
2016-01-28 37.0 20.0 51 29.90
2016-01-29 36.0 21.0 50 29.58
2016-01-30 34.0 16.0 46 30.01
2016-01-31 46.0 28.0 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover \


EST
2016-01-01 10 8.0 0 5
2016-01-02 10 7.0 0 3
2016-01-03 10 8.0 0 1
2016-01-04 10 9.0 0 3
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-08 10 4.0 0 8
2016-01-09 9 8.0 T 8
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0

Events WindDirDegrees
EST

146
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN

2.8.2 Lets perform interpolation using time method just for fun shall we ?

[128]: # convert all the values in the columns in numeric type


# This code works on newer version of pandas
replaced_df.infer_objects(copy=False)
# replaced_df.infer_objects()

# perform interpolation using time method on the dataFrame


interpolated_df = replaced_df.select_dtypes(include=["number"]).
↪interpolate(method="time")

interpolated_df

[128]: Temperature DewPoint Humidity Sea Level PressureIn \


EST
2016-01-01 38.00 23.0 52 30.03

147
2016-01-02 36.00 18.0 46 30.02
2016-01-03 40.00 21.0 47 29.86
2016-01-04 25.00 9.0 44 30.05
2016-01-05 29.00 6.5 41 30.57
2016-01-06 33.00 4.0 35 30.50
2016-01-07 39.00 11.0 33 30.28
2016-01-08 39.00 29.0 64 30.20
2016-01-09 44.00 38.0 77 30.16
2016-01-10 50.00 46.0 71 29.59
2016-01-11 33.00 8.0 37 29.92
2016-01-12 35.00 15.0 53 29.85
2016-01-13 26.00 4.0 42 29.94
2016-01-14 34.50 12.0 47 29.95
2016-01-15 43.00 31.0 62 29.82
2016-01-16 47.00 37.0 70 29.52
2016-01-17 36.00 23.0 66 29.78
2016-01-18 25.00 6.0 53 29.83
2016-01-19 22.00 3.0 42 30.03
2016-01-20 32.00 15.0 49 30.13
2016-01-21 31.00 11.0 45 30.15
2016-01-22 26.00 6.0 41 30.21
2016-01-23 30.25 21.0 78 29.77
2016-01-24 34.50 11.0 53 29.92
2016-01-25 38.75 18.0 54 30.25
2016-01-26 43.00 29.0 56 30.03
2016-01-27 41.00 22.0 45 30.03
2016-01-28 37.00 20.0 51 29.90
2016-01-29 36.00 21.0 50 29.58
2016-01-30 34.00 16.0 46 30.01
2016-01-31 46.00 28.0 52 29.90

VisibilityMiles WindSpeedMPH CloudCover WindDirDegrees


EST
2016-01-01 10 8.000000 5 281.0
2016-01-02 10 7.000000 3 275.0
2016-01-03 10 8.000000 1 277.0
2016-01-04 10 9.000000 3 345.0
2016-01-05 10 5.000000 0 333.0
2016-01-06 10 4.000000 0 313.0
2016-01-07 10 2.000000 3 293.0
2016-01-08 10 4.000000 8 79.0
2016-01-09 9 8.000000 8 76.0
2016-01-10 4 7.333333 7 109.0
2016-01-11 10 6.666667 1 289.0
2016-01-12 10 6.000000 4 235.0
2016-01-13 10 10.000000 0 284.0
2016-01-14 10 5.000000 7 192.5

148
2016-01-15 9 5.000000 2 101.0
2016-01-16 8 7.000000 7 340.0
2016-01-17 8 6.000000 6 345.0
2016-01-18 9 12.000000 2 293.0
2016-01-19 10 11.000000 1 297.5
2016-01-20 10 6.000000 2 302.0
2016-01-21 10 6.000000 1 312.0
2016-01-22 9 11.000000 3 34.0
2016-01-23 1 16.000000 8 42.0
2016-01-24 8 6.000000 3 327.0
2016-01-25 10 3.000000 2 286.0
2016-01-26 10 7.000000 2 280.2
2016-01-27 10 7.000000 3 274.4
2016-01-28 10 5.000000 1 268.6
2016-01-29 10 8.000000 4 262.8
2016-01-30 10 7.000000 0 257.0
2016-01-31 10 5.000000 0 257.0

2.9 How to handle data where differnt types of measurement metrics are found
in the same column
• Example :
– Temperature column may have temperature in degree celcius and in Farenhite both
– Windspeed column may have speed of the wind in mph and kph both.

[129]: # pc path
df_missing_data = pd.read_excel("/home/aditya/github/Deep-learning-prerequisite/
↪pandas/dataset/nyc_weather_different_measurement_type_demo.xlsx")

# macbook path
# df_missing_data = pd.read_excel("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/

↪nyc_weather_different_measurement_type_demo.xlsx", engine="openpyxl")

df_missing_data

[129]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38 F 23 52 30.03
1 1/2/2016 36 F 18 46 30.02
2 1/3/2016 40 21 47 29.86
3 1/4/2016 25 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33 C 4 35 30.50
6 1/7/2016 39 C 11 33 30.28
7 1/8/2016 39 29 64 30.20
8 1/9/2016 44 38 77 30.16
9 1/10/2016 50 46 71 29.59
10 1/11/2016 33 8 37 29.92

149
11 1/12/2016 35 15 53 29.85
12 1/13/2016 26 F 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43 31 62 29.82
15 1/16/2016 47 37 70 29.52
16 1/17/2016 36 23 66 29.78
17 1/18/2016 25 6 53 29.83
18 1/19/2016 22 3 42 30.03
19 1/20/2016 32 15 49 30.13
20 1/21/2016 31 11 45 30.15
21 1/22/2016 26 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43 29 56 30.03
26 1/27/2016 41 22 45 30.03
27 1/28/2016 37 20 51 29.90
28 1/29/2016 36 21 50 29.58
29 1/30/2016 34 16 46 30.01
30 1/31/2016 46 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8 mph 0 5 NaN
1 10 7 kph 0 3 NaN
2 10 8 0 1 NaN
3 10 9 kph 0 3 NaN
4 10 5 0 0 NaN
5 10 4 mph 0 0 NaN
6 10 2 0 3 NaN
7 10 4 0 8 NaN
8 9 8 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6 T 4 NaN
12 10 10 0 0 NaN
13 10 5 T 7 NaN
14 9 5 T 2 NaN
15 8 7 0.24 7 Rain
16 8 6 0.05 6 Fog-Snow
17 9 12 T 2 Snow
18 10 11 0 1 NaN
19 10 6 0 2 NaN
20 10 6 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16 2.31 8 Fog-Snow
23 8 6 T 3 Snow
24 10 3 0 2 NaN

150
25 10 7 0 2 NaN
26 10 7 T 3 Rain
27 10 5 0 1 NaN
28 10 8 0 4 NaN
29 10 7 0 0 NaN
30 10 5 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

Now these types of situations are very tricky to solve because


• I have to remove the Farenhite F and Celcius C from the temperature column.
• Similarly I will have to remove mph and kph from the WindSpeedMPH column.

The best way to handle this is to use a regex

151
[130]: #### The best way to handle this is to use a regex
new_df = df_missing_data.replace("[A-Za-z]","",regex=True)
new_df

[130]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38 23 52 30.03
1 1/2/2016 36 18 46 30.02
2 1/3/2016 40 21 47 29.86
3 1/4/2016 25 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33 4 35 30.50
6 1/7/2016 39 11 33 30.28
7 1/8/2016 39 29 64 30.20
8 1/9/2016 44 38 77 30.16
9 1/10/2016 50 46 71 29.59
10 1/11/2016 33 8 37 29.92
11 1/12/2016 35 15 53 29.85
12 1/13/2016 26 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43 31 62 29.82
15 1/16/2016 47 37 70 29.52
16 1/17/2016 36 23 66 29.78
17 1/18/2016 25 6 53 29.83
18 1/19/2016 22 3 42 30.03
19 1/20/2016 32 15 49 30.13
20 1/21/2016 31 11 45 30.15
21 1/22/2016 26 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43 29 56 30.03
26 1/27/2016 41 22 45 30.03
27 1/28/2016 37 20 51 29.90
28 1/29/2016 36 21 50 29.58
29 1/30/2016 34 16 46 30.01
30 1/31/2016 46 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8 0 5 NaN
1 10 7 0 3 NaN
2 10 8 0 1 NaN
3 10 9 0 3 NaN
4 10 5 0 0 NaN
5 10 4 0 0 NaN
6 10 2 0 3 NaN
7 10 4 0 8 NaN
8 9 8 8

152
9 4 NaN 1.8 7
10 10 NaN 0 1 NaN
11 10 6 4 NaN
12 10 10 0 0 NaN
13 10 5 7 NaN
14 9 5 2 NaN
15 8 7 0.24 7
16 8 6 0.05 6 -
17 9 12 2
18 10 11 0 1 NaN
19 10 6 0 2 NaN
20 10 6 0 1 NaN
21 9 NaN 0.01 3
22 1 16 2.31 8 -
23 8 6 3
24 10 3 0 2 NaN
25 10 7 0 2 NaN
26 10 7 3
27 10 5 0 1 NaN
28 10 8 0 4 NaN
29 10 7 0 0 NaN
30 10 5 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0

153
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

What the …. Why did it replace all the values in the event column? Observations :
As you can see this regex successfully removed - C and F from the Temperature column - removed
mph and kph from the WindSpeedMPH column - removed events_values from the Events column
Reason : Because everything in the event column is alphabet hence this regex removed all the
values in the Events column of the dataFrame.

Fixing this issue


[131]: new_df = df_missing_data.replace({"Temperature":"[A-Za-z]", "WindSpeedMPH":
↪"[A-Za-z]"},"",regex=True)

new_df

[131]: EST Temperature DewPoint Humidity Sea Level PressureIn \


0 1/1/2016 38 23 52 30.03
1 1/2/2016 36 18 46 30.02
2 1/3/2016 40 21 47 29.86
3 1/4/2016 25 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33 4 35 30.50
6 1/7/2016 39 11 33 30.28
7 1/8/2016 39 29 64 30.20
8 1/9/2016 44 38 77 30.16
9 1/10/2016 50 46 71 29.59
10 1/11/2016 33 8 37 29.92
11 1/12/2016 35 15 53 29.85
12 1/13/2016 26 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43 31 62 29.82
15 1/16/2016 47 37 70 29.52
16 1/17/2016 36 23 66 29.78
17 1/18/2016 25 6 53 29.83
18 1/19/2016 22 3 42 30.03
19 1/20/2016 32 15 49 30.13
20 1/21/2016 31 11 45 30.15
21 1/22/2016 26 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25

154
25 1/26/2016 43 29 56 30.03
26 1/27/2016 41 22 45 30.03
27 1/28/2016 37 20 51 29.90
28 1/29/2016 36 21 50 29.58
29 1/30/2016 34 16 46 30.01
30 1/31/2016 46 28 52 29.90

VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \


0 10 8 0 5 NaN
1 10 7 0 3 NaN
2 10 8 0 1 NaN
3 10 9 0 3 NaN
4 10 5 0 0 NaN
5 10 4 0 0 NaN
6 10 2 0 3 NaN
7 10 4 0 8 NaN
8 9 8 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6 T 4 NaN
12 10 10 0 0 NaN
13 10 5 T 7 NaN
14 9 5 T 2 NaN
15 8 7 0.24 7 Rain
16 8 6 0.05 6 Fog-Snow
17 9 12 T 2 Snow
18 10 11 0 1 NaN
19 10 6 0 2 NaN
20 10 6 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16 2.31 8 Fog-Snow
23 8 6 T 3 Snow
24 10 3 0 2 NaN
25 10 7 0 2 NaN
26 10 7 T 3 Rain
27 10 5 0 1 NaN
28 10 8 0 4 NaN
29 10 7 0 0 NaN
30 10 5 0 0 NaN

WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN

155
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN

As you can see now the data looks much better all the values in the Events column
are intact.

2.10 Replacing the score column values from character type grades to number
type values in a dataFrame
[132]: df = [Link]({
"Score" : ["exceptional", "average", "good", "poor", "average",␣
↪"exceptional"],

"Student" : ["Rollex", "Ballistic","Barricade", "Bullet", "Artillery",␣


↪"Ignition"]

})
df

[132]: Score Student


0 exceptional Rollex
1 average Ballistic
2 good Barricade
3 poor Bullet
4 average Artillery
5 exceptional Ignition

156
[133]: df_replaced = [Link](["poor", "average", "good", "exceptional"],[1,2,3,4])
df_replaced

/tmp/ipykernel_3487556/[Link]: FutureWarning: Downcasting behavior in


`replace` is deprecated and will be removed in a future version. To retain the
old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to
the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df_replaced = [Link](["poor", "average", "good", "exceptional"],[1,2,3,4])

[133]: Score Student


0 4 Rollex
1 2 Ballistic
2 3 Barricade
3 1 Bullet
4 2 Artillery
5 4 Ignition

Analysis of this error Error : /tmp/ipykernel_2713287/[Link]:


FutureWarning: Downcasting behavior in `replace` is deprecated and will
be removed in a future version. To retain the old behavior, explicitly
call `result.infer_objects(copy=False)`. To opt-in to the future behavior,
set `pd.set_option('future.no_silent_downcasting', True)` df_replaced =
[Link](["poor", "average", "good", "exceptional"],[1,2,3,4])
In older version of pandas, replace() would automatically downcast objects types (like strings) to
smaller types (like integers) if possible. In future version, this behavior will change. So pandas is
warning you that : - Your replacement may result in a type changes (like from object to int) - And
in the future, this won’t happen silently, you will have to explicitly allow or disable it.

solution :
[134]: pd.set_option('future.no_silent_downcasting', True)
df_replaced = [Link](["poor", "average", "good", "exceptional"], [1, 2, 3,␣
↪4])

df_replaced

[134]: Score Student


0 4 Rollex
1 2 Ballistic
2 3 Barricade
3 1 Bullet
4 2 Artillery
5 4 Ignition

[ ]:

157

You might also like