Clean Data Method
Clean Data Method
August 8, 2025
# df_missing_data = pd.read_csv("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/company_financials.csv") # path of␣
↪macbook
df_missing_data
1
1.0.1 Here in this dataFrame we can see that the data is a bit messy
• In people and eps columns we have a value n.a.
• In revenue column we have 2 values that are in negative ### We need to clean this mess
before we start our analysis
• We need to replace the n.a. with NaN
• We need to replace the -ve values in revenue columns as well because any company’s revenue
cannot be -ve it will either be zero or less than zero.
1.1 Remove n.a. values when creating dataFrame from a csv file
[2]: #path of pc
df_cleaned = pd.read_csv("/home/aditya/github/Deep-learning-prerequisite/pandas/
↪dataset/company_financials.csv",na_values=["not available","n.a."])
# path of macbook
# df_cleaned = pd.read_csv("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/company_financials.
↪csv",na_values=["not available","n.a."])
df_cleaned
As you can see all the n.a. values have been replaced with NaN values
2
1.2 Now lets see how we can remove -ve values from the revenue column
[3]: import numpy as np
df_cleaned.loc[df_cleaned["revenue"]<0, "revenue"] = [Link]
df_cleaned
1.2.1 Explanation
.loc[]
• It is label-based indexing. It is used to access a group of rows and columns in a DataFrame
by their labels (names) — not by their integer positions.
• syntax : [Link][row_condition, column_label]
– row_condition: A condition or label(s) that tells pandas which rows you want.
– column_label: The column(s) you want from those selected rows.
3
[4]: df_cleaned.loc[8,"people"] = "Arvind Krishna"
df_cleaned
1.3.1 Explanation:
• df_cleaned.loc[8,“people”] = “Arvind Krishna” : Here 8 is the row number and “people” is
the column name
• using these two we are locating the cell to be updated and assign a new value into it.
4
# path of macbook
# df = pd.read_csv("/home/machine_learning/github/Deep-learning-prerequisite/
↪pandas/dataset/company_financials.csv")
1.4.2 Explanation:
• people_col_converter function will be called for every single cell in the people column.
• This function will called when creating dataFrame from the excel sheet present in the file
system.
5
1.4.3 Why this worked?
• The reason why this worked is because the people column only had one NaN value in the
entire column.
# macbook path
# df_missing_val = pd.read_csv("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_data_missing_values.
↪csv")
type(df_missing_val["date"][0])
df_missing_val
weather_condition
0 Snow
1 Snow
2 NaN
3 Rain
4 Rain
.. …
95 Fog
96 Fog
97 Sunny
98 Sunny
6
99 Rain
The column date can have date values in form of string so to be safe we are going to
make sure that when we create the dataFrame from a csv file we convert those string
values in the date column into a datetime timestamps.
• First we need to convert all the dates in the date column from type string to type date
[8]: # pc path
df_missing_val = pd.read_csv("/home/aditya/github/Deep-learning-prerequisite/
↪pandas/dataset/nyc_weather_data_missing_values.csv", parse_dates = ["date"])
# macbook path
# df_missing_val = pd.read_csv("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_data_missing_values.
type(df_missing_val["date"][0])
[8]: pandas._libs.[Link]
[9]: df_missing_val
weather_condition
0 Snow
1 Snow
2 NaN
3 Rain
4 Rain
.. …
95 Fog
96 Fog
97 Sunny
98 Sunny
99 Rain
7
[100 rows x 5 columns]
2.0.1 Set the date column as an index of the dataFrame instead of the default index
[10]: df_missing_val.set_index("date",inplace=True)
df_missing_val
weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 NaN
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain
2.1 Find all the rows with the missing values in the dataFrame
[11]: df_missing_val[df_missing_val.isna().any(axis=1)]
8
2023-01-15 NaN 74.3 8.9
2023-01-21 18.9 5.5 NaN
2023-01-22 5.3 32.9 NaN
2023-01-31 58.3 Fog NaN
2023-02-02 5.2 64.0 NaN
2023-02-04 -2.6 55.1 1.7
2023-02-06 NaN 64.4 8.7
2023-02-20 19.4 NaN 13.2
2023-02-24 15.4 16.7 11.1
2023-03-04 20.5 70.9 9.8
2023-03-08 15.6 NaN 16.3
2023-03-13 12.7 59.3 NaN
2023-03-15 NaN 46.6 15.6
2023-03-16 27.6 NaN 14.3
2023-03-23 NaN NaN 9.5
2023-03-24 NaN 63.3 5.7
2023-03-26 18.8 43.3 NaN
2023-03-28 35.5 NaN 5.9
2023-03-29 19.9 71.7 11.2
2023-04-09 14.7 44.4 NaN
weather_condition
date
2023-01-03 NaN
2023-01-06 NaN
2023-01-10 NaN
2023-01-15 Snow
2023-01-21 NaN
2023-01-22 Rain
2023-01-31 NaN
2023-02-02 Fog
2023-02-04 NaN
2023-02-06 Sunny
2023-02-20 Rain
2023-02-24 NaN
2023-03-04 NaN
2023-03-08 Rain
2023-03-13 NaN
2023-03-15 Snow
2023-03-16 Cloudy
2023-03-23 Fog
2023-03-24 Cloudy
2023-03-26 Fog
2023-03-28 Sunny
2023-03-29 NaN
2023-04-09 Sunny
9
2.1.1 Explanation:
df_missing_val.isna()
Returns a DataFrame of the same shape where:
.any(axis=1)
This is a reduction operation that checks if any value in each row is True.
So what is axis=1?
axis=1 → operate row-wise
2.1.3 Fillna method to replace NaN values with Zeros in the dataFrame with missing
values
[12]: new_df_fill_na = df_missing_val.fillna(0)
new_df_fill_na
weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
10
2023-01-03 0
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain
2.1.4 Find all the rows that have zero in it in the dataFrame
[13]: new_df_fill_na[(new_df_fill_na==0).any(axis=1)]
weather_condition
date
2023-01-03 0
2023-01-06 0
2023-01-10 0
11
2023-01-15 Snow
2023-01-21 0
2023-01-22 Rain
2023-01-31 0
2023-02-02 Fog
2023-02-04 0
2023-02-06 Sunny
2023-02-20 Rain
2023-02-24 0
2023-03-04 0
2023-03-08 Rain
2023-03-13 0
2023-03-15 Snow
2023-03-16 Cloudy
2023-03-23 Fog
2023-03-24 Cloudy
2023-03-26 Fog
2023-03-28 Sunny
2023-03-29 0
2023-04-09 Sunny
2.1.5 Explanation :
This method will fill the entire dataFrame with 0 where NaN values were present in the dataFrame
2.1.6 Sometimes we may want to fillna but we don’t want to fill the entire dataFrame
with 0. May be we want to specify different values for different columns
12
2023-04-09 14.7 44.4 0
2023-04-10 0.3 48.3 7.7
weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 no event
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain
Get all the rows where it has data that zero in the dataFrame
[15]: new_df[(new_df==0).any(axis=1)]
weather_condition
date
2023-01-15 Snow
2023-01-21 no event
2023-01-22 Rain
2023-01-31 no event
13
2023-02-02 Fog
2023-02-06 Sunny
2023-02-20 Rain
2023-03-08 Rain
2023-03-13 no event
2023-03-15 Snow
2023-03-16 Cloudy
2023-03-23 Fog
2023-03-24 Cloudy
2023-03-26 Fog
2023-03-28 Sunny
2023-04-09 Sunny
weather_condition
date
2023-01-03 no event
2023-01-06 no event
2023-01-10 no event
2023-01-21 no event
2023-01-31 no event
2023-02-04 no event
2023-02-24 no event
2023-03-04 no event
2023-03-13 no event
2023-03-29 no event
14
Another way of getting between estimate is to carry forward the temperature of one
row previous of the row where 0 is present.
[17]: previous_fill_val_df = df_missing_val.fillna(method="ffill")
previous_fill_val_df
weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 Snow
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain
2.1.7 Explanation :
/tmp/ipykernel_2951661/[Link]: FutureWarning: [Link] with ‘method’ is
deprecated and will raise in a future version. Use obj.ffill() or [Link]() instead. previ-
ous_fill_val_df = df_missing_val.fillna(method=“ffill”)
Means that using the method=“ffill” or method=“bfill” inside .fillna() is going to be removed in
future versions of pandas.
15
[18]: df_missing_val
weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 NaN
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain
16
2023-04-10 0.3 48.3 7.7
weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 Snow
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain
Check if there exist any NaN values in any of the rows in the dataFrame
[20]: previous_fill_val_df[previous_fill_val_df.isna().any(axis=1)]
Another way of getting bettween estimate is to carry forward the temperature of one
row next of the row where 0 is present.
[21]: df_missing_val
weather_condition
17
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 NaN
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain
weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 Rain
2023-01-04 Rain
2023-01-05 Rain
… …
2023-04-06 Fog
2023-04-07 Fog
2023-04-08 Sunny
2023-04-09 Sunny
2023-04-10 Rain
18
Lets confirm if there exist any NaN values in the dataFrame
[23]: next_fill_val_df[next_fill_val_df.isna().any(axis=1)]
Again this confirms that there exists no NaN values in the dataFrame
How to limit the number of times a value is allowed to be copied in these two methods
• Another way of getting between estimate is to carry forward the temperature of one row
previous of the row where 0 is present.
• Another way of getting between estimate is to carry forward the temperature of one row next
of the row where 0 is present.
[24]: df_missing_data.head(10)
19
2023-01-09 7.5 79.0 9.4
2023-01-10 65.2 6.5 Sunny
weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 Snow
2023-01-04 Rain
2023-01-05 Rain
2023-01-06 Rain
2023-01-07 Fog
2023-01-08 Cloudy
2023-01-09 Sunny
2023-01-10 Sunny
weather_condition
date
2023-01-22 Rain
2023-03-24 Cloudy
20
2.3 What is interpolation ?
[Link]() is a pandas method used to fill in missing val-
ues NaN by estimating or guessing what the values should be, based on
surrounding data. Here is the link to the docs related to this topic :
[Link]
[Link](method='linear', *, axis=0, limit=None, inplace=False,
limit_direction=None, limit_area=None, **kwargs)
weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 NaN
2023-01-04 Rain
2023-01-05 Rain
2023-01-06 NaN
2023-01-07 Fog
2023-01-08 Cloudy
2023-01-09 Sunny
2023-01-10 NaN
2.3.1 Explanation:
• DataFrame df_missing_val has columns with object dtype (e.g. strings or mixed types).
• interpolate() is meant for numeric columns, and using it on object-type columns is being
deprecated.
21
[28]: interpolated_val_df = df_missing_val.select_dtypes(include=["number"]).
↪interpolate()
interpolated_val_df.head(10)
[28]: temperature_celsius
date
2023-01-01 -2.9
2023-01-02 15.9
2023-01-03 22.8
2023-01-04 39.4
2023-01-05 7.8
2023-01-06 -1.5
2023-01-07 20.4
2023-01-08 36.8
2023-01-09 7.5
2023-01-10 65.2
[29]: df_missing_val.head(10)
weather_condition
date
2023-01-01 Snow
2023-01-02 Snow
2023-01-03 NaN
2023-01-04 Rain
2023-01-05 Rain
2023-01-06 NaN
2023-01-07 Fog
22
2023-01-08 Cloudy
2023-01-09 Sunny
2023-01-10 NaN
df_missing_val[columns_to_convert] = df_missing_val[columns_to_convert].
↪apply(pd.to_numeric, errors="coerce")
interpolated_val_df = df_missing_val.select_dtypes(include=["number"]).
↪interpolate()
interpolated_val_df.head(10)
2.3.2 Explanation:
• .apply(pd.to_numeric, errors="coerce")
– apply(pd.to_numeric) applies the pd.to_numeric() function to each column in the list.
[31]: # pc path
df_missing_weather_data = pd.read_excel("/home/aditya/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.
↪xlsx")
# macbook path
# df_missing_weather_data = pd.read_excel("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.
↪xlsx",engine="openpyxl")
df_missing_weather_data
23
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90
24
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
25
Lets check how many rows has NaN values in them
[32]: df_missing_weather_data[df_missing_weather_data.isna().any(axis=1)]
26
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
27
[33]: EST Temperature DewPoint Humidity Sea Level PressureIn \
0 1/1/2016 38.0 23 52 30.03
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90
28
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
29
27 NaN
28 NaN
29 257.0
30 NaN
30
1 10 7.000000 0 3 NaN
2 10 8.000000 0 1 NaN
3 10 9.000000 0 3 NaN
4 10 5.000000 0 0 NaN
5 10 4.000000 0 0 NaN
6 10 2.000000 0 3 NaN
7 10 4.000000 0 8 NaN
8 9 8.000000 T 8 Rain
9 4 7.333333 1.8 7 Rain
10 10 6.666667 0 1 NaN
11 10 6.000000 T 4 NaN
12 10 10.000000 0 0 NaN
13 10 5.000000 T 7 NaN
14 9 5.000000 T 2 NaN
15 8 7.000000 0.24 7 Rain
16 8 6.000000 0.05 6 Fog-Snow
17 9 12.000000 T 2 Snow
18 10 11.000000 0 1 NaN
19 10 6.000000 0 2 NaN
20 10 6.000000 0 1 NaN
21 9 11.000000 0.01 3 Snow
22 1 16.000000 2.31 8 Fog-Snow
23 8 6.000000 T 3 Snow
24 10 3.000000 0 2 NaN
25 10 7.000000 0 2 NaN
26 10 7.000000 T 3 Rain
27 10 5.000000 0 1 NaN
28 10 8.000000 0 4 NaN
29 10 7.000000 0 0 NaN
30 10 5.000000 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 313.0
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 192.5
14 101.0
31
15 340.0
16 345.0
17 293.0
18 297.5
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 280.2
26 274.4
27 268.6
28 262.8
29 257.0
30 257.0
This warning means that there are some colums contain object-type columns like
string, or numbers stored as string /tmp/ipykernel_2443388/[Link]: Future-
Warning: [Link] with object dtype is deprecated and will raise in a future
version. Call obj.infer_objects(copy=False) before interpolating instead. inter_polated_df =
df_missing_weather_data.interpolate()
# in older pandas version copy=False will return an error instead you will have␣
↪to use this
# df_missing_weather_data.infer_objects()
# only select those columns whose data-type is of type number and then␣
↪interpolate it accordingly
interpolated_df = df_missing_weather_data.select_dtypes(include=["number"]).
↪interpolate()
interpolated_df
32
9 50.00 46 71 29.59 4
10 33.00 8 37 29.92 10
11 35.00 15 53 29.85 10
12 26.00 4 42 29.94 10
13 34.50 12 47 29.95 10
14 43.00 31 62 29.82 9
15 47.00 37 70 29.52 8
16 36.00 23 66 29.78 8
17 25.00 6 53 29.83 9
18 22.00 3 42 30.03 10
19 32.00 15 49 30.13 10
20 31.00 11 45 30.15 10
21 26.00 6 41 30.21 9
22 30.25 21 78 29.77 1
23 34.50 11 53 29.92 8
24 38.75 18 54 30.25 10
25 43.00 29 56 30.03 10
26 41.00 22 45 30.03 10
27 37.00 20 51 29.90 10
28 36.00 21 50 29.58 10
29 34.00 16 46 30.01 10
30 46.00 28 52 29.90 10
33
23 6.000000 3 327.0
24 3.000000 2 286.0
25 7.000000 2 280.2
26 7.000000 3 274.4
27 5.000000 1 268.6
28 8.000000 4 262.8
29 7.000000 0 257.0
30 5.000000 0 257.0
Let’s handle dataFrame interpolation using time method The reason I want to use time
method is because our data is dependent on time and time i.e dates. Since we have date column I
wanted to use time method.
[36]: df_missing_weather_data
34
VisibilityMiles WindSpeedMPH PrecipitationIn CloudCover Events \
0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
35
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
time_interpolated_df
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[37], line 1
----> 1 time_interpolated_df =␣
↪df_missing_weather_data.select_dtypes(include=["number"]).interpolate(method="time")
2 time_interpolated_df
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in [Link](self, method, axis, limit, inplace,␣
8497 else:
8498 index = missing.get_interp_index(method, [Link])
-> 8499 new_data = obj._mgr.interpolate(
8500 method=method,
8501 index=index,
8502 limit=limit,
8503 limit_direction=limit_direction,
8504 limit_area=limit_area,
8505 inplace=inplace,
8506 downcast=downcast,
8507 **kwargs,
8508 )
8510 result = self._constructor_from_mgr(new_data, axes=new_data.axes)
36
8511 if should_transpose:
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, inplace, **kwargs)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, f, align_keys,␣
↪**kwargs)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, method, index, inplace,␣
↪limit, limit_direction, limit_area, downcast, using_cow, already_warned,␣
↪**kwargs)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/numpy_.py:296, in [Link](self, method, axis,␣
37
293 out_data = self._ndarray.copy()
295 # TODO: assert we have floating dtype?
--> 296 missing.interpolate_2d_inplace(
297 out_data,
298 method=method,
299 axis=axis,
300 index=index,
301 limit=limit,
302 limit_direction=limit_direction,
303 limit_area=limit_area,
304 **kwargs,
305 )
306 if not copy:
307 return self
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace(data, index, axis, method, limit,␣
Explanation :
• As you can see here this error occured because our dataFrame doesn’t have date column set
as its index so inorder to solve this error and for us to use time method for interpolation we
will have to set the date column as the index of this dataFrame
As you can see that now our date column named EST has been set as the index of
this dataFrame now we can move forward with our interpolation operation to remove
NaN from the dataFrame
[ ]: # convert the columns that have numbers in string format into data-types of␣
↪type number
df_missing_weather_data.infer_objects(copy=False)
38
Why are we setting copy = False in .infer_objects(copy=False) here ? The copy pa-
rameter controls whether the method returns a new DataFrame or modifies the original DataFrame
in-place. | copy value | What it does | | ———— | ——————————- | | True | Returns a
new DataFrame (copy) | | False | Modifies the original DataFrame |
You are telling pandas:
“Please try to convert any object columns to a more specific dtype in-place —
do not create a new DataFrame.”
[38]: df_missing_weather_data.select_dtypes(include=["number"]).
↪interpolate(method="time")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[38], line 1
----> 1␣
↪df_missing_weather_data.select_dtypes(include=["number"]).interpolate(method="time")
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in [Link](self, method, axis, limit, inplace,␣
8497 else:
8498 index = missing.get_interp_index(method, [Link])
-> 8499 new_data = obj._mgr.interpolate(
8500 method=method,
8501 index=index,
8502 limit=limit,
8503 limit_direction=limit_direction,
8504 limit_area=limit_area,
8505 inplace=inplace,
8506 downcast=downcast,
8507 **kwargs,
8508 )
8510 result = self._constructor_from_mgr(new_data, axes=new_data.axes)
8511 if should_transpose:
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, inplace, **kwargs)
39
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, f, align_keys,␣
↪**kwargs)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, method, index, inplace,␣
↪limit, limit_direction, limit_area, downcast, using_cow, already_warned,␣
↪**kwargs)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/numpy_.py:296, in [Link](self, method, axis,␣
40
307 return self
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace(data, index, axis, method, limit,␣
Again I am getting this error ? hmmm… Lets check what is the data-type of the EST column
in the dataFrame
[39]: df_missing_weather_data["EST"].dtype
[39]: dtype('O')
Explanation : In pandas:
O stands for "object" type.
strings
[40]: True
Since it returned True that means all the values present in the EST column is of
string datatype. This means I need to convert all the values in this column into date type
before assigning it as an index to this dataFrame
41
[41]: df_missing_weather_data["EST"] = pd.
↪to_datetime(df_missing_weather_data["EST"],format="%m%d%Y",errors="coerce")
df_missing_weather_data["EST"].dtype
[41]: dtype('<M8[ns]')
[42]: df_missing_weather_data
42
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
43
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
2.3.4 Why did my string type dates got converted to NaT values in the EST column
?
Explanation : - Dates in the column EST is this string format 1/3/2016. - This format corresponds
to this %m/%d/%Y format - But in my code I accidently did this %m%d%Y - Which is why I got NaT
values in my ESt column of my dataFrame
↪xlsx")
# macbook path
# df_missing_weather_data = pd.read_excel("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.
↪xlsx",engine="openpyxl")
df_missing_weather_data
44
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90
45
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
Convert the EST column string type dates into date type dates. This time using this
format %m/%d/%Y
[44]: df_missing_weather_data["EST"] = pd.
↪to_datetime(df_missing_weather_data["EST"],format = "%m/%d/%Y",␣
↪errors="coerce")
df_missing_weather_data
46
1 2016-01-02 36.0 18 46 30.02
2 2016-01-03 40.0 21 47 29.86
3 2016-01-04 25.0 9 44 30.05
4 2016-01-05 NaN -3 41 30.57
5 2016-01-06 33.0 4 35 30.50
6 2016-01-07 39.0 11 33 30.28
7 2016-01-08 39.0 29 64 30.20
8 2016-01-09 44.0 38 77 30.16
9 2016-01-10 50.0 46 71 29.59
10 2016-01-11 33.0 8 37 29.92
11 2016-01-12 35.0 15 53 29.85
12 2016-01-13 26.0 4 42 29.94
13 2016-01-14 NaN 12 47 29.95
14 2016-01-15 43.0 31 62 29.82
15 2016-01-16 47.0 37 70 29.52
16 2016-01-17 36.0 23 66 29.78
17 2016-01-18 25.0 6 53 29.83
18 2016-01-19 22.0 3 42 30.03
19 2016-01-20 32.0 15 49 30.13
20 2016-01-21 31.0 11 45 30.15
21 2016-01-22 26.0 6 41 30.21
22 2016-01-23 NaN 21 78 29.77
23 2016-01-24 NaN 11 53 29.92
24 2016-01-25 NaN 18 54 30.25
25 2016-01-26 43.0 29 56 30.03
26 2016-01-27 41.0 22 45 30.03
27 2016-01-28 37.0 20 51 29.90
28 2016-01-29 36.0 21 50 29.58
29 2016-01-30 34.0 16 46 30.01
30 2016-01-31 46.0 28 52 29.90
47
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
48
29 257.0
30 NaN
[45]: df_missing_weather_data["EST"].dtype
[45]: dtype('<M8[ns]')
Now our EST column values have been successfully converted without any issue
Explanation
• M8 stands for datetime64 (M is for “datetime” in NumPy’s dtype notation).
• < means little-endian byte order
• [ns] means the precision in nanoseconds
Let’s verify if the values in this EST column is of type string or not
[46]: df_missing_weather_data["EST"].apply(lambda x: isinstance(x,str)).all()
[46]: False
Since we are getting the result to be False that means the values stored in EST column
is not of type string
Since earlier I have created a new dataFrame from the excel sheet I don’t have to
reset the index in the dataFrame
[47]: df_missing_weather_data.set_index(df_missing_weather_data["EST"],inplace=True)
df_missing_weather_data
49
2016-01-18 2016-01-18 25.0 6 53 29.83
2016-01-19 2016-01-19 22.0 3 42 30.03
2016-01-20 2016-01-20 32.0 15 49 30.13
2016-01-21 2016-01-21 31.0 11 45 30.15
2016-01-22 2016-01-22 26.0 6 41 30.21
2016-01-23 2016-01-23 NaN 21 78 29.77
2016-01-24 2016-01-24 NaN 11 53 29.92
2016-01-25 2016-01-25 NaN 18 54 30.25
2016-01-26 2016-01-26 43.0 29 56 30.03
2016-01-27 2016-01-27 41.0 22 45 30.03
2016-01-28 2016-01-28 37.0 20 51 29.90
2016-01-29 2016-01-29 36.0 21 50 29.58
2016-01-30 2016-01-30 34.0 16 46 30.01
2016-01-31 2016-01-31 46.0 28 52 29.90
50
2016-01-31 10 5.0 0 0
Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN
# select only those columns which have the values as type of numbers
51
new_processed_df = converted_val_in_df.select_dtypes(include=["number"]).
↪interpolate(method="time")
new_processed_df
52
2016-01-08 10 4.000000 8 79.0
2016-01-09 9 8.000000 8 76.0
2016-01-10 4 7.333333 7 109.0
2016-01-11 10 6.666667 1 289.0
2016-01-12 10 6.000000 4 235.0
2016-01-13 10 10.000000 0 284.0
2016-01-14 10 5.000000 7 192.5
2016-01-15 9 5.000000 2 101.0
2016-01-16 8 7.000000 7 340.0
2016-01-17 8 6.000000 6 345.0
2016-01-18 9 12.000000 2 293.0
2016-01-19 10 11.000000 1 297.5
2016-01-20 10 6.000000 2 302.0
2016-01-21 10 6.000000 1 312.0
2016-01-22 9 11.000000 3 34.0
2016-01-23 1 16.000000 8 42.0
2016-01-24 8 6.000000 3 327.0
2016-01-25 10 3.000000 2 286.0
2016-01-26 10 7.000000 2 280.2
2016-01-27 10 7.000000 3 274.4
2016-01-28 10 5.000000 1 268.6
2016-01-29 10 8.000000 4 262.8
2016-01-30 10 7.000000 0 257.0
2016-01-31 10 5.000000 0 257.0
Since there is no NaN values present It is safe to say that the data has been significantly
cleaned in this particular dataFrame
53
interpolation of order 1 (similar to linear). | For smoother interpolation than basic linear. | |
'quadratic' | Spline interpolation of order 2. | Curved data trends (like acceleration data). |
| 'cubic' | Spline interpolation of order 3. | Smooth curves with gentle oscillation (like motion
tracking). | | 'barycentric', 'polynomial' | Fit through points using polynomial interpolation.
| Use cautiously; can cause overfitting. | | 'pchip' | Piecewise Cubic Hermite Interpolating Poly-
nomial. | Monotonic and shape-preserving curves. | | 'akima' | Interpolation using Akima splines.
| Reduces overshoots and artifacts. | | 'spline' | Generic spline method (requires order). | Highly
smooth interpolations. | | 'pad' or 'ffill' | Fill forward using previous valid value. | When
values persist until changed. | | 'backfill' or 'bfill' | Fill backward using next valid value. |
For retrospective value filling. | | 'from_derivatives' | Interpolates from derivatives. | Advanced
usage; rarely needed. | ### slinear : Spline interpolation of order 1 - slinear stands for spline
linear interpolation (a type of spline of degree 1). - It’s a more flexible and numerically stable
version of linear interpolation. - Under the hood it uses SciPy’s interp1d(kind='slinear'),
which interpolate based on piecewise linear splines between known data points. - This method
only works when index is numeric or datetime. - syntax : [Link](method='slinear')
When to use slinear Use 'slinear' when : - You have time series or sequential data with gaps
(example : NaN values). - You want to smooth missing values, but still assume linear transitions
between known data points. - You want more numerical precision or SciPy backend compatibility
compared to simple linear interpolation.
Real-World Example:
• You’re a data analyst working with temperature readings from an IoT weather sensor that
recorded data every hour. Due to network issues, some data points are missing.
Method Description
'linear' Simple straight-line interpolation between two points
'slinear' Piecewise linear interpolation using [Link].interp1d, generally more
precise
In practice, for short gaps, both yield similar results, but ‘slinear’ is better when working within
SciPy-compatible scientific or engineering data pipelines.
54
– When smoothness in slope is essential.(example : animation, trajectory path)
dataFrame
[50]: temperature
2025-08-04 [Link] 30.1
2025-08-04 [Link] 31.5
2025-08-04 [Link] NaN
2025-08-04 [Link] NaN
2025-08-04 [Link] 34.2
2025-08-04 [Link] 35.0
2025-08-04 [Link] NaN
2025-08-04 [Link] 36.1
dataFrame
[Link]([Link],dataFrame["temperature"],color="red",linestyle="-",␣
↪marker="o",label="Original temperature data points with gaps (NaN values)")
[Link](dataFrame.
↪index,dataFrame("slinear_interpolated_temperature"),color="green",linestyle="--",marker="o",
55
[Link]("Temperature")
[Link]()
[Link](True)
[Link]()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[52], line 6
4 # I want to show the index which is of datetime format in the x axis of␣
↪the graph plotted by pyplot I will have to use this [Link]
5 [Link](dataFrame.
↪index,dataFrame["temperature"],color="red",linestyle="-",␣
----> 6 [Link](dataFrame.
↪index,dataFrame("slinear_interpolated_temperature"),color="green",linestyle="--",marker="o"
What just happen ? After inspecting the code I found out that I made a silly mistake here in this
line [Link]([Link],dataFrame("slinear_interpolated_temperature"),color="green",linest
temperature data using slinear method") Here I should have written this
dataFrame["slinear_interpolated_temperature"] but instead I wrote this
dataFrame("slinear_interpolated_temperature"). I know I am stupid but hey thanks
to my supidity atleast someone who is reading this notebook might avoid the mistake that I made.
56
Analysis : When I wrote this dataFrame("slinear_interpolated_temperature") I am essen-
tially trying to call a dataFrame like a function which causes python to say : “You are treating a
dataFrame object like a callable function but it’s not a callable!”
Lets fix the code and re-try plotting the interpolated data using slinear method once
again
[53]: # plotting sliniear
[Link](figsize=(8,4))
# I want to show the index which is of datetime format in the x axis of the␣
↪graph plotted by pyplot I will have to use this [Link]
[Link]([Link],dataFrame["temperature"],color="red",linestyle="-",␣
↪marker="o",label="Original temperature data points with gaps (NaN values)")
[Link](dataFrame.
↪index,dataFrame["slinear_interpolated_temperature"],color="green",linestyle="--",marker="o",
57
allows it to capture curvature in the data, making it useful for smooth accelerating or decelerating
trends. Instead of drawing a straight lines between points (like in linear), it draws parabolic curves
that gently follow the shape of the data. A quadratic function has the form : y = ax^2 + bx + c
Real world use cases : Example : Car acceleration Imagine you are tracking the distance
traveled by a car every second. The car is accelerating , the distance increases faster and faster. |
Time (s) | Distance (m) | | ——– | ———— | | 0 | 0 | | 1 | 3 | | 2 | NaN | | 3 | 18 |
[54]: df = [Link](
{
"time":[0,1,2,3],
"distance":[0,3,[Link],18]
}
)
df
plt.
↪plot(df["time"],df["distance_quadratic"],color="green",linestyle="--",marker="o",␣
58
2.3.7 Difference between quadratic and Linear interpolation:
Scenario: Estimating car speed at a given time using recorded data. Let’s say you have a car’s
position (in meters) at different times (in seconds). You want to estimate its position at a time not
in the original dataset.
59
What is t_interp? In this interpolation example t_interp is the point at which I want to
estimamte a value (example: y-value) based on surrounding known data.
Why 2.5?
• There is known data points t=2 and t=3, and I want to estimate the value halfway between
them (i.e at t = 2.5).
• This is where interpolation comes in I estimate the value at a new point (2.5) using a math-
ematical model based on existing data.
When to use [Link]() Use this when: - You want to fill missing
values (NaNs) - You are working within a time series or numeric sequence - You don’t need high-
resolution or custom x-points
[58]: # Create 100 data points between 0 and 4 values for time column
t_vals = [Link](0,4,100)
pos_linear_vals = linear_interp(t_vals)
pos_quadratic_vals = quadratic_interp(t_vals)
df = [Link]({
"time" : t_vals ,"position_linear_interpolated" : pos_linear_vals,␣
↪"position_quadratic_interpolated" : pos_quadratic_vals
})
df
60
4 0.161616 0.323232 0.187736
.. … … …
95 3.838384 18.707071 18.571574
96 3.878788 19.030303 18.923783
97 3.919192 19.353535 19.279257
98 3.959596 19.676768 19.637996
99 4.000000 20.000000 20.000000
[59]: # plot the difference between linear and quadratic interpolation using␣
↪matplotlib
[Link](figsize=(10, 6))
[Link](time, position, 'o', label='Original Data')
[Link](t_vals, pos_linear_vals, '--', label='Linear Interpolation')
[Link](t_vals, pos_quadratic_vals, '-.', label='Quadratic Interpolation')
[Link]([t_interp], [pos_linear], color='blue', label='Linear @ 2.5s')
[Link]([t_interp], [pos_quadratic], color='green', label='Quadratic @ 2.
↪5s')
[Link]('Time (s)')
[Link]('Position (m)')
[Link]('Linear vs Quadratic Interpolation')
[Link]()
[Link](True)
[Link]()
61
When to use quadratic:
• When data shows smooth but accelerate or decelerating behavior.
• Ideal for physics, motion tracking , growth models , sensor data with non-linear trends.
• More accurate than linear interpolation for curved data.
what is the first and second derivative: Cubic interpolation connects data points using a
third degree polynomial form: y = ax^3 + bx^2 + cx + d This creates smooth curves between
points, unlike linear or quadratic interpolation which may result in sharp bends or unnatural
transitions. Derivative in cubic interpolation : - The first derivative of a function give us the
rate of change or slope at any given point. - For cubic polynomial : - y = ax^3 + bx^2 + cx +
d - y’ = 3ax^2 + 2bx + c (This is the first derivative) - What it tells us? - How steep a curve is
at any point. - In real-world terms: velocity, trend direction or growth rate. - Why it matters: -
Continuity of first derivatives ensures that there are no sharp turns or corners. - Smooth transitions
are essential in animation, physics simulations, robotics , financial forecasting and motion paths.
Second Derivative : The second derivative gives us the rate of change of the rate of change
essentially, how the slope is changing. - What it tells us - The curvature of the graph. - In physical
terms: acceleration, force or inflection points. - Why it matters - Continuity of second derivative
ensures no sudden changes in acceleration which is crucial for : - Mechanical systems (example :
robotic arms moving smoothly) - Video games animation (smooth character movement) - Spline
fitting in data modeling (natural-looking data curves)
Why derivative continuity matters : Imagine drawing a smooth curve through data points,
like tracking an object’s movement: - Linear interpolation : Connects points with straight lines.
The slope can suddenly jump at each point movement looks jerky. - Cubic interpolation : Uses
cubic polynomials between each pair of points and ensures: - No sharp corners (curve is smooth). -
No sudden changes in velocity (first derivative is continuous). - No sudden changes in acceleration
62
(second derivative is continuous) - That means the motion appears fluid and natural , like the
movement of a hand or a car.
Real world use cases : Let’s say you’re tracking the hand position of a person drawing a circle,
capture every few milliseconds. Because hand movement isn’t jerkey, It’s fluid and curved . A
cubic interpolation gives a more natural fit to the motion. Example data: | Time (s) | X Position
| | ——– | ———- | | 0 | 0 | | 1 | 1 | | 2 | NaN | | 3 | 1 | | 4 | 0 |
A 'linear' or 'quadratic' interpolation will make the curve look angular or not smooth enough.
'cubic' smoothly interpolates the missing position at t=2
[60]: df = [Link]({
"time":[0,1,2,3,4],
"x_pos":[0,1,[Link],1,0]
})
df
63
• You want visually pleasing interpolation (example : animations, simulations)
• You need smooth first and second derivatives (important in physics and motion modeling)
• common domains :
– Computer graphics and animation
– sensor based tracking
– climate and environmental modeling
– audio signal smoothing
When to avoid:
• Not ideal for very noisy or discontinuous data
• Overkill for simple trends (linear or non-linear)
• More computationally expensive than linear/quadratic
2.3.9 Lets look at three more examples where cubic method interpolation can be used
in real life
Example 1 : A robotic arm records its X, Y, Z coordinates every 0.5 seconds, but due
to sensor lag, some positions are missing. Since robotic and human arm movements are
fluid and continuous, cubic interpolation ensures smooth transitions between known
positions.
[62]: import pandas as pd
import numpy as np
import [Link] as plt
data = [10, 12, [Link], 15, [Link], [Link], 21, 23, [Link], 28]
time = list(range(len(data)))
df = [Link]({"time":time,"position":data})
df
64
[63]: time position interpolated_position
0 0 10.0 10.000000
1 1 12.0 12.000000
2 2 NaN 13.508547
3 3 15.0 15.000000
4 4 NaN 16.833333
5 5 NaN 18.905983
6 6 21.0 21.000000
7 7 23.0 23.000000
8 8 NaN 25.200855
9 9 28.0 28.000000
[Link](df["time"],df["interpolated_position"],"g--o",label="Cubic␣
↪interpolated data")
65
• Prevents jerky motion.
• Preserves natural motion curves
• Useful in game development , robotics path planning and animation rigs
Code explaination:
• Breakdown of "ro-":
– r -> red color
– o -> circle markers at data points
– - -> solid line
• So "ro-" plots:
– A red solid line with circle markers at the data points.
• Breakdown of "g--o":
– g -> green color
– -- -> dashed lines
– o -> circle markers
• So, g--o plots:
– A green dashed line with circle markers at the data points.
Other Examples:
66
8 9 23.8
9 10 20.0
[66]: # interpolate data and handle NaN values using cubic method here
df["interpolated_temperature_data_using_cubic_method"] = df["temperature"].
↪interpolate(method="cubic")
df
[67]: # Now plotting both original temperature data and interpolated data using cubic␣
↪method in matplotlib
[Link](figsize=(12,8))
plt.
↪plot(df["month"],df["temperature"],color="red",linestyle="-",marker="o",label="Original␣
plt.
↪plot(df["month"],df["interpolated_temperature_data_using_cubic_method"],color="green",linest
67
Let’s see what is the difference between linear and cubic interpolation
[68]: df["linearly_interpolated_temperature"] = df["temperature"].
↪interpolate(method="linear")
df
linearly_interpolated_temperature
0 15.200000
1 16.800000
2 18.450000
3 20.100000
68
4 24.500000
5 25.066667
6 25.633333
7 26.200000
8 23.800000
9 20.000000
df
linearly_interpolated_temperature quadratic_interpolated_temperature
0 15.200000 15.200000
1 16.800000 16.800000
2 18.450000 17.872503
3 20.100000 20.100000
4 24.500000 24.500000
5 25.066667 27.082618
6 25.633333 27.338344
7 26.200000 26.200000
8 23.800000 23.800000
9 20.000000 20.000000
[70]: # Now plotting the linear and cubic interpolated data along with the original␣
↪data with missing gaps in the graph
[Link](figsize=(12,8))
plt.
↪plot(df["month"],df["temperature"],color="red",linestyle="-",marker="o",label="Original␣
plt.
↪plot(df["month"],df["interpolated_temperature_data_using_cubic_method"],color="green",linest
plt.
↪plot(df["month"],df["linearly_interpolated_temperature"],color="orange",linestyle="-.
69
plt.
↪plot(df["month"],df["quadratic_interpolated_temperature"],color="blue",linestyle=":
70
5 5 -0.5
6 6 NaN
7 7 0.0
8 8 0.3
9 9 NaN
df
Since interpolate function requires the atleast two data points for it to fill in the
missing data points between those two points the last values has been left out as NaN.
Hence its better to ignore the last row
[Link](figsize=(12,8))
[Link](df["frame"], df["amplitude"], color="red", linestyle="-",marker = "o",␣
↪label="Original amplitude with missing data points (example : NaN values)")
71
[Link](df["frame"],df["interpolated_amplitude_using_cubic_method"],␣
↪color="green", linestyle="--",marker="o",label="Interpolated amplitude using␣
↪cubic method")
2.3.10 Lets look at three more examples where nearest method interpolation can be
used in real life
The nearest method replaces a missing value with the closest non-null value either before or after
the missing entry (whichever is closer in position).
[75]:
72
temperature = [25.0, 25.1, [Link], [Link], 25.3, 25.4, [Link], 25.6, [Link], 25.
↪9]
seconds = list(range(len(temperature)))
df = [Link]({"second":seconds, "temperature":temperature})
df
df
Now plot the original temperature column and interpolated temperature column using
matplotlib
[77]: [Link](figsize = (8,4))
plt.
↪plot(df["second"],df["temperature"],color="red",linestyle="-",marker="o",label="Original␣
plt.
↪plot(df["second"],df["nearest_interpolated_temperature"],color="green",linestyle="--",marker
↪temperature values")
73
[Link]()
[Link]()
[Link]()
74
2.3.13 Lets look at three more examples where 'barycentric' and 'polynomial'
method interpolation can be used in real life
Both ‘barycentric’ and ‘polynomial’ interpolation methods are based on polynomial interpolation,
where a single polynomial is fitted through all the given data points.
Important notes:
• Very flexible — can pass exactly through known points.
• But risky — overfitting , especially when data is noisy or there are many points (Runge’s
pehnomenon)
df = [Link]({
"voltage":voltage, "pH":pH
})
df
[78]: voltage pH
0 0.1 1.0
1 0.2 2.0
2 0.4 3.8
3 0.5 4.9
4 0.7 7.2
[79]: df = df.set_index("voltage")
df
[79]: pH
voltage
0.1 1.0
0.2 2.0
0.4 3.8
0.5 4.9
0.7 7.2
75
[80]: # Interpolation at missing voltage points
new_index = [Link](0.1, 0.7, 100)
new_index
Here we are using [Link] to generate data for our data frame:
• The data being generated here is the voltages data points
• The data is being generated between 0.1 to 0.7
• The number of data points being generated here is 100
• syntax :
– [Link](lower_limit in decimal, upper_limit in decimal, number of data points gen-
erated)
Reasons for doing this : The original data has only 5 points → we want to interpolate pH at
intermediate voltages (like 0.11, 0.13, 0.33, etc.) for a smooth curve.
[81]: pH
voltage
0.100000 1.0
0.106061 NaN
0.112121 NaN
0.118182 NaN
0.124242 NaN
76
… …
0.675758 NaN
0.681818 NaN
0.687879 NaN
0.693939 NaN
0.700000 7.2
Now that we have gotten the desired dataFrame on which we can perform interpolation
using polynomial interpolation for ph values lets mover forward with that
[82]: interpolated_index["pH"] = interpolated_index["pH"].
↪interpolate(method="polynomial", order=4)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[82], line 1
----> 1 interpolated_index["pH"] =␣
↪interpolated_index["pH"].interpolate(method="polynomial", order=4)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in [Link](self, method, axis, limit, inplace,␣
8497 else:
8498 index = missing.get_interp_index(method, [Link])
-> 8499 new_data = obj._mgr.interpolate(
8500 method=method,
8501 index=index,
8502 limit=limit,
8503 limit_direction=limit_direction,
8504 limit_area=limit_area,
8505 inplace=inplace,
8506 downcast=downcast,
8507 **kwargs,
8508 )
8510 result = self._constructor_from_mgr(new_data, axes=new_data.axes)
8511 if should_transpose:
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, inplace, **kwargs)
77
297 )
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, f, align_keys,␣
↪**kwargs)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, method, index, inplace,␣
↪limit, limit_direction, limit_area, downcast, using_cow, already_warned,␣
↪**kwargs)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/numpy_.py:296, in [Link](self, method, axis,␣
78
305 )
306 if not copy:
307 return self
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace(data, index, axis, method, limit,␣
391 _interpolate_1d(
392 indices=indices,
393 yvalues=yvalues,
(…)
401 **kwargs,
402 )
404 # error: Argument 1 to "apply_along_axis" has incompatible type
405 # "Callable[[ndarray[Any, Any]], None]"; expected "Callable[…,
406 # Union[_SupportsArray[dtype[<nothing>]], Sequence[_SupportsArray
407 # [dtype[<nothing>]]],␣
↪Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]],
408 # Sequence[Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]]],
409 #␣
↪Sequence[Sequence[Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]]]]]]"
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/numpy/lib/
↪shape_base.py:379, in apply_along_axis(func1d, axis, arr, *args, **kwargs)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace.<locals>.func(yvalues)
79
397 limit_area=limit_area_validated,
398 fill_value=fill_value,
399 bounds_error=False,
400 mask=mask,
401 **kwargs,
402 )
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in _interpolate_1d(indices, yvalues, method, limit,␣
526 )
527 else:
--> 528 yvalues[invalid] = _interpolate_scipy_wrapper(
529 indices[valid],
530 yvalues[valid],
531 indices[invalid],
532 method=method,
533 fill_value=fill_value,
534 bounds_error=bounds_error,
535 order=order,
536 **kwargs,
537 )
539 if mask is not None:
540 mask[:] = False
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in _interpolate_scipy_wrapper(x, y, new_x, method, fill_value,␣
592 else:
593 kind = method
--> 594 terp = interpolate.interp1d(
595 x, y, kind=kind, fill_value=fill_value, bounds_error=bounds_error
596 )
597 new_y = terp(new_x)
598 elif method == "spline":
599 # GH #10633, #24014
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/scipy/interpolate/
↪_interpolate.py:397, in interp1d.__init__(***failed resolving arguments***)
394 yy = np.ones_like(self._y)
395 rewrite_nan = True
--> 397 self._spline = make_interp_spline(xx, yy, k=order,
398 check_finite=False)
399 if rewrite_nan:
400 self._call = self.__class__._call_nan_spline
80
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/scipy/interpolate/
↪_bsplines.py:1596, in make_interp_spline(x, y, k, t, bc_type, axis,␣
↪check_finite)
1593 nt = [Link] - k - 1
1595 if nt - n != nleft + nright:
-> 1596 raise ValueError("The number of derivatives at boundaries does not "
1597 f"match: expected {nt-n}, got {nleft}+{nright}")
1599 # bail out if the `y` array is zero-sized
1600 if [Link] == 0:
ValueError: The number of derivatives at boundaries does not match: expected 2,␣
↪got 0+0
2.3.14 Explaination :
This is due to the fact that polynomial interpolation using interpolate(method="polynomial",
order=4) cannot handle missing values at the beginning or end of the DataFrame (NaNs at the
boundaries), because polynomial interpolation needs a full polynomial to fit smoothly – and it
requires enough known values around the missing value to do so.
interpolated_index
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[83], line 1
----> 1 interpolated_index["plynomial_interpolated_pH"] =␣
↪interpolated_index.interpolate(method="polynomial",order=4, limit_direction="both")
2 interpolated_index
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in [Link](self, method, axis, limit, inplace,␣
8497 else:
8498 index = missing.get_interp_index(method, [Link])
-> 8499 new_data = obj._mgr.interpolate(
8500 method=method,
8501 index=index,
81
8502 limit=limit,
8503 limit_direction=limit_direction,
8504 limit_area=limit_area,
8505 inplace=inplace,
8506 downcast=downcast,
8507 **kwargs,
8508 )
8510 result = self._constructor_from_mgr(new_data, axes=new_data.axes)
8511 if should_transpose:
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, inplace, **kwargs)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, f, align_keys,␣
↪**kwargs)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪internals/[Link], in [Link](self, method, index, inplace,␣
↪limit, limit_direction, limit_area, downcast, using_cow, already_warned,␣
↪**kwargs)
82
1810 not copy
1811 and warn_copy_on_write()
1812 and already_warned is not None
1813 and not already_warned.warned_already
1814 ):
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/numpy_.py:296, in [Link](self, method, axis,␣
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace(data, index, axis, method, limit,␣
391 _interpolate_1d(
392 indices=indices,
393 yvalues=yvalues,
(…)
401 **kwargs,
402 )
404 # error: Argument 1 to "apply_along_axis" has incompatible type
405 # "Callable[[ndarray[Any, Any]], None]"; expected "Callable[…,
406 # Union[_SupportsArray[dtype[<nothing>]], Sequence[_SupportsArray
407 # [dtype[<nothing>]]],␣
↪Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]],
408 # Sequence[Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]]],
409 #␣
↪Sequence[Sequence[Sequence[Sequence[_SupportsArray[dtype[<nothing>]]]]]]]]"
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/numpy/lib/
↪shape_base.py:379, in apply_along_axis(func1d, axis, arr, *args, **kwargs)
83
376 raise ValueError(
377 'Cannot apply_along_axis when any iteration dimensions are 0'
378 ) from None
--> 379 res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
381 # build a buffer for storing evaluations of func1d.
382 # remove the requested axis, and add the new ones on the end.
383 # laid out so that each write is contiguous.
384 # for a tuple index inds, buff[inds] = func1d(inarr_view[inds])
385 buff = zeros(inarr_view.shape[:-1] + [Link], [Link])
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in interpolate_2d_inplace.<locals>.func(yvalues)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in _interpolate_1d(indices, yvalues, method, limit,␣
526 )
527 else:
--> 528 yvalues[invalid] = _interpolate_scipy_wrapper(
529 indices[valid],
530 yvalues[valid],
531 indices[invalid],
532 method=method,
533 fill_value=fill_value,
534 bounds_error=bounds_error,
535 order=order,
536 **kwargs,
537 )
539 if mask is not None:
540 mask[:] = False
84
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪[Link], in _interpolate_scipy_wrapper(x, y, new_x, method, fill_value,␣
592 else:
593 kind = method
--> 594 terp = interpolate.interp1d(
595 x, y, kind=kind, fill_value=fill_value, bounds_error=bounds_error
596 )
597 new_y = terp(new_x)
598 elif method == "spline":
599 # GH #10633, #24014
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/scipy/interpolate/
↪_interpolate.py:397, in interp1d.__init__(***failed resolving arguments***)
394 yy = np.ones_like(self._y)
395 rewrite_nan = True
--> 397 self._spline = make_interp_spline(xx, yy, k=order,
398 check_finite=False)
399 if rewrite_nan:
400 self._call = self.__class__._call_nan_spline
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/scipy/interpolate/
↪_bsplines.py:1596, in make_interp_spline(x, y, k, t, bc_type, axis,␣
↪check_finite)
1593 nt = [Link] - k - 1
1595 if nt - n != nleft + nright:
-> 1596 raise ValueError("The number of derivatives at boundaries does not "
1597 f"match: expected {nt-n}, got {nleft}+{nright}")
1599 # bail out if the `y` array is zero-sized
1600 if [Link] == 0:
ValueError: The number of derivatives at boundaries does not match: expected 2,␣
↪got 0+0
85
What is [Link]()?
• [Link](x,y,deg) fits a polynomial of degree deg to the data (x,y).
• It returns the coefficients of the best-fit polynomial (from highest degree to constant term).
Why use a 4th-degree polynomial for 5 points? When using [Link](voltage, pH,
deg=4), you are asking for a polynomial of degree 4 (i.e, highest term is x^4) that fits your 5 data
points. - Because : - A polynomial of degree n can exactly fit n + 1 data points (If they are unique
and well-behaved). - You have 5 data points, so you can fit a polynomial of degree 4 to pass exactly
through all those points. This is called interpolating polynomial fitting. A 4th-degree polynomial:
y = a4x^4 + a3x^3 + a2x^2 + a1x + a0 As you can see this has 5 coeffitients from a0 to a4. To
solve those 5 unknowns you need 5 equations, which you get from the 5 data points you have. So
this choice guarantees an exact fit — The curve will go exactly through each data point.
Why not a higher-degree polynomial? Using a polynomial of degree 5 or more with 5 points
leads to overfitting, or the system becomes underdetermined (more unknowns than equations).
86
1.91928147, 1.97322256, 2.02665959, 2.07965713, 2.13227784,
2.1845825 , 2.23663001, 2.28847737, 2.34017968, 2.39179018,
2.4433602 , 2.49493918, 2.54657469, 2.59831239, 2.65019606,
2.7022676 , 2.75456701, 2.8071324 , 2.86 , 2.91320414,
2.96677727, 3.02074995, 3.07515085, 3.13000674, 3.18534253,
3.24118121, 3.29754391, 3.35444983, 3.41191633, 3.46995885,
3.52859094, 3.58782429, 3.64766866, 3.70813196, 3.76922018,
3.83093745, 3.89328598, 3.95626611, 4.0198763 , 4.08411311,
4.14897119, 4.21444335, 4.28052046, 4.34719153, 4.41444368,
4.48226214, 4.55063024, 4.61952943, 4.68893928, 4.75883745,
4.82919972, 4.9 , 4.97121028, 5.04280068, 5.11473943,
5.18699287, 5.25952544, 5.33229971, 5.40527634, 5.47841413,
5.55166997, 5.62499885, 5.69835391, 5.77168636, 5.84494555,
5.91807892, 5.99103203, 6.06374857, 6.13617031, 6.20823714,
6.27988708, 6.35105624, 6.42167885, 6.49168724, 6.56101188,
6.62958131, 6.69732222, 6.76415939, 6.83001571, 6.89481219,
6.95846795, 7.02090021, 7.08202432, 7.14175373, 7.2 ])
[87]: df = [Link]({
"interpolated_voltage" : interpolated_voltage, "interpolated_ph":
↪interpolated_ph
})
df
[Link]("Voltage")
[Link]("pH")
[Link]()
[Link](True)
87
[Link]("Polynomial Interpolation (order 4)")
[Link]()
Why assuming relationship between ph and voltage is done here in idealized systems
In electro chemistry relationship between voltage (measured by a ph electrode) and ph is described
by the Nearest equation. For a typical glass ph electrode, this is simplified as –> E = E0 + (0.0591
x ph) or more accurately ph = (E-E0)/0.0591 Where : - E = measured potential (voltage) in volts
- E0 = standard electrode potential (reference) - 0.0591 = voltage change per unit ph at 25 degree
celcius (room temperature) This equation implies a linear and predictable relationship between ph
and voltage under ideal conditions.
88
– to benchmark ideal sensor behavior before accounting for real-world drift , hysteresis, or
contamination.
Summary:
2.3.17 Lets look at three more examples where ‘pchip’ method interpolation can be
used in real life
PCHIP stands for Piecewise Cubic Hermite Polynomial. Unlike regular cubic spline interpolation
(Which might overshoot or oscillate), pchip is : - Monotonic : Preserves the increasing nature of
data. - Shape-preserving : Avoids sharp turns and unnatural wiggles. It’s not smooth in second
derivatives, but it gives a more realistic curve for non-linear or step-like data.
What does “not smooth in second derivatives” mean? In calculus, the first derivative of
a curve gives the slope (i.e how steep it is) and the second derivative tells you how the slope is
changing i.e the curvature. A curve is smooth in second derivative if : - It has continuous curvature
without any sudden changes in how it bends.
89
Why is this important? Because in real world data is not always smooth. If you enforce smooth
curvature (like cubic spline), it may introduce unrealistic wiggles, especially in non-linear or step-
like data (example : sudden jumps or sharp transition). PCHIP trades off a bit of mathamatical
elegance (smooth second derivative) for practical realism.
How unrealistic wiggles may get introduced ? When you enforce smooth curvature, such
as cubic spline interpolation, the algorithm tries to make sure that the curve is as smooth as
possible especially in terms of second derivative (curvature). This smoothness contraints often
causes unrealistic oscillations or “wiggles”, especially when your data has : - Sudden jumps - Sharp
transitions - Flat regions followed by rapid changes
How wiggle occurs (in simple terms): Imagine you are connecting a series of pegs with a
flexible ruler (representing a cubic spline). You are trying to bend it smoothly through all the pegs
but if some pegs are suddenly higher or lower then others the ruler will curve sharply before or
after that point. This overshooting creates wiggles, the curve temporarily dips or peaks between
the actual data points. - Mathematical insight - The function is continuous. - The first and second
derivatives are continuous. - But it doesn’t know your data might be non-smooth in reality, so it
creates an unrealistically smooth curve that doesn’t respect real transitions.
Real world example Suppose you are logging the power consumption (in watts) of a smart
device every minute. The readings could look like this : | Time | Power Usage | | | ——– | ———–
| —————————- | | 10:00 AM | 10 W | | | 10:01 AM | 12 W | | | 10:02 AM | 13 W | | | 10:03
AM | 300 W | ← Device turned ON suddenly | | 10:04 AM | 305 W | | | 10:05 AM | 12 W | ← Device
turned OFF suddenly | If you use a cubic interpolation here to smooth between these points. It
might create curves that dip below or rise above the realistic values before or after the sharp jump
10:03 AM even dipping below 10W or overshooting above 105W which never actually happened.
Lets implement it and see for ourselves what is the difference between PCHIP and
cubic interpolation of data
[89]: data = {
"time" : ['10:00', '10:01', '10:02', '10:03', '10:04', '10:05'],
"power" : [ 10, 12, 15, 300, 302, 305]
}
df = [Link](data)
df
[90]: # Lets check if the the time column values are in string datatype or not
df["time"].apply(lambda x : isinstance(x,str)).all()
90
[90]: True
Since the result is returning True that means the time column has the values in string
format instead of time
[91]: # converting the data stored in string dtype to time dtype in time column
df["time"] = pd.to_datetime(df["time"],format="%H:%M")
# Now lets check if the dtype has changed from string to datetime in time␣
↪column of the dataFrame
df["time"].apply(lambda x : isinstance(x,str)).all()
[91]: False
[92]: df
[94]: x = df["timestamp"]
y = df["power"]
# create 200 values between the minimum value and maxmimum value present in the␣
↪timestamp column
x_dense = [Link]([Link](),[Link](),200)
x_dense
91
0.75376884, 0.77889447, 0.8040201 , 0.82914573, 0.85427136,
0.87939698, 0.90452261, 0.92964824, 0.95477387, 0.9798995 ,
1.00502513, 1.03015075, 1.05527638, 1.08040201, 1.10552764,
1.13065327, 1.15577889, 1.18090452, 1.20603015, 1.23115578,
1.25628141, 1.28140704, 1.30653266, 1.33165829, 1.35678392,
1.38190955, 1.40703518, 1.4321608 , 1.45728643, 1.48241206,
1.50753769, 1.53266332, 1.55778894, 1.58291457, 1.6080402 ,
1.63316583, 1.65829146, 1.68341709, 1.70854271, 1.73366834,
1.75879397, 1.7839196 , 1.80904523, 1.83417085, 1.85929648,
1.88442211, 1.90954774, 1.93467337, 1.95979899, 1.98492462,
2.01005025, 2.03517588, 2.06030151, 2.08542714, 2.11055276,
2.13567839, 2.16080402, 2.18592965, 2.21105528, 2.2361809 ,
2.26130653, 2.28643216, 2.31155779, 2.33668342, 2.36180905,
2.38693467, 2.4120603 , 2.43718593, 2.46231156, 2.48743719,
2.51256281, 2.53768844, 2.56281407, 2.5879397 , 2.61306533,
2.63819095, 2.66331658, 2.68844221, 2.71356784, 2.73869347,
2.7638191 , 2.78894472, 2.81407035, 2.83919598, 2.86432161,
2.88944724, 2.91457286, 2.93969849, 2.96482412, 2.98994975,
3.01507538, 3.04020101, 3.06532663, 3.09045226, 3.11557789,
3.14070352, 3.16582915, 3.19095477, 3.2160804 , 3.24120603,
3.26633166, 3.29145729, 3.31658291, 3.34170854, 3.36683417,
3.3919598 , 3.41708543, 3.44221106, 3.46733668, 3.49246231,
3.51758794, 3.54271357, 3.5678392 , 3.59296482, 3.61809045,
3.64321608, 3.66834171, 3.69346734, 3.71859296, 3.74371859,
3.76884422, 3.79396985, 3.81909548, 3.84422111, 3.86934673,
3.89447236, 3.91959799, 3.94472362, 3.96984925, 3.99497487,
4.0201005 , 4.04522613, 4.07035176, 4.09547739, 4.12060302,
4.14572864, 4.17085427, 4.1959799 , 4.22110553, 4.24623116,
4.27135678, 4.29648241, 4.32160804, 4.34673367, 4.3718593 ,
4.39698492, 4.42211055, 4.44723618, 4.47236181, 4.49748744,
4.52261307, 4.54773869, 4.57286432, 4.59798995, 4.62311558,
4.64824121, 4.67336683, 4.69849246, 4.72361809, 4.74874372,
4.77386935, 4.79899497, 4.8241206 , 4.84924623, 4.87437186,
4.89949749, 4.92462312, 4.94974874, 4.97487437, 5. ])
[96]: # use pchip and cubic interpolator to get the y (power) values using pchip and␣
↪cubic interpolation method respectively
y_pchip = pchip(x_dense)
y_cubic = cubic(x_dense)
[97]:
92
# plot the graph showing the difference between pchip and cubic interpolation␣
↪method
[Link](figsize=(8,4))
[Link](x,y, marker = "o", label="Original Data", color="black")
[Link](x_dense,y_pchip,linestyle="-",label="PCHIP (Monotonic) interpolated␣
↪data", linewidth=2)
[Link](x_dense,y_cubic,linestyle="--",label="Cubic interpolated␣
↪data",linewidth=2)
As you can see if you focus on the orange dotted line here, this is what it means
to introduce the wiggle when using cubic interpolation method on data points show
sudden jumps or sharp transition
93
• Growth Curves for Children :
– A child’s height is monotonically increasing (it never decreases).
– Using PCHIP, we ensure that interpolated height values never decrease between checkups
— and we don’t introduce odd wiggles in the curve.
Summary :
Feature PCHIP
Preserves Shape � Yes
Monotonicity � Yes
Smooth First Derivative � Yes
Smooth Second Derivative � No
Good for Step-like Data � Yes
Prone to Overfitting � No
Risk of Overshoot/Wiggle � Low
2.3.18 Lets look at three more examples where ‘pchip’ method interpolation can be
used in real life
Akima spline interpolation is a type of piecewise cubic interpolation that : - Uses information from
neighboring slopes to prevent overshooting or oscillations (wiggles) - Produces smooth and visually
realistic curves. - Is not globally smooth in second derivatives (i.e It’s not as smooth as cubic
spline), but it handles sharp changes in the data much better.
When to use Akima interpolation Use Akima interpolation when: - Your data has sharp
transition or discontinuities - You want to avoid oscillations that comes from cubic splines interpo-
lation - You want to balance between smoothness and local sharp preservation
When not to use Akima interpolation Avoid Akima interpolation when: - You need very
smooth second derivatives (example: for physical simulations involving acceleration or curvature)
- Your data is very clean and smooth, and a simple cubic spline would suffice. - You are working
with very large datasets where performance is critical. Akima is more computationally intensive
than linear interpolation.
94
3 Real-World Examples of When to Use Akima
1. Stock Price charting (with sudden jumps due to news)
• Problem : You want to interpolate between stock prices, but there’s jump due to an
announcement.
• Why Akima? Avoid overshooting and artifacts around jumps.
2. Senso Data with outliers or sudden changes
• Example water pressure spikes in industrial sensors.
• Why Akima? Avoids wiggles and provides stable interpolation without overfitting to
outliers.
3. Medical Signals (like ECG or Heart Rate)
• ECG data has peaks (QRS complex) you don’t want spline interpolation to create arti-
facts around those.
• Why Akima? It preserves sharp changes and avoids adding “fake” peaks.
2.3.19 Lets see how the Akima interpolation looks like with real life example
# Create interpolators
akima_interp = Akima1DInterpolator(x,y)
cubic_interp = CubicSpline(x,y)
pchip_interp = PchipInterpolator(x,y)
# create a dataFrame
df = [Link]({
"time" : x_dense , "stock_prices_interpolated_akima_method" : y_akima ,␣
↪"stock_prices_interpolated_cubic" : y_cubic
, "stock_prices_interpoalted_pchip" : y_pchip
})
df
95
9.08026756 9.09364548 9.10702341 9.12040134 9.13377926 9.14715719
9.16053512 9.17391304 9.18729097 9.2006689 9.21404682 9.22742475
9.24080268 9.2541806 9.26755853 9.28093645 9.29431438 9.30769231
9.32107023 9.33444816 9.34782609 9.36120401 9.37458194 9.38795987
9.40133779 9.41471572 9.42809365 9.44147157 9.4548495 9.46822742
9.48160535 9.49498328 9.5083612 9.52173913 9.53511706 9.54849498
9.56187291 9.57525084 9.58862876 9.60200669 9.61538462 9.62876254
9.64214047 9.65551839 9.66889632 9.68227425 9.69565217 9.7090301
9.72240803 9.73578595 9.74916388 9.76254181 9.77591973 9.78929766
9.80267559 9.81605351 9.82943144 9.84280936 9.85618729 9.86956522
9.88294314 9.89632107 9.909699 9.92307692 9.93645485 9.94983278
9.9632107 9.97658863 9.98996656 10.00334448 10.01672241 10.03010033
10.04347826 10.05685619 10.07023411 10.08361204 10.09698997 10.11036789
10.12374582 10.13712375 10.15050167 10.1638796 10.17725753 10.19063545
10.20401338 10.2173913 10.23076923 10.24414716 10.25752508 10.27090301
10.28428094 10.29765886 10.31103679 10.32441472 10.33779264 10.35117057
10.36454849 10.37792642 10.39130435 10.40468227 10.4180602 10.43143813
10.44481605 10.45819398 10.47157191 10.48494983 10.49832776 10.51170569
10.52508361 10.53846154 10.55183946 10.56521739 10.57859532 10.59197324
10.60535117 10.6187291 10.63210702 10.64548495 10.65886288 10.6722408
10.68561873 10.69899666 10.71237458 10.72575251 10.73913043 10.75250836
10.76588629 10.77926421 10.79264214 10.80602007 10.81939799 10.83277592
10.84615385 10.85953177 10.8729097 10.88628763 10.89966555 10.91304348
10.9264214 10.93979933 10.95317726 10.96655518 10.97993311 10.99331104
11.00668896 11.02006689 11.03344482 11.04682274 11.06020067 11.0735786
11.08695652 11.10033445 11.11371237 11.1270903 11.14046823 11.15384615
11.16722408 11.18060201 11.19397993 11.20735786 11.22073579 11.23411371
11.24749164 11.26086957 11.27424749 11.28762542 11.30100334 11.31438127
11.3277592 11.34113712 11.35451505 11.36789298 11.3812709 11.39464883
11.40802676 11.42140468 11.43478261 11.44816054 11.46153846 11.47491639
11.48829431 11.50167224 11.51505017 11.52842809 11.54180602 11.55518395
11.56856187 11.5819398 11.59531773 11.60869565 11.62207358 11.63545151
11.64882943 11.66220736 11.67558528 11.68896321 11.70234114 11.71571906
11.72909699 11.74247492 11.75585284 11.76923077 11.7826087 11.79598662
11.80936455 11.82274247 11.8361204 11.84949833 11.86287625 11.87625418
11.88963211 11.90301003 11.91638796 11.92976589 11.94314381 11.95652174
11.96989967 11.98327759 11.99665552 12.01003344 12.02341137 12.0367893
12.05016722 12.06354515 12.07692308 12.090301 12.10367893 12.11705686
12.13043478 12.14381271 12.15719064 12.17056856 12.18394649 12.19732441
12.21070234 12.22408027 12.23745819 12.25083612 12.26421405 12.27759197
12.2909699 12.30434783 12.31772575 12.33110368 12.34448161 12.35785953
12.37123746 12.38461538 12.39799331 12.41137124 12.42474916 12.43812709
12.45150502 12.46488294 12.47826087 12.4916388 12.50501672 12.51839465
12.53177258 12.5451505 12.55852843 12.57190635 12.58528428 12.59866221
12.61204013 12.62541806 12.63879599 12.65217391 12.66555184 12.67892977
12.69230769 12.70568562 12.71906355 12.73244147 12.7458194 12.75919732
12.77257525 12.78595318 12.7993311 12.81270903 12.82608696 12.83946488
12.85284281 12.86622074 12.87959866 12.89297659 12.90635452 12.91973244
96
12.93311037 12.94648829 12.95986622 12.97324415 12.98662207 13. ]
stock_prices_interpolated_cubic stock_prices_interpoalted_pchip
0 100.000000 100.000000
1 98.610726 100.000376
2 97.272002 100.001501
3 95.983327 100.003376
4 94.744200 100.005998
.. … …
295 182.608276 184.813870
296 183.171024 184.860183
297 183.757013 184.906643
298 184.366564 184.953249
299 185.000000 185.000000
97
Plotting the graph of interpolated value by using cubic interpolation method and
akima interpolation method
[100]: [Link](figsize = (10,6))
[Link](x,y,marker="o",label="Original Stock prices with gaps", markersize=8,␣
↪color="black")
98
Plotting the graph of interpolated value by using cubic interpolation method vs akima
interpolation vs PCHIP interpolation method
[101]: [Link](figsize = (10,6))
[Link](x,y,marker="o",label="Original Stock prices with gaps", markersize=8,␣
↪color="black")
[Link](x_dense,y_pchip,linestyle="-.",color="orange",linewidth=2,␣
↪label="PCHIP interpolation")
99
2.3.20 Explaination of the graph of akima vs cubic vs pchip interpolation method :
Key obervations : | Time (approx) | What’s happening in the graph | | —————
| ——————————————————————————————————————————
——————————————————- | | 9.0 - 10.0 | Cubic (green) dips unrealistically
below the original values — this is an overshoot due to curvature [Link] (red) and
PCHIP (orange) stay more stable and realistic. | | 10.0 - 11.0 | All curves rise sharply due to a
jump in stock price. Cubic again overshoots a bit, while Akima and PCHIP stay closer to true
values. | | 11.0 - 13.0 | Cubic shows wiggles (up and down oscillations) that are not in the actual
data. PCHIP and Akima give more stable curves. |
100
• It is less sensitive to large changes in adjacent points but may still introduce small
oscillations or dips between points. Akima can introduce dips or local minima/maxima
if the slope changes rapidly or unevenly between adjacent data points.
2.3.21 Lets look at three more examples where from_derivatives method interpola-
tion can be used in real life
The from_derivatives interpolation method constructs an interpolating polynomial based on : -
Function values(y) - Derivative values (dy {first derivative}, ddy {second derivative}) at known data
points. This is known as Hermite Interpolation (as opposed to Langrange or spline interpolation)
How does it work? Mathematically, it uses Hermite polynomials, which enforce that both the
function and its derivatives match at the interpolation nodes. Unlike spline interpolation (which just
fits the function values), Hermite interpolation also ensures smooth changes by matching slopes (and
optionally higher derivatives). It’s most often used via : [Link].from_derivatives(x,
y_ders)
What are hermite polynomials ? Hermite polynomials are a sequence of orthogonal poly-
nomials used in mathematics , physics and numerical analysis, especially in problems involving
Gaussian weight function, quantum mechanics and interpolation (like Piecewise Cubic Hermite
Interpolation Polynomial or PCHIP)
101
Core Ideas
• Hermite polynomials 𝐻𝑛 (𝑥) are solutions to the Hermite differntial equations:
– 𝑦″ − 2𝑥𝑦′ + 2𝑛𝑦 = 0
2
• They form an orthogonal basis with respect to the weight function 𝑤(𝑥) = 𝑒−𝑥 (for physicist’s
version)
• First few Hermite polynomials:
– 𝐻0 (𝑥) = 1
– 𝐻1 (𝑥) = 2𝑥
– 𝐻2 (𝑥) = 4𝑥2 − 2
– 𝐻3 (𝑥) = 8𝑥3 − 12𝑥
– 𝐻4 (𝑥) = 16𝑥4 − 48𝑥2 + 12
Applications:
1. Quantum mechanics
• Hermite polynomials appear in the solution to the quantum harmonic oscillator.
2. Gaussian Quadrature
2
• They are used to construct Gauss-Hermite quadrature rules for integrals involving 𝑒−𝑥
3. Hermite interpolation
• In numerical analysis, Hermite interpolation uses not only function values but also
derivatives at interpolation points.
• PCHIP and from_derivatives are based on this principal, through not necessarily on
Hermite polynomials themselves
4. Machine Learning and statistics
• Used in expansions of functions in terms of orthogonal polynomials
Recurrence Relation: You can compute them using: 𝐻0 (𝑥) = 1, 𝐻1 (𝑥) = 2𝑥, 𝐻𝑛+1 (𝑥) =
2𝑥𝐻𝑛 (𝑥) − 2𝑛𝐻𝑛−1 (𝑥)
102
Clarifying Confusion: PCHIP vs Hermite Polynomials
• PCHIP uses Hermite-style interpolation, which means it ensures function value + slope(1st
derivative) continuity.
• It does not directly use Hermite polynomials , the name “Hermite” here refers to the inter-
polation style not the specific set of orthogonal polynomials
What are orthogonal polynomials? Orthogonal polynomials are a class of polynomials that
are mutually orthogonal under some inner product on a function space (usually defined over an
interval with a weight function). In simpler terms : - The two polynomials are orthogonal if their
weighted inner product (a kind of dot product) equals zero
Formal definition of orthogonal Let 𝑤(𝑥) be a weight function defined on an interval [a,b].
𝑏
Then a sequence of polynomials 𝑃𝑛 (𝑥) is called orthogonal if: ∫𝑎 𝑃𝑛 (𝑥)𝑃𝑚 (𝑥)𝑤(𝑥)𝑑𝑥 = 0 whenever
𝑛 ≠ 𝑚 This means: - Each polynomial 𝑃𝑛 has degree n. - They’re “independence” of each other
under the integration operation with the weight function 𝑤(𝑥)
103
Intuition Think of how vectors can be orthogonal in geometry — at 90° to each other. Orthog-
onal polynomials are “perpendicular” under integration. This makes them perfect for: - Function
approximation (like Fourier series, but with polynomials). - Reducing error in numerical methods
(e.g., least squares). - Solving differential equations (like in physics and engineering).
Example 1 : Physics particle trajectory with known velocity at each point Use Use the
BPoly.from_derivatives() class in SciPy: class in SciPy to implement from_derivatives
x = [0,1,2]
# first derivative gives velocity and second derivative gives accelration
y_ders = [
[0, 10], # at t=0, position=0, velocity=10
[15, 5], # at t=1, position=15, velocity=5
[25, 0] # at t=2, position=25, velocity=0
]
poly = BPoly.from_derivatives(x,y_ders)
x_vals = [Link](0,2,100)
y_vals = poly(x_vals)
104
Example 2 : Thermodynamics –> Temperature vs Time with known Heating Rate
[104]: x = [0,5,10] # time in minutes
y_ders = [
[25, 2], # at 0 min: 25°C, heating at 2°C/min
[35, 1], # at 5 min: 35°C, heating at 1°C/min
[40, 0] # at 10 min: 40°C, stopped heating
]
poly = BPoly.from_derivatives(x,y_ders)
# generate 500 values between 1 to 10
x_vals = [Link](0,10,500)
y_vals = poly(x_vals)
105
When to use? Use it when: - You know or can estimate derivatives at the data points. - You
know smooth curves with custom control over slope and curvature. - you are simulating or modeling
physical systems (example : motion, thermodynamics).
2.4 Sometimes there might be situation where I have to drop the rows with
n.a. or NaN values
[105]: # pc path
df_missing_values = pd.read_excel("/home/aditya/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.
↪xlsx")
# macbook path
# df_missing_values = pd.read_excel("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.
df_missing_values
106
1 1/2/2016 36.0 18 46 30.02
2 1/3/2016 40.0 21 47 29.86
3 1/4/2016 25.0 9 44 30.05
4 1/5/2016 NaN -3 41 30.57
5 1/6/2016 33.0 4 35 30.50
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90
107
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
108
29 257.0
30 NaN
WindDirDegrees
8 76.0
15 340.0
16 345.0
17 293.0
As you can see all the rows containing NaN values have been removed
Suppose there is a situation where I have to drop all the rows that have all NaN values.
[107]: df_drop_rows_having_all_NaN = df_missing_values.dropna(how="all")
df_drop_rows_having_all_NaN
109
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90
110
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
If I want to keep the rows which have atleast one NaN value and drop every other
rows.
[108]: df_keep_the_row_with_one_NaN_val = df_missing_values.dropna(thresh=1)
df_keep_the_row_with_one_NaN_val
111
6 1/7/2016 39.0 11 33 30.28
7 1/8/2016 39.0 29 64 30.20
8 1/9/2016 44.0 38 77 30.16
9 1/10/2016 50.0 46 71 29.59
10 1/11/2016 33.0 8 37 29.92
11 1/12/2016 35.0 15 53 29.85
12 1/13/2016 26.0 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43.0 31 62 29.82
15 1/16/2016 47.0 37 70 29.52
16 1/17/2016 36.0 23 66 29.78
17 1/18/2016 25.0 6 53 29.83
18 1/19/2016 22.0 3 42 30.03
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90
112
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
113
[109]: df_keep_the_row_with_one_NaN_val = df_missing_values.dropna(thresh=10)
df_keep_the_row_with_one_NaN_val
114
29 10 7.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
6 293.0
7 79.0
8 76.0
9 109.0
11 235.0
12 284.0
14 101.0
15 340.0
16 345.0
17 293.0
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
26 NaN
29 257.0
Here thresh i.e threshold means the minimum amount of data that a row must have
for it to be allowed to be white listed when .dropna() method is ran in pandas.
• Basically It’s saying that :
– If there are 11 columns in the dataFrame
– If there is a row where out of 11 columns 10 of them have values in it and 1 of them has
a NaN value in it.
– If [Link](thresh=10) where thresh=10 then what it means is that all the rows that
have 1 column with NaN value and 10 columns with data then It will not be dropped by
.dropna() method. It will be allowed to be remained in the dataFrame after processing
– All the other rows that do not fit in this rule defined in thresh will be dropped.
2.5 Insert the missing date in the dataFrame where the date column is set to
be the index of the dataFrame
Refer to this docs for more info : [Link]
Set the EST which is essentially a date type column as an index of the dataFrame.
• Convert the EST column from string to date type column
• Then set the EST column as an index to the dataFrame
Verify if the EST column has the string type values stored in it
115
[110]: df_missing_values["EST"].apply(lambda x: isinstance (x,str)).all()
[110]: True
Convert the values in the EST column to be from string type to date type
[111]: df_missing_values["EST"] = pd.to_datetime(df_missing_values["EST"],format="%m/
↪%d/%Y",errors="coerce")
df_missing_values
116
0 10 8.0 0 5 NaN
1 10 7.0 0 3 NaN
2 10 8.0 0 1 NaN
3 10 9.0 0 3 NaN
4 10 5.0 0 0 NaN
5 10 4.0 0 0 NaN
6 10 2.0 0 3 NaN
7 10 4.0 0 8 NaN
8 9 8.0 T 8 Rain
9 4 NaN 1.8 7 Rain
10 10 NaN 0 1 NaN
11 10 6.0 T 4 NaN
12 10 10.0 0 0 NaN
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
117
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
Let’s check if all the values in the EST column is of type string
[112]: df_missing_values["EST"].apply(lambda x: isinstance(x,str)).all()
[112]: False
[113]: df_missing_values["EST"].dtype
[113]: dtype('<M8[ns]')
The result is False hence the EST column values are not of type string i.e they have
been successfully converted to datetime object in this dataFrame
Now we can safely set the EST column as index of this dataFrame
[114]: df_missing_values.set_index(df_missing_values["EST"],inplace=True)
df_missing_values
118
2016-01-12 2016-01-12 35.0 15 53 29.85
2016-01-13 2016-01-13 26.0 4 42 29.94
2016-01-14 2016-01-14 NaN 12 47 29.95
2016-01-15 2016-01-15 43.0 31 62 29.82
2016-01-16 2016-01-16 47.0 37 70 29.52
2016-01-17 2016-01-17 36.0 23 66 29.78
2016-01-18 2016-01-18 25.0 6 53 29.83
2016-01-19 2016-01-19 22.0 3 42 30.03
2016-01-20 2016-01-20 32.0 15 49 30.13
2016-01-21 2016-01-21 31.0 11 45 30.15
2016-01-22 2016-01-22 26.0 6 41 30.21
2016-01-23 2016-01-23 NaN 21 78 29.77
2016-01-24 2016-01-24 NaN 11 53 29.92
2016-01-25 2016-01-25 NaN 18 54 30.25
2016-01-26 2016-01-26 43.0 29 56 30.03
2016-01-27 2016-01-27 41.0 22 45 30.03
2016-01-28 2016-01-28 37.0 20 51 29.90
2016-01-29 2016-01-29 36.0 21 50 29.58
2016-01-30 2016-01-30 34.0 16 46 30.01
2016-01-31 2016-01-31 46.0 28 52 29.90
119
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0
Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN
Now that the EST column has been set I don’t need EST column which makes it
redundant
[115]: df_EST_index = df_missing_values.drop("EST",axis=1)
df_EST_index
120
[115]: Temperature DewPoint Humidity Sea Level PressureIn \
EST
2016-01-01 38.0 23 52 30.03
2016-01-02 36.0 18 46 30.02
2016-01-03 40.0 21 47 29.86
2016-01-04 25.0 9 44 30.05
2016-01-05 NaN -3 41 30.57
2016-01-06 33.0 4 35 30.50
2016-01-07 39.0 11 33 30.28
2016-01-08 39.0 29 64 30.20
2016-01-09 44.0 38 77 30.16
2016-01-10 50.0 46 71 29.59
2016-01-11 33.0 8 37 29.92
2016-01-12 35.0 15 53 29.85
2016-01-13 26.0 4 42 29.94
2016-01-14 NaN 12 47 29.95
2016-01-15 43.0 31 62 29.82
2016-01-16 47.0 37 70 29.52
2016-01-17 36.0 23 66 29.78
2016-01-18 25.0 6 53 29.83
2016-01-19 22.0 3 42 30.03
2016-01-20 32.0 15 49 30.13
2016-01-21 31.0 11 45 30.15
2016-01-22 26.0 6 41 30.21
2016-01-23 NaN 21 78 29.77
2016-01-24 NaN 11 53 29.92
2016-01-25 NaN 18 54 30.25
2016-01-26 43.0 29 56 30.03
2016-01-27 41.0 22 45 30.03
2016-01-28 37.0 20 51 29.90
2016-01-29 36.0 21 50 29.58
2016-01-30 34.0 16 46 30.01
2016-01-31 46.0 28 52 29.90
121
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0
Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
122
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN
2.6 Adding missing dates in the dataFrame where EST (date type column) is
set as an index in the dataFrame
Since I don’t have any missing date values in the dataFrame I will manually have to
drop the rows from the dataFrame
[116]: df_dropped_dates = df_EST_index.drop(df_EST_index.index[["2016-01-03",␣
↪"2016-01-08" , "2016-01-09"]])
df_dropped_dates
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[116], line 1
----> 1 df_dropped_dates = df_EST_index.
↪drop(df_EST_index.index[["2016-01-03", "2016-01-08" , "2016-01-09"]])
2 df_dropped_dates
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪indexes/[Link], in Index.__getitem__(self, key)
(…)
5413 stacklevel=find_stack_level(),
5414 )
-> 5416 result = getitem(key)
5417 # Because we ruled out integer above, we always get an arraylike here
5418 if [Link] > 1:
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/[Link], in DatetimeLikeArrayMixin.__getitem__(self, key)
374 """
375 This getitem defers to the underlying array, which by-definition can
376 only handle list-likes, slices, and integer scalars
377 """
378 # Use cast as we know we will get back a DatetimeLikeArray or DTScalar,
379 # but skip evaluating the Union at runtime for performance
380 # (see [Link]
123
--> 381 result = cast("Union[Self, DTScalarOrNaT]", super().__getitem__(key))
382 if lib.is_scalar(result):
383 return result
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪arrays/_mixins.py:292, in NDArrayBackedExtensionArray.__getitem__(self, key)
File ~/miniconda3/envs/tensorflow/lib/python3.10/site-packages/pandas/core/
↪indexers/[Link], in check_array_indexer(array, indexer)
Explaination
• df_dropped_dates = df_EST_index.drop(df_EST_index.index[["2016-01-03",
"2016-01-08", "2016-01-09"]])
• This causes an error because df_EST_index.index[...] expects integer or boolean index-
ing, not string labels like "2016-01-03".
• df_EST_index.index[...] is trying to index the index using a list of strings, which is not
allowed — it’s not integer-based or boolean-based indexing.
• Instead, I was trying to remove rows by label, not by position. #### solution
• During my search I found out that I can pass those dates strings directly to .drop() without
using df_EST_index.index[...]..
• This works because drop can take labels (as strings) if the index is of datetime64 type —
and Pandas will internally convert your strings to datetimes and match them correctly.
[117]: # This code worked fine on my pc but its not working on my macbook for some␣
↪reason
# df_dropped_dates
124
dates_to_drop = pd.to_datetime(["2016-01-03", "2016-01-08", "2016-01-09"])
df_dropped_dates = df_EST_index.drop(dates_to_drop)
df_dropped_dates
125
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0
Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
126
2016-01-30 NaN 257.0
2016-01-31 NaN NaN
Now what I have got the dataFrame that simulates the missing dates that I may get
in the real world I can move forward.
Adding missing dates in the dataFrame where EST (date type column) is set as an
index in the dataFrame
[118]: dt = pd.date_range("2016-01-01", "2016-01-31")
idx = [Link](dt)
df_dropped_dates.reindex(idx)
127
2016-01-02 10.0 7.0 0 3.0
2016-01-03 NaN NaN NaN NaN
2016-01-04 10.0 9.0 0 3.0
2016-01-05 10.0 5.0 0 0.0
2016-01-06 10.0 4.0 0 0.0
2016-01-07 10.0 2.0 0 3.0
2016-01-08 NaN NaN NaN NaN
2016-01-09 NaN NaN NaN NaN
2016-01-10 4.0 NaN 1.8 7.0
2016-01-11 10.0 NaN 0 1.0
2016-01-12 10.0 6.0 T 4.0
2016-01-13 10.0 10.0 0 0.0
2016-01-14 10.0 5.0 T 7.0
2016-01-15 9.0 5.0 T 2.0
2016-01-16 8.0 7.0 0.24 7.0
2016-01-17 8.0 6.0 0.05 6.0
2016-01-18 9.0 12.0 T 2.0
2016-01-19 10.0 11.0 0 1.0
2016-01-20 10.0 6.0 0 2.0
2016-01-21 10.0 6.0 0 1.0
2016-01-22 9.0 NaN 0.01 3.0
2016-01-23 1.0 16.0 2.31 8.0
2016-01-24 8.0 6.0 T 3.0
2016-01-25 10.0 3.0 0 2.0
2016-01-26 10.0 7.0 0 2.0
2016-01-27 10.0 7.0 T 3.0
2016-01-28 10.0 5.0 0 1.0
2016-01-29 10.0 8.0 0 4.0
2016-01-30 10.0 7.0 0 0.0
2016-01-31 10.0 5.0 0 0.0
Events WindDirDegrees
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN NaN
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN NaN
2016-01-09 NaN NaN
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
128
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN
As you can see that the missing dates have been added in the dataFrame with all the
values as NaN
Now I will have to use interpolation to fill all the NaN values based on dates using
time method
[119]: # replace NaN with 0
# convert the string type numbers into integer type numbers
# This code works on newer version of pandas
# df_dropped_dates.infer_objects(copy=False)
# only select those columns whose data-type is of type number and then␣
↪interpolate using time method
df_interpolate_using_time = df_dropped_dates.select_dtypes(include=["number"]).
↪interpolate(method="time",limit_direction="both")
df_interpolate_using_time
129
2016-01-13 26.00 4 42 29.94
2016-01-14 34.50 12 47 29.95
2016-01-15 43.00 31 62 29.82
2016-01-16 47.00 37 70 29.52
2016-01-17 36.00 23 66 29.78
2016-01-18 25.00 6 53 29.83
2016-01-19 22.00 3 42 30.03
2016-01-20 32.00 15 49 30.13
2016-01-21 31.00 11 45 30.15
2016-01-22 26.00 6 41 30.21
2016-01-23 30.25 21 78 29.77
2016-01-24 34.50 11 53 29.92
2016-01-25 38.75 18 54 30.25
2016-01-26 43.00 29 56 30.03
2016-01-27 41.00 22 45 30.03
2016-01-28 37.00 20 51 29.90
2016-01-29 36.00 21 50 29.58
2016-01-30 34.00 16 46 30.01
2016-01-31 46.00 28 52 29.90
130
2016-01-29 10 8.0 4 262.8
2016-01-30 10 7.0 0 257.0
2016-01-31 10 5.0 0 257.0
2.6.1 What gives why is interpolation dropping the rows which have NaN values in
all of their columns ?
Explanation:
• Pandas’ interpolation methods (like “time”, “linear”, etc.) require at least some non-NaN
values to interpolate across.
• If a row has all columns as NaN, pandas has no basis to infer values — so it skips interpolation
for that row.
# macbook path
# df_missing_data = pd.read_excel("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/nyc_weather_missing_data_demo.
↪xlsx", engine="openpyxl")
df_missing_data
131
19 1/20/2016 32.0 15 49 30.13
20 1/21/2016 31.0 11 45 30.15
21 1/22/2016 26.0 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43.0 29 56 30.03
26 1/27/2016 41.0 22 45 30.03
27 1/28/2016 37.0 20 51 29.90
28 1/29/2016 36.0 21 50 29.58
29 1/30/2016 34.0 16 46 30.01
30 1/31/2016 46.0 28 52 29.90
WindDirDegrees
132
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
[121]: True
This goes to show that the EST column contains date in string dataType form
df_missing_data
133
[122]: EST Temperature DewPoint Humidity Sea Level PressureIn \
0 2016-01-01 38.0 23 52 30.03
1 2016-01-02 36.0 18 46 30.02
2 2016-01-03 40.0 21 47 29.86
3 2016-01-04 25.0 9 44 30.05
4 2016-01-05 NaN -3 41 30.57
5 2016-01-06 33.0 4 35 30.50
6 2016-01-07 39.0 11 33 30.28
7 2016-01-08 39.0 29 64 30.20
8 2016-01-09 44.0 38 77 30.16
9 2016-01-10 50.0 46 71 29.59
10 2016-01-11 33.0 8 37 29.92
11 2016-01-12 35.0 15 53 29.85
12 2016-01-13 26.0 4 42 29.94
13 2016-01-14 NaN 12 47 29.95
14 2016-01-15 43.0 31 62 29.82
15 2016-01-16 47.0 37 70 29.52
16 2016-01-17 36.0 23 66 29.78
17 2016-01-18 25.0 6 53 29.83
18 2016-01-19 22.0 3 42 30.03
19 2016-01-20 32.0 15 49 30.13
20 2016-01-21 31.0 11 45 30.15
21 2016-01-22 26.0 6 41 30.21
22 2016-01-23 NaN 21 78 29.77
23 2016-01-24 NaN 11 53 29.92
24 2016-01-25 NaN 18 54 30.25
25 2016-01-26 43.0 29 56 30.03
26 2016-01-27 41.0 22 45 30.03
27 2016-01-28 37.0 20 51 29.90
28 2016-01-29 36.0 21 50 29.58
29 2016-01-30 34.0 16 46 30.01
30 2016-01-31 46.0 28 52 29.90
134
13 10 5.0 T 7 NaN
14 9 5.0 T 2 NaN
15 8 7.0 0.24 7 Rain
16 8 6.0 0.05 6 Fog-Snow
17 9 12.0 T 2 Snow
18 10 11.0 0 1 NaN
19 10 6.0 0 2 NaN
20 10 6.0 0 1 NaN
21 9 NaN 0.01 3 Snow
22 1 16.0 2.31 8 Fog-Snow
23 8 6.0 T 3 Snow
24 10 3.0 0 2 NaN
25 10 7.0 0 2 NaN
26 10 7.0 T 3 Rain
27 10 5.0 0 1 NaN
28 10 8.0 0 4 NaN
29 10 7.0 0 0 NaN
30 10 5.0 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
135
27 NaN
28 NaN
29 257.0
30 NaN
136
2016-01-03 10 8.0 0 1
2016-01-04 10 9.0 0 3
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-08 10 4.0 0 8
2016-01-09 9 8.0 T 8
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0
Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
137
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN
[124]: replaced_df=df_missing_data.replace(-3,[Link])
replaced_df
138
2016-01-20 32.0 15.0 49 30.13
2016-01-21 31.0 11.0 45 30.15
2016-01-22 26.0 6.0 41 30.21
2016-01-23 NaN 21.0 78 29.77
2016-01-24 NaN 11.0 53 29.92
2016-01-25 NaN 18.0 54 30.25
2016-01-26 43.0 29.0 56 30.03
2016-01-27 41.0 22.0 45 30.03
2016-01-28 37.0 20.0 51 29.90
2016-01-29 36.0 21.0 50 29.58
2016-01-30 34.0 16.0 46 30.01
2016-01-31 46.0 28.0 52 29.90
139
Events WindDirDegrees
EST
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN
lets replace NaN values in Events column with 0 using replace function
[125]: replaced_df["Events"] = replaced_df["Events"].replace([Link],0)
replaced_df
140
2016-01-08 39.0 29.0 64 30.20
2016-01-09 44.0 38.0 77 30.16
2016-01-10 50.0 46.0 71 29.59
2016-01-11 33.0 8.0 37 29.92
2016-01-12 35.0 15.0 53 29.85
2016-01-13 26.0 4.0 42 29.94
2016-01-14 NaN 12.0 47 29.95
2016-01-15 43.0 31.0 62 29.82
2016-01-16 47.0 37.0 70 29.52
2016-01-17 36.0 23.0 66 29.78
2016-01-18 25.0 6.0 53 29.83
2016-01-19 22.0 3.0 42 30.03
2016-01-20 32.0 15.0 49 30.13
2016-01-21 31.0 11.0 45 30.15
2016-01-22 26.0 6.0 41 30.21
2016-01-23 NaN 21.0 78 29.77
2016-01-24 NaN 11.0 53 29.92
2016-01-25 NaN 18.0 54 30.25
2016-01-26 43.0 29.0 56 30.03
2016-01-27 41.0 22.0 45 30.03
2016-01-28 37.0 20.0 51 29.90
2016-01-29 36.0 21.0 50 29.58
2016-01-30 34.0 16.0 46 30.01
2016-01-31 46.0 28.0 52 29.90
141
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0
Events WindDirDegrees
EST
2016-01-01 0 281.0
2016-01-02 0 275.0
2016-01-03 0 277.0
2016-01-04 0 345.0
2016-01-05 0 333.0
2016-01-06 0 NaN
2016-01-07 0 293.0
2016-01-08 0 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 0 289.0
2016-01-12 0 235.0
2016-01-13 0 284.0
2016-01-14 0 NaN
2016-01-15 0 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 0 NaN
2016-01-20 0 302.0
2016-01-21 0 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 0 286.0
2016-01-26 0 NaN
2016-01-27 Rain NaN
2016-01-28 0 NaN
2016-01-29 0 NaN
2016-01-30 0 257.0
2016-01-31 0 NaN
142
• I have chosen WindDirDegrees column for this demo. Since I have NaN values in it I am
going to replace it with negative random number -989 for this demo.
replaced_df
143
2016-01-05 10 5.0 0 0
2016-01-06 10 4.0 0 0
2016-01-07 10 2.0 0 3
2016-01-08 10 4.0 0 8
2016-01-09 9 8.0 T 8
2016-01-10 4 NaN 1.8 7
2016-01-11 10 NaN 0 1
2016-01-12 10 6.0 T 4
2016-01-13 10 10.0 0 0
2016-01-14 10 5.0 T 7
2016-01-15 9 5.0 T 2
2016-01-16 8 7.0 0.24 7
2016-01-17 8 6.0 0.05 6
2016-01-18 9 12.0 T 2
2016-01-19 10 11.0 0 1
2016-01-20 10 6.0 0 2
2016-01-21 10 6.0 0 1
2016-01-22 9 NaN 0.01 3
2016-01-23 1 16.0 2.31 8
2016-01-24 8 6.0 T 3
2016-01-25 10 3.0 0 2
2016-01-26 10 7.0 0 2
2016-01-27 10 7.0 T 3
2016-01-28 10 5.0 0 1
2016-01-29 10 8.0 0 4
2016-01-30 10 7.0 0 0
2016-01-31 10 5.0 0 0
Events WindDirDegrees
EST
2016-01-01 0 281.0
2016-01-02 0 275.0
2016-01-03 0 277.0
2016-01-04 0 345.0
2016-01-05 0 333.0
2016-01-06 0 -989.0
2016-01-07 0 293.0
2016-01-08 0 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 0 289.0
2016-01-12 0 235.0
2016-01-13 0 284.0
2016-01-14 0 -989.0
2016-01-15 0 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
144
2016-01-18 Snow 293.0
2016-01-19 0 -989.0
2016-01-20 0 302.0
2016-01-21 0 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 0 286.0
2016-01-26 0 -989.0
2016-01-27 Rain -989.0
2016-01-28 0 -989.0
2016-01-29 0 -989.0
2016-01-30 0 257.0
2016-01-31 0 -989.0
2.8.1 Now I can use .replace function with a dictionary to see if I can replace values
using this function in multiple columns of the dataFrame
145
2016-01-22 26.0 6.0 41 30.21
2016-01-23 NaN 21.0 78 29.77
2016-01-24 NaN 11.0 53 29.92
2016-01-25 NaN 18.0 54 30.25
2016-01-26 43.0 29.0 56 30.03
2016-01-27 41.0 22.0 45 30.03
2016-01-28 37.0 20.0 51 29.90
2016-01-29 36.0 21.0 50 29.58
2016-01-30 34.0 16.0 46 30.01
2016-01-31 46.0 28.0 52 29.90
Events WindDirDegrees
EST
146
2016-01-01 NaN 281.0
2016-01-02 NaN 275.0
2016-01-03 NaN 277.0
2016-01-04 NaN 345.0
2016-01-05 NaN 333.0
2016-01-06 NaN NaN
2016-01-07 NaN 293.0
2016-01-08 NaN 79.0
2016-01-09 Rain 76.0
2016-01-10 Rain 109.0
2016-01-11 NaN 289.0
2016-01-12 NaN 235.0
2016-01-13 NaN 284.0
2016-01-14 NaN NaN
2016-01-15 NaN 101.0
2016-01-16 Rain 340.0
2016-01-17 Fog-Snow 345.0
2016-01-18 Snow 293.0
2016-01-19 NaN NaN
2016-01-20 NaN 302.0
2016-01-21 NaN 312.0
2016-01-22 Snow 34.0
2016-01-23 Fog-Snow 42.0
2016-01-24 Snow 327.0
2016-01-25 NaN 286.0
2016-01-26 NaN NaN
2016-01-27 Rain NaN
2016-01-28 NaN NaN
2016-01-29 NaN NaN
2016-01-30 NaN 257.0
2016-01-31 NaN NaN
2.8.2 Lets perform interpolation using time method just for fun shall we ?
interpolated_df
147
2016-01-02 36.00 18.0 46 30.02
2016-01-03 40.00 21.0 47 29.86
2016-01-04 25.00 9.0 44 30.05
2016-01-05 29.00 6.5 41 30.57
2016-01-06 33.00 4.0 35 30.50
2016-01-07 39.00 11.0 33 30.28
2016-01-08 39.00 29.0 64 30.20
2016-01-09 44.00 38.0 77 30.16
2016-01-10 50.00 46.0 71 29.59
2016-01-11 33.00 8.0 37 29.92
2016-01-12 35.00 15.0 53 29.85
2016-01-13 26.00 4.0 42 29.94
2016-01-14 34.50 12.0 47 29.95
2016-01-15 43.00 31.0 62 29.82
2016-01-16 47.00 37.0 70 29.52
2016-01-17 36.00 23.0 66 29.78
2016-01-18 25.00 6.0 53 29.83
2016-01-19 22.00 3.0 42 30.03
2016-01-20 32.00 15.0 49 30.13
2016-01-21 31.00 11.0 45 30.15
2016-01-22 26.00 6.0 41 30.21
2016-01-23 30.25 21.0 78 29.77
2016-01-24 34.50 11.0 53 29.92
2016-01-25 38.75 18.0 54 30.25
2016-01-26 43.00 29.0 56 30.03
2016-01-27 41.00 22.0 45 30.03
2016-01-28 37.00 20.0 51 29.90
2016-01-29 36.00 21.0 50 29.58
2016-01-30 34.00 16.0 46 30.01
2016-01-31 46.00 28.0 52 29.90
148
2016-01-15 9 5.000000 2 101.0
2016-01-16 8 7.000000 7 340.0
2016-01-17 8 6.000000 6 345.0
2016-01-18 9 12.000000 2 293.0
2016-01-19 10 11.000000 1 297.5
2016-01-20 10 6.000000 2 302.0
2016-01-21 10 6.000000 1 312.0
2016-01-22 9 11.000000 3 34.0
2016-01-23 1 16.000000 8 42.0
2016-01-24 8 6.000000 3 327.0
2016-01-25 10 3.000000 2 286.0
2016-01-26 10 7.000000 2 280.2
2016-01-27 10 7.000000 3 274.4
2016-01-28 10 5.000000 1 268.6
2016-01-29 10 8.000000 4 262.8
2016-01-30 10 7.000000 0 257.0
2016-01-31 10 5.000000 0 257.0
2.9 How to handle data where differnt types of measurement metrics are found
in the same column
• Example :
– Temperature column may have temperature in degree celcius and in Farenhite both
– Windspeed column may have speed of the wind in mph and kph both.
[129]: # pc path
df_missing_data = pd.read_excel("/home/aditya/github/Deep-learning-prerequisite/
↪pandas/dataset/nyc_weather_different_measurement_type_demo.xlsx")
# macbook path
# df_missing_data = pd.read_excel("/home/machine_learning/github/
↪Deep-learning-prerequisite/pandas/dataset/
↪nyc_weather_different_measurement_type_demo.xlsx", engine="openpyxl")
df_missing_data
149
11 1/12/2016 35 15 53 29.85
12 1/13/2016 26 F 4 42 29.94
13 1/14/2016 NaN 12 47 29.95
14 1/15/2016 43 31 62 29.82
15 1/16/2016 47 37 70 29.52
16 1/17/2016 36 23 66 29.78
17 1/18/2016 25 6 53 29.83
18 1/19/2016 22 3 42 30.03
19 1/20/2016 32 15 49 30.13
20 1/21/2016 31 11 45 30.15
21 1/22/2016 26 6 41 30.21
22 1/23/2016 NaN 21 78 29.77
23 1/24/2016 NaN 11 53 29.92
24 1/25/2016 NaN 18 54 30.25
25 1/26/2016 43 29 56 30.03
26 1/27/2016 41 22 45 30.03
27 1/28/2016 37 20 51 29.90
28 1/29/2016 36 21 50 29.58
29 1/30/2016 34 16 46 30.01
30 1/31/2016 46 28 52 29.90
150
25 10 7 0 2 NaN
26 10 7 T 3 Rain
27 10 5 0 1 NaN
28 10 8 0 4 NaN
29 10 7 0 0 NaN
30 10 5 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
151
[130]: #### The best way to handle this is to use a regex
new_df = df_missing_data.replace("[A-Za-z]","",regex=True)
new_df
152
9 4 NaN 1.8 7
10 10 NaN 0 1 NaN
11 10 6 4 NaN
12 10 10 0 0 NaN
13 10 5 7 NaN
14 9 5 2 NaN
15 8 7 0.24 7
16 8 6 0.05 6 -
17 9 12 2
18 10 11 0 1 NaN
19 10 6 0 2 NaN
20 10 6 0 1 NaN
21 9 NaN 0.01 3
22 1 16 2.31 8 -
23 8 6 3
24 10 3 0 2 NaN
25 10 7 0 2 NaN
26 10 7 3
27 10 5 0 1 NaN
28 10 8 0 4 NaN
29 10 7 0 0 NaN
30 10 5 0 0 NaN
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
153
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
What the …. Why did it replace all the values in the event column? Observations :
As you can see this regex successfully removed - C and F from the Temperature column - removed
mph and kph from the WindSpeedMPH column - removed events_values from the Events column
Reason : Because everything in the event column is alphabet hence this regex removed all the
values in the Events column of the dataFrame.
new_df
154
25 1/26/2016 43 29 56 30.03
26 1/27/2016 41 22 45 30.03
27 1/28/2016 37 20 51 29.90
28 1/29/2016 36 21 50 29.58
29 1/30/2016 34 16 46 30.01
30 1/31/2016 46 28 52 29.90
WindDirDegrees
0 281.0
1 275.0
2 277.0
3 345.0
4 333.0
5 NaN
155
6 293.0
7 79.0
8 76.0
9 109.0
10 289.0
11 235.0
12 284.0
13 NaN
14 101.0
15 340.0
16 345.0
17 293.0
18 NaN
19 302.0
20 312.0
21 34.0
22 42.0
23 327.0
24 286.0
25 NaN
26 NaN
27 NaN
28 NaN
29 257.0
30 NaN
As you can see now the data looks much better all the values in the Events column
are intact.
2.10 Replacing the score column values from character type grades to number
type values in a dataFrame
[132]: df = [Link]({
"Score" : ["exceptional", "average", "good", "poor", "average",␣
↪"exceptional"],
})
df
156
[133]: df_replaced = [Link](["poor", "average", "good", "exceptional"],[1,2,3,4])
df_replaced
solution :
[134]: pd.set_option('future.no_silent_downcasting', True)
df_replaced = [Link](["poor", "average", "good", "exceptional"], [1, 2, 3,␣
↪4])
df_replaced
[ ]:
157