Pandas - Python Data Analysis Library
Pandas - Python Data Analysis Library
A Pandas Series contains a single column of data and an index. The index is a way to reference the rows of data in the Series.
Common examples of an index would be simply a monotonically increasing set of integers, or time/date stamps for time
series data.
A Pandas DataFrame can be thought of being created by combining more than one Series that share a common index. So a
table with multiple column labels and common index would be an example of a DataFrame. The description of these data
structures will be made clear through examples in the sequel.
Similarly to the way we import NumPy, it's idiomatic Python to import Pandas as
import pandas as pd
import numpy as np
import pandas as pd
Loading Data
Pandas offers some of the best utilities available for reading/parsing data from text files. The function read_csv has
numerous options for managing header/footer lines in files, parsing dates, selecting specific columns, etc in comma
separated value (CSV) files. The default index for the Dataframe is set to a set of monotonically increasing integers unless
otherwise specified with the keyword argument index_col .
There are similar functions for reading Microsoft Excel spreadsheets ( read_excel ) and fixed width formatted text
( read_fwf ).
The file '200wells.csv' contains a dataset with X and Y coordinates, facies 1 and 2 (1 is sandstone and 2 interbedded
sand and mudstone), porosity , permeability (mD) and acoustic impedance (kg/(m 2 ⋅ s ⋅ 106 ) kg / (m2 ⋅ s ⋅ 106 )).
The head() member function for DataFrames displays the first 5 rows of the DataFrame. Optionally, you can specify
an argument (e.g. head(n=10) to display more/less rows
df = pd.read_csv('datasets/200wells.csv'); df.head()
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 1/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
Summary Statistics
The DataFrame member function describe provides useful summary statistics such as the total number of samples, mean,
standard deviations, min/max, and quartiles for each column of the DataFrame.
df.describe()
Unnamed:
X Y facies_threshold_0.3 porosity permeability acoust
0
df[['porosity']].head()
porosity
0 0.1184
1 0.1566
2 0.1920
3 0.1621
4 0.1766
df[['porosity', 'permeability']].head()
porosity permeability
0 0.1184 6.170
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 2/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
porosity permeability
1 0.1566 6.275
2 0.1920 92.297
3 0.1621 9.048
4 0.1766 7.123
porosity permeability
1 0.1566 6.275
2 0.1920 92.297
df.loc[1:2, 'porosity':'acoustic_impedance']
df.iloc[1:3, 3:5]
facies_threshold_0.3 porosity
1 1 0.1566
2 2 0.1920
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 3/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
df['porosity'].values
DataFrame Transformations
There are several member functions that allow for transformations of the DataFrame labels, adding/removing columns, etc.
To rename DataFrame column labels, we pass a Python dictionary where the keywords are the current labels and the values
are the new labels. For example,
The use of the keyword argument inplace = True has an equivalent outcome as writing
df = df.rename(...
df.rename(columns={'facies_threshold_0.3': 'facies',
'permeability': 'perm',
'acoustic_impedance': 'ai'}, inplace = True)
df.head()
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 4/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
df['zero'] = np.zeros(len(df))
df.head()
Removing Columns
We can remove unwanted columns with the drop member function. The argument inplace = True modifies the existing
DataFrame in place in memory, i.e. 'zero' will no longer be accessible in any way in the DataFrame.
The argument axis = 1 refers to columns, the default is axis = 0 in which case the positional argument would be
expected to be an index label.
Removing Rows
We can remove the row indexed by 1 as follows.
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 5/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
Notice we can stack member function commands, i.e. the drop function is immediately followed by head to display
the DataFrame with row index 1 removed.
df.drop(1).head()
Because the argument inplace = True was not given, the orginal DataFrame is unchanged.
df.head()
Sorting
We can sort the DataFrame in either ascending or desending order by any column label.
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 6/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
Reseting Indices
In the previous example, the resulting indices are now out of order after the sorting operation. This can be fixed, if desired,
with the reset_index member function.
The reindexing operation could have been accomplished during the sort operation by passing the argument
ingnore_index = True .
df.reset_index(inplace=True, drop=True)
df.head(n = 13)
Feature Engineering
In the field of data science, DataFrame column labels are often referred to as features. Feature engineering is the process of
creating new features and/or transforming features for further analysis. In the example below, we create two new features
through manipulations of existing features.
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 7/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
Mathematical operations can be performed directly on the DataFrame columns that are accessed by their labels.
Most NumPy functions such as where will work directly on Pandas DataFrame columns.
Feature Truncation
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 8/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
Here's an example where we use a conditional statement to assign a very low permeability value (0.0001 mD) for all porosity
values below a threshold. Of course, this is for demonstration, in practice a much lower porosity threshold would likely be
applied.
if ϕ > 0.12:
return m * ϕ ** 3 / (1 - ϕ) ** 2
else:
return 0.0001
Using the apply member function we can transform the 'porosity' column into permeability via the
kozeny_carmen_with_threshold function above.
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 9/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
... ... ... ... ... ... ... ... ... ...
195 141 1745 255 2 0.07840 1.01500 3.949 7.840 12.946429 low
197 110 2365 355 1 0.06726 0.15170 3.833 6.726 2.255427 low
198 198 3795 535 1 0.06092 0.01582 3.907 6.092 0.259685 low
199 152 3745 115 1 0.05000 0.01653 3.527 5.000 0.330600 low
Missing Features
It's very common in data analysis to have missing and/or invalid values in our our DataFrames. Pandas offers several built in
methods to identify and deal with missing Data.
The at member function allows for fast selection of a single row/label position within a Series/DataFrame.
The isnull member function returns a boolean array that identifies rows with missing values.
df[df['porosity'].isnull()]
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 10/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
df.dropna(inplace=True)
df.head(n = 3)
Create a Python dictionary where the keywords are the desired DataFrame labels and the values are the associated data.
df_new = pd.DataFrame(df_dict)
df_new.sort_values('porosity', inplace=True, ignore_index=True)
df_new.head()
porosity permeability
0 0.000928 8.013109e-09
1 0.003655 4.916908e-07
2 0.003821 5.622156e-07
3 0.005831 2.005877e-06
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 11/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
porosity permeability
4 0.005843 2.018193e-06
Merging DataFrames
In this example, we'll take a couple subsets from our original DataFrame and create a new one by joining them with the
concat function.
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 12/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
Plotting DataFrames
Pandas has some built in automatic plotting methods for DataFrames. They are most useful for quick-look plots of
relationships between DataFrame columns, but they can be fully customized to make publication quality plots with
additional options available in the Matplotlib plotting library. The default plot member function will create a line plot of all
DataFrame labels as functions of the index.
df_new.plot();
Of course, this is not that useful a plot. What we are likely looking for is a relationship between the DataFrame columns. One
way to visualize this is to set the desired abscissa as the DataFrame index and create a plot.
df_new.set_index(['porosity']).plot();
Or explicitly pass the desired independent variable to the x keyword arguments and the dependent variable to the y
keyword argument. This time we'll also explicitly create a scatter plot.
When the DataFrame columns are explicitly passed as arguments, the axis labels are correctly populated.
df_new.plot.scatter(x='porosity', y='permeability');
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 13/14
13/12/2023 11:53 Pandas: Python Data Analysis Library
Displaying the raw data is the file with the UNIX head command
!head -n 5 datasets/200wells_out.csv
Further Reading
Further reading on Pandas is avialable in the official Pandas documentation
https://johnfoster.pge.utexas.edu/numerical-methods-book/ScientificPython_Pandas.html 14/14