Intr
o
Structur
e
Last week topic Today lecture
1. Why Pandas 5. DataFrame
2. Name 6. Basic Visualization
3. Index
4. Filtering
Topic 1: Pandas
Dataframe
• Simplistically, a data frame is a table, with rows and columns.
• Each column in a data frame is a series object.
• Rows consist of elements inside series.
Name Age City
Alice 24 New York
Bob 30 San Francisco This is a Dataframe
Charlie 22 Los Angeles
An example table
Topic 1: Pandas
Dataframe
Creating a DataFrame from a Dictionary
You can create a DataFrame from a dictionary where keys are column names
and values are lists of data:
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
Output:
A B
0 1 4
1 2 5
2 3 6
Topic 1: Pandas
Dataframe
Creating a DataFrame from a List of Dictionaries
Each dictionary in the list represents a row of data:
data = [{'A': 1, 'B': 4}, {'A': 2, 'B': 5}]
df = pd.DataFrame(data)
Output:
A B
0 1 4
1 2 5
2 3 6
Topic 1: Pandas
Dataframe
Reading Data from a CSV File
Use Pandas to read data from a CSV file with pd.read_csv():
df = pd.read_csv('file.csv')
Output: CSV file content:
A B A, B,
0 1 4 1, 4,
1 2 5 2, 5,
2 3 6 3, 6,
Topic 1: Pandas
Example data
• Data decribes the info of G7 countries
• Each row is a country, each column is properties of that country
We will start using pandas to wrangling the data.
G7 Stats
Population GDP Surface HDI Continent
Canada 35.467 1,785,387.00 9,984,670 0.913 America
France 63.951 2,833,687.00 640,679 0.888 Europe
Germany 80.94 3,874,437.00 357,114 0.916 Europe
Italy 60.665 2,167,744.00 301,336 0.873 Europe
Japan 127.061 4,602,367.00 377,930 0.891 Asia
United Kingdom 64.511 2,950,039.00 242,495 0.907 Europe
United States 318.523 17,348,075.00 9,525,067 0.915 America
Topic 1: Pandas
Dataframe
Reading Data from a CSV File
Creating `DataFrame`s manually can be tedious. 99% of the time you'll be
pulling the data from a Database, a csv file or the web.
df = pd.DataFrame({'Population': […], 'GDP': […],
'Surface Area': […],…}, columns=['Population', 'GDP',
…])
Topic 1: Pandas
Dataframe
Dataframe Index
DataFrame`s also have indexes. As you can see in the "table" above,
pandas has assigned auto-incremental index. We can reassign that to
country name:
df.index = ['Canada','France','Germany','Italy',…]
df
Topic 1: Pandas
Dataframe
Dataframe axes
You can show the table first row and columns by using df.columns and
df.index
df.column
s
df.index
Output:
Index(['Population', 'GDP', 'Surface Area', 'HDI',
'Continent'], dtype='object')
Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United
Kingdom','United States'], dtype='object')
Topic 1: Pandas
Dataframe
Dataframe basic info
You can show basic information of a table by using function df.info()
df.info()
Any null value in the series
Datatype
Why this Dtype is object?
Topic 1: Pandas
Dataframe
Dataframe basic info
You can also table’s statistical information:
df.size(), df.info(),
df.describe()
Mean, standard
deviation and number
of elements
1st, 2nd , … quartiles
What does it mean?
Topic 1: Pandas
Dataframe
Indexing, Selection and Slicing
Individual columns can be selected with regular indexing. The result is
a Series df.loc['Canada'], df.iloc[0]
Output:
Population 35.467
GDP 1785387
Surface Area 9984670
HDI 0.913
Continent America
Name: Canada,
dtype: object
Topic 1: Pandas
Dataframe
Indexing, Selection and Slicing
Multiple columns can be selected similarly. The result is a Dataframe
df[['Population', 'GDP']]
Output:
Topic 1: Pandas
Dataframe
Indexing, Selection and Slicing
Slicing works differently, it acts at "row level", and can be counter
intuitive:
df[1:3]
Output:
Topic 1: Pandas
Dataframe
Indexing, Selection and Slicing
Row level selection works better with loc and iloc. which are
recommended over regular "direct slicing" df[ : ].
loc selects rows matching the iloc selects rows with numeric
given index: position of the index:
df.loc['France': 'Italy', df.iloc[[0, 1, -1]]
'Population']
Topic 1: Pandas
Dataframe
Conditional selection
Conditional selection works the same way as the Series:
df['Population'] > 70
Output:
Canada False
France False
Germany True
Italy False
Japan True
United Kingdom False
United States True
Name: Population,
dtype: bool
Topic 1: Pandas
Dataframe
Conditional selection
Conditional selection works the same way as the Series:
df.loc[df['Population'] > 70]
Output:
Topic 1: Pandas
Dataframe
Remove stuff
Remove some element, for row when axis=0 and for columns when
axis=1, you can also use axis=“rows” or axis=“columns”:
df.drop(['Italy', 'Canada'],
axis=0)
Output:
Topic 1: Pandas
Dataframe
Adding value
You can add new column by add a Series, missing value automatically
filled with NaN:
df['Language'] = pd.Series(['French', 'German', 'Italian']
index=['France', 'Germany', 'Italy'],name='Language')
Topic 1: Pandas
Dataframe
Adding value
You can add new column by add a Series, missing value automatically
filled with NaN:
df['Language'] = pd.Series(['French', 'German', 'Italian']
index=['France', 'Germany', 'Italy'],name='Language')
Topic 1: Pandas
Dataframe
Adding value
You can add new row similarly:
df = df.append(pd.Series({'Population': 3, 'GDP': 5},
name='China'))
Topic 1: Pandas
Dataframe
Basic visualization
You can show the relationship between the population and GDP:
df.plot(kind='scatter', x='Population', y='GDP',
title='Population vs GDP')
Topic 1: Pandas
Dataframe
Basic visualization
With combination of barplot with pandas function:
df.groupby('Continent')[['Population','GDP']].sum().plot(
kind='bar', title='Total Population and GDP by Continent')
Topic 1: Pandas
Dataframe
Basic visualization
Or show the correlation of each column:
df[['Population', 'GDP',
'Surface Area', 'HDI']].corr()
What are some insights from
this correlation?