0% found this document useful (0 votes)
33 views25 pages

Pandas

Uploaded by

inet.free.all
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views25 pages

Pandas

Uploaded by

inet.free.all
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Intr

o
Structur
e

Last week topic Today lecture


1. Why Pandas 5. DataFrame
2. Name 6. Basic Visualization
3. Index
4. Filtering
Topic 1: Pandas
Dataframe

• Simplistically, a data frame is a table, with rows and columns.


• Each column in a data frame is a series object.
• Rows consist of elements inside series.

Name Age City


Alice 24 New York

Bob 30 San Francisco This is a Dataframe

Charlie 22 Los Angeles

An example table
Topic 1: Pandas
Dataframe
Creating a DataFrame from a Dictionary

You can create a DataFrame from a dictionary where keys are column names
and values are lists of data:

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}


df = pd.DataFrame(data)

Output:
A B
0 1 4
1 2 5
2 3 6
Topic 1: Pandas
Dataframe
Creating a DataFrame from a List of Dictionaries

Each dictionary in the list represents a row of data:

data = [{'A': 1, 'B': 4}, {'A': 2, 'B': 5}]


df = pd.DataFrame(data)

Output:
A B
0 1 4
1 2 5
2 3 6
Topic 1: Pandas
Dataframe
Reading Data from a CSV File

Use Pandas to read data from a CSV file with pd.read_csv():

df = pd.read_csv('file.csv')

Output: CSV file content:


A B A, B,
0 1 4 1, 4,
1 2 5 2, 5,
2 3 6 3, 6,
Topic 1: Pandas
Example data

• Data decribes the info of G7 countries


• Each row is a country, each column is properties of that country
 We will start using pandas to wrangling the data.

G7 Stats

Population GDP Surface HDI Continent


Canada 35.467 1,785,387.00 9,984,670 0.913 America
France 63.951 2,833,687.00 640,679 0.888 Europe
Germany 80.94 3,874,437.00 357,114 0.916 Europe
Italy 60.665 2,167,744.00 301,336 0.873 Europe
Japan 127.061 4,602,367.00 377,930 0.891 Asia
United Kingdom 64.511 2,950,039.00 242,495 0.907 Europe
United States 318.523 17,348,075.00 9,525,067 0.915 America
Topic 1: Pandas
Dataframe
Reading Data from a CSV File

Creating `DataFrame`s manually can be tedious. 99% of the time you'll be


pulling the data from a Database, a csv file or the web.

df = pd.DataFrame({'Population': […], 'GDP': […],


'Surface Area': […],…}, columns=['Population', 'GDP',
…])
Topic 1: Pandas
Dataframe
Dataframe Index

DataFrame`s also have indexes. As you can see in the "table" above,
pandas has assigned auto-incremental index. We can reassign that to
country name:
df.index = ['Canada','France','Germany','Italy',…]
df
Topic 1: Pandas
Dataframe
Dataframe axes

You can show the table first row and columns by using df.columns and
df.index
df.column
s
df.index
Output:
Index(['Population', 'GDP', 'Surface Area', 'HDI',
'Continent'], dtype='object')

Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United


Kingdom','United States'], dtype='object')
Topic 1: Pandas
Dataframe
Dataframe basic info

You can show basic information of a table by using function df.info()


df.info()

Any null value in the series

Datatype

Why this Dtype is object?


Topic 1: Pandas
Dataframe
Dataframe basic info

You can also table’s statistical information:


df.size(), df.info(),
df.describe()

Mean, standard
deviation and number
of elements

1st, 2nd , … quartiles


What does it mean?
Topic 1: Pandas
Dataframe
Indexing, Selection and Slicing

Individual columns can be selected with regular indexing. The result is


a Series df.loc['Canada'], df.iloc[0]

Output:
Population 35.467
GDP 1785387
Surface Area 9984670
HDI 0.913
Continent America
Name: Canada,
dtype: object
Topic 1: Pandas
Dataframe
Indexing, Selection and Slicing

Multiple columns can be selected similarly. The result is a Dataframe


df[['Population', 'GDP']]

Output:
Topic 1: Pandas
Dataframe
Indexing, Selection and Slicing

Slicing works differently, it acts at "row level", and can be counter


intuitive:
df[1:3]
Output:
Topic 1: Pandas
Dataframe
Indexing, Selection and Slicing

Row level selection works better with loc and iloc. which are
recommended over regular "direct slicing" df[ : ].

loc selects rows matching the iloc selects rows with numeric
given index: position of the index:
df.loc['France': 'Italy', df.iloc[[0, 1, -1]]
'Population']
Topic 1: Pandas
Dataframe
Conditional selection

Conditional selection works the same way as the Series:


df['Population'] > 70

Output:
Canada False
France False
Germany True
Italy False
Japan True
United Kingdom False
United States True
Name: Population,
dtype: bool
Topic 1: Pandas
Dataframe
Conditional selection

Conditional selection works the same way as the Series:


df.loc[df['Population'] > 70]

Output:
Topic 1: Pandas
Dataframe
Remove stuff

Remove some element, for row when axis=0 and for columns when
axis=1, you can also use axis=“rows” or axis=“columns”:

df.drop(['Italy', 'Canada'],
axis=0)
Output:
Topic 1: Pandas
Dataframe
Adding value

You can add new column by add a Series, missing value automatically
filled with NaN:

df['Language'] = pd.Series(['French', 'German', 'Italian']


index=['France', 'Germany', 'Italy'],name='Language')
Topic 1: Pandas
Dataframe
Adding value

You can add new column by add a Series, missing value automatically
filled with NaN:

df['Language'] = pd.Series(['French', 'German', 'Italian']


index=['France', 'Germany', 'Italy'],name='Language')
Topic 1: Pandas
Dataframe
Adding value

You can add new row similarly:


df = df.append(pd.Series({'Population': 3, 'GDP': 5},
name='China'))
Topic 1: Pandas
Dataframe
Basic visualization

You can show the relationship between the population and GDP:
df.plot(kind='scatter', x='Population', y='GDP',
title='Population vs GDP')
Topic 1: Pandas
Dataframe
Basic visualization

With combination of barplot with pandas function:


df.groupby('Continent')[['Population','GDP']].sum().plot(
kind='bar', title='Total Population and GDP by Continent')
Topic 1: Pandas
Dataframe
Basic visualization

Or show the correlation of each column:


df[['Population', 'GDP',
'Surface Area', 'HDI']].corr()

 What are some insights from


this correlation?

You might also like