2/24/2020 3.
3_group_by - Jupyter Notebook
Group By: Split, Apply, Combine
Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally
on some label or index: this is implemented in the so-called groupby operation. The name "group by" comes
from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms
first coined by Hadley Wickham of Rstats fame: split, apply, combine.
In [1]:
import numpy as np
import pandas as pd
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 1/11
2/24/2020 3.3_group_by - Jupyter Notebook
In [8]:
df = [Link]({'key': ['A', 'B', 'C', 'A', 'B', 'C','B'],
'data': range(7)}, columns=['key', 'data'])
df
Out[8]:
key data
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
6 B 6
In [4]:
[Link]('key')
Out[4]:
<[Link] object at 0x0000010CC3FF41C8>
Notice that what is returned is not a set of DataFrames, but a DataFrameGroupBy object. This object is where
the magic is: you can think of it as a special view of the DataFrame, which is poised to dig into the groups but
does no actual computation until the aggregation is applied.
In [9]:
[Link]('key').sum()
Out[9]:
data
key
A 3
B 11
C 7
In [13]:
for (key, group) in [Link]('key'):
print("{0:20s} shape={1}".format(key, [Link]))
A shape=(2, 2)
B shape=(3, 2)
C shape=(2, 2)
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 2/11
2/24/2020 3.3_group_by - Jupyter Notebook
In [15]:
nyc_flight = pd.read_csv('data/nyc_flights_2013.csv')
In [27]:
nyc_flight.head(3)
Out[27]:
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin
0 2013 1 1 517.0 2.0 830.0 11.0 UA N14228 1545 EWR
1 2013 1 1 533.0 4.0 850.0 20.0 UA N24211 1714 LGA
2 2013 1 1 542.0 2.0 923.0 33.0 AA N619AA 1141 JFK
SELECT carrier, count(*)
FROM df
GROUP BY carrier
In [26]:
n_by_flight = nyc_flight.groupby("carrier")["carrier"].count()
n_by_flight
Out[26]:
carrier
9E 2982
AA 5094
AS 115
B6 8144
DL 7160
EV 8298
F9 107
FL 527
HA 50
MQ 4137
OO 1
UA 8917
US 3152
VX 716
WN 1915
YV 103
Name: carrier, dtype: int64
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 3/11
2/24/2020 3.3_group_by - Jupyter Notebook
In [3]:
data = pd.read_csv('data/phone_data.csv')
[Link](3)
Out[3]:
index date duration item month network network_type
0 0 15/10/14 06:58 34.429 data 2014-11 data data
1 1 15/10/14 06:58 13.000 call 2014-11 Vodafone mobile
2 2 15/10/14 14:46 23.000 call 2014-11 Meteor mobile
In [30]:
[Link]()
<class '[Link]'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 7 columns):
index 830 non-null int64
date 830 non-null object
duration 830 non-null float64
item 830 non-null object
month 830 non-null object
network 830 non-null object
network_type 830 non-null object
dtypes: float64(1), int64(1), object(5)
memory usage: 45.5+ KB
In [4]:
import dateutil
# Convert date from string to date times
#data['date'] = data['date'].apply([Link], dayfirst=True)
data['date'] = pd.to_datetime(data['date'])
In [33]:
[Link]()
<class '[Link]'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 7 columns):
index 830 non-null int64
date 830 non-null datetime64[ns]
duration 830 non-null float64
item 830 non-null object
month 830 non-null object
network 830 non-null object
network_type 830 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 45.5+ KB
1. How many rows the dataset has?
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 4/11
2/24/2020 3.3_group_by - Jupyter Notebook
In [35]:
data['item'].count()
Out[35]:
830
2. What was the longest phone call / data entry?
In [36]:
data['duration'].max()
Out[36]:
10528.0
3. How many seconds of phone calls are recorded in total?
In [7]:
data[data['item']=='call']['duration'].sum()
Out[7]:
92321.0
4. How many entries are there for each month?
In [38]:
data['month'].value_counts()
Out[38]:
2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
Name: month, dtype: int64
5. Number of non-null unique network entries
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 5/11
2/24/2020 3.3_group_by - Jupyter Notebook
In [39]:
data['network'].nunique()
Out[39]:
6. Get the first entry for each month
In [40]:
[Link]('month').first()
Out[40]:
index date duration item network network_type
month
2014-11 0 2014-10-15 [Link] 34.429 data data data
2014-12 228 2014-11-13 [Link] 34.429 data data data
2015-01 381 2014-12-13 [Link] 34.429 data data data
2015-02 577 2015-01-13 [Link] 34.429 data data data
2015-03 729 2015-02-12 [Link] 69.000 call landline landline
7. Get the sum of the durations per month
In [41]:
[Link]('month')['duration'].sum()
Out[41]:
month
2014-11 26639.441
2014-12 14641.870
2015-01 18223.299
2015-02 15522.299
2015-03 22750.441
Name: duration, dtype: float64
8. Get the number of dates / entries in each month
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 6/11
2/24/2020 3.3_group_by - Jupyter Notebook
In [42]:
[Link]('month')['date'].count()
Out[42]:
month
2014-11 230
2014-12 157
2015-01 205
2015-02 137
2015-03 101
Name: date, dtype: int64
9. What is the sum of durations, for calls only, to each network
In [43]:
data[data['item'] == 'call'].groupby('network')['duration'].sum()
Out[43]:
network
Meteor 7200.0
Tesco 13828.0
Three 36464.0
Vodafone 14621.0
landline 18433.0
voicemail 1775.0
Name: duration, dtype: float64
10. How many calls, sms, and data entries are in each month?
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 7/11
2/24/2020 3.3_group_by - Jupyter Notebook
In [44]:
[Link](['month', 'item'])['date'].count()
Out[44]:
month item
2014-11 call 107
data 29
sms 94
2014-12 call 79
data 30
sms 48
2015-01 call 88
data 31
sms 86
2015-02 call 67
data 31
sms 39
2015-03 call 47
data 29
sms 25
Name: date, dtype: int64
11. How many calls, texts, and data are sent per month, split by network_type?
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 8/11
2/24/2020 3.3_group_by - Jupyter Notebook
In [45]:
[Link](['month', 'network_type'])['date'].count()
Out[45]:
month network_type
2014-11 data 29
landline 5
mobile 189
special 1
voicemail 6
2014-12 data 30
landline 7
mobile 108
voicemail 8
world 4
2015-01 data 31
landline 11
mobile 160
voicemail 3
2015-02 data 31
landline 8
mobile 90
special 2
voicemail 6
2015-03 data 29
landline 11
mobile 54
voicemail 4
world 3
Name: date, dtype: int64
In [48]:
# produces Pandas Series
#[Link]('month')['duration'].sum()
# Produces Pandas DataFrame
[Link]('month')[['duration']].sum()
Out[48]:
duration
month
2014-11 26639.441
2014-12 14641.870
2015-01 18223.299
2015-02 15522.299
2015-03 22750.441
In [1]:
import pandas as pd
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 9/11
2/24/2020 3.3_group_by - Jupyter Notebook
In [6]:
name = [Link](['ram','shyam','kiran','rishi'])
In [7]:
name
Out[7]:
0 ram
1 shyam
2 kiran
3 rishi
dtype: object
In [9]:
mark1 = [Link]([45, 67, 32, 65])
In [10]:
mark2 = [Link]([77, 34, 72, 55])
In [13]:
df=[Link]({name, mark1, mark2})
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-f00c768bf164> in <module>
----> 1 df=[Link]({name, mark1, mark2})
~\Anaconda3\lib\site-packages\pandas\core\[Link] in __hash__(self)
1884 raise TypeError(
1885 "{0!r} objects are mutable, thus they cannot be"
-> 1886 " hashed".format(self.__class__.__name__)
1887 )
1888
TypeError: 'Series' objects are mutable, thus they cannot be hashed
In [12]:
df
Out[12]:
0 1 2 3
0 ram shyam kiran rishi
1 45 67 32 65
2 77 34 72 55
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 10/11
2/24/2020 3.3_group_by - Jupyter Notebook
In [ ]:
localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 11/11