Pandas Notes
Pandas Notes
Youtube : https://www.youtube.com/channel/UCIKLLQJ0qxdpKkrWyvWhkgg
1.Introduction to pandas
2.Top 5 or bottom 5 records
3.renaming the column names
4.statistical description
5.adding new column
6.dropping column
7.Setting any column as index
8.resettig index
9. selecting row with index position
10. selecting subsection of the dataset with LOC and iloc
11.how to drop the row
12.conditional filtering using "&" and "|"
13.apply function on single column
14.apply func on multiple column using lambda func
15.sorting method
16.min , max and their index position
17.value_counts/unique/nunique/replace/map function
18.treatment of duplicate values
19.nlargest/nsmallest to get highest/lowest records
20.sample of the dataset (by numbers or percentage)
21.handling missing data
22.isnull/notnull
23.dropping missing values
24.filling missing values
25.group by operation on pandas
26.combining dataframe-->concatenation
27.combining dataframe--> merging-->join--> inner/left/right/outer
28.text method on string data
29.cleaning the data
1. Introduction to pandas
Pandas Library
Pandas Dataframe
pandas dataframe it is two dimentional tabular data structure with labelled axis (rows and columns)
Dataframe is table of columns and rows in pandas that we can easily restructure and filter
Formal Defination:=A group of pandas series Object that share the same index
In [218]:
# ls ===> to get the content(or the list of the all the files in same folder) in the dire
In [2]:
import pandas as pd
import numpy as np
In [3]:
my_Data=np.random.randint(0,101,(4,3))
In [4]:
my_Data
Out[4]:
In [5]:
my_index=["india","Japan","Australia","newziland"]
In [6]:
column=["jan","feb","mar"]
In [7]:
df=pd.DataFrame(my_Data,index=my_index,columns=column)
df
Out[7]:
india 2 79 18
Japan 34 90 80
Australia 9 61 2
newziland 67 87 77
In [219]:
df=pd.read_csv("tips.csv")
2. Top 5 or bottom 5 records
In [220]:
df.head()
Out[220]:
Payer
total_bill tip sex smoker day time size price_per_person
Name
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560325
Cunningham
Douglas
1 10.34 1.66 Male No Sun Dinner 3 3.45 4478071
Tucker
Travis
2 21.01 3.50 Male No Sun Dinner 3 7.00 601181
Walters
Nathaniel
3 23.68 3.31 Male No Sun Dinner 2 11.84 4676137
Harris
Tonya
4 24.59 3.61 Female No Sun Dinner 4 6.15 4832732
Carter
In [226]:
df.tail()
Out[226]:
Payer
total_bill tip sex smoker day time size price_per_person C
Name
239 Michael
29.03 5.92 Male No Sat Dinner 3 9.68 52960686
Avila
240 Monica
27.18 2.00 Female Yes Sat Dinner 2 13.59 35068061
Sanders
241 Keith
22.67 2.00 Male Yes Sat Dinner 2 11.34 60118916
Wong
242 Dennis
17.82 1.75 Male No Sat Dinner 2 8.91 43752
Dixon
243 Michelle
18.78 3.00 Female No Thur Dinner 2 9.39 35114516
Hardin
3. To get the name of all the columns and how to rename the name of the
columns
In [10]:
df.columns
Out[10]:
Out[11]:
In [228]:
In [229]:
df.head()
Out[229]:
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 356032
Cunningham
Douglas
1 10.34 1.66 Male No Sun Dinner 3 3.45 447807
Tucker
Travis
2 21.01 3.50 Male No Sun Dinner 3 7.00 601181
Walters
Nathaniel
3 23.68 3.31 Male No Sun Dinner 2 11.84 467613
Harris
4 24.59 3.61 Female No Sun Dinner 4 6.15 Tonya Carter 483273
4. Statistical description
In [12]:
df.describe()
Out[12]:
df.describe().transpose()
Out[13]:
In [223]:
df[["total_bill"]].head(3)
Out[223]:
total_bill
0 16.99
1 10.34
2 21.01
In [224]:
df[["total_ bill","tip"]].head(3)
Out[224]:
total_bill tip
0 16.99 1.01
1 10.34 1.66
2 21.01 3.50
In [17]:
df["tip_percentage"]=round(df['tip']/df["total_bill"]*100,2)
In [18]:
df["tip_percentage"]
Out[18]:
0 5.94
1 16.05
2 16.66
3 13.98
4 14.68
...
239 20.39
240 7.36
241 8.82
242 9.82
243 15.97
Name: tip_percentage, Length: 244, dtype: float64
In [19]:
df.head()
Out[19]:
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560325
Cunningham
Douglas
1 10.34 1.66 Male No Sun Dinner 3 3.45 4478071
Tucker
Travis
2 21.01 3.50 Male No Sun Dinner 3 7.00 601181
Walters
Nathaniel
11.84 4676137
3 23.68 3.31 Male No Sun Dinner 2
Harris
Tonya
4 24.59 3.61 Female No Sun Dinner 4 6.15 4832732
Carter
In [20]:
df.drop("tip_percentage",axis=1,inplace=True)
In [1]:
df.head()
Out[21]:
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560325
Cunningham
Douglas
1 10.34 1.66 Male No Sun Dinner 3 3.45 4478071
Tucker
Travis
2 21.01 3.50 Male No Sun Dinner 3 7.00 601181
Walters
Nathaniel
3 23.68 3.31 Male No Sun Dinner 2 11.84 4676137
Harris
Tonya
4 24.59 3.61 Female No Sun Dinner 4 6.15 4832732
Carter
In [23]:
In [24]:
df.head(2)
Out[24]:
Payment
ID
Christy
Sun2959 16.99 1.01 Female No Sun Dinner 2 8.49
Cunn ingham
In [26]:
df=df.reset_index()
In [27]:
df.head(2)
Out[27]:
Christ
0 Sun2959 16.99 1.01 Female No Sun Dinner 2 8.49
Cunningha
3.45 Dougla
1 Sun4608 10.34 1.66 Male No Sun Dinner 3 Tucke
In [28]:
df.iloc[0]
Out[28]:
Payment ID Sun2959
total_bill 16.99
tip 1.01
sex Female
smoker No
day Sun
time Dinner
size 2
price_per_person 8.49
Payer Name Christy Cunningham
CC Number 3560325168603410
Name: 0, dtype: object
10. For selecting some subsection of the dataframe
In [29]:
df.iloc[1:5]
Out[29]:
Douglas
1 Sun4608 10.34 1.66 Male No Sun Dinner 3 3.45
Tucker
Travis
2 Sun4458 21.01 3.50 Male No Sun Dinner 3 7.00
Walters
Nathaniel
3 Sun5260 23.68 3.31 Male No Sun Dinner 2 11.84
Harris
6.15 Tonya
4 Sun2251 24.59 3.61 Female No Sun Dinner 4 Carter
In [30]:
df=df.set_index("Payment ID")
In [31]:
df.head(2)
Out[31]:
Payer
total_bill tip sex smoker day time size price_per_person
Name
Payment
ID
Christy
Sun2959 16.99 1.01 Female No Sun Dinner 2 8.49
Cunn ingham
df.loc[["Sun2959","Sun4608"]]
Out[32]:
Payer
total_bill tip sex smoker day time size price_per_person
Name
Payment
ID
Christy
Sun2959 16.99 1.01 Female No Sun Dinner 2 8.49
Cunn ingham
Douglas
Sun4608 10.34 1.66 Male No Sun Dinner 3 3.45 Tucker
In [33]:
df.head(2)
Out[33]:
Payer
total_bill tip sex smoker day time size price_per_person
Name
Payment
ID
Christy
Sun2959 16.99 1.01 Female No Sun Dinner 2 8.49
Cunn ingham
In [34]:
df.drop("Sun2959",axis=0)
Out[34]:
Payer
total_bill tip sex smoker day time size price_per_person
Name
Payment
ID
Travis
Sun4458 21.01 3.50 Male No Sun Dinner 3 7.00 60
Walters
typically in data analysis our datasets are large enough that we dont filter based on positon but based on
some condition
conditional formating allows us to select row based condtion on the column
this leads to discussion on organizing the data
condtion filtering
filter by single condition
filter by multiple condtion
check against multiple values
In [35]:
new_df=pd.read_csv("tips.csv")
new_df.head(2)
Out[35]:
Payer
total_bill tip sex smoker day time size price_per_person
Name
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560325
Cunningham
Douglas
1 10.34 1.66 Male No Sun Dinner 3 3.45 4478071
Tucker
In [36]:
#let filter out for the bill values greater than 40 dollars
new_df[new_df["total_bill"]>40]
Out[36]:
Aaron
95 40.17 4.73 Male Yes Fri Dinner 4 10.04 180
Bentley
Heather
102 44.30 2.50 Female Yes Sat Dinner 3 14.77 379
Cohen
Eric
142 41.19 5.00 Male No Thur Lunch 5 8.24 43565
Andrews
Ryan
156 48.17 5.00 Male No Sun Dinner 6 8.03 35231
Gonzales
Gregory
170 50.81 10.00 Male Yes Sat Dinner 3 16.94 54738
Clark
Jose
182 45.35 3.50 Male Yes Sun Dinner 3 15.12 41122
Parsons
Stephen
184 40.55 3.00 Male Yes Sun Dinner 2 20.27 35477
Cox
Brooke
197 43.11 5.00 Female Yes Thur Lunch 4 10.78 55449
Soto
Alex
212 48.33 9.00 Male No Sat Dinner 4 12.08
Williamson
new_df["total_bill"]>40 ==> will only give the boolean values, that is true and false where as outer function
will filter out ouly true value of the dataframe
In [37]:
Out[37]:
total_bill 157
tip 157
sex 157
smoker 157
day 157
time 157
size 157
price_per_person 157
Payer Name 157
CC Number 157
Payment ID 157
dtype: int64
In [38]:
In [39]:
# show me the male who have paid the total bill more than 30
Out[39]:
James
56 38.01 3.00 Male Yes Sat Dinner 4 9.50 Christensen 3497936
DDS
In [40]:
new_df[(new_df["total_bill"]>30) | (new_df["sex"]=="Male")].tail(5)
Out[40]:
Payer
total_bill tip sex smoker day time size price_per_person
C
Name
237 Thomas
32.83 1.17 Male Yes Sat Dinner 2 16.42
42847226
Brown
238 Kimberly
35.83 4.67 Female No Sat Dinner 3 11.94 6761
Crane
239 Michael
29.03 5.92 Male No Sat Dinner 3 9.68 52960686
Avila
241 Keith
22.67 2.00 Male Yes Sat Dinner 2 11.34 60118916
Wong
242 Dennis
17.82 1.75 Male No Sat Dinner 2 8.91 43752
Dixon
In [41]:
In [42]:
option=["Sat","Sun"]
In [43]:
new_df[new_df["day"].isin(option)]
Out[43]:
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 35603
Cunningham
Douglas
1 10.34 1.66 Male No Sun Dinner 3 3.45 44780
Tucker
Travis
2 21.01 3.50 Male No Sun Dinner 3 7.00 6011
Walters
Nathaniel
3 23.68 3.31 Male No Sun Dinner 2 11.84 46761
Harris
Tonya
4 24.59 3.61 Female No Sun Dinner 4 6.15 48327
Carter
... ... ... ... ... ... ... ... ... ...
Kimberly
238 35.83 4.67 Female No Sat Dinner 3 11.94 6
Crane
Michael
239 29.03 5.92 Male No Sat Dinner 3 9.68 52960
Avila
Monica
240 27.18 2.00 Female Yes Sat Dinner 2 13.59 35068
Sanders
241 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011
In [44]:
#what if i want to grab last four digit number of credit card number
#this can be acchieved by .apply function
In [45]:
def Last_Four(num):
return str(num)[-4:]
In [46]:
Last_Four(945723487)
Out[46]:
'3487'
In [47]:
In [48]:
new_df.head(2)
Out[48]:
Payer
total_bill tip sex smoker day time size price_per_person
Name
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560325
Cunningham
In [49]:
In [50]:
def yelp(price):
if price<10:
return "$"
elif price >=10 and price <30:
return "$$"
else:
return "$$$"
In [51]:
new_df["Yelp"]=new_df["total_bill"].apply(yelp)
In [52]:
new_df.head(4)
Out[52]:
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560325
Cunningham
Douglas
1 10.34 1.66 Male No Sun Dinner 3 3.45 4478071
Tucker
Travis
2 21.01 3.50 Male No Sun Dinner 3 7.00 601181
Walters
Nathaniel
3 23.68 3.31 Male No Sun Dinner 2 11.84 4676137
Harris
In [53]:
# problem statement is ==> based on the ratio totalbill and tip categories customer wheth
In [54]:
def quality(total_bill,tip):
if total_bill/tip>0.25:
return "Generous"
else:
return "other"
In [55]:
quality(16.00,1.01)
Out[55]:
'Generous'
In [86]:
new_df["Quality"]=new_df[["total_bill","tip"]].apply(lambda new_df:quality(new_df["total_
In [ ]:
In [88]:
new_df["Quality"]=np.vectorize(quality)(new_df["total_bill"],new_df["tip"])
In [89]:
#both will give the same result but vectorize is lil easy to remember and it works faster
In [91]:
df=pd.read_csv("tips.csv")
In [92]:
df.head(2)
Out[92]:
Payer
total_bill tip sex smoker day time size price_per_person
Name
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560325
Cunningham
In [93]:
In [96]:
Out[96]:
Out[98]:
Tiffany
67 3.07 1.00 Female Yes Sat Dinner 1 3.07 43594
Brock
111 7.25 1.00 Female No Sat Dinner 1 7.25 Terri Jones 3559
Leah
92 5.75 1.00 Female Yes Fri Dinner 2 2.88 3508
Ramirez
Matthew
236 12.60 1.00 Male Yes Sat Dinner 2 6.30 35436
Myers
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560
Cunningham
In [99]:
#what is max value of total bill and what is its index position
df["total_bill"].max()
Out[99]:
50.81
In [100]:
df["total_bill"].idxmax()
Out[100]:
170
max amount of total bill is 50.81 and its index position is 170
In [101]:
#what is min value of total bill and what is its index position
df["total_bill"].min()
Out[101]:
3.07
In [102]:
#what is min value of total bill and what is its index position
df["total_bill"].idxmin()
Out[102]:
67
In [103]:
df.iloc[67]
Out[103]:
total_bill 3.07
tip 1.0
sex Female
smoker Yes
day Sat
time Dinner
size 1
price_per_person 3.07
Payer Name Tiffany Brock
CC Number 4359488526995267
Payment ID Sat3455
Name: 67, dtype: object
In [104]:
df.corr()
Out[104]:
In [105]:
df["sex"].value_counts()
Out[105]:
Male 157
Female 87
Name: sex, dtype: int64
In [106]:
df["day"].unique()
Out[106]:
In [107]:
df["day"].nunique()
Out[107]:
In [110]:
# how to use replace method==> replace female with "F" and Male with "M"
df.head(2)
Out[110]:
Payer
total_bill tip sex smoker day time size price_per_person
Name
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560325
Cunningham
In [111]:
df["sex"].replace(["Female","Male"],["F","M"])
Out[111]:
0 F
1 M
2 M
3 M
4 F
..
239 M
240 F
241 M
242 M
243 F
Name: sex, Length: 244, dtype: object
In [113]:
mymap={"Female":"F","Male":"M"}
In [114]:
df["sex"].map(mymap)
Out[114]:
0 F
1 M
2 M
3 M
4 F
..
239 M
240 F
241 M
242 M
243 F
Name: sex, Length: 244, dtype: object
In [117]:
df.duplicated()
Out[117]:
0 False
1 False
2 False
3 False
4 False
...
239 False
240 False
241 False
242 False
243 False
Length: 244, dtype: bool
df.drop_duplicates().head(5) # to drop the duplicates
Out[120]:
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560325
Cunningham
Douglas
1 10.34 1.66 Male No Sun Dinner 3 3.45 4478071
Tucker
Travis
2 21.01 3.50 Male No Sun Dinner 3 7.00 601181
Walters
Nathaniel
3 23.68 3.31 Male No Sun Dinner 2 11.84 4676137
Harris
Tonya
4 24.59 3.61 Female No Sun Dinner 4 6.15 4832732
Carter
In [123]:
C:\Users\SSRVC\AppData\Local\Temp\ipykernel_14472\1154451165.py:2: FutureW
arning: Boolean inputs to the `inclusive` argument are deprecated in favou
r of `both` or `neither`.
df[df["total_bill"].between(10,30,inclusive=True)]
Out[123]:
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560
Cunningham
Douglas
1 10.34 1.66 Male No Sun Dinner 3 3.45 4478
Tucker
Travis
2 21.01 3.50 Male No Sun Dinner 3 7.00 6011
Walters
Nathaniel
3 23.68 3.31 Male No Sun Dinner 2 11.84 4676
Harris
Tonya
4 24.59 3.61 Female No Sun Dinner 4 6.15 4832
Carter
... ... ... ... ... ... ... ... ... ...
Michael
239 29.03 5.92 Male No Sat Dinner 3 9.68 5296
Avila
Monica
240 27.18 2.00 Female Yes Sat Dinner 2 13.59 3506
Sanders
241 22.67 2.00 Male Yes Sat Dinner 2 11.34 Keith Wong 6011
Dennis
242 17.82 1.75 Male No Sat Dinner 2 8.91 4
Dixon
In [125]:
Out[125]:
In [126]:
Out[126]:
Tiffany
67 3.07 1.00 Female Yes Sat Dinner 1 3.07 43594
Brock
Leah
92 5.75 1.00 Female Yes Fri Dinner 2 2.88 3508
Ramirez
111 7.25 1.00 Female No Sat Dinner 1 7.25 Terri Jones 3559
Matthew
236 12.60 1.00 Male Yes Sat Dinner 2 6.30 35436
Myers
Christy
0 16.99 1.01 Female No Sun Dinner 2 8.49 3560
Cunningham
20. Random sample of the dataset by the number or percentage
In [128]:
Out[128]:
Payer
total_bill tip sex smoker day time size price_per_person C
Name
Out[129]:
Abigail
201 12.74 2.01 Female Yes Thur Lunch 2 6.37 358664
Parks
6.05 Glenda 57
188 18.15 3.50 Female Yes Sun Dinner 3
Wiggins
Jason
138 16.00 2.00 Male Yes Thur Lunch 2 8.00 356146
Burgess
Nicholas
218 7.74 1.44 Male Yes Sat Dinner 2 3.87 34051
Archer
Adam
69 15.01 2.09 Male Yes Sat Dinner 2 7.50 470092
Hall
James
154 19.77 2.00 Male No Sun Dinner 4 4.94 21316
Smith
Carly
205 16.47 3.23 Female Yes Thur Lunch 3 5.49 478
Reyes
Taylor
132 11.17 1.50 Female No Thur Lunch 2 5.58 601199
Gonzalez
Glenn
228 13.28 2.72 Male No Sat Dinner 2 6.64 50
Jones
Mario
126 8.52 1.48 Male No Thur Lunch 2 4.26 452440
Bradshaw
real world data will ofter be missing data for variety of reason
many machine learning models and statistical methods cant work with missing data in such case we
need to decide what to do with missing data
when reading missing values pandas will display them as NaN values
keep it
remove it
replace it
pros --> potential to save lot of data for use of training a model
Cons --> Hardest to do and somewhat arbitrary --> potentail to lead false conclusion
In [132]:
df=pd.read_csv("movie_scores.csv")
In [133]:
df.head()
Out[133]:
In [134]:
Out[134]:
df.notnull() # reverse of isnull will indicate by false wherever there is missing value
Out[135]:
In [ ]:
In [137]:
df[df["pre_movie_score"].notnull()]
Out[137]:
In [ ]:
#show me the rows where all the promovie score is not given
In [139]:
df[df["pre_movie_score"].isnull()]
Out[139]:
In [ ]:
#show me the rows where all the promovie score is not given but first name is given
In [140]:
Out[140]:
In [225]:
#help(df.dropna)
23. Dropping
In [146]:
# this will remove the row which contail even one missing value and will not consider oth
df.dropna()
Out[146]:
In [148]:
# to prevent from this we can use threshold==> it means only drop the row if it contain 1
df.dropna(thresh=1)
Out[148]:
In [ ]:
#help(df.fillna)
In [151]:
df.fillna("xjd")
Out[151]:
In [155]:
Out[155]:
0 8.0
1 7.0
2 7.0
3 6.0
4 7.0
Name: pre_movie_score, dtype: float64
there is one more method of filling the missing values it is interpolate when the category arrange in
linear order and there is missing values this function find interpolate value considering its linear order
df=pd.read_csv("mpg.csv")
df.head()
Out[157]:
chevr
0 18.0 8 307.0 130 3504 12.0 70 1 chev
ma
b
1 15.0 8 350.0 165 3693 11.5 70 1 sky
In [158]:
df["model_year"].value_counts()
Out[158]:
73 40
78 36
76 34
82 31
75 30
70 29
79 29
80 29
81 29
71 28
72 28
77 28
74 27
Name: model_year, dtype: int64
In [159]:
df.groupby(df["model_year"]).mean()
Out[159]:
model_year
model_year cylinders
model_year cylinders
often the data you need exist in two separate sources , fortunately pandas makes it easy to combine
these together
the simplest combination is if both sources already in the same format then concatenation through
pd.concat() call is that all needed
In [165]:
data_one={"A":["A0","A1","A2","A3"],"B":["B0","B1","B2","B3"]}
In [166]:
data_Two={"C":["C0","C1","C2","C3"],"D":["D0","D1","D2","D3"]}
In [168]:
df1=pd.DataFrame(data_one)
In [169]:
df2=pd.DataFrame(data_Two)
In [170]:
df1
Out[170]:
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
In [171]:
df2
Out[171]:
C D
0 C0 D0
1 C1 D1
2 C2 D2
3 C3 D3
In [173]:
Out[173]:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
In [174]:
Out[174]:
A B C D
0 A0 B0 NaN NaN
1 A1 B1 NaN NaN
2 A2 B2 NaN NaN
3 A3 B3 NaN NaN
0 NaN NaN C0 D0
1 NaN NaN C1 D1
2 NaN NaN C2 D2
3 NaN NaN C3 D3
In [175]:
# we can know this not better approch to join the table
In [ ]:
# so to join the two table along the rows column name of both column must be matching
In [180]:
df2.columns=df1.columns
In [181]:
df2
Out[181]:
C D
0 C0 D0
1 C1 D1
2 C2 D2
3 C3 D3
In [182]:
df1
Out[182]:
C D
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
In [184]:
pd.concat([df1,df2],axis=0)
Out[184]:
C D
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
0 C0 D0
1 C1 D1
2 C2 D2
3 C3 D3
27. Combining Dataframe --> Merging
In [186]:
registrations = pd.DataFrame({'reg_id':[1,2,3,4],'name':['Andrew','Bobo','Claire','David'
logins = pd.DataFrame({'log_id':[1,2,3,4],'name':['Xavier','Andrew','Yolanda','Bobo']})
In [187]:
registrations
Out[187]:
reg_id name
0 1 Andrew
1 2 Bobo
2 3 Claire
3 4 David
In [188]:
logins
Out[188]:
log_id name
0 1 Xavier
1 2 Andrew
2 3 Yolanda
3 4 Bobo
In [189]:
In [190]:
pd.merge(registrations,logins,how="inner",on="name")
Out[190]:
0 1 Andrew 2
1 2 Bobo 4
left and right Merge
In [191]:
#left
pd.merge(registrations,logins,how="left",on="name")
Out[191]:
0 1 Andrew 2.0
1 2 Bobo 4.0
2 3 Claire NaN
3 4 David NaN
In [192]:
#Right
pd.merge(registrations,logins,how="right",on="name")
Out[192]:
0 NaN Xavier 1
1 1.0 Andrew 2
2 NaN Yolanda 3
3 2.0 Bobo 4
In [193]:
# outer merge
pd.merge(registrations,logins,how="outer",on="name")
Out[193]:
#split
email="[email protected]"
email.split("@")
Out[194]:
['milindgaur', 'gmail.com']
In [195]:
name="Milind"
name.isdigit()
Out[195]:
False
In [199]:
"7".isdigit()
Out[199]:
True
In [201]:
names=pd.Series(["Milind","Kanchan","Rohit",'Snehal'])
In [202]:
names
Out[202]:
0 Milind
1 Kanchan
2 Rohit
3 Snehal
dtype: object
In [203]:
names.str.upper()
Out[203]:
0 MILIND
1 KANCHAN
2 ROHIT
3 SNEHAL
dtype: object
In [204]:
names.str.capitalize()
Out[204]:
0 Milind
1 Kanchan
2 Rohit
3 Snehal
dtype: object
In [205]:
names.str.lower()
Out[205]:
0 milind
1 kanchan
2 rohit
3 snehal
dtype: object
In [215]:
In [216]:
messy_names
Out[216]:
0 SaChin
1 Kanchan
2 Rohit
3 SNehal
dtype: object
In [217]:
messy_names.str.strip().str.capitalize()
Out[217]:
0 Sachin
1 Kanchan
2 Rohit
3 Snehal
dtype: object