Python Data Analyses and Visualization

Applied to European city data

Ruthger Righart

[email protected]

Contents

  1. Introduction
  2. Jupyter notebook and loading packages
  3. Create a DataFrame manually [pd.DataFrame, head, tail]
  4. Selecting rows and columns [loc, ix]
  5. Data types [type, dtypes, astype]
  6. Basic properties [columns, index, values, shape, len]
  7. Load another DataFrame [pd.read_csv, value_counts]
  8. Append two DataFrames [append, pd.concat]
  9. Missing values [pd.isnull, count, dropna, describe]
  10. Sorting the DataFrame [sort_index, sort_values]
  11. Data selection by position [iloc]
  12. Conditional data selection [>,==,!=,&,^,|]
  13. Selection based on string [str.contains]
  14. Adding single rows [loc, pd.Series, append]
  15. Exploratory analyses using boxplots [describe, Matplotlib boxplot]
  16. Barplots [Matplotlib]
  17. Pie Charts [Matplotlib, Plotly]
  18. Scatterplots [Matplotlib]
  19. Replacing values [replace]
  20. Data categorization [pd.cut, pd.value_counts, Matplotlib histogram, groupby]
  21. Data conversions [loc, def]
  22. Merging data [pd.merge]
  23. More scatterplots [Matplotlib, Seaborn]
  24. Geographical mapping [np.where, basemap]
  25. Lineplots [sort_values, Matplotlib]
  26. Remove rows and columns [del, drop]
  27. Save DataFrame [to_csv]

Cities are interesting. Some people enjoy living in cities, others tend to avoid them and rather prefer the countryside. Whatever is your preference, one thing is true: Cities generate a huge amount of data that are good source for analyses.

The following is a brief tutorial to get quick insight into Python data analyses and visualization. It is meant to be an introduction for beginners1. Much of this is already documented on the web, and sometimes with more detail. So why another guide?

Many guides treat every topic separately and use new datasets everytime a new data problem is presented. There are howvever advantages of using the same dataset for learning a large variety of analyses. One advantage is time-related, in that all attention can be focused on coding without losing "switching" time understanding new datasets. More importantly, working on a single dataset one experiences the evolution from preprocessing till visualization.

The goal of this tutorial is not to solve an important data science problem. For this, anyways, the presented data is too small. The goal is to use an intuitive dataset to learn Python, intuitive in the sense that it contains data that everyone can understand without prior domain knowledge. It should therefore be suitable for someone who is a starter in Data Science.

We start this tutorial creating data, loading data, data munging and in the end several visualizations. Do not hesitate to contact me if you have any questions or suggestions: [email protected]

First start up IPython or Jupyter notebook. An installation guide can be found at the following page2.

The current tutorial treats a great variety of data types, various preprocessing steps (for ex.: selecting data, merging data, replacing data, finding missing values), data visualization (for ex.: boxplots, barplots, pie-charts, geographic mapping), from different packages (for ex. Matplotlib, Seaborn, Plotly).

In Python, several abbreviations are used, such as pd for Pandas, np for numpy etc. The abbreviations you can use for imported packages are arbitrary, but in this tutorial I used those that are used most commonly on platforms such as StackOverflow. The following packages are needed throughout the tutorial.

In [1]:
import os
import csv
import pandas as pd
import numpy as np
from datetime import datetime

The packages here below are needed for visualizations, such as Matplotlib, Seaborn, and Plotly:

In [2]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib import cm
import seaborn as sns
import plotly
import plotly.plotly as py
import plotly.graph_objs as go

The following packages will be used for geographic mapping:

In [3]:
from mpl_toolkits.basemap import Basemap as Basemap
from matplotlib.colors import rgb2hex
from matplotlib.patches import Polygon
from geopy.geocoders import Nominatim
import math

There is no problem of continuity when any of the visualizations or geographic mapping are skipped.

We are going to use data from European cities, consisting of categorical and numerical data. The data were obtained from Wikipedia. Temperatures are daily means in Celsius. Population is urban population. Altitude is the highest level reported.

We are going to create a DataFrame manually, later we will see how to load a DataFrame. After typing the following code, your DataFrame df1 is created.

In [4]:
data = {'Place': ['Amsterdam', 'Barcelona', 'Paris', 'Geneva', 'Munich', 'Athens', 'Vienna'],
        'Temp_Jan': [3.4, 11.8, 5.0, 1.5, 0.3, 9.9, 1.2],
        'Temp_Jul': [17.6, 25.7, 20.6, 20.2, 19.4, 28.5, 19.1],
        'Language': ['Dutch', 'Catalan and Spanish', 'French', 'French', 'German', 'Greek', 'German'],
        'Altitude': [-2, 12, np.nan, 375, 520, 338, 542],
        'MeasureDate':[np.nan, np.nan, '01-01-2013', '01-12-2015', '31-12-2015', np.nan, '01-01-2017'],
        'Population': [1351587,4740000,10601122,198072,1450381, 3090508, 2600000]}
df1 = pd.DataFrame(data, columns = ['Place', 'Temp_Jan', 'Temp_Jul', 'Language', 'Altitude','MeasureDate', 'Population'])

Inspecting the data will show us the following table. From now on, often the code .head() will be used. By default the first five rows are shown. Putting any number inbetween the parentheses will change the number of rows. The command .tail() does exactly the same but starting from the end.

In [5]:
df1.head(4)
Out[5]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
0 Amsterdam 3.4 17.6 Dutch -2.0 NaN 1351587
1 Barcelona 11.8 25.7 Catalan and Spanish 12.0 NaN 4740000
2 Paris 5.0 20.6 French NaN 01-01-2013 10601122
3 Geneva 1.5 20.2 French 375.0 01-12-2015 198072

To select only one column or variable, use the following:

In [6]:
df1['Temp_Jan'].head(3)
Out[6]:
0     3.4
1    11.8
2     5.0
Name: Temp_Jan, dtype: float64

Another notation that is used and that gives the same result:

In [7]:
df1.Temp_Jan.head(3)
Out[7]:
0     3.4
1    11.8
2     5.0
Name: Temp_Jan, dtype: float64

If you desire selecting multiple columns, do the following:

In [8]:
df1.loc[:,['Temp_Jan','Temp_Jul']].head(3)
Out[8]:
Temp_Jan Temp_Jul
0 3.4 17.6
1 11.8 25.7
2 5.0 20.6

To select a row, for example the first row, do the following. Python starts indexing from 0 (0=first row, 1=second row, etc.):

In [9]:
df1.ix[0]
Out[9]:
Place          Amsterdam
Temp_Jan             3.4
Temp_Jul            17.6
Language           Dutch
Altitude              -2
MeasureDate          NaN
Population       1351587
Name: 0, dtype: object

To select the first column:

In [10]:
df1[[0]].head(3)
Out[10]:
Place
0 Amsterdam
1 Barcelona
2 Paris

To select multiple rows:

In [11]:
df1[0:2]
Out[11]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
0 Amsterdam 3.4 17.6 Dutch -2.0 NaN 1351587
1 Barcelona 11.8 25.7 Catalan and Spanish 12.0 NaN 4740000

Or use the following code:

In [12]:
df1.ix[[0,1]]
Out[12]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
0 Amsterdam 3.4 17.6 Dutch -2.0 NaN 1351587
1 Barcelona 11.8 25.7 Catalan and Spanish 12.0 NaN 4740000

To select multiple columns:

In [13]:
df1[[0,1]].head(3)
Out[13]:
Place Temp_Jan
0 Amsterdam 3.4
1 Barcelona 11.8
2 Paris 5.0

To check the data type of df1.

In [14]:
type(df1)
Out[14]:
pandas.core.frame.DataFrame

Or to check the variable types:

In [15]:
df1.dtypes
Out[15]:
Place           object
Temp_Jan       float64
Temp_Jul       float64
Language        object
Altitude       float64
MeasureDate     object
Population       int64
dtype: object

Convert the variable Language into a categorical variable.

In [16]:
df1['Language'] = df1['Language'].astype('category')

After this we need to verify if we have gotten the right type.

In [17]:
df1['Language'].dtypes
Out[17]:
category

Now we change the column MeasureDate to a date variable

In [18]:
df1['MeasureDate'].dtypes
Out[18]:
dtype('O')
In [19]:
df1['MeasureDate']=pd.to_datetime(df1['MeasureDate'])
In [20]:
df1['MeasureDate'].dtypes
Out[20]:
dtype('<M8[ns]')
In [21]:
df1.dtypes
Out[21]:
Place                  object
Temp_Jan              float64
Temp_Jul              float64
Language             category
Altitude              float64
MeasureDate    datetime64[ns]
Population              int64
dtype: object

The DataFrame can be dissected in its column- and rownames, and the values inside:

In [22]:
df1.columns
Out[22]:
Index([u'Place', u'Temp_Jan', u'Temp_Jul', u'Language', u'Altitude',
       u'MeasureDate', u'Population'],
      dtype='object')
In [23]:
df1.index
Out[23]:
RangeIndex(start=0, stop=7, step=1)
In [24]:
df1.values
Out[24]:
array([['Amsterdam', 3.4, 17.6, 'Dutch', -2.0, NaT, 1351587],
       ['Barcelona', 11.8, 25.7, 'Catalan and Spanish', 12.0, NaT, 4740000],
       ['Paris', 5.0, 20.6, 'French', nan,
        Timestamp('2013-01-01 00:00:00'), 10601122],
       ['Geneva', 1.5, 20.2, 'French', 375.0,
        Timestamp('2015-01-12 00:00:00'), 198072],
       ['Munich', 0.3, 19.4, 'German', 520.0,
        Timestamp('2015-12-31 00:00:00'), 1450381],
       ['Athens', 9.9, 28.5, 'Greek', 338.0, NaT, 3090508],
       ['Vienna', 1.2, 19.1, 'German', 542.0,
        Timestamp('2017-01-01 00:00:00'), 2600000]], dtype=object)

And if we wanted to select the first column name, the following can be done:

In [25]:
df1.columns[0]
Out[25]:
'Place'

The size of the DataFrame can be appreciated by the shape command:

In [26]:
df1.shape
Out[26]:
(7, 7)

And the length of the DataFrame, or in other words, the number of rows, by the len command:

In [27]:
len(df1)
Out[27]:
7

We created a fairly small dataset. Most of the times it is possible to load data from a file. In the following section we will see how to load data using pd.read_csv. The file is from my GitHub page. CSV means that a file containing comma separated values is expected. An important remark: When loading a file, one needs to carefully check if the decimals are dots or commas (this can differ between countries and systems).3

In [28]:
url = 'https://raw.githubusercontent.com/RRighart/City/master/df2.csv'
df2 = pd.read_csv(url, parse_dates=True, delimiter=",", decimal=",")

So this should give the following data of cities:

In [29]:
df2.head(3)
Out[29]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
0 Stockholm -1.6 18.8 Swedish 0.0 NaT 1515017
1 Oslo -2.9 17.7 Norwegian 23.0 2016-01-01 942084
2 Helsinki -3.9 17.8 Finnish NaN 2016-03-31 1231595

The columns Language and MeasureDate are not the right class. So it would be best to change this directly.

In [30]:
df2.dtypes
Out[30]:
Place           object
Temp_Jan       float64
Temp_Jul       float64
Language        object
Altitude       float64
MeasureDate     object
Population       int64
dtype: object

Language should be of type "category".

In [31]:
df2['Language'] = df2['Language'].astype('category')

MeasureDate should be of type "datetime".

In [32]:
df2['MeasureDate']=pd.to_datetime(df2['MeasureDate'])
In [33]:
df2.dtypes
Out[33]:
Place                  object
Temp_Jan              float64
Temp_Jul              float64
Language             category
Altitude              float64
MeasureDate    datetime64[ns]
Population              int64
dtype: object

For categorical variables, it is essential to know the frequency of each category. Note that a slightly different notation, namely df2['Language'].value_counts(), works here as well.

In [34]:
df2.Language.value_counts()
Out[34]:
French        3
English       3
Swedish       2
Turkish       1
Spanish       1
Portuguese    1
Norwegian     1
Italian       1
Finnish       1
Dutch         1
Croatian      1
Name: Language, dtype: int64

If loading the data did not work, another indirect way is loading the data manually from my GitHub page: https://github.com/RRighart/City . Then check that the data are saved in your current directory. Use the following to check the path:

In [35]:
os.getcwd()
Out[35]:
'/yourdirectory'

Using the command os.listdir(os.curdir) will produce a list of files that is in your directory. Now it is possible to verify if your copied .csv file is really in the right folder. To change the directory you can use the command os.chdir('path').

It is often the case that data are acquired from disparate sources, like we have here. So now the trick is to bind the two sets together.

In [36]:
df = df1.append(df2, ignore_index=True)

Displaying the whole DataFrame df now we can see that there are 23 rows.

In [37]:
df
Out[37]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
0 Amsterdam 3.4 17.6 Dutch -2.0 NaT 1351587
1 Barcelona 11.8 25.7 Catalan and Spanish 12.0 NaT 4740000
2 Paris 5.0 20.6 French NaN 2013-01-01 10601122
3 Geneva 1.5 20.2 French 375.0 2015-01-12 198072
4 Munich 0.3 19.4 German 520.0 2015-12-31 1450381
5 Athens 9.9 28.5 Greek 338.0 NaT 3090508
6 Vienna 1.2 19.1 German 542.0 2017-01-01 2600000
7 Stockholm -1.6 18.8 Swedish 0.0 NaT 1515017
8 Oslo -2.9 17.7 Norwegian 23.0 2016-01-01 942084
9 Helsinki -3.9 17.8 Finnish NaN 2016-03-31 1231595
10 Lulea -5.1 20.7 Swedish 6.0 2014-12-31 75966
11 Edinburgh 7.0 19.1 English 47.0 NaT 1339380
12 Birmingham 6.7 21.3 English 140.0 NaT 2440986
13 Cork 5.6 15.3 English NaN NaT 208669
14 Antwerp 3.4 18.5 Dutch NaN 2016-01-01 517042
15 Lille 3.6 18.6 French NaN NaT 1015744
16 Bordeaux 6.6 21.4 French NaN NaT 851071
17 Malaga 12.1 25.5 Spanish 11.0 NaT 569130
18 Porto 9.5 20.6 Portuguese 0.0 NaT 1774000
19 Venice 3.3 23.0 Italian 1.0 NaT 264579
20 Ajaccio 9.0 22.9 French 38.0 NaT 68587
21 Rijeka 5.8 23.1 Croatian 0.0 NaT 213666
22 Istanbul 5.7 22.9 Turkish 39.0 2016-12-31 14657434

Another way this can be done is by the pd.concat function4:

In [38]:
pd.concat([df1, df2], ignore_index=True).head(4)
Out[38]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
0 Amsterdam 3.4 17.6 Dutch -2.0 NaT 1351587
1 Barcelona 11.8 25.7 Catalan and Spanish 12.0 NaT 4740000
2 Paris 5.0 20.6 French NaN 2013-01-01 10601122
3 Geneva 1.5 20.2 French 375.0 2015-01-12 198072

One of the first data inspections is if and where there are missing values (data that are not available for various reasons). As we can see, there are missing values in the data; missing values are labeled differently in Python for numeric data (NaN) and dates (NaT)5. First let's see how to detect where missing values occur.

The following command will indicate those cells with "True" where there is a missing value.

In [39]:
pd.isnull(df).head(3)
Out[39]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
0 False False False False False True False
1 False False False False False True False
2 False False False False True False False

The cells that mention "True" are missing values. They only occur for Altitude and MeasureDate. It is also possible to inspect specific variables, just to avoid the output of the whole datasheet.

In [40]:
pd.isnull(df['Altitude']).head(5)
Out[40]:
0    False
1    False
2     True
3    False
4    False
Name: Altitude, dtype: bool

There is a simple way to check the number of missing values per variable, using the is.null function.

In [41]:
df.isnull().sum()
Out[41]:
Place           0
Temp_Jan        0
Temp_Jul        0
Language        0
Altitude        6
MeasureDate    14
Population      0
dtype: int64

The inverse is also possible, that is counting the number of nonmissing values.

In [42]:
df.count()
Out[42]:
Place          23
Temp_Jan       23
Temp_Jul       23
Language       23
Altitude       17
MeasureDate     9
Population     23
dtype: int64

Statistics could be computed while ignoring the missing values. Another possibility is to drop the row(s) that contains any missing value, using df.dropna(how='any'). If you put how='all' it will only drop the row(s) if all values are missing. Note that only writing it to df will change the DataFrame. So if df = df.dropna(how='any') were used, the DataFrame df would have changed.

In [43]:
df.dropna(how='any')
Out[43]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
3 Geneva 1.5 20.2 French 375.0 2015-01-12 198072
4 Munich 0.3 19.4 German 520.0 2015-12-31 1450381
6 Vienna 1.2 19.1 German 542.0 2017-01-01 2600000
8 Oslo -2.9 17.7 Norwegian 23.0 2016-01-01 942084
10 Lulea -5.1 20.7 Swedish 6.0 2014-12-31 75966
22 Istanbul 5.7 22.9 Turkish 39.0 2016-12-31 14657434

Missing values are automatically omitted from summary statistics, such as describe().

In [44]:
df['Altitude'].describe()
Out[44]:
count     17.000000
mean     122.941176
std      191.653160
min       -2.000000
25%        1.000000
50%       23.000000
75%      140.000000
max      542.000000
Name: Altitude, dtype: float64

It is possible to replace values with NaNs that fulfill a certain condition, for example if Altitude is larger than 500 m. It is adviced to refrain from using chain indices, such as df['Altitude'][df['Altitude']>350]=np.nan6

In [45]:
df.loc[df['Altitude']>350, 'Altitude'] = np.nan 
In [46]:
df['Altitude']
Out[46]:
0      -2.0
1      12.0
2       NaN
3       NaN
4       NaN
5     338.0
6       NaN
7       0.0
8      23.0
9       NaN
10      6.0
11     47.0
12    140.0
13      NaN
14      NaN
15      NaN
16      NaN
17     11.0
18      0.0
19      1.0
20     38.0
21      0.0
22     39.0
Name: Altitude, dtype: float64

The columns of a DataFrame can be alphabetically sorted on the column name. For this we use sort_index with axis=1, and ascending=True.

In [47]:
df.sort_index(axis=1, ascending=True).head(3)
Out[47]:
Altitude Language MeasureDate Place Population Temp_Jan Temp_Jul
0 -2.0 Dutch NaT Amsterdam 1351587 3.4 17.6
1 12.0 Catalan and Spanish NaT Barcelona 4740000 11.8 25.7
2 NaN French 2013-01-01 Paris 10601122 5.0 20.6

It is also possible to sort the rows based on a certain column, using sort_values. If the column is a string, it is sorted alphabetically. If the column is numeric, it is ranked according to value. For example, in the following the DataFrame is sorted alphabetically on Place, and the result is written to df.

In [48]:
df = df.sort_values(by='Place')

Displaying the first ten rows we can confirm that the sorting worked.

In [49]:
df.head(5)
Out[49]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
20 Ajaccio 9.0 22.9 French 38.0 NaT 68587
0 Amsterdam 3.4 17.6 Dutch -2.0 NaT 1351587
14 Antwerp 3.4 18.5 Dutch NaN 2016-01-01 517042
5 Athens 9.9 28.5 Greek 338.0 NaT 3090508
1 Barcelona 11.8 25.7 Catalan and Spanish 12.0 NaT 4740000

It is sometimes important to inspect specific cells. For example, if you need to look up the data that is in the first row, first column, you could do the following:

In [50]:
df.iloc[0,0]
Out[50]:
'Ajaccio'

If you want to have multiple rows and columns, the following can be done:

In [51]:
df.iloc[:3,:2]
Out[51]:
Place Temp_Jan
20 Ajaccio 9.0
0 Amsterdam 3.4
14 Antwerp 3.4

Select rows within a given range, for example first till fourth column, and second till third row.

In [52]:
df.iloc[1:3,0:4]
Out[52]:
Place Temp_Jan Temp_Jul Language
0 Amsterdam 3.4 17.6 Dutch
14 Antwerp 3.4 18.5 Dutch

For selecting certain columns, but switching the position of the second and third

In [53]:
df.iloc[:,[1,3,2]].head(5)
Out[53]:
Temp_Jan Language Temp_Jul
20 9.0 French 22.9
0 3.4 Dutch 17.6
14 3.4 Dutch 18.5
5 9.9 Greek 28.5
1 11.8 Catalan and Spanish 25.7

There are circumstances in which you'd only want to select cases that fulfill a certain condition. The following only selects data if the average temperature in January is higher than 9 degrees Celsius:

In [54]:
df[df.Temp_Jan > 9]
Out[54]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
5 Athens 9.9 28.5 Greek 338.0 NaT 3090508
1 Barcelona 11.8 25.7 Catalan and Spanish 12.0 NaT 4740000
17 Malaga 12.1 25.5 Spanish 11.0 NaT 569130
18 Porto 9.5 20.6 Portuguese 0.0 NaT 1774000

Select cases where the average July temperature was equal to 28.5.

In [55]:
df[df.Temp_Jul == 28.5]
Out[55]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
5 Athens 9.9 28.5 Greek 338.0 NaT 3090508

Select cases if July temperature was unequal to 28.5.

In [56]:
df[df.Temp_Jul != 28.5].head(3)
Out[56]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
20 Ajaccio 9.0 22.9 French 38.0 NaT 68587
0 Amsterdam 3.4 17.6 Dutch -2.0 NaT 1351587
14 Antwerp 3.4 18.5 Dutch NaN 2016-01-01 517042

It is also possible to select data that fulfill multiple conditions, for example cities where the January temperature is larger than 5 and the July temperature is smaller than 17 degrees Celsius.

In [57]:
df[(df.Temp_Jan > 5) & (df.Temp_Jul < 17)]
Out[57]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
13 Cork 5.6 15.3 English NaN NaT 208669

The following selects the data if either January temperature is larger than 5 or July temperature is smaller than 17. Note that Cork is not selected in this case, since it fulfills both conditions.

In [58]:
df[(df.Temp_Jan > 5) ^ (df.Temp_Jul < 17)]
Out[58]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
20 Ajaccio 9.0 22.9 French 38.0 NaT 68587
5 Athens 9.9 28.5 Greek 338.0 NaT 3090508
1 Barcelona 11.8 25.7 Catalan and Spanish 12.0 NaT 4740000
12 Birmingham 6.7 21.3 English 140.0 NaT 2440986
16 Bordeaux 6.6 21.4 French NaN NaT 851071
11 Edinburgh 7.0 19.1 English 47.0 NaT 1339380
22 Istanbul 5.7 22.9 Turkish 39.0 2016-12-31 14657434
17 Malaga 12.1 25.5 Spanish 11.0 NaT 569130
18 Porto 9.5 20.6 Portuguese 0.0 NaT 1774000
21 Rijeka 5.8 23.1 Croatian 0.0 NaT 213666

The following includes the Cork case. It select the data if January temperature is larger than 5 and/or July temperature is smaller than 17. That, means it also includes those cases that fulfill both cases.

In [59]:
df[(df.Temp_Jan > 5) | (df.Temp_Jul < 17)]
Out[59]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
20 Ajaccio 9.0 22.9 French 38.0 NaT 68587
5 Athens 9.9 28.5 Greek 338.0 NaT 3090508
1 Barcelona 11.8 25.7 Catalan and Spanish 12.0 NaT 4740000
12 Birmingham 6.7 21.3 English 140.0 NaT 2440986
16 Bordeaux 6.6 21.4 French NaN NaT 851071
13 Cork 5.6 15.3 English NaN NaT 208669
11 Edinburgh 7.0 19.1 English 47.0 NaT 1339380
22 Istanbul 5.7 22.9 Turkish 39.0 2016-12-31 14657434
17 Malaga 12.1 25.5 Spanish 11.0 NaT 569130
18 Porto 9.5 20.6 Portuguese 0.0 NaT 1774000
21 Rijeka 5.8 23.1 Croatian 0.0 NaT 213666

Imagine that you have a very large dataset, and you wanted to check data for the city Paris. Of course by alphabetically arranging the data (as we have done) it is not that difficult to find back the data. But in some cases you would need a search function. To select a row based on a value in a categorical variable. So if we want to find back the "Paris" data, do the following:

In [60]:
df[df['Place'].str.contains('Paris')]
Out[60]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
2 Paris 5.0 20.6 French NaN 2013-01-01 10601122

And it also work if you are only using part of a string, for example "Stock"

In [61]:
df[df['Place'].str.contains('Stock')]
Out[61]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
7 Stockholm -1.6 18.8 Swedish 0.0 NaT 1515017

Another example, searching for "Spanish" gives the following hits:

In [62]:
df[df['Language'].str.contains('Spanish')]
Out[62]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
1 Barcelona 11.8 25.7 Catalan and Spanish 12.0 NaT 4740000
17 Malaga 12.1 25.5 Spanish 11.0 NaT 569130

Data can be added manually as well. Adding a row can be done in at least two ways:

First, we use df.loc to add a row at the right index, using len(df)+1, which takes the lenght of the DataFrame plus one. So this always places the new row at the end.

In [63]:
df.loc[len(df)+1,:]=['Rome', 7.5, 24.1, 'Italian', 21, np.nan, 4353775]

As we will see the new entry "Rome" is added in the last row.

In [64]:
df.tail(3)
Out[64]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
19 Venice 3.3 23.0 Italian 1.0 NaT 264579.0
6 Vienna 1.2 19.1 German NaN 2017-01-01 2600000.0
24 Rome 7.5 24.1 Italian 21.0 NaT 4353775.0

A second way to do this is by using append. We first make newrow:

In [65]:
newrow = pd.Series(['Madrid', 6.3, 25.6, 'Spanish', 667, np.nan, 6240000], index= ['Place', 'Temp_Jan', 'Temp_Jul', 'Language', 'Altitude', 'MeasureDate', 'Population'])
In [66]:
type(newrow)
Out[66]:
pandas.core.series.Series
In [67]:
newrow
Out[67]:
Place           Madrid
Temp_Jan           6.3
Temp_Jul          25.6
Language       Spanish
Altitude           667
MeasureDate        NaN
Population     6240000
dtype: object

If we have created the row, we use append.

In [68]:
df = df.append(newrow, ignore_index=True)
In [69]:
df.tail(3)
Out[69]:
Place Temp_Jan Temp_Jul Language Altitude MeasureDate Population
22 Vienna 1.2 19.1 German NaN 2017-01-01 00:00:00 2600000.0
23 Rome 7.5 24.1 Italian 21.0 NaT 4353775.0
24 Madrid 6.3 25.6 Spanish 667.0 NaN 6240000.0

To have a summary of a variable, use the describe() command. Note that median is the 50% indicator. It is also possible to use for ex. min(), mean(), median() and other stats separately. From these statistics it is directly clear that in July temperature is higher than January (not unexpected for European countries).

In [70]:
df['Temp_Jan'].describe()
Out[70]:
count    25.000000
mean      4.468000
std       4.630759
min      -5.100000
25%       1.500000
50%       5.600000
75%       7.000000
max      12.100000
Name: Temp_Jan, dtype: float64
In [71]:
df['Temp_Jul'].describe()
Out[71]:
count    25.000000
mean     21.120000
std       3.147353
min      15.300000
25%      18.800000
50%      20.600000
75%      23.000000
max      28.500000
Name: Temp_Jul, dtype: float64

We are going to create boxplots using Matplotlib. Before running any code, we make sure that the plot layout is reset to the origin. As you will see later, we are going to use Seaborn, and this may change the layout of other plots7

In [72]:
sns.reset_orig()

Boxplots can be used to check for outliers and distribution of the variables. To position the two boxplots, fig.add_subplot(a,b,c) is used, where a is number of rows, b is the number of columns, and c is the position that increments rowwise.

In [73]:
fig = plt.figure()

fig.add_subplot(1,2,1)
df[['Temp_Jan']].boxplot(sym='.')
fig.add_subplot(1,2,2)
df[['Temp_Jul']].boxplot(sym='.')
plt.show()

Not bad these plots. Matplotlib gives a quick insight on data distribution and possible outliers. But the range of the axes are different for the two plots, and unnecessary space is wasted for two y-axes. So the question arises if we could change this.

First, we are going to select the two columns and make a new temporary DataFrame called temp.

In [74]:
temp = df.iloc[:,[1,2]]

Second, we'd like to adapt our columnnames slightly, and the following code does that:

In [75]:
temp.columns = ['January','July']

The following code puts the two boxplots in one figure. We could add a line plt.ylim to adjust the range of the y-axis.

In [76]:
temp.boxplot(sym='.')
plt.ylim(-10, 30)
plt.show()