Topic 1
Data Analytics from Programmer’s Perspective
Data science becomes very popular lately because everyone is so envious of FANG (Facebook, Amazon,
Neflix, Google) and BAT (Baidu, Alibaba, Tencent) being the richest companies in the world because
of their powerful ability in big data processing,
We will try to investigate the true nature of data science, in particular, the collection, cleaning,
visualisation, analysis and reporting of data. We will try to introduce various data analysis software,
with a focus on Python data analysis tools.
In this subject, the terms data science, data processing, data analysis, business intelligence, etc. are
used tomean the same thing. According to Chekanov (2016, Chapter 12] and https://wmu.stoltzmaniac.
on, data science consists of the following components:
1, Data “collection” and extraction Topic 1
2. Data exploration and cleaning --Topie 2
+ Understanding your data
+ Looking for red flags (warning of danger)
+ Identifying things outside of the “normal” range
+ Deciding what to do with NaN or missing values
+ Discovering data with the wrong data type
+ Utilise the pandas library and pyjanitor to transform the data into tidy format
+ Descriptive statistics
3. Data visualisation Topic 3
4, Data organisation and query Topic 4
+ Determining whether or not you actually need a database
+ Choosing the right database: Deciding between relational and NoSQL
+ Schema design & normalisation . .. UECS1203/UECS1403 Database System Fundamentals
+ Using an ORM — SQLAlchemy to insert data
+ Data mining ...... coe : _ UECM3213 / UECM3453 Data Mining
5. Data analysis and predictive modelling . Topic 5
+ Building a data pipeline with Python luigi?
+ Etror monitoring
+ Statistical learning or Machine Learning UECM3993 Predictive Modelling
6. Data interpretation, presentation and reporting ...... : siseeeseees Topic 6
In reality, simple data requires only some of the components while a large collection of data may
involves all components.6 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
‘The most popular software for simple data processing is Excel. However, each new version of Excel
introduces incompatibilities which can wastes the user’s time tuning the Excel file and incompatible
versions Excel cannot even be saved and this is what happens if one tries to save a 2010 Excel file in
Excel 2007. For other problems of Excel, see ht tp://mu.eusprig.org/horror-stories.htm
Since more and more of us are trained with various programming skills in secondary schools or
universities, programming languages are becoming an important alternative to spreadsheet software
in data processing, especially in large data processing, Python and R are currently two of the most
popular programming languages which can be used freely and legally.
‘The main references are Kimball and Caserta [2004], Yau [2011], and Few [2006]. The main Python
reference is McKinney [2013]
A supplement reference is Grus [2015]. Advance techniques for data processing can be found in the
books on data mining such as Witten et al. [2011] (using Weka), Aggarwal [2015] (theoretical textbook)
and books on machine learning such as Coelho and Richert [2015], ete.
Course Outcomes:
COL. Apply the data analytics concepts in business scenario .......4+s.sseseeee+1+2++PO5, C3
C02. Apply Extract-Transform-Load (ETL) process PO1,C3
COS. Assess descriptive analytics for business intelligence .......2..6.s0c+ese1e+0+41+PO2,C6
CO4, Construct predictive models for different business applications .........+++++++,PO2 C5
COS. Develop dashboard for data visualization -PO3, P3
Assessments:
+ Excel (Dr Chang Yun Fah): 20% Test + 30% Assignment/Presentation
+ Python:
= 25% Test 2
+ Qh 15%. ee cisveeeesee COB
+ Q2: 10% cos
~ Assignment/Presentation
+ Part 1: 10% .. eee cos
+ Part 2: 10% .. oe : - cote tee teeeeee COM
# PAIS: 5% oe scscseseseceeanscseseeseceeseseseserssatsssesaesscsesssess COS
§1.1 Business, Social and Scientific Data
Data come from various domains. A classification of data according to the applications domains (e.g.
business domain, social science domain or scientific domain) is given below.
+ Agriculture: Plants nutrients and classification, forest data, ete. scientific
+ Bioinformatics: Important in finding out rare diseases (using DNA patterns) .........scientifie
+ Climate+Weather: Data in relation to global warming, etc. E.g. https://en.tutiempo.net/
climate ... vette citer vcceeeseseeeeses es Setemtific
+ Computer Networks: E.g. http://www, caida.org/data/overvien/ .............+.++..Scientific
+ Earth Science: Water resources, oceanographic data, earth-quake data, etc. . -scientific
+ Economics: E.g. https://ourworldindata.org/ covers international trade, human resources, tax,
corruption, ete. business
+ Education: US Scorecard data, PISA test scores (nttp://m.oecd.org/pisa/), ete. .......social41.2, DATA SOURCES 7
+ Energy: Eg. http://datasets.ur:
.org/dataset/globalponerplantdatabase business
+ Finance: Stock and derivatives data - business
+ Geographical Information System (GIS): Waze, Google Maps, OpenStreetMap, etc. ... business
+ Government: E.g. Department of Statistics, Malaysia https://mw.dosm.gov.my/v1/ .. business
+ Health care: Food Nutrient, https://mm.gapminder .org/data/, ete. social
+ Image Processing: http://w. inage-net.org/, Animal images https: //cvml ist .ac.at/ANA2/,
facial data http://wmu. face-rec.org/databases/, X-ray images http: //dnery. ing.puc.cl/index.
php/nater ial/gdxray/, etc. social
+ Machine Learning: Album or Music related data such as https://mmw. indb.con/interfaces/,
https://github.com/ndef f/m... ceceeecseceeteeetseseteteeeeseseeeee cess SOHAL
+ Museums , . , re sees Scientific
+ Natural Language ....0.0.0ccceeeeeeeceee cee eteteee tees cseeeseses Sogial
+ Neuro-science: MRI data https: //openfnri.org/ : - scientific
+ Physics: Particle physics data http://opendata.cern.ch/, cosmos observation https: //icecube.
wisc.edu/science/data, crystal structures http: //mm, crystallography net /cod/, planetary ob-
servations such as https: //exoplanetarchive. ipac.caltech.edu/, https://nssdc.gsfc.nasa.gov/
nssdc/obtaining_data. htm, Sloan Digital Sky Survey https://mm.sdss.org/, etc. ... scientific
+ Social Networks: https: //m. gharchive.org/, http: //snap. stanford. edu/data/higgs-twitter.
html, http: //help. sentiment 140. con/for-students/, http://netsg.cs.sfu.ca/youtubedata/, https:
//webscope. sandbox. yahoo .con/catalog.php?datatype=g, etc. social
+ Social Sciences: https: //github. com/enor isse/FBI-Hate-Crime-Statistics/tree/master/2013,
http://w. europeansocialsurvey.org/data/, ete. 0.0 ..0...csccceseeecseeeeeseese eS0cial
+ Sports: https://mm. jokecamp.con/blog/guide-to-football-and-soccer-data-and-apis/ social
+ Time Series: https: //datamarket .com/data/List/?qzprovider:tsdl, Heart rate http: //ecg.mit.
edu/time-series/ scientific
+ Transportation: Airline data, ete. 2.2.00... foes eteeseettenseeeeseereeerees Social
Many data are very large, ranging from tens of megs to a few hundred terabytes. Unless the data is
‘more than a few terabytes, it is possible for us to process them using classical data analytic program-
‘ming tools instead of the big data analytic tools.
§1.2 Data Sources
‘There are many data sources, such as personal data, public data, classified data, and business data
According to European Union (https: //ec europa. eu/info/law/aw-topic/data-protection/reform/
\what-personal-data_en), personal data is any information that relates to an identified or identifiable
living individual, Different pieces of information, which collected together can lead to the identifica
tion of a particular person, also constitute personal data.
Personal data that has been de-identified, encrypted or pseudonymised but can be used to re~
identify a person remains personal data and falls within the scope of the GDPR.
Public data is usually provided by governments, non-profit organisations, non-governmental or-
‘ganisations for the benefit of public interests. Examples of public data are listed below.
+ USS, Government's open data: https://mm.data.gov/;
+ UK. Government's open data: https: //data.gov.uk/;
+ Malaysia's Open Data Portal: http: //wm.data..gov.my/;8 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
+ Google Trend: https: //trends.google.con/trends/explore;
+ Wikipedia: https://me.wikipedia.org/;
+ CIA Factbook: https: //wu.cia.gov/Library/publications/ the-world-factbook/;
~ https: //iancoleman. io/explor ing-the-cia-norld-factbook/
~ https://codingdisciple.con/cia-facthoook-sql. html
~ https: //gi thub.com/MikeAnthony6/factbook
+ National Centres for Environmental Information https: //mw.ncde.noaa.gov/data-access /quick-links;
+ Earth Science Data Systems (ESDS) Program: https://earthdata.nasa.gov/;
+ The United Nations Children’s Fund (UNICEF): https://data.unicef.org/,
Classified data are those data which the government or companies feel that disclosing them can lead
to the jeopardising of country, public or company interest. Examples are public health data, military
data, ete
Business data are data that belong to business entities. These are the data sources where a data
“scientists” or analysts need to deal with.
Data collection refers to the gathering of the information of objects. A business entity has to
collect customer data in order to provide services to customers. The data can be collected through the
web (eg. online shops), customer service counters, etc. Apart from customer data, a business entity
also need to store the data of its employees, business operations, etc. For a social network services
provider, the data of interests are social connectivity of users, products of interest of users (based
‘on tags), ete. [Russel, 2014] For a scientific institute, the data are normally collected from scientific
observations such as telescopes or measurements from lab apparatus,
§1.3 Computer Data Structures
at
In the past, there is no difference between data science and statistics. However, with the invention
of digital computers, data science becomes a subject which uses computer software to store data and
extract useful information from data.
All computer software are written in some kind of programming languages. We will explore how
programming languages represent data. As we will see in this section, programming languages support
basic data structures, derived data structures and user defined data structures, Note that some language
prefer the term data types instead of data structures. We will treat them as the same thing.
‘The basic data structures are Boolean, character (or string), integer and floating point number. They
are closely related to the underlying computer architecture (UECS1013 Introduction to Computer Or-
ganisation and Architecture and UEEA2283 Computer Organisation and Architecture).
‘The basic data structures are limited and high level programming languages were developed to
support more complex data structures. Around the 1960s, we have Fortran language supporting data
structures for scientific applications, Lisp for supporting data structures used in symbolic and artificial
intelligence applications and COBOL for support data structures for business applications. There were
other programming languages such as ALGOL 68 but they were not popular.
Entering the 21st century, they are many more high level programming languages such as Python,
R, C++, Go, Java, Scala, Kotlin, C#, etc. Out of all these high level programming languages, Python
may be the “easiest to learn” language. This may be one of the reasons for its popularity.
Contrary to what people try to portrait, “data science” has always been important before the in-
vention of “digital computer”. Before 1900s, data processing (in the formal of statistics) were applied to
business accounting (https: //en. wikipedia.org/wiki /Account.ing), insurance (https: //en.wikipedia,
org/wiki/Insurance) and financial data analysis.1.3. COMPUTER DATA STRUCTURES 9
Inthe 1960s, the rise of IBM was also due to the need of fast business data processing. The program-
ming language COBOL was developed by IBM and used in so many financial and business institutes
that until today, there are stil a lot of COBOL programs that requires minor tuning and maintenance.
‘Since 2000, the growth of Internet and social media has led to the development of “Big Data” industries.
“Big data” need very specialised computer networks to stream, process and respond to user input. One
could probably learn about "big data” from UECS3223/UECS3473 Cloud Computing (one is the older
code while the other one is the newer code with the requirement that students need to get at least 40
in final exam to pass).
In this section, we will learn the data structures used in the “old” programming language COBOL,
the “newer” programming languages such as Java, Python and R as well as SQL and a little bit about
proprietary system.
§1.3.1 COBOL Data structures
In the 1960s, writing a program was something complex because one needs to use 80 column punch
cards to “write” the program and then one “places” a stack of punch cards into the computer card reader
to load the “program”. Therefore, we have the following rules which seem strange in modern days:
+ The 7th column in each line can be used to identify a line as comment. If itis an asterisk symbol
* the rest of the line is ignored.
+ Columns 8 through 11 in each line are referred to as the A margin.
+ Columns 12 through 72 in each line are referred to as the B margin.
+ Columns 73 to 80 are not used.
+ Every statement must be ending with a full-stop *”
+ A data should only use letters, digits 0 to 9 and hyphens only and should not be more than 30
characters.
‘The data structures are declared using the Picture clause, ie. a statement with PIC keyword and
suitable “characters” (unused positions are set to spaces or zeros). ‘The following are the characters
that can be used in Picture clauses according to Murach et al. [2005):
Item type Characters Meaning Examples
‘Alphanumeric x ‘Any character X, XK, XC)
Numeric 9 Digit 99
s Sign 999
v Assumed decimal point $9(5)V99
Numeric edited 9 Digit 99
z Zero suppressed digit 729
' Inserted comma 2,1
: Inserted decimal point 722,722.99
- Minus sign if negative 722,22
‘The concept behind a COBOL program format is based on the structure of a document outline, with
a single to-level heading followed by subordinate level. ‘The levels of COBOL’s hierarchy are PROGRAM,
DIVISION (ENVIRONMENT DIVISION, DATA DIVISION, PROCEDURE DIVISION), SECTION, PARAGRAPH, SENTENCE,
STATEMENT and CHARACTER. A SECTION contains zero or many PARAGRAPHS; a PARAGRAPH contains one or
‘more STATEMENTS. A STATEMENT is a line of execution which is made up of CHARACTERs. Two sample
COBOL programs are given below to illustrate the hierarchy.
g:cobol-stdio] Example 1.3.1. Read the following COBOL program and explain what does it do.33
34
35
36
37
38
39
40
a
2
B
“4
45
46
a7
4B
49
50
10
TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
IDENTIFICATION DIVISION.
PROGRAM -1D..
AUTHOR. Mii
Listing4-1,
chael Coughlan.
DATA DIVISION.
WORKING-STORAGE SECTION.
@1 UserName PIC X(28)
*> Receiving
data item for DATE system variable: Format is YYMMOD
01 CurDate.
02 CurYear PIC 99.
02 CurMonth PIC 99.
02 CurDay PIC 99.
*> Receiving
01 DayOfYea
2 FILL
02 Year!
*> Receiving
1 CurTime.
02 Curk
02 Curm.
02 FILL
*> Receiving
@1 Y2KDate.
02 y2KYe
02 Y2KMor
02 Y2KDa
*> Receiving
@1 Y2kDay0f
data item for DAY system variable: Format is YYDDD
Ir.
ER PIC 99.
Day PIC 9(3).
item for TIME:
Format is HHMMSSss s = S/10@
jour PIC 99,
inute PIC 99.
ER PIC 9(4).
item for DATE YYYYMMDD system variable
ar PIC 9(4).
nth PIC 99.
Y PIC 99.
item for DAY YYYYDDD system variable
Year.
@2 Y2KDOY-Year PIC 9(4).
02 Y2KDOY-Day PIC 999.
PROCEDURE DIVISION.
Begin.
DISPLAY
ACCEPT
DISPLAY
ACCEPT
*> GnuCobol
ACCEPT
ACCEPT
ACCEPT
ACCEPT
DISPLAY
DISPLAY
DISPLAY
DISPLAY
DISPLAY
"Please enter your name - " WITH NO ADVANCING
UserName
CurDate FROM DATE
returns the Day0fYear less by ONE DAY, a Bug?
DayOfYear FROM DAY
CurTime FROM TIME
y2kDate FROM DATE YYYYMMDD
Y2kDay0fYear FROM DAY YYYYDDD
"Name is " UserName
"Date is " CurDay *-" CurMonth "-" CurYear
"Today is day " YearDay * of the year"
"The time is " CurHour *:" CurHinute
"y2KDate is " Y2KDay SPACE Y2KMonth SPACE Y2KYear51
52
1.3. COMPUTER DATA STRUCTURES uw
DISPLAY "Y2K Day of Year is " Y2KDOY-Day " of " Y2KDOY-Year
STOP RUN.
Solution. CRAP MARIE F RAL 20 TSEC HB) VA THA FAY BIS YT IE
a
eg:cobol-int | Example 1.3.2, Read the following COBOL program and explain what does it do,
1 IDENTIFICATION DIVISION.
2 PROGRAM-ID. Listing5 -11
3 AUTHOR. Michael Coughlan.
4 *> Accepts two numbers and an operator from the user.
5 *> Applies the appropriate operation to the two numbers.
6
7 DATA DIVISION.
8 WORKING -STORAGE SECTION.
9 1 Wumt PIC 9 VALUE 7.
10 et Num? PIC 9 VALUE 3.
u O1 Result PIC --9.99 VALUE ZEROS.
12 @1 Operator PIC X_ VALUE "=",
B 88 Valid0perator VALUES "*", —
“
15 PROCEDURE DIVISION.
16 CalculateResult.
7 DISPLAY "Enter a single digit number : " WITH NO ADVANCING
18 ACCEPT Num1
» DISPLAY "Enter a single digit number : " WITH NO ADVANCING
20 ACCEPT Num?
at DISPLAY "Enter the operator to be applied : *
2 WITH NO ADVANCING
2 ACCEPT Operator
4 EVALUATE Operator
25 WHEN "+" ADD Num2 TO Nun1 GIVING Result
26 WHEN "-" SUBTRACT Num2 FROM Num1 GIVING Result
ar WHEN "*" MULTIPLY Num2 BY Nun] GIVING Result
28 WHEN "/" DIVIDE Num BY Num2 GIVING Result ROUNDED
29 WHEN OTHER DISPLAY "Invalid operator entered"
30 END - EVALUATE
31 IF ValidOperator
32 DISPLAY "Result is = ", Result
33 END-IF
34 STOP RUN.
Solution. “TFA P $i ADK BCE DAE IAT TS, a
As we shall see in the later section, a business intelligence system need to handle “COBOL copy-
books” and EBCDIC character sets when extracting data from mainframe systems and certain mini-
‘computer systems such as IBM AS/400 [Kimball and Caserta, 2004].
§1.3.2 Java Data Structures
Java isan important language for many business intelligence software and big data software. According.
tohttps://towardsdatasc ence. com/8-open- source-big-data-tools-to-use-in-2018-e35cabd7catd, the
following big data processing projects all run on Java:12 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
+ Apache Hadoop: Big Data processing,
+ Apache Spark: 100 times faster than MapReduce
+ Apache Storm: real-time framework for data stream processing
+ Apache Cassandra: It is one of the pillars behind Facebook’ s massive success, as it allows to
process structured data sets distributed across huge number of nodes across the globe.
Java has three different categories of data types:
1, Primitive data types
(@) boolean
(b) char (16-bit Unicode character)
(©) byte (8-bit signed two's complement integer), short (16-bit), int (32-bit) long (64-bit)
(a) float (32-bit IEEE 754 floating point), double (64-bit)
2, Derived data types (Java Collection Framework): They are made by using any other data type.
(@) java.lang.String
(b) java.util Vector
(©) javaut
(@) java.util
(©) java.util Map
3. User defined data types: classes and interfaces. They are normally a combination of primitive
data types and derived data types.
User defined data types are sufficiently powerful in modelling many business entities but some-
limes BigInteger or BigDecimal data types may be required.
§1.3.3. Python 3 and NumPy Data Structures
‘The data structures available in the standard Python 3 are
+ Basic data structures: Boolean, strings, integers (equivalent to java’s Biglnteger), floating point
values and complex floating point values (java doesn't have this);
+ Derived data structures: tuple, list, set, dictionary (hashmap);
+ Record data structures which are constructed using class.
+ Special data structures: NoneType and Object;
‘We will first investigate a few data structures in Python and then proceed to study how to “process”
data using Python programming,
Example 1.3.3 (Basic Data Structures). Write down the Python 3 statements to perform the following
instructions.
1, Express 12345 as (a) string, (b) integer, (c) floating point value.
2. What is the representation of “true” and “false” in Python?1.3. COMPUTER DATA STRUCTURES 13
Solution. KEMIS IMA A?
astr = '12345' # (a)
aint = 12345 # (b)
anum = 12345.6 # (c)
True, False
a
ved] Example 1.3.4 (Derived Data Structures). Store the integer 1, string “abc” and float 2.0 as a (a) tuple,
(b) list and (c) set.
Solution. List 5 Tuple HMI, NT # CH & methods
(1, ‘abe', 2.0)
U1, ‘abc’, 2.0)
(1, ‘abe', 2.0}
a
eg:derived? | Example 1.3.5 (Derived Data Structures). Store the following data (Malaysia population data from
https://en.wikipedia.org/wiki/List_of_cities_and_touns_in Malaysia_by_population) using Python
dictionary.
Tocal government area | Total population | Local government area | Total population
Kuala Lumpur 7588,750, Padawan 27BAB5
Seberang Perai 818,197 Taiping 245,182
Kajang 795,522 Miri 234,541
Klang 744,062 Kulai 234,532
Subang Jaya 708,296 Kangar 225,590
George Town 708,127 Kuala Langat 220,214
Ipoh 657,892 Kubang Pas 214,479
Petaling Jaya 613,97 Bintulu 212,994
Selayang 542,409 Manjung 2u1,13
Shah Alam 541,306 Batu Pahat 209,461
Iskandar Puteri 529,074 Sepang 207,354
Seremban 515,490 Kuala Selangor 205,257
Johor Bahru 497,067 Muar 201,148
‘Melaka City 484,855 Lahad Datu 199,830
‘Ampang Jaya 468,961 Hulu Selangor 194.387
Kota Kinabalu 452,058 Kinabatangan. 182,328
Sungai Petani 443,488 Pasir Mas 180,878
Kuantan 427,515 Penampang 176,607
‘Alor Setar 405,523 ‘Alor Gajah 13.72
Tawau 397,673 Keningau 173,103
Sandakan 396,290 Kluang 167,833
Kuala Terengganu 337,553 Kemaman 166,750
Kuching 325,132 Sibu 162,676
Kota Bharu 314,964 ‘Temerloh 158,724
Kulim 281,260 Keterch 153.474
Solution, ££ Python 3.6+ Fe {fe LL
a} popul = {"Kuala Lumpur" : 1_588_750, "Seberang Perai" : 818.197, } # S&F
a
mesg
Example 1.3.6. Translate the COBOL program in Example [23-110 Python program.17 print("Date is
“4 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
Solution.
UserName = input("Please enter your name -
Print (Hetero)
import time
CurTime = time. 1ocaltime()
Curvear = str(CurTime.tm_year)[-2:]
CurHonth = CurTime. tm_mon
CurDay = CurTime. tmmday
CurHour = CurTime. tm_hour
CurMinute = CurTime.tm_min
YearDay = CurTime.tm_yday
yokYear = CurTime.tm_year
y2kMonth = CurTime. tm_mon
Y2KDay = CurTime. tmmday
Y2KDOY_Year = CurTime.tmyear
Y2KDOY Day = CurTime.tm_yday
print(*Name is " + UserName)
+ str(CurDay) + "=" + str(CurMonth) + "=" + CurYear)
print("Today is day %3d of the year" % (YearDay,)) # Tuple
print(*Time is day $82d:802d" & (CurHour, CurMinute)) # Tuple
print("Y2KDate is $d 8d $d" & (Y2KDay, V2KMonth, Y2kVear))
print("Y2K Day of Year is $d of $d" & (Y2KDOY Day, Y2KDOY Year))
0
Example 1.3.7. Translate the COBOL program in Example 1°32 to Python program.
Solution.
AU(iE1 = float(input ("Enter a single digit number : ")[0])
‘HUiH2 = float(input("Enter a single digit number : ")[@])
HE = input("Enter the oprator to be applied : *)[0]
BAU = True
if WAP == '+
wR + BUA2
elif Wi ae
ease = Reh - Rez
elit BAF == tt:
eR = Seta * Hei
elit BAF == 17":
eR = SHI 7 MD
else:
BHAI = False
print('Invalid operator entered’)
if HAH: print(*Result is %.2f" % (ZHH)) # Floating point
a
Unlike Java data structures, Python data structures are not suitable for large data processing be-
cause they. are slow. The Numpy array is used as the basic data structure for the pandas module
(Section 1.5-2} 10 handle series and data frames.1.3. COMPUTER DATA STRUCTURES 15
§1.3.4 R Data Structures
‘The rich data structures and statistical packages provided by R made it a powerful open source business
intelligence platform. R has three categories of data types:
1, Basic Data Types
(@) Logical (boolean)
(b) Character (string)
(©) Integer
(a) Numeric
(©) Complex
2, Derived Data Types:
(a) Date: Sys.Date()
(b) Array: array(e(1:4))
(©) List: List(1,"2", TRUE)
(a) Matrix: matrix(c(1:4),2,2)%.
(©) Time Series: ts(rep(1,10))
(®) Data Frame: data. frame (x=
3,2,1),yec("A" CF)
3. User defined types with $3 or S4 classes: setClass( "student", slots=List(name="character",
age="numeric", GPA="nuneric*))
R data structures are rich and it is used along with JuPyteR stack (Julia, Python, R) for enabling
wide-scale statistical analysis and data visualization. JupyteR Notebook is one of 4 most popular Big
Data visualization tools, as it allows composing literally any analytical model from more than 9,000
CRAN (Comprehensive R Archive Network) algorithms and modules, running it in a convenient envi-
ronment, adjusting it on the go and inspecting the analysis results at once, ‘The main benefits of using
Rare as follows:
1, R can run inside the SQL server;
2. Rrruns on both Windows (https: //mran.microsof t.com/open) and Linux servers;
3. R supports Apache Hadoop and Spark;
4, Ris highly portable;
5. R easily scales from a single test machine to vast Hadoop data lakes.
§1.3.5 SQL Data Structures
‘There is standard data structures in SQL system. In Salite (see https://wmy.sqlite.org/datatypes.
html), it only has the following data structures:
+ NULL. The value is a NULL value,
+ BLOB. The value is a blob of data, stored exactly as it was input.
+ TEXT, The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-
16LE).
+ INTEGER. The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the
‘magnitude of the value.16 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
+ REAL. The value is a floating point value, stored as an 8-byte IEEE floating point number.
However, Postgresql (see http://m.postgresql tutorial .com/postgresql-data-types/) has richer data
structures:
+ Boolean;
+ Character types such as char, varchar, and text;
+ Numeric types such as integer and floating-point number;
+ Temporal types such as date, time, timestamp, and interval;
+ UUID for storing Universally Unique Identifiers;
+ Array for storing array strings, numbers, ete;
+ JSON stores JSON data;
+ hstore stores key-value pair;
+ Special types such as network address and geometric data
‘Many companies store their data in SQL systems as a collection of tables of various data structure
above.
However, the growth of Internet has changed the scenes of “database” world from the old LAMP
‘or WAMP stacks to “cloud” and “NoSQL”. So, when it comes to choosing a database, one of the biggest,
decisions is picking a relational (SQL) or non-relational (NoSQL) data structure. While both are viable
options, there are certain key differences between the two that users must keep in mind when making
a decision
SQL databases use structured query language (SQL) for defining and manipulating data. On one
hand, this is extremely powerful: SQL is one of the most versatile and widely-used options available,
‘making ita safe choice and especially great for complex queries. On the other hand, itcan be restrictive.
SQL requires that you use predefined schemas to determine the structure of your data before you work
with it. In addition, all of your data must follow the same structure. This can require significant up-
front preparation. On the other hand, it can mean that a change in the structure would be both difficult,
and disruptive to your whole system.
Some examples of SQL databases include Sqlite, PostgreSQL, MariaDB/MySQL, FirebirdDB, IBM
DR2, Oracle DB, and Microsoft SQL Server. A comprehensive list is given in https://en. wikipedia,
org/wiki/List_of relational _database_management_systens
According to OvidPer! [2018], prior to the creation of SQL in the 1970s, all databases where NoSQL.
‘That's why we have SQL. There was the client who was running at a maximum of about 40 transactions
per second. They won a major contract that required 500 transactions per second and their technology
director told them they needed to switch to NoSQL to get better performance. We got them to around
700 transactions per second in about three weeks ... using PostgreSQL. Most of their performance
problems were simply a matter of technical debt and a poor use of their database.
‘The design and setting up of SQL database system require proper normalisation, configuration,
queries tuning and some de-normalisation. Embracing NoSQL. just because it is a trend is never the
correct answer. There are special purpose databases such as Hadoop that make sense in very specific
contexts, But if relational database is a viable option, then that’s what one should be picking. They
‘won the debate about database design four decades ago for very good reasons.
Many NoSQL databases are just rehashes of those old, flawed designs that we discarded because
they don’t work anywhere near as well. And in the few cases where NoSQL does have an interesting
trick, chances are the relational database can do it too (or will be able to very shortly)
§1.3.6 NoSQL Data Structures
According to https: //en.wikipedia.org/niki/NoSQL, NoSQL databases can be roughly classified into
the following categories:1.3. COMPUTER DATA STRUCTURES uy
+ Column-oriented: Accumulo, Apache Cassandra, Scylla, Apache Druid, HBase, Vertica, Google
BigTable;
+ Document-oriented: Apache CouchDB, ArangoDB, BaseX, Clusterpoint, Couchbase, Cosmos
DB, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB, RavenDB;
+ Key-value pair store: Aerospike, Apache Ignite, ArangoDB, Berkeley DB, Couchbase, Dynamo,
FoundationDB, InfinityDB, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis,
Riak, SciDB, ZooKeeper;
Graph-based: AllegroGraph, ArangoDB, InfinityDB, Apache Giraph, MarkLogic, Neo4J, Ori-
entDB, Virtuoso Universal Server
Object database: Objectivity/DB, Perst, ZopeDB,
‘Since NoSQL databases are closely related to the Web, they support the JSON data structures:
1, null/empty: {"grade": } (a blank indicates null or empty);
2. object: a set of name or value pairs between {};
3. Boolean: {"result": true};
4, string: {"name":"Vivek"};
5, mumber: {age":20, "percenage":82.44);
6, array: {"subjects": ["UECH1304", "UECN3013"}}
ANoSQL database, has dynamic schema for unstructured data, and data is stored in many ways
‘as mentioned earlier. This flexibility means that:
+ We can create documents without having to first define their structure;
+ Each document can have its own unique structure;
+ The syntax can vary from database to database; and
+ We can add fields as you go.
In most situations, SQL databases are vertically scalable, which means that we can increase the
load on a single server by increasing things like CPU, RAM or SSD. NoSQL databases, on the other
hand, are horizontally scalable, This means that we handle more traffic by sharing, or adding more
servers in your NoSQL database. It’s like adding more floors to the same building versus adding more
buildings to the neighbourhood. The latter can ultimately become larger and more powerful, making
NoSQL databases the preferred choice for large or ever-changing data sets.
SQL databases are table-based, while NoSQL databases are either document-based, key-value pairs,
graph databases or wide-column stores. ‘This makes relational SQL databases a better option for appli-
cations that require multi-row transactions such as an accounting system or for legacy systems that
‘were built for a relational structure,
According to Kleppmann [2017], companies which do not have petabytes of data, should use rela~
tional database instead of NoSQL just for scale because building for scale may be a waste of effort.18 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
§1.4 Business Intelligence Infrastructure
According to https://en.wikipedia.org/wiki/Business_IntelLigence, business intelligence consists
of the strategies and technologies for the data analysis of business information. The business intelli-
gence infrastructure is developed to organise data and turn them into knowledge. ‘The components of
the infrastructure comprise [Kleppmann, 2017]
+ a data storage (databases);
+ a cache for fast data reading (e.g. Memcached, Redis);
+ an indexing system for finding data by keywords (e.g. ElasticSearch, Solr);
+ a stream processing pipeline for sending a message to another processing asynchronously; and/or
+ a batch processing pipeline for periodically processing a large amount of accumulated data
There are a lot of commercial offering of business intelligence such as IBM Cognos, Microsoft
PowerBI, Oracle Business Intelligence Suite Enterprise Edition, Hitachi Data Systems, Plotly, Qlik, SAP,
SAS, Tableau Software, ete. (see https://en.wikipedia.org/niki/Business_intelLigence_software for
more) There are very limitted “open source” business intelligence and data analytics tools which can
be found from the Internet. According to Octoparse [2019], some open source tools are
1, KNIME Analyties Platform (https://mw.knine.con/, Java based)
2, OpenRefine (http: //openrefine.org/, previously Google Refine, Java based)
3. R-Programming (https://wwn.r-project.org/, S language, C, C++)
4, Orange Data Mining Tool (ht tps: //orange.biolab.si/, https://gi thub.com/biolab/orange3, Python.
based)
5. RapidMiner Studio (https: //rapidminer .com/, https: //github.con/rapidniner /rapidminer-studio,
Java based)
6, Pentaho data integration Kettle (https: //help.pentaho..con/Docunentation/8.2/Products/Data_
Integration, Java based)
7. Talend Forge (ht tps://mm.talend. con/, https: //mww. talendforge.org/sources, Java based)
8. Weka (https: //mm. cs. waikato.ac.nz/ml/weka/, Java based)
9. NodeXL (https://archive. codeplex. com/?p=nodexl, Excel, C# based)
10. Gephi (https: //gephi .org/, Java based)
§1.5 Data Handling with Python
In data analytics, text formats are popular because they are easy to debug, Popular text formats include
JSON (popularised by Web and JavaScript), XML, CSV [Kleppmann, 2017]. Popular binary formats
include Microsoft office documents, Apache ‘Thrift (by Facebook), Google's Protocol Buffers, Apache
Avro, ete
In this section, we will study the Python packages that handle various data formats,1.5, DATA HANDLING WITH PYTHON 19
§1.5.1 Anaconda Python (on Windows)
seczanaconda
The software “Anaconda Python 3.x” (which can be found on the Internet) has all the necessary mod-
ules, in particular, the pandas module, for data processing with Python 3. The following libraries/mod-
ules are the essential components building the pandas module (and are available in Anaconda Python):
setuptools: 24.2.0 or higher
+ NumPy: 1.12.0 or higher
+ python-dateutil: 25.0 or higher
+ pytz
+ numexpr: for accelerating certain numerical operations. numexpr uses multiple cores as well
as smart chunking and caching to achieve large speedups. If installed, must be Version 2.6.1 or
higher.
bottleneck: for accelerating certain types of nan evaluations. bottleneck uses specialized cython
routines to achieve large speedups. Ifinstalled, must be Version 1.2.0 or higher.
+ Cython: Only necessary to build development version. Version 0.28.2 or higher.
+ SciPy: miscellaneous statistical functions, Version 0.18.1 or higher
+ xarray: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version
0.7.0 or higher is recommended.
+ PyTables: necessary for HDFS-based storage, Version 3.4.2 or higher
+ pyarrow (
0.9.0): necessary for feather-based storage.
+ Apache Parquet, either pyarrow (+= 0.7.0) or fastparquet (+= 0.2.1) for parquet-based storage.
‘The snappy and brotli are available for compression support.
+ SQLAlchemy: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy,
you also need a database specific driver. You can find an overview of supported drivers for each
SQL dialect in the SQLAlchemy docs. Some common drivers are:
= psycopg2: for PostgreSQL
~ pymysql: for MySQL.c
= SQLite: for SQLite, this is included in Python's standard library by default.
‘matplotlib: for plotting, Version 2.0.0 or higher.
For Excel I/O:
~ xlrd/xlwt: Excel reading (xlrd), version 1.0.0 or higher required, and writing (xlwt)
~ openpyxk: openpyxl version 2.4.0 for writing xlsx files (xlrd >= 0.9.0)
~ XlsxWriter: Alternative Excel writer
Jinja: Template engine for conditional HTML formatting,
+ blose: for msgpack compression using blose
+ gesfs: necessary for Google Cloud Storage access (gesfs == 0.1.0).
One of atpy (requires PyQt or PySide), PyQts, PyQt4, pygtk, xsel, or xclip: necessary to use
read_clipboard(). Most package managers on Linux distributions will have xclip and/or xsel
immediately available for installation.
+ pandas-gbq: for Google BigQuery I/O. (pandas-gbq >= 0.8.0)
+ One of the combinations of libraries is needed to use the top-level read_html() function, ie. (a)
BeautifulSoup4 4.2.1+ and htmlslib; or (b) BeautifulSoup4 4.2.1+ and [xml20 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
According to McKinney [2013], Numpy, pandas, matplotlib, jupyter, Seipy, scikit-learn, statsmodels
are the essential Python libraries. For the over 200 packages available in Anaconda Python, refer to
https://docs anaconda. com/anaconda/
“ia5) 5152 ‘The Pandas Module & Its Data Structures
‘The standard Python is not suitable for data analysis because it does not provide the necessary tool
for easy array handling. The Numpy library provides the necessary tool for easy array handling. The
pandas library is designed for working with tabular data [McKinney, 2013, Chapter 5]. It depends on
Numpy and cython for array structure processing support and dateutil and pytz for date-time and
timezone processing support (mentioned in ection tf
‘To turn on data processing and analysis support, we need to load the necessary modules in Python
as follows.
1 import numpy as np
2 import pandas as pd
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 import statsmodels as sm
6 from urllib.request import urlopen
7 from bs4 import BeautifulSoup
Pandas provides two data structures for data processing, ie, Series and DataFrame, A series is a
ID labelled array while a data frame is a 2D labelled array. A few important attributes of these data
structures ds are: d5.size, ds.dtypes, ds. index, ds. values.
Example 1.5.1. Construct a pandas series with the list from Example P=
Solution
sr = pd. Series ((1,*abc",2.0])
rr ivedd
Example 1.5.2. Write a Python script to express the top 6 populations from the table in Example 5
as a pandas series.
Solution: A sample script is given below.
| import pandas as pd
states = ["Kuala Lumpur", "Seberang Perai", “Kajang", "Klang", "Subang Jaya",
"George Town")
pd.Series([1_588_750, 81819, 795 522, 744062, 708296, 708_127],
index=states)
print("The two key aspects of a Series is")
print("1, values =", ts.values, "of type", ts.dtype)
print("2, index =", ts.index, “of type", ts. index. dtype)
ts.index = ['KL","SP',"K]","K6', "SJ", "6T']
10 print ("After the index is changed, we have\nts =\n", ts, sep=
ts
Alternatively, the Python dictionary can also be used.
Example 1.5.3 (https: //en.wikipedia.org/wiki/Birth_rate). he birth rate, which stands for the
number of live births per thousand of population per year, is essential to predict the future population of
‘a.country. By using years as index, the birth rate can be represented as a pandas series. An interesting
exercise will be to read Malaysia Department of Statistics’ Open Data from https: //mru.dosm.gov.ny/
V1/index.php?r=colum3/accordionfmenu_id=alhRYUpWS3B4V¥LYaVBOeUFONFpHUTOS and store it as pan-
das series1.5, DATA HANDLING WITH PYTHON a1
Example 1.5.4 (https://en.wikipedia.org/wiki/Heart_rate). Human’s heart rate is a time series and
can be represented as pandas series.
§1.5.3 Handling Excel Data
We rarely load data by writing a Python program. We usually load data from data files such as Excel
files. However, we need to beware that there are many Excel files (see https://en.wikipedia.org/
wiki/List_of_Microsoft_Office_filename_extensions):
+ xls: Legacy Excel worksheets 1997-2003 binary format
+ xlsx: Excel workbook (2007, 2010, 2013, ete. are incompatible)
+ axlsm: Excel macro-enabled workbook;
+ xltx: Excel template;
+ xltm: Excel macro-enabled template.
So when we process Excel files with Python, we need to install the appropriate modules. The
following are the standard modules from http: //www.python-excel.org/ which handles different types
of Excel files:
+ openpyxl: This module is used for reading and writing Excel 2010 files;
+ xlsewriter: This module can handle Excel 2010 files with better support for formatting and charts;
+ xlrd and xlwt: These modules are used for reading and writing Excel xls files
‘There are other modules such as pyexcel, xlutils which have some extra functionality but may depend
amples en he und ints: astntetebr inst cn/ hater 2!
‘The pandas module's pd.read_excel(' file.xlsx" , index_col-None, header=None) depends the above
modules,
§1.5.4 Handling CSV Data
‘Text file is easier to process compare to binary files, so many companies and many software will save
the data to CSV files. CSV, which stands for Comma-Separated Values, is a text format to store tabular
data separated by “ ina file. TSV, which stands for Tab-Separated Values, separates table data by tab
instead of comma.
The pandas functions for reading a CVS file is df = pd.read_csv("file.csv") and saving df can be
carried out with df. to_csv("neufile.csv")
Example 1.5.5. Download the world population data from https://ourworldindata.org/grapher/
wor1d-populat ion-by-world-regions-post-1820 and store it as a data frame.
Solution: ‘The Python script is as follows.
| import pandas as pd
# https://ourworldindata . org/grapher /world-populat ion -by-world -regions -post}-1820
pop_data = pd.read_csv("data/world-population -by-world-regions -post -1820. csv")
print(pop_data.head(4)) # Print the first 4 rows of the data
print(pop_data.tail(4)) # Print the last 4 rows of the data
print (pop_data. index)
print (pop_data. columns)
We can find the similarity in R script22 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
| pop.data = read.csv("data/world- population -by-world- regions -post -1826. csv" |
header=1) # sep=","; dec="."; stringsAsFactors=FALSE
print (head(pop.data, 4))
print(tail(pop.data, 4))
print (rownames (pop.data))
print (colnames( pop. data))
§1.5.5 Handling “Web” Data
Web data are much more complex these days with the flooding of JavaScript. Old style Web data is
just simple H'TML as shown below.
1
2
https://umw.w3schools.com/html/html_tables.asp -->
3.
4
13.
4
15 Table 1
16
7 ctr>
18 | Firstname |
» Lastname |
20 age |
a
22
2B ctd>Jill
4 | Smith |
25 50 |
26
a ctr>
28 Eve |
29 Jackson |
30 ctd>94
31
32.
33
34 Table 2
35
36
37 | Name |
38 Telephone |
39
1.5, DATA HANDLING WITH PYTHON 23
40
a | Bill Gates |
2 55577854 |
8 55577855 |
4
45
46
47
Modern Web data may include other data structures such as JSON and XML.
Pandas provide the functions pd.read_html, pd.read_json for dealing with simple HTML and JSON.
However, pd.read_json has problem with nested JSON files and it is advisable to use Python’s json
For the XML format, other Python libraries such as lxml.etree, ElementTree, ete. are required. Since
JSON and XML are more complicated and we will not pursue them further.
Example 1.5.6. Write a Python script to read the above HTML tables.
Solution. =H. pd.read_html ZILA TREE.
| import pandas as pd
tbls = pd.read_html ("*tableeg. html")
for i,tbl in enumerate( tbls):
print("="8 + f"Table (ist}" + "="*8)
print(tbl)
a
When the HTML consist JavaScript, we will need to follow StackOverflow advice by calling external
browser to do some processing:
# https: //stackover flow. com/ quest ions /25062365/ python -parsing -html -table- generated -by- jay
from pandas. io.html import read_html
from selenium import webdriver
driver = webdriver.Firefox()
driver. get("http://www1.nyse.com/about/Listed/1P0_Index html")
table = driver. find element_by_xpath('//div[@class="sp5"]/table//table/..')
table_html = table. get_attribute('innerHTML')
11 df = read_html(table_html)[0]
12. print (df)
14 driver.close()
§1.5.6 Handling Proprietary Formats
‘There are many proprietary formats one needs to deal with in data analysis. For example many social
science research uses the SPSS format or Stata format for storing data, On the other hand, many
‘companies use the SAS business intelligence system and store their data in SAS format.
Due to the popularity of Python, many software companies are providing Python support for their
formats, For example, according to https: //b1ogs.sas.com/content/sasdurmy/2017/04/88/python-to-sas-saspy/,
Python coders can now bring the power of SAS into their Python scripts. The project is SASPy, and
it’s available on the SAS Software GitHub https://github.con/sassof tware/saspy. It works with SAS
9.4 and higher, and requires Python 3.x.24 TOPIC 1, DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
‘The savReaderWriter and pyreadstat modules could be used to read SPSS format.
§1.5.7 Handling SQL Database and Cloud Server
Retrieving data from a SQL database server or a cloud server is much more complicated than open files
because a connection to the server. We can use pd.read_sql to read and store the return result of the
server.
For example, reading from a local “SQL database" is listed below.
from pandas. io import sql
import sqlite
conn = sqlite3. connect (‘data.db")
query = "SELECT * FROM tablename'
tbl = sql.read_sql(query, con=conn, parse dates={'date': 'Sd/%m/RY"})
print (tbl.head())
In general, the process can simplify by using the SQLAlchemy module or other generalised Python
‘module such as PugSQL (https: //pugsql org)
Dealing with a cloud server is similar and we sometimes need to use special API. For example, to
deal with the spreadsheet data on Google cloud, we need to use Google Sheet APL as pointed in the
following articles.
+ https://towardsdatascience..com/accessing-google- spreadsheet -data-using-python-90aSbc214fd2
+ https://developers .google.com/sheets/api/quickstart/python
+ https://github. con/burnast/gspread
§1.6 Assignment Part 1
Based on your training experience, design a programme for your company to train the staff, Write up
‘a proposal which includes the following items
+ The data structure of the training programme;
+ An update and review system for the programme structure;
+ Online testing system for the trainees;
Data analytical system for the results of the online trainees.
+ Expand your system to accommodate for external trainees.
+ Evaluate if the data analytical system is worthwhile by listing down the pros and cons of the
system,