AMA1611
Data Analytics
Fundamentals
Introduction to Data Science
1.1 What is Data Science?
“Big data” has become a hot topic in recent years due to:
1. Rapid growth in size and scope of datasets in various sectors
2. advancement in technology.
“Big data” refers to any collection of data sets so large or complex that is difficult to
process using traditional data management techniques.
Data science refers to methods to analyze massive amounts of data, extract the
information.
Data science, Big data” evolved from statistics, traditional data management,
but now considered to be distinct disciplines.
Characteristics of big data: 4 Vs
1. Volume - How much data is there?
2. Variety - How diverse are different types of data?
3. Velocity - At what speed is new data generated?
4. Veracity - How accurate is the data?
These four properties make big data different from the data found in traditional data
management tools.
Challenges they bring can be felt in almost every aspect:
data capture, curation, storage, search, sharing, transfer, and visualization
Big data calls for specialized techniques to extract the insights.
Data science and big data are used almost everywhere in both commercial and non-
commercial settings:
1. to gain insights into their customers, processes, staff, completion, products. 2. to offer
customers a better user experience, to cross-sell, up-sell, and personalize their offerings.
Many governmental organizations not only rely on internal data scientists to discover
valuable information, but also share their data with the public.
You can use this data to gain insights or build data-driven applications.
1.2 Types of data
In data science and big data we will come across many different types of data. Each of
them tends to require different tools and techniques.
The main categories of data include:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
Structured data: data that depends on a data model and resides in a fixed field within a
record. As such, it is often easy to store structured data in tables within databases or
Excel files.
SQL (Structured Query Language) is the preferred way to manage and query data that
resides in databases. Hierarchical data such as a family tree is also structured but it is
hard to store it in a traditional relational database.
Unstructured data: data that is not easy to fit into a data model because the content is
context-specific or varying. Examples: regular emails and posts on social media.
Although email contains structured elements such as the sender, title, and body text, it
is difficult to analyze the context of the email due to the variety in language. Similarly, it
is complicated to analyse the context of a post on social media due to use of different
symbols and emoticons.
Natural language: a special type of unstructured data. It poses some challenges to
process since it requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained
in one domain do not generalize well to other domains.
Machine-generated data: information that’s automatically created by a computer,
process, application, or other machine without human intervention.
Machine-generated data is growing as a major data. The analysis of machine data relies
on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs,
and telemetry.
Graph-based or network data: data that focuses on relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store graphical
data. It is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path
between two people.
Examples of graph-based data can be found on many social media websites such as
Twitter and LinkedIn.
Audio, image, and video: data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be
challenging for computers. For example, a company called DeepMind succeeded at
creating an algorithm that is capable of learning how to play video games. This algorithm
takes the video screen as input and learns to interpret everything via a complex process
of deep learning.
Streaming data can take almost any of the previous forms. In addition, the data flows
into the system when an event happens instead of being loaded into a data store in a
batch. Although it is not quite a different type of data, streaming data is treated here as
such because you need to adapt your process to deal with this type of information.
Examples are live sporting or music events, and the stock market.
1.3 Data science process
The data science process typically consists of six steps:
1. Setting the research goal
2. Data collection
3. Data preparation
4. Data exploration
5. Data modelling
6. Presentation and automation
(1) Setting the research goal
Data science is mostly applied in the context of an organization. When you perform a
data science project, you will first prepare a project charter, which contains information
such as what the research subject is, how the organization benefits from the research,
what data and resources you need, a concrete timetable, and deliverables.
(2) Data collection
The second step is to collect data. In this step, you ensure that you can use the data in
your program, which means checking the existence, quality of, and access to the data.
Data can also be delivered by third-party companies and takes many forms.
(3) Data preparation
Data collection is an error-prone process; in this step, you enhance the quality of the
data and prepare it for use in subsequent steps. This phase consists of 3 sub-phases:
(i) data cleansing removes false values from a data source and inconsistencies across
data sources;
(ii) data integration enriches data sources by combining information from multiple data
sources;
(iii)data transformation ensures that the data is in a suitable format for use in your
models.
(4) Data exploration
Data exploration is concerned with building a deeper understanding of your data. You
try to understand how variables interact with each other, the distribution of the data,
and whether there are outliers. To achieve this you mainly use descriptive statistics,
visual techniques, and simple modelling.
(5) Data modelling
In this step, you utilize models, domain knowledge, and insights about the data you
found in the previous steps to answer the research question. You select a technique
from the fields of statistics, machine learning, operations research, and so on. Building a
model is an iterative process that involves selecting the variables for the model,
executing the model, and model diagnostics.
(6) Presentation and automation
Finally, you present the results. These results can take in forms from presentations to
research reports. Sometimes you will need to automate the execution of the process
because the business will want to use the insights you gained in another project or
enable an operational process to use the outcome from your model.
End of Lecture 1
Note that there is No subject tutorial in the first teaching week. The First subject
tutorial will start in the Second teaching week
1.4 Application
After understanding the basic concepts of data science, we will look into some
application of data analytics and artificial intelligence which make use of various types of
data.
(1) Map and traffic
We can now easily open our phone's map app
and type in our destination. Artificial
intelligence (AI) now provides users with a
much better experience in their unique
surroundings. This application is based on AI
algorithm and has been trained to recognize
and understand traffic. As a result, it suggests
the best way to avoid traffic congestion and
bottlenecks, whilst informing users about the
precise distance and time to arrive to the
destination.
(2) Face detection and recognition
Utilizing face ID or face recognition for unlocking our phones or taking photos is a use of
AI that is presently essential for our daily lives. Human babies begin to develop facial
features, such as eyes, lips, nose, and face shapes. Intelligent machines are trained in
order to recognize facial coordinates (x, y, w, and h; which form a square around the
face as an area of interest), landmarks (nose, eyes, etc.), and alignment (geometric
structures). Face recognition is also used by government facilities or at the airport for
monitoring, and security.
(3) Text and Language
When typing a document, there are inbuilt or downloadable auto-correcting tools for
editors of spelling errors, readability, mistakes, and plagiarism based on their difficulty
level. It should have taken a long time for us to master our language and become fluent
in it, but AI takes a relatively shorter time to master it. The AI algorithms often used
deep learning, machine learning, and natural language in order to detect inappropriate
language use and recommend improvements. Linguists and computer scientists
collaborate in teaching machines grammar in the same of school teaching. Machines are
fed large volumes of high-quality in a machine-understandable expression.
One example is Grammarly, a cloud-based
typing assistant that reviews spelling,
grammar, punctuation, clarity,
engagement, and delivery mistakes by
utilizing AI algorithms. It also allows users to
customize their style, tone, and context-
specific language.
(4) Healthcare
Infervision is using artificial intelligence and deep learning to save lives. In China, where
there are insufficient radiologists to keep up with the demand for checking 1.4 billion CT
scans each year for early symptoms of lung cancer. Infervision trained and instructed
algorithms to expand the work of radiologists in order to permit them to diagnose
cancer more proficiently and correctly.
(5) Banking and finance
The banking and finance industry has a major impact on our daily lives which means the
world runs on liquidity, and banks are the gatekeepers who control the flow. AI plays a
role in monitoring our account and trying to alert us regarding any potential fraud. AI is
now being trained to examine vast samples of fraud data in order to identify patterns so
that we can be alerted before it happens to us. If we run into a snag and contact our
bank's customer service, we are probably speaking with an AI bot. Even the largest
financial industry use AI to analyze data in order to find the best ways to invest capital in
order to maximize returns while minimizing risk.
Nowadays, AI is set to play an even larger role in
the industry, with major banks around the world
investing billions of dollars in AI technology, and
we will be able to see the results sooner rather
than later.
1.5 Computational tools for data science
Currently many big data tools and frameworks exist. The big data ecosystem can be
grouped into technologies that have similar goals and functionalities. Data scientists use
many different technologies, but not all of them. Python and R are two such examples.
Jupyter Notebook is an open source web application that allows users to manage, share,
create, run programs written in Julia, python or R.
1.5 Data types and structure
There are four data types (in python):
1. integer (int) – positive/negative/zero integers
2. floating (float) - floating point real values
3. string (str) - strings of multiple characters, enclosed by quotation marks
4. boolean (bool) - Boolean which is either True (1) / False (0) for logical operations
A variable is a reserved location to store values.
Data structure is a storage that is used to store and organize data. It is a way of
arranging data on a computer so that it can be accessed and updated efficiently. To
store data sequentially in the memory, we can use an array data structure. List is a
primitive data structure in Python.
Python coding will be taught in later chapters.
End of Lecture 1
Note that there is No subject tutorial in the first teaching week. The First subject
tutorial will start in the Second teaching week
Coming Tutorial 1: discuss Tutorial 1 Questions 1 , 2 and 3
1.6 Introduction to Jupyter Notebook
Jupyter Notebook is a web-based application for creating and sharing computational
documents. It offers a modern and powerful web interface to Python. To install, you
may:
Step (1): download Anaconda (choose Python 3 version)
https://docs.anaconda.com/anaconda/install/windows/
Step (2): install Anaconda
Step (3): to run Jupyter notebook, type: jupyter notebook
You may follow the guidelines on the website for more details.
https://test-jupyter.readthedocs.io/en/latest/install.html
check programs that are running
click "new/python3" to create a new program
select folder or file you want to open or edit
The advantage of using Jupyter Notebook is that it provides an interactive interface that
allows user to view the outcome of the coding. Let's take a very simple example. One
might want to evaluate the result of the arithmetic operation 1+1. Just type the coding
on a cell and either click "Run" button or press Alt+Enter to execute the coding.
panel for editing and running the cells
indicates the first cell
coding in the first cell
output corresponding
to the first cell
the next input to be entered
On Jupyter Notebook, you may try to execute some simple calculation using the
arithmetic operators as follows:
Operator Name
+ addition
- subtraction
* multiplication
/ division
** exponent
% modulus
// floor division
max(,) maximum
min(,) minimum
For example, the coding below assigns the string 'ama' to x, the integer 123 to y, and the
float number 1.23 to z. Click "Run" button or press Alt+Enter to execute the coding in a
cell.
In Jupyter Notebook, you may check the value of each variable by calling it. You may also
check their data type using type().
Notice that each member is labelled by an index starting from 0 (instead of 1).
To access a single member, we can use:
list_name[index]
For example we can create a list of numbers:
The five values are stored in a sequential order with index 0 to 4.
mylist
index 0
2 3
1
5
2 3
7 11 4
If we access the member with index 2 in mylist, it gives:
To create a sliced list with consecutive members in the original list, we can use:
list_name[starting_index:stoping_index]
Notice that the term represented by the stopping index is not included in the
sliced list. For example, if we form a sliced list from mylist with staring index 1
and stopping index 4:
The result is another list formed by elements in mylist with index 1, 2, 3.
mylist
index 0
2 3
1 2
5 7
3
11 4
Lecture 2 will be continued with Descriptive Statistics
We will discuss a bit Python in the second subject tutorial, However, please note that
Python will not be a topic in any subject test or subject examination