0% found this document useful (0 votes)

56 views32 pages

Lecture 1 and 2 Powerpoints

The document provides an introduction to data science and big data, highlighting their definitions, characteristics, and the data science process. It discusses various types of data, including structured, unstructured, and machine-generated data, and outlines the steps involved in a data science project. Additionally, it covers applications of data analytics in fields like healthcare and finance, and introduces Jupyter Notebook as a tool for data science programming.

Uploaded by

bostonyu64832180

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views32 pages

Lecture 1 and 2 Powerpoints

Uploaded by

bostonyu64832180

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

AMA1611

Data Analytics
Fundamentals
Introduction to Data Science
1.1 What is Data Science?

“Big data” has become a hot topic in recent years due to:
1. Rapid growth in size and scope of datasets in various sectors
2. advancement in technology.

“Big data” refers to any collection of data sets so large or complex that is difficult to
process using traditional data management techniques.
Data science refers to methods to analyze massive amounts of data, extract the
information.

Data science, Big data” evolved from statistics, traditional data management,
but now considered to be distinct disciplines.
Characteristics of big data: 4 Vs

1. Volume - How much data is there?

2. Variety - How diverse are different types of data?

3. Velocity - At what speed is new data generated?

4. Veracity - How accurate is the data?

These four properties make big data different from the data found in traditional data

management tools.

Challenges they bring can be felt in almost every aspect:

data capture, curation, storage, search, sharing, transfer, and visualization

Big data calls for specialized techniques to extract the insights.

Data science and big data are used almost everywhere in both commercial and non-

commercial settings:

1. to gain insights into their customers, processes, staff, completion, products. 2. to offer

customers a better user experience, to cross-sell, up-sell, and personalize their offerings.

Many governmental organizations not only rely on internal data scientists to discover

valuable information, but also share their data with the public.

You can use this data to gain insights or build data-driven applications.
1.2 Types of data

In data science and big data we will come across many different types of data. Each of
them tends to require different tools and techniques.
The main categories of data include:

• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
Structured data: data that depends on a data model and resides in a fixed field within a
record. As such, it is often easy to store structured data in tables within databases or
Excel files.

SQL (Structured Query Language) is the preferred way to manage and query data that
resides in databases. Hierarchical data such as a family tree is also structured but it is
hard to store it in a traditional relational database.
Unstructured data: data that is not easy to fit into a data model because the content is
context-specific or varying. Examples: regular emails and posts on social media.

Although email contains structured elements such as the sender, title, and body text, it
is difficult to analyze the context of the email due to the variety in language. Similarly, it
is complicated to analyse the context of a post on social media due to use of different
symbols and emoticons.
Natural language: a special type of unstructured data. It poses some challenges to
process since it requires knowledge of specific data science techniques and linguistics.

The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained
in one domain do not generalize well to other domains.
Machine-generated data: information that’s automatically created by a computer,
process, application, or other machine without human intervention.

Machine-generated data is growing as a major data. The analysis of machine data relies
on highly scalable tools, due to its high volume and speed.

Examples of machine data are web server logs, call detail records, network event logs,
and telemetry.
Graph-based or network data: data that focuses on relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store graphical
data. It is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path
between two people.

Examples of graph-based data can be found on many social media websites such as
Twitter and LinkedIn.
Audio, image, and video: data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be
challenging for computers. For example, a company called DeepMind succeeded at
creating an algorithm that is capable of learning how to play video games. This algorithm
takes the video screen as input and learns to interpret everything via a complex process
of deep learning.
Streaming data can take almost any of the previous forms. In addition, the data flows
into the system when an event happens instead of being loaded into a data store in a
batch. Although it is not quite a different type of data, streaming data is treated here as
such because you need to adapt your process to deal with this type of information.
Examples are live sporting or music events, and the stock market.
1.3 Data science process

The data science process typically consists of six steps:

1. Setting the research goal

2. Data collection
3. Data preparation
4. Data exploration
5. Data modelling
6. Presentation and automation
(1) Setting the research goal
Data science is mostly applied in the context of an organization. When you perform a
data science project, you will first prepare a project charter, which contains information
such as what the research subject is, how the organization benefits from the research,
what data and resources you need, a concrete timetable, and deliverables.

(2) Data collection

The second step is to collect data. In this step, you ensure that you can use the data in
your program, which means checking the existence, quality of, and access to the data.
Data can also be delivered by third-party companies and takes many forms.
(3) Data preparation
Data collection is an error-prone process; in this step, you enhance the quality of the
data and prepare it for use in subsequent steps. This phase consists of 3 sub-phases:
(i) data cleansing removes false values from a data source and inconsistencies across
data sources;
(ii) data integration enriches data sources by combining information from multiple data
sources;
(iii)data transformation ensures that the data is in a suitable format for use in your
models.

(4) Data exploration

Data exploration is concerned with building a deeper understanding of your data. You
try to understand how variables interact with each other, the distribution of the data,
and whether there are outliers. To achieve this you mainly use descriptive statistics,
visual techniques, and simple modelling.
(5) Data modelling
In this step, you utilize models, domain knowledge, and insights about the data you
found in the previous steps to answer the research question. You select a technique
from the fields of statistics, machine learning, operations research, and so on. Building a
model is an iterative process that involves selecting the variables for the model,
executing the model, and model diagnostics.

(6) Presentation and automation

Finally, you present the results. These results can take in forms from presentations to
research reports. Sometimes you will need to automate the execution of the process
because the business will want to use the insights you gained in another project or
enable an operational process to use the outcome from your model.
End of Lecture 1

Note that there is No subject tutorial in the first teaching week. The First subject
tutorial will start in the Second teaching week
1.4 Application

After understanding the basic concepts of data science, we will look into some
application of data analytics and artificial intelligence which make use of various types of
data.

(1) Map and traffic

We can now easily open our phone's map app
and type in our destination. Artificial
intelligence (AI) now provides users with a
much better experience in their unique
surroundings. This application is based on AI
algorithm and has been trained to recognize
and understand traffic. As a result, it suggests
the best way to avoid traffic congestion and
bottlenecks, whilst informing users about the
precise distance and time to arrive to the
destination.
(2) Face detection and recognition

Utilizing face ID or face recognition for unlocking our phones or taking photos is a use of
AI that is presently essential for our daily lives. Human babies begin to develop facial
features, such as eyes, lips, nose, and face shapes. Intelligent machines are trained in
order to recognize facial coordinates (x, y, w, and h; which form a square around the
face as an area of interest), landmarks (nose, eyes, etc.), and alignment (geometric
structures). Face recognition is also used by government facilities or at the airport for
monitoring, and security.
(3) Text and Language

When typing a document, there are inbuilt or downloadable auto-correcting tools for
editors of spelling errors, readability, mistakes, and plagiarism based on their difficulty
level. It should have taken a long time for us to master our language and become fluent
in it, but AI takes a relatively shorter time to master it. The AI algorithms often used
deep learning, machine learning, and natural language in order to detect inappropriate
language use and recommend improvements. Linguists and computer scientists
collaborate in teaching machines grammar in the same of school teaching. Machines are
fed large volumes of high-quality in a machine-understandable expression.

One example is Grammarly, a cloud-based

typing assistant that reviews spelling,
grammar, punctuation, clarity,
engagement, and delivery mistakes by
utilizing AI algorithms. It also allows users to
customize their style, tone, and context-
specific language.
(4) Healthcare

Infervision is using artificial intelligence and deep learning to save lives. In China, where
there are insufficient radiologists to keep up with the demand for checking 1.4 billion CT
scans each year for early symptoms of lung cancer. Infervision trained and instructed
algorithms to expand the work of radiologists in order to permit them to diagnose
cancer more proficiently and correctly.
(5) Banking and finance

The banking and finance industry has a major impact on our daily lives which means the
world runs on liquidity, and banks are the gatekeepers who control the flow. AI plays a
role in monitoring our account and trying to alert us regarding any potential fraud. AI is
now being trained to examine vast samples of fraud data in order to identify patterns so
that we can be alerted before it happens to us. If we run into a snag and contact our
bank's customer service, we are probably speaking with an AI bot. Even the largest
financial industry use AI to analyze data in order to find the best ways to invest capital in
order to maximize returns while minimizing risk.

Nowadays, AI is set to play an even larger role in

the industry, with major banks around the world
investing billions of dollars in AI technology, and
we will be able to see the results sooner rather
than later.
1.5 Computational tools for data science

Currently many big data tools and frameworks exist. The big data ecosystem can be
grouped into technologies that have similar goals and functionalities. Data scientists use
many different technologies, but not all of them. Python and R are two such examples.

Jupyter Notebook is an open source web application that allows users to manage, share,
create, run programs written in Julia, python or R.
1.5 Data types and structure

There are four data types (in python):

1. integer (int) – positive/negative/zero integers
2. floating (float) - floating point real values
3. string (str) - strings of multiple characters, enclosed by quotation marks
4. boolean (bool) - Boolean which is either True (1) / False (0) for logical operations
A variable is a reserved location to store values.

Data structure is a storage that is used to store and organize data. It is a way of
arranging data on a computer so that it can be accessed and updated efficiently. To
store data sequentially in the memory, we can use an array data structure. List is a
primitive data structure in Python.

Python coding will be taught in later chapters.

End of Lecture 1

Note that there is No subject tutorial in the first teaching week. The First subject
tutorial will start in the Second teaching week

Coming Tutorial 1: discuss Tutorial 1 Questions 1 , 2 and 3

1.6 Introduction to Jupyter Notebook

Jupyter Notebook is a web-based application for creating and sharing computational

documents. It offers a modern and powerful web interface to Python. To install, you
may:

Step (1): download Anaconda (choose Python 3 version)

https://docs.anaconda.com/anaconda/install/windows/
Step (2): install Anaconda
Step (3): to run Jupyter notebook, type: jupyter notebook
You may follow the guidelines on the website for more details.
https://test-jupyter.readthedocs.io/en/latest/install.html

check programs that are running

click "new/python3" to create a new program

select folder or file you want to open or edit

The advantage of using Jupyter Notebook is that it provides an interactive interface that
allows user to view the outcome of the coding. Let's take a very simple example. One
might want to evaluate the result of the arithmetic operation 1+1. Just type the coding
on a cell and either click "Run" button or press Alt+Enter to execute the coding.

panel for editing and running the cells

indicates the first cell

coding in the first cell

output corresponding
to the first cell

the next input to be entered

On Jupyter Notebook, you may try to execute some simple calculation using the
arithmetic operators as follows:

Operator Name
+ addition
- subtraction
* multiplication
/ division
** exponent
% modulus
// floor division
max(,) maximum
min(,) minimum
For example, the coding below assigns the string 'ama' to x, the integer 123 to y, and the
float number 1.23 to z. Click "Run" button or press Alt+Enter to execute the coding in a
cell.

In Jupyter Notebook, you may check the value of each variable by calling it. You may also
check their data type using type().
Notice that each member is labelled by an index starting from 0 (instead of 1).
To access a single member, we can use:
list_name[index]

For example we can create a list of numbers:

The five values are stored in a sequential order with index 0 to 4.

mylist

index 0
2 3
1
5
2 3
7 11 4

If we access the member with index 2 in mylist, it gives:

To create a sliced list with consecutive members in the original list, we can use:
list_name[starting_index:stoping_index]

Notice that the term represented by the stopping index is not included in the
sliced list. For example, if we form a sliced list from mylist with staring index 1
and stopping index 4:

The result is another list formed by elements in mylist with index 1, 2, 3.

mylist

index 0
2 3
1 2
5 7
3
11 4
Lecture 2 will be continued with Descriptive Statistics

We will discuss a bit Python in the second subject tutorial, However, please note that
Python will not be a topic in any subject test or subject examination

Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
Unit I
No ratings yet
Unit I
262 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
CS 3353 FDS Unit 1 Notes JPR
No ratings yet
CS 3353 FDS Unit 1 Notes JPR
39 pages
Data Science
No ratings yet
Data Science
244 pages
Data Science Overview and Process Guide
No ratings yet
Data Science Overview and Process Guide
139 pages
Data Science Foundations Guide
100% (2)
Data Science Foundations Guide
143 pages
Five Unit Notes
No ratings yet
Five Unit Notes
138 pages
Unit 1
No ratings yet
Unit 1
9 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Foundations of Data Science Course
No ratings yet
Foundations of Data Science Course
25 pages
FDS Notes
No ratings yet
FDS Notes
137 pages
Introduction to Data Science Concepts
100% (1)
Introduction to Data Science Concepts
167 pages
Unit 1
No ratings yet
Unit 1
26 pages
Data Science SPPU
No ratings yet
Data Science SPPU
115 pages
Data v2
No ratings yet
Data v2
25 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
161 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
FDS 4 Unit
No ratings yet
FDS 4 Unit
156 pages
Cs3352 Fds Notes Mk1
No ratings yet
Cs3352 Fds Notes Mk1
30 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Data Science Overview for Honours Students
No ratings yet
Data Science Overview for Honours Students
28 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Mod 3
No ratings yet
Mod 3
96 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
FDS Notes PDF
No ratings yet
FDS Notes PDF
140 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
FDS - Aids Complete Notes
No ratings yet
FDS - Aids Complete Notes
138 pages
Stucor Cs3352 Ad
No ratings yet
Stucor Cs3352 Ad
138 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
15 pages
Fundamentals of Data Science Course
75% (4)
Fundamentals of Data Science Course
62 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
No ratings yet
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
27 pages
Cs3352foundation of Data Science - 1
No ratings yet
Cs3352foundation of Data Science - 1
141 pages
3961502-Class10 Ai Part B Unit3 Unit3 Data Science
No ratings yet
3961502-Class10 Ai Part B Unit3 Unit3 Data Science
15 pages
Unit 1 - DS - 1st Year
No ratings yet
Unit 1 - DS - 1st Year
13 pages
Unit 1 PPT 1
100% (1)
Unit 1 PPT 1
27 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
UNIT I Democracy
No ratings yet
UNIT I Democracy
75 pages
FDS Notes
No ratings yet
FDS Notes
148 pages
Facets of Data
50% (2)
Facets of Data
22 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
CS3352 FDS Notes - 03 - by WWW - Notesfree.in
No ratings yet
CS3352 FDS Notes - 03 - by WWW - Notesfree.in
138 pages
Fods Notes For Lecturing
No ratings yet
Fods Notes For Lecturing
5 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
20 pages
Unit 1
No ratings yet
Unit 1
19 pages
Microsoft PowerPoint - Chapter 5a - CONTRACTUAL ARRANGEMENT - Contract Price.
No ratings yet
Microsoft PowerPoint - Chapter 5a - CONTRACTUAL ARRANGEMENT - Contract Price.
48 pages
Revision Notes Social Science Class 6 Chapter 1 - Locating Places On Earth
No ratings yet
Revision Notes Social Science Class 6 Chapter 1 - Locating Places On Earth
2 pages
Music Production & Artistic Success
100% (1)
Music Production & Artistic Success
8 pages
10 Best Scientific Calculator Find The Best Choice
No ratings yet
10 Best Scientific Calculator Find The Best Choice
1 page
Brosura CT
No ratings yet
Brosura CT
44 pages
Water Baptism
0% (1)
Water Baptism
33 pages
How To Make A Sponge Spicule Prep
No ratings yet
How To Make A Sponge Spicule Prep
2 pages
Confirmation - Delhi - Aloft
No ratings yet
Confirmation - Delhi - Aloft
2 pages
SOHO and Network Architecture FAQs
No ratings yet
SOHO and Network Architecture FAQs
49 pages
Vivax Metrotech Sondes Guide
No ratings yet
Vivax Metrotech Sondes Guide
6 pages
6061 Aluminium: Technical Datasheet
No ratings yet
6061 Aluminium: Technical Datasheet
1 page
TANTIRIYA A GIDAN YARI Book1 by AsmaBaffa
No ratings yet
TANTIRIYA A GIDAN YARI Book1 by AsmaBaffa
227 pages
QP - E&et - Mid - 2023-24
No ratings yet
QP - E&et - Mid - 2023-24
3 pages
Hentai Gameplay and SCP Gaming Stats
No ratings yet
Hentai Gameplay and SCP Gaming Stats
9 pages
Credit Point Sheet - 0
No ratings yet
Credit Point Sheet - 0
3 pages
Claro de Luna Arr Jesús Amaya
No ratings yet
Claro de Luna Arr Jesús Amaya
8 pages
The Mathura, Gandhara School of Art
No ratings yet
The Mathura, Gandhara School of Art
4 pages
English For Academic and Professional Purposes: Language Used in Academic Texts
100% (1)
English For Academic and Professional Purposes: Language Used in Academic Texts
18 pages
Plant Sap-Feeders: Rice Black Bug: Nymph Adult
No ratings yet
Plant Sap-Feeders: Rice Black Bug: Nymph Adult
41 pages
Postobón S.A Financial Report Work
No ratings yet
Postobón S.A Financial Report Work
15 pages
Causes of Mental Illness
No ratings yet
Causes of Mental Illness
33 pages
Future of Multiplexes
No ratings yet
Future of Multiplexes
19 pages
9670 Colheitadeira
No ratings yet
9670 Colheitadeira
1,068 pages
HSK 4 Nouns 91 120
No ratings yet
HSK 4 Nouns 91 120
11 pages
Meadows Catalog 2015
No ratings yet
Meadows Catalog 2015
141 pages
Before Mestizaje: Published Online by Cambridge University Press
No ratings yet
Before Mestizaje: Published Online by Cambridge University Press
308 pages
10 Real Tests For National Exam - 2025
No ratings yet
10 Real Tests For National Exam - 2025
67 pages
Candidates Export 2025-10-12
No ratings yet
Candidates Export 2025-10-12
9 pages
Teenu Jessy CV
No ratings yet
Teenu Jessy CV
2 pages
Board of Technical Education (Student Marksheet)
100% (1)
Board of Technical Education (Student Marksheet)
2 pages

Lecture 1 and 2 Powerpoints

Uploaded by

Lecture 1 and 2 Powerpoints

Uploaded by

AMA1611

1. Volume - How much data is there?

2. Variety - How diverse are different types of data?

3. Velocity - At what speed is new data generated?

4. Veracity - How accurate is the data?

Challenges they bring can be felt in almost every aspect:

data capture, curation, storage, search, sharing, transfer, and visualization

Big data calls for specialized techniques to extract the insights.

The data science process typically consists of six steps:

1. Setting the research goal

(2) Data collection

(4) Data exploration

(6) Presentation and automation

(1) Map and traffic

One example is Grammarly, a cloud-based

Nowadays, AI is set to play an even larger role in

There are four data types (in python):

Python coding will be taught in later chapters.

Coming Tutorial 1: discuss Tutorial 1 Questions 1 , 2 and 3

Jupyter Notebook is a web-based application for creating and sharing computational

Step (1): download Anaconda (choose Python 3 version)

check programs that are running

click "new/python3" to create a new program

select folder or file you want to open or edit

panel for editing and running the cells

indicates the first cell

the next input to be entered

For example we can create a list of numbers:

The five values are stored in a sequential order with index 0 to 4.

If we access the member with index 2 in mylist, it gives:

The result is another list formed by elements in mylist with index 1, 2, 3.

You might also like