0% found this document useful (0 votes)
39 views185 pages

Unit I - Data Science

The document provides an introduction to data science, covering its benefits, the data science process, and the differences between big data and data science. It details various data types, including structured, unstructured, and machine-generated data, as well as the steps involved in data preparation, exploration, and modeling. Additionally, it introduces NumPy as a fundamental package for scientific computing in Python, highlighting its features and array creation methods.

Uploaded by

Ashutosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views185 pages

Unit I - Data Science

The document provides an introduction to data science, covering its benefits, the data science process, and the differences between big data and data science. It details various data types, including structured, unstructured, and machine-generated data, as well as the steps involved in data preparation, exploration, and modeling. Additionally, it introduces NumPy as a fundamental package for scientific computing in Python, highlighting its features and array creation methods.

Uploaded by

Ashutosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 185

Data Science

21CSS303T

1
Unit I
Unit-1: INTRODUCTION TO DATA SCIENCE 10
hours
Benefits and uses of Data science, Facets of data, The data
science process

Introduction to Numpy: Numpy, creating array, attributes,


Numpy Arrays objects: Creating Arrays, basic operations
(Array Join, split, search, sort), Indexing, Slicing and iterating,
copying arrays, Arrays shape manipulation, Identity array, eye
function Exploring Data using Series, Exploring Data using
Data Frames, Index objects, Re-index, Drop Entry, Selecting
Entries, Data Alignment, Rank and Sort, Summary Statistics,
Index Hierarchy

Data Acquisition: Gather information from different sources, 2


Web APIs, Open Data Sources, Web Scrapping.
Big Data vs Data Science
• Big data is a blanket term for any collection of data sets so
large or complex that it becomes difficult to process them
using traditional data management techniques such as, for
example, the RDBMS (relational database management
systems).

• Data science involves using methods to analyze massive


amounts of data and extract the knowledge it contains.

You can think of the relationship between big data and data
science as being like the relationship between crude oil and 3
an oil refinery.
Characteristics of Big Data
• Volume—How much data is there?
• Variety—How diverse are different types of data?
• Velocity—At what speed is new data generated?

4
Benefits and uses of data
science and big data
1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile
5. Data Science Makes Data Better
6. Data Scientists are Highly Prestigious
7. No More Boring Tasks
8. Data Science Makes Products Smarter
9. Data Science can Save Lives

5
Facets of data
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming

6
Structured Data
• Structured data is data that depends on a data model and
resides in a fixed field within a record.

7
Unstructured data
• Unstructured data is data that isn’t easy to fit into a data
model because the content is context-specific or varying.

8
Natural language
• Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of
specific data science techniques and linguistics.

• The natural language processing community has had


success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis,
but models trained in one domain don’t generalize well to
other domains.

9
Machine-generated data
• Machine-generated data is information that’s automatically
created by a computer, process, application, or other
machine without human intervention.
• Machine-generated data is becoming a major data resource
and will continue to do so.

10
Machine-generated data

11
Graph-based or network
data
• “Graph data” can be a confusing term because any data can
be shown in a graph.
• “Graph” in this case points to mathematical graph theory.
• In graph theory, a graph is a mathematical structure to
model pair-wise relationships between objects.
• Graph or network data is, in short, data that focuses on
the relationship or adjacency of objects.
• The graph structures use nodes, edges, and properties to
represent and store graphical data.
• Graph-based data is a natural way to represent social
networks, and its structure allows you to calculate specific
metrics such as the influence of a person and the shortest 12
path between two people.
Graph-based or network
data

13
Audio, video and image
• Audio, image, and video are data types that pose specific
challenges to a data scientist.

14
Streaming
• While streaming data can take almost any of the previous
forms, it has an extra property.
• The data flows into the system when an event happens
instead of being loaded into a data store in a batch.

15
The Data Science
Process
16
The Data Science Process
• The data science process typically consists of six steps, as
you can see in the mind map

17
The Data Science Process

18
Setting the research goal

19
Setting the research goal
• Data science is mostly applied in the context of an
organization.
• A clear research goal
• The project mission and context
• How you’re going to perform your analysis
• What resources you expect to use
• Proof that it’s an achievable project, or proof of concepts
• Deliverables and a measure of success
• A timeline

20
Retrieving data

21
Retrieving data
• Data can be stored in many forms, ranging from simple text
files to tables in a database.
• The objective now is acquiring all the data you need.

• Start with data stored within the company


• Databases
• Data marts
• Data warehouses
• Data lakes

22
Data Lakes
• A data lake is a centralized storage repository that holds a
massive amount of structured and unstructured data.
• According to Gartner, “it is a collection of storage instances
of various data assets additional to the originating data
sources.”

23
Data warehouse
• Data warehousing is about the collection of data from
varied sources for meaningful business insights.
• An electronic storage of a massive amount of information, it
is a blend of technologies that enable the strategic use of
data!

24
Data Mart

25
DWH vs DM
• Data Warehouse is a large repository of data collected from
different sources whereas Data Mart is only subtype of a data
warehouse.
• Data Warehouse is focused on all departments in an
organization whereas Data Mart focuses on a specific group.
• Data Warehouse designing process is complicated whereas the
Data Mart process is easy to design.
• Data Warehouse takes a long time for data handling whereas
Data Mart takes a short time for data handling.
• Comparing Data Warehouse vs Data Mart, Data Warehouse size
range is 100 GB to 1 TB+ whereas Data Mart size is less than
100 GB.
• When we differentiate Data Warehouse and Data Mart, Data
Warehouse implementation process takes 1 month to 1 year
whereas Data Mart takes a few months to complete the 26
implementation process.
DWH vs DL

27
28
Data Lakes
• Data lakes are a fairly new concept and experts have
predicted that it might cause the death of data warehouses
and data marts.
• Although with the increase of unstructured data, data lakes
will become quite popular. But you will probably prefer
keeping your structured data in a data warehouse.

29
Data Providers

30
Data Preparation: Cleansing,
integration and transformation

31
Cleansing data
• Data cleansing is a sub process of the data science
process that focuses on removing errors in your data so
your data becomes a true and consistent
representation of the processes it originates from.
• True and consistent representation
• interpretation error
• inconsistencies

32
Outliers

33
Data Entry Errors
• Data collection and data entry are error-prone processes.
• They often require human intervention, and because
humans are only human, they make typos or lose their
concentration for a second and introduce an error into
the chain. But data collected by machines or computers
isn’t free from errors either.
• Errors can arise from human sloppiness, whereas others
are due to machine or hardware failure.

34
Data Entry Errors

35
Redundant Whitespaces
• Whitespaces tend to be hard to detect but cause errors like
other redundant characters would.
• Capital letter mismatches are common.
• Most programming languages make a distinction between
“Brazil” and “brazil”. In this case you can solve the problem
by applying a function that returns both strings in
lowercase, such as .lower() in Python. “Brazil”.lower() ==
“brazil”.lower() should result in true.

36
Impossible values and
Sanity checks
• Sanity checks are another valuable type of data check.
• Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120

37
Outliers
• An outlier is an observation that seems to be distant from
other observations or, more specifically, one observation
that follows a different logic or generative process than the
other observations.
• Find outliers 🡺 Use a plot or table

38
Outliers

39
Handle missing data

40
Deviations from a code
book
• A code book is a description of your data, a form of
metadata.
• It contains things such as the number of variables per
observation, the number of observations, and what each
encoding within a variable means.(For instance “0” equals
“negative”, “5” stands for “very positive”.)

41
Data Integration: Combining
data from different data
sources
• Joining 🡺 enriching an observation from one table with
information from another table
• Appending or Stacking 🡺adding the observations of one
table to those of another table.

42
Joining
• Joining 🡺 focus on enriching a single observation
• To join tables, you use variables that represent the same
object in both tables, such as a date, a country name, or a
Social Security number. These common fields are known as
keys.
• When these keys also uniquely define the records in the table
they are called Primary Keys

43
Appending
• Appending 🡺 effectively adding observations from one
table to another table.

44
Views
• To avoid duplication of data, you virtually combine data
with views
• Existing 🡺 needed more storage space
• A view behaves as if you’re working on a table, but this
table is nothing but a virtual layer that combines the tables
for you.

45
Views

46
Enriching aggregated
measures
• Data enrichment can also be done by adding calculated
information to the table, such as the total number of sales
or what percentage of total stock has been sold in a certain
region

47
Transforming data
• Certain models require their data to be in a certain shape.
• Transforming your data so it takes a suitable form for data
modeling.

48
Reducing the number of
variables
• Too many variables
don’t add new information to the model
model difficult to handle
certain techniques don’t perform well when you overload them
with too many input variables
• Data scientists use special methods to reduce the number
of variables but retain the maximum amount of data.

49
Turning variables into
dummies
• Dummy variables can only take two values: true(1) or
false(0). They’re used to indicate the absence of a
categorical effect that may explain the observation.

50
Data Exploration

51
Data Exploration
• Information becomes much easier to grasp when shown in
a picture, therefore you mainly use graphical techniques to
gain an understanding of your data and the interactions
between variables.
• Visualization Techniques
• Simple graphs
• Histograms
• Sankey
• Network graphs

52
Bar Chart

53
Line Chart

54
Distribution

55
Overlaying

56
Brushing and Linking

57
STEP 5: BUILD THE MODELS

58
Data modeling

59
Data modeling
• Building a model is an iterative process.
• The way you build your model depends on whether you go
with classic statistics or the somewhat more recent
machine learning school, and the type of technique you
want to use.
• Models consist of the following main steps:
• 1 Selection of a modeling technique and variables to enter
in the model
• 2 Execution of the model
• 3 Diagnosis and model comparison

60
Model and variable
selection
❖ Must the model be moved to a production environment
and, if so, would it be easy to implement?
❖ How difficult is the maintenance on the model: how
long will it remain relevant if left untouched?
❖ Does the model need to be easy to explain?

61
Model execution

62
Model execution

63
Model execution

64
Introduction to
Numpy

65
NumPy Arrays
66
NumPy
•Numerical Python
• General-purpose array-processing package.
• High-performance multidimensional array object, and
tools for working with these arrays.
• Fundamental package for scientific computing with
Python.
• It is open-source software.

67
NumPy - Features
• A powerful N-dimensional array object
• Sophisticated (broadcasting) functions
• Tools for integrating C/C++ and Fortran code
• Useful linear algebra, Fourier transform, and random
number capabilities

68
Choosing NumPy over Python
list

69
Array
• An array is a data type used to store multiple values
using a single identifier (variable name).
• An array contains an ordered collection of data
elements where each element is of the same type and
can be referenced by its index (position)

70
Array
• Similar to the indexing of lists
• Zero-based indexing
• [10, 9, 99, 71, 90 ]

71
NumPy Array
• Store lists of numerical data, vectors and matrices
• Large set of routines (built-in functions) for creating,
manipulating, and transforming NumPy arrays.
• NumPy array is officially called ndarray but commonly
known as array

72
Creation of NumPy Arrays
from List
• First we need to import the NumPy library
import numpy as np

73
Creation of Arrays

74
1. Using the NumPy functions
a. Creating one-dimensional array in NumPy
import numpy as np
array=np.arange(20)
array

Output:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19])

75
1. Using the NumPy functions
a. check the dimensions by using np.shape()
np.shape(array)

Output:
(20, )

76
1. Using the NumPy functions
b. Creating two-dimensional arrays in NumPy
array=np.arange(20).reshape(4,5)
array

Output:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]
[15, 16, 17, 18, 19]])

77
1. Using the NumPy functions
c. Using other NumPy functions
np.zeros((2,4))
np.ones((3,6))
np.full((2,2), 3)

Output:
array([[0., 0., 0., 0.],
[0., 0., 0., 0.]])
array([[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.]])

array([[3, 3], [3, 3]])


78
1. Using the NumPy functions

79
1. Using the NumPy functions

80
1. Using the NumPy functions
c. Using other NumPy functions [[0. 0. 0. 0.]
[0. 0. 0. 0.]]
import numpy as np
a=np.zeros((2,4)) [[1. 1. 1. 1. 1. 1.]
b=np.ones((3,6)) [1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1.]]
c=np.empty((2,3))
d=np.full((2,2), 3) [[1.14137702e-316 0.00000000e+000
e= np.eye(3,3) 6.91583610e-310]
[6.91583609e-310 6.91583601e-310
f=np.linspace(0, 10, num=4) 6.91583601e-310]]

[[3 3]
print(a)
[3 3]]
print(b)
print(c) [[1. 0. 0.]
[0. 1. 0.]
Print(d) [0. 0. 1.]]
print(e) 81
Print(f) [ 0. 3.33333333 6.66666667 10.
]
1. Using the NumPy functions
Sr No. Function Description
Return a new array with the same shape
1 empty_like()
and type

Return an array of ones with the same


2 ones_like()
shape and type.

Return an array of zeros with the same


3 zeros_like()
shape and type

Return a full array with the same shape


4 full_like()
and type
5 asarray() Convert the input to an array.
Return evenly spaced numbers on a log
6 geomspace()
scale.
82
7 copy() Returns a copy of the given object
1. Using the NumPy functions
Sr No. Function Description

8 diag() a diagonal array

9 frombuffer() buffer as a 1-D array

10 fromfile() Construct an array from text or binary file

Build a matrix object from a string, nested


11 bmat()
sequence, or array
12 mat() Interpret the input as a matrix

13 vander() Generate a Vandermonde matrix


83
14 triu() Upper triangle of array
1. Using the NumPy functions
Sr No. Function Description

15 tril() Lower triangle of array

An array with ones at & below the given


16 tri()
diagonal and zeros elsewhere

two-dimensional array with the flattened


17 diagflat()
input as a diagonal

18 fromfunction() executing a function over each coordinate

Return numbers spaced evenly on a log


19 logspace()
scale
Return coordinate matrices from
20 meshgrid() 84
coordinate vectors
2. Conversion from Python structure like lists

import numpy as np
list=[4,5,6]
print(list)
newarray=np.array(list)
print(newarray)

Output
[4, 5, 6]
[4 5 6]

85
Working with Ndarray
• np.ndarray(shape, type)
• Creates an array of the given shape with random numbers.
• np.array(array_object)
• Creates an array of the given shape from the list or tuple.
• np.zeros(shape)
• Creates an array of the given shape with all zeros.
• np.ones(shape)
• Creates an array of the given shape with all ones.
• np.full(shape,array_object, dtype)
• Creates an array of the given shape with complex
numbers.
• np.arange(range)
• Creates an array with the specified range. 86
NumPy Basic Array Operations
There is a vast range of built-in operations that we can
perform on these arrays.
1. ndim – It returns the dimensions of the array.
2. size – It calculates number of elements in an array.
3. dtype – It can determine the data type of the element.
4. reshape – It provides a new view.
5. slicing – It extracts a particular set of elements.
6. linspace – Returns evenly spaced elements.
7. max/min , sum, sqrt
8. ravel – It converts the array into a single line.

87
Arrays in NumPy

88
Checking Array Dimensions in
NumPy
import numpy as np
a = np.array(10)
b = np.array([1,1,1,1])
c = np.array([[1, 1, 1], [2,2,2]])
d = np.array([[[1, 1, 1], [2, 2, 2]], [[3, 3, 3], [4, 4, 4]]])
print(a.ndim) #0
print(b.ndim) #1
print(c.ndim) #2
print(d.ndim) #3

89
Higher Dimensional Arrays in NumPy

import numpy as np
arr = np.array([1, 1, 1, 1, 1], ndmin=10)
print(arr)
print('number of dimensions :', arr.ndim)

[[[[[[[[[[1 1 1 1 1]]]]]]]]]]
number of dimensions : 10

90
Indexing and Slicing in NumPy

91
Indexing & Slicing
Indexing
import numpy as np
arr=([1,2,5,6,7])
print(arr[3]) #6

Slicing
import numpy as np
arr=([1,2,5,6,7])
print(arr[2:5]) #[5, 6, 7]

92
Indexing and Slicing

93
Indexing and Slicing in 2-D

94
Copying Arrays
Copy from one array to another
• Method 1: Using np.empty_like() function
• Method 2: Using np.copy() function
• Method 3: Using Assignment Operator

95
Using np.empty_like( )
• This function returns a new array with the same shape and
type as a given array.
Syntax:
• numpy.empty_like(a, dtype = type)

96
Using np.empty_like( )
import numpy as np
ary=np.array([13,99,100,34,65,11,66,81,632,44])

print("Original array: ")


# printing the Numpy array
print(ary)

# Creating an empty Numpy array similar to ary


copy=np.empty_like(ary)

# Now assign ary to copy


copy=ary

print("\nCopy of the given array: ")

# printing the copied array 97


print(copy)
Using np.empty_like( )

98
Using np.copy() function
• This function returns an array copy of the given object.
Syntax :
• numpy.copy(a)

# importing Numpy package


import numpy as np
org_array = np.array([1.54, 2.99, 3.42, 4.87, 6.94, 8.21, 7.65, 10.50,
77.5])
print("Original array: ")
print(org_array)
# Now copying the org_array to copy_array using np.copy() function
copy_array = np.copy(org_array)
print("\nCopied array: ")
# printing the copied Numpy array
print(copy_array) 99
Using np.copy() function
# importing Numpy package
import numpy as np
org_array = np.array([1.54, 2.99, 3.42, 4.87, 6.94, 8.21, 7.65, 10.50,
77.5])
print("Original array: ")
print(org_array)
copy_array = np.copy(org_array)
print("\nCopied array: ")
# printing the copied Numpy array
print(copy_array)

100
Using Assignment Operator
import numpy as np
org_array = np.array([[99, 22, 33],[44, 77, 66]])
# Copying org_array to copy_array using Assignment operator
copy_array = org_array

# modifying org_array
org_array[1, 2] = 13

# checking if copy_array has remained the same

# printing original array


print('Original Array: \n', org_array)

# printing copied array


print('\nCopied Array: \n', copy_array)
101
Iterating Arrays
• Iterating means going through elements one by one.
• As we deal with multi-dimensional arrays in numpy, we can do
this using basic for loop of python.
• If we iterate on a 1-D array it will go through each element one
by one.
• Iterate on the elements of the following 1-D array:
import numpy as np
arr = np.array([1, 2, 3])
for x in arr:
print(x)
Output:
1
2
3 102
Iterating Arrays
• Iterating 2-D Arrays
• In a 2-D array it will go through all the rows.
• If we iterate on a n-D array it will go through (n-1)th
dimension one by one.

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

for x in arr:
print(x)
Output:
[1 2 3]
103
[4 5 6]
Iterating Arrays
• To return the actual values, the scalars, we have to iterate
the arrays in each dimension.
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
for y in x:
print(y)

1
2
3
4
5
6 104
Iterating Arrays
• Iterating 3-D Arrays
• In a 3-D array it will go through all the 2-D arrays.

• import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

for x in arr:
print(x)

[[1 2 3] [4 5 6]]
[[ 7 8 9] [10 11 12]] 105
Iterating Arrays
• Iterating 3-D Arrays
• To return the actual values, the scalars, we have to iterate the
arrays in each dimension.

import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

for x in arr:
for y in x:
for z in y:
print(z)
106
1 2 3 4 5 6 7 8

Iterating Arrays Using


1 2 3 4 5 6 7 8

nditer()
• The function nditer() is a helping function that can be
used from very basic to very advanced iterations. 1
2
• Iterating on Each Scalar Element 3
• In basic for loops, iterating through each scalar of an array 4
we need to use n for loops which can be difficult to write 5
for arrays with very high dimensionality. 6
7
8
import numpy as np

arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

for x in np.nditer(arr):
print(x)
107
Identity array
• The identity array is a square array with ones on the main
diagonal.
• The identity() function return the identity array.

108
Identity
• numpy.identity(n, dtype = None) : Return an identity
matrix i.e. a square matrix with ones on the main diagonal

• Parameters:
• n : [int] Dimension n x n of output array
• dtype : [optional, float(by Default)] Data type of returned
array

109
Identity array
# 2x2 matrix with 1's on main diagonal
b = np.identity(2, dtype = float)
print("Matrix b : \n", b)
a = np.identity(4)
print("\nMatrix a : \n", a)

Output:
Matrix b :
[[ 1. 0.]
[ 0. 1.]]
Matrix a :
[[ 1. 0. 0. 0.]
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 0. 1.]] 110
eye( )
• numpy.eye(R, C = None, k = 0, dtype = type <‘float’>)
: Return a matrix having 1’s on the diagonal and 0’s
elsewhere w.r.t. k.
• R : Number of rows
C : [optional] Number of columns; By default M = N
k : [int, optional, 0 by default]
Diagonal we require; k>0 means diagonal above main
diagonal or vice versa.
dtype : [optional, float(by Default)] Data type of returned
array.

111
eye( )

112
Identity( ) vs eye( )
• np.identity returns a square matrix (special case of a
2D-array) which is an identity matrix with the main
diagonal (i.e. 'k=0') as 1's and the other values as 0's. you
can't change the diagonal k here.
• np.eye returns a 2D-array, which fills the diagonal, i.e. 'k'
which can be set, with 1's and rest with 0's.
• So, the main advantage depends on the requirement. If you
want an identity matrix, you can go for identity right away,
or can call the np.eye leaving the rest to defaults.
• But, if you need a 1's and 0's matrix of a particular
shape/size or have a control over the diagonal you can go
for eye method.
113
Identity( ) vs eye( )
import numpy as np
print(np.eye(3,5,1))
print(np.eye(8,4,0))
print(np.eye(8,4,-1))
print(np.eye(8,4,-2))
print(np.identity(4))

114
Shape of an Array
• import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)

• Output: (2,4)

115
Reshaping arrays
• Reshaping means changing the shape of an array.
• The shape of an array is the number of elements in each
dimension.
• By reshaping we can add or remove dimensions or change
number of elements in each dimension.

116
Reshape From 1-D to 2-D
• import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

newarr = arr.reshape(4, 3)

print(newarr)

• Output:
• [[ 1 2 3]
• [ 4 5 6]
• [ 7 8 9]
• [10 11 12]] 117
Reshape From 1-D to 3-D
• The outermost dimension will have 2 arrays that contains 3 arrays, each with
2 elements
• import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)

Output:
[[[ 1 2]
[ 3 4]
[ 5 6]]

[[ 7 8]
[ 9 10]
[11 12]]]
118
Can we Reshape into any
Shape?
• Yes, as long as the elements required for reshaping are equal in
both shapes.
• We can reshape an 8 elements 1D array into 4 elements in 2
rows 2D array but we cannot reshape it into a 3 elements 3 rows
2D array as that would require 3x3 = 9 elements.
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])

newarr = arr.reshape(3, 3)

print(newarr)

• Traceback (most recent call last): File


"demo_numpy_array_reshape_error.py", line 5, in <module> 119
ValueError: cannot reshape array of size 8 into shape (3,3)
Flattening the arrays
• Flattening array means converting a multidimensional
array into a 1D array.

• import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])\
newarr = arr.reshape(-1)
print(newarr)

• Output: [1 2 3 4 5 6]

• There are a lot of functions for changing the shapes of


arrays in numpy flatten, ravel and also for rearranging the
elements rot90, flip, fliplr, flipud etc. These fall under 120
Intermediate to Advanced section of numpy.
Introduction to
Pandas
121
Pandas
• Pandas is a popular open-source data manipulation
and analysis library for Python.
• It provides easy-to-use data structures like DataFrame
and Series, which are designed to make working with
structured data fast, easy, and expressive.
• Pandas are widely used in data science, machine
learning, and data analysis for tasks such as data
cleaning, transformation, and exploration.

122
Series
• A Pandas Series is a one-dimensional array-like object that
can hold data of any type (integer, float, string, etc.).
• It is labelled, meaning each element has a unique identifier
called an index.
• Series is defined as a column in a spreadsheet or a single
column of a database table.
• Series are a fundamental data structure in Pandas and are
commonly used for data manipulation and analysis tasks.
• They can be created from lists, arrays, dictionaries, and
existing Series objects.
• Series are also a building block for the more complex
Pandas DataFrame, which is a two-dimensional table-like
structure consisting of multiple Series objects. 123
Series
import pandas as pd Output
0 1
# Initializing a Series from a list 1 2
data = [1, 2, 3, 4, 5]
2 3
series_from_list = pd.Series(data)
print(series_from_list) 3 4
4 5
# Initializing a Series from a dictionary dtype: int64
data = {'a': 1, 'b': 2, 'c': 3} a 1
series_from_dict = pd.Series(data) b 2
print(series_from_dict) c 3
dtype: int64
# Initializing a Series with custom index a 1
data = [1, 2, 3, 4, 5] b 2
index = ['a', 'b', 'c', 'd', 'e'] c 3
series_custom_index = pd.Series(data, index=index)
d 4
print(series_custom_index) 124
e 5
dtype: int64
Series - Indexing
• Each element in a Series has a corresponding index,
which can be used to access or manipulate the data.

print(series_from_list[0])
print(series_from_dict['b’])

Output
1
2

125
Series – Vectorized
Operations
• Series supports vectorized operations, allowing you to
perform arithmetic operations on the entire series
efficiently.

series_a = pd.Series([1, 2, 3])


series_b = pd.Series([4, 5, 6])
sum_series = series_a + series_b
print(sum_series)

Output
0 5
1 7
2 9
dtype: int64
126
Series – Alignment
• When performing operations between two Series
objects, Pandas automatically aligns the data based on
the index labels.
series_a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
series_b = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
sum_series = series_a + series_b
print(sum_series)

Output
a NaN
b 6.0
c 8.0
d NaN
127
dtype: float64
Series – NaN Handling
• Missing values, represented by NaN (Not a Number),
can be handled gracefully in Series operations.

series_a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])


series_b = pd.Series([4, 5], index=['b', 'c'])
sum_series = series_a + series_b
print(sum_series)

Output
a NaN
b 6.0
c 8.0
dtype: float64
128
DataFrame
• A Pandas DataFrame is a two-dimensional, tabular data
structure with rows and columns.
• It is similar to a spreadsheet or a table in a relational
database.
• The DataFrame has three main components:
• data, which is stored in rows and columns;
• rows, which are labeled by an index;
• columns, which are labeled and contain the actual data.

129
DataFrame
• The DataFrame has three main components:
• data, which is stored in rows and columns;
• rows, which are labeled by an index;
• columns, which are labeled and contain the actual data.

130
DataFrames
import pandas as pd

# Initializing a DataFrame from a dictionary Name Age City


data = {'Name': ['John', 'Alice', 'Bob'], 0 John 25 New York
'Age': [25, 30, 35], 1 Alice 30 Los Angeles
'City': ['New York', 'Los Angeles', 'Chicago']} 2 Bob 35 Chicago
df = pd.DataFrame(data) Name Age City
print(df) 0 John 25 New York
1 Alice 30 Los Angeles
# Initializing a DataFrame from a list of lists 2 Bob 35 Chicago
data = [['John', 25, 'New York'],
['Alice', 30, 'Los Angeles'],
['Bob', 35, 'Chicago']]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df) 131
DataFrames - Indexing
• DataFrame provides flexible indexing options, allowing access to
rows, columns, or individual elements based on labels or integer
positions.
# Accessing a column
print(df['Name']) 0 John
1 Alice
# Accessing a row by label 2 Bob
Name: Name, dtype: object
print(df.loc[0])
Name John
Age 25
# Accessing a row by integer position City New York
print(df.iloc[0]) Name: 0, dtype: object
Name John
# Accessing an individual element Age 25
print(df.at[0, 'Name']) City New York
Name: 0, dtype: object 132
John
DataFrame – Column
Operations
• Columns in a DataFrame are Series objects, enabling
various operations such as arithmetic operations, filtering,
and sorting.
# Adding a new column
df['Salary'] = [50000, 60000, 70000]

# Filtering rows based on a condition


high_salary_employees = df[df['Salary’] >= 60000]
print(high_salary_employees)

# Sorting DataFrame by a column


sorted_df = df.sort_values(by='Age', ascending=False)
133
print(sorted_df)
DataFrames – Column
Operations
• Columns in a DataFrame are Series objects, enabling
various operations such as arithmetic operations, filtering,
and sorting.
Name Age City Salary
2 Bob 35 Chicago 70000
# Adding a new column Name Age City Salary
df['Salary'] = [50000, 60000, 70000] 2 Bob 35 Chicago 70000
1 Alice 30 Los Angeles 60000
# Filtering rows based on a condition 0 John 25 New York 50000
high_salary_employees = df[df['Salary’] > 60000]
print(high_salary_employees)
# Sorting DataFrame by a column
sorted_df = df.sort_values(by='Age', ascending=False)
134
print(sorted_df)
DataFrames – Handling
NaN
• DataFrames provide methods for handling missing or
NaN values, including dropping or filling missing
values.
Name Age City Salary
0 John 25 New York 50000
# Dropping rows with missing values 1 Alice 30 Los Angeles 60000
df.dropna() 2 Bob 35 Chicago 70000
print(df) Name Age City Salary
0 John 25 New York 50000
# Filling missing values with a specified value 1 Alice 30 Los Angeles 60000
df.fillna(0) 2 Bob 35 Chicago 70000
print(df)

135
DataFrames – Grouping and
Aggregation
• DataFrames support group-by operations for
summarizing data and applying aggregation functions.

# Grouping by a column and calculating mean


avg_age_by_city = df.groupby('City')['Age'].mean()
print(avg_age_by_city)

City
Chicago 35.0
Los Angeles 30.0
New York 25.0
Name: Age, dtype: float64 136
Indexing
• Indexing is a fundamental operation for accessing and
manipulating data efficiently.
• It involves assigning unique identifiers or labels to data
elements, allowing for rapid retrieval and modification.

137
Indexing - Features
• Immutability: Once created, an index cannot be
modified.
• Alignment: Index objects are used to align data
structures like Series and DataFrames.
• Flexibility: Pandas offers various index types,
including integer-based, datetime, and custom
indices.

138
Index - Creation
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data, index=['A', 'B', 'C'])

139
Re-index
• Reindexing is the process of creating a new DataFrame
or Series with a different index.

• The reindex() method is used for this purpose.


import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie’],
'Age': [25, 30, 35]}
df = pd.DataFrame(data, index=['A', 'B', 'C’])
# Create a new index
new_index = ['A', 'B', 'D', 'E']

# Reindex the DataFrame


df_reindexed = df.reindex(new_index)
140
df_reindexed
Drop Entry
• Dropping entries in data science refers to removing
specific rows or columns from a dataset.
• This is a common operation in data cleaning and
preprocessing to handle missing values, outliers, or
irrelevant information.

141
Drop Entry
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
df
# Drop column
newdf = df.drop("Age", axis='columns')

newdf
142
Selecting Entries –
Selecting by Position Created DataFrame

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles’, 'Chicago']}
df = pd.DataFrame(data)
# Select the second row Selecting data by Position
df.iloc[1]

143
Selecting Entries –
Selecting by Condition Created DataFrame

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles’, 'Chicago']}
df = pd.DataFrame(data)
# Select rows where Age is greater than 30 Selecting data by Condition
df[df['Age'] > 30]

144
Data Alignment
• Data alignment is intrinsic, which means that it's
inherent to the operations you perform.
• Align data in them by their labels and not by their
position
• align( ) function is used to align
• Used to align two data objects with each other according
to their labels.
• Used on both Series and DataFrame objects
• Returns a new object of the same type with labels
compared and aligned.
145
Data Alignment
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9] })
df2 = pd.DataFrame({
'A': [10, 11],
'B': [12, 13],
146
'D': [14, 15] })
Data Alignment
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9] })
df2 = pd.DataFrame({
'A': [10, 11],
'B': [12, 13],
'D': [14, 15] })
df1_aligned, df2_aligned = df1.align(df2, fill_value=np.nan) 147
Rank
• Ranking is assigning ranks or positions to data elements
based on their values.
• Rank is returned based on position after sorting.
• Used when analyzing data with repetitive values or when you

need to identify the top or bottom entries.

148
Rank
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'Animal': ['fox', 'Kangaroo’,

'deer','spider', 'snake’],

'Number_legs': [4, 2, 4, 8, np.nan]})

df

149
Rank

150
Rank
df['default_rank'] = df['Number_legs'].rank()
df['max_rank'] = df['Number_legs'].rank(method='max’)
df['NA_bottom’]= df['Number_legs'].rank(na_option='bottom’)

df['pct_rank'] = df['Number_legs'].rank(pct=True)

df

151
Rank

152
Rank

153
Rank

154
Rank

155
Sort
• Sort by the values along the axis
• Sort a pandas DataFrame by the values of one or more
columns
• Use the ascending parameter to change the sort order
• Sort a DataFrame by its index using .sort_index()
• Organize missing data while sorting values
• Sort a DataFrame in place using inplace set to True

156
Sort
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year',
'Population', 'Continent'])
df 157
Sort
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
['Australia', 1957, 9712569, 'Oceania'],
['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year’, 'Population', 'Continent'])
df

158
Sort by Ascending Order
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],

['Australia', 1957, 9712569, 'Oceania'],


['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year’, 'Population', 'Continent’])
df.sort_values(by=['Country’]) # sorting in Ascending Order

159
Sort by Descending Order
import pandas as pd
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],

['Australia', 1957, 9712569, 'Oceania'],


['Brazil', 1962, 76039390, 'Americas'],
['China', 1957, 637408000, 'Asia'],
['France', 1957, 44310863, 'Europe'],
['India', 1952, 3.72e+08, 'Asia'],
['United States', 1957, 171984000, 'Americas']]
df = pd.DataFrame(age_list, columns=['Country', 'Year’, 'Population', 'Continent’])
df.sort_values(by=['Population'], ascending=False) # sorting in Descending Order

160
Summary Statistics

1
Summary Statistics
• Summary statistics offer a quick and insightful overview of the main characteristics of a dataset.

• Example of Calculate Summary Statistics In Pandas are:

• Using Descriptive Statistics using describe()

• Mean, Median, and Mode with mean(), median(), and mode()

• Correlation with corr() Method

2
Summary Statistics
import pandas as pd

# Creating a sample DataFrame


data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(data)
df

2
Summary Statistics – describe( )
• The describe() method is a powerful tool to generate descriptive statistics of a DataFrame.

• It provides a comprehensive summary, including count, mean, standard deviation, minimum, 25th

percentile, median, 75th percentile, and maximum


import pandas as pd
# Creating a sample DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(data)
df
# Using describe() to calculate summary statistics
summary_stats = df.describe()
summary_stats

2
Summary Statistics – mean(), median(), and mode()
• Pandas provides specific functions to calculate the mean, median, and mode of each column in a

DataFrame.
import pandas as pd
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(data)
# Calculating mean, median, and mode
mean_values = df.mean()
median_values = df.median()
mode_values = df.mode().iloc[0] # mode() returns a DataFrame
print("Mean values:\n", mean_values)
print("\nMedian values:\n", median_values)
print("\nMode values:\n", mode_values)
2
Summary Statistics – corr()
• Correlation measures the strength and direction of a linear relationship between two variables. The

corr() method in Pandas computes the pairwise correlation of columns, and it is particularly useful
when dealing with large dataset
import pandas as pd Correlation Matrix:
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 15, 25, 30]}
df = pd.DataFrame(data)
# Calculating correlation between columns
correlation_matrix = df.corr()
print("\nCorrelation Matrix:\n", correlation_matrix)

2
Data Acquisition:
Gather information from different sources

1
Information Gathering
• Information gathering can be from a variety of sources.
• In principle, how data are being collected depends on the researcher’s nature of research or the
phenomena being studied.
• Data collection is a crucial aspect in any level of research work.
• If data are inaccurately collected, it will surely impact the findings of the study, thereby leading to false
or invaluable outcome.

2
What is data collection?
• Data collection is a systematic method of collecting and measuring data gathered from different sources of
information in order to provide answers to relevant questions.
• An accurate evaluation of collected data can help researchers predict future phenomenon and trends.
• Data collection can be classified into two, namely: primary and secondary data.
• Primary data are raw data i.e. fresh and are collected for the first time.
• Secondary data, on the other hand, are data that were previously collected and tested.

2
Types of Information Sources
• Primary Sources: First-hand accounts or original data
• Surveys
• Interviews
• Experiments
• Observations
• Letters
• Secondary Sources: Information compiled or analyzed by others
• Books
• Articles
• Reports
• Websites
• Databases
2
Web API and Open Data Sources

1
API
• API stands for Application Programming Interface.
• API is actually some kind of interface which is having a set of functions, which will allow programmers to acquire
some specific features or the data of an application.

2
Web API
• A Web API is an application programming interface for the Web.
• A Browser API can extend the functionality of a web browser.
• A Server API can extend the functionality of a web server.

2
How do APIs work?
• An API acts as a communication medium between two programs or systems for functioning.
• The client is the user/customer (who sends the request), the medium is the application interface programming ,
and the server is the backend (where the request is accepted and a response is provided).
• Steps followed in the working of APIs –
• The client initiates the requests via the APIs URI (Uniform Resource Identifier)
• The API makes a call to the server after receiving the request
• Then the server sends the response back to the API with the information
• Finally, the API transfers the data to the client

2
Examples

Paying API Find Location

Social Login API


Weather API
Blogging API
2
Open Data Sources
• Open data refers to information that can be freely accessed, used, modified, and shared by anyone.
• Open data becomes usable when made available in a common, machine-readable format.
• Open data must be licensed. Its license must permit people to use the data in any way they want, including
transforming, combining and sharing it with others, even commercially.

2
Types of Open Data
• Government Data: This includes a wide range of datasets such as demographics, economics, education,
healthcare, and environment.
• Scientific Data: Research data, climate data, and biological data are often made openly available.
• Corporate Data: Some companies release open data for marketing, social responsibility, or research purposes.
• Citizen-Generated Data: This includes data collected by individuals, often through social media or mobile apps.

2
Web Scraping

1
Web Scraping
• Web scraping is an automatic method to obtain large amounts of data from websites.

• Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow you to access their

data in a structured format.

2
Web Scraping
• Web scraping requires two parts, namely the crawler and the scraper.

• The crawler is an artificial intelligence algorithm that browses the web to search for the particular data required

by following the links across the internet.

• The scraper, on the other hand, is a specific tool created to extract data from the website.

Web
Scraping

Crawler Scraper

2
Working of Web Scrapers
• Web Scrapers can extract all the data on particular sites or the specific data that a user wants.

• When a web scraper needs to scrape a site, first the URLs are provided. Then it loads all the HTML code for those

sites and a more advanced scraper might even extract all the CSS and JavaScript elements as well. Then the
scraper obtains the required data from this HTML code and outputs this data in the format specified by the user.
Mostly, this is in the form of an Excel spreadsheet or a CSV file, but the data can also be saved in other formats,
such as a JSON file.

2
Types of Web Scrapers
• Web Scrapers can be divided on the basis of many different criteria:

• Self-built Web Scrapers

• Pre-built Web Scrapers

• Browser extension

• Software Web Scrapers

• Cloud or Local Web Scrapers.

2
How is Web Scraping using Python done?
• Web scraping with Python can be done using three different frameworks:

• Scrapy

• Beautiful Soup

• Selenium

2
Process of Web Scraping

2
Applications of Web Scrapers

You might also like