UNIT IV
PYTHON LIBRARIES FOR
1
DATA WRANGLING
Dr A Mithila
CS3352 FOUNDATIONS OF DATA SCIENCE
SYLLABUS 2
- Basics Of Numpy Arrays
– Aggregations
– Computations On Arrays
– Comparisons, Masks, Boolean Logic
– Fancy Indexing
– Structured Arrays
SYLLABUS 3
– Data Manipulation With Pandas
– Data Indexing And Selection
– Operating On Data
– Missing Data
– Hierarchical Indexing
– Combining Datasets
– Aggregation And Grouping
– Pivot Tables
Data Wrangling 4
• Data Wrangling is the process of gathering,
collecting, and transforming Raw data into
another format for better understanding,
decision-making, accessing, and analysis in less
time.
• Data Wrangling is also known as Data Munging.
Data Wrangling Process 5
Data Wrangling Process 6
1. Discovery: Before starting the wrangling process, it is
critical to think about what may lie beneath your data. It
is crucial to think critically about what results from you
anticipate from your data and what you will use it for
once the wrangling process is complete. Once you've
determined your objectives, you can gather your data.
2. Organization: After you've gathered your raw data within
a particular dataset, you must structure your data. Due
to the variety and complexity of data types and sources,
raw data is often overwhelming at first glance.
Data Wrangling Process 7
3. Cleaning: When your data is organized, you can begin
cleaning your data. Data cleaning involves removing
outliers, formatting nulls, and eliminating duplicate
data. It is important to note that cleaning data collected
from web scraping methods might be more tedious than
cleaning data collected from a database. Essentially, web
data can be highly unstructured and require more time
than structured datafrom a database.
Data Wrangling Process 8
4. Data enrichment: This step requires that you take a step back from
your data to determine if you have enough data to proceed.
Finishing the wrangling process without enough data may
compromise insights gathered from further analysis. For example,
investors looking to analyze product review data will want a
significant amount of data to portray the market and increase
investment intelligence
Data Wrangling Process 9
5. Validation: After determining you gathered enough data, you will
need to apply validation rules to your data. Validation rules,
performed in repetitive sequences, confirm that your data is
consistent throughout your dataset. Validation rules will also ensure
quality as well as security. This step follows similar logic utilized in
data normalization, a data standardization process involving
validation rules.
6. Publishing: The final step of the data munging process is data
publishing. Data publishing involves preparing the data for future
use. This may include providing notes and documentation of your
wrangling process and creating access for other users and
applications.
10
The Basics of Numpy Arrays 11
• Data manipulation in Python is nearly synonymous with NumPy
array manipulation: even newer tools like Pandas are built around
the NumPy array.
• Examples using NumPy array manipulation to access data and
subarrays, and to split, reshape, and join the arrays.
The Basics of Numpy Arrays 12
1. Attributes of arrays -Determining the size, shape, memory
consumption, and data types of arrays
2. Indexing of arrays -Getting and setting the value of individual
array elements
3. Slicing of arrays - Getting and setting smaller subarrays within a
larger array
4. Reshaping of arrays - Changing the shape of a given array
5. Joining and splitting of arrays - Combining multiple arrays into
one, and splitting one array into many
1. Attributes of arrays – shape, size, datatype 13
• N-dimensional arrays or ndarray
• Fixed sized array in memory, Datatype – Integers, Floating point
values
• dtype – attribute that accesses array
• shape – attribute that returns a tuple
• Finds out shape,dimension and item size
• ndarray.shape : resizes the array
• ndarray.size : no.of elements in an array
• ndarray.dtype : describes data type
import numpy as np
np.random.seed(0)
x1 = np.random.randint(10, size=6) 14
x2 = np.random.randint(10, size=(3, 4))
x3 = np.random.randint(10, size=(3, 4, 5))
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print("dtype:", x3.dtype)
Output
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
dtype: int64
15
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")
Output:
itemsize: 8 bytes
nbytes: 480 bytes
2. Array Indexing: Accessing Single Elements 16
[ ] – Index the elements of the array
x1
Output : array([5,0,3,3,7,9])
x1[0]
Output : 5 [0] [1] [2] [3] [4] [5]
x1[4] 5 0 3 3 7 9
Output : 7 [-6] [-5] [-4] [-3] [-2] [-1]
array([[3, 5, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]])
Array Indexing: Accessing Single Elements 17
Code Output
x2 array([3,5,2,4], [7,6,8,8], [1,6,7,7])
x2[0,0] 3
x2[2, 0] 1
x2[2, -1] 7
3. Array Slicing: Accessing Subarrays 18
• To access subarrays with the slice notation, marked by the colon
(:) character - x[start:stop:step] - start=0, stop=size of dimension,
step=1
• One-dimensional subarrays : Code Output
x = np.arange(10) array([0,1,2,3,4,5,6,7,8,9])
x
x[:5] # first five elements array([0,1,2,3,4,5])
x[5:] # elements after index 5 array([5,6,7,8,9])
x[4:7] # middle subarray array([4,5,6])
Array Slicing: Accessing Subarrays 19
Code Output
x[::2] # every other element array([0, 2, 4, 6, 8])
x[1::2] # every other array([1, 3, 5, 7, 9])
element, starting at index 1
x[::-1] # all elements, array([9, 8, 7, 6, 5, 4, 3, 2, 1,
reversed 0])
x[5::-2] # reversed every array([5, 3, 1])
other from index 5
4. Reshaping of Arrays 20
• reshape() method
Code Output
If you want to put the numbers 1 through 9 in a 3×3 grid
grid = np.arange(1, 10).reshape((3, 3)) print(grid) [[1 2 3]
[4 5 6]
[7 8 9]]
Conversion of a one-dimensional array into a two-dimensional row or column matrix.
x = np.array([1, 2, 3]) # row vector via reshape array([[1, 2, 3]]
x.reshape((1, 3))
# column vector via reshape array([[1],
x.reshape((3, 1)) [2],
[3]])
5. Array Concatenation and Splitting (Joining
and Splitting of Arrays) 21
• Concatenation of arrays
• Splitting of arrays
Concatenation of arrays 22
Code Output
x = np.array([1, 2, 3]) array([1, 2, 3, 3, 2, 1])
y = np.array([3, 2, 1])
np.concatenate([x, y])
z = [99, 99, 99] [ 1 2 3 3 2 1 99 99 99]
print(np.concatenate([x, y, z]))
grid = np.array([[1, 2, 3],[4, 5, 6]]) array([[1, 2, 3],
np.concatenate([grid, grid]) [4, 5, 6],
[1, 2, 3],
[4, 5, 6]])
np.concatenate([grid, grid], axis=1) array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
Splitting of arrays
Code Output 23
x = [1, 2, 3, 99, 99, 3, 2, 1] [1 2 3] [99 99] [3 2 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
np.hsplit and np.vsplit
grid = np.arange(16).reshape((4, 4)) array([[ 0, 1, 2, 3],
grid [ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
upper, lower = np.vsplit(grid, [2]) [[0 1 2 3] [4 5 6 7]]
print(upper) [[ 8 9 10 11] [12 13 14 15]]
print(lower)
left, right = np.hsplit(grid, [2]) [[ 0 1]
[ 4 5]
print(left) print(right) [ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]