0% found this document useful (0 votes)
9 views23 pages

Unit IV Python Part1

The document outlines the syllabus for a course on data wrangling using Python libraries, focusing on NumPy and Pandas. It details the data wrangling process, which includes discovery, organization, cleaning, enrichment, validation, and publishing of data. Additionally, it covers the basics of NumPy arrays, including their attributes, indexing, slicing, reshaping, and concatenation.

Uploaded by

dineshk20005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views23 pages

Unit IV Python Part1

The document outlines the syllabus for a course on data wrangling using Python libraries, focusing on NumPy and Pandas. It details the data wrangling process, which includes discovery, organization, cleaning, enrichment, validation, and publishing of data. Additionally, it covers the basics of NumPy arrays, including their attributes, indexing, slicing, reshaping, and concatenation.

Uploaded by

dineshk20005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT IV

PYTHON LIBRARIES FOR


1
DATA WRANGLING
Dr A Mithila
CS3352 FOUNDATIONS OF DATA SCIENCE
SYLLABUS 2

- Basics Of Numpy Arrays


– Aggregations
– Computations On Arrays
– Comparisons, Masks, Boolean Logic
– Fancy Indexing
– Structured Arrays
SYLLABUS 3

– Data Manipulation With Pandas


– Data Indexing And Selection
– Operating On Data
– Missing Data
– Hierarchical Indexing
– Combining Datasets
– Aggregation And Grouping
– Pivot Tables
Data Wrangling 4

• Data Wrangling is the process of gathering,


collecting, and transforming Raw data into
another format for better understanding,
decision-making, accessing, and analysis in less
time.
• Data Wrangling is also known as Data Munging.
Data Wrangling Process 5
Data Wrangling Process 6

1. Discovery: Before starting the wrangling process, it is


critical to think about what may lie beneath your data. It
is crucial to think critically about what results from you
anticipate from your data and what you will use it for
once the wrangling process is complete. Once you've
determined your objectives, you can gather your data.
2. Organization: After you've gathered your raw data within
a particular dataset, you must structure your data. Due
to the variety and complexity of data types and sources,
raw data is often overwhelming at first glance.
Data Wrangling Process 7

3. Cleaning: When your data is organized, you can begin


cleaning your data. Data cleaning involves removing
outliers, formatting nulls, and eliminating duplicate
data. It is important to note that cleaning data collected
from web scraping methods might be more tedious than
cleaning data collected from a database. Essentially, web
data can be highly unstructured and require more time
than structured datafrom a database.
Data Wrangling Process 8

4. Data enrichment: This step requires that you take a step back from
your data to determine if you have enough data to proceed.
Finishing the wrangling process without enough data may
compromise insights gathered from further analysis. For example,
investors looking to analyze product review data will want a
significant amount of data to portray the market and increase
investment intelligence
Data Wrangling Process 9

5. Validation: After determining you gathered enough data, you will


need to apply validation rules to your data. Validation rules,
performed in repetitive sequences, confirm that your data is
consistent throughout your dataset. Validation rules will also ensure
quality as well as security. This step follows similar logic utilized in
data normalization, a data standardization process involving
validation rules.
6. Publishing: The final step of the data munging process is data
publishing. Data publishing involves preparing the data for future
use. This may include providing notes and documentation of your
wrangling process and creating access for other users and
applications.
10
The Basics of Numpy Arrays 11

• Data manipulation in Python is nearly synonymous with NumPy


array manipulation: even newer tools like Pandas are built around
the NumPy array.
• Examples using NumPy array manipulation to access data and
subarrays, and to split, reshape, and join the arrays.
The Basics of Numpy Arrays 12

1. Attributes of arrays -Determining the size, shape, memory


consumption, and data types of arrays
2. Indexing of arrays -Getting and setting the value of individual
array elements
3. Slicing of arrays - Getting and setting smaller subarrays within a
larger array
4. Reshaping of arrays - Changing the shape of a given array
5. Joining and splitting of arrays - Combining multiple arrays into
one, and splitting one array into many
1. Attributes of arrays – shape, size, datatype 13

• N-dimensional arrays or ndarray


• Fixed sized array in memory, Datatype – Integers, Floating point
values
• dtype – attribute that accesses array
• shape – attribute that returns a tuple
• Finds out shape,dimension and item size
• ndarray.shape : resizes the array
• ndarray.size : no.of elements in an array
• ndarray.dtype : describes data type
import numpy as np
np.random.seed(0)
 x1 = np.random.randint(10, size=6) 14
x2 = np.random.randint(10, size=(3, 4))
x3 = np.random.randint(10, size=(3, 4, 5))
 print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
 print("dtype:", x3.dtype)

Output
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
dtype: int64
15

print("itemsize:", x3.itemsize, "bytes")


print("nbytes:", x3.nbytes, "bytes")

Output:
itemsize: 8 bytes
nbytes: 480 bytes
2. Array Indexing: Accessing Single Elements 16

[ ] – Index the elements of the array

x1
Output : array([5,0,3,3,7,9])
x1[0]
Output : 5 [0] [1] [2] [3] [4] [5]
x1[4] 5 0 3 3 7 9
Output : 7 [-6] [-5] [-4] [-3] [-2] [-1]
array([[3, 5, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]])

Array Indexing: Accessing Single Elements 17

Code Output
x2 array([3,5,2,4], [7,6,8,8], [1,6,7,7])
x2[0,0] 3
x2[2, 0] 1

x2[2, -1] 7
3. Array Slicing: Accessing Subarrays 18

• To access subarrays with the slice notation, marked by the colon


(:) character - x[start:stop:step] - start=0, stop=size of dimension,
step=1
• One-dimensional subarrays : Code Output
x = np.arange(10) array([0,1,2,3,4,5,6,7,8,9])
x
 x[:5] # first five elements array([0,1,2,3,4,5])

 x[5:] # elements after index 5 array([5,6,7,8,9])

 x[4:7] # middle subarray array([4,5,6])


Array Slicing: Accessing Subarrays 19

Code Output
x[::2] # every other element array([0, 2, 4, 6, 8])
 x[1::2] # every other array([1, 3, 5, 7, 9])
element, starting at index 1
 x[::-1] # all elements, array([9, 8, 7, 6, 5, 4, 3, 2, 1,
reversed 0])
 x[5::-2] # reversed every array([5, 3, 1])
other from index 5
4. Reshaping of Arrays 20
• reshape() method

Code Output
If you want to put the numbers 1 through 9 in a 3×3 grid
 grid = np.arange(1, 10).reshape((3, 3)) print(grid) [[1 2 3]
[4 5 6]
[7 8 9]]
Conversion of a one-dimensional array into a two-dimensional row or column matrix.
 x = np.array([1, 2, 3]) # row vector via reshape array([[1, 2, 3]]
x.reshape((1, 3))

# column vector via reshape array([[1],


x.reshape((3, 1)) [2],
[3]])
5. Array Concatenation and Splitting (Joining
and Splitting of Arrays) 21

• Concatenation of arrays
• Splitting of arrays
Concatenation of arrays 22
Code Output
x = np.array([1, 2, 3]) array([1, 2, 3, 3, 2, 1])
y = np.array([3, 2, 1])
np.concatenate([x, y])

z = [99, 99, 99] [ 1 2 3 3 2 1 99 99 99]


print(np.concatenate([x, y, z]))
grid = np.array([[1, 2, 3],[4, 5, 6]]) array([[1, 2, 3],
np.concatenate([grid, grid]) [4, 5, 6],
[1, 2, 3],
[4, 5, 6]])
np.concatenate([grid, grid], axis=1) array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
Splitting of arrays
Code Output 23
x = [1, 2, 3, 99, 99, 3, 2, 1] [1 2 3] [99 99] [3 2 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

np.hsplit and np.vsplit


grid = np.arange(16).reshape((4, 4)) array([[ 0, 1, 2, 3],
grid [ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
upper, lower = np.vsplit(grid, [2]) [[0 1 2 3] [4 5 6 7]]
print(upper) [[ 8 9 10 11] [12 13 14 15]]
print(lower)
left, right = np.hsplit(grid, [2]) [[ 0 1]
[ 4 5]
print(left) print(right) [ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]

You might also like