0% found this document useful (0 votes)
22 views44 pages

Datascience Notes Unit-2

Uploaded by

kudumulaindra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views44 pages

Datascience Notes Unit-2

Uploaded by

kudumulaindra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

UNIT-II:

Introduction to NumPy: Understanding Data Types in Python, The Basics of NumPy


Arrays, Computation on NumPy Arrays: Universal Functions, Aggregations: Min, Max,
and Everything in Between, Sorting Arrays. Data Manipulation with Pandas: Installing
and Using Pandas, Introducing Pandas Objects, Data Indexing and Selection,
Operating on Data in Pandas, Visualization with Matplotlib: General Matplotlib Tips,
Importing matplotlib, Setting Styles, show() or No show()? How to Display Your Plots,
Saving Figures to File. Simple Line Plots and Simple Scatter Plots.

Data Types in Python:


Python Data types are the classification or categorization of data items. It represents
the kind of value that tells what operations can be performed on a particular data.
Since everything is an object in Python programming, Python data types are classes
and variables are instances (objects) of these classes. The following are the standard
or built-in data types in Python:
 Numeric - int, float, complex
 Sequence Type - string, list, tuple
 Mapping Type - dict
 Boolean - bool
 Set Type - set

1. Numeric Data Types in Python


The numeric data type in Python represents the data that has a numeric value. A
numeric value can be an integer, a floating number, or even a complex number.
These values are defined as Python int, Python float and Python complex classes
in Python.
 Integers - This value is represented by int class. It contains positive or negative
whole numbers (without fractions or decimals). In Python, there is no limit to
how long an integer value can be.
 Float - This value is represented by the float class. It is a real number with a
floating-point representation. It is specified by a decimal point.
 Complex Numbers - A complex number is represented by a complex class. It is
specified as (real part) + (imaginary part)j . For example - 2+3j
Example: a = 5
print(type(a))
b = 5.0
print(type(b))
c = 2 + 4j
print(type(c))

2. Sequence Data Types in Python


The sequence Data Type in Python is the ordered collection of similar or different
Python data types. Sequences allow storing of multiple values in an organized and
efficient fashion. There are several sequence data types of Python:
 Python String
 Python List
 Python Tuple

String Data Type


Python Strings are arrays of bytes representing Unicode characters. In Python, there
is no character data type Python, a character is a string of length one. It is
represented by str class.
Strings in Python can be created using single quotes, double quotes or even triple
quotes. We can access individual characters of a String using index.
Example:
s = 'Welcome to the Geeks World'
print(s)
# check data type
print(type(s))
# access string with index
print(s[1])
print(s[2])
print(s[-1])
List Data Type
Lists are just like arrays, declared in other languages which is an ordered collection of
data. It is very flexible as the items in a list do not need to be of the same type.
Creating a List in Python
Lists in Python can be created by just placing the sequence inside the square
brackets[].
# Empty list
a = []
# list with int values
a = [1, 2, 3]
print(a)
# list with mixed int and string
b = ["Geeks", "For", "Geeks", 4, 5]
print(b)
Tuple Data Type
Just like a list, a tuple is also an ordered collection of Python objects. The only
difference between a tuple and a list is that tuples are immutable. Tuples cannot be
modified after it is created.
Creating a Tuple in Python
In Python Data Types, tuples are created by placing a sequence of values separated
by a ‘comma’ with or without the use of parentheses for grouping the data
sequence. Tuples can contain any number of elements and of any datatype (like
strings, integers, lists, etc.).
tup1 = tuple([1, 2, 3, 4, 5])
# access tuple items
print(tup1[0])
print(tup1[-1])
print(tup1[-3])
3. Boolean Data Type in Python
Python Data type with one of the two built-in values, True or False. Boolean objects
that are equal to True are truthy (true), and those equal to False are falsy (false).
However non-Boolean objects can be evaluated in a Boolean context as well and
determined to be true or false. It is denoted by the class bool.
Example:
print(type(True))
print(type(False))
print(type(true))

4. Set Data Type in Python


In Python Data Types, Set is an unordered collection of data types that is iterable,
mutable, and has no duplicate elements. The order of elements in a set is undefined
though it may consist of various elements.
Create a Set in Python:
Sets can be created by using the built-in set() function with an iterable object or a
sequence by placing the sequence inside curly braces, separated by a ‘comma’. The
type of elements in a set need not be the same, various mixed-up data type values
can also be passed to the set.
Example:
# set
unique_numbers = {1, 2, 3, 2, 4}
unique_numbers.add(5)
print("Unique Numbers:", unique_numbers)
print("Is 3 in set?", 3 in unique_numbers)
# initializing empty set
s1 = set()
s1 = set("GeeksForGeeks")
print("Set with the use of String: ", s1)
s2 = set(["Geeks", "For", "Geeks"])
print("Set with the use of List: ", s2)
5. Dictionary Data Type
A dictionary in Python is a collection of data values, used to store data values like a
map, unlike other Python Data Types that hold only a single value as an element, a
Dictionary holds a key: value pair. Key-value is provided in the dictionary to make it
more optimized. Each key-value pair in a Dictionary is separated by a colon : ,
whereas each key is separated by a ‘comma’.
Create a Dictionary in Python:
Values in a dictionary can be of any datatype and can be duplicated, whereas keys
can’t be repeated and must be immutable. The dictionary can also be created by the
built-in function dict().
# initialize empty dictionary
d = {}
d = {1: 'Geeks', 2: 'For', 3: 'Geeks'}
print(d)
# creating dictionary using dict() constructor
d1 = dict({1: 'Geeks', 2: 'For', 3: 'Geeks'})
print(d1)
# Accessing an element using key
d = {1: 'Geeks', 'name': 'For', 3: 'Geeks'}
print(d['name'])
# Accessing a element using get
print(d.get(3))

The Basics of NumPy Arrays


NumPy (Numerical Python) is a powerful library for numerical computations
in Python. It is commonly referred to multidimensional container that holds the
same data type. It is the core data structure of the NumPy library and is optimized
for numerical and scientific computation in Python.
Installing NumPy in Python
To begin using NumPy, you need to install it first. This can be done through pip
command:
pip install numpy
Once installed, import the library with the alias np
import numpy as np
🔹 1. What is a NumPy Array?
A NumPy array (ndarray) is a grid of values, all of the same type, and is indexed by a
tuple of non-negative integers.
import numpy as np
arr = np.array([1, 2, 3])
print(arr) # [1 2 3]
print(type(arr)) # <class 'numpy.ndarray'>

🔹 2. Creating Arrays:
 To start using NumPy, import it as follows:

import numpy as np

 NumPy array’s objects allow us to work with arrays in Python. The array object is called
ndarray. NumPy arrays are created using the array() function

Example:
import numpy as np

# Creating a 1D array
x = np.array([1, 2, 3])

# Creating a 2D array
y = np.array([[1, 2], [3, 4]])

# Creating a 3D array
z = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(x)
print(y)
print(z)

Key Attributes of NumPy Arrays


NumPy arrays have attributes that provide information about the array:
 shape: Returns the dimensions of the array.
 dtype: Returns the data type of the elements.
 ndim: Returns the number of dimensions.
Example:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape)
print(arr.dtype)
print(arr.ndim)
Output
(2, 3)
int64
2

Operations on NumPy Arrays:


NumPy supports element-wise and matrix operations, including addition,
subtraction, multiplication, and division:
Example:
import numpy as np
# Element-wise addition
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
print(x + y) # Output: [5 7 9]

# Matrix multiplication
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
print(np.dot(a, b))

Aggregations:
a.sum() # Sum of all elements
a.mean() # Average
a.max() # Maximum
a.min() # Minimum
a.std() # Standard deviation
🔹Reshaping and Transposing
a = np.arange(6) # [0 1 2 3 4 5]
a.reshape(2, 3) # [[0 1 2], [3 4 5]]

b = np.array([[1, 2], [3, 4]])


b.T # Transpose: [[1 3], [2 4]]

Indexing and Slicing


a = np.array([10, 20, 30, 40, 50])
print(a[1]) # 20
print(a[1:4]) # [20 30 40]
b = np.array([[1, 2, 3], [4, 5, 6]])
print(b[0, 2]) # 3 (row 0, col 2)
print(b[:, 1]) # [2 5] (all rows, col 1)

Computation on NumPy Arrays


NumPy enables fast and efficient computation on arrays using vectorized
operations, meaning operations are applied element-wise without explicit loops.

🔹 1. Arithmetic Operations (Element-wise)


Given two arrays:
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
You can perform operations directly:
a+b # array([5, 7, 9])
a-b # array([-3, -3, -3])
a*b # array([4, 10, 18])
a/b # array([0.25, 0.4 , 0.5 ])
a ** 2 # array([1, 4, 9])
Example1:
# Python code to perform arithmetic operations
import numpy as np
# Initializing the array
arr1 = np.arange(4, dtype = np.float_).reshape(2, 2)
print('First array:')
print(arr1)
print('\nSecond array:')
arr2 = np.array([12, 12])
print(arr2)
print('\nAdding the two arrays:')
print(np.add(arr1, arr2))
print('\nSubtracting the two arrays:')
print(np.subtract(arr1, arr2))
print('\nMultiplying the two arrays:')
print(np.multiply(arr1, arr2))
Example2:
# Python code to perform mod function
# on NumPy array
import numpy as np
arr = np.array([5, 15, 20])
arr1 = np.array([2, 5, 9])
print('First array:')
print(arr)
print('\nSecond array:')
print(arr1)
print('\nApplying mod() function:')
print(np.mod(arr, arr1))
print('\nApplying remainder() function:')
print(np.remainder(arr, arr1))

🔹 2. Universal Functions (ufuncs)


NumPy provides ufuncs — fast, vectorized functions:

Function Description

np.add(a, b) Element-wise addition


Function Description

np.subtract(a, b) Element-wise subtraction

np.multiply(a, b) Element-wise multiplication

np.divide(a, b) Element-wise division

np.power(a, b) Raise elements to a power

np.exp(a) Exponential

np.log(a) Natural logarithm

np.sqrt(a) Square root

np.sin(a) /
Trigonometric
np.cos(a)

Example:
x = np.array([1, 2, 3])
np.exp(x) # [2.718, 7.389, 20.085]
Example2:
# Python code to perform power operation
import numpy as np
arr = np.array([5, 10, 15])
print('First array is:')
print(arr)
print('\nApplying power function:')
print(np.power(arr, 2))
print('\nSecond array is:')
arr1 = np.array([1, 2, 3])
print(arr1)
print('\nApplying power function again:')
print(np.power(arr, arr1))

🔹 3. Aggregation Functions
These compute summary statistics on array data:
a = np.array([1, 2, 3, 4])

a.sum() # 10
a.mean() # 2.5
a.max() #4
a.min() #1
a.std() # Standard deviation
a.var() # Variance
For multi-dimensional arrays:
b = np.array([[1, 2], [3, 4]])

b.sum(axis=0) # column-wise: [4, 6]


b.sum(axis=1) # row-wise: [3, 7]

🔹 4. Comparison and Boolean Logic


a = np.array([1, 2, 3])

a>2 # [False, False, True]


a == 2 # [False, True, False]

# Boolean indexing
a[a > 1] # [2, 3]
Logical operations:
np.any(a > 2) # True
np.all(a > 0) # True

🔹 5. Matrix Computation
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Element-wise product
A*B

# Matrix product (dot product)


np.dot(A, B)
A@B # Python 3.5+ shortcut for dot product

Works in multi-dimensional cases too:


A = np.array([[1, 2], [3, 4]])
B = np.array([1, 2])
A+B # [[2, 4], [4, 6]]

Sorting Arrays:
Sorting means putting elements in an ordered sequence.
Ordered sequence is any sequence that has an order corresponding to elements, like
numeric or alphabetical, ascending or descending.
The NumPy ndarray object has a function called sort(), that will sort a specified array.
Example:
Sort the array:
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))
 You can also sort arrays of strings, or any other data type:
Example
Sort the array alphabetically:
import numpy as np
arr = np.array(['banana', 'cherry', 'apple'])
print(np.sort(arr))
Sorting a 2-D Array:
If you use the sort() method on a 2-D array, both arrays will be sorted:
Example
Sort a 2-D array:
import numpy as np
arr = np.array([[3, 2, 4], [5, 0, 1]])
print(np.sort(arr))

Data Manipulation with Pandas:


 Pandas is a Python library used for working with data sets.
 It has functions for analyzing, cleaning, exploring, and manipulating data.
 The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.
 Pandas allows us to analyze big data and make conclusions based on statistical
theories.
 Pandas can clean messy data sets, and make them readable and relevant.
Pandas is a powerful Python library used for data analysis and manipulation. It
provides two main data structures:
 Series – 1-dimensional labeled array
 DataFrame – 2-dimensional labeled data structure (like a table)

Installing Pandas

 If you're using Jupyter Notebook or PyCharm, install Pandas with:

 pip install pandas

 Or in a Jupyter Notebook cell:

 !pip install pandas

Import Pandas in Python


Now, that we have installed pandas on the system. Let's see how we can import it
to make use of it.
For this, go to a Jupyter Notebook or open a Python file, and write the following
code:
 import pandas as pd
Data Structures in Pandas Library:
Pandas provide two data structures for manipulating data which are as follows

1. Pandas Series:
A Pandas Series is one-dimensional labeled array capable of holding data of any
type (integer, string, float, Python objects etc.). The axis labels are collectively
called indexes.
Pandas Series is created by loading the datasets from existing storage which can be
a SQL database, a CSV file or an Excel file. It can be created from lists, dictionaries,
scalar values, etc.

Example: Creating a series using the Pandas Library.


import pandas as pd
import numpy as np
ser = pd.Series()
print("Pandas Series: ", ser)

data = np.array(['r', 'a', 'm'])

ser = pd.Series(data)
print("Pandas Series:\n", ser)

2. Pandas DataFrame:
Pandas DataFrame is a two-dimensional data structure with labeled axes (rows and
columns). It is created by loading the datasets from existing storage which can be a
SQL database, a CSV file or an Excel file. It can be created from lists, dictionaries, a
list of dictionaries etc.

Example: Creating a DataFrame Using the Pandas Library


import pandas as pd

df = pd.DataFrame()
print(df)

lst = ['Data', 'For', 'Datascience']

df = pd.DataFrame(lst)
print(df)

Indexing and Selecting Data with Pandas:


Indexing and selecting data helps us to efficiently retrieve specific rows, columns or
subsets of data from a DataFrame. Whether we're filtering rows based on
conditions, extracting particular columns or accessing data by labels or positions,
mastering these techniques helps to work effectively with large datasets.
1. Indexing Data using the [] Operator
The [] operator is the basic and frequently used method for indexing in Pandas. It
allows us to select columns and filter rows based on conditions. This method can be
used to select individual columns or multiple columns.
1. Selecting a Single Column
To select a single column, we simply refer the column name inside square brackets:
import pandas as pd
data = pd.read_csv("/content/nba.csv"S)
df=pd.DataFrame(data)
df[“column name”]
2. Selecting Multiple Columns
To select multiple columns, pass a list of column names inside the [] operator:
df[[‘colname1’ , ’colname2’]]
2. Indexing with .loc[ ](stop index incudede)
The.loc[] function is used for label-based indexing. It allows us to access rows and
columns by their labels. Unlike the indexing operator, it can select subsets of rows
and columns simultaneously which offers flexibility in data retrieval.
1. Selecting a Single Row by Label
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85, 90, 95]
}
df = pd.DataFrame(data, index=['a', 'b', 'c'])
# Access row 'b'
print(df.loc['b'])
2. Selecting Multiple Rows by Label
# Access 'Score' column of row 'a'
print(df.loc['a', 'Score'])
# Access multiple rows and columns
print(df.loc[['a', 'c'], ['Name', 'Score']])
3. Selecting Specific Rows and Columns
We can select specific rows and columns by providing lists of row labels and column
names:
Dataframe.loc[["row1", "row2"], ["column1", "column2", "column3"]]

4. Selecting All Rows and Specific Columns


We can select all rows and specific columns by using a colon [:] to indicate all rows
followed by the list of column names:
Dataframe.loc[:, ["column1", "column2", "column3"]]
all_rows_specific_columns = data.loc[:, ["Team", "Position", "Salary"]]
print(all_rows_specific_columns)

3. Indexing with .iloc[ ]


The .iloc[] function is used for position-based indexing. It allows us to access rows
and columns by their integer positions. It is similar to .loc[] but only accepts integer-
based indices to specify rows and columns.
1. Selecting a Single Row by Position
To select a single row using .iloc[] provide the integer position of the row:
import pandas as pd
data = pd.read_csv("/content/nba.csv", index_col="Name")
row = data.iloc[3]
print(row)
2. Selecting Multiple Rows by Position
We can select multiple rows by passing a list of integer positions:
rows = data.iloc[[3, 5, 7]]
print(rows)
3. Selecting Specific Rows and Columns by Position
We can select specific rows and columns by providing integer positions for both rows
and columns:
selection = data.iloc[[3, 4], [1, 2]]
print(selection)

4. Selecting All Rows and Specific Columns by Position


To select all rows and specific columns, use a colon [:] for all rows and a list of
column positions:
selection = data.iloc[:, [1, 2]]
print(selection)

4. Other Useful Indexing Methods


Pandas also provides several other methods that we may find useful for indexing and
manipulating DataFrames:
1. .head(): Returns the first n rows of a DataFrame
print(data.head(5))
2. .tail(): Returns the last n rows of a DataFrame
print(data.tail(5))
3.describe(): In Pandas, the .describe() function is used to generate descriptive
statistics of a DataFrame or Series.like count,min,max,standard
deviation,average,75%,25% etc
4.shape:it will gives number of rows and column of a particular dataframe.

Some Numeric Operations in Pandas:


 Pandas is a powerful data manipulation and analysis library for Python. It
provides versatile data structures like series and dataframes, making it easy
to work with numeric values.
 Numeric value operations in Pandas Python form the backbone of efficient
data analysis, offering a streamlined approach to handling numerical data.
With specialized data structures like Series and DataFrame, Pandas
simplifies arithmetic operations, statistical calculations, and data
aggregation.

Numeric Value Operations In Pandas Python


Below are examples of numeric value operations in Pandas Python.
 Arithmetic Operations
 Statistical Aggregation
 Element-wise Functions
 Comparison and Filtering
 Handling Missing Data
Arithmetic Operations
Pandas supports basic arithmetic operations such as addition, subtraction,
multiplication, and division on Series and DataFrames. Let's look at a simple
example:
import pandas as pd
# Creating two Series
series1 = pd.Series([1, 2, 3, 4])
series2 = pd.Series([5, 6, 7, 8])
# Addition
result_addition = series1 + series2
print(result_addition)
Output :
Addition Result:
0 6
1 8
2 10
3 12
dtype: int64
Statistical Aggregation
Pandas provides various statistical aggregation functions to summarize numeric data.
Examples include mean(), sum(), min(), max(), etc. Here's a snippet demonstrating
the use of the mean() function:
# Creating a DataFrame
data = {'A': [10, 20, 30], 'B': [5, 15, 25]}
df = pd.DataFrame(data)
# Calculating mean for each column
mean_values = df.mean()
print(mean_values)
Output :
Mean Values:
A 20.0
B 15.0
dtype: float64
Element-wise Functions
Pandas allows the application of functions to Series or DataFrames on an element-
wise basis. This includes the ability to utilize NumPy functions, such as the square
root function, or apply custom functions. Here's an illustrative example using the
NumPy sqrt function.
import numpy as np
# Creating a DataFrame
data = {'A': [10, 20, 30], 'B': [5, 15, 25]}
df = pd.DataFrame(data)
# Applying element-wise square root
result_sqrt = np.sqrt(df)
print(result_sqrt)
Output :
Square Root Result:
0 1.000000
1 1.414214
2 1.732051
3 2.000000
dtype: float64

Comparison and Filtering


You can use comparison operators to create Boolean masks for filtering data in
Pandas. For instance, filtering values greater than a certain threshold:
import pandas as pd
data = {'col1': [1, 2, 3, 4, 5], 'col2': ['A', 'B', 'C', 'D', 'E']}
df = pd.DataFrame(data)
filtered_df = df[df['col1'] < 3]
print(filtered_df)
Output:
Filtered Values:
col1 col2
0 1 A
1 2 B

Handling Missing Data:


Pandas provides methods for handling missing or NaN (Not a Number) values. The
fillna() function can be used to fill missing values with a specified constant:
import numpy as np
# Creating a DataFrame
data = {'A': [10, 20, 30], 'B': [5, 15, 25]}
df = pd.DataFrame(data)
# Introducing missing values
series_with_nan = pd.Series([1, 2, np.nan, 4])
# Filling missing values with 0
filled_series = series_with_nan.fillna(0)
print(filled_series)
Output :
Filled Series:
0 1.0
1 2.0
2 0.0
3 4.0
dtype: float64
Working with Missing Data in Pandas
In Pandas, missing data occurs when some values are missing or not collected
properly and these missing values are represented as:
 None: A Python object used to represent missing values in object-type arrays.
 NaN: A special floating-point value from NumPy which is recognized by all
systems that use IEEE floating-point standards.

Checking Missing Values in Pandas


Pandas provides two important functions which help in detecting whether a value is
NaN or not.these functions are helpful in making data cleaning and preprocessing
easier in a DataFrame or Series are given below :
1. Using isnull()
isnull() returns a DataFrame of Boolean value where True represents missing data
(NaN). This is simple if we want to find and fill missing data in a dataset.

Example 1: Finding Missing Values in a DataFrame


We will be using Numpy and Pandas libraries for this implementation.
import pandas as pd
import numpy as np
d = {'First Score': [100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
mv = df.isnull()
print(mv)
output:
2. Checking for Non-Missing Values Using notnull()
notnull() function returns a DataFrame with Boolean values where True indicates
non-missing (valid) data. This function is useful when we want to focus only on the
rows that have valid, non-missing values.
Example 1: Identifying Non-Missing Values in a DataFrame

import pandas as pd
import numpy as np
d = {'First Score': [100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
nmv = df.notnull()
print(nmv)
Output

Filling Missing Values in Pandas


Following functions allow us to replace missing values with a specified value or use
interpolation methods to find the missing data.
1. Using fillna()
fillna() used to replace missing values (NaN) with a given value. Lets see various
example for this.
Example 1: Fill Missing Values with Zero
import pandas as pd
import numpy as np
d = {'First Score': [100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
df.fillna(0)
Output

Example 2: Fill with Previous Value (Forward Fill)


The pad method is used to fill missing values with the previous value.
df.fillna(method='ffill')
Output

Example 3: Fill with Next Value (Backward Fill)


The bfill function is used to fill it with the next value.
df.fillna(method='bfill')
Output

2. Using replace()
Use replace() function to replace NaN values with a specific value.
Example
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, np.nan, 30, np.nan]
}
df = pd.DataFrame(data)
# Replace NaN with a specific value using replace()
df_replaced = df.replace(to_replace=np.nan, value=0)
print(df_replaced)
Output:
Name Age
0 Alice 25.0
1 Bob 0.0
2 Charlie 30.0
3 David 0.0

3. Using interpolate()
The interpolate() function fills missing values using interpolation techniques such as
the linear method.
Example
import pandas as pd
df = pd.DataFrame({"A": [12, 4, 5, None, 1],
"B": [None, 2, 54, 3, None],
"C": [20, 16, None, 3, 8],
"D": [14, 3, None, None, 6]})
print(df)
Output

Let’s interpolate the missing values using Linear method. This method ignore the
index and consider the values as equally spaced.
df.interpolate(method ='linear', limit_direction ='forward')
Output
Dropping Missing Values in Pandas

1. Dropping Rows with At Least One Null Value


Remove rows that contain at least one missing value.
Example
import pandas as pd
import numpy as np
dict = {'First Score': [100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score': [52, 40, 80, 98],
'Fourth Score': [np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)

df.dropna()

Output

2. Dropping Rows with All Null Values


We can drop rows where all values are missing using dropna(how='all').

Example
dict = {'First Score': [100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score': [52, np.nan, 80, 98],
'Fourth Score': [np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)

df.dropna(how='all')
Output

3. Dropping Columns with At Least One Null Value


To remove columns that contain at least one missing value we use dropna(axis=1).
Example
dict = {'First Score': [100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score': [52, np.nan, 80, 98],
'Fourth Score': [60, 67, 68, 65]}
df = pd.DataFrame(dict)
df.dropna(axis=1)
Output

Data Visualization using Matplotlib in Python:


Matplotlib is a widely-used Python library used for creating static, animated and
interactive data visualizations. It is built on the top of NumPy and it can easily
handles large datasets for creating various types of plots such as line charts, bar
charts, scatter plots, etc. These visualizations help us to understand data better by
presenting it clearly through graphs and charts.
Matplotlib and Pyplot:
Matplotlib is a versatile toolkit that allows for the creation of static, animated, and
interactive visualizations in the Python programming language.
Generally, matplotlib overlays two APIs:
 The pyplot API: to make plot using matplotlib.pyplot.
 Object-Oriented API: A group of objects assembled with greater flexibility
than pyplot. It provides direct access to Matplotlib’s backend layers.
Matplotlib simplifies simple tasks and enables complex tasks to be accomplished.
Following are the key aspects of matplotlib:
 Matplotlib offers to create quality plots.
 Matplotlib offers interactive figures and customizes their visual style that can
be manipulated as per need.
 Matplotlib offers export to many file formats.
Applications of Matplotlib:
The most common applications of matplotlib include:
 Data Visualization: Many scientific researches, data analytics, and machine
learning applications use Matplotlib to visualize data.
 Scientific Research: Matplotlib helps scientists visualize experimental data,
simulation findings, and statistical analysis. It improves data comprehension
and communication for researchers.
 Engineering: Matplotlib helps engineers to visualize sensor readings,
simulation findings, and design parameters. It excels at graphing in
mechanical, civil, aeronautical, and electrical engineering.
 Finance: Finance professionals use Matplotlib to visualize stock prices, market
trends, portfolio performance, and risk assessments. It helps analysts and
traders make decisions by visualizing complicated financial data in simple
graphics.
 Geospatial Analysis: Matplotlib, Basemap, and Cartopy are used to visualize
geographical data such as maps, satellite images, climate data, and GIS data.
Users may generate interactive maps, plot geographical characteristics, and
overlay data for spatial analysis.
 Biology and Bioinformatics: Matplotlib helps biologists and bioinformaticians
visualize DNA sequences, protein structures, phylogenetic trees, and gene
expression patterns. It helps researchers to visualize complicated biological
processes.
 Education: Educational institutions use Matplotlib to teach data visualization,
programming, and scientific computing. Its easy-to-use visualization interface
makes it suited for high school and university students and teachers.
 Web Development: Flask, Django, and Plotly Dash can incorporate Matplotlib
into online apps. It lets developers build dynamic, interactive visualizations for
web pages and dashboards.
 Machine Learning: Machine learning projects visualize data distributions,
model performance metrics, decision boundaries, and training progress with
Matplotlib. It helps machine learning practitioners analyze algorithm behavior
and troubleshoot model-building concerns.
 Presentation and Publication:Matplotlib creates high-quality figures for
scientific research, reports, presentations, and posters. It offers many
customization options to optimize the plot look for publishing and
presentation.

Visualizing Data with Pyplot using Matplotlib:


Pyplot is a module in Matplotlib that provides a simple interface for creating plots. It
allows users to generate charts like line graphs, bar charts and histograms with
minimal code. Let’s explore some examples with simple code to understand how to
use it effectively.
1. Line Chart
Line chart is one of the basic plots and can be created using plot() function. It is used
to represent a relationship between two data X and Y on a different axis.
Syntax:
matplotlib.pyplot.plot(x, y)
Parameter: x, y Coordinates for data points.
Example: This code plots a simple line chart with labeled axes and a title using
Matplotlib.
import matplotlib.pyplot as plt
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
plt.plot(x, y)
plt.title("Line Chart")
plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')
plt.show()
output:

2. Bar Chart
Bar chart displays categorical data using rectangular bars whose lengths are
proportional to the values they represent. It can be plotted vertically or horizontally
to compare different categories.
Syntax:
matplotlib.pyplot.bar(x, height)
Parameter:
 x: Categories or positions on x-axis.
 height: Heights of the bars (y-axis values).
Example: This code creates a simple bar chart to show total bills for different days. X-
axis represents the days and Y-axis shows total bill amount.
import matplotlib.pyplot as plt
x = ['Thur', 'Fri', 'Sat', 'Sun']
y = [170, 120, 250, 190]
plt.bar(x, y)
plt.title("Bar Chart")
plt.xlabel("Day")
plt.ylabel("Total Bill")
plt.show()
output:

3. Histogram
Histogram shows the distribution of data by grouping values into bins.
The hist() function is used to create it, with X-axis showing bins and Y-axis showing
frequencies.
Syntax:
matplotlib.pyplot.hist(x, bins=None)
Parameter:
 x: Input data.
 bins: Number of bins (intervals) to group data.
Example: This code plots a histogram to show frequency distribution of total bill
values from the list x. It uses 10 bins and adds axis labels and a title for clarity.
import matplotlib.pyplot as plt
x = [7, 8, 9, 10, 10, 12, 12, 12, 13, 14, 14, 15, 16, 16, 17, 18, 18, 19, 20, 20,
21, 22, 23, 24, 25, 25, 26, 28, 30, 32, 35, 36, 38, 40, 42, 44, 48, 50]
plt.hist(x, bins=10, color='steelblue')
plt.title("Histogram")
plt.xlabel("Total Bill")
plt.ylabel("Frequency")
plt.show()
output:

4. Scatter Plot
Scatter plots are used to observe relationships between variables.
The scatter() method in the matplotlib library is used to draw a scatter plot.
Syntax:
matplotlib.pyplot.scatter(x, y)
Parameter: x, y Coordinates of the points.
Example: This code creates a scatter plot to visualize the relationship between days
and total bill amounts using scatter().
import matplotlib.pyplot as plt
x = ['Thur', 'Fri', 'Sat', 'Sun', 'Thur', 'Fri', 'Sat', 'Sun']
y = [170, 120, 250, 190, 160, 130, 240, 200]
plt.scatter(x, y)
plt.title("Scatter Plot")
plt.xlabel("Day")
plt.ylabel("Total Bill")
plt.show()
output:

5. Pie Chart
Pie chart is a circular chart used to show data as proportions or percentages. It is
created using the pie(), where each slice (wedge) represents a part of the whole.
Syntax:
matplotlib.pyplot.pie(x, labels=None, autopct=None)
Parameter:
 x: Data values for pie slices.
 labels: Names for each slice.
 autopct: Format to display percentage (e.g., '%1.1f%%').
Example: This code creates a simple pie chart to visualize distribution of different car
brands. Each slice of pie represents the proportion of cars for each brand in the
dataset.
import matplotlib.pyplot as plt
cars = ['AUDI', 'BMW', 'FORD','TESLA', 'JAGUAR',]
data = [23, 10, 35, 15, 12]
plt.pie(data, labels=cars)
plt.title(" Pie Chart")
plt.show()

6. Box Plot
Box plot is a simple graph that shows how data is spread out. It displays the
minimum, maximum, median and quartiles and also helps to spot outliers easily.
Syntax:
matplotlib.pyplot.boxplot(x, notch=False, vert=True)
Parameter:
 x: Data for which box plot is to be drawn (usually a list or array).
 notch: If True, draws a notch to show the confidence interval around the
median.
 vert: If True, boxes are vertical. If False, they are horizontal.
Example: This code creates a box plot to show the data distribution and compare
three groups using matplotlib
import matplotlib.pyplot as plt
data = [ [10, 12, 14, 15, 18, 20, 22],
[8, 9, 11, 13, 17, 19, 21],
[14, 16, 18, 20, 23, 25, 27] ]
plt.boxplot(data)
plt.xlabel("Groups")
plt.ylabel("Values")
plt.title("Box Plot")
plt.show()

Saving figures in Matplotlib library:


Saving figures in Matplotlib library is useful for preserving visualizations in various
formats by ensuring they can be shared, used or embedded in different contexts as
needed. Adjusting the file format and resolution allows us to balance image quality
and file size based on your requirements.
Syntax:
The following is the syntax and parameters for using the savefig() method.
plt.savefig(fname, dpi=None, bbox_inches='tight', pad_inches=0.1, format=None,
kwargs)
Where,
 fname − The file name or path of the file to save the figure. The file extension
determines the file format such as ".png", ".pdf".
 dpi − Dots per inch i.e. resolution for the saved figure. Default is "None" which
uses the Matplotlib default.
 bbox_inches − Specifies which part of the figure to save. Options include
'tight', 'standard' or a specified bounding box in inches.
 pad_inches − Padding around the figure when bbox_inches='tight'.
 format − Explicitly specify the file format. If 'None' the format is inferred from
the file extension in fname.
 kwargs − Additional keyword arguments specific to the chosen file format.
Saving the plot in specified location:
In this example we are creating a simple line plot by using the plot() function and
then we are trying to save the plotted image in the specified location with the
specified filename.
Example:
import matplotlib.pyplot as plt
# Data
x = [22,1,7,2,21,11,14,5]
y = [24,2,12,5,5,5,9,12]
plt.plot(x,y)
# Customize the plot (optional)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')
# Display the plot
plt.savefig('matplotlib/Savefig/lineplot.png')
plt.show()
Output:
On executing the above code we will get the following output −

Saving plot in .svg format:


Here, this is another example of saving the plotted plot by using the savefig() by
specifying the file format as svg and dpi as 300 to set the resolution.
Example:
import matplotlib.pyplot as plt
# Data
x = [22,1,7,2,21,11,14,5]
y = [24,2,12,5,5,5,9,12]
plt.plot(x,y)

# Customize the plot (optional)


plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')

# Display the plot


plt.savefig('matplotlib/Savefig/lineplot2.svg',dpi = 500)
plt.show()
Output
On executing the above code we will get the following output −

You might also like