0% found this document useful (0 votes)
97 views29 pages

Unit-7 Data Science

Unit-VII covers data ethics, the importance of building ethical data products, and the trade-offs between accuracy and fairness in data science. It emphasizes collaboration among diverse stakeholders, the need for interpretability in models, and strategies for mitigating bias and ensuring data protection. Additionally, it discusses Python libraries that support mathematical calculations and data analysis, highlighting their significance in various industries.

Uploaded by

Sunil Sai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views29 pages

Unit-7 Data Science

Unit-VII covers data ethics, the importance of building ethical data products, and the trade-offs between accuracy and fairness in data science. It emphasizes collaboration among diverse stakeholders, the need for interpretability in models, and strategies for mitigating bias and ensuring data protection. Additionally, it discusses Python libraries that support mathematical calculations and data analysis, highlighting their significance in various industries.

Uploaded by

Sunil Sai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Unit-7

UNIT-VII
Data Ethics, Building Bad Data Products, Trading Off Accuracy and Fairness,
Collaboration, Interpretability, recommendations, Biased Data, Data Protection
IPython, Mathematics, NumPy, pandas, scikit-learn, Visualization, R Up Hierarchical
Clustering.

Data Ethics and Building Ethical Data Products

Data ethics encompasses the principles and guidelines that dictate how data should be
ethically collected, used, and shared. As organizations increasingly rely on data-driven
decisions, the ethical implications become more complex and far-reaching. Ethical data
practices focus on ensuring privacy, fairness, transparency, and accountability in every phase
of data usage.

Core Principles of Data Ethics:

 Privacy: Respecting individuals’ rights and protecting their personal data.


 Transparency: Being clear about how data is collected, processed, and used,
especially when making automated decisions.
 Accountability: Ensuring that those who handle and process data are held responsible
for their actions, particularly in cases where harm occurs.
 Non-maleficence: Ensuring that the use of data doesn’t cause harm to individuals or
groups, either intentionally or unintentionally.

Building Bad Data Products

When organizations neglect these ethical guidelines, they risk creating "bad data products."
These are products or systems that fail to meet ethical standards and can have negative
consequences for users, communities, and businesses. Bad data products often arise due to
the following issues:

1. Bias in Data: If the data used to train models is biased, the resulting products will
perpetuate and even amplify these biases. This is especially dangerous in high-stakes
domains like hiring, criminal justice, and healthcare.
2. Poor Data Quality: If the data is incomplete, incorrect, or irrelevant, it undermines
the integrity of the data product. For example, flawed data used to assess loan
eligibility may result in unjust denials for qualified individuals.
3. Lack of Ethical Oversight: Failure to consider the broader ethical implications of
how data is used—such as neglecting the impact on marginalized groups or ignoring
privacy concerns—can lead to harm.
4. Failure to Involve Stakeholders: A lack of collaboration with impacted communities
and users can result in products that don't align with the needs or values of the people
they serve.
5. Legal and Regulatory Violations: Data products that don’t comply with data
protection laws (e.g., GDPR, CCPA) can lead to significant legal and financial
consequences for organizations.

1
Unit-7

Building ethical data products means avoiding these pitfalls by ensuring that ethical
considerations are baked into the product's design and development from the start.

Trading Off Accuracy and Fairness

One of the most challenging dilemmas in data science is balancing accuracy and fairness.
On the one hand, organizations often prioritize accuracy, as it ensures that predictions or
decisions made by a data product are correct or closely aligned with reality. On the other
hand, fairness seeks to ensure that the outcomes of a model or system are equitable for all
demographic groups, preventing the exacerbation of inequality or discrimination.

The tension between accuracy and fairness arises because optimizing for one can negatively
impact the other. For example, a model trained to maximize overall accuracy may end up
disproportionately benefiting certain groups (e.g., those with more representation in the data)
while disadvantaging underrepresented groups. This could be especially problematic in
sensitive areas like hiring, credit scoring, or criminal justice.

To navigate this trade-off, it’s important to:

 Acknowledge that fairness is context-dependent: What fairness looks like will vary
across contexts, and models may need to be adapted or adjusted to meet specific
ethical standards relevant to the context.
 Apply fairness metrics: Use established fairness metrics, such as equal opportunity
or demographic parity, to assess how well the model performs across different
groups.
 Iterate and balance: Continually evaluate the model's performance, making trade-
offs between accuracy and fairness as needed while considering the ethical
implications.

In some cases, improving fairness might involve accepting a reduction in accuracy,


particularly if the model’s unequal performance across groups could lead to significant harm
for underrepresented or disadvantaged groups.

Collaboration in Data Science

Collaboration is essential in developing ethical and effective data products. Data science and
machine learning are interdisciplinary fields, and creating data products that are both
technically robust and ethically sound requires input from a diverse set of stakeholders. Key
collaborators include:

1. Domain Experts: Experts who understand the subject area (e.g., healthcare, finance,
law) help ensure that the data and model are aligned with real-world needs and
constraints.
2. Ethicists: Professionals who can guide the development of models and products from
an ethical standpoint, ensuring that the impacts on stakeholders, particularly
vulnerable groups, are considered.
3. Engineers and Data Scientists: These professionals work to implement the technical
aspects of the data product, ensuring that algorithms and systems work as intended
while adhering to best practices for fairness, transparency, and security.

2
Unit-7

4. Legal and Regulatory Experts: Involving experts in data protection laws ensures
that the product complies with regulations such as GDPR, CCPA, and others, which
can have significant legal consequences if violated.
5. Community and Stakeholder Engagement: Engaging with the communities or
groups affected by a data product helps ensure that their needs and concerns are
addressed, and it can help identify any unintended consequences of a model or
product.

Collaboration not only ensures that the product is built with expertise from diverse fields but
also fosters ethical accountability across different stages of development.

Interpretability

Interpretability refers to the ability of humans to understand the decision-making process of


a model. For data products, especially those that make decisions impacting individuals' lives,
interpretability is essential for several reasons:

1. Transparency: Clear, understandable models allow users to see why a certain


decision or prediction was made. This is crucial for building trust, particularly in
high-stakes domains like healthcare, credit scoring, and law enforcement.
2. Accountability: When a model is interpretable, it’s easier to track down errors or
biases and understand why the model made a certain decision. This accountability
helps prevent harm and improves the product over time.
3. Regulatory Compliance: Some regulations, like the GDPR, require that automated
decisions be explainable to individuals affected by them. Lack of interpretability
could make it difficult for organizations to comply with such regulations.

Interpretability does not always mean that a model needs to be simple, but rather that it
should be explainable enough for humans to understand the factors influencing its decisions.
In practice, this may involve using simpler, more interpretable models (e.g., decision trees) or
implementing explainability techniques (e.g., SHAP values or LIME) for complex models
like neural networks.

Recommendations for Ethical Data Product Development

To create ethically sound data products, organizations should:

1. Integrate Ethical Design: Build ethical considerations into the design process from
the start, rather than as an afterthought.
2. Embrace Fairness and Bias Mitigation: Prioritize fairness by using bias detection
and mitigation techniques, and regularly evaluate the model for potential biases.
3. Foster Transparency: Design models that are interpretable and explainable,
providing clear insights into how decisions are made.
4. Ensure Privacy and Security: Implement data protection measures, such as
encryption and access controls, and follow data protection regulations to safeguard
user data.
5. Involve Stakeholders: Include input from diverse stakeholders, including domain
experts, ethicists, and impacted communities, to ensure the product meets ethical
standards and serves everyone equitably.

3
Unit-7

6. Conduct Regular Audits: Continuously evaluate the performance and fairness of


data products, making necessary adjustments to minimize harm and ensure positive
outcomes.

Biased Data

Biased data is a key concern in data ethics, as it can lead to unfair and discriminatory
outcomes. Data bias can emerge in several forms:

1. Historical Bias: When data reflects societal inequalities or discriminatory practices,


the model is likely to reinforce these biases. For example, using historical data to
predict hiring decisions might perpetuate gender or racial biases.
2. Sampling Bias: If the data collected does not represent the population adequately, the
model will fail to generalize correctly, leading to poor performance for
underrepresented groups.
3. Measurement Bias: Biases may also arise due to the methods used to collect or label
data, which could inadvertently favor certain groups over others.

Mitigating bias involves:

 Collecting diverse and representative data: Ensuring that the data used to train the
model includes adequate representation of all relevant groups.
 Using fairness-enhancing algorithms: Implementing techniques to detect and reduce
bias in the model’s predictions.
 Regular auditing: Continuously testing models for biased outcomes, and making
adjustments as necessary.

Data Protection

Data protection is vital in ensuring that personal information is handled responsibly. It is a


fundamental aspect of data ethics, particularly in the context of stringent regulations like the
GDPR and CCPA. Key elements of data protection include:

1. Encryption: Protecting sensitive data from unauthorized access by encrypting it both


during transmission and storage.
2. Access Controls: Ensuring that only authorized individuals can access sensitive data.
3. Data Minimization: Collecting only the data necessary for the task at hand, to reduce
the risk of misuse or breach.
4. User Consent: Obtaining clear and informed consent from individuals before
collecting or using their data.
5. Data Rights: Giving individuals the right to access, correct, and delete their data.

Data protection is not just about legal compliance—it’s about maintaining trust with users
and ensuring that their personal data is treated with the utmost care.

Python libraries math, scipy, numpy, matplotlib

4
Unit-7

There is such a thing as Python programming language. It is valuable in itself for a number of
reasons, as it is effective and very common. In addition to everything else, Python is valuable
for its set of libraries for a variety of needs.

The fields of mathematical calculations, computer modeling, economic calculations, machine


learning, statistics, engineering, and other industries are widely used by a number of Python
libraries, some of which we will consider in this article.

These libraries save developers time and standardize work with mathematical functions and
algorithms, which puts Python code writing for many industries at a very high level.

Let's look at these libraries in order and determine which sections of development they are
responsible for and how they are interconnected.

MATH

To carry out calculations with real numbers, the Python language contains many additional
functions collected in a library (module) called math.

To use these functions at the beginning of the program, you need to connect the math library,
which is done by the command

import mathCode language: JavaScript (javascript)

Python provides various operators for performing basic calculations, such as * for
multiplication,% for a module, and / for the division. If you are developing a program in
Python to perform certain tasks, you need to work with trigonometric functions, as well as
complex numbers. Although you cannot use these functions directly, you can access them by
turning on the math module math, which gives access to hyperbolic, trigonometric and
logarithmic functions for real numbers. To use complex numbers, you can use the math
module cmath. When comparing math vs numpy, a math library is more lightweight and can
be used for extensive computation as well.

The Python Math Library is the foundation for the rest of the math libraries that are written
on top of its functionality and functions defined by the C standard. Please refer to the python
math examples for more information.

Number-theoretic and representation functions

This part of the mathematical library is designed to work with numbers and their
representations. It allows you to effectively carry out the necessary transformations with
support for NaN (not a number) and infinity and is one of the most important sections of the

5
Unit-7

Python math library. Below is a short list of features for Python 3rd version. A more detailed
description can be found in the documentation for the math library.

 [Link](x) - return the ceiling of x, the smallest integer greater than or equal to x
 [Link](n, k) - return the number of ways to choose k items from n items without
repetition and without order
 [Link](x, y) - return float with the magnitude (absolute value) of x but the sign of
y. On platforms that support signed zeros, copysign (1.0, -0.0) returns -1.0
 [Link](x) - return the absolute value of x
 [Link](x) - return x factorial as an integer. Raises ValueError if x is not integral or
is negative
 [Link](x) - return the floor of x, the largest integer less than or equal to x
 [Link](x, y) - return fmod(x, y), as defined by the platform C library
 [Link](x) - return the mantissa and exponent of x as the pair (m, e). m is a float and e
is an integer such that x == m * 2**e exactly
 [Link](iterable) - return an accurate floating-point sum of values in the iterable
 [Link](a, b) - return the greatest common divisor of the integers a and b
 [Link](a, b, *, rel_tol=1e-09, abs_tol=0.0) - return True if the values a and b are
close to each other and False otherwise
 [Link](x) - return True if x is neither infinity nor a NaN, and False otherwise (note
that 0.0 is considered finite)
 [Link](x) - return True if x is positive or negative infinity, and False otherwise
 [Link](x) - return True if x is a NaN (not a number), and False otherwise
 [Link](n) - return the integer square root of the nonnegative integer n. This is the floor
of the exact square root of n, or equivalently the greatest integer a such that a² ≤ n
 [Link](x, i) - return x * (2**i). This is essentially the inverse of function frexp()
 [Link](x) - return the fractional and integer parts of x. Both results carry the sign of x
and are floats
 [Link](n, k=None) - return the number of ways to choose k items from n items without
repetition and with order
 [Link](iterable, *, start=1) - calculate the product of all the elements in the input
iterable. The default start value for the product is 1
 [Link](x, y) - return the IEEE 754-style remainder of x with respect to y
 [Link](x) - return the Real value x truncated to an Integral (usually an integer)

Power and logarithmic functions

The power and logarithmic functions section are responsible for exponential calculations,
which is important in many areas of mathematics, engineering, and statistics. These functions
can work with both natural logarithmic and exponential functions, logarithms modulo two,
and arbitrary bases.

 [Link](x) - return e raised to the power x, where e = 2.718281… is the base of natural
logarithms
 math.expm1(x) - return e raised to the power x, minus 1. Here e is the base of natural
logarithms. [Link](x[, base]) - With one argument, return the natural logarithm of x (to
base e). With two arguments, return the logarithm of x to the given base, calculated as
log(x)/log(base)

6
Unit-7

 math.log1p(x) - return the natural logarithm of 1+x (base e). The result is calculated in a
way that is accurate for x near zero
 math.log2(x) - return the base-2 logarithm of x. This is usually more accurate than log(x, 2)
 math.log10(x) - return the base-10 logarithm of x. This is usually more accurate than log(x,
10)
 [Link](x, y) - return x raised to the power y
 [Link](x) - return the square root of x

Trigonometric functions

Trigonometric functions, direct and inverse, are widely represented in the Python
Mathematical Library. They work with radian values, which is important. It is also possible
to carry out calculations with Euclidean functions.

 [Link](x) - return the arc cosine of x, in radians


 [Link](x) - return the arc sine of x, in radians
 [Link](x) - return the arctangent of x, in radians
 math.atan2(y, x) - return atan(y / x), in radians. The result is between -pi and pi
 [Link](x) - return the cosine of x radians
 [Link](p, q) - return the Euclidean distance between two points p and q, each given as a
sequence (or iterable) of coordinates. The two points must have the same dimension
 [Link](*coordinates) - return the Euclidean norm, sqrt(sum(x**2 for x in
coordinates)). This is the length of the vector from the origin to the point given by the
coordinates
 [Link](x) - return the sine of x radians
 [Link](x) - return the tangent of x radians

Angular conversion

Converting degrees to radians and vice versa is a fairly common function and therefore the
developers have taken these actions to the Python library. This allows you to write compact
and understandable code.

 [Link](x) - convert angle x from radians to degrees


 [Link](x) - convert angle x from degrees to radians

Hyperbolic functions

Hyperbolic functions are analogs of trigonometric functions that are based on hyperbolas
instead of circles.

 [Link](x) - return the inverse hyperbolic cosine of x


 [Link](x) - return the inverse hyperbolic sine of x
 [Link](x) - return the inverse hyperbolic tangent of x
 [Link](x) - return the hyperbolic cosine of x
 [Link](x) - return the hyperbolic sine of x

7
Unit-7

 [Link](x) - return the hyperbolic tangent of x

Special functions

The special functions section is responsible for error handling and gamma functions. This is a
necessary function and it was decided to implement it in the standard Python mathematical
library.

 [Link](x) - Return the error function at x


 [Link](x) - Return the complementary error function at x
 [Link](x) - Return the Gamma function at x
 [Link](x) - Return the natural logarithm of the absolute value of the Gamma
function at x

Constants

The constant section provides ready-made values for basic constants and writes them with the
necessary accuracy for a given hardware platform, which is important for Python's portability
as a cross-platform language. Also, the very important values infinity and “not a number” are
defined in this section of the Python library.

 [Link] - the mathematical constant π = 3.141592…, to available precision


 math.e - the mathematical constant e = 2.718281…, to available precision
 [Link] - the mathematical constant τ = 6.283185…, to available precision. Tau is a circle
constant equal to 2π, the ratio of a circle’s circumference to its radius
 [Link] - a floating-point positive infinity. (For negative infinity, use -[Link].)
Equivalent to the output of float('inf')
 [Link] - a floating-point “not a number” (NaN) value. Equivalent to the output of
float('nan')

Scipy

SciPy is a library for the open-source Python programming language, designed to perform
scientific and engineering calculations.

The capabilities of this library are quite wide:

 Search for minima and maxima of functions


 Calculation of function integrals
 Support for special functions
 Signal processing
 Image processing
 Work with genetic algorithms
 Solving ordinary differential equations

8
Unit-7

SciPy in Python is a collection of mathematical algorithms and functions built as a Numpy


extension. It greatly extends the capabilities of an interactive Python session by providing the
user with high-level commands and classes for managing and visualizing data. With SciPy,
an interactive Python session becomes a data processing and prototyping system competing
with systems such as MATLAB, IDL, Octave, R-Lab, and SciLab.

An additional advantage of Python-based SciPy is that it is also a fairly powerful


programming language used in the development of complex programs and specialized
applications. Scientific applications also benefit from the development of additional modules
in numerous software niches by developers around the world. Everything from parallel
programming for the web to routines and database classes is available to the Python
programmer. All of these features are available in addition to the SciPy math library.
Packages for mathematical methods

SciPy is organized into sub-packages covering various scientific computing areas:

 cluster - Clustering Algorithms


 constants - physical and mathematical constants
 fftpack - Fast Fourier Transform subroutines
 integrate - integration and solution of ordinary differential equations
 Interpolate - interpolation and smoothing splines
 io - input and output
 linalg - linear algebra
 ndimage - n-dimensional image processing
 odr -orthogonal regression distance multiplexing
 optimize - root structure optimization and search
 signal - signal processing
 sparse - sparse matrices and related procedures
 spatial - spatial Data Structures and Algorithms
 special - special functions
 stats - statistical Distributions and Functions
 weave - C / C ++ integration

The SciPy ecosystem includes general and specialized tools for data management and
computation, productive experimentation, and high-performance computing. Below, we
overview some key packages, though there are many more relevant packages.

Main components of ScyPy

Data and computation:

 pandas, providing high-performance, easy-to-use data structures


 SymPy, for symbolic mathematics and computer algebra
 scikit-image is a collection of algorithms for image processing
 scikit-learn is a collection of algorithms and tools for machine learning
 h5py and PyTables can both access data stored in the HDF5 format

9
Unit-7

Productivity and high-performance computing:

 IPython, a rich interactive interface, letting you quickly process data and test ideas
 The Jupyter notebook provides IPython functionality and more in your web browser,
allowing you to document your computation in an easily reproducible form
 Cython extends Python syntax so that you can conveniently build C extensions, either to
speed up critical code or to integrate with C/C++ libraries
 Dask, Joblib or IPyParallel for distributed processing with a focus on numeric data

Quality assurance:

 nose, a framework for testing Python code, being phased out in preference for pytest
 numpydoc, a standard, and library for documenting Scientific Python libraries

SciPy provides a very wide and sought-after feature set:

 Clustering package ([Link])


 Constants ([Link])
 Discrete Fourier transforms ([Link])
 Integration and ODEs ([Link])
 Interpolation ([Link])
 Input and output ([Link])
 Linear algebra ([Link])
 Miscellaneous routines ([Link])
 Multi-dimensional image processing ([Link])
 Orthogonal distance regression ([Link])
 Optimization and Root Finding ([Link])
 Signal processing ([Link])
 Sparse matrices ([Link])
 Sparse linear algebra ([Link])
 Compressed Sparse Graph Routines ([Link])
 Spatial algorithms and data structures ([Link])
 Special functions ([Link])
 Statistical functions ([Link])
 Statistical functions for masked arrays ([Link])
 Low-level callback functions

An example of how to calculate effectively on SciPy

In this tutorial, Basic functions — SciPy v1.4.1 Reference Guide, you can find how to
calculate polynomials, their derivatives, and integrals. Yes, by one line of code SciPy
calculates derivative and integral in symbolic form. Imagine how many lines of code you
would need to do this without SciPy. This is why this library is valuable in Python:

>>> p = poly1d([3,4,5])
>>> print(p)
2
3x+4x+5
>>> print(p*p)

10
Unit-7

4 3 2
9 x + 24 x + 46 x + 40 x + 25
>>> print([Link](k=6))
3 2
1x+2x+5x+6
>>> print([Link]())
6x+4
>>> p([4, 5])
array([ 69, 100])
Code language: PHP (php)

NUMPY

In early 2005, programmer and data scientist Travis Oliphant wanted to unite the community
around one project and created the NumPy library to replace the Numeric and NumArray
libraries. NumPy was created based on the Numeric code. The Numeric code was rewritten to
be easier to maintain, and new features could be added to the library. NumArray features
have been added to NumPy.

NumPy was originally part of the SciPy library. To allow other projects to use the NumPy
library, its code was placed in a separate package.

The source code for NumPy is publicly available. NumPy is licensed under the BSD license.

Purpose of the NumPy library

Mathematical algorithms implemented in interpreted languages, for example, Python, often


work much slower than the same algorithms implemented in compiled languages (for
example, Fortran, C, and Java). The NumPy library provides implementations of
computational algorithms in the form of functions and operators, optimized for working with
multidimensional arrays. As a result, any algorithm that can be expressed as a sequence of
operations on arrays (matrices) and implemented using NumPy works as fast as the
equivalent code executed in MATLAB. If we compare numpy vs math, we quickly find
thatnumpy has more advantages for computation methods compared to math.

Here are some of the features of Numpy:

 A powerful N-dimensional array object


 Sophisticated (broadcasting) functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier transform, and random number capabilities

What’s the difference between a Python list and a NumPy array?

11
Unit-7

As described in the NumPy documentation, “NumPy gives you an enormous range of fast
and efficient ways of creating arrays and manipulating numerical data inside them. While a
Python list can contain different data types within a single list, all of the elements in a NumPy
array should be homogenous. The mathematical operations that are meant to be performed on
arrays would be extremely inefficient if the arrays weren’t homogenous.”

NymPy User Features

Numpy provides the following features to the user:

 Array objects
 Constants
 Universal functions (ufunc)
 Routine
 Packaging ([Link])
 NumPy Distutils - Users Guide
 NumPy C-API
 NumPy internals
 NumPy and SWIG

NumPy basics:

 Data types
 Array creation
 I/O with NumPy
 Indexing
 Broadcasting
 Byte-swapping
 Structured arrays
 Writing custom array containers
 Subclassing ndarray

One of the main objects of NumPy is ndarray. It allows you to create multidimensional data
arrays of the same type and perform operations on them with great speed. Unlike sequences
in Python, arrays in NumPy have a fixed size, the elements of the array must be of the same
type. You can apply various mathematical operations to arrays, which are performed more
efficiently than for Python sequences.

The next example shows how to work with linear algebra with NumPy. It is really simple and
easy-to-understand for Python users.

>>> import numpy as np


>>> a = [Link]([[1.0, 2.0], [3.0, 4.0]])
>>> print(a)
[[ 1. 2.]
[ 3. 4.]]

12
Unit-7

>>> [Link]()
array([[ 1., 3.],
[ 2., 4.]])

>>> [Link](a)
array([[-2. , 1. ],
[ 1.5, -0.5]])

>>> u = [Link](2) # unit 2x2 matrix; "eye" represents "I"


>>> u
array([[ 1., 0.],
[ 0., 1.]])
>>> j = [Link]([[0.0, -1.0], [1.0, 0.0]])

>>> j @ j # matrix product


array([[-1., 0.],
[ 0., -1.]])

>>> [Link](u) # trace


2.0

>>> y = [Link]([[5.], [7.]])


>>> [Link](a, y)
array([[-3.],
[ 4.]])

>>> [Link](j)
(array([ 0.+1.j, 0.-1.j]), array([[ 0.70710678+0.j,0.70710678-0.j],
[ 0.00000000-0.70710678j, 0.00000000+0.70710678j]]))
Code language: PHP (php)

Numpy allows processing information without cycles. Please take a look at


this article published by Brad Solomon about the advantages of Numpy: “It is sometimes said
that Python, compared to low-level languages such as C++, improves development time at
the expense of runtime. Fortunately, there are a handful of ways to speed up operation
runtime in Python without sacrificing ease of use. One option suited for fast numerical
operations is NumPy, which deservedly bills itself as the fundamental package for scientific
computing with Python.” It makes computation in Python really fast.

MATPLOTLIB

Neuroscientist John D. Hunter began developing matplotlib in 2003, mainly inspired by the
emulation of Mathworks MATLAB software teams. Matplotlib is today a whole product of
the community: it is developed and supported by many people. John talked about the
evolution of matplotlib at the SciPy conference in 2012.

13
Unit-7

Learning matplotlib at times can be a difficult process. The problem is not the lack of
documentation (which is very extensive, by the way). Difficulties may arise with the
following:

 The size of the library is huge in itself, about 70,000 lines of code
 Matplotlib contains several different interfaces (ways to build a figure) and can interact
with a large number of backends. (The backends are responsible for how in fact the
diagrams will be displayed, not only for the internal structure)
 Despite the vastness, some of the matplotlib's own documentation is seriously outdated.
The library is still evolving, and many old examples on the web may include 70% less
code than in their current version

Understanding that matplotlib roots grow from MATLAB helps explain the existence of
pylab. pylab is a module inside the matplotlib library that has been built in to emulate the
overall MATLAB style. It exists only for introducing a number of class functions from
NumPy and matplotlib into the namespace, which simplifies the transition of MATLAB users
who did not encounter the need for import statements. Former MATLAB users love its
functionality, because with from pylab import * they can simply call plot() or array() directly,
just like they did in MATLAB.

Key features of Matplotlib

One of the business cards of matplotlib is the hierarchy of its objects. If you have already
worked with the matplotlib introductory manual, you may have already called something like
[Link] ([1, 2, 3]). This one line indicates that the graph is actually a hierarchy of Python
objects. By “hierarchy” we mean that each chart is based on a tree-like structure of matplotlib
objects.

The Figure object is the most important external container for matplotlib graphics, which can
include several Axes objects. The reason for the difficulty in understanding may be the name:
Axes (axes), in fact, turn into what we mean by an individual graph or chart (rather than the
plural “axis”, as you might expect).

You can think of the Figure object as a box-like container containing one or more Axes
objects (real graphs). Below Axes objects, in a hierarchical order, are smaller objects such as
individual lines, elevations, legends, and text boxes. Almost every “element” of a diagram is
its own manipulated Python object, right up to labels and markers. An example chart on the
matplot is located below.
Image source

14
Unit-7

15
Unit-7

Matplotlib is a flexible, easily configurable package that, along with NumPy, SciPy, and
IPython, provides features similar to MATLAB. The package currently works with several
graphics libraries, including wxWindows and PyGTK.

Python code example for plotting

The Python code itself is quite simple and straightforward. Here's an example of a simple
plot:

import [Link] as plt

x = [0,1,2,3,4,5,6,7,8,9,10]
y = [0,1,0,1,0,1,0,1,0,1,0]
[Link](x, y, marker="o")
[Link]()
Code language: JavaScript (javascript)

Types of graphs and charts

The package supports many types of graphs and charts:

 Charts (line plot)


 Scatter plot
 Bar charts and histograms
 Pie Chart
 Chart trunk (stem plot)
 Contour plots
 Gradient Fields (quiver)
 Spectrograms
 The user can specify the coordinate axis, grid, add labels and explanations, use a
logarithmic scale or polar coordinates

16
Unit-7

Image source

Simple 3D graphics can be built using the mplot3d toolkit. There are other toolkits: for
mapping, for working with Excel, utilities for GTK and others. With Matplotlib, you can
make animated images.

Matplotlib can be technically and syntactically complex. To create a ready-made diagram, it


can take half an hour to google search alone and combine all this hash to fine-tune the graph.
However, understanding how matplotlib interfaces interact with each other is an investment
that can pay off

Pandas
Pandas, which is styled as pandas is an open-source software library designed for the Python
programming language, focusing on data manipulation and analysis. It provides data
structures like series and DataFrames to effectively easily clean, transform, and analyze large
datasets and integrates seamlessly with other Python libraries, such
as NumPy and Matplotlib.
It offers powerful functions for data transformation, aggregation, and visualization, which are
important for credible analysis. Created by Wes McKinney in 2008, Pandas has grown to
become a cornerstone of data analysis in Python, widely used by data scientists, analysts and
researchers worldwide. Pandas revolves around two primary Data structures: Series (1D) for
single columns and DataFrame (2D) for tabular data enabling efficient data manipulation.
Excel spreadsheet.
name is derived for the term “panel data” which is econometrics terms of data sets.
What is Pandas Used for?
With pandas, you can perform a wide range of data operations, including
Reading and writing data from various file formats like CSV, Excel, and SQL databases.

17
Unit-7

Cleaning and preparing data by handling missing values and filtering entries.
Merging and joining multiple datasets seamlessly.
Reshaping data through pivoting and stacking operations.
Conducting statistical analysis and generating descriptive statistics.
Visualizing data with integrated plotting capabilities.
Learn Pandas
Now that we know what pandas are and their uses, let’s move towards the tutorial part. In the
section below, you will find 8 sections, from basic to advanced, that will help you learn more
about pandas.
Pandas Basics
In this section, we will explore the fundamentals of Pandas. We will start with an
introduction to Pandas, learn how to install it, and get familiar with its core functionalities.
Additionally, we will cover how to use Jupyter Notebook, a popular tool for interactive
coding. By the end of this section, we will have a solid understanding of how to set up and
start working with Pandas for data analysis.
 Pandas Introduction
 Pandas Installation
 Getting started with Pandas
 How To Use Jupyter Notebook
Pandas DataFrame
A DataFrame is a two-dimensional, size-mutable and potentially heterogeneous tabular data
structure with labeled axes (rows and columns)., think of it as a table or a spreadsheet.
 Creating a DataFrame
 Pandas Dataframe Index
 Pandas Access DataFrame
 Indexing and Selecting Data with Pandas
 Slicing Pandas Dataframe
 Filter Pandas Dataframe with multiple conditions
 Merging, Joining, and Concatenating Dataframes
 Sorting Pandas DataFrame
 Pivot Table in Pandas
Pandas Series
A Series is a one-dimensional labeled array capable of holding any data type (integers,
strings, floating-point numbers, Python objects, etc.). It’s similar to a column in a spreadsheet
or a database table.
 Creating a Series
 Accessing elements of a Pandas Series
 Binary Operations on Series

18
Unit-7

 Pandas Series Index() Methods


 Create a Pandas Series from array
Data Input and Output (I/O)
Pandas offers a variety of functions to read data from and write data to different file formats
as given below:
 Read CSV Files with Pandas
 Writing data to CSV Files
 Export Pandas dataframe to a CSV file
 Read JSON Files with Pandas
 Parsing JSON Dataset
 Exporting Pandas DataFrame to JSON File
 Working with Excel Files in Pandas
 Read Text Files with Pandas
 Text File to CSV using Python Pandas
Data Cleaning in Pandas
Data cleaning is an essential step in data preprocessing to ensure accuracy and consistency.
Here are some articles to know more about it:
 Handling Missing Data
 Removing Duplicates
 Pandas Change Datatype
 Drop Empty Columns in Pandas
 String manipulations in Pandas
 String methods in Pandas
 Detect Mixed Data Types and Fix it
Pandas Operations
We will cover data processing, normalization, manipulation, and analysis, along with
techniques for grouping and aggregating data. These concepts will help you efficiently clean,
transform, and analyze datasets. By the end of this section, you’ll be equipped with essential
Pandas operations to handle real-world data effectively.
 Data Processing with Pandas.
 Data Normalization in Pandas
 Data Manipulation in Pandas
 Data Analysis using Pandas
 Grouping and Aggregating with Pandas
 Different Types of Joins in Pandas
Advanced Pandas Operations
In this section, we will explore advanced Pandas functionalities for deeper data analysis and
visualization. We will cover techniques for finding correlations, working with time series
data, and using Pandas’ built-in plotting functions for effective data visualization. By the end
of this section, you’ll have a strong grasp of advanced Pandas operations and how to apply
them to real-world datasets.
19
Unit-7

 Finding Correlation between Data


 Data Visualization with Pandas
 Pandas Plotting Functions for Data Visualization
 Basic of Time Series Manipulation Using Pandas
 Time Series Analysis & Visualization in Python
Hierarchical Clustering
Why hierarchical clustering?
Hierarchical clustering is a technique used to group similar data points together based on their
similarity creating a hierarchy or tree-like structure. The key idea is to begin with each data
point as its own separate cluster and then progressively merge or split them based on their
similarity.
Lets understand this with the help of an example
Imagine you have four fruits with different weights: an apple (100g), a banana (120g), a
cherry (50g), and a grape (30g). Hierarchical clustering starts by treating each fruit as its own
group.
It then merges the closest groups based on their weights.
First, the cherry and grape are grouped together because they are the lightest.
Next, the apple and banana are grouped together.
Finally, all the fruits are merged into one large group, showing how hierarchical clustering
progressively combines the most similar data points.
Getting Started with Dendogram
A dendrogram is like a family tree for clusters. It shows how individual data points or groups
of data merge together. The bottom shows each data point as its own group, and as you move
up, similar groups are combined. The lower the merge point, the more similar the groups are.
It helps you see how things are grouped step by step.
The working of the dendrogram can be explained using the below diagram:

Dendogram

20
Unit-7

In this image, on the left side, there are five points labeled P, Q, R, S, and T. These represent
individual data points that are being clustered. On the right side, there’s a dendrogram, which
shows how these points are grouped together step by step.
At the bottom of the dendrogram, the points P, Q, R, S, and T are all separate.
As you move up, the closest points are merged into a single group.
The lines connecting the points show how they are progressively merged based on similarity.
The height at which they are connected shows how similar the points are to each other; the
shorter the line, the more similar they are
Types of Hierarchical Clustering
Now that we understand the basics of hierarchical clustering, let’s explore the two main types
of hierarchical clustering.
Agglomerative Clustering
Divisive clustering
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC).
Unlike flat clustering hierarchical clustering provides a structured way to group data. This
clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a single cluster that
contains all data.

Hierarchical Agglomerative Clustering


Workflow for Hierarchical Agglomerative clustering
Start with individual points: Each data point is its own cluster. For example if you have 5 data
points you start with 5 clusters each containing just one data point.

21
Unit-7

Calculate distances between clusters: Calculate the distance between every pair of clusters.
Initially since each cluster has one point this is the distance between the two data points.
Merge the closest clusters: Identify the two clusters with the smallest distance and merge
them into a single cluster.
Update distance matrix: After merging you now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.
Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance matrix until
you have only one cluster left.
Create a dendrogram: As the process continues you can visualize the merging of clusters
using a tree-like diagram called a dendrogram. It shows the hierarchy of how clusters are
merged.
Python implementation of the above algorithm using the scikit-learn library:
from [Link] import AgglomerativeClustering
import numpy as np

X = [Link]([[1, 2], [1, 4], [1, 0],


[4, 2], [4, 4], [4, 0]])
clustering = AgglomerativeClustering(n_clusters=2).fit(X)
print(clustering.labels_)
Output :
[1, 1, 1, 0, 0, 0]
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to prespecify
the number of clusters. Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively until individual data
have been split into singleton clusters.
Workflow for Hierarchical Divisive clustering :
Start with all data points in one cluster: Treat the entire dataset as a single large cluster.
Split the cluster: Divide the cluster into two smaller clusters. The division is typically done by
finding the two most dissimilar points in the cluster and using them to separate the data into
two parts.
Repeat the process: For each of the new clusters, repeat the splitting process:
Choose the cluster with the most dissimilar points.
Split it again into two smaller clusters.

22
Unit-7

Stop when each data point is in its own cluster: Continue this process until every data point is
its own cluster, or the stopping condition (such as a predefined number of clusters) is met.

Hierarchical Divisive clustering


Computing Distance Matrix
While merging two clusters we check the distance between two every pair of clusters and
merge the pair with the least distance/most similarity. But the question is how is that distance
determined. There are different ways of defining Inter Cluster distance/similarity. Some of
them are:
Min Distance: Find the minimum distance between any two points of the cluster.
Max Distance: Find the maximum distance between any two points of the cluster.
Group Average: Find the average distance between every two points of the clusters.
Ward’s Method: The similarity of two clusters is based on the increase in squared error when
two clusters are merged.
Distance Matrix Comparision in Hierarchical Clustering
Implementations code for Distance Matrix Comparision
import numpy as np
from [Link] import dendrogram, linkage
import [Link] as plt

X = [Link]([[1, 2], [1, 4], [1, 0],


[4, 2], [4, 4], [4, 0]])

23
Unit-7

Z = linkage(X, 'ward') # Ward Distance

dendrogram(Z) #plotting the dendogram

[Link]('Hierarchical Clustering Dendrogram')


[Link]('Data point')
[Link]('Distance')
[Link]()
Output:

Hierarchical Clustering Dendrogram


Hierarchical clustering is a powerful unsupervised learning technique that organizes data into
a tree-like structure allowing us to visualize relationships between data points using a
dendrogram. Unlike flat clustering methods it does not require a predefined number of
clusters and provides a structured way to explore data similarity.
Data Visualization
Data visualization is the graphical representation of information. In this guide we will study
what is Data visualization and its importance with use cases.
Understanding Data Visualization
Data visualization translates complex data sets into visual formats that are easier for the
human brain to understand. This can include a variety of visual tools such as:
Charts: Bar charts, line charts, pie charts, etc.
Graphs: Scatter plots, histograms, etc.

24
Unit-7

Maps: Geographic maps, heat maps, etc.


Dashboards: Interactive platforms that combine multiple visualizations.
The primary goal of data visualization is to make data more accessible and easier to interpret
allow users to identify patterns, trends, and outliers quickly. This is particularly important
in big data where the large volume of information can be confusing without
effective visualization techniques.
Why is Data Visualization Important?
Let’s take an example. Suppose you compile data of the company’s profits from 2013 to
2023 and create a line chart. It would be very easy to see the line going constantly up with a
drop in just 2018. So you can observe in a second that the company has had continuous
profits in all the years except a loss in 2018.
It would not be that easy to get this information so fast from a data table. This is just one
demonstration of the usefulness of data visualization. Let’s see some more reasons why
visualization of data is so important.

Importance of Data Visualization


1. Data Visualization Simplifies the Complex Data
Large and complex data sets can be challenging to understand. Data visualization helps break
down complex information into simpler, visual formats making it easier for the audience to
grasp. For example in a scenario where sales data is visualized using a heat map on Tableau
states that have suffered a net loss are colored red. This visual makes it instantly obvious
which states are underperforming.

25
Unit-7

2. Enhances Data Interpretation


Visualization highlights patterns, trends, and correlations in data that might be missed in raw
data form. This enhanced interpretation helps in making informed decisions. Consider
another Tableau visualization that demonstrates the relationship between sales and profit. It
might show that higher sales do not necessarily equate to higher profits this trend that could
be difficult to find from raw data alone. This perspective helps businesses adjust strategies to
focus on profitability rather than just sales volume.

3. Data Visualization Saves Time


It is definitely faster to gather some insights from the data using data visualization rather than
just studying a chart. In the screenshot below on Tableau it is very easy to identify the states
that have suffered a net loss rather than a profit. This is because all the cells with a loss are
coloured red using a heat map, so it is obvious states have suffered a loss. Compare this to a
normal table where you would need to check each cell to see if it has a negative value to
determine a loss. Visualizing Data can save a lot of time in this situation.

26
Unit-7

4. Improves Communication
Visual representations of data make it easier to share findings with others especially those
who may not have a technical background. This is important in business where stakeholders
need to understand data-driven insights quickly. Let see the below TreeMap visualization on
Tableau showing the number of sales in each region of the United States with the largest
rectangle representing California due to its high sales volume. This visual context is much
easier to grasp rather than detailed table of numbers.

5. Data Visualization Tells a Data Story


Data visualization is also a medium to tell a data story to the viewers. The visualization can
be used to present the data facts in an easy-to-understand form while telling a story and
leading the viewers to an inevitable conclusion. This data story should have a good
beginning, a basic plot, and an ending that it is leading towards. For example, if a data analyst
has to craft a data visualization for company executives detailing the profits of various

27
Unit-7

products then the data story can start with the profits and losses of multiple products and
move on to recommendations on how to tackle the losses.
Best Practices for Visualizing Data
Effective data visualization is crucial for conveying insights accurately. Follow these best
practices to create compelling and understandable visualizations:
Audience-Centric Approach: Tailor visualizations to your audience’s knowledge level,
ensuring clarity and relevance. Consider their familiarity with data interpretation and adjust
the complexity of visual elements accordingly.
Design Clarity and Consistency: Choose appropriate chart types, simplify visual elements,
and maintain a consistent color scheme and legible fonts. This ensures a clear, cohesive, and
easily interpretable visualization.
Contextual Communication: Provide context through clear labels, titles, annotations, and
acknowledgments of data sources. This helps viewers understand the significance of the
information presented and builds transparency and credibility.
Engaging and Accessible Design: Design interactive features thoughtfully, ensuring they
enhance comprehension. Additionally, prioritize accessibility by testing visualizations for
responsiveness and accommodating various audience needs, fostering an inclusive and
engaging experience.

28
Unit-7

29

You might also like