Unit-7 Data Science
Unit-7 Data Science
UNIT-VII
Data Ethics, Building Bad Data Products, Trading Off Accuracy and Fairness,
Collaboration, Interpretability, recommendations, Biased Data, Data Protection
IPython, Mathematics, NumPy, pandas, scikit-learn, Visualization, R Up Hierarchical
Clustering.
Data ethics encompasses the principles and guidelines that dictate how data should be
ethically collected, used, and shared. As organizations increasingly rely on data-driven
decisions, the ethical implications become more complex and far-reaching. Ethical data
practices focus on ensuring privacy, fairness, transparency, and accountability in every phase
of data usage.
When organizations neglect these ethical guidelines, they risk creating "bad data products."
These are products or systems that fail to meet ethical standards and can have negative
consequences for users, communities, and businesses. Bad data products often arise due to
the following issues:
1. Bias in Data: If the data used to train models is biased, the resulting products will
perpetuate and even amplify these biases. This is especially dangerous in high-stakes
domains like hiring, criminal justice, and healthcare.
2. Poor Data Quality: If the data is incomplete, incorrect, or irrelevant, it undermines
the integrity of the data product. For example, flawed data used to assess loan
eligibility may result in unjust denials for qualified individuals.
3. Lack of Ethical Oversight: Failure to consider the broader ethical implications of
how data is used—such as neglecting the impact on marginalized groups or ignoring
privacy concerns—can lead to harm.
4. Failure to Involve Stakeholders: A lack of collaboration with impacted communities
and users can result in products that don't align with the needs or values of the people
they serve.
5. Legal and Regulatory Violations: Data products that don’t comply with data
protection laws (e.g., GDPR, CCPA) can lead to significant legal and financial
consequences for organizations.
1
Unit-7
Building ethical data products means avoiding these pitfalls by ensuring that ethical
considerations are baked into the product's design and development from the start.
One of the most challenging dilemmas in data science is balancing accuracy and fairness.
On the one hand, organizations often prioritize accuracy, as it ensures that predictions or
decisions made by a data product are correct or closely aligned with reality. On the other
hand, fairness seeks to ensure that the outcomes of a model or system are equitable for all
demographic groups, preventing the exacerbation of inequality or discrimination.
The tension between accuracy and fairness arises because optimizing for one can negatively
impact the other. For example, a model trained to maximize overall accuracy may end up
disproportionately benefiting certain groups (e.g., those with more representation in the data)
while disadvantaging underrepresented groups. This could be especially problematic in
sensitive areas like hiring, credit scoring, or criminal justice.
Acknowledge that fairness is context-dependent: What fairness looks like will vary
across contexts, and models may need to be adapted or adjusted to meet specific
ethical standards relevant to the context.
Apply fairness metrics: Use established fairness metrics, such as equal opportunity
or demographic parity, to assess how well the model performs across different
groups.
Iterate and balance: Continually evaluate the model's performance, making trade-
offs between accuracy and fairness as needed while considering the ethical
implications.
Collaboration is essential in developing ethical and effective data products. Data science and
machine learning are interdisciplinary fields, and creating data products that are both
technically robust and ethically sound requires input from a diverse set of stakeholders. Key
collaborators include:
1. Domain Experts: Experts who understand the subject area (e.g., healthcare, finance,
law) help ensure that the data and model are aligned with real-world needs and
constraints.
2. Ethicists: Professionals who can guide the development of models and products from
an ethical standpoint, ensuring that the impacts on stakeholders, particularly
vulnerable groups, are considered.
3. Engineers and Data Scientists: These professionals work to implement the technical
aspects of the data product, ensuring that algorithms and systems work as intended
while adhering to best practices for fairness, transparency, and security.
2
Unit-7
4. Legal and Regulatory Experts: Involving experts in data protection laws ensures
that the product complies with regulations such as GDPR, CCPA, and others, which
can have significant legal consequences if violated.
5. Community and Stakeholder Engagement: Engaging with the communities or
groups affected by a data product helps ensure that their needs and concerns are
addressed, and it can help identify any unintended consequences of a model or
product.
Collaboration not only ensures that the product is built with expertise from diverse fields but
also fosters ethical accountability across different stages of development.
Interpretability
Interpretability does not always mean that a model needs to be simple, but rather that it
should be explainable enough for humans to understand the factors influencing its decisions.
In practice, this may involve using simpler, more interpretable models (e.g., decision trees) or
implementing explainability techniques (e.g., SHAP values or LIME) for complex models
like neural networks.
1. Integrate Ethical Design: Build ethical considerations into the design process from
the start, rather than as an afterthought.
2. Embrace Fairness and Bias Mitigation: Prioritize fairness by using bias detection
and mitigation techniques, and regularly evaluate the model for potential biases.
3. Foster Transparency: Design models that are interpretable and explainable,
providing clear insights into how decisions are made.
4. Ensure Privacy and Security: Implement data protection measures, such as
encryption and access controls, and follow data protection regulations to safeguard
user data.
5. Involve Stakeholders: Include input from diverse stakeholders, including domain
experts, ethicists, and impacted communities, to ensure the product meets ethical
standards and serves everyone equitably.
3
Unit-7
Biased Data
Biased data is a key concern in data ethics, as it can lead to unfair and discriminatory
outcomes. Data bias can emerge in several forms:
Collecting diverse and representative data: Ensuring that the data used to train the
model includes adequate representation of all relevant groups.
Using fairness-enhancing algorithms: Implementing techniques to detect and reduce
bias in the model’s predictions.
Regular auditing: Continuously testing models for biased outcomes, and making
adjustments as necessary.
Data Protection
Data protection is not just about legal compliance—it’s about maintaining trust with users
and ensuring that their personal data is treated with the utmost care.
4
Unit-7
There is such a thing as Python programming language. It is valuable in itself for a number of
reasons, as it is effective and very common. In addition to everything else, Python is valuable
for its set of libraries for a variety of needs.
These libraries save developers time and standardize work with mathematical functions and
algorithms, which puts Python code writing for many industries at a very high level.
Let's look at these libraries in order and determine which sections of development they are
responsible for and how they are interconnected.
MATH
To carry out calculations with real numbers, the Python language contains many additional
functions collected in a library (module) called math.
To use these functions at the beginning of the program, you need to connect the math library,
which is done by the command
Python provides various operators for performing basic calculations, such as * for
multiplication,% for a module, and / for the division. If you are developing a program in
Python to perform certain tasks, you need to work with trigonometric functions, as well as
complex numbers. Although you cannot use these functions directly, you can access them by
turning on the math module math, which gives access to hyperbolic, trigonometric and
logarithmic functions for real numbers. To use complex numbers, you can use the math
module cmath. When comparing math vs numpy, a math library is more lightweight and can
be used for extensive computation as well.
The Python Math Library is the foundation for the rest of the math libraries that are written
on top of its functionality and functions defined by the C standard. Please refer to the python
math examples for more information.
This part of the mathematical library is designed to work with numbers and their
representations. It allows you to effectively carry out the necessary transformations with
support for NaN (not a number) and infinity and is one of the most important sections of the
5
Unit-7
Python math library. Below is a short list of features for Python 3rd version. A more detailed
description can be found in the documentation for the math library.
[Link](x) - return the ceiling of x, the smallest integer greater than or equal to x
[Link](n, k) - return the number of ways to choose k items from n items without
repetition and without order
[Link](x, y) - return float with the magnitude (absolute value) of x but the sign of
y. On platforms that support signed zeros, copysign (1.0, -0.0) returns -1.0
[Link](x) - return the absolute value of x
[Link](x) - return x factorial as an integer. Raises ValueError if x is not integral or
is negative
[Link](x) - return the floor of x, the largest integer less than or equal to x
[Link](x, y) - return fmod(x, y), as defined by the platform C library
[Link](x) - return the mantissa and exponent of x as the pair (m, e). m is a float and e
is an integer such that x == m * 2**e exactly
[Link](iterable) - return an accurate floating-point sum of values in the iterable
[Link](a, b) - return the greatest common divisor of the integers a and b
[Link](a, b, *, rel_tol=1e-09, abs_tol=0.0) - return True if the values a and b are
close to each other and False otherwise
[Link](x) - return True if x is neither infinity nor a NaN, and False otherwise (note
that 0.0 is considered finite)
[Link](x) - return True if x is positive or negative infinity, and False otherwise
[Link](x) - return True if x is a NaN (not a number), and False otherwise
[Link](n) - return the integer square root of the nonnegative integer n. This is the floor
of the exact square root of n, or equivalently the greatest integer a such that a² ≤ n
[Link](x, i) - return x * (2**i). This is essentially the inverse of function frexp()
[Link](x) - return the fractional and integer parts of x. Both results carry the sign of x
and are floats
[Link](n, k=None) - return the number of ways to choose k items from n items without
repetition and with order
[Link](iterable, *, start=1) - calculate the product of all the elements in the input
iterable. The default start value for the product is 1
[Link](x, y) - return the IEEE 754-style remainder of x with respect to y
[Link](x) - return the Real value x truncated to an Integral (usually an integer)
The power and logarithmic functions section are responsible for exponential calculations,
which is important in many areas of mathematics, engineering, and statistics. These functions
can work with both natural logarithmic and exponential functions, logarithms modulo two,
and arbitrary bases.
[Link](x) - return e raised to the power x, where e = 2.718281… is the base of natural
logarithms
math.expm1(x) - return e raised to the power x, minus 1. Here e is the base of natural
logarithms. [Link](x[, base]) - With one argument, return the natural logarithm of x (to
base e). With two arguments, return the logarithm of x to the given base, calculated as
log(x)/log(base)
6
Unit-7
math.log1p(x) - return the natural logarithm of 1+x (base e). The result is calculated in a
way that is accurate for x near zero
math.log2(x) - return the base-2 logarithm of x. This is usually more accurate than log(x, 2)
math.log10(x) - return the base-10 logarithm of x. This is usually more accurate than log(x,
10)
[Link](x, y) - return x raised to the power y
[Link](x) - return the square root of x
Trigonometric functions
Trigonometric functions, direct and inverse, are widely represented in the Python
Mathematical Library. They work with radian values, which is important. It is also possible
to carry out calculations with Euclidean functions.
Angular conversion
Converting degrees to radians and vice versa is a fairly common function and therefore the
developers have taken these actions to the Python library. This allows you to write compact
and understandable code.
Hyperbolic functions
Hyperbolic functions are analogs of trigonometric functions that are based on hyperbolas
instead of circles.
7
Unit-7
Special functions
The special functions section is responsible for error handling and gamma functions. This is a
necessary function and it was decided to implement it in the standard Python mathematical
library.
Constants
The constant section provides ready-made values for basic constants and writes them with the
necessary accuracy for a given hardware platform, which is important for Python's portability
as a cross-platform language. Also, the very important values infinity and “not a number” are
defined in this section of the Python library.
Scipy
SciPy is a library for the open-source Python programming language, designed to perform
scientific and engineering calculations.
8
Unit-7
The SciPy ecosystem includes general and specialized tools for data management and
computation, productive experimentation, and high-performance computing. Below, we
overview some key packages, though there are many more relevant packages.
9
Unit-7
IPython, a rich interactive interface, letting you quickly process data and test ideas
The Jupyter notebook provides IPython functionality and more in your web browser,
allowing you to document your computation in an easily reproducible form
Cython extends Python syntax so that you can conveniently build C extensions, either to
speed up critical code or to integrate with C/C++ libraries
Dask, Joblib or IPyParallel for distributed processing with a focus on numeric data
Quality assurance:
nose, a framework for testing Python code, being phased out in preference for pytest
numpydoc, a standard, and library for documenting Scientific Python libraries
In this tutorial, Basic functions — SciPy v1.4.1 Reference Guide, you can find how to
calculate polynomials, their derivatives, and integrals. Yes, by one line of code SciPy
calculates derivative and integral in symbolic form. Imagine how many lines of code you
would need to do this without SciPy. This is why this library is valuable in Python:
>>> p = poly1d([3,4,5])
>>> print(p)
2
3x+4x+5
>>> print(p*p)
10
Unit-7
4 3 2
9 x + 24 x + 46 x + 40 x + 25
>>> print([Link](k=6))
3 2
1x+2x+5x+6
>>> print([Link]())
6x+4
>>> p([4, 5])
array([ 69, 100])
Code language: PHP (php)
NUMPY
In early 2005, programmer and data scientist Travis Oliphant wanted to unite the community
around one project and created the NumPy library to replace the Numeric and NumArray
libraries. NumPy was created based on the Numeric code. The Numeric code was rewritten to
be easier to maintain, and new features could be added to the library. NumArray features
have been added to NumPy.
NumPy was originally part of the SciPy library. To allow other projects to use the NumPy
library, its code was placed in a separate package.
The source code for NumPy is publicly available. NumPy is licensed under the BSD license.
11
Unit-7
As described in the NumPy documentation, “NumPy gives you an enormous range of fast
and efficient ways of creating arrays and manipulating numerical data inside them. While a
Python list can contain different data types within a single list, all of the elements in a NumPy
array should be homogenous. The mathematical operations that are meant to be performed on
arrays would be extremely inefficient if the arrays weren’t homogenous.”
Array objects
Constants
Universal functions (ufunc)
Routine
Packaging ([Link])
NumPy Distutils - Users Guide
NumPy C-API
NumPy internals
NumPy and SWIG
NumPy basics:
Data types
Array creation
I/O with NumPy
Indexing
Broadcasting
Byte-swapping
Structured arrays
Writing custom array containers
Subclassing ndarray
One of the main objects of NumPy is ndarray. It allows you to create multidimensional data
arrays of the same type and perform operations on them with great speed. Unlike sequences
in Python, arrays in NumPy have a fixed size, the elements of the array must be of the same
type. You can apply various mathematical operations to arrays, which are performed more
efficiently than for Python sequences.
The next example shows how to work with linear algebra with NumPy. It is really simple and
easy-to-understand for Python users.
12
Unit-7
>>> [Link]()
array([[ 1., 3.],
[ 2., 4.]])
>>> [Link](a)
array([[-2. , 1. ],
[ 1.5, -0.5]])
>>> [Link](j)
(array([ 0.+1.j, 0.-1.j]), array([[ 0.70710678+0.j,0.70710678-0.j],
[ 0.00000000-0.70710678j, 0.00000000+0.70710678j]]))
Code language: PHP (php)
MATPLOTLIB
Neuroscientist John D. Hunter began developing matplotlib in 2003, mainly inspired by the
emulation of Mathworks MATLAB software teams. Matplotlib is today a whole product of
the community: it is developed and supported by many people. John talked about the
evolution of matplotlib at the SciPy conference in 2012.
13
Unit-7
Learning matplotlib at times can be a difficult process. The problem is not the lack of
documentation (which is very extensive, by the way). Difficulties may arise with the
following:
The size of the library is huge in itself, about 70,000 lines of code
Matplotlib contains several different interfaces (ways to build a figure) and can interact
with a large number of backends. (The backends are responsible for how in fact the
diagrams will be displayed, not only for the internal structure)
Despite the vastness, some of the matplotlib's own documentation is seriously outdated.
The library is still evolving, and many old examples on the web may include 70% less
code than in their current version
Understanding that matplotlib roots grow from MATLAB helps explain the existence of
pylab. pylab is a module inside the matplotlib library that has been built in to emulate the
overall MATLAB style. It exists only for introducing a number of class functions from
NumPy and matplotlib into the namespace, which simplifies the transition of MATLAB users
who did not encounter the need for import statements. Former MATLAB users love its
functionality, because with from pylab import * they can simply call plot() or array() directly,
just like they did in MATLAB.
One of the business cards of matplotlib is the hierarchy of its objects. If you have already
worked with the matplotlib introductory manual, you may have already called something like
[Link] ([1, 2, 3]). This one line indicates that the graph is actually a hierarchy of Python
objects. By “hierarchy” we mean that each chart is based on a tree-like structure of matplotlib
objects.
The Figure object is the most important external container for matplotlib graphics, which can
include several Axes objects. The reason for the difficulty in understanding may be the name:
Axes (axes), in fact, turn into what we mean by an individual graph or chart (rather than the
plural “axis”, as you might expect).
You can think of the Figure object as a box-like container containing one or more Axes
objects (real graphs). Below Axes objects, in a hierarchical order, are smaller objects such as
individual lines, elevations, legends, and text boxes. Almost every “element” of a diagram is
its own manipulated Python object, right up to labels and markers. An example chart on the
matplot is located below.
Image source
14
Unit-7
15
Unit-7
Matplotlib is a flexible, easily configurable package that, along with NumPy, SciPy, and
IPython, provides features similar to MATLAB. The package currently works with several
graphics libraries, including wxWindows and PyGTK.
The Python code itself is quite simple and straightforward. Here's an example of a simple
plot:
x = [0,1,2,3,4,5,6,7,8,9,10]
y = [0,1,0,1,0,1,0,1,0,1,0]
[Link](x, y, marker="o")
[Link]()
Code language: JavaScript (javascript)
16
Unit-7
Image source
Simple 3D graphics can be built using the mplot3d toolkit. There are other toolkits: for
mapping, for working with Excel, utilities for GTK and others. With Matplotlib, you can
make animated images.
Pandas
Pandas, which is styled as pandas is an open-source software library designed for the Python
programming language, focusing on data manipulation and analysis. It provides data
structures like series and DataFrames to effectively easily clean, transform, and analyze large
datasets and integrates seamlessly with other Python libraries, such
as NumPy and Matplotlib.
It offers powerful functions for data transformation, aggregation, and visualization, which are
important for credible analysis. Created by Wes McKinney in 2008, Pandas has grown to
become a cornerstone of data analysis in Python, widely used by data scientists, analysts and
researchers worldwide. Pandas revolves around two primary Data structures: Series (1D) for
single columns and DataFrame (2D) for tabular data enabling efficient data manipulation.
Excel spreadsheet.
name is derived for the term “panel data” which is econometrics terms of data sets.
What is Pandas Used for?
With pandas, you can perform a wide range of data operations, including
Reading and writing data from various file formats like CSV, Excel, and SQL databases.
17
Unit-7
Cleaning and preparing data by handling missing values and filtering entries.
Merging and joining multiple datasets seamlessly.
Reshaping data through pivoting and stacking operations.
Conducting statistical analysis and generating descriptive statistics.
Visualizing data with integrated plotting capabilities.
Learn Pandas
Now that we know what pandas are and their uses, let’s move towards the tutorial part. In the
section below, you will find 8 sections, from basic to advanced, that will help you learn more
about pandas.
Pandas Basics
In this section, we will explore the fundamentals of Pandas. We will start with an
introduction to Pandas, learn how to install it, and get familiar with its core functionalities.
Additionally, we will cover how to use Jupyter Notebook, a popular tool for interactive
coding. By the end of this section, we will have a solid understanding of how to set up and
start working with Pandas for data analysis.
Pandas Introduction
Pandas Installation
Getting started with Pandas
How To Use Jupyter Notebook
Pandas DataFrame
A DataFrame is a two-dimensional, size-mutable and potentially heterogeneous tabular data
structure with labeled axes (rows and columns)., think of it as a table or a spreadsheet.
Creating a DataFrame
Pandas Dataframe Index
Pandas Access DataFrame
Indexing and Selecting Data with Pandas
Slicing Pandas Dataframe
Filter Pandas Dataframe with multiple conditions
Merging, Joining, and Concatenating Dataframes
Sorting Pandas DataFrame
Pivot Table in Pandas
Pandas Series
A Series is a one-dimensional labeled array capable of holding any data type (integers,
strings, floating-point numbers, Python objects, etc.). It’s similar to a column in a spreadsheet
or a database table.
Creating a Series
Accessing elements of a Pandas Series
Binary Operations on Series
18
Unit-7
Dendogram
20
Unit-7
In this image, on the left side, there are five points labeled P, Q, R, S, and T. These represent
individual data points that are being clustered. On the right side, there’s a dendrogram, which
shows how these points are grouped together step by step.
At the bottom of the dendrogram, the points P, Q, R, S, and T are all separate.
As you move up, the closest points are merged into a single group.
The lines connecting the points show how they are progressively merged based on similarity.
The height at which they are connected shows how similar the points are to each other; the
shorter the line, the more similar they are
Types of Hierarchical Clustering
Now that we understand the basics of hierarchical clustering, let’s explore the two main types
of hierarchical clustering.
Agglomerative Clustering
Divisive clustering
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC).
Unlike flat clustering hierarchical clustering provides a structured way to group data. This
clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a single cluster that
contains all data.
21
Unit-7
Calculate distances between clusters: Calculate the distance between every pair of clusters.
Initially since each cluster has one point this is the distance between the two data points.
Merge the closest clusters: Identify the two clusters with the smallest distance and merge
them into a single cluster.
Update distance matrix: After merging you now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.
Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance matrix until
you have only one cluster left.
Create a dendrogram: As the process continues you can visualize the merging of clusters
using a tree-like diagram called a dendrogram. It shows the hierarchy of how clusters are
merged.
Python implementation of the above algorithm using the scikit-learn library:
from [Link] import AgglomerativeClustering
import numpy as np
22
Unit-7
Stop when each data point is in its own cluster: Continue this process until every data point is
its own cluster, or the stopping condition (such as a predefined number of clusters) is met.
23
Unit-7
24
Unit-7
25
Unit-7
26
Unit-7
4. Improves Communication
Visual representations of data make it easier to share findings with others especially those
who may not have a technical background. This is important in business where stakeholders
need to understand data-driven insights quickly. Let see the below TreeMap visualization on
Tableau showing the number of sales in each region of the United States with the largest
rectangle representing California due to its high sales volume. This visual context is much
easier to grasp rather than detailed table of numbers.
27
Unit-7
products then the data story can start with the profits and losses of multiple products and
move on to recommendations on how to tackle the losses.
Best Practices for Visualizing Data
Effective data visualization is crucial for conveying insights accurately. Follow these best
practices to create compelling and understandable visualizations:
Audience-Centric Approach: Tailor visualizations to your audience’s knowledge level,
ensuring clarity and relevance. Consider their familiarity with data interpretation and adjust
the complexity of visual elements accordingly.
Design Clarity and Consistency: Choose appropriate chart types, simplify visual elements,
and maintain a consistent color scheme and legible fonts. This ensures a clear, cohesive, and
easily interpretable visualization.
Contextual Communication: Provide context through clear labels, titles, annotations, and
acknowledgments of data sources. This helps viewers understand the significance of the
information presented and builds transparency and credibility.
Engaging and Accessible Design: Design interactive features thoughtfully, ensuring they
enhance comprehension. Additionally, prioritize accessibility by testing visualizations for
responsiveness and accommodating various audience needs, fostering an inclusive and
engaging experience.
28
Unit-7
29