0% found this document useful (0 votes)
131 views314 pages

Dimensionality Reduction in Machine Learning (2025)

The document is a comprehensive book on dimensionality reduction in machine learning, edited by experts from various institutions. It covers fundamental concepts, mathematical foundations, linear and nonlinear methods, and deep learning techniques for dimensionality reduction. The book includes practical implementations and examples, making it a valuable resource for practitioners and researchers in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views314 pages

Dimensionality Reduction in Machine Learning (2025)

The document is a comprehensive book on dimensionality reduction in machine learning, edited by experts from various institutions. It covers fundamental concepts, mathematical foundations, linear and nonlinear methods, and deep learning techniques for dimensionality reduction. The book includes practical implementations and examples, making it a valuable resource for practitioners and researchers in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Dimensionality Reduction in

Machine Learning
Dimensionality Reduction in
Machine Learning
Edited by
Jamal Amani Rad
Choice Modelling Centre and Institute for Transport Studies
University of Leeds
Leeds, United Kingdom

Snehashish Chakraverty
Department of Mathematics
National Institute of Technology Rourkela
Sundargarh, Odisha, India

Kourosh Parand
International Business University
Toronto, ON, Canada
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar
technologies.
Publisher’s note: Elsevier takes a neutral position with respect to territorial disputes or jurisdictional claims in its published
content, including in maps and institutional affiliations.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and retrieval system, without permission in writing from the
publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our
arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be
found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may
be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any
information, methods, compounds, or experiments described herein. In using such information or methods they should be
mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any
injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or
operation of any methods, products, instructions, or ideas contained in the material herein.

ISBN: 978-0-443-32818-3

For information on all Morgan Kaufmann publications


visit our website at https://www.elsevier.com/books-and-journals

Publisher: Mara E. Conner


Acquisitions Editor: Chris Katsaropoulos
Editorial Project Manager: Debarati Roy
Production Project Manager: Vishnu T. Jiji
Cover Designer: Mark Rogers
Typeset by VTeX
To Noora
Contents

Contributors xvii
About the Editors xxi
Preface xxv
Jamal Amani Rad, Snehashish Chakraverty, and Kourosh Parand

PART 1 Introduction to machine learning and data


lifecycle

1. Basics of machine learning 3


Amirhosein Ebrahimi, Hoda Vafaei Sefat, and Jamal Amani Rad

1.1. Data processing in machine learning (ML) 3


1.1.1. What is data? Feature? Pattern? 3
1.1.2. Understanding data processing 6
1.1.3. High-dimensional data 15
1.2. Types of learning problems 17
1.2.1. Supervised learning 17
1.2.2. Unsupervised learning 18
1.2.3. Semi-supervised learning 18
1.2.4. Reinforcement learning 19
1.3. Machine learning algorithms lifecycle 19
1.4. Python for machine learning 19
1.4.1. Python and packages installation 20
References 37

2. Essential mathematics for machine learning 39


Ali Araghian, Mohsen Razzaghi, Madjid Soltani, and Kourosh Parand

vii
viii Contents

2.1. Vectors 39
2.1.1. Basic concepts 39
2.1.2. Linear independence 40
2.1.3. Orthogonality 40
2.2. Matrices 40
2.2.1. Basic concepts 40
2.2.2. Operations 41
2.2.3. Some definitions 42
2.2.4. Important matrix properties 44
2.2.5. Determinant 45
2.2.6. Row and column spaces 46
2.2.7. Rank of a matrix 47
2.3. Vector and matrix norms 47
2.3.1. Vector norms 48
2.3.2. Matrix norms 48
2.4. Eigenvalues and eigenvectors 49
2.4.1. A system of linear equations 49
2.4.2. Calculation of eigenvalues and eigenvectors 50
2.4.3. Cayley–Hamilton theorem 54
2.5. Matrix centering 54
2.6. Orthogonal projection 57
2.7. Definition of gradient 58
2.8. Definition of the Hessian matrix 58
2.9. Definition of a Jacobian 58
2.10. Optimization problem 59
2.10.1. Feasible solutions 59
2.10.2. Lagrangian function 60
2.10.3. Karush–Kuhn--Tucker conditions 61
References 62
Contents ix

PART 2 Linear methods for dimension reduction

3. Principal and independent component analysis methods 65


Mohammadnavid Ghader, Mostafa Abdolmaleki, and Hassan Dana Mazraeh

3.1. Introduction 65
3.1.1. History 65
3.1.2. Intuition 66
3.2. The PCA algorithm 67
3.2.1. Projection in one-dimensional space 67
3.2.2. Projection in two-dimensional space 70
3.2.3. Projection in r-dimensional space 71
3.2.4. Example 73
3.2.5. Additional discussion about PCA 78
3.3. Implementation 78
3.3.1. How to implement PCA algorithm in Python? 78
3.3.2. Parameter options 78
3.3.3. Attribute options 79
3.4. Advantages and limitations 81
3.5. Unveiling hidden dimensions in data 82
3.5.1. The need for kernel PCA 82
3.5.2. Discovering nonlinear relationships 83
3.5.3. Dimensionality reduction and disclosing hidden
dimensions 83
3.6. The Kernel PCA algorithm 83
3.6.1. Data preprocessing 83
3.6.2. Kernel selection 84
3.6.3. Kernel matrix calculation 84
3.6.4. Centering data points in feature space 86
3.6.5. Example 87
3.7. Implementation of Kernel PCA 88
3.8. Independent Component Analysis 90
x Contents

3.8.1. The Cocktail Party Problem 90


3.8.2. A comparison between PCA and ICA 91
3.8.3. Theoretical background of ICA 92
3.8.4. The ICA model 94
3.8.5. Algorithms for ICA 95
3.8.6. Ambiguity in ICA 99
3.8.7. Example 100
3.8.8. Example of implementing ICA 102
3.9. Conclusion 105
References 106

4. Linear discriminant analysis 109


Rambod Masoud Ansari, Mohammad Akhavan Anvari, Saleh Khalaj Monfared,
and Saeid Gorgin

4.1. Introduction to linear discriminant analysis 109


4.1.1. What is linear discriminant analysis? 109
4.1.2. How does linear discriminant analysis work? 109
4.1.3. Application of linear discriminant analysis 110
4.2. Understanding the LDA algorithm 111
4.2.1. Prerequisite 111
4.2.2. Fisher’s linear discriminant analysis 113
4.2.3. Linear algebra explanation 114
4.3. The advanced linear discriminant analysis algorithm 118
4.3.1. Statistical explanation 118
4.3.2. Linear discriminant analysis compared to principal
component analysis 118
4.3.3. Quadratic discriminant analysis 119
4.4. Implementing the linear discriminant analysis algorithm 119
4.4.1. Using LDA with Scikit-Learn 119
4.5. LDA parameters and attributes in Scikit-Learn 120
4.5.1. Parameter options 120
Contents xi

4.5.2. Attributes option 120


4.5.3. Worked example of linear discriminant analysis
algorithm for dimensionality 121
4.5.4. Fitting LDA algorithm on MNIST dataset 121
4.5.5. LDA advantages and limitations 123
4.6. Conclusion 125
References 125

PART 3 Nonlinear methods for dimension reduction

5. Linear local embedding 129


Pouya Jafari, Ehsan Espandar, Fatemeh Baharifard, and Snehashish Chakraverty

5.1. Introduction 129


5.1.1. What is nonlinear dimensionality reduction? 129
5.1.2. Why do we need nonlinear dimensionality
reduction? 130
5.1.3. What is embedding? 131
5.2. Locally linear embedding 132
5.2.1. k nearest neighbors 133
5.2.2. Distance metrics 133
5.2.3. Weights 134
5.2.4. Coordinates 135
5.3. Variations of LLE 138
5.3.1. Inverse LLE 138
5.3.2. Kernel LLE 138
5.3.3. Incremental LLE 140
5.3.4. Robust LLE 142
5.3.5. Weighted LLE 144
5.3.6. Landmark LLE for big data 145
5.3.7. Supervised and semi-supervised LLE 147
5.3.8. LLE with other manifold learning methods 148
xii Contents

5.4. Implementation and use cases 149


5.4.1. How to use LLE in Python? 149
5.4.2. Using LLE in MNIST 151
5.5. Conclusion 154
References 154

6. Multi-dimensional scaling 157


Sherwin Nedaei Janbesaraei, Amir Hosein Hadian Rasanan,
Mohammad Mahdi Moayeri, Hari Mohan Srivastava, and Jamal Amani Rad

6.1. Basics 157


6.1.1. Introduction to multi-dimensional scaling 157
6.1.2. Data in MDS 162
6.1.3. Proximity and distance 162
6.2. MDS models 164
6.2.1. Metric MDS 166
6.2.2. Torgerson’s method 166
6.2.3. Least square model 168
6.2.4. Non-metric MDS 168
6.2.5. The goodness of fit 170
6.2.6. Individual differences models 172
6.2.7. INDSCAL 173
6.2.8. Tucker–Messick model 173
6.2.9. PINDIS 174
6.2.10. Unfolding models 175
6.2.11. Non-metric uni-dimensional scaling 176
6.3. Kernel-based MDS 178
6.4. MDS in practice 180
6.4.1. MDS in Python 180
6.4.2. Conclusion 184
References 184
Contents xiii

7. t-Distributed stochastic neighbor embedding 187


Mohammad Akhavan Anvari, Dara Rahmati, and Sunil Kumar

7.1. Introduction to t-SNE 187


7.1.1. What is t-SNE? 187
7.1.2. Why is t-SNE useful? 187
7.1.3. Prerequisite 188
7.1.4. Applications of t-SNE 190
7.2. Understanding the t-SNE algorithm 190
7.2.1. The t-SNE perplexity parameter 191
7.2.2. The t-SNE objective function 192
7.2.3. The t-SNE learning rate 195
7.2.4. Implementing t-SNE in practice 196
7.3. Visualizing high-dimensional data with t-SNE 202
7.3.1. Choosing the right number of dimensions 203
7.3.2. Interpreting t-SNE plots 204
7.4. Advanced t-SNE techniques 205
7.4.1. Using t-SNE for data clustering 205
7.4.2. Combining t-SNE with other dimensionality
reduction methods 205
7.5. Conclusion and future directions 206
References 207

PART 4 Deep learning methods for dimensionality


reduction

8. Feature extraction and deep learning 211


Abtin Mahyar, Hossein Motamednia, Pooryaa Cheraaqee, and
Azadeh Mansouri

8.1. The revolutionary history of deep learning: from biology to


simple perceptrons and beyond 211
8.1.1. A brief history 211
8.1.2. Biological neurons 213
xiv Contents

8.1.3. Art­ficial neurons: the perceptron 214


8.2. Deep neural networks 217
8.2.1. Deep feedforward networks 217
8.2.2. Convolutional networks 219
8.3. Learned features 227
8.3.1. Visualizing learned features 228
8.3.2. Deep feature extraction 239
8.3.3. Deep feature extraction applications 240
8.4. Conclusion 241
References 242

9. Autoencoders 245
Hossein Motamednia, Ahmad Mahmoudi-Aznaveh, and Artie W. Ng

9.1. Introduction to autoencoders 245


9.1.1. Generative modeling 245
9.2. Autoencoders for feature extraction 246
9.2.1. Latent variable 246
9.2.2. Representation learning 247
9.2.3. Feature learning approaches 248
9.3. Types of autoencoders 248
9.3.1. Denoising autoencoder (DAE) 249
9.3.2. Sparse autoencoder (SAE) 249
9.3.3. Contractive autoencoder (CAE) 249
9.3.4. Variational autoencoder (VAE) 250
9.4. Autoencoder and learned features applications 251
9.4.1. Language encoding 251
9.4.2. Vision models 261
9.4.3. Convolutional autoencoder 263
9.5. Conclusion 266
References 267
Contents xv

10. Dimensionality reduction in deep learning through group


actions 269
Ebrahim Ardeshir-Larijani, Mohammad Saeed Arvenaghi,
Akbar Dehghan Nezhad, and Mohammad Sabokrou

10.1. Introduction 269


10.2. Geometric context of deep learning 270
10.3. Group actions, invariant and equivariant maps 272
10.4. Equivariant neural networks 277
10.4.1. Group equivariant neural networks 278
10.4.2. The general theory of group equivariant neural
networks 280
10.5. Implementation of equivariant neural networks 286
10.5.1. Implementing groups and actions 286
10.5.2. Implementing equivariant convolution layers 287
10.6. Conclusion 291
References 291

Index 293
Contributors

Mostafa Abdolmaleki
Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid
Beheshti University, Tehran, Iran

Mohammad Akhavan Anvari


School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

Jamal Amani Rad


Choice Modelling Centre and Institute for Transport Studies, University of Leeds, Leeds, United
Kingdom

Ali Araghian
Department of Computer Science, Faculty of Art and Science, Bishop’s University, Sherbrooke,
QC, Canada

Ebrahim Ardeshir-Larijani
Iran University of Science and Technology (IUST), Tehran, Iran
School of Computing, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

Mohammad Saeed Arvenaghi


Iran University of Science and Technology (IUST), Tehran, Iran

Fatemeh Baharifard
School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

Snehashish Chakraverty
Department of Mathematics, National Institute of Technology Rourkela, Rourkela, Odisha, India

Pooryaa Cheraaqee
School of Computer Science & Engineering, University of Westminster, London, United
Kingdom

Hassan Dana Mazraeh


Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid
Beheshti University, Tehran, Iran

Amirhosein Ebrahimi
Electrical and Computer Engineering, Carleton University, Ottawa, ON, Canada
xvii
xviii Contributors

Ehsan Espandar
Department of Computer Science, Iran University of Science and Technology, Tehran, Iran

Mohammadnavid Ghader
Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid
Beheshti University, Tehran, Iran

Saeid Gorgin
Department of Computer Engineering, Chosun University, Gwangju, Republic of Korea

Amir Hosein Hadian Rasanan


Faculty of Psychology, University of Basel, Basel, Switzerland

Pouya Jafari
Department of Computer Science, Iran University of Science and Technology, Tehran, Iran

Sherwin Nedaei Janbesaraei


Institute for Cognitive Sciences Studies (ICSS), Tehran, Iran

Sunil Kumar
Department of Mathematics, National Institute of Technology, Jamshedpur, Jharkhand, India

Ahmad Mahmoudi-Aznaveh
Cyberspace Research Institute, Shahid Beheshti University, Tehran, Iran

Abtin Mahyar
School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

Azadeh Mansouri
Department of Electrical and Computer Engineering, Faculty of Engineering, Kharazmi
University, Tehran, Iran

Rambod Masoud Ansari


School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

Mohammad Mahdi Moayeri


Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada

Saleh Khalaj Monfared


Department of Electrical and Computer Engineering, Worcester Polytechnic Institute,
Worcester, MA, United States

Hossein Motamednia
School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
Contributors xix

Akbar Dehghan Nezhad


Iran University of Science and Technology (IUST), Tehran, Iran
Artie W. Ng
International Business University, Toronto, ON, Canada
Kourosh Parand
International Business University, Toronto, ON, Canada
Dara Rahmati
Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
Mohsen Razzaghi
Department of Mathematics and Statistics, Mississippi State University, Mississippi State, MS,
United States
Mohammad Sabokrou
Okinawa Institute of Science and Technology, Okinawa, Japan
Hoda Vafaei Sefat
Electrical and Computer Engineering, Carleton University, Ottawa, ON, Canada
Madjid Soltani
Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON,
Canada
Hari Mohan Srivastava
Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
About the Editors

Jamal Amani Rad


Choice Modelling Centre and Institute for Transport Studies, University of Leeds, Leeds,
United Kingdom

Jamal Amani Rad received his Ph.D. in Applied Mathematics (Scientific Computing)
from the Department of Computer Science at Shahid Beheshti University in 2015. His
academic career began with a one-year computational neurocognitive modeling postdoc­
toral fellowship at the Institute for Cognitive and Brain Sciences. During this period, he
made significant advancements in the mathematical modeling of human cognitive pro­
cesses, applying cutting-edge computational approaches to analyze neural and behavioral
data. He has held a variety of prestigious academic positions. He served as an Assis­
tant Professor at Shahid Beheshti University, where he established a strong reputation in
the fields of cognitive modeling and computational psychology, mentoring graduate stu­
dents and leading interdisciplinary research projects. In this role, he was responsible for
groundbreaking research integrating mathematical psychology, numerical methods, and
cognitive neuroscience, publishing numerous high-impact papers on topics ranging from
decision-making and reinforcement learning to neural network modeling. Since joining
the University of Leeds, Amani Rad has continued his innovative work as a Mathemat­
ical Psychologist at the Choice Modeling Centre and Institute for Transport Studies. His
research is distinguished by its interdisciplinary nature, bridging the gaps between neuro­
science, psychology, applied mathematics, and data science. Amani Rad’s contributions to
computational cognitive modeling and neuropsychology are recognized internationally,
with over 140 peer-reviewed publications, multiple book chapters, and keynote presenta­
tions at prestigious conferences. In addition to his extensive publication record, Amani Rad
has been involved in various high-profile research projects supported by national and in­
ternational grants, focusing on decision-making models, stochastic processes, and neural
mechanisms in learning and behavior. His work has contributed to our understanding of
how mathematical and computational tools can be leveraged to explain complex cognitive
phenomena such as perceptual decision-making, learning, memory, and attention. Amani
Rad’s editorial leadership is exempl­fied by his role as the editor of the forthcoming book
Dimensionality Reduction in Machine Learning, published by Elsevier. This book show­
cases his deep expertise in mathematical modeling and data analysis techniques, aimed
at reducing complexity in large-scale datasets—a topic that sits at the intersection of ma­
chine learning, applied mathematics, and cognitive psychology. Amani Rad is passionate
about fostering collaboration across disciplines and institutions. His vision for future re­
search includes the development of a research hub focused on integrating computational
xxi
xxii About the Editors

models with real-world data to solve problems in cognitive neuroscience, art­ficial intelli­
gence, and human decision-making.

Snehashish Chakraverty
Department of Mathematics, National Institute of Technology Rourkela, Sundargarh,
Odisha, India
Snehashish Chakraverty has 29 years of experience as a researcher and teacher.
Presently, he is working in the Department of Mathematics (Applied Mathematics Group),
National Institute of Technology Rourkela, Odisha, as a senior (HAG) professor. Before
this, he was with CSIR-Central Building Research Institute, Roorkee, India. After gradua­
tion from St. Columba’s College (Ranchi University), his career started at the University
of Roorkee (Now, Indian Institute of Technology Roorkee), and he completed an MSc
(Mathematics) and MPhil (Computer Applications) there securing the first position in
the university. Dr. Chakraverty received his PhD from IIT Roorkee in 1992. Thereafter, he
did his postdoctoral research at the Institute of Sound and Vibration Research (ISVR),
University of Southampton, UK, and at the Faculty of Engineering and Computer Sci­
ence, Concordia University, Canada. He was also a visiting professor at Concordia and
McGill University, Canada, from 1997--1999 and a visiting professor at the University of
Johannesburg, South Africa, during 2011--2014. He has authored/coauthored 16 books,
and published 333 research papers (to date) in journals and conferences, 2 more books
are in press, and 2 books are ongoing. He is on the editorial boards of various Inter­
national Journals, Book Series, and Conferences. Prof. Chakraverty is the chief editor of
the ``International Journal of Fuzzy Computation and Modelling'' (IJFCM), Inderscience
Publisher, Switzerland (http://www.inderscience.com/ijfcm), associate editor of ``Compu­
tational Methods in Structural Engineering, Frontiers in Built Environment,'' and is an
editorial board member of ``Springer Nature Applied Sciences,'' ``IGI Research Insights
Books,'' ``Springer Book Series of Modeling and Optimization in Science and Technolo­
gies,'' ``Coupled Systems Mechanics (Techno Press),'' ``Curved and Layered Structures (De
Gruyter),'' ``Journal of Composites Science (MDPI),'' ``Engineering Research Express (IOP),''
and ``Applications and Applied Mathematics: An International Journal.'' He is also the re­
viewer of around 50 national and international journals of repute, and he was the President
of the Section of Mathematical Sciences (including Statistics) of ``Indian Science Congress''
(2015--2016) and was the Vice President -- ``Orissa Mathematical Society'' (2011--2013). Prof.
Chakraverty is a recipient of prestigious awards, viz. Indian National Science Academy
(INSA) nomination under International Collaboration/Bilateral Exchange Program (with
Czech Republic), Platinum Jubilee ISCA Lecture Award (2014), CSIR Young Scientist (1997),
BOYSCAST (DST), UCOST Young Scientist (2007, 2008), Golden Jubilee Director’s (CBRI)
Award (2001), INSA International Bilateral Exchange Award (2010--2011 [selected but could
not undertake], 2015 [selected]), Roorkee University Gold Medals (1987, 1988) for first po­
sitions in MSc and MPhil (Computer Applications), etc. He has already guided 15 PhD
students and 9 are ongoing. Prof. Chakraverty has undertaken around 16 research projects
as the principal investigator funded by international and national agencies totaling about
About the Editors xxiii

1.5 crores. A good number of international and national conferences, workshops, and
training programs have also been organized by him. His present research areas include
differential equations (ordinary, partial, and fractional), Numerical Analysis and Compu­
tational Methods, Structural Dynamics (FGM, Nano) and Fluid Dynamics, Mathematical
Modeling and Uncertainty Modeling, Soft Computing, and Machine Intelligence (Art­ficial
Neural Networks, Fuzzy, Interval, and A˙ine Computations).

Kourosh Parand
International Business University, Toronto, ON, Canada
Kourosh Parand received his Ph.D. in Applied Mathematics, specializing in Numerical
Analysis and Control, from Amirkabir University of Technology, Iran, in 2004. He previ­
ously served as a full professor at the Department of Data & Computer Science in the
Faculty of Mathematics, as well as the Institute for Cognitive and Brain Sciences (ICBS)
at the National University, Tehran, Iran. Parand has an outstanding academic background
with expertise in numerical analysis (including spectral and meshless methods), cogni­
tive science (epilepsy), mathematical physics, neural networks, data mining and analytics,
business analytics, deep learning, and support vector machines. A globally recognized
leader in applying spectral methods and machine learning techniques to nonlinear dy­
namical models, Parand has an H-index of 30. Throughout his career, he has held several
prominent roles, including Chair of the Department of Data & Computer Science and Di­
rector of the Science and Technology Park (2018--2019) at National University. He has been
awarded multiple times as the university’s top researcher. Parand has undertaken sabbat­
icals at the University of Alberta (2003--2004) and the University of Waterloo (2019--2020,
2023--2024), with this current book written during his time at Waterloo. He has published
over 250 papers in peer-reviewed journals and presented at numerous international con­
ferences. Additionally, he has served as a reviewer and editorial board member for several
prestigious journals. He has successfully supervised more than 70 Master’s students, 11
Ph.D. candidates, and six postdoctoral fellows. Currently, Parand is a full-time professor at
the International University of Business in Toronto, Canada.
Preface
Machine learning is a rapidly growing field that has revolutionized how we approach data
analysis and problem-solving. With its ability to learn from data, machine learning has be­
come integral to many fields, including computer science, engineering, and data science.
However, with the vast amounts of data now available, extracting meaningful information
from the data can be challenging. This is where dimension reduction and feature selection
come in. Dimension reduction is a technique used in data analysis to reduce the number
of features or variables in a dataset while retaining the most important information. This
can be done by transforming the original high-dimensional data into a lower-dimensional
space that preserves as much of the original structure as possible. There are several reasons
why dimension reduction is important in data analysis:
• Improving computational efficiency: High-dimensional data can be computationally
expensive to work with, and dimension reduction techniques can simplify the data and
make it easier to work with.
• Reducing noise: High-dimensional data often contains a lot of noise, making it diffi­
cult to extract meaningful patterns. By reducing the number of dimensions, dimension
reduction can help to filter out the noise and extract the most important patterns.
• Visualizing data: It can be challenging to visualize high-dimensional data, and dimen­
sion reduction can help to simplify the data and make it easier to plot and interpret.
• Improving accuracy: High-dimensional data can sometimes lead to ove­fitting, which
occurs when a model is too complex and fits the training data too closely. By reducing
the number of dimensions, dimension reduction can help to prevent ove­fitting and
improve the accuracy of models.
Overall, dimension reduction is an important tool in data analysis that can help simplify
and clarify complex data and make extracting meaningful patterns and insights easier. This
book, ``Dimensionality Reduction in Machine Learning'' is aimed at anyone who wants
to learn the basics of machine learning and how to reduce the dimensionality of large
datasets. It contains 10 chapters that cover a wide range of topics, from the basics of ma­
chine learning to deep learning and feature selection.
Chapter 1 presents the basics of machine learning and explains the different types of
machine learning algorithms. Moreover, it covers the preliminaries of Python program­
ming. Chapter 2 covers the essential mathematics for machine learning, including basic
algebra, linear algebra, probability, and optimization.
Chapters 3 and 4 are devoted to linear dimension reduction techniques, including Prin­
cipal Component Analysis (PCA), Independent Component Analysis (ICA), Factor Analy­
sis (FA), and Linear Discriminant Analysis (LDA). Similarly, Chapters 5 to 7 present the
xxv
xxvi Preface

nonlinear dimension reduction methods, including Local Linear Embedding (LLE), Multi­
Dimensional Scaling (MDS), and t-distributed Stochastic Neighbor Embedding (t-SNE).
These chapters explain the principles behind each method and provide examples of how
they can be used in practice.
Chapter 8 discusses the use of deep learning for feature selection and provides an
overview of some of the most popular deep learning architectures. Chapter 9 covers au­
toencoder neural networks, a type of deep learning algorithm used for dimension reduc­
tion. Finally, Chapter 10 explains how deep learning can achieve dimension reduction
through group actions.
This book is designed to be accessible to anyone with a basic understanding of mathe­
matics and programming. Each chapter includes practical examples and exercises to help
readers better understand the concepts covered. By the end of this book, readers will have
a solid understanding of dimensionality reduction and will be able to apply these tech­
niques to their datasets. This book is designed to build on the knowledge gained in the
previous book, and readers who work through the entire series will have a deep under­
standing of the theory and practice of machine learning. Whether you are a student or a
practitioner in machine learning, this book is an invaluable resource for anyone looking to
improve their understanding and expertise in this rapidly evolving field.

Jamal Amani Rad, Snehashish Chakraverty, and Kourosh Parand


1
Basics of machine learning
Amirhosein Ebrahimi a, Hoda Vafaei Sefat a, and Jamal Amani Rad b
a Electrical and Computer Engineering, Carleton University, Ottawa, ON, Canada b Choice Modelling Centre

and Institute for Transport Studies, University of Leeds, Leeds, United Kingdom

1.1 Data processing in machine learning (ML)


1.1.1 What is data? Feature? Pattern?
A collection of facts and statistics that have been collected as a result of a number of mea­
surements, observations, or simulations is referred to as data. Data may be class­fied into
many categories, including numbers, words, photos, videos, sounds, etc. Data is not always
meaningful or comprehensible. To be more specific, one can easily absorb some abstract
information from a given image but cannot comprehend detailed information [1]. On the
other hand, extracting meaningful information from raw numbers gathered by a machine
in a factory, such as a machine’s average speed, downtimes, and the number of produced
products, is an impossible or time-consuming process [2]. Worse still, all these challenges
stem from the human perspective, and for computers, these raw data mean nothing [3].
The goal of interpreting and processing these data is to make them machine-readable so
that they can be used for more precise analysis and to take advantage of the machines’
high capability [5]. Once data has been processed, it can be sent to a computer, where it
can be analyzed using a number of different methods, each of which has the potential to
yield useful insights, see Fig. 1.1 [6].
Let us move on to discussing data patterns and features. Data features are any observ­
able characteristics of the data that can be exploited in an analysis [7,20]. Features can be
categorical, such as gender or movie genre, or numerical, such as age or income [10]. The
selection and engineering of features from raw data is a key step in machine learning [8].
Features are used as inputs to machine learning algorithms to train models for prediction,
class­fication, and other tasks [9]. Effective feature engineering requires domain expertise
to select meaningful features [11]. Data patterns refer to interesting relationships, regu­
larities, or structures that can be discovered in data [5]. Identifying patterns allows for
summarizing and interpreting complex datasets. To provide further clarity, if we take a
look at a dataset of housing prices, we can see that it contains characteristics such as the
number of bedrooms, location within the town, square footage, age of the house, etc. In
the field of machine learning, the properties of this dataset (e.g., number of bedrooms)
serve as the inputs to our model, while the output of our model is the price of the house.
The machine learning algorithm (model) has to learn to map the input features to the out­
put. This takes place whenever the machine learning algorithms discover a certain pattern
Dimensionality Reduction in Machine Learning. https://doi.org/10.1016/B978-0-44-332818-3.00009-5 3
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
4 Dimensionality Reduction in Machine Learning

FIGURE 1.1 At first sight, the computer cannot obtain the relevant information from the image of the cat. Once it
is processed (readable for a computer) using some techniques, it can extract useful information, as illustrated in the
figure.

in our data. A pattern is a set of data that follows a recognizable structure or form. Fea­
tures can be utilized for identifying patterns. Overall, finding the pattern in data points is
of great importance as it helps to make a more accurate prediction and provides a better
understanding of data and underlying phenomena represented by the data. To illustrate,
consider a dataset of car prices and the corresponding features like producer factory, year
of the model, driven wheels, size and design, and engine horsepower. The pattern in this
dataset can be ident­fied as the more luxurious design and modern and famous factory
producer, the more expensive the car would be. This extracted pattern may be used to cre­
ate a model that predicts the price of a car based on the features stated.
Let us take a look at an image generated by the AI (Stable Diffusion 1.5) [12] in Fig. 1.2.
As denoted before, features are some parts of objects in an image. A triangle, for instance,
is easily recognizable as a triangle because of its three corners and three edges. Computers
use the same method to determine what is in a picture. Features are distinguishing char­
acteristics of an image, such as its corners, size, color, edges, ridges, texture, etc. In Fig. 1.2
(left), there are many objects, including chairs, tables, clocks, pottery, vessels, books, etc.;
all these objects can be easily detected using the distinguishing characteristics mentioned
before. Fig. 1.2 (right) represents the objects detected by the machine using the Yolo5 ob­
ject detection model.
Chapter 1 • Basics of machine learning 5

FIGURE 1.2 How is an object detected by its relevant features and patterns? A chair has four legs, a flat surface,
and a backing. The model has learned that these patterns and features are related to a chair. When it detects four
or three legs and other features of a chair in a picture, it assumes it is a chair.

1.1.1.1 Types of data


One of the most, if not the most, essential steps before dealing with and utilizing data is
to determine the type of data. Data is either structured or unstructured. The format of
structured data is predetermined and standard. Examples of structured data formats in­
clude database spreadsheets, XML and JSON files, etc. This standard format enables more
efficient data analysis and more convenient manipulation of data [11]. In contrast, un­
structured data does not belong to any particular format (such as images, texts, or videos),
making it more difficult to work with for analysis and manipulation [13]. Primarily, data is
class­fied into the following categories:
• Numerical: Information that can be measured and quant­fied, such as a person’s age,
height, weight, and income. Numeric data might be continuous or discrete. Continu­
ous data may take any value within a particular range, such as height and weight, but
discrete data can only take certain values, such as the number of body organs. Data in
numerical form is straightforward to obtain and manipulate using mathematical oper­
ations.
• Categorial: Information that can be divided into groups. Categorical data, unlike nu­
merical data, do not have numerical values, like color (e.g., red, green, blue). These
forms of data are broadly classed as nominal and ordinal. Unlike nominal data, ordi­
nal data refers to categories that do not have rank or order, such as a person’s degree
of education (high school, undergraduate, graduate) or their level of pleasure with a
product or service (extremely dislike, dislike, neutral, like, extremely like).
6 Dimensionality Reduction in Machine Learning

• Text: Text data refers to information in the form of written words, such as documents,
comments on an online shop’s products, etc. Natural language processing and informa­
tion retrieval are two examples of the numerous applications for text data. The purpose
of natural language processing is to extract meaning from text by analyzing sentence
structure, word selection, and context. Information retrieval is the process of extracting
valuable data from a massive database. Stable Diffusion [12] is a remarkable illustration
of the current advancements in AI. The Stable Diffusion model employs deep learning
to convert text into images. This tool’s principal function is to generate detailed images
based on written descriptions.
• Image: Data in the form of pictures or movies. It records visual data such as forms, col­
ors, and textures. In computers, a picture is represented as an array of integers ranging
from 0 to 255, each representing a pixel that is saved as an 8-bit integer with a possible
value range of 0 to 255. There are further alternatives such as 10-bits, 16-bits, etc.; nev­
ertheless, 8-bit is the most popular option. Typically, 0 sign­fies black, and 255 indicates
white. A grayscale (black-and-white) picture has one channel of numbers, but a color
image has three channels for red, green, and blue (RGB). Image data may be utilized
in several contexts, including the medical and astronomical professions, as well as the
artistic world.
In the field of medicine, image data can be used to diagnose diseases and track the
effects of treatments. In the field of astronomy, image data can be analyzed to investi­
gate the properties of celestial objects. Image data is helpful in data analysis because it
can extract features and patterns, such as identifying objects and examining their form,
color, and texture. Machine learning algorithms can also use image data; one example is
image class­fication, whose aim is to automatically classify images into predetermined
classes based on their content.

1.1.2 Understanding data processing


As denoted before, data should be in a standard and suitable format. First, it should be con­
verted such that the machine can understand it. Secondly, some parts of the data should
be either mod­fied, dropped, or, in most cases, it should be expanded [14]. Consequently,
this processing procedure can enhance performance and facilitate data utilization. Most
of the raw data includes many components that are either additional and should be re­
moved, or d­ficient and should be expanded [11]. Imagine a bot that saves a record of the
Bitcoin price for the past three years. After all this time, we have data describing the behav­
ior of Bitcoin prices and want to use the data to predict the price in the future. Given the
possibility that the bot may have encountered some problems during this data collection,
the finalized data may consist of many duplicates or missing values. Given the sensitivity
of machine learning algorithms, this can hamper their capability. Thus it is of great signif­
icance to make your data well-cooked before applying any machine learning algorithm.
The input data quality heavily i­fluences the machine learning model’s accuracy and use­
fulness. Machine learning models can be prone to producing unreliable and erroneous
Chapter 1 • Basics of machine learning 7

outputs if fed with low-quality input data. However, improved prediction performance and
more reliable insights might result from using high-quality input data.

1.1.2.1 Data analysis


Data analysis is the process of working with data and, more specifically, using statistical
or logical techniques or tools to obtain valuable and efficient information. This procedure
can expand our comprehension of the data and make data evaluation more efficient. Sev­
eral fields can ben­fit significantly from utilizing data analysis. For example, it can be used
in healthcare to arrange huge amounts of information quickly to find answers or treat­
ments for different diseases. It can also be used in manufacturing to determine how many
products need to be made based on data collected and analyzed from samples of demand
and to increase operating capacity and profits.
The goal of machine learning is to create computer programs that can detect and
comprehend the relationships and patterns that exist within data and then use that un­
derstanding to make hypotheses or take action on new data. Nevertheless, for any real
learning to occur, the data must be examined and organized in such a way that it is of
good quality and in a format that machine learning algorithms can readily use [15]. This is
where data analysis can come in handy. The goal of data analysis is to uncover hidden pat­
terns, relationships, and significant characteristics within the data. This knowledge may
subsequently be used to guide the modeling process. Analyzing the data may reveal that
certain features are more predictive than others or that the data are uneven in the sense
that particular groups or categories are underrepresented in the sample. Typically, the data
analysis procedure involves several steps. One of these is data cleansing, which involves
identifying any errors or inconsistencies in the data, such as missing or erroneous infor­
mation, and making the appropriate mod­fications to fix them. When the data has been
thoroughly cleansed, it must undergo preprocessing so that it may be transformed into
a form that machine learning algorithms can understand. This may include actions such
as feature normalization, feature scaling, and feature selection. Normalization is the pro­
cess of putting the values of the features into a comparable range. This is required because
machine learning algorithms usually assume that data has been normalized. Failure to
normalize the data may result in incorrect results. Scaling the features involves modifying
their scale such that they all contain a comparable range, which can improve the accuracy
of the models. Feature selection refers to the process of identifying which features of the
data are the most important and removing any qualities that are extraneous or redundant.
You may begin modeling with the data once it has been organized and preprocessed. The
type of model used will be determined by the circumstances of the case and the informa­
tion available. Three popular modeling approaches are linear regression, decision trees,
and neural networks. These models are trained using data to identify repeating themes
and interconnections in the data. Following training, the model may be used for more data
to provide predictions or judgments. The outcomes of the analysis are eventually delivered
through data visualization, which facilitates comprehension. This involves the use of visual
aids such as graphs, charts, and other similar tools to help identify trends and patterns in
8 Dimensionality Reduction in Machine Learning

the data. The simple act of seeing raw data may not be sufficient to produce the amount of
comprehension that can be acquired via the use of data visualization.
In a nutshell, data analysis is one of the most critical processes in the machine learning
process, which includes cleaning, preprocessing, modeling, and displaying the data. It is
vitally required for the development of trustworthy and efficient machine learning models
capable of gaining conclusions or making predictions based on new data.

1.1.2.2 Data preprocessing


Information may be gleaned from raw data by using various methods of manipulation,
transformation, and analysis, all of which go under the umbrella term data processing [16].
Broadly speaking, there are four data preprocessing techniques [17]:
• data cleaning;
• data transformation;
• data integration;
• data reduction.
As an example, imagine you bought a furnished house. You would probably start by
cleaning the house before moving in and using it [18]. You may need to purchase new
kinds of stuff or eliminate those you do not need (dealing with missing values and remov­
ing NANs and outliers) [16,19]. You may also be more comfortable with your own bed than
the existing one, and you probably do not need two of the same products (removing dupli­
cates) [17]. Furthermore, there is a high probability that the house you just bought needs
to be repaired, e.g., several holes in the walls, a clogged sink, etc. (correcting errors). Finally,
you can change the orientation and location of the furniture in a way that is more conve­
nient for you (standardizing ) [17]. This particular example illustrates an overall overview
of data cleaning techniques in machine learning: handling missing values, and outliers,
standardizing data, removing duplicates, and correcting errors [16].
In the context of machine learning, especially in real-world datasets, missing values is
a challenging problem. There are several solutions for addressing this difficulty regarding
the case. One option is to simply remove all the rows or columns from the dataset that
include data that has been determined to be missing [16]. The threshold for removing rows
and columns varies case by case; for example, if more than half of a row’s values are null,
this row can be removed. This is a tricky task, and one should be aware that it is worth
deleting some parts of the data at the expense of losing much information that could be
highly i­fluential to the model’s output. There is a second solution called imputation to
tackle the challenge of losing data in the previous method. The idea is simple: replace the
missing values in the columns or rows with the remaining data’s mean, median, or mode
[16]. It is obvious that this approach can only be applied to continuous numeric data. In
a dataset with columns or rows with categorial values, the most frequent categories will
be chosen to fill in the missing values. To illustrate, in a dataset related to cars, there is
a column called color, and there are a total of five colors (black, white, red, yellow, and
silver). Suppose 50% of the cars are black, 30% white, 10% silver, 5% red, and 5% yellow; in
Chapter 1 • Basics of machine learning 9

the case of missing values, color black will fill the missing ones. A better strategy would be
to replace 50% missing values with black, 30% with white, and so on.
The outliers in a dataset refer to those that highly deviate from the rest of the data.
For example, there is a dataset of housing prices in a specific part of the town. There are
10 000 houses worth between 40 000$ to 70 000$. Suppose there are some cases far from
this range, e.g., 1 000 000$. In that case, these observations that are significantly different
from the rest of the data can be considered a outliers, and can negatively impact the ac­
curacy of the prediction. Detecting outliers manually can be time-consuming and may
include only some outliers. This objective can be achieved by taking advantage of some vi­
sualization techniques, such as plotting the data points (Box plot, Scatter plot, Histogram,
Distribution plot, QQ plot) or using statistical methods.
There are many statistical methods, among which the most common ones are the Z­
score and interquartile range (IQR). The Z-score technique d­fines a score for each data
point and detects the outliers based on the calculated score and a pre-defined threshold.
Based on the following formula (1.1), assuming the data come from a Gaussian distribu­
tion, the Z-score represents the number of the standard deviations a data point is away
from the mean:
xi − μ
zscorexi = ( ), (1.1)
σ
where μ equals the mean of the whole dataset, and σ is the standard deviation of the
dataset. If the zscorexi is higher than a pre-defined threshold, xi will be chosen as an outlier.
As it turns out, the Z-score method has certain flaws, and this is because it is dependent
on the mean of the data, and extreme outlier values highly impact the mean. As for the
solution, there is another statistical technique that relies on the median, which is a robust
measure of central tendency that is not affected by extreme values, called the interquartile
range (IQR). This approach is based on the quartiles, dividing the whole dataset into four
equal parts:
• Q1: First quartile and represents 25% of the data.
• Q2: Second quartile and represents 50% of the data.
• Q3: Third quartile and represents 75% of the data.
The IQR is then d­fined as the range between the first quartile and the third quartile.
Then, a lower bound and upper bound is d­fined to detect outliers. Any data point lower
than Q1 − 1.5 ∗ I QR or higher than Q3 + 1.5 ∗ I QR will be chosen as outliers.
The process of transforming data into a standard scale is known as data standardiza­
tion. Standardization is ben­ficial when comparing variables with different units or scales.
Also, to alleviate computational load, having all data points on a specific small scale is a
great technique. For example, in computer vision, an image is an array of units between 0
and 255. The image is typically scaled between 0 and 1 before any image processing tech­
niques are applied.
With a grasp of data cleansing methods, learning about data transformation is next on
the agenda. The term ``data transformation'' refers to the act of changing data from one
10 Dimensionality Reduction in Machine Learning

format into another for the sake of making it more useful for machine learning and data
analysis. There are several data transformation techniques, some of which are explained
in the data cleaning process. These methods are as follows, data smoothing, scaling, nor­
malization, feature construction, data aggregation, and data imputation. Data smoothing
is the approach for removing noise from data by taking advantage of some mathematical
functions. This strategy can increase the accuracy of predictions by reducing the i­fluence
of noise and outliers and facilitating the discovery of patterns and trends in data. Scaling,
as is evident, is the process of transforming data to a spec­fied range, such as −1 to +1, and
normalization is the approach for converting the mean of the data to 0 and the standard
deviation to 1; by doing so, data comparison is facilitated, and the i­fluence of outliers is
diminished. Constructing new features or values from preexisting ones using methods like
feature selection and feature scaling is known as feature construction. Hence, we are able
to enhance the performance of machine learning algorithms. Finally, data imputation and
data aggregation are complementary methods. Data aggregation is the process of combin­
ing many datasets into a single dataset, for instance, by replacing these data points with
their mean or total, which can be useful for dimension reduction. Data imputation is the
process of estimating missing values or utilizing the mean or median of the current set of
values to fill in these missing values.
Data integration is the process of merging data from several sources into a single
dataset. Data integration is crucial as it provides a consistent view of disparate data while
ensuring data accuracy. This method has multiple uses, including in the medical area,
where it can assist in diagnosing illnesses and problems by merging data from various pa­
tients and clinics.
When a dataset is very large, data reduction is a procedure that takes the original
data and compresses it into a much smaller amount. This increases the efficiency of data
processing, and when the data quantity is modest, using computationally advanced algo­
rithms becomes simpler and more efficient.
In conclusion, data processing is essential to the field of machine learning as it guar­
antees that the input data is of a high quality and is presented in a style that can be uti­
lized by machine learning algorithms in an efficient manner. This results in more accurate
and trustworthy findings and insights, which can be utilized to drive informed decision­
making and offer value to enterprises and organizations. These results and insights may
also be used to drive innovation.

1.1.2.3 Data visualization


Without a shadow of a doubt, data visualization is a vital process in machine learning as it
offers graphical insights into information and data, especially high-dimensional data and
complex information. One can easily comprehend patterns and relationships within the
data if provided with a standard and suitable data visualization. Consequently, it facilitates
the process of making an informed decision about choosing the perfect algorithm. Finally,
data visualization can aid in the ident­fication of trends and outliers in the data that could
otherwise go unnoticed using more conventional methods of analysis. One of the most,
Chapter 1 • Basics of machine learning 11

if not the most, essential prerequisites for presenting data in a logical and easy-to-follow
manner is to select the appropriate charts and graphs. It ought to be aesthetically pleasing
and factually correct, with no room for misinterpretation or confusion. Data visualization
also requires a deep understanding of the data being presented, as well as the context in
which it is being presented. In this section, ten commonly used data visualization tech­
niques will be explained.
• Line chart: The simplest form of a chart that represents data as a series of connected
lines. It is often used to represent trends over time, such as changes in stock prices or
fluctuations in temperature. Fig. 1.3 is a line plot showing how the countries’ life ex­
pectancy changes over time (1952--2007).

FIGURE 1.3 Illustration of a line chart.

• Bar chart: The value of different categories or data points is represented by rectangu­
lar bars in a bar chart. There are vertical and horizontal bar charts. Typically, the bars’
length and height represent the value of the data points being measured. Bar charts are
useful for comparing values across different categories or groups of data. For example,
Fig. 1.4 represents the population of Iran from 1952--2007. The change of colors also
depicts how the population differs from time to time. Bar charts have the advantage
of being simple to read and understand, especially for non-experts. They are extremely
adaptable, with a wide range of variations and alterations available, including stacked
bar charts, grouped bar charts, and clustered bar charts.
• Pie chart: A pie chart is a type of circular data visualization that displays the proportions
or percentages of several data points by using ``slices'' to represent the data points. The
size of each slice of the pie corresponds to the proportion or percentage of the overall
dataset that corresponds to the particular category or data point that it represents. Pie
charts should be utilized when there are few data points; otherwise, when numerous
data points are allocated, the distinctions between the small proportions are difficult to
identify. To illustrate this, Fig. 1.5 left represents the population of Asia as a pie chart,
12 Dimensionality Reduction in Machine Learning

FIGURE 1.4 Illustration of a bar chart.

and each slice represents the percentage of each country over the total population. It is
evident that it is not possible to identify the portion of small countries in population.
On the other hand, Fig. 1.5 right represents the population of each continent, and one
can easily comprehend the information in the visualization.

FIGURE 1.5 Illustration of a pie chart.

• Scatter plot: A scatter plot depicts the relationship between two variables. The inde­
pendent variable is typically represented by the horizontal axis, and the vertical axis
typically represents the dependent variable. Scatter plots can be used to analyze the
relationship between two variables and discover any patterns or trends in the data.
A scatter plot, for example, can be used to demonstrate the relationship between a per­
son’s height and weight (Fig. 1.6) or the relationship between a company’s advertising
spending and its sales income. We may find any positive or negative correlation be­
Chapter 1 • Basics of machine learning 13

tween the variables, as well as any outliers or clusters of data points, by inspecting the
scatter plot.

FIGURE 1.6 Illustration of a scatter plot.

• Heat map: A heat map, in its most fundamental form, employs color to symbolize value
over two or more dimensions. This map may be used to represent big data points in an
easy-to-understand manner, unlike pie charts. Heat maps may be utilized to illustrate
data trends and highlight regions of interest. Like with many other types of grid repre­
sentations, heat maps often have two axes, with the first axis indicating one dimension
of the data and the second axis representing the other. The value of the data point at the
intersection of the two dimensions is represented by a color allocated to each cell in the
grid. With the intensity of the color signifying the magnitude of the value being plotted,
the color scale can run from cold (e.g., blue) to warm (e.g., red). Fig. 1.7 illustrates the
average arrival delay for each airline by month in heat map visualization.
• Tree map: A data visualization approach known as a tree map displays hierarchical data
by nesting a group of rectangles together to form a tree-like structure. A measurement
of the data, such as its frequency, value, or proportion, is represented by the size of
each rectangle in the chart. The data have been arranged into categories to facilitate
the structure of the tree map, and each category is shown as a rectangular block. The
size of the rectangle represents the proportion of the data that falls into that category.
Fig. 1.8 illustrates the life expectancy of countries in a tree map structure, and the color
and the size of each rectangle depict the value of the life expectancy rate.
• Box plot: The box plot visualizes the distribution of the data. It is helpful in displaying
the given data’s median, quartiles, and outliers. It comprises a rectangle or, let us say, a
box. The bottom of the box represents the first quartile, 25% of the data. On the other
hand, the upper side of the rectangle represents 75% of the data. The horizontal line in
14 Dimensionality Reduction in Machine Learning

FIGURE 1.7 Illustration of a heat map.

FIGURE 1.8 Illustration of a tree map.

the middle of the box represents the median of the data. Fig. 1.9 illustrates the number
of goals of six famous football players in their corresponding teams in a box plot visu­
alization. In this dataset, outliers do not convey a specific meaning, but ignoring this
issue, in Lionel Messi’s section, 75 goals can be detected as an outlier.
• Histogram: Histogram is a highly helpful graphic representation of how many of a cer­
tain type of variable occur within a specific range and is one of the illustrative diagrams
for frequently representing a dataset distribution. This graphic is similar to a bar chart,
although it differs in some ways. To be more explicit, a bar chart would be a more useful
tool when the dataset is discrete or non-numerical. Generally, when the dataset is con­
Chapter 1 • Basics of machine learning 15

FIGURE 1.9 Illustration of a box plot.

tinuous and quantitative, and we also need to compare the distribution of two datasets,
and for displaying outliers in the dataset. Histogram is an effective match to our goal.
When the number of data points is few or there are large gaps between the values of the
data points, the Histogram is not a suitable fit.
In Fig. 1.10, we intend to compare the distribution of pixel values between two images.
The image on the left depicts a sunny day, while the image on the right depicts a night.
In the grayscale representation of images, pixels can be assigned values between 0 and
255. In this range, pixels with a value of 0 represent black, the darkest color, while pixels
with a value of 255 represent white, the brightest color. In general, pixels closer to the
white color are closer to the 255 value, while pixels closer to the black color are closer
to the 0 value. Therefore, the number of pixels with a value near 255 must be greater
in the brighter image than in the darker image, and the number of pixels with a value
near 0 must be more in the darker image than in the brighter image. We can exhibit the
frequency of pixel values in these two photos by using Histogram. The x-axis of these
two histograms represents the range of pixel values, 0 to 255, while the y-axis represents
the number of occurrences of each value in the photos. Consequently, for the brighter
image, the left Histogram indicates a more significant proportion of pixels with values
near 255, while for the darker image, the right Histogram indicates a greater proportion
of pixels with values near 0.

1.1.3 High-dimensional data


To begin, we d­fine ``high-dimensional data'' as data where the number of features is sub­
stantially more than the number of observations or samples; for more comprehension,
when we have 50 data points and the number of features is 1000. Numerous vital real-world
datasets are of this sort; hence dealing with them is of utmost importance. To illuminate
16 Dimensionality Reduction in Machine Learning

FIGURE 1.10 Comparing the pixel distribution of each image.

this topic, let us examine some real-world examples. The high-dimensional data type is
very common in the medical domain. Each individual as a data point may have several
features such as gender, age, weight, height, surgery history, blood pressure, heart rate,
respiratory rate, immune system status, etc. Thus in this area, it is highly possible that
the number of features becomes more than the patient (observations). Furthermore, in
genomic data, the number of genes measured (features) is relatively high for each indi­
vidual (sample), and these datasets are frequently highly dimensional. Dealing with high­
dimensional datasets is crucial since so many real-world examples include such data. One
of the questions that comes to mind is why we cannot deal with these high-dimensional
data straightforwardly. The problem of ove­fitting is one of the significant challenges. It
is easy for the model to ove­fit the noise in the data rather than the main pattern in the
data when the number of features in the dataset is large (in the subsequent sections, the
concept of ove­fitting will be elaborated upon). This can cause the model’s generalization
performance to suffer, meaning it will not perform well on new data. In addition, work­
ing with high-dimensional data can be computationally costly, as many machine learning
techniques demand a substantial amount of processing power and memory to process the
data. With these contexts in mind, it becomes clear that it is crucial to acquire skills for
handling high-dimensional data. Many dimension reduction approaches have been pro­
posed up to this point. One of them is to use some strategies for dropping some features.
Features that do not offer us valuable information, such as those with many missing values
or with values that are close together and have a low variation, can be removed. One form
of regularization-based unsupervised machine learning method for dimension reduction
is principal component analysis (PCA). Some supervised machine learning techniques for
reducing dimensions include t-distributed Stochastic Neighbor Embedding (t-SNE) and
Linear Discriminant Analysis (LDA). The following sections will go into greater depth about
the methodologies of PCA, t-SNE, and LDA.
Chapter 1 • Basics of machine learning 17

1.2 Types of learning problems


Discovering the type of our learning problem is one of the most, if not the most, essen­
tial tasks before implementing the learning task. It might be ben­ficial to determine if the
learning activity is aligned with the task’s purpose and desired outcome. In addition, it is of­
ten simpler to identify the most effective strategy to employ. Machine learning algorithms
may be class­fied in numerous ways, but the most frequent are as follows:
• Supervised learning;
• Unsupervised learning;
• Semi-supervised learning;
• Reinforcement learning.
Each of these categories has its own set of applications and algorithms, and this section
explains these forms of learning and the algorithms that go with them. Fig. 1.11 provides
an understandable graphical representation of various learning types.

FIGURE 1.11 Supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning are
the four primary categories of machine learning algorithms.

1.2.1 Supervised learning


In supervised learning, the data points should be labeled, and during the training, the
model learns to map the input to the output. More specifically, inputs and outputs are
known during the training, and the objective is finding the best function that can predict
outputs for unknown and new inputs. Consider a child who wants to know what a cat looks
like. One option is to show numerous images of the cat and explain to him/her that this is
18 Dimensionality Reduction in Machine Learning

a cat. After a while, when this child sees a cat on the street, he or she will recognize it as a
cat. This learning process is highly similar to supervised learning.
There are several algorithms based on supervised learning, and the most popular ones
are class­fication and regression. In the class­fication tasks, the machine should predict
the category or class (which can be more than one) of input data. These inputs are actually
feature vectors, and the outputs are the corresponding class or category of the input, which
are discrete values. Some famous supervised methods in the machine learning area are
logistic regression, decision trees, random forests, and support vector machines (SVM).
The purpose of regression tasks, on the other hand, is to predict outputs that are con­
tinuous values between certain ranges by leveraging features of input data. This particular
form of machine learning problem is widely used. In the medical domain, for example,
each patient has unique characteristics such as age, gender, illness history, and so on.
Based on these characteristics, the implemented model may forecast the probability of
each individual having a specific disease, such as diabetes. For regression tasks, random
forests, polynomial regression, decision trees, and linear regression can all be employed.

1.2.2 Unsupervised learning


The most prevalent method of labeling data is to manually identify it with the assistance
of humans. In certain circumstances, this method is extremely time-consuming and in­
efficient, resulting in a limited amount of labeled data. As a result, the entire process of
learning is dependent on the machine as input data do not have a predetermined out­
come, which looks like a difficult task at first glance. In unsupervised learning tasks, when
all outputs of data points are unknown, and data is unlabeled, the objective is to discover
the data’s structure, pattern, or hidden relationship. Dimension reduction and clustering
are the methods that can be used in this type of learning. Assume that there is a purchase
application for the film. This application maintains a database of clients and wishes to
categorize them based on various characteristics. This program may group clients based
on their similarities, such as gender, movie genre preferences, and spirits. This is when
clustering algorithms become advantageous; clustering is the process of grouping unla­
beled data. Dimension reduction methods, such as principal component analysis (PCA),
are other unsupervised learning methods ben­ficial for decreasing the number of variables
or features of inputs for improved analysis or more actual utilization of varied approaches.

1.2.3 Semi-supervised learning


All of the data points in the preceding sections (supervised and unsupervised learning) are
labeled (have known output) or unlabeled. In many real-world applications, the amount of
labeled data available is restricted, or gathering this labeled data is time-consuming; how­
ever, the knowledge obtained from labeled data is helpful during the training phase. With
this in mind, if we can utilize a dataset with some labeled data and some unlabeled data,
we can accomplish two goals simultaneously: using the information of labeled data and
overcoming the problem of the limited amount of labeled data available. Data in semi­
Chapter 1 • Basics of machine learning 19

supervised learning tasks may be class­fied into two types: labeled and unlabeled. The
labeled data may be used to train the model, which can then be used to predict the la­
bels of unlabeled data. For semi-supervised learning problems, several strategies may be
used, one of which is the graph-based method.

1.2.4 Reinforcement learning


Reinforcement learning is concerned with training an agent to make efficient decisions in
order to maximize the cumulative reward rate over a series of actions that the agent per­
forms. The actual objective is to identify the optimal policy for mapping states to actions,
which can be gained through the trial-and-error process of the agent interacting with its
environment and receiving feedback in the form of rewards or punishments. This type
of learning has pervasive real-world applications, such as in neurocognitive psychology
[21,22].

1.3 Machine learning algorithms lifecycle


Let us have a look at the general structure of the machine learning algorithm now that
we have a basic understanding of what it is and how it operates. Suppose you are a chef
who wants to make a pizza recipe. First, you need to gather all the ingredients you need to
make the pizza. Ensuring the freshness, quality, and freeness of contaminants is what you
do first. Furthermore, you need to write down the measurements in standard units that
everyone who reads your recipe can understand and use. This is Data Preprocessing in the
machine learning lifecycle. Among all the ingredients you have, some are not essential for
making the pizza, so you need to extract only the most essential ingredients that make the
pizza taste delicious. Additionally, you need to make the recipe balanced in terms of quality
and nutrition. As certain of the pizza’s ingredients alter its healthfulness, cutting back on
their amounts is essential. These two steps can be referred to as Dimension reduction and
Feature Extraction in machine learning. Now that all the essential ingredients are gath­
ered, one can decide on the best cooking techniques in order to make the pizza. You may
experiment with various cooking techniques and recipe variants to determine which pro­
vides the best results. Once the ideal, or, let us say, most tasty and nutritious process has
been determined, it is used to bake the pizza. In machine learning linguistics, this process
is called Model Selection and Model fitting. When everything is done, you take a bite to
see if it is up to par. You can also ask others to taste your pizza to receive any feedback that
may help make it better, which is called Model Evaluation. Fig. 1.12 represents the machine
learning lifecycle in a more illustrative manner.

1.4 Python for machine learning


With the ever-growing and skyrocketing demand for machine learning applications, fa­
miliarity with a programming language best suited for such tasks is of great significance.
20 Dimensionality Reduction in Machine Learning

FIGURE 1.12 Machine learning lifecycle.

In this part of the book, we introduce a popular high-level programming language called
Python. Web development, scientific computing, data analysis, and machine learning are
just a few of the many fields where Python is utilized extensively. Python takes the drudgery
out of programming for newcomers to produce sheer enjoyment, and this is literally be­
cause of several factors, including the existence of various libraries and frameworks that
make it easy to implement machine learning algorithms and an easy-to-understand syn­
tax with a large and active community that increases problem-solving speeds.

1.4.1 Python and packages installation


Python installation
The first step is installing Python in your system. The installation procedure for three
operating systems, windows, macOS, and ubuntu, will be discussed in the following para­
graphs:
• Windows: Depending on the system type, 64-bit or 32-bit operating system, down­
load the latest Python version from https://www.python.org/downloads/windows/. In­
stall the downloaded .exe file. During the installation, make sure to select the option
Add python.exe to PATH. After the installation, to check if Python was successfully in­
Chapter 1 • Basics of machine learning 21

stalled, in windows command prompt - CMD type python. Python is installed if it prompts
the Python version at the first line.
• Ubuntu: Just open the terminal by pressing Ctrl + Alt+ T and then update system
repository by entering sudo apt update, then enter the command sudo apt install
python3. To check if Python was successfully installed, type python3 in the terminal.
Python is installed if it prompts the Python version at the first line.
• MacOS: Download the installer package from Python’s official website https://www.
python.org/downloads/macos/. After the download is completed, install the package.
Once the installation is complete, the installer will automatically open Python’s instal­
lation directory in a new Finder window. To check if it was installed successfully, launch
IDEL in the given directory, and it will open a Python shell.

Python packages
This section dives into some popular and useful Python packages.
• NumPy: In Python, it is one of the most well-known and widely used scientific­
computing packages. By providing multi-dimensional array objects used for various
math operations coupled with compatibility with other popular packages, NumPy
paves the way for a fast and optimized execution and captures scientists’, develop­
ers’, and programmers’ needs. NumPy efficiency is because it is implemented in C and
Python, with most of the computational heavy lifting done in C code. Operation on ar­
rays and matrices is performed much faster than if implemented in pure Python as there
are sets of C functions that are optimized for efficient computations of arrays. This is
mainly owing to leveraging multi-core processors and hardware optimization to per­
form the computation much faster. Aside from computational efficiency, it also exploits
some optimized memory management techniques that reduce the amount of memory
needed for large arrays and matrices.
Below are some examples of the Python code that uses the NumPy library:
22 Dimensionality Reduction in Machine Learning

1 import numpy as np
2

3 # create a 1-dimensional NumPy array


4 arr = np.array([1, 2, 3, 4, 5])
5
6 # perform basic operations on the array
7 print("Original array:", arr)
8 print("Sum of array elements:", np.sum(arr))
9 print("Mean of array elements:", np.mean(arr))
10 print("Standard deviation of array elements:", np.std(arr))
11

12 # create a 2-dimensional NumPy array


13 arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
14
15 # perform basic operations on the array
16 print("Original 2D array:")
17 print(arr2d)
18 print("Sum of array elements:", np.sum(arr2d))
19 print("Mean of array elements:", np.mean(arr2d))
20 print("Standard deviation of array elements:", np.std(arr2d))
21

22 # perform element-wise operations on the array


23 print("Array elements squared:")
24 print(np.square(arr2d))
25 print("Array elements raised to power of 3:")
26 print(np.power(arr2d, 3))

In this example, we first import the NumPy package using the alias "np". Then, we cre­
ate a 1-dimensional NumPy array called "arr" containing the values 1 through 5. We
then perform basic operations on this array, including calculating the sum, mean, and
standard deviation of its elements.
Next, we create a 2-dimensional NumPy array called "arr2d" and perform the same
basic operations on this array. We also demonstrate how to perform element-wise op­
erations on the array, such as squaring and raising each element to the power of 3.

1 # create a 3x3 NumPy array filled with zeros


2 arr_zeros = np.zeros((3, 3))
3 print(arr_zeros)
4
5 # create a 4x2 NumPy array filled with ones
6 arr_ones = np.ones((4, 2))
7 print(arr_ones)
8
9 # create a 3x3 identity matrix
10 identity_matrix = np.eye(3)
11 print(identity_matrix)

In this example, we use the NumPy functions "zeros", "ones", and "eye" to create ar­
rays of different dimensions. The "zeros" function creates an array filled with zeros,
Chapter 1 • Basics of machine learning 23

the "ones" function creates an array filled with ones, and the "eye" function creates an
identity matrix (i.e., a matrix with ones on the diagonal and zeros elsewhere).

1 # create a 2-dimensional NumPy array


2 arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
3

4 # print the second row of the array


5 print(arr_2d[1])
6
7 # print the third column of the array
8 print(arr_2d[:, 2])
9
10 # print the subarray consisting of
11 # The first two rows and first two columns
12 print(arr_2d[:2, :2])

In this example, we create a 2-dimensional NumPy array called "arr_2d". We then


demonstrate how to slice and index the array to extract specific rows, columns, and sub­
arrays. The syntax for slicing and indexing in NumPy is similar to that of Python lists,
but with some additional features such as the ability to use colons to specify ranges.

1 # create two 2-dimensional NumPy arrays


2 arr1 = np.array([[1, 2], [3, 4]])
3 arr2 = np.array([[5, 6], [7, 8]])
4

5 # perform matrix multiplication of the two arrays


6 print(np.matmul(arr1, arr2))
7
8 # calculate the dot product of the two arrays
9 print(np.dot(arr1, arr2))
10
11 # calculate the eigenvalues and eigenvectors of a matrix
12 mat = np.array([[1, 2], [2, 1]])
13 eigenvalues, eigenvectors = np.linalg.eig(mat)
14 print("Eigenvalues:", eigenvalues)
15 print("Eigenvectors:")
16 print(eigenvectors)

In this example, we create two 2-dimensional NumPy arrays called "arr1" and "arr2".
We then demonstrate how to perform matrix multiplication and dot products using
NumPy functions such as "matmul" and "dot". Finally, we show how to calculate the
eigenvalues and eigenvectors of a matrix using the "eig" function from the NumPy
"linalg" module.
• Pandas: On every list of essential prerequisites for better manipulation and analysis of
data in the machine learning domain, Pandas ranks near the top. Pandas provides a
high-level interface for working with structured data, including loading any source of
data with performing various statistical analysis techniques. Similar to spreadsheets,
the data structure in Pandas is called DataFrame. It includes a two-dimensional ta­
24 Dimensionality Reduction in Machine Learning

ble with columns and rows. Pandas can be handy in data science, data analysis, and
machine learning tasks. Pandas is compatible with NumPy, meaning it can be passed
NumPy objects. All the data processing techniques explained in Section 1.1.2.2 can be
done with Pandas, including data cleaning, transformation, etc. One of Pandas’ ad­
vantages is its privilege compared to basic Python data structures that are sometimes
difficult and time-consuming when dealing with a large dataset.
Below are some examples of the Python code that uses the Pandas library:

1 import pandas as pd
2
3 # Load data from a CSV file
4 data = pd.read_csv('data.csv')
5
6 # Display the first few rows of the data
7 print(data.head())

The first line imports the Pandas library and renames it "pd" for convenience. The sec­
ond line uses the read_csv() function from Pandas to load data from a CSV file called
'data.csv' and store it in a Pandas DataFrame called "data". The third line uses the
head() function to display the first few rows of the data DataFrame.

1 # Select a subset of columns from the data


2 subset = data[['column1', 'column2']]
3

4 # Filter the data to only include rows


5 # where column1 is greater than 5
6 filtered = data[data['column1'] > 5]

The first line selects a subset of columns from the "data" DataFrame and stores the
result in a new DataFrame called "subset". The second line filters the "data" DataFrame
to only include rows where the value in the "column1" column is greater than 5, and
stores the result in a new DataFrame called "filtered".

1 # Group the data by a categorical variable


2 # and calculate the mean of a numeric variable
3 grouped = data.groupby('category')['numeric_variable'].mean()
4

5 # Calculate summary statistics for a numeric variable


6 summary_stats = data['numeric_variable'].describe()

The first line groups the "data" DataFrame by a categorical variable (presumably a col­
umn in the DataFrame) and calculates the mean of a numeric variable (presumably
another column in the DataFrame). The result is stored in a new Pandas Series called
"grouped". The second line calculates summary statistics (e.g., count, mean, standard
deviation) for a numeric variable in the "data" DataFrame using the describe() func­
tion, and stores the result in a new Pandas Series called "summary_stats".
Chapter 1 • Basics of machine learning 25

1 # Add a new column to the data based on the values of existing ones
2 data['new_column'] = data['column1'] + data['column2']
3

4 # Replace missing values in a column with the mean of the column


5 mean_value = data['column1'].mean()
6 data['column1'] = data['column1'].fillna(mean_value)
7

The first line adds a new column to the "data DataFrame called "new_column" that is the
sum of the values in the "column1" and "column2" columns. The second line calculates
the mean of the "column1" column in the "data" DataFrame and stores it in a variable
called "mean_value". Then, it replaces any missing values in the "column1" column with
the mean value using the fillna() function.
Now, we show some examples that use both NumPy arrays and Pandas:

1 import pandas as pd
2 import numpy as np
3
4 # Create a 2D NumPy array
5 arr = np.array([[1, 2], [3, 4]])
6

7 # Convert the array to a Pandas DataFrame


8 df = pd.DataFrame(arr, columns=['col1', 'col2'])
9
10 # Display the DataFrame
11 print(df)

In this example, we first import the necessary libraries, Pandas and NumPy. Next, we
create a 2D NumPy array using the np.array() function. We then convert this array to a
Pandas DataFrame using the pd.DataFrame() function, specifying the column names as
columns=['col1', 'col2']. Finally, we print the resulting DataFrame using the print()
function.

1 # Calculate the square of a Pandas column using NumPy


2 df['col3'] = np.square(df['col2'])
3

4 # Display the updated DataFrame


5 print(df)

In this example, we perform a vectorized operation on a Pandas column using NumPy’s


np.square() function. Specifically, we square each element of the 'col2' column and as­
sign the resulting values to a new column 'col3'. We then print the updated DataFrame
using the print() function.
26 Dimensionality Reduction in Machine Learning

1 # Calculate the mean of a Pandas column using NumPy


2 mean_value = np.mean(df['col1'])
3

4 # Calculate the standard deviation of a Pandas column using NumPy


5 std_dev = np.std(df['col2'])
6
7 # Display the results
8 print('Mean:', mean_value)
9 print('Standard deviation:', std_dev)

In this example, we use NumPy’s np.mean() and np.std() functions to calculate the
mean and standard deviation of two different columns in a Pandas DataFrame. We then
print the resulting values using the print() function.

1 # Define a function to apply to the DataFrame


2 def my_function(x):
3 return x ** 2 + 1
4

5 # Apply the function to a Pandas column using NumPy's


6 # vectorize() function
7 df['col4'] = np.vectorize(my_function)(df['col1'])
8

9 # Display the updated DataFrame


10 print(df)

In this example, we d­fine a function my_function(x) that squares its input x, adds 1,
and returns the result. We then apply this function to a Pandas column 'col1' using
NumPy’s np.vectorize() function. The resulting values are assigned to a new column
'col4'. Finally, we print the updated DataFrame using the print() function.
• Plotly: In this section, we want to introduce a package for data visualization for Python.
Instead of Matplotlib, we introduce Plotly. Plotly is newer than Matplotlib, and it al­
lows you to generate aesthetically beautiful interactive graphs with just a few lines of
code. Plotly’s suite of data analysis tools extends far beyond its visualization capabili­
ties, covering areas such as statistical analysis, machine learning, and natural language
processing. The ability to undertake exploratory data analysis, data science, and ma­
chine learning inside a single platform is made possible by these technologies.
Here are the source codes for the data visualization Section 1.1.2.3:
Chapter 1 • Basics of machine learning 27

1 import plotly.express as px
2 # Request to get the data in DataFrame format
3 df = px.data.gapminder().query(
4 "country in ['Iran','Zimbabwe','Yemen, Rep.',
5 'Saudi Arabia','China','Afghanistan','United States',
6 'Nigeria','Sweden','Germany']")
7 # Creating Line figure object
8 fig = px.line(df, x="year", y="lifeExp",
9 color='country',
10 category_orders=
11 {'country':
12 ['Afghanistan','China','Germany','Iran','Nigeria',
13 'Saudi Arabia','Sweden','United States','Yemen, Rep.',
14 'Zimbabwe']},
15 line_dash='country',line_shape="linear",
16 render_mode="svg")
17 # Plotting the data
18 fig.show()

The code above is used to generate Fig. 1.3. The first line imports the necessary modules
for creating visualizations with Plotly. The plotly.express module provides a simpli­
fied interface for creating many common types of visualizations. The third line loads
a built-in dataset from Plotly Express called gapminder, which contains information
about countries’ populations, life expectancy, and income levels over time. The query()
method filters the data to only include information for the spec­fied countries. The
eighth line creates a new Plotly figure object using the px.line() method. The df ar­
gument spec­fies the dataset to use for the visualization. The x and y arguments specify
the variables to use for the x- and y-axes, respectively. The color argument speci­
fies the variable to use for coloring the lines based on the different countries. The
category_orders argument spec­fies the order in which the countries should be dis­
played in the legend. The line_dash and line_shape arguments specify the style of the
lines in the plot. The render_mode argument spec­fies the format in which to render the
plot.

1 # Read the data using the Pandas library


2 pd_Sactter = pd.read_csv("WandH.csv")
3 # Creating Scatter figure object
4 fig = px.scatter(pd_Sactter, x='Height (in)', y='Weight (lb)')
5 # Plotting the data
6 fig.show()

The code above is used to generate Fig. 1.6. The first line loads a CSV file called
"WandH.csv" containing data about the weights and heights of some people into a Pan­
das DataFrame called pd_Sactter. The pd.read_csv() function is used to read the CSV
file and convert it into a DataFrame. The second line creates a new Plotly figure object
using the px.scatter() method. The pd_Sactter argument spec­fies the dataset to use
28 Dimensionality Reduction in Machine Learning

for the visualization. The x and y arguments specify the variables to use for the x- and
y-axes, respectively.
• Scikit-learn: Scikit-learn is an open-source library designed for machine learning tasks.
Scikit-learn was born as a Google Summer of Code project, standing for Scientific
Python Toolkit. Scikit-learn provides a wide range of machine learning tools, from su­
pervised to unsupervised, including class­fication, regression, clustering, dimensional­
ity reduction, model selection, and preprocessing. This section of the chapter provides
insightful examples of the well-documented and well-loved Python machine learning
library, Scikit-learn.
Below is an example of how we use Scikit-learn for the class­fication task:

1 # import necessary libraries


2 from sklearn import datasets
3 from sklearn.model_selection import train_test_split
4 from sklearn.neighbors import KNeighborsClassifier
5 from sklearn.metrics import accuracy_score
6

7 # load the iris dataset


8 iris = datasets.load_iris()
9

10 # split the dataset into training and testing sets


11 X_train, X_test, y_train, y_test = train_test_split(
12 iris.data,
13 iris.target,
14 test_size=0.3,
15 random_state=42)
16
17 # create a KNN classifier object
18 knn = KNeighborsClassifier(n_neighbors=3)
19

20 # fit the classifier to the training data


21 knn.fit(X_train, y_train)
22
23 # make predictions on the test data
24 y_pred = knn.predict(X_test)
25

26 # calculate the accuracy of the model


27 accuracy = accuracy_score(y_test, y_pred)
28
29 # print the accuracy
30 print("Accuracy:", accuracy)

The lines from 2--5 import the necessary libraries for this class­fication task. Sklearn
is the main library we will be using for this task. Datasets are used to import the
iris dataset, train_test_split is used to split the data into training and testing sets,
KNeighborsClassifier is the class­fier algorithm we will be using, and accuracy_score
is used to calculate the accuracy of the model. The eighth line loads the iris dataset
from sklearn.datasets and stores it in the iris variable. The Iris dataset measures three
species of iris flowers. Due to its simplicity and small size, machine learning and
Chapter 1 • Basics of machine learning 29

data analysis use it as a toy example. Iris encompasses 150 samples—50 per species,
and four input variables—sepal length, sepal width, petal length, and petal width-
make up each sample. The target variable (output) is the iris flower species, encoded
as a 0--2 integer. Line 11 splits the dataset into training and testing sets using the
train_test_split() function. iris.data contains the input features and iris.target
contains the output or target values. The test_size parameter spec­fies the percentage
of the dataset to use for testing, and random_state is used to ensure the reproducibility
of the results, meaning it randomly distributes the data into train and test sets. Line
eighteen creates a KNeighborsClassifier object with three neighbors. This algorithm is
used to classify new instances based on the k nearest neighbors to that instance in the
training dataset. Line twenty-one fits the class­fier to the training data using the fit()
method of the KNeighborsClassifier object knn. This step is where the model learns how
to classify new instances. Line number twenty-four makes predictions on the test data
using the predict() method of the KNeighborsClassifier object knn. The predicted val­
ues are stored in the y_pred variable. Finally, line number twenty-seven calculates the
accuracy of the model by comparing the predicted values with the actual values in the
test set, using the accuracy_score() function from sklearn.metrics.
In Section 1.1.2, all preprocessing techniques were explained. Some preprocessing
methods will be described using Scikit-learn Preprocessing library. The Scikit-learn pre­
processing library includes the following popular functions:
1. StandardScaler: Scales the data with a mean of zero and variance of 1. The code
below is an example of utilizing StandardScaler. Line five loads the Boston Housing
dataset and saves input and output into x and y, respectively. Line ten instantiates a
StandardScaler object. Line eleven fits the StandardScaler object to the input data (X)
and transforms it using the fit_transform() method, which scales the data to have
zero mean and unit variance.

1 from sklearn.datasets import load_boston


2 from sklearn.preprocessing import StandardScaler
3
4 # Load the Boston Housing dataset
5 boston = load_boston()
6 X = boston.data
7 y = boston.target
8

9 # Scale the data using StandardScaler


10 scaler = StandardScaler()
11 X_scaled = scaler.fit_transform(X)

2. MinMaxScaler: Scales the data to a fixed range, typically between 0 and 1. In the
previous code, instead of using StandardScaler, we can use MinMaxScaler.
3. RobustScaler: Scales the data using statistics that are robust to outliers.
4. LabelEncoder: Encodes categorical labels as integers from 0 to n_classes-1. When
having non-numeric data in the dataset, like the code below, we can use LabelEncoder
30 Dimensionality Reduction in Machine Learning

to convert them into numerical values such that they convert to machine-under­
standable variables.

1 from sklearn.datasets import load_iris


2 from sklearn.preprocessing import LabelEncoder
3

4 # Load the iris dataset


5 iris = load_iris()
6 X = iris.data
7 y = iris.target
8
9 # Convert the target variable
10 # to numerical values using LabelEncoder
11 le = LabelEncoder()
12 y_encoded = le.fit_transform(y)

5. OneHotEncoder: Converts categorical variables into a set of binary variables.


6. SimpleImputer: Imputes missing values using either the column’s mean, median, or
most frequent value.

1 #(strategy options: mean, median, most_frequent, constant)


2 imputer = SimpleImputer(strategy='mean')
3 imputer.fit(X)

Pipeline optimization
For a machine learning task, let us say class­fication, many class­fication algorithms,
and preprocessing techniques exist. To determine which of these combinations results
in the best performance and output, one can put together a sequence of steps involved
in training a model by using Scikit-Learn’s pipeline library. The effectiveness of every
learning model depends on picking parameters that optimize performance. Let us say
we want to classify the hand-sign images using RandomForestClassifier. After loading
the dataset and splitting it into train and test sets, we can reach our objective using the
following Python code:
Chapter 1 • Basics of machine learning 31

1 from sklearn.decomposition import PCA


2 from sklearn.preprocessing import StandardScaler
3 from sklearn.ensemble import RandomForestClassifier
4 from sklearn.model_selection import GridSearchCV
5 from sklearn.metrics import accuracy_score
6
7 # Scale the data
8 scaler = StandardScaler()
9 X_train = scaler.fit_transform(X_train)
10 X_test = scaler.transform(X_test)
11

12 # Perform PCA on the data


13 pca = PCA(n_components=120)
14 X_train = pca.fit_transform(X_train)
15 X_test = pca.transform(X_test)
16

17 # Train a random forest classifier


18 forest = RandomForestClassifier(n_estimators=20)
19 forest.fit(X_train, y_train)
20

21 # Make predictions on the test set and compute accuracy


22 test_predictions = forest.predict(X_test)
23 accuracy = accuracy_score(test_predictions, y_test)
24 print("Accuracy with RandomForest: {0:.6f}%".format(accuracy))

Lines 8--10 repeat the previous preprocessing procedure. Using the PCA function from
the sklearn.decomposition package, lines 13--15 conduct PCA on the standardized data.
As its name implies, principal component analysis (PCA) is used to minimize the num­
ber of dimensions in a dataset while keeping the majority of its useful information.
Here, we use a value of 120 for the number of parts. As seen in Lines 18 and 19, the
sklearn.ensemble module’s RandomForestClassifier is used to train a random forest clas­
s­fier on the reduced-dimensionality data. Random forests are a type of ensemble learn­
ing in which many decision trees are built at training time, with the mode of the classes
(class­fication) or the mean prediction (regression) from each tree serving as the out­
put. The final lines use the trained random forest class­fier to generate predictions on
the test set and then calculate the accuracy with the accuracy_score function from the
sklearn.metrics package.
32 Dimensionality Reduction in Machine Learning

1 # Use a pipeline to perform PCA, scaling, and classification


2 pipe = Pipeline([
3 ('scaler', StandardScaler()),
4 ('pca', PCA(n_components=120)),
5 ('forest', RandomForestClassifier(n_estimators=20))
6 ])
7

8 # Define the parameter grid for hyperparameter tuning


9 param_grid = {
10 'pca__n_components': [60, 80, 100],
11 'forest__n_estimators': [20, 30, 40, 50]
12 }
13

14 # Use gridsearch to find the best hyperparameters


15 estimator = GridSearchCV(pipe, param_grid, verbose=2)
16 estimator.fit(X_train, y_train)
17

18 # Print the best hyperparameters found


19 print("The best parameters: {0}".format(estimator.best_params_))
20
21 # Set the pipeline parameters to the best hyperparameters found
22 pipe.set_params(**estimator.best_params_)
23 pipe.fit(X_train, y_train)
24

25 # Make predictions on the test set and compute accuracy


26 test_predictions = pipe.predict(X_test)
27 accuracy = accuracy_score(test_predictions, y_test)
28 print("Accuracy with RandomForest: {0:.6f}%".format(accuracy))

The prediction accuracy for test images with a random forest class­fier is about 68%.
Now, let us utilize the pipeline and gridsearch and see how much the accuracy can be
increased.
Chapter 1 • Basics of machine learning 33

1 # Use a pipeline to perform PCA, scaling, and classification


2 pipe = Pipeline([
3 ('scaler', StandardScaler()),
4 ('pca', PCA(n_components=120)),
5 ('forest', RandomForestClassifier(n_estimators=20))
6 ])
7

8 # Define the parameter grid for hyperparameter tuning


9 param_grid = {
10 'pca__n_components': [60, 80, 100],
11 'forest__n_estimators': [20, 30, 40, 50]
12 }
13

14 # Use gridsearch to find the best hyperparameters


15 estimator = GridSearchCV(pipe, param_grid, verbose=2)
16 estimator.fit(X_train, y_train)
17

18 # Set the pipeline parameters to the best hyperparameters found


19 pipe.set_params(**estimator.best_params_)
20 pipe.fit(X_train, y_train)
21

22 # Make predictions on the test set and compute accuracy


23 test_predictions = pipe.predict(X_test)
24 accuracy = accuracy_score(test_predictions, y_test)
25 print("Accuracy with RandomForest: {0:.6f}%".format(accuracy))

Using the Pipeline function in the Scikit-Learn library, the pipeline’s three stages are
d­fined. Input features are first scaled with StandardScaler, then the dimensionality of
the scaled features is reduced using PCA with 120 components, and finally, class­fica­
tion is performed with a random forest class­fier that employs 20 decision trees. The
parameter grid for hyperparameter tuning is d­fined using a dictionary with two keys:
pca__n_components and forest__n_estimators. The first key spec­fies the number of prin­
cipal components to keep in PCA, and the second key spec­fies the number of decision
trees in the random forest class­fier. Gridsearch is performed using GridSearchCV from
scikit-learn to find the best hyperparameters for the pipeline. The estimator variable is
set to GridSearchCV with the pipeline, the parameter grid, and verbose=2 to show the
progress of the gridsearch. The fit method is called with the training data (X_train and
y_train) to perform the gridsearch. The best hyperparameters are set for the pipeline
using pipe.set_params(**estimator.best_params_), and the fit method is called again
with the training data to train the pipeline with the best hyperparameters. The pipeline
is used to make predictions on the test set (X_test), and the accuracy of the predictions
is computed using accuracy_score from scikit-learn. The accuracy is printed to the con­
sole with 6 decimal places.
• TensorFlow: Preprocessing the data, building the model, training, and estimating the
model are all made easier by the TensorFlow package, which is a machine learning
platform. This Google software package is broadly used and open source. One of the
ben­fits of this package is that it can be executed on a wide variety of devices, including
34 Dimensionality Reduction in Machine Learning

desktop computers, cloud clusters, mobile devices (iOS and Android), central process­
ing units (CPUs), and graphics processing units (GPUs). Moreover, users can train and
create their model with minimal code, making it both highly abstract and convenient
for developers. In addition, TensorFlow supplies developers with top-notch visualiza­
tion tools to simplify the debugging and performance evaluation processes. Another
important ben­fit of this package is its flexibility. Image and speech detection, NLP,
class­fication, recurrent neural networks, and many more machine learning tasks are
all supported by TensorFlow.
The following is a straightforward emotion recognition task that employs the most im­
portant TensorFlow functions. Emotion recognition, commonly referred to as facial
expression recognition, can also be applied to natural or artistic imagery [4]. These
tasks can be performed by using machine learning techniques. The shape of the eye­
brows, mouth, and eyes, for example, reveal human emotion in numerous scenarios.
These facial features in the image can be analyzed by machine learning algorithms to
distinguish emotions such as sadness, anger, happiness, surprise, and others. Several
industries, including business, medicine, and psychology, can ben­fit from this type
of learning task. The dataset used on the implemented task has 28 709 rows and two
columns. The pixels column holds the pixel of a grayscale 48 by 48 image, while the
emotion column provides a number between 0 and 6. The aim is to distinguish seven
emotions, which are numbered from 0 to 6 in increasing order: angry, disgusted, fearful,
happy, sad, surprised, and neutral. First and foremost, the relevant libraries required for
this task must be imported. The first library is TensorFlow, which is used to create, run,
and evaluate the model. The second library is Numpy, which is used to create Numpy
arrays and execute computations, and the third library is pandas, which is used to load
the dataset. On line five, the sklearn function train_test_split is used to divide the
dataset into two parts: train and test data. The dataset is loaded into the data_train
dataframe on line seven to preprocess and manipulate the data before feeding it into
the machine learning model.

1 # Import necessary libraries


2 import tensorflow as tf
3 import numpy as np
4 import pandas as pd
5 from sklearn.model_selection import train_test_split
6 # Load data from a CSV file
7 data_train = pd.read_csv("data/train.csv")
8

Now, the dataset is split into two parts, with 80% going to training and 20% going to
testing.
Chapter 1 • Basics of machine learning 35

1 x_train, x_test, y_train, y_test = train_test_split


2 (x_data,
3 y_data,
4 test_size=0.2,
5 random_state=42)

This task takes advantage of transfer learning. By utilizing the Keras API provided by
TensorFlow, the tf.keras.applications module can be used to make available pre­
trained neural networks such as VGG16, ReesNet, MobileNet, and others. These models
are typically trained on large datasets and achieve great performance. One such pre­
trained deep neural network is the VGG16 model, which can be used for image clas­
s­fication. In this facial expression recognition task, the VGG-16 model is utilized due
to its depth, simplicity, pre-training on a big dataset, and high performance. In the fol­
lowing code, the vgg16 variable stores the pre-trained model, VGG16, on the ImageNet
dataset. The weights option is set to verb|imagenet|, which implies that the model’s pre­
trained weights will be downloaded from the Internet and can be used for this task. As
the name suggests, tf.keras.Input is a module for developing the neural network’s in­
put layer. The shape parameter of this module is currently set to 48*48*1, r­flecting the
dimensions and number of channels in the used dataset. Although the supplied dataset
is grayscale, the VGG16 model is trained on RGB data. As a result, the dataset must be
converted to RGB (three channels). The Lambda layer accepts as an argument a func­
tion that can be applied to the neural network’s input. The dataset can be converted
to RGB by using this function, tf.keras.layers.Lambda on line number five. Finally, on
line seven, the above-mentioned pre-trained model is applied to the neural network’s
input.

1 vgg16 = tf.keras.applications.VGG16(include_top=False,
2 weights='imagenet')
3

4 inputs = tf.keras.Input(shape=(48, 48, 1))


5 x = tf.keras.layers.Lambda(lambda x:
6 tf.image.grayscale_to_rgb(x))(inputs)
7 x = vgg16(x)

The neural network’s architecture is implemented at this step. The Flatten function is
used in line one of the code below. This is due to the fact that the input of the fully con­
nected network, in which every neuron on the presented layer is connected to every
neuron on the previous layer, should be a one-dimensional array rather than a multi­
dimensional array. Moreover, the network architecture is built on lines two through
four. This is a fully connected network, and we can determine the number of neurons
in each layer and the activation function of each layer by using the Dense function. The
first layer contains 128, the second layer has 64, and the third layer has 32 neuronal
units, and all three layers have the same activation function, which is relu. As pre­
viously stated, this task can recognize seven emotions, making it a class­fication task
36 Dimensionality Reduction in Machine Learning

with seven classes. As a result, the output layer in line five should have seven neurons,
and the softmax activation function, which provides a probability distribution over the
classes, is appropriate for this layer. The model of the task is constructed on the last line
by utilizing the Model function and assigning the input and output layers of the network.

1 x = tf.keras.layers.Flatten()(x)
2 x = tf.keras.layers.Dense(128, activation='relu')(x)
3 x = tf.keras.layers.Dense(64, activation='relu')(x)
4 x = tf.keras.layers.Dense(32, activation='relu')(x)
5 outputs = tf.keras.layers.Dense(7, activation='softmax')(x)
6 model = tf.keras.Model(inputs, outputs)

TensorFlow’s config module is used in the code below line one to set an experimental
option for the TensorFlow optimizer. It is specifically setting the verb|loop-fusion| pass
to be included in the optimization passes. Loop fusion is an optimization approach that
combines numerous loops into a single loop to enhance computational performance.
In line four, the verb|compile| function is used to specify the loss function, optimizer,
and metrics to be employed during training. In this task, the loss function is categori­
cal cross-entropy, Adam is the optimizer, and accuracy is the metric to evaluate. In line
seven, the verb|fit| function is used to train the model with the training data. The valida­
tion split argument d­fines the percentage of training data to be utilized for validation
during training. The epochs and batch size arguments describe the number of train­
ing epochs and batch size to be used, respectively, during training. The history object
stores the model’s training history, including loss and accuracy on both the training and
validation sets at each epoch. This can be utilized for analysis and visualization of the
training process.

1 tf.config.optimizer.set_experimental_options(
2 {"passes": ["loop-fusion"]})
3
4 model.compile(optimizer='adam',
5 loss='categorical_crossentropy',
6 metrics=['accuracy'])
7 history = model.fit(x_train, y_train, validation_split=0.01,
8 epochs=10, batch_size= 64)

After the training step, it is time to evaluate the model using test data. In many machine
learning frameworks, the model.predict technique is used to apply a trained model to
new input data and generate an output prediction. In the code below, this function is
used for a single image and the model’s output is stored in the y variable. As y is an
array of seven numbers, the maximum value of y represents the input image’s class.
The argmax function is used on line two to retrieve the maximum value between seven
values.
Chapter 1 • Basics of machine learning 37

1 y = model.predict(Image)
2 np.argmax(y)

Fig. 1.13 depicts an example of the model’s output. The output for each image consists
of seven numbers between 0 and 1, with the highest value r­flecting the emotion of the
individual depicted in the image. For the child seen in this illustration, the model correctly
ident­fies his emotion as that of a happy individual.

FIGURE 1.13 Given that the child’s face is happy, the model should recognize his feelings as that of a happy indi­
vidual. On the right diagram, the highest value is for the happy class, demonstrating that the model accurately
ident­fied the child’s emotions.

References
[1] V. Dhar, Data science and prediction, Communications of the ACM 56 (2013) 64--73.
[2] F. Provost, T. Fawcett, Data Science for Business: What You Need to Know About Data Mining and
Data-Analytic Thinking, O’Reilly Media, Inc., 2013.
[3] T. Hastie, R. Tibshirani, J. Friedman, J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer, 2009.
[4] M. Baradaran, P. Zohari, A. Mahyar, H. Motamednia, D. Rahmati, S. Gorgin, Cultural-aware AI model
for emotion recognition, in: 2024 13th Iranian/3rd International Machine Vision and Image Process­
ing Conference (MVIP), 2024, pp. 1--6.
[5] I. Witten, E. Frank, Data mining: practical machine learning tools and techniques with Java imple­
mentations, ACM Sigmod Record 31 (2002) 76--77.
[6] J. Han, J. Pei, H. Tong, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2022.
[7] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning
Research 3 (2003) 1157--1182.
[8] P. Domingos, A few useful things to know about machine learning, Communications of the ACM 55
(2012) 78--87.
[9] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, R. Tibshirani, J. Friedman, Overview of supervised
learning, in: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2009,
pp. 9--41.
[10] M. Kuhn, K. Johnson, Others Applied Predictive Modeling, Springer, 2013.
[11] A. Zheng, A. Casari, Feature Engineering for Machine Learning: Principles and Techniques for Data
Scientists, O’Reilly Media, Inc., 2018.
[12] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-resolution image synthesis with latent
diffusion models, 2021.
38 Dimensionality Reduction in Machine Learning

[13] S. Madden, From databases to big data, IEEE Internet Computing 16 (2012) 4--6.
[14] S. Suthaharan, Machine learning models and algorithms for big data class­fication, Integrated Series
on Information Systems 36 (2016) 1--12.
[15] J. Han, M. Kamber, J. Pei, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, San
Francisco, CA, 2001, pp. 335--391.
[16] I. Witten, E. Frank, M. Hall, C. Pal, M. Data, Practical machine learning tools and techniques, Data
Mining 2 (2005) 403--413.
[17] D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999.
[18] G. James, D. Witten, T. Hastie, R. Tibshirani, Others: An Introduction to Statistical Learning, Springer,
2013.
[19] C. Aggarwal, C. Aggarwal, An Introduction to Outlier Analysis, Springer, 2017.
[20] J.A. Rad, S. Chakraverty, K. Parand, Learning with Fractional Orthogonal Kernel Class­fiers in Support
Vector Machines: Theory, Algorithms, and Applications, Springer, 2023.
[21] S. Ghaderi, J.A. Rad, M. Hemami, R. Khosrowabadi, Dysfunctional feedback processing in metham­
phetamine abusers: evidence from neurophysiological and computational analysis, Neuropsycholo­
gia 197 (2024) 108847.
[22] S. Ghaderi, J.A. Rad, M. Hemami, R. Khosrowabadi, The role of reinforcement learning in shaping the
decision policy in methamphetamine use disorders, Journal of Choice Modelling 50 (2024) 100469.
2
Essential mathematics for machine
learning
Ali Araghian a, Mohsen Razzaghi b, Madjid Soltani c, and Kourosh Parand d
a Department of Computer Science, Faculty of Art and Science, Bishop’s University, Sherbrooke, QC, Canada
b Department of Mathematics and Statistics, Mississippi State University, Mississippi State, MS, United States
c Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada
d International Business University, Toronto, ON, Canada

2.1 Vectors
An n-dimensional set of real numbers is known as a vector, with the numbers themselves
being called the components of the vector. Vectors come in two forms, column vectors and
row vectors. A column vector or simply vector is denoted by [1]:
⎛ ⎞
v1
⎜ v2 ⎟
⎜ ⎟
⎜ ⎟
v = ⎜ v3 ⎟ ,
⎜.⎟
⎝ .. ⎠
vn

and the row form of the same vector equals the transpose of the column form, i.e.:
( )
v T = v1 , v2 , v3 , . . . , vn .

The set of all n-dimensional vectors is called the vector space Rn .


Let S be a set of vectors in Rn , then S is a subspace of Rn if any linear combination of
two vectors in S is also in Rn . In more formal terms, S is called a subspace of Rn if s1 , s2 ∈ S
implies c1 s 1 + c2 s2 ∈ S, where c1 and c2 are any scalars.

2.1.1 Basic concepts


( ) ( )
Two vectors u = u1 , u2 , u3 , . . . , un and v = v1 , v2 , v3 , . . . , vn are considered equal if:

ui = vi , ∀ i = 1, 2, . . . , n. (2.1)
( )
( Also, vector )addition of two n-dimensional (vectors u = u1), u2 , u3 , . . . , un and v =
v1 , v2 , v3 , . . . , vn is d­fined as a new vector z = z1 , z2 , z3 , . . . , zn , denoted by z = u + v,
with its components d­fined by zi = ui + vi , i = 1, 2, . . . , n.
Dimensionality Reduction in Machine Learning. https://doi.org/10.1016/B978-0-44-332818-3.00010-1 39
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
40 Dimensionality Reduction in Machine Learning

( )
On the other hand, scalar multiplication of vector v = v1 , v2 , v3 , . . . , vn and a scalar c
( )
is d­fined as a new vector w = w1 , w2 , w3 , . . . , wn and written as w = cv or equivalently
w = vc, whose components are given by wi = cvi , i = 1, 2, . . . , n.
The inner product or dot product of two vectors u and v is a scalar denoted by u.v and
given by:

u.v = u1 v1 + u2 v2 + . . . + un vn .
( ) √
Also, the length of a vector v = v1 , v2 , v3 , . . . , vn , denoted by v, is v T v.

2.1.2 Linear independence


A set of vectors {m1 , m2 , . . . , mk } in Rn is d­fined as linearly dependent if there exist scalars
c1 , c2 , . . . , ck , not all zero, such that c1 m1 + . . . + ck mk = 0. Otherwise, the set is called
linearly independent.

Example 1. The set S = {v1 , v2 , v3 , v4 } in R3 is linear dependent such that v1 = (3, 0, −3),
v2 = (−1, 1, 2), v3 = (4, 2, −2), v4 = (2, 1, 1), since 2v1 + 2v2 − v3 + 0v4 = 0.

2.1.3 Orthogonality
Two vectors u, v in Rn are orthogonal or perpendicular if uT .v = 0. The angle θ between two
vectors u and v is calculated by:

uT .v
cos (θ ) =
u v

and two vectors u and v are considered orthogonal if θ = π2 .

2.2 Matrices
In this section, the basic concepts of matrices, matrix operations, and some important
definitions are described, which are used in data science, machine learning, deep learning,
data mining, dimension reduction, and other fields.

2.2.1 Basic concepts


A matrix is a rectangular array of numbers, symbols, or expressions arranged in rows and
columns. It is a fundamental concept in linear algebra and is used to represent and manip­
ulate systems of linear equations, transformations, and various mathematical operations.
These operations are widely used in data science and machine learning. The size of a ma­
trix is spec­fied by its number of rows and columns, often denoted as m × n, where m
represents the number of rows and n represents the number of columns. A matrix A, there­
Chapter 2 • Essential mathematics for machine learning 41

fore, has the form:


⎛ ⎞
a11 a12 ··· a1n
⎜ a21 a22 ... a2n ⎟
⎜ ⎟
A=⎜ . .. .. .. ⎟
⎝ .. . . . ⎠
am1 am2 ··· amn
and is denoted by A = (aij )m×n or simply A = (aij ). In this situation, A is said to be of order
m × n.

2.2.2 Operations
Matrix operations are fundamental processes in mathematics, particularly in linear alge­
bra. In these subsections, we will explain matrix addition, scalar multiplication, and matrix
multiplication.

2.2.2.1 Matrix addition


The sum of two m × n matrices A = (aij ) and B = (bij ), denoted by C = A + B, is again an
m × n matrix computed by adding the corresponding elements:

cij = aij + bij . (2.2)


( ) ( ) ( )
1 2 3 3 −1 4 2 4
Example 2. Suppose that A = ,B= , and C = , then
−2 −5 6 2 4 1 6 5
( )
4 1 7
A+B = ,
0 −1 7

while A + C is not d­fined.

2.2.2.2 Scalar multiplication


If c is a scalar, then cA is a matrix given by multiplying each element in A by c, i.e.:

cA = (caij ).

2.2.2.3 Matrix multiplication


Let A and B be m × n and n × p matrices, respectively, then their product denoted by AB is
an m × p matrix given by:
( n )

AB = aik bkj , i = 1, . . . , m, j = 1, . . . , p. (2.3)
k=1
⎛ ⎞
( ) 4 1 4 3 ( )
1 2 4 12 27 30 13
Example 3. Let A = and B = ⎝0 −1 3 1⎠, then AB = .
2 6 0 8 −4 26 12
2 7 5 2
42 Dimensionality Reduction in Machine Learning

2.2.3 Some definitions


A matrix is square if m = n. An identity matrix is a square matrix with 1s along the main
diagonal and 0s elsewhere and is denoted by I . For any matrix A we have:

AI = I A = A.

We d­fine In as follows:
⎛ ⎞
1 0 ··· 0
⎜0 1 ... 0⎟
⎜ ⎟
In = ⎜ . . .. .. ⎟ .
⎝ .. .. . .⎠
0 0 ··· 1 n×n

The trace of a square matrix An×n , denoted by tr(A), is d­fined to be the sum of elements
along its main diagonal, i.e.:

n
tr (A) = aii .
i=1

Let A = (aij )m×n and B = (bij )p×q , A and B are considered equal and denoted as A = B
if and only if:
m = p, n = q, aij = bij ∀i, j = 1, . . . , m(n).
The transpose of a matrix A of order m × n, denoted by AT , is a matrix of order n × m
with rows and columns interchanged:
( )
ATi,j = aj i , i = 1, . . . , n, j = 1, . . . , m. (2.4)

Also, a matrix A is called symmetric if AT = A.


⎛ ⎞
11 12 −24
Example 4. Matrix A = ⎝ 12 −54 68 ⎠ is a symmetric matrix.
−24 68 38

A matrix is said to be a zero matrix if aij = 0, ∀i, j . Also, a square matrix A is said to be
upper triangular if aij = 0 for i > j . The transpose of an upper triangular matrix is lower
triangular, that is, A is lower triangular if aij = 0 for i < j .

Example 5. The matrix below is an upper triangular matrix:


⎛ ⎞
−1 3 −2
A = ⎝ 0 6 4 ⎠.
0 0 5

A square matrix A is said to be upper Hessenberg if aij = 0 for i > j + 1. The transpose
of an upper Hessenberg matrix is lower Hessenberg, that is, A is a lower Hessenberg matrix
Chapter 2 • Essential mathematics for machine learning 43

if aij = 0 for i + 1 < j . A matrix that is both upper and lower Hessenberg, i.e., aij = 0 if
|i − j | > 1, is tridiagonal.
⎛ ⎞
8 4 0
Example 6. Matrix A = ⎝13 15 0 ⎠ is tridiagonal.
0 0 −3

An n × n matrix A is said to be invertible or nonsingular if there exists an n × n matrix B


such that:
AB = BA = I,
where I is the identity matrix. The inverse of A is denoted by A−1 . It can be shown that the
inverse of a matrix is unique. A matrix that is not invertible, is called singular. The most
important properties of the inverse operator are:
1. (AB)−1 = B −1 A−1 ;
n
2. A−n = (A−1 ) = A−1 A−1 . . . A−1 ;
3. (A−1 )−1 = A;
4. (An )−1 = (A−1 )n ;
5. (λA)−1 = λ1 A−1 , if λ = 0.
An elementary matrix is a matrix that differs from the identity matrix by a single ele­
mentary row operation. There are three types of elementary row operations:
1. Row switching: A row within the matrix is switched with another row.
2. Row multiplication: Each element in a row is multiplied by a non-zero constant.
3. Row addition: A row is replaced with the sum of that row and a multiple of another row.
⎛ ⎞ ⎛ ⎞
1 0 0 1 0 3
Example 7. Both matrices A = ⎝0 0 1⎠ and B = ⎝0 1 0⎠ are elementary matrices.
0 1 0 0 0 1

A square matrix A is orthogonal if AAT = AT A = I , i.e., A−1 = AT .


⎛ 1 ⎞
√ √1 0
⎜ 2 2

Example 8. Matrix A = ⎝ 0 0 1⎠ is orthogonal:
√1
−√ 01
2 2
⎛ ⎞⎛ ⎞ ⎛ ⎞
√1 √1 0 √1 0 √1
2 2 2 2 1 0 0
⎜ ⎟⎜ ⎟
AAT = ⎝ 0 0 1⎠ ⎝ √1 0 − √1 ⎠ = ⎝0 1 0⎠ ,
2 2
√1 − √1 0 0 1 0 0 0 1
2 2
⎛ 1 ⎞⎛ 1 ⎞⎛ ⎞
√ 0 √1 √ √1 0
2 2 2 2 1 0 0
⎜ ⎟ ⎜ ⎟
AT A = ⎝ √1 0 − √1 ⎠ ⎝ 0 0 1⎠ = ⎝0 1 0⎠ .
2 2
0 1 0 √1 − √1 0 0 0 1
2 2
44 Dimensionality Reduction in Machine Learning

Two matrices A and B are called similar and denoted A ∼ B if there exists a nonsingular
matrix P such that P −1 AP = B.
( ) ( ) ( )
1 2 −1 −1 1 1
Example 9. Let A = and B = , then A ∼ B and we have P = .
3 4 4 6 1 2
A square matrix A is called diagonalizable if it is similar to a diagonal matrix, i.e., if there
exists an invertible matrix P such that P −1 AP is a diagonal matrix.
( ) ( )
1 1 1 1
Example 10. Let us consider A = and P = , then A is diagonalizable since
−2 4 1 2
( )( )( ) ( )
−1 2 −1 1 1 1 1 2 0
P AP = = .
−1 1 −2 4 1 2 0 3

A square matrix A is called diagonally dominant if for every row of the matrix, the
magnitude of the diagonal element in the row is equal to or larger than the sum of the
magnitude of the other elements in the row. In more formal terms:
∑ 
|aii | ≥ aij  , ∀i. (2.5)
j =i

If a strict inequality is used instead of a weak inequality, matrix A would be a strictly


diagonally dominant matrix.
⎛ ⎞ ⎛ ⎞
−3 −1 2 −3 −1 1
Example 11. Matrix A = ⎝−3 4 0⎠ is diagonally dominant, while B = ⎝−3 4 0⎠
−1 −2 3 −1 −2 4
is strictly diagonally dominant.

The conjugate transpose of an m×n matrix A with complex entries is the n×m matrix A∗
obtained from A by computing its transpose and replacing each element with its complex
conjugate.
⎛ ⎞
3 + 2i 4 − √ 5i
Example 12. If A = ⎝ −6i 9 + i 2⎠, then its conjugate transpose is
7−i 6 + 7i
( )
∗ 3 − 2i 6i√ 7+i
A = .
4 + 5i 9−i 2 6 − 7i

A square matrix A is called a Hermitian matrix if A = A∗ .

2.2.4 Important matrix properties


It can easily be proven that matrices hold the following properties under addition, multi­
plication, trace, and transpose operations [1]:
Chapter 2 • Essential mathematics for machine learning 45

1. An×m + Bn×m = Bn×m + An×m ;


2. An×m + (Bn×m + Cn×m ) = (An×m + Bn×m ) + Cn×m ;
( )
3. An×m Bm×k Ck×p = (An×m Bm×k ) Ck×p ;
4. An×m (Bm×k + Cm×k ) = An×m Bm×k + An×m Cm×k ;
5. 0n×m + An×m = An×m + 0n×m = An×m ;
6. 0n×m Am×k = 0n×k ;
7. An×m 0m×k = 0n×k ;
8. λ (An×m + Bn×m ) = λAn×m + λBn×m , λ ∈ R;
( T )T
9. An×m = An×m ;
10. (An×m + Bn×m )T = ATn×m + Bn×m T ;
11. (An×m Bm×k ) = Bm×k An×m ;
T T T

12. tr (An×m + Bn×m ) = tr (An×m ) + tr(Bn×m );


13. tr (An×n Bn×n ) = tr (Bn×n An×n ).

2.2.5 Determinant
In a square matrix |A|. In a
( A, the
) determinant is a scalar value denoted det (A) or simply
a b
2 × 2 matrix A = , the determinant is as follows:
c d
 
a b 
|A| =  = ad − bc.
c d

⎛ ⎞
a b c
Similarly, in a 3 × 3 matrix A = ⎝d e f ⎠, the determinant is:
g h i

 
a b c       
 e f  d f  d e 
|A| = d e f  = a 
 − b  + c  = aei + bf g + cdh − ceg − bdi − af h.
g h i g i g h
h i

Through the Laplace formula, this procedure can be generalized and performed recur­
sively to calculate the determinant of any n × n matrix. The Laplace formula expresses the
determinant of a matrix in terms of its minors. The minor Mij is d­fined as the determinant
of the (n − 1) × (n − 1) matrix that results from A by removing its ith row and j th column.
Calculation of the determinant through the Laplace formula is given by:


n ∑
n
|A| = (−1) i+j
aij Mij = (−1)i+j aij Mij ,
j =1 i=1

where the expression (−1)i+j Mij is known as a cofactor.


46 Dimensionality Reduction in Machine Learning

⎛ ⎞
−2 2 −3
Example 13. The determinant of A = ⎝−1 1 3 ⎠, through the Laplace expansion
2 0 −1
along the second column is given by:
     
−1 3  −2 −3 −2 −3
|A| = (−1)
1+2 
×2×  + (−1)2+2 
×1×  + (−1)3+2 
×0× 
2 −1 2 −1 −1 3 
= 10 + 8 + 0 = 18.

Important determinant properties: It can be shown that for square matrices A and B,
determinants have the following properties:
1. det (A) = det (AT );
2. det (cA) = cn det(A), where c is a scalar;
3. det (AB) = det (A) det (B);
4. If two rows or two columns of A are identical, then det (A) = 0;
5. If a row or column of A is zero, then det (A) = 0;
6. If B is a matrix obtained from A by interchanging two rows or two columns, then
det (B) = − det (A);
7. The determinant of a triangular matrix is the product of its diagonal entries;
8. Adding a scalar multiple of one column to another column does not change the value
of the determinant;
9. If A ∼ B, then det (A) = det (B);
( )
10. det A−1 = det(A)
1
;
⎛ ⎞T
A11 · · · A1n
1 ⎜ .. .. .. ⎟ , where A = (−1)i+j M .
11. A−1 = det(A) ⎝ . . . ⎠ ij ij
An1 ··· Ann

2.2.6 Row and column spaces


( )
Suppose that A = aij m×n , we d­fine row vectors of A as:

( ) ( )
r1 = a11 , a12 , . . . , a1n , . . . , rm = am1 , am2 , . . . , amn (2.6)

and column vectors of A as


( )T ( )T
c1 = a11 , a21 , . . . , am1 , . . . , cn = a1n , a2n , . . . , amn . (2.7)
( )
Suppose that A = aij m×n , then the subspace spanned by the row vectors of A is called the
row space of A, and the subspace spanned by the column vectors of A is called the column
space of A.
Chapter 2 • Essential mathematics for machine learning 47

2.2.7 Rank of a matrix


The rank of a matrix A, denoted by rank(A), is the dimension of the column space of A. For
example, let us consider

⎛ ⎞ ⎛ ⎞ ⎛ ⎞
( ) 1 2 1 2 1 2 1
1 2
A= , B = ⎝3 4⎠ , C = ⎝2 4⎠ , D = ⎝−2 −3 1⎠ .
3 4
5 6 0 0 3 5 0

We can easily see that rank (A) = 2, rank (B) = 2, rank (C) = 1, rank (D) = 2.
An m × n matrix A is called full column rank if its columns are linearly independent. Full
row rank is similarly d­fined. A matrix is said to be full rank if it has either full column or
row rank.

Important matrix rank properties:


1. rank (A) ≤ min (m, n) , ∀Am×n ;
2. rank (In ) = n;
( )
3. rank (A) = rank AT ;
4. The following statements are equivalent:
a. An×n is invertible;
b. det (A) = 0;
c. rank (A) = n;
d. The row vectors of A are linearly independent;
e. The column vectors of A are linearly independent.

2.3 Vector and matrix norms


Vector and matrix norms play an important role in the realms of data science and machine
learning, offering fundamental tools for quantifying the size, magnitude, and distance of
data structures [3]. These norms provide a standardized way to measure the complexity
and variability of datasets, enabling practitioners to identify outliers, assess convergence
rates in optimization algorithms, and regulate the impact of individual data points on
model training. By defining clear notions of distance and similarity, norms facilitate clus­
tering, class­fication, and dimensionality reduction techniques. Moreover, they underpin
the stability analysis of algorithms and models, aiding in the understanding of generaliza­
tion bounds and ove­fitting. Whether employed in regularization strategies, loss functions,
or validation procedures, matrix and vector norms serve as essential building blocks that
empower data scientists and machine learning engineers to develop robust and effective
solutions in the face of diverse and complex datasets.
48 Dimensionality Reduction in Machine Learning

2.3.1 Vector norms


⎛ ⎞
x1
⎜ x2 ⎟
⎜ ⎟
Let x = ⎜ . ⎟ be an n-vector in Rn . A vector norm, denoted by x, is a real-valued con­
⎝ .. ⎠
xn
tinuous function of the components x1 , x2 , . . . , xn of x, d­fined on Rn ; it has the following
properties:
1. x > 0 for every non-zero vector x. x = 0 if and only if x is a zero vector;
2. αx = |α| x, for all x in Rn and for all scalars α;
3. x + y ≤ x + y for all x and y on Rn .
It is easy to verify that the following are vector norms:

√1 | + |x2 | + . . . + |xn |, one norm or sum norm;


1. x1 = |x
2. x2 = x12 + x22 + . . . + xn2 , Euclidean norm or two norm;
3. x∞ = max (|x1 | , |x2 | , . . . , |xn |), i­finity norm or maximum norm.
In general, if p is a real number greater than or equal to 1, the p-norm, or Hölder norm,
is d­fined by:
( )1
xp = |x1 |p + |x2 |p + . . . + |xn |p p .
( )T
Example 14. Let x = 5, 2, −1 , then

x1 = 5 + 2 + 1 = 8,
√ √
x2 = 25 + 4 + 1 = 30,
x∞ = max (5, 2, 1) = 5.

An important property of the Hölder norm is the Hölder inequality:


 
 T 
x y  ≤ xp yq ,

where:
1 1
+ = 1.
p q
A special case of the Hölder inequality is the Cauchy–Schwarz inequality, which is de­
fined as follows:
 
 T 
x y  ≤ x2 y2 .

2.3.2 Matrix norms


Let A be an n × m matrix. Then, analogous to the vector norm, we d­fine a matrix norm
A with the following properties:
Chapter 2 • Essential mathematics for machine learning 49

1. A > 0, A = 0 if and only if A is a zero matrix;


2. αA = |α| A for any scalar α;
3. A + B ≤ A + B.
The most common forms of matrix norms are:
∑ ∑  2  12
1. AF = m n   , Frobenius norm;
i=1 j =1 aij
∑m  
2. A1 = max1≤j ≤n i=1 aij , maximum column-sum norm;
∑  
3. A∞ = max1≤i≤m nj=1 aij , maximum row-sum norm.

⎛ ⎞
1 2 0
Example 15. Let A = ⎝ 5 4 −1⎠, then we have:
−2 0 6

A1 = max {1 + 5 + 2, 2 + 4 + 0, 0 + 1 + 6} = 8

and

A∞ = max {1 + 2 + 0, 5 + 4 + 1, 2 + 0 + 6} = 10.

2.4 Eigenvalues and eigenvectors


Eigenvalues and eigenvectors of a matrix have many uses in data science, machine learn­
ing, and other fields. A method of calculation of these values is described in this section.
Before giving an explanation of how they are obtained, a brief overview of a method
for solving a system of linear equations, which is necessary for eigenvalues analysis, is
given.

2.4.1 A system of linear equations


A system of linear equations is d­fined as follows:

a1,1 x1 + a1,2 x2 + · · · + a1,n xn = b1 ,


a2,1 x1 + a2,2 x2 + · · · + a2,n xn = b2 ,
.. (2.8)
.
an,1 x1 + an,2 x2 + · · · + an,n xn = bn .

In system (2.8) ai,j and bi are known and xi is unknown, which should be determined.
System (2.8) can be expressed as follows:
50 Dimensionality Reduction in Machine Learning

⎡ ⎤⎡ ⎤ ⎡ ⎤
a1,1 a1,2 ... a1,n x1 b1
⎢a2,1 a2,2 ... a2,n ⎥ ⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ . .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥ .
⎣ .. . ⎦ ⎣ . ⎦ ⎣.⎦
an,1 an,2 ... an,n xn bn

An×n xn×1 bn×1

Therefore we have a system represented by Ax = b. In this system, the unknown vector


x must be determined, given that the constant vector b and the coefficient matrix A are
known. A linear system Ax = b is called a homogeneous system if b = 0, otherwise it is
called a non-homogeneous system.
In solving a linear system, there are two conditions based on the determinant of the
coefficient matrix A as follows:
Condition 1: If the determinant of the coefficient matrix A is non-zero, A is an invertible
matrix and the unknown vector x can be uniquely determined as follows:

x = A−1 b.

Condition 2: If the determinant of the coefficient matrix A is zero, A is a not invertible


matrix. For such a condition, there are two cases as follows:
1. If (adj A)b = 0, then the system of linear equations Ax = b has no solution. In this case,
this system is called inconsistent.
2. If (adj A)b = 0, then the system of linear equations Ax = b will have either i­finitely
many solutions or is inconsistent.
Note that adj A is the adjugate of the coefficient matrix A and equals the transpose of the
cofactor matrix of A.

2.4.2 Calculation of eigenvalues and eigenvectors


For a square matrix A ∈ Rn×n , a non-zero vector x ∈ Cn , and a scalar value λ ∈ C, if

Ax = λx, (2.9)

λ is an eigenvalue of A and x is a corresponding eigenvector of A. To calculate these values,


Eq. (2.9) can be expressed as follows:

(A − λI )x = 0. (2.10)

Calculation of |A − λI | gives a polynomial as follows:

p(λ) = λn + c1 λn−1 + · · · + cn−1 λ + cn , (2.11)

where c1 , c2 , . . . , cn are constant values and λ1 , λ2 , . . . , λn are eigenvalues of matrix A. This


polynomial is called the characteristic polynomial of A.
Chapter 2 • Essential mathematics for machine learning 51

As described in Section 2.4, system (2.10) has a unique solution of zero, if |A − λI | = 0.


However, we are interested in non-zero vectors for x. According to the explanation in
Section 2.4, to have non-zero vectors for x, the value of |A − λI | must be zero. When
|A − λI | = 0, the characteristic equation of A can be obtained as follows:

p(λ) = λn + c1 λn−1 + · · · + cn−1 λ + cn = 0. (2.12)

The eigenvalues λ1 , λ2 , . . . , λn are the roots of Eq. (2.12). Also, for each eigenvalue of the
matrix A, any vector x that sati­fies the following equation is called an eigenvector corre­
sponding to that eigenvalue:
Ax = λx.

Example 16. Calculate the eigenvalues and eigenvectors of the matrix A:


[ ]
8 4
A= .
3 7 2×2

Solution: First, by constructing the characteristic equation of the matrix A, we calculate


the eigenvalues of matrix A as follows:

|A − λI | = 0,
[ ] [ ]
 8 4 λ 0 
 − = 0,
 3 7 0 λ 
 
8 − λ 4 
 = 0,
 3 7 − λ
(8 − λ)(7 − λ) − 12 = 0,
λ2 − 15λ + 44 = 0,
⇒ λ1 = 4, λ2 = 11.

Now, the eigenvector x corresponding to the first eigenvalue λ1 = 4 can be obtained from
the following equations:

Ax = 4x,
[ ][ ] [ ]
8 4 x1 x
=4 1 ,
3 7 x2 x2
[ ] [ ]
8x1 + 4x2 4x1
⇒ = .
3x1 + 7x2 4x2

Therefore we will have a linear system as follows:

8x1 + 4x2 = 4x1 ,


(2.13)
3x1 + 7x2 = 4x2 .
52 Dimensionality Reduction in Machine Learning

Any solution for linear system (2.13) is an eigenvector of the matrix A. Since linear system
(2.13) is inconsistent, it has many i­finity solutions as x1 = −x2 . For instance, the vector
x = [1, −1]T is an eigenvector of the matrix A corresponding to the eigenvalue λ1 = 4. To
investigate whether this vector sati­fies Ax = λx, we will calculate Ax and λx separately
below:
[ ][ ] [ ]
8 4 1 4
Ax = = ,
3 7 −1 −4
[ ] [ ]
1 4
λx = 4 = .
−1 −4

Furthermore, the eigenvector x corresponding to the second eigenvalue λ2 = 11 can be


obtained from the following equations:

Ax = 11x,
[ ][ ] [ ]
8 4 x1 x
= 11 1 ,
3 7 x2 x2
[ ] [ ]
8x1 + 4x2 11x1
⇒ = .
3x1 + 7x2 11x2

Therefore we will have a linear system as follows:

8x1 + 4x2 = 11x1 ,


(2.14)
3x1 + 7x2 = 11x2 .

Any solution for linear system (2.14) is an eigenvector of the matrix A. Since linear system
(2.14) is inconsistent, it has many i­finity solutions as x1 = 43 x2 . For instance, the vector
x = [ 43 , 1]T is an eigenvector of the matrix A corresponding to the eigenvalue λ2 = 11. To
investigate whether this vector sati­fies Ax = λx, we will calculate Ax and λx separately
below:
[ ] [ ] [ 44 ]
8 4 43
Ax = = 3 ,
3 7 1 11
[ 4 ] [ 44 ]
λx = 11 3 = 3 .
1 11

Example 17. Calculate the eigenvalues and eigenvectors of the matrix A:


[ ]
0 1
A= .
−1 0 2×2

Solution: First, by formulating the characteristic equation of the matrix A, we calculate the
eigenvalues of the matrix A as follows:

|A − λI | = 0,
Chapter 2 • Essential mathematics for machine learning 53

[ ] [ ]
 0 1 λ 0 
 − = 0,
 −1 0 0 λ 
 
−λ 1 
 
−1 −λ = 0,

λ2 + 1 = 0.

This equation has no real root. Therefore the eigenvalues of the matrix A are complex num­
bers as follows:
⇒ λ1 = i, λ2 = −i.
Now, the eigenvector x corresponding to the first eigenvalue λ1 = i can be obtained
from the following equations:

Ax = ix,
[ ][ ] [ ]
0 1 x1 x
=i 1 ,
−1 0 x2 x2
[ ] [ ]
x2 x i
⇒ = 1 .
−x1 x2 i

Therefore we will have a linear system as follows:

x2 = x1 i,
(2.15)
−x1 = x2 i.

Any solution for linear system (2.15) is an eigenvector of the matrix A. Since linear system
(2.15) is inconsistent, it has many i­finity solutions as x2 = x1 i. For instance, the vector
x = [1, i]T is an eigenvector of the matrix A corresponding to the eigenvalue λ1 = i. To
investigate whether this vector sati­fies Ax = λx, we will calculate Ax and λx separately
below:
[ ][ ] [ ]
0 1 1 i
Ax = = ,
−1 0 i −1
[ ] [ ]
1 i
λx = i = .
i −1

Furthermore, the eigenvector x corresponding to the second eigenvalue λ2 = −i can be


obtained from the following equations:

Ax = −ix,
[ ][ ] [ ]
0 1 x1 x
= −i 1 ,
−1 0 x2 x2
[ ] [ ]
x2 −x1 i
⇒ = .
−x1 −x2 i
54 Dimensionality Reduction in Machine Learning

Therefore we will have a linear system as follows:

x2 = −x1 i,
(2.16)
−x1 = −x2 i.

Any solution for linear system (2.16) is an eigenvector of the matrix A. Since linear system
(2.16) is inconsistent, it has many i­finity solutions as x2 = −x1 i. For instance, the vector
x = [1, −i]T is an eigenvector of the matrix A corresponding to the eigenvalue λ2 = −i.
To investigate whether this vector sati­fies Ax = λx, we will calculate Ax and λx separately
below:
[ ][ ] [ ]
0 1 1 −i
Ax = = ,
−1 0 −i −1
[ ] [ ]
1 −i
λx = −i = .
−i −1

2.4.3 Cayley–Hamilton theorem


Theorem 1. Every square matrix A sati­fies its own characteristic equation. In other words,
if p(λ) is the characteristic polynomial of the matrix A, then p(A) = 0.

Proof. To read a proof of this theorem, see reference [1].

Example 18. Show that the following matrix A sati­fies its own characteristic equation:
[ ]
8 4
A= .
3 7 2×2

Solution: We have already shown that the characteristic polynomial of the matrix A is
as follows:
p(λ) = λ2 − 15λ + 44.
Now, we calculate p(A) as follows:

p(A) = A2 − 15A + 44I.

Therefore we will have:


[ ][ ] [ ] [ ] [ ]
8 4 8 4 8 4 1 0 0 0
p(A) = − 15 + 44 = .
3 7 3 7 3 7 0 1 0 0

2.5 Matrix centering


Matrix centering is a statistical preprocessing technique used to make the columns or rows
of a matrix have zero mean. This process involves subtracting the mean of each column or
Chapter 2 • Essential mathematics for machine learning 55

row from every element in that column or row. Matrix centering simpl­fies the data struc­
ture, making it easier to identify patterns and relationships. This technique is particularly
useful in multivariate analysis, such as principal component analysis (PCA), where the rel­
ative distances between data points are important for understanding underlying patterns.
Let us take into account a matrix A ∈ Rn×m , which can be depicted by its rows, denoted
as A = [a1 , a2 , . . . , an ]T , or by its columns A = [b1 , b2 , . . . , bm ], where ai and bj stand for the
ith row and j th column of A, respectively. The left-centering matrix is d­fined as follows:

1
HL = I − 11T , (2.17)
n

where HL ∈ Rn×n , 1 = [1, 1, . . . , 1]T ∈ Rn , and I ∈ Rn×n . Left multiplying this matrix by A,
denoted as HL A, eliminates the average value of A’s columns from each of its columns:
[ ]
1
HL A = A − 11T A = A − μcolumns , (2.18)
n

where μcolumns ∈ Rn×m , which is created by replicating the row vector μc = [α1 , . . . , αm ] n
times. μc is the mean of the columns of matrix A, and αi is the mean value of the ith column
of matrix A.

Example 19. For the following matrix A, show that Eq. (2.18) is valid:
[ ]
2 −2 4
A= .
3 3 2 2×3

Solution: First, we calculate matrix HL as follows:


[ ] [ ] [ ]
1 T 1 0 1 1 1 0.5 −0.5
HL = I2×2 − 11 = − = .
2 0 1 2 1 1 −0.5 0.5

Then, we calculate HL A as follows:


[ ][ ] [ ]
0.5 −0.5 2 −2 4 −0.5 −2.5 1
HL A = = . (2.19)
−0.5 0.5 3 3 2 0.5 2.5 −1

Now, we calculate μc and μcolumns as follows:


[ 2+3 −2+3
[ ] ]
μc = 2 2 = 2.5 0.5 3 ,
4+2
2
[ ] [ ]
μ 2.5 0.5 3
μcolumns = c = .
μc 2.5 0.5 3

Then, we calculate A − μcolumns as follows:


56 Dimensionality Reduction in Machine Learning

[ ] [ ] [ ]
2 −2 4 2.5 0.5 3 −0.5 −2.5 1
A − μcolumns = − = . (2.20)
3 3 2 2.5 0.5 3 0.5 2.5 −1

The results obtained from (2.19) and (2.20) are the same.
The right centering matrix of matrix A ∈ Rn×m is d­fined as follows:
1 T
HR = I − 11 , (2.21)
m
where HR ∈ Rm×m , 1 = [1, 1, . . . , 1]T ∈ Rm , and I ∈ Rm×m . Right multiplying this matrix by
A, denoted as AHR , eliminates the average value of A’s rows from each of its rows:
[ ]
1
AHR = A − 11T A = [AT − μrows ]T , (2.22)
m

where μrows ∈ Rm×n , which is created by replicating the row vector μr = [β1 , . . . , βn ] m
times. μr is the mean of the rows of matrix A, and βi is the mean value of the ith row of
matrix A.
Example 20. For the following matrix A, show that Eq. (2.22) is valid:
[ ]
2 −2 4
A= .
3 3 2 2×3

Solution: First, we calculate matrix HR as follows:


⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 0 0 1 1 1 0.667 −0.333 −0.333
1 1
HR = I3×3 − 11T = ⎣0 1 0⎦ − ⎣1 1 1⎦ = ⎣−0.333 0.667 −0.333⎦ .
3 3
0 0 1 1 1 1 −0.333 −0.333 0.667

Then, we calculate AHR as follows:


⎡ ⎤
[ ] 0.667 −0.333 −0.333 [ ]
2 −2 4 ⎣ 0.667 −3.333 2.667
AHR = −0.333 0.667 −0.333⎦ = . (2.23)
3 3 2 0.333 0.333 −0.667
−0.333 −0.333 0.667

Now, we calculate μr and μrows as follows:


[ ] [ ]
μr = 2−2+4
3
3+3+2
3 = 1.333 2.667 ,
⎡ ⎤ ⎡ ⎤
μr 1.333 2.667
μrows = ⎣μr ⎦ = ⎣1.333 2.667⎦ .
μr 1.333 2.667

Then, we calculate [AT − μrows ]T as follows:


⎡⎡ ⎤ ⎡ ⎤⎤T
2 3 1.333 2.667 [ ]
0.667 −3.333 2.667
[AT − μrows ]T = ⎣⎣−2 3⎦ − ⎣1.333 2.667⎦⎦ = . (2.24)
0.333 0.333 −0.667
4 2 1.333 2.667
Chapter 2 • Essential mathematics for machine learning 57

The results obtained from (2.23) and (2.24) are the same.

2.6 Orthogonal projection


In data mining, particularly in principal component analysis (PCA), understanding or­
thogonal projection is crucial for its application in dimensionality reduction and feature
selection.
It is important to know that by reducing the dimensionality of the data, PCA with or­
thogonal projection can increase the performance and efficiency of machine learning and
data mining algorithms applied to the dataset.
Let x, y ∈ Rn be two n-dimensional vectors. An orthogonal projection of the vector y in
the direction of another vector x is given in Fig. 2.1.

FIGURE 2.1 Orthogonal projection.

The vector v is called the orthogonal projection and is calculated by:


( )
x T .y
v= x. (2.25)
x T .x

The vector r = y − v is often interpreted as the residual vector between x and y.

Proof.

v = cx, r = y − v ⇒ r = y − cx, (2.26)


v .r = v (y − v) = v (y − cx),
T T T

0 = (cx)T (y − cx),
x T .y
0 = cx T .y − c2 x T .x ⇒ c = T ,
x .x
( T )
x .y
⇒v= x.
x T .x
58 Dimensionality Reduction in Machine Learning

2.7 Definition of gradient


Consider the function f : Rn → R. In mathematics, specifically in calculus and vector anal­
ysis, the gradient ∇f is a vector composed of the partial derivatives as follows:
( )
∂f ∂f ∂f
∇f = , ,..., ∈ Rn .
∂x1 ∂x2 ∂xn

The gradient represents a vector field that points in the direction of the greatest rate of
increase of a scalar-valued function at each point in space.

2.8 Definition of the Hessian matrix


The second derivative of a function f with respect to its variables is called the Hessian
matrix, denoted by:
⎡ ⎤
∂2f ∂2f ∂2f
···
⎢ ∂x12 ∂x1 ∂x2 ∂x1 ∂xn ⎥
⎢ ∂2f ⎥
⎢ ∂2f
··· ∂2f ⎥
⎢ ∂x22 ∂x2 ∂xn ⎥
H = ∇ 2 f = ⎢ ∂x2 ∂x1 ⎥ ∈ Rn×n ,
⎢ .. .. .. .. ⎥
⎢ . . . . ⎥
⎣ ⎦
∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2 ··· ∂xn2

2
where Hi,j = ∂x∂i ∂x
f
j
. The Hessian matrix exhibits symmetry. For convex function f , the Hes­
sian matrix demonstrates positive-semidefiniteness.

2.9 Definition of a Jacobian


Suppose f : Rn → Rm is a function such that each of its first-order partial derivatives exists
on Rn . This means that the function is at least once differentiable.
The Jacobian matrix of f is d­fined as follows [2,3]:
⎤ ⎡
∇ T f1
  ⎢ ∇ T f2 ⎥
∂f ∂f ∂f ⎢ ⎥
J = ∂x1 , ∂x2 , . . . , ∂xn = ⎢ . ⎥
⎣ .. ⎦
∇ T fm
⎡ ∂f1 ∂f1 ∂f1 ⎤
∂x1 ∂x2 ··· ∂xn
⎢ ∂f2 ∂f2 ∂f2 ⎥
⎢ ∂x1 ··· ∂xn ⎥

=⎢ . ∂x2 ⎥
.. ⎥ ∈ R
m×n
.. .. .
⎣ .. . . . ⎦
∂fm ∂fm ∂fm
∂x1 ∂x2 ··· ∂xn
Chapter 2 • Essential mathematics for machine learning 59

2.10 Optimization problem


Optimization problems are fundamental in data mining and dimension reduction tasks,
where the objective is to determine optimal solutions in high-dimensional data.
Optimization problems are used in clustering, class­fication, regression, dimension re­
duction, anomaly detection, and feature selection.
Optimization problems can be divided into two types; unconstrained and constrained
problems. Suppose f : Rn → Rm and the domain of the function is D, where x ∈ D and
x ∈ Rn . The unconstrained minimize of f (x) is:

Minimize f (x),
x

where x is called the optimization variable and the function f is the objective function or
the cost function.
Another type of optimization problem consists of some equality and/or inequality con­
straints. A constrained optimization problem can be modeled as follows:

Minimize f (x) (2.27)


x
subject to gi (x) ≤ 0, i ∈ {1, 2, . . . , n1 },
hj (x) = 0, j ∈ {1, 2, . . . , n2 },

where f (x) is the cost function or the objective function, every gi (x) ≤ 0 is an inequality
constraint and every hj (x) = 0 is called an equality constraint.
It is important to note that every minimization problem can be changed to a maximiza­
tion problem as follows:

Minimize f (x) Maximize −f (x)


x x
subject to: ≡ subject to:
gi (x) ≤ 0, i ∈ {1, 2, . . . , n1 }, gi (x) ≤ 0, i ∈ {1, 2, . . . , n1 },
hj (x) = 0, j ∈ {1, 2, . . . , n2 }. hj (x) = 0, j ∈ {1, 2, . . . , n2 }.

2.10.1 Feasible solutions


For problem (2.27), x ◦ ∈ D is a feasible solution if

gi (x ◦ ) ≤ 0 ∀i ∈ {1, 2, . . . , n1 },
hj (x ◦ ) = 0 ∀j ∈ {1, 2, . . . , n2 }.

Let S = {x ◦ ∈ D|gi (x) ≤ 0, hj (x) = 0}. Then, S is a set of feasible solutions of problem (2.27).
60 Dimensionality Reduction in Machine Learning

2.10.2 Lagrangian function


The Lagrangian function for problem (2.27) is L : Rn × Rn1 × Rn2 → R, which is d­fined as
follows:

n1 ∑
n2
L(x, λ, μ) = f (x) + λi gi (x) + μi hi (x) (2.28)
i=1 i=1
= f (x) + λ g(x) + μT h(x),
T

where {λi }ni=1


1
and {μi }ni=1
2
are the Lagrangian multipliers (dual variables) corresponding to
inequality and equality constraints, respectively.
Eq. (2.28) is called the Lagrangian relaxation of problem (2.27). It is important to note
that if problem (2.27) is changed to a maximization problem, then the Lagrangian function
is as follows:

n1 ∑
n2
L(x, λ, μ) = −f (x) + λi gi (x) + μi hi (x).
i=1 i=1

Lagrangian relaxation is a method commonly used to solve constrained optimization


problems by transforming them into unconstrained optimization problems. The regular­
ized objective function, often called the Lagrangian function or Lagrangian relaxation, is
obtained from the original problem formulation by incorporating penalty terms.
This method is useful when dealing with large-scale optimization problems with com­
plicated constraints including equality and inequality constraints.

Example 21. Solve the following optimization problem:

Minimize f (x1 + x2 ) = x1 + x2 (2.29)


x1 ,x2

subject to x12 + x22 = 1.

Solution:

L(x1 , x2 ) = f (x1 , x2 ) + μ(x12 + x22 − 1),


( )
∂L ∂L ∂L
∇x1 ,x2 ,μ L(x1 , x2 , μ) = , , ,
∂x1 ∂x2 ∂xμ
 
∇x1 ,x2 ,μ L (x1 , x2 , μ) = 1 + 2μx1 , 1 + 2μx2 , x12 + x22 − 1 = 0,

⎨ 1 + 2μx1 = 0 1
1 + 2μx2 = 0 ⇒ x1 = x2 = − , μ = 0.
⎩ 2 2μ
x1 + x22 = 1

By substituting into x12 + x22 − 1 = 0, we will have:



1 1 2
+ −1=0⇒μ=± .
4μ2 4μ2 2
Chapter 2 • Essential mathematics for machine learning 61

Hence, the stationary points are as follows:


(√ √ √ ) ( √ √ √ )
2 2 2 2 2 2
, ,− , − ,− , .
2 2 2 2 2 2

Since the objective is to minimize the function, the solution is:


( √ √ )
2 2 √
f − ,− = − 2.
2 2

2.10.3 Karush–Kuhn--Tucker conditions


In optimization, the Karush–Kuhn--Tucker (KKT) conditions are the first derivative test for
a solution in a nonlinear programming to be optimal, such as Eq. (2.27). William Karush
and Harold Kuhn were American mathematicians known for their contributions to opti­
mization. Albert Tucker worked extensively on optimization theory. He developed the KKT
conditions. These conditions are extensions of the Lagrange multipliers method for con­
strained optimization.
In optimization, the KKT conditions are necessary conditions for a solution to be opti­
mal in Eq. (2.27) that consists of both equality and inequality constraints [3].
The optimal variable x ∗ (primal) and the optimal variable λ∗ , μ∗ (dual) must satisfy the
KKT conditions. The KKT conditions consist of four types of conditions as follows:

L(x, λ, μ) = f (x) + λ g(x) + μ h(x)


( )
g(x)
= f (x) + α  ,
h(x)
⎡ ⎤
g1 (x) ⎡ ⎤
⎢ g2 (x) ⎥
h1 (x)
⎢ ⎥ ⎢ .. ⎥ ⎡ ⎤ ⎡ ⎤
⎢ .. ⎥ ⎢ ⎥ λ1 μ1
⎢ ⎥ ⎢ . ⎥
⎢ . ⎥ ⎢ . ⎥ ⎢ . ⎥
g(x) = ⎢ ⎥ , h(x) = ⎢ ⎥
⎢ hj (x) ⎥ , λ = ⎣ .. ⎦ , μ = ⎣ .. ⎦ ,
⎢ gi (x) ⎥ ⎢ ⎥
⎢ ⎥ ⎣ .. ⎦
⎢ .. ⎥ . λn1 μn2
⎣ . ⎦
hn2 (x)
gn1 (x)
[ ]
λ
α= .
μ

1. Stationarity condition:


n1
∇x L(x, λ, μ) = ∇x f (x) + λi ∇x gi (x)
i=1

n2
+ μi ∇x hi (x) = 0.
i=1
62 Dimensionality Reduction in Machine Learning

2. Primal feasibility:
gi (x ∗ )  0, ∀i ∈ {1, . . . , n1 } ,
hj (x ∗ ) = 0, ∀j ∈ {1, . . . , n2 } .

3. Dual feasibility:
λi ≥ 0, ∀i ∈ {1, . . . , n1 } .

4. Complementary slackness:

λ∗i gi (x ∗ ) = 0, ∀i ∈ {1, . . . , n1 } .

These conditions provide a powerful method for solving optimization problems with con­
straints and are fundamental in data mining, machine learning, and dimension reduction
methods.

References
[1] Stephen Andrilli, David Hecker, Elementary Linear Algebra, 6th edition, Academic Press, 2022.
[2] Biswa Nath Datta, Numerical Linear Algebra and Applications, 2nd edition, Society for Industrial and
Applied Mathematics, USA, ISBN 0898716853, 2010.
[3] Benyamin Ghojogh, Mark Crowley, Fakhri Karray, Ali Ghodsi, Elements of Dimensionality Reduction
and Manifold Learning, 1st edition, Springer, Cham, 2023.
3
Principal and independent
component analysis methods
Mohammadnavid Ghader, Mostafa Abdolmaleki, and
Hassan Dana Mazraeh
Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University,
Tehran, Iran

3.1 Introduction
PCA is a fascinating statistical technique that allows us to explore and uncover hidden
patterns and relationships in high-dimensional datasets. It is a versatile tool that has revo­
lutionized the way we analyze and understand complex data in fields such as neuroscience
(e.g., [15--17]), genetics, and art­ficial intelligence.
By reducing the dimensionality of the data, PCA enables us to visualize and interpret
large datasets more effectively [3]. The principal components derived from the data allow
us to identify the most relevant and informative variables that contribute to the overall
variance in the dataset. This process can reveal interesting insights into the underlying
structure of the data, providing a new perspective on complex systems and phenomena.
Moreover, PCA can be used for data compression and feature extraction, making it an
indispensable tool in machine learning and computer vision applications. By transforming
the original data into a lower-dimensional space, PCA allows us to build more efficient and
accurate models that can generalize better to new data.
Overall, PCA is a powerful and elegant technique that has broad applications in sci­
entific research, engineering, and industry. It has played a crucial role in advancing our
understanding of complex systems and has opened up new avenues for exploration and
discovery in a wide range of fields.

3.1.1 History
PCA is a statistical technique widely used to analyze data and reduce its dimensionality. It
was first introduced by mathematician Karl Pearson [5] in 1901 to analyze the correlations
between different variables. However, the modern form of PCA, which involves finding the
principal components of a dataset by computing its eigenvalues and eigenvectors, was
developed independently by several statisticians and mathematicians in the mid-20th cen­
tury.
One of the earliest formulations of PCA in its modern form was introduced by Harold
Hotelling [2] in 1933, who described it as a method for finding the linear combinations of
Dimensionality Reduction in Machine Learning. https://doi.org/10.1016/B978-0-44-332818-3.00012-5 65
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
66 Dimensionality Reduction in Machine Learning

variables that have the highest variance. However, it was not until the 1950s and 1960s that
PCA gained widespread recognition as a powerful tool for data analysis and dimensionality
reduction.
During this time, several researchers made significant contributions to the develop­
ment of PCA. For example, the mathematician John W. Tukey introduced the concept of
principal components analysis in 1964, which involves finding the eigenvectors of the
covariance matrix of a dataset. Additionally, the statistician Herman Wold developed a re­
lated technique known as PLS (Partial Least Squares) regression, commonly used to model
relationships between variables in high-dimensional datasets.
Since then, PCA has become a widely used tool in various fields, including data science,
machine learning, and statistics. Its applications range from image and signal processing
to finance and biology. PCA finds applications in various fields, including image process­
ing, signal processing, finance, genetics, and social sciences. It can be used for image
compression, denoising, face recognition, portfolio optimization, gene expression anal­
ysis, and social network analysis. Understanding the specific applications and limitations
of PCA in different fields can be ben­ficial in utilizing it effectively for specific use cases.
In recent years, there has also been significant research into developing more advanced
techniques for PCA, such as nonlinear PCA and sparse PCA [6], which can better handle
complex and high-dimensional data.

3.1.2 Intuition
PCA is a technique that helps in understanding and summarizing the information present
in a high-dimensional dataset. The main intuition behind PCA is to identify the most im­
portant patterns and relationships between the variables in the dataset and to represent
them in a simpler, more understandable form.
To understand how PCA works, imagine that you have a large dataset with many vari­
ables. Each variable may contain some information that is important, but it may also
contain a lot of noise or redundancy. With so many variables, it can be difficult to see the
underlying patterns and relationships in the data.
PCA helps in reducing the dimensionality of the dataset by finding a new set of vari­
ables that are a linear combination of the original variables. These new variables are called
principal components, and they capture the most important information in the data. Each
principal component is a linear combination of the original variables, with the weights
determined by the amount of variation present in each variable.
The first principal component captures the most amount of variation present in the
data, and each subsequent principal component captures the maximum amount of re­
maining variation, subject to being uncorrelated with the previous principal components.
By representing the data in terms of these principal components, we can reduce the di­
mensionality of the data while retaining the most important information.
Chapter 3 • Principal and independent component analysis methods 67

PCA is a powerful tool for data analysis and has many applications in fields such as
finance, image and signal processing, and biology. It helps in identifying the most impor­
tant patterns and relationships in high-dimensional data and represents them in a simpler,
more understandable form.

3.2 The PCA algorithm


Principal Component Analysis is a common method of dimensionality reduction and data
representation. Using PCA, lower-dimensional representations of high-dimensional struc­
tures are created while maintaining their most salient features. The procedure operates by
finding the directions, or main components, along which the data differ the most. These
principal components are orthogonal, so they can be assumed to describe various aspects
of the data without being redundant. In the following, we will give the mathematical ex­
planation of the PCA algorithm.

3.2.1 Projection in one-dimensional space


First, let us assume that we want to reduce the dimension of the data to one. This means
that all the data points are projected on a line. In this projection, we seek to achieve two
goals. The first goal is to select the projected points so they have the highest variance.
The second goal is to minimize the sum of the distances of each point in the original
dataset with its projection. First, we try to achieve the first goal. We assume that our origi­
nal dataset is D, and the vector w represents the direction of the line where the images of
the data points on this line have the highest variance. Without loss of generality, we assume
that the vector w is unity:

w22 = w T w = 1. (3.1)

If we do not consider a limit on the size of w, it causes us to obtain more variance by


increasing its size, which causes problems in our work. We also assume that the data points
are centered and have a mean of zero (μ = 0). If xi is a point in the original dataset and xi
is the projection of it on the vector w, the xi coordinate is obtained from the following
equation:
( )  
w T xi
xi = T
w = w T xi w = ai w, (3.2)
w w

where ai is a scalar representing the coordinates of point xi along w. It should be noted
that since the original data were centered, the mean coordinates of the projected points
will still be equal to zero (μw = 0).
68 Dimensionality Reduction in Machine Learning

Now, it is time to calculate the variance of the projected points:

1∑
n
σw2 = (ai − μw )2 ,
n
i=1

1 ∑ 2
n
= w T xi ,
n
i=1
∑n   T
1
= w T xi w T xi ,
n (3.3)
i=1

1 ∑n  
= w T xi xi T w,
n
i=1
( )
1∑
n
= wT xi x i T w,
n
i=1
= w T w,

where  is the covariance matrix of dataset D, which is centered. To maximize the variance,
we have to solve a constrained optimization problem. The amount of variance should be
maximized subject to w2 = w T w = 1. Using the Lagrange multiplier, we turn this con­
strained optimization problem into an unconstrained optimization problem as follows:
 
max J (w) = w T w − α w T w − 1 . (3.4)

Now, it is enough to take the derivative of J (w) with respect to w and set its value to
zero:

J (w) = 0,
∂w
∂  T  
w w − α w T w − 1 = 0,
∂w (3.5)
2w − 2αw = 0,
2w = 2αw,
w = αw.
The last expression obtained above in Eq. (3.5) reminds us of the equation in discussing
eigenvalues and eigenvectors in which α refers to the eigenvalues of the covariance matrix,
and w is the eigenvectors corresponding to them. If we multiply both sides of Eq. (3.5) from
the left by w T , we obtain:
w T w = w T αw,
σw2 = αw T w, (3.6)
σw2 = α.
Chapter 3 • Principal and independent component analysis methods 69

The above equation shows that if we want to maximize the variance, it is enough to
choose the largest eigenvalue of the covariance matrix and consider the corresponding
eigenvector as the value of w. The largest eigenvalue of the covariance matrix is called λ1 ,
and the corresponding eigenvector is called w1 . w1 is the direction in which the projected
data has the highest variance, called the first principal component. Also, the value of the
variance of the data in the direction of w1 is equal to λ1 .
Now, we move on to achieve the second goal. That is, we want to minimize the recon­
struction error. The formula for mean squared error is written as follows:

1∑
n
xi − xi
2
MSE(w) = ,
n
i=1

1 ∑(
n
)T ( )
= xi − xi xi − xi , (3.7)
n
i=1

1 ∑ ( )T 
n
= xi 2 − 2xiT xi + xi xi .
n
i=1

According to the projection formula, which was explained in Eq. (3.2), we have xi =
( T )
w xi w. By replacing it in Eq. (3.7), we will have:

1 ∑ ( )T 
n
MSE(w) = xi 2 − 2xiT xi + xi xi ,
n
i=1
n (     T   )
1∑
= xi 2 − 2xiT w T xi w + w T xi w w T xi w ,
n
i=1

1 ∑       
n
= xi 2 − 2 w T xi xiT w + w T xi xiT w w T w ,
n
i=1

1 ∑   
n
= xi 2 − w T xi xiT w , (3.8)
n
i=1

1∑ 1 ∑ T  T
n n
= xi 2 − w xi xi w,
n n
i=1 i=1
( n )
1∑ 1∑
n
= xi  − w
2 T T
xi xi w,
n n
i=1 i=1

1 ∑
n
= xi 2 − w T w.
n
i=1
70 Dimensionality Reduction in Machine Learning

Since the data in D are centered (μu = 0), the variance of the original data is equal to:

1∑ 1∑
n n
var(D) = xi − 02 = xi 2 . (3.9)
n n
i=1 i=1

From Eq. (3.3), We also noted that the variance of the projected data points has the
following formula: σw2 = w T w. By inserting Eqs. (3.3) and (3.9) into Eq. (3.8), we will have:

1∑
n
MSE(w) = xi 2 − w T w,
n (3.10)
i=1
= var(D) − σw2 .

Since the original data points are fixed and do not change throughout the dimension
reduction operation, their variance value is also a constant number. Therefore the value of
var(D) is a particular and fixed number. The important point that can be concluded from
the above equation is that if we maximize the variance of the mapped points, the amount
of reconstruction error will automatically be minimized. Therefore the two goals that we
initially assumed to reduce dimensions are equivalent to each other, and if one of them is
achieved, the other will also be achieved. The minimum amount of reconstruction error is
equal to:
MSE(w) = var(D) − λ1 , (3.11)
where λ1 is equal to the largest eigenvalue of the .

3.2.2 Projection in two-dimensional space


We seek to reduce the dimensions of the data that are in D to two dimensions. It is also
assumed that the data in D are centered (μ = 0). It is also assumed that the first princi­
pal component is already calculated. To find the second direction w2 that maximizes the
variance of the projected data, we first write the variance formula in the direction of w2 :

σw2 2 = w2T w2 . (3.12)

We need the vector w2 to be unity and orthogonal to the vector w1 . Hence:

w2T w1 = 0,
(3.13)
w2T w2 = 1.

Now, if we want to maximize the variance of the projected data in the w2 direction, we
must solve a constrained optimization problem. Using the Lagrange multipliers, we turn
this into an unconstrained optimization problem:
   
max J (w2 ) = w2T w2 − α w2T w2 − 1 − β w2T w1 − 0 . (3.14)
w2
Chapter 3 • Principal and independent component analysis methods 71

We take the derivative of J (w2 ) with respect to w2 and set the result equal to zero:

2w2 − 2αw2 − βw1 = 0. (3.15)

If we multiply both sides of Eq. (3.15) from the left by w1T , we will have:

2w1T w2 − 2αw1T w2 − βw1T w1 = 0,


(3.16)
2w2T w1 − β = 0.

Since w1 is an eigenvector of matrix , we have:

w1 = λ1 w1 . (3.17)

By inserting this into Eq. (3.16), we will have:

β = 2w2T λ1 w1 ,
= 2λ1 w2T w1 , (3.18)
= 0.

Thus we realized that β is equal to zero. Hence, we can update Eq. (3.15) as follows:

2w2 − 2αw2 − βw1 = 0,


2w2 − 2αw2 = 0,
(3.19)
2w2 = 2αw2 ,
w2 = αw2 .

We multiply both sides of Eq. (3.19) from the left by w2T :

w2 = αw2 ,
w2T w2 = w2T αw2 ,
(3.20)
σw2 2 = αw2T w2 ,
σw2 2 = α.

If we want to have the largest variance in the w2 direction, we must choose the second
largest eigenvalue of the . The corresponding eigenvector will be our second principal
component.

3.2.3 Projection in r-dimensional space


We now seek to reduce the dimensions of the data that are in D to r dimensions. It is also
assumed that the data in D are centered (μ = 0). It is also assumed that the first r − 1
principal components are already calculated. In other words, λ1 , λ2 , · · · , λr−1 are the first
r − 1 eigenvalues and w1 , w2 , · · · , wr−1 are the corresponding eigenvectors. To compute the
72 Dimensionality Reduction in Machine Learning

vector of the rth direction, we should consider that is unity (wr 2 = wrT wr = 1) and wr is
orthogonal to all calculated vectors wi , i.e., wiT wr = 0 for 1 ≤ i ≤ r − 1. The variance of the
projected data in the direction of wr is calculated from the following formula:

σw2 r = wrT wr . (3.21)

Now, if we want to maximize the variance of the projected data in the wr direction, we must
solve a constrained optimization problem. Using the Lagrange multiplier, we turn this into
an unconstrained optimization problem:

  ∑r−1  
max J (wr ) = wrT wr − α wrT wr − 1 − βi wrT wi − 0 . (3.22)
wr
i=1

In Eq. (3.22), we take the derivative of J (wr ) with respect to wr and set the result equal to
zero:

r−1
2wr − 2αwr − βi wi = 0. (3.23)
i=1

If we multiply both sides of Eq. (3.23) from the left by wkT , for 1 ≤ k ≤ r − 1 we will have:


r−1
2wkT wr − 2αwkT wr − βi wkT wi = 0. (3.24)
i=1

Since wkT wr = 0, we can remove it and simplify the expression:


r−1
2wkT wr − βk wkT wk − βi wkT wi = 0,
i=1 (3.25)
i=k

2wrT wk − βk = 0.

Since wk is an eigenvector of matrix , we have:

wk = λk wk . (3.26)

By inserting this into Eq. (3.25), we have:

βk = 2wrT wk ,
= 2wrT λk wk ,
(3.27)
= 2λk wrT wk,
= 0.
Chapter 3 • Principal and independent component analysis methods 73

Thus we realize that βi for all i < r is equal to zero. Hence, we can update Eq. (3.23):


r−1
2wr − 2αwr − βi wi = 0,
i=1
2wr − 2αwr = 0, (3.28)
2wr = 2αwr ,
wr = αwr .

We multiply both sides of Eq. (3.28) from the left by wrT :

wr = αwr ,
wrT wr = wrT αwr ,
(3.29)
σw2 r = αwrT wr ,
σw2 r = α.

If we want to have the largest variance in the wr direction, we must choose the rth
largest eigenvalue of the . The corresponding eigenvector will be our rth principal com­
ponent.

3.2.4 Example
Suppose the following data table exists. Reduce the dimensions of the data to two dimen­
sions using the PCA algorithm:

row X1 X2 X3
1 3 4 0
2 1 3 2
3 6 5 1
4 2 4 5

• Step 1: First, we need to centralize the data. This means to make the average of each
feature equal to zero. Therefore we first calculate the average values of each feature:

1
X1 = (3 + 1 + 6 + 2) = 3,
4
1
X2 = (4 + 3 + 5 + 4) = 4,
4
1
X3 = (0 + 2 + 1 + 5) = 2.
4
74 Dimensionality Reduction in Machine Learning

Now, from the data of each row, we subtract the average feature that they belong to and
form the concentrated data matrix:
⎡ ⎤
0 0 −2
⎢−2 −1 0 ⎥
D=⎢ ⎣3
⎥.
1 −1⎦
−1 0 3

We can do the centering operation with the left-centering matrix. First, we convert our
dataset into a matrix:
⎡ ⎤
3 4 0
⎢ 1 3 2⎥
A=⎢ ⎣ 6 5 1⎦ .

2 4 5
Then, we form matrix HL with this formula:
1
HL = I − ( )11T . (3.30)
α
In our example, the values are placed as follows:
⎡ ⎤ ⎡ ⎤ ⎡ 3 − 14 − 14 − 14

1 0 0 0 1 1 1 1 4
⎢0 ⎢− 1 − 14 ⎥
0⎥ ⎢1 1⎥ − 14
3
HL = ⎢
1 0 1
⎥− ⎢ 1 1 ⎥=⎢
⎢ 4 4 ⎥
⎥.
⎣0 0 1 0⎦ 4 ⎣1 1 1 1⎦ ⎣− 14 − 14 3
− 14 ⎦
4
0 0 0 1 1 1 1 1 −1 − 14 − 14 3
4 4

If we multiply HL by A from the left side, we will have:


⎡ ⎤
3
4 − 14 − 14 − 14 ⎡3 4 0
⎤ ⎡
0 0 −2

⎢− 1 − 14 ⎥
− 14 ⎥⎢ 2⎥ ⎢ ⎥
3

D = HL A = ⎢ 14 4 ⎢1 3 ⎥ = ⎢−2 −1 0 ⎥ .
1⎥
⎣− 4 − 14 3
4 − 4 ⎣6
⎦ 5 1⎦ ⎣ 3 1 −1⎦
− 14 − 14 − 14 3 2 4 5 −1 0 3
4

• Step 2: At this stage, we must first calculate the pairwise covariance of the features.
A point that should be considered when calculating the covariance is that if we use the
entire population as data, we use N in the denominator. If the sample is used as data,
we put N − 1 in the denominator:

1 ∑ 2 1 2  14
N
Cov (X1 , X1 ) = xk1 = 0 + (−2)2 + 32 + (−1)2 = ,
N −1 3 3
k=1

1 ∑
N
1 5
Cov (X1 , X2 ) = xk1 × xk2 = [(0 × 0) + (−2 × (−1)) + (3 × 1) + (−1 × 0)] = ,
N −1 3 3
k=1
Chapter 3 • Principal and independent component analysis methods 75

1 ∑
N
Cov (X1 , X3 ) = xk1 × xk3
N −1
k=1
1 −1
= [(0 × (−2)) + (−1 × 0) + (1 × (−1)) + (0 × 3)] = ,
3 3
1 ∑ 2 1 2  2
N
Cov (X2 , X2 ) = xk2 = 0 + (−1)2 + 12 + 02 = ,
N −1 3 3
k=1

1 ∑
N
1 1
Cov (X2 , X3 ) = xk2 × xk3 = [(0 × (−2)) + (−1 × 0) + (1 × (−1)) + (0 × 3)] = ,
N −1 3 3
k=1
∑ 1  14
N
1
Cov (X3 , X3 ) = 2
xk3 = (−2)2 + 02 + (−1)2 + 32 = .
N −1 3 3
k=1

Now, we proceed to form the covariance matrix of the data:


⎡ 14 5 −1

3 3 3
⎢ 1 ⎥
 = ⎣ 53 2
3 3 ⎦.
−1 1 14
3 3 3

• Step 3: In this step, we find the eigenvalues of the covariance matrix . The character­
istic equation of the covariance matrix is equal to the following:

det ( − λI ) = 0,
 14 
 −λ 5 −1 
3 3 3 
 5 
 3 2
− λ 1
 = 0.
 −1 3 3 
 1 14
− λ
3 3 3

For convenience of calculation, we use the property of the determinant, take out the
number 13 from the inside of the determinant, and multiply it behind the determinant.
Hence, we will have the following:
 
14 − 3λ 5 −1 
1 
 5 2 − 3λ 1  = 0,

27  −1 1 14 − 3λ
 
14 − 3λ 5 −1 

 5 2 − 3λ 1  = 0.

 −1 1 14 − 3λ

Now, we obtain the determinants of the above matrix:

(14 − 3λ) [(2 − 3λ) × (14 − 3λ) − 1] − 5 [5(14 − 3λ) + 1] − 1 [5 − (2 − 3λ)(−1)] = 0.


76 Dimensionality Reduction in Machine Learning

By simplifying the above expression, we will have:

−27λ3 + 270λ2 − 675λ + 16 = 0.

The eigenvalues we were looking for are the zeros of the above equation, whose values
are as follows:
16
λ1 = ,
3
7 4
λ2 = + √ ,
3 3
7 4
λ3 = − √ .
3 3
• Step 4: To reduce the dimensions of the original data to two dimensions, we need to
calculate the eigenvectors corresponding to the two largest eigenvalues. However, in
this example, we have calculated all three eigenvectors, which are as follows:
⎡ ⎤
0
( − λ1 I ) v1 = 0⎦,

0
⎡ −2 5 −1
⎤ ⎡ ⎤
3 3 3 0
⎢ 5 −14 ⎥ ⎣0⎦,
⎣ 3 3
1
3 ⎦ v 1 =
−1 1 −2 0
3 3 3
⎡ −2 5

−1 ⎡ ⎤ ⎡ ⎤
3 3 3 x 0
⎢ 5 −14 ⎥ ⎣ ⎦ ⎣ ⎦
3 ⎦ y = 0 .
1
⎣ 3 3
−1 1 −2 z 0
3 3 3

Now, we form the following system of equations:


⎧ −2
⎨ 3 x + 3y − 3z = 0
5 1

3x − 3 y + 3z = 0 .
5 14 1

⎩ −1
3 x + 3y − 3z = 0
1 2

For greater simplicity, we multiply both sides of all equations by the number three:

⎨−2x + 5y − 1z = 0
5x − 14y + 1z = 0 .

−1x + 1y − 2z = 0

After solving the above equations, the vector v1 is obtained as follows:


⎡ ⎤
−3
v1 = ⎣−1⎦ .
1
Chapter 3 • Principal and independent component analysis methods 77

The two eigenvectors v2 and v3 are also calculated similarly and are equal to the follow­
ing values:
⎡ √ ⎤
2− √ 3
v2 = ⎣−5 + 3 3⎦ ,
1
⎡ √ ⎤
2+ √ 3
v3 = ⎣−5 − 3 3⎦ .
1

• Step 5: To calculate the principal components, it is enough to normalize the eigenvec­


tors obtained in the previous step:
⎡ ⎤
−0.9045
v1
e1 = = ⎣ −0.3015⎦ ,
v1 
0.3015
⎡ ⎤
0.2543
v2
e2 = = ⎣ 0.1862⎦ ,
v2 
0.9490
⎡ ⎤
0.3423
v3
e3 = = ⎣ −0.9351⎦ .
v3 
0.0917

• Step 6: In this step, we select the principal components with as many dimensions as we
want in the destination space and place them in a matrix. In this example, as we want to
reduce the dimensions of the data to two dimensions, we choose the first two principal
components:
⎡ ⎤
−0.9045 0.2543
P = ⎣ −0.3015 0.1862 ⎦ .
0.3015 0.9490

Now, it is enough to multiply the centered data matrix from the left side by the P matrix:
⎡ ⎤ ⎡ ⎤
0 0 −2 ⎡ ⎤ −0.603 −1.898
⎢−2 −1 0 ⎥ −0.9045 0.2543 ⎢ 2.1105 −0.6948⎥
A = DP = ⎢
⎣3
⎥ ⎣ −0.3015 0.1862 ⎦ = ⎢ ⎥
⎣−3.3165 0.0001 ⎦ .
1 −1⎦
0.3015 0.9490
−1 0 3 1.809 2.5927

As can be seen, the matrix A represents the data mapped in the two-dimensional space
that we formed by the principal components. Therefore we were able to reduce the di­
mensions of the original data from three dimensions to two dimensions.
78 Dimensionality Reduction in Machine Learning

3.2.5 Additional discussion about PCA


Variants of PCA extend the original algorithm to address its limitations. For instance, Ker­
nel PCA handles nonlinear relationships through kernel functions, while Incremental PCA
permits processing large datasets in batches. Sparse PCA introduces sparsity constraints
for more interpretable results, and Robust PCA [4] handles outliers using robust estimators
like median absolute deviation (MAD). These variants offer solutions in scenarios where
traditional PCA may not suffice.
Preprocessing and scaling play crucial roles in PCA. Proper data preprocessing includes
handling missing values, categorical variables, and scaling features to ensure similar mag­
nitudes. PCA’s sensitivity to feature variances underscores the importance of proper scal­
ing; features with larger variances can dominate principal component computations if not
appropriately scaled.
While PCA reduces data dimensionality, interpreting resulting principal components
may pose challenges, especially in complex datasets. Combining PCA with other tech­
niques can enhance effectiveness or address limitations. For instance, it can precede ma­
chine learning algorithms as a preprocessing step or be paired with dimensionality reduc­
tion techniques like t-SNE or UMAP for comprehensive data analysis.
PCA provides insights into variance explained by each principal component, aiding un­
derstanding of their contribution to overall data variability. Scree plots, displaying eigen­
values or variances of principal components, help determine optimal component reten­
tion, facilitating informed decisions on dimensionality reduction.

3.3 Implementation
3.3.1 How to implement PCA algorithm in Python?
Principal Component Analysis (PCA) is a popular technique for dimensionality reduction
in machine learning and data analysis. It is commonly used to reduce the dimensionality of
a dataset with many features while retaining the most important information. In Python,
you can quickly implement PCA using the PCA module from the sklearn.decomposition
module in scikit-learn, which is a popular machine learning library. The PCA module pro­
vides an easy-to-use implementation of PCA with various options and functionalities.

3.3.2 Parameter options


• n__components: This parameter spec­fies the number of components (or principal
components) you want to reduce the data to. It can be an integer value or None. If
n__components is not spec­fied, it defaults to None, which means that all the com­
ponents will be kept (i.e., the data will not be reduced)
• svd_solver: This parameter spec­fies the algorithm for singular value decomposition
(SVD), which is the method used by PCA to compute the principal components. It can
take one of the following values: ‘auto’, ‘full’, ‘arpack’, ‘randomized’.
Chapter 3 • Principal and independent component analysis methods 79

• whiten: This parameter controls whether or not to whiten the data. Whitening is a pre­
processing step that scales the principal components to have unit variances, which
can be helpful in some cases. If whitening is set to True, the transformed data will be
whitened.
• random_state: This parameter controls the random number generator used for ran­
domized SVD solver, if applicable. It can take an integer value or a NumPy RandomState
object.
• copy: This parameter controls whether or not to make a copy of the input data. If copy
is set to True, the input data will be copied before performing PCA, which can be helpful
if you want to keep the original data unchanged.
• iterated_power: This parameter controls the number of iterations for the power
method used in randomized SVD solver.
• tol: This parameter spec­fies the tolerance for convergence of the SVD computation. It
is a small positive number, and the default value is 0.0.

3.3.3 Attribute options


• components_: This attribute contains the principal components (or eigenvectors) of
the input data, arranged as a matrix of shape (n_components, n_features). Each row of
the matrix represents a principal component, and the values in the row represent the
coefficients of the original features in that component.
• explained_variance_ratio_: This attribute contains the proportion of the total vari­
ance in the input data explained by each principal component. It is an array of shapes
(n_components,), where n_components is the number of components spec­fied during
PCA. The values in the array represent the percentage of the total variance explained by
each component, sorted in descending order.
• explained_variance_: This attribute contains the variance explained by each principal
component. It is an array of shapes (n_components,), where n_components is the num­
ber of components spec­fied during PCA. The values in the array represent the variance
explained by each component, sorted in descending order.
• mean_: This attribute contains the input data’s mean (or average) along each feature
axis. It is an array of shapes (n_features,), where n_features is the number of features in
the input data. The values in the array represent the mean value of each feature.
• n_components_ This attribute contains the number of components used for PCA. It
can be helpful if you specify n_components as ‘mle’ (maximum likelihood estimation)
during PCA.
• singular_values_: This attribute contains the singular values, which are the square
roots of the eigenvalues of the covariance matrix of the input data. The singular val­
ues represent the amount of variance explained by each principal component. It is an
array of shapes (n_components,), where n_components is the number of components
spec­fied during PCA.
80 Dimensionality Reduction in Machine Learning

3.3.3.1 Example of implementing PCA


1. Import the necessary libraries:
You must import the necessary libraries, including NumPy for numerical computations
and sklearn.decomposition for PCA implementation.

import numpy as np
from sklearn.decomposition import PCA

2. Load or prepare your dataset:


You need to load or prepare your dataset, which should be a numerical dataset with
multiple features. Ensure the dataset is correctly formatted as a NumPy array or a Pan­
das DataFrame.

# Generate some example data


X = np.random.rand(100, 5)

3. Create a PCA instance:


You must create an instance of the PCA class representing the PCA model. You can spec­
ify the desired number of components, solver, and other options as parameters.

# Create PCA object


pca = PCA(n_components=2)

4. Fit the PCA model:


You need to fit the PCA model to the standardized data using the fit() method. This
computes the PCA model’s principal components and other attributes based on the
input data.

# Fit and transform the data


X_pca = pca.fit_transform(X)

5. Access the results:


Once the PCA model is fitted, you can access attributes such as the principal compo­
nents, explained variance ratio, and other information. These attributes can be useful
for further analysis and visualization of the reduced-dimensional data.
Chapter 3 • Principal and independent component analysis methods 81

# X_pca now contains the reduced-dimensional data

# Access the principal components (eigenvectors)


principal_components = pca.components_

# Access the explained variance ratio


explained_variance_ratio = pca.explained_variance_ratio_

3.4 Advantages and limitations


Principal Component Analysis is a popular dimensionality reduction technique with sev­
eral advantages and limitations. Some of the advantages of PCA as a dimensionality tech­
nique include the following:
• Dimensionality reduction: PCA reduces the dimensionality of the dataset by transform­
ing the original features into a lower-dimensional space while retaining the essential
information. This can be useful for reducing computational complexity, memory us­
age, and visualization of high-dimensional data.
• Feature extraction: PCA ident­fies the principal components that capture the maximum
variance in the data, which often correspond to the most important underlying patterns
or structures in the data. These principal components can be used as new features for
machine learning models, potentially improving their performance.
• Data visualization: PCA can visualize high-dimensional data in lower-dimensional
space, such as 2D or 3D plots, which can help understand the data’s structure and rela­
tionships.
• Noise reduction: PCA can help reduce the impact of noise or irrelevant features in the
data by projecting the data onto a lower-dimensional space that captures the essential
information.
Some of the limitations of PCA as a dimensionality technique include the following:
• Loss of interpretability: PCA creates new features (principal components) that are lin­
ear combinations of the original features, which may not have direct interpretability
or meaningful physical units. This can make it challenging to interpret the results and
understand the underlying patterns in the data.
• Assumes linearity: PCA assumes that the underlying relationships in the data are linear,
which may not be true for all datasets. If the data has complex nonlinear relationships,
PCA may not be as effective in capturing the essential patterns in the data.
• Information loss: PCA projects the data onto a lower-dimensional space, which may
result in some loss of information. The reduced-dimensional data may not fully capture
all the details of the original data, which can affect the performance of downstream
analysis or modeling.
82 Dimensionality Reduction in Machine Learning

• Sensitivity to outliers: PCA can be sensitive to outliers in the data, as outliers can
strongly i­fluence the computation of the principal components and their correspond­
ing variances. Outliers can result in skewed or distorted principal components, leading
to potentially misleading results.
• Computational complexity: The computation of PCA involves matrix operations, which
can be computationally expensive for large datasets. Additionally, some variants of
PCA, such as kernel PCA, can be more computationally intensive.
• Assumes normality: PCA assumes that the data follows a multivariate normal distribu­
tion, which may not be true for all datasets. If the data has non-normal distributions,
PCA results may be less reliable.
• Determination of the number of components: The optimal number of components to
retain in PCA is unclear and may require subjective judgment or additional analysis,
such as scree plots or cross-validation.
It is essential to consider these advantages and limitations of PCA when applying it to
a specific dataset or problem. PCA may not always be the best choice for all scenarios,
and it is essential to carefully assess its suitability based on the data’s characteristics
and the analysis’s specific objectives.

3.5 Unveiling hidden dimensions in data


In the realm of data analysis, uncovering meaningful insights from complex datasets often
requires transcending the boundaries of linear relationships. Traditional methods, while
effective, can fall short when confronted with intricate, nonlinear patterns that lie beneath
the surface. Enter Kernel Principal Component Analysis (Kernel PCA) [7], a powerful tech­
nique designed to address this limitation and reveal the latent structure within data that
linear approaches might miss.

3.5.1 The need for kernel PCA


Principal Component Analysis (PCA) has long been a cornerstone of dimensionality reduc­
tion and data visualization. It provides a means to capture the most prominent variations
within data by identifying orthogonal axes of maximum variance. However, PCA’s effective­
ness hinges on the assumption of linearity—the belief that relationships between variables
are best described by straight lines or planes. This assumption, while often valid, fails to
encompass the richness of nonlinear data interactions prevalent in many real-world sce­
narios.
Kernel PCA steps into this void, acknowledging that data often possess intricate rela­
tionships that defy linearity. Imagine scenarios where data points form intricate clusters
or spiral patterns. Attempting to fit such data into a linear framework might result in the
loss of valuable information. Kernel PCA transcends these constraints by leveraging the
concept of kernels, allowing it to tap into the nonlinear nature of data relationships and
reveal hidden dimensions that remain unexplored by linear techniques.
Chapter 3 • Principal and independent component analysis methods 83

3.5.2 Discovering nonlinear relationships


In essence, Kernel PCA operates on the principle of transforming data into a higher­
dimensional space where nonlinear relationships become more apparent. This transfor­
mation is achieved using kernel functions, which compute the similarity or inner product
between pairs of data points in this new space. The key distinction is that the trans­
formation is performed implicitly—data is not explicitly transformed into the higher­
dimensional space. This allows Kernel PCA to remain computationally feasible, even when
the dimensionality of the transformed space is significantly larger.
Kernel functions come in various forms, each suitable for different data characteristics.
The Gaussian (Radial Basis Function) kernel, polynomial kernel, and sigmoid kernel are a
few examples, each tailored to specific data scenarios. This flexibility empowers analysts
to adapt the technique to the unique nuances of their data, making Kernel PCA a versatile
tool across various domains.

3.5.3 Dimensionality reduction and disclosing hidden dimensions


The ultimate goal of Kernel PCA is to extract the principal components in this higher­
dimensional space. These principal components represent the axes along which the data
exhibits the most variance, akin to traditional PCA. However, in this context, they capture
nonlinear variations that linear PCA would overlook.
Moreover, Kernel PCA facilitates dimensionality reduction while preserving essential
information. By projecting data points onto a reduced set of principal components, cre­
ates a lower-dimensional representation that retains the critical patterns embedded within
the data. This is particularly valuable when dealing with high-dimensional datasets where
simpl­fication is necessary for visualization, analysis, or downstream tasks.

3.6 The Kernel PCA algorithm


Kernel Principal Component Analysis (Kernel PCA) is a dimensionality reduction tech­
nique designed to capture complex and nonlinear relationships within data. It builds upon
the traditional Principal Component Analysis (PCA) by transforming the data into a higher­
dimensional feature space using kernel functions. This transformation allows Kernel PCA
to reveal patterns and structures that standard linear PCA might miss.

3.6.1 Data preprocessing


The first step involves centering the data to ensure that its mean is zero along each dimen­
sion. This is achieved by subtracting the mean of each feature from the corresponding data
points.
84 Dimensionality Reduction in Machine Learning

3.6.2 Kernel selection


Kernel Principal Component Analysis (Kernel PCA) hinges on the selection of appropriate
kernel functions that map data points into higher-dimensional spaces where nonlinear re­
lationships are more visible. The choice of the kernel is crucial, as it significantly i­fluences
the accuracy and efficacy of Kernel PCA in revealing hidden data structures. Here are some
of the kernels that can be used in Kernel PCA:
• Gaussian (Radial Basis Function) kernel: Suitable for local patterns and clusters. It
captures proximity-based relationships, emphasizing nearby points and their i­fluence
on each other. The formula of this kernel is:

K(x, y) = exp(− x−y


2
). (3.31)
2σ 2

• Polynomial kernel: Effective for data with polynomial relationships. The parameters c
and d control the offset and the degree of the polynomial, respectively. The formula of
this kernel is:
K(x, y) = (x T + c)d . (3.32)

• Sigmoid kernel: Captures sigmoid-shaped relationships. The parameters α and c de­


termine the slope and intercept of the sigmoid curve. The formula of this kernel is:

K(x, y) = tanh(αx T y + c). (3.33)

• Linear kernel: While linear, it is valuable for comparing the performance of nonlinear
kernels. It is akin to the standard dot product in traditional PCA. The formula of this
kernel is:
K(x, y) = x T y. (3.34)

Choosing the right kernel is both an art and a science, involving a blend of domain
knowledge, experimentation, and understanding the data’s characteristics. A comparison
between different kernel functions is provided in [8,9,18]. Also, a Python implementation
of more complex kernel functions can be found in [10,18].

3.6.3 Kernel matrix calculation


The kernel matrix calculation is a pivotal step in Kernel Principal Component Analysis
(Kernel PCA). This matrix captures the pairwise similarities or inner products between
data points after they have been mapped to a higher-dimensional space using a chosen
kernel function. The kernel matrix forms the foundation for subsequent computations, al­
lowing Kernel PCA to unveil complex relationships within data that traditional PCA might
overlook.
The kernel matrix encapsulates how data points interact with each other in the trans­
formed space, emphasizing their underlying similarities or dissimilarities. By quantifying
Chapter 3 • Principal and independent component analysis methods 85

these relationships, the kernel matrix provides a comprehensive view of the data’s struc­
ture in the higher-dimensional domain.
The kernel matrix K is calculated using a chosen kernel function K (x, y). The kernel
function measures the similarity or inner product between pairs of data points in the orig­
inal space. The kernel matrix is symmetric and captures the pairwise similarities between
all data points:
K(xi , xj ) = K(xj , xi ). (3.35)

Here, K represents the kernel function, xj , xi are data points from the input space, and
K(xi , xj ) computes the similarity in the transformed space.
The choice of the kernel function is crucial, as it i­fluences how data points are trans­
formed and how their relationships are quant­fied in the higher-dimensional space. Com­
mon kernel functions include the Gaussian (Radial Basis Function) kernel, polynomial
kernel, and sigmoid kernel. The selection depends on the data’s characteristics and the
type of patterns expected.
The kernel matrix is symmetric, r­flecting the fact that K(xi , xj ) is the same as K(xj , xi ).
Additionally, the diagonal elements K(xi , xi ) represent the self-similarity of data points and
are typically positive.
Creating the kernel matrix requires computing the kernel function for each pair of data
points, resulting in a matrix of size n × n for n data points. This can be computationally ex­
pensive, particularly for large datasets. To address this, techniques like the ``kernel trick''
are employed, allowing the computation to be carried out implicitly without explicitly
transforming the data.

3.6.3.1 Example
In this example, we consider three data points and derive the resulting kernel matrix from
the homogeneous quadratic kernel:

row f1 f2
X1 −3 2
X2 0 1
X3 2 4

In the first step, we should calculate the elements of the kernel matrix:

K(X1 , X1 ) = (X1 X1 )2 = ((−3)(−3) + (2)(2))2 = 169,


K(X1 , X2 ) = K(X2 , X1 ) = (X1 X2 )2 = ((−3)(0) + (2)(1))2 = 4,
K(X1 , X3 ) = K(X3 , X1 ) = (X1 X3 )2 = ((−3)(2) + (2)(4))2 = 4,
K(X2 , X2 ) = (X2 X2 )2 = ((0)(0) + (1)(1))2 = 1,
K(X2 , X3 ) = K(X3 , X2 ) = (X2 X3 )2 = ((0)(2) + (1)(4))2 = 16,
K(X3 , X3 ) = (X3 X3 )2 = ((2)(2) + (4)(4))2 = 400.
86 Dimensionality Reduction in Machine Learning

Hence, the kernel matrix is as follows:

⎡ ⎤ ⎡ ⎤
K(X1 , X1 ) K(X1 , X2 ) K(X1 , X3 ) 169 4 4
K = ⎣K(X2 , X1 ) K(X2 , X2 ) K(X2 , X3 )⎦ = ⎣ 4 1 16 ⎦ .
K(X3 , X1 ) K(X3 , X2 ) K(X3 , X3 ) 4 16 400

3.6.4 Centering data points in feature space


At the beginning of the previous section, we assume that we have centered the transformed
data points in the feature space but we only used the kernel matrix that was generated by
this data. It turns out we can easily find this matrix without having the explicit data using
the following procedure: As normal centering in a space goes, we need to find the mean of
the data point. The mean of the points in the feature space is given as:

1∑
n
μφ = φ(xi ). (3.36)
n
i=1

We cannot explicitly compute the mean point in feature space since we do not have
access to φ(xi ). Nonetheless, we can compute the squared norm of the mean as follows:

||μφ ||2 = μ
φ μφ ,
( n ) ⎛ n ⎞
1 ∑ 1 ∑
= φ(xi ) ⎝ φ(xj )⎠ ,
n n
i=1 j =1

1 ∑
n ∑
n (3.37)
= φ(xi ) φ(xj ),
n2
i=1 j =1

1 ∑∑
n n
= 2 K(xi , xj ).
n
i=1 j =1

According to the preceding derivation, the squared norm of the mean in feature space
is just the average of the values in the kernel matrix K. We can center each point in feature
space by subtracting the mean from it, as follows:

φ̂(xi ) = φ(xi ) − μφ . (3.38)

The centered kernel matrix, that is the kernel matrix over centered points in feature
space, is given as:

K̂ = K̂(xi , xj ), (3.39)
Chapter 3 • Principal and independent component analysis methods 87

where each cell corresponds to the kernel between centered points, that is:

K̂(xi , xj ) = φ̂(xi ) φ̂(xj ),


= (φ(xi ) − μφ ) (φ(xj ) − μφ ),
= φ(xi ) φ(xj ) − φ(xi ) μφ − μ 
φ φ(xj ) + μφ μφ ,

1∑ 1∑
n n
(3.40)
= K(xi , xj ) − φ(xi ) φ(xk ) − φ(xk ) φ(xj ) + |μφ |2 ,
n n
k=1 k=1

1∑ 1∑ 1 ∑∑
n n n n
= K(xi , xj ) − K(xi , xk ) − K(xk , xj ) + 2 K(xa , xb ).
n n n
k=1 k=1 a=1 b=1

In other words, using only the kernel function, we can construct the centered kernel
matrix. The centered kernel matrix can be compactly expressed as follows across all pair­
wise pairs of points:

1 1 1
K̂ = K − 1n×n K − K1n×n + 2 1n×n K1n×n ,
n n n (3.41)
1 1
= (I − 1n×n )K(I − 1n×n ),
n n
where 1n×n is the n × n singular matrix with all entries equal to 1.

3.6.5 Example
To follow Example 3.6.3.1, we perform Kernel PCA over these data points. First, we need to
find the centered kernel matrix:
1 1
K̂ = (I − 1n×n )K(I − 1n×n ),
n n
⎡ 2 ⎤
1 ⎡ ⎤⎡ 2 ⎤
3 − 1
3 − 3 169 4 4 3 − 13 − 13
⎢ ⎥ ⎢ ⎥
= ⎣− 13 2
3 − 13 ⎦ ⎣ 4 1 16 ⎦ ⎣− 13 2
3 − 13 ⎦ ,
− 13 − 13 2
3
4 16 400 − 13 − 13 2
3
⎡ 359 ⎤
3
20
3 − 379
3
⎢ ⎥
= ⎣ 203
167
3 − 187
3 ⎦.
− 379
3 − 187
3
566
3

In the next step, we find the eigenvalues and eigenvectors and we normalize the eigen­
vectors corresponding to the non-zero eigenvalues. For a more detailed view of how to
normalize the eigenvectors, see [1]:
⎡ ⎤ ⎡ ⎤
−0.571 −0.033
1 1
λ1 = 297.209 a1 = ⎣−0.220⎦ ⇒ v1 = √ v1 = v1 = ⎣−0.013⎦ ,
λ1 17.240
0.791 0.046
88 Dimensionality Reduction in Machine Learning

⎡ ⎤ ⎡ ⎤
−0.584 0.071
1 1
λ2 = 66.791 v2 = ⎣−0.786⎦ ⇒ v2 = √ v2 = v2 = ⎣−0.096⎦ .
λ2 8.173
0.203 0.025

Now, it is time to project all data onto target space. Since the original data in this
example have 2 dimensions, we choose the eigenvector that corresponds to the largest
eigenvalue of K̂ and project data onto this eigenvector with this formula:

xnew = K̂v1 .

The numerical calculation on the example’s dataset is as follows:


⎡ ⎤⎡ ⎤ ⎡ ⎤
359
3
20
3 − 379
3 −0.033 −9.845
⎢ 20 ⎥
⎣ 3 167
3 − 187 ⎦ ⎣−0.013⎦ = ⎣−3.790⎦ .
3
− 379
3 − 1873
566
3
0.046 13.635

Hence our projected dataset is:

row f1
X1 −9.845
X2 −3.790
X3 13.635

3.7 Implementation of Kernel PCA


1. Import the necessary libraries:
Import necessary libraries, including NumPy for numerical operations, Matplotlib for
plotting, and scikit-learn for KernelPCA and dataset generation.

# Step 1: Import necessary libraries


import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_swiss_roll
from sklearn.decomposition import KernelPCA

2. Load or prepare your dataset:


Now, you should load your dataset. in this example, we generate synthetic data using
the makes wissr oll function from scikit-learn. This creates a 3D Swiss roll dataset.

# Step 2: Generate synthetic data


X = np.random.rand(100, 5)
Chapter 3 • Principal and independent component analysis methods 89

3. Create a KernelPCA instance


Create a ‘KernelPCA’ instance. Specify the kernel (in this case, ‘rbf’ for Gaussian), the
gamma parameter, and the number of components to project onto (in this example, 2).

# Step 3: Create a KernelPCA instance


kernel_pca = KernelPCA(kernel='rbf', gamma=0.1,
n_components=2)

4. Fit and transform data using KernelPCA:


Fit and transform the data using the ‘f itt ransf orm’ method of the KernelPCA instance.

# Step 4: Fit and transform data using KernelPCA


X_kernel_pca = kernel_pca.fit_transform(X)

5. Plot the results:


Plot the original data in 3D and the reduced-dimensional data obtained from Ker­
nelPCA in 2D using Matplotlib.

# Step 5: Plot the results


plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=X[:, 2],
cmap=plt.cm.Spectral)
plt.title('Original Data')

plt.subplot(1, 2, 2)
plt.scatter(X_kernel_pca[:, 0], X_kernel_pca[:, 1],
cmap=plt.cm.Spectral)
plt.title('Kernel PCA')

plt.tight_layout()
plt.show()

Remember that kernel parameters like gamma should be tuned according to your data
characteristics and objectives. This example demonstrates Kernel PCA using synthetic
data, but you can replace the data generation step with your own dataset and modify the
kernel and other parameters accordingly.
Kernel Principal Component Analysis (Kernel PCA) is a powerful extension of tradi­
tional Principal Component Analysis PCA that addresses the limitations of linear dimen­
sionality reduction by incorporating nonlinear transformations. It is ben­ficial when data
exhibits complex and nonlinear patterns that linear methods cannot effectively capture.
90 Dimensionality Reduction in Machine Learning

Kernel PCA achieves this by projecting data into a higher-dimensional feature space using
kernel functions, where nonlinear relationships can be exposed and analyzed.

3.8 Independent Component Analysis


Independent Component Analysis (ICA) is an effective statistical method that permits the
separation of a multivariate signal into its constituent independent components. Unlike
traditional techniques that expect linear dependencies among variables, ICA assumes that
the observed signals are generated by linear combinations of underlying independent
sources. These sources are statistically independent and non-Gaussian, making ICA partic­
ularly suitable for scenarios where the sources exhibit complex and intricate relationships.
The motivation behind ICA lies in scenarios where we have access to mixed obser­
vations but lack information about the original sources. For example, in cocktail party
problem scenarios, multiple speakers’ voices are recorded by a microphone array, result­
ing in a mixture of voices. ICA aims to disentangle these mixed signals and recover the
underlying independent sources, providing insights into the underlying structure of the
data.
The key to ICA’s success is its ability to exploit the statistical properties of the data.
By maximizing measures of statistical independence, such as negentropy or kurtosis, ICA
iteratively estimates the mixing matrix and the independent components. Through this
process, it uncovers the hidden factors driving the observed signals, offering valuable in­
sights into complex systems across a wide range of disciplines.

3.8.1 The Cocktail Party Problem


Consider yourself at a lively cocktail party, surrounded by conversations, laughter, music,
and the clinking of glasses. In the midst of this cacophony, you concentrate on a friend’s
voice, listening to their words while blocking out the background noise. The ability of hu­
mans to isolate a single audio source from a mix of sounds has long fascinated researchers,
and it serves as the foundation for investigating the ``Cocktail Party Problem'' in signal pro­
cessing.
The Cocktail Party Problem describes a circumstance in which multiple overlapping au­
dio signals are detected by a group of microphones. Each microphone captures a unique
combination of these signals based on its position in relation to the sound sources. The
challenge is to separate these mixed signals into their original, individual components
without prior knowledge of the sources. This task, while seemingly simple for the human
brain, poses a significant challenge for automated systems.
Independent Component Analysis provides an effective solution to the Cocktail Party
Problem. ICA is a computational method for separating a multivariate signal into addi­
tive and independent components. The method assumes that the individual signals are
statistically independent and not Gaussian. Under these conditions, ICA can successfully
identify and separate the original signals from their mixtures.
Chapter 3 • Principal and independent component analysis methods 91

Consider a simple example of two people speaking simultaneously in a room with


two microphones. Each microphone captures a different combination of the two voices.
The recorded signals are thus a complex blend of the original voices, with each recording
featuring different contributions from each speaker. The goal of ICA in this context is to
``unmix'' these recordings in order to retrieve individual voices.
By applying ICA to the mixed signals, we can approximate the original sources. The
algorithm iteratively adjusts its parameters until it obtains a set of outputs that are as inde­
pendent from one another as possible. The result is two distinct signals, each representing
one of the original voices, effectively resolving the Cocktail Party Problem.
This problem is a classic example of blind source separation [13], with the ``blind'' as­
pect referring to a lack of knowledge about the mixing process or the original sources.
ICA’s ability to address the blind source separation problem demonstrates its utility in a
wide range of applications, including audio processing and speech recognition, as well as
biomedical signal analysis. ICA provides us with a tool for delving into the complexities of
sound and signal processing, and it also allows the discovery of hidden structures within
complex, multilayered datasets.

3.8.2 A comparison between PCA and ICA


In the data analysis landscape, Principal Component Analysis and Independent Compo­
nent Analysis emerge as the primary means for identifying hidden data structures. While
they navigate different statistical terrains—PCA via variance maximization and ICA via sta­
tistical independence—their similar mathematical process and their common goal is to
reveal the underlying patterns in complex datasets.

3.8.2.1 Similarities between PCA and ICA


Feature extraction: Both PCA and ICA can be used to reduce the dimensionality of data.
While ICA does not initially intend to do so, later variation and PCA can both convert a
high-dimensional dataset into a lower-dimensional space, attempting to retain as much
important information as possible. This reduction is accomplished by identifying a set of
components or axes that best represent variance (PCA) or independent sources (ICA) in
the data.
Dimensionality reduction: ICA and PCA are both considered methods in the category
of feature extraction algorithms. They break the original features into new components
that can be used as new features in machine learning models. These extracted features can
enhance the performance of subsequent analyses by highlighting underlying patterns or
structures that may not be visible in the raw data.
Orthogonal transformations: On a computational level, both PCA and ICA use orthog­
onal transformations to derive their components. In PCA, the transformation produces
orthogonal principal components that maximize variance. Although the goal of ICA is sta­
tistical independence rather than orthogonality, the algorithm frequently uses orthogonal
operations to find independent components.
92 Dimensionality Reduction in Machine Learning

Preprocessing steps: The preprocessing steps for PCA and ICA are similar, such as cen­
tering data by subtracting the mean. This step is critical for both methods because it
ensures that the analysis is not i­fluenced by the mean values of various features. Further­
more, scaling the data to unit variance is common in PCA to ensure equal weight across
features, and similar considerations can apply to ICA, depending on the context and data.
Matrix factorization: Both approaches can be viewed through the lens of matrix fac­
torization. PCA divides the data matrix into a set of orthogonal vectors (principal compo­
nents) and their corresponding scores. Similarly, ICA breaks down the data into a mixing
matrix and a set of independent components. This matrix factorization perspective em­
phasizes the structural analysis provided by both methods, despite the fact that they factor
the data using different criteria.

3.8.2.2 Distinction between PCA and ICA


Different criteria: PCA seeks to identify the directions (principal components) that maxi­
mize variance in the dataset. In contrast, ICA seeks components that are statistically inde­
pendent of one another. This distinction is critical because maximizing variance does not
always result in statistical independence, especially in the presence of non-Gaussian data
distributions.
Assumptions about data: PCA assumes that the principal components are orthogonal,
which is a natural consequence of the technique’s emphasis on variance. However, ICA
does not make this assumption. Instead, it uses the assumption of non-Gaussianity to
achieve source separation. The non-Gaussian assumption is critical to ICA’s effectiveness
because it enables the method to identify and separate independent sources.
Interpretability of components: The PCA-derived components are frequently used to
understand the directions of maximum variance in data, which can be useful for feature
reduction and compression. The ICA components, on the other hand, are interpreted as
the underlying sources or factors that produce the observed data. These components can
provide more detailed insights into the structure and origins of the data, revealing hidden
patterns that PCA may miss.
In conclusion, the comparison of PCA and ICA demonstrates that both methodologies
provide distinct pathways to data transparency. Their differences inform our strategy, al­
lowing us to tailor it to the data’s unique narrative. However, their commonalities still serve
as a foundation for a comprehensive understanding.

3.8.3 Theoretical background of ICA


Independent Component Analysis [11] uses statistical independence and the Central Limit
Theorem (CLT) to separate mixed signals. Let us go through these concepts briefly.
Statistical independence: The statistical independence principle underpins Indepen­
dent Component Analysis. This concept is essential for separating mixed signals into their
individual independent components. Formally, two random variables, X and Y , are con­
sidered statistically independent if and only if their joint probability density function (pdf )
Chapter 3 • Principal and independent component analysis methods 93

factors into the product of their marginal pdfs:

PX,Y (x, y) = PX (x) · PY (y). (3.42)

Expanding this concept to the multivariate case, let S = [s1 , s2 , . . . , sn ]T be a vector of


independent components, the joint pdf of s would factorize as:


n
PS (s) = PSi (si ). (3.43)
i=1

Central Limit Theorem (CLT): This theorem is pivotal for understanding the distribu­
tional properties of mixed signals in ICA. It states that for n independent and identically

distributed (i.i.d.) random variables Xi with mean μ and variance σ 2 , the sum Sn = ni=1 Xi
(or equivalently, the sample mean) converges in a distribution to a normal distribution as
n approaches i­finity:
Sn − nμ d
√ −
→ N (0, 1). (3.44)
σ n

This theorem demonstrates an important insight used in ICA: when independent, non­
Gaussian signals are linearly mixed, the distribution of the resulting signal tends to be
Gaussian. This is because, according to the CLT, the sum of independent variables has a
distribution that is more Gaussian than any of the individual summands, assuming they
are not Gaussian.
By utilizing the CLT, ICA functions on the assumption that the latent independent com­
ponents S are less Gaussian than the observed mixed signals X. Finding a transformation
that maximizes non-Gaussianity is, therefore, necessary to separate the mixed signals and
retrieve the original independent components.
A common measure of non-Gaussianity used in ICA algorithms like FastICA is kurtosis,
d­fined for a random variable X as:
[( ) ]
X−μ 4
Kurtosis(X) = E − 3, (3.45)
σ

where E denotes expectation, μ is the mean, and σ is the standard deviation of X. For a
Gaussian distribution, the kurtosis is zero, negative for sub-Gaussian distributions, and
positive for super-Gaussian distributions; thus, any deviation from zero indicates non­
Gaussianity.
For a discrete random variable X with possible values {x1 , x2 , . . . , xn } and a probability
mass function P (X = xi ) = pi , the entropy H (X) is d­fined as:


n
H (X) = − pi log(pi ), (3.46)
i=1
94 Dimensionality Reduction in Machine Learning

where the logarithm is typically taken to base 2, and the unit of entropy is bits. The base
of the logarithm can be changed to e for natural units (nats), or to 10 for digits, but in the
context of information theory, base 2 is the most common.
For a continuous random variable Y with probability density function (pdf ) f (y), the
concept of entropy extends to differential entropy H (Y ), d­fined as:
" ∞
H (Y ) = − f (y) log(f (y))dy. (3.47)
−∞

It is important to note that differential entropy can be negative, and the interpretation
is less straightforward than the discrete case due to the different scaling properties of con­
tinuous variables.
Since entropy is maximal for a Gaussian distribution, J (Y ) is almost always non­
negative and zero only for Gaussian variables. ICA seeks to maximize negentropy to find
the independent components.
In essence, ICA algorithms like FastICA iteratively adjust the demixing matrix W (where
W = A−1 ) to maximize non-Gaussianity, thereby estimating the independent components.
The optimization process may include maximizing measures such as negentropy or min­
imizing mutual information between the estimated components, taking advantage of the
Central Limit Theorem’s inverse relationship between non-Gaussianity and statistical in­
dependence.

3.8.4 The ICA model


3.8.4.1 The ICA model
The Independent Component Analysis [12] model is based on a linear mixture model,
which assumes that the observed multi-dimensional data is a linear combination of un­
known latent variables. These latent variables, also called independent components, are
assumed to be non-Gaussian and statistically independent of one another.
Consider the observed data matrix X ∈ Rm×n , where m is the number of observations
and n is the number of variables (or sensors). The ICA model posits that X can be expressed
as:
X = AS. (3.48)

Here:
-- A ∈ Rn×n is the unknown mixing matrix. Each column of A represents the contribution
of each independent component to a particular observed variable.
-- S ∈ Rn×m is the matrix of independent components. Each row of S corresponds to one
independent component over all observations.
The goal of ICA is to estimate the mixing matrix A and the independent components S,
given the observed data X. This is achieved by finding the unmixing matrix W = A−1 , such
that:
S = W X. (3.49)
Chapter 3 • Principal and independent component analysis methods 95

3.8.4.2 Assumptions of the ICA model


• Statistical independence: We have discussed how S components are statistically inde­
pendent, underpinning ICA and distinguishing it from PCA.
• Non-Gaussianity: We have covered the vital assumption of component non-Gaussianity,
essential for separating independent components due to the Central Limit Theorem.
• Stability of the mixing process: The mixing matrix A is assumed to remain constant over
time. This implies that the way in which the independent components combine to form
the observed data remains constant, allowing a stable solution to be found.
• No or minimal noise: While not always explicitly stated, the basic ICA model often as­
sumes that the observed data is not significantly corrupted by noise. However, extended
models exist that can account for noise.
• Ident­fiability conditions: It is assumed that at most one of the independent compo­
nents can have a Gaussian distribution. Furthermore, the mixing matrix A should be
square (n × n) and of full rank to ensure that the model is ident­fiable, meaning there is
a unique solution for S and A.

In addition to the previously stated foundational assumptions, the ICA model originally
included a constraint on the dimensions of the observed data and the sources, that is, the
number of sources and observations (sensors) must be equal. This assumption is encap­
sulated in the requirement that the mixing matrix A ∈ Rn×n be square, implying that for
n observed variables, there exist n underlying independent components. Mathematically,
this requirement ensures that A is invertible, which is critical for the model to solve for the
unmixing matrix W = A−1 .
However, this assumption has a significant impact on the applicability of ICA as, in
many real-world scenarios, the number of observed data channels (sensors) does not
equal the number of sources. Recognizing this limitation, subsequent versions of the ICA
model relaxed the assumption, allowing the number of sources to differ from the number
of observations. In these cases, the mixing matrix A is no longer square, and direct inver­
sion to obtain W is not possible.
By relaxing the constraint on the number of sources and observations, later ICA variants
became more versatile, allowing them to be used in fields such as neuroimaging, telecom­
munications, and finance, where the assumption of equal numbers of sources and sensors
is frequently untenable.

3.8.5 Algorithms for ICA


There are numerous algorithms in the topic of Independent Component Analysis that can
extract independent components from observed data. Here, we will take a look at two sig­
nificant variations: Projection Pursuit and FastICA. These methods exemplify the various
strategies used in ICA, demonstrating key approaches to overcoming the challenge of sig­
nal separation with precision and efficiency.
96 Dimensionality Reduction in Machine Learning

3.8.5.1 Projection Pursuit


Projection Pursuit is an approach to determining the most ``interesting'' directions or pro­
jections in multivariate data. The primary goal is to find projections that maximize a spe­
cific index of interestingness, which in the context of ICA refers to non-Gaussianity. The
rationale is that projections that deviate significantly from Gaussianity are more likely to
correspond to the underlying independent components we want to recover.
Before exploring the intricacies of Projection Pursuit, it is vital to preprocess the ob­
served data X to streamline the problem and enhance the algorithm’s efficiency. This
preprocessing generally entails two steps: centering and whitening.
Centering involves modifying the data so that each variable (feature) has a mean of
zero. Mathematically, if x̄ represents the mean vector of the observed data matrix X, the
centered data matrix Xc is derived as:

Xc = X − 1x̄ T , (3.50)

where 1 is a column vector of ones. Centering ensures the analysis is conducted relative
to the data’s mean, aligning with common practices in numerous statistical learning algo­
rithms.
Whitening is a transformation aiming to convert the observed variables into a new set
of variables that are uncorrelated and have unit variance. For whitening, the Singular Value
Decomposition (SVD) of the centered data matrix Xc is employed, denoted as Xc = U SV T ,
where S is the diagonal matrix of singular values, and U and V contain the left and right
singular vectors, respectively. The whitened data Xw is then obtained by:

Xw = S −1/2 U T Xc . (3.51)

This ensures that the components of Xw are normalized, facilitating the ident­fication
of the independent components by putting them on an equal footing in terms of variance.
Projection Pursuit involves iteratively searching for projections that maximize non­
Gaussianity. Let w represent a projection vector, the goal is to find the set of w that maxi­
mizes a non-Gaussianity index J (·) for the projected data w T Xw . The optimization prob­
lem can be formulated as:
max J (w T Xw ), (3.52)
w

subject to w = 1 to ensure the scale invariance of the projection.


Several indices of non-Gaussianity can be used, including kurtosis and negentropy. The
optimization typically involves gradient-based methods or fixed-point iteration schemes,
depending on the specific index used.
Let us break down the steps involved:
1. Initialization: Start with the whitened data matrix Xw obtained from the original data X
through centering and whitening. Initialize the projection vector w with random values,
ensuring w = 1 (i.e., w is unit-norm).
Chapter 3 • Principal and independent component analysis methods 97

2. Maximize non-Gaussianity: For each projection vector w, optimize the non-Gaussianity


index J (w T Xw ). This can be done using a gradient ascent method, where the update
rule might look something like:

wnew = w + η∇J (w T Xw ). (3.53)

3. Orthogonalization: To ensure the newly updated projection vector w is orthogonal to


the previously found projection vectors, apply the Gram–Schmidt process:


i−1
w=w− (w T wj )wj , (3.54)
j =1

where i is the current iteration, and j runs over all previously calculated projection vec­
tors. Normalize w again after orthogonalization.
4. Convergence check: Assess the convergence of the algorithm by checking if the change
in w between iterations falls below a predetermined threshold, or if a maximum num­
ber of iterations is reached:

if w − wprev  < threshold, then stop. (3.55)

5. D­flation: Once a projection vector w converges, it is used to d­flate the data, ensuring
subsequent vectors find new independent components. This step is crucial in the iter­
ative process to uncover all independent components one by one. D­flation is typically
integrated into the orthogonalization step in practice.
6. Repeat Steps 2--5: Continue the process for the next projection vector, initializing w
anew (if not all components have been found), and repeating the optimization, orthog­
onalization, and d­flation steps until all independent components are extracted.

3.8.5.2 FastICA
FastICA [14], an established method for Independent Component Analysis, aims to ex­
tract components one at a time by maximizing the non-Gaussianity of projections of a
pre-whitened data matrix X ∈ RN ×M . This pre-whitened matrix is built as previously dis­
cussed, with the centered data and unit variance along each dimension.
FastICA measures non-Gaussianity using a nonlinear function f (u), its first derivative
g(u), and its second derivative g  (u). These functions are chosen to r­flect the characteris­
tics of non-Gaussian distributions:
-- For general purposes, a good choice is:

f (u) = log cosh(u), g(u) = tanh(u), and g  (u) = 1 − tanh2 (u). (3.56)

-- For robustness, an alternative set of functions is:

f (u) = −e−u g(u) = ue−u and g  (u) = (1 − u2 )e−u


2 /2 2 /2 2 /2
, , . (3.57)
98 Dimensionality Reduction in Machine Learning

3.8.5.2.1 The FastICA algorithm for single-component extraction


1. Initialization: Start with a random weight vector w ∈ RN that is normalized to ensure
w = 1.
2. Iterative update: The weight vector is updated as follows:
# $ # $
w + ← E Xw g(w T Xw )T − E g  (w T Xw ) w. (3.58)

Here, E {...} denotes averaging over all column vectors of matrix Xw . This step is crucial
as it moves w in the direction that maximizes non-Gaussianity, as quant­fied by g and
g.
3. Normalization: Update the weight vector by normalizing it to unit length:

w+
w← . (3.59)
w + 
This step ensures the magnitude of w does not i­flate undesirably.
4. Convergence check: The algorithm iterates the update step until w converges—that is,
changes between iterations fall below a predetermined threshold.

3.8.5.2.2 Multiple component extraction


FastICA’s extension to extract multiple independent components builds on the framework
established for single-component extraction. The process entails iterating over several
components while ensuring their mutual independence through orthogonality.
Input: Number of desired components C, where C ≤ N, and pre-whitened matrix X.
1. Initialization of components:
a. Begin with the pre-whitened data matrix Xw ∈ RN ×M , where each column repre­
sents an N-dimensional sample.
b. Specify the desired number of independent components C, ensuring C ≤ N.
c. Initialize C random weight vectors wp ∈ RN for p = 1, . . . , C, each normalized to unit
length.
2. Iterative optimization for each component:
a. Update each weight vector wp to maximize non-Gaussianity:
1 1
wp+ ← Xw g(wpT Xw )T − g  (wpT Xw )1M wp , (3.60)
M M
where 1M is a column vector of ones of dimension M, g(·) is the first derivative, and
g  (·) is the second derivative of the non-quadratic function used to measure non­
Gaussianity.
b. Normalize the updated weight vector to maintain unit length:

wp+
wp ← . (3.61)
wp+ 

3. Enforcing independence through orthogonalization:


Chapter 3 • Principal and independent component analysis methods 99

a. For each updated weight vector wp , ensure it is orthogonal to all previously ex­
tracted components:

p−1
wp ← wp − (wpT wj )wj . (3.62)
j =1

4. Convergence criterion: Assess convergence for each weight vector wp based on the
magnitude of change across iterations. Repeat the optimization and orthogonalization
steps until convergence is achieved for all components.
5. Constructing the output:
a. Form the unmixing matrix W ∈ RN ×C by compiling the converged weight vectors:
[ ]
W = w1 · · · wC . (3.63)

b. Obtain the matrix of independent components S ∈ RC×M by projecting the original


data X onto the weight vectors:
S = W T Xw . (3.64)

3.8.6 Ambiguity in ICA


Even though Independent Component Analysis is a powerful tool for blind source sepa­
ration and feature extraction, it is prone to ambiguities regarding the scale, sign, and per­
mutation of the estimated components. These ambiguities stem from the mathematical
properties and assumptions that support the ICA model. Understanding these ambigui­
ties is critical for accurately interpreting ICA results.
Scale ambiguity: In the ICA model, the observed data X is assumed to be generated as
a linear combination of independent components S through a mixing matrix A:

X = AS. (3.65)

Given X, ICA aims to recover S and A by estimating an unmixing matrix W = A−1 , such
that:
Sest = W X. (3.66)

However, if S is scaled by a non-zero scalar k, and simultaneously A is scaled by 1/k, the


observed data X remains unchanged:

X = A(kS) = (A/k)S = AS. (3.67)

This shows that the ICA model cannot determine the true scale of the components,
resulting in scale ambiguity.
Sign ambiguity: ICA also exhibits sign ambiguity. Since the statistical independence cri­
terion used by ICA does not differentiate between a component and its negative, both si
100 Dimensionality Reduction in Machine Learning

and −si are equally valid solutions. Mathematically, for any component si in S:

X = AS = A[. . . , −si , . . . ] = Amod Smod , (3.68)

where Amod is A with the corresponding column for si negated. Thus ICA cannot determine
the signs of the independent components.
Permutation ambiguity: Finally, ICA suffers from permutation ambiguity. The order in
which the independent components are recovered is not fixed. This is because the model’s
goal is to maximize statistical independence without any inherent ordering of the compo­
nents. If P is a permutation matrix, then both S and P S will lead to valid decompositions:

X = AS = AP T P S = Anew Snew , (3.69)

where Anew = AP T is just a reordering of the columns of A.

3.8.7 Example
We will start with a 2x2 source matrix S, and a randomly initialized 2x2 mixing matrix A,
where:
[ ] [ ]
0 2 0.4236 0.4375
S= and A = .
2 0 0.6458 0.8917

The mixing process yields:


[ ]
1.2917 0.8473
X = AS = .
1.7835 0.8751

The mean vector x̄ of the observed data X is computed as:


[ ]
1.0695
x̄ = .
1.3294

This mean vector is used to center the data, resulting in the centered data matrix Xc being:
[ ]
0.2222 −0.2222
Xc = .
0.4542 −0.4542

After centering the data, we performed Singular Value Decomposition (SVD) on the
centered matrix Xc , obtaining the matrices U , S, and V T . The SVD results in:
[ ]
−0.4395 −0.8982
U= ,
−0.8982 0.4395
[ ]
0.7151 0
S= ,
0 2.1591 × 10−19
[ ]
−0.7071 0.7071
V =
T
.
−0.7071 −0.7071
Chapter 3 • Principal and independent component analysis methods 101

We then transformed the singular values matrix S into S −1/2 to use for whitening. The
whitened data Xw is computed as:

Xw = S −1/2 U T Xc =
[ ][ ][ ]
1.1825 0 −0.8982 −0.8982 0.2222 −0.2222
=
0 2.1521 × 109 −0.8982 0.4395 0.4541 −0.4541
[ ]
−0.5980 0.5980
.
1.1525 × 10−7 −1.1525 × 10−7

For p = 1, using the first column of the previously pre-whitened matrix and a randomly
initialized vector wp , we calculate the following values:

[ ]
−0.3282
(w1T Xw )T = ,
0.3282
[ ]
−0.3169
g(w1 Xw ) = tanh(w1 Xw ) =
T T T T
,
0.3169
[ ]
0.8996
g  (w1T X2 )T = 1 − tanh2 (w1T X2 )T = ,
0.8996
1M w1 = 1.2640,
M = 2.

The updated value w1+ is:

1 1 1 1
w1+ = Xg(w1T Xw )T − g  (w1T X)1M w1 = Xg(w1T Xw )T − g  (w1T Xw )1M w1 =
M M 2 M
[ ] [ ] [ ]
1 0.3789 1 0.4937 −0.0574
− = .
2 2.0974 × 10−9 2 0.6433 −0.3217

For p = 1, and given there are no previous wj to orthogonalize against (since p = 1 is


the first iteration), we directly normalize the updated vector w1+ to obtain the normalized
w1 :
[ ] [ ]
1 −0.0574 −0.1756
w1 = = .
0.3267 −0.3217 −0.9845

After repeating the process for 100 iterations with p = 1, the final vector w1 converges to
approximately:
[ ]
−1.9274 × 10−7
w1 = .
−1.0000
102 Dimensionality Reduction in Machine Learning

For p = 2, representing the second component, and with a newly initialized vector w2 ,
the calculations yield:
[ ]
−0.2494
(w2T X)T = ,
0.2494
[ ]
−0.2443
g(w2 X) = tanh(w2 Xw ) =
T T T T
,
0.2443
[ ]
0.9403
g  (w2T X)T = 1 − tanh2 (w2T Xw )T = ,
0.9403
1M w2 = 1.1373,
M = 2.

The updated value w2+ for the second component is:

1 1 1 1
w2+ = Xg(w2T Xw )T − g  (w2T X)1M w2 = Xg(w2T Xw )T − g  (w2T Xw )1M w2 =
M M 2 M
[ ] [ ] [ ]
1 0.2921 1 0.3921 −0.0500
− = .
2 1.6172 × 10−9 2 0.6773 −0.3387

For p = 2, the orthogonalization and normalization process yields the following matri­
ces: The vector w2 after the orthogonalization step (subtracting the projection on w1 ):
[ ] [ ] [ ] [ ]
−0.0500 −0.0500 −0.0500 −0.0500
w2 = − ([−0.0500, −0.3387] ) = .
−0.3387 9.6327 × 10−9 −0.3387 9.6327 × 10−9

The normalized vector w2 after normalization:


[ ] [ ]
1 −0.0500 −1.0000
w2 = = .
0.0499 9.6327 × 10−9 1.9274 × 10−7

The unmixing matrix is given by horizontally stacking the converged vectors:


[ ]
−1.9274 × 10−7 −1.000
−1.000 −1.9274 × 10−7

and finally, the estimated sources can be computed as:


[ ][ ]
−1.9274 × 10−7 −1.000 −0.5980 0.5980
S = W Xw =
T
−7 =
−1.000 1.9274 × 10 1.1525 × 10−7 −1.1525 × 10−7
[ ]
−0.5980 0.5980
.
1.1525 × 10−7 −1.1525 × 10−7

3.8.8 Example of implementing ICA


1. Import the necessary libraries:
Chapter 3 • Principal and independent component analysis methods 103

First, import the required Python libraries. We will need sklearn.decomposition for ac­
cessing the FastICA implementation. Here, we only use NumPy to create our dataset.

import numpy as np
from sklearn.decomposition import FastICA

2. Load or prepare your dataset:


Your dataset should be a numerical matrix with multiple features, where each feature
may represent a mixed signal or variable. Here, for demonstration, we will create a syn­
thetic dataset that simulates mixed signals.

# Generate synthetic data


# Assume S contains independent, non-Gaussian signals
S = np.c_[np.sin(np.linspace(0, 100, 1000)),
np.sign(np.sin(np.linspace(0, 100, 1000)))]
A = np.array([[1, 1], [0.5, 2]]) # Mixing matrix
X = S.dot(A.T) # Mixed signals

3. Create a FastICA instance:


Instantiate the FastICA class from Scikit-learn, specifying the number of components
to extract. You can also adjust other parameters according to your requirements.

# Initialize FastICA with the desired number of components


ica = FastICA(n_components=2)

4. Fit the ICA model:


Apply the ICA model to your data using the fit_transform() method. This method fits
the model and returns the estimated independent components in one step.

# Fit ICA on the mixed signals and transform the data


S_ica = ica.fit_transform(X)

5. Access the results:


After fitting the model, you can access various attributes of the FastICA object for analy­
sis and visualization. For example, you can retrieve the estimated mixing and unmixing
matrices, as well as the independent components themselves.
104 Dimensionality Reduction in Machine Learning

# S_ica contains the estimated independent components

# Access the estimated mixing matrix


mixing_matrix = ica.mixing_

# Access the unmixing matrix


unmixing_matrix = ica.components_

# Compare the true and estimated signals

This example demonstrates the basic steps required to implement ICA on synthetic
data. The same process applies to real-world datasets, with the potential need for addi­
tional preprocessing steps depending on the data’s nature.
Advantages of ICA:
• Blind source separation: One of ICA’s most notable capabilities is the ability to perform
blind source separation. It can separate mixed signals into their original components
without prior knowledge of the mixing process. This capability is especially useful in
applications such as audio signal processing, biomedical signal analysis, and image
processing, where the underlying sources or mixing mechanisms are unknown before­
hand.
• Robustness to noise: ICA demonstrates remarkable robustness to additive noise, mak­
ing it suitable for real-world applications where noise is an inevitable factor. Its under­
lying assumptions do not require the absence of noise but rather focus on the statistical
properties of the signal sources, thereby retaining effectiveness even in noisy environ­
ments.
• Discovery of hidden factors: ICA may determine hidden or latent variables that i­flu­
ence observed data. In fields like finance and genomics, this ability enables researchers
to identify underlying factors that drive complex patterns, providing insights into the
mechanisms at work.
• No requirement for prior models: Unlike other analysis methods, which require a pre­
defined model of the data or its sources, ICA makes only a few assumptions about the
sources. Its primary requirements are statistical independence and non-Gaussianity of
the sources, which make it a versatile tool suitable for a wide range of datasets.
• Application in various fields: The versatility of ICA extends its utility beyond signal pro­
cessing to other fields such as neuroimaging (e.g., fMRI analysis), telecommunications,
and finance, demonstrating its adaptability to various types of data and analysis de­
mands.
Limitations of ICA:
• Assumption of independence: The requirement that the source signals be statistically
independent is central to ICA. In practice, however, complete signal independence may
not always be possible, potentially limiting ICA’s effectiveness in certain applications.
Chapter 3 • Principal and independent component analysis methods 105

• Non-Gaussianity requirement: The ICA’s reliance on non-Gaussian sources can be


problematic. While it allows for the separation of mixed signals, it also means that ICA
may struggle with data where this assumption is not true, such as when using purely
Gaussian sources.
• Order and scale indeterminacy: ICA does not specify the order or exact scale of the sep­
arated components. The output components may appear in a different order or scale
than the original sources, necessitating additional steps for interpretation.
• Complexity in determining the number of components: Estimating the correct number
of independent components is critical for successful ICA application; however, this can
be difficult and may necessitate domain-specific knowledge or additional criteria.
• Sensitivity to preprocessing: Preprocessing steps like centering, whitening, and dimen­
sionality reduction can have a significant impact on ICA’s performance. Inappropriate
preprocessing can result in suboptimal separation outcomes.

3.9 Conclusion
In conclusion, PCA is widely used for dimensionality reduction, feature extraction, and
data visualization. It can be a powerful tool for understanding high-dimensional data’s
underlying patterns and structures. PCA offers several advantages, including reducing di­
mensionality, extracting important features, visualizing data, and reducing noise. How­
ever, PCA also has limitations, such as loss of interpretability, assumptions of linearity
and normality, sensitivity to outliers, and determination of the number of components.
PCA is a valuable technique in various fields, including machine learning, data analysis,
image processing, and signal processing. It can help reduce computational complexity,
improve model performance, and gain insights from complex data. However, it should be
used judiciously, considering the data’s specific characteristics and the analysis’s objec­
tives. It is essential to understand PCA’s assumptions, limitations, and potential trade-offs
and carefully evaluate its suitability for a given dataset or problem. Other techniques, such
as nonlinear dimensionality reduction methods, may be more appropriate when data has
complex nonlinear relationships. PCA is a powerful technique that is valuable in data anal­
ysis and machine learning tasks. However, it should be used cautiously and considering the
specific requirements and limitations of the data and problem at hand.
Kernel PCA offers a bridge between linear and nonlinear methods, allowing data ana­
lysts to explore and extract valuable insights from complex datasets that defy linear mod­
eling. The Kernel Principal Component Analysis PCA technique offers various ben­fits and
considerations. First, it excels in capturing nonlinear relationships within data, revealing
complex structures that may otherwise remain hidden. At the same time, it effectively
reduces data dimensionality while preserving key characteristics for visualization, explo­
ration, and analysis. Moreover, Kernel PCA provides flexibility by utilizing different kernel
functions suited to specific data patterns, thereby enhancing its applicability in diverse
fields like image analysis, genetics, and finance.
106 Dimensionality Reduction in Machine Learning

However, important factors need to be carefully considered when using Kernel PCA.
The selection of the appropriate kernel is crucial and requires a thoughtful decision based
on domain knowledge and data comprehension. Parameter tuning is also critical and re­
quires precise adjustments to achieve optimal results and prevent ove­fitting risks. Fur­
thermore, managing the complexity of computations can be challenging, especially with
large datasets, but techniques like approximation methods and efficient algorithms can
help address these issues.
Interpreting principal components in transformed spaces can be difficult, particularly
with complex kernels. Additionally, addressing concerns about ove­fitting, especially with
small datasets or overly complex kernels, necessitates using regularization techniques and
cross-validation methods. Therefore while Kernel PCA offers significant advantages, care­
fully considering these factors is essential for successful implementation. By embracing
the strengths of Kernel PCA and understanding its considerations, practitioners can har­
ness its potential to reveal hidden structures and relationships within their data.
Independent Component Analysis is a potent method for revealing concealed pat­
terns within intricate datasets. By leveraging the statistical independence of the underlying
sources, it enables the untangling of mixed signals without prior knowledge of the sources.
Although it has demonstrated promising outcomes across various domains, its effective­
ness can be i­fluenced by the selection of parameters and the assumptions made about
the data. Nevertheless, ICA remains a valuable tool for tasks in signal processing and data
analysis that necessitate blind source separation.

References
[1] M. Zaki, W. Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, 2014.
[2] H. Hotelling, Analysis of a complex of statistical variables into principal components, Journal of Edu­
cational Psychology (1933), https://doi.org/10.1037/h0071325.
[3] A. Ghodsi, Dimensionality Reduction: A Short Tutorial, 2006.
[4] M. Partridge, M. Jabri, Robust principal component analysis, in: Neural Networks for Signal Process­
ing X, Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No. 00TH8501), https://
doi.org/10.1109/nnsp.2000.889420.
[5] K. Pearson, LIII. On lines and planes of closest fit to systems of points in space, The London, Edin­
burgh, and Dublin Philosophical Magazine and Journal of Science (1901), https://doi.org/10.1080/
14786440109462720.
[6] H. Zou, T. Hastie, R. Tibshirani, Sparse principal component analysis, Journal of Computational and
Graphical Statistics (2006), https://doi.org/10.1198/106186006x113430.
[7] B. Schölkopf, A. Smola, K.-R. Müller, Kernel principal component analysis, Lecture Notes in Computer
Science (1997), https://doi.org/10.1007/bfb0020217.
[8] A.H. Hadian Rasanan, S. Nedaei Janbesaraei, D. Baleanu, Fractional Chebyshev kernel functions: the­
ory and application, in: J.A. Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional Orthogonal
Kernel Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics, Springer, Sin­
gapore, 2023.
[9] A.H. Hadian Rasanan, J.A. Rad, M.S. Tameh, A. Atangana, Fractional Jacobi kernel functions: theory
and application, in: J. Amani Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional Orthogo­
nal Kernel Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics, Springer,
Singapore, 2023.
[10] A.H. Hadian Rasanan, S. Nedaei Janbesaraei, A. Azmoon, M. Akhavan, J. Amani Rad, Class­fica­
tion using orthogonal kernel functions: tutorial on ORSVM package, in: J. Amani Rad, K. Parand,
Chapter 3 • Principal and independent component analysis methods 107

S. Chakraverty (Eds.), Learning with Fractional Orthogonal Kernel Class­fiers in Support Vector Ma­
chines, in: Industrial and Applied Mathematics, Springer, Singapore, 2023.
[11] G. Naik, D. Kumar, An overview of independent component analysis and its applications, Informatica
35 (2011) 63--81.
[12] J. Stone, Independent Component Analysis: A Tutorial Introduction, 2004.
[13] X. Shi, Blind Signal Processing, Springer, Berlin, Heidelberg, 2011.
[14] A. Hyvarinen, Fast and robust fixed-point algorithms for independent component analysis, IEEE
Transactions on Neural Networks (1999), https://doi.org/10.1109/72.761722.
[15] A. Ghaderi-Kangavari, J.A. Rad, M.D. Nunez, A general integrative neurocognitive modeling frame­
work to jointly describe EEG and decision-making on single trials, Computational Brain & Behavior 6
(2023) 317--376.
[16] A. Ghaderi-Kangavari, K. Parand, R. Ebrahimpour, M.D. Nunez, J.A. Rad, How spatial attention af­
fects the decision process: looking through the lens of Bayesian hierarchical diffusion model & EEG
analysis, Journal of Cognitive Psychology 35 (2023) 456--479.
[17] A. Ghaderi-Kangavari, J.A. Rad, K. Parand, M.D. Nunez, Neuro-cognitive models of single-trial EEG
measures describe latent effects of spatial attention during perceptual decision making, Journal of
Mathematical Psychology 111 (2022) 102725.
[18] J.A. Rad, S. Chakraverty, K. Parand, Learning with Fractional Orthogonal Kernel Class­fiers in Support
Vector Machines: Theory, Algorithms, and Applications, Springer, 2023.
4
Linear discriminant analysis
Rambod Masoud Ansari a, Mohammad Akhavan Anvari a,
Saleh Khalaj Monfared b, and Saeid Gorgin c
a School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
b Department of Electrical and Computer Engineering, Worcester Polytechnic Institute, Worcester, MA,

United States c Department of Computer Engineering, Chosun University, Gwangju, Republic of Korea

4.1 Introduction to linear discriminant analysis


Working with high-dimensional data like images can be problematic due to high dimen­
sionality that makes interpretation and visualization of useful features impossible. Linear
Discriminant Analysis (LDA) is a supervised machine learning technique for dimension
reduction and class­fications [6]. The original technique was developed in 1936 by Ronald
A. Fisher and was named linear discriminant analysis or Fisher discriminant analysis, and
it is just a two-class technique [24]. Linear discriminant analysis has been used in many
applications involving high-dimensional data, like face recognition, decision-making, and
image retrieval [2]. Linear discriminant analysis is an excellent tool for dimension reduc­
tion, preserving as much information as is suitable for our prediction task.

4.1.1 What is linear discriminant analysis?


The Linear Discriminant Analysis technique was developed to transform the features into
a lower-dimensional space, which maximizes the ratio of the between-class variance to the
within-class variance; hence Linear Discriminant Analysis guarantees maximum class sep­
arability [2]. Due to maximum class separability, Linear Discriminant Analysis is a suitable
method as a preprocessing technique. The computation time is proportionally fast, mak­
ing it a desirable technique for big datasets. The magnitude of the eigenvalue in Linear
Discriminant Analysis describes the importance of the corresponding eigenspace con­
cerning class­fication performance. Linear Discriminant Analysis fails when the mean of
distributions is shared because it becomes impossible to find a new axis that makes those
classes linearly separable. This is one disadvantage of linear discriminant analysis; we use
nonlinear discriminant analysis in such cases [4].

4.1.2 How does linear discriminant analysis work?


Linear Discriminant Analysis projects the data onto the new axis to maximize the separa­
tion of the different categories. The challenge of using Linear Discriminant analysis is to
Dimensionality Reduction in Machine Learning. https://doi.org/10.1016/B978-0-44-332818-3.00013-7 109
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
110 Dimensionality Reduction in Machine Learning

detect whether we have two classes of data targets or more than two classes; Linear dis­
criminant analysis works differently in these two situations [2]. When we have two classes
in the target variable, we can find the distance with the difference between the mean of
class one and the mean of class two; LDA projects data on a new axis with maximum sep­
arability. In the situation that we have more than two classes as our target variables, LDA
projects data on two new axes, finds a central point to all of the data, then calculates the
distance from each category mean to that central point [1].

4.1.3 Application of linear discriminant analysis


As discussed in the previous sections, Linear Discriminant Analysis has been successfully
used for many applications: Classify, Decision-making, and Recognition. In this section,
we want to discuss the details of the applications and how LDA can help us to classify the
data better.

4.1.3.1 Face recognition


Face recognition is a computer application for identifying and verifying a person’s face
based on facial attributes. Linear Discriminant Analysis is one of the techniques used for
dimensionality reduction and recording an outstanding performance in face recognition.
This method reduces the dimension while preserving as much class-discriminatory in­
formation as possible. LDA is used in face recognition to reduce the number of features
to a reasonable number before classifying the face dataset, and it is called Fisher’s face.
Another advantage of LDA in face recognition is feature extraction; LDA can be used to
extract significant features that show the basic structure of the face [7,8]. Face recognition
in humans can be complex and challenging due to facial expressions, facial hair, and envi­
ronmental lighting that can change the results of the face recognition process. Conversely,
the algorithm should recognize the human face from pictures. Therefore, lighting condi­
tions, image quality, and pose are essential in face recognition systems [9]. LDA as a linear
method for dimensionality reduction, hence, it can only find linear decision boundaries
between the classes. Therefore, LDA may not be accurate enough to classify faces that are
not linearly separable [9]. Thakar, Sing, and their colleagues proposed a new method for
face recognition by using Kernelized Linear Discriminant Analysis and Radial Basis Func­
tion Neural Network together. A kernelized version of LDA can handle nonlinear data by
projecting the data into a higher-dimensional space. This method used KFLDA as a di­
mensionality reduction algorithm to improve the performance of the RBF Neural Networks
for classifying face images. An RBF neural network is suitable for class­fication problems
where the classes are not linearly separable. This method can achieve good performance
in detecting face images. This method was expensive because it needed many resources
and time to run the algorithm. Merging KFLDA and the RBF Neural Network to create a
new algorithm was a promising new approach for face recognition [10]. Interested readers
Chapter 4 • Linear discriminant analysis 111

can refer to [11--14] for a comprehensive tutorial on different kernel functions and how to
implement them in Python.

4.1.3.2 Medical cases


Linear Discriminant Analysis can help to classify a patient’s disease or disorder from differ­
ent perspectives such as: 1 -- Stages of disease or disorder (Stage in medical cases shows the
harshness and spread of the disease. It is often used to d­fine cancer progression, usually
class­fied from 1 to 4. The stage of a disease is an essential factor for determining appro­
priate treatments.) [16]; 2 -- Anatomical location of disease or disorder. For example, LDA
can be used to classify a dataset into these three stages of a disease like mild, moderate, or
severe categories. This class­fication could help medical staff make their work faster and
more accurate. That is just a simple example of using LDA in medical cases; another ex­
ample of LDA in medical cases is predicting the risk of heart disease [21] by analyzing the
factors related to heart diseases like age, gender, family history, blood pressure, and smok­
ing status. The LDA method can predict a person’s heart disease risk and help doctors to
monitor the person with medical information and save lives. LDA can also be used to cre­
ate a model for emergency medical services, classify patients based on the severity of their
illness, and prioritize those with higher-risk diseases to save time and lives [17]. Coronary
heart disease is one of the leading causes of death worldwide. Jung-Gi Young and Jung
Kwon Kim [18] tried to use linear discriminant analysis and an adaptive network-based
fuzzy inference system to build a model that can predict the risk of coronary heart disease.
They applied this model to the Korean National Health and Nutrition Examination survey.
The method achieved a high prediction rate of 80.2%. This accuracy was higher than other
methods that had been used in 2013.
Mammography is the most effective method for early diagnosis of breast cancer. How­
ever, radiologists cannot diagnose cancer from mammograms in every case. Computer­
aided diagnosis systems are being developed to reduce the error rate in the detection of
breast cancer. Heang-Ping Chan, Datong Wei, and colleagues used linear discriminant
analysis to classify mammographic images with masses from normal tissue. They extracted
eight features from 168 mammograms with biopsy-proven cancer and 504 mammograms
of normal tissue. The results showed that LDA could classify cancerous tissue from normal
tissue with an accuracy of 84%. This percentage is higher than the accuracy of radiologists,
who typically have an accuracy of 70--80%.

4.2 Understanding the LDA algorithm


4.2.1 Prerequisite
This section will discuss the prerequisites needed to understand the LDA algorithm. Un­
derstanding the prerequisites of LDA is essential for providing accurate and reliable results
in dimensionality reduction and class­fication. Also, the prerequisites can help in feature
selection by specifying which variables donate most to separating the classes.
112 Dimensionality Reduction in Machine Learning

4.2.1.1 Standard deviation


Standard deviation is a measure to show the scattering of members in datasets. A low Stan­
dard Deviation shows data clustered around the mean, and a high Standard Deviation
shows more widely spread data. The Standard Deviation formula is:

(xi − μ)2
σ= , (4.1)
N

where σ is the Standard Deviation, μ is the mean, and xi is the data point we want to obtain.
To provide a calculation
[ of the ]standard deviation, view the following example dataset
with values of X = 5 3 8 7 2 . To calculate the standard deviation in the first step, we
need to determine the mean (μ):
Xi
μ= , (4.2)
N
where Xi is the sum of all values in X and N is the total number of values. In this case
μ = 5. Now, we can put variables into Eq. (4.1):

σ = 5.2 ≈ 2.28,

which is the Standard Deviation of X.

4.2.1.2 Variance
The average square value of the difference between the values from the mean, specifically
variance, is a measure of how spread numbers in datasets are from the mean. The greater
the value of the variance between classes, shows the classes are as far apart as possible.
The variance formula is:
(xi − μ)2
σ2 = , (4.3)
N
where σ 2 is the variance, μ is the mean, xi [is the data point
] we want to obtain, and N is the
number of values [15]. Consider X as X = 2 4 2 8 . In this example, the mean is

μ = 4, Xi − μ2 = 24.

According to (4.3) it needs division by the number of variables N to determine the variance:

24
σ2 = = 6.
4

4.2.1.3 Covariance
Covariance shows the relationship between two variables and measures how much two
variables change together with their scattering compared to the mean. Covariance is a di­
rectional relationship between the returns on two assets. The low value of the variables of
covariance means the linear dependence between two parameters is low and vice versa.
Chapter 4 • Linear discriminant analysis 113

Covariance and variance are related, measuring how data points are distributed around a
mean [15]:
(xi − x)(yi − y)
covx,y = , (4.4)
N −1
where covx,y is the covariance, xi , yi is the data of x, y, x, y shows the mean of data, and N
is the number of data values. For example:
[ ] [ ]
X= 2 4 2 8 , y= 1 3 5 7 .

We need to calculate the mean for each class (mean is represented by x = 4, y):

x = 4, y = 4,
16
covx,y = = 5.33.
3

4.2.1.4 Maximum Likelihood class­fication


Maximum Likelihood is a statistical method for estimating the parameter probability dis­
tribution based on observed data points [20]. In the context of linear discriminant analysis,
Maximum Likelihood estimates the parameters of Gaussian distributions for each class
based on the training data. Then, given a new dataset, Gaussian distributions are used to
calculate the probability of assigning new data points to each class. In image class­fication
for assumed probability distributions, calculating the probability of a given pixel belongs
to a particular class with the highest probability. We want to obtain the total probability
distribution of all observed data points:

1 (x − μ)2
f (x; μ, σ ) = √ exp(− ). (4.5)
σ 2π 2σ 2

Symbols that come after the semicolon in f (x; μ, σ ) show the parameters of distribu­
tion, μ is the mean and σ is the Standard Deviation.

4.2.2 Fisher’s linear discriminant analysis


Fisher’s linear discriminant analysis (FLDA) is a statistical method used in dimension re­
duction, class­fication, and pattern recognition. This technique was developed by Ronald
A. Fisher in 1936; since then, it has been used in many applications like image class­fi­
cation, face recognition, and medical cases [2]. FLDA aims to find a linear function that
maximizes the ratio of between-class variance and minimizes the within-class variance
[23]. With this technique, we can project the data in the lower dimension while keeping
the maximum possible distance between classes.

4.2.2.1 Key differences between FLDA and LDA


Linear and Fisher linear discriminant analyses are used for linear dimensionality reduction
and class­fications; the two methods are similar but have some key differences [23]:
114 Dimensionality Reduction in Machine Learning

• Objective function:
The LDA provides maximum separability by minimizing the within-class variance and
maximizing the between-class variance. At the same time, FLDA seeks a linear pro­
jection that maximizes the ratio of between-class variance to the sum of within-class
variance.
• Data distribution:
LDA expects the data to be normally distributed, but FLDA has no expectations.
• Regularization:
LDA does not have a regularization term, while FLDA assumes a regularization term
that can control ove­fitting.

4.2.3 Linear algebra explanation

FIGURE 4.1 The relation of between-class and within-class variance.

In the first step of linear discriminant analysis, we need to determine the mean of each
class and the overall mean. In the next step, we need to determine the scatter matrix for
within-class variance (SW ) and between-classes (SB ) [5] (see Fig. 4.1). The goal of linear
discriminant analysis is to find a linear method to maximize the ratio of within-class scat­
ter and between-class scatter. Thus we must compute the eigenvectors and eigenvalues of
−1
SW SB [19].
Proceed with these steps in the formula:

w T SB w
J (w) = . (4.6)
w T Sw w

This step needs the derivative of this fraction for optimization of w.

∂J
= 0. (4.7)
∂W
The fraction deviation formula can be expressed as:

d f (x) g(x)f  (x) − f (x)g  (x)


[ ]= , (4.8)
dx g(x) g(x)2
Chapter 4 • Linear discriminant analysis 115

which is derived through the process of taking the derivative of:

(W T SW W )SB W − (W T SB W )SW W = 0, (4.9)

then, by dividing both sides by (W T SW W ):


−1
SB W − J SW W = 0 ⇒ SW SB W = J W. (4.10)
−1
J is an eigenvalue and Wi is an eigenvector of SW SB . That was an overview of the linear
discriminant and how it works. To determine the value of SB , it is necessary first to cal­
culate the mean (μ). This is achieved by calculating the mean (μ) of each group and call
them μ1 and μ2 . μtotal taking summation of n1n+n
1
2
μ1 + n1n+n2
2
μ2 ; n1 is the number of rows in
w1 and n2 represents the number of rows in w2 :

SBi = ni (μi − μ)T (μi − μ), (4.11)


SB = SBi . (4.12)

In this step, computing the variable Sw is necessary, and computing by that mean is
centered at zero and shows it with dj :

Swj = djT di , (4.13)


SW = Swj , (4.14)

with SW and SB we can determine the eigenvalue and the eigenvector from SW T S . Then,
B
data can be projected to a lower dimension where it can be easily worked on.

4.2.3.1 Eigenvectors and eigenvalues


Eigenvectors and eigenvalues are potent tools with numerous applications in linear alge­
bra. Eigenvectors and eigenvalues are branches of mathematics that deal with methods
for solving linear equations and their properties. Eigenvectors and eigenvalues in the lin­
ear discriminant analysis algorithm transform the data to a lower dimension space with
maximum separability between and within classes [22]. A is an n × n matrix. To determine
the eigenvalues and eigenvectors of A, we must address the task of solving the character­
istic equation:
det (A − λI ) = 0,

where I is an identify matrix and λ is an eigenvalue.

Example:
[ ]
2 −4
A=
−1 1
[ ] [ ] [ ]
2 −4 1 0 2−λ −4
det (A = −λ ) = det = λ2 − λ − 6 = 0
−1 −1 0 1 −1 −1 − λ
116 Dimensionality Reduction in Machine Learning

→ λ1 = 3, λ2 = −2.

To determine eigenvectors, we must fit each eigenvalue in the (A − λi I )xi = 0 equation.


A is a[ matrix,
] λ is the eigenvalue, I is identify matrix, and xi is the eigenvector related to λi ,
x1
xi = . For λ1 = 3, we have:
x2
[ ] [ ] [ ] [ ][ ] [ ]
2 −4 1 0 x −1 −4 x1 0
( −3 ) 1 =0→ =
−1 −1 0 1 x2 −1 −4 x2 0
−x1 − 4x2 = 0
−x1 − 4x2 = 0.

These two equations are equal to each other. If x1 = −4s, x2 = s, then all the eigenvectors
of λ1 = 3, are the following vector coefficients:
[ ]
−4s
.
s

For λ2 = −2, we have:


[ ] [ ] [ ] [ ][ ] [ ]
2 −4 1 0 x 4 −4 x1 0
( +2 ) 1 =0→ =
−1 −1 0 1 x2 −1 1 x2 0
4x1 − 4x2 = 0
−4x1 + 4x2 = 0.

These two equations are equal to each other. For this, we must have x1 = x2 = t, therefore
all the eigenvectors of λ2 = −2, are the following vector coefficients:
[]
t
.
t

4.2.3.2 Examples
Consider the sample data from two classes w1 and w2 as follows:
⎡ ⎤
⎡ ⎤ 4 2
1 2 ⎢5 0⎥
⎢ 2 3⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢5 2⎥
w1 = ⎢ 3 3 ⎥ , w2 = ⎢ ⎥.
⎢ ⎥ ⎢3 2⎥
⎣ 4 5⎦ ⎢ ⎥
⎣5 3⎦
5 5
6 3

Calculate the mean for each class. Estimating the mean for each class is an essential step in
LDA that permits us to understand the data distribution and identify the dimensions that
Chapter 4 • Linear discriminant analysis 117

best separate the classes:


[ ] [ ]
μ1 = 3 3.6 , μ2 = 4.67 2 ,

n1
where ni is the number of rows and n is the number of samples: n = n1 + n2 , μ = μ1 +
n
n2 5 6 [ ]
μ2 , then μ = μ1 + μ2 → μ = 3.91 2.727 , during this step, we need to determine
n 11 11
SB for each class.
According to formula (4.11):
[ ]
[ ]T [ ] 4.13 −3.97
SB1 = 5 −0.91 0.87 −0.91 0.87 → SB1 = .
−3.97 3.81

All the steps are repeated for SB2 :


[ ]
3.44 −3.31
SB2 = .
−3.31 3.17

SB will be d­fined by summation SBi (4.12):


[ ]
7.58 −7.27
SB = .
−7.27 6.98

The within-class scatter (SW ) matrix should be d­fined to calculate the eigenvalue and
eigenvector. To calculate Sw , we do need to subtract SWj from the mean (μ):
⎡ ⎤
⎡ ⎤ −67 0
−2 −1.16 ⎢ 0.33 −2⎥
⎢−1 −0.6 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ 0.33 0⎥
d1 = ⎢ 0 −0.6 ⎥ , d2 = ⎢ ⎥.
⎢ ⎥ ⎢−1.6 0 ⎥
⎣1 1.4 ⎦ ⎢ ⎥
⎣ 0.33 1⎦
2 1.4
1.33 1

Now, determine SWi from this formula (4.13) and we find SW1 and SW2 :
[ ] [ ]
10 8 5.3 1
SW1 = , SW2 = .
8 7.2 1 6

SW will be d­fined by summation SWj (4.14) and we have:


[ ]
15.3 9
SW = ,
9 13.2
118 Dimensionality Reduction in Machine Learning

−1
we need to place SW and SB in this formula (4.10) to determine the eigenvalue and eigen­
vector. In the first place, we should transpose SW then insert it into (4.10):

[ ] [ ] [ ]
−1 0.11 −0.07 7.58 −7.27 −1 1.37 −1.32
SW = , SB = → SW SB =
−0.07 0.13 −7.27 6.98 1.49 1.43
⇒ λ1 = 2.81, λ2 = −0.0027
[ ]
0.68 −0.69
W= .
−0.74 −0.72

In this example, we want to decrease the dimension from two-dimensional space to one­
dimensional space. We recommend using the formula, yi = wi W1 , to achieve this goal.
Selecting the component with a higher eigenvalue is the better option to achieve the ex­
pected dimensional reduction:
⎡ ⎤ ⎡ ⎤
1 2 −0.79
⎢2 3⎥ ] ⎢ ⎥
⎢ ⎥[ ⎢−0.85⎥
⎢ ⎥ 0.68 ⎢ ⎥
y1 = ⎢3 3⎥ = ⎢−0.18⎥
⎢ ⎥ −0.74 ⎢ ⎥
⎣4 5⎦ ⎣−0.97⎦
5 5 −0.29
⎡ ⎤
1.24
⎢3.39⎥
⎢ ⎥
⎢ ⎥
⎢1.92⎥
y 2 = w2 W 1 ⎢ ⎥.
⎢0.56⎥
⎢ ⎥
⎣1.18⎦
1.86

4.3 The advanced linear discriminant analysis algorithm


4.3.1 Statistical explanation
A regression method cannot estimate a meaningful prediction of qualitative data. Hence,
we use a more suitable technique for qualitative data, like Linear Discriminant Analysis.

4.3.2 Linear discriminant analysis compared to principal component


analysis
Linear Discriminant Analysis is like Principal Component Analysis (PCA) in some ways;
both algorithms reduce dimensions. Although Linear Discriminant Analysis focuses on
maximizing the separability among the categories, PCA chooses new axes for lower di­
mensions while the variance of data is inviolable, another major difference between these
two algorithms is that linear discriminant analysis is a supervised dimension reduction
Chapter 4 • Linear discriminant analysis 119

technique. In contrast, principle component analysis is an unsupervised technique for di­


mension reduction [3].

FIGURE 4.2 LDA versus PCA on the Iris dataset.

The Iris dataset is a small dataset that is usually used for class­fication. When applied
to Iris datasets, LDA and PCA obtain good results that are illustrated in Fig. 4.2. That is
because the iris is a simple dataset with only four features; in complex datasets, the differ­
ences and advantages of each dimension reduction method become specific. This will be
discussed in Section 4.5.4.

4.3.3 Quadratic discriminant analysis


As the name suggests, Quadratic Discriminant Analysis is in the quadratic decision sur­
face, almost like Linear Discriminant Analysis, except that the assumed covariance ma­
trix could be different for each class [4]. Linear Discriminant Analysis is a particular case
of Quadratic Discriminant Analysis that has the same covariance value assumed for all
classes, The quadratic and linear discriminant analysis assumes that each class has an un­
derlying Gaussian distribution, k =  for all k.

4.4 Implementing the linear discriminant analysis algorithm


4.4.1 Using LDA with Scikit-Learn
The linear discriminant analysis technique is easy to implement, which makes it an ac­
cepted method among data scientists. Scikit-Learn gives access to the linear discriminant
analysis algorithm for the projection of datasets from a higher dimension to a lower dimen­
sion with maximum separability; linear discriminant analysis is a class­fier with a linear
decision boundary. The Scikit-Learn library prepares the number of parameters and at­
tributes for linear discriminant analysis.
120 Dimensionality Reduction in Machine Learning

4.5 LDA parameters and attributes in Scikit-Learn


4.5.1 Parameter options
• n__components: This parameter shows the number of components to keep after di­
mension reduction. In Scikit-Learn, n__components is set to none by default, n__com­
ponents should be an integer.
• Solver: This parameter chooses the algorithm for linear discriminant analysis. Scikit­
Learn has three possible options to use (svd, lsqr, eigen), and the default is svd.
• tol: This parameter determines the tolerance for a singular value and is only used if
solver is svd. The type of tol is float and by default is set to 1.0e−4 .
• store__covariance: This is spec­fied to compute and store the class covariance or not.
The type of store__covariance is Boolean and set to false in scikit.
• priors: Another important hyperparameter is priors that shows the previous class prob­
ability. With this parameter, we can adjust a manual weight on each class during classi­
fication; the default value of priors is none.
• shrinkage: Apply shrinkage to obtain the covariance matrices for each class. This pa­
rameter can help to prevent ove­fitting when the number of features is larger than the
number of samples; the default value of shrinkage is none but it can change to auto or
float.

4.5.2 Attributes option


• classes_: This attribute stores the unique class labels in training data.
• means_: This attribute stores the mean value of each feature of each class in the training
set.
• ceof_: ceof_ These attributes store the ratio of the decision boundary that separates the
classes. ceof_ is a NumPy array.
• priors_: Represents the prior probabilities of each class; it has an array shape.
• n_features_in_ This attribute stores the number of features in the training dataset.
n_features_in_ type is integer. This attribute can be accessed after fitting with .fit()
code.
• explained_variance_ratio_: This attribute returns the percentage of explained variance
by each of the selected n_components; this attribute is only available when ‘eigen’ or
‘svd’ solver is used.
Chapter 4 • Linear discriminant analysis 121

4.5.2.1 Using linear discriminant analysis


#Importing LDA from the sklearn.discriminant_analysis:

#import libraries
from sklearn.discriminant_analysis
import LinearDiscriminantAnalysis

#Call LDA algorithm:


lda = LinearDiscriminantAnalysis()

#Fitting LDA on data. Fit command takes two arguments.


lda.fit(x, y)

4.5.3 Worked example of linear discriminant analysis algorithm for


dimensionality
An example code shows how to fit linear discriminant analysis on datasets.

#Import libraries
from sklearn.datasets import load_digits
from sklearn.discriminant_analysis
import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split

#Load dataset
data = load_digits()

#split the dataset to train and test

X_train, X_test, y_train, y_test = train_test_split(data.data,


data.target, test_size=0.3)

#create a sample
lda = LinearDiscriminantAnalysis()

#fit the LDA algorithm into the training data


lda.fit(X_train, y_train)

4.5.4 Fitting LDA algorithm on MNIST dataset


Another application of linear discriminant analysis is classifying the MNIST dataset; the
MNIST dataset contains 70 000 images of handwritten digits zero to ten, split into two
122 Dimensionality Reduction in Machine Learning

categories, data and targets (Fig. 4.3). The main goal of the MNIST dataset is to develop an
algorithm that can classify digits correctly.

FIGURE 4.3 MNIST dataset includes 70 000 images of 10 unique handwritten digits.

1. Importing important libraries:

import numpy as np
from sklearn.datasets import fetch_openml
#MNIST dataset from scikit-learn library.
from sklearn.discriminant_analysis
import LinearDiscriminantAnalysis as LDA

Import matplotlib.pyplot as plt


import seaborn as sns
import plotly.express as px

2. Loading MNIST dataset from scikit-learn library:

mnist = fetch_openml('mnist_784')
X = mnist.data
y = mnist.target

3. Fit LDA on MNIST dataset:


Chapter 4 • Linear discriminant analysis 123

From sklearn.model_selection import train_test_split

#Split the data into 800 train samples and 200 test samples.
X_train, X_test, y_train, y_test = train_test_split(
X, y,train_size=1000,shuffle=True, random_state=42)

#Fit LDA on training data, then transform it.


lda = LDA(n_components = 2)
X_lda = lda.fit(X_train,y_train).transform(X_train)

4. Produce scatter plot:

px.scatter(X_lda[:, 0], X_lda[:,0], color = y_train,


width=800 , height=400)

FIGURE 4.4 Fitting LDA on 1000 data points of MNIST dataset.

Linear discriminant analysis performance depends on various factors; one factor is the
amount of data available. The larger the dataset becomes the more the difference between
LDA and PCA performance increases.
As is evident from Figs. 4.4 and 4.5, the LDA performance is much better than PCA on
the MNIST dataset. While the PCA failed to classify the data efficiently, LDA achieves a
lower dimensionality with maximum separability.

4.5.5 LDA advantages and limitations


LDA is a great supervised algorithm for dimension reduction and class­fication. However,
like any other algorithm, it has some limitations [4]:
124 Dimensionality Reduction in Machine Learning

FIGURE 4.5 Fitting PCA on 1000 data points of MNIST dataset.

• Restricted to linear boundaries:


LDA thinks the decision boundary between classes is linear. It may perform inaccu­
rately when the decision boundary is nonlinear.
• Not appropriate for imbalanced datasets:
LDA needs balanced data to work correctly. If there is not a sufficient number of sam­
ples compared to the number of features, LDA cannot work efficiently. LDA thinks the
data is usually well distributed and has an equal covariance matrix for each class, which
is not applicable for imbalanced datasets.
• Normality of data:
If the data does not obey a normal distribution, LDA cannot perform efficiently. This is
because the LDA algorithm has a contradiction with non-normal data.
• Sensitive to outliers: LDA can be affected by outliers in the data. LDA assumes the co­
variance matrix of each class is equal, which outliers can distort the covariance matrix
and affect LDA performance.

4.5.5.1 Future linear discriminant analysis algorithm


The linear discriminant analysis algorithm is a popular algorithm for dimension reduction
and class­fication; this algorithm was formulated for the first time in 1936; while the basic
algorithm remained the same, the LDA has seen some improvement over the years, but
still needs more in the future.
The performance of the linear discriminant analysis algorithm in detecting outliers
could improve; at this moment, LDA presumes the data follow a normal distribution even
when data has outliers that make the result inaccurate.
The second potential improvement could be the scaling; implementing the linear dis­
criminant analysis on high-dimensional data can be computationally expensive. A future
version of the LDA can merge distributed computing methods to obtain a better result in
scaling.
Chapter 4 • Linear discriminant analysis 125

4.6 Conclusion
Linear discriminant analysis is a powerful and efficient method for dimension reduction
and class­fication; this method can be used in many fields, such as image recognition, doc­
ument class­fication, and image segmentation. The linear discriminant analysis technique
is easy to use and provides better prediction accuracy than other methods.

References
[1] S. Balakrishnama, A. Ganapathiraju, Linear Discriminant Analysis: A Brief Tutorial, 1998.
[2] C. Li, B. Wang, Fisher Linear Discriminant Analysis, 2014.
[3] S. Molin, Hands-On Data Analysis with Pandas: Efficiently Perform Data Collection, Wrangling, Anal­
ysis, and Visualization Using Python, Packt Publishing, 2019.
[4] G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning: With Applications
in R, second edition, Springer, 2021.
[5] C.M. Bishop, Pattern Recognition and Machine Learning, Information Science and Statistics,
Springer, 2006.
[6] J. Ye, R. Janardan, Q. Li, Two-dimensional linear discriminant analysis, in: L. Saul, Y. Weiss, L. Bottou
(Eds.), Advances in Neural Information Processing Systems, vol. 17, 2004.
[7] M.M. Solomon, M.S. Meena, J. Kaur, Challenges in face recognition systems, International Journal of
Recent Technology and Engineering (2020).
[8] B. Bozorgtabar, H. Azami, F. Noorian, Illumination invariant face recognition using fuzzy LDA and
FFNN, International Journal of Computer Applications (2012).
[9] rahul kumar, Shilpa Ohol, Linear Discriminant Analysis for Human Face Recognition, International
Journal of Computer Applications (2017).
[10] S. Thakur, J.K. Sing, D.K. Basu, M. Nasipuri, Face recognition using kernel Fisher linear discriminant
analysis and RBF neural network, International Journal of Pattern Recognition and Art­ficial Intelli­
gence 26 (03) (2012) 1250019.
[11] A.H. Hadian Rasanan, S. Nedaei Janbesaraei, D. Baleanu, Fractional Chebyshev kernel functions: the­
ory and application, in: J.A. Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional Orthogonal
Kernel Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics, Springer, Sin­
gapore, 2023.
[12] A.H. Hadian Rasanan, J. Amani Rad, M.S. Tameh, A. Atangana, Fractional Jacobi kernel functions:
theory and application, in: J. Amani Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional
Orthogonal Kernel Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics,
Springer, Singapore, 2023.
[13] A.H. Hadian Rasanan, S. Nedaei Janbesaraei, A. Azmoon, M. Akhavan, J. Amani Rad, Class­fica­
tion using orthogonal kernel functions: tutorial on ORSVM package, in: J. Amani Rad, K. Parand,
S. Chakraverty (Eds.), Learning with Fractional Orthogonal Kernel Class­fiers in Support Vector Ma­
chines, in: Industrial and Applied Mathematics, Springer, Singapore, 2023.
[14] J.A. Rad, S. Chakraverty, K. Parand, Learning with Fractional Orthogonal Kernel Class­fiers in Support
Vector Machines: Theory, Algorithms, and Applications, Springer, 2023.
[15] W. Härdle, L. Simar, Applied Multivariate Statistical Analysis, 3rd edition, Springer, 2012.
[16] Patricia J. Pardo, Apostolos P. Georgopoulos, John T. Kenny, Traci A. Stuve, L. Robert, A class­fication
of adolescent psychotic disorders using linear discriminant analysis, 2006.
[17] N. Rathore, P.K. Jain, M. Parida, A sustainable model for emergency medical services in developing
countries: a novel approach using partial outsourcing and machine learning, 2022.
[18] J.-G. Yang, J.-K. Kim, U.-G. Kang, Coronary heart disease optimization system on adaptive-network­
based fuzzy inference system and linear discriminant analysis (ANFIS-LDA), 2013.
[19] J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning: data mining, inference, and
prediction 2001.
[20] H. Qarehdaghi, J.A.E.Z-C. DM Rad, Fast, simple, robust, and accurate estimation of circular diffusion
model parameters, Psychonomic Bulletin & Review (2024), in press.
126 Dimensionality Reduction in Machine Learning

[21] C. Ricciardi, G. Cerullo, F. Fornarelli, G. Iannucci, Linear discriminant analysis for coronary artery
disease prediction, Journal of Medical Systems (2020).
[22] R.A. Horn, C.R. Johnson, Matrix Analysis, Cambridge University Press, 2012.
[23] A. Glenberg, M. Andrzejewski, An Introduction to Statistical Reasoning, second edition, Routledge,
2018.
[24] R.A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (2) (1936)
179--188.
5
Linear local embedding
Pouya Jafari a, Ehsan Espandar a, Fatemeh Baharifard b, and
Snehashish Chakraverty c
a Department of Computer Science, Iran University of Science and Technology, Tehran, Iran b School of

Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran c Department of
Mathematics, National Institute of Technology Rourkela, Rourkela, Odisha, India

5.1 Introduction
5.1.1 What is nonlinear dimensionality reduction?
Dimension reduction refers to a collection of techniques used to transform high-dimen­
sional data into a lower-dimensional space while maintaining the relationships and struc­
ture between data points. This is carried out because high-dimensional data can be dif­
ficult and expensive to analyze and interpret, due to the sheer number of dimensions
involved. Dimension reduction techniques aim to reduce the dimensionality of these
datasets while retaining as much relevant information as possible and thereby simplify­
ing the data analysis process [1,2].
Dimensionality reduction techniques can be broadly categorized as linear and nonlin­
ear methods. Linear methods attempt to preserve the most variance in the data through
a linear transformation. One example of a widely used linear method is Principal Com­
ponent Analysis (PCA) [3--6], which finds a set of orthogonal axes that capture the most
variance in the data and projects the data onto those axes. PCA can be used to reduce the
noise in images. Another method for dimension reduction is Linear Discriminant Analysis
(LDA) [7]. Linear discriminant analysis’s goal is to maximize the separation between classes
and minimize the variation within each class. This means LDA is a supervised method that
finds a linear combination between high-dimensional points and maps it into a lower di­
mension. By applying PCA or LDA to the image data, you can identify the most important
features, remove the noise, and make machine learning algorithms faster. This can be use­
ful in a variety of applications such as facial recognition, object detection, and medical
imaging.
Dimensionality linear methods such as PCA assume data on a linear subspace. How­
ever, in many situations, data can lie in a nonlinear subspace. In these cases, data struc­
tures like nonlinear manifolds, nonlinear interactions, non-Gaussian distributions, and
high-dimensional data may not be properly recognized by linear methods. Hence, non­
linear dimension reduction methods attempt to preserve nonlinear relationships between
data points by preserving the local structure of the data. One example of a nonlinear
method is Locally Linear Embedding (LLE) [8], which represents each data point as a lin­
Dimensionality Reduction in Machine Learning. https://doi.org/10.1016/B978-0-44-332818-3.00015-0 129
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
130 Dimensionality Reduction in Machine Learning

ear combination of its neighbors, which is a nonlinear relationship between points, and
then finds a low-dimensional embedding by optimizing a cost function that preserves
the local structure of the data. LLE can be used to identify the underlying structure of
high-dimensional data, including images, texts, and gene expressions. By reducing the di­
mensionality of the data, LLE can identify the most important features that contribute to
image recognition, natural language processing and particular diseases. For an example of
applying the LLE method, we have embedded high-dimensional data like handwritten im­
ages of digits ‘4’ and ‘9’ by LLE and transformed our images into two-dimensional space in
Fig. 5.1. This figure indicates that images of digit ‘4’ are close to each other and ‘9’s behave
in the same way.

FIGURE 5.1 Dimension reduction on MNIST dataset using LLE for digits ‘4’ and ‘9’.

Nonlinear dimensionality reduction techniques have a wide range of applications in


fields such as computer vision, natural language processing, and bioinformatics. For ex­
ample, these techniques are used for feature extraction and image class­fication in com­
puter vision, document clustering, and topic modeling in natural language processing, and
gene expression analysis and drug discovery in bioinformatics [1,9].

5.1.2 Why do we need nonlinear dimensionality reduction?


High-dimensional data is ubiquitous in many fields, such as biology, finance, and com­
puter science [10]. This data can be difficult to visualize, process, and analyze due to its
high dimensionality. Furthermore, high-dimensional data may contain complex nonlin­
ear relationships between features, which can be difficult to model with linear methods.
Dimensionality reduction techniques can alleviate these challenges by transforming
high-dimensional data into lower-dimensional spaces while preserving its essential char­
Chapter 5 • Linear local embedding 131

acteristics. Linear dimensionality reduction methods, such as PCA and LDA, are widely
used due to their efficiency and ease of implementation. However, they make the assump­
tion that the relationship between the features is linear, which may not hold for many
real-world datasets including image datasets.
Nonlinear dimensionality reduction techniques are needed to capture the nonlinear
relationships between features that may exist in high-dimensional data. These techniques
model the data as a nonlinear function of the input variables, allowing for a more accurate
representation of the data structure. Furthermore, nonlinear dimensionality reduction
techniques can reveal the underlying structure of the data, which may not be visible in
the original high-dimensional space.
For example, consider a dataset of images of handwritten digits. In the high-dimen­
sional space, each image is represented as a vector of pixel values. Linear dimensionality
reduction techniques may not be able to capture the complex relationships between the
pixels that correspond to the curvature of the digit strokes or the angles between them.
Nonlinear techniques, such as t-SNE [11] or Isomap [2], can capture these relationships
and produce a more informative visualization of the data.
Nonlinear dimensionality reduction techniques can also be useful for clustering or
class­fication tasks. In high-dimensional data, the inherent structure of the data may be
obscured by noise or irrelevant features. Nonlinear dimensionality reduction can help
identify the relevant features and reduce the impact of noise in the data, leading to more
accurate clustering or class­fication results.
In conclusion, nonlinear dimensionality reduction techniques are necessary to han­
dle high-dimensional data with complex nonlinear relationships between features. These
techniques can capture the underlying structure of the data, reveal the relevant features,
and reduce the impact of noise, leading to more accurate modeling, visualization, and
analysis of the data.

5.1.3 What is embedding?


An embedding is a mathematical concept used in machine learning, data analysis, nat­
ural language processing, and computer vision to represent complex data or objects in
a lower-dimensional space [12,13]. The goal is to reduce the dimensionality of the data
while preserving certain properties or relationships between the objects. There are several
techniques for constructing embeddings, but they all aim to maintain the structure of the
original data.
One common approach for constructing embeddings is to use dimensionality reduc­
tion techniques such as multi-dimensional scaling (MDS), PCA, LLE or t-SNE. These meth­
ods aim to find a lower-dimensional representation of the data that captures as much of
the original structure as possible. For instance, if one needs to visualize a dataset of high­
dimensional points in a two-dimensional plot, applying PCA or MDS can help obtain a
two-dimensional embedding that shows the structure of the data. One interesting use of
LLE is microarray data analysis [14]. The microarray data are high dimensional in nature
132 Dimensionality Reduction in Machine Learning

and we can use them for cancer class­fication. t-SNE is particularly useful when visualizing
high-dimensional data in two or three dimensions.
Machine learning algorithms can also learn embeddings from data. The primary goal
is to learn an embedding that is useful for tasks such as class­fication, clustering, or rec­
ommendation. For instance, in natural language processing, word embeddings are used
to represent words as vectors that capture their semantic meaning. These embeddings
can be used as input to neural networks for sentiment analysis, language translation, or
question-answering.
In image recognition, deep convolutional neural networks can be used to learn feature
representations of images that capture important features of the images, such as edges,
textures, and shapes. These embeddings can be used for tasks such as object recognition
or image captioning.
In summary, embeddings are a crucial component of many modern data-driven tech­
nologies. They are constructed using various techniques, including dimensionality reduc­
tion and machine learning algorithms, and are useful for several applications in fields such
as machine learning, data analysis, and computer vision.

5.2 Locally linear embedding


One of the most important tasks in the data science world is feature selection and extrac­
tion. Let us imagine you have a dataset with thousands of features and you want to train
a model on it. This number of features can be a problem. The computation power needed
to train on such data is too high and not efficient. Feature selection [15], refers to choos­
ing a subset of our features that are more useful to us. However, sometimes finding these
features is not that easy and features can be complicated. In this case, we have another
concept named feature extraction. Feature extraction algorithms [16], are techniques to
make data more relevant and easy to analyze and compute. One of these techniques that
is a nonlinear algorithm is Locally Linear Embedding.
Locally linear embedding (LLE) [8] is an embedding method, which is a dimension
reduction method that we can use for manifold learning and feature extraction. The al­
gorithm is based on a simple principle, ``the points that are close to each other in high­
dimensional space must be close to each other in reduced space'' and this principle also
works for far points. You can imagine that LLE takes a folded line and straightens it in a
way that close points in that folded line are still close in a straight line in embedded space.
For illustration, in Fig. 5.2 there are sphere shaped data points in R3 space and with LLE
we can transform this sphere into a R2 space and by saving the features, the points that
were close to each other in R3 space are still close to each other in R2 space. In fact, we just
unfolded the sphere with the LLE algorithm. Hence, we can summarize the LLE algorithm
in the three following steps:
1. For each high-dimensional data point {xi ∈ Rd }ni=1 , find k nearest neighbors (we will
discuss this algorithm in the next section).
2. Find weights between every point and its neighbors.
Chapter 5 • Linear local embedding 133

3. Find {yi ∈ Rp }ni=1 , which are the new coordinates in reduced space, where p  d.

FIGURE 5.2 Unfolding (embedding) a sphere from 3D space into 2D space with LLE.

5.2.1 k nearest neighbors


A classic supervised machine learning algorithm for class­fication is k-nearest-neighbors
algorithms [17]. k nearest neighbors (k-NN) is a non-parametric supervised algorithm,
which means there is no learnable parameter in this algorithm. The idea is when we want
to predict the new point x belongs to which class, we find the k nearest neighbors of it with
a distance metric and predict the class that has a high number of neighbors. As you may
have guessed, the k-NN algorithm does more computation in test time rather than train­
ing time. Parameter k is a hyperparameter that we should choose based on our data and
experiment.
As mentioned, different metrics can be used to calculate the distance between points,
some of which will be explained below.

5.2.2 Distance metrics


A metric distance is a function that d­fines the distance between two elements, which can
be anything like data points in RN space or words in a dictionary. In this case, our elements
are RN points and if d(x, y) is a distance function it must have these properties:
1. The distance from a point to itself is 0: d(x, x) = 0.
2. The distance between two points cannot be negative: d(x, y) ≥ 0.
3. The distance between x and y must be the same as y and x: d(x, y) = d(y, x).
4. For points x, y, and z: d(x, z) ≤ d(x, y) + d(y, z).
Here, we explain some common metrics.
• Minkowski distance: Minkowski distance is a measure to calculate the distance between
two points in N-dimensional space. You may hear about Euclidean distance or Man­
hattan distance, these distances are Minkowski distances. The general formula of the
134 Dimensionality Reduction in Machine Learning

Minkowski distance is:



N
1
distance(X, Y ) = | |xi − yi |r | r ,
i=1

where X and Y are two arbitrary N-dimensional points. Here, if you put r = 1 you obtain
Manhattan distance, and r = 2 you obtain Euclidean distance.
• Jaccard distance: In Minkowski distance, we measure the distance between points in
RN , but what if we want to d­fine a metric for sets? For this, we introduce the Jaccard
index and with that, we d­fine Jaccard’s distance. The Jaccard index is:
|A ∩ B|
Jindex (A, B) = ,
|A ∪ B|

where A and B are two arbitrary sets. This index measures the similarity between two
sets, but in distances, we need the opposite, hence:
|A ∩ B|
distance(A, B) = 1 − Jindex = 1 − .
|A ∪ B|

• Cosine distance: Cosine distance shows the distance in direction view and points that
are near with Cosine distance can be very far away in Minkowski distance. This metric
can be calculated from the dot product of two vectors. Hence, we can use the following
formula to construct the Cosine distance between two vectors X and Y , each of which
has N elements:
∑N
X.Y Xi Yi
distance(X, Y ) = cos (α) = = √∑ i=1√∑ .
||X|| ||Y || N 2 N 2
i=1 Xi i=1 Yi

The introduced distances as well as other available distances, can help us gain infor­
mation about how two points are similar or different. However, how to choose one among
them? This depends on the intended application. However, in experience, it is shown that
the Cosine metric works much better with face image datasets and by calculating the direc­
tions of face embeddings, we can achieve better evaluations by this metric. This is because
the pose and dressing of the faces are different and the features are not that close. An ex­
ample of that is ArcFace embedding [18] that uses Cosine similarity. However, in other
datasets such as manifolds and curves datasets, the Euclidean distance works better, which
is shown in Fig. 5.2 with a sphere shape.

5.2.3 Weights
After finding the k nearest neighbors for each point in our data, we have to find weights. In
Fig. 5.3 you can see that we find five nearest neighbors of xi ∈ R3 with Euclidean distance
and then map them in R2 space with the found weights. Remember these weights are not
the new distances!
Chapter 5 • Linear local embedding 135

FIGURE 5.3 Embedding R3 space to R2 space using five neighbors.

The idea is to find the best weights for reconstruction. We will discuss this reconstruc­
tion later, but now we just make the weights to reduce our space dimension, to travel
between folded space and embedded space. For weights, we are dealing with an optimiza­
tion problem:


n ∑
k
minimize (W ) = ||xi − wij xij ||22 ,
(W )
i=1 j =1


k
subject to wij = 1, i ∈ {1, 2, ..., n}.
j =1

In this problem n is the number of data points, k is the number of neighbors for each data
point, W is the weight matrix, and wij is the weight between the ith data point to the j th

neighbor, which is xij . As you can see, the term kj =1 wij = 1 means that the sum of the
weights for every point is one. The question is can we have negative weights through this
optimization problem?

5.2.4 Coordinates
After solving the weights optimization problem, in the next step, we should embed data to
a lower-dimensional embedding space with our weights. Here again, we have an optimiza­
136 Dimensionality Reduction in Machine Learning

tion problem in finding the embedded coordinates:


n ∑
n
minimize ||yi − wij yj ||22 ,
(Y )
i=1 j =1

1∑
n
subject to yi yiT = I,
n
i=1

n
yi = 0,
i=1

where I is the identity matrix, Y are the new coordinates in reduced space (yi ∈ Rp in which
p is the dimension of the reduced space) and wij is the weight, we calculated for data point
xi and its j th neighbor.
In the above formula, there are points that are not the neighbors of point xi , so there are
no weights for them and in this case, we put zero for their weights:

wij , if xj ∈ k−NN(xi ),
wij =
0, otherwise.

Another feature that we understand from this optimization problem is that the sum of our
embedded coordinates is 0. Hence, this optimization problem makes sure that the mean
of embedded data is 0!
To solve this problem let us d­fine a matrix 1 in which 1i = [0, ..., 1, ...0] in a way that the
ith element is one. Now, we can rewrite yi as Y T 1i . After rewriting the second term, we can
write the problem again in this new form:


n
||Y T 1i − Y T wi ||22 = ||Y T 1i − Y T wi ||2F ,
i=1

where every row of W (wi ) is the weight vector for each data point and F is the Frobenius
norm of the matrix. We continue to simplify the problem:

||Y T 1i − Y T wi ||2F = tr((I − W )Y Y T (I − W )T )


= tr(Y T (I − W )T (I − W )Y ) = tr(Y T MY ).

Here, tr(.) means the trace of the matrix and we can d­fine the Laplacian matrix of W as
(I − W ), as every column of this matrix adds up to 1. If we consider M = (I − W )T (I − W ),
the final equation is:

minimize tr(Y T MY ),
(Y )
1 T
subject to Y Y = I,
n
Chapter 5 • Linear local embedding 137

Y T 1 = 0,

where 1 and 0 are matrices and 1 ∈ Rn and 0 ∈ Rp . Now, we approach solving this prob­
lem by the assumption that the second constraint is sati­fied implicitly [8] and write the
Lagrangian for our final equation as:

1
L = tr
tr(Y T MY ) − tr
tr(T ( Y T Y − I )),
n

where  ∈ Rn×n are Lagrangian multipliers in a diagonal form. To solve the equation, we
have to obtain a derivative of L and put it equal to zero:

∂L 2
= 2MY − Y  = 0,
∂Y n
1
⇒ MY = Y ( ),
n

which is the eigenvalue problem for M [19] and it means the columns of Y are the eigen­
vectors of M that are the diagonal elements of n1 .
In summary, LLE is a technique used to reduce the dimensionality of high-dimensional
data while preserving its local structure. LLE first constructs a graph by connecting each
data point to its k nearest neighbors in the high-dimensional space. It then finds a set of
low-dimensional embeddings that preserve the pairwise distances between neighboring
points in the graph. Actually, the LLE technique is based on the assumption that high­
dimensional data points can be approximated linearly by their nearest neighbors. The
algorithm minimizes the difference between the original data and their reconstructed
counterparts to reconstruct the high-dimensional data points. Implementation of stan­
dard LLE is reviewed below to refer them to variants of LLE implementation steps, which
is the subject of our next section:
1. Parameter selection: Choose the number of nearest neighbors, k, and the number of
dimensions for the low-dimensional embedding, p.
2. Finding nearest neighbors: Calculate the pairwise Euclidean distances between all data
points then find the k nearest neighbors for each point.
3. Computing weights: For all data points, compute the weights that minimize the cost of
reconstructing the point from its neighbors. This will be done by solving a least squares
problem with constraints. The weights W can be computed by minimizing the cost
∑ ∑
function: (W ) = ||xi − (wij xij )||22 for all i and all j in the set of k nearest neigh­

bors of xi , subject to the constraint that for each i, wij = 1 for all j in the set of k
nearest neighbors of xi .
4. Computing embeddings: Weights will be utilized to compute the low-dimensional em­
beddings. This is done by solving an eigenvalue problem on the matrix (I − W )T (I − W ),
where I is the identity matrix and W is the weight matrix.
138 Dimensionality Reduction in Machine Learning

5.3 Variations of LLE


In the previous section, we reviewed implementation steps for standard LLE and now we
can explain the implementation of variants of LLE including inverse LLE, kernel LLE, in­
cremental LLE, robust LLE, weighted LLE, landmark LLE, supervised and semi-supervised
LLE, using references to standard LLE steps.

5.3.1 Inverse LLE


Inverse LLE (ILLE) is a technique used for nonlinear dimensionality reduction [20]. It is
designed to reconstruct high-dimensional data points from their low-dimensional em­
beddings. Inverse LLE is an inverse problem of LLE. It takes a set of low-dimensional
embeddings obtained from LLE and aims to reconstruct the original high-dimensional
data points that generated those embeddings. To achieve this, ILLE ident­fies the k nearest
neighbors for each embedding in the low-dimensional space and uses them to interpolate
the high-dimensional data points. This interpolation process is computationally expensive
and involves solving a set of optimization problems to minimize the difference between
the interpolated high-dimensional data points and the original data points [21]. Here are
the ILLE implementation steps:
(a) Finding nearest neighbors in embedded space: For each low-dimensional point, find
its k nearest neighbors in the embedded space.
(b) Computing weights: For all low-dimensional points, compute the weights that mini­
mize the reconstruction cost of the point from its neighbors in the embedded space.
This can be done by solving a constrained least squares problem, similar to the for­
ward LLE process.
(c) High-dimensional points reconstruction: Weights are used for each high-dimensional
point reconstruction from its neighbors. This is done by taking a weighted sum of the
high-dimensional coordinates of the neighbors, using the weights computed in the

previous step by the formula xi = wij xj for all j in the set of k nearest neighbors of
xi in the embedded space.
Inverse LLE can be useful in a variety of applications, such as image and speech pro­
cessing, where the original high-dimensional data may not be easily accessible, but a low­
dimensional representation is available. It can also be useful in data compression where
the low-dimensional embeddings can be used to store the data more efficiently. LLE can­
not be used in these cases due to a lack of original data.

5.3.2 Kernel LLE


Kernel Locally Linear Embedding (KLLE) is a nonlinear dimensionality reduction tech­
nique that aims to preserve the local structure of high-dimensional data in a low­
dimensional space using a kernel function [22]. KLLE is an extension of the standard LLE
algorithm and is particularly useful for nonlinear manifolds. Kernel methods including
kernel LLE use a concept called the kernel trick [23]. In dimensionality reduction, the ker­
Chapter 5 • Linear local embedding 139

nel trick is utilized to enable performing linear operations on data that has been mapped
into a higher-dimensional space using a nonlinear function called a kernel function. This is
ben­ficial for kernel LLE as it can capture more intricate relationships between data points
and preserve more of the original data’s local structure. However, selecting an appropriate
kernel function and parameter values is critical to prevent ove­fitting or unde­fitting of the
data. The steps to implement KLLE are as follows:
(a) Kernel mapping : A kernel function K(x, y) must be chosen to map the data points
from the original space into a higher-dimensional space. Gaussian kernel, polynomial
kernel, and sigmoid kernel are the usual kernel functions. For instance, the Gaussian
kernel is d­fined as:
−||x−y||2
2
K(x, y) = e 2σ 2 ,
where the σ parameter controls the width of the Gaussian function. A discussion on
the geometric properties of various kernel functions and a tutorial on how to imple­
ment them in Python are presented in [24--27].
(b) Finding nearest neighbors: For all data points in the mapped space, find its k nearest
neighbors using the Euclidean distance metric. The distance between two points xi
and xj in the mapped space is computed using the kernel function:

d(xi , xj ) = K(xi , xi ) − 2K(xi , xj ) + K(xj , xj ).

(c) Computing weights: For all data points, compute the weights that reconstruct the
point from its neighbors in the mapped space in the best way. Computing weights
needs least-squares problem solving with constraints, which is given by:


n ∑
k
minimize (W ) = ||(xi ) − wij (xij )||22 ,
(W )
i=1 j =1


k
subject to wij = 1, i ∈ {1, 2, ..., n}.
j =1

Here, (xi ) is the mapped data point, (xij ) is the j th nearest neighbor of (xi ) in
the mapped space, and wij is the weight to represent how well (xi ) reconstruction is
done from (xij ). To solve this optimization problem, a matrix equation can be made
and solved using a method like the Moore–Penrose pseudo-inverse:

W = (C + λI )−1 1,
W
W=∑ .
wij

Here, C is the Gram matrix of the neighbors, λ is a regularization parameter, I is the


identity matrix, and 1 is a column vector of ones. For more details, you can read about
the Moore–Penrose method in [28].
140 Dimensionality Reduction in Machine Learning

(d) Computing embeddings: Compute the low-dimensional embedding vectors recon­


structed by the weights calculated in the previous step in the best way. Here, solving
another optimization problem is needed, which is given by:


n ∑
n
minimize ||yi − wij yj ||22 ,
(Y )
i=1 j =1

1 ∑
n
subject to yi yiT = I,
n
i=1

n
yi = 0.
i=1

Here, yi is the low-dimensional embedding of (xi ), yj is the low-dimensional em­


bedding of the j th nearest neighbor of (xi ), and I is the identity matrix. To solve
this optimization problem, you can compute the eigenvectors and eigenvalues of the
matrix M = (I − W )T (I − W ), and then select the eigenvectors corresponding to the p
smallest non-zero eigenvalues as the columns of the embedding matrix Y . p is chosen
by parameter selection in standard LLE, as we mentioned before.
Kernel LLE is a powerful technique for reducing dimensionality that can capture com­
plex relationships between data points better than other LLE methods. KLLE utilizes the
kernel function to map the original data into a higher-dimensional space where linear op­
erations can be performed to maintain the local structure of the data. This makes KLLE
much more useful for data with nonlinear dependencies, such as natural language pro­
cessing or computer vision datasets. In addition, KLLE has been successfully used on un­
supervised feature learning, allowing it to learn complex hierarchical illustrations of data
with no explicit labeling.

5.3.3 Incremental LLE


Big data refers to large datasets that are not easy to process by traditional methods. Data
with a large number of entries needs huge statistical power and can end up with a more
false discovery rate. LLE is not always useful particularly for big datasets as it requires all
the data to be stored in the memory (RAM). This can cause problems for large datasets
that can not fit into the available RAM capacity. Incremental LEE (INLLE) is an extension of
the LLE algorithm, which is a popular nonlinear dimensionality reduction technique [29].
The main advantage of incremental LEE is that it can handle large datasets that cannot
be stored in memory by processing the data incrementally, one batch at a time. The algo­
rithm first computes the LLE embeddings for the initial batch of data and then updates the
embeddings, when new data is added.
Incremental LLE is an approach that addresses memory problems by dividing the
dataset into smaller batches and computing the low-dimensional embeddings for each
batch, separately. Then, the embeddings are merged to obtain the final low-dimensional
Chapter 5 • Linear local embedding 141

representation of the entire dataset. In this way, incremental LLE can handle larger
datasets than standard LLE without requiring excessive memory. However, there are some
disadvantages to using incremental LLE. One drawback is that splitting the data into
batches can result in a loss of information and introduce artifacts in the embeddings. An­
other issue is that the choice of batch size and overlap can affect the quality of the final
embedding and require careful selection. The steps to implement INLEE are as follows:
(a) Initialization: Assume we already have n data points and the embedding is obtained
by standard LLE methods. The eigenvectors Y are orthonormal, so the matrix Y is or­
thogonal. With respect to Section 5.2.4, we have MY = Y ( n1 ) and this can be restated
as:
1
Y T MY = .
n
Y represents the matrix of embedded coordinates obtained from the original data
points. Let us consider truncated Y and so we have p eigenvalues.  represents the di­
agonal matrix of eigenvalues in the LLE formulation, M is the weighting matrix used
in LLE to determine the local relationships between data points. Therefore Y is in
Rn×p ,  is in Rp×p , and M is in Rn×n .
(b) Adding new data points: Consider we have ℓ new data points. The equation becomes:

1
YUT MU YU = U .
n

Here, the U index is related to updated points (union of old and new points) and so
YU is in R(n+ℓ)×p and MU is in R(n+ℓ)×(n+ℓ) .
(c) Updating the eigenvalues: Since we are considering the smallest eigenvalues when
truncating, the eigenvalues in both U and  are very small, so we can say that we
approximately have U .
(d) Updating the embedded coordinates: Solve the following optimization problem to up­
date the coordinates of new data points in embedded space:

1
minimize ||YUT MU YU − ||2F ,
(YU ) n
1 T
subject to Y YU = I,
n U
YUT 1 = 0.

Here, ||.||F represents the Frobenius norm.


(e) Solving the optimization problem: This optimization problem can be solved using the
interior point method [30]. The Lagrangian of this problem is (by ignoring the second
constraint):
1 1
L = YUT MU YU − 2F − tr
tr(T ( YUT YU − I )).
n n
142 Dimensionality Reduction in Machine Learning

The derivative of this Lagrangian concerning YU is:

∂L 1
= 4(YUT MU YU − )MU YU .
∂YU n

(f ) Final embedding : The solution YU ∈ R(n+ℓ)×p obtained by optimization contains the


row-wise p-dimensional embeddings of both old and new data.
Overall, incremental LLE is a powerful tool for dimensionality reduction in large, evolv­
ing datasets. It has applications in a variety of fields. By allowing for the incremental addi­
tion of new data points and the efficient computation of embeddings for large datasets, it
opens up new possibilities for data analysis and machine learning. However, as with any
algorithm, it is important to carefully tune the parameters and consider the limitations and
assumptions of the method to obtain the best results.
The stability of incremental LLE can be i­fluenced by the selection of the batch size. If
the batch size is too small, the low-dimensional embeddings may not accurately represent
the local structure of the data. Conversely, if the batch size is too large, incremental LLE
may not be able to capture the global structure of the data. For instance, suppose there is
a dataset with a complex and nonlinear structure, where the local structure varies greatly
across the dataset. If the batch size is chosen to be very small, incremental LLE might only
capture the local structure of each batch, resulting in low-dimensional embeddings that
fail to r­flect the global structure of the dataset. Alternatively, if the batch size is set to be
very large, incremental LLE may not be able to capture the fine details of the local structure,
and the embeddings might lose significant information.
In real-world applications, the batch size for incremental LLE depends on the size and
complexity of the dataset, as well as the available computational resources. Finding an
optimal batch size that balances capturing the local and global structure of the data may
require some experimentation.

5.3.4 Robust LLE


An outlier is a data point that is significantly different from other data points in a dataset.
Outliers can occur in any type of dataset, including numerical, categorical, and textual
data. They can be the result of measurement errors, data processing errors, or represent ac­
tual extreme values. Outliers can cause several problems in data analysis, such as skewing
statistical measures, affecting model accuracy, and reducing the interpretability of results.
They can also lead to false assumptions about the underlying data distribution and re­
lationships between variables. Therefore it is important to identify and handle outliers
appropriately before conducting any analysis or modeling.
Robust Locally Linear Embedding (RLLE) is a variant of Locally Linear Embedding (LLE)
that is designed to handle outliers and noise in the data [31,32]. RLLE achieves this by
using a robust estimate of the reconstruction weights that are less sensitive to outliers.
There are several methods to implement robust LLE and in this chapter we will discuss the
implementation steps of RLLE using the least squares problem method given by [31]:
Chapter 5 • Linear local embedding 143

(a) Initialization: Initialize the reliability weights, biases, and PCA projection matrices for
each data point xi .
(b) Iteration steps: Iterate between the following steps until convergence:
i. For all data points like xi , use PCA to minimize the error of weighted reconstruction
by solving the following least squares problem:


k ∑
k
minimize (aij eij ) = aij ||xij − bi − Pi yij ||22 ,
(Pi )
j =1 j =1

where bi ∈ Rd is a bias, Pi ∈ Rd×p is PCA projection matrix, yij ∈ Rp is the embedding


of xij , aij for j = 1 to k are the reliability weights, and eij is the reconstruction error
between the original data point xij and its reconstruction using the weighted PCA
projection matrix and bias.
ii. Update the bias bi using the following formula:
∑k
j =1 (aij xij )
bi = ∑k .
j =1 aij

iii. Update the columns of Pi as the top p eigenvectors of the covariance matrix over
the neighbors:

1∑
k
Si = aij (xij − bi )(xij − bi )T .
k
j =1

iv. Use the Huber formula to update the reliability weights {aij }kj =1 :

1, if eij ≤ ci ,
aij = ci
eij , if eij > ci ,

where eij is d­fined in previous steps and ci is the mean error residual, i.e., ci =
1 ∑k
k j =1 eij .
(c) Calculating mean weights: Calculate the mean reliability weights over the neighbors

of each point xi as si = 1k kj =1 aij .
(d) Weighting the objective function: Weight the objective function of the standard LLE
embedding optimization problem using the mean reliability weights to make the em­
beddings robust to outliers, the formula of which is given by:


n ∑
n
minimize si ||yi − (wij yj )||22 ,
(Y )
i=1 j =1

with the constraints in standard LLE embedding.


144 Dimensionality Reduction in Machine Learning

Hence, we have changed the optimization problem in a manner that is robust to out­
liers, so finding the optimum solution to this problem can lead us to an errorless embed­
ding.
Stock prices, medical data, and customer data are some examples of datasets with prob­
able outliers. In the context of dimensionality reduction techniques like LLE and RLLE,
outliers can lead to distortions in the low-dimensional embeddings, making it difficult to
accurately capture the structure of the original high-dimensional data. However, RLLE is
designed to be more robust to outliers than LLE, by using a robust estimator of the local
covariance matrix to better handle the i­fluence of outliers. As a result, RLLE may be a
better choice for datasets with outliers, compared to LLE.

5.3.5 Weighted LLE


Weighted LLE (WLLE) is an extension of LLE that allows for the incorporation of user­
defined weights to control the contribution of each data point to the local reconstruction
of its neighbors [33]. The idea behind WLLE is to adjust the reconstruction weights matrix
in such a way that data points that are more important or informative for the task at hand
receive a higher weight. Weighted LLE can be implemented using various methods, in this
section we will discuss two of them including weighted LLE for deformed distributed data
and weighted LLE using the probability of occurrence.

5.3.5.1 Weighted LLE for deformed distributed data


The steps to implement this method are as follows:
(a) Computing weighted distance: For all pairs of data points like xi and xj , compute the
weighted distance using the formula:
||xi − xj ||2
dist (xi , xj ) = .
(xi −xj )T τi
αi + βi ||xi −xj ||2

Here, ||xi − xj ||2 is the Euclidean distance between xi and xj , αi and βi are constants,
and τi is a data-dependent parameter.
(b) Calculating parameters: Compute the parameters αi , βi , and τi using the formulas:
mi qi ||mi ||2
τi = , αi = , βi = ,
||mi ||2 c1 c2
where mi is the average of vectors from xi to its k nearest neighbors, qi is the average
of the squared lengths of these vectors, and c1 and c2 are constants that depend on
the dimensionality of the input space, d [8].
(c) Finding nearest neighbors: Compute the k nearest neighbors for each data point using
weighted distance. This makes the graph of k nearest neighbors.
(d) Computing weights: For all data points, compute the weights that minimize the cost
of reconstructing the point from its neighbors. This needs to solve a constrained least
squares problem, similar to the standard LLE process as we mentioned before.
Chapter 5 • Linear local embedding 145

(e) Computing embeddings: Compute the low-dimensional embeddings using weights.


Here, solving an eigenvalue problem on the matrix (I − W )T (I − W ) is needed, where
I is the identity matrix and W is the weight matrix.

5.3.5.2 Weighted LLE using the probability of occurrence


In this method, similar steps need to be taken, but the distance and the Gram matrix are
weighted by the probabilities of occurrence of the data points:
(a) Computing weighted distance: For all pairs of data points xi and xj , compute the
weighted distance using the formula given by:

||xi − xj ||22
dist 2 (xi , xj ) = ,
pi

where pi is the probability of occurrence of data point xi .


(b) Finding nearest neighbors: This step is similar to the previous method.
(c) Computing weighted Gram matrix: Compute the Gram matrix and weight its ele­
ments by the probabilities of occurrence of the data points:

Gi (a, b) = pi pj Gi (a, b),

where, pi and pj are the probabilities of occurrence of data points xi and xj , respec­
tively.
(d) Computing weights: This step is also similar to the previous method and the standard
LLE section.
(e) Computing embeddings: Compute the low-dimensional embeddings using weights
similar to the previous method.
In both methods, the final steps of the algorithm are the same as in standard LLE. The
most important difference is how distance and weight are computed.
The main advantage of weighted LLE is that it allows the user to incorporate prior
knowledge or domain-specific information into the embedding process by assigning dif­
ferent weights to the data points.

5.3.6 Landmark LLE for big data


Landmark LLE (LLLE) is an extension of the LLE algorithm designed for handling big data
[34]. LLE is a popular technique for nonlinear dimensionality reduction, but it is not scal­
able for datasets that cannot fit into memory. Landmark LLE solves this issue by processing
data incrementally. As we have mentioned before in Section 5.3.3, Landmark LLE, similar
to incremental LLE, is for handling large datasets but they are different in their approach.
Landmark LLE is a technique that involves selecting a small subset of data points called
landmarks and calculating low-dimensional embeddings for each data point concerning
these landmarks. This technique helps reduce the computational complexity of the LLE al­
gorithm by computing the pairwise distances only between the landmarks and data points,
146 Dimensionality Reduction in Machine Learning

rather than between all data points. Landmark LLE is especially useful when the dataset is
too large to fit into memory but the landmarks can be loaded into memory. In contrast,
incremental LLE is a technique that processes data in smaller batches, calculates the low­
dimensional embeddings for each batch separately, and then combines them to produce
the final embedding. This technique allows incremental LLE to handle larger datasets that
exceed the available memory by processing only a subset of the data at a time. These are
the steps to implement LLLE using locally linear landmarks:
(a) Landmark selection: Choose a subset point from data points X ∈ Rd×n as your land­
 ∈ Rd×m , where m  n. This reveals that m landmarks are chosen
mark and call it X
from n main data points.
(b) Calculating projection matrix: Compute the projection matrix Ũ ∈ Rn×m that maps
the Y ∈ Rn×p to the Ỹ ∈ Rm×p . This is given by Y = Ũ Ỹ . The projection will also work
T .
for the input space, which is given by X T = Ũ X
(c) Optimization for finding projection matrix: To find the optimal projection matrix, the
following optimization problem must be solved:


n
minimize ũi
xi − X 2
2,
(Ũ ) i=1
subject to 1 ũi = 1, T
i ∈ {1, . . . , n}.

The solution to this optimization problem is given by:

1
ũi = (G̃i )−1 ,
1T (G̃i )−1 1

 T (xi 1T − X),
where G̃i = (xi 1T − X)  and 1 is a vector of ones.
(d) Landmark embedding : Embeddings of the landmark points must be computed by
solving the following optimization problem:

minimize tr(Ỹ T Ũ T M Ũ Ỹ ),
(Ỹ )
1 T T
subject to Ỹ Ũ Ũ Ỹ = I.
n

Here, M = Ũ T M Ũ , and M
 is a square matrix of size m × m, where m is the number
of landmarks selected. The entries of the matrix M  are computed based on the pair­
wise distances and relationships between the landmark points. The solution to this
optimization problem is given by the p smallest eigenvectors of M,  not considering
eigenvectors with zero eigenvalues.
(e) Data embedding : Eventually, use the obtained embeddings of the m landmarks in or­
der to approximate the embeddings of all n data points, by the Y = Ũ Ỹ formula as
given in step (b).
Chapter 5 • Linear local embedding 147

Landmark LLE using locally linear landmarks is particularly useful for large datasets,
where m  n because computational complexity is highly reduced by only computing the
embeddings for the m landmarks and then approximating the embeddings for the remain­
ing data points.

5.3.7 Supervised and semi-supervised LLE


Locally Linear Embedding is a popular nonlinear dimensionality reduction technique that
has been widely used in various fields. However, LLE suffers from several limitations, such
as its inability to handle large datasets and the need for the entire dataset to be stored
in memory. Semi-supervised LLE and supervised LLE are extensions of LLE that address
some of these limitations. Supervised and semi-supervised LLE are different from land­
mark and incremental LLE in that they incorporate extra information to aid in the process
of dimensionality reduction, whereas the latter two methods only rely on the inherent
structure of the data itself.

5.3.7.1 Supervised LLE


Supervised LLE (SLLE) is a variant of LLE that incorporates class labels to guide the em­
bedding process [35]. The objective of SLLE is to preserve the local structure of the data
while also preserving the class labels. The steps to implement SLLE are as follows:
(a) Distance matrix mod­fication: In SLLE, the Euclidean distance matrix D ∈ Rn×n is
changed to increase the inter-class variance of data art­ficially. This can be done by
increasing the distances of points that are from different classes. The new mod­fied
distance matrix D is given by:

D = D + θ (dmax )(11T − ).

Here, 11T ∈ Rn×n is the matrix of ones, dmax ∈ R represents the diameter of data given
by:
dmax = maxi,j (||xi − xj ||2 )
and is a matrix with the following elements:

1, if ci = cj ,
(i, j ) =
0, otherwise,

where ci represents the class label of xi , and θ ∈ [0, 1]. When θ = 0, SLLE becomes LLE,
which is no longer supervised. When θ = 1, SLLE is fully supervised. When θ ∈ (0, 1),
we have partially supervised SLLE that is neither supervised nor unsupervised.
(b) k-nearest-neighbors graph: Find the k-nearest-neighbors graph using the mod­fied
distances similar to the way we did in the standard LLE section.
(c) LLE algorithm: The rest of the algorithm is completely similar to the standard LLE.
The only important point where SLLE differs from other LLE algorithms is its mod­fied
distance matrix.
148 Dimensionality Reduction in Machine Learning

5.3.7.2 Semi-supervised LLE


The semi-supervised LLE (SSLLE) algorithm has several advantages over its supervised
counterpart [36]. For one, it can handle missing labels and noisy label assignments, which
are common in real-world datasets. Additionally, it can be used to learn embeddings in a
low-dimensional space that captures both the geometric structure and the semantic re­
lationships between data points. This makes it useful for a wide range of applications,
including image and speech recognition, natural language processing, and bioinformat­
ics. The steps to implement SSLLE are as follows:
(a) Distance matrix mod­fication: In semi-supervised LLE, not all data points have labels
but a certain number of them do have labels. The distances are computed as:

⎪ −D 2 /γ − θ, if c = c ,
⎨( 1 − e
⎪ i j
D = 1 − e−D /γ ,
2
if xi or xj is unlabeled,


⎩ eD 2 /γ , otherwise.

D(i,j )
Here, γ = averagei,j (||xi − xj ||2 ), D (i, j ) = mi ×mj
, where mi = averager (||xi − xr ||2 )
for all r ∈ 1, ..., n.
(b) k-nearest-neighbors graph: Similar to the supervised method.
(c) LLE algorithm: Similar to the supervised method.
The main technique of both supervised and semi-supervised methods is to compute
mod­fied distances between data points based on their class labels, and the rest of the
problem can be easily addressed using standard LLE methods using mod­fied distances.
In summary, semi-supervised LLE is a powerful technique for learning embeddings in
a low-dimensional space that preserves both the geometric structure and the semantic
relationships between data points. By incorporating label information in the optimization
problem, it can handle missing or noisy labels and provide more informative embeddings.
The algorithm has a wide range of applications and can be used in combination with other
machine learning methods to improve performance on various tasks.

5.3.8 LLE with other manifold learning methods


LLE is a nonlinear method for embedding, but some of the parts are not nonlinear, such
as finding the k nearest neighbors commonly uses Euclidean distance, which is a lin­
ear method. However, we can use other distances that are used in manifolds and curves
that are not linear. In this section, we want to talk about geodesic distance [37] and we
call this method ISOLLE. Geodesic distance is the shortest path between two data points
on a manifold and finding this distance is not as simple as the Euclidean one. Here, we
should travel in our manifold to find the shortest path between points and this calcula­
tion needs high-level mathematics. ISOLLE finds this distance with an interesting idea, in
which we construct a graph with a k-NN algorithm that uses Euclidean distance and then
finds the shortest path between two data points using a classic shortest path algorithm in
Chapter 5 • Linear local embedding 149

the weighted graphs. Since the weights are the Euclidean distance, a Dijkstra algorithm in
this case will do the job in a good time complexity O(n log2 n). By gathering all these ideas
the final formula for geodesic distance is:

(g)

z
Dij = min ||ri − ri+1 ||2 .
r
i=2

In this formula, z is a sequence of data points and ri ∈ {xi }ni=1 and matrix D (g) is the geodesic
distance matrix after calculating our distance metrics. Then, the rest of ISOLLE is the same
as LLE.

5.4 Implementation and use cases


5.4.1 How to use LLE in Python?
Utilizing LLE inside a Python environment is possible using the ``sklearn.manifold'' library.
This library includes a considerable number of manifold learning embeddings and algo­
rithms, which can be easily accessed. In this case, our goal is to make use of the library on
the S curve manifold. S curve simply illustrates the Latin letter ‘S’, which is 3D shaped. In
order to use the above-mentioned library, you can import it from ``scikit-learn'' and use it
for embedding in the following way:

1 from sklearn.manifold import LocallyLinearEmbedding


2 embedding = LocallyLinearEmbedding (
3 n_neighbors : number of neighbors,
4 n_components : number reduced coordinates,
5 reg : regularization constant,
6 eigen_solver('auto', 'arpack', 'dense'):solver on eigenvectors,
7 max_iter : maximum number of iterations,
8 method : ('standard', 'hessian', 'modified', 'ltsa;),
9 different variations of LLE that are in sklearn.manifold
10 )

Scikit-learn provides various facilities and hyperparameters, as is clear in the code, the
method ``LocallyLinearEmbedding'' has an ``n_neighbors'' parameter. It is the number of
nearest neighbors that must be extracted for all data points. ``n_components'' is the lower­
dimensional space that the original data must reduce to it. The parameter ``reg'' is the
regularization constant, which users are not advised to change, and the best practice is to
leave it with its default values. Regularization is a sensitively important method, and scikit­
learn’s developers have found a useful functional constant for it. The ``eigen_solver'' as it
literally sounds, is the solver on eigenvectors. The ``max_iter'' is the number of iterations
to fit the weights for our embedding, and last but not least, the ``method'' is the algorithm
that is used to find the embedding.
At this stage, our goal is to use LLE on the S curve but to achieve this, data points need
to be created. The ``s_curve'' has already been implemented in the ``sklearn.datasets'' li­
150 Dimensionality Reduction in Machine Learning

brary as a dataset. A number of 3600 data points are utilized for the curve, then it has been
plotted and the results are illustrated in Fig. 5.4.

1 import matplotlib.pyplot as plt


2 import mpl_toolkits.mplot3d
3

4 from sklearn import datasets


5

6 n_samples = 3600
7 S_points, S_color = datasets.make_s_curve(n_samples, random_state=0)
8
9 x, y, z = S_points.T
10 fig, ax = plt.subplots(subplot_kw={'projection' : '3d'})
11 ax.scatter(x, y, z, c=S_color, cmap=plt.cm.rainbow)
12 plt.show()

FIGURE 5.4 S curve dataset.

Next, our intention is to reduce the above-mentioned S curve dimensions, from R3 to


R2 using LLE. The value 10 is determined to be the parameter for k nearest neighbors, and
2 is the value for the dimension of reduced space. The standard LLE is the method that has
been used, so an LLE object is created and assigned to a variable with a similar name in
the following piece of code. In line number 10, data points are fitted and transformed from
the S curve and saved into the coordinates. Having been plotted, the reduced coordinates
reveal 2D data points that can be seen in Fig. 5.5.
Chapter 5 • Linear local embedding 151

1 from sklearn.manifold import LocallyLinearEmbedding


2
3 n_neighbors = 10
4 n_components = 2
5

6 LLE = LocallyLinearEmbedding(method = "standard",


7 n_neighbors = n_neighbors,
8 n_components = n_components)
9

10 coordinates = LLE.fit_transform(S_points)
11 x, y = coordinates.T
12 plt.scatter(x, y, c = S_color, cmap=plt.cm.rainbow)
13 plt.show()

FIGURE 5.5 Embedded S curve with standard LLE.

As you can see in Fig. 5.5, the S curve has transformed successfully, satisfying the condi­
tions mentioned in Section 5.2. Hence, the points have kept their distances in R2 , relative
to the distances they used to have in R3 .
In addition, LLE in scikit-learn has different methods. In Fig. 5.6 the different outcomes
of the S curve using standard LLE, mod­fied LLE [38] and PCA are shown, respectively.
The comparison between standard and mod­fied LLE indicates that using LLE variants can
provide useful reduced data for analysis purposes. Additionally, by making use of PCA, it is
clear that both distances and shapes remain the same, but due to the difference in princi­
ples, it cannot be claimed that the reduced coordinates present the unfolded data and the
same curve with reduced dimensions. Note that the output of LLE on the S curve has some
differences in Figs. 5.5 and 5.6. The reason for these small differences is the random state
and choices that the scikit-learn generates to solve the optimization problem. Now, let us
test LLE on a more applied dataset in the next section.

5.4.2 Using LLE in MNIST


A well-known recommended dataset to use, while working on data science-related fields
is the MNIST dataset. The MNIST dataset includes a massive number of images made of
digits ‘0’ to ‘9’, where images are in 28 × 28 resolution. This resolution seems to be high and
complex indeed, so it may cause a slow recognition for our application. In this case, our
152 Dimensionality Reduction in Machine Learning

FIGURE 5.6 Comparing standard LLE, mod­fied LLE, and PCA.

purpose is to use LLE for dimensionality reduction, followed by a k-nearest-neighbor clas­


s­fier. For our demo project, the dataset needs to be loaded first, and because of time and
processing issues, a smaller version of the dataset named ``digits'' is used in ``scikit-learn''.
As is clear in the given code, the dataset is loaded into an axis, where X stands for images
of digits, and y illustrates the digit that all images represent. All images in the mentioned
dataset are in a resolution of 28 × 28, that simply means if images are flattened, the size of
the high-dimensional space is 784, our goal is to make use of LLE in order to reduce it to
the number 2.

1 from sklearn.datasets import load_digits


2
3 digits = load_digits(n_class=6)
4 X, y = digits.data, digits.target
5 fig:mnistembed
6 print("X Shape: ", X.shape)
7 print("Y Shape: ", y.shape)

Next, the dataset is split into train and test sets for evaluation. It is declared that 10
percent of existing data will be used for the test set and the other 90 percent for training:

1 from sklearn.model_selection import train_test_split


2 X_train, X_test, y_train, y_test = train_test_split(X, y,
3 test_size = 0.1, random_state=42)

Therefore LLE may be used for our dimensionality reduction purpose. As before, LLE
needs to be imported from ``sklearn.manifold'', the number 10 as the value of k in k nearest
neighbors, and the number 2 as the value for reduced space dimension. At the next stage,
LLE is fitted, the weights are trained on the training dataset, and then both the train set
and test set are transformed with it. Fig. 5.7 illustrates the embedding in R2 .
Chapter 5 • Linear local embedding 153

1 import matplotlib.pyplot as plt


2 from sklearn.manifold import LocallyLinearEmbedding
3 embedding = LocallyLinearEmbedding(
4 n_neighbors=10, n_components=2, method="modified")
5 transitions_train = embedding.fit_transform(X_train)
6 trainsitions_test = embedding.transform(X_test)
7 X, y = transitions_train.T
8 plt.scatter(X, y, c=y_train)

FIGURE 5.7 The digits dataset after embedding.

Now, k-NN may be used as a class­fier. k-NN is mentioned specifically in Section 5.2.1
in order to help with LLE weights finding, here it is used for class­fication. We use k-NN in
``scikit-learn'' as the following approach:

1 from sklearn.neighbors import KNeighborsClassifier


2
3 knn = KNeighborsClassifier(n_neighbors=10)
4 knn.fit(transitions_train, y_train)
5
6 pred = knn.predict(trainsitions_test)

Ultimately, model evaluation is done by checking accuracy, precision, and recall. Accu­
racy demonstrates how accurate the model is in label prediction, and it is calculated by
dividing true predictions by the size of the test set. Precision represents the quality rate,
and recall is the rate of quantity. Let T P be true positive, F P be false positive, T N be true
negative, and F N be false negative, accuracy, precision, and recall can be calculated as
follows:
TP +TN
Accuracy = ,
T P + FP + T N + FN
154 Dimensionality Reduction in Machine Learning

TP
Precision = ,
T P + FP
TP
Recall = .
T P + FN
Built-in ratings are used in the sklearn.metric for the evaluation of this project.

1 from sklearn.metrics import accuracy_score,precision_score,recall_score


2
3 print("Accuracy: ", accuracy_score(y_test, pred))
4 print("Precision: ", precision_score(y_test, pred, average="macro"))
5 print("Recall: ", recall_score(y_test, pred, average="macro"))

The evaluation of the test leads us to an accuracy of 0.954, a precision of 0.954, and a
recall of 0.955, which are significantly reliable rates.

5.5 Conclusion
LLE is a powerful nonlinear technique for dimensionality reduction, that has found many
applications including in computer vision, bioinformatics, and image processing. In this
chapter, we learned about a simple principle that LLE uses for dimension reduction and
we d­fined an optimization problem to solve it. We transformed the problem into an
eigenvector problem and then expanded the idea with other approaches by explaining the
variants of LLE such as weighted LLE, inverse LLE, kernel LLE, etc. Then, we used LLE as
a tool used in data science and machine learning applications. We used the scikit-learn
library with Python to implement LLE and compared it with other dimension reduction
methods like PCA. After that, we used this technique on the MNIST dataset to solve a cat­
egorical class­fication problem and evaluated it with some metrics. As we say, LLE is not
the best embedding that could be found but it is not the worst either, if you have the right
data, LLE can be very helpful.

References
[1] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science
290 (5500) (2000) 2323--2326.
[2] J.B. Tenenbaum, V.D. Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality
reduction, Science 290 (5500) (2000) 2319--2323.
[3] M. Ringnér, What is principal component analysis?, Nature Biotechnology 26 (3) (2008) 303--304.
[4] A. Ghaderi-Kangavari, J.A. Rad, M.D. Nunez, A general integrative neurocognitive modeling frame­
work to jointly describe EEG and decision-making on single trials, Computational Brain & Behavior 6
(2023) 317--376.
[5] A. Ghaderi-Kangavari, K. Parand, R. Ebrahimpour, M.D. Nunez, J.A. Rad, How spatial attention af­
fects the decision process: looking through the lens of Bayesian hierarchical diffusion model & EEG
analysis, Journal of Cognitive Psychology 35 (2023) 456--479.
[6] A. Ghaderi-Kangavari, J.A. Rad, K. Parand, M.D. Nunez, Neuro-cognitive models of single-trial EEG
measures describe latent effects of spatial attention during perceptual decision making, Journal of
Mathematical Psychology 111 (2022) 102725.
Chapter 5 • Linear local embedding 155

[7] P. Xanthopoulos, P.M. Pardalos, T.B. Trafalis, Linear discriminant analysis, in: Robust Data Mining,
2013, pp. 27--33.
[8] B. Ghojogh, A. Ghodsi, F. Karray, M. Crowley, Locally linear embedding and its variants: tutorial and
survey, arXiv preprint, arXiv:2011.10925, 2020.
[9] O. Bousquet, U. Luxburg, G. Rätsch (Eds.), Advanced Lectures on Machine Learning: ML Summer
Schools 2003, Revised Lectures, Canberra, Australia, February 2--14, 2003, Tübingen, Germany, August
4--16, 2003, vol. 3176, Springer, 2011.
[10] C.M. Bishop, N.M. Nasrabadi, Pattern Recognition and Machine Learning, vol. 4, Springer, New York,
2006, p. 738.
[11] L. Van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research
9 (11) (2008).
[12] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
[13] K.P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012.
[14] S. Chao, C. Lihui, Feature dimension reduction for microarray data analysis using locally linear em­
bedding, in: Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, 2005, pp. 211--217.
[15] J. Li, K. Cheng, S. Wang, F. Morstatter, R.P. Trevino, J. Tang, H. Liu, Feature selection: a data perspective,
ACM Computing Surveys 50 (6) (2017) 1--45.
[16] I. Guyon, S. Gunn, M. Nikravesh, L.A. Zadeh (Eds.), Feature Extraction: Foundations and Applications,
vol. 207, Springer, 2008.
[17] T. Cover, P. Hart, Nearest neighbor pattern class­fication, IEEE Transactions on Information Theory
13 (1) (1967) 21--27.
[18] J. Deng, J. Guo, N. Xue, S. Zafeiriou, ArcFace: additive angular margin loss for deep face recogni­
tion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019,
pp. 4690--4699.
[19] B. Ghojogh, F. Karray, M. Crowley, Locally linear image structural embedding for image structure
manifold learning, in: International Conference on Image Analysis and Recognition, Springer Inter­
national Publishing, Cham, August 2019, pp. 126--138.
[20] L.K. Saul, S.T. Roweis, Think globally, fit locally: unsupervised learning of low dimensional manifolds,
Journal of Machine Learning Research 4 (Jun 2003) 119--155.
[21] M. Delkhosh, K. Parand, A.H. Hadian-Rasanan, A development of Lagrange interpolation, Part I: the­
ory, arXiv preprint, arXiv:1904.12145, 2019.
[22] X. Zhao, S. Zhang, Facial expression recognition using local binary patterns and discriminant kernel
locally linear embedding, EURASIP Journal on Advances in Signal Processing 2012 (2012) 1--9.
[23] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press,
2004.
[24] A.H. Hadian Rasanan, S. Nedaei Janbesaraei, D. Baleanu, Fractional Chebyshev kernel functions: the­
ory and application, in: J. Amani Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional
Orthogonal Kernel Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics,
Springer, Singapore, 2023.
[25] A.H. Hadian Rasanan, J. Amani Rad, M.S. Tameh, A. Atangana, Fractional Jacobi kernel functions:
theory and application, in: J. Amani Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional
Orthogonal Kernel Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics,
Springer, Singapore, 2023.
[26] A.H. Hadian Rasanan, S. Nedaei Janbesaraei, A. Azmoon, M. Akhavan, J. Amani Rad, Class­fica­
tion using orthogonal kernel functions: tutorial on ORSVM package, in: J. Amani Rad, K. Parand,
S. Chakraverty (Eds.), Learning with Fractional Orthogonal Kernel Class­fiers in Support Vector Ma­
chines, in: Industrial and Applied Mathematics, Springer, Singapore, 2023.
[27] J.A. Rad, S. Chakraverty, K. Parand, Learning with Fractional Orthogonal Kernel Class­fiers in Support
Vector Machines: Theory, Algorithms, and Applications, Springer, 2023.
[28] J.C.A. Barata, M.S. Hussein, The Moore–Penrose pseudoinverse: a tutorial review of the theory, Brazil­
ian Journal of Physics 42 (2012) 146--165.
[29] O. Kouropteva, O. Okun, M. Pietikäinen, Incremental locally linear embedding, Pattern Recognition
38 (10) (2005) 1764--1767.
[30] S.P. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.
[31] H. Chang, D.Y. Yeung, Robust locally linear embedding, Pattern Recognition 39 (6) (2006) 1053--1065.
156 Dimensionality Reduction in Machine Learning

[32] Y. Zhang, D. Ye, Y. Liu, Robust locally linear embedding algorithm for machinery fault diagnosis, Neu­
rocomputing 273 (2018) 323--332.
[33] Y. Pan, S.S. Ge, A. Al Mamun, Weighted locally linear embedding for dimension reduction, Pattern
Recognition 42 (5) (2009) 798--811.
[34] M. Vladymyrov, M.Á. Carreira-Perpinán, Locally linear landmarks for large-scale manifold learning,
in: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD
2013, Prague, Czech Republic, September 23--27, 2013, Proceedings, Part III 13, Springer, Berlin, Hei­
delberg, 2013, pp. 256--271.
[35] D. De Ridder, O. Kouropteva, O. Okun, M. Pietikäinen, R.P. Duin, Supervised locally linear embedding,
in: International Conference on Art­ficial Neural Networks, Springer, Berlin, Heidelberg, June 2003,
pp. 333--341.
[36] S. Zhang, K.W. Chau, Dimension reduction using semi-supervised locally linear embedding for plant
leaf class­fication, in: Emerging Intelligent Computing Technology and Applications: 5th Interna­
tional Conference on Intelligent Computing, ICIC 2009, Ulsan, South Korea, September 16--19, 2009,
Proceedings 5, Springer, Berlin, Heidelberg, 2009, pp. 948--955.
[37] B. Ghojogh, A. Ghodsi, F. Karray, M. Crowley, Multidimensional scaling, Sammon mapping, and
isomap: tutorial and survey, arXiv preprint, arXiv:2009.08136, 2020.
[38] Z. Zhang, H. Zha, Principal manifolds and nonlinear dimensionality reduction via tangent space
alignment, SIAM Journal on Scientific Computing 26 (1) (2004) 313--338.
6
Multi-dimensional scaling
Sherwin Nedaei Janbesaraei a, Amir Hosein Hadian Rasanan b,
Mohammad Mahdi Moayeri c, Hari Mohan Srivastava d, and
Jamal Amani Rad e
a Institute for Cognitive Sciences Studies (ICSS), Tehran, Iran b Faculty of Psychology, University of Basel,

Basel, Switzerland c Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada
d Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada e Choice Modelling

Centre and Institute for Transport Studies, University of Leeds, Leeds, United Kingdom

6.1 Basics
Multi-dimensional scaling is a family of statistical methods that helps researchers uncover
the latent structures or relationships hidden within a dataset. It allows them to explore
underlying themes, or dimensions, that explain the similarities or dissimilarities (i.e., dis­
tances) between data points. While heavily i­fluenced by psychological studies in the 20th
century, the concept of MDS has a richer history. The groundwork for MDS was laid by
Eckart and Young in their work on approximating matrices with lower-rank ones [2]. Young
and Householder further contributed by studying distances in Euclidean spaces and their
relationship to lower-rank matrices [3]. The term ``multi-dimensional scaling'' itself first
appeared in an article by M.W. Richardson titled ``Multidimensional psychophysics'' [1].
Following these foundational works, Torgerson introduced metric MDS (classical
MDS), which focuses on preserving distances between data points [4]. A decade later,
Shepard (1962) introduced non-metric MDS, which relaxes the strict distance preserva­
tion requirement [8]. The field of MDS has seen further advancements with significant
contributions from researchers like Kruskal (1964) on non-metric MDS [5] and least square
MDS, Coombs on unfolding models [6], and Horan [9] and Carroll [10] on three-way MDS
models (INDSCAL and IDIOSCAL).

6.1.1 Introduction to multi-dimensional scaling


MDS is a group of techniques within exploratory data analysis (EDA) and dimensional­
ity reduction. It aims to transform data with many dimensions (features or attributes)
into a lower-dimensional space for easier visualization and analysis. High dimensionality
can arise from numerous observations (data points) or dimensions (features) within the
data. Imagine evaluating stocks for investment. Consider many attributes, such as consis­
tent earnings growth, earnings per share, price-to-earnings ratio, capital expenditures, and
market value. In this scenario, the ``observations'' are the companies, and the ``dimensions''
are the attributes used for comparison. While a large number of observations or dimen­
Dimensionality Reduction in Machine Learning. https://doi.org/10.1016/B978-0-44-332818-3.00016-2 157
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
158 Dimensionality Reduction in Machine Learning

sions can provide more information, it also increases the complexity of data processing
and analysis. Imagine trying to visualize the distance between two points on a sphere -- it’s
not as straightforward as calculating the Euclidean distance in two or three dimensions.
This phenomenon is known as the ``curse of dimensionality''. As the number of dimen­
sions increases, the volume of space grows exponentially, making it difficult to efficiently
store, manipulate, and visualize high-dimensional data. Techniques like MDS become cru­
cial for overcoming these challenges by effectively reducing the dimensionality of the data
while preserving the most important relationships and information.
MDS works by transforming observations from a high-dimensional space into a lower­
dimensional one. Crucially, this transformation aims to preserve the essential relation­
ships and information within the data. The resulting lower-dimensional representation
allows for easier visualization and understanding while retaining the characteristics of the
original data points. This use case makes MDS a powerful mapping technique as well. Con­
sider applying MDS to customer survey responses. MDS can reveal hidden patterns and
relationships by mapping the responses into a two-dimensional space. Customers with
similar responses will be positioned closer, allowing you to identify distinct customer seg­
ments. Thus MDS is a valuable tool for market research and targeted marketing campaigns.
In essence, MDS acts as a dimensionality reduction and visualization tool (identifying clus­
ters) and a data exploration tool (revealing customer relationships).
According to the literature, data points encompass objects, features, variables, subjects,
or any measurable entity or concept. In psychology studies, data points may include stim­
uli, subjects’ characteristics relevant to the experiment, their response times or decisions,
or any other psychological measurements [19--27]. Discerning whether two comparable
items are similar or identical is a straightforward task for humans, involving either consid­
ering them as identical or discriminating between two distinct items.
MDS leverages proximity to reveal hidden structures. The core concept lies in the as­
sumption that data points close together in the original high-dimensional space are more
similar, while distant points are likely dissimilar. MDS quant­fies this notion of proximity
by calculating distances between data points in a Euclidean space. The resulting output
is a ``map'' visually representing these spatial relationships. Points with smaller distances,
indicating more marked similarity, are positioned closer together on the map, while dis­
similar points are placed farther apart [14]. To illustrate, consider a dataset recording the
flight times (in minutes) from Bristol Airport, located in western England, to various des­
tinations. Fig. 6.1 summarizes this data. For example, flights from Bristol to Geneva take
nearly 100 min.
While clustering this small dataset might seem straightforward, real-world scenarios
often involve much larger data volumes. When dealing with extensive datasets, even es­
timating potential clusters can be complicated. To utilize MDS in this situation, we first
need a distance matrix representing the dissimilarities between data points. In our case,
this translates to the flight times between Bristol Airport and various destinations (Fig. 6.2).
MDS leverages this matrix to identify the optimal lower-dimensional space that best pre­
serves the relationships within the data.
Chapter 6 • Multi-dimensional scaling 159

FIGURE 6.1 This horizontal bar chart illustrates the flight times (in minutes) from Bristol Airport to various desti­
nations. As shown, Las Palmas has the longest flight time at approximately 250 min, while the flights to Guernsey,
Belfast, and Dublin almost take the same time, around 70 min.

FIGURE 6.2 This figure depicts a distance matrix illustrating the relative distances between Bristol Airport and
various destinations. It complements Fig. 6.1 by providing a different perspective for comparison. While Fig. 6.1 fo­
cused on flight times, this matrix allows you to visualize the relative distances between destinations. Note that the
distances are not shown to scale and may not directly correspond to flight times due to factors like wind patterns
and air traffic control. For better understanding of this figure and its details, please refer to the online version.

Fig. 6.3 depicts the initial state of the Bristol flight data. Each destination is represented
by a circle and a label, connected by lines indicating their similarity (proximity) based on
flight duration. For example, destinations like Faro, Malaga, and Krakow appear close to­
gether, suggesting similar flight times from Bristol. This is co­firmed in Fig. 6.1, where
flight durations to these destinations are around 150 min. The distance matrix (Fig. 6.2)
further reinforces this by showing near-zero distances between these locations (distance
(Faro, Krakow) = 0, distance (Faro, Malaga) = 0.0685, distance (Malaga, Krakow) = 0.0685).
This proximity between data points is r­flected in the clustering patterns. The lines con­
necting them in Figs. 6.3 and 6.4 highlight these similarities. Additionally, the color scheme
160 Dimensionality Reduction in Machine Learning

in these figures might indicate potential clusters, which will be further explored (or con­
firmed) in the final clustering results.

FIGURE 6.3 This figure builds upon the distance matrix presented in Fig. 6.2 by visually exploring potential clus­
ters of flight times. By referencing the flight times from Fig. 6.1, destinations with similar flight durations can be
ident­fied. However, the current visualization might ben­fit from further r­finement for clearer interpretation. For
better understanding of this figure and its details, please refer to the online version.

As previously mentioned, the MDS analysis plot is a powerful tool for visualizing data
and uncovering potential clusters. By projecting the data points onto a two-dimensional
space based on pairwise distances, MDS facilitates identifying clusters and analyzing any
underlying patterns within the data.
Fig. 6.4 presents the results of the MDS analysis. This visualization allows us to read­
ily assess whether flight-time-based clusters of destinations emerge based on the two-way
distances between all data points. MDS leverages pairwise distances to determine the sim­
ilarity or dissimilarity between data points. While the specific co­figuration of the plot may
vary slightly across different MDS runs, the underlying clustering pattern will remain con­
sistent.
MDS can be broken down into three main steps, as Torgerson described [4]. The first
step involves the input data, which can be any dataset commonly encountered in machine
learning. This data typically takes the form of a table containing multiple observations
(data points) described by various attributes or features.
1. Data points are embedded in a space of dimensionality z, where z corresponds to the
number of features or attributes associated with each data point.
Chapter 6 • Multi-dimensional scaling 161

FIGURE 6.4 This figure presents a clustered visualization of the distance matrix from Fig. 6.2 using the Multi­
Dimensional Scaling (MDS) algorithm. Compared to the raw distance matrix, this visualization facilitates easier
exploration and understanding of the relationships between destinations. Distinct clusters emerge, highlighting
groups of destinations that are relatively close in distance from Bristol Airport. For better understanding of this
figure and its details, please refer to the online version.

2. The distance between each pair of labels is calculated and transformed using a distance
function. Also, the lowest achievable dimension is determined. Dij is the distance be­
tween obji and objj .
3. The output is the observations in the space of dimension y (where y < z), which is now
plausible to visualize and explore.
The heart of the MDS procedure lies in the second step, where a function minimizes
the discrepancy between distances in the high-dimensional space (n) and the lower­
dimensional space (m). This optimization step d­fines different MDS models based on the
chosen distance function. However, to delve deeper into these models, we first need a solid
understanding of the concept of proximity in general. We need a few basic definitions that
are used extensively in the literature on MDS before we can move forward:
• n refers to the number of objects, observations, stimuli, or other concepts treated as
input data into the MDS algorithm.
• Proximity is a generic term that refers to the relation of two data points from input data
in terms of similarity or dissimilarity. A high pij denotes that i and j are similar.
• Dissimilarity denoted as δij is a proximity aspect, in contrast to the similarity and refers
to what extent two data points i and j differ in terms of distance, specifically in geomet­
rical space.
162 Dimensionality Reduction in Machine Learning

• X is the co­figuration of n points in m-dimensional space. Such a co­figuration is not


unique but arbitrary. Finding the optimal co­figuration is the goal of the MDS algo­
rithm. In a mathematical definition, X is a matrix of size n × m, the relative coordinates
of n points on m Cartesian axis. X =(xi1 , xi2 , . . . , xim ) are the coordinates of point i with
respect to axes 1,2 to m. Note that dij (X) used in this chapter refers to the fact that the
distance between data points i and j is a function of the coordinates X.
• f (.) refers to a mapping function. The mapping function determines the way ob­
served proximities are transformed into estimated distances. Therefore f (pij ) denotes
the mapping function on the proximity of the i and j, that is sometimes written as
f : pij → f (pij ) to point to the transformation process and also d̂ij to point to the out­
put distance of the mapping function. Note that in the case of Ordinal input data (as it
is in non-metric MDS) d̂ij are called disparities.

6.1.2 Data in MDS


MDS applies to the data that are gathered among related objects, entities, or stimuli. As
a result of the differences in the data sources, the collection methods for data are more
complex. The most tiring part of MDS analysis is probably the data collection and prepa­
ration part. There is a possibility, however, of observing or measuring data by simply asking
participants about their judgments about the subject matter. However, there are many use
cases where one needs to transform some derived attributes of the original data to achieve
the desired results, such as the Bristol example presented in the previous section. It is im­
portant to note that in both cases, the data may come in different scales ranging from
nominal (categorical) to ordinal (an ordered set of quantitative or qualitative data), inter­
val (quantitative data whose difference between two points can have a significant impact,
such as temperature), and ratio scales, which are essentially the same as interval scales but
have a definite zero point. The proximity measure is a common means of comparing mea­
sured data points in all measurement scales. Proximity measures compare the distances
between two points in a given space. This can be used to measure the similarities and dif­
ferences between two points and identify clusters of points that are closely related to one
another.

6.1.3 Proximity and distance


The definition of proximity is introduced and elaborated upon by Coombs [6] and Shep­
ard [8]. Proximity is the closeness or nearness of two objects, observations, stimuli, or
data points in space or time. Two objects of interest can be compared regarding spatial or
temporal closeness. Therefore one can determine the relationship between those objects
over a spectrum from similar (completely overlapping) to dissimilar (far away). A common
method of judging such things is using proximity. Direct proximity comes directly from the
assessment of similarity or dissimilarity ratings of data points. For example, some subjects
can be interviewed to determine if the objects of interest are similar or dissimilar. Multiple
types of direct proximity are introduced, such as pairwise comparison [16], where subjects
Chapter 6 • Multi-dimensional scaling 163

are asked to rate similar or dissimilar data points on a pre-defined scale in a pairwise fash­
ion. In the ranking method, subjects are asked to sort some paired objects in ascending or
descending order according to the similarity extent of pairs [12]. A sophisticated method
is called Q-sort, introduced by William Stephenson [17], which is indeed an instrument in
Q methodology, a research method commonly used in psychology and social sciences to
study people’s viewpoint over a subject such as their preference or belief in a matter under
study. Q-sorting, as the fourth step in Q-methodology, is the task of rank ordering of Q-sets
by participant. Q-set or Q-samples are smaller sets sampled from all possible statements
about a specific topic. Participants are asked to judge a statement from a Q-set and order
it according to a pre-defined scale [18]. Through Q-sorting, the researcher can understand
the preferences and opinions of the participants on a given topic. By understanding the
similarities and differences between the participants’ Q-sorts, the researcher can gain in­
sight into the structure of the underlying population.
Another popular method is called the anchor stimulus method. In the assessment, one
of the objects will be considered an anchor or a baseline. Thus the remaining objects
will be compared to the baseline [12]. In addition to direct judgments, proximity can be
derived from indirect measurements. Derived proximities are calculated for a pair of vari­
ables, usually as a correlation or distance matrix [11]. The correlation of two features can
be calculated using multiple methods such as Spearman correlation for monotonic rela­
tionships and Pearson correlation for continuous data. However, in a general form, the
correlation between two items, p, and q, for N individuals can be determined using the
following equation:
∑N
i=1 (xi− x̄)(yi − ȳ)
r = √∑ √∑ , (6.1)
N N
(x
i=1 i − x̄)2 (y
i=1 i − y¯j )2

where, −1 < r < 1, while N is the number of individuals, x̄ and ȳ are the average over all xi s
and yi s. The value of r scales to what extent and in what direction the response of individu­
als to X and Y follow a similar pattern. In other words, considering an individual’s response
to X, how much her response to Y will probably be similar or different? The positivist of r
refers to the positive pattern or tied relation (correlated) between X and Y, meaning both
rise or subside simultaneously. In contrast, a negative r refers to the contrary relation. An
increase in X, for example, is accompanied by a decrease in Y s.
Multiple equations can be used to calculate the similarity of quantitative data points.
Note that in the case of dichotomous (binary) variables, usually the calculated similarity
coefficients will be transformed into dissimilarity using a transformation equation such as
δij = 1 − sij . This is the common method for qualitative variables. Suppose there are more
attributes than just the flight time between cities in the Bristol Airport example. Consider
ticket price and flight speed as other comparison metrics. Therefore matrix X is a matrix
of size {N, m}, where N is the number of destinations and m is the number of attributes.
164 Dimensionality Reduction in Machine Learning

Hence, the distance matrix can be obtained using a distance measure such as Euclidean:

 m
∑
dpq (X) =  (xip − xiq )2 , (6.2)
i=1

where p and q are two observations. m is the number of attributes. A summary of some
of the most commonly used proximity measures can be found in Table 6.2. In the case of
binary variables of two different objects p and q, the proximity measures or the similarity
coefficients can be determined according to the following description:
There are four possible combinations of binary data for two observations, p and q.
These can be summarized as follows:
1. When p is 1 (observation p is true):
a. If q is also 1 (observation q is also true), then this combination is represented by a.
b. If q is 0 (observation q is false), then this combination is represented by b.
2. When p is 0 (observation p is false):
a. If q is 1 (observation q is true), then this combination is represented by c.
b. If q is 0 (observation q is false), then this combination is represented by d.
Using these four rules, one can use the equations from Table 6.1 to calculate similarity
coefficients or distance of two observations p and q.

Table 6.1 Proximity measurement equations for bi­


nary data. Values a, b, c, and d are already d­fined.
Proximity measure Equation
Simple matching coefficient a+d
spq = a+b+c+d
Russell, Rao a
spq = a+b+c+d
Jaccard coefficient a
spq = a+b+c
Czekanowski, Sørensen, Dice 2a
spq = 2a+b+c
Ochiai spq = √ a
(a+b)(a+c)
Braun, Blanque spq = a
max{(a+b),(c+d)}
Simpson spq = a
min{(a+b),(a+c)}
Mountford spq = 2a
a(a+c)+2bc
Phi spq = √ ad−bc
(a+b)(a+c)(b+d)(c+d)
Yule spq = ad−bc
ad+bc
a−(b+c)+d
Hamman spq = a+b+c+d

6.2 MDS models


MDS is primarily concerned with input data, which is the first significant aspect of the
MDS technique, which has been discussed in previous sections. In this section, we focus
Chapter 6 • Multi-dimensional scaling 165

Table 6.2 Summary of the most used proximity


measures. The Minkowski model is a generalization
of the Euclidean and City-Block models. In other
words, the Minkowski distance is equal to Euclidean
when p = 2 and City-Block when p = 1. In addition,
if p = ∞ it is called the Chebyshev distance.
Proximity measure Equation
√∑
Euclidean (L1) dpq (X) = m (xip − xiq )2
√∑i=1
dpq (X) = m
i=1 wi (xip − xiq )
Weighted Euclidean 2
∑m
City-Block (L2) dpq (X) = i=1 |xi − yi |
√∑
dpq (X) = m
i=1 (xip − xiq )
k k
Minkowski (LP)*
∑m |xip −xiq )2 |
Canberra distance dpq (X) = i=1
|(xip +xiq )2 |
1 ∑m |xip −xiq )2 |
Bray–Curtis dpq (X) = m i=1
|(xip +xiq )2 |
√∑
m √ √
Chord distance dpq (X) = i=1 ( xip − xiq )
∑m
x x
ip iq
Angular separation dpq (X) = √∑m i=1 √ ∑m
( i=1 xip )( i=1 xiq )
1 ∑m (xip −xiq )
2
Divergence dpq (X) = m i
∑m (xip +xiq )
2

i=1 ip|x −x |
iq
Soergel dpq (X) = max(x ,x )
√∑ ip iq
Bhattacharyya dpq (X) = m (√x − √x )2 )
i=1 ip iq
∑  min(xip ,xiq )
Wave–Hedges 1 m
dpq (X) = m i=1 1 − max(x ,x )
ip iq

on the different MDS models. By the MDS model, we mean a geometric representation
of distances between variables. As elaborated in the previous section, many ways exist to
assess the distances and create the corresponding matrix. The first step to gain such rep­
resentation is to change the raw input data to a distance matrix, as shown in Fig. 6.2. MDS
models use the distance matrix relations to find the most precise geometrical representa­
tion. Technically, MDS models estimate distances by mapping them from (dis)similarity
coefficients. The mapping function of the MDS model is the main difference between dif­
ferent models. The mapping function determines the number of dimensions (or axes) to
represent the data and how to map the data points on the axes. The model then creates
a matrix of distances between the data points, which can be used to create a visual rep­
resentation of the data. This new visual representation gives us a better understanding of
patterns and relationships within the data. However, in a more general form, one can map
the distance between i and j from dissimilarity of proximity using the following equation:

dij = f (δij ) + ij , (6.3)

where i and j are two data points, f is a mapping function such as the functions from
Table 6.2, and the ij is the error corresponding to the measurement.
166 Dimensionality Reduction in Machine Learning

6.2.1 Metric MDS


Metric models are the first models proposed for MDS, assuming input data are measured
on an interval or ratio scale. One of the main metric models is the classical or Torgerson
metric model, and the other is the least square model. Torgerson [4] proposed the classic
MDS (CMDS) model in 1952. CMDS is a mathematical model that uses the dissimilarity
ratings provided by a group of individuals to calculate the distances between the data
points on a map. It considers the people’s subjective opinions and tries to find the most
accurate representation of the distances between the points. Assume a map of the ma­
jor cities of a country. The trivial solution to finding the distances between cities is using
a ruler to measure the distances directly and multiply the measured values by the map
scale. Ratios and intervals are preserved perfectly in such measurements. CMDS works
similarly. This model utilizes the Euclidean distance as the distance measure to create a
complete and symmetric similarity matrix from given data points in the Euclidean space.
The distance between two cities a and b with coordinates x and y, is achievable using the
Euclidean formula:

distancea,b = (xa − xb )2 + (ya − yb )2 . (6.4)
The CMDS model recreates the map using Euclidean distances, although in most cases,
the proximity of the data points is used. In CMDS, the observed distance δij equals dij in
Euclidean space. A metric MDS model starts with a proximity matrix (distance matrix).
As shown in Fig. 6.2, that is a symmetric square matrix, where δij = δj i (considering that
there is some equality of values, in Fig. 6.4 we have avoided any duplicates). In the related
example, the model tries to find a set of 14 points in the 1-dimensional space, such that
the distance between any pairs of the 14 points in the 1-dimensional space is as similar
as possible to the corresponding distance between cities in the original proximity matrix
(dij ≈ δij ). Usually, this mapping is achieved using the Euclidean equation (6.2):

f (δij ) = dij (X), (6.5)

where δij is the original proximity/dissimilarity of two observations i and j, and dij is the
transformed proximity (the estimated distance) in the target vector space X. f indicates
the transformation approach to achieve the best fit for the model. Metric MDS or Interval
MDS, as it is called due to the nature of its input data, tries to preserve the linearity of the
input data in the resultant distances. The standard model of metric MDS is as follows:

f → a + bδij = dij (X), (6.6)

where a and b are free parameters to scale the input data. This linear transformation keeps
meaningful information.

6.2.2 Torgerson’s method


Suppose the input data includes N data points, each one is from dimension m, and there­
fore, the dissimilarity between data point i and j is δij . The CMDS model creates N pro­
Chapter 6 • Multi-dimensional scaling 167

jections from m-dimensional space to k-dimensional space (k < m) and finds the best
arrangement of the projections such that the Euclidean distance dij in dimension k re­
sembles the δij in dimension m, for all pairs of i and j. In other words, the direct observed
proximity/dissimilarity δij , between all pairs of i and j is assumed to be equal to the corre­
sponding distance in Euclidean space:

 m
∑
δij = dij =  (xik − xj k )2 . (6.7)
k=0

CMDS tries to minimize the following equation:



(δij − dij )2 , (6.8)
i=i

where δij is the dissimilarity between i and j (given form distance/proximity matrix) and
dij is the mapped distance between i and j, obtained from the mapping function of the
CMDS.
Torgerson’s method incorporates a double-centered version of the distance/proximity
matrix (), denoted as B, by applying a transformation to the squared version of the prox­
imity matrix ∗ , in which all elements of  are squared. Hence, the first step in Torgerson’s
solution is calculating the ∗ and the second step requires calculating the double-centered
matrix using:
1
B = − C∗ C, (6.9)
2
where C is a centering matrix that can be obtained using an identity matrix I, all-ones
matrix J, and the number of data points N :

1
C=I − JN . (6.10)
N
Technically, the later mentioned transformation yields a matrix with the sum of all rows,
all columns, and overall sums of all entries equal to zero. To be more precise, for each δij ,
Torgerson proved that the corresponding dissimilarity in B can be obtained from:

δij∗ = −0.5 δij2 − δi.2 − δ.j2 + δ..2 , (6.11)

where δi.2 is the mean value of the ith row in the squared distance matrix, δ.j2 is the mean
value of the jth column in the distance matrix, and δ..2 is the mean value of all entries in the
squared distance matrix. Lastly, a Singular Value Decomposition (SVD) on matrix B yields
the output. The SVD of matrix B provides a low-rank approximation of the matrix, which
can be used to extract meaningful information from the data. The first two dimensions of
SVD represent the data in a new lower-dimensional space. The following steps summarize
the CMDS model:
168 Dimensionality Reduction in Machine Learning

1. Construct the squared proximity matrix ∗ .


2. Apply the double centering solution using Eqs. (6.9) and (6.10).
3. Find the k largest positive eigenvalues of B (B = XX  ) and the corresponding eigenvec­
tors Ek .
4. The k-dimensional co­figuration of N objects can be derived from X, where X =

Ek k , where k is the diagonal matrix of k eigenvalues of matrix B.

6.2.3 Least square model


In addition to Torgerson’s method, the least square approach to metric MDS uses Sam­
mon’s loss function on the lower dimension to find mapped distances. This is an iterative
method and consists of multiple steps. An arbitrary initial state of the points must be de­
termined as a first step. Then, the dissimilarity matrix δij should be formed by computing
the Euclidean distances between all pairs of data points. Points close to each other in
higher-dimensional space should also be close in lower-dimensional space, and vice versa.
Therefore the goodness of mapping requires an iterative control. Sammon’s loss function
judges this:
∑ ∑  −1
δij (dj k − δij )2ik
loss(dij , δij ) = ∑∑ , (6.12)
δij
where δij is the observed dissimilarity in higher-dimensional space and dij is the distance
in low-dimensional space between points i and j.
By minimizing Eq. (6.12) one can reach the minimum difference between inter­
point distances in higher-dimensional space and the corresponding distance in low­
dimensional space. Moreover, Sammon’s loss function tries to preserve the topology by
applying more weight to the small inter-point distances using the δij−1 .

6.2.4 Non-metric MDS


The non-metric MDS is proposed by critical contributions of Shepard [8] and Kruskal [5].
Considering proximities as distances in metric models has some inherent restrictions. In
psychology, for example, there is a difference between the preferences of subjects when
sorting different objects, as distance is not meaningful or straightforward. Whenever the
ordinal property is the feature of interest, non-metric models are proposed. Non-metric
MDS is proposed as an alternative to metric models because it allows for a more accurate
representation of ordinal data. For instance, when studying the preferences of subjects in
sorting different objects, the difference between preferences can be accurately represented
using non-metric models, whereas metric models may not capture this difference as well.
Non-metric MDS models are similar to metric models with the same goal. In non­
metric MDS, dissimilarities are calculated using the ordinal properties of the data. In
other words, the assumption is that the proximities are on the ordinal scale. Therefore this
method is also called Ordinal MDS. The main difference between metric and non-metric
MDS approaches is the solution to the relation between the observed original dissimilar­
Chapter 6 • Multi-dimensional scaling 169

ities in higher-dimensional space and the model-derived distances in lower-dimensional


space. As already mentioned, this relation is considered linear in the metric model. Still,
according to Shepard, the original proximities are considered monotonically related to the
derived proximity in the non-metric model. Non-metric MDS tries to find the optimally
scaled proximities (through monotonic transformation), which is also called disparities,
pseudo-distance, and also fitted distance; d̂ = f (proximity). Even so, this difference is not
very important as the output of both approaches is fairly similar. However, the non-metric
methods provide a better fit for data.
Non-metric or ordinal MDS creates a new co­figuration in a lower-dimensional space
through an iterative optimization method controlled by a cost function. Unlike the scaling
solution of metric models, in which new dimensions are added to the previously estimated
ones, non-metric solutions estimate new ones simultaneously. The result will be discarded
if the desired goodness of fit is not reached. The solution will be pursued in an additional
dimension for a new co­figuration. The general form of the non-metric model is the fol­
lowing:
( )
∑ 2
δij = f (dij ) = f xik − xj k , (6.13)
k

where δij is the observed distance or the original proximity/dissimilarity, dij is the model­
derived distance with the rank order similar to the δij as much as possible, xik and xj k are
the estimated coordinates used to compute the dij , and f is a monotonic function. Let us
assume the resulting co­figuration in the higher dimension is X . Therefore in the non­
metric method, the mapping equation from proximity/dissimilarity to distance is:

f : δij → dij (X), (6.14)


f : δij < δik → dij (X)  dil (X). (6.15)

For all pairs of i and j and l data points, Eq. (6.14) is monotone such that (6.15) holds
true. Eq. (6.15) d­fines the monotonic relation between all subsets of i and j and l and
refers to the weak monotonic relation [5] that only requires that there should not be any
inversion in rank orders. For example, if δij < δkl then dij  dkl . In contrast to the strong
monotonic relation [30] that is restricted and does not accept estimated distances to be
equal, such that if δij < δkl then dij < dkl ). Technically, it points to the fact that only ordering
matters in non-metric MDS.
The solution starts with N data points that need to be scaled. The distances in the initial
co­figuration barely align with their original proximity/dissimilarities of the data points.
Therefore we need to move the points so that the distances will be monotonic concerning
the dissimilarities. However, in real cases, we often cannot reach this status. In such cases,
the solution will be stopped when moving the points does not improve the consistency
between dissimilarities and distances, which means we have reached an approximation.
In other words, non-metric MDS seeks a co­figuration in which distances between pairs
of points are to be as consistent as possible with the original proximities/dissimilarities.
170 Dimensionality Reduction in Machine Learning

Although checking the consistency of observed distances and estimated distances requires
a comparison between the rank orders, similar to many model-based techniques, MDS
models use fit measures to compare such consistency.

6.2.5 The goodness of fit


The goodness of fit of a non-metric model explains how successful it is in fitting the ob­
served data. Technically, this is a measure to quantify prediction precision. Although many
approaches have been proposed for this quantification test, they all rely on the discrepancy
between the generated or predicted data and the observed data. The mentioned discrep­
ancy is the error of the fitting procedure; accordingly, by considering it as a loss function
and trying to minimize it, the MDS algorithm may reach the most accurate estimation. In
MDS literature, this fit measure is called STRESS (Standardized Residual Sum of Squares).
The STRESS measure is used to evaluate the ability of the MDS algorithm to capture the
patterns of the observed dataset. The smaller the STRESS value, the better the fit of the gen­
erated model and the more accurate the quantification of the data is. The STRESS function
measures the aberrance of estimated distances in geometric space and the corresponding
observed dissimilarities. Multiple formulations are proposed to calculate MDS STRESS.
Some of the most frequent ones are as follows:

∑
 (dij − d̂ij )2
S1 =  ∑ 2 , (6.16)
dij

 ( )2
∑
 dij − d̂ij



S2 =  ( )2 , (6.17)
 ∑
 d −d
ij

 ( )

∑
 d̂ij − d
2 2


SS1 =  ∑ 2 , (6.18)
dij

and
 
∑
 dij2 − d̂ij2

SS2 = 
∑ , (6.19)
2 2
δij − d
2

where dij is the model estimated distance and d̂ij is the disparities. Disparities are the
appropriate and optimal transformations of proximities. The d is the mean of estimated
distances by the model. Eq. (6.16) is called Kruskal’s Stress and also the Normalized Stress,
is usually designated as S1 . Eq. (6.17), designated as S2 , differs only in the denominator
Chapter 6 • Multi-dimensional scaling 171

part from Kruskal’s STRESS, where the discrepancy of estimated distances from dispari­
ties is normalized by the sum of the deviations from the mean instead of the sum of the
observed distances. Eqs. (6.18) and (6.19) are called the Young’s Stress functions.
In both equations and all four STRESS equations, the sum of differences between the
observed distances (dissimilarities) and the corresponding estimated values is divided by a
normalizing factor. Consequently, the output will vary between 0 (optimal fit) and 1 (worst
fit). Moreover, it makes the output independent of the size and scale of the co­figuration.
Stress is a measure of how much information is lost during MDS. The optimal MDS
solution has Stress=0, meaning nothing is lost during MDS. Nevertheless, interpreting the
output value of the Stress equation is tricky, and there is no clear consensus on how it
should be interpreted. However, this is a legitimate question for MDS users: what is an
``adequate'' or ``acceptable'' value for Stress? The answer to this question is not certain, but
some ideas exist. Kruskal recommends some values that are important for interpretation,
including (0.20, poor), (0.10, fair), (0.05, good), (0.025, excellent), and (0.00, optimal), but
in real research context these values are not much help. According to a statistical analysis
method for STRESS, explained in detail by Borg et al. [11], one can use the STRESS values
of the random data from the uniform normal distribution as a basis.
Fig. 6.5 summarizes the simulation result. Simulation is performed for Interval (metric)
and Ordinal (non-metric) MDS models and also for three different dimensions to compare
the effects of dimensions on the STRESS value. The other effect under consideration is the
effect of the number of data points (observations, stimulus, or subjects). A color scheme
is used to highlight the variation of the Stress value according to the change in dimen­
sions and data points. In both models, the increase in the number of dimensions causes
the STRESS value to decrease, whereas the increase in the number of data points increases
the STRESS too. However, the effect of increasing the dimension is slight at a lower num­
ber of points; conversely, as the number of points increases (given a fixed dimension), the
amount of change in STRESS decreases. As an example, when n = 5, the mean value of
STRESS is 0.0117. Increasing the point to 10 affects the STRESS such that it increases to
0.1898, which is a significant leap equal to 0.1781. This is in contrast to the effect of the
increase in points from 100 to 200, which causes only a 0.0129 increase in mean stress.
The STRESS values in Fig. 6.5 are from scaling random data. Therefore to answer the last
question about the ``good'' possible value of STRESS, one should assume that the yielded
STRESS values of a particular MDS should be better (smaller) than the corresponding value
in this table. The STRESS values can be used to find the most appropriate number of
dimensions during the scaling procedure in MDS. The squared STRESS values, such as
Eq. (6.16), explain how much of the disparities variances model can describe. S1 = 0.17
means that 83% of data variance can be just­fied by the model. Therefore when the num­
ber of points is fixed, the algorithm seeks the dimension that explains most of (optimally,
the whole of ) the data.
To summarize, the STRESS values measure the discrepancy between the distances
represented in the MDS model and the original distances in the dataset. The algorithm
searches for a dimension that minimizes the STRESS value, thus maximizing the amount
172 Dimensionality Reduction in Machine Learning

FIGURE 6.5 Summary of the simulation of the stress values for different co­figurations. 500 stress values were
calculated per setting. Each setting includes a specific number of observations (8 different sets), a fixed dimension
(3 different sets), and also for Interval (metric) MDS and Ordinal (non-metric) MDS models. The green color denotes
low values in contrast to red, which denotes high values. The color scheme used highlights how absolute values
vary in response to the simulation’s parameters change. For both models, an increase in dimension causes the stress
value to decrease, whereas increasing the number of data points (observations) causes the stress to increase. For
better understanding of this figure and its details, please refer to the online version.

of variance the model can explain. When the STRESS value reaches 0, it means that the
model has explained all the variance.
Two additional techniques, the Scree plot and Shepard diagrams, help judge if the
MDS solution is adequate. The Scree plot compares the STRESS value against the num­
ber of dimensions. In the Scree plot, we are looking for the point where more dimensions
do not yield significantly lower STRESS. This is called the elbow point. The Shepard dia­
gram visually represents the differences between the distances of the original data and the
distances of the transformed data. If the two match closely, then we can say that the MDS
solution is adequate. Thus the better fit has a smaller spread of points in this diagram.
In summary, the following steps are the core of non-metric MDS:

1. A random co­figuration X is chosen. Usually from a normal distribution.


2. The distance between the points is calculated.
3. The optimal monotonic transformation of data points (disparities) should be found.
4. In the third step, the STRESS value should be minimized by trying the various co­figu­
rations.
5. Determine if the STRESS value is suitable using some criteria. Exit the algorithm or else
return to the second step for another cycle.

6.2.6 Individual differences models


The solution provided by metric and non-metric MDS models yields the co­figuration X
averaged over all individuals. Such a co­figuration contains the nomothetic information.
However, in many cases, an idiographic study is required, in which we are trying to under­
stand the differences among individuals in terms of the group co­figuration in which they
Chapter 6 • Multi-dimensional scaling 173

are located. Suppose n individuals are asked to judge m stimuli for a simple explanation of
the situation. As a result, there will be n dissimilarity matrices.
A group co­figuration Xik contains multiple data points in a given dimension k. An
individual co­figuration is denoted by xiks , which is d­fined as follows:

xiks = Xik · wks , (6.20)

where wks is the weight of participant s’s co­figuration regarding the group co­figuration,
the value of this weight can be interpreted as the importance of each participant’s co­fig­
uration. Considering Eqs. (6.7) and (6.20), one can d­fine the (dis)similarity (disparities) of
individual co­figurations as:
∑  )2
δiks = 2 x −x
wks ik jk . (6.21)
k

Two basic approaches were proposed for Individual Differences analysis. The first one
is the Weighted Euclidean Model and d­fines an individual co­figuration for each partic­
ipant. This model is sometimes referred to as Individual Differences Scaling (INDSCAL).
The second approach is Procrustean Individual Differences Scaling (PINDIS). In the lat­
ter method, each individual matrix is first scaled separately, and then an overall pairwise
comparison between all individuals is carried out through Procrustean analysis.

6.2.7 INDSCAL
The basic idea of Individual Differences Scaling proposed by Carrol and Chang [10] is that
among individuals performing a judgment task over several objects or stimuli, each may
take into account one or more attributes concerning a general construct. For example,
if some individuals are asked to rate different airline companies, one’s judgment may be
more impressed by the welfare services. In contrast, another individual thinks the fastest
flight is more important. Therefore the group space is constructed by all individual spaces
(attributes) that are exceptional cases of the group space, and we want to know how these
individual spaces differ.
INDSCAL works by assigning a weight to each individual space that implies the partici­
pants’ attention, salience, or importance of the relevant dimension (the judged attribute).
Consequently, each participant has their unique set of weights across each dimension.
INDSCAL works on the individual sets to assess how those are (dis)similar. However, we
are not proceeding any further regarding the INDSCAL in this chapter, as the relevant al­
gorithm requires more elaboration. However, it is worth mentioning that other versions of
the INDSCAL have also been proposed. For more information refer to [13,15] and [11].

6.2.8 Tucker–Messick model


Averaging over individuals as used in the Weighted Euclidean Model leads to information
loss [28]. Moreover, comparing individual sets (the unique sets of individuals as explained
174 Dimensionality Reduction in Machine Learning

in INDSCAL) is expensive. Tucker and Messick suggested constructing a special matrix


X from dissimilarities to overcome this drawback. This matrix is such that columns are
equal to the number of individuals and rows are equal to all the 12 n(n − 1) possible ob­
ject(stimulus)-pairs. Hence, the matrix X can be approximated using the Singular Value
Decomposition (SVD), such that:

X = Uk k VkT ,

where matrix Uk is the principal coordinates in the space of pairs of objects and k Vk VkT is
the principle coordinates in a space for individuals.

6.2.9 PINDIS
The Procrustean INdividual Differences Scaling (PINDIS) model was introduced after the
INDSCAL. The basic assumption in this model is that a basic scaling (of any type) is car­
ried out for each matrix that yields relevant co­figurations. Let us denote the individual
co­figurations with Xi f orsimplicity. Finally, to compare those co­figurations, the Pro­
crustes analysis will be used. This analysis starts with a centroid co­figuration according
to the method suggested by Gower [31]. Then, multiple transformations are applied to in­
dividual co­figurations to be as close as possible to the centroid co­figuration. Procrustes
analysis includes seven steps:
1. tr(XiT Xi ) = 1, meaning all individual co­figurations should have mean squared dis­
tances to the origin equal to unity.
2. The second co­figuration is rotated to the first one. Then, the first estimate of the cen­
troid co­figuration, Z = 12 (X1 + X2 ).
3. Now, the third co­figuration X3 will be rotated to the Z from the previous step. The new
estimate of Z is achievable as the weighted average of X3 and Z.
4. The last step is repeated until all co­figurations are comprised.
5. Now, all co­figurations are rotated to Z and the relevant goodness of fit is calculated as

1 ∑
G= 1 − Ri2 (Xi , Z),
N
i

where R(Xi , Z)2 is the Procrustes Statistics for the Xi ’s rotations to Z and N is the num­
ber of co­figurations.
6. The updated estimate of the centroid co­figuration Z will be calculated using the aver­
age of the latest rotated co­figurations Xi , and consequently G will be updated.
7. The last step is repeated until G converges. The final Z is the centroid co­figuration.
The model generated through the PINDIS procedure is a basic model denoted as P0 that
is achieved using some admissible transformations. This basic model provides the poorest
fit. Other types of transformation (non-admissible) can also construct better models to
achieve a better fit. There is a hierarchy of models as follows:
Chapter 6 • Multi-dimensional scaling 175

1. A basic model that only the admissible transformations are applied:



R1 (Ri , Z) = tr(Xi Ri − Z)T (Xi Ri − Z)
i

and is minimized over the X1 . . . Xn .


2. Dimension weighting model: The dimensions of group space are weighted and rotated.
The following equation should be minimized:

R2 (Ri , Z) = tr(Xi Ri − ZSW )T (Xi Ri − ZSW ),
i

where W is a diagonal matrix and S T S = I . As a non-admissible transformation in this


model, Z is rotated by S before applying weights to dimensions.
3. Idiosyncratic dimension weighting: In this model, weights of the dimensions can be
different per individual:

R3 (Ri , Z) = tr(Xi Ri − ZSWi )T (Xi Ri − ZSWi ),
i

minimized over X1 . . . X2 .
4. Vector weighting, individual origins model: This is the same as the previous model,
however, the origin of Z for each individual can be moved to a better position for a
better fit:

R5 (Ri , Z) = tr(Xi Ri − Vi (Z − ltiT ))T (Xi Ri − Vi (Z − 1tiT )),
i

where ti is a translation vector applied to the centroid co­figuration of the ith individ­
ual.
5. Double weighting model: At this level, both dimensional weighting (as in the second
model) and vector weighting are applicable:

R6 (Ri , Z) = tr(Xi Ri − Vi (Z − 1tiT )Wi )T × (Xi Ri − Vi (Z − 1tiT )Wi ).
i

The first model in this hierarchy provides the poorest fit, while the last has the best fit.
For more elaboration on PINDIS models and application examples, please refer to [13,15]
and [11].

6.2.10 Unfolding models


Unfolding is another family of MDS models proposed for preference data. In many studies,
such as questionnaires, subjects’ personal preferences are investigated. Generally, there
are two types of preference data: ranking and rating. Suppose some individuals are asked
to rank 5 different airlines. A possible result will be similar to Table 6.3. In this example,
seven individuals were asked to rank 5 different airlines, while each airline can be char­
acterized regarding its judged position on each of the underlying attributes. Participants
176 Dimensionality Reduction in Machine Learning

determine the point corresponding to each item according to their distance from an ideal
point, representing their maximum preference for an attribute. To that end, each attribute
can be considered an axis to construct a multi-dimensional space. In such a space, the
distance can be interpreted literally as the distance between the participant’s ideal point
projected on the axis and the point the participant has determined for an airline. In this
example, subjects would rank items from the most preferred as 1 and least preferred as 5.
Note that ranks can also be designated using any kind of ordered character sets, such as
A, B, C, . . . instead of numerical sets. Each cell in Table 6.3 contains a number referring
to the ranks of a specific airline, as an individual prefers. Note that these rankings can be
considered as (dis)similarities. Smaller values for an airline and a subject ID indicate how
similar individuals’ preferences are. The second type, rating data, is typically the result of
a questionnaire. Table 6.3 can also be considered as rating data with a slight difference. For
example, imagine that 4 individuals were asked to rate five airlines. This is actually scor­
ing. Instead of ranking from 1st to 5th choices, individuals should rate from 1 to 5. usually,
rather than 1 to 5, scoring is between 1 to 100.

Table 6.3 Ranking of 4 different airlines (A, B, C,


and D) by 5 individuals. This table includes a sample
of preference data for ranking types. ‘#1’ denotes
the most preferred and ‘#4’ refers to the least pre­
ferred. Similarly this can be viewed as distance to an
ideal point, such that ‘#1’ can be interpreted as ‘as
close as possible’ to the ideal point of the subject
(similar), whereas ‘#4’ is ‘as far away from the ideal
point’ (dissimilar).
Subject Rank #1 Rank #2 Rank #3 Rank #4
S1 A B C D
S1 B A C D
S3 A C B D
S4 C D B A

6.2.11 Non-metric uni-dimensional scaling


The unfolding models can be categorized as uni-dimensional and multi-dimensional
models. Moreover, the metric and non-metric unfolding models are introduced. The non­
metric uni-dimensional unfolding model as the first unfolding model was introduced by
Cooms (1950). Assume the ranking task of Table 6.3, according to Cooms, individuals and
objects (airlines) can be represented by points on a straight line, such that the distance
between the points representing an individual and points representing the airlines has the
same rank order as the original table. Coombs introduced the J and I scales. Scales are
two separate lines. The J scale is the line that points representing individuals and objects
(airlines) are placed upon. The individual’s personal preference ordering and each individ­
Chapter 6 • Multi-dimensional scaling 177

ual’s preferred order of the objects or stimuli is called an I scale. Table 6.4 summarizes the
preference orders of the individuals according to ranked lists in Table 6.3.

Table 6.4 The I scale of Table 6.3 con­


taining the preference ordering.
Interval I1 I2 I3
Ordering ABCD BACD CDBA

However, the I and J scales can be generalized to higher dimensions, hence here we
discuss one-dimensional space (line). Some basic definitions are needed before proceed­
ing to the J scale. On a line, only one point divides the distance between two other distinct
points equally. This is called the midpoint. Fig. 6.6 depicts two points on a line and the
relevant midpoint. All points on the right side of the midpoint are closer to the point B
rather than A (denoted as B>A) and vice versa. Concerning our example (rating task of
airlines), one can assume these points are airlines. Therefore all points on the right side
of the midpoint denote that the airline B is preferred over A. There is only the midpoint
that A=B, meaning the subject has no preference between A and B. It is worth mentioning
that in two-dimensional space (a plane), instead of the midpoint, a line exists that plays
a similar role. Moreover, in three-dimensional space (a cube), a plane divides the space
between two distant objects equally. In many resources, such a boundary is called a hy­
perplane, a surface of dimension just one less than the original space. There are excellent
resources available for detailed information about multi-dimensional unfolding models,
such as [29], [15], and [11].

FIGURE 6.6 This shows the midpoint of two distinct points, A and B. All points on the right side of the midpoint are
closer to point B, indicating the relative preference for option B.

Fig. 6.7 is the J scale of our example. The table I scale can be determined using the J
scale through a folding method. However, in the first example, we try to explain the re­
lation, and only for the third one do we use the folding method to generate the relevant
scale.
Four airlines are denoted by A, B, C, and D. According to the Coombs [7] I Scale and
J Scale model, all points are represented on a straight line. Distinct colors are assigned to
each point. The midpoints are also denoted by color bars. As already discussed, the interval
between point A and the midpoint AB refers to the preference for A over B.
The preferences in Table 6.3 can be distinguished by the J scale. For example, the pref­
erence of S1 is ``ABCD'' in Table 6.3. In other words, A is preferred to B, thus the relevant
178 Dimensionality Reduction in Machine Learning

FIGURE 6.7 The J scale of the Coombs model for non-metric uni-dimensional unfolding. Objects (airlines in our
example) are indicated using color dots on the line. The midpoints are indicated by color bars. The corresponding
intervals are denoted by In . For better understanding of this figure and its details, please refer to the online ver­
sion.

interval in J scale the segment between point A and the midpoint AB is the Fig. 6.7. All
points in this segment have the property of A>B (preference of A over B).
B is preferred to C that denotes the interval on the left side of the midpoint BC, and
finally C is preferred to D that similarly refers to the interval on the left side of the midpoint
CD. Intersection of all determined intervals for preference of S1 is equal to the A>B. This
interval is denoted as I1 in the I scale in Table 6.4.
As another example, S2 preferred B over A; therefore, it refers to all points on the right
side of the midpoint AB. Consequently, A is preferred to C by subject S2 . The corresponding
interval is the segment on the right side of the midpoint AC. The C is also preferred to D.
Consequently, it refers to the points on the left side of the midpoint AB. The result is the
interval I2 as depicted in Fig. 6.7.
For the third example, ``CDBA'' (the 4th row of Table 6.3), we use the folding technique to
find the relevant interval (I3 in Table 6.4 and Fig. 6.7). As Coombs originally did, the folding
method is easily practiced using a handkerchief. In this method, by knowing the place of
objects on a line and the midpoints, we are going to find the interval that if by choosing any
point from that interval as an axis to fold the whole line, the yielded order of the objects
will be the same as the given one (``CDBA''). Fig. 6.8 illustrates the steps of folding. As the
given preference starts with C and then D, we know that the target should lie on the left
side interval of midpoint CD. S4 has preferred B after D. Therefore the interval on the right
side of the midpoint BD can be determined. The intersection between the last mentioned
intervals gives the interval between BD and CD. By choosing any axis from this interval and
folding the line as depicted in Fig. 6.8 (su­-figures a and b), one can reach the same order
of preference. All points on the mentioned interval produce similar results. Nevertheless,
all rankings are not possible to accommodate, such as ``ACBD'' (the 3rd row of Table 6.3).

6.3 Kernel-based MDS


MDS methods are a versatile toolbox, that is not limited to what has been discussed so
far. One variant is Kernel-based MDS, which takes dimensionality reduction to the next
level. These methods incorporate mathematical functions called kernels to capture com­
plex, nonlinear relationships in the data. Essentially, the kernels are used to transform the
data such that distances in the new space r­flect intricate similarities or dissimilarities,
Chapter 6 • Multi-dimensional scaling 179

FIGURE 6.8 The folding method is used to find the relevant I scale of a preference. By folding the J scale over a
suitable axis, one is able to achieve the same order of objects as in a given preference. For better understanding of
this figure and its details, please refer to the online version.

enabling the model to uncover patterns that would otherwise remain hidden in a purely
Euclidean framework.
Kernel methods have been a hot topic in machine learning research [40], and one of the
most iconic examples is the Support Vector Machine (SVM). SVM uses kernel functions to
cleverly transform data into a new feature space, where it can efficiently calculate the inner
workings of the data points, with the aim of finding the best hyperplane that separates the
classes. This ingenious approach is known as the ``kernel trick''.
The beauty of the kernel trick lies in its ability to sidestep a major computational
hurdle. Normally, transforming data into higher dimensions would be a costly and time­
consuming process. The kernel trick allows us, instead of calculating the coordinates in
the higher-dimensional space, to directly calculate the inner product of transformed data
points in the feature space (xi and xj ) through a mapping function K(xi , xj ). Here, the
choice of the kernel function is a matter of importance as different kernel functions can
capture different relationships in the data. Some popular kernel functions include lin­
ear, radial basis function (rbf ), polynomial, and more recently, fractional orthogonal ker­
nels like fractional Chebyshev [33,34], fractional Gegenbauer [33,35], fractional Legendre
[33,36], and fractional Jacobi [33,37] kernel functions (a practical tutorial and implementa­
tion of these kernels are available in [33,38]). Each kernel has its strengths and weaknesses,
and selecting the right one can make all the difference in uncovering the hidden insights
in your data.
The use of the kernel method in PCA is also studied and led to kernel PCA by Schölkopf
et al. [39], This involves computing the eigenvalues and vectors of a kernel matrix derived
from input data, finding the principle components in the transformed feature space. The
ben­fit of using a kernel trick has never been limited to these algorithms. Further investi­
gations highlight that the metric MDS can be seen as a special case of kernel PCA with a
Euclidean distance metric [41]. Pairwise similarities can be used to d­fine a kernel matrix
180 Dimensionality Reduction in Machine Learning

and derive an embedding without losing any essential properties of the original dissimi­
larity structure [32]. (For the mathematical foundation of a comprehensive understanding
of kernel methods and kernel MDS one can refer to [32,40,42,43].)

6.4 MDS in practice


To date, multiple software tools have been introduced to implement MDS algorithms and
techniques. However, the first software implementations of the MDS, especially using
FORTRAN, intended to be used in mainframes, were also introduced by names such as
Kruskal, Torgerson, Young, Guttman, and Lingoes. These implementations were the first
to provide an accessible means of computing MDS solutions that were previously only
obtainable through manual calculations. They were also the first to leverage the com­
puting power of mainframes to reduce the amount of time required to compute MDS
solutions. Software tools such as MSSCAL, TORSCA, and SSA were later developed into
more capable ones that were also usable for PCs, for example, SMACOF. After decades of
development, multiple reliable and comprehensive implementations of MDS algorithms
are now available to test and use in real-world cases. Among the big names, probably the
SPSS multi-dimensional scaling (ALSCAL), smacof and MASS for R, and the scikit-learn
Python package are the most well-known.

6.4.1 MDS in Python


Python is an open-source, high-level, and general-purpose programming language widely
used in scientific programming. The initial version of Python was released in 1991. Almost
32 years of development have created a robust and comprehensive analytical program­
ming ecosystem. It is possible to code practically any algorithm in Python; hence, different
programmers or researchers have already developed many packages. Specifically for MDS,
there are multiple packages indexed in PyPi (Python’s repository). The scikit-learn pack­
age is the most well-known package for Python that supports a wide variety of machine
learning algorithms of categories Class­fication, Regression, Clustering, Dimensionality Re­
duction, Model Selection, and Reprocessing. Manifold learning techniques are embedded in
the sklearn.manifold module from Scikit-Learn. From the latter module, we use the MDS
class for metric or non-metric MDS analysis. Some of the essential parameters that MDS
class accepts are as follows; however, for detailed documentation of this class, one can refer
to the scikit-learn website:
• n_components: The number dimensions of the solution space. The default is 2.
• metric: A Boolean value to determine the metric MDS or non-metric MDS analysis.
• dissimilarity: determines how to calculate the dissimilarity (distance) matrix. Possible
values: ‘Euclidean’, ‘precomputed’. The default value is ‘Euclidean’, which refers to the
Euclidean formula. The ‘precomputed’ is used for any formula other than Euclidean
(refer to Table 6.2).
Chapter 6 • Multi-dimensional scaling 181

• normalized_stress: For non-metric MDS, whether to use normalized stress (stress-1) or


the raw value. Valid values are: bool or ``auto'' or default=False.
Moreover, there exist some attributes related to the MDS objects:
• embedding_: The position of the dataset in the embedding space.
• stress_: The Goodness-o­-fit statistic. 0 perfect fit, 0.025 excellent, 0.05 good, 0.1 fair,
and 0.2 poor.
• n_iter_: Number of iterations during finding the best Goodness of Fit.
There are more parameters, attributes, and functions in the Sklearn module’s MDS class.
For a detailed explanation, it is suggested to refer to the original reference page on the
Scikit-learn website.1 MDS class likewise implements fit() and fit_transform() methods,
as all other dimensionality reduction classes of the Sklearn module. Sklearn also provides
some datasets that are available using sklearn.datasets. The load_digits is a subset of the
UCI ML handwritten digits dataset.2 This dataset has the following spec­fications:
• Classes: 10.
• Samples per class: 180.
• Total Samples: 1797.
• Dimensionality: 64.
Moreover, distance measures are implemented in sklearn.metrics. For the following ex­
ample, we use Euclidean and Manhattan distance measures. The following code demon­
strates an example. Recall that the input dimension is 16 while the output dimension
is 3. embedding, is a Sklearn.manifold.MDS structure that holds many values, such as the
Stress value, the distance matrix, and attributes of the parent MDS object. By calling
embedding.dissimilarity_matrix_, one can access the distance matrix, and embedding.
stress_ returns the calculated stress value.

1
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html.
2
https://archive.ics.uci.edu/ml/data-sets/Optical+Recognition+of+Handwritten+Digits.
182 Dimensionality Reduction in Machine Learning

# import required packages


from sklearn.manifold import MDS
from sklearn.metrics.pairwise
import manhattan_distances, euclidean_distances
import sklearn.datasets as dt
import plotly.express as px

# load a sample dataset.


X, y = dt.load_digits( return_X_y=True)

# Run the metric analysis


embedding = MDS(n_components=3,
normalized_stress=False,random_state=0)

# fit the 'X' from the input dataset


dist_ecu = embedding.fit_transform(X)

# plot the metric MDS results in 3D


fig = px.scatter_3d(None, x=dist_ecu[:,0],
y=dist_ecu[:,1], z=dist_ecu[:,2], color=y,)

fig.update_layout(title_text="load_digits",
showlegend=False,
scene_camera=dict(up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-0.1),
eye=dict(x=1.25, y=1.5, z=1)),
margin=dict(l=0, r=0, b=0, t=0),
scene = dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
),
zaxis=dict(backgroundcolor='lightgrey',
color='black',
gridcolor='#f0f0f0',
title_font=dict(size=10),
tickfont=dict(size=10),
)))

# Update marker size


fig.update_traces(marker=dict(size=3,
line=dict(color='black', width=0.1)))

fig.update(layout_coloraxis_showscale=False)
fig.show()
Chapter 6 • Multi-dimensional scaling 183

Through metric MDS analysis of a 64-dimensional dataset, it is possible to visualize it


in 3 dimensions. The following code block is sample code for visualization of MDS output
in 3 dimensions. However, by setting n_components=2, a 2D distance matrix will be yielded.
Fig. 6.9 shows the result of the metric MDS analysis of a dataset of 64 dimensions in 3D
and 2D space using the Euclidean distance measure.

FIGURE 6.9 3D and 2D visualization of a 64-dimensional dataset using metric MDS. a) The plot of the load-digits
dataset using the Euclidean distance measure in 3D space. b) The plot of the load-digits dataset using the Eu­
clidean distance measure in 2D space. For better understanding of this figure and its details, please refer to the
online version.

By setting the dissimilarity='precomputed' and applying the Manhattan distance mea­


sure manhattan_distances(X), one can visualize the original dataset in 2D and 3D L1(Man­
hattan) space. Fig. 6.10 shows the output of the metric MDS using the Manhattan distance
measure.

FIGURE 6.10 3D and 2D visualization of a 64-dimensional dataset using metric MDS. a) The plot of the load-digits
dataset applying the Manhattan distance measure in 3D space. b) The plot of the load-digits dataset using the
Manhattan distance measure in 2D space. For better understanding of this figure and its details, please refer to the
online version.
184 Dimensionality Reduction in Machine Learning

6.4.2 Conclusion
MDS is a powerful and widely used statistical technique for visualizing and analyzing data.
By creating a graphical representation of the data points, researchers can uncover pat­
terns in data that might not be obvious from the raw numbers. For example, this technique
can be used to compare and contrast different items or entities concerning particular at­
tributes.
At its core, multi-dimensional scaling is a way of transforming data into a visual repre­
sentation. This is done by taking data points that represent different items or entities and
plotting them on a two-dimensional graph. The data points are then organized based on
their similarities and differences. This is carried out by taking into account the distances
or dissimilarities between the data points.
MDS can be used for various applications, including market research, customer seg­
mentation, product positioning, and more. Additionally, it can uncover patterns in data
that might not be evident from the raw numbers. For example, researchers can use MDS
to compare and contrast different products or services concerning particular attributes.
A variety of software tools have been developed for different types of MDS. As a specific
example, for the Python programming language, the Scikit-learn, mvlearn and Orange are
the recommended options.
Overall, MDS is a powerful and versatile tool that can uncover patterns and relation­
ships in data. It can help researchers to better understand their data and make more
informed decisions.

References
[1] M.W. Richardson, Multidimensional psychophysics, Psychological Bulletin 35 (1938) 659--660.
[2] C. Eckart, G. Young, The approximation of one matrix by another of lower rank, Psychometrika 1 (3)
(1936) 211--218.
[3] G. Young, A.S. Householder, Discussion of a set of points in terms of their mutual distances, Psy­
chometrika 3 (1) (1938) 19--22.
[4] W.S. Torgerson, Multidimensional scaling: I. Theory and method, Psychometrika 17 (4) (1952)
401--419.
[5] J.B. Kruskal, Nonmetric multidimensional scaling: a numerical method, Psychometrika 29 (2) (1964)
115--129.
[6] C.H. Coombs, A theory of data, 1964.
[7] C. Coombs, Psychological scaling without a unit of measurement, Psychological Review 57 (3) (1950)
145--158.
[8] R.N. Shepard, The analysis of proximities: multidimensional scaling with an unknown distance func­
tion. I, Psychometrika 27 (2) (1962) 125--140.
[9] C.B. Horan, Multidimensional scaling: combining observations when individuals have different per­
ceptual structures, Psychometrika 34 (2) (1969) 139--165.
[10] J.D. Carroll, J.J. Chang, Analysis of individual differences in multidimensional scaling via an N-way
generalization of ``Eckart-Young'' decomposition, Psychometrika 35 (3) (1970) 283--319.
[11] I. Borg, P.J. Groenen, P. Mair, Applied Multidimensional Scaling, Springer Science & Business Media,
2012.
[12] I. Borg, P. Groenen, Modern Multidimensional Scaling: Theory and Applications, Springer Science &
Business Media, 2013.
Chapter 6 • Multi-dimensional scaling 185

[13] C.S. Ding, Fundamentals of Applied Multidimensional Scaling for Educational and Psychological Re­
search, Springer, New York, 2018.
[14] M.C. Hout, M.H. Papesh, S.D. Goldinger, Multidimensional scaling, Wiley Interdisciplinary Reviews:
Cognitive Science 4 (1) (2013) 93--103.
[15] D.R. Cox, D.V. Hinkley, D. Rubin, B.W. Silverman (Eds.), Monographs on Statistics and Applied Proba­
bility, Chapman & Hall, 1984.
[16] D.H. Krantz, A. Tversky, Similarity of rectangles: an analysis of subjective dimensions, Journal of
Mathematical Psychology 12 (1) (1975) 4--34.
[17] W. Stephenson, The study of behavior: Q-technique and its methodology, 1953.
[18] J. He, B.Y. Hu, X. Fan, Q-sort technique, in: V. Zeigler-Hill, T. Shackelford (Eds.), Encyclopedia of Per­
sonality and Individual Differences, Springer, Cham, 2017.
[19] A.H. Hadian-Rasanan, L. Fontanesi, N.J. Evans, C. Manning, C. Huang-Pollock, D. Matzke, A. Heath­
cote, J. Rieskamp, M. Speekenbrink, S. Palminteri, M.J. Frank, C.G. Lucas, J.R. Busemeyer, R. Ratcliff,
J.A. Rad, Beyond discrete-choice options, Trends in Cognitive Sciences 28 (2024) 857--870.
[20] H. Qarehdaghi, J.A. Rad, EZ-CDM: fast, simple, robust, and accurate estimation of circular diffusion
model parameters, Psychonomic Bulletin & Review 31 (2024) 2058--2091.
[21] A.H. Hadian-Rasanan, J.A. Rad, D.K. Sewell, Are there jumps in evidence accumulation, and what,
if anything, do they r­flect psychologically? An analysis of Lévy-Flights models of decision-making,
Psychonomic Bulletin & Review 31 (2024) 32--48.
[22] S. Ghaderi, J.A. Rad, M. Hemami, R. Khosrowabadi, Dysfunctional feedback processing in metham­
phetamine abusers: evidence from neurophysiological and computational analysis, Neuropsycholo­
gia 197 (2024) 108847.
[23] S. Ghaderi, J.A. Rad, M. Hemami, R. Khosrowabadi, The role of reinforcement learning in shaping the
decision policy in methamphetamine use disorders, Journal of Choice Modelling 50 (2024) 100469.
[24] A. Ghaderi-Kangavari, J.A. Rad, M.D. Nunez, A general integrative neurocognitive modeling frame­
work to jointly describe EEG and decision-making on single trials, Computational Brain & Behavior 6
(2023) 317--376.
[25] A.H.H. Rasanan, N.J. Evans, J. Rieskamp, J.A. Rad, Numerical approximation of the first-passage time
distribution of time-varying diffusion decision models: a mesh-free approach, Engineering Analysis
with Boundary Elements 151 (2023) 227--243.
[26] A. Ghaderi-Kangavari, K. Parand, R. Ebrahimpour, M.D. Nunez, J.A. Rad, How spatial attention af­
fects the decision process: looking through the lens of Bayesian hierarchical diffusion model & EEG
analysis, Journal of Cognitive Psychology 35 (2023) 456--479.
[27] A. Ghaderi-Kangavari, J.A. Rad, K. Parand, M.D. Nunez, Neuro-cognitive models of single-trial EEG
measures describe latent effects of spatial attention during perceptual decision making, Journal of
Mathematical Psychology 111 (2022) 102725.
[28] L.R. Tucker, S. Messick, An individual differences model for multidimensional scaling, Psychometrika
28 (4) (1963) 333--367.
[29] J.F. Bennett, W.L. Hays, Multidimensional unfolding: determining the dimensionality of ranked pref­
erence data, 1960.
[30] L. Guttman, A general nonmetric technique for finding the smallest coordinate space for a co­figura­
tion of points, Psychometrika 33 (1968) 469--506.
[31] J.C. Gower, Generalized Procrustes analysis, Psychometrika 40 (1975) 33--51.
[32] A. Webb, A kernel approach to metric multidimensional scaling, in: T. Caelli, A. Amin, R.P.W. Duin, D.
de Ridder, M. Kamel (Eds.), Structural, Syntactic, and Statistical Pattern Recognition, SSPR/SPR 2002,
in: Lecture Notes in Computer Science, vol. 2396, Springer, Berlin, Heidelberg, 2002.
[33] J.A. Rad, S. Chakraverty, K. Parand, Learning with Fractional Orthogonal Kernel Class­fiers in Support
Vector Machines: Theory, Algorithms, and Applications, Springer, 2023.
[34] A.H. Hadian Rasanan, S. Nedaei Janbesaraei, D. Baleanu, Fractional Chebyshev kernel functions: the­
ory and application, in: J.A. Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional Orthogonal
Kernel Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics, Springer, Sin­
gapore, 2023.
[35] S. Nedaei Janbesaraei, A. Azmoon, D. Baleanu, Fractional Gegenbauer kernel functions: theory and
application, in: J.A. Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional Orthogonal Kernel
Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics, Springer, Singapore,
2023.
186 Dimensionality Reduction in Machine Learning

[36] A. Azmoon, S. Chakraverty, S. Kumar, Fractional Legendre kernel functions: theory and application,
in: J.A. Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional Orthogonal Kernel Class­fiers
in Support Vector Machines, in: Industrial and Applied Mathematics, Springer, Singapore, 2023.
[37] A.H. Hadian Rasanan, J.A. Rad, M.S. Tameh, A. Atangana, Fractional Jacobi kernel functions: theory
and application, in: J.A. Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional Orthogonal
Kernel Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics, Springer, Sin­
gapore, 2023.
[38] A.H. Hadian Rasanan, S. Nedaei Janbesaraei, A. Azmoon, M. Akhavan, J. Amani Rad, Class­fication
using orthogonal kernel functions: tutorial on ORSVM package, in: J.A. Rad, K. Parand, S. Chakraverty
(Eds.), Learning with Fractional Orthogonal Kernel Class­fiers in Support Vector Machines, in: Indus­
trial and Applied Mathematics, Springer, Singapore, 2023.
[39] B. Schölkopf, A. Smola, K.R. Müller, Nonlinear component analysis as a kernel eigenvalue problem,
Neural Computation 10 (5) (1998) 12991319.
[40] T. Hofmann, B. Schölkopf, A.J. Smola, Kernel methods in machine learning, 2008.
[41] C. Williams, On a connection between kernel PCA and metric multidimensional scaling, Advances in
Neural Information Processing Systems 13 (2000).
[42] N. Saeed, H. Nam, M.I.U. Haq, D.B. Muhammad Saqib, A survey on multidimensional scaling, ACM
Computing Surveys 51 (3) (2018) 1--25.
[43] A. Yuille, Lecture 11. Kernel PCA and multidimensional scaling (MDS), CS1234, Johns Hopkins Uni­
versity, Baltimore, MD, 2014.
7
t-Distributed stochastic neighbor
embedding
Mohammad Akhavan Anvari a, Dara Rahmati b, and Sunil Kumar c
a School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran b Faculty of

Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran c Department of Mathematics,
National Institute of Technology, Jamshedpur, Jharkhand, India

7.1 Introduction to t-SNE


7.1.1 What is t-SNE?
t-SNE is an algorithm primarily used in machine learning that reduces the dimensional­
ity of high-dimensional, complex datasets that was developed by Maaten and Hinton [1].
Dimension reduction is an essential technique nowadays as many real-world datasets con­
tain numerous features, making it challenging to visualize and understand the relation­
ships between the data points [6]. t-SNE is a nonlinear dimensionality reduction algorithm
that maps the high-dimensional data points to a lower-dimensional space while preserv­
ing the relationships between the data points as much as possible. t-SNE is suitable for
assessing and visualizing complex and nonlinear relationships in the data, which does not
rely on a linear relationship between the data points, unlike linear methods such as PCA.
Instead, it models the local associations between the data points, enabling the discovery
of nonlinear patterns and structures. This makes it an effective mechanism for examining
and comprehending complex datasets in many fields like biology, computer vision, and
natural language processing.

7.1.2 Why is t-SNE useful?


t-SNE is advantageous since it permits the display and investigation of high-dimensional
datasets. In numerous real-world applications, the data we work with possess multiple
qualities, making it challenging to comprehend the interrelationships between the data
points. t-SNE provides a solution by reducing the dimensionality of the data, shifting it
from a high-dimensional space to a lower-dimensional one while preserving as many of
the relationships between the data points as possible. Its ability to show nonlinear corre­
lations in the data, which are generally difficult to observe in high-dimensional space, is
one of its key advantages. This renders it an effective tool for exploratory data analysis and
pattern recognition. The capacity of t-SNE to generate highly interpretable data visualiza­
tions is an added advantage. The application illustrates the connections between the data
Dimensionality Reduction in Machine Learning. https://doi.org/10.1016/B978-0-44-332818-3.00017-4 187
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
188 Dimensionality Reduction in Machine Learning

points, allowing us to spot clusters of similar data points and their linkages. This facili­
tates the collection of insights into the data structure that may not be readily apparent in a
high-dimensional setting.

7.1.3 Prerequisite
7.1.3.1 Gaussian distribution
In statistics, continuous random variables are often represented by the Gaussian distribu­
tion, also known as the normal distribution. The German mathematician Carl Friedrich
Gauss was primarily responsible for its development. The two most notable features of
the Gaussian distribution are the mean, which represents the central tendency of the data,
also known as the highest point of the distribution, and the standard deviation, which rep­
resents the spread or dispersion of the data. Entropy quant­fies the degree of disorder in a
statistical distribution. An essential property of the Gaussian distribution is that it is a max­
imum entropy distribution, meaning it has the maximum uncertainty or randomness for
a given mean and standard deviation. This property is derived from the Gaussian distribu­
tion having the most extensive spread for a given variance among all possible probability
distributions with the same mean and variance.
The use of Gaussian distributions in t-SNE is essential for several reasons:
• Nonlinearity: The t-SNE algorithm is a nonlinear technique for reducing the dimen­
sionality of data. This approach can capture intricate relationships between data points
that cannot be achieved by linear methods such as PCA. The utilization of Gaussian
distributions in t-SNE enables the modeling of pairwise similarities between points in
a nonlinear manner, thereby facilitating the capture of nonlinear relationships.
• Smoothness: The Gaussian distribution is characterized by a continuous probability
density function, indicating its smoothness. The smoothness attribute plays a crucial
role in maintaining the local structure of data in the lower-dimensional representation
by ensuring smooth and continuous pairwise similarities between data points.
• Robustness: The Gaussian distribution is a widely studied probability distribution with
numerous established statistical properties. Its suitability for modeling pairwise simi­
larities between data points makes it a robust choice, contributing to the stability and
reliability of the t-SNE algorithm.
• Parameterization: The Gaussian distribution is characterized by two parameters,
namely the mean, and variance, which can be manipulated to modulate the distribu­
tion’s morphology. The optimization of the visualization can be achieved by adjusting
the parameters of the Gaussian kernel, which enables the fine-tuning of the t-SNE al­
gorithm to suit particular datasets.

7.1.3.2 Student’s t-distribution


Student’s t-distribution, also known as the t-distribution, is a probability distribution with
heavier tails than the Gaussian or normal distribution. It is frequently employed in sta­
tistical inference, especially for verifying hypotheses and estimating co­fidence intervals
Chapter 7 • t-Distributed stochastic neighbor embedding 189

when the sample size is limited. The t-distribution is d­fined by its degrees of freedom (df ),
which determine the distribution’s appearance. Similar to the Gaussian distribution, but
with larger tails, the distribution is symmetric and bell-shaped but with thicker tails. With
an increase in degrees of freedom, the t-distribution approaches the Gaussian distribution.
The distribution’s tails are the primary distinction between the Gaussian distribution and
the t-distribution. The tails of the Gaussian distribution are narrower, which attributes a
lower probability to extreme values or data outliers. In contrast, the t-distribution has fat­
ter tails, which means it allocates a greater probability to extreme values or data deviations.
Consequently, the t-distribution is more resistant to outliers than the Gaussian distribu­
tion. In statistical inference, when the sample size is small and the data contains outliers,
the t-distribution can provide more accurate estimates of the population parameters, such
as the mean and standard deviation, which can help to preserve the local structure of the
data in the lower-dimensional space.

7.1.3.3 Gradient descent algorithm


Gradient descent [4], also known as steepest descent in older notation, is a classical it­
erative first-order optimization technique utilized for determining the optimal solution
on differentiable functions. The present algorithm employs the gradient of a function to
compute the derivatives of a function at a specific point, resulting in the contemporary
nomenclature of the function’s slope at said point. To determine the slope of a function at
a specific point x, formula (7.1) is utilized, as demonstrated below:
f (x + x) − f (x)
Slope = . (7.1)
x
In this formula, f is a continuous function, and x is the rate of the change in x; to seek
precision, we consider it as epsilon, the tiniest value.
The gradient descent algorithm employs a stochastic point on the function to explore
a local minimum of a claimed function. The algorithm utilizes the spec­fied point to com­
pute the gradient value, thereby facilitating a one-step descent to the function’s lower state.
Subsequently, the algorithm will proceed to revisit the point where it had previously exe­
cuted, thereby assuming a novel position. Executing numerous iterations of this process
results in reaching either a local or a global minimum of the function.
The formulation of this operation can be d­fined as below:

θt+1 = θt − Dx f (x). (7.2)

In this formulation, f is the target function we want to find a minimum for, Dx is the gra­
dient operator in point x, and θ is a point on the function in steps t and t + 1.
Optimizing the lower-dimensional representation of data in t-SNE is achieved through
gradient descent, which aims to minimize the Kullback–Leibler divergence or in abbrevi­
ation the KL divergence [5] between the high-dimensional and lower-dimensional proba­
bility distributions. The mod­fications to the lower-dimensional depiction are executed by
following the negative gradient of the cost function. This results in the lower-dimensional
190 Dimensionality Reduction in Machine Learning

points being brought closer to their corresponding neighbors in the high-dimensional


space, which will be elaborated on in subsequent sections.

7.1.4 Applications of t-SNE


As mentioned in early sections, t-SNE primarily uses dimension reduction, especially for
nonlinear relational data; however, in some cases, scientists use the t-SNE algorithm for
clustering. This algorithm is an adaptable method such that its applications could be used
in many different industries, therefore most common applications of t-SNE are the follow­
ing:
• t-SNE can be used to visualize high-dimensional word vectors in natural language pro­
cessing (NLP) applications such as document categorization and sentiment analysis,
which help scientists to understand relational links between words and sentences [2].
• t-SNE can be used to visualize the activation functions of deep neural networks to com­
prehend how they process picture data in image recognition.
• t-SNE can be used to show gene manifestation data in a way that reveals patterns and
correlations between genes and biological samples in the context of gene expression
analysis [3].
• Recommender systems like content-based recommendation, such as movies or music,
t-SNE, can be used to show the relationships between users’ behavior and candidate
content.
These are only selected applications of t-SNE. The algorithm’s capacity to analyze and in­
vestigate high-dimensional data renders it a significant aid in various disciplines.

7.2 Understanding the t-SNE algorithm


The t-SNE algorithm aims to map high-dimensional data points to a low-dimensional
space while maintaining their pairwise similarity relation. In other words, it attempts to
embed each data point in a lower-dimensional space so that related points remain close
together in the low-dimensional space. At the same time, a more significant distance sep­
arates dissimilar points. The t-SNE algorithm works in two main stages: First, it builds
a probability distribution over pairs of high-dimensional items, then constructs a corre­
sponding distribution over pairs of low-dimensional objects and minimizes the Kullback--
Leibler divergence between the two distributions. The stages are described in full below.
Step 1. Constructing the high-dimensional probability distribution:
In the first step of the t-SNE algorithm, a probability distribution is constructed over
pairs of high-dimensional objects (e.g., data points). The goal of this step is to create a
probability distribution that captures the similarity between pairs of objects, with similar
objects having a higher probability of being selected. To achieve this, t-SNE constructs a
Gaussian probability distribution around each data point, with the mean of the selected
point and the variance of this determined by a user-defined parameter called the perplex­
ity. The perplexity controls the balance between preserving global and local structure and
Chapter 7 • t-Distributed stochastic neighbor embedding 191

is typically set between 5 and 50. The probability that point j is similar to point i is given
by conditional probability as below:

exp(−xi − xj 2 /2σi2 )
pj |i = ∑ , (7.3)
k=i exp(−xi − xk 2 /2σi2 )

where xi and xj are the high-dimensional feature vectors corresponding to data points i
and j , . denotes the Euclidean distance between the two vectors, and sigma is the vari­
ance of the Gaussian distribution, which is set such that the perplexity is sati­fied. The
denominator is a normalization constant that ensures that the probabilities sum to one
over all pairs of points.
Step 2. Constructing the low-dimensional probability distribution:
In the second phase of the t-SNE method, a probability distribution over pairs of low­
dimensional objects is produced (e.g., the embedded points in a 2D or 3D space). Note that
this projection in the first step could be a random variable or we could use other dimen­
sion reduction algorithms like PCA as a starting point. This step’s objective is to generate
a probability distribution representing the similarity between pairs of embedded points,
with related points having a greater chance of being picked. To do this, t-SNE builds a
Student-t probability distribution around each embedded point, with the variance of the
distribution based on the number of degrees of freedom (df ). The probability that point j
is similar to point i in the embedded space is given by:

exp(−yi − yj 2 )
qj |i = ∑ , (7.4)
k=i exp(−yi − yk  )
2

where yi and yj are the low-dimensional feature vectors corresponding to embedded


points i and j . The denominator is a normalization constant that ensures that the proba­
bilities sum to one over all pairs of points.
Step 3. Minimizing the Kullback–Leibler divergence:
In the last phase of the t-SNE technique, the Kullback–Leibler divergence between the
two probability distributions created in stages 1 and 2 is minimized using the gradient de­
scent algorithm. At each iteration, the algorithm computes the gradient of the divergence
with respect to the positions of the data points in the low-dimensional space and updates
their positions accordingly; stage 3. In this way, gradient descent helps t-SNE find a good
representation of the data in lower dimensions. In the following sections, first, we will dis­
cuss the perplexity parameter and then the Kullback–Leibler divergence, which measures
the difference between two probability distributions.

7.2.1 The t-SNE perplexity parameter


In t-SNE, the perplexity hyperparameter regulates the balance between local and global
structure in the low-dimensional embedding. It measures the number of suitable nearest
neighbors used to generate the high-dimensional similarity matrix.
192 Dimensionality Reduction in Machine Learning

The formula to compute perplexity is:

P erp(Pi ) = 2H (Pi ) , (7.5)

where Pi is the conditional probability distribution over the nearest neighbors of point i in
the high-dimensional space, and H (Pi ) is the Shannon entropy of Pi , d­fined as:

H (Pi ) = − Pj |i log2 Pj |i . (7.6)
j

The perplexity value controls the number of nearest neighbors to consider in each local
neighborhood and is commonly set between 5 and 50 but consider that there is no optimal
perplexity for all datasets. When the perplexity is low, the method concentrates more on
the local data structure, and the resultant embedding preserves the fine-grained features
of the data points inside each neighborhood. However, this may be achieved at the cost
of not capturing the global data structure and may result in deformed clusters. When the
perplexity is large, the method concentrates more on the global structure of the data, and
the resultant embedding preserves the data’s overall similarity structure. This may come at
the expense of not obtaining the granular features of the data pieces inside each area.
Consider an example to understand how perplexity affects the behavior of t-SNE. Sup­
pose we have a dataset of 1000 images of handwritten digits [13] (from 0 to 9), each rep­
resented as a 784-dimensional vector. We want to visualize the dataset in 2-dimensional
space using t-SNE. We construct a high-dimensional similarity matrix based on the Eu­
clidean distance between the data points. After that, we set the perplexity to 30 and then
execute t-SNE to obtain a 2-dimensional embedding. The resulting embedding in Fig. 7.1
shows clear clusters of digits, with similar digits grouped (e.g., 0s with other 0s, 1s with
other 1s, etc.). Next, we decrease the perplexity to a value of 5 and rerun t-SNE. The result­
ing embedding of Fig. 7.2 shows more fine-grained details, with each digit spread into a
different cloud. However, some clusters could be more well-defined, and some digits are
mixed in the embedding. Finally, we increase the perplexity to a value of 50 and rerun t­
SNE. The results in Fig. 7.3 show a more precise global structure, with the digits separated
into distinct clusters. However, some fine-grained details are lost, and some clusters must
be more well-separated.
In conclusion, the perplexity parameter in t-SNE controls the balance between local
and global structure in the low-dimensional embedding, and the optimal value depends
on the specific characteristics of the dataset and the task at hand.

7.2.2 The t-SNE objective function


Kullback–Leibler divergence is a measure of the difference between two probability distri­
butions, often denoted as P and Q. It measures how much information is lost when we use
Q to approximate P . Mathematically, KL divergence is d­fined as follows:
∑ P (i)
DKL (P ||Q) = P (i) log , (7.7)
Q(i)
i
Chapter 7 • t-Distributed stochastic neighbor embedding 193

FIGURE 7.1 Illustration of the MNIST dataset in a 2D plot using t-SNE algorithm with the perplexity parameter set
to 30.

FIGURE 7.2 Illustration of the MNIST dataset in a 2D plot using t-SNE algorithm with the perplexity parameter set
to 5.
194 Dimensionality Reduction in Machine Learning

FIGURE 7.3 Illustration of the MNIST dataset in a 2D plot using t-SNE algorithm with the perplexity parameter set
to 50.

where P and Q are two probability distributions over the same event space, and i indexes
the events in the space. The logarithm in the formula ensures that the KL divergence is
always non-negative, with a value of zero when P and Q are identical. KL divergence is not
symmetric, meaning that DKL (P ||Q) and DKL (Q||P ) can have different values.
Let us consider an example to understand KL divergence better. Suppose we have a coin
with an unknown bias. We toss the coin ten times and observe seven heads and three tails.
We want to know the probability of getting heads or tails in the next toss. We can model
the coin toss using two probability distributions: the accurate distribution, P , which rep­
resents the actual bias of the coin, and the estimated distribution, Q, which we compute
based on the observed data. Let us say we believe that the coin is fair, so we set Q to be a
uniform distribution:

P (H ) = p, P (T ) = 1 − p,
(7.8)
Q(H ) = Q(T ) = 0.5.

We can now calculate the KL divergence between P and Q to measure how different our
estimated distribution is from the actual distribution. Using the formula above, we obtain:

P (H ) P (T )
DKL (P ||Q) = P (H ) × log + P (T ) × log
Q(H ) Q(T ) (7.9)
= p × log(2p) + (1 − p) × log(2(1 − p)).
Chapter 7 • t-Distributed stochastic neighbor embedding 195

If we assume that p = 0.7, we can calculate the value of DKL (P ||Q) to be approximately
0.24. This tells us that the estimated distribution Q could better approximate the true dis­
tribution P . In other words, we lose 0.24 bits of information when we use Q to model the
coin toss, compared to using the true distribution P . KL divergence is used in many areas
of machine learning, including information theory, signal processing, and statistics. It is of­
ten used as a loss function in training generative models, where the goal is to approximate
a complex, high-dimensional probability distribution with a simpler, low-dimensional dis­
tribution.
In t-SNE, the high-dimensional probability distribution P and the low-dimensional
probability distribution Q are constructed based on the pairwise similarity of data points.
The goal is to find a mapping of the data points from the high-dimensional space to the
low-dimensional space that preserves their pairwise similarities. For this, t-SNE minimizes
the KL divergence between P and Q by adjusting the embedding of each data point until
the two probability distributions are as close as possible. The optimization is performed
iteratively using gradient descent, and the embedding is adjusted to minimize the cost
function:
∑ ∑
n ∑
n
pi|j
C= KL(Pi|j ||Qi|j ) = pi|j log , (7.10)
qi|j
i i=1 j =1

where P (i, j ) is the probability that point i and point j are similar in the high-dimensional
space, and Q(i, j ) is the probability that the embedded points i and j are similar in the
low-dimensional space.
The gradient of the cost function with respect to the embedding of point i in the low­
dimensional space is given by:

∂C ∑
=4 (pi|j − qi|j )(yi − yj )(1 + ||yi − yj ||2 )−1 , (7.11)
∂yi
j

where yi is the low-dimensional embedding of point i, yj is the low-dimensional embed­


ding of point j , and ||.|| denotes the Euclidean distance between the two embeddings.
The gradient is used to update the embedding of each data point in the low-dimensional
space during each iteration of the optimization process. By minimizing the cost function,
t-SNE produces a low-dimensional embedding of the high-dimensional data, preserving
the pairwise similarities between the data points as much as possible.

7.2.3 The t-SNE learning rate


The learning rate in t-SNE controls the step size of the gradient descent algorithm used
to optimize the low-dimensional embedding. It determines how much the embedding
changes in each iteration of the algorithm. The formula to update the low-dimensional
embedding at each iteration is:
196 Dimensionality Reduction in Machine Learning

yi (t + 1) = yi (t) + η(t) × di (t), (7.12)

where yi (t) is the 2-dimensional embedding of data point i at iteration t, η(t) is the learning
rate at iteration t, and di (t) is the gradient of the cost function with respect to yi (t). The
learning rate is usually initialized to a high value and gradually reduced throughout the
optimization. This is done to prevent the algorithm from becoming stuck in local optima
and to allow it to explore different regions of the low-dimensional space. One common
strategy for reducing the learning rate is to use a power-law decay, such as:

η0
η(t) = , (7.13)
(1 + at )

where η0 is the initial learning rate, t is the current iteration, and a is a constant that con­
trols the rate of decay. The learning rate also interacts with the gradient magnitudes in
the high-dimensional space. When the gradient magnitudes are significant, a high learn­
ing rate can lead to unstable behavior and oscillations in the embedding. To prevent
this, t-SNE uses early exaggeration, which multiplies the pairwise similarities in the high­
dimensional space by a factor of 4 in the early iterations of the algorithm. This exaggerates
the distances between the data points and creates more significant gradients, which allows
the low-dimensional embedding to separate more quickly. To illustrate how the learn­
ing rate i­fluences the behavior of t-SNE, consider the following example. Imagine we
have a collection of 1000 photographs of handwritten numbers (0--9) where each image
is recorded as a 784-dimensional vector. Using t-SNE, we wish to depict the dataset in
two dimensions. The learning rate is initialized at 1, and 10 iterations of t-SNE are per­
formed. The resultant embedding in Fig. 7.4 displays a few clusters of numbers, but most
of the digits are still jumbled. Then, we decreased the learning rate to 0.5 and ran t-SNE 10
times more. The result in Fig. 7.5 displays groups of identical digits that are more clearly
d­fined. Lastly, we decrease the learning rate to 0.1 and run t-SNE for ten additional itera­
tions. Fig. 7.6 demonstrates even more distinct clusters, with minimal overlap between the
digits.
The learning rate parameter governs the step size of the gradient descent process used
to optimize the low-dimensional embedding. The unique properties of the dataset and
the objective at hand determine the ideal value. A high learning rate can lead to unstable
behavior, whereas a low learning rate can result in slow convergence and being stuck in
local optima.

7.2.4 Implementing t-SNE in practice


To achieve a better understanding of the t-SNE algorithm we are going to implement it
from scratch. First, we need to understand the pseudocode of the algorithm so we write
exactly the pseudocode from the original paper of t-SNE:
Chapter 7 • t-Distributed stochastic neighbor embedding 197

FIGURE 7.4 Illustration of the MNIST dataset in a 2D plot using t-SNE algorithm with the perplexity parameter set
to 30 and a learning rate of 1.

FIGURE 7.5 Illustration of the MNIST dataset in a 2D plot using the t-SNE algorithm with the perplexity parameter
set to 30 and a learning rate of 0.5.
198 Dimensionality Reduction in Machine Learning

FIGURE 7.6 Illustration of the MNIST dataset in a 2D plot using the t-SNE algorithm with the perplexity parameter
set to 30 and a learning rate of 0.1.

Algorithm 1 Simple version of t-Distributed Stochastic Neighbor Embedding.


Data: dataset χ = {x1 , x2 , ..., xn },
cost function parameters: perplexity Perp,
optimization parameters: number of iterations T , learning rate η, momentum α(t).
Result: low-dimensional data representation Y (T ) = {y1 , y2 , ..., yn }.
begin
compute pairwise a˙inities pj |i with perplexity Perp (using Eq. (7.3))
p +p
set pij = j |i2n i|j
sample initial solution Y (T ) = {y1 , y2 , ..., yn } from N (0, 10−4 I )
for t = 1 to T do
compute low-dimensional a˙inities qij (using Eq. (7.12))
compute gradient δδC Y (using Eq. (7.11))
set Y = Y
(t) t−1 + η δ Y + α(t)(Y (t−1) − Y (t−2) )
δC

end for

Step 1: Import necessary libraries: We will need to import NumPy for numerical com­
putations, and matplotlib for data visualization.
Chapter 7 • t-Distributed stochastic neighbor embedding 199

import numpy as np
import matplotlib.pyplot as plt

Step 2: Preparing the data: t-SNE is a dimensionality reduction algorithm, so you


will need to start with high-dimensional data. In this example, let us assume you
have a dataset with n samples and m features.

X = # Your data with shape (n, m).

Step 3: Compute pairwise similarity: t-SNE operates on pairwise similarity between


samples. Compute the pairwise Euclidean distance between all samples in your dataset.

from sklearn.metrics.pairwise import euclidean_distances

distances = euclidean_distances(X)

Step 4: Compute similarity matrix: Transform the pairwise distance matrix into a simi­
larity matrix using the Gaussian kernel.

def gaussian_kernel(distances, sigma=1.0):


return np.exp(-(distances ** 2) / (2 * (sigma ** 2)))

similarities = gaussian_kernel(distances)

Step 5: Initialize the embedding: Initialize a low-dimensional embedding for your data.
This can be done randomly or using another method.

# the desired dimensionality of the embedding


embedding_dim = 2
# initialize embedding randomly
Y = np.random.randn(n, embedding_dim)

Step 6: D­fine the cost function: The t-SNE cost function is d­fined as the Kullback--
Leibler (KL) divergence between the high-dimensional and low-dimensional distributions.
200 Dimensionality Reduction in Machine Learning

def kl_divergence(P, Q):


return np.sum(P * np.log(P / Q))

def compute_joint_probabilities(similarities, perplexity):


# Compute pairwise conditional probability
# using Gaussian kernel
sum_s = np.sum(similarities)
P = similarities / sum_s
P = np.maximum(P, 1e-12) # to prevent division by zero
P *= perplexity
P = np.maximum(P, 1e-12) # to prevent division by zero
P = P / np.sum(P)
return P

def compute_gradient(P, Q, Y):


# Compute gradient of cost function with respect to Y
pq_diff = P - Q
for i in range(n):
grad[i, :] = np.sum(np.tile(pq_diff[:, i] * Y[i, :],
(n, 1)), axis=0)
return grad

Step 7: Optimize the cost function: Use gradient descent to optimize the cost function
with respect to the low-dimensional embedding.

def optimize_embedding(Y, P, learning_rate=200,


n_iter=1000):
# Optimize the embedding using gradient descent
Q = 1 / (1 + euclidean_distances(Y) ** 2)
np.fill_diagonal(Q, 0)
for i in range(n_iter):
grad = compute_gradient(P, Q, Y)
Y = Y - learning_rate * grad
Q = 1 / (1 + euclidean_distances(Y) ** 2)
np.fill_diagonal(Q, 0)
return Y

P = compute_joint_probabilities(similarities,
perplexity=30)
Y = optimize_embedding(Y, P)

Step 8: Visualize the low-dimensional embedding: Finally, you can visualize the low­
dimensional embedding using a scatter plot.

plt.scatter(Y[:, 0], Y[:, 1], c=color_labels)


Chapter 7 • t-Distributed stochastic neighbor embedding 201

For ease of usage, this algorithm is implemented in Python. We can use Scikit-Learn to
feed data into the algorithm and create a plot visualization for better understanding. We
will use this package and run the t-SNE algorithm on the Iris dataset.
Step 1: Importing: Scikit-Learn and other useful packages.

import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris
import plotly.express as px

Step 2: Loading: The Iris dataset.

iris = load_iris()
x = iris.data
y = iris.target

Step 3: Initializing: t-SNE as an object with n components for selecting the dimension
of output data. (For this example we use default values for hyperparameters of the t-SNE
algorithm.)

tsne = TSNE(n_components=2, verbose=1, random_state=123)


x_transformed = tsne.fit_transform(x)

Step 4: Preparing output data for visualization.

df = pd.DataFrame()
df["y"] = y
df["component_1"] = x_transformed[:,0]
df["component_2"] = x_transformed[:,1]

Step 5: Visualizing output data.

px.scatter(data_frame=df, x="component_1",
y="component_2",
color=df.y.tolist(),
labels={
"component_1": "Component 1",
"component_2": "Component 2",
},
title="Iris dataset T-SNE projection",
width=1024, height=1024)
202 Dimensionality Reduction in Machine Learning

FIGURE 7.7 Illustration of the Iris dataset in a 2D plot using the t-SNE algorithm with default parameters.

Fig. 7.7 illustrates the result of the algorithm, and we can see that each label is shown in
its own individual color. It is essential to point out that one of the labels is located a consid­
erable distance from the others. In contrast, the other two labels have a similar structure
but vary in the values of their component parts.
In the same way as before, we will try to run the algorithm on the much more difficult
MNIST dataset, which is significantly more complicated than the Iris dataset. Parameter
setting will be the same as the default value of the Scikit-Learn.
In Fig. 7.8, we see that each label, a digit number in this dataset, is differentiated by
color. However, as is clear, some data from two or three other digits collided together. The
reason for that is, first, the structure of those digits is similar, and the algorithm could not
understand the difference. Secondly, the dimension of the MNIST dataset; 784 features is
significantly higher than that of the Iris dataset, which is 4. Due to this, the algorithm strug­
gles to distinguish some similar digits. Continuing this, we will explore our alternatives to
resolve this issue.

7.3 Visualizing high-dimensional data with t-SNE


It is essential to note that t-SNE is unsuitable for all datasets and should be cautiously se­
lected. For instance, it may not function optimally with highly structured or noisy datasets,
and in this case, the algorithm will be computationally costly, particularly for massive
datasets. Nevertheless, with correct parameter adjustment and preparation, t-SNE may be
an effective visualization tool for complicated datasets.
Chapter 7 • t-Distributed stochastic neighbor embedding 203

FIGURE 7.8 Illustration of the MNIST dataset in a 2D plot using the t-SNE algorithm with default parameters.

7.3.1 Choosing the right number of dimensions


In the t-SNE method, the number of dimensions in the lower-dimensional space is a hyper­
parameter that the user must provide. A smaller number of dimensions generally results
in an easier-to-understand representation, but it may also lead to information loss. On the
other hand, a visualization with more dimensions may gather more information, but it
may also be more challenging to comprehend. The most common number of dimensions
used in t-SNE visualizations is 2, which enables the data to be readily displayed on a 2D
plot. Depending on the dataset and the task, it may be necessary to experiment with sev­
eral dimensions to obtain the appropriate display. One approach to choosing the correct
number of dimensions is a trial-and-error process, where the user tries different dimen­
sions and evaluates the resulting visualizations. Another technique is to employ heuristics,
such as picking the number of dimensions that captures a particular percentage of the
data’s variation or the number of dimensions that best isolates the various data clusters.
Eventually, the number of dimensions for t-SNE visualization is determined by the data
features and the research objectives.
It is important to remember that t-SNE is primarily a visualization tool and should be
used with other analytical techniques to comprehend and analyze high-dimensional data
thoroughly.
Using the t-SNE algorithm with its default parameters, we can obtain a 3D plot in
Fig. 7.9 of the MNIST dataset that allows us to visualize the similarities and differences
204 Dimensionality Reduction in Machine Learning

between handwritten digits with more details than the 2D plot we saw in the previous sec­
tion.

FIGURE 7.9 Illustration of the MNIST dataset in a 3D plot using the t-SNE algorithm with default parameters.

7.3.2 Interpreting t-SNE plots


t-SNE plots can be challenging, but there are a few key concepts to remember that help
make sense of the visualizations:
• Understanding the basic idea behind t-SNE. The idea is to create a new representation
of the data that is easier to visualize and analyze while preserving the essential relation­
ships between the data points.
• Clusters of data points are the first item to search for in a plot created by the t-SNE
algorithm. Each cluster represents a collection of related data points in the high­
dimensional space. On the output plot, clusters may be represented by distinct colors or
forms, making them simpler to detect. It is important to note that the distance between
data points is not meaningless, but they do not represent high-dimensional spatial dis­
tances. Those near t-SNE plots are likely to be close in the high-dimensional space,
whereas those far apart may not.
• Outliers are data points that are geographically far from any cluster. These may indicate
unique or uncommon data items that are important to find and evaluate.
• After identifying the clusters and outliers, the next step is to look for patterns and trends
in the data. For example, specific clusters might be close or far apart, or some are more
tightly packed than others.
• Considering the context of the data that are analyzed is very important. Understanding
the data context can help draw more meaningful conclusions from the t-SNE plot.
Chapter 7 • t-Distributed stochastic neighbor embedding 205

Finally, interpreting t-SNE plots requires understanding the basic concept behind t­
SNE, identifying clusters and outliers, looking for patterns and trends, and considering the
data context.

7.4 Advanced t-SNE techniques


This section will dive into two critical aspects of this algorithm worth exploring. The first
aspect we will discuss is the usage of t-SNE for clustering datasets. Clustering refers to di­
viding a large dataset into smaller, more manageable subsets or clusters based on their
inherent similarities or dissimilarities. The second aspect we will explore is the usage
of t-SNE in conjunction with other dimensionality reduction methods. While t-SNE is a
powerful tool, it can be even more effective when combined with dimensionality reduc­
tion methods such as principal component analysis (PCA) or linear discriminant analysis
(LDA). This combination can further reduce the complexity of the data and highlight the
essential features for analysis.

7.4.1 Using t-SNE for data clustering


t-SNE is a widely used algorithm for visualizing high-dimensional data in a low-dimen­
sional space. While it can be recognized as a clustering method, it is important to note that
it cannot be solely intended for clustering purposes [7]. It is better to use it as a dimension­
ality reduction algorithm as a preprocessing method for other clustering applications such
as K-means, DBSCAN, and the others. With the help of t-SNE, this task becomes much
more accessible and efficient. The algorithm ident­fies and highlights patterns and rela­
tionships within the data that may take time to be apparent through other methods.

7.4.2 Combining t-SNE with other dimensionality reduction methods


While t-SNE is a powerful tool for visualizing data, it has limitations, such as the sensitiv­
ity of t-SNE to the choice of parameters, like the perplexity parameter and computational
complexity in larger datasets, which is one of the most significant drawbacks of this al­
gorithm. In addition, the t-SNE algorithm sometimes needs help maintaining the global
structure of the data, which might result in inaccurate representations of the information.
Using t-SNE with other dimensionality reduction methods is one way to avoid the con­
straints imposed by this analysis technique. This collective approach is often used to speed
up the computation of t-SNE and reduce its sensitivity to the choice of hyperparameters
[9].
PCA is a linear dimensionality reduction technique that finds the principal components
of the data. These principal components represent the directions in which the data varies
the most, and they can be used to reduce the dimensionality of the data while retaining
most of its variance [10]. By applying PCA before t-SNE, we can reduce the dimensionality
of the data and remove any linear dependencies between the features, making the t-SNE
computation more efficient and effective. In addition, by selecting the number of princi­
206 Dimensionality Reduction in Machine Learning

pal components based on the variance we want to retain, we can also control the amount
of information passed to t-SNE, which can help us obtain meaningful visualizations. It is
possible for us to take advantage of DCGAN to extract spectro-spatial features from hyper­
spectral pictures, and then we can use t-SNE to do dimensionality reduction and clustering
on the features that we have extracted [11].
On the other hand, one could use t-SNE to first reduce the dimensionality of the data to
a lower-dimensional space and then apply a different dimensionality reduction method,
such as PCA or LDA (Linear Discriminant Analysis) [12], to reduce the dimensionality of
the data further. When performed in this manner, employing t-SNE as a preprocessing
step can assist in increasing the preservation of the global structure while lowering the
sensitivity to the chosen parameters in the second dimensionality reduction algorithm.
Nevertheless, combining t-SNE with various other strategies for dimensionality reduction
may result in additional difficulties and trade-offs. For instance, the computing cost of the
combined technique may be higher, and it may be more challenging to comprehend the
visuals produced as a result.

7.5 Conclusion and future directions


t-SNE is a robust algorithm for visualizing high-dimensional datasets in lower dimensions,
particularly useful for clustering and dimensionality reduction tasks. Due to its ability to
capture nonlinear correlations in the data and maintain its local structure, t-SNE offers var­
ious ben­fits over other dimensionality reduction techniques like PCA. It is, nevertheless,
susceptible to ove­fitting and is sensitive to the selection of parameters, it is thus possible
that interactive investigation will be required in order to choose settings and verify out­
comes [8]. Despite its limitations, t-SNE has proven to be a valuable tool for exploratory
data analysis and pattern recognition. Computer vision, bioinformatics, and natural lan­
guage processing are just a few of the many industries that have succeeded in using t-SNE
for displaying high-dimensional information in lower dimensions. Due to its ability to
meaningfully visualize complex high-dimensional data, it has been put to use in a wide
variety of contexts, such as the visualization of gene expression patterns in single-cell RNA
sequencing data, the ident­fication of image clusters for use in computer vision, and the
investigation of the structure of natural language data.
The t-SNE field of study may advance along several different paths in the future. De­
veloping techniques for automatically identifying optimum parameter settings for t-SNE
might be one area that receives attention as a potential emphasis area. This would assist in
solving one of the primary issues presented by the algorithm, which is the situation’s sen­
sitivity to the chosen parameters. Another area of focus could be developing new ways to
lower computation complexity for larger datasets, which in the current version is very de­
manding. In conclusion, t-SNE is a practical approach for representing high-dimensional
data and may be used in various of contexts. While there are still obstacles to overcome,
t-SNE research promises to continue to advance our understanding of complex data struc­
tures and relationships.
Chapter 7 • t-Distributed stochastic neighbor embedding 207

References
[1] L. Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research 9 (2008).
[2] G. Hu, M. Ahmed, M. L’Abbé, Natural language processing and machine learning approaches for food
categorization and nutrition quality prediction compared with traditional methods, The American
Journal of Clinical Nutrition 117 (2023) 553--563, https://www.sciencedirect.com/science/article/pii/
S0002916522105526.
[3] D. Kobak, P. Berens, The art of using t-SNE for single-cell transcriptomics, Nature Communications
10 (2019) 5416, https://doi.org/10.1038/s41467-019-13056-x.
[4] H. Robbins, A stochastic approximation method, The Annals of Mathematical Statistics 22 (1951)
400--407.
[5] I. Csiszar, I-divergence geometry of probability distributions and minimization problems, Annals of
Probability 3 (1975) 146--158, https://doi.org/10.1214/aop/1176996454.
[6] M. Blum, M. Nunes, D. Prangle, S. Sisson, A comparative review of dimension reduction methods in
approximate Bayesian computation, Statistical Science 28 (2013) 189--208, https://doi.org/10.1214/
12-STS406.
[7] G. Linderman, S. Steinerberger, Clustering with t-SNE, provably, CoRR, arXiv:1706.02582, 2017.
[8] M. Wattenberg, F. Viégas, I. Johnson, How to use t-SNE effectively, Distill (2016), http://distill.pub/
2016/misread-tsne.
[9] H. Huang, Y. Wang, C. Rudin, E. Browne, Towards a comprehensive evaluation of dimension reduction
methods for transcriptomic data visualization, Communications Biology 5 (2022) 719, https://doi.
org/10.1038/s42003-022-03628-x.
[10] K. Pearson, LIII. On lines and planes of closest fit to systems of points in space, Zenodo, https://
doi.org/10.1080/14786440109462720.
[11] R. Silva, P. Melo-Pinto, t-SNE: a study on reducing the dimensionality of hyperspectral data for the re­
gression problem of estimating oenological parameters, Art­ficial Intelligence in Agriculture 7 (2023)
58--68, https://www.sciencedirect.com/science/article/pii/S2589721723000053.
[12] P. Cohen, P. Cohen, S.G. West, L.S. Aiken, Applied Multiple Regression/Correlation Analysis for the
Behavioral Sciences, Psychology Press, 1983.
[13] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition,
Proceedings of the IEEE 86 (11) (1998) 2278--2324, https://doi.org/10.1109/5.726791.
8
Feature extraction and deep
learning
Abtin Mahyar a, Hossein Motamednia a,b, Pooryaa Cheraaqee c, and
Azadeh Mansouri d
a School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran b Faculty of

Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran c School of Computer Science &
Engineering, University of Westminster, London, United Kingdom d Department of Electrical and Computer
Engineering, Faculty of Engineering, Kharazmi University, Tehran, Iran

8.1 The revolutionary history of deep learning: from biology


to simple perceptrons and beyond
8.1.1 A brief history
Warren McCulloch and Walter Pitts demonstrated the concept of mimicking neurons and
their functions in animal brains by electrical circuits and mathematical equations in work
published in 1943. They suggested using a network of interconnected neurons to solve
challenging issues [1].
Moreover, Donald Hebb introduced a learning principle known as Hebb’s Rule or Heb­
bian learning. This principle mimics the way our brains learn, where neurons are activated
and connected with other neurons. Initially, these connections have weak weights during
the early stages of learning. However, with repeated exposure to the stimulus, the weights
of these connections gradually increase, becoming stronger over time. Siegrid Löwel aptly
describes this phenomenon as Neurons that fire together, wire together [2].
Some intuitive examples of this process in our body are when practicing a musical
instrument, having an exercise routine, or learning a new language. This is because rep­
etition in each of these processes helps to build up muscle memory and makes the task
feel more natural, or in our scenario, tune the weights of connected neurons in the brain
and body.
The ability to simulate the early variations of neural networks was ultimately made pos­
sible between 1950 and 1960 thanks to advances in computer technology and hardware.
Stanford developed the first effective art­ficial neural networks, ADALINE and MADALINE,
which were able to predict the upcoming bit in an input sequence by identifying binary
patterns in streaming bits from a phone line [9].
After the success of the neural network in this period, funding and interest in Art­fi­
cial Intelligence (AI) research decreased significantly that is denoted as the First AI Winter.
Dimensionality Reduction in Machine Learning. https://doi.org/10.1016/B978-0-44-332818-3.00019-8 211
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
212 Dimensionality Reduction in Machine Learning

The von Neumann architecture, a traditional computing model, has been dominating the
computing scene since its invention in 1945. It is based on the idea of separating mem­
ory and processing units, allowing for faster and more efficient data processing. During
this time, neural network research was left behind as the focus shifted to traditional com­
puter designs. By using von Neumann architecture, computers were able to process large
amounts of data quickly and accurately. However, this approach had its limitations when
it came to dealing with complex problems such as natural language processing or pattern
recognition tasks. Neural networks offered a promising alternative for these types of tasks,
but they were not widely adopted until after 1970 when advances in hardware technology
made them more accessible.
Significant developments in this field, such as new architectures and improved training
methods, resulted in a surge of interest in a new method for connections of the neurons of
these networks in the early 1980s. This method contends that connections in the network
are analogous to how people engage with each other.
This surge of interest and growth in the neural network did not last long since it made
slow progress and other powerful Machine Learning techniques such as SVM (Support
Vector Machines) were invented and achieved better accuracy and performance compared
to prior basic neural network architectures that are called the Second AI Winter. Some ap­
plications of the SVM are discussed in [10--13].
The advancement of neural networks has meanwhile been sped up by the emergence
of multilayer networks. In these networks, multiple layers of neurons are piled on top of
one another and interconnected. As a result, it is possible to overcome conventional pro­
cedures and process and assess data at higher degrees of complexity.
Deep neural networks have thus grown more crucial in tackling issues across various
areas, including healthcare, finance, and engineering. Neural networks have developed
into indispensable tools for businesses due to their capacity to quickly evaluate enormous
amounts of data.

FIGURE 8.1 History and progress of development of AI.

The interest in neural networks and deep learning has been on a steady rise for the past
few years, and there does not seem to be any indication that it will slow down anytime
Chapter 8 • Feature extraction and deep learning 213

soon. In fact, given the current progression and its potential for improvement, it is unlikely
that we will witness another AI winter in the near future. Fig. 8.1 illustrates a summarized
history of AI.

8.1.2 Biological neurons


Fig. 8.2 illustrates a simple neuron and the process of transformation of information by it
in a most straightforward way. A neuron consists of 3 principal components:
• Dendrites: These accumulate signals from other neurons or the peripheral environ­
ment in the form of electricity and pass them to the cell body (input gate).
• Cell body: This contains the nucleus and other critical components of a cell that make it
the core section of a neuron. It is also known as soma that contains genetic information,
maintains the neuron’s structure, and provides energy to drive activities. A cell body in
its resting stage has a resting potential electrical charge voltage of about −70 to −90
mV, which is a steady charge maintained between action potentials. By entering sig­
nals from dendrites, this potential can be increased or decreased based on the neuron
type that sends the signal, and if the potential of a neuron reaches a threshold (around
−55 mV) called the action potential, a signal in the form of electricity will transmit to
the connected neuron through its axon and terminal branches. We refer to this process
as firing and the signal as a spike or impulse [20--22].
• Axon: This sends messages to other neurons, muscles, or glands from the cell body.
Axon-branches at the end of an axon are joined to the dendrites of other neurons.
Synapse is the name of this relationship. Such synapses affect the potential of the re­
ceiving neuron’s dendrites when a spike crosses them [23--25].

FIGURE 8.2 Simple representation of a biological neuron.

The brain is thought to have 10 billion neurons, each of which is connected to 10 000
other neurons. We can carry out complicated calculations, such as finding patterns, mak­
214 Dimensionality Reduction in Machine Learning

ing deliberate choices, and adapting our behavior in response to environmental changes,
thanks to this vast network. Facial recognition, language understanding, problem-solving,
and memorizing are some of the most intricate calculations that our brains perform regu­
larly.
Despite the lack of a complete understanding of how biological neural networks (BNNs)
work, scientists have started to map out some of their basic architecture. It seems that neu­
rons are often organized in consecutive layers, especially in the cerebral cortex -- the outer
layer of our brain. Researchers believe that this architecture allows for faster processing
and more efficient communication between different parts of the brain. This discovery has
opened up new opportunities for understanding neurological architectures and enhanc­
ing art­ficial ones by mimicking their structures and interactions.

8.1.3 Art­ficial neurons: the perceptron


The perceptron model is a type of art­ficial neural network developed in the 1950s by Frank
Rosenblatt [3]. It is an algorithm used to classify linearly separable data and is based on
biological neural networks.
A perceptron has one output and one or more inputs, each with a corresponding
weight. The product of the input and its equivalent weight added together, determines the
output’s strength. The output is turned on if the sum rises above a predetermined level:

y = step(z),
(8.1)
z = w1 x1 + w2 x2 + ... + wn xn = x T w.

This activation process is performed by applying a step function to the weighted sum
of inputs and outputs the results. Usually, a Heaviside step function and a sign function are
used as step functions for this model:

 ⎪
0 z<0 ⎨−1 z < 0

heaviside(z) = , sgn(z) = 0 z=0. (8.2)
1 z≥0 ⎪

⎩1 z>0

The perceptron model operates similarly to a single neuron in the human brain. In both
cases, the neurons take information from other neurons and then activate or suppress the
particular signal. Nevertheless, there are variations in how the weights are determined.
It is capable of classifying only linearly separable data, indicating its inability to address
nonlinear challenges. This paradigm can be adapted to address nonlinear and even com­
plex situations [33--37]. As illustrated in Fig. 8.3, if the weighted sum exceeds a threshold
(here it is 0), it outputs the positive class (square). Otherwise, it outputs the negative class
(triangle); therefore, it can be used in a binary class­fication task such as iris flowers classi­
fication based on petal length and width.
For performing multi-class class­fication tasks, it is also possible to have multiple units
in the output layer of a perceptron in which each unit is connected to all of the inputs.
Chapter 8 • Feature extraction and deep learning 215

FIGURE 8.3 Perceptron model for linearly separable data.

We call this layer a fully connected layer or a dense layer in which every unit (neuron) in
this layer is connected to all the units (neurons) in the previous layer (i.e., inputs). Each of
these connections has a specific weight that will be tuned during the training process in
order for each neuron to predict the desired value for the corresponding input. Moreover,
an additional bias term is usually added that provides the model with a starting point,
a direction the weights should take, and a way to adjust the weights to best fit the data.
Without bias terms, neural networks would take longer to learn the dataset, as they would
need to discover the best weights through trial and error. An illustration of a perceptron
model for multi-class class­fication is depicted in Fig. 8.4.

FIGURE 8.4 A perceptron model with N input neurons, multiple output neurons, and a bias term for multi-class
class­fication problems.

The Hebb learning rule is used during the perceptron model’s training procedure. Hebb
asserts that neurons that fire together wire together, which means that when two neu­
216 Dimensionality Reduction in Machine Learning

rons are engaged simultaneously, they become more tightly coupled, as was described in
the first section. Input is given to the neurons during training, which causes them to be­
come activated and create patterns. The Hebb rule then serves to reinforce these patterns,
strengthening the link between the two neurons. Until the network is trained to detect the
input pattern and produce the required response, this process is repeated for each input
instance. More specifically, the learning rule alters the weight of connections in the model
in a way that reduces prediction error and real values associated with it. The following can
be written as the calculation of the perceptron’s learning rule:
The output of the perceptron in terms of the model parameters and the input data:

n
ŷ = wi xi + b. (8.3)
i=1

Use this equation to calculate the loss function. We can use either the Mean Squared
Error (MSE) or the Cross-Entropy (CE) loss function. For this example, we will use the MSE
loss:
1 ∑
N
L= (ŷ − yi )2 . (8.4)
2N
i=1

Use the chain rule to calculate the derivative of the loss function with respect to the
model parameters:
∂L ∂L ∂ ŷ 1
= = (ŷ − yi )xi . (8.5)
∂wi ∂ ŷ ∂wi N
Finally, use the derivative to update the model parameters:
t+1
wi,j = wi,j
t
+ η(yj − ŷj )xi . (8.6)

In the above equation, the weight of the connection between the ith input neuron and
j th output neuron (i.e., wi,j ) is updated after feeding one training instance at a time t. ŷj
is the predicted output of the j th output neuron for the corresponding input instance. η
is the learning rate that is a crucial parameter used to regulate the magnitude of weight
adjustments within the network during the training process. It requires careful tuning
through trial and testing to find the optimal value that minimizes error by adjusting the
weights effectively.
Perceptron models are single-layer neural networks that can only classify linearly sep­
arable data. This means that they are not capable of solving more complex problems such
as image recognition or natural language processing.
By stacking multiple perceptron layers and employing nonlinear activation functions
instead of step functions to generate probability rather than using hard thresholds, we can
overcome the drawbacks and simplicity of the perception models. The previously men­
tioned ANN is known as the Multi-Layer Perceptron (MLP), and it will be further discussed
in the next section along with additional deep learning model variations for other applica­
tions.
Chapter 8 • Feature extraction and deep learning 217

8.2 Deep neural networks


8.2.1 Deep feedforward networks
Now that we understand what a perceptron model is and its functionality, it is a good time
to introduce a multi-layer perceptron (MLP). More formally, it consists of multiple layers
of interconnected nodes. By stacking multiple layers, we can build an MLP.
In a perceptron, input data is passed through a single layer of neurons. Each neuron
carries out the same simple mathematical operation to generate a single output. However,
in more complex neural networks, the input data is propagated through multiple layers
of neurons. At each layer, new outputs are generated, serving as inputs to the next layer.
Fig. 8.5 demonstrates an MLP with multiple hidden layers.

FIGURE 8.5 Schematic of an MLP with n hidden layers and an output layer.

In this network, the number of neurons in each layer can vary (an adjustable hyperpa­
rameter). Typically, each neuron in a layer is connected to every neuron in the previous
layer. During training, the weights of these connections can be adjusted. An optimization
technique like gradient descent is employed during training to minimize the difference
between the network’s expected and actual outputs.
In 1986, a paper that introduced backpropagation was published that revolutionized
this field of research and inaugurated new developments and gained much attention to
neural networks [8]. By using this algorithm, we can calculate the gradient of the model’s
error with respect to every single parameter in the model; therefore we can understand the
association between each weight and bias term in the model to the error and tune them to
reduce that using an optimizer.
Gradient descent iteratively updates the model parameters by moving in the direc­
tion of the negative gradient of the loss function with respect to the model’s parameters.
The gradient indicates how rapidly the loss function changes concerning the parameters,
218 Dimensionality Reduction in Machine Learning

with the negative gradient pointing toward the steepest decrease in the loss. Through this
process, gradient descent gradually converges to a minimum of the loss function by con­
tinuously adjusting the parameters in this direction.
The process of computing the gradient and updating the parameters is repeated until
the error reaches a minimum or a stopping criterion is met, such as a maximum number
of iterations or a minimum change in the loss.
Training an MLP has two stages: forward pass and backward propagation. In the for­
ward pass, let us assume that the input to the network is x, and the network has L layers,
including the input layer and the output layer. The output of the j th layer is represented
by a j that is illustrated by an nj -dimensional vector, where nj is the number of units in the
j th layer. The activation function of the j th layer is represented by g j :

a 0 = x,
zj = W j a j −1 + bj , (8.7)
a = g (z ), where j = 2, 3, ..., L.
j j j

The final output of the network, a L , represents the prediction of the network. This is
then used by the loss function to calculate the loss. More specifically, assume that the loss
function is denoted by C, and the weight matrix between neurons in layer j − 1 to layer j
is represented by w j .
To perform the backpropagation we need to calculate the effect of each weight in the
network on the loss that is represented as follows. Where wij K represents the weight of the

connection of neuron i in layer k − 1 to neuron j in layer k:

∂C
k
. (8.8)
∂wij

According to the equations in (8.7) we can calculate the above term by applying the
chain rule:

∂C ∂C ∂zk ∂C
= k· = k · a k−1 ,
∂wij ∂z ∂wij ∂z
(8.9)
∂C ∂C ∂a k ∂C 
k
= k · k = k · g k (zk ).
∂z ∂a ∂z ∂a
∂C
∂a k
can be calculated in the last layer using the loss function by measuring the differ­
ence between predicted outputs and actual outputs. For the other layers, since a certain
neuron can be connected to multiple neurons in the next layer and those neurons are
themselves connected to other neurons, that term can be calculated using a summation
as follows:
Chapter 8 • Feature extraction and deep learning 219

∂C ∑ ∂C ∂a
= ,
∂aik ∂a ∂aik
a:aik →a k+1
∑ (8.10)
 ∂C
= ·g (k+1) (a k ) · w(ai ,a) · ,
∂a
a:aik →a k+1

where the summation is performed on every connection of neuron i in layer k to the


connected neurons in layer k + 1. We can summarize the above equations into a single
equation as follows:
∂C  ∂C
= a k−1 · g k (zk ) · k . (8.11)
∂wij ∂ai
The same equation can be written for bias terms as follows:

∂C ∂C ∂C 
k
= k = k · g k (zk ). (8.12)
∂b ∂z ∂a

By applying the gradient descent algorithm, we adjust the weights and biases of the
network using the terms we have just calculated to alter them to reduce the loss. By using
learning rate η we can control the intensity of changes in these terms. This is a hyperpa­
rameter that should be tuned in order to converge the model to a global minimum:

∂C
W (k) = W (k) − η ,
∂W (k) (8.13)
∂C
b(k) = b(k) − η (k) .
∂b

8.2.2 Convolutional networks


In the late 1950s, David H. Hubel and Torsten Wiesel accomplished multiple experiments
on cats’ and monkeys’ visual cortexes to gain useful insights into the structure of this part
of the brain. In [5,6], they showed that many neurons in the brain have a small local re­
ceptive field that could overlap with each other and form the whole visual field. However,
the main point that they mentioned was that the whole visual field is not the input of the
neurons located in the visual cortex, a small part of that field is used as the input for those
neurons. Moreover, they showed that the functionality of these neurons is different from
each other. More specifically, some of them react to simple patterns in the image such as
lines or curves but some of them react to more complex patterns, and some of them have
larger receptive fields than others.
With these studies and developments in computation power, in the 1980s neocognitron
[7] and then LeNet [4] were introduced as the first inspired deep learning models of visual
system structure. The main idea of these types of models is to convolve a kernel on the
input image instead of using the whole image as the input and reduce the size of the input
image while extracting the most important features from it.
220 Dimensionality Reduction in Machine Learning

With the development in neural networks, scientists concluded that MLPs are not well
suited for most of the complex tasks that are easy for the human brain, such as object
detection and recognition or more general image-related tasks, as they require a large
number of parameters to accurately represent the data and extract the most important
features from it. Images have more characteristics than other data types, thus there are
more weights and biases that need to be tweaked, which is computationally expensive as
each neuron in MLPs is linked to every neuron that exists in the next adjacent layer.
On the other hand, Convolutional Neural Networks (CNNs) reduce this number of pa­
rameters by using convolutional layers, which apply a filter to the input data and extract
features from it. This reduces the number of parameters needed to represent the data, as
only a few filters are required instead of a large number of weights and biases. Addition­
ally, CNNs use pooling layers that further reduce the number of parameters by combining
similar features together.
Another problem of using an MLP for image-related tasks is that there should be no
or little variance to shifting, scaling, and other forms of distortion in the outputs but by
using MLPs since layers in it are fully connected, weights and biases of the network are not
invariant to these transformations. This means that if the input data is shifted, scaled, or
distorted in any way, the weights and biases of the MLP will no longer be valid for making
predictions. This can lead to poor performance on unseen data or even complete failure.
In contrast, CNNs (Convolutional Neural Networks) are more robust to these types of
transformations because they use convolutional layers that are invariant to shifts and scal­
ing. This means that even if the input data is shifted or scaled, the convolutional layers will
still be able to extract useful features from it. Additionally, CNNs also have pooling layers
that can help reduce distortion caused by rotations and other forms of transformation. As
a result, CNNs are generally better at handling shifting, scaling, and other forms of distor­
tion than MLPs.

8.2.2.1 Convolution
A convolution layer consists of filters (kernels) applied to an input image or feature map.
Each filter is a small matrix that is used to detect specific features in the input. The output
of the convolution layer is a feature map, which contains information about the detected
features. Values of the kernel matrix are tuned during the training phase in order to specify
which kernel extracts what features and aggregates the most important features repre­
sented in the image with respect to their spatial relations.
The convolution process involves calculating the dot product between each element
in the filter and its corresponding element in the input, as depicted in Fig. 8.6. This is
achieved by sliding the filter across the input picture or feature map. The resulting dot
product is summed up and stored in the feature map. This process is repeated for each el­
ement in the filter until all items have been processed. The feature map generated contains
information about the features detected in the input picture.
The convolution operation for one-dimensional space can be written as follows in
which a one-dimensional kernel slides on a 1-dimensional signal. t represents the time
Chapter 8 • Feature extraction and deep learning 221

FIGURE 8.6 Visualization of a 2D convolution by a 3 × 3 kernel.

variable in convolving function f on function g. The convolution operation symbol is writ­


ten by an ∗ and the output is calculated by multiplying f and g values over all values of τ :
 ∞
(f ∗ g)(t) = f (τ )g(t − τ )dτ. (8.14)
−∞

The operation in two-dimensional space is the same as the previous one albeit the ker­
nel and the function are both 2-dimensional, which is written as follows. x and y represent
the value in that location in the output signal and it is calculated by multiplying f and g
values over all different values of u and v:
 ∞ ∞
(f ∗ g)(x, y) = f (u, v)g(x − u, y − v)dudv. (8.15)
−∞ −∞

Since we are working with images that consist of pixels we have a discrete space and the
above formula can be written as follows. Here, f 1, f 2 are the size of the kernel and g is the
original image. The pixel at location (m, n) is calculated by multiplying kernel values to the
pixel at locations (m − f1 , n − f2 ), (m − f1 + 1, n − f2 ), (m − f1 , n − f2 + 1), ..., (m, n):

1 −1 f∑
f∑ 2 −1
(f ∗ g)(m, n) = f (i, j )g(m − i, n − j ). (8.16)
i=0 j =0

For input images with multiple channels, an extra term of channels is added to the for­
mula, meaning that values calculated during each operation should be added to form the
final output, as illustrated in Fig. 8.7. Moreover, we can have multiple kernels and perform
multiple convolution operations along different kernels in order to extract assorted fea­
tures from the input. Here, we are convolving kernel k on input image in with C channels
for calculating pixel (m, n) in the k channel in the output. An extra learnable bias term bk is
d­fined for each kernel in the convolution layer and added to the formula:

1 −1 f∑
f∑ 2 −1 C−1

(f ∗ g)m,n,k = inm+i,n+j,c · wi,j,c,k + bk . (8.17)
i=0 j =0 c=0
222 Dimensionality Reduction in Machine Learning

FIGURE 8.7 Visualization of a 3D convolution by a 3 × 3 × 3 kernel.

The size of the filter (f1 and f2 ; usually f1 = f2 in convolutional layers and these are
odd for retaining the symmetry in forward and backward propagation) determines how
much information it can detect from an input image or feature map. Larger filters can
detect more complex patterns, while smaller filters can detect simpler patterns. By stacking
multiple convolution layers together, CNNs can learn increasingly complex patterns from
their inputs and produce more accurate results [32].
In other words, each convolution layer removes infrequent subpatterns (disturbances)
and extracts frequent subpatterns. In these layers, multiple feature maps may be generated
and each one extracts an independent frequent subpattern.
In forward propagation when multiple convolutional layers are stacked up together this
operation is performed on the outputs of the previous layer by the kernels in the current
layer that can be written as follows (for simplicity, we consider images with one channel
and a kernel in each layer). Here, oL−1 is the output of the previous layer L − 1 and w L is
the kernel of the current layer that consists of weights that should be tuned:
−1 f∑
f∑ −1
L−1
L
xm,n = om+i,n+j · wi,j
L
+ bL . (8.18)
i=0 j =0

After the calculation of the output feature map, the activation function of that layer is
applied element-wise to them. Here, a is the activation function of the layer L:
L
om,n = a(xm,n
L
). (8.19)

We use backpropagation to update the weights in kernels for each layer. First, we need
to calculate the error E of the predicted output with original labels or values using an ap­
propriate loss function based on the performed task. For this purpose, we need to calculate
the gradients of the loss function with respect to the model parameters that are then used
in order to update them during the training phase. We refer to this value as ∂E L , which
∂wm ,n
indicates the i­fluence of a change in the value located in (m , n ) of the kernel in layer L
Chapter 8 • Feature extraction and deep learning 223

on the error. It can be calculated using the mentioned equations and chain rule as follows:


H −f W
∑ −f L
∂E ∂E ∂xi,j
L
= L
· L
. (8.20)
∂wm  ,n
i=0 j =0
∂xi,j ∂wm  ,n

From (8.18), we can write:


L f∑−1 f∑−1
∂xi,j ∂ L−1
L
= L
· om+i,n+j · wi,j
L
+ bL . (8.21)
∂wm  ,n ∂wm  ,n
i=0 j =0

Now, if we expand the above equation all the terms except the terms in which m = m
and n = n will be equal to zero, and we have:
L
∂xi,j ∂ L−1
L
= L
· om  +i,n +j · wm ,n
L
∂wm  ,n ∂wm  ,n (8.22)
L−1
= om  +i,n +j .

If we substitute this value in (8.20), then we have the following formula. It can be imple­
L−1 ∂E
mented as a convolution operation on om  +i,n +j with error matrix L as the kernel that
∂xi,j
should be rotated 180 degrees. This value represents a change in value at the location i, j
in the output of layer L how much the loss will alternate:


H −f W
∑ −f
∂E ∂E L−1
L
= L
· om  +i,n +j . (8.23)
∂wm  ,n
i=0 j =0
∂x i,j

∂E
For calculating L , from the convolution operation explained above, we can conclude
∂xi,j
that a pixel located at (i, j ) affects pixels form (i  − f + 1, j  − f + 1) to (i, j  ) in the outputs
that are going to represent this array of values with Q. Hence, we have:

∂E ∑ ∂E L+1
∂xi,j
= · . (8.24)
∂xiL ,j  i,j ∈Q ∂xi,j
L+1 ∂x L
i  ,j 

L+1
From (8.18) for xQ , we can write:

L+1 −1 f∑
f∑ −1
∂xQ ∂
= · oL−1
    · wm,n
L+1
+ bL+1 . (8.25)
∂xiL ,j  ∂xiL ,j  m=0 n=0 m +i −m,n +j −n

By substituting (8.19) into the above equation all the terms except the terms in which
m = m and n = n will be equal to zero, and we have:
L+1
∂xQ
= wm,n
L+1
· a  (xiL ,j  ). (8.26)
∂xiL ,j 
224 Dimensionality Reduction in Machine Learning

By substituting this equation into (8.24) and then substituting into (8.23) we will have:
∂E ∂E
L
= L+1 ∗ rotation180 {wm,n
L+1
} · a  (xm
L
 ,n ). (8.27)
∂wm ,n ∂xm ,n

∂E
We have L+1 for the last layer that is calculated by applying the loss function to the
∂xm ,n
predicted and actual values. We can alter the parameters of the model in a backward man­
ner from the last to the first layer. The weights and bias terms of each layer will be updated
as follows:
∂E
wi,j = wi,j − η · ,
∂wi,j
(8.28)
∂E
bk = bk − η · .
∂bk
Subpatterns that appear in the first convolution layer are the main edges, lines, or
curves of the input image. In the next layers, these patterns are aggregated with the spatial
information and form more complex patterns such as corners and smaller sub-skeletons
and finally in the last convolution layer they form the main sub-skeleton.
There are some other techniques used in convolutional layers that will be described fur­
ther in the following. One of them is padding that is used to increase the spatial size of the
input image. By adding padding to the input image, the size of the output feature map can
be controlled to preserve the spatial dimensions of the input image. In this technique, bor­
ders of the image are filled art­ficially with a certain value (usually zero, i.e., zero padding)
to keep spatial dimensions constant across kernels, as depicted in Fig. 8.8. It is useful when
having large receptive fields with strides. Generally, there are two types of padding used in
these layers:
• Valid padding: no zero values are added to the input image. This indicates that the size
of the output feature map is smaller than the size of the input picture. The size of the
output feature map is determined as follows. Where I is the height of the input image,

FIGURE 8.8 Visualization of a convolution operation using padding.


Chapter 8 • Feature extraction and deep learning 225

J is the width of the input image, H is the height of the filter, W is the width of the filter,
and K is the number of filters:

(I − H + 1) × (J − W + 1) × K. (8.29)

• Same padding: In same padding, zero values are added to the input image so that the
spatial size of the output feature map is the same as the input image size. The number
of zeros added to each side of the input image can be calculated as follows, where ph is
the number of zeros added to the height of the input image, and pw is the number of
zeros added to the width of the input image:

H −1
ph = ,
2
(8.30)
W −1
pw = .
2

The same padding provides for the preservation of the input picture’s spatial dimen­
sions, which is advantageous for tasks like object detection and image class­fication where
the input image’s spatial information is crucial.
The other technique used in convolutional layers is stride. This is a hyperparameter that
determines the step size of the filter as it slides over the input image, as demonstrated in
Fig. 8.9. The stride controls the spatial size of the output feature map and can be thought of
as down-sampling the input image. The formula to calculate the output feature map size
for a given input image and the filter size is as follows, where I is the height of the input
image, J is the width of the input image, W is the height of the filter, H is the width of the
filter, S is the stride, and K is the number of filters:

I −H J −W
+1 × + 1 × K. (8.31)
S S

Increasing the stride can reduce the spatial size of the output feature map, resulting in
a decrease in the amount of computation required for each layer, albeit it is a destructive

FIGURE 8.9 A convolution operation using 1 and 2 as the stride.


226 Dimensionality Reduction in Machine Learning

operation and can also result in a loss of spatial information from the input image. The
choice of stride will depend on the specific problem and desired output size.
General formulas of the height and width of an image after entering a convolutional
layer can be summarized as follows:

Hin + 2P − H
Hout = +1 ,
S
(8.32)
Win + 2P − W
Wout = +1 .
S

8.2.2.2 Pooling
Another typical layer mostly used in CNNs is the pooling layers. They are mainly used to
reduce the spatial size of the feature maps that resulted in a decrease in the computation
required for each layer and control ove­fitting, which can occur when the model has too
many parameters.
As for convolutional layers, in the pooling layers also a kernel with a certain size will
slide over the image, but the functionality is different. There are three main types of pooling
layers:
• Max pooling: The output value is chosen to be the highest value within a range of values
in the input feature map. Since it has been proven to be more successful in lowering
the spatial size of feature maps while keeping crucial information, it is more frequently
utilized than average pooling.
• Average pooling: The output value is chosen to be the average value within a range of
values in the input feature map.
• Global pooling: Instead of employing a kernel to focus on a specific set of values in the
input feature map, it will take into account the whole channel of the input picture as
the values, and max (global max pooling) or average (global average pooling) is decided
by these values. This operation generates a vector whose values are all computed from
values in certain input feature map channels.
The formulas for the size of the output feature map for the average and max pooling
layers are:

Hin − H
Hout = +1 ,
S
(8.33)
Win − W
Wout = +1 .
S

Since their kernels do not have any weights to tune, they do not increase the complex­
ity of the network, in contrast to convolution layers. These layers will be applied to every
channel presented in the input image; therefore, input and output images will have the
same number of channels. Using these layers will give the model the ability to be scaled
and rotationally invariant meaning that by changing the location or the orientation of an
Chapter 8 • Feature extraction and deep learning 227

object in the input image, the network will still detect that object and extract the same fea­
tures from it. This ability can be useful in performing tasks that are not dependent on the
details such as class­fication or detection. Fig. 8.10 demonstrates the processes of perform­
ing max pooling and average pooling on an input image.

FIGURE 8.10 Visualization of a max pooling layer with 2 as the stride.

These layers are typically used after the convolutional layers and before the fully con­
nected layers. They can be stacked together multiple times to reduce the spatial size of the
feature maps at different stages of the network. Another point that needs to be mentioned
here is that these layers are destructive. To provide an example, consider a pooling layer
with kernel size 2 ∗ 2 and 2 as the stride, this layer reduces the spatial dimension of the
input image by a factor of four that could annihilate some of the most important features
from the original image. As a result, in some cases using these layers is inexpedient, such as
semantic segmentation. In these tasks, the network should focus on spatial locations and
should not be eliminated. Subsequently, using these layers frequently does not necessarily
increase the performance of the model.

8.3 Learned features


In this section, we will move from the basics of neural networks and CNNs to practical
usage. We will learn how to actually use them with different tasks, and we will see how they
extract features from data. We will also explore visualizations of the extracted features at
each layer of a trained deep model. Finally, we will build a face ver­fication system using a
pre-trained deep learning model. This will demonstrate how powerful these networks are
at extracting important details from data.
228 Dimensionality Reduction in Machine Learning

8.3.1 Visualizing learned features


8.3.1.1 Perceptron
Using the numpy library, we create a perceptron model from scratch in this example. The
model makes a prediction on whether an input belongs to a class with the label 1 or 0. The
main objective of this implementation is to demonstrate that the perceptron model works
best with data that can be separated linearly. When the data can be separated nonlinearly,
however, the perceptron model performs poorly as it does not use a nonlinearity activation
function and instead tries to adjust the model’s weights.
In this implementation, we begin by creating a class called Perceptron, which consists of
two methods and an initialization function. Upon instantiation of an object from this class,
the user spec­fies the learning rate and the number of iterations for the weight update pro­
cedure. The fit method takes as input the training data and its corresponding labels and
updates the model’s weights by calculating the loss and utilizing Eq. (8.6). Additionally,
it aggregates the loss values across different iterations and returns them for visualization
and debugging purposes. In the predict method, input feature vectors are provided to the
model along with their corresponding labels. The method uses the updated weights ob­
tained from the training process to predict the labels. If the prediction is positive, the
corresponding record is assigned to class 1; otherwise, it is assigned to class 0.
Chapter 8 • Feature extraction and deep learning 229

1 import numpy as np
2

3 class Perceptron(object):
4 # initialize and set the learning rate
5 def __init__(self, learning_rate=0.01, num_iterations=100):
6 self.learning_rate = learning_rate
7 self.num_iterations = num_iterations
8
9 def fit(self, X, y):
10 # initialize the weights for the input layer with bias
11 self.weights = np.zeros(X.shape[1] + 1)
12
13 # list to store errors
14 history = []
15

16 for i in range(self.num_iterations):
17 # get the prediction from the model
18 y_hat = self.predict(X)
19

20 # calculate the loss


21 loss = y_hat - y
22
23 # update the weights
24 self.weights[0] -= self.learning_rate * loss.sum() # bias
25 self.weights[1:] -= self.learning_rate * (X.T.dot(loss))
26
27 history.append(np.mean(np.abs(loss)))
28 return history
29

30 def predict(self, X):


31 # add the bias to the input
32 X_biased = np.c_[np.ones(X.shape[0]), X]
33
34 # calculate the product of weights and inputs
35 product = np.dot(X_biased, self.weights)
36

37 # if the product is positive classify the input as 1 else 0


38 return np.where(product >= 0, 1, 0)

We employ two datasets with distinct objectives. The first dataset is linearly separable,
which presents the model with the task of fitting a single line to differentiate records with
different labels. The second dataset is the moon dataset, which simulates nonlinearly sep­
arable data. This dataset consists of two interleaving half-circles. The decision boundaries
predicted by the model are depicted in Fig. 8.11.
The findings demonstrate how successfully Perceptron could identify decision bound­
aries and categorize various inputs using a linear separator. The accuracy was lower while
dealing with nonlinear data that could not be separated linearly as it could not determine
the right line since there were none and should be separated in a nonlinear approach.
The notion that was previously mentioned is co­firmed by the subsequent figure. The
Perceptron’s model error per epoch during training is shown in Fig. 8.12. As can be ob­
230 Dimensionality Reduction in Machine Learning

FIGURE 8.11 Performance of the Perceptron on different datasets.

served, the model can converge to a global minimum in the first epochs when training on
linearly separable data. However, when training on nonlinearly separable data, the model
cannot converge to a global minimum nor further reduce the error from the first epochs.
The search for a better decision boundary oscillates greatly, but it is unable to converge on
an ideal outcome.

FIGURE 8.12 Error per epochs when training Perceptron on different data.

8.3.1.2 MLP
We perform the same task as the previous sub-section in order to show that adding non­
linearity and multiple layers through MLP can significantly affect the performance of the
model. Here, we implement a neural network using Tenso­flow with 4 layers comprised of
32, 16, 8, and 1 units in each layer, respectively. We used ReLu as the activation function for
Chapter 8 • Feature extraction and deep learning 231

the first 3 layers and for the final layer Sigmoid for predicting the probability of each input
belonging to which classes. We used Adam as the optimizer and binary cross-entropy since
we only have 2 classes. We train the model for 50 epochs and 32 as the size of the batch.

1 # Create the MLP model


2 model = Sequential()
3 model.add(Dense(32, input_dim=2, activation='relu'))
4 model.add(Dense(16, activation='relu'))
5 model.add(Dense(8, activation='relu'))
6 model.add(Dense(1, activation='sigmoid'))
7
8 # Compile the model
9 model.compile(loss='binary_crossentropy',
10 optimizer='adam',
11 metrics=['accuracy'])
12

13 # Train the model


14 model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0)

FIGURE 8.13 Performance of an MLP on the moon dataset.

Results are illustrated in Fig. 8.13. The left figure demonstrates that the model can ac­
curately identify the decision boundary. To show how well the model can separate the data
points from one another, we also visualize the mapped space of the layer before the top­
most layer, which has 8 units. We use the PCA technique to decrease the data’s dimension
from 8 to 2 in order to observe it. As can be seen, the model may exploit the fact that blue
data points have greater values on the X-axis in the mapped space it created to catego­
rize the data. This is a compelling finding that again illustrates the fact that neural network
232 Dimensionality Reduction in Machine Learning

models can be used to reduce the dimensionality of the data while extracting the most
important features from it to further be used in desired tasks.
For the next case, We make use of the handwritten image dataset from MNIST. We will
use an MLP with 6 layers to approach this issue and show the capability of neural networks
to extract valuable characteristics from the input data and minimize the dimension. The
last layer contains 10 class­fication units, whereas the first layers’ class­fication units range
from 256 to 16 by a factor of 2. With the exception of the last layer, which uses softmax as
the activation function, we utilized the tanh function for the other layers. As we have 10
classes, we utilized Adam as the optimizer and Sparse Cross-Entropy. We train the model
across 50 epochs with a batch size of 10.

1 # Define the model


2 model = Sequential([
3 Dense(256, input_dim=784, activation='tanh'),
4 Dense(128, activation='tanh'),
5 Dense(64, activation='tanh'),
6 Dense(32, activation='tanh'),
7 Dense(16, activation='tanh'),
8 Dense(10, activation='softmax')
9 ])
10

11 # Compile the model


12 model.compile(optimizer='Adam',
13 loss='sparse_categorical_crossentropy',
14 metrics=['accuracy'])
15

16 # Train the model


17 history = model.fit(X_train,
18 y_train,
19 epochs=10,
20 batch_size=32,
21 validation_data=(X_test, y_test))

Outcomes are demonstrated in the following figures. In Fig. 8.14a we use PCA to reduce
the original picture dimensions in the dataset to two in order to demonstrate them. Also,
using the outputs from the fifth layer of Fig. 8.14b, we show the same images after having
them feedforwarded through the network. As can be seen, the model does a valuable job
of differentiating data points from one another by moving data points from the same class
closer to one another and farther away from clusters of different classes. This will support
the model’s data class­fication in the bottom layer. In addition, outputs from the fifth layer
show a decent representation of the data when compared to Fig. 8.14a and may be used
for other tasks as well.

8.3.1.3 CNN
In order to illustrate the power of CNNs in dimension reduction and feature extraction
tasks, we demonstrate the extracted features in convolution layers in the subsequent sec­
tion. For this case, we consider a class­fication task on the Dogs vs. Cats dataset. This
Chapter 8 • Feature extraction and deep learning 233

FIGURE 8.14 Effects of MLP on the MNIST dataset.

dataset comprises many pictures of dogs and cats and the labels that go with them. We
created a neural network with six convolutional layers, each with a kernel size of three,
and stacked them. To minimize the dimension of the data and maintain the most crucial
characteristics for the class­fication task, we add a Max Pooling layer with a kernel size of 2
234 Dimensionality Reduction in Machine Learning

after each of two consecutive convolution layers. The 2D extracted features are then con­
verted to 1D vectors using a Flatten layer, which will then feed the vectors forward to two
fully connected layers with 128 and 2 units each. Others use the ReLu activation function,
while the final layer activation function for the class­fication job is softmax. The model
was trained for 10 epochs with batches with the size of 32 and we utilized Adam as the
optimizer.

1 # Build the model


2 model = tf.keras.Sequential([
3 layers.Conv2D(16, (3, 3), activation='relu'),
4 layers.Conv2D(16, (3, 3), activation='relu'),
5 layers.MaxPooling2D((2, 2)),
6 layers.Conv2D(32, (3, 3), activation='relu'),
7 layers.Conv2D(32, (3, 3), activation='relu'),
8 layers.MaxPooling2D((2, 2)),
9 layers.Conv2D(64, (3, 3), activation='relu'),
10 layers.Conv2D(64, (3, 3), activation='relu'),
11 layers.MaxPooling2D((2, 2)),
12 layers.Flatten(),
13 layers.Dense(128, activation='relu'),
14 layers.Dense(2, activation='softmax')
15 ])
16
17 # Compile the model
18 optim = tf.keras.optimizers.Adam(learning_rate=1e-4)
19 model.compile(optimizer=optim,
20 loss='categorical_crossentropy',
21 metrics=['accuracy'])
22

23 # Train the model


24 history = model.fit(generator_train,
25 epochs=10,
26 validation_data=generator_valid)

In Fig. 8.15, we visualize the different features extracted by the convolution layers for a
single random record in the test set. Eight different channels in the outputs of each con­
volution layer are illustrated as a heatmap. As can be seen, in the first convolution layers,
the model extracted simple textures or attributes such as horizontal, and vertical lines or
edges. However, by going further through the network complex textures such as muzzles,
ears, eyes, or lips will appear in the visualization. Additionally, this informative and low di­
mensions compared to the original image can be further used in other tasks to reduce the
complexity, and the required time while boosting the performance.
Next, we will use this type of neural network to construct a face ver­fication system in
order to show the capability of CNNs to extract features from input data. To fu­fill the ob­
jective, we must create a database of pictures and their related feature vectors in order
to create such a system. To extract features from the input photos and save them in the
database, we will utilize a CNN module. We will construct the face ver­fication procedure,
Chapter 8 • Feature extraction and deep learning 235

FIGURE 8.15 Visualization of features extracted by convolutional layers in the model.

which enters an image and precisely extracts the feature vector from it after the database
has been established. The next step is to search the database for the feature vector that is
the most akin to the one already there. The measure we will use to compare the feature vec­
tors is cosine similarity. We claim to have found the related face and returned the person’s
name if the resemblance was more than a certain threshold. In the event that no compa­
rable faces are discovered, we assert that no similar faces exist in the dataset. An overview
of the described face ver­fication system is illustrated in Fig. 8.16.
We will use the LFW (Labeled Faces in the Wild) dataset, which is composed of pictures
of well-known individuals that have been labeled with their names. We will utilize the VGG­
Face model, a VGG model that has been trained on a large dataset of faces and is accus­
tomed to faces, in order to fully extract features from the face, for the CNN model feature
extractor. In comparison to pre-trained models on other generic datasets like ImageNet,
236 Dimensionality Reduction in Machine Learning

FIGURE 8.16 Visualization of the ver­fication system.

applying a pre-trained model on the faces will greatly improve the model’s performance
for the aforementioned points.
Chapter 8 • Feature extraction and deep learning 237

To begin, we load the VGGFace model. Setting include_top=False allows us to exclude


the topmost layer, which was previously utilized for class­fication and training, as we in­
tend to use this model solely as a feature extractor. Additionally, we specify the input
images’ shape and incorporate average pooling layers into the model’s architecture. Fi­
nally, we add a flatten layer to convert the model’s final representation of an image into a
1D vector, thereby creating our feature extractor model.
Method extract_features, inputs the path of an image, reads it, and returns the final out­
put of the model described before by forward passing the image. build_database only calls
once at the beginning of the procedure. It takes as input a directory path that contains dis­
cernible folders of images, each of whose names is equivalent to the person whose images
are contained within it. This method returns a dictionary in Python languages consisting
of pairs of keys and values. In this case, the key is the path to each image, and the value
is the extracted feature vector calculated using the mentioned function. Eventually, this
database is stored to further be used to verify an input image.

1 from keras_vggface.vggface import VGGFace


2
3 # Load the VGGFace model with pretrained weights
4 vggface_model = VGGFace(include_top=False,
5 input_shape=image_size,
6 pooling='avg')
7
8 # Add a Flatten layer to convert the output feature map to a 1D vector
9 flatten_layer = Flatten()(vggface_model.output)
10 feature_extractor = Model(inputs=vggface_model.input,
11 outputs=flatten_layer)
12

13 # Define a function to extract features from an image using the model


14 def extract_features(img_path):
15 img = image.load_img(img_path, target_size=image_size)
16 x = image.img_to_array(img)
17 x = np.expand_dims(x, axis=0)
18 features = feature_extractor.predict(x)
19 return features.flatten()
20

21 # Define a function to build the face database


22 def build_database(data_dir):
23 database = {}
24 for name in os.listdir(data_dir):
25 person_dir = os.path.join(data_dir, name)
26 for img_name in os.listdir(person_dir):
27 img_path = os.path.join(person_dir, img_name)
28 database[img_path] = extract_features(img_path)
29 return database

The main objective of the verify_face method is to input a path of an image, the afore­
mentioned database, and a certain value called threshold and verify whether the image
already exists in the database or not. First, it reads the input image and extracts the fea­
238 Dimensionality Reduction in Machine Learning

tures from it using the VGGFace model. Then, it searches through the database to find the
nearest feature vector presented in the database using cosine similarity as the metric. It will
verify the picture and return the path of the image whose feature vector has the smallest
distance from the input image’s feature vector if the detected vector has a smaller distance
than the threshold value. If not, ``Unknown'' will be returned.

1 from scipy.spatial.distance import cosine


2
3 # Define a function to perform face verification
4 def verify_face(img_path, database, threshold=0.5):
5 # Extract features from the input image
6 input_features = extract_features(img_path)
7 # Search the face database for the closest match
8 min_distance = float('inf')
9 min_path = None
10 for path, features in database.items():
11 distance = cosine(input_features, features)
12 if distance < min_distance:
13 min_distance = distance
14 min_path = path
15 # Check if the closest match is below the threshold
16 if min_distance < threshold:
17 return min_path
18 else:
19 return "Unknown"

We divided the data into two sets: one for testing and the other for training. For the
test set, the model’s accuracy is 68%. Below, we illustrate the results of two sample tests to
show how well the system works. In Fig. 8.17a the system ver­fied the input image prop­
erly, whereas the Fig. 8.17b system predicted the input image’s label inaccurately. There
are similarities between the two mistakenly matched photos, as can be observed. Never­
theless, we may improve the model’s performance and accuracy by adjusting the threshold
for similarity comparison and retraining the model on this dataset.

FIGURE 8.17 Performance of the ver­fication system on LFW dataset.


Chapter 8 • Feature extraction and deep learning 239

8.3.2 Deep feature extraction


In the first attempts, semi-supervised multi-task learning of deep convolutional represen­
tations was investigated. In that scenario, representations are learned on a set of related
problems and applied to new tasks with too few training examples. The model is consid­
ered as a deep transfer learning architecture based on supervised pre-training [14]. A deep
convolutional model is first trained in a fully supervised setting. Then, extracting various
network features was evaluated for generic vision tasks. The CNN feature generalization
for other datasets was evaluated. Moreover, the effect of network depth on performance
can be analyzed qualitatively and quantitatively.
CNNs constitute convolutional layers that automatically learn increasing features with
different granularity, patterns, and structures. The convolutional features mapped at the
first layers illustrate simple shape patterns, e.g., different types of edges. The next layers
model more complex mid-level representations, such as different parts of objects or even
object types with a simple visual appearance. Complex object types are formulated at the
last layers using the pattern combination learned in previous layers. Commonly, the last
fully connected layer generates the network’s class­fication.
In [16], a visualization technique was performed to give insight into the function of
intermediate feature layers and the class­fier’s operation in a CNN. Employing the men­
tioned analysis shows that the features from the last convolutional layer may not be the
best choice for all object types with different levels of semantic granularity. Although the
fully connected layers combine different feature maps, useful information, such as corre­
lations among feature maps, is not considered.
As mentioned above, each layer of a CNN can be represented by a series of feature maps
that show the structural information of the input. Assume there is a 1-layer CNN. Nf fea­
ture maps can be created from a layer with Nf kernels (each feature map is the output of
applying a spec­fied kernel). The feature map’s size is considered as Sf , where Sf is the fea­
ture map’s height × width. For each layer, vectorizing and stacking up feature maps creates
a matrix of size Nf × Sf . In any layer, a Gram matrix can be calculated using the feature
map matrix:

GMijl = Mik
l l
Mki . (8.34)

In fact, the Gram Matrix can illustrate the relations of the feature maps. These deep fea­
tures were first explored to provide the multi-scale representation of texture information
for style transfer, in [15].
The correlation between features in various layers of a pre-trained Convolutional Neu­
ral Network (CNN) can also be considered a convenient feature for other tasks. The correla­
tion can capture intricate relationships between visual patterns and represent higher-level
concepts. An exemplary instance of this is the correlation between features in the convolu­
tional layers of a CNN, which can be employed to produce saliency maps that emphasize
the most significant regions of an image for various tasks such as object recognition or
image captioning. However, it should be noted that the correlation between features may
240 Dimensionality Reduction in Machine Learning

differ regarding the architecture and training data of the pre-trained CNN. As a result, it
may require fine-tuning or some adaptation for specific tasks.
In [17] and [15] the authors try to reconstruct the input image by employing various
layers of the original VGG-Network. The lower layers of a Convolutional Neural Network
(CNN) are designed to capture low-level features. These features play a crucial role in re­
constructing the overall structure of an image. Moreover, the upper layers of a CNN have
been designed to capture high-level features that may not necessarily be important for re­
constructing the exact pixel values. As a result, when utilizing features from a pre-trained
CNN, it may be more effective to use features from the lower layers of the network for
reconstruction or other tasks requiring detailed pixel information. For higher-level under­
standing tasks, features from the upper layers may be more suitable.

8.3.3 Deep feature extraction applications


Extracting deep features has many practical uses across diverse domains, including com­
puter vision, natural language processing, and speech recognition.
In [28] deep features are considered to present a perceptual loss that considers high­
level perceptual and semantic differences between images. In fact, instead of input images,
feature vectors extracted from a pre-trained network are considered.
Feature Reconstruction Loss is employed in image superresolution to estimate the dis­
tinction between the high-resolution image and the reconstructed one. A pre-trained deep
neural network, for example, VGG or ResNet, is employed to extract feature maps from
both images, which are then utilized to calculate the loss and average squared error be­
tween the feature maps of the high-resolution and reconstructed image. Feature Recon­
struction Loss is employed in image superresolution to estimate the distinction between
the high-resolution image and the reconstructed one. A pre-trained deep neural network,
for example, VGG or ResNet, is employed to extract feature maps from both images, which
are then utilized to calculate the loss and average squared error between the feature maps
of the high-resolution and reconstructed image:

Lf eat = φ(yH R ) − φ(G(yLR ))22 , (8.35)

where Lf eat is the feature reconstruction loss, yH R and yLR are the high-resolution and
low-resolution images, respectively, G is the superresolution model used to generate the
reconstructed low-resolution image, and φ is a pre-trained deep neural network used to
extract the feature maps.
The feature reconstruction loss penalizes the estimated output when its content varies
from the target image. For the style transferring task, the loss should be present to penalize
differences in style: colors, textures, and structural information:
2
Lstyle = GM(y) − GM(ŷ) F
, (8.36)

in which GM shows the Gram Matrix, which can be extracted by employing (8.34) consid­
ering the selected layer.
Chapter 8 • Feature extraction and deep learning 241

Deep features have been widely explored in the analysis of medical images, including
X-rays, CT scans, and MRI scans. The features can detect abnormalities, diagnose, and
monitor effectively. Medical records and laboratory findings can be utilized for deep fea­
ture extraction to diagnose illnesses. The features obtained from the above data can be
employed for training the class­fier capable of identifying distinct ailments and predicting
the degrees of severity. Pre-training scenarios are hot research topics that have attracted
attention in the medical domain. This is because of the challenges posed by medical data,
such as the data scarcity, lack of annotation, privacy-related tasks, imbalanced data, and
computation complexity [27].
In [18] deep features extracted from two different layers with high-level and middle­
level features are employed for breast masses class­fication. In [19] a pre-trained GoogLe­
Net model is utilized to extract distinctive features from brain MRI images. Subsequently,
various class­fier models are incorporated to classify the extracted features effectively.
In [26] several pre-trained DCNNs, including Alexnet, Resnet50, GoogLeNet, VGG-16,
Resnet101, VGG-19, Inceptionv3, and InceptionResNetV2, are investigated. The potential
of diverse pre-trained models through transfer learning in classifying pathological brain
images is analyzed and reported.
Transfer learning is a generally utilized approach in transferring knowledge from one
domain to another. In medical imaging, the common approach applies deep features de­
rived from a pre-trained network on ImageNet, despite the dissimilarities in tasks and
image characteristics. In [29] a series of investigations on diverse medical image bench­
mark datasets are conducted to explore the relationship between transfer learning, data
size, inductive bias, and the distance between the source and target domains. The results
show that transfer learning is advantageous in most cases.
In [30] ImageNet pre-training helps less if there are different characteristics between
the domains. In fact, it demonstrated that transfer learning may not always lead to signif­
icant enhancements in performance. In this case, in [31] it is illustrated that in the first
layers, deep features resemble Gabor-like filters and do not seem exclusive to a particular
dataset or task. As a result, these features can be reused for many datasets and tasks. As we
go deeper into a network, the characteristics of the features extracted from deeper layers
progressively assume a more task-oriented nature.

8.4 Conclusion
Using pre-trained networks for feature extraction has become a popular technique in
many applications due to their ability to learn useful representations of the input data.
As a case in point, in image recognition, the deep features extracted from a convolutional
neural network can catch important characteristics of the input images at various levels.
The network learns these features during the training process and can be used further to
represent the input data of another task. Features extracted using the first layer are low­
level features not specific to a particular dataset or task. These features provide general
characteristics that make them appropriate for reuse in many other tasks. Features tran­
242 Dimensionality Reduction in Machine Learning

sitioned from general to specific when going deeper through the network. This transition
should be further studied in detail.

References
[1] W.S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity, The Bulletin of
Mathematical Biophysics 5 (1943) 115--133.
[2] D.O. Hebb, The Organization of Behavior: A Neuropsychological Theory, Psychology Press, 2005.
[3] F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the
brain, Psychological Review 65 (6) (1958) 386.
[4] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition,
Proceedings of the IEEE 86 (11) (1998) 2278--2324.
[5] D.H. Hubel, Single unit activity in striate cortex of unrestrained cats, Journal of Physiology 147 (2)
(1959) 226.
[6] D.H. Hubel, T.N. Wiesel, Receptive fields of single neurones in the cat’s striate cortex, Journal of Phys­
iology 148 (3) (1959) 574.
[7] K. Fukushima, S. Miyake, Neocognitron: a self-organizing neural network model for a mechanism of
visual pattern recognition, in: Competition and Cooperation in Neural Nets: Proceedings of the US--
Japan Joint Seminar Held at Kyoto, Japan, February 15--19, 1982, Springer, Berlin, Heidelberg, 1982,
pp. 267--285.
[8] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning Internal Representations by Error Propagation,
California Univ. San Diego La Jolla Inst. for Cognitive Science, 1985.
[9] B. Widrow, M.A. Lehr, 30 years of adaptive neural networks: perceptron, Madaline, and backpropaga­
tion, Proceedings of the IEEE 78 (9) (1990) 1415--1442.
[10] J.A. Rad, S. Chakraverty, K. Parand, Learning with Fractional Orthogonal Kernel Class­fiers in Support
Vector Machines: Theory, Algorithms, and Applications, Springer, 2023.
[11] A.H. Hadian Rasanan, A.G. Khoee, M. Jani, Solving distributed-order fractional equations by LS-SVR,
in: J. Amani Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional Orthogonal Kernel Classi­
fiers in Support Vector Machines, in: Industrial and Applied Mathematics, Springer, Singapore, 2023.
[12] A.H. Hadian Rasanan, S. Nedaei Janbesaraei, D. Baleanu, Fractional Chebyshev kernel functions: the­
ory and application, in: J. Amani Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional
Orthogonal Kernel Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics,
Springer, Singapore, 2023.
[13] A.H. Hadian Rasanan, J. Amani Rad, M.S. Tameh, A. Atangana, Fractional Jacobi kernel functions:
theory and application, in: J. Amani Rad, K. Parand, S. Chakraverty (Eds.), Learning with Fractional
Orthogonal Kernel Class­fiers in Support Vector Machines, in: Industrial and Applied Mathematics,
Springer, Singapore, 2023.
[14] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, Decaf: a deep convolutional
activation feature for generic visual recognition, in: International Conference on Machine Learning,
2014, pp. 647--655.
[15] L. Gatys, A. Ecker, M. Bethge, Image style transfer using convolutional neural networks, in: Proceed­
ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2414--2423.
[16] M. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Computer Vision--
ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part
I 13, 2014, pp. 818--833.
[17] A. Mahendran, A. Vedaldi, Understanding deep image representations by inverting them, in: Proceed­
ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5188--5196.
[18] Z. Jiao, X. Gao, Y. Wang, J. Li, A deep feature based framework for breast masses class­fication, Neuro­
computing 197 (2016) 221--231.
[19] S. Deepak, P. Ameer, Brain tumor class­fication using deep CNN features via transfer learning, Com­
puters in Biology and Medicine 111 (2019) 103345.
[20] M.M. Moayeri, J.A. Rad, K. Parand, Desynchronization of stochastically synchronized neural popula­
tions through phase distribution control: a numerical simulation approach, Nonlinear Dynamics 104
(2021) 2363--2388.
Chapter 8 • Feature extraction and deep learning 243

[21] M. Hemami, J.A. Rad, K. Parand, Phase distribution control of neural oscillator populations using
local radial basis function meshfree technique with application in epileptic seizures: a numerical
simulation approach, Communications in Nonlinear Science and Numerical Simulation 103 (2021)
105961.
[22] M.M. Moayeri, J.A. Rad, K. Parand, Dynamical behavior of reaction–diffusion neural networks and
their synchronization arising in modeling epileptic seizure: a numerical simulation study, Computers
& Mathematics with Applications 80 (2020) 1887--1927.
[23] M.M. Moayeri, A.H. Hadian, S. Latifi, K. Parand, J.A. Rad, An efficient space-splitting method for sim­
ulating brain neurons by neuronal synchronization to control epileptic activity, Engineering with
Computers (2020) 1--28.
[24] M. Hemami, J.A. Rad, K. Parand, The use of space-splitting RBF-FD technique to simulate the con­
trolled synchronization of neural networks arising from brain activity modeling in epileptic seizures,
Journal of Computational Science 42 (2020) 101090.
[25] M. Hemami, K. Parand, J.A. Rad, Numerical simulation of reaction–diffusion neural dynamics models
and their synchronization/desynchronization: application to epileptic seizures, Computers & Math­
ematics with Applications 78 (2019) 3644--3677.
[26] T. Kaur, T. Gandhi, Deep convolutional neural networks with transfer learning for automated brain
image class­fication, Machine Vision and Applications 31 (2020) 20.
[27] Y. Qiu, F. Lin, W. Chen, M. Xu, Pre-training in medical data: a survey, Machine Intelligence Research
20 (2023) 147--179.
[28] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution,
in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, the Netherlands, October
11--14, 2016, Proceedings, Part II 14, 2016, pp. 694--711.
[29] C. Matsoukas, J. Haslum, M. Sorkhei, M. Söderberg, K. Smith, What makes transfer learning work
for medical images: feature reuse & other factors, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022, pp. 9225--9234.
[30] K. He, R. Girshick, P. Dollár, Rethinking ImageNet pre-training, in: Proceedings of the IEEE/CVF Inter­
national Conference on Computer Vision, 2019, pp. 4918--4927.
[31] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural networks?,
Advances in Neural Information Processing Systems 27 (2014).
[32] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv
preprint, arXiv:1409.1556, 2014.
[33] M. Omidi, B. Arab, A. Rasanan, J. Rad, K. Parand, Learning nonlinear dynamics with behavior or­
dinary/partial/system of the differential equations: looking through the lens of orthogonal neural
networks, Engineering with Computers 38 (2022) 1635--1654.
[34] A. Hadian Rasanan, N. Bajalan, K. Parand, J. Rad, Simulation of nonlinear fractional dynamics arising
in the modeling of cognitive decision making using a new fractional neural network, Mathematical
Methods in the Applied Sciences 43 (2020) 1437--1466.
[35] A. Hadian-Rasanan, D. Rahmati, S. Gorgin, K. Parand, A single layer fractional orthogonal neural net­
work for solving various types of Lane–Emden equation, New Astronomy 75 (2020) 101307.
[36] A.H. Hadian-Rasanan, J.A. Rad, D.K. Sewell, Are there jumps in evidence accumulation, and what,
if anything, do they r­flect psychologically? An analysis of Lévy-Flights models of decision-making,
Psychonomic Bulletin & Review 31 (2024) 32--48.
[37] A. Ghaderi-Kangavari, J.A. Rad, M.D. Nunez, A general integrative neurocognitive modeling frame­
work to jointly describe EEG and decision-making on single trials, Computational Brain & Behavior 6
(2023) 317--376.
9
Autoencoders
Hossein Motamednia a,b, Ahmad Mahmoudi-Aznaveh c, and Artie W. Ng d
a School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran b Faculty of

Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran c Cyberspace Research Institute,
Shahid Beheshti University, Tehran, Iran d International Business University, Toronto, ON, Canada

9.1 Introduction to autoencoders


Autoencoders are mainly well known as an effective model for unsupervised representa­
tion learning. Since the decoder can reproduce images based on the latent variables in
a deterministic manner, it can generate data by modifying the latent variable. Autoen­
coders can also be considered generative models. The encoder and decoder are the two
components that jointly make up an autoencoder. In the encoder, input is processed and
features are extracted. The decoder transfers the extracted features to the output within
the input type’s domain. The extracted features are coded by hidden layers in the encoder
to represent the input. This representation is task-dependent and i­fluenced by the input
data, model architecture, and training pipeline. Then, the decoder may utilize the encoded
features to generate output data or feed to different generative or discriminative models.
These features are sufficient to represent input data contextually.
According to [12], for the first time, [13] introduced the autoencoders. The proposed
model has a simple architecture with one hidden layer. The goal of the model was to re­
construct the input as output. Currently, autoencoders are constructed with billions of
parameters and include different network architecture components.

9.1.1 Generative modeling


There are two primary categories of machine learning (ML) models: generative and dis­
criminative. The generative model generates new output within the pre-defined domain.
For instance, the decoder part can be employed as a generator. In contrast, discriminative
models are taught to distinguish groups of input data. The generator model discovers the
joint distribution of input and output data, whereas the discriminative model discovers
the conditional probability between input and output data [3,14].
The Generator model is designed to learn how to convert the input data, the z distribu­
tion, into a new distribution called P (x). Meanwhile, the generator aims to rebuild the data
to match the target data distribution closely. The mapping of an input data distribution to
a target data distribution using a generative model is depicted in Fig. 9.1. The generative
model maps the data xi to yi , where xi is a member of our input data and yi is a member
of P̂ (x). For generating output data with the same distribution as the objective data, the
Dimensionality Reduction in Machine Learning. https://doi.org/10.1016/B978-0-44-332818-3.00020-4 245
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
246 Dimensionality Reduction in Machine Learning

model learns to characterize the input data distribution and capture the relationships be­
tween features so that the output can be generated. The model discovers which parameters
minimize the loss function. The loss function is designed to bring the output distribution
closer to the target distribution P (x).

FIGURE 9.1 Scheme of the generative model formulation.

The generative models learn the context of the input data to capture the data distri­
bution, and then transfer the encoded data to the target distribution. Autoencoders, as
well-known forms of generative models, are analyzed in detail later in this chapter. The ca­
pability of an autoencoder in data embedding will also be studied for representative tasks
in human language and computer vision.

9.2 Autoencoders for feature extraction


The autoencoder represents input as latent variable and then reconstruct output in the in­
put domain. The model compares the reconstructed output with the desired output and,
under the loss function, trains the autoencoder to represent input as a feature vector. Each
feature vector dimension could represent a distinct attribute of the input. A human cannot
distinguish these characteristics at a glance, but a model can employ this feature vector as
input and make accurate predictions. This advantage of encoding input permits autoen­
coders to extract features from a variety of input forms, including images and text.
This section addresses the notations of deep generative encoding, followed by a review
of learning algorithms in this domain. Finally, we studied the use of deep learning (DL)
models as feature extractors.

9.2.1 Latent variable


Autoencoder encode inputs in the latent space. As the encoded features of the latent space
are indistinguishable from those of humans, this space is referred to as latent. In this space,
features are compressed, and the input’s primary attributes are stored. These features are
sufficient for recreating input. In actuality, semantic features are encoded in the latent
space, and using these features, the decoder can reconstruct the input semantically. Fig. 9.2
depicts a trained encoder that accepts images as input.
Chapter 9 • Autoencoders 247

FIGURE 9.2 Contextual image representation utilizing a pre-trained encoder and presenting features in two­
dimensional space.

The autoencoder in Fig. 9.2 is trained for an encoding decoding application. During
training, the network learns to represent the input semantically. In this instance, all inputs
are animal species, and the network is learned to represent them as disjoint clusters. In
this sample, the feature vector encoded by the trained encoder is converted into 2D feature
points to demonstrate the relation between samples with similar content.

9.2.2 Representation learning


The ML algorithm can represent raw data in other forms. For autoencoders, the training
data, training process, and network architecture are important. The network learns to en­
code the inputs in the latent space, and the decoder transforms the feature vector into the
output in the target space domain. Target space could be an image or text belonging to
the task the model is trained for. In the process of training such a model, the encoder side
learns to represent input as features required for reconstruction. This representation can
be used in other downstream tasks. For example, in the image class­fication model, the
encoder learns to represent the semantic features of an image. Features belong to edges
and lines, and shape features are groups of features represented in the latent space. The
process of training can affect the features that the model extracts.
248 Dimensionality Reduction in Machine Learning

9.2.3 Feature learning approaches


In machine learning principles, there are multiple techniques to feature learning. There
are three categories of feature learning approaches:
• Rule-based models.
• Traditional machine learning.
• Representation learning.
Rule-based models are dependent on the spec­fied attributes. An expert determines
the parameters in conventional techniques and then develops a model to extract hand­
crafted characteristics. These models are now appropriate for some applications. However,
In most real-world scenarios, rule-based models cannot accurately estimate the behavior
of data and fit the optimal function to the problem.
Traditional ML algorithms extract statistical data to address this issue. The statistical
data were then utilized to fit a function. These characteristics were more resilient than
rule-based techniques but also were domain-specific.
Later, more complicated ML models introduced the concept of representation learn­
ing. Such models, followed by DL models, extract features from the domain of input data.
During training, the model also learns to extract characteristics from raw data, and si­
multaneously a class­fication or regression model may be built, for example, using these
features. Learning the features during class­fication/regression needs a huge number of
labeled data. In this case, unsupervised feature learning can be helpful. Autoencoders are
appropriate to learn feature space in an unsupervised manner.
Various architectures are known for implementing feature extraction with autoen­
coders. An autoencoder is composed of two key parts: encoders and decoders. The input
data is encoded by the encoder and represented in the latent space. Subsequently, the fea­
tures are transmitted to the expected output, mainly a visual representation, by a decoder.
These models aim to create the same inputs or similar inputs for both language and visual
tasks. The encoder encodes the context-related information in the latent space. Another
model can utilize the context of these attributes to analyze or comprehend input images
or text.

9.3 Types of autoencoders


In the previous section, the inside mechanism of autoencoders was discussed. In this sec­
tion, we looked at the different types of autoencoders. Then, later in this chapter, we inves­
tigate the performance of these models. In the last few years, various sorts of autoencoders
have been developed for different purposes. Some of the popular models are reviewed in
this section. In general, autoencoders learn to represent data. Each of the mentioned au­
toencoders in this section presents a mechanism to represent data. The mechanism of
representation is designed to handle specific issues and represent some aspect of data.
Denoising autoencoders (DAE) [17], sparse autoencoders (SAE) [19], contractive au­
toencoders (CAE) [16], variational autoencoders (VAE) [18], and vector quantized varia­
Chapter 9 • Autoencoders 249

tional autoencoders (VQ-VAE) [1] are examples of such representative models. We discuss
the architecture of each model and the usage of the models in particular tasks.

9.3.1 Denoising autoencoder (DAE)


The denoising autoencoder recovers the original data that has been corrupted with noise
[17]. During training, this network learns to represent characteristics of noise-free content
to reproduce undistorted output. In this concept, the network is trained to suppress noise
features in the latent space.
Fig. 9.3 depicts the DAE’s general architecture. Input images are noisy, while output
images are undistorted. The DAE, similar to the base AE, comprises two primary compo­
nents: an encoder and a decoder, as shown in the image. However, in this model, the noise
distribution is learned for removal, and the data visual content distribution is learned for
reconstruction.

FIGURE 9.3 The structure of a denoising autoencoder (DAE). It comprises an encoder that compresses the input
data into latent space and a decoder that reconstructs the original input data.

As stated, the extracted features on the encoder side correspond to the features that
r­flect the content of the image. This capability of the DAE encoder is utilized for other
relevant tasks. Image class­fication algorithms may experience difficulties in accurately
classifying noisy images. However, the performance of these models can be enhanced by
utilizing features retrieved with a DAE. This characteristic can also be extended to other
tasks.

9.3.2 Sparse autoencoder (SAE)


Sparsity is an attractive characteristic of an efficient feature [19]. This kind of autoencoder
adds a penalty term to the loss function to provide a sparse representation. This internal
representation is usually exploited for another task, such as class­fication. There are differ­
ent ways to impose sparsity constraints, such as using L1 regularization or KL divergence
between expected averages of activation to a pre-defined distribution.

9.3.3 Contractive autoencoder (CAE)


In this type of autoencoder, the aim is to obtain robust features. In other words, it is ex­
pected that the extracted features are invariant or insensitive to small variations of the
input [16]. It is proposed to penalize the sensitivity of the representation to the input. The
Frobenius norm of the Jacobian of the encoder with respect to the input is used as the
250 Dimensionality Reduction in Machine Learning

regularizer. It should be noted that the robustness of representation is considered in other


types of autoencoder such as DAEs.

9.3.4 Variational autoencoder (VAE)


A variational autoencoder (VAE) is an autoencoder in which a fixed encoding is not learned
[18]. Instead, a probability distribution over the latent variable for each data point is
achieved. This leads to a continuous internal representation. That is, similar latent vari­
ables are decoded to related outputs. This semantic latent space can be exploited to gen­
erate new data by sampling from a learned distribution. In this case, a regularization term
is added to the loss function to ensure the encoded data is not a fixed point. It is proposed
to use KL-divergence to guarantee that the internally represented data follows a normal
distribution (or any predetermined distribution).

FIGURE 9.4 The variational autoencoder architecture. The model transforms input data into a probabilistic distri­
bution within the latent space, after which the decoder regenerates the original input data using sampled latent
points.

The loss function is depicted as:

L(θ, φ; x) = −Ez∼qφ (z|x) [log pθ (x|z)] + KL[qφ (z|x)||p(z)], (9.1)

where θ and φ are the parameters of the decoder and encoder, x is the input data, z is the
latent variable, pθ (x|z) is the likelihood function, qφ (z|x) is the approximate posterior dis­
tribution over the latent variables, and p(z) is the prior distribution. After sampling the
latent distribution, the decoder receives the randomly sampled point of the decoded dis­
tribution and tries to reconstruct the input. This method is also illustrated in Fig. 9.4.
Since the sampling is not differentiable, an alternative method for generating samples,
the reparameterization trick, has been proposed. The reparameterization trick involves
introducing a new variable epsilon that is sampled from a fixed distribution, typically a
standard Gaussian, and then transforming it into a sample by applying a deterministic
function that depends on the encoder parameters (mean and standard deviation):

z = μ + σ · . (9.2)
Chapter 9 • Autoencoders 251

9.4 Autoencoder and learned features applications


Learned characteristics are applicable to a variety of machine and human tasks. This book
examines two primary application domains. Language models and vision models. Ma­
chine translation models are examples of representation learning in language models. In
machine translation models, the source language input sequence is converted to the target
language input sequence. The model then maps the encoded sequence in the latent space
to the target sequence. The language task is text generation, in which the model generates
a new sequence in the same language as the input sequence, with semantic and contextual
relationships between the two sentences. The other types of applications can be examined
from vision tasks. One of the fundamental tasks is image reconstruction. It can be divided
into other things like image compression, image denoising, and image super-resolution.
In image compression, the purpose is to reconstruct the exact input image from the en­
coded input image. The encoded input image has input image features but needs lower
bits allocated. In other words, the input image is compressed with the model representa­
tion. Image denoising and superresolution are other similar tasks that try to reconstruct
the rect­fied version of the input image. During training, they learn to represent the critical
input images that enable the decoder to generate clean or superresolution images.
Training models for the introduced language or vision tasks enable models to represent
the input sentence or image semantically and contextually. This representation further
can be employed in different models. For instance, a model trained on an image com­
pression dataset can be used for the class­fication model. For this, the trained encoder is
used as a feature extractor, and the extracted features are fed to the class­fication model
and trained with these features as input and class­fication labels. The same scenario may
be employed in object detection or regression models. In the rest of this section, the intro­
duced concepts are reviewed and the details of implementation and training models will
be studied.

9.4.1 Language encoding


Models of natural language processing (NLP) process human texts and generate text, a
class, or a number representing an attribute of the input text. NLP models behave simi­
larly to models that process time series, so before processing the input text, these models
preprocess and map input text to a series of numbers. The proposed model then processes
the series to generate a new series or to classify the raw data. During data processing, the
model learns to extract features from input data before classifying or remapping the input
series. Most NLP models process time series using recurrent neural networks. Recurrent
neural networks capture temporal characteristics, making them appropriate models for
time series processing. Fig. 9.5 shows a recurrent neural network-based model that maps a
sequence of input elements to a new sequence of length n. Each hidden layer in this archi­
tecture learns to represent some features of the input data. With neural networks’ ability to
analyze input data from multiple perspectives, additional analyses can be performed on
the input data.
252 Dimensionality Reduction in Machine Learning

FIGURE 9.5 A language model that converts input sequence token IDs into a different sequence of token IDs in
another domain by employing a deep recurrent neural network architecture.

Autoencoder natural language processing models map the input sequence to the new
sequence. The modern machine translation models employ language autoencoders. In
the machine translation models, a source language input sequence is mapped to a target
language input sequence. For this purpose, language autoencoders that represent token
relationships in the latent space perform adequately. In this scenario, the model learns to
extract features that reveal the grammatical and semantic relationship between words or
tokens. The language encoder displays the input sequence as an informative distribution
in the latent space. This informative distribution represents input features that contribute
to the meaning and context of the fed sentence. The decoder will then map this distribu­
tion to the new space indicating the target token sequence. The language autoencoder’s
power to accurately represent the meaning of a token sequence is sufficient for use in
other NLP tasks. For instance, this representation can be used in class­fication or regres­
sion tasks.
Fig. 9.5 illustrates a language model that maps an input token sequence to a target token
sequence. This figure demonstrates a language model that includes a sentence tokeniza­
tion module and preprocesses the input sentence before it goes through neural network
processing. The input sentence is converted into a sequence of tokens. A token is the
shortest informative portion of a phrase. Each specific language model includes vocab­
ulary that keeps the language units. In accordance with the model vocabulary, document
terms are broken into smaller units and compared to specific units in the vocabulary. The
matched tokens were replaced with correspondence vocabulary unit ID. If no matching
unit is found, that portion of the sentence is marked as unknown. As tokenization is not
informative and does not reveal the relationship between terms, the output of tokenization
may not be considered a meaningful representation.
To process the dependency between tokens, models such as deep neural networks
are applied to a sequence of tokens to extract this dependency. Neural networks can en­
capsulate nonlinear dynamics [25,26,28,29] and interpolate functions that convert input
into a space expressing the relationship among tokens [27]. Each token in the series cor­
responds to a number that the network can comprehend. The developed deep neural
network shown in Fig. 9.6 comprises convolution layers that extract the local dependency
Chapter 9 • Autoencoders 253

FIGURE 9.6 The architecture of a language model. The input word sequences are converted into a sequence of
token IDs, which the model then processes to generate the desired result.

between tokens and low-level features, and recurrent layers that extract the temporal de­
pendency in series between tokens. One or more recurrent gates make up the recurrent
layer. Each gate can process an array of tokens. In the diagram shown, each layer con­
sists of a single gate, and each layer’s output serves as the input for the subsequent layer,
which is a recurrent layer. Recurrent gates are designed to infer the relationship between
inputs and extract characteristics from inputs that indicate the relationship between to­
kens. The hidden layers of this model produce coded features in the latent space that
represent the semantic or grammatical relation between tokens. This representation may
be used with any model that makes decisions based on the concept of the input docu­
ment. In this model, the encoding is used by a decoder to generate output sequences from
another language that have a semantic correlation with the input sequences.
As described above, the common application for language autoencoders is machine
translation tasks. Datasets for training these models are named parallel sentences. The
parallel sentence dataset consists of pairs of sentences; one from a source language and
another one from a target language. The OPUS dataset [21] that was collected from the web
and WMT datasets that introduce the challenges with the same name [22,23] are examples
of parallel datasets for text translation. Fig. 9.7 shows several samples from a translation
dataset. Samples are from English to Hindi and English to Persian languages. The network
trained on this corpus learns the relation between words and learns representation that
maps the sequence from the source sentence in English to the target sentence. This model
has its encoding while training the same architecture in other types of datasets making
different representations in the latent space. Pair sentences in the translation datasets are
common in the context of meaning. For instance, in the dataset in Fig. 9.7 sentence words
and grammar are different but have similar meanings. As a result, after training the model,
the model learns semantic features in the source language, and learns to represent seman­
254 Dimensionality Reduction in Machine Learning

tic features in the source language that help to reconstruct a sequence of tokens in the
target language.

FIGURE 9.7 Parallel sentence dataset for training a machine translation model. The dataset consists of sentence
pairs, each serving as a translation of the other.

The following snippet of code implements a simple text autoencoder. In this model,
long short-term memory (LSTM) model is used to infer temporal features in the text se­
quence.

1 # Sentence Preprocessing
2 tokenized_seq = tokenizer.texts_to_sequences(sentences)
3 padded_tokenized_seq = pad_sequences(tokenized_seq)
4

5 # Encoder Model
6 embd = Embedding(num_words)
7
8 input = Input(in_shape)
9 x = embd(input)
10 lstm_rnn = LSTM(256)(x)
11 enc_model = Model(inputs=input,output=lstm_rnn)
12 enc_output = enc_model(input)
13

14 # Decoder Model
15 dec_lstm_rnn = LSTM(128, return_sequences=True)(enc_output)
16 dec_output = Dense(num_words)(dec_lstm_rnn)
17

18 # AutoEnc Model
19 mnt_model = Model(input, dec_output)
20 mnt_model.compile(optimizer, loss_fn)
21 mnt_model.fit(inputs_seq, outputs_seq)
22

The implemented algorithm consists of four main parts. First, input sequences are to­
kenized, and the token arrays are padded to the same length. In the next part, the model
architecture is implemented. An embedding layer is added to the model at line 9. In the
Chapter 9 • Autoencoders 255

tokenization step, the sequence of tokens is transformed into positive integers. Before
predicting the input these numbers should change to dense vectors to make more ac­
curate predictions. This is the exact rule of the embedding layer to transform positive
numbers into dense vectors. Then, the vectors are fed to recurrent layers to extract tem­
poral features. This simple architecture of the encoder just extracts temporal features and
represents the input sentence. The extracted features decoded and reconstructed the new
sequence of tokens in the new language domain. The model compares the output of the
decoder and the target output and updates the weights in comparison to the result of the
loss function.
In the training process, the encoder learns to represent input as features for the decoder
to reconstruct correct outputs. The encoder tries to code features belonging to the input
context, features that make the relation between input and output. In the context of the
machine translation task, most features of grammar and meaning make this relationship.
In the traditional translation by humans at first, the concept of a sentence is extracted
and then searched for in the target language to find the same concepts to generate the
sentence translation. In neural network training also there is the same procedure and the
neural network learns to reconstruct the output sentence by considering the meaning of
the input sentence. In this scenario, the network tries to code something about meaning
in the latent space. A latent space with features about input context is rich enough to be
used in another task that reacquired contextualized features from the input text.
To train the neural translation model a parallel sentence dataset is used like the dataset
presented in Fig. 9.7. The dataset consists of English texts and their translation by humans
to Spanish. The dataset splits into two sets, English sentences as input and correspondence
Spanish sentences as labels. Both corpus sets are preprocessed and tokenized at the first
step. All sentences in English and Spanish are padded to the same size. If the length of
token sequences exceeded the threshold the rest of the sequence was emitted and if the
sequence length was less than the threshold the rest of the sequence i­flated with zero. In
the training state, the model is trained to transform input sentences in English to output
sentences in Spanish. The model was trained in 10 epochs with the Adam optimizer.
In the past scenario, the network learns temporal features but local features also can be
learned with convolutional layers. This model processes a series using a one-dimensional
convolution layer embedded in both the encoder and decoder. Before passing the features
to the recurrent layer the spatial features are processed and then temporal features are ex­
tracted from input data. This can be done after extracting temporal features but in these
two methods, two types of features are extracted that represent two different types of fea­
tures. The mentioned model with a convolutional layer can be implemented as follows.
256 Dimensionality Reduction in Machine Learning

1 ...
2 # Encoder Model
3 ...
4

5 input = Input(in_shape)
6 x = embd(input)
7 conv_1d = Conv1D(x)
8 lstm_rnn = LSTM(256)(conv_1d)
9 ...
10

11 # Decoder Model
12 dec_lstm_rnn = LSTM(128, return_sequences=True)(enc_output)
13 dec_conv_1d = Conv1D(dec_lstm_rnn)
14 dec_output = Dense(num_words)(dec_conv_1d)
15
16 ...

In the previous model architecture, the model learns the contextual language relatives
to map the source input sequence to the target output sequence. The encoder within the
trained model might be utilized for additional language-related tasks. The encoder ac­
quires the ability to encode the contextual characteristics of the input. This functionality,
for instance, might be utilized to make contextual comparisons between different texts.
Fig. 9.7 presents the concept of a text similarity model. The text similarity model assesses
the similarity between two text sequences semantically. The model assesses sentences in
the same language but it could be generalized to multiple languages.
Classic models compare lexical features like common words or the alphabet between
two sentences. However, in the case of words with multiple meanings, the models based
on lexical features have errors in the comparison phase. Modern methods like language
models based on deep neural networks comprehend the context of language. In this case,
the model makes different representations for the same words in different sentences. To
this end, deep-learning-based methods are commonly used for text similarity.
The trained encoder for the text translation model, for instance, can be used for this
proposal. The encoder encodes the input sentence in the latent space as a feature vector.
The feature vectors from two sentences can be compared with similarity measure methods
like cosine similarity. The cosine similarity measures the cosine angle between two feature
vectors. In the equation below, the similarity between two feature representations from
source and target sentences is calculated:

Rs .Rt
cosine − similarity = . (9.3)
||Rs ||||Rt ||

On the other hand, the encoder can be trained with a semantic textual similarity (STS)
dataset. The semantic textual similarity dataset includes pair sentences with their corre­
spondence semantic similarity scores, Table 9.1. These sentences were scored with sub­
jects that were selected during a standard experiment. The subjects commonly are more
than twenty subjects for each pair. Then, the mean of subjects’ scores was computed and
Chapter 9 • Autoencoders 257

reported as pairs of semantic similarity scores. The other method for generating such
datasets is crowdsourcing methods, where the experiment is broadcast to thousands of
subjects all over the world who are requested to score the dataset. As so many subjects
participate in the experiment the scores are considered reliable. Whenever the dataset is
generated, in the first case, the subjects are ver­fied by the organizers that the score is ac­
curately submitted, and in the second one, the number of subjects proves the experiment’s
accuracy, can consider a model that accurately fits this dataset.

Table 9.1 Semantic text similarity dataset.


Sentence 1 Sentence 2 Similarity score
``A man is cutting up a potato.'' ``A man is cutting up carrots.'' 2.375
``A kid is playing guitar.'' ``A boy is playing a guitar.'' 3.8
``A boy is playing guitar.'' ``A man is playing a guitar.'' 3.2
``A man is playing guitar.'' ``A boy is playing a guitar.'' 3.2
``A little boy is playing a keyboard.'' ``A boy is playing key board.'' 4.4

To make a more accurate representation of semantic textual similarity different types


of models were developed. A common algorithm is based on a language model encoder
that is trained on a huge text dataset. Then, the encoder part is investigated to train the
final model. As shown in Fig. 9.8, the encoder is used twice for each sample pair. Each
sentence in the sample pair is encoded with an encoder and then the distance between
feature vectors is computed with similarity function criteria. Then, the model parameters
are used to minimize the distance between the output score and the actual score.

FIGURE 9.8 The framework of a semantic textual similarity algorithm utilizing a language model. To semantically
compare two sentences, both are encoded using a language model and their similarity is assessed using a similarity
criterion.

Different frameworks are proposed to implement STS algorithms. Some of them are
based on traditional methods like Gensim,1 LASER [4], and scikit-learn.2 In this chap­
1
https://radimrehurek.com/gensim/.
2
https://scikit-learn.org/stable/.
258 Dimensionality Reduction in Machine Learning

ter sbert3 [2] is employed to implement an STS model. This framework implemented the
essential methods required for monolingual or multilingual language STS models. Most fa­
mous pre-trained language models based on Bert [7], XLM-Roberta [11], and Distilbert [8]
are also provided to be used in other language tasks or be fine-tuned.
This library extends the idea of sentence embedding to give language models for STS.
The majority of earlier language encoders, including word2vec and fast-text [9,10], encode
words. In the models like fast-text, the words are represented context-free. This implies
that a word’s representation remains constant throughout all sentences and document
text. The modern models encode words belonging to the sentence context. The Glove
model is an example of such a model that represents words contextually using neural net­
works. Similar to Glove, the model in SentenceBert [6] encodes words under the sentence
and also introduces a pooling layer that collects all word embeddings in a sentence and
represents the semantic meaning of the sentence.
The next sample code describes how to implement and train a basic STS model using
this library. Both ``SentenceTransformer'' and ``models'' are the main classes used in this li­
brary. The model can be loaded from a pre-trained semantic textual similarity model, lines
2 and 3. Then, a pooling layer is d­fined in line 7 to compute the sentence representa­
tion in the output. The rest of the implementation d­fines the whole model and prepares
the training dataset. For the loss function, the cosine similarity is selected as a common
method for measuring the similarity between two vectors. Then, in the last line of the snip­
pet code the model is trained on the training dataset for ten epochs. The models that are
trained from scratch should be trained at least 100 to 300 epochs. However, for models that
use the pre-trained models, like the proposed method converged in a few epochs as here,
10 epochs are enough to fit the model on the training samples. In this scenario, the model
trained with data that all sentence pairs are from one language and reached an accuracy of
about 85 percent.

3
https://www.sbert.net/.
Chapter 9 • Autoencoders 259

1 from sentence_transformers
2 import SentenceTransformer, models
3
4 sent_model =
5 models.SentModel('pre_trained_model', max_seq_length=256)
6
7 pooling_model =
8 models.Pooling(sent_model.get_word_embedding_dimension())
9

10 model =
11 SentenceTransformer(modules=[word_embedding_model, pooling_model])
12
13 train_dataloader =
14 DataLoader(train_examples, shuffle=True, batch_size=16)
15

16 train_loss = losses.CosineSimilarityLoss(model)
17

18 model.fit(train_objectives=
19 [(train_dataloader, train_loss)], epochs=10, warmup_steps=100)

The presented model can be considered one of the language encoding applications.
The STS models are bounded to one language, monolingual models because most datasets
are generated for one language. Recently, some multi-language STS datasets were gener­
ated but only some of the general languages were covered. To deal with training multi­
lingual models without a multilingual STS dataset [6] introduce an algorithm for training
multilingual STS models using monolingual models. In this method, a state-of-the-art STS
model is used to train a multilingual model.
To train a model for multilingual semantic textual similarity, the model is split into
teacher and student models. The teacher model is a trained model on the STS dataset that
is thoroughly trained and converged. The student model is a type of language encoding
model that does not train on STS data but is trained on a multi-language dataset like text
generation or text class­fication.
The whole pipeline is shown in Fig. 9.9. The top language model is the teacher model
and the bottom language model is the student model. Input sequence 1 is in the domain
language that the teacher model is trained on. For instance, in this example, we want to
train a model that produces Persian and English text representations for semantic textual
similarity. The teacher model trained on the English corpus and/or dataset to train the
student model is a parallel sentence dataset for sentence translation tasks. In each itera­
tion for training, the sentences pair in Persian, and relevant English sentences are fed to
the student model. Then, the reference representation is also produced from the teacher
model with an English sentence. The loss criteria are to make both representations from
the student model for Persian and English sentences to reference the representation from
the teacher model. Either MSE or Cosine Similarity (9.3) can serve as the loss function.
Then, the student model weights are updated due to the loss value. When the model con­
verged during the training process the student model representation can be used for both
Persian and English semantic textual similarity checks. The student model has the abil­
260 Dimensionality Reduction in Machine Learning

ity to check monolingual similarity in both Persian and English or cross-language check
between Persian and English.

FIGURE 9.9 The training pipeline for a semantic textual similarity model with knowledge distillation as introduced
in [6]. The student model acquires semantic representations of the target language utilizing source language rep­
resentations produced by the teacher model.

In order to execute the mentioned wor­flow for assessing multilingual semantic tex­
tual similarity, we utilize the previously evaluated SBERT library and acquire the necessary
tools for model training. The next provided code snippet demonstrates the described pro­
cedure. In lines 2 and 5 the teacher model and the student model are d­fined. The teacher
model is a state-of-the-art model in the STS task for English text. The student model is also
a big model for language encoding but is trained on another language task instead of STS.
The train data that is loaded in line 14 is data for text translation that consists of parallel
texts from Persian to English. The loss function is also cosine-similarity that is d­fined in
line 17. In the last part, the student model fits the training data with a pre-defined loss
function and optimizer.
The default learning rate and other hyperparameters were pre-defined in the fit func­
tion and like the previous model in the monolingual task, the model converged in the 10
epochs. To evaluate the model, the test set from a STS dataset is required. However, as pre­
viously mentioned, the primary challenge is the lack of STS datasets for many languages
such as Persian. Instead, the model can be evaluated for translation tasks, and if it meets a
specific accuracy, it can be considered effective [6].
Chapter 9 • Autoencoders 261

1 ...
2 teacher_model =
3 SentenceTransformer(teacher_model_name)
4
5 sent_model =
6 models.SentModel('pre_trained_model', max_seq_length=256)
7

8 pooling_model =
9 models.Pooling(sent_model.get_word_embedding_dimension())
10
11 student_model =
12 SentenceTransformer(modules=[word_embedding_model, pooling_model])
13
14 train_dataloader =
15 DataLoader(train_examples, shuffle=True, batch_size=16)
16

17 train_loss = losses.CosineSimilarityLoss(student_model)
18

19 student_model.fit(train_objectives=[(train_dataloader, train_loss)],
20 evaluator=evaluation.SequentialEvaluator(evaluators,
21 main_score_function=lambda scores: np.mean(scores)),
22 epochs=num_epochs,
23 warmup_steps=num_warmup_steps,
24 evaluation_steps=num_evaluation_steps,
25 output_path=output_path,
26 save_best_model=True,
27 optimizer_params= {'lr': 2e-5, 'eps': 1e-6, 'correct_bias':
28 False}
29 )
30

9.4.2 Vision models


Vision models receive images as input and generate output images or numerical se­
quences that represent a property of the input image. The input image can be in color
or grayscale. The model processes this image and returns the expected output. During this
process, the model in the hidden layers attempts to represent the input in the latent space
and maps the encoded input to a new series, which may be a class of input or a new image.
An image model is drawn in Fig. 9.10. As illustrated, the neural network model processes
the input image and maps it to a new feature space. Then, a costume model, which may
be a class­fier or an image decoder, maps these features to our expected space once more.
This anticipated space is our label for class­fication tasks or an image for image decoding
tasks.
In this scenario, the extracted features from a neural network are crucial. These features
represent the various aspects of input in terms of the task that the network was trained on.
The objective function that is minimized or maximized during training is a crucial factor
in determining the nature of the extracted features by the neural network. These features
are optimized to minimize the proposed loss function. The remainder of this chapter will
262 Dimensionality Reduction in Machine Learning

FIGURE 9.10 A custom model training approach using features extracted from a pre-trained encoder. The encoder,
which is part of an autoencoder trained on different types of data, is subsequently employed to train the new
model.

demonstrate that the extracted features by model hidden layers can be applied to other
vision tasks.
To implement a simple vision autoencoder, an encoder–decoder model based on fully
connected layers is presented in the next snippet code. Each fully connected layer consists
of neurons that transform input vectors linearly using a weights matrix. The architecture
is implemented using the tensorflow-keras framework. The input image flattens out and
is fed to the encoder. The encoder extracts features and codes the input data in the latent
space. Then, the decoder reconstructs the input image using coded features. The loss func­
tion is designed to make the reconstructed image close to the input image using the MSE
function. In the last part, the optimizer is d­fined and the model is fit on the training data.

1 # Encoder Model
2 input = Input(in_shape)
3 x = Flatten(input)
4 x = Dense(num_pixels)
5 dense_1 = Dense(256)(x)
6 dense_2 = Dense(128)(dense_1)
7 dense_3 = Dense(64)(dense_2)
8 dense_4 = Dense(32)(dense_3)
9 enc_model = Model(inputs=input,output=dense_4)
10 enc_output = enc_model(input)
11 # Decoder Model
12 dec_dense_1 = Dense(32)(enc_output)
13 dec_dense_2 = Dense(64)(dec_dense_1)
14 dec_dense_3 = Dense(128)(dec_dense_2)
15 dec_dense_4 = Dense(256)(dec_dense_3)
16 dec_output = Dense(num_pixels)(dec_dense_4)
17 # AutoEnc Model
18 loss_fn = MSE()
19 mnt_model = Model(input, dec_output)
20 mnt_model.compile(optimizer, loss_fn)
21 mnt_model.fit(inputs_seq, outputs_seq)
Chapter 9 • Autoencoders 263

FIGURE 9.11 A number of samples from the MNIST dataset. Each sample represents a category of data contained
within this dataset.

This model was trained on the MNIST dataset [5]. The MNIST dataset of handwritten
digits consists of 60 000 training images for 10 classes of handwritten digits. 10 000 samples
were also prepared in this dataset for evaluating models [15]. Fig. 9.11 shows 10 samples
from the MNIST dataset. For each digit class, an instance is presented in Fig. 9.11. The size
of images is 28 × 28 and when the image is flattened by the network in the input layer the
size of the vector is 768. The label in this example is the input image and the input image
is reconstructed as a list of one-dimensional vectors of size 768. The model converged in
the appropriate number of training epochs. The model learns to represent the input as
feature vectors on the encoder side and reconstructs the input image as an output on the
decoder side. The encoder codes the image in the latent space and the coded feature vector
represents the structure of the input image and can be used to describe the input image in
the compressed domain.

9.4.3 Convolutional autoencoder


A Convolutional Autoencoder encodes the input and extracts features by stacking convo­
lutional layers. Then, it up-samples features using deconvolutional layers to reproduce the
input image. Due to the computability complexity and advantages of convolutional layers
like capturing local spatial features, this architecture can efficiently be used to reproduce
images [20]. As described in [20] a max pooling layer is also used to learn biologically plau­
sible features. Fig. 9.12 shows a convolutional autoencoder in general.

FIGURE 9.12 The architecture of the convolutional autoencoder model. Both the encoder and decoder in this archi­
tecture are constructed using convolutional layers.

In vision problems, the input size of images can be very large. For example, in an image
with a resolution of 128 × 128, if a fully connected model is being utilized, the number of
neurons in the first layer would be 16 384. As a result, to reduce the size of the network, con­
264 Dimensionality Reduction in Machine Learning

volutional layers are employed to order process features locally and minimize the size of
the total parameters in the model. CAE can capture features spatially in images and code
in latent space. As the decoder on the other side tries to reproduce the input image, the
most critical features that describe the image are coded in latent space. These features de­
scribe image content and make the image differentiable semantically from other images.
This semantic representation also can be used in other tasks. Image similarity, image de­
tection, and class­fication are such tasks. As a result, this capability makes CAE a common
choice to be used for feature extraction.
A simple CAE is implemented in the next snippet code. In this architecture, four convo­
lution layers are stacked on the encoder side and four deconvolution layers are designed
on the decoder side. The encoder codes spatial-related features in the latent space and the
decoder tries to reconstruct the exact image in the output layer.

1 # Encoder Model
2 input = Input(in_shape)
3 x = Conv(input)
4 Conv_1 = Conv(256)(x)
5 Conv_2 = Conv(128)(Conv_1)
6 Conv_3 = Conv(64)(Conv_2)
7 Conv_4 = Conv(32)(Conv_3)
8 enc_model = Model(inputs=input,output=Conv_4)
9 enc_output = enc_model(input)
10

11 # Decoder Model
12 dec_Conv_1 = Conv(32)(enc_output)
13 dec_Conv_2 = Conv(64)(dec_Conv_1)
14 dec_Conv_3 = Conv(128)(dec_Conv_2)
15 dec_Conv_4 = Conv(256)(dec_Conv_3)
16 dec_output = Conv(num_pixels)(dec_Conv_4)
17
18 # AutoEnc Model
19 mnt_model = Model(input, dec_output)
20 mnt_model.compile(optimizer, loss_fn)
21 mnt_model.fit(inputs_seq, outputs_seq)
22

Fig. 9.13 shows the architecture implemented in the described CAE snippet code. The
left-side model blocks show the elements of the encoder and the right-side blocks show
the elements of the decoder. This architecture is presented in [24] for learned image com­
pression tasks. In the learned image compression models the input image is compressed
in the latent space. Both the reconstructed image on the decoder side and the size of the
compressed image on the encoder side are important and in the training, the loss func­
tion considers both the required bits per pixel to encode the image and the quality of the
reconstructed image.
Different datasets can be used to train models for learned image compression. A fre­
quent dataset that is used in many learned image compression models is the CLIC dataset.
Chapter 9 • Autoencoders 265

FIGURE 9.13 Learned image compression model architecture. The model comprises a down-sampling section, a
quantization method, an arithmetic encoding/decoding part, and an up-sampling section. The purpose of this ap­
proach is to achieve maximum compression during encoding and to provide output during decoding that closely
resembles the input.

FIGURE 9.14 Challenge on learned image compression (CLIC) dataset test set samples.

The CLIC dataset was introduced in the learned image compression challenge.4 The chal­
lenge in learned image compression is the CVPR workshop and the challenge includes im­
age compression, video compression, and perceptual metrics. They generated 5 versions
of learned image compression for 5 challenge events from 2018. Each dataset consists of
a train and test set and most learned image compression methods are trained and report
accuracy on this dataset. Fig. 9.14 shows samples from the 2022 CLIC test dataset that is
employed in this chapter for training and exploring feature extraction ben­fits of learned
image compression autoencoders. The CLICL 2022 dataset split to train about 1000 images
and test about 400 images. The images were collected from the Internet and specifically
from the Unsplash5 website. The resolution of images is different and images are different

4
http://compression.cc/.
5
https://unsplash.com/.
266 Dimensionality Reduction in Machine Learning

in size and ratio. As the trained model for learned image compression should work with
any size of the image and reconstruct images in the size of the input image so the test set
also consists of images from different sizes and ratios. Most samples are natural images
and this challenge is focused on natural image compression methods.

FIGURE 9.15 The created images employ a trained image compression model that was developed under various
scenarios. In each scenario, the model observes the data over distinct epochs.

The presented model trained for 5 epochs, 50 epochs, 100 epochs, and 200 epochs. As
shown in Fig. 9.15, the picture structure is reconstructed after 5 epochs. Then, the small
details are comprehended by the network in 50 or 100 epochs.

9.5 Conclusion
Learning data representation can be considered the major advantage of deep learning
systems. The most distinguishing aspect of deep learning is feature learning. However,
end-to-end learning requires a huge number of labeled data. Unsupervised representation
learning is an effective way to obtain representative features. An autoencoder is a type of
neural network trained in an unsupervised manner. In order to achieve a more effective in­
ternal representation, various kinds of autoencoders have been proposed. They are mainly
achieved by adding an appropriate regularization term. In this chapter, well-known au­
Chapter 9 • Autoencoders 267

toencoder models are reviewed, including denoising, sparse, contractive, variational, and
vector quantized variational autoencoders. It should be noted that the autoencoder can
capture the generation process of most real data. It can be used as an effective generative
model. Finally, some cases are investigated. The effective internal representation of an au­
toencoder makes it the de facto standard in learning-based image compression algorithms
that is explored in this chapter.

References
[1] A. Van Den Oord, O. Vinyals, et al., Neural discrete representation learning, Advances in Neural Infor­
mation Processing Systems 30 (2017).
[2] N. Reimers, I. Gurevych, Sentence-BERT: sentence embeddings using Siamese BERT-networks, arXiv
preprint, arXiv:1908.10084, 2019.
[3] Open AI, Generative models, https://openai.com/research/generative-models, 2023. (Accessed 16
April 2023).
[4] Holger Schwenk, Matthijs Douze, Learning joint multilingual sentence representations with neural
machine translation, arXiv preprint, arXiv:1704.04154, 2017.
[5] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition,
Proceedings of the IEEE 86 (1998) 2278--2324.
[6] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using knowledge
distillation, arXiv preprint, arXiv:2004.09813, 2020.
[7] J. Devlin, M. Chang, K. Lee, K.B.ER. T Toutanova, Pre-training of deep bidirectional transformers for
language understanding, CoRR, arXiv:1810.04805, 2018.
[8] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper
and lighter, in: NeurIPS EMC2̂ Workshop, 2019.
[9] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin, Advances in pre-training distributed word
representations, in: Proceedings of the International Conference on Language Resources and Evalu­
ation (LREC 2018), 2018.
[10] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space,
arXiv preprint, arXiv:1301.3781, 2013.
[11] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L.
Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR, arXiv:
1911.02116, 2019.
[12] Suraj Srinivas, R. Venkatesh Babu, Deep learning in neural networks: an overview, Computer Science
(2015).
[13] D.H. Ballard, Modular learning in neural networks, in: Proc. AAAI, 1987, pp. 279--284.
[14] Google Developer, Background: what is a generative model?, https://developers.google.com/
machine-learning/gan/generative/, 2023. (Accessed 16 April 2023).
[15] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition,
Proceedings of the IEEE 86 (11) (1998) 2278--2324.
[16] S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Bengio, Contractive auto-encoders: explicit invariance
during feature extraction, in: Proceedings of the 28th International Conference on International Con­
ference on Machine Learning, June 2011, pp. 833--840.
[17] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P. Manzagol, L. Bottou, Stacked denoising autoencoders:
learning useful representations in a deep network with a local denoising criterion, Journal of Machine
Learning Research 11 (2010).
[18] D. Kingma, M. Welling, Auto-encoding variational Bayes, in: 2nd International Conference on Learn­
ing Representations, ICLR 2014, Banff, AB, Canada, April 14--16, 2014, Conference Track Proceedings,
2014.
[19] D. Arpit, Y. Zhou, H. Ngo, V. Govindaraju, Why regularized auto-encoders learn sparse represen­
tation?, in: Proceedings of the 33rd International Conference on Machine Learning, vol. 48, 2016,
pp. 136--144, https://proceedings.mlr.press/v48/arpita16.html.
268 Dimensionality Reduction in Machine Learning

[20] J. Masci, U. Meier, D. Cireşan, J. Schmidhuber, Stacked convolutional auto-encoders for hierarchical
feature extraction, in: Art­ficial Neural Networks and Machine Learning–ICANN 2011: 21st Interna­
tional Conference on Art­ficial Neural Networks, Espoo, Finland, June 14--17, 2011, Proceedings, Part
I 21, Springer, Berlin, Heidelberg, 2011, pp. 52--59.
[21] J. Tiedemann, Parallel data, tools and interfaces in OPUS, LREC 2012 (2012) 2214--2218.
[22] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. Yepes, P. Koehn, V. Lo­
gacheva, C. Monz, Others findings of the 2016 conference on machine translation, in: Proceedings of
the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016, pp. 131--198.
[23] O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint­
Amand, Others findings of the 2014 workshop on statistical machine translation, in: Proceedings of
the Ninth Workshop on Statistical Machine Translation, 2014, pp. 12--58.
[24] J. Ballé, V. Laparra, E. Simoncelli, End-to-end optimized image compression, arXiv preprint, arXiv:
1611.01704, 2016.
[25] M. Omidi, B. Arab, A. Rasanan, J. Rad, K. Parand, Learning nonlinear dynamics with behavior or­
dinary/partial/system of the differential equations: looking through the lens of orthogonal neural
networks, Engineering with Computers 38 (2022) 1635--1654.
[26] A. Hadian Rasanan, N. Bajalan, K. Parand, J. Rad, Simulation of nonlinear fractional dynamics arising
in the modeling of cognitive decision making using a new fractional neural network, Mathematical
Methods in the Applied Sciences 43 (2020) 1437--1466.
[27] A. Hadian-Rasanan, D. Rahmati, S. Gorgin, K. Parand, A single layer fractional orthogonal neural net­
work for solving various types of Lane–Emden equation, New Astronomy 75 (2020) 101307.
[28] A.H. Hadian-Rasanan, J.A. Rad, D.K. Sewell, Are there jumps in evidence accumulation, and what,
if anything, do they r­flect psychologically? An analysis of Lévy-Flights models of decision-making,
Psychonomic Bulletin & Review 31 (2024) 32--48.
[29] A. Ghaderi-Kangavari, J.A. Rad, M.D. Nunez, A general integrative neurocognitive modeling frame­
work to jointly describe EEG and decision-making on single trials, Computational Brain & Behavior 6
(2023) 317--376.
10
Dimensionality reduction in deep
learning through group actions
Ebrahim Ardeshir-Larijani a,b, Mohammad Saeed Arvenaghi a,
Akbar Dehghan Nezhad a, and Mohammad Sabokrou c
a Iran University of Science and Technology (IUST), Tehran, Iran b School of Computing, Institute for Research

in Fundamental Sciences (IPM), Tehran, Iran c Okinawa Institute of Science and Technology, Okinawa, Japan

10.1 Introduction
In recent years, we have witnessed the blossoming technological ability to process very
large amounts of data, and in particular, applying complex machine learning (ML) to a
wide spectrum of applications. Whether these applications come from life sciences such as
protein folding and drug discovery or the quest for understanding the fundamental aspects
of nature through data coming from high-energy physics experiments or astronomical ob­
servations, there is one unifying goal that is processing high-dimensional data.
Among the successful techniques for the application of machine learning tasks to high­
volume, high-dimensional data, deep learning (DL) is nowadays ubiquitous in AI applica­
tions. Despite numerous applications of DL and also the large body of research on it, DL’s
success stems from two concrete attributes:
1. A layered structure to facilitate learning features from data.
2. An optimized algorithm that is widely carried out by gradient descent and hence prop­
agation among layers.
Despite the above simplistic view of DL’s structure, the task of learning a general func­
tion over high-dimensional data turns out to be complicated. Nevertheless, in real-world
applications, we often deal with the underlying structure of high-dimensional data. One
approach is to expose the geometrical structure of data in such DL applications, which at­
tracts many researchers. This promising field is called Geometrical Deep Learning [1] and
studies data representations in different geometrical structures such as groups and man­
ifolds. Moreover, it aims at unifying different DL architectures such as CNN, RNN, and
GNN [1], through common geometrical properties such as symmetry and invariance. Note
that this un­fication itself is inspired by mathematics, in particular the so-called Erlangen’s
program [16].
A central question for GDL is under what conditions and how the geometric structures
that are described above are preserved? In many cases, such as convolutional neural net­
works (CNN), capturing properties such as locality and translational symmetry is prevalent
Dimensionality Reduction in Machine Learning. https://doi.org/10.1016/B978-0-44-332818-3.00021-6 269
Copyright © 2025 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
270 Dimensionality Reduction in Machine Learning

in many ML tasks. This is because convolutions are among operations that are equivariant
with respect to transformations such as translations.
Many studies have tried to generalize these concepts, effectively borrowing from math­
ematical subjects such as group representation theory, differential geometry, etc. In the
seminal work of Kondor and Trivedi (ICML 2018) [3], it has been proved that a neural net­
work is group equivariant if and only if its structure is convolutional. Furthermore, Cohen
et al. (NIPS 2019) [2], extended this concepts to homogeneous space (see Definition 7).
In this chapter, first, we review the necessary backgrounds that are needed for under­
standing the next sections. Then, we explain the main concepts of groups, group actions,
and equivariant maps in Section 10.3. Next, we introduce group equivariant networks and
in particular G-CNN and its generalizations in Section 10.4. A comprehensive guide on
how to implement some of the concepts in Python will be explained in Section 10.5. Fi­
nally, Section 10.6 concludes this chapter with several directions for future works.

10.2 Geometric context of deep learning


The central question that this chapter tries to answer is ``Why do we need to understand the
geometry of data in ML, particularly DL, and how it can be exploited for efficient learning
(i.e., by reducing the complexity of data samples and models)?''. For example, suppose we
want an interpolation of function f . In the lower dimensions (e.g., d = 2), the function has
the form f : X −→ R and we need a guarantee that the estimation error can be controlled.
Thus we assume f is 1-Lipschitz (locally smooth), perturbation in input measured by a
norm (||), leads to a controlled output deviation:

∀x, x  ∈ X ||f (x) − f  (x)|| ≤ ||x − x  ||.

It has been shown that a two-layer Perceptron model can learn a two-dimensional 1­
Lipschitz function approximation with arbitrary error  [8]. Most importantly, the sample
size of O( −2 ) is needed for this task. However, for arbitrary d, approximating a Lipschitz
function up to  leads to exponential growth of the sample complexity, i.e., ( −d ), to be
precise [8]. Fig. 10.1 [1] depicts this phenomenon graphically.
Although the problem described above is a challenge when we are dealing with high­
dimensional data, many applications in the real world involve data with a geometric con­
text. In modern DNN architectures, such as CNNs, the structure of convolution r­flects
such a geometric nature, which is learned by the network. Suppose we have the convolu­
tion of two functions as follows:

f ∗ g(x) = f (t)k(x − t)dt.
t∈R

Here, the geometrical context is symmetry, i.e., a transformation that ensures invari­
ance. An instance of such symmetries, is a Shift operator, Shy (x) = f (x − y). For example,
the class­fication of images is invariant under shift. Moreover, convolution has the prop­
erty Shy (f ∗ k) = Shy (f ) ∗ k, for the filter map k. This means no matter where the filter k
Chapter 10 • Dimensionality reduction in deep learning through group actions 271

FIGURE 10.1 Sample complexity of Lipschitz functions approximation.

is applied, class­fication (or any response to detecting a pattern that triggers k) is invari­
ant under shift. The latter is called Shift equivariance. In practice, the filter k is learned
by CNNs, and together with backpropagation, the CNNs become much more successful
than fully connected networks [8]. Fig. 10.2 illustrates the shift equivariance, applied to an
image.

FIGURE 10.2 Shift equivariance.

In general, transformations such as shifts or rotations, form a group, i.e., a symmetry


group. This led researchers to an even more general concept of group equivariance, as
272 Dimensionality Reduction in Machine Learning

mentioned in Section 10.1, where the notion of equivariance is characterized in terms of


actions of a group. For further details see the next section.
We conclude this section by observing that the concept of symmetry is prevalent across
many domains, as Hermann Weyl describes it: ``Symmetry is a vast subject and has signifi­
cance in art and nature. Mathematics lies in symmetry’s root, and it would be very hard to
find a better one on which to demonstrate the working of the mathematical intellect.''

10.3 Group actions, invariant and equivariant maps


In this section, first, we d­fine the main concepts surrounding groups, then we d­fine
group action on spaces, in particular homogeneous spaces. Furthermore, we introduce
two important applications of group actions, namely in constructing invariant and equiv­
ariant maps. These concepts will be used in the following sections. We included several
examples to clarify the mentioned concepts.

Definition 1 (Group). Let G be a non-empty set with a well-defined binary operation 


such that for each ordered pair g1 , g2 ∈ G, g1  g2 is also in G. The pair (G, ) is a group if
and only if:
(i) there exists an identity element e ∈ G such that e  g = g = g  e for all g ∈ G;
(ii) any element g in G has an inverse g −1 ∈ G such that g  g −1 = g −1  g = e;
(iii) the binary operation  is associative: a  (b  c) = (a  b)  c for all a, b, c ∈ G.

Using the composition of mappings in Rn as the binary operation , one can prove that
symmetries of a subset S ⊂ Rn form a group, which is called the symmetry group of S.
More formally, in a metric space X, a symmetry g ∈ G of a set S ⊂ X is an isometry
(a distance preserving transformation) that maps S to itself (an automorphism), g(S) = S.
The transformation g keeps S invariant as a whole while permuting its parts. Symmetries G
of S form a mathematical group (G, ), closed under transformation composition , called
the symmetry group of S.

Example 1. Symmetries of a subset S ⊂ Rn form a symmetry group GS of S. All the sym­


metries of Rn form the Euclidean group E.

Definition 2 (Orbit). An orbit of a point x ∈ Rn under group G, or the G-orbit of x, is G(x) =


{g(x)|g ∈ G}.

Definition 3 (Discrete group). A discrete group G is a subgroup of E such that for any
x ∈ Rn and any sphere Br = {y|y ∈ Rn , ||y|| ≤ r} there is only a finite number of points in the
G-orbit of x that are contained in Br .

Definition 4 (Stabilizer subgroup). For every x in X, we d­fine the Stabilizer subgroup of


x (also called the isotropy group or little group) as the set of all elements in G that fix x:
Gx = {g ∈ G|g(x) = x}.
Chapter 10 • Dimensionality reduction in deep learning through group actions 273

Definition 5 (Point group). A point group G is a symmetry group that leaves a single point
x fixed. In other words, G = Gx , here Gx is the stabilizer subgroup on x.

Example 2. In crystallography, a space group is a group G of operations that leave the


i­finitely extended, regularly repeating pattern of a crystal unchanged. In R3 there are 230
such groups. In other words, such a group G as a whole leaves no point invariant, i.e., for
all possible x ∈ Rn , Gx is either a proper subgroup of G, Gx ⊂ G or empty. Here, Gx is the
stabilizer subgroup on x.

Example 3 (Tiling). A plane tiling T 1 is a countable family of closed sets T = {T1 , T2 , · · · }


that cover the plane without gaps or overlaps. More explicitly, the union of the sets
T1 , T2 , · · · that are known as the tiles or motifs of T is to be the whole plane, and the in­
 
teriors of the sets Ti are pairwise disjoint, that is Ti = ∅ and Ti = T .

The type of symmetrical operation is isometry or congruence transformation. Isometry


is any mapping of the n-Euclidean En onto itself that leaves distances between points un­
changed. Thus if we denote the isometry mapping by T : En → En , and P and Q are any two
points, then the distance between P and Q is equal to the distance between their images
T (P ) and T (Q).
The commonly accepted notions of symmetry in two dimensions are susceptible to
analysis into four basic geometrical elements:
(i) Rotation of the motif about some center by some angle.
(ii) R­flection of a motif in the plane about a line moves its r­flected image to where it
would appear if you viewed it using a mirror placed on the line. Another way to make
a r­flection is to fold a piece of paper and trace the figure onto the other side of the
fold.
(iii) Translation: Translating a motif means moving it without rotating or r­flecting it. We
can describe a translation by stating how far it moves an object, and in what direc­
tion.
(iv) Glide r­flection: A glide r­flection combines a r­flection with a translation along the
direction of the mirror line.
The symmetry group of the pattern is the set of all isometries that map the pattern
onto itself. Therefore the symmetry group of a set X = E2 is the set that consists of all the
symmetries of X = E2 .
A mature mathematical theory for periodic patterns has been known for over a cen­
tury, namely, the crystallographic groups. These are groups composed of symmetries of
periodic patterns in n-dimensional Euclidean space. An essential mathematical theory of
periodic patterns is the answer to the first part of Hilbert’s 18th problem: regardless of di­
mension n and despite an i­finite number of possible instantiating of periodic patterns,
the number of distinct symmetry groups for periodic patterns in any Euclidean space Rn

1
See the mathematical treatise ``Tilings and Patterns'' by Grünbaum and Shephard [15].
274 Dimensionality Reduction in Machine Learning

is always finite! These groups are often referred to as crystallographic groups. For example,
in 2-dimension monochrome patterns, there are 17 wallpaper-groups covering the whole
plane.
There are several other examples of groups than those mentioned earlier. Here, we list
some groups that we may encounter later in this chapter:

Example 4.
• Translation group (Rd , +).
• Rotation group: the set of all rotation (along angles; e.g., in two dimensions SO(2)) form
a group.
• Unitary group U (1): all logical operations in quantum computing represented using
members of this group: {eiφ |φ ∈ [0, 2π)}.
• General linear group: this group is used for obtaining an effective group representation
that will be described later: GL(d) = {g ∈ Rd×d |det (g) = 0}.

An important aspect of groups is the ability to act on other spaces. This is characterized
by group action, which is the crucial concept that shows how at the same time it preserves
an underlying substructure and applies an action on the space.

Definition 6 (Group action). Let G be a group and X be a set. A left group action is a map:

: G × X −→ X ,
(g, x) −→ g x,

furthermore, we have the following compatibility conditions with the original group defi­
nition sati­fied:
• associativity: for all g, h ∈ G, x ∈ X (gh) x=g (h x);
• identity: for all x ∈ X e x = x.

To imagine group action visually, consider the group SO(2) (see Example 4) acting on
the space of a two-dimensional scalar field, as illustrated in Fig. 10.3. In other words, group
actions provide a means to program a structure (such as data points) that already exists.

FIGURE 10.3 SO(2) group action on a 2D scalar field.


Chapter 10 • Dimensionality reduction in deep learning through group actions 275

The spaces in this work where neural networks such as CNNs act upon them have an
important property that a group action behaves transitively on them. In fact, the convolu­
tional feature space can be characterized by:
1. The type of a field such as a vector field, tensor field, etc.
2. A homogeneous space (see Definition 7) that the field is d­fined over it.

Definition 7 (Transitive group action). A group action on a space X is transitive if for all
x, y ∈ X there exists g such that y = g x. In that case, X is called a homogeneous space.

In the following, we explain two properties of group actions. The first property is invari­
ance, for acting a group on a set. The second is equivariance where two group actions on a
set behave invariantly, thus these notions essentially are closely related.

Definition 8 (Invariant map). Let X be a group action on a set X. We say a function f :


X −→ Y is invariant if and only if the following is sati­fied: for all g ∈ G, x ∈ X f (g X x) =
f (x).

Example 5 (Permutation invariant neural network with sensory data). In [4], permutation
invariant neural networks, where their inputs are sensory and they are interacting with the
environment, are studied. Their work is inspired by the human brain’s ability in adapting
to sudden changes in order of sensory data, whereas ML models in this case need retrain­
ing. As an example, for the game Cart-Pole Swing-Up, they design a network with input
[x, ẋ, sin(θ ), cos(θ ), θ̇ ] the agent observes input behavior and is rewarded whenever x be­
comes closer to zero and cos(θ ) closer to 1. The agent tries to cope with permuted data
input, and thus the network behaves in a permutation invariant manner, as visualized in
Fig. 10.4.

Example 6 (Rotation and norm). Consider SO(2), the group of rotations in 2D. Let | | be
the vector norm (the function f in Definition 8) then we have:

∀g ∈ SO(2) |g v| = |v|.

This means that no matter how a vector is rotated in 2D, the norm of the vector is invariant.

The importance of group invariant maps is that they offer a standard way of construct­
ing equivariant maps [5,6] that will be discussed next.

Definition 9 (Equivariant maps). Let X and Y be two group actions on sets X, Y . Now,
we say the function f : X −→ Y is G-equivariant iff the following actions are commutative:

∀g ∈ G, x ∈ X f (g X) = g Y f (x).

In other words, Fig. 10.5 is commutative.


276 Dimensionality Reduction in Machine Learning

FIGURE 10.4 Permutation invariant cart-pole swing-up.

FIGURE 10.5 Equivariant map.

Example 7 (Semantic segmentation). A classic instance of a problem that requires true


equivariance is the widely encountered task of semantic segmentation in computer vision.
To put it simply, the reason behind this is that the output of semantic segmentation is
a mask that assigns labels to individual pixels, and this mask should undergo the same
transformations as the input image.
To illustrate the concepts discussed, Gerken et al. [14] presented a semantic segmenta­
tion model as an example, see Fig. 10.6.

Example 8 (Rotations around a projection). Consider SO(2), the group of rotations in 2D.
Then, we know that rotations around a projection axis in the SO(2) group are compati­
ble with the projection, indicating that the order of performing these operations does not
impact the ultimate outcome. For example:
Chapter 10 • Dimensionality reduction in deep learning through group actions 277

FIGURE 10.6 Semantic segmentation model [14]. The semantic segmentation model consists of several layers that
process the input image. The first layer, called fin , takes the input image and maps it from the Z2 domain (pixel
grid) to RGB values. This input image has a support region d­fined as [0, W ] × [0, H ], where W and H represent the
width and height of the image, respectively. Following fin , the first convolutional layer is applied, resulting in a
feature map called f2 . This feature map maps the input image into an N2 -dimensional space, where N2 represents
the number of filters used in the convolutional operation. After a point-wise activation, the second convolution
generates the feature map fout . Each point in the domain is associated with an Nout -dimensional vector, where Nout
represents the number of output classes.

⎛ ⎞⎛ ⎞ ⎛ ⎞
( ) cos(θ ) −sin(θ ) 0 x ( )( ) x
1 0 0 ⎝ ⎠ ⎝ ⎠ cos(θ ) −sin(θ ) 1 0 0 ⎝ ⎠
sin(θ ) cos(θ ) 0 y = y .
0 1 0 sin(θ ) cos(θ ) 0 1 0
0 0 1 z z

Example 9 (Translation equivariant convolution). Translation equivariant convolution


plays a crucial role in fields including computer vision and natural language process­
ing. It is particularly valuable in tasks such as image recognition and sentiment analysis,
where understanding spatial relationships and the sequential order of words are impor­
tant. Translation equivariant convolution ensures that the output remains consistent re­
gardless of the order of the elements within the input. See Fig. 10.7, which illustrates this
concept.

10.4 Equivariant neural networks


In this section, we discuss an important tool in geometric deep learning for the reduction
of model and sample complexity by taking advantage of underlying geometrical structures.
Furthermore, we study those neural networks that act equivariantly, namely convolu­
tional neural networks (G-CNN). Finally, we review two seminal results, first [3] shows the
only operation that preserves equivariance in neural architecture is convolution, and the
second adapts a more general approach and characterizes G-CNN in terms of their (sym­
metry) group, the underlying homogeneous space and feature space (field) type.
278 Dimensionality Reduction in Machine Learning

FIGURE 10.7 Translation equivariant convolution.

10.4.1 Group equivariant neural networks


Group equivariant neural networks have demonstrated their ability to decrease both sam­
ple and model complexity, particularly in difficult tasks that involve input transformations
like arbitrary rotations. They make use of principles from group representation theory,
non-commutative harmonic analysis, and differential geometry, which are not commonly
found in machine learning.
Kondor and Trivedi [3] have shown that a neural network is considered group equivari­
ant if and only if its architecture follows a convolutional structure. Therefore in this section,
our attention is directed toward studying the characteristics of group equivariant convo­
lutional neural networks (G-CNNs).
A typical deep neural network can be represented as a sequence of a˙ine operations,
denoted as Wi , where the parameters are optimized. These operations are followed by non­
linearities, denoted as σi . The output of the network, denoted as fout , can be expressed as:

fout = Wn (. . . σ2 (W2 (σ1 (W1 fin ))) . . .).

In convolutional neural networks, these operations are typically convolutions with an


added bias term. The most commonly used nonlinearity is the pointwise rect­fied linear
unit (ReLU), d­fined as σ (xi ) = max(xi , 0).
Equivariant neural networks make use of equivariant operations and symmetries
present in the data in order to decrease both the complexity of the model and the number
of training samples needed.
Chapter 10 • Dimensionality reduction in deep learning through group actions 279

Example 10 (Finite group CNNs). In the case of a finite group, convolution is expressed as
follows:
1 ∑
(f ∗ k)(g) = f (h)k(h−1 g).
|G|
h∈G

The operation mentioned, which achieves rotation equivariance, has been effectively uti­
lized on discrete subgroups of SO(3) (the group of rotations in 3D).

Example 11 (Spherical CNNs and cross-correlation). Spherical Convolutional Neural Net­


works (CNNs) have been developed as an extension of traditional CNNs to handle spheri­
cal data (Cohen et al. [9]; Esteves et al. [10]).
Cohen et al. proposed a technique for spherical CNNs where spherical functions, de­
noted as f, are transformed into functions on SO(3) through spherical cross-correlation
with a filter k:

(f  k)(g) = k(g −1 x)f (x)dx.
x∈S 2

Suppose k is a rotated version of f, the correlation reaches its maximum value when g rep­
resents the rotation that aligns k and f. It is important to note that f and k are functions
d­fined on S 2 , while f  k is a function on SO(3).
On the other hand, Esteves et al. proposed a purely spherical convolutional network,
where the inputs, filters, and feature maps are functions d­fined on sphere S 2 . The main
operation is the spherical convolution that uses a 3D rotation group:

(f ∗ k)(x) = f (gv)k(g −1 x)dg.
x∈SO(3)

Here, v represents a fixed point on the sphere (the north pole).


In a recent paper, Esteves et al. [11], scaled up spherical CNNs to handle significantly
larger operations and feature resolutions compared to the previous work. These models
leverage generalized Fourier transforms to compute convolutions on the sphere, enabling
3D rotation equivariance instead of translation equivariance.

Example 12 (The Clebsch–Gordan networks). Kondor et al. [12] introduced an SO(3)­


equivariant neural network architecture designed for spherical data, which operates ex­
clusively in Fourier space. This approach overcomes a significant limitation of previous
models that required frequent switching between Fourier space and ``real'' space. This is
accomplished by adopting an unconventional approach where the Clebsch–Gordan de­
composition serves as the sole source of nonlinearity.

Example 13 (The 3D steerable CNNs). Weiler et al. [13] introduced 3D Steerable CNNs,
which are a category of SE(3)-equivariant networks that encode data using different types
of fields over R3 . Note that the group E(3) is somewhat different from groups that we have
encountered so far in the sense that they are not commutative. They have provided an
280 Dimensionality Reduction in Machine Learning

extensive theoretical framework for 3D Steerable CNNs and have demonstrated that con­
volutions using SO(3)-steerable filters offer a highly versatile approach for achieving equiv­
ariant mappings between fields. This finding establishes SE(3)-equivariant networks as a
universally applicable class of architectures. The implementation of 3D Steerable CNNs
involves making only minor adjustments to the code of a 3D CNN, and they can be easily
converted back to a conventional 3D CNN after the training process. The results obtained
from their experiments co­firm that 3D Steerable CNNs exhibit equivariance, and they
demonstrate outstanding accuracy and data efficiency in tasks such as amino acid propen­
sity prediction and protein structure class­fication.
The above examples can be un­fied through the formal definition of Group equivariant
neural networks as in the following.
Definition 10 (Group equivariant neural networks [7]). Assume that G is a group, and X1
and X2 are two sets that possess G-actions:

Tg : X1 → X1 , T g : X2 → X2 .

Consider vector spaces V1 and V2 , and let T and T denote the induced actions of the group
G on LV1 (X1 ) and LV2 (X2 ), respectively. A map ψ : LV1 (X1 ) → LV1 (X2 ) (linear or nonlinear)
is considered equivariant with respect to the action of G, or G-equivariant, if

ψ(Tg (f )) = Tg (ψ(f )), ∀f ∈ LV1 (X1 ),

for any group element g ∈ G, see Fig. 10.8.

FIGURE 10.8 Commutative diagram of equivariance.

10.4.2 The general theory of group equivariant neural networks


CNNs have achieved remarkable success in the field of image recognition due to their abil­
ity to maintain translation equivariance. In recent times, there have been numerous efforts
to extend this framework to diverse domains such as graphs and manifolds (the latter will
be explained in this section). In this section, we focus on two major initiatives regarding
the generalization of equivariant networks. However, first, we need to d­fine the following
concepts as preliminary:
Chapter 10 • Dimensionality reduction in deep learning through group actions 281

Definition 11 (Coset space). In the context of a group G, if we have a subgroup H and an


element g from G, we can d­fine the left coset gH as the set gH = {gh|h ∈ H }. These left
cosets form a partition of G and are collectively referred to as the left coset space G/H .
Similarly, we can d­fine the right cosets and the collection of these right cosets is known as
the right coset space H \G.

In accordance with Kondor and Trivedi [3] and Esteves [8], we limit the derivation to
discrete groups that are compact, acknowledging that the same principles apply to con­
tinuous compact groups by replacing summations with integrals. We introduce the group
convolution between f : G → C and k : G → C as:

(f ∗ k)(g) = f (gu−1 )k(u).
u∈G

To study functions on homogeneous spaces G/H of a group G (see Definitions 11 and 7),
we provide the projection operator ↓ and the lifting operator ↑ for two functions: f1 : G →
C and f2 : G/H → C. The operators are d­fined as follows:

1 ∑
(↓ f1 )(gH ) = f1 (u),
|H |
u∈gH

(↑ f2 )(g) = f2 (gH ).

We offer a generalized form of group convolution that allows for inputs d­fined on
homogeneous spaces. Consider functions f : G/H1 → C and k : G/H2 → C, the group con­
volution can be described in the following manner:

(f ∗ k)(g) = ↑ f (gu−1 ) ↑ k(u).
u∈G

When both H1 and H2 are equal to the {e}, the group convolution simpl­fies to

(f ∗ k)(g) = f (gu−1 )k(u).
u∈G

Since the functions f and k can be d­fined on either the group G or the homogeneous
space G/H , there are four possible combinations to consider:
1. f : G → C and k : G/H → C

In this case, the group convolution can be expressed as (f ∗ k)(g) = f (gu−1 ) ↑ k(u).
u∈G
Therefore we can d­fine the convolution as a function on G/H :

(f ∗ k)(gH ) = f (gu−1 )k(uH ).
u∈G

2. f : G/H → C and k : G → C
282 Dimensionality Reduction in Machine Learning


In this case, the group convolution can be expressed as (f ∗k)(g) = ↑ f (gu−1 )k(u). To
u∈G
simplify the group convolution, we can d­fine a function k̃ : H \G → C, where k̃(H v) =

k(hv). With this definition, the convolution can be reduced to:
h∈H

(f ∗ k)(g) = f (gv −1 H )k̃(H v).
H v∈H \G

3. f : G/H1 → C and k : G/H2 → C



In this case, the group convolution can be expressed as (f ∗ k)(g) = ↑ f (gu−1 ) ↑ k(u).
u∈G
To simplify the group convolution, we can d­fine a function k̃ : H1 \G/H2 → C, where

k̃(H1 gH2 ) = k(hgH2 ). With this definition, the convolution can be reduced to:
h∈H1

(f ∗ k)(gH2 ) = f (gv −1 H1 )k̃(H1 vH2 ).
H1 v∈H1 \G

4. f : G → C and k : G → C

In this case, the group convolution can be expressed as (f ∗ k)(g) = f (gu−1 )k(u).
u∈G
Therefore we can d­fine the continuous group convolution:
 
−1
(f ∗ k)(gH ) = f (u)k(u g)du = f (gu)k(u−1 )du.
u∈G u∈G

We now assert that the generalized convolution:



(f ∗ k)(g) = ↑ f (gu−1 ) ↑ k(u),
u∈G

represents the most comprehensive class of equivariant operations.

Theorem 1. A linear map between fi : G/Hi → C and fj : G/Hj → C is said to be equiv­


ariant to the action of G if and only if it can be represented as a generalized convolution

(f ∗ k)(g) = ↑ f (gu−1 ) ↑ k(u) with some filter k : Hi \G/Hj → C, i.e., fj = fi ∗ k.
u∈G

You can check the proof of the aforementioned theorem and obtain a more detailed
description of the cases mentioned above in Esteves [8].
Following Cohen et al. [2], we employ the framework of fiber bundles to describe the
generalization of the findings presented thus far. However, before that we need to under­
stand the concepts of manifold, which has numerous applications in mathematics and in
particular geometry. A mapping ψ : En → En of the n-Euclidean space onto itself is called
a homeomorphism or a topological transformation if it is one-to-one and bicontinuous.
Bicontinuity means that both ψ and ψ −1 are continuous.
Chapter 10 • Dimensionality reduction in deep learning through group actions 283

Definition 12. An n-dimensional topological manifold (or simply a topological n-manifold)


is a Hausdorff space M with a countable basis that sati­fies the following conditions:
(1) Every point m ∈ M has a neighborhood Um that is homeomorphic to an open subset
of an n-dimensional real vector space E. In this case, we write dim M = n.
(2) A chart for a topological n-manifold M is a triple (U, u, V ), where U is an open subset
of M, V is an open subset of an n-dimensional real vector space E, and u : U → V
is a homeomorphism. As the chart (V , u, V ) is determined by the pair (U, u), we will
usually denote a chart by (U, u).
(3) An atlas on an n-manifold M is a family of charts {(Uα , uα )|a ∈ J }, where J is an arbi­

trary indexing set, such that the sets Uα form a covering of M: M = α∈J Uα . An atlas
is called countable (or finite) if index set is countable (or finite).

Definition 13. Let M be a topological manifold and let {(Uα , uα )|a ∈ J } be an atlas for M.
Consider two neighborhoods Uα , Uβ such that Uαβ = Uα ∩ Uβ = ∅. Then, a homeomor­
phism
uαβ : uβ (Uαβ ) → uα (Uαβ )

is d­fined by uαβ = uα ◦ u−1 β . This map is called the ident­fication map for Uα and Uβ . By
definition, uγβ ◦ uβα = uγ α in uα (Uαβγ ), and uαα (x) = x, x ∈ uα(Uα ) . These relations imply
that the inverse of uαβ is uβα . The atlas {(Uα , uα )} is called smooth if all its ident­fication
maps are smooth (as a mapping between open subsets of real vector spaces).
Two smooth atlases are equivalent if their union is again a smooth atlas; i.e., {(Uα , uα )}
and {(Vi , vi )} are equivalent if all the maps

vi ◦ u−1
α : uα (Uα ∩ Vi ) → vi (Uα ∩ Vi )

and their inverses are smooth. A smooth structure on M is an equivalence class of smooth
atlases on M. A topological manifold endowed with a smooth structure is called a smooth
manifold.

Example 14. Examples of manifolds.


(1) Spheres: Let E be an n-dimensional Euclidean space with inner product <, >. The
unit sphere S n−1 is {x ∈ E| < x, x >= 1}. S n−1 is a Hausdorff space with a countable
basis in the relative topology. Let a ∈ S n−1 and U+ = S n−1 − {a}, U− = S n−1 − {−a}.
D­fine maps u+ : U+ → a ⊥ , u− : U− → a ⊥ by

x− < x, a > a x− < x, a > a


u+ = , u− = .
1− < x, a > 1+ < x, a >

Then, {(Ui , ui )|i = +, −} is a smooth atlas for S n−1 . Moreover, the atlas obtained, in this
way, from a second point b ∈ S n−1 is equivalent to this one. Thus the smooth manifold
structure of S n−1 is independent of the choice of a.
284 Dimensionality Reduction in Machine Learning

(2) Projective spaces: Consider the equivalence relation on S n whose equivalence classes
are the pairs {x, −x}, x ∈ S n , and introduce the quotient topology on the set of equiv­
alence classes. We call the result a real projective n-space, RP n .
To construct a smooth atlas on RP n , consider the projection π : S n → RP n given by
π(x) = {x, −x}. If O is an open set in S n such that x ∈ O implies that −x ∈ / O, then π(O)
is open in RP n and π : O → π(O) is a homeomorphism. Now, let {(Uα , uα )} be an atlas
/ Uα . Then, {(π(Uα ), uα ◦ π −1 )} is a smooth atlas for RP n .
for S n such that, if x ∈ Uα , −x ∈
(3) Tori: Denote the elements of Rn by x = (x1 , · · · , xn ), xi ∈ R. D­fine an equivalence
relation in Rn by x ∼ x  if and only if xi − xi ∈ Z, i = 1, · · · , n. Let the set of equiva­
lence classes, with the quotient topology, be denoted by Tn and let π : Rn → Tn be
the canonical projection. Consider the smooth atlas for Rn given by {(Uα , uα )}, a ∈ Rn ,
where
1
Ua = {x ∈ Rn ||xi − ai | < , i = 1, · · · , n}, ua (x) = x.
4
Then, {(π(Uα ), uα ◦ π −1 )} is a smooth atlas for Tn .

Let B be a connected space with a basepoint b0 ∈ B, and p : E → B be a continuous


map. A fiber space over a space B is a triple (B, E, p) of the space B, a space E, and a
continuous map p of E into B.

Definition 14 (Fiber bundle with fiber F ). The map p : E → B is a locally trivial fibration,
or fiber bundle, with (typical) fiber F if it sati­fies the following properties:
(i) P −1 (b0 ) = F ;
(ii) The map p : E → B is surjective;
(iii) For every point x ∈ B there is an open neighborhood Ux ⊆ B and a ``fiber preserving
homeomorphism'' Ux : P −1 (Ux ) → Ux × F , that is a homeomorphism making the
following diagram commute:

P −1 (Ux )
Ux
Ux × F
p proj
id
Ux Ux

The space B is called the base space of the fiber space, p the projection, and b ∈ B, the
subspace p −1 (b) of E (which is closed if {b} is closed) is the fiber of b (in E).

Example 15.
(i) The projection map X × F → X is the trivial fibration over F .
(ii) Let S 1 ⊆ C be the unit circle with basepoint 1 ∈ S 1 . Consider the map fn : S 1 → S 1
given by fn (z) = zn . Then, fn : S 1 → S 1 is a local fibration with fiber a set of n distinct
points (the nth roots of unity in S 1 ).
Chapter 10 • Dimensionality reduction in deep learning through group actions 285

(iii) Let exp : R → S 1 be given by exp(t) = e2πit ∈ S 1 . Then, exp is a locally trivial fibration
with fiber the integers Z.

Definition 15 (Vector bundles). An n-dimensional vector bundle over a field K (or


(V , E, B)) is a locally trivial fibration p : E → B with a fiber as an n-dimensional K-vector
space V satisfying the additional requirement that the local trivialization ψ : p −1 (U ) →
U × V induces K-linear transformations on each fiber. That is, restricted to each x ∈ U, ψ
d­fines a K-linear transformation (and thus isomorphism) ψ : p −1 (x) → x × V .

Definition 16 (Associated vector bundle). An associated vector bundle can be constructed


from a principal bundle p : E → E/G. Let V be a vector space, we d­fine an equivalence re­
lation on E × V using the group action of G, denoted by ∼ G, where (x, y) ∼ G(xg, ρ(g)−1 y),
with ρ being a representation of G on V and g ∈ G. The associated vector bundle can be
represented as the bundle pF : (E × V )/∼G → B, where pF ([x, y]) = p(x).

Definition 17 (Fiber bundles). A fiber bundle consists of three topological spaces E, B, F ,


and a continuous map p : E → B, such that the following condition is sati­fied: Each b ∈ B
has an open neighborhood Ub and a homeomorphism:

ψ : Ub × F → P −1 (Ub ),

such that p ◦ ψ = proj1 . The pre-image P −1 (b), frequently denoted by Fb , is called the fiber
over b. A fiber bundle is said to be smooth if E, B, and F are smooth manifolds, p is a
smooth map, and the ψ above can be chosen to be diffeomorphisms.

Describing feature maps of G-CNNs as sections of associated vector bundles is a con­


venient approach. In cases where the features correspond to vectors on a homogeneous
space G/H (meaning that each point x in G/H is associated with a feature vector), it is
desirable to ensure equivariance with respect to the group G. The bundle p : G → G/H
is a principal bundle, which means we can create a related vector bundle based on it. To
construct this vector bundle, we need to choose a vector space V and a representation of
H on V . The choice of V is flexible and can vary depending on the situation. In the case of
scalar fields, we can have multiple channels, and the representation ρ = I (identity) can be
used. In a more general scenario, we can use a direct sum of vector spaces Vj with different
dimensions, where each Vj has its own representation ρj . In this setup, the sections of the
bundle are equivariant feature fields.
To handle the transformation between layers in a G-CNN (Group Convolutional Neural
Network), we adopt a similar approach as traditional neural networks. The transformation
is restricted to a linear operation with learnable parameters, followed by a nonlinearity.
However, it is crucial that this transformation preserves equivariance, which imposes cer­
tain constraints.
In the context of G-CNN, the features in a layer can be viewed as functions (Mackey
functions) denoted by f : G → V , where G is the group and V is the vector space. These
286 Dimensionality Reduction in Machine Learning

functions satisfy the property that for any h ∈ H , the equation f (gh) = ρ(h−1 )f (g) holds.
This property ensures that the relation

(g, f (g)) → (gh, f (gh)) = (gh, ρ(h−1 )f (g))

is maintained as d­fined in Definition 16.


From a practical perspective, defining f in this way may seem redundant and ineffi­
cient since the function operates on the entire space G. However, it proves to be useful for
algebraic manipulation and ensures that the transformation sati­fies the necessary equiv­
ariance property.

Theorem 2. Any linear map that preserves equivariance between feature fields on homo­
geneous spaces can be represented as a cross-correlation operation (such as explained in
Example 7), thus reducing to a lower dimension.

10.5 Implementation of equivariant neural networks


In this section, we will demonstrate how to implement some of the concepts mentioned
above in Python using libraries such as Matplotlib and PyTorch. Our aim is to demonstrate
that the concepts we discussed are not purely theoretical but have practical applications as
well. By providing code examples, we will illustrate how these concepts can be translated
into tangible implementations, enabling us to solve real-world problems across various
domains. Throughout this section, we will guide you in defining group theory concepts in
code and constructing equivariant convolution layers.

10.5.1 Implementing groups and actions


We start by showcasing how to d­fine group theory concepts in code, which allows us to
work with symmetry and transformations computationally. This practical approach allows
us to apply these concepts in fields such as computer vision, where objects often exhibit
symmetries that can be leveraged for better understanding and analysis. By representing
and manipulating groups in code, we can build algorithms that exploit these symmetries,
enhancing tasks such as object recognition, image class­fication, and image generation.
In Python, we can represent groups as sets of elements and d­fine operations that sat­
isfy certain properties. For example, let us say we want to d­fine a cyclic group of order n.
Chapter 10 • Dimensionality reduction in deep learning through group actions 287

1 class CyclicGroup:
2 def __init__(self, n):
3 self.n = n
4

5 def operation(self, r: int, s: int) -> int:


6 return (r + s) % self.n
7

8 def inverse(self, r: int) -> int:


9 return (self.n - r) % self.n

The provided code d­fines a class called CyclicGroup, which represents a cyclic group
with a given order n. A cyclic group is a mathematical structure consisting of a set of ele­
ments and a binary operation that combines two elements. In this case, the operation used
is addition modulo n. In addition to the operation method, the inverse method calculates
the inverse of a given element within the cyclic group.
Group actions are an essential aspect of group theory, and they can be implemented
in code as well. One common example is the group action of rotation, where elements of
a group act on a set by rotating its elements. To implement group actions, we can d­fine
functions or methods that perform the desired action on the elements of the set. For in­
stance, if we have a set of points representing coordinates in a 2D plane (Fig. 10.9), we can
d­fine a rotation function that takes a point and rotates it by a spec­fied angle around a
given center. Rotating by any degree can be easily implemented with PyTorch:

1 import matplotlib.pyplot as plt


2 from torchvision.transforms.functional import rotate
3
4 x = torch.zeros(1, 1, 256, 256)
5 x[0, 0, :, :] = torch.linspace(0, 1, steps=256)
6

7 gx = rotate(x, 90)
8

9 fig, ax = plt.subplots(1, 2)
10

11 ax[0].imshow(x[0, 0].numpy())
12 ax[0].set_title('Original Image')
13
14 ax[1].imshow(gx[0, 0].numpy())
15 ax[1].set_title('Rotated Image')
16

17 plt.show()

10.5.2 Implementing equivariant convolution layers


Equivariant Convolution Layers are designed to ensure that the learned features in the
layer maintain equivariance to the desired transformations. The core concept of equiv­
ariance is that when the input undergoes a specific transformation, the output should
transform in a corresponding manner. This property is crucial as it allows the network
288 Dimensionality Reduction in Machine Learning

FIGURE 10.9 Group action.

to interpret and process data while respecting the underlying symmetries. In the case of
rotation equivariant convolution, the layer is invariant to image rotations, meaning that
rotating the input image will result in a corresponding rotation of the output feature maps.
By enforcing equivariance, Equivariant Convolution Layers enable neural networks to cap­
ture and utilize symmetry-based patterns effectively (Fig. 10.10).
Equivariant Convolution Layers can be implemented using specific classes and func­
tions provided by libraries like PyTorch. Let us take a look at an example code snippet that
demonstrates the implementation of a rotation equivariant convolution layer.

1 import torch
2 import torch.nn as nn
3 import torch.nn.functional as F

The code begins by importing the necessary libraries, including PyTorch and its mod­
ules, such as nn (neural network) and F (functional).
Chapter 10 • Dimensionality reduction in deep learning through group actions 289

1 # Define the Rotation Equivariant Convolution Layer


2 class EquivariantConvolutionRotation(nn.Module):
3 def __init__(self, in_channels, out_channels, kernel_size):
4 super(EquivariantConvolutionRotation, self).__init__()
5 self.in_channels = in_channels
6 self.out_channels = out_channels
7 self.kernel_size = kernel_size
8 self.filters = nn.Parameter(torch.randn(out_channels,
9 in_channels,
10 kernel_size,
11 kernel_size))
12
13 def rotate_filters(self):
14 filters = self.filters
15 rotated_filters = torch.zeros_like(filters)
16 for angle in range(4):
17 rotated_filters += rotate(filters, angle * 90)
18 return rotated_filters
19

20 def forward(self, x):


21 rotated_filters = self.rotate_filters()
22 output = F.conv2d(x, rotated_filters,
23 padding=self.kernel_size // 2)
24 return output

FIGURE 10.10 Equivariant convolution layer.

The code d­fines the EquivariantConvolutionRotation class, which inherits from


nn.Module. This class represents the rotation equivariant convolutional layer. The __init__
method initializes the layer by defining its parameters, including the number of input and
output channels and the kernel size. It also creates a learnable parameter called filters
using nn.Parameter, which represents the filters of the convolutional layer.
The rotate_f ilters method is responsible for rotating the filters in the layer. It initializes
a tensor with the same shape as the filters and then iterates over four angles (0, 90, 180, and
270 degrees). For each angle, it rotates the filters by that angle using the rotate function
from torchvision.transforms.functional and adds them to the tensor.
290 Dimensionality Reduction in Machine Learning

The forward method d­fines the forward pass of the layer. It first calls the rotate_f ilters
method to obtain the rotated filters. Then, it applies the rotated filters to the input tensor
x using F.conv2d (2D convolution in PyTorch), with a padding set to ensure the output has
the same spatial dimensions as the input.

1 # Define input and layer parameters


2 in_channels = 1
3 out_channels = 64
4 kernel_size = 3
5

6 # Create the EquivariantConvolutionRotation layer


7 layer = EquivariantConvolutionRotation(in_channels=in_channels,
8 out_channels=out_channels,
9 kernel_size=kernel_size)
10 layer.eval()
11

12 # Create input tensor x and rotate it by 90 degrees


13 x = torch.zeros(1, 1, 256, 256)
14 x[0, 0, :, :] = torch.linspace(0, 1, steps=256)
15 gx = rotate(x, 90)
16
17 # Pass x and gx through the layer
18 psi_x = layer(x)
19 psi_gx = layer(gx)
20

21 # Rotate the output of psi_x by 90 degrees


22 g_psi_x = rotate(psi_x, 90)
23
24 # Visualize the tensors
25 fig, ax = plt.subplots(1, 3, figsize=(10, 18))
26

27 ax[0].imshow(x[0, 0])
28 ax[0].set_title('x')
29
30 ax[1].imshow(g_psi_x[0, 0].detach().numpy())
31 ax[1].set_title('$g.\psi(x)$')
32

33 ax[2].imshow(psi_gx[0, 0].detach().numpy())
34 ax[2].set_title('$\psi(g.x)$')
35
36 plt.show()
37
38 # Check equivariance
39 assert torch.allclose(psi_gx, g_psi_x, atol=1e-6, rtol=1e-6)

By adding Equivariant Convolutional Layers to a neural network architecture, we can


take advantage of operations that preserve symmetries. This allows us to build models that
can recognize and understand patterns regardless of how the input data is arranged or
oriented.
Chapter 10 • Dimensionality reduction in deep learning through group actions 291

To create an Equivariant Neural Network, we can stack multiple Equivariant Convolu­


tional Layers together with suitable activation functions and pooling operations. We can
also include other layers depending on what the task requires.

10.6 Conclusion
In this chapter, we have explained the theory of equivariant deep neural networks and how
they reduce the underlying model and sample complexity by using geometry of data. More
specifically, sample complexity reduction occurs when, for example, by using symmet­
ric data, we can project to lower-dimensional space, and model reduction happens when
the model preserves, or in other words, comprehends the geometrical structure without
adding additional layers and parameters to the model. An important type of such a net­
work is CNN in which a convolution operator adheres to geometrical structure such as
invariance. In our view, the crux of such an advantage lies in the concept of group action
where it captures the underlying ``semantics'' of data and facilitate the definition of equiv­
ariance. We aimed at simplifying the seminal results on generalization of group action
CNN (G-CNN) and clar­fication of mathematical structure, such as manifolds and fiber
bundles, via a number of illustrative examples. We hope this chapter gives an accessible
grasp of the very interesting area of equivariant neural networks.
As for the future work, we envisage the technique of using group action as a mecha­
nism of separation of concerns, where actions and structures are dealt with separately is a
fruitful line for further investigation. One aspect lies in the limitation that group structure
incurs, that is for the existences of an inverse member and thus reversibility of actions.
What happens if we have more relaxed algebraic structure than group, while preserving
invariance to some extent? Also, is it possible to use some kind of hyperstructure to en­
code several actions, while they remain invariant? We conjecture that it leads to even more
model reduction and thus better performance in processing high-dimensional data.

References
[1] Michael M. Bronstein, et al., Geometric deep learning: grids, groups, graphs, geodesics, and gauges,
https://doi.org/10.48550/arXiv.2104.13478, May 2021.
[2] Taco Cohen, et al., A general theory of equivariant CNNs on homogeneous spaces, https://doi.org/
10.48550/arXiv.1811.02017, Jan. 2020.
[3] Risi Kondor, Shubhendu Trivedi, On the generalization of equivariance and convolution in neural
networks to the action of compact groups, https://doi.org/10.48550/arXiv.1802.03690, Nov. 2018.
[4] Yujin Tang, David Ha, The sensory neuron as a transformer: permutation-invariant neural networks
for reinforcement learning, https://doi.org/10.48550/arXiv.2109.02869, Sep. 2021.
[5] Soledad Villar, et al., Scalars are universal: equivariant machine learning, structured like classical
physics, https://doi.org/10.48550/arXiv.2106.06610, Feb. 2023.
[6] Mehran Shakerinava, et al., Structuring representations using group invariants, Advances in Neural
Information Processing Systems 35 (Dec. 2022) 34162--34174.
[7] Taco S. Cohen, Max Welling, Group equivariant convolutional networks, https://doi.org/10.48550/
arXiv.1602.07576, June 2016.
[8] Carlos Esteves, Theoretical aspects of group equivariant neural networks, http://arxiv.org/abs/2004.
05154, Apr. 2020.
292 Dimensionality Reduction in Machine Learning

[9] Taco S. Cohen, et al., Spherical CNNs, openreview.net, https://openreview.net/forum?id=


Hkbd5xZRb, 2018.
[10] Carlos Esteves, et al., Learning SO(3) equivariant representations with spherical CNNs, https://doi.
org/10.48550/arXiv.1711.06721, Sep. 2018.
[11] Carlos Esteves, et al., Scaling spherical CNNs, https://doi.org/10.48550/arXiv.2306.05420, June 2023.
[12] Risi Kondor, et al., Clebsch-Gordan nets: a fully Fourier space spherical convolutional neural network,
https://doi.org/10.48550/arXiv.1806.09231, Nov. 2018.
[13] Maurice Weiler, et al., 3D steerable CNNs: learning rotationally equivariant features in volumetric
data, in: Advances in Neural Information Processing Systems, vol. 31, Curran Associates, Inc., 2018,
pp. 10381--10392.
[14] Jan E. Gerken, et al., Geometric deep learning and equivariant neural networks, https://doi.org/10.
48550/arXiv.2105.13926, May 2021.
[15] Branko Grünbaum, G.C. Shephard, Tilings and Patterns, second edition, Dover Publications, Inc.,
2016.
[16] Felix Klein, A comparative review of recent researches in geometry, Bulletin of the New York Mathe­
matical Society 2 (10) (1900) 215--249.
Index

A Cayley–Hamilton theorem, 54
Accuracy, 153 Centering
Activation function, 35, 218, 222, 228, 230, 232 data, 92
Adam optimizer, 231, 255 data points, 86
Admissible transformations, 174, 175 matrix, 167
Advanced Central Limit Theorem (CLT), 92, 93
linear discriminant analysis algorithm, 118 Central processing unit (CPU), 34
t-SNE techniques, 205 Centroid co­figuration, 174, 175
Anchor stimulus method, 163
Challenge on learned image compression
ArcFace embedding, 134
(CLIC) dataset, 264, 265
Art­ficial
Circular data visualization, 11
neural networks, 211, 214
Class covariance, 120
neurons, 214
Classic MDS (CMDS) model, 166, 167
Art­ficial Intelligence (AI), 211
Class­fication
Attribute options, 79, 120
Autoencoders, 245 algorithms, 30
for feature extraction, 246 data, 232
language, 252, 253 labels, 251
natural language processing, 252 model, 251
types, 248 performance, 109
problems, 110
B results, 131
Bar chart, 11 task, 18, 28, 35, 131, 232, 233, 261
Basic concepts, 39, 40 text, 259
Batch size, 36, 141, 142, 232 units, 232
Binary Clustering
class­fication task, 214 algorithms, 18
cross-entropy, 231
applications, 205
data, 164
data, 205
patterns, 211
datasets, 205
Biological neural network (BNN), 214
method, 205
Biological neurons, 213
patterns, 159, 160
Box plot, 13, 14
purposes, 205
C Cocktail Party Problem, 90
Cancer class­fication, 132 Coded features, 253, 262
Categorical Coefficient matrix, 50
class­fication problem, 154 Cofactor, 45
data, 5 Cofactor matrix, 50
293
294 Index

Computer vision, 130--132, 154, 187, 206, 276, class­fication, 232


277, 286 cleaning, 8, 24
Concentrated data matrix, 74 process, 10
Congruence transformation, 273 techniques, 8
Contextual relationships, 251 cleansing, 7, 9
Continuous clustering, 205
data, 5 collection, 6, 162
group convolution, 282 comparison, 10
Contractive autoencoder (CAE), 248, 249, 264 compression, 65, 138
snippet code, 264 DataFrame, 24
Convolution layers, 220, 221, 224, 226, 232, dimensionality, 78, 91, 105, 188
234, 252 distribution, 114, 116
equivariant, 287, 288 embedding, 146, 196
Convolutional encoded, 246, 250
autoencoder, 263
evaluation, 7
feature space, 275
exploration tool, 158
layers, 220, 222, 224--227, 233, 255, 263, 264,
features, 3, 203
289
generation step, 89
networks, 219, 279
imputation, 10
Convolutional Neural Network (CNN), 220,
integration, 8, 10
222, 226, 227, 232, 234, 239, 269--271,
local structure, 188
277--280
matrix, 77, 92
Cosine
mining, 40, 57, 59
distance, 134
similarity, 256, 258 mining algorithms, 57
Covariance, 112 nonlinear, 82, 229
Covariance matrix, 75 outliers, 189
Cross-Entropy (CE) loss function, 216 patterns, 3
Customer pieces, 192
data, 144 points
relationships, 158 clusters, 13, 204
Cyclic group, 286 dissimilar, 163
dissimilarity, 169
D dissimilarity ratings, 162
Data images, 67
accuracy, 10 mapping, 195
aggregation, 10 monotonic transformation, 172
analysis, 5--8, 10, 20, 66, 67, 78, 82, 131, 132, outputs, 18
142 pairwise similarity, 195
landscape, 91 positions, 191
process, 129 self-similarity, 85
tools, 26 sequence, 149
analysts, 105 sphere shaped, 132
binary, 164 values, 15
characteristics, 89 preprocessing, 8, 83
Index 295

processing, 6, 8, 10, 158, 212 Deep


errors, 142 convolutional neural networks, 132
in machine learning, 3 feature, 239--241
techniques, 24 feature extraction, 239, 240
reduction, 8, 10 feedforward networks, 217
relationships, 82 neural networks, 35, 190, 212, 217, 240, 252,
samples, 270 256, 278, 291
scarcity, 241 Deep learning (DL), 40, 212, 266, 269, 277
science, 24, 26, 40, 47, 49, 66, 154 geometric context, 270
science world, 132 models, 216, 219, 246, 248
scientists, 119 revolutionary history, 211
smoothing, 10 systems, 266
sources, 162 Deformed distributed data, 144
standardization, 9 Degrees of freedom (df ), 189, 191
structure, 55, 129, 131, 188 Demixing matrix, 94
distance, 47 Denoising autoencoder (DAE), 248, 249
in Pandas, 23 Dimension
reduction, 10, 16, 18, 40, 109, 113, 119, 120,
symmetric, 291
123, 124, 129, 154, 187, 190
text, 6
algorithms, 191
train, 260
methods, 119, 129, 132
transformation, 8, 9
operation, 70
transparency, 92
tasks, 59
trends, 13
techniques, 119, 129
types, 5, 220
weighting model, 175
variability, 78
Dimensional
variance, 118, 171
reduction, 118
visualization, 7, 8, 10, 11, 26, 81, 82, 105, 187
weighting, 175
approach, 13
Dimensionality, 121
for python, 26 curse, 158
techniques, 11 data, 78, 91, 105, 188
visualizing, 105, 160, 205 linear methods, 129
Database spreadsheets, 5 reduction, 47, 57, 66, 67, 78, 81--83, 91, 110,
DataFrame, 23, 24 111, 113, 132, 138, 142, 157, 158, 178,
data, 24 180, 206
Pandas, 24--26, 80 algorithm, 205, 206
Dataset methods, 105, 205, 206
distribution, 14 techniques, 83, 129--131, 144
generation, 88 technique, 81
text, 257 Disclosing hidden dimensions, 83
Decoded distribution, 250 Discovering nonlinear relationships, 83
Decoder Disparate data, 10
part, 245 Dissimilarity, 161
side, 263, 264 matrices, 168, 173
Deconvolutional layers, 263 ratings, 166
296 Index

Distance Extracting
function, 133, 161 deep features, 240
matrix, 147, 158, 159, 163--167, 181, 183 important features, 105
matrix mod­fication, 148 network features, 239
measure, 164, 166, 181
metrics, 133, 149 F
Document Face recognition, 110, 113
class­fication, 125 Facial
clustering, 130 attributes, 110
expression recognition, 34, 35
E features, 34
Eigenvalues, 49, 115 recognition, 214
Eigenvectors, 49, 115 Feasible solutions, 59
Embedding, 131 Features
data, 146, 196 data, 3, 203
layer, 255 deep, 239--241
matrix, 140 encoded, 245, 246
method, 132 extraction, 65, 81, 91, 99, 105, 110, 130, 132,
process, 145, 147 232, 241, 248, 264, 265
Encoded extraction algorithms, 91, 132
data, 246, 250 learning approaches, 248
features, 245, 246 local, 255
input, 261 map, 220, 222, 224--227, 239, 240, 279, 288
sequence, 251 scaling, 7, 10, 78
Encoder selection, 7, 10, 57, 59, 111, 132
parameters, 250 semantic, 246, 247, 254
side, 247, 249, 263, 264 space, 86
trained, 246, 247, 251, 256 vector, 228, 234, 235, 238, 240, 246, 247, 256,
Equivariant 257, 285
convolution layers, 287, 288 vector dimension, 246
convolutional layers, 289--291 Finalized data, 6
maps, 272 Fisher’s linear discriminant analysis (FLDA),
networks, 279, 280 113, 114
neural networks, 277, 278, 291 Fit method, 228
neural networks implementation, 286 Flatten layer, 234, 237
Errorless embedding, 144 Folding method, 177--179
Euclidean distance, 133, 134, 144, 148, 158, Future linear discriminant analysis algorithm,
166--168, 191, 192, 195, 199 124
matrix, 147
measure, 183 G
metric, 139, 179 Gaussian distribution, 93--95, 113, 119, 188,
Evolving datasets, 142 189, 191
Exploratory data analysis (EDA), 26, 157, 187, Generative modeling, 245
206 Genomic data, 16
Extensive datasets, 158 Geodesic distance, 148, 149
Index 297

Geometrical structure, 269, 277, 291 Imbalanced


Glide r­flection, 273 data, 241
Global datasets, 124
average pooling, 226 Implementation, 78, 149
data structure, 192 Incremental LEE (INLLE), 138, 140--142,
max pooling, 226 145--147
Gradient Independent Component Analysis (ICA), 90,
definition, 58 94
descent, 189, 191, 195, 200, 217, 218, 269 model, 94, 95, 99, 103
descent algorithm, 189, 191, 195, 219 Independent components, 90--99
descent process, 196 matrix, 94, 99
Gram matrix, 139, 145, 239, 240 vector, 93
Gram–Schmidt process, 97 Individual differences models, 172
Granular features, 192 Individual Differences Scaling (INDSCAL),
Graphics processing unit (GPU), 34 157, 173, 174
Group I­finity solutions, 52--54
action, 272, 274 Informative visualization, 131
convolution, 281, 282 Infrequent subpatterns, 222
equivariance, 271 Inputs features, 18
equivariant networks, 270 Interconnected neurons network, 211
equivariant neural networks, 270, 278, 280 Interquartile range (IQR), 9
Group equivariant Convolutional Neural Interrelationships, 187
Network (G-CNN), 277, 278, 285 Intricate relationships, 90
Invariant maps, 272
H Inverse LLE (ILLE), 138, 154
Heat map, 13, 14 Iris dataset, 28, 119, 201, 202
Hebb learning rule, 215 Irrelevant features, 81, 131
Hermitian matrix, 44 Iteration steps, 143
Hessian matrix, 58
Hidden J
data structures, 84, 91 Jaccard distance, 134
dimensions, 82 Jacobian matrix, 58
layers, 217, 245, 251, 253, 261
layers model, 262 K
patterns, 7, 65, 92, 158 Karush–Kuhn--Tucker (KKT) conditions, 61
High-dimensional data, 10, 15, 16 Kernel
Histogram, 14, 15 functions, 78, 83--85, 87, 111, 138--140, 179
Homogeneous space, 275, 277, 281, 285 mapping, 139
Huber formula, 143 matrix, 84--86, 179, 220
Hyperplane, 177, 179 matrix calculation, 84
selection, 84
I trick, 85, 138, 139, 179
Idiosyncratic dimension weighting, 175 Kernel Locally Linear Embedding (KLLE),
ImageNet dataset, 35 138--140, 154
298 Index

Kernel Principal Component Analysis (Kernel Lexical features, 256


PCA), 78, 82--84, 87--90, 179 Line chart, 11
algorithm, 83 Linear
Kernel-based MDS, 178 algebra explanation, 114
Kernelized Linear Discriminant Analysis, 110 boundaries, 124
k nearest neighbors (k-NN), 132--134, 137--139, dimensionality reduction, 89, 205
144, 148, 150 independence, 40
Kruskal’s Stress, 170 kernel, 84
Kullback–Leibler (KL) divergence, 189, 192, relationships, 82
194, 195, 199, 249 transformation, 166
Linear Discriminant Analysis (LDA), 16,
109--111, 113--115, 118, 119, 121, 129,
L
205, 206
Labeled Faces in the Wild (LFW) dataset, 235,
algorithm, 111, 119--121, 124
238
application, 110, 121
Lagrange multiplier, 61, 68, 70, 72
parameters, 120
Lagrangian
performance, 123
function, 60
technique, 109, 119
multipliers, 60, 137
Linearly separable
relaxation, 60
classes, 109
Landmark embedding, 146 data, 214, 216, 230
Landmark LLE (LLLE), 138, 145--147 dataset, 229
Language Local
autoencoders, 252, 253 data structure, 192
encoders, 258 features, 255
encoding, 251, 259, 260 patterns, 84
model encoder, 257 reconstruction, 144
Laplace structure, 129, 130, 137--140, 142, 147, 189,
expansion, 46 190, 206
formula, 45 Locally Linear Embedding (LLE), 129, 132, 142,
Laplacian matrix, 136 147
Latent algorithm, 132, 140, 145, 147, 148
space, 246--249, 251--253, 255, 256, 261--264 embedding optimization problem, 143
variable, 245, 246, 250 variations, 138
Layer Long short-term memory (LSTM) model, 254
activation function, 234 Loss function, 36, 168, 170, 195, 216--218, 222,
embedding, 255 224, 246, 249, 250, 255, 258--262
Learned features, 227, 251, 287
visualizing, 228 M
Learning Machine learning (ML), 3, 40, 269
embeddings, 148 algorithms, 3, 6, 7, 10, 17, 19, 20, 34, 78, 129,
features, 269 132, 180, 247, 248
problems types, 17 algorithms lifecycle, 19
rate, 195, 196, 216, 219, 228, 260 applications, 19, 154
Least square model, 168 domain, 23
Index 299

efficiency, 57 Medical
engineers, 47 cases, 111, 113
frameworks, 36 data, 144, 241
library, 78 image benchmark datasets, 241
lifecycle, 19 Metric
linguistics, 19 distance, 133
methods, 16, 148 MDS, 157, 166, 168, 179, 180, 183
models, 8, 34, 81, 91, 245, 248 Microarray data, 131
platform, 33 Midpoint, 177
problem, 18 Minkowski distance, 133, 134
process, 8 MNIST dataset, 121--124, 151, 154, 193, 194,
tasks, 24, 28, 30, 34, 105, 269 197, 198, 202, 203, 263
techniques, 16, 34, 212 Model
tools, 28 accuracy, 7, 28, 29
Machine translation models, 252 class­fication, 251
Manhattan distance, 133, 134 hidden layers, 262
measure, 181, 183 trained, 36, 256, 259, 266
Manifold learning methods, 148 training, 260
Matplotlib, 26, 88, 89, 198, 286 Monolingual similarity, 260
Matrix Monotonic
addition, 41 relationships, 163
centering, 54, 55 transformation, 169
data, 77, 92 transformation data points, 172
distance, 147, 158, 159, 163--167, 181, 183 Multi-dimensional scaling (MDS), 131, 157
eigenvector, 71, 72 models, 164, 165
factorization, 92 Multi-Layer Perceptron (MLP), 216, 217,
independent components, 94 230--233
kernel, 84--86, 179, 220 Multilingual semantic textual similarity, 259,
multiplication, 41 260
norms, 47--49 Multivariate data, 96
operations, 40, 41, 82
properties, 44 N
rank, 47 Natural language processing (NLP), 6, 26,
rank properties, 47 130--132, 140, 187, 206, 212, 216, 240,
similarity, 199 251, 277
symmetric, 42 applications, 190
Max pooling, 226, 227 autoencoders, 252
layer, 226, 233, 263 Network
Maximum architecture, 245, 247
likelihood class­fication, 113 complex textures, 234
separability, 110, 114, 119, 123 Neural network
variance, 81, 82, 92 advancement, 212
Mean Squared Error (MSE), 216 architectures, 35, 212, 290
Median absolute deviation (MAD), 78 biological, 214
300 Index

convolutional, 220, 222, 226, 227, 232, 234, Outliers, 8--10, 82, 105, 124, 142--144, 189, 204,
239, 269--271, 277--280 205
deep, 35, 190, 212, 217, 240, 252, 256, 278,
291 P
equivariant, 277, 278, 291 Packages installation, 20
group equivariant, 270, 278 Pairwise
input, 35 covariance, 74
models, 231, 261 distance matrix, 199
processing, 252 distances, 137, 145, 146, 160
training, 255 Euclidean distances, 137
Noisy datasets, 202 similarities, 84, 85, 179, 188, 195, 196, 199
Non-metric similarity relation, 190
MDS, 168 Pairwise comparison, 162
uni-dimensional scaling, 176 Pandas, 23
uni-dimensional unfolding model, 176 column, 25, 26
Nonlinear DataFrame, 24--26, 80
data, 82, 229 library, 24
dimensionality reduction, 129, 130, 138 Parallel sentence dataset, 253--255, 259
algorithm, 187 Parameter
technique, 130, 138, 140, 147 options, 78, 120
relational data, 190 selection, 137
relationships, 78, 83, 84, 90, 129, 178, 187, Parameterization, 188
188 Partial Least Squares (PLS) regression, 66
transformations, 89 Patterns
Nonlinearity, 188 binary, 211
Nonlinearly separable data, 229, 230 clustering, 159, 160
Nonsingular matrix, 44 data, 3
Normalized Stress, 170 hidden, 7, 65, 92, 158
Numeric data, 5, 8 local, 84
NumPy, 21 recognition, 206
Pearson correlation, 163
array, 22, 23, 25, 34, 80, 120
Perceptron, 214, 228
efficiency, 21
layers, 216
functions, 22, 23
model, 214--217, 228
library, 21
model error, 229
package, 22
model training procedure, 215
RandomState object, 79
Permutation matrix, 100
Permuted data input, 275
O Perplexity, 190--192
Objective function, 114 hyperparameter, 191
Optimization problem, 59 parameter, 191, 192, 205
Optimizer, 36, 217, 260 value, 192
Ordinal data, 5, 168 Pipeline
Orthogonal projection, 57 function, 33
Orthogonality, 40 optimization, 30
Index 301

Plotly, 26, 27 variables, 163


Polynomial kernel, 83--85 Quantitative data, 162, 163
Pooling, 226
Pooling layers, 220, 226, 227, 258 R
Precision, 153 Radial Basis Function (RBF)
Predict method, 228 kernel, 83
Prediction Neural Network, 110
accuracy, 9, 10, 32, 33, 125 Ranking method, 163
error, 216 Rating data, 176
Principal component, 33, 65--67, 69--71, 73, Raw
77--81, 205, 206 data, 3, 6, 8, 91
Principal Component Analysis (PCA), 16, 18, input data, 165
31, 55, 57, 67, 78, 81--83, 118, 129, 205 Recall, 153
algorithm, 67, 78 Recurrent
projection matrix, 143 layer, 253, 255
Processing high-dimensional data, 269 neural networks, 34, 251
Procrustean Individual Differences Scaling Regularization, 114
(PINDIS), 173, 174 Reinforcement learning, 19
model, 174, 175 Relationships
procedure, 174 data, 82
Projected data, 70 linear, 82
Propagation, 269 nonlinear, 78, 83, 84, 90, 129, 178, 187, 188
Protein structure class­fication, 280 semantic, 148
Proximity, 161, 162 ReLu activation function, 230, 234
matrix, 166--168 Reparameterization trick, 250
measurement equations, 164 Representation learning, 247
measures, 162, 164, 165 Robust Locally Linear Embedding (RLLE), 142,
Python, 20 144
code, 21, 24, 30 Robustness, 188
data structures, 24
environment, 149 S
for machine learning, 19 Scalar multiplication, 40, 41
implementation, 84 Scale invariance, 96
installation, 20 Scaling
libraries, 103 features, 7, 10, 78
lists, 23 procedure, 171
packages, 21 random data, 171
shell, 21 Scatter
version, 21 matrix, 114
PyTorch, 286--288, 290 plot, 12
Scientific Python Toolkit, 28
Q Scikit-Learn, 28, 119, 120, 149, 152, 184, 201
Quadratic discriminant analysis, 119 library, 29, 33, 119, 122
Qualitative Python package, 180
data, 118, 162 Scree plot, 172
302 Index

Semantic value, 170--172


features, 246, 247, 253, 254 Statistical explanation, 118
relationships, 148 Student’s t-distribution, 188
similarity scores, 256, 257 Supervised
Semantic textual similarity (STS), 257, 259, 260 learning, 17, 18
algorithms, 257 machine learning, 16, 133
dataset, 256, 259, 260 Supervised LLE (SLLE), 147
models, 258, 259 Support Vector Machine (SVM), 18, 179, 212
Semi-supervised learning, 17--19 Symmetric
Semi-supervised LLE (SSLLE) algorithm, 148 data, 291
Sensory data, 275 matrix, 42
Shepard diagram, 172 similarity matrix, 166
Shepard diagrams, 172 square matrix, 166
Shift Symmetry group, 271--273
equivariance, 271 Synthetic
operator, 270 data, 89, 104
Sigmoid, 231 dataset, 103
Sigmoid kernel, 84
Similarity, 161
T
coefficients, 164, 165
function, 257 t-distributed Stochastic Neighbor Embedding
(t-SNE), 16, 187, 190, 206
matrix, 199
measure, 256 algorithm, 190
text, 256 applications, 190
Singular Value Decomposition (SVD), 78, 79, learning rate, 195
96, 100, 167, 174 objective function, 192
Sklearn module, 181 perplexity parameter, 191
Smoothness, 188 Temporal features, 254, 255
Snippet code, 258, 262, 264 TensorFlow, 33--35
Space dimension, 135 optimizer, 36
Sparse autoencoder (SAE), 248, 249 package, 33
Spearman correlation, 163 Text
Spherical CNNs, 279 autoencoder, 254
Standard class­fication, 259
deviation, 112 data, 6
LLE, 137, 138, 140, 141, 145 dataset, 257
algorithm, 138 similarity, 256
embedding, 143 Topological
methods, 141, 148 manifold, 283
process, 144 transformation, 282
section, 145, 147 Torgerson’s method, 166--168
Standardized data, 31, 80 Train
Standardized Residual Sum of Squares data, 260
(STRESS), 170 models, 3
equations, 171 set, 152
Index 303

Trained formula, 70, 112


deep model, 227 maximization, 91
encoder, 246, 247, 251, 256 maximum, 81, 82
model, 36, 256, 259, 266 Variational autoencoder (VAE), 248, 250
Transformation Vector
approach, 166 norms, 48
composition, 272 quantized variational autoencoders, 249,
data, 8, 9 267
equation, 163 Vision models, 261
linear, 166 Visualization
process, 162 capabilities, 26
techniques, 10 data, 7, 8, 10, 11, 26, 81, 82, 105, 187
Transforming data, 9, 10, 83, 179, 184 techniques, 9, 239
Translational symmetry, 269 tool, 202, 203
Tree map, 13, 14 Visualizing
Triangular matrix, 42, 46 data, 105, 160, 205
Tucker–Messick model, 173 learned features, 228
output data, 201
U
Ubuntu, 21 W
Unfolded data, 151 Weight matrix, 135, 137, 145
Unfolding models, 175--177 Weighted Euclidean Model, 173
Unlabeled data, 18, 19 Weighted LLE (WLLE), 144, 145, 154
Unmixing matrix, 94, 95, 99, 102 Weighting matrix, 141
Unstructured data, 5 Whitened data, 96, 101
Unsupervised learning, 17, 18 WMT datasets, 253
Words embeddings, 132
V
Variance, 112 Z
data, 118, 171 Z-score, 9

You might also like